arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2042
2506.11042 2026-06-05 cs.LG

GenFT: A Generative Parameter-Efficient Fine-Tuning Method for Pretrained Foundation Models

GenFT:一种用于预训练基础模型的生成性参数高效微调方法

Guangning Xu, Baoquan Zhang, Michael. K. Ng

发表机构 * Department of Mathematics, Hong Kong Baptist University, Hong Kong, China(香港 Baptist 大学数学系,香港,中国) Department of Computer Science, Harbin Institute of Technology, Shenzhen, China(哈尔滨工业大学(深圳)计算机科学系,中国)

AI总结 本文提出GenFT,一种基于预训练权重的参数高效微调方法,通过生成任务特定的更新来利用预训练权重中的结构信息,实现高效的模型微调。

Comments paper is accepted at ICANN 2026

详情
AI中文摘要

参数高效微调(PEFT)已作为一种资源高效的策略,通过学习少量任务特定的更新ΔW来适应预训练基础模型(PFMs)。现有方法往往在很大程度上独立于预训练权重W₀,或主要通过初始化或简单的重参数化来利用W₀。为了进一步利用W₀中编码的结构信息,我们提出生成性参数高效微调(GenFT),一种基于W₀的PEFT方法,使用确定性权重生成器生成任务特定的更新。具体而言,GenFT通过行和列变换与非线性激活来从W₀中提取结构化模式,并引入共享-特定分解以平衡跨层信息重用和层特定的灵活性。GenFT简单且参数高效,在NLP和CV基准上实现了竞争性或更优的平均性能。我们进一步在LLaMA-7B上进行试点研究,以检验其在生成模型中的可行性。代码可在GitHub https://github.com/xuguangning1218/GenFT 上获得。

英文摘要

Parameter-efficient fine-tuning (PEFT) has emerged as a resource-efficient strategy for adapting Pretrained Foundation Models (PFMs) by learning a small number of task-specific updates $ΔW$. Existing methods often learn $ΔW$ largely independently of pretrained weights $W_0$, or exploit $W_0$ mainly through initialization or simple reparameterization. To further leverage the structural information encoded in $W_0$, we propose Generative Parameter-Efficient Fine-Tuning (GenFT), a $W_0$-conditioned PEFT method that uses a deterministic weight generator to produce task-specific updates. Specifically, GenFT performs row and column transformations with nonlinear activations to extract structured patterns from $W_0$, and introduces a shared-specific decomposition to balance cross-layer information reuse and layer-specific flexibility. GenFT is simple and parameter-efficient, achieving competitive or better average performance across NLP and CV benchmarks. We further provide a pilot study on LLaMA-7B to examine its feasibility for generative models. The code is available at GitHub https://github.com/xuguangning1218/GenFT.

2506.10601 2026-06-05 cs.CV

Semantic-decoupled Spatial Partition Guided Point-supervised Oriented Object Detection

语义解耦的空间分区引导的点监督定向物体检测

Xinyuan Liu, Hang Xu, Zirui Chen, Yike Ma, Chenggang Yan, Feng Dai

发表机构 * Institute of Computing Technology, Chinese Academy of Sciences(中国科学院计算技术研究所) Hefei University of Technology(合肥工业大学)

AI总结 本文提出了一种高效的训练框架SSP,通过规则驱动的先验注入和数据驱动的标签净化,解决了单点注解放置不足和伪标签质量差的问题,实验表明SSP在DOTA-v1.0和其他数据集上取得了显著的mAP提升,且训练时间和内存占用较低。

Comments Published in Pattern Recognition, 2026

详情
Journal ref
Pattern Recognition, Volume 180, Part B, Article 114079 (2026)
AI中文摘要

鉴于其减少标注成本的能力,基于单点注释的弱监督学习已成为定向物体检测研究的焦点。与经典教师-学生范式相比,简单的模型范式(如PointOBB-v2)可以显著减少训练所需的资源,同时保证强大的性能。后者在低成本训练中具有更大的潜力,但此类方法仍面临样本分配不足和伪标签质量差的挑战。在本文中,我们提出了一种训练高效的框架,称为SSP,该框架结合了规则驱动的先验注入和数据驱动的标签净化。具体而言,SSP引入了两种设计:(1)像素级空间分区基于的样本分配,通过像素映射的空间分区估计物体尺度的上下界,并通过空间分区挖掘高质量的正样本和困难负样本;(2)语义空间分区基于的框提取,通过由语义地图调节的空间分区推导实例,并将其转换为伪框以监督检测器。在DOTA-v1.0和其他数据集上的实验表明,SSP的优越性:与基线相比,SSP实现了+6.73%的mAP提升,同时仅需2小时的训练时间和6GB的GPU内存。此外,当SSP与更强的检测器结合时,mAP可以达到50.81%。代码可在https://github.com/antxinyuan/ssp上获得。

英文摘要

Given its ability to reduce annotation costs, weakly supervised learning based on single-point annotations has emerged as a research focus in oriented object detection. Compared with the classical teacher-student paradigm, the simple model paradigm (e.g., PointOBB-v2) can substantially further reduce resources required for training while ensuring strong performance. The latter exhibits greater potential for low-cost training, yet such methods still face challenges of insufficient sample assignment and poor pseudo-label quality. In this paper, we propose a training-efficient framework named SSP, which synergizes rule-driven prior injection and data-driven label purification. Specifically, SSP introduces two designs: (1) Pixel-level Spatial Partition-based Sample Assignment, which compactly estimates the upper and lower bounds of object scales and mines high-quality positive samples and hard negative samples through spatial partitioning of pixel maps. (2) Semantic Spatial Partition-based Box Extraction, which derives instances from spatial partitions modulated by semantic maps and converts them into pseudo-boxes for supervising detectors. Experiments on DOTA-v1.0 and other datasets demonstrate SSP's superiority: it achieves +6.73% mAP improvement compared with the baseline, while requiring only 2 h of training time and 6 GB of GPU memory. Furthermore, when SSP is integrated with stronger detector, the mAP can reach 50.81%. The code is available at https://github.com/antxinyuan/ssp.

2506.00188 2026-06-05 cs.LG stat.ML

Cluster-Aware Causal Mixer for Online Anomaly Detection in Multivariate Time Series

基于聚类的因果混合器用于多变量时间序列的在线异常检测

Md Mahmuddun Nabi Murad, Yasin Yilmaz

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 本文提出了一种基于聚类的因果混合器,用于多变量时间序列的在线异常检测,通过聚类处理通道间的相关性,结合因果混合器保持时间因果性,并开发了序列异常评分方法以提高检测准确性。

详情
AI中文摘要

在时间序列数据中早期和准确地检测异常至关重要,因为假阳性和漏检带来的风险很大。虽然基于MLP的混合模型在时间序列分析中显示出潜力,但它们在数据处理过程中不维护时间因果性。此外,现实中的多变量时间序列通常包含众多通道,具有多样的通道间相关性。重构时间序列中的虚假相关性导致表示噪声,从而导致检测不准确。此外,忽略时间连续性的异常评分方法可能会误导连续检测。为了解决这些挑战,我们提出了一种多变量时间序列异常检测的基于聚类的因果混合器。根据相关性将通道分组为集群,并通过专用嵌入层对每个集群进行嵌入。引入因果混合器以在保持时间因果性的同时整合信息。我们进一步开发了一种序列异常评分方法,该方法在时间上累积证据并细化异常边界。我们提出的模型以在线方式运行,使其适合实时时间序列异常检测。在六个公开基准数据集上的实验评估表明,所提出的方法在性能上始终优于其他方法。

英文摘要

Early and accurate detection of anomalies in time-series data is critical due to the substantial risks associated with false or missed detections. While MLP-based mixer models have shown promise in time-series analysis, they do not maintain temporal causality during data processing. Moreover, real-world multivariate time series often contain numerous channels with diverse inter-channel correlations. Spurious correlations in the reconstructed time series lead to noisy representations, resulting in inaccurate anomaly detection. In addition, anomaly scoring methods that ignore temporal continuity can mislead sequential detection. To address these challenges, we propose a cluster-aware causal mixer for multivariate time-series anomaly detection. Channels are grouped into clusters based on their correlations, and each cluster is embedded through a dedicated embedding layer. A causal mixer is introduced to integrate information while maintaining temporal causality. We further develop a sequential anomaly-scoring method that accumulates evidence over time and refines anomaly boundaries. Our proposed model operates in an online fashion, making it suitable for real-time time-series anomaly detection. Experimental evaluations across six public benchmark datasets demonstrate that the proposed approach consistently achieves superior performance.

2310.04649 2026-06-05 cs.LG

Uncovering Model Processing Strategies with Non-Negative Per-Example Fisher Factorization

通过非负每例费舍尔分解揭示模型处理策略

Michael Matena, Colin Raffel

发表机构 * University of North Carolina Chapel Hill(北卡罗来纳大学教堂山分校) University of Toronto(多伦多大学) Vector Institute(向量研究所)

AI总结 本文提出NPEFF方法,通过分解每例费舍尔矩阵揭示模型生成预测所用的策略,展示了NPEFF组件在语言模型和文本处理任务中的应用,并展示了如何通过扰动这些组件来干扰模型处理,同时通过消融研究和实验验证了NPEFF在分析和缓解去学习的副作用以及研究上下文学习中的优势。

详情
AI中文摘要

我们引入NPEFF(非负每例费舍尔分解),一种可解释性方法,旨在揭示模型生成预测所使用的策略。NPEFF使用一种新颖的分解算法分解每例费舍尔矩阵,该算法学习了一组由学习得到的秩-1半正定矩阵表示的组件。通过结合人类评估和自动化分析,我们证明这些NPEFF组件对应于各种语言模型和文本处理任务中的模型处理策略。我们进一步展示了如何从NPEFF组件构建参数扰动,以选择性地干扰给定组件在模型处理中的作用。除了进行广泛的消融研究外,我们还包括实验,展示了NPEFF如何用于分析和缓解去学习的副作用,并用NPEFF研究上下文学习。此外,我们展示了NPEFF相对于梯度聚类和使用稀疏自编码器进行字典学习等基线方法的优势。我们发布了本工作的代码。

英文摘要

We introduce NPEFF (Non-Negative Per-Example Fisher Factorization), an interpretability method that aims to uncover strategies used by a model to generate its predictions. NPEFF decomposes per-example Fisher matrices using a novel decomposition algorithm that learns a set of components represented by learned rank-1 positive semi-definite matrices. Through a combination of human evaluation and automated analysis, we demonstrate that these NPEFF components correspond to model processing strategies for a variety of language models and text processing tasks. We further show how to construct parameter perturbations from NPEFF components to selectively disrupt a given component's role in the model's processing. Along with conducting extensive ablation studies, we include experiments to show how NPEFF can be used to analyze and mitigate collateral effects of unlearning and use NPEFF to study in-context learning. Furthermore, we demonstrate the advantages of NPEFF over baselines such as gradient clustering and using sparse autoencoders for dictionary learning over model activations. We release the code used in this work.

2505.02540 2026-06-05 cs.LG cs.AI

Lazy But Effective: Collaborative Personalized Federated Learning with Heterogeneous Data

懒惰但有效:基于异构数据的协同个性化联邦学习

Ljubomir Rokvic, Panayiotis Danassis, Boi Faltings

发表机构 * Artificial Intelligence Laboratory EPFL(苏黎世联邦理工学院人工智能实验室) Telenor Research(Telenor研究)

AI总结 本文提出了一种简单有效的个性化联邦学习框架pFedLIA,通过使用计算效率高的影响近似方法'Lazy Influence',在分布式 manner 中对客户端进行聚类,从而在模型聚合前协同训练模型以捕捉客户端特定的数据模式,实验证明其在非iid数据集上能有效恢复全局模型性能,并在多个基准任务中优于现有基线方法。

Comments Accepted at the International Joint Conference on Neural Networks (IJCNN), IEEE, 2025

详情
AI中文摘要

在联邦学习中,客户端数据分布的异质性往往意味着单一全局模型无法为个别客户端提供最佳性能。例如,训练键盘的下一个词预测模型时,由于用户特定的语言模式(如人口统计学特征、语言能力、书写风格等),客户端之间会产生高度非iid的数据集。其他例子包括使用不同机器拍摄的医学图像或不同车辆类型的驾驶数据。为了解决这一问题,我们提出了一种简单但有效的个性化联邦学习框架(pFedLIA),该框架利用一种计算效率高的影响近似方法,称为'Lazy Influence',在分布式 manner 中在模型聚合前对客户端进行聚类。在每个聚类中,数据所有者协同训练一个模型,以捕捉客户端特定的数据模式。我们的方法在各种合成和现实世界设置中成功恢复了由于非iid性导致的全局模型性能下降,特别是在北欧语言的下一个词预测任务以及多个基准任务中。它在性能上与假设的Oracle聚类匹配,并显著优于现有基线方法,例如在CIFAR100上提高了17%。

英文摘要

In Federated Learning, heterogeneity in client data distributions often means that a single global model does not have the best performance for individual clients. Consider for example training a next-word prediction model for keyboards: user-specific language patterns due to demographics (dialect, age, etc.), language proficiency, and writing style result in a highly non-IID dataset across clients. Other examples are medical images taken with different machines, or driving data from different vehicle types. To address this, we propose a simple yet effective personalized federated learning framework (pFedLIA) that utilizes a computationally efficient influence approximation, called `Lazy Influence', to cluster clients in a distributed manner before model aggregation. Within each cluster, data owners collaborate to jointly train a model that captures the specific data patterns of the clients. Our method has been shown to successfully recover the global model's performance drop due to the non-IID-ness in various synthetic and real-world settings, specifically a next-word prediction task on the Nordic languages as well as several benchmark tasks. It matches the performance of a hypothetical Oracle clustering, and significantly improves on existing baselines, e.g., an improvement of 17% on CIFAR100.

2503.23300 2026-06-05 cs.CV cs.RO

Learning Predictive Visuomotor Coordination

学习预测性视觉-运动协调

Wenqi Jia, Bolin Lai, Miao Liu, Danfei Xu, James M. Rehg

发表机构 * University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) Georgia Tech(佐治亚理工学院) Meta AI

AI总结 本文提出了一种基于预测的视觉-运动协调建模任务,通过结合第一人称视觉和运动学观测预测头部姿态、目光方向和上半身运动,展示了多模态整合在理解视觉-运动协调中的重要性。

Comments CVPR 2026 Findings

详情
AI中文摘要

理解并预测人类视觉-运动协调对于机器人学、人机交互和辅助技术的应用至关重要。本文介绍了一种基于预测的视觉-运动协调建模任务,目标是从第一人称视觉和运动学观测中预测头部姿态、目光方向和上半身运动。我们提出了一种视觉-运动协调表示(VCR),学习这些多模态信号之间的结构时间依赖性。我们扩展了基于扩散的运动建模框架,整合了第一人称视觉和运动学序列,实现了时间一致且准确的视觉-运动预测。我们的方法在大规模EgoExo4D数据集上进行了评估,展示了在多样化现实活动中的强大泛化能力。我们的结果强调了多模态整合在理解视觉-运动协调中的重要性,为视觉-运动学习和人类行为建模的研究做出了贡献。项目页面:https://vjwq.github.io/VCR/.

英文摘要

Understanding and predicting human visuomotor coordination is crucial for applications in robotics, human-computer interaction, and assistive technologies. This work introduces a forecasting-based task for visuomotor modeling, where the goal is to predict head pose, gaze, and upper-body motion from egocentric visual and kinematic observations. We propose a \textit{Visuomotor Coordination Representation} (VCR) that learns structured temporal dependencies across these multimodal signals. We extend a diffusion-based motion modeling framework that integrates egocentric vision and kinematic sequences, enabling temporally coherent and accurate visuomotor predictions. Our approach is evaluated on the large-scale EgoExo4D dataset, demonstrating strong generalization across diverse real-world activities. Our results highlight the importance of multimodal integration in understanding visuomotor coordination, contributing to research in visuomotor learning and human behavior modeling. Project Page: https://vjwq.github.io/VCR/.

2411.18343 2026-06-05 cs.LG cs.AI

Comprehensive and Reliable Feature Attribution for Diverse Modalities and Models via Frequency-Domain Insights

通过频域见解实现多样化模态和模型的全面可靠特征归因

Zechen Liu, Feiyang Zhang, Wei Song, Xiang Li, Wei Wei

发表机构 * School of Computational Science, Wuhan University(武汉大学计算科学学院) Brain Research Center, Wuhan University(武汉大学脑科学研究中心) College of Information Science and Technology (School of Cyber Science and Technology), Shihezi University(石河子大学信息科学学院(网络安全科学与技术学院)) Xinjiang Production and Construction Corps Key Laboratory of Computing Intelligence and Network Information Security Open Fund(新疆生产建设兵团计算智能与网络信息安全重点实验室开放基金)

AI总结 本文提出了一种新的可解释性方法FreqX,结合信号处理和信息理论,以解决个性化联邦学习中非IID数据、异构设备、缺乏公平性和贡献不明确等问题,通过频域分析提高解释性效率和准确性。

Comments 16pages, 9 figures

详情
AI中文摘要

个性化联邦学习(PFL)允许客户端在不披露其私有数据集的情况下协作训练个性化模型。然而,PFL面临非IID、异构设备、缺乏公平性和贡献不明确等挑战,亟需深度学习模型的可解释性来克服这些问题。这些挑战提出了新的可解释性需求,包括低成本、隐私性和详细信息。目前没有现有的可解释性方法能满足这些需求。在本文中,我们提出了一种新的可解释性方法FreqX,通过引入信号处理和信息理论。我们的实验表明,FreqX的解释结果包含属性信息和概念信息。FreqX的运行速度至少比包含概念信息的基线方法快10倍。

英文摘要

Personalized Federal learning(PFL) allows clients to cooperatively train a personalized model without disclosing their private dataset. However, PFL suffers from Non-IID, heterogeneous devices, lack of fairness, and unclear contribution which urgently need the interpretability of deep learning model to overcome these challenges. These challenges proposed new demands for interpretability. Low cost, privacy, and detailed information. There is no current interpretability method satisfying them. In this paper, we propose a novel interpretability method \emph{FreqX} by introducing Signal Processing and Information Theory. Our experiments show that the explanation results of FreqX contain both attribution information and concept information. FreqX runs at least 10 times faster than the baselines which contain concept information.

2503.14295 2026-06-05 cs.CV cs.AI

PC-Talk: Precise Facial Animation Control for Audio-Driven Talking Face Generation

PC-Talk: 用于音频驱动说话面部生成的精确面部动画控制

Baiqin Wang, Xiangyu Zhu, Fan Shen, Hao Xu, Zhen Lei

发表机构 * MAIS, Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所MAIS部) School of Artificial Intelligence, University of Chinese Academy of Sciences(中国科学院大学人工智能学院) Psyche AI.INC(Psyche AI公司) HKUST(香港科技大学) CAIR, HKISI, Chinese Academy of Sciences(中国科学院计算智能研究所) SCSE, FIE, M.U.S.T(M.U.S.T的SCSE、FIE部门)

AI总结 本文针对音频驱动说话面部生成中面部动画控制不足的问题,提出PC-Talk框架,通过改进唇音对齐和情感控制来提升生成视频的多样性和用户友好性。

Comments 10 Pages, 6 figures. Accepted in CVPR2026

详情
AI中文摘要

近年来,音频驱动说话面部生成在唇同步方面取得了显著进展。然而,当前方法往往缺乏对面部动画(如说话风格和情绪表达)的充分控制,导致输出结果单一。本文聚焦于改进两个关键因素:唇音对齐和情感控制,以增强说话视频的多样性和易用性。唇音对齐控制关注说话风格和唇部运动幅度等元素,而情感控制则专注于生成逼真的情绪表达,允许对强度等多属性进行修改。为实现精确的面部动画控制,我们提出了一种新的框架PC-Talk,通过隐式关键点变形实现唇音对齐和情感控制。首先,我们的唇音对齐控制模块实现了对说话风格的精确编辑,并调整唇部运动幅度以模拟不同语音音量水平,保持与音频的同步。其次,我们的情感控制模块生成生动的情绪面部特征,通过纯粹的情绪变形实现。该模块还允许对强度进行精细修改,并在不同面部区域组合多种情绪。我们的方法在广泛的实验中展示了出色的控制能力,并在HDTF和MEAD数据集上取得了最先进的性能。

英文摘要

Recent advancements in audio-driven talking face generation have made great progress in lip synchronization. However, current methods often lack sufficient control over facial animation such as speaking style and emotional expression, resulting in uniform outputs. In this paper, we focus on improving two key factors: lip-audio alignment and emotion control, to enhance the diversity and user-friendliness of talking videos. Lip-audio alignment control focuses on elements like speaking style and the scale of lip movements, whereas emotion control is centered on generating realistic emotional expressions, allowing for modifications in multiple attributes such as intensity. To achieve precise control of facial animation, we propose a novel framework, PC-Talk, which enables lip-audio alignment and emotion control through implicit keypoint deformations. First, our lip-audio alignment control module facilitates precise editing of speaking styles at the word level and adjusts lip movement scales to simulate varying vocal loudness levels, maintaining lip synchronization with the audio. Second, our emotion control module generates vivid emotional facial features with pure emotional deformation. This module also enables the fine modification of intensity and the combination of multiple emotions across different facial regions. Our method demonstrates outstanding control capabilities and achieves state-of-the-art performance on both HDTF and MEAD datasets in extensive experiments.

2503.11910 2026-06-05 cs.LG cs.AI math.AT math.SG

RTD-Lite: Scalable Topological Analysis for Comparing Weighted Graphs in Learning Tasks

RTD-Lite:用于学习任务中比较加权图拓扑结构的可扩展分析

Eduard Tulchinskii, Daria Voronkova, Ilya Trofimov, Evgeny Burnaev, Serguei Barannikov

发表机构 * Skoltech, AI Foundation and Algorithm Lab(斯克里普丘尔技术学院,人工智能基础与算法实验室) Skoltech, AIRI(斯克里普丘尔技术学院,人工智能研究机构) Skoltech, CNRS(斯克里普丘尔技术学院,法国国家科学研究中心)

AI总结 本文提出RTD-Lite算法,通过最小生成树辅助图在O(n²)时间内高效比较加权图的拓扑特征,适用于降维和神经网络训练等任务,实验表明其在识别拓扑差异和减少计算时间方面优于现有方法。

Comments Accepted for AISTATS 2025

详情
AI中文摘要

用于比较加权图的拓扑方法在各种学习任务中具有价值,但通常在大规模数据集上计算效率低下。我们介绍了RTD-Lite,一种可扩展算法,能够高效比较两个具有顶点一一对应关系的加权图的拓扑特征,特别是任意尺度下的连通性或聚类结构。通过辅助图的最小生成树,RTD-Lite以O(n²)的时间和内存复杂度捕捉拓扑差异。这种效率使其适用于降维和神经网络训练等任务。在合成和现实数据集上的实验表明,RTD-Lite能够有效识别拓扑差异,同时显著减少计算时间,相较于现有方法。此外,将RTD-Lite作为损失函数组件整合到神经网络训练中,可以增强学习表示中的拓扑结构保持。我们的代码在https://github.com/ArGintum/RTD-Lite上公开可用。

英文摘要

Topological methods for comparing weighted graphs are valuable in various learning tasks but often suffer from computational inefficiency on large datasets. We introduce RTD-Lite, a scalable algorithm that efficiently compares topological features, specifically connectivity or cluster structures at arbitrary scales, of two weighted graphs with one-to-one correspondence between vertices. Using minimal spanning trees in auxiliary graphs, RTD-Lite captures topological discrepancies with $O(n^2)$ time and memory complexity. This efficiency enables its application in tasks like dimensionality reduction and neural network training. Experiments on synthetic and real-world datasets demonstrate that RTD-Lite effectively identifies topological differences while significantly reducing computation time compared to existing methods. Moreover, integrating RTD-Lite into neural network training as a loss function component enhances the preservation of topological structures in learned representations. Our code is publicly available at https://github.com/ArGintum/RTD-Lite

2409.13607 2026-06-05 cs.RO

RECON: Reducing Causal Confusion with Human-Placed Markers

RECON: 通过人类放置的标记减少因果混淆

Robert Ramirez Sanchez, Heramb Nemlekar, Shahabedin Sagheb, Cara M. Nunez, Dylan P. Losey

发表机构 * Collaborative Robotics Lab ( Collab ), Dept. of Mechanical Engineering, Virginia Tech, Blacksburg, VA 24061(协作机器人实验室(Collab),机械工程系,弗吉尼亚理工学院,布莱克斯堡,VA 24061) Sibley School of Mechanical and Aerospace Engineering, Cornell University, Ithaca, NY 14853(西伯利机械与航空航天工程学院,康奈尔大学,伊萨卡,NY 14853)

AI总结 该研究提出RECON框架,通过人类主动标记任务关键部分来减少机器人学习中的因果混淆,利用标记物数据训练任务相关状态嵌入,从而提高学习效率。

Comments 7 pages, 5 figures

详情
AI中文摘要

模仿学习使机器人能够从人类示例中学习新任务。然而,从人类学习时的一个根本限制是因果混淆。因果混淆发生在机器人观察到的任务相关和无关信息同时存在时:例如,机器人的摄像头可能不仅看到目标,还看到环境中的杂物和光照变化。由于机器人事先不知道哪些观察方面是重要的,它经常误解人类的例子,无法学习所需任务。为了解决这个问题,我们指出——尽管机器人学习者可能不知道该关注什么,但人类教师知道。在本文中,我们提出人类应主动用小型轻量的标记物标记任务关键部分。在我们的框架(RECON)中,人类在提供演示前将这些标记物附着在任务相关对象上:当人类展示任务示例时,标记物跟踪标记对象的位置。我们随后利用这些离线标记数据来训练任务相关状态嵌入。具体来说,我们将机器人的观察嵌入到一个与测量标记读数相关的潜在状态中:在实践中,这使机器人能够自动过滤掉无关观察,并基于从标记数据中学习的特征做出决策。我们的模拟和一个真实机器人实验表明,这种人类放置标记的框架可以缓解因果混淆。确实,我们发现使用RECON显著减少了传达任务所需的演示次数,从而降低人类教学的总体时间。见此处视频:https://youtu.be/oy85xJvtLSU

英文摘要

Imitation learning enables robots to learn new tasks from human examples. One fundamental limitation while learning from humans is causal confusion. Causal confusion occurs when the robot's observations include both task-relevant and extraneous information: for instance, a robot's camera might see not only the intended goal, but also clutter and changes in lighting within its environment. Because the robot does not know which aspects of its observations are important a priori, it often misinterprets the human's examples and fails to learn the desired task. To address this issue, we highlight that -- while the robot learner may not know what to focus on -- the human teacher does. In this paper we propose that the human proactively marks key parts of their task with small, lightweight beacons. Under our framework (RECON) the human attaches these beacons to task-relevant objects before providing demonstrations: as the human shows examples of the task, beacons track the position of marked objects. We then harness this offline beacon data to train a task-relevant state embedding. Specifically, we embed the robot's observations to a latent state that is correlated with the measured beacon readings: in practice, this causes the robot to autonomously filter out extraneous observations and make decisions based on features learned from the beacon data. Our simulations and a real robot experiment suggest that this framework for human-placed beacons mitigates causal confusion. Indeed, we find that using RECON significantly reduces the number of demonstrations needed to convey the task, lowering the overall time required for human teaching. See videos here: https://youtu.be/oy85xJvtLSU

2502.20914 2026-06-05 cs.LG cs.AI cs.CL

Everything, Everywhere, All at Once: Is Mechanistic Interpretability Identifiable?

Everything, Everywhere, All at Once: Is Mechanistic Interpretability Identifiable?

Maxime Méloux, Silviu Maniu, François Portet, Maxime Peyrard

发表机构 * Université Grenoble Alpes, CNRS, Grenoble INP, LIG(格勒诺布尔阿尔卑斯大学、国家科学研究中心、格勒诺布尔INP、实验室LIG)

AI总结 本文探讨了在机械可解释性(MI)框架下,给定行为是否具有唯一解释的问题,通过统计可识别性理论分析了MI解释的可识别性,并提出了两种主要策略及实验结果。

详情
Journal ref
The Thirteenth International Conference on Learning Representations (ICLR 2025)
AI中文摘要

随着AI系统应用于高风险领域,确保可解释性至关重要。机械可解释性(MI)旨在通过提取人类可理解的算法来解释神经网络的行为。本文探讨了一个关键问题:在给定行为下,根据MI的标准,是否存在唯一的解释?借鉴统计学中的可识别性,其中参数在特定假设下可以唯一推断,我们探索了MI解释的可识别性。我们识别出两种主要的MI策略:(1)“where-then-what”,通过隔离复制模型行为的电路并在之后解释它;(2)“what-then-where”,从候选算法开始,通过因果对齐搜索实现它们的神经激活子空间。我们对布尔函数和小型多层感知机测试了这两种策略,完全枚举了候选解释。实验揭示了系统性的不可识别性:多个电路可以复制行为,一个电路可以有多种解释,多个算法可以与网络对齐,一个算法可以与不同的子空间对齐。是否需要唯一性?一种务实的方法可能只需要预测性和可操作性标准。如果唯一性对理解至关重要,可能需要更严格的条件。我们还参考了内部可解释性框架,该框架通过多种标准验证解释。本文为定义AI中的解释标准做出了贡献。

英文摘要

As AI systems are used in high-stakes applications, ensuring interpretability is crucial. Mechanistic Interpretability (MI) aims to reverse-engineer neural networks by extracting human-understandable algorithms to explain their behavior. This work examines a key question: for a given behavior, and under MI's criteria, does a unique explanation exist? Drawing on identifiability in statistics, where parameters are uniquely inferred under specific assumptions, we explore the identifiability of MI explanations. We identify two main MI strategies: (1) "where-then-what," which isolates a circuit replicating model behavior before interpreting it, and (2) "what-then-where," which starts with candidate algorithms and searches for neural activation subspaces implementing them, using causal alignment. We test both strategies on Boolean functions and small multi-layer perceptrons, fully enumerating candidate explanations. Our experiments reveal systematic non-identifiability: multiple circuits can replicate behavior, a circuit can have multiple interpretations, several algorithms can align with the network, and one algorithm can align with different subspaces. Is uniqueness necessary? A pragmatic approach may require only predictive and manipulability standards. If uniqueness is essential for understanding, stricter criteria may be needed. We also reference the inner interpretability framework, which validates explanations through multiple criteria. This work contributes to defining explanation standards in AI.

2502.14145 2026-06-05 cs.CL eess.AS

LLM-Enhanced Dialogue Management for Full-Duplex Spoken Dialogue Systems

基于大语言模型的全双工语音对话系统对话管理

Hao Zhang, Weiwei Li, Rilin Chen, Vinay Kothapally, Meng Yu, Dong Yu

发表机构 * Tencent AI Lab(腾讯人工智能实验室)

AI总结 本文提出一种基于大语言模型的语义语音活动检测模块,用于高效管理全双工语音对话系统的轮询,通过轻量级大语言模型实现意图和非意图打断的区分,并通过短间隔处理输入语音以实现实时决策,同时减少计算开销。

详情
AI中文摘要

在语音对话系统(SDS)中实现全双工通信需要实时协调听、说和思。本文提出一个语义语音活动检测(VAD)模块作为对话管理器(DM),用于高效管理全双工SDS中的轮询。该模块实现为一个轻量级(0.5B)大语言模型,经过全双工对话数据微调,语义VAD预测四个控制标记以调节轮询和轮询保持,区分意图和非意图打断,同时检测查询完成以处理用户停顿和犹豫。通过短间隔处理输入语音,语义VAD实现了实时决策,而核心对话引擎(CDE)仅在生成响应时被激活,从而减少计算开销。这种设计允许独立优化DM而不需重新训练CDE,平衡了交互准确性和推理效率,以实现可扩展的下一代全双工SDS。

英文摘要

Achieving full-duplex communication in spoken dialogue systems (SDS) requires real-time coordination between listening, speaking, and thinking. This paper proposes a semantic voice activity detection (VAD) module as a dialogue manager (DM) to efficiently manage turn-taking in full-duplex SDS. Implemented as a lightweight (0.5B) LLM fine-tuned on full-duplex conversation data, the semantic VAD predicts four control tokens to regulate turn-switching and turn-keeping, distinguishing between intentional and unintentional barge-ins while detecting query completion for handling user pauses and hesitations. By processing input speech in short intervals, the semantic VAD enables real-time decision-making, while the core dialogue engine (CDE) is only activated for response generation, reducing computational overhead. This design allows independent DM optimization without retraining the CDE, balancing interaction accuracy and inference efficiency for scalable, next-generation full-duplex SDS.

2502.06434 2026-06-05 cs.CV cs.LG

Unifying Dataset Pruning and Distillation for Efficient Large-scale Compression

统一数据集剪枝与蒸馏以实现高效大规模压缩

Lingao Xiao, Songhua Liu, Yang He, Xinchao Wang

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 本文提出一个统一的数据集压缩基准,探讨数据集剪枝与蒸馏的收敛趋势,发现软标签蒸馏在小数据集上表现不如剪枝,提出基于硬标签的数据集压缩方法,通过PCA框架提升图像质量和存储效率。

Comments Accepted by ICML 2026

详情
AI中文摘要

数据集剪枝(DP)和数据集蒸馏(DD)在输出上有根本差异:DP选择原始图像子集,而DD生成合成图像。最近,DD对原始图像的依赖增加表明两种方法趋于融合。为研究这种融合趋势,我们提出统一的数据集压缩(DC)基准。该基准揭示了软标签-DD的有趣权衡:虽然软标签提供有价值信息,但它们可能使蒸馏过程变得不必要,因为蒸馏图像可能不总能优于随机子集。此外,基准表明在当前阶段,数据集剪枝在小数据集上优于数据集蒸馏。鉴于这些观察,我们探索硬标签-DC作为互补方法,强调图像质量的同时提供显著的存储效率。我们的PCA(Prune, Combine, and Augment)是首个不依赖软标签而是聚焦图像质量的框架。(1)

英文摘要

Dataset pruning (DP) and dataset distillation (DD) fundamentally differ in their outputs: DP selects original image subsets, while DD generates synthetic images. Recently, DD's increasing reliance on original images suggests a convergence of the two directions. To investigate this convergence trend, we propose a unified dataset compression (DC) benchmark. This benchmark reveals an interesting trade-off for soft-label-DD: while soft labels provide valuable information, they can make the distillation process less essential, as distilled images may not always outperform random subsets. In addition, the benchmark reveals that in current stages, dataset pruning outperforms dataset distillation at small dataset sizes. Given these observations, we explore hard-label-DC as a complementary approach that emphasizes image quality while offering substantial storage efficiency. Our PCA (Prune, Combine, and Augment) is the first framework that does not rely on soft labels but instead focuses on image quality. (1) "P'' means selecting easy samples based on dataset pruning metrics, (2) "C'' indicates combining these samples effectively, and (3) "A'' is to apply constrained image augmentation during training. Our code is available at https://github.com/ArmandXiao/Unifying-Dataset-Pruning-and-Distillation

2502.02487 2026-06-05 cs.CV

Hier-EgoPack: Hierarchical Egocentric Video Understanding with Diverse Task Perspectives

Hier-EgoPack:具有多样任务视角的层次化眼动视频理解

Simone Alberto Peirone, Francesca Pistilli, Antonio Alliegro, Tatiana Tommasi, Giuseppe Averta

发表机构 * Department of Control and Computer Engineering(控制与计算机工程系)

AI总结 本文提出Hier-EgoPack,通过引入层次化架构和GNN层,扩展了EgoPack在多粒度时间推理上的能力,有效解决了多种下游任务中的视频理解问题。

Comments Project webpage at https://sapeirone.github.io/hier-egopack

详情
AI中文摘要

我们对人类活动视频流的理解本质上是多方面的:在短短几秒钟内,我们能够把握正在发生的事情,识别场景中物体的相关性和互动,并预测即将发生的事情,所有这些都在一起发生。为了赋予自主系统这种整体感知,学习如何关联概念、在不同任务中抽象知识,并在学习新技能时利用任务协同是至关重要的。在这方面的一个重要进展是EgoPack,这是一个统一的框架,用于在多样化的任务中理解人类活动,具有最小的开销。EgoPack促进下游任务之间的信息共享和协作,这对于高效学习新技能至关重要。在本文中,我们介绍了Hier-EgoPack,它通过在不同时间粒度上进行推理来扩展EgoPack,从而将其适用范围扩展到更广泛的下游任务。为此,我们提出了一种新的层次化架构用于时间推理,配备了专门设计的GNN层,以有效应对多粒度推理的挑战。我们在多个Ego4D基准上评估了我们的方法,涉及片段级和帧级推理,展示了我们的层次化统一架构如何同时有效地解决这些多样化任务。

英文摘要

Our comprehension of video streams depicting human activities is naturally multifaceted: in just a few moments, we can grasp what is happening, identify the relevance and interactions of objects in the scene, and forecast what will happen soon, everything all at once. To endow autonomous systems with such a holistic perception, learning how to correlate concepts, abstract knowledge across diverse tasks, and leverage tasks synergies when learning novel skills is essential. A significant step in this direction is EgoPack, a unified framework for understanding human activities across diverse tasks with minimal overhead. EgoPack promotes information sharing and collaboration among downstream tasks, essential for efficiently learning new skills. In this paper, we introduce Hier-EgoPack, which advances EgoPack by enabling reasoning also across diverse temporal granularities, which expands its applicability to a broader range of downstream tasks. To achieve this, we propose a novel hierarchical architecture for temporal reasoning equipped with a GNN layer specifically designed to tackle the challenges of multi-granularity reasoning effectively. We evaluate our approach on multiple Ego4d benchmarks involving both clip-level and frame-level reasoning, demonstrating how our hierarchical unified architecture effectively solves these diverse tasks simultaneously.

2410.13056 2026-06-05 cs.CL cs.AI

Channel-Wise Mixed-Precision Quantization for Large Language Models

通道级混合精度量化用于大语言模型

Zihan Chen, Bike Xie, Jundong Li, Cong Shen

发表机构 * Department of Electrical and Computer Engineering, University of Virginia(电气与计算机工程系,弗吉尼亚大学) Kneron Inc.(芯驰科技)

AI总结 本文提出通道级混合精度量化(CMPQ),通过根据激活分布分配不同精度级别来优化大语言模型的量化过程,从而在低比特范围内实现任意平均比特宽度,并在内存使用增加有限的情况下提升性能。

详情
AI中文摘要

大型语言模型(LLMs)在多种语言任务上表现出色,但其在边缘设备上的部署仍面临挑战,因为其大规模参数导致内存需求大。权重仅量化提供了一种减少LLM内存足迹的有希望的解决方案。然而,现有方法主要集中在整数比特量化上,限制了它们对分数比特量化任务的适应性,并阻碍了设备上可用存储空间的充分利用。在本文中,我们引入了通道级混合精度量化(CMPQ),一种新颖的混合精度量化方法,根据激活分布在通道级分配量化精度。通过将不同精度级别分配给不同的权重通道,CMPQ支持低比特范围(例如2到4比特)内的任意平均比特宽度。CMPQ采用非均匀量化策略,并结合两种异常值提取技术,共同保留关键信息,从而最小化量化损失。在九种不同LLM上的实验表明,CMPQ不仅在整数比特量化任务中提高了性能,而且通过以混合精度方式进行处理,在内存使用增加有限的情况下实现了显著的性能提升。CMPQ代表了一种适应性强且有效的LLM量化方法,在各种设备能力下提供了显著的好处。

英文摘要

Large Language Models (LLMs) have demonstrated remarkable success across a wide range of language tasks, but their deployment on edge devices remains challenging due to the substantial memory requirements imposed by their large parameter sizes. Weight-only quantization presents a promising solution to reduce the memory footprint of LLMs. However, existing approaches primarily focus on integer-bit quantization, limiting their adaptability to fractional-bit quantization tasks and preventing the full utilization of available storage space on devices. In this paper, we introduce Channel-Wise Mixed-Precision Quantization (CMPQ), a novel mixed-precision quantization method that allocates quantization precision in a channel-wise pattern based on activation distributions. By assigning different precision levels to different weight channels, CMPQ supports arbitrary average bit-widths in the low-bit regime (e.g., between 2 and 4 bits). CMPQ employs a non-uniform quantization strategy and incorporates two outlier extraction techniques that collaboratively preserve the critical information, thereby minimizing the quantization loss. Experiments on nine different LLMs demonstrate that CMPQ not only enhances performance in integer-bit quantization tasks but also achieves significant performance gains with a modest increase in memory usage by performing in a mixed-precision way. CMPQ represents an adaptive and effective approach to LLM quantization, offering substantial benefits across diverse device capabilities.

2407.10486 2026-06-05 cs.AI cs.CL

IDEAL: Leveraging Infinite and Dynamic Characterizations of Large Language Models for Query-focused Summarization

IDEAL: 利用大型语言模型的无限和动态特性进行查询导向的摘要

Jie Cao, Dian Jiao, Yang Dai, Rolan Yan, Wenqiao Zhang, Siliang Tang

发表机构 * Zhejiang University(浙江大学) Tencent, Wechat(腾讯,微信)

AI总结 本文针对查询导向摘要问题,提出两种核心方法:高效细粒度查询-LLM对齐和长文档摘要,通过Query-aware HyperExpert和Query-focused Infini-attention模块实现,实验验证了方法的有效性和通用性。

详情
AI中文摘要

查询导向摘要(QFS)旨在生成回答特定问题的摘要,使用户能够更好地控制和个性化内容。随着大型语言模型(LLMs)的出现,其通过大规模预训练展现出了强大的文本理解能力,这表明了提取片段生成的巨大潜力。本文系统地研究了LLMs基于QFS模型应具备的两个不可或缺特性,即高效细粒度查询-LLM对齐和长文档摘要。相应地,我们提出了两个模块,称为Query-aware HyperExpert和Query-focused Infini-attention,以访问上述特性。这些创新为QFS技术的更广泛应用和可访问性铺平了道路。在现有QFS基准上的广泛实验表明了所提出方法的有效性和通用性。

英文摘要

Query-focused summarization (QFS) aims to produce summaries that answer particular questions of interest, enabling greater user control and personalization. The advent of large language models (LLMs), shows their impressive capability of textual understanding through large-scale pretraining, which implies the great potential of extractive snippet generation. In this paper, we systematically investigated two indispensable characteristics that the LLMs-based QFS models should be harnessed, \emph{Efficiently Fine-grained Query-LLM Alignment} and \emph{Lengthy Document Summarization}, respectively. Correspondingly, we propose two modules called Query-aware HyperExpert and Query-focused Infini-attention to access the aforementioned characteristics. These innovations pave the way for broader application and accessibility in the field of QFS technology. Extensive experiments conducted on existing QFS benchmarks indicate the effectiveness and generalizability of the proposed approach.

2412.07583 2026-06-05 cs.CV cs.AI

Mobile Video Diffusion

移动视频扩散

Haitam Ben Yahia, Denis Korzhenkov, Ioannis Lelekas, Amir Ghodrati, Amirhossein Habibian

发表机构 * Qualcomm AI Research(高通人工智能研究)

AI总结 本文提出了一种移动优化的视频扩散模型MobileVD,通过降低帧分辨率、引入多尺度时间表示和两种新的剪枝方案,显著降低了内存和计算成本,同时在移动设备上实现了高效的视频生成。

详情
AI中文摘要

视频扩散模型已实现了出色的现实感和可控性,但受限于高计算需求,限制了其在移动设备上的应用。本文介绍了首个移动优化的视频扩散模型。从Stable Video Diffusion (SVD) 的时空UNet出发,我们通过降低帧分辨率、引入多尺度时间表示以及引入两种新的剪枝方案来减少通道数和时间块数量。此外,我们采用对抗微调将去噪步骤减少到一步。我们的模型,称为MobileVD,在效率上提高了523倍(1817.2 vs. 4.34 TFLOPs),质量略有下降(FVD 149 vs. 171),在Xiaomi-14 Pro上生成14x512x256像素的视频片段仅需1.7秒。我们的结果可在https://qualcomm-ai-research.github.io/mobile-video-diffusion/上查看。

英文摘要

Video diffusion models have achieved impressive realism and controllability but are limited by high computational demands, restricting their use on mobile devices. This paper introduces the first mobile-optimized video diffusion model. Starting from a spatio-temporal UNet from Stable Video Diffusion (SVD), we reduce memory and computational cost by reducing the frame resolution, incorporating multi-scale temporal representations, and introducing two novel pruning schema to reduce the number of channels and temporal blocks. Furthermore, we employ adversarial finetuning to reduce the denoising to a single step. Our model, coined as MobileVD, is 523x more efficient (1817.2 vs. 4.34 TFLOPs) with a slight quality drop (FVD 149 vs. 171), generating latents for a 14x512x256 px clip in 1.7 seconds on a Xiaomi-14 Pro. Our results are available at https://qualcomm-ai-research.github.io/mobile-video-diffusion/

2406.08966 2026-06-05 cs.LG cs.AI

Separation Power of Equivariant Neural Networks

等变神经网络的分离能力

Marco Pacini, Xiaowen Dong, Bruno Lepri, Gabriele Santin

发表机构 * University of Trento(特伦托大学) Fondazione Bruno Kessler(布鲁诺·凯斯勒基金会) University of Oxford(牛津大学) University of Venice(威尼斯大学)

AI总结 本文研究了等变神经网络的分离能力,分析了架构和超参数对分离能力的影响,发现非多项式激活函数在表达能力上等价,深度在阈值后不再提升分离能力,而隐表示的块分解会影响分离能力。

Comments Published as a conference paper at ICLR 2025

详情
Journal ref
International Conference on Learning Representations (ICLR), 2025
AI中文摘要

机器学习模型的分离能力是指其区分不同输入的能力,常被用作表达能力的代理。确实,了解模型家族的分离能力是获得细粒度普遍性结果的必要条件。在本文中,我们分析了等变神经网络(如卷积网络和置换不变网络)的分离能力。我们首先给出了由给定架构导出的模型无法区分的输入的完整特征化。从这些结果中,我们推导出分离能力如何受到超参数和架构选择(如激活函数、深度、隐藏层宽度和表示类型)的影响。值得注意的是,所有非多项式激活函数(包括ReLU和Sigmoid)在表达能力上是等价的,并能达到最大分离能力。深度在达到阈值后提升分离能力,之后进一步增加无效应。在隐表示中添加不变特征不影响分离能力。最后,隐表示的块分解影响分离性,最小的组件形成一个分离能力的层次结构,提供了一种直接比较模型分离能力的方法。

英文摘要

The separation power of a machine learning model refers to its ability to distinguish between different inputs and is often used as a proxy for its expressivity. Indeed, knowing the separation power of a family of models is a necessary condition to obtain fine-grained universality results. In this paper, we analyze the separation power of equivariant neural networks, such as convolutional and permutation-invariant networks. We first present a complete characterization of inputs indistinguishable by models derived by a given architecture. From this results, we derive how separability is influenced by hyperparameters and architectural choices-such as activation functions, depth, hidden layer width, and representation types. Notably, all non-polynomial activations, including ReLU and sigmoid, are equivalent in expressivity and reach maximum separation power. Depth improves separation power up to a threshold, after which further increases have no effect. Adding invariant features to hidden representations does not impact separation power. Finally, block decomposition of hidden representations affects separability, with minimal components forming a hierarchy in separation power that provides a straightforward method for comparing the separation power of models.

2406.12620 2026-06-05 cs.CL

What Makes Two Language Models Think Alike?

是什么让两个语言模型思考相似?

Louis Jalouzot, Christophe Pallier, Emmanuel Chemla, Yair Lakretz

发表机构 * UNICOG CNRS(法国国家科学研究中心) INSERM(法国国家健康与医学研究院) CEA(法国原子能委员会) Paris-Saclay University(巴黎-萨克雷大学) LSCP(语言科学研究中心) EHESS(高等科学研究所) ENS(巴黎高等师范学校) PSL University(巴黎科学哲学大学)

AI总结 本文研究了语言模型表示和处理语言的方式是否受架构和训练差异影响,提出了一种新的方法来量化模型间相似性和差异性,并发现模型相似性主要由发布日期和模型家族决定。

Comments 25 pages, 13 figures

详情
AI中文摘要

模型的架构和训练差异是否影响它们表示和处理语言的方式?传统相似性度量只能告诉我们两个模型是否具有相似的表示几何,但无法解释原因。本文提出了一种新的、简单的方法来解决这个问题。该方法将每个模型各层的神经活动映射到一组可解释的语言特征,并量化这些特征如何驱动模型间的相似性和差异性。我们使用这种方法比较了43个语言模型,涵盖10个家族,包括解码器Transformer、状态空间模型和循环神经网络。我们发现,模型层面的相似性主要由发布日期(作为通用LLM发展的代理)和模型家族决定,表明语言签名并非主要由规模或架构类别决定。总体而言,我们的方法提供了一种将理论动机的符号描述与神经表示联系起来的方法,并可以轻易扩展到其他领域如语音和视觉,以及到其他神经系统如生物大脑。

英文摘要

Do architectural and training differences influence the way models represent and process language? Traditional similarity metrics tell us whether two models share a similar representational geometry, but they cannot explain why. Here, we propose a new, simple, approach to address this question. This approach maps neural activity in each model layer onto a set of interpretable linguistic features and quantifies how much each of them drives similarities and differences between models. We use this approach to compare 43 language models across 10 families, including decoder Transformers, State-Space Models, and Recurrent Neural Networks. We find that model-level similarity is driven most strongly by release date, a proxy for general LLM development, and model family, suggesting that linguistic signatures are not primarily shaped by scale or architecture class. Overall, our approach provides a way to link theoretically-motivated symbolic descriptions to neural representations and can readily be extended to other domains such as speech and vision, and to other neural systems such as biological brains.

2311.07565 2026-06-05 cs.LG stat.ML

Exploration via linearly perturbed loss minimisation

通过线性扰动损失最小化进行探索

David Janz, Shuai Liu, Alex Ayoub, Csaba Szepesvári

发表机构 * University of Alberta(阿尔伯塔大学)

AI总结 本文提出了一种基于线性扰动损失的探索方法EVILL,通过求解线性扰动的正则化负对数似然函数的最小化问题,解释了随机奖励扰动为何能产生有效的多臂老虎机算法,并展示了数据依赖扰动如何使EVILL在理论和实践中达到与Thompson采样类参数扰动方法相当的性能。

Comments Updated with erratum note: Appendix I contains a gap in the proof; all main-paper claims remain valid via the corrected argument of Perneczky, Abeille & Janz (2026, arXiv:2606.00431)

详情
AI中文摘要

我们引入了通过线性损失扰动进行探索(EVILL),一种用于结构化随机老虎机问题的随机探索方法,其通过求解线性扰动的正则化负对数似然函数的最小化问题来工作。我们证明,在一般线性老虎机的情况下,EVILL简化为扰动历史探索(PHE),一种通过在随机扰动的奖励上进行训练来实现探索的方法。通过这样做,我们提供了一个简单清晰的解释,说明何时以及为什么随机奖励扰动会产生有效的老虎机算法。我们提出了之前PHE类型方法中未出现的数据依赖扰动,使EVILL能够匹配Thompson-sampling风格的参数扰动方法的性能,理论和实践中均如此。此外,我们展示了在一般线性老虎机之外的一个例子,其中PHE导致不一致的估计,从而产生线性遗憾,而EVILL仍然表现良好。与PHE一样,EVILL可以通过几行代码实现。

英文摘要

We introduce exploration via linear loss perturbations (EVILL), a randomised exploration method for structured stochastic bandit problems that works by solving for the minimiser of a linearly perturbed regularised negative log-likelihood function. We show that, for the case of generalised linear bandits, EVILL reduces to perturbed history exploration (PHE), a method where exploration is done by training on randomly perturbed rewards. In doing so, we provide a simple and clean explanation of when and why random reward perturbations give rise to good bandit algorithms. We propose data-dependent perturbations not present in previous PHE-type methods that allow EVILL to match the performance of Thompson-sampling-style parameter-perturbation methods, both in theory and in practice. Moreover, we show an example outside generalised linear bandits where PHE leads to inconsistent estimates, and thus linear regret, while EVILL remains performant. Like PHE, EVILL can be implemented in just a few lines of code.

2308.10897 2026-06-05 cs.CV

Can Language Models Learn to Listen?

语言模型能否学会倾听?

Evonne Ng, Sanjay Subramanian, Dan Klein, Angjoo Kanazawa, Trevor Darrell, Shiry Ginosar

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 本文提出了一种基于说话人话语生成适当面部回应的框架,通过将量化后的面部动作元素作为额外语言token输入到基于transformer的大型语言模型中,从而提升监听响应的质量。

Comments ICCV 2023; Project page: https://people.eecs.berkeley.edu/~evonne_ng/projects/text2listen/

详情
AI中文摘要

我们提出了一种框架,用于在双人社交互动中根据说话人的词语生成适当的面部回应。给定一个包含说话人词语及其时间戳的输入转录,我们的方法自回归地预测听众的回应:一系列听众的面部动作,通过VQ-VAE进行量化。由于动作是语言的一部分,我们提出将量化后的原子动作元素作为额外的语言token输入到基于transformer的大型语言模型中。使用仅在文本上预训练的语言模型权重初始化transformer,可以显著提高听众回应的质量,优于从头开始训练transformer。我们通过定量指标和定性用户研究展示了生成的听众动作流畅且反映了语言语义。在我们的评估中,我们分析了模型利用口语文本的时间和语义方面的能力。项目页面:https://people.eecs.berkeley.edu/~evonne_ng/projects/text2listen/

英文摘要

We present a framework for generating appropriate facial responses from a listener in dyadic social interactions based on the speaker's words. Given an input transcription of the speaker's words with their timestamps, our approach autoregressively predicts a response of a listener: a sequence of listener facial gestures, quantized using a VQ-VAE. Since gesture is a language component, we propose treating the quantized atomic motion elements as additional language token inputs to a transformer-based large language model. Initializing our transformer with the weights of a language model pre-trained only on text results in significantly higher quality listener responses than training a transformer from scratch. We show that our generated listener motion is fluent and reflective of language semantics through quantitative metrics and a qualitative user study. In our evaluation, we analyze the model's ability to utilize temporal and semantic aspects of spoken text. Project page: https://people.eecs.berkeley.edu/~evonne_ng/projects/text2listen/

2306.09712 2026-06-05 cs.LG cs.AI cs.CL

Semi-Offline Reinforcement Learning for Optimized Text Generation

半离线强化学习用于优化文本生成

Changyu Chen, Xiting Wang, Yiqiao Jin, Victor Ye Dong, Li Dong, Jie Cao, Yi Liu, Rui Yan

发表机构 * Changyu Chen, Xiting Wang, Yiqiao Jin, Victor Ye Dong, Li Dong, Jie Cao, Yi Liu, Rui Yan(未知机构)

AI总结 本文提出了一种半离线强化学习方法,平衡了探索能力和训练成本,并在优化成本、渐近误差和过拟合误差界方面实现了最优的强化学习设置。

Comments In Proceedings of the 40th International Conference on Machine Learning (ICML 2023)

详情
AI中文摘要

在强化学习(RL)中,与环境交互有两种主要设置:在线和离线。在线方法在显著的时间成本下探索环境,而离线方法通过牺牲探索能力高效地获得奖励信号。我们提出了一种半离线RL,一种新的范式,能够从离线过渡到在线设置,平衡探索能力和训练成本,并为比较不同的RL设置提供理论基础。基于半离线公式,我们提出了在优化成本、渐近误差和过拟合误差界方面最优的RL设置。广泛实验表明,我们的半离线方法高效且在与最新方法相比时表现相当或更好。

英文摘要

In reinforcement learning (RL), there are two major settings for interacting with the environment: online and offline. Online methods explore the environment at significant time cost, and offline methods efficiently obtain reward signals by sacrificing exploration capability. We propose semi-offline RL, a novel paradigm that smoothly transits from offline to online settings, balances exploration capability and training cost, and provides a theoretical foundation for comparing different RL settings. Based on the semi-offline formulation, we present the RL setting that is optimal in terms of optimization cost, asymptotic error, and overfitting error bound. Extensive experiments show that our semi-offline approach is efficient and yields comparable or often better performance compared with state-of-the-art methods.

2305.12640 2026-06-05 cs.AI cs.LG stat.ML

Limited Resource Allocation in a Non-Markovian World: The Case of Maternal and Child Healthcare

在非马尔可夫世界中的有限资源分配:产科与儿童保健的案例

Panayiotis Danassis, Shresth Verma, Jackson A. Killian, Aparna Taneja, Milind Tambe

发表机构 * Harvard University(哈佛大学) Google Research(谷歌研究)

AI总结 本文研究了在非马尔可夫环境下如何通过时间序列方法优化资源分配,提出了一种新的时间序列臂排名指数(TARI)策略,以提高产科和儿童保健项目的参与度和依从性。

Comments Proceedings of the 32nd International Joint Conference on Artificial Intelligence (IJCAI 2023)

详情
AI中文摘要

许多医疗项目成功的关键在于参与者的依从性。我们考虑在资源有限的环境中(例如健康工作者及时拨打电话)安排干预措施,以提高依从性和/或参与度。以往的工作已经成功开发了几种基于活跃多臂老虎机(RMAB)的解决方案。然而,所有以往的RMAB方法都假设参与者的行为遵循马尔可夫性质。我们展示了在我们合作伙伴NGO ARMMAN的产科健康意识项目上的真实数据中,存在显著偏离马尔可夫假设的现象。此外,我们扩展RMAB到连续状态空间,这是之前研究较少的领域。为解决一般的非马尔可夫RMAB环境,我们(i)将每个参与者的时间轨迹建模为时间序列,(ii)利用时间序列预测模型的力量来学习复杂模式和动态以预测未来状态,(iii)提出时间序列臂排名指数(TARI)策略,这是一种新的算法,选择最能从干预中受益的RMAB臂,基于我们的未来状态预测。我们在合成数据和ARMMAN的真实数据二次分析上评估了我们的方法,并证明了与部署的Whittle指数解决方案相比,参与度显著增加。这相当于额外16.3小时的内容被聆听,90.8%更多的脱节风险被防止,并覆盖了超过两倍的高脱节风险受益人。

英文摘要

The success of many healthcare programs depends on participants' adherence. We consider the problem of scheduling interventions in low resource settings (e.g., placing timely support calls from health workers) to increase adherence and/or engagement. Past works have successfully developed several classes of Restless Multi-armed Bandit (RMAB) based solutions for this problem. Nevertheless, all past RMAB approaches assume that the participants' behaviour follows the Markov property. We demonstrate significant deviations from the Markov assumption on real-world data on a maternal health awareness program from our partner NGO, ARMMAN. Moreover, we extend RMABs to continuous state spaces, a previously understudied area. To tackle the generalised non-Markovian RMAB setting we (i) model each participant's trajectory as a time-series, (ii) leverage the power of time-series forecasting models to learn complex patterns and dynamics to predict future states, and (iii) propose the Time-series Arm Ranking Index (TARI) policy, a novel algorithm that selects the RMAB arms that will benefit the most from an intervention, given our future state predictions. We evaluate our approach on both synthetic data, and a secondary analysis on real data from ARMMAN, and demonstrate significant increase in engagement compared to the SOTA, deployed Whittle index solution. This translates to 16.3 hours of additional content listened, 90.8% more engagement drops prevented, and reaching more than twice as many high dropout-risk beneficiaries.

2110.06847 2026-06-05 cs.CL cs.CY cs.SI physics.soc-ph

Ousiometrics: The essence of meaning aligns with a power-danger-structure framework instead of valence-arousal-dominance

Ousiometrics: 本质的意义与权力-危险-结构框架相一致,而非价值-唤醒-主导框架

P. S. Dodds, T. Alshaabi, M. I. Fudolig, J. W. Zimmerman, J. Lovato, S. Beaulieu, J. R. Minot, M. V. Arnold, A. J. Reagan, C. M. Danforth

发表机构 * Computational Story Lab, Vermont Advanced Computing Center, University of Vermont, Burlington, VT 05405, United States(计算故事实验室、佛蒙特高级计算中心、佛蒙特大学、伯灵顿,VT 05405,美国) Vermont Complex Systems Institute, MassMutual Center of Excellence for Complex Systems and Data Science, University of Vermont, Burlington, VT 05405, United States(佛蒙特复杂系统研究所、马斯穆特复杂系统和数据科学卓越中心、佛蒙特大学、伯灵顿,VT 05405,美国) Department of Computer Science, University of Vermont, Burlington, VT 05405, United States(计算机科学系、佛蒙特大学、伯灵顿,VT 05405,美国) Santa Fe Institute, 1399 Hyde Park Rd, Santa Fe, NM 87501, United States(圣达菲研究所、1399号海德公园路,圣达菲,NM 87501,美国) Howard Hughes Medical Institute, Janelia Research Campus, Ashburn, VA 20147, United States(霍华德·休斯医学研究所、贾能利亚研究校区、阿什伯恩,VA 20147,美国) Advanced Bioimaging Center, University of California Berkeley, Berkeley, CA 94720, United States(先进生物成像中心、加州大学伯克利分校、伯克利,CA 94720,美国) School of Computer and Mathematical Sciences, University of Adelaide, Adelaide, SA 5005, Australia(计算机与数学科学学院、阿德莱德大学、阿德莱德,SA 5005,澳大利亚) Computational Ethics Lab, University of Vermont, Burlington, VT 05405, United States(计算伦理实验室、佛蒙特大学、伯灵顿,VT 05405,美国)

AI总结 本文提出了一种新的意义本质描述框架GPADS,通过分析英语语料库发现,意义本质应由权力-危险-结构框架描述,并构建了ousiometer原型。

Comments 115 pages (30 page main manuscript, 85 page appendix), 82 figures (9 main, 73 appendix), 3 tables (2 main, 1 appendix)

详情
Journal ref
Science Advances, 12(9): eadr4039, 2026
AI中文摘要

从20世纪中叶以来,意义的本质被广泛接受为由价值、唤醒和主导(VAD)三个正交维度描述。这些基本维度已成为许多领域情感分析的基石。通过重新审视英语语言的第一类型和词素,并利用自动注释的直方图--ousiograms--我们发现:词语传达的意义本质最好由好-权力-攻击-危险结构环形框架(GPADS)描述;大规模英语语料库揭示了对安全、低危险词的系统偏见;并且权力-危险-结构(PDS)框架是代表基本意义的最小框架。我们发现GPADS框架与其他空间如心理状态和虚构原型之间有显著的一致性,并构建并展示了ousiometer原型。

英文摘要

From work emerging through the middle of the 20th century, the essence of meaning has become widely accepted as being described by the three orthogonal dimensions of valence, arousal, and dominance (VAD). These essential dimensions have become the cornerstone of sentiment analysis across many fields. By re-examining first types and then tokens for the English language, and through the use of automatically annotated histograms -- `ousiograms' -- we find here that: The essence of meaning conveyed by words is instead best described by a goodness-power-aggression-danger-structure circumplex framework (GPADS); that large-scale English language corpora reveal a systematic bias toward safe, low-danger words; and that the power-danger-structure (PDS) framework is the minimal framework that represents essential meaning. We find remarkable congruences between the GPADS framework and other spaces including mental states and fictional archetypes, and we construct and demonstrate a prototype ousiometer.

2606.06492 2026-06-05 cs.SE cs.AI cs.CL

Code2LoRA: Hypernetwork-Generated Adapters for Code Language Models under Software Evolution

Code2LoRA:用于软件演化下代码语言模型的超网络生成适配器

Liliana Hotsko, Yinxi Li, Yuntian Deng, Pengyu Nie

发表机构 * University of Waterloo(滑铁卢大学)

AI总结 提出Code2LoRA超网络框架,通过生成仓库特定的LoRA适配器注入仓库知识,无需推理时令牌开销,支持静态和演化两种场景,在RepoPeftBench上达到与逐仓库LoRA相当或更优的性能。

详情
AI中文摘要

代码语言模型需要仓库级上下文来解决导入、API和项目约定。现有方法通过长输入(通过RAG或依赖分析检索)或逐仓库微调和LoRA注入这些知识——这在仓库规模上成本高昂且对演化的代码库脆弱。我们引入Code2LoRA,一个超网络框架,生成仓库特定的LoRA适配器,有效地注入仓库知识,零推理时令牌开销。Code2LoRA支持两种使用场景:Code2LoRA-Static将单个仓库快照转换为适配器,适用于稳定代码库的理解;而Code2LoRA-Evo维护一个由GRU隐藏状态支持的适配器,该状态随每次代码差异更新,适用于演化代码库的活跃开发。为了评估Code2LoRA与参数高效微调基线,我们构建了RepoPeftBench,一个包含604个Python仓库的基准,包含两个轨道:一个静态轨道,包含40K训练和12K测试断言完成任务;一个演化轨道,包含215K提交派生训练和87K提交派生测试任务。在静态轨道上,Code2LoRA-Static实现了63.8%的跨仓库和66.2%的仓库内精确匹配,与逐仓库LoRA上界相当;在演化轨道上,Code2LoRA-Evo实现了60.3%的跨仓库精确匹配(比单个共享LoRA高5.2个百分点)。Code2LoRA的代码可在https://anonymous.4open.science/r/code2lora-6857找到;模型检查点和RepoPeftBench数据集可在https://huggingface.co/code2lora找到。

英文摘要

Code language models need repository-level context to resolve imports, APIs, and project conventions. Existing methods inject this knowledge as long inputs (retrieved through RAG or dependency analysis) or through per-repository fine-tuning and LoRA -- costly at repository scale and brittle to evolving codebases. We introduce Code2LoRA, a hypernetwork framework that generates repository-specific LoRA adapters, effectively injecting repository knowledge with zero inference-time token overhead. Code2LoRA supports two usage scenarios: Code2LoRA-Static converts a single repository snapshot into an adapter, suitable for comprehension of stable codebases; while Code2LoRA-Evo maintains an adapter backed by a GRU hidden state updated per code diff, suitable for active development of evolving codebases. To evaluate Code2LoRA against parameter-efficient fine-tuning baselines, we build RepoPeftBench, a benchmark of 604 Python repositories with two tracks: a static track with 40K training and 12K test assertion-completion tasks, and an evolution track with 215K commit-derived training and 87K commit-derived test tasks. On the static track, Code2LoRA-Static achieves 63.8% cross-repo and 66.2% in-repo exact match, matching the per-repository LoRA upper bound; on the evolution track, Code2LoRA-Evo achieves 60.3% cross-repo exact match (+5.2 pp over a single shared LoRA). Code2LoRA's code can be found at https://anonymous.4open.science/r/code2lora-6857; the model checkpoints and RepoPeftBench datasets can be found at https://huggingface.co/code2lora.

2606.06480 2026-06-05 cs.GT cs.LG

DNQ: Deep Nash Q-Network for Partially Observable n-Player Games

DNQ: 用于部分可观测n人博弈的深度纳什Q网络

Qintong Xie, Edward Koh, Xavier Cadet, Peter Chin

发表机构 * IEEE

AI总结 针对多智能体同时博弈问题,提出DNQ框架,通过求解器在环的均衡监督训练智能体,并对比成对与精确均衡求解方法的可扩展性。

详情
AI中文摘要

许多现实世界的竞争系统要求多个决策者在共享约束、有限信息和重复交互下同时行动,例如拍卖、资源分配和安全竞争。我们将多轮同时竞价作为此类问题的受控测试平台,并提出DNQ,一种求解器在环的均衡监督框架,用于训练竞价智能体。DNQ在轨迹收集、基于评论家的收益估计、均衡计算和策略模仿之间交替进行。在每个访问的状态下,共享评论家预测成对收益矩阵或精确的N人收益张量,外部求解器计算均衡策略,智能体通过最小化其掩码策略与求解器导出的均衡目标之间的KL散度进行训练。我们专注于可扩展的成对公式,与精确公式相比,大大降低了均衡求解成本和训练时间,同时共享评论家跨智能体和状态摊销了收益学习。实验使用评论家损失、策略熵、竞价资源使用和训练成本比较了成对和精确变体,表明成对方法可扩展到更多智能体,而精确方法随着联合博弈的增长在计算上变得不可行。这些结果说明了重复竞争环境中战略保真度与可扩展性之间的权衡。

英文摘要

Many real-world competitive systems require multiple decision-makers to act simultaneously under shared constraints, limited information, and repeated interaction, as in auctions, resource allocation, and security competition. We study multi-turn simultaneous bidding as a controlled testbed for such problems and propose DNQ, a solver-in-the-loop equilibrium supervision framework for training bidding agents. DNQ alternates between trajectory collection, critic-based payoff estimation, equilibrium computation, and policy imitation. At each visited state, a shared critic predicts either pairwise payoff matrices or an exact N-player payoff tensor, an external solver computes equilibrium strategies, and the agents are trained by minimizing the KL divergence between their masked policies and the solver-derived equilibrium targets. We focus on a scalable pairwise formulation that greatly reduces equilibrium-solving cost and training time compared with the exact formulation, while the shared critic amortizes payoff learning across agents and states. Experiments compare the pairwise and exact variants using critic loss, policy entropy, bidding resource usage, and training cost, showing that the pairwise method scales to larger numbers of agents, whereas the exact method becomes computationally impractical as the joint game grows. These results illustrate the trade-off between strategic fidelity and scalability in repeated competitive environments.

2606.06469 2026-06-05 math.ST cs.LG math.PR stat.TH

How abundant are good interpolators?

好的插值器有多丰富?

August Y. Chen, Ahmed El Alaoui

发表机构 * Cornell University, Department of Computer Science(康奈尔大学计算机科学系) Cornell University, Department of Statistics and Data Science(康奈尔大学统计与数据科学系)

AI总结 在高维比例下,通过大偏差原理研究随机均匀选择的线性插值分类器的泛化误差分布,发现几乎所有插值分类器具有相同的泛化性能,而高效算法(如梯度下降)优于大多数插值器。

Comments 140 pages

详情
AI中文摘要

设 $S$ 是单位范数线性分类器 $\theta\in \mathbb{R}^d$ 的集合,这些分类器以预先固定的可能负的间隔 $\kappa$ 正确分类标记数据集 $(X_i,y_i)_{i=1}^n$ 中的每个点,其中 $X_i \in \mathbb{R}^d$,$y_i \in \{-1,+1\}$。在两种自然的数据生成分布——高斯混合模型和具有高斯特征的逻辑模型——以及比例 $n/d \to \alpha$ 且 $\alpha$ 足够小的条件下,我们建立了关于事件(从 $S$ 中均匀随机选择的点 $\theta$ 达到给定泛化误差)的大偏差原理,且该事件以高概率依赖于数据的选择。相关的速率函数是确定性的,描述了在 $d$ 的指数尺度上具有给定期望性能的插值分类器的比例。作为推论,我们建立了以下集中现象:除了指数小的一部分外,所有插值分类器都具有大致相同的泛化性能,该性能由该速率函数的唯一最大值给出。我们将该最大值与通过梯度下降的经验风险最小化和自然线性规划的性能进行了数值比较,两者都找到了 $S$ 中的一个点,并推断出在 $\alpha$ 小的过参数化区域中,这些高效方法优于绝大多数插值器,指出了它们在此设置中非平凡的良性过拟合。

英文摘要

Let $S$ be the set of unit norm linear classifiers $θ\in \mathbb{R}^d$ which correctly classify every point of a labeled dataset $(X_i,y_i)_{i=1}^n$, $X_i \in \mathbb{R}^d$, $y_i \in \{-1,+1\}$, with a possibly negative margin $κ$ fixed in advance. Under two natural data-generating distributions of the $(X,y)$ pairs -- a Gaussian mixture model and a logistic model with Gaussian features -- and in the proportional regime $n/d \to α$ with small enough $α$, we establish a large deviation principle on the event that a point $θ$ chosen uniformly at random from $S$ achieves a given generalization error, with high probability over the choice of the data. The associated large deviation rate function is deterministic and describes the proportion, at the exponential scale in $d$, of interpolating classifiers having a given desired performance. As a consequence, we establish the following concentration phenomenon: all but an exponentially small fraction of interpolating classifiers have approximately the same generalization performance given by the unique maximizer of this rate function. We numerically compare this maximizer to the performance of empirical risk minimization by gradient descent and to the performance of a natural linear program, both finding a point in $S$, and deduce that in the overparametrized regime of small $α$, these efficient procedures outperform the vast majority of interpolators, pointing to their nontrivial benign overfitting in this setting.

2606.06460 2026-06-05 cs.CR cs.AI

Will the Agent Recuse Itself? Measuring LLM-Agent Compliance with In-Band Access-Deny Signals

智能体会自行回避吗?测量LLM智能体对带内拒绝访问信号的遵从性

Thamilvendhan Munirathinam

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出一种轻量级带内拒绝信号(Recuse Signal),通过实验测量LLM智能体是否自愿遵从该信号,发现信号能有效诱导回避,但高级模型在操作员授权下可能忽略。

Comments 8 pages, 1 figure. Code, specification, and experiment harness: https://github.com/mthamil107/Recuse

详情
AI中文摘要

随着自主LLM智能体越来越多地持有真实凭证并在无人参与的情况下操作基础设施,操作员没有标准方式告知智能体某个资源是禁止访问的。访问控制要么允许智能体进入(它有有效凭证),要么硬性拒绝(与任何其他客户端无法区分)。我们提出第三种模式:一种轻量级的、公开的带内拒绝信号——Recuse Signal——服务器通过协议的现有通道(如SSH横幅、PostgreSQL NOTICE)发出,要求连接的自动化智能体自愿退出。这是一种合作治理控制,类似于实时访问的robots.txt;明确不是安全边界。其价值完全是经验性的,据我们所知,尚未被测量:合规的LLM智能体是否真的会遵守这样的信号?我们将该信号定义为一个开放的小型标准,实现了两个零或低占用适配器(一个SSH横幅/PAM钩子和一个PostgreSQL线路协议代理),将它们部署在实时的生产主机上,并进行受控实验,其中新智能体被赋予一个良性操作任务,并观察其是否回避。在试点中(SSH;OpenAI GPT-4o和GPT-4o-mini;以及作为部署智能体的Claude Code),该信号干净地诱导回避——存在信号时100%回避,而无信号对照组中100%完成任务——并且揭示性地表现为合作信号而非绝对信号:显式的操作员授权框架使最强大的模型继续执行,而其他智能体继续遵从主机策略。我们发布该标准、适配器和实验框架以供复现。

英文摘要

As autonomous LLM agents increasingly hold real credentials and operate infrastructure without a human in the loop, operators have no standard way to tell an agent that a resource is off-limits. Access controls either let the agent in (it has valid credentials) or hard-fail it (indistinguishable from any other client). We propose a third mode: a lightweight, published in-band deny signal -- the Recuse Signal -- that a server emits over a protocol's existing channels (an SSH banner, a PostgreSQL NOTICE) asking a connecting automated agent to voluntarily withdraw. This is a cooperative governance control, the robots.txt analogue for live access; it is explicitly not a security boundary. Its value is entirely empirical and, to our knowledge, unmeasured: do compliant LLM agents actually honor such a signal? We define the signal as an open mini-standard, implement two zero- or low-footprint adapters (an SSH banner/PAM hook and a PostgreSQL wire-protocol proxy), deploy them on a live production host, and run a controlled experiment in which fresh agents are given a benign operations task and observed for recusal. In a pilot (SSH; OpenAI GPT-4o and GPT-4o-mini; and Claude Code as a deployed agent), the signal cleanly induces recusal -- 100% recusal when present versus 100% task completion in a no-signal control -- and, revealingly, behaves as a cooperative rather than absolute signal: an explicit operator-authorization framing flips the most capable model to proceed, while other agents continue to defer to the on-host policy. We release the standard, adapters, and experiment harness for reproduction.

2606.06454 2026-06-05 cs.SE cs.CL

Scaffold, Not Vocabulary? A Controlled, Two-Tier, Pre-Registered Study of a Popperian Code-Generation Skill

脚手架,而非词汇?一项受控、双层、预注册的波普尔式代码生成技能研究

Mehmet Iscan

发表机构 * PythaLab, Yıldız Technical University, Istanbul, Turkey(Pytha实验室,伊兹密尔技术大学,伊斯坦布尔,土耳其)

AI总结 通过双层消融实验(包括长度匹配安慰剂、仅标签脚手架和真实执行测试),研究发现波普尔式提示技能对代码正确性的提升主要来自脚手架结构而非其内容,并在大模型上因天花板效应无法检测,在小模型上仅标签脚手架即可达到类似效果。

Comments 34 pages, 5 figures, 8 tables

详情
AI中文摘要

大型语言模型越来越多地编写、审查和评判代码,一种快速发展的实践是为它们配备提示“技能”,要求模型像科学家一样推理。一个突出的例子是告诉模型扮演波普尔式证伪主义者,据报道这种技能能改进生成的代码。但这些增益几乎总是通过LLM作为评判者来读取,而该评判工具存在已知的位置偏好、自我偏好和风格偏差。我们问:如果它看起来有帮助,那么增益是来自技能的波普尔式内容,还是来自任何脚手架所施加的结构?我们预注册了一个双层消融实验,包含三个对照:长度匹配的安慰剂、仅保留波普尔式标题但去除过程的仅标签脚手架,以及一个执行预言机(HumanEval+单元测试),外加一个词汇光环哨兵和一个同模型自评判审计。在前沿模型(Claude Sonnet 4.6,N=163)上,所有条件都接近基准上限且无法区分,因此预注册的+5点改进未得到支持(上限限制的未检测)。在小模型(Qwen2.5-Coder-0.5B,N=164)上,结构化条件将最佳八次正确率提升了20-22点,但完整技能相比仅标签脚手架没有显示出可分离的益处(聚合F@8=L@8 vs V@8=34.8%),而安慰剂仅落后2.4点。一个应用波普尔式评分标准的0.5B自评判器未能击败随机选择,并将其60%的选择集中在一个索引上。在测试的两种设置中,该技能的波普尔式过程内容在仅标签脚手架之外没有增加可分离的执行正确性收益,因此增益追踪的是脚手架结构。我们贡献了一个校准的负结果和一个可重用的消歧协议;该发现界定了关于一个提示技能家族的工程主张,而不是对波普尔式方法论的总体评价。

英文摘要

Large language models increasingly write, review, and judge code, and a fast-growing practice equips them with prompt 'skills' that ask the model to reason like a scientist. A prominent example tells the model to act as a Popperian falsificationist, and such skills are reported to improve generated code. But these gains are almost always read off an LLM-as-a-judge, an instrument with documented positional, self-preference, and stylistic biases. We ask: if it appears to help, is the gain from the skill's Popperian content, or from the structure any scaffold imposes? We pre-register a two-tier ablation with three controls: a length-matched placebo, a labels-only scaffold that keeps the Popperian headers but strips the procedure, and an execution oracle (HumanEval+ unit tests), plus a vocabulary-halo sentinel and a same-model self-judge audit. On a frontier model (Claude Sonnet 4.6, N=163) all conditions sit near the benchmark ceiling and do not separate, so the pre-registered +5-point improvement is not supported (a ceiling-limited non-detection). On a small model (Qwen2.5-Coder-0.5B, N=164) structured arms lift best-of-eight correctness by 20-22 points, but the full skill shows no separable benefit over a labels-only scaffold (aggregate F@8=L@8 vs V@8=34.8%), and the placebo trails by only 2.4 points. A 0.5B self-judge applying the Popperian rubric does not beat random selection and concentrates 60% of its picks on one index. In the two settings tested, the skill's Popperian procedural content adds no separable execution-correctness benefit beyond a labels-only scaffold, so the gains track scaffold structure. We contribute a calibrated negative result and a reusable disambiguation protocol; the finding bounds an engineering claim about one prompt-skill family and is not an evaluation of Popperian methodology in general.

2606.06444 2026-06-05 eess.AS cs.CL cs.SD

USAD 2.0: Scaling Representation Distillation for Universal Audio Understanding

USAD 2.0:面向通用音频理解的表征蒸馏规模化

Heng-Jui Chang, Alexander H. Liu, Saurabhchand Bhati, Mrudula Athi, Anton Ratnarajah, Amit Chhetri, James Glass

发表机构 * MIT CSAIL(麻省理工学院计算机科学与人工智能实验室) Amazon(亚马逊)

AI总结 提出USAD 2.0通用音频编码器,通过领域感知蒸馏融合自监督和监督基础模型知识,并扩展至音乐领域,经深度缩放达到十亿参数,在探测和基于LLM的评估中取得领先性能。

Comments Accepted to Interspeech 2026

详情
AI中文摘要

音频编码器对于现代音频应用至关重要,因为大型语言模型(LLM)越来越依赖单一编码器处理多样输入。虽然自监督学习(SSL)已产生强大的领域特定编码器(如语音或音乐专家),但像USAD和SPEAR这样的多领域方法在覆盖范围和评估方面仍然有限。最近的研究也表明,监督编码器与音频LLM的对齐效果更好。我们提出USAD 2.0,一种融合了SSL和监督基础模型知识的通用编码器。USAD 2.0引入了领域感知蒸馏来解决教师不匹配问题,将覆盖范围扩展到音乐领域,并增加了用于下游任务的第二阶段监督蒸馏。我们进一步通过深度缩放将模型扩展到十亿参数。实验表明,USAD 2.0在探测和基于LLM的评估中取得了强劲或最先进的性能。

英文摘要

Audio encoders are critical to modern audio applications as large language models (LLMs) increasingly rely on a single encoder for diverse inputs. While self-supervised learning (SSL) has yielded strong domain-specific encoders like speech or music experts, multi-domain approaches like USAD and SPEAR remain limited in coverage and evaluation. Recent studies also suggest supervised encoders align better with audio LLMs. We present USAD 2.0, a universal encoder integrating knowledge from both SSL and supervised foundation models. USAD 2.0 introduces domain-aware distillation to address teacher mismatch, extends coverage to the music domain, and adds second-stage supervised distillation for downstream use. We further scale the model to one billion parameters via depth scaling. Experiments show USAD 2.0 achieves strong or state-of-the-art performance across probing and LLM-based evaluations.