arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2042
2510.04500 2026-06-05 cs.LG

Expand Neurons, Not Parameters

扩展神经元,而非参数

Linghao Kong, Inimai Subramanian, Yonadav Shavit, Micah Adler, Dan Alistarh, Nir Shavit

发表机构 * University of Washington(华盛顿大学) Microsoft Research(微软研究院)

AI总结 通过增加神经元数量而不增加非零参数总数,减少特征干扰,从而提高网络性能,并在多种模型中验证了有效性。

Comments Accepted to the 43rd International Conference on Machine Learning (ICML 2026). 9 pages, 6 figures. Code available at https://github.com/Shavit-Lab/Expand-Neurons

详情
AI中文摘要

本工作展示了如何在不增加网络非零参数总数的情况下,通过增加神经元数量来提升性能。我们证明,这种提升对应于多个特征之间干扰的减少,否则这些特征将共享相同的神经元。在符号布尔任务中,根据子句知识将每个神经元分割成更稀疏的子神经元,系统性地降低了多语义性指标,并获得了更高的任务准确率。值得注意的是,即使是神经元权重的随机分割也能近似这些增益,表明减少冲突(而非精确分配)是主要驱动因素。与叠加假说一致,该框架的收益随着干扰的增加而增长:当多语义负载较高时,准确率提升最大。将这些见解迁移到更现实的模型中,包括基于CLIP嵌入的分类器、卷积神经网络和更深的多层网络,我们发现,在保持非零参数数量不变的情况下加宽网络,持续提高了准确率。这些结果确定了一种基于可解释性的机制,利用宽度来对抗叠加,从而在不增加非零参数数量的情况下提升性能。这种方向与现代加速器非常匹配,因为在这些加速器中,非零参数的内存移动(而非原始计算)通常是主要瓶颈。

英文摘要

This work demonstrates how increasing the number of neurons in a network without increasing its total number of non-zero parameters improves performance. We show that this gain corresponds with a decrease in interference between multiple features that would otherwise share the same neurons. On symbolic Boolean tasks, splitting each neuron into sparser sub-neurons with knowledge of the clauses systematically reduces polysemanticity metrics and yields higher task accuracy. Notably, even random splits of neuron weights approximate these gains, indicating that reduced collisions, not precise assignment, are a primary driver. Consistent with the superposition hypothesis, the benefits of this framework grow with increasing interference: when polysemantic load is high, accuracy improvements are the largest. Transferring these insights to more realistic models, including classifiers over CLIP embeddings, convolutional neural networks, and deeper multilayer networks, we find that widening networks while maintaining a constant non-zero parameter count consistently increases accuracy. These results identify an interpretability-grounded mechanism to leverage width against superposition, improving performance without increasing the number of non-zero parameters. Such a direction is well matched to modern accelerators, where memory movement of non-zero parameters, rather than raw compute, is often a dominant bottleneck.

2508.09697 2026-06-05 cs.LG cs.CV

Towards Label-Noise Resistant Learning via Optimal Brain Damage Masking

通过最优脑损伤遮蔽实现抗标签噪声学习

Xinlei Zhang, Fan Liu, Chuanyi Zhang, Fan Cheng, Qian Li, Yuhui Zheng

发表机构 * Hohai University(河海大学)

AI总结 本文提出了一种基于最优脑损伤理论的抗标签噪声学习方法,通过遮蔽冗余连接来减少噪声梯度传播,提升模型鲁棒性。

详情
AI中文摘要

噪声标签在现实世界中不可避免。由于深度神经网络强大的记忆能力,这些噪声标签会导致显著的性能下降。现有的噪声鲁棒方法主要集中在鲁棒损失函数和样本选择上,对动态架构适应的探索相对有限。本文重新审视了标签噪声存在下模型连接的作用。直观上,噪声标签引起的性能下降源于噪声梯度的反向传播。由于最终分类器层是这种误差传播的主要通道,直接丢弃分类器中的冗余连接可以在根源上截断噪声梯度。为了识别这些冗余连接,我们利用模型压缩中的经典最优脑损伤(OBD)理论,该理论指出造成微小损失扰动的参数可以安全移除而不影响性能。基于这一原则,我们发现遮蔽低激活边可以保持网络的正常拟合能力,同时有效降低噪声梯度传播的风险。为了将这一理论洞察与实际训练相结合,我们提出了一种新的选择性边遮蔽(SEM)机制,用于广泛采用的全连接(FC)层,以增强模型对噪声标签的鲁棒性。SEM可以自适应地只保留最重要的边用于信息传播,同时抑制由噪声标签引起的梯度误差。作为插件式组件,SEM可以无缝集成到各种噪声鲁棒方法中,包括鲁棒损失函数和样本选择。在合成和现实世界基准上的广泛评估表明,我们的OBD驱动方法在性能上始终优于最先进的方法。

英文摘要

Noisy labels are inevitable in real-world scenarios. Due to the strong capacity of deep neural networks to memorize corrupted labels, these noisy labels cause significant performance degradation. Existing noise-robust methods have mainly focused on robust loss functions and sample selection, with comparatively limited exploration of dynamic architectural adaptation. In this paper, we rethink the role of model connectivity in the presence of label noise. Intuitively, performance degradation caused by noisy labels stems from the backpropagation of noisy gradients. Since the final classifier layer acts as the primary gateway for this error propagation, directly discarding redundant connections within the classifier can structurally intercept noisy gradients at the root. Consequently, to identify these redundant connections, we leverage the seminal Optimal Brain Damage (OBD) theory from model compression, which posits that parameters causing negligible loss perturbation can be safely removed without impairing performance. Guided by this principle, we reveal that masking low-activation edges maintains the network's normal fitting capacity while effectively reducing the risk of backpropagating noisy gradients. To bridge this theoretical insight with practical training, we propose a novel Selective Edge Masking (SEM) mechanism for the widely-adopted fully connected (FC) layer to enhance model robustness against noisy labels. It can adaptively preserve only the most critical edges for information propagation while suppressing gradient errors caused by noisy labels. As a plug-and-play component, SEM can be seamlessly integrated into various noise-robust methods, including robust loss functions and sample selection. Extensive evaluations on both synthetic and real-world benchmarks demonstrate that our OBD-driven approach consistently outperforms state-of-the-art methods.

2509.22015 2026-06-05 cs.LG

Concept-SAE: A Controllable and Invertible Concept Interface for Sparse Autoencoders

Concept-SAE: 一种可控且可逆的概念接口用于稀疏自编码器

Jianrong Ding, Muxi Chen, Chenchen Zhao, Qiang Xu

发表机构 * The Chinese University of Hong Kong(香港中文大学)

AI总结 本文提出Concept-SAE,一种通过结构化可控接口探测用户定义概念的框架,通过将激活子空间分解为概念令牌和自由令牌,实现高保真、局部化强且解耦的概念表示,优于现有方法。

Comments Accepted by ECML PKDD 2026, the project can be found at https://github.com/RafaDD/Concept-SAE

详情
AI中文摘要

标准稀疏自编码器(SAEs)在发现模型学习的字典方面表现出色,为被动特征发现提供了强大视角。然而,这种被动性质使得系统评估或分析用户关心的概念变得困难。我们引入Concept-SAE,一种通过结构化可控接口扩展SAEs的框架,用于探测用户定义的概念。Concept-SAE将激活子空间分解为两个正交部分:概念令牌,通过双监督在概念存在和空间定位上对齐外部指定语义;自由令牌,像标准SAEs一样捕捉所有剩余信息。这种混合解耦策略确保概念令牌忠实、空间接地且与残差子空间清洁分离,同时保留SAEs对开放概念发现的能力。我们进行了广泛的实验,证明Concept-SAE产生高保真、局部化强且解耦的概念表示,优于替代方法。最后,我们通过三个诊断评估验证该概念接口的实用性:对对抗图像样本的分类检测测试、聚焦于可控反事实编辑的可控性测试以及使用对抗扰动的稳定性测试。这些结果表明,Concept-SAE为SAEs提供了一种可靠的机制,用于评估、探测和诊断用户定义的概念。

英文摘要

Standard Sparse Autoencoders (SAEs) excel at discovering a dictionary of a model's learned features, providing a powerful lens for passive feature discovery. However, this passive nature makes it difficult to systematically evaluate or analyze concepts that users explicitly care about. We introduce Concept-SAE, a framework that augments SAEs with a structured and controllable interface for probing user-defined concepts. Concept-SAE decomposes an activation subspace into two orthogonal components: Concept Tokens, which are aligned to externally specified semantics through dual supervision on both concept existence and spatial localization, and Free Tokens, which operate like standard SAEs to capture all remaining information. This hybrid disentanglement strategy ensures that Concept Tokens are faithful, spatially grounded, and cleanly separated from the residual subspace while preserving the ability of SAEs for open-ended concept discovery. We conduct extensive experiments demonstrating that Concept-SAE yields high-fidelity, well-localized, and strongly disentangled concept representations, outperforming alternatives in interface quality. Finally, we validate the utility of this conceptual interface through three diagnostic evaluations: a detection test on classifying adversarial image samples, a controllability test focusing on controlled counterfactual editing and a stability test using adversarial perturbations. Together, these results show that Concept-SAE equips SAEs with a reliable mechanism for evaluating, probing, and diagnosing user-defined concepts.

2504.10823 2026-06-05 cs.CL cs.AI

CLASH: Evaluating Language Models on Judging High-Stakes Dilemmas from Multiple Perspectives

CLASH:从多个视角评估语言模型在高风险困境中的判断

Ayoung Lee, Ryan Sungmo Kwon, Peter Railton, Lu Wang

发表机构 * Department of Computer Science and Engineering(计算机科学与工程系) Department of Philosophy(哲学系) University of Michigan Ann Arbor(安娜堡大学)

AI总结 本文提出CLASH数据集,用于研究基于价值观的决策过程,发现语言模型在处理矛盾决策、心理不适和价值观变化时存在显著不足。

Comments Published as a conference paper at ICLR 2026

详情
AI中文摘要

在高风险领域,涉及冲突价值的困境对人类都极具挑战性,更不用说AI了。然而,先前的研究仅限于日常场景。为弥补这一差距,我们引入了CLASH(基于角色视角的LLM在高风险情境中的评估),该数据集包含345个高影响困境及3,795个不同价值观的个体视角。CLASH使研究者能够探讨关键但尚未被深入研究的价值决策过程方面,包括对决策矛盾和心理不适的理解以及角色视角中价值观的时间变化。通过基准测试14个非思考和思考模型,我们揭示了几个关键发现:(1)即使强大的专有模型,如GPT-5和Claude-4-Sonnet,也难以处理矛盾决策,仅达到24.06和51.01的准确率。(2)尽管LLMs能合理预测心理不适,但它们在涉及价值变化的视角中并不充分理解。(3)在数学解题和游戏策略领域有效的认知行为无法转移到价值推理中。相反,新的失败模式出现,包括早期承诺和过度承诺。(4)LLMs对特定价值的可引导性与其价值偏好显著相关。(5)最后,当从第三方视角推理时,LLMs表现出更高的可引导性,尽管某些价值(如安全)独特地受益于第一人称框架。

英文摘要

Navigating dilemmas involving conflicting values is challenging even for humans in high-stakes domains, let alone for AI, yet prior work has been limited to everyday scenarios. To close this gap, we introduce CLASH (Character perspective-based LLM Assessments in Situations with High-stakes), a meticulously curated dataset consisting of 345 high-impact dilemmas along with 3,795 individual perspectives of diverse values. CLASH enables the study of critical yet underexplored aspects of value-based decision-making processes, including understanding of decision ambivalence and psychological discomfort as well as capturing the temporal shifts of values in the perspectives of characters. By benchmarking 14 non-thinking and thinking models, we uncover several key findings. (1) Even strong proprietary models, such as GPT-5 and Claude-4-Sonnet, struggle with ambivalent decisions, achieving only 24.06 and 51.01 accuracy. (2) Although LLMs reasonably predict psychological discomfort, they do not adequately comprehend perspectives involving value shifts. (3) Cognitive behaviors that are effective in the math-solving and game strategy domains do not transfer to value reasoning. Instead, new failure patterns emerge, including early commitment and overcommitment. (4) The steerability of LLMs towards a given value is significantly correlated with their value preferences. (5) Finally, LLMs exhibit greater steerability when reasoning from a third-party perspective, although certain values (e.g., safety) benefit uniquely from first-person framing.

2509.15061 2026-06-05 cs.RO cs.CV

Ask-to-Clarify: Resolving Instruction Ambiguity through Multi-turn Dialogue

Ask-to-Clarify: 通过多轮对话解决指令歧义

Xingyao Lin, Xinghao Zhu, Tianyi Lu, Sicheng Xie, Hui Zhang, Xipeng Qiu, Zuxuan Wu, Yu-Gang Jiang

发表机构 * College of Computer Science and Artificial Intelligence, Fudan University, Shanghai, China(复旦大学计算机科学与人工智能学院) Shanghai Innovation Institute, Shanghai, China(上海创新研究院) Mechanical Systems Control Lab, UC Berkeley, California, USA(伯克利机械系统控制实验室)

AI总结 本文提出Ask-to-Clarify框架,通过多轮对话解决指令歧义问题,结合视觉语言模型和扩散模型,采用两阶段知识绝缘策略训练,实现多任务中更高效的协作式具身代理。

Comments 9 pages, 4 figures, 7 tables

详情
AI中文摘要

具身代理的最终目标是创造能够与人类交互的合作者,而非仅仅执行指令的被动执行者。这要求代理能够通过沟通、协调和适应行动来响应人类反馈。最近,视觉语言代理(VLAs)的进步为实现这一目标提供了途径。然而,大多数当前基于VLAs的具身代理仍处于单向模式:接收指令并执行,而无反馈。这种做法在现实场景中往往失效,因为指令通常存在歧义。在本文中,我们提出了Ask-to-Clarify框架来解决这一问题。该框架首先通过多轮对话解决模糊的指令,然后生成低层动作。具体来说,Ask-to-Clarify框架由两个组件组成:一个用于协作的视觉语言模型(VLM)和一个用于动作的扩散模型。我们还引入了一个连接模块,该模块根据VLM的输出生成扩散模型的条件。该模块通过指令调整观察来生成可靠的条件。我们采用两阶段知识绝缘策略来训练我们的框架。首先,我们使用模糊解决对话数据微调协作组件以处理歧义。然后,我们在冻结协作组件的情况下整合动作组件。这在保持交互能力的同时,微调扩散模型以生成动作。训练策略保证了我们的框架能够首先提问,然后生成动作。在推理过程中,一个信号检测器充当路由器,帮助框架在提问和执行之间切换。我们在8个现实任务中评估了Ask-to-Clarify框架,结果表明它在现有最先进的VLAs中表现更优。结果表明,所提出的框架及其训练策略为协作式具身代理提供了一条可行路径。

英文摘要

The ultimate goal of embodied agents is to create collaborators that can interact with humans, not mere executors that passively follow instructions. This requires agents to communicate, coordinate, and adapt their actions based on human feedback. Recently, advances in VLAs have offered a path toward this goal. However, most current VLA-based embodied agents operate in a one-way mode: they receive an instruction and execute it without feedback. This approach fails in real-world scenarios where instructions are often ambiguous. In this paper, we address this problem with the Ask-to-Clarify framework. Our framework first resolves ambiguous instructions by asking questions in a multi-turn dialogue. Then it generates low-level actions end-to-end. Specifically, the Ask-to-Clarify framework consists of two components, one VLM for collaboration and one diffusion for action. We also introduce a connection module that generates conditions for the diffusion based on the output of the VLM. This module adjusts the observation by instructions to create reliable conditions. We train our framework with a two-stage knowledge-insulation strategy. First, we fine-tune the collaboration component using ambiguity-solving dialogue data to handle ambiguity. Then, we integrate the action component while freezing the collaboration one. This preserves the interaction abilities while fine-tuning the diffusion to generate actions. The training strategy guarantees our framework can first ask questions, then generate actions. During inference, a signal detector functions as a router that helps our framework switch between asking questions and taking actions. We evaluate the Ask-to-Clarify framework in 8 real-world tasks, where it outperforms existing state-of-the-art VLAs. The results suggest that our proposed framework, along with the training strategy, provides a path toward collaborative embodied agents.

2307.05284 2026-06-05 cs.LG cs.AI

Rethinking Distribution Shifts: Empirical Analysis and Modeling for Tabular Data

重新思考分布偏移:针对表格数据的经验分析与建模

Tianyu Wang, Jiashuo Liu, Peng Cui, Hongseok Namkoong

发表机构 * Department of Industrial Engineering and Operations Research(工业工程与运筹学系) Department of Computer Science and Technology(计算机科学与技术系) Decision, Risk, and Operations Division(决策、风险与运营部) Columbia University(哥伦比亚大学) Tsinghua University(清华大学)

AI总结 本文通过经验分析和建模,重新审视分布偏移问题,发现Y|X偏移在表格数据中最为常见,与机器学习文献中对X(协变量)偏移的重视形成鲜明对比,并指出鲁棒算法的性能并不优于普通方法。

Comments Forthcoming at Management Science. Conference version appeared in NeurIPS 2023, previously titled "On the Need for a Language Describing Distribution Shifts: Illustrations on Tabular Datasets"

详情
AI中文摘要

不同的分布偏移需要不同的干预措施,算法必须基于其解决的具体偏移类型来构建。然而,稳健算法的方法学发展通常依赖于缺乏实证验证的结构性假设。本文倡导一种以实证为基础的数据驱动方法来开发算法,构建了一个包含8个表格数据集中的自然偏移、172个分布对、45种方法和90,000种方法配置的实证测试平台,涵盖了经验风险最小化和分布鲁棒优化(DRO)方法。我们发现Y|X偏移在我们的测试平台中最为普遍,这与机器学习文献中对X(协变量)偏移的高度重视形成鲜明对比,并且稳健算法的性能并不优于普通方法。为了理解原因,我们深入分析了DRO方法,发现被忽视的实现细节——如底层模型类(例如LightGBM)的选择和超参数选择——对性能的影响比模糊集或其半径更大。通过案例研究,我们展示了如何通过数据驱动的归纳理解分布偏移,提供了一种新的算法开发方法。

英文摘要

Different distribution shifts require different interventions, and algorithms must be grounded in the specific shifts they address. However, methodological development for robust algorithms typically relies on structural assumptions that lack empirical validation. Advocating for an empirically grounded data-driven approach to algorithm development, we build an empirical testbed comprising natural shifts across 8 tabular datasets, 172 distribution pairs over 45 methods and 90,000 method configurations encompassing empirical risk minimization and distributionally robust optimization (DRO) methods. We find $Y|X$-shifts are most prevalent in our testbed, in stark contrast to the heavy focus on $X$ (covariate)-shifts in the ML literature, and that the performance of robust algorithms is no better than that of vanilla methods. To understand why, we conduct an in-depth empirical analysis of DRO methods and find that underlooked implementation details -- such as the choice of underlying model class (e.g., LightGBM) and hyperparameter selection -- have a bigger impact on performance than the ambiguity set or its radius. We illustrate via case studies how a data-driven, inductive understanding of distribution shifts can provide a new approach to algorithm development.

2508.15851 2026-06-05 cs.CL

DocHop-QA: Towards Multi-Hop Reasoning over Multimodal Document Collections

DocHop-QA: 向多跳推理多模态文档集合迈进

Jiwon Park, Seohyun Pyeon, Jinwoo Kim, Rina Carines Cabal, Zhenyuan He, Yihao Ding, Soyeon Caren Han

发表机构 * Pohang University of Science and Technology(釜山科学技术大学) The University of Sydney(悉尼大学) The University of Western Australia(西澳大学) The University of Melbourne(墨尔本大学)

AI总结 本文提出DocHop-QA基准,通过多模态、多文档、多跳科学问答评估多模态证据综合能力,揭示当前模型在长上下文和多证据需求下的局限性。

详情
AI中文摘要

尽管大语言模型(LLMs)在快速进步,当前QA基准仍忽视了现实世界科学信息检索的核心挑战:合成散落在多个文档和结构格式中的多模态证据。现有的QA基准范围狭窄,依赖单模态文本和短跨度推理,无法捕捉真实信息检索的复杂性。我们引入DocHop-QA,一个包含11,379个实例的基准,用于评估多模态、多文档、多跳科学QA。该基准基于公开可用的PubMed文章构建,包含文本段落、表格和布局线索,能够在没有显式超链接的情况下实现跨文档推理。为了扩展现实QA的构建,我们开发了一个基于11个科学推理概念的LLM驱动生成管道,生成多样且连贯的问题-答案对。为了突出数据集的实用性和多功能性,我们提出一个任务驱动的评估框架,涵盖四个设置,包括生成回答、多模态证据整合和结构化索引预测。实验表明,当前模型在DocHop-QA的长上下文和多证据需求下表现不佳,确立了其作为推进下一代科学QA系统严格测试平台的地位。

英文摘要

Despite rapid progress in large language models (LLMs), current QA benchmarks still overlook the core challenge of real-world scientific information seeking: synthesizing multimodal evidence scattered across multiple documents and structural formats. Existing QA benchmarks remain narrow in scope, relying on unimodal text and short-span reasoning that fail to capture the complexity of real information seeking. We introduce DocHop-QA, a benchmark of 11,379 instances for evaluating multimodal, multi-document, multi-hop scientific QA. Built from publicly available PubMed articles, DocHop-QA incorporates textual passages, tables, and layout cues, enabling cross-document inference without explicit hyperlinks. To scale realistic QA construction, we develop an LLM-driven generation pipeline grounded in 11 scientific reasoning concepts, producing diverse and coherent question-answer pairs. To highlight the utility and versatility of the dataset, we propose a task-driven evaluation framework spanning four settings, including generative answering, multimodal evidence integration, and structured index prediction. Experiments show that current models struggle with the long-context and multi-evidence demands of DocHop-QA, establishing it as a rigorous testbed for advancing next-generation scientific QA systems.

2508.00537 2026-06-05 cs.CL

The Prosody of Emojis

表情符号的语调

Giulio Zhou, Tsz Kin Lam, Alexandra Birch, Barry Haddow

发表机构 * University of Edinburgh(爱丁堡大学) NatWest Aveni

AI总结 研究探讨了表情符号如何影响语音表达,并揭示听众如何通过语音线索恢复表情符号的含义,发现语义差异越大,语音变化越明显,表明表情符号是连接数字文本和口语表达的语调载体。

Comments ACL 26

详情
AI中文摘要

语调特征如音高、节奏和语调对于口语交流至关重要,传达情感、意图和话语结构。在基于文本的环境中,这些线索缺失,表情符号作为视觉替代品,增加了情感和语用的细微差别。本研究探讨了表情符号如何影响语音实现,并研究听众如何通过语音线索恢复表情符号的含义。与以往研究不同,我们通过受控的诱发生产任务收集人类语音数据,直接将语音和表情符号联系起来。使用贝叶斯多级模型,我们显示说话者会系统地根据表情符号线索调整语音,并且听众可以显著高于随机水平恢复意图含义。此外,我们的结果揭示了语音变化的清晰层次:表情符号之间的语义差异越大,语音变化越明显。这些发现表明,表情符号是传达语调意图的重要载体,架起了数字文本和口语表达之间的桥梁。

英文摘要

Prosodic features such as pitch, timing, and intonation are central to spoken communication, conveying emotion, intent, and discourse structure. In text-based settings, where these cues are absent, emojis act as visual surrogates that add affective and pragmatic nuance. This study examines how emojis influence prosodic realisation in speech and how listeners interpret prosodic cues to recover emoji meanings. Unlike previous work, we directly link prosody and emojis by analysing human speech data collected through a controlled elicited production task. Using Bayesian multilevel modelling, we show that speakers systematically adapt their prosody based on emoji cues, and that listeners can recover intended meanings significantly above chance. Furthermore, our results reveal a clear hierarchy in prosodic shifts: greater semantic differences between emojis correspond to increased prosodic divergence. These findings suggest that emojis are meaningful carriers of prosodic intent that bridge the gap between digital text and spoken production.

2503.22929 2026-06-05 cs.CV

Self-supervised Feature Disentanglement and Augmentation Network for One-class Face Anti-spoofing

自监督特征解耦与增强网络用于单类面部反伪装

Pei-Kai Huang, Jun-Xiong Chong, Ming-Tsung Hsu, Fang-Yu Hsu, Yi-Ting Lin, Kai-Heng Chien, Hao-Chiang Shao, Chiou-Ting Hsu

发表机构 * National Tsinghua University(国立清华大学)

AI总结 本文提出了一种自监督特征解耦与增强网络(UFDANet),通过解耦活体特征和领域特征,提升单类面部反伪装的泛化能力,实验表明其优于现有单类方法并可与双类方法媲美。

详情
AI中文摘要

面部反伪装(FAS)技术旨在通过区分真实活体面部与欺骗性尝试来增强面部身份认证的安全性。虽然双类FAS方法可能因过拟合训练攻击而性能不佳,单类FAS方法能处理未见过的攻击但对活体特征中混杂的领域信息不够鲁棒。为此,我们提出了一种无监督特征解耦与增强网络(UFDANet),一种单类FAS技术,通过解耦特征增强面部图像以提升泛化能力。UFDANet采用新颖的无监督特征解耦方法分离活体和领域特征,促进判别性特征学习。它整合了非分布活体特征增强方案以合成未见过的欺骗类活体特征,从而增强活体特征的表示性和判别性。此外,UFDANet还整合了领域特征增强流程以合成未见过的领域特征,从而实现更好的泛化能力。广泛实验表明,所提出的UFDANet优于现有单类FAS方法,并在与现有最先进双类FAS方法的性能上具有可比性。

英文摘要

Face anti-spoofing (FAS) techniques aim to enhance the security of facial identity authentication by distinguishing authentic live faces from deceptive attempts. While two-class FAS methods risk overfitting to training attacks to achieve better performance, one-class FAS approaches handle unseen attacks well but are less robust to domain information entangled within the liveness features. To address this, we propose an Unsupervised Feature Disentanglement and Augmentation Network (\textbf{UFDANet}), a one-class FAS technique that enhances generalizability by augmenting face images via disentangled features. The \textbf{UFDANet} employs a novel unsupervised feature disentangling method to separate the liveness and domain features, facilitating discriminative feature learning. It integrates an out-of-distribution liveness feature augmentation scheme to synthesize new liveness features of unseen spoof classes, which deviate from the live class, thus enhancing the representability and discriminability of liveness features. Additionally, \textbf{UFDANet} incorporates a domain feature augmentation routine to synthesize unseen domain features, thereby achieving better generalizability. Extensive experiments demonstrate that the proposed \textbf{UFDANet} outperforms previous one-class FAS methods and achieves comparable performance to state-of-the-art two-class FAS methods.

2507.15736 2026-06-05 cs.CL

IDRBench: Understanding the Capability of Large Language Models on Interdisciplinary Research

IDRBench: 理解大型语言模型在跨学科研究中的能力

Yuanhao Shen, Daniel Xavier de Sousa, Ricardo Marçal, Hongyu Guo, Xiaodan Zhu

发表机构 * GitHub

AI总结 本文研究了大型语言模型在跨学科研究中的能力,提出IDRBench框架,通过三个任务评估不同模型的跨学科知识整合能力,并为未来研究建立基准。

详情
AI中文摘要

创新是推动人类文明的重要驱动力。随着知识体系的不断扩展,跨学科领域中创新的产生变得愈发具有挑战性。最近机器学习模型,特别是大型语言模型(LLMs)的进步,为访问广泛的知识源提供了有效途径,并在推理方面展现出显著的能力,为跨学科发现提供了重要机会。我们的研究旨在理解最先进的LLMs在整合不同领域知识以进行跨学科研究(IDR)方面的能力。为了解决这一根本问题,我们引入了IDRBench,一个开创性的框架,包括数据集和评估任务:(1)跨学科论文识别,(2)跨学科思想整合,(3)跨学科思想推荐。我们对十种主流LLMs的研究提供了对其行为的全面分析,并为未来研究建立了基准和基线。据我们所知,IDRBench是首个全面调查LLMs跨学科能力的框架。

英文摘要

Innovation is a key driving force of human civilization. As the body of knowledge has grown considerably, bridging knowledge across different disciplines, where significant innovation often emerges, has become increasingly challenging. The recent advancements in machine learning models, particularly Large Language Models (LLMs), have provided effective access to extensive knowledge sources and shown impressive abilities in reasoning, rendering significant opportunities for interdisciplinary discovery. Our research aims to understand the capabilities of state-of-the-art LLMs in integrating knowledge from different fields for interdisciplinary research (IDR). To address this fundamental problem, we introduce IDRBench, a pioneering framework that includes both datasets and evaluation tasks: (1) IDR Paper Identification, (2) IDR Idea Integration, and (3) IDR Idea Recommendation. Our study on ten mainstream LLMs provides a comprehensive analysis of their behavior and establishes benchmarks and baselines for future research. To the best of our knowledge, IDRBench is the first to provide a comprehensive investigation of LLMs' IDR capability.

2507.12336 2026-06-05 cs.CV

Unsupervised Monocular 3D Keypoint Discovery from Multi-View Diffusion Priors

无监督单目多视图扩散先验的3D关键点发现

Subin Jeon, In Cho, Junyoung Hong, Woong Oh Cho, Seon Joo Kim

发表机构 * Yonsei University(延世大学)

AI总结 本文提出KeyDiff3D框架,通过单张图像准确预测3D关键点,利用预训练的多视图扩散模型中的几何先验,将隐式3D先验转化为显式3D特征体,实现关键点估计和3D对象操控。

Comments Accepted at CVPR 2026. Project page: https://subin6.github.io/keydiff3d-project/

详情
AI中文摘要

大多数现有的3D关键点估计方法依赖于手动标注或校准的多视角图像,这两种方法都昂贵且难以收集。本文引入KeyDiff3D框架,该框架能够从单张图像准确预测3D关键点,从而消除对昂贵数据采集的依赖。为此,我们利用预训练的多视角扩散模型中嵌入的强大几何先验。在我们的框架中,扩散模型从单张图像生成多视角图像,作为监督信号,为模型提供3D几何线索。我们还引入了3D特征提取器,将扩散特征中隐含的3D先验转换为显式的3D特征体。除了准确的关键点估计外,我们还引入了一条管道,使由扩散模型生成的3D对象得以操控。在多样化的数据集上,包括Human3.6M、CUB-200-2011、斯坦福狗、以及多个真实世界和非领域输入,实验结果突显了我们的方法在准确性、泛化能力和从单张图像生成3D对象并进行操控方面的有效性。

英文摘要

Most existing 3D keypoint estimation methods rely on manual annotations or calibrated multi-view images, both of which are expensive to collect. This paper introduces KeyDiff3D, a framework that can accurately predict 3D keypoints from a single image, thus eliminating the need for such expensive data acquisitions. To achieve this, we leverage powerful geometric priors embedded in a pretrained multi-view diffusion model. In our framework, the diffusion model generates multi-view images from a single image, serving as supervision signals to provide 3D geometric cues to our model. We also introduce a 3D feature extractor that transforms implicit 3D priors embedded in the diffusion features into explicit 3D feature volumes. Beyond accurate keypoint estimation, we further introduce a pipeline that enables manipulation of 3D objects generated by the diffusion model. Experimental results on diverse datasets, including Human3.6M, CUB-200-2011, Stanford Dogs, and several in-the-wild and out-of-domain inputs, highlight the effectiveness of our method in terms of accuracy, generalization, and its ability to enable manipulation of 3D objects generated by the diffusion model from a single image.

2507.06219 2026-06-05 cs.RO cs.AI cs.LG

Is Diversity All You Need for Scalable Robotic Manipulation?

多样性是否是可扩展机器人操作的全部需求?

Modi Shi, Li Chen, Jin Chen, Yuxiang Lu, Chiming Liu, Guanghui Ren, Ping Luo, Di Huang, Maoqing Yao, Hongyang Li

发表机构 * University of Science and Technology of China(中国科学技术大学) Tsinghua University(清华大学) University of California, Berkeley(加州大学伯克利分校)

AI总结 本文研究了数据多样性在机器人学习中的作用,发现任务多样性比单任务演示量更重要,多身体预训练数据在跨身体转移中可选,专家多样性可能对策略学习产生干扰,提出分布去偏方法提升性能。

Comments Code is available at https://github.com/OpenDriveLab/AgiBot-World

详情
AI中文摘要

数据扩展在自然语言处理和计算机视觉的基础模型中取得了显著成功,但机器人操作中有效数据扩展的原则仍不够清楚。本文通过研究机器人学习中数据多样性的细微作用,探讨了三个关键维度:任务(做什么)、身体(使用哪种机器人)和专家(谁演示)。通过在各种机器人平台上进行广泛实验,我们发现:(1)任务多样性比单任务演示数量更重要,有助于从多样预训练任务转移到新下游场景;(2)多身体预训练数据在跨身体转移中是可选的,高质量单身体预训练模型可以高效地转移到不同平台,在微调过程中表现出比多身体预训练模型更优的扩展特性;(3)专家多样性源于个体操作偏好和人类演示中的随机变化,可能对策略学习产生干扰,速度多模态成为关键贡献因素。基于这一洞察,我们提出了一种分布去偏方法以缓解速度模糊性,所提出的GO-1-Pro方法实现了15%的性能提升,相当于使用2.5倍的预训练数据。这些发现提供了新的视角,并为如何有效扩展机器人操作数据集提供了实用指导。

英文摘要

Data scaling has driven remarkable success in foundation models for Natural Language Processing (NLP) and Computer Vision (CV), yet the principles of effective data scaling in robotic manipulation remain insufficiently understood. In this work, we investigate the nuanced role of data diversity in robot learning by examining three critical dimensions-task (what to do), embodiment (which robot to use), and expert (who demonstrates)-challenging the conventional intuition of "more diverse is better". Throughout extensive experiments on various robot platforms, we reveal that (1) task diversity proves more critical than per-task demonstration quantity, benefiting transfer from diverse pre-training tasks to novel downstream scenarios; (2) multi-embodiment pre-training data is optional for cross-embodiment transfer-models trained on high-quality single-embodiment data can efficiently transfer to different platforms, showing more desirable scaling property during fine-tuning than multi-embodiment pre-trained models; and (3) expert diversity, arising from individual operational preferences and stochastic variations in human demonstrations, can be confounding to policy learning, with velocity multimodality emerging as a key contributing factor. Based on this insight, we propose a distribution debiasing method to mitigate velocity ambiguity, the yielding GO-1-Pro achieves substantial performance gains of 15%, equivalent to using 2.5 times pre-training data. Collectively, these findings provide new perspectives and offer practical guidance on how to scale robotic manipulation datasets effectively.

2506.22078 2026-06-05 cs.CV

Towards Accurate Heart Rate Measurement from Ultra-Short Video Clips via Periodicity-Guided rPPG Estimation and Signal Reconstruction

通过周期性引导的rPPG估计与信号重建实现从超短视频片段中准确的心率测量

Pei-Kai Huanga, Ya-Ting Chan, Kuan-Wen Chen, Chiou-Ting Hsu, Xiaoding Wang, Md. Jalil Piran

发表机构 * National Tsinghua University(国立清华大学) Fujian Normal University(福建师范大学) Sungkyunkwan University(成均馆大学)

AI总结 本文针对超短视频片段中心率测量问题,提出周期性引导的rPPG估计方法和信号重建技术,以提高从超短视频中准确测量心率的能力,并在多个基准数据集上验证了方法的有效性。

详情
AI中文摘要

许多远程心率(HR)测量方法专注于从持续约10秒的视频片段中估计远程光体积脉动图(rPPG)信号,但常常忽略了从超短视频片段中估计心率的必要性。在本文中,我们旨在通过专门解决两个关键挑战来准确测量超短2秒视频片段中的心率。首先,为了解决超短视频片段中心跳周期数量有限的问题,我们提出了一种有效的周期性引导的rPPG估计方法,该方法强制在从超短片段中估计的rPPG信号与其更长的真实信号之间的周期性保持一致。其次,为了解决由于频谱泄漏导致的估计不准确问题,我们提出包含生成器来从超短片段中重建更长的rPPG信号,同时保持其周期性一致性,以实现更准确的心率测量。在四个rPPG估计基准数据集上的大量实验表明,我们提出的方法不仅能够准确测量超短视频片段中的心率,而且在rPPG估计技术中实现了最先进的性能。

英文摘要

Many remote Heart Rate (HR) measurement methods focus on estimating remote photoplethysmography (rPPG) signals from video clips lasting around 10 seconds but often overlook the need for HR estimation from ultra-short video clips. In this paper, we aim to accurately measure HR from ultra-short 2-second video clips by specifically addressing two key challenges. First, to overcome the limited number of heartbeat cycles in ultra-short video clips, we propose an effective periodicity-guided rPPG estimation method that enforces consistent periodicity between rPPG signals estimated from ultra-short clips and their much longer ground truth signals. Next, to mitigate estimation inaccuracies due to spectral leakage, we propose including a generator to reconstruct longer rPPG signals from ultra-short ones while preserving their periodic consistency to enable more accurate HR measurement. Extensive experiments on four rPPG estimation benchmark datasets demonstrate that our proposed method not only accurately measures HR from ultra-short video clips but also outperform previous rPPG estimation techniques to achieve state-of-the-art performance.

2501.14291 2026-06-05 cs.LG stat.ML

Advances in Temporal Point Processes: Bayesian, Neural, and LLM Approaches

时间点过程的进展:贝叶斯、神经网络和大语言模型方法

Feng Zhou, Quyu Kong, Jie Qiao, Cheng Wan, Yixuan Zhang, Ruichu Cai

发表机构 * Center for Applied Statistics and School of Statistics, Renmin University of China(应用统计中心和中国人民大学统计学院) Independent Researcher(独立研究者) School of Computer Science, Guangdong University of Technology(广东工业大学计算机学院) School of Statistics and Data Science, Southeast University(东南大学统计与数据科学学院)

AI总结 本文综述了时间点过程的最新研究,从贝叶斯、深度学习和大语言模型三个角度探讨了模型设计、参数估计以及经典应用领域,并展望了未来的研究挑战和方向。

详情
AI中文摘要

时间点过程(TPPs)是用于表征连续时间中事件序列的随机过程模型。传统统计TPPs已有长久的历史,众多模型已被提出并在不同领域中成功应用。近年来,深度学习的进步推动了神经TPPs的发展,使捕捉复杂时间动态变得更加灵活和表达性更强。大语言模型(LLMs)的出现进一步引发了关注,通过利用其丰富的上下文理解能力,为事件序列建模和分析提供了新的可能性。本文从贝叶斯、深度学习和LLM三个视角全面回顾了最近关于TPPs的研究。我们首先回顾了TPPs的基本概念,随后深入讨论了这三种框架中的模型设计和参数估计技术。我们还回顾了TPPs的经典应用领域,以突出其实际相关性。最后,我们概述了TPPs面临的挑战和未来研究的有前景方向。

英文摘要

Temporal point processes (TPPs) are stochastic process models used to characterize event sequences occurring in continuous time. Traditional statistical TPPs have a long-standing history, with numerous models proposed and successfully applied across diverse domains. In recent years, advances in deep learning have spurred the development of neural TPPs, enabling greater flexibility and expressiveness in capturing complex temporal dynamics. The emergence of large language models (LLMs) has further sparked excitement, offering new possibilities for modeling and analyzing event sequences by leveraging their rich contextual understanding. This survey presents a comprehensive review of recent research on TPPs from three perspectives: Bayesian, deep learning, and LLM approaches. We begin with a review of the fundamental concepts of TPPs, followed by an in-depth discussion of model design and parameter estimation techniques in these three frameworks. We also revisit classic application areas of TPPs to highlight their practical relevance. Finally, we outline challenges and promising directions for future research.

2506.20263 2026-06-05 cs.CV

Hierarchical Mask-Enhanced Dual Reconstruction Network for Few-Shot Fine-Grained Image Classification

层次化掩码增强双重建网络用于少样本细粒度图像分类

Ning Luo, Meiyin Hu, Huan Wan, Yanyan Yang, Zhuohang Jiang, Xin Wei

发表机构 * Nanjing University(南京大学)

AI总结 本文提出层次化掩码增强双重建网络(HMDRN),通过双层特征重建与掩码增强特征处理,解决少样本细粒度图像分类中区分视觉相似子类的问题,实验显示其在三种细粒度数据集上均优于现有方法。

详情
AI中文摘要

少样本细粒度图像分类(FS-FGIC)具有挑战性,因为它需要在极少量标记示例下区分视觉相似的子类。现有方法存在关键限制:基于度量的方法丢失空间信息并导致局部特征错位,而基于重建的方法未充分利用层次特征信息且缺乏对判别关键区域的选择性关注。我们提出层次化掩码增强双重建网络(HMDRN),整合双层特征重建与掩码增强特征处理。HMDRN通过可学习权重利用不同网络层次的互补视觉信息,平衡高层语义表示与中层结构细节。它包含一个空间二进制掩码增强的Transformer模块,可选择增强判别区域并过滤背景噪声。在三个细粒度数据集上,HMDRN在Conv-4和ResNet-12背骨上均优于现有最先进方法。消融研究验证了每个组件的有效性,显示双层重建增强类间判别能力,而掩码增强转换减少类内变化。

英文摘要

Few-shot fine-grained image classification (FS-FGIC) is challenging as it requires distinguishing visually similar subclasses with extremely limited labeled examples. Existing methods suffer from critical limitations: metric-based methods lose spatial information and misalign local features, while reconstruction-based methods underuse hierarchical feature information and lack selective focus on discriminative key regions. We propose the Hierarchical Mask-enhanced Dual Reconstruction Network (HMDRN), integrating dual-layer feature reconstruction with mask-enhanced feature processing. HMDRN leverages complementary visual information from different network hierarchies via learnable weights, balancing high-level semantic representations with mid-level structural details. It incorporates a spatial binary mask-enhanced transformer module that selectively enhances discriminative regions while filtering background noise. On three fine-grained datasets, HMDRN consistently outperforms state-of-the-art methods with both Conv-4 and ResNet-12 backbones. Ablation studies validate each component's effectiveness, showing dual-layer reconstruction enhances inter-class discrimination while mask-enhanced transformation reduces intra-class variations.

2506.10145 2026-06-05 cs.CV

RoCA: Robust Cross-Domain End-to-End Autonomous Driving

RoCA: 面向鲁棒跨域端到端自动驾驶的框架

Rajeev Yasarla, Shizhong Han, Hsin-Pai Cheng, Apratim Bhattacharyya, Shweta Mahajan, Litian Liu, Yunxiao Shi, Risheek Garrepalli, Hong Cai, Fatih Porikli

发表机构 * University of California, Berkeley(加州大学伯克利分校) University of Texas at Austin(德克萨斯大学奥斯汀分校) University of California, San Diego(加州大学圣地亚哥分校) University of California, Los Angeles(加州大学洛杉矶分校) University of California, Davis(加州大学戴维斯分校)

AI总结 本文提出RoCA框架,通过联合概率分布建模端到端自动驾驶管道中的 ego 和周围车辆信息,提升跨域自动驾驶的泛化能力和鲁棒性,无需额外推理计算。

Comments accepted for ICML 2026

详情
AI中文摘要

端到端(E2E)自动驾驶最近作为一种新范式出现,具有显著潜力。然而,很少有研究探讨了跨域部署的实际挑战(例如城市)。尽管一些工作将大型语言模型(LLMs)纳入其中以利用其开放世界知识,但LLMs无法保证跨域驾驶性能且在域适应过程中可能产生 prohibitive 重训练成本。本文提出RoCA,一种新颖的框架用于鲁棒跨域端到端自动驾驶。RoCA在E2E管道中对编码ego和周围车辆信息的token的联合概率分布进行建模。通过高斯过程(GP)实例化,RoCA学习一组具有相应轨迹的基底token,这些token跨越了多样化的驾驶场景。然后,给定任何驾驶场景,它能够概率性地推断未来轨迹。通过将RoCA与源域训练中的基础E2E模型结合,我们提升了基础模型的泛化能力,而无需额外的推理计算。此外,RoCA在新目标域上实现了鲁棒适应,显著优于直接微调。我们广泛评估了RoCA在各种跨域场景中,并展示其在领域泛化和适应性能方面表现强劲。

英文摘要

End-to-end (E2E) autonomous driving has recently emerged as a new paradigm, offering significant potential. However, few studies have looked into the practical challenge of deployment across domains (e.g., cities). Although several works have incorporated Large Language Models (LLMs) to leverage their open-world knowledge, LLMs do not guarantee cross-domain driving performance and may incur prohibitive retraining costs during domain adaptation. In this paper, we propose RoCA, a novel framework for robust cross-domain E2E autonomous driving. RoCA formulates the joint probabilistic distribution over the tokens that encode ego and surrounding vehicle information in the E2E pipeline. Instantiating with a Gaussian process (GP), RoCA learns a set of basis tokens with corresponding trajectories, which span diverse driving scenarios. Then, given any driving scene, it is able to probabilistically infer the future trajectory. By using RoCA together with a base E2E model in source-domain training, we improve the generalizability of the base model, without requiring extra inference computation. In addition, RoCA enables robust adaptation on new target domains, significantly outperforming direct finetuning. We extensively evaluate RoCA on various cross-domain scenarios and show that it achieves strong domain generalization and adaptation performance.

2506.11973 2026-06-05 cs.LG

Self-Regulating Cars: Automating Traffic Control in Free Flow Road Networks

自调节汽车:自动化自由流道路网络的交通控制

Ankit Bhardwaj, Rohail Asim, Sachin Chauhan, Yasir Zaki, Lakshminarayanan Subramanian

发表机构 * Department of Computer Science(计算机科学系) New York University(纽约大学) Indian Institute of Technology Delhi(德里印度理工学院)

AI总结 本文提出了一种基于强化学习的自调节汽车方法,通过动态调节车辆速度来优化通行能力和防止拥堵,无需新基础设施,结合经典交通流理论和微观模拟,在高保真度的PTV Vissim模拟器上实现了提高通行能力、减少延误和停车次数的改进。

详情
AI中文摘要

自由流道路网络,如郊区高速公路,由于通勤流量增加和基础设施有限,正越来越多地经历交通拥堵。传统控制机制,如交通信号或局部启发式方法,在这些高速、无信号的环境中效果不佳或不可行。我们引入了自调节汽车,一种基于强化学习的交通控制协议,通过动态调节车辆速度来优化通行能力和防止拥堵,而无需新的物理基础设施。我们的方法将经典交通流理论、间隙接受模型和微观模拟整合到一个物理指导的强化学习框架中。通过将道路抽象为超段,智能体捕捉到涌现的流量动态,并从即时交通观测中学习稳健的速度调节策略。在高保真度的PTV Vissim模拟器上,我们的方法在真实世界高速公路网络中实现了比无控制设置提高5%的总通行能力,减少13%的平均延误,以及减少3%的总停车次数。它还实现了更平滑、抗拥堵的流量,同时在各种交通模式中泛化,展示了其在可扩展的ML驱动交通管理中的潜力。

英文摘要

Free-flow road networks, such as suburban highways, are increasingly experiencing traffic congestion due to growing commuter inflow and limited infrastructure. Traditional control mechanisms, such as traffic signals or local heuristics, are ineffective or infeasible in these high-speed, signal-free environments. We introduce self-regulating cars, a reinforcement learning-based traffic control protocol that dynamically modulates vehicle speeds to optimize throughput and prevent congestion, without requiring new physical infrastructure. Our approach integrates classical traffic flow theory, gap acceptance models, and microscopic simulation into a physics-informed RL framework. By abstracting roads into super-segments, the agent captures emergent flow dynamics and learns robust speed modulation policies from instantaneous traffic observations. Evaluated in the high-fidelity PTV Vissim simulator on a real-world highway network, our method improves total throughput by 5%, reduces average delay by 13%, and decreases total stops by 3% compared to the no-control setting. It also achieves smoother, congestion-resistant flow while generalizing across varied traffic patterns, demonstrating its potential for scalable, ML-driven traffic management.

2506.11042 2026-06-05 cs.LG

GenFT: A Generative Parameter-Efficient Fine-Tuning Method for Pretrained Foundation Models

GenFT:一种用于预训练基础模型的生成性参数高效微调方法

Guangning Xu, Baoquan Zhang, Michael. K. Ng

发表机构 * Department of Mathematics, Hong Kong Baptist University, Hong Kong, China(香港 Baptist 大学数学系,香港,中国) Department of Computer Science, Harbin Institute of Technology, Shenzhen, China(哈尔滨工业大学(深圳)计算机科学系,中国)

AI总结 本文提出GenFT,一种基于预训练权重的参数高效微调方法,通过生成任务特定的更新来利用预训练权重中的结构信息,实现高效的模型微调。

Comments paper is accepted at ICANN 2026

详情
AI中文摘要

参数高效微调(PEFT)已作为一种资源高效的策略,通过学习少量任务特定的更新ΔW来适应预训练基础模型(PFMs)。现有方法往往在很大程度上独立于预训练权重W₀,或主要通过初始化或简单的重参数化来利用W₀。为了进一步利用W₀中编码的结构信息,我们提出生成性参数高效微调(GenFT),一种基于W₀的PEFT方法,使用确定性权重生成器生成任务特定的更新。具体而言,GenFT通过行和列变换与非线性激活来从W₀中提取结构化模式,并引入共享-特定分解以平衡跨层信息重用和层特定的灵活性。GenFT简单且参数高效,在NLP和CV基准上实现了竞争性或更优的平均性能。我们进一步在LLaMA-7B上进行试点研究,以检验其在生成模型中的可行性。代码可在GitHub https://github.com/xuguangning1218/GenFT 上获得。

英文摘要

Parameter-efficient fine-tuning (PEFT) has emerged as a resource-efficient strategy for adapting Pretrained Foundation Models (PFMs) by learning a small number of task-specific updates $ΔW$. Existing methods often learn $ΔW$ largely independently of pretrained weights $W_0$, or exploit $W_0$ mainly through initialization or simple reparameterization. To further leverage the structural information encoded in $W_0$, we propose Generative Parameter-Efficient Fine-Tuning (GenFT), a $W_0$-conditioned PEFT method that uses a deterministic weight generator to produce task-specific updates. Specifically, GenFT performs row and column transformations with nonlinear activations to extract structured patterns from $W_0$, and introduces a shared-specific decomposition to balance cross-layer information reuse and layer-specific flexibility. GenFT is simple and parameter-efficient, achieving competitive or better average performance across NLP and CV benchmarks. We further provide a pilot study on LLaMA-7B to examine its feasibility for generative models. The code is available at GitHub https://github.com/xuguangning1218/GenFT.

2506.10601 2026-06-05 cs.CV

Semantic-decoupled Spatial Partition Guided Point-supervised Oriented Object Detection

语义解耦的空间分区引导的点监督定向物体检测

Xinyuan Liu, Hang Xu, Zirui Chen, Yike Ma, Chenggang Yan, Feng Dai

发表机构 * Institute of Computing Technology, Chinese Academy of Sciences(中国科学院计算技术研究所) Hefei University of Technology(合肥工业大学)

AI总结 本文提出了一种高效的训练框架SSP,通过规则驱动的先验注入和数据驱动的标签净化,解决了单点注解放置不足和伪标签质量差的问题,实验表明SSP在DOTA-v1.0和其他数据集上取得了显著的mAP提升,且训练时间和内存占用较低。

Comments Published in Pattern Recognition, 2026

详情
Journal ref
Pattern Recognition, Volume 180, Part B, Article 114079 (2026)
AI中文摘要

鉴于其减少标注成本的能力,基于单点注释的弱监督学习已成为定向物体检测研究的焦点。与经典教师-学生范式相比,简单的模型范式(如PointOBB-v2)可以显著减少训练所需的资源,同时保证强大的性能。后者在低成本训练中具有更大的潜力,但此类方法仍面临样本分配不足和伪标签质量差的挑战。在本文中,我们提出了一种训练高效的框架,称为SSP,该框架结合了规则驱动的先验注入和数据驱动的标签净化。具体而言,SSP引入了两种设计:(1)像素级空间分区基于的样本分配,通过像素映射的空间分区估计物体尺度的上下界,并通过空间分区挖掘高质量的正样本和困难负样本;(2)语义空间分区基于的框提取,通过由语义地图调节的空间分区推导实例,并将其转换为伪框以监督检测器。在DOTA-v1.0和其他数据集上的实验表明,SSP的优越性:与基线相比,SSP实现了+6.73%的mAP提升,同时仅需2小时的训练时间和6GB的GPU内存。此外,当SSP与更强的检测器结合时,mAP可以达到50.81%。代码可在https://github.com/antxinyuan/ssp上获得。

英文摘要

Given its ability to reduce annotation costs, weakly supervised learning based on single-point annotations has emerged as a research focus in oriented object detection. Compared with the classical teacher-student paradigm, the simple model paradigm (e.g., PointOBB-v2) can substantially further reduce resources required for training while ensuring strong performance. The latter exhibits greater potential for low-cost training, yet such methods still face challenges of insufficient sample assignment and poor pseudo-label quality. In this paper, we propose a training-efficient framework named SSP, which synergizes rule-driven prior injection and data-driven label purification. Specifically, SSP introduces two designs: (1) Pixel-level Spatial Partition-based Sample Assignment, which compactly estimates the upper and lower bounds of object scales and mines high-quality positive samples and hard negative samples through spatial partitioning of pixel maps. (2) Semantic Spatial Partition-based Box Extraction, which derives instances from spatial partitions modulated by semantic maps and converts them into pseudo-boxes for supervising detectors. Experiments on DOTA-v1.0 and other datasets demonstrate SSP's superiority: it achieves +6.73% mAP improvement compared with the baseline, while requiring only 2 h of training time and 6 GB of GPU memory. Furthermore, when SSP is integrated with stronger detector, the mAP can reach 50.81%. The code is available at https://github.com/antxinyuan/ssp.

2506.00188 2026-06-05 cs.LG stat.ML

Cluster-Aware Causal Mixer for Online Anomaly Detection in Multivariate Time Series

基于聚类的因果混合器用于多变量时间序列的在线异常检测

Md Mahmuddun Nabi Murad, Yasin Yilmaz

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 本文提出了一种基于聚类的因果混合器,用于多变量时间序列的在线异常检测,通过聚类处理通道间的相关性,结合因果混合器保持时间因果性,并开发了序列异常评分方法以提高检测准确性。

详情
AI中文摘要

在时间序列数据中早期和准确地检测异常至关重要,因为假阳性和漏检带来的风险很大。虽然基于MLP的混合模型在时间序列分析中显示出潜力,但它们在数据处理过程中不维护时间因果性。此外,现实中的多变量时间序列通常包含众多通道,具有多样的通道间相关性。重构时间序列中的虚假相关性导致表示噪声,从而导致检测不准确。此外,忽略时间连续性的异常评分方法可能会误导连续检测。为了解决这些挑战,我们提出了一种多变量时间序列异常检测的基于聚类的因果混合器。根据相关性将通道分组为集群,并通过专用嵌入层对每个集群进行嵌入。引入因果混合器以在保持时间因果性的同时整合信息。我们进一步开发了一种序列异常评分方法,该方法在时间上累积证据并细化异常边界。我们提出的模型以在线方式运行,使其适合实时时间序列异常检测。在六个公开基准数据集上的实验评估表明,所提出的方法在性能上始终优于其他方法。

英文摘要

Early and accurate detection of anomalies in time-series data is critical due to the substantial risks associated with false or missed detections. While MLP-based mixer models have shown promise in time-series analysis, they do not maintain temporal causality during data processing. Moreover, real-world multivariate time series often contain numerous channels with diverse inter-channel correlations. Spurious correlations in the reconstructed time series lead to noisy representations, resulting in inaccurate anomaly detection. In addition, anomaly scoring methods that ignore temporal continuity can mislead sequential detection. To address these challenges, we propose a cluster-aware causal mixer for multivariate time-series anomaly detection. Channels are grouped into clusters based on their correlations, and each cluster is embedded through a dedicated embedding layer. A causal mixer is introduced to integrate information while maintaining temporal causality. We further develop a sequential anomaly-scoring method that accumulates evidence over time and refines anomaly boundaries. Our proposed model operates in an online fashion, making it suitable for real-time time-series anomaly detection. Experimental evaluations across six public benchmark datasets demonstrate that the proposed approach consistently achieves superior performance.

2310.04649 2026-06-05 cs.LG

Uncovering Model Processing Strategies with Non-Negative Per-Example Fisher Factorization

通过非负每例费舍尔分解揭示模型处理策略

Michael Matena, Colin Raffel

发表机构 * University of North Carolina Chapel Hill(北卡罗来纳大学教堂山分校) University of Toronto(多伦多大学) Vector Institute(向量研究所)

AI总结 本文提出NPEFF方法,通过分解每例费舍尔矩阵揭示模型生成预测所用的策略,展示了NPEFF组件在语言模型和文本处理任务中的应用,并展示了如何通过扰动这些组件来干扰模型处理,同时通过消融研究和实验验证了NPEFF在分析和缓解去学习的副作用以及研究上下文学习中的优势。

详情
AI中文摘要

我们引入NPEFF(非负每例费舍尔分解),一种可解释性方法,旨在揭示模型生成预测所使用的策略。NPEFF使用一种新颖的分解算法分解每例费舍尔矩阵,该算法学习了一组由学习得到的秩-1半正定矩阵表示的组件。通过结合人类评估和自动化分析,我们证明这些NPEFF组件对应于各种语言模型和文本处理任务中的模型处理策略。我们进一步展示了如何从NPEFF组件构建参数扰动,以选择性地干扰给定组件在模型处理中的作用。除了进行广泛的消融研究外,我们还包括实验,展示了NPEFF如何用于分析和缓解去学习的副作用,并用NPEFF研究上下文学习。此外,我们展示了NPEFF相对于梯度聚类和使用稀疏自编码器进行字典学习等基线方法的优势。我们发布了本工作的代码。

英文摘要

We introduce NPEFF (Non-Negative Per-Example Fisher Factorization), an interpretability method that aims to uncover strategies used by a model to generate its predictions. NPEFF decomposes per-example Fisher matrices using a novel decomposition algorithm that learns a set of components represented by learned rank-1 positive semi-definite matrices. Through a combination of human evaluation and automated analysis, we demonstrate that these NPEFF components correspond to model processing strategies for a variety of language models and text processing tasks. We further show how to construct parameter perturbations from NPEFF components to selectively disrupt a given component's role in the model's processing. Along with conducting extensive ablation studies, we include experiments to show how NPEFF can be used to analyze and mitigate collateral effects of unlearning and use NPEFF to study in-context learning. Furthermore, we demonstrate the advantages of NPEFF over baselines such as gradient clustering and using sparse autoencoders for dictionary learning over model activations. We release the code used in this work.

2505.02540 2026-06-05 cs.LG cs.AI

Lazy But Effective: Collaborative Personalized Federated Learning with Heterogeneous Data

懒惰但有效:基于异构数据的协同个性化联邦学习

Ljubomir Rokvic, Panayiotis Danassis, Boi Faltings

发表机构 * Artificial Intelligence Laboratory EPFL(苏黎世联邦理工学院人工智能实验室) Telenor Research(Telenor研究)

AI总结 本文提出了一种简单有效的个性化联邦学习框架pFedLIA,通过使用计算效率高的影响近似方法'Lazy Influence',在分布式 manner 中对客户端进行聚类,从而在模型聚合前协同训练模型以捕捉客户端特定的数据模式,实验证明其在非iid数据集上能有效恢复全局模型性能,并在多个基准任务中优于现有基线方法。

Comments Accepted at the International Joint Conference on Neural Networks (IJCNN), IEEE, 2025

详情
AI中文摘要

在联邦学习中,客户端数据分布的异质性往往意味着单一全局模型无法为个别客户端提供最佳性能。例如,训练键盘的下一个词预测模型时,由于用户特定的语言模式(如人口统计学特征、语言能力、书写风格等),客户端之间会产生高度非iid的数据集。其他例子包括使用不同机器拍摄的医学图像或不同车辆类型的驾驶数据。为了解决这一问题,我们提出了一种简单但有效的个性化联邦学习框架(pFedLIA),该框架利用一种计算效率高的影响近似方法,称为'Lazy Influence',在分布式 manner 中在模型聚合前对客户端进行聚类。在每个聚类中,数据所有者协同训练一个模型,以捕捉客户端特定的数据模式。我们的方法在各种合成和现实世界设置中成功恢复了由于非iid性导致的全局模型性能下降,特别是在北欧语言的下一个词预测任务以及多个基准任务中。它在性能上与假设的Oracle聚类匹配,并显著优于现有基线方法,例如在CIFAR100上提高了17%。

英文摘要

In Federated Learning, heterogeneity in client data distributions often means that a single global model does not have the best performance for individual clients. Consider for example training a next-word prediction model for keyboards: user-specific language patterns due to demographics (dialect, age, etc.), language proficiency, and writing style result in a highly non-IID dataset across clients. Other examples are medical images taken with different machines, or driving data from different vehicle types. To address this, we propose a simple yet effective personalized federated learning framework (pFedLIA) that utilizes a computationally efficient influence approximation, called `Lazy Influence', to cluster clients in a distributed manner before model aggregation. Within each cluster, data owners collaborate to jointly train a model that captures the specific data patterns of the clients. Our method has been shown to successfully recover the global model's performance drop due to the non-IID-ness in various synthetic and real-world settings, specifically a next-word prediction task on the Nordic languages as well as several benchmark tasks. It matches the performance of a hypothetical Oracle clustering, and significantly improves on existing baselines, e.g., an improvement of 17% on CIFAR100.

2503.23300 2026-06-05 cs.CV cs.RO

Learning Predictive Visuomotor Coordination

学习预测性视觉-运动协调

Wenqi Jia, Bolin Lai, Miao Liu, Danfei Xu, James M. Rehg

发表机构 * University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) Georgia Tech(佐治亚理工学院) Meta AI

AI总结 本文提出了一种基于预测的视觉-运动协调建模任务,通过结合第一人称视觉和运动学观测预测头部姿态、目光方向和上半身运动,展示了多模态整合在理解视觉-运动协调中的重要性。

Comments CVPR 2026 Findings

详情
AI中文摘要

理解并预测人类视觉-运动协调对于机器人学、人机交互和辅助技术的应用至关重要。本文介绍了一种基于预测的视觉-运动协调建模任务,目标是从第一人称视觉和运动学观测中预测头部姿态、目光方向和上半身运动。我们提出了一种视觉-运动协调表示(VCR),学习这些多模态信号之间的结构时间依赖性。我们扩展了基于扩散的运动建模框架,整合了第一人称视觉和运动学序列,实现了时间一致且准确的视觉-运动预测。我们的方法在大规模EgoExo4D数据集上进行了评估,展示了在多样化现实活动中的强大泛化能力。我们的结果强调了多模态整合在理解视觉-运动协调中的重要性,为视觉-运动学习和人类行为建模的研究做出了贡献。项目页面:https://vjwq.github.io/VCR/.

英文摘要

Understanding and predicting human visuomotor coordination is crucial for applications in robotics, human-computer interaction, and assistive technologies. This work introduces a forecasting-based task for visuomotor modeling, where the goal is to predict head pose, gaze, and upper-body motion from egocentric visual and kinematic observations. We propose a \textit{Visuomotor Coordination Representation} (VCR) that learns structured temporal dependencies across these multimodal signals. We extend a diffusion-based motion modeling framework that integrates egocentric vision and kinematic sequences, enabling temporally coherent and accurate visuomotor predictions. Our approach is evaluated on the large-scale EgoExo4D dataset, demonstrating strong generalization across diverse real-world activities. Our results highlight the importance of multimodal integration in understanding visuomotor coordination, contributing to research in visuomotor learning and human behavior modeling. Project Page: https://vjwq.github.io/VCR/.

2411.18343 2026-06-05 cs.LG cs.AI

Comprehensive and Reliable Feature Attribution for Diverse Modalities and Models via Frequency-Domain Insights

通过频域见解实现多样化模态和模型的全面可靠特征归因

Zechen Liu, Feiyang Zhang, Wei Song, Xiang Li, Wei Wei

发表机构 * School of Computational Science, Wuhan University(武汉大学计算科学学院) Brain Research Center, Wuhan University(武汉大学脑科学研究中心) College of Information Science and Technology (School of Cyber Science and Technology), Shihezi University(石河子大学信息科学学院(网络安全科学与技术学院)) Xinjiang Production and Construction Corps Key Laboratory of Computing Intelligence and Network Information Security Open Fund(新疆生产建设兵团计算智能与网络信息安全重点实验室开放基金)

AI总结 本文提出了一种新的可解释性方法FreqX,结合信号处理和信息理论,以解决个性化联邦学习中非IID数据、异构设备、缺乏公平性和贡献不明确等问题,通过频域分析提高解释性效率和准确性。

Comments 16pages, 9 figures

详情
AI中文摘要

个性化联邦学习(PFL)允许客户端在不披露其私有数据集的情况下协作训练个性化模型。然而,PFL面临非IID、异构设备、缺乏公平性和贡献不明确等挑战,亟需深度学习模型的可解释性来克服这些问题。这些挑战提出了新的可解释性需求,包括低成本、隐私性和详细信息。目前没有现有的可解释性方法能满足这些需求。在本文中,我们提出了一种新的可解释性方法FreqX,通过引入信号处理和信息理论。我们的实验表明,FreqX的解释结果包含属性信息和概念信息。FreqX的运行速度至少比包含概念信息的基线方法快10倍。

英文摘要

Personalized Federal learning(PFL) allows clients to cooperatively train a personalized model without disclosing their private dataset. However, PFL suffers from Non-IID, heterogeneous devices, lack of fairness, and unclear contribution which urgently need the interpretability of deep learning model to overcome these challenges. These challenges proposed new demands for interpretability. Low cost, privacy, and detailed information. There is no current interpretability method satisfying them. In this paper, we propose a novel interpretability method \emph{FreqX} by introducing Signal Processing and Information Theory. Our experiments show that the explanation results of FreqX contain both attribution information and concept information. FreqX runs at least 10 times faster than the baselines which contain concept information.

2503.14295 2026-06-05 cs.CV cs.AI

PC-Talk: Precise Facial Animation Control for Audio-Driven Talking Face Generation

PC-Talk: 用于音频驱动说话面部生成的精确面部动画控制

Baiqin Wang, Xiangyu Zhu, Fan Shen, Hao Xu, Zhen Lei

发表机构 * MAIS, Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所MAIS部) School of Artificial Intelligence, University of Chinese Academy of Sciences(中国科学院大学人工智能学院) Psyche AI.INC(Psyche AI公司) HKUST(香港科技大学) CAIR, HKISI, Chinese Academy of Sciences(中国科学院计算智能研究所) SCSE, FIE, M.U.S.T(M.U.S.T的SCSE、FIE部门)

AI总结 本文针对音频驱动说话面部生成中面部动画控制不足的问题,提出PC-Talk框架,通过改进唇音对齐和情感控制来提升生成视频的多样性和用户友好性。

Comments 10 Pages, 6 figures. Accepted in CVPR2026

详情
AI中文摘要

近年来,音频驱动说话面部生成在唇同步方面取得了显著进展。然而,当前方法往往缺乏对面部动画(如说话风格和情绪表达)的充分控制,导致输出结果单一。本文聚焦于改进两个关键因素:唇音对齐和情感控制,以增强说话视频的多样性和易用性。唇音对齐控制关注说话风格和唇部运动幅度等元素,而情感控制则专注于生成逼真的情绪表达,允许对强度等多属性进行修改。为实现精确的面部动画控制,我们提出了一种新的框架PC-Talk,通过隐式关键点变形实现唇音对齐和情感控制。首先,我们的唇音对齐控制模块实现了对说话风格的精确编辑,并调整唇部运动幅度以模拟不同语音音量水平,保持与音频的同步。其次,我们的情感控制模块生成生动的情绪面部特征,通过纯粹的情绪变形实现。该模块还允许对强度进行精细修改,并在不同面部区域组合多种情绪。我们的方法在广泛的实验中展示了出色的控制能力,并在HDTF和MEAD数据集上取得了最先进的性能。

英文摘要

Recent advancements in audio-driven talking face generation have made great progress in lip synchronization. However, current methods often lack sufficient control over facial animation such as speaking style and emotional expression, resulting in uniform outputs. In this paper, we focus on improving two key factors: lip-audio alignment and emotion control, to enhance the diversity and user-friendliness of talking videos. Lip-audio alignment control focuses on elements like speaking style and the scale of lip movements, whereas emotion control is centered on generating realistic emotional expressions, allowing for modifications in multiple attributes such as intensity. To achieve precise control of facial animation, we propose a novel framework, PC-Talk, which enables lip-audio alignment and emotion control through implicit keypoint deformations. First, our lip-audio alignment control module facilitates precise editing of speaking styles at the word level and adjusts lip movement scales to simulate varying vocal loudness levels, maintaining lip synchronization with the audio. Second, our emotion control module generates vivid emotional facial features with pure emotional deformation. This module also enables the fine modification of intensity and the combination of multiple emotions across different facial regions. Our method demonstrates outstanding control capabilities and achieves state-of-the-art performance on both HDTF and MEAD datasets in extensive experiments.

2503.11910 2026-06-05 cs.LG cs.AI math.AT math.SG

RTD-Lite: Scalable Topological Analysis for Comparing Weighted Graphs in Learning Tasks

RTD-Lite:用于学习任务中比较加权图拓扑结构的可扩展分析

Eduard Tulchinskii, Daria Voronkova, Ilya Trofimov, Evgeny Burnaev, Serguei Barannikov

发表机构 * Skoltech, AI Foundation and Algorithm Lab(斯克里普丘尔技术学院,人工智能基础与算法实验室) Skoltech, AIRI(斯克里普丘尔技术学院,人工智能研究机构) Skoltech, CNRS(斯克里普丘尔技术学院,法国国家科学研究中心)

AI总结 本文提出RTD-Lite算法,通过最小生成树辅助图在O(n²)时间内高效比较加权图的拓扑特征,适用于降维和神经网络训练等任务,实验表明其在识别拓扑差异和减少计算时间方面优于现有方法。

Comments Accepted for AISTATS 2025

详情
AI中文摘要

用于比较加权图的拓扑方法在各种学习任务中具有价值,但通常在大规模数据集上计算效率低下。我们介绍了RTD-Lite,一种可扩展算法,能够高效比较两个具有顶点一一对应关系的加权图的拓扑特征,特别是任意尺度下的连通性或聚类结构。通过辅助图的最小生成树,RTD-Lite以O(n²)的时间和内存复杂度捕捉拓扑差异。这种效率使其适用于降维和神经网络训练等任务。在合成和现实数据集上的实验表明,RTD-Lite能够有效识别拓扑差异,同时显著减少计算时间,相较于现有方法。此外,将RTD-Lite作为损失函数组件整合到神经网络训练中,可以增强学习表示中的拓扑结构保持。我们的代码在https://github.com/ArGintum/RTD-Lite上公开可用。

英文摘要

Topological methods for comparing weighted graphs are valuable in various learning tasks but often suffer from computational inefficiency on large datasets. We introduce RTD-Lite, a scalable algorithm that efficiently compares topological features, specifically connectivity or cluster structures at arbitrary scales, of two weighted graphs with one-to-one correspondence between vertices. Using minimal spanning trees in auxiliary graphs, RTD-Lite captures topological discrepancies with $O(n^2)$ time and memory complexity. This efficiency enables its application in tasks like dimensionality reduction and neural network training. Experiments on synthetic and real-world datasets demonstrate that RTD-Lite effectively identifies topological differences while significantly reducing computation time compared to existing methods. Moreover, integrating RTD-Lite into neural network training as a loss function component enhances the preservation of topological structures in learned representations. Our code is publicly available at https://github.com/ArGintum/RTD-Lite

2409.13607 2026-06-05 cs.RO

RECON: Reducing Causal Confusion with Human-Placed Markers

RECON: 通过人类放置的标记减少因果混淆

Robert Ramirez Sanchez, Heramb Nemlekar, Shahabedin Sagheb, Cara M. Nunez, Dylan P. Losey

发表机构 * Collaborative Robotics Lab ( Collab ), Dept. of Mechanical Engineering, Virginia Tech, Blacksburg, VA 24061(协作机器人实验室(Collab),机械工程系,弗吉尼亚理工学院,布莱克斯堡,VA 24061) Sibley School of Mechanical and Aerospace Engineering, Cornell University, Ithaca, NY 14853(西伯利机械与航空航天工程学院,康奈尔大学,伊萨卡,NY 14853)

AI总结 该研究提出RECON框架,通过人类主动标记任务关键部分来减少机器人学习中的因果混淆,利用标记物数据训练任务相关状态嵌入,从而提高学习效率。

Comments 7 pages, 5 figures

详情
AI中文摘要

模仿学习使机器人能够从人类示例中学习新任务。然而,从人类学习时的一个根本限制是因果混淆。因果混淆发生在机器人观察到的任务相关和无关信息同时存在时:例如,机器人的摄像头可能不仅看到目标,还看到环境中的杂物和光照变化。由于机器人事先不知道哪些观察方面是重要的,它经常误解人类的例子,无法学习所需任务。为了解决这个问题,我们指出——尽管机器人学习者可能不知道该关注什么,但人类教师知道。在本文中,我们提出人类应主动用小型轻量的标记物标记任务关键部分。在我们的框架(RECON)中,人类在提供演示前将这些标记物附着在任务相关对象上:当人类展示任务示例时,标记物跟踪标记对象的位置。我们随后利用这些离线标记数据来训练任务相关状态嵌入。具体来说,我们将机器人的观察嵌入到一个与测量标记读数相关的潜在状态中:在实践中,这使机器人能够自动过滤掉无关观察,并基于从标记数据中学习的特征做出决策。我们的模拟和一个真实机器人实验表明,这种人类放置标记的框架可以缓解因果混淆。确实,我们发现使用RECON显著减少了传达任务所需的演示次数,从而降低人类教学的总体时间。见此处视频:https://youtu.be/oy85xJvtLSU

英文摘要

Imitation learning enables robots to learn new tasks from human examples. One fundamental limitation while learning from humans is causal confusion. Causal confusion occurs when the robot's observations include both task-relevant and extraneous information: for instance, a robot's camera might see not only the intended goal, but also clutter and changes in lighting within its environment. Because the robot does not know which aspects of its observations are important a priori, it often misinterprets the human's examples and fails to learn the desired task. To address this issue, we highlight that -- while the robot learner may not know what to focus on -- the human teacher does. In this paper we propose that the human proactively marks key parts of their task with small, lightweight beacons. Under our framework (RECON) the human attaches these beacons to task-relevant objects before providing demonstrations: as the human shows examples of the task, beacons track the position of marked objects. We then harness this offline beacon data to train a task-relevant state embedding. Specifically, we embed the robot's observations to a latent state that is correlated with the measured beacon readings: in practice, this causes the robot to autonomously filter out extraneous observations and make decisions based on features learned from the beacon data. Our simulations and a real robot experiment suggest that this framework for human-placed beacons mitigates causal confusion. Indeed, we find that using RECON significantly reduces the number of demonstrations needed to convey the task, lowering the overall time required for human teaching. See videos here: https://youtu.be/oy85xJvtLSU

2502.20914 2026-06-05 cs.LG cs.AI cs.CL

Everything, Everywhere, All at Once: Is Mechanistic Interpretability Identifiable?

Everything, Everywhere, All at Once: Is Mechanistic Interpretability Identifiable?

Maxime Méloux, Silviu Maniu, François Portet, Maxime Peyrard

发表机构 * Université Grenoble Alpes, CNRS, Grenoble INP, LIG(格勒诺布尔阿尔卑斯大学、国家科学研究中心、格勒诺布尔INP、实验室LIG)

AI总结 本文探讨了在机械可解释性(MI)框架下,给定行为是否具有唯一解释的问题,通过统计可识别性理论分析了MI解释的可识别性,并提出了两种主要策略及实验结果。

详情
Journal ref
The Thirteenth International Conference on Learning Representations (ICLR 2025)
AI中文摘要

随着AI系统应用于高风险领域,确保可解释性至关重要。机械可解释性(MI)旨在通过提取人类可理解的算法来解释神经网络的行为。本文探讨了一个关键问题:在给定行为下,根据MI的标准,是否存在唯一的解释?借鉴统计学中的可识别性,其中参数在特定假设下可以唯一推断,我们探索了MI解释的可识别性。我们识别出两种主要的MI策略:(1)“where-then-what”,通过隔离复制模型行为的电路并在之后解释它;(2)“what-then-where”,从候选算法开始,通过因果对齐搜索实现它们的神经激活子空间。我们对布尔函数和小型多层感知机测试了这两种策略,完全枚举了候选解释。实验揭示了系统性的不可识别性:多个电路可以复制行为,一个电路可以有多种解释,多个算法可以与网络对齐,一个算法可以与不同的子空间对齐。是否需要唯一性?一种务实的方法可能只需要预测性和可操作性标准。如果唯一性对理解至关重要,可能需要更严格的条件。我们还参考了内部可解释性框架,该框架通过多种标准验证解释。本文为定义AI中的解释标准做出了贡献。

英文摘要

As AI systems are used in high-stakes applications, ensuring interpretability is crucial. Mechanistic Interpretability (MI) aims to reverse-engineer neural networks by extracting human-understandable algorithms to explain their behavior. This work examines a key question: for a given behavior, and under MI's criteria, does a unique explanation exist? Drawing on identifiability in statistics, where parameters are uniquely inferred under specific assumptions, we explore the identifiability of MI explanations. We identify two main MI strategies: (1) "where-then-what," which isolates a circuit replicating model behavior before interpreting it, and (2) "what-then-where," which starts with candidate algorithms and searches for neural activation subspaces implementing them, using causal alignment. We test both strategies on Boolean functions and small multi-layer perceptrons, fully enumerating candidate explanations. Our experiments reveal systematic non-identifiability: multiple circuits can replicate behavior, a circuit can have multiple interpretations, several algorithms can align with the network, and one algorithm can align with different subspaces. Is uniqueness necessary? A pragmatic approach may require only predictive and manipulability standards. If uniqueness is essential for understanding, stricter criteria may be needed. We also reference the inner interpretability framework, which validates explanations through multiple criteria. This work contributes to defining explanation standards in AI.

2502.14145 2026-06-05 cs.CL eess.AS

LLM-Enhanced Dialogue Management for Full-Duplex Spoken Dialogue Systems

基于大语言模型的全双工语音对话系统对话管理

Hao Zhang, Weiwei Li, Rilin Chen, Vinay Kothapally, Meng Yu, Dong Yu

发表机构 * Tencent AI Lab(腾讯人工智能实验室)

AI总结 本文提出一种基于大语言模型的语义语音活动检测模块,用于高效管理全双工语音对话系统的轮询,通过轻量级大语言模型实现意图和非意图打断的区分,并通过短间隔处理输入语音以实现实时决策,同时减少计算开销。

详情
AI中文摘要

在语音对话系统(SDS)中实现全双工通信需要实时协调听、说和思。本文提出一个语义语音活动检测(VAD)模块作为对话管理器(DM),用于高效管理全双工SDS中的轮询。该模块实现为一个轻量级(0.5B)大语言模型,经过全双工对话数据微调,语义VAD预测四个控制标记以调节轮询和轮询保持,区分意图和非意图打断,同时检测查询完成以处理用户停顿和犹豫。通过短间隔处理输入语音,语义VAD实现了实时决策,而核心对话引擎(CDE)仅在生成响应时被激活,从而减少计算开销。这种设计允许独立优化DM而不需重新训练CDE,平衡了交互准确性和推理效率,以实现可扩展的下一代全双工SDS。

英文摘要

Achieving full-duplex communication in spoken dialogue systems (SDS) requires real-time coordination between listening, speaking, and thinking. This paper proposes a semantic voice activity detection (VAD) module as a dialogue manager (DM) to efficiently manage turn-taking in full-duplex SDS. Implemented as a lightweight (0.5B) LLM fine-tuned on full-duplex conversation data, the semantic VAD predicts four control tokens to regulate turn-switching and turn-keeping, distinguishing between intentional and unintentional barge-ins while detecting query completion for handling user pauses and hesitations. By processing input speech in short intervals, the semantic VAD enables real-time decision-making, while the core dialogue engine (CDE) is only activated for response generation, reducing computational overhead. This design allows independent DM optimization without retraining the CDE, balancing interaction accuracy and inference efficiency for scalable, next-generation full-duplex SDS.

2502.06434 2026-06-05 cs.CV cs.LG

Unifying Dataset Pruning and Distillation for Efficient Large-scale Compression

统一数据集剪枝与蒸馏以实现高效大规模压缩

Lingao Xiao, Songhua Liu, Yang He, Xinchao Wang

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 本文提出一个统一的数据集压缩基准,探讨数据集剪枝与蒸馏的收敛趋势,发现软标签蒸馏在小数据集上表现不如剪枝,提出基于硬标签的数据集压缩方法,通过PCA框架提升图像质量和存储效率。

Comments Accepted by ICML 2026

详情
AI中文摘要

数据集剪枝(DP)和数据集蒸馏(DD)在输出上有根本差异:DP选择原始图像子集,而DD生成合成图像。最近,DD对原始图像的依赖增加表明两种方法趋于融合。为研究这种融合趋势,我们提出统一的数据集压缩(DC)基准。该基准揭示了软标签-DD的有趣权衡:虽然软标签提供有价值信息,但它们可能使蒸馏过程变得不必要,因为蒸馏图像可能不总能优于随机子集。此外,基准表明在当前阶段,数据集剪枝在小数据集上优于数据集蒸馏。鉴于这些观察,我们探索硬标签-DC作为互补方法,强调图像质量的同时提供显著的存储效率。我们的PCA(Prune, Combine, and Augment)是首个不依赖软标签而是聚焦图像质量的框架。(1)

英文摘要

Dataset pruning (DP) and dataset distillation (DD) fundamentally differ in their outputs: DP selects original image subsets, while DD generates synthetic images. Recently, DD's increasing reliance on original images suggests a convergence of the two directions. To investigate this convergence trend, we propose a unified dataset compression (DC) benchmark. This benchmark reveals an interesting trade-off for soft-label-DD: while soft labels provide valuable information, they can make the distillation process less essential, as distilled images may not always outperform random subsets. In addition, the benchmark reveals that in current stages, dataset pruning outperforms dataset distillation at small dataset sizes. Given these observations, we explore hard-label-DC as a complementary approach that emphasizes image quality while offering substantial storage efficiency. Our PCA (Prune, Combine, and Augment) is the first framework that does not rely on soft labels but instead focuses on image quality. (1) "P'' means selecting easy samples based on dataset pruning metrics, (2) "C'' indicates combining these samples effectively, and (3) "A'' is to apply constrained image augmentation during training. Our code is available at https://github.com/ArmandXiao/Unifying-Dataset-Pruning-and-Distillation