arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2510.26236 2026-06-05 cs.RO

PHUMA: Physically Reliable Humanoid Locomotion Dataset

PHUMA：物理可靠的仿人运动数据集

Kyungmin Lee, Sibeen Kim, Youngdo Lee, Minho Park, Hyunseung Kim, Dongyoon Hwang, Donghu Kim, Hojoon Lee, Jaegul Choo

发表机构 * KAIST（韩国科学技术院）

AI总结本文提出PHUMA数据集，通过结合物理感知的筛选和物理约束的重定向，整合动作捕捉和网络视频，生成物理可靠的仿人运动数据，提升仿人运动的稳定性和泛化能力。

详情

AI中文摘要

运动模仿是实现仿人运动的一种有前景的方法，使代理能够获取类人行为。现有方法通常依赖高质量的动作捕捉数据集，如AMASS，但这些数据稀缺且昂贵，限制了可扩展性和多样性。最近的研究尝试通过转换大规模互联网视频来扩大数据收集，例如Humanoid-X。然而，这些方法常面临物理伪影，如漂浮、穿透和脚滑，阻碍了稳定的模仿。为此，我们引入PHUMA，一个物理可靠的仿人运动数据集，通过两阶段流程结合物理感知的筛选和物理约束的重定向，将动作捕捉和互联网视频整合为一个73小时的物理可靠数据集。在动作跟踪基准测试中，PHUMA训练的策略比AMASS和Humanoid-X训练的策略成功率更高，并成功在真实Unitree G1上实现零样本迁移。代码可在https://davian-robotics.github.io/PHUMA获取。

英文摘要

Motion imitation is a promising approach for humanoid locomotion, enabling agents to acquire humanlike behaviors. Existing methods typically rely on high-quality motion capture datasets such as AMASS, but these are scarce and expensive, limiting scalability and diversity. Recent studies attempt to scale data collection by converting large-scale internet videos, exemplified by Humanoid-X. However, they often suffer from physical artifacts such as floating, penetration, and foot skating, which hinder stable imitation. To address this, we introduce PHUMA, a Physically Reliable HUMAnoid locomotion dataset produced by a two-stage pipeline combining physics-aware curation and physics-constrained retargeting, aggregating both motion capture and internet video into a physically reliable, 73-hour corpus. On motion tracking benchmarks, PHUMA-trained policies achieve higher success rates than those trained on AMASS and Humanoid-X, and successfully transfer zero-shot to a real Unitree G1. The code is available at https://davian-robotics.github.io/PHUMA.

URL PDF HTML ☆

赞 0 踩 0

2510.23497 2026-06-05 cs.CV

VOLD: Reasoning Transfer from LLMs to Vision-Language Models via On-Policy Distillation

VOLD：通过在线蒸馏将LLM推理能力转移到视觉语言模型

Walid Bousselham, Hilde Kuehne, Cordelia Schmid

发表机构 * Tuebingen AI Center（图宾根人工智能中心）； University of Tuebingen（图宾根大学）； MIT-IBM Watson AI Lab（MIT-IBM沃森人工智能实验室）； Inria, École Normale Supérieure, CNRS, PSL Research University（法国国家科学研究院、巴黎-萨克勒大学、École Normale Supérieure、PSL研究大学）

AI总结本文提出VOLD框架，通过在线蒸馏将文本模型的推理能力转移到视觉语言模型，利用组相对策略优化与在线蒸馏结合，提升推理性能，并验证了冷启动对齐在在线训练中的重要性。

Comments www.walidbousselham.com/VOLD/

详情

AI中文摘要

训练视觉语言模型（VLMs）进行复杂推理仍是一项具有挑战性的任务，例如由于高质量图像-文本推理数据稀缺。相反，基于文本的推理资源丰富且可扩展，但如何利用它们来增强VLM推理仍是一个开放性问题。为此，我们提出了VOLD，一种将推理能力从文本-only教师模型转移到VLM学生模型的框架。为此，VOLD结合了通过组相对策略优化（GRPO）进行的强化学习与在线蒸馏，使学生推理轨迹能够由教师模型引导，从而在单独使用GRPO时获得显著提升。我们进一步表明，在此场景中，在线训练阶段有效的转移需要冷启动对齐，并且在教师和学生之间缺乏足够的分布对齐时，在线蒸馏无法提供有意义的指导。我们评估了VOLD在MMMU-Pro、MathVision、MathVista和LogicVista等多样化的基准测试中，显示出VOLD显著优于基线模型，并在现有最先进水平上取得显著提升。我们的消融研究显示了通过SFT进行冷启动对齐在文本-only教师与在线蒸馏中的重要性。

英文摘要

Training vision-language models (VLMs) for complex reasoning remains a challenging task, i.a. due to the scarcity of high-quality image-text reasoning data. Conversely, text-based reasoning resources are abundant and scalable, but it is still an open question how to leveraging them for VLM reasoning. To address this problem, we propose VOLD, a framework to transfer reasoning capabilities from text-only teacher models to VLM student models. To this end, VOLD combines reinforcement learning via Group Relative Policy Optimization (GRPO) with on-policy distillation, which allows the student reasoning traces to be guided by the teacher model, resulting in a significant gain over using GRPO alone. We further show that a cold-start alignment is essential for an effective transfer during the online training phase in this scenario and that without sufficient distributional alignment between teacher and student, on-policy distillation fails to provide meaningful guidance. We evaluate VOLD across diverse benchmarks including MMMU-Pro, MathVision, MathVista, and LogicVista, showing that VOLD outperforms the baseline model significantly and improves over the state of the art by a margin. Our ablation shows the importance of a cold-start alignment via SFT for on-policy distillation with a text-only teacher.

URL PDF HTML ☆

赞 0 踩 0

2505.15405 2026-06-05 cs.LG

HOPSE: Scalable Higher-Order Positional and Structural Encoder for Combinatorial Representations

HOPSE：可扩展的高阶位置和结构编码器用于组合表示

Guillermo Bernárdez, Marco Montagna, Louis Van Langendonck, Martin Carrasco, Amirreza Akbari, Louisa Cornelis, Mathilde Papillon, Pere Barlet-Ros, Nina Miolane, Lev Telyatnikov

发表机构 * Guillermo Bernárdez University California Santa Barbara（Guillermo Bernárdez 卡尔弗大学圣巴bara分校）； Sapienza University of Rome（罗马萨皮恩扎大学）； Universitat Politèqnica de Catalunya（加泰罗尼亚理工大学）； University of Fribourg（弗里堡大学）； Aalto University（阿alto大学）； University California Santa Barbara（加州圣巴bara大学）； Intelligent Maintenance and Operations Systems, EPFL（EPFL智能维护与操作系统）

AI总结本文提出HOPSE，一种无需消息传递层的框架，通过Hasse图分解在任意高阶域上生成高效且表达能力强的编码，实现了在组合表示规模线性增长的同时保持HOMP方法的表达能力和排列等价性，实验表明其在分子和拓扑基准上表现优异且速度更快。

详情

AI中文摘要

尽管图神经网络（GNNs）在建模关系数据方面表现出色，但成对连接无法自然捕捉复杂现实系统中多向关系。为此，拓扑深度学习（TDL）利用更一般的组合表示——如单纯复形或细胞复形——来容纳高阶交互。现有TDL方法通常通过高阶消息传递（HOMP）扩展GNNs，但因传播消息通过组合结构的复杂度高而面临关键的可扩展性挑战。为克服这一限制，我们提出了HOPSE（高阶位置和结构编码器），一种无需消息传递层的框架，利用Hasse图分解在任意高阶域上生成高效且表达能力强的编码。值得注意的是，HOPSE在组合表示规模上呈线性增长，同时保持HOMP方法的表达能力和排列等价性。在分子和拓扑基准上的实验表明，它在匹配或超越最先进性能的同时，始终在HOMP基于模型上实现速度提升，为可扩展的TDL开辟了新路径。代码可在https://github.com/geometric-intelligence/topobench.git获取。

英文摘要

While Graph Neural Networks (GNNs) have proven highly effective at modeling relational data, pairwise connections cannot fully capture multi-way relationships naturally present in complex real-world systems. In response to this, Topological Deep Learning (TDL) leverages more general combinatorial representations--such as simplicial or cellular complexes--to accommodate higher-order interactions. Existing TDL methods often extend GNNs through Higher-Order Message Passing (HOMP), but face critical scalability challenges due to the steep complexity overhead of propagating messages through combinatorial structures. To overcome this limitation, we propose HOPSE (Higher-Order Positional and Structural Encoder), a framework free of message passing layers that uses Hasse graph decompositions to derive efficient and expressive encodings over arbitrary higher-order domains. Notably, HOPSE scales linearly with the size of combinatorial representations while preserving the expressive power and permutation equivariance of the HOMP approaches. Experiments on molecular and topological benchmarks show that it matches or surpasses state-of-the-art performance while consistently achieving speedups over HOMP-based models, opening a new path for scalable TDL. The code is available at https://github.com/geometric-intelligence/topobench.git.

URL PDF HTML ☆

赞 0 踩 0

2510.22768 2026-06-05 cs.CL

Seeing is Believing? Evaluating Vision-Language Model Susceptibility in Agent-to-Agent Multimodal Persuasion

见多识广？评估面向Agent-to-Agent多模态说服的视觉语言模型易受性

Haoyi Qiu, Yilun Zhou, Pranav Narayanan Venkit, Kung-Hsiang Huang, Jiaxin Zhang, Nanyun Peng, Chien-Sheng Wu

发表机构 * University of California, Los Angeles（加州大学洛杉矶分校）； Salesforce AI Research（Salesforce AI研究）

AI总结本文研究了在多智能体多模态说服场景中，视觉语言模型对多模态内容的易受性，提出了MMPersuade框架和数据集，通过实验揭示了多模态输入在说服中的优势，以及说服对象的领域和格式依赖性，以及心理策略在不同上下文和模型架构下的效果差异。

详情

AI中文摘要

随着自主代理越来越多地互动，它们不可避免地试图互相影响。尽管先前在纯文本环境下研究了Agent-to-Agent (A2A) 说服的动力学，但视觉语言模型 (VLMs) 的兴起带来了更复杂的挑战：多模态内容传达了更丰富的信息，同时整合了微妙且难以检测的说服线索。为了研究这种易受性，我们提出了MMPersuade，一个统一的框架和数据集用于A2A多模态说服。我们建模了说服者代理（利用图像和心理策略）与说服对象VLM之间的互动。我们的基准涵盖商业、主观和行为，以及对抗性情境，并通过功能调用评估说服，以捕捉超出口头回应的行为变化。在六个VLM上的实验揭示了三个发现：（1）多模态输入在说服中始终优于纯文本说服，原始视觉信号在对抗性情境中独特地增加易受性，通过绕过文本激活的安全防御；（2）说服对象的易受性高度依赖于领域和格式，现实和社区风格的格式在商业情境中驱动易受性，而不同格式在对抗性情境中占主导地位；（3）心理策略的有效性取决于上下文和模型架构，更强大的模型抵抗良性说服，但在对抗性多模态输入下更易受攻击。我们的框架为构建更稳健和对齐的VLMs提供了基础，以在多代理环境中使用。

英文摘要

As autonomous agents increasingly interact, they inevitably attempt to influence one another. While prior work in text-only settings has explored the dynamics of Agent-to-Agent (A2A) persuasion, the rise of Vision-Language Models (VLMs) introduces a more complex challenge: multimodal content conveys richer information while integrating subtle, hard-to-detect persuasive cues. To study this vulnerability, we present MMPersuade, a unified framework and dataset for A2A multimodal persuasion. We model interactions between a persuader agent, which leverages images and psychological strategies, and a persuadee VLM. Our benchmark spans commercial, subjective and behavioral, and adversarial contexts, and evaluates persuasion via function-calling that capture behavioral shifts beyond verbal responses. Experiments on six VLMs reveal three findings: (1) multimodal inputs consistently outperform text-only persuasion, with raw visual signals uniquely increasing susceptibility in adversarial settings by bypassing text-activated safety defenses; (2) persuadee vulnerability is highly domain- and format-dependent, with realistic and community-style formats driving susceptibility in commercial settings while different formats dominate in adversarial ones; and (3) psychological strategy efficacy varies with context and model architecture, as more capable models resist benign persuasion yet become more susceptible under adversarial multimodal inputs. Our framework provides a foundation for building more robust and aligned VLMs in multi-agent environments.

URL PDF HTML ☆

赞 0 踩 0

2510.17256 2026-06-05 cs.CL

Explainability of Large Language Models: Opportunities and Challenges toward Generating Trustworthy Explanations

大型语言模型的可解释性：朝着生成可信解释的方向机遇与挑战

Shahin Atakishiyev, Housam K. B. Babiker, Jiayi Dai, Nawshad Farruque, Teruaki Hayashi, Nafisa Sadaf Hriti, Md Abed Rahman, Iain Smith, Mi-Young Kim, Osmar R. Zaïane, Randy Goebel

发表机构 * University of Alberta（阿尔伯塔大学）； University of Tokyo（东京大学）

AI总结本文探讨了大型语言模型的可解释性问题，分析了局部可解释性和机械可解释性方法，并在医疗和自动驾驶两个关键领域进行了实验研究，总结了当前可解释性领域存在的问题和未来发展方向。

详情

AI中文摘要

大型语言模型在自然语言处理的多种下游任务中表现出色。然而，人类通常无法理解语言模型如何预测下一个标记并生成内容。此外，这些模型经常在预测和推理中出现错误，即幻觉。这些错误凸显了更好地理解和解释语言模型内部运作以及如何生成预测输出的紧迫需求。受此差距的启发，本文研究了基于Transformer的大型语言模型中的局部可解释性和机械可解释性，以促进此类模型的信任。为此，本文旨在做出三个关键贡献。首先，我们综述了局部可解释性和机械可解释性方法及相关文献中的研究和见解。此外，我们描述了在医疗和自动驾驶两个关键领域进行的可解释性和推理实验，并分析了这些解释对解释接收者信任的影响。最后，我们总结了当前LLM可解释性领域未解决的问题，并概述了生成与人类一致、可信的LLM解释的机会、关键挑战和未来方向。

英文摘要

Large language models have exhibited impressive performance across a broad range of downstream tasks in natural language processing. However, how a language model predicts the next token and generates content is not generally understandable by humans. Furthermore, these models often make errors in prediction and reasoning, known as hallucinations. These errors underscore the urgent need to better understand and interpret the intricate inner workings of language models and how they generate predictive outputs. Motivated by this gap, this paper investigates local explainability and mechanistic interpretability within Transformer-based large language models to foster trust in such models. In this regard, our paper aims to make three key contributions. First, we present a review of local explainability and mechanistic interpretability approaches and insights from relevant studies in the literature. Furthermore, we describe experimental studies on explainability and reasoning with large language models in two critical domains -- healthcare and autonomous driving -- and analyze the trust implications of such explanations for explanation receivers. Finally, we summarize current unaddressed issues in the evolving landscape of LLM explainability and outline the opportunities, critical challenges, and future directions toward generating human-aligned, trustworthy LLM explanations.

URL PDF HTML ☆

赞 0 踩 0

2510.09061 2026-06-05 cs.SD eess.AS

O_O-VC: Synthetic Data-Driven One-to-One Alignment for Any-to-Any Voice Conversion

O_O-VC: 基于合成数据驱动的任意到任意语音转换一一对一对齐

Huu Tuong Tu, Huan Vu, cuong tien nguyen, Dien Hy Ngo, Nguyen Thi Thu Trang

发表机构 * VNPT AI, VNPT Group（VNPT AI，VNPT集团）； Hanoi University of Science and Technology（河内科学技术大学）； Business AI Lab, National Economics University（国家经济大学商业人工智能实验室）

AI总结本文提出了一种基于合成数据驱动的任意到任意语音转换方法，通过利用高质量预训练多说话人文本到语音模型生成的合成语音数据，学习源语音到目标语音的直接映射，从而在保留语言内容的同时捕捉说话人特定特征，并在零样本场景中提升适应性和性能。

Comments EMNLP 2025

详情

DOI: 10.18653/v1/2025.findings-emnlp.879
Journal ref: Findings of the Association for Computational Linguistics: EMNLP 2025

AI中文摘要

传统语音转换（VC）方法通常试图将说话人身份和语言信息分离为不同的表示，然后将这些表示组合起来重建音频。然而，有效解耦这些因素仍然具有挑战性，往往导致训练过程中的信息丢失。在本文中，我们提出了一种新的方法，利用由高质量预训练多说话人文本到语音（TTS）模型生成的合成语音数据。具体而言，使用共享相同语言内容但说话人身份不同的合成数据对作为输入-输出对来训练语音转换模型。这使模型能够学习源语音和目标语音之间的直接映射，从而有效捕捉说话人特定特征的同时保留语言内容。此外，我们引入了一种灵活的训练策略，用于任意到任意语音转换，该策略在未见过的说话人和新语言上泛化良好，增强了在零样本场景中的适应性和性能。我们的实验表明，所提出的方法在词错误率上实现了16.35%的相对减少，并在说话人余弦相似度上提升了5.91%，优于几种最先进的方法。语音转换样本可访问：https://oovc-emnlp-2025.github.io/

英文摘要

Traditional voice conversion (VC) methods typically attempt to separate speaker identity and linguistic information into distinct representations, which are then combined to reconstruct the audio. However, effectively disentangling these factors remains challenging, often leading to information loss during training. In this paper, we propose a new approach that leverages synthetic speech data generated by a high-quality, pretrained multispeaker text-to-speech (TTS) model. Specifically, synthetic data pairs that share the same linguistic content but differ in speaker identity are used as input-output pairs to train the voice conversion model. This enables the model to learn a direct mapping between source and target voices, effectively capturing speaker-specific characteristics while preserving linguistic content. Additionally, we introduce a flexible training strategy for any-to-any voice conversion that generalizes well to unseen speakers and new languages, enhancing adaptability and performance in zero-shot scenarios. Our experiments show that our proposed method achieves a 16.35% relative reduction in word error rate and a 5.91% improvement in speaker cosine similarity, outperforming several state-of-the-art methods. Voice conversion samples can be accessed at: https://oovc-emnlp-2025.github.io/

URL PDF HTML ☆

赞 0 踩 0

2510.05544 2026-06-05 cs.CL cs.LG

Activation-Informed Pareto-Guided Low-Rank Compression for Efficient LLM/VLM

基于激活信息的帕累托引导低秩压缩用于高效LLM/VLM

Ryan Solgi, Parsa Madinei, Jiayi Tian, Rupak Swaminathan, Jing Liu, Nathan Susanj, Zheng Zhang

发表机构 * University of California-Santa Barbara（加州大学圣芭芭拉分校）； Amazon（亚马逊）

AI总结本文提出了一种基于激活信息的帕累托引导低秩压缩方法，通过理论分析和算法设计，在保持模型精度的同时提升LLM和VLM的压缩效率和推理速度。

2504.10020 2026-06-05 cs.CL cs.AI cs.CV

The Mirage of Performance Gains: Why Contrastive Decoding Fails to Mitigate Object Hallucinations in MLLMs?

性能提升的幻象：为何对比解码无法减轻多模态大语言模型中的对象幻觉？

Hao Yin, Guangzong Si, Zilei Wang

发表机构 * University of Science and Technology of China（中国科学技术大学）； Eastern Institute of Technology, Ningbo（宁波东部技术研究所）

AI总结本文研究了对比解码方法在减轻多模态大语言模型（MLLMs）中对象幻觉方面的有效性，发现其性能提升主要源于两个误导性因素，挑战了对比解码策略的有效性。

详情

AI中文摘要

对比解码策略被广泛用于减少多模态大语言模型（MLLMs）中的对象幻觉。这些方法通过构建对比样本来诱导幻觉，然后在输出分布中抑制它们。然而，本文证明此类方法无法有效缓解幻觉问题。在POPE基准测试中观察到的性能提升主要由两个误导性因素驱动：（1）对模型输出分布的粗略、单向调整；（2）自适应可能性约束，将采样策略简化为贪婪搜索。为进一步说明这些问题，我们引入了一系列虚假改进方法，并将其性能与对比解码技术进行评估。实验结果揭示了对比解码中观察到的性能提升与其缓解幻觉的初衷无关。我们的发现挑战了对比解码策略有效性的常见假设，并为开发真正有效的MLLMs幻觉解决方案铺平了道路。

英文摘要

Contrastive decoding strategies are widely used to reduce object hallucinations in multimodal large language models (MLLMs). These methods work by constructing contrastive samples to induce hallucinations and then suppressing them in the output distribution. However, this paper demonstrates that such approaches fail to effectively mitigate the hallucination problem. The performance improvements observed on POPE Benchmark are largely driven by two misleading factors: (1) crude, unidirectional adjustments to the model's output distribution and (2) the adaptive plausibility constraint, which reduces the sampling strategy to greedy search. To further illustrate these issues, we introduce a series of spurious improvement methods and evaluate their performance against contrastive decoding techniques. Experimental results reveal that the observed performance gains in contrastive decoding are entirely unrelated to its intended goal of mitigating hallucinations. Our findings challenge common assumptions about the effectiveness of contrastive decoding strategies and pave the way for developing genuinely effective solutions to hallucinations in MLLMs.

URL PDF HTML ☆

赞 0 踩 0

2510.04500 2026-06-05 cs.LG

Expand Neurons, Not Parameters

扩展神经元，而非参数

Linghao Kong, Inimai Subramanian, Yonadav Shavit, Micah Adler, Dan Alistarh, Nir Shavit

发表机构 * University of Washington（华盛顿大学）； Microsoft Research（微软研究院）

AI总结通过增加神经元数量而不增加非零参数总数，减少特征干扰，从而提高网络性能，并在多种模型中验证了有效性。

Comments Accepted to the 43rd International Conference on Machine Learning (ICML 2026). 9 pages, 6 figures. Code available at https://github.com/Shavit-Lab/Expand-Neurons

详情

AI中文摘要

本工作展示了如何在不增加网络非零参数总数的情况下，通过增加神经元数量来提升性能。我们证明，这种提升对应于多个特征之间干扰的减少，否则这些特征将共享相同的神经元。在符号布尔任务中，根据子句知识将每个神经元分割成更稀疏的子神经元，系统性地降低了多语义性指标，并获得了更高的任务准确率。值得注意的是，即使是神经元权重的随机分割也能近似这些增益，表明减少冲突（而非精确分配）是主要驱动因素。与叠加假说一致，该框架的收益随着干扰的增加而增长：当多语义负载较高时，准确率提升最大。将这些见解迁移到更现实的模型中，包括基于CLIP嵌入的分类器、卷积神经网络和更深的多层网络，我们发现，在保持非零参数数量不变的情况下加宽网络，持续提高了准确率。这些结果确定了一种基于可解释性的机制，利用宽度来对抗叠加，从而在不增加非零参数数量的情况下提升性能。这种方向与现代加速器非常匹配，因为在这些加速器中，非零参数的内存移动（而非原始计算）通常是主要瓶颈。

英文摘要

This work demonstrates how increasing the number of neurons in a network without increasing its total number of non-zero parameters improves performance. We show that this gain corresponds with a decrease in interference between multiple features that would otherwise share the same neurons. On symbolic Boolean tasks, splitting each neuron into sparser sub-neurons with knowledge of the clauses systematically reduces polysemanticity metrics and yields higher task accuracy. Notably, even random splits of neuron weights approximate these gains, indicating that reduced collisions, not precise assignment, are a primary driver. Consistent with the superposition hypothesis, the benefits of this framework grow with increasing interference: when polysemantic load is high, accuracy improvements are the largest. Transferring these insights to more realistic models, including classifiers over CLIP embeddings, convolutional neural networks, and deeper multilayer networks, we find that widening networks while maintaining a constant non-zero parameter count consistently increases accuracy. These results identify an interpretability-grounded mechanism to leverage width against superposition, improving performance without increasing the number of non-zero parameters. Such a direction is well matched to modern accelerators, where memory movement of non-zero parameters, rather than raw compute, is often a dominant bottleneck.

URL PDF HTML ☆

赞 0 踩 0

2508.09697 2026-06-05 cs.LG cs.CV

Towards Label-Noise Resistant Learning via Optimal Brain Damage Masking

通过最优脑损伤遮蔽实现抗标签噪声学习

Xinlei Zhang, Fan Liu, Chuanyi Zhang, Fan Cheng, Qian Li, Yuhui Zheng

发表机构 * Hohai University（河海大学）

AI总结本文提出了一种基于最优脑损伤理论的抗标签噪声学习方法，通过遮蔽冗余连接来减少噪声梯度传播，提升模型鲁棒性。

详情

AI中文摘要

噪声标签在现实世界中不可避免。由于深度神经网络强大的记忆能力，这些噪声标签会导致显著的性能下降。现有的噪声鲁棒方法主要集中在鲁棒损失函数和样本选择上，对动态架构适应的探索相对有限。本文重新审视了标签噪声存在下模型连接的作用。直观上，噪声标签引起的性能下降源于噪声梯度的反向传播。由于最终分类器层是这种误差传播的主要通道，直接丢弃分类器中的冗余连接可以在根源上截断噪声梯度。为了识别这些冗余连接，我们利用模型压缩中的经典最优脑损伤（OBD）理论，该理论指出造成微小损失扰动的参数可以安全移除而不影响性能。基于这一原则，我们发现遮蔽低激活边可以保持网络的正常拟合能力，同时有效降低噪声梯度传播的风险。为了将这一理论洞察与实际训练相结合，我们提出了一种新的选择性边遮蔽（SEM）机制，用于广泛采用的全连接（FC）层，以增强模型对噪声标签的鲁棒性。SEM可以自适应地只保留最重要的边用于信息传播，同时抑制由噪声标签引起的梯度误差。作为插件式组件，SEM可以无缝集成到各种噪声鲁棒方法中，包括鲁棒损失函数和样本选择。在合成和现实世界基准上的广泛评估表明，我们的OBD驱动方法在性能上始终优于最先进的方法。

英文摘要

Noisy labels are inevitable in real-world scenarios. Due to the strong capacity of deep neural networks to memorize corrupted labels, these noisy labels cause significant performance degradation. Existing noise-robust methods have mainly focused on robust loss functions and sample selection, with comparatively limited exploration of dynamic architectural adaptation. In this paper, we rethink the role of model connectivity in the presence of label noise. Intuitively, performance degradation caused by noisy labels stems from the backpropagation of noisy gradients. Since the final classifier layer acts as the primary gateway for this error propagation, directly discarding redundant connections within the classifier can structurally intercept noisy gradients at the root. Consequently, to identify these redundant connections, we leverage the seminal Optimal Brain Damage (OBD) theory from model compression, which posits that parameters causing negligible loss perturbation can be safely removed without impairing performance. Guided by this principle, we reveal that masking low-activation edges maintains the network's normal fitting capacity while effectively reducing the risk of backpropagating noisy gradients. To bridge this theoretical insight with practical training, we propose a novel Selective Edge Masking (SEM) mechanism for the widely-adopted fully connected (FC) layer to enhance model robustness against noisy labels. It can adaptively preserve only the most critical edges for information propagation while suppressing gradient errors caused by noisy labels. As a plug-and-play component, SEM can be seamlessly integrated into various noise-robust methods, including robust loss functions and sample selection. Extensive evaluations on both synthetic and real-world benchmarks demonstrate that our OBD-driven approach consistently outperforms state-of-the-art methods.

URL PDF HTML ☆

赞 0 踩 0

2509.22015 2026-06-05 cs.LG

Concept-SAE: A Controllable and Invertible Concept Interface for Sparse Autoencoders

Concept-SAE: 一种可控且可逆的概念接口用于稀疏自编码器

Jianrong Ding, Muxi Chen, Chenchen Zhao, Qiang Xu

发表机构 * The Chinese University of Hong Kong（香港中文大学）

AI总结本文提出Concept-SAE，一种通过结构化可控接口探测用户定义概念的框架，通过将激活子空间分解为概念令牌和自由令牌，实现高保真、局部化强且解耦的概念表示，优于现有方法。

Comments Accepted by ECML PKDD 2026, the project can be found at https://github.com/RafaDD/Concept-SAE

详情

AI中文摘要

标准稀疏自编码器（SAEs）在发现模型学习的字典方面表现出色，为被动特征发现提供了强大视角。然而，这种被动性质使得系统评估或分析用户关心的概念变得困难。我们引入Concept-SAE，一种通过结构化可控接口扩展SAEs的框架，用于探测用户定义的概念。Concept-SAE将激活子空间分解为两个正交部分：概念令牌，通过双监督在概念存在和空间定位上对齐外部指定语义；自由令牌，像标准SAEs一样捕捉所有剩余信息。这种混合解耦策略确保概念令牌忠实、空间接地且与残差子空间清洁分离，同时保留SAEs对开放概念发现的能力。我们进行了广泛的实验，证明Concept-SAE产生高保真、局部化强且解耦的概念表示，优于替代方法。最后，我们通过三个诊断评估验证该概念接口的实用性：对对抗图像样本的分类检测测试、聚焦于可控反事实编辑的可控性测试以及使用对抗扰动的稳定性测试。这些结果表明，Concept-SAE为SAEs提供了一种可靠的机制，用于评估、探测和诊断用户定义的概念。

英文摘要

Standard Sparse Autoencoders (SAEs) excel at discovering a dictionary of a model's learned features, providing a powerful lens for passive feature discovery. However, this passive nature makes it difficult to systematically evaluate or analyze concepts that users explicitly care about. We introduce Concept-SAE, a framework that augments SAEs with a structured and controllable interface for probing user-defined concepts. Concept-SAE decomposes an activation subspace into two orthogonal components: Concept Tokens, which are aligned to externally specified semantics through dual supervision on both concept existence and spatial localization, and Free Tokens, which operate like standard SAEs to capture all remaining information. This hybrid disentanglement strategy ensures that Concept Tokens are faithful, spatially grounded, and cleanly separated from the residual subspace while preserving the ability of SAEs for open-ended concept discovery. We conduct extensive experiments demonstrating that Concept-SAE yields high-fidelity, well-localized, and strongly disentangled concept representations, outperforming alternatives in interface quality. Finally, we validate the utility of this conceptual interface through three diagnostic evaluations: a detection test on classifying adversarial image samples, a controllability test focusing on controlled counterfactual editing and a stability test using adversarial perturbations. Together, these results show that Concept-SAE equips SAEs with a reliable mechanism for evaluating, probing, and diagnosing user-defined concepts.

URL PDF HTML ☆

赞 0 踩 0

2504.10823 2026-06-05 cs.CL cs.AI

CLASH: Evaluating Language Models on Judging High-Stakes Dilemmas from Multiple Perspectives

CLASH：从多个视角评估语言模型在高风险困境中的判断

Ayoung Lee, Ryan Sungmo Kwon, Peter Railton, Lu Wang

发表机构 * Department of Computer Science and Engineering（计算机科学与工程系）； Department of Philosophy（哲学系）； University of Michigan Ann Arbor（安娜堡大学）

AI总结本文提出CLASH数据集，用于研究基于价值观的决策过程，发现语言模型在处理矛盾决策、心理不适和价值观变化时存在显著不足。

Comments Published as a conference paper at ICLR 2026

详情

AI中文摘要

在高风险领域，涉及冲突价值的困境对人类都极具挑战性，更不用说AI了。然而，先前的研究仅限于日常场景。为弥补这一差距，我们引入了CLASH（基于角色视角的LLM在高风险情境中的评估），该数据集包含345个高影响困境及3,795个不同价值观的个体视角。CLASH使研究者能够探讨关键但尚未被深入研究的价值决策过程方面，包括对决策矛盾和心理不适的理解以及角色视角中价值观的时间变化。通过基准测试14个非思考和思考模型，我们揭示了几个关键发现：（1）即使强大的专有模型，如GPT-5和Claude-4-Sonnet，也难以处理矛盾决策，仅达到24.06和51.01的准确率。（2）尽管LLMs能合理预测心理不适，但它们在涉及价值变化的视角中并不充分理解。（3）在数学解题和游戏策略领域有效的认知行为无法转移到价值推理中。相反，新的失败模式出现，包括早期承诺和过度承诺。（4）LLMs对特定价值的可引导性与其价值偏好显著相关。（5）最后，当从第三方视角推理时，LLMs表现出更高的可引导性，尽管某些价值（如安全）独特地受益于第一人称框架。

英文摘要

Navigating dilemmas involving conflicting values is challenging even for humans in high-stakes domains, let alone for AI, yet prior work has been limited to everyday scenarios. To close this gap, we introduce CLASH (Character perspective-based LLM Assessments in Situations with High-stakes), a meticulously curated dataset consisting of 345 high-impact dilemmas along with 3,795 individual perspectives of diverse values. CLASH enables the study of critical yet underexplored aspects of value-based decision-making processes, including understanding of decision ambivalence and psychological discomfort as well as capturing the temporal shifts of values in the perspectives of characters. By benchmarking 14 non-thinking and thinking models, we uncover several key findings. (1) Even strong proprietary models, such as GPT-5 and Claude-4-Sonnet, struggle with ambivalent decisions, achieving only 24.06 and 51.01 accuracy. (2) Although LLMs reasonably predict psychological discomfort, they do not adequately comprehend perspectives involving value shifts. (3) Cognitive behaviors that are effective in the math-solving and game strategy domains do not transfer to value reasoning. Instead, new failure patterns emerge, including early commitment and overcommitment. (4) The steerability of LLMs towards a given value is significantly correlated with their value preferences. (5) Finally, LLMs exhibit greater steerability when reasoning from a third-party perspective, although certain values (e.g., safety) benefit uniquely from first-person framing.

URL PDF HTML ☆

赞 0 踩 0

2509.15061 2026-06-05 cs.RO cs.CV

Ask-to-Clarify: Resolving Instruction Ambiguity through Multi-turn Dialogue

Ask-to-Clarify: 通过多轮对话解决指令歧义

Xingyao Lin, Xinghao Zhu, Tianyi Lu, Sicheng Xie, Hui Zhang, Xipeng Qiu, Zuxuan Wu, Yu-Gang Jiang

发表机构 * College of Computer Science and Artificial Intelligence, Fudan University, Shanghai, China（复旦大学计算机科学与人工智能学院）； Shanghai Innovation Institute, Shanghai, China（上海创新研究院）； Mechanical Systems Control Lab, UC Berkeley, California, USA（伯克利机械系统控制实验室）

AI总结本文提出Ask-to-Clarify框架，通过多轮对话解决指令歧义问题，结合视觉语言模型和扩散模型，采用两阶段知识绝缘策略训练，实现多任务中更高效的协作式具身代理。

Comments 9 pages, 4 figures, 7 tables

详情

AI中文摘要

具身代理的最终目标是创造能够与人类交互的合作者，而非仅仅执行指令的被动执行者。这要求代理能够通过沟通、协调和适应行动来响应人类反馈。最近，视觉语言代理（VLAs）的进步为实现这一目标提供了途径。然而，大多数当前基于VLAs的具身代理仍处于单向模式：接收指令并执行，而无反馈。这种做法在现实场景中往往失效，因为指令通常存在歧义。在本文中，我们提出了Ask-to-Clarify框架来解决这一问题。该框架首先通过多轮对话解决模糊的指令，然后生成低层动作。具体来说，Ask-to-Clarify框架由两个组件组成：一个用于协作的视觉语言模型（VLM）和一个用于动作的扩散模型。我们还引入了一个连接模块，该模块根据VLM的输出生成扩散模型的条件。该模块通过指令调整观察来生成可靠的条件。我们采用两阶段知识绝缘策略来训练我们的框架。首先，我们使用模糊解决对话数据微调协作组件以处理歧义。然后，我们在冻结协作组件的情况下整合动作组件。这在保持交互能力的同时，微调扩散模型以生成动作。训练策略保证了我们的框架能够首先提问，然后生成动作。在推理过程中，一个信号检测器充当路由器，帮助框架在提问和执行之间切换。我们在8个现实任务中评估了Ask-to-Clarify框架，结果表明它在现有最先进的VLAs中表现更优。结果表明，所提出的框架及其训练策略为协作式具身代理提供了一条可行路径。

英文摘要

The ultimate goal of embodied agents is to create collaborators that can interact with humans, not mere executors that passively follow instructions. This requires agents to communicate, coordinate, and adapt their actions based on human feedback. Recently, advances in VLAs have offered a path toward this goal. However, most current VLA-based embodied agents operate in a one-way mode: they receive an instruction and execute it without feedback. This approach fails in real-world scenarios where instructions are often ambiguous. In this paper, we address this problem with the Ask-to-Clarify framework. Our framework first resolves ambiguous instructions by asking questions in a multi-turn dialogue. Then it generates low-level actions end-to-end. Specifically, the Ask-to-Clarify framework consists of two components, one VLM for collaboration and one diffusion for action. We also introduce a connection module that generates conditions for the diffusion based on the output of the VLM. This module adjusts the observation by instructions to create reliable conditions. We train our framework with a two-stage knowledge-insulation strategy. First, we fine-tune the collaboration component using ambiguity-solving dialogue data to handle ambiguity. Then, we integrate the action component while freezing the collaboration one. This preserves the interaction abilities while fine-tuning the diffusion to generate actions. The training strategy guarantees our framework can first ask questions, then generate actions. During inference, a signal detector functions as a router that helps our framework switch between asking questions and taking actions. We evaluate the Ask-to-Clarify framework in 8 real-world tasks, where it outperforms existing state-of-the-art VLAs. The results suggest that our proposed framework, along with the training strategy, provides a path toward collaborative embodied agents.

URL PDF HTML ☆

赞 0 踩 0

2307.05284 2026-06-05 cs.LG cs.AI

Rethinking Distribution Shifts: Empirical Analysis and Modeling for Tabular Data

重新思考分布偏移：针对表格数据的经验分析与建模

Tianyu Wang, Jiashuo Liu, Peng Cui, Hongseok Namkoong

发表机构 * Department of Industrial Engineering and Operations Research（工业工程与运筹学系）； Department of Computer Science and Technology（计算机科学与技术系）； Decision, Risk, and Operations Division（决策、风险与运营部）； Columbia University（哥伦比亚大学）； Tsinghua University（清华大学）

AI总结本文通过经验分析和建模，重新审视分布偏移问题，发现Y|X偏移在表格数据中最为常见，与机器学习文献中对X（协变量）偏移的重视形成鲜明对比，并指出鲁棒算法的性能并不优于普通方法。

Comments Forthcoming at Management Science. Conference version appeared in NeurIPS 2023, previously titled "On the Need for a Language Describing Distribution Shifts: Illustrations on Tabular Datasets"

详情

AI中文摘要

不同的分布偏移需要不同的干预措施，算法必须基于其解决的具体偏移类型来构建。然而，稳健算法的方法学发展通常依赖于缺乏实证验证的结构性假设。本文倡导一种以实证为基础的数据驱动方法来开发算法，构建了一个包含8个表格数据集中的自然偏移、172个分布对、45种方法和90,000种方法配置的实证测试平台，涵盖了经验风险最小化和分布鲁棒优化（DRO）方法。我们发现Y|X偏移在我们的测试平台中最为普遍，这与机器学习文献中对X（协变量）偏移的高度重视形成鲜明对比，并且稳健算法的性能并不优于普通方法。为了理解原因，我们深入分析了DRO方法，发现被忽视的实现细节——如底层模型类（例如LightGBM）的选择和超参数选择——对性能的影响比模糊集或其半径更大。通过案例研究，我们展示了如何通过数据驱动的归纳理解分布偏移，提供了一种新的算法开发方法。

英文摘要

Different distribution shifts require different interventions, and algorithms must be grounded in the specific shifts they address. However, methodological development for robust algorithms typically relies on structural assumptions that lack empirical validation. Advocating for an empirically grounded data-driven approach to algorithm development, we build an empirical testbed comprising natural shifts across 8 tabular datasets, 172 distribution pairs over 45 methods and 90,000 method configurations encompassing empirical risk minimization and distributionally robust optimization (DRO) methods. We find $Y|X$-shifts are most prevalent in our testbed, in stark contrast to the heavy focus on $X$ (covariate)-shifts in the ML literature, and that the performance of robust algorithms is no better than that of vanilla methods. To understand why, we conduct an in-depth empirical analysis of DRO methods and find that underlooked implementation details -- such as the choice of underlying model class (e.g., LightGBM) and hyperparameter selection -- have a bigger impact on performance than the ambiguity set or its radius. We illustrate via case studies how a data-driven, inductive understanding of distribution shifts can provide a new approach to algorithm development.

URL PDF HTML ☆

赞 0 踩 0

2508.15851 2026-06-05 cs.CL

DocHop-QA: Towards Multi-Hop Reasoning over Multimodal Document Collections

DocHop-QA: 向多跳推理多模态文档集合迈进

Jiwon Park, Seohyun Pyeon, Jinwoo Kim, Rina Carines Cabal, Zhenyuan He, Yihao Ding, Soyeon Caren Han

发表机构 * Pohang University of Science and Technology（釜山科学技术大学）； The University of Sydney（悉尼大学）； The University of Western Australia（西澳大学）； The University of Melbourne（墨尔本大学）

AI总结本文提出DocHop-QA基准，通过多模态、多文档、多跳科学问答评估多模态证据综合能力，揭示当前模型在长上下文和多证据需求下的局限性。

详情

AI中文摘要

尽管大语言模型（LLMs）在快速进步，当前QA基准仍忽视了现实世界科学信息检索的核心挑战：合成散落在多个文档和结构格式中的多模态证据。现有的QA基准范围狭窄，依赖单模态文本和短跨度推理，无法捕捉真实信息检索的复杂性。我们引入DocHop-QA，一个包含11,379个实例的基准，用于评估多模态、多文档、多跳科学QA。该基准基于公开可用的PubMed文章构建，包含文本段落、表格和布局线索，能够在没有显式超链接的情况下实现跨文档推理。为了扩展现实QA的构建，我们开发了一个基于11个科学推理概念的LLM驱动生成管道，生成多样且连贯的问题-答案对。为了突出数据集的实用性和多功能性，我们提出一个任务驱动的评估框架，涵盖四个设置，包括生成回答、多模态证据整合和结构化索引预测。实验表明，当前模型在DocHop-QA的长上下文和多证据需求下表现不佳，确立了其作为推进下一代科学QA系统严格测试平台的地位。

英文摘要

Despite rapid progress in large language models (LLMs), current QA benchmarks still overlook the core challenge of real-world scientific information seeking: synthesizing multimodal evidence scattered across multiple documents and structural formats. Existing QA benchmarks remain narrow in scope, relying on unimodal text and short-span reasoning that fail to capture the complexity of real information seeking. We introduce DocHop-QA, a benchmark of 11,379 instances for evaluating multimodal, multi-document, multi-hop scientific QA. Built from publicly available PubMed articles, DocHop-QA incorporates textual passages, tables, and layout cues, enabling cross-document inference without explicit hyperlinks. To scale realistic QA construction, we develop an LLM-driven generation pipeline grounded in 11 scientific reasoning concepts, producing diverse and coherent question-answer pairs. To highlight the utility and versatility of the dataset, we propose a task-driven evaluation framework spanning four settings, including generative answering, multimodal evidence integration, and structured index prediction. Experiments show that current models struggle with the long-context and multi-evidence demands of DocHop-QA, establishing it as a rigorous testbed for advancing next-generation scientific QA systems.

URL PDF HTML ☆

赞 0 踩 0

2508.00537 2026-06-05 cs.CL

The Prosody of Emojis

表情符号的语调

Giulio Zhou, Tsz Kin Lam, Alexandra Birch, Barry Haddow

发表机构 * University of Edinburgh（爱丁堡大学）； NatWest ； Aveni

AI总结研究探讨了表情符号如何影响语音表达，并揭示听众如何通过语音线索恢复表情符号的含义，发现语义差异越大，语音变化越明显，表明表情符号是连接数字文本和口语表达的语调载体。

Comments ACL 26

详情

AI中文摘要

语调特征如音高、节奏和语调对于口语交流至关重要，传达情感、意图和话语结构。在基于文本的环境中，这些线索缺失，表情符号作为视觉替代品，增加了情感和语用的细微差别。本研究探讨了表情符号如何影响语音实现，并研究听众如何通过语音线索恢复表情符号的含义。与以往研究不同，我们通过受控的诱发生产任务收集人类语音数据，直接将语音和表情符号联系起来。使用贝叶斯多级模型，我们显示说话者会系统地根据表情符号线索调整语音，并且听众可以显著高于随机水平恢复意图含义。此外，我们的结果揭示了语音变化的清晰层次：表情符号之间的语义差异越大，语音变化越明显。这些发现表明，表情符号是传达语调意图的重要载体，架起了数字文本和口语表达之间的桥梁。

英文摘要

Prosodic features such as pitch, timing, and intonation are central to spoken communication, conveying emotion, intent, and discourse structure. In text-based settings, where these cues are absent, emojis act as visual surrogates that add affective and pragmatic nuance. This study examines how emojis influence prosodic realisation in speech and how listeners interpret prosodic cues to recover emoji meanings. Unlike previous work, we directly link prosody and emojis by analysing human speech data collected through a controlled elicited production task. Using Bayesian multilevel modelling, we show that speakers systematically adapt their prosody based on emoji cues, and that listeners can recover intended meanings significantly above chance. Furthermore, our results reveal a clear hierarchy in prosodic shifts: greater semantic differences between emojis correspond to increased prosodic divergence. These findings suggest that emojis are meaningful carriers of prosodic intent that bridge the gap between digital text and spoken production.

URL PDF HTML ☆

赞 0 踩 0

2503.22929 2026-06-05 cs.CV

Self-supervised Feature Disentanglement and Augmentation Network for One-class Face Anti-spoofing

自监督特征解耦与增强网络用于单类面部反伪装

Pei-Kai Huang, Jun-Xiong Chong, Ming-Tsung Hsu, Fang-Yu Hsu, Yi-Ting Lin, Kai-Heng Chien, Hao-Chiang Shao, Chiou-Ting Hsu

发表机构 * National Tsinghua University（国立清华大学）

AI总结本文提出了一种自监督特征解耦与增强网络（UFDANet），通过解耦活体特征和领域特征，提升单类面部反伪装的泛化能力，实验表明其优于现有单类方法并可与双类方法媲美。

详情

AI中文摘要

面部反伪装（FAS）技术旨在通过区分真实活体面部与欺骗性尝试来增强面部身份认证的安全性。虽然双类FAS方法可能因过拟合训练攻击而性能不佳，单类FAS方法能处理未见过的攻击但对活体特征中混杂的领域信息不够鲁棒。为此，我们提出了一种无监督特征解耦与增强网络（UFDANet），一种单类FAS技术，通过解耦特征增强面部图像以提升泛化能力。UFDANet采用新颖的无监督特征解耦方法分离活体和领域特征，促进判别性特征学习。它整合了非分布活体特征增强方案以合成未见过的欺骗类活体特征，从而增强活体特征的表示性和判别性。此外，UFDANet还整合了领域特征增强流程以合成未见过的领域特征，从而实现更好的泛化能力。广泛实验表明，所提出的UFDANet优于现有单类FAS方法，并在与现有最先进双类FAS方法的性能上具有可比性。

英文摘要

Face anti-spoofing (FAS) techniques aim to enhance the security of facial identity authentication by distinguishing authentic live faces from deceptive attempts. While two-class FAS methods risk overfitting to training attacks to achieve better performance, one-class FAS approaches handle unseen attacks well but are less robust to domain information entangled within the liveness features. To address this, we propose an Unsupervised Feature Disentanglement and Augmentation Network (\textbf{UFDANet}), a one-class FAS technique that enhances generalizability by augmenting face images via disentangled features. The \textbf{UFDANet} employs a novel unsupervised feature disentangling method to separate the liveness and domain features, facilitating discriminative feature learning. It integrates an out-of-distribution liveness feature augmentation scheme to synthesize new liveness features of unseen spoof classes, which deviate from the live class, thus enhancing the representability and discriminability of liveness features. Additionally, \textbf{UFDANet} incorporates a domain feature augmentation routine to synthesize unseen domain features, thereby achieving better generalizability. Extensive experiments demonstrate that the proposed \textbf{UFDANet} outperforms previous one-class FAS methods and achieves comparable performance to state-of-the-art two-class FAS methods.

URL PDF HTML ☆

赞 0 踩 0

2507.15736 2026-06-05 cs.CL

IDRBench: Understanding the Capability of Large Language Models on Interdisciplinary Research

IDRBench: 理解大型语言模型在跨学科研究中的能力

Yuanhao Shen, Daniel Xavier de Sousa, Ricardo Marçal, Hongyu Guo, Xiaodan Zhu

发表机构 * GitHub

AI总结本文研究了大型语言模型在跨学科研究中的能力，提出IDRBench框架，通过三个任务评估不同模型的跨学科知识整合能力，并为未来研究建立基准。

详情

AI中文摘要

创新是推动人类文明的重要驱动力。随着知识体系的不断扩展，跨学科领域中创新的产生变得愈发具有挑战性。最近机器学习模型，特别是大型语言模型（LLMs）的进步，为访问广泛的知识源提供了有效途径，并在推理方面展现出显著的能力，为跨学科发现提供了重要机会。我们的研究旨在理解最先进的LLMs在整合不同领域知识以进行跨学科研究（IDR）方面的能力。为了解决这一根本问题，我们引入了IDRBench，一个开创性的框架，包括数据集和评估任务：（1）跨学科论文识别，（2）跨学科思想整合，（3）跨学科思想推荐。我们对十种主流LLMs的研究提供了对其行为的全面分析，并为未来研究建立了基准和基线。据我们所知，IDRBench是首个全面调查LLMs跨学科能力的框架。

英文摘要

Innovation is a key driving force of human civilization. As the body of knowledge has grown considerably, bridging knowledge across different disciplines, where significant innovation often emerges, has become increasingly challenging. The recent advancements in machine learning models, particularly Large Language Models (LLMs), have provided effective access to extensive knowledge sources and shown impressive abilities in reasoning, rendering significant opportunities for interdisciplinary discovery. Our research aims to understand the capabilities of state-of-the-art LLMs in integrating knowledge from different fields for interdisciplinary research (IDR). To address this fundamental problem, we introduce IDRBench, a pioneering framework that includes both datasets and evaluation tasks: (1) IDR Paper Identification, (2) IDR Idea Integration, and (3) IDR Idea Recommendation. Our study on ten mainstream LLMs provides a comprehensive analysis of their behavior and establishes benchmarks and baselines for future research. To the best of our knowledge, IDRBench is the first to provide a comprehensive investigation of LLMs' IDR capability.

URL PDF HTML ☆

赞 0 踩 0

2507.12336 2026-06-05 cs.CV

Unsupervised Monocular 3D Keypoint Discovery from Multi-View Diffusion Priors

无监督单目多视图扩散先验的3D关键点发现

Subin Jeon, In Cho, Junyoung Hong, Woong Oh Cho, Seon Joo Kim

发表机构 * Yonsei University（延世大学）

AI总结本文提出KeyDiff3D框架，通过单张图像准确预测3D关键点，利用预训练的多视图扩散模型中的几何先验，将隐式3D先验转化为显式3D特征体，实现关键点估计和3D对象操控。

Comments Accepted at CVPR 2026. Project page: https://subin6.github.io/keydiff3d-project/

详情

AI中文摘要

大多数现有的3D关键点估计方法依赖于手动标注或校准的多视角图像，这两种方法都昂贵且难以收集。本文引入KeyDiff3D框架，该框架能够从单张图像准确预测3D关键点，从而消除对昂贵数据采集的依赖。为此，我们利用预训练的多视角扩散模型中嵌入的强大几何先验。在我们的框架中，扩散模型从单张图像生成多视角图像，作为监督信号，为模型提供3D几何线索。我们还引入了3D特征提取器，将扩散特征中隐含的3D先验转换为显式的3D特征体。除了准确的关键点估计外，我们还引入了一条管道，使由扩散模型生成的3D对象得以操控。在多样化的数据集上，包括Human3.6M、CUB-200-2011、斯坦福狗、以及多个真实世界和非领域输入，实验结果突显了我们的方法在准确性、泛化能力和从单张图像生成3D对象并进行操控方面的有效性。

英文摘要

Most existing 3D keypoint estimation methods rely on manual annotations or calibrated multi-view images, both of which are expensive to collect. This paper introduces KeyDiff3D, a framework that can accurately predict 3D keypoints from a single image, thus eliminating the need for such expensive data acquisitions. To achieve this, we leverage powerful geometric priors embedded in a pretrained multi-view diffusion model. In our framework, the diffusion model generates multi-view images from a single image, serving as supervision signals to provide 3D geometric cues to our model. We also introduce a 3D feature extractor that transforms implicit 3D priors embedded in the diffusion features into explicit 3D feature volumes. Beyond accurate keypoint estimation, we further introduce a pipeline that enables manipulation of 3D objects generated by the diffusion model. Experimental results on diverse datasets, including Human3.6M, CUB-200-2011, Stanford Dogs, and several in-the-wild and out-of-domain inputs, highlight the effectiveness of our method in terms of accuracy, generalization, and its ability to enable manipulation of 3D objects generated by the diffusion model from a single image.

URL PDF HTML ☆

赞 0 踩 0

2507.06219 2026-06-05 cs.RO cs.AI cs.LG

Is Diversity All You Need for Scalable Robotic Manipulation?

多样性是否是可扩展机器人操作的全部需求？

Modi Shi, Li Chen, Jin Chen, Yuxiang Lu, Chiming Liu, Guanghui Ren, Ping Luo, Di Huang, Maoqing Yao, Hongyang Li

发表机构 * University of Science and Technology of China（中国科学技术大学）； Tsinghua University（清华大学）； University of California, Berkeley（加州大学伯克利分校）

AI总结本文研究了数据多样性在机器人学习中的作用，发现任务多样性比单任务演示量更重要，多身体预训练数据在跨身体转移中可选，专家多样性可能对策略学习产生干扰，提出分布去偏方法提升性能。

Comments Code is available at https://github.com/OpenDriveLab/AgiBot-World

详情

AI中文摘要

数据扩展在自然语言处理和计算机视觉的基础模型中取得了显著成功，但机器人操作中有效数据扩展的原则仍不够清楚。本文通过研究机器人学习中数据多样性的细微作用，探讨了三个关键维度：任务（做什么）、身体（使用哪种机器人）和专家（谁演示）。通过在各种机器人平台上进行广泛实验，我们发现：（1）任务多样性比单任务演示数量更重要，有助于从多样预训练任务转移到新下游场景；（2）多身体预训练数据在跨身体转移中是可选的，高质量单身体预训练模型可以高效地转移到不同平台，在微调过程中表现出比多身体预训练模型更优的扩展特性；（3）专家多样性源于个体操作偏好和人类演示中的随机变化，可能对策略学习产生干扰，速度多模态成为关键贡献因素。基于这一洞察，我们提出了一种分布去偏方法以缓解速度模糊性，所提出的GO-1-Pro方法实现了15%的性能提升，相当于使用2.5倍的预训练数据。这些发现提供了新的视角，并为如何有效扩展机器人操作数据集提供了实用指导。

英文摘要

Data scaling has driven remarkable success in foundation models for Natural Language Processing (NLP) and Computer Vision (CV), yet the principles of effective data scaling in robotic manipulation remain insufficiently understood. In this work, we investigate the nuanced role of data diversity in robot learning by examining three critical dimensions-task (what to do), embodiment (which robot to use), and expert (who demonstrates)-challenging the conventional intuition of "more diverse is better". Throughout extensive experiments on various robot platforms, we reveal that (1) task diversity proves more critical than per-task demonstration quantity, benefiting transfer from diverse pre-training tasks to novel downstream scenarios; (2) multi-embodiment pre-training data is optional for cross-embodiment transfer-models trained on high-quality single-embodiment data can efficiently transfer to different platforms, showing more desirable scaling property during fine-tuning than multi-embodiment pre-trained models; and (3) expert diversity, arising from individual operational preferences and stochastic variations in human demonstrations, can be confounding to policy learning, with velocity multimodality emerging as a key contributing factor. Based on this insight, we propose a distribution debiasing method to mitigate velocity ambiguity, the yielding GO-1-Pro achieves substantial performance gains of 15%, equivalent to using 2.5 times pre-training data. Collectively, these findings provide new perspectives and offer practical guidance on how to scale robotic manipulation datasets effectively.

URL PDF HTML ☆

赞 0 踩 0

2506.22078 2026-06-05 cs.CV

Towards Accurate Heart Rate Measurement from Ultra-Short Video Clips via Periodicity-Guided rPPG Estimation and Signal Reconstruction

通过周期性引导的rPPG估计与信号重建实现从超短视频片段中准确的心率测量

Pei-Kai Huanga, Ya-Ting Chan, Kuan-Wen Chen, Chiou-Ting Hsu, Xiaoding Wang, Md. Jalil Piran

发表机构 * National Tsinghua University（国立清华大学）； Fujian Normal University（福建师范大学）； Sungkyunkwan University（成均馆大学）

AI总结本文针对超短视频片段中心率测量问题，提出周期性引导的rPPG估计方法和信号重建技术，以提高从超短视频中准确测量心率的能力，并在多个基准数据集上验证了方法的有效性。

详情

AI中文摘要

许多远程心率（HR）测量方法专注于从持续约10秒的视频片段中估计远程光体积脉动图（rPPG）信号，但常常忽略了从超短视频片段中估计心率的必要性。在本文中，我们旨在通过专门解决两个关键挑战来准确测量超短2秒视频片段中的心率。首先，为了解决超短视频片段中心跳周期数量有限的问题，我们提出了一种有效的周期性引导的rPPG估计方法，该方法强制在从超短片段中估计的rPPG信号与其更长的真实信号之间的周期性保持一致。其次，为了解决由于频谱泄漏导致的估计不准确问题，我们提出包含生成器来从超短片段中重建更长的rPPG信号，同时保持其周期性一致性，以实现更准确的心率测量。在四个rPPG估计基准数据集上的大量实验表明，我们提出的方法不仅能够准确测量超短视频片段中的心率，而且在rPPG估计技术中实现了最先进的性能。

英文摘要

Many remote Heart Rate (HR) measurement methods focus on estimating remote photoplethysmography (rPPG) signals from video clips lasting around 10 seconds but often overlook the need for HR estimation from ultra-short video clips. In this paper, we aim to accurately measure HR from ultra-short 2-second video clips by specifically addressing two key challenges. First, to overcome the limited number of heartbeat cycles in ultra-short video clips, we propose an effective periodicity-guided rPPG estimation method that enforces consistent periodicity between rPPG signals estimated from ultra-short clips and their much longer ground truth signals. Next, to mitigate estimation inaccuracies due to spectral leakage, we propose including a generator to reconstruct longer rPPG signals from ultra-short ones while preserving their periodic consistency to enable more accurate HR measurement. Extensive experiments on four rPPG estimation benchmark datasets demonstrate that our proposed method not only accurately measures HR from ultra-short video clips but also outperform previous rPPG estimation techniques to achieve state-of-the-art performance.

URL PDF HTML ☆

赞 0 踩 0

2501.14291 2026-06-05 cs.LG stat.ML

Advances in Temporal Point Processes: Bayesian, Neural, and LLM Approaches

时间点过程的进展：贝叶斯、神经网络和大语言模型方法

Feng Zhou, Quyu Kong, Jie Qiao, Cheng Wan, Yixuan Zhang, Ruichu Cai

发表机构 * Center for Applied Statistics and School of Statistics, Renmin University of China（应用统计中心和中国人民大学统计学院）； Independent Researcher（独立研究者）； School of Computer Science, Guangdong University of Technology（广东工业大学计算机学院）； School of Statistics and Data Science, Southeast University（东南大学统计与数据科学学院）

AI总结本文综述了时间点过程的最新研究，从贝叶斯、深度学习和大语言模型三个角度探讨了模型设计、参数估计以及经典应用领域，并展望了未来的研究挑战和方向。

详情

AI中文摘要

时间点过程（TPPs）是用于表征连续时间中事件序列的随机过程模型。传统统计TPPs已有长久的历史，众多模型已被提出并在不同领域中成功应用。近年来，深度学习的进步推动了神经TPPs的发展，使捕捉复杂时间动态变得更加灵活和表达性更强。大语言模型（LLMs）的出现进一步引发了关注，通过利用其丰富的上下文理解能力，为事件序列建模和分析提供了新的可能性。本文从贝叶斯、深度学习和LLM三个视角全面回顾了最近关于TPPs的研究。我们首先回顾了TPPs的基本概念，随后深入讨论了这三种框架中的模型设计和参数估计技术。我们还回顾了TPPs的经典应用领域，以突出其实际相关性。最后，我们概述了TPPs面临的挑战和未来研究的有前景方向。

英文摘要

Temporal point processes (TPPs) are stochastic process models used to characterize event sequences occurring in continuous time. Traditional statistical TPPs have a long-standing history, with numerous models proposed and successfully applied across diverse domains. In recent years, advances in deep learning have spurred the development of neural TPPs, enabling greater flexibility and expressiveness in capturing complex temporal dynamics. The emergence of large language models (LLMs) has further sparked excitement, offering new possibilities for modeling and analyzing event sequences by leveraging their rich contextual understanding. This survey presents a comprehensive review of recent research on TPPs from three perspectives: Bayesian, deep learning, and LLM approaches. We begin with a review of the fundamental concepts of TPPs, followed by an in-depth discussion of model design and parameter estimation techniques in these three frameworks. We also revisit classic application areas of TPPs to highlight their practical relevance. Finally, we outline challenges and promising directions for future research.

URL PDF HTML ☆

赞 0 踩 0

2506.20263 2026-06-05 cs.CV

Hierarchical Mask-Enhanced Dual Reconstruction Network for Few-Shot Fine-Grained Image Classification

层次化掩码增强双重建网络用于少样本细粒度图像分类

Ning Luo, Meiyin Hu, Huan Wan, Yanyan Yang, Zhuohang Jiang, Xin Wei

发表机构 * Nanjing University（南京大学）

AI总结本文提出层次化掩码增强双重建网络（HMDRN），通过双层特征重建与掩码增强特征处理，解决少样本细粒度图像分类中区分视觉相似子类的问题，实验显示其在三种细粒度数据集上均优于现有方法。

详情

AI中文摘要

少样本细粒度图像分类（FS-FGIC）具有挑战性，因为它需要在极少量标记示例下区分视觉相似的子类。现有方法存在关键限制：基于度量的方法丢失空间信息并导致局部特征错位，而基于重建的方法未充分利用层次特征信息且缺乏对判别关键区域的选择性关注。我们提出层次化掩码增强双重建网络（HMDRN），整合双层特征重建与掩码增强特征处理。HMDRN通过可学习权重利用不同网络层次的互补视觉信息，平衡高层语义表示与中层结构细节。它包含一个空间二进制掩码增强的Transformer模块，可选择增强判别区域并过滤背景噪声。在三个细粒度数据集上，HMDRN在Conv-4和ResNet-12背骨上均优于现有最先进方法。消融研究验证了每个组件的有效性，显示双层重建增强类间判别能力，而掩码增强转换减少类内变化。

英文摘要

Few-shot fine-grained image classification (FS-FGIC) is challenging as it requires distinguishing visually similar subclasses with extremely limited labeled examples. Existing methods suffer from critical limitations: metric-based methods lose spatial information and misalign local features, while reconstruction-based methods underuse hierarchical feature information and lack selective focus on discriminative key regions. We propose the Hierarchical Mask-enhanced Dual Reconstruction Network (HMDRN), integrating dual-layer feature reconstruction with mask-enhanced feature processing. HMDRN leverages complementary visual information from different network hierarchies via learnable weights, balancing high-level semantic representations with mid-level structural details. It incorporates a spatial binary mask-enhanced transformer module that selectively enhances discriminative regions while filtering background noise. On three fine-grained datasets, HMDRN consistently outperforms state-of-the-art methods with both Conv-4 and ResNet-12 backbones. Ablation studies validate each component's effectiveness, showing dual-layer reconstruction enhances inter-class discrimination while mask-enhanced transformation reduces intra-class variations.

URL PDF HTML ☆

赞 0 踩 0

2506.10145 2026-06-05 cs.CV

RoCA: Robust Cross-Domain End-to-End Autonomous Driving

RoCA: 面向鲁棒跨域端到端自动驾驶的框架

Rajeev Yasarla, Shizhong Han, Hsin-Pai Cheng, Apratim Bhattacharyya, Shweta Mahajan, Litian Liu, Yunxiao Shi, Risheek Garrepalli, Hong Cai, Fatih Porikli

发表机构 * University of California, Berkeley（加州大学伯克利分校）； University of Texas at Austin（德克萨斯大学奥斯汀分校）； University of California, San Diego（加州大学圣地亚哥分校）； University of California, Los Angeles（加州大学洛杉矶分校）； University of California, Davis（加州大学戴维斯分校）

AI总结本文提出RoCA框架，通过联合概率分布建模端到端自动驾驶管道中的 ego 和周围车辆信息，提升跨域自动驾驶的泛化能力和鲁棒性，无需额外推理计算。

Comments accepted for ICML 2026

详情

AI中文摘要

端到端（E2E）自动驾驶最近作为一种新范式出现，具有显著潜力。然而，很少有研究探讨了跨域部署的实际挑战（例如城市）。尽管一些工作将大型语言模型（LLMs）纳入其中以利用其开放世界知识，但LLMs无法保证跨域驾驶性能且在域适应过程中可能产生 prohibitive 重训练成本。本文提出RoCA，一种新颖的框架用于鲁棒跨域端到端自动驾驶。RoCA在E2E管道中对编码ego和周围车辆信息的token的联合概率分布进行建模。通过高斯过程（GP）实例化，RoCA学习一组具有相应轨迹的基底token，这些token跨越了多样化的驾驶场景。然后，给定任何驾驶场景，它能够概率性地推断未来轨迹。通过将RoCA与源域训练中的基础E2E模型结合，我们提升了基础模型的泛化能力，而无需额外的推理计算。此外，RoCA在新目标域上实现了鲁棒适应，显著优于直接微调。我们广泛评估了RoCA在各种跨域场景中，并展示其在领域泛化和适应性能方面表现强劲。

英文摘要

End-to-end (E2E) autonomous driving has recently emerged as a new paradigm, offering significant potential. However, few studies have looked into the practical challenge of deployment across domains (e.g., cities). Although several works have incorporated Large Language Models (LLMs) to leverage their open-world knowledge, LLMs do not guarantee cross-domain driving performance and may incur prohibitive retraining costs during domain adaptation. In this paper, we propose RoCA, a novel framework for robust cross-domain E2E autonomous driving. RoCA formulates the joint probabilistic distribution over the tokens that encode ego and surrounding vehicle information in the E2E pipeline. Instantiating with a Gaussian process (GP), RoCA learns a set of basis tokens with corresponding trajectories, which span diverse driving scenarios. Then, given any driving scene, it is able to probabilistically infer the future trajectory. By using RoCA together with a base E2E model in source-domain training, we improve the generalizability of the base model, without requiring extra inference computation. In addition, RoCA enables robust adaptation on new target domains, significantly outperforming direct finetuning. We extensively evaluate RoCA on various cross-domain scenarios and show that it achieves strong domain generalization and adaptation performance.

URL PDF HTML ☆

赞 0 踩 0

2506.11973 2026-06-05 cs.LG

Self-Regulating Cars: Automating Traffic Control in Free Flow Road Networks

自调节汽车：自动化自由流道路网络的交通控制

Ankit Bhardwaj, Rohail Asim, Sachin Chauhan, Yasir Zaki, Lakshminarayanan Subramanian

发表机构 * Department of Computer Science（计算机科学系）； New York University（纽约大学）； Indian Institute of Technology Delhi（德里印度理工学院）

AI总结本文提出了一种基于强化学习的自调节汽车方法，通过动态调节车辆速度来优化通行能力和防止拥堵，无需新基础设施，结合经典交通流理论和微观模拟，在高保真度的PTV Vissim模拟器上实现了提高通行能力、减少延误和停车次数的改进。

详情

DOI: 10.1609/aaai.v40i45.41163

AI中文摘要

自由流道路网络，如郊区高速公路，由于通勤流量增加和基础设施有限，正越来越多地经历交通拥堵。传统控制机制，如交通信号或局部启发式方法，在这些高速、无信号的环境中效果不佳或不可行。我们引入了自调节汽车，一种基于强化学习的交通控制协议，通过动态调节车辆速度来优化通行能力和防止拥堵，而无需新的物理基础设施。我们的方法将经典交通流理论、间隙接受模型和微观模拟整合到一个物理指导的强化学习框架中。通过将道路抽象为超段，智能体捕捉到涌现的流量动态，并从即时交通观测中学习稳健的速度调节策略。在高保真度的PTV Vissim模拟器上，我们的方法在真实世界高速公路网络中实现了比无控制设置提高5%的总通行能力，减少13%的平均延误，以及减少3%的总停车次数。它还实现了更平滑、抗拥堵的流量，同时在各种交通模式中泛化，展示了其在可扩展的ML驱动交通管理中的潜力。

英文摘要

Free-flow road networks, such as suburban highways, are increasingly experiencing traffic congestion due to growing commuter inflow and limited infrastructure. Traditional control mechanisms, such as traffic signals or local heuristics, are ineffective or infeasible in these high-speed, signal-free environments. We introduce self-regulating cars, a reinforcement learning-based traffic control protocol that dynamically modulates vehicle speeds to optimize throughput and prevent congestion, without requiring new physical infrastructure. Our approach integrates classical traffic flow theory, gap acceptance models, and microscopic simulation into a physics-informed RL framework. By abstracting roads into super-segments, the agent captures emergent flow dynamics and learns robust speed modulation policies from instantaneous traffic observations. Evaluated in the high-fidelity PTV Vissim simulator on a real-world highway network, our method improves total throughput by 5%, reduces average delay by 13%, and decreases total stops by 3% compared to the no-control setting. It also achieves smoother, congestion-resistant flow while generalizing across varied traffic patterns, demonstrating its potential for scalable, ML-driven traffic management.

URL PDF HTML ☆

赞 0 踩 0

2506.11042 2026-06-05 cs.LG

GenFT: A Generative Parameter-Efficient Fine-Tuning Method for Pretrained Foundation Models

GenFT：一种用于预训练基础模型的生成性参数高效微调方法

Guangning Xu, Baoquan Zhang, Michael. K. Ng

发表机构 * Department of Mathematics, Hong Kong Baptist University, Hong Kong, China（香港 Baptist 大学数学系，香港，中国）； Department of Computer Science, Harbin Institute of Technology, Shenzhen, China（哈尔滨工业大学（深圳）计算机科学系，中国）

AI总结本文提出GenFT，一种基于预训练权重的参数高效微调方法，通过生成任务特定的更新来利用预训练权重中的结构信息，实现高效的模型微调。

Comments paper is accepted at ICANN 2026

详情

AI中文摘要

参数高效微调（PEFT）已作为一种资源高效的策略，通过学习少量任务特定的更新ΔW来适应预训练基础模型（PFMs）。现有方法往往在很大程度上独立于预训练权重W₀，或主要通过初始化或简单的重参数化来利用W₀。为了进一步利用W₀中编码的结构信息，我们提出生成性参数高效微调（GenFT），一种基于W₀的PEFT方法，使用确定性权重生成器生成任务特定的更新。具体而言，GenFT通过行和列变换与非线性激活来从W₀中提取结构化模式，并引入共享-特定分解以平衡跨层信息重用和层特定的灵活性。GenFT简单且参数高效，在NLP和CV基准上实现了竞争性或更优的平均性能。我们进一步在LLaMA-7B上进行试点研究，以检验其在生成模型中的可行性。代码可在GitHub https://github.com/xuguangning1218/GenFT 上获得。

英文摘要

Parameter-efficient fine-tuning (PEFT) has emerged as a resource-efficient strategy for adapting Pretrained Foundation Models (PFMs) by learning a small number of task-specific updates $ΔW$. Existing methods often learn $ΔW$ largely independently of pretrained weights $W_0$, or exploit $W_0$ mainly through initialization or simple reparameterization. To further leverage the structural information encoded in $W_0$, we propose Generative Parameter-Efficient Fine-Tuning (GenFT), a $W_0$-conditioned PEFT method that uses a deterministic weight generator to produce task-specific updates. Specifically, GenFT performs row and column transformations with nonlinear activations to extract structured patterns from $W_0$, and introduces a shared-specific decomposition to balance cross-layer information reuse and layer-specific flexibility. GenFT is simple and parameter-efficient, achieving competitive or better average performance across NLP and CV benchmarks. We further provide a pilot study on LLaMA-7B to examine its feasibility for generative models. The code is available at GitHub https://github.com/xuguangning1218/GenFT.

URL PDF HTML ☆

赞 0 踩 0

2506.10601 2026-06-05 cs.CV

Semantic-decoupled Spatial Partition Guided Point-supervised Oriented Object Detection

语义解耦的空间分区引导的点监督定向物体检测

Xinyuan Liu, Hang Xu, Zirui Chen, Yike Ma, Chenggang Yan, Feng Dai

发表机构 * Institute of Computing Technology, Chinese Academy of Sciences（中国科学院计算技术研究所）； Hefei University of Technology（合肥工业大学）

AI总结本文提出了一种高效的训练框架SSP，通过规则驱动的先验注入和数据驱动的标签净化，解决了单点注解放置不足和伪标签质量差的问题，实验表明SSP在DOTA-v1.0和其他数据集上取得了显著的mAP提升，且训练时间和内存占用较低。

Comments Published in Pattern Recognition, 2026

详情

DOI: 10.1016/j.patcog.2026.114079
Journal ref: Pattern Recognition, Volume 180, Part B, Article 114079 (2026)

AI中文摘要

鉴于其减少标注成本的能力，基于单点注释的弱监督学习已成为定向物体检测研究的焦点。与经典教师-学生范式相比，简单的模型范式（如PointOBB-v2）可以显著减少训练所需的资源，同时保证强大的性能。后者在低成本训练中具有更大的潜力，但此类方法仍面临样本分配不足和伪标签质量差的挑战。在本文中，我们提出了一种训练高效的框架，称为SSP，该框架结合了规则驱动的先验注入和数据驱动的标签净化。具体而言，SSP引入了两种设计：（1）像素级空间分区基于的样本分配，通过像素映射的空间分区估计物体尺度的上下界，并通过空间分区挖掘高质量的正样本和困难负样本；（2）语义空间分区基于的框提取，通过由语义地图调节的空间分区推导实例，并将其转换为伪框以监督检测器。在DOTA-v1.0和其他数据集上的实验表明，SSP的优越性：与基线相比，SSP实现了+6.73%的mAP提升，同时仅需2小时的训练时间和6GB的GPU内存。此外，当SSP与更强的检测器结合时，mAP可以达到50.81%。代码可在https://github.com/antxinyuan/ssp上获得。

英文摘要

Given its ability to reduce annotation costs, weakly supervised learning based on single-point annotations has emerged as a research focus in oriented object detection. Compared with the classical teacher-student paradigm, the simple model paradigm (e.g., PointOBB-v2) can substantially further reduce resources required for training while ensuring strong performance. The latter exhibits greater potential for low-cost training, yet such methods still face challenges of insufficient sample assignment and poor pseudo-label quality. In this paper, we propose a training-efficient framework named SSP, which synergizes rule-driven prior injection and data-driven label purification. Specifically, SSP introduces two designs: (1) Pixel-level Spatial Partition-based Sample Assignment, which compactly estimates the upper and lower bounds of object scales and mines high-quality positive samples and hard negative samples through spatial partitioning of pixel maps. (2) Semantic Spatial Partition-based Box Extraction, which derives instances from spatial partitions modulated by semantic maps and converts them into pseudo-boxes for supervising detectors. Experiments on DOTA-v1.0 and other datasets demonstrate SSP's superiority: it achieves +6.73% mAP improvement compared with the baseline, while requiring only 2 h of training time and 6 GB of GPU memory. Furthermore, when SSP is integrated with stronger detector, the mAP can reach 50.81%. The code is available at https://github.com/antxinyuan/ssp.

URL PDF HTML ☆

赞 0 踩 0

2506.00188 2026-06-05 cs.LG stat.ML

Cluster-Aware Causal Mixer for Online Anomaly Detection in Multivariate Time Series

基于聚类的因果混合器用于多变量时间序列的在线异常检测

Md Mahmuddun Nabi Murad, Yasin Yilmaz

发表机构 * University of California, Berkeley（加州大学伯克利分校）

AI总结本文提出了一种基于聚类的因果混合器，用于多变量时间序列的在线异常检测，通过聚类处理通道间的相关性，结合因果混合器保持时间因果性，并开发了序列异常评分方法以提高检测准确性。

详情

AI中文摘要

在时间序列数据中早期和准确地检测异常至关重要，因为假阳性和漏检带来的风险很大。虽然基于MLP的混合模型在时间序列分析中显示出潜力，但它们在数据处理过程中不维护时间因果性。此外，现实中的多变量时间序列通常包含众多通道，具有多样的通道间相关性。重构时间序列中的虚假相关性导致表示噪声，从而导致检测不准确。此外，忽略时间连续性的异常评分方法可能会误导连续检测。为了解决这些挑战，我们提出了一种多变量时间序列异常检测的基于聚类的因果混合器。根据相关性将通道分组为集群，并通过专用嵌入层对每个集群进行嵌入。引入因果混合器以在保持时间因果性的同时整合信息。我们进一步开发了一种序列异常评分方法，该方法在时间上累积证据并细化异常边界。我们提出的模型以在线方式运行，使其适合实时时间序列异常检测。在六个公开基准数据集上的实验评估表明，所提出的方法在性能上始终优于其他方法。

英文摘要

Early and accurate detection of anomalies in time-series data is critical due to the substantial risks associated with false or missed detections. While MLP-based mixer models have shown promise in time-series analysis, they do not maintain temporal causality during data processing. Moreover, real-world multivariate time series often contain numerous channels with diverse inter-channel correlations. Spurious correlations in the reconstructed time series lead to noisy representations, resulting in inaccurate anomaly detection. In addition, anomaly scoring methods that ignore temporal continuity can mislead sequential detection. To address these challenges, we propose a cluster-aware causal mixer for multivariate time-series anomaly detection. Channels are grouped into clusters based on their correlations, and each cluster is embedded through a dedicated embedding layer. A causal mixer is introduced to integrate information while maintaining temporal causality. We further develop a sequential anomaly-scoring method that accumulates evidence over time and refines anomaly boundaries. Our proposed model operates in an online fashion, making it suitable for real-time time-series anomaly detection. Experimental evaluations across six public benchmark datasets demonstrate that the proposed approach consistently achieves superior performance.

URL PDF HTML ☆

赞 0 踩 0

2310.04649 2026-06-05 cs.LG

Uncovering Model Processing Strategies with Non-Negative Per-Example Fisher Factorization

通过非负每例费舍尔分解揭示模型处理策略

Michael Matena, Colin Raffel

发表机构 * University of North Carolina Chapel Hill（北卡罗来纳大学教堂山分校）； University of Toronto（多伦多大学）； Vector Institute（向量研究所）

AI总结本文提出NPEFF方法，通过分解每例费舍尔矩阵揭示模型生成预测所用的策略，展示了NPEFF组件在语言模型和文本处理任务中的应用，并展示了如何通过扰动这些组件来干扰模型处理，同时通过消融研究和实验验证了NPEFF在分析和缓解去学习的副作用以及研究上下文学习中的优势。

详情

AI中文摘要

我们引入NPEFF（非负每例费舍尔分解），一种可解释性方法，旨在揭示模型生成预测所使用的策略。NPEFF使用一种新颖的分解算法分解每例费舍尔矩阵，该算法学习了一组由学习得到的秩-1半正定矩阵表示的组件。通过结合人类评估和自动化分析，我们证明这些NPEFF组件对应于各种语言模型和文本处理任务中的模型处理策略。我们进一步展示了如何从NPEFF组件构建参数扰动，以选择性地干扰给定组件在模型处理中的作用。除了进行广泛的消融研究外，我们还包括实验，展示了NPEFF如何用于分析和缓解去学习的副作用，并用NPEFF研究上下文学习。此外，我们展示了NPEFF相对于梯度聚类和使用稀疏自编码器进行字典学习等基线方法的优势。我们发布了本工作的代码。

英文摘要

We introduce NPEFF (Non-Negative Per-Example Fisher Factorization), an interpretability method that aims to uncover strategies used by a model to generate its predictions. NPEFF decomposes per-example Fisher matrices using a novel decomposition algorithm that learns a set of components represented by learned rank-1 positive semi-definite matrices. Through a combination of human evaluation and automated analysis, we demonstrate that these NPEFF components correspond to model processing strategies for a variety of language models and text processing tasks. We further show how to construct parameter perturbations from NPEFF components to selectively disrupt a given component's role in the model's processing. Along with conducting extensive ablation studies, we include experiments to show how NPEFF can be used to analyze and mitigate collateral effects of unlearning and use NPEFF to study in-context learning. Furthermore, we demonstrate the advantages of NPEFF over baselines such as gradient clustering and using sparse autoencoders for dictionary learning over model activations. We release the code used in this work.

URL PDF HTML ☆

赞 0 踩 0

2505.02540 2026-06-05 cs.LG cs.AI

Lazy But Effective: Collaborative Personalized Federated Learning with Heterogeneous Data

懒惰但有效：基于异构数据的协同个性化联邦学习

Ljubomir Rokvic, Panayiotis Danassis, Boi Faltings

发表机构 * Artificial Intelligence Laboratory EPFL（苏黎世联邦理工学院人工智能实验室）； Telenor Research（Telenor研究）

AI总结本文提出了一种简单有效的个性化联邦学习框架pFedLIA，通过使用计算效率高的影响近似方法'Lazy Influence'，在分布式 manner 中对客户端进行聚类，从而在模型聚合前协同训练模型以捕捉客户端特定的数据模式，实验证明其在非iid数据集上能有效恢复全局模型性能，并在多个基准任务中优于现有基线方法。

Comments Accepted at the International Joint Conference on Neural Networks (IJCNN), IEEE, 2025

详情

DOI: 10.1109/IJCNN64981.2025.11228646

AI中文摘要

在联邦学习中，客户端数据分布的异质性往往意味着单一全局模型无法为个别客户端提供最佳性能。例如，训练键盘的下一个词预测模型时，由于用户特定的语言模式（如人口统计学特征、语言能力、书写风格等），客户端之间会产生高度非iid的数据集。其他例子包括使用不同机器拍摄的医学图像或不同车辆类型的驾驶数据。为了解决这一问题，我们提出了一种简单但有效的个性化联邦学习框架（pFedLIA），该框架利用一种计算效率高的影响近似方法，称为'Lazy Influence'，在分布式 manner 中在模型聚合前对客户端进行聚类。在每个聚类中，数据所有者协同训练一个模型，以捕捉客户端特定的数据模式。我们的方法在各种合成和现实世界设置中成功恢复了由于非iid性导致的全局模型性能下降，特别是在北欧语言的下一个词预测任务以及多个基准任务中。它在性能上与假设的Oracle聚类匹配，并显著优于现有基线方法，例如在CIFAR100上提高了17%。

英文摘要

In Federated Learning, heterogeneity in client data distributions often means that a single global model does not have the best performance for individual clients. Consider for example training a next-word prediction model for keyboards: user-specific language patterns due to demographics (dialect, age, etc.), language proficiency, and writing style result in a highly non-IID dataset across clients. Other examples are medical images taken with different machines, or driving data from different vehicle types. To address this, we propose a simple yet effective personalized federated learning framework (pFedLIA) that utilizes a computationally efficient influence approximation, called `Lazy Influence', to cluster clients in a distributed manner before model aggregation. Within each cluster, data owners collaborate to jointly train a model that captures the specific data patterns of the clients. Our method has been shown to successfully recover the global model's performance drop due to the non-IID-ness in various synthetic and real-world settings, specifically a next-word prediction task on the Nordic languages as well as several benchmark tasks. It matches the performance of a hypothetical Oracle clustering, and significantly improves on existing baselines, e.g., an improvement of 17% on CIFAR100.

URL PDF HTML ☆

赞 0 踩 0