arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 2367
2605.29847 2026-05-29 cs.CL

EvoRubric: Self-Evolving Rubric-Driven RL for Open-Ended Generation

EvoRubric: 用于开放生成的自我进化评分标准驱动强化学习

Xin Guan, Xiaomeng Hu, Shen Huang, Zhenyi Wang, Bo Zhang, Zijian Li, Pengjun Xie, Bo Liu, Jiuxin Cao

发表机构 * Tongyi Lab , Alibaba Group(通义实验室,阿里巴巴集团)

AI总结 提出EvoRubric,一种单策略共进化强化学习框架,通过动态交替生成响应和评分标准,并引入多层验证机制,解决开放生成任务中缺乏明确奖励的问题,在医学、写作和科学领域超越传统静态和外部LLM驱动方法。

详情
AI中文摘要

强化学习(RL)在可验证领域显著提升了大型语言模型(LLM),但由于缺乏明确的奖励,为开放生成任务对齐模型仍然极具挑战性。当前的基于评分标准的RL方法通过使用显式标准来缓解这一问题;然而,它们严重依赖于静态的人工标注评分标准,这不可避免地导致策略滞后,或者依赖昂贵的外部专有模型进行动态更新。在本文中,我们提出了EvoRubric,一种新颖的单策略共进化RL框架,消除了对静态标准和外部评分标准生成器的依赖。通过将响应生成和评分标准生成统一在单一参数化策略下,EvoRubric在推理器和评分标准生成器之间动态交替。为了防止奖励黑客攻击并确保生成信号的可信度,我们引入了一个多层验证流程,包括元验证器、零方差剪枝和留一法同行共识机制。经过验证的标准被动态归档到记忆池中,产生密集的多目标奖励,以持续共同优化两个角色。在医学、写作和科学领域的广泛实验表明,EvoRubric始终优于传统的静态和外部LLM驱动的对齐方法。值得注意的是,我们的框架与人类专家先验知识兼容。当使用专家标注的评分标准初始化时,EvoRubric能够进一步发现新颖的、有区分度的维度,从而实现比仅依赖静态专家标注更好的性能。

英文摘要

Reinforcement Learning (RL) has significantly advanced Large Language Models (LLMs) in verifiable domains, but aligning models for open-ended generation remains profoundly challenging due to the lack of definitive rewards. Current rubric-based RL methods mitigate this by employing explicit criteria; however, they rely heavily on static, human-annotated rubrics that inevitably cause policy lag, or expensive external proprietary models for dynamic updates. In this paper, we propose EvoRubric, a novel single-policy co-evolutionary RL framework that eliminates the reliance on static criteria and on external rubric generators. By unifying response generation and rubric generation under a single parameterized policy, EvoRubric dynamically alternates between a Reasoner and a Rubric Generator. To prevent reward hacking and ensure the reliability of generated signals, we introduce a multi-level verification pipeline featuring a meta-verifier, zero-variance pruning, and a Leave-One-Out peer consensus mechanism. Validated criteria are dynamically archived into a memory pool, yielding dense, multi-objective rewards to continuously co-optimize both roles. Extensive experiments across Medical, Writing, and Science domains demonstrate that EvoRubric consistently outperforms traditional static and external-LLM-driven alignment methods. Notably, our framework is compatible with human-expert priors. When initialized with expert-annotated rubrics, EvoRubric can further uncover novel, discriminative dimensions, achieving better performance than relying solely on static expert annotations.

2605.29843 2026-05-29 cs.LG cs.AI

HARP: Hadamard-Preconditioned Adaptive Rotation Processor for Extreme LLM Quantization

HARP: 哈达玛预条件自适应旋转处理器用于极端LLM量化

Artur Zagitov, Gleb Molodtsov, Aleksandr Beznosikov

发表机构 * BRAIn Lab(BRAIn实验室)

AI总结 提出HARP,一种可学习的结构化双正交处理器,替代固定随机哈达玛变换,通过自适应旋转基来改善极端低位量化中的激活异常值和各向异性权重曲率问题,在2-4比特设置下提升困惑度和零样本准确率,并保持部署效率。

详情
AI中文摘要

后训练量化(PTQ)对于在内存和带宽约束下部署LLM至关重要。然而,极端低位量化仍然对激活异常值和各向异性权重曲率高度敏感。现有的基于非相干性的PTQ方法通过固定的随机哈达玛变换(RHT)缓解了这一问题,这提高了量化鲁棒性,但无法将旋转基适应于层、校准分布或量化器。我们引入了HARP(哈达玛预条件自适应旋转处理器),一种可学习的结构化双正交处理器,它替代了固定的哈达玛混合,同时保留了精确的全精度等价性。HARP将每个旋转表示为稀疏蝶形类块正交阶段的乘积,通过混合基数调度支持非2的幂次维度,并初始化为RHT处理器(最多一个固定排列)。仅在校准数据上拟合,HARP将量化基适应于每一层和后端。在从1B到70B参数的模型的2-4比特设置中,HARP在困惑度和零样本准确率上优于固定RHT。重要的是,HARP保持了部署效率,达到128 tok/s,而FP16为61 tok/s。

英文摘要

Post-training quantization (PTQ) is essential for deploying LLMs under memory and bandwidth constraints. However, extreme low-bit quantization remains highly sensitive to activation outliers and anisotropic weight curvature. Existing incoherence-based PTQ methods mitigate this issue with fixed randomized Hadamard transforms (RHTs), which improve quantization robustness but cannot adapt the rotated basis to the layer, calibration distribution, or quantizer. We introduce HARP (Hadamard-preconditioned Adaptive Rotation Processor), a learnable structured two-sided orthogonal processor that replaces fixed Hadamard mixing while preserving exact full-precision equivalence. HARP represents each rotation as a product of sparse butterfly-like block-orthogonal stages, supports non-power-of-two dimensions via Mixed-Radix schedules, and initializes to the RHT processor up to a fixed permutation. Fitted only on calibration data, HARP adapts the quantization basis to each layer and backend. Across 2-4 bit settings on models ranging from 1B to 70B parameters, HARP improves perplexity and zero-shot accuracy over fixed RHT. Importantly, HARP preserves deployment efficiency, reaching 128 tok/s versus 61 tok/s for FP16.

2605.29836 2026-05-29 cs.LG cs.AI stat.ML

CB-SLICE: Concept-Based Interpretable Error Slice Discovery

CB-SLICE: 基于概念的可解释错误切片发现

Yael Konforti, Mateo Espinosa Zarlenga, Elaf Almahmoud, Mateja Jamnik

发表机构 * Department of Computer Science and Technology, University of Cambridge, Cambridge, UK(计算机科学与技术系,剑桥大学,剑桥,英国) Trinity College, University of Oxford, Oxford, UK(牛津大学三一学院,牛津,英国) Cambridge Institute for Technology and Humanity, Cambridge, UK(剑桥技术与人类研究所,剑桥,英国)

AI总结 提出CB-SLICE方法,利用概念瓶颈模型的概念预测失败来发现错误切片,并通过关键词概念解释失败模式,优于现有方法。

Comments 20 pages, 7 figures, 12 tables, to be published at Proceedings of the 43rd International Conference on Machine Learning (ICML 2026)

详情
AI中文摘要

尽管平均性能强劲,深度学习模型在特定人群组(称为错误切片)上常表现出系统性错误。识别这些组及其失败的根本原因对于模型调试和偏差缓解至关重要。然而,现有的错误切片发现方法(SDMs)通常生成与模型推理过程脱节的解释,因此只能近似潜在错误源,可能不准确。我们通过利用概念瓶颈模型(CBMs)来解决这一局限,其预测直接依赖于人类可理解的语义概念。由于CBM中下游任务失败通常源于概念预测错误,概念表示为错误切片识别提供了强有力的候选,提供了直接关联错误源的细粒度解释。基于这一见解,我们引入CB-SLICE,一种基于概念的SDM,它将共享概念预测失败的样本分组,并识别每个切片失败模式中最关键的关键词概念。在多个基准测试中,我们展示了CB-SLICE在发现已知偏差方面优于最先进方法,同时提供更丰富、更忠实的模型错误解释。

英文摘要

Despite strong average-case performance, deep learning models often exhibit systematic errors on specific population groups, known as error slices. Identifying these groups and the root causes of their failures is critical for model debugging and bias mitigation. However, existing error Slice Discovery Methods (SDMs) typically generate explanations disconnected from the model's inference process, thus only approximating the underlying error source and may be inaccurate. We address this limitation by leveraging Concept Bottleneck Models (CBMs), whose predictions are directly dependent on human-understandable semantic concepts. Since downstream task failures in CBMs commonly arise from concept mispredictions, concept representations provide a strong candidate for error slice identification, offering fine-grained explanations directly linked to the error source. Building on this insight, we introduce CB-SLICE, a concept-based SDM that groups samples with shared concept prediction failures and identifies the keyword concepts most responsible for each slice's failure mode. Across multiple benchmarks, we show that CB-SLICE outperforms state-of-the-art methods in uncovering well-known biases while providing richer and more faithful explanations of model errors.

2605.29834 2026-05-29 cs.LG

Open World Autoencoding Drift Detection with Novel Class Recognition in Tabular Non-stationary Data Streams

开放世界自编码漂移检测与表格非平稳数据流中的新类识别

Joanna Komorniczak

发表机构 * Department of Systems and Computer Networks(系统与计算机网络系) wrocław University of Science and Technology(沃林大学科学与技术学院)

AI总结 提出一种基于自编码器重构误差的无监督概念漂移检测方法,通过密度估计识别新类样本,利用镜像自编码器独立增量适应变化分布,在合成表格数据流上实验表明与当前最优方法竞争。

详情
AI中文摘要

数据流处理已成为现代机器学习应用中的里程碑,概念漂移和新类出现是复杂识别方法面临的主要挑战。本文提出一种无监督概念漂移检测方法,基于自编码器的重构误差识别已知类分布的偏移,同时通过对样本代理表示的密度估计实现新类样本的识别。使用镜像自编码器允许两个任务独立增量适应变化的问题分布,从而持续调整演化概念并可靠识别未知样本。实验使用多种合成表格数据流,观察到概念漂移和新类出现。结果表明,所提方法与当前最先进的无监督漂移检测器和新颖性分类器具有竞争力。

英文摘要

Data stream processing has become a landmark in modern machine learning applications, with concept drifts and novel class appearances posing the primary challenges faced by sophisticated recognition methods. This work proposes an unsupervised concept drift detection method that identifies shifts in known class distributions based on the reconstruction errors of an autoencoder, while also enabling the recognition of novel class samples through density estimation of a proxy representation of samples. Using mirrored autoencoders allows for independent incremental adaptation to changing problem distributions for the two considered tasks, resulting in continuous adjustment to evolving concepts and reliable recognition of unknown samples. Conducted experiments used a diverse set of synthetic tabular data streams, where both concept drifts and the emergence of novelties were observed. The results show that the proposed approach is competitive with current state-of-the-art unsupervised drift detectors and novelty classifiers.

2605.29829 2026-05-29 cs.AI cs.LG

OptSkills: Learning Generalizable Optimization Skills from Problem Archetypes via Cluster-Based Distillation

OptSkills: 通过基于聚类的蒸馏从问题原型中学习可泛化的优化技能

Haochen Yang, Ke Zhao, Mengyuan Ma, Xingyu Lu, Xiangfeng Wang, Hong Qian

发表机构 * East China Normal University(华东师范大学) Shanghai Innovation Institute(上海创新研究院) Ant Group(蚂蚁集团)

AI总结 提出OptSkills系统,通过聚类问题原型、蒸馏成功轨迹为可复用工作流技能,并动态扩展技能库,提升优化建模与求解的分布内和分布外泛化能力。

Comments 22 pages, 10 figuers, project: https://github.com/fujiwaranoM0kou/OptSkills

详情
AI中文摘要

利用大型语言模型(LLM)从自然语言自动制定和求解优化问题已成为自动化优化的高效范式。然而,现有方法仍表现出有限的泛化能力:它们对表面叙述变化敏感,主要在案例层面复用经验,难以适应变化或新兴的问题类型。我们提出OptSkills,一个以原型为中心的技能学习和推理智能体系统,用于优化建模和求解。为提升鲁棒泛化,我们的系统根据问题的底层原型而非表面叙述进行聚类。为提升分布内泛化,它在每个聚类内探索多样的建模范式和求解器配置,然后将成功轨迹蒸馏为可重用的工作流级技能。为提升分布外泛化,它利用新获得的轨迹改进现有技能或扩展技能库。我们的系统在涵盖多种问题类型和场景的数据集上达到了68.27%的最先进微平均准确率。此外,在极具挑战性的大规模高维基准MIPLIB-NL上,它达到了26.91%的准确率,比DeepSeek-V3.2-Thinking高出4.53%。在Nano-CO上进行技能学习后,它在OOD NLCO基准上达到了72.79%。代码和技能可在https://github.com/fujiwaranoM0kou/OptSkills获取。

英文摘要

Leveraging Large Language Models (LLMs) to automatically formulate and solve optimization problems from natural language has emerged as an efficient paradigm for automated optimization. However, existing methods still exhibit limited generalization: they are sensitive to superficial narrative variations, reuse experience mainly at the case level, and struggle to adapt to shifted or emerging problem types. We propose OptSkills, an archetype-centric skill learning and reasoning agent system for optimization modeling and solving. To improve robust generalization, our system clusters problems by their underlying archetypes rather than surface narratives. To improve in-distribution generalization, it explores diverse modeling paradigms and solver configurations within each cluster, then distills successful trajectories into reusable workflow-level skills. To improve out-of-distribution generalization, it refines existing skills or expands the skill library using newly obtained trajectories. Our system achieves a state-of-the-art micro-averaged accuracy of 68.27% on datasets encompassing diverse problem types and scenarios. In addition, on MIPLIB-NL, a highly challenging large-scale and high-dimensional benchmark, it achieves 26.91% accuracy, outperforming DeepSeek-V3.2-Thinking by 4.53%. After skill learning on Nano-CO, it reaches 72.79% on the OOD NLCO benchmark. Code and skills are available at https://github.com/fujiwaranoM0kou/OptSkills.

2605.29828 2026-05-29 cs.LG

When Do Graph Foundation Models Transfer? A Data-Centric Theory

图基础模型何时迁移?一个以数据为中心的理论

Jiajun Zhu, Ying Chen, Peihao Wang, Yixuan He, Pan Li, Aditya Akella, Zhangyang Wang

发表机构 * University of Texas at Austin(德克萨斯大学奥斯汀分校) Arizona State University(亚利桑那州立大学) Georgia Institute of Technology(佐治亚理工学院)

AI总结 本文通过图论连续极限方法,将跨域输出偏移分解为有限样本近似项和结构不匹配的内在域差异,并验证了位置编码稳定性对迁移的影响。

Comments 21 pages, including appendix. Accepted at ICML 2026

详情
AI中文摘要

图基础模型(GFMs)旨在跨不同图域重用单一骨干网络,但其迁移往往不均匀,并可能出现负迁移。虽然大多数先前工作通过架构或自适应选择改进迁移,但我们提出一个以数据为中心的问题:两个图域的哪些属性决定了固定表示模型改变其输出的程度?利用基于图论的稠密图连续极限,我们证明对于基于集合和消息传递的标记化,任何Lipschitz骨干网络都允许将跨域输出偏移显式分解为(i)图特定的有限样本近似项和(ii)捕获结构不匹配的内在、重标号不变的域差异。一个关键因素是位置编码(PE)稳定性:我们为谱PE建立了稳定性保证,并突出了基于特征向量与基于子空间的PE的对比行为。在合成和真实图上的实验验证了该理论,并将该分解转化为GFM迁移中数据整理的指导。

英文摘要

Graph foundation models (GFMs) aim to reuse a single backbone across diverse graph domains, yet their transfer is often uneven and can exhibit negative transfer. While most prior work improves transfer through architectural or adaptation choices, we ask a data-centric question: which properties of two graph domains determine how much a fixed representation model changes its outputs? Using a graphon-based continuous limit for dense graphs, we show that for both set-based and message-passing tokenizations, any Lipschitz backbone admits an explicit decomposition of cross-domain output shift into (i) graph-specific finite-sample approximation terms and (ii) an intrinsic, relabeling-invariant domain discrepancy capturing structural mismatch. A key ingredient is positional-encoding (PE) stability: we establish stability guarantees for spectral PEs and highlight contrasting behaviors of eigenvector- versus subspace-based PEs. Experiments on synthetic and real graphs validate the theory and translate the decomposition into guidance for data curation in GFM transfer.

2605.29827 2026-05-29 cs.CV

Fairness Beyond Demographics: Optimizing Performance Across Appearance-Based Hidden Cohorts in Medical Imaging

超越人口统计学的公平性:在医学影像中优化基于外观的隐藏队列性能

Milad Masroor, Cuong Nguyen, Kevin Wells, Gustavo Carneiro

发表机构 * Centre for Vision, Speech and Signal Processing(视觉、语音和信号处理中心)

AI总结 提出无标签隐藏队列公平性(LHCF)训练范式,通过聚类图像为外观队列并优化其公平性,解决医学影像模型在人口统计学属性上的性能差异问题。

Comments Pre-review version submitted to MICCAI 2026. 10 pages, 5 figures

详情
AI中文摘要

医学图像分析模型可能在患者子组间表现出性能差异,威胁临床安全性和公平性。现有方法通常通过优化可见人口统计学属性(如性别或年龄)的准确性和公平性指标来解决这一问题,但这些属性是孤立考虑的。这种策略不仅忽略了可能更具信息量的潜在分层(这些分层可能揭示模型错误和不平等的更深层来源),而且当同时考虑多个人口统计学属性时,由于每个子组内训练数据的稀疏性,该方法无法扩展。我们通过引入无标签隐藏队列公平性(LHCF)训练范式来处理这些问题,该范式不是最大化可见人口统计学属性的公平性,而是优化从图像外观发现的潜在子群体的公平性。通过将图像聚类为K个基于外观的队列并对其应用公平性优化,LHCF揭示了模型错误的潜在来源,避免了多人口统计学属性的组合稀疏性,减少了单个和多个人口统计学属性上的差异。我们在我们提出的公平性基准HIDFairBench上证明,尽管从未使用人口统计学标签进行训练,LHCF在单个和多个人口统计学属性上提供了最先进的公平性结果。我们的结果将隐藏队列公平性定位为基于人口统计学的公平性优化的实用、可扩展且稳健的替代方案,用于可信的医学图像分析。

英文摘要

Medical image analysis models can exhibit performance disparities across patient subgroups, threatening clinical safety and fairness. Existing methods typically address this issue by optimizing accuracy and fairness metrics for visible demographic attributes (e.g., sex or age) considered in isolation. This strategy not only overlooks potentially more informative latent stratifications, which may reveal deeper sources of model error and inequity, but also fails to scale when multiple demographic attributes are considered simultaneously due to the resulting sparsity of training data within each subgroup. We deal with these issues by introducing the label-free hidden-cohort fairness (LHCF) training paradigm that instead of maximizing fairness over visible demographic attributes, it optimizes fairness across latent subpopulations discovered from image appearance. By clustering images into K appearance-based cohorts and applying fairness optimization over them, LHCF uncovers underlying sources of model error and avoids the combinatorial sparsity of multi-demographic attributes, reducing disparities across both single and multiple demographic attributes. We demonstrate on our proposed fairness benchmark, HIDFairBench, that LHCF provides state-of-the-art fairness results on single and multiple demographic attributes, despite never using demographic labels for training. Our results position hidden-cohort fairness as a practical, scalable, and robust alternative to demographic-based fairness optimization for trustworthy medical image analysis.

2605.29826 2026-05-29 cs.CL cs.AI

Towards Localized and Disentangled Knowledge Editing for Multimodal Large Language Models

面向多模态大语言模型的局部化与解耦知识编辑

Leijiang Gu, Zhen Zeng, Feng Li, Xinjian Gao, Zenglin Shi

发表机构 * Hefei University of Technology(合肥工业大学) Tongji University(同济大学)

AI总结 针对多模态知识编辑中因果错位和特征纠缠问题,提出LDKE框架,通过快速定位关键层和解耦分类器实现精准泛化编辑并保持高局部性。

详情
AI中文摘要

现有的多模态知识编辑(MKE)方法在纠正多模态大语言模型(MLLMs)中过时或不准确的知识方面取得了进展。然而,它们存在一个关键局限性:虽然能有效修改目标事实对,但无法将编辑泛化到逻辑相关的查询,并且常常对无关但视觉或语义上关联的信息造成意外改变。我们识别并形式化了导致该问题的两种潜在失败模式:因果错位(将编辑限制在特定样本)和特征纠缠(对耦合但无关的信息造成意外改变)。为解决这些问题,我们提出局部化与解耦知识编辑(LDKE),一种通过定位事实特定模型层并将目标相关输入与无关输入解耦来实现精确和泛化编辑的新框架。我们的方法引入快速定位模块以高效识别和更新关键层,以及解耦分类器以适当路由输入从而保留无关知识。在各种基准和MLLMs上的大量实验表明,LDKE在将编辑传播到相关上下文方面实现了优越性能,同时保持了高局部性。

英文摘要

Existing methods in Multimodal Knowledge Editing (MKE) have advanced the ability to correct outdated or inaccurate knowledge in Multimodal Large Language Models (MLLMs). However, they exhibit a critical limitation: while effectively modifying target factual pairs, they fail to generalize edits to logically related queries and often cause unintended alterations to unrelated but visually or semantically linked information. We identify and formalize two underlying failure modes causing this issue: Causal Misalignment, which confines edits to the specific sample, and Feature Entanglement, which causes unintended alterations to coupled but irrelevant information. To address these issues, we propose Localized and Disentangled Knowledge Editing (LDKE), a new framework that achieves precise and generalized editing by localizing fact-specific model layers and disentangling target-relevant inputs from irrelevant ones. Our approach introduces a Fast Localization module to identify and update critical layers efficiently, along with a Disentanglement Classifier that routes inputs appropriately to preserve unrelated knowledge. Extensive experiments across various benchmarks and MLLMs demonstrate that LDKE achieves superior performance in propagating edits to related contexts while maintaining high locality.

2605.29819 2026-05-29 cs.LG

The Interplay Between Interpolation and Aggregation in Regression: Optimal Sample Complexity

回归中插值与聚合的相互作用:最优样本复杂度

Mikael Møller Høgsgaard, Kasper Green Larsen, Liang-Yu Zou

发表机构 * Department of Computer Science, Aarhus University(奥胡斯大学计算机科学系) Department of Statistics, University of Oxford(牛津大学统计学系)

AI总结 本文从理论上研究回归中插值与聚合的相互作用,证明γ-图维度刻画了广泛自然聚合过程的可学习性,并发现通过中位数聚合三个插值假设的简单过程在所有聚合过程中最优,且严格强于恰当学习。

详情
AI中文摘要

本文从理论上研究了回归中插值与聚合的相互作用。我们证明了$γ$-图维度刻画了一类广泛自然聚合过程的可学习性。此外,我们证明了一种极其简单的聚合过程——通过中位数组合三个插值假设——在所有这些聚合过程中是最优的,并且严格强于恰当学习。最后,我们表明某些假设类只能通过聚合无限多个假设或使用非插值聚合规则(可能预测超出其输入范围)来学习,而任何有限的插值聚合甚至无法达到平凡的性能。

英文摘要

This work investigates theoretically the interplay between interpolation and aggregation in regression. We establish that the $γ$-graph dimension characterizes learnability for a broad class of natural aggregation procedures. Furthermore, we prove that an extremely simple aggregation procedure, combining three interpolating hypotheses via the median, is optimal among all these aggregation procedures, and is strictly more powerful than proper learning. Finally, we show that some hypothesis classes are learnable only by aggregating infinitely many hypotheses or by using non-interpolating aggregation rules (which may predict outside the range of their inputs), and any finite interpolating aggregation fails to achieve even trivial performance.

2605.29816 2026-05-29 cs.AI

Harnessing non-adversarial robustness in large language models

利用大语言模型中的非对抗鲁棒性

Qinghua Zhou, Ellina Aleshina, Andrey Lovyagin, Oleg Somov, Mikhail Seleznyov, Alexander Panchenko, Ivan Oseledets, Elena Tutubalina, Ivan Y. Tyukin

发表机构 * Applied AI Institute, Moscow, Russia(莫斯科应用人工智能研究所) King's College London, London, UK(伦敦国王学院) International Joint Laboratory of AI for Industry, QUST, Qingdao, China(工业人工智能联合实验室)

AI总结 本文通过理论分析和实验,提出了一种基于去偏的微调方法,以提升大语言模型对语义相似但文本不同的提示的鲁棒性,并提供了认证保证。

详情
AI中文摘要

本文提出了一种方法来解决大语言模型(LLMs)对由语义相似但文本不同的提示引起的改变和潜在错误的鲁棒性挑战。最近的研究表明,这类提示变化会显著影响LLMs在任务上的性能。核心问题是:能否在不重新训练整个模型的情况下,获得LLMs对语义中性提示变化的鲁棒性?我们通过理论和实验来探讨这个问题。我们的理论分析揭示了一个影响模型鲁棒性的关键因素——神经网络模块输出中的系统性预期偏移或扰动引起的偏差。受此分析启发,我们表明可以通过一个简单的微调过程实现鲁棒性:为鲁棒性进行去偏。我们确定了去偏有帮助和没有帮助的条件,并通过理论和大量实验证明,为鲁棒性进行去偏确实可以成为一种快速有效的工具,以增强鲁棒性并提供对随机提示扰动的认证。

英文摘要

The work presents an approach for addressing the challenge of robustness in Large Language Models (LLMs) to alterations and potential errors caused by semantically similar but textually different prompts. Recent works have shown that these kinds of prompt variations can significantly impact the performance of LLMs on tasks. The central question is: can LLMs' robustness to semantically-neutral prompt alterations be acquired without expensive retraining of the entire model? We address this question both theoretically and through experiments. Our theoretical analysis reveals a crucial factor impacting model robustness - a systematic expected shift or perturbation-induced bias in neural network module outputs. Motivated by this analysis, we show that robustness can be achieved via a simple fine-tuning process: debiasing for robustness. We identify conditions when debiasing helps and when it does not, and demonstrate, through both theory and extensive experiments, that debiasing for robustness may indeed be a quick and efficient tool to enhance robustness and provide certification against random prompt perturbations.

2605.29815 2026-05-29 cs.AI cs.CL

PRAIB: Peer Review AI Benchmark of Behaviour of LLM-Assisted Reviewing

PRAIB: 大语言模型辅助审稿行为的同行评审AI基准

Krzysztof Żurawicki, Julia Farganus, Arkadiusz Gaweł, Mateusz Bystroński, Tomasz Jan Kajdanowicz

发表机构 * Department of Artificial Intelligence(人工智能系)

AI总结 提出PRAIB框架,通过定义审稿特异性、风格和参与行为的指标,并基于11000条机器生成审稿与人类审稿的对比实验,揭示LLM审稿在评分、交叉引用和弱点识别方面与人类审稿的系统性差异。

详情
AI中文摘要

提交论文数量的增长促使人们探索利用大型语言模型(LLMs)来支持和增强同行评审过程,特别是在提高其速度和可扩展性方面。然而,目前尚不清楚LLMs是否以与人类审稿人相同的方式处理科学稿件,还是仅仅生成看起来像审稿的文本。为了解决这个问题,我们引入了同行评审AI基准(PRAIB),这是一个新颖的框架,包含精确定义的指标,用于衡量审稿的特异性、风格和参与行为。为补充PRAIB框架,我们进行了一项大规模实证研究,利用一个包含由五个专有和开源模型为1000篇ICLR和NeurIPS论文生成的11000条审稿的数据集。这些机器生成的审稿跨越2021-2025年,与原始人类反馈在不同提示策略下进行比较,以识别系统性的行为差异。我们的分析表明,生成的审稿与人类审稿人提供的反馈存在显著差异:LLM评分变异性较小、存在正向偏差且过度自信,其交叉引用模式依赖于模型且与人类规范不同。此外,通过PRAIB评估,我们观察到LLMs倾向于生成更长、更复杂的审稿,但经常忽略人类审稿人指出的原子性弱点。通过描述LLM审稿行为在哪些方面以及如何偏离人类规范,PRAIB为社区提供了一个诊断工具,用于识别LLMs目前可以可靠支持审稿过程的哪些方面,以及在部署前哪些方面需要进一步发展。

英文摘要

The growing number of submitted papers has motivated the exploration of Large Language Models (LLMs) as a means to support and augment the peer review process, particularly in terms of improving its speed and scalability. Yet, it remains unknown whether LLMs engage with scientific manuscripts in the same manner as human reviewers, or whether they merely produce review-looking text. To address this, we introduce the Peer Review AI Benchmark (PRAIB), a novel framework comprising thoroughly defined metrics that measure review specificity, style, and behavior of engagement. To complement the PRAIB framework, we conduct a large-scale empirical study leveraging a dataset of 11,000 reviews generated by five proprietary and open-source models for 1,000 ICLR and NeurIPS papers. Spanning the 2021--2025 period, these machine-generated reviews are compared against original human feedback across diverse prompting strategies to identify systematic behavioral divergences. Our analysis reveals that the generated reviews diverge significantly from feedback provided by human reviewers: LLM ratings are less variable, positively biased, and overconfident, and their cross-reference patterns are model-dependent and distinct from human norms. Furthermore, when evaluated through PRAIB, we observe that LLMs tend to generate longer, more complex reviews, yet frequently overlook the atomic weaknesses noted by human reviewers. By characterizing where and how LLMs reviewing behavior departs from human norms, PRAIB provides the community with a diagnostic tool for identifying which aspects of the review process LLMs can reliably support today and which require further development before deployment.

2605.29812 2026-05-29 cs.CV

Not All Inputs Are Valid: Towards Open-Set Video Moment Retrieval Using Language

并非所有输入都有效:面向开放集视频时刻检索的语言方法

Xiang Fang, Wanlong Fang, Daizong Liu, Xiaoye Qu, Jianfeng Dong, Pan Zhou, Renfu Li, Zichuan Xu, Lixing Chen, Panpan Zheng, Yu Cheng

发表机构 * Huazhong University of Science and Technology(华中科技大学) Peking University(北京大学) Zhejiang Gongshang University(浙江工商大学) Dalian University of Technology(大连理工大学) Shanghai Jiao Tong University(上海交通大学) Xinjiang University(新疆大学) The Chinese University of Hong Kong(香港中文大学)

AI总结 针对开放集场景下视频时刻检索任务中无关查询导致错误检索的问题,提出基于归一化流的开放集视频时刻检索模型OpenVMR,实现分布内查询的精确检索与分布外查询的拒绝。

Comments Published in ACM MM 2024

详情
AI中文摘要

视频时刻检索(VMR)旨在从未修剪的视频中检索与句子查询对应的特定时刻。尽管近期工作在该任务上取得了显著进展,但它们隐含地基于封闭集假设,即所有给定查询都与视频相关 ootnote{在本文中,我们将“视频相关查询”视为“分布内(ID)查询”,将“视频无关查询”视为“分布外(OOD)查询”。}。在开放集场景中,给定OOD查询时,它们仍会用于错误检索,这可能在高风险场景(例如犯罪活动检测)中导致不可挽回的损失。为此,我们创造性地探索了一种全新的VMR设置,称为开放集视频时刻检索(OS-VMR),其中我们不仅应基于ID查询检索精确时刻,还应拒绝OOD查询。在本文中,我们首次尝试迈向OS-VMR,并提出了一种新颖模型OpenVMR,该模型首先基于归一化流技术区分ID和OOD查询,然后基于ID查询进行时刻检索。具体而言,我们首先通过构建归一化流学习ID分布,并假设ID查询分布服从多元高斯分布。然后,我们引入不确定性分数来搜索ID-OOD分离边界。之后,通过拉近ID查询特征来细化ID-OOD边界。此外,分别设计了视频-查询匹配和帧-查询匹配用于粗粒度和细粒度的跨模态交互。最后,引入正-无标签学习模块用于时刻检索。在三个VMR数据集上的实验结果表明了我们的OpenVMR的有效性。

英文摘要

Video Moment Retrieval (VMR) targets to retrieve the specific moment corresponding to a sentence query from an untrimmed video. Although recent works have made remarkable progress in this task, they implicitly are rooted in the closed-set assumption that all the given queries as video-relevant\footnote{In this paper, we treat ``video-relevant query'' as ``in-distribution (ID) query'' and ``video-irrelevant query'' as ``out-of-distribution (OOD) query''.}. Given an OOD query in open-set scenarios, they still utilize it for wrong retrieval, which might lead to irrecoverable losses in high-risk scenarios, \textit{e.g.}, criminal activity detection. To this end, we creatively explore a brand-new VMR setting termed Open-Set Video Moment Retrieval (OS-VMR), where we should not only retrieve the precise moments based on ID query, but also reject OOD queries. In this paper, we make the first attempt to step toward OS-VMR and propose a novel model \textbf{OpenVMR}, which first distinguishes ID and OOD queries based on the normalizing flow technology, and then conducts moment retrieval based on ID queries. Specifically, we first learn the ID distribution by constructing a normalizing flow, and assume the ID query distribution obeys the multi-variate Gaussian distribution. Then, we introduce an uncertainty score to search the ID-OOD separating boundary. After that, we refine the ID-OOD boundary by pulling together ID query features. Besides, video-query matching and frame-query matching are designed for coarse-grained and fine-grained cross-modal interaction, respectively. Finally, a positive-unlabeled learning module is introduced for moment retrieval. Experimental results on three VMR datasets show the effectiveness of our OpenVMR.

2605.29807 2026-05-29 cs.CL cs.AI cs.LG

Data filtering methods for training language models

训练语言模型的数据过滤方法

Egor Shevchenko, Elena Bruches

发表机构 * Novosibirsk State University(新西伯利亚国立大学) A. P. Ershov Institute of Informatics Systems SB RAS(A. P. Ershov 信息系统研究所)

AI总结 本文比较了Confident Learning和Dataset Cartography两种自动标签错误检测方法在俄语文本分类任务中的效果,发现其有效性依赖于数据集特性,在小规模高噪声数据集上Confident Learning显著提升F1-macro。

Comments AINL-2026

详情
AI中文摘要

数据质量是机器学习模型有效性的关键因素。即使广泛使用的基准数据集中也存在标签错误,这些错误会引入训练数据噪声并降低模型泛化能力。在本工作中,我们对两种自动标签错误检测方法——Confident Learning和Dataset Cartography——在三个俄语文本分类语料库上进行了比较分析,这些语料库在规模、类别数量和领域上各不相同:ru_emotion_e-culture(49,123个样本,情感分类)、RuCoLA(8,524个样本,语言可接受性)和TERRa(2,337个样本,文本蕴含识别)。我们使用在每个语料库上微调的预训练rubert-base-cased模型。为了验证过滤的意义,我们进行了控制实验,随机移除等量样本。结果表明,两种方法的有效性强烈依赖于数据集特征:在噪声水平低的大规模语料库上,过滤并未提升性能,而在噪声高的小规模数据集上,Confident Learning实现了显著的F1-macro提升。Dataset Cartography表现出更保守的行为,移除的样本更少。在所有语料库中,两种方法的目标性移除均优于随机移除,证实了这些方法的意义。

英文摘要

Data quality is a critical factor in the effectiveness of machine learning models. Label errors, present even in widely used benchmarks, introduce noise into training data and reduce model generalization. In this work, we conduct a comparative analysis of two automatic label error detection methods - Confident Learning and Dataset Cartography - on three Russian text classification corpora of varying size, number of classes, and domain: ru_emotion_e-culture (49,123 examples, emotion classification), RuCoLA (8,524 examples, linguistic acceptability), and TERRa (2,337 examples, textual entailment recognition). We use the pre-trained rubert-base-cased model fine-tuned on each corpus. To verify the meaningfulness of filtering, we conduct control experiments with random removal of an equivalent number of examples. Results show that the effectiveness of both methods depends strongly on dataset characteristics: on large corpora with low noise levels, filtering does not improve performance, while on small datasets with high noise, Confident Learning achieves a significant F1-macro improvement. Dataset Cartography demonstrates more conservative behavior, removing fewer examples. Across all corpora, targeted removal by both methods outperforms random removal, confirming the meaningfulness of the approaches.

2605.29803 2026-05-29 cs.LG

Gated Graph Attention Networks with Learnable Temperature

具有可学习温度的门控图注意力网络

Zhongtian Ma, Hao Wu, Yexin Zhang, Qiaosheng Zhang, Zhen Wang

发表机构 * School of Cybersecurity, Northwestern Polytechnical University(网络安全学院,西北工业大学) Shanghai Artificial Intelligence Laboratory(上海人工智能实验室)

AI总结 提出门控图注意力和可学习温度机制,通过过滤不可靠特征维度并动态调整注意力系数分布的锐度,提升图注意力网络在均匀和异质异配基准上的性能。

详情
AI中文摘要

图注意力网络通过数据相关的系数学习邻居的重要性,但标准层缺乏对不可靠特征维度的显式控制,并且使用固定的注意力系数分布锐度。本文针对常见的图注意力机制提出了门控图注意力和可学习温度。门控图注意力过滤特征或消息响应以减少不可靠维度的影响,而可学习温度动态调整注意力系数分布的锐度。在均匀和异质异配基准上的实验表明,所提出的变体一致地改进了相应的图注意力骨干网络,受控噪声研究进一步验证了它们在特征扰动下的行为。理论分析解释了这些结果,表明当只有部分特征坐标可靠时,门控提高了鲁棒性,而当全局噪声削弱节点特征的可区分性时,温度是有益的。

英文摘要

Graph attention networks learn neighbor importance through data-dependent coefficients, but standard layers lack explicit control over unreliable feature dimensions and use fixed sharpness of attention coefficient distributions. This paper proposes gated graph attention and learnable temperature for common graph attention mechanisms. Gated graph attention filters feature or message responses to reduce the influence of unreliable dimensions, while learnable temperature dynamically adjusts the sharpness of the attention coefficient distribution. Experiments on homogeneous and heterophilic heterogeneous benchmarks show that the proposed variants consistently improve the corresponding graph attention backbones, and controlled noise studies further verify their behavior under feature perturbations. Theoretical analysis explains these results by showing that gating improves robustness when only part of the feature coordinates are reliable, while temperature is beneficial when global noise weakens the discriminability of node features.

2605.29801 2026-05-29 cs.AI cs.CL cs.CR cs.CV cs.LG

AgentDoG 1.5: A Lightweight and Scalable Alignment Framework for AI Agent Safety and Security

AgentDoG 1.5:一种轻量级且可扩展的AI智能体安全与安保对齐框架

Dongrui Liu, Yu Li, Zhonghao Yang, Peng Wang, Guanxu Chen, Yuejin Xie, Qinghua Mao, Wanying Qu, Yanxu Zhu, Tianyi Zhou, Leitao Yuan, Zhijie Zheng, Qihao Lin, Yimin Wang, Haoyu Luo, Shuai Shao, Chen Qian, Qingyu Liu, Ling Tang, Ruiyang Qin, Qihan Ren, Junxiao Yang, Kun Wang, Zhiheng Xi, Linfeng Zhang, Ranjie Duan, Bo Zhang, Wenjie Wang, Wen Shen, Qiaosheng Zhang, Yan Teng, Chaochao Lu, Rui Mei, Man Li, Jialing Tao, Xi Lin, Tianhang Zheng, Yong Liu, Quanshi Zhang, Lei Zhu, Xingjun Ma, Junhua Liu, Hui Xue, Xiaoxiang Zuo, Xiangnan He, Chao Shen, Xianglong Liu, Minlie Huang, Jing Shao, Xia Hu

发表机构 * Shanghai Artificial Intelligence Laboratory(上海人工智能实验室)

AI总结 针对开放世界智能体的新兴安全风险,提出一种轻量级可扩展的安全对齐框架,通过更新安全分类法、构建数据引擎并训练小模型(0.8B-8B参数),实现与闭源模型相当的性能,并降低部署开销两个数量级。

Comments 44 pages, 12 Figures, 9 Tables

详情
AI中文摘要

现代开放世界智能体(如OpenClaw)展现出强大的跨环境执行能力,但同时也引入了广泛的新安全风险源。同时,先进的前沿AI模型大幅降低了攻击门槛,使得当前的智能体对齐框架不足以应对实际部署。为了应对这些新兴威胁,我们提出了一种轻量级且可扩展的智能体安全对齐框架。具体而言,我们更新了智能体安全分类法,以涵盖来自Codex和OpenClaw执行场景的新兴风险。我们进一步构建了一个基于分类法指导的数据引擎,并采用影响函数净化,仅使用约1k样本训练轻量级AgentDoG 1.5变体(0.8B、2B、4B和8B参数),达到了与领先闭源模型(如GPT-5.4)相当的性能。基于AgentDoG 1.5,我们构建了一个高效的智能体安全SFT和RL训练环境,将Docker级环境的部署开销降低了两个数量级。最后,我们将AgentDoG 1.5部署为无需训练的在线护栏,用于实时安全审核。大量实验结果表明,AgentDoG 1.5在多样且复杂的交互式智能体场景中达到了最先进的性能。所有模型和数据集均已公开发布。

英文摘要

Modern open-world agents such as OpenClaw exhibit powerful cross-environment execution capabilities yet introduce broad new safety risk sources. Meanwhile, advanced frontier AI models drastically lower attack barriers, rendering current agent alignment frameworks inadequate for real-world deployment. To tackle these emerging threats, we propose a lightweight and scalable agent safety alignment framework. Specifically, we update the agent safety taxonomy to accommodate emergent risks from Codex and OpenClaw execution scenarios. We further build a taxonomy-guided data engine with influence-function purification to train lightweight AgentDoG 1.5 variants (0.8B, 2B, 4B, and 8B parameters) using only around 1k samples, achieving comparable performance with leading closed-source models (e.g., GPT-5.4). Based on AgentDoG 1.5, we construct a highly efficient agentic safety SFT and RL training environment, which reduces deployment overhead in Docker-level environments by two orders of magnitude. Finally, we deploy AgentDoG 1.5 as a training-free online guardrail for real-time safety moderation. Extensive experimental results indicate that AgentDoG 1.5 achieves state-of-the-art performance in diverse and complex interactive agentic scenarios. All models and datasets are openly released.

2605.29800 2026-05-29 cs.CL

Nine Judges, Two Effective Votes: Correlated Errors Undermine LLM Evaluation Panels

九位评委,两张有效选票:相关错误削弱了LLM评估小组

Guneet Kohli

发表机构 * Apple(苹果公司)

AI总结 本文通过分析9个前沿LLM在自然语言推理任务上的投票行为,发现由于模型间存在高度相关的错误,评估小组的有效信息仅相当于约2个独立投票,实际准确率比独立投票理想情况低8-22个百分点,且增加评委或改进聚合算法均无法弥补这一差距。

Comments 14 pages, 5 figures, 12 tables

详情
AI中文摘要

LLM作为评委的小组通过聚合多个模型的投票来评估,期望不同模型能提供更可靠的评估。我们开发了一个框架来衡量此类小组的真实信息价值,并量化其可靠性距离独立投票理想状态的差距。在来自7个模型家族的9个前沿LLM小组上,对三个自然语言推理数据集(每个项目有100个人工标注)进行测试,我们发现9位评委实际上仅提供约2个独立投票的信息量。小组名义独立性的约四分之三因模型在相同项目上犯相同错误而丧失。后果是显著的:小组的实际准确率比独立投票所能达到的低8-22个百分点,且最佳单个评委在所有条件下均匹配或超越整个小组。增加评委或使用更智能的聚合算法均无济于事——即使能访问正确答案,现有方法最多只能缩小这一差距的11%。我们使用Kish有效样本量(n_eff)和Condorcet零模型量化这些发现,并显示该缺陷在提示变体、温度、思维链推理以及成对偏好任务(RewardBench)中均稳健存在。瓶颈在于评委之间的相关性,而非聚合算法,这意味着扩大小组规模无法替代真正独立的评估。

英文摘要

LLM-as-a-judge panels aggregate votes from multiple models, with the expectation that diverse models yield more reliable evaluations. We develop a framework to measure the true informational value of such panels and quantify how far their reliability falls short of the independent-voting ideal. Testing a panel of 9 frontier LLMs from 7 model families on three natural language inference datasets (each with 100 human annotations per item), we find that the 9 judges effectively provide only about 2 independent votes' worth of information. Roughly three-quarters of the panel's nominal independence is lost because the models make the same mistakes on the same items. The consequences are stark: the panel's actual accuracy falls 8-22 percentage points short of what independent voting would achieve, and the best single judge matches or outperforms the full panel across all conditions. Neither adding more judges nor using smarter aggregation algorithms helps -- established methods close at most 11% of this gap, even with access to the correct answers. We quantify these findings using the Kish effective sample size (n_eff) and a Condorcet null model, and show the deficit is robust across prompt variants, temperatures, chain-of-thought reasoning, and a pairwise preference task (RewardBench). The bottleneck is correlated judges, not the aggregation algorithm, implying that scaling up panels cannot substitute for genuinely independent evaluation.

2605.29798 2026-05-29 cs.CV cond-mat.mtrl-sci eess.IV

Low-Magnification SEM May Suffice: Interpretable Deep Learning for Multi-Scale Fracture-Cause Classification in Zirconia-Toughened Alumina

低倍率SEM可能足够:用于氧化锆增韧氧化铝多尺度断裂原因分类的可解释深度学习

Julian Schmid, Pawel Astankow, Tom Vater, Julius Beck, Robert Cichon, Danny Krautz

发表机构 * CeramTec GmbH(CeramTec公司) School of Life Sciences, University of Applied Sciences(应用科学与艺术北瑞士学院生命科学学院)

AI总结 提出一种可解释的视觉变换器工作流,利用低倍率SEM图像对氧化铝基复合材料植入物断裂原因进行自动分类,达到与高倍率相当的准确率。

详情
AI中文摘要

可靠识别氧化铝基复合材料髋关节和膝关节植入物的断裂起源对于质量保证和患者安全至关重要,然而当前的断口分析工作流程耗时、部分主观且依赖高倍率扫描电子显微镜(SEM)。我们提出了一种可解释的视觉变换器(ViT)工作流,用于对广泛用于全关节置换的氧化铝基复合材料(BIOLOX delta, CeramTec GmbH)的断裂原因进行自动分类。从五年的生产爆破和验证测试中整理了8,493张SEM图像(50倍至10,000倍)的数据集,并按照制造链定义的三个缺陷类别(生坯、硬加工和材料缺陷)进行标注。在严重的类别不平衡下,微调后的ViT在分层五折交叉验证中达到了0.907的准确率和0.888的宏F1分数,两阶段感知哈希/SSIM泄漏审计确认了样本重叠可忽略。值得注意的是,低倍率(50倍)下的性能与高倍率(1k-10k倍)相当,表明宏观特征——镜面几何和羽状纹线场——已经编码了足够的诊断信号。Grad-CAM归因一致地定位在经典的断口线索(镜面、羽状纹、孔隙、加工痕迹)上,与既定的断口分析标准一致。这些结果共同将可解释ViT定位为陶瓷植入物质量保证的补充工具,能够实现低倍率预筛选并减少对耗时的高倍率检查的依赖。

英文摘要

Reliable identification of fracture origins in alumina matrix composite hip and knee implants is critical for quality assurance and patient safety, yet current fractographic workflows are time-consuming, partly subjective, and reliant on high-magnification scanning electron microscopy (SEM). We present an interpretable vision-transformer (ViT) workflow for automated classification of fracture causes in an alumina matrix composite (BIOLOX delta, CeramTec GmbH) widely used in total joint replacements. A dataset of 8,493 SEM images (50x-10,000x) was curated from five years of in-production burst and proof tests and annotated into three defect categories defined along the manufacturing chain: green body, hard machining, and material defects. Under severe class imbalance, the fine-tuned ViT reached an accuracy of 0.907 and a macro-F1 of 0.888 in stratified five-fold cross-validation, with a two-stage perceptual-hash/SSIM leakage audit confirming negligible specimen overlap. Notably, performance at low magnification (50x) was comparable to that at high magnification (1k-10kx), indicating that macro-scale features - mirror geometry and hackle line fields - already encode sufficient diagnostic signal. Grad-CAM attributions consistently localised on canonical fractographic cues (mirrors, hackles, pores, machining marks), aligning with established fractographic criteria. Together, these results position interpretable ViTs as a complementary tool for ceramic-implant quality assurance, enabling low-magnification pre-screening and reducing reliance on time-intensive high-magnification inspection.

2605.29797 2026-05-29 cs.CL

Metric-Dependent Annotation Saturation for Learning from Label Distributions

基于度量依赖的标注饱和:从标签分布中学习

Guneet Kohli

发表机构 * Apple(苹果公司)

AI总结 研究从标签分布中学习时,所需标注者数量如何依赖于评估度量,发现熵相关需要20-50个标注者收敛,而KL散度在10个标注者时饱和,且软标签优于标签平滑。

Comments 16 pages, 3 figures, 14 tables

详情
AI中文摘要

当标注者对标签存在分歧时,分歧本身携带信号——而捕获该信号所需的标注者数量取决于评估度量。我们在从ChaosNLI(每个项目提供100个独立标注者判断的数据集)子采样的标签分布上微调NLI模型,并识别出度量依赖的饱和。在我们的3类NLI设置中,熵相关——模型是否识别出哪些项目引发分歧——需要N~20-50个标注者才能收敛,而分布匹配(KL散度)在N~10时饱和(五个模型种子中达到改进的87-95%)。这一发现基于先前的观察:软标签携带标签平滑无法复现的项目特定信号。在五种平滑强度下,熵相关聚类在r~0.45-0.49,而软标签达到r=0.643(p<0.001);逐项分析将这一差距归因于平滑无法区分模糊项目与清晰项目。软标签优势在两种架构(DeBERTa、RoBERTa)、非NLI预训练基线以及内容安全探索性跨领域评估中得以复现。这些结果表明,标注预算应根据目标评估度量而非统一设定。

英文摘要

When annotators disagree on a label, the disagreement itself carries signal -- and the number of annotators needed to capture it depends on the evaluation metric. We fine-tune NLI models on label distributions subsampled from ChaosNLI, a dataset providing 100 independent annotator judgments per item, and identify metric-dependent saturation. In our 3-class NLI setting, entropy correlation -- whether the model identifies which items elicit disagreement -- requires N ~ 20-50 annotators to converge, while distributional match (KL divergence) saturates by N ~ 10 (87-95% of improvement across five model seeds). This finding rests on a prior observation: soft labels carry item-specific signal that label smoothing cannot replicate. Across five smoothing intensities, entropy correlation clusters at r ~ 0.45-0.49, while soft labels reach r = 0.643 (p < 0.001); per-item analysis traces this gap to smoothing's inability to distinguish ambiguous items from clear ones. The soft-label advantage replicates across two architectures (DeBERTa, RoBERTa), a non-NLI-pretrained baseline, and an exploratory cross-domain evaluation on content safety. These results suggest that annotation budgets should be informed by the target evaluation metric rather than set uniformly.

2605.29795 2026-05-29 cs.AI

MEMENTO: Leveraging Web as a Learning Signal for Low-Data Domains

MEMENTO: 利用网络作为低数据领域的学习信号

Ashutosh Ojha, Vinay Aggarwal, Ashutosh Srivastava, Siddharth Yedlapati, Yaman K Singla, Jitendra Ajmera

发表机构 * Adobe, Media & Data Science Research Lab(Adobe媒体与数据科学研究实验室)

AI总结 提出MEMENTO框架,通过自适应探索树和双通道记忆将网络作为学习信号,在低数据专业领域(销售自动化和法律研究)中显著提升性能。

详情
AI中文摘要

现实世界的任务通常缺乏大规模标注数据集,这激发了在低数据场景下学习的广泛研究。然而,现有方法如少样本提示、指令调优和合成数据生成,仍将标注或伪标注数据作为主要学习信号。相比之下,人类从业者通过反复、自主地与开放网络交互来获取专业知识,逐步完善领域知识和搜索策略。我们提出MEMENTO,一个将网络视为学习信号而非无状态检索接口的框架。MEMENTO在两个层面运作:在每个会话内,它通过自适应探索树(AET)进行迭代式网络探索,将任务分解为演化中的问题并反思中间发现;跨会话间,它通过双通道记忆积累经验,将陈述性知识(事实)与程序性知识(搜索策略)分离。这种设计使智能体能够从网络交互轨迹中学习可重用的研究策略和领域专业知识,而无需额外的模型训练。我们在两个低数据专业领域(销售自动化和法律研究)上评估MEMENTO。实验结果显示,与基于ReAct的基线相比,性能持续提升(销售自动化+25.6%,法律研究+36.5%),表明网络可以作为在数据稀缺场景下获取任务特定专业知识的可扩展学习源。

英文摘要

Real-world tasks often lack large labeled datasets, motivating extensive work on learning in low-data regimes. However, existing approaches such as few-shot prompting, instruction tuning, and synthetic data generation, continue to treat labeled or pseudo-labeled data as the primary learning signal. In contrast, human practitioners acquire expertise through repeated, self-directed interaction with the open web, progressively refining both domain knowledge and search strategies. We propose MEMENTO, a framework that treats the web as a learning signal rather than a stateless retrieval interface. MEMENTO operates at two levels: within each session, it conducts iterative web exploration via an Adaptive Exploration Tree (AET) that decomposes tasks into evolving questions and reflects on intermediate findings; across sessions, it accumulates experience through dual-channel memory, separating declarative knowledge (facts) from procedural knowledge (search strategies). This design enables agents to learn reusable research strategies and domain expertise from trajectories of web interaction without additional model training. We evaluate MEMENTO on two low-data professional domains: sales automation and legal research. Our empirical results show consistent improvements in performance over ReAct based baselines (+25.6% on sales automation and 36.5% on legal research), demonstrating that the web can serve as a scalable learning source for acquiring task-specific expertise in data-scarce settings.

2605.29794 2026-05-29 cs.AI

SkillsInjector: Dynamic Skill Context Construction for LLM Agents

SkillsInjector: 面向LLM智能体的动态技能上下文构建

Yanchao Li, Wanhao Liu, Ben Gao, Jiaqing Xie, Zhehong Ai, Na Zou, Yuqiang Li, Tianfan Fu

发表机构 * Nanjing University(南京大学) Shanghai AI Lab(上海人工智能实验室)

AI总结 针对静态技能注入导致性能下降的问题,提出SkillsInjector两阶段自适应方法,通过上下文规划器学习技能偏好并自适应预算,结合集合感知渲染器优化描述呈现,在三个基准上分别提升3.9、6.1和7.3个百分点。

详情
AI中文摘要

LLM智能体现在依赖不断增长的技能库来处理复杂任务。然而,注入更多技能并不总能提高任务完成度,甚至可能降低性能。现有方法仍将技能注入视为静态步骤,使用固定标准选择技能,预先设定预算,并保持描述不变。我们认为这种静态处理会削弱技能的效用,因为暴露哪些技能、包含多少技能以及如何呈现它们都会影响下游性能。我们提出SkillsInjector,一种两阶段自适应方法,共同解决这些决策。首先,上下文规划器学习基于执行的技能偏好,并为每个任务自适应地确定技能数量。然后,集合感知渲染器根据共注入的邻居定制所选描述的呈现方式。在tau2-bench、SkillsBench和ALFWorld上,SkillsInjector取得了最高分数,分别比最强基线提高了3.9、6.1和7.3个百分点。消融研究表明,技能选择、自适应预算和集合感知渲染各自对性能提升有贡献。这些结果表明,技能增强型智能体受益于优化注入的上下文本身。代码将在发表后发布。

英文摘要

LLM agents now draw on growing skill libraries to handle complex tasks. However, injecting more skills does not always improve task completion and can even degrade it. Existing methods still treat skill injection as a static step, selecting skills with fixed criteria, fixing the budget in advance, and leaving descriptions unchanged. We argue that this static treatment can undermine the utility of skills, because which skills are exposed, how many are included, and how they are presented all affect downstream performance. We propose SkillsInjector, a two-stage adaptive method that jointly addresses these decisions. First, a context planner learns execution-grounded skill preferences and admits an adaptive number of skills for each task. A set-aware renderer then tailors how selected descriptions are presented relative to their co-injected neighbors. Across tau2-bench, SkillsBench, and ALFWorld, SkillsInjector achieves the highest score, improving over the strongest baseline by 3.9, 6.1, and 7.3 percentage points, respectively. Ablation studies show that skill selection, adaptive budgeting, and set-aware rendering each contribute to the gain. These results show that skill-augmented agents benefit from optimizing the injected context itself. Code will be released upon publication

2605.29793 2026-05-29 cs.CV

Fewer Steps, Better Performance: Efficient Cross-Modal Clip Trimming for Video Moment Retrieval Using Language

更少步骤,更优性能:基于语言的高效跨模态视频片段修剪用于视频时刻检索

Xiang Fang, Daizong Liu, Wanlong Fang, Pan Zhou, Zichuan Xu, Wenzheng Xu, Junyang Chen, Renfu Li

发表机构 * Hubei Engineering Research Center on Big Data Security, School of Cyber Science and Engineering, Huazhong University of Science of Technology(湖北大数据安全工程研究中心,网络安全学院,华中科技大学) Peking University(北京大学) Henan University(河南大学) Dalian University of Technology(大连理工大学) Sichuan University(四川大学) Shenzhen University(深圳大学) Huazhong University of Science and Technology(华中科技大学)

AI总结 提出SpotVMR方法,通过可学习的片段搜索模型和低成本语义索引特征,高效修剪查询相关视频片段,作为即插即用模块提升现有VMR方法的效率与性能。

Comments Published in AAAI 2024

详情
AI中文摘要

给定一个未修剪的视频和一个句子查询,基于语言的视频时刻检索(VMR)旨在定位目标查询相关的时刻。由于未修剪的视频过长,几乎所有现有的VMR方法首先将每个未修剪的视频稀疏下采样为多个固定长度的视频片段,然后与查询特征和昂贵的片段特征进行多模态交互以进行推理,这对于跨越数小时的长真实世界视频是不可行的。由于视频被下采样为固定长度的片段,一些与查询相关的帧可能被过滤掉,这将模糊目标时刻的特定边界,将相邻的不相关帧作为新边界,容易导致跨模态错位,并引入边界偏差和推理偏差。为此,在本文中,我们提出了一种高效的方法SpotVMR,用于修剪与查询相关的片段。此外,我们提出的SpotVMR可以作为即插即用模块,在保持良好检索性能的同时提高最先进VMR方法的效率。特别地,我们首先设计了一个新颖的片段搜索模型,该模型学习根据语言查询识别有希望的视频区域进行搜索。然后,我们引入一组低成本的语义索引特征来捕获对象和交互的上下文,这些上下文提示在哪里搜索查询相关的时刻。此外,利用蒸馏损失来解决片段选择器和VMR模型端到端联合训练中出现的优化问题。在三个具有挑战性的数据集上的大量实验证明了其有效性。

英文摘要

Given an untrimmed video and a sentence query, video moment retrieval using language (VMR) aims to locate a target query-relevant moment. Since the untrimmed video is overlong, almost all existing VMR methods first sparsely down-sample each untrimmed video into multiple fixed-length video clips and then conduct multi-modal interactions with the query feature and expensive clip features for reasoning, which is infeasible for long real-world videos that span hours. Since the video is downsampled into fixed-length clips, some query-related frames may be filtered out, which will blur the specific boundary of the target moment, take the adjacent irrelevant frames as new boundaries, easily leading to cross-modal misalignment and introducing both boundary-bias and reasoning-bias. To this end, in this paper, we propose an efficient approach, SpotVMR, to trim the query-relevant clip. Besides, our proposed SpotVMR can serve as plug-and-play module, which achieves efficiency for state-of-the-art VMR methods while maintaining good retrieval performance. Especially, we first design a novel clip search model that learns to identify promising video regions to search conditioned on the language query. Then, we introduce a set of low-cost semantic indexing features to capture the context of objects and interactions that suggest where to search the query-relevant moment. Also, the distillation loss is utilized to address the optimization issues arising from end-to-end joint training of the clip selector and VMR model. Extensive experiments on three challenging datasets demonstrate its effectiveness.

2605.29791 2026-05-29 cs.CL

ActTraitBench: Quantifying the Knowledge-Decision Gap in Large Language Models via Human-Grounded Behavioral Validation

ActTraitBench: 通过人类行为验证量化大型语言模型中的知识-决策差距

Yutong Yang, Chenxi Miao, Weikang Li, Yunfang Wu

发表机构 * Peking University(北京大学) Baidu Inc(百度公司)

AI总结 提出ActTraitBench框架,基于人类数据建立心理测量方面与行为范式的一一映射,并通过分位数映射校准LLM评分分布,揭示LLM在自我报告与行为决策之间的知识-决策差距,并引入CoCA干预来缓解该差距。

详情
AI中文摘要

虽然大型语言模型(LLM)在显式自我报告中能够令人信服地模拟人格,但它们在隐式行为决策中常常出现偏差,揭示了显著的知识-决策差距($G_{\text{KD}}$)。现有的基准由于结构效度有限、多维度纠缠以及基于LLM评估中的分布偏差,难以衡量这种不对称性。为了解决这些问题,我们提出了ActTraitBench,一个基于人类数据的评估框架,用于衡量LLM中的人格一致性。基于经验人类数据,ActTraitBench建立了心理测量方面与行为范式之间的一一映射,并应用通过分位数映射的分布校准程序,使LLM评判者的分数分布与人类规范对齐。在14个主流LLM上的实验揭示了普遍的知识-决策不对称性,其中更大、能力更强的模型尽管自我报告高度一致,但往往表现出更强的行为分歧。为了缓解这一差距,我们进一步引入了认知对齐链(CoCA),一种即插即用的推理时干预措施,可改善具有推理能力的前沿模型的对齐,同时暴露出较小架构中明显的能力限制。

英文摘要

While Large Language Models (LLMs) can convincingly simulate personas in explicit self-reports, they often deviate in implicit behavioral decisions, revealing a substantial Knowledge-Decision Gap ($G_{\text{KD}}$). Existing benchmarks struggle to measure this asymmetry due to limited construct validity, multi-dimensional entanglement, and distributional biases in LLM-based evaluation. To address these issues, we propose ActTraitBench, a human-grounded evaluation framework for measuring personality consistency in LLMs. Grounded in empirical human data, ActTraitBench establishes one-to-one mappings between psychometric facets and behavioral paradigms, and applies a Distributional Calibration via Quantile Mapping procedure to align LLM-judge score distributions with human norms. Experiments on 14 mainstream LLMs reveal a pervasive knowledge-decision asymmetry, where larger and more capable models often exhibit stronger behavioral divergence despite highly consistent self-reports. To mitigate this gap, we further introduce the Chain of Cognitive Alignment (CoCA), a plug-and-play inference-time intervention that improves alignment in reasoning-capable frontier models while exposing clear capability limitations in smaller architectures.

2605.29788 2026-05-29 cs.AI cs.LG

Certified Policy Optimisation for Nested Causal Bandits via PAC-Bayes Risk

嵌套因果赌博机的认证策略优化:基于PAC-Bayes风险

Tim Woydt, Paul-David Zuercher

发表机构 * ProdAxon

AI总结 本文提出嵌套因果汤普森采样(NCTS)算法,通过PAC-Bayes超额风险界对历史数据进行离线、任意时刻的部署策略认证,解决分层因果赌博机中的跨时间尺度因果耦合问题。

详情
AI中文摘要

关键序列决策很少是单时间尺度的:一个战略决策因果地塑造了每个后续战术选择所处的环境;标准赌博机和强化学习理论并未捕捉时间尺度之间的这种因果耦合。我们将问题类别形式化为嵌套上下文因果赌博机(NCCBs),这是一个分层SCM,其中每个层次的动作设置下一层次的上下文分布,并提出了嵌套因果汤普森采样(NCTS),该算法每轮抽取一个机制因子化的信念,并在其下递归地行动。我们的主要理论结果是一个因果PAC-Bayesian超额风险界,它仅从历史数据中认证任何候选部署策略,离线且任意时刻,回答了部署问题:我们能否在此处信任该智能体,风险如何?在分层SCM上的实验表明,相对于同一函数类上的匹配RFF-GP联合回归,因子化的SCM机制后验在外生分布偏移下零样本迁移显著更好,递归的元到内层提交在分布上显著优于联合提交替代方案,并且随着离线数据积累,认证显著收缩。结合这些结果,我们建立了渐进式认证交接,一种安全部署方法:每个时间尺度在收益可被认证时从传统控制器切换到NCTS,独立于其他时间尺度。

英文摘要

Critical sequential decisions are rarely single-timescale: a strategic decision causally shapes the context in which every subsequent tactical choice is made; standard bandit and reinforcement-learning theory does not capture this causal coupling between timescales. We formalise the problem class as Nested Contextual Causal Bandits (NCCBs), a hierarchical SCM where each level's action sets the next level's context distribution, and propose Nested Causal Thompson Sampling (NCTS), which draws one mechanism-factorised belief per episode and acts recursively under it. Our main theoretical result is a causal PAC-Bayesian excess-risk bound that certifies any candidate deployment policy from historic data alone, off-policy and anytime, answering the deployment question: can we trust this agent here, and at what risk? Experiments on a hierarchical SCM show that, against a matched RFF-GP joint regression on the same function class, the factorised SCM-mechanism posterior transfers significantly better zero-shot under exogenous distribution shifts, the recursive meta-to-inner commit significantly dominates the joint-commit alternative in distribution, and the certificate significantly contracts as offline data accumulates. Combining these results, we establish progressive certified handover, a safe-deployment method: each timescale flips from a legacy controller to NCTS when gains can be certified, independently of the others.

2605.29786 2026-05-29 cs.AI

Croissant Tasks: A Metadata Format for Reproducible Machine Learning Evaluations

Croissant Tasks:一种用于可重复机器学习评估的元数据格式

Omar Benjelloun, Leonardo Martins Bianco, Isabelle Guyon, Thanh Gia Hieu Khuong, Jonathan Lebensold, Sebastian Lobentanzer, Luis Oala, Benedictus Kent Rachmat, Ihsan Ullah, Peyman Vahidi, Joaquin Vanschoren

发表机构 * Google DeepMind(谷歌DeepMind) ChaLearn Université Paris-Saclay(巴黎-萨克雷大学) Jetty Mila, Quebec AI Institute(魁北克AI研究所) Inst. of Computational Biology, Helmholtz Munich(海德堡慕尼黑计算生物学研究所) German Center for Diabetes Research(德国糖尿病研究中心) School of CIT, TUM(技术大学 CIT 学校) Helmholtz AI(海德堡人工智能研究所) Brickroad Eindhoven Univ. of Technology(埃因霍温技术大学)

AI总结 提出Croissant Tasks元数据格式,通过声明式规范解耦任务问题与解决方案,结合自动化LLM管道实现概念可重复性,使自主代理能从零生成可复现的评估流水线。

Comments 10 pages, 4 figures

详情
AI中文摘要

可重复性是科学方法的基础,但在机器学习中仍然是一个关键挑战。导致这一问题的因素包括不明确的执行细节和脆弱的软件环境。以人为中心的补救措施(如检查清单和手动验证)有所帮助,但需要大量精力且难以扩展。为了解决这个问题,我们引入了Croissant Tasks:一种声明式的、机器可操作的元数据格式,将低层实现细节抽象为高层规范。这种格式实现了概念可重复性:通过独立的、由代理生成的实现来验证声明,而不是脆弱的源代码复制。我们贡献了:(1) Croissant Tasks规范,正式将任务问题与解决方案解耦;(2) 一个自动化的LLM流水线,将现有基准测试改造为此格式;(3) 实证验证表明,自主代理可以摄取这些规范,从零开始生成功能准确的可重复流水线。我们设想这种格式将成为机器学习中自动化和概念可重复性的新基础。

英文摘要

Reproducibility is fundamental to the scientific method, yet remains a critical challenge in machine learning. Contributing factors include underspecified execution details and brittle software environments. Human-centric remedies, such as checklists and manual verification, help but require intensive effort and fail to scale. To address this, we introduce Croissant Tasks: a declarative, machine-actionable metadata format that abstracts low-level implementation details into high-level specifications. This format enables conceptual reproducibility: verifying claims via independent, agent-generated implementations rather than brittle source code replication. We contribute: (1) the Croissant Tasks specification, formally decoupling task problem from solution; (2) an automated LLM pipeline that retrofits existing benchmarks into this format; and (3) empirical validation showing autonomous agents can ingest these specifications to generate functional, accurate reproduction pipelines from scratch. We envision this format as a new foundation for automated and conceptual reproducibility in machine learning.

2605.29782 2026-05-29 cs.LG cs.AI cs.CL

Hista and Numca: Estimate State Value Effectively for LLM Reinforcement Learning

Hista 和 Numca:为 LLM 强化学习有效估计状态值

Zizhe Chen, Jiqian Dong, Yizhou Tian, Garry Yang, Yongqiang Chen, Zhitang Chen, James Cheng

发表机构 * Department of Computer Science and Engineering, The Chinese University of Hong Kong(香港中文大学计算机科学与工程系) Huawei Technologies Ltd(华为技术有限公司)

AI总结 针对 LLM 强化学习中状态值估计不准确的问题,提出 Numca(利用数值跨度作为可分级里程碑)和 Hista(利用隐藏状态加权平均不连续轨迹及其回报)两种方法,显著提升估计精度和训练性能。

Comments Accepted at ICML 2026

详情
AI中文摘要

强化学习(RL)通过奖励信号直接优化模型行为来改进大型语言模型(LLMs)。虽然在经典RL中准确的状态值估计对于稳定训练至关重要,但在LLM后训练中这仍是一个未被充分探索的挑战。在这项工作中,我们引入了状态值估计基准(SVEB)来评估现有RL框架中的状态估计,并展示了像PPO这样的标准方法中的评论家会退化为粗糙的组平均基线。为了解决这个问题,我们提出了两种技术:Numca,它利用数值跨度作为可分级里程碑进行状态值估计;以及Hista,一个使用LLM的隐藏状态作为表示来加权平均不连续轨迹及其回报的框架。大量实验表明,这两种方法都能产生更准确的状态值估计,并在不同的RL算法和模型大小上提升训练性能,而不会产生显著的计算开销。

英文摘要

Reinforcement learning (RL) refines large language models (LLMs) by directly optimizing model behavior through reward signals. While accurate state value estimation is critical for stable training in classical RL, it remains an underexplored challenge in LLM post-training. In this work, we introduce the State Value Estimation Benchmark (SVEB) to assess state estimation within existing RL frameworks and show that critics in standard approaches like PPO collapse to a coarse group-average baseline. To address this, we propose two techniques: Numca, which leverages numerical spans as gradable milestones for state value estimation, and Hista, a framework that uses LLM's hidden states as representation to weighted average disjoint rollouts and their return. Extensive experiments demonstrate that both methods yield more accurate state value estimates and enhance training performance across different RL algorithms and model sizes without incurring significant computational overhead.

2605.29776 2026-05-29 cs.CV

Improving CLIP Adaptation by Breaking Tail Alignment for Source-Free Cross-Domain Few-Shot Learning

通过打破尾部对齐改进CLIP适应:用于源无关跨域小样本学习

Shuai Yi, Yixiong Zou, Yuhua Li, Ruixuan Li

发表机构 * School of Computer Science and Technology, Huazhong University of Science and Technology, Wuhan, China(华中科技大学计算机科学与技术学院) Institute of Artificial Intelligence, Huazhong University of Science and Technology, Wuhan, China(华中科技大学人工智能研究院)

AI总结 针对CLIP在跨域小样本学习中的性能下降问题,提出自适应尾头对齐策略(ATHA),通过有选择地削弱低相似度图像令牌的对齐来减少过拟合,在四个基准上取得最优结果。

Comments Accepted by ICML 2026

详情
AI中文摘要

视觉语言模型(如CLIP)展现出强大的零样本泛化能力,但在目标域训练数据稀缺的跨域场景(跨域小样本学习,CDFSL)中性能显著下降。本文聚焦于基于CLIP的CDFSL任务中的目标域小样本微调。现有的微调范式将所有图像块令牌与其对应的文本嵌入统一对齐。然而,我们发现一个反直觉的现象:主动将某些低相似度图像令牌(称为“尾部令牌”)推离其文本嵌入能持续提升目标域性能。我们深入探究这一现象并给出新的解释:在巨大的域偏移和稀缺的训练数据下,模型难以从视觉输入中提取语义信息;因此,常见的对齐信念仅对已包含足够语义信息的令牌有效;对于尾部令牌,强制对齐会导致对稀缺训练的过度过拟合,而打破对齐则更有用。受此启发,我们提出自适应尾头对齐(ATHA),一种新颖的CLIP微调策略,将传统的统一对齐范式转变为自适应对齐范式,同时包含对齐增强和削弱。在四个具有挑战性的CDFSL基准上的大量实验验证了我们的最先进性能。我们的代码可在 https://github.com/shuaiyi308/ATHA 获取。

英文摘要

Vision-Language Models (VLMs) such as CLIP demonstrate strong zero-shot generalization, but their performance significantly degrades in cross-domain scenarios with scarce target-domain training data (Cross-Domain Few-Shot Learning, CDFSL). In this paper, we focus on the target-domain few-shot finetuning in the CLIP-based CDFSL task. Prevailing finetuning paradigms uniformly align all image patch tokens with their corresponding textual embeddings. However, we find a counterintuitive phenomenon: actively pushing away certain low-similarity image tokens, termed "tail tokens", from their textual embeddings consistently improves target-domain performance. We delve into this phenomenon and provide a novel interpretation: under great domain shifts and scarce training data, the model can hardly extract semantic information from visual inputs; therefore, the common belief of alignment is valid only for tokens already containing sufficient semantic information; for tail tokens, forcing the alignment would lead to excessive overfitting to the scarce training, while breaking the alignment is more useful. Motivated by this, we propose Adaptive Tail-Head Alignment (ATHA), a novel fine-tuning strategy for CLIP that transforms the conventional uniform alignment paradigm to an adaptive alignment paradigm, with both alignment strengthening and weakening. Extensive experiments on four challenging CDFSL benchmarks validate our state-of-the-art performance. Our code is available at https://github.com/shuaiyi308/ATHA.

2605.29773 2026-05-29 cs.CV cs.AI cs.RO

Energy-Aware NECO for Single-Pass Pixel-wise Out-of-Distribution Detection in Semantic Segmentation

能量感知NECO:用于语义分割中单次逐像素分布外检测

Boyuan Zhang, Huanshan Huang, Yifei Cao

发表机构 * Ecole Polytechnique, Institut Polytechnique de Paris(巴黎理工学院高研院) CIAD, UTBM, Université Marie et Louis Pasteur(CIAD、UTBM、马吕斯·路易·巴斯蒂埃大学) U2IS, ENSTA, Institut Polytechnique de Paris(U2IS、ENSTA、巴黎理工学院)

AI总结 提出一种结合NECO几何比率和能量分数的混合方法,实现单次前向传播的逐像素分布外检测,在miniMUAD数据集上AUROC达0.8539,优于单独使用NECO或能量分数。

Comments 7 pages, 6 figures. Accepted at the ICRA 2026 Workshop on Long-term Deployments in the Wild (LoWi 2026)

详情
AI中文摘要

移动机器人的可靠语义分割需要准确的密集预测和分布偏移下的鲁棒不确定性估计。强不确定性基线如蒙特卡洛Dropout通常需要重复的随机前向传播,难以在边缘平台上部署。我们提出能量感知NECO,一种用于语义分割的单次逐像素分布外(OOD)检测器。该方法将从解码器特征计算的居中NECO风格几何比率与基于logit的能量分数相结合。两个分量均使用在纯分布内验证集上拟合的统计量进行标准化,并通过凸组合融合。我们在miniMUAD子集上使用真实像素级OOD标签评估该方法。所提出的混合分数达到0.8539的AUROC,优于仅NECO(0.8280)、仅能量(0.8171)和集成预测熵基线(0.8124)。额外的定性和操作点分析表明,混合检测器在保持单次设计效率优势的同时,提高了整体排名性能。代码可在https://github.com/boyuan-zhangx/Energy-Aware_NECO获取。

英文摘要

Reliable semantic segmentation for mobile robots requires both accurate dense prediction and robust uncertainty estimation under distribution shift. Strong uncertainty baselines such as Monte Carlo Dropout often require repeated stochastic forward passes and are difficult to deploy on edge platforms. We propose Energy-Aware NECO, a single-pass pixel-wise out-of-distribution (OOD) detector for semantic segmentation. The method combines a centered NECO-style geometric ratio computed from decoder features with a logit-based Energy score. Both components are standardized using statistics fitted on a pure in-distribution validation split and fused through a convex combination. We evaluate the method on the miniMUAD subset using true pixel-level OOD labels. The proposed hybrid score achieves an AUROC of 0.8539, outperforming NECO-only (0.8280), Energy-only (0.8171), and an ensemble predictive-entropy baseline (0.8124). Additional qualitative and operating-point analyses show that the hybrid detector improves overall ranking performance while preserving the efficiency advantages of a single-pass design. Code is available at https://github.com/boyuan-zhangx/Energy-Aware_NECO

2605.29771 2026-05-29 cs.RO

Joint Angle Estimation with Customized Wristband Based on Online Incremental Learning

基于在线增量学习的定制腕带关节角度估计

Shuo Wang, Xiaobin Chen, Xiaoming Tao

发表机构 * Research Institute for Intelligent Wearable Systems, The Hong Kong Polytechnic University(智能可穿戴系统研究院,香港理工大学)

AI总结 提出一种基于在线增量学习的定制腕带系统,通过两阶段方法(在线学习更新模型+模型估计)实现腕关节角度估计,适应不同佩戴配置下的数据漂移,误差约15度。

详情
AI中文摘要

智能可穿戴技术在人机交互、运动和健康监测中扮演着越来越重要的角色。为了确保使用的舒适性和实用性,运动监测的一种常见形式是利用软体可穿戴传感器。然而,许多关于可穿戴传感器的研究应用过于简单,难以适应不同情况。本研究提出了一种基于在线增量学习方法的定制腕带系统,用于估计腕关节角度。这是一种两阶段估计方法:第一阶段根据佩戴者的手腕运动特征,利用在线学习更新模型,并集成来自IMU的实时数据作为真实值。第二阶段仅使用腕带利用更新后的模型进行腕关节角度估计。换句话说,模型训练在数据采集过程中完成,使得训练好的模型可用于后续的角度估计。该方法在适应由不同测试配置引起的数据漂移方面具有优势,例如同一受试者的左右手腕、同一手腕上佩戴位置的偏差,甚至不同受试者之间的差异。结果表明,传感器在应变变化下表现出良好的性能,所提系统在不同场景下的腕关节轨迹估计误差约为15度。

英文摘要

Intelligent wearable technology plays an increasingly important role in human-computer interaction, motion, and health monitoring. To ensure comfort and practicality of use, one common form for motion monitoring is to utilize soft wearable sensors. However, many research applications regarding wearable sensors are simplistic and difficult to adapt to different situations. This study proposes a system for estimating the angle of the wrist joint using a customized wristband based on an online incremental learning approach. It is a two-stage estimation method: the first stage updates the model based on the wearer's wrist movement characteristics using online learning, integrating real-time data from an IMU as ground truth. The second stage utilizes the updated model for estimation of wrist joint angle solely with the wristband. In other words, model training is completed during data acquisition, allowing the trained model to be used for subsequent angle estimation. This method offers advantages in adapting to data drift caused by variations in different testing configurations, such as the left and right wrists of the same subject, deviations in the wearing position on the same wrist, and even differences among various subjects. The results indicate that the sensors exhibit good performance under strain variations, and the wrist joint trajectory estimation of the proposed system has an approximate error of 15 degree in different scenarios.

2605.29768 2026-05-29 cs.AI

From XXLTraffic to EvoXXLTraffic: Scaling Traffic Forecasting to Sensor-Evolving Networks

从XXLTraffic到EvoXXLTraffic:将交通预测扩展到传感器演化的网络

Du Yin, Hao Xue, Arian Prabowo, Shuang Ao, Flora Salim

发表机构 * University of New South Wales(新南威尔士大学)

AI总结 针对现有交通预测基准假设固定传感器集的问题,提出包含长达27年数据的XXLTraffic数据集及其传感器演化版本EvoXXLTraffic,定义年度流式预测协议,并评估多种基线方法,发现超大规模演化数据集更贴近现实且许多现有SOTA方法失效。

Comments Under Review

详情
AI中文摘要

现有的交通预测基准假设固定的传感器集,但实际道路传感器网络随着道路网络逐年变化而持续增长。我们引入了XXLTraffic数据集系列,涵盖长达27年的加州PeMS和新南威尔士州交通数据。XXLTraffic的固定传感器子集支持多年间隔的极长周期预测以及标准的每小时/每日长时预测。我们将其扩展为EvoXXLTraffic,这是一个传感器演化的重组版本,暴露了每年活跃的传感器、年度交通流矩阵以及九个PeMS区域的年度图快照,增长率从+305%到超过+10,000%。我们在EvoXXLTraffic上定义了一个年度流式预测协议,其中每个日历年是一个持续任务,并评估了来自静态时空GNN、朴素在线方案、演化图持续方法以及检索/测试时方法的各种代表性基线。我们发现,我们的超大规模演化数据集更好地反映了现实世界,许多最先进(SOTA)结果不再有效。我们的数据集通过支持在超长演化道路网络下更现实的预测,补充了现有的基准。

英文摘要

Existing traffic forecasting benchmarks assume a fixed sensor set, but real road-sensor networks grow continuously as the road network changes year by year. We introduce the XXLTraffic dataset family, which spans up to 27 years of California PeMS and Transport for NSW data. The fixed-sensor subsets of XXLTraffic support extremely long forecasting with multi-year gaps and standard hourly / daily long-horizon forecasting. We extend it to EvoXXLTraffic, a sensor-evolving reorganization that exposes per-year active sensors, yearly traffic-flow matrices, and yearly graph snapshots across nine PeMS districts, with growth ratios ranging from +305% to over +10,000%. We define a yearly streaming forecasting protocol on EvoXXLTraffic in which each calendar year is a continual task, and benchmark a wide range of representative baselines drawn from static spatio-temporal GNNs, naïve online schemes, evolving-graph continual methods, and retrieval / test-time methods. We find that our ultra-large evolutionary dataset better reflects the real world, and many state-of-the-art (SOTA) results no longer work. Our dataset complements existing benchmarks by enabling more realistic forecasting under ultra-long evolutionary road networks.

2605.29766 2026-05-29 cs.RO

MARS Policy: Multimodality Only When It Matters

MARS策略:仅在必要时使用多模态

Jindou Jia, Tuo An, Yuxuan Hu, Gen Li, Jingliang Li, Bohan Hou, Xiangyu Chen, Jiaqi Bai, Bofan Lyu, Jianfei Yang

发表机构 * MARS Lab, Nanyang Technological University(MARS实验室,南洋理工大学)

AI总结 提出MARS策略,通过自适应地在需要时引入随机性,在单模态阶段使用确定性学习,平衡生成策略的多模态能力与确定性模型的高效率,在模拟和真实任务中提升成功率和推理速度。

Comments 13 figures, 17 pages

详情
AI中文摘要

模仿学习已成为解决复杂机器人操作任务的基石。特别是,多模态使机器人能够捕捉多样且有效的行为模式,推动了生成策略作为机器人学习主导范式的迅速兴起。然而,实现这种多模态通常依赖于随机噪声初始化和迭代去噪过程,导致训练复杂度高、推理效率低。同时,机器人任务的并非所有阶段都固有地需要行为多样性。受此启发,我们提出了模态自适应机器人采样(MARS)策略,该策略仅在真正有益时自适应地调用定制的随机性,而在单模态阶段恢复为高效的确定性学习。换句话说,仅在适当的时间注入适量的噪声。通过选择性激活多模态生成,MARS策略弥合了生成策略的多模态能力与确定性模型优越的训练和推理效率之间的差距。在8个模拟和4个真实世界任务上的实证研究表明,MARS展现出鲁棒的多模态表达能力和高效率,在真实世界测试中成功率提升16.67%,推理延迟降低83.20%。反直觉的是,MARS在近确定性任务上的训练效率也超过了确定性策略,因为它更有效地建模了细微的动作多样性。

英文摘要

Imitation learning has become a cornerstone for solving complex robotic manipulation tasks. In particular, multimodality, which enables robots to capture diverse yet valid behavioral patterns, has driven the rapid emergence of generative policies as a dominant paradigm in robot learning. However, achieving such multimodality typically relies on stochastic noise initialization and iterative denoising procedures, resulting in substantial training complexity and low inference efficiency. Meanwhile, not all phases of a robotic task inherently require behavioral diversity. Motivated by this insight, we propose the Modality-Adaptive Robot Sampling (MARS) policy, which adaptively invokes tailored stochasticity only when it is truly beneficial, while reverting to an efficient deterministic learning during single-modal phases. In other words, the proper amount of noise is injected only at the proper time. By selectively activating multimodal generation, MARS policy bridges the gap between the multimodal capability of generative policies and the superior training and inference efficiency of deterministic models. Empirical studies across 8 simulated and 4 real-world tasks demonstrate that MARS exhibits robust multimodal expressivity and high efficiency, with a 16.67% success rate improvement and an 83.20% inference latency reduction in real-world tests. Counterintuitively, MARS also outpaces deterministic policies in training efficiency on near-deterministic tasks by more effectively modeling nuanced action diversity.