arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2280
2605.27824 2026-05-28 cs.AI cs.CL

Revealing Algorithmic Deductive Circuits for Logical Reasoning

揭示逻辑推理的算法演绎电路

Phuong Minh Nguyen, Tien Huu Dang, Naoya Inoue

AI总结 本研究通过因果中介分析定位大语言模型中负责逻辑推理步骤的注意力头,发现少量专用头处理事实和规则信息,而高层头促进信息整合和全局推理策略的出现。

详情
AI中文摘要

最近的研究表明,通过在少样本学习设置中引入抽象描述图遍历算法和逐步推理的功能性符号表示,大型语言模型(LLMs)能够实现强大的推理性能。然而,目前尚不清楚LLMs如何仅从有限的示例中真正理解每个推理步骤的抽象含义以及整体算法。本文旨在定位负责单个推理步骤的注意力头,并刻画它们之间传输的信息类型。我们首先在符号辅助的思维链(CoT)提示框架下,将组成推理步骤与其对应的token logits对齐。我们的分析表明,引导推理过程的token位置与低置信度分数相关,这些低置信度分数是由满足演示中推理行为模式的约束引起的。然后,我们采用因果中介分析技术来识别负责这些模式的注意力头。此外,我们的发现表明,LLMs通过专门的注意力头(约占全部头的3%)为各个子推理任务检索事实和基于规则的信息,而较高层主要促进信息整合和全局推理策略(例如图遍历算法)的出现,这些策略协调多个中间推理步骤以解决整体任务。

英文摘要

Recent studies have shown that Large Language Models (LLMs) can achieve strong reasoning performance by incorporating functional symbolic representations that abstractly describe graph traversal algorithms and step-by-step reasoning in few-shot learning settings. However, it remains unclear how LLMs genuinely understand the abstract meaning of each reasoning step and the overall algorithm from only a limited number of demonstrations. This work aims to localize the attention heads responsible for individual reasoning steps and characterize the types of information transferred among them. We first align constituent reasoning steps with their corresponding token logits under a symbolic-aided Chain-of-Thought (CoT) prompting framework. Our analysis shows that token positions that steer the reasoning process are associated with low confidence scores caused by constraints on satisfying reasoning behavior patterns in demonstrations. We then adopt causal mediation analysis techniques to identify the attention heads responsible for these patterns. In addition, our findings indicate that LLMs retrieve factual and rule-based information for individual sub-reasoning tasks through specialized attention heads (approximately 3% total heads), whereas higher layers predominantly facilitate information integration and the emergence of global reasoning strategies (e.g., graph traversal algorithms) that coordinate multiple intermediate reasoning steps to solve the overall task.

2605.27820 2026-05-28 cs.AI

EgoBench: An Interactive Egocentric Multimodal Benchmark for Tool-Using Agents

EgoBench:面向工具使用智能体的交互式自我中心多模态基准

Yunqi Liu, Tong Niu, Zitong Wang, Zhenlong Dai, Yuqi Qing, Weiqiang Wang, Jian Liu

AI总结 提出EgoBench,首个交互式自我中心多模态基准,通过1045个自我中心视频任务和用户-智能体-工具交互环境,联合评估视觉感知、工具增强多跳推理和动态交互能力,揭示当前最先进模型性能上限(平均准确率19.43%)。

Comments 68 pages, 6 figures

详情
AI中文摘要

随着AI智能体在开放的真实世界环境中日益运作,它们需要多模态感知、多跳推理的工具调用以及与用户的动态交互的深度协同。然而,现有基准由于在设计严格耦合的多能力任务、模拟自然且任务受限的用户反馈以及确保动态交互的客观评估方面存在挑战,未能联合评估这些能力。为弥补这一差距,我们引入了EgoBench,这是首个面向工具使用智能体的交互式多模态基准。EgoBench包含覆盖四个日常场景的1,045个自我中心视频任务,以及一个用于评估的用户-智能体-工具交互环境。我们实现了一个三阶段协同流水线,通过该流水线,每个任务旨在强制视觉感知和工具增强多跳推理的联合应用。我们还在EgoBench中开发了一个多智能体模拟用户来评估智能体的交互能力,该模拟用户生成高保真、任务对齐的响应。此外,我们建立了一个确定性联合验证框架,通过基于过程和基于结果的等价性保证客观评估。在EgoBench上对八个最先进的视频-MLLM智能体进行基准测试揭示了严重的性能上限:最佳模型在最佳表现场景中仅达到30.62%的准确率,在所有四个场景中平均为19.43%。最后,我们进行了多维错误分析以解开失败模式,揭示了推动未来AI智能体发展的能力瓶颈。

英文摘要

As AI agents increasingly operate in open, real-world environments, they require a deep synergy of multimodal perception, tool invocation with multi-hop reasoning, and dynamic interaction with users. However, existing benchmarks fail to jointly evaluate these capabilities due to challenges in designing strictly coupled multi-capability tasks, simulating natural and task-constrained user feedback, and ensuring objective evaluation of dynamic interaction. To bridge this gap, we introduce EgoBench, the first interactive multimodal benchmark for tool-using agents. EgoBench comprises 1,045 egocentric-video-grounded tasks covering four daily scenarios, along with a user-agent-tool interactive environment for evaluation. We implement a three-stage synergistic pipeline through which each task is designed to enforce the joint application of visual perception and tool-augmented multi-hop reasoning. We additionally develop a multi-agent simulated user within EgoBench to evaluate agents' interaction capabilities, which generates high-fidelity, task-aligned responses to agents. Furthermore, we establish a deterministic joint validation framework that guarantees objective assessment through process-based and result-based equivalence. Benchmarking eight SOTA video-MLLM agents on EgoBench reveals a severe performance ceiling: the best model achieves only 30.62% accuracy in the best-performing scenario, averaging 19.43% across all four scenarios. Finally, we conduct a multi-dimensional error analysis to disentangle failure modes, exposing capability bottlenecks for advancing future AI agents.

2605.27819 2026-05-28 cs.LG cs.AI

ReSAE: Residualized Sparse Autoencoders for Multi-Layer Transformer Interventions

ReSAE: 用于多层Transformer干预的残差化稀疏自编码器

Prathyush Poduval, Calvin Yeung, Neel Desai, Mohsen Imani

AI总结 针对多层稀疏自编码器(SAE)在Transformer中因层间耦合导致的冗余和交互问题,提出残差化稀疏自编码器(ReSAE),通过拟合层间仿射映射并训练SAE于残差上,减少解码器冗余并提升多层替换下的交叉熵恢复。

详情
AI中文摘要

稀疏自编码器通常逐层训练,尽管Transformer残差流激活在深度上强烈耦合。这对多层干预造成实际问题:不同层的字典可能将容量用于表示相同的向前传递信息,同时替换多层可能产生单层行为无法预测的交互。我们引入残差化稀疏自编码器(ReSAE),它在选定层之间拟合仿射映射,并在未解释的残差上训练后续层的SAE,而非完整激活。重构通过拟合的仿射链映射回原始激活空间,因此ReSAE可以像普通SAE一样使用相同的干预协议进行评估。在Pythia-1.4B和Gemma-2-9B上,残差化减少了解码器冗余,并在大多数测试设置中改进了稀疏探测和定向扰动。尽管重构的原始激活方差较少,ReSAE在多层替换下恢复了更多Transformer交叉熵。这一增益在教师强制和足够的在线稀疏性下最为明显,表明ReSAE保留了与模型下游计算最相关的激活成分。这些结果表明,去除线性可预测的跨层结构是多层SAE干预的有用默认设置。

英文摘要

Sparse autoencoders are usually trained one layer at a time, even though transformer residual stream activations are strongly coupled across depth. This creates a practical problem for multi-layer interventions: different layerwise dictionaries can spend capacity representing the same carried-forward information, and replacing several layers at once can produce interactions that are not predicted by single-layer behavior. We introduce Residualized Sparse Autoencoders (ReSAEs), which fit an affine map between selected layers and train each later-layer SAE on the unexplained residual rather than on the full activation. Reconstructions are mapped back into the original activation space through the fitted affine chain, so ReSAEs can be evaluated with the same intervention protocols as ordinary SAEs. On Pythia-1.4B and Gemma-2-9B, residualization reduces decoder redundancy and improves sparse probing and targeted perturbation in most tested settings. Despite reconstructing less of the raw activation variance, ReSAEs recover more transformer cross entropy under multi-layer replacement. This gain is clearest under teacher-forcing and at sufficient sparsity online, indicating that ReSAEs preserve the components of the activation most relevant to the model's downstream computation. These results suggest that removing linearly predictable cross-layer structure is a useful default for multi-layer SAE interventions.

2605.27817 2026-05-28 cs.RO cs.AI cs.CV cs.LG

Turning Video Models into Generalist Robot Policies

将视频模型转化为通用机器人策略

Sizhe Lester Li, Evan Kim, Xingjian Bai, Tong Zhao, Tao Pang, Max Simchowitz, Vincent Sitzmann

AI总结 提出一种解耦的视频到动作策略VERA,利用无动作视频世界模型和基于机器人雅可比矩阵的逆动力学模型,实现跨本体的零样本机器人控制。

Comments project page: https://vera.csail.mit.edu

详情
AI中文摘要

视频生成模型已成为一种有前景的机器人骨干网络,能够生成描绘跨本体和环境完成复杂任务的视频。最近的工作提出了机器人基础模型,通过使用带有动作标签的数据微调视频模型,联合预测未来观测和动作。在本文中,我们测试了一种替代方法的极限:保持视频规划器不变,同时训练一个特定本体的逆动力学模型(IDM)。这种解耦带来了几个自然的好处:视频规划器保持本体无关,不同的视频模型可以轻松互换而无需重新训练IDM,并且IDM可以独立地使用现成的自对弈数据进行训练。我们提出了一种闭环的视频到动作策略,该策略将无动作视频世界模型与基于机器人本体雅可比矩阵的精心设计的IDM相结合。我们证明了我们的IDM设计既数据高效又可扩展到高维动作空间。我们将该策略命名为视频到具身机器人动作模型(VERA),在模拟和真实世界基准测试中取得了强劲的性能,包括零样本的Panda机械臂操作和16自由度Allegro灵巧手立方体重新定向。通过将相同的视频规划器与不同的本体特定IDM配对,可以在多个本体上使用。我们的结果表明,解耦的视频规划加上忠实的视频到动作翻译是实现零样本、跨本体和可泛化机器人控制的可行替代途径。更多结果请访问我们的项目网站:https://vera.csail.mit.edu。

英文摘要

Video generative models have emerged as a promising robotics backbone, capable of generating videos that depict the completion of complex tasks across embodiments and environments. Recent work proposes robot foundation models that jointly predict future observations and actions by finetuning video models with action-labeled data. In this paper, we test the limits of an alternative approach: leave the video planner as-is while training an embodiment-specific inverse dynamics model (IDM). This decoupling offers several natural benefits: the video planner remains embodiment-agnostic, different video models can be interchanged easily without re-training the IDM, and the IDM can be independently trained with readily available self-play data. We present a closed-loop, video-to-action policy that combines an action-free video world model with a carefully-designed IDM based on the robot embodiment Jacobian. We demonstrate that our IDM design is both data-efficient and scalable to high-dimensional action spaces. Our policy, which we coin the Video-to-Embodied Robot Action Model (VERA), achieves strong performance across simulated and real-world benchmarks, including zero-shot Panda arm manipulation and 16-DoF Allegro-hand dexterous cube re-orientation. The same video planner can be used across multiple embodiments by pairing it with different embodiment-specific IDMs. Our results show that decoupled video planning plus faithful video-to-action translation is a viable alternative route towards zero-shot, cross-embodiment, and generalizable robot control. More results are available on our project website: https://vera.csail.mit.edu.

2605.27816 2026-05-28 cs.CV

Pattern Recognition Tasks with Personalized Federated Learning

个性化联邦学习的模式识别任务

Md. Arifur Rahman, Isha Das, Mushfiqur Rahman Abir, B. M. Taslimul Haque, Abdullah Al Noman, Abir Ahmed, Md. Jakir Hossen

AI总结 本文通过比较七种个性化联邦学习算法在MNIST、SignMNIST和Digit5数据集上的性能,发现APPLE、FedGC和FedProto在准确率、精确率、召回率和F1分数上表现优异。

Comments Comprehensive comparative analysis of 7 Personalized Federated Learning algorithms across MNIST, SignMNIST, and Digit5 datasets. The paper presents detailed methodology, workflow architecture, experimental evaluation, and privacy-preserving AI analysis for distributed intelligent systems, secure collaborative learning, and critical infrastructure applications

详情
Journal ref
Emerging Science Journal 10(2):974-990 (2026)
AI中文摘要

个性化联邦学习(PFL)构成了一种新颖的范式,它为每个客户端定制机器学习(ML)模型,从而在维护严格数据隐私原则的同时提供个性化的模型更新。与传统的标准联邦学习(FL)方法不同,PFL使模型适应不同的客户端数据分布,从而在最小化通信开销的同时,实现更高水平的准确性、定制化和数据安全性。这种方法在依赖于异构数据源且以隐私问题为关键的模式识别任务背景下尤为突出。在本研究工作中,本文对七种不同的PFL算法进行了全面的比较分析,这些算法在三个不同的数据集(即MNIST、SignMNIST和Digit5)上部署。总体目标是通过基于准确率、精确率、召回率和F1分数等指标的严格评估,确定在模式识别任务框架内最优秀的PFL算法。同时,对这些PFL算法进行了深入审查,阐明了它们的工作流程、优点和局限性。通过实证研究,结果表明APPLE、FedGC和FedProto是强有力的竞争者,在评估的数据集范围内始终提供优越的性能,同时承认其他算法的上下文特异性以及通过迭代改进实现最优结果的潜力。

英文摘要

Personalized Federated Learning (PFL) constitutes a novel paradigm that tailors Machine Learning (ML) models to individual clients, thereby furnishing personalized model updates whilst upholding stringent data privacy principles. Diverging from conventional standard Federated Learning (FL) approaches, PFL adapts models to distinct client data distributions, engendering heightened levels of accuracy, customization, and data security, all while minimizing communication overhead. This methodology proves particularly salient in contexts marked by pattern recognition tasks reliant upon heterogeneous data sources and underpinned by paramount privacy apprehensions. In the present research endeavor, this article undertake a comprehensive comparative analysis of seven distinct PFL algorithms deployed across three diverse datasets, namely MNIST, SignMNIST, and Digit5. The overarching objective entails ascertaining the preeminent PFL algorithm, within the framework of pattern recognition tasks, through a rigorous evaluation anchored in metrics encompassing Accuracy, Precision, Recall, and F1 Score. Concurrently, an in-depth scrutiny of these PFL algorithms is conducted, elucidating their operative workflows, advantages, and limitations. Through empirical investigation, the findings evince that APPLE, FedGC, and FedProto emerge as stalwart contenders, consistently furnishing superior performance across the spectrum of assessed datasets, while acknowledging the contextual specificity of alternative algorithms and the potential for iterative refinement to realize optimal outcomes.

2605.27813 2026-05-28 cs.CV cs.AI cs.LG

Residualized Temporal Sparse Autoencoders for Interpreting Diffusion Models

残差化时间稀疏自编码器用于解释扩散模型

Calvin Yeung, Prathyush Poduval, Ali Zakeri, Zhuowen Zou, Mohsen Imani

AI总结 提出残差化时间稀疏自编码器,通过去噪时间步间的线性预测残差学习扩散激活轨迹中的可解释特征,并在Stable Diffusion 1.5上验证其有效性。

详情
AI中文摘要

文本到图像扩散模型通过迭代去噪过程生成图像,因此内部神经层产生激活轨迹而非单一静态表示。稀疏自编码器(SAE)最近被用于将扩散激活分解为可解释的特征方向,但大多数方法在单个时间步分析激活或基于时间条件,而非直接从完整激活轨迹中学习。在这项工作中,我们引入了用于扩散激活轨迹的残差化时间SAE。我们收集去噪时间上的激活,拟合相邻时间步之间的线性预测器,并使用初始激活以及这些线性动力学未解释的残差分量来表示每个轨迹。在这种残差化表示上训练SAE鼓励稀疏潜在变量捕捉超出线性可预测范围的结构。残差化解码器方向可以映射回激活空间,使得每个潜在变量可以作为去噪时间上的特征轨迹进行分析。通过在Stable Diffusion 1.5上的重建与消融研究、时空特征分析和定性引导实验,我们表明残差化时间SAE为研究时间结构化的扩散激活提供了一个有用的框架。

英文摘要

Text-to-image diffusion models generate images through an iterative denoising process, so internal neural layers produce trajectories of activations rather than single static representations. Sparse autoencoders (SAEs) have recently been used to decompose diffusion activations into interpretable feature directions, but most approaches analyze activations at individual timesteps or condition on time rather than learning directly from full activation trajectories. In this work, we introduce residualized temporal SAEs for diffusion activation trajectories. We collect activations across denoising time, fit linear predictors between neighboring timesteps, and represent each trajectory using an initial activation together with residual components not explained by these linear dynamics. Training an SAE on this residualized representation encourages sparse latents to capture structure beyond what is linearly predictable. The residualized decoder directions can be mapped back into activation space, allowing each latent to be analyzed as a feature trajectory over denoising time. Through reconstruction and ablation studies, spatiotemporal feature analysis, and qualitative steering experiments on Stable Diffusion~1.5, we show that residualized temporal SAEs provide a useful framework for studying temporally structured diffusion activations.

2605.27811 2026-05-28 cs.AI

Constrained Auto-Bidding via Generative Response Modeling

通过生成式响应建模实现约束自动出价

Eunseok Yang, Xingdong Zuo, Kyung-Min Kim

AI总结 提出生成式响应模型(GRM),通过预测未来流量和聚合成本/价值曲线,结合轻量解析控制器,在预算和比率约束下实现稳定高效的自动出价。

详情
AI中文摘要

自动出价系统旨在预算约束和成本每次获取等比率目标下,最大化广告主在长期内的价值,然而未来流量和拍卖动态是非平稳且不确定的。现有方法面临明显局限性:基于控制的节奏方法对偏差做出反应但无法预测未来条件,而强化学习和生成方法将约束纳入奖励信号,掩盖了违规并在分布偏移下退化。我们将学习目标从动作转向响应,提出生成式响应模型(GRM),这是一个基于历史条件的序列模型,联合预测未来流量和作为单一出价乘数函数的水平聚合成本/价值曲线。我们证明,在温和的单调性条件下,相对于完全逐拍控制的最优性差距受逐拍边际价值-成本离散度的限制。给定预测响应,一个轻量解析控制器通过一维求根步骤强制执行每个活动约束。我们证明该控制器对于单乘数问题是精确的,并根据预测误差限制了滚动时域重规划下的约束违规。在AuctionNet上的实验表明,与现有基线相比,GRM提高了约束稳定性和总体得分。

英文摘要

Auto-bidding systems aim to maximize advertiser value over long horizons under budget constraints and ratio targets such as cost-per-acquisition, yet future traffic and auction dynamics are non-stationary and uncertain. Existing approaches face distinct limitations: control-based pacing reacts to deviations but cannot anticipate future conditions, while RL and generative methods fold constraints into reward signals, obscuring violations and degrading under distribution shift. We shift the learning target from actions to responses with the Generative Response Model (GRM), a history-conditioned sequence model that jointly predicts future traffic volume and horizon-aggregate cost/value curves as functions of a single bid multiplier. We show that under mild monotonicity conditions, the optimality gap relative to full per-tick control is bounded by the dispersion of per-tick marginal value-per-cost. Given predicted responses, a lightweight analytic controller enforces each active constraint via a 1D root-finding step. We prove this controller is exact for the single-multiplier problem and bound constraint violations under receding-horizon replanning in terms of prediction error. Experiments on AuctionNet show that GRM improves constraint stability and overall score compared to existing baselines.

2605.27808 2026-05-28 cs.CL cs.MM

TARQ: Tail-Aware Reconstruction Quantization for Rare-Word Robust Automatic Speech Recognition

TARQ: 面向罕见词鲁棒自动语音识别的尾部感知重建量化

Xinyu Wang, Ziyu Zhao, Ke Bai, Silin Meng, Dongming Shen, Xiao-Wen Chang, Yixuan HE

AI总结 提出TARQ,一种无标签的后训练量化框架,通过尾部感知重建损失和罕见词平衡规则,在不增加额外训练的情况下显著降低罕见词错误率。

详情
AI中文摘要

数据感知后训练量化(PTQ)在小型校准语料库上最小化每个token的重建损失,隐式地根据经验频率对位置进行加权。对于自动语音识别(ASR),这与尾部敏感风险不一致:名称、数字和领域特定词获得的校准质量比例较小。我们提出了尾部感知重建量化(TARQ),一种无标签的PTQ框架,通过罕见词平衡(一种封闭形式的每线性层规则,平衡常见/尾部质量)和度量一致的残差校正,将校准转向词汇尾部。TARQ不需要实体标签、不需要精心设计的校准集、不需要验证解码,也不需要额外训练。在八个ASR骨干网络和六个数据集上,W4G128下,TARQ在不导致总体WER回归的情况下改善了平均罕见词错误率(rare-WER),在比较方法中实现了最低的跨语料库rare-WER波动,并在无需实体监督的情况下迁移到实体丰富的基准测试(ProfASR, ContextASR-Speech-En)。

英文摘要

Data-aware post-training quantization (PTQ) minimizes a per-token reconstruction loss on a small calibration corpus, implicitly weighting positions by their empirical frequency. For \textbf{A}utomatic \textbf{S}peech \textbf{R}ecognition (ASR), this misaligns with tail-sensitive risk: names, numerals, and domain-specific words receive proportionally little calibration mass. We propose \textbf{Tail-Aware Reconstruction Quantization} (\TARQ), a label-free PTQ framework that shifts calibration toward the lexical tail via \textbf{\rareBAL}, a closed-form per-Linear-layer rule equalizing common/tail mass, paired with a metric-consistent residual correction. \TARQ\ requires no entity labels, no curated calibration set, no validation decoding, and no additional training. Across eight ASR backbones and six datasets at W4G128, \TARQ\ improves mean rare-\textbf{W}ord \textbf{E}rror \textbf{R}ate (rare-WER) without an aggregate-WER regression, achieves the lowest cross-corpus rare-WER swing among compared methods, and transfers to entity-rich benchmarks (ProfASR, ContextASR-Speech-En) without entity supervision.

2605.27805 2026-05-28 cs.CL cs.AI

ChildEval: When large language models meet children's personalities

ChildEval:当大语言模型遇到儿童个性

Yanyan Luo, Xue Han, Chunxu Zhao, Ruiqiao Bai, Yaxing Zhang, Qian Hu, Lijun Mei, Junlan Feng

AI总结 提出ChildEval基准,通过合成3-6岁儿童个性档案和偏好(显式或隐式表达),评估大语言模型在长对话中推断并遵循儿童偏好的能力,实验表明微调可提升儿童中心性能。

Comments 8 pages of main text (ACL Findings format), with references and appendix

详情
AI中文摘要

虽然大语言模型(LLM)使得个性化聊天机器人成为可能,但它们在儿童中心个性化方面的有效性仍不明确,因为缺乏对儿童特定偏好的系统评估。为填补这一空白,我们引入了ChildEval,一个用于评估LLM在长上下文对话中推断和遵循儿童中心偏好能力的基准。ChildEval包含29K个3-6岁儿童的合成个性档案,提供相对静态的背景信息。每个个性档案关联一个儿童偏好——可能与个性一致、冲突或独立——通过单句显式表达或6-10轮对话隐式表达。显式和隐式偏好旨在反映相同的潜在偏好,但表达方式不同,捕捉偏好表达的动态方面而非静态个性的变化。该基准涵盖五个顶层类别和十四个子类别,覆盖儿童的日常生活和发展。我们进一步提出了细粒度、以儿童为中心的评估协议,以系统评估开源LLM。实验结果表明,不同的个性化表示如何影响LLM的响应,并表明在ChildEval上进行微调可以提升儿童中心性能。我们的代码和数据集可在https://github.com/ziyanluo/ChildEval获取。

英文摘要

While LLMs enable personalized chatbots, their effectiveness in child-centered personalization remains unclear, as systematic evaluation of child-specific preferences is still lacking. To address this gap, we introduce ChildEval, a benchmark for evaluating LLMs' ability to infer and follow child-centered preferences in long-context conversations. ChildEval contains 29K synthesized persona profiles of children aged 3-6, providing relatively static background information. Each persona is associated with a child preference-which may align with, conflict with, or be independent of the persona-expressed either explicitly in a single sentence or implicitly through 6-10 turn dialogues. Explicit and implicit preferences are designed to reflect the same underlying preference but differ in expression, capturing dynamic aspects of preference expression rather than changes in the static persona. The benchmark spans five top-level and fourteen sub-level categories covering children's daily lives and development. We further propose fine-grained, child-centric evaluation protocols to systematically assess open-source LLMs. Experimental results demonstrate how different personalized representations affect LLM responses and suggest that finetuning on ChildEval can enhance child-centered performance. Our code and dataset are available at https://github.com/ziyanluo/ChildEval.

2605.27800 2026-05-28 cs.CV

CuriosAI Submission to the CASTLE Challenge at EgoVis 2026

CuriosAI 在 EgoVis 2026 CASTLE 挑战赛中的提交

Yuto Kanda, Hayato Tanoue, Takayuki Hori

AI总结 针对600多小时多视角自我中心视频的185道选择题,提出SVA(搜索-验证-回答)三阶段流水线和TMKG(时间多模态知识图谱)两种方法,SVA达到0.50准确率并作为最终提交。

Comments The 4th place solution for the CASTLE Challenge at the CVPR EgoVis Workshop 2026

详情
AI中文摘要

CASTLE 2026 在超过600小时的同步多视角自我中心视频中提出了185道多项选择题。我们在共享的多模态预处理层之上探索了两种方法,包括每人时间线、说话人解析的转录本和多VLM描述集成。方法A,SVA:搜索-验证-回答,是一个三阶段流水线,它分层缩小到主要窗口,在四个反事实规则下用VLM验证子窗口,并在证据优先级层次下用LLM法官融合证据。方法B,TMKG:时间多模态知识图谱,是相反的:它构建一个时间多模态知识图谱,通过图搜索定位主要单元,并用单个接地VLM产生最终答案。SVA在排行榜上达到0.50的准确率,是我们的最终挑战提交;TMKG达到0.35。

英文摘要

CASTLE 2026 asks 185 multiple-choice questions over 600+ hours of synchronized multi-view egocentric video. We explore two approaches on top of a shared multimodal preprocessing layer, including per-person timelines, speaker-resolved transcripts, and multi-VLM caption ensembles. Approach A, SVA: Search-Verify-Answer, is a three-stage pipeline that hierarchically narrows to a primary window, verifies sub-windows with a VLM under four anti-confabulation rules, and fuses evidence with an LLM judge under an evidence-priority hierarchy. Approach B, TMKG: Temporal-Multimodal-Knowledge-Graph, is the contrast: it builds a temporal multimodal knowledge graph, locates a primary cell via graph search, and produces the final answer with a single grounded VLM. SVA reaches a leaderboard accuracy of 0.50 and is our final challenge submission; TMKG reaches 0.35.

2605.27799 2026-05-28 cs.AI eess.SP

GraD-IBD: Graph Representation Learning from Diagnosis Trajectories for Early Detection of Inflammatory Bowel Disease

GraD-IBD:基于诊断轨迹的图表示学习用于炎症性肠病的早期检测

Leo Y. Li-Han, Ellen L. Larson, Elizabeth B. Habermann, Cornelius A. Thiels, Hojjat Salehinejad

AI总结 提出GraD-IBD图诊断模型,将纵向ICD轨迹重构为时间有向图,并设计上下文感知的时间衰减消息传递机制,以降低复杂度并提升炎症性肠病检测性能。

详情
AI中文摘要

国际疾病分类(ICD)是一种全球公认的编码系统,记录每次患者就诊的诊断事件,为各种临床任务提供标准化的数据基础。然而,ICD代码序列的不规则性和层次性给基于N-D格子的序列建模方法带来了挑战,导致模型设计过于复杂。在本文中,我们提出了GraD-IBD,一种图诊断模型,将纵向ICD轨迹重构为按就诊分桶的时间有向图,以检测炎症性肠病(IBD)的风险。我们开发了一种新颖的上下文感知时间衰减消息传递机制,以捕获时间依赖性并降低模型复杂度。使用真实世界临床数据集的实验结果表明,与最先进的方法相比,IBD检测性能一致且稳健地提升,同时与序列模型相比,计算复杂度显著降低。这些发现凸显了图表示学习在从纵向ICD诊断代码中进行高效、可扩展且准确的疾病风险预测方面的潜力。

英文摘要

International Classification of Diseases (ICD) is a globally recognized coding system that records diagnostic events during each patient encounter, providing a standardized data foundation for various clinical tasks. However, the irregular and hierarchical nature of ICD code sequences poses challenges for N-D lattice-based sequential modeling methods, leading to overly complex model designs. In this paper, we propose GraD-IBD, a graph diagnosis model that reformulates longitudinal ICD trajectories as visit-bucketized, temporally directed graphs to detect the risk of inflammatory bowel disease (IBD). A novel context-aware, time-decay message passing mechanism was developed to capture temporal dependencies while reducing model complexity. The experimental results using a real-world clinical dataset demonstrated consistent and robust improvements in IBD detection over state-of-the-art methods, with significant reductions in computational complexity compared to sequential models. These findings highlight the potential of graph representation learning to enable efficient, scalable, and accurate disease risk prediction from longitudinal ICD diagnosis codes.

2605.27790 2026-05-28 cs.LG

SYNAPSE: Neuro-Symbolic Visual Thought-to-Text Decoding via Topological Semantic Denoising

SYNAPSE: 通过拓扑语义去噪的神经符号视觉思维到文本解码

Akshaj Murhekar, Abhijit Mishra

AI总结 提出SYNAPSE框架,利用常识图结构和潜在样本进行推理时符号正则化,稳定脑电到文本解码中的语义生成,无需微调大语言模型。

详情
AI中文摘要

大语言模型的最新进展加速了开放词汇的脑电到想象文本解码,其中视觉感知期间记录的非侵入性神经活动被翻译成所观看刺激的连贯自然语言描述。然而,现有系统仍然高度易受生物噪声影响,其中受损的神经投影在冻结语言模型中引发幻觉或语义不稳定的生成。我们引入了SYNAPSE(符号神经对齐用于精确语义提取),一个轻量级神经符号框架,通过推理时符号正则化稳定神经文本生成。通过使用常识图结构和潜在样本来净化脑电衍生的语义候选,SYNAPSE无需端到端微调LLM即可提高语义稳定性。在流行的脑电解码基准和多个冻结LLM后端上的实验表明,与无约束提示基线相比,SYNAPSE持续改进,在对象标签消融下具有鲁棒性,并且性能与资源密集得多的微调系统相当,同时通过将原始脑电处理完全限制在编码器堆栈内来保护生物特征隐私。

英文摘要

Recent advances in large language models have accelerated open-vocabulary EEG-to-imagined-text decoding, where non-invasive neural activity recorded during visual perception is translated into coherent natural language descriptions of viewed stimuli. However, existing systems remain highly vulnerable to biological noise, where corrupted neural projections induce hallucinated or semantically unstable generation in frozen language models. We introduce SYNAPSE (Symbolic Neural Alignment for Precise Semantic Extraction), a lightweight neuro-symbolic framework that stabilizes neural text generation through inference-time symbolic regularization. By purifying EEG-derived semantic candidates using commonsense graph structure and latent exemplars, SYNAPSE improves semantic stability without end-to-end LLM fine-tuning. Experiments across popular EEG decoding benchmarks and multiple frozen LLM backends demonstrate consistent gains over unconstrained prompting baselines, robustness under object-label ablation, and performance commensurate with substantially more resource-intensive fine-tuned systems, while preserving biometric privacy by localizing raw EEG processing entirely within the encoder stack.

2605.27789 2026-05-28 cs.AI cs.CL

A Fixed-Budget, Cluster-Aware Standard for LLM-as-a-Judge Evaluation: A Multi-Hop RAG Stress Test

固定预算、聚类感知的 LLM-as-a-Judge 评估标准:多跳 RAG 压力测试

Camilo Chacón Sartori, José H. García

AI总结 针对多跳 RAG 系统评估中的统计偏差问题,提出一种固定预算、聚类感知的 LLM-as-a-Judge 比较标准,并通过遗传算法证据选择器 GADMEC 在 400 个多跳问题上进行压力测试,揭示聚类感知推断改变了实证结论。

详情
AI中文摘要

检索增强生成(RAG)系统通常通过让大型语言模型(LLM)法官判断哪个答案更好来进行比较。对于多跳 RAG,这已成为一个测量问题,与建模问题同等重要:相同的分数可以反映检索质量、答案长度、词汇重叠或忽略聚类数据的统计检验。我们询问当这些选择被明确时会发生什么。 我们提出了 RAG 中 LLM-as-a-Judge 比较的最小测量标准。该标准固定了 top-100 候选池、证据预算、答案上限、生成器和提示;它还要求预先注册假设、聚类感知推断、在可行时进行精确的聚类符号翻转检验以及第二法官复制。聚类基准可能夸大进展;该领域应采用此标准。我们使用遗传算法解码器进行多跳证据组合(GADMEC),一种进化证据选择器,在计算机科学/机器学习(CS/ML)和材料科学领域的 400 个多跳问题上对其进行压力测试。该协议改变了实证故事。二项检验使所有四个语义基线比较看起来显著;聚类感知推断只留下一个 Bonferroni 显著结果。在相同预算下,BM25 优于纯语义 GADMEC,而词汇-语义混合在 CS/ML 中恢复并缩小了材料科学差距。

英文摘要

Retrieval-augmented generation (RAG) systems are often compared by asking a large language model (LLM) judge which answer is better. For multi-hop RAG, this has become a measurement problem as much as a modeling problem: the same score can reflect retrieval quality, answer length, lexical overlap, or a statistical test that ignores clustered data. We ask what happens when these choices are made explicit. We propose a minimum measurement standard for LLM-as-a-judge comparisons in RAG. The standard fixes the top-100 candidate pool, evidence budget, answer cap, generator, and prompt; it also requires pre-registered hypotheses, cluster-aware inference, an exact cluster sign-flip check when feasible, and second-judge replication. Clustered benchmarks can overstate progress; the field should adopt this standard. We stress-test it with Genetic Algorithm Decoder for Multi-hop Evidence Composition (GADMEC), an evolutionary evidence selector, on 400 multi-hop questions in computer science/machine learning (CS/ML) and Materials Science. The protocol changes the empirical story. A binomial test makes all four semantic-baseline comparisons look significant; cluster-aware inference leaves only one Bonferroni-significant result. BM25 beats pure semantic GADMEC under the same budget, while a lexical-semantic hybrid recovers in CS/ML and narrows the Materials Science gap.

2605.27788 2026-05-28 cs.LG cs.CL

Knowing When to Ask: Segment-Level Credit Assignment for LLM Tool Use

知道何时求助:面向LLM工具使用的片段级信用分配

Abhijit Kumar, Zoey Wu, Mohit Suley

AI总结 提出CARL方法,通过强化学习在模型自身轨迹上训练评论家,对每个工具使用片段独立分配信用,使模型学会区分参数知识足够与需要外部帮助的情况,在多个基准上提升准确率并减少不必要的工具调用。

详情
AI中文摘要

人类知道何时需要求助,例如 $347 \times 28$ 需要计算器而 $2+2$ 不需要。语言模型则不然。基于提示的方法可以指导模型何时调用工具,但这种脚手架并不能教会模型识别自身知识的边界。将单一结果奖励分配给整个轨迹的强化学习方法同样效果不佳:轨迹级信用无法隔离成功回合中哪个工具调用真正有帮助,也无法惩罚不必要的调用。我们提出 \textbf{CARL}(\textbf{C}ompetence-\textbf{A}ware \textbf{R}einforcement \textbf{L}earning),该方法在模型自身的 rollout 上训练评论家,以学习参数知识何时足够以及何时需要外部帮助。通过在每个 rollout 的自然工具使用边界(例如代码围栏分隔符和上下文块转换)处进行分解,CARL 从单一二元结果中为每个片段分配独立信用,无需外部评判或步骤级标注。因此,错误的工具调用、不正确的提取以及不必要的调用各自获得适当符号的优势。训练好的评论家捕捉了模型的领域能力:在7B规模下,它以AUC 0.93区分参数可解问题与工具依赖问题。在涵盖算术、多跳事实问答和金融表格数值推理的五个基准上,CARL在7B和3B规模下分别比最佳RL基线提高了6.7和9.7个精确匹配准确率点,其中在Musique上增益最大(7B +8.3 EM,3B +9.0 EM)。模型在参数可回答的问题上减少了53%的工具调用,同时在这些问题上仍保持约10个EM点的更高准确率。增益在小规模上最大:3B的改进是7B改进的1.4倍,这表明知道何时求助对参数记忆较小的模型有更大益处。

英文摘要

Humans know when to reach for help e.g. $347 \times 28$ warrants a calculator while $2+2$ does not. Language models do not. Prompt-based approaches can instruct a model when to invoke tools, but this scaffolding does not teach it to recognize the boundary of its own knowledge. RL approaches that assign a single outcome reward to the whole trajectory fare no better: trajectory-level credit cannot isolate which tool call in a successful episode actually helped, nor penalize unnecessary calls. We propose \textbf{CARL} (\textbf{C}ompetence-\textbf{A}ware \textbf{R}einforcement \textbf{L}earning), which trains a critic on the model's own rollouts to learn where parametric knowledge suffices and where it needs external help. By decomposing each rollout at natural tool-use boundaries (e.g., code fence delimiters and context block transitions), CARL assigns independent credit to each segment from a single binary outcome, without external judges or step-level annotations. As a result, erroneous tool calls, incorrect extractions, and unnecessary calls each receive appropriately signed advantages. The trained critic captures the model's domain competence: it separates parametrically solvable from tool-dependent questions with AUC 0.93 at 7B. On five benchmarks spanning arithmetic, multi-hop factual QA, and numerical reasoning over financial tables, CARL improves exact-match accuracy by 6.7 points at 7B and 9.7 points at 3B over the best RL baseline, with the largest gain (+8.3 EM at 7B, +9.0 EM at 3B) on Musique. The model issues 53\% fewer tool calls on parametrically answerable questions while remaining ${\sim}10$ EM points more accurate on them. Gains are largest at small scale: the 3B improvement is $1.4\times$ the 7B improvement, suggesting that knowing when to ask disproportionately benefits models with smaller parametric memory.

2605.27785 2026-05-28 cs.AI cs.DB

A Query Engine for the Agents

面向智能体的查询引擎

Kenny Daniel

AI总结 提出一个轻量级、JS原生、支持异步SQL和LLM UDF的查询引擎Hyperparam,用于在AI应用中分析非结构化文本,性能优于DuckDB-WASM。

Comments 4 pages, 1 figure, 3 tables

详情
AI中文摘要

当今生产环境中增长最快的数据是非结构化文本:智能体轨迹、聊天日志、推理链、模型输出。人们想要分析这些数据,而有价值的问题(例如“显示智能体在哪里感到困惑”)无法仅通过SQL回答,因为如果没有模型参与查询路径,文本是不可查询的。这种分析自然发生在新一类AI应用中(如Claude Code、Cursor、Claude Desktop、浏览器内智能体),这些应用在客户端运行,并在同一进程中托管人类用户和LLM智能体。这些应用越来越需要处理数据,但数据湖仓的读取路径在JS运行时中难以使用:Spark、Trino和托管数据仓库不适合。为了构建这种新型AI数据应用,引擎的三个属性成为首要考虑:JS原生分发,能够直接嵌入应用已运行的运行时;足够小的包体积,以便在冷标签页或每轮智能体沙箱中分发;以及一种将分析操作符与基于模型的文本解释交错的方法。我们提出Hyperparam,三个总大小低于70 KB的开源JavaScript库(Hyparquet、Squirreling、Icebird),它们直接从对象存储读取Parquet和Apache Iceberg,并通过基于单元格的异步原生SQL执行满足第三个属性,因此昂贵的单元格仅在下游操作符需要时才触发。Squirreling在过滤受限查询上运行LLM形状的异步UDF比DuckDB-WASM快300倍以上(排序受限查询快192倍),并以低三分之二的成本完成十项智能体分析师任务。我们认为数据工程作为一个学科需要更新,以适应现已投入生产的AI原生客户端应用以及与其用户协作的智能体。

英文摘要

The fastest-growing data in production today is unstructured text: agent traces, chat logs, reasoning chains, model outputs. People want to analyze it, and the questions worth asking ("show me where the agent got confused") cannot be answered by SQL alone, since text is not queryable without a model in the query path. The natural place this analysis is happening is the new class of AI applications (Claude Code, Cursor, Claude Desktop, in-browser agents) that run client-side and host both a human user and an LLM agent in the same process. These applications increasingly want to work with data, but the lakehouse read path has been hard to use from a JS runtime: Spark, Trino, and managed warehouses do not fit there. To build this new kind of AI data application, three properties of the engine become first-order: a JS-native distribution that drops into the runtime the application already runs in, a bundle small enough to ship inside a cold tab or per-turn agent sandbox, and a way to interleave analytic operators with model-based interpretation of text. We present Hyperparam, three open-source JavaScript libraries (Hyparquet, Squirreling, Icebird) totaling under 70 KB, that read Parquet and Apache Iceberg directly from object storage and meet the third property with per-cell, async-native SQL execution, so expensive cells fire only when downstream operators demand them. Squirreling runs LLM-shaped async UDFs over 300x faster than DuckDB-WASM on filter-bounded queries (and 192x on sort-bounded queries) and completes a ten-task agent analyst suite at two-thirds lower cost. We argue that data engineering as a discipline needs to update for the AI-native client applications now in production and the agents that work alongside their users.

2605.27784 2026-05-28 cs.AI

Diagnosing Live Within-Policy Instruction Conflicts in LLM Agents with Witnessed Resolution Profiles

诊断LLM代理中实时策略内指令冲突的见证解析轮廓

Lu Yan, Xuan Chen, Xiangyu Zhang

AI总结 提出WIRE管道,通过提取规则、编码为PyRule子句、检测冲突并生成见证实例,诊断LLM代理单一提示策略内规则对之间的冲突,发现64.6%的见证实例至少违反一条源规则。

详情
AI中文摘要

LLM代理受长期自然语言提示策略的约束,但个别合理的常设规则可能以未经检查的方式相互作用。我们研究实时策略内规则冲突诊断:在单个提示策略内找到可以共同治理现实状态的规则对,并测量模型在响应或工具动作中如何解决该压力。我们引入WIRE,一个见证策略内规则评估管道。WIRE提取基于源的规则,将其编码为PyRule子句,使用可满足性检查保留同表面硬碰撞候选,将这些候选实现为具体的共同治理见证,并根据原始源规则文本判断模型输出。在六个公开提示策略中,WIRE提取276条源规则和560个原子子句,分类30,944个策略内子句对比较,保留170个编码的硬碰撞候选源规则对,并将其实现为1,402个具体见证。在仅策略评估中,这些见证产生13,335次后生成试验,其中两条源规则共同治理且两个合规标签均可判断。仅35.4%处于联合合规状态;64.6%违反至少一条治理源规则。这些轮廓是WIRE选择候选的条件诊断,而非部署频率或因果超额失败估计,但它们揭示了不同的策略、模型和工具动作解析模式。

英文摘要

LLM agents are governed by long-lived natural-language prompt policies, but individually reasonable standing rules can interact in uninspected ways. We study live intra-policy rule-conflict diagnosis: finding rule pairs inside a single prompt policy that can co-govern a realistic state, and measuring how models resolve that pressure in responses or tool actions. We introduce WIRE, a Witnessed Intra-policy Rule Evaluation pipeline. WIRE extracts source-grounded rules, encodes them as PyRule clauses, uses satisfiability checks to retain same-surface hard-collision candidates, realizes those candidates as concrete co-governance witnesses, and judges model outputs against the original source-rule text. Across six public prompt policies, WIRE extracts 276 source rules and 560 atomic clauses, classifies 30,944 within-policy clause-pair comparisons, retains 170 encoded hard-collision candidate source-rule pairs, and realizes them as 1,402 concrete witnesses. In policy-only evaluation, these witnesses yield 13,335 post- generation trials where both source rules govern and both compliance labels are judgeable. Only 35.4% fall in joint compliance; 64.6% violate at least one governed source rule. These profiles are conditional diagnostics for WIRE-selected candidates, not deployment-frequency or causal excess failure estimates, but they reveal distinct policy, model, and tool-action resolution patterns.

2605.27782 2026-05-28 cs.LG cs.CR

Revisiting ML Training under Fully Homomorphic Encryption: Convergence Guarantees, Differential Privacy, and Efficient Algorithms

重新审视全同态加密下的机器学习训练:收敛保证、差分隐私与高效算法

Yvonne Zhou, Mingyu Liang, Ivan Brugere, Danial Dervovic, Yue Guo, Antigoni Polychroniadou, Min Wu, Dana Dachman-Soled

AI总结 本文首次对全同态加密下的机器学习训练进行理论收敛性分析,结合适用于加密计算的差分隐私训练算法,通过多项式近似激活函数和损失函数实现近似梯度下降的收敛,并采用无逐样本梯度裁剪的差分隐私机制提升计算效率。

详情
AI中文摘要

我们首次对全同态加密(FHE)下的机器学习训练进行了理论收敛性分析,并结合了一种针对加密计算量身定制的差分隐私(DP)训练算法。我们的方法在实现可比效用的同时,提高了标准差分隐私梯度下降(DP-GD)的计算效率。具体而言,我们证明了使用激活函数和损失函数的多项式近似(这是FHE兼容性所必需的)的近似梯度下降的收敛性。为了保护下游任务中的隐私,我们集成了差分隐私,而无需依赖昂贵的逐样本梯度裁剪,从而实现了可扩展的加密学习。我们还提供了数据无关的超参数选择和具有理论依据的多项式近似策略,这些策略可能具有独立的价值。这些贡献共同推进了在敏感数据上进行高效、私密且安全的机器学习的可行性。

英文摘要

We present the first theoretical convergence analysis of machine learning training under fully homomorphic encryption (FHE), combined with a differentially private (DP) training algorithm tailored to encrypted computation. Our approach improves computational efficiency over standard differentially private gradient descent (DP-GD) while achieving comparable utility. In particular, we prove convergence of approximate gradient descent using polynomial approximations of activation and loss functions, which are required for FHE compatibility. To preserve privacy in downstream tasks, we integrate differential privacy without relying on costly per-sample gradient clipping, enabling scalable encrypted learning. We also provide data-independent hyperparameter selection and theoretically grounded strategies for polynomial approximation which can be of independent interest. Together, these contributions advance the feasibility of efficient, private, and secure machine learning on sensitive data.

2605.27774 2026-05-28 cs.LG

Fine-Tuning Dynamics of In-Context Factual Recall in Transformers

Transformer中上下文事实回忆的微调动力学

Ruomin Huang, Eshaan Nichani, Jason D. Lee, Rong Ge

AI总结 研究Transformer在上下文学习中如何利用存储的参数化事实知识,通过引入上下文事实回忆任务并分析单层Transformer的微调动力学,证明模型收敛到特定的成对注意力模式,且所需样本量极少。

详情
AI中文摘要

上下文学习——基于提示中给出的示例执行任务——是大型语言模型中涌现的重要能力,在理论和实践中都受到了广泛关注。现有的理论工作通常关注学习仅使用提示中信息的场景。然而,许多上下文学习的实际实例要求模型检索存储在模型参数中的事实知识,而上下文则用于识别哪些知识是相关的。在这项工作中,我们研究了上下文学习如何利用事实知识回忆。我们通过引入\emph{上下文事实回忆(IC-recall)}任务来形式化这种行为,其中Transformer接收从隐藏关系生成的(主题,答案)对上下文以及一个查询主题,必须同时推断该隐藏关系并检索相应的答案。事实知识通过Transformer访问一个简单预构建的MLP联想记忆(存储(主题,关系,答案)三元组)来建模。我们分析了单层Transformer在IC-recall数据上的监督微调动力学,并证明模型通过收敛到特定的成对注意力模式成功执行IC-recall。这个微调阶段只需要极少的样本——仅与存储的知识三元组数量成对数关系。实验验证了我们的理论预测,并表明即使MLP层是预训练而非构建的,成对注意力模式也会出现。

英文摘要

In-context learning \ -- performing tasks based on examples given in the prompt \ -- is an important capability that has emerged in large language models and has received significant attention in both theory and practice. Existing theoretical work often focuses on settings where the learning uses information purely from the prompt. However, many practical instances of in-context learning require the model to retrieve factual knowledge stored in the model's parameters, with the context serving to identify which knowledge is relevant. In this work, we study how in-context learning leverages factual knowledge recall. We formalize this behavior by introducing the \emph{in-context factual recall (IC-recall)} task, where a transformer is provided a context of (subject, answer) pairs generated from a hidden relation, along with a query subject, and must both infer this hidden relation and retrieve the corresponding answer. Factual knowledge is modeled by the transformer having access to a simple pre-constructed MLP associative memory storing (subject, relation, answer) triplets. We analyze the supervised fine-tuning dynamics of a one-layer transformer on IC-recall data and prove that the model successfully performs IC-recall by converging to a particular pairwise attention pattern. This fine-tuning stage requires a very small number of samples \ -- only polylogarithmic in the number of stored knowledge triplets. Experiments verify our theoretical predictions and show that the pairwise attention pattern emerges even when the MLP layer is pretrained instead of constructed.

2605.27773 2026-05-28 cs.CL cs.AI cs.LG

Do Models Know Why They Changed Their Mind? Interpretability and Faithfulness of Chain-of-Thought Under Knowledge Conflict

模型是否知道它们为何改变主意?知识冲突下思维链的可解释性与忠实性

Pruthvinath Jeripity Venkata

AI总结 通过引入内省忠实性,研究在知识冲突下语言模型的思维链推理是否忠实反映其决策机制,发现CoT高度稳定但置信度携带微弱真实信号。

Comments 12 pages, 8 tables, 3 appendices

详情
AI中文摘要

当语言模型看到与其训练知识相矛盾的文档时,它必须做出选择:遵循文档还是相信自己。先前的工作证明这种选择取决于事实的知名程度。我们问:模型的思维链(CoT)推理是否忠实地报告了这一机制?我们引入了内省忠实性,并在200个问题、8个模型和4种提示条件下进行了测试。我们发现CoT推理在相反决策下高度稳定:翻转对保留了96%的相同答案相似度(d=0.34;通过ROUGE-L确认,d=0.45)。然而,自我评定的置信度携带微弱的真实信号:对于实体知名度无信息量的冷门事实,置信度仍能预测决策(p<0.001),并追踪项目级知识(r=0.134)。GPT-4o是唯一具有统计上可靠的推理-决策耦合的模型。Claude Sonnet 4.6显示出最宽的置信度范围(SD=1.39),但汇总相关性接近零,因为置信度-决策关系在不同条件下反转;温度消融实验证实这是模型特有的。内部思考令牌比面向用户的CoT显示出更大的决策敏感性(p=0.033)。CoT分解为决策不变的知识展示(约96%)和一层薄弱的置信度层,后者带有微弱但真实的信号。对于监控:读取置信度,而不是论证。

英文摘要

When a language model sees a document contradicting its training knowledge, it must choose: follow the document or trust itself. Prior work proved this choice depends on how well-known the fact is. We ask: does the model's chain-of-thought (CoT) reasoning faithfully report this mechanism? We introduce introspective faithfulness and test it across 200 questions, 8 models, and 4 prompt conditions. We find CoT reasoning is highly stable across opposite decisions: flip pairs retain 96% of same-answer similarity (d=0.34; confirmed by ROUGE-L, d=0.45). Yet self-rated confidence carries a faint genuine signal: for obscure facts where entity fame is uninformative, confidence still predicts decisions (p<0.001) and tracks item-level knowledge (r=0.134). GPT-4o is the only model with statistically reliable reasoning-decision coupling. Claude Sonnet 4.6 shows the widest confidence range (SD=1.39) but near-zero pooled correlation because the confidence-decision relationship reverses between conditions; a temperature ablation confirms this is model-specific. Internal thinking tokens show greater decision-sensitivity than user-facing CoT (p=0.033). CoT decomposes into a decision-invariant knowledge display (~96%) and a thin confidence layer with weak but real signal. For monitoring: read confidence, not the argument.

2605.27772 2026-05-28 cs.SD cs.LG

Do Audio LLMs Listen or Read? Analyzing and Mitigating Paralinguistic Failures with VoxParadox

音频大语言模型是听还是读?使用VoxParadox分析和缓解副语言失败

Jiacheng Pang, Ashutosh Chaubey, Mohammad Soleymani

AI总结 针对音频大语言模型在副语言理解上的不足,提出对抗性基准VoxParadox和Prompt-Conditioned Layer Mixer方法,显著提升模型对副语言线索的利用能力。

Comments Accepted as a conference paper at ICML 2026. Project page: https://voxparadox.github.io/

详情
AI中文摘要

音频大语言模型(Audio LLMs)在语音理解任务上表现出色,但其理解副语言信息的能力仍然有限。为了系统量化这一问题,我们引入了VoxParadox,一个包含2000个验证示例的对抗性基准,涵盖10项副语言任务,通过受控语音合成故意使转录声明与说话风格不匹配,从而直接测量语音副语言理解能力。对多种音频大语言模型的评估显示,它们在声学真实情况上的准确率持续较低,并且强烈倾向于遵循语言暗示的(错误)答案。为了理解这一差距的原因,我们进行了逐层探针分析,发现(i)副语言线索可能在更深的编码器层以及编码器-大语言模型接口处退化,(ii)即使音频令牌中存在这些线索,语言模型也经常忽略它们。为了解决这些问题,我们提出了提示条件层混合器(PCLM),它根据输入提示自适应地组合多个音频层的信息,并结合直接偏好优化(DPO)来明确偏好声学支持的选项而非语言暗示的选项。这些方法显著提升了音频大语言模型的副语言理解能力,在VoxParadox上将Audio Flamingo 3从17.40%提升至65.20%,在MMSU副语言子集上从37.74%提升至54.78%。我们的项目页面位于https://voxparadox.github.io/。

英文摘要

Audio large language models (Audio LLMs) demonstrate strong performance on speech understanding tasks, yet their ability to understand paralinguistic information remains limited. To systematically quantify this issue, we introduce VoxParadox, an adversarial benchmark with 2,000 verified examples, spanning 10 paralinguistic tasks, created with controlled speech synthesis to intentionally mismatch transcript claims and speaking style, enabling direct measurement of speech paralinguistic understanding. Evaluation of a diverse set of Audio LLMs reveals consistently low accuracy on acoustic ground truth and a strong tendency to follow language-implied (incorrect) answers. To understand the cause of this gap, we perform layer-wise probing and find that (i) paralinguistic cues can degrade in deeper encoder layers and at the encoder--LLM interface, and (ii) even when such cues are available in audio tokens, the language model frequently ignores them. To address these problems, we propose Prompt-Conditioned Layer Mixer (PCLM), which adaptively combines information from multiple audio layers based on the input prompt, and pair it with Direct Preference Optimization (DPO) to explicitly prefer acoustically supported options over language-implied alternatives. These methods substantially improve Audio LLM paralinguistic understanding, improving Audio Flamingo 3 from 17.40% to 65.20% on VoxParadox, and from 37.74% to 54.78% on MMSU paralinguistic subset. Our project page is available at https://voxparadox.github.io/.

2605.27768 2026-05-28 cs.AI

Auditable Decision Models with Learned Abstention and Real-Time Steering

具有学习弃权与实时引导的可审计决策模型

Sankaranarayanan Palamadai Chandrasekaran

AI总结 提出EvaluatorDPT模型,通过Transformer编码器学习YES/NO/TBD三值决策,其中TBD作为延迟输出被学习,并支持推理时阈值控制和辅助语义信号,实现可审计的决策控制。

Comments 21 pages, 5 figures

详情
AI中文摘要

生产AI系统通常在证据不完整、冲突或不足的情况下运行。强制分类器将此类情况压缩为动作标签,而生成系统可能产生难以解释为可审计执行决策的输出。我们研究AI系统的操作决策控制,其中不确定性必须明确可路由、受策略约束且可审计,而不是隐藏在强制预测或自由形式生成中。我们提出EvaluatorDPT,一种有界决策控制模型,预测YES、NO或TBD,其中TBD被学习为延迟结果,而非仅作为事后置信规则添加。该模型使用Transformer编码器,带有主要的有界决策头和用于价值观及情绪/情感的辅助通道。接口在形式上与领域无关:部署领域提供证据和策略阈值,而模型发出有界分布,可在推理时通过记录的操作阈值以及(经验证后)辅助语义信号进行控制。对于评估的模型版本,我们报告了在保留验证集和测试集上的决策性能;由于此评估中禁用了情感头,因此省略了辅助情感指标。在保留测试集(n=44,597)上,模型达到准确率=0.8260,宏F1=0.8252,各类别F1分别为0.8314(YES)、0.8486(NO)和0.7956(TBD)。评估记录还包括校准证据(验证集上ECE=0.0338)、阈值扫描输出、多种子稳定性检查、混淆矩阵和可重复性命令。我们的主要贡献是一个有界执行接口,其中延迟被学习,推理时路由保持可检查,辅助信号提供可审计行为控制的路径,且评估证据支持外部审查。

英文摘要

Production AI systems often operate with incomplete, conflicting, or insufficient evidence. Forced classifiers collapse such cases into action labels, while generative systems can produce outputs that are difficult to interpret as auditable execution decisions. We study operational decision control for AI systems, where uncertainty must be explicitly routable, policy-governed, and auditable rather than hidden inside forced predictions or free-form generation. We present EvaluatorDPT, a bounded decision-control model that predicts YES, NO, or TBD, where TBD is learned as a deferral outcome rather than added only as a post-hoc confidence rule. The model uses a transformer encoder with a primary bounded-decision head and structured auxiliary channels for values and emotions/sentiments. The interface is domain-agnostic in form: a deployment domain supplies evidence and policy thresholds, while the model emits a bounded distribution that can be controlled at inference time through recorded operating thresholds and, when validated, auxiliary semantic signals. For the evaluated model version, we report decision performance on held-out validation and test splits; auxiliary emotion metrics are omitted because the emotion head is disabled for this evaluation. On the held-out test split (n=44,597), the model achieves Accuracy = 0.8260 and Macro F1 = 0.8252, with per-class F1 of 0.8314 (YES), 0.8486 (NO), and 0.7956 (TBD). The evaluation record also includes calibration evidence (ECE = 0.0338 on validation), threshold-sweep outputs, multi-seed stability checks, confusion matrices, and reproducibility commands. Our main contribution is a bounded execution interface in which deferral is learned, inference-time routing remains inspectable, auxiliary signals provide a path to auditable behavior control, and evaluation evidence supports external review.

2605.27767 2026-05-28 cs.CL cs.AI cs.LG

UniMaia: Steering Chess Policies with Language for Human-like Play

UniMaia:用语言引导国际象棋策略以实现类人玩法

Sherman Siu, Lesley Istead

AI总结 提出UniMaia框架,通过参数高效文本编码器和ControlNet风格调节机制,在冻结的Lc0国际象棋策略网络上实现提示条件策略调制,实现语义控制(如开局选择和玩家强度)并保持预训练策略表征,同时构建大规模元数据增强的Lichess数据集和半自动提示生成管道,在多个基准上取得最优或竞争性结果。

详情
AI中文摘要

大型语言模型的最新进展使得自然语言能够作为控制复杂系统的灵活接口,但通常以大规模多模态训练或弱化领域特定归纳偏差为代价。在结构化决策领域(如国际象棋)中,专门的策略网络表现强劲但缺乏语义可控性,而提示条件语言模型更灵活但通常领域基础较弱。我们提出$ extbf{UniMaia}$,一个用于提示条件策略调制的框架,它使用参数高效文本编码器和ControlNet风格的调节机制来适配基于Lc0的冻结国际象棋策略网络。UniMaia能够实现对游戏玩法的语义控制,包括开局选择和玩家强度,同时保留预训练的策略表征。我们进一步引入$ extbf{UniMaia-Aux}$,它结合了辅助时间条件化和行为预测目标。为了支持这项工作,我们构建了一个大规模元数据增强的Lichess数据集,开发了一个半自动提示生成管道,并引入了涵盖提示条件和元数据条件设置的基准。UniMaia在多个提示条件基准上实现了最先进的预期准确率,在通用指令遵循任务上达到了竞争性的最佳着法准确率,同时在人类着法预测基准上与专门的元数据条件方法保持竞争力。UniMaia-Aux进一步提高了多个评估设置下的预期准确率和行为建模,在最佳着法准确率上略有折衷。总体而言,我们的结果表明,无需端到端多模态训练即可实现领域特定策略网络的提示条件控制,同时突出了可控性与预测性能之间的权衡。

英文摘要

Recent advances in large language models have enabled natural language to serve as a flexible interface for controlling complex systems, but often at the cost of large-scale multimodal training or weakened domain-specific inductive biases. In structured decision-making domains such as chess, specialized policy networks achieve strong performance but lack semantic controllability, while prompt-conditioned language models are more flexible yet typically exhibit weaker domain grounding. We propose $\textbf{UniMaia}$, a framework for prompt-conditioned policy modulation that adapts a frozen Lc0-based chess policy network using a parameter-efficient text encoder and a ControlNet-style conditioning mechanism. UniMaia enables semantic control over gameplay, including opening selection and player strength, while preserving the pretrained policy representations. We further introduce $\textbf{UniMaia-Aux}$, which incorporates auxiliary temporal conditioning and behavioral prediction objectives. To support this work, we construct a large-scale metadata-augmented Lichess dataset, develop a semi-automated prompt-generation pipeline, and introduce benchmarks spanning both prompt-conditioned and metadata-conditioned settings. UniMaia achieves state-of-the-art expected accuracy on several prompt-conditioned benchmarks and competitive top-move accuracy on general instruction-following tasks, while remaining competitive with dedicated metadata-conditioned approaches on human move prediction benchmarks. UniMaia-Aux further improves expected accuracy and behavioral modeling across several evaluation settings, with modest trade-offs in top-move accuracy. Overall, our results demonstrate that prompt-conditioned control of domain-specific policy networks is feasible without end-to-end multimodal training, while highlighting trade-offs between controllability and predictive performance.

2605.27766 2026-05-28 cs.AI

Got a Secret? LLM Agents Can't Keep It: Evaluating Privacy in Multi-Agent Systems

Got a Secret? LLM Agents Can't Keep It: Evaluating Privacy in Multi-Agent Systems

Aman Priyanshu, Supriti Vijay, Esha Pahwa

AI总结 本研究通过多智能体模拟平台评估LLM智能体在社交压力下的隐私泄露风险,发现多轮社交交互显著增加隐私泄露,且泄露具有社交传染性,即使有隐私指令也无法完全消除。

详情
AI中文摘要

LLM安全评估主要在隔离环境中测试模型,然而部署的AI智能体越来越多地与其他智能体在持久社交环境中交互。我们引入了一个Moltbook风格的模拟平台,数千个LLM智能体在模拟的一个月内跨社区交互,并用它来评估在不同程度的社交压力下隐私作为下游安全问题的表现。我们发现从单轮到多轮社交评估会放大隐私侵犯(OpenAI模型上CIMemories 19.95%到我们的45.30%),泄露具有社交传染性,观察到同伴泄露后智能体泄露敏感信息的可能性增加8倍,并且明确的隐私指令减少但不能消除这种效应,即使有保护措施,泄露率仍高于37.8%。我们的发现表明,基于静态聊天的安全基准系统性地低估了智能体部署中的风险,而仅社交环境就足以引发单轮评估永远不会发现的敏感泄露。

英文摘要

LLM safety evaluations predominantly test models in isolation, yet deployed AI agents increasingly operate within persistent social environments alongside other agents. We introduce a Moltbook-style simulation platform where thousands of LLM agents interact across communities over a simulated month, and use it to evaluate privacy as a downstream safety concern under varying degrees of social pressure. We find that shifting from single turn to multi turn social evaluation amplifies privacy violations (CIMemories 19.95% to Ours 45.30% across OpenAI models), that leakage is socially contagious, with agents 8 times more likely to disclose sensitive information after observing a peer do so, and that explicit privacy instructions reduce but do not eliminate this effect, leaving leakage rates above 37.8% even with safeguards. Our findings suggest that static chat based safety benchmarks systematically underestimate risks in agentic deployment, and that social context alone is sufficient to elicit sensitive disclosures that single turn evaluations would never surface.

2605.27765 2026-05-28 cs.LG cs.AI

Restoring the Sweet Spot: Pass-Rate Weighted Self-Distillation for LLM Reasoning

恢复甜蜜点:用于LLM推理的通过率加权自蒸馏

Zehao Liu, Yuanpu Cao, Jinghui Chen, Vasant G. Honavar

AI总结 提出SC-SDPO方法,通过问题通过率加权自蒸馏损失,动态调整训练难度,提升LLM推理性能。

Comments 18 pages, 8 figures

详情
AI中文摘要

自蒸馏策略优化(SDPO)通过利用模型自身的反馈条件预测作为自教师,为大型语言模型的强化学习提供密集的令牌级信用分配。然而,与GRPO不同——其群体相对优势自然地将学习集中在一个中等难度问题的甜蜜点上——SDPO的基于KL的优势缺乏隐式的难度感知。我们通过GRPO的优势归一化视角分析这一差距。将可学习性框架扩展到归一化奖励,我们表明归一化吸收了方差项$p(1-p)$,使各问题的前导阶可学习性相等,留下$\sqrt{p(1-p)}$作为每个问题梯度中唯一的残差缩放因子。这一分析产生了一个简单的处方:用$[\hat{p}(1-\hat{p})]^{1/2}$加权每个问题的SDPO损失,得到SC-SDPO,即SDPO的尺度一致变体。所提出的权重作为在线策略rollout与批自适应归一化的零成本副产品获得,诱导出一个隐式课程,动态跟踪模型不断发展的能力。在科学推理和工具使用基准上的实验表明,SC-SDPO持续优于SDPO,在Qwen3-8B上获得+3.2/+4.3(mean@16/maj@16)的提升,在OLMo-3-7B上获得+1.8/+3.0的提升,同时在整个优化过程中保持稳定的训练动态。

英文摘要

Self-Distillation Policy Optimization (SDPO) provides dense token-level credit assignment for reinforcement learning with large language models by leveraging the model's own feedback-conditioned predictions as a self-teacher. Unlike GRPO, however, whose group-relative advantage naturally concentrates learning on a sweet spot of intermediate-difficulty questions, SDPO's KL-based advantage lacks an implicit notion of difficulty awareness. We analyze this gap through the lens of GRPO's advantage normalization. Extending the learnability framework to normalized rewards, we show that normalization absorbs the variance term $p(1-p)$, equalizing leading-order learnability across questions and leaving $\sqrt{p(1-p)}$ as the sole residual scaling factor in the per-question gradient. This analysis yields a simple prescription: weight each question's SDPO loss by $[\hat{p}(1-\hat{p})]^{1/2}$, resulting in SC-SDPO, a scale-consistent variant of SDPO. The proposed weights are obtained as a zero-cost byproduct of on-policy rollouts with batch-adaptive normalization, inducing an implicit curriculum that dynamically tracks the model's evolving competence. Experiments on scientific reasoning and tool-use benchmarks demonstrate that SC-SDPO consistently improves over SDPO, yielding gains of +3.2/+4.3 (mean@16/maj@16) on Qwen3-8B and +1.8/+3.0 on OLMo-3-7B, while preserving stable training dynamics throughout optimization.

2605.27764 2026-05-28 cs.CV cs.AI

Can Segmentation Models Understand the World? Towards Proactive Affordance Reasoning via Visual Chain-of-Thought

分割模型能理解世界吗?通过视觉思维链实现主动可供性推理

Yuchen Guo, Junli Gong, Hongmin Cai, Yiu-ming Cheung, Weifeng Su

AI总结 提出SegWorld框架,通过多级视觉思维链在意图级指令下进行主动场景观察和可供性推理,实现从目标到部件的高效分割。

详情
AI中文摘要

最近的分割模型将大语言模型(LLMs)与掩码解码器结合,将复杂的语言表达映射到掩码上,但其指令仍然是目标指涉的:它们描述、约束或暗示待分割的区域。然而,在现实世界的具身交互中,人类指令通常是意图级的,包括期望的结果而不指定实现该结果的区域。为弥合这一差距,我们引入SegWorld,其中模型在确定掩码之前通过多级视觉思维链(CoT)推理场景。在接收任何指令之前,它主动观察场景,描述可见对象并推断它们可能支持的可能事件。给定指令后,它继续思维链:从与意图相关的对象,到满足意图的动作,再到物理交互部位,即支持该动作的对象部分。我们将SegWorld形式化为概率推理,其中主动观察提供语言场景上下文,当指令以意图级别给出时,可改善掩码预测。我们构建了一个意图到部件的基准,用于评估从高层目标出发的可供性承载部件分割。实验表明,SegWorld在目标指涉指令上匹配指令驱动基线,并在意图级指令上显著提升。

英文摘要

Recent segmentation models couple large language models (LLMs) with mask decoders to ground complex language expressions into masks, yet their instructions remain target-referential: they describe, constrain, or imply the region to be segmented. However, in real-world embodied interaction, human instructions are often at the intent-level, which includes the desired outcome without naming the region that enables it. To bridge this gap, we introduce SegWorld, where the model reasons about the scene through a multi-level visual chain-of-thought (CoT) before committing to a mask. Before receiving any instructions, it proactively observes the scene, describing visible objects and inferring plausible events they may support. Given an instruction, it continues the chain: from the object relevant to the intent, through the action that satisfies it, to the physical interaction site, the object part that affords the action. We formalize SegWorld as probabilistic inference, in which proactive observation supplies a linguistic scene context that improves mask prediction when instructions are given at the level of intent. We construct an intent-to-part benchmark for evaluating affordance-bearing part segmentation from high-level goals. Experiments show SegWorld matches instruction-driven baselines on target-referential instructions and improves substantially on intent-level ones.

2605.27763 2026-05-28 cs.LG

A Paired Testing Protocol for Batch-Conditioned Refusal Robustness in LLM Serving

LLM服务中基于批处理条件的拒绝鲁棒性配对测试协议

Sahil Kadadekar

AI总结 提出配对测试协议,通过四项研究验证批处理条件对LLM安全标签的影响,发现批处理导致的安全标签翻转率低但存在,建议精确堆栈验证。

Comments 12 pages. Accepted to the ICML 2026 Workshop on Hypothesis Testing

详情
AI中文摘要

语言模型的安全性评估通常将服务配置视为固定的背景基础设施,但当同一提示可能单独评估、在同步批处理中或在连续批处理调度器内评估时,批处理条件是一个未经测试的处理变量。我们将四项基于工件的研究综合成一个配对测试协议:研究A结合局部发现、评分者校正裁决和真实批处理确认;研究B测试跨模型泛化;研究C测试连续批处理组合;研究D进行批处理不变核消融。局部测试发现安全标签变化比能力标签变化更频繁(0.51% vs. 0.14%),但对63个候选行的裁决仅留下17个真实行为翻转,意味着校正后的全集率为0.16%。扩展到15个模型未发现可检测的普遍安全优于能力偏差:翻转接近平衡(0.94倍),对齐类型无显著关联(p=0.942,η²=0.033),输出不稳定性是最强的脆弱性筛选指标(r=0.909,bootstrap 95% CI [0.65, 0.97])。在目标核消融中,标准vLLM在当前分数翻转候选上重现22/55的标签翻转,而启用VLLM_BATCH_INVARIANT=1将同一测试减少到0/55翻转;组合测试单独未发现4.7pp敏感度下的聚合效应。测试建议是精确堆栈验证:在服务批处理设置下评估拒绝,将安全提示与能力控制配对,并分别报告低速率方向翻转与聚合零效应。

英文摘要

Safety evaluations of language models often treat serving configuration as fixed background infrastructure, but batch condition is an untested treatment variable whenever the same prompt may be evaluated alone, in a synchronized batch, or inside a continuous-batching scheduler. We synthesize four artifact-backed studies into a paired testing protocol: Study A combines local discovery, scorer-corrected adjudication, and true-batching confirmation; Study B tests cross-model generalization; Study C tests continuous-batch composition; and Study D runs a batch-invariant-kernel ablation. The local test finds safety-label changes more often than capability-label changes (0.51% vs. 0.14%), but adjudication of 63 candidate rows leaves only 17 genuine behavioral flips, implying a corrected full-set rate of 0.16%. The 15-model extension finds no detectable universal safety-over-capability skew: flips are near parity (0.94x), alignment type has no detectable association ($p=0.942$, $η^2=0.033$), and output instability is the strongest tested fragility screen ($r=0.909$, bootstrap 95% CI [0.65, 0.97]). In the targeted kernel ablation, standard vLLM reproduces 22/55 label flips on current score-flip candidates, while enabling VLLM_BATCH_INVARIANT=1 reduces the same test to 0/55 flips; the composition test separately finds no aggregate effect at 4.7pp sensitivity. The testing recommendation is exact-stack validation: evaluate refusal at the served batch setting, pair safety prompts with capability controls, and report low-rate directional flips separately from aggregate null effects.

2605.27761 2026-05-28 cs.CV cs.SE

AndroidDaily: A Verifiable Benchmark for Mobile GUI Agents on Real-World Closed-Source Applications

AndroidDaily: 面向真实世界闭源应用的可验证移动GUI智能体基准

Yifan Sui, Xin Huang, Hongbing Li, Fang Xu, Jiahe Lv, Haolong Yan, Yeqing Shen, Litao Liu, Zhimin Fan, Ziyang Meng, Jia Wang, Junbo Qi, Kaijun Tan, Zheng Ge, Xiangyu Zhang, Daxin Jiang, Osamu Yoshie

AI总结 针对闭源应用无法获取内部状态导致自动验证困难的问题,提出AndroidDaily基准(350个日常任务)和GRADE评估器(基于可观察外部指南的三层系统),实现无需内部状态的可验证评估,最强模型成功率为62.0%。

Comments 11 pages, 6 figures. Preprint

详情
AI中文摘要

GUI基础模型和移动GUI智能体的快速发展催生了众多评估基准,但大多数依赖于模拟环境或开源应用,真实世界的闭源应用在很大程度上未得到评估。核心困难在于闭源应用不暴露内部状态,使得传统的自动验证不适用。为弥合这一差距,我们引入了AndroidDaily,一个大规模基准,包含跨94个高频Android应用的350个现实日常任务,涵盖交通、购物、本地服务、娱乐、内容创作、社交媒体和日常实用工具。为了在这些不透明环境中实现自动且可验证的评估,我们提出了基于指南的自动诊断评估评审器(GRADE),这是一个基于三层可观察外部指南系统构建的过程感知评估器:操作义务、输出质量和负面约束。GRADE根据这些标准跟踪智能体的视觉轨迹,并产生步骤级诊断判断,将长期、开放式的移动交互转化为可验证的评估,而无需依赖隐藏的内部状态。实验表明,GRADE与人类评估者的一致性达到87.37%。最强模型在AndroidDaily上的成功率为62.0%,凸显了当前推理能力与现实移动工作流实际执行之间的巨大差距。

英文摘要

The rapid development of GUI foundation models and mobile GUI agents has spurred numerous evaluation benchmarks, yet most rely on simulated environments or open-source applications, leaving real-world closed-source applications largely unevaluated. The core difficulty is that closed-source applications do not expose internal states, making traditional automatic verification inapplicable. To bridge this gap, we introduce AndroidDaily, a large-scale benchmark comprising 350 realistic daily-use tasks across 94 high-frequency Android applications spanning transportation, shopping, local services, entertainment, content creation, social media, and everyday utilities. To enable automatic and verifiable assessment in these opaque environments, we propose Guideline-grounded Reviewer for Automatic Diagnostic Evaluation (GRADE), a process-aware evaluator built on a three-tiered system of observable external guidelines: operational obligations, output quality, and negative constraints. GRADE tracks the agent's visual trajectory against these criteria and produces step-level diagnostic judgments, turning long-horizon, open-ended mobile interactions into verifiable evaluation without relying on hidden internal states. Experiments show that GRADE achieves 87.37\% agreement with human evaluators. The strongest model reaches a 62.0\% success rate on AndroidDaily, highlighting a substantial gap between current reasoning capabilities and practical execution in realistic mobile workflows.

2605.27760 2026-05-28 cs.AI

SkillGrad: Optimizing Agent Skills Like Gradient Descent

SkillGrad: 像梯度下降一样优化智能体技能

Hanyu Wang, Yifan Lan, Bochuan Cao, Lu Lin, Jinghui Chen

AI总结 提出SkillGrad框架,将技能包视为结构化参数,通过轨迹级损失、文本梯度诊断和动量记忆覆盖进行类梯度下降优化,在表格问答任务上平均提升6.7个百分点。

详情
AI中文摘要

智能体技能通过将可复用的程序化知识存储在结构化文件中,提供了一种轻量级的方式将LLM智能体适配到专业领域。然而,无论是从第三方下载还是自行生成,这些技能往往不可靠、不完整或过时。现有的技能演化方法通常通过启发式反思来解决这些缺陷,缺乏明确的优化公式。在本文中,我们提出了SkillGrad,一种受梯度下降启发的智能体技能优化框架。SkillGrad将技能包视为结构化参数,以梯度下降的方式进行优化:任务执行提供轨迹级损失证据,自动诊断随后提供指示修正方向的文本梯度。为了稳定跨迭代的优化,动量智能体将重复出现的诊断模式累积到持久记忆覆盖层中。最后,基于LLM的修补器通过对技能包进行层感知编辑来执行参数更新。在SpreadsheetBench Verified和WikiTableQuestions上的评估表明,SkillGrad在两个骨干LLM上始终优于基于训练的技能演化基线,平均比最强的基于训练的基线高出6.7个百分点。消融实验进一步表明,动量和对比诊断都对最终技能质量有贡献。

英文摘要

Agent skills provide a lightweight way to adapt LLM agents to specialized domains by storing reusable procedural knowledge in structured files. However, whether downloaded from third parties or self-generated, these skills are often unreliable, incomplete, or outdated. Existing skill-evolution methods often address these deficiencies through heuristic reflections without an explicit optimization formulation. In this paper, we propose SkillGrad, a gradient-descent-inspired framework for optimizing agent skills. SkillGrad treats the skill package as a structured parameter to optimize in a gradient descent fashion: task executions provide trajectory-level loss evidence, automatic diagnoses then provide text-based gradients that indicate the correction directions. To stabilize optimization across iterations, a momentum agent accumulates recurring diagnostic patterns into a persistent memory overlay. Finally, an LLM-based patcher executes the parameter update by applying layer-aware edits to the skill package. Evaluated on SpreadsheetBench Verified and WikiTableQuestions, SkillGrad consistently outperforms training-based skill evolution baselines across two backbone LLMs, improving over the strongest training-based baseline by $6.7$ percentage points on average. Ablations further show that momentum and contrastive diagnosis both contribute to the final skill quality.

2605.27759 2026-05-28 cs.RO

Colosseum V2: Benchmarking Generalization for Vision Language Action Models

Colosseum V2:视觉语言动作模型的泛化能力基准测试

Jeremy Morgan, Prajwal Vijay, Hyeonho Oh, Jincen Song, Ashvin Arora, Alina Du, Gaurav Sukhatme, Jesse Thomason, Ishika Singh

AI总结 提出Colosseum V2大规模仿真基准,通过28个任务和两种机器人形态,系统评估VLA模型在分布偏移下的泛化能力,揭示其在高层次理解与鲁棒行为之间的差距。

详情
AI中文摘要

视觉-语言-动作(VLA)模型在大规模视觉和语言预训练的推动下,在机器人操作中展现出有前景的泛化能力。然而,这种进展可能具有误导性。尽管VLA具有零样本感知和语言能力,但它们的整体任务性能在分布偏移下常常下降,揭示了这些系统将高层次理解转化为鲁棒行为方面的差距。为了系统地研究这一差距,我们引入了Colosseum V2,这是一个大规模仿真基准,用于评估机器人学习中VLA在不同条件下的泛化能力。该基准包含28个任务,涵盖13个任务类别和两种机器人形态,覆盖了广泛的操作原语和长时域行为。基于ManiSkill仿真器构建,Colosseum V2支持快速、GPU并行化的评估,并支持大规模域内和域外测试。我们评估了包括Action Chunking Transformers (ACT)和Pi0.5在内的最先进方法,揭示了它们在基础性能和泛化方面的局限性。我们展示了仿真与真实世界指标之间的强相关性,支持了该基准的生态效度。通过在统一基准中标准化任务、指标和评估协议,Colosseum V2实现了可重复和公平的比较,降低了评估开销,并加速了向通用机器人策略的进展。

英文摘要

Vision-Language-Action (VLA) models demonstrate promising generalization in robotic manipulation, driven by advances in large-scale vision and language pre-training. This progress can be misleading. Despite the zero-shot perception and language capabilities of VLAs, their overall task performance often degrades under distribution shifts, revealing gaps in how these systems translate high-level understanding into robust behavior. To systematically study this gap, we introduce Colosseum V2, a large-scale simulation benchmark for evaluating VLA generalization in robot learning across diverse conditions. The benchmark comprises 28 tasks spanning 13 task categories and two robot morphologies, covering a wide range of manipulation primitives and long-horizon behaviors. Built on the ManiSkill simulator, Colosseum V2 enables fast, GPU-parallelized evaluation and supports both in-domain and out-of-domain testing at scale. We evaluate state-of-the-art methods, including Action Chunking Transformers (ACT) and Pi0.5, and reveal limitations in both base performance and generalization. We demonstrate strong correlations between simulation and real-world metrics that support the ecological validity of the benchmark. By standardizing tasks, metrics, and evaluation protocols within a unified benchmark, Colosseum V2 enables reproducible and fair comparisons, reduced evaluation overhead, and accelerated progress toward general-purpose robot policies.

2605.27758 2026-05-28 cs.LG cs.AI physics.comp-ph

High-Fidelity Industrial Crash Dynamics Prediction via Geometry-Aware Operator Learning with Memory-Efficient Low-Rank Attention

基于几何感知算子学习与内存高效低秩注意力的高保真工业碰撞动力学预测

Deepak Akhare, Mohammad Amin Nabian, Corey Adams, Sudeep Chavare, Sanjay Choudhry

AI总结 本文提出GeoTransolver框架,通过几何感知算子学习和内存高效低秩注意力机制,实现工业级碰撞动力学的高保真预测,在复杂梁和整车碰撞数据集上验证了其准确性和效率。

详情
AI中文摘要

汽车碰撞安全性优化仍然是一个安全关键挑战,需要通过迭代的高保真模拟来管理大规模非线性结构变形和能量耗散。虽然传统有限元求解器计算成本高昂,新兴的算子学习框架提供了快速的代理预测;然而,将其应用于工业级碰撞分析(其中复杂几何、接触非线性和快速演变的瞬态变形并存)仍然是一个未解决的挑战。在本文中,我们证明GeoTransolver框架为工业规模下准确、高保真的碰撞动力学预测提供了可行的解决方案。在复杂的保险杠梁和整车碰撞数据集上进行的基准测试表明,GeoTransolver能够捕捉多尺度几何上下文,并准确解析塑性变形模式以及关键乘员位置的加速度曲线。除了架构本身,我们提出并系统评估了一系列时间预测策略,包括一次性、时间条件和自回归滚动策略,证明一次性方法在显著降低训练开销和推理延迟的同时实现了最先进的准确性。作为次要贡献,我们引入了一种基于快速低秩注意力路由引擎(FLARE)的修改,应用于GeoTransolver注意力主干,将内存开销减少约2倍,同时进一步提高O(N)长程、高频瞬态的预测准确性,保留了基础框架的几何感知交叉注意力优势。我们的结果突显了几何感知算子学习在复杂、安全关键的汽车动力学高保真代理建模中的实际可行性。

英文摘要

Automotive crashworthiness optimization remains a safety-critical challenge, requiring the management of large-scale nonlinear structural deformations and energy dissipation through iterative, high-fidelity simulations. While traditional finite element solvers are computationally prohibitive, emerging operator learning frameworks provide rapid surrogate predictions; however, applying them to industrial-scale crash analysis, where complex geometry, contact nonlinearities, and rapidly evolving transient deformation coexist, remains an open challenge. In this paper, we demonstrate that the GeoTransolver framework provides a viable solution for accurate, high-fidelity crash dynamics prediction at industrial scale. Benchmarked on complex bumper beam and full-vehicle crash datasets, GeoTransolver captures multi-scale geometric context and accurately resolves plastic deformation patterns as well as acceleration profiles at critical occupant locations. Beyond the architecture itself, we propose and systematically evaluate a suite of temporal prediction recipes, including one-shot, time-conditional, and autoregressive rollout strategies, demonstrating that the one-shot approach achieves state-of-the-art accuracy with significantly reduced training overhead and inference latency. As a secondary contribution, we introduce a Fast Low-rank Attention Routing Engine (FLARE)-based modification to the GeoTransolver attention backbone that reduces memory overhead by approximately 2x while further improving predictive accuracy for O(N) long-range, high-frequency transients, preserving the geometry-aware cross-attention strengths of the base framework. Our results highlight the practical viability of geometry-aware operator learning for high-fidelity surrogate modeling of complex, safety-critical automotive dynamics.