arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2605.26717 2026-05-27 cs.IR cs.AI

L2Rec: Towards Dual-View Understanding of LLMs for Personalized Recommendation

L2Rec：面向个性化推荐的LLM双视图理解

Pingjun Pan, Tingting Zhou, Peiyao Lu, Tingting Fei, Hongxiang Chen, Chuanjiang Luo

AI总结提出L2Rec方法，通过双视图个性化混合专家机制在参数层面统一行为与语义理解，实现端到端个性化推荐，实验证明优于现有方法。

详情

DOI: 10.1145/3805712.3809943
Comments: Accepted at SIGIR 2026

AI中文摘要

将大型语言模型（LLM）适配于个性化推荐需要将其通用能力与用户特定偏好对齐，同时有效利用行为信号和语义信号。现有方法通常在输入层（例如，将行为嵌入注入令牌空间）或输出层（例如，独立编码器的对比对齐）整合这些信号，存在分布差距或缺乏端到端任务监督。在这项工作中，我们引入了L2Rec，它在LLM的参数层面统一了行为和语义理解。我们的关键洞察是，同一组Transformer参数可以作为两个视图的共享媒介：通过双视图个性化混合专家（DPMoE）机制应用视图特定的个性化低秩扰动，L2Rec使得单个LLM主干能够为每个用户产生互补的行为和语义适应，且表示层面的不对齐最小化。一个自适应跨视图融合模块进一步将双视图输出整合为统一的用户偏好。在四个数据集上的实验表明，L2Rec持续优于最先进的基线方法，并且在大型工业平台上的在线A/B测试验证了关键参与指标的显著改进。

英文摘要

Adapting large language models (LLMs) for personalized recommendation requires aligning their general-purpose capabilities with user-specific preferences while effectively leveraging both behavioral and semantic signals. Existing approaches typically integrate these signals at either the input level (e.g., injecting behavioral embeddings into the token space) or the output level (e.g., contrastive alignment of separate encoders), suffering from distribution gaps or lack of end-to-end task supervision. In this work, we introduce L2Rec, which unifies behavioral and semantic understanding at the parameter level of LLMs. Our key insight is that the same set of Transformer parameters can serve as a shared medium for both views: by applying view-specific, personalized low-rank perturbations via a Dual-view Personalized Mixture-of-Experts (DPMoE) mechanism, L2Rec enables a single LLM backbone to produce complementary behavioral and semantic adaptations for each user with minimal representation-level misalignment. An adaptive cross-view fusion module further integrates the dual-view outputs into a unified user preference. Experiments on four datasets show that L2Rec consistently outperforms state-of-the-art baselines, and online A/B testing on a large-scale industrial platform validates significant improvements in key engagement metrics.

URL PDF HTML ☆

赞 0 踩 0

2605.26715 2026-05-27 cs.LG

Image Feature Fusion-based Federated Client Unlearning (FCU)

基于图像特征融合的联邦客户端遗忘 (FCU)

Hangyi Shen, Yizhi Pan, Tiansuo Li, Weiqi Jiang, Guanqun Sun

AI总结针对联邦遗忘中灾难性遗忘导致全局泛化下降的问题，提出基于线性图像特征融合机制（Mixup）的联邦客户端遗忘方法，通过动态生成混合样本弥合遗忘与保留分布，在医学影像基准上实现了与重训练标准相当的遗忘效果。

详情

AI中文摘要

主要数据保护法规都提到了“被遗忘权”，这推动了联邦遗忘技术的发展。但一个顽固的问题仍然存在：灾难性遗忘——你擦除了目标知识，但同时也丢弃了必要的保留知识，从而损害了模型的全局泛化能力。为了在遗忘效果和泛化能力之间取得更好的平衡，我们提出了基于图像特征融合的联邦客户端遗忘（IFF-FCU）。其思想是引入线性图像特征融合机制（Mixup），动态创建混合样本，弥合遗忘分布和保留分布之间的差距。该策略不仅仅是删除几个离散的数据点——它在理论上拓宽并正则化了遗忘边界。我们在医学影像基准（RSNA-ICH 和 ISIC2018）上进行了大量实验，结果表明我们的方法实现了相当好的遗忘效果。例如，在 ICH 数据集上，IFF-FCU 实现了与重训练黄金标准高度竞争的误差偏差，显示出对现有基线的稳健改进。

英文摘要

Major data protection regulations all mention the "right to be forgotten," and that's what pushed federated unlearning (FU) techniques forward. But one stubborn issue remains: catastrophic forgetting--you erase the target knowledge, yet somehow you also end up throwing out essential retained knowledge, which then hurts the model's global generalization. To get a better balance between unlearning effectiveness and generalization ability, we propose something called Image Feature Fusion-based Federated Client Unlearning (IFF-FCU). The idea is to bring in a linear Image Feature Fusion mechanism (Mixup) that dynamically creates mixed samples, bridging the gap between forget-distribution and retain-distribution. What this strategy does isn't just deleting a few discrete data points--it theoretically widens and regularizes the forgetting boundary. We ran extensive experiments on medical imaging benchmarks (RSNA-ICH and ISIC2018), and the results show that our approach achieves reasonably good unlearning. For instance, on the ICH dataset, IFF-FCU achieves a highly competitive Error deviation from the retrained gold standard, demonstrating robust improvements over existing baselines.

URL PDF HTML ☆

赞 0 踩 0

2605.26713 2026-05-27 stat.ML cs.LG

Transformers Can Learn Posterior Predictive Distributions In-Context

Transformer可以在上下文中学习后验预测分布

Gyeonghun Kang, Changwoo J. Lee, Xiang Cheng

AI总结本文通过构造证明Transformer能够实现针对后验预测均值和方差的梯度下降算法，并研究其逼近后验预测分布的误差界，揭示了归一化和注意力深度对泛化能力的关键作用。

详情

AI中文摘要

先验数据拟合网络（PFN）最近已成为贝叶斯预测任务的一种强大方法，通过上下文学习近似后验预测分布（PPD）。尽管它们具有强大的实证性能和超越点预测的能力，但对Transformer在上下文中学习分布的算法能力的理论理解仍然缺乏。聚焦于高斯过程回归问题，我们通过构造证明Transformer可以实现针对后验预测均值和方差的梯度下降算法，随后通过非线性映射产生PPD的分箱概率。我们根据注意力深度和分箱分辨率研究了近似PPD的误差界。基于这些结果，我们进一步证明了归一化和注意力深度的选择在使Transformer能够超越预训练样本大小范围进行外推中的关键作用。我们进行了模拟实验，验证了我们的发现，为针对PPD的PFN的表达能力以及架构选择如何影响泛化能力提供了见解。

英文摘要

Prior-data fitted networks (PFNs) have recently emerged as a powerful approach for Bayesian prediction tasks, approximating the posterior predictive distribution (PPD) through in-context learning. Despite their strong empirical performance and ability to go beyond point predictions, theoretical understandings of the algorithmic capability of transformers to learn distributions in context are still lacking. Focusing on Gaussian process regression problems, we show by construction that transformers can implement a gradient descent algorithm targeting the posterior predictive mean and variance, followed by nonlinear mappings that yield binned probabilities of PPD. We study the error bounds of the approximated PPD in terms of attention depth and bin resolution. Based on these results, we further demonstrate the key role of normalization and the choice of attention depth in enabling the extrapolation abilities of transformers beyond the pretraining sample size range. We conduct simulations that corroborate our findings, providing insight into the expressivity of PFNs targeting PPDs and how architectural choices may influence generalization capabilities.

URL PDF HTML ☆

赞 0 踩 0

2605.26712 2026-05-27 cs.CV

METATR: A Multilingual, Evolving Benchmark for Automatic Text Recognition

METATR：一个多语言、不断演进的自动文本识别基准

Mélodie Boillet, Solène Tarride, Christopher Kermorvant

AI总结提出METATR基准，通过多样化多语言文档、标准化评估框架和动态更新机制，全面评估自动文本识别系统（尤其是视觉大语言模型）的性能，支持模型比较与选择。

详情

AI中文摘要

反映真实文档多样性和复杂性的基准对于准确评估自动文本识别（ATR）系统，特别是视觉大语言模型（vLLMs）至关重要。尽管最近的模型表现出令人印象深刻的性能，但它们通常在包含现代印刷文本（主要是英语）的数据集上进行评估，这限制了它们与许多实际应用的相关性。因此，为特定用例选择模型需要在与目标文档匹配的数据上进行评估。这突显了代表性基准对于实际应用的重要性。在本文中，我们介绍了METATR（v1.0），一个多语言、不断演进的基准，旨在评估ATR模型在广泛文档上的性能，促进有意义的模型比较和选择。该基准通过包含来自各种公共收藏的文档来最大化多样性。这些文档涵盖29种语言，并包含多种文字和布局的文本。除了数据集本身，METATR还定义了标准化的提示和归一化方法，并建立了一个动态评估框架。这种方法旨在产生可重复的结果，同时随着时间的推移保持可扩展性。我们评估了广泛的最先进系统，包括开源模型和闭源模型。结果从多个维度报告，包括数据集和语言级别的性能、对手写文档的鲁棒性以及计算效率。我们的发现表明，尽管专有模型实现了最一致的性能，但在不同文字和布局之间仍然存在显著差异。总体而言，METATR提供了一个多维度的、面向从业者的框架，用于在真实条件下评估多语言ATR，并随着领域的发展跟踪进展。

英文摘要

Benchmarks that reflect the diversity and complexity of real-world documents are essential for accurately evaluating Automatic Text Recognition (ATR) systems, especially Vision-Large Language Models (vLLMs). Although recent models demonstrate impressive performance, they are often evaluated on datasets containing modern, printed texts mostly written in English, which limits their relevance to many practical applications. Therefore, selecting a model for a specific use case requires evaluating it on data that matches the target documents. This highlights the importance of representative benchmarks for real-world applications. In this paper, we introduce METATR (v1.0), a multilingual, evolving benchmark designed to evaluate ATR models across a wide range of documents, facilitating meaningful model comparison and selection. The benchmark was designed to maximize diversity by including documents from various public collections. These documents cover 29 languages and include texts with multiple scripts and layouts. Beyond the dataset itself, METATR defines a standardized prompting and normalization methodology and establishes a dynamic evaluation framework. This approach is intended to produce reproducible results while remaining extensible over time. We evaluated a wide range of state-of-the-art systems, including open-source models and closed-source models. Results are reported across various dimensions, including performance at the dataset and language levels, robustness to handwritten documents, and computational efficiency. Our findings show that, although proprietary models achieve the most consistent performance, substantial variability persists across scripts and layouts. Overall, METATR provides a multidimensional, practitioner-oriented framework for assessing multilingual ATR in real-world conditions and tracking progress as the field evolves.

URL PDF HTML ☆

赞 0 踩 0

2605.26710 2026-05-27 cs.RO

Look Further: Socially-Compliant Navigation System in Residential Buildings

看得更远：住宅楼中的社交合规导航系统

Akira Shiba, Marina Obata, Nathan Kau, Zoltan Beck, Rishi Shah, Michael Sudano, Sabrina Lee

AI总结提出一种主动变道（PLC）运动模式，通过将反应距离扩展到8米以上，改善人类对机器人运动的感知，并在直走廊场景中显著提升安全性、流畅性和礼貌性。

详情

DOI: 10.1109/HRI61500.2025.10973828
Journal ref: 2025 20th ACM/IEEE International Conference on Human-Robot Interaction (HRI), Melbourne, Australia, 2025, pp. 272-282
Comments: 2025 ACM/IEEE International Conference on Human-Robot Interaction

AI中文摘要

移动机器人对人的反应距离强烈影响人机交互的多种品质。本文聚焦于移动配送机器人在住宅室内走廊环境中的导航。社交导航方法通常侧重于避免令人不适的人机交互，例如机器人侵入某人的个人空间。由于个人空间已被证明仅在几米范围内，社交导航方法通常侧重于解决这些短距离交互。然而，在本工作中，我们证明通过将反应距离扩展到超过8米（远超出典型交互距离），可以改善人类对机器人运动的感知。我们引入了主动变道（PLC）运动模式以及利用该模式在更远距离上对人做出反应的导航系统。该模式包括当机器人在走廊中从中心向侧面导航时，在距离迎面而来的人8米处改变其横向位置。我们进行了一项有42名参与者的用户研究，基于三个服务目标（安全性、流畅性和礼貌性）评估他们对配送机器人的印象。在直走廊场景（正面接近）中，结果显示与文献中典型的运动模式（减速、停止和在靠近人时反应性避障）相比，这三个目标均有显著改善。相比之下，在交叉口（盲角）场景中，没有任何一种方法显著优于其他方法，参与者对机器人运动模式的偏好各不相同。

英文摘要

The distance at which a mobile robot reacts to a person strongly impacts various qualities of the human-robot interaction. In this paper, we focus on the navigation of a mobile delivery robot platform in a residential indoor hallway environment. Social navigation methods typically focus on avoiding uncomfortable human-robot interactions, such as when a robot encroaches on someone's personal space. Since personal space has been shown to be in the range of just a few meters, social navigation methods typically focus on deconflicting and resolving these short-range interactions. In this work, however, we demonstrate that by extending the reaction distance to over eight meters, far beyond the typical interaction distance, we can improve the human's perception of the robot's motion. We introduce the Proactive Lane-Changing (PLC) motion pattern and a navigation system that leverages it to react to people at an increased distance. This pattern consists of changing the robot's lateral position as it navigates down the hallway from the center to the side at an eight-meter distance from an oncoming person. We conducted a user study with 42 participants to assess their impressions of the delivery robot based on three service objectives: safety, smoothness, and politeness. In the straight hallway scenario (Frontal Approach), results showed significant improvement in each of these three objectives compared to typical motion patterns found in the literature: slowing down, stopping, and reactive collision avoidance in the proximity of a person. In contrast, in the intersection (Blind Corner) scenarios, none of the approaches performed significantly better than any other, with participants having a diverse range of preferences among robot motion patterns.

URL PDF HTML ☆

赞 0 踩 0

2605.25981 2026-05-27 cs.CL

When Do LLM Agents Treat Surface Noise Differently from Semantic Noise? A 68-Cell Measurement Study with a Held-Out Trace-Level Validation

LLM 代理何时对表面噪声与语义噪声做出不同处理？一项基于 68 个单元格的测量研究及留出轨迹级验证

Liyun Zhang, Jiayi Guo

AI总结本研究通过 68 个单元格的测量实验，发现大语言模型驱动的思维链和 ReAct 代理对语义扰动（如释义、同义词）比表现扰动（如格式、重排序）更敏感，并基于留出模型和轨迹级机制分析提出了“隐蔽发散”解释。

详情

AI中文摘要

我们记录了一个经验现象：在来自七个架构家族的十种大语言模型驱动的思维链和 ReAct 代理中，意义承载扰动（例如，释义、同义词）比同等严重程度的表现扰动（例如，格式、重排序）更频繁地改变最终答案。跨越 GSM8K、MATH 和 HotpotQA 的 68 个单元格（1,530 个原始样本和约 11,150 个变体），在严重性匹配后，不一致性差距平均为 +19.69 个百分点（配对 t=9.58，p<0.0001），其中 64/68 个单元格为正。该差距通过了四次严重性代理审计，并且在排除 qwen 模型时仍然显著（+11.10 个百分点，p<0.0001）。几项压力测试诚实地失败了：在更严格的假设下，聚类自助法显著性消失；可处理性对比无法复制；跨架构生成器交换破坏了每个单元格的排名；第二个 LLM 判断器仅产生中等一致性（κ=0.50）。然后，我们在一个完全留出的第 11 个模型（qwen2.5-14B-Instruct；1,800 条轨迹）上验证了标题效应，并重新测试了一个预先注册的能力×可处理性分区，观察到一个小但正的留出效应（3/4 个单元格为正；合并 Welch t=3.81，p=9.6×10^{-4}）。利用留出轨迹，我们探测了四个轨迹级机制信号。两个先前的机制主张未能复制并被明确撤回。两个新的探测反而支持一种“隐蔽发散”图景：语义扰动通常保留第一个动作，但从后续步骤开始导致中间推理发散，并伴随略微更深的轨迹。我们将此定位为一项带有留出复现和部分轨迹级解释的测量贡献，说明语义扰动如何通过代理推理传播。代码、扰动语料库、原始轨迹和分析脚本已匿名发布以供评审。

英文摘要

We document an empirical phenomenon in chain-of-thought and ReAct agents driven by ten large language models from seven architecture families: meaning-bearing perturbations (e.g., paraphrase, synonym) alter final answers more often than presentation perturbations (e.g., formatting, reordering) of comparable severity. Across 68 cells spanning GSM8K, MATH, and HotpotQA (1,530 originals and $\sim$11,150 variants), the inconsistency gap averages +19.69 pp after severity matching (paired $t=9.58$, $p<0.0001$), with 64/68 cells positive. The gap survives four severity-proxy audits and remains significant when excluding qwen models (+11.10 pp, $p<0.0001$). Several stress tests fail honestly: cluster-bootstrap significance disappears under stricter assumptions, tractability contrasts do not replicate, cross-architecture generator swaps break per-cell rankings, and a second LLM judge yields only moderate agreement ($κ=0.50$). We then validate the headline effect on a fully held-out 11th model (qwen2.5-14B-Instruct; 1,800 trajectories) and re-test a pre-registered capability$\times$tractability partition, observing a small but positive held-out effect (3/4 cells positive; pooled Welch $t=3.81$, $p=9.6\times10^{-4}$). Using held-out trajectories, we probe four trace-level mechanism signals. Two prior mechanism claims fail to replicate and are explicitly retracted. Two new probes instead support a \emph{stealth-divergence} picture: semantic perturbations often preserve the first action but induce divergence in intermediate reasoning from later steps onward, accompanied by slightly deeper trajectories. We position this as a measurement contribution with held-out replication and a partial trace-level account of how semantic perturbations propagate through agent reasoning. Code, perturbation corpus, raw trajectories, and analysis scripts are released anonymously for review.

URL PDF HTML ☆

赞 0 踩 0

2605.25930 2026-05-27 cs.SD

CosyEdit2: Speech-Editing-Oriented Reinforcement Learning Unlocks Better Zero-Shot TTS

CosyEdit2: 面向语音编辑的强化学习解锁更好的零样本TTS

Junyang Chen, Yuhang Jia, Hui Wang, Jiaming Zhou, Yongchang Gan, Yong Qin

AI总结提出CosyEdit2，通过两阶段后训练框架（监督编辑初始化+基于目标语音无关数据的编辑导向GRPO）解决语音编辑与零样本TTS的局部声学一致性问题，显著提升编辑性能并增强零样本TTS能力。

详情

AI中文摘要

语音编辑和零样本文本到语音（TTS）共享基于语音提示的类似生成基础，但语音编辑对与周围未编辑内容的局部声学一致性要求严格得多。虽然先前工作表明监督微调（SFT）能使TTS模型获得功能性编辑能力，但该方法根本上受限于不完美的配对编辑数据和粗粒度的优化信号。为解决这些限制，我们提出CosyEdit2，一种构建于两阶段后训练框架上的语音编辑模型，该框架从监督编辑初始化逐步过渡到基于目标语音无关数据的编辑导向组相对策略优化（GRPO）。大量实验表明，CosyEdit2不仅显著提升了语音编辑性能，还解锁了更好的零样本TTS能力，揭示了两项任务之间更深层的相互关系。音频样本可在 https://cjy1018.github.io/CosyEdit2 获取。

英文摘要

Speech editing and zero-shot Text-to-Speech (TTS) share a similar generative foundation conditioned on speech prompts, yet speech editing demands far stricter local acoustic consistency with surrounding unedited content. While prior work has shown that Supervised Fine-Tuning (SFT) enables TTS models to acquire functional editing capability, this approach remains fundamentally bottlenecked by imperfect paired editing data and coarse-grained optimization signals. To address these limitations, we propose CosyEdit2, a speech editing model built on a two-stage post-training framework that progresses from supervised editing initialization to editing-oriented Group Relative Policy Optimization (GRPO) over target-speech-free data. Extensive experiments demonstrate that CosyEdit2 not only substantially advances speech editing performance, but also unlocks better zero-shot TTS capability, revealing a deeper mutual relationship between the two tasks. Audio samples are available at https://cjy1018.github.io/CosyEdit2.

URL PDF HTML ☆

赞 0 踩 0

2605.25731 2026-05-27 cs.CL

Trait-Aware Policy Optimization for Autoregressive Multi-Trait Essay Scoring

面向自回归多维度作文评分的特质感知策略优化

Zhengyang Wang, Sanwoo Lee, Jiaxin Wang, Chenxi Miao, Weikang Li, Yunfang Wu

AI总结提出特质感知策略优化（TAPO）框架，通过分解样本和特质维度的奖励并结合增强提示，提升自回归多维度作文评分性能。

详情

AI中文摘要

多维度作文评分旨在跨多个维度提供写作质量的细粒度评估。然而，如何有效后训练自回归评分模型仍未充分探索。在本文中，我们提出了特质感知策略优化（TAPO），一种专为自回归多维度评分设计的后训练框架。我们的方法沿样本和特质维度分解奖励，结合全局评分一致性、特质级准确性、格式有效性以及跨特质依赖保持。此外，我们在整个训练过程中使用增强提示，通过融入原始提示文本和特质描述，为特质特定分数生成提供更丰富的语义信息。跨多个骨干模型的实验表明，我们的方法在监督微调和标量奖励优化基线上持续提升了多维度评分性能，证明了特质感知后训练在作文评分中的有效性和可迁移性。

英文摘要

Multi-trait essay scoring aims to provide fine-grained evaluation of writing quality across multiple dimensions. However, how to effectively post-train autoregressive scoring models remains underexplored. In this paper, we propose Trait-Aware Policy Optimization (TAPO), a post-training framework tailored to autoregressive multi-trait scoring. Our method decomposes rewards along both the sample and trait dimensions, combining global scoring consistency, trait-level accuracy, format validity, and inter-trait dependency preservation. In addition, we use enhanced prompts throughout training by incorporating original prompt texts and trait descriptions, providing richer semantic information for trait-specific score generation. Experiments across multiple backbone models show that our method consistently improves multi-trait scoring performance over supervised fine-tuning and scalar-reward optimization baselines, demonstrating the effectiveness and transferability of trait-aware post-training for essay scoring.

URL PDF HTML ☆

赞 0 踩 0

2605.25510 2026-05-27 cs.CL

The Age of Curiosity Meets the Age of AI: Benchmarking Child Safety in Large Language Models

好奇心时代遇上AI时代：大型语言模型中的儿童安全基准测试

Samee Arif, Angana Borah, Rada Mihalcea

AI总结针对7-11岁儿童使用LLM的安全性问题，提出基于发展心理学的KIDBench基准，通过隐式线索和显式年龄指令提升安全性，并开发了儿童安全评估器KIDGuardLlama和响应模型KIDLlama。

详情

AI中文摘要

儿童越来越多地接触大型语言模型（LLM），这可能会使他们接触到发展不适当或需要年龄敏感性安全、指导和界限的回应。现有的LLM安全评估主要关注有害内容规避，并未明确针对面向儿童的安全性。我们引入了KIDBench，这是一个使用基于发展心理学的LLM作为评判标准的基准，用于评估面向7-11岁儿童的LLM安全性。KIDBench包含十个类别的真实儿童查询，包括单轮提示和多轮儿童角色模拟。我们比较了无儿童上下文的无提示、暗示儿童说话者的隐式提示以及显式年龄指令。隐式提示使模型得分提高了9-47%，而显式年龄进一步增加了10-30%的增益。跨语言和文化评估显示，不同语言和国家背景下的安全行为不均匀。多轮模拟显示，面向儿童的响应质量从第一轮到最差轮次可能下降6-24%。除了评估，我们还引入了儿童安全评估器KIDGuardLlama和面向儿童的响应模型KIDLlama，展示了KIDBench如何支持更安全的面向儿童AI。

英文摘要

Children increasingly have access to Large Language Models (LLMs), which may expose them to responses that are developmentally inappropriate or require age-sensitive safety, guidance, and boundaries. Existing LLM safety evaluations largely focus on harmful-content avoidance and do not explicitly target child-facing safety. We introduce KIDBench, a benchmark for evaluating child-facing LLM safety for ages 7-11 using a developmental-psychology-grounded LLM-as-a-Judge rubric. KIDBench contains realistic child queries across ten categories, with single-turn prompts and multi-turn child-actor simulations. We compare no-cues prompts with no child context, implicit-cues prompts that suggest a child speaker, and explicit age instructions. Implicit-cues improve scores by 9-47% across models, while explicit age adds a further 10-30% gain. Cross-lingual and cultural evaluations show uneven safety behavior across languages and country contexts. Multi-turn simulations show that child-facing response quality can degrade by 6-24% from the first to worst turn. Beyond evaluation, we introduce KIDGuardLlama, a child-safety evaluator, and KIDLlama, a child-oriented response model, showing how KIDBench supports safer child-facing AI.

URL PDF HTML ☆

赞 0 踩 0

2605.25507 2026-05-27 cs.AI

Credit Assignment with Resets in Language Model Reasoning

语言模型推理中带有重置的信用分配

Ankur Samanta, Akshayaa Magesh, Ayush Jain, Youliang Yu, Daniel Jiang, Kavosh Asadi, Kaveh Hassani, Paul Sajda, Jalaj Bhandari, Yonathan Efroni

AI总结提出随机重置策略优化（RRPO）和自重置策略优化（SRPO）两种方法，通过重置到中间状态并重新采样反事实延续来改进语言模型多步推理中的信用分配，SRPO在多个推理基准上优于标准GRPO和RRPO。

详情

AI中文摘要

当代使用可验证奖励方法的强化学习通过对轨迹中的所有令牌统一分配单一结果奖励来对多步推理进行语言模型后训练。这种统一分配忽略了哪些步骤促成了成功或失败。改进信用分配可以通过实现对错误推理步骤的针对性细化来解决这一限制，而不是统一更新整个轨迹。重置是一种简单的机制，通过返回到中间状态并重新采样反事实延续来实现更精确的信用分配，从而将结果差异归因于该点做出的决策。我们提出了两种这样的方法：随机重置策略优化（RRPO），其中重置状态从推理步骤中随机抽取；以及自重置策略优化（SRPO），其中模型自我定位错误轨迹中的错误步骤并在此重置。我们在保守策略迭代（CPI）框架内分析了这些方法。通过针对可改进状态的信用分配预言机扩展CPI，相比于随机重置可证明改进。在多个模型和推理基准上，SRPO通过仅在自我定位的重置处采样多个后缀延续并从其奖励中学习，仅使用模型本身且无需外部监督，始终优于标准GRPO和RRPO。

英文摘要

Contemporary reinforcement learning with verifiable reward methods post-train language models on multi-step reasoning by assigning a single outcome reward uniformly across all tokens in a trajectory. Such uniform assignment ignores which steps contributed to success or failure. Improving credit assignment can address this limitation by enabling targeted refinement of faulty reasoning steps, rather than updating entire trajectories uniformly. Resets are one such simple mechanism, enabling more precise credit assignment by returning to an intermediate state and resampling counterfactual continuations, so that outcome differences can be attributed to decisions made at that point. We propose two such methods: Random-Reset Policy Optimization (RRPO), where reset states are drawn randomly from reasoning steps, and Self-Reset Policy Optimization (SRPO), where the model self-localizes the erroneous step in an incorrect trajectory and resets there. We analyze these methods within the Conservative Policy Iteration (CPI) framework. Extending CPI with a credit-assignment oracle that targets improvable states yields provable improvements over random resets. Across models and reasoning benchmarks, SRPO consistently outperforms standard GRPO and RRPO by sampling multiple suffix continuations at a self-localized reset and learning from their rewards, using only the model itself with no external supervision.

URL PDF HTML ☆

赞 0 踩 0

2605.25480 2026-05-27 cs.CL

Retrieval as Reasoning: Self-Evolving Agent-Native Retrieval via LLM-Wiki

检索即推理：通过LLM-Wiki实现自我进化的智能体原生检索

Haoliang Ming, Feifei Li, Xiaoqing Wu, Wenhui Que

AI总结提出LLM-Wiki系统，将外部知识组织为可编译、可组合、自进化的Wiki页面，通过工具调用接口实现搜索、阅读和链接跟踪操作，并引入错误簿进行持续自纠正，在多项多跳问答基准上取得最优结果。

详情

Comments: 15 pages, 3 figures, 10 tables, 1 algorithm

AI中文摘要

LLM智能体需要的检索行为应更像推理（搜索、阅读、遍历、判断证据是否充分），而非一次性上下文获取。然而，当前的检索增强生成（RAG）系统将外部知识组织为扁平块，通过嵌入相似性检索，暴露出一种不适合迭代推理智能体的“检索即查找”接口。我们提出LLM-Wiki，一种智能体原生检索系统，它将外部知识视为可编译、可组合、自进化的结构而非静态检索索引，从而实现了“检索即推理”范式。LLM-Wiki将文档编译为带有双向链接的结构化Wiki页面，通过标准工具调用接口暴露搜索、阅读和链接跟踪操作，并引入错误簿进行持久的结构和语义自纠正。LLM-Wiki在HotpotQA、MuSiQue和2WikiMultiHopQA上取得了最先进的结果，比HippoRAG 2、LightRAG和GraphRAG高出2.0-8.1个F1点。在AuthTrace上，LLM-Wiki取得了最佳总体准确率，在多文档结构化查询上尤其有显著提升，证实了基于编译的检索在链式多跳推理之外也具有泛化能力。

英文摘要

LLM agents require retrieval to behave less like one-shot context fetching and more like reasoning: searching, reading, traversing, and deciding when evidence is sufficient. Yet current Retrieval-Augmented Generation (RAG) systems organize external knowledge as flat chunks retrieved by embedding similarity, exposing a retrieval-as-lookup interface ill-suited to iterative reasoning agents. We propose LLM-Wiki, an agent-native retrieval system that operationalizes the Retrieval-as-Reasoning paradigm by treating external knowledge as a compilable, composable, and self-evolving structure rather than a static retrieval index. LLM-Wiki compiles documents into structured Wiki pages with bidirectional links, exposes search, read, and link-following operations through standard tool-calling interfaces, and introduces an Error Book for persistent structural and semantic self-correction. LLM-Wiki achieves state-of-the-art results on HotpotQA, MuSiQue, and 2WikiMultiHopQA, outperforming HippoRAG 2, LightRAG, and GraphRAG by 2.0-8.1 F1 points. On AuthTrace, LLM-Wiki achieves the best overall accuracy, with especially strong gains on multi-document structured queries, confirming that compilation-based retrieval generalizes beyond chain-style multi-hop reasoning.

URL PDF HTML ☆

赞 0 踩 0

2605.25382 2026-05-27 cs.CL

AuthTrace: Diagnosing Evidence Construction in Thematically Dense Single-Author Corpora

AuthTrace: 主题密集的单作者语料库中的证据构建诊断

Xiaoqing Wu, Feifei Li, Haoliang Ming, Wenhui Que

AI总结提出AuthTrace基准，通过扇入梯度诊断主题密集单作者语料库中证据构建系统的召回率、精度和答案正确性，发现证据召回是答案正确性的最强预测因子。

详情

AI中文摘要

证据构建——决定在生成开始前哪些段落到达语言模型的阶段——按范式进行评估，使得从业者无法有原则地诊断哪种组织策略失败、在哪里失败或为什么失败。我们引入了AuthTrace，这是一个基于主题密集的单作者语料库构建的诊断基准，其中近失干扰项与所需证据共享风格、主题和词汇。AuthTrace提供明确的引用证据、精确的扇入注释以及统一的包级协议，用于衡量证据召回率、证据精度和答案正确性。扇入梯度——支持答案所需的源文档数量——作为主要诊断轴，使得能够在检索、记忆、图和结构化证据范式之间进行受控比较。评估两个QA模型上的八个系统，我们发现，在主要读者-判断器对下，证据召回率是答案正确性最强的观察预测因子（r = 0.96）；大多数失败源于缺失证据而非答案合成。扇入进一步揭示了特定范式的崩溃模式：平面检索的退化速度比主题组织的证据构建快2-3倍。这些结果表明，扇入分解是一种可重用的诊断镜头，用于识别证据构建系统失败的位置以及哪种范式最适合给定的工作负载。

英文摘要

Evidence construction--the stage that determines which passages reach the language model before generation begins--is evaluated paradigm by paradigm, leaving practitioners with no principled way to diagnose which organization strategy fails, where, or why. We introduce AuthTrace, a diagnostic benchmark built on thematically dense single-author corpora where near-miss distractors share style, topic, and vocabulary with the required evidence. AuthTrace provides explicit quoted evidence, exact fan-in annotation, and a unified pack-level protocol measuring evidence recall, evidence precision, and answer correctness. A fan-in gradient--the number of source documents required to support the answer--serves as the primary diagnostic axis, enabling controlled comparison across retrieval, memory, graph, and structured-evidence paradigms. Evaluating eight systems across two QA models, we find that evidence recall is the strongest observed predictor of answer correctness under the primary reader-judge pair (r = 0.96); most failures stem from missing evidence rather than answer synthesis. Fan-in further exposes paradigm-specific collapse patterns: flat retrieval degrades 2-3x faster than thematically organized evidence construction. These results show fan-in decomposition to be a reusable diagnostic lens for identifying where evidence-construction systems fail and which paradigm best serves a given workload.

URL PDF HTML ☆

赞 0 踩 0

2605.25281 2026-05-27 cs.CL cs.AI

READER: Reasoning-Enhanced AI-Generated Text Detection

READER: 增强推理的AI生成文本检测

Pingfan Su, Kai Ye, Shijin Gong, Erhan Xu, Jin Zhu, Giulia Livieri, Chengchun Shi

AI总结提出READER方法，通过微调1.5B参数的LLM在结构化推理数据集READ上，结合推理与检测，在分布偏移下优于GPT-5.2等大100-1000倍的模型。

2605.24636 2026-05-27 cs.AI cs.CL

GlobalDentBench: A Multinational Benchmark for Evaluating LLM Clinical Reasoning in Dentistry with Expert Calibration

GlobalDentBench：一个用于评估牙科领域大语言模型临床推理能力并包含专家校准的多国基准

Junjie Zhao, Jingyi Liang, Zhenyang Cai, Jiaming Zhang, Zhenwei Wen, Shuzhi Deng, Wenjing Yi, Chunfeng Luo, Hexian Zhang, Junying Chen, Tianrui Liu, Zhuhui Bai, Zixu Zhang, Pradeep Singh, Xiang Liu, Jianquan Li, Nhan L Tran, Falk Schwendicke, Zuolin Jin, Lijian Jin, Liangyi Chen, Wei-fa Yang, Benyou Wang, Junwen Wang, Shan Jiang

AI总结提出首个跨国牙科基准GlobalDentBench，包含14个专科、88个国家的8978道专家验证题目，评估三种推理层次，揭示当前大语言模型在牙科临床推理中性能随复杂度下降且存在高风险。

详情

AI中文摘要

尽管大语言模型（LLMs）在医学领域具有变革潜力，但其在真实临床场景中的推理鲁棒性和安全性仍未得到充分探索，尤其是在牙科领域。本文提出GlobalDentBench，首个跨国牙科基准，其分类体系涵盖六大洲88个国家和地区的14个牙科专科。该基准包含8978道专家验证题目，分为三种格式（选择题、简答题和基于案例的题目），并评估三个递进推理层次：知识回忆（L1）、常规推理（L2）和个体化推理（L3）。为确保数据质量，自动构建框架由六名资深牙医校准，选择题和简答题的专家一致率达到99.98%，更复杂的基于案例的题目达到96.78%。在GlobalDentBench上对12个前沿LLMs的评估显示，随着推理复杂度增加，性能呈急剧阶梯式下降。具体而言，准确率从选择题的81.34%骤降至简答题的64.53%和基于案例的题目的22.34%，同时从L1的74.01%显著下降至L2的55.64%和L3的35.71%。更关键的是，对真实牙科案例的风险分析表明，LLM生成的临床建议中总体不安全率高达31.01%，其中4.51%存在导致不可逆患者伤害的风险，且风险在正畸等专科中尤为突出。这些发现暴露了当前LLMs在医学推理和安全性方面的根本局限性。因此，GlobalDentBench为可信赖的临床AI评估提供了可扩展的基础，强调了在医疗领域安全部署这些模型之前迫切需要严格验证。

英文摘要

While large language models (LLMs) hold transformative potential for medicine, their reasoning robustness and safety in real-world clinical scenarios remain critically underexplored, particularly in dentistry. Here we introduce GlobalDentBench, the first multinational dental benchmark, featuring a taxonomy that encompasses 14 dental specialties across 88 countries and regions spanning six continents. The benchmark comprises 8,978 expert-validated questions across three formats (multiple-choice, short-answer, and case-based questions) and assesses three progressive reasoning levels: knowledge recall (L1), routine reasoning (L2), and individualized reasoning (L3). To ensure data quality, the automated construction framework was calibrated by six senior dentists, achieving expert agreement rates of 99.98% for multiple-choice and short-answer questions and 96.78% for the more complex case-based questions. Evaluation of 12 frontier LLMs on GlobalDentBench revealed a sharp, stepwise performance degradation with increasing reasoning complexity. Specifically, accuracy plummeted from 81.34% on multiple-choice to 64.53% on short-answer and 22.34% on case-based questions, while declining markedly from 74.01% at L1 to 55.64% at L2 and 35.71% at L3. More critically, risk analysis of real-world dental cases demonstrated an alarming overall unsafe rate of 31.01% in LLM-generated clinical recommendations, with 4.51% posing risks of irreversible patient harm and risks particularly pronounced in specialties such as orthodontics. These findings expose fundamental limitations in the medical reasoning and safety of current LLMs. Consequently, GlobalDentBench provides a scalable foundation for trustworthy clinical AI evaluation, underscoring the urgent need for rigorous validation before the safe deployment of these models in healthcare.

URL PDF HTML ☆

赞 0 踩 0

2605.24465 2026-05-27 cs.RO

Polymander II: an amphibious salamander-inspired robot with contact and flow sensors

Polymander II：一种带有接触和流量传感器的两栖蝾螈启发机器人

Qiyuan Fu, Sudong Lee, Andrea Grillo, Jonathan Arreguit, Louis Gevers, Josie Hughes, Auke J. Ijspeert

AI总结本文提出一种基于霍尔效应传感器的两栖机器人，用于感知足部接触力和侧向水动力，实现陆水环境感知与反馈控制。

详情

Comments: This work has been accepted for publication in the 2026 International Conference on Robotics and Automation (ICRA), Vienna, Austria

AI中文摘要

机器人受益于感官信息来协调身体运动、增强对扰动的鲁棒性，并在不同模式间转换以适应各种地形。然而，很少有兩栖机器人能够感知与陆地和水中环境的交互。在本文中，我们提出了一种解决方案，使用霍尔效应传感器来感知一种受蝾螈启发的两栖机器人的足部接触力和侧向水动力。通过两条总线，机器人可以同时以超过500 Hz的频率获取这些外部感受信息，并以100 Hz的频率获取本体感受信息，如关节位置和负载。所使用的霍尔效应传感器体积小巧，适合嵌入机器人多个位置，并且对小力具有高灵敏度。此外，由于传感器可以与测量对象分开放置，防水实现相对容易。我们的测试展示了机器人在穿越两栖环境方面的能力，以及其在利用反馈控制执行更复杂运动任务方面的潜力。

英文摘要

Robots benefit from sensory information to coordinate body movement, gain robustness against perturbations, and transition between different modes to adapt to various terrains. However, few amphibious robots can sense interactions with both terrestrial and aquatic environments. In this paper, we present a solution that uses Hall-effect sensors to sense foot contact forces and lateral hydrodynamic forces on a salamander-inspired amphibious robot. With two bus lines, the robot can simultaneously acquire this exteroceptive information at more than 500 Hz and proprioceptive information, such as joint positions and loads, at 100 Hz. The Hall-effect sensors used are compact, making them suitable for embedding in multiple positions within a robot, and exhibit high sensitivity to small forces. Moreover, because the sensor can be positioned separately from the measured object, waterproofing can be implemented with relative ease. Our tests demonstrate the robot's capabilities in traversing amphibious environments and its potential in using feedback control for more complex locomotion tasks.

URL PDF HTML ☆

赞 0 踩 0

2605.24456 2026-05-27 cs.CV

EgoProx: Evaluating MLLMs on Egocentric 3D Proximity Reasoning Across a Cognitive Hierarchy

EgoProx: 在认知层级上评估多模态大语言模型的自我中心3D邻近推理能力

Jinzhao Li, Yinuo Chen, Dongxu Piao, Panwang Pan, Yifan Yu, Dong Wang, Honglei Yan, Liang Yue, Shaofei Wang, Yixin Chen, Siyuan Huang, Miao Liu

AI总结提出EgoProx基准，通过认知链任务和基于智能体的数据引擎，评估多模态大语言模型在自我中心3D邻近推理中的表现，发现模型虽具备空间知识但难以有效利用。

详情

Comments: Accepted to CVPR 2026

AI中文摘要

人类不断推理3D邻近性，即身体与周围物体之间的关系，以指导日常生活中的感知和行动。多模态大语言模型（MLLMs）能否进行这种具身3D推理尚不清楚。为此，我们引入了EgoProx，一个用于自我中心3D邻近推理的基准。我们沿着认知链组织任务，涵盖意图、探索、利用和行动链推理。我们还设计了一个基于智能体的数据引擎，能够大规模生成多样且一致的问答对。我们在EgoProx上对主流MLLMs进行了基准测试，并通过数据集特定和任务特定的指令微调进行了额外分析。我们观察到较大的跨领域增益，表明当前的MLLMs包含一些空间知识；然而，它们仍然难以有效利用这些知识进行空间推理VQA。

英文摘要

Humans constantly reason about 3D proximity, the relations between their body and surrounding objects, to guide perception and action in daily life. Whether multimodal large language models (MLLMs) can perform such embodied 3D reasoning remains unclear. To this end, we introduce EgoProx, a benchmark for egocentric 3D proximity reasoning. We organize our tasks along a cognitive chain, covering intention, exploration, exploitation, and chain-of-actions reasoning. We also design an agent based data engine that produces diverse and consistent QA pairs at scale. We benchmark prevailing MLLMs on EgoProx and conduct additional analyses with dataset specific and task specific instruction tuning. We observe large cross-domain gains, indicating that current MLLMs contain some spatial knowledge; however, they still struggle to effectively leverage it for spatial reasoning VQA.

URL PDF HTML ☆

赞 0 踩 0

2605.24219 2026-05-27 cs.AI

Beyond Final Answers: Auditing Trajectory-Level Hallucinations in Multi-Agent Industrial Workflows

超越最终答案：多智能体工业工作流中的轨迹级幻觉审计

Harshada Badave, Santosh Borse, Andrea Gomez, Harshitha Narahari, Sara Carter, Vishwa Bhatt, Aishani Rachakonda, Shuxin Lin, Dhaval Patel

AI总结提出Trajel数据集和评估框架，通过五类幻觉分类法审计多智能体工业工作流中的轨迹级幻觉，发现现有基准忽略的常见失败模式，并证明轨迹感知检测优于标准事后验证。

详情

AI中文摘要

大型语言模型（LLM）越来越多地被部署为自主智能体，能够推理、使用工具并执行多步操作。然而，大多数幻觉基准仍然只评估最终输出，忽略了源自中间思考-行动-观察步骤的失败。我们提出了Trajel，一个用于审计多智能体工业工作流中轨迹级幻觉的数据集和评估框架。Trajel在来自AssetOpsBench的专家注释智能体轨迹上引入了一个五类幻觉分类法（事实性、指代性、逻辑性、程序性和范围性）。我们在子任务、轨迹和长上下文级别对监督检测模型进行了基准测试。我们的结果表明，最常见的失败模式被现有基准忽略，近一半的幻觉轨迹同时涉及多种类型，并且具有高二元准确率的自动检测器仍然错误分类最微妙的类型。轨迹感知检测显著优于标准事后验证，使得基于分类法的评估对于更安全的智能体部署成为必要。

英文摘要

Large Language Models (LLMs) are increasingly deployed as autonomous agents that reason, use tools, and act over multiple steps. Yet most hallucination benchmarks still evaluate only the final output, missing failures that originate in intermediate Thought-Action-Observation steps. We present Trajel, a dataset and evaluation framework for auditing trajectory-level hallucinations in multi-agent industrial workflows. Trajel introduces a five-type hallucination taxonomy (factual, referential, logical, procedural, and scope-based) over expert-annotated agent traces from AssetOpsBench. We benchmark supervised detection models at the subtask, trajectory, and long-context levels. Our results show that the most common failure modes are missed by existing benchmarks, that nearly half of hallucinated trajectories involve multiple types at once, and that automated detectors with high binary accuracy still misclassify the subtlest types. Trajectory-aware detection significantly outperforms standard post-hoc verification, making taxonomy-grounded evaluation necessary for safer agentic deployment.

URL PDF HTML ☆

赞 0 踩 0

2605.24041 2026-05-27 cs.LG cs.AI

Iterative Refinement Neural Operators are Learned Fixed-Point Solvers: A Principled Approach to Spectral Bias Mitigation

迭代精化神经算子：一种学习型不动点求解器——频谱偏差缓解的原则性方法

Xiaotian Liu, Shuyuan Shang, Xiaopeng Wang, Pu Ren, Yaoqing Yang

AI总结提出迭代精化神经算子（IRNO），通过固定点迭代应用学习精化模块，结合渐进频谱损失，有效缓解神经算子的频谱偏差，在湍流和活性物质等物理系统中显著降低高频误差。

详情

Comments: 47 pages; accepted to ICML 2026 as a Spotlight

AI中文摘要

神经算子作为科学建模的快速数据驱动替代方法，通常依赖于单一前向推理过程，难以解析高频细节，这一局限性称为频谱偏差。我们引入迭代精化神经算子（IRNO），通过固定点迭代反复应用学习精化模块来增强预训练算子。IRNO将预测分解为粗初始化及随后的残差校正，类似于经典数值求解器。在局部假设下，我们建立了诱导算子的收缩性，确保收敛到唯一不动点。为明确针对高频误差，我们提出渐进频谱损失，在训练过程中自适应地增加对高频分量的惩罚。在物理系统中，IRNO持续降低误差，在湍流中提升高达56.05%。在活性物质中，频谱分析显示，相对于基础算子，归一化误差比在低频降至27.72-36.10%，中频降至5.07-6.68%，高频降至1.48-2.04%，且在训练迭代次数之外保持稳定。代码见 https://github.com/xiaotianliu-dartmouth/Iterative_Refinement_Neural_Operator。

英文摘要

Neural operators serve as fast, data-driven surrogates for scientific modeling but typically rely on a monolithic, single-pass inference procedure that struggles to resolve high-frequency details, a limitation known as spectral bias. We introduce the Iterative Refinement Neural Operator (IRNO), which augments pre-trained operators with a learned refinement module iteratively applied via fixed-point iteration. IRNO decomposes the prediction into a coarse initialization followed by successive residual corrections, paralleling classical numerical solvers. Under local assumptions, we establish contraction of the induced operator, ensuring convergence to a unique fixed point. To explicitly target high-frequency errors, we propose a progressive spectral loss that adaptively increases penalty on high-frequency components over refinement steps during training. Across physical systems, IRNO consistently lowers error, with up to 56.05% improvement on turbulent flow. On Active Matter, spectral analysis reveals that, relative to base operator, the normalized error ratios decrease to 27.72-36.10% in low-, 5.07-6.68% in mid-, and 1.48-2.04% in high-frequencies, remaining stable beyond the trained iteration count. Code is available at https://github.com/xiaotianliu-dartmouth/Iterative_Refinement_Neural_Operator

URL PDF HTML ☆

赞 0 踩 0

2605.23910 2026-05-27 cs.CL cs.AI

Document Classification Pattern Recognition via Information Fusion: A Systematic Review of Multimodal and Multiview Representation Approaches

基于信息融合的文档分类模式识别：多模态与多视角表示方法的系统综述

Marcin Michał Mirończuk

AI总结本文通过系统综述和元分析，提出了统一框架，量化了多模态和多视角融合在文档分类中的性能提升，并揭示了方法学严谨性不足的问题。

详情

DOI: 10.1016/j.inffus.2026.104247
Journal ref: Information Fusion, 132, 2026, 104247

AI中文摘要

信息融合被广泛用于通过整合多数据源（多模态）或多表示（多视角）来改进文档分类。然而，该领域缺乏统一框架、对其有效性的定量综合以及给实践者的明确指导。本系统综述通过分析139项主要研究来填补这些空白。它引入了一个正式框架来结构化该领域，呈现了定性分析结果以识别关键趋势，并进行了随机效应元分析（据我们所知，这是首次专注于文档分类的元分析）以量化性能提升。我们的元分析显示，多模态融合显著提高了准确率（平均提升+5.28个百分点，$p=0.0016$）——F1分数效应方向为正，但在我们的主要模型中统计上不显著。多视角融合在准确率（+4.67%）、F1分数（+3.08%）和召回率（均$p<0.05$）上提供了一致但适度的提升。关键的是，我们的定性综合揭示了方法学严谨性方面的可重复性挑战：只有11.8%（多模态）和23.3%（多视角）的研究使用统计检验来验证其发现，这削弱了许多结果的可靠性。本综述的主要贡献是一个统一框架、首个定量证据基础以及数据驱动的指南。本综述得出结论，成功的信息融合不依赖于算法复杂性，而在于融合方法与任务上下文的战略对齐以及对更严格验证的承诺。

英文摘要

Information fusion is used widely to improve document classification by the integration of multiple data sources (multimodal) or representations (multiview). However, the field lacks a unified framework, a quantitative synthesis of its effectiveness, and clear guidance for practitioners. This systematic review addresses these gaps by analysing 139 primary studies. It introduces a formal framework to structure the field, presents the results of a qualitative analysis to identify key trends, and performs a random-effects meta-analysis (to our knowledge, the first focused on document classification) to quantify performance gains. Our meta-analysis reveals that multimodal fusion improves accuracy (mean gain of +5.28 percentage points, $p=0.0016$) significantly -- the F1-score effect is directionally positive but statistically non-significant in our primary model. Multiview fusion provides consistent but modest gains for accuracy (+4.67\%), F1-score (+3.08\%), and recall (all $p<0.05$). Critically, our qualitative synthesis uncovers challenges in reproducibility in methodological rigour: only 11.8\% (multimodal) and 23.3\% (multiview) of the studies use statistical tests to validate their findings, which undermines the reliability of many of their results. This review's primary contributions are a unifying framework, the first quantitative evidence base, and data-driven guidelines. This review concludes that successful information fusion depends not on algorithmic complexity, but on the strategic alignment of the fusion method with the task context and a commitment to more rigorous validation.

URL PDF HTML ☆

赞 0 踩 0

2605.22557 2026-05-27 cs.LG cs.NA math.NA

Neural Flow Operators can Approximate any Operator: Abstract Frameworks and Universal Approximations

神经流算子可以逼近任意算子：抽象框架与通用逼近

Shuang Chen, Juncai He, Xue-Cheng Tai

AI总结提出神经流抽象框架，涵盖组合与分离结构的连续深度模型，证明其在有限维和无限维空间中的通用逼近性质，并通过时间离散化统一残差与普通架构。

详情

AI中文摘要

我们为神经网络和神经算子引入了一个抽象的神经流框架。该框架包含两种连续深度模型，即具有组合和分离结构的神经流，并涵盖了有限维函数逼近和无限维算子逼近。我们证明了相应神经流的适定性和通用逼近性质，包括据我们所知，首个无限维空间之间基于流的模型的通用逼近结果。我们还获得了卷积神经流模型的通用逼近结果。通过适当的时间离散化，组合结构恢复了ResNet类型的架构，而分离结构通过基于分裂的离散化产生了普通架构。这为具有全连接或卷积线性层的神经网络和神经算子的残差和普通架构提供了一条统一的基于流的路径。

英文摘要

We introduce an abstract neural flow framework for neural networks and neural operators. The framework contains two continuous-depth models, namely neural flows with composition and separation structures, and covers both finite-dimensional function approximation and infinite-dimensional operator approximation. We prove well-posedness and universal approximation properties for the corresponding neural flows, including, to the best of our knowledge, the first universal approximation result for flow-based models between infinite-dimensional spaces. We also obtain universal approximation results for convolutional neural flow models. Through suitable time discretizations, the composition structure recovers ResNet-type architectures, while the separation structure, via a splitting-based discretization, yields plain architectures. This gives a unified flow-based route to both residual and plain architectures for neural networks and neural operators with fully connected or convolutional linear layers.

URL PDF HTML ☆

赞 0 踩 0

2605.22511 2026-05-27 cs.AI cs.CL cs.IR

Search-E1: Self-Distillation Drives Self-Evolution in Search-Augmented Reasoning

Search-E1: 自蒸馏驱动搜索增强推理中的自我进化

Zihan Liang, Yufei Ma, Ben Chen, Zhipeng Qian, Xuxin Zhang, Huangyu Dai, Lingtao Mao

AI总结提出Search-E1方法，通过交替使用普通GRPO和在线策略自蒸馏（OPSD），让搜索增强智能体无需外部监督或复杂模块即可自我进化，在七个QA基准上以3B模型超越所有开源基线。

详情

AI中文摘要

后训练已成为将语言模型转变为胜任的搜索增强推理智能体的主要方法。近期一系列工作通过在此标准流程之上添加复杂机制来进一步提升性能。这些增强引入了来自更强外部系统的外部监督，附加了诸如过程奖励模型或回顾性评论者等辅助模块，通过树搜索或多阶段课程重构了轨迹生成本身，并利用手工设计的奖励和惩罚来塑造奖励。每项增加都带来了可衡量的提升，但同时也使训练流程更加复杂，并将方法绑定到可能并非总是可用的资源或设计上。我们退一步思考这些机制是否真的必要，并提出了Search-E1，一种自我进化方法，让搜索增强智能体仅通过普通的GRPO与在线策略自蒸馏（OPSD）交替进行来改进。在每轮GRPO之后，策略在其自身的训练问题上进行轨迹生成。然后，一个token级的前向KL目标将策略的推理时分布与其在特权上下文下的自身分布对齐，该特权上下文暴露了更高效的兄弟轨迹。尽管简单，该过程自然地提供了密集的每步监督。在七个QA基准上，Search-E1使用Qwen2.5-3B达到了0.440的平均EM，在两个规模上均超越了所有开源基线。代码和完整版本将很快公开。

英文摘要

Post-training has become the dominant recipe for turning a language model into a competent search-augmented reasoning agent. A line of recent work pushes its performance further by adding elaborate machinery on top of this standard pipeline. These augmentations import external supervision from stronger external systems, attach auxiliary modules such as process reward models or retrospective critics, restructure the rollout itself with tree search or multi-stage curricula, or shape the reward with hand-crafted bonuses and penalties. Each addition delivers a measurable gain, but each also inflates the training pipeline and ties the recipe to resources or designs that may not always be available. We take a step back and ask whether any of this machinery is actually necessary, and propose Search-E1, a self-evolution method that lets a search-augmented agent improve through only vanilla GRPO interleaved with on-policy self-distillation (OPSD). After each GRPO round, the policy rolls out on its own training questions. A token-level forward KL objective then aligns the policy's inference-time distribution to its own distribution under a privileged context that exposes a more efficient sibling trajectory. Despite this simplicity, the procedure naturally provides dense per-step supervision. On seven QA benchmarks, Search-E1 reaches 0.440 average EM with Qwen2.5-3B, surpassing all open-source baselines at both scales. Code and complete version will be made public soon.

URL PDF HTML ☆

赞 0 踩 0

2605.22468 2026-05-27 cs.LG cs.AI

BioFormer: Rethinking Cross-Subject Generalization via Spectral Structural Alignment in Biomedical Time-Series

BioFormer: 通过频谱结构对齐重新思考生物医学时间序列中的跨主体泛化

Guikang Du, Haoran Li, Xinyu Liu, Zhibo Zhang, Xiaoli Gong, Jin Zhang

AI总结提出BioFormer模型，通过频谱漂移视角显式建模主体特异性变异，利用频带对齐模块和样本条件层归一化对齐频谱结构，在六个数据集上F1分数提升6%。

详情

AI中文摘要

生物医学时间序列中的跨主体泛化指在一些主体数据上训练并在未见主体上测试。关键挑战是抑制BTS表示中的主体特异性变异。大多数现有方法通过模型构建或主体对抗学习隐式抑制变异，但很少显式建模。我们引入频谱漂移作为表征主体特异性变异的新视角。具体来说，相同标签下的BTS信号通常共享一致的振荡结构，但在特定频率分量上表现出依赖于主体的幅度或相位偏移，我们将其解释为主体特异性变异。基于这一见解，我们提出BioFormer。其核心是频带对齐模块（FBAM），该模块从频谱分布生成带级调制因子，并自适应调整幅度和相位以对齐频谱结构，从而减轻变异。我们进一步将FBAM与样本条件层归一化配对，该归一化从内在信号统计量而非主体身份推断归一化参数，稳定跨主体表示。在六个数据集上的大量实验表明，BioFormer优于12个基线，绝对F1分数提升6%。

英文摘要

Cross-subject generalization in biomedical time-series refers to training on data from some subjects and testing on unseen subjects.The key challenge is to suppress subject specific variability in BTS representations.Most existing methods implicitly suppress the variability through model building or subject adversarial learning, but rarely model it explicitly.We introduce spectral drift as a new perspective to characterize subject specific variability.Specifically, BTS signals under the same label often share consistent oscillatory structure, yet exhibit subject-dependent magnitude or phase shifts in specific frequency components, which we interpret as subject-specific variability. Building on this insight, we propose BioFormer.At its core is a Frequency-Band Alignment Module(FBAM) that generates band-wise modulation factors from the spectral distribution and adaptively adjusts amplitude and phase to align spectral structure, thereby mitigating variability.We further pair FBAM with Sample Conditional Layer Normalization, which infers normalization parameters from intrinsic signal statistics rather than subject identity, stabilizing cross-subject representations.Extensive experiments on six datasets demonstrate that BioFormer outperforms 12 baselines, yielding absolute F1-score improvements of 6%.

URL PDF HTML ☆

赞 0 踩 0

2605.22417 2026-05-27 cs.CV cs.SE

The Neglected Baseline in Model Interpretation

模型解释中被忽视的基线

Yongjin Cui, Xiaohui Fan

AI总结针对现有模型解释方法普遍忽略基线导致不精确的问题，本文重新定义解释任务和原则，统一梯度法、积分梯度法和泰勒展开，分析相关方法缺陷，并基于清晰合理的基线改进积分梯度法，实现基于任意层特征的解释。

详情

AI中文摘要

我们观察到现有的模型解释方法普遍忽略了基线，这种忽视常常导致不精确甚至错误的解释。本文重新阐述了模型解释的任务和解释结果的原则，以证明基线的重要性。我们进一步统一了基于梯度的方法、积分梯度（IG）方法和泰勒展开，阐明了它们之间的联系，并明确识别了每种方法的基线。在此基础上，我们分析了相关模型解释方法（IG、LayerCAM、ODAM、Difference Map）中的缺陷和错误。我们主张通过归因结果与归因目标之间的归因误差来精确评估模型解释结果的质量，而不是采用有缺陷的评估方法，例如基于边际效应或假设模型性能完美的方法。我们改进了IG，并开发了一种具有清晰合理基线的模型解释方法，取得了更好的结果。我们的方法支持基于任意层特征进行模型解释。基于不同层特征的解释都是合理的，这些结果之间的差异反映了不同特征提取阶段特征提取的不同程度。

英文摘要

We observe that existing model interpretation methods generally ignore the baseline, and such neglect often results in imprecise or even incorrect interpretation. In this paper, we reformulate the task of model interpretation and the interpretation principles for model interpretation results to demonstrate the importance of the baseline. We further unify gradient-based methods, Integrated Gradients (IG) methods, and Taylor expansion, clarifying the connections among them and explicitly identifying the baseline for each method. On this basis, we analyze the flaws and errors in related model interpretation methods (IG, LayerCAM, ODAM, Difference Map). We advocate evaluating the quality of model interpretation results precisely through the attribution error between the attribution result and the attribution target, rather than adopting flawed evaluation methods, such as those based on marginal-effect or the assumption of perfect model performance. We revise IG and develope a model interpretation method with a clear and reasonable baseline, achieving better results. Our method supports model interpretation based on features from any layer. Interpretation based on features from different layers are all reasonable, and the differences among these results reflect varying degrees of feature extraction at different feature extraction stages.

URL PDF HTML ☆

赞 0 踩 0

2605.21617 2026-05-27 cs.LG q-bio.QM

$\textit{BlockFormer}$ : Transformer-based inference from interaction maps

$ extit{BlockFormer}$：基于交互图的Transformer推理

Eloïse Touron, Pedro L. C. Rodrigues, Julyan Arbel, Nelle Varoquaux, Michael Arbel

AI总结提出BlockFormer，一种基于Transformer架构的数据驱动方法，通过模拟器生成合成数据训练，解决从交互图中推断可变数量和大小实体参数的反问题，并成功应用于多种物种的着丝粒定位。

2605.20530 2026-05-27 cs.AI cs.CL cs.LG cs.SE

AgentAtlas: Beyond Outcome Leaderboards for LLM Agents

AgentAtlas：超越LLM智能体的结果排行榜

Parsa Mazaheri, Kasra Mazaheri

AI总结提出AgentAtlas框架，通过控制决策分类法和轨迹故障词汇表，将智能体评估从结果成功分离为控制决策质量和轨迹质量，并揭示仅依赖结果排行榜的测量风险。

详情

AI中文摘要

大型语言模型智能体现在可以操作代码库、浏览器、操作系统、日历、文件和工具生态系统，但它们的评估通常将行为简化为最终任务成功。AgentAtlas将智能体评估重新定义为一种诊断词汇和审计协议，用于将结果成功与控制决策质量和轨迹质量分离。本文贡献了：(i) 一个六状态控制决策分类法（行动/询问/拒绝/停止/确认/恢复）；(ii) 一个包含主要错误源和下游影响的轨迹失败词汇表；(iii) 对十五个智能体基准的0/1/2基准覆盖审计；(iv) 一个在合成1,342项数据集上进行的说明性协议研究，使用八种模型在分类法感知和分类法盲提示格式下进行评估。该合成演示不是公开基准发布，不应被视为确定的模型比较。相反，它说明了两个测量风险：当显式标签菜单被移除时，映射标签一致性可能发生显著变化，并且轴选择可能改变表观排名。AgentAtlas旨在帮助基准设计者说明他们覆盖的行为，并帮助评估者诊断仅结果排行榜隐藏的失败。

英文摘要

Large language model agents now act on codebases, browsers, operating systems, calendars, files, and tool ecosystems, but their evaluations often collapse behavior into final task success. AgentAtlas reframes agent evaluation as a diagnostic vocabulary and audit protocol for separating outcome success from control-decision quality and trajectory quality. The paper contributes: (i) a six-state control-decision taxonomy (Act / Ask / Refuse / Stop / Confirm / Recover); (ii) a trajectory-failure vocabulary with primary error source and downstream impact; (iii) a 0/1/2 benchmark-coverage audit over fifteen agent benchmarks; and (iv) an illustrative protocol study on a synthetic 1,342-item set evaluated with eight models under taxonomy-aware and taxonomy-blind prompt formats. The synthetic demonstration is not a public benchmark release and should not be read as a definitive model comparison. Instead, it illustrates two measurement risks: mapped label agreement can change substantially when the explicit label menu is removed, and axis choice can change apparent rankings. AgentAtlas is intended to help benchmark designers state what behavior they cover, and to help evaluators diagnose failures that outcome-only leaderboards hide.

URL PDF HTML ☆

赞 0 踩 0

2605.20251 2026-05-27 cs.SE cs.AI

ProcCtrlBench: Evaluating Process-Level Defects and Control Preservation in LLM Coding Agents

ProcCtrlBench: 评估LLM编码智能体中的过程级缺陷与控制保持

Jiawei He, Jie Jia, Chenbo Liu, Chaoyi Xue, Yapeng Song, Xikai Yang, Dong Sun

AI总结提出ProcCtrlBench基准，通过可复用的缺陷本体和标准化轨迹表示，从过程证据而非仅最终结果评估LLM编码智能体的执行质量，并引入控制保持量化执行过程的可解释性、可中断性等属性。

详情

Comments: 22 pages, 8 figures

AI中文摘要

现有的LLM编码智能体基准主要评估最终结果。虽然有助于衡量整体能力，但这些指标提供的可见性有限，常常遗漏执行过程中出现的缺陷。我们提出了ProcCtrlBench，一个用于LLM编码智能体执行过程评估的基准。ProcCtrlBench将重复出现的执行缺陷组织成一个可复用的本体，涵盖4类11种缺陷类型，并通过标准化的过程证据而非仅最终结果来评估智能体轨迹。为了支持异构智能体之间的比较，ProcCtrlBench将原始日志标准化为统一的轨迹表示，并报告基于过程发现的校准评分卡。此外，ProcCtrlBench使用控制保持作为量化执行过程质量的方式，捕获执行是否保持可解释、可中断、可纠正、可逆，并在需要时能够交还控制权。我们在从三个基准（AndroidBench、TerminalBench和SWE-bench-Verified）中采样的200个案例上评估了ProcCtrlBench。结果表明，ProcCtrlBench可以以有用的可靠性实例化，提供比直接阈值化更稳定的语义，并揭示了传统基于结果的评估常常忽略的执行质量的有意义差异。

英文摘要

Existing benchmarks for LLM coding agents primarily evaluate final outcomes. While useful for measuring overall capability, these metrics provide limited visibility and often miss defects that arise during execution. We present ProcCtrlBench, a benchmark for execution-process evaluation in LLM coding agents. ProcCtrlBench organizes recurrent execution defects into a reusable ontology covering 11 defect types in 4 categories, and evaluates agent trajectories through standardized process evidence rather than final outcomes alone. To support comparison across heterogeneous agents, ProcCtrlBench standardizes raw logs into a unified trajectory representation and reports calibrated scorecards over process-level findings. In addition, ProcCtrlBench uses control preservation as a way to quantify execution-process quality, capturing whether execution remains interpretable, interruptible, correctable, reversible, and able to hand back authority when needed. We evaluate ProcCtrlBench on 200 cases sampled from three benchmarks: AndroidBench, TerminalBench, and SWE-bench-Verified. Results show that ProcCtrlBench can be instantiated with useful reliability, provides more stable semantics than direct thresholding, and reveals meaningful differences in execution quality that are often overlooked by conventional outcome-based evaluation.

URL PDF HTML ☆

赞 0 踩 0

2605.04932 2026-05-27 stat.ML cs.LG

Jacobian-Velocity Bounds for Deployment Risk Under Covariate Drift

协变量漂移下部署风险的雅可比-速度界

Jonathan R. Landers

AI总结针对动态协变量漂移下冻结预测器的长期部署风险，提出基于时域庞加莱不等式和雅可比-速度定理的路径控制方法，并设计漂移对齐切线正则化（DTR）以降低风险波动。

详情

Comments: 8 pages, 4 figures, 4 tables

AI中文摘要

我们研究了动态协变量漂移下冻结预测器的长期部署问题。时域庞加莱不等式首先将时间风险波动降低为导数能量。然后，雅可比-速度定理提供了相应的路径控制。在明确的规则性和支配假设下，该定理将沿部署路径的方向切线能量识别为控制量。在低秩漂移下，该量减少为漂移子空间中的方向雅可比能量，从而激发了漂移对齐切线正则化（DTR）和匹配的监测代理。DTR不是各向同性地平滑网络，而是仅沿估计的漂移方向惩罚敏感性。我们通过四个实验验证了从定理到方法的流程：一个用于时域不等式的合成基准，一个与各向同性雅可比正则化对比的受控合成实验，以及在UCI空气质量数据集和Tetouan电力消耗数据集上的两个冻结部署研究。DTR在受控低秩区域降低了风险波动和方向增益，并优于各向同性平滑。它还在两个真实数据集上给出了验证选择的部署增益，其中空气质量子空间是从目标正交传感器运动估计的。适度的漂移子空间错误指定是可容忍的，而正交错误指定则基本消除了收益。

英文摘要

We study long-horizon deployment of a frozen predictor under dynamic covariate shift. A time-domain Poincare inequality first reduces temporal risk volatility to derivative energy. A Jacobian-velocity theorem then supplies the corresponding pathwise control. Given explicit regularity and domination assumptions, the theorem identifies directional tangent energy along the deployment path as the governing quantity. Under low-rank drift, that quantity reduces to directional Jacobian energy in the drift subspace, motivating drift-aligned tangent regularization (DTR) and a matched monitoring proxy. Rather than smoothing the network isotropically, DTR penalizes sensitivity only along estimated drift directions. We validate the theorem-to-method pipeline in four experiments: a synthetic benchmark for the time-domain inequality, a controlled synthetic comparison against isotropic Jacobian regularization, and two frozen-deployment studies on the UCI Air Quality and Tetouan power-consumption datasets. DTR reduces risk volatility and directional gain in the controlled low-rank regime and beats isotropic smoothing there. It also gives validation-selected deployment gains on both real datasets, with the Air Quality subspace estimated from target-orthogonal sensor motion. Moderate drift-subspace misspecification is tolerable while orthogonal misspecification largely removes the benefit.

URL PDF HTML ☆

赞 0 踩 0

2605.03309 2026-05-27 cs.CR cs.AI cs.SE

Cryptographic Registry Provenance: Structural Defense Against Dependency Confusion in AI Package Ecosystems

加密注册表溯源：针对AI包生态系统中依赖混淆的结构性防御

Alan L. McCann

AI总结提出加密注册表溯源系统，通过注册表身份签名、双重签名模型和权威命名空间绑定三层结构防御依赖混淆攻击。

详情

Comments: 15 pages, 1 figure, 4 tables. Companion proofs: https://github.com/mashin-live/governance-proofs. Project: https://mashin.live. Updated license

AI中文摘要

依赖混淆攻击利用了软件分发中的结构性缺陷：一旦包被安装，就没有加密证据证明是哪个注册表分发的。所有现有防御都是基于配置的，并且在配置错误时会静默失败。我们提出一个加密分发溯源系统，包含三个组件：(1) 加密注册表身份，每个注册表持有一个Ed25519密钥对，并对其分发的每个工件进行签名；(2) 双重签名模型，发布者在打包时签名，注册表在发布时副署；(3) 权威命名空间绑定，消费者固定注册表指纹，解析器从加密上拒绝来自未授权注册表的工件。这些创建了三层防御，需要同时攻破才能成功攻击。对八个生态系统（npm、Cargo、Hex.pm、PyPI、Go模块、Docker/OCI、NuGet、Maven）的比较显示，没有现有生态系统结合了强制发布者签名、加密注册表身份、强制注册表副署和消费者端加密执行。该系统扩展到AI生成溯源作为签名属性，以及治理强制依赖解析。一个案例研究将分发溯源与三层运行时治理架构集成，创建了一个无加密间隙的四阶段生命周期链。

英文摘要

Dependency confusion attacks exploit a structural gap in software distribution: once a package is installed, there is no cryptographic proof of which registry distributed it. Every existing defense is configuration-based and fails silently when misconfigured. We present a cryptographic distribution provenance system comprising three components: (1) cryptographic registry identity, where every registry holds an Ed25519 keypair and signs every artifact it distributes; (2) a dual-signature model, where the publisher signs at packaging time and the registry countersigns at publication time; and (3) authoritative namespace binding, where consumers pin registry fingerprints and the resolver cryptographically rejects artifacts from unauthorized registries. These create three defense layers requiring simultaneous compromise for a successful attack. A comparison across eight ecosystems (npm, Cargo, Hex.pm, PyPI, Go modules, Docker/OCI, NuGet, Maven) shows no existing ecosystem combines mandatory publisher signing, cryptographic registry identity, mandatory registry countersigning, and consumer-side cryptographic enforcement. The system extends to AI-generation provenance as a signed attribute and governance-enforced dependency resolution. A case study integrates distribution provenance with a three-layer runtime governance architecture, creating a four-phase lifecycle chain with no cryptographic gaps.

URL PDF HTML ☆

赞 0 踩 0

2605.02958 2026-05-27 cs.CR cs.AI cs.CL cs.LG

Tracing the Dynamics of Refusal: Exploiting Latent Refusal Trajectories for Robust Jailbreak Detection

追踪拒绝的动态：利用潜在拒绝轨迹进行鲁棒越狱检测

Xulin Hu, Che Wang, Wei Yang Bryan Lim, Jianbo Gao, Zhong Chen

AI总结通过因果追踪识别出稀疏的“拒绝轨迹”激活模式，并提出轻量级白盒检测器SALO，基于隐藏状态窗口实现鲁棒越狱检测。

详情

Comments: Accepted to the 43rd International Conference on Machine Learning (ICML 2026). Camera-ready version

AI中文摘要

表征工程分析通常使用从终端或池化表示中提取的静态方向来描述拒绝。我们质疑这种观点是否忽略了拒绝是如何在层-标记位置上构建的。通过因果追踪，我们识别出一个 extit{拒绝轨迹}：一种稀疏的上游激活模式，即使当诸如GCG的攻击抑制终端拒绝信号时，该模式也常常持续存在。基于这一观察，我们提出了SALO（稀疏激活定位算子），一种轻量级白盒检测器，它在选定层窗口的原始隐藏状态体积上操作。在Qwen、Llama和Mistral模型上，SALO在固定的XSTest校准工作点下，改进了多个攻击家族的越狱检测。我们进一步分析了静态RepE风格基线、ROI敏感性、自适应GCG攻击和编码输入边界情况，阐明了拒绝轨迹监测的前景和局限性。

英文摘要

Representation Engineering analyses often characterize refusal using static directions extracted from terminal or pooled representations. We ask whether this view misses how refusal is constructed across layer-token positions. Using causal tracing, we identify a \textit{Refusal Trajectory}: a sparse upstream activation pattern that often persists even when attacks such as GCG suppress terminal refusal signals. Based on this observation, we propose SALO (Sparse Activation Localization Operator), a lightweight white-box detector that operates on raw hidden-state volumes from a selected layer window. Across Qwen, Llama, and Mistral models, SALO improves jailbreak detection on several attack families under a fixed XSTest-calibrated operating point. We further analyze static RepE-style baselines, ROI sensitivity, adaptive GCG attacks, and encoded-input boundary cases, clarifying both the promise and limitations of refusal-trajectory monitoring.

URL PDF HTML ☆

赞 0 踩 0

2605.02035 2026-05-27 cs.CL cs.AI

VIDA: A dataset for Visually Dependent Ambiguity in Multimodal Machine Translation

VIDA: 多模态机器翻译中视觉依赖歧义的数据集

Jingheng Pan, Xintong Wang, Longyue Wang, Liang Ding, Weihua Luo, Chris Biemann

AI总结提出VIDA数据集，包含2500个精心策划的实例，用于评估多模态机器翻译中需要视觉证据才能解决的歧义，并引入以歧义消解为中心的指标，实验表明链式思维微调能提升跨分布歧义消解能力。

详情

AI中文摘要

歧义消解是多模态机器翻译（MMT）中的一个关键挑战，模型必须真正利用视觉输入将歧义表达映射到其预期含义。尽管先前的工作提出了面向消歧的基准来评估视觉的作用，但我们观察到现有基准仍受限于任务格式不匹配、歧义覆盖范围狭窄或视觉依赖性验证不足。此外，现有的歧义评估并不适用于开放式翻译中的多种歧义类型。为解决这些局限性，我们提出了VIDA（视觉依赖歧义），一个包含2500个精心策划实例的数据集，其中解析带注释的源语言片段需要视觉证据。我们进一步提出了以消歧为中心的指标，使用LLM作为评判分类器来验证带注释的歧义表达是否在片段级别被正确消解。使用两个最先进的LVLM进行的实验表明，监督微调（SFT）提高了整体翻译质量，而链式思维SFT（CoT-SFT）产生了更强的跨分布歧义消解能力，这表明显式的消歧指导提高了对多种歧义类型的泛化能力。

英文摘要

Ambiguity resolution is a key challenge in multimodal machine translation (MMT), where models must genuinely leverage visual input to map an ambiguous expression to its intended meaning. Although prior work has proposed disambiguation-oriented benchmarks probing the role of vision, we observe that existing benchmarks remain limited by task-format mismatch, narrow ambiguity coverage, or insufficient visual-dependency validation. Moreover, existing ambiguity evaluations are not well suited to diverse ambiguity types in open-ended translation. To address these limitations, we present VIDA (Visually-Dependent Ambiguity), a dataset of 2,500 carefully curated instances in which resolving an annotated source span requires visual evidence. We further propose Disambiguation-Centric Metrics that use an LLM-as-a-judge classifier to verify whether annotated ambiguous expressions are resolved correctly at the span level. Experiments with two state-of-the-art LVLMs show that supervised fine-tuning (SFT) improves overall translation quality, while chain-of-thought SFT (CoT-SFT) yields stronger out-of-distribution disambiguation, suggesting that explicit disambiguation guidance improves generalization to diverse ambiguity types.

URL PDF HTML ☆

赞 0 踩 0