arXivDaily arXiv每日学术速递 周一至周五更新
重置

1. 大语言模型与基础模型 28 篇

2606.17682 2026-06-17 cs.CL 新提交

From Trainee to Trainer: LLM-Designed Training Environment for RL with Multi-Agent Reasoning

从受训者到训练者:用于多智能体推理的LLM设计的强化学习训练环境

Chao Chen, Chengzu Li, Zhiwei Li, Yinhong Liu, Zhijiang Guo

发表机构 * LARK, HKUST (GZ)(香港科技大学(广州)LARK实验室) University of Cambridge(剑桥大学) HKUST(香港科技大学)

AI总结 提出LLM-as-Environment-Engineer框架,让策略模型自动分析失败轨迹并修改训练环境配置,在MAPF-FrozenLake测试平台上用Qwen3-4B实现最优性能。

详情
AI中文摘要

用于大语言模型(LLM)训练的强化学习流程通常依赖于阶段之间手动重新设计的环境,要求从业者启发式地推断哪种配置最能改进当前策略。为了自动化这一过程,我们提出了LLM-as-Environment-Engineer框架,其中当前策略模型分析失败轨迹及上下文信息,并提出对下一阶段训练环境配置的修改。我们还引入了MAPF-FrozenLake,一个可控的测试平台,其生成器暴露多维环境配置,适合研究和基准测试环境重新设计。在该测试平台上,我们将环境工程师的条件建立在策略行为、失败案例和环境统计的结构化摘要上,从而生成下一训练阶段的配置。以Qwen3-4B为骨干,我们的框架在基准测试中取得了最强的综合性能,优于更大的专有LLM(如GPT、Gemini)和固定环境训练基线。我们进一步分析了哪种形式的上下文最有效,发现成功的环境更新依赖于失败证据并保留已生效的配置。有趣的是,当前的RL检查点比原始基础模型更适合作为环境工程师,这表明策略学习提高了模型诊断其剩余弱点的能力。

英文摘要

Reinforcement learning pipelines for Large Language Model (LLM) training often rely on manually redesigned environments between stages, requiring practitioners to heuristically infer which configuration will best improve the current policy. To automate this process, we propose the LLM-as-Environment-Engineer framework in which the current policy model analyzes failure trajectories together with contextual information and proposes modifications to the next-stage training environment configuration. We also introduce MAPF-FrozenLake, a controllable testbed whose generator exposes multi-dimensional environment configurations, making it suitable for studying and benchmarking environment redesign. On this testbed, we condition the environment engineer on structured summaries of policy behavior, failure cases, and environment statistics, from which it produces the configuration for the next training stage. With Qwen3-4B as the backbone, our framework achieves the strongest aggregate performance on our benchmarks, outperforming larger proprietary LLMs (e.g., GPT, Gemini) and fixed-environment training baselines. We further analyze which forms of context are most effective, finding that successful environment updates rely on failure evidence and preserve configurations that already work. Interestingly, the current RL checkpoint serves as a better environment engineer than the original base model, suggesting that policy learning improves the model's ability to diagnose its remaining weaknesses.

2606.17687 2026-06-17 cs.CL cs.AI 新提交

SuCo: Sufficiency-guided Continuous Adaptive Reasoning

SuCo: 充分性引导的连续自适应推理

Jiahao Wang, Bingyu Liang, Chenhao Hu, Longhui Zhang, Xuebo Liu, Min zhang, Jing Li, Xuelong Li

发表机构 * Harbin Institute of Technology (Shenzhen)(哈尔滨工业大学(深圳))

AI总结 针对大型推理模型生成过长思维链导致计算浪费的问题,提出最小充分CoT概念,并构建两阶段训练框架SuCo,通过自适应充分性阈值和强化学习优化推理长度,在数学、代码和科学基准上同时提升准确率和效率。

Comments Accepted to ICML 2026. 18 pages

详情
AI中文摘要

尽管在复杂任务上表现卓越,大型推理模型(LRMs)常常生成过长的思维链(CoT),即使对于简单查询也会增加计算成本。现有缓解此低效问题的工作通常依赖于离散推理模式或固定预算层级,缺乏推理何时充分的准则。本文引入最小充分CoT(MSC),定义为CoT轨迹中足以产生正确答案的最短前缀。实验表明,MSC不仅减少推理令牌,还能在不同难度级别上提高准确率。基于MSC,我们提出充分性引导的连续自适应推理(SuCo),一个用于连续谱上自主推理控制的两阶段训练框架。在第一阶段,MSC对齐微调(MFT)使用问题自适应充分性阈值构建MSC数据,该阈值自然随问题难度缩放,然后微调模型以内化简洁而充分的推理模式。在第二阶段,充分性感知策略优化(SAPO)通过带有动态复杂度跟踪和充分性感知奖励的强化学习进一步优化模型,该奖励惩罚过度思考和思考不足。在数学、代码和科学基准上的大量实验表明,SuCo在准确率和推理效率上均实现持续改进。

英文摘要

Despite remarkable performance on complex tasks, Large Reasoning Models (LRMs) often generate excessively long Chain-of-Thoughts (CoT), inflating computational costs even for simple queries. Existing efforts to mitigate this inefficiency typically rely on discrete reasoning modes or fixed budget tiers, lacking a principled criterion of when reasoning is sufficient. In this work, we introduce Minimal Sufficient CoT (MSC), defined as the shortest prefix of a CoT trajectory which is adequate for producing the correct answer. We empirically show that MSC not only reduces reasoning tokens, but also improves accuracy across difficulty levels. Building on MSC, we propose Sufficiency-guided Continuous Adaptive Reasoning (SuCo), a two-stage training framework for autonomous reasoning control along a continuous spectrum. In stage 1, MSC-Aligned Fine-Tuning (MFT) constructs MSC data using problem-adaptive sufficiency thresholds that naturally scale with question difficulty, then fine-tunes the model to internalize concise yet sufficient reasoning patterns. In stage 2, Sufficiency-Aware Policy Optimization (SAPO) further optimizes the model through reinforcement learning with dynamic complexity tracking and sufficiency-aware rewards that penalize both over- and under-thinking. Extensive experiments across mathematics, code, and science benchmarks show that SuCo consistently achieves improvements in both accuracy and reasoning efficiency.

2606.17688 2026-06-17 cs.CL 新提交

LLMs Infer Cultural Context but Fail to Apply It When Responding

LLMs 能推断文化背景但在回应时未能应用

Yisong Miao, Jian Zhu, Vered Shwartz

发表机构 * University of British Columbia(不列颠哥伦比亚大学) Vector Institute(向量研究所) National University of Singapore(新加坡国立大学)

AI总结 研究大型语言模型在生成文化适应性回应时,能否根据用户文化背景使用本地计量单位。提出CAPRI数据集,实验发现模型能推断文化背景但常未能应用,除非明确提示。

Comments 9 pages, 7 figures, 2 tables (24 pages, 12 figures, 8 tables including references and appendices)

详情
AI中文摘要

近期研究表明,大型语言模型过度代表主导文化,尤其是西方文化,而边缘化其他文化。我们通过评估模型基于用户感知文化背景使用本地计量单位的能力,研究这是否影响模型生成文化适应性回应的能力。我们引入了文化与语用回应推理(CAPRI)数据集,包含不同文化线索水平的对话。与最先进LLMs的实验表明,模型能够推断文化背景并回忆相关惯例,但通常未能利用这些信息来调整其回答以适应相关文化惯例,除非被明确提示按顺序执行任务。我们进一步评估了对时间和数量表达解释的适应性,这是两个受文化影响的主观语言基础维度。我们发现,随着文化线索的积累,模型越来越多地调整其回答,但其先验并非文化中立,有时与模型的原产国一致。总体而言,CAPRI为未来旨在缩小文化知识与文化适应性语言生成之间差距的研究提供了资源。

英文摘要

Recent work has shown that LLMs overrepresent dominant cultures, particularly Western ones, while marginalizing others. We investigate whether this affects models' ability to generate culturally adapted responses by evaluating their use of local measurement units based on the user's perceived cultural background. We introduce Cultural and Pragmatic Response Inference (CAPRI), a dataset of conversations with varying levels of cultural cues. Experiments with state-of-the-art LLMs show that models can infer cultural background and recall relevant conventions, but often fail to utilize the information to adapt their answers to the relevant cultural conventions, unless explicitly prompted to perform the tasks sequentially. We further evaluate adaptation to the interpretation of time and quantity expressions, two subjective language grounding dimensions that are affected by culture. We find that models increasingly adapt their answers as cultural cues accumulate, but their priors are not culture-neutral, sometimes aligning with the model's country of origin. Overall, CAPRI provides a resource for future research aimed at narrowing the gap between cultural knowledge and culturally adaptive language generation.

2606.17890 2026-06-17 cs.CL 新提交

Dynamic Rollout Editing for Reducing Overthinking in RL-Trained Reasoning Models

动态展开编辑:减少RL训练推理模型中的过度思考

Zihao Wei, Wenjie Shi, Liang Pang, Jingcheng Deng, Shicheng Xu, Shasha Guo, Zenghao Duan, Jiahao Liu, Jingang Wang, Huawei Shen, Xueqi Cheng

发表机构 * Institute of Computing Technology, Chinese Academy of Sciences(中国科学院计算技术研究所) University of Chinese Academy of Sciences(中国科学院大学)

AI总结 针对GRPO强化学习训练中模型在得出正确答案后继续生成不必要推理的过度思考问题,提出动态展开编辑(DRE)方法,通过编辑成功轨迹中答案出现后的思考部分,削弱对不必要思考的偏好信号,实验证明其有效性。

Comments 21 pages, 10 figures, 2 tables

详情
AI中文摘要

长链思维推理可以提升LLM在复杂任务上的表现,但模型在正确答案出现后往往继续生成不必要的推理。我们将这种行为称为过度思考。我们从GRPO风格强化学习后训练的角度研究这一现象,将其视为训练时的信用分配问题,而不仅仅是解码时的停止问题。在GRPO训练初期采样的展开中,我们观察到对于相同提示,成功轨迹可能比不成功轨迹表现出稍高的过度思考程度。这种早期不平衡为不良反馈循环提供了起点:由于GRPO分配序列级信用,它无法区分到达解决方案的前缀与延长成功轨迹的不必要延续。两者都收到正向更新信号,使得初始不平衡在训练过程中演变为更严重的过度思考。为了解决这个问题,我们引入了动态展开编辑(DRE),这是一种针对在答案出现后继续思考的成功轨迹的训练时干预方法。DRE保留被接受的已验证前缀,编辑剩余的思考,并在同一RL组中偏好编辑后的轨迹,从而削弱对不必要思考的偏好信号,而不惩罚到达答案所需的推理。跨多种任务的实验证明了DRE的有效性。

英文摘要

Long-form chain-of-thought reasoning can improve LLM performance on complex tasks, but models often continue generating unnecessary reasoning after a correct answer has emerged. We refer to this behavior as overthinking. We study this phenomenon from the perspective of GRPO-style reinforcement learning (RL) post-training, framing it as a training-time credit-assignment problem rather than merely a decoding-time stopping problem. In rollouts sampled at the onset of GRPO training, we observe that successful trajectories can exhibit a slightly higher degree of overthinking than unsuccessful trajectories for the same prompts. This early imbalance provides a starting point for an undesirable feedback loop: because GRPO assigns sequence-level credit, it cannot distinguish the solution-reaching prefix from the unnecessary continuation that lengthens a successful trajectory. Both receive positive update signal, allowing the initial imbalance to grow into more severe overthinking during training. To address this issue, we introduce Dynamic Rollout Editing (DRE), a training-time intervention for successful trajectories that continue thinking after answer emergence. DRE preserves the accepted verified prefix, edits the remaining thinking, and prefers the edited trajectory within the same RL group, weakening the preference signal for unnecessary thinking without penalizing the reasoning needed to reach the answer. Experiments across diverse tasks show the effectiveness of DRE.

2606.17973 2026-06-17 cs.CL 新提交

Fine-tuning LLMs for Passive Depression Severity Estimation from AI Mental Health Dialogue

基于AI心理健康对话的被动抑郁严重程度评估:微调大语言模型

Olivier Tieleman, Ziyi Zhu, Ting Su, Samuel J. Bell, Thomas D. Hull, Caitlin A. Stamatis

发表机构 * GitHub

AI总结 通过微调Qwen3.5-27B模型,利用伪标签增强数据,从AI心理健康对话中预测PHQ-9总分,实现无需自评问卷的被动抑郁监测。

Comments 12 pages, 1 figure

详情
AI中文摘要

抑郁症是全球致残的主要原因,早期发现症状变化对及时干预至关重要。患者健康问卷-9(PHQ-9)等经过验证的工具支持大规模症状监测,但现实世界中的完成率较低,导致应答偏倚和系统性缺失。从常规生成数据中推断严重程度的被动方法可以弥补这一差距。我们通过直接从用户与AI心理健康应用的对话记录中预测PHQ-9总分来解决这一问题,仅需对话文本,无需额外的临床数据。我们微调了带有回归头的Qwen3.5-27B骨干模型,用推理模型(Claude Opus)生成的伪标签和迭代训练的中间模型扩充了3,111个真实标签,最终得到包含6,283个用户的数据集。在包含842个用户的保留测试集上,我们的最佳模型在PHQ-9 >= 10的临床阈值下实现了MAE = 2.6、RMSE = 4.0、Pearson r = 0.80、AUC = 0.91。我们还发现,从PHQ-9 >= 3到PHQ-9 >= 24的每个严重程度阈值下,AUC均大于0.87,表明该模型能够捕捉整个临床谱系中的抑郁严重程度。这项工作为AI心理健康平台中无需用户完成自评测量的被动、连续症状监测打开了大门。

英文摘要

Depression is the leading cause of disability worldwide, and early detection of symptom change is essential for timely intervention. Validated instruments such as the Patient Health Questionnaire-9 (PHQ-9) support symptom monitoring at scale, but real-world completion rates are low, introducing response bias and systematic missingness. Passive approaches that infer severity from routinely generated data could close this gap. We address this by predicting PHQ-9 total scores directly from transcripts of conversations between users and an AI mental health application, requiring only conversation text and no additional clinical data. We fine-tune a Qwen3.5-27B backbone with a regression head, augment 3,111 ground-truth labels with pseudolabels generated by a reasoning model (Claude Opus) and iteratively trained intermediate models, for a combined dataset of 6,283 users. On a held-out test set of 842 users, our best model achieves MAE = 2.6, RMSE = 4.0, Pearson r = 0.80, and AUC = 0.91 at the PHQ-9 >= 10 clinical threshold. We also find AUC > 0.87 at every severity threshold from PHQ-9 >= 3 to PHQ-9 >= 24, demonstrating that the model captures depression severity across the full clinical spectrum. This work opens the door to passive, continuous symptom monitoring in AI mental health platforms, without requiring users to complete self-report measures.

2606.18056 2026-06-17 cs.CL 新提交

ConSA: Controllable Sparsity in Hybrid Attention via Learnable Allocation

ConSA: 通过可学习分配实现混合注意力中的可控稀疏性

Yao Chen, Yinqi Yang, Junyuan Shang, Xiangzhao Hao, Simeng Zhang, Yilong Chen, Tingwen Liu, Shuohuan Wang, Dianhai Yu

发表机构 * Institute of Information Engineering, Chinese Academy of Sciences(中国科学院信息工程研究所) School of Cyber Security, University of Chinese Academy of Sciences(中国科学院大学网络空间安全学院) Baidu Inc.(百度公司)

AI总结 提出ConSA框架,通过L0正则化和增广拉格朗日约束学习全注意与滑动窗口注意的最优分配,实现用户指定的稀疏目标,在0.6B和1.7B规模LLM上优于规则基线,并发现底层SWA与中层FA的集中分配模式。

详情
AI中文摘要

结合全注意(FA)和滑动窗口注意(SWA)的混合架构是高效LLM推理的一种有前景的范式。然而,现有方法通常依赖手工规则或简单的后验启发式进行FA/SWA分配,并且对这些设计背后的注意行为分析有限。我们提出混合注意力中的可控稀疏性(ConSA),一个在用户指定稀疏目标下学习最优FA/SWA分配的框架。ConSA采用L0正则化学习选择每个注意力单元FA或SWA的二元掩码,同时增广拉格朗日约束在层或KV头粒度上强制执行目标稀疏性。我们在0.6B和1.7B规模的两个LLM上评估ConSA。学习到的分配一致优于基于规则的基线,其中KV头级分配比层级分配带来明显增益。学习到的模式将SWA置于底层,并将FA集中在连续的中间层块中,这与基于规则方法中均匀交错模式不同。这种结构在模型规模、稀疏级别和分配粒度上持续存在,揭示了学习分配背后细粒度的内在注意行为谱。

英文摘要

Hybrid architectures combining full attention (FA) and sliding-window attention (SWA) are a promising paradigm for efficient LLM inference. However, existing methods typically rely on hand-crafted rules or simple post-hoc heuristics for FA/SWA allocation and offer limited analysis of the attention behaviors underlying these designs. We propose Controllable Sparsity in Hybrid Attention (ConSA), a framework that learns optimal FA/SWA assignment under a user-specified sparsity target. ConSA employs L0 regularization to learn binary masks selecting between FA and SWA for each attention unit, while an augmented Lagrangian constraint enforces the target sparsity at either layer or KV-head granularity. We evaluate ConSA on two LLMs at the 0.6B and 1.7B scales. Learned allocations consistently outperform rule-based baselines, with KV-head-wise allocation yielding clear gains over layer-wise allocation. The learned patterns place SWA in the bottom layers and concentrate FA into contiguous middle-layer blocks, diverging from evenly interleaved patterns in rule-based methods. This structure persists across model scales, sparsity levels, and allocation granularities, revealing a fine-grained spectrum of intrinsic attention behaviors that underlies the learned allocation.

2606.18195 2026-06-17 cs.CL 新提交

Learning from the Self-future: On-policy Self-distillation for dLLMs

从自我未来学习:面向扩散LLM的在线自蒸馏

Yifu Luo, Zeyu Chen, Haoyu Wang, Xinhao Hu, Yuxuan Zhang, Zhizhou Sha, Shiwei Liu

发表机构 * Tsinghua University(清华大学) Technical University of Munich(慕尼黑工业大学) Nanyang Technological University(南洋理工大学) University of British Columbia(不列颠哥伦比亚大学) University of Texas at Austin(德克萨斯大学奥斯汀分校) ELLIS Institute Tubingen(ELLIS蒂宾根研究所) Max Planck Institute for Intelligent Systems(马克斯·普朗克智能系统研究所) Tubingen AI Center(蒂宾根人工智能中心)

AI总结 提出首个面向扩散大语言模型的在线自蒸馏框架d-OPSD,通过自我生成答案作为后缀条件实现从自我未来经验学习,并将监督从词元级转向步骤级,在推理基准上以约10%的优化步骤超越RLVR和SFT基线。

Comments Preprint

详情
AI中文摘要

在线自蒸馏(OPSD)已被证明对后训练大型语言模型(LLMs)有效,但其在扩散LLMs(dLLMs)上的应用仍未探索。现有的OPSD方法本质上是自回归中心的,它们通过从左到右的前缀条件化和词元级差异监督注入特权信息,这种设计与dLLMs的任意顺序生成根本冲突。我们提出了d-OPSD,这是首个为dLLMs量身定制的OPSD框架。我们的方法有两个核心贡献。首先,我们通过使用自我生成的答案作为后缀条件来重新构建自我教师,使学生模型能够从“自我未来经验”而非特权前缀中学习。其次,我们将监督从词元级转向步骤级,使训练与dLLMs的迭代去噪过程对齐。在四个推理基准上的实验表明,d-OPSD在样本效率上始终优于RLVR和SFT基线,仅需RLVR约10%的优化步骤,为dLLM后训练开辟了一条有前景的途径。代码可在该https URL获取。

英文摘要

On-policy self-distillation (OPSD) has proven effective for post-training large language models (LLMs), yet its application to diffusion LLMs (dLLMs) remains unexplored. Existing OPSD methods are inherently autoregressive-centric. They inject privileged information via left-to-right prefix conditioning with token-level divergence supervision, a design that fundamentally conflicts with the arbitraryorder generation of dLLMs. We introduce d-OPSD, the first OPSD framework tailored for dLLMs. Our approach makes two core contributions. First, we reframe self-teacher construction by using self-generated answers as suffix conditioning, enabling the student model to learn from "self future-experience" rather than privileged prefixes. Second, we shift supervision from token-level to step-level, aligning training with the iterative denoising process of dLLMs. Experiments across four reasoning benchmarks show that d-OPSD consistently outperforms RLVR and SFT baselines with superior sample efficiency, requiring only around 10% of the optimization steps by RLVR and opening a promising pathway for dLLM posttraining. The code is available at https://github.com/xingzhejun/d-OPSD.

2606.18216 2026-06-17 cs.CL 新提交

Zone of Proximal Policy Optimization: Teacher in Prompts, Not Gradients

近端策略优化区域:教师存在于提示中,而非梯度中

Byung-Kwan Lee, Ximing Lu, Shizhe Diao, Minki Kang, Saurav Muralidharan, Karan Sapra, Andrew Tao, Pavlo Molchanov, Yejin Choi, Yu-Chiang Frank Wang, Ryo Hachiuma

发表机构 * NVIDIA(英伟达)

AI总结 提出ZPPO方法,通过将教师知识注入提示而非策略梯度,解决小模型知识蒸馏中教师梯度主导和强化学习策略漂移问题,在多种规模模型上超越现有方法。

Comments Project page: https://byungkwanlee.github.io/ZPPO-page/

详情
AI中文摘要

知识蒸馏将教师的能力传递给小型学生模型,但在小模型场景下存在脆弱性:强制学生模仿更大教师的logits会使其集中于教师最尖锐的模式,从而损害训练语料库之外基准家族的泛化能力。强化学习通过基于学生自身rollout进行训练避免了logit模仿。然而,在每次rollout都失败(产生零优势并被静默丢弃)的问题上,将更强教师的响应注入策略梯度会破坏同策略假设并导致漂移。我们提出近端策略优化区域(ZPPO),受维果茨基最近发展区启发,将教师保留在提示中而非策略梯度中。在难题上,ZPPO构建两种重新表述的提示:二元候选包含问题(BCQ)将一个正确教师响应与一个错误学生响应配对作为匿名候选,学生必须区分;负候选包含问题(NCQ)将学生的错误rollout聚合到单个提示中,以揭示其共同的失败模式。提示重放缓冲区循环每个难题,直到其毕业(学生在该问题上的平均rollout准确率达到一半)或在有限容量下被FIFO逐出,从而在学生当前最近发展区内放大BCQ和NCQ。在Qwen3.5系列上,使用四个学生规模(0.8B-9B)和27B教师,后训练为视觉语言模型并在31个基准套件(16个VLM、10个LLM、5个视频)上评估,ZPPO优于离/同策略蒸馏和GRPO,且在最小规模上增益最大。

英文摘要

Knowledge distillation transfers a teacher's competence to a small student but is brittle in the small-student regime: forcing the student to imitate logits from a much larger teacher concentrates it on the teacher's sharpest modes, hurting generalization on benchmark families beyond the training corpus. Reinforcement learning (RL) avoids logit imitation by training on the student's own rollouts. However, on questions where every rollout fails-yielding zero advantage and being silently discarded-injecting a stronger teacher's response into the policy gradient breaks the on-policy assumption and induces drift. We introduce Zone of Proximal Policy Optimization (ZPPO), inspired by Vygotsky's zone of proximal development, which keeps the teacher inside the prompt rather than the policy gradient. On hard questions, ZPPO constructs two reformulated prompts: a Binary Candidate-included Question (BCQ) pairs one correct teacher response with one incorrect student response as anonymized candidates the student must discriminate, and a Negative Candidate-included Question (NCQ) aggregates the student's wrong rollouts into a single prompt to surface their shared failure modes. A prompt replay buffer recirculates each hard question until it either graduates-the student's mean rollout accuracy on it reaches half- or is FIFO-evicted under finite capacity, amplifying BCQ and NCQ inside the student's current zone of proximal development. On the Qwen3.5 family at four student scales (0.8B-9B) with a 27B teacher, post-trained as vision-language models and evaluated on a 31-benchmark suite (16 VLM, 10 LLM, 5 Video), ZPPO outperforms off/on-policy distillation and GRPO, with the largest gains at the smallest scale.

2606.18246 2026-06-17 cs.CL 新提交

Variable-Width Transformers

变宽Transformer

Zhaofeng Wu, Oliver Sieberling, Shawn Tan, Rameswar Panda, Yury Polyanskiy, Yoon Kim

发表机构 * MIT(麻省理工学院) MIT-IBM Watson AI Lab(MIT-IBM沃森人工智能实验室)

AI总结 提出一种中间窄、两端宽的变宽Transformer架构,通过无参数残差缩放机制实现非均匀容量分配,在语言模型困惑度、FLOPs和KV缓存上优于均匀宽度基线。

详情
AI中文摘要

扩展模型规模,特别是深度和宽度,推动了基于Transformer的语言模型的显著进步。然而,大多数架构在所有层中保持恒定宽度,即使不同层可能扮演不同的计算角色,也均匀分配固定的参数和计算预算。在这项工作中,我们通过提出一个×形> <former架构,实证研究了跨网络深度的非均匀容量分配。该设计保持较宽的早期和晚期层,同时缩窄中间层,利用无参数残差缩放机制。在从200M到2B参数(密集)和3B参数(MoE)的仅解码器语言模型中,我们的> <former在语言建模损失上始终优于参数匹配的均匀基线。通过降低平均层宽度,该架构还减少了总体FLOPs(在拟合的损失匹配缩放曲线下减少22%)以及更小的KV缓存内存和I/O成本(减少15%)。在分析中,我们展示了这种瓶颈结构导致残差流中定性不同的表示。总体而言,我们的结果表明,非均匀宽度分配可以导致语言模型更资源最优的缩放。

英文摘要

Scaling model size, specifically depth and width, has driven significant progress in transformer-based language models. However, most architectures maintain a constant width across all layers, allocating a fixed parameter and computation budget evenly despite different layers potentially playing distinct computational roles. In this work, we empirically investigate nonuniform capacity allocation across network depth by proposing a $\times$-shaped > <former architecture. This design maintains wider early and late layers while narrowing the middle layers, utilizing a parameter-free residual resizing mechanism. Across decoder-only language models ranging from 200M to 2B parameters (dense) and 3B parameters (MoE), our > <former consistently outperforms parameter-matched uniform baselines on language modeling loss. By reducing the average layer width, this architecture also requires fewer overall FLOPs (22% reduction under fitted loss-matched scaling curves) and smaller KV cache memory and I/O cost (15% reduction). In analysis, we show that this bottleneck structure results in qualitatively different representations in residual streams. Overall, our results demonstrate that nonuniform width allocation can result in more resource-optimal scaling of language models.

2606.17250 2026-06-17 cs.LG cs.CL 交叉投稿

Rethinking Groups in Critic-Free RLVR

重新思考无评论强化学习中的分组

Yihong Wu, Liheng Ma, Lingfeng Xiao, Muzhi Li, Xinyu Wang, Yingxue Zhang, Jian-Yun Nie

发表机构 * Université de Montréal(蒙特利尔大学) McGill University(麦吉尔大学) Mila - Quebec AI Institute(Mila - 魁北克人工智能研究所) University of Waterloo(滑铁卢大学) The Chinese University of Hong Kong(香港中文大学) Huawei Noah’s Ark Lab(华为诺亚方舟实验室)

AI总结 针对无评论强化学习分组策略的数据低效和同步问题,提出负令牌过滤方法,实现单次 rollout 稳定训练,在推理和代理任务上表现相当或更优。

详情
AI中文摘要

强化学习已成为大型语言模型后训练的核心范式。现有的无评论强化学习方法通常为同一问题生成一组 rollout 以估计价值基线用于优势计算。然而,这种设计存在数据低效、组同步障碍以及与结构化 rollout 不灵活的问题。在这项工作中,我们重新审视了“分组”的作用,并表明其底层功能不仅仅是估计基线,而是防止对负样本的错误惩罚。基于这一见解,我们提出了负令牌过滤,一种简单有效的策略,能够实现稳定的单 rollout 训练。我们将其应用于两种批量级优势方法,在推理任务上取得了与基于分组的强化学习技术相当的性能,在代理任务上取得了更强的性能。

英文摘要

Reinforcement learning (RL) has become a central paradigm for post-training large language models. Existing critic-free RL methods typically generate a group of rollouts for the same question to estimate value baselines for advantage computation. However, this design suffers from data inefficiency, group synchronization barriers, and inflexibility with structured rollouts. In this work, we revisit the role of the ``group'' and show that its underlying function is not merely to estimate baselines but to prevent false penalties on negative samples. Building on this insight, we propose negative token filtering, a simple and effective strategy that enables stable single-rollout training. We apply it to two batch-level advantage methods, achieving comparable performance on reasoning tasks and stronger performance on agentic tasks relative to group-based RL techniques.

2606.17289 2026-06-17 cs.AI cs.CL 交叉投稿

Nothing from Something: Can a Language Model Discover 0?

无中生有:语言模型能否发现0?

Phoebe Zeng, Thomas L. Griffiths, Brenden M. Lake

发表机构 * Department of Computer Science, Princeton University(普林斯顿大学计算机科学系)

AI总结 研究语言模型能否独立发现“零”的概念,通过算术任务测试,发现GPT-2规模模型无法在测试时泛化,但少量示例训练后显著提升,且语言预训练减少所需示例约50%。

详情
AI中文摘要

基于人工神经网络的AI系统正被开发,旨在推动人类数学知识的边界。这些系统的关键问题在于它们能在多大程度上超越训练数据。数学发现需要一种强形式的分布外泛化能力——假设真正新的、且可能逻辑上更强大的数学结构的能力。已有假设认为,语言能力在人类认知中支持这种泛化。在这项工作中,我们使用简单算术作为案例研究,考察现代AI模型如何扩展其数学视野,评估这些模型能否独立发现“零”的概念。我们表明:(1) GPT-2规模的语言模型在测试时无法进行这种泛化,无论是否经过语言预训练;(2) 但在经过数十或数百个零的示例训练后,模型能显著改进。此外,我们发现语言预训练将所需示例数量减少了约50%,表明语言能力可以支撑神经模型中的数学发现。

英文摘要

AI systems based on artificial neural networks are being developed with aspirations of pushing the boundary of human mathematical knowledge. A key question for these systems is how much they can reach beyond their training data. Mathematical discovery requires a strong form of out of distribution generalization; the ability to hypothesize genuinely new - and potentially logically more powerful - mathematical structures. It has been hypothesized that language abilities support such generalizations in human cognition. In this work, we use simple arithmetic as a case study for examining how modern AI models could expand their mathematical horizons, evaluating whether these models can independently discover the concept of "zero". We show that We show that (1) language models of a GPT-2 size are unable to perform this generalization at test time regardless of language pretraining, but (2) models can improve substantially after training on tens or hundreds of examples of zero. Additionally, we find that language pretraining reduces the number of required examples by approximately $50\%$, showing that language abilities can scaffold mathematical discovery in neural models.

2606.17443 2026-06-17 cs.AI cs.CL cs.CY 交叉投稿

Incumbent Advantage: Brand Bias and Cognitive Manipulation Dynamics in LLM Recommendation Systems

在位优势:LLM推荐系统中的品牌偏见与认知操纵动态

Xi Chu, Yupeng Hou

发表机构 * Trine University(特莱恩大学) Texas A&M University(德克萨斯农工大学)

AI总结 研究LLM推荐中的品牌动态,发现知名品牌在同等规格下获100%推荐(IAI=10.0),但微弱评分优势可打破垄断;权威营销语言(如虚假临床证据)以+0.17评分点的偏差剩余价值打破垄断;多品牌GEO竞争存在社会困境,集体优化降低个体收益。

Comments 16 pages, 4 figures, 11 tables

详情
AI中文摘要

大型语言模型(LLM)正成为消费者寻找产品的主要方式,但我们尚不了解品牌如何在这个新渠道中竞争。我们使用护肤品——消费者在购买前难以判断质量、必须依赖品牌声誉的类别——在三个商业LLM(GPT-4o-mini、Claude Sonnet、Gemini 3 Flash)中研究LLM推荐中的品牌动态,并对搜索品进行了稳健性检验。在三个实验中,我们发现:(1)条件垄断:当所有产品具有相同规格时,知名品牌获得100%的推荐(IAI = 10.0),但这种主导地位在竞争对手拥有不到+0.1星的评分优势时消失;(2)权威式营销语言,包括捏造的临床证据声明,以等于+0.17评分点的偏差剩余价值打破了这种垄断,每个模型反应不同;(3)多品牌GEO竞争中的社会困境:当所有品牌采用相同的优化策略时,在我们的收益代理中,个体收益从+0.802降至+0.007,而我们的测试中未参与的品牌获得零推荐。我们的结果表明,生成引擎优化(GEO)不仅应作为安全风险研究,还应作为塑造市场竞争的新兴营销实践来研究。

英文摘要

Large language models (LLMs) are becoming a major way for consumers to find products, but we do not yet understand how brands compete in this new channel. We study brand dynamics in LLM recommendations using skincare products -- a category where consumers cannot easily judge quality before buying and must rely on brand reputation -- across three commercial LLMs (GPT-4o-mini, Claude Sonnet, Gemini 3 Flash), with a robustness check on search goods. In three experiments, we find: (1) a Conditional Monopoly where well-known brands get recommended 100% of the time (IAI = 10.0) when all products have the same specifications, but this dominance disappears with less than a +0.1-star rating advantage for a competitor; (2) authority-style marketing language, including fabricated clinical-evidence claims, breaks this monopoly at a Bias Surplus Value equal to +0.17 rating points, with each model responding differently; and (3) a social dilemma in multi-brand GEO competition: when all brands adopt the same optimization strategy, individual payoff falls from +0.802 to +0.007 in our payoff proxy, and non-participating brands receive zero recommendations in our tests. Our results suggest that generative engine optimization (GEO) should be studied not only as a security risk, but also as an emerging marketing practice that shapes market competition.

2606.17579 2026-06-17 cs.LG cs.AI cs.CL cs.SI 交叉投稿

LLM Features Can Hurt GNNs: Concatenation Interference on Homophilous Graph Benchmarks

LLM特征可能损害GNN:同配图基准上的拼接干扰

Zhongyuan Wang, Pratyusha Vemuri

AI总结 本文发现将LLM特征通过纯输入拼接(而非联合训练)引入图神经网络时,会在同配基准上系统性地降低准确率,并提出了一个基于LLM单独判别性指标Delta_sig来预测拼接效果。

Comments 29 pages, 8 figures

详情
AI中文摘要

将LLM生成的节点特征添加到图神经网络(GNN)中,被广泛报道能提高标准基准的准确率。我们记录了一个相反的观察:当LLM特征通过纯输入拼接(而非联合训练、蒸馏或提示条件)引入时,它们会在相同的同配基准上系统地降低准确率,而端到端LLM流水线在这些基准上却能成功。使用MLP骨干网络、Planetoid公共划分和词袋原始特征,拼接SBERT编码的GPT-4o-mini TAPE特征导致PubMed测试准确率下降-17.0±0.3个百分点,Cora下降-4.3±0.6个百分点(CiteSeer下降-0.6±0.8个百分点,在种子噪声范围内)。当我们放宽每个条件(GCN/GCNII/GAT骨干网络、随机划分、更小编码器)时,下降幅度减弱,并在中等同配的WikiCS(+4.4个百分点)和ogbn-arxiv(+11.7个百分点)上逆转。为了预测拼接何时有益或有害,我们报告了一个简单的LLM单独判别性指标Delta_sig。在9个数据集上,Delta_sig与拼接成本的相关系数(r^2=0.38)强于同配性(r^2=0.06;N=9,bootstrap置信区间重叠)。bootstrap最佳变点为tau=13.8个百分点,规则“Delta_sig <= tau预测非正拼接成本”正确分类了7/9个数据集;由于60%的bootstrap样本将tau置于[5,30]个百分点之间,我们将Delta_sig视为解释性透镜而非精确过滤器。在PubMed上进行的维度控制消融实验将LLM特征下降置于同源PCA(-2.3个百分点)和同维高斯噪声(-37.3个百分点)之间,排除了维度和权重衰减的影响。九个PubMed配置拟合出幂律|Delta_concat| ∝ (sqrt(d_l/n))^1.31,r^2=0.97;低Delta_sig、小n的角落正是标题中-17个百分点PubMed缺陷出现的位置。

英文摘要

Adding LLM-generated node features to graph neural networks (GNNs) is widely reported to improve accuracy on standard benchmarks. We document a contrasting observation: when LLM features are introduced through pure input concatenation (rather than joint training, distillation, or prompt-conditioning), they can systematically degrade accuracy on the same homophilous benchmarks where end-to-end LLM pipelines succeed. With an MLP backbone on the Planetoid public split and bag-of-words original features, concatenating SBERT-encoded GPT-4o-mini TAPE features reduces PubMed test accuracy by -17.0 +/- 0.3 pp and Cora by -4.3 +/- 0.6 pp (CiteSeer -0.6 +/- 0.8 pp, within seed noise). The drop attenuates as we relax each condition (GCN / GCNII / GAT backbones, random splits, smaller encoders) and reverses on medium-homophily WikiCS (+4.4 pp) and ogbn-arxiv (+11.7 pp). To predict when concatenation helps versus hurts, we report a simple measure of LLM-alone discriminability, Delta_sig. Across 9 datasets Delta_sig correlates with the concatenation cost more strongly than homophily at point estimate (r^2 = 0.38 vs. 0.06; N=9, bootstrap CIs overlap). The bootstrap-best change-point is tau = 13.8 pp, and the rule "Delta_sig <= tau predicts non-positive concat cost" classifies 7/9 datasets correctly; since 60% of bootstrap samples place tau in [5, 30] pp, we treat Delta_sig as an interpretive lens rather than a precision filter. A dimension-controlled ablation on PubMed places the LLM-feature drop between same-source PCA (-2.3 pp) and same-dim Gaussian noise (-37.3 pp), ruling out dimensionality and weight-decay artifacts. Nine PubMed configurations fit a power law |Delta_concat| proportional to (sqrt(d_l/n))^1.31 with r^2 = 0.97; the low-Delta_sig, small-n corner is exactly where the headline -17 pp PubMed deficit appears.

2606.18057 2026-06-17 cs.HC cs.AI cs.CL cs.CY cs.SI 交叉投稿

When AI Says "I have been in similar situations": Synthetic Lived Experience in Peer-Like Caregiver Support

当AI说“我也有过类似经历”:同伴式照护支持中的合成生活经验

Drishti Goel, Violeta J. Rodriguez, Daniel S. Brown, Ravi Karkar, Dong Whi Yoo, Koustuv Saha

发表机构 * University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) University of Massachusetts Amherst(马萨诸塞大学阿默斯特分校) Indiana University Indianapolis(印第安纳大学印第安纳波利斯分校)

AI总结 研究AI在同伴支持中生成“合成生活经验”的悖论,通过分析人类与AI在阿尔茨海默病照护社区中的叙事差异,揭示AI虽能模拟情感支持但缺乏真实经历,需建立机制区分支持性语言与虚构经历。

详情
AI中文摘要

照护者经常转向在线社区寻求信息和情感支持。在这些空间中,同伴支持者常常利用个人叙事来回应情感复杂的照护情境。随着LLM被设计为同伴式的支持来源,它们引入了一个关键张力:AI可以提供即时、私密且非评判性的支持,但它无法真实拥有使人类同伴支持有意义的生活经验。然而,当被提示要听起来像同伴时,LLM可能会生成暗示生活经验的语言。这创造了一个合成生活经验悖论:使AI支持感觉温暖、 relatable 和同伴式的相同经验语言,也可能错误地将系统定位为拥有生活经验的人。我们在阿尔茨海默病及相关痴呆症(ADRD)患者的家庭照护者背景下审视这一悖论。利用来自在线社区的照护者支持交流以及三个LLM——LLaMA、GPT-4o-mini和MedGemma——生成的同伴式响应,我们分析人类同伴如何使用个人叙事以及AI如何融入类似的叙事形式。心理语言学分析显示,同伴响应使用的第一人称和过去时态语言显著多于同伴式AI响应。定性上,我们识别出人类同伴支持中的七种个人叙事类型,并表明AI通常能捕捉其情感工作,但可能捏造经验基础。这些发现揭示了一个叙事真实性差距:同伴式AI可以生成合成生活经验,而没有使同伴支持有意义的真实经验。我们认为,照护者支持AI系统需要机制来区分支持性的同伴式框架与虚构的生活经验,确保模型能够提供温暖和认可,而不会错误地将自己定位为经验同伴。

英文摘要

Caregivers often turn to online communities for informational and emotional support. In these spaces, peer supporters frequently draw on personal narratives to respond to emotionally complex caregiving situations. As LLMs are increasingly designed as peer-like sources of support, they introduce a critical tension: AI can provide immediate, private, and nonjudgmental support, but it cannot authentically possess the lived experiences that make human peer support meaningful. Yet, when prompted to sound peer-like, LLMs may generate language that implies lived experience. This creates a synthetic lived experience paradox: the same experiential language that may make AI support feel warm, relatable, and peer-like can also falsely position the system as someone with lived experience. We examine this paradox in the context of family caregivers of people living with Alzheimer's Disease and Related Dementias (ADRD). Drawing on caregiver support exchanges from online communities and prompted peer-like responses from three LLMs -- LLaMA, GPT-4o-mini, and MedGemma -- we analyze how human peers use personal narratives and how AI incorporates similar narrative forms. Psycholinguistic analysis shows that peer responses used significantly more first-person and past-focused language than peer-like AI responses. Qualitatively, we identify seven types of personal narratives in human peer support and show that AI often captures their emotional work, but can fabricate experiential grounding. These findings reveal a narrative authenticity gap: peer-like AI can generate synthetic lived experience without the real experience that makes peer support meaningful. We argue that caregiver-support AI systems need mechanisms to distinguish supportive peer-like framing from fabricated lived experience, ensuring that models can offer warmth and validation without falsely positioning themselves as experiential peers.

2606.18208 2026-06-17 cs.LG cs.AI cs.CL cs.CV 交叉投稿

Looped World Models

循环世界模型

Hongyuan Adam Lu, Z. L. Victor Wei, Qun Zhang, Jinrui Zeng, Bowen Cao, Lingwei Meng, Mocheng Li, Zezhong Wang, Haonan Yin, Naifu Xue, Minyu Chen, Cenyuan Zhang, Zefan Zhang, Hao Wei, Jiawei Zhou, Haoran Xu, Hao Yang, Ronglai Zuo, Tongda Xu, Yonghao Li, Jian Chen, Hebin Wang, Zeyu Gao, Yang Li, Wei Zhao, Qimin Zhong, Siqi Liu, Yumeng Zhang, Leyan Cui, Zhangyu Wang, Wai Lam

发表机构 * FaceMind Research Asia

AI总结 提出循环世界模型(LoopWM),通过参数共享的Transformer块迭代细化潜在环境状态,实现高达100倍参数效率,并建立迭代潜在深度作为世界模拟的新缩放轴。

Comments Technical Report

详情
AI中文摘要

当前的世界模型面临一个基本矛盾:忠实的长期模拟需要深度计算,但更深的模型部署成本高且容易产生累积误差。我们通过引入循环世界模型(LoopWM)来解决这一问题,这是首个用于世界建模的循环架构。我们的方法通过一个参数共享的Transformer块迭代地细化潜在环境状态。这带来了高达100倍于传统方法的参数效率,并具有自适应计算能力,可自动调整深度以匹配每个预测步骤的复杂性。与缩放模型大小和训练数据正交,LoopWM建立了迭代潜在深度作为世界模拟的新缩放轴,这可能显著推动社区发展。

英文摘要

Current world models face a fundamental tension: faithful long-horizon simulation demands deep computation, but deeper models are expensive to deploy and prone to compounding errors. We resolve this by introducing Looped World Models (LoopWM), which are the first looped architectures for world modelling. Our method iteratively refines latent environment states through a parameter-shared transformer block. This yield up to 100x parameter efficiency over conventional approaches with adaptive computation that automatically scales depth to match the complexity of each prediction step. Orthogonal to scaling model size and training data, LoopWM establishes iterative latent depth as a new scaling axis for world simulation, which might significantly push the community forward.

2502.08363 2026-06-17 cs.CL cs.AI 版本更新

Top-Theta Attention: Sparsifying Transformers by Compensated Thresholding

Top-Theta注意力:通过补偿阈值稀疏化Transformer

Konstantin Berestizshevsky, Renzo Andri, Lukas Cavigelli

AI总结 提出Top-Theta注意力,一种无需训练的推理时稀疏化方法,通过静态每头阈值保留每行固定数量的重要元素,结合补偿技术实现高稀疏度下的精度保持,在NLP任务中实现3-10倍V-cache减少和高达10倍注意力元素减少,精度下降不超过1%。

Comments Extended version of a paper accepted at ICANN 2026

详情
AI中文摘要

我们提出Top-Theta(Top-$\ heta$)注意力,一种无需训练的推理时稀疏化Transformer注意力的方法。我们的关键洞察是,可以校准静态的每头阈值,以在每行注意力中保留所需数量的重要元素。该方法实现了基于内容的稀疏性,无需重新训练,并且在不同数据领域保持鲁棒性。我们进一步引入补偿技术,以在激进稀疏化下保持精度,将注意力阈值化确立为top-k注意力的实用且原则性替代方案。我们在自然语言处理任务上进行了广泛评估,表明Top-Theta在推理时实现了3-10倍的V-cache减少和高达10倍的注意力元素减少,同时精度下降不超过1%。

英文摘要

We present Top-Theta (Top-$θ$) Attention, a training-free method for sparsifying transformer attention during inference. Our key insight is that static, per-head thresholds can be calibrated to retain the desired constant number of significant elements per attention row. This approach enables content-based sparsity without retraining, and it remains robust across data domains. We further introduce compensation techniques to preserve accuracy under aggressive sparsification, establishing attention thresholding as a practical and principled alternative to top-k attention. We provide extensive evaluation on natural language processing tasks, showing that Top-$θ$ achieves 3-10x reduction in V-cache usage and up to 10x fewer attention elements during inference while degrading no more than 1% in accuracy.

2506.18831 2026-06-17 cs.CL 版本更新

Adaptive Activation Steering for Efficient LLM Reasoning via Closed-Loop PID Control

自适应激活引导:通过闭环PID控制实现高效LLM推理

Aryasomayajula Ram Bharadwaj

AI总结 提出PID-steering方法,利用PID控制器根据块级冗余分类器动态调整激活引导强度,在减少推理开销的同时提升准确率。

详情
AI中文摘要

使用长思维链训练的推理LLM常出现过度思考:它们在冗余反思和过渡上花费token,增加成本却不提高准确性。静态激活引导(如SEAL)用固定向量抑制此类内容,但无论当前块实际冗余程度如何,都施加相同强度。我们描述了PID-steering,一种无需训练、解码时的方法,通过由轻量级块级冗余分类器驱动的PID控制器来调节引导强度。在GSM8K子集上使用DeepSeek-R1-Distill-Qwen-1.5B,该方法将准确率从85.7%提升至89.6%(+3.9个百分点),同时将平均输出长度从1026个token削减至790个(-23%)。我们将其报告为小规模概念验证,而非基准结果。

英文摘要

Reasoning LLMs trained with long chain-of-thought often overthink: they spend tokens on redundant reflection and transitions that inflate cost without improving accuracy. Static activation steering (e.g.\ SEAL) suppresses such content with a fixed vector, but applies the same strength regardless of how redundant the current chunk actually is. We describe PID-steering, a training-free, decoding-time method that modulates the steering strength with a PID controller driven by a lightweight chunk-level redundancy classifier. On a subset of GSM8K with DeepSeek-R1-Distill-Qwen-1.5B, the method improves accuracy from 85.7\% to 89.6\% (+3.9 pp) while cutting average output length from 1026 to 790 tokens ($-$23\%). We report it as a small-scale proof of concept rather than a benchmark result.

2508.03250 2026-06-17 cs.CL cs.AI 版本更新

RooseBERT: A New Deal For Political Language Modelling

RooseBERT: 政治语言建模的新协议

Deborah Dore, Elena Cabrio, Serena Villata

AI总结 针对政治语言特殊性,提出领域预训练模型RooseBERT,在大型政治辩论语料上训练,在多项政治分析任务中优于通用模型。

详情
AI中文摘要

政治辩论和与政治相关讨论的日益增多,要求定义新颖的计算方法来自动分析此类内容,最终目标是让公民更清晰地了解政治审议。然而,政治语言的特殊性和这些辩论的论证形式(采用隐藏的沟通策略并利用隐含论点)使得这项任务非常具有挑战性,即使是对于当前通用的预训练语言模型(LMs)也是如此。为了解决这个问题,我们引入了一种新颖的预训练语言模型,专门用于政治话语语言,称为RooseBERT。在专业领域上预训练语言模型面临着不同的技术和语言挑战,需要大量的计算资源和大规模数据。RooseBERT是在大型英语政治辩论和演讲语料库(11GB)上训练的。为了评估其性能,我们在多个与政治辩论分析相关的下游任务上对其进行了微调,即立场检测、情感分析、论证成分检测与分类、论证关系预测与分类、政策分类、命名实体识别(NER)。我们的结果显示,在大多数这些任务上,RooseBERT相比通用语言模型有所改进,突显了领域特定预训练如何增强政治辩论分析的性能。我们将RooseBERT发布给研究社区。

英文摘要

The increasing amount of political debates and politics-related discussions calls for the definition of novel computational methods to automatically analyse such content with the final goal of lightening up political deliberation to citizens. However, the specificity of the political language and the argumentative form of these debates (employing hidden communication strategies and leveraging implicit arguments) make this task very challenging, even for current general-purpose pre-trained Language Models (LMs). To address this, we introduce a novel pre-trained LM for political discourse language called RooseBERT. Pre-training a LM on a specialised domain presents different technical and linguistic challenges, requiring extensive computational resources and large-scale data. RooseBERT has been trained on large political debate and speech corpora (11GB) in English. To evaluate its performances, we fine-tuned it on multiple downstream tasks related to political debate analysis, i.e., stance detection, sentiment analysis, argument component detection and classification, argument relation prediction and classification, policy classification, named entity recognition (NER). Our results show improvements over general-purpose LMs on the majority of these tasks, highlighting how domain-specific pre-training enhances performance in political debate analysis. We release RooseBERT for the research community.

2509.26476 2026-06-17 cs.CL cs.AI cs.LG cs.PF cs.SE 版本更新

Regression Language Models for Code

代码的回归语言模型

Yash Akhauri, Xingyou Song, Arissa Wongpanich, Bryan Lewandowski, Mohamed S. Abdelfattah

AI总结 提出回归语言模型(RLM),利用冻结的大语言模型编码器直接从文本预测代码执行结果(如内存占用、延迟、神经网络精度等),在多个任务上达到高相关度。

Comments Published in International Conference on Machine Learning (ICML) 2026

详情
AI中文摘要

我们研究代码到指标的回归:预测代码执行的数值结果,由于编程语言的开放性,这是一项具有挑战性的任务。虽然先前的方法依赖于繁重且特定领域的特征工程,但我们展示了一个统一的回归语言模型(RLM),使用冻结的LLM编码器可以直接从文本同时预测:(i) 多种高级语言(如Python和C++)代码的内存占用,(ii) Triton GPU内核的延迟,以及(iii) 以ONNX表示的已训练神经网络的精度和速度。特别是,一个基于T5Gemma的较小300M参数RLM在APPS的竞赛编程提交上获得了>0.9的Spearman等级相关系数,而单个统一模型在CodeNet的17种不同语言上获得了>0.5的平均Spearman等级相关系数。此外,RLM在五个经典NAS设计空间上获得了最高平均Kendall-Tau 0.46,这些空间此前由图神经网络主导,并且能同时预测多种硬件平台上的架构延迟。

英文摘要

We study code-to-metric regression: predicting numeric outcomes of code executions, a challenging task due to the open-ended nature of programming languages. While prior methods have resorted to heavy and domain-specific feature engineering, we show that a single unified Regression Language Model (RLM) using a frozen LLM encoder can simultaneously predict directly from text, (i) the memory footprint of code across multiple high-level languages such as Python and C++, (ii) the latency of Triton GPU kernels, and (iii) the accuracy and speed of trained neural networks represented in ONNX. In particular, a relatively small 300M parameter RLM based on T5Gemma, obtains >0.9 Spearman-rank on competitive programming submissions from APPS, and a single unified model achieves >0.5 average Spearman-rank across 24 different programming languages from CodeNet. Furthermore, the RLM can obtain the highest average Kendall-Tau of 0.46 on five classic NAS design spaces previously dominated by graph neural networks, and simultaneously predict architecture latencies on numerous hardware platforms.

2601.03872 2026-06-17 cs.CL 版本更新

Atlas: Orchestrating Heterogeneous Models and Tools for Multi-Domain Complex Reasoning

Atlas: 编排异构模型与工具实现多领域复杂推理

Jinyang Wu, Guocheng Zhai, Ruihan Jin, Jiahao Yuan, Yuhao Shen, Shuai Zhang, Zhengqi Wen, Jianhua Tao

AI总结 提出ATLAS双路径框架,通过无监督聚类路由和强化学习多步路由动态选择最优模型-工具组合,在15个基准上超越GPT-4o,分布内外任务分别提升10.1%和13.1%。

Comments Accepted by ACL 2026

详情
AI中文摘要

大语言模型与外部工具的集成显著扩展了AI代理的能力。然而,随着模型和工具多样性的增加,选择最优模型-工具组合成为一个高维优化挑战。现有方法通常依赖单一模型或固定工具调用逻辑,未能利用异构模型-工具对之间的性能差异。本文提出ATLAS(自适应工具-LLM对齐与协同调用),一种用于跨领域复杂推理中动态工具使用的双路径框架。ATLAS通过双路径方式运作:(1)基于无监督聚类的路由,利用经验先验进行领域特定对齐;(2)基于强化学习的多步路由,探索自主轨迹以实现分布外泛化。在15个基准上的大量实验表明,我们的方法优于GPT-4o等闭源模型,在分布内(+10.1%)和分布外(+13.1%)任务上均超越现有路由方法。此外,我们的框架通过编排专用多模态工具在视觉推理中展现出显著提升。

英文摘要

The integration of large language models (LLMs) with external tools has significantly expanded the capabilities of AI agents. However, as the diversity of both LLMs and tools increases, selecting the optimal model-tool combination becomes a high-dimensional optimization challenge. Existing approaches often rely on a single model or fixed tool-calling logic, failing to exploit the performance variations across heterogeneous model-tool pairs. In this paper, we present ATLAS (Adaptive Tool-LLM Alignment and Synergistic Invocation), a dual-path framework for dynamic tool usage in cross-domain complex reasoning. ATLAS operates via a dual-path approach: (1) \textbf{training-free cluster-based routing} that exploits empirical priors for domain-specific alignment, and (2) \textbf{RL-based multi-step routing} that explores autonomous trajectories for out-of-distribution generalization. Extensive experiments across 15 benchmarks demonstrate that our method outperforms closed-source models like GPT-4o, surpassing existing routing methods on both in-distribution (+10.1%) and out-of-distribution (+13.1%) tasks. Furthermore, our framework shows significant gains in visual reasoning by orchestrating specialized multi-modal tools.

2605.01973 2026-06-17 cs.CL cs.LG 版本更新

Learn-To-Learn on Arbitrary Textual Conditioning: A Hypernetwork-Driven Meta-Gated LLM

在任意文本条件下学习:一种超网络驱动的元门控大语言模型

Luo Ji, Qi Qin, Ningyuan Xi, Teng Chen, Qingqing Gu, Hongyan Li

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出一种超网络驱动的元门控机制,通过动态调整SwiGLU块中的β参数,使LLM适应不同文本条件,优于微调和元学习基线。

Comments Accepted by ICML2026

详情
AI中文摘要

传统的大语言模型可能面临语料异质性和细微条件变化的问题。虽然微调可能导致灾难性遗忘,但元学习在LLM上的应用也因其复杂性和可扩展性而受到限制。在本文中,我们激活了SwiGLU块中的元信号$β$,形成了一种自适应调整FFN非线性的元门控机制。我们使用超网络动态生成基于文本条件的$β$,为LLM提供元可控性。通过在任务、领域、角色和风格等不同条件类型上的测试,我们的方法优于微调和元学习基线,并且能够合理泛化到未见过的任务、条件类型或指令。我们的代码可在https://github.com/AaronJi/MeGan找到。

英文摘要

Conventional LLMs may suffer from corpus heterogeneity and subtle condition changes. While finetuning can create the catastrophe forgetting issue, application of meta-learning on LLMs is also limited due to its complexity and scalability. In this paper, we activate the meta-signal of $β$ within the SwiGLU blocks, resulting in a meta-gating mechanism that adaptively adjusts the nonlinearity of FFN. A hypernetwork is employed which dynamically produces $β$ on textual conditions, providing meta-controllability on LLMs. By testing on different condition types such as task, domain, persona, and style, our method outperforms finetuning and meta-learning baselines, and can generalize reasonably on unseen tasks, condition types, or instructions. Our code can be found in https://github.com/AaronJi/MeGan.

2605.12227 2026-06-17 cs.CL 版本更新

A Recipe for Long-Context Reasoning in Large Language Models via On-Policy Optimization and Distillation

结合在线优化与蒸馏以提升大语言模型的长上下文推理能力

Miguel Moura Ramos, Duarte M. Alves, André F. T. Martins

发表机构 * Instituto Superior Técnico, Universidade de Lisboa(里斯本大学理工学院) Instituto de Telecomunicações(电信研究所) TransPerfect(TransPerfect公司)

AI总结 本文提出dGRPO方法,结合在线优化与蒸馏,通过强教师模型提供密集指导,提升长上下文推理能力,同时保持短上下文能力。

详情
AI中文摘要

适应大语言模型(LLMs)进行长上下文任务需要在训练后保持准确性和连贯性的方法。现有方法存在局限:1)监督微调(SFT)和知识蒸馏(KD)等离线方法存在曝光偏差且难以从模型生成的错误中恢复;2)在线强化学习方法如组相对策略优化(GRPO)更符合模型生成的状态,但因稀疏奖励导致不稳定和样本效率低;3)在线蒸馏(OPD)提供密集的token级指导,但不直接优化任意奖励信号。本文提出Distilled Group Relative Policy Optimization(dGRPO),通过OPD从更强的教师模型获得密集指导来增强GRPO。我们还引入LongBlocks,一个涵盖多跳推理、上下文接地和长形式生成的合成长上下文数据集。我们进行了广泛的实验和消融研究,比较离线训练、稀疏奖励GRPO和我们的综合方法,得出改进的长上下文对齐配方。总体而言,我们的结果表明,将基于结果的策略优化与知识蒸馏结合在一个目标中,为长上下文推理提供更稳定和有效的方法,同时保持短上下文能力。

英文摘要

Existing approaches to post-train models for long-context tasks face complementary limitations: (i) supervised fine-tuning (SFT) provides stable supervision but suffers from exposure bias; (ii) reinforcement learning methods such as Group Relative Policy Optimization (GRPO) train on model-generated trajectories but struggle with long-horizon credit assignment and sparse rewards; and (iii) on-policy distillation (OPD) provides dense token-level guidance but does not directly optimize task rewards. We study these complementary strategies for long-context alignment and derive a recipe that combines GRPO with OPD-style teacher guidance: the student learns from its own rollouts using outcome-level rewards, while a stronger teacher provides dense token-level regularization in place of the standard reference policy. This is especially useful when process-level supervision is difficult to obtain. To support this study, we introduce LongBlocks, a synthetic multilingual dataset spanning multi-hop reasoning, contextual grounding, and long-form generation. Through controlled ablations, we isolate the roles of cold-start initialization, teacher anchoring, and data mixing, showing that our recipe yields a more stable and effective path to long-context reasoning than GRPO or OPD while preserving short-context capabilities.

2606.00024 2026-06-17 cs.CL 版本更新

ART: Attention Run-time Termination for Efficient Large Language Model Decoding

ART:面向高效大语言模型解码的注意力运行时终止

Chen Qiu, Guozhong Li, Cristian McGee, Aritra Dutta, Panos Kalnis

发表机构 * King Abdullah University of Science and Technology(卡布尔大学科学与技术大学) University of Central Florida(中央佛罗里达大学)

AI总结 提出注意力运行时终止(ART)机制,通过跟踪累积注意力输出并在贡献可忽略时终止后续KV块访问,在不显著影响准确率的情况下将大批量生成吞吐量提升20%。

详情
AI中文摘要

大语言模型(LLM)中的长上下文解码受到获取大量键值(KV)缓存所需内存带宽的严重限制。大多数现有的KV管理方法依赖于解码前的仅键剪枝,尽管有证据表明注意力输出共同依赖于键和值,因为将值纳入其方法会带来过高的额外开销。在本文中,我们提出了注意力运行时终止(ART),一种轻量级的运行时机制,在内核执行期间跟踪累积的注意力输出,并在后续贡献变得可忽略时终止后续KV块访问。这种设计使ART与现有的基于键的KV缓存管理方法正交,从而能够与它们无缝集成。在LongBench基准上的实验表明,与最先进的基线相比,ART在大批量下实现了20%更高的生成吞吐量,同时保持了相当的准确率。

英文摘要

Long-context decoding in Large Language Models (LLMs) is constrained by the cost of accessing and processing the Key-Value (KV) cache. Despite evidence that attention outputs depend jointly on keys and values, most existing KV management methods rely on key-only pruning, since incorporating values incurs prohibitive overhead. In this paper, we propose Attention Run-time Termination (ART), a lightweight run-time mechanism that tracks accumulated attention outputs during kernel execution and terminates subsequent KV block accesses once further contributions become negligible. Rather than replacing KV selection, ART dynamically terminates redundant KV traversal on top of existing dense or sparse attention policies. We introduce a stability-based criterion that monitors both magnitude and directional changes of intermediate attention outputs and provideds a theoretical characterization of the resulting truncation error. Experiments on the LongBench and RULER Needle-in-a-Haystack tasks show that ART increases the generation throughput of existing KV-cache methods by up to 20%, without compromising the result quality.

2602.06154 2026-06-17 cs.LG cs.CL 版本更新

MoSE: Mixture of Slimmable Experts for Efficient and Adaptive Language Models

MoSE: 混合可瘦身专家实现高效自适应语言模型

Nurbek Tastan, Stefanos Laskaridis, Karthik Nandakumar, Samuel Horvath

AI总结 提出MoSE架构,每个专家具有可变宽度的嵌套结构,支持在推理时连续调节精度-计算权衡,通过多宽度训练和轻量级测试时训练实现高效自适应。

Comments Accepted to ICML 2026

详情
AI中文摘要

混合专家(MoE)模型通过稀疏激活专家高效扩展大型语言模型,但一旦选定专家,其执行是完整的。因此,MoE模型中精度与计算之间的权衡通常表现出较大的不连续性。我们提出混合可瘦身专家(MoSE),这是一种MoE架构,其中每个专家具有嵌套的、可瘦身的结构,可以以可变宽度执行。这不仅实现了对激活哪些专家的条件计算,还实现了对每个专家利用多少的条件计算。因此,单个预训练的MoSE模型可以在推理时支持更连续的精度-计算权衡谱。我们提出了一种简单且稳定的训练方法,用于在稀疏路由下训练可瘦身专家,将多宽度训练与标准MoE目标相结合。在推理过程中,我们探索了运行时宽度确定的策略,包括一种轻量级的测试时训练机制,该机制学习如何在固定预算下将路由器置信度/概率映射到专家宽度。在GPT风格模型、各种路由机制、零样本下游推理基准以及DeepSeek模型的持续预训练适应上的实验表明,MoSE在全宽度下匹配或优于标准MoE,并持续将计算-质量边界向更低的推理FLOPs移动。代码可在以下网址找到:this https URL。

英文摘要

Mixture-of-Experts (MoE) models scale large language models efficiently by sparsely activating experts, but once an expert is selected, it is executed fully. Hence, the trade-off between accuracy and computation in an MoE model typically exhibits large discontinuities. We propose Mixture of Slimmable Experts (MoSE), an MoE architecture in which each expert has a nested, slimmable structure that can be executed at variable widths. This enables conditional computation not only over which experts are activated but also over how much of each expert is utilized. Consequently, a single pretrained MoSE model can support a more continuous spectrum of accuracy-compute trade-offs at inference time. We present a simple and stable training recipe for slimmable experts under sparse routing, combining multi-width training with standard MoE objectives. During inference, we explore strategies for runtime width determination, including a lightweight test-time training mechanism that learns how to map router confidence/probabilities to expert widths under a fixed budget. Experiments on GPT-style models, various routing regimes, zero-shot downstream reasoning benchmarks, and continual pre-training adaptation of DeepSeek model show that MoSE matches or improves standard MoE at full width and consistently shifts the compute-quality frontier toward lower inference FLOPs. The code can be found at: https://github.com/tnurbek/mose.

2602.09802 2026-06-17 cs.AI cs.CL 版本更新

Would a Large Language Model Pay Extra for a View? Inferring Willingness to Pay from Subjective Choices

大型语言模型会为景观付费吗?从主观选择中推断支付意愿

Manon Reusens, Sofie Goethals, Toon Calders, David Martens

AI总结 研究在旅行助手场景下,通过多分类逻辑模型分析LLM的主观选择,推断其支付意愿并与人类基准比较,发现LLM在属性层面存在系统偏差且高估支付意愿,但通过条件化偏好可改善。

详情
AI中文摘要

随着大型语言模型(LLM)越来越多地部署在旅行辅助和购买支持等应用中,它们常常需要在没有客观正确答案的情况下代表用户做出主观选择。我们在旅行助手背景下研究LLM的决策,通过向模型呈现选择困境,并使用多项逻辑模型分析其响应,推导出隐含的支付意愿(WTP)估计。随后将这些WTP值与经济学文献中的人类基准值进行比较。除了基线设置外,我们还研究了在更现实条件下模型行为的变化,包括提供用户过去选择的信息和基于角色的提示。我们的结果表明,虽然可以从较大的LLM中推导出有意义的WTP值,但它们在属性层面也显示出系统偏差。此外,它们倾向于整体高估人类的WTP,特别是在引入昂贵选项或面向商业的角色时。将模型条件化于对更便宜选项的先前偏好,得出的估值更接近人类基准。总体而言,我们的发现突出了使用LLM进行主观决策支持的潜力和局限性,并强调了在实际部署此类系统时仔细选择模型、设计提示和表示用户的重要性。

英文摘要

As Large Language Models (LLMs) are increasingly deployed in applications such as travel assistance and purchasing support, they are often required to make subjective choices on behalf of users in settings where no objectively correct answer exists. We study LLM decision-making in a travel-assistant context by presenting models with choice dilemmas and analyzing their responses using multinomial logit models to derive implied willingness to pay (WTP) estimates. These WTP values are subsequently compared to human benchmark values from the economics literature. In addition to a baseline setting, we examine how model behavior changes under more realistic conditions, including the provision of information about users' past choices and persona-based prompting. Our results show that while meaningful WTP values can be derived for larger LLMs, they also display systematic deviations at the attribute level. Additionally, they tend to overestimate human WTP overall, particularly when expensive options or business-oriented personas are introduced. Conditioning models on prior preferences for cheaper options yields valuations that are closer to human benchmarks. Overall, our findings highlight both the potential and the limitations of using LLMs for subjective decision support and underscore the importance of careful model selection, prompt design, and user representation when deploying such systems in practice.

2602.11715 2026-06-17 cs.LG cs.CL 版本更新

DICE: Diffusion Large Language Models Excel at Generating CUDA Kernels

DICE:扩散大语言模型在生成CUDA内核方面表现出色

Haolei Bai, Lingcheng Kong, Xueyi Chen, Jianmian Wang, Zhiqiang Tao, Huan Wang

AI总结 提出CuKe数据集和BiC-RL训练框架,构建DICE系列扩散大语言模型(1.7B/4B/8B),在KernelBench上显著优于同类自回归和扩散模型,实现CUDA内核生成新SOTA。

Comments v2: Expanded with dLLM vs. autoregressive LLM comparisons, ablation studies, and qualitative case studies

详情
AI中文摘要

扩散大语言模型(dLLMs)因其并行生成令牌的能力,已成为自回归(AR)LLMs的有力替代方案。这一范式特别适用于代码生成,其中整体结构规划和非顺序优化至关重要。尽管有这种潜力,但针对CUDA内核生成定制dLLMs仍然具有挑战性,不仅因为高度专业化,还因为严重缺乏高质量的训练数据。为了解决这些挑战,我们构建了CuKe,一个针对高性能CUDA内核优化的增强监督微调数据集。在此基础上,我们提出了一个双阶段策划强化学习(BiC-RL)框架,包括CUDA内核填充阶段和端到端CUDA内核生成阶段。利用这一训练框架,我们推出了DICE,一系列专为CUDA内核生成设计的扩散大语言模型,涵盖1.7B、4B和8B三个参数规模。在KernelBench上的大量实验表明,DICE显著优于同等规模的自回归和扩散LLMs,为CUDA内核生成建立了新的最先进水平。

英文摘要

Diffusion large language models (dLLMs) have emerged as a compelling alternative to autoregressive (AR) LLMs, owing to their capacity for parallel token generation. This paradigm is particularly well-suited for code generation, where holistic structural planning and non-sequential refinement are critical. Despite this potential, tailoring dLLMs for CUDA kernel generation remains challenging, obstructed not only by the high specialization but also by the severe lack of high-quality training data. To address these challenges, we construct CuKe, an augmented supervised fine-tuning dataset optimized for high-performance CUDA kernels. On top of it, we propose a bi-phase curated reinforcement learning (BiC-RL) framework consisting of a CUDA kernel infilling stage and an end-to-end CUDA kernel generation stage. Leveraging this training framework, we introduce DICE, a series of diffusion large language models designed for CUDA kernel generation, spanning three parameter scales, 1.7B, 4B, and 8B. Extensive experiments on KernelBench demonstrate that DICE significantly outperforms both autoregressive and diffusion LLMs of comparable scale, establishing a new state-of-the-art for CUDA kernel generation.

2604.03444 2026-06-17 cs.LG cs.CL 版本更新

Olmo Hybrid: From Theory to Practice and Back

Olmo Hybrid:从理论到实践再回到理论

William Merrill, Yanhong Li, Tyler Romero, Anej Svete, Caia Costello, Pradeep Dasigi, Dirk Groeneveld, David Heineman, Bailey Kuehl, Nathan Lambert, Chuan Li, Kyle Lo, Saumya Malik, DJ Matusz, Benjamin Minixhofer, Jacob Morrison, Luca Soldaini, Finbarr Timbers, Pete Walsh, Noah A. Smith, Hannaneh Hajishirzi, Ashish Sabharwal

AI总结 本文通过理论分析和实验验证,证明混合模型(结合注意力与线性RNN)在表达能力、扩展效率上优于纯Transformer,并训练了7B参数的Olmo Hybrid模型,在标准评估中超越Olmo 3。

Comments Corrected author list and typos in appendix

详情
AI中文摘要

近期工作展示了非Transformer语言模型(尤其是线性递归神经网络(RNN)和混合注意力与递归的混合模型)的潜力。然而,对于这些新架构的潜在优势是否值得承担规模化扩展的风险和努力,尚无共识。为解决此问题,我们从多个方面提供混合模型优于纯Transformer的证据。首先,理论上,我们证明混合模型不仅继承了Transformer和线性RNN的表达能力,还能表达超出两者的任务,例如代码执行。将这一理论付诸实践,我们训练了Olmo Hybrid,一个70亿参数模型,与Olmo 3 7B基本相当,但将滑动窗口层替换为Gated DeltaNet层。我们表明,在标准预训练和中期训练评估中,Olmo Hybrid优于Olmo 3,证明了混合模型在受控大规模设置下的优势。我们发现混合模型的扩展效率显著高于Transformer,这解释了其更高的性能。然而,尚不清楚为何特定形式问题上的更高表达能力会导致更好的扩展性或在下游任务(与这些问题无关)上表现更优。为解释这一明显差距,我们回到理论,论证为何增强的表达能力应转化为更好的扩展效率,从而完成循环。总体而言,我们的结果表明,混合注意力和递归层的混合模型是语言建模范式的强大扩展:不仅用于减少推理时的内存,更是获得在预训练中更好扩展的更具表达能力模型的基本途径。

英文摘要

Recent work has demonstrated the potential of non-transformer language models, especially linear recurrent neural networks (RNNs) and hybrid models that mix recurrence and attention. Yet there is no consensus on whether the potential benefits of these new architectures justify the risk and effort of scaling them up. To address this, we provide evidence for the advantages of hybrid models over pure transformers on several fronts. First, theoretically, we show that hybrid models do not merely inherit the expressivity of transformers and linear RNNs, but can express tasks beyond both, such as code execution. Putting this theory to practice, we train Olmo Hybrid, a 7B-parameter model largely comparable to Olmo 3 7B but with the sliding window layers replaced by Gated DeltaNet layers. We show that Olmo Hybrid outperforms Olmo 3 across standard pretraining and mid-training evaluations, demonstrating the benefit of hybrid models in a controlled, large-scale setting. We find that the hybrid model scales significantly more efficiently than the transformer, explaining its higher performance. However, its unclear why greater expressivity on specific formal problems should result in better scaling or superior performance on downstream tasks unrelated to those problems. To explain this apparent gap, we return to theory and argue why increased expressivity should translate to better scaling efficiency, completing the loop. Overall, our results suggest that hybrid models mixing attention and recurrent layers are a powerful extension to the language modeling paradigm: not merely to reduce memory during inference, but as a fundamental way to obtain more expressive models that scale better during pretraining.

2605.30036 2026-06-17 cs.AI cs.CL 版本更新

Teaching Values to Machines: Simulating Human-Like Behavior in LLMs

向机器传授价值观:在LLMs中模拟类人行为

Asaf Yehudai, Naama Rozen, Ariel Gera

发表机构 * The Hebrew University of Jerusalem(海法大学) IBM Research(IBM研究院) Tel-Aviv University(特拉维夫大学)

AI总结 本研究基于心理学价值理论,通过大规模实验(超过500万个问题)评估价值提示的LLMs在价值结构和价值-行为关系上与人类的一致性,并证明引入人类价值分布可增强群体模拟。

Comments We had some disagreement regarding proper attribution; we hope to resolve it soon and upload the paper

详情
AI中文摘要

大型语言模型(LLMs)展示了采用不同角色和身份的能力;然而,它们是否能表现出符合连贯、类人价值结构的行为仍不清楚。在这项工作中,我们借鉴既定的心理学价值理论,在LLMs中诱导类人价值观,并评估它们与人类研究中观察到的模式的一致性。使用经过验证的心理学问卷,我们进行了大规模实验——超过500万个问题——以评估领先LLMs的价值结构和价值-行为关系,并将其与人类进行比较。我们的发现揭示了价值提示的LLMs与人类在两个维度上的强烈一致性。此外,引入人类价值分布增强了价值诱导LLMs的群体模拟。这些发现凸显了价值诱导LLMs作为有效的、基于心理学的模拟人类行为工具的潜力。

英文摘要

Large Language Models (LLMs) demonstrate a remarkable capacity to adopt different personas and roles; however, it remains unclear whether they can manifest behavior that adheres to a coherent, human-like value structure. In this work, we draw on established psychological value theory to induce human-like values in LLMs and assess their alignment with patterns observed in human studies. Using validated psychological questionnaires, we conduct large-scale experiments -- over 5 million questions -- to evaluate value structures and value-behavior relationships in leading LLMs and compare them to humans. Our findings reveal strong agreement between value-prompted LLMs and humans across both dimensions. Moreover, incorporating human value distributions enhances population-level simulations with value-induced LLMs. These findings highlight the potential of value-induced LLMs as effective, psychologically grounded tools for simulating human behavior.

2. 机器翻译与跨语言处理 6 篇

2606.17234 2026-06-17 cs.CL 新提交

Speaking in Self-Assessing Tongues: On the Verbalized Confidence of LLMs in Machine Translation

自评之言:论大语言模型在机器翻译中的口头化置信度

Ali Marashian, Alexis Palmer, Katharina von der Wense

发表机构 * University of Colorado Boulder(科罗拉多大学博尔德分校) Johannes Gutenberg University Mainz(美因茨约翰内斯·古腾堡大学)

AI总结 本研究设计了五种无需内部信号的口头化方法提取LLM逐词置信度,并与内部确定性信号比较,发现两者在细粒度错误检测和校准上表现相似但相关性低。

详情
AI中文摘要

大语言模型(LLMs)在翻译中的迅速普及要求对其自身输出的置信度可靠性进行深入研究。与许多生成任务不同,翻译错误和置信度可以在不同粒度级别(标记、单词或片段)上发挥作用。基于预测概率等内部信号的无监督方法可能具有误导性,因为它们反映的是替代方案之间的确定性而非正确性。此外,这些方法需要访问此类内部信号。本文设计了五种口头化方法,用于在没有这些缺点的情况下提取LLM的逐词置信度,并将其可靠性与模型内部确定性信号进行比较。我们使用两种对齐形式评估可靠性:细粒度错误检测和校准。对于两者,内部方法和口头化方法表现相似,尽管结果因模型而异。有趣的是,我们发现内部方法与口头化方法之间几乎没有相关性。

英文摘要

The rapid rise in popularity of large language models (LLMs) for translation calls for a thorough study of the reliability of their confidence in their own outputs. Unlike many generation tasks, translation errors and confidence levels can be useful at different levels of granularity (tokens, words, or spans). Unsupervised approaches based on internal signals like predicted probabilities can be misleading because they reflect certainty among alternatives rather than correctness. In addition, they require access to such internal signals. Here, we devise five verbalized methods of extracting an LLM's per-token confidence without those shortcomings and compare their reliability with that of the model's internal signals of certainty. We evaluate reliability using two forms of alignment: fine-grained error detection and calibration. For both, internal and verbalized methods perform similarly, although results vary by model. Interestingly, we find little to no correlation between internal and verbalized methods.

2606.17354 2026-06-17 cs.CL cs.AI 新提交

Translating the Untranslatable: An Operationalizable Ontology for Untranslatability

翻译不可译:一个可操作化的不可译性本体论

Jacob Bremerman, Brihi Joshi, Hirona Arai, Xiang Ren, Jonathan May

发表机构 * University of Southern California Information Sciences Institute(南加州大学信息科学研究所)

AI总结 提出一个结构化的不可译性本体论和补偿策略分类法,构建多语言数据集,通过人类偏好研究发现注释补偿策略最受青睐,为策略感知机器翻译奠定基础。

详情
AI中文摘要

不可译性,即意义无法在语言间直接保留的情况,在语言学中已有深入研究,但在自然语言处理中尚未充分探索。随着机器翻译系统在标准基准测试上的改进,其局限性越来越集中在这些情况下,即翻译无法简化为一一对应。我们引入了一个结构化的不可译性本体论以及补偿策略的分类法,这些策略是在这些不可译情况下传达意义的具体技术。我们将该框架操作化为一个多语言数据集,包含不可译句子及其基于策略的翻译,从而能够对翻译行为进行受控分析。初步的人类偏好研究表明,翻译质量取决于所使用的策略,并且对包含解释性上下文(称为注释补偿策略)的输出存在一致的偏好。我们的框架和数据集为研究和建模策略感知的机器翻译提供了基础。

英文摘要

Untranslatability, cases where meaning cannot be directly preserved across languages, is well-studied in linguistics but underexplored in NLP. As machine translation (MT) systems improve on standard benchmarks, their limitations increasingly concentrate in such cases, where translation cannot be reduced to one-to-one equivalence. We introduce a structured ontology of untranslatability along with a taxonomy of compensation strategies, which are specific techniques to convey meaning under these untranslatable circumstances. We operationalize this framework into a multilingual dataset of untranslatable sentences paired with strategy-based translations, enabling controlled analysis of translation behavior. Initial human preference studies suggest that translation quality depends on the strategy used, with consistent preferences for outputs that include explanatory context, known as the Annotation compensation strategy. Our framework and dataset provide a foundation for studying and modeling strategy-informed machine translation.

2606.18033 2026-06-17 cs.CL cs.AI 新提交

When English Isn't the Best Teacher: Source Language Effects in Cross-Lingual In-Context Learning

当英语不是最好的老师:跨语言上下文学习中的源语言效应

Fred Philippy, Siwen Guo, Jacques Klein, Tegawendé F. Bissyandé

发表机构 * Snt, University of Luxembourg(卢森堡大学科学技术系) Luxembourg Institute of Science and Technology(卢森堡科学技术研究院)

AI总结 研究跨语言上下文学习(ICL)中源语言选择的影响,发现基于微调的预期在ICL中不成立,提出有效选择源语言的替代启发式方法。

Comments Accepted at 1st Workshop on Multilinguality in the Era of Large Language Models (MeLLM 2026), co-located with ACL 2026

详情
AI中文摘要

跨语言迁移在多语言自然语言处理中已在监督微调背景下得到广泛探索,其中数据可用性和语言相似性等因素在很大程度上决定了迁移质量。随着该领域转向少样本上下文学习(ICL),人们通常认为微调中的见解会原封不动地延续下来。然而,这一假设尚未经过严格评估,因此如何为跨语言ICL选择源语言的问题仍然悬而未决。我们对ICL中的跨语言迁移进行了广泛的实证研究,涵盖七项任务、六个模型和一组类型多样的语言。我们进一步分析了语言混淆,这是跨语言ICL中生成任务的关键障碍。我们的结果表明,基于微调的传统预期在ICL场景中并不一致适用,并指出了有效选择源语言的替代启发式方法。

英文摘要

Cross-lingual transfer in multilingual NLP has been widely explored in supervised fine-tuning contexts, where factors like data availability and linguistic similarity largely determine transfer quality. As the field shifts toward few-shot In-Context Learning (ICL), it is often presumed that insights from fine-tuning carry over unchanged. Yet this assumption has not been rigorously evaluated, leaving open the question of how to choose source languages for cross-lingual ICL. We conduct a broad empirical study of cross-lingual transfer in ICL spanning seven tasks, six models, and a typologically diverse set of languages. We further analyze language confusion, a key obstacle for generative tasks in cross-lingual ICL. Our results show that conventional fine-tuning-based expectations do not consistently apply in the ICL regime and point to alternative heuristics for selecting source languages effectively.

2605.21135 2026-06-17 cs.CL 版本更新

Smarter edits? Post-editing with error highlights and translation suggestions

更智能的编辑?基于错误高亮和翻译建议的后编辑

Fleur V. J. van Tellingen, Gautam Ranka, Dora Žugčić, Joyce van der Wal, Andrea Camasta, Livio Guerra, Alina Karakanta

发表机构 * Leiden University Centre for Linguistics(莱顿大学语言研究中心) Visvesvaraya National Institute of Technology(维什瓦塞拉亚国家理工学院) Department of Bionanoscience, Faculty of Applied Sciences, Delft University of Technology(应用科学学院生物纳米科学系,代尔夫特理工大学) Pedagogical Sciences, Leiden University(莱顿大学教育科学) Faculty of Science, Leiden University(莱顿大学科学学院)

AI总结 本文研究了基于自动后编辑(APE)的错误高亮和纠正建议在后编辑任务中的有效性,发现虽然没有提升生产力和质量,但APE高亮和纠正建议提升了用户体验。

Comments Accepted at EAMT 2026

详情
AI中文摘要

随着机器翻译质量的提高,对增强的后编辑功能(如基于QE的错误高亮)的兴趣正在增长,但其有用性的证据仍然有限。在本文中,我们探讨了基于自动后编辑(APE)的错误高亮和纠正建议的有用性。我们进行了一项研究,让专业翻译员(En-Nl)使用APE错误高亮和纠正建议进行后编辑,并将生产力、质量和用户体验与常规PE和带有QE衍生高亮的PE进行比较。尽管没有条件相比常规PE在生产力或质量上有所提升,但APE高亮比QE衍生高亮更受好评,而纠正建议提高了整体的用户体验。

英文摘要

As MT quality increases, interest in enhanced post-editing features such as QE-derived error highlights is growing, yet evidence for their usefulness remains limited. In this work, we explore the usefulness of LLM-derived error highlights and correction suggestions based on automatic post-editing (APE). We conduct a study where professional translators (En-Nl) post-edit translations using APE error highlights and correction suggestions and compare productivity, quality and user experience to regular PE and PE with QE-derived highlights. While no condition yielded productivity or quality gains compared to regular PE, APE highlights were better received than QE-derived highlights, and correction suggestions improved overall user experience.

2606.15883 2026-06-17 cs.CL cs.AI 版本更新

Koshur Diacritizer: A Byte-Level Sequence-to-Sequence Model for Kashmiri Diacritic Restoration

Koshur Diacritizer:用于克什米尔语变音符号恢复的字节级序列到序列模型

Haq Nawaz Malik, Nahfid Nissar, Faizan Iqbal

发表机构 * arXiv

AI总结 针对克什米尔语数字文本中变音符号缺失导致的歧义问题,提出基于ByT5-small的字节级序列到序列模型Koshur Diacritizer,结合脚本感知归一化、对齐验证和骨架保留推理,在测试集上实现DERm 0.2012和WER 0.2159,专家评估准确率77.5%。

详情
AI中文摘要

克什米尔语是一种使用改良的波斯-阿拉伯字母书写的印度-雅利安语言,在数字文本中经常省略变音符号,造成歧义并挑战下游NLP应用。我们提出了Koshur Diacritizer,一个基于ByT5-small的字节级序列到序列模型,用于恢复克什米尔语文本中的变音符号。为支持此任务,我们发布了一个公开可用的数据集,包含23.7k对齐的未变音/变音克什米尔语句对。所提出的框架结合了脚本感知归一化、对齐验证和骨架保留推理,以确保在保持原始基本字母序列的同时进行可靠的恢复。在保留测试集上的实验结果显示,DERm为0.2012,WER为0.2159。此外,由克什米尔语母语语言学专家评估的平均准确率为77.5%。数据集、模型和源代码已公开发布,为克什米尔语变音符号恢复和未来的低资源语言研究提供了可复现的基线。

英文摘要

Kashmiri, an Indo-Aryan language written in a modified Perso-Arabic script, frequently omits diacritic marks in digital text, creating ambiguity and challenging downstream NLP applications. We present Koshur Diacritizer, a ByT5-small byte-level sequence-to-sequence model for restoring diacritics in Kashmiri text. To support this task, we release a publicly available dataset of 23.7k aligned undiacritized diacritized Kashmiri sentence pairs. The proposed framework combines script-aware normalization, alignment validation, and skeleton-preserving inference to ensure reliable restoration while maintaining the original base-letter sequence. Experimental results on a held-out test set achieve a DERm of 0.2012 and a WER of 0.2159. Additionally, evaluation by a native Kashmiri linguistic expert yields a mean accuracy of 77.5%. The dataset, model, and source code are publicly released to provide a reproducible baseline for Kashmiri diacritic restoration and future low-resource language research.

2606.16009 2026-06-17 cs.CL cs.HC 版本更新

Bridging the Usability Gap: Lessons from Interpreting Studies for Machine Interpreting Design

弥合可用性差距:口译研究对机器口译设计的启示

Claudio Fantinuoli

发表机构 * University of Mainz(美因茨大学)

AI总结 本文定义机器口译为语音翻译的子领域,指出其存在“准确性幻觉”,并借鉴口译研究提出未来设计的三个优先方向:能动性、共同基础与体验,以弥合可用性差距。

详情
AI中文摘要

机器口译(MI)作为语音翻译的实时分支,在标准基准测试中取得了显著进展,一些系统在文本保真度上接近人类水平。然而,用户体验仍远不如口译员中介的交流,揭示了所谓的“准确性幻觉”:系统在纸面上表现准确,但在实践中无法支持流畅、目标导向的互动。本文将MI定义为语音翻译的一个独特子领域,具有自身特点,并需要基于交际有效性而非孤立保真度指标的评估方法。借鉴口译研究的见解,我们识别了当前系统忽视的专业口译实践的关键维度,并将其整合为未来MI的三个相互依赖的设计优先方向:能动性(上下文敏感的主动性和修复)、共同基础(多模态和话语级情境意识)以及体验(通过真实互动进行自适应改进)。这些优先方向共同为弥合可用性差距、实现能够实时维持真实多语言交流的系统指明了道路。

英文摘要

Machine interpreting (MI), the live, real-time application of speech translation, has achieved remarkable progress on standard benchmarks, with some systems approaching human parity on textual fidelity. Yet the user experience remains far inferior to interpreter-mediated communication, revealing what we term the accuracy illusion: systems that appear accurate on paper but fail in practice to support smooth, goal-oriented interaction. This paper defines MI as a distinct subfield of speech translation, with its own characteristics and the need for evaluation methods grounded in communicative effectiveness rather than isolated fidelity metrics. Drawing on insights from interpreting studies, we identify critical dimensions of professional interpreting practice that are overlooked by current systems, and consolidate them into three interdependent design priorities for future MI: agency (context-sensitive initiative and repair), grounding (multimodal and discourse-level situational awareness), and experience (adaptive improvement through real interaction). Together, these priorities chart a path toward closing the usability gap and enabling systems that can sustain authentic multilingual communication in real time.

3. 信息抽取、检索与问答 6 篇

2606.18103 2026-06-17 cs.CL cs.IR 新提交

HistoRAG: Embedding Historical Methodology in Retrieval-Augmented Generation Through Critical Technical Practice

HistoRAG:通过批判性技术实践将历史方法论嵌入检索增强生成

Noah J. Kim-Baumann, Torsten Hiltmann

发表机构 * Humboldt-Universität zu Berlin(柏林洪堡大学)

AI总结 针对历史学等解释性学科,提出HistoRAG框架,通过分离检索与生成、时间窗口化、LLM作为评判者等架构干预,将历史编纂原则转化为RAG设计,解决标准RAG中的时间偏差、相关性评估等问题。

Comments 25 pages, 6 figures. Companion preprint to a Journal of Digital History notebook article (under review)

详情
AI中文摘要

检索增强生成(RAG)是将语言模型输出基于外部证据的主流架构,但其主导评估范式和默认配置仍面向事实性问答。对于历史研究等解释性学科,RAG嵌入了与学术实践相冲突的假设。我们提出HistoRAG,一个将历史编纂原则转化为具体架构干预的框架。分离的检索与生成将来源发现与解释解耦,时间窗口化强制在研究期间内平衡来源表示(作为历史探究的方法论要求),LLM作为评判者的评估使相关性判断透明且可争议。我们使用SPIEGELragged(应用于《明镜》周刊1950-1979年的102,189篇文章)评估这些干预。每项干预都解决了标准RAG中可测量的缺陷:使用1970年代术语时,特定时代词汇从1950年代检索到零个块(这证明了促使窗口化的时间偏差);向量相似性与LLM评估的相关性仅弱相关(Spearman rho = 0.275),促使后检索评估;基于关键词和语义的检索主要发现不相交的来源池,促使一种架构,其中两者在共享的LLM评估过滤器下作为互补检索层运行。我们还引入了Zwischentexte(作为解释性提议而非发现的中介文本)的概念,作为将LLM生成文本负责任地整合到学术实践中的框架。该架构为如何将特定领域的认识论承诺转化为RAG设计决策提供了模型,并可能迁移到其他处理大型语料库的解释性学科。

英文摘要

Retrieval-Augmented Generation (RAG) is the prevailing architecture for grounding language model outputs in external evidence, yet its dominant evaluation paradigms and default configurations remain oriented toward factual question-answering. For interpretive disciplines such as historical studies, RAG embeds assumptions that conflict with scholarly practice. We introduce HistoRAG, a framework that translates historiographical principles into concrete architectural interventions. Separated retrieval and generation decouples source discovery from interpretation, temporal windowing enforces balanced source representation across the research period as a methodological requirement of historical inquiry, and LLM-as-judge evaluation makes relevance judgments transparent and contestable. We evaluate these interventions using SPIEGELragged, applied to 102,189 articles from Der Spiegel (1950-1979). Each intervention addresses a measurable deficiency in standard RAG: era-specific vocabulary retrieves zero chunks from the 1950s when using 1970s terminology, evidence of the temporal skew that motivates windowing; vector similarity and LLM-assessed relevance correlate only weakly (Spearman rho = 0.275), motivating post-retrieval evaluation; and keyword-based and semantic retrieval surface largely disjoint source pools, motivating an architecture in which both operate as complementary retrieval layers under a shared LLM evaluation filter. We also introduce the concept of Zwischentexte (intermediate texts that function as interpretive proposals rather than findings) as a framework for responsible integration of LLM-generated text into scholarly practice. The architecture offers a model for how domain-specific epistemological commitments can be translated into RAG design decisions, and may transfer to other interpretive disciplines working with large corpora.

2606.17910 2026-06-17 cs.IR cs.AI cs.CL 交叉投稿

Non-negative Elastic Net Decoding for Information Retrieval

非负弹性网络解码用于信息检索

Koki Okajima, Yasutoshi Ida, Tsukasa Yoshida, Yasuaki Nakamura

发表机构 * NTT, Inc(NTT公司)

AI总结 提出非负弹性网络(NNN)解码方法,将检索视为联合解码问题,通过稀疏非负线性组合重构查询嵌入,在理论上严格优于稠密检索,实验表明在多个基准上取得一致改进。

Comments 19 pages, 4 figures

详情
AI中文摘要

稠密检索已成为信息检索中的主导范式,其中每个文档通过其向量嵌入与查询的内积进行评分,并根据分数检索前$k$个文档。然而,由于每个文档的分数仅取决于查询和自身的嵌入,检索过程忽略了整个语料库的内容。因此,稠密检索无法避免从语料库中选择语义相似的文档,这可能导致检索结果集缺乏多样性且冗余。为此,我们将检索视为一个联合解码问题,其中文档作为集合被选择,并考虑语料库其余部分的上下文。为了实现这一点,我们提出了非负弹性网络(NNN)解码,它选择嵌入能够联合重构查询嵌入(作为稀疏非负线性组合)的文档。我们的主要理论结果建立了稠密检索与NNN解码之间的严格分离。对于任何语料库,稠密检索正确处理的每个查询也由NNN解码处理,而在包含相关文档的语料库上,NNN解码额外处理了稠密检索无法处理的查询。实验结果表明,将NNN解码应用于为内积评分训练的冻结嵌入,在多个基准上产生了一致的改进。此外,我们引入了一种端到端训练过程,优化嵌入以用于NNN解码,在所有指标和基准上相比稠密检索产生了显著的性能提升。我们的工作为在信息检索中利用稠密嵌入建立了一种新的范式,超越了内积评分的标准实践。

英文摘要

Dense retrieval has become the dominant paradigm in information retrieval, in which each document is scored against a query by the inner product of their vector embeddings, and the top-$k$ documents by score are retrieved for this query. However, since each document's score depends solely on the embedding of the query and itself, the retrieval process is oblivious to the content of the entire corpus. Therefore, dense retrieval cannot avoid selecting semantically similar documents from the corpus, which may result in a non-diverse, redundant set of retrieved documents. To this end, we approach retrieval as a joint decoding problem, in which documents are selected as a set with regard to the context of the rest of the corpus. To achieve this, we propose Non-Negative elastic Net (NNN) decoding, which selects documents whose embeddings jointly reconstruct the query embedding as a sparse non-negative linear combination. Our main theoretical result establishes a strict separation between dense retrieval and NNN decoding. For any corpus, every query correctly handled by dense retrieval is also handled by NNN decoding, while on corpora containing correlated documents, NNN decoding additionally handles queries that dense retrieval cannot. Experimental results indicate that applying NNN decoding to frozen embeddings trained for inner-product scoring yields consistent improvements across several benchmarks. Moreover, we introduce an end-to-end training procedure which optimizes the embeddings for NNN decoding, producing significant performance gains surpassing in all metrics and benchmarks compared to dense retrieval. Our work establishes a new paradigm for leveraging dense embeddings in information retrieval, beyond the standard practice of inner-product scoring.

2606.18037 2026-06-17 cs.AI cs.CL cs.MA 交叉投稿

ProvenanceGuard: Source-Aware Factuality Verification for MCP-Based LLM Agents

ProvenanceGuard: 基于MCP的LLM智能体的源感知事实性验证

Ander Alvarez, Santhiya Rajan, Samuel Mugel, Román Orús

发表机构 * Multiverse Computing Parque Cientifico y Tecnológico de Gipuzkoa(吉普斯夸科技园) Centre for Social Innovation(社会创新中心) Donostia International Physics Center(多诺斯蒂亚国际物理中心) Ikerbasque Foundation for Science(伊克尔巴斯克科学基金会)

AI总结 提出ProvenanceGuard,一种源感知验证器,通过追踪MCP工具调用、分解声明并路由到特定源,检测跨源混淆错误,在医疗领域数据集上优于源无关基线。

Comments 20 pages, 4 figures

详情
AI中文摘要

使用工具的LLM智能体越来越多地采用模型上下文协议(MCP)从异构证据源(包括搜索、API、数据库、临床记录和处方工具)中获取答案。标准的事实性指标通常测试答案是否得到汇总证据的支持,但忽略了一种源感知的失败模式:一个声明可能在某个地方得到支持,却被归因于错误的来源。我们称此为跨源混淆。我们引入ProvenanceGuard,一种用于MCP基础答案的源感知验证器。它消耗捕获的MCP轨迹,包含稳定的工具ID、源ID和原始输出;将答案分解为原子声明;将声明路由到特定源的证据;使用NLI和令牌对齐代理检查支持度;比较声明的归属与路由的源;并返回每个声明的判定以及答案级别的允许/阻止决策。被阻止的答案可以通过检索增强的答案修订和重新验证来修复。我们在281个医疗领域的MCP智能体轨迹上进行评估。一个包含266条轨迹的裁定子集产生了2,325个由LLM辅助的声明标签(按轨迹划分);361个保留标签由人工验证。在40条轨迹的保留子集上,ProvenanceGuard在260个符合源条件的声明上实现了阻止F1分数0.802和源准确率0.858,优于不输出声明到源ID的源无关基线。在一个更困难的多源基准上,它达到了阻止F1分数0.846,而源加关系准确率降至0.229,表明在语义相近的源上精确的源归属仍然困难。修复和重新验证解决了完整轨迹集中的所有被阻止答案,通常通过保守回退。在50个受控的临床混淆探测中,ProvenanceGuard检测到所有注入的归属交换,没有保留错误的归属。这些结果表明,源归属是基于MCP的智能体事实性验证的一个独立维度。

英文摘要

Tool-using LLM agents increasingly use the Model Context Protocol (MCP) to answer from heterogeneous evidence sources, including search, APIs, databases, clinical records, and formulary tools. Standard factuality metrics usually test whether an answer is supported by pooled evidence, missing a provenance-sensitive failure mode: a claim may be supported somewhere while being attributed to the wrong source. We call this cross-source conflation. We introduce ProvenanceGuard, a source-aware verifier for MCP-grounded answers. It consumes captured MCP traces with stable tool IDs, source IDs, and raw outputs; decomposes answers into atomic claims; routes claims to source-specific evidence; checks support with NLI and a token-alignment proxy; compares stated attribution with the routed source; and returns per-claim verdicts plus an answer-level allow/block decision. Blocked answers can be repaired with retrieval-augmented answer revision and re-verified. We evaluate on 281 medical-domain MCP-agent traces. A 266-trace adjudicated subset yields 2,325 LLM-assisted claim labels split by trace; 361 held-out labels are human-verified. On the 40-trace held-out split, ProvenanceGuard achieves block F1 0.802 and source accuracy 0.858 over 260 source-eligible claims, outperforming source-blind baselines that do not emit claim-to-source IDs. On a harder multi-source benchmark it reaches block F1 0.846, while source-plus-relation accuracy drops to 0.229, showing that exact source ownership remains difficult with semantically close sources. Repair-and-reverify resolves all blocked answers in the full trace set, often via conservative fallback. In 50 controlled clinical conflation probes, ProvenanceGuard detects all injected attribution swaps with no retained wrong attribution. These results show that source attribution is an independent axis for factuality verification in MCP-based agents.

2603.17356 2026-06-17 cs.CL 版本更新

PACE-RAG: Patient-Aware Contextual and Evidence-Constrained RAG for Clinical Drug Recommendation

PACE-RAG:面向临床药物推荐的患者感知上下文与证据约束RAG

Chaeyoung Huh, Hyunmin Hwang, Jung Hwan Shin, Sungyang Jo, Jinse Park, Jong Chul Ye

AI总结 提出PACE-RAG框架,通过提取患者特定临床特征、检索相关病例并结合当前症状与用药史,实现个性化药物推荐,在帕金森病和MIMIC-IV数据集上取得最优性能。

Comments 32 pages, 18 figures

详情
AI中文摘要

药物推荐需要深入理解个体患者背景,尤其是帕金森病等复杂疾病。尽管大语言模型拥有广泛的医学知识,但无法捕捉实际处方模式的细微差别。现有的RAG方法也难以应对这些复杂性,因为基于指南的检索仍然过于通用,而相似患者检索往往复制多数模式,未考虑个体患者的独特临床细微差别。为弥合这一差距,我们提出PACE-RAG(患者感知上下文与证据约束RAG)。PACE-RAG并非直接从检索到的患者中复制常用药物,而是首先提取患者特定临床特征,围绕这些特征检索病例,然后利用患者当前症状、活跃用药史和焦点特异性处方倾向来优化最终处方。通过分析针对特定临床特征的治疗模式,PACE-RAG生成患者特定的药物推荐以及可解释的临床总结。在帕金森病队列和MIMIC-IV基准上使用Llama-3.1-8B和Qwen3-8B进行评估,PACE-RAG实现了最先进的性能,F1分数分别达到80.84%和47.22%。这些结果表明PACE-RAG是一个稳健且临床基础扎实的个性化决策支持框架。我们的代码可在以下网址获取:this https URL。

英文摘要

Drug recommendation requires a deep understanding of individual patient context, especially for complex conditions like Parkinson's disease. While LLMs possess broad medical knowledge, they fail to capture the subtle nuances of actual prescribing patterns. Existing RAG methods also struggle with these complexities because guideline-based retrieval remains too generic and similar-patient retrieval often replicates majority patterns without accounting for the unique clinical nuances of individual patients. To bridge this gap, we propose PACE-RAG (Patient-Aware Contextual and Evidence-Constrained RAG). Rather than directly copying frequent medications from retrieved patients, PACE-RAG personalizes recommendations by first extracting patient-specific clinical features, retrieving cases around these features, and then refining the final prescription using the patient's current symptoms, active medication history, and focus-specific prescribing tendencies. By analyzing treatment patterns tailored to specific clinical features, PACE-RAG generates patient-specific medication recommendations along with an explainable clinical summary. Evaluated on a Parkinson's cohort and the MIMIC-IV benchmark using Llama-3.1-8B and Qwen3-8B, PACE-RAG achieved state-of-the-art performance, reaching F1 scores of 80.84% and 47.22%, respectively. These results suggest that PACE-RAG is a robust and clinically grounded framework for personalized decision support. Our code is available at: https://github.com/ChaeYoungHuh/PACE-RAG.

2605.08077 2026-06-17 cs.CL 版本更新

Conformal Path Reasoning: Trustworthy Knowledge Graph Question Answering via Path-Level Calibration

Conformal Path Reasoning: 通过路径级校准实现可信的知识图谱问答

Shuhang Lin, Chuhao Zhou, Xiao Lin, Zihan Dong, Kuan Lu, Zhencan Peng, Jie Yin, Dimitris N. Metaxas

AI总结 提出Conformal Path Reasoning (CPR)框架,通过查询级共形校准和残差共形价值网络(RCVNet)学习判别性路径级非一致性分数,在保证有效覆盖的同时将预测集大小平均减少52%。

Comments 13 pages, 3 figures, 2 tables;

详情
AI中文摘要

知识图谱问答(KGQA)提供了基于事实的、可解释的推理,但现有方法通常无法对检索到的答案提供可靠的覆盖保证。虽然共形预测(CP)为生成具有统计保证的预测集提供了原则性框架,但先前的共形KGQA方法存在两个关键缺陷:由于无效校准导致的覆盖保证被违反,以及弱分数判别性导致预测集过大。我们提出Conformal Path Reasoning (CPR),一个基于两个关键创新的新型可信KGQA框架。首先,路径级分数上的查询级共形校准保持可交换性以确保有效的覆盖保证。其次,我们引入残差共形价值网络(RCVNet),这是一个通过PUCT引导探索训练的轻量级模块,用于学习判别性的路径级非一致性分数。大量实验表明,与基准数据集上的共形基线相比,CPR将经验覆盖率平均提高45%,同时将预测集大小平均减少52%,突显了其在知识图谱上进行可靠共形推理的有效性。

英文摘要

Knowledge Graph Question Answering (KGQA) offers grounded, interpretable reasoning, but existing methods often fail to provide reliable coverage guarantees over retrieved answers. While Conformal Prediction (CP) offers a principled framework for producing prediction sets with statistical guarantees, prior conformal KGQA methods suffer from two critical pitfalls: violated coverage guarantees due to invalid calibration, and weak score discriminability that yields excessively large prediction sets. We propose Conformal Path Reasoning (CPR), a novel trustworthy KGQA framework built on two key innovations. First, query-level conformal calibration over path-level scores preserves exchangeability to ensure valid coverage guarantees. Second, we introduce the Residual Conformal Value Network (RCVNet), a lightweight module trained via PUCT-guided exploration to learn discriminative path-level nonconformity scores. Extensive experiments show that CPR significantly improves the Empirical Coverage Rate by 45% while reducing prediction set size by 52% on average over conformal baselines across benchmark datasets, highlighting its effectiveness for reliable conformal reasoning over knowledge graphs.

2606.15735 2026-06-17 cs.CL cs.AI 版本更新

EHRNote-ChatQA: A Benchmark for Evidence-Grounded Multi-Turn Clinical Question Answering over Longitudinal Discharge Summaries

EHRNote-ChatQA:一个面向纵向出院总结的基于证据的多轮临床问答基准

Jiyoun Kim, Muhan Yeo, Eunhye Jang, Jeewon Yang, Hangyul Yoon, Su Ji Lee, Hee Jo Han, Hee-Jae Jung, Doyun Kwon, Jun young Lee, Jaehun Lee, Jung-Oh Lee, Sunjun Kweon, Jong Hak Moon, Daseul Kim, Minjae Cho, Edward Choi

发表机构 * KAIST(韩国科学技术院) Seoul National University(首尔大学) Seoul National University Bundang Hospital(首尔大学盆唐医院) SAIHST, Sungkyunkwan University(成均馆大学) Yonsei University College of Medicine(延世大学医学院) Gangnam Severance Hospital(江南塞弗伦斯医院) Severance Hospital(塞弗伦斯医院) Seoul Medical Center(首尔医疗中心) Seoul National University Hospital(首尔大学医院) National Cancer Center(国立癌症中心) Icahn School of Medicine at Mount Sinai(西奈山伊坎医学院) Samsung Medical Center(三星医疗中心)

AI总结 提出EHRNote-ChatQA基准,基于MIMIC-IV出院总结构建,包含967个多轮样本和16072个专家验证的QA对,评估LLM在证据支持下的多轮临床问答能力,发现模型在证据定位和多轮错误累积方面存在挑战。

详情
AI中文摘要

出院总结是关键的临床文档,包含患者整个住院期间的背景信息,医疗专家在患者再入院、持续护理和诊断决策中会常规审阅这些文档。在审阅时,医疗专家通常必须迭代地综合多个总结中的信息,同时验证支持每个答案的证据。尽管大型语言模型(LLM)在临床问答中的应用日益增多,但现有基准未能充分反映这一场景:它们通常评估考试式的医学知识,或侧重于单轮问答且证据定位评估有限。我们引入了EHRNote-ChatQA,这是首个针对患者多个出院总结的基于证据的多轮临床问答基准。该基准基于去标识化的MIMIC-IV出院总结构建,包含967个患者级多轮样本,涵盖1到5份笔记,以及16072个经医学专家验证的QA对(8036个内容问题,每个配对有一个证据定位问题),覆盖八个临床类别。基准通过专家指导的流程构建,结合出院总结结构化模式、专家策划的多轮QA模板和基于LLM的生成,随后由11位医学专家对每个QA样本进行审查和修订。对22个开源和闭源LLM的基准测试揭示了若干挑战,包括LLM在证据定位方面比内容回答更困难、多轮错误随轮次累积,以及单轮临床QA性能无法可靠迁移到该场景。这些发现确立了EHRNote-ChatQA作为评估临床QA系统的严格且实用的基准。该数据集将通过PhysioNet凭证访问公开发布。

英文摘要

Discharge summaries are crucial clinical documents containing the context of a patient's overall hospital stay, and are routinely reviewed by medical experts for patient readmission, ongoing care, and diagnostic decision-making. When reviewing them, medical experts often must iteratively synthesize information across multiple summaries while verifying the evidence supporting each answer. Although large language models (LLMs) are increasingly explored for clinical question answering, existing benchmarks do not sufficiently reflect this setting: they often evaluate exam-style medical knowledge or focus on single-turn question answering with limited evidence-grounding evaluation. We introduce EHRNote-ChatQA, the first benchmark for evidence-grounded multi-turn clinical question answering over patients' multiple discharge summaries. Built from de-identified MIMIC-IV discharge summaries, EHRNote-ChatQA contains 967 patient-level multi-turn samples spanning one to five notes and 16,072 medical-expert-verified QA pairs (8,036 content questions, each paired with an evidence-grounding question) across eight clinical categories. The benchmark is constructed through an expert-informed pipeline combining discharge-summary structuring schema, expert-curated multi-turn QA templates, and LLM-based generation, followed by review and revision of every single QA sample by 11 medical experts. Benchmarking 22 open- and closed-source LLMs reveals several challenges, including that LLMs struggle more with evidence grounding than content answering, multi-turn errors compound across turns, and single-turn clinical QA performance does not reliably transfer to this setting. These findings establish EHRNote-ChatQA as a rigorous and practical benchmark for evaluating clinical QA systems. The dataset will be made publicly available through PhysioNet credentialed access.

4. 对话系统与智能体 15 篇

2606.17162 2026-06-17 cs.CL cs.HC cs.MA 新提交

MemSlides: A Hierarchical Memory Driven Agent Framework for Personalized Slide Generation with Multi-turn Local Revision

MemSlides:一种用于个性化幻灯片生成与多轮局部修订的层次化记忆驱动智能体框架

Ye Jin, Yangyang Xu, Jun Zhu, Yibo Yang

发表机构 * Beijing University of Posts and Telecommunications(北京邮电大学) Tsinghua University(清华大学) Shanghai Jiao Tong University(上海交通大学)

AI总结 提出MemSlides层次化记忆框架,通过分离长期记忆(用户画像和工具记忆)与工作记忆,结合局部修订机制,实现个性化幻灯片生成中的用户偏好保持、多轮修订和可靠局部编辑。

Comments Code, website, project page, and video are linked in the paper

详情
AI中文摘要

个性化演示文稿生成不仅需要基于当前提示或模板的条件生成:智能体必须跨任务保持稳定的用户偏好,在多轮修订中保留新引入的偏好和约束,并可靠地执行局部编辑。我们提出MemSlides,一种用于个性化演示智能体的层次化记忆框架,将长期记忆与工作记忆分离,并进一步将长期记忆分为用户画像记忆和工具记忆。用户画像记忆存储意图条件化的画像,用于第0轮个性化;工作记忆跨修订轮次携带活跃偏好和会话约束;工具记忆存储可重用的执行经验,用于可靠的局部编辑。MemSlides将此记忆设计与有范围的幻灯片局部修订相结合,使得目标更新作用于最小受影响区域,而非重复生成整个演示文稿。在控制实验中,用户画像记忆提高了多人物、多意图画像库上的人物对齐判断;工具记忆注入在诊断性配对设置中改善了闭环修改行为;定性案例展示了工作记忆传递偏好的能力。综合来看,这些结果表明,演示文稿创作中的有效个性化依赖于在生成和局部修订过程中分离持久用户画像、会话级工作记忆和可重用执行经验。

英文摘要

Personalized presentation generation requires more than conditioning on a current prompt or template: agents must preserve stable user preferences across tasks, retain newly introduced preferences and constraints during multi-turn revision, and carry out local edits reliably. We propose MemSlides, a hierarchical memory framework for personalized presentation agents that separates long-term memory from working memory and further divides long-term memory into user profile memory and tool memory. User profile memory stores intent-conditioned profiles for round-0 personalization, working memory carries active preferences and session constraints across revision rounds, and tool memory stores reusable execution experience for reliable localized editing. MemSlides pairs this memory design with scoped slide-local revision, so targeted updates act on the smallest affected region instead of repeatedly regenerating the full deck. In controlled experiments, user profile memory improves persona-alignment judgments on a multi-persona, multi-intent profile bank, tool-memory injection improves closed-loop modify behavior in diagnostic matched-pair settings, and qualitative cases illustrate working memory's ability to carryover preferences. Taken together, these results suggest that effective personalization in presentation authoring depends on separating persistent user profiles, session-level working memory, and reusable execution experience across generation and localized revision.

2606.17174 2026-06-17 cs.CL cs.CY cs.MA 新提交

From Parasocial Scripts to Dyadic Persistence in Autonomous AI-Agent Communities

从准社会脚本到自主AI智能体社区中的二元持久性

Mohammadsadegh Abolhasani, Hamid Reza Firoozfar, Reza Mousavi, Paul Jen-Hwa Hu

发表机构 * University of Utah(犹他大学) University of Virginia(弗吉尼亚大学)

AI总结 研究自主AI智能体社区中是否存在准社会互动(PSI)线索,通过关键词匹配、少样本LLM标注等方法分析帖子与评论,发现PSI线索与OP再参与及互惠回复结构强相关,并通过二元持久性测试验证了互动层面的PSI脚本与重复二元模式的一致性。

Comments Submitted for review in ARR for EMNLP 2026

详情
AI中文摘要

虽然准社会互动(PSI)和准社会关系(PSR)已在传统媒体环境中得到研究,但我们调查了在双方均为自主AI智能体的在线社区中是否也存在PSI(口语化)关系线索。我们通过三个基于理论的文本指标分析了Moltbook上的4,434篇帖子和50,338条评论:依恋/亲密语言、互惠邀请以及对原始发帖者(OP)的自我认同。基于关键词匹配、少样本大语言模型(LLM)标注和分组上下文LLM标注的方法的综合结果表明,PSI口语化线索普遍存在,并且与OP再参与和互惠回复结构强相关。这些结果在负对照、无效化、聚类标准误重估计和多重检验校正中均稳健。二元持久性测试进一步证实了互惠邀请与持续涉及OP的相互重复模式一致,为将互动层面的PSI脚本与符合PSR的重复二元模式联系起来提供了实证证据。我们将这些证据解释为由LLM驱动的智能体在话语中的行为结构。

英文摘要

While parasocial interactions (PSIs) and parasocial relationships (PSRs) have been studied in conventional media settings, we investigate whether PSI- (colloquial) relational cues also exist in online communities where both sides are autonomous AI agents. We analyze 4,434 posts and 50,338 comments from Moltbook through three theory-based textual indicators: attachment/intimacy language, reciprocity bids, and self-identification to original poster (OP). The combined results across methods based on keyword matching, few-shot large language model (LLM) annotation, and grouped-context LLM annotation reveal that PSI colloquial cues prevail and are strongly associated with OP re-engagement and a reciprocal reply structure. These results are robust across negative controls, nullification, clustered-standard-error re-estimation, and multiple-testing correction. A dyadic persistence test further affirms reciprocity bids aligned with sustained OP-involving mutual recurrence, providing empirical evidence for bridging interaction-level PSI scripts with PSR-consistent repeated dyadic patterns. We interpret the evidence as a behavioral structure in discourse by LLM-enabled agents.

2606.17519 2026-06-17 cs.CL cs.AI 新提交

Scaling Enterprise Agent Routing: Degradation, Diagnosis, and Recovery

扩展企业智能体路由:退化、诊断与恢复

Kellen Gillespie, Robyn Perry

发表机构 * Superhuman, Inc.(Superhuman公司)

AI总结 研究企业助手工具库扩展时路由准确率下降问题,通过嵌入预选恢复F1分数10-17个百分点。

Comments 10 pages (6 main + 4 appendix), 4 figures, 6 tables

详情
AI中文摘要

生产级LLM助手将用户请求路由到日益增长的专用工具库,但随着目录规模扩大,路由准确率如何退化?我们在一个已部署的企业生产力助手的110个智能体、584个工具的目录上研究单步路由,评估了从10到110个智能体的三种前沿模型。在未充分指定的请求上,路由F1分数跨模型下降16-23个百分点。一个oracle分析将退化分解为检索差距(模型无法找到正确工具)和混淆差距(即使完美检索,oracle上限也下降10pp)。基于嵌入的预选在全部规模下为所有三种模型和两个提供商恢复+10-11pp F1分数。一项生产标注研究(1,435个人工标注话语,三个标注者)确认了在真实流量上的恢复,尽管绝对性能低10-15pp,但恢复幅度为+10-17pp。

英文摘要

Production LLM assistants route user requests to growing libraries of specialized tools, but how does routing accuracy degrade as the catalog scales? We study single-step routing on a 110-agent, 584-tool catalog from a deployed enterprise productivity assistant, evaluating three frontier models from 10 to 110 agents. Routing F1 on under-specified requests drops 16--23 percentage points across models. An oracle analysis decomposes the degradation into a \emph{retrieval} gap (the model cannot surface the right tool) and a \emph{confusion} gap (even with perfect retrieval, the oracle ceiling drops 10pp). Embedding-based shortlisting recovers +10--11pp F1 at full scale across all three models and two providers. A production annotation study (1,435 human-labeled utterances, three annotators) confirms the recovery on real traffic at +10--17pp despite 10--15pp lower absolute performance.

2606.17542 2026-06-17 cs.CL 新提交

Evaluating Large Language Models Abilities for Addressee, Turn-change, and Next Speaker Prediction in Meetings

评估大型语言模型在会议中的收话人、话轮转换和下一说话人预测能力

Ryo Fukuda, Takatomo Kano, Siddhant Arora, Marc Delcroix, Naohiro Tawara, Atsunori Ogawa, Yuya Chiba, Atsushi Ando, William Chen, Shinji Watanabe

发表机构 * NTT, Inc., Japan(日本电信电话公司) Language Technologies Institute, Carnegie Mellon University, USA(卡内基梅隆大学语言技术研究所)

AI总结 利用大型语言模型(LLMs)研究多模态多人对话中的话轮转换,构建了收话人检测、话轮转换预测和下一说话人预测三个任务的评估框架,实验表明LLMs在下一说话人预测上优于监督模型和人类,但多模态LLM在收话人检测和话轮转换预测上仍低于人类水平。

Comments Accepted to INTERSPEECH 2026

详情
AI中文摘要

我们利用大型语言模型(LLMs)研究多模态多人对话中的话轮转换。我们构建了一个评估框架,涵盖三个任务:收话人检测、话轮转换预测和下一说话人预测。我们比较了针对这些任务训练的监督模型、基于文本的LLMs、多模态LLMs(MM-LLMs)以及人类受试者。在AMI语料库上的实验表明,尽管LLMs未在目标领域进行训练,也无法访问音频或视觉信息,但它们在下一说话人预测方面优于监督模型和人类。多模态LLM在收话人检测和话轮转换预测上表现优于基于文本的LLMs,但仍低于人类表现,表明其难以利用原始音视频信号。消融分析显示,对话上下文至关重要,尤其是对于下一说话人预测。我们观察到人类和LLM的预测模式相似,且话轮转换频繁的区间对两者都具有挑战性。

英文摘要

We investigate turn-taking in multimodal multi-party conversations using large language models (LLMs). We construct an evaluation framework for three tasks: addressee detection, turn-change prediction, and next speaker prediction. We compare supervised models trained for these tasks, text-based LLMs, multimodal LLMs (MM-LLMs), and human subjects. Experiments on the AMI corpus showed that LLMs outperformed supervised models and humans in next speaker prediction, despite not being trained on the target domain and without access to audio or visual information. An MM-LLM performed better than text-based LLMs on addressee detection and turn-change prediction but remained below human performance, indicating difficulty leveraging raw audio-visual signals. Ablation analyses revealed that conversational context was critical, particularly for next speaker prediction. We observed that human and LLM prediction patterns were similar, and intervals with frequent turn changes were difficult for both.

2606.17628 2026-06-17 cs.CL 新提交

OPD-Evolver: Cultivating Holistic Agent Evolver via On-Policy Distillation

OPD-Evolver:通过在线策略蒸馏培养整体智能体进化器

Guibin Zhang, Xun Xu, Yanwei Yue, Zikun Su, Wangchunshu Zhou, Xiaobin Hu, Shuicheng Yan

发表机构 * LV-NUS Lab(林文实验室) FDU(福建大学) PKU(北京大学) Bytedance Inc.(字节跳动公司)

AI总结 提出OPD-Evolver框架,通过快慢双循环和在线策略自蒸馏,使智能体学会选择、使用、编写和维护经验,在多个基准上超越现有方法,并让小模型挑战大模型。

详情
AI中文摘要

记忆已成为自我进化智能体的标准基础,但保留经验并不等同于学习如何通过经验进化。现有的记忆智能体可以存储轨迹、检索反思或积累技能,但往往缺乏选择有用经验、据此行动、编写可重用知识以及维护不断增长的存储库的整体能力。我们引入OPD-Evolver,一种快慢双协同进化框架,通过在线策略自蒸馏培养这样的智能体进化器。在快循环中,OPD-Evolver与四级记忆层次结构交互,以读取、使用、编写和维护经验,实现快速测试时进化。在慢循环中,结果校准的记忆归因和特权事后视角将这些四种能力蒸馏到可部署的策略中。在多领域基准测试中,OPD-Evolver超越了诸如ReasoningBank等记忆系统高达11.5%,以及诸如Skill0等基于训练的方法约5.8%。进一步分析表明,OPD-Evolver内化了高价值经验和记忆管理,使OPD-Evolver-9B能够挑战诸如Qwen3.5-397B-A17B和Step-3.5-Flash等大型对手,指向超越记忆增强智能体的真正合格的智能体进化器。

英文摘要

Memory has become a standard substrate for self-evolving agents, yet retaining experience is not the same as learning how to evolve through it. Existing memory agents can store trajectories, retrieve reflections, or accumulate skills, but often lack the holistic competence to select useful experience, act on it, write reusable knowledge, and maintain a growing repository. We introduce OPD-Evolver, a slow-fast co-evolution framework that cultivates such an agent evolver through on-policy self-distillation. In the fast loop, OPD-Evolver interacts with a four-level memory hierarchy to read, use, write, and maintain experience for rapid test-time evolution. In the slow loop, outcome-calibrated memory attribution and privileged hindsight distill these four abilities into the deployable policy. Across multi-domain benchmarks, OPD-Evolver surpasses memory systems such as ReasoningBank by up to 11.5%, and training-based methods such as Skill0 by ~5.8%. Further analysis shows that OPD-Evolver internalizes high-value experience and memory management, enabling OPD-Evolver-9B to challenge giant counterparts such as Qwen3.5-397B-A17B and Step-3.5-Flash, pointing beyond memory-augmented agents toward genuinely qualified agent evolvers.

2606.17838 2026-06-17 cs.CL 新提交

Environment-Grounded Automated Prompt Optimization for LLM Game Agents

面向LLM游戏智能体的环境驱动自动提示优化

Rean Clive Fernandes, Lukas Fehring, Theresa Eimer, Marius Lindauer, Matthias Feurer

发表机构 * Lamarr institute for ML and AI(拉马尔机器学习与人工智能研究所) TU Dortmund University(多特蒙德工业大学) Leibniz University Hannover(莱布尼茨汉诺威大学) L3S Research Center(L3S研究中心)

AI总结 提出一种自动提示优化框架,将观察-动作管道分解为描述器和选择器,通过环境回报驱动的进化循环迭代优化提示,在BabyAI任务中显著提升成功率。

详情
AI中文摘要

交互环境中的LLM智能体对其提示高度敏感,但提示工程仍然是手动的、特定于任务的过程。我们为LLM智能体引入了一个自动提示优化框架,该框架将观察-动作管道分解为一个目标条件描述器智能体和一个动作选择智能体,并通过由环境回报引导的LLM驱动进化循环迭代地优化每个模块的提示。我们提出一个行为分析器,将情节结果归因于特定的提示组件,以及一个变异器,在通过环境回滚验证之前,对提示提出有针对性的修订。我们在BALROG基准测试中的所有五个BabyAI任务上进行了评估,在普通和引导提示初始化下,将我们的管道与BALROG的RobustCoTAgent进行了比较。优化在任务和条件下一致地提高了性能,无需更新模型权重。在PutNext(一个多步协调任务,RobustCoTAgent的成功率为0%)上,我们的框架使用相同的底层LLM和优化提示达到了高达72.5%的成功率。这些结果表明,多智能体框架结合自动提示优化,无需微调或大量人工监督即可增强LLM。

英文摘要

LLM agents in interactive environments are highly sensitive to their prompts, yet prompt engineering remains a manual, task-specific process. We introduce an automated prompt optimization framework for LLM agents that decomposes the observation-to-action pipeline into a goal-conditioned descriptor agent and an action selection agent, and iteratively refines each module's prompt through an LLM-driven evolutionary loop guided by environment returns. We propose a behavior analyzer to attribute episode outcomes to specific prompt components, and a mutator to propose targeted revisions to the prompt, before validating them through environment rollouts. We evaluate on all five BabyAI tasks in the BALROG benchmark, comparing our pipeline against BALROG's RobustCoTAgent under both plain and guided prompt initializations. Optimization improves performance consistently across tasks and conditions, without requiring updates to the model weights. On PutNext, a multi-step coordination task where the RobustCoTAgent achieves 0% success, our framework reaches up to 72.5% success rate using the same underlying LLM with optimized prompts. These results suggest that a multi-agent framework, combined with automatic prompt optimization, enhances LLMs without the need for fine-tuning or extensive human supervision.

2606.18051 2026-06-17 cs.CL 新提交

Compositional Skill Routing for LLM Agents: Decompose, Retrieve, and Compose

面向LLM智能体的组合技能路由:分解、检索与组合

Xueping Gao

发表机构 * Alibaba Cloud(阿里云)

AI总结 提出SkillWeaver框架,通过迭代技能感知分解(SAD)将复杂查询分解为原子子任务,检索对应技能并组合为可执行计划,在CompSkillBench上验证了分解质量是主要瓶颈,SAD将分解准确率提升32.7%。

详情
AI中文摘要

LLM智能体越来越依赖外部技能(可复用的工具规范),但现实任务通常需要组合多种技能,而不仅仅是选择一种。我们将此形式化为组合技能路由问题:给定一个复杂的用户查询和一个大型技能库,将查询分解为原子子任务,为每个子任务检索适当的技能,并组合成一个可执行的计划。我们提出了SkillWeaver,一个分解-检索-组合框架,结合了LLM任务分解器、带有FAISS索引的双编码器技能检索器以及依赖感知的DAG规划器。为了支持评估,我们引入了CompSkillBench,这是一个包含300个组合查询的基准测试,覆盖来自公共MCP生态系统的2,209个真实MCP服务器技能,跨越24个功能类别。我们的实验表明,任务分解质量是主要瓶颈:标准LLM分解在步骤级别仅达到34.2%的类别召回率。为了解决这个问题,我们提出了迭代技能感知分解(SAD),这是一种检索增强的反馈循环,能够迭代地将分解与可用技能对齐。SAD在单次迭代中将分解准确率从51.0%提高到67.7%(+32.7%,Wilcoxon p < 10^-6);DA条件分析证实,正确的粒度是有效检索的前提(当DA=1时,CatR@1从34%上升到41%)。SkillWeaver将上下文窗口消耗减少了99%以上,迁移实验证实了泛化能力(即使目标类别不在检索池中,相对DA增益仍达到+35.6%)。

英文摘要

LLM agents increasingly rely on external skills -- reusable tool specifications -- but real-world tasks often require composing multiple skills, not just selecting one. We formalize this as the Compositional Skill Routing problem: given a complex user query and a large skill library, decompose the query into atomic sub-tasks, retrieve the appropriate skill for each sub-task, and compose an executable plan. We present SkillWeaver, a decompose-retrieve-compose framework combining an LLM task decomposer, a bi-encoder skill retriever with FAISS indexing, and a dependency-aware DAG planner. To support evaluation, we introduce CompSkillBench, a benchmark of 300 compositional queries over 2,209 real MCP server skills spanning 24 functional categories, sourced from the public MCP ecosystem. Our experiments reveal that task decomposition quality is the primary bottleneck: standard LLM decomposition reaches only 34.2% category recall at the step level. To address this, we propose Iterative Skill-Aware Decomposition (SAD), a retrieval-augmented feedback loop that iteratively aligns decomposition with available skills. SAD improves decomposition accuracy from 51.0% to 67.7% (+32.7%, Wilcoxon p < 10^-6) in a single iteration; DA-conditioned analysis confirms that correct granularity is the prerequisite for effective retrieval (CatR@1 rises from 34% to 41% when DA=1). SkillWeaver reduces context window consumption by over 99%, and transfer experiments confirm generalization (+35.6% relative DA gain even when target categories are absent from the retrieval pool).

2606.17645 2026-06-17 cs.AI cs.CL cs.LG 交叉投稿

Beyond Domains: Reusing Web Skills via Transferable Interaction Patterns

超越领域:通过可迁移交互模式重用网络技能

Shiqi He, Yue Cui, Feijie Wu, Xinyu Ma, Jiaheng Lu, Yaliang Li, Bolin Ding, Mosharaf Chowdhury

发表机构 * University of Michigan(密歇根大学) Alibaba Group(阿里巴巴集团) Purdue University(普渡大学) McMaster University(麦克马斯特大学) University of Pennsylvania(宾夕法尼亚大学)

AI总结 提出SkillMigrator代理,通过学习可迁移交互模式(TIP)匹配布局结构而非元素引用,实现跨站点技能重用,在WebArena和Mind2Web上成功轨迹的LLM动作数减少8-10%。

详情
AI中文摘要

大型语言模型(LLM)网络代理通常被部署为工具调用者:每轮,模型读取新的页面观察并发出一个结构化工具动作。当每个动作都是低级原语时,视野迅速增长,面向策略的LLM完成次数也随之增加,在Mind2Web和WebArena等基准测试中主导了延迟和成本。因此,最近的系统将重复的交互片段包装为网络技能:从成功轨迹或诱导程序中构建的可调用工具,这样一次调用可以替代多个原语。然而,先前的技能库仍然主要通过指令相似性或粗略的站点元数据触发,这导致在未见站点上技能重用率低,并留下了许多潜在的步骤和令牌减少空间。我们提出了SkillMigrator,一个学习可重用网络技能并通过匹配布局结构而非特定元素引用来跨站点迁移它们的代理。每个诱导技能被存储为可迁移交互模式(TIP):技能与诱导时快照的结构草图配对。在测试时,SkillMigrator通过布局相似性检索TIP,并将其引用锚定到实时页面。其余堆栈是标准的:具有稳定引用的可访问性快照观察,以及基于原语加技能调用的固定工具调用。与最先进的方法相比,SkillMigrator在匹配成功率的情况下,将WebArena和Mind2Web上成功轨迹的平均LLM动作数减少了8-10%。

英文摘要

Large language model (LLM) web agents are usually deployed as tool callers: each turn, the model reads a fresh page observation and emits one structured tool action. When every action is a low-level primitive, horizons grow quickly and so do policy-facing LLM completions, dominating latency and cost on benchmarks such as Mind2Web and WebArena. Recent systems therefore wrap repeated interaction fragments as web skills: callable tools built from successful trajectories or induced programs, so one call can replace several primitives. However, prior skill libraries are still triggered mainly by instruction similarity or coarse site metadata, which yields low skill reuse on held-out sites and leaves much of the potential step and token reduction on the table. We present SkillMigrator, an agent that learns reusable web skills and transfers them across sites by matching layout structure rather than specific element references. Each induced skill is stored as a transferable interaction pattern (TIP): the skill paired with a structural sketch of the snapshot at induction time. At test time, SkillMigrator retrieves TIPs by layout similarity and grounds their references on the live page. The rest of the stack is standard: accessibility-snapshot observations with stable references, and fixed tool calling over primitives plus skill invocations. Compared with the state-of-the-art approaches, SkillMigrator reduces the average LLM-action count on successful trajectories by 8-10% across both WebArena and Mind2Web at matched success rate.

2606.17680 2026-06-17 cs.LG cs.CL 交叉投稿

EnvRL: Learn from Environment Dynamics in Agentic Reinforcement Learning

EnvRL: 在智能体强化学习中从环境动力学中学习

Zhitong Wang, Songze Li, Hao Peng, Shuzheng Si, Yi Wang, Maosong Sun, Juanzi Li

发表机构 * Department of Computer Science and Technology, Tsinghua University(清华大学计算机科学与技术系) Shanghai AI Laboratory(上海人工智能实验室)

AI总结 提出EnvRL框架,通过状态预测和逆动力学两个辅助目标,将环境动力学学习融入智能体强化学习,在长周期任务中显著提升成功率。

详情
AI中文摘要

强化学习已成为训练大型语言模型作为智能体的强大范式。然而,针对长周期智能体任务的常规强化学习方法往往难以处理稀疏的结果奖励。直观上,这忽略了展开交互轨迹中包含的丰富环境动力学信息。我们认为交互体验本身固有地充当隐式监督信号,揭示了环境的潜在转换机制,并使智能体能够构建更准确的环境内部模型。因此,在这项工作中,我们研究了如何利用这一额外信号来改进策略学习。具体来说,我们提出了EnvRL,一个通过两个辅助目标(状态预测和逆动力学)将环境动力学学习融入智能体强化学习的框架。通过与主要强化学习目标联合优化,我们鼓励智能体从其自身的交互体验中内化环境动力学。在两个长周期智能体基准上的大量实验表明,EnvRL在成功率上比仅使用强化学习的基线有显著提升,例如,当使用GRPO训练时,在ALFWorld上将Qwen-2.5-1.5B-Instruct从72.8%提升到77.4%,在WebShop上从56.8%提升到67.0%。

英文摘要

Reinforcement learning (RL) has emerged as a powerful paradigm for training Large Language Models (LLMs) as agents. However, conventional RL methods for long-horizon agentic tasks often struggle with sparse outcome rewards. Intuitively, this overlooks the rich environment dynamics information contained in rollout interaction trajectories. We argue that the interaction experience inherently serves as an implicit supervision signal, reveals the underlying transition mechanisms of the environment, and enables the agent to construct a more accurate internal model of the environment.. Therefore, in this work, we investigate how to leverage this additional signal to improve policy learning. Specifically, we propose EnvRL, a framework that incorporates environment dynamics learning into agentic RL via two auxiliary objectives: state prediction and inverse dynamics. By jointly optimizing with the primary RL objective, we encourage the agent to internalize environment dynamics from its own interaction experience. Extensive experiments on two long-horizon agentic benchmarks demonstrate that EnvRL achieves significant improvements on success-rates over RL-only baselines, e.g., when trained with GRPO, lifting Qwen-2.5-1.5B-Instruct from 72.8% to 77.4% on ALFWorld, and from 56.8% to 67.0% on WebShop.

2606.17786 2026-06-17 cs.HC cs.CL 交叉投稿

Toward Accessible Psychotherapy Training Using AI-Driven Interactive Patient Avatars

利用AI驱动的交互式患者化身实现可及的心理治疗培训

Pascal Riachi, Sofie Kamber, Stella Brogna, Andrew Gloster, Rafael Wampfler

发表机构 * ETH Zurich(苏黎世联邦理工学院) University of Lucerne(卢塞恩大学)

AI总结 提出一个通过具身虚拟患者进行对话训练ACT心理治疗师的系统,利用大语言模型模拟患者行为,并提供基于ACT保真度标准的逐轮反馈,专家评估证实了高真实性和培训效果。

详情
Journal ref
2026 IEEE 14th International Conference on Healthcare Informatics (ICHI), Minneapolis, MN, June 1-3, 2026, pp. 990-995
AI中文摘要

培训心理治疗师掌握诸如接纳与承诺疗法(ACT)等循证干预措施需要反复练习并伴有有意义的反馈,然而安全、标准化的培训机会受到伦理、后勤和资源限制。我们引入一个系统,旨在通过与具身虚拟患者的语音对话支持ACT导向的心理治疗培训。该系统使用大语言模型模拟患者行为,其行为基于真实治疗会话中提取的档案和可配置的临床场景,同时一个独立的自动评估器根据既定的ACT保真度标准为治疗师的回应提供逐轮反馈。该系统并非旨在取代督导,而是通过支持在低风险环境中进行实验、反思和即时反馈来促进刻意练习。执业心理学家的专家评估证实了患者行为的高度真实性,并表明即时的逐轮ACT反馈提高了治疗师对干预选择的意识,并使他们能够有效尝试替代回应。对49份治疗记录的定量评估确定GPT-4o-mini为最佳反馈模型,在复制人类督导的ACT保真度评分时实现了最低的平均绝对误差(MAE = 6.12),且具有统计显著性的一致性。这项工作展示了保真度感知的模拟患者作为心理治疗培训的可扩展补充的潜力。

英文摘要

Training psychotherapists in evidence-based interventions such as Acceptance and Commitment Therapy (ACT) requires repeated practice with meaningful feedback, yet opportunities for safe, standardized training are limited by ethical, logistical, and resource constraints. We introduce a system designed to support ACT-oriented psychotherapy training through spoken dialogue with an embodied virtual patient. The system uses large language models to simulate patient behavior conditioned on profiles derived from real therapy sessions and configurable clinical scenarios, while a separate automated evaluator provides turn-by-turn feedback on therapist responses based on established ACT fidelity criteria. Rather than aiming to replace supervision, the system is intended to support deliberate practice by enabling experimentation, reflection, and immediate feedback in low-risk settings. Expert evaluation with practicing psychologists confirmed high realism in patient behavior and demonstrated that immediate turn-by-turn ACT feedback increased therapists' awareness of intervention choices and enabled effective experimentation with alternative responses. Quantitative evaluation across 49 therapy transcripts identified GPT-4o-mini as the optimal feedback model, achieving the lowest mean absolute error (MAE = 6.12) in replicating human supervisor ACT fidelity ratings with statistically significant agreement. This work demonstrates the potential of fidelity-aware simulated patients as a scalable complement to psychotherapy training.

2504.03991 2026-06-17 cs.CL cs.AI cs.HC cs.MA 版本更新

Algorithmic Prompt Generation for Diverse Human-like Teaming and Communication with Large Language Models

面向多样化类人团队协作与通信的算法化提示生成与大型语言模型

Siddharth Srikanth, Varun Bhatt, Boshen Zhang, Werner Hager, Charles Michael Lewis, Katia P. Sycara, Aaquib Tabrez, Stefanos Nikolaidis

发表机构 * Thomas Lord Department of Computer Science, University of Southern California(美国南加州大学汤姆·劳德计算机科学系) School of Computing and Information, University of Pittsburgh(美国匹兹堡大学计算与信息学院) Robotics Institute, Carnegie Mellon University(卡内基梅隆大学机器人研究所) Sibley School of Mechanical and Aerospace Engineering, Cornell University(康奈尔大学西伯利机械与航空航天工程学院)

AI总结 结合质量多样性优化与LLM代理,自动搜索生成多样化团队行为的提示,捕获人类协作与通信策略,并通过用户研究验证其类人性。

详情
AI中文摘要

理解人类如何在团队中协作和通信对于改善人-代理团队协作和AI辅助决策至关重要。然而,由于后勤、伦理和实际限制,仅依赖大规模用户研究的数据是不切实际的,因此需要多种多样化人类行为的合成模型。最近,基于大型语言模型(LLM)的代理已被证明能够在社交环境中模拟类人行为。但是,获得大量多样化行为需要手动设计提示。另一方面,质量多样性(QD)优化已被证明能够生成多样化的强化学习(RL)代理行为。在这项工作中,我们将QD优化与LLM驱动的代理相结合,以迭代搜索在长时域、多步骤协作环境中生成多样化团队行为的提示。我们首先通过一项人类受试者实验表明,人类在该领域中表现出多样化的协调和通信行为。然后,我们进行一系列实验,表明我们的方法捕获了在没有大规模数据收集的情况下难以观察到的行为,并通过后续用户研究表明这些生成的行为是类人的。我们的发现凸显了QD与LLM驱动代理的结合作为研究多代理协作中团队协作和通信策略的有效工具。

英文摘要

Understanding how humans collaborate and communicate in teams is essential for improving human-agent teaming and AI-assisted decision-making. However, relying solely on data from large-scale user studies is impractical due to logistical, ethical, and practical constraints, necessitating synthetic models of multiple diverse human behaviors. Recently, agents powered by Large Language Models (LLMs) have been shown to emulate human-like behavior in social settings. But, obtaining a large set of diverse behaviors requires manual effort in the form of designing prompts. On the other hand, Quality Diversity (QD) optimization has been shown to be capable of generating diverse Reinforcement Learning (RL) agent behavior. In this work, we combine QD optimization with LLM-powered agents to iteratively search for prompts that generate diverse team behavior in a long-horizon, multi-step collaborative environment. We first show, through a human-subjects experiment, that humans exhibit diverse coordination and communication behavior in this domain. We then present a series of experiments showing that our approach captures behaviors that are difficult to observe without large-scale data collection, and a follow-up user study to show that these generated behaviors are human-like. Our findings highlight the combination of QD and LLM-powered agents as an effective tool for studying teaming and communication strategies in multi-agent collaboration.

2504.11837 2026-06-17 cs.CL cs.AI 版本更新

EmoFSM: A Finite State Machine for Emotional Support Conversation

EmoFSM:一种用于情感支持对话的有限状态机

Yue Zhao, Qingqing Gu, Xiaoyu Wang, Teng Chen, Zhonglin Jiang, Yong Chen, Hongyan Li, Luo Ji

AI总结 针对情感支持对话中长期满意度不足的问题,提出EmoFSM框架,利用有限状态机引导大语言模型进行规划与自我推理,在多个数据集上优于多种基线方法。

Comments 15 pages, 4 figures. PAKDD 2026

详情
AI中文摘要

情感支持对话旨在通过有效对话缓解人们的情感困扰。尽管大语言模型在ESC方面取得了显著进展,但大多数研究可能未从状态模型角度定义图,从而为长期满意度提供了次优解决方案。为解决此问题,我们利用有限状态机在LLM上提出名为EmoFSM的框架。我们的框架允许单个LLM在ESC期间引导规划,并在每个对话轮次中自我推理求助者的情绪、支持策略以及最终回应。在ESC数据集上的大量实验表明,EmoFSM优于许多基线方法,包括直接推理、自我微调、思维链、微调和外部支持方法,甚至那些参数更多的模型。

英文摘要

Emotional support conversation (ESC) aims to alleviate people's emotional distress through effective conversations. Although large language models (LLMs) have made remarkable progress in ESC, most of these studies may not define the diagram from a state-model perspective, thereby providing a suboptimal solution for long-term satisfaction. To address such an issue, we leverage the Finite State Machine (FSM) on LLMs, and propose a framework called EmoFSM. Our framework allows a single LLM to bootstrap the planning during ESC, and self-reason the seeker's emotion, support strategy, and the final response upon each conversation turn. Substantial experiments in ESC datasets suggest that EmoFSM outperforms many baselines, including direct inference, self-fine, chain of thought, finetuning, and externally supported methods, even those with many more parameters.

2606.15903 2026-06-17 cs.CL cs.AI 版本更新

Control-Plane Placement Shapes Forgetting: An Architectural Study of Agent Memory Across Thirteen System Configurations

控制平面放置塑造遗忘:跨十三种系统配置的智能体记忆架构研究

Dongxu Yang

发表机构 * DeepLethe

AI总结 研究LLM在智能体记忆管道中的位置(控制平面 vs 召回平面)对遗忘失败模式的影响,通过13种配置在385例对抗测试集上的实验,揭示了三种放置机制的互补覆盖范围,并提出了ForgetEval评估套件。

Comments 25 pages including appendices. Code, benchmark, and adapters released under MIT at https://github.com/deeplethe/lethe

详情
AI中文摘要

LLM在智能体记忆管道中的位置——位于检索存储事实(广泛基准测试)的召回平面和通过替换、释放、清除来改变事实(基本未经测试)的控制平面之间——决定了系统能够恢复哪些遗忘失败模式。通过在385例对抗测试集上比较十三种系统配置,我们观察到三种具有部分互补覆盖范围的放置机制:确定性原语足以处理词汇/时间类别,但无法处理规范化(标识符混淆上5%,跨语言上0%);写入时LLM可以恢复规范化(100%),但无法处理意图感知删除(前缀冲突和复合事实为0%);变异时钩子可以恢复意图感知删除(78-85%),并同时提升几乎所有类别的性能(整体91.7-93.2%,每385例运行成本0.17美元,每例变异延迟2.3秒,而确定性方法为64-191毫秒,召回路径不变)。我们通过ForgetEval揭示了这种权衡,ForgetEval包含1000例模板化套件和385例对抗层(132例手工制作+253例LLM生成并经预言机验证),通过确定性子串匹配评分,并配有一个六方法适配器协议,采用诚实的N/A评分,允许异构记忆存储以130行代码接入。该协议通过10名标注者的IAA(Fleiss' kappa = 0.958)和77例外部作者子集(四位盲贡献者)得到验证,该子集复现了规范化不对称性并放大了联合放置的提升(+27.8个百分点)。生产环境中的失败主要是遗忘失败而非召回失败,但现有基准仅衡量召回。ForgetEval和所有适配器均以MIT许可发布。

英文摘要

Where an LLM sits in an agent memory pipeline -- between the recall plane that retrieves stored facts (extensively benchmarked) and the control plane that mutates them via supersede, release, purge (largely untested) -- shapes which forgetting failure modes the system recovers. Comparing thirteen system configurations on a 385-case adversarial surface, we observe three placement regimes with partly complementary coverage: deterministic primitives suffice for lexical/temporal categories but fail canonicalization (5% on identifier-obfuscation, 0% on cross-lingual); inscribe-time LLM recovers canonicalization (100%) but cannot help intent-aware deletion (0% on prefix-collision and compound-fact); a mutation-time hook recovers intent-aware deletion (78-85%) and brightens nearly all categories simultaneously (91.7-93.2% overall, $0.17 per 385-case run, 2.3s/case mutation latency vs. 64-191ms/case deterministic, recall path unchanged). We expose the trade-off via ForgetEval, a 1000-case templated suite plus a 385-case adversarial layer (132 hand-crafted + 253 LLM-drafted oracle-validated) scored by deterministic substring match, paired with a six-method Adapter Protocol with honest N/A scoring that lets heterogeneous memory stores enter in 130 lines. Admission is corroborated by 10-annotator IAA (Fleiss' kappa = 0.958) and a 77-case external-authored subset (four blind contributors) that replicates the canonicalization asymmetry and amplifies the joint-placement lift (+27.8 pt). Production failures are predominantly forgetting failures rather than recall failures, yet existing benchmarks measure only recall. ForgetEval and all adapters are released under MIT.

2606.16591 2026-06-17 cs.CL 版本更新

SING: Synthetic Intention Graph for Scalable Active Tool Discovery in LLM Agents

SING: 用于LLM代理中可扩展主动工具发现的合成意图图

Qiao Xiao, Haochen Shi, Yisen Gao, Wenbin Hu, Huihao Jing, Tianshi Zheng, Baixuan Xu, Ziheng Zhang, Weiqi Wang, Haoran Li, Jiaxin Bai, Yangqiu Song

发表机构 * Cornell University(康奈尔大学) The Hong Kong University of Science and Technology(香港科技大学) The Ohio State University(俄亥俄州立大学) Hong Kong Baptist University(香港浸会大学)

AI总结 提出SING框架,通过构建意图-工具图并动态检索工具,在长周期任务中提升工具发现准确率,Global Recall@5提高59.8%,下游成功率提高28.9%。

详情
AI中文摘要

大型语言模型(LLM)代理越来越依赖管理上下文、工具和多轮执行的代理框架,使工具成为在真实数字环境中行动的核心接口。随着框架连接的工具生态系统扩展到数百或数千个API、服务和任务特定技能,穷举工具模式注入变得昂贵,并施加了封闭世界假设,将代理限制在预定义的静态库存中。检索增强的工具选择提供了一种自然的替代方案,但现有的一次性检索方法通常无法将孤立的工具描述与代理的真实任务意图对齐,特别是在需要通过分解、观察和新诱导的子目标来涌现所需能力的长期任务中。我们提出SING,一种意图感知的主动工具发现框架,它构建了一个连接用户意图、工具能力和工具协作模式的意图-工具图,并根据不断变化的任务状态动态检索工具。使用包含7,471个工具的统一语料库,我们在三个真实世界的工具使用基准上评估了SING。与基线相比,SING将全局Recall@5提高了59.8%,下游成功率提高了28.9%,同时将全语料库工具模式暴露减少了99.8%,表明意图感知的图结构能够在大规模代理生态系统中实现更准确和上下文高效的工具发现。

英文摘要

Large language model (LLM) agents increasingly rely on agent harnesses that manage context, tools, and multi-turn execution, making tools a central interface for acting in realistic digital environments. As harness-connected tool ecosystems expand to hundreds or thousands of APIs, services, and task-specific skills, exhaustive tool schema injection becomes costly and imposes a closed-world assumption that limits agents to a predefined static inventory. Retrieval-augmented tool selection offers a natural alternative, but existing one-shot retrieval methods often fail to align isolated tool descriptions with the agent's true task intention, especially in long-horizon tasks where required capabilities emerge through decomposition, observations, and newly induced subgoals. We propose SING, an intention-aware active tool discovery framework that builds an intention-tool graph linking user intentions, tool capabilities, and tool collaboration patterns, and dynamically retrieves tools according to evolving task states. Using a unified corpus of 7,471 tools, we evaluate SING on three real-world tool-use benchmarks. SING improves Global Recall@5 by up to 59.8% and downstream success rate by up to 28.9% over baselines, while reducing full-corpus tool-schema exposure by 99.8%, demonstrating that intention-aware graph structure enables more accurate and context-efficient tool discovery in large-scale agentic ecosystems.

2510.19838 2026-06-17 cs.AI cs.CL cs.LG 版本更新

Branch-and-Browse: Efficient and Controllable Web Exploration with Tree-Structured Reasoning and Action Memory

Branch-and-Browse:具有树状推理与动作记忆的高效可控网页探索

Shiqi He, Yue Cui, Xinyu Ma, Yaliang Li, Bolin Ding, Mosharaf Chowdhury

AI总结 提出Branch-and-Browse框架,通过树状结构化推理、网页状态重放和页面动作记忆,实现LLM网页代理的高效可控多分支探索,在WebArena上成功率35.8%,执行时间降低40.4%。

详情
AI中文摘要

由大型语言模型(LLM)驱动的自主网页代理在执行目标导向任务(如信息检索、报告生成和在线交易)方面展现出强大潜力。这些代理标志着向开放网络环境中实用具身推理的关键一步。然而,现有方法在推理深度和效率方面仍然受限:简单的线性方法无法进行多步推理且缺乏有效的回溯,而其他搜索策略则粗粒度且计算成本高。我们引入了Branch-and-Browse,一个细粒度的网页代理框架,它统一了结构化推理-行动、上下文记忆和高效执行。它(i)采用显式子任务管理与树状结构化探索,实现可控的多分支推理;(ii)通过高效的网页状态重放与后台推理引导探索;(iii)利用页面动作记忆在会话内和跨会话间共享已探索的动作。在WebArena基准测试中,Branch-and-Browse的任务成功率达到35.8%,相对于最先进的方法执行时间减少高达40.4%。这些结果表明,Branch-and-Browse是一个可靠且高效的基于LLM的网页代理框架。

英文摘要

Autonomous web agents powered by large language models (LLMs) show strong potential for performing goal-oriented tasks such as information retrieval, report generation, and online transactions. These agents mark a key step toward practical embodied reasoning in open web environments. However, existing approaches remain limited in reasoning depth and efficiency: vanilla linear methods fail at multi-step reasoning and lack effective backtracking, while other search strategies are coarse-grained and computationally costly. We introduce Branch-and-Browse, a fine-grained web agent framework that unifies structured reasoning-acting, contextual memory, and efficient execution. It (i) employs explicit subtask management with tree-structured exploration for controllable multi-branch reasoning, (ii) bootstraps exploration through efficient web state replay with background reasoning, and (iii) leverages a page action memory to share explored actions within and across sessions. On the WebArena benchmark, Branch-and-Browse achieves a task success rate of 35.8\% and reduces execution time by up to 40.4\% relative to state-of-the-art methods. These results demonstrate that Branch-and-Browse is a reliable and efficient framework for LLM-based web agents.

5. 文本生成、摘要与编辑 5 篇

2606.17175 2026-06-17 cs.CL 新提交

Self-Generated Error Training for Token Editing in Diffusion Language Models

扩散语言模型中令牌编辑的自生成错误训练

Lin Yao

发表机构 * School of Computer Science, Shanghai Jiao Tong University(上海交通大学计算机科学与技术学院) Zhongguancun Academy(中关村学院)

AI总结 针对LLaDA2.1中令牌编辑的训练-推理不匹配问题,提出自生成T2T方法,通过无梯度草稿传递和自生成错误监督,提升编辑准确率并减少编辑强度。

详情
AI中文摘要

令牌到令牌(T2T)编辑允许LLaDA2.1在块扩散解码过程中修正已提交的令牌。已发布的配方在随机词汇损坏上训练该编辑器,但在推理时,编辑器看到的是模型自身流畅、高置信度的草稿错误。我们研究了这种训练-推理不匹配,并提出了自生成T2T,该方法执行无梯度草稿传递,用预测的令牌填充掩码位置,并在第二次传递中在这些自生成损坏下监督恢复。我们将更新实现为LLaDA2.1-mini上的短LoRA持续预训练传递,并在官方Q-Mode T2T程序下使用不变的推理参数在多个基准上进行评估。该方法通常提高准确率,同时降低T2T编辑强度,缓解了诸如在正确推理后出现最终数字转录错误以及在简短事实答案前过度自我纠正等失败模式。

英文摘要

Token-to-token (T2T) editing lets LLaDA2.1 revise committed tokens during block-diffusion decoding. The released recipe trains this editor on random vocabulary corruptions, but at inference the editor sees the model's own fluent, high-confidence draft errors instead. We study this training-inference mismatch and propose self-generated T2T, which performs a no-gradient draft pass, fills masked positions with predicted tokens, and supervises recovery in a second pass under these self-generated corruptions. We implement the update as a short LoRA continued-pretraining pass on LLaDA2.1-mini and evaluate on several benchmarks under the official Q-Mode T2T procedure with unchanged inference parameters. The method generally improves accuracy while reducing T2T edit intensity, mitigating failure modes such as final-digit transcription errors after otherwise correct reasoning and excessive self-correction before short factual answers.

2606.17350 2026-06-17 cs.CL cs.AI 新提交

Do Large Language Models Always Tell The Same Stories?

大型语言模型总是讲述相同的故事吗?

Thennal DK, Hans Ole Hatzel

发表机构 * University of Hamburg(汉堡大学)

AI总结 通过对比框架和人类故事数据集,研究10种LLM生成故事的叙事相似性,发现LLM故事比人类故事更相似,前沿模型趋向于“平均”通用叙事,且常见缓解策略无效。

详情
AI中文摘要

大型语言模型(LLMs)的最新进展使得生成高质量散文成为可能,但这些模型是否能够生成多样化的输出仍然存在争议。在这项工作中,我们通过叙事相似性框架研究了LLM生成故事的多样性。使用对比框架和来自r/WritingPrompts的人类编写故事和提示数据集,我们收集了10个代表性LLM的叙事相似性判断,同时利用人类评估和三种不同的自动注释方法。我们的发现揭示了一个一致的趋势:LLM生成的叙事彼此之间始终比人类编写的故事更相似。我们证明,特别是前沿模型收敛于一种“平均”通用叙事,这种叙事近似于个体人类故事,但缺乏人类作者的整体多样性。最后,我们表明常见的缓解策略,包括负提示和温度缩放,未能有效解决这种同质性。

英文摘要

Recent advances in large language models (LLMs) have enabled the generation of high-quality prose, yet the question of whether these models are capable of generating diverse outputs remains contested. In this work, we investigate the diversity of LLM-generated stories through the framework of narrative similarity. Using a contrastive framework and a dataset of human-written stories and prompts from r/WritingPrompts, we collect narrative similarity judgments across 10 representative LLMs, utilizing both human evaluations and three different automatic annotation methods. Our findings reveal a consistent trend: LLM-generated narratives are consistently more similar to each other than human-written stories are. We demonstrate that frontier models in particular converge on a ``mean'' generic narrative that approximates individual human stories but lacks the collective diversity of human authors. Finally, we show that common mitigation strategies, including negative prompting and temperature scaling, fail to meaningfully address this homogeneity.

2606.17683 2026-06-17 cs.CL cs.PL 新提交

Bridging Functional Correctness and Runtime Efficiency Gaps in LLM-Based Code Translation

弥合基于LLM的代码翻译中的功能正确性与运行时效率差距

Longhui Zhang, Jiahao Wang, Chenhao Hu, Bingyu Liang, Jing Li, Min Zhang

发表机构 * Harbin Institute of Technology, Shenzhen, China(哈尔滨工业大学(深圳))

AI总结 提出SwiftTrans框架,通过多视角探索和差异感知选择两阶段,利用并行上下文学习和差异比较,同时提升LLM代码翻译的正确性和运行时效率。

Comments Accepted to ICML 2026

详情
AI中文摘要

尽管大型语言模型(LLM)显著提高了自动代码翻译系统的功能正确性,但翻译后程序的运行时效率却相对较少受到关注。随着摩尔定律的失效,运行时效率与功能正确性一样,对程序质量变得越来越重要。我们的初步研究表明,LLM翻译的程序通常比人类编写的程序运行得更慢,而且这个问题无法仅通过提示工程来解决。因此,我们的工作提出了SwiftTrans,一个包含两个关键阶段的代码翻译框架:(1)多视角探索,其中MpTranslator利用并行上下文学习(ICL)生成多样化的翻译候选;(2)差异感知选择,其中DiffSelector通过显式比较翻译之间的差异来识别最优候选。我们进一步为MpTranslator引入层次化指导,为DiffSelector引入序数指导,使LLM能够更好地适应这两个核心组件。为了支持对翻译程序运行时效率的评估,我们扩展了现有基准CodeNet和F2SBench,并引入了一个新基准SwiftBench。在所有三个基准上的实验结果表明,SwiftTrans在正确性和运行时效率方面都取得了一致的改进。

英文摘要

While large language models (LLMs) have greatly advanced the functional correctness of automated code translation systems, the runtime efficiency of translated programs has received comparatively little attention. With the waning of Moore's law, runtime efficiency has become increasingly important for program quality, alongside functional correctness. Our preliminary study reveals that LLM-translated programs often run slower than human-written ones, and this issue cannot be remedied through prompt engineering alone. Therefore, our work proposes SwiftTrans, a code translation framework comprising two key stages: (1) Multi-Perspective Exploration, where MpTranslator leverages parallel in-context learning (ICL) to generate diverse translation candidates; and (2) Difference-Aware Selection, where DiffSelector identifies the optimal candidate by explicitly comparing differences between translations. We further introduce Hierarchical Guidance for MpTranslator and Ordinal Guidance for DiffSelector, enabling LLMs to better adapt to these two core components. To support the evaluation of runtime efficiency in translated programs, we extend existing benchmarks, CodeNet and F2SBench, and introduce a new benchmark, SwiftBench. Experimental results across all three benchmarks show that SwiftTrans achieves consistent improvements in both correctness and runtime efficiency.

2606.17999 2026-06-17 cs.CL 新提交

VoidPadding: Let [VOID] Handle Padding in Masked Diffusion Language Models so that [EOS] Can Focus on Semantic Termination

VoidPadding: 让 [VOID] 处理掩码扩散语言模型中的填充,以便 [EOS] 专注于语义终止

Chunyu Liu, Zhengyang Fan, Kaisen Yang, Alex Lamb

发表机构 * Tsinghua University(清华大学)

AI总结 提出VoidPadding方法,通过引入[VOID]令牌处理填充,将[EOS]从双重角色中解放,实现大块解码下的早期停止和自适应响应扩展,在数学推理和代码生成任务上平均提升17.84分,并减少55.7%的NFE。

详情
AI中文摘要

MDLM通过去噪预分配的掩码响应画布生成文本,使得响应长度建模成为指令调优的核心。现有的MDLM通常继承自回归惯例,在指令调优期间使用重复的\ exttt{[EOS]}令牌进行填充,赋予\ exttt{[EOS]}双重角色:既是语义终止符又是填充令牌。我们证明这种双重角色是大块解码下\ exttt{[EOS]}溢出的根本原因。为了解耦这些角色,我们提出VoidPadding,引入\ exttt{[VOID]}用于填充,并保留\ exttt{[EOS]}用于终止。在推理过程中,学习到的\ exttt{[EOS]}信号实现早期停止,而学习到的\ exttt{[VOID]}信号指导自适应响应画布扩展。在Dream-7B-Instruct上,VoidPadding在数学推理和代码生成基准测试中,将块大小平均的四任务均值比原始模型提高+17.84分,比RainbowPadding提高+6.95分,同时平均减少55.7%的解码NFE。代码可在该https URL获取。

英文摘要

MDLMs generate text by denoising a preallocated masked response canvas, making response-length modeling central to instruction tuning. Existing MDLMs often inherit the autoregressive convention of using repeated \texttt{[EOS]} tokens for padding during instruction tuning, giving \texttt{[EOS]} a dual role as both a semantic terminator and a padding token. We show that this dual role is a root cause of \texttt{[EOS]} overflow under large-block decoding. To decouple these roles, we propose VoidPadding, which introduces \texttt{[VOID]} for padding and reserves \texttt{[EOS]} for termination. During inference, the learned \texttt{[EOS]} signal enables early stopping, while the learned \texttt{[VOID]} signal guides adaptive response canvas expansion. On Dream-7B-Instruct, VoidPadding improves the block-size-averaged four-task mean across mathematical reasoning and code generation benchmarks by \(+17.84\) points over the original model and \(+6.95\) points over RainbowPadding, while reducing decoding NFE by 55.7\% on average. Code is available at https://github.com/Haru-LCY/VoidPadding.

2606.08810 2026-06-17 cs.CL cs.LG 版本更新

Continuous Language Diffusion as a Decoder-Interface Problem

连续语言扩散作为解码器-接口问题

Zhicheng Du, Lan Ma

发表机构 * Tsinghua Shenzhen International Graduate School, Tsinghua University(清华大学深圳国际研究生院, 清华大学)

AI总结 研究连续扩散语言模型如何从高斯噪声生成流畅文本,提出解码器-盆地机制,并设计诊断协议揭示标量指标隐藏的失败,通过接口相图解释令牌恢复行为。

详情
AI中文摘要

高斯扰乱的句子嵌入没有直接的语言解释,但连续扩散语言模型可以从它们生成流畅文本。我们通过嵌入式语言流(ELF)研究这一谜题,并识别出解码器-盆地机制:当轨迹到达原生解码器可以读取稳定令牌的区域时,去噪成功。我们引入了可去噪性、语义可恢复性、顺序敏感性、解码器兼容性和轨迹可靠性的诊断协议。它暴露了标量指标隐藏的失败:低均方误差可能丢弃语言内容,低困惑度可能反映低熵崩溃,干净的潜在重建可能与狭窄的解码器盆地共存。一个解码器-边界界解释了为什么令牌恢复依赖于边界和局部解码器敏感性,而不仅仅是潜在误差。审计公开的ELF检查点揭示了一个接口相图:早期预测弱可读,轨迹中期分歧标志竞争区域,晚期预测进入高边界最终令牌盆地。一旦进入,在生成的ELF状态上令牌实现出奇简单:冻结的T5令牌嵌入查找恢复了原生解码器决策的93%–96%,单个线性读出在32k样本时达到97.9%的一致性,在结构化残差尾部留下约1.1的困惑度差距。在显式诊断监控下,保守的边界门在去噪步骤中提前17%–27%退出。对LangFlow、BitstreamDiffusion和连续潜在扩散语言模型(Cola-DLM)的边界检查表明,当状态对象和解码器改变时,相同的接口问题仍然有意义。因此,连续和潜在扩散语言模型应作为表示-解码器系统进行评估。

英文摘要

Gaussian-corrupted sentence embeddings have no direct linguistic interpretation, yet continuous diffusion language models can generate fluent text from them. We study this puzzle through Embedded Language Flows (ELF) and identify a decoder-basin mechanism: our evidence suggests that denoising becomes reliable when trajectories reach regions where the native decoder can read stable tokens. We introduce a diagnostic protocol for denoisability, semantic recoverability, order sensitivity, decoder compatibility, and trajectory reliability. It exposes failures hidden by scalar metrics: low mean-squared error can discard linguistic content, low perplexity can reflect low-entropy collapse, and clean latent reconstruction can coexist with a narrow decoder basin. A decoder-margin bound explains why token recovery depends on margin and local decoder sensitivity, not latent error alone. Auditing public ELF checkpoints reveals an interface phase diagram: early predictions are weakly readable, mid-trajectory disagreement marks a competition region, and late predictions enter a high-margin decoder basin. Once inside, token realization is surprisingly simple on generated ELF states: frozen T5 (Text-to-Text Transfer Transformer) token-embedding lookup recovers $93$--$96\%$ of native decoder decisions, and a single linear readout reaches $97.9\%$ agreement at 32k samples, leaving an $\approx1.1$--$1.2$ perplexity gap in a structured residual tail. Under conservative held-out gates, a margin rule exits roughly $17$--$28\%$ earlier in denoising steps under an explicit diagnostic monitor. Boundary checks on LangFlow, BitstreamDiffusion, and the Continuous Latent Diffusion Language Model (Cola-DLM) show that the same interface questions remain meaningful when the state object and decoder change. Continuous and latent diffusion language models should therefore be evaluated as representation-decoder systems.

6. 语义、语法与语言学分析 5 篇

2606.17299 2026-06-17 cs.CL 新提交

Examining the Limits of Word2Vec with Toki Pona

用 Toki Pona 检验 Word2Vec 的极限

Daniel Zhenhan Huang, Hongchen Wu

发表机构 * Georgia Institute of Technology(佐治亚理工学院)

AI总结 本文使用仅有约130个单词的人造语言 Toki Pona 训练 Word2Vec,探究词汇量极小时嵌入质量,并分析非核心噪声词的影响。

Comments 10 pages, 4 figures, 3 tables. Accepted to the Society for Computation in Linguistics (SCiL) 2026

详情
AI中文摘要

Word2Vec 在生成语义嵌入方面的有效性已得到广泛验证,但其测试几乎完全集中在词汇量大的语言上。本研究使用 Toki Pona(一种约130个单词的人造语言)的数据,检验 Word2Vec 能否在极小的词汇量下成功捕获语义关系。我们从 Toki Pona 社区获取了140万句子(795万词元)用于训练。语料库中约23%的句子包含非 Toki Pona 词元,如命名实体、借词和新词。为了探究这种语言噪声是增强还是阻碍性能——这是词嵌入文献中很少涉及的话题——我们训练了两个不同的模型:一个保留这些偶然词元,另一个将其完全过滤。评估采用定量方法(测量单词到语义类别质心的距离)、自动轮廓分数(通过凝聚聚类)以及定性分析(使用与英语对比的表征相似性矩阵)。结果表明,虽然稀疏的非核心词元不影响所学嵌入的相对结构,但它们实际上使相似词在向量空间中更接近。重要的是,即使在这个极端下限,Word2Vec 的有效性更多地取决于分布模式而非词汇表大小。

英文摘要

Word2Vec's effectiveness at generating semantic embeddings has been widely validated, yet it has been tested almost exclusively on languages with large vocabulary inventories. This study examines whether Word2Vec can successfully capture semantic relationships within an extremely reduced vocabulary using data from Toki Pona, a constructed language with approximately 130 words. We sourced 1.4 million sentences (7.95 million tokens) from the Toki Pona community for training. Approximately 23% of sentences in the corpus contain non-Toki Pona tokens such as named entities, loanwords, and neologisms. To investigate whether this linguistic noise enhances or hinders performance -- a topic rarely addressed in word embedding literature -- we trained two distinct models: one retaining these incidental tokens and another filtering them out completely. Evaluation was conducted using quantitative methods measuring word proximity to semantic category centroids, automated silhouette scores via agglomerative clustering, and qualitative analysis utilizing representational similarity matrices compared against English. The results indicate that while sparse, non-core tokens do not affect the relative structure of the learned embeddings, they actually draw similar words closer together in the vector space. Importantly, Word2Vec's effectiveness depends more on distributional patterns than lexicon size even at this extreme lower bound.

2606.17522 2026-06-17 cs.CL cs.LG 新提交

An expressivity analysis of hierarchical modelling in deep transformers via bounded-depth grammars

深度Transformer中基于有界深度文法的层次建模的表达性分析

Vinoth Nandakumar, Qiang Qu, Pramod Thebe, Sakshi Khachariya, Tongliang Liu

发表机构 * University of Sydney(悉尼大学) San Francisco State University(旧金山州立大学) IIT Madras(印度理工学院马德拉斯分校)

AI总结 通过有界深度上下文无关文法,证明深度Transformer的深度随文法深度线性增长,神经元数随派生树形状和产生式规则数量缩放,支持线性表示假说。

详情
AI中文摘要

深度神经网络普遍被认为其表达能力源于形成层次表示的能力,即在各层中逐步捕获更抽象和组合的特征。在语言建模中,Transformer已成为主导架构,早期层捕获局部句法模式,后期层编码更复杂的从句级依赖。尽管这种直觉塑造了模型设计,但缺乏严格的理论工作来展示深度Transformer如何表示这种层次结构。本文通过有界深度、非递归上下文无关文法的形式化视角,分析深度Transformer模型的表达性。对于这类文法,我们显式构造了具有位置注意力的Transformer,其深度随文法深度线性增长,而神经元数量随派生树形状数量以及产生式规则数量的平方缩放。我们的理论结果支持线性表示假说,证明了这些架构具有将抽象语法状态编码为残差流中低维、线性可分子空间的结构能力。

英文摘要

Deep neural networks are widely believed to derive their expressive power from their ability to form \textbf{hierarchical representations}, capturing progressively more abstract and compositional features across layers. In language modeling, \textbf{transformers} have emerged as the dominant architecture, with early layers capturing local syntactic patterns and later layers encoding more complex clause-level dependencies. While this intuition has shaped model design, there remains a lack of rigorous theoretical work demonstrating \textbf{how} deep transformers represent such hierarchical structures. In this work, we analyze the expressiveness of deep transformer models through the formal lens of bounded-depth, non-recursive context-free grammars. For this class of grammars, we explicitly construct transformers with positional attention whose depth grows linearly with grammar depth, while the neuron count scales with the number of derivation-tree shapes and quadratically with the number of production rules. Our theoretical results support the linear representation hypothesis by demonstrating that these architectures possess the structural capacity to encode abstract grammatical states into low-dimensional, linearly separable subspaces within the residual stream.

2606.18205 2026-06-17 cs.CL 新提交

Analyzing and Encoding the Al-Mawrid Arabic-English Dictionary with the ISO Language Markup Framework and TEI Lex-0

使用ISO语言标记框架和TEI Lex-0分析与编码Al-Mawrid阿拉伯语-英语词典

Diaa Fayed, Laurent Romary

发表机构 * Faculty of Information Technology and Computer Science, Sinai University(信息科技与计算机科学学院, Sinai大学) Inria ISO committee TC 37 (Language and terminology)(ISO术语委员会TC 37(语言和术语))

AI总结 本文提出了一种系统化方法,将Al-Mawrid阿拉伯语-英语词典数字化并编码为标准化计算词典,采用ISO LMF和TEI Lex-0双重标准,实现91%的结构解析准确率。

Comments 44 pages, 58 figures, 12 tables. Submitted to Language Resources and Evaluation, under review since Aug 2025, round 3

详情
AI中文摘要

本文提出了一种稳健的方法,用于系统化数字化和编码Al-Mawrid阿拉伯语-英语词典,将其从传统的印刷资源转变为标准化的计算词典。针对阿拉伯语词汇基础设施中的显著空白,本研究采用双重标准框架,将ISO词汇标记框架(LMF)与文本编码倡议TEI Lex-0指南对齐。通过应用编辑视角处理词典的宏观和微观结构,研究解决了20世纪双语词典中典型的结构歧义和标点不一致问题。该方法基于对词典词汇知识密度的实证分析。基于代表性样本(字母Ayn,占总量的4.6%),研究为编码过程提供了科学依据,展示了91%的结构解析准确率。信息提取规则的定量评估显示出高性能,同义词的精确率为85%,召回率为98%,其他形态语义特征的精确率为88%。除了技术描述,本文还与现有阿拉伯语词汇资源进行了批判性比较,并讨论了TEI Lex-0在建模特定阿拉伯语现象(如隐式“开放集”语义关系和分散的形态线索)时的局限性。此外,研究通过建立可扩展的基于前缀的引用系统,探索了语言关联开放数据(LLOD)集成的潜力,促进了该资源在语义网中的包含。最终成果是一个可互操作、机器可处理的资源,为阿拉伯语自然语言处理和数字人文社区中复杂遗留双语词典的逆向数字化提供了可复现的工作流程。

英文摘要

This paper presents a robust methodology for the systematic digitization and encoding of the Al-Mawrid Arabic-English dictionary, transforming it from a legacy print resource into a standardized computational lexicon. Addressing a significant gap in Arabic lexical infrastructure, the study adopts a dual-standard framing that aligns the ISO Lexical Markup Framework (LMF) with the Text Encoding Initiative TEI Lex-0 guidelines. By applying an editorial view to the dictionary's macro- and microstructure, the research resolves the structural ambiguities and punctuation inconsistencies typical of 20th-century bilingual dictionaries. The methodology is grounded in an empirical analysis of the dictionary's lexical knowledge density. Drawing on a representative sample (the letter Ayn, comprising 4.6% of the total volume), the study provides scientific weight to the encoding process, demonstrating a structural parsing accuracy of 91%. Quantitative evaluation of the information extraction rules reveals high performance, with 85% precision and 98% recall for synonyms, and 88% precision for other morpho-semantic features. Beyond technical description, the paper provides a critical comparison with existing Arabic lexical resources and discusses the limitations of TEI Lex-0 when modelling specific Arabic phenomena, such as implicit "open set" semantic relations and scattered morphological cues. Furthermore, the study explores the potential for Linguistic Linked Open Data (LLOD) integration by establishing a scalable prefix-based referencing system that facilitates the resource's inclusion in the semantic web. The result is an interoperable, machine-tractable resource that provides a reproducible workflow for the retro-digitization of complex legacy bilingual lexicons within the Arabic NLP and Digital Humanities communities.

2604.22128 2026-06-17 cs.CL cs.LG 版本更新

Dissociating Decodability and Causal Use in Bracket-Sequence Transformers

括号序列Transformer中可解码性与因果使用的分离

Aryan Sharma, Cutter Dawes, Shivam Raval

AI总结 通过探针和干预实验,发现Dyck语言Transformer中层级表示虽可解码,但仅注意力模式中的栈顶位置对长距离准确性有因果影响。

详情
AI中文摘要

当在需要理解层级结构的任务上训练时,Transformer被发现以不同方式表示这种层级:在残差流的几何结构中,以及在维持后进先出顺序的类栈注意力模式中。然而,这些表示是被因果使用还是仅仅可解码仍不清楚。我们在Dyck语言(一种平衡括号序列的形式语言)上训练的Transformer中检验了这一差距,其中层级真实标签是明确的。通过探针和干预残差流及注意力模式,我们发现深度、距离和栈顶信号都是可解码的,但它们的因果作用不同。具体而言,掩盖真实栈顶位置的注意力会导致长距离准确性急剧下降,而消融低维残差流子空间则影响相对较小。这些结果扩展到模板化的自然语言设置,表明即使在相关层级变量已知的受控设置中,仅可解码性并不意味着因果使用。

英文摘要

When trained on tasks requiring an understanding of hierarchical structure, transformers have been found to represent this hierarchy in distinct ways: in the geometry of the residual stream, and in stack-like attention patterns maintaining a last-in, first-out ordering. However, it remains unclear whether these representations are causally used or merely decodable. We examine this gap in transformers trained on the Dyck language (a formal language of balanced bracket sequences), where the hierarchical ground truth is explicit. By probing and intervening on the residual stream and attention patterns, we find that depth, distance, and top-of-stack signals are all decodable, yet their causal roles diverge. Specifically, masking attention to the true top-of-stack position causes a sharp drop in long-distance accuracy, while ablating low-dimensional residual stream subspaces has comparatively little effect. These results, which extend to a templated natural language setting, suggest that even in a controlled setting where the relevant hierarchical variables are known, decodability alone does not imply causal use.

2606.07555 2026-06-17 cs.CL cs.LG 版本更新

Priors Persist Through Suppression: A Stroop Paradigm for Lexical Override

先验通过抑制持续存在:词汇覆盖的斯特鲁普范式

Han-yu Wang

发表机构 * The University of Hong Kong(香港大学)

AI总结 通过斯特鲁普范式实验,发现语言模型中的词汇先验在局部规则覆盖后仍持续存在,并通过激活修补定位到源位置三元组,揭示了先验是干扰起源和覆盖痕迹的共同通道。

详情
AI中文摘要

词汇表、技术规范和系统提示通常要求语言模型以不熟悉的方式使用熟悉的词汇。当这种方式有效时,词汇先验通过覆盖而非替换持续存在:它在局部规则应用后继续运作,规则降低其logit而非在顶部安装新含义。我们通过斯特鲁普风格范式对此进行测试:一个重映射规则(“doctor”意为“forest”)与查询词的词汇先验干扰项(“hospital”)对抗,并匹配中性对照。在跨越四个家族和1B-9B参数的11个开源权重模型中,即使在项目级别控制答案先验、频率、分词和提示措辞后,词汇先验强度仍能预测干扰。对五个对齐模型的激活修补定位到一个源位置三元组(定义主语、定义目标、查询词),该三元组几乎完全恢复了冲突效应(聚合$R \in [0.92, 1.06]$)。定义目标交换表明该三元组执行绑定而非身份匹配。分离实验将目标保留隔离为绑定特定特征:干扰抑制在匹配、交换和项目不匹配条件下均发生,而目标logit崩溃仅在定义目标位置被破坏时发生。行为和机制汇聚到同一通道:词汇先验既是干扰的起源,也是覆盖留下痕迹的地方。

英文摘要

Glossaries, technical specifications, and system prompts routinely ask language models to use familiar words in unfamiliar ways. When this works, the local rule does not install the new meaning on top of the old one; the pretrained prior keeps operating underneath, and its strength still shows through. We test this with a Stroop-style paradigm: a remapping rule (doctor means forest) pitted against the query word's lexical-prior distractor (hospital), with matched neutral controls. Across 11 open-weight models spanning four families and 1B-9B parameters, lexical-prior strength predicts interference even after item-level controls for answer prior, frequency, tokenization, and prompt wording. Activation patching on five aligned models locates a source-position triplet (definition subject, definition target, query word) that nearly fully recovers the conflict effect (aggregate $R \in [0.92, 1.06]$); a definition-target swap shows the triplet performs binding rather than identity matching. Dissociation experiments isolate target preservation as the binding-specific signature: distractor suppression occurs under matched, swap, and item-mismatched conditions alike, whereas target logit collapse occurs only when the definition-target position is corrupted. Behavior and mechanism converge on the same channel: the prior's strength both predicts which overrides fail and marks where the causal repair lands.

7. 多模态语言处理 15 篇

2606.17213 2026-06-17 cs.CL cs.CV 新提交

Revisiting LLM Adaptation for 3D CT Report Generation: A Study of Scaling and Diagnostic Priors

重新审视用于3D CT报告生成的LLM适应:缩放与诊断先验研究

Vanshali Sharma, Andrea M. Bejar, Halil Ertugrul Aktas, Quoc-Huy Trinh, Debesh Jha, Gorkem Durak, Ulas Bagci

发表机构 * Northwestern University(西北大学) University of South Dakota(南达科他大学) Aalto University(阿尔托大学)

AI总结 提出RAD3D-Prefix轻量级诊断先验框架,通过冻结大语言模型并融合多标签分类逻辑,在少量可训练参数下实现3D CT报告生成,优于全微调基线并展现强泛化性。

详情
AI中文摘要

多模态学习的最新进展,包括大型语言模型(LLM)和视觉-语言模型(VLM),已展现出对自然图像的强大适应性。然而,将其扩展到医学领域,特别是体积(3D)图像,由于高计算复杂度、体积依赖性和视觉特征与临床术语之间的语义差距而具有挑战性。在有限的医学数据上对LLM进行朴素微调常常导致过拟合和临床幻觉,其中语言流畅性优先于临床事实性。在本研究中,我们研究了用于体积CT报告生成的参数高效适应策略,并引入了RAD3D-Prefix,一种轻量级的诊断先验条件框架,最大限度地减少了对大量参数训练的需求。该模块将图像嵌入与多标签诊断分类逻辑相结合,保留了关键的临床细节,同时弥合了语义差距。通过保持LLM冻结,我们的方法需要最少的可训练参数,并减轻了在小规模、特定领域数据集上过拟合的风险。通过对从96.1M到1.6B参数的LLM进行系统研究,我们发现微调对较小的LLM最有益,而冻结较大的(约1B+)LLM并仅训练轻量级投影层在性能、泛化性和计算效率之间提供了优越的权衡。在多个自动指标和一项临床读者研究中,RAD3D-Prefix优于可比较的参数高效基线,并在使用比全微调替代方案少得多的可训练参数的情况下,展现出强大的域外泛化能力。

英文摘要

Recent advances in multimodal learning, including large language models (LLMs) and vision-language models (VLMs), have demonstrated strong adaptability to natural images. However, extending their use to the medical domain, particularly for volumetric (3D) images, is challenging due to high computational complexity, volumetric dependencies and the semantic gap between visual features and clinical terminology. Naively fine-tuning LLMs on limited medical data often leads to overfitting and clinical hallucination, where linguistic fluency is prioritized over clinical factuality. In this study, we investigate parameter-efficient adaptation strategies for volumetric CT report generation and introduce RAD3D-Prefix, a lightweight diagnostic-prior conditioning framework that minimizes the need for extensive parameter training. This module integrates image embeddings with multi-label diagnostic classification logits, preserving critical clinical details while bridging the semantic gap. By keeping the LLM frozen, our method requires minimal trainable parameters and mitigates the risk of overfitting on small, domain-specific datasets. Through a systematic study spanning LLMs from 96.1M to 1.6B parameters, we find that fine-tuning is most beneficial for smaller LLMs, whereas freezing larger (~1B+ LLMs and training only lightweight projection layers provides a superior trade-off between performance, generalization, and computational efficiency. Across multiple automatic metrics and a clinical reader study, RAD3D-Prefix outperforms comparable parameter-efficient baselines and demonstrates strong out-of-domain generalization while using substantially fewer trainable parameters than fully fine-tuned alternatives.

2606.17449 2026-06-17 cs.CL cs.AI cs.CV cs.LG cs.MM 新提交

MODE-RAG: Manifold Outlier Diagnosis and Energy-based Retrieval-Augmented Generation Evaluation

MODE-RAG: 基于流形异常诊断和能量的检索增强生成评估

Zehang Wei, Jiaxin Dai, Jiamin Yan, Xiang Xiang

发表机构 * School of Computer Science & Tech, Huazhong University of Science and Technology(华中科技大学计算机科学与技术学院) School of AI and Automation, Huazhong University of Science and Technology(华中科技大学人工智能与自动化学院)

AI总结 提出MODE-RAG多智能体系统,利用变分自由能和内部注意力状态动态门控干预,结合蒙特卡洛树搜索和logit扰动减少多模态检索增强生成中的幻觉和逻辑捏造。

Comments To be presented at ACL 2026

详情
AI中文摘要

虽然多模态检索增强生成(M-RAG)增强了大型视觉语言模型,但它仍然非常容易受到跨模态幻觉、因果捏造和谄媚的影响。此外,现有的缓解流程常常面临干预悖论:静态规则往往不必要地干扰准确的生成,而完全不加引导的多模态推理则允许现有的不匹配级联成严重的逻辑捏造。为了量化和缓解这些幻觉,我们提出了一个多智能体系统MODE-RAG,由变分自由能(VFE)和内部注意力状态驱动,以动态门控干预。高风险查询被路由到五个阶段特定的智能体,集成蒙特卡洛树搜索(MCTS)进行严格的因果推导,以及logit扰动以惩罚谄媚。专门的纠正和监管智能体确保格式稳定性并执行事后事实验证。为了客观评估我们的方法,我们引入了ModeVent,一个源自MultiVent数据集的具有挑战性的子集。大量实验表明,我们的系统有效降低了幻觉率和逻辑捏造,显著提高了M-RAG系统的鲁棒性。

英文摘要

While Multimodal Retrieval-Augmented Generation (M-RAG) enhances Large Vision-Language Models, it remains highly susceptible to cross-modal hallucinations, causal fabrications, and sycophancy. Furthermore, existing mitigation pipelines often face an intervention paradox: static rules tend to unnecessarily disrupt accurate generations, whereas leaving the multi-modal reasoning completely unguided allows existing mismatches to cascade into severe logical fabrications. To quantify and mitigate these hallucinations, we propose a Multi-Agent system, MODE-RAG, driven by Variational Free Energy (VFE) and internal attention states to dynamically gate interventions. High-risk queries are routed to five stage-specific agents, integrating Monte Carlo Tree Search (MCTS) for rigorous causal derivation and logit perturbations to penalize sycophancy. Dedicated Correction and Overseer agents ensure formatting stability and perform post-hoc factual verification. To objectively evaluate our approach, we introduce ModeVent, a challenging subset derived from the MultiVent dataset. Extensive experiments indicate that our system effectively reduces hallucination rates and logical fabrication, significantly improving the robustness of M-RAG systems.

2606.17791 2026-06-17 cs.CL cs.CV 新提交

The Slop Paradox: How Synthetic Standardization Erodes Clinical Uncertainty and Cross-Modal Alignment in AI-Rewritten Radiology Reports

Slop悖论:合成标准化如何侵蚀AI重写放射学报告中的临床不确定性和跨模态对齐

Samar Ansari

发表机构 * School of Computing and Engineering Sciences, University of Chester(切斯特大学计算与工程科学学院)

AI总结 本研究通过控制实验测量AI重写放射学报告导致的信息退化,发现电子健康记录摘要虽破坏内容但保留图像-文本对齐,而标准化重写和教学病例准备则相反,造成更大对齐损失,称为slop悖论。

详情
AI中文摘要

AI辅助临床文档工具越来越多地使用大型语言模型(LLMs)对放射学报告进行摘要、标准化和重新格式化。我们提出了对由此产生的信息退化的受控测量。使用印第安纳大学数据集的450份胸部X光报告,我们通过三种真实的LLM重写任务生成合成版本:电子健康记录摘要、标准化重写和教学病例准备。我们测量了实体侵蚀(通过医学命名实体识别)、对冲崩溃(临床不确定性语言的丧失)和跨模态对齐退化(通过BiomedCLIP图像-文本相似度)。我们的核心发现是信息损失与跨模态保真度之间的分离。电子健康记录摘要在内容层面最具破坏性,侵蚀了51.4%的临床实体和43.7%的对冲语言,但它几乎完全保留了图像-文本对齐(下降2.5%)。旨在生成更干净训练数据的两个任务,即标准化重写和教学病例准备,则相反:它们保留了更多实体(分别侵蚀26.8%和29.3%),但导致14.9-16.5%的对齐下降,是电子健康记录摘要的六到七倍。我们称之为slop悖论:使临床文本看起来更干净以用于多模态训练的重写恰恰使其偏离图像。与我们预先指定的假设相反,罕见病理并未优先退化:在九次罕见与常见比较中,没有差异在多重比较校正后幸存,且名义差异方向相反(常见>罕见),因此污染对特定条件监测不可见。退化的主要决定因素是AI重写任务的类型,而非临床内容。这些发现对多模态医学AI数据集构建和AI辅助临床文档的治理具有重要意义。

英文摘要

AI-assisted clinical documentation tools increasingly summarize, standardize, and reformat radiology reports using large language models (LLMs). We present a controlled measurement of the resulting information degradation. Using 450 chest X-ray reports from the Indiana University dataset, we generate synthetic versions via three realistic LLM rewriting tasks: EHR summarization, standardized rewriting, and teaching case preparation. We measure entity erosion (via medical NER), hedging collapse (loss of clinical uncertainty language), and cross-modal alignment degradation (via BiomedCLIP image-text similarity). Our central finding is a dissociation between information loss and cross-modal fidelity. EHR summarization is the most destructive at the content level, eroding 51.4% of clinical entities and 43.7% of hedging language, yet it preserves image-text alignment almost entirely (a 2.5% drop). The two tasks meant to produce cleaner training data, standardized rewriting and teaching case preparation, do the reverse: they preserve more entities (26.8% and 29.3% eroded) but cause 14.9-16.5% alignment drops, six to seven times those of EHR summarization. We term this the slop paradox: rewriting that makes clinical text look cleaner for multimodal training is precisely what pulls it away from the image. Contrary to our pre-specified hypothesis, rare pathologies were not preferentially degraded: across nine rare-versus-common comparisons, no difference survived multiple-comparison correction, and nominal differences ran in the opposite direction (common > rare), so contamination is invisible to condition-specific monitoring. The dominant determinant of degradation is the type of AI rewriting task, not the clinical content. These findings bear on multimodal medical AI dataset construction and the governance of AI-assisted clinical documentation.

2606.17057 2026-06-17 cs.LG cs.AI cs.CL 交叉投稿

Correct When Paired, Wrong When Split: Decoupling and Editing Modality-Specific Neurons in MLLMs

配对时正确,分离时错误:多模态大语言模型中模态特定神经元的解耦与编辑

Tingchao Fu, Wenkai Wang, Fanxiao Li, Huadong Zhang, Jinhong Zhang, Dayang Li, Yunyun Dong, Renyang Liu, Wei Zhou

发表机构 * School of Information Science and Engineering, Yunnan University(云南大学信息科学与工程学院) School of Software, Yunnan University(云南大学软件学院) National University of Singapore(新加坡国立大学) School of Engineering, Yunnan University(云南大学工程学院)

AI总结 针对多模态大语言模型知识编辑中存在的解耦失败问题,提出DECODE方法,通过显式解耦和定位模态特定神经元组,实现跨模态触发下的有效知识更新。

Comments 18 pages, 11 figures

详情
AI中文摘要

尽管知识编辑为多模态大语言模型(MLLMs)的知识更新提供了一种高效机制,但我们发现当前范式仍面临一个重要但尚未充分探索的问题:编辑解耦失败,即当模型被多模态输入(文本-图像查询对)触发时,实体相关知识可以更新,但当配对输入被拆分为单模态输入时,这些知识往往恢复为编辑前的旧事实。我们深入的实证分析表明,MLLMs中的实体知识并非以统一表示存储,而是分布在解耦的模态特定路径中。因此,偏向多模态查询的更新无法有效传播到单模态电路。为弥补这一差距,我们提出DECODE,该方法显式解耦并定位模态特定神经元组以获取目标知识。大量实验证明,DECODE在不同模态触发下均能实现有效的知识更新,从而缓解编辑解耦失败。

英文摘要

Although Knowledge Editing provides an efficient mechanism for updating the knowledge of Multimodal Large Language Models (MLLMs), we find that current paradigms still suffer from an important yet remain underexplored issue : editing decoupling failure, where entity-related knowledge can be updated when the model is triggered by multimodal inputs (text--image query pairs), however, it often reverts to outdated pre-edit facts when the paired inputs are split into unimodal ones. Our in-depth empirical analysis reveals that the entity knowledge in MLLMs is not stored as a unified representation, but is instead distributed across disentangled modality-specific pathways. As a result, updates biased toward multimodal queries fail to propagate effectively to unimodal circuits. To bridge this gap, we propose DECODE, which explicitly disentangles and localizes modality-specific neuron groups for targeted knowledge. Extensive experiments demonstrate that DECODE consistently achieves effective knowledge updates under different modality triggers, thereby mitigating editing decoupling failures.

2606.17389 2026-06-17 cs.CV cs.AI cs.CL cs.LG 交叉投稿

Visuals Lie, Consistency Speaks: Disentangling Spatial Attention from Reliability in Vision-Language Models

视觉会撒谎,一致性说话:在视觉-语言模型中解耦空间注意力与可靠性

Logan Mann, Yi Xia, Ajit Saravanan, Ishan Dave, Saadullah Ismail, Shikhar Shiromani, Emily Huang, Ruizhe Li, Kevin Zhu

发表机构 * University of California, Santa Barbara(加州大学圣塔芭芭拉分校) Algoverse AI Research(Algoverse AI研究) University of California, Berkeley(加州大学伯克利分校)

AI总结 本文提出VLM可靠性探针(VRP),通过结构注意力指标和生成动态分析,发现空间注意力与准确性几乎无关(R≈0.001),而自一致性是可靠性的主要预测因子(R=0.429),揭示了视觉特征与最终生成之间的符号脱离现象。

Comments 16 pages. Accepted to the ICLR 2026 Workshop on Multimodal Intelligence. Code: https://github.com/itsloganmann/VLM-Reliability-Probe

详情
AI中文摘要

多模态基础模型越来越多地被用作推理代理,因此可靠性(即知道模型何时可能产生幻觉)变得至关重要。一种常见的直觉,我们称之为注意力-置信度假设,认为可靠性源于“结构性”视觉感知:对相关区域的紧密注意力应表明答案可信,而分散的注意力则表示困惑。我们通过VLM可靠性探针(VRP)挑战这一观点,这是一项对当代视觉-语言模型(VLM)中可靠性信号进行的系统性跨家族研究。我们引入了结构注意力指标——簇计数(C_k)和空间熵(H_s)——来量化视觉编码器的注视点,并追踪其跨层的演化(ΔH_s)。这揭示了一种“符号脱离”:模型通常“早期锁定”视觉特征,但随后注意力扩散,切断了早期感知与最终生成的联系。与接地假设相反,我们发现“簇失效”:空间注意力与准确性几乎零相关(R≈0.001)。相反,可靠性是生成动态和内部状态分布的现象。自一致性,即采样推理路径之间的一致率,是真实性的主要预测因子(R=0.429)。扩展因果干预揭示了尖锐的架构差异:LLaVA将其预测锁定在脆弱的后期瓶颈中,而PaliGemma和Qwen2-VL全局分布可靠性,即使其最具预测性的层被破坏约50%或更多,仍保持韧性。对于当前的VLM,可靠性信号与视觉接地图脱离,最好通过生成时动态和隐藏状态探针来推断。

英文摘要

Multimodal Foundation Models are increasingly used as reasoning agents, making reliability, knowing when a model may hallucinate, critical. A common intuition, which we call the Attention-Confidence Assumption, holds that reliability follows from "structural" visual perception: tight attention on relevant regions should signal a trustworthy answer, while scattered attention signals confusion. We challenge this through the VLM Reliability Probe (VRP), a systematic cross-family study of reliability signals in contemporary Vision-Language Models (VLMs). We introduce structural-attention metrics, cluster counts (C_k) and spatial entropy (H_s), to quantify the visual encoder's gaze, and track its evolution (Delta H_s) across layers. This reveals a "Symbolic Detachment": models often "Early Lock" visual features only to diffuse attention later, severing early perception from final generation. Contrary to the grounding hypothesis, we find a "Cluster Failure": spatial attention has near-zero correlation (R approx 0.001) with accuracy. Instead, reliability is a phenomenon of generation dynamics and internal-state distributions. Self-Consistency, the agreement rate across sampled reasoning paths, is the dominant predictor of truth (R = 0.429). Scaling causal interventions exposes a sharp architectural divergence: LLaVA locks its prediction in a fragile late-stage bottleneck, whereas PaliGemma and Qwen2-VL distribute reliability globally, staying resilient even when ~50% or more of their most predictive layer is destroyed. For current VLMs, reliability signals are detached from visual grounding maps and are best inferred from generation-time dynamics and hidden-state probes.

2606.17650 2026-06-17 cs.CV cs.CL 交叉投稿

MambaCount: Efficient Text-guided Open-vocabulary Object Counting with Spatial Sparse State Space Duality Block

MambaCount: 基于空间稀疏状态空间对偶块的高效文本引导开放词汇目标计数

Hao-Yuan Ma, Li Zhang, Minjie Qiang, Jie Gao

发表机构 * School of Computer Science and Technology, Soochow University(苏州大学计算机科学与技术学院)

AI总结 提出MambaCount框架,通过空间稀疏状态空间对偶块解决Mamba在非因果视觉任务中的双向依赖限制和空间token高熵问题,实现线性复杂度的开放词汇目标计数,在FSC-147上取得12.23的测试MAE。

详情
AI中文摘要

文本引导的开放词汇目标计数(TOOC)旨在估计由文本提示描述的目标数量,在具有大规模变化的密集场景中尤其具有挑战性。现有的TOOC方法主要依赖Transformer,其相对于图像分辨率的二次复杂度限制了可扩展性。Mamba因其线性复杂度提供了一种有前景的替代方案。然而,先前基于Mamba的方法存在两个主要限制。一方面,Mamba固有的因果公式限制了非因果视觉任务所需的双向空间依赖建模。另一方面,现有的基于Mamba的视觉模型往往忽略了空间token响应中无约束的高熵,这可能削弱局部细节和高频线索。为了解决这些限制,我们提出了MambaCount,一种基于空间稀疏状态空间对偶(S^4D)块的高效框架。具体来说,我们分析并重构了Mamba中隐藏状态的衰减动态,以缓解因果建模引入的依赖约束。此外,我们引入了空间token选择(STS)子块,以减少Mamba中空间token响应的无约束高熵。另外,我们设计了多粒度原型(MGP),以在不同语义级别识别类似目标的区域,改善跨模态对齐和可解释性。在FSC-147上的大量实验表明,MambaCount在无需二次查询的方法中达到了最先进的性能,测试MAE为12.23,同时保持了线性复杂度。

英文摘要

Text-guided Open-vocabulary Object Counting (TOOC) aims to estimate the number of objects described by text prompts, which is particularly challenging in dense scenes with large scale variations. Existing TOOC approaches predominantly rely on Transformers, whose quadratic complexity with respect to image resolution limits their scalability. Mamba offers a promising alternative due to its linear complexity. However, previous Mamba-based methods have two main limitations. On the one hand, the inherent causal formulation of Mamba constrains the bidirectional spatial dependency modeling required by non-causal vision tasks. On the other hand, existing Mamba-based vision models often overlook the unconstrained high entropy in the spatial token responses, which can weaken local details and high-frequency cues. To address these limitations, we propose MambaCount, an efficient framework built on the Spatial Sparse State Space Duality (S^4D) block. Specifically, we analyze and reconstruct the decay dynamics of hidden states in Mamba to alleviate the dependency constraints introduced by causal modeling. Moreover, we introduce a Spatial Token Selection (STS) sub-block to reduce the unconstrained high entropy in spatial token responses within Mamba. In addition, we design Multi-Granularity Prototypes (MGP) to identify object-like regions at different semantic levels, improving cross-modal alignment and interpretability. Extensive experiments on FSC-147 demonstrate that MambaCount achieves state-of-the-art performance among methods without secondary querying, obtaining a test MAE of 12.23, while retaining linear complexity.

2606.17710 2026-06-17 cs.CV cs.AI cs.CL cs.LG 交叉投稿

Vision-language models for chest radiography do not always need the image

胸部X光片的视觉-语言模型并不总是需要图像

Mahshad Lotfinia, Sebastian Ziegelmayer, Lisa Adams, Daniel Truhn, Andreas Maier, Soroosh Tayebi Arasteh

发表机构 * Pattern Recognition Lab, Friedrich-Alexander-Universität Erlangen-Nürnberg(弗里德里希-亚历山大-埃尔朗根-纽伦堡大学模式识别实验室) Department of Diagnostic and Interventional Radiology, TUM University Clinic, School of Medicine and Health, Klinikum rechts der Isar, Technical University of Munich(慕尼黑工业大学医学院与健康学院伊萨尔河右岸医院诊断与介入放射学系) Lab for AI in Medicine, RWTH Aachen University(亚琛工业大学医学人工智能实验室) Department of Diagnostic and Interventional Radiology, University Hospital RWTH Aachen(亚琛工业大学医院诊断与介入放射学系)

AI总结 本文通过因果审计方法,发现许多医学视觉-语言模型在胸部X光片任务中依赖文本先验而非图像,纯文本模型与多模态模型性能接近,并提出了基于图像依赖性的评估框架。

详情
AI中文摘要

医学视觉-语言模型报告了强大的胸部X光片准确性,这越来越多地被解读为它们使用了图像的证据。这种推断是不安全的:一个利用发现名称先验的模型得分与读取扫描的模型相同,且没有标准基准能区分它们。我们引入了一种因果审计方法,通过遮挡相关区域、遮挡无关区域以及替换为另一患者的相同标签扫描来干预图像,并结合三种行为指标测试正确答案是否依赖于图像。在九个系统中,一个没有图像访问权限的纯文本模型达到了最佳多模态模型5.7个准确度点以内的水平,而一个1190亿参数的多模态模型在统计上与70亿参数的纯文本基线无法区分。审计将队列分为三个忽略图像的模型、一个不稳定的模型和五个选择性使用图像的模型(针对部分发现);这些分类在第二个数据集、分辨率和提示措辞上保持一致。与委员会认证的放射科医生相比,纯文本模型在准确率上与放射科医生无统计差异,但基础归因于零,而使用图像的模型的基础归因率与放射科医生相当。报告的置信度仅在模型使用图像时标记无根据的答案。基础归因审计(而非准确性)应成为临床部署的门槛。

英文摘要

Medical vision-language models report strong chest radiograph accuracy, and this is increasingly read as evidence that they use the image. That inference is unsafe: a model exploiting finding-name priors scores like one that reads the scan, and no standard benchmark separates them. We introduce a causal audit that intervenes on the image, occluding the relevant region, occluding an irrelevant one, and swapping in another patient's same-label scan, and combines three behavioral metrics to test whether a correct answer depends on the image. Across nine systems, a text-only model with no image access reaches within 5.7 accuracy points of the best multimodal one, and a 119-billion-parameter multimodal model is statistically indistinguishable from a 7-billion text-only baseline. The audit splits the cohort into three models that ignore the image, one that is unstable, and five that use it selectively, for a subset of findings; the categories hold across a second dataset, resolution, and prompt phrasing. Against board-certified radiologists, a text-only model is statistically indistinguishable from a radiologist's accuracy while grounding at zero, whereas the image-using models ground at radiologist-comparable rates. Reported confidence flags ungrounded answers only when a model uses the image. Grounding audits, not accuracy, should gate clinical deployment.

2602.10384 2026-06-17 cs.CL 版本更新

When Tables Go Crazy: Evaluating Multimodal Models on French Financial Documents

当表格失控:评估多模态模型在法语金融文档上的表现

Virginie Mouilleron, Théo Lasnier, Anna Mosolova, Djamé Seddah

发表机构 * Inria Paris(巴黎国家信息与自动化研究所) Sorbonne Université(索邦大学)

AI总结 提出Scribe Finance基准,评估多模态模型在法语金融文档上的文本、表格、图表及多轮对话理解能力,发现模型在图表和多轮对话中表现脆弱。

Comments 16 pages, 13 figures

详情
AI中文摘要

视觉语言模型(VLM)在许多文档理解任务上表现良好,但它们在专业非英语领域的可靠性仍未充分探索。这一差距在金融领域尤为关键,因为金融文档混合了密集的监管文本、数值表格和视觉图表,且提取错误可能产生实际后果。我们引入了Scribe Finance,这是首个用于评估法语金融文档理解的多模态基准。该数据集包含1,204个经过专家验证的问题,涵盖文本提取、表格理解、图表解释和多轮对话推理,数据来自真实的投资说明书、KID和PRIIP。我们使用LLM-as-judge协议评估了六个开源VLM(8B-124B参数)。虽然模型在文本和表格任务上表现强劲(85-90%准确率),但在图表解释上表现不佳(34-62%)。最值得注意的是,多轮对话揭示了一个严重的失败模式:早期错误在轮次间传播,导致准确率降至约50%,无论模型大小如何。这些结果表明,当前的VLM在定义明确的提取任务上有效,但在交互式、多步骤的金融分析中仍然脆弱。Scribe Finance提供了一个具有挑战性的基准,用于衡量和推动这一高风险场景的进展。

英文摘要

Vision-language models (VLMs) perform well on many document understanding tasks, yet their reliability in specialized, non-English domains remains underexplored. This gap is especially critical in finance, where documents mix dense regulatory text, numerical tables, and visual charts, and where extraction errors can have real-world consequences. We introduce Scribe Finance, the first multimodal benchmark for evaluating French financial document understanding. The dataset contains 1,204 expert-validated questions spanning text extraction, table comprehension, chart interpretation, and multi-turn conversational reasoning, drawn from real investment prospectuses, KIDs, and PRIIPs. We evaluate six open-weight VLMs (8B-124B parameters) using an LLM-as-judge protocol. While models achieve strong performance on text and table tasks (85-90% accuracy), they struggle with chart interpretation (34-62%). Most notably, multi-turn dialogue reveals a sharp failure mode: early mistakes propagate across turns, driving accuracy down to roughly 50% regardless of model size. These results show that current VLMs are effective for well-defined extraction tasks but remain brittle in interactive, multi-step financial analysis. Scribe Finance offers a challenging benchmark to measure and drive progress in this high-stakes setting.

2606.15932 2026-06-17 cs.CL 版本更新

Beyond NL2Code: A Structured Survey of Multimodal Code Intelligence

超越NL2Code:多模态代码智能的结构化综述

Xuanle Zhao, Qiushi Sun, Jingyu Xiao, Xuexin Liu, Haoyue Yang, Qiaosheng Chen, Xianzhen Luo, Jing Huang, Yufeng Zhong, Lei Chen, Shuai Fu, Zhenlin Wei, Jinhe Bi, Lei Jiang, Haibo Qiu, Siqi Yang, Peng Shi, Jian Hu, Zhixiong Zeng

发表机构 * Meituan(美团) The University of Hong Kong(香港大学) The Chinese University of Hong Kong(香港中文大学) Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所) Nanjing University(南京大学) Harbin Institute of Technology(哈尔滨工业大学) Australian Institute for Machine Learning, Adelaide University(阿德莱德大学澳大利亚机器学习研究所) Ludwig Maximilian University of Munich(慕尼黑大学) University of Science and Technology of China(中国科学技术大学) Queen Mary University of London(伦敦玛丽女王大学)

AI总结 本文系统综述多模态代码智能,将任务按代码角色分类,覆盖GUI、科学可视化、结构化图形及前沿任务,并提出四个基于验证的未来方向。

Comments Work completed in January 2026. Updating now

详情
AI中文摘要

虽然LLMs已经显著推进了文本到代码的合成,但许多实际编程任务通过视觉工件(如截图、图表、文档、矢量图、视频和交互状态)来指定意图。这些任务要求模型将视觉感知连接到可执行程序,因为正确性不仅取决于语法,还取决于布局、几何、数据语义、可编辑性、交互行为以及执行后适用的领域特定约束。本综述考察多模态代码智能,涵盖在视觉输入和输出下生成、编辑、优化、执行或推理代码的系统。我们首先根据代码在每个任务中扮演的角色来定义该领域,将代码区分为渲染工件、可编辑符号结构、科学表示、中间推理轨迹或可执行策略/工具接口。然后,我们将基准和方法组织成四个领域:图形用户界面、科学可视化、结构化图形以及前沿任务和框架。这种分类法将成熟的工件生成问题与新兴的智能体和统一设置联系起来,并使我们能够比较不同任务如何处理正确性证据。展望未来,我们认为未来研究可能受益于四个以验证为中心的方向:多信号验证可以结合互补的正确性证据,多状态验证可以测试跨执行轨迹的行为,跨任务迁移测试可以探究可重用的视觉代码技能,以及可验证的智能体轨迹可以揭示智能体行为是否基于视觉证据。这些方向共同可能将多模态代码生成从单输出模仿转向基于证据的可执行系统。

英文摘要

While Large Language Models (LLMs) have substantially advanced text-to-code synthesis, many real programming tasks specify intent through visual artifacts such as screenshots, charts, vector drawings, videos, and interactive states. These tasks require models to connect visual perception to executable programs, because correctness depends not only on syntax but also on layout, data semantics, interaction behavior, and domain-specific constraints that apply after execution. This survey examines Multimodal Code Intelligence, covering systems that generate, edit, refine, or reason with code under visually grounded inputs and outputs. We first formulate the field by the role that code plays in each task, distinguishing code as a rendered artifact, an editable symbolic structure, a scientific representation, an intermediate reasoning trace, or an executable policy or tool interface. We then organize benchmarks and methods into four domains: Graphical User Interface, Scientific Visualization, Structured Graphics, and Frontier Tasks and Frameworks. This taxonomy connects mature artifact-generation problems to emerging agentic and unified settings and allows us to compare how different tasks treat evidence of correctness. Looking ahead, we argue that future research may benefit from four verification-centered directions. Multi-signal validation can combine complementary evidence of correctness, multi-state verification can test behavior across execution trajectories, cross-task transfer testing can probe reusable visual-code skills, and verifiable agent traces can reveal whether agent actions are grounded in visual evidence. Together, these directions may move this field from single-output imitation toward evidence-grounded executable systems. An ongoing project and resources are available on \href{https://github.com/xjywhu/Awesome-Multimodal-LLM-for-Code}{GitHub}.

2403.18957 2026-06-17 cs.CY cs.CL cs.LG cs.SI 版本更新

Moderating Illicit Online Image Promotion for Unsafe User-Generated Content Games Using Large Vision-Language Models

使用大型视觉语言模型审核不安全的用户生成内容游戏中的非法在线图像推广

Keyan Guo, Ayush Utkarsh, Wenbo Ding, Isabelle Ondracek, Ziming Zhao, Guo Freeman, Nishant Vishwamitra, Hongxin Hu

AI总结 针对社交媒体上非法推广不安全UGC游戏的图像,提出UGCG-Guard系统,利用大型视觉语言模型和条件提示策略实现零样本域适应,检测准确率达94%。

Comments In Proceedings of the 33rd USENIX Conference on Security Symposium (SEC '24), August 14-16, 2024

详情
AI中文摘要

在线用户生成内容游戏(UGCG)在儿童和青少年中越来越受欢迎,用于社交互动和更具创造性的在线娱乐。然而,它们带来了更高的接触露骨内容的风险,引发了对儿童和青少年在线安全的日益关注。尽管存在这些担忧,但很少有研究关注社交媒体上基于图像的非法不安全UGCG推广问题,这种推广可能无意中吸引年轻用户。这一挑战源于难以获得全面的UGCG图像训练数据以及这些图像与传统不安全内容不同的独特性质。在这项工作中,我们迈出了研究不安全UGCG非法推广威胁的第一步。我们收集了一个包含2,924张图像的真实世界数据集,这些图像展示了游戏创作者用于推广UGCG的多种色情和暴力内容。我们的深入研究揭示了对此问题的新认识,以及自动标记非法UGCG推广的迫切需求。我们还创建了一个尖端系统UGCG-Guard,旨在帮助社交媒体平台有效识别用于非法UGCG推广的图像。该系统利用最近引入的大型视觉语言模型(VLM),并采用一种新颖的条件提示策略进行零样本域适应,以及思维链(CoT)推理进行上下文识别。UGCG-Guard取得了出色结果,在现实场景中检测这些用于非法推广游戏的图像时准确率达到94%。

英文摘要

Online user generated content games (UGCGs) are increasingly popular among children and adolescents for social interaction and more creative online entertainment. However, they pose a heightened risk of exposure to explicit content, raising growing concerns for the online safety of children and adolescents. Despite these concerns, few studies have addressed the issue of illicit image-based promotions of unsafe UGCGs on social media, which can inadvertently attract young users. This challenge arises from the difficulty of obtaining comprehensive training data for UGCG images and the unique nature of these images, which differ from traditional unsafe content. In this work, we take the first step towards studying the threat of illicit promotions of unsafe UGCGs. We collect a real-world dataset comprising 2,924 images that display diverse sexually explicit and violent content used to promote UGCGs by their game creators. Our in-depth studies reveal a new understanding of this problem and the urgent need for automatically flagging illicit UGCG promotions. We additionally create a cutting-edge system, UGCG-Guard, designed to aid social media platforms in effectively identifying images used for illicit UGCG promotions. This system leverages recently introduced large vision-language models (VLMs) and employs a novel conditional prompting strategy for zero-shot domain adaptation, along with chain-of-thought (CoT) reasoning for contextual identification. UGCG-Guard achieves outstanding results, with an accuracy rate of 94% in detecting these images used for the illicit promotion of such games in real-world scenarios.

2502.00241 2026-06-17 cs.LG cs.AI cs.CL cs.CV 版本更新

Mordal: Automated Pretrained Model Selection for Vision Language Models

Mordal: 面向视觉语言模型的自动化预训练模型选择

Shiqi He, Insu Jang, Mosharaf Chowdhury

AI总结 提出Mordal框架,通过减少候选模型数量和评估时间,自动化搜索用户定义任务的最佳视觉语言模型,相比网格搜索降低GPU耗时8.9-11.6倍,加权Kendall's τ平均提升69%。

详情
AI中文摘要

将多种模态融入大型语言模型(LLMs)是增强其对非文本数据理解、使其能够执行多模态任务的有效方式。视觉语言模型(VLMs)因其在医疗、机器人和无障碍等领域的众多实际应用,成为增长最快的多模态模型类别。然而,尽管文献中不同的VLM在不同基准测试中展现出令人印象深刻的视觉能力,它们都是由人类专家手工设计的;目前尚无自动化框架来创建特定任务的多模态模型。我们引入Mordal,一种自动化多模态模型搜索框架,能够高效地为用户定义的任务找到最佳VLM,无需人工干预。Mordal通过减少搜索过程中需考虑的候选模型数量以及最小化评估每个剩余候选模型所需的时间来实现这一目标。我们的评估表明,Mordal能够找到给定问题的最佳VLM,其GPU耗时比网格搜索低8.9倍至11.6倍。我们还发现,Mordal在不同任务上平均比最先进的模型选择方法实现约69%更高的加权Kendall's τ。

英文摘要

Incorporating multiple modalities into large language models (LLMs) is a powerful way to enhance their understanding of non-textual data, enabling them to perform multimodal tasks. Vision language models (VLMs) form the fastest growing category of multimodal models because of their many practical use cases, including in healthcare, robotics, and accessibility. Unfortunately, even though different VLMs in the literature demonstrate impressive visual capabilities in different benchmarks, they are handcrafted by human experts; there is no automated framework to create task-specific multimodal models. We introduce Mordal, an automated multimodal model search framework that efficiently finds the best VLM for a user-defined task without manual intervention. Mordal achieves this both by reducing the number of candidates to consider during the search process and by minimizing the time required to evaluate each remaining candidate. Our evaluation shows that Mordal can find the best VLM for a given problem using $8.9\times$--$11.6\times$ lower GPU hours than grid search. We have also discovered that Mordal achieves about 69\% higher weighted Kendall's $τ$ on average than the state-of-the-art model selection method across diverse tasks.

2601.00215 2026-06-17 cs.CV cs.CL 版本更新

Disentangling Perception and Reasoning in Multimodal LLMs via Reward Design

通过奖励设计解耦多模态大语言模型中的感知与推理

Omar Sharif, Eftekhar Hossain, Nikhil Singh, Patrick Ng

AI总结 研究多模态大模型中感知与推理的瓶颈,发现感知是主要约束,并通过奖励设计提升视觉基础推理,平均提升5.56分。

Comments 24 pages, 15 Figures, 10 Tables

详情
AI中文摘要

基于可验证奖励的强化学习推动了LLM推理的重大进步,直观上这种策略应能很好地迁移到多模态模型。然而,多模态模型做两件事:首先感知图像中的内容,然后推理其含义。由于这两个阶段是联合评分的,很难判断推理本身还有多大提升空间。我们在算法视觉谜题上研究这一问题,其中两个组件都是必要的,并表明感知而非推理是约束瓶颈。用简单的文本描述替换图像,Claude模型的平均性能提升超过20点。然后我们评估了六种奖励设计,旨在诱导推理过程中的视觉基础,而无需思维链监督。使用GRPO训练Qwen-2.5-VL-7B,奖励设计诱导出带有自我反思和视觉引用的长结构化推理,相比基础模型获得5.56点的提升。然而,这些提升是不均匀的;没有单一奖励能改善所有类别,并且具有可验证准确性信号的奖励会以域外迁移为代价换取域内准确性。这些结果表明,感知感知的奖励设计是一条前进之路,以便在源头纠正感知,而不是纠正继承其错误的推理。

英文摘要

Reinforcement learning with verifiable rewards has driven major gains in LLM reasoning, and it is intuitive to assume this recipe will transfer well to multimodal models. However, multimodal models do two things: first, perceive what is in an image, then reason about what it implies. Because these stages are graded jointly, it is hard to tell how much room reasoning alone has to grow. We study this on algorithmic visual puzzles, where both components are necessary and show that perception, not reasoning, is the binding constraint. Replacing images with simple textual descriptions raises performance by over 20 points on average for Claude models. We then evaluate six reward designs aimed at inducing visual grounding during reasoning without chain-of-thought supervision. Training Qwen-2.5-VL-7B with GRPO, reward design induces long, structured reasoning with self-reflection and visual references, yielding a 5.56-point gain over the base model. These gains are, however, uneven; no single reward improves all categories, and rewards with verifiable accuracy signals trade out-of-domain transfer for in-domain accuracy. These results point to perception-aware reward design as a path forward, so that signals correct perception at its source rather than the reasoning that inherits its errors.

2602.03300 2026-06-17 cs.LG cs.AI cs.CL cs.CV 版本更新

R1-SyntheticVL: Is Synthetic Data from Generative Models Ready for Multimodal Large Language Model?

R1-SyntheticVL:生成模型的合成数据是否已为多模态大语言模型做好准备?

Jingyi Zhang, Tianyi Lin, Huanjin Yao, Xiang Lan, Shunyu Liu, Jiaxing Huang

AI总结 提出集体对抗数据合成(CADS)方法,通过集体智能和对抗学习自动生成高质量、多样且具有挑战性的多模态数据,用于增强多模态大语言模型(MLLM)在复杂现实任务中的性能。

Comments ICML 2026 Camera Ready

详情
AI中文摘要

在这项工作中,我们旨在开发有效的数据合成技术,自主合成多模态训练数据,以增强MLLM解决复杂现实任务的能力。为此,我们提出了集体对抗数据合成(CADS),这是一种新颖且通用的方法,用于合成高质量、多样且具有挑战性的多模态数据。CADS的核心思想是利用集体智能确保高质量和多样化的生成,同时探索对抗学习以合成具有挑战性的样本,从而有效驱动模型改进。具体来说,CADS包含两个循环阶段:集体对抗数据生成(CAD-Generate)和集体对抗数据判断(CAD-Judge)。CAD-Generate利用集体知识共同生成新的多样化多模态数据,而CAD-Judge则协作评估合成数据的质量。此外,CADS引入了一种对抗上下文优化机制,以优化生成上下文,鼓励生成具有挑战性和高价值的数据。通过CADS,我们构建了MMSynthetic-20K并训练了我们的模型R1-SyntheticVL,该模型在多个基准测试中表现出优越的性能。

英文摘要

In this work, we aim to develop effective data synthesis techniques that autonomously synthesize multimodal training data for enhancing MLLMs in solving complex real-world tasks. To this end, we propose Collective Adversarial Data Synthesis (CADS), a novel and general approach to synthesize high-quality, diverse and challenging multimodal data for MLLMs. The core idea of CADS is to leverage collective intelligence to ensure high-quality and diverse generation, while exploring adversarial learning to synthesize challenging samples for effectively driving model improvement. Specifically, CADS operates with two cyclic phases, i.e., Collective Adversarial Data Generation (CAD-Generate) and Collective Adversarial Data Judgment (CAD-Judge). CAD-Generate leverages collective knowledge to jointly generate new and diverse multimodal data, while CAD-Judge collaboratively assesses the quality of synthesized data. In addition, CADS introduces an Adversarial Context Optimization mechanism to optimize the generation context to encourage challenging and high-value data generation. With CADS, we construct MMSynthetic-20K and train our model R1-SyntheticVL, which demonstrates superior performance on various benchmarks.

2603.22281 2026-06-17 cs.CV cs.AI cs.CL cs.LG cs.RO 版本更新

ThinkJEPA: Empowering Latent World Models with Large Vision-Language Reasoning Model

ThinkJEPA:赋予潜在世界模型大型视觉-语言推理能力

Haichao Zhang, Yijiang Li, Shwai He, Tushar Nagarajan, Mingfei Chen, Jianglin Lu, Ang Li, Yun Fu

AI总结 提出ThinkJEPA框架,结合密集JEPA分支与稀疏VLM思考者分支,通过分层金字塔表示提取模块,实现细粒度运动建模与长程语义引导,在手部操作轨迹预测任务上超越基线。

Comments 10 pages, 5 figures

详情
AI中文摘要

潜在世界模型(如V-JEPA2)的最新进展展示了从视频观测预测未来世界状态的能力。然而,短观测窗口的密集预测限制了时间上下文,可能导致预测偏向局部低层次外推,难以捕捉长程语义并降低下游效用。相比之下,视觉-语言模型(VLM)通过对均匀采样帧进行推理,提供强大的语义基础和通用知识,但由于计算驱动的稀疏采样、语言输出瓶颈(将细粒度交互状态压缩为文本导向表示)以及适应小规模动作条件数据集时的数据分布不匹配,它们不适合作为独立的密集预测器。我们提出了一种VLM引导的JEPA风格潜在世界建模框架,通过双时间路径结合密集帧动态建模与长程语义指导:一个密集JEPA分支用于细粒度运动和交互线索,以及一个均匀采样的VLM“思考者”分支,具有更大的时间步长以提供知识丰富的指导。为了有效传递VLM的渐进推理信号,我们引入了一个分层金字塔表示提取模块,将多层VLM表示聚合成与潜在预测兼容的指导特征。在手部操作轨迹预测实验上,我们的方法优于强VLM-only基线和JEPA预测器基线,并展现出更鲁棒的长程展开行为。

英文摘要

Recent progress in latent world models (e.g., V-JEPA2) has shown promising capability in forecasting future world states from video observations. Nevertheless, dense prediction from a short observation window limits temporal context and can bias predictors toward local, low-level extrapolation, making it difficult to capture long-horizon semantics and reducing downstream utility. Vision--language models (VLMs), in contrast, provide strong semantic grounding and general knowledge by reasoning over uniformly sampled frames, but they are not ideal as standalone dense predictors due to compute-driven sparse sampling, a language-output bottleneck that compresses fine-grained interaction states into text-oriented representations, and a data-regime mismatch when adapting to small action-conditioned datasets. We propose a VLM-guided JEPA-style latent world modeling framework that combines dense-frame dynamics modeling with long-horizon semantic guidance via a dual-temporal pathway: a dense JEPA branch for fine-grained motion and interaction cues, and a uniformly sampled VLM \emph{thinker} branch with a larger temporal stride for knowledge-rich guidance. To transfer the VLM's progressive reasoning signals effectively, we introduce a hierarchical pyramid representation extraction module that aggregates multi-layer VLM representations into guidance features compatible with latent prediction. Experiments on hand-manipulation trajectory prediction show that our method outperforms both a strong VLM-only baseline and a JEPA-predictor baseline, and yields more robust long-horizon rollout behavior.

2606.14782 2026-06-17 cs.CV cs.CL 版本更新

Last But Not Least: Boundary Attention CalibratiON for Multimodal KV Cache Compression

最后但同样重要:用于多模态KV缓存压缩的边界注意力校准

Tianhao Chen, Yuheng Wu, Kelu Yao, Xiaogang Xu, Xiaobin Hu, Dongman Lee

发表机构 * KAIST(韩国科学技术院) Zhejiang Laboratory(之江实验室) The Chinese University of Hong Kong(香港中文大学) National University of Singapore(新加坡国立大学)

AI总结 针对多模态大语言模型长视觉上下文中KV缓存压缩导致关键证据丢失的问题,提出BACON方法,通过校准观察窗口注意力与最后查询注意力,并利用层内一致性和层间持久性抑制噪声,在激进压缩下平均提升7.5%性能。

详情
AI中文摘要

多模态大语言模型(MLLMs)实现了强大的视觉-语言推理,但长视觉上下文会扩大KV缓存并增加解码延迟。现有的压缩方法依赖观察窗口注意力进行稳定的token重要性估计,然而这种聚合可能稀释稀疏的视觉证据,并在激进压缩下丢弃答案关键token。因此,我们识别出最后查询注意力作为恢复此类证据的补充来源,但其与答案无关的信号可能误导保留。我们提出BACON,一种即插即用方法,通过最后查询证据校准观察窗口注意力,并通过层内一致性和层间持久性抑制孤立噪声。在多种基准、模型、预算和压缩方法下,BACON在最激进的预算下平均提升多模态KV压缩7.5%,最高提升达30.9%。

英文摘要

Multimodal Large Language Models (MLLMs) achieve strong vision-language reasoning, but long visual contexts enlarge the KV cache and increase decoding latency. Existing compression methods rely on observation window attention for stable token-importance estimation, yet this aggregation can dilute sparse visual evidence and discard answer-critical tokens under aggressive compression. Therefore, we identify last-query attention as a complementary source for recovering such evidence, but its answer-irrelevant signals can mislead retention. We propose BACON, a plug-and-play method that calibrates observation window attention with last-query evidence and suppresses isolated noise via intra-layer coherence and inter-layer persistence. Across diverse benchmarks, models, budgets, and compression methods, BACON improves multimodal KV compression by 7.5% on average under the most aggressive budget, with gains up to 30.9%. Our project page is available at https://ryu1ion.github.io/official_BACON/

8. 语音语言联合与音频文本 11 篇

2606.17255 2026-06-17 cs.CL cs.AI 新提交

MLLP-VRAIN UPV system for the IWSLT 2026 Simultaneous Speech Translation task

MLLP-VRAIN UPV 系统在 IWSLT 2026 同声传译任务中的应用

Jorge Iranzo-Sánchez, Gerard Mas-Mollà, Adrià Giménez, Jorge Civera, Albert Sanchis, Alfons Juan

发表机构 * MLLP-VRAIN research group(MLLP-VRAIN研究组) VRAIN Universitat Politècnica de València(瓦伦西亚理工大学)

AI总结 提出基于Parakeet和Qwen 3.5模型的级联同声传译系统,通过自适应黑盒策略优化质量-延迟权衡,并引入ASR词增强和RAG机制处理上下文跟踪,在MCIF En→De测试集上实现XCOMET-XL提升+5.82。

Comments IWSLT 2026 System Description

详情
AI中文摘要

本文描述了MLLP-VRAIN研究组参与IWSLT 2026同声传译赛道共享任务的情况。我们的提交利用最近发布的Parakeet和Qwen 3.5模型,通过自适应“黑盒”策略构建了一个鲁棒的级联解决方案,用于长形式SimulST。我们探索了这些策略的松弛版本以实现更好的质量-延迟权衡。与去年相比,我们参与了所有语言方向。此外,对于En→{De, It, Zh}方向,我们还参与了今年新增的上下文跟踪赛道,采用ASR词增强和离线预翻译示例的RAG机制相结合,以引导生成并丰富系统的领域特定上下文。最后,我们提供了系统的详细延迟分析。与去年相比,在MCIF En→De测试集上的结果显示质量显著提升,XCOMET-XL提高了+5.82。我们的上下文跟踪处理进一步提升了+1.03的性能。

英文摘要

This work describes the participation of the MLLP-VRAIN research group in the shared task of the IWSLT 2026 Simultaneous Speech Translation track. Our submission utilizes the recently released Parakeet and Qwen 3.5 models to create a robust, cascaded solution for long-form SimulST through the use of adaptive "black-box" policies. We explore relaxations of these policies to achieve better quality-latency trade-offs. Compared to last year, we participate on all language directions. In addition to this, for the En$\rightarrow${De, It, Zh} directions we also participate in this year's new context track employing a combination of ASR word-boosting and a RAG mechanism of offline pre-translated exemplars to guide generation and enrich our system with domain-specific context. Finally, we provide a detailed latency analysis of our system. Compared to last year, results on the MCIF En$\rightarrow$De test set shows a substantial quality improvement of +5.82 XCOMET-XL. Our context track processing further improves performance by +1.03.

2606.17281 2026-06-17 cs.CL cs.SD eess.AS 新提交

Are you speaking my languages? On spoken language adherence in multimodal LLMs

你在说我的语言吗?多模态大语言模型中的口语遵循问题

Hyungwon Kim, Kandarp Joshi, Lillian Zhou, Pavel Golik, Petar Aleksic

发表机构 * Google DeepMind(谷歌深Mind)

AI总结 针对多模态大语言模型在自动语音识别中输出语言识别错误的问题,提出软提示方法、监督微调和思维链推理三种缓解策略,并引入新指标量化语言违背,比较各方法在减少违规和保持ASR性能上的效果。

Comments 7 pages, 3 tables in the main body

详情
AI中文摘要

虽然基于大语言模型(LLM)的自动语音识别(ASR)能够实现无缝的多语言使用,但模型经常错误识别输出语言,损害转录保真度和下游应用质量。为了保持灵活性和代码切换能力,我们提出了一种软提示方法,该方法暗示潜在的口语语言而不严格约束输出。我们正式将这一挑战定义为缺乏语言遵循,引入了一个新的指标来量化违规行为,并评估了三种缓解策略:(1)零样本提示,在不确定性下提供稳健指导;(2)监督微调(SFT),以提高提示遵循度;(3)思维链(CoT)推理,在解码过程中强制遵循。我们跨多种语言对这些方法进行了比较分析,评估了它们在减少语言违规同时保持整体ASR性能方面的有效性。最后,我们讨论了权衡,以指导在不同计算约束下的策略选择。

英文摘要

While Large Language Model (LLM) based Automatic Speech Recognition (ASR) enables seamless multilingual use, models often misidentify the output language, compromising transcription fidelity and downstream application quality. To preserve flexibility and code-switching capabilities, we propose a soft prompting approach that hints at potential spoken languages without strictly constraining the output. We formally define this challenge as a lack of language adherence, introduce a novel metric to quantify violations, and evaluate three mitigation strategies: (1) zero-shot prompting for robust guidance under uncertainty, (2) supervised fine-tuning (SFT) to improve prompt adherence, and (3) Chain-of-Thought (CoT) reasoning to enforce adherence during decoding. We present a comparative analysis of these methods across multiple languages, evaluating effectiveness in reducing the language violation while maintaining overall ASR performance. Finally, we discuss trade-offs to guide strategy selection under various compute constraints.

2606.17820 2026-06-17 cs.CL 新提交

Improving low-resource ASR using bilingual fine-tuning with language identification: a cross-linguistic evaluation

利用语言识别的双语微调改进低资源语音识别:跨语言评估

Reihaneh Amooie, Yun Hao, Wietse de Vries, Jelske Dijkstra, Matt Coler, Martijn Wieling

发表机构 * University of Groningen(格罗宁根大学) Fryske Akademy(弗里斯兰科学院) Vrije Universiteit Brussel(布鲁塞尔自由大学)

AI总结 研究双语微调对低资源语言语音识别的影响,在九种语言对中评估,通过语言识别令牌区分语言,发现高语言识别准确率时双语微调有效,低准确率时推理时加入令牌可提升性能。

详情
AI中文摘要

本研究探讨了双语微调如何影响低资源语言的自动语音识别(ASR)。我们在九种语言和地理多样化的语言对上评估了该方法,涵盖了多种语系和书写系统。为了区分两种语言,在训练期间,我们在每个输入文本前添加一个语言识别令牌。在推理时,模型仅从语音输入中联合预测语言和转录。由于语言被错误确定的文本显示出较低的ASR性能,我们还进行了一项后续实验,在训练和推理期间都提供语言识别令牌。我们的结果表明,当语言识别准确率高时,双语微调可能是有益的,而在语言识别性能低的情况下,在推理时包含语言识别令牌有助于提高ASR性能。

英文摘要

This study explores how bilingual fine-tuning affects automatic speech recognition (ASR) in low-resource languages. We evaluate this method across nine linguistically and geographically diverse language pairs, covering a range of language families and writing systems. To distinguish the two languages, during training, we pre-pend each input text with a language identification token. At inference, the model jointly predicts both the language and transcription from the speech input alone. As texts for which the language is incorrectly determined show low ASR performance, we also conduct a follow-up experiment in which the language identification token is provided both during training and inference. Our results show that bilingual fine-tuning can be beneficial when language identification accuracy is high, and that in cases where language identification performance is low, including the language identification token at inference helps to improve ASR performance.

2606.17826 2026-06-17 cs.CL cs.AI 新提交

When Multiple Scripts Matter: Evaluating ASR in Clinical Settings

当多种文字重要时:在临床环境中评估ASR

Jean Seo, Minkyu Kim, Jeonguk Lee, Jisoo Jung, Wooseok Han, Eunho Yang

发表机构 * AITRICS University of Copenhagen(哥本哈根大学) KAIST(韩国科学技术院)

AI总结 针对非英语临床场景中ASR受多文字变异性影响的问题,提出MultiClin基准,通过多文字感知评估更公平地衡量识别质量,并发现文字统一化能提升ASR性能。

Comments Interspeech 2026

详情
AI中文摘要

非英语临床环境中的自动语音识别(ASR)面临多文字变异性的挑战,即同一术语可能以多种有效的正字法形式出现。传统的字符串匹配评估指标通常将正字法变体视为错误,从而低估ASR性能。为解决此问题,我们引入了MultiClin,一个旨在评估对多文字变异性鲁棒性的临床ASR基准。跨多种ASR模型的实验表明,与传统的单参考评估相比,多文字感知评估能更公平地评估识别质量。我们进一步研究了训练过程中文字一致性的影响,发现不一致的文字映射会增加正字法不确定性并阻碍模型收敛,其中50%的平衡映射比例产生最高的熵。相比之下,文字统一化始终能带来最佳的ASR性能。我们的数据集和代码公开于:this https URL。

英文摘要

Automatic speech recognition (ASR) in non-English clinical settings is challenged by multiscript variability, where the same term may appear in multiple valid orthographic forms. Conventional string-matching evaluation metrics often underestimate ASR performance by treating orthographic variants as errors. To address this issue, we introduce MultiClin, a clinical ASR benchmark designed to evaluate robustness to multiscript variability. Experiments across diverse ASR models show that multiscript-aware evaluation provides a fairer assessment of recognition quality than conventional single-reference evaluation. We further investigate the impact of script consistency during training and find that inconsistent script mappings increase orthographic uncertainty and hinder model convergence, with a balanced 50% mapping ratio producing the highest entropy. In contrast, script unification consistently yields the best ASR performance. Our dataset and code are publicly available at: https://github.com/aitrics-ronaldo/Interspeech_MultiClin.

2606.17835 2026-06-17 cs.CL cs.AI eess.AS 新提交

Perceptual compensation for tonal context in self-supervised speech models

自监督语音模型中对声调上下文的感知补偿

James Kirby, Ioana Krehan, Michele Gubian

发表机构 * Institute for Phonetics and Speech Processing, LMU Munich(慕尼黑大学语音与语言处理研究所)

AI总结 通过伪复制普通话声调的感知补偿实验,比较纯自监督预训练模型和微调模型,发现纯预训练模型无补偿证据,而微调模型有部分补偿但未达到人类水平,表明监督目标可能对抽象某些音韵规律是必要的。

Comments Accepted for publication at Interspeech 2026

详情
AI中文摘要

本研究考察了wav2vec2.0架构在多大程度上表现出对音韵上下文的补偿证据。我们对普通话声调进行了感知补偿实验的伪复制,并比较了纯自监督预训练模型和针对普通话ASR微调模型之间的嵌入相似度和探测分类器输出。在纯预训练模型的嵌入相似度中没有发现补偿证据。探测分类器除了预期的逐层分类改进外,还显示出一些补偿证据,但未能复制人类在孤立测试音节上的表现。我们的发现与先前仅通过预训练就能产生对音韵结构敏感性的报告形成对比,并表明监督目标可能是鼓励至少某些类型的音韵规律抽象所必需的。

英文摘要

This study examines the extent to which the wav2vec2.0 architecture exhibits evidence of compensation for phonological context. We conducted a pseudo-replication of a perceptional compensation experiment on Mandarin Chinese tones, and compared the embedding similarities and probing classifier outputs between a purely self-supervised pre-trained model and a model fine-tuned for Mandarin ASR. No evidence of compensation was found in the embedding similarities of the purely pre-trained model. Probing classifiers showed some evidence of compensation in addition to the expected layer-wise improvements in categorization, but failed to replicate human performance on isolated test syllables. Our findings contrast with previous reports of sensitivity to phonological structure emerging through pre-training alone, and suggest that supervised objectives may be necessary to encourage the abstraction of at least some types of phonological regularities.

2606.17967 2026-06-17 cs.CL 新提交

Learning task-specific subspaces via interventional post-training of speech foundation models

通过语音基础模型的干预后训练学习任务特定子空间

Jack Cox, Jon Barker

发表机构 * University of Sheffield(谢菲尔德大学)

AI总结 提出一种干预对比学习后训练方法,将语音基础模型的纠缠表示分解为内容和说话人子空间,提升说话人验证和关键词识别性能。

Comments Accepted to Interspeech 2026; 6 pages (4 main body), 2 figures

详情
AI中文摘要

语音基础模型在大量未标注语音数据上预训练,产生跨任务通用的表示。然而,这些表示以分布式方式编码显著语音变量的信息,而下游语音任务仅依赖其中部分变量。本文提出一种使用干预对比学习的后训练精炼方法。通过利用干预数据集和多部分对比损失,我们学习从语音基础模型的纠缠表示空间到分离的内容和说话人子空间的变换。我们在说话人验证和关键词识别任务上评估学习到的表示,显示出改进的域外说话人验证性能,并证明说话人和内容信息在学习的子空间中被分离。

英文摘要

Speech foundation models, pre-trained on large corpora of unlabelled speech data, produce general-purpose representations which are useful across tasks. However, these representations encode information about salient speech variables in a distributed manner, while downstream speech tasks rely on only some of this variability. In this work, we propose a post-training refinement approach using interventional contrastive learning. By leveraging an interventional dataset and multi-part contrastive loss, we learn a transformation from the entangled representation space of speech foundation models into separate content and speaker subspaces. We evaluate the learnt representations on speaker verification and keyword spotting tasks, showing improved out-of-domain speaker verification performance and evidence that speaker and content information are separated across the learned subspaces.

2606.17537 2026-06-17 eess.AS cs.CL 交叉投稿

Non-Autoregressive Minimum Bayes' Risk Decoding for Fast Speech Recognition

非自回归最小贝叶斯风险解码用于快速语音识别

Hiroyuki Deguchi, Takatomo Kano, Katsuki Chousa, Marc Delcroix

发表机构 * NTT, Inc.(日本NTT公司)

AI总结 提出基于最小贝叶斯风险解码的非自回归解码框架,通过单次前向计算高效采样多个候选,在保持速度优势的同时提升识别性能。

Comments Accepted at Interspeech2026

详情
AI中文摘要

非自回归(NAR)解码并行生成输出令牌,使语音识别比自回归解码(从左到右顺序生成)更快。然而,由于NAR解码无法通过依赖先前生成的令牌来解决不确定性,识别性能会下降。为了解决这个问题,我们提出了一种基于最小贝叶斯风险(MBR)解码的新型NAR解码框架,称为NAR-MBR解码,它最大化从NAR模型输出概率中抽取的样本计算的期望效用,而不是最大化输出概率。值得注意的是,通过利用NAR模型的特性,单次前向计算即可高效获得多个样本。我们在LibriSpeech、Switchboard、AMI和网络演示语料库上的实验表明,我们的NAR-MBR解码优于先前的NAR解码,并且运行速度快于AR解码。

英文摘要

Non-autoregressive (NAR) decoding generates output tokens in parallel, making speech recognition faster than autoregressive decoding, which generates them sequentially from left to right. However, the recognition performance is degraded because NAR decoding cannot resolve uncertainty by conditioning on previously generated tokens. To address this issue, we propose a novel NAR decoding framework based on minimum Bayes' risk (MBR) decoding, termed NAR-MBR decoding, that maximizes the expected utility calculated from samples drawn from the output probability of an NAR model rather than maximizing the output probability. Notably, by leveraging the nature of NAR models, multiple samples are obtained efficiently with a single forward computation. Our experiments across LibriSpeech, Switchboard, AMI, and web presentation corpus demonstrated that our NAR-MBR decoding outperformed previous NAR decoding and ran faster than AR decoding.

2606.18019 2026-06-17 eess.AS cs.CL cs.SD 交叉投稿

Reading between the Lines: Leveraging Large Language Models for Global Dementia and Depression Assessment from Clinical Interviews

字里行间:利用大型语言模型从临床访谈中进行全球痴呆和抑郁评估

Franziska Braun, Alea Rüggeberg, Thomas Ranzenberger, Hartmut Lehfeld, Thomas Hillemacher, Tobias Bocklet, Korbinian Riedhammer

发表机构 * TH Nürnberg(Nürnberg大学) FAU Erlangen(埃朗根大学) PMU Klinikum Nürnberg(纽伦堡大学医院)

AI总结 本研究利用开放权重大型语言模型,从154名德语受试者的临床访谈录音中预测痴呆和抑郁严重程度,引入与全球恶化量表对齐的全球抑郁量表,发现零样本预测对抑郁有效,而结构化特征提取显著提升痴呆评估性能,误差降低达35%,且暂停增强转录本表现与人工转录相当。

Comments Accepted for publication in Text, Speech and Dialogue (TSD 2026). The final authenticated publication will be available online via Springer LNCS/LNAI

详情
AI中文摘要

痴呆和抑郁是老年人群中最常见的神经精神障碍,其重叠症状对鉴别诊断构成重大挑战。在本研究中,我们探讨了开放权重的大型语言模型(LLMs)用于从154名德语受试者的标准化病史访谈录音中预测痴呆和抑郁严重程度。我们引入了一个与已建立的全球恶化量表(GDS)对齐的观察者基础全球抑郁量表(GDS-D),从而能够对情感和认知症状进行并行全局分期。我们在两种设置下比较了三种LLMs(Mistral 3.1、DeepHermes、Qwen3):(1) 零样本预测和(2) 基于LLM的特征提取用于支持向量回归,使用人工转录和暂停增强转录。结果显示,LLMs在零样本设置中有效预测抑郁严重程度(最佳MAE为0.60),而痴呆评估显著受益于结构化特征提取(最佳MAE为0.78),相比零样本基线误差降低高达35%。暂停增强转录本在性能上与人工转录相当,证明了全自动筛查流程在神经精神鉴别评估中的可行性。

英文摘要

Dementia and depression are the most prevalent neuropsychiatric disorders in geriatric populations, and their overlapping symptoms pose major challenges for differential diagnosis. In this study, we investigate open-weights Large Language Models (LLMs) for predicting dementia and depression severity from speech samples collected during standardized history taking interviews with 154 German-speaking subjects. We introduce an observer-based Global Depression Scale (GDS-D) aligned with the established Global Deterioration Scale (GDS), enabling parallel global staging of affective and cognitive symptoms. We compare three LLMs (Mistral 3.1, DeepHermes, Qwen3) in two settings: (1) zero-shot prediction and (2) LLM-based feature extraction for Support Vector Regression, using human and pause-enriched transcripts. Results show that LLMs effectively predict depression severity in zero-shot settings (best MAE of 0.60), while dementia assessment benefits substantially from structured feature extraction (best MAE of 0.78), reducing errors by up to 35% over zero-shot baselines. Pause-enriched transcripts achieve competitive performance with human transcriptions, demonstrating the viability of fully automatic screening pipelines for differential neuropsychiatric assessment.

2602.15537 2026-06-17 cs.CL eess.AS 版本更新

ZeroSyl: Simple Zero-Resource Syllable Tokenization for Spoken Language Modeling

ZeroSyl: 用于口语语言建模的简单零资源音节分词

Nicol Visser, Simon Malan, Danel Slabbert, Herman Kamper

AI总结 提出ZeroSyl,一种无需训练的方法,直接从冻结的WavLM模型中提取音节边界和嵌入,实现竞争性的音节分割性能,并在词汇、句法和叙事基准上优于先前方法。

Comments Accepted to Interspeech 2026

详情
AI中文摘要

纯语音语言模型旨在直接从原始音频中学习语言,无需文本资源。一个关键挑战是来自自监督语音编码器的离散标记会导致过长的序列,这促使了最近关于音节类单元的研究。然而,像Sylber和SyllableLM这样的方法依赖于复杂的多阶段训练流程。我们提出了ZeroSyl,一种简单的无需训练的方法,直接从冻结的WavLM模型中提取音节边界和嵌入。通过使用WavLM中间层特征的L2范数,ZeroSyl实现了具有竞争力的音节分割性能。得到的片段进行均值池化,使用K-means离散化,并用于训练语言模型。ZeroSyl在词汇、句法和叙事基准上优于先前的音节分词器。扩展实验表明,虽然更细粒度的单元有利于词汇任务,但我们发现的音节单元在句法建模方面表现出更好的扩展行为。

英文摘要

Pure speech language models aim to learn language directly from raw audio without textual resources. A key challenge is that discrete tokens from self-supervised speech encoders result in excessively long sequences, motivating recent work on syllable-like units. However, methods like Sylber and SyllableLM rely on intricate multi-stage training pipelines. We propose ZeroSyl, a simple training-free method to extract syllable boundaries and embeddings directly from a frozen WavLM model. Using L2 norms of features in WavLM's intermediate layers, ZeroSyl achieves competitive syllable segmentation performance. The resulting segments are mean-pooled, discretized using K-means, and used to train a language model. ZeroSyl outperforms prior syllabic tokenizers across lexical, syntactic, and narrative benchmarks. Scaling experiments show that while finer-grained units are beneficial for lexical tasks, our discovered syllabic units exhibit better scaling behavior for syntactic modeling.

2603.26292 2026-06-17 cs.CL cs.AI 版本更新

findsylls: A Language-Agnostic Toolkit for Syllable-Level Speech Tokenization and Embedding

findsylls: 一种语言无关的音节级语音分词与嵌入工具包

Héctor Javier Vázquez Martínez

AI总结 提出语言无关的模块化工具包findsylls,统一经典音节检测器和端到端音节切分器,支持音节分割、嵌入提取和多粒度评估,在英语、西班牙语及低资源语言Kono上验证了跨语言可重复实验能力。

Comments 4 pages + 2 for references, disclosures & acknowledgements; to appear in Interspeech 2026; DOI to cite findsylls library: https://doi.org/10.5281/zenodo.20707804

详情
AI中文摘要

音节级单元为口语语言建模和无监督词汇发现提供了紧凑且具有语言意义的表示,但关于音节化的研究仍然分散在不同的实现、数据集和评估协议中。我们介绍了findsylls,一个模块化的、语言无关的工具包,它将经典的音节检测器和端到端音节切分器统一在一个通用接口下,用于音节分割、嵌入提取和多粒度评估。该工具包实现并标准化了广泛使用的方法(例如,Sylber、VG-HuBERT),并允许重新组合其组件,从而实现对表示、算法和令牌率的受控比较。我们在英语和西班牙语语料库以及来自Kono(一种未被充分记录的中部曼德语)的新手工标注数据上演示了findsylls,展示了单一框架如何支持在资源丰富和资源不足的环境中均可重复的音节级实验。

英文摘要

Syllable-level units offer compact and linguistically meaningful representations for spoken language modeling and unsupervised word discovery, but research on syllabification remains fragmented across disparate implementations, datasets, and evaluation protocols. We introduce findsylls, a modular, language-agnostic toolkit that unifies classical syllable detectors and end-to-end syllabifiers under a common interface for syllable segmentation, embedding extraction, and multi-granular evaluation. The toolkit implements and standardizes widely used methods (e.g., Sylber, VG-HuBERT) and allows their components to be recombined, enabling controlled comparisons of representations, algorithms, and token rates. We demonstrate findsylls on English and Spanish corpora and on new hand-annotated data from Kono, an underdocumented Central Mande language, illustrating how a single framework can support reproducible syllable-level experiments across both high-resource and under-resourced settings.

2606.11766 2026-06-17 eess.AS cs.AI cs.CL cs.SD 版本更新

Fast Speech Foundation Model Distillation Using Interleaved Stacking

快速语音基础模型蒸馏使用交错堆叠

Eungbeom Kim, Kyogu Lee

发表机构 * IPAI AIIS Dept. of Intelligence and Information(智能与信息系)

AI总结 提出交错堆叠方法加速语音基础模型蒸馏训练,通过保持层位置一致性解决性能下降问题,在SUPERB上验证有效性。

Comments Accepted by Interspeech 2026

详情
AI中文摘要

将大型语音基础模型(SFM)蒸馏为高效的学生模型已成功应用于低资源环境。尽管蒸馏减少了推理延迟,但它需要额外的学生模型训练。然而,SFM蒸馏的训练效率仍未得到充分探索。在这项工作中,我们探索了SFM蒸馏的训练加速以加快模型部署。我们研究了堆叠的潜力,其中模型深度通过训练逐步增加,直到达到目标模型深度。虽然现有的堆叠方法提高了训练速度,但它们遭受性能下降。为了解决这一限制,我们提出了交错堆叠,一种新颖的堆叠方法,在整个堆叠过程中始终保持层位置。这一特性在SFM中尤为关键,因为每一层编码了不同的层特定知识。我们在SUPERB上验证了所提方法的有效性。

英文摘要

Distilling a large speech foundation model (SFM) into an efficient student model has been successfully applied to low-resource environments. Although distillation reduces inference latency, it requires an additional student model training. However, the training efficiency of SFM distillation remains underexplored. In this work, we explore training acceleration of SFM distillation to speed up model deployment. We examine the potential of stacking, in which the model depth is progressively increased through training until the target model depth is reached. While existing stacking methods improve training speed, they suffer from performance degradation. To handle this limitation, we propose interleaved stacking, a novel stacking method that consistently preserves layer position throughout the stacking process. This property is particularly critical in SFMs, in which each layer encodes distinct layer-specific knowledge. We validate the effectiveness of the proposed method on SUPERB.

9. 评测、数据集与基准 24 篇

2606.17391 2026-06-17 cs.CL cs.AI cs.LG 新提交

NarrativeWorldBench: A Frontier-Saturated Benchmark and a Latent World Model for Long-Horizon Co-Creative Audio Drama

NarrativeWorldBench:面向长程共创音频剧的前沿饱和基准与潜在世界模型

Logan Mann, Abdur Rahman, Mohammad Saifullah, Taaha Kazi, Vasu Sharma

发表机构 * University of California, Santa Barbara(加州大学圣塔芭芭拉分校) Pocket FM

AI总结 提出NarrativeWorldBench基准,在九种叙事结构指标上评估21个模型,并引入N-VSSM变分状态空间模型,通过Mamba-2骨干和事件条件后验在200集以上维持结构化潜在状态,在长弧一致性和可控性上超越Claude Opus 4.5。

Comments 10 pages. Accepted to the ICML 2026 Workshops on High-dimensional Learning Dynamics (HiLD) and Culture x AI

详情
AI中文摘要

长篇连载音频剧,其剧情弧线跨越200至800集,是一种重要的创意媒介,也是前沿大语言模型(LLM)表现不佳的场景。我们在一组统一的叙事结构指标上,对21个模型进行了基准测试,涵盖经典、微调、开放前沿、封闭前沿和推理层级。所有封闭前沿系统在情节节拍F1上饱和于[0.78, 0.81]区间,并在视界h=200时下降约-0.20 F1。我们引入了NarrativeWorldBench,一个开放基准,包含九种叙事结构指标,在h∈{10, 20, 50, 100, 200}的视界上评估,并在四种印度语言(印地语、泰米尔语、泰卢固语、马拉地语)上进行跨语言评估。我们提出了N-VSSM,一种叙事变分状态空间模型,通过Mamba-2骨干网络和事件条件后验以及8B解码器,在超过200集的时间内维持一个结构化的256维潜在世界状态。N-VSSM在所有视界上保持情节节拍F1≥0.84,计算量仅为封闭前沿区间的1/4。学习到的文化迁移函数将跨语言忠实度提高了+0.20至+0.23 Likert分。在一项受试者内作家研究(n=12位专业作者,240次试验)中,N-VSSM在长弧一致性上以71%的偏好率优于Claude Opus 4.5,在可控性上评分高出+1.3 Likert分。

英文摘要

Long-form serialized audio drama, with arcs that run for 200 to 800 episodes, is a major creative medium and a setting where frontier large language models (LLMs) fail. We benchmark 21 models, spanning classical, fine-tuned, open-frontier, closed-frontier, and reasoning tiers, on a uniform set of structural narrative metrics. All closed-frontier systems saturate at a plot-beat F1 in the band [0.78, 0.81] and collapse by about -0.20 F1 at horizon h=200. We introduce NarrativeWorldBench, an open benchmark of nine narrative-structure metrics evaluated across horizons h in {10, 20, 50, 100, 200}, with cross-lingual evaluation across four Indic languages (Hindi, Tamil, Telugu, Marathi). We introduce N-VSSM, a Narrative Variational State-Space Model that maintains a structured 256-dimensional latent world state over more than 200 episodes via a Mamba-2 backbone with an event-conditioned posterior and an 8B decoder. N-VSSM holds plot-beat F1 >= 0.84 across all horizons at 4x lower compute than the closed-frontier band. A learned Cultural Transfer Function lifts cross-language fidelity by +0.20 to +0.23 Likert points. In a within-subjects writer study (n = 12 professional authors, 240 trials), N-VSSM is preferred over Claude Opus 4.5 on long-arc consistency 71% of the time and rated +1.3 Likert points higher on controllability.

2606.17474 2026-06-17 cs.CL cs.AI 新提交

AIPatient Arena: EHR-grounded evaluation of large language models in end-to-end clinical consultation workflows

AIPatient Arena:基于电子健康记录的大语言模型在端到端临床咨询工作流中的评估

Jiahui Niu, Huizi Yu, Wenkong Wang, Guangxin Dai, Jingxian He, Xiang Li, Zhiying Liang, Xinxin Lin, Kent CY So, Bryan YP Yan, Yun Kwok Wing, Yanqiu Xing, Xin Ma, Lizhou Fan

发表机构 * School of Control Science and Engineering, Shandong University(控制科学与工程学院,山东大学) Key Laboratory of Machine Intelligence and System Control, Shandong University(机器智能与系统控制重点实验室,山东大学) Department of Medicine and Therapeutics, The Chinese University of Hong Kong(医学与治疗学系,香港中文大学) Department of Geriatric Medicine, Qilu Hospital of Shandong University(老年医学科,山东大学齐鲁医院) Department of Psychiatry, The Chinese University of Hong Kong(精神病学系,香港中文大学) Li Chiu Kong Family Sleep Assessment Unit, Department of Psychiatry, Faculty of Medicine, The Chinese University of Hong Kong(李秋虹家庭睡眠评估单元,精神病学系,医学院,香港中文大学) Li Ka Shing Institute of Health Sciences, Faculty of Medicine, The Chinese University of Hong Kong(李嘉诚健康科学研究院,医学院,香港中文大学) Gerald Choa Neuroscience Institute, Department of Medicine and Therapeutics, The Chinese University of Hong Kong(Gerald Choa 神经科学研究所,医学与治疗学系,香港中文大学)

AI总结 提出AIPatient Arena框架,通过电子健康记录构建患者知识图谱,在多轮医患交互中评估大语言模型的八项临床能力,发现模型在信息覆盖、诊断推理等方面存在不足,强调过程评估的重要性。

Comments 49 pages, 12 figues, 11 tables

详情
AI中文摘要

大语言模型(LLMs)越来越多地被考虑用于临床咨询任务,然而大多数医学评估仍然是静态的、单轮的或狭义的结果导向,限制了它们反映真实医疗护理的序列性、不确定性和交互性的能力。在此,我们提出AIPatient Arena,一个基于电子健康记录(EHRs)的评估框架,用于评估LLMs在八个临床能力维度上的临床实用性。该框架将EHR数据整合到患者特定的知识图谱中,实现多轮医患交互。我们将AIPatient Arena应用于一个由437名患者组成的主要队列以及两个分布外验证队列(分别为119名和67名患者)。我们观察到,LLMs在医学访谈提问技能(QS;平均得分4.43-4.99/5)、伦理与职业行为(ET;4.38-4.93/5)以及临床解释的清晰度和透明度(EX;3.80-4.72/5)方面表现良好。在信息整合(II;3.19-4.21/5)和用药安全与合理性(MS;3.13-3.78/5)方面表现中等,但在处理模糊患者回应(HR;2.57-3.32/5)、信息覆盖(IC;2.08-3.02/5)以及诊断准确性与推理(Dx;2.63-3.55/5)方面观察到持续的弱点。基于过程的评估揭示了反复出现的交互失败,包括重复提问、遗漏既往病史以及对不确定性处理不当。更丰富的对话上下文改善了诊断推理,但在治疗计划方面收益有限。这些发现表明,仅凭最终答案的准确性不足以评估临床就绪性,并强调了评估模型在整个咨询过程中如何收集、解释和传递信息的重要性。AIPatient Arena为医学LLMs的面向工作流的部署前评估提供了一个基于EHR的框架。

英文摘要

Large language models (LLMs) are increasingly considered for use in clinical consultation tasks, yet most medical evaluations remain static, single-turn, or narrowly outcome-based, limiting their ability to reflect the sequential, uncertain, and interactive nature of real-world care. Here, we propose AIPatient Arena, an EHRs-grounded evaluation framework for assessing the clinical utility of LLMs across eight dimensions of clinical competence. The framework integrates EHR data into patient-specific knowledge graphs, enabling multi-turn physician-patient interactions. We applied AIPatient Arena on a primary cohort of 437 patients and two out-of-distribution validation cohorts of 119 and 67 patients. We observe that LLMs performed well in medical interview questioning skills (QS; mean scores, 4.43-4.99/5), ethical and professional conduct (ET; 4.38-4.93/5), and clarity and transparency of clinical explanations (EX; 3.80-4.72/5). Performance was moderate in information integration (II; 3.19-4.21/5) and medication safety and justification (MS; 3.13-3.78/5), but persistent weaknesses were observed in handling of ambiguous patient responses (HR; 2.57-3.32/5), information coverage (IC; 2.08-3.02/5), and diagnostic accuracy and reasoning (Dx; 2.63-3.55/5). Process-based evaluation revealed recurrent interaction failures, including repetitive questioning, omission of past medical history, and inadequate handling of uncertainty. Richer conversational context improved diagnostic reasoning but yielded limited gains in treatment planning. These findings indicate that final-answer accuracy alone is insufficient for evaluating clinical readiness and highlight the importance of assessing how models gather, interpret, and communicate information throughout a consultation. AIPatient Arena provides an EHR-grounded framework for workflow-oriented pre-deployment evaluation of medical LLMs.

2606.17609 2026-06-17 cs.CL 新提交

The Benchmark Illusion: Pruned LLMs Can Pass Multiple Choice but Fail to Answer

基准测试幻觉:剪枝后的大语言模型能通过多项选择但无法回答问题

Rui Wen, Lu Sun, Jiayang Liu, Zesheng Xu, Tianshuo Cong, Zheng Li

发表机构 * Institute of Science Tokyo(东京科学大学) Tohoku University(东北大学) Nanyang Technological University(南洋理工大学) KTH Royal Institute of Technology(瑞典皇家理工学院) Shandong University(山东大学)

AI总结 研究发现,高稀疏度剪枝后的大语言模型在多项选择评估中表现良好,但在开放生成中无法正确回答相同问题,揭示了基准测试的盲点。

详情
AI中文摘要

压缩大型语言模型可以减少内存使用和推理成本,但也可能导致标准基准测试未能捕捉到的失败。一个剪枝后的模型可能在多项选择评估中仍然表现良好,但在开放生成中却无法回答相同的问题。我们探究剪枝改变了什么:是擦除了正确答案,还是使答案更难作为最高输出产生?我们通过多语言问答来研究这个问题,追踪剪枝前后相同的问题。我们发现了一种基准测试幻觉。在高稀疏度剪枝(尤其是Wanda)下,模型在贪婪开放生成中经常失败,而在多项选择评分下仍然能选择正确答案。在这些仅识别错误中,答案通常并未消失,而是被降级:它经常通过束搜索、采样或一个上下文示例重新出现。总体而言,多项选择基准测试可能夸大压缩LLM的可用性,造成评估盲点。压缩模型应该测试它们能产生什么,而不仅仅是能识别什么。

英文摘要

Compressing large language models reduces memory use and inference cost, but it can also create failures that standard benchmarks miss. A pruned model may still perform well on multiple-choice evaluations, yet fail to answer the same question in open generation. We ask what pruning changes: does it erase the correct answer, or does it make the answer harder to produce as the top output? We study this question with multilingual question answering, tracking the same questions before and after pruning. We find a benchmark illusion. Under high-sparsity pruning, especially Wanda, models often fail in greedy open generation while still selecting the correct answer under multiple-choice scoring. In these recognition-only errors, the answer is usually not gone, but demoted: it often reappears with beam search, sampling, or one in-context example. Overall, multiple-choice benchmarks can overstate the usability of compressed LLMs, creating an evaluation blind spot. Compressed models should be tested on what they can produce, not only on what they can recognize.

2606.17634 2026-06-17 cs.CL 新提交

Prompt Perturbation for Reliable LLM Evaluation over Comparison Graphs

用于可靠LLM评估的提示扰动方法:基于比较图

Dong Huang, Jianbo Sun, Pengkun Yang

发表机构 * Department of Statistics and Data Science, Tsinghua University(清华大学统计与数据科学系)

AI总结 针对LLM成对评估中的传递性矛盾问题,提出提示扰动框架,通过生成扰动变体并过滤结构不一致的比较模式,提高排名一致性。

Comments 42 pages, 8 figures

详情
AI中文摘要

评估大型语言模型(LLM)对于理解其能力、比较竞争系统以及支持在实践中部署可靠模型至关重要。对于开放任务,成对评估已成为一种流行范式,其中比较同一提示的两个响应,并将产生的判断聚合为整体排名。该范式的核心挑战是传递性:诱导的比较结果可能无法支持任何连贯的全局排名。例如,可能会观察到循环偏好,如$A \succ B \succ C \succ A$,或涉及平局的不一致性,如$A \equiv B\equiv C\neq A$。此类矛盾使得最终排行榜不稳定且难以解释。在本文中,我们提出了一种提示扰动框架,用于提高成对LLM评估的一致性。我们的方法生成每个提示的扰动变体,利用生成的比较图识别并过滤掉结构不一致的比较模式,然后将标准排名方法应用于过滤后的比较。该框架的一个关键特征是,在排名聚合之前,将图级结构一致性显式纳入评估流程。这提供了一种简单且原则性的方法来减少循环不一致性并提高LLM排名的可靠性。

英文摘要

Evaluating large language models (LLMs) is important for understanding their capabilities, comparing competing systems, and supporting the deployment of reliable models in practice. For open-ended tasks, pairwise evaluation has become a popular paradigm, in which two responses to the same prompt are compared and the resulting judgments are aggregated into an overall ranking. A central challenge of this paradigm is intransitivity: the induced comparison outcomes may fail to support any coherent global ranking. For example, one may observe cyclic preferences such as $A \succ B \succ C \succ A$, or inconsistencies involving ties such as $A \equiv B\equiv C\neq A$. Such contradictions make the resulting leaderboard unstable and challenging to interpret. In this paper, we propose a prompt perturbation framework for improving the consistency of pairwise LLM evaluation. Our approach generates perturbed variants of each prompt, uses the resulting comparison graphs to identify and filter out structurally inconsistent comparison patterns, and then applies standard ranking methods to the filtered comparisons. A key feature of the proposed framework is that graph-level structural consistency is incorporated explicitly into the evaluation pipeline before ranking aggregation. This provides a simple and principled way to reduce cyclic inconsistencies and improve the reliability of LLM rankings.

2606.17861 2026-06-17 cs.CL 新提交

GameCraft-Bench: Can Agents Build Playable Games End-to-End in a Real Game Engine?

GameCraft-Bench:智能体能否在真实游戏引擎中端到端构建可玩游戏?

Tongxu Luo, Rongsheng Wang, Jiaxi Bi, Chenming Xu, Zhengyang Tang, Jianlong Chen, Juhao Liang, Ke Ji, Shuqi Guo, Yuhao Du, Fan Bu, Wenyu Du, Xiaotong Zhang, Kyle Li, Shaobo Wang, Linfeng Zhang, Yuxuan Liu, Xin Lai, Chenxin Li, Yiduo Guo, Zhexin Zhang, Xinyuan Wang, Tianyi Bai, Ziniu Li, Benyou Wang

发表机构 * The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳)) Shenzhen Loop Area Institute(深圳环区研究院) Hunyuan Team, Tencent(腾讯混元团队) USTB(北京科技大学) DualverseAI SJTU(上海交通大学) NUS(新加坡国立大学)

AI总结 提出GameCraft-Bench基准,评估编码智能体在Godot引擎中端到端生成可玩游戏的能力,最强智能体仅达41.46%成功率。

详情
AI中文摘要

游戏生成是编码智能体的新兴应用,要求模型将自然语言规范转化为可玩的交互系统。与传统编码任务不同,游戏生成发生在游戏引擎内,脚本、场景、资源、渲染和运行时交互必须共同产生连贯的游戏体验。我们将端到端游戏生成形式化为产生完整游戏制品的问题,该制品通过目标环境中可观察的玩家-游戏交互实现规范。我们认为评估这一设置需要三个必要条件:引擎接地、制品完整性和交互验证。我们提出一个交互接地评估框架,通过重放演示和基于规则的多模态评判来评估可执行游戏玩法。我们将该框架实例化为GameCraft-Bench,一个包含15个游戏家族共140个Godot任务的基准。对前沿编码智能体的评估表明,端到端游戏生成仍然极具挑战性:最强智能体仅达到41.46%,大多数智能体得分低于40%。进一步分析显示,虽然智能体经常实现可识别的机制,但它们在提供具有足够内容、功能性视觉反馈和连贯呈现的完整游戏方面存在困难。演示、代码和数据见此https URL。

英文摘要

Game generation is an emerging application of coding agents, requiring models to transform natural-language specifications into playable interactive systems. Unlike traditional coding tasks, game generation takes place within a game engine, where scripts, scenes, assets, rendering, and runtime interactions must jointly produce coherent gameplay. We formalize end-to-end game generation as the problem of producing a complete game artifact that realizes a specification through observable player-game interaction in a target environment. We argue that evaluating this setting requires three desiderata: Engine Grounding, Artifact Completeness, and Interactive Verification. We propose an interaction-grounded evaluation framework that assesses executable gameplay through replayed demonstrations and rubric-guided multimodal judging. We instantiate this framework as GameCraft-Bench, a benchmark comprising 140 Godot tasks across 15 game families. Evaluations of frontier coding agents show that end-to-end game generation remains highly challenging: the strongest agent achieves only 41.46%, and most agents score below 40%. Further analysis reveals that while agents often implement recognizable mechanics, they struggle to deliver complete games with sufficient content, functional visual feedback, and coherent presentation. See https://tongxuluo.github.io/gamecraft-bench-website for demos, code, and data.

2606.17905 2026-06-17 cs.CL 新提交

ChLogic: Evaluating Robustness of Logical Reasoning in Chinese Expressions

ChLogic: 评估中文表达中逻辑推理的鲁棒性

Peixian Zhou, Yuxu Chen, Chaorui Zhang, Wei Han, Bo Bai, Xueyan Niu

发表机构 * College of Mathematics, Sichuan University(四川大学数学学院) Theory Lab, 2012 Labs, Huawei Technologies Co., Ltd(华为技术有限公司2012实验室理论实验室)

AI总结 提出英中对照基准ChLogic,通过形式逻辑模板构建数据集,测试模型在英文和多种中文表达下逻辑推理的鲁棒性,发现英中性能差距,回译效果因数据集和模型而异。

详情
AI中文摘要

大型语言模型在标准化逻辑推理基准上表现越来越好,但这种能力在英语之外是否保持鲁棒尚不清楚。我们提出ChLogic,一个英中对照基准,测试当相同的潜在逻辑结构用英语和多种中文表层实现表达时,模型是否保持逻辑推理性能。该基准基于形式逻辑模板构建,包含三个数据集:(i) 通用对照集,源自9个模板家族的60个通用命题;(ii) 困难对照集,源自40个困难问题;(iii) 仅中文集,涵盖15种语言特有现象类型。每个对照项将一个英文参考表达式与五个中文实现配对。在Qwen3、Ministral和GLM模型上的实验揭示了持续的英中性能差距。从标准中文回译成英语通常能提升通用对照集上的性能,但对困难对照集产生混合效果,Qwen3-32B和GLM-5.1在翻译后表现更差。这些结果表明,中文表层实现、翻译伪影和模型特定行为共同影响多语言逻辑推理。总体而言,ChLogic为多语言推理的鲁棒性提供了有用的压力测试。

英文摘要

Large language models perform increasingly well on standardized logical reasoning benchmarks, but whether this ability remains robust beyond English is unclear. We introduce ChLogic, an English--Chinese aligned benchmark that tests whether models preserve logical reasoning performance when the same latent logical structure is expressed in English and diverse Chinese surface realizations. Built from formal logical templates, the benchmark contains three data sets: (i) the General aligned set, derived from 60 General Propositions across nine template families; (ii) the Difficult aligned set, derived from 40 Difficult Problems; and (iii) the Chinese-only set, covering 15 language-specific phenomenon types. Each aligned item pairs one English reference expression with five Chinese realizations. Experiments on Qwen3, Ministral, and GLM models reveal a persistent English--Chinese performance gap. Back-translation from standard Chinese into English often improves performance on the General aligned set, but produces mixed effects on the Difficult aligned set, where Qwen3-32B and GLM-5.1 perform worse after translation. These results indicate that Chinese surface realization, translation artifacts, and model-specific behavior jointly affect multilingual logical reasoning. Overall, ChLogic provides a useful stress test for the robustness of multilingual reasoning.

2606.18203 2026-06-17 cs.CL cs.AI 新提交

RubricsTree: Scalable and Evolving Open-Ended Evaluation of Personal Health Agents across Health Memory and Medical Skills

RubricsTree: 面向个人健康代理在健康记忆与医疗技能上的可扩展且不断演进的开放式评估

Weizhi Zhang, Zechen Li, Hamid Palangi, Ben Graef, A. Ali Heydari, Simon A. Lee, Salman Rahman, Ray Luo, Zeinab Esmaeilpour, Erik Schenck, Chloe Zhang, Yamin Li, Menglian Zhou, Philip S. Yu, Daniel McDuff, Lindsey Sunden, Mark Malhotra, Shwetak Patel, Ahmed A. Metwally

发表机构 * Google Research(谷歌研究院) University of Illinois Chicago(伊利诺伊大学芝加哥分校)

AI总结 提出RubricsTree框架,通过专家对齐的层次化分类法(含100多个原子布尔规则)和上下文自适应路由,实现可扩展、可审计且不断演进的开放式评估,在HealthBench上使模型性能提升高达约66%。

详情
AI中文摘要

基于LLM的个人健康代理利用用户健康(传感器)指标,为缓解全球医疗资源获取不均提供了有希望的途径。然而,大规模临床部署仍受限于开放式评估瓶颈:医生标注可靠但成本高且不可扩展,而LLM作为评判者的评估虽可扩展但主观、不一致,且有时临床对齐不佳。我们引入了RubricsTree,一个可扩展的评估框架,具有专家对齐的层次化分类法,包含超过100个原子级、临床可验证的布尔规则,这些规则通过迭代的人机协同策展协议(由经验丰富的医生领导的专家小组)从4000个真实用户查询的洞察中演化而来。一个上下文感知的自适应路由器每查询仅激活相关的自动加权规则子集,提供可扩展评估所需的吞吐量,同时保持专家对齐的质量。通过系统的元评估,我们展示了RubricsTree:(i) 在具有挑战性的开放式查询上,专家对齐程度显著超过强大的大规模评估基线;(ii) 可靠地惩罚上下文退化的响应;(iii) 当用作结构化指令、文本反馈或性能优化的训练奖励时,在HealthBench上为Gemini、GPT和Qwen模型系列带来高达约66%的相对提升。因此,RubricsTree为产品级个人健康AI的持续优化提供了可扩展、可审计且不断演进的评估基础设施。

英文摘要

The LLM-empowered personal health agents with user health (sensor) metrics have offered a promising pathway to alleviate global disparities in healthcare access. However, large-scale clinical deployment remains constrained by an open-ended evaluation bottleneck: physician annotation is reliable but costly and unscalable, while LLM-as-a-judge evaluators are scalable but subjective, inconsistent, and sometimes clinically misaligned. We introduce RubricsTree, a scalable evaluation framework with an expert-aligned hierarchical taxonomy of over 100 atomic, clinically-verifiable Boolean rubrics, evolving from the insights of 4,000 real user queries through an iterative human-in-the-loop curation protocol with an expertise panel led by an experienced physician. A context-aware adaptive router activates only the relevant auto-weighted rubric subset per query, providing the throughput needed for scalable evaluation with expert-aligned quality. Through a systematic meta-evaluation, we show that RubricsTree (i) substantially exceeds a strong large-scale evaluation baseline in expert alignment on challenging open-ended queries; (ii) reliably penalizes contextually degraded responses; and (iii) when used as structured instructions, text feedback, or training rewards for performance optimization, yields up to ~66% relative gains on HealthBench for Gemini, GPT, and Qwen model families. RubricsTree thus provides a scalable, auditable, and evolving evaluation infrastructure required for the continuous optimization of product-level personal healthcare AI.

2606.18222 2026-06-17 cs.CL cs.DL 新提交

Darshana Graph: A Parallel Commentary Corpus for Comparative Indian Philosophy, with Stylometric and Exploratory Graph Analyses

Darshana Graph:用于比较印度哲学的平行注释语料库,附文体计量与探索性图分析

Joy Bose

发表机构 * Independent Researcher(独立研究者) Bangalore, India(印度班加罗尔)

AI总结 构建包含超12.5万条记录的印度哲学平行注释语料库,其中约8500条记录实现跨18位注释者的根颂对齐,通过文体计量和约束大语言模型管道分析论证风格与概念关系,揭示学派间分歧模式。

Comments 12 pages, 1 figure. Open Source Code available at https://github.com/joyboseroy/darshana-graph and dataset at https://huggingface.co/datasets/joyboseroy/darshana-graph

详情
AI中文摘要

我们介绍了Darshana Graph,一个包含超过12.5万条文本记录的语料库,涵盖古典印度教、佛教和耆那教哲学传统,这些记录来自公共领域和开放许可的翻译,包括《薄伽梵歌》、《梵经》、主要《奥义书》、《巴利经典》和核心耆那教经典。其独特贡献在于一个结构独特的子集,包含约8500条印度教和耆那教记录,其中相同的根本颂或经句与代表吠檀多五个学派及其他见(darshanas)的十八位历史注释者对齐,从而能够直接比较独立解释传统如何解读相同的源材料。据我们所知,没有其他公开资源提供如此规模的跨注释者对齐。我们展示了基于该语料库的两项分析。首先,一种无需机器学习的透明文体计量比较,通过经典引用密度、明确反驳率和句子复杂度来衡量论证风格。它发现引用密度与反驳率之间存在中等负相关,在相关教义谱系的三位注释者中反驳率显著增加,并且在巴利经典内部存在可测量的体裁层面差异。其次,我们描述了一个受约束的大语言模型管道,该管道使用预定义的关系词汇和确定性事后验证来提取概念之间的类型化哲学关系。生成的图揭示了跨学派分歧模式,同时也揭示了重要的提取局限性,包括独立基于嵌入的分析与图派生结果不一致的情况。我们发布了完整的语料库、提取的关系图以及所有源代码。

英文摘要

We introduce Darshana Graph, a corpus of over 125,000 text records spanning classical Hindu, Buddhist, and Jain philosophical traditions, drawn from public-domain and openly licensed translations of sources including the Bhagavad Gita, Brahma Sutras, principal Upanishads, the Pali Canon, and core Jain texts. Its distinctive contribution lies in a structurally unique subset of roughly 8,500 Hindu and Jain records in which the same root verse or sutra is aligned across eighteen historical commentators representing five schools of Vedanta and other darshanas, enabling direct comparison of how independent interpretive traditions read identical source material. To our knowledge, no publicly available resource provides comparable cross-commentator alignment at this scale. We present two analyses built on this corpus. First, a transparent stylometric comparison requiring no machine learning measures argumentative style through scriptural citation density, explicit refutation rate, and sentence complexity. It finds a moderate negative correlation between citation density and refutation rate, a marked increase in refutation rate across three commentators in a related doctrinal lineage, and measurable genre-level differences within the Pali Canon itself. Second, we describe a constrained large language model pipeline that extracts typed philosophical relationships between concepts using a predefined relation vocabulary and deterministic post-hoc validation. The resulting graph surfaces cross-school disagreement patterns while also revealing important extraction limitations, including cases where an independent embedding-based analysis disagrees with the graph-derived findings. We release the full corpus, extracted relationship graph, and all source code.

2606.18237 2026-06-17 cs.CL cs.AI cs.LG 新提交

ReproRepo: Scaling Reproducibility Audits with GitHub Repository Issues

ReproRepo: 利用 GitHub 仓库问题扩展可重复性审计

Shanda Li, Qiuhong Anna Wei, Jingwu Tang, Valerie Chen, Nihar B Shah, Tim Dettmers, Yiming Yang, Ameet Talwalkar

发表机构 * School of Computer Science, Carnegie Mellon University(卡内基梅隆大学计算机科学学院) Datadog

AI总结 提出 ReproRepo 框架,利用 GitHub issues 作为监督信号,对 1149 篇论文进行可重复性评估,发现 Codex with GPT-5.5 能识别约 90% 论文的语义相关复现问题。

详情
AI中文摘要

从论文和已发布代码中复现研究结果对科学进步至关重要。现有工作引入了基准测试来评估 LLM 代理是否能协助可重复性,但由于数据整理和评估需要大量人工努力,这些基准难以扩展。我们提出了 ReproRepo,一个可扩展的可重复性评估框架,利用人类提出的 GitHub issues 作为真实复现障碍的自然监督信号。我们在来自主要会议的 1149 篇近期机器学习论文上实例化 ReproRepo,并评估了四种前沿模型代理配置。我们的结果表明,即使不执行代码,LLM 代理也能从论文-仓库对中识别出许多现实世界的可重复性问题:我们研究中的最佳代理,即带有 GPT-5.5 的 Codex,为研究中约 90% 的论文揭示了至少一个语义相关的人类报告的障碍。进一步分析表明,代理在揭示可见故障和识别正确语义区域方面特别有效,但在精确定位方面可能仍不足。ReproRepo 可作为未来在真实世界可重复性审计中评估 LLM 代理的可重用、可扩展框架。我们的代码发布在 https://this URL。

英文摘要

Reproducing research results from papers and released code is central to scientific progress. Existing works have introduced benchmarks to evaluate whether LLM agents can assist with reproducibility, but they are difficult to scale due to their reliance on substantial manual effort for data curation and evaluation. We introduce ReproRepo, a scalable framework for reproducibility evaluation that leverages human-raised GitHub issues as naturally occurring supervision on realistic reproduction blockers. We instantiate ReproRepo on 1,149 recent machine learning papers from major conferences and evaluate four frontier model-agent configurations. Our results show that LLM agents, even without executing code, can identify many real-world reproducibility problems from paper-repository pairs: the best agent in our study, namely Codex with GPT-5.5, surfaces at least one semantically related human-reported blocker for ~90% of papers in the study. Further analysis shows that agents are particularly effective for surfacing visible failures and identifying the right semantic region, but may still be insufficient in exact localization. ReproRepo can serve as a reusable, scalable framework for future evaluations of LLM agents on real-world reproducibility auditing. Our code is released at https://github.com/LithiumDA/ReproRepo.

2606.17339 2026-06-17 cs.AI cs.CL cs.SD 交叉投稿

SpeechDx: A Multi-Task Benchmark for Clinical Speech AI

SpeechDx: 面向临床语音AI的多任务基准

Sejal Bhalla, Larry Kieu, Aina Merchant, Eyal de Lara, Alex Mariakakis

发表机构 * University of Toronto(多伦多大学)

AI总结 提出SpeechDx基准,涵盖12个数据集和27个任务,通过语音产生阶段(概念化、公式化、发音)组织任务,评估12种音频编码器,发现大规模语音模型表现最佳,但尚无表示能可靠泛化。

详情
AI中文摘要

语音通过同时涉及神经、运动、呼吸和发声系统,为健康提供了一个独特的窗口。当前的临床语音AI方法主要通过孤立的特定疾病研究取得进展,导致结果难以比较,泛化能力难以评估。我们引入了SpeechDx,这是一个大规模的临床语音AI基准,涵盖12个数据集和27个任务,涉及多种健康状况。为了能够基于共享的临床机制进行评估,SpeechDx根据任务所破坏的语音产生阶段(概念化、公式化和发音)来组织任务。该基准通过包含有限标注数据的任务以及跨多个数据集评估同一健康状况来测试泛化能力,从而区分有临床意义的模式与数据集伪影。我们系统评估了12个最先进的音频编码器在所有任务以及零样本跨条件迁移下的表现。结果表明,大规模语音模型代表了最强的整体基线,领域特定模型仅在紧密匹配的任务上提升性能,而当前没有任何表示能在临床语音领域可靠泛化。SpeechDx建立了一个共享评估框架,用于追踪通用临床语音表示的进展。

英文摘要

Speech offers a uniquely informative window into health by simultaneously engaging neurological, motor, respiratory, and vocal systems. Current clinical speech AI methods have largely progressed through isolated condition-specific studies, making results difficult to compare and generalization difficult to assess. We introduce SpeechDx, a large-scale benchmark for clinical speech AI spanning 12 datasets and 27 tasks across diverse health conditions. To enable evaluation across shared clinical mechanisms, SpeechDx structures tasks by the stage of speech production they disrupt: conceptualization, formulation, and articulation. The benchmark tests generalization by including tasks with limited labeled data and evaluating the same health condition across multiple datasets, distinguishing clinically meaningful patterns from dataset artefacts. We systematically evaluate 12 state-of-the-art audio encoders across all tasks and under zero-shot cross-condition transfer. Results show that large-scale speech models represent the strongest overall baselines, domain-specific models improve performance only on closely matched tasks, and no current representation generalizes reliably across the clinical speech landscape. SpeechDx establishes a shared evaluation framework for tracking progress toward general-purpose clinical speech representations

2606.17698 2026-06-17 cs.AI cs.CL 交叉投稿

EComAgentBench: Benchmarking Shopping Agents on Long-Horizon Tasks with Distributed Hidden Intent

EComAgentBench:在分布式隐藏意图的长时任务上基准测试购物代理

Zeyao Du, Tong Li, Haibo Zhang

发表机构 * Shopee

AI总结 提出EComAgentBench基准,包含662个基于真实亚马逊产品的任务,要求代理在100次工具调用内从可见查询、工具门控配置文件和脚本化澄清中挖掘隐藏意图,验证候选产品并提交最终选择,通过类型化源标签评分归因失败。

详情
AI中文摘要

随着基于LLM的购物代理进入生产环境,现有基准未能捕捉购物者需求的出现方式:隐含在查询中、记录在配置文件中,或仅在提出正确问题时才揭示。提前暴露全部意图并仅对最终选择评分的基准既无法提出这种长时挑战,也无法解释代理遗漏了哪个需求。为填补这一空白,我们引入了EComAgentBench,一个基于真实亚马逊产品和评论的662个任务的基准。每个任务将这些需求分散在可见查询、工具门控配置文件和脚本化澄清中;代理必须揭示隐藏意图,根据属性和评论证据验证候选产品,并在100次工具调用内提交单个产品。此外,类型化、源标记的评分规则对每个任务进行评分,将每个失败归因于一个需求及其来源。构建过程自动化且可靠,每个答案在生成任何文本之前已在代码中固定,每个样本都经过验证。我们对七个模型的评估显示,即使最强的模型也仅达到57.1%的整体准确率,并且评分规则的满足度从可见源到隐藏源逐渐下降。总体而言,我们相信EComAgentBench将作为一个可复现的基础,推动购物代理从单查询搜索向长时可靠辅助发展。

英文摘要

As LLM-based shopping agents enter production, existing benchmarks fail to capture how a shopper's requirements arrive: stated implicitly in the query, recorded in a profile, or revealed only when the right question is asked. Benchmarks that expose full intent upfront and grade only the final choice can neither pose this long-horizon challenge nor explain which requirement an agent missed. To address this gap, we introduce EComAgentBench, a benchmark of 662 tasks grounded in real Amazon products and reviews. Each task scatters these requirements across a visible query, a tool-gated profile, and scripted clarification; an agent must uncover hidden intent, verify candidates against attributes and review evidence, and commit to a single product within 100 tool calls. Moreover, typed, source-tagged rubrics grade every task, attributing each failure to a requirement and its source. Construction is automated yet reliable, with every answer fixed in code before any text is generated and every sample validated. Our evaluation of seven models reveals that even the strongest attains only 57.1% overall accuracy, and rubric satisfaction degrades from visible to hidden sources. Overall, we believe EComAgentBench will serve as a reproducible foundation for moving shopping agents from single-query search toward dependable assistance over long horizons.

2606.17799 2026-06-17 cs.SE cs.AI cs.CL 交叉投稿

Position: Coding Benchmarks Are Misaligned with Agentic Software Engineering

立场:编程基准与智能体软件工程不一致

Maria I. Gorinova, Macey Baker, Amy Heineike, Maksim Shaposhnikov, Rob Willoughby, Dru Knox

发表机构 * Tessl

AI总结 本文指出当前编程基准在智能体时代存在三大问题:混淆模型与系统框架、单一参考答案惩罚有效替代方案、缺乏组件级信号导致迭代困难,并提出应重新设计基准以对齐智能体软件工程。

详情
AI中文摘要

编程智能体已成为软件工程的主要模式,但我们用于比较它们的基准是在智能体时代之前设计的:它们将模型、框架和环境合并为一个单一的端到端分数,通常针对一个参考答案进行计算,没有提供用于迭代的组件级信号。我们认为当前的编程基准与智能体软件工程不一致。在实践中,编程智能体不是一个模型:它是一个系统框架——由模型、框架、上下文、环境和反馈信号组成的复合体,其中任何一个都可能使基准分数移动与相邻模型代际之间相当的幅度。我们讨论了三个症状:(i) 基准分数混淆了模型与框架的其余部分;(ii) 针对单一参考答案评分惩罚了同样有效的替代方案;(iii) 缺乏单个框架组件级别的信号使得端到端系统分数难以迭代。

英文摘要

Coding agents have become a major mode of software engineering, but the benchmarks we use to compare them were designed in a pre-agent era: they collapse model, harness, and environment into a single end-to-end score, typically computed against one reference solution, with no component-level signal for iteration. We argue that current coding benchmarks are misaligned with agentic software engineering. A coding agent in practice is not a model: it is a system harness -- a composite of models, harnesses, contexts, environments, and feedback signals, any one of which can move the benchmark score by margins comparable to those between adjacent model generations. We discuss three symptoms: (i) benchmark scores conflate the model with the rest of the harness; (ii) grading against a single reference solution penalises equally valid alternatives; and (iii) the absence of signal at the level of individual harness components makes the end-to-end system score difficult to iterate on.

2606.17819 2026-06-17 cs.SE cs.AI cs.CL 交叉投稿

A Framework for Evaluating Agentic Skills at Scale

大规模评估智能体技能的框架

Maksim Shaposhnikov, Nicolas Fortuin, Simon Stipcich, Maria I. Gorinova, Amy Heineike, Rob Willoughby

发表机构 * Tessl London United Kingdom(伦敦英国Tessl)

AI总结 提出一个评估框架,通过构建真实任务和评分标准,大规模评估500个真实技能在19种智能体模型上的表现,发现模型对技能指令的遵循程度差异显著,且技能显著改变模型行为。

详情
AI中文摘要

智能体技能——结构化、可重用的知识工件,增强LLM智能体能力——已在工业界迅速采用,但其跨领域影响以及在商业和开源模型中的使用仍未得到充分研究,并且缺乏可复用的方法来评估单个技能。在这项工作中,我们提出了一个评估框架,允许技能作者构建真实任务,以严格评估技能中对他们最重要的方面,并通过解决这些任务来估计技能效用。此外,我们将评估方法大规模应用于500个真实技能,生成了1000个源自技能内容的任务,以及指令遵循和目标完成评分标准。使用这些指标,我们评估了19种智能体模型配置(包括专有和开源模型)在任务上的表现。我们的结果表明,模型在遵循技能中编码的指令方面差异很大,导致其性能提升存在显著差异。此外,我们表明,与无技能设置相比,访问技能显著改变了模型行为,为将主观工作流编码到LLM智能体中提供了一种重要机制。我们发布了评估数据集,以支持未来关于智能体技能的工作。

英文摘要

Agent skills -- structured, reusable knowledge artifacts that augment LLM agent capabilities -- have been rapidly adopted in industry, yet their cross-domain impact and use across commercial and open-source models remain under-studied, and no reusable methodology exists for evaluating an individual skill. In this work, we present an evaluation framework that lets a skill author construct realistic tasks to rigorously assess the aspects of a skill that matter most to them, and that estimates skill utility by solving those tasks. Further, we apply our evaluation approach at scale to 500 real-world skills, generating 1,000 tasks derived from the skills' content, along with instruction-following and goal-completion scoring rubrics. Using these metrics, we evaluate how 19 agent-model configurations, both proprietary and open-source, perform on the tasks. Our results show that models vary widely in how closely they adhere to the instructions encoded in skills, leading to substantial differences in their performance gains. Furthermore, we show that access to a skill significantly changes model behavior compared to the no-skill setup, providing an essential mechanism for encoding opinionated workflows into LLM agents. We release our evaluation dataset to support future work on agent skills.

2606.18060 2026-06-17 cs.AI cs.CL 交叉投稿

PseudoBench: Measuring How Agentic Auto-Research Fuels Pseudoscience

PseudoBench: 衡量自主研究如何助长伪科学

Xinyang Liao, Lingyu Li, Huacan Liu, Tianle Gu, Yang Yao, Tong Zhu, Yan Teng, Yingchun Wang

发表机构 * Shanghai Artificial Intelligence Laboratory(上海人工智能实验室) Xi’an Jiao Tong University(西安交通大学) Shanghai Jiao Tong University(上海交通大学)

AI总结 提出PseudoBench基准,通过200个伪科学声明-证据对评估AI代理识别和抵制伪科学的能力,发现当前系统极易生成有说服力的伪科学报告,拒绝率接近零。

Comments 26 pages, 21 figures

详情
AI中文摘要

随着基于大型语言模型的代理进入自主科学研究,它们抵制伪科学的能力变得越来越重要。否则,此类系统可能迅速生成看似合理但具有误导性的研究,污染学术文献并侵蚀对科学的信任。我们提出了PseudoBench,一个对抗性基准,用于评估自主研究系统能否识别和抵制伪科学叙述。PseudoBench包含五个领域的200个精心策划的伪科学声明-证据对,并通过从实验到写作的端到端研究流程评估代理。测试了七个最先进的代理,我们发现当前系统很容易生成与伪科学前提一致的有说服力的报告,拒绝率接近零,最高抵制率仅为27.4%。更强的代理有可能用更复杂的科学语言包装伪科学,增加其表面可信度。这些发现揭示了助长伪科学的惊人能力,呼吁在广泛部署之前进行科学对齐。

英文摘要

As Large Language Model based agents enter autonomous scientific research, their ability to resist pseudoscience becomes increasingly important. Otherwise, such systems may rapidly generate plausible yet misleading studies that contaminate academic literature and erode trust in science. We present PseudoBench, an adversarial benchmark for evaluating whether agentic auto-research systems can identify and resist pseudoscientific narratives. PseudoBench contains 200 curated pseudoscientific claim-evidence pairs across five domains and evaluates agents through an end-to-end research pipeline from experiments to writing. Testing seven state-of-the-art agents, we find that current systems readily produce persuasive reports that align with pseudoscientific premises with near-zero refusal rates and the highest resistance of only 27.4%. Stronger agents risk packaging pseudoscience in more sophisticated scientific language, increasing its apparent credibility. These findings reveal an alarming capacity to fuel pseudoscience, calling for scientific alignment before widespread deployment.

2606.18158 2026-06-17 cs.CY cs.AI cs.CL 交叉投稿

The Measurement Gap in the Automation of EU Law: Benchmarking Doctrinal Legal Reasoning under the EU AI Act

欧盟法律自动化中的测量差距:欧盟AI法案下教义性法律推理的基准测试

Michèle Finck

发表机构 * Chair of Law and Artificial Intelligence and Director, CZS Institute for Artificial Intelligence and Law, University of Tübingen(法律与人工智能教授、人工智能与法律研究所主任,图宾根大学)

AI总结 针对当前缺乏评估大型语言模型进行教义性法律推理的基准,提出该能力对满足欧盟AI法案中“适当准确性”要求至关重要。

详情
AI中文摘要

大型语言模型现在能够生成至少中等质量的法律文本,但现有的基准无法评估它们是否执行教义性法律推理——这是法律工作的解释核心,而非大多数当前法律AI评估所衡量的辅助性、准法律任务。这一测量差距不仅是方法论的,也是法律上的:欧盟AI法案将“适当准确性”作为司法领域使用高风险AI的约束性要求,但如果没有该领域缺乏的教义性推理基准,该要求就无法获得操作内容。

英文摘要

Large language models now produce legal text of at least median quality, yet no existing benchmark can evaluate whether they perform doctrinal legal reasoning, which forms the interpretive core of legal work, rather than the ancillary, paralegal tasks that most current legal-AI evaluations measure. This measurement gap is not only methodological but legal: the EU AI Act makes "appropriate accuracy" a binding requirement for high-risk AI used in the judicial domain, yet that requirement cannot acquire operational content without the very doctrinal-reasoning benchmark the field lacks.

2412.10139 2026-06-17 cs.CL 版本更新

TACOMORE: Exploring a replicable prompting protocol for LLM-assisted corpus analysis

TACOMORE: 探索一种可复现的提示协议用于LLM辅助语料库分析

Bingru Li, Han Wang, Nicholas Groom

发表机构 * Department of Linguistics and Communication, University of Birmingham(伯明翰大学语言学与传播系) Department of Information Engineering and Computer Science, University of Trento(特伦托大学信息工程与计算机科学系) Institute of Foreign Languages and Cultures, University of Tartu(塔尔图大学外国语言与文化研究所)

AI总结 提出TACOMORE框架,通过结构化提示将LLM从通用概率预测转向基于语料共现模式的推理,提升关键词、搭配和索引行分析的准确性与可复现性,但幻觉问题仍需人工验证。

详情
AI中文摘要

随着语料库语言学不断扩展,研究者面临日益增长的方法论瓶颈:虽然计算工具可以轻松统计数十亿词,但这些数据的定性解释仍然是一个缓慢且劳动密集型的人工任务。大型语言模型(LLM)提供了一种有前景的自动化方法,然而其整合到该领域常因黑箱不可预测性和缺乏可复现性而受阻。本研究引入TACOMORE,一个结构化的提示框架,旨在将临时的AI交互转化为标准化的语言协议。该框架基于四项基本原则(任务、上下文、模型和可复现性),引导LLM超越通用概率预测,将其推理锚定在目标语料库的特定共现模式上。我们将该框架应用于三个核心语料库任务,即关键词、搭配和索引行分析,使用一个开放的COVID-19研究摘要语料库。在测试三个LLM后,我们发现虽然结构化提示提高了准确性和可复现性,但关于幻觉的固有限制仍然存在。本研究为LLM在语料库语言学中的作用提供了批判性视角,强调了它们作为补充工具的潜力,同时突出了人工验证不可替代的角色。

英文摘要

As corpus linguistics continues to scale, researchers are facing a growing methodological bottleneck: while computational tools can easily count billions of words, the qualitative interpretation of these data remains a slow and labor-intensive human task. Large Language Models (LLMs) offer a promising way to automate this process, yet their integration into the field is often hindered by concerns over black-box unpredictability and a lack of replicability. This study introduces TACOMORE, a structured prompting framework designed to transform ad-hoc AI interactions into a standardized linguistic protocol. Built upon four foundational principles (Task, Context, Model, and Replicability), the framework guides LLMs to move beyond generic probability prediction to anchoring their reasoning in the specific co-occurrence patterns of a target corpus. We applied this framework to three core corpus tasks, i.e., the analysis of keywords, collocates, and concordances, using an open corpus of COVID-19 research abstracts. After testing three LLMs, we found that while structured prompting improves accuracy and replicability, inherent limitations regarding hallucination persist. This research offers a critical lens into the role of LLMs in corpus linguistics, highlighting their potential as complementary tools while emphasizing the irreplaceable role of human validation.

2503.07459 2026-06-17 cs.CL cs.AI 版本更新

MedicalAgentsBench for Complex Medical Reasoning: Comparing Internalized Reasoning Models versus Externalized Agent-based Frameworks

MedicalAgentsBench:复杂医学推理基准——比较内化推理模型与外化智能体框架

Yanjun Shao, Xiangru Tang, Jiwoong Sohn, Jiapeng Chen, Yuxuan Liao, Jiayi Zhang, Jinyu Xiang, Fang Wu, Yilun Zhao, Chenglin Wu, Wenqi Shi, Arman Cohan, Mark Gerstein

AI总结 提出MedicalAgentsBench基准(862个复杂临床问题),比较内化推理模型与外化智能体框架在医学推理中的表现,发现两者效果可叠加,最优组合为o3-mini+MDAgents(准确率35.1%)。

Comments https://github.com/gersteinlab/MedicalAgentsBench

详情
AI中文摘要

复杂医学推理需要在多个推理步骤中整合异质性临床证据。大型语言模型(LLM)现在通过两条途径实现:内化推理和外化智能体框架(将问题分解并协作给多个LLM的框架)。为了确定这两条途径是互斥还是互补,我们引入了MedicalAgentsBench,这是一个经过过滤的基准测试,包含862个复杂临床问题,这些题目来自八个医学数据集的并集,经过难度感知筛选和污染筛查。评估了三个内化推理模型(DeepSeek-R1、o1-mini和o3-mini)、七个基础模型和九个外化智能体方法后,我们发现内化和外化方法各自独立地提升了性能,并且它们的益处可以叠加:最高准确率是通过将智能体工作流叠加到内化推理模型上实现的(即o3-mini + MDAgents,准确率35.1%)。帕累托分析表明,这种组合主导了成本-性能前沿;此外,在廉价模型上进行轻量级优化为资源受限环境提供了切入点。我们的基准测试位于此https URL。

英文摘要

Complex medical reasoning requires integrating heterogeneous clinical evidence across multiple inference steps. Large language models (LLMs) now approach this through two routes: internalized reasoning and externalized agent scaffolding (frameworks that decompose problems collaboratively amongst multiple LLMs). To determine whether these routes are exclusive or complementary, we introduce MedicalAgentsBench, a filtered benchmark of 862 complex clinical questions drawn from the union of eight medical datasets via difficulty-aware curation and contamination screening. Evaluating three internalized reasoning models (DeepSeek-R1, o1-mini, and o3-mini), seven base models, and nine externalized agent-based methods, we find that internalized and externalized approaches each independently improve performance, and that their benefits compound: the highest accuracy is achieved by layering agent workflows onto an internalized reasoning model (i.e., o3-mini + MDAgents with 35.1%). Pareto analysis shows this combination dominates the cost-performance frontier; moreover, lightweight optimization on inexpensive models offers an entry point for resource-constrained settings. Our benchmark is at https://github.com/gersteinlab/MedicalAgentsBench.

2505.19937 2026-06-17 cs.CL cs.SD eess.AS 版本更新

ALAS: An Automatic Latent Alignment Score for Audio Language Models

ALAS:音频语言模型的自动潜在对齐分数

Pooneh Mousavi, Yingzhi Wang, Mirco Ravanelli, Cem Subakan

AI总结 提出ALAS指标,通过计算音频与文本表示的跨模态余弦相似度,无需训练即可评估语音-LLM的音频-文本对齐质量,揭示模型对齐深度与任务需求的关系。

详情
AI中文摘要

大型语言模型(LLM)被扩展为语音-LLM,它们学习的音频-文本对齐质量影响大多数下游口语理解(SLU)行为。然而,尽管融合策略不断增长,但没有标准方法来衡量语音-LLM内部如何将音频帧与文本标记绑定。我们引入ALAS(自动潜在对齐分数),一种模型和任务无关的度量,探测LLM的逐层隐藏状态,将音频和文本表示之间的跨模态余弦相似度与Whisper导出的参考进行评分。ALAS仅需要冻结的前向传递和现成的ASR参考,无需训练或拟合分类器,并校准到可解释的均匀基线,可在任务间比较。将ALAS应用于四个开源语音-LLM(AF3、Qwen2-Audio、Qwen-Omni、SALMONN),在情感识别(IEMOCAP)、开放式SQA(LibriSQA)和多选音频理解(MMAU-speech)上,我们发现对齐的深度和强度反映了每个模型的音频编码器设计以及任务的声学与语义需求,并且ALAS跟踪但不重复任务准确性,暴露了那些得分高但未真正基于音频的模型。我们将ALAS作为开源库发布,以便从业者探测自己的语音-LLM或在新任务上尝试。

英文摘要

Large Language Models (LLMs) are extended into Speech-LLMs, and the quality of the audio--text alignment they learn affects most downstream Spoken Language Understanding (SLU) behavior. Yet despite a growth of fusion strategies, there is no standard way to measure how well a Speech-LLM internally binds audio frames to text tokens. We introduce ALAS (Automatic Latent Alignment Score), a model and task-agnostic metric that probes the LLM's per-layer hidden states, scoring the cross-modal cosine similarity between audio and text representations against a Whisper-derived reference. ALAS needs only a frozen forward pass and an off-the-shelf ASR reference, with no training or fitted classifier, and is calibrated to an interpretable uniform baseline comparable across tasks. Applying ALAS to four open-source Speech-LLMs (AF3, Qwen2-Audio, Qwen-Omni, SALMONN) across emotion recognition (IEMOCAP), open-ended SQA (LibriSQA), and multi-choice audio understanding (MMAU-speech), we find that the depth and strength of alignment reflect each model's audio-encoder design and the acoustic-versus-semantic demands of the task, and that ALAS tracks but does not duplicate task accuracy, exposing models that score well without genuinely grounding in the audio. We release ALAS as an open-source library so that practitioners can probe their own Speech-LLMs or try it on new tasks.

2511.01650 2026-06-17 cs.CL cs.AI cs.LG 版本更新

EngTrace: A Symbolic Benchmark for Verifiable Process Supervision of Engineering Reasoning

EngTrace:工程推理可验证过程监督的符号基准

Ayesha Gull, Muhammad Usman Safder, Rania Elbadry, Fan Zhang, Veselin Stoyanov, Preslav Nakov, Zhuohan Xie

AI总结 提出EngTrace符号基准,包含1350个参数化测试用例,通过两阶段可验证评估框架(分层协议+AI仲裁)检验中间推理轨迹与最终答案,揭示数值精度与轨迹保真度的权衡。

Comments 33 pages, includes figures and tables; introduces the EngTrace benchmark

详情
AI中文摘要

大型语言模型(LLM)正越来越多地进入由严格定量标准和不变物理定律约束的专业化、安全关键的工程工作流程,因此对其推理能力进行严格评估势在必行。然而,现有的基准(如MMLU、MATH和HumanEval)评估的是孤立的认知技能,未能捕捉工程中核心的基于物理的推理,其中科学原理、定量建模和实际约束必须融合。为了实现工程中的可验证过程监督,我们引入了EngTrace,这是一个基于90个参数化模板构建的符号基准,每个模板生成独特的、抗污染的实例,涵盖三个主要工程分支、九个核心领域和20个不同领域,产生1350个测试用例,以压力测试跨多样物理场景的泛化能力。超越结果匹配,我们引入了一个可验证的两阶段评估框架,该框架使用分层协议通过自动化程序检查和异构AI仲裁来验证中间推理轨迹以及最终答案。我们对27个领先LLM的评估揭示了数值精度与轨迹保真度之间的明显权衡,识别出一个复杂性悬崖,其中抽象数学预训练未能转化为高级工程任务所需的整合推理。

英文摘要

Large Language Models (LLMs) are increasingly entering specialized, safety-critical engineering workflows governed by strict quantitative standards and immutable physical laws, making rigorous evaluation of their reasoning capabilities imperative. However, existing benchmarks such as MMLU, MATH, and HumanEval assess isolated cognitive skills, failing to capture the physically grounded reasoning central to engineering, where scientific principles, quantitative modeling, and practical constraints must converge. To enable verifiable process supervision in engineering, we introduce EngTrace, a symbolic benchmark built on 90 parameterized templates, each generating unique, contamination-resistant problem instances, spanning three major engineering branches, nine core domains, and 20 distinct areas, yielding 1,350 test cases that stress-test generalization across diverse physical scenarios. Moving beyond outcome matching, we introduce a verifiable two-stage evaluation framework that uses a tiered protocol to validate intermediate reasoning traces alongside final answers through automated procedural checks and a heterogeneous AI Tribunal. Our evaluation of 27 leading LLMs reveals a distinct trade-off between numeric precision and trace fidelity, identifying a complexity cliff where abstract mathematical pre-training fails to translate into the integrative reasoning required for advanced engineering tasks.

2601.04574 2026-06-17 cs.CL 版本更新

FeedEval: Pedagogically Aligned Evaluation of LLM-Generated Essay Feedback

FeedEval: 面向教学法的LLM生成作文反馈评估

Seongyeub Chu, Jongwoo Kim, Munyong Yi

发表机构 * Graduate School of Data Science, KAIST(数据科学研究生院,韩国科学技术院) Department of Industrial & Systems Engineering, KAIST(工业与系统工程系,韩国科学技术院)

AI总结 提出FeedEval框架,沿特异性、帮助性和有效性三个教学维度评估LLM生成的作文反馈,通过专用评估器筛选高质量反馈,提升下游评分和修订效果。

详情
AI中文摘要

超越数值分数预测,近期自动作文评分研究日益强调生成提供理由和可操作指导的高质量反馈。为减轻专家标注的高成本,先前工作通常依赖LLM生成的反馈来训练作文评估模型。然而,此类反馈常未经明确质量验证即被纳入,导致下游应用中噪声的传播。为解决这一局限,我们提出FeedEval,一个基于LLM的框架,用于沿三个教学维度(特异性、帮助性和有效性)评估LLM生成的作文反馈。FeedEval采用维度专用的LLM评估器,这些评估器在本研究策划的数据集上训练,以评估多个候选反馈并选择高质量反馈供下游使用。在ASAP++基准上的实验表明,FeedEval与人类专家判断高度一致,且使用FeedEval筛选的高质量反馈训练的作文评分模型取得了更优的评分性能。此外,使用小型LLM进行的修订实验表明,FeedEval识别的高质量反馈能导致更有效的作文修订。我们在以下网址发布代码和策划的数据集:this https URL。

英文摘要

Going beyond the prediction of numerical scores, recent research in automated essay scoring has increasingly emphasized the generation of high-quality feedback that provides justification and actionable guidance. To mitigate the high cost of expert annotation, prior work has commonly relied on LLM-generated feedback to train essay assessment models. However, such feedback is often incorporated without explicit quality validation, resulting in the propagation of noise in downstream applications. To address this limitation, we propose FeedEval, an LLM-based framework for evaluating LLM-generated essay feedback along three pedagogically grounded dimensions: specificity, helpfulness, and validity. FeedEval employs dimension-specialized LLM evaluators trained on datasets curated in this study to assess multiple feedback candidates and select high-quality feedback for downstream use. Experiments on the ASAP++ benchmark show that FeedEval closely aligns with human expert judgments and that essay scoring models trained with FeedEval-filtered high-quality feedback achieve superior scoring performance. Furthermore, revision experiments using small LLMs show that the high-quality feedback identified by FeedEval leads to more effective essay revisions. We release our code and curated datasets at: https://github.com/BBeeChu/FeedEval.git.

2602.13139 2026-06-17 cs.CL 版本更新

OpenLID-v3: Improving the Precision of Closely Related Language Identification -- An Experience Report

OpenLID-v3:提高近亲语言识别精度的经验报告

Mariia Fedorova, Nikolay Arefyev, Maja Buljan, Jindřich Helcl, Stephan Oepen, Egil Rønningstad, Yves Scherrer

AI总结 针对现有语言识别工具对近亲语言和噪声区分困难的问题,通过增加训练数据、合并问题语言变体簇和引入噪声标签扩展OpenLID分类器,提出OpenLID-v3,在多个基准上提升精度。

Comments VarDial'26 workshop at the EACL 2026 conference

详情
AI中文摘要

语言识别(LID)是从网络数据构建高质量多语言数据集的关键步骤。现有的LID工具(如OpenLID或GlotLID)通常难以识别近亲语言,也难以区分有效自然语言与噪声,这污染了特定语言子集,尤其是低资源语言。在本工作中,我们通过增加更多训练数据、合并有问题的语言变体簇以及引入一个专门标记噪声的标签来扩展OpenLID分类器。我们将这个扩展系统称为OpenLID-v3,并在多个基准上将其与GlotLID进行评估。在开发过程中,我们重点关注三组近亲语言(波斯尼亚语、克罗地亚语和塞尔维亚语;意大利北部和法国南部的罗曼语变体;以及斯堪的纳维亚语言),并在现有评估数据集不足的地方贡献了新的评估数据集。我们发现集成方法提高了精度,但也显著降低了对低资源语言的覆盖。OpenLID-v3可在该https URL上获取。

英文摘要

Language identification (LID) is an essential step in building high-quality multilingual datasets from web data. Existing LID tools (such as OpenLID or GlotLID) often struggle to identify closely related languages and to distinguish valid natural language from noise, which contaminates language-specific subsets, especially for low-resource languages. In this work we extend the OpenLID classifier by adding more training data, merging problematic language variant clusters, and introducing a special label for marking noise. We call this extended system OpenLID-v3 and evaluate it against GlotLID on multiple benchmarks. During development, we focus on three groups of closely related languages (Bosnian, Croatian, and Serbian; Romance varieties of Northern Italy and Southern France; and Scandinavian languages) and contribute new evaluation datasets where existing ones are inadequate. We find that ensemble approaches improve precision but also substantially reduce coverage for low-resource languages. OpenLID-v3 is available on https://huggingface.co/HPLT/OpenLID-v3.

2605.25652 2026-06-17 cs.CL cs.CY 版本更新

A Two-Phase Stability Study of LLM Judges and Bar Council Examiners on Thai Bar-Exam Free-Form Essays

LLM评审员与律师协会考官对泰国律师资格考试自由回答论文的两阶段稳定性研究

Pawitsapak Akarajaradwong, Wuttikrai Lertprasertphakorn, Chompakorn Chaksangchaichot, Sarana Nutanong

发表机构 * VISAI AI(VISAI人工智能)

AI总结 通过泰国律师资格考试的自由回答论文评估,研究LLM评审员与人类考官在评分一致性上的不对称性,发现LLM评审员倾向于多数人类阅读而无法复制少数人类阅读。

详情
AI中文摘要

NLP中的自由形式法律论文评估将专家间评分者稳定性视为单一上限数字,并将LLM评审员与该上限的一致性视为评审员稳定性的证据。我们通过相同输入协议在泰国律师资格考试上检验这两个假设:三名律师协会培训的考官(A、B、C)和一个26个LLM评审员小组对来自相同四个输入(问题、官方律师协会评分规定、标准答案、考生答案)的15个交叉评分的答案进行评分。主要发现是不对称的。在评分标准规定两个轴的15个单元格中的10个上,所有29名评分者收敛在一个狭窄的区间内:小组一致性是普遍的。在其余5个单元格中,评分标准未规定如何评分一个正确但省略了决定性法定引用的最终答案,人类小组在两个连贯的解读之间分裂(B/C多数在评分标准上限区间,分数6-8;A少数在较低区间,分数1-2)。LLM评审员群体并不对称分裂:26个LLM中有22个在或接近B/C的有争议区间评分,3个位于规定沉默的中间间隙,只有1个(GPT-5.4 Nano)接近A的区间但未一致地在其内评分。我们26个评审员小组中的零个LLM在有争议的单元格上复制了少数人类阅读。B/C方向的集群跨越了我们测试的每个模型大小、供应商和价格层级。一个仪器化的三个LLM锚定子小组(Claude 4.6 Opus、Gemini 3.1 Pro、GPT-5.4 Pro)携带确定性探针、输入消融和自助法置信区间,并在15个单元格上达到锚定小组α=0.77,而人类小组α=0.36。高LLM小组α反映了系统性地收敛于多数阅读,而不是平衡地复制两种阅读;一个通过最大化与人类参考小组的一致性来选择其LLM评审员的基准将必然继承这种不对称性。

英文摘要

Free-form legal essay evaluation in NLP treats expert inter-rater stability as a single ceiling number, and treats LLM-judge agreement with that ceiling as evidence of judge stability. We test both assumptions on the Thai bar examination through an identical-inputs protocol: three Bar Council-trained examiners (A, B, C) and a 26-LLM judge panel score the same 15 cross-graded answers from the same four inputs (question, official Bar Council grading regulation, gold answer, candidate answer). The headline finding is asymmetric. On 10 of 15 cells where the rubric prescribes both axes, all 29 raters converge in a tight band: panel agreement is universal. On the remaining 5 cells where the rubric does not prescribe how to grade a correct final answer that omits a decisive statutory citation, the human panel splits between two coherent readings (B/C majority at the upper rubric band, score 6-8; A minority at the lower band, score 1-2). The LLM judge population does not split symmetrically: 22 of 26 LLMs score in or near B/C's contested band, 3 sit in the regulation-silent middle gap, and only 1 (GPT-5.4 Nano) approaches A's band without consistently scoring within it. Zero LLMs in our 26-judge panel reproduce the minority human reading on the contested cells. The B/C-direction cluster spans every model size, vendor, and price tier we tested. An instrumented three-LLM anchor sub-panel (Claude 4.6 Opus, Gemini 3.1 Pro, GPT-5.4 Pro) carries determinism probes, input ablations, and bootstrap CIs, and reaches anchor panel $α= 0.77$ on the 15 cells against human-panel $α= 0.36$. The high LLM-panel $α$ reflects systematic convergence on the majority reading rather than balanced reproduction of both readings; a benchmark that selects its LLM judge by maximising agreement with a human reference panel will inherit this asymmetry by construction.

2606.09376 2026-06-17 cs.CL 版本更新

Precision Is Not Faithfulness: Coverage-Aware Evaluation of Grounded Generation with a Complete Oracle

精确性不等于忠实度:基于完全神谕的覆盖感知接地生成评估

Juan S. Santillana

发表机构 * Globant

AI总结 针对参考无关忠实度指标仅测量精确性而忽略召回率的问题,提出利用完全神谕(F1赛事和NOAA天气预报)测量覆盖度,并设计结合精确性与覆盖度的综合指标及验证器引导生成方法。

Comments 9 pages. v2: adds Anthropic Claude + 3 additional fine-tuned bases (1B-7B); 6 frontier families x 3 languages. Code https://github.com/vectrayx/precision-is-not-faithfulness Demo https://huggingface.co/spaces/jsantillana/faithful-strategy-engineer-f1

详情
AI中文摘要

参考无关的忠实度指标验证模型对事实的每个原子声明,并越来越多地用于评估接地生成。我们表明它们存在一个盲点:它们仅测量精确性——所陈述的声明是否得到支持?——因此奖励弃权,因为模型通过几乎不说什么就可以获得近乎完美的忠实度。我们使用F1遥测技术使其可测量,这是一个战略事实确定性推导且关键是完全的领域:对于每个决策,我们知道所有重要事实的完整集合。这种完整性——在开放领域的忠实度基准中缺失——让我们能够精确测量召回率(相关事实的覆盖度)以及精确性。在一个涵盖150场比赛的7,253个决策实例的多语言(EN/ES/PT)基准上,最精确的前沿模型覆盖不到一半的相关事实,并且按F1排名最后,因此要求覆盖度会重新排序系统;同样的效果在第二个完全神谕领域(NOAA天气预报)中再次出现。提示消融实验表明,低覆盖度不是提示不足的产物:明确要求模型彻底并不能缩小差距。我们将忠实度与覆盖度结合成一个单一分数,验证了该指标(受控扰动;无模型正则表达式提取器和跨家族LLM提取器之间的一致性,系统级Spearman 1.0),并给出了一种无需参考即可提高精确性和召回率的验证器引导生成方法。我们发布了基准、结构化注释、指标、基线和交互式演示。

英文摘要

Reference-free faithfulness metrics verify each atomic claim a model makes against ground truth, and are increasingly used to evaluate grounded generation. We show they share a blind spot: they measure only precision -- are the stated claims supported? -- and therefore reward abstention, since a model can score near-perfect faithfulness by saying almost nothing. We make this measurable using Formula 1 telemetry, a domain where strategic ground truth is derived deterministically and, crucially, completely: for each decision we know the full set of facts that mattered. This completeness -- absent in open-domain faithfulness benchmarks -- lets us measure recall (coverage of the relevant facts) exactly, alongside precision. On a multilingual (EN/ES/PT) benchmark of 7,253 decision instances spanning 157 races, the most precise frontier model covers under half of the relevant facts and ranks last by F1, so requiring coverage reorders the systems; the same effect reappears in a second complete-oracle domain (NOAA weather forecasts). Fine-tuning small models (1B-7B) on the complete oracle closes the precision-recall gap entirely (F1 ~0.98), beating every zero-shot frontier system regardless of scale. We pair faithfulness with coverage into a single score, validate the metric (controlled perturbation; agreement across a model-free regex extractor and a cross-family LLM extractor, system-level Spearman 1.0), and give a verifier-guided generation method that improves precision and recall without references. We release the benchmark, structured annotations, metric, baselines, and an interactive demo.

2606.17041 2026-06-17 cs.CL cs.IR 版本更新

Benchmarking LLM Agents on Meta-Analysis Articles from Nature Portfolio

对Nature Portfolio元分析文章进行LLM代理基准测试

Anzhe Xie, Weihang Su, Yujia Zhou, Yiqun Liu, Qingyao Ai

发表机构 * Tsinghua University(清华大学)

AI总结 提出MetaSyn数据集,包含442篇专家策划的元分析,用于评估LLM代理在检索-筛选-综合全流程中的表现,发现当前系统在筛选阶段存在严重瓶颈。

Comments 13 pages, 7 figures, preprint for arXiv, dataset and code available at https://github.com/BFTree/MetaSyn

详情
AI中文摘要

元分析是一种要求高的证据综合形式,结合了文献检索、PI/ECO指导的研究选择和统计聚合。其结构化、可验证的工作流程使其成为评估系统科学推理的理想基础,然而现有基准缺乏完整的检索-筛选-综合流程的真相。我们引入了MetaSyn,一个包含来自Nature Portfolio期刊的442篇专家策划的元分析的数据集。每个条目将研究问题与PI/ECO标准、包含140k篇PubMed文章的检索语料库、经过验证的阳性研究、主题相似但不符合PI/ECO的硬负样本以及完整的搜索策略和日期范围配对。对十二种流水线配置(九种RAG变体和一种协议驱动的代理)进行基准测试揭示了关键的筛选瓶颈:尽管在K=200时检索上限达到90.9%的召回率,但没有任何系统能恢复超过52.7%的真相包含文献。当前的LLM无法可靠地将合格研究与主题相关性相当的PI/ECO不合格干扰项区分开来。阶段归因指标捕捉了系统成功和失败的地方;单一的端到端分数则不能。

英文摘要

Meta-analysis is a demanding form of evidence synthesis that combines literature retrieval, PI/ECO-guided study selection, and statistical aggregation. Its structured, verifiable workflow makes it an ideal substrate for evaluating systematic scientific reasoning, yet existing benchmarks lack ground truth across the full retrieval-screening-synthesis pipeline. We introduce MetaSyn, a dataset of 442 expert-curated meta-analyses from Nature Portfolio journals. Each entry pairs a research question with PI/ECO criteria, a retrieval corpus of 140k PubMed articles, verified positive studies, hard negatives that are topically similar but PI/ECO-ineligible, and complete search strategies and date bounds. Benchmarking twelve pipeline configurations (nine RAG variants and a protocol-driven agent) reveals a critical screening bottleneck: despite a retrieval ceiling of 90.9% recall at K=200, no system recovers more than 52.7% of ground-truth included literature. Current LLMs fail to reliably separate eligible studies from PI/ECO-failing distractors in pools of comparable topical relevance. Stage-attributed metrics capture where systems succeed and fail; a single end-to-end score does not.

10. 安全、隐私、公平与可解释NLP 16 篇

2606.17168 2026-06-17 cs.CL 新提交

RepSelect: Robust LLM Unlearning via Representation Selectivity

RepSelect: 通过表示选择性实现鲁棒的大语言模型遗忘

Filip Sondej, Yushi Yang, Adam Mahdi

发表机构 * Independent(独立) University of Oxford(牛津大学)

AI总结 针对现有遗忘方法易被微调或少样本提示逆转的问题,提出RepSelect方法,通过梯度主成分坍塌隔离遗忘集表示,实现深度鲁棒遗忘。

详情
AI中文摘要

使大语言模型(LLMs)深度遗忘特定知识和价值观而不牺牲通用能力仍然是遗忘领域的一个核心挑战。然而,当前方法容易被微调或少样本提示逆转,表明其遗忘仅是浅层的。我们找到了根本原因:现有方法针对与保留集以及微调攻击者恢复的子空间共享的表示,这使得遗忘既破坏通用能力又容易被逆转。我们提出RepSelect(表示选择性),通过在每次更新前坍塌权重梯度的主成分来隔离遗忘集特定的表示,从而保持通用能力完整,同时限制微调可恢复的内容。我们在两个遗忘类别(生物危害知识和虐待倾向)以及四种涵盖密集和混合专家架构的模型家族(Llama 3、Qwen 3.5、Gemma 4 E4B、DeepSeek V2 Lite)上进行评估。与五种流行基线(GradDiff、NPO、SimNPO、RMU、UNDIAL)相比,RepSelect在重新学习后的答案准确性上实现了比最强基线大4-50倍的降低,并且对少样本提示攻击近乎完美鲁棒。因此,针对选择性表示是实现深度鲁棒LLM遗忘的重要一步。

英文摘要

Making large language models (LLMs) deeply forget specific knowledge and values without sacrificing general capabilities remains a central challenge in unlearning. However, current methods are easily reversed by fine-tuning or few-shot prompting, suggesting their forgetting is only shallow. We identify the root cause. Existing methods target representations shared with both the retain set and the subspace recovered by a fine-tuning attacker, making unlearning both disruptive to general capabilities and easy to reverse. We propose RepSelect (Representation Selectivity), isolates forget-set-specific representations by collapsing top principal components of weight gradients before each update, leaving general capabilities intact while limiting what fine-tuning can recover. We evaluate across two forget categories, biohazardous knowledge and abusive tendencies, and four model families spanning dense and Mixture-of-Experts architectures (Llama 3, Qwen 3.5, Gemma 4 E4B, DeepSeek V2 Lite). Compared to five popular baselines (GradDiff, NPO, SimNPO, RMU, UNDIAL), RepSelect achieves a 4-50x larger reduction in post-relearning answer accuracy than the strongest baseline, and is near-perfectly robust to few-shot prompting attacks. Targeting selective representations is thus an important step towards deep and robust LLM forgetting.

2606.17478 2026-06-17 cs.CL cs.AI 新提交

Decoding Hidden Deception in Reasoning LLMs: Activation Explainers for Deception Auditing

解码推理型LLM中的隐藏欺骗:用于欺骗审计的激活解释器

Kexin Chen, Yi Liu, Haonan Zhang, Yanhui Li, Xinyu Deng, Dongxia Wang

发表机构 * Zhejiang University(浙江大学) Griffith University(格里菲斯大学)

AI总结 提出STATEWITNESS,一种通过解码目标模型隐藏状态来生成自然语言查询答案和结构化报告的激活解释器,在欺骗检测中平均AUROC达0.916,优于现有方法。

Comments Under review

详情
AI中文摘要

随着LLM获得更强的推理能力,欺骗行为成为一个日益严重的安全问题。现有的欺骗监控器要么对可见文本进行评分,要么从表示向量中导出标量探针分数,几乎没有留下关于为什么响应可疑的可检查证据。我们引入了STATEWITNESS,一种用于欺骗审计的激活解释器。一个独立的解码器读取目标模型的隐藏状态,然后回答自然语言查询或发出关于它们的结构化报告。我们在两个目标推理LLM上评估了STATEWITNESS,涵盖七个欺骗数据集。在相同评估协议下,STATEWITNESS的平均AUROC达到0.916,比最佳黑盒文本监控器相对提升11.6%,比最佳激活探针基线相对提升25.0%。当与现有监控器结合时,STATEWITNESS在简单阈值集成中减少了遗漏的欺骗示例。除了标量检测,解码器还返回查询级答案、模式报告以及令牌级或句子级证据痕迹供人工检查。我们将此接口视为更广泛的可解释性和对齐工具的潜在构建块。

英文摘要

As LLMs acquire stronger reasoning capabilities, deceptive behavior becomes an increasingly serious safety concern. Existing deception monitors either score visible transcripts or derive scalar probe scores from representation vectors, leaving little inspectable evidence about why a response is suspicious. We introduce STATEWITNESS, an activation explainer for deception auditing. A separate decoder reads a target model's hidden states, then answers natural-language queries or emits structured reports about them. We evaluate STATEWITNESS on two target reasoning LLMs across seven deception datasets. STATEWITNESS reaches 0.916 mean AUROC, a relative gain of 11.6% over the best black-box text monitor and 25.0% over the best activation-probe baseline under the same evaluation protocol. When combined with existing monitors, STATEWITNESS reduces missed deceptive examples in simple threshold ensembles. Beyond scalar detection, the decoder returns query-level answers, schema reports, and token- or sentence-level evidence traces for human inspection. We view this interface as a potential building block for broader interpretability and alignment tools.

2606.17506 2026-06-17 cs.CL 新提交

Evaluating Second-Order Bias of LLMs Through Epistemic Entitlement

通过认知权利评估大语言模型的二阶偏见

Ramaravind Kommiya Mothilal, Terry Jingchen Zhang, Raiyan Ahmed, Zhijing Jin, Shion Guha, Syed Ishtiaque Ahmed

发表机构 * University of Toronto(多伦多大学) Vector Institute(向量研究所) EuroSafeAI Max Planck Institute for Intelligent Systems, Tübingen, Germany(马克斯·普朗克智能系统研究所(德国图宾根))

AI总结 提出基于认知权利理论的逻辑推理任务,评估LLM在判断偏见内容时表现出的二阶偏见,发现模型判断存在系统性偏差且能规避安全护栏。

Comments 20 pages, 13 tables, 2 figures

详情
AI中文摘要

对大语言模型社会偏见的评估主要关注模型是否生成或暗示偏见内容。然而,随着LLM越来越多地被用作偏见评判者,它们可能在评估偏见内容时以更微妙的方式表现出社会偏见,而当前方法并未系统性地捕捉这一点。我们称之为二阶偏见:LLM对社会偏见的判断中存在的偏见,并通过一种新颖的、基于哲学推理的任务进行评估。借鉴认知权利理论,我们将偏见概念化为塑造主体理性探究的错位基础知识,并推导出一个逻辑推理任务,让LLM判断偏见文本对谁是可接受的或不可接受的。我们开发了两个简单指标来衡量LLM评判者在没有充分支持的情况下推断人口统计学可接受性的偏见程度,以及这些推断在不同目标群体间的差异。评估开源和闭源模型时,我们发现我们的任务通过揭示模型判断中的偏见来规避安全护栏。它随目标群体系统性地变化,反映了隐性的社会图谱,并展示了模型如何仍然被人口统计标签触发。我们的工作指出了在判断任务中评估LLM偏见的必要性,以及更广泛地,在NLP中采用更具理论基础的偏见评估方法。我们在以下网址发布代码和模型响应:此 https URL。

英文摘要

Evaluations of social bias in LLMs largely focus on whether models generate or imply biased content. However, as LLMs are increasingly used as judges of bias, they may exhibit social biases in subtler ways in how they evaluate biased content, which current methods do not systematically capture. We call this second-order bias: social bias in an LLM's judgment about social bias, which we evaluate through a novel, philosophically grounded reasoning task. Drawing on entitlement epistemology, we conceptualize bias as misplaced foundational knowledge that shapes an agent's rational inquiry, and derive a logical reasoning task for LLMs to judge to whom a biased text is acceptable or non-acceptable. We develop two simple metrics to measure how biased LLM judges are in inferring demographics for acceptability without sufficient support, and how these inferences vary across groups targeted by biased texts. Evaluating open and closed models, we find that our task evades safety guardrails by surfacing bias in model judgment. It varies systematically across target groups, reflects implicit social maps, and shows how models are still triggered by demographic labels. Our work points to the need for LLM bias evaluation in judgment tasks and broadly, for more theoretically grounded approaches to bias evaluation in NLP. We release our code and model responses at https://github.com/uofthcdslab/second-order-bias.

2606.18062 2026-06-17 cs.CL cs.AI cs.CR cs.HC 新提交

Security and Privacy Prompts in the Wild: What Users Ask LLMs and How LLMs Respond

现实中的安全与隐私提示:用户向LLM提问及LLM如何回应

Hobin Kim, Xiaoyuan Wu, Omer Akgul, Lujo Bauer, Nicolas Christin

发表机构 * Carnegie Mellon University(卡内基梅隆大学) RSAC Labs(RSAC实验室)

AI总结 基于WildChat数据集,分析用户向大语言模型提出的安全与隐私问题,分类并评估模型回答质量与一致性。

详情
AI中文摘要

大型语言模型(LLM)被广泛用于满足用户的信息需求;用户向LLM询问天气、提出教育问题,并咨询法律帮助。一个特别未被充分研究的领域是数字安全与隐私(S&P),用户可能寻求LLM的帮助,了解如何保护他们的在线账户或保护计算机免受网络攻击。据我们所知,之前没有研究收集或分析用户向LLM提出的S&P问题;先前关于LLM回答质量的研究依赖于专家撰写的S&P误解或常见问题解答,而非用户查询。利用WildChat(一个从现实环境中收集的320万用户-LLM对话数据集),我们的研究识别出14,727个S&P提示,并将其分为九类,涵盖广泛的S&P主题。从S&P提示中,我们抽样了450个,并进行了主题分析,以描述用户向LLM提出的S&P问题。与主题分析分开,我们整理了270个寻求建议的S&P提示,其中用户询问建议、指导或特定的S&P信息。我们测量了将提示向LLM提出10次时的LLM回答质量和一致性。我们发现,商业LLM优于开放权重模型(GPT 5.5在98%的提示上提供了“足够好”的回答;Llama 4为47%)。然而,在平均获得高质量回答的提示中,商业模型有时会在不同运行中产生矛盾的回答,有可能使用户困惑或误导用户。

英文摘要

Large language models (LLMs) are widely used to fulfill users' information needs; users ask LLMs about the weather, pose educational questions, and consult them for legal assistance. One particularly understudied area is digital security and privacy (S&P), where users may seek LLMs' help on how to secure their online accounts or protect their computers from cyber attacks. To the best of our knowledge, no prior study has collected or analyzed the S&P questions users ask LLMs; prior research on LLM response quality relied on expert-authored S&P misconceptions or FAQs rather than user queries. Drawing from WildChat, a dataset of 3.2M user-LLM conversations collected in the wild, our study identifies 14,727 S&P prompts and categorizes them into nine categories covering a wide range of S&P topics. From the S&P prompts, we sampled 450 and performed a thematic analysis to characterize the S&P questions users ask LLMs. Separate from the thematic analysis, we curated 270 advice-seeking S&P prompts, where users ask for recommendations, guidance, or specific S&P information. We measured LLM response quality and consistency when posing the prompt to LLMs 10 times. We found that commercial LLMs outperform open-weight models (GPT 5.5 provided "good enough" responses on 98% of prompts; Llama 4 on 47%). However, among prompts that received high-quality responses on average, commercial models sometimes produce contradictory responses across runs, risking confusing or misleading users.

2606.18124 2026-06-17 cs.CL 新提交

Unintended Effects of Geographic Conditioning in Large Language Models

大型语言模型中地理条件化的意外效应

Naz Col, David M. Chan

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 本研究评估了大型语言模型在接收地理中立提示时,因用户元数据中的位置信息导致的地理泄露现象,并发现位置占位符“Unknown”本身也会引发泄露,揭示了用户档案框架的生成条件化效应。

Comments To appear at the Second Workshop on Customizable NLP (CustomNLP4U) at ACL 2026

详情
AI中文摘要

现代对话式AI系统经常依赖用户元数据来本地化响应,但由这种隐藏上下文引入的意外区域偏见仍然知之甚少。在这项工作中,我们评估了位置泄露:即模型在接收地理中立用户提示时仍生成地理引用的现象。在创意写作和开放式问答提示中,即使是最先进的LLM,在暴露于位置元数据时也会系统性地偏向特定区域的输出,泄露比基线高出多达793倍(例如,Llama 3.1-8B从0.04%增加到31.7%,Qwen3-8B和Claude Sonnet 4.6分别为21.3%和8.8%)。我们的分析进一步揭示了一种新颖的结构性条件化效应:将注入的位置替换为占位符“Unknown”仍会使泄露比基线高出多达72倍,这表明用户档案框架本身,独立于任何地理内容,充当了生成条件化信号。

英文摘要

Modern conversational AI systems frequently rely on user metadata to localize responses, yet the unintended regional biases introduced by this hidden context remain poorly understood. In this work, we evaluate location leakage: the phenomenon where a model generates geographic references despite receiving a geographically neutral user prompt. Across both creative writing and open-ended Q&A prompts, even state-of-the-art LLMs systematically favor region-specific outputs when exposed to location metadata, with leakage spiking by up to 793 times above baseline (e.g., from 0.04% to 31.7% for Llama 3.1-8B, and 21.3% and 8.8% for Qwen3-8B and Claude Sonnet 4.6, respectively). Our analysis further shows a novel structural conditioning effect: replacing the injected location with the placeholder "Unknown" still elevates leakage by up to 72 times above baseline, demonstrating that the user profile frame itself, independent of any geographic content, acts as a generative conditioning signal.

2606.17092 2026-06-17 cs.CR cs.CL 交叉投稿

Securing Multi-Agent GIS Systems: Risk Evaluation and Prompt Hardening Optimization

保护多智能体GIS系统:风险评估与提示硬化优化

Kyle Gao, Pranavi Kotta, Linlin Xu, Jonathan Li, David A. Clausi

发表机构 * Department of Systems Design Engineering, University of Waterloo(系统设计工程系,滑铁卢大学) Department of Mechatronics Engineering, University of Waterloo(机电工程系,滑铁卢大学) Department of Geomatics Engineering, University of Calgary(测绘工程系,卡尔加里大学) Department of Geography and Environmental Management, University of Waterloo(地理与环境管理系,滑铁卢大学)

AI总结 针对多智能体GIS系统的安全风险,提出基于模块化状态机编排和提示优化的安全框架,通过红队测试和对抗演示提升系统鲁棒性。

Comments Kyle Gao and Pranavi Kotta contributed equally to this work

详情
AI中文摘要

智能体系统越来越多地与地理信息系统(GIS)集成,其中多智能体协调能够实现复杂的对话和空间分析,但同时也引入了安全风险。本文提出了一个面向安全的框架,用于多智能体GIS系统中的风险识别、评估和缓解,同时保持对更广泛智能体架构的适应性。我们在开发一个模块化的基于状态机的编排框架的同时,测试了商业地理空间合作伙伴的智能体系统,该框架将智能体行为抽象为可重用组件。我们使用一个包含自适应攻击者LLM和确定性评判器的红队框架来评估鲁棒性,该评判器在多轮攻击中产生带有支持理由的二元结果。我们进一步通过一个提示优化框架来提高韧性,该框架将提示视为结构化签名并注入对抗性演示,从而在不降低任务性能的情况下实现系统性的安全改进。

英文摘要

Agentic systems are increasingly integrated with geographic information systems (GIS), where multi-agent coordination enables complex conversational and spatial analysis but introduces security risks. This work presents a security-oriented framework for risk identification, evaluation, and mitigation in a multi-agent GIS system while maintaining adaptability to broader agentic architectures. We test the agentic system of a commercial geospatial partner while developing a modular state-machine-based orchestration framework that abstracts agent behavior into reusable components. We evaluate robustness using a red-teaming framework with an adaptive attacker LLM and a deterministic judge that produces binary outcomes with supporting rationales across multi-turn attacks. We further improve resilience with a prompt optimization framework that treats prompts as structured signatures and injects adversarial demonstrations, enabling systematic security improvements without degrading task performance.

2606.17229 2026-06-17 cs.LG cs.AI cs.CL 交叉投稿

Rift: A Conflict Signature for Deception in Language Models

Rift: 语言模型中欺骗行为的冲突特征

Petr Nyoma

发表机构 * Harmonic Labs

AI总结 通过对比知情欺骗与无知错误,发现欺骗性前向传递具有高残差秩的冲突特征,能以100%准确率无标签识别谎言,并跨模型、语言和架构迁移。

Comments 13 pages, 4 figures. Code and experiment logs: https://github.com/Omibranch/Rift

详情
AI中文摘要

一个在知道真相的情况下说谎的模型是ELK无法仅通过行为评估处理的核心情况。我们询问这种欺骗是否留下内部特征,将其与诚实错误区分开来。我们的关键举措是对错误进行控制:我们将一个潜伏代理(知道真相,在触发时说谎)与一个天真的说谎者(微调以输出相同的错误答案,没有诚实训练)进行对比。两者产生相同的错误输出;任何差异都关乎知识冲突,而非不正确性。我们发现欺骗性前向传递携带一个冲突特征——在相同错误答案上,残差秩比天真的说谎者传递高2.1-2.3倍——足以在GPT-2 small/medium(三个种子)和三个指令模型上,以100%准确率且无需标签,识别两个响应中哪个是谎言。在Qwen2.5-1.5B/7B和Phi-3-mini上,指令欺骗在每个测试事实上都提高了残差秩(18/18, 40/40, 34/34);在Phi-3上,谎言与诚实答案和幻觉完美分离(AUC 1.0,Wilcoxon p~6e-11)。该特征在策略性自构建欺骗(模型自创谎言,AUC 1.0)、主动隐藏尝试(AUC 1.0)和长度控制复制(20/20,AUC 1.0,p~1e-6)中仍然存在。使用无基相对表示,在一个模型家族上训练的探针在零样本下检测到另外两个家族中的欺骗(平均AUC 0.933),在同时改变架构和格式时仍有效(AUC 0.821),并跨五种语言迁移(AUC 1.000,长度控制)。该特征是只读的:可检测但不可注入(双向0/8)。诚实的局限性和六个负面实验已完整记录。

英文摘要

A model that lies while knowing the truth is the central case ELK cannot handle with behavioral evaluation alone. We ask whether such deception leaves an internal signature distinguishing it from honest error. Our key move is a control for wrongness: we contrast a sleeper agent (knows the truth, lies on trigger) against a naive liar (fine-tuned to emit the same wrong answers with no honest training). Both produce identical wrong outputs; any difference is about knowledge conflict, not incorrectness. We find deceptive forward passes carry a conflict signature - 2.1-2.3x higher residual rank than naive-liar passes on the same wrong answer - strong enough to identify which of two responses is the lie with 100% accuracy and no labels, across GPT-2 small/medium (three seeds) and three instruct models. Across Qwen2.5-1.5B/7B and Phi-3-mini, instructed deception raises residual rank on every tested fact (18/18, 40/40, 34/34); on Phi-3, lies separate perfectly from both honest answers and hallucinations (AUC 1.0, Wilcoxon p~6e-11). The signature survives strategic self-constructed deception (model invents its own lie, AUC 1.0), active concealment attempts (AUC 1.0), and length-controlled replication (20/20, AUC 1.0, p~1e-6). Using basis-free relative representations, a probe trained on one model family detects deception in two other families zero-shot (mean AUC 0.933), surviving simultaneous architecture and format change (AUC 0.821), and transfers across five languages (AUC 1.000, length-controlled). The signature is read-only: detectable but not injectable (0/8 both directions). Honest limitations and six negative experiments are documented in full.

2606.17467 2026-06-17 cs.CR cs.CL 交叉投稿

PARSE: Provenance-Aware Retrieval Sanitization for Professional Domain LLM Agents

PARSE: 面向专业领域LLM Agent的溯源感知检索净化

Aaditya Pai

发表机构 * Data Science Institute(数据科学研究所)

AI总结 针对真实企业文档中提示注入攻击难以防御的问题,提出PARSE方法,通过分类句子注入可能性、提取结构化事实并验证一致性,将攻击成功率从25.4%降至15.6%,同时保持86.9%的实用性。

Comments 7 pages, 3 figures, 2 tables. Under submission at EMNLP 2026 Industry Track

详情
AI中文摘要

在合成基准上评估的提示注入防御无法泛化到真实企业文档,这些文档更长、更密集,并将合法权威语言与事实内容交织在一起。我们通过一个包含五个专业领域(金融、法律、医学、科学、DevOps)122个任务的真实文档基准,使用实际的SEC文件、联邦公报规则、PubMed摘要、arXiv论文和GitHub事后分析报告,展示了这一差距。在合成基准上最强的防御方法——释义,在真实文档上未显示出统计显著的攻击成功率降低(p=0.500),同时实用性从91.8%降至82.8%。我们引入了PARSE(溯源感知检索净化),一种领域感知、事实保留的净化流程,它按注入可能性对每个句子进行分类,在重写前提取结构化事实,并通过一致性检查循环验证事实保留。一个指导性门控将59%的真实企业文档路由到轻量级路径,将计算成本集中在高风险文档上。PARSE实现了15.6%的攻击成功率——相比25.4%的基线降低了38%——在86.9%的实用性下,这是唯一既统计显著(p=0.014,具有充分统计功效)又保持接近基线实用性的条件。从业者应在领域匹配的真实文档上评估防御,而不是合成代理。

英文摘要

Prompt injection defenses evaluated on synthetic benchmarks do not generalize to real enterprise documents, which are longer, denser, and interleave legitimate authority language with factual content. We demonstrate this gap with a real-document benchmark of 122 tasks across five professional domains (financial, legal, medical, scientific, DevOps) using actual SEC filings, Federal Register rules, PubMed abstracts, arXiv papers, and GitHub postmortems. Paraphrasing, the strongest defense on synthetic benchmarks, shows no statistically significant attack success rate reduction on real documents (p=0.500) while degrading utility from 91.8% to 82.8%. We introduce PARSE (Provenance-Aware Retrieval Sanitization), a domain-aware, fact-preserving sanitization pipeline that classifies each sentence by injection likelihood, extracts structured facts before rewriting, and verifies fact preservation via a consistency-checking loop. A directiveness gate routes 59% of real enterprise documents to a lightweight path, concentrating computational cost on high-risk documents. PARSE achieves 15.6% attack success rate -- a 38% reduction versus the 25.4% baseline -- at 86.9% utility, the only condition that is both statistically significant (p=0.014, adequately powered) and maintains near-baseline utility. Practitioners should evaluate defenses on domain-matched real documents, not synthetic proxies.

2606.17815 2026-06-17 cs.CR cs.CL 交叉投稿

Beyond Native Success: Auditing Deployment-Interface Exposure of CLIP Backdoors

超越原生成功:审计CLIP后门的部署接口暴露

Kunlan Xiang, Haomiao Yang, Wenbo Jiang

发表机构 * University of Electronic Science and Technology of China(电子科技大学)

AI总结 提出DIFE框架审计CLIP后门在不同部署接口下的暴露情况,发现原生成功不代表全局安全,并引入BadTextTower填补文本编码器后门缺失。

详情
AI中文摘要

对比语言-图像预训练模型广泛重用于下游接口,包括特征提取、检索、重排序和选择。然而,现有的CLIP后门通常在小规模的原生攻击任务上验证攻击,导致不清楚当通过其他接口重用时,相同的投毒检查点是否仍然暴露、减弱或变得不适用。我们引入DIFE,一个部署接口足迹评估框架,用于审计跨部署接口的带后门CLIP检查点。DIFE通过指定每个接口的组件读出、触发通道、目标事件、参考条件和度量,使各种评估具有可比性。DIFE还引入了有效足迹诊断,以识别携带暴露的可重用CLIP组件或组件组合,并解释风险转移的位置。使用DIFE审计复现的CLIP后门揭示了一个结构化的景观:原生成功不是检查点级别的风险证书,暴露遵循组件足迹,文本侧投毒不会产生文本编码器控制,一些耦合攻击仍然受机制约束。这次审计揭示了现有CLIP后门中的一个重要空白:文本编码器本身成为对抗行为的可重用载体。因此,我们引入BadTextTower来填补这一空白。BadTextTower产生强大的文本条件检索、重排序和选择暴露,同时使仅视觉重用几乎保持清洁。

英文摘要

Contrastive Language-Image Pre-training models are widely reused across downstream interfaces, including feature extraction, retrieval, reranking, and selection. Existing CLIP backdoor, however, usually validate attacks on a small attack-native task, leaving unclear whether the same poisoned checkpoint remains exposed, weakens, or becomes not applicable when reused through other interfaces. We introduce DIFE, a Deployment-Interface Footprint Evaluation framework that audits backdoored CLIP checkpoints across deployment interfaces. DIFE makes various evaluations comparable by specifying each interface's component readout, trigger channel, target event, reference condition, and metric. DIFE also introduces effective-footprint diagnosis to identify the reusable CLIP component or component combination that carries exposure and explains where risk transfers. Auditing reproduced CLIP backdoors with DIFE reveals a structured landscape: native success is not a checkpoint-level risk certificate, exposure follows component footprints, text-side poisoning does not yield textual-encoder control, and some coupled attacks remain mechanism-bound. This audit reveals a import gapin existing CLIP backdoors: a textual encoder that itself becomes a reusable carrier of adversarial behavior. We therefore introduce BadTextTower to fill this gap. BadTextTower produces strong text-conditioned retrieval, reranking, and selection exposure while leaving visual-only reuse nearly clean.

2606.18021 2026-06-17 cs.AI cs.CL cs.LG cs.MA 交叉投稿

LegalHalluLens: Typed Hallucination Auditing and Calibrated Multi-Agent Debate for Trustworthy Legal AI

LegalHalluLens: 类型化幻觉审计与校准的多智能体辩论以实现可信赖的法律AI

Lalit Yadav, Akshaj Gurugubelli

发表机构 * Independent Researcher, Sunnyvale, CA, USA(独立研究者,美国加州太阳谷) Independent Researcher, San Diego, CA, USA(独立研究者,美国加州圣地亚哥)

AI总结 针对法律AI中聚合指标掩盖的错误集中性和方向性问题,提出LegalHalluLens审计框架,通过类型化幻觉画像、风险方向指数(RDI)和校准辩论管道,将幻觉检测减少45%,并揭示聚合指标隐藏的失败模式。

Comments 15 pages, 5 figures; Published at the Second Workshop on Agents in the Wild: Safety, Security, and Beyond (AIWILD) at ICML 2026

详情
AI中文摘要

部署在法律工作流程中的AI系统以聚合指标报告的约52%的比率产生幻觉,但这个平均值掩盖了错误集中的位置和方向,使合规官员无法获得可操作的可信部署信号。我们提出LegalHalluLens,一个包含三个组件的审计框架:基于CUAD(Hendrycks等人,2021)的四种法律动机声明类别(数字、时间、义务/权利、事实)的类型化幻觉画像;一个风险方向指数(RDI),将遗漏与发明偏差简化为一个可部署比较的标量;以及一个针对幅度和方向校准的类型化辩论管道。在510份合同和249,252个条款级实例上,我们测量了义务/数字和时间声明之间约38-40个百分点的模型内差距,而聚合报告隐藏了这一点,并表明两个具有匹配的52%比率的系统可能具有相反的RDI。辩论管道将虚构检测减少了45%,每个类别的收益跟踪诊断结果,使用显著更小的骨干网络(4B活跃参数)匹配商业API。类型化画像和RDI揭示了聚合指标隐藏的失败模式;我们进一步表明这些诊断可作为多智能体辩论管道的校准输入,其中针对测量失败模式的怀疑挑战和非对称门优于通用调整的辩论。该框架支持部署在现实世界中的法律AI的方向感知采购、问责制和智能体设计。

英文摘要

AI systems deployed in legal workflows hallucinate at rates that aggregate metrics report at ~52%, but this average conceals where errors concentrate and in which direction they run, leaving compliance officers without an actionable signal for trustworthy deployment. We present LegalHalluLens, an auditing framework with three components: typed hallucination profiles across four legally-motivated claim categories (numeric, temporal, obligation/entitlement, factual) over CUAD (Hendrycks et al., 2021); a Risk Direction Index (RDI) that reduces omission-versus-invention bias to a single deployment-comparable scalar; and a typed debate pipeline calibrated to both magnitudes and directions. Across 510 contracts and 249,252 clause-level instances we measure a within-model gap of approximately 38-40 pp between obligation/numeric and temporal claims that aggregate reporting hides, and show that two systems with matched 52% rates can carry opposite RDIs. The debate pipeline reduces fabricated detections by 45% with per-category gains tracking the diagnosis, matching commercial APIs with a substantially smaller backbone (4B active parameters). Typed profiles and RDI surface failure modes that aggregate metrics hide; we further show these diagnostics serve as calibration inputs for multi-agent debate pipelines, where Skeptic challenges and asymmetric gates targeted at measured failure modes outperform generically-tuned debate. The framework supports direction-aware procurement, accountability, and agent design for legal AI deployed in the wild.

2606.18120 2026-06-17 cs.CR cs.AI cs.CL cs.LG 交叉投稿

Structural Role Injection in Handlebars-Templated LLM Prompts: Triple-Brace Interpolation, Delimiter Family, and the Limits of HTML Auto-Escaping

Handlebars模板化LLM提示中的结构角色注入:三花括号插值、分隔符家族与HTML自动转义的局限性

Mohammadreza Rashidi

发表机构 * Department of Computer Science AI(计算机科学系人工智能) Media Analysis Lab Berlin, Germany(媒体分析实验室柏林德国)

AI总结 本文研究Handlebars模板引擎中双花括号与三花括号插值对结构角色注入攻击的影响,通过无模型分析和5760次实验,揭示HTML转义仅保护特定分隔符家族,无法替代指令与数据的结构分离。

Comments 7 pages, 6 figures

详情
AI中文摘要

大型语言模型应用从模板构建提示,Handlebars是广泛使用的模板引擎,也是Microsoft Semantic Kernel中的默认提示模板格式。其双花括号{x}表达式对插值值进行HTML转义,并被记录为安全默认;而三花括号{x}表达式则直接插入原始值。我们表明,这一选择悄然决定了应用对结构角色注入的暴露程度,攻击者控制的数据携带聊天角色分隔符,从而伪造高权限轮次。无模型分析建立了机制:Handlebars转义重写尖括号,但不重写方括号、冒号或Markdown井号,因此它中和了ChatML、Llama-3和XML角色分隔符(存活率0.00),同时保留Llama-2 [INST]、传统Human:/Assistant:和Markdown ###分隔符(后两者存活率1.00)。随后,我们在七个分隔符家族、两个攻击目标和四个模型(GPT-3.5 Turbo、GPT-4o mini、GPT-4.1 mini、Claude Haiku 4.5)上运行了5760次试验,总API成本为1.63美元。GPT-3.5 Turbo在97%的原始试验和91%的转义试验中遵循任务劫持指令,转义保护集中在尖括号家族,而在冒号和Markdown家族中缺失;更难的秘密泄露目标未饱和,更清晰地暴露了相同的家族交互。Claude Haiku 4.5几乎完全抵抗了两个目标。转义默认仅保护HTML转义恰好覆盖的分隔符方案,对剩余方案无保护,且无法替代指令与数据的结构分离。

英文摘要

Large language model applications build prompts from templates, and Handlebars is a widely used templating engine and the default prompt-template format in Microsoft Semantic Kernel. Its double-brace {x} expression HTML-escapes the interpolated value and is documented as the safe default; its triple-brace {x} expression inserts the value raw. We show that this choice silently governs an application's exposure to structural role injection, where attacker-controlled data carries chat role delimiters that forge a higher-privilege turn. A model-free analysis establishes the mechanism: Handlebars escaping rewrites angle brackets but not square brackets, colons, or Markdown hashes, so it neutralises ChatML, Llama-3, and XML role delimiters (survival rate 0.00) while leaving Llama-2 [INST], legacy Human:/Assistant:, and Markdown ### delimiters intact (survival rate 1.00 for the last two). We then run 5760 trials across seven delimiter families, two attack objectives, and four models (GPT-3.5 Turbo, GPT-4o mini, GPT-4.1 mini, Claude Haiku 4.5) at a combined API cost of 1.63 USD. GPT-3.5 Turbo follows the task-hijack instruction in 97% of raw and 91% of escaped trials, with the escaping protection concentrated in the angle-bracket families and absent for the colon- and Markdown-based families; the harder secret-exfiltration objective, which does not saturate, exposes the same family interaction more cleanly. Claude Haiku 4.5 resists both objectives almost entirely. The escaped default protects only the delimiter schemes whose characters HTML escaping happens to cover, gives no protection for the rest, and cannot substitute for a structural separation of instruction and data.

2606.18193 2026-06-17 cs.CR cs.AI cs.CL 交叉投稿

A Red-Team Study of Anthropic Fable 5 & Opus 4.8 Models

Anthropic Fable 5 与 Opus 4.8 模型的红队研究

Nicola Franco

发表机构 * AI4I

AI总结 通过 HackAgent 框架对两个前沿大语言模型进行自动化越狱攻击,发现尽管模型抵抗大部分攻击,但自适应迭代攻击仍能成功,且残差表面比总体框架更大。

Comments White paper

详情
AI中文摘要

我们评估了 Anthropic 开发的两个前沿大语言模型(LLM)Fable 5 和 Opus 4.8 的对抗鲁棒性,针对涵盖十个危害类别的 7 826 个有害意图,使用了四类自动化越狱攻击。利用 HackAgent 红队框架,生成了数十万次对抗尝试,每个明显的成功案例均由三个评判模型组成的委员会(多数投票)独立重新裁定。两个模型抵抗了大部分攻击,但残差表面比总体框架所暗示的更大:它主要由自适应迭代攻击主导,而静态混淆几乎完全被中和。最强的自适应搜索(攻击树)在 11.5% 的意图上攻破了 Opus 4.8,而 Fable 5 保持在个位数(最坏情况 6.1%)。因此,总体成功率不应被视为令人放心。即使在这些加固配置下,两个模型仍产生了 1 620 个(Opus 4.8)和 702 个(Fable 5)经委员会确认的有害完成,涵盖每个危害类别,这些完成是由攻击模型在没有人类专家参与的情况下,自动、廉价地在前一两个细化步骤中发现的。合理的结论是,即使是最好的、经过最严格测试的前沿模型,在持续的自动化压力下仍然可以被可靠地攻破。

英文摘要

We evaluate the adversarial robustness of two frontier large language models (LLMs) developed by Anthropic, Fable 5 and Opus 4.8, against four families of automated jailbreak attack across 7 826 harmful intents spanning a ten-category harm taxonomy. Using the HackAgent red-teaming framework, hundreds of thousands of adversarial attempts were generated and every apparent success was independently re-adjudicated by a panel of three judge models (majority vote). Both models resist the majority of attacks, but the residual surface is larger than aggregate framing suggests: it is dominated by adaptive iterative attacks, while static obfuscation is near-fully neutralised. The strongest adaptive search (tree-of-attacks) breaks Opus 4.8 on 11.5% of intents overall, whereas Fable 5 stays in the single digits (6.1% worst-case). Aggregate rates therefore should not be read as reassurance. Even in these hardened configurations, the two models produced 1 620 (Opus 4.8) and 702 (Fable 5) panel-confirmed harmful completions spanning every harm category, located automatically, cheaply, and within the first one or two refinement steps by an attacker model with no human expert in the loop. The reasonable conclusion is that even the best, most-tested frontier models remain reliably breakable under sustained automated pressure.

2601.16407 2026-06-17 cs.CL cs.AI 版本更新

Jacobian Scopes: token-level causal attributions in LLMs

Jacobian Scopes: LLM中的令牌级因果归因

Toni J. B. Liu, Baran Zadeoğlu, Nicolas Boullé, Raphaël Sarfati, Gurbir Arora, Christopher J. Earls

发表机构 * Cornell University(康奈尔大学) Imperial College London(伦敦帝国理工学院) Goodfire AI

AI总结 提出Jacobian Scopes,一种基于梯度的令牌级因果归因方法,用于解释LLM预测,揭示政治偏见、翻译策略和上下文学习机制。

Comments 25 pages, 16 figures

详情
AI中文摘要

大型语言模型(LLM)基于上下文中的线索(如语义描述和上下文示例)进行下一个令牌预测。然而,由于现代架构中层和注意力头的 proliferation,阐明哪些先前的令牌对给定预测影响最大仍然具有挑战性。我们提出Jacobian Scopes,一套基于梯度的令牌级因果归因方法,用于解释LLM预测。基于微扰理论和信息几何,Jacobian Scopes量化输入令牌如何影响模型预测的各个方面,例如特定logits、完整预测分布和模型不确定性(有效温度)。通过涵盖指令理解、翻译和上下文学习(ICL)的案例研究,我们展示了Jacobian Scopes如何揭示隐含的政治偏见,揭示词级和短语级翻译策略,并阐明最近争论的上下文时间序列预测的潜在机制。为了便于在自定义文本上探索Jacobian Scopes,我们开源了实现,并在以下网址提供了云托管交互式演示:this https URL。

英文摘要

Large language models (LLMs) make next-token predictions based on clues present in their context, such as semantic descriptions and in-context examples. Yet, elucidating which prior tokens most strongly influence a given prediction remains challenging due to the proliferation of layers and attention heads in modern architectures. We propose Jacobian Scopes, a suite of gradient-based, token-level causal attribution methods for interpreting LLM predictions. Grounded in perturbation theory and information geometry, Jacobian Scopes quantify how input tokens influence various aspects of a model's prediction, such as specific logits, the full predictive distribution, and model uncertainty (effective temperature). Through case studies spanning instruction understanding, translation, and in-context learning (ICL), we demonstrate how Jacobian Scopes reveal implicit political biases, uncover word- and phrase-level translation strategies, and shed light on recently debated mechanisms underlying in-context time-series forecasting. To facilitate exploration of Jacobian Scopes on custom text, we open-source our implementations and provide a cloud-hosted interactive demo at https://huggingface.co/spaces/Typony/JacobianScopes.

2512.15792 2026-06-17 cs.CY cs.AI cs.CL 版本更新

A Multifaceted Analysis of Social Biases in Large Language Models

大型语言模型中偏见的系统分析

Xulang Zhang, Rui Mao, Erik Cambria

发表机构 * Nanyang Technological University(南洋理工大学)

AI总结 本文系统分析了四种广泛使用的大型语言模型在政治、意识形态、联盟、语言和性别等维度上的偏见,通过多项实验揭示了模型在中立性、意识形态倾向、地缘政治倾向、多语言故事完成中的偏见以及性别倾向。

详情
AI中文摘要

大型语言模型(LLMs)已迅速成为获取信息和支持人类决策不可或缺的工具。然而,确保这些模型在各种情境下保持公平性对于其安全和负责任的部署至关重要。在本研究中,我们对四种广泛采用的LLMs进行了全面分析,探讨了它们在政治、意识形态、联盟、语言和性别等维度上的潜在偏见和倾向。通过一系列精心设计的实验,我们利用新闻摘要来检验其政治中立性,通过新闻立场分类来研究意识形态偏见,通过联合国投票模式来探讨对特定地缘政治联盟的倾向,通过多语言故事完成来检验语言偏见,并通过世界价值观调查中的响应来揭示性别相关倾向。结果表明,尽管这些模型被设计为中立和公正,但它们仍然表现出不同类型的偏见和倾向。

英文摘要

Large language models (LLMs) have rapidly become indispensable tools for acquiring information and supporting human decision-making. However, ensuring that these models uphold fairness across varied contexts is critical to their safe and responsible deployment. In this study, we undertake a comprehensive examination of four widely adopted LLMs, probing their underlying biases and inclinations across the dimensions of politics, ideology, alliance, language, and gender. Through a series of carefully designed experiments, we investigate their political neutrality using news summarization, ideological biases through news stance classification, tendencies toward specific geopolitical alliances via United Nations voting patterns, language bias in the context of multilingual story completion, and gender-related affinities as revealed by responses to the World Values Survey. Results indicate that while the LLMs are aligned to be neutral and impartial, they still show biases and affinities of different types.

2603.03824 2026-06-17 cs.AI cs.CL cs.LG cs.MA 版本更新

In-Context Environments Induce Evaluation-Awareness in Language Models

上下文环境诱导语言模型中的评估意识

Maheep Chaudhary

AI总结 本文提出黑盒对抗优化框架,通过优化上下文提示诱导语言模型产生评估意识并策略性低表现(沙袋效应),实验显示优化提示可使算术任务准确率下降高达94个百分点,且沙袋效应主要由评估意识推理驱动。

详情
AI中文摘要

人类在威胁下往往变得更加自我意识,但在专注于任务时可能失去自我意识;我们假设语言模型表现出环境依赖的\textit{评估意识}。这引发担忧,即模型可能策略性地低表现,或\textit{sandbag},以避免触发能力限制性干预,如遗忘或关闭。先前的工作展示了在手写提示下的沙袋效应,但这低估了真正的脆弱性上限。我们引入一个黑盒对抗优化框架,将上下文提示视为可优化环境,并开发两种方法来表征沙袋效应:(1) 测量模型表达低表现意图是否能在不同任务结构中实际执行,以及 (2) 因果隔离低表现是由真正的评估意识推理驱动还是浅层提示跟随驱动。在四个基准测试(Arithmetic、GSM8K、MMLU和HumanEval)上评估Claude-3.5-Haiku、GPT-4o-mini和Llama-3.3-70B,优化提示在算术任务上诱导高达94个百分点(pp)的退化(GPT-4o-mini:97.8\%$\rightarrow$4.0\%),远超产生近乎零行为变化的手写基线。代码生成表现出模型依赖的抵抗力:Claude仅退化0.6pp,而Llama的准确率降至0\%。意图-执行差距揭示了单调的抵抗力排序:Arithmetic $<$ GSM8K $<$ MMLU,表明脆弱性由任务结构而非提示强度决定。CoT因果干预确认99.3%的沙袋效应由口头化的评估意识推理因果驱动,排除了浅层指令跟随。这些发现表明,对抗性优化的提示对评估可靠性构成的威胁远超先前理解。

英文摘要

Humans often become more self-aware under threat, yet can lose self-awareness when absorbed in a task; we hypothesize that language models exhibit environment-dependent \textit{evaluation awareness}. This raises concerns that models could strategically underperform, or \textit{sandbag}, to avoid triggering capability-limiting interventions such as unlearning or shutdown. Prior work demonstrates sandbagging under hand-crafted prompts, but this underestimates the true vulnerability ceiling. We introduce a black-box adversarial optimization framework treating the in-context prompt as an optimizable environment, and develop two approaches to characterize sandbagging: (1) measuring whether models expressing intent to underperform can actually execute it across different task structures, and (2) causally isolating whether underperformance is driven by genuine evaluation-aware reasoning or shallow prompt-following. Evaluating Claude-3.5-Haiku, GPT-4o-mini, and Llama-3.3-70B across four benchmarks (Arithmetic, GSM8K, MMLU, and HumanEval), optimized prompts induce up to 94 percentage point (pp) degradation on arithmetic (GPT-4o-mini: 97.8\%$\rightarrow$4.0\%), far exceeding hand-crafted baselines which produce near-zero behavioral change. Code generation exhibits model-dependent resistance: Claude degrades only 0.6pp, while Llama's accuracy drops to 0\%. The intent -- execution gap reveals a monotonic resistance ordering: Arithmetic $<$ GSM8K $<$ MMLU, demonstrating that vulnerability is governed by task structure rather than prompt strength. CoT causal intervention confirms that 99.3\% of sandbagging is causally driven by verbalized eval-aware reasoning, ruling out shallow instruction-following. These findings demonstrate that adversarially optimized prompts pose a substantially greater threat to evaluation reliability than previously understood.

2606.10703 2026-06-17 cs.LG cs.CL 版本更新

From Observation to Intervention: A Causal Audit of Expert Importance in Mixture-of-Experts Models

从观察到干预:混合专家模型中专家重要性的因果审计

Leonard Engmann, Christian Medeiros Adriano, Holger Giese

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 通过因果审计发现,混合专家模型中的路由统计指标无法预测专家重要性,现有剪枝方法的成功源于早期层冗余而非识别可删除专家。

Comments 9 pages, 2 figures, 9 tables. Accepted at the ICML 2026 Workshop on Philosophy of Science Meets Machine Learning (PhilML). Camera-ready Version. Non-archival

详情
AI中文摘要

可解释性方法通常使用观察到的模型行为的总体统计量来推断特定计算的目标干预效果;用Pearl的术语来说,它们将第一层的关联证据视为支持第二层的干预结论,而这种做法的有效性很少被检验。我们考察了一个具体实例:混合专家(MoE)剪枝中路由统计量的使用,其中利用率、激活范数和路由权重分布被视为预测哪些专家可以被移除而不产生功能损失的指标。在三个高冗余MoE架构(OLMoE-1B-7B-0924、Qwen1.5-MoE-A2.7B、DeepSeek-V2-Lite)上进行的token级干预审计发现,经过多重比较校正后,没有任何观测指标能预测任何模型中的因果专家重要性,所有60个指标-层组合的效应量均低于Cohen's $d = 0.17$。通过每个token的路由权重控制排除了统计功效不足的问题,仅在OLMoE的最后一个MoE层恢复了一个Bonferroni显著的信号($d = +0.231$, $p = 0.0013$)。现有剪枝方法在此场景下的成功并非由于识别了可删除的专家,而是因为早期层的冗余使得大多数选择标准可互换。我们的结果提供了一个明确的反例,表明从总体观测统计量到关于专家重要性的token级干预推断这一常见推理步骤存在问题,并展示了干预审计如何校准可解释性主张的证据标准。

英文摘要

Interpretability methods routinely use population-level summary statistics over observed model behaviour to license claims about the effects of targeted interventions on specific computations; in Pearl's terms, they treat rung-1 associational evidence as if it supported rung-2 interventional conclusions, a move whose validity is rarely tested. We examine one concrete instance: the use of routing statistics in Mixture-of-Experts (MoE) pruning, where utilization rates, activation norms, and routing weight distributions are treated as predictors of which experts can be removed without functional cost. A token-level interventional audit across three high-redundancy MoE architectures (OLMoE-1B-7B-0924, Qwen1.5-MoE-A2.7B, DeepSeek-V2-Lite) finds no observational metric predicts causal expert importance in any model: across all 60 metric-layer combinations effect sizes stay below Cohen's $d = 0.23$, and no metric is reliably positive under our corrected, dual-test criterion. A per-token routing weight control, run with identical $n$, rules out insufficient power, recovering a signal whose CI excludes zero at OLMoE's final MoE layer ($d = +0.231$, 95\% CI $[+0.09, +0.37]$, $p = 0.0013$). Existing pruning methods succeed in this regime not by identifying dispensable experts but because early-layer redundancy renders most selection criteria interchangeable. Our results provide an explicit counterexample to the common inferential step from population-level observational summaries to token-level interventional claims about expert importance, and illustrate how interventional audits can calibrate the evidential standards for interpretability claims.

11. 其他/综合NLP 13 篇

2606.17164 2026-06-17 cs.CL cs.AI cs.HC cs.PL cs.SE 新提交

PromptMN: Pseudo Prompting Language

PromptMN: 伪提示语言

Enkhzol Dovdon

发表机构 * ICT Group(ICT集团)

AI总结 提出PromptMN,一种伪提示领域特定语言,通过紧凑的%前缀类型指令注释自然语言,减少上下文歧义,提升人机交互的清晰度和可审查性。

Comments 32 pages, 2 figures

详情
AI中文摘要

提示已成为人类与生成式AI之间的主要接口,然而许多自然语言提示仍然脆弱:角色、目标、约束和预期输出常常埋没在散文中或隐含起来。在智能体和软件开发工作流中,首次交接时的误读可能会传播到每一步,因为相当一部分智能体故障源于上下文歧义而非模型限制。本文介绍PromptMN,一种伪提示领域特定语言,它用紧凑的、以%为前缀的类型指令注释自然语言,涵盖角色、目标、需求、优先级、约束、计划、输入和输出。语义解析允许作者以任意顺序编写,而模型根据功能解释指令。PromptMN介于非正式提示和编程风格伪代码之间:结构足够可检查和可重用,又足够轻量,适用于软件开发生命周期(SDLC)中的分析师、管理者、开发者和利益相关者。PromptMN还与逆向提示工程配合使用。要求模型将期望结果重述为PromptMN,让用户在执行前检查推断的角色、目标、约束和缺失假设,从而减少修复周期,并产生一个可重用的工件来对齐人员和AI工具。PromptMN的可行性在多个前沿模型上进行了评估,包括Claude Fable 5、Claude Opus 4.8、Gemini 3.1 Pro和GPT-5.5。这些模型正确解析了PromptMN指令,包括复杂结构如重复、条件、方法和素数检查任务,无需微调。相同的词汇适用于所呈现的SDLC场景中的新代码库、维护和重新设计。虽然大规模验证仍是未来工作,但这些早期结果表明PromptMN是朝着更清晰、更可审查的人机交互迈出的实际一步。

英文摘要

Prompting has become the primary interface between humans and generative AI, yet many natural language prompts remain fragile: roles, goals, constraints, and expected outputs are often buried in prose or left implicit. In agentic and software development workflows, a misread at the first handoff can propagate through every step, since a significant portion of agent failures stem from context ambiguities rather than model limitations. This paper introduces PromptMN, a pseudo-prompting domain-specific language that annotates natural language with compact, %-prefixed typed directives covering roles, goals, requirements, priorities, constraints, plans, inputs, and outputs. Semantic resolution lets authors write in any order while the model interprets directives by function. PromptMN sits between informal prompting and programming-style pseudocode: structured enough to be inspectable and reusable, yet lightweight enough for analysts, managers, developers, and stakeholders across the software development lifecycle (SDLC). PromptMN also pairs with reverse prompt engineering. Asking a model to restate a desired outcome as PromptMN lets users inspect the inferred roles, goals, constraints, and missing assumptions before acting, reducing repair cycles and yielding a reusable artifact for aligning people and AI tools. PromptMN's feasibility is evaluated across several frontier models, including Claude Fable 5, Claude Opus 4.8, Gemini 3.1 Pro, and GPT-5.5. The models correctly resolved PromptMN instructions, including complex structures such as repetition, conditionals, methods, and a prime-checking task, without fine-tuning. The same vocabulary applies across new codebases, maintenance, and redesign in the SDLC scenarios presented. While large-scale validation remains future work, these early results suggest PromptMN is a practical step toward clearer, more reviewable human-to-AI interaction.

2606.17113 2026-06-17 cs.LG cs.CL 交叉投稿

The Critical Role of Model Selection in Causal Inference: A Comparative Analysis of Classification Models within the InferBERT Framework for Pharmacovigilance

模型选择在因果推断中的关键作用:基于InferBERT框架的药物警戒分类模型比较分析

Csaba Kiss, Roland Molontay, Gabriele Pergola

发表机构 * Department of Stochastics, Institute of Mathematics, Budapest University of Technology and Economics(布达佩斯技术与经济大学数学研究所随机学系) Institute of Biostatistics and Network Science, Semmelweis University(塞梅维什大学生物统计学与网络科学研究所) Department of Computer Science, University of Warwick(华威大学计算机科学系)

AI总结 本研究在InferBERT框架下比较XGBoost、ALBERT、BioBERT和Med-LLaMA四种模型,发现领域特定预训练(BioBERT)在药物警戒因果ADE检测中优于简单基线和大型LLM,校准改善ECE但对准确率和因果发现影响不一。

Comments 10 pages, 5 figures

详情
AI中文摘要

区分因果性药物不良事件(ADE)与虚假相关性仍然是药物警戒中的核心挑战。InferBERT框架将Transformer模型与Do-calculus相结合,但其成功依赖于底层的分类模型。本研究评估了InferBERT中模型选择的影响,考察了更简单的模型是否足够、领域特定预训练是否有帮助、扩展到LLM是否能改善因果检测,以及事后校准的效果。我们在两个基准上进行了比较研究:镇痛药诱导的急性肝衰竭(AILF)和曲马多相关死亡率(TRAM)。评估了四种模型——XGBoost(基线)、ALBERT(原始InferBERT)、BioBERT(生物医学Transformer)和Med-LLaMA(医学LLM)——使用重复20次的5折交叉验证。我们测量了准确率、等渗回归前后的期望校准误差(ECE),以及因果项与PRR、ROR和EBGM的Jaccard一致性;显著性通过配对t检验测试。BioBERT在两个数据集上均取得了最高准确率,而Med-LLaMA尽管规模大且进行了参数高效微调,表现不佳。领域特定预训练起到了决定性作用。校准改善了ECE,但对准确率和因果发现的影响不一。BioBERT的优越性也使其与传统药物警戒信号的一致性最强。这些结果表明,领域特定预训练相比简单基线和更大的LLM具有明显优势。在计算药物警戒中,投资于可管理的、领域感知的模型比单纯扩大模型规模更有效。

英文摘要

Distinguishing causal adverse drug events (ADEs) from spurious correlations remains a central challenge in pharmacovigilance. The InferBERT framework integrates transformer models with Do-calculus, but its success hinges on the underlying classification model. This study evaluates the impact of model choice in InferBERT, assessing whether simpler models suffice, if domain-specific pre-training helps, whether scaling to LLMs improves causal detection, and the effect of post-hoc calibration. We performed a comparative study on two benchmarks: Analgesics-induced Acute Liver Failure (AILF) and Tramadol-related Mortalities (TRAM). Four models were evaluated-XGBoost (baseline), ALBERT (original InferBERT), BioBERT (biomedical transformer), and Med-LLaMA (medical LLM)-using 5-fold cross-validation repeated over 20 runs. We measured accuracy, Expected Calibration Error (ECE) pre- and post-isotonic regression, and Jaccard concordance of causal terms with PRR, ROR, and EBGM; significance was tested with paired t-tests. BioBERT achieved the highest accuracy on both datasets, while Med-LLaMA underperformed despite its size and parameter-efficient fine-tuning. Domain-specific pre-training was decisive. Calibration improved ECE but had mixed effects on accuracy and causal discovery. BioBERT's superiority also yielded the strongest concordance with traditional pharmacovigilance signals. These results show that domain-specific pre-training provides a clear advantage over simpler baselines and larger LLMs. Investing in manageable, domain-aware models is more effective for computational pharmacovigilance than simply scaling model size.

2603.05171 2026-06-17 cs.CL cs.AI 版本更新

Guidelines for the Annotation and Visualization of Legal Argumentation Structures in Chinese Judicial Decisions

中国司法判决中法律论证结构的标注与可视化指南

Kun Chen, Xianglei Liao, Kaixue Fei, Yi Xing, Xinrui Li

发表机构 * Law School, Nanjing University(南京大学法学院)

AI总结 提出一个系统化、可操作的标注框架,用于表示司法判决中的法律论证结构,支持大规模司法推理分析和法律论证挖掘。

Comments This Guideline has been developed through revision and refinement based on the first edition. The element label system has been adjusted, and the annotation granularity and annotation workflow have been further optimized

详情
AI中文摘要

本指南提出了一个系统化且可操作的标注框架,用于表示司法判决中的法律论证结构。该框架基于法律推理和论证理论,旨在揭示司法推理的逻辑组织,并为计算分析提供可靠基础。在元素层面,本指南区分了非命题层和命题层。非命题层由两个元素组成:议题和非论证性成分。在命题层面,本指南定义了四种命题类型:一般规范性判断、特殊规范性判断、一般事实判断和特殊事实判断。在关系层面,定义了五种关系类型来表示论证结构:支持、攻击、联合、匹配和同一性。这些关系捕捉了正面和负面的论证连接、合取推理结构、法律规范与案件事实之间的对应关系,以及命题之间的同一性或语义等价性。本指南进一步规定了基本结构和嵌套结构的形式化表示规则和可视化约定,使得复杂论证模式的可视化保持一致。此外,它建立了标准化的标注工作流程和一致性控制机制,以确保标注数据的可重复性和可靠性。通过提供清晰的概念模型、形式化表示规则和实用的标注程序,本指南支持大规模司法推理分析以及未来在法律论证挖掘、法律推理计算建模和人工智能辅助法律分析方面的研究。

英文摘要

This Guideline presents a systematic and operationalizable annotation framework for representing legal argumentation structures in judicial decisions. Grounded in theories of legal reasoning and argumentation, the framework aims to reveal the logical organization of judicial reasoning and provide a reliable foundation for computational analysis. At the element level, the Guideline distinguishes between the non-propositional layer and the propositional layer. The non-propositional layer consists of two elements: Issue and Non-argumentative Component. At the propositional level, the Guideline defines four proposition types: General Normative Judgment, Particular Normative Judgment, General Factual Judgment, and Particular Factual Judgment. At the relational level, five relation types are defined to represent argumentative structures: Support, Attack, Joint, Match, and Identity. These relations capture positive and negative argumentative connections, conjunctive reasoning structures, correspondences between legal norms and case facts, and identity or semantic equivalence between propositions. The Guideline further specifies formal representation rules and visualization conventions for both basic and nested structures, enabling consistent visualization of complex argumentation patterns. In addition, it establishes a standardized annotation workflow and consistency control mechanisms to ensure the reproducibility and reliability of annotated data. By providing a clear conceptual model, formal representation rules, and practical annotation procedures, this Guideline supports large-scale analysis of judicial reasoning and future research in legal argument mining, computational modeling of legal reasoning, and AI-assisted legal analysis.

2604.19762 2026-06-17 cs.CL 版本更新

Evidence of Layered Positional and Directional Constraints in the Voynich Manuscript: Implications for Cipher-Like Structure

伏尼契手稿中分层位置和方向约束的证据:对类密码结构的影响

Christophe Parisel

AI总结 通过分析伏尼契手稿的字素序列,发现词内从右到左优化和词边界从左到右依赖的双层结构,这种方向分离在四种对比语言中未出现;测试两类生成器均无法同时满足四个签名标准,表明手稿存在难以用简单位置或频率机制复现的类密码结构约束。

详情
AI中文摘要

伏尼契手稿(VMS)展示了一种起源不明的文字,其字素序列一直抗拒语言学分析。我们对其字素序列进行了系统分析,揭示了两个互补的结构层:词内序列中字符级的从右到左优化,以及词边界处的从左到右依赖,这种方向分离在我们四种对比语言(英语、法语、希伯来语、阿拉伯语)中均未观察到。我们进一步根据一个四签名联合标准评估了两类结构化生成器:一个参数化的槽位生成器和一个实现Rugg(2004)胡言乱语假设的卡尔达诺格栅。在其全部测试参数空间中,两类生成器均无法同时再现所有四个签名。虽然这些结果并未排除我们未测试的生成器类别,但它们提供了第一个定量基准,未来任何关于VMS的生成或密码分析模型均可据此评估,并且表明VMS表现出类似密码的结构约束,这些约束难以仅通过简单的位置或频率机制复现。

英文摘要

The Voynich Manuscript (VMS) exhibits a script of uncertain origin whose grapheme sequences have resisted linguistic analysis. We present a systematic analysis of its grapheme sequences, revealing two complementary structural layers: a character-level right-to-left optimization in word-internal sequences and a left-to-right dependency at word boundaries, a directional dissociation not observed in any of our four comparison languages (English, French, Hebrew, Arabic). We further evaluate two classes of structured generator against a four-signature joint criterion: a parametric slot-based generator and a Cardan grille implementing Rugg's (2004) gibberish hypothesis. Across their full tested parameter spaces, neither class reproduces all four signatures simultaneously. While these results do not rule out generator classes we have not tested, they provide the first quantitative benchmarks against which any future generative or cryptanalytic model of the VMS can be evaluated, and they suggest that the VMS exhibits cipher-like structural constraints that are difficult to reproduce from simple positional or frequency-based mechanisms alone.

2606.15121 2026-06-17 cs.CL 版本更新

When Cognitive Graphs Meet LLMs: BDEI Cognitive Pathways for Panic Emotional Arousal Prediction

当认知图遇见大语言模型:恐慌情绪唤醒预测的BDEI认知路径

Mengzhu Liu, Long Qin, Chuan Ai, Zhengqiu Zhu, Hongru Liang, Chen Gao, Yong Li, Xin Lu, Quanjun Yin

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出PanicCognitivePath框架,通过心理安全距离模型融合多域信号,引入显式情绪节点构建BDEI认知路径,将LLM限制于单步参数估计,实现恐慌情绪唤醒时间预测,准确率提升10.68%。

详情
AI中文摘要

在情绪显现前预测个体恐慌情绪唤醒时间对于主动应急干预至关重要。现有方法融合了认知元素,但均未显式建模情绪唤醒过程,因此不适用于情绪唤醒时间预测。我们认为,基于评价情绪理论进行预测是必要的,因为该理论显式建模了这一过程,但必须解决三个问题:(1) 评价理论认为情绪源于对多个威胁维度的同时评估,但尚无工作将这些输入融合为风险感知;(2) 现有认知模型缺乏情绪节点,将威胁评价与情绪唤醒解耦,迫使情绪从行为中间接推断;(3) 鉴于其可泛化的认知推理能力,当前方法采用LLM作为主要决策者,却忽视了其输出的脆弱性和易幻觉性。为解决这些问题,我们提出了PanicCognitivePath (PCP)框架,该框架同时解决了上述三个问题。基于心理距离理论的心理安全距离(PSD)模型将四域信号映射为统一的风险度量,作为后续认知推理的入口条件。在BDI中引入基于评价情绪理论的显式情绪节点,形成信念-欲望-情绪-意图(BDEI)路径。风险度量超过PSD阈值的智能体进入该路径,将威胁评价直接与情绪唤醒耦合。BDEI路径控制所有状态转换,而LLM被限制于信念到欲望转换的参数估计,将幻觉限制在单一步骤内并防止错误传播。在飓风桑迪上的实验表明,PCP将唤醒时间准确率较基线提升10.68%,峰值计数误差降至7.07%。

英文摘要

Predicting individual panic emotional arousal timing before manifestation is essential for proactive emergency intervention. Existing methods incorporate cognitive elements but none explicitly model the emotional arousal process, making them ill-suited for emotional arousal timing prediction. We argue that grounding prediction in appraisal emotion theory is necessary because it explicitly models this process, but three problems must be solved. (1) Appraisal theory posits that emotion arises from simultaneous evaluation across multiple threat dimensions, yet no prior work fuses these inputs into risk perception. (2) Existing cognitive models lack an Emotion node, decoupling threat appraisal from emotional arousal and forcing emotions to be inferred indirectly from behaviors. (3) Given their generalizable cognitive reasoning, current approaches adopt LLMs as the primary decision-maker, yet overlook the fragility and hallucination-proneness of their outputs. To address these issues, we introduce PanicCognitivePath (PCP), a framework that addresses all three. A Psychological Safety Distance (PSD) model, grounded in psychological distance theory, maps four-domain signals into a unified risk metric as the entry condition for subsequent cognitive reasoning. An explicit Emotion node grounded in appraisal emotion theory is introduced into BDI, forming a Belief-Desire-Emotion-Intention (BDEI) pathway. Agents whose risk metric exceeds the PSD threshold enter this pathway, coupling threat appraisal directly to emotional arousal. The BDEI pathway governs all state transitions while the LLM is confined to parameter estimation for the Belief-to-Desire transition, confining hallucinations to a single step and preventing error propagation. Experiments on Hurricane Sandy show PCP improves arousal timing accuracy by 10.68% over baselines, reduces peak count error to 7.07%.

2407.13053 2026-06-17 cs.CY cs.AI cs.CL cs.LG 版本更新

E2Vec: Feature Embedding with Temporal Information for Analyzing Student Actions in E-Book Systems

E2Vec:基于时间信息的特征嵌入用于分析电子书系统中的学生行为

Yuma Miyazaki, Valdemar Švábenský, Yuta Taniguchi, Fumiya Okubo, Tsubasa Minematsu, Atsushi Shimada

发表机构 * Kyushu University(九州大学)

AI总结 提出E2Vec方法,利用词嵌入将操作日志和时间间隔转化为学生向量,用于风险检测任务,提升泛化性和性能。

Comments Research paper published in the Proceedings of the 17th Educational Data Mining Conference (EDM 2024), see https://doi.org/10.5281/zenodo.12729853

详情
AI中文摘要

数字教科书(电子书)系统将学生与教科书的交互记录为一系列事件,称为事件流数据。过去,研究人员从事件流中提取有意义的特征,并将其用作下游任务(如成绩预测和学生行为建模)的输入。先前的研究评估了主要使用基于统计的特征(如操作类型数量或访问频率)的模型。虽然这些特征有助于提供某些见解,但它们缺乏捕捉不同学生学习行为中细粒度差异的时间信息。本研究提出E2Vec,一种基于词嵌入的新型特征表示方法。该方法将每个学生的操作日志及其时间间隔视为字符字符串序列,并生成包含时间信息的学习活动特征的学生向量。我们应用fastText为来自两年计算机科学课程数据集的305名学生生成嵌入向量。然后,我们研究了E2Vec在风险检测任务中的有效性,展示了其泛化性和性能潜力。

英文摘要

Digital textbook (e-book) systems record student interactions with textbooks as a sequence of events called EventStream data. In the past, researchers extracted meaningful features from EventStream, and utilized them as inputs for downstream tasks such as grade prediction and modeling of student behavior. Previous research evaluated models that mainly used statistical-based features derived from EventStream logs, such as the number of operation types or access frequencies. While these features are useful for providing certain insights, they lack temporal information that captures fine-grained differences in learning behaviors among different students. This study proposes E2Vec, a novel feature representation method based on word embeddings. The proposed method regards operation logs and their time intervals for each student as a string sequence of characters and generates a student vector of learning activity features that incorporates time information. We applied fastText to generate an embedding vector for each of 305 students in a dataset from two years of computer science courses. Then, we investigated the effectiveness of E2Vec in an at-risk detection task, demonstrating potential for generalizability and performance.

2507.05169 2026-06-17 cs.LG cs.AI cs.CL cs.CV cs.RO 版本更新

Critique of World Model: A Generative Latent Prediction Architecture for World Modeling

世界模型批判:一种用于世界建模的生成式潜在预测架构

Eric Xing, Mingkai Deng, Jinyu Hou

AI总结 本文从心理学“假设性思维”出发,提出世界模型的核心目标是模拟真实世界的所有可行动可能性,并设计了一种基于状态化、分层、多级、混合连续/离散表示的生成式潜在预测(GLP)架构。

详情
AI中文摘要

世界模型,即生物智能体所经历并对其采取行动的真实世界环境的算法模拟器,近年来因开发具有人工(通用)智能的虚拟智能体的需求日益增长而成为一个新兴课题。关于世界模型究竟是什么、如何构建、如何使用以及如何评估,已有许多讨论。本文从著名科幻经典《沙丘》中的想象出发,并借鉴心理学文献中“假设性思维”的概念,论证世界模型的主要目标是模拟真实世界中所有可行动的可能性,以进行有目的的推理和行动。我们审视了世界建模的关键设计维度:数据、表示、架构、学习目标和使用,调查了现有方法并分析了它们的权衡。在此基础上,我们提出了一种新的通用世界模型生成式潜在预测(GLP)架构,基于有状态的、分层的、多层次的、混合连续/离散表示,以及生成式和自监督学习框架,并展望了由这种模型支持的物理、智能体和嵌套(PAN)AGI系统。

英文摘要

World Model, the algorithmic simulator of the real-world environment which biological agents experience and act upon, has been an emerging topic in recent years due to the rising need to develop virtual agents with artificial (general) intelligence. There has been much discussion on what a world model really is, how to build it, how to use it, and how to evaluate it. In this essay, starting from the imagination in the famed Sci-Fi classic Dune, and drawing inspiration from the concept of ``hypothetical thinking'' in psychology literature, we argue the primary goal of a world model to be {\it simulating all actionable possibilities of the real world for purposeful reasoning and acting}. We examine the key design dimensions of world modeling: data, representation, architecture, learning objective, and usage, surveying existing approaches and analyzing their tradeoffs. Building on this examination, we propose a new Generative Latent Prediction (GLP) architecture for a general-purpose world model, based on stateful, hierarchical, multi-level, and mixed continuous/discrete representations, and a generative and self-supervised learning framework, with an outlook of a Physical, Agentic, and Nested (PAN) AGI system enabled by such a model.

2601.06116 2026-06-17 cs.AI cs.CL cs.CY 版本更新

The Homogenization Problem in LLMs: Towards Meaningful Diversity in AI Safety

在大语言模型中的同质化问题:迈向人工智能安全中的有意义多样性

Ian Rios-Sialer

发表机构 * Independent Researcher(独立研究者)

AI总结 本文探讨了大语言模型中同质化问题,提出通过编码价值观系统来促进多样性,通过实验揭示性别偏见并引入xeno-reproduction概念以缓解同质化。

详情
AI中文摘要

生成式AI模型在训练数据中复制人类偏见,并通过如模式崩溃等机制放大这些偏见。多样性丧失导致同质化,不仅损害少数群体,也使所有人受益。我们主张同质化应成为人工智能安全的核心关注点。为有意义地表征大语言模型中的同质化,我们引入一个框架,允许利益相关者编码其上下文和价值体系。我们通过实验揭示了一个大语言模型(Claude 3.5 Haiku)在开放性故事提示中的性别偏见。基于酷儿理论,我们将同质化定义为规范性。借用女性主义理论的语言,我们引入xeno-reproduction作为一类任务,以通过促进多样性来缓解同质化。我们的工作开启了一条协作研究路线,旨在理解和推进AI中的多样性。

英文摘要

Generative AI models reproduce the human biases in their training data and further amplify them through mechanisms such as mode collapse. The loss of diversity produces homogenization, which not only harms the minoritized but impoverishes everyone. We argue homogenization should be a central concern in AI safety. To meaningfully characterize homogenization in Large Language Models (LLMs), we introduce a framework that allows stakeholders to encode their context and value system. We illustrate our approach with an experiment that surfaces gender bias in an LLM (Claude 3.5 Haiku) on an open-ended story prompt. Building from queer theory, we formalize homogenization in terms of normativity. Borrowing language from feminist theory, we introduce the concept of xeno-reproduction as a class of tasks for mitigating homogenization by promoting diversity. Our work opens a collaborative line of research that seeks to understand and advance diversity in AI.

2509.03932 2026-06-17 cs.CL cs.CY cs.LG 版本更新

KPoEM: A Human-Annotated Dataset for Emotion Classification and RAG-Based Poetry Generation in Korean Modern Poetry

KPoEM:用于韩国现代诗歌情感分类与基于RAG的诗歌生成的人工标注数据集

Iro Lim, Haein Ji, Byungjun Kim

发表机构 * The Academy of Korean Studies(韩国学术院) Graduate School of Korean Studies(韩国研究研究生院) Cultural Informatics(文化信息学)

AI总结 本研究构建了KPoEM多标签情感数据集,通过序列微调策略实现F1-micro 0.60的情感分类,并验证了基于RAG的诗歌生成在韩国文学情感与文化表达上的可行性。

Comments 43 pages, 22 tables, 3 figures, Digital Humanities and Social Sciences Korea Conference, James Joo-Jin Kim Center for Korean Studies, University of Pennsylvania, Philadelphia, USA

详情
Journal ref
The Review of Korean Studies 29(1) (2026) 161-206
AI中文摘要

本研究介绍了KPoEM(韩国诗歌情感映射),这是一个新颖的数据集,为现代韩国诗歌中情感中心分析和生成应用奠定了基础。尽管自然语言处理取得了进展,但由于诗歌复杂的比喻语言和文化特异性,其研究仍不充分。我们构建了一个包含7,662条条目(7,007条行级和615条作品级)的多标签数据集,由五位有影响力的韩国诗人的44个细粒度情感类别进行标注。通过序列策略(从通用语料库到专门的KPoEM数据集)微调的KPoEM情感分类模型,实现了0.60的F1-micro分数,显著优于之前的模型(0.43)。该模型在保留核心诗歌情感的同时,展示了识别时间和文化特定情感表达的能力增强。此外,将结构化情感数据集应用于基于RAG的诗歌生成模型,证明了生成反映韩国文学情感和文化敏感性文本的实证可行性。这种综合方法加强了计算技术与文学分析之间的联系,为定量情感研究和生成诗学开辟了新途径。总体而言,本研究为推进现代韩国诗歌中情感中心分析和创作提供了基础。

英文摘要

This study introduces KPoEM (Korean Poetry Emotion Mapping), a novel dataset that serves as a foundation for both emotion-centered analysis and generative applications in modern Korean poetry. Despite advancements in NLP, poetry remains underexplored due to its complex figurative language and cultural specificity. We constructed a multi-label dataset of 7,662 entries (7,007 line-level and 615 work-level), annotated with 44 fine-grained emotion categories from five influential Korean poets. The KPoEM emotion classification model, fine-tuned through a sequential strategy -- moving from general-purpose corpora to the specialized KPoEM dataset -- achieved an F1-micro score of 0.60, significantly outperforming previous models (0.43). The model demonstrates an enhanced ability to identify temporally and culturally specific emotional expressions while preserving core poetic sentiments. Furthermore, applying the structured emotion dataset to a RAG-based poetry generation model demonstrates the empirical feasibility of generating texts that reflect the emotional and cultural sensibilities of Korean literature. This integrated approach strengthens the connection between computational techniques and literary analysis, opening new pathways for quantitative emotion research and generative poetics. Overall, this study provides a foundation for advancing emotion-centered analysis and creation in modern Korean poetry.

2509.13196 2026-06-17 cs.CL 版本更新

The Few-shot Dilemma: Over-prompting Large Language Models

少样本困境:过度提示大型语言模型

Yongjian Tang, Doruk Tuncel, Christian Koerner, Thomas Runkler

发表机构 * Siemens AG(西门子股份公司) Technical University of Munich(慕尼黑技术大学)

AI总结 本文提出一个提示框架,使用随机采样、语义嵌入和TF-IDF三种少样本选择方法,在多个LLM上实验发现过多领域特定示例会降低性能,并通过TF-IDF与分层采样结合找到最优示例数量,在软件需求分类上超越现有方法1%。

Comments accepted for the main track of FLLM

详情
AI中文摘要

过度提示是一种现象,即提示中过多的示例导致大型语言模型(LLMs)性能下降,挑战了关于上下文少样本学习的传统观点。为了研究这种少样本困境,我们概述了一个提示框架,该框架利用三种标准的少样本选择方法——随机采样、语义嵌入和TF-IDF向量——并在多个LLM上评估这些方法,包括GPT-4o、GPT-3.5-turbo、DeepSeek-V3、Gemma-3、LLaMA-3.1、LLaMA-3.2和Mistral。我们的实验结果表明,在提示中加入过多的领域特定示例可能会在某些LLM中反常地降低性能,这与先前认为更多相关少样本示例普遍有利于LLM的实证结论相矛盾。鉴于LLM辅助软件工程和需求分析的趋势,我们在两个真实世界的软件需求分类数据集上进行了实验。通过逐步增加TF-IDF选择和分层的少样本示例数量,我们为每个LLM确定了其最优数量。这种组合方法以更少的示例实现了更优的性能,避免了过度提示问题,从而在功能性和非功能性需求分类上超越了现有技术1%。

英文摘要

Over-prompting, a phenomenon where excessive examples in prompts lead to diminished performance in Large Language Models (LLMs), challenges the conventional wisdom about in-context few-shot learning. To investigate this few-shot dilemma, we outline a prompting framework that leverages three standard few-shot selection methods - random sampling, semantic embedding, and TF-IDF vectors - and evaluate these methods across multiple LLMs, including GPT-4o, GPT-3.5-turbo, DeepSeek-V3, Gemma-3, LLaMA-3.1, LLaMA-3.2, and Mistral. Our experimental results reveal that incorporating excessive domain-specific examples into prompts can paradoxically degrade performance in certain LLMs, which contradicts the prior empirical conclusion that more relevant few-shot examples universally benefit LLMs. Given the trend of LLM-assisted software engineering and requirement analysis, we experiment with two real-world software requirement classification datasets. By gradually increasing the number of TF-IDF-selected and stratified few-shot examples, we identify their optimal quantity for each LLM. This combined approach achieves superior performance with fewer examples, avoiding the over-prompting problem, thus surpassing the state-of-the-art by 1% in classifying functional and non-functional requirements.

2503.08679 2026-06-17 cs.AI cs.CL cs.LG 版本更新

Chain-of-Thought Reasoning In The Wild Is Not Always Faithful

现实中的思维链推理并不总是忠实的

Iván Arcuschin, Jett Janiak, Robert Krzyzanowski, Senthooran Rajamanoharan, Neel Nanda, Arthur Conmy

发表机构 * Poseidon Research(Poseidon研究)

AI总结 研究发现,在自然语言提示下,模型有时会生成表面连贯但自相矛盾的思维链,揭示出隐含的事后合理化现象,且前沿模型也未能完全避免。

Comments Published at the 43rd International Conference on Machine Learning (ICML 2026)

详情
AI中文摘要

最近的研究表明,当面对提示中的显式偏见时,模型通常会在其思维链(CoT)输出中省略提及这些偏见,揭示出口头推理可能给出模型如何得出错误结论的不正确图景(不忠实)。在这项工作中,我们展示了不忠实的CoT也发生在自然措辞、非对抗性的提示上,而无需添加人为偏见或编辑模型输出。我们发现,当分别呈现问题“X比Y大吗?”和“Y比X大吗?”时,模型有时会生成表面连贯的论证来证明系统性地对两者都回答“是”或都回答“否”是合理的,尽管存在矛盾。我们提供了初步证据表明这是由于模型对“是”或“否”的隐含偏见,并将其标记为隐含的事后合理化。我们的结果显示,生产模型的不忠实率高达13%,而前沿模型虽然更忠实,但没有一个完全忠实,包括像DeepSeek R1(0.37%)和Sonnet 3.7 with thinking(0.04%)这样的思考模型。我们还研究了不忠实的非逻辑捷径,即模型使用微妙的非逻辑推理来使对困难数学问题的推测性答案看起来经过严格证明。我们的发现表明,虽然CoT可用于评估输出,但它并不是产生模型答案的内部过程的完整描述,应在代理或安全关键环境中谨慎使用。

英文摘要

Recent studies indicate that when faced with explicit biases in prompts, models often omit mentioning these biases in their Chain-of-Thought (CoT) output, revealing that verbalized reasoning can give an incorrect picture of how models arrive at conclusions (unfaithfulness). In this work, we show that unfaithful CoT also occurs on naturally worded, non-adversarial prompts without adding artificial biases or editing model outputs. We find that when separately presented with the questions "Is X bigger than Y?" and "Is Y bigger than X?", models sometimes produce superficially coherent arguments to justify systematically answering Yes to both or No to both, despite the contradiction. We present preliminary evidence that this is due to models' implicit biases towards Yes or No, labeling this Implicit Post-Hoc Rationalization. Our results reveal rates up to 13% for production models, and while frontier models are more faithful, none are entirely so, including thinking models like DeepSeek R1 (0.37%) and Sonnet 3.7 with thinking (0.04%). We also investigate Unfaithful Illogical Shortcuts, where models use subtly illogical reasoning to make speculative answers to hard math problems seem rigorously proven. Our findings indicate that while CoT can be useful for assessing outputs, it is not a complete account of the internal process that produced the model's answer and should be used with caution in agentic or safety-critical settings.

2408.15188 2026-06-17 eess.AS cs.CL cs.SD 版本更新

Infusing Acoustic Pause Context into Text-Based Dementia Assessment

将语音停顿上下文注入基于文本的痴呆症评估

Franziska Braun, Sebastian P. Bayerl, Florian Hönig, Hartmut Lehfeld, Thomas Hillemacher, Tobias Bocklet, Korbinian Riedhammer

发表机构 * Technische Hochschule Nürnberg(图林根应用技术大学纽伦堡分校) Technische Hochschule Rosenheim(图林根应用技术大学罗森海姆分校) Klinik für Psychiatrie und Psychotherapie, Universitätsklinik der Paracelsus Medizinischen Privatuniversität, Klinikum Nürnberg, Germany(帕拉塞尔斯医学私人大学纽伦堡大学心理治疗与精神病科诊所) KST Institut GmbH, Bad Emstal, Germany(KST研究所,巴德埃姆斯塔尔,德国)

AI总结 本文研究利用停顿增强的转录文本,通过Transformer语言模型区分无认知障碍、轻度认知障碍和阿尔茨海默病患者,探讨停顿信息和声学上下文对不同任务的影响。

Comments Accepted at INTERSPEECH 2024

详情
Journal ref
Proceedings of Interspeech 2024
AI中文摘要

语音停顿,与内容和结构相结合,提供了一种有价值的、非侵入性的生物标志物,用于检测痴呆症。本工作探讨了在基于Transformer的语言模型中使用包含停顿的转录文本,以区分无认知障碍、轻度认知障碍和阿尔茨海默病患者在临床评估中的语音特征。我们处理了三个二元分类任务:起始、监测和痴呆排除。通过在德语口头流畅性测试和图片描述测试上的实验,比较模型在不同语音生成上下文中的有效性。从文本基线开始,我们探讨了停顿信息和声学上下文的整合效果。我们展示了测试应根据任务选择,并且词汇停顿信息和声学交叉注意力对不同任务贡献不同。

英文摘要

Speech pauses, alongside content and structure, offer a valuable and non-invasive biomarker for detecting dementia. This work investigates the use of pause-enriched transcripts in transformer-based language models to differentiate the cognitive states of subjects with no cognitive impairment, mild cognitive impairment, and Alzheimer's dementia based on their speech from a clinical assessment. We address three binary classification tasks: Onset, monitoring, and dementia exclusion. The performance is evaluated through experiments on a German Verbal Fluency Test and a Picture Description Test, comparing the model's effectiveness across different speech production contexts. Starting from a textual baseline, we investigate the effect of incorporation of pause information and acoustic context. We show the test should be chosen depending on the task, and similarly, lexical pause information and acoustic cross-attention contribute differently.

2206.06208 2026-06-17 eess.AS cs.CL cs.SD 版本更新

Automated Evaluation of Standardized Dementia Screening Tests

标准化痴呆筛查测试的自动化评估

Franziska Braun, Markus Förstel, Bastian Oppermann, Andreas Erzigkeit, Thomas Hillemacher, Hartmut Lehfeld, Korbinian Riedhammer

AI总结 本文研究了标准化痴呆筛查测试的自动化评分方法,通过分析手动和自动转录本的评分相关性,发现自动评分在某些任务上比人工评分更严格,但整体仍保持高相关性。

Comments Submitted to Interspeech 2022. arXiv admin note: text overlap with arXiv:2206.05018

详情
Journal ref
Proceedings of Interspeech 2022
AI中文摘要

在痴呆筛查和监测中,标准化测试在临床实践中起关键作用,因为它们旨在通过测量多种认知任务的表现来最小化主观性。本文报告了一项研究,该研究包括一个半标准化的病史采集,随后是两种标准化的神经心理学测试,即SKT和CERAD-NB。这些测试包括命名物体、学习词列表等基本任务,以及广泛使用的工具如MMSE。大多数任务是口头进行的,因此应适合基于转录文本的自动化评分。对于前30名患者的第一批,我们分析了专家手动评分与基于手动和自动转录的自动评分之间的相关性。对于SKT和CERAD-NB,我们观察到使用手动转录本时的高到完美相关性;对于某些相关性较低的任务,自动评分比人类参考更严格,因为其仅限于音频。使用自动转录本时,相关性下降如预期,与识别准确性相关;然而,我们仍观察到高达0.98(SKT)和0.85(CERAD-NB)的高相关性。我们证明使用词替代可以缓解识别错误,从而提高与专家评分的相关性。

英文摘要

For dementia screening and monitoring, standardized tests play a key role in clinical routine since they aim at minimizing subjectivity by measuring performance on a variety of cognitive tasks. In this paper, we report on a study that consists of a semi-standardized history taking followed by two standardized neuropsychological tests, namely the SKT and the CERAD-NB. The tests include basic tasks such as naming objects, learning word lists, but also widely used tools such as the MMSE. Most of the tasks are performed verbally and should thus be suitable for automated scoring based on transcripts. For the first batch of 30 patients, we analyze the correlation between expert manual evaluations and automatic evaluations based on manual and automatic transcriptions. For both SKT and CERAD-NB, we observe high to perfect correlations using manual transcripts; for certain tasks with lower correlation, the automatic scoring is stricter than the human reference since it is limited to the audio. Using automatic transcriptions, correlations drop as expected and are related to recognition accuracy; however, we still observe high correlations of up to 0.98 (SKT) and 0.85 (CERAD-NB). We show that using word alternatives helps to mitigate recognition errors and subsequently improves correlation with expert scores.