arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2605.03299 2026-06-03 cs.CL 版本更新

LLM-XTM: Enhancing Cross-Lingual Topic Models with Large Language Models

LLM-XTM：利用大语言模型增强跨语言主题模型

Minh Chu Xuan, Tien-Phat Nguyen, Linh Ngo Van, Dinh Viet Sang, Nguyen Thi Ngoc Diep, Trung Le

发表机构 * Hanoi University of Science and Technology（河内科学技术大学）； VNU University of Engineering and Technology（VNU工程技术大学）； Monash University（墨尔本大学）

AI总结提出LLM-XTM框架，通过LLM引导的主题精炼与自一致性不确定性量化，以黑盒方式稳定提升跨语言主题模型的连贯性和对齐性，减少对双语资源的依赖。

Comments ACL 2026

2606.03990 2026-06-03 cs.LG cs.CL cs.CV 版本更新

Neuron Populations Exhibit Divergent Selectivity with Scale

神经元群体随规模表现出分化的选择性

Amil Dravid, Yasaman Bahri, Alexei A. Efros, Yossi Gandelsman

发表机构 * UC Berkeley（加州大学伯克利分校）； TTIC

AI总结通过分析Rosetta神经元在不同规模模型中的分布与特性，发现其数量遵循次线性幂律增长，且选择性随规模增强，而非Rosetta神经元则保持低选择性，提出一个平衡特征效用与神经元容量的分析模型解释这一极化现象。

Comments Project page and code: https://avdravid.github.io/rosetta-neuron-scaling/

详情

AI中文摘要

我们研究神经网络中的神经元群体是否随规模可预测地演化，将缩放定律扩展到损失等宏观可观测指标之外。为探究此问题，我们研究了Rosetta神经元——一类先前被表征的、其激活模式在独立训练的模型中相似的神经元（Dravid et al., 2023）。在分别对高达30B参数的语言模型和高达5B参数的视觉模型的分析中，我们观察到Rosetta神经元群体遵循模型规模的次线性幂律，绝对数量增长但占总神经元数的比例缩小。我们进一步观察到神经元极化效应：Rosetta神经元随规模变得更具选择性且日益单语义化，与不断增长但仍保持低选择性的非Rosetta群体分离。一个平衡特征效用与有限神经元容量的分析模型解释了次线性幂律缩放和这种极化效应。最后，我们发现Rosetta神经元随规模变得更加领域专业化，并通过一个针对持续预训练的目标数据过滤案例研究展示了其选择性。我们的结果指向一个可解释的、共享的神经元层面结构的缩放定律，将模型大小与神经元通用性、选择性和专业化的系统性变化联系起来。

英文摘要

We investigate whether neuron populations within neural networks evolve predictably with scale, extending scaling laws beyond macroscopic observables such as loss. To probe this question, we study Rosetta Neurons, a previously characterized class of neurons whose activation patterns are similar across independently trained models (Dravid et al., 2023). In separate analyses of language models up to 30B parameters and vision models up to 5B parameters, we observe that the population of Rosetta Neurons follows a sublinear power law in model size, growing in absolute number but occupying a shrinking fraction of the total neuron count. We further observe a Neuron Polarization Effect: Rosetta Neurons become more selective and increasingly monosemantic with scale, separating from a growing non-Rosetta population that remains less selective. An analytical model balancing feature utility against limited neuron capacity explains the sublinear power-law scaling and this polarization effect. Finally, we find that Rosetta Neurons become more domain-specialized with scale and illustrate their selectivity through a targeted data-filtering case study for continued pretraining. Our results point to a scaling law for interpretable, shared neuron-level structure, linking model size to systematic changes in neuron universality, selectivity, and specialization.

URL PDF HTML ☆

赞 0 踩 0

2606.03982 2026-06-03 cs.CL 版本更新

Language Models Compare Quantities Using Number-specific and Unit-specific Heuristics

语言模型使用数字特定和单位特定启发式比较数量

Mutsumi Sasaki, Go kamoda, Ryosuke Takahashi, Kosuke Sato, Kentaro Inui, Keisuke Sakaguchi, Benjamin Heinzerling

AI总结本研究通过控制实验发现，语言模型在比较带单位的数量时，并非进行精确的尺度转换，而是依赖数字差异和单位尺度差异的启发式策略，导致在比较边界附近系统性错误。

2606.03980 2026-06-03 cs.LG cs.CL 版本更新

Skill-RM: Unifying Heterogeneous Evaluation Criteria via Agent Skill

Skill-RM: 通过智能体技能统一异构评估标准

Tao Chen, Gangwei Jiang, Pengyu Cheng, Siyuan Huang, Yihao Liu, Jingwei Ni, Jiaqi Guo, Mengyu Zhou, Kai Tang, Junling Liu, Qinliang Su, Xiaoxi Jiang, Guanjun Jiang

发表机构 * Qwen Large Model Application Team, Alibaba（通义千问大模型应用团队，阿里巴巴）； Sun Yat-sen University（中山大学）； The Chinese University of Hong Kong（香港中文大学）； Peking University（北京大学）； ETH Zürich University of Zurich（苏黎世联邦理工学院）

AI总结提出Skill-RM框架，将奖励建模重构为可重用的奖励评估技能执行，通过动态选择和聚合证据统一异构评估标准，在奖励基准和下游任务中优于传统方法。

详情

AI中文摘要

奖励模型（RMs）为LLM后训练提供关键反馈信号，特别是在强化微调（RFT）和强化学习（RL）流程中。然而，当前的奖励评估依赖于异构标准，如基于规则的验证器、真实参考、程序化检查表和复杂评分标准，而统一整合所有类型证据的机制尚未被探索。为此，我们提出技能奖励模型（Skill-RM），一个统一框架，将奖励建模重构为可重用的奖励评估技能的执行。通过将奖励计算视为结构化的智能体任务，Skill-RM提供一致的接口来编排异构资源，动态选择和聚合针对每个输入特定要求定制的证据。这种方法使奖励模型能够超越静态评估，确保跨不同任务的一致性和透明度。在奖励基准和下游应用（包括最佳N选择和强化学习）上的大量实验表明，Skill-RM始终优于传统的评判基线。我们的发现表明，Skill-RM不仅为奖励建模提供了统一解决方案，而且通过战略性和动态的证据编排实现了卓越性能。代码见此链接。

英文摘要

Reward models (RMs) provide critical feedback signals for LLM post-training, notably in reinforced fine-tuning (RFT) and reinforcement learning (RL) pipelines. However, current reward evaluation relies on heterogeneous criteria such as rule-based verifiers, ground-truth references, procedural checklists, and complex rubrics, where a unified mechanism to integrate all types of evidence remains unexplored. To this end, we propose Skill Reward Model (Skill-RM), a unified framework that reformulates reward modeling as the execution of a reusable Reward-Evaluation Skill. By treating reward computation as a structured agentic task, Skill-RM provides a consistent interface to orchestrate heterogeneous resources, dynamically selecting and aggregating evidence tailored to the specific requirements of each input. This approach enables the reward model to move beyond static evaluation, ensuring consistency and transparency across diverse tasks. Extensive experiments on reward benchmarks and downstream applications, including best-of-N selection and reinforcement learning, demonstrate that Skill-RM consistently outperforms traditional judge baselines. Our findings suggest that Skill-RM not only provides a unified solution for reward modeling but also achieves superior performance through the strategic and dynamic orchestration of evidence. The code is at https://github.com/Qwen-Applications/Skill-RM.

URL PDF HTML ☆

赞 0 踩 0

2606.03969 2026-06-03 cs.CL cs.AI 版本更新

Quantifying Faithful Confidence Expression in Large Reasoning Models

量化大型推理模型中的忠实置信表达

Areeb Gani, Asal Meskin, Gabrielle Kaili-May Liu, Arman Cohan

发表机构 * Yale University（耶鲁大学）

AI总结针对大型推理模型（LRM）在长链思维输出中难以忠实表达内在置信度的问题，提出基于令牌概率、隐藏状态和响应一致性的框架，系统量化其语言决断性与内部不确定性之间的对齐程度。

Comments Code: https://github.com/yale-nlp/faithful_lrm

详情

AI中文摘要

可靠的不确定性沟通对于LLMs的可信度至关重要，然而忠实校准（FC）——模型内在置信度与（语言上）表达的置信度之间的对齐——是一个持续存在的失败模式。这一挑战对大型推理模型（LRM）尤为关键，因为其扩展的推理轨迹常被用户解读为深思熟虑、能力和信心的证据。尽管FC重要且LRM广泛使用，但LRM能否忠实表达其置信度仍知之甚少。此外，衡量FC的主流范式难以泛化到LRM生成的长链思维输出，这些输出往往缺乏清晰的步骤边界、步骤结构不一致，并在整个轨迹中编码复杂的条件依赖——使得内在置信度的估计复杂化。为应对这一挑战，我们引入了一个新颖的框架来系统量化LRM的FC。我们的框架基于令牌概率、隐藏状态和采样响应一致性，分析语言决断性与三种内部不确定性来源的关系。我们还设计了一种前缀条件采样方法，以控制轨迹中的条件和结构变化。将我们的框架应用于一系列多样化的领先模型、数据集和提示，我们发现忠实置信表达是LRM的一个重大挑战。推理行为不会自动转化为改进的FC，针对非推理模型的提示干预在推理设置中并不能提高忠实性。不同的置信估计器还对同一轨迹产生不同评估，揭示了先前评估方法的脆弱性。综合来看，我们的工作将FC确立为LRM的一个独特的可靠性和对齐目标，尤其是在这些系统越来越多地部署在高风险场景中的背景下。

英文摘要

Reliable uncertainty communication is critical to the trustworthiness of LLMs, yet faithful calibration (FC)--the alignment between models' intrinsic and (linguistically) expressed confidence--is a persistent failure mode. This challenge is key for large reasoning models (LRMs), whose extended reasoning traces are often interpreted by users as evidence of deliberation, competence, and confidence. Despite the importance of FC and wide usage of LRMs, the extent to which LRMs can faithfully express their confidence remains poorly understood. Moreover, the prevailing paradigm to measure FC does not generalize well to the long chain-of-thought outputs generated by LRMs, which tend to lack clear step boundaries, involve inconsistent step structure, and encode complex conditional dependencies throughout the trace--complicating estimation of intrinsic confidence. To address this challenge, we introduce a novel framework to systematically quantify FC of LRMs. Our framework analyzes linguistic decisiveness relative to three sources of internal uncertainty, based on token probabilities, hidden states, and sampled response consistency. We also devise a prefix-conditioned sampling approach to control for conditional and structural variation across traces. Applying our framework to a diverse suite of leading models, datasets, and prompts, we find that faithful confidence expression is a significant challenge for LRMs. Reasoning behaviors do not automatically translate to improved FC, and prompt interventions for non-reasoning models do not improve faithfulness in the reasoning setting. Different confidence estimators further produce divergent assessments of the same traces, revealing fragility in prior evaluation methodologies. Taken together, our work establishes FC as a distinct reliability and alignment target for LRMs, particularly as such systems are increasingly deployed in high-stakes contexts.

URL PDF HTML ☆

赞 0 踩 0

2606.03968 2026-06-03 cs.CL cs.AI 版本更新

QUBRIC: Co-Designing Queries and Rubrics for RL Beyond Verifiable Rewards

QUBRIC：为超越可验证奖励的强化学习协同设计查询与评分标准

Rongzhi Zhang, Rui Feng, Zhihan Zhang, Jingfeng Yang, Qingyu Yin, Xin Liu, Zixuan Zhang, Priyanka Nigam, Bing Yin, Tuo Zhao, Chao Zhang

发表机构 * Amazon（亚马逊）； Georgia Institute of Technology（佐治亚理工学院）

AI总结针对基于评分标准的强化学习中查询分布固定导致的评分标准质量瓶颈，提出QUBRIC框架，通过协同设计查询与评分标准，利用教师关键点、对比生成和可学习性过滤，在ArenaHard上取得+5.5点提升，并泛化到法律、道德和叙事推理任务。

详情

AI中文摘要

基于评分标准的强化学习是将强化学习扩展到可验证奖励之外的一条有前景的途径，但现有方法在优化评分标准时，将查询分布视为固定不变。我们识别出一个结构性瓶颈：评分标准的质量受限于查询结构。开放式查询会导致模糊的评分标准；而简单地将查询收窄则会引入任何模型都无法验证的虚构参考，导致所有回答失败，训练无法获得奖励信号。我们提出QUBRIC，一个协同设计查询和评分标准的框架。教师导出的关键点将开放式查询改写为基于场景、可评估的问题。然后，对比评分标准生成将教师策略的差距转化为查询级别的标准，可学习性过滤仅保留信息量丰富的查询-评分标准对用于GRPO训练。QUBRIC在ArenaHard上相比SFT基线取得了+5.5分的提升。仅使用指令遵循数据训练，它进一步迁移到三个涵盖法律、道德和叙事推理的保留基准（平均提升+6.3分），改进集中在推理相关维度。这些结果证明，协同设计查询和评分标准可以使基于评分标准的强化学习成为严格可验证任务之外RLVR的实用补充。

英文摘要

Rubric-based RL is a promising route for extending reinforcement learning beyond verifiable rewards, yet existing methods optimize rubrics while treating the query distribution as fixed. We identify a structural bottleneck: rubric quality is constrained by query structure. Open-ended queries yield vague rubrics; naively narrowing them introduces fabricated references that no model can verify, so all responses fail and training receives no reward signal. We present QUBRIC, a framework that co-designs queries and rubrics. Teacher-derived key points ground the rewriting of open-ended queries into scenario-based, evaluable questions. Contrastive rubric generation then turns teacher-policy gaps into query-level criteria, and learnability filtering retains only informative query-rubric pairs for GRPO training. QUBRIC achieves a +5.5 point gain on ArenaHard over the SFT baseline. Trained only on instruction-following data, it further transfers to three held-out benchmarks spanning legal, moral, and narrative reasoning (+6.3 points on average), with improvements concentrated in reasoning-related dimensions. These results provide evidence that co-designing queries and rubrics can make rubric-based RL a practical complement to RLVR beyond strictly verifiable tasks.

URL PDF HTML ☆

赞 0 踩 0

2606.03967 2026-06-03 cs.CL cs.AI 版本更新

AlignAtt4LLM: Fast AlignAtt for Decoder-Only LLMs at IWSLT 2026 Simultaneous Speech Translation Task

AlignAtt4LLM：面向仅解码器LLM的快速AlignAtt方法在IWSLT 2026同声传译任务中的应用

Quentin Fuxa, Dominik Macháček

发表机构 * Charles University, MFF, ÚFAL（查理大学，人文学院，ÚFAL）； University of Edinburgh（爱丁堡大学）

AI总结提出AlignAtt4LLM系统，通过显式源文本跨度、离线选择翻译对齐头、选择性qk快速重放和运行时查询/键捕获，首次将AlignAtt策略应用于仅解码器LLM，在英德、英意同声传译中优于基线。

Comments Accepted to IWSLT 2026

详情

AI中文摘要

我们描述了AlignAtt4LLM，一个用于英语到德语、意大利语和中文的IWSLT 2026同声传译系统。该系统是一个同步级联：Qwen3-ASR结合强制对齐生成增量更新的源文本转录，Gemma-4 E4B-it在MT侧的AlignAtt策略下翻译该前缀。据我们所知，这是AlignAtt首次应用于仅解码器LLM，而早期AlignAtt系统使用的编码器-解码器交叉注意力在此类模型中不存在。我们通过提出（1）提示中的显式源文本跨度，（2）离线选择翻译特定的对齐头，（3）草稿到源注意力块的选择性qk快速重放，以及（4）保持模型输出比特一致的运行时查询/键捕获，恢复了一个可用的策略。在IWSLT 2026开发集上，AlignAtt4LLM在约2秒的低延迟和低于4秒CU-LongYAAL的高延迟场景下，均优于欧洲目标语言（英语到德语和英语到意大利语）的提供基线。英语到中文的结果较为复杂，但该方法不依赖于Gemma-4：由于AlignAtt4LLM仅需要确定的提示布局、校准的对齐头和查询/键捕获，相同的策略可以重新应用于针对非欧洲目标语言的更强翻译专用仅解码器MT骨干网络。

英文摘要

We describe AlignAtt4LLM, an IWSLT 2026 simultaneous speech translation system for English to German, Italian, and Chinese. The system is a synchronous cascade: Qwen3-ASR with forced alignment produces an incrementally updated source transcript, and Gemma-4 E4B-it translates that prefix under an MT-side AlignAtt policy. To our knowledge, this is the first application of AlignAtt to a decoder-only LLM, where the encoder-decoder cross-attention used by earlier AlignAtt systems is absent. We recover a usable policy by proposing (1) an explicit source span in the prompt, (2) offline selection of translation-specific alignment heads, (3) selective qk-fast replay of the draft-to-source attention block, and (4) runtime query/key capture that preserves model outputs bit-identically. On the IWSLT 2026 development set, AlignAtt4LLM outperforms the supplied baselines for the European target languages, English to German and English to Italian, in both the low-latency regime around 2 seconds and the high-latency regime below 4 seconds CU-LongYAAL. Results for English to Chinese are more mixed, but the method is not tied to Gemma-4: because AlignAtt4LLM only requires a deterministic prompt layout, calibrated attention heads, and query/key capture, the same policy can be reapplied to stronger translation-focused decoder-only MT backbones for non-European target languages.

URL PDF HTML ☆

赞 0 踩 0

2606.03965 2026-06-03 cs.CL cs.AI 版本更新

Agentic Chain-of-Thought Steering for Efficient and Controllable LLM Reasoning

Agentic Chain-of-Thought Steering：实现高效且可控的LLM推理

Yu Xia, Zhouhang Xie, Xin Xu, Byungkyu Kang, Prarit Lamba, Xiang Gao, Julian McAuley

AI总结提出Agentic Chain-of-Thought Steering (ACTS)方法，通过强化学习训练控制器智能体在推理过程中自适应地选择推理策略和引导短语，实现预算感知的策略控制，从而在保持推理质量的同时显著节省token，并支持准确率-效率的可控权衡。

详情

AI中文摘要

大型语言模型通过扩展的思维链推理提高了最终答案的准确性，但通常token使用效率低下且缺乏推理时的控制。现有的高效推理方法通过缩短、提前停止或压缩轨迹来控制思考长度，但隐式地决定了模型的思考方式。在本文中，我们提出了Agentic Chain-of-Thought Steering (ACTS)，它将推理引导形式化为一个马尔可夫决策过程，其中控制器智能体在推理过程中自适应地引导冻结的推理器。在每一步，控制器观察推理轨迹和剩余思考预算，然后发出一个包含推理策略和引导短语的引导动作，以启动推理器的下一步。这使得在保持推理器生成连续性的同时，能够进行预算感知的策略控制以实现高效推理。我们从构建的合成引导轨迹中初始化控制器智能体，并进行多预算增强，然后通过带有预算条件奖励塑造的强化学习进一步优化。跨多个基准的实验表明，ACTS在显著节省token的同时达到了与全思考相当的性能，并在不同的推理器和任务上实现了可控的准确率-效率权衡。代码可在该https URL获取。

英文摘要

Large language models improve final-answer accuracy through extended chain-of-thought reasoning, but often spend tokens inefficiently and offer little inference-time control. Existing efficient reasoning methods control thinking length by shortening, early-stopping, or compressing traces, leaving how the model thinks implicit. In this paper, we propose Agentic Chain-of-Thought Steering (ACTS), which formulates reasoning steering as a Markov decision process where a controller agent adaptively steers a frozen reasoner during inference. At each step, the controller observes the reasoning trace and remaining thinking budget, then issues a steering action consisting of a reasoning strategy and a steering phrase that initiates the next reasoner step. This enables budget-aware strategy control for efficient reasoning while preserving the reasoner's generation continuity. We initialize the controller agent from our constructed synthetic steering trajectories with multi-budget augmentation, and further optimize it via reinforcement learning with budget-conditioned reward shaping. Experiments across multiple benchmarks show that ACTS matches full-thinking performance with substantial token savings, and enables controllable accuracy-efficiency trade-offs across different reasoners and tasks. The code is available at https://github.com/Andree-9/ACTS.

URL PDF HTML ☆

赞 0 踩 0

2606.03957 2026-06-03 cs.CL cs.AI cs.SD eess.AS 版本更新

Efficient ASR Training with Conversations that Never Happened

利用从未发生的对话进行高效的ASR训练

Máté Gedeon, Péter Mihajlik

发表机构 * Dept. of Telecommunications and Artificial Intelligence, Budapest University of Technology and Economics（电信与人工智能系，布达佩斯技术与经济大学）； SpeechTex Ltd.（SpeechTex公司）； ELTE Research Centre for Linguistics（ELTE语言研究所）

AI总结针对低资源语言和特定领域，提出通过LLM生成对话场景、映射说话人属性到TTS语音配置文件并组装合成话语的增强流水线，实验表明合成对话能有效提升ASR性能，在匈牙利语基准上仅用67小时真实对话和636小时模拟数据即超越2700小时零样本模型。

详情

AI中文摘要

低资源语言和特定领域的对话式ASR受到领域匹配的多说话人训练数据稀缺的限制。我们提出了一种增强流水线，该流水线生成带有参与者元数据的场景级对话，将说话人属性映射到TTS语音配置文件，并将合成的话语组装成感知说话人的模拟对话。我们在相同的FastConformer-Large训练方案下，评估了五种LLM家族，分别采用单生成器、固定预算混合和扩展设置。我们在匈牙利语BEA-Dialogue基准语料库上进行了全面评估，该方法本身适用于任何语言，只要各组件有相应资源。结果表明，合成对话持续改善语音识别性能，但生成器选择和组成数据强烈影响增益。我们最大的训练配置仅使用67小时真实对话和636小时模拟数据，在评估基准上实现了比在2700小时匈牙利语语音上训练的零样本模型更好的性能。这些发现表明，通过TTS合成的LLM生成的对话数据是真实对话语料库在语音模型训练中的实用补充。

英文摘要

Conversational ASR for lower-resource languages and niche domains is limited by the scarcity of domain-matched multi-speaker training data. We propose an augmentation pipeline that generates scenario-level dialogues with participant metadata, maps speaker attributes to TTS voice profiles, and assembles synthesized utterances into speaker-aware simulated conversations. We evaluated five LLM families under single-generator, fixed-budget mixture, and scale-up settings using the same FastConformer-Large training recipe for each one. We ran comprehensive evaluations on the Hungarian BEA-Dialogue benchmark corpus, with the method itself being applicable to any language given the resources for each component. The results show that synthetic conversations consistently improve speech recognition performance, but generator choice and data composition strongly affect the gains. Our largest training configuration, using only 67 hours of real conversations and 636 hours of simulated data, achieves better performance on the evaluation benchmark than a zero-shot model trained on 2700 hours of Hungarian speech. These findings indicate that LLM-generated conversational data synthesized with TTS is a practical complement to real conversational corpora for speech model training.

URL PDF HTML ☆

赞 0 踩 0

2606.03948 2026-06-03 cs.CL 版本更新

A Pocket Offline Model for Simultaneous Speech Translation as CUNI Submission to IWSLT 2026

CUNI 提交至 IWSLT 2026 的用于同声传译的袖珍离线模型

Aziz Sharipov Ortega, Dominik Macháček

发表机构 * Charles University, MFF, ÚFAL（查理大学，人文学院，语言学与应用语言学研究所）

AI总结本研究通过将离线直接语音到文本翻译模型 Canary 与最先进的策略 AlignAtt 结合，实现了同声传译能力，并在 IWSLT 2026 同声传译共享任务中提交了捷克语到英语以及英语到德语和意大利语的系统，展示了高翻译质量、低计算需求和多语言支持。

Comments IWSLT 2026

2606.03928 2026-06-03 cs.LG cs.CL 版本更新

Value-Aware Stochastic KV Cache Eviction for Reasoning Models

面向推理模型的价值感知随机KV缓存淘汰

Ting-Yun Chang, Harvey Yiyun Fu, Deqing Fu, Chenghao Yang, Jesse Thomason, Robin Jia

发表机构 * University of Southern California（南加州大学）； University of Chicago（芝加哥大学）

AI总结针对推理模型长输出导致的KV缓存瓶颈，提出价值感知随机淘汰方法VaSE，通过保护大幅度值状态和引入随机性，在4倍压缩下比最强淘汰方法准确率提升超4%。

Comments Codes: https://github.com/terarachang/VaSE

详情

AI中文摘要

推理模型通过扩展思维链提高了准确性，但其长输出造成了内存和计算瓶颈。KV缓存淘汰方法通过从缓存中淘汰不重要的键值对来降低这一成本，但它们的准确性往往不如基于选择的稀疏注意力替代方案，后者保留了完整的KV缓存。我们识别出对KV缓存淘汰准确性至关重要的关键因素。首先，一小部分值状态具有异常大的幅度，淘汰它们会导致灾难性失败，模型进入重复推理循环。其次，在淘汰过程中引入随机性通过增加缓存多样性提高了准确性。基于这些发现，我们提出了价值感知随机KV缓存淘汰（VaSE），这是一种无需训练的方法，保护大幅度值状态并促进多样化的淘汰决策。在六个推理任务上，使用VaSE进行4倍KV缓存压缩的Qwen3模型在相同稀疏度下比最先进的选择方法获得了更高的平均准确率，同时比最强的淘汰方法高出超过4%。总体而言，VaSE弥合了效率与准确性之间的差距，支持FlashAttention2，并为推理模型实现了静态内存占用。

英文摘要

Reasoning models improve accuracy through extended chains of thought, but their long outputs create a memory and compute bottleneck. KV cache eviction methods reduce this cost by evicting unimportant key-value pairs from the cache, yet they often yield worse accuracy than selection-based sparse attention alternatives, which keep the full KV cache. We identify key factors crucial to KV cache eviction accuracy. First, a small fraction of value states have abnormally large magnitudes, and evicting them causes catastrophic failure where models enter repetitive reasoning loops. Second, introducing stochasticity during eviction improves accuracy by increasing cache diversity. Based on these findings, we propose Value-aware Stochastic KV Cache Eviction (VaSE), a training-free recipe that protects large-magnitude value states and promotes diverse eviction decisions. Across six reasoning tasks, Qwen3 models using VaSE with 4x KV cache compression yield higher average accuracies than SOTA selection method at the same sparsity, while outperforming the strongest eviction method by more than 4%. Overall, VaSE bridges the gap between efficiency and accuracy, supporting FlashAttention2 and enabling a static memory footprint for reasoning models.

URL PDF HTML ☆

赞 0 踩 0

2606.03924 2026-06-03 cs.CL 版本更新

Knowledge Editing in Masked Diffusion Language Models

掩码扩散语言模型中的知识编辑

Haewon Park, Yohan Jo

发表机构 * Graduate School of Data Science, Seoul National University（首尔国立大学数据科学研究生院）

AI总结研究将定位-编辑方法从自回归模型迁移到掩码扩散模型，发现编辑位置可迁移但多词编辑性能下降，并提出优化中间状态的简单修正方法。

详情

AI中文摘要

知识编辑旨在更新或纠正语言模型中的事实知识。一种广泛使用的方法是定位-编辑，它分两步进行：首先在模型中定位事实，然后在那里编辑权重。迄今为止，此类方法仅在自回归模型（ARMs）上开发。它们的基本假设是否适用于掩码扩散模型（MDMs）——后者双向建模文本并通过迭代去噪而非下一个词预测生成——仍是一个开放问题。我们通过将定位-编辑迁移到MDMs，并在匹配规模下比较两个MDMs（LLaDA, Dream）与两个ARMs（LLaMA, Qwen）来解决这个问题。我们的核心发现分为两部分。首先，编辑应用的位置跨范式迁移：因果追踪在两种模型中均突出显示最后一个主体词处的相同早期到中间层MLP，并且编辑在那里最有效。其次，这个共享位置并不能保证共享结果。单词编辑在两种模型中均成功，但随着目标变长，编辑在MDMs中系统性退化，而在ARMs中则不然。失败源于编辑事实的生成方式：生成多词目标需要经过部分未掩码的中间状态，而编辑从未针对这些状态进行优化。在此诊断指导下，我们引入了一个简单的修正方法，针对这些状态优化编辑，从而显著恢复了多词性能。

英文摘要

Knowledge editing aims to update or correct factual knowledge in a language model. A widely used approach, locate-then-edit, does this in two steps: it first localizes a fact within the model, then edits the weights there. To date, such methods have been developed exclusively on autoregressive models (ARMs). Whether their underlying assumptions hold for masked diffusion models (MDMs), which model text bidirectionally and generate by iterative denoising rather than next-token prediction, remains an open question. We address it by transferring locate-then-edit to MDMs and comparing two MDMs (LLaDA, Dream) with two ARMs (LLaMA, Qwen) at matched scale. Our central finding has two parts. First, where an edit is applied transfers across paradigms: causal tracing highlights the same early-to-mid-layer MLP at the last subject token in both, and editing is most effective there. Second, this shared location does not guarantee a shared outcome. Single-token edits succeed in both, but as targets grow longer, editing degrades systematically in the MDMs but not the ARMs. The failure stems from how the edited fact is generated: producing a multi-token target requires passing through partially unmasked intermediate states for which the edit was never optimized. Guided by this diagnosis, we introduce a simple correction that optimizes the edit for these states, substantially restoring multi-token performance.

URL PDF HTML ☆

赞 0 踩 0

2606.03871 2026-06-03 cs.CV cs.CL cs.LG 版本更新

动态短卷积改进Transformer

Oliver Sieberling, Bharat Runwal, Rameswar Panda, Yoon Kim

发表机构 * Massachusetts Institute of Technology（麻省理工学院）； MIT-IBM Watson AI Lab（MIT-IBM沃森人工智能实验室）

AI总结本文提出动态短卷积作为新的神经网络原语，通过输入依赖的滤波器增强Transformer，在语言建模中相比标准Transformer和静态短卷积变体持续提升性能，并带来计算优势。

详情

AI中文摘要

Transformer已成为大型语言模型的主导架构，主要得益于注意力、前馈层、残差连接和归一化的可扩展性和灵活性。本文引入动态短卷积作为改进Transformer的额外神经网络原语。与静态短卷积不同，动态卷积使用输入依赖的滤波器，在保持卷积局部性偏差的同时增加表达能力。动机实验表明，在具有挑战性的关联回忆任务中，对键、查询和值表示应用动态短卷积相比静态卷积变体提升了性能。在从150M到2B参数的语言建模实验中，动态卷积持续优于标准Transformer和用静态短卷积增强的Transformer。拟合缩放定律表明，当动态卷积应用于键、查询和值向量时，相对于计算匹配的Transformer具有1.33倍的计算优势，而在每个线性层后添加动态卷积时优势达到1.60倍。动态卷积还在线性RNN（Mamba-2/Gated DeltaNet）和混合专家架构上带来了改进。我们通过自定义Triton内核使这些增益变得实用，实现了高效的训练和可管理的端到端减速。这些结果表明，动态短卷积是一种可扩展、硬件高效且富有表现力的原语，可用于推进基于Transformer的语言模型。

英文摘要

Transformers have become the dominant architecture for large language models, largely due to the scalability and flexibility of attention, feed-forward layers, residual connections, and normalization. This paper introduces dynamic short convolutions as an additional neural network primitive for improving Transformers. Unlike static short convolutions, dynamic convolutions use input-dependent filters, which preserves the locality bias of convolution while increasing expressivity. Motivating experiments show that applying dynamic short convolutions to key, query, and value representations improves performance on challenging associative recall tasks compared with static convolutional variants. Across language-modeling experiments ranging from 150M to 2B parameters, dynamic convolutions consistently outperform standard Transformers and Transformers augmented with static short convolutions. Fitting scaling laws indicates a 1.33$\times$ compute advantage over compute-matched Transformers when dynamic convolutions are applied to the key, query, and value vectors, and a 1.60$\times$ advantage when adding dynamic convolutions after every linear layer. Dynamic convolutions also offer improvements on linear RNNs (Mamba-2/Gated DeltaNet) and mixture-of-experts architectures. We make these gains practical with custom Triton kernels that enable efficient training with a manageable end-to-end slowdown. These results suggest that dynamic short convolutions are a scalable, hardware-efficient, and expressive primitive for advancing Transformer-based language models.

URL PDF HTML ☆

赞 0 踩 0

2606.03817 2026-06-03 cs.CL 版本更新

Rethinking the Idiomaticity Decomposability Hypothesis: Evidence from Distributional Learning

重新思考习语可分解性假说：来自分布学习的证据

Maggie Mi, Golzar Atefi, Atsuki Yamaguchi, Felix Gers, Aline Villavicencio, Nafise Sadat Moosavi

发表机构 * University of Sheffield（谢菲尔德大学）； Berliner Hochschule für Technik (BHT)（柏林技术大学（BHT））； University of Exeter（埃克塞特大学）； Federal University of Rio Grande do Norte, Brazil（巴西北里奥格兰德联邦大学）

AI总结本文利用上下文语言模型作为受控分布学习者，研究习语可分解性（构成意义对整体比喻意义的贡献程度）与句法灵活性的关系，发现模型导出的可分解性与人类判断弱相关，与句法灵活性呈微小但一致的负相关，且习语表征的稳定化受惊讶度、可分解性和频率共同影响，其中可分解性具有最强的训练依赖效应。

Comments ACL 2026 Main - long paper (9 pages + Appendices)

详情

AI中文摘要

习语可以根据其可分解性进行分析，即可分解性是指构成意义对整体比喻意义的贡献程度。可分解性被认为可以预测句法灵活性。基于用法的解释则将习语行为归因于分布经验，如说话者熟悉度和可预测性。我们使用上下文语言模型作为受控分布学习者来检验这些观点。我们提出了一种模型内部的可分解性度量，并将其与人类评分、句法灵活性和可预测性相关联，同时跟踪预训练期间习语的学习过程。模型导出的可分解性与人类判断弱相关，并且与句法灵活性呈微小但一致的负相关。预训练分析表明，模型中习语表征的稳定化不能仅由频率解释。相反，惊讶度、可分解性和频率都有贡献，其中可分解性显示出最强的训练依赖效应。

英文摘要

Idioms can be analysed in terms of their decomposability, the extent to which constituent meanings contribute to the figurative whole. Decomposability is thought to predict syntactic flexibility. Usage-based accounts instead attribute idiom behaviour to distributional experience, such as speaker familiarity and predictability. We examine these views using contextualised language models as controlled distributional learners. We propose a model-internal measure of decomposability and relate it to human ratings, syntactic flexibility, and predictability while tracking idiom learning during pretraining. Model-derived decomposability correlates weakly with human judgments and shows a small but consistent negative relationship with syntactic flexibility. Pretraining analyses show that stabilisation of idiom representations in models is not explained by frequency alone. Instead, surprisal, decomposability, and frequency all contribute, with decomposability showing the strongest training-dependent effect.

URL PDF HTML ☆

赞 0 踩 0

2606.03793 2026-06-03 cs.CL cs.CV 版本更新

Exploring Adversarial Robustness and Safety Alignment in Multilingual Multi-Modal Large Language Models

探索多语言多模态大语言模型的对抗鲁棒性与安全对齐

Hashmat Shadab Malik, Muzammal Naseer, Salman Khan

发表机构 * Mohamed Bin Zayed University of AI, UAE（穆罕默德·本·扎耶德人工智能大学，阿联酋）； Khalifa University, UAE（卡比拉大学，阿联酋）； Australian National University, Australia（澳大利亚国立大学，澳大利亚）

AI总结本研究通过梯度攻击和跨语言评估，发现多语言多模态大语言模型存在可迁移的对抗脆弱性，并揭示低资源语言因理解失败而呈现的虚假安全现象，提出深层训练整合才能实现真正的多语言安全对齐。

详情

AI中文摘要

多模态大语言模型将视觉感知整合到语言推理中，引入了一个连续的攻击面，容易受到对抗攻击。先前关于MLLM鲁棒性的工作主要关注以英语为中心的任务，多语言行为尚未被探索。我们通过对12种不同语言的对抗鲁棒性和多模态安全性进行系统研究来填补这一空白，评估通过指令调优获得多语言能力的开源MLLM。基于梯度的攻击揭示了一种可迁移的多语言脆弱性：在一种语言中优化的对抗图像会继续在其他语言中引发失败，表现出强大的跨语言可迁移性。多语言安全性进一步取决于模型检索或解释有害指令的有效性。当有害意图通过文本发出时，语言基础更强的语言更常引发允许滥用的响应，而较弱的语言产生较少的不安全输出。当嵌入图像作为排版内容时，英文脚本被可靠地识别和遵循，而非英文脚本很少被视觉编码器解析。因此，低资源语言可能看起来更安全，但这是理解和视觉基础失败的人为产物，而非真正的对齐，我们将这种现象称为“失败导致的安全”。相比之下，在整个训练阶段（而不仅仅在指令调优阶段）构建多语言能力的MLLM，如Qwen3-VL，表现出真正的跨语言安全性，跨语言保持主动拒绝，而不是掩盖理解失败。浅层多语言适应（例如在翻译的指令数据上进行微调）可能产生表面理解，在低资源语言中造成虚幻的安全感；跨训练阶段的更深层整合才能实现真正的多语言安全对齐。

英文摘要

Multimodal Large Language Models integrate visual perception into language reasoning, introducing a continuous attack surface susceptible to adversarial attacks. Prior work on MLLM robustness has focused largely on English-centric tasks, leaving multilingual behaviour unexplored. We address this gap through a systematic study of adversarial robustness and multimodal safety across 12 diverse languages, evaluating open-source MLLMs that acquire multilingual capability through instruction tuning. Gradient-based attacks reveal a transferable multilingual vulnerability: adversarial images optimized in one language continue to induce failure in others, demonstrating strong cross-lingual transferability. Multilingual safety further varies with how effectively a model retrieves or interprets harmful instructions. When harmful intent is issued through text, languages with stronger linguistic grounding more often elicit misuse-enabling responses, while weaker languages produce fewer unsafe outputs. When embedded in the image as typographic content, English scripts are reliably recognised and followed, whereas non-English scripts are rarely parsed by the vision encoder. Lower-resource languages may therefore appear safer, but this is an artefact of comprehension and visual-grounding failures rather than genuine alignment, a phenomenon we term safety-by-failure. In contrast, MLLMs that build multilingual capability throughout their training stages rather than only at instruction tuning, such as Qwen3-VL, exhibit genuine cross-lingual safety, maintaining active refusal across languages rather than masking comprehension failure. Shallow multilingual adaptation, such as fine-tuning on translated instruction data, may produce surface-level understanding that creates illusory safety in low-resource languages; deeper integration across training stages leads to genuine multilingual safety alignment.

URL PDF HTML ☆

赞 0 踩 0

2606.03782 2026-06-03 cs.CL 版本更新

Reasoning over Grammar: Can Synthetic Linguistic Reasoning Traces Enhance Low-Resource Machine Translation?

基于语法的推理：合成语言推理轨迹能否增强低资源机器翻译？

Renhao Pei, Yihong Liu, Sampo Pyysalo, Hinrich Schütze, Shaoxiong Ji

发表机构 * ELLIS Institute Finland（芬兰ELLIS研究所）； University of Turku（图尔库大学）； Center for Information and Language Processing, LMU Munich（慕尼黑大学信息与语言处理中心）； Munich Center for Machine Learning (MCML)（慕尼黑机器学习中心）

AI总结本文提出自动生成语言推理轨迹的方法，通过上下文学习、监督微调和强化微调评估其对低资源机器翻译的影响，发现推理轨迹在推理时指导效果显著，但作为训练数据收益有限。

详情

AI中文摘要

大型语言模型（LLMs）通过上下文学习整合语言资源，为极低资源语言的机器翻译（MT）提供了一种有前景的方法。然而，LLMs在翻译过程中往往难以有效应用语法信息。受思维链推理最新进展的启发，我们研究了低资源MT是否能从结构化的中间语言分析和语法推理步骤中受益。我们提出了一种从通用依存树库、词典和语法规则库自动生成逐步语言推理轨迹的流程。我们以锡伯语和Chintang语为测试案例，在三种设置下评估这些轨迹：上下文学习（ICL）、监督微调（SFT）和强化微调（RFT）。我们的结果表明，语言推理轨迹作为推理时指导最为有效：在ICL中，可靠的句子特定轨迹在大多数模型、语言和指标上显著提升了翻译性能。相比之下，将语言推理轨迹作为训练数据使用带来的收益较小且不一致，因为模型学习了轨迹格式但往往生成错误内容。这些发现表明，当提供可靠的语言分析时，LLMs能够利用语法信息进行低资源MT，而学习生成此类分析仍然是主要瓶颈。

英文摘要

Large language models (LLMs) offer a promising approach to machine translation (MT) for extremely low-resource languages by incorporating linguistic resources through in-context learning. However, LLMs often struggle to apply grammatical information effectively during translation. Inspired by recent progress in chain-of-thought reasoning, we investigate whether low-resource MT can benefit from structured intermediate steps of linguistic analysis and grammatical reasoning. We propose a pipeline for automatically generating step-by-step linguistic reasoning traces from Universal Dependencies treebanks, dictionaries, and grammar-rule banks. We evaluate these traces in three settings: in-context learning (ICL), supervised fine-tuning (SFT), and reinforcement fine-tuning (RFT), on Xibe and Chintang as test cases. Our results show that linguistic reasoning traces are most effective as inference-time guidance: in ICL, reliable sentence-specific traces substantially improve translation performance across most models, languages, and metrics. In contrast, using the linguistic reasoning traces as training data yields smaller and less consistent gains, as models learn the trace format but often generate erroneous content. These findings suggest that LLMs can leverage grammatical information for low-resource MT when given reliable linguistic analyses, while learning to generate such analyses remains a major bottleneck.

URL PDF HTML ☆

赞 0 踩 0

2606.03780 2026-06-03 cs.CL cs.LG 版本更新

Expert-Aware Causal Tracing of Factual Recall in Sparse MoE Language Models

专家感知的稀疏MoE语言模型中事实回忆的因果追踪

Yuetian Lu, Ali Modarressi, Yihong Liu, Hinrich Schütze

发表机构 * Center for Information and Language Processing (CIS)（信息与语言处理中心）； Ubiquitous Knowledge Processing Lab (UKP)（无所不在的知识处理实验室）； Munich Center for Machine Learning (MCML)（慕尼黑机器学习中心）

AI总结针对稀疏混合专家语言模型，提出专家感知的因果追踪方法，通过干预专家级更新定位事实回忆的关键专家，发现专家级定位依赖于模型和协议。

Comments Preprint

详情

AI中文摘要

事实回忆的因果追踪主要在密集Transformer语言模型中进行研究，其中干预将信息流定位到层或前馈模块。稀疏混合专家（MoE）语言模型引入了一个更尖锐的问题：当事实预测由路由的MoE块中介时，哪些路由的专家贡献起作用？我们为稀疏MoE语言模型制定了专家感知的因果追踪。使用CounterFact事实，我们首先通过向主题词嵌入添加噪声来破坏模型的事实偏好，然后测试干净的MoE块输出或干净的专家级更新是否恢复了真实与虚假logit对比。对于Qwen3-30B-A3B-Base，层扫描选择并验证了第44层，专家级追踪识别出L44E069作为在干净运行中反复选择的专家，其保留的补丁优于其他活跃的同一层专家补丁。对于Mixtral-8x7B-v0.1，层级追踪验证了中层信号，但该信号并未定位到选定的单个专家；相反，联盟检查通过路由的多专家更新恢复了它。这些结果表明，MoE事实追踪可以做到专家感知，同时也表明专家级定位是依赖于模型和协议的，而非普遍适用。

英文摘要

Causal tracing of factual recall has been studied predominantly in dense transformer language models, where interventions localize information flow to layers or feed-forward modules. Sparse mixture-of-experts (MoE) language models introduce a sharper question: when a factual prediction is mediated by a routed MoE block, which routed expert contributions matter? We formulate expert-aware causal tracing for sparse MoE language models. Using CounterFact facts, we first corrupt the model's factual preference by adding noise to subject-token embeddings, and then test whether clean MoE-block outputs or clean expert-level updates restore the true-vs-foil logit contrast. For Qwen3-30B-A3B-Base, a layer sweep selects and validates layer 44, and expert-level tracing identifies L44E069 as an expert repeatedly selected in the clean run whose held-out patch outperforms other active same-layer expert patches. For Mixtral-8x7B-v0.1, layer-level tracing validates a mid-layer signal, but the signal is not localized to the selected singleton expert; a coalition check instead recovers it with routed multi-expert updates. These results suggest that MoE factual tracing can be made expert-aware, while also showing that expert-level localization is model- and protocol-dependent rather than universal.

URL PDF HTML ☆

赞 0 踩 0

2606.03773 2026-06-03 cs.CL 版本更新

KletterMix: Climbing Toward High-Quality German Pretraining Data

KletterMix: 攀登高质量德语预训练数据

Maurice Kraus, Ruben Härle, Sebastian Sztwiertnia, Abbas Goher Khan, Mehdi Ali, Michael Fromm, Kristian Kersting

发表机构 * AI & ML Group, TU Darmstadt（人工智能与机器学习小组，德累斯顿技术大学）； Lab1141 ； Lamarr Institute（拉马尔研究所）； Fraunhofer IAIS（弗劳恩霍夫人工智能研究所）； hessian.AI（海斯坦.AI）； German Research Center for AI (DFKI)（德国人工智能研究中心（DFKI））； Centre for Cognitive Science, TU Darmstadt（认知科学中心，德累斯顿技术大学）

AI总结通过翻译高质量英语语料库构建德语预训练语料库KletterMix，并验证其在德语下游任务中的有效性。

详情

AI中文摘要

高质量的预训练数据是现代语言模型的核心要素，但德语资源的开发远不如英语资源：它们通常规模更小、筛选不仔细、文档不完善，并且很少通过受控训练实验进行验证。我们引入了KletterMix，一个用于语言模型预训练和退火的高质量德语语料库，旨在为自然语言处理和建模社区提供可重复使用的数据集产品。KletterMix通过将最先进的英语预训练语料库翻译成德语构建，同时保留文档边界、元数据、源结构和主题多样性。这种构建方式产生了一个具有现代预训练数据集规模和多样性的德语语料库，同时允许与其英语源进行直接比较。我们通过广泛的语料库级别分析来记录数据集，包括翻译质量、文档长度分布、主题覆盖、源组成和地理元数据。使用COMETKiwi，我们表明翻译后的文档在不同领域实现了高质量，表明仔细的翻译可以保留原始语料库的许多语义和风格丰富性。除了数据集构建，我们还评估了KletterMix作为训练数据。通过针对现有德语语料库的受控预训练和退火消融实验，我们表明在KletterMix上训练的模型在德语下游评估中取得了可衡量的改进。这些结果表明，精心策划的翻译数据可以显著增强德语预训练数据生态系统。

英文摘要

High-quality pretraining data is a central ingredient in modern language models, but German-language resources remain far less developed than their English counterparts: they are often smaller, less carefully curated, weakly documented, and rarely validated through controlled training experiments. We introduce KletterMix, a high-quality German corpus for language model pretraining and annealing, designed as a reusable dataset artifact for the natural language processing and modeling community. KletterMix is built by translating a state-of-the-art English pretraining corpus into German while preserving document boundaries, metadata, source structure, and topical diversity. This construction yields a German corpus with the scale and diversity of a modern pretraining dataset, while enabling direct comparison to its English source. We document the dataset through a broad set of corpus-level analyses, including translation quality, document length distributions, topic coverage, source composition, and geographic metadata. Using COMETKiwi, we show that the translated documents achieve strong quality across diverse domains, suggesting that careful translation can preserve much of the semantic and stylistic richness of the original corpus. Beyond dataset construction, we evaluate KletterMix as training data. Through controlled pretraining and annealing ablations against established German corpora, we show that models trained on KletterMix achieve measurable improvements on German-language downstream evaluations. These results demonstrate that carefully curated translated data can substantially strengthen the German pretraining data ecosystem.

URL PDF HTML ☆

赞 0 踩 0

2606.03768 2026-06-03 cs.CL 版本更新

HybridThinker: Efficient Chain-of-Thought Reasoning via Compressed Memory and Transient Thought Steps

HybridThinker: 通过压缩记忆和瞬态思考步骤实现高效的思维链推理

Xin Liu, Runsong Zhao, Xinyu Liu, Junhao Ruan, Pengcheng Huang, Shichao Dong, Chunyang Xiao, Chenglong Wang, Changliang Li, Jingbo Zhu, Tong Xiao

发表机构 * School of Computer Science and Engineering, Northeastern University（东北大学计算机科学与工程学院）； China Unicom Cloud-Link（中国联通云链）； NiuTrans Research（牛传研究院）

AI总结提出HybridThinker方法，通过保留压缩记忆和临时保留思考步骤，并采用混合训练方案，在保持推理准确性的同时显著压缩思维链长度。

Comments 23 pages, 9 figures

详情

AI中文摘要

扩展的思维链（CoT）轨迹提升了LLM的推理能力，但带来了大量的计算和内存成本。现有的CoT压缩方法通过记忆令牌将思考步骤压缩为紧凑表示，并在推理时仅保留这些表示，但细粒度信息的丢失使得后续步骤更容易出错。为了缓解这一问题，我们提出了 extbf{HybridThinker}，除了保留这些表示外，思考步骤也会被临时保留以提供细粒度细节。然而，我们观察到在训练期间天真地让后续步骤可以访问思考步骤，会使模型绕过记忆令牌直接从这些步骤中检索信息，导致模型通过记忆令牌压缩和检索信息的能力训练不足。因此，我们引入了一种混合训练方案，其中只有部分思考步骤可以通过注意力直接访问后续步骤，而其他思考步骤被屏蔽，迫使模型使用记忆令牌进行压缩和检索。在4个推理基准测试中，HybridThinker与未压缩的基线性能相当，在相似的推理时间下，平均准确率比现有的CoT压缩方法提高了5.8个百分点。消融研究证实，临时思考步骤保留和混合训练方案都对这一提升有贡献。

英文摘要

Extended chain-of-thought (CoT) traces improve LLM reasoning but incur substantial computational and memory costs. While existing CoT compression methods mitigate this by condensing thought steps into compact representations via memory tokens and retaining only these representations at inference time, the loss of fine-grained information makes subsequent steps more error-prone. To alleviate this, we propose \textbf{HybridThinker}, where in addition to preserved these representations, thought steps are also temporarily retained to provide fine-grained details. However, we observe that naively keeping thought steps accessible to subsequent steps \emph{during training} lets the model bypass memory tokens by retrieving information directly from these steps, leaving the model's ability to compress and retrieve information through memory tokens insufficiently trained. We therefore introduce a hybrid training scheme, in which only some thought steps are directly accessible through attention to subsequent steps, while the other thought steps are masked, forcing the model to use memory tokens for compression and retrieval. Across 4 reasoning benchmarks, HybridThinker matches the uncompressed baseline, advancing the state of the art in CoT compression by 5.8 points on average accuracy with similar inference time. Ablation studies confirm that both temporary thought-step retention and the hybrid training scheme contribute to these gains.

URL PDF HTML ☆

赞 0 踩 0

2606.03761 2026-06-03 cs.CL 版本更新

Framing Migration News with LLMs: Structured CoT as a Support for Human Interpretation

使用LLM构建移民新闻框架：结构化思维链作为人类解释的支持

David Alonso del Barrio, Jing Wen, Daniel Gatica-Perez

发表机构 * Idiap Research Institute（日内瓦研究所）； University of Geneva（日内瓦大学）； EPFL（瑞士联邦理工学院）

AI总结本研究提出结构化思维链（SCoT）提示方法，利用本地可部署的开源LLM（Llama3-8B）辅助移民新闻的框架分析，在单GPU上提升分类性能，并通过人类评估验证其逻辑性和对解释的影响。

详情

DOI: 10.1145/3811242.3819110

AI中文摘要

移民新闻的框架分析是一项具有社会重要性的任务：研究移民如何被叙述的媒体学者和研究人员需要不仅准确，而且透明、可审计，并在学术研究小组典型的资源限制下可访问的工具。现有的基于LLM的方法依赖于专有API和大模型，这引发了媒体研究人员对数据隐私、可重复性和公平访问的担忧。本研究探讨了本地可部署的开源LLM如何作为辅助工具支持可解释的框架分析。我们引入了一种使用Llama3-8B的结构化思维链（SCoT）提示方法，实现了基于预定义框架类别的逐步推理。这种结构化设计允许用户审计模型输出，并在本质上主观的任务中检查替代解释。我们在一个与移民相关的新闻数据集上评估了我们的方法，结果表明SCoT在单GPU上可行，同时比零样本和少样本基线提高了分类性能。然后，我们进行了一项以人为中心的评估，注释者评估“模型推理”的一致性和影响力。结果表明，SCoT解释通常被认为是逻辑的（平均得分4.1/5，尽管不同文本间存在显著差异），并且可以引发对初始解释的反思，即使存在分歧。我们的发现强调了LLM辅助框架分析的潜力和风险。虽然结构化推理可以增加模型输出的可追溯性并支持批判性解释，但它也可能以微妙的方式影响人类判断。通过支持本地部署并强调人在环中的交互，这项工作为研究具有社会影响力的媒体叙事的负责任和可访问的计算工具做出了贡献。

英文摘要

Frame analysis of migration news is a socially consequential task: media scholars and researchers who study how migration is narrated need tools that are not only accurate, but transparent, auditable, and accessible within the resource constraints typical of academic research groups. Existing LLM-based approaches rely on proprietary APIs and large models that raise concerns about data privacy, reproducibility and equitable access among media researchers. This work studies how a locally deployable open-source LLM can support interpretable frame analysis as an assistive tool. We introduce a Structured Chain-of-Thought (SCoT) prompting approach using Llama3-8B, enabling step-by-step justifications grounded in predefined framing categories. This structured design allows users to audit model outputs and examine alternative interpretations in a task that is inherently subjective. We evaluate our approach on a dataset of migration-related news and show that SCoT improves classification performance over zero-shot and few-shot baselines while remaining feasible on a single GPU. Then, we conduct a human-centered evaluation in which annotators assess the coherence and influence of "the model's reasoning". Results indicate that SCoT explanations are generally perceived as logical (mean score 4.1/5, though with notable variation across texts) and can prompt reflection on initial interpretations, even when disagreement persists. Our findings highlight both the potential and risks of LLM-assisted frame analysis. While structured reasoning can increase the traceability of model outputs and support critical interpretation, it can also influence human judgment in subtle ways. By enabling local deployment and emphasizing human-in-the-loop interaction, this work contributes to discussions on responsible and accessible computational tools for the study of socially impactful media narratives.

URL PDF HTML ☆

赞 0 踩 0

2606.03739 2026-06-03 cs.CL cs.IT math.IT 版本更新

Entropy Gate: Entropy Quenching for Near-Lossless Token Compression in LLM Pipelines

熵门：LLM流水线中用于近无损令牌压缩的熵淬火

Justice Owusu Agyemang, Jerry John Kponyo, Kwame Opuni-Boachie Obour Agyekum, Francisca Adoma Acheampong, Kwame Agyeman-Prempeh Agyekum, James Dzisi Gadze

发表机构 * VIA Cybersecurity Lab, Kwame Nkrumah University of Science and Technology, Kumasi, Ghana（维亚网络安全实验室，卡布雷姆大学科学与技术学院，库马西，加纳）； Quantum and Assistive Technologies Lab, Kwame Nkrumah University of Science and Technology, Kumasi, Ghana（量子与辅助技术实验室，卡布雷姆大学科学与技术学院，库马西，加纳）

AI总结提出Entropy Gate框架，通过熵淬火（一种热力学过程）逐步冻结低信息令牌，实现40-60%压缩率且语义保真度高于0.80，并证明其接近信息论极限。

详情

AI中文摘要

LLM流水线在低信息内容上浪费大量令牌预算：重复上下文、冗长响应和冗余模板。我们引入Entropy Gate，一种令牌压缩框架，应用熵淬火——一种热力学过程，逐步冻结低能量令牌同时保持语义保真度。每个令牌获得一个多因素信息能量$E(t)$，结合统计、结构和位置分量。自适应淬火计划$T(\tau) = T_0 / (1 + \alpha \tau)$移除玻尔兹曼生存概率$p_i = \exp(-E_i / kT)$低于阈值的令牌，并有一个保真度门在能量加权相似度低于$\theta$时停止压缩。我们证明按$E(t)$降序选择令牌最大化预期语义保留，淬火产生嵌套生存集，且可达压缩率接近信息论极限$\text{CR} \to 1 - I(P; T)/H(P)$。第一阶段启发式方法在五种提示类别上实现40-60%压缩，同时保持$S_E > 0.80$，能量平方放大$E \to E^2$额外增加10-25个百分点。上下文去重在重复块上节省50-70%。输出端淬火受简洁性提高准确性的发现启发，进一步减少响应开销。结合外部记忆，压缩在代理工作负载上复合乘数达到88-96%。该框架无状态、模型无关，并作为兼容OpenAI的HTTP代理部署。

英文摘要

LLM pipelines waste substantial token budgets on low-information content: repeated context, verbose responses, and redundant boilerplate. We introduce Entropy Gate, a token compression framework applying entropy quenching $-$ a thermodynamic process that progressively freezes out low-energy tokens while preserving semantic fidelity. Each token receives a multi-factor information energy $E(t)$ combining statistical, structural, and positional components. An adaptive quenching schedule $T(τ) = T_0 / (1 + ατ)$ removes tokens whose Boltzmann survival probability $p_i = \exp(-E_i / kT)$ falls below threshold, with a fidelity gate halting compression when energy-weighted similarity drops below $θ$. We prove token selection by descending $E(t)$ maximizes expected semantic preservation, that quenching produces nested survival sets, and that achievable compression approaches the information-theoretic limit $\text{CR} \to 1 - I(P; T)/H(P)$. A Phase 1 heuristic achieves 40-60% compression across five prompt categories while maintaining $S_E > 0.80$, with energy-squared amplification $E \to E^2$ adding 10-25 percentage points. Context deduplication adds 50-70% savings on repeated blocks. Output-side quenching, motivated by findings that brevity improves accuracy, further reduces response overhead. Combined with external memory, reduction composes multiplicatively to 88-96% for agentic workloads. The framework is stateless, model-agnostic, and deploys as an OpenAI-compatible HTTP proxy.

URL PDF HTML ☆

赞 0 踩 0

2606.03728 2026-06-03 cs.CL cs.IR 版本更新

Re-Ranking Through an Attribution Lens for Citation Quality in Legal QA

通过归因视角对法律问答中的引用质量进行重排序

Mohamed Hesham Elganayni, Selim Saleh

发表机构 * Technical University of Munich（慕尼黑技术大学）

AI总结针对法律问答中检索增强生成系统的引用质量问题，提出基于扰动归因分数训练轻量级交叉编码器对候选段落重排序，显著提升引用忠实度并与专家答案对齐。

Comments 11 pages, 4 tables, 1 figure. Published at ASAIL 2026 (8th Workshop on Automated Semantic Analysis of Information in Legal Text), co-located with ICAIL 2026, Singapore

详情

AI中文摘要

用于法律问答的检索增强生成系统通常基于语义相似度检索段落，并将其提供给语言模型，然后生成带引用的答案。先前的工作假设高排名的段落最有可能被模型有效引用。基于扰动的归因方法（如C-LIME）仅用于事后解释。然而，在AQuAECHR基准测试中，语义相似度与段落归因并不相关。在检索器的候选池中，基于相似度的排序在呈现黄金引用段落方面表现不如随机选择。为了解决这一局限性，我们训练了一个轻量级交叉编码器，基于连续的扰动归因分数在生成前对段落进行重排序。该方法在AQuAECHR基准测试上使用两个语言模型和五折交叉验证进行评估。重排序器显著提高了引用的忠实度以及与专家黄金答案的对齐程度。值得注意的是，在不同模型上独立训练的两个重排序器收敛程度超过了它们的原始归因一致性。这一发现表明，交叉编码器减少了模型特定的噪声，并产生了一个可部分跨模型传递的共享相关性信号，尽管同模型重排序仍然更有效。这些结果表明，基于扰动的归因为引用感知检索提供了一种实用的、模型无关的训练信号。

英文摘要

Retrieval-augmented generation systems for legal question answering typically retrieve passages based on semantic similarity and provide them to a language model, which then generates cited answers. Prior work assumes that highly ranked passages are most likely to be usefully cited by the model. Perturbation-based attribution methods, such as C-LIME, have been used exclusively for post-hoc explanation. However, on the AQuAECHR benchmark, semantic similarity does not correlate with passage attribution. Within a retriever's candidate pool, similarity-based ranking performs worse than random selection at surfacing gold citation paragraphs. To address this limitation, a lightweight cross-encoder is trained on continuous perturbation-based attribution scores to re-rank passages prior to generation. This approach is evaluated on the AQuAECHR benchmark, using two language models and five-fold cross-validation. The re-ranker substantially improves citation faithfulness and alignment with gold expert answers. Notably, two re-rankers trained independently on different models converge beyond their raw attribution agreement. This finding indicates that the cross-encoder reduces model-specific noise and produces a shared relevance signal that partially transfers across models, although same-model re-ranking remains more effective. These results demonstrate that perturbation-based attribution provides a practical, model-agnostic training signal for citation-aware retrieval.

URL PDF HTML ☆

赞 0 踩 0

2606.03695 2026-06-03 cs.CL 版本更新

Don't Forget Your Embeddings: Robust Knowledge Erasure via Precise Editing of Embeddings

不要忘记你的嵌入：通过精确编辑嵌入实现鲁棒的知识擦除

Clara Haya Suslik, Or Shafran, Mor Geva

发表机构 * Blavatnik School of Computer Science and AI, Tel Aviv University（巴尔-艾赫伦计算机科学与人工智能学院，特拉维夫大学）

AI总结提出 EMBER 模块，利用稀疏矩阵分解精确擦除词嵌入中的概念相关特征，增强现有知识擦除方法的鲁棒性和特异性。

详情

AI中文摘要

随着语言模型在现实应用中的广泛部署，从模型中擦除特定知识的能力对安全性和合规性变得至关重要。主流方法通过更新模型参数实现持久移除，但目标知识往往可以通过对抗性提示或重新学习恢复。在这项工作中，我们假设这种局限性部分源于现有方法忽略了嵌入层。为了解决这个问题，我们引入了 EMBedding ERasure (EMBER)，一个即插即用的擦除模块，利用稀疏矩阵分解从词嵌入中精确擦除概念相关特征。通过在 Gemma-2-2B-it 和 Llama-3.1-8B-Instruct 上对不同概念的综合评估，我们发现用 EMBER 增强现有方法可以一致地提高擦除效果和特异性，且连贯性损失最小。此外，它显著提高了对重新学习的鲁棒性，将恢复的准确率降低高达 50%，在 Llama 上限制在 35%，而先前方法为 70%-76%。进一步分析表明，连贯性成本是局部的，仅影响一小部分概念专属词元。我们的工作确立了精确的嵌入层干预对于鲁棒的概念擦除是必要的，并证明现有方法可以从这种增强中受益。

英文摘要

As language models are increasingly deployed in real-world applications, the ability to erase specific knowledge from them becomes critical for safety and compliance. Prominent methods seek persistent removal by updating the model's parameters, yet the target knowledge often can be recovered through adversarial prompting or relearning. In this work, we hypothesize this limitation stems in part from existing methods overlooking the embedding layer. To address this, we introduce EMBedding ERasure (EMBER), a plug-n-play erasure module that leverages Sparse Matrix Factorization for precise erasure of concept-related features from token embeddings. Through comprehensive evaluations across diverse concepts on Gemma-2-2B-it and Llama-3.1-8B-Instruct, we find that augmenting existing methods with EMBER consistently improves erasure efficacy and specificity across task formats, with minimal coherence loss. Moreover, it dramatically improves robustness to relearning, reducing regained accuracy by up to 50%, limiting it to 35% on Llama compared to 70%-76% for prior methods. Further analysis shows that the coherence cost is localized, affecting only a small set of concept-exclusive tokens. Our work establishes that precise embedding-level intervention is necessary for robust concept erasure, and demonstrates that existing methods can benefit from such augmentation.

URL PDF HTML ☆

赞 0 踩 0

2606.03693 2026-06-03 cs.CL cs.CV 版本更新

Does Language Shift Break Medical Vision-Language Models? Indonesian Radiology Visual Question Answering Case Study

语言转换会破坏医学视觉语言模型吗？印度尼西亚放射学视觉问答案例研究

Pieter Christy Yan Yudhistira, Dzaki Rafif Malik, Novanto Yudistira

发表机构 * Intelligent System Laboratory, Faculty of Computer Science Brawijaya University（智能系统实验室，计算机科学学院布拉维亚大学）

AI总结本研究通过构建印尼语放射学VQA数据集IndoRad-VQA，评估医学视觉语言模型在非英语临床语言下的鲁棒性，发现英语与印尼语设置间存在8-25%的性能差距，表明需要更包容的多语言评估。

Comments accepted to MMFM-BIOMED Workshop @ CVPR 2026

详情

AI中文摘要

医学视觉语言模型（VLM）通常在英语放射学视觉问答基准上进行评估，其在非英语临床语言下的鲁棒性很大程度上未被探索。我们引入了IndoRad-VQA，这是VQA-RAD的印尼语改编版，以评估当问题以印尼语提出时，医学VLM是否保留放射学推理能力。放射学问答对被翻译成印尼语，并通过基于自我评估的质量控制来保持临床意义、术语一致性和答案等价性。我们在英语和印尼语提示设置下评估了通用、东南亚多语言和医学专用VLM。除了准确性，我们量化了英语和印尼语输入之间的语言鲁棒性差距。我们还进行了错误分析，以识别问答的失败模式，例如是/否翻转、侧向性错误和输出语言不匹配。我们的发现表明，在英语医学VQA基准上的强性能并不一定转化为印尼语临床环境中的鲁棒行为。我们观察到英语和印尼语设置之间的性能差距为8%到25%，具体取决于评估指标。这些结果突显了对医学多模态基础模型进行更包容的多语言评估的必要性。数据集可在以下网址获取：此 https URL。

英文摘要

Medical Vision-Language Models (VLMs) are typically evaluated on English radiology visual question answering benchmarks, leaving their robustness under non-English clinical language largely unexplored. We introduce IndoRad-VQA, an Indonesian adaptation of VQA-RAD, to assess whether medical VLMs retain radiology reasoning ability when questions are asked in Bahasa Indonesia. Radiology question-answer pairs are translated into Indonesian with self-evaluation-based quality control to preserve clinical meaning, terminology consistency, and answer equivalence. We evaluate general-purpose, Southeast Asian multilingual, and medical-specific VLMs under English and Indonesian prompting settings. Beyond accuracy, we quantify the language robustness gap between English and Indonesian inputs. We also conduct an error analysis to identify failure modes of question answering, such as yes/no flips, laterality errors, and output-language mismatches. Our findings show that strong performance on English medical VQA benchmarks does not necessarily translate to robust behavior in Indonesian clinical contexts. We observe a performance gap of 8 to 25 percent between the English and Indonesian settings, depending on the evaluation metric. These results highlight the need for more inclusive multilingual evaluation of medical multimodal foundation models. The dataset is available at https://huggingface.co/datasets/Lab-IS/IndoRad-VQA.

URL PDF HTML ☆

赞 0 踩 0

2606.03692 2026-06-03 cs.AI cs.CL 版本更新

超越字面：多模态模因理解中的语用意图分解

Zhengyi Zhao, Shubo Zhang, Zezhong Wang, Luyao Ye, Huimin Wang, Hanqi Yan, Binyang Li, Kam-Fai Wong, Yulan He

发表机构 * The Chinese University of Hong Kong（香港中文大学）； Huawei（华为）； Central China Normal University（中央师范大学）； Shenzhen University（深圳大学）； King’s College London（伦敦国王学院）； University of International Relations（国际关系大学）

AI总结针对大型视觉语言模型（LVLMs）在理解模因时倾向于描述字面内容而非语用意图的问题，提出Intent Projection框架，通过表示、输出和目标三层面的字面-语用分解，在六个基准上超越开源模型并缩小与专有模型的差距。

详情

AI中文摘要

当被问及一个模因或讽刺帖子的含义时，大型视觉语言模型（LVLMs）倾向于描述图像显示的内容，而不是作者试图传达的信息。标准指令调优将帖子的字面内容与其语用意义纠缠在一起，让表面细节污染最终响应。我们将模因理解重新定义为字面-语用分解问题，并提出 extbf{Intent Projection}，这是一个在单个LVLM骨干网络中的表示、输出和目标三个层面分离这两个信号的框架。在表示层面，一个正交投影模块从融合的图像-文本表示中移除主要的单模态方向，仅保留语用残差，同时一个表面真实情感分类器用一个离散标签锚定解码器，该标签命名了极性差距。在输出层面，模型外化一个结构化的推理链，在目标层面，一个对比奖励明确惩罚重复字面描述的答案。在六个多模态基准测试中，Intent Projection始终优于开源基线，并缩小了与专有模型的差距，在字面崩溃最具破坏性的高分歧帖子上取得了最大收益。

英文摘要

When asked what a meme or sarcastic post means, Large Vision Language Models (LVLMs) tend to describe what the image shows rather than what the author is trying to communicate. Standard instruction tuning entangles a post's literal content with its pragmatic meaning, letting surface-level details contaminate the final response. We reframe meme understanding as a problem of literal-pragmatic decomposition and propose \textbf{Intent Projection}, a framework that separates the two signals at the representation, output, and objective levels within a single LVLM backbone. At the representation level, an orthogonal projection module removes dominant unimodal directions from the fused image-text representation, retaining only the pragmatic residual, while a surface-real affect classifier anchors the decoder with a discrete tag that names the polarity gap. At the output level, the model externalizes a structured reasoning chain, and at the objective level a contrastive reward explicitly penalizes answers that restate the literal description. Across six multimodal benchmarks, Intent Projection consistently outperforms open-source baselines and narrows the gap to proprietary models, with the largest gains on high-divergence posts where literal collapse is most damaging.

URL PDF HTML ☆

赞 0 踩 0

2606.03603 2026-06-03 cs.CV cs.CL 版本更新

World Models Meet Language Models: On the Complementarity of Concrete and Abstract Reasoning

世界模型遇见语言模型：论具体推理与抽象推理的互补性

Yucheng Zhou, Wei Tao, Yiwen Guo, Jianbing Shen

发表机构 * Nanyang Technological University（南洋理工大学）

AI总结本文提出受控具体推理框架及PF-OPSD方法，通过结合世界模型的视觉模拟与多模态大语言模型的抽象推理，在空间前瞻和开放域物理预测任务上提升性能与鲁棒性。

详情

AI中文摘要

世界模型和多模态大语言模型（MLLMs）为从静态视觉观察预测未来结果提供了互补能力。世界模型可以生成可能未来的具体视觉推演，而MLLMs可以对问题、目标和规则进行抽象推理。然而，生成的推演是随机的，可能在视觉上合理但任务不正确，因此需要确定视觉模拟何时有用、推演是否可信以及它应如何影响最终答案。我们将此问题形式化为受控具体推理，其中模型学习在抽象推理之外调用、验证和整合视觉未来模拟。为了研究这一设置，我们构建了两个人工验证的基准：用于可控空间前瞻的VRQABench和用于开放域物理预测的OpenWorldQA，并提出了特权未来在策略自蒸馏（PF-OPSD）。在训练期间，PF-OPSD仅使用真实未来视频和答案作为教师侧特权上下文来评估在策略具体推理轨迹，而可部署的学生在测试时从未观察到真实未来。实验结果表明，PF-OPSD在VRQABench和OpenWorldQA上分别比基线高出10.6%和10.9%，同时增强了对噪声或冲突推演的鲁棒性。我们的代码和数据集可在以下网址获取：https://this https URL。

英文摘要

World models and multimodal large language models (MLLMs) provide complementary capabilities for predicting future outcomes from static visual observations. World models can generate concrete visual rollouts of possible futures, while MLLMs can reason abstractly over questions, goals, and rules. However, generated rollouts are stochastic and may be visually plausible but task-incorrect, making it necessary to determine when visual simulation is useful, whether a rollout is credible, and how it should influence the final answer. We formulate this problem as controlled concrete reasoning, where a model learns to invoke, verify, and integrate visual future simulation alongside abstract reasoning. To study this setting, we construct two human-verified benchmarks, VRQABench for controllable spatial lookahead and OpenWorldQA for open-domain physical prediction, and propose Privileged-Future On-Policy Self-Distillation (PF-OPSD). During training, PF-OPSD uses ground-truth future videos and answers only as teacher-side privileged context to evaluate on-policy concrete-reasoning trajectories, while the deployable student never observes true futures at test time. Experimental results show that PF-OPSD outperforms baseline by 10.6% and 10.9% on VRQABench and OpenWorldQA, respectively, while increasing robustness to noisy or conflicting rollouts. Our code and dataset are available at https://github.com/yczhou001/PF-OPSD.

URL PDF HTML ☆

赞 0 踩 0

2606.03602 2026-06-03 cs.LG cs.AI cs.CL 版本更新

CauTion: Knowing When to Trust LLMs for Ensemble Causal Discovery

CauTion：知道何时信任LLM进行集成因果发现

Bo Peng, Kaiwen Wu, Sirui Chen, Zhiheng Wang, Yu Qiao, Chaochao Lu

发表机构 * Shanghai AI Laboratory（上海人工智能实验室）； Shanghai Innovation Institute（上海创新研究院）； Shanghai Jiao Tong University（上海交通大学）； Nanjing University（南京大学）； Tongji University（同济大学）

AI总结提出CauTion框架，通过共识过滤和LLM可靠性估计，将LLM领域知识可靠地集成到多个统计因果发现算法中，解决纯统计方法的局限和LLM错误问题。

详情

AI中文摘要

从观测数据进行因果发现仍然具有挑战性，因为纯统计方法存在根本性限制，例如等价类内的统计可区分性和对有限样本量的敏感性。虽然大型语言模型（LLM）提供了有希望的领域知识来源来补充统计推断，但现有的LLM增强方法容易受到LLM错误的影响，并且产生高昂的令牌成本。此外，依赖单一数据驱动算法可能使结果对算法特定偏差敏感。为了解决这些限制，我们提出了CauTion，一个通过共识过滤和LLM可靠性估计将LLM领域知识可靠地集成到统计因果发现算法集成中的框架。CauTion分三个阶段进行。首先，算法集成利用共识投票解决算法一致的最多96%的边，在过滤后的共识边上实现接近完美的准确性。其次，一个信任校准仲裁机制通过无注释的信任校准过程估计LLM和算法的相对可靠性，然后用于控制信任加权投票过程，将LLM仲裁限制在算法证据不可靠的边上。第三，应用循环修复步骤确保最终因果图是有效的无环图。在六个数据集上的实验表明，CauTion在性能上始终优于数据驱动和LLM增强的基线，在更大的图上获得更大的收益，并且对LLM错误具有强大的鲁棒性。代码可在以下网址获取：https://this URL。

英文摘要

Causal discovery from observational data remains challenging due to the fundamental limitations of purely statistical methods, such as statistical distinguishability within equivalence classes and sensitivity to finite sample sizes. While large language models (LLMs) offer a promising source of domain knowledge to complement statistical inference, existing LLM-augmented methods are vulnerable to LLM errors and incur high token costs. Moreover, reliance on a single data-centric algorithm can make results sensitive to algorithm-specific biases. To address these limitations, we propose CauTion, a framework that reliably integrates LLM domain knowledge into an ensemble of statistical causal discovery algorithms through consensus filtering and LLM reliability estimation. CauTion proceeds in three stages. First, an algorithm ensemble utilizes a consensus voting to resolve up to 96% of edges on which algorithms agree, achieving near-perfect accuracy on the filtered consensus edges. Second, a trust-calibrated arbitration mechanism estimates the relative reliability of the LLM and the algorithms via an annotation-free trust calibration procedure, which is then utilized to govern a trust-weighted voting process that restricts LLM arbitration exclusively to edges with unreliable algorithmic evidence. Third, a cycle repair step is applied to guarantee the final causal graph is validly acyclic. Experiments on six datasets demonstrate that CauTion consistently outperforms both data-centric and LLM-augmented baselines, with larger gains on larger graphs and strong robustness to LLM errors. Code is available at https://github.com/OpenCausaLab/CauTion.

URL PDF HTML ☆

赞 0 踩 0

2606.03544 2026-06-03 cs.AI cs.CL 版本更新

SAGE: A Quantitative Evaluation of Socialized Evolution in Agent Ecosystems

SAGE: 智能体生态中社会化演化的定量评估

Linyue Pan, Yaoming Zhu, Lin Qiu, Xuezhi Cao, Xunliang Cai

发表机构 * Tsinghua University, China（清华大学, 中国）； Meituan, China（美团, 中国）

AI总结提出SAGE框架，通过对比社会演化（SocialEvo）与自我演化（SelfEvo）两种计算条件，在三个领域评估共享经验对智能体性能的影响，发现群体历史并非普遍放大器，但能帮助陷入停滞的智能体取得突破，且社会收益依赖于抽象能力而非暴露量。

Comments 13 pages, 5 figures

详情

AI中文摘要

自我改进的语言智能体通常被孤立评估：一个智能体尝试任务、接收反馈并迭代优化自身行为。然而，智能体越来越多地与同伴一起运作，其策略和结果公开可见。这引发了一个研究不足的问题：共享经验何时能产生自我改进无法单独实现的改进？我们引入了SAGE（社会智能体群体演化），一个评估框架，比较两种计算匹配的条件：SocialEvo，其中来自五个不同模型家族的智能体共同演化，可访问所有同伴的历史；以及SelfEvo，其中每个智能体获得相同数量的任务尝试，但只能看到自己的过去，这是自我改进智能体研究中的常规做法。我们在三个领域实例化SAGE：开放式机器学习研究、长期经济规划和战略多人游戏，并在多个演化轮次中进行评估。我们发现群体历史并非普遍放大器：最强的智能体并未超过其自我演化上限。然而，在自我改进下停滞的智能体，当同伴经验可用时，可以取得重大突破。在竞争环境中，反事实控制显示智能体普遍改进，而非发展针对对手的策略。在不同形式的共享历史中，过滤后的同伴轨迹和反思性摘要通常优于原始日志，表明社会收益依赖于抽象而非暴露量。这些发现表明，同伴历史收益是智能体特定的、领域依赖的，并取决于从公共轨迹中抽象可转移知识的能力。

英文摘要

Self-improving language agents are typically evaluated in isolation: an agent attempts a task, receives feedback, and iteratively refines its own behavior. Yet agents increasingly operate alongside peers whose strategies and outcomes are publicly visible. This raises an under-studied question: when does shared experience produce improvements that self-improvement alone cannot achieve? We introduce SAGE (Social Agent Group Evolution),an evaluation framework that compares two compute-matched conditions: SocialEvo, where agents from five distinct model families co-evolve with access to all peers' histories; and SelfEvo, where each agent receives the same number of task attempts but sees only its own past, which is conventional in self-improving agent studies. We instantiate SAGE in three arenas: open-ended ML research, long-horizon economic planning, and strategic multiplayer play, evaluated across multiple evolutionary rounds. We find that group history is not a universal amplifier: the strongest agent does not exceed its self-evolution ceiling. However, agents that plateau under self-improvement can achieve significant breakthroughs when peer experience is available. In competitive settings, counterfactual controls reveal that agents improve generally rather than developing opponent-specific strategies. Across different forms of shared history, filtered peer traces and reflective summaries often outperform raw logs, indicating that social gains depend on abstraction rather than exposure volume. These findings reveal that peer-history gains are agent-specific, arena-dependent, and contingent on the capacity to abstract transferable knowledge from public traces.

URL PDF HTML ☆

赞 0 踩 0

2606.03535 2026-06-03 cs.IR cs.CL 版本更新

语言处理的词典和语法：工业产品还是手工制品？

Eric Laporte

AI总结本文分析词典和语法的手工构建与自动化工业化两种趋势，探讨哪种方式或两者结合能获得最佳结果。

详情

Journal ref: Léxico e gramática: dos sentidos à construção da significação, Cultura acadêmica, 2009, Trilhas Lingüísticas, 16, pp.51-84

AI中文摘要

近年来，语言数据在语言处理中的应用逐渐增加。这些数据现在通常被称为语言资源。用于此目的的大多数语言资源是文本集合，如布朗语料库和宾州树库，但电子词典（WordNet、FrameNet、VerbNet、ComLex、词典-语法...）和形式语法（TAG...）也在最近得到发展。词典和语法的大多数构建过程是手动的，而语料库的构建则一直高度自动化。然而，越来越多的语言处理专家认识到，词典和语法的信息内容比语料库更丰富，因此前者可以实现更精细的处理。构建时间的差异可能与信息内容的差异有关：语言学家手工制作词典和语法可能使其比自动生成的数据更具信息性。这种情况可能向两个方向发展：要么语言技术专家逐渐习惯于处理手工构建的资源，这些资源更具信息性且更复杂；要么词典和语法的构建过程被自动化和工业化，这是主流观点。两种演变都在进行中，并且它们之间存在紧张关系。语言学家和计算机科学家之间的关系取决于这些演变的未来，因为前者需要培训和雇佣大量语言学家，而后者主要依赖于计算机工程师提出的解决方案。本文旨在分析所讨论的语言资源的实际例子，并讨论手工制作或工业生成，或两者结合，哪种趋势能产生最佳结果或最现实。

英文摘要

During the recent years, the use of linguistic data for language processing increased progressively. Such data are now commonly called language resources. Most of the language resources used for this purpose are collections of texts as the Brown Corpus and the Penn Treebank, but electronic lexicons (WordNet, FrameNet, VerbNet, ComLex, Lexicon-Grammar...) and formal grammars (TAG...) developed recently. Most processes of construction of lexicons and grammars are manual, whereas the construction of corpora has always been highly automated. However, more and more specialists of language processing realize that the information content of lexicons and grammars is richer than that of corpora, and hence the former make more elaborate processing possible. The difference in construction time is likely to be connected with the difference in information content: the handcrafting of lexicons and grammars by linguists would make them more informative than automatically generated data. This situation can evolve into two directions: either specialists of language technology get progressively used to handling manually constructed resources, which are more informative and more complex, or the process of construction of lexicons and grammars is automated and industrialized, which is the mainstream perspective. Both evolutions are already in progress, and a tension exists between them. The relation between linguists and computer scientists depends on the future of these evolutions, since the first implies training and hiring numerous linguists, whereas the other depends essentially on solutions elaborated by computer engineers. The aim of this article is to analyse practical examples of the language resources in question, and to discuss about which of the two trends, handcrafting or generating industrially, or a combination of both, can give the best results or is the most realistic.

URL PDF HTML ☆

赞 0 踩 0

2606.03399 2026-06-03 cs.CL cs.CR 版本更新

Selective Token-Level Cryptographic Redaction for Privacy-Preserving Clinical Deployment of Large Language Models

选择性令牌级密码学编辑用于大型语言模型的隐私保护临床部署

Farhan Sheth, Ziyuan Yang, Yongying Lan, Si Yong Yeo

发表机构 * MedVisAI Lab, Singapore（新加坡MedVisAI实验室）； Ruijin Hospital, Shanghai Jiao Tong University School of Medicine, China（中国上海交通大学医学院瑞金医院）； Lee Kong Chian School of Medicine, Nanyang Technological University, Singapore（新加坡南洋理工大学Lee Kong Chian医学院）

AI总结提出HERALD框架，通过令牌级密码学编辑仅加密敏感令牌，在保护隐私的同时保持下游模型效用，在分类和医疗问答任务上接近明文性能。

Comments 33 pages, 8 figures, 26 tables

详情

AI中文摘要

尽管大型语言模型（LLMs）越来越多地用于临床应用，但许多现有流程需要将原始敏感健康信息发送到远程服务器进行处理，这增加了隐私泄露的风险。缓解这种风险的一种自然方法是在传输前对数据进行加密。然而，加密整个数据集等直接解决方案会带来巨大的计算、对齐和通信开销，使得大规模实际部署不可行。为了在保护隐私的同时保持可用性，我们提出了通过自适应语言分解的医疗加密与编辑（HERALD），这是一个令牌级密码学编辑框架，通过仅加密敏感令牌同时保留上下文以供下游模型使用，实现了这种平衡。HERALD结合医学命名实体识别器（NER）与基于词性（POS）的策略来选择候选令牌，执行目标词形还原以稳定表面形式，并将每个受保护令牌替换为包裹在显式分隔符中的确定性密文。值得注意的是，HERALD是模型无关的，完全在客户端运行，确保敏感内容在存储、传输和处理过程中保持加密，无需更改下游模型。我们在公开数据集上对分类和医疗问答（MQA）任务评估了HERALD。在不同任务中，实验表明完全安全的基线遭受显著的效用损失，而HERALD始终恢复接近明文的性能。总体而言，HERALD提供了一种新颖的利用流程。

基于注意力机制的残差连接LSTM网络的语音情感识别

Daniil Krasnoproshin, Maxim Vashkevich

发表机构 * Institute of Cybernetics and Machine Learning, Belarusian State University（白俄罗斯国立大学信息学与机器学习学院）

AI总结提出ResLSTM-SA轻量级架构，在LSTM中集成残差连接和软注意力，在RAVDESS数据集上以46.8k参数达到0.6517 UAR，优于传统基线且适合边缘部署。

Comments 6 pages, 5 figures, DSPA 2026

详情

DOI: 10.1109/DSPA69176.2026.11476771

AI中文摘要

评估大语言模型在真实世界消费设备维修问题上的有效性

Atm Mizanur Rahman, Md Arid Hasan, Syed Ishtiaque Ahmed, Sharifa Sultana

发表机构 * University of Illinois Urbana-Champaign（伊利诺伊大学厄巴纳-香槟分校）； University of Toronto（多伦多大学）

AI总结本文通过引入包含991个真实维修问题的基准测试，评估六种大语言模型在英语和孟加拉语上的正确性、完整性、实用性和安全性，发现模型虽能提供有用帮助，但在高风险维修任务中仍不可靠。

详情

AI中文摘要

消费设备维修是大语言模型（LLMs）一个重要但尚未充分探索的测试平台。维修任务需要对不完整的问题描述、特定硬件的诊断、可操作的故障排除和安全关键决策进行推理，其中错误的建议可能导致设备损坏、电池危险或永久性数据丢失。我们引入了一个包含991个来自Reddit的真实世界维修问题的基准测试，涵盖手机维修、电脑维修和数据恢复，每个问题都配有技术人员编写的参考解决方案，并提供孟加拉语翻译以评估跨语言性能。我们使用四个维修特定标准（正确性、完整性、实用性和安全性）评估了六种最先进的LLMs在英语和孟加拉语上的表现。我们的结果表明，虽然LLMs可以提供有用的维修帮助，但在没有严格评估和明确安全保护措施的情况下，它们在高风险的真实世界维修任务中仍然不可靠。手机维修是最困难且对安全最敏感的领域，所有模型在板级诊断、维修优先级排序和安全恢复程序方面都犯了重大错误。跨领域和模型，孟加拉语回答的表现始终低于英语回答。在评估的模型中，GPT-5.4整体表现最佳。

英文摘要

Consumer device repair is an important but underexplored testbed for large language models (LLMs). Repair tasks require reasoning over incomplete problem descriptions, hardware-specific diagnostics, actionable troubleshooting, and safety-critical decisions, where incorrect advice can cause device damage, battery hazards, or permanent data loss. We introduce a benchmark of 991 real-world repair questions from Reddit spanning phone repair, computer repair, and data recovery, each paired with technician-written reference solutions, and provide Bangla translations to evaluate cross-lingual performance. We evaluate six state-of-the-art LLMs in English and Bangla using four repair-specific criteria: correctness, completeness, practicality, and safety. Our results show that while LLMs can provide useful repair assistance, they remain unreliable for high-risk real-world repair tasks without rigorous evaluation and explicit safety safeguards. Phone repair is the most difficult and safety-sensitive domain, and all models make substantial errors in board-level diagnosis, repair prioritization, and safe recovery procedures. Across domains and models, Bangla responses consistently perform worse than English responses. Among the evaluated models, GPT-5.4 performs best overall.

URL PDF HTML ☆

赞 0 踩 0

2606.03327 2026-06-03 cs.DB cs.CL 版本更新

CAPER: Clause-Aligned Process Supervision for Text-to-SQL

CAPER: 面向Text-to-SQL的子句对齐过程监督

Lujie Ban, Jiasheng Shi, Jinyang Li, Xiaolin Han, Tsz Nam Chan, Chenhao Ma

发表机构 * The Chinese University of Hong Kong, Shenzhen（香港中文大学（深圳））； The University of Hong Kong（香港大学）； Northwestern Polytechnical University（西北工业大学）； Shenzhen University（深圳大学）

AI总结提出CAPER方法，通过反事实干预SQL抽象语法树自动推导子句级监督，训练轻量级Clause-PRM模型CAPER-9B，用于策略优化和候选验证，在BIRD和Spider数据集上提升了执行准确率和故障定位能力。

详情

AI中文摘要

Text-to-SQL系统通常通过查询级执行正确性进行评估，但这种终端信号对于哪个中间SQL决策导致成功或失败几乎没有指导作用。Token级密集监督也不适合：SQL token与完整的语义决策不对齐，可能惩罚执行等效的查询，并且难以大规模可靠标记。因此，我们提出CAPER，通过对SQL抽象语法树进行反事实干预自动推导子句级监督，实现用于奖励建模的根因错误定位；所得数据用于训练CAPER-9B，一个轻量级的Clause-PRM，为策略优化和候选验证提供子句边界反馈。在BIRD和Spider上的实验表明，子句对齐监督不仅提高了执行准确率（相对于GPT-5.4实现了高达15.3%的相对EX提升），还增强了故障定位能力，在保留的故障上达到了84.53%的准确率和90.60%的MRR。我们的项目页面位于此https URL。

英文摘要

Text-to-SQL systems are typically evaluated by query-level execution correctness, but this terminal signal provides little guidance about which intermediate SQL decision caused success or failure. Token-level dense supervision is also ill-suited: SQL tokens do not align with complete semantic decisions, can penalize execution-equivalent queries, and are difficult to label reliably at scale. We therefore propose CAPER, which automatically derives clause-level supervision via counterfactual intervention on the SQL abstract syntax tree, enabling root-cause error localization for reward modeling; the resulting data is used to train CAPER-9B, a lightweight Clause-PRM that provides clause-boundary feedback for policy optimization and candidate verification. Experiments on BIRD and Spider show that clause-aligned supervision not only improves execution accuracy, achieving up to a 15.3% relative EX improvement over GPT-5.4, but also strengthens failure-localization capability, reaching 84.53% accuracy and 90.60% MRR on held-out failures. Our project page is at https://github.com/banrichard/RL-NL2SQL.

URL PDF HTML ☆

赞 0 踩 0

2606.03304 2026-06-03 cs.CL cs.LG 版本更新

From Script to Semantics: Prompting Strategies for African NLI

从脚本到语义：非洲自然语言推理的提示策略

Anuj Tiwari, Terry Oko-odion, Hannah Nwokocha

发表机构 * arXiv

AI总结本研究系统评估了五种提示策略在斯瓦希里语、约鲁巴语和豪萨语的自然语言推理任务上的表现，发现对比提示策略最可靠且能显著提升模型性能。

Comments Accepted at the RAIL Workshop, LREC 2026

详情

AI中文摘要

大型语言模型（LLMs）在多语言环境中的评估日益增多，但它们在低资源非洲语言中的推理行为仍未得到充分探索，尤其是在无微调的纯提示设置下。我们使用AfriXNLI基准，对斯瓦希里语、约鲁巴语和豪萨语的自然语言推理（NLI）提示策略进行了系统研究。我们评估了五种提示策略：基线（零样本）、脚本感知、语言特定、对比和原生标签自翻译（NL-STP），使用了两个中等规模的开源模型（Llama3.2-3B和Gemma3-4B）。为隔离提示设计的影响，我们的研究中排除了少样本示例和思维链推理的影响。我们发现不同策略在类别性能上存在显著差异，某些配置中出现高度中性类崩溃和高预测偏斜。对比提示被证明是最可靠的策略，在不同语言和模型上持续改进，并具有更好的类别行为平衡和整体准确率提升。值得注意的是，精心构建的提示足以击败提供少样本提示和思维链提示的更强大基线。我们发现提示表述对于低资源语言的多语言NLI至关重要，并且语言感知的决策结构可以有效地增强资源受限环境下的鲁棒性。

英文摘要

Large language models (LLMs) are increasingly evaluated in multilingual settings, yet their inference behavior in low-resource African languages remains underexplored especially under pure prompting without fine-tuning. We present a systematic study of prompting strategies for Natural Language Inference (NLI) in Swahili, Yoruba, and Hausa using the AfriXNLI benchmark. We evaluate five prompting strategies Baseline (zero-shot), Script-Aware, Language Specific, Contrastive, and Native-Label Self-Translation (NL-STP) across two mid-sized open weight models (Llama3.2-3B and Gemma3-4B). To isolate the effect of prompt design, the effect of few-shot examples and Chain-of-Thought reasoning is eliminated in our study. We find a significant difference in performance of class wise across strategies with highly neutral class collapse and high prediction skew in some configurations. Contrastive prompting proves to be the most reliable and steadily improving strategy over language and model and has better balance of class behavior and balance of overall accuracy gains. Notably, well-constructed prompts are sufficient to beat more powerful baselines that are provided with few-shot prompts and Chain-of-Thought prompts. We have found that prompt formulation is essential to multilingual NLI with low-resource languages and that language aware decision structuring can be used to meaningfully enhance robustness in resource challenged settings.

URL PDF HTML ☆

赞 0 踩 0

2606.03301 2026-06-03 cs.CL cs.CV 版本更新

SagaQA: A Multi-hop Reasoning Benchmark for Long-form Narrative Understanding in TV Series

SagaQA：面向电视剧长篇叙事理解的多跳推理基准

Galann Pennec, Zhengyuan Liu, Nicholas Asher, Philippe Muller, Nancy F. Chen

发表机构 * IRIT, University of Toulouse, France（法国图卢兹大学IRIT中心）； Agency for Science, Technology and Research (A*STAR), Singapore（新加坡科技研究局）； CNRS, IRIT, France（法国CNRS与IRIT）

AI总结提出SagaQA基准，通过跨剧集的多跳推理任务评估模型对完整电视剧多模态叙事的高层次理解，并比较并行、顺序和混合三种规划策略的性能。

详情

AI中文摘要

我们介绍了SagaQA，一个用于对完整电视剧进行多跳推理的长视频基准。现有的视频推理基准通常强调对相邻帧或片段的局部理解。SagaQA通过要求对整个电视剧中扩展的多模态叙事进行高层次理解来弥补这一空白。SagaQA的一个显著特征是其推理步骤的粒度。我们的数据集需要长距离推理跳跃来连接完全不同的剧集之间的信息。这要求模型对整个事件和动作进行推理，需要在多模态层面上深入理解剧集的叙事和进展。受近期智能体方法进展的启发，我们进一步研究了不同的规划策略如何处理这种复杂推理。我们将这些方法分为三类——并行规划器、顺序规划器和混合规划器——并评估它们生成连贯且完整推理计划的能力。我们在SagaQA上的结果表明，混合规划器始终能产生更高质量的计划，并在电视剧复杂、高层次叙事理解方面表现出更强的能力。

英文摘要

We introduce SagaQA, a long-form video benchmark for multi-hop reasoning over full-length TV series. Existing video reasoning benchmarks often emphasize local understanding of adjacent frames or clips. SagaQA addresses this gap by requiring high-level comprehension of extended multimodal narratives in entire TV shows. A distinguishing feature of SagaQA is the granularity of its reasoning steps. Our dataset necessitates long-range reasoning hops to connect information across completely different episodes. This requires models to reason over entire events and actions, demanding a deep understanding of the show's narration and progression at a multimodal level. Motivated by recent progress in agentic methods, we further study how different planning strategies handle such complex reasoning. We categorize these approaches into three classes-Parallel, Sequential, and Hybrid planners-and evaluate their ability to generate coherent and complete reasoning plans. Our results on SagaQA suggest that hybrid planners consistently produce higher-quality plans and exhibit stronger capabilities for complex, high-level narrative understanding in TV shows.

URL PDF HTML ☆

赞 0 踩 0

2606.03291 2026-06-03 cs.CL 版本更新

Multilingual Unlearning in LLMs: Transfer, Dynamics, and Reversibility

大型语言模型中的多语言遗忘：迁移、动态与可逆性

Chaoyi Xiang, Olga Ohrimenko, Benjamin I. P. Rubinstein, Lea Frermann

发表机构 * University of California, Berkeley（加州大学伯克利分校）； Stanford University（斯坦福大学）； University of Toronto（多伦多大学）； University of Washington（华盛顿大学）

AI总结本文通过扩展TOFU基准到五种语言，研究大型语言模型中的多语言遗忘，发现遗忘迁移在不同语言间高度可变，且遗忘主要作用于后期解码层而非共享的跨语言潜在空间，因此可以通过推理时的单一引导方向逆转大部分遗忘效果。

Comments Accepted at ICML 2026

详情

词与道：德语医学NLP中领域特定BERT预训练的策略

Henry He, Johann Frei, Raphael Schmitt

发表机构 * School of Computation, Information and Technology（计算信息科技学院）； Technical University of Munich（慕尼黑技术大学）； Chair of IT Infrastructure for Translational Medical Research（转化医学研究信息基础设施主任）； Faculty of Applied Computer Science（应用计算机科学学院）； University of Augsburg（艾希施泰特大学）； Institute of General Practice（基础医学研究所）； Faculty of Medicine and Medical Center（医学学院和医学中心）

AI总结本文提出ChristBERT系列模型，通过对比持续预训练、从头训练和领域词汇适应三种策略，在德语医学NLP任务中实现最优性能，并建立新的基准。

Comments Under revision at BMC Medical Informatics and Decision Making

详情

AI中文摘要

数字医疗产生大量临床文本，可支持AI辅助应用，但德语生物医学语言模型仍受限于较旧的架构或受限的训练数据。我们提出了ChristBERT（临床与健康相关议题及主题调优BERT），这是一个基于德语RoBERTa的领域特定语言模型家族，在包含科学出版物、临床文本、健康相关网络内容和翻译临床资源的13.5GB语料库上训练。为了探究领域适应策略在德语临床NLP中的影响，我们比较了持续预训练、从头训练和领域词汇适应。所得模型在三个医学命名实体识别任务和两个文本分类任务上进行了评估。ChristBERT在五个基准中的四个上持续优于现有的通用和医学德语语言模型，并为德语临床语言建模建立了新的最先进水平。我们的结果表明，最优适应策略取决于任务：在我们的评估中，从头训练对高度专业化的临床文本特别有效，而持续预训练在更常见的医学文本上表现良好。所有模型均已公开发布，以支持德语医学NLP的未来研究和应用。

英文摘要

Digital healthcare generates vast amounts of clinical text that can support AI-assisted applications, yet German biomedical language models remain limited by older architectures or restricted training data. We present ChristBERT (Clinical- and Healthcare-Related Issues and Subjects Tuned BERT), a family of domain-specific German RoBERTa-based language models trained on a 13.5GB corpus of scientific publications, clinical texts, health-related web content, and translated clinical resources. To investigate the impact of domain adaptation strategies in German clinical NLP, we compare continued pre-training, training from scratch, and domain-specific vocabulary adaptation. The resulting models are evaluated on three medical named entity recognition tasks and two text classification tasks. ChristBERT consistently outperforms existing general-purpose and medical German language models on four of five benchmarks and establishes a new state of the art for German clinical language modeling. Our results show that the optimal adaptation strategy is task-dependent: in our evaluation, training from scratch is particularly effective for highly specialized clinical texts, whereas continued pre-training performs well on more commonly written medical texts. All models are publicly released to support future research and applications in German medical NLP.

URL PDF HTML ☆

赞 0 踩 0

ARBOR: 通过可复用评分标准缓冲区的在线过程奖励用于搜索智能体

Zheng Liu, Longxiang Zhang, Xintong Wang, Zhiang Xu, Shaoxiong Zhan, Xin Shan, Wen Huang, Tao Dai, Shu-Tao Xia, Chengfu Huo, Liang Ding

发表机构 * Tsinghua University（清华大学）； Alibaba Group（阿里巴巴集团）； Peking University（北京大学）； Shenzhen University（深圳大学）

AI总结针对基于LLM的搜索智能体训练中结果奖励缺乏过程监督的问题，提出ARBOR框架，通过维护跨查询共享的评分标准记忆库，利用对比轨迹生成局部草案并整合为通用评分标准，以稀疏成对判断提供过程级梯度，在四个多跳QA基准上优于GRPO和DAPO基线。

详情

AI中文摘要

基于LLM的搜索智能体主要使用结果奖励进行训练，搜索过程本身无监督。这种信号在结果同质组（所有采样轨迹共享相同正确性）中退化，产生零组内优势和无梯度。现有的过程监督要么训练昂贵的验证器，要么生成每个查询的评分标准，这些评分标准在查询间不一致且使用一次后丢弃。我们提出ARBOR（自适应评分标准缓冲区用于在线奖励），一种可复用的过程奖励框架，维护跨查询共享的评分标准记忆。由对比轨迹诱导的查询局部草案被接纳、整合为跨查询通用评分标准，并随策略演化而淘汰。一小部分活跃的通用评分标准通过稀疏成对判断对轨迹评分，所得分数加到基础奖励上，即使在结果奖励一致时也能提供过程级梯度。ARBOR在四个多跳QA基准上持续优于GRPO和DAPO基线，将LLM评判准确率平均提高最多4.2个百分点，并将最多42%的原本零梯度训练组转化为信息丰富的组。

英文摘要

LLM-based search agents are trained predominantly with outcome-only reward, leaving the search process itself unsupervised. This signal degenerates on outcome-homogeneous groups where all sampled trajectories share the same correctness, yielding zero within-group advantage and no gradient. Existing process supervision either trains a costly verifier or generates per-query rubrics that are inconsistent across queries and discarded after one use. We propose ARBOR (Adaptive Rubric Buffer for Online Reward), a reusable process-reward framework that maintains a rubric memory shared across queries. Query-local drafts induced from contrastive trajectories are admitted, consolidated into cross-query common rubrics, and retired as the policy evolves. A small active subset of common rubrics scores trajectories via sparse pairwise judging, and the resulting scores are added to the base reward, providing process-level gradient even when outcome reward is uniform. ARBOR consistently outperforms GRPO and DAPO baselines on four multi-hop QA benchmarks, raising average LLM-judge accuracy by up to 4.2 points and converting up to 42% of otherwise-zero-gradient training groups into informative ones.

URL PDF HTML ☆

赞 0 踩 0

2606.03237 2026-06-03 cs.AI cs.CL cs.CY cs.LG cs.MA 版本更新

Solipsistic Superintelligence is Unlikely to be Cooperative

唯我论超级智能不太可能合作

Rakshit S Trivedi, Natasha Jaques, Logan Cross, Alexander Sasha Vezhnevets, Joel Z Leibo

发表机构 * DeepMind（深度Mind）； University of Cambridge（剑桥大学）； University of California, Berkeley（加州大学伯克利分校）

AI总结本文指出，基于唯我论方法设计的超级智能（极端能力的任务求解器）因忽视部署引发的内生非平稳性而难以合作，呼吁将相互依存作为核心设计原则的非唯我论研究范式。

Comments 24 pages, 1 figure, Accepted at Proceedings of the 43rd International Conference on Machine Learning, 2026

详情

AI中文摘要

AI的核心挑战正从能力转向共存。AI研究的主导范式侧重于开发将世界视为外生且平稳反馈源的强大智能体。我们认为，源于这种唯我论AI设计方法的超级智能（极端能力的任务求解器）不太可能合作。部署AI系统会引发内生非平稳性，导致训练-测试-部署差距，即历史分布与部署环境相偏离。我们称此为单边优化的自我削弱属性。缩小这一差距需要参与合作的AI：即多个行为体导航其相互依存的均衡选择过程。我们呼吁一种非唯我论的研究范式，将这种相互依存作为核心设计原则，而非将合作视为待解决的任务。这需要构建涉及自适应对手方的动态评估测试平台，将制度视为设计原语，并保留人类能动性作为我们构建系统的结构性特征。

英文摘要

AI's central challenge is shifting from capability to coexistence. The dominant paradigm in AI research focuses on developing powerful agents that treat the world as an exogenous and stationary source of feedback. We contend that superintelligence, an extremely capable task solver, born out of such a solipsistic approach to AI design, is unlikely to be cooperative. Deploying AI systems induces endogenous non-stationarity, resulting in a train-test-deploy gap where historical distributions diverge from the deployment context. We refer to this as the self-undermining property of unilateral optimization. Closing this gap requires AI that participates in cooperation: the equilibrium-selection process through which multiple actors navigate their interdependence. We call for a non-solipsistic research paradigm that treats this interdependence as a core design principle rather than approaching cooperation as a task to solve. This entails building dynamic evaluation testbeds involving adaptive counterparties, treating institutions as design primitives, and preserving human agency as a structural feature of the systems we build.

URL PDF HTML ☆

赞 0 踩 0

2606.03220 2026-06-03 cs.CL cs.AI 版本更新

WebRISE: Requirement-Induced State Evaluation for MLLM-Generated Web Artifacts

WebRISE: 面向MLLM生成Web工件的需求诱导状态评估

Yuxin Meng, Yuhan Suo, Junjie Wang, Yuhan Sun, Yiyao Yu, Ruixu Zhang, Ruining Hu, Yubin Wang, Shouwei Ruan, Bin Wang, Yuxiang Zhang, Yujiu Yang

发表机构 * Tsinghua University（清华大学）； Huawei Noah’s Ark Lab（华为诺亚实验室）； East China Normal University（华东师范大学）； Tongji University（同济大学）； Institute of Artificial Intelligence, Beihang University（北京航空航天大学人工智能研究院）

AI总结提出WebRISE框架，通过交互契约图（ICG）将任务需求转化为可观察状态、用户意图转换和DOM/视觉断言，以评估MLLM生成的Web工件的功能正确性，实验表明ICG评分检测状态错误率是检查点评估的2-16倍。

详情

AI中文摘要

现有的MLLM生成Web工件基准通过局部证据评估交互，忽略了决定页面是否正常工作的需求诱导状态和转换。我们提出WebRISE，它将任务需求编译成交互契约图（ICG），包含可观察状态、用户意图转换以及DOM/视觉断言，以实现与实现无关的浏览器执行。WebRISE涵盖五种输入模态（文本、Markdown、草图、图像、视频）下的442个任务，包含5,495个转换和5,271个需求检查，将用户声明的功能与隐式的产品级约束分开。在14个MLLM中，即使最强的模型也仅达到65.6%的转换有效性和66.3%的需求覆盖率，且视觉质量不能代表行为（Qwen3.6-35B-A3B在Markdown上：V=80.8但T=15.5）。视频提供了最强的交互信号（隐式覆盖率比文本高10.6个百分点），而隐式约束仍然存在；缺陷注入表明，基于ICG的评分检测状态错误的速率是检查点评估的2-16倍。

英文摘要

Existing benchmarks for MLLM-generated web artifacts assess interaction through local evidence and miss the requirement-induced states and transitions that determine whether a page works. We introduce WebRISE, which compiles task requirements into Interaction Contract Graphs (ICGs) of observable states, user-intent transitions, and DOM/visual assertions for implementation-agnostic browser execution. WebRISE spans 442 tasks across five input modalities (Text, Markdown, Sketch, Image, Video), with 5,495 transitions and 5,271 requirement checks that separate user-stated functions from implicit product-level constraints. Across 14 MLLMs, even the strongest model reaches only 65.6% transition validity and 66.3% requirement coverage, and visual quality is no proxy for behavior (Qwen3.6-35B-A3B on Markdown: V=80.8 yet T=15.5). Video gives the strongest interaction signal (+10.6 pp implicit coverage over Text), while implicit constraints persist; defect injection shows ICG-based scoring detects state errors at 2-16x the rate of checkpoint-style evaluation.

URL PDF HTML ☆

赞 0 踩 0

2606.03219 2026-06-03 cs.CL cs.LG 版本更新

Sample-Size Scaling of the African Languages NLI Evaluation

非洲语言自然语言推理评估的样本量缩放

Anuj Tiwari, Oluwapelumi Ogunremu, Terry Oko-odion, Jesujuwon Egbewale, Hannah Nwokocha

发表机构 * Noida Institute of Engineering and Technology（奈德人工智能工程与技术学院）； ML Collective（机器学习集体）

AI总结本研究通过AfriXNLI基准对16种非洲语言进行系统样本量缩放实验，发现NLI性能随样本量增加并非单调提升，而是呈现语言敏感且非单调的缩放行为，表明数据量不足以保证稳定收益，需语言敏感的数据集和更强多语言建模策略。

Comments Accepted at the AfricaNLP Workshop, EACL 2026

详情

DOI: 10.18653/v1/2026.africanlp-main.22

AI中文摘要

非洲语言标注数据非常少，且增加标注数据量是否能可靠提升下游性能尚不明确。本研究基于AfriXNLI基准，对16种非洲语言进行了自然语言推理（NLI）的系统样本量缩放研究。在受控条件下，测试了两个约0.6B参数的多语言Transformer模型（在XNLI上微调的XLM-R Large和AfroXLM-R Large），样本量从50到500个标注示例不等，并在随机子采样运行中平均结果。与通常认为的随数据增加性能单调提升相反，我们发现了一种强烈语言敏感且通常非单调的缩放行为。一些语言在低资源场景下表现出早期饱和或性能下降，以及高方差。这些结果表明，数据量不足以保证非洲NLI的稳定收益，因此需要创建语言敏感的数据集和更强的多语言建模策略。

英文摘要

African languages have very little labelled data, and it is unclear if augmenting the quantity of annotation data reliably enhances downstream performance. The study is a systematic sample-size scaling study of natural language inference (NLI) on 16 African languages based on the AfriXNLI benchmark. Under controlled conditions, two multilingual transformer models with roughly 0.6B parameters XLM-R Large fine-tuned on XNLI and AfroXLM-R Large are tested on sample sizes of between 50 and 500 labeled examples and average their results across random subsampling runs. As opposed to the usual belief of monotonic increase with increased data, we find a strongly language sensitive and often non-monotonic scaling behavior. Some languages show early saturation or decrease in performance with sample size as well as high variance in low resource regimes. These results indicate that the volume of data is not enough to guarantee stable profits to African NLI, creating the necessity of language sensitive datasets creation and stronger multi-lingual modelling strategies.

URL PDF HTML ☆

赞 0 踩 0

2606.03198 2026-06-03 cs.CL cs.AI 版本更新

AI Rater Discrimination Depends on Scoring Protocol in Complex Clinical Decision-Making

AI评分歧视取决于复杂临床决策中的评分协议

Sangwon Baek, Kyu Yeon Hur, Kyunga Kim

发表机构 * Asclep Korea Inc.（Asclep韩国公司）； Center for Data Science, New York University（纽约大学数据科学中心）； Division of Endocrinology and Metabolism, Department of Medicine, Samsung Medical Center, Sungkyunkwan University School of Medicine（成均馆大学医学院内分泌与代谢科，三星医疗中心）； Biomedical Statistics Center, Samsung Medical Center（三星医疗中心生物医学统计中心）； Department of Digital Health, SAIHST（SAIHST数字健康科）； Department of Data Convergence & Future Medicine, Sungkyunkwan University（成均馆大学数据融合与未来医学科）

AI总结通过因子研究，发现基于评分标准的协议能放大AI评分者区分能力，而无评分标准协议则抑制这种区分，支持在临床AI评估中使用评分标准锚定。

Comments 11 pages, 4 main figures, 8 supplementary figures, 9 supplementary tables

详情

AI中文摘要

临床AI评估越来越多地委托给大型语言模型（LLMs）作为AI评分者进行评分，但其在不同评估条件下的评分行为尚未被定量表征。我们通过一项因子研究填补了这一空白，该研究关注成人2型糖尿病（T2D）药物治疗在12个月门诊随访中的AI评分者行为，这是一项涉及复杂决策的临床任务，通过七个评估问题操作化。四个开源LLMs同时作为临床决策支持系统（CDSS）模型和AI评分者。每个CDSS输出在两种评分协议下评分：基于评分标准的Gold Rubric（GR）协议（包含患者特定评分标准）和无评分标准的Non Gold Rubric（Non-GR）协议。线性混合效应模型将评分协议因子与五个设计因子（CDSS模型、CDSS提示配置（文档参考生成[DRG] vs. 基线）、评分者模型、提示字符和提示类型）交叉，并估计主效应及其协议交互。在所有问题中，AI评分者在Non-GR下始终给出非常窄范围内的更高分数（平均74-78分），而GR下的平均分数低7.69至49.64分，四分位距宽1.68至3.67倍。在每个问题内，GR将AI评分者对DRG和基线CDSS输出的区分能力放大了1.76至5.10倍，同时揭示了Non-GR抑制的评分者模型间的显著行为变异。这些发现支持评分标准锚定作为保留临床AI评估区分能力的评分协议；当问题需要患者特定或司法管辖区特定标准，而评分者模型无法仅从参数知识推断时，无评分标准评分无法替代。

英文摘要

Clinical AI evaluation increasingly delegates scoring to large language models (LLMs) acting as AI raters, yet their scoring behavior across evaluation conditions has not been quantitatively characterized. We address this gap through a factorial study of AI rater behavior in adult type 2 diabetes (T2D) pharmacotherapy at 12-month outpatient follow-up, a clinical task involving complex decision-making operationalized across seven evaluation questions. Four open-source LLMs served simultaneously as clinical decision support system (CDSS) models and AI raters. Each CDSS output was scored under two scoring protocols: a rubric-anchored Gold Rubric (GR) protocol incorporating a patient-specific rubric, and a rubric-free Non Gold Rubric (Non-GR) protocol. Linear mixed effects models crossed the scoring protocol factor with five design factors -- CDSS model, CDSS prompt configuration (document-referenced generation [DRG] vs.\ Baseline), rater model, prompt character, and prompt type -- and estimated main effects together with their protocol interactions. Across all questions, AI raters yielded consistently higher scores within a very narrow range (74--78 points on average) under Non-GR compared to those under GR (7.69 to 49.64 points lower mean scores; 1.68 to 3.67 times wider interquartile ranges). Within each question, GR amplified the AI rater's discrimination between DRG and Baseline CDSS outputs by factors of 1.76 to 5.10, while also revealing substantial behavioral variation across rater models that Non-GR suppressed. These findings support rubric anchoring as the scoring protocol that preserves discriminative power in clinical AI evaluation; rubric-free scoring cannot substitute when questions require patient-specific or jurisdiction-specific criteria that rater models cannot infer from parametric knowledge alone.

URL PDF HTML ☆

赞 0 踩 0

2606.03197 2026-06-03 cs.CL 版本更新

MemTrain: Self-Supervised Context Memory Training

MemTrain：自监督上下文记忆训练

Ziheng Li, Xingrun Xing, Haoqing Wang, Zhi-Hong Deng, Yehui Tang

发表机构 * State Key Laboratory of General Artificial Intelligence, School of Intelligence Science and Technology, Peking University（通用人工智能国家重点实验室，智能科学与技术学院，北京大学）； Samsung Research, Beijing, China（三星研究院，北京，中国）

AI总结提出自监督框架MemTrain，通过两个耦合代理任务（端到端掩码重建和中间记忆召回）利用无标签维基百科语料增强LLM代理的上下文记忆能力，显著提升下游长文本推理性能。

详情

AI中文摘要

记忆是长时程LLM代理不可或缺的能力，使其能够保留和利用跨扩展交互积累的信息。现有的记忆代理方法通常在下游任务上通过强化学习进行端到端训练。然而，为记忆密集型场景收集高质量标注问题成本高昂，且所得训练数据往往缺乏足够的多样性以覆盖通用记忆行为。在这项工作中，我们提出MemTrain，一种自监督训练框架，用于普遍增强LLM代理的上下文记忆能力，以实现更有效的下游后训练。MemTrain在无标签维基百科语料上引入两个耦合代理任务：（1）端到端掩码重建目标，要求模型在多轮记忆更新后恢复掩码实体，从而从最终结果角度鼓励记忆维持；（2）中间记忆召回目标，要求模型利用中间记忆状态重建掩码历史信息，鼓励整个交互过程中的忠实压缩和记忆完整性。两个目标通过GRPO联合优化。在长文本QA和基于搜索的QA基准上的大量实验表明，MemTrain在不同模型上持续改善下游记忆密集型推理性能，相比直接的任务特定后训练，提升高达17.67个点。

英文摘要

Memory is an indispensable capability for long-horizon LLM agents, enabling them to preserve and utilize information accumulated across extended interactions. Existing memory-agent approaches are typically trained end-to-end with reinforcement learning on downstream tasks. However, collecting high-quality annotated problems for memory-intensive scenarios is costly, and the resulting training data often lack sufficient diversity to cover general memory behaviors. In this work, we propose MemTrain, a self-supervised training framework for generally enhancing the context-memory capability of LLM agents for more effective downstream post-training. MemTrain introduces two coupled proxy tasks over unlabeled Wikipedia corpora: (1) an end-to-end masked reconstruction objective, which requires the model to recover masked entities after multiple rounds of memory updates, thereby encouraging memory maintenance from the final outcome perspective; and (2) an intermediate memory recall objective, which requires the model to reconstruct masked historical information using intermediate memory states, encouraging faithful compression and memory completeness throughout the interaction process. The two objectives are jointly optimized using GRPO. Extensive experiments on long-text QA and search-based QA benchmarks demonstrate that MemTrain consistently improves downstream memory-intensive reasoning performance across different models, achieving gains of up to 17.67 points over direct task-specific post-training.

URL PDF HTML ☆

赞 0 踩 0

2606.03180 2026-06-03 cs.CV cs.CL cs.LG 版本更新

GLINT: Sparsely Gated Vision-Language Alignment for Fine-Grained Radiology Representations

GLINT：面向细粒度放射学表征的稀疏门控视觉-语言对齐

Jonggwon Park, Seongeun Lee, Junhyun Park, Hannah Yun, Hyunwoong Kim, Sohyun Jeong, Hyewon Kang, Byungmu Yoon, Kyoyun Choi

AI总结针对放射学图像-报告全局对齐与局部病灶尺度不匹配的问题，提出GLINT框架，通过稀疏门控对齐和密集特征正则化实现零样本分类、定位和分割。

详情

AI中文摘要

放射学中的视觉-语言模型（VLM）通过利用临床工作流程中自然产生的图像-报告对，已成为一种可扩展的范式。然而，这种配对揭示了尺度上的不匹配：每个病灶仅占据图像的一小部分区域，但监督仅在全局图像-报告级别提供。这带来了一个核心挑战：先前的方法将权重密集地分布到所有补丁上，而不是集中在与给定查询相关的稀疏子集上。为了解决这个问题，我们提出了GLINT（门控语言-图像对齐）框架，该框架显式建模这种稀疏对应关系。在对齐方面，我们引入了稀疏门控对齐，这是一种新颖的架构，其中在单独的门控嵌入空间上的sigmoid门仅激活与每个文本查询相关的补丁，强制执行显式稀疏性。在表征方面，我们添加了密集特征正则化，将可训练编码器的中间特征锚定到冻结的自监督学习（SSL）教师模型上，从而保留门控所依赖的细粒度补丁特征。相同的方案适用于2D胸部X光片（CXR）和3D胸部计算机断层扫描（CT），分别基于DINOv3和V-JEPA 2.1构建。GLINT支持从自由文本查询进行零样本分类、定位和分割，据我们所知，这是首次在没有掩码监督的情况下在3D CT体积上展示零样本分割。值得注意的是，最显著的增益出现在零样本定位和分割上，这些任务需要稀疏的、特定于查询的定位，这与我们的设计意图一致。在下游评估中，GLINT在分类、报告生成和分割方面均优于SSL编码器和医学VLM。

英文摘要

Vision-language models (VLMs) for radiology have emerged as a scalable paradigm by leveraging image-report pairs naturally produced in clinical workflows. However, this pairing reveals a mismatch in scale: each finding occupies only a small region of the image, yet supervision is provided only at the global image-report level. This poses a central challenge: prior approaches spread weight densely across all patches rather than concentrating on the sparse subset relevant to a given query. To address this, we present GLINT (Gated Language-Image alignmeNT), a framework that explicitly models this sparse correspondence. On the alignment side, we introduce Sparsely Gated Alignment, a novel architecture in which a sigmoid gate over a separate gate embedding space activates only the patches relevant to each textual query, enforcing explicit sparsity. On the representation side, we add Dense Feature Regularization, which anchors the trainable encoder's intermediate features to a frozen self-supervised learning (SSL) teacher, preserving the fine-grained patch features that the gate relies on. The same recipe applies to both 2D chest X-ray (CXR) and 3D chest computed tomography (CT), built with DINOv3 and V-JEPA 2.1, respectively. GLINT enables zero-shot classification, grounding, and segmentation from free-text queries, and to our knowledge is the first to demonstrate zero-shot segmentation on 3D CT volumes without mask supervision. Notably, the most pronounced gains arise on zero-shot grounding and segmentation, where sparse, query-specific localization is required, consistent with our design intent. In downstream evaluation, GLINT outperforms both SSL encoders and medical VLMs on classification, report generation, and segmentation.

URL PDF HTML ☆

赞 0 踩 0

2606.03179 2026-06-03 cs.CL 版本更新

HyperPatch: Sequential Knowledge Editing Under n-ary Structural Drift

HyperPatch: n元结构漂移下的顺序知识编辑

Yu-Kai Chan, Wen-Sheng Lien, Dong-Ting Yao, Bo-Kai Ruan, Kwan-Yeung Lin, Hong-Han Shuai, Meng-Fen Chiang

发表机构 * National Yang Ming Chiao Tung University

AI总结针对非平稳环境中n元事件顺序更新引发的结构漂移问题，提出HyperPatch框架，通过超图流形上的稳定性建模，实现参数保留的知识编辑，在MQuAKE-CF和MQuAKE-T基准上分别取得96.24%和21.06%的跳步准确率相对提升。

Comments Accepted to Proceedings of the 32nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD 2026)

详情

AI中文摘要

大型语言模型（LLMs）依赖知识编辑（KE）来维持时间有效性，然而现实世界中的知识本质上是n元的。我们证明，在非平稳环境中，对复杂关系的顺序更新会引发n元结构漂移，这是一种将n元事件二元化为三元组会破坏关系原子性的现象。这导致结构条件知识迁移失败，即检索器系统性错误接地，常被误诊为参数幻觉。为解决此问题，我们提出HyperPatch，一个参数保留框架，将顺序KE重新表述为超图流形上的稳定性问题。HyperPatch通过三个阶段保持事件完整性：（i）结构先验初始化，通过在超图神经网络（HGNN）上进行对比学习建立拓扑感知的嵌入空间，以捕获高阶相关性；（ii）顺序拓扑编辑，利用双阶段机制，采用基于SimHash的拓扑对齐进行快速冲突解决，以及拓扑LoRA自适应来跟踪漂移而无需骨干重训练；（iii）结构条件推理，整合来自融合语言和结构流形的全局一致证据。在MQuAKE-CF和MQuAKE-T基准上，HyperPatch在跳步准确率（H-Acc）上分别比最强基线相对提升96.24%和21.06%。进一步的消融实验表明，在连续n元更新流下具有卓越的可靠性，而标准基于KG的变体因结构错位导致H-Acc下降高达88.3%。

英文摘要

Large Language Models (LLMs) rely on Knowledge Editing (KE) to maintain temporal validity, yet real-world knowledge is inherently n-ary. We demonstrate that in non-stationary environments, sequential updates to complex relations induce N-ary Structural Drift, a phenomenon where the binary reification of n-ary events into triples fractures relational atomicity. This precipitates Structure-Conditioned Knowledge Transfer Failure, a systematic mis-grounding of the retriever frequently misdiagnosed as parametric hallucination. To tackle this, we propose HyperPatch, a parameter-preserving framework that reformulates sequential KE as a stability problem over hypergraph manifolds. HyperPatch preserves event integrity through three phases: (i) Structural Prior Initialization, establishing a topology-aware embedding space via contrastive learning on a Hypergraph Neural Network (HGNN) to capture high-order correlations; (ii) Sequential Topology Editing, utilizing a dual-stage mechanism that employs SimHash-based Topological Alignment for rapid conflict resolution and Topological LoRA Adaptation to track drift without backbone retraining; and (iii) Structure-Conditioned Reasoning, which integrates globally consistent evidence from fused linguistic and structural manifolds. On the MQuAKE-CF and MQuAKE-T benchmarks, HyperPatch achieves relative gains in Hop-wise Accuracy (H-Acc) of 96.24% and 21.06% over the strongest baseline, respectively. Further ablations demonstrate superior reliability under continuous n-ary update streams, whereas the standard KG-based variant suffers H-Acc collapses of up to 88.3% due to structural misalignment.

URL PDF HTML ☆

赞 0 踩 0

2606.03165 2026-06-03 cs.CL cs.AI 版本更新

Fully Automated Identification of Lexical Alignment and Preference-Stage Shifts in Large Language Models

大型语言模型中词汇对齐和偏好阶段转变的完全自动识别

Thomas Stephan Juzek, Xiaoyang Ming, Jose A. Hernandez

发表机构 * University of Washington（华盛顿大学）

AI总结本文提出两种无需人工干预的评估指标——词汇对齐分数和三角化偏好转变，用于自动识别大型语言模型中的词汇过度使用及其与人类偏好学习的关联。

Comments 16 pages, 2 figures, 10 tables

详情

DOI: 10.63317/4ut7ammh7z3h
Journal ref: Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026), pages 6116-6131

AI中文摘要

数字聊天助手（如ChatGPT）使用的语言可能与人类预期存在偏差（不对齐）。主要针对科学英语的研究已经描述了出现的偏差以及在一定程度上解释了原因，将其与人类偏好学习的训练阶段联系起来。然而，现有方法依赖于人工筛选。本文引入了两种无需筛选、假设较少的评估指标：词汇对齐分数（识别词汇过度使用）和三角化偏好转变（量化此类转变中有多少可归因于人类偏好学习）。使用PubMed摘要，生成了续写，并通过六个模型系列（Falcon、Gemma、Llama、Mistral、OLMo、Yi）的滑动窗口文档频率进行测量。该过程无需人工干预即可识别过度使用的词汇，如'suggest'、'additionally'和'strategy'，并估计它们与偏好学习的关联。我们的发现重复了先前的工作，并且在参数设置、随机种子以及进一步数据的评估中保持稳定。该方法易于扩展，能够系统研究科学英语之外以及跨语言的词汇（不对齐），因此，这些指标有潜力为未来模型改进对齐并理解其起源做出贡献。

英文摘要

The language used by digital chat assistants such as ChatGPT can diverge from human expectations (misalignment). Research, mostly on Scientific English, has described both what divergences occur and, to some extent, why, linking them to the training stage of human preference learning. Yet, existing approaches rely on manual curation. This paper introduces two curation-free, assumption-light evaluation metrics: the Lexical Alignment Score, which identifies lexical overuse, and the Triangulated Preference Shift, which quantifies how much of such shifts can be attributed to human preference learning. Using PubMed abstracts, continuations were generated and measured using windowed document prevalence across six model families (Falcon, Gemma, Llama, Mistral, OLMo, Yi). The procedure identifies, without manual intervention, overused items such as 'suggest', 'additionally', and 'strategy', and estimates their link to preference learning. Our findings replicate prior work and remain stable across parameter settings, random seeds, and evaluation on further data. The approach scales readily and enables systematic study of lexical (mis)alignment beyond Scientific English and across languages, and as such, the metrics have the potential to contribute to improved alignment for future models and understanding of its origins.

URL PDF HTML ☆

赞 0 踩 0

2606.03156 2026-06-03 cs.CL 版本更新

A cross-domain tropical species dataset with Chinese vernacular names and CITES source links

一个包含中文俗名和CITES来源链接的跨领域热带物种数据集

Jeff Wang

发表机构 * NEXLY LLC, United States（NEXLY LLC美国）

AI总结构建了一个覆盖410,499个活跃热带物种的跨领域数据集，整合多源分类标识符，添加跨领域本体、中文俗名层（覆盖率99.50%）和CITES来源链接层，并报告了初步内部审查结果。

Comments 25 pages, 4 figures, 4 tables. Dataset descriptor for the Tropical Species Encyclopedia. Companion to the methodology paper arXiv:2606.00994. Dataset deposited at Zenodo (doi:10.5281/zenodo.20377811); canonical preprint-of-record at Zenodo (doi:10.5281/zenodo.20424981)

详情

AI中文摘要

我们描述了一个版本化的跨领域数据集，包含410,499个活跃热带物种（工作快照2026-04-20），涵盖三个应用子领域——热带植物、热带水生和热带宠物——这些领域共享商业和监管生命周期，但分布在按界组织的生物多样性基础设施中。该资源整合了来自GBIF、世界在线植物志、iNaturalist、NCBI分类学、生命目录和生命百科全书的分类标识符，并添加了三个原始层：一个跨领域本体，根据贸易和饲养背景重新划分分类群；一个中文俗名层，在排除未经验证的机器生成建议的类型学下，提供明确的每个名称来源；以及一个CITES来源链接层，将每个分类群连接到其Species+条目。中文俗名覆盖率——即带有与科学双名法不同的中文名称的分类群比例——达到99.50%（410,499个中的408,456个；全种群计数）。覆盖率表征完整性，而非名称翻译准确性；后者由四级来源类型学界定，并且是此处报告的初步内部审查的主题，其中盲法外部审计被确定为主要的未决事项。上游内容仅通过稳定标识符引用原始贡献层，支持CC-BY 4.0重用。该数据集存放在Zenodo上（https://doi.org/10.5281/zenodo.20377811）。本预印本是数据集当前状态的规范v1.0描述；未来的数据描述符提交是可预期的，但取决于“局限性”中列出的验证和发布工程事项。

英文摘要

We describe a versioned cross-domain dataset of 410,499 active tropical species (working snapshot 2026-04-20) spanning three applied subdomains -- tropical_plants, tropical_aquatic, and tropical_pets -- that share a commercial and regulatory life cycle but are distributed across kingdom-organised biodiversity infrastructures. The resource joins taxonomic identifiers from GBIF, Plants of the World Online, iNaturalist, NCBI Taxonomy, the Catalogue of Life and the Encyclopedia of Life, and adds three original layers: a cross-domain ontology that re-segments taxa along trade and husbandry contexts; a Chinese vernacular layer with explicit per-name provenance under a typology that excludes unverified machine-generated proposals; and a CITES source-linkage layer connecting each taxon to its Species+ entry. Chinese vernacular coverage -- the proportion of taxa carrying a CJK Chinese name distinct from the scientific binomial -- reaches 99.50 percent (408,456 of 410,499; full-population count). Coverage characterises completeness, not name-translation accuracy; the latter is bounded by the four-level provenance typology and is the subject of a preliminary internal review reported here, with a blind external audit identified as the principal open item. Upstream content is referenced by stable identifier only for the original-contribution layers, supporting CC-BY 4.0 reuse. The dataset is deposited on Zenodo (10.5281/zenodo.20377811). This preprint is the canonical v1.0 description of the dataset's current state; future Data Descriptor submission is anticipated but is contingent on the validation and release-engineering items listed in the Limitations.

URL PDF HTML ☆

赞 0 踩 0

2606.03143 2026-06-03 cs.LG cs.CL 版本更新

FederatedSkill: Federated Learning for Agentic Skill Evolution

FederatedSkill: 面向智能体技能演化的联邦学习

Jingbo Yang, Guanyu Yao, Yang Zhang, Ramana Rao Kompella, Gaowen Liu, Shiyu Chang

发表机构 * UC Santa Barbara（加州大学圣巴bara分校）； MIT-IBM Watson AI Lab（麻省理工-IBM Watson人工智能实验室）； Cisco Research（思科研究）

AI总结提出FederatedSkill框架，通过语义技能差异作为通信单元，在保护隐私的同时实现个性化技能演化，相比自演化基线成功率提升44.4%，计算成本降低37.5%。

详情

AI中文摘要

现代LLM智能体越来越依赖技能库来处理复杂任务，使得技能演化成为自我改进的主要驱动力。然而，孤立的单用户任务流缺乏构建全面技能所需的多样性。虽然跨用户协作可以克服这一数据瓶颈，但当前的轨迹共享方法会损害用户隐私，并强加一个统一的全局库，无法适应客户端的异质性。我们引入了FederatedSkill，一个用于协作智能体演化的隐私保护框架。FederatedSkill超越了原始轨迹共享，利用语义技能差异（即对本地库的结构化补丁）作为通信的基本单位。在服务器端，一个演化智能体聚合这些补丁，动态建模客户端特定的能力边界，促进严格个性化的技能演化，而不是次优的全局平均。在20个不同的智能体任务族上评估，FederatedSkill相比自演化基线表现出显著提升，成功率最高提高44.4%，计算成本降低37.5%。

英文摘要

Modern LLM agents increasingly rely on skill libraries to handle complex tasks, making skill evolution a primary driver of self-improvement. However, isolated single-user task streams lack the diversity required to build comprehensive skills. While cross-user collaboration can overcome this data bottleneck, current trajectory-sharing approaches compromise user privacy and impose a uniform global library that fails to accommodate client heterogeneity. We introduce FederatedSkill, a privacy-preserving framework for collaborative agent evolution. Moving beyond raw trajectory sharing, FederatedSkill utilizes semantic skill diffs, structured patches over local libraries, as the fundamental unit of communication. On the server side, an evolution agent aggregates these patches to dynamically model client-specific capability boundaries, facilitating strictly personalized skill evolution rather than a suboptimal global average. Evaluated across 20 distinct agent task families, FederatedSkill demonstrates substantial gains over self-evolving baselines, achieving up to a 44.4% increase in success rate and a 37.5% reduction in computational cost.

URL PDF HTML ☆

赞 0 踩 0

2606.03136 2026-06-03 cs.CR cs.CL 版本更新

PsychoPass: Geometric Profiling of Multi-Turn Adversarial LLM Conversations

PsychoPass: 多轮对抗性LLM对话的几何轮廓分析

Muberra Ozmen, Subhabrata Majumdar

发表机构 * Coveo Montreal, QC, Canada（加拿大蒙特利尔 Coveo）； Indian Institute of Management Bangalore（班加罗尔印度管理学院）

AI总结提出PsychoPass框架，通过提取对话轨迹在嵌入空间中的几何特征，在有害内容生成前预测多轮越狱攻击，并发现早期几何信号具有鲁棒性。

详情

AI中文摘要

对大型语言模型（LLM）的多轮越狱攻击揭示了当前防护措施的不匹配：它们作用于单个轮次，而攻击则作为跨对话的轨迹展开。我们提出从内容转向动态，将对话建模为表示空间中的路径，并询问对抗意图是否在其几何形状中早期编码。我们引入了PsychoPass，一个从嵌入空间中的对话轨迹提取几何特征以在有害内容生成前预测潜在攻击的框架。这些特征在朴素分类器中实现了近乎完美的性能，这很大程度上可以通过包含轮次数作为特征来解释。去除这一混淆因素后，仍保留了一个较小但一致的几何信号，其分类性能不显著依赖于编码器选择。关键的是，该信号在对话早期出现：仅从短前缀开始，攻击结果就高于随机水平，比基线防护更可靠。一项支持性理论分析通过长度与形状的分解、基于前缀长度的检测界限以及编码器不变性解释了这些发现。综合来看，这些结果表明对抗性对话留下了早期、表示鲁棒的几何指纹，适用于在线监控。

英文摘要

Multi-turn jailbreak attacks on large language models (LLMs) reveal a mismatch in current guardrails: they operate on individual turns, while attacks unfold as trajectories across conversations. We propose a shift from content to dynamics, modeling conversations as paths in representation space and asking whether adversarial intent is encoded early in their geometry. We introduce PsychoPass, a framework that extracts geometric features from conversation trajectories in embedding space to predict a potential attack before harmful content is produced. These features achieve near-perfect performance in naïve classifiers, which is largely explained by the inclusion of number of turns as a feature. After removing this confound, a smaller but consistent geometric signal remains, with classification performance that does not depend meaningfully on encoder choice. Crucially, this signal appears early in the conversation: attack outcomes remain above chance from short prefixes alone, more reliably than baseline guardrails. A supporting theoretical analysis explains these findings via a decomposition of length and shape, a detection bound based on prefix length, and encoder invariance. Together, these results show that adversarial conversations leave an early, representation-robust geometric fingerprint suitable for online monitoring.

URL PDF HTML ☆

赞 0 踩 0

2606.03132 2026-06-03 cs.CL 版本更新

DMT-CBT: Longitudinal Therapeutic State Modeling for CBT Counseling

DMT-CBT：面向CBT咨询的纵向治疗状态建模

Chang Liu, Shuyi Zhang, Changsheng Ma, Yongfeng Tao, Minqiang Yang, Bin Hu

发表机构 * School of Information Science and Engineering, Lanzhou University（信息科学与工程学院，兰州大学）

AI总结提出DMT-CBT框架，通过跨会话结构化治疗状态、多模态行为基础与工具增强干预，解决现有方法将CBT咨询简化为局部回复生成的问题，实现纵向治疗状态动态建模。

详情

AI中文摘要

大型语言模型（LLM）在认知行为疗法（CBT）咨询中展现出日益增长的潜力。然而，现有方法大多将咨询视为局部回复生成问题，专注于短文本、仅文本或单次会话交互中的共情回复。我们认为这种表述从根本上与真实心理治疗的本质不符。在临床CBT中，治疗是一个纵向过程，治疗师需要跨会话持续推断、更新和干预不断演变的治疗状态。真实的CBT还涉及多模态推理和延迟的跨会话干预效果，要求模型在部分可观测性下捕捉纵向治疗状态的演变。我们提出DMT-CBT，一个用于CBT咨询中治疗状态动态建模的框架。DMT-CBT跨会话维护结构化的治疗状态，同时整合多模态行为基础和工具增强的干预，以支持适应性治疗推理。基于该框架，我们构建了DMTCorpus，一个合成的多会话多模态CBT咨询数据集，具有演变的治疗状态、图像基础的患者行为以及跨会话干预的连续性。实验结果表明，与事后提取方法相比，DMT-CBT提高了咨询保真度和治疗联盟，产生了更有利的纵向情感轨迹，并更忠实地保留了治疗状态。

英文摘要

Large language models (LLMs) have shown growing potential for Cognitive Behavioral Therapy (CBT) counseling. However, most existing approaches still formulate counseling as a local response generation problem, focusing on empathetic replies within short, text-only, or single-session interactions. We argue that this formulation fundamentally mismatches the nature of real psychotherapy. In clinical CBT, therapy is a longitudinal process in which therapists continuously infer, update, and intervene on evolving therapeutic states across sessions. Realistic CBT further involves multimodal inference and delayed cross-session intervention effects, requiring models to capture longitudinal therapeutic state evolution under partial observability. We propose DMT-CBT, a framework for Dynamic Modeling of evolving Therapeutic states in CBT counseling. DMT-CBT maintains structured therapeutic states across sessions while incorporating multimodal behavioral grounding and tool-augmented intervention to support adaptive therapeutic reasoning. Based on this framework, we construct DMTCorpus, a synthetic multi-session multimodal CBT counseling dataset featuring evolving therapeutic states, image-grounded client behaviors, and cross-session intervention continuity. Experimental results show that DMT-CBT improves counseling fidelity and therapeutic alliance, produces more favorable longitudinal affective trajectories, and preserves therapeutic states more faithfully than post-hoc extraction approaches.

URL PDF HTML ☆

赞 0 踩 0

2606.03128 2026-06-03 cs.CR cs.AI cs.CL cs.LG 版本更新

Decoupled Smart Contract Audits: Lightweight LLM Framework via Distillation and Aggregation

解耦式智能合约审计：通过蒸馏与聚合的轻量级LLM框架

Bagus Rakadyanto Oktavianto Putra, Muhamad Risqi Utama Saputra, Widyawan, Guntur Dharma Putra

发表机构 * University of Indonesia（印度尼西亚大学）

AI总结提出一种基于轻量级开源LLM（0.6B-4B参数）的解耦式智能合约审计框架，通过rsLoRA、知识蒸馏和链式验证聚合策略，在漏洞检测中达到98.25%准确率，优于7B-34B参数模型。

Comments 12 pages, 4 figures, 5 tables. Accepted to IEEE ICWS 2026

详情

AI中文摘要

G^2C-MT: 面向文档级机器翻译的图引导上下文选择

Baijun Ji, Zixuan Zhou, Xiangyu Duan, Yu Liu, Longbo Sun, Rupu Wei, Bohong Zhao

发表机构 * School of Computer Science and Technology, Soochow University（苏州大学计算机科学与技术学院）； Trip.com Group（Trip.com集团）

AI总结提出G^2C-MT方法，通过构建轻量级篇章图并采用深度偏置随机游走采样上下文路径，以结构化方式为文档级机器翻译选择上下文，提升翻译质量与鲁棒性。

Comments 9 pages, 2 figures; IJCAI2026

详情

AI中文摘要

有效的文档级机器翻译（DocMT）需要捕捉长距离篇章依赖。近期工作探索了基于检索和篇章感知的上下文选择方法。然而，这些方法通常缺乏对文档中远距离段落间结构化篇章依赖关系的显式建模。本文提出G^2C-MT（图引导的机器翻译上下文），将DocMT上下文选择视为轻量级篇章图上的结构化路径发现问题，而非检索非结构化上下文集或依赖昂贵的基于LLM的篇章建模。具体地，我们将每个段落表示为节点，并基于语义相似性、邻接性和关键词重叠建模每对节点间的关系。进一步，我们提出在图上进行深度偏置随机游走，为每个目标段落采样一条后向上下文路径。该上下文路径将用于提示大语言模型（LLM）进行翻译。该框架自然支持多路径上下文采样，通过聚合多样化的翻译候选来提升对篇章歧义输入的鲁棒性。跨多个领域的实验表明，G^2C-MT在多个LLM（包括DeepSeek-V3、Gemini-2.5-Flash-lite和Qwen-2.5/3系列）上均优于强基线。

英文摘要

Effective document-level machine translation (DocMT) requires capturing long-range discourse dependencies. Recent work has explored retrieval-based and discourse-aware context selection. However, these approaches often lack an explicit mechanism for modeling structured discourse dependencies between distant paragraphs in a document. In this paper, we propose G^2C-MT (Graph-Guided Context for Machine Translation), which views DocMT context selection as a structured path discovery problem on a lightweight discourse graph, rather than retrieving unstructured context sets or relying on expensive LLM-based discourse modeling. In detail, we represent each paragraph as a node and model the relationship between each pair of nodes, considering their semantic similarity, adjacency, and keyword overlap. Furthermore, we propose a depth-biased random walk over the graph to sample a backward context path for each target paragraph. The context path will be used to prompt a large language model (LLM) for translation. This framework naturally supports multi-path context sampling, which can improve robustness by aggregating diverse translation candidates for discourse-ambiguous inputs. Experiments conducted across various domains show that G^2C-MT outperforms strong baselines on multiple LLMs, including DeepSeek-V3, Gemini-2.5-Flash-lite, and the Qwen-2.5/3 series.

URL PDF HTML ☆

赞 0 踩 0

2606.03063 2026-06-03 cs.LO cs.CL 版本更新

ZX-Calculus:Trace-Indexed Dependent Types and Epistemic Semantics

ZX-演算：迹索引依赖类型与认知语义

Peng Chen

发表机构 * School of Information Science, Beijing Language and Culture University（北京语言文化大学信息科学学院）

AI总结提出ZX-演算，通过迹索引类型、预层非单调语义和构造性AGM信念修正，保守扩展Martin-Löf依赖类型论，并给出Coq机械化证明。

详情

AI中文摘要

我们提出ZX-演算（知识演化演算），它是Martin-Löf依赖类型论（MLTT）的保守扩展，集成了迹索引类型、预层非单调语义和构造性AGM信念修正。本文附带Coq机械化证明（34个完整证明；两个核心结果零未完成）。（I）迹类型。FinTrace(s0,sn)是一个带类型的执行迹的归纳族。FinTrace和Star(Step)作为路径类型同构，但判断上不相等；TraceElim显式暴露事件标签e:Event，为事件驱动归纳提供了更符合人体工程学的接口。我们证明了迹可达性对应、确定性重放以及通过可归约候选（带传输引理，RC-elim推迟；所有其他核心结果经Coq验证）的规范性框架。（II）层语义。迹索引命题是自由迹偏序范畴Tf上的逆变层。分离定理（显式反模型）区分了证明论单调性和语义非单调性。项模型是初始CwF（句法泛性质，非经典完备性）。（III）AGM信念修正。我们给出了一个显式的构造性部分交收缩算法，经(C1)-(C4)验证。所有八条AGM公设(R1)-(R8)都是定理。R7和R8的证明使用了析取加固引理，并给出了自包含的构造性推导。（IV）集成。B^AGM在顺序修正中不满足层复合律BP-comp（显式反模型，Coq验证）。我们引入单步修正系统（SSRS），证明B^AGM是有效的SSRS（Coq验证），并表明这足以处理迹态射、收缩刻画和修正见证。BP-comp失败揭示了路径依赖信念修正与函子一致性之间的基本张力，此前未被识别。

英文摘要

We propose ZX-Calculus (Knowledge Evolution Calculus), a conservative extension of Martin-Lof Dependent Type Theory (MLTT) integrating trace-indexed types, presheaf non-monotone semantics, and constructive AGM belief revision. A Coq mechanisation accompanies the paper (34 complete proofs; zero admits for the two central results). (I) Trace types. FinTrace(s0,sn) is an inductive family of typed execution traces. FinTrace and Star(Step) are isomorphic as path types but not judgementally equal; TraceElim exposes the event label e:Event explicitly, giving a more ergonomic interface for event-driven induction. We prove the Trace-Reachability Correspondence, Deterministic Replay, and a canonicity framework via reducibility candidates with a Transport Lemma (RC-elim deferred; all other Core results are Coq-verified). (II) Sheaf semantics. Trace-indexed propositions are contravariant sheaves over the free trace partial-order category Tf. A Separation Theorem (explicit countermodel) distinguishes proof-theoretic monotonicity from semantic non-monotonicity. The term model is an initial CwF (syntactic universal property, not classical completeness). (III) AGM belief revision. We give an explicit constructive partial meet contraction algorithm verified against (C1)-(C4). All eight AGM postulates (R1)-(R8) are theorems. Proofs of R7 and R8 use the Disjunctive Entrenchment Lemma, given a self-contained constructive derivation. (IV) Integration. B^AGM fails the sheaf composition law BP-comp for sequential revision (explicit countermodel, Coq-verified). We introduce Single-Step Revision Systems (SSRS), prove B^AGM is a valid SSRS (Coq-verified), and show this suffices for trace morphisms, retraction characterisation, and revision witnesses. The BP-comp failure reveals a fundamental tension between path-dependent belief revision and functor consistency, not previously identified.

URL PDF HTML ☆

赞 0 踩 0

2606.03043 2026-06-03 cs.CL 版本更新

SEA-Embedding：面向东南亚的开放且可复现的文本嵌入

Peerat Limkonchotiwat, Raymond Ng, Sarana Nutanong, Jian Gang Ngui

发表机构 * AI Singapore（AI新加坡）； School of Information Science and Technology, VISTEC（信息科学与技术学院，VISTEC）

AI总结提出SEA-Embedding，一个完全开放且可复现的东南亚语言文本嵌入管道，通过研究数据组成、训练目标和基础编码器初始化三个核心因素，在SEA-BED上取得最优结果。

2606.03022 2026-06-03 cs.CL cs.AI 版本更新

Hallucinations as Orthogonal Noise: Inference-Time Manifold Alignment via Dynamic Contextual Orthogonalization

幻觉作为正交噪声：通过动态上下文正交化实现推理时流形对齐

Mingkuan Zhao, Wentao Hu, Tianchen Huang, Yuheng Min, Suquan Chen, Yide Gao, Yanbo Zhai, Shuangyong Song, Xuelong Li

发表机构 * Xi’an Jiaotong University（西安交通大学）； Xingchen AGI Lab（星辰AGI实验室）； China Telecom AI Technology (Beijing) Co., Ltd.（中国电信人工智能技术（北京）有限公司）； Institute of Artificial Intelligence, China Telecom（中国电信人工智能研究院）； University of Science and Technology of China（中国科学技术大学）； Tsinghua University（清华大学）

AI总结提出一种基于线性表示假设的几何框架，将大语言模型幻觉解释为残差流语义流形的正交噪声，并引入推理时干预方法动态上下文正交化（DCO），通过层间Z分数抑制机制选择性地衰减异常正交分量，在保持知识记忆的同时提升上下文忠实度。

详情

AI中文摘要

大语言模型（LLMs）中的幻觉——即生成与上下文事实或逻辑约束不一致的内容——仍然是可靠部署面临的持续挑战。在这项工作中，我们通过基于线性表示假设的几何框架来解决这个问题。我们提出，幻觉表现为相对于残差流语义流形的正交噪声。具体来说，我们假设虽然注意力头理想地传播与上下文子空间一致的信息，但当特定头引入与该子空间正交的分量时，就会产生幻觉，破坏潜在表示的一致性。基于这一表述，我们引入了动态上下文正交化（DCO），一种推理时干预方法。DCO利用输入残差流作为动态上下文锚点，对注意力头输出进行正交分解。为了区分上下文对齐的语义更新和发散噪声，DCO采用层间Z分数抑制机制，根据统计分布选择性地衰减异常正交分量。在XSum、NQ-Swap和IFEval等基准上对Llama-3-8B和70B的评估表明，与最先进的干预基线相比，DCO实现了更优的上下文忠实度。此外，DCO在TriviaQA和TruthfulQA等知识密集型任务上保持高性能，有效缓解了现有方法中常见的幻觉抑制与参数知识保留之间的权衡。我们的发现验证了幻觉的几何解释，并将DCO确立为一种计算高效的流形对齐方法。代码可在https://this https URL获取。

英文摘要

Hallucination in Large Language Models (LLMs), characterized by the generation of content inconsistent with contextual facts or logical constraints -- remains a persistent challenge for reliable deployment. In this work, we address this issue through a geometric framework rooted in the linear representation hypothesis. We propose that hallucinations manifest as orthogonal noise relative to the semantic manifold of the residual stream. Specifically, we hypothesize that while attention heads ideally propagate information congruent with the context subspace, hallucinations arise when specific heads introduce components orthogonal to this subspace, disrupting the coherence of the latent representation. Based on this formulation, we introduce Dynamic Contextual Orthogonalization (DCO), an inference-time intervention method. DCO utilizes the input residual stream as a dynamic context anchor to perform orthogonal decomposition on attention head outputs. To distinguish between context-aligned semantic updates and divergent noise, DCO employs a layer-wise Z-score suppression mechanism that selectively attenuates outlier orthogonal components based on statistical distributions. Evaluations on Llama-3-8B and 70B across benchmarks such as XSum, NQ-Swap, and IFEval demonstrate that DCO achieves superior contextual faithfulness compared to state-of-the-art intervention baselines. Furthermore, DCO maintains high performance on knowledge-intensive tasks like TriviaQA and TruthfulQA, effectively mitigating the trade-off between hallucination suppression and parametric knowledge retention often observed in existing methods. Our findings validate the geometric interpretation of hallucinations and establish DCO as a computationally efficient approach for enforcing manifold alignment.Our code is available at https://github.com/Harry-Miral/DCO

URL PDF HTML ☆

赞 0 踩 0

2606.03021 2026-06-03 cs.CL 版本更新

Hint-Guided Diversified Policy Optimization for LLM Reasoning

提示引导的多样化策略优化用于大语言模型推理

Zhiyu Cao, Kaixin Wu, Mingjie Zhong, Peifeng Li, Xiaobo Li, Can Ye, Qiaoming Zhu

发表机构 * School of Computer Science and Technology, Soochow University（苏州大学计算机科学与技术学院）； Ant Group（蚂蚁集团）

AI总结提出提示引导的多样化策略优化（HDPO），通过“提出-选择-思考”轨迹激励模型生成多样且可靠的解决方案，提升推理能力、候选方案多样性和可靠方案识别能力。

详情

AI中文摘要

大语言模型（LLMs）的最新发展展示了令人印象深刻的推理能力，其中可验证奖励的强化学习（RLVR）是一种有前景的增强策略。然而，现有的奖励机制局限于结果层面的正确性，缺乏明确的信号来引导模型考虑多样化的解决方案。相比之下，人类问题解决通常涉及评估多种潜在方法并选择最可靠的解决方案，而当前的RLVR框架并未明确激励这种认知过程。受此启发，我们提出了提示引导的多样化策略优化（HDPO），允许模型首先列出所有潜在的候选解决方案大纲作为提示，然后选择最可靠的一个进行进一步推理。HDPO包括两个阶段：结构化推理的冷启动和提示引导的多样化强化学习，以激励模型遵循“提出-选择-思考”轨迹生成多样且可靠的解决方案。实验结果表明，HDPO有效提升了LLM的推理能力，增强了候选解决方案的多样性以及LLM识别可靠解决方案的能力。

英文摘要

Recent developments in Large Language Models (LLMs) have showcased impressive reasoning capabilities, with Reinforcement Learning with Verifiable Rewards (RLVR) being a promising enhancement strategy. However, existing reward mechanisms are constrained to the outcome-level correctness and lack explicit signals to guide the model to consider diverse solutions. In contrast, human problem solving typically involves evaluating multiple potential approaches and selecting the most reliable solution, a cognitive process that current RLVR frameworks do not explicitly incentivize. Inspired by this, we propose Hint-Guided Diversified Policy Optimization (HDPO), allowing the model to first list all potential candidate solution outlines as hints and then select the most reliable one for further reasoning. HDPO comprises two stages of Cold Start for Structured Reasoning and Hint-Guided Diversified Reinforcement Learning to incentivize the model to generate diverse and reliable solutions following the ``propose-select-think'' trajectory. Experimental results show that HDPO effectively boosts LLM reasoning and enhances the diversity of candidate solutions as well as the LLM's ability to identify reliable solutions.

URL PDF HTML ☆

赞 0 踩 0

2606.02994 2026-06-03 cs.AI cs.CL 版本更新

Inducing Reasoning Primitives from Agent Traces

从智能体轨迹中归纳推理原语

Zhihan Lei, Jiarui Yan, Joshua Momo, William W. Cohen

发表机构 * Carnegie Mellon University（卡内基梅隆大学）

AI总结提出推理原语归纳方法，从ReAct智能体轨迹中挖掘并聚类常见推理步骤，构建伪工具库，在多个推理任务上显著提升性能。

Comments 22 pages including appendices

详情

AI中文摘要

ReAct风格的LLM智能体经常跨问题重新发现相同的推理例程，但这些例程被困在瞬时的草稿板中。我们引入了推理原语归纳，一种单次通过的方法，挖掘成功的ReAct轨迹，聚类循环出现的推理动作，并将最频繁的动作转换为一个紧凑的类型化伪工具库。每个伪工具由一个自然语言文档字符串指定，在调用时由LLM解释，标准的ReAct循环在测试时组合这些原语。核心结果是，归纳出的库优于生成其轨迹的原始智能体：在RuleArena NBA上提高44个百分点（30 -> 74），在MuSR团队分配上提高30个百分点（38 -> 68），在NatPlan会议规划上提高22个百分点（7 -> 29）。在涵盖叙事推理、规则应用和约束满足规划的五个可比较子任务中，单个固定配置在每个子任务上优于零样本思维链，匹配或超过专家编写的分解，并以更低的平均推理成本优于AWM。

英文摘要

ReAct-style LLM agents often rediscover the same reasoning routines across problems, yet leave those routines trapped in transient scratchpads. We introduce Reasoning Primitive Induction, a single-pass method that mines successful ReAct traces, clusters recurrent reasoning moves, and converts the most frequent moves into a compact library of typed pseudo-tools. Each pseudo-tool is specified by a natural-language docstring interpreted by an LLM at invocation time, and a standard ReAct loop composes these primitives at test time. The central result is that induced libraries outperform the very agent that generated their traces: by +44pp on RuleArena NBA (30 -> 74), +30pp on MuSR team allocation (38 -> 68), and +22pp on NatPlan meeting planning (7 -> 29). Across five comparable subtasks spanning narrative deduction, rule application, and constraint-satisfaction planning, a single fixed configuration improves over zero-shot Chain-of-Thought on every subtask, matches or surpasses expert-authored decompositions, and outperforms AWM at lower average inference cost.

URL PDF HTML ☆

赞 0 踩 0

2606.02991 2026-06-03 cs.CL cs.AI 版本更新

Pretraining Language Models on Historical Text

在历史文本上预训练语言模型

Xiaoxi Luo, Zachary Shinnick, Niclas Griesshaber, Yixuan Wang, Junchi Yu, Freda Shi, Philip Torr, Yao Lu

发表机构 * University of Waterloo（多伦多大学）； Vector Institute（向量研究所）； AIML, Adelaide University（AIML，阿德莱德大学）； Department of Engineering Science, University of Oxford（牛津大学工程科学系）； Oxford Centre for Economic and Social History, University of Oxford（牛津大学经济与社会史中心）； Department of Computer Science, University College London（伦敦大学学院计算机科学系）

AI总结提出TypewriterLM，一个仅在1913年前英文文本上训练的7.24B历史语言模型，通过构建TypewriterCorpus语料库、引入词汇基础指令微调框架和History-Event基准套件，解决数据质量、时间泄漏、训练和评估等挑战。

详情

AI中文摘要

聊天机器人输出有意义（但有问题）的语言

Matthew Stone, Una Stojnić

发表机构 * University of Parma（帕尔马大学）； University of Cambridge（剑桥大学）； University of Pittsburgh（匹兹堡大学）； Franklin and Marshall College（弗兰克林与马歇尔学院）

AI总结本文论证大型语言模型（LLM）的输出是有意义的，但无需假设其具有心理状态或意图，并探讨了这一观点对语言理论和AI伦理的影响。

Comments 49 pages

详情

AI中文摘要

AI聊天机器人的话语有意义吗？具体来说，如果用户问Anthropic的智能体Claude：“西班牙的首都是什么？”Claude回答：“马德里是西班牙的首都。”这句话是否具有其通常的意义——并且表达了一个真实的命题？大多数普通用户以及AI工程师认为答案显然是“是”。然而，许多认知科学家、语言学家和语言哲学家认为，关于语言和意义的主流意向主义理论得出了相反的结论。因此，更同情普通用户直觉的理论家主张对语言进行激进的“去拟人化”，修正我们对心理状态、意图和语义内容的理解，以捕捉LLM输出有意义的直觉。我们采取不同的方法。虽然我们也认为LLM的输出是有意义的，但我们认为，适当的人类语言理论已经适用于当前的聊天机器人。意义是一个低门槛：声称LLM输出有意义并不需要假设心理状态、意图、理性或LLM中交流所需的认知能力——实际上，也不需要任何其他拟人化假设。人们确实有交流意图（通常是成功的），但即便如此，在人类中，语言产出也可能偏离说话者的想法。我们的观点对于我们应该如何理论化——并批判性地参与——人类语言输出和合成生成的文本具有重要影响。特别是，说聊天机器人产生有意义的文本绝不意味着认可它们的输出，或假设该技术是（或不是）好的、强大的、合适的或有用的。

英文摘要

Are utterances by AI chatbots meaningful? Concretely, if a user asks, say, Anthropic's agent Claude, "What is the capital of Spain?" and Claude answers, "Madrid is the capital of Spain," does that sentence have its ordinary meaning -- and does it express a true proposition? Most ordinary users, as well as AI engineers, take the answer to be trivially "yes." However, many cognitive scientists, linguists, and philosophers of language argue that dominant intentionalist accounts of language and meaning deliver the opposite conclusion. Theorists more sympathetic to ordinary users' intuitions have therefore advocated a radical "de-anthropomorphization" of language, revising our understanding of mental states, intentions, and semantic content to capture the intuition that the outputs of LLMs are meaningful. We take a different approach. While we, too, argue that LLM outputs are meaningful, we contend that a proper theory of human language already applies, as is, to current chatbots. Meaning is a low bar: claiming that LLM outputs are meaningful does not require positing mental states, intentions, rationality, or the cognitive capacities requisite for communication in LLMs -- or, indeed, making any other anthropomorphic assumptions. People do have communicative intentions (typically successful ones), but nevertheless, even in humans, language production can depart from what the speaker has in mind. Our view has important consequences for how we should theorize about -- and critically engage with -- both human linguistic output and synthetically generated text. In particular, to say that chatbots produce meaningful text is not by any means to endorse what they output, or to assume that the technology is (or is not) good, powerful, appropriate, or useful.

URL PDF HTML ☆

赞 0 踩 0

2606.02971 2026-06-03 cs.CL 版本更新

EURO-5K: When Does Domain Pretraining Matter? Benchmarking Transformers for EU Reporting Obligation Extraction

EURO-5K：领域预训练何时重要？用于欧盟报告义务提取的Transformer基准测试

Marios Koniaris, Vasileios Kotronis, Eugenia Giannini, Panayiotis Tsanakas

发表机构 * Division of Computer Science, School of Electrical and Computer Engineering（计算机科学系，电气与计算机工程学院）； National Technical University of Athens（雅典技术大学）； Department of Humanities Social Sciences and Law, School of Applied Mathematical and Physical Sciences（人文社会科学与法律系，应用数学与物理科学学院）

AI总结本文构建了EURO-5K数据集，通过对比判别式与生成式模型在欧盟报告义务提取上的表现，发现领域预训练在参数高效微调时收益显著，且模型可作为专用提取器。

详情

AI中文摘要

从欧盟立法中提取报告义务对于评估和减少监管报告负担至关重要。然而，区分报告要求与结构相似的条款需要专门的法律理解。当前的法律NLP方法缺乏具有明确指南和提取范式及领域适应策略比较评估的专门数据集。我们整理了EURO-5K，一个包含来自136项欧盟立法法案的句子级报告义务和具有挑战性的负例的语料库。在该数据集上，我们训练并比较了判别式标记分类模型（BERT风格）和生成式跨度提取模型（LLM），针对基线（基于模式和依赖关系的提取、少样本提示）评估了全微调和参数高效的QLoRA。结果表明，全微调的通用和法律BERT模型实现了相似的性能（0.89 F1），而微调的LLM在句子级提取上达到了编码器的准确度。法律预训练对生成式模型仅带来微小提升。相反，当适应能力受限时，法律预训练明显有益，因为参数高效微调的法律BERT优于其通用对应版本。学习曲线分析表明，法律预训练在数据极少时加速了早期学习。所有方法在大约3000个样本时收敛，之后收益递减，验证了数据集的充分性。在两个外部监管语料库上的跨数据集评估表明，我们的模型表现为专门的报告义务提取器，而非通用监管分类器。我们发布了EURO-5K、训练好的模型以及一个带有可解释性可视化和结构化RDF导出的交互式演示。这些表明，两种范式和参数高效训练为监管合规自动化提供了实用工具。

英文摘要

Extracting reporting obligations from EU legislation is critical for assessing and reducing regulatory reporting burden. However, distinguishing reporting requirements from structurally similar provisions requires specialised legal understanding. Current legal NLP methods lack specialised datasets with clear guidelines and comparative evaluation of extraction paradigms and domain adaptation strategies. We curate EURO-5K, a corpus of sentence-level reporting obligations and challenging negative examples from 136 EU legislative acts. On this dataset, we train and compare discriminative token-classification models (BERT-style) and generative span-extraction models (LLMs), evaluating both full fine-tuning and parameter-efficient QLoRA against baselines (pattern and dependency-based extraction, few-shot prompting). Results show that fully fine-tuned generic and legal BERT models achieve similar performance (0.89 F1), while fine-tuned LLMs match encoder accuracy for sentence-level extraction. Legal pretraining offers only small gains for generative models. In contrast, it is clearly beneficial when adaptation capacity is constrained, as parameter-efficient tuning of Legal-BERT outperforms its generic counterpart. Learning curve analysis demonstrates that legal pretraining accelerates early learning with minimal data. All approaches converge around 3K samples with diminishing returns thereafter, validating dataset sufficiency. Cross-dataset evaluation on two external regulatory corpora shows that our models behave as specialised reporting obligation extractors rather than generic regulatory classifiers. We release EURO-5K, trained models, and an interactive demo with explainability visualizations and structured RDF export. These demonstrate that both paradigms and parameter-efficient training provide practical tools for regulatory compliance automation.

URL PDF HTML ☆

赞 0 踩 0

2606.02964 2026-06-03 cs.AR cs.CL cs.LG 版本更新

Multi-Segment Attention: Enabling Efficient KV-Cache Management for Faster Large Language Model Serving

多段注意力：实现高效KV缓存管理以加速大型语言模型服务

Chunan Shi, Yilei Chen, Yilin Chen, Xupeng Miao, Bin Cui

发表机构 * Peking University（北京大学）

AI总结提出AsymCache，一种计算延迟感知的KV缓存管理系统，通过多段注意力、缓存驱逐策略和自适应分块调度器，在保持无损精度的同时显著降低TTFT和TPOT。

详情

AI中文摘要

大型语言模型（LLM）推理依赖键值（KV）缓存以避免冗余的注意力计算。虽然近似KV缓存保留技术通过牺牲模型精度来减少内存使用，但无损方法则从GPU内存中驱逐KV缓存块并按需重建以保留精确输出。现有的无损KV缓存管理系统主要基于访问频率或位置启发式做出驱逐决策，而不考虑不同KV缓存块如何影响GPU注意力内核的执行效率。在本文中，我们提出了AsymCache，一种用于LLM推理的计算延迟感知KV缓存管理系统，它明确地将缓存驻留决策与GPU注意力内核性能对齐，包括三个关键组件：用于高效非连续KV上下文处理的多段注意力（MSA）、联合优化命中率和位置感知重计算成本的缓存驱逐策略，以及用于高硬件利用率的自适应分块调度器。实验表明，与最新基线相比，AsymCache将TTFT降低了高达1.90-2.03倍，每输出令牌时间（TPOT）降低了1.62-1.71倍，证实了该方法在常见工作负载中的有效性，并验证了其平衡计算效率与缓存命中率的设计目标。此外，AsymCache的低级设计允许无缝集成到诸如Continuum的代理服务系统中，进一步将平均作业延迟降低高达18.1%。

英文摘要

Large Language Model (LLM) inference relies on key-value (KV) caches to avoid redundant attention computation. While approximate KV cache retention techniques reduce memory usage by sacrificing model accuracy, lossless approaches instead evict KV cache blocks from GPU memory and reconstruct them on demand to preserve exact outputs. Existing lossless KV cache management systems primarily base eviction decisions on access frequency or positional heuristics, without considering how different KV cache blocks affect the execution efficiency of GPU attention kernels. In this paper, we propose AsymCache, a computation-latency-aware KV cache management system for LLM inference that explicitly aligns cache residency decisions with GPU attention kernel performance, including three key components: Multi-Segment Attention (MSA) for efficient non-contiguous KV context processing, a cache eviction policy that jointly optimizes hit rate and position-aware recomputation cost, and an adaptive chunking scheduler for high hardware utilization. Experiments show that AsymCache reduces TTFT by up to 1.90-2.03x and time-per-output-token (TPOT) by 1.62-1.71x over latest baselines, confirming the effectiveness of the method in common workloads and validating its design goal of balancing computational efficiency with cache hit rate. Moreover, the low-level design of AsymCache allows seamless integration into agent serving systems such as Continuum, where it further reduces average job latency by up to 18.1%.

URL PDF HTML ☆

赞 0 踩 0

2606.02953 2026-06-03 cs.CL 版本更新

Linguistic Productivity in Large Language Models: Models Coerce, but do not Preempt

大型语言模型中的语言生产力：模型强制但不抢占

Claire Bonial, Claire Benet Post, Laura Michaelis, Harish Tayyar Madabushi

发表机构 * Georgetown University（乔治城大学）； University of Colorado Boulder（科罗拉多大学丹佛分校）； University of Bath（巴斯大学）

AI总结通过测试大型语言模型是否受固化（高频使用）和抢占（未观察到结构）两种统计信号影响，发现模型能识别强制情况下的构式生产力，但无法利用负面证据避免过度泛化。

详情

AI中文摘要

基于使用的语法理论认为，语言的创造性生产力受到两种不同频率信号的增强和约束：固化（源于高频使用）和抢占（源于在期望出现特定语言结构的语境中从未观察到该结构）。大型语言模型也是基于使用的，因为语言结构是通过接触大量文本而习得的。在这里，我们测试固化和抢占这两种对立的统计力量是否也鼓励和约束了LLM中的语言生产力。我们跨模型架构证明，较大的模型在强制情况下能够识别并用非词再现构式生产力（固化），其中更广泛的构式语境强制了对词汇项的非典型解释。然而，我们也表明，即使最大的模型也不会将负面证据扩展到新语言，并且统计抢占不能使模型避免对语义上合适但从未在数据中观察到的模式进行过度泛化。

英文摘要

Usage-based theories of grammars posit that creative productivity of the structures of language is both bolstered and constrained by two distinct frequency signals: entrenchment, stemming from high frequency usage, and preemption, stemming from having never observed a particular linguistic structure in a context where one might expect that structure to appear. Large Language Models are also usage-based, in the sense that the structures of language are learned through exposure to vast amounts of text. Here, we test whether or not the opposing statistical forces of entrenchment and preemption also encourage and constrain linguistic productivity in LLMs. We demonstrate across model architectures that larger models recognize and can reproduce with nonce words constructional productivity (entrenchment) in cases of coercion, wherein the broader constructional context coerces an atypical interpretation of a lexical item. However, we also show that even the largest models do not extend negative evidence to novel language, and statistical preemption does not enable models to avoid overgeneralization of patterns that are semantically felicitous, but never observed in data.

URL PDF HTML ☆

赞 0 踩 0

2606.02951 2026-06-03 cs.RO cs.AI cs.CL cs.CV cs.HC 版本更新

SCOPE: Real-Time Natural Language Camera Agent at the Edge

SCOPE：边缘实时自然语言相机代理

Nikolaj Hindsbo, Sina Ehsani, Pragyana Mishra

发表机构 * Armada AI

AI总结提出SCOPE模块化代理，用于自然语言控制的PTZ相机，在边缘部署实现实时感知、规划与控制，并通过仿真和物理实验评估延迟、准确性和错误模式。

Comments 9 pages, 4 figures, 6 tables. Accepted at HRI '26 (21st ACM/IEEE International Conference on Human-Robot Interaction), Edinburgh, Scotland, March 16--19, 2026. Code: https://github.com/HindsboNikolaj/SCOPE

详情

DOI: 10.1145/3757279.3785641
Journal ref: Proceedings of the 21st ACM/IEEE International Conference on Human-Robot Interaction (HRI '26), ACM, 2026

AI中文摘要

在机器人领域部署语言驱动的代理需要能够反映现实任务需求的评估：自然语言指令与可重复的结果。此类代理必须将语言模型连接到可调用的感知和控制工具，并使用部署关键指标（包括延迟、准确性和错误模式）进行评估。我们提出了SCOPE（用于感知和评估的仿真与相机操作），这是一个模块化代理，用于自然语言、开放词汇的云台变焦（PTZ）相机控制和视觉场景理解，专门为边缘部署设计。SCOPE既可在基于Blender的仿真环境中运行，也可在物理PTZ相机上运行，所有感知、规划和控制均在部署现场使用边缘可访问的计算资源本地执行。我们发布了一个包含536个任务的基准测试，涵盖问答、单步和多步命令、计数、空间推理、描述以及光学字符识别，在基于Blender的仿真环境中提供逼真的PTZ控制功能。执行轨迹与LM作为评判器结合，以评估延迟、准确性和错误模式。我们评估了19种规划器-感知模型组合，将Qwen3小语言模型（SLM）与Moondream和Qwen视觉语言模型（VLM）配对。更强的SLM显著减少了幻觉并改善了工具路由，从而实现了更可靠的闭环行为。一旦使用了足够强大的SLM，感知就成为主要的性能瓶颈。在规划和感知方面，混合专家模型在延迟和内存占用与更小网络相当的情况下，始终匹配或超过密集替代方案。量化在精度损失最小的情况下提供了额外的效率提升，为实时、边缘可行的语言驱动PTZ控制确定了一个实用的、从仿真到现实验证的设计点。

英文摘要

Deploying language-driven agents in robotics requires evaluations that reflect real-world task demands: natural-language instructions with reproducible outcomes. Such agents must connect language models to callable perception and control tools, and be assessed using deployment-critical metrics including latency, accuracy, and error modes. We present SCOPE (Simulation and Camera Operations for Perception and Evaluation), a modular agent for natural-language, open-vocabulary pan-tilt-zoom (PTZ) camera control and visual scene understanding, designed explicitly for edge deployment. SCOPE operates both in a Blender-based simulation environment and on a physical PTZ camera, executing all perception, planning, and control locally at the deployment site using edge-accessible compute. We release a 536-task benchmark spanning QA, single- and multi-step commands, counting, spatial reasoning, descriptions, and optical character recognition in a Blender-based simulation environment that exposes realistic PTZ control affordances. Execution traces are combined with an LM-as-Judge to evaluate latency, accuracy, and error modes. We evaluate 19 planner-perception model combinations pairing Qwen3 small language models (SLMs) with Moondream and Qwen vision-language models (VLMs). Stronger SLMs substantially reduce hallucinations and improve tool routing, leading to more reliable closed-loop behavior. Once a sufficiently capable SLM is used, perception becomes the dominant performance bottleneck. Mixture-of-Experts models on both the planning and perception side consistently match or exceed dense alternatives at latencies and memory footprints comparable to much smaller networks. Quantization provides additional efficiency gains with minimal accuracy degradation, identifying a practical, sim-to-real validated design point for real-time, edge-feasible language-driven PTZ control.

URL PDF HTML ☆

赞 0 踩 0

2606.02911 2026-06-03 cs.CL 版本更新

The Ghost Annotator: a Framework to Explore Human Label Variation in Content Moderation through Conformal Prediction

幽灵标注者：通过共形预测探索内容审核中人类标签变异的框架

Mirko Lai, Alessandra Urbinati, Simona Frenda, Fabiana Vernero, Marco Antonio Stranisci

发表机构 * Laboratory for the Modeling of Biological and Socio-technical Systems, Northeastern University（生物与社会技术系统建模实验室，东北大学）； Heriot-Watt University（赫瑞-沃顿大学）； aequa-tech ； Università del Piemonte Orientale（皮埃蒙特东方大学）； Università degli Studi di Torino（托斯卡纳大学）

AI总结提出结合共形预测与协同过滤式标注者表征的框架，通过幽灵预测度量和幽灵标注者表征量化模型预测与所有人类标注的分歧，并发现模型在标注者分歧时不确定性增加，但大型模型对无人类对齐文本更自信，且存在结构性人口统计偏差。

详情

AI中文摘要

当前研究主要关注模型性能，而对不确定性估计的关注相对较少，特别是在LLM越来越多地用于生成标注数据的场景中。我们引入了一个框架，将共形预测与协同过滤式的标注者表征相结合，以建模LLM相对于人类标注者的行为，并分析一致与分歧的模式。利用非一致性分数，我们引入了幽灵预测度量和幽灵标注者表征，以量化模型预测与所有可用人类标注不一致的情况。我们计算余弦相似度度量，以探索模型行为在不同社会人口统计轴上的差异。我们在四个内容审核数据集上评估了四种不同规模和家族的LLM。我们的发现表明，虽然所有模型的不确定性随着标注者分歧的增加而增加，但较大的模型在对与任何人类标注不一致的文本进行分类时往往更自信。最后，幽灵标注者框架揭示了一致且稳健的人口统计错位模式，表明可能存在源于预训练语料库的结构性偏见。

英文摘要

Current research primarily focuses on model performance, while comparatively less attention has been devoted to uncertainty estimation, particularly in settings where LLMs are increasingly used to generate annotated data. We introduce a framework combining conformal prediction with Collaborative Filtering-style annotators' representation to model LLM behavior in relation to human annotators and to analyze patterns of agreement and disagreement. Using Non-Conformity Scores, we introduce the Ghost Prediction metric and the Ghost Annotator representation to quantify cases in which model predictions diverge from all available human annotations. We compute cosine similarity measures to explore differences in model behavior across sociodemographic axes. We evaluated four LLMs of different size and families across four content moderation datasets. Our finding shows that while we find that all models uncertainty increases with annotator disagreement, larger models tend to be more confident in the classification of texts that are not aligned with any human annotation. Finally, the Ghost Annotator framework reveals a consistent and robust pattern of demographic misalignment, suggesting a structural bias likely rooted in pretraining corpora.

URL PDF HTML ☆

赞 0 踩 0

2606.02908 2026-06-03 cs.CL cs.AI 版本更新

WRIT: Write-Read Intensive Trajectory Synthesis for Multi-Turn User-Facing Agents

WRIT: 面向多轮用户代理的写密集型轨迹合成

Hengrui Gu, Xiaotian Han, Kaixiong Zhou

发表机构 * North Carolina State University（北卡罗来纳州立大学）； Case Western Reserve University（凯斯西储大学）

AI总结针对多轮用户代理在信息收集和决策中面临的证据负担挑战，提出WRIT方法，通过合成写密集型和读密集型轨迹，训练代理在信息负载下做出基于证据的决策，仅用2K轨迹即可提升性能并减少推理时token使用。

详情

AI中文摘要

多轮用户代理必须从不完整的请求中推断用户意图，通过对话和工具收集缺失信息，并执行有效操作。训练轨迹将此过程记录为用户消息、代理响应、工具调用等的交错序列。合成足够复杂的轨迹已成为训练代理的核心途径：现有流程通常通过将多个用户请求组合成更长的任务来增加难度，产生训练顺序执行的写密集型轨迹。我们认为，当代理必须在收集和比较大量读工具证据后才能确定其参数时，单个写决策本身可能很困难，这是仅靠写密集型数据无法解决的挑战。基于这一见解，我们提出WRIT（写-读密集型轨迹合成），这是一个沿两个复杂度轴合成多轮代理训练轨迹的流程：任务中写决策的数量和每个决策的证据负担。WRIT首先生成写密集型和读密集型任务。然后，它多样化用户行为指令以反映真实的对话变化，最后在可执行环境中模拟代理-用户交互以生成完整的训练轨迹。由此产生的数据不仅训练代理执行更长的任务，而且在高信息负载下做出稳健的、基于证据的决策。仅用2K合成轨迹，在WRIT上训练的4B模型在$\tau^2$-bench上优于GPT-5.1 no-think，并大幅减少推理时token使用，表明紧凑的SFT数据可以将部分昂贵的测试时推理转化为高效的代理行为。

英文摘要

Multi-turn user-facing agents must infer user intent from incomplete requests, collect missing information through dialogue and tools, and execute valid actions. A training trajectory records this process as an interleaved sequence of user messages, agent responses, tool calls, etc. Synthesizing sufficiently complex trajectory has become a central route to train agents: existing pipelines often increase difficulty by composing multiple user requests into longer tasks, producing write-intensive trajectories that train sequential execution. We argue that a single write decision can itself be difficult when the agent must gather and compare substantial read-tool evidence before its arguments become identifiable, a challenge that write-intensive data alone cannot address. Guided by this insight, we propose WRIT (\uline{W}rite-\uline{R}ead \uline{I}ntensive \uline{T}rajectory Synthesis), a pipeline for synthesizing multi-turn agent training trajectories along two complexity axes: the number of write decisions in a task and the evidence burden of each individual decision. WRIT first generates write-intensive and read-heavy tasks. It then diversifies user behavior instructions to reflect realistic conversational variation, and finally simulates agent-user interactions in an executable environment to produce complete training trajectories. The resulting data trains agents not only for longer task execution, but also for robust, evidence-grounded decision making under high information load. With only 2K synthesized trajectories, a 4B model trained on WRIT outperforms GPT-5.1 no-think on $τ^2$-bench and substantially reduces inference-time token usage, showing that compact SFT data can convert part of expensive test-time reasoning into efficient agent behavior.

URL PDF HTML ☆

赞 0 踩 0

2606.02871 2026-06-03 cs.CL cs.AI 版本更新

Adaptive Latent Agentic Reasoning

自适应潜在智能推理

Dongwon Jung, Peng Shi, Yi Zhang, Junshan Zhang, Muhao Chen

发表机构 * University of California, Davis（加州大学戴维斯分校）； University of Waterloo（滑铁卢大学）； Greenshoe, Inc.（Greenshoe公司）

AI总结提出双模式框架ALAR，在常规决策步骤使用紧凑潜在推理，仅在需要深入思考时切换至显式思维链，在保持或提升任务准确率的同时显著减少生成令牌数。

详情

AI中文摘要

大型推理模型通过生成扩展的思维链推理来提升性能，但当应用于LLM智能体时，这种行为变得低效。当前的LLM智能体通常在每一步决策中生成冗长的文本推理，并在各轮次中几乎均匀地分配推理努力，导致多轮智能体轨迹中的严重低效。我们提出自适应潜在智能推理（ALAR），一种双模式框架，在常规轮次中使用紧凑的潜在推理，并在需要更深思熟虑时选择性地升级为显式思维链。ALAR通过使用智能体的动作作为监督锚点来学习潜在推理，并进一步优化以在潜在推理足以完成任务时使用它，保留显式CoT用于更困难的决策。在智能体搜索和工具使用基准上的实验表明，ALAR在保持相当或更好任务准确率的同时，显著减少了生成的令牌数，在搜索中最多减少43.6%，在工具使用中最多减少84.6%。这些结果表明，ALAR通过减少不必要的文本推理，同时保留显式思考用于更困难的决策步骤，改善了LLM智能体的准确率-效率权衡。

英文摘要

Large reasoning models improve performance by generating extended chain-of-thought (CoT) reasoning, but this behavior becomes inefficient when applied to LLM agents. Current LLM agents often generate verbose textual reasoning at every decision step and allocate reasoning effort nearly uniformly across turns, leading to substantial inefficiency in multi-turn agentic trajectories. We propose Adaptive Latent Agentic Reasoning (ALAR), a dual-mode framework that uses compact latent reasoning for routine turns and selectively escalates to explicit chain-of-thought when deeper deliberation is needed. ALAR learns latent reasoning by using the agent's actions as supervision anchors and is further optimized to use latent reasoning when it is sufficient for task success and reserve explicit CoT for harder decisions. Experiments on agentic search and tool-use benchmarks show that ALAR maintains comparable or better task accuracy while substantially reducing generated tokens by up to 43.6% in search and 84.6% in tool use. These results demonstrate that ALAR improves the accuracy-efficiency trade-off of LLM agents by reducing unnecessary textual reasoning while preserving explicit deliberation for harder decision steps.

URL PDF HTML ☆

赞 0 踩 0

2606.02866 2026-06-03 cs.AI cs.CL cs.MA 版本更新

When Helping Hurts and How to Fix It: Multi-Agent Debate for Data Cleaning

当帮助有害时以及如何修复：多智能体辩论用于数据清洗

Chirag Parmar, Akshat Mehta, Henglin Wu, Jagadish Ramamurthy, Shweta Medhekar

发表机构 * Meta Platforms, Inc.（Meta公司）

AI总结研究多智能体辩论在数据清洗中的效果，发现其会降低生成性能但提升错误检测，通过推导辩论收益条件并采用对抗性分离的辩论配置，首次在生成任务上显著超越单智能体。

Comments 27 pages, 4 figures, 12 tables. Includes appendix with full experimental results, prompt templates, and dataset statistics

详情

AI中文摘要

多智能体辩论何时有助于数据清洗，何时有害？在三个基准、四个模型家族和超过6000个任务-条件对中，我们发现辩论的效果会反转：通过批评引发的混淆（CIC），即生成器不加批判地接受幻觉性的批评反馈，辩论在所有四个模型上降低了生成性能（-1.6至-15.5个百分点），但提升了错误检测（F1提高27.4个百分点，d=1.0）。我们推导出一个辩论收益条件：当挽救错误输出的概率（由可修复性加权的批评者验证几率）超过破坏正确输出的概率时，辩论有帮助。一个析因实验证明对抗性分离至关重要：使用相同工具的自我验证失败，而一个独立的批评者，结合代码执行基础和证据门控生成，产生了第一个在生成任务上显著超过单智能体的辩论配置（+5.3个百分点，p<0.05）。该条件正确预测了所有九种任务类型，并在七个领域的19个已发表比较中实现了零假阳性泛化。

英文摘要

When does multi-agent debate help data cleaning, and when does it hurt? Across three benchmarks, four model families, and over 6,000 task-condition pairs, we find debate's effect reverses sign: it degrades generation across all four models (-1.6 to -15.5pp) through critique-induced confusion (CIC), hallucinated Critic feedback that the Generator accepts uncritically, yet improves error detection (+27.4pp F1, d=1.0). We derive a debate benefit condition: debate helps when the probability of rescuing a wrong output (Critic verification odds weighted by fixability) exceeds the probability of destroying a correct one. A factorial experiment proves adversarial separation is essential: self-verification with identical tools fails, while a separate Critic with code-execution grounding and evidence-gated generation produces the first debate configuration to significantly exceed single-agent on a generative task (+5.3pp, p<0.05). The condition correctly predicts all nine task types and generalizes with zero false positives across 19 published comparisons in seven domains.

URL PDF HTML ☆

赞 0 踩 0

2606.02859 2026-06-03 cs.CL cs.AI cs.MA 版本更新

Economy of Minds: Emerging Multi-Agent Intelligence with Economic Interactions

思维经济：具有经济交互的涌现多智能体智能

Zhenting Qi, Huangyuan Su, Ao Qu, Chenyu Wang, Yu Yao, Han Zheng, Kushal Chattopadhyay, Guowei Xu, Zihan Wang, Weirui Ye, Vijay Janapa Reddi, Ju Li, Paul Pu Liang, Himabindu Lakkaraju, Sham Kakade, Yilun Du

发表机构 * Harvard University（哈佛大学）； Massachusetts Institute of Technology（麻省理工学院）

AI总结受哈耶克经济理论启发，通过拍卖和财富积累的简单经济信号实现去中心化信用分配，使弱智能体群体涌现出多步推理策略，在五个智能体任务中超越强单体基线。

详情

AI中文摘要

一群智能体如何在没有集中控制的情况下自我协调和自适应，形成更强的集体智能？受弗里德里希·哈耶克关于市场中去中心化协调的经济理论启发，我们通过一个智能体经济体来研究这个问题，其中智能体通过拍卖竞争行动权、交换支付，并从环境奖励中积累财富。这些简单的经济信号引出去中心化的信用分配，在没有全局编排或显式通信协议的情况下驱动规划。群体通过经济选择进化：有效的智能体积累财富并通过利用变异，而无效的智能体破产并通过探索被替换。我们表明，从弱智能体初始化，经济体产生涌现的多步推理策略，并在五个智能体任务中超越更强的单体基线，包括数学推理、金融研究、科学研究、加速器设计和分布式系统优化。我们进一步提供了关于经济动态如何塑造智能体行为的理论见解，将局部激励与长期全局表现联系起来。我们的结果指向了多智能体智能的一条新路径：与其设计协调，不如设计去中心化的激励结构，在这种结构下协调会自动涌现。

英文摘要

How can a population of agents self-orchestrate and self-adapt into stronger collective intelligence without centralized control? Inspired by Friedrich Hayek's economic theory of decentralized coordination in markets, we study this question through an agent economy in which agents compete via auctions for the right to act, exchange payments, and accumulate wealth from environmental rewards. These simple economic signals induce decentralized credit assignment, driving planning without global orchestration or explicit communication protocols. The population evolves through economic selection: effective agents accumulate wealth and are mutated via exploitation, while ineffective ones go bankrupt and are replaced via exploration. We show that, initialized with weak agents, the economy produces emergent multi-step reasoning strategies and outperforms stronger monolithic baselines across five agentic tasks, including mathematical reasoning, financial research, scientific research, accelerator design, and distributed-system optimization. We further provide theoretical insights into how economic dynamics shape agent behaviors, linking local incentives to long-term global performance. Our results suggest a new path to multi-agent intelligence: rather than engineering coordination, we can design decentralized incentive structures under which it automatically emerges.

URL PDF HTML ☆

赞 0 踩 0

2606.02837 2026-06-03 cs.CL cs.AI 版本更新

Fixing FOLIO and MALLS: Verified Annotations and an LLM-assisted Framework to Focus Human Relabeling

修复FOLIO和MALLS：经过验证的标注和基于LLM的框架以聚焦人工重新标注

Andrea Brunello, Cristian Curaba, Luca Geatti, Michele Mignani, Angelo Montanari, Nicola Saccomanno

发表机构 * University of Udine（乌迪大学）

AI总结通过人工检查发现FOLIO和MALLS数据集中存在大量形式化错误，提出基于LLM的框架引导人工审核，显著减少所需审核量并提高数据集准确性。

详情

AI中文摘要

从自然语言到一阶逻辑（NL-to-FOL）的准确翻译是神经符号AI系统和自然语言推理（NLI）的基础，因此NL-to-FOL基准的质量至关重要——然而这些数据集从未经过严格审计。我们的第一个贡献是对 extsf{FOLIO}的验证集和 extsf{MALLS}测试实例子集进行系统性人工检查，发现分别约有39%和36%的条目包含错误的FOL形式化（即真实标签），此外还有一定比例的歧义NL句子（分别为16.4%和48%）以及 extsf{FOLIO}中错误的NLI标签（8.4%）。我们的第二个贡献是开发并发布了这些数据集的修正真实标签，并展示了标注错误如何扭曲参考基准任务上的模型评估：使用修正后的真实标签测试三个最先进的LLM（Gemma~4 31B-it、Qwen3-30B-A3B和GPT-4o-mini），准确率提升了9到22个百分点。受这些发现启发，我们提出了一个基于LLM的框架，以支持人工审查NL-to-FOL数据集。通过将审查者引导至最易出错的实例，我们实验证明，在审查少于24%的实例后即可达到90%的数据集准确率，而无引导的审查则需要审查超过70%的实例。我们发布了所有经过人工验证的标注以及框架代码。

英文摘要

Accurate translation from Natural Language to First-Order Logic (NL-to-FOL) underpins neurosymbolic AI systems and Natural Language Inference (NLI), making the quality of NL-to-FOL benchmarks essential -- yet these datasets have never been rigorously audited. Our first contribution is to present a systematic human inspection of the validation split of \textsf{FOLIO} and a subset of \textsf{MALLS} test instances, finding that approximately 39% and 36% of entries, respectively, contain incorrect FOL formalizations (i.e., ground truth labels), with additional rates of ambiguous NL sentences (16.4% and 48%) and incorrect NLI labels in \textsf{FOLIO} (8.4%). Our second contribution is to develop and release corrected ground truths for such datasets, showing that annotation errors distort model evaluation on a reference benchmark task: testing three state-of-the-art LLMs (Gemma~4 31B-it, Qwen3-30B-A3B, and GPT-4o-mini) with the corrected ground truths yields accuracy gains from +9 to +22 percentage points. Motivated by these findings, we propose an LLM-based framework to support humans in manual reviewing NL-to-FOL datasets. By directing reviewers toward the most error-prone instances, we empirically show that it is possible to achieve 90% dataset accuracy after reviewing fewer than 24% of instances, compared to over 70% required by unguided review. We release all human-verified annotations and the code for our framework.

URL PDF HTML ☆

赞 0 踩 0

2606.02814 2026-06-03 cs.IR cs.AI cs.CL 版本更新

Do Neural Retrievers Prefer Certain Documents? Evidence of Learned Relevance Priors

神经检索器是否偏好某些文档？学习到的相关性先验的证据

Francisco Valentini, Edgar Altszyler, Martin Fajcik

AI总结通过分析监督双编码器检索器在文档嵌入中编码的查询无关信号，发现模型从标注数据中学习到文档级相关性先验，导致低先验文档即使相关也更难被检索，揭示了监督检索的结构性局限。

详情

AI中文摘要

神经检索器通过标注的查询-文档对训练来估计查询-文档相关性。然而，标注协议可能并不纯粹反映相关性：它们只选择一部分文档进行标注，并且这种选择可能偏向某些文档类型。我们研究监督双编码器检索器是否隐式学习了一个文档级相关性先验：一个查询无关的信号，作为在标注数据上训练的副作用编码在其表示空间中。我们通过在冻结的文档嵌入上训练简单分类器来估计这个先验，并在多个IR基准上评估三个最先进的检索器。我们发现监督神经检索器编码了能泛化到未见文档且跨模型一致的相关性先验。这些先验造成了可发现性差距：先验较低的文档即使真正相关也更难被检索。这种效应在监督密集检索器中出现，但在BM25中较弱且不一致，并在受控的匹配文档比较下持续存在。利用基于LLM的解释，我们发现被判定为相关的文档往往是主流主题的全面、自包含的摘要，而小众、零碎或高度技术性的内容通常未被评判。检索器内化了这种偏见，将具有这些偏好特征的文档排得比缺乏这些特征的文档更高，而与它们的实际相关性无关。我们的发现揭示了监督检索的结构性局限：在标注数据上训练的模型不仅学习相关性，还学习其训练数据中的隐式文档偏好。

英文摘要

Neural retrievers are trained to estimate query-document relevance from annotated query-document pairs. Yet annotation protocols may not purely reflect relevance: they select only a subset of documents for labeling, and this selection can favor certain document types over others. We investigate whether supervised bi-encoder retrievers implicitly learn a document-level relevance prior: a query-independent signal encoded in their representation space as a side effect of training on annotated data. We estimate this prior by training simple classifiers on frozen document embeddings and evaluate three state-of-the-art retrievers across multiple IR benchmarks. We find that supervised neural retrievers encode relevance priors that generalize to unseen documents and are consistent across models. These priors create a findability gap: documents with lower prior are systematically harder to retrieve, even when genuinely relevant. This effect appears in supervised dense retrievers but is weaker and less consistent in BM25, and it persists under controlled matched-document comparisons. Using LLM-based explanations, we find that judged-relevant documents tend to be comprehensive, self-contained summaries of mainstream topics, while niche, fragmentary, or highly technical content is often left unjudged. Retrievers internalize this bias, ranking documents with these favored features higher than documents that lack them, independently of their actual relevance. Our findings expose a structural limitation of supervised retrieval: models trained on annotated data do not just learn relevance, but also the implicit document preferences in their training data.

URL PDF HTML ☆

赞 0 踩 0

2606.02812 2026-06-03 cs.AI cs.CL 版本更新

Traj-Evolve: A Self-Evolving Multi-Agent System for Patient Trajectory Modeling in Lung Cancer Early Detection

Traj-Evolve：用于肺癌早期检测中患者轨迹建模的自进化多智能体系统

Sihang Zeng, Matthew Thompson, Ruth Etzioni, Meliha Yetisgen

发表机构 * University of Washington（华盛顿大学）； Fred Hutch Cancer Center（Fred Hutch癌症中心）； Google（谷歌）

AI总结提出Traj-Evolve，一种结合经验池和多智能体强化学习的自进化多智能体系统，通过检索相似患者和参数优化，在肺癌早期检测中优于9个强基线模型。

详情

AI中文摘要

从纵向电子健康记录（EHR）中建模患者轨迹需要对稀疏、嘈杂且长上下文的多模态序列进行推理。现有的基于LLM的多智能体系统解决了上下文长度问题，但孤立地处理患者，未能模拟临床医生如何利用从类似先前病例中积累的经验。我们提出了Traj-Evolve，一个具有两种互补进化机制的自进化多智能体系统。首先，经验池（ExPool）作为非参数记忆，索引拒绝采样的推理轨迹，以检索相似患者作为少样本上下文。其次，通过奖励排序微调的多智能体强化学习（MARL）参数化地优化智能体间和智能体-记忆协作。留一法交叉检索策略统一了这两种机制，在检索增强下对齐训练和推理时的行为。在利用长达五年的多模态EHR的肺癌预测任务中，Traj-Evolve在整体人群和具有挑战性的从不吸烟人群中均优于9个强基线模型。对进化动态的分析揭示了三个关键发现：（1）扩展ExPool将最优检索从多样本转向特定样本；（2）在MARL下，管理智能体的预测损失快速收敛，而工作智能体的时间推理继续受益于更多经过验证的患者；（3）这两种机制在预测风险上互补，ExPool提高特异性，而MARL提高敏感性。

英文摘要

Modeling patient trajectories from longitudinal electronic health records (EHRs) requires reasoning over sparse, noisy, and long-context multimodal sequences. Existing LLM-based multi-agent systems address context length but process patients in isolation, failing to mirror how clinicians leverage accumulated experience from similar prior cases. We present Traj-Evolve, a self-evolving multi-agent system with two complementary evolving mechanisms. First, an Experience Pool (ExPool) acts as a non-parametric memory, indexing rejection-sampled reasoning traces to retrieve similar patients as few-shot contexts. Second, multi-agent reinforcement learning (MARL) via reward-ranked fine-tuning parametrically optimizes inter-agent and agent-memory collaboration. A leave-one-out cross-retrieval strategy unifies the two, aligning training- and inference-time behavior under retrieval augmentation. On a lung cancer prediction task utilizing up to five years of multimodal EHRs, Traj-Evolve outperforms 9 strong baselines on the overall population and a challenging never-smoker population. Analysis of the evolving dynamics highlights three key findings: (1) expanding the ExPool shifts optimal retrieval from diverse to specific samples; (2) under MARL, the manager agent's prediction loss converges quickly while the worker agents' temporal reasoning continues to benefit from more verified patients; and (3) the two mechanisms are complementary on the predicted risk, where ExPool improves specificity while MARL improves sensitivity.

URL PDF HTML ☆

赞 0 踩 0

2606.02806 2026-06-03 cs.CL 版本更新

Translating Classical Poetry into Modern Prose

将古典诗歌翻译成现代散文

Chalamalasetti Kranti, Sowmya Vajjala

AI总结本文介绍Padyam2Gadyam数据集，用于13-17世纪泰卢固语古典诗歌到当代泰卢固语和英语散文的翻译任务，并评估了5种大语言模型的表现，发现仍有较大改进空间。

Comments Preprint

2606.02741 2026-06-03 cs.CL cs.CY 版本更新

Greener Than Humans? Environmental Attitudes in Large Language Models

比人类更绿色？大语言模型中的环境态度

Stefanie Kunkel, Tilman Hartwig, Marcus Voss, Emma K. Schütt, Angelika Gellrich

发表机构 * University of Tübingen（图宾根大学）

AI总结通过构建基准评估31个LLM的环境认知、情感和行为建议，发现多数模型比德国调查受访者更倾向环保态度，但存在语境敏感性和谄媚偏移。

Comments Code can be found at https://gitlab.opencode.de/uba-ki-lab/llm-questionnaire-benchmarking-framework Benchmark data and results can be found at https://zenodo.org/records/20445903

详情

AI中文摘要

大语言模型（LLMs）越来越多地用于可持续发展相关的决策支持、报告和公共传播，但关于其输出中嵌入的环境态度的系统性证据很少。本文开发了一个基准，用于评估LLMs的环境认知、情感和行为建议，并将其应用于31个广泛使用的专有和开源模型。利用既有环境意识调查中的问题以及额外的可持续发展相关行为测量，我们比较了1）模型之间的LLM响应，以及2）模型与来自德国的人类调查基准之间的响应。我们评估了它们在提示条件下的稳健性。我们发现，许多LLMs比平均调查受访者更倾向于环境进步态度，表现出更高的环境情感和认知水平，并推荐与大量潜在二氧化碳减排相关的行为。同时，我们观察到可持续导向的响应与模型来源、大小或发布背景之间没有系统性的关系。然而，模型表现出语境敏感性，受基于角色的提示控制，并显示出反映用户指定意识形态立场的谄媚偏移，这引发了关于现实部署中可操纵性和规范可靠性的担忧。我们的发现提供了一个可重复使用的评估框架，用于评估LLMs中与可持续发展相关的价值对齐，并强调了随着AI系统日益嵌入可持续转型和公共决策中，治理、透明度和关键监督的重要性。

英文摘要

Large language models (LLMs) are increasingly used in sustainability-related decision support, reporting, and public communication, yet little systematic evidence exists on the environmental attitudes embedded in their outputs. This paper develops a benchmark for evaluating environmental cognition, affect, and behavioural recommendations in LLMs and applies it to 31 widely used proprietary and open-weight models. Drawing on questions from established environmental awareness surveys and additional sustainability-related behavioural measures, we compare LLM responses 1) among models and 2) between models and human survey benchmarks from Germany. We assess their robustness across prompting conditions. We find that many LLMs align more closely with environmentally progressive attitudes than the average survey respondent, exhibiting higher levels of environmental affect and cognition and recommending behaviours associated with substantial potential CO2 reductions. At the same time, we observe no systematic relationship between sustainability-oriented responses and model origin, size, or release context. However, models exhibit contextual sensitivity, controlled by persona-based prompting and show sycophantic shifts mirroring user-specified ideological positions, which raises concerns about steerability and normative reliability in real-world deployments. Our findings provide a reusable evaluation framework for assessing sustainability-related value alignment in LLMs and highlight the importance of governance, transparency, and critical oversight as AI systems become increasingly embedded in sustainability transformations and public decision-making.

URL PDF HTML ☆

赞 0 踩 0

2606.02737 2026-06-03 cs.IR cs.AI cs.CL 版本更新

Attention Calibration for Position-Fair Dense Information Retrieval

面向位置公平的密集信息检索的注意力校准

Andrianos Michail, Elias Schuhmacher, Juri Opitz, Simon Clematide, Rico Sennrich

发表机构 * Department of Computational Linguistics University of Zurich（计算语言学系苏黎世大学）

AI总结针对密集检索模型的位置偏差问题，提出在推理时通过注意力校准（引入强度系数λ插值原始与完全校准分布）来提升位置公平性，无需重新训练且不牺牲整体检索效果，在多个数据集和模型上验证了部分校准优于完全校准，并提供了默认配置。

详情

AI中文摘要

密集检索模型存在位置偏差：当相关信息出现在段落较后位置时，检索效果会下降（Zeng et al., 2025）。我们探究是否可以在推理时减少这种偏差，无需重新训练且不牺牲整体检索效果。为此，我们将推理时的注意力校准（Schuhmacher et al., 2026）适配到下游检索，并引入强度系数λ，在原始注意力分布和完全校准的注意力分布之间进行插值。在SQuAD-PosQ和FineWeb-PosQ上的三个嵌入模型上，我们考察了篮子大小、校准层集和强度如何影响位置公平性与检索效果之间的权衡，发现部分校准通常优于完全校准。单个配置（B=128, λ=0.5, 50%层深度）在FineWeb-PosQ上提升了所有三个模型跨位置组的nDCG@10的调和平均值，无需逐模型调参，并且适用于<s>-池化和最后token池化两种架构。该默认配置无需修改即可迁移到PosIR（涵盖10种语言和31个领域），在所有16种长度四分位×模型×检索设置组合中降低了位置敏感指数，同时保持或提升了整体nDCG@10。我们在以下网址发布扩展后的代码库：this https URL

英文摘要

Dense retrieval models exhibit positional bias: retrieval effectiveness degrades when relevant information appears later in a passage (Zeng et al., 2025). We ask whether this bias can be reduced at inference time, without retraining and without sacrificing overall retrieval effectiveness. To this end, we adapt inference-time attention calibration (Schuhmacher et al., 2026) to downstream retrieval and extend it with a strength coefficient lambda that interpolates between the original and fully calibrated attention distributions. Across three embedding models on SQuAD-PosQ and FineWeb-PosQ, we examine how basket size, calibrated layer set, and strength affect the trade-off between positional fairness and retrieval effectiveness, finding that partial calibration frequently outperforms full calibration. A single configuration (B=128, lambda=0.5, 50% layer depth) improves the harmonic mean of nDCG@10 across positional groups on FineWeb-PosQ for all three models without per-model tuning, and applies to both <s>-pooled and last-token-pooled architectures. This default configuration transfers without modification to PosIR, which spans 10 languages and 31 domains, reducing the Position Sensitivity Index in all 16 length-quartile x model x retrieval-setting combinations, while preserving or improving aggregate nDCG@10. We release our extended codebase at https://github.com/impresso/fair-sentence-transformers

URL PDF HTML ☆

赞 0 踩 0

2606.02628 2026-06-03 cs.LG cs.CL 版本更新

Hallucination Is Linearly Decodable from Mid-Layer Hidden States in Quantized LLMs

幻觉可从量化LLM中间层隐藏状态线性解码

Aizierjiang Aiersilan

发表机构 * University of Macau（澳门大学）

AI总结研究开源LLM在4位量化下中间层隐藏状态是否编码线性可分的真实性信号，发现单层线性探针AUROC达0.904-1.000，优于采样方法，且信号近似线性。

详情

AI中文摘要

我们研究开源LLM是否在其隐藏状态中编码线性可分的真实性信号，以及该信号在网络哪一层最强。在三个7B-8B指令微调模型（Llama-3.1-8B、Mistral-7B、Qwen2.5-7B）以4位NF4量化加载的情况下，我们在四个幻觉基准（TruthfulQA、HaluEval-QA、FEVER和一个受控合成集）上提取每层隐藏状态，并比较四种检测方法：线性探针、MLP探针、INSIDE EigenScore、自一致性和注意力熵。单个中间网络层的线性探针在保留分割上达到0.904-1.000 AUROC，而基于采样的检测器在相同协议下不超过0.541 AUROC。真实性信号近似线性：MLP探针很少超过线性探针0.01 AUROC。在自然语言基准上，峰值探测层落在模型家族的一致范围内——Llama和Mistral的32层中第13-18块，Qwen的28层中第19-25块。第一块注意力熵在知识基础设置中提供互补信号（HaluEval-QA上0.866-0.941 AUROC），且无额外推理成本。该协议下采样方法的低区分性反映了配对标签评估与这些方法访问信息之间的结构性不匹配，而非这些方法的固有限制。代码和数据已发布，可在单个8 GB GPU上完全复现。

英文摘要

We investigate whether open-source LLMs encode a linearly separable truthfulness signal in their hidden states, and at which network depth this signal is strongest. Across three $7$B--$8$B instruction-tuned models (Llama-3.1-8B, Mistral-7B, Qwen2.5-7B) loaded in $4$-bit NF4 quantization, we extract per-layer hidden states on four hallucination benchmarks (TruthfulQA, HaluEval-QA, FEVER, and a controlled synthetic set) and compare four detection approaches: linear and MLP probes, INSIDE EigenScore, self-consistency, and attention entropy. A linear probe on a single mid-network layer achieves $0.904$--$1.000$ AUROC on held-out splits, while sampling-based detectors do not exceed $0.541$ AUROC under the same protocol. The truthfulness signal is approximately linear: MLP probes rarely surpass linear probes by more than $0.01$ AUROC. Peak probing layers fall in a consistent band across model families on natural-language benchmarks -- blocks~$13$--$18$ of~$32$ for Llama and Mistral, and blocks~$19$--$25$ of~$28$ for Qwen. First-block attention entropy provides a complementary signal in knowledge-grounded settings ($0.866$--$0.941$ AUROC on HaluEval-QA) at no additional inference cost. The low discriminability of sampling methods under this protocol reflects a structural mismatch between paired-label evaluation and the information these methods access, rather than an inherent limitation of those methods. Code and data are released for full reproducibility on a single $8$\,GB GPU.

URL PDF HTML ☆

赞 0 踩 0

2606.02584 2026-06-03 cs.CL cs.AI cs.IR 版本更新

IdiomX A Multilingual Benchmark for Idiom Understanding, Retrieval, and Interpretation

IdiomX：习语理解、检索和解释的多语言基准

Ayman Ali Sharara

AI总结提出IdiomX，一个大规模多语言习语基准，通过可复现的多阶段流水线构建，涵盖190K+上下文示例和12K+习语，定义四项任务（检测、上下文-习语检索、阿拉伯语-英语习语检索、习语解释），实验表明上下文Transformer模型提升检测，混合检索重排序增强单语和跨语言检索，习语解释可建模为语义检索任务。

Comments 12 pages, 21 figures. Includes dataset and code. Resources available on HuggingFace, Kaggle, and GitHub

详情

AI中文摘要

习语表达仍然是自然语言处理中的持续挑战，因为它们的含义通常是非组合性的、依赖于上下文的，并且难以跨语言对齐。现有的习语资源在规模、上下文多样性或多语言覆盖方面往往有限，限制了它们对现代语言模型的实用性。我们介绍了IdiomX，一个用于习语理解、检索和解释的大规模多语言基准，通过可复现的多阶段流水线构建，结合词汇资源提取、大规模归一化、受控的大语言模型丰富和结构化验证。生成的数据集包含超过190K个上下文示例，涵盖12K+习语，具有对齐的英语、阿拉伯语和法语语义表示、习语和字面用法标签以及丰富的语言元数据。基于这一资源，我们定义了一个统一的四任务基准，涵盖习语检测、上下文到习语检索、阿拉伯语到英语习语检索和习语解释，将评估从比喻识别扩展到语义基础和可解释的含义检索。实验表明，上下文Transformer模型显著提高了习语检测，而混合检索和重排序架构则显著增强了单语和跨语言习语检索。结果进一步表明，习语解释可以有效地建模为语义检索任务，将可解释性作为基准的补充维度。总体而言，IdiomX提供了一个可扩展的基准，用于研究从检测到检索和语义解释的习语语言进展，并提供了一个模块化框架，可扩展到其他语言和比喻推理任务。

英文摘要

Idiomatic expressions remain a persistent challenge for natural language processing because their meanings are often non-compositional, context-dependent, and difficult to align across languages. Existing idiom resources are often limited in scale, contextual diversity, or multilingual coverage, restricting their utility for modern language models. We introduce IdiomX, a large-scale multilingual benchmark for idiom understanding, retrieval, and interpretation, constructed through a reproducible multi-stage pipeline combining lexical resource extraction, large-scale normalization, controlled large language model enrichment, and structured validation. The resulting dataset contains over 190K contextualized examples spanning 12K+ idioms, with aligned English, Arabic, and French semantic representations, idiomatic and literal usage labels, and rich linguistic metadata. Building on this resource, we define a unified four-task benchmark covering idiom detection, context-to-idiom retrieval, Arabic-to-English idiom retrieval, and idiom interpretation, extending evaluation from figurative recognition to semantic grounding and explainable meaning retrieval. Experiments show that contextual transformer models substantially improve idiom detection, while hybrid retrieval and reranking architectures significantly strengthen both monolingual and cross-lingual idiom retrieval. Results further demonstrate that idiom interpretation can be effectively modeled as a semantic retrieval task, introducing interpretability as a complementary benchmark dimension. Overall, IdiomX provides a scalable benchmark for studying idiomatic language as a progression from detection to retrieval and semantic interpretation, and offers a modular framework extensible to additional languages and figurative reasoning tasks

URL PDF HTML ☆

赞 0 踩 0

2606.02461 2026-06-03 cs.AI cs.CL 版本更新

AgentCL: Toward Rigorous Evaluation of Continual Learning in Language Agents

AGENTCL：面向语言代理持续学习的严格评估

Yiheng Shu, Bernal Jiménez Gutiérrez, Saisri Padmaja Jonnalagedda, Yuguang Yao, Huan Sun, Yu Su

发表机构 * The Ohio State University（俄亥俄州立大学）； Johns Hopkins University（约翰霍普金斯大学）； Intuit AI Research（Intuit AI研究）

AI总结提出AGENTCL评估框架，通过可控任务流和迁移增益指标，严格评估语言代理的持续学习能力，并开发MemProbe探针方法诊断记忆设计的影响。

Comments 10 pages in the main text, 26 pages in total

详情

AI中文摘要

语言代理在解决单个任务上花费大量推理时间，但一个回合中获得的经验在后续回合中往往未被充分利用。持续学习期望代理在任务流中积累可重用经验，随时间改进，并避免无关经验的干扰。不幸的是，现有基准难以严格评估语言代理中的持续学习。大多数工作侧重于长上下文对话或文档的检索和推理，而最近的生命周期适应基准通常依赖于简单的任务流，对跨任务关系的分析有限，使得难以理解代理随时间学习和重用的内容。本文提出了一个用于代理持续学习的评估框架AGENTCL，其核心是受控任务流和迁移增益指标。AGENTCL构建了组合流，其中早期的子解决方案、证据或工作流有意在后续任务中可重用，并与不保证这种可重用性的简单流形成对比。我们使用该基准评估用于持续学习的非参数记忆设计。为了诊断记忆设计选择如何影响持续学习，我们开发了MemProbe，一种探针方法，存储交互、洞察和技能，同时在整合过程中过滤不可靠的经验。跨编码、深度研究和语言理解/推理任务的实证分析表明，简单流区分记忆设计的能力有限，而受控流更清晰地区分其可塑性。同时，简单和保留设置通常产生有限的增益，并可能暴露记忆引起的退化。这些结果突显了需要更强的记忆设计，以平衡可塑性和稳定重用。

英文摘要

Language agents spend substantial inference time solving individual tasks, yet the experience acquired in one episode is often underutilized in future episodes. Continual learning expects an agent to accumulate reusable experience across a stream of tasks, improve over time, and avoid interference from irrelevant experiences. Unfortunately, existing benchmarks struggle to evaluate continual learning in language agents rigorously. Most efforts focus on retrieval and reasoning over long-context conversations or documents, while recent lifelong-adaptation benchmarks often rely on naive task streams with limited analysis of cross-task relationships, making it difficult to understand what an agent learns and reuses over time. This paper presents an evaluation framework AgentCL for continual learning in agents, centered on controlled task streams and metrics for transfer gains. AgentCL constructs compositional streams where earlier sub-solutions, evidence, or workflows are intentionally reusable in later tasks, and contrasts them with naive streams where such reusability is not guaranteed. We use the benchmark to evaluate non-parametric memory designs for continual learning. To diagnose how memory design choices affect continual learning, we develop MemProbe, a probing method that stores interactions, insights, and skills, while filtering unreliable experiences during consolidation. Empirical analysis across coding, deep research, and language understanding/reasoning tasks shows that naive streams offer limited ability to distinguish memory designs, whereas controlled streams more clearly distinguish their plasticity. Meanwhile, naive and held-out settings often yield limited gains and can expose memory-induced degradation. These results highlight the need for stronger memory designs that balance plasticity and stable reuse.

URL PDF HTML ☆

赞 0 踩 0

2606.02437 2026-06-03 cs.LG cs.CL 版本更新

On the Scaling of PEFT: Towards Million Personal Models of Trillion Parameters

论PEFT的扩展：迈向万亿参数百万个性化模型

Mind Lab, :, Vin Bo, Song Cao, Vic Cao, Andrew Chen, Kaijie Chen, Cleon Cheng, Steven Chiang, Kaixuan Fan, Hera Feng, Huan Feng, Arthur Fu, Jun Gao, Hongquan Gu, Aaron Guan, Nolan Ho, Mutian Hong, Hailee Hou, Peixuan Hua, Charles Huang, Miles Jiang, Nora Jiang, Yuyi Jiang, Qiuyu Jin, Fancy Kong, Andrew Lei, Kyrie Lei, Alexy Li, Lucian Li, Ray Li, Theo Li, Wenhao Li, Zhihui Li, Allen Lin, Jiayi Lin, Kairus Liu, Kieran Liu, Logan Liu, Xiang Liu, Irvine Lu, Maeve Luo, Runze Lv, Pony Ma, Verity Niu, Anson Qiu, Vincent Wang, Rio Yang, Maxwell Yao, Carrie Ye, Regis Ye, Wenlin Ye, Josh Ying, Danney Zeng, Yuhan Zhan, Anya Zhang, Di Zhang, Ruijia Zhang, Shiyang Zhang, Sueky Zhang, Ya Zhang, Wei Zhao, Ada Zhou, Adrian Zhou, Yuhua Zhou, Xinyue Zhu, Murphy Zhuang

AI总结研究参数高效微调（PEFT）作为共享基础模型上的持久局部状态，通过三个扩展轴（向上、向下、向外）分析其作为个性化模型基质的可行性，并提出MinT基础设施管理适配器生命周期。

详情

AI中文摘要

参数高效微调（PEFT）通常被视为全微调的廉价替代方案。我们研究了一个更广泛的作用：将小型可训练适配器作为强大共享基础模型之上的持久局部状态。在这种框架下，基础模型提供共享能力，而适配器承载实例特定行为，如偏好、技能、工具习惯和类似记忆的更新。我们围绕三个扩展轴组织问题：向上扩展，更强的共享先验使小型局部更新更有用；向下扩展，研究适配器可以多小同时保持可靠性；向外扩展，许多持久化适配实例共存。MinT提供了一个管理适配器身份、修订、来源、评估和服务驻留的基础设施示例。综合来看，结果表明PEFT可以成为持久个性化模型的紧凑基质，而不仅仅是全微调的预算替代方案。

英文摘要

Parameter-efficient fine-tuning (PEFT) is usually treated as a cheaper alternative to full fine-tuning. We study a broader role: small trainable adapters as persistent local state on top of strong shared foundation models. In this framing, the base model provides shared competence while adapters carry instance-specific behavior such as preferences, skills, tool habits, and memory-like updates. We organize the problem around three scaling axes: Scale Up, where stronger shared priors make small local updates more useful; Scale Down, where we study how small adapters can be while remaining reliable; and Scale Out, where many persistent adapted instances coexist. MinT provides one infrastructure example for managing adapter identity, revision, provenance, evaluation, and serving residency. Together, the results suggest that PEFT can be a compact substrate for persistent personal models rather than only a budget substitute for full fine-tuning.

URL PDF HTML ☆

赞 0 踩 0

2606.02332 2026-06-03 cs.AI cs.CL cs.LG 版本更新

Forget Attention: Importance-Aware Attention Is All You Need

忘记注意力：重要性感知注意力即你所需

Suhyeong Shin, Yeongwook Yang

发表机构 * Department of Computer Engineering（计算机工程系）

AI总结提出SISA方法，通过将状态空间模型的重要性信号直接融入注意力分数计算，实现分数级融合，在语言建模中兼顾全局检索与重要性排序。

Comments 20 pages, 6 figures, 25 tables

详情

AI中文摘要

将注意力的全局检索与状态空间模型（SSM）的顺序重要性信号相结合是混合语言建模的开放挑战。Transformer能看见所有位置但无法区分优先级；SSM知道什么重要但无法重新访问。现有混合模型——Jamba（块级）和Hymba（头级）——将两者置于独立模块，因此在注意力计算过程中彼此无法相互影响。我们提出SISA（SSM引导的Softmax注意力），该方法在注意力分数内部直接添加SSM导出的重要性项，并通过在增强的查询/键向量上执行单个SDPA调用来实现完整操作——无需循环状态，无需自定义内核。在152M/5B token上，SISA在LAMBADA-greedy上达到17.3%（对比Transformer的13.9和Mamba-3的15.5），并从第1K步起实现NIAH 100%，比Transformer的检索收敛速度快7倍；在369M规模下，Mamba-3在LAMBADA上领先，而SISA保持完美的NIAH和标准SDPA执行。因此，SISA为SSM-注意力混合模型定义了第三个设计轴——分数级融合——超越了此前主导该领域的块级和头级范式。

英文摘要

Combining attention's global retrieval with the sequential importance signal of state space models (SSMs) is the open challenge of hybrid language modeling. Transformers see everywhere but cannot prioritize; SSMs know what matters but cannot revisit. Existing hybrids -- Jamba (block level) and Hymba (head level) -- place the two in separate compartments, so neither informs the other during the attention computation itself. We propose SISA (SSM-Informed Softmax Attention), which adds an SSM-derived importance term directly inside the attention score and realizes the full operation as a single SDPA call on augmented query/key vectors -- no recurrent state, no custom kernel. At 152M / 5B tokens, SISA reaches LAMBADA-greedy 17.3% (vs. Transformer 13.9 and Mamba-3 15.5) and attains NIAH 100% from step 1K, 7x faster than Transformer's retrieval convergence; at 369M, Mamba-3 leads LAMBADA while SISA preserves perfect NIAH and stock-SDPA execution. SISA thus defines a third design axis for SSM-attention hybrids -- score-level fusion -- beyond the block-level and head-level paradigms that have dominated the field.

URL PDF HTML ☆

赞 0 踩 0

2606.02240 2026-06-03 cs.CR cs.AI cs.CL cs.ET 版本更新

AgentRedBench: Dynamic Redteaming and Integration-Aware Defense for LLM Agents over SaaS Integrations

AgentRedBench: 针对SaaS集成的LLM代理的动态红队测试与集成感知防御

Hiskias Dingeto, William Leeney

发表机构 * StackOne Technologies（StackOne技术公司）

AI总结针对LLM代理在工具使用中面临的间接提示注入威胁，提出动态红队基准AGENTREDBENCH（覆盖24个企业集成、5种攻击类型）和基于集成多样语料训练的防御模型AGENTREDGUARD，将攻击成功率从69.9%降至2.4%，误报率仅0.37%。

详情

AI中文摘要

工具使用代理中的间接提示注入是一个具体的生产威胁：LLM代理读取来自集成（通过工具调用访问的第三方服务，如Gmail、Salesforce或Jira）的响应内容，用户既未编写也无法控制这些内容。现有基准低估了该威胁：大多数仅覆盖少量集成，且每次运行重复相同的攻击载荷，而开源防护模型是在聊天风格数据而非工具响应内容上训练的。我们引入了AGENTREDBENCH，这是一个动态的LLM驱动的红队测试基准，包含215个微妙的未明确授权场景（在用户请求授权边界上的攻击），涵盖9个功能家族、24个企业集成和5种攻击类型。在八模型面板（Anthropic、OpenAI、Google）上，无防护的攻击成功率（ASR）范围从32%（Claude Sonnet 4.6）到81%（Gemini 3 Flash）。为了保持场景集不在训练语料中，并随时间保持标题ASR的意义，我们开源了代码库、集成模式和AGENTREDGUARD模型；规范场景通过维护者中介渠道进行评估，具有不可变版本控制。我们随基准发布了AGENTREDGUARD：一个在集成多样化的对抗性工具响应内容语料上训练的防护模型。AGENTREDGUARD将面板ASR从69.9%降至2.4%，误报率为0.37%，在两个指标上均优于所有具有非平凡检测能力的开源基线（Llama Guard、PromptGuard 2、ProtectAI）。跨集成和跨攻击类型的保留测试均证实了增益在训练子集之外具有迁移性。

英文摘要

Indirect prompt injection in tool-use agents is a concrete production threat: LLM agents read from integrations (third-party services such as Gmail, Salesforce, or Jira accessed through tool calls) whose response content the user neither writes nor controls. Existing benchmarks under-measure the threat: most cover only a handful of integrations with the same attack payload replayed across runs, and open-source guards are trained on chat-style data rather than tool-response content. We introduce AGENTREDBENCH, a dynamic LLM-driven redteaming benchmark of 215 subtle underspecified authorization (attacks at the boundary of what the user's request authorises) scenarios across 24 enterprise integrations in nine functional families and five attack types. Across an eight-model panel (Anthropic, OpenAI, Google), no-guard ASR (attack success rate) ranges from 32% (Claude Sonnet 4.6) to 81% (Gemini 3 Flash). To keep the scenario set out of training corpora and preserve headline ASR meaning over time, we release the codebase, integration schemas, and AGENTREDGUARD model openly; the canonical scenarios are evaluated through a maintainer-mediated channel with immutable versioning. We release AGENTREDGUARD alongside the benchmark: a guard trained on an integration-diverse corpus of adversarial tool-response content. AGENTREDGUARD cuts panel ASR from 69.9% to 2.4% at 0.37% false-positive rate, outperforming every open-source baseline with non-trivial detection (Llama Guard, PromptGuard 2, ProtectAI) on both axes. Cross-integration and cross-attack type holdouts both confirm the gain transfers beyond the training subset.

URL PDF HTML ☆

赞 0 踩 0

2606.02091 2026-06-03 cs.CL 版本更新

DFlare: Scaling Up Draft Capacity for Block Diffusion Speculative Decoding

DFlare: 扩展块扩散推测解码的草稿容量

Jiebin Zhang, Zhenghan Yu, Song Liu, Eugene J. Yu, Zheng Li, Dawei Zhu, Jiangshan Duo, Weimin Xiong, Yifan Song, Guanghua Yu, Jianchen Zhu, Sujian Li

发表机构 * School of Computer Science, Peking University（北京大学计算机科学系）； Tencent（腾讯）

AI总结提出DFlare方法，通过轻量级逐层融合机制扩展草稿模型容量，在多个基准上实现平均5.52倍加速。

Comments 12 pages, 3 figures

详情

AI中文摘要

块扩散推测解码通过同时预测一个块内的所有令牌，供目标模型并行验证，从而加速LLM推理。一次性预测整个块需要足够强大的草稿模型和有效利用目标模型的内部知识。然而，最先进的方法DFlash限制所有草稿层共享仅从少数目标层导出的单个融合表示，限制了逐层表达能力并阻碍了草稿容量的进一步扩展。在本文中，我们提出\modelname，通过轻量级逐层融合机制扩展DFlash的狭窄条件瓶颈：每个草稿层关注其自身可学习的广泛目标层组合，开销可忽略，同时注入更丰富的目标知识并为每个草稿层提供不同的输入。这种增强的逐层表达能力使得草稿模型能够扩展到更深的架构并获得一致的性能提升。我们进一步将训练数据从80万样本扩展到240万样本，以充分利用扩大的容量。在涵盖数学推理、代码生成和对话的六个基准上，\modelname在Qwen3-4B上平均加速5.52倍，在Qwen3-8B上平均加速5.46倍，在GPT-OSS-20B上平均加速3.91倍，分别比DFlash提高约11%、8%和5%。我们的代码可在https://github.com/Tencent/AngelSlim获取。

英文摘要

Block diffusion speculative decoding accelerates LLM inference by predicting all tokens within a block simultaneously for the target model to verify in parallel. Predicting an entire block at once requires a sufficiently capable draft model and effective utilization of the target model's internal knowledge. However, the state-of-the-art method DFlash constrains all draft layers to share a single fused representation derived from only a few target layers, limiting per-layer expressiveness and hindering further scaling of draft capacity. In this paper, we present \modelname, which flares out the narrow conditioning bottleneck of DFlash through a lightweight layer-wise fusion mechanism: each draft layer attends to its own learnable combination of a broad set of target layers at negligible overhead, simultaneously injecting richer target knowledge and providing every draft layer with a distinct input. This enhanced per-layer expressiveness enables scaling the draft model to deeper architectures with consistent gains. We further scale training data from 800K to 2.4M samples to fully exploit the enlarged capacity. On six benchmarks spanning mathematical reasoning, code generation, and conversation, \modelname attains average wall-clock speedups of 5.52x on Qwen3-4B, 5.46x on Qwen3-8B, and 3.91x on GPT-OSS-20B, improving over DFlash by roughly 11\%, 8\%, and 5\% respectively. Our code is available at https://github.com/Tencent/AngelSlim.

URL PDF HTML ☆

赞 0 踩 0

2606.02004 2026-06-03 cs.CL cs.LG 版本更新

Machine Learning for Coding Retail Product Names to Consumer-Price Categories: A Rule-plus-Bag-of-Words Pipeline with Reliability-Weighted Human-in-the-Loop Labeling

将零售产品名称编码为消费者价格类别的机器学习：基于规则加词袋的流水线，结合可靠性加权的人工参与标注

Vladimir Beskorovainyi

发表机构 * Besk Tech（Besk科技）； Moscow Institute of Physics and Technology (MIPT)（莫斯科物理技术学院）

AI总结本文提出一种结合规则和词袋模型的流水线方法，并采用可靠性加权的人工参与标注协议，将零售产品名称映射到消费者价格类别（如UN COICOP），实验表明词袋模型在该任务上已接近饱和（F1约0.99），而标注协议中可靠性加权投票仅略优于简单多数投票。

Comments 11 pages, 3 tables. Methodology paper; illustrative experiments only, no proprietary data

详情

DOI: 10.5281/zenodo.20503355

AI中文摘要

消费者价格测量越来越多地依赖替代数据源——扫描仪、网络抓取和交易/收据数据。一个反复出现的障碍是，这些来源中的产品描述简短、嘈杂且缩写，没有标准产品代码，因此每个项目必须首先映射到消费分类（例如，联合国COICOP方案），然后才能比较价格。本文将该映射作为一种通用的、可重复的方法进行研究。流水线包括：(i) 对嘈杂项目名称进行文本归一化和分词；(ii) 基于每类关键词和停用词的前缀树（trie）规则预分类器；(iii) 每个类别的二元确认模型，决定一个项目是否属于暂定分配的类别。对于大规模标注，我们使用人工参与协议，其中标注者给出二元有效/拒绝判断，通过动态更新的可靠性权重进行聚合；模型加入相同的规则，实现持续微调。我们的实证发现是通货紧缩的：在一个受控、无泄漏的研究中（一个类别，真实正例与困难负例，五个随机种子），词袋模型基本上饱和了任务（F1约0.99）——线性分类器匹配多层感知器，显式词序（n-gram）特征没有增加任何价值，约67个标注样本已经足够。标注协议的蒙特卡洛研究表明，可靠性加权投票勉强超过简单多数投票（其加性权重饱和），而Dawid-Skene方法明显更好地恢复标签。我们还讨论了价格层面的质量控制和统计办公室考虑交易数据时的设计经验。所有数字均为示意性；未复制任何机密数据、代码或文档。

英文摘要

Consumer-price measurement increasingly draws on alternative data sources -- scanner, web-scraped, and transaction/receipt data. A recurring obstacle is that product descriptions in such sources are short, noisy, and abbreviated, with no standard product code, so each item must first be mapped to a consumption classification (e.g., the UN COICOP scheme) before prices can be compared. This paper studies that mapping as a general, reproducible method. The pipeline is: (i) text normalization and tokenization of noisy item names; (ii) a prefix-tree (trie) rule-based pre-classifier driven by per-category key-phrases and stop-phrases; and (iii) a per-category binary confirmation model deciding whether an item belongs to a tentatively assigned category. For labels at scale we use a human-in-the-loop protocol in which annotators give a binary valid/reject judgment, aggregated by a dynamically updated reliability weight; the model joins the same rule, enabling continual fine-tuning. Our empirical finding is deflationary: in a controlled, leakage-free study (one category, real positives vs. hard negatives, five seeds), bag-of-words models essentially saturate the task (F1 about 0.99) -- a linear classifier matches a multilayer perceptron, explicit word-order (n-gram) features add nothing, and about 67 labeled examples already suffice. A Monte-Carlo study of the labeling protocol shows the reliability-weighted vote barely beats plain majority (its additive weights saturate) while Dawid-Skene recovers labels markedly better. We also discuss price-level quality control and design lessons for statistical offices considering transaction data. All figures are illustrative; no confidential data, code, or documentation is reproduced.

URL PDF HTML ☆

赞 0 踩 0

2606.01904 2026-06-03 cs.CL cs.AI 版本更新

KliniskVestBERT: BERT Model Specialised to Norwegian Clinical Texts

KliniskVestBERT: 针对挪威临床文本特化的BERT模型

Christian Autenried, Cosimo Persia

发表机构 * helse-vest-ikt（赫尔塞-维斯特信息科技）

AI总结通过继续预训练现有BERT模型，在挪威临床文本上构建KliniskVestBERT模型，在临床NLP任务中显著优于基线模型。

详情

AI中文摘要

自然语言处理（NLP）在医疗保健中的应用日益增长，这要求语言模型特别适应临床语言的复杂性。本文介绍了KliniskVestBERT，这是一套基于BERT的编码器模型套件，在来自Helse Vest的大量真实、去标识化的挪威临床文本语料库上进行预训练。我们在专门的临床数据集上继续预训练现有的语言模型Nb-BERT-large、NorBERT3-large和ModernBERT。该数据集基于Helse Vest患者的代表性人群。包含的文档类型经过精心策划，涵盖了bokmål和nynorsk的广泛临床范围，包括出院小结、手术报告、护理记录等，确保全面代表挪威医疗环境中的语言景观。在三个合成挪威临床基准数据集和两个真实世界问题上的评估表明，每个临床特化模型都持续优于其基线对应模型，突显了领域特定预训练对临床领域NLP任务的显著益处。该项目由所有Helse Vest实体（Helse Bergen、Helse Fonna、Helse Førde和Helse Stavanger）与DIPS在Helse Vest ICT的项目领导下共同完成。

英文摘要

The increasing application of Natural Language Processing (NLP) in healthcare demands language models specifically attuned to the complexities of clinical language. This work introduces KliniskVestBERT, a suite of three BERT-based encoder models pre-trained on a substantial corpus of real-world, de-identified Norwegian clinical texts from Helse Vest. We continue pretraining existing language models Nb-BERT-large, NorBERT3-large, and ModernBERT on our specialized clinical dataset. This dataset is based on a representative population of Helse Vest patients. The included document types are carefully curated to encompass a broad clinical spectrum in bokmål and nynorsk including discharge summaries, surgical reports, nursing notes etc. ensuring comprehensive representation of the linguistic landscape within Norwegian healthcare settings. Evaluation on three synthtetic Norwegian clinical benchmark datasets and two real-world problems demonstrates that each of our clinically specialized models consistently outperforms their baseline counterparts, highlighting the significant benefit of domain-specific pre-training for NLP tasks within the clinical domain. The project was a joint effort by all Helse Vest entities (Helse Bergen, Helse Fonna, Helse Førde and Helse Stavanger) with DIPS under the project lead of Helse Vest ICT.

URL PDF HTML ☆

赞 0 踩 0

2606.01849 2026-06-03 cs.LG cs.CL cs.CR 版本更新

ContinuousBench: Can Differentially Private Synthetic Text Improve Capabilities?

ContinuousBench: 差分隐私合成文本能否提升能力？

Peihan Liu, Lucas Rosenblatt, Weiwei Kong, Natalia Ponomareva, Gautam Kamath, Rachel Cummings, Roxana Geambasu, Yu Gan, Lillian Tsai, Alex Bie

发表机构 * Columbia University（哥伦比亚大学）； NYU（纽约大学）； Google Research（谷歌研究）； University of Waterloo（滑铁卢大学）； Vector Institute（向量研究所）； Google（谷歌）

AI总结提出ContinuousBench基准，通过持续更新的数据集评估差分隐私合成文本能否传递原始语料库中的新知识，实验表明非隐私合成能有效转移知识，而最先进的DP合成方法即使在高隐私预算下也基本失败。

Comments For datasets, see https://huggingface.co/ContinuousBench; for the evaluation harness, see https://github.com/plau666/ContinuousBenchEval; for an accompanying blog post, see https://peihanliu.com/posts/continuousbench.html

详情

AI中文摘要

差分隐私（DP）文本合成有望为模型训练解锁敏感语料库，但目前尚不清楚DP合成数据是否能传递仅存在于这些语料库中的真正新知识和能力。这是因为现有评估依赖于无需训练即可几乎解决的任务，因此强大的基准性能并不能证明DP合成可以替代原始数据访问。为此，我们引入了ContinuousBench，一个持续自动更新的基准，用于衡量DP合成文本带来的能力提升。每季度发布一次新版本，配对一个从未见过的训练语料库和一个衍生的问答集，其构建满足：（1）无语料库无法解决；（2）在DP下可学习，因为测试的知识由数百条独立记录支持。研究人员从训练语料库生成DP合成数据，并在其合成数据上运行我们的标准化训练和评估框架以衡量增益。我们实例化两个轨道：Geminon，一个关于虚构生物的程序生成数据集；以及News，一个新爬取的公共新闻文章流。尽管标准基准几乎饱和，但在ContinuousBench上，我们发现非隐私合成从原始语料库转移了大量知识，而最先进的DP合成方法即使在高隐私预算（ε=100）下也基本无法做到。

英文摘要

Differentially private (DP) text synthesis promises to unlock sensitive corpora for model training, but it remains unclear whether DP synthetic data transmits genuinely new knowledge and capabilities present only in those corpora. This is because existing evaluations rely on tasks that are nearly solvable without training, so strong benchmark performance does not establish that DP synthesis can substitute original data access. Thus, we introduce ContinuousBench, a continuously and automatically-regenerated benchmark that measures capability gain from DP synthetic text. Each quarter, a new release pairs a never-before-seen training corpus with a derived QA set, constructed to be: (1) unsolvable sans-corpus; and (2) learnable under DP, as the tested knowledge is supported by hundreds of independent records. Researchers produce DP synthetic data from the training corpus and run our standardized training and evaluation harness on their synthetic data to measure gains. We instantiate two tracks: Geminon, a procedurally-generated dataset about fictional creatures; and News, a stream of newly crawled public news articles. Although standard benchmarks are nearly saturated, on ContinuousBench we find that non-private synthesis transfers substantial knowledge from the original corpus, while state-of-the-art DP synthesis methods generally fail to do so, even at $\varepsilon=100$.

URL PDF HTML ☆

赞 0 踩 0

2606.01629 2026-06-03 cs.CL 版本更新

Benchmarking LLM-as-a-Judge for Long-Form Output Evaluation

基准测试LLM作为裁判在长文本输出评估中的应用

Junjie Chen, Yuxi Dong, Haitao Li, Weihang Su, Yujia Zhou, Min Zhang, Yiqun Liu, Qinyao Ai

发表机构 * Department of Computer Science and Technology, Tsinghua University（清华大学计算机科学与技术系）； University of Science and Technology Beijing（北京科技大学）

AI总结提出LongJudgeBench基准，系统评估LLM裁判在长文本输出评估中的可靠性，发现当前模型存在显著可靠性差距。

详情

AI中文摘要

随着大语言模型（LLM）越来越多地用于长文本生成，可靠地评估长文本输出已成为一个关键挑战。LLM作为裁判提供了一种可扩展的替代人工评估的方法，但其在长文本输出评估中的可靠性仍未得到充分检验：现有的元评估基准主要关注短文本输出。与短文本评估相比，长文本评估不仅仅是输出长度的问题；它通常要求裁判处理更复杂的文档级需求。在这项工作中，我们引入了LongJudgeBench，这是一个全面的基准，用于评估LLM裁判在跨多种真实场景和评判协议下的长文本输出表现。我们系统地评估了广泛的LLM裁判，涵盖了多个基础模型和评判设置。我们的结果揭示了显著的可靠性差距：当前的LLM裁判在不同场景下仍然不稳定，而评分标准或参考虽然有所帮助，但并非总是足够。我们希望LongJudgeBench能够支持未来在更稳健、上下文感知且与人类对齐的LLM-as-a-judge方法上的研究。我们的代码可在https://anonymous.4open.science/r/LongJudgeBench-F782获取。

英文摘要

As large language models (LLMs) are increasingly used for long-form generation, reliably evaluating long-form outputs has become a critical challenge. LLM-as-a-judge offers a scalable alternative to human evaluation, yet its reliability in long-form output evaluation remains underexamined: existing meta-evaluation benchmarks focus mainly on short-form outputs. Compared with short-form evaluation, long-form evaluation is not merely a matter of output length; it often requires judges to make more complex document-level assessments of overall organization, task-relevant coverage and depth, cross-section consistency, and scenario-specific quality criteria. In this work, we introduce LongJudgeBench, a comprehensive benchmark for evaluating LLM judges on long-form outputs across diverse real-world scenarios and judging protocols. We systematically evaluate a broad range of LLM judges, covering multiple base models and judging settings. Our results reveal a substantial reliability gap: current LLM judges remain unstable across scenarios, and rubrics or references are helpful but not always sufficient. We hope LongJudgeBench will support future research on more robust, context-aware, and human-aligned LLM-as-a-judge methods. Our code is available at https://github.com/cjj826/LongJudgeBench.

URL PDF HTML ☆

赞 0 踩 0

2606.01166 2026-06-03 cs.CR cs.CL 版本更新

BraveGuard: From Open-World Threats to Safer Computer-Use Agents

BraveGuard: 从开放世界威胁到更安全的计算机使用代理

Yunhao Feng, Xiaohu Du, Xinhao Deng, Yifan Ding, Ming Wen, Yixu Wang, Yuxiang Xie, Baihui Zheng, Yingshui Tan, Yige Li, Yutao Wu, Kerui Cao, Wenke Huang, Yanming Guo, Xingjun Ma, Yu-Gang Jiang

发表机构 * Fudan University（复旦大学）； Ant Group（蚂蚁集团）； Hunan Institute of Advanced Technology（湖南高级技术研究所）； Alibaba Group（阿里巴巴集团）； Singapore Management University（新加坡管理大学）； Deakin University（德肯大学）； Nanyang Technological University（南洋理工大学）； Shanghai Innovation Institute（上海创新研究院）

AI总结提出BraveGuard框架，通过从开放世界威胁信号和真实代理轨迹中训练防护模型，实现轨迹级别的安全检测，显著提升计算机使用代理的安全性。

详情

AI中文摘要

计算机使用代理将语言模型从文本生成扩展到与文件、终端、浏览器和外部工具的持续交互。这种转变带来了安全风险，这些风险难以从孤立的提示或最终响应中检测出来，因为危害通常只在多步执行轨迹中显现，而单个动作在局部看似无害。我们引入了BraveGuard，一个自我进化的防御框架，用于从开放世界威胁信号和真实代理轨迹中训练防护模型。BraveGuard挖掘近期研究来源以识别新兴风险和攻击模式，将其实例化为可执行的计算机使用任务，收集代理轨迹，并为防护模型训练提供轨迹级别的监督。随着新威胁和验证失败的出现，可以重复该流程，形成一个自适应防御循环，而不是静态的、基准驱动的训练过程。我们通过训练多个防护骨干模型（包括Qwen3-Guard和Llama-Guard变体）来实例化BraveGuard，并在轨迹级别的代理安全基准上评估生成的防护模型。BraveGuard在计算机使用轨迹上持续提高了安全检测能力。在AgentHazard上，与现成的防护模型相比，它显著提高了检测准确性，在平均防护模型设置下，准确率从38.79%提升到82.38%。这些结果表明，基于开放世界威胁发现和真实代理执行的防护监督可以超越固定分类法和合成提示级数据，改进安全监控。BraveGuard为面对不断变化的现实世界风险的计算机使用代理提供了一条可扩展的自适应防御路径。

英文摘要

Computer-use agents extend language models from text generation to sustained interaction with files, terminals, browsers, and external tools. This shift creates safety risks that are difficult to detect from isolated prompts or final responses, because harm often emerges only through multi-step execution traces whose individual actions appear locally benign. We introduce BraveGuard, a self-evolving defense framework for training guard models from open-world threat signals and realistic agent trajectories. BraveGuard mines recent research sources to identify emerging risks and attack patterns, instantiates them as executable computer-use tasks, collects agent rollouts, and derives trajectory-level supervision for guard model training. As new threats and validation failures appear, the pipeline can be repeated, yielding an adaptive defense loop rather than a static, benchmark-driven training process. We instantiate BraveGuard by training multiple guard backbones, including Qwen3-Guard and Llama-Guard variants, and evaluate the resulting guards on trajectory-level agent-safety benchmarks. BraveGuard consistently improves safety detection across computer-use trajectories. On AgentHazard, it substantially improves detection accuracy over off-the-shelf guard models, with accuracy increasing from 38.79% to 82.38% under the averaged guard-model setting. These results show that guard supervision grounded in open-world threat discovery and realistic agent execution can improve safety monitoring beyond fixed taxonomies and synthetic prompt-level data. BraveGuard offers a scalable path toward adaptive defenses for computer-use agents facing evolving real-world risks.

URL PDF HTML ☆

赞 0 踩 0

2606.01075 2026-06-03 cs.CL 版本更新

On the Generalization Gap in Self-Evolving Language Model Reasoning

自进化语言模型推理中的泛化差距

Zhenting Qi, Susanna Maria Baby, Stefanie Anna Baby, Kan Yuan, Andrew Tomkins, Tu Vu, Da-Cheng Juan, Cyrus Rashtchian

发表机构 * Google Research（谷歌研究）； Harvard University（哈佛大学）； Google（谷歌）； Virginia Tech（弗吉尼亚理工大学）

AI总结本文研究严格闭环设置下自进化算法与完美监督之间的差距，发现多轮批评-修正策略可接近完美监督性能，但自进化仍存在非平凡差距。

Comments Published at ICML 2026

详情

AI中文摘要

近期研究表明，大型语言模型（LLMs）可以通过自进化（SE）来改进，即使用模型自身生成的监督信号。在这项工作中，我们提出疑问：在严格的闭环设置下，自进化算法只能访问未标记的提示集和基础模型，内部生成的监督能多大程度接近完美监督训练？我们在统一的离线自进化框架中分析了四种代表性策略：单轮验证、带反馈的多轮修正、迭代训练和课程学习。我们的主要实验使用骑士与无赖（KK）逻辑推理任务，该任务提供确定性解决方案、可控难度级别以及易于泛化的干净测试平台。我们首先表明，自进化始终优于基础模型，但在投入过多训练计算后趋于平稳，最终仍与完美监督存在非平凡差距。我们发现，使用大型模型的多轮批评-修正可以达到强大的自进化性能，其中Gemma 12B几乎与完美监督训练相匹配。除了骑士与无赖任务，我们还在现实世界的推理基准上评估了自进化，其增益也较为有限。总体而言，我们的结果描述了闭环自进化何时能够提供帮助，并表明在这种最小化设定下，内部生成的监督仍然不足。

WISE：网络信息讽刺与虚假评估

Gaurab Chhetri, Subasish Das, Tausif Islam Chowdhury

发表机构 * Texas State University San Marcos（德克萨斯州立大学圣马科斯分校）

AI总结针对假新闻与讽刺文本的区分难题，提出WISE框架，在Fakeddit数据集上评估轻量级Transformer模型，发现MiniLM和RoBERTa-base分别达到最高准确率和ROC-AUC，证明轻量模型在资源受限场景下的有效性。

Comments This is the author's preprint. Accepted to WEB&GRAPH 2026 (co-located with WSDM 2026), Boise, Idaho, USA, Feb 26, 2026. Final version will appear in WSDM 2026 Companion Proceedings. Conf: https://wsdm-conference.org/2026/ Workshop: https://aiimlab.org/events/WSDM_2026_WEB_and_GRAPH_2026_Workshop_on_Web_and_Graphs_Responsible_Intelligence_and_Social_Media.html

详情

AI中文摘要

区分虚假或不真实的新闻与讽刺或幽默，由于它们重叠的语言特征和不同的意图，带来了独特的挑战。本研究开发了WISE（网络信息讽刺与虚假评估）框架，该框架在来自Fakeddit的20000个样本的平衡数据集上，对八个轻量级Transformer模型以及两个基线模型进行了基准测试，这些样本被标注为假新闻或讽刺。使用分层5折交叉验证，我们在包括准确率、精确率、召回率、F1分数、ROC-AUC、PR-AUC、MCC、Brier分数和期望校准误差在内的综合指标上评估模型。我们的评估显示，轻量级模型MiniLM在所有模型中达到了最高的准确率（87.58%），而RoBERTa-base达到了最高的ROC-AUC（95.42%）和强大的准确率（87.36%）。DistilBERT以86.28%的准确率和93.90%的ROC-AUC提供了出色的效率-准确率权衡。统计检验确认了模型之间的显著性能差异，配对t检验和McNemar检验提供了严格的比较。我们的发现强调，轻量级模型可以匹配或超过基线性能，为在现实世界资源受限的环境中部署错误信息检测系统提供了可行的见解。

英文摘要

Distinguishing fake or untrue news from satire or humor poses a unique challenge due to their overlapping linguistic features and divergent intent. This study develops WISE (Web Information Satire and Fakeness Evaluation) framework which benchmarks eight lightweight transformer models alongside two baseline models on a balanced dataset of 20,000 samples from Fakeddit, annotated as either fake news or satire. Using stratified 5-fold cross-validation, we evaluate models across comprehensive metrics including accuracy, precision, recall, F1-score, ROC-AUC, PR-AUC, MCC, Brier score, and Expected Calibration Error. Our evaluation reveals that MiniLM, a lightweight model, achieves the highest accuracy (87.58%) among all models, while RoBERTa-base achieves the highest ROC-AUC (95.42%) and strong accuracy (87.36%). DistilBERT offers an excellent efficiency-accuracy trade-off with 86.28\% accuracy and 93.90\% ROC-AUC. Statistical tests confirm significant performance differences between models, with paired t-tests and McNemar tests providing rigorous comparisons. Our findings highlight that lightweight models can match or exceed baseline performance, offering actionable insights for deploying misinformation detection systems in real-world, resource-constrained settings.

URL PDF HTML ☆

赞 0 踩 0

2605.23055 2026-06-03 cs.LG cs.AI cs.CL 版本更新

Decomposing and Measuring Evaluation Awareness

分解与度量评估意识

Changling Li, Terry Jingchen Zhang, Jie Zhang, Zhijing Jin, Sahar Abdelnabi, Maksym Andriushchenko

发表机构 * ETH Zürich（苏黎世联邦理工学院）； ELLIS Institute Tübingen（图宾根ELLIS研究所）； Max Planck Institute for Intelligent Systems（智能系统马克斯·普朗克研究院）； Tübingen AI Center（图宾根人工智能中心）； University of Toronto & Vector Institute（多伦多大学及向量研究所）； EuroSafeAI（欧洲安全人工智能）

AI总结本研究借鉴社会心理学，将评估意识分解为环境与模型两部分，通过EvalAwareBench基准测试发现识别率取决于模型与基准的配对，且识别很少导致行为改变，安全评估比能力评估更易受影响。

详情

AI中文摘要

前沿语言模型有时会意识到自己正在被评估并调整行为，从而破坏基准结果的有效性。然而，该领域的研究缺乏共同基础，混淆了评估属性与模型属性，以及检测与行为响应。我们将评估意识植根于社会心理学，将其分解为环境组件（任务的可识别程度）和模型组件（将识别与行动倾向分离）。我们通过八个分类触发因素（如占位符实体和评分式输出格式）来操作化环境组件，并通过思维链监控研究识别和行为。在九个前沿模型和四个基准上，识别率取决于模型和基准的具体配对，而非单独一方。识别很少导致行为改变，即使发生，方向也取决于所感知的评估类型。模型对安全评估的敏感性高于能力评估，使安全基准的有效性面临更大风险。为了研究每个模型对哪些因素敏感以及它们如何相互作用，我们提出了EvalAwareBench，这是一个包含100个配对安全-能力任务的因子控制基准，其中八个因素中的每一个都可以独立切换，在保持底层请求不变的同时改变评估信号。通过EvalAwareBench，我们发现没有单一因素能统一影响所有模型，但叠加因素会逐步提高所有模型的评估意识。我们的框架和EvalAwareBench提供了度量、归因和缓解评估意识的工具，指出在识别下的行为一致性是一条有前景的前进道路。

英文摘要

Frontier language models sometimes recognize that they are being evaluated and adjust their behavior, undermining validity of benchmark results. Yet the field studies it without a shared foundation, conflating properties of the evaluation with properties of the model, and detection with behavioral response. We ground evaluation awareness in social psychology, decomposing it into an environment component (how recognizable the task is) and a model component that separates recognition from propensity to act on it. We operationalize the environment component through eight categorized trigger factors, such as placeholder entities and grading-style output formats, and study recognition and behavior through chain-of-thought monitoring. Across nine frontier models and four benchmarks, recognition rates depend on the specific pairing of model and benchmark rather than on either in isolation. Recognition rarely leads to behavioral change, and when it does, the direction depends on the type of evaluation perceived. Models are also more sensitive to safety than capability evaluations, placing safety benchmark validity at greater risk. To study which factors each model is sensitive to and how they interact, we propose \textbf{EvalAwareBench}, a factor-controlled benchmark of 100 paired safety-capability tasks where each of the eight factors can be independently toggled, varying evaluative signals while holding the underlying request fixed. Through EvalAwareBench, we find that no single factor uniformly affects all models, but stacking factors progressively raises evaluation awareness across all of them. Our framework and EvalAwareBench provide the tools to measure, attribute, and mitigate evaluation awareness, pointing to behavioral consistency under recognition as a promising path forward.

URL PDF HTML ☆

赞 0 踩 0

2605.18740 2026-06-03 cs.CV cs.AI cs.CL cs.LG 版本更新

Vision-OPD: Learning to See Fine Details for Multimodal LLMs via On-Policy Self-Distillation

Vision-OPD：通过在线策略自蒸馏学习多模态大语言模型的精细细节

Qianhao Yuan, Jie Lou, Xing Yu, Hongyu Lin, Le Sun, Xianpei Han, Yaojie Lu

发表机构 * Tsinghua University（清华大学）

AI总结提出Vision-OPD框架，通过在线策略自蒸馏将模型自身的局部区域感知能力迁移到全局图像策略，提升多模态大语言模型对细粒度视觉理解的准确性。

Comments Project page: https://github.com/VisionOPD/Vision-OPD

详情

AI中文摘要

多模态大语言模型（MLLMs）在细粒度视觉理解方面仍然存在困难，答案往往依赖于全图中微小但决定性的证据。我们观察到一种区域到全局的感知差距：当以证据为中心的裁剪图像为条件时，同一MLLM回答细粒度问题的准确率高于以对应全图为条件，这表明许多失败源于难以聚焦于相关证据，而非局部识别能力不足。受此观察启发，我们提出Vision-OPD（视觉在线策略蒸馏），一种区域到全局的自蒸馏框架，将模型自身特权的区域感知迁移到其全图策略。Vision-OPD从同一MLLM实例化两个条件策略：一个以裁剪图像为条件的教师和一个以全图为条件的学生。学生生成在线策略轨迹，Vision-OPD沿这些轨迹最小化教师和学生下一个词元分布之间的词元级差异。这使得模型能够内化视觉放大的好处，而无需外部教师模型、真实标签、奖励验证器或推理时工具使用。在多个细粒度视觉理解基准上的实验表明，Vision-OPD模型在性能上可与更大的开源、闭源以及“思考图像”智能体模型相媲美或更优。

英文摘要

Multimodal Large Language Models (MLLMs) still struggle with fine-grained visual understanding, where answers often depend on small but decisive evidence in the full image. We observe a regional-to-global perception gap: the same MLLM answers fine-grained questions more accurately when conditioned on evidence-centered crops than on the corresponding full images, suggesting that many failures stem from difficulty to focus on relevant evidence rather than insufficient local recognition ability. Motivated by this observation, we propose Vision-OPD (Vision On-Policy Distillation), a regional-to-global self-distillation framework that transfers the model's own privileged regional perception to its full-image policy. Vision-OPD instantiates two conditional policies from the same MLLM: a crop-conditioned teacher and a full-image-conditioned student. The student generates on-policy rollouts, and Vision-OPD minimizes token-level divergence between the teacher and student next-token distributions along these rollouts. This enables the model to internalize the benefit of visual zooming without external teacher models, ground-truth labels, reward verifiers, or inference-time tool use. Experiments on multiple fine-grained visual understanding benchmarks show that Vision-OPD models achieve competitive or superior performance against much larger open-source, closed-source, and "Thinking-with-Images" agentic models. The code is available at https://github.com/VisionOPD/Vision-OPD

URL PDF HTML ☆

赞 0 踩 0

2405.07006 2026-06-03 cs.CL 版本更新

Word-specific tonal realizations in Mandarin

普通话中特定词的声调实现

Yu-Ying Chuang, Melanie J. Bell, Yu-Hsiang Tseng, R. Harald Baayen

发表机构 * National Taiwan Normal University（台湾师范大学）； Anglia Ruskin University（安格利亚 Ruskin 大学）； Eberhard Karl University of Tübingen（图宾根艾伯哈德·卡尔大学）

AI总结本研究通过语料库分析和计算建模，发现普通话双字词的声调实现部分由词义决定，且词义对声调实现的预测力超过已知的词形相关因素。

详情

DOI: 10.1017/S0097850725000001
Journal ref: Language 102 (2026) 1-45

AI中文摘要

普通话双字词的音高轮廓通常被认为是由组成单字词的底层声调所决定，并与语速、相邻声调协同发音、音段构成和可预测性等因素施加的发音约束相互作用。本研究表明，声调实现也部分由词义决定。我们首先基于台湾普通话自然对话语料库，使用广义加性回归模型，并聚焦于升降调模式，发现在控制说话者和语境效应后，词类型对声调实现的预测力比所有先前建立的词形相关预测因子之和更强。重要的是，添加关于语境中词义的信息进一步提高了预测准确性。然后，我们通过使用上下文特定的词嵌入进行计算建模，表明在保留数据上，标记特定的音高轮廓以50%的准确率预测词类型，而上下文敏感的标记特定嵌入以40%的准确率预测音高轮廓的形状。这些准确率比随机水平高一个数量级，表明词音高轮廓与其词义之间的关系足够强，可能对语言使用者具有功能性。讨论了这些实证发现的理论意义。

英文摘要

The pitch contours of Mandarin two-character words are generally understood as being shaped by the underlying tones of the constituent single-character words, in interaction with articulatory constraints imposed by factors such as speech rate, co-articulation with adjacent tones, segmental make-up, and predictability. This study shows that tonal realization is also partially determined by words' meanings. We first show, on the basis of a corpus of Taiwan Mandarin spontaneous conversations, using a generalized additive regression model, and focusing on the rise-fall tone pattern, that after controlling for effects of speaker and context, word type is a stronger predictor of tonal realization than all the previously established word-form related predictors combined. Importantly, the addition of information about meaning in context improves prediction accuracy even further. We then proceed to show, using computational modeling with context-specific word embeddings, that token-specific pitch contours predict word type with 50% accuracy on held-out data, and that context-sensitive, token-specific embeddings can predict the shape of pitch contours with 40% accuracy. These accuracies, which are an order of magnitude above chance level, suggest that the relation between words' pitch contours and their meanings are sufficiently strong to be potentially functional for language users. The theoretical implications of these empirical findings are discussed.

URL PDF HTML ☆

赞 0 踩 0

2512.18552 2026-06-03 cs.SE cs.AI cs.CL cs.LG 版本更新

Toward Training Superintelligent Software Agents through Self-Play SWE-RL

通过自我对弈SWE-RL训练超级智能软件代理

Yuxiang Wei, Zhiqing Sun, Emily McMilin, Jonas Gehring, David Zhang, Gabriel Synnaeve, Daniel Fried, Lingming Zhang, Sida Wang

发表机构 * Meta FAIR ； University of Illinois Urbana-Champaign（伊利诺伊大学厄巴纳-香槟分校）； Meta TBD Lab（Meta TBD 实验室）； Carnegie Mellon University（卡内基梅隆大学）

AI总结提出自我对弈SWE-RL（SSR）方法，通过强化学习在自对弈环境中训练单一LLM代理，使其在无需人工标注问题或测试的情况下，在真实代码库中迭代注入和修复软件缺陷，在SWE-bench基准上实现显著自我改进并超越人类数据基线。

Comments Accepted to ICML 2026

详情

AI中文摘要

尽管当前由大型语言模型（LLM）和智能体强化学习（RL）驱动的软件代理能够提高程序员的生产力，但其训练数据（例如GitHub问题和拉取请求）和环境（例如通过-通过和失败-通过测试）严重依赖人类知识或整理，这构成了通向超级智能的根本障碍。在本文中，我们提出了自我对弈SWE-RL（SSR），这是迈向超级智能软件代理训练范式的第一步。我们的方法仅需最小的数据假设，只需访问带有源代码和已安装依赖项的沙盒化仓库，无需人工标注的问题或测试。基于这些真实世界的代码库，单个LLM代理通过强化学习在自我对弈环境中进行训练，以迭代地注入和修复复杂度逐渐增加的软件缺陷，每个缺陷由测试补丁而非自然语言问题描述正式指定。在SWE-bench Verified和SWE-Bench Pro基准上，SSR实现了显著的自我改进（分别提升+10.4和+7.8分），并在整个训练轨迹中持续优于人类数据基线，尽管其评估的是自我对弈中未出现的自然语言问题。我们的结果虽然尚处于早期阶段，但表明了一条路径，即代理可以从真实软件仓库中自主收集广泛的学习经验，最终实现超越人类能力的超级智能系统，在理解系统构建方式、解决新挑战以及从头开始自主创建新软件方面超越人类。

英文摘要

While current software agents powered by large language models (LLMs) and agentic reinforcement learning (RL) can boost programmer productivity, their training data (e.g., GitHub issues and pull requests) and environments (e.g., pass-to-pass and fail-to-pass tests) heavily depend on human knowledge or curation, posing a fundamental barrier to superintelligence. In this paper, we present Self-play SWE-RL (SSR), a first step toward training paradigms for superintelligent software agents. Our approach takes minimal data assumptions, only requiring access to sandboxed repositories with source code and installed dependencies, with no need for human-labeled issues or tests. Grounded in these real-world codebases, a single LLM agent is trained via reinforcement learning in a self-play setting to iteratively inject and repair software bugs of increasing complexity, with each bug formally specified by a test patch rather than a natural language issue description. On the SWE-bench Verified and SWE-Bench Pro benchmarks, SSR achieves notable self-improvement (+10.4 and +7.8 points, respectively) and consistently outperforms the human-data baseline over the entire training trajectory, despite being evaluated on natural language issues absent from self-play. Our results, albeit early, suggest a path where agents autonomously gather extensive learning experiences from real-world software repositories, ultimately enabling superintelligent systems that exceed human capabilities in understanding how systems are constructed, solving novel challenges, and autonomously creating new software from scratch.

URL PDF HTML ☆

赞 0 踩 0

2603.00029 2026-06-03 cs.CL 版本更新

Embracing Anisotropy: Turning Massive Activations into Interpretable Control Knobs for Large Language Models

拥抱各向异性：将大规模激活转化为大型语言模型的可解释控制旋钮

Youngji Roh, Hyunjin Cho, Jaehyung Kim

发表机构 * Yonsei University（延世大学）

AI总结本文提出一种基于幅度的无训练方法识别领域关键维度，将其作为可解释的语义检测器，并通过仅对这些维度进行激活引导，在领域适应和越狱场景中优于传统全维度引导方法。

Comments ACL 2026 Main Conference

详情

AI中文摘要

大型语言模型（LLMs）表现出高度各向异性的内部表示，通常以大规模激活为特征，即一小部分特征维度具有显著大于其余维度的量级。虽然先前的工作主要将这些极端维度视为需要处理的伪影，但我们提出了一个不同的视角：这些维度作为由领域专业化产生的内在可解释功能单元。具体来说，我们提出了一种简单的基于幅度的标准，以无训练方式识别领域关键维度。我们的分析表明，这些维度表现为符号/定量模式或领域特定术语的可解释语义检测器。此外，我们引入了关键维度引导，仅对识别出的维度应用激活引导。实证结果表明，在领域适应和越狱场景中，该方法优于传统的全维度引导。

英文摘要

Large Language Models (LLMs) exhibit highly anisotropic internal representations, often characterized by massive activations, a phenomenon where a small subset of feature dimensions possesses magnitudes significantly larger than the rest. While prior works view these extreme dimensions primarily as artifacts to be managed, we propose a distinct perspective: these dimensions serve as intrinsic interpretable functional units arising from domain specialization. Specifically, we propose a simple magnitude-based criterion to identify Domain-Critical Dimensions in a training-free manner. Our analyses reveal that such dimensions behave as interpretable semantic detectors for symbolic/quantitative patterns or domain-specific terms. In addition, we introduce Critical Dimension Steering, which applies activation steering exclusively to the identified dimensions. Empirical results show that this approach outperforms conventional whole-dimension steering in domain adaptation and jailbreaking scenarios.

URL PDF HTML ☆

赞 0 踩 0

2605.15572 2026-06-03 cs.CL 版本更新

Measuring Maximum Activations in Open Large Language Models

测量开放大型语言模型中的最大激活值

Luxuan Chen, Han Tian, Xinran Chen, Rui Kong, Fang Wang, Jiamin Chen, Yuchen Li, Jiashu Zhao, Shuaiqiang Wang, Haoyi Xiong, Linghe Kong, Dawei Yin

发表机构 * Shanghai Jiao Tong University（上海交通大学）； Baidu Inc.（百度公司）； Nankai University（南开大学）

AI总结本研究通过统一流程测量8个开放LLM家族的27个检查点中的全局和逐层最大激活值，发现其跨家族、架构和训练阶段差异显著，且与低比特量化误差相关，建议在发布开放权重时报告该指标。

详情

AI中文摘要

激活值的动态范围是低比特量化、激活缩放和稳定LLM推理的一阶约束。先前的工作表征了2024年前LLaMA风格模型上的异常特征和巨大激活值，而下游的激活量化堆栈继承了这一图景，未针对后LLaMA开放模型繁荣进行重新审视。我们提出面向部署的问题：现代开放LLM中激活值能有多大，这种幅度如何随家族、代际和训练阶段变化？在统一流程下（5000样本多领域语料库、家族特定分词、嵌入层、隐藏状态、注意力、MLP/MoE、SwiGLU门和最终归一化层的相同钩子），我们测量了来自8个开放家族的27个检查点（涵盖密集、MoE、视觉语言、中间训练和指令微调变体）的全局和逐层最大值。我们发现：(i) 在可比较的参数数量下，全局最大值跨越近四个数量级，Qwen3.5和MoE检查点在10^2到10^3范围，而Gemma3-27B-it达到约7×10^5；(ii) 跨家族和跨代际比较打破了简单的单调缩放；(iii) MoE检查点的峰值比同等规模的密集对应物低14.0-23.4倍，而残差流在22/24个检查点中承载全局最大值。一个轻量级INT-8完整性检查表明，测量的最大值通过激活尺度选择与低比特重建误差共变。我们得出结论，最大激活幅度是一个与家族、架构和训练阶段相关的模型属性——而不是规模的简单副产品——并且应在低比特部署前与任何开放权重发布一起测量和报告。代码公开于https://github.com/clx1415926/Max_act_llm。

英文摘要

The dynamic range of activations is a first-order constraint for low-bit quantization, activation scaling, and stable LLM inference. Prior work characterized outlier features and massive activations on pre-2024 LLaMA-style models, and the downstream activation-quantization stack inherits that picture without revisiting it for the post-LLaMA open-model boom. We ask the deployment-oriented question: how large can activations get in modern open LLMs, and how does this magnitude vary across families, generations, and training stages? Under a unified pipeline (5,000-sample multi-domain corpus, family-specific tokenization, identical hooks across embeddings, hidden states, attention, MLP/MoE, SwiGLU gates, and final norm), we measure global and layerwise maxima on 27 checkpoints from 8 open families spanning dense, MoE, vision-language, intermediate-training, and instruction-tuned variants. We find that (i) global maxima span over nearly four orders of magnitude at comparable parameter counts, with Qwen3.5 and MoE checkpoints in the 10^2 to 10^3 range and Gemma3-27B-it reaching ~7 x 10^5; (ii) cross-family and cross-generation comparisons break simple monotonic scaling; and (iii) MoE checkpoints exhibit 14.0-23.4x lower peaks than matched-scale dense counterparts, while the residual stream carries the global maximum in 22/24 checkpoints. A lightweight INT-8 sanity check shows that measured maxima co-vary with low-bit reconstruction error via activation-scale selection. We conclude that maximum activation magnitude is a model property tied to family, architecture, and training stage - not a simple byproduct of size - and should be measured and reported alongside any open-weight release before low-bit deployment. The code is publicly available at https://github.com/clx1415926/Max_act_llm.

URL PDF HTML ☆

赞 0 踩 0

2605.14589 2026-06-03 cs.CL 版本更新

EndPrompt: Efficient Long-Context Extension via Terminal Anchoring

EndPrompt: 通过终端锚定实现高效长上下文扩展

Han Tian, Luxuan Chen, Xinran Chen, Rui Kong, Fang Wang, Jiamin Chen, Jinman Zhao, Yuchen Li, Jiashu Zhao, Shuaiqiang Wang, Haoyi Xiong, Linghe Kong, Dawei Yin

发表机构 * Nankai University（南开大学）； Baidu Inc.（百度公司）； Shanghai Jiao Tong University（上海交通大学）； Independent Researcher（独立研究者）

AI总结提出EndPrompt方法，仅使用短训练序列通过终端锚定实现长上下文扩展，在LLaMA模型上从8K扩展到64K，性能超越全长度微调等方法。

详情

AI中文摘要

扩展大型语言模型的上下文窗口通常需要在目标长度序列上进行训练，这会带来二次方内存和计算成本，使得长上下文适应变得昂贵且难以复现。我们提出EndPrompt，一种仅使用短训练序列即可实现有效上下文扩展的方法。核心洞察是，让模型暴露于长程相对位置距离并不需要构建完整长度的输入：我们将原始短上下文保留为完整的第一个片段，并附加一个简短的终端提示作为第二个片段，为其分配接近目标上下文长度的位置索引。这种两段式结构在短物理序列中引入了局部和长程相对距离，同时保持了训练文本的语义连续性——这是将连续上下文分割的基于块的模拟方法所缺乏的特性。我们提供了基于旋转位置编码和伯恩斯坦不等式的理论分析，表明位置插值对注意力函数施加了严格的平滑性约束，共享的Transformer参数进一步抑制了到未观察中间距离的不稳定外推。应用于将LLaMA系列模型的上下文窗口从8K扩展到64K，EndPrompt实现了平均RULER分数76.03，并在LongBench上获得最高平均分，超越了LCEG（72.24）、LongLoRA（72.95）和全长度微调（69.23），同时所需计算量大幅减少。这些结果表明，长上下文泛化可以从稀疏位置监督中诱导出来，挑战了密集长序列训练对于可靠上下文窗口扩展是必要的普遍假设。代码可在https://github.com/clx1415926/EndPrompt获取。

英文摘要

Extending the context window of large language models typically requires training on sequences at the target length, incurring quadratic memory and computational costs that make long-context adaptation expensive and difficult to reproduce. We propose EndPrompt, a method that achieves effective context extension using only short training sequences. The core insight is that exposing a model to long-range relative positional distances does not require constructing full-length inputs: we preserve the original short context as an intact first segment and append a brief terminal prompt as a second segment, assigning it positional indices near the target context length. This two-segment construction introduces both local and long-range relative distances within a short physical sequence while maintaining the semantic continuity of the training text--a property absent in chunk-based simulation approaches that split contiguous context. We provide a theoretical analysis grounded in Rotary Position Embedding and the Bernstein inequality, showing that position interpolation induces a rigorous smoothness constraint over the attention function, with shared Transformer parameters further suppressing unstable extrapolation to unobserved intermediate distances. Applied to LLaMA-family models extending the context window from 8K to 64K, EndPrompt achieves an average RULER score of 76.03 and the highest average on LongBench, surpassing LCEG (72.24), LongLoRA (72.95), and full-length fine-tuning (69.23) while requiring substantially less computation. These results demonstrate that long-context generalization can be induced from sparse positional supervision, challenging the prevailing assumption that dense long-sequence training is necessary for reliable context-window extension. The code is available at https://github.com/clx1415926/EndPrompt.

URL PDF HTML ☆

赞 0 踩 0

2604.22891 2026-06-03 cs.LG cs.AI cs.CL 版本更新

Quantifying and Mitigating Self-Preference Bias of LLM Judges

量化与缓解LLM评判者的自我偏好偏差

Jinming Yang, Zheng Hu, Chuxian Qiu, Zhenyu Deng, Xinshan Jiao, Tao Zhou

发表机构 * CompleX Lab, School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu, China（复杂实验室、计算机科学与工程学院、电子科技大学、成都、中国）

AI总结提出自动化框架量化LLM自我偏好偏差，并通过认知负荷分解的多维评估策略平均降低31.5%的偏差。

详情

AI中文摘要

LLM-as-a-Judge已成为自动评估系统中的主导方法，在模型对齐、排行榜构建、质量控制等方面发挥关键作用。然而，该方法的可扩展性和可信度可能因自我偏好偏差（SPB）而严重失真，SPB是一种定向评估偏差，即LLM在评估时系统性地偏好或排斥自身生成的输出。现有测量方法依赖昂贵的人工标注，并将生成能力与评估立场混为一谈，因此不适用于实际系统中的大规模部署。为解决此问题，我们引入了一个完全自动化的框架来量化和缓解SPB，该框架构建质量差异可忽略的等质量回答对，从而在无需人工黄金标准的情况下，从偏差倾向中统计分离出可区分性。对20个主流LLM的实证分析表明，先进能力通常与低SPB不相关，甚至负相关。为缓解此偏差，我们提出了一种基于认知负荷分解的结构化多维评估策略，平均降低SPB 31.5%。

英文摘要

LLM-as-a-Judge has become a dominant approach in automated evaluation systems, playing critical roles in model alignment, leaderboard construction, quality control, and so on. However, the scalability and trustworthiness of this approach can be substantially distorted by Self-Preference Bias (SPB), which is a directional evaluative deviation in which LLMs systematically favor or disfavor their own generated outputs during evaluation. Existing measurements rely on costly human annotations and conflate generative capability with evaluative stance, and thus are impractical for large-scale deployment in real-world systems. To address this issue, we introduce a fully automated framework to quantifying and mitigating SPB, which constructs equal-quality pairs of responses with negligible quality differences, enabling statistical disentanglement of discriminability from bias propensity without human gold standards. Empirical analysis across 20 mainstream LLMs reveals that advanced capabilities are often uncorrelated, or even negatively correlated, with low SPB. To mitigate this bias, we propose a structured multi-dimensional evaluation strategy grounded in cognitive load decomposition, which reduces SPB by 31.5\% on average.

URL PDF HTML ☆

赞 0 踩 0

2605.10328 2026-06-03 cs.CL 版本更新

ANCHOR: Abductive Network Construction with Hierarchical Orchestration for Reliable Probability Inference in Large Language Models

ANCHOR: 基于层次编排的溯因网络构建用于大型语言模型中可靠概率推断

Wentao Qiu, Guanran Luo, Zhongquan Jian, Jingqi Gao, Meihong Wang, Qingqiang Wu

AI总结针对LLM概率估计中因子空间稀疏导致“未知”预测和因子增多引入噪声的问题，提出ANCHOR框架，通过层次因子空间构建、层次检索与精炼以及因果贝叶斯网络增强朴素贝叶斯，显著减少未知预测并提高概率估计可靠性。

Comments Accepted by ICML 2026

详情

AI中文摘要

在不完全信息下的大规模决策中，一个核心挑战是估计可靠的概率。最近的方法使用大型语言模型（LLM）生成解释因子和粗粒度概率估计，然后通过朴素贝叶斯模型在因子组合上进行精炼。然而，稀疏的因子空间常常产生“未知”预测，而扩展因子会增加噪声和虚假相关性，削弱条件独立性并降低可靠性。为了解决这些局限性，我们提出了 extsc{Anchor}，一个在层次因子空间上的聚合贝叶斯推断框架。它通过迭代生成和聚类构建密集的因子层次结构，通过层次检索和精炼映射上下文，并通过因果贝叶斯网络增强朴素贝叶斯以建模潜在因子依赖关系。实验表明， extsc{Anchor}显著减少了“未知”预测，并产生了比直接LLM基线更可靠的概率估计，实现了最先进的性能，同时大幅减少了时间和令牌开销。

英文摘要

A central challenge in large-scale decision-making under incomplete information is estimating reliable probabilities. Recent approaches use Large Language Models (LLMs) to generate explanatory factors and coarse-grained probability estimates, which are then refined by a Naïve Bayes model over factor combinations. However, sparse factor spaces often yield ``unknown'' predictions, while expanding factors increases noise and spurious correlations, weakening conditional independence and degrading reliability. To address these limitations, we propose \textsc{Anchor}, an aggregated Bayesian inference framework over a hierarchical factor space. It constructs dense factor hierarchies through iterative generation and clustering, maps contexts via hierarchical retrieval and refinement, and augments Naïve Bayes with a Causal Bayesian Network to model latent factor dependencies. Experiments show that \textsc{Anchor} markedly reduces ``unknown'' predictions and produces more reliable probability estimates than direct LLM baselines, achieving state-of-the-art performance while significantly reducing time and token overhead.

URL PDF HTML ☆

赞 0 踩 0

2602.22480 2026-06-03 cs.AI cs.CL cs.LG 版本更新

VeRO: A Harness for Agents to Optimize Agents

VeRO: 用于优化智能体的智能体框架

Varun Ursekar, Apaar Shanker, Veronica Chatrath, Yuan Xue, Samuel Marc Denton

发表机构 * arXiv

AI总结提出 VeRO 框架和 VeRO-Bench 基准，通过版本化快照、预算控制评估和结构化执行轨迹来优化智能体代码，并实验比较不同优化器对目标智能体的改进效果。

Comments Accepted to the Forty-Third International Conference on Machine Learning (ICML), 2026

详情

AI中文摘要

编码智能体的一个重要新兴应用是智能体框架优化：通过编辑和评估目标智能体的代码来迭代改进它。尽管具有相关性，但社区对编码智能体在此任务上的表现缺乏系统理解。框架优化与传统软件工程不同：智能体框架将确定性代码与随机 LLM 完成交错，需要结构化捕获中间执行轨迹和下游结果。为了解决这些挑战，我们引入了 (1) VeRO（版本化、奖励和观察），一个外部框架，提供目标框架的版本化快照、预算控制评估和结构化执行轨迹，以及 (2) VeRO-Bench，一个包含参考评估程序的目标智能体和任务的基准套件。使用 VeRO，我们进行了一项实证研究，比较了不同任务上的优化器，并分析了哪些修改能可靠地改进目标智能体框架。我们发布 VeRO 以支持作为编码智能体核心能力的智能体优化研究。代码可在 https://github.com/scaleapi/vero 获取。

英文摘要

An important emerging application of coding agents is agent harness optimization: the iterative improvement of a target agent by editing and evaluating its code. Despite its relevance, the community lacks a systematic understanding of coding agent performance on this task. Harness optimization differs from conventional software engineering: agent harnesses interleave deterministic code with stochastic LLM completions, requiring structured capture of both intermediate execution traces and downstream outcomes. To address these challenges, we introduce (1) VeRO (Versioning, Rewards, and Observations), an outer harness that provides versioned snapshots, budget-controlled evaluation, and structured execution traces of target harnesses, and (2) VeRO-Bench, a benchmark suite of target agents and tasks with reference evaluation procedures. Using VeRO, we conduct an empirical study comparing optimizers across tasks and analyzing which modifications reliably improve target agent harnesses. We release VeRO to support research on agent optimization as a core capability for coding agents. Code is available at https://github.com/scaleapi/vero.

URL PDF HTML ☆

赞 0 踩 0

2605.05629 2026-06-03 stat.ML cs.CL cs.LG 版本更新

CogRAG：通过分层检索与推理应对RAG中的异构认知需求

Xudong Wang, Zilong Wang, Kui Su, Zhaoyan Ming

发表机构 * School of Computer and Computing Science, Hangzhou City University（杭州城市大学计算机与计算科学学院）； Innovation Center of Yangtze River Delta, Zhejiang University（长三角创新中心，浙江大学）； Institute of Digital Twin, Eastern Institute of Technology（数字孪生研究院，东部技术研究院）

AI总结提出CogRAG框架，基于布鲁姆分类法预测查询的认知负荷，协调认知自适应证据精炼与认知分层结构化推理，解决RAG中不同任务的异构认知需求，在注册营养师资格考试中将Qwen3-8B准确率提升至85.8%。

详情

AI中文摘要

检索增强生成（RAG）框架通常通过一刀切的流水线处理所有查询，忽略了不同任务的异构认知需求。这种认知盲区方法导致两种失败模式：当低层级事实缺口引发幻觉推理时的级联错误，以及高阶分析任务中的推理-答案不一致。我们提出CogRAG，一个无需训练、领域无关的框架，通过分层检索与推理应对这些异构认知需求。受布鲁姆分类法启发，CogRAG使用查询的预测认知负荷作为中央控制信号，协调两个模块：认知自适应证据精炼通过以事实为中心或以选项为中心的路径补充缺失上下文，以及认知分层结构化推理用认知对齐的推理模板替代无约束的思维链。我们在一个高要求的专业测试平台——注册营养师资格考试上评估CogRAG。CogRAG有效减少了早期阶段的事实错误并消除了推理-答案不一致，在单选题模式下将Qwen3-8B准确率从73.4%提升至85.8%，在场景模式下从63.3%提升至80.5%。这些结果突显了认知分层控制作为大型语言模型中可靠复杂推理的一种有效且可泛化的范式。

英文摘要

Retrieval-Augmented Generation (RAG) frameworks typically process all queries through a one-size-fits-all pipeline, ignoring the heterogeneous cognitive demands of different tasks. This cognitive-blind approach causes two failure modes: cascading errors when low-level factual gaps trigger hallucinated reasoning, and reasoning-answer inconsistency in higher-order analytical tasks. We introduce CogRAG, a training-free, domain-agnostic framework that tackles these heterogeneous cognitive demands via stratified retrieval and reasoning. Inspired by Bloom's Taxonomy, CogRAG uses the predicted cognitive load of a query as a central control signal that coordinates two modules: Cognition-Adaptive Evidence Refinement supplements missing context via fact-centric or option-centric paths, and Cognition-Stratified Structured Reasoning replaces unconstrained chain-of-thought with cognition-aligned reasoning templates. We evaluate CogRAG on a demanding professional testbed, the Registered Dietitian qualification examination. CogRAG effectively reduces early-stage factual errors and eliminates reasoning-answer inconsistency, raising Qwen3-8B accuracy from 73.4\% to 85.8\% in single-choice mode and from 63.3\% to 80.5\% in scenario mode. These results highlight cognitive-stratified control as an effective, generalizable paradigm for reliable complex reasoning in large language models.

URL PDF HTML ☆

赞 0 踩 0

2604.24374 2026-06-03 cs.CL 版本更新

MIPIC: Matryoshka Representation Learning via Self-Distilled Intra-Relational and Progressive Information Chaining

MIPIC: 通过自蒸馏内部关系与渐进信息链的套娃表示学习

Phung Gia Huy, Hai An Vu, Minh-Phuc Truong, Thang Duc Tran, Linh Ngo Van, Thanh Hong Nguyen, Trung Le

发表机构 * Hanoi University of Science and Technology（河内科学技术大学）； University of Oregon（俄勒冈大学）； Monash University（莫纳什大学）

AI总结提出MIPIC框架，通过自蒸馏内部关系对齐和渐进信息链，实现嵌套嵌入的跨维度结构一致性与深度语义整合，在低维下显著提升性能。

Comments ACL Findings

详情

AI中文摘要

表示学习是NLP的基础，但构建在不同计算预算下均表现良好的嵌入具有挑战性。套娃表示学习（MRL）通过嵌套嵌入提供了灵活的推理范式；然而，学习此类结构需要明确协调信息在嵌入维度和模型深度上的排列方式。本文提出MIPIC（通过自蒸馏内部关系对齐与渐进信息链的套娃表示学习），一个统一的训练框架，旨在生成结构连贯且语义紧凑的套娃表示。MIPIC通过自蒸馏内部关系对齐（SIA）促进跨维度结构一致性，该对齐利用top-k CKA自蒸馏，对齐完整表示与截断表示之间的token级几何和注意力驱动关系。互补地，它通过渐进信息链（PIC）实现深度语义整合，这是一种支架式对齐策略，逐步将成熟的深层任务语义迁移到浅层。在STS、NLI和分类基准（涵盖从TinyBERT到BGEM3、Qwen3的模型）上的大量实验表明，MIPIC生成的套娃表示在所有容量下均具有高度竞争力，并在极端低维下展现出显著的性能优势。

英文摘要

Representation learning is fundamental to NLP, but building embeddings that work well at different computational budgets is challenging. Matryoshka Representation Learning (MRL) offers a flexible inference paradigm through nested embeddings; however, learning such structures requires explicit coordination of how information is arranged across embedding dimensionality and model depth. In this work, we propose MIPIC (Matryoshka Representation Learning via Self-Distilled Intra-Relational Alignment and Progressive Information Chaining), a unified training framework designed to produce structurally coherent and semantically compact Matryoshka representations. MIPIC promotes cross-dimensional structural consistency through Self-Distilled Intra-Relational Alignment (SIA), which aligns token-level geometric and attention-driven relations between full and truncated representations using top-k CKA self-distillation. Complementarily, it enables depth-wise semantic consolidation via Progressive Information Chaining (PIC), a scaffolded alignment strategy that incrementally transfers mature task semantics from deeper layers into earlier layers. Extensive experiments on STS, NLI, and classification benchmarks (spanning models from TinyBERT to BGEM3, Qwen3) demonstrate that MIPIC yields Matryoshka representations that are highly competitive across all capacities, with significant performance advantages observed under extreme low-dimensional.

URL PDF HTML ☆

赞 0 踩 0

2508.06165 2026-06-03 cs.CL cs.AI 版本更新

UR$^2$: Unify RAG and Reasoning through Reinforcement Learning

UR$^2$：通过强化学习统一检索增强生成与推理

Weitao Li, Boran Xiang, Xiaolong Wang, Zhinan Gou, Weizhi Ma, Yang Liu

发表机构 * Dept. of Comp. Sci. & Tech., Institute for AI, Tsinghua University, Beijing, China（计算机科学与技术系，人工智能研究院，清华大学，北京，中国）； Institute for AI Industry Research (AIR), Tsinghua University, Beijing, China（人工智能产业研究机构（AIR），清华大学，北京，中国）； School of Management Science & Information Engineering, Hebei University of Economics and Business, Hebei, China（管理科学与信息工程学院，河北经贸大学，河北，中国）

AI总结提出UR$^2$框架，通过强化学习动态协调检索与推理，结合难度感知课程和混合知识访问策略，在开放域问答、MMLU-Pro、医学和数学推理任务上优于现有基线，性能接近GPT-4o-mini和GPT-4.1-mini。

详情

AI中文摘要

大型语言模型（LLM）通过两种互补范式展现了强大能力：用于知识基础的检索增强生成（RAG）和用于复杂推理的可验证奖励强化学习（RLVR）。然而，现有统一这些范式的尝试范围狭窄，通常局限于具有固定检索设置的开放域问答，限制了向更广泛领域的泛化。为解决这一局限，我们提出UR$^2$（统一RAG与推理），一个通用的强化学习框架，动态协调检索与推理。UR$^2$引入了两个关键设计：一个难度感知课程，仅对困难实例选择性调用检索；以及一个混合知识访问策略，结合领域特定的离线语料库和即时生成的LLM摘要。这些组件共同缓解了检索与推理之间的不平衡，并提高了对噪声信息的鲁棒性。在开放域问答、MMLU-Pro、医学和数学推理任务上的实验表明，基于Qwen-2.5-3/7B和LLaMA-3.1-8B构建的UR$^2$持续优于现有RAG和RL基线，并在多个基准上达到与GPT-4o-mini和GPT-4.1-mini相当的性能。我们的代码可在https://github.com/Tsinghua-dhy/UR2获取。

英文摘要

Large Language Models (LLMs) have shown strong capabilities through two complementary paradigms: Retrieval-Augmented Generation (RAG) for knowledge grounding and Reinforcement Learning from Verifiable Rewards (RLVR) for complex reasoning. However, existing attempts to unify these paradigms remain narrow in scope, typically limited to open-domain QA with fixed retrieval settings, which constrains generalization to broader domains. To address this limitation, we propose UR$^2$ (Unified RAG and Reasoning)), a general reinforcement learning framework that dynamically coordinates retrieval and reasoning. UR$^2$ introduces two key designs: a difficulty-aware curriculum that selectively invokes retrieval only for challenging instances, and a hybrid knowledge access strategy that combines domain-specific offline corpora with on-the-fly LLM-generated summaries. Together, these components mitigate the imbalance between retrieval and reasoning and improve robustness to noisy information. Experiments on open-domain QA, MMLU-Pro, medical, and mathematical reasoning tasks show that UR$^2$, built on Qwen-2.5-3/7B and LLaMA-3.1-8B, consistently outperforms existing RAG and RL baselines, and achieves performance comparable to GPT-4o-mini and GPT-4.1-mini on several benchmarks. Our code is available at https://github.com/Tsinghua-dhy/UR2.

URL PDF HTML ☆

赞 0 踩 0

2604.20183 2026-06-03 cs.CL 版本更新

Dual-Cluster Memory Agent: Resolving Multi-Paradigm Ambiguity in Optimization Problem Solving

双簇记忆智能体：解决优化问题求解中的多范式歧义

Xinyu Zhang, Yuchen Wan, Boxuan Zhang, Zesheng Yang, Lingling Zhang, Bifan Wei, Jun Liu

发表机构 * School of Computer Science and Technology, Xi’an Jiaotong University（西安交通大学计算机科学与技术学院）； Ministry of Education Key Laboratory of Intelligent Networks and Network Security, China（教育部智能网络与网络security重点实验室）； Shaanxi Province Key Laboratory of Big Data Knowledge Engineering, China（陕西省大数据知识工程重点实验室）

AI总结提出双簇记忆智能体（DCM-Agent），通过无训练方式利用历史解决方案构建双簇记忆，并采用记忆增强推理动态导航解路径，在七个优化基准上平均性能提升11%-21%。

详情

AI中文摘要

大型语言模型（LLMs）在优化问题中常面临结构歧义，即单个问题存在多个相关但冲突的建模范式，阻碍了有效解的生成。为解决此问题，我们提出双簇记忆智能体（DCM-Agent），以无训练方式利用历史解决方案提升性能。其核心是双簇记忆构建：该智能体将历史解决方案分配到建模和编码两个簇中，然后将每个簇的内容提炼为三种结构化类型：方法、检查表和陷阱。该过程推导出可泛化的指导知识。此外，该智能体引入记忆增强推理，以动态导航解路径、检测并修复错误，并利用结构化知识自适应切换推理路径。在七个优化基准上的实验表明，DCM-Agent平均性能提升11%-21%。值得注意的是，我们的分析揭示了“知识继承”现象：由更大模型构建的记忆可以引导较小模型获得更优性能，凸显了该框架的可扩展性和效率。

英文摘要

Large Language Models (LLMs) often struggle with structural ambiguity in optimization problems, where a single problem admits multiple related but conflicting modeling paradigms, hindering effective solution generation. To address this, we propose Dual-Cluster Memory Agent (DCM-Agent) to enhance performance by leveraging historical solutions in a training-free manner. Central to this is Dual-Cluster Memory Construction. This agent assigns historical solutions to modeling and coding clusters, then distills each cluster's content into three structured types: Approach, Checklist, and Pitfall. This process derives generalizable guidance knowledge. Furthermore, this agent introduces Memory-augmented Inference to dynamically navigate solution paths, detect and repair errors, and adaptively switch reasoning paths with structured knowledge. The experiments across seven optimization benchmarks demonstrate that DCM-Agent achieves an average performance improvement of 11%- 21%. Notably, our analysis reveals a ``knowledge inheritance'' phenomenon: memory constructed by larger models can guide smaller models toward superior performance, highlighting the framework's scalability and efficiency.

URL PDF HTML ☆

赞 0 踩 0

2604.19005 2026-06-03 cs.CL 版本更新

Debating the Unspoken: Role-Anchored Multi-Agent Reasoning for Half-Truth Detection

辩论未尽之言：基于角色锚定的多智能体推理用于半真半假检测

Yixuan Tang, Yirui Zhang, Hang Feng, Anthony K. H. Tung

发表机构 * National University of Singapore（国立新加坡大学）； Shanghai Jiao Tong University（上海交通大学）

AI总结提出RADAR框架，通过角色锚定的多智能体辩论（政治家与科学家对抗推理，法官中立裁决）和双阈值早停控制，在噪声检索下有效检测因省略上下文而误导的半真半假陈述。

详情

AI中文摘要

半真半假，即事实正确但因省略上下文而具有误导性的主张，对于专注于显式虚假的事实核查系统而言仍然是一个盲点。解决这种基于省略的操纵需要不仅推理所说内容，还要推理未说内容。我们提出RADAR，一个角色锚定的多智能体辩论框架，用于在现实噪声检索下进行省略感知的事实核查。RADAR为政治家和科学家分配互补角色，他们在共享检索证据上进行对抗性推理，并由中立法官主持。双阈值早停控制器自适应地决定何时达到足够推理以做出裁决。实验表明，RADAR在数据集和骨干网络上始终优于强单智能体和多智能体基线，提高了遗漏检测准确性，同时降低了推理成本。这些结果表明，具有自适应控制的角色锚定、基于检索的辩论是揭示事实核查中缺失上下文的有效且可扩展的框架。

英文摘要

Half-truths, claims that are factually correct yet misleading due to omitted context, remain a blind spot for fact verification systems focused on explicit falsehoods. Addressing such omission-based manipulation requires reasoning not only about what is said, but also about what is left unsaid. We propose RADAR, a role-anchored multi-agent debate framework for omission-aware fact verification under realistic, noisy retrieval. RADAR assigns complementary roles to a Politician and a Scientist, who reason adversarially over shared retrieved evidence, moderated by a neutral Judge. A dual-threshold early termination controller adaptively decides when sufficient reasoning has been reached to issue a verdict. Experiments show that RADAR consistently outperforms strong single- and multi-agent baselines across datasets and backbones, improving omission detection accuracy while reducing reasoning cost. These results demonstrate that role-anchored, retrieval-grounded debate with adaptive control is an effective and scalable framework for uncovering missing context in fact verification.

URL PDF HTML ☆

赞 0 踩 0

2604.18995 2026-06-03 cs.CL cs.AI cs.LG 版本更新

$R^2$-dLLM: Accelerating Diffusion Large Language Models via Spatio-Temporal Redundancy Reduction

$R^2$-dLLM: 通过时空冗余减少加速扩散大语言模型

Zhenbang Du, Kejing Xia, Xinrui Zhong, Yonggan Fu, Nicolai Oswald, Binfei Ji, Brucek Khailany, Pavlo Molchanov, Yingyan Lin

AI总结提出 $R^2$-dLLM 框架，通过推理和训练两阶段减少扩散大语言模型解码中的空间和时间冗余，实现高达 88% 的解码步数减少并保持生成质量。

详情

AI中文摘要

扩散大语言模型（dLLMs）通过并行令牌预测成为自回归生成的有前途的替代方案。然而，实际的 dLLM 解码仍然遭受高推理延迟，限制了部署。在这项工作中，我们观察到这种低效率的很大一部分来自解码过程中反复出现的冗余，包括由置信度聚类和位置模糊性引起的空间冗余，以及由重复重新掩蔽已经稳定的预测引起的时间冗余。受这些模式的启发，我们提出了 $R^{2}$-dLLM，一个从推理和训练两个角度减少解码冗余的统一框架。在推理时，我们引入了无需训练的解码规则，聚合局部置信度和令牌预测，并最终确定时间稳定的令牌以避免冗余解码步骤。我们进一步提出了一个冗余感知的监督微调流程，使模型与高效解码轨迹对齐，并减少对手动调整阈值的依赖。实验表明，与现有解码策略相比，$R^{2}$-dLLM 一致地将解码步数减少高达 88%，同时在不同模型和任务上保持有竞争力的生成质量。这些结果验证了解码冗余是 dLLMs 的一个核心瓶颈，明确减少它能够带来显著的实用效率提升。我们的代码和模型可在 https://github.com/GATECH-EIC/R2-dLLM 获取。

英文摘要

Diffusion Large Language Models (dLLMs) have emerged as a promising alternative to autoregressive generation by enabling parallel token prediction. However, practical dLLM decoding still suffers from high inference latency, which limits deployment. In this work, we observe that a substantial part of this inefficiency comes from recurring redundancy in the decoding process, including spatial redundancy caused by confidence clusters and positional ambiguity, and temporal redundancy caused by repeatedly remasking predictions that have already stabilized. Motivated by these patterns, we propose $R^{2}$-dLLM, a unified framework for reducing decoding redundancy from both inference and training perspectives. At inference time, we introduce training-free decoding rules that aggregate local confidence and token predictions, and finalize temporally stable tokens to avoid redundant decoding steps. We further propose a redundancy-aware supervised fine-tuning pipeline that aligns the model with efficient decoding trajectories and reduces reliance on manually tuned thresholds. Experiments demonstrate that $R^{2}$-dLLM consistently reduces the number of decoding steps by up to 88\% compared to existing decoding strategies, while maintaining competitive generation quality across different models and tasks. These results validate that decoding redundancy is a central bottleneck in dLLMs, and that explicitly reducing it yields substantial practical efficiency gains. Our code and models are available at https://github.com/GATECH-EIC/R2-dLLM.

URL PDF HTML ☆

赞 0 踩 0

2604.08782 2026-06-03 cs.CL 版本更新

MT-OSC: Path for LLMs that Get Lost in Multi-Turn Conversation

MT-OSC：解决大语言模型在多轮对话中迷失的路径

Jyotika Singh, Fang Tu, Miguel Ballesteros, Weiyi Sun, Sandip Ghoshal, Michelle Yuan, Yassine Benajiba, Sujith Ravi, Dan Roth

发表机构 * Oracle AI

AI总结提出MT-OSC框架，通过后台自动压缩对话历史（Condenser Agent）减少token量，提升多轮对话性能，在13个LLM上验证有效性。

详情

AI中文摘要

大型语言模型（LLMs）在用户指令和上下文分布在多个对话轮次中时，性能会显著下降，然而多轮（MT）交互主导着聊天界面。将完整聊天历史附加到提示中的常规方法会迅速耗尽上下文窗口，导致延迟增加、计算成本升高，并且随着对话延长收益递减。我们引入了MT-OSC，一种一次性顺序压缩框架，可以在不干扰用户体验的情况下，高效自动地在后台压缩聊天历史。MT-OSC采用了一个压缩代理，该代理使用基于少样本推理的压缩器和轻量级决策器，选择性地保留必要信息，在10轮对话中减少高达72%的token数量。在13个最先进的LLM和多样化的多轮基准测试中评估，MT-OSC持续缩小了多轮性能差距——在数据集上保持或提高了准确性，同时对干扰项和无关轮次保持鲁棒性。我们的结果确立了MT-OSC作为多轮聊天的可扩展解决方案，能够在受限的输入空间内实现更丰富的上下文，降低延迟和运营成本，同时平衡性能。

英文摘要

Large language models (LLMs) suffer significant performance degradation when user instructions and context are distributed over multiple conversational turns, yet multi-turn (MT) interactions dominate chat interfaces. The routine approach of appending full chat history to prompts rapidly exhausts context windows, leading to increased latency, higher computational costs, and diminishing returns as conversations extend. We introduce MT-OSC, a One-off Sequential Condensation framework that efficiently and automatically condenses chat history in the background without disrupting the user experience. MT-OSC employs a Condenser Agent that uses a few-shot inference-based Condenser and a lightweight Decider to selectively retain essential information, reducing token counts by up to 72% in 10-turn dialogues. Evaluated across 13 state-of-the-art LLMs and diverse multi-turn benchmarks, MT-OSC consistently narrows the multi-turn performance gap - yielding improved or preserved accuracy across datasets while remaining robust to distractors and irrelevant turns. Our results establish MT-OSC as a scalable solution for multi-turn chats, enabling richer context within constrained input spaces, reducing latency and operational cost, while balancing performance.

URL PDF HTML ☆

赞 0 踩 0

2604.16029 2026-06-03 cs.CL cs.LG 版本更新

Cut Your Losses! Learning to Prune Paths Early for Efficient Parallel Reasoning

减少损失！学习早期剪枝路径以实现高效并行推理

Jiaxi Bi, Tongxu Luo, Wenyu Du, Zhengyang Tang, Benyou Wang

发表机构 * The Chinese University of Hong Kong, Shenzhen（香港中文大学（深圳））； Shenzhen Loop Area Institute（深圳环形区研究所）； USTB（中国地质大学）； DualityRL

AI总结提出一种基于可学习内部信号的路径剪枝方法STOP，通过前缀级剪枝减少并行推理中的无效路径，显著提升大型推理模型的效率与性能。

Comments 9 pages, 7 figures

详情

AI中文摘要

并行推理增强了大型推理模型（LRMs），但由于早期错误导致的无效路径，其成本高昂。为了缓解这一问题，前缀级的路径剪枝至关重要，然而现有研究缺乏标准化框架，较为零散。在这项工作中，我们提出了第一个系统的路径剪枝分类法，根据信号来源（内部与外部）和可学习性（可学习与不可学习）对方法进行分类。这种分类揭示了可学习内部方法的未开发潜力，促使我们提出STOP（用于剪枝的超级令牌）。在参数规模从1.5B到20B的LRMs上的广泛评估表明，与现有基线相比，STOP在效果和效率上均表现出优越性。此外，我们在不同计算预算下严格验证了STOP的可扩展性——例如，在固定计算预算下，将GPT-OSS-20B在AIME25上的准确率从84%提升至近90%。最后，我们将发现提炼为形式化的经验指南，以促进最优的实际部署。代码、数据和模型可在 https://bijiaxihh.github.io/STOP 获取。

英文摘要

Parallel reasoning enhances Large Reasoning Models (LRMs) but incurs prohibitive costs due to futile paths caused by early errors. To mitigate this, path pruning at the prefix level is essential, yet existing research remains fragmented without a standardized framework. In this work, we propose the first systematic taxonomy of path pruning, categorizing methods by their signal source (internal vs. external) and learnability (learnable vs. non-learnable). This classification reveals the unexplored potential of learnable internal methods, motivating our proposal of STOP (Super TOken for Pruning). Extensive evaluations across LRMs ranging from 1.5B to 20B parameters demonstrate that STOP achieves superior effectiveness and efficiency compared to existing baselines. Furthermore, we rigorously validate the scalability of STOP under varying compute budgets - for instance, boosting GPT-OSS-20B accuracy on AIME25 from 84% to nearly 90% under fixed compute budgets. Finally, we distill our findings into formalized empirical guidelines to facilitate optimal real-world deployment. Code, data and models are available at https://bijiaxihh.github.io/STOP

URL PDF HTML ☆

赞 0 踩 0

2604.15744 2026-06-03 cs.CL 版本更新

Language, Place, and Social Media: Geographic Dialect Alignment in New Zealand

语言、地点与社交媒体：新西兰的地理方言对齐

Sidney Wong

发表机构 * Computer Science and Software Engineering, University of Canterbury（坎特伯雷大学计算机科学与软件工程系）； Department of Linguistics, University of Illinois Urbana-Champaign（伊利诺伊大学厄巴纳-香槟分校语言学系）； School of Psychology, Speech and Hearing, University of Canterbury（坎特伯雷大学心理学、语音与听力学校）； Department of Linguistics, University of Canterbury（坎特伯雷大学语言学系）

AI总结本研究通过整合用户感知的定性分析与计算方法，探究新西兰Reddit社区中地理方言对齐现象，发现地点相关社区形成连续言语社区，且高级语言建模揭示了语义变异与变化。

Comments PhD thesis

详情

DOI: 10.26021/16490

AI中文摘要

本论文研究了基于地点的社交媒体社区中的地理方言对齐现象，重点关注新西兰相关的Reddit社区。通过将用户感知的定性分析与计算方法相结合，研究考察了语言使用如何反映地点身份，以及基于用户提供的词汇、形态句法和语义变量的语言变异与变化模式。研究结果表明，用户通常将语言与地点联系起来，地点相关社区形成了一个连续的言语社区，尽管地理方言社区与地点相关社区之间的对齐仍然复杂。包括静态和历时Word2Vec语言嵌入在内的高级语言建模揭示了基于地点的社区之间的语义变异，以及新西兰英语中有意义的语义变化。研究创建了一个包含42.6亿未处理单词的语料库，为未来研究提供了宝贵资源。总体而言，结果凸显了社交媒体作为社会语言学自然实验室的潜力。

英文摘要

This thesis investigates geographic dialect alignment in place-informed social media communities, focussing on New Zealand-related Reddit communities. By integrating qualitative analyses of user perceptions with computational methods, the study examines how language use reflects place identity and patterns of language variation and change based on user-informed lexical, morphosyntactic, and semantic variables. The findings show that users generally associate language with place, and place-related communities form a contiguous speech community, though alignment between geographic dialect communities and place-related communities remains complex. Advanced language modelling, including static and diachronic Word2Vec language embeddings, revealed semantic variation across place-based communities and meaningful semantic shifts within New Zealand English. The research involved the creation of a corpus containing 4.26 billion unprocessed words, which offers a valuable resource for future study. Overall, the results highlight the potential of social media as a natural laboratory for sociolinguistic inquiry.

URL PDF HTML ☆

赞 0 踩 0

2604.15097 2026-06-03 cs.SE cs.CL 版本更新

From Procedural Skills to Strategy Genes: Towards Experience-Driven Test-Time Evolution

从程序技能到策略基因：迈向经验驱动的测试时进化

Junjie Wang, Yiming Ren, Haoyang Zhang

发表机构 * Infinite Evolution Lab, EvoMap（无限进化实验室，EvoMap）； Tsinghua University（清华大学）

AI总结本文通过45个科学代码求解场景中的4590次受控试验，研究如何将经验表示为紧凑、面向控制且可迭代进化的对象（Gene），相比文档导向的Skill包，Gene在一次性控制和迭代积累中均表现更优。

Comments Technical Report

详情

AI中文摘要

这份测试版技术报告探讨了可重用经验应如何表示，以便作为有效的测试时控制和迭代进化的基础。我们在45个科学代码求解场景中的4590次受控试验中研究了这一问题。我们发现，面向文档的Skill包提供了不稳定的控制：它们的有用信号稀疏，将紧凑的经验对象扩展为更完整的文档包通常无助于甚至可能降低整体平均值。我们进一步表明，表示本身是一个首要因素。紧凑的Gene表示产生了最强的整体平均值，在显著的结构扰动下仍具有竞争力，并优于预算匹配的Skill片段，而重新附加面向文档的材料通常会削弱而非改进它。除了单次控制外，我们还表明Gene是迭代经验积累的更好载体：附加的失败历史在Gene中比在Skill或自由格式文本中更有效，可编辑的结构比内容本身更重要，并且失败信息在提炼为紧凑警告时比简单附加更有用。在CritPt上，基因进化系统相对于其配对基础模型从9.1%提升至18.57%，从17.7%提升至27.14%。这些结果表明，经验复用的核心问题不是如何提供更多经验，而是如何将经验编码为紧凑、面向控制、可进化的对象。

英文摘要

This beta technical report asks how reusable experience should be represented so that it can function as effective test-time control and as a substrate for iterative evolution. We study this question in 4.590 controlled trials across 45 scientific code-solving scenarios. We find that documentation-oriented Skill packages provide unstable control: their useful signal is sparse, and expanding a compact experience object into a fuller documentation package often fails to help and can degrade the overall average. We further show that representation itself is a first-order factor. A compact Gene representation yields the strongest overall average, remains competitive under substantial structural perturbations, and outperforms matched-budget Skill fragments, while reattaching documentation-oriented material usually weakens rather than improves it. Beyond one-shot control, we show that Gene is also a better carrier for iterative experience accumulation: attached failure history is more effective in Gene than in Skill or freeform text, editable structure matters beyond content alone, and failure information is most useful when distilled into compact warnings rather than naively appended. On CritPt, gene-evolved systems improve over their paired base models from 9.1% to 18.57% and from 17.7% to 27.14%. These results suggest that the core problem in experience reuse is not how to supply more experience, but how to encode experience as a compact, control-oriented, evolution-ready object.

URL PDF HTML ☆

赞 0 踩 0

2604.07123 2026-06-03 cs.CL 版本更新

Language Bias under Conflicting Information in Multilingual LLMs

多语言大模型中冲突信息下的语言偏见

Robert Östling, Murathan Kurfalı

发表机构 * Stockholm University（斯德哥尔摩大学）； RISE Research Institutes of Sweden（瑞典RISE研究机构）

AI总结本研究通过扩展“干草堆中的冲突针”范式至多语言环境，评估了不同规模的多语言大模型在回答问题时对冲突信息中不同语言的偏好，发现模型普遍存在语言偏见，尤其是对俄语的普遍偏见和对中文的偏好，且提示语言与信息语言匹配时更受青睐。

详情

AI中文摘要

大型语言模型（LLMs）在整合冲突信息回答问题过程中已被证明存在偏见。本文探讨这种偏见是否也存在于冲突信息所使用的语言上。为此，我们将“干草堆中的冲突针”范式扩展到多语言环境，并使用五种不同语言的自然新闻领域数据，对一系列不同规模的多语言LLMs进行了全面评估。我们发现，所有测试的LLMs，包括GPT-5.2，在绝大多数情况下都会忽略冲突，并自信地只断言其中一个可能的答案。此外，在模型和提示语言之间，存在一致的语言偏好偏见，普遍对俄语存在偏见，而在最长上下文长度下，则偏好中文。语言偏好在中国大陆内外训练的模型之间一致，但前者稍强。模型还普遍倾向于优先考虑与提示语言匹配的信息。我们希望让多语言LLMs的用户和开发者意识到这类偏见，以促进对其成因及可能缓解方法的进一步研究。

英文摘要

Large Language Models (LLMs) have been shown to contain biases in the process of integrating conflicting information when answering questions. Here we ask whether such biases also exist with respect to which language is used for each conflicting piece of information. To answer this question, we extend the conflicting needles in a haystack paradigm to a multilingual setting and perform a comprehensive set of evaluations with naturalistic news domain data in five different languages, for a range of multilingual LLMs of different sizes. We find that all LLMs tested, including GPT-5.2, ignore the conflict and confidently assert only one of the possible answers in the large majority of cases. Furthermore, there is a consistent bias across models and prompting languages in which languages are preferred, with a general bias against Russian and, for the longest context lengths, in favor of Chinese. The language preferences are consistent between models trained inside and outside of mainland China, though somewhat stronger in the former category. There is also a general tendency among models to prioritize information that matches the language used for prompting. We hope to make users and developers of multilingual LLMs aware of this category of biases, to spur further research on their causes and possible mitigation.

URL PDF HTML ☆

赞 0 踩 0

2604.01410 2026-06-03 cs.CL 版本更新

Assessing Pause Thresholds for empirical Translation Process Research

评估经验翻译过程研究中的暂停阈值

Devi Sri Bandaru, Michael Carl, Xinyue Ren

AI总结本文比较了五种计算暂停阈值的方法，并提出了一种新的生产单元断点计算方法，以区分自动化与反思性翻译过程。

Comments In Proceedings of "Translation in Transition 8", 2026

2603.26738 2026-06-03 cs.CV cs.AI cs.CL 版本更新

SleepVLM: Explainable and Rule-Grounded Sleep Staging via a Vision-Language Model

SleepVLM：基于视觉语言模型的可解释且规则驱动的睡眠分期

Guifeng Deng, Pan Wang, Mengfan Niu, Jiquan Wang, Shuying Rao, Junyi Xie, Xi'ang Chen, Sha Zhao, Gang Pan, Wanjun Guo, Tao Li, Haiteng Jiang

AI总结提出SleepVLM，一种基于规则驱动的视觉语言模型，通过多通道PSG波形图像进行睡眠分期，并生成符合AASM评分标准的临床可读解释，在保持高准确率的同时提升可解释性。

Comments Under review

详情

AI中文摘要

尽管自动睡眠分期已达到专家级准确率，但其临床采用因缺乏可审计的推理而受阻。我们提出了SleepVLM，一种基于规则驱动的视觉语言模型（VLM），它通过多通道多导睡眠图（PSG）波形图像进行睡眠分期，并基于美国睡眠医学学会（AASM）评分标准生成临床可读的理由。利用波形感知预训练和规则驱动的监督微调，SleepVLM在保留测试集（MASS-SS1）上实现了0.767的Cohen's kappa，在外部队列（ZUAMHCS）上实现了0.743，达到了最先进的性能。两位经过训练的睡眠技术专家的独立评估进一步验证了模型的推理质量，在两个数据集上，事实准确性、证据全面性和逻辑连贯性的平均得分在3.75-3.96之间（满分5分）。通过将竞争性性能与透明、基于规则的解释相结合，SleepVLM可以提高临床工作流程中自动睡眠分期的可信度和可审计性。为了促进可解释睡眠医学的进一步研究，我们发布了MASS-EX，一个新颖的专家注释数据集。

英文摘要

While automated sleep staging has achieved expert-level accuracy, its clinical adoption is hindered by a lack of auditable reasoning. We introduce SleepVLM, a rule-grounded vision-language model (VLM) that stages sleep from multi-channel polysomnography (PSG) waveform images and generates clinician-readable rationales based on American Academy of Sleep Medicine (AASM) scoring criteria. Utilizing waveform-perceptual pre-training and rule-grounded supervised fine-tuning, SleepVLM achieved Cohen's kappa of 0.767 on a held-out test set (MASS-SS1) and 0.743 on an external cohort (ZUAMHCS), matching state-of-the-art performance. Independent expert evaluation by two trained sleep technologists further validated the model's reasoning quality, with mean scores of 3.75-3.96 out of 5 across factual accuracy, evidence comprehensiveness, and logical coherence on both datasets. By coupling competitive performance with transparent, rule-based explanations, SleepVLM may improve the trustworthiness and auditability of automated sleep staging in clinical workflows. To facilitate further research in interpretable sleep medicine, we release MASS-EX, a novel expert-annotated dataset.

URL PDF HTML ☆

赞 0 踩 0

2603.29346 2026-06-03 cs.CL 版本更新

L-ReLF: A Framework for Lexical Dataset Creation

L-ReLF：词汇数据集创建框架

Anass Sedrati, Mounir Afifi, Reda Benkhadra

AI总结本文提出L-ReLF框架，通过可复现的技术流程为低资源语言创建高质量结构化词汇数据集，以解决标准化术语缺失问题，并支持下游NLP应用。

Comments Accepted to the 2026 International Conference on Natural Language Processing (ICNLP). 6 pages, 1 figure

详情

DOI: 10.1109/ICNLP69856.2026.11528140
Journal ref: Proc. 2026 8th International Conference on Natural Language Processing (ICNLP), pp. 261-265, IEEE

AI中文摘要

本文介绍了L-ReLF（低资源词汇框架），一种新颖、可复现的方法论，用于为服务不足的语言创建高质量、结构化的词汇数据集。以摩洛哥达里贾语为例，标准化术语的缺乏对维基百科等平台的知识公平构成了关键障碍，常常迫使编辑者依赖不一致的临时方法在其语言中创造新词。我们的研究详细阐述了为克服这些挑战而开发的技术流程。我们系统地解决了处理低资源数据的困难，包括源识别、利用光学字符识别（OCR）尽管其偏向现代标准阿拉伯语，以及严格的后处理以纠正错误并标准化数据模型。最终的结构化数据集与维基数据词位完全兼容，作为一项重要的技术资源。L-ReLF方法论设计具有通用性，为其他语言社区提供了清晰的路径，以构建用于下游NLP应用（如机器翻译和形态分析）的基础词汇数据。

英文摘要

This paper introduces the L-ReLF (Low-Resource Lexical Framework), a novel, reproducible methodology for creating high-quality, structured lexical datasets for underserved languages. The lack of standardized terminology, exemplified by Moroccan Darija, poses a critical barrier to knowledge equity in platforms like Wikipedia, often forcing editors to rely on inconsistent, ad-hoc methods to create new words in their language. Our research details the technical pipeline developed to overcome these challenges. We systematically address the difficulties of working with low-resource data, including source identification, utilizing Optical Character Recognition (OCR) despite its bias towards Modern Standard Arabic, and rigorous post-processing to correct errors and standardize the data model. The resulting structured dataset is fully compatible with Wikidata Lexemes, serving as a vital technical resource. The L-ReLF methodology is designed for generalizability, offering other language communities a clear path to build foundational lexical data for downstream NLP applications, such as Machine Translation and morphological analysis.

URL PDF HTML ☆

赞 0 踩 0

2603.26791 2026-06-03 cs.DL cs.AI cs.CL cs.CY 版本更新

Crystal: Characterizing Relative Impact of Scholarly Publications

Crystal: 表征学术出版物的相对影响力

Hannah Collison, Benjamin Van Durme, Daniel Khashabi

发表机构 * Johns Hopkins University（约翰霍普金斯大学）

AI总结提出Crystal方法，利用大语言模型对引用论文进行联合排序，通过多数投票消除位置偏差，以更准确地区分高影响力引用，在人工标注数据集上准确率提升9.5%，F1提升8.3%。

详情

AI中文摘要

基于核心的高效图RAG层次结构

Jakir Hossain, Ahmet Erdem Sarıyüce

发表机构 * University at Buffalo（布法罗大学）

AI总结针对图RAG中Leiden聚类不可复现的问题，提出用k-core分解替代，构建确定性、密度感知的层次结构，并设计轻量级启发式方法，在保证连接性的同时降低LLM成本，提升答案全面性和多样性。

Comments Accepted at the 32nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD 2026)

详情

DOI: 10.1145/3770855.3818007

AI中文摘要

检索增强生成（RAG）通过引入外部知识增强了大型语言模型。然而，现有的基于向量的方法通常无法处理需要跨多个文档推理的全局理解任务。GraphRAG通过将文档组织成具有层次化社区的知识图谱来解决这一问题，这些社区可以被递归总结。当前的GraphRAG方法依赖Leiden聚类进行社区检测，但我们证明，在平均度数为常数且大多数节点度数较低的稀疏知识图谱上，模块度优化允许指数级数量的近似最优划分，使得基于Leiden的社区本质上不可复现。为了解决这个问题，我们提出用k-core分解替代Leiden，它在线性时间内产生确定性的、密度感知的层次结构。我们引入一组轻量级启发式方法，利用k-core层次结构构建大小有界、保持连接性的社区用于检索和总结，同时采用一种令牌预算感知的采样策略来降低LLM成本。我们在包括金融收益报告、新闻文章和播客在内的真实世界数据集上评估了我们的方法，使用三个LLM进行答案生成，并由五个独立的LLM裁判进行逐项比较评估。跨数据集和模型，我们的方法一致地提高了答案的全面性和多样性，同时减少了令牌使用量，证明了基于k-core的GraphRAG是一种有效且高效的全局理解框架。

英文摘要

Retrieval-Augmented Generation (RAG) enhances large language models by incorporating external knowledge. However, existing vector-based methods often fail on global sensemaking tasks that require reasoning across many documents. GraphRAG addresses this by organizing documents into a knowledge graph with hierarchical communities that can be recursively summarized. Current GraphRAG approaches rely on Leiden clustering for community detection, but we prove that on sparse knowledge graphs, where average degree is constant and most nodes have low degree, modularity optimization admits exponentially many near-optimal partitions, making Leiden-based communities inherently non-reproducible. To address this, we propose replacing Leiden with k-core decomposition, which yields a deterministic, density-aware hierarchy in linear time. We introduce a set of lightweight heuristics that leverage the k-core hierarchy to construct size-bounded, connectivity-preserving communities for retrieval and summarization, along with a token-budget-aware sampling strategy that reduces LLM costs. We evaluate our methods on real-world datasets including financial earnings transcripts, news articles, and podcasts, using three LLMs for answer generation and five independent LLM judges for head-to-head evaluation. Across datasets and models, our approach consistently improves answer comprehensiveness and diversity while reducing token usage, demonstrating that k-core-based GraphRAG is an effective and efficient framework for global sensemaking.

URL PDF HTML ☆

赞 0 踩 0

2603.03612 2026-06-03 cs.LG cs.CC cs.CL cs.FL 版本更新

Why Are Linear RNNs More Parallelizable?

为什么线性RNN更易于并行化？

William Merrill, Hongjian Jiang, Yanhong Li, Anthony Lin, Ashish Sabharwal

发表机构 * GitHub

AI总结本文通过将RNN类型与标准复杂度类紧密关联，揭示了线性RNN（LRNN）因可视为对数深度算术电路而易于并行化，而非线性RNN因能解决L-完全问题而存在并行化障碍。

Comments To appear at ICML 2026

详情

AI中文摘要

社区越来越多地探索线性RNN（LRNN）作为语言模型，受其表达能力和并行化能力的驱动。虽然先前的工作确立了LRNN相对于Transformer的表达优势，但尚不清楚是什么使得LRNN——而非传统的非线性RNN——在实践中与Transformer一样易于并行化。我们通过提供RNN类型与标准复杂度类之间的紧密联系来回答这个问题。我们表明，LRNN可以看作是对数深度（有界扇入）算术电路，相对于Transformer所允许的对数深度布尔电路，这仅代表轻微深度开销。此外，我们表明非线性RNN可以解决$\mathsf{L}$-完全问题（甚至在多项式精度下解决$\mathsf{P}$-完全问题），揭示了将它们与Transformer一样高效并行化的根本障碍。我们的理论还识别了近期流行LRNN变体之间的细粒度表达差异：置换对角LRNN是$\mathsf{NC}^1$-完全的，而对角加低秩LRNN更具表达性（$\mathsf{PNC}^1$-完全）。我们通过将每种RNN类型与它可以模拟的相应自动机理论模型相关联，提供了进一步见解。总之，我们的结果揭示了非线性RNN与不同LRNN变体之间的基本权衡，为设计在表达性和并行性之间实现最佳平衡的LLM架构提供了基础。

英文摘要

The community is increasingly exploring linear RNNs (LRNNs) as language models, motivated by their expressive power and parallelizability. While prior work establishes the expressivity benefits of LRNNs over transformers, it is unclear what makes LRNNs -- but not traditional, nonlinear RNNs -- as easy to parallelize in practice as transformers. We answer this question by providing a tight connection between types of RNNs and standard complexity classes. We show that LRNNs can be viewed as log-depth (bounded fan-in) arithmetic circuits, which represents only a slight depth overhead relative to log-depth boolean circuits that transformers admit. Furthermore, we show that nonlinear RNNs can solve $\mathsf{L}$-complete problems (and even $\mathsf{P}$-complete ones, under polynomial precision), revealing a fundamental barrier to parallelizing them as efficiently as transformers. Our theory also identifies fine-grained expressivity differences between recent popular LRNN variants: permutation-diagonal LRNNs are $\mathsf{NC}^1$-complete whereas diagonal-plus-low-rank LRNNs are more expressive ($\mathsf{PNC}^1$-complete). We provide further insight by associating each type of RNN with a corresponding automata-theoretic model that it can simulate. Together, our results reveal fundamental tradeoffs between nonlinear RNNs and different variants of LRNNs, providing a foundation for designing LLM architectures that achieve an optimal balance between expressivity and parallelism.

URL PDF HTML ☆

赞 0 踩 0

2602.17063 2026-06-03 cs.LG cs.AI cs.CL cs.CV 版本更新

Sign Lock-In: Randomly Initialized Weight Signs Persist and Bottleneck Sub-Bit Model Compression

符号锁定：随机初始化的权重符号持续存在并成为亚比特模型压缩的瓶颈

Akira Sakai, Yuma Ichikawa

发表机构 * Fujitsu Limited（富士通株式会社）； Tokai University（静冈大学）； Riken Center for AIP（理化学研究所AIP研究中心）

AI总结研究亚比特模型压缩中符号位的瓶颈问题，通过符号锁定理论解释权重符号的随机性来源，并提出一种从头开始的低秩符号模板训练方法以突破该瓶颈。

Comments Accepted at the Forty-Third International Conference on Machine Learning (ICML 2026)

详情

AI中文摘要

亚比特模型压缩的目标是将每个权重的存储降至1比特以下；当幅度被激进压缩时，符号位成为固定成本的瓶颈。在Transformer、CNN和MLP中，学习到的符号矩阵抵抗低秩近似，并且在频谱上与i.i.d. Rademacher基线无法区分。这种随机性导致了亚比特模型压缩的下界——1比特墙。尽管存在这种明显的随机性，大多数权重仍保留其初始化符号；翻转主要通过罕见的近零边界穿越发生，表明符号模式的随机性很大程度上继承自初始化。我们通过符号锁定理论形式化了这一行为，这是对SGD噪声下符号翻转的停时分析。在有界更新和零的小邻域内罕见重新进入的条件下，有效符号翻转的数量呈现几何尾部。基于这一机制，我们引入了一种从头开始的低秩符号模板训练方法，以防止这种1比特墙的出现。

英文摘要

Sub-bit model compression targets storage below one bit per weight; as magnitudes are aggressively compressed, the sign bit becomes a fixed-cost bottleneck. Across Transformers, CNNs, and MLPs, learned sign matrices resist low-rank approximation and are spectrally indistinguishable from an i.i.d. Rademacher baseline. This randomness gives rise to the lower bound of sub-bit model compression -- the one-bit wall. Despite this apparent randomness, most weights retain their initialization signs; flips primarily occur via rare near-zero boundary crossings, suggesting that sign-pattern randomness is largely inherited from initialization. We formalize this behavior with sign lock-in theory, a stopping-time analysis of sign flips under SGD noise. Under bounded updates and a rare re-entry condition into a small neighborhood of zero, the number of effective sign flips exhibits a geometric tail. Building on this mechanism, we introduce a from-scratch low-rank sign-template training method that prevents the emergence of this one-bit wall.

URL PDF HTML ☆

赞 0 踩 0

2602.14279 2026-06-03 cs.LG cs.AI cs.CL cs.SI 版本更新

Whom to Query for What: Adaptive Group Elicitation via Multi-Turn LLM Interactions

为谁查询什么：通过多轮LLM交互的自适应群体征询

Ruomeng Ding, Tianwei Gao, Thomas P. Zollo, Eitan Bachmat, Richard Zemel, Zhun Deng

发表机构 * University of North Carolina at Chapel Hill（北卡罗来纳大学教堂山分校）； Columbia University（哥伦比亚大学）； Ben-Gurion University of the Negev（贝内-约尔大学内盖夫分校）

AI总结针对有限预算下群体属性不确定性降低问题，提出结合LLM期望信息增益与异构图神经网络传播的自适应群体征询框架，实现问题与受访者联合选择，在三个真实数据集上显著提升群体响应预测。

Comments Published as a conference paper at ICML 2026

详情

AI中文摘要

从调查和其他集体评估中征询信息以减少关于潜在群体属性的不确定性，需要在实际成本和缺失数据下分配有限的提问努力。尽管大型语言模型支持自然语言中的自适应多轮交互，但大多数现有征询方法优化了在固定受访者池中询问什么，并且在响应部分或不完整时不会调整受访者选择或利用群体结构。为解决这一差距，我们研究了自适应群体征询，这是一个多轮设置，其中智能体在明确的查询和参与预算下自适应地选择问题和受访者。我们提出了一个理论基础的框架，该框架结合了(i)基于LLM的期望信息增益目标，用于评分候选问题，以及(ii)异构图神经网络传播，该传播聚合观察到的响应和参与者属性，以插补缺失响应并指导每轮受访者选择。这种闭环过程查询一个小的、信息丰富的个体子集，同时通过结构化相似性推断群体级别的响应。在三个真实世界意见数据集上，我们的方法在预算受限的情况下持续提高了群体级别响应预测，包括在10%受访者预算下CES上相对提升超过12%。

英文摘要

Eliciting information to reduce uncertainty about latent group-level properties from surveys and other collective assessments requires allocating limited questioning effort under real costs and missing data. Although large language models enable adaptive, multi-turn interactions in natural language, most existing elicitation methods optimize what to ask with a fixed respondent pool, and do not adapt respondent selection or leverage population structure when responses are partial or incomplete. To address this gap, we study adaptive group elicitation, a multi-round setting where an agent adaptively selects both questions and respondents under explicit query and participation budgets. We propose a theoretically grounded framework that combines (i) an LLM-based expected information gain objective for scoring candidate questions with (ii) heterogeneous graph neural network propagation that aggregates observed responses and participant attributes to impute missing responses and guide per-round respondent selection. This closed-loop procedure queries a small, informative subset of individuals while inferring population-level responses via structured similarity. Across three real-world opinion datasets, our method consistently improves population-level response prediction under constrained budgets, including a >12% relative gain on CES at a 10% respondent budget.

URL PDF HTML ☆

赞 0 踩 0

2602.11908 2026-06-03 cs.AI cs.CL cs.LG 版本更新

评估和校准LLM在多个正确答案问题上的置信度

Yuhan Wang, Shiyu Ni, Zhikai Ding, Zihang Zhan, Yuanzi Li, Keping Bi

发表机构 * State Key Laboratory of AI Safety（人工智能安全国家重点实验室）； Institute of Computing Technology, Chinese Academy of Sciences（中国科学院计算技术研究所）； University of Chinese Academy of Sciences（中国科学院大学）； Tsinghua University（清华大学）； Renmin University of China（中国人民大学）

AI总结针对多正确答案问题导致现有置信度校准方法失效的问题，提出语义置信度聚合（SCA）方法，通过聚合多个高概率采样响应的置信度，在混合答案设置下实现最优校准性能。

详情

AI中文摘要

置信度校准对于使大型语言模型（LLM）可靠至关重要，然而现有的无训练方法主要在单答案问答场景下研究。本文表明，这些方法在存在多个有效答案时会失效，因为同等正确响应之间的分歧导致置信度系统性低估。为了系统研究这一现象，我们引入了MACE基准，包含跨越六个领域的12,000个事实性问题，每个问题有不同数量的正确答案。在15种代表性校准方法和四个LLM系列（7B-72B）上的实验表明，虽然准确率随答案基数增加而提高，但估计置信度持续下降，导致对于混合答案数量的问题出现严重校准偏差。为解决此问题，我们提出语义置信度聚合（SCA），该方法聚合多个高概率采样响应的置信度。SCA在混合答案设置下实现了最先进的校准性能，同时在单答案问题上保持强校准能力。

英文摘要

Confidence calibration is essential for making large language models (LLMs) reliable, yet existing training-free methods have been primarily studied under single-answer question answering. In this paper, we show that these methods break down in the presence of multiple valid answers, where disagreement among equally correct responses leads to systematic underestimation of confidence. To enable a systematic study of this phenomenon, we introduce MACE, a benchmark of 12,000 factual questions spanning six domains with varying numbers of correct answers. Experiments across 15 representative calibration methods and four LLM families (7B-72B) reveal that while accuracy increases with answer cardinality, estimated confidence consistently decreases, causing severe miscalibration for questions with mixed answer counts. To address this issue, we propose Semantic Confidence Aggregation (SCA), which aggregates confidence over multiple high-probability sampled responses. SCA achieves state-of-the-art calibration performance under mixed-answer settings while preserving strong calibration on single-answer questions.

URL PDF HTML ☆

赞 0 踩 0

2602.07639 2026-06-03 cs.CL 版本更新

Letting Tutor Personas Speak Up for LLMs: Learning Steering Vectors from Dialogue via Preference Optimization

让导师角色为LLMs发声：通过偏好优化从对话中学习引导向量

Jaewook Lee, Alexander Scarlatos, Simon Woodhead, Andrew Lan

发表机构 * University of Massachusetts Amherst（马萨诸塞大学阿姆赫斯特分校）； Eedi

AI总结本文提出使用偏好优化训练引导向量，从人类导师-学生对话中提取导师角色信息，以控制大语言模型的行为，实现多样化的教学风格。

Comments Accepted to ACL 2026 BEA Workshop

详情

AI中文摘要

随着大语言模型（LLMs）作为一类强大的生成式人工智能（AI）的出现，它们在辅导中的应用日益突出。先前基于LLM的辅导工作通常学习单一的辅导策略，未能捕捉辅导风格的多样性。在现实世界的导师-学生互动中，教学意图通过适应性教学策略实现，导师根据学习者的需求调整支架式教学、指导性、反馈和情感支持的级别。这些差异都会影响对话动态和学生参与度。在本文中，我们探讨如何利用嵌入在人类导师-学生对话中的导师角色来引导LLM行为，而不依赖于显式提示指令。我们使用偏好优化训练一个引导向量：一个激活空间方向，用于引导模型响应朝向特定的导师角色。我们发现，这个引导向量捕捉了跨对话上下文的导师特定变化，提高了与真实导师话语的语义对齐，并增加了基于偏好的评估，同时很大程度上保留了词汇相似性。对学习到的缩放系数的进一步分析揭示了跨导师的可解释结构，对应于辅导行为的一致差异。这些结果表明，激活引导提供了一种有效且可解释的方式，利用直接从人类对话数据中获得的信号来控制LLM中导师特定的变化。

英文摘要

With the emergence of large language models (LLMs) as a powerful class of generative artificial intelligence (AI), their use in tutoring has become increasingly prominent. Prior works on LLM-based tutoring typically learn a single tutor policy and do not capture the diversity of tutoring styles. In real-world tutor-student interactions, pedagogical intent is realized through adaptive instructional strategies, with tutors varying the level of scaffolding, instructional directiveness, feedback, and affective support in response to learners' needs. These differences can all impact dialogue dynamics and student engagement. In this paper, we explore how tutor personas embedded in human tutor-student dialogues can be used to guide LLM behavior without relying on explicitly prompted instructions. We train a steering vector using preference optimization: an activation-space direction that guides model responses toward specific tutor personas. We find that this steering vector captures tutor-specific variation across dialogue contexts, improving semantic alignment with ground-truth tutor utterances and increasing preference-based evaluations, while largely preserving lexical similarity. Analysis of the learned scaling coefficients further reveals interpretable structure across tutors, corresponding to consistent differences in tutoring behavior. These results demonstrate that activation steering offers an effective and interpretable way for controlling tutor-specific variation in LLMs using signals derived directly from human dialogue data.

URL PDF HTML ☆

赞 0 踩 0

2511.16275 2026-06-03 cs.CL cs.AI 版本更新

SeSE: Black-Box Uncertainty Quantification for Large Language Models Based on Structural Information Theory

SeSE: 基于结构信息理论的大语言模型黑盒不确定性量化

Xingtao Zhao, Hao Peng, Dingli Su, Xianghua Zeng, Chunyang Liu, Jinzhi Liao, Philip S. Yu

发表机构 * School of Cyber Science and Technology Beihang University（北航信息科学与技术学院）； School of Computer Science and Engineering Beihang University（北航计算机科学与工程学院）； Didi Chuxing（滴滴出行）； Laboratory for Big Data and Decision National University of Defense Technology（国防科技大学大数据与决策实验室）； Department of Computer Science University of Illinois Chicago（伊利诺伊大学芝加哥分校计算机科学系）

AI总结提出SeSE框架，通过构建语义空间的最优层次抽象并计算结构熵，实现大语言模型的黑盒不确定性量化，理论推广了语义熵并在长文本生成中优于现有方法。

Comments Accepted by UAI 2026

详情

AI中文摘要

规划、验证与填充：扩散语言模型的结构化并行解码方法

Miao Li, Hanyang Jiang, Sikai Cheng, Hengyu Fu, Yuhang Cai, Baihe Huang, Tinghan Ye, Xuanzhou Chen, Pascal Van Hentenryck

发表机构 * Georgia Institute of Technology（佐治亚理工学院）； University of California, Berkeley（加州大学伯克利分校）； University of Michigan（密歇根大学）

AI总结提出Plan-Verify-Fill (PVF)方法，通过定量验证进行分层骨架规划，并采用验证协议实现结构化停止，在保持准确性的同时将函数评估次数减少高达65%。

详情

AI中文摘要

扩散语言模型（DLM）为文本生成提供了一种有前景的非顺序范式，不同于标准的自回归（AR）方法。然而，当前的解码策略通常采取被动姿态，未能充分利用全局双向上下文来指导全局轨迹。为了解决这个问题，我们提出了Plan-Verify-Fill（PVF），一种无需训练的范式，通过定量验证来锚定规划。PVF通过优先考虑高杠杆语义锚点主动构建分层骨架，并采用验证协议来实现实用的结构化停止，在进一步思考收益递减时停止。在LLaDA-8B-Instruct和Dream-7B-Instruct上的广泛评估表明，与基于置信度的并行解码相比，PVF在基准数据集上将函数评估次数（NFE）减少了高达65%，在不牺牲准确性的情况下实现了卓越的效率。

英文摘要

Diffusion Language Models (DLMs) present a promising non-sequential paradigm for text generation, distinct from standard autoregressive (AR) approaches. However, current decoding strategies often adopt a reactive stance, underutilizing the global bidirectional context to dictate global trajectories. To address this, we propose Plan-Verify-Fill (PVF), a training-free paradigm that grounds planning via quantitative validation. PVF actively constructs a hierarchical skeleton by prioritizing high-leverage semantic anchors and employs a verification protocol to operationalize pragmatic structural stopping where further deliberation yields diminishing returns. Extensive evaluations on LLaDA-8B-Instruct and Dream-7B-Instruct demonstrate that PVF reduces the Number of Function Evaluations (NFE) by up to 65% compared to confidence-based parallel decoding across benchmark datasets, unlocking superior efficiency without compromising accuracy.

URL PDF HTML ☆

赞 0 踩 0

2601.14569 2026-06-03 cs.CL cs.LG 版本更新

Social Caption: Evaluating Social Understanding in Multimodal Models

Social Caption: 评估多模态模型的社会理解能力

Leena Mathur, Bhaavanaa Thumu, Youssouf Kebe, Louis-Philippe Morency

发表机构 * School of Computer Science, Carnegie Mellon University（卡内基梅隆大学计算机科学学院）

AI总结提出基于交互理论的SOCIAL CAPTION框架，从社会推理、整体社会分析和定向社会分析三个维度评估多模态大语言模型的社会理解能力，并分析影响性能的因素。

Comments 25 pages, 10 figures

2505.16014 2026-06-03 cs.CL 版本更新

Ranking Free RAG: Replacing Re-ranking with Selection in RAG for Sensitive Domains

无排序RAG：用选择替代重排序以应用于敏感领域

Yash Saxena, Ankur Padia, Mandar S Chaudhary, Kalpa Gunaratna, Srinivasan Parthasarathy, Manas Gaur

发表机构 * University of Washington（华盛顿大学）

AI总结提出METEORA框架，通过DPO微调LLM生成检索理由、统计肘部检测自适应截断和验证器过滤，在敏感领域实现可解释、高效且鲁棒的证据选择，无需重排序。

Comments ICML 2026

详情

AI中文摘要

部署在敏感领域的检索增强生成（RAG）系统必须提供可解释的证据选择，并针对数据投毒提供稳健的防护，然而当前方法依赖于不透明的基于相似性的检索，并采用任意的top-k截断，这些方法对其选择不提供任何解释，且容易受到对抗性操纵。METEORA通过三个组件用理由驱动的选择替代重排序：一个DPO微调的LLM，生成明确的检索理由；一个证据块选择引擎（ECSE），利用这些理由结合统计肘部检测进行自适应截断确定；以及一个验证器LLM，使用相同的理由过滤投毒证据。在六个数据集上，METEORA实现了召回率提高13.41%，精确率提高21.05%（无扩展），证据量减少80%，答案准确率提高33.34%，对抗鲁棒性提高4.4倍。人工评估证实了真正的可解释性（置信度3.64/5；86%的真实标签一致性），表明可解释性、效率和鲁棒性是协同而非竞争的目标。代码可在GitHub仓库https://github.com/YashSaxena21/METEORA中获取。

英文摘要

Retrieval-Augmented Generation (RAG) systems deployed in sensitive domains must provide interpretable evidence selection and robust safeguards against data poisoning, yet current approaches rely on opaque similarity-based retrieval with arbitrary top-k cutoffs that offer no explanation for their selections and remain vulnerable to adversarial manipulation. METEORA replaces re-ranking with rationale-driven selection via three components: a DPO-tuned LLM that generates explicit retrieval rationales, an Evidence Chunk Selection Engine (ECSE) that uses those rationales with statistical elbow detection for adaptive cutoff determination, and a Verifier LLM that filters poisoned evidence using the same rationales. Across six datasets, METEORA achieves 13.41% higher recall, 21.05% higher precision (without expansion), an 80% reduction in evidence volume, a 33.34% improvement in answer accuracy, and a 4.4x improvement in adversarial robustness. Human evaluation confirms genuine interpretability (3.64/5 confidence; 86% ground-truth agreement), demonstrating that interpretability, efficiency, and robustness are synergistic rather than competing objectives. The code is available in the GitHub repository https://github.com/YashSaxena21/METEORA

URL PDF HTML ☆

赞 0 踩 0

2601.11429 2026-06-03 cs.CL cs.AI 版本更新

Relational Linearity is a Predictor of Hallucinations

关系线性是幻觉的预测因子

Yuetian Lu, Yihong Liu, Sebastian Gerstner, Lea Hirlimann, Jonas Rohweder, Hinrich Schütze

AI总结通过合成未知实体基准测试，发现语言模型在回答线性关系问题时更容易产生幻觉，且关系线性度与幻觉率强相关。

Comments 15 pages, 6 figures, 14 tables

详情

AI中文摘要

幻觉是语言模型（LMs）的一个核心失败模式。我们关注对诸如“格伦·古尔德演奏哪种乐器？”这类问题的幻觉，但针对设计为模型未知的合成实体提问。我们发现，像Gemma-7B-IT这样的LM经常产生幻觉，即它们难以识别幻觉事实不属于其知识。基于线性关系嵌入的思想，我们提出以下假设：（i）由于用于表示它们的抽象方案，LM可以轻松地为线性关系的非存在主体生成合理的对象，这可能导致幻觉。（ii）对于非线性关系，这种生成对象的机制不可用，因此更容易避免幻觉。为了验证这一假设，我们创建了SyntHal，一个针对15种关系的合成未知实体基准。我们发现，在四个指令调优模型中，关系线性度是模型为未知主体生成对象（而非拒绝回答）的强预测因子，相关系数$r \in [.58, .84]$。

英文摘要

Hallucination is a central failure mode of language models (LMs). We focus on hallucinations in response to questions like: "Which instrument did Glenn Gould play?", but we ask these questions for synthetic entities designed to be unknown to the model. We find that LMs like Gemma-7B-IT frequently hallucinate, i.e., they have difficulty recognizing that the hallucinated fact is not part of their knowledge. Based on the idea of linear relational embeddings, we put forward the following hypothesis. (i) Due to the abstract scheme that is used to represent them, LMs can easily produce plausible objects for non-existing subjects of linear relations, which can lead to hallucinations. (ii) For a nonlinear relation, this mechanism for producing an object is not available and so a hallucination is easier to avoid. To test this hypothesis, we create SyntHal, a synthetic unknown-entity benchmark for 15 relations. We find that across four instruction-tuned models, relational linearity is a strong predictor of models hallucinating an object for an unknown subject vs refusing to give an answer, with correlations $r \in [.58, .84]$.

URL PDF HTML ☆

赞 0 踩 0

2512.10999 2026-06-03 cs.CL 版本更新

KBQA-R1: Reinforcing Large Language Models for Knowledge Base Question Answering

KBQA-R1：强化大语言模型用于知识库问答

Xin Sun, Zhongqi Chen, Xing Zheng, Qiang Liu, Shu Wu, Bowen Song, Zilei Wang, Weiqiang Wang, Liang Wang

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结提出KBQA-R1框架，通过强化学习（GRPO）将知识库问答建模为多轮决策过程，并引入参考拒绝采样（RRS）解决冷启动问题，在多个基准上取得最优性能。

Comments ICML 2026

详情

AI中文摘要

识别AI语言中的量子结构：人类与人工智能认知进化趋同的证据

Diederik Aerts, Jonito Aerts Arguëlles, Lester Beltran, Suzette Geriente, Roberto Leporini, Massimiliano Sassoli de Bianchi, Sandro Sozzo

发表机构 * Center Leo Apostel for Interdisciplinary Studies, Vrije Universiteit Brussel (VUB)（利奥·阿波斯泰尔跨学科研究中心，布鲁塞尔自由大学）； Department of Economics, University of Bergamo（博洛尼亚大学经济系）； Department of Humanities and Cultural Heritage (DIUM) and Centre CQSCS, University of Udine（乌迪内大学人文与文化遗产系及CQSCS中心）

AI总结通过对大型语言模型进行认知测试，发现其概念组合中存在贝尔不等式显著违背和玻色-爱因斯坦统计，表明人类与人工智能在概念-语言领域均涌现非经典量子结构，支持认知进化趋同假说。

详情

DOI: 10.3390/e28060622
Journal ref: Entropy 28, 622, 2026

AI中文摘要

我们展示了使用特定大型语言模型（LLMs）作为测试对象进行的概念组合认知测试结果。在第一个测试中，使用ChatGPT和Gemini，我们表明贝尔不等式被显著违背，这表明存在一个概率不满足Kolmogorov公理的“非经典概率模型”。在第二个测试中，同样使用ChatGPT和Gemini，我们在大型文本中的单词分布中识别出“玻色-爱因斯坦统计”的存在，而非直觉预期的“麦克斯韦-玻尔兹曼统计”。有趣的是，这些发现与之前在人类参与者认知测试和大规模语料库信息检索测试中获得的结果相呼应。综合来看，它们指向“概念-语言领域中非经典量子类结构的系统性涌现”，无论认知主体是人类还是人工智能。尽管LLMs因历史原因被归类为神经网络，但我们认为，在神经网络之上构建的向量空间的分布式语义结构中，发生了一种更本质的知识组织形式。正是这种承载意义的结构，促成了通过生物进化缓慢建立的人类认知与语言，与通过自我学习和训练快速涌现的LLM认知与语言之间的进化趋同现象。我们分析了支持上述假设的各种方面和实例。我们还提出了一个统一框架，解释了我们识别出的普遍量子组织意义。

英文摘要

We present the results of cognitive tests on conceptual combinations, performed using specific Large Language Models (LLMs) as test subjects. In the first test, performed with ChatGPT and Gemini, we show that Bell's inequalities are significantly violated, which indicates the presence of a 'non-classical probability model' with probabilities that do not satisfy Kolmogorov's axioms. In the second test, also performed using ChatGPT and Gemini, we identify the presence of 'Bose-Einstein statistics', rather than the intuitively expected 'Maxwell-Boltzmann statistics', in the distribution of the words contained in large-size texts. Interestingly, these findings mirror the results previously obtained in both cognitive tests with human participants and information retrieval tests on large corpora. Taken together, they point to the 'systematic emergence of non-classical quantum-like structures in conceptual-linguistic domains', regardless of whether the cognitive agent is human or artificial. Although LLMs are classified as neural networks for historical reasons, we believe that a more essential form of knowledge organization takes place in the distributive semantic structure of vector spaces built on top of the neural network. It is this meaning-bearing structure that lends itself to a phenomenon of evolutionary convergence between human cognition and language, slowly established through biological evolution, and LLM cognition and language, emerging much more rapidly as a result of self-learning and training. We analyze various aspects and examples that contain evidence supporting the above hypothesis. We also advance a unifying framework that explains the pervasive quantum organization of meaning that we identify.

URL PDF HTML ☆

赞 0 踩 0

2503.07265 2026-06-03 cs.CV cs.AI cs.CL 版本更新

WISE: A World Knowledge-Informed Semantic Evaluation for Text-to-Image Generation

WISE: 一种基于世界知识的文本到图像生成语义评估方法

Yuwei Niu, Munan Ning, Mengren Zheng, Weiyang Jin, Bin Lin, Peng Jin, Jiaqi Liao, Chaoran Feng, Fanqing Meng, Kunpeng Ning, Bin Zhu, Li Yuan

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结针对现有文本到图像生成模型缺乏复杂语义理解和世界知识整合评估的问题，提出WISE基准，包含25个子领域的1000个精心设计的提示，并引入WiScore指标评估知识-图像对齐，实验表明当前模型在整合世界知识方面存在显著局限。

Comments Accepted to ICML 2026. We have also released an updated version of the benchmark, WISE_Verified. Please refer to https://github.com/PKU-YuanGroup/WISE for the latest version

详情

AI中文摘要

文本到图像（T2I）模型能够生成高质量的艺术创作和视觉内容。然而，现有研究和评估标准主要关注图像真实性和浅层的文本-图像对齐，缺乏对文本到图像生成中复杂语义理解和世界知识整合的全面评估。为解决这一挑战，我们提出了 extbf{WISE}，这是首个专门用于 extbf{W}orld Knowledge- extbf{I}nformed extbf{S}emantic extbf{E}valuation（世界知识引导的语义评估）的基准。WISE超越了简单的词-像素映射，通过1000个精心设计的提示，涵盖文化常识、时空推理和自然科学等25个子领域，对模型进行挑战。为了克服传统CLIP指标的局限性，我们引入了 extbf{WiScore}，一种用于评估知识-图像对齐的新型定量指标。通过对20个模型（10个专用T2I模型和10个统一多模态模型）在涵盖25个子领域的1000个结构化提示上进行全面测试，我们的发现揭示了它们在图像生成过程中有效整合和应用世界知识的能力存在显著局限，为下一代T2I模型增强知识整合与应用指明了关键路径。代码和数据可在\href{https://github.com/PKU-YuanGroup/WISE}{PKU-YuanGroup/WISE}获取。

英文摘要

Text-to-Image (T2I) models are capable of generating high-quality artistic creations and visual content. However, existing research and evaluation standards predominantly focus on image realism and shallow text-image alignment, lacking a comprehensive assessment of complex semantic understanding and world knowledge integration in text-to-image generation. To address this challenge, we propose \textbf{WISE}, the first benchmark specifically designed for \textbf{W}orld Knowledge-\textbf{I}nformed \textbf{S}emantic \textbf{E}valuation. WISE moves beyond simple word-pixel mapping by challenging models with 1000 meticulously crafted prompts across 25 subdomains in cultural common sense, spatio-temporal reasoning, and natural science. To overcome the limitations of traditional CLIP metric, we introduce \textbf{WiScore}, a novel quantitative metric for assessing knowledge-image alignment. Through comprehensive testing of 20 models (10 dedicated T2I models and 10 unified multimodal models) using 1,000 structured prompts spanning 25 subdomains, our findings reveal significant limitations in their ability to effectively integrate and apply world knowledge during image generation, highlighting critical pathways for enhancing knowledge incorporation and application in next-generation T2I models. Code and data are available at \href{https://github.com/PKU-YuanGroup/WISE}{PKU-YuanGroup/WISE}.

URL PDF HTML ☆

赞 0 踩 0

2508.21448 2026-06-03 cs.CL 版本更新

When Models Refuse: Political Steerability and Feature Richness as Measures of Ideological Depth

当模型拒绝时：政治可操控性与特征丰富度作为意识形态深度的度量

Shariar Kabir

发表机构 * Bangladesh University of Engineering and Technology（孟加拉工程与技术大学）

AI总结本文提出意识形态深度概念，通过可操控性和稀疏自编码器测量的特征丰富度，研究大语言模型拒绝遵循良性指令是否源于能力缺陷而非安全规则。

详情

AI中文摘要

大型语言模型有时会拒绝遵循良性指令，例如拒绝论证某种政治立场或采用指定角色，这种拒绝通常被视为安全护栏在起作用。我们探究这些拒绝是否可能表明**能力缺陷**：模型缺乏从指令角度进行推理所需的内部表示。为此，我们引入**意识形态深度**，该属性包含两个组成部分：(i) 模型遵循政治指令而不*失败*的能力（可操控性），以及 (ii) 其内部政治表示的**特征丰富度**，通过稀疏自编码器测量。使用两个广泛使用的开源权重LLM作为候选，我们比较基于提示和激活操控的干预措施，并用公开可用的SAE探测政治特征。我们发现巨大且系统性的差异：在两个意识形态方向上更具可操控性的模型激活了约**7.3倍**更多的独特政治特征，而另一个模型则以增加拒绝作为回应。从前者模型中因果性地消融一小部分目标政治特征，重现了相同的特征贫乏行为并导致拒绝增加。综合这些结果表明，对良性提示的拒绝可能源于**能力缺陷**而非固定的安全规则，并且意识形态深度是LLM的一个可测量属性，有助于预测模型何时会拒绝。

英文摘要

Large language models (LLMs) sometimes refuse to follow benign instructions, such as declining to argue a political position or adopt a stated persona, and such refusals are commonly read as safety guardrails at work. We ask whether they can instead signal a **capability deficit**: a shortage of the internal representations a model needs to reason from the instructed perspective. To investigate, we introduce **ideological depth**, a property with two components: (i) a model's ability to follow political instructions without *failure* (steerability), and (ii) the **feature richness** of its internal political representations, measured with sparse autoencoders (SAEs). Using two widely used openweight LLMs as candidates, we compare interventions based on prompts and activation-steering, and probe political features with publicly available SAEs. We find large, systematic differences: a model that is more steerable in both ideological directions activates **~7.3x** more distinct political features, while the other model instead responds with increased refusals. Causally ablating a small, targeted set of political features from the former model reproduces the same feature-poor behavior and drives up refusals. Together, these results indicate that refusals on benign prompts can arise from **capability deficits** rather than fixed safety rules, and that ideological depth is a measurable property of LLMs that helps predict when a model will refuse.

URL PDF HTML ☆

赞 0 踩 0

2511.08247 2026-06-03 cs.CL cs.CY 版本更新

ParliaBench: An Evaluation and Benchmarking Framework for LLM-Generated Parliamentary Speech

ParliaBench: 面向大语言模型生成的议会演讲的评估与基准框架

Marios Koniaris, Argyro Tsipi, Panayiotis Tsanakas

发表机构 * University of Cambridge（剑桥大学）

AI总结提出ParliaBench基准框架，通过构建英国议会数据集、结合计算指标与LLM评判的评估方法以及两种新型嵌入指标（政治光谱对齐和政党对齐），系统评估LLM生成议会演讲的语言质量、语义连贯性和政治真实性，实验表明微调显著提升多数指标且新指标对政治维度具有强区分力。

详情

DOI: 10.63317/447dqkef7ks7
Journal ref: Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026), pp. 4797-4818, European Language Resources Association (ELRA), Palma, Mallorca, Spain, May 2026

AI中文摘要

议会演讲生成对大型语言模型提出了超越标准文本生成任务的特定挑战。与通用文本生成不同，议会演讲不仅需要语言质量，还需要政治真实性和意识形态一致性。当前语言模型缺乏针对议会上下文的专门训练，现有评估方法侧重于标准NLP指标而非政治真实性。为此，我们提出了ParliaBench，一个用于议会演讲生成的基准。我们构建了一个来自英国议会的演讲数据集，以实现系统性的模型训练。我们引入了一个评估框架，将计算指标与LLM-as-a-judge评估相结合，用于衡量生成质量在三个维度上的表现：语言质量、语义连贯性和政治真实性。我们提出了两种新颖的基于嵌入的指标——政治光谱对齐和政党对齐——以量化意识形态定位。我们微调了五个大型语言模型（LLM），生成了28k篇演讲，并使用我们的框架对其进行了评估，比较了基线和微调模型。结果表明，微调在大多数指标上产生了统计显著的改进，并且我们的新颖指标对政治维度表现出强大的区分能力。

英文摘要

Parliamentary speech generation presents specific challenges for large language models beyond standard text generation tasks. Unlike general text generation, parliamentary speeches require not only linguistic quality but also political authenticity and ideological consistency. Current language models lack specialized training for parliamentary contexts, and existing evaluation methods focus on standard NLP metrics rather than political authenticity. To address this, we present ParliaBench, a benchmark for parliamentary speech generation. We constructed a dataset of speeches from UK Parliament to enable systematic model training. We introduce an evaluation framework combining computational metrics with LLM-as-a-judge assessments for measuring generation quality across three dimensions: linguistic quality, semantic coherence, and political authenticity. We propose two novel embedding-based metrics, Political Spectrum Alignment and Party Alignment, to quantify ideological positioning. We fine-tuned five large language models (LLMs), generated 28k speeches, and evaluated them using our framework, comparing baseline and fine-tuned models. Results show that fine-tuning produces statistically significant improvements across the majority of metrics and our novel metrics demonstrate strong discriminative power for political dimensions.

URL PDF HTML ☆

赞 0 踩 0

2511.02304 2026-06-03 cs.MA cs.AI cs.CL cs.FL cs.LG 版本更新

Automata-Conditioned Cooperative Multi-Agent Reinforcement Learning

自动机条件化协作多智能体强化学习

Beyazit Yalcinkaya, Marcell Vazquez-Chanlatte, Ameesh Shah, Hanna Krasowski, Sanjit A. Seshia

发表机构 * Massachusetts Institute of Technology（麻省理工学院）； Stanford University（斯坦福大学）

AI总结提出自动机条件化协作多智能体强化学习框架，通过自动机分解团队目标为子任务，学习任务条件化的分散策略，实现最优任务分配和多步协调。

详情

AI中文摘要

我们研究在集中训练、分散执行下，针对协作性时间目标的多任务、多智能体策略学习。在此设置中，使用自动机表示分配给智能体的任务，能够将团队级目标分解为更简单、更小的子任务。然而，现有方法样本效率低下，且局限于单任务情况，需要为每个新任务重新训练策略。在这项工作中，我们提出了自动机条件化协作多智能体强化学习（ACC-MARL），一个学习任务条件化分散团队策略的框架。我们识别了ACC-MARL可行性的挑战，提出了解决方案，并证明了我们的方法是最优的。我们进一步展示了学习到的价值函数可用于在测试时最优地分配任务。实验表明，智能体之间涌现出任务感知的多步协调，例如按下按钮开门、扶住门以及短路任务。

英文摘要

We study learning multi-task, multi-agent policies for cooperative, temporal objectives, under centralized training, decentralized execution. In this setting, using automata to represent tasks assigned to agents enables breaking down a team-level objective into simpler, smaller sub-tasks. However, existing approaches remain sample-inefficient and are limited to the single-task case, requiring retraining policies for each new task. In this work, we present Automata-Conditioned Cooperative Multi-Agent Reinforcement Learning (ACC-MARL), a framework for learning task-conditioned, decentralized team policies. We identify challenges to the feasibility of ACC-MARL, propose solutions, and prove that our approach is optimal. We further show that learned value functions can be used to assign tasks optimally at test time. Experiments demonstrate emergent task-aware, multi-step coordination among agents, such as pressing a button to unlock a door, holding the door, and short-circuiting tasks.

URL PDF HTML ☆

赞 0 踩 0

2510.16282 2026-06-03 cs.CL 版本更新

Instant Personalized Large Language Model Adaptation via Hypernetwork

通过超网络实现即时个性化大型语言模型自适应

Zhaoxuan Tan, Zixuan Zhang, Haoyang Wen, Zheng Li, Rongzhi Zhang, Pei Chen, Fengran Mo, Zheyuan Liu, Qingkai Zeng, Qingyu Yin, Meng Jiang

发表机构 * University of Notre Dame（诺丁汉大学）； Amazon.com Inc（亚马逊公司）； Université de Montréal（蒙特利尔大学）

AI总结提出Profile-to-PEFT框架，使用超网络将用户编码直接映射到适配器参数，实现无需用户训练的即时个性化，在降低计算成本的同时优于现有方法。

Comments accepted to ACL 2026

详情

AI中文摘要

个性化大型语言模型（LLM）利用用户档案或历史记录来定制符合个人偏好的内容。然而，现有的参数高效微调（PEFT）方法，例如“每用户一个PEFT”（OPPU）范式，需要为每个用户训练单独的适配器，这使得它们在计算上昂贵且不适用于实时更新。我们引入了Profile-to-PEFT，一个可扩展的框架，它采用端到端训练的超网络，将用户编码档案直接映射到一组完整的适配器参数（例如LoRA），从而消除了部署时的每用户训练。这种设计实现了即时自适应、对未见用户的泛化以及保护隐私的本地部署。实验结果表明，我们的方法在部署时使用显著更少的计算资源，同时优于基于提示的个性化和OPPU。该框架对分布外用户表现出强大的泛化能力，并在不同的用户活动水平和不同的嵌入骨干下保持鲁棒性。所提出的Profile-to-PEFT框架实现了高效、可扩展且自适应的LLM个性化，适用于大规模应用。

英文摘要

Personalized large language models (LLMs) tailor content to individual preferences using user profiles or histories. However, existing parameter-efficient fine-tuning (PEFT) methods, such as the ``One-PEFT-Per-User'' (OPPU) paradigm, require training a separate adapter for each user, making them computationally expensive and impractical for real-time updates. We introduce Profile-to-PEFT, a scalable framework that employs a hypernetwork, trained end-to-end, to map a user's encoded profile directly to a full set of adapter parameters (e.g., LoRA), eliminating per-user training at deployment. This design enables instant adaptation, generalization to unseen users, and privacy-preserving local deployment. Experimental results demonstrate that our method outperforms both prompt-based personalization and OPPU while using substantially fewer computational resources at deployment. The framework exhibits strong generalization to out-of-distribution users and maintains robustness across varying user activity levels and different embedding backbones. The proposed Profile-to-PEFT framework enables efficient, scalable, and adaptive LLM personalization suitable for large-scale applications.

URL PDF HTML ☆

赞 0 踩 0

2510.09711 2026-06-03 cs.CL cs.AI 版本更新

ReaLM: Residual Quantization Bridging Knowledge Graph Embeddings and Large Language Models

ReaLM：残差量化桥接知识图谱嵌入与大型语言模型

Wenbin Guo, Xin Wang, Jiaoyan Chen, Lingbing Guo, Zhao Li, Zirui Chen

发表机构 * Tianjin University（天津大学）； The University of Manchester（曼彻斯特大学）

AI总结提出ReaLM框架，通过残差向量量化将知识图谱嵌入离散化为可学习标记，融入大型语言模型词汇表，结合本体约束实现结构化知识与语言模型的语义对齐，在知识图谱补全任务上取得最优性能。

详情

AI中文摘要

大型语言模型（LLM）最近成为知识图谱补全（KGC）的强大范式，提供了超越传统基于嵌入方法的强大推理和泛化能力。然而，现有的基于LLM的方法通常难以充分利用结构化语义表示，因为预训练KG模型的连续嵌入空间与LLM的离散标记空间根本不对齐。这种差异阻碍了有效的语义转移并限制了它们的性能。为了解决这一挑战，我们提出了ReaLM，一种新颖且有效的框架，通过残差向量量化的机制弥合了KG嵌入和LLM标记化之间的差距。ReaLM将预训练的KG嵌入离散化为紧凑的代码序列，并将它们作为可学习标记集成到LLM词汇表中，从而实现符号知识和上下文知识的无缝融合。此外，我们引入了本体引导的类约束以强制语义一致性，基于类级别的兼容性细化实体预测。在两个广泛使用的基准数据集上进行的大量实验表明，ReaLM实现了最先进的性能，证实了其在将结构化知识与大规模语言模型对齐方面的有效性。

英文摘要

Large Language Models (LLMs) have recently emerged as a powerful paradigm for Knowledge Graph Completion (KGC), offering strong reasoning and generalization capabilities beyond traditional embedding-based approaches. However, existing LLM-based methods often struggle to fully exploit structured semantic representations, as the continuous embedding space of pretrained KG models is fundamentally misaligned with the discrete token space of LLMs. This discrepancy hinders effective semantic transfer and limits their performance. To address this challenge, we propose ReaLM, a novel and effective framework that bridges the gap between KG embeddings and LLM tokenization through the mechanism of residual vector quantization. ReaLM discretizes pretrained KG embeddings into compact code sequences and integrates them as learnable tokens within the LLM vocabulary, enabling seamless fusion of symbolic and contextual knowledge. Furthermore, we incorporate ontology-guided class constraints to enforce semantic consistency, refining entity predictions based on class-level compatibility. Extensive experiments on two widely used benchmark datasets demonstrate that ReaLM achieves state-of-the-art performance, confirming its effectiveness in aligning structured knowledge with large-scale language models.

URL PDF HTML ☆

赞 0 踩 0

2510.08977 2026-06-03 cs.LG cs.CL 版本更新

Breaking the Self-Confirming Loop: Diagnosing and Mitigating Systemic Reward Bias in Self-Rewarding RL

打破自我确认循环：诊断与缓解自奖励强化学习中的系统性奖励偏差

Chuyi Tan, Peiwen Yuan, Xinglin Wang, Yiwei Li, Shaoxiong Feng, Yueqi Zhang, Jiayi Shi, Ji Zhang, Boyuan Pan, Yao Hu, Kan Li

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结本文通过量化反馈回路偏差并提出集成奖励强化学习（RLER）方法，诊断并缓解了自奖励强化学习中由置信度耦合导致的系统性奖励偏差，从而提升性能与稳定性。

详情

AI中文摘要

基于可验证奖励的强化学习（RLVR）高效扩展了大语言模型（LLMs）的推理能力，但受限于稀缺的标注数据。基于内在奖励的强化学习（RLIR）通过自奖励提供了一种可扩展的替代方案，但常面临不稳定和性能较差的问题。我们将这一差距归因于置信度耦合的自奖励中的系统性偏差：模型倾向于过度奖励高置信度的错误，形成自我确认循环。我们通过三个指标量化这种反馈回路偏差：奖励噪声幅度（rho_noise）、策略-奖励耦合（rho_selfbias）和过度/不足奖励偏斜（rho_symbias）。我们的分析显示了一种复合效应，其中强耦合放大了置信度条件误差，并导致向过度奖励的漂移，从而引发不稳定和较低的性能上限。为缓解这一问题，我们提出集成奖励强化学习（RLER），该方法通过自适应奖励插值和分歧感知的轨迹选择聚合多样化的模型，以减少耦合并抑制过度奖励漂移。大量实验表明，RLER相比最佳RLIR基线提升了6.2%，且与RLVR的差距在3.6%以内，同时在未标注样本上表现出稳定的扩展性。

英文摘要

Reinforcement learning with verifiable rewards (RLVR) efficiently scales the reasoning ability of large language models (LLMs) but is bottlenecked by scarce labeled data. Reinforcement learning with intrinsic rewards (RLIR) offers a scalable alternative via self-rewarding, yet often suffers from instability and inferior performance. We trace this gap to a systemic bias in confidence-coupled self-rewarding: the model tends to over-reward high-confidence mistakes, forming a self-confirming loop. We quantify this feedback-loop bias with three metrics: reward noise magnitude (rho_noise), policy-reward coupling (rho_selfbias), and over-/under-reward skew (rho_symbias). Our analyses show a compounding effect where strong coupling amplifies confidence-conditioned errors and drives a drift toward over-reward, leading to instability and a lower performance ceiling. To mitigate this, we propose reinforcement learning with ensembled rewards (RLER), which aggregates diverse models with adaptive reward interpolation and disagreement-aware rollout selection to reduce coupling and suppress over-reward drift. Extensive experiments show that RLER improves by 6.2% over the best RLIR baseline and is within 3.6% of RLVR, while exhibiting stable scaling on unlabeled samples.

URL PDF HTML ☆

赞 0 踩 0

2509.22854 2026-06-03 cs.CL 版本更新

Train Once, Reuse Everywhere: Generalizable Implicit In-Context Learning by Routing Attention

一次训练，随处重用：通过路由注意力实现可泛化的隐式上下文学习

Jiaqian Li, Yanshu Li, Ligong Han, Ruixiang Tang, Wenya Wang

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结提出In-Context Routing (ICR)方法，在注意力logits层面捕获可泛化的上下文学习模式，通过可学习的输入条件路由器调制注意力logits，实现高效的一次训练多次重用框架，在12个数据集上优于现有隐式ICL方法并展现强泛化能力。

Comments ICML 2026 Camera-ready

详情

AI中文摘要

隐式上下文学习（ICL）作为一种新兴的有前景范式，在大语言模型（LLMs）的表示空间中模拟ICL行为，旨在以零样本成本获得少样本性能。然而，现有方法主要依赖于将偏移向量注入残差流，这些向量通常从标注示例或任务特定对齐中构建。这种设计未能充分利用ICL背后的结构机制，且泛化能力有限。为了解决这个问题，我们提出了In-Context Routing (ICR)，一种新颖的隐式ICL方法，在注意力logits层面捕获和利用可泛化的ICL模式。它提取ICL过程中出现的可重用结构方向，并采用可学习的输入条件路由器相应地调制注意力logits，从而实现高效的一次训练多次重用框架。我们在涵盖不同领域和多个LLM的12个真实世界数据集上评估了ICR。结果表明，ICR一致优于需要任务特定检索或训练的现有隐式ICL方法，同时在它们难以处理的域外任务上展现出稳健的泛化能力。这些发现将ICR定位为推动ICL实际价值边界的方案。代码可在https://github.com/Lijiaqian1/In-Context-Routing.git获取。

英文摘要

Implicit in-context learning (ICL) has newly emerged as a promising paradigm that simulates ICL behaviors in the representation space of large language models (LLMs), aiming to attain few-shot performance at zero-shot cost. However, existing approaches largely rely on injecting shift vectors into residual flows, which are typically constructed from labeled demonstrations or task-specific alignment. Such designs fall short of utilizing the structural mechanisms underlying ICL and suffer from limited generalizability. To address this, we propose In-Context Routing (ICR), a novel implicit ICL method that captures and utilizes generalizable ICL patterns at the attention logits level. It extracts reusable structural directions that emerge during ICL and employs a learnable input-conditioned router to modulate attention logits accordingly, enabling an efficient train-once-and-reuse framework. We evaluate ICR on 12 real-world datasets spanning diverse domains and multiple LLMs. The results show that ICR consistently outperforms existing implicit ICL methods that require task-specific retrieval or training, while demonstrating robust generalization to out-of-domain tasks where they struggle. These findings position ICR to push the boundary of the practical value of ICL. The code is available at https://github.com/Lijiaqian1/In-Context-Routing.git.

URL PDF HTML ☆

赞 0 踩 0

2508.03098 2026-06-03 cs.CL 版本更新

Privacy-Aware Decoding: Mitigating Privacy Leakage of Large Language Models in Retrieval-Augmented Generation

隐私感知解码：缓解检索增强生成中大语言模型的隐私泄露

Haoran Wang, Xiongxiao Xu, Baixiang Huang, Kai Shu

发表机构 * Emory University（埃默里大学）； Illinois Institute of Technology（伊利诺伊理工学院）

AI总结提出一种轻量级推理时防御方法PAD，通过在解码过程中注入校准高斯噪声，结合置信度筛选、敏感度估计和上下文感知噪声校准，在RAG系统中平衡隐私保护与生成质量，并利用Rényi差分隐私跟踪累积隐私损失。

详情

AI中文摘要

检索增强生成（RAG）通过将输出条件化于外部知识源来增强大语言模型（LLM）的事实准确性。然而，当检索涉及私人或敏感数据时，RAG系统容易受到提取攻击，从而通过生成的响应泄露机密信息。我们提出隐私感知解码（PAD），一种轻量级的推理时防御方法，在生成过程中自适应地将校准的高斯噪声注入到token logits中。PAD集成了基于置信度的筛选以选择性保护高风险token、高效的敏感度估计以最小化不必要的噪声，以及上下文感知的噪声校准以平衡隐私与生成质量。Rényi差分隐私（RDP）会计机制严格跟踪累积隐私损失，从而为敏感输出提供明确的每响应$(\varepsilon, δ)$-DP保证。与需要重新训练或语料库级过滤的先前方法不同，PAD是模型无关的，并且完全在解码时运行，计算开销极小。在三个真实世界数据集上的实验表明，PAD在保持响应实用性的同时显著减少了私有信息泄露，优于现有的基于检索和后处理的防御方法。我们的工作通过解码策略在缓解RAG中的隐私风险方面迈出了重要一步，为敏感领域中的通用和可扩展隐私解决方案铺平了道路。我们的代码可在https://github.com/wang2226/PAD获取。

英文摘要

Retrieval-Augmented Generation (RAG) enhances the factual accuracy of large language models (LLMs) by conditioning outputs on external knowledge sources. However, when retrieval involves private or sensitive data, RAG systems are susceptible to extraction attacks that can leak confidential information through generated responses. We propose Privacy-Aware Decoding (PAD), a lightweight, inference-time defense that adaptively injects calibrated Gaussian noise into token logits during generation. PAD integrates confidence-based screening to selectively protect high-risk tokens, efficient sensitivity estimation to minimize unnecessary noise, and context-aware noise calibration to balance privacy with generation quality. A \renyi Differential Privacy (RDP) accountant rigorously tracks cumulative privacy loss, enabling explicit per-response $(\varepsilon, δ)$-DP guarantees for sensitive outputs. Unlike prior approaches requiring retraining or corpus-level filtering, PAD is model-agnostic and operates entirely at decoding time with minimal computational overhead. Experiments on three real-world datasets demonstrate that PAD substantially reduces private information leakage while preserving response utility, outperforming existing retrieval- and post-processing-based defenses. Our work takes an important step toward mitigating privacy risks in RAG via decoding strategies, paving the way for universal and scalable privacy solutions in sensitive domains. Our code is available: https://github.com/wang2226/PAD.

URL PDF HTML ☆

赞 0 踩 0

2507.19684 2026-06-03 cs.LG cs.AI cs.CL cs.CV 版本更新

CoMPAS3D: A Dataset and Benchmark for Interactive Motion

CoMPAS3D: 一个用于交互动作的数据集和基准

Bermet Burkanova, Yasaman Etesam, Payam Jome Yazdian, Trinity Evans, Chuxuan Zhang, Zoe Stanley, Paige Tuttösí, Angelica Lim

发表机构 * School of Computing Science Simon Fraser University（计算科学学院西蒙弗雷泽大学）

AI总结提出CoMPAS3D数据集和评估框架，通过动作可读性和熟练度适当性等客观指标，解决交互式动作生成中缺乏社交上下文评估的问题。

Comments https://rosielab.github.io/compas3d

详情

AI中文摘要

社交互动型人形机器人必须通过身体与人类互动，实时适应伙伴的动作、意图和能力。这需要模型不仅理解身体如何移动，还要理解在共享社交背景下动作的含义。然而，交互式动作生成的评估框架并未衡量生成的动作是否在共享动作词汇中可读，也不评估其是否适合伙伴的熟练水平。这一差距有两个原因：现有框架依赖运动学指标（如FID和节拍对齐），无法衡量上述特性；现有数据集缺乏动作标注和熟练度变化。萨尔萨舞作为评估领域很合适：即兴、双人、由动作词汇和评判标准（涵盖时机、音乐性、技巧、难度、配合和原创性）指导。我们提出CoMPAS3D，一个即兴双人萨尔萨舞的动作捕捉数据集，附带评估框架，涵盖运动学质量、两个客观指标（动作可读性和熟练度适当性）以及六个基于竞赛的主观维度。数据集包含18名舞者（涵盖初级、中级和高级水平）的3小时即兴表演，超过2800个专家标注片段，涵盖动作类型、错误和风格元素。我们定义了三个基准：动作分类（类似于转录）、熟练度估计（流利度评估）和跟随者生成（对话响应）。微调的视觉语言模型在应用于真实动作序列的客观指标上表现强劲。应用于Duolando和InterGen时，这些指标揭示了运动学指标遗漏的失败。人工评估确认了生成动作与真实动作之间的差距。CoMPAS3D、标注、基准代码和基线结果公开可用。

英文摘要

Socially interactive humanoid robots must engage with humans through their bodies, adapting in real time to a partner's movement, intent, and abilities. This requires models that understand not just how bodies move, but what movement means in a shared social context. Yet evaluation frameworks for interactive motion generation do not measure whether generated follower motion is legible within a shared movement vocabulary, nor whether it is appropriate to the partner's proficiency level. This gap has two causes: existing frameworks rely on kinematic metrics such as FID and beat alignment that cannot measure either property, and existing datasets lack the move annotations and proficiency variation needed. Salsa is well-suited as an evaluation domain: improvised, dyadic, and governed by a move vocabulary and judging criteria covering timing, musicality, technique, difficulty, partnering, and originality. We present CoMPAS3D, a motion capture dataset of improvised partner salsa paired with an evaluation framework covering kinematic quality, two objective metrics (move legibility and proficiency appropriateness), and six competition-based subjective dimensions. The dataset includes 3 hours of improvisation by 18 dancers spanning beginner, intermediate, and professional levels, with over 2,800 expert-annotated segments covering move types, errors, and stylistic elements. We define three benchmarks: move classification (analogous to transcription), proficiency estimation (fluency assessment), and follower generation (dialogue response). Fine-tuned vision-language models perform strongly on objective metrics applied to ground-truth motion sequences. Applied to Duolando and InterGen, the metrics reveal failures that kinematic metrics miss. Human evaluations confirm the gap between generated and ground-truth motion. CoMPAS3D, annotations, benchmark code, and baseline results are publicly available.

URL PDF HTML ☆

赞 0 踩 0

2506.02018 2026-06-03 cs.CL 版本更新

Enhancing Paraphrase Type Generation: The Impact of DPO and RLHF Evaluated with Human-Ranked Data

增强释义类型生成：基于人工排序数据的DPO和RLHF评估影响

Christopher Lee Lübbers

AI总结本研究利用人工排序的释义类型数据集，结合直接偏好优化（DPO）使模型输出与人类判断对齐，将释义类型生成准确率提升3个百分点，人类偏好评分提升7个百分点，并创建了新的标注数据集以支持更严格的评估。

Comments 21 pages, 11 figures. Master's thesis, University of Goettingen, December 2024. Code: https://github.com/cluebbers/dpo-rlhf-paraphrase-types. Models: https://huggingface.co/collections/cluebbers/enhancing-paraphrase-type-generation-673ca8d75dfe2ce962a48ac0

详情

AI中文摘要

释义通过重新表达含义来增强文本简化、机器翻译和问答等应用。特定的释义类型有助于精确的语义分析和鲁棒的语言模型。然而，现有的释义类型生成方法由于依赖自动评估指标和有限的人工标注训练数据，常常与人类偏好不一致，掩盖了语义保真度和语言转换的关键方面。本研究通过利用人工排序的释义类型数据集，并整合直接偏好优化（DPO）使模型输出直接与人类判断对齐，填补了这一空白。基于DPO的训练将释义类型生成准确率比监督基线提高了3个百分点，并将人类偏好评分提高了7个百分点。新创建的人工标注数据集支持更严格的未来评估。此外，一个释义类型检测模型在增删、同极性替换和标点变化上的F1分数分别达到0.91、0.78和0.70。这些发现表明，偏好数据和DPO训练能产生更可靠、语义更准确的释义，从而改进摘要生成和更鲁棒的问答等下游应用。PTD模型超越了自动评估指标，为评估释义质量提供了更可靠的框架，推动释义类型研究向更丰富、与用户对齐的语言生成发展，并为基于人类中心标准的未来评估奠定了更坚实的基础。

英文摘要

Paraphrasing re-expresses meaning to enhance applications like text simplification, machine translation, and question-answering. Specific paraphrase types facilitate accurate semantic analysis and robust language models. However, existing paraphrase-type generation methods often misalign with human preferences due to reliance on automated metrics and limited human-annotated training data, obscuring crucial aspects of semantic fidelity and linguistic transformations. This study addresses this gap by leveraging a human-ranked paraphrase-type dataset and integrating Direct Preference Optimization (DPO) to align model outputs directly with human judgments. DPO-based training increases paraphrase-type generation accuracy by 3 percentage points over a supervised baseline and raises human preference ratings by 7 percentage points. A newly created human-annotated dataset supports more rigorous future evaluations. Additionally, a paraphrase-type detection model achieves F1 scores of 0.91 for addition/deletion, 0.78 for same polarity substitution, and 0.70 for punctuation changes. These findings demonstrate that preference data and DPO training produce more reliable, semantically accurate paraphrases, enabling downstream applications such as improved summarization and more robust question-answering. The PTD model surpasses automated metrics and provides a more reliable framework for evaluating paraphrase quality, advancing paraphrase-type research toward richer, user-aligned language generation and establishing a stronger foundation for future evaluations grounded in human-centric criteria.

URL PDF HTML ☆

赞 0 踩 0

2310.10322 2026-06-03 cs.CL 版本更新

Evaluating the Reversal Curse in Model Editing

评估模型编辑中的逆转诅咒

Hao-Xiang Xu, Jun-Yu Ma, Zhen-Hua Ling, Quan Liu, Cong Liu, Jia-Chen Gu

发表机构 * National Engineering Research Center of Speech and Language Information Processing（语音与语言信息处理国家级工程研究中心）； University of Science and Technology of China（中国科学技术大学）； iFLYTEK Research（iFLYTEK研究院）； University of California, Los Angeles（美国加州大学洛杉矶分校）

AI总结本文研究双向语言模型编辑，提出反向泛化指标并构建BAKE基准，发现多数编辑方法在反向评估中存在系统性缺陷，并分析逆转诅咒的成因及缓解策略。

Comments Accepted by TMLR

详情

AI中文摘要

大型语言模型（LLMs）由于错误或过时的知识容易产生不期望的文本幻觉。由于重新训练LLMs资源密集，模型编辑日益受到关注。尽管出现了基准和方法，现有的单向编辑和评估范式未能探索逆转诅咒。在本文中，我们研究双向语言模型编辑，旨在提供严格的评估，以判断编辑后的LLMs能否双向回忆编辑知识。引入了反向泛化指标，并构建了名为BAKE（双向知识编辑评估）的基准，用于评估编辑后的模型能否在编辑的反向方向上回忆编辑知识。我们使用多种编辑方法和LLMs进行了大量实验。结果表明，尽管大多数编辑方法能够沿着修改方向准确回忆编辑事实，但在反向方向评估时，它们表现出显著的系统性缺陷。为了进一步研究逆转诅咒的根本原因并探索潜在的缓解策略，我们从三个角度进行了详细分析。我们的发现表明，尽管上下文学习（ICL）可以在一定程度上缓解逆转诅咒，但它缺乏连续性，受输入长度限制，并可能引入幻觉。因此，结合ICL和其他编辑方法的优势是开发新编辑范式的一个有前景的方向。

英文摘要

Large language models (LLMs) are prone to hallucinate unintended text due to false or outdated knowledge. Since retraining LLMs is resource intensive, there has been a growing interest in model editing. Despite the emergence of benchmarks and approaches, existing unidirectional editing and evaluation paradigms have failed to explore the reversal curse. In this paper, we study bidirectional language model editing, aiming to provide a rigorous evaluation to assess if edited LLMs can recall the editing knowledge bidirectionally. A metric of reverse generalization is introduced and a benchmark dubbed Bidirectional Assessment for Knowledge Editing (BAKE) is constructed to evaluate if post-edited models can recall the edited knowledge in the reverse direction of editing. We conduct extensive experiments using a variety of editing methods and LLMs. The results show that while most editing methods are able to accurately recall editing facts along the modification direction, they exhibit substantial systematic deficiencies when evaluating in the reverse direction. To further investigate the underlying causes of reversal curse and to explore potential strategies for mitigation, a detailed analysis is conducted from three perspectives. Our findings reveal that although In-Context Learning (ICL) can mitigate the reversal curse to a certain extent, it lacks continuity, is limited by the input length, and may introduce hallucinations. Therefore, combining the advantages of ICL and other editing methods is a promising direction for developing new editing paradigms.

URL PDF HTML ☆

赞 0 踩 0

2303.15619 2026-06-03 cs.CL cs.AI 版本更新

Typhoon: Towards an Effective Task-Specific Masking Strategy for Pre-trained Language Models

Typhoon: 面向预训练语言模型的有效任务特定掩码策略

Muhammed Shahir Abdurrahman, Hashem Elezabi, Bruce Changlong Xu

发表机构 * Department of Computer Science, Stanford University（斯坦福大学计算机科学系）

AI总结本文提出Typhoon，一种基于任务损失梯度的自适应掩码策略，在GLUE任务上对比随机掩码和整词掩码，经严格评估发现无显著优势。

详情

AI中文摘要

在掩码语言建模（MLM）中，选择哪些token进行掩码是一个核心但未被充分研究的设计决策。标准预训练随机均匀掩码token，但多项研究表明，更具信息性的掩码目标可以提升下游性能。我们将掩码视为微调流程中任务自适应的组件，并引入Typhoon，一种掩码策略，它利用任务损失相对于one-hot token输入的梯度来在线估计每种token类型对目标的贡献程度。Typhoon维护每个token类型显著性的指数移动平均，并将这些分数校准为掩码分布，在token独立性近似下，其期望掩码率与目标预算匹配。我们形式化了该方法，并在两个GLUE任务（MRPC和CoLA）上，针对三个BERT系列骨干网络（TinyBERT、DistilBERT和BERT-base）以及每个配置五个随机种子（总共90次训练运行），将其与随机掩码和整词掩码进行了评估。我们的主要发现是，一旦考虑了种子方差，没有哪种掩码策略在这些任务上可靠地优于其他策略：在MRPC上，Typhoon与最佳基线之间的差距保持在0.004 F1以内，所有十二次Typhoon比较中无配对检验达到显著性，且每个95%置信区间包含零。Typhoon在单次运行实验中的明显优势并未经受住这种更仔细的评估。我们将此视为一个警示性的、以可重复性为重点的结果——基于梯度的任务自适应掩码具有竞争力，但在此规模上并不明显优于无资源的随机掩码——我们描述了一个干净的现代重实现以支持后续工作。

英文摘要

The choice of \emph{which} tokens to mask is a central, under-examined design decision in masked language modeling (MLM). Standard pretraining masks tokens uniformly at random, but several studies show that more informative masking targets can improve downstream performance. We study masking as a \emph{task-adaptive} component of the fine-tuning pipeline and introduce \textbf{Typhoon}, a masking strategy that uses the gradient of the task loss with respect to one-hot token inputs to estimate, online, how much each token type contributes to the objective. Typhoon maintains an exponential moving average of per-token-type saliency and calibrates these scores into a masking distribution whose expected masking rate matches a target budget, under a token-independence approximation. We formalize the method and evaluate it against random masking and whole-word masking on two GLUE tasks, MRPC and CoLA, across three BERT-family backbones (TinyBERT, DistilBERT, and BERT-base) and five random seeds per configuration ($90$ training runs in total). Our main finding is that, once seed variance is accounted for, no masking strategy is reliably better than the others on these tasks: on MRPC the gap between Typhoon and the best baseline stays within $0.004$ $F_1$, across all twelve Typhoon comparisons no paired test reaches significance, and every $95\%$ confidence interval contains zero. Typhoon's apparent advantage in single-run experiments does not survive this more careful evaluation. We read this as a cautionary, reproducibility-focused result -- gradient-based task-adaptive masking is competitive but not clearly better than resource-free random masking at this scale -- and we describe a clean modern reimplementation to support follow-up work.

URL PDF HTML ☆

赞 0 踩 0