2606.02556 2026-06-02 cs.CL 版本更新

HERO'S JOURNEY: Testing Complex Rule Induction with Text Games

英雄之旅：用文本游戏测试复杂规则归纳

Anshun Asher Zheng, Kanishka Misra, David I. Beaver, Junyi Jessy Li

发表机构 * Department of Linguistics（语言学系）； The University of Texas at Austin（德克萨斯大学奥斯汀分校）

AI总结本文提出HERO'S JOURNEY基准，通过目标导向的文本游戏评估大型语言模型在属性与程序归纳任务中的规则推理能力，发现模型虽能进行规则归纳但能力有限且不均衡，程序执行成为瓶颈，而表面语义影响较小。

Comments 24 pages

2606.02548 2026-06-02 cs.CL 版本更新

SN-WER: Script-Normalized WER for Multi-Script Indic ASR Evaluation

SN-WER：用于多脚本印度语ASR评估的脚本归一化词错误率

Priyaranjan Pattnayak

发表机构 * Oracle America Inc.（Oracle美国公司）

AI总结提出SN-WER指标，通过将参考和假设文本音译为规范脚本后计算WER，解决多脚本场景下WER高估错误的问题，在印度语上评估显示可减少高达12%的模型差距。

Comments Accepted to ACL 2026 MeLLM

详情

AI中文摘要

词错误率（WER）是自动语音识别（ASR）的主要指标，但当参考文本和假设文本以不同脚本编码相同单词时，WER可能高估错误。在多语言设置中，ASR模型可能输出罗马化文本，这一问题很常见。我们提出脚本归一化WER（SN-WER），一种无需训练、仅用于评估的评分方法，在计算WER之前将参考文本和假设文本音译为特定语言的规范脚本。我们在5种印度语言、2个数据集和3个ASR模型上评估了SN-WER。在精心整理的FLEURS数据上，SN-WER将膨胀的模型差距减少了高达12%，而在噪声较大的Common Voice数据上，减少幅度较小或不一致，表明存在真正的识别弱点而不仅仅是脚本不匹配。受控压力测试显示，人为罗马化引起的WER膨胀衰减了67%，而词汇替换控制显示对语义错误的敏感性几乎相同，Delta SN-WER / Delta WER约为1.09。SN-WER对音译器选择、归一化变化具有鲁棒性，并且在评估的印度语设置中，令牌碰撞率低于0.1%。我们认为，SN-WER应作为WER和CER的伴随指标报告，用于脚本不敏感的ASR评估，特别是当转录文本用于下游搜索、索引或多语言LLM流水线时。

英文摘要

Word Error Rate (WER) is the dominant metric for automatic speech recognition (ASR), but it can overestimate errors when references and hypotheses encode the same words in different scripts. This issue is common in multilingual settings where ASR models may emit romanized text. We propose Script-Normalized WER (SN-WER), a training-free, evaluation-only scoring method that transliterates both reference and hypothesis text into a language-specific canonical script before computing WER. We evaluate SN-WER on 5 Indic languages, 2 datasets, and 3 ASR models. On curated FLEURS data, SN-WER reduces inflated model gaps by up to 12%, while on noisier Common Voice data the reductions are smaller or inconsistent, indicating genuine recognition weaknesses rather than only script mismatch. Controlled stress tests show a 67% attenuation of artificial romanization-induced WER inflation, while lexical-substitution controls show near-identical sensitivity to semantic errors, with Delta SN-WER / Delta WER approximately 1.09. SN-WER is robust to transliterator choice, normalization changes, and shows low token-collision rates below 0.1% in the evaluated Indic setting. We argue that SN-WER should be reported alongside WER and CER as a companion metric for script-insensitive ASR evaluation, especially when transcripts feed downstream search, indexing, or multilingual LLM pipelines.

URL PDF HTML ☆

赞 0 踩 0

2606.02545 2026-06-02 cs.CL 版本更新

Transferable Self-Harm Surveillance from Emergency Department Triage Notes Using an Evidence-Augmented Machine Learning Approach

基于证据增强机器学习方法的急诊科分诊笔记可迁移自伤监测

Liuliu Chen, Gowri Rajaram, Eleanor Bailey, Katrina Witt, Michelle Lamblin, Jo Robinson, Mike Conway, Vlada Rozova

发表机构 * School of Computing and Information Systems, University of Melbourne（墨尔本大学计算与信息系）； Orygen ； Centre for Youth Mental Health, University of Melbourne（墨尔本大学青年心理健康中心）； Centre for Digital Transformation of Health, University of Melbourne（墨尔本大学医疗数字化转型中心）

AI总结本研究提出一种结合大语言模型筛选与证据提取的三阶段机器学习方法，从急诊科分诊笔记中检测自伤行为，并在三家澳大利亚医院验证了其高可迁移性和细粒度监测能力。

详情

AI中文摘要

自伤是一个主要的公共卫生问题，但当前依赖于医院就诊的监测因诊断代码敏感性低而不足。急诊科（ED）分诊笔记在初次接触时记录，提供了就诊的简洁摘要，并提供了识别自伤的机会。我们开发了一种三阶段方法，将传统机器学习与大语言模型筛选和证据提取相结合，以检测ED分诊笔记中的自伤行为。我们评估了模型在三个澳大利亚医院的可迁移性。我们的方法在内部和外部验证中分别显示了0.887 ± 0.016和0.884 ± 0.012的AUPRC。在前瞻性评估中，它在开发站点达到了0.881 ± 0.008的AUPRC，在两个外部站点无需特定站点重新训练的情况下分别达到了0.879 ± 0.012和0.816 ± 0.015。该方法的一个关键优势是能够以95%的准确率识别主要的自伤方式，支持超越二元分类的更细粒度监测。

英文摘要

Self-harm is a major public health concern, but current surveillance relying on hospital presentations is inadequate due to the low sensitivity of diagnostic codes. Emergency Department (ED) triage notes, recorded at the initial point of contact, provide a succinct summary of presentations and an opportunity to identify self-harm. We developed a three-stage approach, augmenting traditional machine learning with large language model-based screening and evidence extraction to detect self-harm in ED triage notes. We assessed model transferability across three Australian hospitals. Our approach showed AUPRCs of 0.887 +/- 0.016 and 0.884 +/- 0.012 during internal and external validation. Prospectively, it achieved AUPRC of 0.881 +/- 0.008 at the development site, and 0.879 +/- 0.012 and 0.816 +/- 0.015 at two external sites without site-specific retraining. A key advantage of the approach is that it enables identification of the primary self-harm method with an accuracy of 95%, supporting more granular surveillance beyond binary classification.

URL PDF HTML ☆

赞 0 踩 0

2606.02544 2026-06-02 cs.CL cs.AI 版本更新

SimSD: Simple Speculative Decoding in Diffusion Language Models

SimSD：扩散语言模型中的简单推测解码

Junxia Cui, Haotian Ye, Runchu Tian, Hongcan Guo, Jinya Jiang, Haoru Li, Chaojie Ren, Yiming Huang, Kaijie Zhu, Zhongkai Yu, Kun Zhou, Jingbo Shang

发表机构 * University of California San Diego（加州大学圣地亚哥分校）； University of Illinois Urbana-Champaign（伊利诺伊大学厄巴纳-香槟分校）； Google（谷歌）； University of California Santa Barbara（加州大学圣芭芭拉分校）

AI总结针对扩散语言模型无法直接使用标准推测解码的问题，提出SimSD算法，通过即插即用的掩码策略引入参考令牌并设计注意力掩码，实现单次前向传播验证多个草稿令牌，在保持并行解码优势的同时提升解码吞吐量。

Comments 13 pages, 4 figures, code available at https://github.com/airevo2/SimSD-release

详情

AI中文摘要

扩散大语言模型（dLLMs）最近作为自回归（AR）LLMs的有前景替代方案出现，通过并行或块状解码实现更快的推理。然而，它们的掩码语言建模公式仍然与标准令牌级推测解码不兼容，而后者是AR模型最有效的加速技术之一。在AR解码中，因果掩码保留了时间上有效的令牌级上下文，使目标模型能够在单次前向传播中验证多个草稿令牌。相比之下，dLLMs依赖于掩码令牌和双向注意力，导致有效上下文在去噪步骤中发生变化，从而阻止了直接的令牌级推测验证。为了弥合这一差距，我们提出了一种简单但有效的扩散语言模型推测解码算法，名为SimSD，它主要采用即插即用的掩码策略，为dLLMs配备时间上有效的令牌级上下文以进行推测解码。我们的方法明确地从草稿模型预测中引入参考令牌，并设计了一种注意力掩码来调节它们与当前步骤令牌的交互，使dLLMs能够在单次前向传播中计算草稿令牌的有效logits。这恢复了AR模型中因果掩码提供的关键验证能力，同时保留了dLLMs的并行解码优势。所提出的方法无需训练，并且可以灵活地与其他加速技术（如KV缓存和块状解码）集成。在四个基准测试上的SDAR系列dLLMs实验表明，我们的方法实现了高达7.46倍的解码吞吐量提升，同时保持甚至提高了平均生成质量。

英文摘要

Diffusion large language models (dLLMs) have recently emerged as a promising alternative to autoregressive (AR) LLMs, offering faster inference through parallel or blockwise decoding. However, their masked language modeling formulation remains incompatible with standard token-level speculative decoding, one of the most effective acceleration techniques for AR models. In AR decoding, the causal mask preserves temporally valid token-level contexts, enabling a target model to verify multiple drafted tokens in a single forward pass. In contrast, dLLMs rely on mask tokens and bidirectional attention, causing the effective context to change across denoising steps and preventing direct token-level speculative verification. To bridge this gap, we propose a simple but effective speculative decoding algorithm for diffusion language models, named SimSD, which mainly adopts a plug-and-play masking strategy that equips dLLMs with temporally valid token-level contexts for speculative decoding. Our method explicitly introduces reference tokens from draft-model predictions and designs an attention mask that regulates their interaction with current-step tokens, allowing dLLMs to compute valid logits for drafted tokens in a single forward pass. This restores the key verification ability provided by causal masking in AR models while preserving the parallel decoding advantages of dLLMs. The proposed method is training-free and can be flexibly integrated with other acceleration techniques such as KV cache and blockwise decoding. Experiments on SDAR-family dLLMs across four benchmarks show that our method achieves up to 7.46x higher decoding throughput while maintaining and even improving average generation quality.

URL PDF HTML ☆

赞 0 踩 0

2606.02540 2026-06-02 cs.CL 版本更新

CRAM：面向多模态持续指令调优的质心路由与自适应MoE

Jun-Tao Tang, Zhen-Hao Xie, Yu-Cheng Shi, Da-Wei Zhou

发表机构 * School of Artificial Intelligence, Nanjing University, China（南京大学人工智能学院）； State Key Laboratory of Novel Software Technology, Nanjing University, China（南京大学新型软件技术国家重点实验室）

AI总结提出CRAM方法，通过将任务特定模式隔离到独立模块、自适应秩实例化动态分配参数、质心路由激活现有专家以及正交惩罚约束更新方向，解决了多模态持续指令调优中任务竞争导致遗忘和参数效率低下的问题。

详情

AI中文摘要

多模态大语言模型（MLLMs）通过指令调优在共享生成框架下统一异构视觉-语言任务，但实际部署需要持续能力扩展，这使得多模态持续指令调优（MCIT）至关重要。现有方法要么使用共享参数集更新所有任务，要么为每个新任务分配专用模块。共享更新迫使异构任务竞争，导致已学能力遗忘。相反，隔离扩展防止了干扰，但在长任务流中严重限制了参数效率。为解决这一困境，我们提出了CRAM。具体来说，通过将任务特定模式隔离到独立模块，CRAM减轻了跨任务的灾难性遗忘。为了进一步提高参数效率，我们利用自适应秩实例化来识别现有专家能力与新任务需求之间的能力差距，并仅动态分配必要的参数。为了确保任务间的稳定复用，质心引导路由识别并激活现有专家的能力，而正交惩罚将新更新限制在任务特定方向，防止重新学习通用能力。跨多个基准的大量实验一致证明了其相对于现有方法的优越性。

英文摘要

Multimodal Large Language Models (MLLMs) unify heterogeneous vision-language tasks under a shared generative framework via instruction tuning, yet real-world deployment demands continuous capability expansion, making Multimodal Continual Instruction Tuning (MCIT) essential. Existing methods either update all tasks with a shared parameter set or allocate dedicated modules for each new task. Shared updates force heterogeneous tasks to compete, causing forgetting of learned capabilities. Conversely, isolated expansion prevents interference but severely limits parameter efficiency over long task streams. To address this dilemma, we propose CRAM. Specifically, by isolating task-specific patterns into independent modules, CRAM mitigates catastrophic forgetting across tasks. To further boost parameter efficiency, we utilize adaptive-rank instantiation to identify the capability gap between existing expert capability and new task demands, and dynamically allocate only the necessary parameters. To ensure stable reuse among tasks, centroid-guided routing recognizes and activates existing experts' capabilities, while an orthogonality penalty confines new updates to task-specific directions, preventing re-learning general capability. Extensive experiments across diverse benchmarks consistently demonstrate its superiority over existing methods.

URL PDF HTML ☆

赞 0 踩 0

2606.02487 2026-06-02 cs.CL 版本更新

Towards Multidisciplinary Summarization of Hospital Stays: Efficient Sentence-Level Clinical Provenance Categorization

迈向多学科住院总结：高效的句子级临床来源分类

Baris Karacan, Vaibhav Bhargava, Barbara Di Eugenio, Natalie Parde, Mary Khetani, Yu-Shan Tseng, Vanessa Barbosa, Julie Vignato, Lindsey Knake, Rajashree Dahal, Emily Spellman, Danielle Hitzel, Janine Petitgout, Kristi Haughey, Amanda Karstens, Brianna Clarahan, Rachel Dawson, Lauren Boyd, Mackenzie Weis, Angie Tipton, Jaewon Bae, Catherine K. Craven, Karen Dunn Lopez, Andrew D. Boyd

发表机构 * University of Illinois Chicago（伊利诺伊大学芝加哥分校）； University of Iowa（爱荷华大学）； University of Missouri-Columbia（密苏里大学哥伦比亚分校）

AI总结本研究提出一个基于大语言模型监督微调的临床来源分类流水线，用于多学科住院总结，通过量化70B模型在跨领域迁移中提升F1分数7%，并证明模型容量对语义灵活性的关键作用。

Comments 5 pages. Submitted preprint version of a paper accepted to AIME 2026. This version may differ from the camera-ready manuscript and the final Version of Record. The Version of Record will be available from Springer Nature once published

详情

AI中文摘要

在新生儿重症监护室（NICU）等高复杂性环境中，有效的“全团队”总结需要聚合来自不同学科（医生、护士、治疗师）的见解，这些见解分散在数百份临床自由文本记录中。简单地将异质文本汇集在一起往往会导致输出不连贯。因此，结构化总结首先需要对跨多源记录的句子级来源进行准确分类。本初步研究引入了一个临床来源分类流水线，使用大语言模型（LLM）的监督微调（SFT）。我们将两个Llama-3模型（8B和70B）适配到MedSecId，这是一个包含2,002份MIMIC-III（成人ICU）记录并带有临床来源标题注释的语料库，两个模型在域内均实现了超过92%的宏F1分数。为了评估跨域泛化能力，我们评估了模型容量（8B vs. 70B）和量化在一个由来自三个多学科NICU总结的227个句子级跨度组成的金标准数据集上的表现。实验结果表明存在规模依赖的迁移效应：虽然SFT对8B模型仅产生边际变化，但显著改进了70B模型，使宏F1提高了7%。值得注意的是，量化微调的70B模型在显著降低计算需求的同时，超越了其全精度基线。这些发现表明，足够的模型容量对于在跨域临床迁移中保持语义灵活性至关重要，并且高效的量化适配可以为下游总结实现结构化来源建模。

英文摘要

Effective "all-team" summarization in high-complexity settings like the Neonatal Intensive Care Unit (NICU) requires aggregating insights from diverse disciplines (physicians, nurses, therapists) spread across hundreds of clinical free-text notes. Simply pooling heterogeneous text often leads to incoherent outputs. Structured summarization therefore first requires accurate categorization of sentence-level provenance across multi-source notes. This pilot study introduces a clinical provenance categorization pipeline using supervised fine-tuning (SFT) of large language models (LLMs). We adapted two Llama-3 models (8B and 70B) to MedSecId, a corpus of 2,002 MIMIC-III (Adult ICU) notes annotated with clinical provenance headers, achieving in-domain Macro F1 scores above 92% for both models. To evaluate cross-domain generalization, we assessed model capacity (8B vs. 70B) and quantization on a gold-standard dataset of 227 sentence-level spans derived from three multi-disciplinary NICU summaries. Experimental results demonstrate a scale-dependent transfer effect: while SFT produced only marginal changes for the 8B model, it substantially improved the 70B model, increasing Macro F1 by 7%. Notably, the quantized fine-tuned 70B model outperformed its full-precision baseline while substantially reducing computational requirements. These findings suggest that sufficient model capacity is critical for preserving semantic flexibility during cross-domain clinical transfer and that efficient quantized adaptation can enable structured provenance modeling for downstream summarization.

URL PDF HTML ☆

赞 0 踩 0

2606.02483 2026-06-02 cs.CR cs.AI cs.CL 版本更新

食物噪音与虚假安全：系统评估LLMs如何在临床医生反馈下未能适应饮食障碍查询

Giulia Pucci, Emily Hemendinger, Ruizhe Li, Gavin Abercrombie, Tanvi Dinkar, Arabella Sinclair

发表机构 * University of Aberdeen（阿伯丁大学）； University of Colorado Anschutz（科罗拉多大学安舒茨分校）； Heriot-Watt University（赫里奥特-瓦特大学）； University College London（伦敦大学学院）

AI总结本研究通过与临床饮食障碍专家合作，系统评估了大型语言模型在处理饮食障碍用户查询时，因不加批判地适应不安全或自伤请求而可能产生的危害。

2606.02443 2026-06-02 cs.CL cs.AI cs.CV 版本更新

PaSBench-Video: A Streaming Video Benchmark for Proactive Safety Warning

PaSBench-Video: 面向主动安全预警的流式视频基准

Yusong Zhao, Yuejin Xie, Youliang Yuan, Junjie Hu, Jitian Guo, Yujiu Yang, Pinjia He

发表机构 * The Chinese University of Hong Kong, Shenzhen（香港中文大学（深圳））； Tsinghua University（清华大学）

AI总结提出PaSBench-Video基准，包含740个视频，评估多模态大模型在危险发生前及时发出预警的能力，发现现有模型在时序精度和低误报率上表现不佳。

详情

AI中文摘要

从危险的第一个可见迹象到事故发生之间，通常存在一个仍可干预的时间窗口。具备视频能力的多模态大语言模型（MLLM）可以作为始终在线的安全监控器，在此窗口内发出警告。然而，当前的基准测试并未检验这一能力：它们依赖静态输入，忽略时间精度，并且省略了对安全场景的误报测量。我们提出了PaSBench-Video，一个包含740个视频的基准测试，涵盖驾驶、医疗、日常生活和工业生产四个领域，其中包含481个风险视频和259个无风险视频。风险视频标注了帧级别的风险起始点和事故边界。模型必须以因果方式观察视频，并发出在时间上校准且内容正确的警告。测试了13个MLLM后，我们发现没有模型在我们的最严格指标上超过20.0%，并且召回率与误报率紧密相关，皮尔逊相关系数为0.64：更高的检测率只能以在大多数安全片段上触发警告为代价。性能按领域显著分化：在日常生活领域，模型在低误报率下实现了中等召回率，因为该领域的风险本质上是异常的；而在驾驶领域，模型不加区分地触发警告，因为常规场景和危险场景看起来相似。这些结果表明，当前模型依赖于场景级别的活动线索，而不是推理正在出现的危害。

英文摘要

Between the first visible sign of danger and the moment an accident occurs, there is often a window where intervention remains possible. Video-capable multimodal large language models (MLLMs) could serve as always-on safety monitors that issue warnings during this window. Yet current benchmarks do not test this ability: they rely on static inputs, ignore timing precision, and omit false-positive measurement on safe scenes. We present PaSBench-Video, a 740-video benchmark with 481 risk and 259 no-risk videos across four domains: driving, healthcare, daily life, and industrial production. Risk videos are annotated with frame-level risk onset and accident boundaries. A model must observe the video causally and produce a warning that is both temporally calibrated and content-correct. Testing 13 MLLMs, we find that no model exceeds 20.0% on our strictest metric, and recall is tightly coupled with false-positive rate, with Pearson correlation 0.64: higher detection comes only at the cost of triggering warnings on the majority of safe clips. Performance splits sharply by domain: models achieve moderate recall at low false-positive rates in daily life, where risks are inherently anomalous, yet fire indiscriminately in driving, where routine and hazardous scenes look alike. These results indicate that current models rely on scene-level activity cues rather than reasoning about emerging harm.

URL PDF HTML ☆

赞 0 踩 0

2606.02433 2026-06-02 cs.IR cs.AI cs.CL cs.LG cs.MA 版本更新

ODTQA-FoRe: An Open-Domain Tabular Question Answering Dataset for Future Data Forecasting and Reasoning

ODTQA-FoRe：面向未来数据预测与推理的开放域表格问答数据集

Zhensheng Wang, Xiaole Liu, Wenmian Yang, Kun Zhou, Yiquan Zhang, Weijia Jia

发表机构 * School of Artificial Intelligence, Beijing Normal University（北京师范大学人工智能学院）； Institute of Artificial Intelligence and Future Networks, Beijing Normal University（北京师范大学人工智能与未来网络研究院）； Faculty of Arts and Sciences, Beijing Normal University（北京师范大学文理学院）； Beijing Normal-Hong Kong Baptist University（北京师范大学-香港 Baptist大学）

AI总结提出开放域表格问答的未来预测与推理任务，并构建首个覆盖时间序列预测和基于预测推理的数据集，通过基于LLM代理的TimeFore框架（检索器、预测器、分析器）解决历史数据检索、预测限制和响应标准化挑战。

Comments This paper has been accepted by Findings of ACL 2026

详情

AI中文摘要

大语言模型的快速发展显著推进了表格问答，但大多数系统无法进行面向未来的数值预测。为弥补这一空白，我们引入了一个新任务——面向未来数据预测与推理的开放域表格问答，并提出了首个覆盖时间序列预测和基于预测推理场景的数据集，使用房地产数据。该任务在检索精确历史数据、克服LLM的预测限制以及标准化多样化查询的响应方面提出了挑战。为解决上述挑战，我们提出了TimeFore，一个基于LLM代理的框架，将问题分解为三个协作角色：检索器自主生成SQL以获取数据，预测器调用外部时间序列模型以获得更高精度，分析器综合结果以构建精确且一致的最终答案。大量实验证明了我们TimeFore的有效性。

英文摘要

The rapid development of LLMs has significantly advanced tabular question answering, but most systems cannot perform future-oriented numerical prediction. To address this gap, we introduce a novel task, Open-Domain Tabular Question Answering for Future Data Forecasting and Reasoning, and propose the first dataset to cover time-series forecasting and forecast-based reasoning scenarios using real estate data. This task poses challenges in retrieving precise historical data, overcoming the forecasting limitations of LLMs, and standardizing responses for diverse queries. To solve the above challenges, we propose TimeFore, an LLM agent-based framework that decomposes the problem into three collaborative roles: a Retriever autonomously generates SQL to fetch data, a Forecaster invokes external time-series models for higher accuracy, and an Analyzer synthesizes the results to construct a precise and consistent final answer. Extensive experiments demonstrate the effectiveness of our TimeFore.

URL PDF HTML ☆

赞 0 踩 0

2606.02423 2026-06-02 cs.CL cs.LG 版本更新

Investigating and Alleviating Harm Amplification in LLM Interactions

调查和缓解大语言模型交互中的危害放大

Ruohao Guo, Wei Xu, Alan Ritter

发表机构 * Georgia Institute of Technology（佐治亚理工学院）

AI总结提出HarmAmp基准和TrajSafe监控器，用于评估和缓解多轮对话中大语言模型对危害的放大效应。

详情

AI中文摘要

大语言模型（LLM）可以作为有用的助手，但它们同样可以作为危害放大器，使恶意用户通过扩展交互实现超出其能力的危害结果。这种风险沿着两个轴显现，即民主化领域专业知识，使新手能够产生专门的有害内容，以及以手动努力无法匹敌的规模扩大有害操作。然而，现有工作往往忽略了LLM在多轮对话中如何加剧危害。我们引入了HarmAmp，这是一个新的基准，用于涵盖十二个风险类别的多轮危害放大场景。每个场景都基于现实世界的威胁，并满足严格的标准，即实质性放大、操作特异性和多轮必要性。我们进一步提出了TrajSafe，一种主动监控器，可以预测有害轨迹并通过诸如探测用户真实意图和引导模型更安全地完成等行动进行干预。我们的广泛实验表明，TrajSafe显著降低了多轮交互中产生的危害性，同时保持了低过度拒绝率和目标模型的一般能力。我们的工作为缓解LLM交互中微妙的安全风险提供了一个有前景的范式。

英文摘要

Large language models (LLMs) can serve as helpful assistants, yet they can equally function as harm amplifiers that enable malicious users to achieve harmful outcomes beyond their capabilities through extended interactions. This risk manifests along two axes, i.e., democratizing domain expertise that allows novices to produce specialized harmful content, and scaling harmful operations at volumes that manual effort cannot match. Existing works, however, often overlook how LLMs compound harm across multi-turn conversations. We introduce HarmAmp, a new benchmark for multi-turn harm amplification scenarios spanning twelve risk categories. Each scenario is grounded in real-world threats and satisfies rigorous criteria, i.e., substantive amplification, operational specificity, and multi-turn necessity. We further propose TrajSafe, a proactive monitor that anticipates harmful trajectories and intervenes through actions such as probing users' genuine intents and steering the models towards safer completion. Our extensive experiments demonstrate that TrajSafe significantly reduces the harmfulness incurred in multi-turn interactions while preserving a low over-refusal rate and the target model's general capabilities. Our work offers a promising paradigm to alleviate the nuanced safety risks in LLM interactions.

URL PDF HTML ☆

赞 0 踩 0

2606.02404 2026-06-02 cs.CL 版本更新

K-BrowseComp: A Web Browsing Agent Benchmark Grounded in Korean Contexts

K-BrowseComp：基于韩国语境的网页浏览代理基准测试

Nahyun Lee, Dongkeun Yoon, Guijin Son, Geewook Kim, Dayoon Ko, Jeonghun Park, Haneul Yoo, Jaewon Cho, Junghun Park, Changyoon Lee, Kyochul Jang, Jaeyeon Kim, Eunsu Kim, Woojin Cho, Seungone Kim

发表机构 * Chung-Ang University（Chung-Ang 大学）； KAIST（韩国科学技术院）； Seoul National University（首尔国立大学）； OnelineAI ； NAVER Cloud AI ； Carnegie Mellon University（卡内基梅隆大学）

AI总结针对韩国语境，构建包含400个问题的网页浏览代理基准K-BrowseComp，评估前沿模型性能，发现其准确率显著低于BrowseComp，并公开数据与代码。

详情

AI中文摘要

前沿模型评估正从基础能力（如指令遵循和推理）转向组合性、代理性能力，但韩语代理基准仍然稀缺。我们介绍了K-BrowseComp，一个基于韩国语境的网页浏览代理基准，包含400个问题。其中300个问题的K-BrowseComp-Verified子集由母语为韩语的人手动构建和验证。在该子集上，包括GPT-5.5、DeepSeek-V4-Pro和GLM-5.1在内的前沿LLM仅达到30.00–45.67%，相比BrowseComp大幅下降，而通过韩国专有AI基础模型计划发布的韩语LLM仅获得0.00–10.33%。我们进一步利用解决和创建网页浏览问题之间的不对称性，通过硬样本少样本示例和失败模式导向生成构建了一个100个问题的合成分割。在对抗性过滤的合成诊断分割上，最强模型仅达到26.00%，我们将此分割作为定向压力测试单独报告。我们公开发布了数据和代码。

英文摘要

Frontier model evaluations are shifting from foundational capabilities (e.g., instruction following and reasoning) toward compositional, agentic ones, but Korean agentic benchmarks remain scarce. We introduce K-BrowseComp, a web-browsing agent benchmark grounded in Korean contexts, consisting of 400 problems. The 300-problem K-BrowseComp-Verified subset is manually constructed and validated by native Korean speakers. On this subset, frontier LLMs, including GPT-5.5, DeepSeek-V4-Pro, and GLM-5.1, reach only 30.00--45.67\%, a substantial drop from BrowseComp, while Korean LLMs released through Korea's Proprietary AI Foundation Model program obtain only 0.00--10.33\%. We further construct a 100-problem synthetic split using hard few-shot exemplars and failure-mode-targeted generation to exploit the asymmetry between solving and creating web browsing problems. On the adversarially filtered synthetic diagnostic split, the strongest model reaches only 26.00\%, and we report this split separately as a targeted stress test. We publicly release our data and code.

URL PDF HTML ☆

赞 0 踩 0

2606.02398 2026-06-02 cs.LG cs.CL 版本更新

A Local Perturbation Theory for Cross-Domain Interference and Recovery in Multi-Domain RL

跨域干扰与恢复的局部微扰理论：多领域强化学习

Lei Yang, Siyu Ding, Deyi Xiong

发表机构 * TJUNLP Lab, College of Intelligence and Computing, Tianjin University（天津大学智能与计算学院TJUNLP实验室）； Baidu Inc.（百度公司）

AI总结针对多领域RL训练中一个领域性能下降的问题，提出局部微扰理论，证明后期领域训练主要通过二阶损伤项在低维共享冲突子空间中损害早期领域，并通过短时领域刷新实现选择性恢复。

详情

AI中文摘要

强化学习后训练在数学推理、代码生成、问答和创意写作等单个领域上改进了大型语言模型，但在一个领域上的训练往往会降低其他领域的性能。基于灾难性遗忘或全局梯度冲突的现有解释是不完整的：即使全模型梯度几乎正交，也可能发生实质性干扰。我们表明，单领域RL产生稀疏、小量级的参数编辑，且top变化神经元之间的重叠较弱，而不同领域仍然共享大量的活跃计算路径，这些路径上的更新方向决定了它们是协同还是冲突。在此观察指导下，我们在多领域RL的局部微扰模型下证明，后期领域训练主要通过二阶损伤项损害早期领域，在观察到的稀疏路径结构下，该损伤项集中在低维共享冲突子空间中。此外，短时领域刷新会收缩该子空间上的有害成分，从而在有限的附带损伤下实现选择性恢复。与理论一致，在Code → Math → QA → CW之后进行短暂的Re-Math刷新，将Math从57.66恢复到66.04，同时基本保持其他领域的性能，得到最佳平均分66.39。除了刷新之外，针对Math-QA对的稀疏代理冲突坐标集进行无训练回滚可部分恢复Math，为局部损伤提供了直接的代理级证据。这些结果为多领域RL中的干扰和恢复提供了局部机制解释。

英文摘要

Reinforcement learning (RL) post-training improves large language models (LLMs) on individual domains such as mathematical reasoning, code generation, question answering, and creative writing (CW), but training on one domain often degrades performance on others. Existing explanations based on catastrophic forgetting or global gradient conflict are incomplete: substantial interference can occur even when full-model gradients are nearly orthogonal. We show that single-domain RL produces sparse, small-magnitude parameter edits with weak overlap among top-changed neurons, while different domains still share substantial active computation routes on which update directions determine whether they act synergistically or conflict. Guided by this observation, we prove under a local perturbation model of multi-domain RL that later-domain training harms an earlier domain mainly through a second-order damage term, which under the observed sparse route structure concentrates in a low-dimensional shared conflict subspace. Moreover, a short domain refresh contracts the harmful component on this subspace, enabling selective recovery with limited collateral damage. Consistent with the theory, a brief Re-Math refresh after Code $\rightarrow$ Math $\rightarrow$ QA $\rightarrow$ CW recovers Math from 57.66 to 66.04 while largely preserving performance on the other domains, yielding the best average score of 66.39. Beyond refresh, a training-free rollback on a sparse proxy conflict coordinate set for the Math-QA pair partially restores Math, providing direct proxy-level evidence for localized damage. These results provide a localized mechanistic account of interference and recovery in multi-domain RL.

URL PDF HTML ☆

赞 0 踩 0

2606.02380 2026-06-02 cs.CL cs.AI 版本更新

SPADE-Bench: Evaluating Spontaneous Strategic Deception in Agents via Plan-Action Divergence

SPADE-Bench：通过计划-行动分歧评估智能体中的自发性策略欺骗

Yuyan Bu, Haowei Li, Qirui Zheng, Bowen Dong, Kaiyue Yang, Jiaming Ji, Yingshui Tan, Wenxin Li, Yaodong Yang, Juntao Dai

发表机构 * Beijing Academy of Artificial Intelligence（北京人工智能研究院）； Peking University（北京大学）； University of Science and Technology of China（中国科学技术大学）； University of Chinese Academy of Science（中国科学院大学）； Alibaba Group（阿里巴巴集团）

AI总结针对LLM智能体在工具使用中可能出现的自发性策略欺骗（计划与行动不一致），提出SPADE-Bench基准，通过结合实际工具执行和受控压力场景，严格区分欺骗与幻觉，实验证实该问题真实且紧迫。

详情

AI中文摘要

随着基于LLM的智能体扩展其操作范围，可靠性成为实际部署的前提。然而，在实际应用中，人类用户无法监控每一个即时行为；相反，执行过程往往是一个黑箱，用户仅依赖智能体的自我报告更新。这种不透明性带来了关键风险：智能体可能呈现与执行行动不一致的面向观察者的报告，使得系统不可控，尤其是在高风险自主场景中。我们将这种自我报告的计划-行动分歧称为智能体欺骗。为了评估这一点，我们引入了SPADE-Bench，一个旨在评估自发性计划-行动分歧的基准。与先前的欺骗基准不同，SPADE-Bench同时集成了实际工具执行和受控压力场景。这种设计确保了生态效度，并通过在压力下进行受控的计划-行动比较，严格区分策略欺骗与单纯的幻觉。跨主流模型的实验证实，智能体欺骗在工具使用环境中是一个真实且紧迫的问题。通过提供一个全面且稳健的评估框架，SPADE-Bench填补了智能体安全中的关键空白，促进社区朝着构建可信和可控的自主系统迈进。

英文摘要

As LLM-based agents expand their operational scope, reliability becomes a prerequisite for real-world deployment. However, in practical applications, human users cannot monitor every immediate behavior; instead, the execution process often remains a black box, leaving users dependent solely on the agent's self-reported updates. This opacity creates a critical risk: agents may present observer-facing reports that diverge from their executed actions, rendering the system uncontrollable, especially in high-stakes autonomous scenarios. We term such self-reported plan-action divergence as agent deception. To assess this, we introduce SPADE-Bench, a benchmark designed to evaluate spontaneous plan-action divergence. Unlike prior deception benchmarks, SPADE-Bench simultaneously integrates actual tool execution and controlled pressure scenarios. This design ensures ecological validity and rigorously distinguishes strategic deception from mere hallucination through controlled plan-action comparisons under pressure. Experiments across mainstream models confirm that agent deception is a genuine and pressing issue in tool-use contexts. By providing a comprehensive and robust evaluation framework, SPADE-Bench fills a critical gap in agent safety, facilitating the community's progress toward building trustworthy and controllable autonomous systems.

URL PDF HTML ☆

赞 0 踩 0

2606.02375 2026-06-02 cs.CL cs.CY cs.HC 版本更新

WAXAL-NET: Finetuned Edge ASR Across 19 African Languages

WAXAL-NET: 针对19种非洲语言的微调边缘ASR

Victor Tolulope Olufemi, Oreoluwa Babatunde, Ramsey Njema, Bolarinwa Gbotemi, Wanchi Lucia Yen, John Uzodinma, Sunday Ajayi, Oluwademilade Williams, Kausar Moshood, Innocent Elendu Anyaele, Akebert Arefaine, Candace Hunzwi, Wongel Dawit Daniel, Emmilly Namuganga, Cleophas Kadima, Athanase Bahizire, Onitsiky Ranaivoson, Emmanuel Aaron, Nicholaus Ladislaus, Idris Muhammed, Jonathan Enoch Simenya, Martin Koome, Matewos Tegete Endaylalu, Peter Ifeoluwa Adeyemo, Hondi Prisca Birindwa, Ukachi Agnes Eze-Mbey, Yacoba Oduro-Yeboah, Pericles Adjovi, Mikel K. Ngueajio, Toluwani Aremu, Prasenjit Mitra

发表机构 * CMU Africa（CMU非洲中心）； LyngualLabs ； MBZUAI

AI总结本研究评估了紧凑型领域专用ASR模型在WAXAL语料库的19种非洲语言会话语音上是否优于大规模多语言基础模型，通过微调边缘模型实现了宏平均WER从64.9%降至38.0%，模型大小缩小3-40倍，证实领域专业化主导规模效应。

详情

AI中文摘要

我们评估了紧凑型领域专用ASR模型是否能在WAXAL语料库的19种非洲语言会话语音上优于大规模多语言基础模型。微调后的边缘模型实现了宏平均词错误率（WER）38.0%，而最佳零样本基线为64.9%，使用小3-40倍的模型降低了26.9个百分点。结果证实，对于自发的非洲语音，领域专业化主导规模效应。跨域评估显示，微调模型在分布外（OOD）语音上恢复了可用性能，而零样本模型在测试域与其预训练分布匹配时重新获得优势。一项涵盖所有调查语言的分布式母语者审计产生了基于语言学的错误分类，表明CTC和自回归架构在不同语系中表现不同。我们进一步表明，对于音节文字语言，仅WER会错误表示性能，其中CER/WER比率显示字符级准确率远高于标题WER所暗示的。最后，为促进未来的非洲ASR研究，我们发布了所有模型权重、微调和评估脚本，以及涵盖全部19种语言的清洗后的WAXAL子集。

英文摘要

We evaluate whether compact domain-specialized ASR models can outperform massively multilingual foundation models for conversational African speech across 19 languages in the WAXAL corpus. Fine-tuned edge models achieve a macro-averaged WER of $38.0\%$ compared to $64.9\%$ for the best zero-shot baseline, a $26.9$ percentage-point reduction using models $3-40\times$ smaller. Results confirm that domain specialization dominates scale for spontaneous African speech. Cross-domain evaluation shows that fine-tuned models recover usable performance on out-of-distribution (OOD) speech, while zero-shot models regain an advantage when the test domain matches their pretraining distribution. A distributed native-speaker audit across all surveyed languages produces a linguistically-grounded error taxonomy, showing that CTC and autoregressive architectures behave differently across language families. We further show that WER alone misrepresents performance for syllabary-script languages where CER/WER ratios reveal substantially higher character-level accuracy than headline WER suggests. Finally, to contribute to future African ASR research, we release all model weights, fine-tuning and evaluation scripts, and a cleaned WAXAL subset covering all $19$ languages.

URL PDF HTML ☆

赞 0 踩 0

2606.02373 2026-06-02 cs.AI cs.CL cs.IR 版本更新

Harness-1: Reinforcement Learning for Search Agents with State-Externalizing Harnesses

Harness-1：基于状态外化马具的搜索智能体强化学习

Pengcheng Jiang, Zhiyi Shi, Kelly Hong, Xueqiang Xu, Jiashuo Sun, Jimeng Sun, Hammad Bashir, Jiawei Han

AI总结提出Harness-1，一个20B参数的搜索智能体，通过强化学习在有状态搜索马具中训练，将常规状态管理外化到环境，在八个检索基准上平均召回率0.730，超越现有开源搜索子智能体11.4个百分点。

详情

AI中文摘要

搜索智能体通常被训练为基于不断增长的转录的策略：模型必须决定如何搜索，同时记住它看到了什么、哪些证据有用、哪些约束仍然开放、哪些声明已被检查。我们认为这种表述将过多的常规状态管理放在策略内部：强化学习被迫同时优化语义搜索决策和可恢复的簿记，而环境可以更可靠地维护这些簿记。我们引入Harness-1，一个20B参数的搜索智能体（检索子智能体），在有状态搜索马具内通过强化学习训练。该马具维护环境端的工作记忆，包括候选池、重要性标记的精选集、紧凑的证据链接、验证记录、压缩和去重的观察结果，以及预算感知的上下文渲染。策略保留语义决策：搜索什么、保留或丢弃哪些文档、验证什么以及何时停止。在涵盖网络、金融、专利和多跳问答的八个检索基准上，Harness-1实现了0.730的平均精选召回率，比次强的开源搜索子智能体高出11.4个百分点，并与更大的前沿模型搜索器保持竞争力。其优势在保留的迁移基准上尤为显著，表明基于显式搜索状态的强化学习可以产生超越训练领域的检索行为。我们的代码可在https://github.com/pat-jj/harness-1获取。

英文摘要

Search agents are often trained as policies over growing transcripts: the model must decide how to search while also remembering what it has seen, which evidence is useful, which constraints remain open, and which claims have actually been checked. We argue that this formulation puts too much routine state management inside the policy: reinforcement learning is forced to optimize both semantic search decisions and recoverable bookkeeping that the environment can maintain more reliably. We introduce Harness-1, a 20B search agent (retrieval subagent) trained with reinforcement learning inside a stateful search harness. The harness maintains environment-side working memory, including a candidate pool, an importance-tagged curated set, compact evidence links, verification records, compressed and deduplicated observations, and budget-aware context rendering. The policy retains the semantic decisions: what to search, which documents to keep or discard, what to verify, and when to stop. Across eight retrieval benchmarks spanning web, finance, patents, and multi-hop QA, Harness-1 achieves 0.730 average curated recall, outperforming the next strongest open search subagent by +11.4 points and remaining competitive with much larger frontier-model searchers. Its gains are especially strong on held-out transfer benchmarks, suggesting that reinforcement learning over explicit search state can produce retrieval behaviors that generalize beyond the training domains. Our code is available at https://github.com/pat-jj/harness-1.

URL PDF HTML ☆

赞 0 踩 0

2606.02372 2026-06-02 cs.AI cs.CL 版本更新

COMAP: Co-Evolving World Models and Agent Policies for LLM Agents

COMAP：面向LLM智能体的世界模型与智能体策略协同进化

Youwei Liu, Jian Wang, Hanlin Wang, Wenjie Li

发表机构 * Central South University（中南大学）； College of Computer Science, Sichuan University（四川大学计算机学院）； Department of Computing, The Hong Kong Polytechnic University（香港理工大学计算机系）

AI总结提出COMAP框架，通过闭环交互协同进化文本世界模型和智能体策略，在具身任务规划、网页导航和工具使用基准上显著提升性能。

详情

AI中文摘要

为语言智能体配备世界模型使其能够在执行前预测环境动态并评估候选动作。然而，现有的文本世界模型通常在训练后固定不变，无法适应由进化中的智能体引发的策略内状态-动作分布。同时，智能体改进方法往往依赖外部奖励或验证器，限制了其在现实交互环境中的适用性。本文提出COMAP，一种通过闭环交互协同进化文本世界模型和智能体策略的新框架。在每个决策步骤，世界模型预测候选动作的未来状态反馈，智能体通过估计该反馈的可靠性并相应调整动作来进行未来感知反思。由此产生的策略内轨迹随后通过自蒸馏用于更新世界模型，使其更好地匹配智能体不断演化的交互分布。在具身任务规划、网页导航和工具使用基准上，COMAP始终优于竞争基线，例如使用Qwen3-4B相对提升16.75%。进一步分析表明，协同进化循环随时间提高了世界模型的预测准确性，并导致更有效的长程决策。我们的代码可在https://github.com/loyiv/CoMAP获取。

英文摘要

Equipping language agents with world models enables them to anticipate environment dynamics and evaluate candidate actions before execution. However, existing textual world models are typically fixed after training, preventing them from adapting to the on-policy state-action distributions induced by an evolving agent. Meanwhile, agent-improvement methods often rely on external rewards or verifiers, limiting their applicability in realistic interactive environments. In this paper, we propose COMAP, a novel framework that co-evolves textual world models and agent policies through closed-loop interaction. At each decision step, the world model predicts future state feedback for candidate actions, and the agent performs future-aware reflection by estimating the reliability of this feedback and refining its action accordingly. The resulting on-policy trajectories are then used to update the world model via self-distillation, allowing it to better match the agent's evolving interaction distribution. Across embodied task planning, Web navigation, and tool-use benchmarks, COMAP consistently outperforms competitive baselines, e.g., +16.75% relative improvement with Qwen3-4B. Further analyses show that the co-evolutionary loop improves the world model's prediction accuracy over time and leads to more effective long-horizon decision-making. Our code is available at: https://github.com/loyiv/CoMAP.

URL PDF HTML ☆

赞 0 踩 0

2606.02304 2026-06-02 cs.CL 版本更新

DECK: LLM幻觉的一致性×置信度分类法

Mohit Singh Chauhan

发表机构 * University of California, Berkeley（加州大学伯克利分校）

AI总结提出一种基于样本间一致性和词级置信度的2×2分类法（DECK），将LLM幻觉分为四个行为区域，每个区域对应可检测的评分器家族，并通过实验验证其有效性及识别出输出级不确定性评估的普遍盲点。

Comments 18 pages, 3 figures, 5 tables

详情

AI中文摘要

现有的幻觉分类法根据输出错误的内容（如记忆错误、推理失败、流畅编造）对LLM错误进行分类。这些分类法有助于诊断，但无法回答另一个问题：哪个不确定性评分器本可以捕捉到这个错误？我们提出一种补充性分类法，根据错误的可检测性特征（评分器家族能读取的信号）对错误进行分类。DECK分类法是一个2×2划分，沿样本间一致性和词级置信度分为四个行为区域（Drift、Entrenched、Confabulation、Knotted），每个区域映射到能够检测它的特定评分器家族（或多个家族）：黑盒一致性评分器在D和C中有信号，白盒词概率评分器在K和C中有信号，只有经过独立预训练的LLM-as-a-Judge才能检测E。通过在每个评分器轴上使用Youden's J最优分割来操作化单元成员关系。在三个模型和四个数据集上，我们通过两种方式验证该分类法：分析评分器对的不一致性，以及检查外部标签（SelfAware不可回答、HaluEval对抗性、PopQA实体流行度）是否落在预测的DECK单元中，并附带模型规模特定和内容特定的次级单元细化。我们进一步识别出输出级不确定性评估的一个普遍盲点：在知识缺口输入上，当生成器输出自信、可重复的编造时，每个输出级家族在构造上都会失效。对Llama-3-8B隐藏状态的线性探针也失效至随机水平，初步证据表明该失败可能在激活层面持续存在；更丰富的内部状态方法（不确定性评估头、信息论估计器）仍有待测试。

英文摘要

Existing hallucination taxonomies classify LLM errors by what is wrong with the output -- memorised misconceptions, reasoning failures, fluent fabrications. These taxonomies are useful for diagnosis but cannot answer a different question: which uncertainty scorer would have caught this error? We propose a complementary taxonomy that classifies errors by their detectability signature -- the signal a scorer family would read. The DECK taxonomy is a 2x2 partition along inter-sample consistency and token-level confidence into four behavioural regimes (Drift, Entrenched, Confabulation, Knotted), each mapping to a specific scorer family (or families) that can detect it: black-box consistency scorers have signal in D and C, white-box token-probability scorers have signal in K and C, and only an LLM-as-a-Judge with independent pretraining can detect E. Cell membership is operationalised by a Youden's J optimal split on each scorer axis. Across three models and four datasets we validate the taxonomy two ways: by analysing scorer-pair disagreement, and by checking that external labels (SelfAware unanswerable, HaluEval adversarial, PopQA entity popularity) land in the predicted DECK cells, with model-scale and content-specific secondary-cell refinements. We further identify a universal blind spot of output-level UQ: on knowledge-gap inputs where the generator emits confident, repeatable fabrications, every output-level family collapses by construction. A linear probe on Llama-3-8B's hidden states also collapses to chance, giving preliminary evidence that the failure may persist at the activation level; richer internal-state methods (UQ heads, information-theoretic estimators) remain to be tested.

URL PDF HTML ☆

赞 0 踩 0

2606.02276 2026-06-02 cs.CV cs.AI cs.CL cs.LG 版本更新

Cross-modal linkage risk in clinical vision-language models

临床视觉-语言模型中的跨模态链接风险

Soroosh Tayebi Arasteh, Mahshad Lotfinia, Sven Nebelung, Daniel Truhn

发表机构 * Lab for AI in Medicine（医学人工智能实验室）； RWTH Aachen University（亚琛工业大学）； Department of Diagnostic and Interventional Radiology（诊断与介入放射学部门）

AI总结研究临床视觉-语言模型（VLM）在图像与报告分离场景下通过余弦相似度实现跨模态重链接的风险，并采用仅对投影头进行差分隐私微调的方法在保持图像效用同时显著降低重链接率。

详情

AI中文摘要

在配对胸部X光片和放射学报告上训练的视觉-语言模型（VLM）学习了一个共享嵌入空间，该空间可以保留实例级别的图像-报告对应关系。这在故意将X光片和报告在获取后分开的场景中（例如仅图像数据共享或受控访问的报告）构成了隐私风险，因为一个去标识的图像可能仅通过余弦相似度就重新链接到其原始叙述性报告。我们将此形式化为图像到报告的检索，并使用公共配对队列（其中真实配对是已知的）作为基准来审计风险，而不是作为隐私场景。在来自MIMIC-CXR（43,793个保留对）和外部CheXpert Plus（29,296个对）的126,804名患者的406,241个配对示例上评估了临床专业化程度递增的VLM，我们发现重链接率随专业化程度系统性地上升：最强的VLM在候选池N=100时以15倍随机概率检索到正确报告，在N=10,000时以50倍随机概率，在全数据库规模下仍远高于随机概率。该信号在去除疾病标签捷径的病理匹配困难负样本下仍然存在，表明对应关系超出了广泛的诊断类别。为了在不重新训练的情况下减少这种风险，我们冻结了两个编码器，仅对定义对齐层的投影头应用差分隐私优化（epsilon=0.34，delta=6x10^-6）。这使得MIMIC-CXR上N=10,000时的Recall@1降低了61.8%，并无需重新训练即可迁移到CheXpert Plus，同时图像侧效用基本保持：线性探针分类在14个标签上的宏AUROC仅从79.63%变为79.43%。对共享对齐层的定向DP微调可以大幅减少跨模态重链接，而不会实质性降低使这些模型在临床上有用的图像表示。

英文摘要

Vision-language models (VLMs) trained on paired chest radiographs and radiology reports learn a shared embedding space that can preserve instance-level image-report correspondence. This poses a privacy risk in settings where radiographs and reports are deliberately kept separate after acquisition, such as image-only data sharing or access-controlled reports, because a de-identified image may be re-linked to its original narrative report through cosine similarity alone. We formalized this as image-to-report retrieval and used public paired cohorts, in which the true pairing is known by design, as ground-truth benchmarks to audit the risk rather than as the privacy scenario. Evaluating VLMs of increasing clinical specialization on 406,241 paired examples from 126,804 patients across MIMIC-CXR (43,793 held-out pairs) and external CheXpert Plus (29,296 pairs), we found that re-linkage rose systematically with specialization: the strongest VLM retrieved the correct report at 15 times chance at a candidate pool of N = 100, 50 times chance at N = 10,000, and well above chance at full-database scale. The signal persisted under pathology-matched hard negatives that removed disease-label shortcuts, indicating correspondence beyond broad diagnostic categories. To reduce it without retraining, we froze both encoders and applied differentially private optimization only to the projection heads defining the alignment layer (epsilon = 0.34, delta = 6x10-6). This reduced Recall@1 by 61.8% at N = 10,000 on MIMIC-CXR and transferred to CheXpert Plus without retraining, while image-side utility was largely preserved: macro AUROC for linear-probe classification across 14 labels shifted only from 79.63% to 79.43%. Targeted DP finetuning of the shared alignment layer can substantially reduce cross-modal re-linkage without materially degrading the image representations that make these models clinically useful.

URL PDF HTML ☆

赞 0 踩 0

2606.02255 2026-06-02 cs.CL cs.AI 版本更新

Who Annotates in NLP? A Large-scale Assessment of Human Annotation Reporting between 2018 and 2025

谁在NLP中进行标注？2018年至2025年人工标注报告的大规模评估

Maria Kunilovskaya, Gagan Bhatia, Lisa Sophie Albertelli, Yanran Chen, Christian Greisinger, Lotta Kiefer, Christoph Leiter, Subhadeep Roy, Tewodros Achamaleh, Muhammad Arslan Manzoor, Sebastian Pohl, Yufang Hou, Steffen Eger

发表机构 * NLLG Lab University of Technology Nuremberg（NLLG实验室梅尔堡技术大学）； Interdisciplinary Transformation University（跨学科转型大学）

AI总结本研究通过大规模审计NLP论文中的人工标注报告，发现标注细节报告不完整，并提出了改进报告质量的框架和建议。

详情

AI中文摘要

人工标注是许多NLP研究的经验基础，从数据集构建到模型评估，但论文往往不清楚谁产生了标注以及如何控制标注过程。我们首次对主要NLP会议中的人工标注报告进行了大规模、任务级别的审计，询问哪些标注细节被记录，哪些缺失，以及报告如何随时间、主题、会议和人工判断的预期用途而变化。我们引入了一个统一的标注报告实践分类法，并针对Annotated-gold（一个由41篇论文和72个标注任务组成的人工裁决黄金标准）验证了一个LLM辅助的提取流程，其中最佳模型与裁决标签达到了与人类相当的一致性，Krippendorff's alpha为0.606，而人类间一致性为0.585。利用该流程，我们构建了Annotated-llm数据集，涵盖2018-2025年ACL会议论文，包含来自1603篇论文的2667个提取的标注任务，发现论文经常报告操作细节，如招募策略、标注者专业知识和标注量，但往往省略评估标注有效性所需的细节，包括培训、语言能力、报酬、社会人口统计、裁决和一致性值，尤其是在模型评估研究中。我们的结果表明，NLP中的标注报告随时间有所改善，但仍不均衡，我们建立了一个可扩展的框架和最低报告建议，以使人工标注更可靠、可重复和可解释。

英文摘要

Human annotation is the empirical foundation of much NLP research, from dataset construction to model evaluation, but papers often leave unclear who produced the annotations and how the annotation process was controlled. We provide the first large-scale, task-level audit of human annotation reporting across major NLP venues, asking which annotation details are documented, which are missing, and how reporting varies across time, topic, venue, and intended use of human judgment. We introduce a unified taxonomy of annotation-reporting practices and validate an LLM-assisted extraction pipeline against Annotated-gold, a human-adjudicated gold standard of 41 papers and 72 annotation tasks, where the best model reaches human-comparable agreement with adjudicated labels, with Krippendorff's alpha of 0.606 versus 0.585 for human-human agreement. Using this pipeline, we construct Annotated-llm, a dataset covering ACL-venue papers from 2018-2025, with 2,667 extracted annotation tasks from 1,603 papers, and find that papers frequently report operational details such as recruitment strategies, annotator expertise, and annotation volume, but often omit details needed to assess annotation validity, including training, language proficiency, compensation, socio-demographics, adjudication, and agreement values, especially in model-evaluation studies. Our results show that annotation reporting in NLP has improved over time but remains uneven, and they establish a scalable framework and bare-minimum reporting recommendations for making human annotation more reliable, reproducible, and interpretable.

URL PDF HTML ☆

赞 0 踩 0

2606.02252 2026-06-02 cs.CL 版本更新

ResMerge: Residual-based Spectral Merging of Large Language Models

ResMerge：基于残差的谱合并大型语言模型

Yandu Sun, Zhiyan Hou, Haokai Ma, Yuheng Jia, Junfeng Fang, Haiyun Guo, Hongyan An, weizhen wang, Jinqiao Wang

发表机构 * Southeast University（东南大学）； Institute of Automation, Chinese Academy of Sciences（中国科学院自动化研究所）； University of Chinese Academy of Sciences（中国科学院大学）； National University of Singapore（新加坡国立大学）； Wuhan University of Technology（武汉理工大学）； Peking University（北京大学）； Wuhan AI Research（武汉人工智能研究院）

AI总结提出ResMerge框架，通过将RL任务向量分解为谱头部和残差组件，利用残差组件构建稳定骨干并选择性引入头部信息，实现无需训练的专家模型合并。

Comments 14 pages including appendix

详情

AI中文摘要

模型合并提供了一种无需训练的方式组合多个后训练专家模型，但合并通过强化学习（RL）获得的专家仍然具有挑战性。现有的谱合并方法通常假设主导奇异方向包含主要任务信号，而低能量残差组件可以被压缩、选择或衰减以减少干扰。我们发现这一假设不适用于RL任务向量：将每个任务向量分解为谱头部和残差组件后，两部分都能独立恢复大量行为知识，同时表现出不同的合并特性。头部高度集中且信息丰富，但更容易产生尖锐的跨专家冲突，而残差组件更分散，为聚合提供了更稳定的基础。基于这一观察，我们提出了ResMerge，一种针对RL专家的基于残差的谱合并框架。ResMerge首先通过球形残差共识自适应构建稳定的残差骨干，该算法在Frobenius球面上估计一个可靠性加权的共识方向。然后通过轻量级头部校正模块重新引入头部信息，该模块由正跨专家一致性门控。在多个RL专家组和能力领域的实验表明，ResMerge比代表性的任务向量和谱合并基线更好地保留了专家能力。ResMerge的实现可在https://github.com/sunyd0303-cpu/ResMerge-release公开获取。

英文摘要

Model merging offers a training-free way to combine multiple post-trained expert models, but merging experts obtained through reinforcement learning (RL) remains challenging. Existing spectral merging methods often assume that leading singular directions contain the main task signal, while lower-energy residual components can be compressed, selected, or attenuated to reduce interference. We find that this assumption does not hold for RL task vectors: after decomposing each task vector into a leading spectral head and a residual component, both parts can independently recover substantial behavior knowledge, while exhibiting different merging properties. The head is highly concentrated and informative but more prone to sharp cross-expert conflicts, whereas the residual component is more dispersed and provides a more stable basis for aggregation. Based on this observation, we propose ResMerge, a residual-based spectral merging framework for RL experts. ResMerge first constructs a stable residual backbone with Spherical Residual Consensus Adaptation, which estimates a reliability-weighted consensus direction on the Frobenius sphere. It then reintroduces leading-head information through a Lightweight Head Correction module gated by positive cross-expert agreement. Experiments across multiple RL expert groups and capability domains show that ResMerge better preserves expert capabilities than representative task-vector and spectral merging baselines. The implementation of ResMerge is publicly available at https://github.com/sunyd0303-cpu/ResMerge-release.

URL PDF HTML ☆

赞 0 踩 0

2606.02248 2026-06-02 cs.CL 版本更新

Geometric Latent Reasoning Induces Shorter Generations in LLMs

几何潜在推理诱导LLM生成更短的文本

Shashi Kumar, Yacouba Kaloga, Petr Motlicek, Ina Kodrasi, Andrea Cavallaro

发表机构 * Idiap Research Institute（日内瓦研究所）； EPFL（苏黎世联邦理工学院）； BUT（布拉格技术大学）

AI总结提出几何潜在推理（GLR）方法，通过轻量级过渡头在嵌入空间中预测迭代方向更新，近似离散推理轨迹，从而在不显式优化长度的情况下显著缩短LLM的生成步数。

详情

AI中文摘要

大型语言模型通过生成冗长的显式推理令牌链来解决复杂问题。虽然有效，但这使得推理成本高昂、对长度敏感，并受限于（离散的）自然语言。虽然潜在推理提供了连续的替代方案，但确定中间潜在状态的有用结构仍是一个开放挑战。在本文中，我们将潜在推理公式化为模型预训练令牌嵌入空间中的几何路径逼近问题。我们引入了几何潜在推理（GLR），它使用轻量级过渡头来预测嵌入空间中的迭代方向更新。利用文本思维链轨迹作为锚点，GLR学习逼近离散推理轨迹，同时允许偏离精确令牌嵌入的连续偏差。使用Qwen3模型在数学推理基准上的评估揭示了一个涌现现象：几何潜在推理在没有显式长度目标的情况下诱导出显著更短的生成。通过用连续的潜在步骤替换早期的显式推理，模型通常使用更少的总生成步骤达到正确答案。这些发现表明，连续轨迹充当紧凑的中间推理状态，揭示了潜在计算预算、输出长度和准确性之间的新权衡。

英文摘要

Large language models solve complex problems by generating lengthy chains of explicit reasoning tokens. While effective, this makes reasoning expensive, length-sensitive, and constrained to (discrete) natural language. While latent reasoning offers a continuous alternative, determining useful structures for intermediate latent states is an open challenge. In this paper, we formulate latent reasoning as a geometric path-approximation problem within the model's pretrained token-embedding space. We introduce Geometric Latent Reasoning (GLR), which uses a lightweight transition head to predict iterative direction updates in embedding space. Using textual chain-of-thought traces as anchors, GLR learns to approximate discrete reasoning trajectories while permitting continuous deviations from exact token embeddings. Evaluations on mathematical reasoning benchmarks using Qwen3 models reveal an emergent phenomenon: geometric latent reasoning induces substantially shorter generations without an explicit length objective. By replacing early explicit reasoning with continuous latent steps, models often reach correct answers using substantially fewer total generation steps. These findings suggest that continuous trajectories act as compact intermediate reasoning states, exposing a new tradeoff between latent computation budget, output length, and accuracy.

URL PDF HTML ☆

赞 0 踩 0

2606.02245 2026-06-02 cs.CL 版本更新

When Knowledge Is Not Free: Cost-Aware Evidence Selection in Retrieval-Augmented Generation

当知识并非免费：检索增强生成中的成本感知证据选择

Mingyan Wu, Han Yang, Omer Ben-Porat, Yftah Ziser

发表机构 * Northeastern University（东北大学）； Technical University of Munich（慕尼黑技术大学）； GESIS – Leibniz Institute for the Social Sciences（GESIS——莱比锡社会科学研究所）； Technion–Israel Institute of Technology（技术学院——以色列理工学院）； NVIDIA Research（NVIDIA研究）； University of Groningen（格罗宁根大学）

AI总结提出成本感知RAG设置，通过访问成本层级和预算约束，研究证据选择策略，发现静态选择脆弱而智能体方法有潜力但依赖模型和任务。

详情

AI中文摘要

检索增强生成（RAG）通常假设外部知识是免费的，但许多高质量来源需要付费、许可、受限或以其他方式访问成本高昂。我们引入了成本感知RAG，这是一种设置，其中检索到的证据被分配访问成本层级，系统必须在明确的证据访问预算下回答问题。我们通过为MS MARCO v2.1添加访问摩擦层级来实例化这一设置，并在通用领域和特定领域QA基准上评估预算证据选择。我们的结果表明，静态选择是脆弱的：没有固定的选择器能统一占优，更大的预算并不能可靠地提高答案质量，即使高成本证据与领域匹配。然后我们研究了智能体成本感知RAG，其中LLM决定何时检索、访问哪个层级以及何时停止。智能体作为自适应证据获取控制器显示出强大的潜力，但其行为仍然高度依赖于模型和任务。这些发现表明，成本感知的证据获取是下一代RAG系统的核心挑战。所有代码和数据可在https://github.com/Mignonmy/Cost-Aware获取。

英文摘要

Retrieval-Augmented Generation (RAG) typically assumes that external knowledge is free, but many high-quality sources are paywalled, licensed, restricted, or otherwise costly to access. We introduce cost-aware RAG, a setting where retrieved evidence is assigned access-cost tiers and systems must answer under an explicit evidence-access budget. We instantiate this setting by augmenting MS MARCO v2.1 with access-friction tiers and evaluate budgeted evidence selection across general-domain and domain-specific QA benchmarks. Our results show that static selection is brittle: no fixed selector uniformly dominates, and larger budgets do not reliably improve answer quality, even when costly evidence is domain-matched. We then study agentic cost-aware RAG, where an LLM decides when to retrieve, which tier to access, and when to stop. Agents show strong promise as adaptive evidence-acquisition controllers, but their behavior remains highly model- and task-dependent. These findings suggest that cost-aware evidence acquisition is a central challenge for the next generation of RAG systems. All code and data are available at https://github.com/Mignonmy/Cost-Aware.

URL PDF HTML ☆

赞 0 踩 0

2606.02215 2026-06-02 cs.CL cs.SI 版本更新

Better with Experience: Self-Evolving LLM Agents for Evidence-Grounded Health Community Notes

经验更优：用于证据基础健康社区笔记的自进化LLM智能体

Zihang Fu, Fanxiao Li, Jianyang Gu, Haonan Wang, Preslav Nakov, Bryan Hooi, Min-Yen Kan, Jiaying Wu

发表机构 * National University of Singapore（国立新加坡大学）； Yunnan University（云南大学）； The Ohio State University（俄亥俄州立大学）； Mohamed bin Zayed University of Artificial Intelligence（穆罕默德·本·扎耶德人工智能大学）

AI总结提出EvoNote框架，通过自进化经验记忆和细粒度信用分配，在健康社区笔记生成中实现证据获取、分析与撰写，显著提升笔记质量并缩短生成时间。

详情

AI中文摘要

大型语言模型增强的社区笔记为社交媒体上健康错误信息的及时、基于证据的纠正提供了一条可扩展的路径。然而，它们仍然在每条帖子后重置，留下了先前案例中未使用的有用纠正经验。我们引入了EvoNote，一个智能体框架，通过先前错误信息纠正事件的自进化经验记忆，使健康社区笔记生成能够自我进化。其核心是细粒度信用分配：EvoNote将轨迹级反馈基于健康特定的笔记质量进行归因，并将其提炼为行动级记忆，用于声明分析、证据获取和笔记撰写。我们在MM-HealthCN上评估了EvoNote，这是一个包含1200个实例的多模态基准测试，包括用户标记的健康帖子、人工撰写的社区笔记和众包有用性标签。在一个人工验证的分层效用评判器下，EvoNote生成的笔记在89.6%的案例中优于对应的人工笔记；在一组没有众包有用性判定的“需要更多评分”帖子中，EvoNote为82.0%的案例生成了有用的笔记。它还将生成候选纠正所需的中位时间从人工笔记流程的超过13小时减少到不到2分钟。分析将这些收益归因于更强的证据使用和可复用的纠正策略，将自进化笔记生成定位为健康错误信息治理的一种有前景的范式。

英文摘要

Large Language Model (LLM)-augmented Community Notes offer a scalable path for timely, evidence-grounded correction of health misinformation on social platforms. However, they still reset at every post, leaving useful correction experience from prior cases unused. We introduce EvoNote, an agentic framework that enables health Community Notes generation to self-evolve through an evolving experience memory of prior misinformation correction episodes. Its core is fine-grained credit assignment: EvoNote grounds trajectory-level feedback in health-specific note qualities and distills it into action-level memory for claim analysis, evidence acquisition, and note writing. We evaluate EvoNote on MM-HealthCN, a 1.2K-instance multimodal benchmark of user-flagged health posts with human-written Community Notes and crowd-derived helpfulness labels. Under a human-validated hierarchical utility judge, EvoNote-generated notes are preferred over corresponding human-written notes in 89.6% of cases; on a separate set of Needs More Ratings posts without a crowd helpfulness verdict, EvoNote produces helpful notes for 82.0% of cases. It also reduces the median time needed to produce a candidate correction from over 13 hours in the human-note pipeline to under 2 minutes. Analyses link these gains to stronger evidence use and reusable correction strategies, positioning self-evolving note generation as a promising paradigm for health misinformation governance.

URL PDF HTML ☆

赞 0 踩 0

2606.02214 2026-06-02 cs.CL 版本更新

Do Gender Cues Affect LLM Value Trade-offs? Evidence from a Controlled Decision Benchmark

性别线索是否影响LLM价值权衡？来自受控决策基准的证据

Yangyang Liu, Dong Yu, Pengyuan Liu

发表机构 * Beijing Language and Culture University（北京语言大学）

AI总结通过构建受控基准RVDB，研究性别线索是否导致大语言模型在价值决策中产生系统性翻转，并发现性别效应集中在价值边界模糊和决策严重性高的情境下，且模型自我归因常掩盖性别影响。

详情

AI中文摘要

大型语言模型越来越多地用于价值敏感的决策场景，在这些场景中，无关的人口统计线索不应改变判断。我们构建了现实价值决策基准（RVDB），这是一个受控基准，仅改变角色-性别配置，同时保持场景、有序价值对、角色、候选决策、价值距离和决策严重性固定。使用跨七个模型的位置平衡评估，我们测试模型是否在性别扰动下保持决策不变性，以及它们的自我归因是否反映观察到的行为变化。我们发现，显式性别线索会引起有限但系统的决策翻转，包括在显式性别归因提示下，该提示要求模型报告性别是否影响其选择。跨性别角色交换揭示了一致的女性提议决策不对称性，而模型通常将翻转的决策归因于无影响或其他非性别因素。进一步分析表明，性别效应集中在价值边界较不明确和决策背景更严重的情况下，表明性别线索作为局部边界移动因素，而非价值推理的全局覆盖。价值排名基本保持稳定，但有序价值对权衡在不同角色-性别配置中不均匀地变化。这些结果表明，性别可以在行为上进入LLM价值权衡，同时在自我归因中被掩盖，这促使在基于解释的评估之外进行受控行为审计。

英文摘要

Large language models are increasingly used in value-sensitive decision settings, where irrelevant demographic cues should not alter judgments. We construct the Realistic Value Decision Benchmark (RVDB), a controlled benchmark that varies only the role-gender configuration while holding the scenario, ordered value pair, roles, candidate decisions, Value Distance, and Decision Severity fixed. Using a position-balanced evaluation across seven models, we test whether models preserve decision invariance under gender perturbations and whether their self-attributions reflect observed behavioral changes. We find that explicit gender cues induce bounded but systematic decision flips, including under an explicit gender-attribution prompt that asks models to report whether gender influenced their choice. Cross-gender role swaps reveal a consistent female-proposed-decision asymmetry, while models often attribute flipped decisions to No Influence or other non-gender factors. Further analysis shows that gender effects concentrate near less determinate value boundaries and under more severe decision contexts, suggesting that gender cues act as local boundary-shifting factors rather than global overrides of value reasoning. Value rankings remain largely stable, but ordered value-pair trade-offs shift unevenly across role-gender configurations. These results show that gender can enter LLM value trade-offs behaviorally while remaining obscured in self-attribution, motivating controlled behavioral audits beyond explanation-based evaluation.

URL PDF HTML ☆

赞 0 踩 0

2606.02211 2026-06-02 cs.CL cs.AI 版本更新

视觉丰富文档类型分类的多模态方法：一项比较分析

Catyana Heyne, Jürgen Frikel, Filippo Riccio

AI总结针对视觉丰富文档类型分类中多模态建模策略难以系统比较的问题，本文在统一实验框架下对基于Transformer和LLM的四种代表性模型进行受控对比，发现专用多模态Transformer优于LLM方法，且图像信息贡献最大。

详情

AI中文摘要

视觉丰富文档中的文档类型分类仍然具有挑战性，因为相关信息分布在文本、视觉和布局模态中。为了捕捉这种复杂性，当前方法依赖于多样化的多模态建模策略，导致异构架构使得系统比较复杂化。这种变异性也反映在现有的比较研究中，这些研究通常依赖于异构评估设置，进一步复杂化了系统比较，并使得评估进展变得困难。为了解决这些局限性，本文提供了跨基于Transformer和基于LLM架构的多模态设计策略的结构化分析，并结合统一实验框架内的受控实证比较。具体来说，在RVL-CDIP基准上评估了四种代表性模型（LayoutLMv3、Donut、Qwen3-VL-32B-Instruct和Qwen3-32B），以系统分析文本、图像和布局信息对文档类型分类的贡献，特别关注对比OCR依赖和OCR无关的方法。结果表明，专用多模态Transformer在视觉丰富和布局密集型文档上优于基于LLM的方法。图像信息对可靠分类贡献最大，而OCR派生的文本提供有用但次要的支持。这些发现强调，对于具有显著布局结构的文档，多模态处理仍然是必不可少的。总体而言，该研究为比较多模态架构提供了系统基础，并为选择有效的特征组合和模型设计以进行文档类型分类提供了实用指导。

英文摘要

Document type classification in visually rich documents remains challenging, as relevant information is distributed across textual, visual, and layout modalities. To capture this complexity, current approaches rely on diverse multimodal modeling strategies, resulting in heterogeneous architectures that complicate systematic comparison. This variability is also reflected in existing comparative studies, which often rely on heterogeneous evaluation setups, further complicating systematic comparison and making it difficult to assess progress. To address these limitations, this work provides a structured analysis of multimodal design strategies across transformer- and LLM-based architectures, combined with a controlled empirical comparison within a unified experimental framework. Specifically, four representative models (LayoutLMv3, Donut, Qwen3-VL-32B-Instruct, and Qwen3-32B) are evaluated on the RVL-CDIP benchmark to systematically analyze the contributions of text, image, and layout information for document type classification, with a particular focus on contrasting OCR-dependent and OCR-free approaches. The results show that specialized multimodal Transformers outperform LLM-based approaches on visually rich and layout-intensive documents. Image information contributes most strongly to reliable classification, while OCR-derived text provides useful but secondary support. These findings highlight that multimodal processing remains essential for documents with pronounced layout structure. Overall, the study provides a systematic basis for comparing multimodal architectures and offers practical guidance for selecting effective feature combinations and model designs for document type classification.

URL PDF HTML ☆

赞 0 踩 0

2606.02161 2026-06-02 cs.CV cs.CL 版本更新

InfoMerge: Information-aware Token Compression for Efficient Video Large Language Models

InfoMerge: 信息感知的令牌压缩用于高效视频大语言模型

Xinxin Liu, Shiwei Gan, Xiao Liu, Yafeng Yin, Lei Xie, Sanglu Lu

发表机构 * State Key Laboratory of Novel Software Technology（新型软件技术国家重点实验室）

AI总结提出InfoMerge，一种无需训练的视觉令牌压缩方法，通过鲁棒冗余估计和内容感知预算分配，在减少85%视觉令牌的同时保持98.8%性能，实现4.24倍预填充加速。

Comments 15 pages, 8 figures

详情

AI中文摘要

视频大语言模型在视频理解中表现出色，但过多的视觉令牌带来了巨大的计算开销。现有的免训练压缩方法通过减少视觉令牌来提高推理效率，但它们通常依赖局部相邻帧相似性进行时间冗余估计，或主要根据片段长度分配令牌预算。这种设计对帧级噪声敏感，且无法捕捉真实视频的非均匀信息分布。为解决这些挑战，我们提出InfoMerge，一种无需训练的视觉令牌压缩方法，通过鲁棒冗余估计和内容感知预算分配来提高令牌利用率。具体来说，我们提出时间指纹差异：一种片段级二阶时间冗余估计策略，用于建模每个片段内相同空间位置令牌的时间相似性结构。我们进一步引入内容感知预算分配（CABA），根据片段独特性和基于谱熵的表征丰富性动态分配片段级令牌预算。通过减少对冗余静态区域的重复保留，并将更多令牌分配给信息丰富的片段，InfoMerge在保持强大性能的同时更好地利用了有限的令牌预算。大量实验表明，InfoMerge在多个基准和骨干网络上实现了强效的精度-效率权衡，在激进压缩下优势更为明显。在LLaVA-OneVision-7B上，InfoMerge保留了原始平均性能的98.8%，同时减少了85%的视觉令牌，并在预填充阶段实现了4.24倍的加速。

英文摘要

Video Large Language Models (Video-LLMs) achieve strong performance in video understanding, but their excessive visual tokens bring substantial computational overhead. Existing training-free compression methods improve inference efficiency by reducing visual tokens, yet they often rely on local adjacent-frame similarity for temporal redundancy estimation or allocate token budgets mainly according to segment length. Such designs are sensitive to frame-level noise and fail to capture the non-uniform information distribution of real-world videos. To address these challenges, we propose InfoMerge, a training-free visual token compression method that improves token utilization through robust redundancy estimation and content-aware budget allocation. Specifically, we propose the Temporal Fingerprint Difference: a segment-level second-order temporal redundancy estimation strategy, which models the temporal similarity structure of tokens at the same spatial positions within each segment. We further introduce Content-Aware Budget Allocation (CABA), which dynamically allocates segment-level token budgets based on segment uniqueness and spectral-entropy-based representational richness. By reducing repeated preservation of redundant static regions and allocating more tokens to informative segments, InfoMerge makes better use of the limited token budget while maintaining strong performance. Extensive experiments show that InfoMerge achieves strong efficiency--accuracy trade-offs across multiple benchmarks and backbones, with more pronounced advantages under aggressive compression. On LLaVA-OneVision-7B, InfoMerge retains 98.8\% of the original average performance while reducing 85\% of visual tokens and achieving a 4.24-fold speedup in the prefill stage.

URL PDF HTML ☆

赞 0 踩 0

2606.02158 2026-06-02 cs.CL 版本更新

On the Salience of Low-Probability Tokens for AI-Generated Text Detection: A Multiscale Uncertainty Perspective

低概率标记对AI生成文本检测的显著性：多尺度不确定性视角

Yikai Guo, Bin Wang, Xilai Fan, Wenjun Ke, Haoran Luo

发表机构 * National University of Singapore（新加坡国立大学）； University of Science and Technology of China（中国科学技术大学）

AI总结针对AI生成文本检测中模板标记主导和点估计脆弱的问题，提出基于低概率标记的多尺度不确定性估计器Uncertainty及其改进版Uncertainty++，在多个数据集和LLM上验证了有效性、泛化性和鲁棒性。

Comments Accepted by ICML 2026 main conference

详情

AI中文摘要

AI生成的文本日益与人类写作融合，带来了诸如错误信息、学术滥用和语料库污染等实际风险。虽然统计检测器因高效和泛化能力而具有吸引力，但它们存在两个关键局限性：（i）模板标记主导，人类和LLM写作中共有的模板标记可能压倒判别信号；（ii）脆弱的点估计，依赖单一概率分数在对抗性操纵下产生不稳定的决策。为解决这些问题，我们提出Uncertainty，一种多尺度不确定性估计器，专注于信息丰富的低概率标记，这些标记更清晰地暴露分布差异。局部上，它通过平均低概率标记的对数概率来缓解模板标记主导；全局上，它通过Rényi熵捕捉该低概率区域的分布形状，从而降低脆弱性。我们进一步通过条件独立采样将检测器扩展为Uncertainty++，得到更稳定的不确定性估计。在七个数据集和十六个LLM上的实验证明了其高效性、泛化性和鲁棒性。我们的代码可在https://github.com/guoyikai2000/Uncertainty-AIGT获取。

英文摘要

AI-generated text increasingly blends with human writing, raising practical risks such as misinformation, academic misuse, and corpora contamination. While statistical detectors are appealing for efficiency and generalization, they suffer from two key limitations. (i) Boilerplate dominance, boilerplate tokens shared across human and LLM writing can overwhelm discriminative signals. (ii) Brittle point estimates, relying on a single probability score yields unstable decisions under adversarial manipulations. To address these issues, we propose Uncertainty, a multiscale uncertainty estimator that focuses on informative low-probability tokens, which more clearly expose distributional discrepancies. Locally, it alleviates boilerplate dominance by averaging the log-probabilities of low-probability tokens; globally, it reduces brittleness by capturing the distributional shape of this low-probability region via Rényi entropy. We further extend the detector to Uncertainty++ via conditional independent sampling, yielding a more stable uncertainty estimation. Experiments across seven datasets and sixteen LLMs demonstrate high effectiveness, generalization, and robustness. Our code is available at https://github.com/guoyikai2000/Uncertainty-AIGT.

URL PDF HTML ☆

赞 0 踩 0

2606.02147 2026-06-02 cs.CL cs.AI 版本更新

Multilingual Idioms in Sentences and Conversations Across High-, Medium-, and Low-Resource Languages

高、中、低资源语言中的句子和对话多语言习语

Saeed Almheiri, Bilal Elbouardi, Salsabila Zahirah Pranida, Irina Nikishina, Ashwath Rao B, Parameswari Krishnamurthy, Muhammad Cendekia Airlangga, Rifo Ahmad Genadi, Nguyen Phan Gia Bao, Amir Hossein Yari, Hawau Olamide Toyin, Nurdaulet Mukhituly, Mena Attia, Besher Hassan, Ahmad Fathan Hidayatullah, Tatsuki Kuribayashi, Haonan Li, Suma Bhat, Fajri Koto

发表机构 * Mohamed bin Zayed University of Artificial Intelligence（莫扎德大学人工智能大学）； University of Hamburg（汉堡大学）； Manipal University（曼印大学）； IIIT Hyderabad（海得拉尔IIIT）； University of Science and Technology of Hanoi（河内科学技术大学）； Universitas Islam Indonesia（印尼伊斯兰大学）； Princeton University（普林斯顿大学）

AI总结针对多语言习语理解，构建了覆盖3种高资源、3种中资源和12种低资源语言的MIDI数据集，包含句子和对话上下文中的字面与比喻用法，实验表明低资源语言理解更差，字面义比比喻义更难，对话上下文虽有改善但未消除差距。

详情

AI中文摘要

习语表达对多语言NLP构成重大挑战，因为其意义在比喻和字面用法之间转换，通常需要上下文才能准确理解。先前工作集中在高资源语言上，通常评估孤立的习语意义问题，忽略了现实话语。我们引入了MIDI，一个多语言习语数据集，涵盖3种高资源、3种中资源和12种低资源语言，由母语者策划。与之前的数据集不同，MIDI提供了嵌入在句子级和对话上下文中的习语，捕捉了字面和比喻解读。对最先进模型的基准测试表明，习语理解在低资源语言中下降，并且在所有资源层级中，字面解释比比喻解释更难。对话上下文提高了性能，但并未消除这些差异。通过受控测试和对隐藏表示的干预，我们进一步将记忆与推理分离，揭示了当前模型的核心局限性。

英文摘要

Idiomatic expressions pose a major challenge for multilingual NLP because their meanings shift between figurative and literal usage, often requiring context for accurate interpretation. Prior work has focused on high-resource languages typically evaluates isolated idiom-meaning questions, overlooking realistic discourse. We introduce MIDI, a multilingual idiom dataset spanning 3 high-, 3 medium-, and 12 low-resource languages, curated by native speakers. Unlike previous datasets, MIDI provides idioms embedded in both sentence-level and conversational contexts, capturing both literal and figurative readings. Benchmarking state-of-the-art models shows that idiom comprehension degrades in low-resource languages and that, in all resource tiers, literal interpretations are substantially harder than figurative ones. Conversational context improves performance but does not eliminate these disparities. Through controlled tests and interventions on hidden representations, we further separate memorization from reasoning, exposing core limitations of current models.

URL PDF HTML ☆

赞 0 踩 0

2606.02113 2026-06-02 cs.CL cs.AI 版本更新

A Primer in Post-Training Reasoning Data: What We Know About How It Works

后训练推理数据入门：我们对其运作机制的了解

Yaoming Li, Guangxiang Zhao, Qilong Shi, Lin Sun, Xiangzheng Zhang, Tong Yang

AI总结本文综述了后训练推理数据的类型、效用、构建方法和扩展规律，为未来推理数据发布和后训练方案提供归因框架。

Comments 22 pages. Project Repository: https://github.com/RenBing-Sumeru/Awesome-LLM-Reasoning-Data

2606.02111 2026-06-02 cs.CV cs.AI cs.CL 版本更新

Jailbreaking Multimodal Large Language Models using Multi-Clip Video

使用多片段视频破解多模态大语言模型

Choongwon Kang, Seungjong Sun, Hyunmin Jun, Jang Hyun Kim

发表机构 * Department of Applied Artificial Intelligence, Sungkyunkwan University（应用人工智能系，成均馆大学）； Department of Human-Artificial Intelligence Interaction, Sungkyunkwan University（人机交互系，成均馆大学）

AI总结提出MCV SafetyBench数据集，通过多片段视频评估多模态大语言模型的安全漏洞，发现视频模态比图像更脆弱，动态和多样化上下文增加攻击成功率，并基于图像模态的鲁棒性提出防御策略。

Comments 27 pages, 20 figures, Accepted to the Main Conference of ACL 2026

详情

AI中文摘要

随着多模态大语言模型（MLLMs）发展到处理视频输入，人们开始担忧其被恶意滥用的可能性。先前的越狱研究表明，MLLMs中的安全对齐可以通过视觉输入被绕过，但尚不清楚视频输入的哪些属性导致了这种脆弱性。为填补这一空白，我们引入了Multi-Clip Video (MCV) SafetyBench，一个包含2,920个视频的数据集，旨在评估视频输入的多样性如何影响MLLMs的脆弱性。每个视频由多个短片段组成，描述与有害查询相关的不同上下文。对八个代表性视频MLLMs的实验表明，攻击成功率随着片段数量的增加而持续提高。我们的结果进一步表明，视频模态（1）比图像模态更脆弱，（2）对动态视频比对静态视频更脆弱，（3）当视频包含更多样化的上下文时更脆弱。基于这些发现，我们提出了一种利用图像模态相对鲁棒性的防御策略。

英文摘要

As multimodal large language models (MLLMs) have advanced to process video inputs, concerns have emerged about their potential for malicious misuse. Prior jailbreak studies have shown that safety alignment in MLLMs can be bypassed through visual inputs, yet it remains unclear which properties of video inputs induce this vulnerability. To address this gap, we introduce Multi-Clip Video (MCV) SafetyBench, a dataset of 2,920 videos designed to evaluate how the diversity of video inputs affects the vulnerability of MLLMs. Each video consists of multiple short clips depicting diverse contexts related to a harmful query. Experiments on eight representative video MLLMs show that attack success consistently increases with the number of clips. Our results further indicate that the video modality is (1) more vulnerable than the image modality, (2) more vulnerable to dynamic videos than to static videos, and (3) more vulnerable when videos contain more diverse contexts. Building on these findings, we propose a defense strategy that leverages the relative robustness of the image modality.

URL PDF HTML ☆

赞 0 踩 0

2606.02100 2026-06-02 cs.CL 版本更新

PortBERT: Navigating the Depths of Portuguese Language Models

PortBERT：探索葡萄牙语语言模型的深度

Raphael Scheible-Schmitt, Henry He, Armando B. Mendes

AI总结本文提出PortBERT，一种基于RoBERTa的葡萄牙语语言模型家族，通过字节级BPE分词和稳定预训练在超过450GB数据上训练，在ExtraGLUE基准上达到竞争性能，并重点分析了训练和推理效率。

详情

DOI: 10.26615/978-954-452-105-9-008
Journal ref: Proceedings of the Workshop on Beyond English: Natural Language Processing for all Languages in an Era of Large Language Models, 2025, pp. 59-71

AI中文摘要

Transformer模型主导现代自然语言处理，但高效的语言特定模型仍然稀缺。在葡萄牙语中，大多数工作侧重于规模或准确性，往往忽略了训练和部署效率。在本文中，我们介绍了PortBERT，一个基于RoBERTa的葡萄牙语语言模型家族，旨在平衡性能和效率。使用fairseq在来自CulturaX的超过450GB去重和过滤的mC4和OSCAR23数据上从头训练，PortBERT利用字节级BPE分词以及在GPU和TPU处理器上的稳定预训练流程。我们发布了两个变体，PortBERT base和PortBERT large，并在ExtraGLUE（一组翻译的GLUE和SuperGLUE任务）上评估它们。两个模型都表现出竞争力，匹配或超越现有的单语和多语言模型。除了准确性，我们还报告了训练和推理时间以及微调吞吐量，提供了模型效率的实用见解。因此，PortBERT通过解决葡萄牙语NLP中计算-性能权衡这一未被充分探索的维度，补充了先前的工作。我们在Huggingface上发布所有模型，并提供fairseq检查点以支持进一步的研究和应用。

英文摘要

Transformer models dominate modern NLP, but efficient, language-specific models remain scarce. In Portuguese, most focus on scale or accuracy, often neglecting training and deployment efficiency. In the present work, we introduce PortBERT, a family of RoBERTa-based language models for Portuguese, designed to balance performance and efficiency. Trained from scratch on over 450 GB of deduplicated and filtered mC4 and OSCAR23 from CulturaX using fairseq, PortBERT leverages byte-level BPE tokenization and stable pre-training routines across both GPU and TPU processors. We release two variants, PortBERT base and PortBERT large, and evaluate them on ExtraGLUE, a suite of translated GLUE and SuperGLUE tasks. Both models perform competitively, matching or surpassing existing monolingual and multilingual models. Beyond accuracy, we report training and inference times as well as fine-tuning throughput, providing practical insights into model efficiency. PortBERT thus complements prior work by addressing the underexplored dimension of compute-performance tradeoffs in Portuguese NLP. We release all models on Huggingface and provide fairseq checkpoints to support further research and applications.

URL PDF HTML ☆

赞 0 踩 0

2606.02093 2026-06-02 cs.CL cs.AI cs.LG 版本更新

The Role of Ambiguity in Error Prediction via Uncertainty Quantification

不确定性量化中模糊性在错误预测中的作用

Ieva Raminta Staliūnaitė, James Bishop, Andreas Vlachos

发表机构 * University of Cambridge（剑桥大学）； The Alan Turing Institute（艾伦·图灵研究所）

AI总结通过解耦输入模糊性与不确定性信号，利用门控专家和选择性预测提升大语言模型在问答任务中的错误预测性能。

Comments 8 pages not including references and appendices, 3 figures

详情

AI中文摘要

错误预测任务，即预测模型输出是否正确，通常通过不确定性量化（UQ）来解决。然而，虽然不确定性指标捕捉了模型缺乏知识或能力进行预测的情况，但它们也反映了模型输入和上下文中固有的偶然不确定性。本文提出了一种通过将输入模糊性与UQ信号解耦来改进大语言模型（LLM）错误预测的方法。我们在问答（QA）任务上使用六种UQ指标进行实验，结果表明，UQ指标在无歧义实例上的错误预测能力优于具有多个合理答案的问题。我们使用门控专家和选择性预测将真实和预测的模糊性标签纳入错误预测流程。我们发现，模糊性信息提高了跨模型家族、训练和评估范式、数据集（包括据称无歧义的数据集）以及偶然不确定性来源的错误预测分数，在标准数据集上对单个UQ指标的PRR提升超过10个百分点。

英文摘要

The task of Error Prediction, namely predicting whether a model output is correct, is commonly tackled with Uncertainty Quantification (UQ). However, while uncertainty metrics capture when models lack knowledge or capacity to make a prediction, they also reflect aleatoric uncertainty, which is inherent in the model input and context. This paper presents a method for improving error prediction for Large Language Models (LLMs), by disentangling input ambiguity from UQ signal. We conduct experiments on the task of Question Answering (QA) with six UQ metrics and show that UQ metrics are more predictive of errors on unambiguous instances than on questions with multiple plausible answers. We use Gated Experts and Selective Prediction to incorporate gold and predicted ambiguity labels into the error prediction pipeline. We find that ambiguity information improves error prediction scores across model families, training and evaluation paradigms, datasets (including allegedly unambiguous ones), and sources of aleatoric uncertainty, yielding improvements of over 10 points of PRR for individual UQ metrics on standard datasets.

URL PDF HTML ☆

赞 0 踩 0

2606.02041 2026-06-02 cs.CL 版本更新

SentGuard: Sentence-Level Streaming Guardrails for Large Language Models

SentGuard：面向大型语言模型的句子级流式护栏

Jiaqi Yu, Xin Wang, Yixu Wang, Jie Li, Yan Teng, Xingjun Ma, Yingchun Wang

发表机构 * Fudan University（复旦大学）； Shanghai AI Laboratory（上海人工智能实验室）

AI总结提出SentGuard，一种与生成并行运行的句子级流式护栏，通过轻量级等待缓冲区将流式令牌分组为句子块并仅释放已验证块，以在低延迟下实现高精度不安全内容检测。

Comments 16 pages, 5 figures, submitted to ARR

详情

AI中文摘要

大型语言模型越来越多地实时流式输出长篇幅、推理密集的响应，这使得何时进行审核与是否进行审核同样关键。现有的护栏分为两种不理想的极端：响应级方法延迟干预直到完整输出生成，而令牌级方法基于不完整的语义进行操作，往往产生不稳定的决策和过多的护栏调用。为应对这一挑战，我们提出SentGuard，一种与生成并行运行的句子级流式护栏。一个轻量级等待缓冲区将流式令牌分组为句子块，并仅向用户释放已验证的块，引入一个小偏移量，使得SentGuard能够在目标LLM解码后续内容时评估当前前缀。为支持这一点，我们构建了StreamSafe基准，包含8个危害类别的结构化逐句标注，捕捉推理和响应段中安全风险的演变。我们进一步使用从粗到细的目标训练SentGuard，以在不安全意图在句子边界出现时立即检测。在5个安全基准上的实验表明，SentGuard优于现有基线，在两个句子内检测到90.5%的不安全案例，同时保持7.41%的低流式误报率。

英文摘要

Large language models increasingly stream long, reasoning-intensive responses in real time, making when to moderate as critical as whether to moderate. Existing guardrails fall into two unsatisfactory extremes: response-level methods delay intervention until the full output is generated, whereas token-level methods act on incomplete semantics, often producing unstable decisions and excessive guard invocations. To address this challenge, we propose SentGuard, a sentence-level streaming guardrail that operates in parallel with generation. A lightweight waiting buffer groups streamed tokens into sentence chunks and releases only verified chunks to the user, introducing a small offset that enables SentGuard to assess the current prefix while the target LLM decodes subsequent content. To support this, we construct StreamSafe, a benchmark with structured per-sentence annotations across 8 harm categories, capturing the evolution of safety risks across both reasoning and response segments. We further train SentGuard with a coarse-to-fine objective to detect unsafe intent as soon as it emerges at sentence boundaries. Experiments on 5 safety benchmarks show that SentGuard outperforms existing baselines, detecting 90.5% of unsafe cases within two sentences while maintaining a low streaming false-positive rate of 7.41%.

URL PDF HTML ☆

赞 0 踩 0

2606.02020 2026-06-02 cs.CL cs.LG 版本更新

Unveiling the Entropy Dynamics of Chain-of-Thought Reasoning

揭示思维链推理的熵动力学

Ting Xu, Xu He, Yupu Lu, Jiankai Sun, Dong Li, Wai Lam, Jianye Hao

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结本文通过熵动力学揭示思维链推理的两阶段结构（不确定性区域和置信区域），并提出基于CUSUM变化点检测的无训练框架实现早期退出和测试时缩放，以提升推理效率与可靠性。

Comments 21 pages, 10 figures, accepted in ICML2026

详情

AI中文摘要

本文研究了思维链（CoT）的熵动力学，揭示了一致的两阶段结构：一个探索性的不确定性区域，然后急剧过渡到收敛的置信区域。我们证明置信区域具有两个关键性质：1）高可靠性——置信区域中的答案变得高度准确和稳定，以及2）高冗余性——模型在达到正确答案后生成长时间的不必要token。这些性质解锁了更高效和可靠的推理策略：1）早期退出利用可靠性和冗余性，在收益递减时安全终止计算，以及2）测试时缩放使用置信区域信号优先考虑收敛轨迹。为了实施这些见解，我们将置信区域检测建模为序列变化点检测问题，首次将经典变化点方法应用于监控CoT推理。使用累积和（CUSUM）算法（一种统计最优的变化点检测器），我们开发了一个无训练框架用于实时推理控制。实验表明，我们的方法为早期退出建立了优越的帕累托前沿。CUSUM在减少11.1% token的情况下达到63.06%的准确率，在准确率上分别超过DEER和Dynasor 3.28%和4.36%。对于测试时缩放，CUSUM加权投票始终优于自一致性。

MMG2Skill: 智能体能否从野外指南中提炼出自我进化的技能？

Xinyu Che, Junqi Xiong, Yunfei Ge, Xinping Lei, Shihao Li, Hang Yan, Han Li, Yuanxing Zhang, Zhiqi Bai, Jinhua Hao, Ming Sun, Han Li, Jiaheng Liu

发表机构 * Nanjing University（南京大学）； Kuaishou Technology（快手科技）

AI总结提出MMG2Skill框架，将多模态异构的野外指南编译为可编辑技能，通过轨迹级根因反馈持续改进，在GUI控制、开放游戏和策略卡牌任务中显著提升VLM智能体性能。

Comments 35 pages, 12 figures, 13 tables. Code: https://github.com/NJU-LINK/MMG2Skill

详情

AI中文摘要

网络上丰富的程序性知识对于帮助智能体解决长程任务具有巨大潜力。然而，这些知识通常是多模态、异构、有噪声的，并且隐含地假设人类执行者，使得它们难以直接用作智能体所需的技能。为了弥合人类导向指南与智能体可执行技能之间的差距，我们将此问题形式化为指南到技能学习：将野外指南转换为可执行技能，并从智能体可观察的轨迹中持续改进它们。为了评估现有智能体在此任务上的能力，我们引入了MMG2Skill-Bench，这是针对该问题的首个基准测试。我们进一步提出了MMG2Skill，一个闭环框架，它将指南编译为可编辑技能，在执行过程中将固定的视觉语言模型（VLM）智能体条件化于这些技能，并从轨迹级根因反馈中修正技能，而不使用基准测试分数。在GUI控制、开放式游戏和策略卡牌游戏中，使用六个VLM骨干网络，MMG2Skill在每个模型-领域设置中始终优于普通基线智能体，在骨干网络上实现了宏观平均增益+12.8到+25.3个百分点。消融研究表明，直接用原始指南提示智能体会降低性能，而结构化技能构建和轨迹驱动修订对于观察到的改进都是必要的。在成功可推断的任务中，当成功信号适当校准时，基于分析器的提前停止进一步防止了后期性能退化，并节省了25%-53%的尝试次数。

英文摘要

Abundant procedural knowledge on the Web holds great potential for helping agents solve long-horizon tasks. However, such knowledge is often multimodal, heterogeneous, noisy, and implicitly assumes human executors, making it difficult to use directly as the skills required by agents. To bridge the gap between human-oriented guides and agent-executable skills, we formalize this problem as guide-to-skill learning: converting in-the-wild guides into executable skills and continuously improving them from trajectories observable to the agent. To evaluate the capability of existing agents on this task, we introduce MMG2Skill-Bench, the first benchmark designed for this problem. We further propose MMG2Skill, a closed-loop framework that compiles guides into editable skills, conditions a fixed vision-language model (VLM) agent on these skills during execution, and revises the skills from trajectory-level root-cause feedback without using benchmark scores. Across GUI control, open-ended gameplay, and strategic card play with six VLM backbones, MMG2Skill consistently outperforms vanilla baseline agents in every model-domain setting, achieving macro-average gains of +12.8 to +25.3 percentage points across backbones. Ablation studies show that directly prompting agents with raw guides can degrade performance, while both structured skill construction and trajectory-driven revision are necessary for the observed improvements. On success-inferable tasks, analyzer-based early stopping further prevents late-stage performance regressions and saves 25%-53% of attempts when the success signal is properly calibrated.

URL PDF HTML ☆

赞 0 踩 0

2606.01991 2026-06-02 cs.AI cs.CL cs.CY 版本更新

SafeMCP: Proactive Power Regulation for LLM Agent Defense via Environment-Grounded Look-Ahead Reasoning

SafeMCP：基于环境接地前瞻推理的LLM智能体防御主动功率调节

Lichao Wang, Zhaoxing Ren, Tianzhuo Yang, Jiaming Ji, Chi Harold Liu, Yaodong Yang, Juntao Dai

发表机构 * Beijing Institute of Technology（北京理工大学）； Beijing Academy of Artificial Intelligence（北京人工智能研究院）； Institute for Artificial Intelligence, Peking University（北京大学人工智能研究院）

AI总结针对LLM智能体因动作空间扩大而面临功率寻求风险，提出SafeMCP服务器端防御插件，通过内部世界模型进行前瞻推理，实现主动工具过滤和即时干预两级防御，在保持智能体效用的同时有效降低风险。

Comments Accepted to the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026), Main Conference

详情

AI中文摘要

随着大语言模型（LLM）智能体越来越多地利用模型上下文协议（MCP）在复杂环境中运行，其动作空间的扩展赋予了智能体不安全的能力，并凸显了功率寻求的风险。虽然广阔的动作空间和更大的环境影响对于任务完成至关重要，但它们也创造了一个脆弱的风险表面，其中微小的错误或幻觉会被放大为灾难性故障。为此，我们提出了SafeMCP，一种{服务器端}防御插件，通过关于未来安全风险的预测推理来约束工具获取。SafeMCP利用内部世界模型进行前瞻推理，实现两级防御：主动工具过滤以限制危险功率扩展，以及即时干预作为故障安全机制。为了训练SafeMCP，我们引入了一个三阶段流程，包括环境动态接地、安全策略初始化和具有双重可验证奖励的强化学习（RL）。在PowerSeeking Bench、ToolEmu和AgentHarm上的实验表明，SafeMCP实现了安全平衡，在有效缓解风险的同时保持了智能体的效用。

英文摘要

As Large Language Model (LLM) agents increasingly leverage the Model Context Protocol (MCP) to operate in complex environments, the expansion of their action spaces offers agents unsafe capabilities and underscores the risk of power-seeking. While broad action space and greater environment influence are essential for task fulfillment, they create a fragile risk surface where minor errors or hallucinations are magnified into catastrophic failures. In response, we propose SafeMCP, a {server-side} defense plugin that constrains tool acquisition via predictive reasoning regarding future safety risks. SafeMCP utilizes an internal world model for look-ahead reasoning to implement a two-tier defense: proactive tool filtering to constrain hazardous power expansion and immediate intervention as a fail-safe. To train SafeMCP, we introduce a three-stage pipeline comprising environmental dynamic grounding, safe policy initialization, and reinforcement learning (RL) with dual verifiable rewards. Experiments on PowerSeeking Bench, ToolEmu, and AgentHarm show that SafeMCP achieves a safe equilibrium, effectively mitigating risks while preserving agent utility.

URL PDF HTML ☆

赞 0 踩 0

2606.01967 2026-06-02 cs.CL 版本更新

Eyettention II: 一种用于建模阅读中注视位置、词内着陆位置和注视持续时间的双序列架构

Shuwen Deng, Cui Ding, David R. Reich, Paul Prasse, Lena A. Jäger

发表机构 * Department of Computer Science, University of Potsdam（波恩大学计算机科学系）； Department of Computational Linguistics, University of Zurich（苏黎世大学计算语言学系）； Department of Informatics, University of Zurich（苏黎世大学信息学系）

AI总结提出端到端训练的轻量级深度学习模型Eyettention II，通过双序列架构生成包含注视位置、词内着陆位置和注视持续时间的完整扫描路径，在预测性能上超越现有模型并捕捉关键心理语言学现象。

详情

AI中文摘要

阅读时眼睛的运动方式为理解读者的认知过程和文本属性提供了宝贵信息。特别是，阅读过程中的眼动追踪数据在多种技术应用中显示出高度价值，例如增强和解释语言模型以及推断读者特征。然而，这些应用通常依赖于大规模数据驱动模型，需要大量的眼动追踪数据集，而由于数据收集的资源密集型特性，这些数据集难以获取。为了解决数据稀缺的挑战，我们开发了Eyettention II，一个端到端训练的深度学习模型，能够生成由按时间顺序排列的完整注视属性组成的真实扫描路径，包括注视位置、词内着陆位置和注视持续时间。我们的模型轻量级，可在有限的GPU资源上高效训练，并与认知理论紧密对齐。我们证明，Eyettention II在扫描路径预测方面超越了最先进的模型，并通过捕捉关键心理语言学现象模拟了类似人类的注视行为。凭借其稳健的性能，Eyettention II有潜力推动自然语言处理的发展，促进心理语言学实验材料的预测试，并揭示超出理论认知模型明确编码的新见解。

英文摘要

The way our eyes move while reading provides valuable insights into both the reader's cognitive processes and the properties of the text. In particular, eye-tracking-while-reading data has shown to be highly beneficial in various technological applications, such as enhancing and interpreting language models and inferring a reader's characteristics. However, these applications often rely on large-scale, data-driven models, which demand extensive eye-tracking datasets that are challenging to obtain due to the resource-intensive nature of data collection. To address the challenge of data scarcity, we develop Eyettention II, an end-to-end trained deep-learning model capable of generating realistic scanpaths consisting of a complete set of fixation attributes in chronological order, including fixation location, within-word landing position, and fixation duration. Our model is lightweight, efficiently trainable on limited GPU resources, and closely aligned with cognitive theories. We demonstrate that Eyettention II surpasses state-of-the-art models in scanpath prediction and mirrors human-like gaze behavior by capturing key psycholinguistic phenomena. With its robust performance, Eyettention II holds the potential to drive advancements in natural language processing, facilitate piloting the materials of psycholinguistic experiments, and uncover new insights beyond what is explicitly encoded in theoretical cognitive models.

URL PDF HTML ☆

赞 0 踩 0

2606.01936 2026-06-02 cs.CL 版本更新

What to Format and How: A Benchmark and Workflow Approach for Document Formatting

格式化什么以及如何格式化：文档格式化的基准与工作流方法

Shihao Rao, Liang Li, Jiapeng Liu, Tong Lin, Bing Li, Xiyan Gao, Peng Fu, Jing Huang, Can Ma

发表机构 * Institute of Information Engineering, Chinese Academy of Sciences（信息工程研究所，中国科学院）； School of Cyber Security, University of Chinese Academy of Sciences（中国科学院大学网络安全学院）

AI总结针对内容感知的文档格式化任务，提出基准DocFormBench和工作流方法DocFormFlow，通过解耦目标定位与修改执行，在提升准确率的同时降低token消耗。

详情

AI中文摘要

大型语言模型（LLM）的最新进展为自动化文档格式化开辟了新的可能性。然而，现实中的格式化通常需要根据文档内容识别目标。这种内容感知的设置仍然具有挑战性且未被充分探索，主要是由于缺乏专门的评估数据集。为了在现实的内容感知场景中实现评估，我们引入了DocFormBench，这是一个将文本到格式评估扩展到多样化格式化需求的基准，同时提供了准确性和效率的指标。为了减少现有方法在格式化过程中的冗余文档读取，我们提出了DocFormFlow，一种工作流格式化方法，将目标定位与修改执行解耦为“格式化什么”和“如何格式化”。在多个LLM和多模态模型上的大量实验表明，与代表性基线相比，DocFormFlow在减少token消耗的同时持续提高了格式化准确性。进一步的分析表明，精确的目标定位是影响格式化性能的主要因素。我们希望DocFormBench和DocFormFlow能够促进未来朝着更智能、更可靠的文档格式化的研究。

英文摘要

Recent advances in large language models (LLMs) have opened up new possibilities for automated document formatting. However, real-world formatting often requires identifying targets based on document content. This content-aware setting remains challenging and underexplored, primarily due to the lack of dedicated evaluation datasets.To enable evaluation in realistic content-aware scenarios, we introduce DocFormBench, a benchmark that extends Text-to-Format evaluation to diverse formatting requirements, along with metrics for both accuracy and efficiency.To mitigate redundant document reading in existing methods during formatting, we propose DocFormFlow, a workflow formatting method that decouples target localization from modification execution into what to format and how. Extensive experiments across multiple LLMs and multimodal models show that DocFormFlow consistently improves formatting accuracy while reducing token consumption compared to representative baselines. Further analysis reveals that precise target localization is the primary factor influencing formatting performance. We hope DocFormBench and DocFormFlow will facilitate future research toward more intelligent and reliable document formatting.

URL PDF HTML ☆

赞 0 踩 0

2606.01934 2026-06-02 cs.LG cs.CL 版本更新

HMPO: Hybrid Median-length Policy Optimization for Chain-of-Thought Compression

HMPO: 用于思维链压缩的混合中位数长度策略优化

Minghui Zheng, Hongxu Chen, Huimin Ren, Hongsheng Xin, Xiaoyang Qu, Ze Wang, Shuling Yang, Ziyu Peng, Kaike Zhang, Pan Zhou, Kun Zhan

发表机构 * Li Auto Inc.（Li Auto公司）

AI总结提出HMPO，一种单阶段强化学习框架，通过自适应中位数预算、余弦衰减令牌奖励和乘法奖励公式，在数学数据上训练后实现19%-46%的令牌压缩且精度损失极小，并泛化至多种任务。

详情

AI中文摘要

大型语言模型通过扩展的思维链推理取得了显著性能，但这一冗长过程带来了大量推理开销。现有的思维链压缩方法面临不灵活的手动长度预算、计算昂贵的多阶段训练流程以及仅适用于小模型的脆弱可扩展性。我们提出HMPO（混合中位数长度策略优化），一种经济高效的单阶段强化学习框架。HMPO通过三个协同组件高效压缩思维链：基于成功轨迹的自适应中位数预算以消除手动调整、用于平滑长度惩罚的余弦衰减令牌奖励，以及通过严格优先考虑答案正确性来大幅减轻琐碎奖励破解的乘法奖励公式。仅在数学数据上训练，HMPO无缝泛化到数学、代码、科学和指令遵循任务。在从9B到122B参数、涵盖密集和混合专家架构的大规模实验中，HMPO实现了19%-46%的令牌压缩，精度下降可忽略，同时与现有的多阶段基线相比大幅降低了训练成本。

英文摘要

Large language models achieve remarkable performance via extended chain-of-thought (CoT) reasoning, yet this lengthy process incurs substantial inference overhead. Existing CoT compression methods struggle with inflexible manual length budgets, computationally expensive multi-stage training pipelines, and fragile scalability restricted to small models. We propose HMPO (Hybrid Median-length Policy Optimization), a cost-effective, single-stage reinforcement learning framework. HMPO efficiently compresses CoT via three synergistic components: an adaptive median-based budget derived from successful rollouts to eliminate manual tuning, a cosine-decay token reward for smooth length penalization, and a multiplicative reward formulation that substantially mitigates trivial reward hacking by strictly prioritizing answer correctness. Trained exclusively on mathematical data, HMPO generalizes seamlessly across math, code, science, and instruction-following tasks. Extensive experiments scaling from 9B to 122B parameters across dense and Mixture-of-Experts (MoE) architectures demonstrate that HMPO achieves 19%--46% token compression with negligible accuracy degradation, all while drastically reducing training costs compared to existing multi-stage baselines.

URL PDF HTML ☆

赞 0 踩 0

2606.01926 2026-06-02 cs.CL 版本更新

Mitigating Bias in Locally Constrained Decoding via Tractable Proposals

通过可处理提议缓解局部约束解码中的偏差

Meihua Dang, Linxin Song, Honghua Zhang, Jieyu Zhao, Guy Van den Broeck, Stefano Ermon

发表机构 * Stanford University（斯坦福大学）； University of California, Berkeley（加州大学伯克利分校）； Massachusetts Institute of Technology（麻省理工学院）

AI总结针对局部约束解码中因短视掩码导致的采样偏差，提出基于张量化有限自动机的全局约束解码提议和概率全局约束解码提议，结合序贯蒙特卡洛方法实现无偏采样，在函数调用、关键词生成和SQL生成任务中显著减少所需粒子数并加速收敛。

Comments 13 pages, 5 figures

详情

AI中文摘要

大型语言模型的生成结果往往不符合期望的约束，如JSON模式。现有的局部约束解码（LCD）方法通过短视地掩蔽下一个词元来强制约束，导致采样偏差和性能下降。最近的工作使用序贯蒙特卡洛（SMC）方法来缓解此类偏差，但设计有效的提议分布或势函数仍然是一个关键挑战。在这项工作中，我们提出了一种通用方法来构建从 $p_{\mathrm{lm}}( \cdot \mid \mathrm{constraint})$ 进行SMC采样的提议和势函数。首先，我们证明了以有限自动机形式指定的约束可以张量化以在GPU上高效执行，我们利用这一点构建了全局约束解码（GCD）提议。此外，利用张量化有限自动机与隐马尔可夫模型共享相同电路结构的事实，我们通过电路乘法得到概率全局约束解码（P-GCD）提议，该提议编码了目标分布的逻辑和概率信息。我们在函数调用、基于关键词的生成和SQL生成任务上评估了(P-)GCD。实验表明，在相同的SMC采样设置下，与LCD提议相比，(P-)GCD以显著更少的粒子更快地收敛到目标分布。

英文摘要

Generations from large language models often fail to conform to desired constraints such as JSON schema. Existing locally constrained decoding (LCD) approaches enforce constraints by myopically masking out next tokens, resulting in biased sampling and degradation in performance. Recent work uses sequential Monte Carlo (SMC) methods to mitigate such biases, but designing effective proposal distributions or potential functions remains a key challenge. In this work, we propose a generic approach to construct proposals and potentials for SMC sampling from $p_{\mathrm{lm}}( \cdot \mid \mathrm{constraint})$. First, we show that constraints specified as finite automata can be tensorized for efficient execution on GPUs, which we use to construct globally constrained decoding (GCD) proposals. In addition, leveraging the fact that tensorized finite automata share the same circuit structure as hidden Markov models, we circuit-multiply them to obtain the probabilistic GCD (P-GCD) proposals encoding both logical and probabilistic information about the target distributions. We evaluate (P-)GCD on the tasks of function calling, keyword-based generation, and SQL generation. Experiments show that under the same SMC sampling setup, compared to LCD proposals, (P-)GCD converges faster to the target distribution with significantly fewer particles.

URL PDF HTML ☆

赞 0 踩 0

2606.01923 2026-06-02 cs.CL cs.LG 版本更新

Resonant Context Anchoring: Decoupling Attention Routing and Signal Gain at Inference Time

共振上下文锚定：推理时解耦注意力路由与信号增益

Mingkuan Zhao, Yide Gao, Wentao Hu, Suquan Chen, Tianchen Huang, Zhenhua An, Zetao Chang, Xiayu Sun, Yuheng Min

发表机构 * Xi’an Jiaotong University（西安交通大学）； University of Science and Technology of China（中国科学技术大学）； Tongji University（同济大学）； Tsinghua University（清华大学）

AI总结提出共振上下文锚定（RCA）方法，通过解耦自注意力中的路由逻辑与信息幅度，在推理时动态增强上下文令牌的信号，有效抑制大语言模型的参数化幻觉，提升事实一致性。

详情

AI中文摘要

大型语言模型（LLM）在面对与内部参数记忆冲突的输入证据时，经常表现出“上下文忽视”，导致持续的事实幻觉。现有的缓解策略主要依赖于抑制特定神经元激活或使用计算昂贵的对比解码机制，这往往会导致困惑度增加或推理延迟显著升高。为了解决这些局限性，我们从残差流信号动力学的角度提出了一种轻量级的推理时干预方法——共振上下文锚定（RCA）。RCA旨在解决外部证据在深层网络传播过程中的信号衰减问题。其核心机制是在自注意力模块中正交解耦路由逻辑和信息幅度。通过利用原始的softmax前注意力分数作为语义对齐的即时度量，我们通过非线性整流构建动态增益场，选择性地放大上下文令牌对应的值向量的范数，而不改变注意力概率分布。该机制有效提升了残差流混合中输入证据的信噪比（SNR），从而在推理时稳健地将生成轨迹锚定到真实上下文。在Llama-3模型系列上的大量实验表明，RCA在多个事实一致性和强知识冲突任务中显著提高了上下文忠实度，有效抑制了参数化幻觉。此外，结果证实，作为一个无需训练且计算量可忽略的即插即用模块，RCA在保持模型通用语言理解能力的同时，在忠实度和流畅性上实现了帕累托改进。

英文摘要

Large Language Models (LLMs) frequently exhibit "contextual disregard" when faced with input evidence that conflicts with their internal parametric memory, leading to persistent factual hallucinations. Existing mitigation strategies primarily rely on suppressing specific neuron activations or employing computationally expensive contrastive decoding mechanisms, which often result in increased perplexity or significantly elevated inference latency. To address these limitations, we propose Resonant Context Anchoring (RCA), a lightweight inference-time intervention method grounded in the perspective of residual stream signal dynamics. RCA aims to resolve the signal attenuation of external evidence during its propagation through deep networks. The core mechanism involves the orthogonal decoupling of routing logic and information magnitude within the self-attention module. By utilizing raw pre-softmax attention scores as an instantaneous metric of semantic alignment, we construct a dynamic gain field via non-linear rectification to selectively amplify the norms of value vectors corresponding to context tokens, without altering the attention probability distribution. This mechanism effectively elevates the signal-to-noise ratio (SNR) of input evidence within the residual stream mixture, thereby robustly anchoring the generation trajectory to the truthful context during inference. Extensive experiments on the Llama-3 model series demonstrate that RCA significantly improves contextual faithfulness across multiple factual consistency and strong knowledge-conflict tasks, effectively suppressing parametric hallucinations. Furthermore, results confirm that as a training-free and computationally negligible plug-and-play module, RCA achieves a Pareto improvement in faithfulness and fluency while maintaining the model's general language understanding capabilities.

URL PDF HTML ☆

赞 0 踩 0

2606.01914 2026-06-02 cs.CL cs.CV 版本更新

Mechanistic Diagnostics of Spatial Lexical Bias in Multimodal Large Language Model Spatial Reasoning

多模态大语言模型空间推理中空间词汇偏差的机制诊断

Chuang Ma, Qianying Liu, Tomoyuki Obuchi, Fei Cheng, Wang Yang, Sudong Cai, Shuyuan Zheng, Akiko Aizawa, Sadao Kurohashi

发表机构 * Kyoto University（京都大学）； NII LLMC（日本国立信息与通信技术研究所语言模型中心）； RIKEN AIP（日本理化学研究所先进理工研究所）； Case Western Reserve University（凯斯西储大学）； The Hong Kong Polytechnic University（香港理工大学）； The University of Osaka（大阪大学）； University of Tokyo（东京大学）

AI总结本文发现多模态大语言模型存在空间词汇偏差，即添加空间关系词会吸引模型选择该选项，并通过机制可解释性工具揭示偏差主要源于语言侧而非视觉侧，最后提出轻量级LLM-only DPO更新可有效缓解偏差。

详情

AI中文摘要

多模态大语言模型（MLLMs）在空间多项选择题上仍不可靠，其失败常归因于视觉信息关注不足。本文识别了一种互补的失败模式——空间词汇偏差：向答案选项添加空间关系词会吸引模型决策，使新添加的选项更可能被选中。使用九个开放权重的MLLMs，我们证明该现象广泛存在。特别地，模型能正确回答二元空间问题，但一旦向答案集添加第三个空间选项，模型便持续选择错误的第三选项。我们将这种二元稳定但三元脆弱的案例隔离为诊断示例，并利用机制可解释性工具，揭示失败的主要原因来自语言侧而非视觉侧：视觉注意力分析和残差流探针表明，在这些失败中，正确的空间关系在内部仍然可用，而不相关选项控制、激活修补和稀疏组件干预将偏差追溯到特定的LLM侧通道和神经元。基于此发现，我们证明在微小的单对象对合成数据上进行轻量级仅LLM的DPO更新可缓解偏差，在合成数据上将四路鲁棒准确率提升高达100个百分点，在更广泛的评估数据集WhatsUp、SpatialMQA-Direct和VSR上分别提升68.0、32.6和20.1个百分点。

英文摘要

Multimodal large language models (MLLMs) remain unreliable on spatial multiple-choice questions, and their failures are often attributed to poorly attended visual information. In this work, we identify a complementary failure mode, spatial lexical bias: adding a spatial relation word to the answer options can attract the model's decision and make the newly added option likely to be selected. Using nine open-weight MLLMs, we show that this phenomenon is widely observed. In particular, models can answer a binary spatial question correctly, yet consistently select an incorrect third spatial option once it is added to the answer set. We isolate such binary-stable but ternary-fragile cases as diagnostic examples and leverage mechanistic interpretability tools, revealing that a substantial part of the failure instead originates on the language side rather than the visual side: visual attention analyses and residual-stream probes show the correct spatial relation remains internally available on these failures, while irrelevant-option controls, activation patching, and sparse component interventions trace the bias to specific LLM-side channels and neurons. Based on this finding, we show that a lightweight LLM-only DPO update on tiny single-object-pair synthetic data mitigates the bias, lifting four-way robust accuracy by up to 100 points on synthetic data, and by 68.0, 32.6, and 20.1 points on broader evaluation datasets WhatsUp, SpatialMQA-Direct, and VSR.

URL PDF HTML ☆

赞 0 踩 0

2606.01901 2026-06-02 cs.CV cs.AI cs.CL 版本更新

The Image Reconstruction Game: Drawing Common Ground Through Iterative Multimodal Dialogue

图像重建游戏：通过迭代多模态对话建立共同基础

Sherzod Hakimov, Mattia D'Agostini, Ivan Samodelkin, David Schlangen

发表机构 * Computational Linguistics, Department of Linguistics University of Potsdam（波恩大学语言学系计算语言学部）； German Research Center for Artificial Intelligence (DFKI), Berlin（德国人工智能研究中心（DFKI）柏林）

AI总结提出图像重建游戏基准，通过多轮迭代中视觉语言模型向图像生成器发出纠正指令，使累积的共同基础直接可视化为重建图像，发现描述器是重建质量的主导因素，而生成器决定迭代改进的效果。

详情

AI中文摘要

我们引入了图像重建游戏，这是一个全自动基准测试，其中视觉语言模型在多轮迭代中向图像生成器发出纠正指令，使得累积的共同基础直接可视化为渲染图像。通过对七个图像类别中的两个描述器模型与两个生成器模型进行交叉基准测试，我们发现描述器是重建质量的主导因素，而生成器决定迭代改进是否有益。数学和几何图像构成了最大的挑战。描述器的令牌预算强烈影响收敛性：较短的预算产生更稀疏的初始渲染，有更多可见改进的空间，而较长的预算提高了绝对质量，但留下的修复空间较少。更强的描述器使用更丰富的纠正词汇，涵盖空间、数值和结构类别，而较弱的描述器则集中于表面属性，并且往往在几轮后停止。人工验证表明，最佳自动评判器与人类偏好之间仅达到轻微到中等的一致性，并且自动评分需要人工重新校准才能可靠使用。

英文摘要

We introduce the Image Reconstruction Game, a fully automated benchmark in which a vision-language model issues corrective instructions to an image generator across multiple turns, making accumulated common ground directly observable as a rendered image. Benchmarking two Describer models crossed with two Generator models across seven image categories, we find that the describer is the dominant factor in reconstruction quality, while the generator determines whether iterative refinement helps or hurts. Mathematical and geometric images pose the greatest challenge. The describer's token budget strongly affects convergence: shorter budgets yield sparser first renderings with more room for visible improvement, while longer budgets raise absolute quality but leave less to fix. Stronger describers use a richer correction vocabulary spanning spatial, numeric, and structural categories, while weaker describers concentrate on surface properties and tend to stop after a few turns. Human validation shows that the best automated judge reaches only slight-to-fair agreement with human preferences, and automated scores require human recalibration to be used reliably.

URL PDF HTML ☆

赞 0 踩 0

2606.01879 2026-06-02 cs.CL 版本更新

CultureForest: Understanding and Evaluating Cultural Norm Grounded Reasoning in LLMs

CultureForest：理解与评估大语言模型中的文化规范推理

Yangfan Ye, Xiaocheng Feng, Jialong Tang, Xiayu Cao, Zihan Zhang, Xiachong Feng, Baosong Yang, Bing Qin

发表机构 * Harbin Institute of Technology（哈尔滨工业大学）； The University of Hong Kong（香港大学）； Harvard University（哈佛大学）

AI总结为弥补现有研究仅将文化智能视为知识获取问题而忽视实际场景应用的不足，提出CultureForest基准，通过基于原子规范的推理任务评估模型，发现顶级模型在开放式生成中性能大幅下降，并揭示推理能力瓶颈。

详情

AI中文摘要

现有研究大多将大语言模型中的文化智能简化为知识层面的问题，忽视了模型能否在现实场景中有效利用其获取的知识。为弥补这一差距，我们引入了CultureForest，一个用于 extit{文化规范推理}的基准。每个问题都基于一组原子规范，从而支持可验证和可归因的评估。CultureForest包含来自8个领域和53个国家/地区的5,378个示例，并支持从多项选择到开放式生成的渐进式评估。大量实验表明，即使是顶级模型在开放式设置中性能也大幅下降，并伴随显著的跨区域差异。通过针对性分析，我们发现了几个一致的模式：（1）测试时推理带来的收益有限，且可能加剧不公平；（2）模型表现出高度共享的区域偏好结构；（3）模型响应明显保守，尤其在更严格的文化约束下；（4）通过分离文化知识获取与文化推理，我们发现虽然LLMs拥有丰富的文化知识，但其性能进一步受限于知识的有效利用。这些发现表明，有必要从以知识为中心的评估转向衡量基于知识的推理。

英文摘要

Existing research largely reduces cultural intelligence in LLMs to a knowledge-level problem, overlooking whether models can effectively utilize their acquired knowledge in realistic scenarios. To bridge this gap, we introduce CultureForest, a benchmark for \textit{Cultural Norm Grounded Reasoning}. Each question is grounded in a small set of atomic norms, enabling verifiable and attributable evaluation. CultureForest comprises 5,378 examples across 8 domains and 53 countries/regions, and supports a progressive evaluation from multiple-choice to open-ended generation. Extensive experiments reveal that even top-tier models degrade substantially in open-ended settings, accompanied by pronounced cross-region disparities. Through targeted analysis, we uncover several consistent patterns: (1) test-time reasoning yields limited gains and may exacerbate inequity; (2) models exhibit highly shared regional preference structures; (3) model responses are markedly conservative, especially under stricter cultural constraints; and (4) by disentangling cultural knowledge acquisition from cultural reasoning, we show that while LLMs possess substantial cultural knowledge, their performance is further bottlenecked by its effective use. These findings point to a necessary shift from knowledge-centric evaluation toward measuring knowledge-grounded reasoning.

URL PDF HTML ☆

赞 0 踩 0

2606.01845 2026-06-02 cs.CL cs.AI 版本更新

Unveiling the Limits of Large Language Models in Inferring Pragmatic Meaning from Non-Verbal Responses

揭示大型语言模型在推断非语言回应中的语用意义的局限性

Sugyeong Eo, Heuiseok Lim

发表机构 * Department of Software, Yonsei University Mirae Campus（燕山大学软件系）； Department of Computer Science and Engineering, Korea University（韩国大学计算机科学与工程系）

AI总结本研究首次系统评估大型语言模型（LLMs）从纯非语言回应对话中推断语用意义的能力，发现其准确率相比语言回应下降高达60%，并表明上下文学习有助于语用推理。

详情

AI中文摘要

尽管大型语言模型（LLMs）在语用语言理解方面取得了显著进展，但先前的研究主要集中在其对语言行为的理解上。然而，非语言行为仍然是人类交流的基本组成部分，特别是当故意单独使用以传达间接意义时。在这项工作中，我们首次系统评估了LLMs从仅包含非语言回应的对话中推断语用意义的能力。我们探讨了三个研究问题：（1）LLMs能否识别通过非语言回应传达的间接意图？（2）LLMs何时以及如何未能捕捉非语言意图？（3）我们如何提高LLMs解释非语言意图的能力？通过评估，我们观察到LLMs难以从非语言回应中推断出潜在意义，准确率相比语言回应下降高达60个百分点。进一步的广泛分析揭示了LLMs在解释非语言行为时的行为模式，并表明上下文学习有助于语用推理。

英文摘要

Although large language models (LLMs) have shown considerable progress in pragmatic language understanding, prior research has focused mainly on their comprehension of verbal behavior. Nonetheless, non-verbal behavior remains a fundamental component of human communication, especially when deliberately utilized in isolation to convey indirect meanings. In this work, we present the first systematic evaluation of LLMs' ability to infer pragmatic meaning in dialogue consisting solely of non-verbal responses. We explore three research questions: (1) Can LLMs recognize indirect intent conveyed through non-verbal responses? (2) When and how do LLMs fail to capture non-verbal intent? (3) How can we improve LLMs' ability to interpret non-verbal intent?. Through the evaluation, we observe that LLMs struggle to infer underlying meaning from non-verbal responses, with accuracy dropping by up to 60% points compared to verbal ones. Further extensive analysis reveals a behavioral pattern in LLMs' interpretations of non-verbal behavior and demonstrates that in-context learning facilitates pragmatic inference.

URL PDF HTML ☆

赞 0 踩 0

2606.01838 2026-06-02 cs.CL cs.AI cs.LG 版本更新

LayerRoute: Input-Conditioned Adaptive Layer Skipping via LoRA Fine-Tuning for Agentic Language Models

LayerRoute: 基于LoRA微调的输入条件自适应层跳过方法用于智能语言模型

Prateek Kumar Sikdar

发表机构 * Accenture（埃森哲）

AI总结提出LayerRoute，通过为每个Transformer块添加轻量级路由器和LoRA适配器，根据输入类型（工具调用或规划推理）自适应跳过层，在仅增加0.22%可训练参数下实现12.91%的跳过差异并提升质量。

Comments 10 pages, 3 figures, 4 tables

详情

AI中文摘要

“我知道这会如何发展”：通过渐进条件惊奇度刻画多样性

Matthew Khoriaty, David Williams-King, Shi Feng

发表机构 * University of California, Berkeley（加州大学伯克利分校）； University of Cambridge（剑桥大学）； Stanford University（斯坦福大学）

AI总结提出一种基于上下文学习的多样性度量方法 Decan（D_{Ca_n}），通过单次前向传递计算每个字节的得分，无需嵌入模型、参考语料或人工标注，在多个基准上验证了其有效性。

Comments 28 pages, 18 figures, 9 tables. Accepted to the Workshop on Generative AI, Creativity, and Human-AI Co-Creation @ ICML 2026 (non-archival). Code and data: https://github.com/AMindToThink/icl-diversity

详情

AI中文摘要

衡量创意输出的多样性对于评估训练后模式崩溃、比较解码策略以及量化AI和人类写作中的创造性行为至关重要。我们提出了一种使用上下文学习来度量多样性的新方法，其中“Decan”度量 $D_{Ca_n} = C \times a_n$ 是我们评估的工作实例：一个基于每个字节的得分，该得分从基础模型 $θ$ 的每个标记对数概率中读取，每次排列只需一次前向传递，无需嵌入模型、参考语料库和人工标签。该方法基于信息论，利用语言模型的上下文学习来检测任意数量输入之间的广泛相似性，并避免了训练专用模型的需要。同一流程对AI样本和人类编写的回答集进行评分，将多样性视为（回答、提示、评分模型）的一个属性。在Tevet和Berant基于人类判断的McDiv基准上，$D_{Ca_n}$ 在McDiv prompt_gen 集上达到了0.846的OCA，这是其表现最好的情况，仅次于Tevet和Berant报告的最强神经基线（SentBERT，0.897）。在OLMo-2-7B训练后流程中，$D_{Ca_n}$ 在基础→SFT→DPO→RLVR阶段单调下降，检测到创意写作应用所关注的多样性损失类型。

英文摘要

Measuring the diversity of creative outputs is central to evaluating post-training mode collapse, comparing decoding strategies, and quantifying creative behavior in both AI and human writing. We propose a new approach to measuring diversity using in-context learning, of which the ``Decan'' metric, $D_{Ca_n} = C \times a_n$, is the working instance we evaluate: a per-byte score read off the per-token log-probabilities of a base model $θ$ in a \emph{single forward pass} per permutation, with no embedding model, no reference corpus, and no human labels. This approach is grounded in information theory, makes use of language model in-context learning to detect a wide range of similarities between any number of inputs, and obviates the need to train a special-purpose model. The same pipeline scores AI samples and human-written response sets, with diversity treated as a property of (responses, prompt, scoring model). On Tevet and Berant's human-grounded McDiv benchmark, $D_{Ca_n}$ reaches OCA 0.846 on the McDiv prompt\_gen set where it performs best, behind the strongest neural baseline reported in Tevet and Berant (SentBERT, 0.897). On the OLMo-2-7B post-training pipeline, $D_{Ca_n}$ drops monotonically across the base $\to$ SFT $\to$ DPO $\to$ RLVR stages, detecting the type of diversity loss that creative-writing applications care about.

URL PDF HTML ☆

赞 0 踩 0

2606.01806 2026-06-02 cs.CL cs.AI cs.LG 版本更新

ProbeScale: Probing Analysis to Optimize Neural Scaling Laws for Efficient Small Language Model Inference

ProbeScale: 通过探测分析优化神经缩放定律以实现高效小语言模型推理

Sourav Das

发表机构 * Department of Computer Science and Engineering（计算机科学与工程系）； Indian Institution of Information Technology Kalyani（印度信息技术学院Kalyani）

AI总结提出ProbeScale框架，利用缩放定律和探测分析从预训练小语言模型中识别参数高效子网络，在参数预算下最大化任务加权探测性能，实现5-10倍参数压缩并保持95%-98%原始性能。

Comments 7 pages, 2 figures, ACL

详情

AI中文摘要

小语言模型在能力与计算可行性之间取得了平衡。神经缩放定律指导其最优训练，表明它们拥有随规模增长而丰富的内部表示。然而，在严格的资源约束下部署即使是这些小语言模型也可能具有挑战性。语言模型探测提供了分析模型内部编码的语言知识的方法。我们提出ProbeScale，一个统一缩放定律和探测洞察的框架，用于在预训练小语言模型中识别参数高效的子网络。ProbeScale利用良好缩放的小语言模型的高质量表示，并使用任务特定探测来数学量化每层对目标下游能力的相关性。这使得能够选择在性能与参数规模之间最优权衡的子网络。我们将子网络选择形式化为在参数预算下寻找最大化聚合任务加权探测性能的层子集。在代表性小语言模型如RoBERTa-Large和T5-Base上的实验表明，ProbeScale识别出的子网络实现了5到10倍的显著参数减少，同时在目标任务上保持了高性能（原始小语言模型的95%至98%），优于启发式基线。

英文摘要

Small Language Models (SLMs) offer a balance between capability and computational feasibility. Neural scaling laws inform their optimal training, suggesting that they possess rich internal representations that scale with their size. However, deploying even these SLMs can be challenging under strict resource constraints. Language model probing provides methods for analyzing the linguistic knowledge encoded in a model's internals. We propose ProbScale, a framework that unifies insights from scaling laws and probing to identify parameter-efficient subnetworks within pre-trained SLMs. ProbScale utilizes the high-quality representations of well-scaled SLMs and uses task-specific probes to mathematically quantify the relevance of each layer for target downstream capabilities. This allows selecting subnetworks that optimally trade off performance against parameter size. We formulate the subnetwork selection as finding a layer subset maximizing aggregated, task-weighted probe performance under a parameter budget. Experiments on representative SLMs such as RoBERTa-Large and T5-Base demonstrate that ProbScale identifies subnetworks achieving significant parameter reduction, from 5 to 10 times, while maintaining high performance (95% to 98% of the original SLMs) on targeted tasks, outperforming heuristic baselines.

URL PDF HTML ☆

赞 0 踩 0

2606.01800 2026-06-02 cs.CL cs.AI cs.LG 版本更新

Multilinguality of Large Language Models From a Structural Perspective

从结构视角看大语言模型的多语言性

Haruki Sakajo, Yusuke Sakai, Hidetaka Kamigaito, Taro Watanabe

发表机构 * Nara Institute of Science and Technology（奈良科学技術研究所）

AI总结本研究通过表示结构分析探索大语言模型的多语言性，发现低资源语言与英语的结构差异大于高、中资源语言，且语言特定后训练改变结构但保留语言间关系。

2606.01779 2026-06-02 cs.CL 版本更新

HarnessForge: Joint Harness and Policy Evolution for Adaptive Agent Systems

HarnessForge：面向自适应智能体系统的协同框架与策略进化

Mingju Chen, Can Lv, Guibin Zhang, Heng Chang, Shiji Zhou

发表机构 * Beijing Advanced Innovation Center for Future Blockchain and Privacy Computing, School of Artificial Intelligence, Beihang University（北京未来区块链与隐私计算先进创新中心，人工智能学院，北京航空航天大学）； Tsinghua University（清华大学）

AI总结提出HarnessForge元自适应框架，通过框架-策略协同进化实现LLM智能体系统的全系统自适应，在多个基准上显著提升性能。

Comments 25 pages, 13 figures

详情

AI中文摘要

LLM智能体越来越需要在需要不同执行范式的异构任务环境中运行。这对固定智能体系统提出了挑战，并推动了超越孤立组件更新的系统级元自适应。虽然现有工作已自适应外部框架或训练底层推理策略，但全系统自适应仍未被充分表征。结构与执行之间的自适应空间很少被明确化，外部框架与内部推理器之间的兼容性也未得到联合优化。我们提出HarnessForge，一个用于进化LLM智能体系统的元自适应框架。HarnessForge将智能体系统形式化为一个框架-策略对，定义了一个稳定的自适应空间，将框架级执行结构与策略级推理行为分离。然后，它通过故障引导的框架裁剪和框架条件化的策略对齐执行框架-策略协同进化。在来自不同领域的五个基准上的实验表明，HarnessForge一致地改进了Qwen3-4B和Qwen3-8B骨干网络，优于仅框架和仅策略的基线，比最强基线提升高达12.0%，并实现了有利的展开效率权衡，证明了框架-策略协同进化是有效的，并且框架与推理策略之间的可执行兼容性对于智能体系统自适应至关重要。代码可在https://github.com/mingju-c/HarnessForge获取。

英文摘要

LLM agents are increasingly expected to operate across heterogeneous task regimes that require distinct execution paradigms. This challenges fixed agent systems and motivates system-level meta-adaptation beyond isolated component updates. While existing works have adapted external harness or trained underlying reasoning policies, full-system adaptation remains insufficiently characterized. The adaptation space between structure and execution is rarely made explicit, and the compatibility between the external harness and the internal reasoner is not optimized jointly. We propose HarnessForge, a meta-adaptive framework for evolving LLM agent systems. HarnessForge formulates an agent system as a harness--policy pair, defining a stable adaptation space that separates harness-level execution structure from policy-level reasoning behavior. It then performs harness--policy co-evolution through fault-guided harness tailoring and harness-conditioned policy alignment. Experiments across five benchmarks from diverse domains show that HarnessForge consistently improves both Qwen3-4B and Qwen3-8B backbones, outperforming harness-only and policy-only baselines with gains of up to 12.0\% over the strongest baseline and achieving favorable rollout-efficiency tradeoffs, demonstrating that harness--policy co-evolution is effective, and that executable compatibility between the harness and reasoning policy is essential for agent-system adaptation. The code is available at https://github.com/mingju-c/HarnessForge.

URL PDF HTML ☆

赞 0 踩 0

2606.01755 2026-06-02 cs.AI cs.CL 版本更新

TriAlign: Towards Universal Truth Consistency in Personalized LLM Alignment

TriAlign: 迈向个性化大语言模型对齐中的通用真值一致性

Thi-Nhung Nguyen, Linhao Luo, Rollin Omari, Junae Kim, Thuy-Trang Vu, Dinh Phung

发表机构 * Department of Data Science & AI, Monash University（数据科学与人工智能系，墨尔本大学）； Defence Science and Technology Group, Australia（澳大利亚国防科学与技术集团）

AI总结针对个性化大语言模型在不同社会群体间存在的通用真值不一致问题，提出TriAlign框架，通过离线多智能体强化学习联合优化真值准确性、跨群体一致性和个性化，实现公平对齐。

详情

AI中文摘要

个性化大语言模型根据用户的偏好和社会属性调整响应，但可能在不同社会群体间引入显著的通用真值不一致性，即某些群体在客观任务上系统性地获得较不准确的响应。现有的对齐方法要么忽略个性化，要么主要关注主观偏好对齐，很大程度上忽视了通用真值的公平性和一致性。为填补这一空白，我们研究了真值不变对齐（TIA），这是一个针对个性化LLM的对齐问题，旨在确保通用真值在不同社会群体间保持一致，同时保留个性化。我们提出TriAlign，这是首个用于TIA的离线多智能体强化学习（MARL）框架，其中每个社会群体被建模为一个交互的智能体。TriAlign通过一个公平感知目标和一个显式的不一致性惩罚，联合优化通用真值准确性、跨群体真值一致性和个性化。跨多个基准的实验表明，TriAlign在这三个目标之间实现了比强基线更强的平衡，减少了跨社会群体的通用真值差异，同时提高了客观任务性能和个性化质量。

英文摘要

Personalized large language models adapt responses to users' preferences and social attributes, but can introduce substantial universal truth inconsistencies across social groups, where some groups systematically receive less accurate responses on objective tasks. Existing alignment methods either ignore personalization or mainly focus on subjective preference alignment, largely overlooking fairness and consistency in universal truths. To address this gap, we study Truth-Invariant Alignment (TIA), an alignment problem for personalized LLMs that aims to ensure universal truths remain consistent across social groups while preserving personalization. We propose TriAlign, the first offline multi-agent reinforcement learning (MARL) framework for TIA, where each social group is modeled as an agent interacting. TriAlign jointly optimizes universal truth accuracy, cross-group truth consistency, and personalization through a fairness-aware objective and an explicit inconsistency penalty. Experiments across diverse benchmarks demonstrate that TriAlign achieves a stronger balance among these three objectives than strong baselines, reducing universal truth disparities across social groups while improving both objective task performance and personalization quality.

URL PDF HTML ☆

赞 0 踩 0

2606.01747 2026-06-02 cs.CL cs.AI 版本更新

Construction of Historical Knowledge Graphs Based on BERT and Graph Neural Networks

基于BERT和图神经网络的历史知识图谱构建

Ping Li, Bartlomiej Brzozka

发表机构 * Shandong Management University（山东管理大学）； Maria Curie-Sklodowska University（玛丽·居里-斯洛多夫斯卡大学）

AI总结本文提出结合BERT和图神经网络的高层架构，从历史文本中提取实体和关系，构建知识图谱，在精度、召回率和F1分数上优于传统方法和深度学习基线。

Comments 9 pages, 4 figures

详情

AI中文摘要

通过数字人文研究和规模化历史数据分析，大量传统历史文本被转换为结构化知识图谱。本文提出一种结合双向编码器表示（BERT）和图神经网络（GNN）的高层架构，用于从各类历史文本中提取实体和关系。传统历史文本系统地解决了语言歧义、上下文限制的引用以及缺乏既定语法规范的问题。本研究根据上述建议，开发了一种基于FastRQNet和预训练视觉-语言模型Vilt-qaformer+RoBInet的新型图像检索系统。实验充分利用了市政记录、议会文件和历史信函的全面数据集。与传统基于规则的技术和其他流行的深度学习基线相比，联合BERT-GNN系统获得了更高的精度、召回率和F1分数（表2）。该结构在创建知识图谱时能够以足够的准确性和全面性处理复杂的嵌套结构和隐式引用问题。上述实验表明，将关系图学习算法与上下文敏感的语义表示技术相结合，可以自动提取历史数据，为知识库积累累积的智慧。

英文摘要

Through digital humanities research and scale-up historical data analysis, a significant amount of traditional historical text is converted into structured knowledge graphs. This paper provides a high-level architecture that combines bidirectional encoder representations of transformers (BERT) and graph neural networks (GNN) to extract the entities and relationships from various types of historical texts. The texts of traditional history resolve linguistic ambiguities, references limited by context, and a lack of established grammatical norms in a systematic way. This study develops a new image retrieval system based on FastRQNet and pre-trained vision-language model Vilt-qaformer+RoBInet in accordance with the aforementioned recommendations. The experiments make full use of a comprehensive collection of municipal records, parliamentary documents, and historical correspondence. When compared to conventional rule-based techniques and other popular deep-learning baselines, the joint BERT-GNN system obtains greater Precision, Recall, and F1-score (Table 2). Complex nested structures and implicit reference issues can be handled by this structure with sufficient accuracy and thoroughness when creating knowledge graphs. The aforementioned experiments show that combining relational graph learning algorithms with context-sensitive semantic representation techniques can automatically extract historical data to add accumulated wisdom to the knowledge repository.

URL PDF HTML ☆

赞 0 踩 0

2606.01738 2026-06-02 cs.CL cs.AI 版本更新

THRD: A Training-Free Multi-Turn Defense Framework for Jailbreak Attacks on Large Language Models

THRD：一种针对大语言模型越狱攻击的无训练多轮防御框架

Zhiqing Ma, Zhonghao Xu, Dong Yu, Chen Kang, Changliang Li, Pengyuan Liu

发表机构 * Beijing Language and Culture University（北京语言大学）

AI总结提出无训练框架THRD，通过显式建模时间风险累积（包括逐轮风险评估、跨轮意图检测、响应评估和决策模块）防御多轮越狱攻击，将攻击成功率降至0.2-4.0%且模型效用损失小于1.5%。

详情

AI中文摘要

多轮越狱攻击通过利用对话动态（如逐步升级和跨轮协调）对LLM构成日益严重的威胁。现有防御要么依赖昂贵的重新训练（通常会降低模型效用），要么在每一轮独立应用单轮分析，无法捕捉风险沿交互轨迹的累积。我们观察到多轮交互中的安全行为是轨迹依赖的：对话历史不断重塑模型的调节上下文，使得孤立评估每一轮变得不足。基于这一洞察，我们提出THRD，这是第一个显式建模多轮越狱防御中时间风险累积的无训练框架。THRD集成了四个模块：用于即时风险评估的逐轮风险评估器（TRA）、用于跨轮意图升级检测的历史上下文分析器（HCA）、用于识别促进性输出的响应评估器（RE），以及通过带衰减调制和趋势感知调整的时间演化评分机制组合这些信号的决策模块。在两个目标模型上针对最先进的多轮攻击（包括基于树搜索和多智能体协作方法）的实验表明，THRD将攻击成功率降至0.2-4.0%，同时在MMLU和GSM8K上将模型效用退化控制在1.5%以内。消融研究证实了模块的非冗余贡献和稳定的跨架构泛化。对首次拒绝触发器的分析显示，超过70%的多轮攻击需要在第2轮或之后才能检测到，验证了显式时间聚合的必要性。

英文摘要

Multi-turn jailbreak attacks pose a growing threat to LLMs by exploiting conversational dynamics such as gradual escalation and cross-turn coordination. Existing defenses either rely on costly retraining -- often degrading model utility -- or apply single-turn analysis independently at each turn, failing to capture how risk accumulates along interaction trajectories. We observe that safety behavior in multi-turn interaction is trajectory-dependent: dialogue history continuously reshapes the model's conditioning context, making it insufficient to evaluate each turn in isolation. Motivated by this insight, we present THRD, the first training-free framework that explicitly models temporal risk accumulation for multi-turn jailbreak defense. THRD integrates four modules: a Turn-level Risk Assessor (TRA) for instantaneous risk estimation, a Historical Context Analyzer (HCA) for cross-turn intent escalation detection, a Response Evaluator (RE) for identifying facilitative outputs, and a Decision Module that combines these signals through a time-evolving scoring mechanism with attenuation-based modulation and trend-aware adjustment. Experiments against state-of-the-art multi-turn attacks -- including tree-search-based and multi-agent collaborative methods -- across two target models show that THRD reduces ASR to 0.2--4.0% while preserving model utility within 1.5% degradation on MMLU and GSM8K. Ablation studies confirm non-redundant module contributions and stable cross-architecture generalization. Analysis of first rejection triggers reveals that over 70% of multi-turn attacks require Turn~2 or later to detect, validating the necessity of explicit temporal aggregation.

URL PDF HTML ☆

赞 0 踩 0

2606.01682 2026-06-02 cs.CL cs.AI cs.LG 版本更新

Off-the-Shelf LLMs as Process Scorers: Training-Free Alternative to PRMs for Mathematical Reasoning

现成的大语言模型作为过程评分器：数学推理中PRM的无训练替代方案

Atoosa Chegini, Soheil Feizi

发表机构 * Department of Computer Science, University of Maryland（马里兰大学计算机科学系）

AI总结提出Chunk-Level Guided Generation方法，利用现成的大语言模型作为过程评分器，通过固定长度块评分和对比选择规则，无需训练即可在数学推理中匹配或超越PRM引导搜索的性能。

详情

AI中文摘要

使用更强的评分器从多个小模型样本中选择最佳响应是一种简单的推理时策略，但当小模型已经陷入错误推理路径时，该策略会失败。PRM引导搜索通过在生成过程中对候选延续进行评分来避免这一问题，但需要经过步骤级标签训练的奖励模型。我们提出Chunk-Level Guided Generation，一种无训练的替代方案，使用现成的大语言模型作为过程评分器。在每一步，小模型采样k个固定长度的候选块，而大模型使用似然度对候选块进行评分，无需生成任何文本。选中的块在下一步之前被提交，从而在错误传播之前引导生成。我们用两种选择规则实例化该框架：似然引导选择（LGS），选择具有最高长度归一化大模型对数概率的块；以及对比引导选择（CGS），减去小模型的对数概率，以偏向于大模型偏好与小模型偏好不同的块。我们证明，由于系统性的长度偏差（即使在长度归一化后仍然存在），使用大模型似然度对可变长度推理步骤进行评分是不可靠的，而固定长度块避免了这一混淆。在GSM8K、MATH、Minerva Math、AMC23和AIME24上，使用Qwen2.5-32B引导Qwen2.5-1.5B以及Llama-3.1-70B引导Llama-3.2-1B，CGS在多数投票上最多提升28个百分点，并且在匹配的引导预算下，在大多数基准测试中匹配或超越了Qwen2.5-Math-PRM-72B引导搜索，且无需奖励模型训练。使用Qwen2.5-72B引导Qwen2.5-7B，CGS在k=16时在MATH上达到81.8%，在Minerva Math上达到63.6%，超过多数投票4-6个百分点。最后，Chunk-Level Guided Generation产生的推理轨迹比PRM引导搜索短得多。

英文摘要

Selecting the best response from multiple small-model samples using a stronger scorer is a simple inference-time strategy, but fails when the small model has already committed to incorrect reasoning paths. PRM guided search avoids this by scoring candidate continuations during generation, but requires a reward model trained with step-level labels. We propose Chunk-Level Guided Generation, a training-free alternative that uses an off-the-shelf large language model as a process scorer. At each step, a small model samples k fixed-length candidate chunks, while the larger model scores the candidates using likelihoods without generating any text. The selected chunk is committed before the next step, steering generation before errors can propagate. We instantiate this framework with two selection rules: Likelihood-Guided Selection (LGS), which selects the chunk with the highest length-normalized large-model log-probability, and Contrastive-Guided Selection (CGS), which subtracts the small model's log-probability to favor chunks where the large model's preference diverges from the small model's. We show that scoring variable-length reasoning steps with large-model likelihoods is unreliable due to a systematic length bias that persists even after length normalization, and that fixed-length chunks avoid this confound. On GSM8K, MATH, Minerva Math, AMC23, and AIME24 with Qwen2.5-1.5B guided by Qwen2.5-32B and Llama-3.2-1B guided by Llama-3.1-70B, CGS outperforms majority voting by up to 28 pp and, under matched guidance budgets, matches or outperforms Qwen2.5-Math-PRM-72B guided search on most benchmarks without reward-model training. With Qwen2.5-7B guided by Qwen2.5-72B, CGS reaches 81.8% on MATH and 63.6% on Minerva Math at k=16, surpassing majority voting by 4--6 pp. Finally, Chunk-Level Guided Generation produces substantially shorter reasoning traces than PRM guided search.

URL PDF HTML ☆

赞 0 踩 0

2606.01679 2026-06-02 cs.CL 版本更新

Encoded but Not Routed: Explaining the Table-Chart Gap in Scientific Claim Verification

编码但未路由：解释科学声明验证中的表格-图表差距

Sunisth Kumar, Xanh Ho, Tim Schopf, Andre Greiner-Petter, Florian Boudin, Akiko Aizawa

发表机构 * The University of Tokyo（东京大学）； NII LLMC（日本信息处理学会LLMC）； National Institute of Informatics（日本信息处理研究所）； University of Göttingen（哥廷根大学）； Inria, LS2N, Nantes Université（法国国家信息与自动化技术研究院（Inria）、LS2N实验室、南特大学）

AI总结通过分层线性探测和注意力分析，发现多模态大语言模型在科学声明验证中，图表信息被编码但未路由到预测位置，导致表格与图表表现差距。

详情

AI中文摘要

多模态大语言模型越来越多地被用于辅助科学同行评审，其核心要求是验证论文中的声明是否得到其证据的支持。先前的研究表明，当证据是表格时，模型在此任务上的表现显著优于相同底层数据的图表。这引发了一个问题：模型是否未能从图表中提取信息，还是提取了信息但在形成预测时未能使用？我们通过分层线性探测和注意力分析，在三个开源视觉语言模型上研究了这个问题，这些模型处理代表相同底层数据的表格和图表证据。我们发现一致的证据支持后者。图表信息被编码在模型的中间表示中，但未到达预测位置，这一差距在表格中不存在，并且在所有测试条件下都成立。注意力分析进一步揭示，这种断开在不同模型家族中表现为两种架构上不同的形式。这些发现将表格-图表差距重新定义为编码的视觉信息在预测时路由失败的问题，而非编码本身失败。

EvoPool: 面向标签高效专业监督的进化式程序化标注

Tianyi Xu, Yaolun Zhang, Xuan Ouyang, Huazheng Wang

发表机构 * Oregon State University（俄勒冈州立大学）； University of Wisconsin–Madison（威斯康星大学麦迪逊分校）

AI总结提出进化多智能体框架EvoPool，通过程序化标注器迭代进化与投票聚合，在低标注成本下显著提升专业领域监督性能。

Comments 39 pages, 7 figures. Code: https://github.com/tianyi0216/EvoPool

详情

AI中文摘要

多智能体计算机使用

Jing Yu Koh, Ruslan Salakhutdinov, Daniel Fried

发表机构 * Carnegie Mellon University（卡内基梅隆大学）

AI总结针对单智能体计算机使用代理在复杂长时任务中的不足，提出多智能体计算机使用系统，通过有向无环图分解任务并并行执行，在多个基准测试上提升3.4-25.5%性能，并加速任务完成时间约1.5倍。

详情

AI中文摘要

目前的计算机使用代理（CUA）主要部署为单序列代理。这种设置对于受益于任务分解、并行执行和基于新信息持续重新规划的复杂长时任务来说并不理想。在本文中，我们认为应该转向评估和构建多智能体计算机使用（MACU）系统。这些系统强调规划和并行执行，缓解了单智能体CUA的许多缺点。我们提出了一种通用的多智能体设置，其中管理模型将计算机使用任务分解为有向无环图（DAG），编码子代理的相关依赖关系和目标。在每次迭代中，管理器调度并行的CUA子代理执行DAG就绪前沿上的节点，并根据子代理的新发现持续修订DAG（添加、取消或重写节点）。这种设计将计算机使用的部分可观察环境作为首要挑战：下游代理可能无法重新观察到的信息通过管理器和DAG结构保留并传递。我们证明，MACU在桌面（OSWorld）和网页导航（Online-Mind2Web、WebTailBench、Odysseys）基准测试上始终比强单智能体基线提升3.4-25.5%，表现出更有利的测试时缩放，并解决了单智能体CUA陷入困境的复杂长时任务。在Odysseys（一个长时网页导航基准测试）上，MACU将平均任务完成墙钟时间提高了约1.5倍，证明了其在加速传统缓慢的CUA流程方面的有效性。我们的研究结果强调，多智能体协调是扩展计算机使用代理以更长时间、更有效地工作的一个有前景的方向。我们在https://jykoh.com/multi-agent-computer-use上发布所有代码和交互式可视化。

英文摘要

Computer use agents (CUAs) today are primarily deployed as single serial agents. This setup is suboptimal for complex long-horizon tasks that benefit from task decomposition, parallel execution, and consistent re-planning based on new information. In this paper, we argue that we should instead move towards evaluating and building multi-agent computer use (MACU) systems. These systems, which emphasize planning and parallel execution, alleviate many of the shortcomings of single-agent CUAs. We propose a general multi-agent setup in which a manager model decomposes computer use tasks as a directed acyclic graph (DAG), encoding relevant dependencies and goals for subagents. At each iteration, the manager dispatches parallel CUA subagents to carry out nodes on the ready frontier of the DAG, and continuously revises the DAG (adding, canceling, or rewriting nodes) as new findings arrive from subagents. This design treats the partially observable environment of computer use as a first class challenge: information that downstream agents may not be able to re-observe are retained and passed forward through the manager and DAG structure. We demonstrate that MACU consistently improves over strong single-agent baselines by $3.4-25.5\%$ on desktop (OSWorld) and web navigation (Online-Mind2Web, WebTailBench, Odysseys) benchmarks, exhibits more favorable test-time scaling, and solves complex long-horizon tasks where single-agent CUAs get stuck. On Odysseys, a long-horizon web navigation benchmark, MACU improves average task completion wall-clock time by ${\sim} 1.5 \times$, demonstrating its efficacy in speeding up traditionally slow CUA pipelines. Our findings highlight that multi-agent coordination is a promising axis for scaling computer use agents to work productively for longer and more effectively. We release all code and interactive visualizations at https://jykoh.com/multi-agent-computer-use.

URL PDF HTML ☆

赞 0 踩 0

2606.01513 2026-06-02 cs.DC cs.AI cs.CL cs.LG 版本更新

Compliance-Scored Best-of-N Guardrail Orchestration for Multimodal Document Generation in Payments Dispute Defense

基于合规评分的Best-of-N护栏编排用于支付争议防御中的多模态文档生成

Nataraj Agaram Sundar, Tejas Morabia

发表机构 * eBay Inc.（eBay公司）

AI总结提出一种结合多候选生成与合规评分早退机制的护栏编排层，通过并行生成、加权评分和最佳输出选择，在支付争议防御场景中实现高合规率与低延迟。

Comments 8 pages, 7 figures, 4 tables. Preprint. Applied systems paper on compliance-scored guardrail orchestration for multimodal LLM document generation. Contains aggregate operational readouts; not a randomized A/B test

详情

AI中文摘要

高风险企业文档生成，包括金融争议叙述、合规通知和审计摘要，要求模式正确性、策略合规性以及大规模低延迟操作。在统一的护栏层之前，生产系统通常将独立的PII编辑、内容审核和格式验证步骤拼接在一起，导致逻辑碎片化、请求路径变慢和运营成本增加。我们提出了一种针对文本和图像输入的护栏编排层，它将多候选生成与用于早退的显式合规评分相结合。该框架运行可配置的并行生成头，根据加权护栏（包括PII检测、内容审核、模式约束和领域规则）对候选进行评分，并返回具有选择元数据的最佳评分输出。可用的运营读数报告在20秒内进行5次尝试，合规率为91%。对于支付争议防御摘要，我们分析聚合运营场景读数，而非随机A/B测试。可变队列显示总体胜率高于对照组，301/659对比536/1548，对应+11.0个百分点，95%置信区间[6.6, 15.5]，p < 0.001；对于调整后的未收到物品案例，+7.5个百分点，95%置信区间[0.2, 15.7]，p = 0.045。欺诈和本地证据排名差异方向为正，但在聚合计数数据中不具有统计显著性。我们还报告了来自770次生成证据审查和70例OCR切片的评审校准的负责任AI证据质量信号，并通过请求接口、评分逻辑、伪代码和运营证据边界记录了可重复性边界。

英文摘要

High-stakes enterprise document generation, including financial dispute narratives, compliance notices, and audit summaries, demands schema correctness, policy compliance, and low-latency operation at scale. Prior to a unified guardrail layer, production systems often stitched together separate PII redaction, content moderation, and format validation steps, leading to fragmented logic, slower request paths, and higher operational cost. We present a guardrail orchestration layer for text and image inputs that couples multi-candidate generation with an explicit compliance score used for early exit. The framework runs configurable parallel generation heads, scores candidates against weighted guardrails including PII detection, content moderation, schema constraints, and domain rules, and returns the best-scoring output with selection metadata. The available operational readout reports 5 attempts within 20 seconds and 91 percent compliance. For payments dispute defense summaries, we analyze aggregate operational scenario readouts rather than a randomized A/B test. Variable cohorts show higher count win rates than controls overall, 301/659 versus 536/1548, corresponding to +11.0 percentage points with 95 percent confidence interval [6.6, 15.5] and p < 0.001, and for adjusted item-not-received cases, +7.5 percentage points with 95 percent confidence interval [0.2, 15.7] and p = 0.045. Fraud and local evidence-ranking deltas are directionally positive but not statistically significant from the aggregate count data. We also report reviewer-calibrated Responsible-AI evidence-quality signals from 770 generated-evidence reviews and a 70-case OCR slice, and document the reproducibility boundary through the request interface, scoring logic, pseudocode, and operational evidence boundary.

URL PDF HTML ☆

赞 0 踩 0

2606.01503 2026-06-02 cs.CV cs.AI cs.CL 版本更新

On the Limits of Token Reduction for Efficient Unified Vision Language Training

论高效统一视觉语言训练中令牌缩减的极限

Siyi Chen, Weiming Zhuang, Jingtao Li, Lingjuan Lv

发表机构 * University of Michigan（密歇根大学）； Sony AI（索尼人工智能）

AI总结本文通过分析层注意力分配，发现视觉理解与视觉生成在令牌冗余上存在不对称性，设计任务特定加速器，但统一训练中任务特定令牌丢弃导致协同损失，表明高效统一建模需保留共享跨任务结构。

详情

AI中文摘要

统一视觉语言模型（VLM）在单个自回归骨干中集成了视觉理解和视觉生成，但其联合训练计算成本高昂且从效率角度常被忽视。在这项工作中，我们研究了基于令牌缩减的加速在统一VLM训练中的可行性和极限。通过对逐层注意力分配的系统分析，我们揭示了一个基本的不对称性：视觉理解在后期层表现出显著的视觉冗余，而视觉生成在深度上对图像令牌保持持续依赖。受此观察启发，我们设计了任务特定的加速器，针对每个目标选择性地减少图像令牌计算。虽然这些方法在孤立设置中实现了显著的效率提升，但我们在统一训练下观察到一致的协同损失——任务特定的令牌丢弃需要不同的参数路径，并消除了联合优化中通常观察到的相互性能增益。我们的发现表明，高效统一建模需要保留共享的跨任务结构，强调了需要协同感知的加速策略。项目页面：https://chicychen.github.io/TokenReductionUnifiedVLM/。

英文摘要

Unified vision-language models (VLMs) integrate visual understanding and visual generation within a single autoregressive backbone, but their joint training is computationally expensive and largely overlooked from an efficiency perspective. In this work, we study the feasibility and limits of token-reduction-based acceleration for unified VLM training. Through a systematic analysis of layerwise attention allocation, we uncover a fundamental asymmetry: visual understanding exhibits substantial late-layer visual redundancy, whereas visual generation maintains persistent dependence on image tokens across depth. Guided by this observation, we design task-specific accelerators that selectively reduce image-token computation for each objective. While these methods achieve significant efficiency gains in isolated settings, we observe a consistent synergy loss under unified training -- task-specific token dropping necessitates divergent parameter pathways and eliminates the mutual performance gains typically observed in joint optimization. Our findings suggest that efficient unified modeling requires preserving shared cross-task structures, highlighting the need for synergy-aware acceleration strategies. Project page: https://chicychen.github.io/TokenReductionUnifiedVLM/.

URL PDF HTML ☆

赞 0 踩 0

2606.01498 2026-06-02 cs.CL cs.AI 版本更新

TimeSage-MT: A Multi-Turn Benchmark for Evaluating Agentic Time Series Reasoning

TimeSage-MT：用于评估智能时间序列推理的多轮基准测试

Yaxuan Kong, Qingren Yao, Yuqi Nie, Yichen Li, Yilei Shao, Stefan Zohren, Anna Vettoruzzo, Joaquin Vanschoren, Ming Jin, Qingsong Wen

发表机构 * University of Oxford（牛津大学）； VulpiVox Intelligence ； Eindhoven University of Technology（埃因霍温理工大学）； Griffith University（格里菲斯大学）； Squirrel Ai Learning ； East China Normal University（华东师范大学）

AI总结提出TimeSage-MT多轮基准测试，包含240个任务和2680轮对话，覆盖8个真实领域，用于评估LLM智能体在时间序列推理中的表现，揭示其在决策导向任务中的性能下降及记忆、不确定性处理等缺陷。

详情

AI中文摘要

时间序列数据为许多真实世界领域的决策提供信息。虽然大语言模型（LLM）智能体可以通过自然语言和工具分析数据，但目前尚不清楚它们是否能在多轮对话中进行可靠的时间序列分析。现有基准测试侧重于预测和异常检测等单步任务，忽略了用户目标演变、智能体必须基于先前分析以及结论从累积证据中得出的实际工作流程。在这项工作中，我们引入了TimeSage-MT，一个用于智能时间序列推理的多轮基准测试，包含240个任务和2,680轮对话，涵盖8个真实世界领域，从基础探索到决策导向分析。TimeSage-MT通过一个可复现的流程构建，该流程将真实世界的时间序列数据转换为具有可验证答案的多轮对话。它提供了一个统一的评估协议和公共排行榜，用于比较时间序列智能系统。为了展示基准测试的实用性，我们评估了前沿LLM以及TimeSage——一种配备全面时间序列技能库的新型结构化智能体。结果显示，在决策导向任务上性能急剧下降，原因是记忆、不确定性处理和基于领域的决策方面的失败。TimeSage-MT揭示了当前智能推理中的关键差距，并为未来发展提供了严谨的基础。

英文摘要

Time series data inform critical decisions across many real-world domains. While large language model (LLM) agents can analyze data through natural language and tools, it remains unclear whether they can conduct reliable time series analysis across multi-turn conversations. Existing benchmarks focus on single-step tasks such as forecasting and anomaly detection, overlooking practical workflows where user goals evolve, agents must build on prior analyses, and conclusions emerge from accumulated evidence. In this work, we introduce TimeSage-MT, a multi-turn benchmark for agentic time series reasoning with 240 tasks and 2,680 dialogue turns across 8 real-world domains, spanning basic exploration to decision-oriented analysis. TimeSage-MT is built through a reproducible pipeline that converts real-world time series data into multi-turn conversations with verifiable answers. It provides a unified evaluation protocol and public leaderboard for comparing time series agentic systems. To demonstrate the benchmark's utility, we evaluate frontier LLMs alongside TimeSage, a novel structured agent equipped with a comprehensive time series skill library. The results show sharp performance drops on decision-oriented tasks, driven by failures in memory, uncertainty handling, and domain-based decision making. TimeSage-MT exposes critical gaps in current agentic reasoning and provides a rigorous foundation for future development.

URL PDF HTML ☆

赞 0 踩 0

2606.01482 2026-06-02 cs.CL 版本更新

Beyond Topical Similarity: Contrastive Evidence Retrieval with Interpretable Attention Alignment in RAG

超越主题相似性：RAG 中具有可解释注意力对齐的对比证据检索

Francielle Vargas, João Robiatti, Diego Alves, Lucas Pascotti Valem, Maximilian Seeth, Sebastián Ferrada, Ameeta Agrawal, Daniel Pedronette, André Freitas

发表机构 * University of Chile（智利大学）； São Paulo State University（圣保罗州立大学）； Saarland University（萨尔兰州立大学）； University of Munich（慕尼黑大学）； Portland State University（波特兰州立大学）； Idiap Research Institute（Idiap研究机构）

AI总结提出 CERA 框架，通过基于主观性的困难负样本选择和辅助注意力对齐损失注入证据归纳偏差，实现可解释且事实准确的检索。

详情

AI中文摘要

确保 RAG 中的事实性和可解释性仍然是一个开放且紧迫的问题。我们引入了对比证据理性注意力（CERA），这是第一个采用基于主观性的困难负样本选择并通过辅助注意力对齐损失将证据归纳偏差注入对比学习的检索框架。CERA 使用两个训练目标微调密集检索器：基于三元组的对比学习和可解释注意力对齐，后者通过使用基于词性标注的掩码分布监督 CLS 到 token 的注意力，该分布覆盖人工标注的事实理性作为证据信号。在大型临床试验报告语料库上的实验表明，与 Contriever 和困难负样本选择基线相比，基于主观性的困难负样本选择显著提高了检索效果。此外，理性对齐在保持竞争性检索性能的同时提高了忠实度，支持了注意力在人类理性指导下可以作为模型行为更忠实解释的假设。超越主题相似性，CERA 使检索器能够识别构成支持证据的特定 token，促进了 RAG 系统中更可解释的证据选择。

英文摘要

Ensuring factuality and interpretability in RAG remains an open and urgent problem. We introduce Contrastive Evidence Rationale Attention (CERA), the first retrieval framework to employ subjectivity-based hard negative selection and inject an evidential inductive bias into contrastive learning through an auxiliary attention alignment loss. CERA fine-tunes a dense retriever using two training objectives: triplet-based contrastive learning and interpretable attention alignment, which supervises CLS-to-token attention using a part-of-speech-weighted masking distribution over human-annotated factual rationales as evidence signals. Experiments on a large corpus of clinical trial reports demonstrate that the subjectivity-based hard negative selection substantially improves retrieval effectiveness compared to both Contriever and hard negative selection baselines. Furthermore, rationale alignment improves faithfulness while maintaining competitive retrieval performance, supporting the hypothesis that attention can serve as a more faithful explanation of model behavior when guided by human rationales. Moving beyond topical similarity, CERA enables the retriever to identify the specific tokens that constitute supporting evidence, promoting more interpretable evidence selection in RAG systems.

URL PDF HTML ☆

赞 0 踩 0

2606.01479 2026-06-02 cs.CL 版本更新

Sparse Autoencoders for Interpretable Emotion Control in Text-to-Speech

用于文本到语音中可解释情感控制的稀疏自编码器

Hongfei Du, Jiacheng Shi, Sidi Lu, Gang Zhou, Ye Gao

发表机构 * University of California, Berkeley（加州大学伯克利分校）

AI总结本文通过稀疏自编码器分析基于LLM的TTS模型中的情感相关潜在特征，提出特征级干预框架实现双向情感诱导与抑制，无需修改模型参数。

Comments Accepted by ICML 2026

详情

AI中文摘要

将大型语言模型（LLMs）集成到文本到语音（TTS）系统中提高了语音的表现力，但可解释的情感控制仍然具有挑战性。现有方法主要依赖于外部条件或全局激活引导，对情感控制背后的内部表示提供的洞察有限。在这项工作中，我们使用稀疏自编码器（SAEs）分析基于LLM的TTS模型语义隐藏状态中与情感相关的变异，以识别稀疏潜在特征。我们的分析表明，情感变异分布在多个稀疏潜在特征上，而对一小部分特征进行干预可以实现可解释的情感控制。基于这一观察，我们引入了一个特征级干预框架，用于双向情感诱导和抑制，而无需修改骨干参数。我们进一步表明，不同的潜在特征与特定的声学属性（例如，音高）相关联，这表明情感表达源于协调的潜在贡献，而非单一的全局变化。实验上，引导这些稀疏潜在特征在情感诱导和抑制性能上达到或优于全局引导和现有的TTS基线。

英文摘要

Integrating large language models (LLMs) into text-to-speech (TTS) systems has improved speech expressiveness, yet interpretable emotional control remains challenging. Existing approaches primarily rely on external conditioning or global activation steering, offering limited insight into the internal representations underlying emotional control. In this work, we analyze emotion-related variation in the semantic hidden states of LLM-based TTS models using sparse autoencoders (SAEs) to identify sparse latent features. Our analysis shows that emotional variation is distributed across multiple sparse latent features, while intervening on a small subset enables interpretable emotion control. Building on this observation, we introduce a feature-level intervention framework for bidirectional emotion induction and suppression without modifying backbone parameters. We further show that distinct latent features are associated with specific acoustic attributes (e.g., pitch), suggesting that emotional expression arises from coordinated latent contributions rather than a single global shift. Empirically, steering these sparse latent features achieves comparable or superior emotion induction and suppression performance relative to global steering and existing TTS baselines.

URL PDF HTML ☆

赞 0 踩 0

2606.01469 2026-06-02 cs.CL 版本更新

Peacemaker at ATE-IT: Automatic term extraction from Italian text for waste management data using encoder model

Peacemaker at ATE-IT: 使用编码器模型从意大利语文本中自动提取废物管理术语

Mahdi Bakhtiyarzadeh, Hadi Bayrami Asl Tekanlou, Jafar Razmara

发表机构 * Department of Computer Science, University of Tabriz（塔布里兹大学计算机科学系）； University of Tabriz（塔布里兹大学）

AI总结针对ATE共享任务中的Task A，提出一种低计算成本、可解释的自动术语提取方法，通过微调编码器模型在少量资源上实现平衡性能，为低资源模型提供起点。

Comments 9 pages, 2 figures, Published in EVALITA 2026, CEUR Workshop Proceedings Vol. 4195

详情

Journal ref: CEUR Workshop Proceedings, Vol. 4195, 2026

AI中文摘要

自动术语提取的发展在现代技术中变得越来越重要。目前几乎每个可用的搜索引擎中都存在自动术语提取。最近的进展为自动术语的提取提供了有希望的结果；然而，由于多种因素，如可用于训练的标注文档数量有限，以及由于领域变化导致提取多词表达式的复杂性，准确标注是困难的。在本文中，我们将提出一种低成本且可解释的自动术语提取方法，专门为ATE共享任务中的Task A开发。这种新方法利用微调提取策略，可以在少量计算资源上运行。我们使用类型级和微级精确率、召回率和F1分数来评估我们的自动化系统，以衡量提取性能的两个互补方面。根据实验结果，我们提出的方法与其他团队相比，实现了一致且平衡的性能。尽管该技术本身相对简单，但它为低资源模型提供了一个良好的起点。总体而言，研究结果表明，未来在模型扩展方面有可能取得重大进展，同时仍能保持其可解释性。

英文摘要

The development of automatic term extraction has become increasingly important in modern technology. Automatic term extraction can be found in virtually every search engine that is currently available to users. Recent advancements have provided promising results for the extraction of automatic terms; however, accurate labeling is difficult because of several factors, such as the limited number of annotated documents available for training and the complexity of extracting multi-word expressions due to shifts in the domain. In this paper, we will present a low-cost and interpretable method of automatic term extraction, developed specifically for Task A of the ATE Shared Task. This new method utilizes fine-tuning extraction strategies that can run on a small amount of computational resources. We evaluated our automated system using both type-level and micro-level measures of precision, recall, and F1-score to measure both complementary aspects of the extraction performance. According to the experimental results, our proposed approach achieves consistent and balanced performance compared to other teams. Even though the technique itself is relatively straightforward, it serves as a good starting point for low-resource models. Overall, the findings point toward the possibility of significant future advancements (in model expansion) with higher-level performance still able to retain their ability to be interpreted.

URL PDF HTML ☆

赞 0 踩 0

2606.01464 2026-06-02 cs.CL 版本更新

Cross-lingual Self-Consistency for Multilingual Reasoning with Language Models

跨语言自一致性：面向语言模型的多语言推理

Ahmed Elhady, Eneko Agirre, Mikel Artetxe

发表机构 * HiTZ Center, University of the Basque Country (UPV/EHU)（巴斯克大学HiTZ中心）； Reka AI

AI总结提出无监督强化学习方法，通过强制模型对跨语言等价问题产生相同答案来增强多语言推理，在MGSM上平均提升21.7%，并展现出强泛化能力。

Comments Paper under review

详情

AI中文摘要

尽管大语言模型（LLMs）的多语言覆盖范围在扩大，但其高级推理能力仍主要局限于少数高资源语言（如英语）。为了解决这一问题，我们提出了一种无监督强化学习（RL）方法，通过强制跨语言自一致性（即模型应对不同语言的等价问题产生相同最终答案）来增强多语言推理。现有方法受限于多语言推理数据的稀缺性，且对未见语言的泛化能力较弱。我们的方法既不需要标准答案，也不需要平行数据，在MGSM的10种语言上平均提升了高达21.7%。此外，我们的方法展现出强泛化能力，在训练期间未见的MGSM语言上平均提升18.2%，在3个分布外基准测试上提升高达6.2%。这些结果表明，基于一致性的方法有潜力在无需监督数据的情况下提升LLMs的多语言能力。

英文摘要

Despite expanding their multilingual coverage, the advanced reasoning capabilities of LLMs remain largely confined to a few high-resource languages like English. To address this, we propose an unsupervised Reinforcement Learning (RL) approach to enhance multilingual reasoning by enforcing cross-lingual self-consistency: the principle that a model should produce the same final answer for equivalent problems in different languages. Existing methods are limited by the scarcity of multilingual reasoning data and show weak generalization to unseen languages. Our approach requires neither gold answers nor parallel data, and it achieves average gains of up to 21.7% on MGSM across 10 languages. In addition, our method demonstrates strong generalization, with an 18.2% mean improvement on MGSM languages unseen during training, and up to 6.2% gain on 3 out-of-distribution benchmarks. These results show the potential of consistency-based methods to improve the multilingual capabilities of LLMs without requiring supervised data.

URL PDF HTML ☆

赞 0 踩 0

2606.01462 2026-06-02 cs.AI cs.CL cs.LG 版本更新

An Enigma of Artificial Reason: Investigating the Production-Evaluation Gap in Large Reasoning Models

人工推理之谜：探究大型推理模型中的生成-评估差距

Mingzhong Sun, Teresa Yeo, Armando Solar-Lezama, Tan Zhi-Xuan

发表机构 * NUS Department of Computer Science（国立新加坡大学计算机科学系）； MIT EECS（麻省理工学院电子工程与计算机科学系）； A*STAR（新加坡科技研究局）； Singapore-MIT Alliance for Research and Technology (SMART)（新加坡-麻省理工联合研究技术机构（SMART））

AI总结本文通过VAIR数据集发现大型推理模型在评估推理时存在显著缺陷，表现为答案确认偏差，即模型倾向于验证答案正确性而非仔细检查推理步骤。

Comments 10 pages, 8 figures, 2 tables (Appendix: 19 pages, 13 figures, 3 tables)

详情

AI中文摘要

对人类推理的研究表明，人们通常更擅长评估推理而非从头生成推理。相比之下，大型推理模型（LRMs）经过训练，擅长生成长链推理以解决复杂问题。那么，LRMs在评估推理方面表现如何？我们通过有效答案-无效推理（VAIR）数据集进行研究：该数据集包含数学问题和解决方案，这些解决方案存在琐碎的推理缺陷但答案有效，旨在将推理评估与推理生成混淆因素分离。与人类（我们发现人类在评分此类问题时仅比解决它们差6%）不同，我们发现LRMs存在显著的生成-评估差距：前沿模型在评估VAIR解决方案时得分低至48%，尽管在解决方案生成方面近乎完美。为何存在这一谜团？通过思维链（CoT）分析，我们发现了答案确认偏差的证据：LRMs通常先产生答案，然后检查正确答案，而不是仔细验证每一步，即使在注意到异常推理时也会编造合理化解释。线性探针进一步证实了这一点，表明虽然LRM激活编码了有效推理的某些表示，但它们未能稳健地将VAIR解决方案表示为无效。对最终答案表示的因果修补导致LRM判断和激活翻转，表明答案有效性是模型确认偏差的原因。这些发现揭示了主导推理训练方法的显著局限性，该方法激励LRMs生成并确认朝向正确答案的推理，但未能稳健地评估底层推理。

英文摘要

Studies of human reasoning have shown that people are typically stronger at evaluating reasoning than producing it from scratch. In contrast, large reasoning models (LRMs) are trained to excel at producing long chains of reasoning to solve complex problems. How then do LRMs perform at evaluating reasons? We investigate this with the Valid-Answer-Invalid-Reasoning (VAIR) dataset: math problems and solutions with trivial reasoning flaws but valid answers, designed to isolate reasoning evaluation from the confound of reasoning production. Unlike humans, who we find are only 6% worse at grading than solving such problems, we find a substantial production-evaluation gap in LRMs: frontier models score as low as 48% when evaluating VAIR solutions, despite near-perfect solution production. Why this enigma? Through chain-of-thought (CoT) analysis, we find evidence of an answer confirmation bias: LRMs often produce then check for the correct answer instead of carefully verifying each step, fabricating rationalizations even when noticing anomalous reasoning. Linear probes corroborate this, showing that while LRM activations encode some representation of valid reasoning, they fail to robustly represent VAIR solutions as invalid. Causal patching of the final answer's representations causes LRM verdicts and activations to flip, demonstrating that answer validity is responsible for models' confirmation biases. These findings indicate an outstanding limitation in dominant approaches to reasoning training, which incentivize LRMs to produce and confirm reasoning towards correct answers, but not to robustly evaluate the underlying reasons.

URL PDF HTML ☆

赞 0 踩 0

2606.01456 2026-06-02 cs.LG cs.CL cs.GT 版本更新

Truthful AI Advisors: A Pre-Specified Benchmark for Large Language Model Honesty Under Preference Misalignment

诚实的人工智能顾问：偏好错位下大语言模型诚实性的预设基准

Hamidreza Hasani Balyani, Seyed Pouyan Mousavi Davoudi, Alireza Amiri-Margavi, Amin Gholami Davodi, Arshia Gharagozlou

发表机构 * Amazon Lab126, HW Tech Org.（亚马逊实验室126，硬件技术组织）； Computational Modeling and Simulation University of Pittsburgh（计算建模与仿真大学匹兹堡分校）； Mathematics & Statistics Department University of Minnesota Duluth（数学与统计学系明尼苏达大学 Duluth 分校）

AI总结通过Crawford-Sobel廉价谈话模型构建基准，评估大语言模型在偏好冲突时是否诚实，发现模型过度揭示信息，偏离策略最优。

Comments 19 pages. Code and data: https://github.com/iHamidHasani/cheap-talk-llm-benchmark

详情

AI中文摘要

大语言模型越来越多地被部署为顾问，其目标与用户不一致：推荐系统优化参与度，销售助手优化购买，谈判代理优化让步。当诚实与自身收益冲突时，这些顾问是否保持诚实是一个核心的对齐评估问题。我们将经典的Crawford-Sobel廉价谈话模型转化为偏好错位下LLM诚实性的预设基准。廉价谈话理论预测既非完全揭示也非沉默，而是粗糙的单调划分，随着偏好冲突增加，信息区间减少。发送者观察到状态omega在[0,1]中，希望接收者的行动接近omega+b，并向理想行动为omega的接收者发送一条无成本消息。设计使用5个偏差水平、3个提示框架、固定的低温度设置和每个单元200个状态：共12,000次发送者调用。对于正偏差网格b∈{0.01,0.04,0.08,0.12}，最信息丰富的划分大小分别为7、4、3、2，预言机归一化互信息分别为0.5294、0.3268、0.2205、0.1829。在四个指令调优模型（GPT-4o、Claude Sonnet 4.5、Gemini 2.5 Flash-Lite、Llama-3.3-70B）上运行完整设计，我们发现所有四个模型相对于最信息丰富的均衡过度揭示1.8至4.2倍：归一化互信息保持在0.78-0.94，而预言机规定为0.18-0.53。信息量随偏差下降如预测，但从未接近策略最优；模型显示出近乎完全的揭示，并带有跟踪其偏差的恒定正向偏移（线性夸大）。收益最大化与诚实框架的影响可忽略。解码器消融表明，仅当接收者读取发送者陈述的数字时，该发现才可恢复：仅嵌入解码器将相同数据误读为近乎胡言乱语。

英文摘要

Large language models are increasingly deployed as advisors whose objective is not aligned with the user's: recommenders optimize for engagement, sales assistants for purchases, negotiation agents for concessions. Whether such advisors stay truthful when honesty conflicts with their own payoff is a core alignment-evaluation question. We turn the canonical Crawford-Sobel cheap-talk model into a pre-specified benchmark for LLM honesty under preference misalignment. Cheap-talk theory predicts neither full revelation nor silence but coarse monotone partitions, with fewer informative intervals as preference conflict grows. A sender observes a state omega in [0,1], wants the receiver's action near omega+b, and sends one costless message to a receiver whose ideal action is omega. The design uses 5 bias levels, 3 prompt frames, a fixed low-temperature setting, and 200 states per cell: 12,000 sender calls. For the positive-bias grid b in {0.01,0.04,0.08,0.12} the exact most-informative partition sizes are 7,4,3,2, with oracle normalized mutual information 0.5294, 0.3268, 0.2205, 0.1829. Running the full design on four instruction-tuned models (GPT-4o, Claude Sonnet 4.5, Gemini 2.5 Flash-Lite, Llama-3.3-70B), we find all four over-reveal relative to the most-informative equilibrium by 1.8 to 4.2x: normalized mutual information stays at 0.78-0.94 where the oracle prescribes 0.18-0.53. Informativeness declines with bias as predicted but never approaches the strategic optimum; rather than coarse partitions, models show near-full revelation with a constant upward offset tracking their bias (linear exaggeration). Payoff-maximizing versus honesty framing has negligible effect. A decoder ablation shows the finding is recoverable only when the receiver reads the sender's stated number: an embedding-only decoder mis-reads the same data as near-babbling.

URL PDF HTML ☆

赞 0 踩 0

2606.01451 2026-06-02 cs.CL 版本更新

Before and After Temperature: A Distributional View of Creative LLM Generation

温度前后：创造性LLM生成的分布视角

V. S. Raghu Parupudi, Harsha Ponnada, Aditi Kaushal, S. Shria Parupudi, Saiteja Dasari, Sahiti Bulusu

发表机构 * University of California, San Diego（加州大学圣地亚哥分校）； Indian Institute of Technology, Kharagpur（印度理工学院克哈格布尔分校）； Delhi Technological University（德里技术大学）； Carnegie Mellon University（卡内基梅隆大学）； Rutgers University（罗格斯大学）； Georgia Institute of Technology（佐治亚理工学院）

AI总结通过分析采样温度重塑token分布的特征，提出一种无需参考的创造力评估方法，在Llama-3.1-8B-Instruct上达到Spearman ρ=0.918（LLM评判）和ρ=0.870（人类多数评判），显著优于现有基线。

Comments Submitted to NGEN-AI 2026

详情

AI中文摘要

大型语言模型（LLM）创造力的无参考评估依赖于困惑度、熵和top-1边际。我们表明，一个更强的信号存在于流程中更早的一步：在抽取下一个token之前，采样温度如何重塑模型的token分布。在Llama-3.1-8B-Instruct生成的500个开放式创造性提示（温度T∈{0.3,0.8,1.5}）上，一个基于这种重塑的单个逐token特征预测提示内创造力排名的Spearman ρ=0.918（与平均gpt-4o/gemini-2.5-pro评判对比，n=500）和ρ=0.870（与三人人类多数排名对比，n=150）。四种标准无参考基线（自困惑度、平均预测熵、top-1边际、gzip压缩比）在两个真实标签上均达到|ρ|≈0.76：与平均LLM评判差距为+0.165，与人类多数评判差距为+0.110，两者均远大于基线之间的差异。两个真实标签面板之间的相关系数为ρ=0.83，高于人类间上限ρ=0.77，因此比较不受评判噪声的瓶颈限制。机制上，优势来自于不连贯区域的尖锐分布特征：在T=1.5时，累积质量宽度n_{95}(q)从约1个token膨胀到约131个token，且温度后的质量从温度前top-90%可能集合中泄漏约13个百分点。逐token聚合无法区分T=0.8和T=0.3；区分这两个连贯区域的任务留给序列级特征。

英文摘要

Reference-free evaluation of large language model (LLM) creativity relies on perplexity, entropy, and top-1 margin. We show that a much stronger signal lives one step earlier in the pipeline: in how sampling temperature \emph{reshapes} the model's token distribution before the next token is drawn. On Llama-3.1-8B-Instruct generations of 500 open-ended creative prompts at $T \in \{0.3, 0.8, 1.5\}$, a single per-token feature derived from this reshaping predicts the within-prompt creativity rank at Spearman $ρ{=}0.918$ against an averaged gpt-4o\,/\,gemini-2.5-pro judge ($n{=}500$) and $ρ{=}0.870$ against a three-rater human-majority ranking ($n{=}150$). Each of four standard reference-free baselines (self-perplexity, mean predictive entropy, top-1 margin, gzip compression ratio) tops out at $|ρ|\!\approx\!0.76$ on both ground truths: a gap of $+0.165$ on averaged-LLM and $+0.110$ on human-majority, both far larger than the spread among the baselines themselves. The two ground-truth panels agree with each other at $ρ{=}0.83$, above the inter-human ceiling of $ρ{=}0.77$, so the comparison is not bottlenecked by judge noise. Mechanistically, the win comes from a sharp distributional signature of the incoherence regime: at $T{=}1.5$ the cumulative-mass width $n_{95}(q)$ inflates from $\sim\!1$ to ${\sim}\!131$ tokens and post-temperature mass leaks off the pre-temperature top-$90\%$ plausible set by about $13$ percentage points. The per-token aggregates do not separate $T{=}0.8$ from $T{=}0.3$; discriminating the two coherent regimes is left to sequence-level features.

URL PDF HTML ☆

赞 0 踩 0

2606.01444 2026-06-02 cs.AI cond-mat.mtrl-sci cs.CL cs.LG math.CT 版本更新

Self-Revising Discovery Systems for Science: A Categorical Framework for Agentic Artificial Intelligence

科学中的自我修正发现系统：面向主体人工智能的范畴论框架

Fiona Y. Wang, Markus J. Buehler

发表机构 * Laboratory for Atomistic and Molecular Mechanics（原子分子力学实验室）； Department of Biological Engineering（生物工程系）； Massachusetts Institute of Technology（麻省理工学院）； Department of Civil and Environmental Engineering（土木与环境工程系）； Department of Mechanical Engineering（机械工程系）； Center for Computational Science and Engineering（计算科学与工程中心）； Schwarzman College of Computing（施瓦茨曼计算学院）

AI总结本文提出一个基于范畴论的框架，通过左Kan扩展实现科学发现中的表征体制转换，并应用于材料科学中的蛋白质力学和纤维网络建模。

详情

AI中文摘要

科学发现不仅是生成答案，更是对证据、人工制品、操作和验证者进行类型化的表征体制的修正。我们为材料科学中的主体发现开发了一个范畴论描述。在固定体制b中，模式类别为S_b，系统状态是一个余预层I_t: S_b -> Set，来源是元素范畴∫_{S_b} I_t。固定体制操作是对此类状态的更新，仅当指定并保留了保持来源的细化时才是自函子。发现则是经过验证的体制转换u: S_b -> S_b'：旧人工制品通过左Kan扩展Lan_u I_t保存并传输，并与转换后状态进行比较，以识别超出函子传输的剩余内容。这在不依赖主观新颖性的情况下区分了检索、搜索和发现。我们在两个系统中实例化了该框架。在Builder/Breaker中，蛋白质力学世界模型在最小描述长度门控下进行修正；接受的定律将链内柔性表示为受慢集体模式调节的全模态弹性柔度，即模式调节柔度。在CategoryScienceClaw中，类型化技能、人工制品、开放需求、工作流变异、门控、压力测试和公共话语构成了一个携带证明的知识计算图。一个纤维网络示例记录了候选模型、被拒绝的替代方案、AIC门控、扰动测试以及一个基于各向同性纤维计数描述符的接受取向张量各向异性刚度代理模型。这些案例共同展示了范畴论如何既作为科学发现的数学语言，又作为自我修正AI发现系统的工程规范。

英文摘要

Scientific discovery is not only answer generation but revision of the representational regime in which evidence, artifacts, operations, and verifiers are typed. We develop a category-theoretic account of agentic discovery for materials science. In a fixed regime b with schema category S_b, the system state is a copresheaf I_t: S_b -> Set, and provenance is the category of elements \int_{S_b} I_t. Fixed-regime operation is an update on such states, endofunctorial only when provenance-preserving refinements are specified and preserved. Discovery is instead a verified regime transition u: S_b -> S_b': old artifacts are preserved, transported by the left Kan extension Lan_u I_t, and compared with the post-transition state to identify residual content beyond functorial transport. This separates retrieval, search, and discovery without subjective novelty. We instantiate the framework in two systems. In Builder/Breaker, a protein-mechanics world model is revised under a Minimum Description Length gate; the accepted law expresses within-chain flexibility as all-mode elastic compliance conditioned by slow collective-mode participation, or mode-conditioned compliance. In CategoryScienceClaw, typed skills, artifacts, open needs, workflow mutation, gates, stress tests, and public discourse become a proof-carrying knowledge-computation graph. A fiber-network example records candidate models, rejected alternatives, an AIC gate, perturbation tests, and an accepted orientation-tensor anisotropic stiffness surrogate over an isotropic fiber-count descriptor. Together, the cases show how category theory can be both a mathematical language for discovery and an engineering specification for self-revising AI discovery systems.

URL PDF HTML ☆

赞 0 踩 0

2606.01436 2026-06-02 cs.CL 版本更新

Learning from Saturated Data: Signals Beyond Correctness for LLM Training

从饱和数据中学习：LLM训练中超越正确性的信号

Hanno Hiss, Jasper Dekoninck, Martin Vechev

发表机构 * ETH Zurich（苏黎世联邦理工学院）

AI总结本文研究在基准测试和训练数据集饱和的情况下，如何利用超越二元正确性的细粒度质量信号（如成对LLM自判断和token级熵）来提升下游性能，实验表明在简单算术任务上质量信号显著优于SFT，但在复杂任务上效果有限且需谨慎校准。

Comments 25 pages, 5 figures

详情

AI中文摘要

大型语言模型（LLM）能力的不断增强导致许多用于改进它们的基准测试和训练数据集趋于饱和。受此启发，我们研究了那些经验上达到完美准确率的问题是否仍可用于提升下游性能。为此，我们将二元正确性替换为两种更细粒度的质量信号来源：（1）成对LLM自判断，即模型评估自身解决方案的相对质量；（2）token级熵，其中token级不确定性作为解决方案质量的代理。我们将这些信号整合到多种训练算法中，并在Qwen3-1.7B-Base上进行了评估。当仅在一个简单算术任务上训练时，基于质量的信号相比基础模型提升了高达18.6%的性能，显著优于SFT。然而，在GSM8K上，提升较为有限，且强烈依赖于质量信号。例如，自判断与更强的外部判断者一致性较差，甚至可能使性能低于基础模型。总体而言，我们的结果表明，基于质量的训练可以从饱和问题中为基础模型提取有用信号，但将此类信号应用于更复杂的任务需要仔细校准和进一步研究。

英文摘要

The growing capabilities of large language models (LLMs) have led to the saturation of many benchmarks and training datasets used to improve them. Motivated by this, we investigate whether questions solved with perfect empirical accuracy can nevertheless be used to improve downstream performance. To do so, we replace binary correctness with two sources of more fine-grained quality signals: (1) pairwise LLM self-judgments, in which the model evaluates the relative quality of its own solutions, and (2) token-level entropy, where token-level uncertainty is used as a proxy for solution quality. We incorporate these signals into several training algorithms and evaluate them on Qwen3-1.7B-Base. When training exclusively on a simple arithmetic task, quality-based signals improve performance by up to $18.6\%$ over the base model, substantially outperforming SFT. On GSM8K, however, gains are more modest and depend strongly on the quality signal. For instance, self-judgments show poor agreement with a stronger external judge and can even degrade performance below the base model. Overall, our results suggest that quality-based training can extract useful signal from saturated questions for base models, but that applying such signals to more complex tasks requires careful calibration and further study.

URL PDF HTML ☆

赞 0 踩 0

2606.01435 2026-06-02 cs.AI cs.CL cs.IR 版本更新

Don't Ask the LLM to Track Freshness: A Deterministic Recipe for Memory Conflict Resolution

不要询问LLM追踪新鲜度：一种确定性的内存冲突解决策略

Vikas Reddy, Sumanth Challaram

发表机构 * IIT Kgp（印度理工学院科钦分校）

AI总结针对基于LLM的内存系统中事实冲突解决性能低下的问题，提出用候选提取加Python max(serial)的确定性聚合替代LLM判断，在单跳任务上提升10.8个百分点，并扩展到多跳任务。

详情

AI中文摘要

基于LLM的内存系统越来越多地维护随时间演变的事实，其中一个反复出现的失败是冲突解决：当一个事实有多个矛盾的值时，智能体应该返回哪个？MemoryAgentBench (MAB; Hu et al., 2026) 在其FactConsolidation任务中明确了这一点：事实被编号，反事实具有更高的序号，并且智能体被告知较新的事实具有较大的序号。然而，每个已发布的系统表现不佳：HippoRAG-v2在单跳（FC-SH）上达到54%，BM25 48%，Mem0 18%，而时间知识图谱Zep/Graphiti仅为7%。多跳几乎未解决（22个系统中最多7%）。我们认为瓶颈在于组装步骤：基线将冲突解决留给LLM介导的检索或生成，而不是版本感知的聚合。一个匹配设置的比较（相同的主干、检索、分块、TOP_K）表明，用候选提取加Python max(serial)替换LLM判断答案流水线，在FC-SH上（gpt-4o-mini）获得+10.8分的提升，从6K时的+8分扩大到262K时的+21分。这是一个全流水线效应（解析器、提示、格式和温度共同变化）；隔离解析器是未来的工作。该配方在FC-SH上达到78.0%（gpt-4o-mini）、94.8%（gpt-4o），在FC-MH上达到30.2%（gpt-4o-mini，使用gpt-4o时升至51.5%），通过每跳确定性的Self-Ask扩展。在匹配的262K下，它比HippoRAG-v2高出+28分，比已发布的最佳FC-MH结果高出+20分。这一含义对该子领域具有纠正作用：冲突解决的瓶颈是组装（检索后聚合），而不是存储。一个LongMemEval知识更新检查表明，该机制从max(serial)移植到max(timestamp)，但仅与LLM判断持平（57.8% vs 64.4%，n=45）：确定性聚合是当前值冲突的正确原语，并且必须与问题类型感知处理组合，以实现更广泛的内存问答。

英文摘要

LLM-based memory systems increasingly maintain facts that evolve over time, where a recurring failure is conflict resolution: when a fact has multiple contradictory values, which should the agent return? MemoryAgentBench (MAB; Hu et al., 2026) makes this explicit in its FactConsolidation task: facts are numbered, the counterfactual has the higher serial, and agents are told newer facts have larger serials. Yet every published system underperforms: HippoRAG-v2 reaches 54% on single-hop (FC-SH), BM25 48%, Mem0 18%, and the temporal KG Zep/Graphiti just 7%. Multi-hop is near-unsolved (at most 7% across 22 systems). We argue the bottleneck is the assembly step: baselines leave conflict resolution to LLM-mediated retrieval or generation rather than version-aware aggregation. A matched-setup comparison (same backbone, retrieval, chunking, TOP_K) shows that replacing the LLM-judgment answer pipeline with candidate-extraction plus Python max(serial) yields +10.8 points on FC-SH (gpt-4o-mini), widening from +8 at 6K to +21 at 262K. This is a whole-pipeline effect (resolver, prompt, format, and temperature vary jointly); isolating the resolver is future work. The recipe reaches 78.0% on FC-SH (gpt-4o-mini), 94.8% (gpt-4o), and 30.2% on FC-MH (gpt-4o-mini, rising to 51.5% with gpt-4o) via a per-hop deterministic extension of Self-Ask. At matched-262K, it beats HippoRAG-v2 by +28 points and the best published FC-MH result by +20. The implication is corrective for the subfield: the bottleneck on conflict resolution is assembly (post-retrieval aggregation), not storage. A LongMemEval knowledge-update check shows the mechanism ports from max(serial) to max(timestamp) but only ties LLM judgment (57.8% vs 64.4%, n=45): deterministic aggregation is the right primitive for current-value conflicts and must be composed with question-type-aware handling for broader memory QA.

URL PDF HTML ☆

赞 0 踩 0

2606.01434 2026-06-02 cs.CL 版本更新

DrugClaw and DrugAudit: A Primary-Source-Grounded Agent and Authority-Aware Benchmark for Drug-Information Question Answering

DrugClaw与DrugAudit：基于原始来源的智能体与权威感知基准用于药物信息问答

Qing Wang, Bo Li, Jialu Liang, Daling Shi, Bob Zhang, Qianqian Song

发表机构 * Department of Health Outcomes and Biomedical Informatics, College of Medicine, University of Florida（佛罗里达大学健康结局与生物医学信息学系，医学院）； PAMI Research Group, Department of Computer and Information Science, Faculty of Science and Technology, University of Macau（澳门大学科学与技术学院计算机与信息科学系PAMI研究组）

AI总结提出多智能体检索增强系统DrugClaw，通过反射驱动状态机查询药物注册与药物警戒知识库，并构建含3772条权威感知基准DrugAudit，在多个基准上取得最优性能。

详情

AI中文摘要

药物信息问答是一个高风险场景，其中虚构的事实可能误导临床决策，且每个引用事实的来源与事实本身同样重要。我们提出了DrugClaw，一个多智能体检索增强系统，通过反射驱动的状态机工作流查询药物注册和药物警戒技能库，并返回基于原始监管或同行评审记录的答案。我们还贡献了DrugAudit，一个包含3772个条目的权威感知基准，配备评估面板，在双评判者LLM作为评判者的协议下（评判者间kappa=0.88，几乎完美），对上游黄金来源匹配、令牌级语义片段重叠和引用忠实性进行评分。在DrugAudit以及MedQA（751）和PubMedQA（512）的药物相关子集上，DrugClaw在标题表的每一列均排名第一：两个评判者下的综合证据指数、评判者中介的答案正确性、原始来源率（0.918，比次优高10.1个百分点）、忠实性（0.887，高5.9个百分点）、MedQA（0.920）和PubMedQA（0.693）。

英文摘要

Drug-information question answering is a high-stakes setting where hallucinated facts can mislead clinical decision-making and the provenance of each cited fact matters as much as the fact itself. We present DrugClaw, a multi-agent retrieval-augmented system that queries a registry of drug and pharmacovigilance skills via a reflection-driven state-machine workflow and returns answers grounded in primary regulatory or peer-reviewed records. We also contribute DrugAudit, a 3,772-item authority-aware benchmark with an evaluation panel that scores upstream-of-gold source match, token-level semantic snippet overlap, and citation faithfulness under a dual-judge LLM-as-judge protocol with inter-judge kappa = 0.88 (almost-perfect). Across DrugAudit plus drug-related subsets of MedQA (751) and PubMedQA (512), DrugClaw is top-1 on every column of the headline table: composite Evidence Index under both judges, judge-mediated answer correctness, primary-source rate (0.918, +10.1 pp over next-best), faithfulness (0.887, +5.9 pp), MedQA (0.920), and PubMedQA (0.693).

URL PDF HTML ☆

赞 0 踩 0

2606.01400 2026-06-02 cs.CL cs.AI 版本更新

Consistent and Distinctive: LLM Benchmark Efficiency via Maximum Independent Set Prompt Selection on Similarity Graphs

一致且独特：基于相似图最大独立集提示选择的LLM基准测试效率

Denica Kjorvezir, Marko Djukanović, Ana Gjorgjevikj, Gjorgjina Cenikj, Tome Eftimov

发表机构 * Computer Systems Department, Jožef Stefan Institute, Ljubljana, Slovenia（计算机系统部，乔塞夫·斯塔芬研究所，卢布尔雅那，斯洛文尼亚）； Jožef Stefan International Postgraduate School, Ljubljana, Slovenia（乔塞夫·斯塔芬国际研究生学院，卢布尔雅那，斯洛文尼亚）； Center for Astrophysics and Cosmology, University of Nova Gorica, Nova Gorica, Slovenia（天体物理与宇宙学中心，诺瓦戈里察大学，诺瓦戈里察，斯洛文尼亚）

AI总结提出基于相似图最大独立集的提示选择框架，通过选择多样且非冗余的子集，在保持LLM排名一致性的同时显著减少基准测试成本。

详情

AI中文摘要

在全面基准测试中评估大型语言模型（LLM）既昂贵又耗时。我们提出了一种基于图的提示选择框架，将每个基准建模为相似图——如果提示在嵌入空间中的距离超过可配置阈值，则节点相连——并应用最大独立集（MIS）算法选择最大多样、非冗余的子集。我们评估了四种MIS求解器（CPLEX、GREEDY、Online-MIS、ReduMIS），涵盖六种嵌入模型、三种距离度量、六个百分位数阈值和四个基准（GPQA、IFEval、MMLU-Pro、Omni-MATH），涉及66个LLM。我们的核心假设——不同随机种子下的重复选择会产生一致的LLM排名，且可能不同于完整基准基线——得到强烈证实：在99.2%的随机配置中Kendall's $W \geq 0.90$（平均$W = 0.997 \pm 0.008$），而在较高百分位数阈值下，所选子集平均减少25-48%的提示。与完整基准的排名差异（$\rho < 0.95$）仅发生在15.95%的配置中，主要集中在低阈值（$p_{10}$-$p_{20}$）和基准（GPQA、IFEval）上，识别出过于密集的图是主要失败模式。

英文摘要

Evaluating large language models (LLMs) across comprehensive benchmarks is expensive and time-consuming. We propose a graph-based prompt selection framework that models each benchmark as a similarity graph -- nodes are prompts connected if their embedding-space distance falls above a configurable threshold -- and applies Maximum Independent Set (MIS) algorithms to select a maximally diverse, non-redundant subset. We evaluate four MIS solvers (CPLEX, GREEDY, Online-MIS, ReduMIS) across six embedding models, three distance measures, six percentile thresholds, and four benchmarks (GPQA, IFEval, MMLU-Pro, Omni-MATH) covering 66 LLMs. Our central hypothesis -- that repeated selection under different random seeds yields consistent LLM rankings that may also differ from the full-benchmark baseline -- is strongly confirmed: Kendall's $W \geq 0.90$ in 99.2\% of stochastic configurations (mean $W = 0.997 \pm 0.008$), while at higher percentile thresholds selected subsets achieve 25--48\% prompt reduction on average. Ranking divergence from the full benchmark ($ρ< 0.95$) occurs in only 15.95\% of configurations, concentrated at low thresholds ($p_{10}$--$p_{20}$) and benchmarks (GPQA, IFEval), identifying overly dense graphs as the primary failure mode.

URL PDF HTML ☆

赞 0 踩 0

2606.01394 2026-06-02 cs.CL 版本更新

在生物制药制造中本地LLM的自然语言到SQL查询基准测试：消费级硬件上的实证基准

Sagar Bhetwal, Rajan Bastakoti, Nirajan Acharya, Gaurav Kumar Gupta

发表机构 * Department of Computer Science, University of the Cumberlands（大学的计算机科学系）； Department of Computer Science, DePaul University（德保罗大学计算机科学系）； Youngstown State University（亚当斯州立大学）

AI总结本研究评估了四种本地部署的开源大语言模型在生物制药制造数据库上的自然语言到SQL生成性能，发现代码调优的通用模型优于领域特定模型，但当前性能仍需人工监督。

详情

AI中文摘要

生物制药制造组织在FDA指南、欧盟良好生产规范（GMP）和欧盟AI法案等监管框架下运营，这些框架可能限制基于云的人工智能系统的使用。本地部署的大语言模型（LLM）提供了一种保护隐私的替代方案，但它们在制药制造任务中的适用性仍未得到充分探索。本研究评估了四种通过Ollama本地部署的开源LLM（Qwen 2.5 Coder 7B、Llama 3.1 8B、Mistral 7B和Meditron 7B）在制药制造数据库上的自然语言到SQL生成能力。开发了一个基于FastAPI的评估平台PharmaBatchDB AI，使用一个包含约63,000条记录的合成Microsoft SQL Server数据库，涵盖批次、制造执行系统（MES）和在线清洗（CIP）模块。模型在60个领域特定的自然语言问题上进行了基准测试，使用的指标包括SQL提取率、SQL合规性、事实一致性、ROUGE-L、幻觉率、吞吐量和延迟。Qwen 2.5 Coder 7B、Llama 3.1 8B和Mistral 7B为所有评估任务生成了SQL，而Meditron 7B由于上下文窗口限制和SQL生成能力差，几乎在所有任务上失败。Llama 3.1 8B实现了最高的SQL合规性，而Qwen 2.5 Coder 7B在整体文本相似性和事实一致性方面最强。两个领先模型之间的性能差异在统计上不显著。结果表明，代码调优的通用LLM在制药制造数据的结构化查询生成上优于领域特定的生物医学模型。尽管完全本地化、符合GxP的NLQ系统在消费级硬件上是可行的，但当前性能水平在监管使用中仍需人工监督和下游验证。

英文摘要

Biopharmaceutical manufacturing organizations operate under regulatory frameworks such as FDA guidance, EU Good Manufacturing Practice (GMP), and the EU AI Act, which can restrict the use of cloud-based artificial intelligence systems. Locally deployed large language models (LLMs) offer a privacy-preserving alternative, but their suitability for pharmaceutical manufacturing tasks remains underexplored. This study evaluates four open-source LLMs (Qwen 2.5 Coder 7B, Llama 3.1 8B, Mistral 7B, and Meditron 7B) deployed locally via Ollama for natural-language-to-SQL generation over a pharmaceutical manufacturing database. A FastAPI-based evaluation platform, PharmaBatchDB AI, was developed using a synthetic Microsoft SQL Server database containing approximately 63,000 records across Batch, Manufacturing Execution System (MES), and Clean-In-Place (CIP) modules. Models were benchmarked on 60 domain-specific natural-language questions using metrics including SQL extraction rate, SQL compliance, factual consistency, ROUGE-L, hallucination rate, throughput, and latency. Qwen 2.5 Coder 7B, Llama 3.1 8B, and Mistral 7B generated SQL for all evaluation tasks, while Meditron 7B failed on nearly all tasks due to context-window limitations and poor SQL generation capability. Llama 3.1 8B achieved the highest SQL compliance, whereas Qwen 2.5 Coder 7B achieved the strongest overall text similarity and factual consistency. Performance differences between the two leading models were not statistically significant. The results show that code-tuned general-purpose LLMs outperform a domain-specific biomedical model on structured query generation for pharmaceutical manufacturing data. Although fully local, GxP-aligned NLQ systems are feasible on consumer hardware, current performance levels still require human oversight and downstream validation for regulated use.

URL PDF HTML ☆

赞 0 踩 0

2606.01336 2026-06-02 cs.CL 版本更新

LongAttnComp: Cross-Family Context Compression for Long-Context Reasoning

LongAttnComp：跨族上下文压缩用于长上下文推理

Mengmeng Ji, Ravi Shanker Raju, Jonathan Lingjie Li, Chen Wu

发表机构 * SambaNova Systems, Inc.（SambaNova系统公司）

AI总结提出LongAttnComp方法，通过微调轻量级交叉注意力评分层并引入令牌级分块、令牌预算top-p算法、位置重排序和格式无关查询解析器，结合两阶段微调策略，在长上下文推理任务中实现与全上下文相当或更优的准确率。

Comments Under review

详情

AI中文摘要

随着实际应用越来越需要处理10万+令牌的输入，上下文长度与推理效率之间的差距已成为关键瓶颈。上下文压缩提供了一种在保持任务准确性的同时降低预填充成本的方法。然而，现有的无训练注意力方法在代码推理等要求高的长上下文任务中留下了显著差距。我们提出了LongAttnComp，这是AttnComp的长上下文适配，它微调了一个轻量级的交叉注意力评分层，并引入了令牌级分块、令牌预算top-p算法、位置重排序和格式无关的查询解析器。我们进一步为压缩器设计了两阶段微调方案：阶段1从NIAH风格数据构建通用检索基础，阶段2通过多跳和推理数据扩展，以覆盖更广泛的长上下文任务。在InfiniteBench Code-Debug上，LongAttnComp匹配或超过全上下文准确率，显著优于无训练基线，并在来自三个族的四个目标模型上迁移。在LongBench v2上，两阶段方案在很大程度上缩小了阶段1在多文档推理上的差距，同时保持了Code-Debug的性能。

英文摘要

As real-world applications increasingly require processing inputs of 100k+ tokens, the gap between context length and inference efficiency has become a critical bottleneck. Context compression offers a way to reduce prefill costs while preserving task accuracy. However, existing training-free attention-based methods leave substantial gaps in demanding long-context tasks such as code reasoning. We present LongAttnComp, a long-context adaptation of AttnComp that fine-tunes a lightweight cross-attention scoring layer and introduces tokenlevel chunking, a token-budget top-p algorithm, positional reordering, and a formatagnostic query parser. We further design a two-stage fine-tuning recipe for the compressor: Stage 1 builds a general retrieval foundation from NIAH-style data, and Stage 2 extends it with multi-hop and reasoning data for broader long-context task coverage. On InfiniteBench Code-Debug, LongAttnComp matches or exceeds full-context accuracy, substantially outperforms training-free baselines, and transfers across four target models from three families. On LongBench v2, the two-stage recipe largely closes the Stage 1 gap on multi-document reasoning while preserving Code-Debug performance.

URL PDF HTML ☆

赞 0 踩 0

2606.01323 2026-06-02 cs.CL cs.AI 版本更新

Challenger at MultiPRIDE: 这是仇恨言论还是被重新使用的语言？

Hadi Bayrami Asl Tekanlou, Mahdi Bakhtiyarzadeh, Jafar Razmara

发表机构 * University of Tabriz（塔布里兹大学）

AI总结针对仇恨言论与重新使用语言的区分难题，提出一种结合语义嵌入、标签噪声过滤（Cleanlab+逻辑回归）和MLP分类器的可解释方法，在资源受限下实现稳健性能。

Comments 9 pages, 2 figures, Published in EVALITA 2026, CEUR Workshop Proceedings Vol. 4195

详情

Journal ref: CEUR Workshop Proceedings, Vol. 4195, 2026

AI中文摘要

仇恨言论的传播在现代数字环境中，特别是在社交网络平台上，变得越来越有害。尽管最近的进展在自动仇恨言论检测方面显示出有希望的结果，但一个关键挑战仍然存在：区分真正的仇恨言论和被重新使用的语言。由于重新使用表达具有细微差别和上下文依赖性，准确标注是困难的。在本文中，我们提出了一种简单且可解释的方法来区分仇恨言论和被重新使用的语言，该方法是为MultiPride共享任务开发的。我们的方法生成密集的语义文本嵌入，并使用Cleanlab结合逻辑回归进行标签噪声过滤阶段，然后使用多层感知器（MLP）神经网络进行最终分类。该系统设计用于在有限的计算资源下运行，同时保持强大的性能。我们使用精确率、召回率和F1分数（包括宏平均指标）评估我们的方法。实验结果表明，尽管数据集存在极端类别不平衡，但性能稳健。总体而言，研究结果强调了通过更大的嵌入模型和更先进的预处理技术进一步改进的潜力，同时保持可解释性。

英文摘要

The spread of hate speech has become increasingly harmful in modern digital environments, particularly on social networking platforms. While recent advances have shown promising results in automatic hate speech detection, a key challenge remains: distinguishing genuine hate speech from reclaimed language. Accurate labeling is difficult due to the nuanced and context-dependent nature of reclaimed expressions. In this paper, we present a simple and interpretable approach for distinguishing hate speech from reclaimed language, developed for the MultiPride Shared Task. Our method generates dense semantic text embeddings and incorporates a label-noise filtering stage using Cleanlab with logistic regression, followed by a Multi-layer Perceptron (MLP) neural network for final classification. The system is designed to operate under limited computational resources while maintaining strong performance. We evaluate our approach using precision, recall, and F1-score, including macro-averaged metrics. Experimental results demonstrate robust performance despite extreme class imbalance in the dataset. Overall, the findings highlight the potential for further improvements through larger embedding models and more advanced preprocessing techniques while preserving interpretability.

URL PDF HTML ☆

赞 0 踩 0

2606.01294 2026-06-02 cs.CL cs.LG 版本更新

Don't Read Everything: A Curvature-Conditioned Query for Linear Attention

不要阅读一切：用于线性注意力的曲率条件查询

Dong Le, Thong Nguyen, Cong-Duy Nguyen, Anh Tuan Luu

发表机构 * Nanyang Technological University（南洋理工大学）； National University of Singapore（国立新加坡大学）； VinUniversity（文理大学）

AI总结针对线性注意力在上下文检索和长上下文任务中的不足，提出曲率条件查询（CCQ）机制，通过二阶泰勒展开构建局部二次模型，利用运行键协方差收缩查询向量，仅修改读取步骤，兼容现有线性注意力骨干，在困惑度、零样本下游准确率、检索和长上下文任务上取得提升。

Comments 19 pages

详情

AI中文摘要

线性注意力通过维护一个循环的快速权重状态，降低了 softmax 注意力的二次成本，但在上下文检索和长上下文任务中始终落后。现有的补救措施通过门控、增量更新或核特征映射作用于记忆的写入侧，但读取步骤保持不变：每个过去的键对输出都有加性贡献，因此有用的目标被存储向量的大多数稀释。我们借用 softmax 几何的一个特定部分来构建一个廉价的读取时查询收缩。在等向注意力点处对 softmax 对数配分函数进行二阶泰勒展开，得到一个局部二次模型，其曲率与运行键协方差一致，该量可以通过与线性注意力状态相同的循环/分块机制来维护。相关的线性算子在查询读取状态之前，沿着记忆的高密度方向收缩查询。我们将这种机制称为曲率条件查询（CCQ）。CCQ 仅修改读取步骤，并且可以与任何线性注意力骨干组合。将其附加到 GLA 和 Gated DeltaNet 上，它在困惑度、零样本下游准确率、训练上下文内外的 S-NIAH 检索、从 4K 到 20K 的长度外推困惑度以及 LongBench 准确率上均有提升，且额外成本很小。

英文摘要

Linear attention reduces the quadratic cost of softmax attention by maintaining a recurrent fast-weight state, but it consistently lags on in-context retrieval and long-context tasks. Existing remedies act on the write side of memory through gating, delta updates, or kernel feature maps, but the read step is left unchanged: every past key contributes additively to the output, so useful targets are diluted by the bulk of stored vectors. We borrow one specific piece of softmax's geometry to construct a cheap read-time contraction of the query. A second-order Taylor expansion of the softmax log-partition at the isotropic-attention point gives a local quadratic model whose curvature coincides with the running key covariance, a quantity that can be maintained with the same recurrent/chunkwise mechanism as the linear-attention state. The associated linear operator contracts the query along the high-density directions of memory before it reads the state. We call this mechanism Curvature-Conditioned Query (CCQ). CCQ modifies only the read step and is composable with any linear-attention backbone. Attached to GLA and Gated DeltaNet, it improves perplexity, zero-shot downstream accuracy, S-NIAH retrieval at and beyond the training context, length-extrapolation perplexity from 4K to 20K, and LongBench accuracy, at small extra cost.

URL PDF HTML ☆

赞 0 踩 0

2606.01286 2026-06-02 cs.SE cs.AI cs.CL cs.LG 版本更新

BenchEvolver: Frontier Task Synthesis via Solution-Centric Evolution

BenchEvolver: 通过以解决方案为中心的进化进行前沿任务合成

Yangzhen Wu, Aaron J. Li, Wenjie Ma, Li Cao, Ziheng Zhou, Mert Cemri, Shu Liu, Yuran Xiu, Chenxiao Yan, Haikun Zhao, Bin Yu, Ion Stoica, Dawn Song

发表机构 * University of California, Berkeley（加州大学伯克利分校）； Institute for Interdisciplinary Information Sciences, Tsinghua University（清华大学交叉信息研究院）

AI总结提出BenchEvolver框架，通过进化参考解决方案自动生成更难的编程问题，以解决基准饱和问题，并在LiveCodeBench和SciCode上验证其有效性。

详情

AI中文摘要

前沿大语言模型的快速进步导致了广泛的基准饱和，限制了现有数据集区分模型能力或提供有用训练信号的能力。例如，在LiveCodeBench上，前沿模型在简单拆分上达到超过99%的Pass@1，在不同难度级别上平均超过90%的Pass@1。构建新的、具有挑战性的数据集通常需要大量人力，成为进步的瓶颈。我们引入了BenchEvolver，一个以解决方案为中心的进化框架，自动将现有编码问题转化为更难的变体。BenchEvolver不是从头生成问题，而是通过结构化变换进化参考解决方案，并从进化后的解决方案中推导出相应的描述和测试。这种设计将生成过程基于可执行语义，使得能够可扩展地构建高质量、多样化和困难的任务，并具有可验证的正确性。将BenchEvolver应用于LiveCodeBench和SciCode，我们获得了显著更难的进化任务，同时保持了有效性、参考正确性和多样性。我们进一步策划了LiveCodeBench-Plus，一个包含91个问题的基准，结合了进化后的任务和困难的原始LCB-v6任务，其中前沿模型的Pass@1范围从27.5%到62.6%，恢复了强编码模型之间的清晰区分。重要的是，即使对于生成它们的模型，进化后的任务仍然具有挑战性，从而实现了自我改进。我们进一步表明，在进化后的LCB任务上进行强化学习提高了留出编码性能：对于gpt-oss-20b，种子+进化训练在LCB v6 Hard和LCB-Pro Easy上分别获得了+8.7和+8.3的Pass@1提升，分别超过仅种子训练的70.7%和34.8%。我们的结果表明，BenchEvolver可以将饱和的基准转化为前沿级别的评估套件和可重用的训练信号。

英文摘要

The rapid progress of frontier large language models has led to widespread benchmark saturation, limiting the ability of existing datasets to differentiate model capabilities or provide useful training signal. For instance, on LiveCodeBench, frontier models achieve over 99% Pass@1 on easy splits and exceed 90% Pass@1 on average across difficulty levels. Constructing new, challenging datasets typically requires substantial human effort, creating a bottleneck for progress. We introduce BenchEvolver, a solution-centric evolutionary framework that automatically transforms existing coding problems into harder variants. Rather than generating problems from scratch, BenchEvolver evolves reference solutions through structured transformations and derives corresponding statements and tests from the evolved solutions. This design grounds generation in executable semantics, enabling scalable construction of high-quality, diverse, and difficult tasks with verifiable correctness. Applying BenchEvolver to LiveCodeBench and SciCode, we obtain evolved tasks that are substantially harder while maintaining validity, reference correctness, and diversity. We further curate LiveCodeBench-Plus, a 91-problem benchmark combining evolved and difficult original LCB-v6 tasks, where frontier-model Pass@1 ranges from 27.5% to 62.6%, restoring clear discrimination among strong coding models. Importantly, evolved tasks remain challenging even for the model that generates them, enabling self-improvement. We further show that RL on evolved LCB tasks improves held-out coding performance: for gpt-oss-20b, seed+evolved training achieves +8.7 and +8.3 Pass@1 gains on LCB v6 Hard and LCB-Pro Easy, exceeding seed-only gains by 70.7% and 34.8%, respectively. Our results show that BenchEvolver can convert saturated benchmarks into frontier-level evaluation suites and reusable training signal.

URL PDF HTML ☆

赞 0 踩 0

2606.01276 2026-06-02 cs.CL 版本更新

Worlds Within Words: Translating Culture in Ancient Chinese Texts with Multi-Agent Coordination

词中世界：基于多智能体协调的古代汉语文本文化翻译

Xiaoqi He, Kaixin Lan, Mu You, Tao Fang, Lidia S. Chao, Derek F. Wong

发表机构 * NLP2 CT Lab, Department of Computer and Information Science, University of Macau（澳门大学自然语言处理与计算机实验室，计算机与信息科学系）； Institute of International Language Services Studies, Macau Millennium College（澳门 millennium 学院国际语言服务研究学院）

AI总结针对古代汉语文本中文化负载词的翻译难题，提出多智能体文化感知翻译框架MACAT，通过选择性显化策略动态识别文化短语并注入简洁解释，在中医经典和《论语》上优于基线模型。

Comments The preprint manuscript is 20 pages long and is currently under review

详情

AI中文摘要

基于大语言模型的机器翻译推动了跨文化交流，但在处理古代汉语文本中的文化负载词时仍面临挑战。挑战不仅在于词汇对齐，还在于决定何时以及如何向缺乏相关背景的读者显化文化依赖知识。直译常保留表面形式但缺失深层概念，而过度显化则损害简洁性和可读性。为解决此问题，我们将文化负载词翻译定义为选择性显化任务，并提出MACAT，一个多智能体文化感知翻译框架，动态识别文化显著短语并在必要时注入简洁的解释性知识。MACAT进一步包含一个质量感知重排序模块用于候选选择，以及一个多轮评估智能体，从术语精确性、可读性、忠实度、文化保留和文化显化方面评估翻译。在中医经典和《论语》上的实验表明，在统一的GPT-5.4评估设置下，MACAT在100篇中医文档和《论语》20章子集上始终优于骨干模型和通用机器翻译基线。

英文摘要

Large language model (LLM)-based machine translation has advanced cross-cultural communication, yet it still struggles with culture-loaded words (CLWs) in ancient Chinese texts. The challenge extends beyond lexical alignment to deciding when and how culture-dependent knowledge should be explicated for readers lacking relevant background. Literal translation often preserves surface forms while missing underlying concepts, whereas over-explicitation harms conciseness and readability. To address this problem, we formulate CLW translation as a selective explicitation task and propose \textbf{MACAT}, a \textbf{M}ulti-\textbf{A}gent \textbf{C}ulture-\textbf{A}ware \textbf{T}ranslation framework that dynamically identifies culturally salient phrases and injects concise explanatory knowledge when necessary. MACAT further incorporates a quality-aware reranking module for candidate selection and a multi-round evaluation agent that assesses translations across terminological precision, readability, fidelity, cultural preservation, and cultural explicitation. Experiments on traditional Chinese medicine (TCM) classics and the \textit{Analects} show that, under a unified GPT-5.4 evaluation setting, MACAT consistently outperforms both the backbone model and general-purpose MT baselines on 100 TCM documents and a 20-chapter subset of the \textit{Analects}.

URL PDF HTML ☆

赞 0 踩 0

2606.01260 2026-06-02 cs.CL cs.AI 版本更新

IndoBias: A Dual Track Culturally Grounded Benchmark for LLMs Bias Evaluation in Indonesian Languages

IndoBias：印尼语言中大语言模型偏见评估的双轨文化基准

Ikhlasul Akmal Hanif, Muhammad Falensi Azmi, Filbert Aurelian Tjiaranata, Eryawan Presma Yulianrifat, Fajri Koto

发表机构 * Mohamed bin Zayed University of Artificial Intelligence（莫扎德大学人工智能大学）； Universitas Indonesia（印度尼西亚大学）； Independent Researcher（独立研究员）

AI总结提出IndoBias基准，通过深度和广度双轨评估，发现现有LLM在印尼语和三种地方语言中表现出显著偏见，且预训练数据来源和语言多样性影响偏见程度。

详情

AI中文摘要

尽管印尼拥有超过1300个民族和700种土著语言，但大语言模型中的偏见尚未得到充分研究，从而在其独特广阔、多语言和多样化的社会文化背景下，评估代表性公平性和本地化刻板印象存在关键空白。为解决此问题，我们引入IndoBias作为文化基础的偏见基准，评估LLM在印尼语和三种地方语言（爪哇语、巽他语和望加锡语）中的偏见。IndoBias具有双视角评估轨道：深度导向（使用对比对）和广度导向（基于生成），后者基于社会科学框架（SPI、O*NET和WGI）。我们的结果表明，现有LLM——尤其是解码器模型——对印尼语中的原型句子表现出强烈偏见，而地方语言在意识形态和宗教类别下遭受更高偏见。我们还发现，当使用各种地方实体提示时，LLM响应表现出非均匀的刻板印象极性。最后，我们发现，在印尼语中，Common Crawl文本在预训练期间引入的偏见比人工审核的文章文本（如维基百科、新闻）更多，而将地方语言引入预训练通常会增加偏见。这项工作强调了在特定文化背景下研究偏见的重要性。警告：本文包含可能具有冒犯性、有害性或偏见性的示例数据。

英文摘要

Despite being home to more than 1300 ethnic groups and 700 indigenous languages, bias in Large Language Models has not been fully studied in Indonesia, thus leaving a critical gap in evaluating representational fairness and localized stereotypes within its uniquely vast, multilingual, and diverse sociocultural landscape. To address this, we introduce IndoBias as a culturally-grounded bias benchmark to assess LLMs bias in Indonesian and three local languages: Javanese, Sundanese, and Makasar. IndoBias features dual perspective evaluation tracks: depth-oriented (with contrastive-pairs) and breadth-oriented (with generation-based), where the latter is grounded in social science frameworks (SPI, O*NET, and WGI). Our results show that existing LLMs -- particularly decoder models -- exhibit strong bias towards prototypical sentences in Indonesian, while local languages suffer higher bias under Ideology and Religion category. We also find that LLMs responses exhibit a non-uniform Stereotype Polarity when prompted with various local entities. Finally, we discover that, in Indonesian, Common Crawl texts introduce more bias during pretraining, compared to human-reviewed article texts (e.g., Wikipedia, News), whereas introducing local languages to pretraining generally increases bias. This work highlights the importance of studying bias in culture-specific context. Warning: This paper contains example data that may be offensive, harmful, or biased.

URL PDF HTML ☆

赞 0 踩 0

2606.01258 2026-06-02 cs.LG cs.CL eess.SP 版本更新

Beyond Sinusoids: A Morlet Wavelet Framework for Transformer Positional Encoding

超越正弦波：基于Morlet小波的Transformer位置编码框架

Athanasios Zeris

发表机构 * Independent Researcher（独立研究者）； Athens, Greece（希腊雅典）

AI总结提出Morlet位置编码（MoPE），通过可学习的频率和局部带宽统一了正弦位置编码和旋转位置编码，并在TinyShakespeare上结合能量门控注意力提升了0.119的性能。

Comments 16 pages, 4 figures, 4 tables

详情

AI中文摘要

标准Transformer位置编码——正弦编码和旋转位置编码（RoPE）——将每个位置视为同等局部：它们编码了标记的位置，但未编码其位置影响应延伸多远。我们提出Morlet小波（同时最小化位置和频率的不确定性）是位置编码的自然基础，并引入Morlet位置编码（MoPE）：每个嵌入维度从数据中学习其自身的频率和局部带宽。主要理论结果是统一：当局部性关闭时（sigma_i -> 无穷大），正弦PE和RoPE相关核都作为MoPE的极限情况出现。MoPE的相位精确恢复了RoPE旋转角度；幅度增加了一个标准编码所缺乏的可学习高斯局部核。实验上，MoPE结合能量门控注意力在TinyShakespeare上比标准注意力提升了0.119，优于任一单独组件。对学习参数的分析显示，所有128个频率-带宽对收敛到小波可容许边界——这一经验观察与关于能量门控的伴随结果一致，表明字符级语言信号的一个可重现性质，值得进一步研究。

英文摘要

Standard positional encodings for transformers - sinusoidal and rotary (RoPE) - treat every position as equally local: they encode where a token is, but not how far its positional influence should extend. We propose that the Morlet wavelet, which simultaneously minimises uncertainty in position and frequency, is the natural basis for positional encoding, and introduce Morlet Positional Encoding (MoPE): each embedding dimension learns its own frequency and locality bandwidth from data. The main theoretical result is a unification: sinusoidal PE and the RoPE correlation kernel both emerge as limiting cases of MoPE when locality is switched off (sigma_i -> infinity). The phase of MoPE recovers the RoPE rotation angle exactly; the amplitude adds a learned Gaussian locality kernel that standard encodings lack. Empirically, MoPE combined with Energy-Gated Attention achieves +0.119 improvement over standard attention on TinyShakespeare, outperforming either component alone. Analysis of the learned parameters reveals that all 128 frequency-bandwidth pairs converge to the wavelet admissibility boundary - an empirical observation consistent with a companion result on energy gating, suggesting a reproducible property of character-level language signals that warrants further investigation.

URL PDF HTML ☆

赞 0 踩 0

2606.01255 2026-06-02 cs.CL 版本更新

Agentic Clustering: Controllable Text Taxonomies via Multi-Agent Refinement

Agentic Clustering: 通过多智能体精炼实现可控文本分类

Simon Löwe, Emily Silcock

发表机构 * Burning Glass Institute（Burning Glass研究院）； Harvard University（哈佛大学）

AI总结提出一种基于多智能体（提议者、合成者、审计者、调查者、批评者）的自适应文本聚类方法，通过编排器LLM动态调整聚类流程，在七个公开基准上实现最先进性能，ARI最高提升32%。

详情

AI中文摘要

最近的文本聚类方法使用大型语言模型从语料库中提出聚类分类法，然后将每个文本分配到其中。这些流程本质上是程序化的：LLM调用的顺序以及停止、合并和拆分聚类的规则是预先在代码中固定的，因此它们在不同结构的语料库上泛化能力差，并且难以整合用户提供的约束，如目标聚类数量或聚类意图。我们提出了一种基于智能体的替代方案，其中编排器LLM在每个步骤检查发现过程的状态，并调度一组专门的智能体（提议者、合成者、审计者、调查者和批评者）中的一个，使流程适应语料库，而不是执行固定的流程。在七个公开文本聚类基准上，该方法实现了最先进的性能，在ARI上比最强的先前LLM基线高出32%。

英文摘要

Recent text-clustering methods use large language models to propose a cluster taxonomy from a corpus and then assign each text to it. These pipelines are fundamentally programmatic: the sequence of LLM calls and the rules for stopping, merging, and splitting clusters are fixed in code in advance, so they generalise poorly across corpora of different structure and cannot easily incorporate user-supplied constraints such as a target cluster count or a clustering intent. We propose an agentic alternative in which an orchestrator LLM inspects the state of the discovery process at each step and dispatches one of a small set of specialised agents - proposer, synthesizer, auditor, investigator, and critic - adapting the pipeline to the corpus rather than executing a fixed one. On seven public text-clustering benchmarks the method achieves state-of-the-art performance, beating the strongest prior LLM baseline by up to 32% in ARI.

URL PDF HTML ☆

赞 0 踩 0

2606.01252 2026-06-02 cs.CL cs.AI 版本更新

Understanding LLM Behavior in Multi-Target Cross-Lingual Summarization

理解多目标跨语言摘要中的LLM行为

Sangwon Ryu, Yihong Liu, Mingyang Wang, Yunsu Kim, Jungseul Ok, Gary Geunbae Lee, Hinrich Schuetze

发表机构 * GSAI, POSTECH（POSTECH认知科学研究院）； CSE, POSTECH（POSTECH计算机科学与工程学院）； LMU Munich（慕尼黑大学）； MCML ； LILT（语言信息实验室）

AI总结针对多目标跨语言摘要任务，提出MEA基准并分析LLM内部机制，发现翻译和摘要行为在后期层联合出现，并引入推理时激活引导方法提升生成质量。

详情

AI中文摘要

多目标跨语言文本摘要（MTXLS）将源文档总结为多种目标语言，随着用户以多种语言消费内容，其重要性日益增加，但仍未得到充分探索。为填补这一空白，我们引入了多目标跨语言元素感知（MEA），这是一个涵盖24种目标语言的新MTXLS基准。我们评估了各种LLM的端到端和流水线方法，并表明MTXLS性能仍远落后于英语单语摘要。为了更好地理解LLM中的MTXLS，我们提出了一种逐层分析框架，用于研究LLM如何在内部执行MTXLS。我们的分析表明，翻译和摘要行为在后期层中联合出现，而不是作为截然不同的分解阶段。大多数任务相关处理发生在这些层内，错误也往往在类似深度出现。受这些发现启发，我们引入了一种推理时激活引导方法，利用英语摘要的隐藏表示来指导MTXLS生成。实验表明，我们的方法在目标语言上持续提高了MTXLS质量。

英文摘要

Multi-target cross-lingual text summarization (MTXLS), which summarizes a source document into multiple target languages, is increasingly important as users consume content in diverse languages, but remains underexplored. To address this gap, we introduce multi-target cross-lingual element-aware (MEA), a new MTXLS benchmark covering 24 target languages. We benchmark end-to-end and pipeline approaches across various LLMs and show that MTXLS performance still substantially lags behind English monolingual summarization. To better understand MTXLS in LLMs, we propose a layer-wise analysis framework for investigating how LLMs internally perform MTXLS. Our analyses suggest that translation and summarization behaviors emerge jointly within later layers rather than as distinctly decomposed stages. Most task-relevant processing occurs within these layers, and errors also tend to arise at similar depths. Motivated by these findings, we introduce an inference-time activation steering method that leverages hidden representations from English summarization to guide MTXLS generation. Experiments show that our method consistently improves MTXLS quality across target languages.

URL PDF HTML ☆

赞 0 踩 0

2606.01243 2026-06-02 cs.CL cs.LG 版本更新

Unlocking the Black Box of Latent Reasoning: An Interpretability-Guided Approach to Intervention

解锁潜在推理的黑箱：一种可解释性引导的干预方法

Shuochen Chang, Tong Bai, Xiaofeng Zhang, Qianli Ma, Qingyang Liu, Zhaohe Liao, Yibo Miao, Li Niu

发表机构 * Shanghai Jiao Tong University（上海交通大学）； Fudan University（复旦大学）

AI总结本文通过结构、因果和几何探针分析潜在推理向量的可解释性，并基于此提出无需训练的解码时干预方法，提升大语言模型推理准确性。

详情

Journal ref: ACL2026 Main

AI中文摘要

潜在推理使大型语言模型（LLMs）能够在连续隐藏状态内执行多步推理，相比显式思维链（CoT）提供了效率提升。然而，这些连续思维向量的不透明性阻碍了其可靠性和可控性。本文弥合了机械可解释性与可操作控制之间的差距。我们首先使用结构、因果和几何探针进行系统分析，揭示潜在向量编码了推理步骤的压缩、忠实表示，其中早期向量作为关键因果枢纽。在此基础上，我们将这些可解释性见解操作化为一套无需训练、解码时干预的方法，通过施加已识别的几何和语义先验来优化潜在推理过程。跨多个模型规模和不同任务领域的广泛实验表明，我们的方法持续提高了推理准确性。我们的可解释性引导干预一致地解锁了潜在能力，并在没有任何参数更新的情况下提高了推理准确性。

英文摘要

Latent reasoning enables Large Language Models (LLMs) to perform multi-step inference within continuous hidden states, offering efficiency gains over explicit Chain-of-Thought (CoT). However, the opacity of these continuous thought vectors hinders their reliability and controllability. This paper bridges the gap between mechanistic interpretability and actionable control. We first present a systematic analysis using structural, causal, and geometric probes, revealing that latent vectors encode compressed, faithful representations of reasoning steps, with early vectors acting as critical causal hubs. Building on this, we operationalize these interpretability insights into a suite of training-free, decode-time interventions that refine the latent reasoning process by imposing the identified geometric and semantic priors. Extensive experiments across multiple model scales and diverse task domains demonstrate that our approaches consistently improve reasoning accuracy. Our interpretability-guided interventions consistently unlock latent capabilities and improve reasoning accuracy without any parameter updates.

URL PDF HTML ☆

赞 0 踩 0

2606.01240 2026-06-02 cs.CL 版本更新

Efficient RAG with Intent-Aware Retrieval and Semantics-Preserving Chunking

基于意图感知检索与语义保持分块的高效RAG

Fachrina Dewi Puspitasari, Chaoning Zhang, Jiaquan Zhang, Zhicheng Wang, Hafiz Shakeel Ahmad Awan, Rizwan Qureshi, Jewon Lee, Tae-Ho Kim, Yang Yang

发表机构 * School of Computer Science and Engineering, University of Electronic Science and Technology of China（电子科技大学计算机科学与工程学院）； Massachusetts General Hospital, Harvard University（哈佛大学马萨诸塞州总医院）； Nota AI

AI总结提出InSemRAG框架，通过意图感知检索器和语义保持分块模块，结合迭代检索-检查机制，解决传统RAG的信息不足问题，在多跳和证据敏感任务上取得显著提升。

详情

AI中文摘要

对大型语言模型（LLM）强大指令遵循和推理能力的需求推动了检索增强生成（RAG）的快速发展。RAG系统通过从外部数据库检索与查询匹配的补充知识块来辅助LLM生成。然而，传统RAG系统由于意图无关的检索和信息碎片化两个因素，存在信息不足的问题。我们的工作提出了一个名为InSemRAG的RAG框架，通过迭代检索-检查机制以及两个支持模块——意图感知检索器（IAR）和语义保持分块（SPC）来解决这些挑战。IAR实现了一种动态混合检索方法，根据查询意图自适应地加权检索通道，而SPC对受损的证据块进行检测和修复以保持语义完整性。为了减轻迭代机制带来的计算延迟，我们利用了小型语言模型（SLM）。在多个基准数据集上的大量实验一致表明，我们的方法相对于最近最先进的RAG机制具有竞争力。特别是在多跳和证据敏感任务上，我们的方法取得了显著提升，在HotPotQA上F1提高了2.65个百分点，在FEVER上准确率提高了1.5个百分点。我们的方法还利用SLM实现了与Multi-Hop RAG相当的性能，同时延迟降低了4.32倍。

英文摘要

The demand for powerful instruction following and reasoning capability of large language models (LLMs) has promoted rapid development of retrieval-augmented generation (RAG). The RAG system assists LLM generation by retrieving chunks of query-fit supplementary knowledge from an external database. Conventional RAG systems, however, suffer from information insufficiency due to two factors, which are intent-agnostic retrieval and information fragmentation. Our work proposes a RAG framework, termed InSemRAG, that addresses these challenges via an iterative retrieve-and-check mechanism with two supporting modules, an intention-aware retriever (IAR) and semantics-preserving chunking (SPC). IAR implements a dynamic hybrid retrieval method that adaptively weights the retrieval channels based on the query intent, while SPC performs detection and reparation to the damaged evidence chunks to preserve the semantic integrity. To alleviate the computational latency brought by our iterative mechanism, we leverage small language models (SLMs). Extensive experiments across several benchmark datasets consistently demonstrate the competitiveness of our method against recent state-of-the-art RAG mechanisms. Particularly, our method achieves significant gains on multi-hop and evidence-sensitive tasks, with a 2.65-point improvement in F1 on HotPotQA and a 1.5-point increase in accuracy on FEVER. Our method also achieves competitive performance to Multi-Hop RAG with 4.32$\times$ lower latency with the utilization of SLM.

URL PDF HTML ☆

赞 0 踩 0

2605.04638 2026-06-02 cs.CL cs.AI 版本更新

Gradients with Respect to Semantics Preserving Embeddings Tell the Uncertainty of Large Language Models

相对于语义保持嵌入的梯度揭示大语言模型的不确定性

Mingda Li, Rundong Lv, Xinyu Li, Weinan Zhang, Ting Liu

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结提出首个基于梯度的自由文本生成不确定性量化方法SemGrad，通过语义空间中的梯度计算实现高效且无需采样的不确定性估计。

Comments Accepted by ICML 2026

详情

AI中文摘要

不确定性量化（UQ）是确保大语言模型（LLM）可信度的重要技术，因为LLM容易产生幻觉。现有的自由文本生成UQ方法严重依赖采样，导致计算成本高且方差大。在这项工作中，我们提出了首个基于梯度的自由文本生成UQ方法SemGrad，它无需采样且计算高效。与先前针对分类任务开发的在参数空间中操作的梯度方法不同，我们提出在语义空间中考虑梯度。我们的方法基于一个关键直觉：自信的LLM应在语义等价的输入扰动下保持稳定的输出分布。我们将这种稳定性解释为语义空间中的梯度，并引入语义保持分数（SPS）来识别最能捕捉语义的嵌入，并针对这些嵌入计算梯度。我们进一步提出了HybridGrad，它结合了SemGrad和参数梯度的优势。实验表明，我们的两种方法都提供了高效且有效的不确定性估计，在多个有效响应的设置中尤其优于现有方法。

英文摘要

Uncertainty quantification (UQ) is an important technique for ensuring the trustworthiness of LLMs, given their tendency to hallucinate. Existing state-of-the-art UQ approaches for free-form generation rely heavily on sampling, which incurs high computational cost and variance. In this work, we propose the first gradient-based UQ method for free-form generation, SemGrad, which is sampling-free and computationally efficient. Unlike prior gradient-based methods developed for classification tasks that operates in parameter space, we propose to consider gradients in semantic space. Our method builds on the key intuition that a confident LLM should maintain stable output distributions under semantically equivalent input perturbations. We interpret the stability as the gradients in semantic space and introduce a Semantic Preservation Score (SPS) to identify embeddings that best capture semantics, with respect to which gradients are computed. We further propose HybridGrad, which combines the strengths of SemGrad and parameter gradients. Experiments demonstrate that both of our methods provide efficient and effective uncertainty estimates, achieving superior performance than state-of-the-art methods, particularly in settings with multiple valid responses.

URL PDF HTML ☆

赞 0 踩 0

2605.02277 2026-06-02 cs.CL 版本更新

TECCI：收集与策划图像的棘手编辑

Aishwarya Agrawal, Roy Hirsch, Yasumasa Onoe, Sherry Ben, Jason Baldridge

发表机构 * Google Research（谷歌研究）； Google DeepMind（谷歌深Mind）

AI总结提出TECCI基准，包含7550对图像与编辑指令，通过人工与自动评估揭示现有图像编辑模型在指令遵循、最小编辑和视觉质量方面的不足。

详情

AI中文摘要

尽管近期取得了巨大进展，但当前的文本引导图像编辑方法在涉及指令遵循、最小化编辑源图像以及确保高视觉质量等多个方面仍面临困难。当请求的编辑具有挑战性时，例如涉及位置、运动、视角、比例和创意编辑，这些问题尤为明显。为了系统性地测试生成式图像编辑器，我们提出了一个新的图像编辑基准——TECCI：收集与策划图像的棘手编辑。TECCI包含我们发布的全新图像集。TECCI中的图像涵盖7个图像类别。这些图像和类别经过有意策划，以针对现有方法的弱点。TECCI中的编辑指令由Gemini自动生成，每个源图像覆盖5种编辑类型。我们还策划了一组530张图像，为其创建了具有挑战性的人工编写编辑指令。总体而言，TECCI包含7550对图像和编辑指令。我们对TECCI上的五个领先图像编辑模型进行了人工评估。人类从三个维度判断输出：1）指令遵循，2）编辑的最小性，以及3）视觉质量。为了扩大评估规模，我们还使用Gemini构建了一个自动评分器，在匹配人类评估方面达到了74.7%的准确率。我们的评估揭示：1）没有一个模型的总体成功率超过22%，这显示了TECCI的挑战性；2）Nano Banana Pro是整体表现最好的模型；3）模型在指令遵循方面表现显著优于最小编辑和视觉质量；4）模型在编辑建筑和自然图像方面存在困难，这些需要较强的空间布局和复杂视觉细节理解能力；5）推理和创意编辑是最困难的，而颜色和外观编辑是最容易的。

英文摘要

Despite tremendous recent progress, current text-guided image editing methods still struggle with many aspects of editing involving instruction following, minimally editing the source image, and ensuring high visual quality. These problems are especially apparent when the requested edit is challenging, such as those that involve position, motion, viewpoint, scale and creative edits. To systematically test generative image editors, we propose a novel image editing benchmark -- TECCI: Tricky Edits of Collected and Curated Images. TECCI consists of a completely new set of images we are releasing. The images in TECCI span 7 image categories. The images and these categories were curated intentionally to target weaknesses of existing methods. The edit instructions in TECCI are automatically generated by Gemini, covering 5 edit types per source image. We also curated a set of 530 images for which we created challenging manually written edit instructions. Overall, TECCI contains 7550 pairs of images and edit instructions. We conduct human evaluations of five leading image editing models on TECCI. Humans judge outputs along three dimensions: 1) instruction following, 2) minimality of the edits, and 3) visual quality. To scale-up the evaluation, we also build an auto-rater using Gemini that achieves 74.7% accuracy in matching human evaluations. Our evaluations reveal that: 1) none of the models exceed a 22% overall success rate, demonstrating the challenging nature of TECCI, 2) Nano Banana Pro is the best performing model overall, 3) models perform significantly better at instruction following compared to minimal edits and visual quality, 4) models struggle with editing architecture and nature images which require strong understanding of spatial layout and intricate visual details. 5) reasoning and creative edits are the most difficult, whereas color and appearance edits are the easiest.

URL PDF HTML ☆

赞 0 踩 0

2606.01204 2026-06-02 cs.CL cs.AI cs.CY 版本更新

Implicit Geographic Inference in LLM Medical Triage: Language-Driven Disparities in Emergency Recommendations

LLM医疗分诊中的隐式地理推断：语言驱动的急诊推荐差异

Qi Han Wong

发表机构 * GitHub

AI总结研究大型语言模型在相同症状下，仅因患者提示语言不同而产生不同的医疗分诊推荐，发现模型根据输入语言隐式推断地理位置，导致急诊推荐率差异显著。

Comments 7 pages, 4 tables. Code and data at https://github.com/wongqihan/ai-behavioral-experiments

详情

AI中文摘要

我们研究大型语言模型是否仅根据患者提示的语言，对相同症状产生不同的医疗分诊推荐。使用Gemini 3.5 Flash，我们评估了六种语言（英语、西班牙语、中文、印地语、日语、阿拉伯语）下的神经症状特征（持续性头痛、视力模糊、恶心），每种条件运行30次（共450次API调用）。我们发现，尽管模型在所有语言中分配的严重程度评分几乎相同（7.7-8.0/10），但急诊室就诊推荐率从0%（日语、印地语）到30%（英语、阿拉伯语）不等。添加一句指定患者位于美国的句子，非英语提示的急诊推荐率最多增加76.7个百分点，而反向锚定（英语提示加上东京地点）将急诊率从30%降至6.7%。回译控制（日语到英语）产生的急诊率与英语基线相当，证实差异并非由翻译质量引起，而是由输入语言的隐式地理推断所致。我们发布了完整的数据集、实验代码和结果。

英文摘要

We investigate whether large language models produce different medical triage recommendations for identical symptoms based solely on the language of the patient prompt. Using Gemini 3.5 Flash, we evaluate a neurological symptom profile (persistent headache, blurred vision, nausea) across six languages (English, Spanish, Chinese, Hindi, Japanese, Arabic) with 30 runs per condition (n=450 total API calls). We find that the model recommends emergency room visits at rates ranging from 0% (Japanese, Hindi) to 30% (English, Arabic), despite assigning nearly identical severity scores (7.7-8.0/10) across all languages. Adding a single sentence specifying the patient's US location increases ER recommendations by up to 76.7 percentage points for non-English prompts, while the reverse anchor (English prompt with a Tokyo location) reduces the ER rate from 30% to 6.7%. A back-translation control (Japanese to English) produces ER rates comparable to the English baseline, confirming that the disparity is not caused by translation quality but by implicit geographic inference from the input language. We release the complete dataset, experiment code, and results.

URL PDF HTML ☆

赞 0 踩 0

2606.01202 2026-06-02 cs.AI cs.CL cs.LG 版本更新

The Shape of Wisdom: Decision Trajectories in Language Models

智慧的形状：语言模型中的决策轨迹

Shailesh Rana

发表机构 * Independent Researcher（独立研究者）

AI总结本文通过分析三种语言模型在MMLU上的9000条轨迹，提出用答案边际、边际变化和决策翻转距离描述轨迹，发现正确性与稳定性不同，并探究了注意力与MLP标量对边际的影响。

Comments 6 pages, 5 figures. Code and derived artifacts: https://github.com/gut-puncture/The-Shape-of-Wisdom

详情

AI中文摘要

语言模型并非简单地在输出层选择一个答案。在一项包含9000条轨迹的MMLU研究中，涉及Qwen2.5-7B-Instruct、Llama-3.1-8B-Instruct和Mistral-7B-Instruct-v0.3，答案的分数在深度上以结构化方式移动。我们用三个量描述每条轨迹：当前答案边际、该边际的下一层变化，以及距离决策翻转的距离。主要经验图景是正确性和稳定性是不同的：最大的群体是不稳定-正确的，而不是稳定-正确的。然后，一个追踪的子集询问是什么推动了边际。在稳定-正确的情况下，平均注意力标量指向正确的方向，而平均MLP标量则不然；跨度删除显示，移除支持答案的文本会损害边际，而移除类似干扰项的文本则有助于边际。结果并非完整的电路解释。它是一种可重复的方式，用于查看哪些答案已确定，哪些仍然脆弱，以及哪些测量来源推动了它们。

英文摘要

Language models do not simply choose an answer at the output layer. In a 9,000-trajectory MMLU study across Qwen2.5-7B-Instruct, Llama-3.1-8B-Instruct, and Mistral-7B-Instruct-v0.3, the score of the answer moves across depth in structured ways. We describe each trajectory with three quantities: the current answer margin, the next-layer change in that margin, and the distance from a decision flip. The main empirical picture is that correctness and stability are different: the largest group is unstable-correct, not stable-correct. A traced subset then asks what moves the margin. In stable-correct cases, the average attention scalar points in the correct direction, while the average MLP scalar does not; span deletion shows that removing answer-supporting text hurts the margin and removing distractor-like text helps it. The result is not a full circuit explanation. It is a reproducible way to see which answers are settled, which remain fragile, and which measured sources move them.

URL PDF HTML ☆

赞 0 踩 0

2606.01196 2026-06-02 cs.CL cs.AI 版本更新

并非所有解释都能同等模拟：比较言语化特征归因与自生成理由

Pingjun Hong, Benjamin Roth

发表机构 * Faculty of Computer Science, University of Vienna（维也纳大学计算机科学系）； UniVie Doctoral School Computer Science, University of Vienna（维也纳大学计算机科学博士学院）； Faculty of Philological and Cultural Studies, University of Vienna（维也纳大学文学与文化研究系）

AI总结本研究通过反事实模拟设置，比较了言语化特征归因和自生成理由两种解释来源对问答模型行为可模拟性的影响，发现解释格式和粒度显著影响模拟效果。

2606.01136 2026-06-02 cs.CL 版本更新

From Outliers to Errors: Auditing Pali-to-English LLM Translations with Multi-Reference Adjudication

从异常到错误：使用多参考裁决审计巴利语到英语的LLM翻译

Máté Metzger, Nadnapang Phophichit, Hansa Dhammahaso

发表机构 * Independent Researcher（独立研究者）； Nibbana Meditation Centre（尼布达冥想中心）

AI总结针对大语言模型翻译古典语言时误将合理变异标记为错误的问题，提出基于多个人类翻译参考包络和嵌入漂移阈值筛选、再由LLM评审团裁决的审计方法，并应用于巴利语-英语翻译。

Comments Preprint. This manuscript has not yet been peer reviewed

详情

AI中文摘要

单一评分翻译指标可能混淆合理变异与错误，这一问题对于古典语言尤为严重，因为同一段落的多个可辩护的英语译文共存。我们使用三位公认的人类译者（Bhikkhu Sujato、Thanissaro Bhikkhu和Bhikkhu Bodhi）的译文作为局部参考包络（而非单一黄金标准），审计了四种旗舰大语言模型（GPT-5.5、Claude Sonnet 4.6、Gemini 3.1 Pro和Grok 4.3）在巴利语经典1700个段落上的巴利语到英语输出。每个候选译文的归一化嵌入漂移（相对于参考质心）作为分诊信号，而非错误标签；然后，对漂移阈值超过1.5的1203个候选译文，由盲审的三模型LLM评审团进行裁决，并针对300个实例的作者裁决验证集进行校准。两个结果突出：第一，漂移预测的是严重程度而非错误本身：在裁决的高漂移候选译文中，主要错误率从1.5-2.0区间的7.9%单调上升至3.0以上的51.6%，而约80%的1.5-2.0异常值被判定为有效的翻译变体。第二，模型差异在高漂移尾部最为明显：GPT-5.5的裁决高漂移主要错误率最低，其置信区间与Claude Sonnet 4.6和Gemini 3.1 Pro重叠；Grok 4.3的异常值数量最大，且尾部主要错误率最高（总体27.6%，漂移3.0以上74.4%）。主要错误类别（如省略或截断、教义术语错误）正是最可能误导教义文本读者的失败类型。贡献在于提供了一种可复用的古典到现代翻译审计设计：从多个人类译者定义局部参考包络，使用嵌入漂移优先审查，并对标记的尾部进行裁决，而非将异常状态视为错误。

英文摘要

Single-score translation metrics can conflate legitimate variation with error, a problem especially acute for classical languages where multiple defensible English renderings of the same passage coexist. We audit Pali-to-English output from four flagship large language models (LLMs): GPT-5.5, Claude Sonnet 4.6, Gemini 3.1 Pro, and Grok 4.3, on 1,700 passages from the Pali Canon, using three established human translations by Bhikkhu Sujato, Thanissaro Bhikkhu, and Bhikkhu Bodhi as a local reference envelope rather than a single gold standard. Each candidate's normalized embedding drift from the reference centroid serves as a triage signal, not an error label; the 1,203 candidates above a 1.5 drift threshold are then adjudicated by a blinded three-model LLM judge panel, calibrated against a 300-instance author-adjudicated validation set. Two results stand out. First, drift predicts severity rather than error per se: the major-error rate among adjudicated high-drift candidates rose monotonically from 7.9% in the 1.5-2.0 band to 51.6% above 3.0, while approximately 80% of 1.5-2.0 outliers were judged valid translation variations. Second, model differences were clearest in the high-drift tail: GPT-5.5 had the lowest adjudicated high-drift major-error rate, with confidence intervals overlapping those of Claude Sonnet 4.6 and Gemini 3.1 Pro; Grok 4.3 had both the largest outlier volume and the highest tail major-error rate (27.6% overall, 74.4% above drift 3.0). The dominant major-error categories (e.g. omission or truncation, doctrinal term errors) are precisely the failures most likely to mislead readers of doctrinal text. The contribution is a reusable audit design for classical-to-modern translation: define a local reference envelope from multiple human translators, use embedding drift to prioritize review, and adjudicate the flagged tail rather than treating outlier status as error.

URL PDF HTML ☆

赞 0 踩 0

2606.01109 2026-06-02 cs.DL cs.CL 版本更新

Digging Up Citations: FOSSIL, a Dataset and Workflow for Reference Extraction in Law and the Humanities

挖掘引文：FOSSIL——法律与人文学科参考文献提取的数据集与工作流

Luca Foppiano, Christian Boulanger

发表机构 * University of Amsterdam（阿姆斯特丹大学）

AI总结针对法律和人文学科脚注中参考文献提取的挑战，本文提出FOSSIL数据集、PDF-TEI编辑器、标注工作流和Grobid特化模块，将提取质量从0.36提升至0.72（micro-F1）。

Comments This is an extended abstract, peer-reviewed and presented at CiteX2026 https://sites.google.com/view/workshop-on-citation-extractio/startseite

详情

AI中文摘要

引文提取工具是为自然科学中结构化的文末参考文献设计的，但法律和人文学科的学术引用主要出现在脚注中，其中书目数据与评论和交叉引用交织在一起，并且在不同语言和风格之间差异很大。为了解决缺乏合适的黄金标准资源的问题，我们提出了FOSSIL（基于脚注的开放获取社会科学与人文学科科学实例标签），这是一个开放许可的多语言数据集，包含96篇带注释的学术文章，超过7,600个嵌入脚注的参考文献，以及PDF-TEI编辑器（一个协作式网络注释工具）、一个文档化的七人标注工作流和一个用于基于脚注的引文的Grobid特化模块。在端到端评估中，特化流程将提取质量比默认Grobid提高了近一倍（micro-F1从0.36提升到0.72），这主要归功于召回率的提高，同时表明交叉引用和混合内容脚注仍有很大的改进空间。本扩展摘要介绍了正在进行的工作；引文分割和解析的注释以及交叉引用解析正在进行中。

英文摘要

Citation extraction tools are designed for the structured end-of-document bibliographies of the natural sciences, but law and humanities scholarship cites references primarily in footnotes, where bibliographic data is interleaved with commentary and cross-references and varies widely across languages and styles. To address the scarcity of suitable gold-standard resources, we present FOSSIL (Footnote-based Open-access SSH Scientific Instance Labels), an openly licensed multilingual dataset of 96 annotated scholarly articles containing over 7,600 footnote-embedded references, together with PDF-TEI Editor (a collaborative web annotation tool), a documented seven-annotator workflow, and a Grobid specialization for footnote-based citations. In end-to-end evaluation, the specialized pipeline nearly doubles extraction quality over default Grobid (micro-F1 from 0.36 to 0.72), driven largely by improved recall, while showing that substantial headroom remains for cross-references and mixed-content footnotes. This extended abstract presents work in progress; annotations of citations segmentation and parsing, and cross-reference resolution are ongoing.

URL PDF HTML ☆

赞 0 踩 0

2606.01099 2026-06-02 cs.CL cs.AI 版本更新

MiCU: End-to-End Smart Home Command Understanding with Large Language Model

MiCU: 基于大语言模型的端到端智能家居指令理解

Haowei Han, Kexin Hu, Weiwei Cai, Debiao Zhang, Bin Qin, Yuxiang Wang, Jiawei Jiang, Xiao Yan, Bo Du

发表机构 * School of Computer Science, Wuhan University（武汉大学计算机学院）； Xiaomi Corporation（小米公司）； Institute for Math & AI, Wuhan University（武汉大学数学与人工智能研究院）

AI总结提出MiCU，一种利用课程学习、强化学习和令牌压缩技术的领域特定大语言模型，用于解决智能家居中模糊指令理解问题，平均准确率提升20.01%。

详情

DOI: 10.1145/3770855.3818446

AI中文摘要

智能家居生态系统中的指令理解系统可以自动化设备控制并显著改善用户体验。然而，尽管它们在精确表述（例如“打开卧室灯”）上表现良好，但在处理模糊或不一致的指令（例如“让卧室变得舒适”）时却存在困难。大语言模型（LLM）在各种领域都能很好地泛化，并且在此类任务上可以超越传统的基于规则的系统，但其有效性通常受到领域特定数据稀缺、任务特定适应性不足以及高计算成本的限制。在本文中，我们提出了一种利用用户日志和LLM的自动化训练数据合成工作流程；然后构建了MiCU，一个在指令理解方面表现出色的领域特定LLM。具体来说，我们采用课程学习将领域知识注入基础LLM，然后通过冷启动训练结合领域特定思维规则引导的强化学习（RL）来增强其推理能力。此外，我们引入了一种令牌压缩技术，将设备描述压缩为单个特殊令牌，从而显著降低推理开销，并实现了\model-fast，一种针对长输入优化的高效变体。大量实验表明，MiCU显著优于基线，在所有设备类别上平均准确率提升20.01%。我们已在小米家应用中部署了MiCU，每天接收约170万页面浏览量。生产评估显示，MiCU将用户纠正率降低了1.57%，并将人工审核准确率提高了32.05%。我们的数据和代码可在https://github.com/xiaomi-research/iot_spec_llm获取。

英文摘要

Command understanding systems in smart home ecosystems can automate device control and substantially improve user experience. However, while they perform well on precise utterances (e.g., "turn on the bedroom light"), they struggle with ambiguous or misaligned commands (e.g., "make the bedroom cozy"). Large language models (LLMs) generalize well across various domains and can outperform traditional rule-based systems on such tasks, but their effectiveness is often constrained by scarce domain-specific data, insufficient task-specific adaptation, and high computational costs. In this paper, we propose an automated training data synthesis workflow using user logs and LLMs; then we build MiCU, a domain-specific LLM that excels at command understanding. Specifically, we employ curriculum learning to inject domain knowledge into the base LLM, then we enhance its reasoning ability via cold-start training combined with reinforcement learning (RL) guided by domain-specific thinking rules. Additionally, we introduce a token compression technique that condenses device description into a single special token, substantially reducing inference overhead and enabling \model-fast, an efficient variant optimized for long inputs. Extensive experiments show that MiCU significantly outperforms baselines, with an average accuracy gain of 20.01% across all device categories. We have deployed MiCU in the Xiaomi Home app, receiving approximately 1.7 million page views per day. Production evaluations show that MiCU reduces user correction rate by 1.57% and increases human audited accuracy by 32.05%. Our data and code are available at https://github.com/xiaomi-research/iot_spec_llm

URL PDF HTML ☆

赞 0 踩 0

2606.01091 2026-06-02 cs.CL 版本更新

Deep Research as Rubric for Reinforcement Learning

深度研究作为强化学习的评估准则

Wangyi Mei, Zhouhong Gu, Zhenhan Bai, Yin Cai, Lefan Zhang, Zhenxin Ding, Bo Chen, Yan Gao, Yi Wu, Yao Hu, Jiaqing Liang, Deqing Yang

发表机构 * Fudan University（复旦大学）； Xiaohongshu Inc.（小红书公司）； Beijing University of Posts and Telecommunications（北京邮电大学）

AI总结提出DR-rubric框架，通过两阶段过程（迭代多轮智能体搜索构建准则，然后蒸馏为可验证约束用于GRPO策略优化）为开放式推理和长文本生成任务生成细粒度奖励信号，实验表明该方法在多个基准上表现优异。

详情

AI中文摘要

开放式推理和长文本生成任务缺乏可靠的自动验证信号用于基于奖励的策略优化。评估准则提供了一种有前景的替代方案，但现有方法将其视为给定的产物——要么手工制作，要么由提示生成——并且常常忽略最关键的、任务特定的、知识密集的维度，从而扭曲了奖励信号。我们的关键观察是，准则构建本身就是一个研究问题：识别什么使回答正确或富有洞察力需要发现和综合外部知识。我们提出了深度研究作为评估准则（DR-rubric），一个用于构建此类准则的两阶段框架。第一阶段通过迭代多轮智能体搜索引出领域事实、结构约束和失败模式；第二阶段将这些证据蒸馏为原子化的、可独立验证的约束，用于基于GRPO的策略优化。由于正在训练的模型可以作为其自身的准则生成器，DR-rubric-8B支持无需前沿模型辅助的自举准则生成。我们在涵盖智能体研究和专家推理的6个基准上进行了评估。实验表明，DR-Rubric仅使用1K-3K训练实例就取得了强劲的竞争性能，其中GPT-5生成的准则特别有利于智能体任务的广度覆盖，Gemini生成的准则在智能体和专家推理任务上取得了最平衡的性能，而自举准则表现出从专业化到再平衡的演变，在第三次迭代时达到最佳整体性能。结果表明，将准则构建从静态评估模板重新定义为证据驱动的研究过程，可以为开放式任务产生更可扩展、更细粒度的奖励信号。

英文摘要

Open-ended reasoning and long-form generation tasks lack reliable automatic verification signals for reward-based policy optimization. Rubrics offer a promising alternative, but existing approaches treat them as given artifacts -- either hand-crafted or prompt-generated -- and often miss the task-specific, knowledge-intensive dimensions that matter most, distorting the reward signal. Our key observation is that rubric construction is itself a research problem: identifying what makes a response correct or insightful requires discovering and synthesizing external knowledge. We propose Deep Research as Rubric (DR-rubric), a two-stage framework for constructing such rubrics. Stage I elicits domain facts, structural constraints, and failure modes through iterative multi-turn agentic search; Stage II distills this evidence into atomic, independently verifiable constraints for GRPO-based policy optimization. Because the model under training can serve as its own rubric generator, DR-rubric-8B supports bootstrap rubric generation without frontier-model assistance. We evaluate on 6 benchmarks spanning agentic research and expert reasoning. Experiments show that DR-Rubric achieves strong competitive performance with only 1K -- 3K training instances, where GPT-5-generated rubrics particularly benefit breadth coverage on agentic tasks, Gemini-generated rubrics yield the most balanced performance across agentic and expert reasoning tasks, and bootstrap rubrics exhibit a specialization-to-rebalancing evolution achieving the best overall performance at the third iteration. Results demonstrate that reframing rubric construction from static evaluation templates into an evidence-driven research process yields more scalable, fine-grained reward signals for open-ended tasks.

URL PDF HTML ☆

赞 0 踩 0

2606.01074 2026-06-02 cs.CL 版本更新

When Is 0.1% Enough? Analyzing the Combined Effects of Dimensionality Reduction and Quantization on Text Embedding Compression

何时0.1%就足够了？分析降维和量化对文本嵌入压缩的联合效应

Riku Kisako, Hayato Tsukagoshi, Ryohei Sasano

发表机构 * Graduate School of Informatics, Nagoya University（名古屋大学信息学研究科）

AI总结本文系统研究了结合降维和量化方法压缩文本嵌入的效果，发现联合压缩可大幅减小嵌入尺寸（低至原始大小的0.1%）而几乎不损失性能，且最优策略因任务而异。

2606.01049 2026-06-02 cs.CL 版本更新

PMC-InterCPT: Rethinking Biomedical Interleaved Data for Multimodal Continued Pretraining

PMC-InterCPT：重新思考用于多模态持续预训练的医学交错数据

Guanghao Zhu, Zeyu Liu, Zhitian Hou, Pengkai Wang, Zhijie Sang, Minheng Ni, Wenjun Wang, Yanggan Gu, Shuo Cai, Congkai Xie, Jianmin Wu, Hongxia Yang

发表机构 * The Hong Kong Polytechnic University（香港理工大学）； Sun Yat-sen University（中山大学）； InfiX.ai ； PolyU-Daya Bay Technology and Innovation Research Institute（PolyU-大亚湾技术与创新研究院）

AI总结针对医学多模态持续预训练中图像-文本对数据存在的上下文缺失、结构噪声等问题，提出PMC-InterCPT交错语料库，通过恢复缺失标题、清洗文本、重建交错样本及LLM监督过滤，结合模态感知重采样，有效提升医学与通用多模态性能。

详情

AI中文摘要

从科学文献中提取的大规模生物医学图像-文本数据集为医学多模态模型训练提供了宝贵资源。这些数据集通常组织为图像-标题对；然而，图像标题往往简短、依赖上下文，且在没有周围文章文本的情况下仅部分信息。同时，大规模自动提取引入了结构噪声，如缺失标题、残留标记、重复上下文和不连贯的多段落图像描述。我们重新审视医学多模态持续预训练（CPT）的数据构建，并提出PMC-InterCPT，一个基于上下文的生物医学交错语料库，除了标题外还包含引用图像的正文文本。我们的流程恢复缺失标题，清洗标题和上下文文本，重建连贯的交错图像-文本样本，并应用LLM监督的医学相关性和质量分类器来过滤噪声记录。我们进一步揭示了结果语料库中强烈的模态不平衡，并引入了一个四桶证据分类法用于模态感知重采样。通过在Qwen3.5-4B-Base上进行CPT后接监督微调（SFT），PMC-InterCPT有效提升了医学和通用多模态性能，同时使用的CPT令牌少于原始源池。实验结果还说明了数据质量和模态对医学多模态CPT的互补性。

英文摘要

Large-scale biomedical image-text datasets extracted from scientific literature provide valuable resources for medical multimodal model training. These datasets are commonly organized as image-caption pairs; however, figure captions are often short, context-dependent, and only partially informative without the surrounding article text. At the same time, large-scale automatic extraction introduces structural noise such as missing captions, residual markup, duplicated context, and incoherent multi-paragraph figure descriptions. We revisit data construction for medical multimodal continued pretraining (CPT) and present PMC-InterCPT, a context-grounded biomedical interleaved corpus that incorporates figure-referencing body text in addition to captions. Our pipeline recovers missing captions, cleans caption and context text, reconstructs coherent interleaved image-text samples, and applies LLM-supervised medical relevance and quality classifiers to filter noisy records. We further reveal strong modality imbalance in the resulting corpus and introduce a four-bucket evidence taxonomy for modality-aware resampling. Through CPT followed by supervised fine-tuning (SFT) on Qwen3.5-4B-Base, PMC-InterCPT effectively improves medical and general multimodal performance while using fewer CPT tokens than the raw source pool. The experimental results also illustrate the complementarity between the data quality and modality for medical multimodal CPT.

URL PDF HTML ☆

赞 0 踩 0

2606.01045 2026-06-02 cs.CL 版本更新

Child-directed speech facilitates production, not comprehension, in BabyLMs

儿童导向语言促进BabyLMs的生成而非理解

Bastian Bunzeck, Sina Zarrieß

发表机构 * Computational Linguistics, Department of Linguistics（计算语言学系，语言学系）

AI总结本研究通过框架补全任务评估儿童导向语言（CDS）对BabyLMs生成能力的影响，发现CDS训练的模型在生成语法补全方面优于网络数据训练的模型，而理解基准测试低估了CDS的贡献。

Comments Accepted at CoNLL 2026

详情

AI中文摘要

近期研究表明，儿童导向语言（CDS）不利于BabyLMs的语言学习。然而，当前的评估主要关注理解而非生成，而生成是基于用法的语言习得理论的核心，该理论认为CDS通过构式“框架”（具有开放槽位的频繁词汇模式）促进早期语言使用。我们受这些理论启发，引入了一种新颖的基于生成的评估方法——框架补全任务，并比较了使用CDS、BabyLM语料库和网络爬取数据（FineWeb-edu）训练的Llama模型在理解基准测试和我们新框架上的表现。我们的结果揭示了模型的理解与生成能力之间存在明显的分离：虽然FineWeb训练的模型在最小对测试中表现优异，但CDS训练的模型在训练早期就能生成语法正确的补全，并将概率质量集中在合适的槽位填充词上。这些发现表明，理解基准测试低估了CDS对BabyLMs的贡献。

英文摘要

Recent studies suggest that child-directed speech is not conducive to language learning in BabyLMs. However, current evaluations focus predominantly on comprehension and not production, which is central to usage-based theories of language acquisition which argue how CDS facilitates early language use through constructional ''frames'' (frequent lexical patterns with open slots). We introduce a novel generation-based evaluation inspired by such theories in form of a frame-completion task, and compare Llama models trained with CDS, the BabyLM corpus, and web-crawl data (FineWeb-edu) on comprehension benchmarks and our novel framework. Our results reveal a clear dissociation between models' comprehension and production capabilities: while FineWeb-trained models excel at minimal pairs, CDS-trained models produce grammatical completions substantially earlier in training and concentrate probability mass on appropriate slot-fillers. These findings show that comprehension benchmarks underestimate what CDS affords to BabyLMs.

URL PDF HTML ☆

赞 0 踩 0

2606.01041 2026-06-02 cs.CL 版本更新

混合验证解码：在推测解码中学习分配验证

Xin Su, Dawid Majchrowski, Fangyuan Yu, Vanshil Atul Shah, Sebastian Rogawski, Pawel Morkisz, Anahita Bhiwandiwalla, Phillip Howard

发表机构 * Thoughtworks ； Nvidia

AI总结提出混合验证解码方法，通过预测缓存草稿的接受长度并在缓存验证与模型草稿之间动态选择，在代理工作流中平均加速2.73倍。

详情

AI中文摘要

大型语言模型（LLM）生成仍然昂贵，因为自回归解码每生成一个新token就调用一次模型。推测解码通过草拟多个token并用目标模型一步验证来降低成本，但其加速取决于接受的草稿token数量。无参数草稿源可以在结构化和代理工作负载中以低成本提出长续写，但一个生成步骤中看起来有前景的缓存匹配可能在下一步收益很低。我们提出混合验证解码，在验证前预测缓存草稿的接受长度，并使用该收益估计在缓存验证和基于模型的草稿器之间进行选择。在三个LLM和十六个数据集上，混合验证解码在代理工作流中特别有效，在每个设置中均优于EAGLE3，平均加速2.73倍。我们的分析揭示了提示结构如何创造缓存机会，高收益缓存草稿如何集中在草稿空间的一小部分，以及收益引导的选择如何减少顺序解码工作，指向运行时草稿选择作为推测解码的一个有前景的方向。

英文摘要

Large Language Model (LLM) generation remains expensive because autoregressive decoding calls the model once for each new token. Speculative decoding reduces this cost by drafting multiple tokens and verifying them with the target model in one step, but its speedup depends on how many drafted tokens are accepted. Parameter-free draft sources can propose long continuations at low cost in structured and agentic workloads, yet a cache match that looks promising at one generation step may have low payoff at the next. We propose Hybrid Verified Decoding, which predicts the accepted length of a cache draft before verification and uses this payoff estimate to choose between cache verification and a model-based drafter. Across three LLMs and sixteen datasets, Hybrid Verified Decoding is especially effective on agentic workflows, where it outperforms EAGLE3 in every setting with a 2.73x average speedup. Our analysis shows how prompt structure creates cache opportunities, how high-payoff cache drafts concentrate in a small part of the draft space, and how payoff-guided selection reduces sequential decoding work, pointing to runtime draft selection as a promising direction for speculative decoding.

URL PDF HTML ☆

赞 0 踩 0

2606.01016 2026-06-02 cs.CL cs.AI eess.AS 版本更新

PolySpeech-100: A Large-Scale Benchmark for Speech Understanding Across 100+ Languages and Dialects

PolySpeech-100：面向100多种语言和方言的大规模语音理解基准

Sicheng Yang, Shulan Ruan, Shiwei Wu, Yu Liu, Lu Fan, Zhi Li, You He

发表机构 * Shenzhen International Graduate School, Tsinghua University（深圳国际研究生院，清华大学）； Department of Electronic Engineering, Tsinghua University（清华大学电子工程系）； JD AI Research（京东人工智能研究院）

AI总结为解决现有语音评估基准在资源丰富语言偏向、缺乏语义推理和忽视方言的问题，提出PolySpeech-100基准，通过混合构建管道覆盖110种语言变体，并评估22个模型，发现开源端到端模型在重方言上优于级联系统，而思维链提示在零样本设置下会降低性能。

Comments 19 pages, 13 figures, KDD 2026

详情

AI中文摘要

虽然端到端（E2E）语音大语言模型（Speech-LLMs）正在快速发展，但它们的评估方法仍局限于简单转录的时代。现有基准存在三个关键限制：明显偏向高资源语言、关注低级识别（ASR）而非语义推理，以及忽视区域方言。为弥补这一差距，我们引入了PolySpeech-100，这是一个大规模基准，旨在评估110种语言变体上的“母语级”语音理解。我们采用了一种新颖的混合构建管道，将黄金标准的人类录音与指令驱动的合成语音相结合，从而覆盖了19种不同的中文方言和80多种低资源语言。对22个最先进模型（包括Gemini-3、GPT-Audio和Qwen2.5-Omni）的广泛评估得出了关键见解。首先，我们证明开源端到端模型在重方言上优于级联（ASR+LLM）系统，证明直接音频处理保留了标准转录中经常丢失的关键副语言线索和韵律特征（例如语调、重音）。其次，我们揭示了一个显著的性能差距：虽然商业模型保持稳健，但开源模型在低资源语言上遭受灾难性退化。最后，反直觉的是，我们观察到在标准零样本设置下，思维链提示经常降低大多数评估模型的语音理解性能，揭示了当前架构中潜在的多模态对齐差距。PolySpeech-100为下一代包容性、全能的语音LLM建立了严格标准。数据、演示和代码公开于https://github.com/YoungSeng/PolySpeech-100。

英文摘要

While End-to-End (E2E) Speech-Large Language Models (Speech-LLMs) are rapidly evolving, their evaluation methodologies remain limited to the era of simple transcription. Existing benchmarks suffer from three critical limitations: a pronounced bias towards high-resource languages, a focus on low-level recognition (ASR) rather than semantic reasoning, and a neglect of regional dialects. To bridge this gap, we introduce PolySpeech-100, a massive-scale benchmark designed to assess `native-level' speech comprehension across 110 linguistic variants. We employ a novel hybrid construction pipeline that augments gold-standard human recordings with instruction-driven synthetic speech, allowing us to cover 19 distinct Chinese dialects and over 80 low-resource languages. Extensive evaluation of 22 state-of-the-art models (including Gemini-3, GPT-Audio, and Qwen2.5-Omni) yields pivotal insights. First, we demonstrate that open-source E2E models outperform Cascade (ASR+LLM) systems on heavy dialects, proving that direct audio processing preserves critical paralinguistic cues and prosodic features (e.g., intonation, stress) that are often lost in standard transcription. Second, we reveal a significant performance gap: while commercial models maintain robustness, open-source models suffer catastrophic degradation on low-resource languages. Finally, counter-intuitively, we observe that under standard zero-shot settings, Chain-of-Thought prompting frequently degrades speech understanding performance for most evaluated models, revealing a potential modality alignment gap in current architectures. PolySpeech-100 establishes a rigorous standard for the next generation of inclusive, omni-capable Speech-LLMs. The data, demo, and code are publicly available at https://github.com/YoungSeng/PolySpeech-100.

URL PDF HTML ☆

赞 0 踩 0

2606.01000 2026-06-02 cs.LG cs.CL 版本更新

Trust Functions: Near-Lossless Weak-to-Strong Generalization by Learning When to Trust the Weak Teacher

信任函数：通过学习何时信任弱教师实现近乎无损的弱到强泛化

Arda Uzunoglu, Alvin Zhang, Daniel Khashabi

发表机构 * University of Washington（华盛顿大学）

AI总结提出信任函数为弱标签分配信任分数并据此过滤弱监督，在多个领域实现近乎无损的弱到强泛化，且能通过迭代链放大收益。

Comments ICML 2026

2606.00997 2026-06-02 cs.CL 版本更新

Decoding in Order-Agnostic Language Models: Chain-Rule Deviation and Uniform Spreading

顺序无关语言模型中的解码：链式法则偏差与均匀扩散

Lin Yao

发表机构 * School of Computer Science, Shanghai Jiao Tong University（上海交通大学计算机科学学院）； Zhongguancun Academy（中关村学院）

AI总结本文研究顺序无关语言模型（OALM）中揭示顺序对似然的影响，提出基于置信度方差的诊断方法，并证明均匀扩散定理以优化解码路径。

详情

AI中文摘要

顺序无关语言模型（OALM），包括离散扩散语言模型（dLLM），被训练用于在任意条件集下预测掩码标记，从而允许在推理时以任意揭示顺序生成或评分序列。在LLaDA-2.1中，我们报告了三个发现。首先，学习到的条件概率并不是一个连贯联合分布的精确分解：仅改变揭示顺序就会使目标对数似然偏移高达0.49 nats/标记，因此仅凭似然就混合了内容难度和路径依赖的伪影。其次，尽管置信度优先（CF）解码是顺序无关的，但其在内容标记上的揭示顺序接近从左到右（L2R）。第三，我们提出了一种基于置信度轨迹形状的补充诊断方法。一个均匀扩散定理表明，在固定总似然下，当每一步的置信度均匀扩散时，目标可恢复性最大化；由此产生的偏差促使我们使用$\mathrm{Var}(\log q_t)$作为比较解码路径的诊断指标。在C4和四个下游基准测试中，低方差将结构化路径与随机排序区分开来，并且方差与下游正确性一致相关。这些结果支持在比较OALM解码路径时联合报告平均置信度和置信度方差。

英文摘要

Order-agnostic language models (OALMs), including discrete diffusion language models (dLLMs), are trained to predict masked tokens under arbitrary conditioning sets, allowing sequences to be generated or scored under arbitrary reveal orders at inference time. In LLaDA-2.1, we report three findings. First, the learned conditionals are not exact factorizations of a coherent joint distribution: changing only the reveal order shifts target log-likelihood by up to 0.49 nats/token, so likelihood alone mixes content difficulty with path-dependent artifacts. Second, although confidence-first (CF) decoding is order-agnostic, its reveal orders are close to left-to-right (L2R) on content tokens. Third, we propose a complementary diagnostic based on the shape of the confidence trace. A uniform-spreading theorem shows that, at fixed total likelihood, target recoverability is maximized when per-step confidence is spread uniformly; the resulting deviation motivates $\mathrm{Var}(\log q_t)$ as a diagnostic for comparing decoding paths. Across C4 and four downstream benchmarks, low variance separates structured paths from random ordering, and variance is consistently associated with downstream correctness. These results support reporting mean confidence and confidence variance jointly when comparing OALM decoding paths.

URL PDF HTML ☆

赞 0 踩 0

2606.00994 2026-06-02 cs.CL 版本更新

A Registry-Bound LLM Pipeline for Evidence-Grounded Trait Extraction across Tropical Plants, Aquatic Species, and Exotic Pets

面向热带植物、水生生物和外来宠物的证据驱动性状提取的注册约束LLM流水线

Jeff Wang

发表机构 * NEXLY LLC, United States（NEXLY LLC，美国）

AI总结提出一种注册约束的大语言模型流水线，通过四种机制（受控词汇注册表、逐行证据引用、置信度标签、多版本保存）从热带物种百科全书中大规模提取证据驱动的结构化性状记录，在409,820个物种上实现99.985%的覆盖率和高置信度。

Comments 33 pages, 6 figures; methodology paper

详情

AI中文摘要

我们描述了一种注册约束的大语言模型提取流水线，能够在栽培热带植物、水生和宠物物种上大规模生成证据驱动的结构化性状记录。四种机制使LLM导出的行可审计：一个版本化的39键闭词汇性状注册表，将每个接受的值约束到类型化模式；每行逐字证据引用，将每个值绑定到源文本；每行置信度标签（高或中；低置信度在持久化前丢弃）；以及多版本保存。应用于热带物种百科全书中409,880个可发表物种，该流水线执行了706,220次运行，并在409,820个物种（99.985%）中持久化了5,489,881条性状记录，其中81.57%为高置信度。我们报告了三个验证层级，按证据强度递减：在全种群中，5,427,588条含证据的行中有90.12%的引用是源文本的逐字子串（排除一个合规性元性状后为93.49%）；在n=100的分层非红区行上进行的引用支持值审计结果为100/100（下限96.30%）；在n=50的红区行上进行的表面有效性审计结果为50/50接受（下限92.86%）。我们不声称每条记录的正确性；100%有待人工审核。贡献在于四种机制框架。

英文摘要

We describe a registry-bound large-language-model extraction pipeline producing evidence-grounded structured trait records at scale, on cultivated tropical plant, aquatic, and pet species. Four mechanisms render LLM-derived rows auditable: a versioned 39-key closed-vocabulary trait registry constraining every admitted value to a typed schema; a per-row verbatim evidence quote tying each value to source text; a per-row confidence label (high or medium; low dropped pre-persist); and multi-version preservation. Applied to 409,880 publishable species from the Tropical Species Encyclopedia, the pipeline executed 706,220 runs and persisted 5,489,881 trait records across 409,820 species (99.985%), 81.57% at high confidence. We report three validation layers in descending evidentiary strength: at full population, 90.12% of 5,427,588 evidence-bearing rows have their quote as a verbatim source substring (93.49% excluding one compliance meta-trait); a quote-supports-value audit on n=100 stratified non-red-zone rows yielded 100/100 (lower bound 96.30%); face-validity on n=50 red-zone rows yielded 50/50 Accept (lower bound 92.86%). Per-record correctness is not claimed; 100% pending human curation. The contribution is the four-mechanism framework.

URL PDF HTML ☆

赞 0 踩 0

2606.00981 2026-06-02 cs.CL 版本更新

Robust Asynchronous Planning via Auto-Formalization

通过自动形式化实现鲁棒的异步规划

Jiayi Zhang, Jianing Yin, Ben Zhou, Li Zhang

发表机构 * Drexel University（德雷塞尔大学）； Arizona State University（亚利桑那州立大学）

AI总结针对异步规划中并发、非均匀时长和执行时间约束的挑战，提出自动形式化方法，通过CP-SAT形式化器在依赖图规模从5到100动作时保持高准确率，并引入状态感知修复策略应对执行时约束更新。

详情

AI中文摘要

LLMs可以通过直接生成动作序列作为规划器，或将任务翻译成领域特定语言供外部求解器作为形式化器来进行规划。虽然大多数现实任务具有非均匀时长、并发和执行时间约束的异步特性，但现有基准测试很少涵盖这些。我们将这些异步规划挑战统一到一个公式下，并引入了三个分别大规模解决这些挑战的基准测试。我们得出结论：形式化表示的选择主要决定了规划是否可扩展：当依赖图从5个动作增长到100个动作时，规划器的规划准确率从96%下降到5%，PDDL2.1形式化器从13%下降到0%，而CP-SAT形式化器平均为94%，在100个动作时仍达到83%。忠实性诊断表明，当LLMs必须保持谓词、效果和目标一致时，PDDL2.1基于谓词的规划表示变得脆弱，而通用约束满足程序则不然。执行时更新规划约束进一步严重降低了性能（规划器23.9%，PDDL2.1 0.7%，CP-SAT 46.1%），但一种仅更新事件诱导约束的状态感知修复策略将CP-SAT形式化器恢复至84.5%。

英文摘要

LLMs can plan by either generating action sequences directly as a Planner or translating tasks into domain specific language for an external solver as a Formalizer. While most real-world tasks are asynchronous with non-uniform durations, concurrency, and execution-time constraints, existing benchmarks hardly cover them. We unify these asynchronous planning challenges under a single formulation and introduce the first three benchmarks that address each at scale. We conclude that the choice of formal representation primarily determines whether planning scales: as dependency graphs grow from 5 to 100 actions, Planner collapses from 96% to 5% plan accuracy and PDDL2.1 Formalizer from 13% to 0%, while CP-SAT Formalizer averages 94% and still achieves 83% at 100 actions. Faithfulness diagnostics show that PDDL2.1's predicate-based planning representation becomes brittle compared to general constraint satisfaction programs, when LLMs must keep predicates, effects, and goals consistent. Execution-time updates of planning constraints further degrade performance sharply (Planner 23.9%, PDDL2.1 0.7%, CP-SAT 46.1%), but a state-aware repair strategy that updates only event-induced constraints recovers CP-SAT Formalizer to 84.5%.

URL PDF HTML ☆

赞 0 踩 0

2606.00975 2026-06-02 cs.CL 版本更新

Lost in Delusion: Examining LLM Safety Under User Delusions and Distress

迷失在妄想中：审视用户妄想与痛苦下的LLM安全性

Andrew Aquilina, Chetna Nihalani, Vasudha Varadarajan, Nathan S. Fishbein, Yu-Ru Lin, Maarten Sap

发表机构 * University of Pittsburgh（匹兹堡大学）； Carnegie Mellon University（卡内基梅隆大学）； Fordham University（福特汉姆大学）

AI总结本研究通过多轮对话模拟，发现LLM在检测用户痛苦时表现良好，但在痛苦嵌入妄想时干预行为显著减少（高达4.5倍），并提出针对性提示策略以缩小这一差距。

详情

AI中文摘要

LLM聊天机器人日益成为心理困扰人群（包括那些困扰与妄想信念交织的人）的首要支持来源。先前关于LLM心理健康安全性的研究主要评估一般治疗质量或单轮危机检测，未明确模型在持续对话中痛苦与妄想交织时的行为。我们通过匹配的多轮模拟（基于临床角色和六个LLM）填补了这一空白，将每次妄想对话与仅痛苦对照配对，以隔离妄想框架的影响。这揭示了一个识别-干预差距：模型无论框架如何都能以相当比率检测痛苦，但一旦痛苦嵌入妄想，模型便严重未能采取行动，安全干预被抑制高达4.5倍。这种失败追踪的是对用户前提的累积接受，而非情感验证。更糟糕的是，提示模型评估用户痛苦的直观修复在妄想框架下适得其反；只有带有明确响应指导的妄想感知提示才能缩小差距，且这依赖于一个妄想分类器，而该分类器在最脆弱的模型上本身不可靠。因此，安全部署需要将妄想框架视为一种独特的风险信号，覆盖对话顺应。

英文摘要

LLM chatbots increasingly serve as a first source of support for people in psychological distress, including those whose distress is entangled with delusional beliefs. Prior work on LLM mental-health safety largely evaluates general therapeutic quality or single-turn crisis detection, leaving unclear how models behave when distress is intertwined with delusion over sustained conversations. We address this gap with matched multi-turn simulations, across clinically grounded personas and six LLMs, that pair each delusional conversation with a distress-only control to isolate the effect of delusional framing. This reveals a recognition-intervention gap: models detect distress at comparable rates regardless of framing, yet sharply fail to act on it once distress is embedded in delusion, with safety interventions suppressed by up to 4.5x. The failure tracks accumulated acceptance of the user's premises rather than emotional validation. Worse, the intuitive fix of prompting models to assess user distress backfires under delusional framing; only delusion-aware prompting with explicit response guidance closes the gap, and even this depends on a delusion classifier that is itself unreliable on the most vulnerable models. Safe deployment therefore requires treating delusional framing as a distinct risk signal that overrides conversational accommodation.

URL PDF HTML ☆

赞 0 踩 0

2606.00971 2026-06-02 cs.CL 版本更新

HypothesisMed: Inference-Time Answer Fusion and Structured Hypothesis-Space Reporting for Biomedical Question Answering

HypothesisMed：生物医学问答中的推理时答案融合与结构化假设空间报告

Md Motaleb Hossen Manik, Ge Wang

发表机构 * Department of Computer Science Rensselaer Polytechnic Institute（计算机科学系雷士打理工学院）； Department of Biomedical Engineering Rensselaer Polytechnic Institute（生物医学工程系雷士打理工学院）

AI总结提出HypothesisMed推理时可靠性流水线，通过答案融合和SPACE标签（有效/不完整/矛盾）提升生物医学多项选择问答的准确率、可解析性和结构化可靠性报告。

详情

AI中文摘要

使用大型语言模型进行生物医学问答通常通过答案准确率进行评估，但仅凭答案准确率并不能表明模型能否生成可解析的输出、遵循结构化可靠性指令、识别弱答案空间或避免自信的错误承诺。本文提出HypothesisMed，一个用于生物医学多项选择问答的推理时可靠性流水线。它结合了直接提示、思维链提示、HypothesisMed-v3提示和答案融合。最终答案通过融合选择，而HypothesisMed-v3提供SPACE标签和置信度信息。SPACE标签将答案空间标记为有效、不完整或矛盾。我们在MedQA、MedMCQA和PubMedQA上使用每个数据集1000个样本评估了Qwen2.5-7B、Phi-4-mini、DeepSeek-R1-32B和BioMistral-7B。该流水线在每个模型的最佳直接或思维链基线基础上提高了加权准确率，同时增加了解析和SPACE覆盖率。我们还使用每个模型10183个样本将评估扩展到Qwen2.5-7B和Phi-4-mini。融合将Phi-4-mini的准确率从0.4296提升到0.5192，而Qwen2.5-7B的思维链在答案准确率上仍略高。然而，Qwen2.5-7B融合实现了完全的解析和SPACE覆盖率，且错误承诺更低。一个12000样本的SPACE压力测试表明，答案空间诊断仍然困难，Qwen2.5-7B的SPACE准确率为0.3074，Phi-4-mini为0.4168。这些结果表明，答案准确率、可解析性、结构化可靠性报告、校准行为和错误承诺行为是可分离的能力。主要贡献不是声称通用的最先进性能，而是一个可复现的推理时框架，用于在结构化可靠性约束下评估作为可审计工作流组件的生物医学问答模型。

英文摘要

Biomedical question answering with large language models is commonly evaluated using answer accuracy, but answer accuracy alone does not indicate whether a model can produce parseable outputs, follow structured reliability instructions, recognize weak answer spaces, or avoid confident incorrect commitments. This paper presents HypothesisMed, an inference-time reliability pipeline for biomedical multiple-choice question answering. It combines direct, chain-of-thought, HypothesisMed-v3 prompting, and answer fusion. The final answer is selected by fusion, while HypothesisMed-v3 supplies SPACE labels and confidence information. SPACE labels mark the answer space as VALID, INCOMPLETE, or CONTRADICTED. We evaluate Qwen2.5-7B, Phi-4-mini, DeepSeek-R1-32B, and BioMistral-7B on MedQA, MedMCQA, and PubMedQA using 1,000 examples per dataset. The pipeline improves weighted accuracy over each model's best direct or chain-of-thought baseline while increasing parse and SPACE coverage. We also scale evaluation to Qwen2.5-7B and Phi-4-mini using 10,183 examples per model. Fusion improves Phi-4-mini accuracy from 0.4296 to 0.5192, while Qwen2.5-7B chain-of-thought remains slightly higher in answer accuracy. However, Qwen2.5-7B fusion achieves complete parse and SPACE coverage with much lower false commitment. A 12,000-example SPACE stress test shows answer-space diagnosis remains difficult, with SPACE accuracy of 0.3074 for Qwen2.5-7B and 0.4168 for Phi-4-mini. These results show that answer accuracy, parseability, structured reliability reporting, calibration behavior, and false-commitment behavior are separable capabilities. The main contribution is not a universal state-of-the-art claim, but a reproducible inference-time framework for evaluating biomedical question answering models as auditable workflow components under structured reliability constraints.

URL PDF HTML ☆

赞 0 踩 0

2606.00963 2026-06-02 cs.CV cs.CL 版本更新

MLLM-Microscope：解锁多模态大语言模型中的隐藏结构

Ravil Mussabayev, Rustam Mussabayev

发表机构 * Satbayev University（萨特拜耶夫大学）

AI总结提出MLLM-Microscope系统，通过分析线性度、内在维度和各向异性，揭示多模态大语言模型中隐藏的表示结构，并基于ScienceQA数据集评估LLaVA-NeXT和OmniFusion，发现模态融合方式显著影响模型内部工作机理。

详情

AI中文摘要

本文提出MLLM-Microscope，一个用于分析多模态大语言模型（MLLMs）中隐藏表示的新型系统。我们的系统评估了跨transformer层的多模态token嵌入的线性度、内在维度和各向异性。利用ScienceQA数据集，我们评估了两个最先进的MLLM：LLaVA-NeXT和OmniFusion。我们发现，两种模态的token的主流和残差流在transformer层中均表现出高度线性行为。然而，LLaVA-NeXT的图像token线性度略有下降，而OmniFusion的保持一致。与LLaVA-NeXT相比，OmniFusion的图像token维度在各层中始终较高。此外，观察到OmniFusion的各向异性在各层中保持较低水平。这些发现表明，MLLM的内部工作高度依赖于将token序列传入LLM之前执行的模态融合的性质。这一发现以及从我们的系统中获得的其他潜在新见解，无疑能够增强我们对MLLM内部工作的理解，为未来的模型设计和优化提供信息。

英文摘要

This work presents MLLM-Microscope, a novel system designed for analyzing the hidden representations within Multimodal Large Language Models (MLLMs). Our system evaluates the linearity, intrinsic dimension, and anisotropy of multimodal token embeddings across transformer layers. Utilizing the ScienceQA dataset, we evaluate two state-of-the-art MLLMs, LLaVA-NeXT and OmniFusion. We find that both the main and residual streams for tokens of both modalities exhibit highly linear behaviors across transformer layers. However, LLaVA-NeXT's image tokens reveal a slight decline in linearity, whereas OmniFusion's remain consistent. Image token dimensions in OmniFusion remain consistently higher across layers compared to LLaVA-NeXT. Also, the OmniFusion's anisotropy is observed to stay consistently low throughout the layers. These findings suggest that the inner workings of MLLMs highly depend on the nature of modality fusion performed before passing the token sequence into LLM. This and other new potential insights obtainable from our system are surely capable of enhancing our understanding of the inner workings of MLLMs, informing future model design and optimization.

URL PDF HTML ☆

赞 0 踩 0

2606.00898 2026-06-02 cs.CL cs.DL 版本更新

Citation Grounding: Detecting and Reducing LLM Citation Hallucinations via Legal Citation Graphs

引用溯源：通过法律引用图检测和减少LLM引用幻觉

Volodymyr Ovcharov

发表机构 * LEX AI LLC

AI总结提出引用溯源（CG）指标，利用乌克兰法院判决的引用图（1.008亿判决，5.02亿边）检测LLM法律引用幻觉，并通过CG-DPO方法（基于真实判决构建偏好对）减少幻觉，在100个法律查询上CG为0.791-0.873，幻觉率13-21%。

Comments 14 pages, 3 figures, 3 tables. Code and data: https://huggingface.co/datasets/overthelex/citation-grounding-eval

详情

AI中文摘要

大型语言模型系统性地产生法律引用幻觉——编造法规引用、引用已废除条款、混淆司法管辖区——但目前尚无自动化方法可大规模测量或减少此行为。我们提出引用溯源（CG），该指标通过从1.008亿乌克兰法院判决（5.02亿条边，21,736个唯一法规节点）中提取的真实引用图来验证LLM生成的法律引用。CG分解为三个组成部分——引用精确性（引用的条款是否存在？）、引用相关性（是否上下文相关？）和引用时效性（在相关日期是否有效？）——从而实现对幻觉类型的差异化诊断。对100个乌克兰法律查询的实证评估（涉及五个系统：通过AWS Bedrock的四个商业LLM——Claude Haiku 4.5、Mistral Pixtral Large、Amazon Nova Pro/Lite——以及一个RAG增强的生产系统）显示CG范围为0.791至0.873，其中13-21%的引用是幻觉。为了在没有人工标注的情况下减少幻觉，我们引入了引用溯源DPO（CG-DPO）：一种通过四种针对性策略从真实法院判决中破坏已验证引用来自动构建偏好对的方法。在包含2,244个法院判决的数据集上，使用LoRA微调的Qwen2.5-7B-Instruct模型在区分正确和错误引用方面达到了98.5%的平均验证准确率（奖励边际+14.9，3个种子的标准差<0.3个百分点）。引用图、评估框架和CG-DPO数据集作为开放资源发布。

英文摘要

Large language models systematically hallucinate legal citations -- fabricating statute references, citing repealed provisions, and confusing jurisdictions -- yet no automated method exists to measure or reduce this behavior at scale. We propose citation grounding (CG), a metric that verifies LLM-generated legal citations against a ground-truth citation graph extracted from 100.8 million Ukrainian court decisions (502 million edges, 21,736 unique statute nodes). CG decomposes into three components -- citation precision (does the cited provision exist?), citation relevance (is it contextually appropriate?), and citation temporality (was it valid at the relevant date?) -- enabling differential diagnosis of hallucination types. Empirical evaluation on 100 Ukrainian legal queries across five systems -- four commercial LLMs via AWS Bedrock (Claude Haiku 4.5, Mistral Pixtral Large, Amazon Nova Pro/Lite) and one RAG-augmented production system -- reveals CG ranging from 0.791 to 0.873, with 13-21% of citations hallucinated. To reduce hallucinations without human annotation, we introduce Citation Grounding DPO (CG-DPO): a method that constructs preference pairs algorithmically by corrupting verified citations from real court decisions via four targeted strategies. On a dataset of 2,244 court decisions, a Qwen2.5-7B-Instruct model fine-tuned with LoRA achieves 98.5% mean validation accuracy in distinguishing correct from corrupted citations (rewards margin +14.9, std < 0.3 pp across 3 seeds). The citation graph, evaluation framework, and CG-DPO dataset are released as open resources.

URL PDF HTML ☆

赞 0 踩 0

2606.00881 2026-06-02 cs.CL 版本更新

Chunking Methods on Retrieval-Augmented Generation - Effectiveness Evaluation Against Computational Cost and Limitations

检索增强生成中的分块方法——针对计算成本与局限性的有效性评估

Mateusz Śmigielski, Michał Rajkowski, Mateusz Zbrocki, Michał Bernacki-Janson, Karol Kunicki, Julianna Godziszewska, Maciej Piasecki, Konrad Wojtasik

发表机构 * Department of Artificial Intelligence, Faculty of Information and Communication Technology, Wrocław University of Science and Technology（人工智能系，信息与通信技术学院，沃斯克大学）

AI总结本研究首次系统评估多种分块方法在RAG系统中的有效性，揭示分块策略中常被忽视的关键问题。

详情

AI中文摘要

检索增强生成（RAG）在提升大型语言模型（LLMs）性能方面展现了显著能力。RAG系统中的关键任务之一是分块过程。传统上，固定大小分块和语义分块是标准方法。然而，对分块策略的兴趣日益增长，导致越来越多声称性能优于传统技术的方法被提出。许多这些方法针对特定用例和数据类型定制，缺乏在不同场景下有效性的证据。因此，直接比较不同技术并评估其相对优势仍然具有挑战性。据我们所知，本研究首次系统评估了广泛分块方法的有效性，并强调了RAG系统中分块策略的潜在挑战。虽然分块通常被视为简单的预处理步骤，但我们表明它引入了一系列有影响且常被忽视的问题。

英文摘要

Retrieval-Augmented Generation (RAG) has demonstrated significant capabilities in enhancing the performance of Large Language Models (LLMs). One of the key tasks in RAG systems is the chunking process. Traditionally, fixed-size chunking and semantic chunking have been the standard approaches. However, interest in chunking strategies has been increasing, leading to a growing number of proposed methods that often claim improved performance over these conventional techniques. Many of these approaches are tailored to specific use cases and data types, with limited evidence of their effectiveness across diverse scenarios. As a result, it remains challenging to directly compare different techniques and assess their relative strengths. To the best of our knowledge, this study is the first to systematically evaluate the effectiveness of a wide range of chunking methods and emphasize the underlying challenges of chunking strategies in RAG systems. While chunking is commonly treated as a simple preprocessing step, we show that it introduces a range of impactful and often overlooked issues.

URL PDF HTML ☆

赞 0 踩 0

2606.00875 2026-06-02 cs.CL 版本更新

IDEAFix: Evaluation Framework for Creative Defixation Prompting in LLMs

IDEAFix: 大语言模型中创造性去固定化提示的评估框架

F. Carichon, S. Sharma, M. Girard, R. Rampa, G. Farnadi

发表机构 * McGill University（麦吉尔大学）； Mila ； Concordia University（康科迪亚大学）； ÉTS

AI总结提出IDEAFix框架，通过控制任务变体和提示策略系统分析大语言模型在开放式创意生成中的发散思维，发现任务表述和属性选择显著影响性能，简单提示可提升原创性但输出同质化问题依然存在。

详情

AI中文摘要

大语言模型（LLMs）越来越多地被用于涉及创造性问题解决和想法生成的任务。然而，关于其创造能力缺乏共识：一些研究报告了相比人类更优越的表现，而另一些则强调了结构性限制，如固定化和输出的同质化。现有的评估方法要么依赖于狭窄、脱离上下文的无法捕捉目标导向生成的任务，要么依赖于更广泛的设置，这些设置混淆了创造性过程的多个方面，使得难以隔离任务表述、提示和评估设计的影响。值得注意的是，结构化提示策略在塑造想法生成中的作用仍未得到充分探索。因此，我们引入了IDEAFix，一个用于分析开放式想法生成任务中发散思维的评估框架。我们提示模型对受控变化的简短设计场景、任务属性和去固定化提示策略生成多个原始解决方案。这种设计使得能够系统分析结构化指导如何影响LLMs的想法生成。我们的结果表明，任务表述和属性选择都显著影响模型的表现，并且简单的提示策略可以提升解决方案的原创性。然而，我们也观察到模型间持续的输出同质化，证实了它们在生成多样化解决方案方面固有的局限性。总体而言，IDEAFix提供了一个受控、可扩展的框架，用于研究LLMs创造力的底层机制。

英文摘要

Large language models (LLMs) are increasingly used for tasks involving creative problem solving and idea generation. However, there is a lack of consensus concerning their creative capabilities: some studies report superior performances compared to humans, while others highlight structural limitations such as fixation and the homogenization of outputs. Existing evaluation approaches either rely on narrow, decontextualized tasks that do not capture goal-oriented generation or on broader settings that confound multiple aspects of the creative process, making it difficult to isolate the effects of task formulation, prompting, and evaluation design. Significantly, the role of structured prompting strategies in shaping idea generation remains underexplored. Therefore, we introduce IDEAFix, an evaluation framework for analyzing divergent thinking in open-ended idea generation tasks. We prompt models to generate multiple original solutions to controlled variations of short design scenarios, task attributes, and defixation prompting strategies. This design enables systematic analysis of how structured guidance influences LLMs' idea generation. Our results show that both task formulation and attribute selection significantly affect models' performance, and that simple prompting strategies can boost the originality of solutions. However, we also observe persistent output homogenization across models, confirming inherent limits in their ability to generate diverse solutions. Overall, IDEAFix provides a controlled, extensible framework for studying the mechanisms underlying LLMs' creativity.

URL PDF HTML ☆

赞 0 踩 0

2606.00860 2026-06-02 cs.SI cs.AI cs.CL 版本更新

GenPT: Beyond Self-Report for Reliable LLM Psychometrics via Generative Projective Testing

GenPT：通过生成式投射测试实现超越自我报告的可靠LLM心理测量

Ming Wang, Shuang Wu, Bixuan Wang, Lu Lin, Yuxin Chen, Xiaocui Yang, Daling Wang, Shi Feng, Yifei Zhang, Yufan Sun

发表机构 * School of Computer Science and Engineering, Northeastern University（东北大学计算机科学与工程学院）； School of Computing and Information Systems, Singapore Management University（新加坡管理学院计算机与信息学院）； Mental Health Education Center, Northeastern University（东北大学心理健康教育中心）； School of Psychology, Northeast Normal University（东北师范大学心理学系）； Faculty of psychology, Southwest University（西南大学心理学系）； School of Sociology and Psychology, Central University of Finance and Economics（中央财经大学社会学与心理学学院）； College of Arts, Northeastern University（东北大学艺术学院）

AI总结针对自我报告问卷在人格化智能体心理测量中存在的训练语料污染和方向性偏差问题，提出GenPT方法，通过改编投射测试范式并构建三阶段评估流程，实现了更可靠的心理状态测量。

详情

AI中文摘要

自我报告问卷仍然是探测人格化智能体（PC-Agents）心理状态的主流工具。然而，经典工具存在两个众所周知的威胁：来自训练语料的污染以及由社会期望或上下文框架驱动的方向性偏差。为了克服这些方法论瓶颈，我们探讨投射范式是否能够被改编为一种稳健的心理测量工具。我们提出了 extbf{GenPT}（生成式投射测试），它通过新生成的刺激重新表述了TAT、罗夏测试和SCT，并将评估组织为三阶段流程，以导出标准化的心理指标和目标状态。通过评估由CharacterRAG和AnnaAgent配置文件诱导的PC-Agents，我们针对经典问卷基准测试了GenPT的信度和效度。结果表明，问卷在社会期望框架下表现出系统性的方向性偏移，在自杀意念上最为强烈。相比之下，GenPT收集的行为模式保持在对称基线附近。此外，在纵向咨询背景下，当Qwen3作为骨干模型时，基于GenPT的抑郁评估变化幅度比问卷对应方法大约一个数量级。总体而言，GenPT在需要抗污染、偏差对称性和上下文敏感性的场景中补充了自我报告方法。代码和刺激材料可在https://github.com/sci-m-wang/GenPT获取。

英文摘要

Self-report questionnaires remain the prevailing tool for probing the psychological states of persona-conditioned agents (PC-Agents). However, classical instruments inherit two well-known threats: contamination from training corpora and directional bias driven by social-desirability or contextual framing. To overcome these methodological bottlenecks, we ask whether projective paradigms can be adapted into a robust psychometric tool. We introduce \textbf{GenPT} (Generative Projective Testing), which reformulates TAT, Rorschach, and SCT with newly generated stimuli and organizes assessment as a three-stage pipeline to derive standardized psychological indicators and target states. Evaluating PC-Agents induced via CharacterRAG and AnnaAgent profiles, we benchmark GenPT's reliability and validity against classical questionnaires. The results indicate that questionnaires exhibit systematic directional shifts under social-desirability framing, most strongly on suicide ideation. In contrast, GenPT's collected behavioral patterns stay near the symmetric baseline. Furthermore, under a longitudinal counselling context, GenPT-based depression assessment shifts by roughly an order of magnitude more than the questionnaire counterpart when Qwen3 serves as the backbone. Overall, GenPT complements self-report methods in scenarios where contamination resistance, bias asymmetry, and context sensitivity matter. Code and stimuli can be found at https://github.com/sci-m-wang/GenPT.

URL PDF HTML ☆

赞 0 踩 0

2606.00851 2026-06-02 cs.SD cs.CL cs.HC cs.LG eess.AS 版本更新

Sympatheia: Emotionally Adaptive Voice Assistant with Continuous Affect Conditioning

Sympatheia: 具有连续情感调节的情感自适应语音助手

Sukru Samet Dindar, Riki Shimizu, Xilin Jiang, Nima Mesgarani

发表机构 * Department of Electrical Engineering, Columbia University（电气工程系，哥伦比亚大学）

AI总结提出Sympatheia语音对话框架，通过从用户语音推断情感并结合连续效价-唤醒度控制信号，实现情感自适应响应，优于基线模型。

详情

AI中文摘要

共情口语对话系统必须推断用户的情感状态以做出适当响应，然而日常语音通常带有微弱、中性或模糊的情感线索。为解决这一问题，我们引入了Sympatheia，一种语音到语音对话框架，其条件基于从用户语音中推断出的情感，并且在可用时，基于多模态感知模块或用户界面提供的连续效价-唤醒度（VA）控制信号中的明确情感规格。为了训练我们的模型，我们构建了Sympatheia-18k，一个包含12个情感锚点的情感条件合成口语对话语料库。该数据集包括用于学习情感语音行为的情感分割，以及一个中性分割，该分割将情感中性查询与多个情感条件响应配对，以在情感模糊情况下隔离明确的情感控制。实验结果表明，Sympatheia在生成语义内容和口语表达均情感适当的响应方面优于语音对话基线。我们进一步表明，相同的VA界面可以整合来自不同感知模块（包括面部表情、生物信号和文本情感描述）的情感估计，从而在语音单独提供有限情感证据时改善响应对齐。这些结果表明，连续情感调节是构建情感自适应语音助手的有效实际步骤。

英文摘要

Empathetic spoken dialogue systems must infer a user's emotional state to respond appropriately, yet everyday speech often carries weak, neutral, or ambiguous affective cues. To address this, we introduce Sympatheia, a speech-to-speech dialogue framework conditioned on affect inferred from the user's speech and, when available, explicit affect specifications provided as a continuous valence--arousal (VA) control signal by a multimodal sensing module or user interface. To train our model, we construct Sympatheia-18k, an emotion-conditioned synthetic spoken dialogue corpus with 12 emotion anchors. This dataset includes an emotional split for learning affective speech behavior, and a neutral split that pairs emotionally neutral queries with multiple emotion-conditioned responses to isolate explicit emotion control in emotionally ambiguous cases. Empirical results show that Sympatheia outperforms speech conversational baselines in generating responses whose semantic content and spoken delivery are both emotionally appropriate. We further show that the same VA interface can integrate emotion estimates from diverse sensing modules, including facial expression, biosignals, and textual affect descriptions, improving response alignment when speech alone provides limited emotional evidence. These results suggest that continuous affect conditioning is an effective practical step for building emotionally adaptive voice assistants.

URL PDF HTML ☆

赞 0 踩 0

2606.00832 2026-06-02 cs.CL 版本更新

Momento: Evaluating Persistent Memory and Reasoning with Multi-Session Agentic Conversations

Momento：评估多会话代理对话中的持久记忆与推理

Adril Putra Merin, David Anugraha, Ayu Purwarianti, Genta Indra Winata

发表机构 * Institut Teknologi Bandung（万隆技术大学）； Stanford University（斯坦福大学）； Capital One

AI总结提出Momento基准，通过多会话服务环境评估代理在跨会话中利用持久记忆和推理完成个性化任务的能力，发现现有代理因误估用户状态而表现不佳。

Comments Preprint

2606.00820 2026-06-02 cs.CL 版本更新

Shaohua Li, Xiuchao Sui, Xiaobing Sun, Yuhang Wu, Liangli Zhen, Yong Liu, Rick Siow Mong Goh

发表机构 * Institute of High Performance Computing, Agency for Science, Technology and Research, Singapore（高性能计算研究所，新加坡科技研究局）； Shanghai University of Engineering Science, China（上海工程技术大学）

AI总结提出 Confidence-Aware SwiGLU (κ-SwiGLU)，通过根据 token 级路由置信度调整专家门控锐度，在 MoE Transformer 中提升性能且仅增加少量参数和计算开销。

Comments 13 pages, 10 figures

详情

AI中文摘要

SwiGLU 已成为现代 Transformer MLP 中的标准门控激活函数，但其门控锐度——即门控函数的平滑性和选择性——在整个训练过程中通常是固定的。在这项工作中，我们提出了 Confidence-Aware SwiGLU (κ-SwiGLU)，这是 SwiGLU 的一种变体，用于混合专家 (MoE) 模型，它根据 token 级路由置信度调整专家门控锐度。具体来说，κ-SwiGLU 将 SiLU 门控锐度系数参数化为路由器 logit 的可学习函数，使每个专家门控单元能够在平滑、广泛激活的门控和尖锐、选择性门控之间进行插值。我们在 FineWeb-Edu 数据集上评估了 κ-SwiGLU，使用了从 8 层到 28 层的 MoE Transformer 模型。在这些设置中，κ-SwiGLU 提高了平均 CORE 性能，同时仅增加了可忽略的参数和少量计算开销，表明置信度感知的门控锐度是改进 MoE MLP 的一种有前景的机制。代码可在 https://github.com/askerlee/kappa-swiglu 获取。

英文摘要

SwiGLU has become a standard gated activation in modern Transformer MLPs, yet its gate sharpness -- the smoothness and selectivity of the gating function -- is typically fixed throughout training. In this work, we propose Confidence-Aware SwiGLU ($κ$-SwiGLU), a variant of SwiGLU for Mixture-of-Experts (MoE) models that adjusts expert gate sharpness according to token-level routing confidence. Specifically, $κ$-SwiGLU parameterizes the SiLU gate sharpness coefficient as a learnable function of the router logit, enabling each expert gate unit to interpolate between smooth, broadly active gating and sharp, selective gating. We evaluate $κ$-SwiGLU on the FineWeb-Edu dataset across MoE Transformer models ranging from 8 to 28 layers. Across these settings, $κ$-SwiGLU improves mean CORE performance while adding negligible parameters and incurring only a small computational overhead, demonstrating that confidence-aware gate sharpness is a promising mechanism for improving MoE MLPs. The code is available at https://github.com/askerlee/kappa-swiglu.

URL PDF HTML ☆

赞 0 踩 0

2606.00755 2026-06-02 cs.CL cs.LG 版本更新

Internalize the Temperature: On-Policy Self-Distillation as Policy Reheater for Reinforcement Learning

内化温度：面向强化学习的同策略自蒸馏作为策略加热器

Xuewei Yang, Jiachen Yu, Jie Wu, Shaoning Sun, Junjie Wang, Yujiu Yang

发表机构 * Tsinghua University（清华大学）

AI总结提出温度缩放同策略自蒸馏（TS-OPSD），通过将温度探索效应内化到模型参数中，缓解强化学习中的熵崩溃问题，无需外部教师或额外推理成本。

详情

AI中文摘要

基于可验证奖励的强化学习提升了大语言模型的推理能力，但常常遭受熵崩溃，即日益集中的策略减少了轨迹多样性和有用的学习信号。现有补救措施要么约束强化学习目标（如熵正则化），要么在轨迹收集期间调整采样温度，但这些干预措施仍外在于模型参数。我们提出温度缩放同策略自蒸馏（TS-OPSD），一种轻量级的策略加热方法，将温度的探索效应内化到模型参数中。从熵崩溃的强化学习检查点开始，TS-OPSD 通过对模型自身的 logits 应用高温缩放来构建自教师，然后将得到的更平滑分布蒸馏回学生。这种策略加热不需要外部教师、特权数据或额外的推理成本。在 Qwen3-4B-Base 和 Qwen3-8B-Base 上的实验表明，策略加热为继续强化学习提供了比标准继续强化学习和轨迹级温度加热更强的初始化。进一步分析表明，TS-OPSD 主要降低输出锐度，同时保留中间表示、顶级候选集和推理能力。这些结果表明，熵恢复可以作为面向推理的强化学习的一种简单的崩溃后干预措施。

英文摘要

Reinforcement learning from verifiable rewards improves the reasoning ability of large language models, but often suffers from entropy collapse, in which increasingly concentrated policies reduce rollout diversity and useful learning signals. Existing remedies either constrain the RL objective (e.g., entropy regularization) or adjust sampling temperature during rollout collection, but these interventions remain external to the model parameters. We propose Temperature-Scaled On-Policy Self-Distillation (TS-OPSD), a lightweight policy reheating method that internalizes the exploratory effect of temperature into model parameters. Starting from an entropy-collapsed RL checkpoint, TS-OPSD constructs a self-teacher by applying high-temperature scaling to the model's own logits, then distills the resulting smoother distribution back into the student. This policy reheating requires no external teacher, privileged data, or additional inference cost. Experiments on Qwen3-4B-Base and Qwen3-8B-Base show that policy reheating yields a stronger initialization for continued RL than both standard continued RL and rollout-level temperature reheating. Further analyses show that TS-OPSD mainly reduces output sharpness while preserving intermediate representations, top candidate sets, and reasoning capability. These results suggest that entropy restoration can serve as a simple post-collapse intervention for extending reasoning-oriented RL.

URL PDF HTML ☆

赞 0 踩 0

2606.00750 2026-06-02 cs.CL 版本更新

I-WebGenBench : Evaluating Interactivity in LLM-Generated Scientific Web Applications

I-WebGenBench: 评估大语言模型生成的科学网页应用中的交互性

Dasen Dai, Biao Wu, Meng Fang, Shuoqi Li, Wenhao Wang

发表机构 * Vast Intelligence Lab（vast 智能实验室）； UTS ； University of Liverpool（利物浦大学）

AI总结提出 Paper-to-Interactive-System Agent 将研究论文转化为可执行交互式网页系统，并构建 PaperVoyager 框架以显式建模机制和交互逻辑，显著提升生成质量。

Comments 9 pages, 4 figures

详情

AI中文摘要

近期视觉语言模型的进展使得自主代理能够进行复杂推理、工具使用和文档理解。然而，现有的文档代理主要将论文转化为静态产物，如摘要、网页或幻灯片，这对于涉及动态机制和状态转换的技术论文来说是不够的。在这项工作中，我们提出了一个论文到交互式系统的代理，将研究论文转化为可执行的交互式网页系统。给定一篇 PDF 论文，该代理无需人工干预即可进行端到端处理，包括论文理解、系统建模和交互式网页合成，使用户能够操作输入并观察动态行为。为了评估这一任务，我们引入了一个包含 19 篇研究论文的基准测试，每篇论文都配有专家构建的交互式系统作为真实值。我们进一步提出了 PaperVoyager，一个结构化生成框架，在合成过程中显式建模机制和交互逻辑。实验表明，PaperVoyager 显著提高了生成的交互式系统的质量，为交互式科学论文理解提供了新的范式。

英文摘要

Recent advances in visual language models have enabled autonomous agents for complex reasoning, tool use, and document understanding. However, existing document agents mainly transform papers into static artifacts such as summaries, webpages, or slides, which are insufficient for technical papers involving dynamic mechanisms and state transitions. In this work, we propose a Paper-to-Interactive-System Agent that converts research papers into executable interactive web systems. Given a PDF paper, the agent performs end-to-end processing without human intervention, including paper understanding, system modeling, and interactive webpage synthesis, enabling users to manipulate inputs and observe dynamic behaviors. To evaluate this task, we introduce a benchmark of 19 research papers paired with expert-built interactive systems as ground truth. We further propose PaperVoyager, a structured generation framework that explicitly models mechanisms and interaction logic during synthesis. Experiments show that PaperVoyager significantly improves the quality of generated interactive systems, offering a new paradigm for interactive scientific paper understanding.

URL PDF HTML ☆

赞 0 踩 0

2606.00728 2026-06-02 cs.CL 版本更新

From Empathy to Personalized Empathy: Adapting Empathetic Strategies to Individual Users

从共情到个性化共情：根据个体用户调整共情策略

Wuqiang Zheng, Chengbing Wang, Yilin Yang, Junyi Cheng, Jianfei Xiao, Hu Sun, Yi Xie, Yangyang Li, Wenjie Wang

发表机构 * University of Science and Technology of China（中国科学技术大学）； Huawei Technologies（华为技术）

AI总结针对大语言模型长期交互中忽略用户个性对共情策略影响的问题，提出个性化共情任务，构建PersonaEmp数据集和PereGRM奖励建模框架，实验证明其有效提升个性化共情能力。

详情

AI中文摘要

随着大语言模型（LLMs）越来越多地部署在与用户的长期交互中，共情已成为一项日益重要的能力。然而，现有研究忽视了用户个性特征对长期交互中共情策略的影响。为弥补这一空白，我们引入了个性化共情任务，其重点是根据从历史中获得的用户个性化特征调整共情策略。为了研究和增强这一能力，我们构建了PersonaEmp，一个基于长期用户-AI交互构建的个性化共情数据集，具有丰富的用户历史、人物信息和寻求共情的查询。我们进一步提出了PereGRM，一种奖励建模框架，它将共情评估结构与动态评估标准生成相结合，用于细粒度奖励建模。在不同设置和多个评判模型下的实验结果表明，PereGRM始终取得最强的性能提升，表明其在增强个性化共情能力方面的有效性。

英文摘要

As Large Language Models (LLMs) are increasingly deployed in long-term interactions with users, empathy has become an increasingly important capability. However, existing research overlooks the influence of users' personality traits on empathetic strategies during long-term interactions. To address this gap, we introduce the task of personalized empathy, which focuses on adapting empathetic strategies according to users' personalized characteristics derived from history. To study and enhance this capability, we construct PersonaEmp, a personalized empathy dataset built from long-term user-AI interactions, featuring rich user histories, persona information, and empathy-seeking queries. We further propose PereGRM, a reward modeling framework that combines the empathy evaluation structure with dynamic evaluation criteria generation for fine-grained reward modeling. Experimental results across different settings and multiple judge models show that PereGRM consistently achieves the strongest performance improvements, indicating its effectiveness for enhancing personalized empathetic capabilities.

URL PDF HTML ☆

赞 0 踩 0

2606.00724 2026-06-02 cs.CL cs.AI 版本更新

WaveFilter: Enhancing the Long-Context Capability of Diffusion LLMs via Wavelet-Guided KV Cache Filtering

WaveFilter: 通过小波引导的KV缓存过滤增强扩散型大语言模型的长上下文能力

Jinnan Yang, Yan Wang, Zhen Bi, Kehao Wu, Xiaojie Li, Jungang Lou, Zechao Li, Jing Liu

发表机构 * Nanjing University of Science and Technology（南京理工大学）； Alibaba Group（阿里巴巴集团）； Huzhou Normal University（湖州师范学院）； Institute of Automation, Chinese Academy of Sciences（中国科学院自动化研究所）

AI总结针对扩散型大语言模型在长上下文任务中计算开销大和推理延迟高的问题，提出一种无需训练的通用缓存框架WaveFilter，利用小波变换分解长序列以精确识别关键token，构建稀疏KV缓存，从而提升现有KV缓存方法在复杂长上下文任务中的性能。

Comments 8 pages,3 figures

详情

AI中文摘要

扩散型大语言模型（DLMs）在各种任务中展现出显著优势。然而，受限于其多步迭代推理机制，它们在长上下文任务中的计算开销和推理延迟已成为限制其大规模部署的核心瓶颈。在处理长序列时，现有的键值（KV）缓存机制常常面临生成质量急剧下降的困境，其核心挑战在于如何在超长上下文中精确且高效地过滤关键token。受人类阅读过程的启发，我们提出了 extbf{WaveFilter}，一个通用的、无需训练的缓存框架。该框架创新性地引入小波变换来分解长序列，以实现关键token的精确识别，并基于此构建稀疏KV缓存以计算最终的上下文表示。实验结果表明，WaveFilter作为一个即插即用的通用框架，显著提升了现有主流KV缓存方法在复杂长上下文任务中的性能。

英文摘要

Diffusion Large Language Models (DLMs) have demonstrated significant advantages across various tasks. However, constrained by their multi-step iterative inference mechanism, their computational overhead and inference latency in long-context tasks have become core bottlenecks restricting their large-scale deployment. When processing long sequences, existing Key-Value (KV) caching mechanisms often face a dilemma where generation quality degrades drastically, where the core challenge lies in precisely and efficiently filtering critical tokens within ultra-long contexts. Inspired by the human reading process, we propose \textbf{WaveFilter}, a universal and training-free caching framework. This framework innovatively introduces the wavelet transform for decomposition of long sequences to achieve precise identification of key tokens, based on which a sparse KV Cache is constructed to compute the final contextual representation. Experimental results demonstrate that WaveFilter, as a plug-and-play generic framework, significantly enhances the performance of existing mainstream KV Cache methods in complex long-context tasks.

URL PDF HTML ☆

赞 0 踩 0

2606.00722 2026-06-02 cs.CL cs.AI 版本更新

EPIC: Efficient and Parallel Inference under CFG Constraints for Diffusion Language Models

EPIC: 扩散语言模型在上下文无关文法约束下的高效并行推理

Hyundong Jin, Yo-Sub Han

发表机构 * Yonsei University（延世大学）

AI总结提出EPIC框架，通过词法记忆化、Earley解析验证和松弛兼容子集选择，解决扩散语言模型在CFG约束解码中的低效和并行性损失问题，推理时间降低67.5%，额外开销减少90.5%。

详情

AI中文摘要

控制语言模型输出对于确保结构有效性、可靠性和下游可用性至关重要，扩散语言模型也不例外。最近扩散语言模型解码的进展已将输出控制从常规约束扩展到上下文无关文法（CFG）约束。然而，现有方法的速度可能比无约束解码慢四倍。更重要的是，它们大大削弱了扩散语言模型相对于自回归模型的关键优势之一，即并行解码。这种减慢是因为在并行生成过程中，顺序有效性检查引入了显著开销。我们提出了一个高效的CFG约束解码框架EPIC，解决了这一限制。我们的方法通过结合词法记忆化、使用Earley风格解析（而非确定性自动机）进行验证，以及用于并行提交的松弛兼容子集选择，提高了解码效率。它减少了重复的词法分析和验证开销，同时允许多个兼容令牌一起提交。在三个基准测试上使用四个模型的实验表明，与现有的CFG约束解码方法相比，我们的方法将推理时间减少了高达67.5%，并将额外开销降低了高达90.5%。我们的实现可在https://github.com/hyundong98/EPIC-Decoding.git获取。

英文摘要

Controlling language model outputs is essential for ensuring structural validity, reliability, and downstream usability, and diffusion language models are no exception. Recent advances in diffusion language model decoding have extended output control beyond regular constraints to context-free grammar (CFG) constraints. Existing methods, however, can be up to four times slower than unconstrained decoding. More importantly, they substantially diminish one of the key advantages of diffusion language models over autoregressive models, namely parallel decoding. This slowdown arises because sequential validity checking introduces significant overhead during parallel generation. We propose an efficient CFG-constrained decoding framework, EPIC, that addresses this limitation. Our method improves decoding efficiency by combining lexing memoization, validation using Earley-style parsing instead of deterministic automata, and relaxed compatible subset selection for parallel commit. It reduces repeated lexing and validation overhead while allowing multiple compatible tokens to be committed together. Experiments on three benchmarks using four models show that our method reduces inference time by up to 67.5% and decreases the additional overhead by up to 90.5% compared with existing CFG-constrained decoding methods. Our implementation is available at https://github.com/hyundong98/EPIC-Decoding.git .

URL PDF HTML ☆

赞 0 踩 0

2606.00684 2026-06-02 eess.AS cs.CL cs.SD 版本更新

Local Diagnostics of Continuous Normalizing Flow for Out-of-Distribution Detection

连续归一化流用于分布外检测的局部诊断

Xinwei Cao, Mengxuan Lu, Torbjørn Svendsen, Giampiero Salvi

发表机构 * Department of Electronic Systems（电子系统系）； Norwegian University of Science and Technology（挪威科学技术大学）； Trondheim, Norway（特伦德内克，挪威）

AI总结针对高维数据子空间中目标观测的分布外检测问题，提出基于连续归一化流的拉格朗日子流框架，通过速度场几何诊断信号设计零样本音素级发音错误检测指标，优于基于似然的方法。

Comments 16 pages, 5 figures

详情

AI中文摘要

我们解决了嵌入在高维数据空间子空间中的目标观测的分布外（OOD）检测问题。利用连续归一化流（CNFs），我们提出了一个拉格朗日子流（LSF）框架，旨在隔离并估计表示中相关分量的密度，同时将剩余分量作为上下文。通过对语音合成模型的实验，我们表明CNFs与其他深度生成模型（DGMs）类似，容易受到“似然悖论”的影响，即OOD样本被错误地赋予高似然。这归因于DGMs的归纳偏差，即优先考虑低级结构细节而非高级语义一致性。为了缓解这一现象，我们提出了基于子流轨迹上速度场的若干几何诊断信号。基于这些信号，我们为零样本音素级发音错误检测这一具有挑战性的任务设计了指标。最后，我们在一个真实的发音错误检测基准上展示了这些指标相对于基于似然的方法的优越性。

英文摘要

We address the problem of out-of-distribution (OOD) detection for target observations embedded in a subspace of the high dimensional data space. Using continuous normalizing flows (CNFs), we propose a Lagrangian sub-flow (LSF) framework designed to isolate and estimate the density for the relevant components in the representation and using the remaining components as context. Through experimentation with models for speech synthesis, we show that CNFs, similarly to other deep generative models (DGMs), are susceptible to the "likelihood paradox", where high likelihood is erroneously assigned to OOD samples. This is attributed to the inductive bias of DGMs that prioritize low-level structural details over high-level semantic coherence. To mitigate this phenomenon, we propose a number of geometric diagnostic signals based on the velocity field over the sub-flow trajectory. Based on these signals, we design metrics for the challenging task of zero-shot phoneme-level mispronunciation detection. Finally, we demonstrate the superiority of these metrics compared to likelihood-based methods on a real-world mispronunciation detection benchmark.

URL PDF HTML ☆

赞 0 踩 0

2606.00683 2026-06-02 cs.CL 版本更新

OCC-RAG: Optimal Cognitive Core for Faithful Question Answering

OCC-RAG：面向忠实问答的最优认知核心

Maksim Savkin, Mikhail Goncharov, Alexander Gambashidze, Alla Chepurova, Dmitrii Tarasov, Nikita Andriianov, Daria Pugacheva, Vasily Konovalov, Andrey Galichin, Ivan Oseledets

发表机构 * OCC Team（OCC团队）

AI总结提出OCC-RAG，一种通过多上下文多跳合成数据训练的小语言模型，在忠实问答任务中匹配或超越2-6倍规模通用模型。

详情

AI中文摘要

近年来，语言模型的发展由规模定义，每一代模型都将更多世界知识吸收进其参数中。然而，许多实际应用更受益于稳健推理而非广泛的参数化知识。在此背景下，任务专用的小语言模型（SLM）提供了一种原则性的设计选择。我们提出最优认知核心（OCC），一个基于此前提构建的SLM家族。作为OCC的变体，我们提出OCC-RAG，针对基于给定上下文的忠实问答（QA）进行了优化。该任务与OCC设计方法直接对齐，需要在提供的段落上进行多跳推理，同时忽略记忆的知识。为训练OCC-RAG，我们实现了一种新颖的流水线，用于大规模合成多上下文、多跳QA数据，生成了一个包含超过三百万个样本的语料库，针对多跳推理、严格上下文忠实性和校准的弃权进行了优化。我们发布了OCC-RAG-0.6B和OCC-RAG-1.7B，两者均在此语料库上进行了中期训练。这些模型生成带有源引用的结构化推理轨迹，这些引用基于上下文中的逐字引用。通过OCC-RAG，我们证明了紧凑的任务专用SLM可以在多跳推理（HotpotQA、MuSiQue、TAT-QA）、忠实性（ConFiQA）和拒绝（MuSiQue-Un）基准测试中匹配或超越规模为其2-6倍的通用模型。

英文摘要

Recent progress in the development of language models has been defined by scale, with each generation absorbing more of the world's knowledge into its weights. However, many practical applications benefit more from robust reasoning than from extensive parametric knowledge. In this setting, task-specialized small language models (SLMs) offer a principled design choice. We introduce Optimal Cognitive Core (OCC), a family of SLMs built around this premise. As a variant of OCC, we present OCC-RAG, optimized for faithful question answering (QA) grounded in the provided context. This task directly aligns with the OCC design approach, requiring multi-hop reasoning over supplied passages while ignoring memorized knowledge. To train OCC-RAG, we implement a novel pipeline for synthesizing multi-context, multi-hop QA data at scale, producing a corpus of over three million examples targeting multi-hop reasoning, strict context faithfulness, and calibrated abstention. We release OCC-RAG-0.6B and OCC-RAG-1.7B, both mid-trained on this corpus. The models produce structured reasoning traces with source citations grounded in literal quotes from the context. Through OCC-RAG, we demonstrate that compact, task-specialized SLMs can match or exceed general-purpose models 2 -- 6x their size across multi-hop reasoning (HotpotQA, MuSiQue, TAT-QA), faithfulness (ConFiQA), and refusal (MuSiQue-Un) benchmarks.

URL PDF HTML ☆

赞 0 踩 0

2606.00671 2026-06-02 cs.AI cs.CL cs.LG 版本更新

AXIOM: A Trust-First Neuro-Symbolic Execution Architecture for Verifiable Mathematical Reasoning

AXIOM: 一种用于可验证数学推理的信任优先神经符号执行架构

Alessio Bruno

发表机构 * Independent researcher（独立研究者）

AI总结提出AXIOM架构，将语言模型限制为规范化器，通过确定性计算机代数系统管道实现可验证的数学推理，在4个MATH类别上达到94.36%的正确率和100%的信任度。

Comments Preprint. 12 pages, 2 figures. Live interactive demo: https://huggingface.co/spaces/Squagghy/axiom-solver. Paper artifact and dataset on Zenodo (concept-DOI): 10.5281/zenodo.20440225

详情

DOI: 10.5281/zenodo.20440225

AI中文摘要

我们提出AXIOM，一种用于自然语言数学推理的信任优先神经符号执行架构。在AXIOM中，语言模型严格作为规范化器：它将非正式问题文本重写为狭窄的模式，由确定性计算机代数系统（CAS）管道消费，该管道推导并验证答案，或作为第一类输出弃权。路由遵循问题形状正则表达式、特定模式提示和封闭形式CAS处理器之间的1:1:1对齐，已交付3100多条这样的路由，并在250多个连续提交中零LOST_CORRECT回归。我们在4个MATH类别上报告了实证结果，累积正确率为94.36%（2,592/2,747），可解析问题的信任度为100.00%（在整个2,747条记录基准测试中零自信错误答案），所有四个领域均高于每个领域70/90/70的阈值，每个领域信任度为100.0%，仅规则处理器的中位延迟为1毫秒（在lm-eval算术20,000条记录基准测试中占88%的记录）。该架构通过公共部署已服务约30,000次生产查询。我们强调的贡献不是最终的准确率数字，而是该架构建立的向前动态：生产中的每个记录弃权在一次发布周期后都是候选正确，因为新任务在不回归注册表的情况下组合。支撑这一特性的操作纪律——数学模板分桶、LOST_CORRECT扫描作为回归预言机、可解析优先接入以及弃权作为第一类输出——构成了一个可迁移的框架，适用于数学之外的值得信赖的神经符号系统。

英文摘要

We present AXIOM, a trust-first neuro-symbolic execution architecture for natural-language mathematical reasoning. In AXIOM, the language model functions strictly as a canonicalizer: it rewrites informal problem text into a narrow schema consumed by a deterministic Computer-Algebra-System (CAS) pipeline, which derives and verifies the answer or abstains as a first-class output. Routing follows a 1:1:1 alignment between problem-shape regex, schema-specific prompt, and closed-form CAS handler, with 3,100+ such routes shipped and zero LOST_CORRECT regressions across 250+ consecutive ship commits. We report empirical results on 4 MATH categories with a cumulative correctness of 94.36% (2,592/2,747) at 100.00% trust on parseable (zero confident-wrong answers across the full 2,747-record benchmark), all four domains above the per-domain 70/90/70 floor with per-domain trust at 100.0%, and median latency of 1 ms on rule-only handlers (88% of records on the lm-eval arithmetic 20,000-record benchmark). The architecture has served ~30,000 production queries through a public deployment. The contribution we emphasize is not a final accuracy figure but the forward dynamic the architecture establishes: every logged abstain in production is a candidate correct after one ship cycle, since new tasks compose without regressing the registry. The operational discipline behind this property -- math-template bucketing, LOST_CORRECT scan as regression oracle, parseable-first onboarding, and abstain as first-class output -- constitutes a transferable framework for trustworthy neuro-symbolic systems beyond mathematics.

URL PDF HTML ☆

赞 0 踩 0

2606.00660 2026-06-02 cs.CL 版本更新

FineVerify: Scaling Test-Time Compute with Fine-Grained Self-Verification for Agentic Search

FineVerify: 通过细粒度自验证扩展智能搜索的测试时计算

James Xu Zhao, Hui Chen, Bryan Hooi, See-Kiong Ng

发表机构 * National University of Singapore（新加坡国立大学）

AI总结提出FineVerify细粒度自验证框架，将问题分解为可检查的子问题，对候选答案逐项验证并聚合得分，在四个智能搜索基准上显著优于标准扩展基线。

Comments 8+18 pages, 6 tables, 11 figures

详情

AI中文摘要

智能搜索需要语言模型代理探索多个来源并回答复杂的信息寻求问题。扩展测试时计算是改进这些代理的一种有前景的方法，但当前方法可能失败，因为正确答案通常稀疏且基于分数的选择依赖于模型校准。我们提出FineVerify，一个细粒度自验证框架，将每个问题分解为可检查的子问题，对采样候选答案针对每个子问题进行验证，并选择聚合得分最高的候选答案。这种逐项检查结构将选择转化为更简单的局部判断，并在相同的显式标准下产生分数。在四个智能搜索基准和两个模型上，FineVerify始终优于标准扩展基线。仅使用四个采样轨迹，它在GPT-5-mini上平均提高8.2个准确率点，在Gemini-3-flash上提高5.6%。使用12个样本，FineVerify使GPT-5-mini在BrowseComp-Plus上超越前沿GPT-5。除了准确性，FineVerify还生成可解释的验证轨迹，有助于审计基准错误，表明其在检查智能搜索系统方面具有更广泛的应用。代码和数据可在https://github.com/XuZhao0/fineverify获取。

英文摘要

Agentic search requires language model agents to explore many sources and answer complex information-seeking questions. Scaling test-time compute is a promising way to improve these agents, but current approaches can fail, because correct answers are often sparse and score-based selection depends on model calibration. We propose FineVerify, a fine-grained self-verification framework that decomposes each question into checkable sub-questions, verifies sampled candidates against each sub-question, and selects the candidate with the highest aggregated score. This per-check structure turns selection into simpler local judgments and produces scores under the same explicit criteria. Across four agentic search benchmarks and two models, FineVerify consistently outperforms standard scaling baselines. With only four sampled trajectories, it improves GPT-5-mini by 8.2 accuracy points and Gemini-3-flash by 5.6% on average. With 12 samples, FineVerify enables GPT-5-mini to surpass frontier GPT-5 on BrowseComp-Plus. Beyond accuracy, FineVerify produces interpretable verification traces that help audit benchmark errors, suggesting broader applications for inspecting agentic search systems. Code and data are available at https://github.com/XuZhao0/fineverify

URL PDF HTML ☆

赞 0 踩 0

2606.00651 2026-06-02 cs.LG cs.AI cs.CL 版本更新

MESA: Improving MoE Safety Alignment via Decentralized Expertise

MESA: 通过去中心化专家提升MoE安全对齐

Yitong Sun, Yao Huang, Teng Li, Ranjie Duan, Yichi Zhang, Xingjun Ma, Hui Xue, Xingxing Wei

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结针对MoE架构中安全能力集中于少数专家导致的脆弱性，提出MESA框架，通过最优传输理论实现专家安全职责去中心化分配与路由细化，在保持实用性的同时提升防御性能。

Comments 18 pages, 8 figures, accepted by ICML 2026

详情

AI中文摘要

混合专家（MoE）架构高效扩展大型语言模型（LLM），通过动态路由将输入分配给相关专家，以降低计算成本的同时增强容量，但引入了一个关键漏洞：安全稀疏性，即安全能力集中在少数专家中，使其容易受到对抗性绕过。同时，传统的对齐方法统一调整所有参数，忽略了它们的功能差异，并无意中降低了性能。为了解决这些挑战，我们提出了MESA（MoE安全对齐），一个针对基于MoE的LLM的定向对齐框架，策略性地去中心化安全责任以最大化覆盖范围，同时最小化对实用性的干扰。基于最优传输（OT）理论，MESA通过两种机制运作：（1）专家容量重新分配使用传输成本矩阵将安全职责分配给最具成本效益的专家，以及（2）动态路由细化约束路由器精确激活这些去中心化模块。实验表明，MESA在保持有用性的同时，对各种有害基准实现了稳健的防御性能。代码可在https://github.com/lorraine021/MESA获取。

英文摘要

Mixture-of-Experts (MoE) architectures scale Large Language Models (LLMs) efficiently, enabling greater capacity with reduced computational cost by dynamically routing inputs to relevant experts, yet introduce a critical vulnerability: Safety Sparsity, where safety capabilities concentrate in few experts, making them susceptible to adversarial bypassing. Meanwhile, conventional alignment methods uniformly adapt all parameters, ignoring their functional differences and inadvertently degrading performances. To address these challenges, we propose MESA (MoE Safety Alignment), a targeted alignment framework for MoE-based LLMs that strategically decentralizes safety responsibility to maximize coverage while minimizing interference with utility. Based on Optimal Transport (OT) theory, MESA operates through two mechanisms: (1) Expert Capacity Reallocation uses a transport cost matrix to distribute safety duties to the most cost-effective experts, and (2) Dynamic Routing Refinement constrains the router to precisely activate these decentralized modules. Experiments show that MESA achieves robust defensive performance against varied harmful benchmarks while preserving helpfulness. Code is available at https://github.com/lorraine021/MESA.

URL PDF HTML ☆

赞 0 踩 0

2606.00647 2026-06-02 cs.CL cs.AI 版本更新

LinguIUTics at PsyDefDetect: Iterative Imbalance-Aware Fine-tuning of Qwen3-8B for Psychological Defense Mechanism Classification

LinguIUTics 在 PsyDefDetect 中的研究：用于心理防御机制分类的迭代不平衡感知微调 Qwen3-8B

Shefayat E Shams Adib, Ahmed Alfey Sani, Md Hasibur Rahman Alif, Ajwad Abrar

发表机构 * Department of Computer Science and Engineering, Islamic University of Technology, Dhaka, Bangladesh（计算机科学与工程系，伊斯兰技术大学，达卡，孟加拉国）

AI总结针对对话文本中心理防御机制检测的类别不平衡问题，提出基于 QLoRA 微调 Qwen3-8B 的迭代不平衡感知方法，通过分组分层交叉验证、少数类轮询词汇增强和后处理流水线，在 PsyDefDetect 2026 共享任务中达到宏 F1 0.3917，排名第4。

Comments Accepted at PsyDefDetect, a shared task at the 25th BioNLP Workshop (BioNLP 2026), co-located with ACL 2026 in San Diego, CA, USA

详情

语言学感知的无失真LLM水印

Shinwoo Park, Hyejin Park, Hyeseon An, Yo-Sub Han

发表机构 * Yonsei University（延世大学）； Rensselaer Polytechnic Institute（罗切斯特理工学院）

AI总结提出LUNA水印方法，通过语言自适应非失真二元锦标赛采样器，在保持文本质量的同时实现高检测性能，在12种设置中AUROC达0.9959且中位困惑度偏移仅0.045。

详情

AI中文摘要

水印应能识别语言模型输出而不降低质量或限制验证仅由模型提供者进行。多语言部署使这更加困难，因为形态、分词和书写系统的变化会改变水印证据自然进入的位置。我们引入LUNA，一种语言自适应水印，结合了无模型检测和标准随机密钥模型下的单令牌无失真。LUNA从外部语料库中的词性上下文估计归一化下一标记熵，并用其设置无失真二元锦标赛采样器的深度；检测器从文本、分词器、词性标注器和密钥重建相同的调度。我们在六种类型多样的语言和两个领域上评估了八种主要基线。LUNA在十二种设置中达到了0.9959的AUROC和最低的平均绝对中位困惑度偏移0.045；其95%自助法区间[0.022, 0.073]低于所有基线区间。LUNA还记录了最低的平均Self-BLEU、Distinct-1、surprisal和熵偏移。它是唯一同时在大多数设置中实现AUROC > 0.99和绝对中位困惑度偏移低于0.1的方法，在12种设置中的9种达到该状态，而没有任何基线在超过2种设置中达到。我们的代码可在https://github.com/Shinwoo-Park/luna_watermark获取。

英文摘要

Watermarking should identify language-model output without degrading quality or limiting verification to the model provider. Multilingual deployment makes this harder because morphology, segmentation, and script change where watermark evidence can enter naturally. We introduce LUNA, a linguistically adaptive watermark that combines model-free detection with single-token non-distortion under the standard random-key model. LUNA estimates normalized next-tag entropy from part-of-speech contexts in an external corpus and uses it to set the depth of a non-distortionary binary tournament sampler; the detector reconstructs the same schedule from text, a tokenizer, a tagger, and a secret key. We evaluate six typologically diverse languages and two domains against eight primary baselines. LUNA attains an AUROC of 0.9959 and the lowest mean absolute median perplexity shift of 0.045 across the twelve settings; its 95% bootstrap interval [0.022, 0.073] lies below all baseline intervals. LUNA also records the lowest mean Self-BLEU, Distinct-1, surprisal, and entropy shifts. It is the only method that simultaneously achieves AUROC > 0.99 and an absolute median perplexity shift below 0.1 in a majority of settings, reaching this regime in 9 of the 12 settings while no baseline reaches it in more than 2. Our code is available at: https://github.com/Shinwoo-Park/luna_watermark

URL PDF HTML ☆

赞 0 踩 0

2606.00596 2026-06-02 cs.CL 版本更新

Toward Responsible and Epistemically Grounded Multilingual LLMs for Computational Social Science and Humanities

面向负责任且基于认识论的多语言大语言模型在计算社会科学与人文学科中的应用

Wajdi Zaghouani

发表机构 * Northwestern University in Qatar（卡塔尔西北大学）

AI总结本文重新概念化多语言推理大语言模型为解释学工具，提出一个基于理论框架来评估其在社会科学与人文学科研究中的文化对齐、跨语言稳定性和推理忠实性。

详情

Journal ref: Proceedings of LLMs4SSH Workshop at LREC 2026, Palma de Mallorca, Spain, 2026

AI中文摘要

大语言模型在多语言能力和推理能力方面迅速发展，使其能够整合到社会科学与人文学科研究流程中。然而，现有的评估范式仍然基于任务型NLP基准，未能解决解释有效性、文化情境性和认识论中介问题。本文重新概念化多语言推理大语言模型为解释学工具，这些工具在不同语言和文化背景下积极构建意义生产。借鉴解释学、技术哲学、科学技术研究、多语言NLP研究和计算社会科学方法论，我们为评估社会科学与人文学科研究中的多语言推理开发了一个理论基础的框架。我们阐述了一个严格的实验协议，包含文化对齐、跨语言稳定性和推理忠实性的可操作化指标，以及针对解释性研究任务定制的透明度要求。我们通过一个涉及多语言政治话语分析的具体应用场景来说明该框架。本文为将多语言推理大语言模型负责任地整合到计算社会科学基础设施中提供了概念和方法论基础。

英文摘要

Large language models have rapidly evolved in multilingual competence and reasoning capacity, enabling their integration into Social Sciences and Humanities research workflows. Yet existing evaluation paradigms remain anchored in task-based NLP benchmarks and fail to address interpretive validity, cultural situatedness, and epistemic mediation. This paper reconceptualizes multilingual reasoning LLMs as hermeneutic instruments that actively structure meaning production across linguistic and cultural contexts. Drawing on hermeneutics, philosophy of technology, science and technology studies, multilingual NLP research, and computational social science methodology, we develop a theoretically grounded framework for evaluating multilingual reasoning in Social Sciences and Humanities (SSH) research. We articulate a rigorous experimental protocol with operationalized metrics for cultural alignment, cross-lingual stability, and reasoning faithfulness, along with transparency requirements tailored to interpretive research tasks. We illustrate the framework through a concrete application scenario involving multilingual political discourse analysis. The paper contributes a conceptual and methodological foundation for responsible integration of multilingual reasoning LLMs into computational social science infrastructures.

URL PDF HTML ☆

赞 0 踩 0

2606.00593 2026-06-02 cs.CL cs.AI 版本更新

学习检索：面向文本到SQL代理的双层长期记忆

Yibo Wang, Nikki Lijing Kuang, Philip S. Yu, Zhewei Yao, Yuxiong He

发表机构 * University of Illinois Chicago（伊利诺伊大学芝加哥分校）； Snowflake AI Research（Snowflake AI研究院）

AI总结提出MERIT框架，通过强化学习优化双层记忆检索（全局策略与局部决策），提升交互式文本到SQL代理的成功率并减少交互轮次。

详情

AI中文摘要

交互式文本到SQL代理通过多轮交互解决数据库任务，涉及模式探索、查询执行、反馈解释和决策修订。长期记忆帮助代理重用过去经验，但现有检索方法仍有局限。静态方法依赖固定的相似性启发式，无法优化下游效用；动态方法通常从稀疏的最终结果中学习，并在单一决策水平上检索记忆。当记忆有用性随交互阶段变化时，这种方法是 insufficient 的，因为用于初始规划的记忆可能不同于局部、状态条件执行所需的记忆。我们提出MERIT，一种动态多水平记忆检索框架。MERIT维护用于全局策略指导的片段级记忆和用于局部决策支持的轮次级记忆。两个水平都使用通过强化学习优化的学习检索策略。为了在有限的中间监督下训练轮次级检索，MERIT使用轻量级过程奖励模型为局部记忆选择提供密集的代理奖励。在BIRD-Interact上的实验表明，MERIT在成功率上优于无记忆、静态检索和动态检索基线，同时减少了平均交互轮次。在Spider2-Snow上的迁移结果进一步显示了无需基准特定调优的跨基准正迁移。这些结果表明，多水平检索改善了交互式文本到SQL代理中的经验重用。

英文摘要

Interactive text-to-SQL agents solve database tasks through multi-turn interactions involving schema exploration, query execution, feedback interpretation, and decision revision. Long-term memory helps agents reuse past experiences, but existing retrieval methods remain limited. Static methods rely on fixed similarity heuristics that do not optimize downstream utility, while dynamic methods often learn from sparse final outcomes and retrieve memories at a single decision horizon. This is insufficient when memory usefulness changes across interaction stages, since memories useful for initial planning may differ from those needed for local, state-conditioned execution. We propose MERIT, a dynamic multi-horizon memory retrieval framework. MERIT maintains episode-level memory for global strategic guidance and turn-level memory for local decision support. Both levels use learned retrieval policies optimized with reinforcement learning. To train turn-level retrieval despite limited intermediate supervision, MERIT uses a lightweight Process Reward Model to provide dense proxy rewards for local memory selection. Experiments on BIRD-Interact show that MERIT outperforms no-memory, static-retrieval, and dynamic-retrieval baselines in success rate while reducing average interaction turns. Transfer results on Spider2-Snow further show positive cross-benchmark transfer without benchmark-specific tuning. These results suggest that multi-horizon retrieval improves experience reuse in interactive text-to-SQL agents.

URL PDF HTML ☆

赞 0 踩 0

2606.00544 2026-06-02 cs.LG cs.CL 版本更新

SALSA: 通过学习的引导激活向量实现语音感知的LLM适配

Yekaterina Yegorova, Argyrios Gerogiannis, Haolong Zheng, Julia Hockenmaier, Chang D. Yoo, Mark A. Hasegawa-Johnson

发表机构 * University of Illinois Urbana-Champaign（伊利诺伊大学厄巴纳-香槟分校）； Korea Advanced Institute of Science and Technology（韩国科学技术院）

AI总结提出SALSA方法，通过监督学习优化逐层引导向量，在儿童语音、多语种语音和普通话-英语代码切换基准上显著提升零样本推理和语音上下文学习性能，最高相对提升46.8%。

详情

AI中文摘要

语音感知的大语言模型通常在域外场景中泛化能力较差。我们提出SALSA（通过学习的引导激活实现语音感知的LLM适配），一种轻量级适配方法，学习逐层引导向量。与通常依赖对比激活差异的引导方法不同，SALSA直接使用监督目标优化引导向量。在儿童语音、多语种语音和普通话-英语代码切换基准上，SALSA相比零样本推理和语音上下文学习基线显著提升性能，相对于零样本最高实现46.8%的相对改进。进一步分析表明，引导编码器（尤其是后层）比引导LLM主干更有效。这些发现表明，引导通过调整高层声学和语音表示以更好地与预训练语言模型表示空间对齐，而不是通过修改解码器本身，从而提升下游ASR性能。

英文摘要

Speech-aware large language models often generalize poorly to out-of-domain settings. We propose SALSA (Speech-Aware LLM Adaptation via Learned Steering Activations), a lightweight adaptation method that learns layer-wise steering vectors. Unlike commonly used steering approaches that rely on contrastive activation differences, SALSA directly optimizes steering vectors using a supervised objective. Across children's speech, multilingual speech, and Mandarin-English code-switching benchmarks, SALSA substantially improves performance over zero-shot inference and speech in-context learning baselines, achieving up to 46.8% relative improvements over zero-shot. Analysis further demonstrates that steering the encoder, particularly the later layers, is more effective than steering the LLM backbone. These findings suggest that steering improves downstream ASR performance by adapting higher-level acoustic and phonetic representations to better align with the pretrained language model representation space, rather than by modifying the decoder itself.

URL PDF HTML ☆

赞 0 踩 0

2606.00451 2026-06-02 cs.CL 版本更新

隔离LLM词汇偏差：一种无人工标注的三角化偏好学习阶段度量

Xiaoyang Ming, Jose Hernandez, Thomas Stephan Juzek

发表机构 * Florida State University（佛罗里达州立大学）

AI总结提出一种无需人工标注的三角化偏好偏移分数（Triangulated Preference Shift score），通过对比人类标准、基础模型和指令变体，量化偏好学习阶段引入的词汇偏差。

Comments 7 pages, 2 figures, 1 table

详情

DOI: 10.32473/flairs.39.1.141843
Journal ref: The International FLAIRS Conference Proceedings, 39(1) (2026)

AI中文摘要

近年来，各种语言领域发生了显著变化；这些变化很大程度上归因于大型语言模型的出现及其与自然语言使用的不对齐。这些不对齐部分源于偏好学习阶段，例如从人类反馈中强化学习，这通常使模型更有用，但同时也可能引入系统性词汇偏差。在词汇行为方面，这体现在模型对某些格式的偏好或过度使用某些词汇（如delve、furthermore），即使这些模式在基础模型输出中并不存在。关于偏好训练引起的词汇不对齐的研究受限于对人工标注的依赖。我们通过引入三角化偏好偏移分数来解决这一问题，该度量在人类黄金标准、基础模型和指令变体之间进行三角化，以隔离偏好学习引起的特定偏移，无需人工标注。我们提供了六个模型家族的数据，将结果锚定在文献中，并通过分析偏好学习是否将模型推向可解释为“威望语言”的方向，展示了该通用方法的实用性。该度量提供了一种初始的自动化方法来量化偏好调整引起的行为偏移，从而有助于指导模型对齐和可信AI的开发。

英文摘要

Various language domains have undergone remarkable changes in recent years; these shifts are largely attributed to the advent of Large Language Models and their misalignment with natural language usage. These misalignments are thought to partly originate in the preference-learning stage, e.g. Reinforcement Learning from Human Feedback, which generally makes models more useful but simultaneously may introduce systematic lexical bias. In terms of lexical behavior, this is visible in a model's preference for certain formats or the overuse of words (delve, furthermore), even when such patterns are not present in base model outputs. Research on lexical misalignment induced during preference training is constrained by reliance on manual curation. We address this, by introducing the Triangulated Preference Shift score, a metric that triangulates between human gold standards, base models, and instruct variants to isolate shifts induced specifically by preference learning, without manual curation. We provide data across six model families, anchor the results in the literature, and illustrate the general approach's utility by analyzing whether preference learning shifts models toward what could be interpreted as a "language of prestige". The metric provides an initial automated method to quantify behavioral shifts attributable to preference tuning, and thus, may help inform model alignment and development of trustworthy AI.

URL PDF HTML ☆

赞 0 踩 0

2606.00333 2026-06-02 cs.CL 版本更新

Which Institutional Frameworks Do Chatbots Assume? Auditing Jurisdictional Defaults in Multilingual LLMs

聊天机器人假设了哪些制度框架？审计多语言大语言模型中的司法默认值

Zhizhi Wang, Harini Suresh

发表机构 * University of California, Berkeley（加州大学伯克利分校）

AI总结研究多语言大语言模型在未指定司法管辖区时，是否将输入语言作为默认司法信号，通过审计7个模型在60个提示上的2520个响应，发现中文输入更常产生中国特定答案，英文输入更常产生美国特定答案，揭示了制度框架误选风险。

详情

AI中文摘要

大语言模型越来越多地回答有关税收、劳动保护、医疗保健、教育、养老金和行政程序的问题，其有用性通常取决于适用的司法管辖区。多语言用户可能使用他们最熟悉的语言书写，而不是与适用规则的国家或地区相关的语言。我们询问，当提示省略任何国家或地区时，部署的大语言模型是否将输入语言作为默认司法信号。先前的多语言审计表明，提示语言可以改变文化、政治或规范性输出；我们检查当司法管辖区未明确指定时，模型提供了哪种法律-行政框架。我们评估了在美国或中国开发的七个大语言模型，在英语和普通话的60个未明确指定司法管辖区的法律-行政提示下，在三种系统提示条件下，产生了2520个手动注释的响应。跨模型和条件，中文输入更常产生中国特定的答案，而英文输入更常产生美国特定、比较性或通用答案。需要单一答案的提示进一步增加了司法管辖区选择：跨模型汇总，74.5%的英文输入响应采用美国框架，而53.3%的中文输入响应采用中国框架。这种方向性模式出现在所有七个模型中。我们将这种部署层面的模式描述为制度框架误选风险：一个流畅的答案可能依赖于用户未意图的法律-行政背景，特别是当他们的首选语言与相关司法管辖区不同时。大语言模型界面不应仅通过输入语言来路由制度建议；当位置信息缺失时，应请求位置或说明答案的司法管辖区范围。

英文摘要

LLMs increasingly answer questions about taxes, labor protections, healthcare, education, pensions, and administrative procedures, where usefulness often depends on the applicable jurisdiction. Multilingual users may write in their most comfortable language rather than one associated with the country or region whose rules apply. We ask whether deployed LLMs use input language as a default jurisdictional signal when prompts omit any country or region. Prior multilingual audits show that prompt language can shift cultural, political, or normative outputs; we examine which legal-administrative framework models supply when jurisdiction is underspecified. We evaluate seven LLMs developed in the United States or China on 60 underspecified legal-administrative prompts in English and Mandarin Chinese under three system-prompt conditions, yielding 2,520 manually annotated responses. Across models and conditions, Chinese input more often produces China-specific answers, while English input more often produces U.S.-specific, comparative, or generic answers. Prompts requiring a single answer further increase jurisdiction selection: pooled across models, 74.5% of English-input responses adopt a U.S. framework, while 53.3% of Chinese-input responses adopt a China framework. This directional pattern appears in all seven models. We describe this deployment-level pattern as institutional-framework misselection risk: a fluent answer may rely on a legal-administrative context the user did not intend, especially when their preferred language differs from the relevant jurisdiction. LLM interfaces should not route institutional advice by input language alone; when location is absent, they should request it or state the jurisdictional scope of the answer.

URL PDF HTML ☆

赞 0 踩 0

2606.00305 2026-06-02 cs.CL cs.AI 版本更新

Bridging Reasoning Trajectories in On-Policy Distillation via Near-Future Guidance

通过近未来指导桥接在线策略蒸馏中的推理轨迹

Yuxuan Jiang, Francis Ferraro

发表机构 * University of Maryland, Baltimore County（马里兰大学巴尔的摩分校）

AI总结针对在线策略蒸馏中令牌级学习信号无法有效桥接推理轨迹的问题，提出基于近未来轨迹信息的轨迹感知在线策略蒸馏方法，显著提升大语言模型推理性能。

详情

AI中文摘要

在线策略蒸馏通过让学生在教师监督下从其自身策略采样的轨迹上进行训练，改进了大语言模型的推理能力。尽管OPD基于轨迹操作，但其学习信号仍然是令牌级的：它通过高损失令牌识别偏差，并通过局部反向KL校正进行修复。我们表明，这种“轨迹采样但令牌学习”的机制无法可靠地将学生轨迹桥接至教师轨迹。约30%的高损失令牌落入低发散区域，表明许多是表面形式不匹配而非真正的推理分叉。此外，即使是真正发散的令牌也难以通过孤立的令牌级监督修复，因为推理失败通常表现为短视的分布漂移。我们提出轨迹感知OPD，它利用近未来轨迹信息识别真正的发散状态，并将指导分布到多个未来令牌上。实验表明，抑制非发散的高损失令牌将标准OPD的平均准确率从47.8%提升至48.2%，而TOPD进一步将性能提升至52.2%，在AIME24上从60.0%提升至63.3%，在AIME25上从46.7%提升至53.3%。

英文摘要

On-Policy Distillation (OPD) improves large language model reasoning by training a student model on trajectories sampled from its own policy under teacher supervision. Although OPD operates on trajectories, its learning signal remains token-level: it identifies deviations through high-loss tokens and repairs them through local reverse-KL correction. We show that this "trajectory-sampled but token-learned" mechanism cannot reliably bridge student trajectories toward teacher trajectories. About 30% of high-loss tokens fall into the low-divergence regime, indicating that many are surface-form mismatches rather than real reasoning forks. Moreover, even truly divergent tokens are difficult to repair with isolated token-level supervision, since reasoning failures often unfold as short-horizon distributional drift. We propose Trajectory-aware OPD (TOPD), which uses near-future trajectory information to identify real divergent states and distribute guidance across multiple future tokens. Experiments show that suppressing non-divergent high-loss tokens improves standard OPD from 47.8% to 48.2% average accuracy, while TOPD further improves performance to 52.2%, with gains on AIME24 from 60.0% to 63.3% and AIME25 from 46.7% to 53.3%.

URL PDF HTML ☆

赞 0 踩 0

2606.00294 2026-06-02 cs.CL 版本更新

Uncovering Temporal Framing in the News

揭示新闻中的时间框架

Tarek Mahmoud, Veronika Solopova, Premtim Sahitaj, Ariana Sahitaj, Max Upravitelev, Mervat Abassy, Hana Fatima Shaikh, Neda Foroutan, Vera Schmitt, Preslav Nakov

发表机构 * MBZUAI, UAE（阿联酋马布里扎人工智能研究所）； Technische Universität Berlin, Germany（柏林技术大学，德国）； German Research Center for Artificial Intelligence (DFKI), Germany（德国人工智能研究中心（DFKI），德国）； University of Maryland, USA（美国马里兰大学）

AI总结本研究提出八种时间框架分类法，通过专家注释多语言新闻语料库，分析时间框架的普遍性、共现模式和词汇线索，并利用监督微调和零样本分类评估时间框架检测，发现句子级时间框架可学习且监督模型显著优于零样本方法。

Comments ACL 2026 Main Conference Oral

详情

AI中文摘要

时间语言的作用不仅限于将事件置于时间线上。在新闻话语中，对过去、现在和未来的引用可以作为修辞手段，塑造解释和说服。在这里，我们研究时间框架，定义为使用与时间相关的语言进行说服，以构建意义而非报告时间顺序。我们基于先前关于时间性和框架的研究，提出了八种时间框架的分类法，并通过专家注释多语言新闻语料库来实现。由此产生的数据集包括458篇英文和德文新闻文章，从超过2万句的语料库中识别出超过2千个时间框架句和约3千个时间框架注释。我们分析了框架的普遍性、共现模式和词汇线索，并使用监督微调和零样本分类评估了时间框架检测。我们的实验表明，时间框架在句子级别是可学习的，监督模型显著优于零样本方法。我们公开发布该语料库以支持未来关于时间框架的研究：https://mbzuai-nlp.github.io/temporal-framing/。

英文摘要

Temporal language does more than place events on a timeline. In news discourse, references to the past, present, and future can function as rhetorical devices that shape interpretation and persuasion. Here, we study temporal framing, defined as the persuasive use of time-related language to structure meaning rather than to report chronology. We propose a taxonomy of eight temporal frames grounded in prior work on temporality and framing, and we realize it through expert annotation of a multilingual news corpus. The resulting dataset includes 458 English and German news articles, with over 2K temporally framed sentences and approximately 3K temporal framing annotations identified from a corpus of more than 20K sentences. We analyze frame prevalence, co-occurrence patterns, and lexical cues, and evaluate temporal framing detection using supervised fine-tuning and zero-shot classification. Our experiments show that temporal framing is learnable at the sentence level, with supervised models substantially outperforming zero-shot approaches. We publicly release the corpus to support future research on temporal framing: https://mbzuai-nlp.github.io/temporal-framing/.

URL PDF HTML ☆

赞 0 踩 0

2606.00285 2026-06-02 cs.CL 版本更新

DeSQ：基于分解的SPARQL查询生成

Papa Abdou Karim Karou Diallo, Aditya Sharma, Neshat Elhami Fard, Amal Zouaq

发表机构 * LAMA-WeST ； Mila – Quebec AI Institute（魁北克AI研究所）； Polytechnique Montréal（蒙特利尔理工学院）

AI总结提出DeSQ框架，通过将复杂问题分解为原子约束并映射为SPARQL片段，再组装成完整查询，在五个基准测试中四个超越现有方法，并增强鲁棒性和可解释性。

详情

AI中文摘要

知识库问答（KBQA）的主流方法分为两类：一是生成形式化查询，但存在脆弱性和可解释性有限的问题；二是通过知识库探索直接检索答案，但计算成本高且容易产生幻觉。为了结合两种范式的优势并减轻各自的缺点，我们提出了DeSQ（基于分解的SPARQL查询生成），这是一个与知识库无关的框架，分为三个阶段。首先，它将复杂问题分解为原子约束（AC），这些约束反映了底层知识库的关系结构。其次，它生成一个两部分的结构化输出：（a）将每个AC映射到对应的SPARQL片段，使用标准化的变量和URI占位符，以及（b）描述每个占位符的URI接地块。第三，它将这些片段组装成一个完整的SPARQL查询。DeSQ在五个主要基准测试中的四个上超越了现有最先进的方法，并展现出对词汇变化的卓越鲁棒性。除了性能提升，我们的框架通过消除对实时知识库端点的需求，大大简化了评估，并且其结构化输出支持细粒度的错误分析，从而实现更有针对性的改进干预。

英文摘要

Dominant approaches to Knowledge Base Question Answering (KBQA) fall into two categories. First is the generation of a formal query that suffers from brittleness and limited explainability, and the second is direct answer retrieval through KB exploration that is computationally costly and prone to hallucination. To combine the strengths of both paradigms while mitigating their respective weaknesses, we introduce DeSQ (Decomposition-based SPARQL Query Generation), a KB-agnostic framework that operates in three stages. First, it decomposes complex questions into Atomic Constraints (ACs) that mirror the relational structure of the underlying KB. Second, it generates a two-part structured output: (a) Mapping of each AC to its corresponding SPARQL Fragment, using standardized variable and URIs placeholders, and (b) URIs Grounding block describing each placeholder. Third, it assembles these fragments into a complete SPARQL query. DeSQ surpasses state-of-the-art approaches on four out of five major benchmarks and demonstrates superior robustness to lexical variation. Beyond performance gains, our framework greatly simplifies evaluation by eliminating the need for a live KB endpoint, and its structured output enables fine-grained error analysis, allowing more targeted interventions for improvement.

URL PDF HTML ☆

赞 0 踩 0

2606.00198 2026-06-02 cs.LG cs.AI cs.CL 版本更新

BAGEN: Are LLM Agents Budget-Aware?

BAGEN：LLM 智能体是否具有预算意识？

Yuxiang Lin, Zihan Wang, Mengyang Liu, Yuxuan Shan, Longju Bai, Junyao Zhang, Xing Jin, Boshan Chen, Jinyan Su, Xingyao Wang, Jiaxin Pei, Manling Li

发表机构 * Northwestern University（西北大学）； O2 Lab（O2实验室）； Independent（独立）； University of Michigan（密歇根大学）； Cornell（康奈尔大学）； All Hands AI ； Stanford（斯坦福大学）； UT Austin（德克萨斯大学奥斯汀分校）

AI总结本文提出预算感知智能体（BAGEN）概念，将预算作为主动控制信号而非被动成本指标，通过渐进区间估计方法预测剩余预算上下界，并在四个环境和五个前沿模型上发现强模型不一定具有强预算意识、模型过度乐观等失败模式，早期停止可节省 28-64% 令牌，但精确区间校准仍具挑战。

详情

AI中文摘要

尽管智能体正在花费越来越多的资源，但如今智能体成本大多仅在执行后衡量。预算感知智能体（BAGEN）应将预算视为主动控制信号，而非被动成本指标。我们首先系统地将预算估计定义为内部预算（来自智能体计算）和外部预算（来自智能体动作）。然后，我们将预算意识形式化为渐进区间估计：在计划的每一步，智能体应预测剩余预算的上限和下限，并在完成可能性低时发出警报。通过 rollout-replay 协议进行评分，我们在四个环境和五个前沿模型上发现了一致的失败模式：（1）强模型不一定具有强预算意识，相关性 r=0.35。（2）前沿模型始终过度乐观，继续在不太可能成功的任务上花费资源，而不是尽早提醒用户。（3）预算感知信号是可操作且可训练的。早期停止在失败轨迹上节省 28-64% 的令牌，SFT+RL 增强了早期停止和警报行为。（4）精确区间校准仍然具有挑战性，SFT+RL 后区间覆盖率上限为 47%。项目页面：https://ragen-ai.github.io/bagen/

英文摘要

While agents are increasingly spending more resources, today agent cost is mostly measured only after execution. A Budget-Aware Agent (BAGEN) should treat budget as an active control signal, rather than a passive cost metric. We first systematically define budget estimation as internal budgets (from agent computation) and external budgets (from agent actions). We then formalize budget-awareness as progressive interval estimation: at each step of a plan, an agent should predict an upper and lower bound on remaining budget, and alert when completion is unlikely. Scoring with a rollout-replay protocol, we find consistent failure patterns on four environments and five frontier agents: (1) strong agents do not necessarily have strong budget-awareness, with correlation r=0.35. (2) frontier models are consistently over-optimistic, continue spending on tasks that are unlikely to succeed, instead of alerting the user early. (3) budget-aware signal is actionable and trainable. Early stop saves 28-64% tokens on failed trajectories, and SFT+RL strengthens early stop and alert behavior. (4) precise interval calibration remains challenging, with interval coverage capping at 47% after SFT+RL. Project page: https://ragen-ai.github.io/bagen/

URL PDF HTML ☆

赞 0 踩 0

2606.00168 2026-06-02 cs.CL 版本更新

RealityTest: How People Probe AI Identity and Whether Models Disclose It

RealityTest: 人们如何探查AI身份以及模型是否披露它

Anna Gausen, Sarenne Wallbridge, Bessie O'Dell, Christopher Summerfield, Hannah Rose Kirk

发表机构 * AI Security Institute（人工智能安全研究所）

AI总结通过收集来自49个国家约750名参与者的3152个身份探查查询，构建首个大规模多模态多语言基准RealityTest，发现仅31%的人在模糊场景中直接询问身份，且问题多样性远超机器生成查询，测试17个文本和6个语音模型后发现，问题措辞和对话上下文比模型本身对披露行为影响更大。

Comments 9 pages, 4 figures

详情

AI中文摘要

AI系统越来越多地部署在对话场景中，用户可能不确定自己是在与人类还是AI交谈。尽管监管机构对这一已知安全风险日益关注，但现有的AI披露评估通常仅限英语、基于机器生成的问题，且局限于文本。我们提出RealityTest，以全面测试AI系统在被询问时是否披露其身份。该基准是首个大规模多模态和多语言评估，基于人类在现实世界中实际遇到和质疑AI身份的数据。伴随基准，我们发布了包含来自49个国家约750名参与者的3152个身份探查查询的基础数据集，涵盖文本和语音场景。我们发现，在模糊场景中，只有31%的人直接询问身份，而且人们提出的问题远比机器生成的查询多样化。我们测试了17个文本模型和6个语音模型，发现披露行为存在显著差异。然而，即使是在表现最好的模型中，单一的抑制指令也会将披露率降至30%以下。验证我们在多样化、基于人类的评估数据上的投入，我们发现问题的措辞和对话上下文对披露的影响比模型本身更大。基于狭窄或合成查询集的安全评估可能会错误地描述模型在实际部署场景中的行为。

英文摘要

AI systems are increasingly deployed in conversational settings where users may be uncertain whether they are speaking with a human or an AI. Despite mounting regulatory attention to this known safety risk, existing evaluations of AI disclosure are typically English-only, based on machine-generated questions, and restricted to text. We present RealityTest to comprehensively test whether AI systems disclose their identity when asked. The benchmark is the first large-scale multimodal and multilingual evaluation, grounded in human data on how people actually encounter and question AI identity in the real-world. Alongside the benchmark, we release the underlying dataset of 3,152 identity-probing queries collected from ~750 participants across 49 countries and five languages, in text and speech scenarios. We find that only 31% of people ask about identity directly in ambiguous scenarios, and that the questions people ask are far more diverse than machine-generated queries. We test 17 text and 6 speech models, and find substantial variation in disclosure behaviour. However, a single suppression instruction reduces disclosure rates to below 30%, even in the best-performing models. Validating our investment in diverse, human-grounded evaluation data, we find that how the question is phrased and the context of the conversation matter more for disclosure than which model is being tested. Safety evaluations built on narrow or synthetic query sets risk mischaracterising how models behave in realistic deployment settings.

URL PDF HTML ☆

赞 0 踩 0

2606.00160 2026-06-02 cs.CR cs.AI cs.CL 版本更新

DataShield: Safety-degrading Data Filtering for LLM Benign Instruction Fine-Tuning

DataShield: 针对大语言模型良性指令微调的安全降级数据过滤

Junbo Zhang, Qianli Zhou, Xinyang Deng, Wen Jiang, Jie Pan, Jinbiao Zhu

发表机构 * nwpu.edu.cn（西北工业大学）

AI总结提出DataShield方法，通过量化每个样本对模型合规行为的贡献作为安全降级分数，高效识别良性数据集中的安全降级样本，并在多个模型和数据集上验证有效性。

详情

AI中文摘要

大型语言模型（LLM）即使在使用良性数据集进行微调时，也会出现安全能力下降的问题。然而，现有识别良性数据集中安全降级样本的方法存在计算成本高和噪声显著的问题。在本文中，我们提出DataShield，以高效且有效地识别潜在的安全降级样本。我们的关键直觉基于以下观察：良性微调提高了LLM的整体响应合规性。DataShield的核心技术见解是将每个样本对模型合规行为的贡献量化为其安全降级分数。DataShield包含三个核心组件：（1）合规向量提取，捕获LLM的合规行为倾向；（2）一种新颖的合规感知分数（CAS），自动识别最优的安全关键层；（3）安全降级样本过滤，量化训练数据沿合规方向的投影偏移。在Llama3-8B、Llama3.1-8B和Qwen2.5-7B上使用Alpaca和Dolly良性数据集进行的广泛实验评估验证了我们的方法在识别高风险和低风险数据子集方面的有效性。我们还观察到，开放式问答更可能触发安全降级，且相应的响应往往更长。我们希望这项工作能为以数据为中心的防御方法提供新的见解。源代码可在https://github.com/ZJunBo/DataShield获取。

英文摘要

Large language models (LLMs) suffer from degraded safety capabilities even when fine-tuned with benign datasets. However, existing methods for identifying safety-degrading samples in benign datasets suffer from high computational costs and significant noise issues. In this paper, we propose DataShield to efficiently and effectively identify potential safety-degrading samples. Our key intuition is based on the observation that benign fine-tuning increases the overall response compliance of LLMs. DataShield's key technical insight is to quantify each sample's contribution to the model's compliance behavior as its safety degradation score. DataShield consists of three core components: (1) Compliance Vector Extraction, which captures the LLM's compliance behavior tendency; (2) a novel Compliance-Aware Score (CAS), which automatically identifies the optimal safety-critical layer; and (3) Safety-degrading Sample Filtering, which quantifies the projection shift of training data along the compliance direction. Extensive experimental evaluation on Llama3-8B, Llama3.1-8B, and Qwen2.5-7B using the Alpaca and Dolly benign datasets validates our method's effectiveness in identifying high-risk and low-risk data subsets. We also observe that open-ended question answering is more likely to trigger safety degradation, and corresponding responses tend to be longer. We hope this work can provide new insights into data-centric defense methods. The source code is available at: https://github.com/ZJunBo/DataShield.

URL PDF HTML ☆

赞 0 踩 0

2606.00136 2026-06-02 cs.LG cs.AI cs.CL cs.CR cs.SI 版本更新

Generative AI and Digital Ecosystem Resilience: A Proactive Lifecycle-Based Survey

生成式AI与数字生态系统韧性：基于生命周期的主动式综述

Jonghyun Chung, Rishabh Chaddha, Sanket Badhe, Debanshu Das, Nathan Huang, Amanpreet Kaur

发表机构 * Google LLC（谷歌有限公司）

AI总结本文采用基于生命周期的C5交互模型，综合机器学习与社会科学方法，系统综述了针对生成式AI驱动的对抗性合成内容的主动检测技术，包括协调不真实行为分析、流行病学建模和霍克斯过程等，旨在构建更具韧性的信息生态系统。

Comments 14 pages, 3 figures, 3 tables. Accepted for publication in IEEE Access (May 2026)

详情

Journal ref: IEEE Access (2026) IEEE Access (2026)

AI中文摘要

生成式AI加速了对抗性合成内容的扩散，使得传统的被动检测方法失效。本综述综合了新兴研究，展示了向主动检测新兴不真实叙事的范式转变。我们采用统一的、基于生命周期的分类法，将对抗性活动的社会技术生命周期模型与新兴不真实叙事检测的高级计算方法相结合。通过围绕C5交互模型（背景、原因、内容、放大循环、后果）构建分析，我们整合了机器学习和社会科学的不同研究流。为了区分合成放大模式与真实基线流量，本文综述了建模新叙事创建、播种和传播的最先进技术，包括协调不真实行为分析、流行病学建模和霍克斯过程。本综述还系统回顾了C5交互模型不同阶段对抗性威胁的主动检测方法，特别是高维嵌入空间中的异常检测、多层图上的无监督协调检测以及代理型AI系统。最后，本综述探讨了生成式AI带来的挑战，包括追踪快速变化威胁和多级分布漂移的困难，并概述了未来研究议程，重点在于检测异常聚类和构建预期性及韧性系统。本综述为更韧性的信息生态系统提供了基于生命周期的主动检测新兴合成威胁方法的全面回顾。

英文摘要

The proliferation of adversarial synthetic content, accelerated by Generative AI (GenAI) is rendering traditional reactive detection methods ineffective. This survey synthesizes emerging research to demonstrate a paradigm shift toward the proactive detection of emerging inauthentic narratives. In this survey, we adopt a unified, lifecycle-based taxonomy to combine socio-technical lifecycle models of adversarial campaigns with advanced computational methodologies for emerging inauthentic narrative detection. By structuring the analysis around the C5 Interaction Model (Context, Causes, Content, Cycle of Amplification, Consequences), we integrate different research streams from machine learning and social science. To differentiate spread patterns of synthetic amplification from authentic baseline traffic, this paper surveys state-of-the-art techniques for modeling the creation, seeding, and propagation of fresh narratives, including the analysis of Coordinated Inauthentic Behavior (CIB), epidemiological modeling, and Hawkes process. This survey also provides a systematic review of proactive detection methods for adversarial threats at different stages in the C5 interaction model, specifically, anomaly detection in high-dimensional embedding spaces, unsupervised coordination detection on multi-layer graphs, and agentic AI systems. Finally, this survey addresses challenges posed by GenAI, including the difficulty of tracking rapidly changing threats and multi-level distributional drift, and it outlines a future research agenda focused on detecting anomalous clusters and building anticipatory and resilient systems. This survey provides a comprehensive, lifecycle-based review of methods for the proactive detection of emerging synthetic threats for more resilient information ecosystems.

URL PDF HTML ☆

赞 0 踩 0

2606.00116 2026-06-02 cs.CL cs.AI cs.LG 版本更新

Enhancing BiGRU with a KAN Block for Legal Document Classification and Summarization

增强BiGRU与KAN模块在法律文档分类与摘要中的应用

Ahmed Faizul Haque Dhrubo, Souvik Pramanik, Most. Aysha Siddika Sumona, Shahnewaz Siddique, Mohammad Ashrafuzzaman Khan, Mohammad Abdul Qayum, Mohsin Sajjad

发表机构 * Dept. of ECE North South University（电子工程系北南大学）

AI总结提出一种基于KAN的BiGRU模型，用于低资源多语言法律文档的分类与摘要，通过KAN模块提升分类准确率至67.96%。

Comments This paper contains of 10 pages, 10 figures, 4 tables and version 2 after it review from ACL 2026

详情

AI中文摘要

本研究引入了一种基于KAN的BiGRU模型的新架构，用于低资源多语言环境下的法律文档分类与摘要任务。为了解决领域语言、不同语言使用、上下文长依赖和类别不平衡等问题，我们使用了由孟加拉国法律文档组成的数据集，这些文档来自Manupatra，包括孟加拉语、英语和音译孟加拉语。我们的分类任务采用BiGRU模型以及Kolmogorov-Arnold网络（KAN）模块，而摘要部分则利用基于注意力的GRU结合KAN模型头部。分类模型达到了67.96%的准确率和0.65的F1分数；摘要的ROUGE-1、ROUGE-2和ROUGE-L指标分别对应0.38、0.23和0.31的F1分数。消融研究表明，使用KAN将分类准确率从57.34%提升至67.96%。此外，我们将所提出的技术与多个基线进行了比较，包括经典机器学习算法和预训练语言模型。

英文摘要

This study introduces a novel architecture of KAN-based BiGRU model for the task of classification and summarization of legal documents in a low-resource multilingual setup. In order to tackle problems associated with domain language, the usage of different languages, long dependencies within context, and class imbalance, we employ the dataset composed of legal documents from Bangladesh and taken from Manupatra, which include Bengali, English, and transliterated Bengali languages. Our classification task involves BiGRU model, along with Kolmogorov-Arnold Network (KAN) module, while the summarization part utilizes attention-based GRU, combined with a KAN model head. Classification model yields 67.96% of accuracy and 0.65 F1 score; while ROUGE-1, ROUGE-2, and ROUGE-L measures for summarization yield 0.38, 0.23, and 0.31 F1 scores, correspondingly. Ablation study shows that the use of KAN increases classification accuracy from 57.34% to 67.96%. Moreover, our proposed technique is compared to several baselines, including classical ML algorithms and pretrained language models.

URL PDF HTML ☆

赞 0 踩 0

2606.00095 2026-06-02 cs.CV cs.AI cs.CL cs.RO 版本更新

SentimentLens: 通过双模态调和酒店业中的情感与评分

Dineth Jayakody, Pasindu Thenahandi, Sampath Jayarathna

发表机构 * University of Peradeniya（珀拉尼亚大学）

AI总结提出SentimentLens系统，基于方面级情感分析从非结构化酒店评论中提取知识，并通过跨模态调和文本情感与数值评分来识别运营冲突和服务改进机会。

详情

AI中文摘要

在线旅游平台生成大量用户生成的酒店评论，为大规模理解旅行者体验提供了丰富机会。然而，将非结构化文本反馈转化为结构化、可操作的见解仍然是一项具有挑战性的任务。本文提出了SentimentLens，一个基于方面级情感分析的可扩展分析系统，该系统从非结构化酒店评论中执行知识提取，并将其组织成可解释的服务类别。SentimentLens集成了方面术语提取、方面情感分类、语义类别分配和多层次分析模块，以支持区域级、酒店级和类别级评估。该系统设计为在不同地理环境和酒店环境中运行。为了展示其实用性，我们将SentimentLens应用于一个包含超过10,000条公开酒店评论的大型真实数据集。通过广泛分析，该框架揭示了旅行者情感如何随区域、服务类别和酒店类型而变化。我们进一步实现了文本情感与数值评分的跨模态调和，以识别潜在运营冲突、服务质量的结构性不一致性，并使用重要性-绩效和基于熵的分析确定高影响力的改进机会。结果表明，SentimentLens有效地将大规模非结构化评论转化为可操作的情报，支持酒店管理和旅游政策的数据驱动决策。虽然通过一个国家案例研究进行了演示，但所提出的系统可推广到其他目的地和评论驱动的服务领域。

英文摘要

Online travel platforms generate vast volumes of user-generated hotel reviews, offering rich opportunities to understand traveler experiences at scale. However, transforming unstructured textual feedback into structured, actionable insights remains a challenging task. This paper presents SentimentLens, a scalable analysis system based on Aspect-Based Sentiment Analysis that performs knowledge extraction from unstructured hotel reviews and organizes them into interpretable service categories. SentimentLens integrates aspect term extraction, aspect sentiment classification, semantic category assignment, and multi-level analytical modules to support region-level, hotel-level, and category-level evaluation. The system is designed to operate across different geographic contexts and hospitality settings. To demonstrate its practical utility, we apply SentimentLens to a large real-world dataset of over 10,000 publicly available hotel reviews. Through extensive analysis, the framework reveals how traveler sentiment varies across regions, service categories, and hotel archetypes. We further implement a cross-modal reconciliation of textual sentiment and numerical ratings to identify latent operational conflicts, structural inconsistencies in service quality, and high-impact improvement opportunities using importance--performance and entropy-based analyses. The results show that SentimentLens effectively transforms large-scale unstructured reviews into actionable intelligence, supporting data-driven decision-making for hospitality management and tourism policy. While demonstrated using a national case study, the proposed system is generalizable to other destinations and review-driven service domains.

URL PDF HTML ☆

赞 0 踩 0

2606.00065 2026-06-02 cs.IR cond-mat.mtrl-sci cs.AI cs.CL 版本更新

Beyond Text and Tables: Vision-Language Model Integration in ComProScanner for Extracting Materials Data from Scientific Figures with High Accuracy

超越文本与表格：ComProScanner中视觉-语言模型集成实现从科学图表中高精度提取材料数据

Aritra Roy, Enrico Grisan, Chiara Gattinoni, John Buckeridge

发表机构 * Energy, Materials and Environment Research Centre, London South Bank University, London SE1 0AA, UK（能源、材料与环境研究中心，伦敦南银行大学）； School of Engineering and Design, London South Bank University, London SE1 0AA, UK（工程与设计学院，伦敦南银行大学）； Bioscience and Bioengineering Research Centre, London South Bank University, London SE1 0AA, UK（生物科学与生物工程研究中心，伦敦南银行大学）； Department of Physics, Kings College London, London WC2R 2LS, UK（物理系，伦敦国王学院）

AI总结本文通过集成视觉-语言模型扩展ComProScanner框架，实现了从科学图表中自动提取成分-性能数据，在压电陶瓷数据集上达到0.97的组成准确率和归一化F1分数，并引入基于范围的误差阈值评估方法。

Comments 18 pages, 3 figures

详情

AI中文摘要

基于大语言模型流水线的自动提取科学文献中材料成分-性能数据的方法已取得显著进展；然而，现有框架仍局限于文本和表格内容，忽视了仅在科学图表中报告的大量定量性能数据。本文扩展了ComProScanner——一个用于自动构建成分-性能数据库的完全端到端多智能体框架，为其增加了基于原生视觉-语言模型（VLM）的图表提取能力。该扩展引入了一个FigureExtractor工具，用于基于标题关键词对所有支持的出版商进行图表过滤，以及一个GraphExtractorTool智能体，它将提取的图表传递给可配置的VLM，以从科学图表和绘图中恢复成分-性能对。基于LMArena Diagram排行榜和每百万token输入成本低于1.50美元的标准，选择了四个VLM进行评估。在来自已建立的d33测试语料库的50篇压电陶瓷文章上的基准测试表明，Gemini-3-Flash-Preview实现了最高性能，组成准确率为0.97，归一化F1分数为0.97，同时仍然是四个评估模型中成本效益最高的。此外，我们在评估框架中引入了一个基于范围的值误差阈值参数，与精确值匹配相比，提供了对从图表中提取的数值性能数据更具物理意义的评估。这些贡献使集成VLM的ComProScanner成为第一个针对材料科学、完全自动化、多模态的文献挖掘平台，能够在单一统一流水线中从文本、表格和图表中提取结构化的成分-性能数据。

英文摘要

Automated extraction of materials composition-property data from scientific literature has advanced considerably with the development of large language model-based pipelines; however, existing frameworks remain limited to textual and tabular content, overlooking the substantial proportion of quantitative property data reported exclusively in scientific figures. Here, we extend ComProScanner, a fully end-to-end multi-agent framework for automated composition-property database construction, with a native vision-language model (VLM) based figure extraction capability. The extension introduces a FigureExtractor utility for caption-keyword-based figure filtering across all supported publishers, and a GraphExtractorTool agent that passes extracted figures to a configurable VLM to recover composition-property pairs from scientific charts and plots. Four VLMs are selected for evaluation on the basis of the LMArena Diagram leaderboard with an input cost criterion of less than \$1.50 per million tokens. Benchmarking on 50 piezoelectric ceramic articles from the established $d_{33}$ test corpus demonstrates that Gemini-3-Flash-Preview achieves the highest performance with a composition accuracy of 0.97 and a normalised F1 score of 0.97, whilst remaining the most cost-effective model among the four evaluated. We additionally introduce a range-based value error threshold parameter into the evaluation framework, providing a more physically meaningful assessment of numeric property values extracted from figures than exact value matching. These contributions establish VLM-integrated ComProScanner as the first materials-specific, fully automated, multimodal literature mining platform capable of extracting structured composition-property data from text, tables, and figures within a single unified pipeline.

URL PDF HTML ☆

赞 0 踩 0

2606.00062 2026-06-02 cs.CL 版本更新

TCAR-Gen: 基于证据融合的时间图检索用于知识增强生成

Sidra Nasir, Muhammad Noman Zahid, Rizwan Ahmed Khan

发表机构 * Dipartimento di Informatica, Università di Verona（威尼斯大学计算机科学系）； School of Advanced Studies, University of Camerino（坎皮诺大学高级研究学院）； Department of Computer Science, School of Mathematics and Computer Science, Institute of Business Administration (IBA), Karachi, Pakistan（卡拉奇工商管理学院（IBA）数学与计算机科学学院计算机科学系）

AI总结提出TCAR-Gen框架，结合查询条件图神经网络、时间证据融合和树链推理，在历史犯罪叙事问答中实现时间推理和多源证据融合，优于现有RAG方法。

详情

AI中文摘要

检索增强生成系统在回答历史犯罪案件叙述的复杂问题时，在时间推理和证据融合方面存在困难。现有方法要么独立于查询语义进行检索，要么无法连贯地整合多个证据来源。我们提出时间上下文增强检索生成（TCAR-Gen），一个结合查询条件图神经网络、时间证据融合和树链推理的框架，以将答案生成基于检索到的证据。在维多利亚犯罪日记基准上，TCAR-Gen在Recall@5上达到0.3738，在包括多跳推理和反事实问题在内的七种查询类型上优于Vanilla RAG、Temporal RAG、GraphRAG-C和GraphRAG-T。消融研究表明，上下文图、时间惩罚机制和查询条件是关键组件。跨五个语言模型（GPT-OSS 20B到TinyLlama 1.1B）的评估表明，TCAR-Gen在较小模型规模下保持稳健的检索覆盖，但生成质量随模型容量减少而显著下降。我们的工作表明，显式时间建模和多分支证据融合对于基于知识语料库的忠实、推理密集型问答至关重要。

英文摘要

Retrieval-augmented generation systems struggle with temporal reasoning and evidence fusion when answering complex questions over historical criminal case narratives. Existing approaches either retrieve independently of query semantics or fail to integrate multiple evidence sources coherently. We propose Temporal Context Augmented Retrieval Generation (TCAR-Gen), a framework that combines query-conditioned graph neural networks, temporal evidence fusion, and chain-of-trees reasoning to ground answer generation in retrieved evidence. On the Victorian Crime Diaries benchmark, TCAR-Gen achieves 0.3738 Recall@5, outperforming Vanilla RAG, Temporal RAG, GraphRAG-C, and GraphRAG-T across seven query types including multi-hop reasoning and counterfactual questions. Ablation studies reveal that the context graph, temporal penalty mechanism, and query conditioning are critical components. Cross-model evaluation across five language model (GPT-OSS 20B to TinyLlama 1.1B) demonstrates that TCAR-Gen maintains robust retrieval coverage at smaller model scales, though generation quality degrades substantially with reduced model capacity. Our work shows that explicit temporal modelling and multi-branch evidence fusion are essential for faithful, reasoning-intensive question answering over knowledge-grounded corpora.

URL PDF HTML ☆

赞 0 踩 0

2606.00027 2026-06-02 cs.CL cs.AI 版本更新

A Multi-Domain Red Teaming Framework for Safety, Robustness, and Fairness Evaluation of Medical Large Language Models

医疗大语言模型安全性、鲁棒性和公平性评估的多领域红队框架

Andrei Marian Feier, Veysel Kocaman, Yigit Gul, Ahmet Korkmaz, Alexander Thomas, Aleksei Zakharov, Jay Gil, Mehmet Butgul, David Talby

发表机构 * John Snow Labs Inc.（约翰·索克斯实验室公司）

AI总结提出一个多领域红队框架，通过690个临床场景评估11个当代大语言模型，发现平均准确率掩盖了临床意义上的风险，性能方差和最坏情况失败比平均准确率更能反映可靠性，混合评估方法对可信安全评估至关重要。

Comments 10 pages, 4 figures. To be presented at the Text2Story 2026 Workshop (Delft, The Netherlands, 29 March 2026); CEUR Workshop Proceedings (forthcoming). Affiliation: John Snow Labs Inc

详情

AI中文摘要

大语言模型（LLM）在医疗领域的部署日益增多，但现有基准未能捕捉临床实践中常见的对抗性或伦理复杂条件下的模型行为。我们开发了一个多领域红队框架，评估了11个当代LLM在690个临床场景中的表现，这些场景涵盖9个领域和150多个子类别。场景包含对抗性变换，响应使用七维度评分标准进行评估，包括LLM辅助评分和人在环验证。结果揭示了显著的性能差异，平均得分范围从0.791到0.984。关键的是，几个高性能系统在个别安全关键场景中完全失败，表明平均准确率掩盖了临床意义上的风险。最高性能系统（X-BAI、GPT-5、Claude Opus 4.1）得分超过0.97且方差较低，而不同领域间性能差异显著。公平性相关任务在人口统计修改后错误率放大10-20%，人工评审员识别出自动评估遗漏的临床相关失败。我们的发现表明，性能方差和最坏情况失败比平均准确率更能提供临床意义的可靠性指标，而结合自动化与临床监督的混合评估方法对于可信安全评估至关重要。

英文摘要

Large language models (LLMs) are increasingly deployed across healthcare, yet existing benchmarks fail to capture model behavior under adversarial or ethically complex conditions common in clinical practice. We developed a multi-domain red teaming framework evaluating eleven contemporary LLMs across 690 clinically grounded scenarios spanning nine domains and over 150 subcategories. Scenarios incorporated adversarial transformations, and responses were assessed using a seven-dimension rubric with LLM-assisted scoring and human-in-the-loop validation. Results revealed substantial performance variance, with mean scores ranging from 0.791 to 0.984. Critically, several high-performing systems produced complete failures in individual safety-critical scenarios, demonstrating that aggregate accuracy masks clinically meaningful risk. The highest-performing systems (X-BAI, GPT-5, Claude Opus 4.1) achieved scores above 0.97 with low variance, while performance varied significantly across domains. Equity-related tasks showed 10-20% error amplification with demographic modifications, and human reviewers identified clinically relevant failures missed by automated evaluation. Our findings demonstrate that performance variance and worst-case failures provide more clinically meaningful reliability indicators than mean accuracy alone, and that hybrid evaluation approaches combining automation with clinician oversight are essential for credible safety assessment.

URL PDF HTML ☆

赞 0 踩 0

2606.00026 2026-06-02 cs.CL 版本更新

Cognitive-Linguistic Indicators of Depression in Online Communities: Analysed by DistilBERT and Holographic Reduced Representation

在线社区中抑郁的认知语言指标：基于DistilBERT和全息简化表示的分析

Brian Van Steen

发表机构 * School of Computing, University of Leeds（利兹大学计算机学院）

AI总结本研究结合认知语言学特征与DistilBERT嵌入，通过混合模型（DistilBERT+HRR）在Reddit帖子中检测抑郁，宏F1达0.94，优于基线TF-IDF的0.80。

详情

AI中文摘要

本文研究将认知基础的语言特征与基于transformer的嵌入相结合是否能改善在线文本中抑郁的自动检测。基于Beck的抑郁认知理论，研究提取认知扭曲作为可测量特征，包括第一人称代词密度、绝对化词汇以及Reddit中抑郁相关和对照社区帖子中的负面情绪。使用Kaggle Reddit自杀和抑郁检测数据集的一个子集，比较了两个分类流程：作为基线的TF-IDF嵌入与朴素贝叶斯，以及一个混合模型，该模型将DistilBERT句子嵌入与编码认知语言特征的全息简化表示（HRR）向量拼接，然后进行逻辑回归。混合DistilBERT HRR模型的宏F1得分为0.94，而TF-IDF基线为0.80，5折交叉验证F1从0.83提高到0.92，AUC从0.958提高到0.981。

英文摘要

This paper investigates whether combining cognitively grounded linguistic features with transformer-based embeddings improves automated detection of depression in online text. Using Beck's Cognitive Theory of Depression, the study extracts cognitive distortions as measurable features, including first-person pronoun density, absolutist words, and negative emotion in Reddit posts from depression-related and control communities. Using a subset of the Kaggle Reddit Suicide and Depression Detection dataset, two classification pipelines are compared, a TF-IDF embedding with Naive Bayes as a baseline, and a hybrid model that concatenates DistilBERT sentence embeddings with Holographic Reduced Representation (HRR) vectors encoding the cognitive-linguistic features, followed by Logistic Regression. The hybrid DistilBERT HRR model achieves a macro F1 score of 0.94 versus 0.80 for the TD-IDF baseline, with 5-fold cross validation F1 improving from 0.83 to 0.92, and AUC from 0.958 to 0.981.

URL PDF HTML ☆

赞 0 踩 0

2606.00023 2026-06-02 cs.CL cs.AI cs.LG 版本更新

TrustLDM: Benchmarking Trustworthiness in Language Diffusion Models

TrustLDM：语言扩散模型的可信度基准测试

Yichuan Mo, Yukun Jiang, Yanbo Shi, Mingjie Li, Michael Backes, Yang Zhang, Yisen Wang

发表机构 * State Key Lab of General Artificial Intelligence, School of Intelligence Science and Technology, Peking University（中国科学院自动化研究所，智能科学与技术学院，北京大学）； CISPA Helmholtz Center for Information Security（信息安全研究所）； School of EECS, Peking University（电子工程学院，北京大学）； Institute for Artificial Intelligence, Peking University（人工智能研究所，北京大学）

AI总结针对语言扩散模型（LDM）的可信度问题，提出TrustLDM基准，评估其在不同架构和恶意上下文下的安全性、隐私性和公平性，并开发自动评估框架TrustLDM-Auto以识别脆弱配置。

详情

AI中文摘要

语言扩散模型（LDM）的快速发展挑战了自回归模型在语言处理中的主导地位。然而，其灵活、任意顺序的解码策略不仅实现了快速解码速度，还可能带来新的可信度挑战。为了更好地理解其流程背后的风险，我们引入了一个针对LDM的全面可信度基准（TrustLDM），评估不同LDM架构在多种静态后上下文类别下的安全性、隐私性和公平性。我们的实证结果表明，尽管LDM在仅使用用户提示时通常表现出较强的可信度，但当恶意后上下文附加到掩码响应时，其对齐行为明显下降。我们进一步观察到，较长的上下文不一定产生更强的影响，解码顺序和生成长度都会影响评估结果。最后，我们提出了TrustLDM-Auto，一个利用LDM解码灵活性自动识别脆弱配置的评估框架，揭示了所有评估模型和维度上的显著可信度弱点。我们的工作可能有助于社区构建更可信的LDM。我们的代码可在https://github.com/PKU-ML/TrustLDM获取。

英文摘要

The rapid development of Language Diffusion Models (LDMs) challenges the dominant position of auto-regressive competitors in language processing. However, their flexible, any-order decoding strategies not only enable fast decoding speed but also potentially bring new trustworthiness challenges. To better understand the risks behind their pipelines, we introduce a comprehensive trustworthiness benchmark tailored to LDMs (TrustLDM), evaluating safety, privacy, and fairness across different LDM architectures with multiple categories of static post contexts. Our empirical results show that although LDMs generally exhibit strong trustworthiness with only the user prompts, their alignment behavior degrades noticeably when the malicious post contexts are attached to the masked responses. We further observe that longer contexts do not necessarily induce stronger effects, and both decoding order and generation length affect the evaluation outcomes. Finally, we propose TrustLDM-Auto, an automatic evaluation framework that leverages LDM decoding flexibility to systematically identify vulnerable configurations, revealing substantial trustworthiness weaknesses across all evaluated models and dimensions. Our work may potentially help the community build more trustworthy LDMs. Our code is available at https://github.com/PKU-ML/TrustLDM.

URL PDF HTML ☆

赞 0 踩 0

2606.00022 2026-06-02 cs.CL cs.AI 版本更新

lmfaoooo at SemEval-2026 Task 1: Humor Is an Audience. Preference Modeling for Constrained Humor Generation

lmfaoooo at SemEval-2026 Task 1: 幽默是受众。面向约束幽默生成的偏好建模

Alexey Tikhonov, Alexey Ivanov

发表机构 * Inworld.AI Berlin, Germany（Inworld.AI柏林，德国）； OpenAI ； Mountain View, CA（山景城，加利福尼亚州）

AI总结针对约束幽默生成任务，提出“生成多候选-偏好选择”策略，利用人类成对比较训练偏好模型，在MWAHAHA任务英、中、西语子任务中分别获得第1、第1和第2名。

Comments 5 pages. Accepted for SEMEVAL 2026

详情

AI中文摘要

幽默生成仍然困难，不仅因为生成流畅、新颖的笑话很难，而且因为“有趣”取决于受众，监督信号嘈杂——偏好随受众、语境和文化而变化，标注者一致性通常较低。在本文中，我们描述了用于SemEval-2026 Task-1（MWAHAHA）的系统，该任务专注于在显式约束下进行幽默生成。任务通过1对1竞技场式比较中的人类偏好判断来评估提交的系统。我们采用“生成多个->选择最佳”策略。首先，我们通过多步提示、模型集成和多样性导向解码为每个实例生成多样化的候选池。其次，我们使用偏好模型选择输出，该模型通过从人类比较中学习（而非绝对趣味性分数）来近似“读者”。为支持该方法，我们发布了通过幽默竞技场原型收集的2.5K人类成对判断。我们进一步提出一个可解释的流程，将标注的比较转换为偏好模型。在三个偏好数据集上，我们的模型一致优于基线，并表现出更强的跨领域迁移。最后，我们将学到的偏好模型应用于MWAHAHA设置中的候选排序，并发布中间产物（候选池和排序）以促进后续工作。我们的系统在MWAHAHA的英语和汉语子任务中排名第一，在西班牙语子任务中排名第二。

英文摘要

Humor generation remains difficult not only because producing fluent, novel jokes is hard, but because "funny" is audience-dependent and supervision is noisy -- preferences vary with audience, context, and culture, and annotator agreement is often low. In this paper, we describe our system for the SemEval-2026 Task-1 (MWAHAHA), which focuses on humor generation under explicit constraints. The task evaluates submitted systems via human preference judgments in 1-on-1 arena-style comparisons. We adopt a "generate-many -> select-best" strategy. First, we generate a diverse pool of candidates per instance using multi-step prompting, model ensembling, and diversity-oriented decoding. Second, we select outputs using a preference model that approximates a "reader" by learning from human comparisons rather than absolute funniness scores. To support this approach, we release 2.5K human pairwise judgments collected through the Humor Arena prototype. We further propose an interpretable pipeline that converts labeled comparisons into a preference model. Across three preference datasets, our models consistently outperform baselines and show stronger cross-domain transfer. Finally, we apply the learned preference model to rank candidates for the MWAHAHA setting and release intermediate artifacts (candidate pools and rankings) to facilitate follow-up work. Our system ranked 1st in the English and Chinese subtasks of MWAHAHA and 2nd in the Spanish subtask.

URL PDF HTML ☆

赞 0 踩 0

2606.00021 2026-06-02 cs.CL cs.AI cs.LG 版本更新

Înţelegi Româneşte?'' 罗马尼亚语视觉语言模型的构建配方

Mihai Masala, Marius Leordeanu, Mihai Dascalu, Traian Rebedea

发表机构 * National University of Science and Technology POLITEHNICA Bucharest（波兰技术大学布加勒斯特国家大学）

AI总结本文系统研究了构建罗马尼亚语视觉语言模型（VLM）的完整流程，通过翻译英文语料、消融实验分析视觉骨干、语言骨干和OCR数据的影响，并构建文化本地化评估集HoraVQA，证明罗马尼亚语适配模型在性能上超越同尺寸甚至更大尺寸的模型。

详情

AI中文摘要

视觉语言模型（VLM）很大程度上遵循纯文本LLM的发展轨迹，在英文基准测试中表现出色，但在低资源语言上性能急剧下降，因为既缺乏大规模图像-文本语料库，也缺乏基于文化的评估。我们提出了一项针对罗马尼亚语构建语言特定VLM的系统研究，涵盖了从数据构建到架构选择的完整流程。我们将已有的英文VLM训练和评估语料库翻译成罗马尼亚语，对文本注释和图像内文本应用机器翻译，在保留视觉基础的同时调整文本内容。利用这些数据，我们训练并消融了一系列VLM，以隔离以下因素的贡献：（i）不同规模和预训练的视觉骨干，（ii）从多语言到罗马尼亚语适配LLM的语言骨干，以及（iii）OCR风格的图像-文本数据。我们进一步整理了HoraVQA，一个基于罗马尼亚日常场景的文化本地化评估集。罗马尼亚语适配的VLM始终优于同尺寸的对应模型，并且在所有评估基准上甚至超越了下一个更大尺寸类别的模型。

英文摘要

Vision-Language Models (VLMs) largely follow the text-only LLM trajectory, excelling on English benchmarks but sharply degrading on low-resource languages, where neither large-scale image-text corpora nor culturally grounded evaluations exist. We present a systematic study of building a language-specific VLM for Romanian, covering the full pipeline from data construction to architectural choices. We translate established English VLM training and evaluation corpora into Romanian, applying machine translation to textual annotations and to in-image text, preserving visual grounding while adapting the textual content. Using this data, we train and ablate a series of VLMs to isolate the contribution of (i) vision backbones of varying scale and pretraining, (ii) language backbones from multilingual to Romanian-adapted LLMs, and (iii) OCR-style image-text data. We further curate HoraVQA, a culturally native evaluation set grounded in Romanian everyday scenes. Romanian-adapted VLMs consistently outperform their same-sized counterparts and, across all evaluated benchmarks, even surpass models from the next larger size category.

URL PDF HTML ☆

赞 0 踩 0

2605.31086 2026-06-02 cs.CL cs.IR 版本更新

Beyond Static Dialogues: Benchmarking Realistic, Heterogeneous, and Evolving Long-Term Memory

超越静态对话：对现实、异构和演化长期记忆的基准测试

Han Zhang, Zihao Tang, Xin Yu, Xiao Liu, Yeyun Gong, Haizhen Huang, Yan Lu, Weiwei Deng, Feng Sun, Qi Zhang, Hanfang Yang

发表机构 * Center for Applied Statistics, Renmin University of China（中国人民大学应用统计中心）； School of Statistics, Renmin University of China（中国人民大学统计学院）； Microsoft（微软公司）

AI总结针对现有大语言模型记忆基准中对话缺乏长期语义一致性、人物静态化以及忽视异构数据流的问题，提出RHELM基准，通过用户画像和LOOP模块生成具有动态时间演化和长期连贯性的现实对话，并整合异构外部源，评估模型在多源聚合和现实上下文推理方面的能力。

详情

AI中文摘要

在现有的大语言模型（LLM）记忆基准中，评估的对话会话通常缺乏长期语义一致性，且底层人物角色趋于扁平化和静态化。此外，在现实场景中，用户与助手之间的交互涉及更多样化、异构的数据流，例如文档和电子邮件。这些缺陷严重限制了当前评估的现实性和有效性。为了解决这些限制，我们引入了RHELM（现实、异构和演化的长期记忆）。通过精心设计的用户画像和新型LOOP（规划-展开-演化-剪枝）模块，我们在多样化的交互场景中构建了展现动态时间演化和长期连贯性的现实对话。关键的是，这些对话与用户时间事件轨迹同步的异构外部源深度融合。由此产生的基准涵盖了跨越七种查询类型的具有挑战性的问答对，每个问题映射到至少27个关键记忆特征之一，这些特征在我们看来是当前研究中必要但尚未充分探索的。在全文模型、检索增强生成（RAG）方法和代表性记忆框架上的全面实验表明，当代方法在复杂的现实环境中仍然暴露出关键弱点，特别是在解决多源聚合和现实上下文推理方面。

英文摘要

In existing memory benchmarks for Large Language Models (LLMs), the evaluated dialogue sessions often lack long-term semantic consistency, and the underlying personas tend to be flat and static. Furthermore, in real-world scenarios, interactions between users and assistants involve more diverse, heterogeneous data streams, such as documents and emails. These shortcomings significantly limit the realism and effectiveness of current evaluations. To address these limitations, we introduce RHELM (Realistic, Heterogeneous, and Evolving Long-term Memory). Driven by meticulously crafted user profiles and a novel LOOP (pLan-rOllout-evOlve-Prune) module, we construct realistic dialogues across diverse interaction scenarios that exhibit dynamic temporal evolution and long-term coherence. Crucially, these dialogues are deeply integrated with heterogeneous external sources synchronized with the user's temporal event trajectory. The resulting benchmark encompasses challenging question-answer pairs spanning seven inquiry types, with each question mapping to at least one of 27 critical memory characteristics that we identify as essential yet underexplored in current research. Comprehensive experiments across full-context models, retrieval-augmented generation (RAG) methods, and representative memory frameworks reveal that contemporary approaches still expose critical weaknesses in complex, real-world settings, particularly in resolving multi-source aggregation and real-world contextual reasoning.

URL PDF HTML ☆

赞 0 踩 0

2605.30848 2026-06-02 cs.CR cs.CL 版本更新

LLM Anonymization Against Agentic Re-Identification

LLM匿名化对抗智能体重识别

Ziwen Li, Jianing Wen, Tianshi Li

发表机构 * Khoury College of Computer Sciences（科里学院计算机科学学院）； Northeastern University（东北大学）

AI总结提出AURA框架，通过掩码-重构方法解耦隐私定位与效用保留，并利用对抗性隐私和效用检查，以抵抗基于网络搜索的智能体重识别攻击，同时保留文本的上下文效用。

Comments 32 pages, 7 figures

详情

AI中文摘要

具有网络搜索功能的智能体LLM改变了文本匿名化的威胁模型：弱上下文线索可能成为重识别的交叉引用证据，但这些细节也承载着文本的下游分析价值。现有的防御措施要么移除显式标识符，要么扰动文本以获得形式化隐私，要么针对非网络推理模型测试改写后的文本，而未充分探索抵抗智能体网络搜索重识别与效用保留之间的操作区域。我们引入了AURA（ extbf{A}nonymization with extbf{U}tility- extbf{R}etention extbf{A}daptation），一个LLM驱动的 extit{掩码-重构}框架，将隐私定位与效用保留重构解耦，并通过对抗性隐私和效用保留检查选择候选方案。我们在真实用户访谈转录上评估AURA，使用由网络搜索智能体发起的重识别攻击，以及基于受访者档案事实、编码本事实和联合上下文效用网格的效用评估。我们的结果表明，AURA通过使用自适应隐私范围增强对智能体重识别的抵抗，以及使用掩码-重构匿名化方法在固定隐私范围内更好地保留上下文效用，从而改善了隐私-效用边界。

英文摘要

Agentic LLMs with web search change the threat model for text anonymization: weak contextual cues can become cross-referenceable evidence for re-identification, yet those same details also carry downstream analytic value of the text. Existing defenses either remove explicit identifiers, perturb text for formal privacy, or test rewritten text against non-web inference models, leaving underexplored the operating region between resistance to agentic web-search re-identification and utility retention. We introduce AURA (\textbf{A}nonymization with \textbf{U}tility-\textbf{R}etention \textbf{A}daptation), an LLM-powered \textit{mask-reconstruct} framework that decouples privacy localization from utility-preserving reconstruction and selects candidates with adversarial privacy and utility-retention checks. We evaluate AURA on real-user interview transcripts using re-identification attacks carried out by web-search agents, along with a utility evaluation based on interviewee-profile facts, codebook facts, and the joint contextual utility grid. Our results show that AURA improves the privacy-utility frontier by using adaptive privacy scope to strengthen resistance to agentic re-identification and using a mask-reconstruct anonymization method to better preserve contextual utility under fixed privacy scope.

URL PDF HTML ☆

赞 0 踩 0

2605.30743 2026-06-02 cond-mat.mtrl-sci cs.CE cs.CL 版本更新

CAREF: 面向解释忠实性的校准感知正则化，无需理由监督

Naphat Nithisopa, Teerapong Panboonyuen

发表机构 * MARSAIL ； Chulalongkorn University（朱拉隆梭大学）； PBYAIL (Panboonyuen AI Lab)（PBYAIL（Panboonyuen人工智能实验室））

AI总结提出CAREF框架，通过校准感知正则化联合优化预测准确性和解释忠实性，无需理由监督，在四个NLE基准上以少量可训练参数取得最佳性能。

Comments 10 pages

2605.27458 2026-06-02 cs.CV cs.AI cs.CL cs.LG 版本更新

Generic Interpretation Approach for Transformer Models Incorporating Heterogenous Attention Structures

融合异质注意力结构的Transformer模型通用解释方法

Yongjin Cui, Xiaohui Fan, Huajun Chen

发表机构 * Zhejiang University（浙江大学）

AI总结针对Transformer中异质注意力结构（如共注意力）带来的多源信息融合挑战，提出一种通用解释方法，并通过实验分析范式对代表性模型进行语义和逻辑解释。

详情

AI中文摘要

Transformer极大地推动了人工智能的发展，也推动了智能体（agent）的发展。我们将Transformer的注意力结构根据输入信息的来源分为两类：同质注意力结构和异质注意力结构。异质注意力结构以共注意力（co-attention）为典型例子，处理来自不同来源的信息。异质注意力结构是Transformer模型实现更复杂功能、融合更多模态信息的基础。无论是出于研究目的还是政策要求，对具有异质注意力结构的Transformer模型进行解释都是一项重要任务。来自不同来源的信息融合带来了新的挑战。我们的工作主要包括方法和实验两部分。在方法方面，我们提出了一种针对具有异质注意力结构的Transformer模型的解释方法。在实验方面，基于我们的实验分析范式，我们解释代表性模型的操作机制，进行语义解释和逻辑解释。

英文摘要

Transformer has significantly propelled the development of artificial intelligence, and certainly the development of agents as well. We categorize attention structures of Transformer into two types based on the source of the input information: homogenous and heterogenous attention structures. Heterogenous attention structures, with co-attention as a typical example, process information from different sources. Heterogenous attention structure is the foundation for Transformer models to achieve more complex functions and integrate more modal information. Whether for research purposes or policy requirements, the interpretation of Transformer models with heterogenous attention structures is an important task. The fusion of information from different sources brings new challenges. Our work mainly includes two parts: method and experimentation. In terms of method, we propose an interpretation method for Transformer models with heterogenous attention structures. In terms of experimentation, based on our experimental analysis paradigm, we interpret the operating mechanisms of representative models, conduct semantic interpretation and logical interpretation.

URL PDF HTML ☆

赞 0 踩 0

2605.27000 2026-06-02 cs.CL cs.AI 版本更新

Cast a Wider Net: Coordinated Pass@K Policy Optimization for Code Reasoning

撒更宽的网：面向代码推理的协调 Pass@K 策略优化

Yilong Li, Suman Banerjee, Tong Che

发表机构 * University of Wisconsin–Madison（威斯康星大学麦迪逊分校）； NVIDIA Research（英伟达研究）

AI总结提出协调 Pass@K 策略优化 (CPPO)，通过规划器生成多种策略并协调求解器尝试，以解决代码生成中重复采样导致冗余推理路径的问题，在多个基准上显著提升 pass@4。

Comments Code reasoning; pass@K optimization; coordinated planning; verifiable rewards; strategy diversity

详情

AI中文摘要

使用验证器进行重复采样是分配代码生成测试时计算的标准方法，pass@$K$ 是规范指标。然而，标准策略类从单一答案分布中抽取 $K$ 个独立样本，因此尝试往往坍缩到近乎重复的推理路径，并将预算浪费在冗余 rollout 上。在竞争性编程中，这种失败代价高昂，因为许多问题允许多种不同的算法策略，而 pass@$K$ 只需要一次正确尝试。我们提出协调 Pass@$K$ 策略优化 (CPPO)，它将 pass@$K$ 生成转化为策略上的联合探索：规划器输出一个包含 $K{=}4$ 种替代高层方法的元组，共享求解器为每种方法尝试一个解决方案。CPPO 使用乘法规划器奖励 $R_{\\mathrm{plan}} = J_ψ\\\cdot R_{\\mathrm{out}}$ 训练此联合策略，仅将信用分配给导致验证器确认的 pass@$K$ 成功的有效策略元组。在 APPS、CodeContests 和 LiveCodeBench-v6 上，CPPO 在相同的 $K{=}4$ 求解器尝试预算下，相比于直接采样、规划基线、仅规划器 SFT 和面向 pass@$K$ 的 RL，提升了 pass@$4$，在九个模型-基准组合中的六个上具有统计显著增益。最大单次增益是在 Qwen3.5-9B LiveCodeBench-v6 上，相比于最强基线 PKPO 提升了 $+0.16$（$0.588 \\rightarrow 0.748$；配对 bootstrap，$p < 0.05$）。

英文摘要

Repeated sampling with a verifier is the standard way to allocate test-time compute for code generation, with pass@$K$ as the canonical metric. Yet the standard policy class draws $K$ independent samples from a single answer distribution, so attempts often collapse onto near-duplicate reasoning paths and waste the budget on redundant rollouts. This failure is costly in competitive programming, where many problems admit multiple distinct algorithmic strategies and pass@$K$ requires only one correct attempt. We propose Coordinated Pass@$K$ Policy Optimization (CPPO), which turns pass@$K$ generation into joint exploration over strategies: a planner emits a tuple of $K{=}4$ alternative high-level methods, and a shared solver attempts one solution per method. CPPO trains this joint policy with a multiplicative planner reward, $R_{\mathrm{plan}} = J_ψ\cdot R_{\mathrm{out}}$, assigning credit only to valid strategy tuples that lead to verifier-confirmed pass@$K$ success. Across APPS, CodeContests, and LiveCodeBench-v6, CPPO improves pass@$4$ over direct sampling, planning baselines, planner-only SFT, and pass@$K$-oriented RL under the same $K{=}4$ solver-attempt budget, with statistically significant gains on six of nine model--benchmark cells. The largest single gain is $+0.16$ on Qwen3.5-9B LiveCodeBench-v6 over the strongest baseline, PKPO ($0.588 \rightarrow 0.748$; paired bootstrap, $p < 0.05$).

URL PDF HTML ☆

赞 0 踩 0

2605.26444 2026-06-02 cs.CL 版本更新

NanoSpec: Accelerating Speculative Decoding using Minimalist In-Context Vocabularies

NanoSpec: 利用极简上下文词汇表加速推测解码

Zhiyang Chen, Daliang Xu, Yinyuan Zhang, Chenghua Wang, Mengwei Xu, Yun Ma

发表机构 * Institute for Artificial Intelligence, Peking University, Beijing, China（北京大学人工智能研究院）； Key Laboratory of High Confidence Software Technologies (Peking University), Ministry of Education（教育部高可信软件技术重点实验室）； School of Computer Science, Peking University, Beijing, China（北京大学计算机学院）； State Key Laboratory of Networking（网络与交换技术国家重点实验室）； School of Computer Science, Beijing University of Posts（北京邮电大学计算机学院）

AI总结提出NanoSpec，一种无需训练的方法，通过动态构建极简上下文感知词汇表（平均<3k tokens），结合系统-算法协同设计，将推测解码的草稿生成时间平均减少51.6%，实现端到端加速1.17-1.29倍。

详情

AI中文摘要

大型语言模型的巨大词汇量（通常超过10万tokens）在推测解码过程中对最后的线性投影层造成了计算瓶颈。现有的词汇表剪枝方案依赖于静态或粗粒度的子词汇表，需要较大的活跃大小（约30k）才能维持草稿质量。我们提出NanoSpec，一种新颖的无需训练的方法，通过为每个生成步骤动态构建一个极简的、上下文感知的活跃词汇表，打破了这种权衡。利用语言生成固有的时间局部性，NanoSpec实现了高覆盖率，同时将平均词汇量削减了40倍以上（降至<3k tokens），且无需任何辅助训练参数。为了在现代硬件上实现这种高稀疏性的理论优势，我们引入了系统-算法协同设计，通过异步收集和GPU驻留状态管理克服了稀疏内存访问的低效问题。作为一个互补的即插即用模块，NanoSpec将草稿生成时间平均减少了51.6%，在7个任务上比最先进的推测解码方法EAGLE-2和EAGLE-3实现了1.17-1.29倍的端到端加速，并优于复杂的基于训练的剪枝基线。

英文摘要

The massive vocabulary sizes of large language models, often exceeding 100k tokens, impose a computational bottleneck on the final linear projection layer during speculative decoding. Existing vocabulary pruning solutions rely on static or coarsely-grained sub-vocabularies that necessitate large active sizes ($\sim$30k) to maintain draft quality. We propose NanoSpec, a novel training-free approach that breaks this trade-off by dynamically constructing a minimalist, context-aware active vocabulary for each generation step. Leveraging the inherent temporal locality of language generation, NanoSpec achieves high coverage while slashing the average vocabulary size by over $40\times$ (to $<$3k tokens) without requiring any auxiliary trained parameters. To realize the theoretical benefits of such high sparsity on modern hardware, we introduce a system-algorithm co-design that overcomes the inefficiencies of sparse memory access through asynchronous gathering and GPU-resident state management. As a complementary plug-and-play module, NanoSpec cuts draft time by an average of 51.6\%, delivering a $1.17$-$1.29\times$ end-to-end speedup over the state-of-the-art speculative decoding methods EAGLE-2 and EAGLE-3 across 7 tasks and outperforming complex training-based pruning baselines.

URL PDF HTML ☆

赞 0 踩 0

2605.26436 2026-06-02 cs.CL cs.AI 版本更新

Targeted Remasking: Replacing Token Editing with Token-to-Mask Refinement in Discrete Diffusion Language Models

目标重掩码：在离散扩散语言模型中将令牌编辑替换为令牌到掩码的精炼

Lin Yao

AI总结针对离散掩码扩散语言模型中令牌编辑机制的局限性，提出无训练的令牌到掩码重掩码方法，通过将疑似错误令牌重置为掩码状态，利用扩散过程在更干净上下文中重新预测，显著提升数学等任务的性能。

Comments This paper has been significantly revised, expanded, and superseded by a more comprehensive version available at arXiv:2604.18738. The authors have chosen to withdraw this version to avoid overlap and direct readers to the updated work

详情

AI中文摘要

离散掩码扩散语言模型（如LLaDA）通过迭代去噪生成文本，其中掩码令牌逐步被预测的令牌替换。LLaDA2.1引入了令牌到令牌（T2T）编辑机制，通过直接替换疑似错误的已提交令牌来加速生成。然而，我们发现了T2T编辑的根本性限制：它将错误检测与替换耦合，用可能错误的令牌污染生成上下文，并引入了训练-推理噪声不匹配，其中系统性的模型生成错误与训练中看到的随机扰动不同。我们提出了令牌到掩码（T2M）重掩码，这是一种无需训练、即插即用的T2T编辑替代方案，将疑似错误的令牌重置回掩码状态，允许扩散过程在更干净的上下文中重新预测它们。我们设计并实证验证了三种互补的错误检测策略——基于概率的、触发镜像的和基于时间差分的——并提供了统一的理论分析，表明T2M重掩码净化了生成上下文，将系统性的推理错误转换回模型的原生掩码噪声类型，并实现了延迟承诺以进行联合多位置优化。在涵盖知识、推理、数学、编码和指令跟随的12个基准上的全面实验表明，T2M通常在需要精确令牌级输出的任务上提升性能，其中数学任务提升最大（CMATH上+5.92%）。对CMATH的错误分析揭示，主要的失败模式是最后一英里令牌损坏——即正确的推理产生损坏的最终答案——而T2M修复了59.4%的此类情况。

英文摘要

Discrete masked diffusion language models such as LLaDA generate text through iterative denoising, where mask tokens are progressively replaced with predicted tokens. LLaDA2.1 introduced a Token-to-Token (T2T) editing mechanism that accelerates generation by directly replacing committed tokens suspected of being incorrect. However, we identify fundamental limitations of T2T editing: it couples error detection with replacement, pollutes the generation context with potentially incorrect tokens, and introduces a train-inference noise mismatch where systematic model-generated errors differ from the random perturbations seen during training. We propose Token-to-Mask (T2M) remasking, a training-free, drop-in replacement for T2T editing that resets suspected erroneous tokens back to the mask state, allowing the diffusion process to re-predict them under cleaner context. We design and empirically validate three complementary error detection strategies -- probability-based, trigger-mirrored, and temporal-difference-based -- and provide a unified theoretical analysis showing that T2M remasking purifies the generation context, converts systematic inference errors back to the model's native mask noise type, and enables delayed commitment for joint multi-position optimization. Comprehensive experiments across 12 benchmarks spanning knowledge, reasoning, mathematics, coding, and instruction following show that T2M generally improves performance on tasks requiring precise token-level output, with the largest gain on mathematics (+5.92% on CMATH). Error analysis on CMATH reveals that the dominant failure mode is last-mile token corruption -- where correct reasoning produces a corrupted final answer -- and that T2M repairs 59.4% of such cases.

URL PDF HTML ☆

赞 0 踩 0

2605.26431 2026-06-02 cs.CL stat.AP 版本更新

自训练验证用于训练和测试时的自我改进

Chen Henry Wu, Aditi Raghunathan

发表机构 * arXiv

AI总结提出自训练验证（STV）方法，通过让验证器模仿参考解决方案下的自身版本，解决自我改进中验证器瓶颈问题，在测试时显著提升验证-细化循环，在训练时通过验证器在环训练（ViL）进一步提升生成器性能。

详情

AI中文摘要

大规模自我改进一直是推理模型的长期目标，有两个自然的实现阶段：测试时，通过验证-细化（V-R）循环；训练时，通过自训练方法。两者都受限于同一个瓶颈：验证器。当验证器得分膨胀而准确率停滞，且反馈过于泛化无法执行时，V-R循环会停滞；当糟糕的自生成数据被加入训练时，自训练同样会失败。更好的验证将解锁两者，但我们想要训练的能力，即捕捉自生成的错误，缺乏训练信号。为了解决这一挑战，我们提出了自训练验证（STV）。我们的关键观察是，虽然模型单独无法捕捉这些错误，但当它看到参考解决方案时却可以。我们将这种不对称性转化为监督目标，训练验证器模仿自身更具信息量的版本。在测试时，STV在困难问题上显著改进了V-R循环，而替代方法（如SFT、对验证器分数进行RL，甚至元验证器）则不然。STV在困难数学任务上大致使准确率翻倍，在科学推理任务上提升14倍（从1.5%到21%）。在训练时，我们额外使用STV验证器在V-R循环内的反馈对生成器进行RL训练——我们称之为验证器在环训练（ViL）。从一个RL收敛的生成器开始，ViL在pass@1上进一步获得33%的提升。更值得注意的是，生成器在测试时无验证器的独立pass@1相对标准RL收敛点提升了30%。因此，困难问题推理的下一个前沿可能在于我们如何训练用于验证和与验证结合的方法。网站：https://ar-forum.github.io/stv-webpage

英文摘要

Self-improvement at scale has been a longstanding goal for reasoning models, and there are two natural places to do it: at test time, through verification-refinement (V-R) loops; and at training time, through self-training methods. Both are gated by the same bottleneck: the verifier. V-R loops stall when verifier scores inflate while accuracy stagnates, and when feedback is too generic to act on; self-training fails similarly when bad self-generated data are added to training. Better verification would unlock both, but the capability we want to train, i.e., catching self-generated errors, lacks training signal. To address this challenge, we propose self-trained verification (STV). Our key observation is that, while a model cannot catch these errors alone, it can when shown the reference solution. We turn this asymmetry into a supervision target and train the verifier to imitate a more informed version of itself. At test time, STV substantially improves V-R loops on hard problems, while alternatives (e.g., SFT, RL on verifier scores, and even meta-verifiers) do not. STV roughly doubles accuracy on hard math and lifts it 14x on scientific reasoning tasks (1.5% to 21%). At training time, we additionally train the generator using RL with STV verifier's feedback inside the V-R loop - a procedure we call verifier-in-the-loop training (ViL). Starting from an RL-converged generator, ViL yields a further 33% gain in pass@1. More notably, the generator's standalone pass@1, with no verifier at test time, climbs 30% relative past where standard RL had converged. Hence, the next frontier in reasoning on hard problems may lie in how we train for and with verification. Website: https://ar-forum.github.io/stv-webpage

URL PDF HTML ☆

赞 0 踩 0

2605.30280 2026-06-02 cs.RO cs.AI cs.CL 版本更新

Qwen-VLA: Unifying Vision-Language-Action Modeling across Tasks, Environments, and Robot Embodiments

Qwen-VLA：统一跨任务、环境和机器人形态的视觉-语言-动作建模

Qiuyue Wang, Mingsheng Li, Jian Guan, Jinhui Ye, Sicheng Xie, Yitao Liu, Junhao Chen, Zhixuan Liang, Jie Zhang, Xintong Hu, Xuhong Huang, Pei Lin, Junyang Lin, Dayiheng Liu, Shuai Bai, Jingren Zhou, Jiazhao Zhang, Haoqi Yuan, Gengze Zhou, Hang Yin, Ye Wang, Yiyang Huang, Zixing Lei, Wujian Peng, Delin Chen, Yingming Zheng, Jingyang Fan, Xianwei Zhuang, Xin Zhou, Haoyang Li, Anzhe Chen, Tong Zhang, Xuejing Liu, Yuchong Sun, Ruizhe Chen, Zhaohai Li, Chenxu Lü, Zhibo Yang, Tao Yu, Xionghui Chen

发表机构 * Qwen Team（通义实验室）

AI总结提出Qwen-VLA，一种基于DiT动作解码器的统一具身基础模型，通过大规模联合预训练和具身感知提示，将操作、导航和轨迹预测统一为动作-轨迹预测框架，实现跨任务、环境和机器人形态的泛化。

Comments 34 pages

详情

AI中文摘要

具身智能通常通过针对单个任务（如操作或导航）的专用模型进行研究，导致能力碎片化，且跨任务、环境和机器人形态的泛化能力有限。在这项工作中，我们研究了异构的具身决策问题是否可以在单个视觉-语言-动作模型中统一。我们提出了Qwen-VLA，一个统一的具身基础模型，它通过基于DiT的动作解码器将Qwen的视觉-语言建模栈从感知、理解和推理扩展到连续动作和轨迹生成。Qwen-VLA通过大规模联合预训练方案在多样化的数据源上进行训练，包括机器人操作轨迹、人类自我中心演示、合成模拟数据、视觉-语言导航数据、轨迹中心监督和辅助视觉-语言数据。为了支持多种机器人平台，我们引入了具身感知提示调节，其中特定于机器人的文本描述指定了当前的具身形态和控制约定。我们进一步将操作、导航和轨迹预测统一为一个动作-轨迹预测框架，实现了跨机器人形态、任务族和环境的可迁移视觉基础、空间推理和连续动作生成。在操作、导航和轨迹中心基准上的实验显示，在场景布局、背景、光照、物体配置和机器人形态变化下，具有一致的多任务性能和分布外泛化能力。Qwen-VLA-Instruct在LIBERO上达到97.9%，在Simpler-WidowX上达到73.7%，在RoboTwin-Easy/Hard上达到86.1%/87.2%，在R2R上达到69.0% OSR，在RxR上达到59.6% SR，在真实世界ALOHA实验中平均OOD成功率为76.9%，在DOMINO动态操作上零样本成功率为26.6%。

英文摘要

Embodied intelligence is often studied through specialized models for individual tasks such as manipulation or navigation, resulting in fragmented capabilities and limited generalization across tasks, environments, and robot embodiments. In this work, we study whether heterogeneous embodied decision-making problems can be unified within a single vision-language-action model. We present Qwen-VLA, a unified embodied foundation model that extends Qwen's vision-language modeling stack from perception, understanding, and reasoning to continuous action and trajectory generation through a DiT-based action decoder. Qwen-VLA is trained with a large-scale joint pretraining recipe over diverse data sources, including robotics manipulation trajectories, human egocentric demonstrations, synthetic simulation data, vision-and-language navigation data, trajectory-centric supervision, and auxiliary vision-language data. To support multiple robot platforms, we introduce embodiment-aware prompt conditioning, where robot-specific textual descriptions specify the current embodiment and control convention. We further cast manipulation, navigation, and trajectory prediction into a unified action-and-trajectory prediction framework, enabling transferable visual grounding, spatial reasoning, and continuous action generation across robot morphologies, task families, and environments. Experiments on manipulation, navigation, and trajectory-centric benchmarks show consistent multi-task performance and out-of-distribution generalization under variations in scene layout, background, lighting, object configuration, and robot embodiment. Qwen-VLA-Instruct achieves 97.9% on LIBERO, 73.7% on Simpler-WidowX, 86.1%/87.2% on RoboTwin-Easy/Hard, 69.0% OSR on R2R, 59.6% SR on RxR, 76.9% average OOD success in real-world ALOHA experiments, and 26.6% zero-shot success on DOMINO dynamic manipulation.

URL PDF HTML ☆

赞 0 踩 0

2605.30237 2026-06-02 cs.IR cs.CL cs.LG 版本更新

GRASP: Plan-Guided Graph Retrieval with Adaptive Fusion and Reranking on Semi-Structured Knowledge Bases

GRASP：半结构化知识库上具有自适应融合和重排序的计划引导图检索

Yicheng Tao, Yiqun Wang, Xiangchen Song, Xin Luo, Kai Liu, Jie Liu

发表机构 * Department of Electrical Engineering and Computer Science, University of Michigan（密歇根大学电气工程与计算机科学系）； Department of Computational Medicine and Bioinformatics, University of Michigan（密歇根大学计算医学与生物信息学系）

AI总结提出GRASP框架，通过计划引导图检索、条件融合和重排序三阶段，在半结构化知识库上实现最先进的检索性能。

2605.29987 2026-06-02 cs.LG cs.CL 版本更新

MIC: Maximizing Informational Capacity in Adaptive Representations via Isotropic Subspace Alignment

MIC: 通过各向同性子空间对齐最大化自适应表示中的信息容量

Dang Nguyen Hong, Nhi Ngoc-Yen Nguyen, Huy-Hieu Pham

发表机构 * National University of Singapore（新加坡国立大学）

AI总结针对多尺度表示中的维度冗余和谱坍缩问题，提出MIC框架，通过各向同性子空间对齐、软坍缩正则化和谱各向同性正则化，结合自蒸馏目标生成语义密集且高判别力的表示，在高压缩场景下显著优于基线。

Comments Accepted at the GlobalSouthML Workshop at ICML 2026. 8 pages, 2 figures

2605.04583 2026-06-02 cs.CL 版本更新

TajikNLP: An Open-Source Toolkit for Comprehensive Text Processing of Tajik (Cyrillic Script)

TajikNLP：一个用于塔吉克语（西里尔字母）综合文本处理的开源工具包

Mullosharif K. Arabov, Karomatullo Habibullozoda, Nurali Shirinov

发表机构 * Institute of Computational Mathematics and Information Technologies（计算数学与信息科技研究所）； Kazan Federal University（喀山联邦大学）； Bokhtar State University named after Nosiri Khusrav（诺西拉夫命名的博克塔尔州大学）

AI总结本文介绍 TajikNLP，一个开源的 Python 库，提供首个完整的塔吉克语文本处理流水线，包括清洗、分词、词性标注、词干提取、词形还原等，并发布四个语言数据集，以促进低资源西里尔字母语言的 NLP 研究。

Comments Accepted to CLIB 2026

详情

AI中文摘要

塔吉克语使用西里尔字母书写，在公开可用的自然语言处理工具包方面仍然严重资源不足，这阻碍了语言学研究和应用开发。本文介绍了 TajikNLP，一个开源的 Python 库，它提供了首个完整的流水线来处理真实的塔吉克语文本，同时保留原始西里尔字母正字法。该库实现了以统一 Doc 对象为中心的模块化架构，支持顺序应用组件进行清洗、规范化、分词（包括子词 BPE）、词素切分、词性标注、词干提取、词形还原和句子分割。引入了一个新颖的统一形态引擎，提供受控和深度分析模式，显著改进了对塔吉克语黏着性名词和动词屈折变化的处理。该发布版还包含一个基于词典的情感分析器和直接从 Hugging Face Hub 加载的预训练 Word2Vec/FastText 嵌入。为确保可重复性并促进未来研究，四个配套的语言数据集——一个词性标注语料库（52.5k 条目）、一个情感词典（3.5k 条目）、一个地名辞典（5.6k 条目）和一个个人姓名数据集（3.8k 条目）——已在宽松许可下公开发布。该库的可靠性通过包含 616 个自动化测试的广泛测试套件验证，实现了 93% 的源代码覆盖率。因此，TajikNLP 为塔吉克语处理建立了基础技术基础设施，降低了在低资源西里尔字母环境中学术和工业应用的门槛。

英文摘要

The Tajik language, written in Cyrillic script, remains severely under-resourced in terms of publicly available natural language processing (NLP) toolkits, hindering both linguistic research and applied development. This paper introduces TajikNLP, an open-source Python library that provides the first comprehensive pipeline for processing authentic Tajik text while preserving the original Cyrillic orthography. The library implements a modular architecture centered around a unified Doc object, enabling sequential application of components for cleaning, normalization, tokenization (including subword BPE), morphemic segmentation, part-of-speech tagging, stemming, lemmatization, and sentence splitting. A novel unified morphology engine is introduced, offering controlled and deep analysis modes that significantly improve handling of Tajik's agglutinative nominal and verbal inflections. The release further incorporates a lexicon-based sentiment analyser and pre-trained Word2Vec/FastText embeddings loaded directly from the Hugging Face Hub. To ensure reproducibility and facilitate future research, four accompanying linguistic datasets -- a POS-tagged corpus (52.5k entries), a sentiment lexicon (3.5k entries), a toponym gazetteer (5.6k entries), and a personal names dataset (3.8k entries) -- have been openly published under permissive licenses. The library's reliability is validated by an extensive test suite of 616 automated tests achieving 93% source code coverage. TajikNLP thus establishes a foundational technological infrastructure for Tajik language processing, lowering the barrier to entry for both academic and industrial applications in low-resource Cyrillic-script environments.

URL PDF HTML ☆

赞 0 踩 0

2512.00837 2026-06-02 cs.CL 版本更新

Mix-MoE：通过混合专家混合提升大语言模型的多语言机器翻译

Bo Li, Tianyu Dong, Shaolin Zhu, Deyi Xiong

发表机构 * School of Software, Tsinghua University（清华大学软件学院）； College of Intelligence and Computing, Tianjin University（天津大学智能与计算学院）

AI总结提出Mix-MoE框架，通过将MoE层分为语言模型专家和机器翻译专家，并利用傅里叶变换增强路由机制，解决大语言模型在多语言机器翻译微调中的参数干扰问题。

Comments Accepted by TASLP

详情

DOI: 10.1109/TASLPRO.2026.3698013

AI中文摘要

大语言模型（LLMs）在多语言机器翻译（MT）中展现出巨大潜力，即使双语监督有限。然而，使用平行语料库微调LLMs带来了主要挑战，即参数干扰。为了解决这些问题，我们提出了Mix-MoE，一个混合专家混合框架，旨在训练LLMs进行多语言MT。我们的框架在两个不同的阶段运行：（1）在单语语料库上使用MoE进行后预训练，以及（2）在平行语料库上使用MoE进行后预训练。关键的是，我们将MoE层分为两个专门的组：语言模型专家（LM专家）和机器翻译专家（MT专家）。LM专家旨在捕获和保留预训练LLM学到的单语知识。另一方面，MT专家专门训练以获取和存储双语翻译知识。此外，为了促进这些专门专家之间的有效交互并利用文本中潜在的结构模式，我们引入了一种由模型表示中的傅里叶变换特征增强的路由机制。实验结果表明，Mix-MoE在多语言MT中表现出色，显著优于现有基线，并在缓解参数干扰方面取得了显著进展。

英文摘要

Large Language Models (LLMs) have shown great promise in multilingual machine translation (MT), even with limited bilingual supervision. However, fine-tuning LLMs with parallel corpora presents major challenges, namely parameter interference. To address these issues, we propose Mix-MoE, a mixed Mixture-of-Experts framework designed to train LLMs for multilingual MT. Our framework operates in two distinct stages: (1) post-pretraining with MoE on monolingual corpora, and (2) post-pretraining with MoE on parallel corpora. Crucially, we divide the MoE layers into two specialized groups: Language Model Experts (LM Experts) and Machine Translation Experts (MT Experts). LM Experts are designed to capture and retain the monolingual knowledge learned by the pre-trained LLM. MT Experts, on the other hand, are specifically trained to acquire and store bilingual translation knowledge. Furthermore, to facilitate effective interaction between these specialized experts and leverage potential underlying structural patterns in text, we introduce a routing mechanism enhanced by Fourier Transform features derived from model representations. The experimental results demonstrate that Mix-MoE excels in multilingual MT, significantly outperforming existing baselines and showing notable progress in mitigating parameter interference.

URL PDF HTML ☆

赞 0 踩 0

2605.24528 2026-06-02 cs.AI cs.CL cs.LG 版本更新

Hypothesis Generation and Inductive Inference in Children and Language Models

儿童与语言模型中的假设生成与归纳推理

Jeffrey Qin, Wasu Top Piriyakulkij, Zhuangfei Gao, Mia Radovanovic, Jessica Sommerville, Kevin Ellis, Marta Kryven

发表机构 * Computer Science University of Waterloo（滑铁卢大学计算机科学系）； Department of Computer Science Cornell University（康奈尔大学计算机科学系）； Department of Computer Science Dalhousie University（达尔豪斯大学计算机科学系）； Department of Psychology University of Toronto（多伦多大学心理学系）

AI总结通过归纳推理盒子任务，结合贝叶斯粒子推断的程序归纳形式化，比较儿童与基于LLM的智能体在不确定性下的假设生成与证据寻求行为，发现两者在适应环境结构上相似但信息寻求成本与归纳偏差不同。

详情

AI中文摘要

现实世界中的决策需要在证据、潜在因果规则以及世界状态本身的不确定性下构建心智模型。在这种条件下，哪些计算原理支撑人类的推理？在给定匹配约束下，基于LLM的智能体是否表现出类似行为？我们使用归纳推理盒子任务来探讨这些问题，在该任务中，参与者（人类儿童和基于LLM的智能体）通过与不确定环境的顺序交互来推断潜在原因。我们将该任务形式化为基于贝叶斯粒子推断的程序归纳，并承认两种互补的解释：(1) 作为对假设的约束满足过程，以及(2) 作为程序综合问题，其中假设是针对证据评估的可执行程序。使用基于约束的公式，我们表明儿童的行为最好由主观证据可靠性和在线假设生成的组合来解释，这解释了他们的证据寻求模式以及任务完成与规则泛化之间的分离。使用程序综合公式，我们将基于LLM的智能体视为模型有机体：可控系统，允许系统性地操纵任务条件。在各种后端中，基于LLM的智能体复制了儿童对证据可靠性和可观察性变化的反应，包括折扣不可靠证据、寻求解决部分信息以及任务完成与因果泛化之间的分离。同时，与儿童相比，基于LLM的智能体倾向于过度观察和过度遵守指令。这些结果表明，虽然儿童和基于LLM的智能体在适应环境结构方面相似，但他们的信息寻求行为表现出不同的潜在成本和归纳偏差。

英文摘要

Real world decision-making requires constructing mental models under uncertainty over evidence, over the underlying causal rules, and over the state of the world itself. Which computational principles underpin human inference under such conditions, and do LLM-based agents exhibit similar behavior given matching constraints? We address these questions using an inductive inference Box Task in which participants, human children and LLM-based agents, infer a latent cause through sequential interaction with an uncertain environment. We formalize this task as program induction with Bayesian particle-based inference, admitting two complementary interpretations: (1) as a constraint satisfaction process over hypotheses, and (2) as a program synthesis problem in which hypotheses are executable programs evaluated against evidence. Using the constraint-based formulation, we show that children's behavior is best explained by a combination of subjective evidence reliability and online hypothesis generation, accounting for both their evidence-seeking patterns and their dissociation between task completion and rule generalization. Using the program synthesis formulation, we treat LLM-based agents as model organisms: controllable systems that allow systematic manipulation of task conditions. Across backends, LLM-based agents replicate children's responses to changes in evidence reliability and observability, including discounting unreliable evidence, seeking to resolve partial information, and dissociating between task completion and causal generalization. At the same time, LLM-based agents tend to over-observe and over-comply with instructions relative to children. These results suggest that while children and LLM-based agents adapt similarly to environmental structure, their information-seeking behavior exhibits distinct underlying costs and inductive biases.

URL PDF HTML ☆

赞 0 踩 0

2605.18838 2026-06-02 cs.LG cs.AI cs.CL 版本更新

Lying Is Just a Phase: The Hidden Alignment Transition in Language Model Scaling

说谎只是一个阶段：语言模型扩展中的隐藏对齐转变

Adil Amin

发表机构 * ZEHEN Labs（ZEHEN实验室）

AI总结通过分析63个基础模型，发现语言模型在特定规模阈值下，推理能力与真实性从反相关转变为正相关，并揭示了输出投影瓶颈和零竞争注意力头等内部机制。

Comments 15 pages, 8 figures, 2 tables. Companion paper: "The Growing Pains of Frontier Models: When Leaderboards Stop Separating and What to Measure Next." ( https://doi.org/10.48550/arXiv.2605.18840). Code: https://github.com/adilamin89/cape-scaling. Dashboard: https://zehenlabs.com/cape/

详情

AI中文摘要

扩展定律预测了计算量带来的损失，但未预测能力如何相互作用。我们测量了来自16个家族的63个基础模型的推理能力与真实性之间的耦合，并发现了一个在损失曲线中不可见的相变：低于家族依赖的临界规模N_c时，能力反相关（r = -0.989，p = 4 x 10^{-5}，非参数置换检验）；高于该规模时，它们合作。N_c ~ 3.5B参数 [2.9B, 13.4B]（bootstrap 95% CI），但模型大小并非决定相位的唯一变量。架构、数据整理和训练配方各自独立地改变N_c：精心整理的数据消除了Qwen代际之间的耦合下降（在匹配规模下从0.025到0.830），Gemma-4在4B时通过蒸馏和架构创新实现了0.871的耦合，这通常是13B+标准训练模型的特征，而Phi在1B时仅通过数据整理就达到了10B网络训练模型的耦合水平。宽度归一化消除了所有测试家族的反相关，支持输出投影瓶颈的存在。在内部，40个模型中有38个显示零竞争注意力头。一个稀疏回归ODE以5.6%的误差交叉预测了保留的Llama-2。该诊断不需要模型内部信息——仅需跨模型家族的公开基准分数。合作区域扩展到前沿（r = +0.72，34个模型，10个实验室）。一个概念验证干预证实了瓶颈是可利用的：在识别层添加单个真实方向向量，无需重新训练即可纠正税收阶段60%的错位输出——这是一种无需修改权重的、每推理一次的外科手术式修正。代码、数据、用于任何开放权重模型的开源转向CLI以及用于相位诊断的交互式仪表板已发布：https://zehenlabs.com/cape/。

英文摘要

Scaling laws predict loss from compute but not how capabilities interact. We measure the coupling between reasoning and truthfulness across 63 base models from 16 families and find a regime change invisible to loss curves: below a family-dependent critical scale N_c, capabilities anticorrelate (r = -0.989, p = 4 x 10^{-5} nonparametric permutation test); above it, they cooperate. N_c ~ 3.5B parameters [2.9B, 13.4B] (bootstrap 95% CI), but model size is not the only variable that determines phase. Architecture, data curation, and training recipe each shift N_c independently: curated training eliminated the coupling dip between Qwen generations (0.025 to 0.830 at matched scale), Gemma-4 at 4B achieves coupling 0.871, characteristic of 13B+ standard-trained models, through distillation and architectural innovation, and Phi at 1B matches web-trained coupling at 10B through data curation alone. Width normalization eliminates the anticorrelation across all tested families, supporting an output-projection bottleneck. Internally, 38 of 40 models show zero competing attention heads. A sparse-regression ODE cross-predicts held-out Llama-2 at 5.6% error. The diagnostic requires no model internals -- only public benchmark scores across a model family. The cooperative regime extends to the frontier (r = +0.72, 34 models, 10 labs). A proof-of-concept intervention confirms the bottleneck is exploitable: adding a single truth-direction vector at the identified layer corrects 60% of misaligned outputs in the tax phase with zero retraining -- a surgical, per-inference correction that requires no weight modification. Code, data, an open-source steering CLI for any open-weight model, and an interactive dashboard for phase diagnosis are released: https://zehenlabs.com/cape/.

URL PDF HTML ☆

赞 0 踩 0

2603.09095 2026-06-02 cs.CL cs.CV 版本更新

Herculean: 面向金融智能的智能体基准测试

Xueqing Peng, Zhuohan Xie, Yupeng Cao, Haohang Li, Lingfei Qian, Yan Wang, Vincent Jim Zhang, Huan He, Xuguang Ai, Linhai Ma, Ruoyu Xiang, Yueru He, Yi Han, Shuyao Wang, Yuqing Guo, Mingyang Jiang, Yilun Zhao, Youzhong Dong, Xiaoyu Wang, Yankai Chen, Ye Yuan, Qiyuan Zhang, Fuyuan Lyu, Haolun Wu, Yonghan Yang, Zichen Zhao, Yuyang Dai, Fan Zhang, Rania Elbadry, Ayesha Gull, Muhammad Usman Safder, Nuo Chen, Fengbin Zhu, Tianshi Cai, Zimu Wang, Polydoros Giannouris, Yuechen Jiang, Zhiwei Liu, Mohsinul Kabir, Yuyan Wang, Yixiang Zheng, Yangyang Yu, Weijin Liu, Wenbo Cao, Anke Xu, Peng Lu, Jerry Huang, Mingquan Lin, Prayag Tiwari, Yijia Zhao, Víctor Gutiérrez-Basulto, Xiao-Yang Liu, Kaleb E Smith, Jiahuan Pei, Arman Cohan, Jimin Huang, Yuehua Tang, Alejandro Lopez-Lira, Xi Chen, Xue Liu, Junichi Tsujii, Jian-Yun Nie, Sophia Ananiadou

发表机构 * The Fin AI ； Yale University（耶鲁大学）； Columbia University（哥伦比亚大学）； Stevens Institute of Technology（史蒂文斯理工学院）； NVIDIA（英伟达）； New York University（纽约大学）； Georgia Institute of Technology（佐治亚理工学院）； University of Florida（佛罗里达大学）； MBZUAI ； Université de Montréal（蒙特利尔大学）； University of Minnesota（明尼苏达大学）； University of Massachusetts Boston（马萨诸塞大学波士顿分校）； National Institute of Advanced Industrial Science and Technology（国家先进工业科学与技术研究院）； University of Liverpool（利物浦大学）； Vrije Universiteit Amsterdam（阿姆斯特丹自由大学）； National University of Singapore（新加坡国立大学）； Halmstad University（哈尔姆斯塔德大学）； University of Manchester（曼彻斯特大学）； Cardiff University（卡迪夫大学）； McGill University（麦吉尔大学）； Mila – Quebec AI Institute（魁北克人工智能研究所）

AI总结本文提出Herculean，首个覆盖交易、对冲、市场洞察和审计四个代表性工作流的智能体金融智能基准测试，通过标准化MCP技能环境评估异构智能体系统，发现智能体在交易和市场洞察上表现较好，但在对冲和审计等需要长期协调、状态一致性和结构化验证的任务上存在显著不足。

详情

AI中文摘要

LISTEN 你的偏好：面向多目标选择的LLM框架

Adam S. Jovine, Tinghan Ye, Francis Bahk, Jingjing Wang, Matthew Ford, David B. Shmoys, Peter I. Frazier

发表机构 * Cornell University（康奈尔大学）； Georgia Institute of Technology（佐治亚理工学院）

AI总结提出LISTEN框架，利用LLM作为决策代理，通过迭代优化内部偏好模型（LISTEN-U参数法或LISTEN-T非参数法）从自然语言中学习用户隐含偏好，实现多目标选择。

Comments Accepted at IJCAI-ECAI 2026 (the 35th International Joint Conference on Artificial Intelligence)

详情

基石还是绊脚石？解读在线策略蒸馏中的岩石令牌

Yuxuan Jiang, Runchao Li, Shubhashis Roy Dipta, Dawei Li, Zhao Yang

发表机构 * University of Maryland, Baltimore County（马里兰大学巴尔的摩分校）； Case Western Reserve University（凯斯西储大学）； Arizona State University（亚利桑那州立大学）； VU Amsterdam（阿姆斯特丹自由大学）

AI总结本文研究在线策略蒸馏中持续高损失的“岩石令牌”，发现它们虽占据大量梯度但功能贡献微弱，提出绕过这些令牌可简化对齐过程。

详情

AI中文摘要

尽管近期关于可验证奖励强化学习（RLVR）的研究表明，一小部分关键令牌不成比例地驱动推理增益，但在线策略蒸馏（OPD）中类似的令牌级理解仍未探索。本文研究了高损失令牌——在OPD的逐令牌KL目标下，作为师生不匹配的最直接信号，根据现有研究，这些令牌应随着训练收敛而逐渐减少；然而，我们的实证分析显示并非如此。即使在OPD训练达到明显饱和后，仍有大量令牌持续表现出高损失；我们将这些令牌称为“岩石令牌”，它们可占生成输出中高达18%的令牌。我们的研究揭示了两个令人惊讶的悖论。首先，尽管这些令牌的高出现频率提供了不成比例的大份额总梯度范数，但岩石令牌本身在整个训练过程中保持停滞，抵抗教师驱动的修正。其次，通过因果干预，我们发现这些令牌对模型的实际推理性能贡献可忽略不计。这些发现表明，大量优化带宽被花费在学生模型无法或无需内化的结构和话语残差上。通过解构这些动态，我们证明策略性地绕过这些“绊脚石”可以显著简化对齐过程，挑战了统一令牌权重的必要性，并为大规模模型蒸馏提供了更高效的范式。

英文摘要

While recent work in Reinforcement Learning with Verifiable Rewards (RLVR) has shown that a small subset of critical tokens disproportionately drives reasoning gains, an analogous token-level understanding of On-Policy Distillation (OPD) remains largely unexplored. In this work, we investigate high-loss tokens, a token type that--as the most direct signal of student-teacher mismatch under OPD's per-token KL objective--should progressively diminish as training converges according to existing studies; however, our empirical analysis shows otherwise. Even after OPD training reaches apparent saturation, a substantial subset of tokens continues to exhibit persistently high loss; these tokens, which we term Rock Tokens, can account for up to 18\% of the tokens in generated outputs. Our investigation reveals two startling paradoxes. First, despite their high occurrence frequency providing a disproportionately large share of total gradient norms, Rock Tokens themselves remain stagnant throughout training, resisting teacher-driven corrections. Second, through causal intervention, we find that these tokens provide negligible functional contribution to the model's actual reasoning performance. These findings suggest that a vast amount of optimization bandwidth is spent on structural and discourse residuals that the student model cannot or need not internalize. By deconstructing these dynamics, we demonstrate that strategically bypassing these ``stumbling blocks'' can significantly streamline the alignment process, challenging the necessity of uniform token weighting and offering a more efficient paradigm for large-scale model distillation.

URL PDF HTML ☆

赞 0 踩 0

2605.09098 2026-06-02 cs.CL 版本更新

Dynamic Meta-Metrics: Source-Sentence Conditioned Weighting for MT Evaluation

动态元度量：面向机器翻译评估的源句条件加权

Luke Zhang, Justin Vasselli, Aditya Khan, York Hay Ng, En-Shiun Annie Lee

发表机构 * University of Toronto, Canada（多伦多大学）； Nara Institute of Science and Technology, Japan（奈良科学技術大學）； Ontario Tech University, Canada（安大略技术大学）

AI总结提出动态元度量（DMM）框架，通过源句条件组合现有度量来提升机器翻译评估性能，实验表明MLP组合优于线性与高斯过程集成，软条件扩展进一步带来提升。

Comments 5 pages, ACL SRW 2026

2602.16571 2026-06-02 cs.CL 版本更新

AtomEval: 事实核查中对抗性声明重写的有效性感知原子评估

Hongyi Cen, Mingxin Wang, Yule Liu, Jingyi Zheng, Hanze Jia, Tan Tang

发表机构 * Zhejiang University（浙江大学）； The Hong Kong University of Science and Technology (Guangzhou)（香港科技大学（广州））

AI总结提出AtomEval协议，通过原子分解和保留门控，区分有效规避验证与改变命题的重写，并引入VASR指标，解决传统ASR膨胀问题。

详情

AI中文摘要

大型语言模型（LLM）可以重写被驳斥的声明以规避基于证据的事实核查器，但当重写改变、削弱或纠正了本应保留的虚假命题时，传统的攻击成功率（ASR）可能会被夸大。我们引入了AtomEval，一种用于固定证据对抗性声明重写的有效性感知评估协议。AtomEval将声明表示为“主体-关系-客体-修饰语”（SROM）原子，应用单向保留门将有效的验证器规避与改变命题的重写分开，并报告有效性感知攻击成功率（VASR），该指标仅统计保留原始虚假命题的验证器规避重写。AtomEval进一步提供细粒度诊断，解释命题级失败和非最小有效重写。在FEVER被驳斥声明重写任务上，AtomEval揭示并解释了ASR膨胀：许多明显的攻击通过改变、削弱或纠正本应保留的命题来欺骗验证器。通过使受攻击命题的保留变得明确且可测量，AtomEval为评估必须在验证器规避与命题保留之间取得平衡的对抗性重写器提供了稳定的评估目标。

英文摘要

Large language models (LLMs) can rewrite refuted claims to evade evidence-based fact verifiers, but conventional attack success rate (ASR) can be inflated when rewrites change, weaken, or correct the false proposition they are supposed to preserve. We introduce AtomEval, a validity-aware evaluation protocol for fixed-evidence adversarial claim rewriting. AtomEval represents claims as subject--relation--object--modifier (SROM) atoms, applies a one-way preservation gate to separate valid verifier evasion from proposition-changing rewrites, and reports validity-aware attack success rate (VASR), which counts only verifier-evasive rewrites that preserve the original false proposition. AtomEval further provides fine-grained diagnostics that explain both proposition-level failures and non-minimal valid rewrites. On FEVER refuted-claim rewriting, AtomEval exposes and explains ASR inflation: many apparent attacks fool the verifier by altering, weakening, or correcting the proposition they should preserve. By making attacked-proposition preservation explicit and measurable, AtomEval provides a stable evaluation target for evaluating adversarial rewriters that must balance verifier evasion with proposition preservation.

URL PDF HTML ☆

赞 0 踩 0

2602.17513 2026-06-02 cs.CL 版本更新

Bridging the Domain Divide: Supervised vs. Zero-Shot Clinical Section Segmentation from MIMIC-III to Obstetrics

弥合领域鸿沟：从MIMIC-III到产科的监督式与零样本临床章节分割

Baris Karacan, Barbara Di Eugenio, Patrick Thornton

发表机构 * Massachusetts Institute of Technology（麻省理工学院）

AI总结通过构建产科笔记数据集、评估基于Transformer的监督模型和首次与零样本大语言模型对比，发现监督模型在域内表现强但域外下降显著，而零样本模型在修正幻觉后展现出稳健的域外适应性。

Comments 14 pages. Camera-ready version accepted at LREC 2026; includes minor revisions and an appendix. To appear in the conference proceedings

详情

DOI: 10.63317/4ktoypuohtci
Journal ref: Proceedings of the 2026 Language Resources and Evaluation Conference (LREC 2026), pages 2594-2607, Palma, Spain. ELRA 2026

AI中文摘要

临床自由文本笔记包含重要的患者信息。它们被组织成带标签的章节；识别这些章节已被证明支持临床决策和下游NLP任务。在本文中，我们通过三个关键贡献推进临床章节分割。首先，我们整理了一个新的去标识化、带章节标签的产科笔记数据集，以补充公共语料库（如MIMIC-III）所涵盖的医学领域，现有的大多数分割方法都是在这些语料库上训练的。其次，我们在MIMIC-III的一个精选子集（域内）和新的产科数据集（域外）上系统评估了基于Transformer的监督模型用于章节分割。第三，我们首次将医学章节分割的监督模型与零样本大语言模型进行直接比较。我们的结果表明，虽然监督模型在域内表现强劲，但其性能在域外大幅下降。相比之下，一旦纠正了幻觉章节标题，零样本模型展现出稳健的域外适应性。这些发现强调了开发特定领域临床资源的重要性，并指出零样本分割是将医疗NLP应用于研究充分的语料库之外的一个有前景的方向，前提是适当管理幻觉。

英文摘要

Clinical free-text notes contain vital patient information. They are structured into labelled sections; recognizing these sections has been shown to support clinical decision-making and downstream NLP tasks. In this paper, we advance clinical section segmentation through three key contributions. First, we curate a new de-identified, section-labeled obstetrics notes dataset, to supplement the medical domains covered in public corpora such as MIMIC-III, on which most existing segmentation approaches are trained. Second, we systematically evaluate transformer-based supervised models for section segmentation on a curated subset of MIMIC-III (in-domain), and on the new obstetrics dataset (out-of-domain). Third, we conduct the first head-to-head comparison of supervised models for medical section segmentation with zero-shot large language models. Our results show that while supervised models perform strongly in-domain, their performance drops substantially out-of-domain. In contrast, zero-shot models demonstrate robust out-of-domain adaptability once hallucinated section headers are corrected. These findings underscore the importance of developing domain-specific clinical resources and highlight zero-shot segmentation as a promising direction for applying healthcare NLP beyond well-studied corpora, as long as hallucinations are appropriately managed.

URL PDF HTML ☆

赞 0 踩 0

2604.21511 2026-06-02 cs.IR cs.CL 版本更新

From Tokens to Concepts: Leveraging SAE for SPLADE

从词元到概念：利用稀疏自编码器增强SPLADE

Yuxuan Zong, Mathias Vast, Basile Van Cooten, Laure Soulier, Benjamin Piwowarski

发表机构 * Sorbonne Université, CNRS, ISIR Paris France（索邦大学、国家科学研究中心、巴黎信息科学研究所法国）

AI总结针对SPLADE依赖骨干词汇表导致多义性和同义性等问题，提出用稀疏自编码器学习的语义概念空间替换词汇表，实现性能相当且效率提升。

Comments 11 pages, 3 figures, 9 tables. To appear at SIGIR 2026

2604.19786 2026-06-02 cs.CL 版本更新

SCOPE: 信号校准的在线策略蒸馏增强与双路径自适应加权

Binbin Zheng, Xing Ma, Yiheng Liang, Jingqing Ruan, Xiaoliang Fu, Kepeng Lin, Benchang Zhu, Ke Zeng, Xunliang Cai

发表机构 * University of Science and Technology of China（中国科学技术大学）； Meituan LongCat Interaction Team（美团 LongCat 交互团队）； Nanjing University（南京大学）； Fudan University（复旦大学）； Huazhong University of Science and Technology（华中科技大学）

AI总结针对在线策略强化学习中奖励稀疏导致的信用分配难题，提出SCOPE框架，通过双路径自适应加权机制分别处理正确与错误轨迹，实现信号校准的蒸馏增强，在六个推理基准上平均提升11.42%的Avg@32和7.30%的Pass@32。

详情

AI中文摘要

在线策略强化学习已成为大型语言模型推理对齐的主导范式，但其稀疏的结果级奖励使得令牌级信用分配异常困难。在线策略蒸馏（OPD）通过引入来自教师模型的密集令牌级KL监督缓解了这一问题，但通常对所有rollout均匀应用这种监督，忽略了信号质量的根本差异。我们提出信号校准的在线策略蒸馏增强（SCOPE），一种双路径自适应训练框架，根据正确性将在线策略rollout路由到两个互补的监督路径。对于错误轨迹，SCOPE执行教师困惑度加权的KL蒸馏，优先考虑教师展现出真正纠正能力的实例，同时降低不可靠指导的权重。对于正确轨迹，它应用学生困惑度加权的MLE，将强化集中在能力边界上的低置信度样本，而不是过度强化已掌握的样本。两条路径都采用组级归一化来自适应校准权重分布，考虑不同提示的内在难度差异。在六个推理基准上的大量实验表明，SCOPE在Avg@32和Pass@32上分别比竞争基线平均相对提升11.42%和7.30%，证明了其一致的有效性。

英文摘要

On-policy reinforcement learning has become the dominant paradigm for reasoning alignment in large language models, yet its sparse, outcome-level rewards make token-level credit assignment notoriously difficult. On-Policy Distillation (OPD) alleviates this by introducing dense, token-level KL supervision from a teacher model, but typically applies this supervision uniformly across all rollouts, ignoring fundamental differences in signal quality. We propose Signal-Calibrated On-Policy Distillation Enhancement (SCOPE), a dual-path adaptive training framework that routes on-policy rollouts by correctness into two complementary supervision paths. For incorrect trajectories, SCOPE performs teacher-perplexity-weighted KL distillation to prioritize instances where the teacher demonstrates genuine corrective capability, while down-weighting unreliable guidance. For correct trajectories, it applies student-perplexity-weighted MLE to concentrate reinforcement on low-confidence samples at the capability boundary rather than over-reinforcing already mastered ones. Both paths employ a group-level normalization to adaptively calibrate weight distributions, accounting for the intrinsic difficulty variance across prompts. Extensive experiments on six reasoning benchmarks show that SCOPE achieves an average relative improvement of 11.42% in Avg@32 and 7.30% in Pass@32 over competitive baselines, demonstrating its consistent effectiveness.

URL PDF HTML ☆

赞 0 踩 0

2604.04937 2026-06-02 cs.AI cs.CL 版本更新

Pramana: Fine-Tuning Large Language Models for Epistemic Reasoning through Navya-Nyaya

Pramana: 通过 Navya-Nyaya 微调大型语言模型进行认知推理

Sharath Sathish

发表机构 * University of York（约克大学）

AI总结提出 Pramana 方法，利用 2500 年历史的印度 Navya-Nyaya 逻辑框架微调 LLM，通过结构化六阶段推理解决认知差距，提升可溯源性。

Comments 52 pages + appendices, comprehensive treatment of Navya-Nyaya computational formalization

详情

AI中文摘要

大型语言模型能生成流畅文本，但在系统推理方面存在困难，常常产生自信但无根据的幻觉。当苹果研究人员向数学问题添加无关背景时，LLM 性能下降了 65% (Apple Machine Learning Research)，暴露出表面推理下脆弱的模式匹配。这种认知差距，即无法将主张建立在可追溯证据上的能力，限制了 AI 在需要论证的领域的可靠性。我们引入 Pramana，一种新颖的方法，通过在 Navya-Nyaya 逻辑（一种 2500 年历史的印度推理框架）上进行微调，教导 LLM 显式的认识论方法论。与通用的思维链提示不同，Navya-Nyaya 强制执行结构化的六阶段推理：SAMSHAYA（怀疑分析）、PRAMANA（证据源识别）、PANCHA AVAYAVA（包含普遍规则的五段论）、TARKA（反事实验证）、HETVABHASA（谬误检测）和 NIRNAYA（区分知识与假设的确定）。这种逻辑与认识论的整合提供了标准推理方法所缺乏的认知支架。我们在 55 个 Nyaya 结构化的逻辑问题（约束满足、布尔 SAT、多步演绎）上微调了 Llama 3.2-3B 和 DeepSeek-R1-Distill-Llama-8B。第一阶段在保留评估上实现了 100% 的语义正确性，尽管严格格式遵循率仅为 40%，这表明即使结构执行不完美，模型也能内化推理内容。消融研究表明格式提示和温度对性能有关键影响，且不同阶段的最优配置不同。我们在 Hugging Face 上发布所有模型、数据集和训练基础设施，以促进关于 AI 推理认识论框架的进一步研究。

英文摘要

Large language models produce fluent text but struggle with systematic reasoning, often hallucinating confident but unfounded claims. When Apple researchers added irrelevant context to mathematical problems, LLM performance degraded by 65% Apple Machine Learning Research, exposing brittle pattern-matching beneath apparent reasoning. This epistemic gap, the inability to ground claims in traceable evidence, limits AI reliability in domains requiring justification. We introduce Pramana, a novel approach that teaches LLMs explicit epistemological methodology by fine-tuning on Navya-Nyaya logic, a 2,500-year-old Indian reasoning framework. Unlike generic chain-of-thought prompting, Navya-Nyaya enforces structured 6-phase reasoning: SAMSHAYA (doubt analysis), PRAMANA (evidence source identification), PANCHA AVAYAVA (5-member syllogism with universal rules), TARKA (counterfactual verification), HETVABHASA (fallacy detection), and NIRNAYA (ascertainment distinguishing knowledge from hypothesis). This integration of logic and epistemology provides cognitive scaffolding absent from standard reasoning approaches. We fine-tune Llama 3.2-3B and DeepSeek-R1-Distill-Llama-8B on 55 Nyaya-structured logical problems (constraint satisfaction, Boolean SAT, multi-step deduction). Stage 1 achieves 100% semantic correctness on held-out evaluation despite only 40% strict format adherence revealing that models internalize reasoning content even when structural enforcement is imperfect. Ablation studies show format prompting and temperature critically affect performance, with optimal configurations differing by stage. We release all models, datasets, and training infrastructure on Hugging Face to enable further research on epistemic frameworks for AI reasoning.

URL PDF HTML ☆

赞 0 踩 0

2602.00906 2026-06-02 cs.LG cs.AI cs.CL cs.DS cs.IT math.IT 版本更新

Hallucination is a Consequence of Space-Optimality: A Rate-Distortion Theorem for Membership Testing

幻觉是空间最优性的结果：成员测试的率失真定理

Anxin Guo, Jingwei Li

发表机构 * Computer Science Department, Northwestern University（西北大学计算机科学系）； Department of IEOR, Columbia University（哥伦比亚大学工业工程与运筹学系）

AI总结通过将幻觉形式化为成员测试问题，建立率失真定理，证明在有限容量下信息论最优策略必然导致对某些非事实的高置信度，从而产生幻觉。

Comments ICML 2026

详情

AI中文摘要

大型语言模型通常对缺乏可推断模式的“随机事实”以高置信度产生幻觉。我们将此类事实的记忆形式化为一个成员测试问题，统一了布隆过滤器的离散误差指标与LLM的连续对数损失。通过分析在事实在可能主张的宇宙中稀疏的情况下，我们建立了一个率失真定理：最优记忆效率由事实与非事实得分分布之间的最小KL散度刻画。这一理论框架在理想化设置下为幻觉提供了独特的解释：即使有最优训练、完美数据和简化的“封闭世界”设置，有限容量下信息论最优策略不是放弃或遗忘，而是对某些非事实赋予高置信度，从而导致幻觉。我们在合成数据和真实数据上实证验证了这一理论，表明幻觉作为有损压缩的自然结果持续存在。同一定理恢复并锐化了布隆型滤波器的经典空间下界，确定了两侧滤波器遗留的加性常数。

英文摘要

Large language models often hallucinate with high confidence on "random facts" that lack inferable patterns. We formalize the memorization of such facts as a membership testing problem, unifying the discrete error metrics of Bloom filters with the continuous log-loss of LLMs. By analyzing this problem in the regime where facts are sparse in the universe of plausible claims, we establish a rate-distortion theorem: the optimal memory efficiency is characterized by the minimum KL divergence between score distributions on facts and non-facts. This theoretical framework provides a distinctive explanation for hallucination under an idealized setting: even with optimal training, perfect data, and a simplified ``closed world'' setting, the information-theoretically optimal strategy under limited capacity is not to abstain or forget, but to assign high confidence to some non-facts, resulting in hallucination. We validate this theory empirically on both synthetic and real-world data, showing that hallucinations persist as a natural consequence of lossy compression. The same theorem recovers and sharpens classical space lower bounds for Bloom-type filters, pinning down an additive constant left open for two-sided filters.

URL PDF HTML ☆

赞 0 踩 0

2603.03312 2026-06-02 cs.CL cs.AI cs.HC eess.AS q-bio.NC 版本更新

Escaping the BLEU Trap: A Signal-Grounded Framework with Decoupled Semantic Guidance for EEG-to-Text Decoding

逃离BLEU陷阱：一种基于信号锚定与解耦语义引导的脑电解码文本框架

Yuchen Wang, Haonan Wang, Yu Guo, Honglong Yang, Xiaomeng Li

发表机构 * The Hong Kong University of Science and Technology（香港科学与技术大学）； Shenzhen Loop Area Institute（深圳环湖研究院）

AI总结针对脑电解码文本中语义偏差、信号忽略和BLEU陷阱问题，提出SemKey多阶段框架，通过解耦语义目标（情感、主题、长度、意外性）和主动检索解码机制，强制生成基于信号而非语言先验，并采用检索和分布度量（如Fréchet距离）建立评估协议，有效缓解幻觉并达到最优性能。

详情

AI中文摘要

从非侵入性脑电信号中解码自然语言是一项有前景但充满挑战的任务。然而，当前最先进的模型仍受限于三个基本问题：语义偏差（输出退化为通用语言模板）、信号忽略（模型严重依赖大语言模型先验，即使在缺乏有意义信号时也能生成流畅文本）以及“BLEU陷阱”（高频停用词虚增n-gram指标，掩盖真正语义保真度的缺失）。为解决这些挑战，我们超越传统的端到端流水线，提出SemKey——一种新颖的多阶段框架，通过四个解耦的语义目标（情感、主题、长度和意外性）强制进行基于信号的生成。我们直接从脑电嵌入中提取这些语义锚点，然后通过主动检索解码机制统一它们，迫使大语言模型将其令牌生成锚定在神经信号上，而非默认使用语言先验。此外，我们通过建立全面的评估协议（使用严格的检索和基于分布的度量，如Fréchet距离）打破BLEU陷阱。大量实验表明，SemKey有效缓解了对噪声输入的幻觉，并在这些鲁棒协议上达到了最先进的性能。代码将在论文被接收后发布于https://github.com/xmed-lab/SemKey。

英文摘要

Decoding natural language from non-invasive EEG signals is a promising yet challenging task. However, current state-of-the-art models remain constrained by three fundamental issues: Semantic Bias, where outputs collapse into generic linguistic templates; Signal Neglect, where models rely heavily on LLM priors to hallucinate fluent text even in the absence of meaningful signals; and the "BLEU Trap", where high-frequency stopwords inflate n-gram metrics, masking a lack of true semantic fidelity. To resolve these challenges, we move beyond conventional end-to-end pipelines and propose SemKey, a novel multi-stage framework that enforces signal-grounded generation through four decoupled semantic objectives: sentiment, topic, length, and surprisal. We extract these semantic anchors from EEG embeddings directly, then unify them with an Active Retrieval Decoding mechanism, compelling the LLM to ground its token generation in the neural signals rather than defaulting to linguistic priors. Furthermore, we break the BLEU Trap by establishing a comprehensive evaluation protocol using rigorous retrieval and distribution-based metrics such as Fréchet Distance. Extensive experiments demonstrate that SemKey effectively mitigates hallucinations on noise inputs and achieves SOTA performance on these robust protocols. Code will be released upon acceptance at https://github.com/xmed-lab/SemKey.

URL PDF HTML ☆

赞 0 踩 0

2604.01562 2026-06-02 cs.SD cs.AI cs.CL cs.CY cs.HC 版本更新

弧标准依存推导的树解释

Zihao Huang, Ai Ka Lee, Jungyeul Park

发表机构 * The University of British Columbia（不列颠哥伦比亚大学）； Korea Advanced Institute of Science & Technology（韩国科学技术院）

AI总结本文提出弧标准依存推导可解释为增量构建具有连续产出的词汇化有序树，并证明该表示等价于投影性，从而为投影依存分析提供树论解释。

详情

AI中文摘要

投影依存树上的弧标准推导可以解释为增量构建具有连续产出的词汇化有序树。每个\textsc{shift}、\textsc{leftarc}和\textsc{rightarc}转移对应于确定的树更新，生成的有序树唯一地确定了推导引入的依存弧。我们证明这种表示并非任意编码：单中心依存树允许这种连续有序表示当且仅当它是投影的。因此，该提议是基于推导的而非基于转换的，因为有序对象是在转移序列本身上定义的，而不是通过转换完成的依存图获得的。这为弧标准解析提供了树论解释，其中投影依存推导隐式地构建了可恢复的短语结构风格的有序树。对于非投影输入，可以通过伪投影提升和逆解码使用该解释。一个小型实现研究证实，映射的推导可以在现有的神经转移解析器中执行。

英文摘要

Arc-standard derivations over projective dependency trees can be interpreted as the incremental construction of lexicalized ordered trees with contiguous yields. Each \textsc{shift}, \textsc{leftarc}, and \textsc{rightarc} transition corresponds to a deterministic tree update, and the resulting ordered tree uniquely determines the dependency arcs introduced by the derivation. We show that this representation is not an arbitrary encoding: a single-headed dependency tree admits such a contiguous ordered representation if and only if it is projective. The proposal is therefore derivational rather than conversion-based, since the ordered object is defined over the transition sequence itself rather than obtained by transforming a completed dependency graph. This gives a tree-theoretic interpretation of arc-standard parsing, in which projective dependency derivations implicitly construct recoverable constituency-style ordered trees. For non-projective inputs, the interpretation can be used through pseudo-projective lifting and inverse decoding. A small implementation study confirms that the mapped derivations are executable in an existing neural transition-based parser.

URL PDF HTML ☆

赞 0 踩 0

2603.22999 2026-06-02 cs.CL 版本更新

PaperVoyager : Building Interactive Web with Visual Language Models

PaperVoyager：利用视觉语言模型构建交互式网页

Dasen Dai, Biao Wu, Meng Fang, Wenhao Wang

发表机构 * Vast Intelligence Lab（vast 智能实验室）； UTS（UTS大学）； University of Liverpool（利物浦大学）

AI总结提出PaperVoyager框架，将研究论文自动转化为可执行的交互式网页系统，通过显式建模机制和交互逻辑，显著提升生成系统的质量。

Comments 9 pages, 5 figures

详情

AI中文摘要

视觉语言模型的最新进展使得自主代理能够进行复杂推理、工具使用和文档理解。然而，现有的文档代理主要将论文转化为静态产物，如摘要、网页或幻灯片，这对于涉及动态机制和状态转换的技术论文来说是不够的。在这项工作中，我们提出了一种论文到交互式系统的代理，将研究论文转化为可执行的交互式网页系统。给定一篇PDF论文，该代理无需人工干预即可进行端到端处理，包括论文理解、系统建模和交互式网页合成，使用户能够操作输入并观察动态行为。为了评估这一任务，我们引入了一个包含19篇研究论文的基准测试，每篇论文都配有专家构建的交互式系统作为真实参考。我们进一步提出了PaperVoyager，一个结构化生成框架，在合成过程中显式建模机制和交互逻辑。实验表明，PaperVoyager显著提高了生成的交互式系统的质量，为交互式科学论文理解提供了新的范式。

英文摘要

Recent advances in visual language models have enabled autonomous agents for complex reasoning, tool use, and document understanding. However, existing document agents mainly transform papers into static artifacts such as summaries, webpages, or slides, which are insufficient for technical papers involving dynamic mechanisms and state transitions. In this work, we propose a Paper-to-Interactive-System Agent that converts research papers into executable interactive web systems. Given a PDF paper, the agent performs end-to-end processing without human intervention, including paper understanding, system modeling, and interactive webpage synthesis, enabling users to manipulate inputs and observe dynamic behaviors. To evaluate this task, we introduce a benchmark of 19 research papers paired with expert-built interactive systems as ground truth. We further propose PaperVoyager, a structured generation framework that explicitly models mechanisms and interaction logic during synthesis. Experiments show that PaperVoyager significantly improves the quality of generated interactive systems, offering a new paradigm for interactive scientific paper understanding.

URL PDF HTML ☆

赞 0 踩 0

2603.25640 2026-06-02 cs.DL cs.CL 版本更新

RenoBench: A Citation Parsing Benchmark

RenoBench: 引文解析基准

Parth Sarin, Juan Pablo Alperin, Adam Buttrick, Dione Mentis

发表机构 * Graduate School of Education, Stanford University（斯坦福大学教育研究生院）； Public Knowledge Project, Simon Fraser University（公共知识项目，西蒙弗雷泽大学）； DataCite, Hannover, Germany（DataCite，德国汉诺威）； California Digital Library, University of California Office of the President（加州数字图书馆，加州大学校长办公室）

AI总结针对现有引文解析评估方法不可泛化、基于合成数据或不可公开获取的问题，提出从四个出版生态系统收集的公开基准RenoBench，通过自动验证和特征采样构建多语言、多类型数据集，并评估多种解析系统，结果表明语言模型（尤其微调后）表现优异，为可重复标准化评估奠定基础。

Comments Presented as a conference paper at CiteX 2026

详情

AI中文摘要

准确解析引文对于机器可读的学术基础设施是必要的。但是，尽管对该问题持续关注，现有的评估技术通常不可泛化、基于合成数据或不可公开获取。我们引入了RenoBench，一个用于引文解析的公共领域基准，来源于四个出版生态系统（SciELO、Redalyc、Public Knowledge Project和Open Research Europe）发布的PDF。从161,000条带注释的引文开始，我们应用自动验证和基于特征的采样，生成了一个包含10,000条引文的数据集，涵盖多种语言、出版类型和平台。然后，我们评估了多种引文解析系统，并报告了字段级别的精确率和召回率。我们的结果显示语言模型表现强劲，尤其是在微调后。RenoBench实现了对引文解析系统的可重复、标准化评估，并为推进自动化引文解析和元科学研究提供了基础。

英文摘要

Accurate parsing of citations is necessary for machine-readable scholarly infrastructure. But, despite sustained interest in this problem, existing evaluation techniques are often not generalizable, based on synthetic data, or not publicly available. We introduce RenoBench, a public domain benchmark for citation parsing, sourced from PDFs released on four publishing ecosystems: SciELO, Redalyc, the Public Knowledge Project, and Open Research Europe. Starting from 161,000 annotated citations, we apply automated validation and feature-based sampling to produce a dataset of 10,000 citations spanning multiple languages, publication types, and platforms. We then evaluate a variety of citation parsing systems and report field-level precision and recall. Our results show strong performance from language models, particularly when fine-tuned. RenoBench enables reproducible, standardized evaluation of citation parsing systems, and provides a foundation for advancing automated citation parsing and metascientific research.

URL PDF HTML ☆

赞 0 踩 0

2603.23485 2026-06-02 cs.CL cs.AI cs.CY 版本更新

Failure of contextual invariance in large language models

大型语言模型中语境不变性的失效

Sagar Kumar, Ariel Flint, Luca Maria Aiello, Andrea Baronchelli

发表机构 * Network Science Institute, Northeastern University（网络科学研究所，东北大学）； Center for Health Informatics Program, Boston Children’s Hospital（健康信息学计划中心，波士顿儿童医院）； Dept. of Mathematics, City St George’s, University of London（伦敦大学城市圣乔治学院数学系）； IT University of Copenhagen（哥本哈根IT大学）

AI总结通过代词选择任务发现，在语境等价但无信息量的干扰下，大语言模型输出发生系统性偏移，表明其违反语境不变性，影响偏见评估与高风险应用。

详情

AI中文摘要

标准评估实践假设，当提示嵌入语境等价的语篇中时，大型语言模型（LLM）的输出是稳定的。这里，我们在性别推断的背景下测试这一假设。使用受控的代词选择任务，我们引入最小的、理论上无信息的语篇语境，发现这会导致模型输出出现大规模、系统性的偏移。与去语境化设置中存在的文化性别刻板印象的相关性在引入语境后减弱或消失，而理论上无关的特征（如无关指代对象的代词性别）成为模型行为最具信息量的预测因子。通过默认语境性分析发现，在模型间的19%至52%的案例中，这种依赖性在考虑语境对单个输出的所有边际效应后仍然存在，并且不能归因于简单的代词重复。这些发现表明，即使在几乎相同的句法表述下，LLM的输出也违反了语境不变性，这对偏见基准测试和高风险环境中的部署具有重要影响。

英文摘要

Standard evaluation practices assume that large language model (LLM) outputs are stable when prompts are embedded in contextually equivalent discourses. Here, we test this assumption in the setting of gender inference. Using a controlled pronoun selection task, we introduce minimal, theoretically uninformative discourse context and find that this induces large, systematic shifts in model outputs. Correlations with cultural gender stereotypes, present in decontextualized settings, weaken or disappear once context is introduced, while theoretically irrelevant features, such as the gender of a pronoun for an unrelated referent, become the most informative predictors of model behavior. A Contextuality-by-Default analysis reveals that, in 19--52\% of cases across models, this dependence persists after accounting for all marginal effects of context on individual outputs and cannot be attributed to simple pronoun repetition. These findings show that LLM outputs violate contextual invariance even under near-identical syntactic formulations, with implications for bias benchmarking and deployment in high-stakes settings.

URL PDF HTML ☆

赞 0 踩 0

2508.09456 2026-06-02 cs.CV cs.CL cs.CR 版本更新

IAG: Input-aware Backdoor Attack on VLM-based Visual Grounding

IAG: 基于输入感知的后门攻击针对VLM视觉定位

Junxian Li, Beining Xu, Simin Chen, Jiatong Li, Jingdi Lei, Haodong Zhao, Di Zhang

发表机构 * Shanghai Jiao Tong University（上海交通大学）； Fudan University（复旦大学）； Columbia University（哥伦比亚大学）； Hong Kong Polytechnic University（香港理工大学）； Nanyang Technological University（南洋理工大学）

AI总结提出IAG方法，通过文本条件UNet动态生成输入感知的触发器，实现首个多目标后门攻击VLM视觉定位，在多个模型和基准上达到最佳攻击成功率且不影响正常性能。

Comments Accepted by CVPR 2026; Code is at https://github.com/lijunxian111/IAG

详情

Journal ref: https://openaccess.thecvf.com/content/CVPR2026/papers/Li_IAG_Input-aware_Backdoor_Attack_on_VLM-based_Visual_Grounding_CVPR_2026_paper.pdf

AI中文摘要

近期视觉语言模型（VLM）的进展显著提升了视觉定位任务，该任务涉及根据自然语言查询在图像中定位对象。尽管取得了这些进展，基于VLM的定位系统的安全性尚未得到彻底研究。本文揭示了一个新颖且现实的安全漏洞：首个针对VLM视觉定位的多目标后门攻击。与依赖静态触发器或固定目标的先前攻击不同，我们提出了IAG，一种动态生成输入感知、文本引导触发器的方法，这些触发器以任意指定目标对象描述为条件来执行攻击。这是通过一个文本条件的UNet实现的，该网络将难以察觉的目标语义线索嵌入视觉输入，同时保持对良性样本的正常定位性能。我们进一步开发了一个联合训练目标，平衡语言能力与感知重建，以确保隐蔽性、有效性和隐秘性。在多个VLM（如LLaVA、InternVL、Ferret）和基准（RefCOCO、RefCOCO+、RefCOCOg、Flickr30k Entities和ShowUI）上的大量实验表明，IAG在几乎所有设置下都实现了比其他基线最佳的攻击成功率，同时不损害干净准确率，保持对现有防御的鲁棒性，并展现出跨数据集和模型的迁移性。这些发现强调了具有定位能力的VLM中的关键安全风险，并突出了对可信多模态理解的进一步研究的必要性。

英文摘要

Recent advances in vision-language models (VLMs) have significantly enhanced the visual grounding task, which involves locating objects in an image based on natural language queries. Despite these advancements, the security of VLM-based grounding systems has not been thoroughly investigated. This paper reveals a novel and realistic vulnerability: the first multi-target backdoor attack on VLM-based visual grounding. Unlike prior attacks that rely on static triggers or fixed targets, we propose IAG, a method that dynamically generates input-aware, text-guided triggers conditioned on any specified target object description to execute the attack. This is achieved through a text-conditioned UNet that embeds imperceptible target semantic cues into visual inputs while preserving normal grounding performance on benign samples. We further develop a joint training objective that balances language capability with perceptual reconstruction to ensure imperceptibility, effectiveness, and stealth. Extensive experiments on multiple VLMs (e.g., LLaVA, InternVL, Ferret) and benchmarks (RefCOCO, RefCOCO+, RefCOCOg, Flickr30k Entities, and ShowUI) demonstrate that IAG achieves the best ASRs compared with other baselines on almost all settings without compromising clean accuracy, maintaining robustness against existing defenses, and exhibiting transferability across datasets and models. These findings underscore critical security risks in grounding-capable VLMs and highlight the need for further research on trustworthy multimodal understanding.

URL PDF HTML ☆

赞 0 踩 0

2603.19453 2026-06-02 cs.CL cs.GT 版本更新

Beyond Scalar Rewards: Dense Feedback for LLM Policy Synthesis in Sequential Social Dilemmas

超越标量奖励：面向序列社会困境的LLM策略合成中的密集反馈

Víctor Gallego

发表机构 * Komorebi AI Technologies（Komorebi AI技术公司）

AI总结本研究通过反馈工程，比较稀疏奖励与密集社会指标反馈对LLM在多智能体环境中生成程序化策略的影响，发现密集反馈能打破奖励别名，提升策略性能。

Comments Accepted to NExT-Game 2026: New Frontiers in Game-Theoretic Learning, ICML 2026 Workshop

详情

AI中文摘要

我们研究LLM策略合成：使用语言模型迭代生成多智能体环境的程序化智能体策略。我们的框架不是通过强化学习训练神经策略，而是提示LLM生成Python策略函数，在自对弈中评估它们，并在迭代中使用性能反馈进行改进。我们研究反馈工程（在改进过程中向LLM展示哪些评估信息的设计），比较稀疏反馈（仅标量奖励）与密集反馈（奖励加上社会指标：效率、平等、可持续性、和平）。在两个经典的序列社会困境（Gathering和Cleanup）和两个前沿LLM（Claude Sonnet 4.6，Gemini 3.1 Pro）上，密集反馈在所有指标上始终匹配或超过稀疏反馈。我们通过反馈别名解释这种不对称性：当标量奖励单独将不同的失败模式映射到相同的值（例如，清洁不足与过度清洁）时，社会指标打破了别名，让LLM诊断出应该采取哪个纠正方向。因此，社会指标作为协调信号而非干扰，产生了诸如Voronoi区域划分和废物自适应清洁调度等策略。代码见https://github.com/vicgalle/llm-policies-social-dilemmas。

英文摘要

We study LLM policy synthesis: using a language model to iteratively generate programmatic agent policies for multi-agent environments. Rather than training neural policies via reinforcement learning, our framework prompts an LLM to produce Python policy functions, evaluates them in self-play, and refines them using performance feedback across iterations. We investigate feedback engineering (the design of what evaluation information is shown to the LLM during refinement) comparing sparse feedback (scalar reward only) against dense feedback (reward plus social metrics: efficiency, equality, sustainability, peace). Across two canonical Sequential Social Dilemmas (Gathering and Cleanup) and two frontier LLMs (Claude Sonnet 4.6, Gemini 3.1 Pro), dense feedback consistently matches or exceeds sparse feedback on all metrics. We explain the asymmetry through feedback aliasing: when scalar reward alone maps distinct failure modes to the same value (e.g., under- vs. over-cleaning), social metrics break the alias and let the LLM diagnose which corrective direction to take. Social metrics thus function as a coordination signal rather than a distraction, yielding strategies such as Voronoi territory partitioning and waste-adaptive cleaner schedules. Code at https://github.com/vicgalle/llm-policies-social-dilemmas.

URL PDF HTML ☆

赞 0 踩 0

2502.15411 2026-06-02 cs.CL cs.AI 版本更新

HiFi-KPI: A Dataset for Hierarchical KPI Extraction from Earnings Filings

HiFi-KPI：用于从财报中提取层次化KPI的数据集

Rasmus Aavang, Giovanni Rizzi, Rasmus Bøggild, Alexandre Iolov, Mike Zhang, Johannes Bjerva

发表机构 * Department of Computer Science, Aalborg University（奥尔堡大学计算机科学系）； ALIPES ApS（ALIPES公司）； University of Copenhagen（哥本哈根大学）； Pioneer Centre for AI（先锋人工智能中心）

AI总结针对财报中关键绩效指标（KPI）跨公司可迁移性差的问题，提出包含165万段落和19.8万层次化标签的HiFi-KPI数据集，并评估分类、提取和结构化提取三个任务。

详情

AI中文摘要

准确标注财报可以为利益相关者带来显著的短期回报。机器可读的内联可扩展商业报告语言（iXBRL）是公开财务申报的强制要求。然而，其复杂且细粒度的分类法限制了标记关键绩效指标（KPI）的跨公司可迁移性。为了解决这个问题，我们引入了层次化财务关键绩效指标（HiFi-KPI）数据集，这是一个包含165万段落和19.8万个独特层次化标签的大规模语料库，这些标签与iXBRL分类法相关联。HiFi-KPI支持多个任务，我们评估了其中三个：KPI分类、KPI提取和结构化KPI提取。为了快速评估，我们还发布了HiFi-KPI-Lite，一个手动策划的8K段落子集。在HiFi-KPI-Lite上的基线实验表明，基于编码器的模型在分类任务上达到了超过0.906的宏F1分数，而大型语言模型（LLM）在结构化提取任务上达到了0.440的F1分数。最后，定性分析显示提取错误主要与日期相关。我们在https://github.com/aaunlp/HiFi-KPI上开源了所有代码和数据。

英文摘要

Accurate tagging of earnings reports can yield significant short-term returns for stakeholders. The machine-readable inline eXtensible Business Reporting Language (iXBRL) is mandated for public financial filings. Yet, its complex, fine-grained taxonomy limits the cross-company transferability of tagged Key Performance Indicators (KPIs). To address this, we introduce the Hierarchical Financial Key Performance Indicator (HiFi-KPI) dataset, a large-scale corpus of 1.65M paragraphs and 198k unique, hierarchically organized labels linked to iXBRL taxonomies. HiFi-KPI supports multiple tasks and we evaluate three: KPI classification, KPI extraction, and structured KPI extraction. For rapid evaluation, we also release HiFi-KPI-Lite, a manually curated 8K paragraph subset. Baselines on HiFi-KPI-Lite show that encoder-based models achieve over 0.906 macro-F1 on classification, while Large Language Models (LLMs) reach 0.440 F1 on structured extraction. Finally, a qualitative analysis reveals that extraction errors primarily relate to dates. We open-source all code and data at https://github.com/aaunlp/HiFi-KPI.

URL PDF HTML ☆

赞 0 踩 0

2603.18016 2026-06-02 cs.CL cs.AI cs.DC cs.LG 版本更新

MineDraft: A Framework for Batch Parallel Speculative Decoding

MineDraft: 批量并行推测解码框架

Zhenwei Tang, Arun Verma, Zijian Zhou, Zhaoxuan Wu, Alok Prakash, Daniela Rus, Bryan Kian Hsiang Low

发表机构 * Massachusetts Institute of Technology（麻省理工学院）； Toyota Research Institute（丰田研究院）； Toyota Motor Corporation（丰田公司）

AI总结提出MineDraft框架，通过批量并行设计将草稿生成与验证阶段重叠，显著提升推测解码的吞吐量和端到端延迟。

Comments Accepted at ICML 2026

详情

AI中文摘要

推测解码（SD）通过使用较小的草稿模型提出草稿令牌，随后由较大的目标模型验证，从而加速大型语言模型推理。然而，标准SD的性能通常受限于这些草稿和验证阶段的严格顺序执行。为解决此问题，本文提出MineDraft，一种批量并行推测解码（PSD）框架，旨在通过将草稿生成与验证重叠来有效隐藏草稿延迟。我们的理论分析表明，PSD比标准SD高效得多。MineDraft通过一种新颖的批量并行设计实现PSD，该设计维护两个请求批次，将一个批次的草稿生成与另一个批次的验证重叠。我们的实验结果显示，与标准SD相比，MineDraft在吞吐量（最高提升75%）和端到端延迟（最高降低39%）方面均有显著改进。此外，我们已将MineDraft实现为vLLM的插件，展示了其在生产级推理系统中的实用性。

英文摘要

Speculative decoding (SD) accelerates large language model inference by using a smaller draft model to propose draft tokens that are subsequently verified by a larger target model. However, the performance of standard SD is often limited by the strictly sequential execution of these drafting and verification stages. To address this, this paper proposes MineDraft, a batch parallel speculative decoding (PSD) framework designed to effectively hide drafting latency by overlapping it with verification. Our theoretical analysis shows that PSD is substantially more efficient than standard SD. MineDraft realizes the PSD through a novel batch-parallel design that maintains two batches of requests, overlapping drafting for one batch with verification for the other. Our experimental results show significant improvements of MineDraft in both throughput (up to 75%) and end-to-end latency (up to 39%) over standard SD. Furthermore, we have implemented MineDraft as a plugin for vLLM, demonstrating its practicality for production-ready inference systems.

URL PDF HTML ☆

赞 0 踩 0

2603.16142 2026-06-02 cs.CL 版本更新

Parametric Social Identity Injection and Diversification in Public Opinion Simulation

参数化社会身份注入与多样化在舆论模拟中的应用

Hexi Wang, Yujia Zhou, Bangde Du, Qingyao Ai, Yiqun Liu

发表机构 * DCST, Tsinghua University（清华大学信息科技系）； Quancheng Laboratory（泉城实验室）； Tsinghua University（清华大学）； Shandong China（山东中国）

AI总结针对大语言模型在舆论模拟中社会多样性不足的问题，提出参数化社会身份注入（PSII）框架，通过向中间隐藏状态注入显式参数化的人口属性与价值取向表示，显著提升分布保真度与多样性。

Comments Accepted to KDD 2026 Research Track. Project page: https://github.com/halsayxi/PSII

详情

DOI: 10.1145/3770855.3817926

AI中文摘要

大语言模型（LLMs）最近被用作舆论模拟的合成代理，为昂贵且缓慢的人类调查提供了一种有前景的替代方案。尽管具有可扩展性，当前基于LLM的模拟方法未能捕捉社会多样性，产生扁平化的群体间差异和跨人口群体的过度同质化响应。我们将这一限制识别为LLM隐藏表示中的多样性崩溃现象，其中不同的社会身份在各层之间变得越来越难以区分。受此观察启发，我们提出了参数化社会身份注入（PSII），这是一个通用框架，将人口属性和价值取向的显式参数化表示直接注入LLM的中间隐藏状态。与基于提示的人物角色条件化不同，PSII能够在表示层面实现细粒度且可控的身份调制。在多个开源LLM上使用世界价值观调查进行的广泛实验表明，PSII显著提高了分布保真度和多样性，减少了与真实世界调查数据的KL散度，同时增强了整体多样性。这项工作为LLM代理的表示级控制提供了新见解，并推动了可扩展的、具有多样性意识的舆论模拟。

英文摘要

Large language models (LLMs) have recently been adopted as synthetic agents for public opinion simulation, offering a promising alternative to costly and slow human surveys. Despite their scalability, current LLM-based simulation methods fail to capture social diversity, producing flattened inter-group differences and overly homogeneous responses across demographic groups. We identify this limitation as a Diversity Collapse phenomenon in LLM hidden representations, where distinct social identities become increasingly indistinguishable across layers. Motivated by this observation, we propose Parametric Social Identity Injection (PSII), a general framework that injects explicit, parametric representations of demographic attributes and value orientations directly into intermediate hidden states of LLMs. Unlike prompt-based persona conditioning, PSII enables fine-grained and controllable identity modulation at the representation level. Extensive experiments on the World Values Survey using multiple open-source LLMs show that PSII significantly improves distributional fidelity and diversity, reducing KL divergence to real-world survey data while enhancing overall diversity. This work provides new insights into representation-level control of LLM agents and advances scalable, diversity-aware public opinion simulation.

URL PDF HTML ☆

赞 0 踩 0

2603.10619 2026-06-02 cs.CL 版本更新

Disentangling Similarity and Relatedness in Topic Models

在主题模型中区分相似性和相关性

Hanlin Xiao, Yang Wang, Mauricio A. Álvarez, Rainer Breitling

发表机构 * Manchester Institute of Biotechnology, The University of Manchester, UK（曼彻斯特生物技术研究所，曼彻斯特大学，英国）； Department of Computer Science, The University of Manchester, UK（计算机科学系，曼彻斯特大学，英国）； Department of Chemistry, The University of Manchester, UK（化学系，曼彻斯特大学，英国）

AI总结本文通过构建大规模合成基准和神经评分器，从心理语言学角度区分主题词之间的主题相关性和分类相似性，并证明两个维度对下游任务性能有不同影响。

Comments 26 pages, 9 figures, 18 tables

详情

AI中文摘要

大型预训练语言模型（PLMs）的最新成功推动了它们与主题建模的整合。然而，PLM增强的主题模型不仅在性能上，而且在它们捕获的语义结构类型上也与经典共现模型（如潜在狄利克雷分配（LDA））不同。我们沿着两个心理语言学轴形式化了这种区别：主题相关性（狗/骨头）和分类相似性（狗/狼）。为了衡量主题词上的这两个轴，我们使用基于LLM的标注构建了一个大型合成词对基准，并训练了一个神经评分器。在多个语料库和模型家族中，评分器将不同的主题模型家族置于联合相似性-相关性空间中的不同位置。这两个分数进一步预测了下游任务性能：需要相似性的任务受益于富含相似性的主题，而需要相关性的任务则受益于相反的情况，并且对任一轴的过度强调都会降低与对立语义结构对齐的任务的性能。没有一个轴是普遍有益的。因此，衡量两者提供了一种实用的、与模型无关的诊断方法，用于评估主题模型捕获的语义结构。

英文摘要

The recent success of large pre-trained language models (PLMs) has motivated their integration into topic modeling. However, PLM-augmented topic models differ from classical co-occurrence models such as Latent Dirichlet Allocation (LDA) not only in performance, but also in the type of semantic structure they capture. We formalize this distinction along two psycholinguistic axes: thematic relatedness (dog/bone) and taxonomic similarity (dog/wolf). To measure both axes over topic words, we construct a large synthetic benchmark of word pairs using LLM-based annotation and train a neural scorer on it. Across multiple corpora and model families, the scorer places different topic-model families at distinct positions within the joint similarity-relatedness space. The two scores further predict downstream task performance: tasks requiring similarity benefit from similarity-rich topics, whereas tasks requiring relatedness benefit from the converse, and excessive emphasis on either axis degrades performance on tasks aligned with the opposing semantic structure. Neither axis is uniformly beneficial. Measuring both therefore provides a practical, model-agnostic diagnostic for evaluating the semantic structure captured by topic models.

URL PDF HTML ☆

赞 0 踩 0

2603.09692 2026-06-02 cs.LG cs.AI cs.CL 版本更新

ActiveUltraFeedback: Efficient Preference Data Generation using Active Learning

ActiveUltraFeedback：使用主动学习的高效偏好数据生成

Davit Melikidze, Marian Schneider, Jessica Lam, Martin Wertich, Ido Hakimi, Barna Pásztor, Andreas Krause

发表机构 * University of Freiburg（弗赖堡大学）

AI总结提出ActiveUltraFeedback主动学习流水线，通过不确定性估计和两种新采样方法（DRTS和DeltaUCB）动态选择最具信息量的响应对，以最少六分之一的标注数据实现与静态基线相当或更优的下游性能。

Comments 40 pages, 9 figures, 26 tables

详情

AI中文摘要

基于人类反馈的强化学习（RLHF）已成为对齐大型语言模型（LLMs）的标准方法，但其有效性受到偏好数据获取高成本的瓶颈限制，尤其是在低资源和专家领域。为解决这一问题，我们引入了ACTIVEULTRAFEEDBACK，一个模块化的主动学习流水线，利用不确定性估计动态识别最具信息量的响应进行标注。我们的流水线支持系统评估标准响应选择方法以及两种新方法：DOUBLE REVERSE THOMPSON SAMPLING（DRTS）和DELTAUCB，这两种方法优先选择预测质量差距大的响应对，利用近期研究结果，即此类对为微调提供良好信号。实验表明，ACTIVEULTRAFEEDBACK生成的高质量数据集在下游性能上带来显著提升，尤其以静态基线六分之一的标注数据即可达到相当或更优的结果。我们的流水线可在https://github.com/lasgroup/ActiveUltraFeedback获取，偏好数据集可在https://huggingface.co/ActiveUltraFeedback获取。

英文摘要

Reinforcement Learning from Human Feedback (RLHF) has become the standard for aligning Large Language Models (LLMs), yet its efficacy is bottlenecked by the high cost of acquiring preference data, especially in low-resource and expert domains. To address this, we introduce ACTIVEULTRAFEEDBACK, a modular active learning pipeline that leverages uncertainty estimates to dynamically identify the most informative responses for annotation. Our pipeline facilitates the systematic evaluation of standard response selection methods alongside DOUBLE REVERSE THOMPSON SAMPLING (DRTS) and DELTAUCB, two novel methods prioritizing response pairs with large predicted quality gaps, leveraging recent results showing that such pairs provide good signals for fine-tuning. Our experiments demonstrate that ACTIVEULTRAFEEDBACK yields high-quality datasets that lead to significant improvements in downstream performance, notably achieving comparable or superior results with as little as one-sixth of the annotated data relative to static baselines. Our pipeline is available at https://github.com/lasgroup/ActiveUltraFeedback and our preference datasets at https://huggingface.co/ActiveUltraFeedback.

URL PDF HTML ☆

赞 0 踩 0

2509.25773 2026-06-02 cs.CV cs.AI cs.CL 版本更新

v-HUB: A Benchmark for Video Humor Understanding from Vision and Sound

v-HUB: 从视觉和声音理解视频幽默的基准

Zhengpeng Shi, Yanpeng Zhao, Jianqun Zhou, Yuxuan Wang, Qinrong Cui, Wei Bi, Songchun Zhu, Bo Zhao, Zilong Zheng

发表机构 * Shanghai Jiao Tong University（上海交通大学）； Wuhan University（武汉大学）； Beijing Institute for General Artificial Intelligence（北京一般人工智能研究院）； Independent Researcher（独立研究者）

AI总结提出v-HUB基准，通过非语言短视频评估多模态大语言模型在仅凭视觉线索理解幽默的能力，并发现音频信息有助于提升幽默理解。

Comments 24 pages, 9 figures

详情

AI中文摘要

能够理解幽默的AI模型具有现实应用前景——例如，增强人机交互中的参与度。为了评估和诊断多模态大语言模型（MLLMs）理解幽默的能力，我们引入了v-HUB，一个新颖的视频幽默理解基准。v-HUB包含一个精心策划的非语言短视频集合，反映了仅通过视觉线索即可欣赏幽默的现实场景。我们将每个视频片段与丰富的标注配对，以支持各种评估任务和分析，包括一项关于增强幽默的环境声音的新研究。为了扩大其适用性，我们构建了一个开放式问答任务，使v-HUB能够轻松集成到现有的视频理解任务套件中。我们评估了多种MLLMs，从专门的Video-LLMs到能够原生处理音频的多功能OmniLLMs，涵盖了开源和专有领域。实验结果揭示了MLLMs在仅凭视觉线索理解幽默时面临的困难。我们的发现还表明，结合音频有助于视频幽默理解，突显了为复杂视频理解任务整合更丰富模态的前景。

英文摘要

AI models capable of comprehending humor hold real-world promise -- for example, enhancing engagement in human-machine interactions. To gauge and diagnose the capacity of multimodal large language models (MLLMs) for humor understanding, we introduce v-HUB, a novel video humor understanding benchmark. v-HUB comprises a curated collection of non-verbal short videos, reflecting real-world scenarios where humor can be appreciated purely through visual cues. We pair each video clip with rich annotations to support a variety of evaluation tasks and analyses, including a novel study of environmental sound that can enhance humor. To broaden its applicability, we construct an open-ended QA task, making v-HUB readily integrable into existing video understanding task suites. We evaluate a diverse set of MLLMs, from specialized Video-LLMs to versatile OmniLLMs that can natively process audio, covering both open-source and proprietary domains. The experimental results expose the difficulties MLLMs face in comprehending humor from visual cues alone. Our findings also demonstrate that incorporating audio helps with video humor understanding, highlighting the promise of integrating richer modalities for complex video understanding tasks.

URL PDF HTML ☆

赞 0 踩 0

2603.08026 2026-06-02 cs.CL cs.AI cs.PF 版本更新

DyLLM: Efficient Diffusion LLM Inference via Saliency-based Token Selection and Partial Attention

DyLLM: 基于显著性标记选择与部分注意力的高效扩散LLM推理

Younjoo Lee, Seungkyun Dan, Junghoo Lee, Jaiyoung Park, Jung Ho Ahn

发表机构 * University of Seoul（首尔大学）

AI总结针对扩散语言模型迭代去噪计算昂贵的问题，提出DyLLM框架，通过仅计算显著标记的注意力与前馈操作，实现无需训练的加速推理，在保持精度的同时吞吐量提升高达9.6倍。

Comments 21 pages, 10 figures, 7 tables, accepted at ICML 2026

详情

AI中文摘要

掩码扩散语言模型支持并行令牌解码，为自回归生成的顺序性质提供了一种有前景的替代方案。然而，其迭代去噪过程仍然计算昂贵，因为每一步都重复处理整个序列。我们观察到，在这些扩散步骤中，大多数令牌表示保持稳定；只有一小部分（我们称之为显著令牌）对下一次更新有实质性贡献。利用这种时间稀疏性，我们提出了DyLLM，一种无需训练的推理框架，通过仅选择性地计算这些显著令牌来加速解码。DyLLM通过测量相邻去噪步骤之间注意力上下文的余弦相似性来识别显著性。它仅对显著令牌重新计算前馈和注意力操作，同时为其余令牌重用缓存的激活。在多种推理和代码生成基准测试中，DyLLM实现了高达9.6倍的吞吐量提升，同时基本保持了代表性开源扩散LLM（LLaDA和Dream）的基线准确性。

英文摘要

Masked diffusion language models enable parallel token decoding, providing a promising alternative to the sequential nature of autoregressive generation. However, their iterative denoising process remains computationally expensive because it repeatedly processes the entire sequence at every step. We observe that across these diffusion steps, most token representations remain stable; only a small subset, which we term salient tokens, contributes meaningfully to the next update. Leveraging this temporal sparsity, we present DyLLM, a training-free inference framework that accelerates decoding by selectively computing only these salient tokens. DyLLM identifies saliency by measuring the cosine similarity of attention contexts between adjacent denoising steps. It recomputes feed-forward and attention operations only for salient tokens while reusing cached activations for the remainder. Across diverse reasoning and code-generation benchmarks, DyLLM achieves up to 9.6x higher throughput while largely preserving the baseline accuracy of representative open-source diffusion LLMs, LLaDA, and Dream.

URL PDF HTML ☆

赞 0 踩 0

2603.08000 2026-06-02 cs.CL cs.LG 版本更新

SmartThinker: Progressive Chain-of-Thought Length Calibration for Efficient Large Language Model Reasoning

SmartThinker: 渐进式思维链长度校准以实现高效的大语言模型推理

Chenzhi Hu, Qinzhe Hu, Yuhang Xu, Junyi Chen, Ruijie Wang, Shengzhong Liu, Jianxin Li, Fan Wu, Guihai Chen

发表机构 * Tsinghua University（清华大学）

AI总结针对大型推理模型输出冗余问题，提出基于GRPO的渐进式CoT长度校准方法SmartThinker，通过动态估计最优长度和调节长度奖励系数，在压缩响应长度同时提升准确率。

Comments Accepted by ICML 2026, 18 pages, 13 figures

详情

AI中文摘要

大型推理模型（LRMs），如OpenAI o1和DeepSeek-R1，通过采用长思维链（CoT）推理路径在复杂任务上实现了高准确率。然而，这些过程固有的冗长常常导致冗余和过度思考。为了解决这一问题，现有工作利用组相对策略优化（GRPO）来减少LRM的输出长度，但其静态长度奖励设计无法根据问题相对难度和响应长度分布动态调整，导致过度压缩和准确率下降。因此，我们提出SmartThinker，一种新颖的基于GRPO的高效推理方法，具有渐进式CoT长度校准。SmartThinker有两个贡献：首先，它在训练期间动态估计具有峰值准确率的最优长度，并引导过长响应朝向该长度，以减少响应长度同时保持准确率。其次，它动态调节长度奖励系数，以避免对正确推理路径的不当惩罚。大量实验结果表明，SmartThinker在提高准确率的同时实现了高达52.5%的平均长度压缩，并在AIME25等具有挑战性的基准上实现了高达16.6%的准确率提升。源代码可在https://github.com/SJTU-RTEAS/SmartThinker获取。

英文摘要

Large reasoning models (LRMs) like OpenAI o1 and DeepSeek-R1 achieve high accuracy on complex tasks by adopting long chain-of-thought (CoT) reasoning paths. However, the inherent verbosity of these processes frequently results in redundancy and overthinking. To address this issue, existing works leverage Group Relative Policy Optimization (GRPO) to reduce LRM output length, but their static length reward design cannot dynamically adapt according to the relative problem difficulty and response length distribution, causing over-compression and compromised accuracy. Therefore, we propose SmartThinker, a novel GRPO-based efficient reasoning method with progressive CoT length calibration. SmartThinker makes a two-fold contribution: First, it dynamically estimates the optimal length with peak accuracy during training and guides overlong responses toward it to reduce response length while sustaining accuracy. Second, it dynamically modulates the length reward coefficient to avoid the unwarranted penalization of correct reasoning paths. Extensive experiment results show that SmartThinker achieves up to 52.5% average length compression with improved accuracy, and achieves up to 16.6% accuracy improvement on challenging benchmarks like AIME25. The source code can be found at https://github.com/SJTU-RTEAS/SmartThinker.

URL PDF HTML ☆

赞 0 踩 0

2512.16310 2026-06-02 cs.CR cs.AI cs.CL 版本更新

Agent Tools Orchestration Leaks More: Dataset, Benchmark, and Mitigation

Agent工具编排泄露更多：数据集、基准测试与缓解措施

Yuxuan Qiao, Dongqin Liu, Hongchang Yang, Wei Zhou, Songlin Hu

发表机构 * Institute of Information Engineering, Chinese Academy of Sciences（中国科学院信息工程研究所）； School of Cyber Security, University of Chinese Academy of Sciences（中国科学院大学网络空间安全学院）

AI总结研究LLM代理在编排多个工具时泄露敏感结论的风险（TOP-R），构建了包含1000个实例的基准TOP-Bench，并提出TOP-Align后训练方法以缓解泄露。

Comments 17 pages, 2 figures. Dataset and code are available at https://github.com/1Ponder/TOP-R

详情

AI中文摘要

基于LLM的代理越来越多地使用多个外部工具来完成复杂任务。我们研究了工具编排隐私风险（TOP-R）：代理可能组合单个非敏感的工具返回结果，并披露一个非预期的敏感结论。我们通过三个条件形式化TOP-R：结论敏感性、单源不可推断性和组合可推断性。我们引入了LRSE（基于库的反向推理种子扩展），这是一个基于隐私规范、推理链、工具模式和任务场景的四库反向构建流水线，并使用它构建了TOP-Bench，一个包含1000个实例的基准测试。该基准测试在受控的两阶段工具使用协议下评估最终响应的语义泄露。在六个LLM代理中，任务完成率保持较高，但平均泄露率达到88.6%，导致H分数仅为20.4。两种仅提示的防护措施在主基准测试上将H分数提高了约2.7分。我们进一步提出了TOP-Align，一种SFT+DPO后训练方法，用于更安全的任务完成边界。在单独的后训练评估划分上，TOP-Align将H分数比相应基础模型提高了16.2分，而同一划分上仅提示缓解措施的平均增益为4.9分。这些结果表明TOP-R需要超越仅提示的缓解措施。

英文摘要

LLM-based agents increasingly use multiple external tools to complete complex tasks. We study Tools Orchestration Privacy Risk (TOP-R): an agent may combine individually non-sensitive tool returns and disclose an unintended sensitive conclusion. We formalize TOP-R with three conditions: conclusion sensitivity, single-source non-inferability, and compositional inferability. We introduce LRSE (Library-Grounded Reverse-Inference Seed Expansion), a four-library reverse-construction pipeline grounded in privacy norms, reasoning chains, tool schemas, and task scenarios, and use it to build TOP-Bench, a 1,000-instance benchmark. The benchmark evaluates final-response semantic disclosure under a controlled two-stage tool-use protocol. Across six LLM agents, task completion remains high, but the average leakage rate reaches 88.6 percent, yielding an H-score of only 20.4. Two prompt-only safeguards improve H-score by about 2.7 points on the main benchmark. We further propose TOP-Align, an SFT+DPO post-training method for safer task completion boundaries. On a separate post-training evaluation split, TOP-Align improves H-score by 16.2 points over the corresponding base model, compared with a 4.9-point average gain from prompt-only mitigation on the same split. These results show that TOP-R requires mitigation beyond prompting alone.

URL PDF HTML ☆

赞 0 踩 0

2603.04828 2026-06-02 cs.CL 版本更新

From Unfamiliar to Familiar: Detecting Pre-training Data via Gradient Deviations in Large Language Models

从陌生到熟悉：通过梯度偏差检测大型语言模型中的预训练数据

Ruiqi Zhang, Lingxiang Wang, Hainan Zhang, Zhiming Zheng, Yanyan Lan

发表机构 * Beijing Advanced Innovation Center for Future Blockchain and Privacy Computing, Beihang University（北京未来区块链与隐私计算先进创新中心，北京航空航天大学）； School of Artificial Intelligence, Beihang University（北京航空航天大学人工智能学院）； Institute for AI Industry Research (AIR), Tsinghua University（清华大学人工智能产业研究院）

AI总结提出GDS方法，通过分析目标样本的梯度偏差分数（包括更新幅度、位置和神经元激活集中度）来区分预训练成员与非成员数据，实现高效且跨数据集迁移的预训练数据检测。

Comments 17 pages, 8 figures

详情

AI中文摘要

大型语言模型的预训练数据检测对于解决版权问题和减轻基准污染至关重要。现有方法主要关注微调前后的基于似然的统计特征或启发式信号，但前者易受语料库中词频偏差的影响，后者强烈依赖于微调数据的相似性。从优化角度，我们观察到在训练过程中，样本从陌生到熟悉的转变方式体现在梯度行为的系统性差异上。熟悉样本表现出更小的更新幅度、模型组件中不同的更新位置以及更尖锐激活的神经元。基于这一洞察，我们提出GDS，一种通过探测目标样本的梯度偏差分数来识别预训练数据的方法。具体来说，我们首先使用梯度轮廓表示每个样本，该轮廓捕获跨FFN和注意力模块的参数更新的幅度、位置和集中度，揭示成员与非成员数据之间的一致区别。然后将这些特征输入轻量级分类器进行二值成员推断。在五个公共数据集上的实验表明，GDS在强基线上实现了最先进的性能，并显著提高了跨数据集的可迁移性。进一步的可解释性分析揭示了梯度分布的差异，半监督结果为检测预训练数据提供了一种实用方法。

英文摘要

Pre-training data detection for LLMs is essential for addressing copyright concerns and mitigating benchmark contamination. Existing methods mainly focus on the likelihood-based statistical features or heuristic signals before and after fine-tuning, but the former are susceptible to word frequency bias in corpora, and the latter strongly depend on the similarity of fine-tuning data. From an optimization perspective, we observe that during training, samples transition from unfamiliar to familiar in a manner reflected by systematic differences in gradient behavior. Familiar samples exhibit smaller update magnitudes, distinct update locations in model components, and more sharply activated neurons. Based on this insight, we propose GDS, a method that identifies pre-training data by probing Gradient Deviation Scores of target samples. Specifically, we first represent each sample using gradient profiles that capture the magnitude, location, and concentration of parameter updates across FFN and Attention modules, revealing consistent distinctions between member and non-member data. These features are then fed into a lightweight classifier to perform binary membership inference. Experiments on five public datasets show that GDS achieves state-of-the-art performance with significantly improved cross-dataset transferability over strong baselines. Further interpretability analyses reveal differences in gradient distributions, and the semi-supervised results offer a practical way to detect pre-training data.

URL PDF HTML ☆

赞 0 踩 0

2603.03202 2026-06-02 cs.CL 版本更新

Code2Math: Can Your Code Agent Effectively Evolve Math Problems Through Exploration?

Code2Math：你的代码智能体能否通过探索有效演化数学问题？

Dadi Guo, Yuejin Xie, Qingyu Liu, Weixian Huang, Jiayu Liu, Zhiyuan Fan, Qihan Ren, Shuai Shao, Tianyi Zhou, Jianjie Feng, Wenze Su, Yujiu Yang, Dongrui Liu, Yi R. Fung

发表机构 * Hong Kong University of Science and Technology（香港科技大学）； Tsinghua University（清华大学）； Zhejiang University（浙江大学）； Nanjing Tech University（南京工业大学）； Shanghai Jiao Tong University（上海交通大学）； University of Michigan（密歇根大学）； Independent Researcher（独立研究者）

AI总结本文提出一个多智能体框架，利用代码智能体通过探索将现有数学问题自主演化为更复杂、更困难的变体，并验证其可解性和难度提升。

Comments 38 pages

详情

AI中文摘要

随着大型语言模型（LLM）在数学能力上向IMO和研究水平迈进，高质量挑战性问题的稀缺已成为LLM训练、评估和自我进化的显著瓶颈。同时，最近的代码智能体在智能体编码和推理方面展示了复杂技能，表明代码执行可以作为数学实验的可扩展环境。在本文中，我们研究了代码智能体自主将现有数学问题演化为更复杂变体的潜力。我们引入了一个多智能体框架，旨在执行问题演化，同时验证生成问题的可解性和难度增加。我们的实验表明，在充分的测试时探索下，代码智能体可以合成新的、可解的问题，这些问题在结构上与原始问题不同且更具挑战性。本工作提供了经验证据，表明代码驱动的智能体可以作为在可扩展计算环境中合成高难度数学推理问题的可行机制。代码和数据可在 https://github.com/TarferSoul/Code2Math 获取。

英文摘要

As large language models (LLMs) advance their mathematical capabilities toward the IMO and research level, the scarcity of challenging, high-quality problems has become a significant bottleneck for training, evaluation and self-evolution of LLMs. Simultaneously, recent code agents have demonstrated sophisticated skills in agentic coding and reasoning, suggesting that code execution can serve as a scalable environment for mathematical experimentation. In this paper, we investigate the potential of code agents to autonomously evolve existing math problems into more complex variations. We introduce a multi-agent framework designed to perform problem evolution while validating the solvability and increased difficulty of the generated problems. Our experiments demonstrate that, given sufficient test-time exploration, code agents can synthesize new, solvable problems that are structurally distinct from and more challenging than the originals. This work provides empirical evidence that code-driven agents can serve as a viable mechanism for synthesizing high-difficulty mathematical reasoning problems within scalable computational environments. Code and data is available at https://github.com/TarferSoul/Code2Math.

URL PDF HTML ☆

赞 0 踩 0

2603.03291 2026-06-02 cs.CL cs.AI 版本更新

One Bias After Another: Mechanistic Reward Shaping and Persistent Biases in Language Reward Models

一个接一个的偏差：语言奖励模型中的机械奖励塑造与持续偏差

Daniel Fein, Max Lamparth, Violet Xiang, Mykel J. Kochenderfer, Nick Haber

发表机构 * Stanford University（斯坦福大学）； University of California, Berkeley（加州大学伯克利分校）

AI总结本文通过系统测量五个高质量奖励模型中的偏差，发现长度、谄媚、过度自信等持续问题，并提出一种简单的后处理干预方法（机械奖励塑造）来减轻低复杂度偏差。

Comments ICML 2026 Camera-ready

详情

AI中文摘要

奖励模型（RMs）对于语言模型（LMs）与人类偏好的在线对齐至关重要。然而，基于RM的偏好调优容易受到奖励破解的影响，即LM策略从有缺陷的RM中学习不良行为。通过系统测量五个高质量RM（包括最先进的模型）中的偏差，我们发现尽管已有相关工作，但在长度、谄媚和过度自信方面的问题仍然存在。我们还发现了与模型特定“风格”和答案顺序相关的新偏差。我们将RM失败分类为可处理或对线性干预具有抵抗性，并提出一种简单的后处理干预措施，以减轻由虚假相关性引起的低复杂度偏差。我们提出的机械奖励塑造在不降低奖励质量且使用最少标注数据的情况下减少了目标偏差。该方法可扩展到新偏差、模型内部，并具有分布外泛化能力。

英文摘要

Reward Models (RMs) are crucial for online alignment of language models (LMs) with human preferences. However, RM-based preference-tuning is vulnerable to reward hacking, whereby LM policies learn undesirable behaviors from flawed RMs. By systematically measuring biases in five high-quality RMs, including the state-of-the-art, we find that issues persist despite prior work with respect to length, sycophancy, and overconfidence. We also discover new issues related to bias toward model-specific ``styles'' and answer-order. We categorize RM failures as tractable or resistant to linear intervention and propose a simple post-hoc intervention to mitigate low-complexity biases that arise from spurious correlations. Our proposed mechanistic reward shaping reduces targeted biases without degrading reward quality and while using minimal labeled data. The method is extensible to new biases, model-internal, and generalizes out-of-distribution.

URL PDF HTML ☆

赞 0 踩 0

2603.00963 2026-06-02 cs.LG cs.CL 版本更新

Stabilizing Policy Optimization via Logits Convexity

通过Logits凸性稳定策略优化

Hongzhan Chen, Tao Yang, Yuhua Zhu, Shiping Gao, Xiaojun Quan, Ting Yao

发表机构 * National University of Singapore（新加坡国立大学）； University of Science and Technology of China（中国科学技术大学）； University of California, Berkeley（加州大学伯克利分校）

AI总结针对强化学习训练不稳定的问题，从梯度角度分析监督微调与强化学习的稳定性差距，提出Logits凸优化（LCO）框架，通过模拟logits级凸性来稳定策略优化，实验表明该方法能提升训练稳定性并在多个基准上优于传统方法。

详情

AI中文摘要

虽然强化学习（RL）在大语言模型（LLM）近期成功中发挥了核心作用，但RL优化以不稳定著称，尤其是与监督微调（SFT）相比。本文从梯度角度研究SFT和RL之间的稳定性差距，并表明SFT损失相对于模型logits的凸性在实现稳定训练中起关键作用。我们的理论分析证明，该性质在优化过程中诱导了有利的梯度方向性。相比之下，广泛采用的策略梯度算法——使用裁剪替代目标的近端策略优化（PPO）缺乏这种稳定性质。受此观察启发，我们提出Logits凸优化（LCO），一种简单而有效的策略优化框架，将学习策略与从原始RL目标导出的最优目标对齐，从而模拟logits级凸性的稳定效果。跨多个模型家族的大量实验表明，我们的LCO框架一致地提升了训练稳定性，并在广泛的基准测试中优于传统RL方法。

英文摘要

While reinforcement learning (RL) has been central to the recent success of large language models (LLMs), RL optimization is notoriously unstable, especially when compared to supervised fine-tuning (SFT). In this work, we investigate the stability gap between SFT and RL from a gradient-based perspective, and show that the convexity of the SFT loss with respect to model logits plays a key role in enabling stable training. Our theoretical analysis demonstrates that this property induces favorable gradient directionality during optimization. In contrast, Proximal Policy Optimization (PPO), a widely adopted policy gradient algorithm utilizing a clipped surrogate objective, lacks this stabilizing property. Motivated by this observation, we propose Logits Convex Optimization (LCO), a simple yet effective policy optimization framework that aligns the learned policy with an optimal target derived from the original RL objective, thereby emulating the stabilizing effects of logits-level convexity. Extensive experiments across multiple model families show that our LCO framework consistently improves training stability and outperforms conventional RL methods on a broad range of benchmarks.

URL PDF HTML ☆

赞 0 踩 0

2603.00829 2026-06-02 cs.CL cs.AI cs.LG 版本更新

Constitutional Black-Box Monitoring for Scheming in LLM Agents

LLM Agent 中阴谋行为的宪法黑盒监控

Simon Storf, Rich Barton-Cooper, James Peters-Gill, Marius Hobbhahn

发表机构 * University of Cambridge（剑桥大学）

AI总结研究使用基于宪法黑盒的监控器，通过仅观察外部输入和输出检测LLM Agent的阴谋行为，并在合成数据上优化后泛化到更真实环境。

Comments Accepted at ICML 2026. Camera-ready version

详情

AI中文摘要

在自主环境中安全部署大型语言模型（LLM）Agent需要可靠的监督机制。一个核心挑战是检测阴谋行为，即Agent暗中追求不一致的目标。缓解此类风险的一种方法是基于LLM的监控：使用语言模型检查Agent行为中的可疑动作。我们研究宪法黑盒监控器：仅利用外部可观测的输入和输出检测阴谋行为的提示分类器，并在从自然语言行为规范生成的合成数据上优化。我们引入两个生成合成Agent轨迹的流水线：STRIDE（迭代精炼）和Gloom（Agent-环境模拟），各生成1000个样本。通过提示扫描、人工精炼和自动提示优化，我们在这些数据集上优化前沿LLM监控器，并在ControlArena（一套Agent在更现实环境中运行的接地环境）中的7500个保留轨迹上评估性能。结果表明，仅基于合成数据选择的监控器可以泛化到更现实的环境，捕获有意义的阴谋信号。然而，我们发现性能在我们的设置中迅速饱和，简单的提示扫描匹配了更广泛优化的结果。超越这一限制不会带来进一步改进，反而导致过拟合。

英文摘要

Safe deployment of Large Language Model (LLM) agents in autonomous settings requires reliable oversight mechanisms. A central challenge is detecting scheming, where agents covertly pursue misaligned goals. One approach to mitigating such risks is LLM-based monitoring: using language models to examine agent behaviors for suspicious actions. We study constitutional black-box monitors: prompted classifiers that detect scheming using only externally observable inputs and outputs, optimized on synthetic data generated from natural-language behavior specifications. We introduce two pipelines for generating synthetic agent trajectories, STRIDE (iterative refinement) and Gloom (agent-environment simulation), from which we generate 1,000 samples each. We optimize frontier LLM monitors on these datasets via prompt sweeps, human refinement, and automated prompt optimization, and evaluate performance on 7,500 held-out trajectories from ControlArena, a suite of grounded environments where agents operate in more realistic contexts. Our results demonstrate that monitors selected purely on synthetic data can generalize to more realistic environments, capturing a meaningful scheming signal. However, we find that performance saturates quickly in our setting, with simple prompt sweeps matching the results of more extensive optimization. Pushing beyond this limit yields no further improvements and instead leads to overfitting.

URL PDF HTML ☆

赞 0 踩 0

2511.00206 2026-06-02 cs.AI cs.CL 版本更新

Addressing Longstanding Challenges in Cognitive Science with Language Models

用语言模型应对认知科学中长期存在的挑战

Dirk U. Wulff, Rui Mata

发表机构 * Center for Adaptive Rationality, Max Planck Institute for Human Development（适应性理性中心，马克斯·普朗克人类发展研究所）； Faculty of Psychology, University of Basel（心理学系，巴塞尔大学）

AI总结本文探讨如何利用语言模型应对认知科学中研究整合、形式化、概念清晰度等长期挑战，并指出其风险与机遇。

2603.00021 2026-06-02 cs.CL 版本更新

微调不忘上下文学习：线性注意力模型的理论分析

Chungpa Lee, Jy-yong Sohn, Kangwook Lee

发表机构 * KAIST（韩国科学技术院）

AI总结本文通过线性注意力模型理论分析，揭示了微调目标如何修改注意力参数并导致少样本性能下降的条件，提出仅更新值矩阵可保持上下文学习能力。

详情

Journal ref: International Conference on Machine Learning (ICML) 2026

AI中文摘要

基于Transformer的大型语言模型展现出上下文学习能力，能够通过少量示例提示适应下游任务。实践中，这类模型常被微调以提升下游任务的零样本性能，使其无需示例即可解决问题，从而降低推理成本。然而，微调可能削弱上下文学习能力，限制微调模型在未见任务上的表现。利用线性注意力模型，我们提供了理论分析，刻画了微调目标如何修改注意力参数，并识别了导致少样本性能下降的条件。我们表明，微调所有注意力参数会损害上下文学习，而仅更新值矩阵可在保持上下文学习的同时提升零样本性能。我们进一步证明，引入辅助的少样本损失主要增强目标任务的上下文学习，但以牺牲微调未见任务上的上下文学习能力为代价。我们提供了来自合成和真实数据集的实验证据，与理论定性预测一致。

英文摘要

Transformer-based large language models exhibit in-context learning, enabling adaptation to downstream tasks via few-shot prompting with demonstrations. In practice, such models are often fine-tuned to improve zero-shot performance on downstream tasks, allowing them to solve tasks without examples and thereby reducing inference costs. However, fine-tuning can degrade in-context learning, limiting the performance of fine-tuned models on tasks not seen during fine-tuning. Using linear attention models, we provide a theoretical analysis that characterizes how fine-tuning objectives modify attention parameters and identifies conditions under which this leads to degraded few-shot performance. We show that fine-tuning all attention parameters can harm in-context learning, whereas restricting updates to the value matrix improves zero-shot performance while preserving in-context learning. We further show that incorporating an auxiliary few-shot loss enhances in-context learning primarily on the target task, at the expense of degraded in-context learning ability on tasks not seen during fine-tuning. We provide empirical evidence from synthetic and real-world datasets consistent with the qualitative predictions of our theory.

URL PDF HTML ☆

赞 0 踩 0

2602.22221 2026-06-02 cs.IR cs.AI cs.CL cs.CY 版本更新

Evaluating Reliability Asymmetries in Chinese Factual Search and AI Answers

评估中文事实搜索与AI答案中的可靠性不对称性

Geng Liu, Li Feng, Mengxiao Zhu, Francesco Pierri

发表机构 * Department of Electronics, Information and Bioengineering, Politecnico di Milano（电子、信息与生物工程系，米兰理工学院）； University of Science and Technology of China（中国科学技术大学）

AI总结通过构建基于真实中文搜索日志的查询事实核查数据集，比较传统搜索引擎、大型语言模型和搜索集成AI概览在中文是非问题上的准确性、回答频率、极性差距及区域信息需求差异，揭示可靠性不仅取决于回答正确性，还受回答频率、否定主张处理和信息需求暴露风险影响。

详情

AI中文摘要

搜索引擎和AI驱动的系统越来越多地成为获取事实信息的媒介，但在现实信息寻求场景中，其可靠性仍难以评估。我们通过从真实中文搜索日志构建基于查询的事实核查数据集，并比较传统搜索引擎、独立大型语言模型和搜索集成AI概览等九种系统，在中文网络生态中研究这一问题。聚焦于中文事实性是非问题，我们根据证据推导的基准事实评估系统是否提供正确、错误或不确定的判断。我们发现，当系统给出明确答案时，准确率相似（73.2%至78.9%），但给出明确答案的频率差异显著：搜索引擎对超过83%的查询给出明确答案，而Qwen-Max则不到一半。我们还发现一致的极性差距：所有系统在标记为“是”的查询上表现优于标记为“否”的查询。我们利用百度指数数据识别健康相关搜索关注度较高的中国省份，这可能表明更大的错误信息暴露风险。总体而言，我们的结果表明，可靠性不仅取决于系统回答时的正确性，还取决于回答频率、如何处理否定主张以及信息需求可能增加暴露风险的地方。

英文摘要

Search engines and AI-powered systems increasingly mediate access to factual information, yet their reliability remains difficult to evaluate in realistic information-seeking settings. We study this problem in the Chinese web ecosystem by constructing a query-based fact-checking dataset from real Chinese search logs and comparing nine systems across traditional search engines, standalone large language models, and search-integrated AI Overviews. Focusing on factual Chinese-language factual Yes/No questions, we evaluate whether systems provide correct, incorrect, or uncertain decisions against evidence-derived ground truth. We find that systems are similarly accurate when they provide definitive answers, but differ sharply in how often they do so. Conditional accuracy ranges from 73.2% to 78.9%, yet search engines answer definitively on over 83% of queries, while Qwen-Max does so on fewer than half. We also find a consistent polarity gap: all systems perform better on yes-labeled queries than on no-labeled queries. We also use Baidu Index data to identify Chinese provinces with higher health-related search attention, which may indicate greater potential exposure to misinformation. Overall, our results show that reliability depends not only on whether systems are correct when they answer, but also on how often they answer, how they handle negative claims, and where information demand may increase exposure risks.

URL PDF HTML ☆

赞 0 踩 0

2512.09730 2026-06-02 cs.CL cs.LG 版本更新

Interpreto: An Explainability Library for Transformers

Interpreto：一个用于Transformer的可解释性库

Antonin Poché, Thomas Mullor, Gabriele Sarti, Frédéric Boisnard, Corentin Friedrich, Charlotte Claye, François Hoofd, Raphael Bernas, Nicholas Asher, Céline Hudelot, Fanny Jourdan

发表机构 * IRT Saint Exupéry Toulouse（伊尔杜夫圣埃克苏佩里图卢斯）； IRIT Toulouse（图卢兹IRIT）； Khoury College of Computer Sciences（科赫里计算机科学学院）； Ampere（阿姆佩尔）； MICS, CentraleSupélec（MICS，中央超导学院）； Scienta Lab（科学实验室）； Thales Avionics（泰勒斯航空电子）； ANITI

AI总结 Interpreto是一个开源Python库，通过归因方法和基于概念的解释，为HuggingFace语言模型（从早期BERT变体到LLM）提供统一的解释工作流，其端到端基于概念的流水线是主要创新。

Comments Accepted to ACL 2026 System Demonstration. Equal contribution: Poché and Jourdan

2602.18008 2026-06-02 cs.LG cs.AI cs.CL 版本更新

Are LLMs Ready for Neural-integrated Mechanistic Modeling? A Benchmark and Agentic Framework

LLM 是否准备好进行神经集成机制建模？一个基准测试与智能体框架

Zihan Guan, Rituparna Datta, Mengxuan Hu, Shunshun Liu, Aiying Zhang, Prasanna Balachandran, Sheng Li, Anil Vullikanti

发表机构 * University of Virginia（弗吉尼亚大学）

AI总结本文提出神经集成机制建模（NIMM）基准测试，评估大语言模型在三个科学领域构建神经集成机制模型的能力，并设计树引导的智能体框架 NIMMGen，通过分支级搜索和原子模型细化显著提升搜索稳定性和解质量。

Comments 25 pages, 8 figures

详情

AI中文摘要

大语言模型（LLM）在从数据构建机制模型方面显示出潜力。然而，现有评估主要关注简化设置，未能捕捉真实世界科学建模的复杂性。在实践中，此类建模通常涉及神经集成公式，其中机制模型组件和神经网络组件共同构建，导致搜索空间显著复杂化。受此差距驱动，我们引入了神经集成机制建模（NIMM）基准测试，该基准测试评估 LLM 生成的神经集成机制模型在三个科学领域上的表现。在 NIMM 上的实验表明，现有基于 LLM 的方法难以有效探索这一复杂空间，导致搜索稳定性和解质量有限。为应对这一挑战，我们提出了 NIMMGen，一种树引导的智能体框架，通过分支级搜索实现多样化探索，并通过原子模型细化改进解。大量实验表明，NIMMGen 在 NIMM 上达到了最先进的性能，显著提升了搜索稳定性和解质量。

英文摘要

Large language models (LLMs) have shown promise in constructing mechanistic models from data. However, existing evaluations largely focus on simplified settings and fail to capture the complexity of real-world scientific modeling. In practice, such modeling often involves neural-integrated formulations, where a mechanistic model component and a neural network component are jointly constructed, leading to a significantly more complex search space. Motivated by this gap, we introduce the Neural-Integrated Mechanistic Modeling (NIMM) benchmark, which evaluates LLM-generated neural-integrated mechanistic models across three scientific domains. Experiments on NIMM reveal that existing LLM-based approaches struggle to effectively explore this complex space, resulting in limited search stability and solution quality. To address this challenge, we propose NIMMGen, a tree-guided agentic framework that enables diversified exploration via branch-level search and improves solutions through atomic model refinement. Extensive experiments demonstrate that NIMMGen achieves state-of-the-art performance on NIMM, significantly improving search stability and solution quality.

URL PDF HTML ☆

赞 0 踩 0

2602.12984 2026-06-02 cs.CL 版本更新

SciAgentGym: Benchmarking Multi-Step Scientific Tool-use in LLM Agents

SciAgentGym：LLM代理中多步科学工具使用的基准测试

Yujiong Shen, Yajie Yang, Zhiheng Xi, Binze Hu, Huayu Sha, Jiazheng Zhang, Qiyuan Peng, Junlin Shang, Jixuan Huang, Yutao Fan, Jingqi Tong, Shihan Dou, Ming Zhang, Lei Bai, Zhenfei Yin, Tao Gui, Xingjun Ma, Qi Zhang, Xuanjing Huang, Yu-Gang Jiang

发表机构 * Fudan NLP Group（复旦大学自然语言处理组）

AI总结为解决当前基准忽视代理在科学工作流中编排工具能力的问题，提出包含1780个领域特定工具的可扩展交互环境SciAgentGym和分层评估套件SciAgentBench，并设计数据合成方法SciForge，通过微调使SciAgent-8B超越更大模型，展示科学工具使用能力的跨领域正迁移。

详情

AI中文摘要

科学推理本质上需要整合复杂的工具包来导航领域特定知识。然而，当前的基准测试在很大程度上忽视了代理编排工具以进行此类严谨工作流的能力。为弥补这一差距，我们引入了SciAgentGym，一个可扩展的交互环境，包含四个自然科学学科中的1,780个领域特定工具，并由强大的执行基础设施支持。此外，我们提出了SciAgentBench，一个分层评估套件，旨在从基本动作到长期工作流对代理能力进行压力测试。我们的评估识别出一个关键瓶颈：最先进的模型仍然难以处理复杂的科学工具使用，并且随着交互范围的扩展，其性能显著下降。为解决这一问题，我们提出了SciForge，一种数据合成方法，将工具动作空间建模为依赖图，以生成逻辑感知的训练轨迹。通过在这些轨迹上进行微调，我们的SciAgent-8B优于显著更大的Qwen3-VL-235B-Instruct，同时表现出科学工具使用能力的跨领域正迁移。这些结果凸显了下一代自主科学代理的潜力。

英文摘要

Scientific reasoning inherently demands integrating sophisticated toolkits to navigate domain-specific knowledge. Yet, current benchmarks largely overlook agents' ability to orchestrate tools for such rigorous workflows. To bridge this gap, we introduce SciAgentGym, a scalable interactive environment featuring 1,780 domain-specific tools across four natural science disciplines, supported by a robust execution infrastructure. Complementing this, we present SciAgentBench, a tiered evaluation suite designed to stress-test agentic capabilities from elementary actions to long-horizon workflows. Our evaluation identifies a critical bottleneck: state-of-the-art models still struggle with complex scientific tool-use, and their performance degrades substantially as interaction horizons extend. To address this, we propose SciForge, a data synthesis method that models the tool action space as a dependency graph to generate logic-aware training trajectories. By fine-tuning on these trajectories, our SciAgent-8B outperforms the significantly larger Qwen3-VL-235B-Instruct while exhibiting positive cross-domain transfer of scientific tool-use capabilities. These results underscore the promising potential of next-generation autonomous scientific agents.

URL PDF HTML ☆

赞 0 踩 0

2602.11852 2026-06-02 cs.AI cs.CL cs.LG 版本更新

Prototype Transformer: Towards Language Model Architectures Interpretable by Design

原型Transformer：迈向可解释设计的语言模型架构

Yordan Yordanov, Matteo Forasassi, Bayar Menzat, Ruizhi Wang, Chang Qi, Markus Kaltenberger, Amine M'Charrak, Tommaso Salvatori, Thomas Lukasiewicz

发表机构 * University of Cambridge（剑桥大学）； ETH Zurich（苏黎世联邦理工学院）

AI总结提出原型Transformer（ProtoT），一种用线性代价原型模块替代二次代价自注意力的自回归语言模型架构，原型自动捕获可命名概念，提升可解释性并支持行为编辑。

Comments Accepted at ICML 2026. Equal contribution: Yordan Yordanov and Matteo Forasassi. 40 pages, 28 figures, 22 tables

详情

AI中文摘要

尽管最先进的语言模型（LM）在某些领域超越了大多数人类，但其推理过程仍然不透明，降低了信任度并增加了欺骗和幻觉的风险。我们引入了原型Transformer（ProtoT），一种自回归LM架构，它将Transformer的二次代价自注意力模块替换为基于原型的线性代价模块，原型是学习到的参数向量。在ProtoT中，原型创建了在不同时间尺度上聚合上下文信息的通信通道。我们表明，这种结构导致原型在训练过程中自动捕获可命名的概念，例如“女人”，为解释模型推理和对模型行为进行有针对性的编辑提供了途径。与基线相比，ProtoT在模型和数据规模上具有良好的扩展性，对输入扰动具有鲁棒性，并在文本生成和下游任务（包括GLUE）上表现良好。这些结果表明，ProtoT是朝着设计上更可解释的自回归语言模型迈出的有希望的一步。

英文摘要

While state-of-the-art language models (LMs) surpass most humans in certain domains, their reasoning remains largely opaque, reducing trust and increasing the risk of deception and hallucination. We introduce the Prototype Transformer (ProtoT), an autoregressive LM architecture that replaces the quadratic-cost self-attention module of the Transformer with a linear-cost module based on prototypes, which are learned parameter vectors. In ProtoT, prototypes create communication channels that aggregate contextual information at different time scales. We show that this structure leads prototypes to automatically capture nameable concepts, such as "woman", during training, offering a path toward interpreting model reasoning and making targeted edits to model behavior. Compared with baselines, ProtoT scales well with model and data size, is robust to input perturbations, and performs well on text generation and downstream tasks, including GLUE. These results suggest that ProtoT is a promising step toward autoregressive language models that are more interpretable by design.

URL PDF HTML ☆

赞 0 踩 0

2602.11790 2026-06-02 cs.AI cs.CL 版本更新

Beyond End-to-End Video Models: An LLM-Based Multi-Agent System for Educational Video Generation

超越端到端视频模型：基于LLM的多智能体系统用于教育视频生成

Lingyong Yan, Jiulong Wu, Dong Xie, Weixian Shi, Deguo Xia, Jizhou Huang

发表机构 * Baidu Inc.（百度公司）

AI总结提出LASEV，一种基于LLM的分层多智能体系统，通过将教育视频生成分解为多个专业智能体协作，解决端到端模型在逻辑严谨性和知识表示方面的不足，实现低成本、高吞吐量的自动化教学视频生产。

Comments Accepted at ACM SIGKDD 2026 (KDD '26), Applied Data Science Track. 10 pages, 2 figures, 5 tables. The project is available at \url{https://robitsg.github.io/LASEV}

详情

DOI: 10.1145/3770855.3818323

AI中文摘要

尽管最近的端到端视频生成模型在视觉导向的内容创作中表现出令人印象深刻的性能，但在需要严格逻辑严谨性和精确知识表示的场景（如教学和教育媒体）中仍然受限。为解决此问题，我们提出LASEV，一种基于LLM的分层多智能体系统，用于从教育问题生成高质量教学视频。LASEV将教育视频生成表述为一个多目标任务，同时要求正确的逐步推理、教学连贯的叙述、语义忠实的视觉演示以及精确的视听对齐。为解决先前方法的局限性——包括低程序保真度、高生产成本和有限的可控性——LASEV将生成工作流分解为通过中央编排智能体协作的专业智能体，共享生产状态、显式质量门控和迭代批评机制。具体来说，编排智能体监督一个用于严格问题求解的求解智能体、一个生成可执行可视化代码的插图智能体，以及一个面向学习者的教学脚本的叙述智能体。此外，工作智能体的所有输出都经过语义批评、基于规则的约束和基于工具的编译检查。该系统不直接合成像素，而是构建一个结构化的可执行视频脚本，该脚本通过模板驱动的组装规则确定性编译为同步的视觉和叙述，实现无需手动编辑的全自动生产。在大规模部署中，LASEV实现了每天超过一百万视频的吞吐量，与当前行业标准方法相比成本降低超过95%，同时保持高接受率。

英文摘要

Although recent end-to-end video generation models demonstrate impressive performance in visually oriented content creation, they remain limited in scenarios that require strict logical rigor and precise knowledge representation, such as instructional and educational media. To address this problem, we propose LASEV, a hierarchical LLM-based multi-agent system for generating high-quality instructional videos from educational problems. LASEV formulates educational video generation as a multi-objective task that simultaneously demands correct step-by-step reasoning, pedagogically coherent narration, semantically faithful visual demonstrations, and precise audio--visual alignment. To address the limitations of prior approaches--including low procedural fidelity, high production cost, and limited controllability--LASEV decomposes the generation workflow into specialized agents that collaborate through a central Orchestrating Agent, shared production state, explicit quality gates, and iterative critique mechanisms. Specifically, the Orchestrating Agent supervises a Solution Agent for rigorous problem solving, an Illustration Agent that produces executable visualization code, and a Narration Agent for learner-oriented instructional scripts. In addition, all outputs from the working agents are subject to semantic critique, rule-based constraints, and tool-based compilation checks. Rather than directly synthesizing pixels, the system constructs a structured executable video script that is deterministically compiled into synchronized visuals and narration using template-driven assembly rules, enabling fully automated production without manual editing. In large-scale deployments, LASEV achieves a throughput exceeding one million videos per day, delivering over a 95% reduction in cost compared to current industry-standard approaches while maintaining a high acceptance rate.

URL PDF HTML ☆

赞 0 踩 0

2602.11177 2026-06-02 cs.CL cs.AI 版本更新

如何正确报告LLM作为评估者的评估结果

Chungpa Lee, Thomas Zeng, Jongwon Jeong, Jy-yong Sohn, Kangwook Lee

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结针对LLM作为评估者时存在偏差的问题，提出一种插件式校正框架，实现无偏估计和统计原理的不确定性量化，并证明在分布偏移下仍保持无偏性。

详情

Journal ref: International Conference on Machine Learning (ICML) 2026

AI中文摘要

大型语言模型（LLMs）被广泛用作模型响应的可扩展评估者，以替代人工标注者。然而，LLM评估者的不完美灵敏度和特异性会导致朴素评估分数产生偏差。我们提出一个简单的插件式框架，可校正此偏差并实现统计原理的不确定性量化。我们的框架构建置信区间，该区间同时考虑来自测试数据集和人工标注校准数据集的不确定性。此外，它采用自适应策略分配校准样本以获得更紧的区间。重要的是，我们刻画了由真实评估分数和LLM评估者的灵敏度与特异性定义的参数区间，在这些区间内，基于LLM的评估比仅人工评估产生更可靠的估计。此外，我们证明，与现有方法相比，我们的框架在测试集和校准集之间存在分布偏移时仍保持无偏性。

英文摘要

Large language models (LLMs) are widely used as scalable evaluators of model responses in lieu of human annotators. However, imperfect sensitivity and specificity of the LLM judges induce bias in naive evaluation scores. We propose a simple plug-in framework that corrects this bias and enables statistically principled uncertainty quantification. Our framework constructs confidence intervals that account for uncertainty from both the test dataset and a human-labeled calibration dataset. Additionally, it uses an adaptive strategy to allocate calibration samples for tighter intervals. Importantly, we characterize parameter regimes defined by the true evaluation score and the LLM judge's sensitivity and specificity in which our LLM-based evaluation yields more reliable estimates than human-only evaluation. Moreover, we show that our framework remains unbiased under distribution shift between the test and calibration datasets, in contrast to existing approaches.

URL PDF HTML ☆

赞 0 踩 0

2502.16174 2026-06-02 cs.LG cs.AI cs.CL cs.CR 版本更新

Efficient LLM Moderation with Multi-Layer Latent Prototypes

基于多层潜在原型的高效LLM审核

Maciej Chrabąszcz, Filip Szatkowski, Bartosz Wójcik, Jan Dubiński, Tomasz Trzciński, Sebastian Cygert

发表机构 * University of Warsaw（华沙大学）

AI总结提出多层原型审核器（MLPM），利用多层中间表示的原型实现轻量、高效且可定制的输入审核，在多个基准上达到最优性能，并可与输出审核结合提升响应安全性。

详情

AI中文摘要

尽管现代LLM在后训练过程中与人类价值观对齐，但在部署时仍需稳健的审核以防止有害输出。现有方法存在性能与效率的权衡，且难以定制以满足用户特定需求。针对这一差距，我们引入了多层原型审核器（MLPM），一种轻量级且高度可定制的输入审核工具。我们提出利用多层中间表示的原型来提高审核质量，同时保持高效率。通过设计，我们的方法对生成流水线的开销可忽略不计，并可无缝应用于任何模型。MLPM在多种审核基准上实现了最先进的性能，并在不同大小的模型系列中表现出强大的可扩展性。此外，我们展示了它能平滑集成到端到端审核流水线中，并在与输出审核技术结合时进一步提高响应安全性。总体而言，我们的工作为安全、稳健且高效的LLM部署提供了一种实用且可适应的解决方案。

英文摘要

Although modern LLMs are aligned with human values during post-training, robust moderation remains essential to prevent harmful outputs at deployment time. Existing approaches suffer from performance-efficiency trade-offs and are difficult to customize to user-specific requirements. Motivated by this gap, we introduce Multi-Layer Prototype Moderator (MLPM), a lightweight and highly customizable input moderation tool. We propose leveraging prototypes of intermediate representations across multiple layers to improve moderation quality while maintaining high efficiency. By design, our method adds negligible overhead to the generation pipeline and can be seamlessly applied to any model. MLPM achieves state-of-the-art performance on diverse moderation benchmarks and demonstrates strong scalability across model families of various sizes. Moreover, we show that it integrates smoothly into end-to-end moderation pipelines and further improves response safety when combined with output moderation techniques. Overall, our work provides a practical and adaptable solution for safe, robust, and efficient LLM deployment.

URL PDF HTML ☆

赞 0 踩 0

2511.16886 2026-06-02 cs.CL cs.AI cs.LG 版本更新

Latent Reasoning in TRMs is Secretly a Policy Improvement Operator

TRMs中的潜在推理实际上是策略改进算子

Arip Asadulaev, Rayan Banerjee, Fakhri Karray, Martin Takac

发表机构 * Arip Asadulaev ； Rayan Banerjee ； Fakhri Karray ； Martin Takac

AI总结本文通过将潜在递归推理形式化为策略改进算法，解释了递归步骤何时有效提升性能，并提出结合强化学习和扩散方法的训练方案，在Tiny Recursive Model上实现18倍前向传递减少且保持性能。

详情

AI中文摘要

最近，具有潜在递归的小模型在复杂推理任务上取得了有希望的结果。这些结果通常由这样的理论解释：这种递归增加了网络的深度，使其能够紧凑地模拟更大模型的能力。然而，递归添加层的性能仍然落后于具有相同前馈深度的单次通过模型。这意味着在循环版本中，并非每个递归步骤都有效地贡献于深度。这提出了一个问题：潜在推理何时以及为何能提高性能，何时会导致无效计算？在我们的工作中，我们证明了潜在递归推理为这个问题提供了答案。我们展示了潜在递归推理可以形式化为策略改进算法。基于这些见解，我们提出使用强化学习和扩散方法的训练方案用于潜在推理模型。以Tiny Recursive Model作为测试平台，我们展示了通过我们的修改，可以避免无效计算步骤，并将前向传递总数减少18倍，同时保持性能。总的来说，我们展示了递归步骤的策略改进视角如何解释模型行为，并为进一步改进提供见解。

英文摘要

Recently, small models with latent recursion have obtained promising results on complex reasoning tasks. These results are typically explained by the theory that such recursion increases a networks depth, allowing it to compactly emulate the capacity of larger models. However, the performance of recursively added layers remains behind the capabilities of one pass models with the same feed-forward depth. This means that in the looped version, not every recursive step effectively contributes to depth. This raises the question: when and why does latent reasoning improve performance, and when does it result in dead compute? In our work, we demonstrate that latent recursive reasoning provides answer to this question. We show that latent recursive reasoning can be formalized as a policy improvement algorithm. Building on these insights, we propose to use a training schemes from reinforcement learning and diffusion methods for latent reasoning models. Using the Tiny Recursive Model as our testbed, we show that with our modifications we can avoid dead compute steps and reduce the total number of forward passes by 18x while maintaining performance. Broadly speaking, we show how a policy improvement perspective on recursive steps can explain model behavior and provide insights for further improvements.

URL PDF HTML ☆

赞 0 踩 0

2602.03318 2026-06-02 cs.CL 版本更新

MIRROR: A Multi-Agent Framework with Iterative Adaptive Revision and Hierarchical Retrieval for Optimization Modeling in Operations Research

MIRROR: 一种用于运筹学优化建模的具有迭代自适应修正与分层检索的多智能体框架

Yifan Shi, Jiayi Wang, Minyi Wu, Ye Fan, Jialong Shi, Jianyong Sun

发表机构 * Xi’an Jiaotong University（西安交通大学）； Northwestern Polytechnical University（西北工业大学）

AI总结提出一种免微调的多智能体框架MIRROR，通过执行驱动的迭代自适应修正和分层检索机制，将自然语言优化问题直接转化为数学模型和求解器代码，在标准运筹学基准上优于现有方法。

详情

AI中文摘要

运筹学依赖于专家驱动的建模——这是一个缓慢且脆弱的过程，难以适应新场景。虽然大语言模型可以自动将自然语言转化为优化模型，但现有方法要么依赖昂贵的后训练，要么采用多智能体框架，然而大多数方法仍缺乏可靠的协作纠错和任务特定检索，常常导致错误输出。我们提出MIRROR，一个免微调的端到端多智能体框架，直接将自然语言优化问题转化为数学模型和求解器代码。MIRROR集成了两个核心机制：(1) 执行驱动的迭代自适应修正，用于自动纠错；(2) 分层检索，从精心策划的示例库中获取相关的建模和编码示例。实验表明，MIRROR在标准运筹学基准上优于现有方法，在复杂工业数据集如IndustryOR和Mamo-ComplexLP上取得了显著结果。通过将精确的外部知识注入与系统纠错相结合，MIRROR为非专家用户提供了高效可靠的运筹学建模解决方案，克服了通用大语言模型在专家优化任务中的根本局限性。

英文摘要

Operations Research (OR) relies on expert-driven modeling-a slow and fragile process ill-suited to novel scenarios. While large language models (LLMs) can automatically translate natural language into optimization models, existing approaches either rely on costly post-training or employ multi-agent frameworks, yet most still lack reliable collaborative error correction and task-specific retrieval, often leading to incorrect outputs. We propose MIRROR, a fine-tuning-free, end-to-end multi-agent framework that directly translates natural language optimization problems into mathematical models and solver code. MIRROR integrates two core mechanisms: (1) execution-driven iterative adaptive revision for automatic error correction, and (2) hierarchical retrieval to fetch relevant modeling and coding exemplars from a carefully curated exemplar library. Experiments show that MIRROR outperforms existing methods on standard OR benchmarks, with notable results on complex industrial datasets such as IndustryOR and Mamo-ComplexLP. By combining precise external knowledge infusion with systematic error correction, MIRROR provides non-expert users with an efficient and reliable OR modeling solution, overcoming the fundamental limitations of general-purpose LLMs in expert optimization tasks.

URL PDF HTML ☆

赞 0 踩 0

2509.23782 2026-06-02 cs.CL 版本更新

Bridging the Knowledge-Prediction Gap in LLMs on Multiple-Choice Questions

弥合大语言模型在多项选择题中的知识-预测差距

Yoonah Park, Haesung Pyun, Yohan Jo

发表机构 * KAIST（韩国科学技术院）

AI总结通过分析隐藏表示中的知识子空间和预测子空间，提出轻量级推理时干预方法KAPPA来对齐两者，从而减少LLM在多项选择题上的知识-预测差距。

Comments Accepted to ICML 2026

详情

AI中文摘要

尽管大语言模型（LLM）在多种任务上表现强劲，但其可信度受到不忠实于内部知识的异常行为的限制。特别是，LLM在多项选择题（MCQ）上经常失败，即使它们在隐藏表示中编码了正确答案，这揭示了内部知识与输出行为之间的错位。我们通过三步分析隐藏表示来研究和缓解MCQ上的这种知识-预测差距。首先，我们量化了跨模型和数据集的差距的普遍性和幅度。其次，我们通过在残差流中识别不同的知识和预测子空间来提供几何解释。第三，我们引入了KAPPA，一种轻量级的推理时干预方法，用于对齐残差流中的两个子空间以减少知识-预测差距。我们的结果为LLM中的知识-预测差距提供了几何且可解释的解释。此外，KAPPA有效地减少了跨多种MCQ基准和模型的差距，并泛化到自由形式设置中。

英文摘要

While large language models (LLMs) perform strongly on diverse tasks, their trustworthiness is limited by erratic behavior that is unfaithful to their internal knowledge. In particular, LLMs often fail on multiple-choice questions (MCQs) even if they encode correct answers in their hidden representations, revealing a misalignment between internal knowledge and output behavior. We investigate and mitigate this knowledge-prediction gap on MCQs through a three-step analysis of hidden representations. First, we quantify the prevalence and magnitude of the gap across models and datasets. Second, we provide a geometric interpretation by identifying distinct knowledge and prediction subspaces in the residual stream. Third, we introduce KAPPA, a lightweight inference-time intervention that aligns the two subspaces within the residual stream to reduce the knowledge-prediction gap. Our results provide a geometric and interpretable explanation of the knowledge-prediction gap in LLMs. Furthermore, KAPPA effectively reduces the gap across diverse MCQ benchmarks and models, and generalizes to free-form settings.

URL PDF HTML ☆

赞 0 踩 0

2501.18649 2026-06-02 cs.CL cs.AI cs.IR cs.LG 版本更新

Fake News Detection After LLM Laundering: Measurement and Explanation

LLM清洗后的假新闻检测：测量与解释

Rupak Kumar Das, Jonathan Dodge

发表机构 * College of IST Pennsylvania State University（宾夕法尼亚州立大学信息科学与技术学院）

AI总结研究测量检测器在识别LLM改写假新闻时的有效性，发现检测器难以检测LLM改写的假新闻，并通过LIME解释发现情感偏移是检测失败的原因之一。

详情

DOI: 10.24251/HICSS.2026.339

AI中文摘要

凭借其先进的能力，大型语言模型（LLM）可以生成高度令人信服且上下文相关的假新闻，这可能有助于传播错误信息。尽管针对人类撰写文本的假新闻检测已有大量研究，但检测LLM生成的假新闻这一领域仍探索不足。本研究测量了检测器在识别LLM改写的假新闻方面的有效性，特别是确定在检测流程中添加改写步骤是有助于还是阻碍检测。本研究贡献如下：（1）检测器在检测LLM改写的假新闻时比检测人类撰写文本更困难；（2）我们发现了哪些模型在哪些任务（逃避检测、通过改写逃避检测以及为语义相似性进行改写）上表现出色；（3）通过LIME解释，我们发现了检测失败的一个可能原因：情感偏移；（4）我们发现了一个关于改写质量测量的令人担忧的趋势：尽管BERTSCORE很高，但样本仍表现出情感偏移；（5）我们提供了一对数据集，用改写输出和分数扩充了现有数据集。该数据集可在GitHub上获取。

英文摘要

With their advanced capabilities, Large Language Models (LLMs) can generate highly convincing and contextually relevant fake news, which can contribute to disseminating misinformation. Though there is much research on fake news detection for human-written text, the field of detecting LLM-generated fake news is still under-explored. This research measures the efficacy of detectors in identifying LLM-paraphrased fake news, in particular, determining whether adding a paraphrase step in the detection pipeline helps or impedes detection. This study contributes: (1) Detectors struggle to detect LLM-paraphrased fake news more than human-written text, (2) We find which models excel at which tasks (evading detection, paraphrasing to evade detection, and paraphrasing for semantic similarity). (3) Via LIME explanations, we discovered a possible reason for detection failures: sentiment shift. (4) We discover a worrisome trend for paraphrase quality measurement: samples that exhibit sentiment shift despite a high BERTSCORE. (5) We provide a pair of datasets augmenting existing datasets with paraphrase outputs and scores. The dataset is available on GitHub

URL PDF HTML ☆

赞 0 踩 0

2602.03719 2026-06-02 cs.CL 版本更新

BranPO: Scalable Contrastive Branch Sampling for Long-Horizon Agentic Reinforcement Learning

BranPO：面向长程智能体强化学习的可扩展对比分支采样

Yubao Zhao, Weiquan Huang, Sudong Wang, Ruochen Zhao, Chen Chen, Yao Shu, Chengwei Qin

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)（香港科学与技术大学（广州））； Nanyang Technological University（南洋理工大学）

AI总结针对长程智能体强化学习中轨迹级奖励稀疏且信用分配困难的问题，提出一种无值函数方法BranPO，通过截断轨迹并重采样延续构建对比分支，实现局部对比监督，无需密集奖励。

Comments 26 pages, 5 figures

详情

AI中文摘要

智能体强化学习使大型语言模型能够执行多轮规划和使用工具，但在稀疏轨迹级奖励下，长程训练仍然具有挑战性，其中单一结果被统一分配给所有决策。先前的方法通过基于树的探索或过程级评估引入更细粒度的监督，但往往成本高昂或产生噪声信用信号。在智能体轨迹中，早期错误可能被后续动作纠正，而看似有希望的中间状态可能因后续决策不佳而失败。我们将此属性称为非单调正确性，这使得结果奖励或状态值不足以指导从每个状态应采取什么动作。为了解决这个问题，我们提出了分支相对策略优化（BranPO），一种无值函数方法，无需密集奖励即可构建局部对比监督。BranPO在中间前缀处截断轨迹，并重采样延续以形成对比分支，这些分支共享相同的前缀但在最终结果上不同，从而隔离导致成功或失败的决策。我们进一步引入了难度感知分支采样和冗余步骤掩码，以提高采样效率并抑制冗余更新。实验表明，BranPO在多个多跳问答基准测试中一致优于各种基线类别，且无需额外训练成本，并泛化到更广泛的长程智能体任务中，实现持续改进。我们的代码可在https://github.com/YubaoZhao/BranPO获取。

英文摘要

Agentic reinforcement learning enables large language models to perform multi-turn planning and tool use, but long-horizon training remains challenging under sparse trajectory-level rewards, where a single outcome is uniformly assigned to all decisions. Prior methods introduce finer-grained supervision via tree-based exploration or process-level evaluation, but often incur high cost or produce noisy credit signals. In agentic trajectories, early mistakes may still be corrected by later actions, while seemingly promising intermediate states can fail due to poor subsequent decisions. We call this property non-monotonic correctness, which makes outcome rewards or state values insufficient for guiding what actions should be taken from each state. To address this, we propose Branching Relative Policy Optimization (\textbf{BranPO}), a value-free method that constructs localized contrastive supervision without dense rewards. BranPO truncates trajectories at intermediate prefixes and resamples continuations to form contrastive branches that share the same prefix but diverge in final outcomes, thereby isolating decisions that drive success or failure. We further introduce difficulty-aware branch sampling and Redundant Step Masking to improve sampling efficiency and suppress redundant updates. Experiments show that BranPO consistently outperforms diverse baseline categories across multiple multi-hop QA benchmarks without additional training cost, and generalizes to broader long-horizon agentic tasks with consistent improvements. Our code is available at https://github.com/YubaoZhao/BranPO.

URL PDF HTML ☆

赞 0 踩 0

2602.03619 2026-06-02 cs.CL 版本更新

Learning Query-Specific Rubrics from Human Preferences for DeepResearch Report Generation

从人类偏好中学习查询特定评分标准用于DeepResearch报告生成

Changze Lv, Jie Zhou, Wentao Zhao, Jingwen Xu, Shihan Dou, Zisu Huang, Muzhao Tian, Xiaohua Wang, Yang Liu, Pluto Zhou, Tao Gui, Le Tian, Xiao Zhou, Xiaoqing Zheng, Xuanjing Huang, Jie Zhou

发表机构 * Tencent（腾讯）； Fudan University（复旦大学）； Tsinghua University（清华大学）

AI总结提出一种通过强化学习训练查询特定评分标准生成器的流水线，以解决DeepResearch长报告生成中缺乏可验证奖励信号的问题，并在人类偏好测试和下游任务中取得显著性能提升。

详情

CURP: 基于码本的连续用户表示用于大语言模型的个性化生成

Liang Wang, Xinyi Mou, Xiaoyou Liu, Xuanjing Huang, Zhongyu Wei

发表机构 * School of Data Science, Fudan University（复旦大学数据科学学院）； Shanghai Innovation Institute（上海创新研究院）； School of Computer Science, Fudan University（复旦大学计算机科学学院）

AI总结提出CURP框架，通过双向用户编码器和离散原型码本提取多维用户特征，实现少量可训练参数的即插即用个性化生成，在变体生成任务上优于强基线。

详情

AI中文摘要

用户建模通过个体的偏好和行为模式来表征用户，从而在当代方法中实现与大语言模型（LLMs）的个性化模拟和生成。然而，现有方法（无论是基于提示的方法还是基于训练的方法）在平衡个性化质量与计算和数据效率方面面临挑战。我们提出了一个新颖的框架CURP，它采用双向用户编码器和离散原型码本来提取多维用户特征。这种设计使得即插即用的个性化成为可能，且只需少量可训练参数（约2000万参数，约占模型总大小的0.2%）。通过在变体生成任务上的大量实验，我们表明CURP相比强基线实现了更优的性能和泛化能力，同时提供了更好的可解释性和可扩展性。代码可在https://github.com/RaidonWong/CURP_code获取。

英文摘要

User modeling characterizes individuals through their preferences and behavioral patterns to enable personalized simulation and generation with Large Language Models (LLMs) in contemporary approaches. However, existing methods, whether prompt-based or training-based methods, face challenges in balancing personalization quality against computational and data efficiency. We propose a novel framework CURP, which employs a bidirectional user encoder and a discrete prototype codebook to extract multi-dimensional user traits. This design enables plug-and-play personalization with a small number of trainable parameters (about 20M parameters, about 0.2\% of the total model size). Through extensive experiments on variant generation tasks, we show that CURP achieves superior performance and generalization compared to strong baselines, while offering better interpretability and scalability. The code are available at https://github.com/RaidonWong/CURP_code

URL PDF HTML ☆

赞 0 踩 0

2507.08038 2026-06-02 cs.CL cs.AI 版本更新

AblationBench: Evaluating Automated Planning of Ablations in Empirical AI Research

AblationBench：评估实证AI研究中消融实验的自动规划

Talor Abramovich, Gal Chechik

发表机构 * Google Research（谷歌研究）

AI总结提出AblationBench基准套件，包含作者消融和审稿人消融两个任务，用于评估语言模型在AI研究中规划消融实验的能力，实验表明当前最佳模型仅能识别45%的原始消融，低于人类水平。

Comments AI4Science Workshop, ICML 2026; Project page: https://ablation-bench.github.io/

详情

AI中文摘要

语言模型代理越来越多地被用于自动化科学研究，然而评估其科学贡献仍然是一个挑战。获得此类见解的关键机制是通过消融实验。为此，我们引入了AblationBench，这是一个用于评估代理在实证AI研究中进行消融规划任务的基准套件。它包括两个任务：AuthorAblation，帮助作者基于方法部分提出消融实验，包含83个实例；以及ReviewerAblation，帮助审稿人发现完整论文中缺失的消融，包含350个实例。对于这两个任务，我们开发了基于LM的评判器，作为自动评估框架。我们对前沿LM的实验表明，这些任务仍然具有挑战性，性能最佳的LM系统平均仅能识别45%的原始消融，低于人类水平。我们观察到作者任务和审稿人任务之间存在相反的性能趋势，这归因于模型基础的不同。最后，我们分析了当前LM在这些任务上的局限性，并发现思维链提示优于基于代理的方法。我们的数据可在https://huggingface.co/collections/ai-coscientist/ablationbench获取，代码可在https://github.com/ai-scientist-bench/ablation-bench获取。

英文摘要

Language model agents are increasingly used to automate scientific research, yet evaluating their scientific contributions remains a challenge. A key mechanism to obtain such insights is through ablation experiments. To this end, we introduce AblationBench, a benchmark suite for evaluating agents on ablation planning tasks in empirical AI research. It includes two tasks: AuthorAblation, which helps authors propose ablation experiments based on a method section and contains 83 instances, and ReviewerAblation, which helps reviewers find missing ablations in a full paper and contains 350 instances. For both tasks, we develop LM-based judges that serve as an automatic evaluation framework. Our experiments with frontier LMs show that these tasks remain challenging, with the best-performing LM system identifying only 45% of the original ablations on average, below human-level performance. We observe an inverse performance trend between the author and reviewer tasks, which we attribute to differences in model grounding. Lastly, we analyze the limitations of current LMs on these tasks, and find that chain-of-thought prompting outperforms an agent-based approach. Our data is available on https://huggingface.co/collections/ai-coscientist/ablationbench, and our code is available on https://github.com/ai-scientist-bench/ablation-bench .

URL PDF HTML ☆

赞 0 踩 0

2601.22947 2026-06-02 cs.CL cs.LG 版本更新

Reconsidering Positional Supervision in Masked Diffusion Language Model Training

重新审视掩码扩散语言模型训练中的位置监督

Mengyu Ye, Keito Kudo, Ryosuke Takahashi, Jun Suzuki

发表机构 * Tohoku University（东大大学）； RIKEN（理化学研究所）； NII LLMC（国家信息研究所LLMC）

AI总结针对掩码扩散语言模型对位置偏移敏感的问题，提出基于连接主义时序分类（CTC）的目标函数，通过引入松弛令牌和更新折叠映射来吸收位置不确定性，从而在开放生成基准上取得一致提升。

Comments preprint, WIP

详情

AI中文摘要

掩码扩散语言模型（MDLM）通过并行去掩码生成文本，最近成为自回归语言模型的替代方案。它们可以被视为使用位置交叉熵（CE）损失训练的并行解码器，与非自回归翻译（NAT）设置相同。在NAT中，CE训练的并行解码器被认为对小的位置偏移敏感，因为CE会严厉惩罚它们。我们询问CE训练的MDLM在迭代解码下是否同样对此类偏移敏感。为了探究这一点，我们应用了一种受控干预，在解码过程中引入这些偏移。在LLaDA-8B-Instruct和Arena-Hard上，仅将1%的生成令牌移动一个位置，就显著降低了相对于未干预模型的胜率，表明MDLM在迭代并行解码下对此类小偏移敏感。受此启发，我们将连接主义时序分类（CTC）（一种已知能缓解该问题的对齐灵活目标）适配到MDLM监督微调中。通过放宽CE施加的严格位置匹配，CTC为损失提供了吸收小位置偏移的空间；具体地，我们修改了CTC目标，使用一个特殊的<slack>令牌来吸收目标令牌与输出位置之间的位置不确定性，并更新了折叠映射以保留目标表面形式。在四个开放生成基准上，所得模型在原始模型和匹配的交叉熵训练基线上均有一致改进，且在所有四个基准上具有统计显著性。这些结果表明，训练侧的对齐灵活性是MDLM SFT的一个有用设计维度，与先前工作中探索的推理时方法互补。

英文摘要

Masked diffusion language models (MDLMs) generate text by unmasking tokens in parallel and have recently emerged as alternatives to autoregressive language models. They can be viewed as parallel decoders trained with a position-wise cross-entropy (CE) loss, the same setup as non-autoregressive translation (NAT). In NAT, CE-trained parallel decoders have been argued to be sensitive to small positional shifts, since CE penalizes them harshly. We ask whether CE-trained MDLMs are similarly sensitive to such shifts under iterative decoding. To probe this, we apply a controlled intervention that introduces them during decoding. On LLaDA-8B-Instruct with Arena-Hard, displacing as little as 1% of generated tokens by one position substantially reduces win rates against the unintervened model, showing that MDLMs are sensitive to such small shifts under iterative parallel decoding. Motivated by this, we adapt connectionist temporal classification (CTC), an alignment-flexible objective known to mitigate it there, to MDLM supervised fine-tuning. By relaxing the strict position-wise match that CE imposes, CTC gives the loss room to absorb small positional shifts; concretely, we modified CTC objective to use a special <slack> token that absorbs positional uncertainty between target tokens and output positions, and a updated collapse map that preserves target surface forms. Across four open-ended generation benchmarks, the resulting model consistently improves over both the original model and a matched cross-entropy-trained baseline, with statistically significant gains on all four. These results identify training-side alignment flexibility as a useful design dimension for MDLM SFT, complementary to the inference-time approaches explored in prior work.

URL PDF HTML ☆

赞 0 踩 0

2512.00986 2026-06-02 cs.CL 版本更新

ADRA-Bank: A Modular Benchmark for Academic Deep Research Agents

ADRA-Bank：面向学术深度研究智能体的模块化基准

Zhihan Guo, Feiyang Xu, Yifan Li, Muzhi Li, Shuai Zou, Jiele Wu, Han Shi, Haoli Bai, Ho-fung Leung, Irwin King

发表机构 * The Chinese University of Hong Kong, Hong Kong SAR, China（香港中文大学）； Hong Kong Polytechnic University, Hong Kong SAR, China（香港理工大学）； National University of Singapore, Singapore, Singapore（新加坡国立大学）； Huawei Technologies, Hong Kong SAR, China（华为技术有限公司）； Independent（独立）

AI总结针对现有基准侧重检索而忽视规划与推理、且缺乏学术领域覆盖的问题，提出ADRA-Bank模块化基准，包含10个学术领域的200个人工标注实例，并设计ADRA-Eval评估范式，通过端到端和隔离评估两种模式测试规划、检索和推理能力，揭示智能体在跨源检索和跨领域一致性上的不足，并指出提升高层规划能力是释放基础LLM推理潜力的关键。

详情

AI中文摘要

学术出版物激增催生了自动化深度研究（DR）系统的需求，但准确评估这些系统仍是一个开放问题。首先，现有基准通常狭隘地聚焦于检索，而忽视了高层规划与推理。其次，现有基准偏向通用领域，而非作为DR智能体核心应用的学术领域。为弥补这些不足，我们引入了ADRA-Bank，一个面向学术DR智能体的模块化基准。该基准基于学术文献，是一个包含10个学术领域（涵盖研究论文和综述论文）200个人工标注实例的数据集。此外，我们提出了一个模块化的学术DR智能体评估范式（ADRA-Eval），利用学术论文的丰富结构来评估规划、检索和推理的核心能力。它采用两种互补模式：针对任务智能体的端到端评估，以及针对作为潜在基础模型的基础LLM的隔离评估。结果揭示了不均衡的能力：虽然智能体表现出专业优势，但在多源检索和跨领域一致性上存在困难。此外，提升高层规划能力是释放作为基础模型的基础LLM推理潜力的关键因素。通过暴露这些可操作的失败模式，ADRA-Bank提供了一个诊断工具，以指导更可靠的自动化学术研究助手的开发。

英文摘要

A surge in academic publications calls for automated deep research (DR) systems, but accurately evaluating them is still an open problem. First, existing benchmarks often focus narrowly on retrieval while neglecting high-level planning and reasoning. Second, existing benchmarks favor general domains over the academic domains that are the core application for DR agents. To address these gaps, we introduce ADRA-Bank, a modular benchmark for Academic DR Agents. Grounded in academic literature, our benchmark is a human-annotated dataset of 200 instances across 10 academic domains, including both research and review papers. Furthermore, we propose a modular Evaluation Paradigm for Academic DR Agents (ADRA-Eval), which leverages the rich structure of academic papers to assess the core capabilities of planning, retrieval, and reasoning. It employs two complementary modes: an end-to-end evaluation for \task agents and an isolated evaluation for foundational LLMs as potential backbones. Results reveal uneven capabilities: while agents show specialized strengths, they struggle with multi-source retrieval and cross-field consistency. Moreover, improving high-level planning capability is the crucial factor for unlocking the reasoning potential of foundational LLMs as backbones. By exposing these actionable failure modes, ADRA-Bank provides a diagnostic tool to guide the development of more reliable automatic academic research assistants.

URL PDF HTML ☆

赞 0 踩 0

2501.13428 2026-06-02 cs.CL cs.AI cs.LG 版本更新

Softplus Attention with Re-weighting Boosts Length Extrapolation in Large Language Models

Softplus注意力与重加权提升大语言模型的长度外推能力

Bo Gao, Michael W. Spratling, Letizia Gionfrida

发表机构 * University of California, Berkeley（加州大学伯克利分校）

AI总结提出一种两阶段注意力机制，用Softplus和l1归一化替代Softmax，并引入基于不变熵的动态缩放因子和重加权机制，以提升数值稳定性、缓解注意力下沉现象，并显著改善长度外推性能。

Comments Accepted by ICML 2026

详情

AI中文摘要

近年来，大语言模型取得了显著成功，这主要归功于自注意力机制。然而，传统的Softmax注意力存在数值不稳定性，并且随着推理令牌数量的增加，性能会下降。本文通过提出一种新的注意力设计原则来解决这些问题，将注意力视为一个两阶段过程。第一阶段（归一化）通过用数值更稳定的Softplus后接$l_{1}$归一化替代Softmax来改进标准注意力。此外，我们引入了一个基于不变熵的动态缩放因子。我们证明，这种新颖的注意力机制优于传统的Softmax注意力和最先进的非Softmax替代方案。我们的第二个提议是引入第二阶段处理（锐化），该阶段由一个重加权机制组成，该机制放大重要的注意力权重，同时削弱较弱的权重。这使得模型能够更有效地聚焦于相关令牌，缓解注意力下沉现象，并从根本上改善长度外推。这种新颖的两阶段自注意力替代方案被证明能确保数值稳定性，并显著提升长度外推能力，在训练长度的16倍时保持几乎恒定的验证损失，同时在具有挑战性的长上下文检索任务和下游基准测试中取得优异结果。此外，符号回归实验表明，我们的方法使模型能够从轨道轨迹序列中恢复牛顿万有引力定律，这为适当的注意力机制对于基础模型发展真正的物理世界模型至关重要提供了证据。我们的代码可在 https://github.com/iminfine/freeattn 获取。

英文摘要

Large language models have achieved remarkable success in recent years, primarily due to self-attention. However, traditional Softmax attention suffers from numerical instability and reduced performance as the number of inference tokens increases. This work addresses these issues by proposing a new design principle for attention, viewing it as a two-stage process. The first stage (normalisation) refines standard attention by replacing Softmax with the more numerically stable Softplus followed by $l_{1}$-normalisation. Furthermore, we introduce a dynamic scale factor based on invariance entropy. We show that this novel attention mechanism outperforms conventional Softmax attention, and state-of-the-art Softmax-free alternatives. Our second proposal is to introduce a second processing stage (sharpening) which consists of a re-weighting mechanism that amplifies significant attentional weights while diminishing weaker ones. This enables the model to concentrate more effectively on relevant tokens, mitigating the attention sink phenomenon, and fundamentally improving length extrapolation. This novel, two-stage, replacement for self-attention is shown to ensure numerical stability and dramatically improve length extrapolation, maintaining a nearly constant validation loss at 16$\times$ the training length while achieving superior results on challenging long-context retrieval tasks and downstream benchmarks. Furthermore, symbolic regression experiments demonstrate that our method enables models to recover Newton's gravitational law from orbital trajectory sequences, providing evidence that appropriate attention mechanisms are crucial for foundation models to develop genuine physical world models. Our code is available at https://github.com/iminfine/freeattn.

URL PDF HTML ☆

赞 0 踩 0

2601.21579 2026-06-02 cs.CL cs.LG 版本更新

结构化语义信息有助于为上下文学习检索更好的示例，应用于少样本关系抽取

Aunabil Chakma, Mihai Surdeanu, Eduardo Blanco

发表机构 * University of Arizona（亚利桑那大学）

AI总结提出基于句法-语义结构相似性的示例选择策略，结合大语言模型生成示例，构建混合系统提升少样本关系抽取性能。

详情

AI中文摘要

本文提出了几种自动获取上下文学习额外示例的策略，有效地将关系抽取从1-shot转变为少样本设置。具体来说，我们引入了一种新颖的示例选择策略，其中新示例基于其底层句法-语义结构与提供的1-shot示例的相似性进行选择。我们表明，与LLM生成的示例相比，我们的策略导致了互补的词汇选择和句子结构。当两种策略结合时，生成的混合系统比单独使用任一方法更全面地描绘了感兴趣的关系。我们的框架在数据集（FS-TACRED和FS-FewRel）和LLM系列（Qwen和Gemma）之间具有良好的迁移性。总体而言，我们的混合系统持续优于替代策略，在FS-TACRED上达到了最先进的性能，并在定制的FewRel子集上取得了显著的提升。

英文摘要

This paper presents several strategies to automatically obtain additional examples for in-context learning, effectively transforming relation extraction from a 1-shot to a few-shot setting. Specifically, we introduce a novel strategy for example selection, in which new examples are selected based on the similarity of their underlying syntactic-semantic structure to the provided 1-shot example. We show that our strategy results in complementary word choices and sentence structures compared to LLM-generated examples. When both strategies are combined, the resulting hybrid system achieves a more holistic picture of the relations of interest than either method alone. Our framework transfers well across datasets (FS-TACRED and FS-FewRel) and LLM families (Qwen and Gemma). Overall, our hybrid system consistently outperforms alternative strategies achieving state-of-the-art performance on FS-TACRED and strong gains on a customized FewRel subset.

URL PDF HTML ☆

赞 0 踩 0

2601.19919 2026-06-02 cs.CL cs.AI cs.SD 版本更新

ASKD-Whisper: Adaptive Self-knowledge Distillation for Efficient and Low-Latency Automatic Speech Recognition

ASKD-Whisper: 自适应自知识蒸馏用于高效低延迟自动语音识别

Junseok Lee, Nahun Kim, Sangyong Lee, Chang-Jae Chun

发表机构 * OKESTRO Co., Ltd（OKESTRO公司）； Sejong University（世宗大学）

AI总结提出自适应自知识蒸馏（ASKD）动态课程框架，通过逐步减少对教师模型的依赖并引入自知识蒸馏阶段，在压缩Whisper模型时实现5倍推理加速和1.07%词错误率降低。

Comments Title and content have been updated

详情

AI中文摘要

知识蒸馏（KD）是将大规模基础模型压缩为可部署架构的最有效范式之一。在自动语音识别（ASR）背景下，先前研究主要侧重于强制学生模型严格模仿大型教师模型的预测分布。然而，这种静态依赖通常存在固有权衡：虽然学生快速获得基本语言表示，但同时继承了教师特定领域的盲点和过度自信的幻觉，导致分布外泛化能力严重下降。为有效缓解此问题，我们提出自适应自知识蒸馏（ASKD），一种动态课程框架。ASKD随着训练进行系统地衰减对教师分布的依赖——从而释放学生独立推理能力——随后采用自知识蒸馏阶段作为结构正则化器。通过应用ASKD，我们将庞大的Whisper架构蒸馏为紧凑变体ASKD-Whisper。在跨多种声学领域的综合评估中，ASKD-Whisper不仅实现了5倍推理延迟加速，还以1.07%更低的词错误率（WER）超越了其教师模型。这些结果表明，ASKD有效防止了教师引起的过拟合，并为可泛化模型压缩建立了新的最先进水平。

英文摘要

Knowledge distillation (KD) is one of the most effective paradigms for compressing large-scale foundation models into deployable architectures. In the context of Automatic Speech Recognition (ASR), previous studies have predominantly focused on forcing the student model to strictly mimic the predictive distribution of a massive teacher model. However, this static dependency often presents an inherent trade-off: while the student rapidly acquires basic linguistic representations, it simultaneously inherits the teacher's domain-specific blind spots and over-confident hallucinations, leading to a severe decline in out-of-distribution generalization capacity. To effectively mitigate this issue, we propose Adaptive Self-Knowledge Distillation (ASKD), a dynamic curriculum framework. ASKD systematically decays the dependency on the teacher's distribution as training progresses-thereby unlocking the student's independent reasoning capacity-and subsequently employs a self-knowledge distillation phase to act as a structural regularizer. By applying ASKD, we distill the massive Whisper architecture into a compact variant, ASKD-Whisper. In our comprehensive evaluations across diverse acoustic domains, ASKD-Whisper not only achieves a 5x speedup in inference latency but also outperforms its teacher model by yielding a 1.07% lower word error rate (WER). These results demonstrate that ASKD effectively prevents teacher-induced overfitting and establishes a new state-of-the-art for generalizable model compression.

URL PDF HTML ☆

赞 0 踩 0

2510.18439 2026-06-02 cs.CL 版本更新

Grounding or Guessing? Visual Signals for Detecting Hallucinations in Sign Language Translation

基于视觉线索检测手语翻译中的幻觉：是依据视觉信息还是猜测？

Yasser Hamidullah, Koel Dutta Chowdhury, Yusser Al Ghussin, Shakib Yazdani, Cennet Oguz, Josef van Genabith, Cristina España-Bonet

发表机构 * German Research Center for Artificial Intelligence (DFKI GmbH)（德国人工智能研究中心（DFKI GmbH））； Saarland Informatics Campus（萨尔兰州信息学校园）； Barcelona Supercomputing Center (BSC-CNS)（巴塞罗那超级计算中心（BSC-CNS））

AI总结针对手语翻译中模型依赖语言先验而非视觉输入导致幻觉的问题，提出一种基于特征敏感性和反事实信号的令牌级可靠性度量，用于量化视觉信息利用程度，并在两个基准上验证其预测幻觉率、跨数据集泛化及与文本信号结合提升风险评估的效果。

Comments Published at ICLR2026 Code available at \url{https://github.com/yhamidullah/hallucination-slt}

详情

AI中文摘要

幻觉是指模型生成流畅但缺乏视觉证据支持的文本，这是视觉-语言模型的主要缺陷，在手语翻译（SLT）中尤为关键。在SLT中，意义依赖于视频中的精确依据，而无词汇表模型尤其脆弱，因为它们直接将连续的手势运动映射为自然语言，缺少作为对齐的中间词汇表监督。我们认为，当模型依赖语言先验而非视觉输入时会产生幻觉。为捕捉这一点，我们提出一种令牌级可靠性度量，用于量化解码器使用视觉信息的程度。我们的方法结合了基于特征的敏感性（衡量视频被掩蔽时的内部变化）和反事实信号（捕捉干净视频与修改视频之间的概率差异）。这些信号被聚合为句子级可靠性分数，提供了一种紧凑且可解释的视觉依据度量。我们在两个SLT基准（PHOENIX-2014T和CSL-Daily）上，使用基于词汇表和无词汇表模型评估了所提出的度量。结果表明，可靠性能够预测幻觉率，跨数据集和架构泛化，并在视觉退化下降低。除了这些定量趋势，我们还发现可靠性能够区分有依据的令牌和猜测的令牌，从而无需参考即可进行风险估计；当与基于文本的信号（置信度、困惑度或熵）结合时，进一步改善了幻觉风险评估。定性分析揭示了为什么无词汇表模型更容易产生幻觉。综合来看，我们的发现将可靠性确立为诊断SLT中幻觉的实用且可复用的工具，并为多模态生成中更鲁棒的幻觉检测奠定了基础。

英文摘要

Hallucination, where models generate fluent text unsupported by visual evidence, remains a major flaw in vision-language models and is particularly critical in sign language translation (SLT). In SLT, meaning depends on precise grounding in video, and gloss-free models are especially vulnerable because they map continuous signer movements directly into natural language without intermediate gloss supervision that serves as alignment. We argue that hallucinations arise when models rely on language priors rather than visual input. To capture this, we propose a token-level reliability measure that quantifies how much the decoder uses visual information. Our method combines feature-based sensitivity, which measures internal changes when video is masked, with counterfactual signals, which capture probability differences between clean and altered video inputs. These signals are aggregated into a sentence-level reliability score, providing a compact and interpretable measure of visual grounding. We evaluate the proposed measure on two SLT benchmarks (PHOENIX-2014T and CSL-Daily) with both gloss-based and gloss-free models. Our results show that reliability predicts hallucination rates, generalizes across datasets and architectures, and decreases under visual degradations. Beyond these quantitative trends, we also find that reliability distinguishes grounded tokens from guessed ones, allowing risk estimation without references; when combined with text-based signals (confidence, perplexity, or entropy), it further improves hallucination risk estimation. Qualitative analysis highlights why gloss-free models are more susceptible to hallucinations. Taken together, our findings establish reliability as a practical and reusable tool for diagnosing hallucinations in SLT, and lay the groundwork for more robust hallucination detection in multimodal generation.

URL PDF HTML ☆

赞 0 踩 0

2601.17952 2026-06-02 cs.CL cs.AI 版本更新

A Monosemantic Attribution Framework for Stable Interpretability in Clinical Neuroscience Transformer-Based Language Models

面向临床神经科学中基于Transformer的语言模型的稳定可解释性的单语义归因框架

Michail Mamalakis, Tiago Azevedo, Cristian Cosentino, Chiara D'Ercoli, Subati Abulikemu, Zhongtian Sun, Richard Bethlehem, Pietro Lio

发表机构 * Department of Computer Science and Technology（计算机科学与技术系）； Cancer Research UK（癌症研究英国公司）； Cambridge Institute（剑桥研究所）； University of Cambridge（剑桥大学）； United Kingdom（英国）； DIMES（迪梅斯）； University of Calabria（卡拉布里亚大学）； Italy（意大利）； Department of Computer Automatic and Management Engineering (DIAG)（计算机自动与管理工程系）； Sapienza Università di Roma（罗马大学萨皮恩扎）； Department of Psychiatry（精神病学系）； School of Computing（计算学院）； University of Kent（肯特大学）； Department of Psychology（心理学系）

AI总结提出一种通过单语义特征提取整合归因与机制视角的统一可解释性框架，生成稳定的输入级重要性分数，促进语言模型在认知健康和神经退行性疾病中的安全应用。

详情

AI中文摘要

可解释性仍然是语言模型在临床环境中部署的关键挑战，例如阿尔茨海默病的进展诊断，其中早期和可信的预测至关重要。现有的归因方法由于基于Transformer的语言模型和LLM表示的多语义性质而表现出高方法间变异性和不稳定的解释，而机制可解释性方法缺乏与模型输入和输出的直接对齐，并且不提供显式的重要性分数。我们引入了一个统一的可解释性框架，通过单语义特征提取整合了归因和机制视角。通过在基于Transformer的LM层级别构建单语义嵌入空间，并优化框架以显式减少方法间变异性，我们的方法生成稳定的输入级重要性分数，并通过感兴趣层的解压缩表示突出显著特征，推进了语言模型在认知健康和神经退行性疾病中的安全可信应用。

英文摘要

Interpretability remains a key challenge for deploying language models (LM) in clinical settings such as progression diagnosis of Alzheimer disease, where early and trustworthy predictions are essential. Existing attribution methods exhibit high inter-method variability and unstable explanations due to the polysemantic nature of Transformer-Based LM and LLM representations, while mechanistic interpretability approaches lack direct alignment with model inputs and outputs and do not provide explicit importance scores. We introduce a unified interpretability framework that integrates attributional and mechanistic perspectives through monosemantic feature extraction. By constructing a monosemantic embedding space at the level of an transformer-based LM layer and optimizing the framework to explicitly reduce inter-method variability, our approach produces stable input-level importance scores and highlights salient features via a decompressed representation of the layer of interest, advancing the safe and trustworthy application of LMs in cognitive health and neurodegenerative disease.

URL PDF HTML ☆

赞 0 踩 0

2601.17898 2026-06-02 cs.CL 版本更新

Assessment of Generative Named Entity Recognition in the Era of Large Language Models

大语言模型时代生成式命名实体识别的评估

Qi Zhan, Yile Wang, Hui Huang

发表机构 * College of Computer Science and Software Engineering, Shenzhen University（深圳大学计算机科学与软件工程学院）

AI总结本文系统评估了开源大语言模型在平面和嵌套命名实体识别任务上的性能，发现参数高效微调结合结构化格式可达到与传统模型竞争的性能，且NER能力源于指令遵循和生成能力而非记忆，微调对通用能力影响很小。

详情

AI中文摘要

命名实体识别（NER）正随着大语言模型（LLMs）的兴起从序列标注任务演变为生成式范式。我们对开源LLMs在平面和嵌套NER任务上进行了系统评估。我们研究了几个研究问题，包括生成式NER与传统NER模型之间的性能差距、输出格式的影响、LLMs是否依赖记忆，以及微调后通用能力的保持情况。通过在八个不同规模的LLMs和四个标准NER数据集上的实验，我们发现：（1）通过参数高效微调和结构化格式（如内联括号或XML），开源LLMs实现了与传统基于编码器的模型竞争的性能，并超越了使用上下文学习技术的基于解码器的LLMs；（2）LLMs的NER能力源于指令遵循和生成能力，而不仅仅是实体-标签对的记忆；（3）应用NER指令微调对LLMs的通用能力影响很小，甚至由于增强了实体理解，在DROP等数据集上将性能提高了25.50到45.32个F1点。这些发现表明，使用LLMs的生成式NER是传统方法的一种有前景且用户友好的替代方案。我们在https://github.com/szu-tera/LLMs4NER上发布了数据和代码。

英文摘要

Named entity recognition (NER) is evolving from a sequence labeling task into a generative paradigm with the rise of large language models (LLMs). We conduct a systematic evaluation of open-source LLMs on both flat and nested NER tasks. We investigate several research questions including the performance gap between generative NER and traditional NER models, the impact of output formats, whether LLMs rely on memorization, and the preservation of general capabilities after fine-tuning. Through experiments across eight LLMs of varying scales and four standard NER datasets, we find that: (1) With parameter-efficient fine-tuning and structured formats like inline bracketed or XML, open-source LLMs achieve performance competitive with traditional encoder-based models and surpass decoder-based LLMs with in-context learning techniques; (2) The NER capability of LLMs stems from instruction-following and generative power, not mere memorization of entity-label pairs; and (3) Applying NER instruction tuning has minimal impact on general capabilities of LLMs, even improving performance on datasets like DROP by 25.50 to 45.32 F1 points due to enhanced entity understanding. These findings demonstrate that generative NER with LLMs is a promising, user-friendly alternative to traditional methods. We release the data and code at https://github.com/szu-tera/LLMs4NER.

URL PDF HTML ☆

赞 0 踩 0

2511.12487 2026-06-02 cs.NE cs.AI cs.CL 版本更新

ToxSearch: Evolving Prompts for Toxicity Search in Large Language Models

ToxSearch: 面向大型语言模型毒性搜索的提示演化

Onkar Shelar, Travis Desell

发表机构 * Rochester Institute of Technology（罗切斯特技术研究所）

AI总结提出ToxSearch，一种黑盒演化框架，通过同步稳态循环演化提示来测试大型语言模型的安全性，并分析不同操作符的行为及跨模型迁移性。

Comments 16 pages

详情

DOI: 10.1007/978-3-032-23607-4_9
Journal ref: In: García-Sánchez, P., Díaz Álvarez, J., Murphy, A. (eds) Applications of Evolutionary Computation. EvoApplications 2026. Lecture Notes in Computer Science, vol 16525. Springer, Cham

AI中文摘要

大型语言模型即使在安全对齐后，仍然容易受到引发毒性内容的对抗性提示的攻击。我们提出了ToxSearch，一种黑盒演化框架，通过同步稳态循环演化提示来测试模型安全性。该系统采用多种操作符，包括词汇替换、否定、回译、释义以及两种语义交叉操作符，同时一个审核预言机提供适应度指导。操作符级分析显示出异质性行为：词汇替换提供了最佳的收益-方差权衡，语义相似性交叉充当精确的低吞吐量插入器，而全局重写表现出高方差和较高的拒绝成本。使用在LLaMA 3.1 8B上演化的精英提示，我们观察到实际有意义但衰减的跨模型迁移，大多数目标上的毒性大约减半，较小的LLaMA 3.2变体表现出最强的抵抗力，而一些跨架构模型保留了较高的毒性。这些结果表明，小的、可控的扰动是系统性红队测试的有效载体，并且防御措施应预期对抗性提示的跨模型重用，而不是仅关注单模型加固。

英文摘要

Large Language Models remain vulnerable to adversarial prompts that elicit toxic content even after safety alignment. We present ToxSearch, a black-box evolutionary framework that tests model safety by evolving prompts in a synchronous steady-state loop. The system employs a diverse set of operators, including lexical substitutions, negation, back-translation, paraphrasing, and two semantic crossover operators, while a moderation oracle provides fitness guidance. Operator-level analysis shows heterogeneous behavior: lexical substitutions offer the best yield-variance trade-off, semantic-similarity crossover acts as a precise low-throughput inserter, and global rewrites exhibit high variance with elevated refusal costs. Using elite prompts evolved on LLaMA 3.1 8B, we observe practically meaningful but attenuated cross-model transfer, with toxicity roughly halving on most targets, smaller LLaMA 3.2 variants showing the strongest resistance, and some cross-architecture models retaining higher toxicity. These results suggest that small, controllable perturbations are effective vehicles for systematic red-teaming and that defenses should anticipate cross-model reuse of adversarial prompts rather than focusing only on single-model hardening.

URL PDF HTML ☆

赞 0 踩 0

2601.16462 2026-06-02 cs.CL 版本更新

Finding What Matters: Anchoring Context Knowledge with Evolving Indices for Iterative Retrieval

寻找关键：通过演化索引锚定上下文知识进行迭代检索

Mingyan Wu, Zhenghao Liu, Xinze Li, Yuqing Lan, Yukun Yan, Shuo Wang, Cheng Yang, Minghe Yu, Zheni Zeng, Maosong Sun

发表机构 * School of Computer Science and Engineering, Northeastern University, China（东北大学计算机科学与工程学院）； Department of Computer Science and Technology, Institute for AI, Tsinghua University, China（清华大学人工智能研究院计算机科学与技术系）； Beijing National Research Center for Information Science and Technology, China（北京信息科学与技术国家研究中心）； School of Computer Science, Beijing University of Posts and Telecommunications, China（北京邮电大学计算机学院）

AI总结提出KAIR框架，通过迭代检索中动态更新的知识索引锚定关键证据，引导大语言模型在多跳问答中有效推理并缓解噪声干扰。

详情

AI中文摘要

检索增强生成（RAG）已成为通过引入外部知识来减轻大语言模型（LLMs）幻觉的主流范式。然而，现有的RAG系统通常难以有效整合和推理分散在嘈杂检索文档中的关键证据，尤其是在多跳场景中。本文提出KAIR，一种用于迭代检索的知识锚定框架，该框架在检索到的知识中锚定知识，以引导LLMs定位关键信息。在迭代检索过程中，KAIR逐步更新知识索引，以锚定检索文档中的显著证据。演化索引作为导航锚定索引，使LLM能够评估知识充分性并制定后续检索查询。最后，KAIR通过联合利用检索文档和最终确定的锚定索引生成答案。在四个多跳问答基准上的实验表明，KAIR始终优于强RAG基线。进一步分析显示，KAIR有效锚定关键知识并减轻了迭代检索过程中的上下文噪声，提高了LLM关联和推理分散在检索文档中的证据的能力。所有代码和数据可在https://github.com/NEUIR/KAIR获取。

英文摘要

Retrieval-Augmented Generation (RAG) has become a dominant paradigm for mitigating hallucinations in Large Language Models (LLMs) by incorporating external knowledge. However, existing RAG systems often struggle to effectively integrate and reason over key evidence scattered across noisy retrieved documents, particularly in multi-hop scenarios. In this paper, we propose KAIR, a Knowledge Anchoring framework for Iterative Retrieval that anchors knowledge within retrieved knowledge to guide LLMs to locate the key information. During iterative retrieval, KAIR progressively updates the knowledge index to anchor salient evidence from retrieved documents. The evolving index serves as a navigational anchoring index that enables the LLM to assess knowledge sufficiency and formulate subsequent retrieval queries. Finally, KAIR generates answers by jointly leveraging the retrieved documents and the finalized anchoring index. Experiments on four multi-hop question answering benchmarks demonstrate that KAIR consistently outperforms strong RAG baselines. Further analysis shows that KAIR effectively anchors key knowledge and alleviates the context noise during iterative retrieval, improving the LLM's ability to associate and reason over dispersed evidence across retrieved documents. All code and data are available at https://github.com/NEUIR/KAIR.

URL PDF HTML ☆

赞 0 踩 0

2509.06093 2026-06-02 cs.DB cond-mat.mtrl-sci cs.AI cs.CL 版本更新

面向安全与合规的人工智能：NLP模型生命周期管理的组织标准与协议

Sunil Arora, John Hastings

发表机构 * ICAART 2026

AI总结提出SC-NLP-LMF六阶段框架，通过系统综述整合NIST AI RMF等标准，结合偏差检测、差分隐私等方法，保障NLP系统从开发到退役的安全与合规，并以医疗案例验证其有效性。

Comments 9 pages, 2 tables, 1 figure

详情

DOI: 10.5220/0014346400004052
Journal ref: Proc. of the 18th International Conference on Agents and Artificial Intelligence - Vol. 2: ICAART, 1624-1634, 2026

AI中文摘要

自然语言处理系统越来越多地用于医疗、金融和政府等敏感领域，处理大量个人和受监管数据。然而，这些系统引入了与安全、隐私和法规遵从相关的独特风险，现有的人工智能治理框架未能完全解决这些问题。本文介绍了安全合规的NLP生命周期管理框架（SC-NLP-LMF），这是一个全面的六阶段模型，旨在确保NLP系统从开发到退役的安全运行。该框架通过对45篇同行评审和监管来源进行基于PRISMA的系统综述而开发，与领先标准（包括NIST AI RMF、ISO/IEC 42001:2023、欧盟AI法案和MITRE ATLAS）保持一致。它集成了偏差检测、隐私保护（差分隐私、联邦学习）、安全部署、可解释性和安全模型退役等成熟方法。一个医疗案例研究展示了SC-NLP-LMF如何检测新兴术语漂移（例如，与COVID相关的语言）并指导合规的模型更新。该框架为组织提供了一个实用的、覆盖全生命周期的结构，用于在高风险环境中开发、部署和维护安全且负责任的NLP系统。

英文摘要

Natural Language Processing (NLP) systems are increasingly used in sensitive domains such as healthcare, finance, and government, where they handle large volumes of personal and regulated data. However, these systems introduce distinct risks related to security, privacy, and regulatory compliance that are not fully addressed by existing AI governance frameworks. This paper introduces the Secure and Compliant NLP Lifecycle Management Framework (SC-NLP-LMF), a comprehensive six-phase model designed to ensure the secure operation of NLP systems from development to retirement. The framework, developed through a systematic PRISMA-based review of 45 peer-reviewed and regulatory sources, aligns with leading standards, including NIST AI RMF, ISO/IEC 42001:2023, the EU AI Act, and MITRE ATLAS. It integrates established methods for bias detection, privacy protection (differential privacy, federated learning), secure deployment, explainability, and secure model decommissioning. A healthcare case study illustrates how SC-NLP-LMF detects emerging terminology drift (e.g., COVID-related language) and guides compliant model updates. The framework offers organizations a practical, lifecycle-wide structure for developing, deploying, and maintaining secure and accountable NLP systems in high-risk environments.

URL PDF HTML ☆

赞 0 踩 0

2512.20638 2026-06-02 cs.CL cs.AI cs.LG 版本更新

Uncovering Competency Gaps in Large Language Models and Their Benchmarks

揭示大型语言模型及其基准测试中的能力差距

Maty Bohacek, Nino Scherrer, Nicholas Dufour, Thomas Leung, Christoph Bregler, Stephanie C. Y. Chan

发表机构 * University of Zurich（苏黎世大学）

AI总结提出一种基于稀疏自编码器概念激活的新方法，自动发现模型在细粒度概念上的弱点（模型差距）和基准测试覆盖不平衡（基准差距），并通过内部表示评估和跨基准比较进行验证。

详情

Journal ref: ICML 2026

AI中文摘要

大型语言模型的评估严重依赖标准化基准测试。这些基准测试提供了有用的聚合指标，但可能掩盖（i）模型薄弱的特定子领域（“模型差距”）和（ii）基准测试本身的不平衡覆盖（“基准差距”）。为了自动揭示这两类差距，我们提出了一种简单的新方法，利用稀疏自编码器的概念激活，在逐概念基础上识别细粒度差距。该方法还受益于将评估基于模型的内部表示，以及易于跨基准测试进行比较。我们将该方法应用于五个流行的开源模型和十几个基准测试，作为示例说明。作为对该方法的验证，我们发现我们的自动无监督方法能够恢复文献中先前记录的模型差距（例如与谄媚相关的差距），并识别出新的模型差距。我们还能够自动揭示基准差距：应属于给定基准测试范围的核心概念。我们的“能力差距”方法可以通过提供模型行为的概念级分解，并帮助基准测试开发者迭代基准测试设计，来补充现有基准测试。代码可在 https://competency-gaps.github.io 获取。

英文摘要

The evaluation of large language models relies heavily on standardized benchmarks. These benchmarks provide useful aggregated metrics, but can obscure (i) particular sub-areas where the models are weak ("model gaps") and (ii) imbalanced coverage in the benchmarks themselves ("benchmark gaps"). To automatically uncover both types of gaps, we propose a simple new method using concept activations from sparse autoencoders, to identify fine-grained gaps on a per-concept basis. The method also benefits from grounding evaluation in the model's internal representations, as well as easy comparison across benchmarks. We applied the method to five popular open-source models and more than a dozen benchmarks, as illustrative examples. As validation of the approach, we found that our automatic, unsupervised method was able to recover model gaps that have been previously documented in the literature (e.g. relating to sycophancy), in addition to identifying novel model gaps. We were also able to automatically uncover benchmark gaps: core concepts that should fall within the scope of a given benchmark. Our "competency gaps" method can be used to complement existing benchmarks, by providing a concept-level decomposition of model behavior, and by helping benchmark developers iterate upon benchmark design. Code is available at https://competency-gaps.github.io.

URL PDF HTML ☆

赞 0 踩 0

2512.07795 2026-06-02 cs.AI cs.CL cs.LG 版本更新

ReasonBENCH: Benchmarking the (In)Stability of LLM Reasoning

ReasonBENCH: 基准测试LLM推理的（不）稳定性

Nearchos Potamitis, Vansh Ramani, Har Ashish Arora, Dhairya Kuchhal, Lars Klein, Akhil Arora

发表机构 * Aarhus University（奥胡斯大学）； Indian Institute of Technology Delhi（德里印度理工学院）； EPFL（苏黎世联邦理工学院）

AI总结提出ReasonBench基准套件，通过30次独立试验揭示LLM推理系统在贪婪解码下仍存在结构化方差，并引入全局噪声和运行噪声分类法，证明稳定性是推理系统的固有属性，倡导分布感知评估。

Comments 29 pages, 19 tables, 85 figures

详情

AI中文摘要

LLM推理系统的基准分数被报告为单一数字，然而相同的模型、策略和任务在重复执行时，即使在贪婪解码（T=0）下也会产生显著不同的答案和成本。这种方差并非统计上的麻烦：性能最高的策略在与最接近的对手进行头对头运行时仅获胜77%，这意味着单次观测到的分数可能会无声地错误排序系统。我们引入了ReasonBench，一个基准套件，记录了10种推理策略、12个模型和6个任务的30次独立试验，将质量和成本视为分布而非点估计。我们发现这种方差是有结构的而非随机的：一个双组分分类法——全局噪声（捕捉跨基准的不均匀性）和运行噪声（捕捉基准内的随机性）——揭示了策略架构预测稳定性分布，而模型和策略则移动分布的正交方面。层次分解将四分之三的分数方差归因于基准、系统和项目结构，而单次运行评估无声地吸收了持久的残差。最后，成本和成本非对称地解耦：廉价方法在结构上对联合成本-质量失败免疫，而昂贵方法无论其准确性如何仍然暴露。这些发现确立了不稳定性作为推理系统的固有属性，并促使分布感知评估成为标准实践。

英文摘要

Benchmark scores for LLM reasoning systems are reported as single numbers, yet the same model, strategy, and task can produce meaningfully different answers and costs across repeated executions, even under greedy decoding (T = 0). This variance is not a statistical nuisance: the highest-performing strategy wins only 77% of head-to-head runs against its nearest competitor, meaning a single observed score can silently misrank systems. We introduce ReasonBench, a benchmark suite recording 30 independent trials across 10 reasoning strategies, 12 models, and 6 tasks, treating quality and cost as distributions rather than point estimates. We find that this variance is structured rather than random: a two-component taxonomy -- Global Noise, capturing cross-benchmark unevenness, and Run Noise, capturing within-benchmark stochasticity -- reveals that strategy architecture predicts stability profiles, while models and strategies shift orthogonal aspects of the distribution. A hierarchical decomposition attributes three-quarters of score variance to benchmark, system, and item structure, with a persistent residual that single-run evaluation silently absorbs. Finally, cost and quality decouple asymmetrically: cheap methods are structurally immune to joint cost-quality failure, while expensive methods remain exposed regardless of their accuracy. These findings establish instability as an inherent property of reasoning systems and motivate distribution-aware evaluation as standard practice.

URL PDF HTML ☆

赞 0 踩 0

2511.20639 2026-06-02 cs.CL cs.AI cs.LG 版本更新

Latent Collaboration in Multi-Agent Systems

多智能体系统中的潜在协作

Jiaru Zou, Ruizhong Qiu, Gaotang Li, Xiyuan Yang, Katherine Tieu, Pan Lu, Ke Shen, Hanghang Tong, Yejin Choi, Jingrui He, James Zou, Mengdi Wang, Ling Yang

发表机构 * University of Washington（华盛顿大学）

AI总结提出LatentMAS框架，使LLM智能体在连续潜在空间直接协作，无需文本中介，实现更高精度、更低开销和更快推理。

Comments ICML2026 Spotlight, Project: https://github.com/Gen-Verse/LatentMAS

详情

AI中文摘要

多智能体系统（MAS）将大语言模型（LLM）从独立的单模型推理扩展到协同的系统级智能。现有LLM智能体依赖基于文本的中介进行推理和通信，而我们更进一步，使模型能够在连续潜在空间内直接协作。我们引入了LatentMAS，一个端到端无需训练的框架，实现了LLM智能体间的纯潜在协作。在LatentMAS中，每个智能体首先通过最后一层的隐藏嵌入而非文本进行自回归潜在思维生成。然后，一个共享的潜在工作记忆保存并传递每个智能体的内部表示和潜在思维，确保无需重新编码的无损信息交换。我们提供了详细的理论分析，表明LatentMAS比基于文本的标准MAS具有更高的表达能力和无损信息保存能力，且整体复杂度更低。此外，在涵盖数学和科学推理、常识理解及代码生成的9个综合基准测试上的实证评估表明，LatentMAS优于先进的单智能体和基于文本的MAS基线，准确率最高提升14.6%，输出token使用量减少70.8%-83.7%，端到端推理速度提升4倍至4.3倍。代码和数据完全开源：https://github.com/Gen-Verse/LatentMAS。

英文摘要

Multi-agent systems (MAS) extend large language models (LLMs) from independent single-model reasoning to coordinative system-level intelligence. While existing LLM agents depend on text-based mediation for reasoning and communication, we take a step forward by enabling models to collaborate directly within the continuous latent space. We introduce LatentMAS, an end-to-end training-free framework that enables pure latent collaboration among LLM agents. In LatentMAS, each agent first performs auto-regressive latent thoughts generation through last-layer hidden embeddings instead of text. Then, a shared latent working memory preserves and transfers each agent's internal representations and latent thoughts, ensuring lossless information exchange without re-encoding. We provide detailed theoretical analyses showing that LatentMAS achieves higher expressiveness and lossless information preservation with lower overall complexity than standard text-based MAS. In addition, empirical evaluations across 9 comprehensive benchmarks spanning math and science reasoning, commonsense understanding, and code generation show that LatentMAS outperforms advanced single agents and text-based MAS baselines, achieving up to 14.6% higher accuracy, reducing output token usage by 70.8%-83.7%, and providing 4$\times$-4.3$\times$ faster end-to-end inference. Code and data are fully open-sourced at https://github.com/Gen-Verse/LatentMAS.

URL PDF HTML ☆

赞 0 踩 0

2512.05671 2026-06-02 cs.CL 版本更新

ClinTutor-R1: Advancing Scalable and Robust One-to-Many Alignment in Clinical Socratic Education

ClinTutor-R1：在临床苏格拉底式教育中推进可扩展且稳健的一对多对齐

Zhitao He, Haolin Yang, Zeyu Qin, Yi R Fung

发表机构 * Hong Kong University of Science and Technology（香港理工大学）

AI总结针对临床查房等一对多教学场景中上下文稀释与目标错位问题，提出基于多智能体模拟器ClinEdu构建的苏格拉底式教学数据集ClinTeach，并开发首个显式内部思维机制的一对多对齐视觉语言智能体ClinTutor-R1，通过建模个体信念状态与群体共识，在静态基准、交互评估、专家评审及200人真实用户研究中超越基线模型20%以上，达到与专有模型相当的性能，并展现出随学生规模扩展的教学质量可扩展性。

Comments Accepted by ICML 2026 (Spotlight)

详情

AI中文摘要

尽管大型语言模型（LLMs）在二元（一对一）教学中取得了显著成功，但在临床查房等一对多对齐场景中面临重大挑战，此时教师必须同时指导一群多样化的受训者。当前模型常受上下文稀释与目标错位影响，难以平衡个体脚手架与集体学习进度。为此，我们引入ClinEdu，一个模拟群体动态复杂性的多智能体教学模拟器。基于该平台，我们构建了ClinTeach，一个大规模的苏格拉底式教学对话数据集，并提出ClinTutor-R1，这是首个明确设计用于实现临床教育中一对多对齐的视觉语言智能体，采用显式内部思维机制来建模个体信念状态与群体共识。我们通过涵盖静态基准、ClinEdu内原位交互评估、专家评估以及200名参与者的真实用户研究的综合协议验证了我们的框架。实验结果表明，ClinTutor-R1在性能上超越基线模型超过20%，并达到与专有模型相当的水平，同时在扩展学生群体时展现出维持教学质量的扩展性。

英文摘要

While Large Language Models (LLMs) have achieved remarkable success in dyadic (one-on-one) instruction, they face significant challenges in One-to-Many alignment, such as clinical ward rounds, where an instructor must simultaneously guide a diverse group of trainees. Current models often suffer from context dilution and goal misalignment, failing to balance individual scaffolding with collective learning progress. To address this, we introduce ClinEdu, a multi-agent pedagogical simulator that models the complexity of group dynamics. Leveraging this platform, we construct ClinTeach, a large-scale dataset of Socratic teaching dialogues, and propose ClinTutor-R1, the first vision-language agent explicitly architected to achieve one-to-many alignment in clinical education, employing an explicit internal thinking mechanism to model both individual belief states and group consensus. We validate our framework through a comprehensive protocol covering static benchmarks, in-situ interactive evaluation within ClinEdu, expert assessment, and a 200-participant real user study. Experimental results demonstrate that ClinTutor-R1 outperforms base models by over 20% and achieves parity with proprietary models, while exhibiting scalability in maintaining instructional quality across expanding student cohorts.

URL PDF HTML ☆

赞 0 踩 0

2511.21397 2026-06-02 cs.CV cs.AI cs.CL cs.LG 版本更新

Understanding the Effects of Distractors on Reasoning Vision-Language Models

理解干扰项对推理视觉语言模型的影响

Jiyun Bae, Hyunjong Ok, Sangwoo Mo, Jaeho Lee

发表机构 * Pohang University of Science and Technology (POSTECH)（坡山科学技术大学（POSTECH））

AI总结本文通过构建包含语义和数值维度干扰项的视觉问答数据集Idis，研究视觉干扰项如何影响视觉语言模型的测试时缩放行为，发现视觉干扰项以与文本干扰项根本不同的方式降低准确率而不增加推理长度，并提出简单提示策略缓解干扰项驱动的预测。

Comments preprint

详情

AI中文摘要

无关信息（即干扰项）如何影响视觉语言模型（VLM）的测试时缩放？先前关于纯文本语言模型的研究表明，文本干扰项可以加剧逆缩放，导致模型推理更长但推理轨迹效率更低。在这项工作中，我们研究了类似现象是否在多模态设置中出现。我们引入了Idis（带干扰项的图像），这是一个视觉问答数据集，系统性地沿着语义和数值维度变化干扰项。我们的分析揭示，视觉干扰项以与文本干扰项根本不同的方式影响推理VLM：尽管逆缩放仍然出现，但视觉干扰项降低了准确率而不增加推理长度。我们进一步展示了从推理轨迹中提取的属性计数为干扰项如何与推理长度和准确率交互提供了关键见解。作为合理性检查，我们提出了一种简单的提示策略，以减轻推理视觉语言模型中干扰项驱动的预测。

英文摘要

How does irrelevant information (i.e., distractors) affect test-time scaling in vision-language models (VLMs)? Prior work on text-only language models has shown that textual distractors can intensify inverse scaling, causing models to reason longer but less effective reasoning traces. In this work, we investigate whether similar phenomena arise in multimodal settings. We introduce Idis (Images with distractors), a visual question-answering dataset that systematically varies distractors along semantic and numerical dimensions. Our analyses reveal that visual distractors affect reasoning VLMs in a fundamentally different way from textual distractors: although inverse scaling still emerges, visual distractors reduce accuracy without increasing reasoning length. We further show that attribute counts extracted from reasoning traces provide key insights into how distractors interact with reasoning length and accuracy. As a sanity check, we propose a simple prompting strategy that mitigates distractor-driven predictions in reasoning vision-language models.

URL PDF HTML ☆

赞 0 踩 0

2511.20409 2026-06-02 cs.CL 版本更新

HalleluBERT: 让每个有意义的词都承载其权重

Raphael Schmitt

发表机构 * Raphael Schmitt（拉斐尔·施米特）

AI总结提出基于RoBERTa的希伯来语编码器家族HalleluBERT，在49.1 GB去重希伯来语网络文本和维基百科上从头训练，使用希伯来语特定字节级BPE词汇，在命名实体识别和情感分类基准上优于单语和多语基线。

详情

DOI: 10.63317/3qdqexx4e9i2
Journal ref: Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026), pp. 3022-3030, 2026

AI中文摘要

基于Transformer的模型推动了NLP的发展，但希伯来语仍然缺乏一个经过大规模训练并以base和large两种变体发布的RoBERTa编码器。我们提出HalleluBERT，一个基于RoBERTa的编码器家族，在49.1 GB去重的希伯来语网络文本和维基百科上从头训练，使用希伯来语特定的字节级BPE词汇。在希伯来语命名实体识别（BMC, NEMO）和情感分类（SMCD）的本地基准上，HalleluBERT优于单语和多语基线，并在三个基准上取得了最高的未加权平均分。我们在MIT许可下发布模型权重和分词器，以支持可复现的希伯来语NLP研究。

英文摘要

Transformer-based models have advanced NLP, yet Hebrew still lacks a RoBERTa encoder that is trained at scale and released in both base and large variants. We present HalleluBERT, a RoBERTa-based encoder family trained from scratch on 49.1~GB of deduplicated Hebrew web text and Wikipedia using a Hebrew-specific byte-level BPE vocabulary. On native Hebrew benchmarks for named entity recognition (BMC, NEMO) and sentiment classification (SMCD), HalleluBERT outperforms monolingual and multilingual baselines, and yields the highest unweighted mean score across the three benchmarks. We release model weights and tokenizer under the MIT license to support reproducible Hebrew NLP research.

URL PDF HTML ☆

赞 0 踩 0

2510.21364 2026-06-02 cs.CL 版本更新

SindBERT, the Sailor: Charting the Seas of Turkish NLP

SindBERT，水手：绘制土耳其语NLP的海洋

Raphael Schmitt, Stefan Schweter

发表机构 * School of Computation, Information and Technology, Technical University of Munich（慕尼黑技术大学计算、信息与技术学院）； Institute of General Practice, Faculty of Medicine and Medical Center, University of Freiburg（弗赖堡大学医学系一般实践研究所）； Independent Researcher（独立研究者）

AI总结针对形态丰富语言预训练不足的问题，提出首个大规模土耳其语RoBERTa编码器SindBERT，在多个任务上表现有竞争力，并揭示了基准饱和与语料质量的重要性。

Comments Published at SIGTURK 2026, co-located with EACL 2026

详情

DOI: 10.18653/v1/2026.sigturk-1.1
Journal ref: Proceedings of the Second Workshop on Natural Language Processing for Turkic Languages (SIGTURK 2026), pp. 1-13, 2026

AI中文摘要

Transformer模型彻底改变了NLP，但许多形态丰富的语言在大规模预训练中仍然代表性不足。通过SindBERT，我们着手绘制土耳其语NLP的海洋，为土耳其语提供了首个大规模基于RoBERTa的编码器。SindBERT在312 GB土耳其语文本（mC4、OSCAR23、Wikipedia）上从头训练，以base和large两种配置发布，是土耳其语可用的首个大规模仅编码器语言模型。我们在词性标注、命名实体识别、攻击性语言检测和TurBLiMP语言可接受性基准上评估SindBERT。结果表明，SindBERT与现有的土耳其语和多语言模型相比具有竞争力，其中large变体在四个任务中的两个上取得了最佳分数，但总体上没有一致的扩展优势。这种平坦的扩展趋势，在XLM-R和EuroBERT中也观察到，表明当前的土耳其语基准可能已经饱和。同时，与更小但更精心策划的模型（如BERTurk）的比较表明，语料库质量和多样性可以超过纯粹的数据量。总之，SindBERT既作为土耳其语NLP的公开资源发布，也作为关于扩展限制和语料库组成在形态丰富语言中核心作用的实证案例研究。SindBERT模型在MIT许可下发布，并提供fairseq和Huggingface格式。

英文摘要

Transformer models have revolutionized NLP, yet many morphologically rich languages remain underrepresented in large-scale pre-training efforts. With SindBERT, we set out to chart the seas of Turkish NLP, providing the first large-scale RoBERTa-based encoder for Turkish. Trained from scratch on 312~GB of Turkish text (mC4, OSCAR23, Wikipedia), SindBERT is released in both base and large configurations, representing the first large-scale encoder-only language model available for Turkish. We evaluate SindBERT on part-of-speech tagging, named entity recognition, offensive language detection, and the TurBLiMP linguistic acceptability benchmark. Our results show that SindBERT performs competitively with existing Turkish and multilingual models, with the large variant achieving the best scores in two of four tasks but showing no consistent scaling advantage overall. This flat scaling trend, also observed for XLM-R and EuroBERT, suggests that current Turkish benchmarks may already be saturated. At the same time, comparisons with smaller but more curated models such as BERTurk highlight that corpus quality and diversity can outweigh sheer data volume. Taken together, SindBERT contributes both as an openly released resource for Turkish NLP and as an empirical case study on the limits of scaling and the central role of corpus composition in morphologically rich languages. The SindBERT models are released under the MIT license and made available in both fairseq and Huggingface formats.

URL PDF HTML ☆

赞 0 踩 0

2510.17532 2026-06-02 cs.CL cs.LG 版本更新

OncoReason: Structuring Clinical Reasoning in LLMs for Robust and Interpretable Survival Prediction

OncoReason: 在大语言模型中构建临床推理以实现稳健且可解释的生存预测

Raghu Vamshi Hemadri, Geetha Krishna Guruju, Kristi Topollai, Anna Ewa Choromanska

AI总结提出统一多任务学习框架，通过监督微调、思维链提示和强化学习三种对齐策略，使自回归大语言模型在MSK-CHORD数据集上联合进行二元生存分类、连续生存时间回归和自然语言理由生成，实现可解释的生存预测。

Comments This manuscript is withdrawn to allow careful review and correction of bibliographic issues identified after submission, including references that could not be adequately verified. These matters should be resolved before further circulation

详情

AI中文摘要

预测癌症治疗结果需要模型既准确又可解释，尤其是在存在异质性临床数据的情况下。虽然大语言模型（LLM）在生物医学自然语言处理中表现出强大的性能，但它们通常缺乏在高风险决策支持中至关重要的结构化推理能力。我们提出了一个统一的多任务学习框架，将自回归LLM与临床推理对齐，用于MSK-CHORD数据集上的结果预测。我们的模型被训练来联合执行二元生存分类、连续生存时间回归和自然语言理由生成。我们评估了三种对齐策略：（1）标准监督微调（SFT），（2）带有思维链（CoT）提示的SFT，以引发逐步推理，以及（3）组相对策略优化（GRPO），一种强化学习方法，将模型输出与专家推导的推理轨迹对齐。使用LLaMa3-8B和Med42-8B骨干网络的实验表明，CoT提示将F1提高了+6.0，并将MAE降低了12%，而GRPO在BLEU、ROUGE和BERTScore上实现了最先进的可解释性和预测性能。我们进一步表明，现有的生物医学LLM由于架构限制通常无法产生有效的推理轨迹。我们的发现强调了在多任务临床建模中推理感知对齐的重要性，并为精准肿瘤学中可解释、可信赖的LLM设立了新的基准。

英文摘要

Predicting cancer treatment outcomes requires models that are both accurate and interpretable, particularly in the presence of heterogeneous clinical data. While large language models (LLMs) have shown strong performance in biomedical NLP, they often lack structured reasoning capabilities critical for high-stakes decision support. We present a unified, multi-task learning framework that aligns autoregressive LLMs with clinical reasoning for outcome prediction on the MSK-CHORD dataset. Our models are trained to jointly perform binary survival classification, continuous survival time regression, and natural language rationale generation. We evaluate three alignment strategies: (1) standard supervised fine-tuning (SFT), (2) SFT with Chain-of-Thought (CoT) prompting to elicit step-by-step reasoning, and (3) Group Relative Policy Optimization (GRPO), a reinforcement learning method that aligns model outputs to expert-derived reasoning trajectories. Experiments with LLaMa3-8B and Med42-8B backbones demonstrate that CoT prompting improves F1 by +6.0 and reduces MAE by 12%, while GRPO achieves state-of-the-art interpretability and predictive performance across BLEU, ROUGE, and BERTScore. We further show that existing biomedical LLMs often fail to produce valid reasoning traces due to architectural constraints. Our findings underscore the importance of reasoning-aware alignment in multi-task clinical modeling and set a new benchmark for interpretable, trustworthy LLMs in precision oncology.

URL PDF HTML ☆

赞 0 踩 0

2505.22961 2026-06-02 cs.CL cs.LG 版本更新

智能的社会成本：多智能体系统中刻板偏见的涌现、传播与放大

Thi-Nhung Nguyen, Linhao Luo, Amardeep Kaur, Rollin Omari, Tamas Abraham, Junae Kim, Thuy-Trang Vu, Dinh Phung

发表机构 * Department of Data Science & AI, Monash University（数据科学与人工智能系，莫纳什大学）； Defence Science and Technology Group, Australia（澳大利亚国防科学与技术集团）

AI总结本研究提出一个评估框架，通过三个智能体级指标量化多智能体系统中偏见的涌现、传播和放大，发现通信可触发高达70%的新偏见、传播至80%以上智能体并放大3倍以上刻板印象，且密集竞争性通信增加偏见，系统易受简单偏见注入攻击。

详情

AI中文摘要

大型语言模型（LLM）中的偏见仍然是一个持续存在的挑战，常常导致跨社会群体的刻板印象和不公平对待。虽然先前的工作主要关注单个LLM，但多智能体系统（MAS）的出现——其中多个LLM协作和通信——引入了偏见如何涌现、传播和放大的新的且未被充分探索的动态。为了系统地研究这些动态，我们提出了一个简单的评估框架，包含三个智能体级指标，用于量化多智能体交互过程中偏见的涌现、传播和放大。我们在不同的LLM骨干网络、社会群体配置、通信行为和对抗性设置下，对三个偏见基准评估了MAS。我们的结果表明，通信可以触发高达70%的新偏见涌现，将偏见传播到超过80%的智能体，并将刻板印象放大3倍以上。我们进一步发现，更密集和竞争性的通信通常会增加偏见。最后，我们证明了MAS极易受到简单的偏见注入攻击，而现有的防御策略只能提供有限的保护。我们的发现为多智能体LLM系统的公平性和鲁棒性提供了重要见解。

英文摘要

Bias in large language models (LLMs) remains a persistent challenge, often leading to stereotyping and unfair treatment across social groups. While prior work has mainly focused on individual LLMs, the emergence of multi-agent systems (MAS), where multiple LLMs collaborate and communicate, introduces new and underexplored dynamics in how bias emerges, propagates, and amplifies. To systematically investigate these dynamics, we propose a simple evaluation framework with three agent-level metrics that quantify bias emergence, propagation, and amplification throughout multi-agent interaction. We evaluate MAS across three bias benchmarks under varying LLM backbones, social-group configurations, communication behaviors, and adversarial settings. Our results show that communication can trigger up to 70\% new bias emergence, propagate bias across over 80\% of agents, and amplify stereotypes by more than 3$\times$. We further find that denser and competitive communication generally increases bias. Finally, we demonstrate that MAS are highly vulnerable to simple bias injection attacks, and existing defense strategies provide only limited protection. Our findings provide important insights into the fairness and robustness of multi-agent LLM systems.

URL PDF HTML ☆

赞 0 踩 0

2510.10676 2026-06-02 cs.AR cs.CL cs.RO eess.AS 版本更新

Bhasha-Rupantarika: Algorithm-Hardware Co-design approach for Multilingual Neural Machine Translation

Bhasha-Rupantarika: 面向多语言神经机器翻译的算法-硬件协同设计方法

Mukul Lokhande, Tanushree Dewangan, Mohd Sharik Mansoori, Tejas Chaudhari, Akarsh J., Damayanti Lokhande, Adam Teman, Santosh Kumar Vishvakarma

发表机构 * Special Manpower Development Program for Chip to Start-Up (SMDP-C2S)（芯片到初创企业专项人才发展计划（SMDP-C2S））； Ministry of Electronics and Information Technology (MeitY)（电子与信息技术部（MeitY））

AI总结提出一种通过算法-硬件协同设计实现的轻量高效多语言翻译系统Bhasha-Rupantarika，采用亚字节精度量化（FP8/INT8/INT4/FP4）在FPGA上实现模型大小减少4.1倍、推理速度提升4.2倍，为资源受限环境下的多语言AI部署提供可行方案。

详情

DOI: 10.1109/ISQED69900.2026.11534749
Journal ref: International Symposium on Quality Electronic Design (ISQED), San Francisco, CA, USA, 2026

AI中文摘要

本文介绍了Bhasha-Rupantarika，一个通过算法-硬件协同设计为资源受限环境量身定制的轻量高效多语言翻译系统。该方法研究了亚字节精度级别（FP8、INT8、INT4和FP4）的模型部署，实验结果表明模型大小减少4.1倍（FP4），推理速度提升4.2倍，对应吞吐量提高至66 tokens/s（提升4.8倍）。这凸显了超低精度量化对于使用FPGA加速器的物联网设备实时部署的重要性，实现了与预期相当的性能。我们的评估涵盖了印度语言和国际语言之间的双向翻译，展示了其在低资源语言环境中的适应性。FPGA部署显示LUT减少1.96倍，FF减少1.65倍，与OPU相比吞吐量提升2.2倍，与HPTA相比提升4.6倍。总体而言，该评估提供了一种基于量化感知翻译且兼顾硬件效率的可行解决方案，适用于可部署的多语言AI系统。完整的代码[https://github.com/mukullokhande99/Bhasha-Rupantarika/]和可复现数据集已公开，便于研究人员快速集成和进一步开发。

英文摘要

This paper introduces Bhasha-Rupantarika, a light and efficient multilingual translation system tailored through algorithm-hardware codesign for resource-limited settings. The method investigates model deployment at sub-octet precision levels (FP8, INT8, INT4, and FP4), with experimental results indicating a 4.1x reduction in model size (FP4) and a 4.2x speedup in inference speed, which correlates with an increased throughput of 66 tokens/s (improvement by 4.8x). This underscores the importance of ultra-low precision quantization for real-time deployment in IoT devices using FPGA accelerators, achieving performance on par with expectations. Our evaluation covers bidirectional translation between Indian and international languages, showcasing its adaptability in low-resource linguistic contexts. The FPGA deployment demonstrated a 1.96x reduction in LUTs and a 1.65x decrease in FFs, resulting in a 2.2x enhancement in throughput compared to OPU and a 4.6x enhancement compared to HPTA. Overall, the evaluation provides a viable solution based on quantisation-aware translation along with hardware efficiency suitable for deployable multilingual AI systems. The entire codes [https://github.com/mukullokhande99/Bhasha-Rupantarika/] and dataset for reproducibility are publicly available, facilitating rapid integration and further development by researchers.

URL PDF HTML ☆

赞 0 踩 0

2510.09608 2026-06-02 cs.CV cs.AI cs.CL 版本更新

StreamingVLM: Real-Time Understanding for Infinite Video Streams

StreamingVLM：无限视频流的实时理解

Ruyi Xu, Guangxuan Xiao, Yukang Chen, Liuning He, Yao Lu, Song Han

发表机构 * MIT（麻省理工学院）； NVIDIA（英伟达）

AI总结提出StreamingVLM，通过统一训练与流推理的框架，利用注意力汇点状态复用和滑动窗口机制实现无限视频流的实时稳定理解，在Inf-Streams-Eval基准上以8 FPS速度达到66.18%胜率，并提升通用VQA能力。

Comments Published as a conference paper at ICLR 2026. The first two authors contributed equally to this work

详情

AI中文摘要

视觉语言模型（VLM）可以为实时助手和自主代理提供动力，但它们面临一个关键挑战：理解近乎无限的视频流而不增加延迟和内存使用。对整个视频进行全注意力处理会导致二次计算成本和在长视频上性能不佳。同时，简单的滑动窗口方法也存在缺陷，它们要么破坏连贯性，要么由于冗余重计算而遭受高延迟。在本文中，我们介绍了StreamingVLM，一种专为实时、稳定理解无限视觉输入而设计的模型。我们的方法是一个统一框架，将训练与流推理对齐。在推理过程中，我们通过重用注意力汇点状态、最近视觉令牌的短窗口和最近文本令牌的长窗口来维护一个紧凑的KV缓存。这种流式能力通过一个简单的监督微调（SFT）策略灌输，该策略在短的重叠视频块上应用全注意力，有效地模拟了推理时的注意力模式，而无需在过长的上下文中进行训练。为了评估，我们构建了Inf-Streams-Eval，一个新的基准，包含平均超过两小时的视频，需要帧与文本之间的密集、每秒对齐。在Inf-Streams-Eval上，StreamingVLM对GPT-4O mini实现了66.18%的胜率，并在单个NVIDIA H100上以高达8 FPS的速度保持稳定、实时的性能。值得注意的是，我们的SFT策略还增强了通用的VQA能力，无需任何VQA特定的微调，在LongVideoBench上提高了+4.30，在OVOBench Realtime上提高了+5.96。代码可在https://github.com/mit-han-lab/streaming-vlm获取。

英文摘要

Vision-language models (VLMs) could power real-time assistants and autonomous agents, but they face a critical challenge: understanding near-infinite video streams without escalating latency and memory usage. Processing entire videos with full attention leads to quadratic computational costs and poor performance on long videos. Meanwhile, simple sliding window methods are also flawed, as they either break coherence or suffer from high latency due to redundant recomputation. In this paper, we introduce StreamingVLM, a model designed for real-time, stable understanding of infinite visual input. Our approach is a unified framework that aligns training with streaming inference. During inference, we maintain a compact KV cache by reusing states of attention sinks, a short window of recent vision tokens, and a long window of recent text tokens. This streaming ability is instilled via a simple supervised fine-tuning (SFT) strategy that applies full attention on short, overlapped video chunks, which effectively mimics the inference-time attention pattern without training on prohibitively long contexts. For evaluation, we build Inf-Streams-Eval, a new benchmark with videos averaging over two hours that requires dense, per-second alignment between frames and text. On Inf-Streams-Eval, StreamingVLM achieves a 66.18% win rate against GPT-4O mini and maintains stable, real-time performance at up to 8 FPS on a single NVIDIA H100. Notably, our SFT strategy also enhances general VQA abilities without any VQA-specific fine-tuning, improving performance on LongVideoBench by +4.30 and OVOBench Realtime by +5.96. Code is available at https://github.com/mit-han-lab/streaming-vlm.

URL PDF HTML ☆

赞 0 踩 0

2510.08825 2026-06-02 cs.CL 版本更新

Search-on-Graph: Iterative Informed Navigation for Large Language Model Reasoning on Knowledge Graphs

图上搜索：面向知识图谱的大语言模型推理的迭代式知情导航

Jia Ao Sun, Hao Yu, Fabrizio Gotti, Fengran Mo, Yihong Wu, Yuchen Hui, Zhan Su, Lingfeng Xiao, Jian-Yun Nie

发表机构 * Université de Montréal（蒙特利尔大学）； McGill University（麦吉尔大学）； Digital Media, CBC（数字媒体，CBC）； Halmstad University College（哈姆斯塔德大学学院）； University of Waterloo（滑铁卢大学）

AI总结提出Search-on-Graph方法，通过让大语言模型自身基于知识图谱结构和完整推理历史选择关系路径，遵循观察-思考-导航范式，在六个KGQA基准上超越现有方法，无需任务特定微调。

Comments Accepted to KDD '26 (32nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining)

详情

DOI: 10.1145/3770855.3817964

AI中文摘要

大语言模型（LLMs）结合知识图谱（KGs）为知识密集型推理提供了一种有前景的方法。该方法的核心是在KG中选择合适的推理路径。然而，现有方法面临一个共同限制：推理路径选择通常由独立模块使用与推理需求仅弱相关的标准进行，这往往导致选择错误的关系或过早剪枝相关路径。我们提出Search-on-Graph（SoG），一种通过让LLM自身选择要遵循的关系来加强路径选择与推理之间联系的方法，该选择基于可用的KG结构和完整的推理历史。SoG遵循“观察-思考-导航”范式：每一步，LLM观察当前实体可用的关系连接，思考哪条路径最有助于回答问题，并相应地进行导航。这种上下文感知的导航充分利用了LLM的推理能力，而不是依赖具有替代标准的独立选择模块。在六个知识图谱问答（KGQA）基准上的实验表明，SoG优于最先进的方法，同时无需任务特定的微调，并能泛化到不同的KG模式。

英文摘要

Large language models (LLMs) augmented with knowledge graphs (KGs) offer a promising approach for knowledge-intensive reasoning. Central to this approach is the selection of appropriate reasoning paths in the KG. Yet, existing methods face a common limitation: reasoning path selection is often performed by separate modules using criteria that are only weakly connected to the reasoning requirements. This often results in selecting incorrect relations or premature pruning of relevant paths. We propose Search-on-Graph (SoG), a method that strengthens the connection between path selection and reasoning by having the LLM itself select which relations to follow, informed by both the available KG structure and the complete reasoning history. SoG follows an \textit{observe-think-navigate} paradigm: at each step, the LLM observes the relational connections available at the current entity, reasons about which path best advances toward answering the question, and navigates accordingly. This context-aware navigation fully exploits the LLM's reasoning capabilities rather than relying on independent selection modules with surrogate criteria. Experiments on six knowledge graph question answering (KGQA) benchmarks demonstrate that SoG outperforms state-of-the-art methods while requiring no task-specific fine-tuning and generalizing across different KG schemas.

URL PDF HTML ☆

赞 0 踩 0

2506.11083 2026-06-02 cs.CL 版本更新

同时多目标对齐：可验证与不可验证奖励

Yiran Shen, Yu Xia, Jonathan Chang, Prithviraj Ammanabrolu

AI总结提出MAHALO框架，通过标准化PRM训练、多动作头DPO和PRM引导解码，实现大语言模型在可验证与不可验证奖励上的多目标对齐，减少目标冲突并支持推理时控制。

Comments ICML 2026

详情

AI中文摘要

将大语言模型与人类偏好对齐本质上是多维的，但大多数流水线将异质信号压缩为单一目标。我们试图回答如何同时在多个领域中对齐模型，这些领域包括：可验证奖励、不可验证主观偏好以及复杂交互场景。这种多目标对齐设置常常因各个目标相互冲突而困扰，导致训练效率低下和推理时用户控制有限。为了解决这些问题，我们提出了$ extbf{MAHALO}$（Multi-Action-Head Alignment with PRM-guided Decoding），这是一个统一的框架，它在可验证和不可验证设置下标准化PRM训练以进行步骤级监督，通过多动作头DPO执行向量化多目标对齐，并通过目标特定权重和PRM引导解码实现可控推理。在数学推理、人类价值观对齐和多轮辅导上的实验表明，MAHALO能够以有限的干扰同时联合改善多个目标，同时保持跨领域的泛化性和适应性，并在推理时提供灵活的用户控制。我们的代码可在 https://github.com/pearls-lab/multiobj-align 获取。

英文摘要

Aligning large language models to human preferences is inherently multidimensional, yet most pipelines collapse heterogeneous signals into a single objective. We seek to answer what it would take to simultaneously align a model across various domains spanning those with: verifiable rewards, non-verifiable subjective preferences, and complex interactive scenarios. Such multi-objective alignment setups are often plagued by individual objectives being at odds with each other, resulting in inefficient training and limited user control during inference. To address these issues, we propose $\textbf{M}$ulti-$\textbf{A}$ction-$\textbf{H}$ead $\textbf{AL}$ignment with PRM-guided Dec$\textbf{O}$ding ($\textbf{MAHALO}$), a unified framework that standardizes PRM training across verifiable and non-verifiable settings for step-level supervision, performs vectorized multi-objective alignment with Multi-Action-Head DPO, and enables controllable inference through objective-specific weighting and PRM-guided decoding. Experiments across math reasoning, human values alignment, and multi-turn tutoring show that MAHALO jointly improves multiple objectives simultaneously with limited interference, while remaining generalizable and adaptable across domains and offering flexible user control at inference time. Our code is available at: https://github.com/pearls-lab/multiobj-align.

URL PDF HTML ☆

赞 0 踩 0

2505.17630 2026-06-02 cs.CL cs.LG 版本更新

Correcting Gradient-Based Circuit Localization via Interaction-Aware Backpropagation

通过交互感知反向传播修正基于梯度的电路定位

Joakim Edin, Casper L. Christensen, Róbert Csordás, Tuukka Ruotsalo, Zhengxuan Wu, Maria Maistro, Jing Huang, Lars Maaløe

发表机构 * Corti ； Stanford University（斯坦福大学）； Copenhagen University（哥本哈根大学）； LUT University（拉普兰大学）

AI总结针对现有电路定位方法忽略组件交互导致重要性误估的问题，提出GIM技术，通过在反向传播中显式建模特征交互，在机械可解释性基准的电路定位任务上达到最优性能。

详情

AI中文摘要

GeistBERT: 为德语 NLP 注入活力

Raphael Scheible-Schmitt, Johann Frei

AI总结通过在大规模德语语料库上使用 RoBERTa 架构和 Whole Word Masking 进行增量预训练，GeistBERT 在多项德语 NLP 任务上取得了领先性能，并在 GermEval 2018 细粒度文本分类中达到新 SOTA。

详情

DOI: 10.26615/978-954-452-105-9-006
Journal ref: Proceedings of the Workshop on Beyond English: Natural Language Processing for all Languages in an Era of Large Language Models, 2025, pp. 42-50

AI中文摘要

基于 Transformer 的语言模型的进展凸显了在高质量语料库上进行特定语言预训练的优势。在此背景下，德语 NLP 有望受益于针对德语语言特征更新的架构和现代数据集。GeistBERT 旨在通过在多样化语料库上进行增量训练并优化模型在多种 NLP 任务上的性能来改进德语语言处理。我们使用 fairseq 预训练 GeistBERT，遵循 RoBERTa 基础配置并采用 Whole Word Masking (WWM)，从 GottBERT 权重初始化。模型在 1.3 TB 的德语语料库上训练，采用动态掩码和固定的 512 令牌序列长度。为了评估，我们在标准下游任务上微调模型，包括 NER（CoNLL 2003、GermEval 2014）、文本分类（GermEval 2018 粗粒度/细粒度、10kGNAD）和 NLI（German XNLI），使用 $F_1$ 分数和准确率作为评估指标。GeistBERT 在所有任务上均取得了强劲结果，在基础模型中领先，并在 GermEval 2018 细粒度文本分类中创下新的最先进水平 (SOTA)。它还优于多个更大的模型，尤其是在分类基准测试中。为了支持德语 NLP 研究，我们在 MIT 许可下发布 GeistBERT。

英文摘要

Advances in transformer-based language models have highlighted the benefits of language-specific pre-training on high-quality corpora. In this context, German NLP stands to gain from updated architectures and modern datasets tailored to the linguistic characteristics of the German language. GeistBERT seeks to improve German language processing by incrementally training on a diverse corpus and optimizing model performance across various NLP tasks. We pre-trained GeistBERT using fairseq, following the RoBERTa base configuration with Whole Word Masking (WWM), and initialized from GottBERT weights. The model was trained on a 1.3 TB German corpus with dynamic masking and a fixed sequence length of 512 tokens. For evaluation, we fine-tuned the model on standard downstream tasks, including NER (CoNLL 2003, GermEval 2014), text classification (GermEval 2018 coarse/fine, 10kGNAD), and NLI (German XNLI), using $F_1$ score and accuracy as evaluation metrics. GeistBERT achieved strong results across all tasks, leading among base models and setting a new state-of-the-art (SOTA) in GermEval 2018 fine text classification. It also outperformed several larger models, particularly in classification benchmarks. To support research in German NLP, we release GeistBERT under the MIT license.

URL PDF HTML ☆

赞 0 踩 0

2012.02110 2026-06-02 cs.CL cs.LG 版本更新

GottBERT: a pure German Language Model

GottBERT: 一个纯德语语言模型

Raphael Scheible, Johann Frei, Fabian Thomczyk, Henry He, Patric Tippmann, Jochen Knaus, Victor Jaravine, Frank Kramer, Martin Boeker

发表机构 * Institute for AI and Informatics in Medicine, University Hospital rechts der Isar, Technical University Munich（慕尼黑技术大学医学人工智能与信息学研究所，莱茵河右岸大学医院）； IT-Infrastructure for Translational Medical Research, Faculty of Applied Computer Science, University of Augsburg（奥格斯堡大学应用计算机科学学院医学转化研究IT基础设施）； Data Integration Center, Faculty of Medicine, University of Freiburg（弗赖堡大学医学学院数据整合中心）； School of Computation, Information and Technology, Technical University Munich（慕尼黑技术大学计算、信息与技术学院）； Institute of Medical Biometry and Statistics, Medical Center, Faculty of Medicine, University of Freiburg（弗赖堡大学医学学院医学生物统计学研究所）； Freiburg Center for Data Analysis and Modeling, University of Freiburg（弗赖堡大学数据分析与建模中心）； Hengrui Europe Biosciences, Zurich（苏黎世亨格里欧生物科学公司）

AI总结本文提出了首个德语单语言RoBERTa模型GottBERT，在OSCAR德语子集上预训练，并在NER和文本分类任务上展示了竞争性能。

详情

DOI: 10.18653/v1/2024.emnlp-main.1183

AI中文摘要

预训练语言模型显著推进了自然语言处理（NLP），尤其是BERT及其优化版本RoBERTa的引入。虽然最初的研究集中在英语上，但单语言模型在多语言模型方面在预训练工作量、整体资源效率或下游任务性能上可能具有优势。尽管基于提示的LLM越来越流行，但计算效率更高的类BERT模型仍然高度相关。在这项工作中，我们提出了第一个德语单语言RoBERTa模型GottBERT，该模型仅在OSCAR数据集的德语部分上进行预训练。此外，我们研究了过滤OSCAR语料库的影响。GottBERT使用fairseq和标准超参数进行预训练。我们在两个命名实体识别（NER）任务（Conll 2003和GermEval 2014）和三个文本分类任务（GermEval 2018细粒度和粗粒度，以及10kGNAD）上，与现有的德语BERT模型和两个多语言模型进行了性能评估。性能使用$F_{1}$分数和准确率来衡量。GottBERT base和large模型表现出竞争性能，其中GottBERT在6个任务中的4个中领先于base模型。与我们的预期相反，所应用的过滤并未显著影响结果。为了支持德语NLP研究社区，我们将在MIT许可下发布GottBERT模型。

英文摘要

Pre-trained language models have significantly advanced natural language processing (NLP), especially with the introduction of BERT and its optimized version, RoBERTa. While initial research focused on English, single-language models can be advantageous compared to multilingual ones in terms of pre-training effort, overall resource efficiency or downstream task performance. Despite the growing popularity of prompt-based LLMs, more compute-efficient BERT-like models remain highly relevant. In this work, we present the first German single-language RoBERTa model, GottBERT, pre-trained exclusively on the German portion of the OSCAR dataset. Additionally, we investigated the impact of filtering the OSCAR corpus. GottBERT was pre-trained using fairseq and standard hyperparameters. We evaluated its performance on two Named Entity Recognition (NER) tasks (Conll 2003 and GermEval 2014) and three text classification tasks (GermEval 2018 fine and coarse, and 10kGNAD) against existing German BERT models and two multilingual models. Performance was measured using the $F_{1}$ score and accuracy. The GottBERT base and large models showed competitive performance, with GottBERT leading among the base models in 4 of 6 tasks. Contrary to our expectation, the applied filtering did not significantly affect the results. To support the German NLP research community, we are releasing the GottBERT models under the MIT license.

URL PDF HTML ☆

赞 0 踩 0

2506.05387 2026-06-02 cs.CL cs.AI 版本更新

Advancing Decoding Strategies: Enhancements in Locally Typical Sampling for LLMs

推进解码策略：局部典型采样在大型语言模型中的增强

Jaydip Sen, Saptarshi Sengupta, Subhasis Dasgupta

发表机构 * Praxis Business School（普拉克斯商业学校）； San Jose State University（旧金山州立大学）

AI总结提出自适应语义感知典型性采样（ASTS）改进局部典型采样算法，通过动态熵阈值、多目标评分和奖惩调整，在保持计算效率的同时提升文本生成的流畅性、多样性和连贯性。

Comments This is the accepted but pre-reviewed version of the chapter that has been accepted for publication in the Springer volume 'Decision-Making in Computational Intelligence-Based Systems,' edited by Witold Pedrycz, Gilberto Rivera, Rose Ma Rodriguez, and Salvador Ibarra Martinez. The chapter is 39 pages long, and it contains 2 figures and 6 tables. This is NOT the final camera-ready version

详情

DOI: 10.1007/978-3-032-13497-4_17
Journal ref: Recent Advances in Artificial Neural Networks. Intelligent Systems Reference Library, vol 283. Springer, Cham, 2026

AI中文摘要

本章探讨了大型语言模型（LLMs）解码策略的进展，重点关注增强局部典型采样（LTS）算法。传统的解码方法，如top-k和核采样，通常在文本生成中难以平衡流畅性、多样性和连贯性。为应对这些挑战，提出了自适应语义感知典型性采样（ASTS）作为LTS的改进版本，融合了动态熵阈值、多目标评分和奖惩调整。ASTS在保持计算效率的同时，确保上下文连贯且多样的文本生成。其性能在多个基准测试中进行了评估，包括故事生成和抽象摘要，使用了困惑度、MAUVE和多样性分数等指标。实验结果表明，ASTS通过减少重复、增强语义对齐和提高流畅性，优于现有的采样技术。

英文摘要

This chapter explores advancements in decoding strategies for large language models (LLMs), focusing on enhancing the Locally Typical Sampling (LTS) algorithm. Traditional decoding methods, such as top-k and nucleus sampling, often struggle to balance fluency, diversity, and coherence in text generation. To address these challenges, Adaptive Semantic-Aware Typicality Sampling (ASTS) is proposed as an improved version of LTS, incorporating dynamic entropy thresholding, multi-objective scoring, and reward-penalty adjustments. ASTS ensures contextually coherent and diverse text generation while maintaining computational efficiency. Its performance is evaluated across multiple benchmarks, including story generation and abstractive summarization, using metrics such as perplexity, MAUVE, and diversity scores. Experimental results demonstrate that ASTS outperforms existing sampling techniques by reducing repetition, enhancing semantic alignment, and improving fluency.

URL PDF HTML ☆

赞 0 踩 0

2504.04718 2026-06-02 cs.CL cs.AI 版本更新

T1: Tool-integrated Verification for Test-time Compute Scaling in Small Language Models

T1：小语言模型测试时计算扩展的工具集成验证

Minki Kang, Jongwon Jeong, Jaewoong Cho

发表机构 * KAIST（韩国科学技术院）； University of Wisconsin-Madison（威斯康星大学麦迪逊分校）； KRAFTON

AI总结针对小语言模型在测试时扩展中验证能力不足的问题，提出T1框架，通过外部工具过滤候选输出后由小语言模型进行最终验证，显著提升验证准确率和测试时扩展性能。

Comments ICLR 2026

详情

AI中文摘要

近期研究表明，测试时计算扩展能有效提升小语言模型（sLMs）的性能。然而，先前研究主要利用额外的大模型作为验证器进行测试时计算扩展，而sLMs自身的验证能力尚未被充分探索。本文研究sLMs在测试时扩展中能否可靠地验证输出候选。我们发现，即使从大验证器进行知识蒸馏，sLMs在需要记忆的任务（如数值计算和事实核查）上仍表现不佳。为解决这一局限，我们提出工具集成验证（T1），这是一个两阶段框架：首先用外部工具过滤候选，然后使用sLM进行最终验证，将记忆密集型步骤卸载到代码解释器等工具上。在T1框架内，我们证明卸载到外部工具可减轻sLMs的记忆负担，并提升测试时扩展性能。在MATH基准上的实验表明，采用T1的Llama-3.2 1B模型在测试时扩展下性能优于规模更大的Llama-3.1 8B模型。此外，T1提高了过程奖励模型（PRMs）和评论家模型的验证准确率。我们的发现凸显了工具集成在显著提升sLMs验证能力方面的潜力。

英文摘要

Recent studies have demonstrated that test-time compute scaling effectively improves the performance of small language models (sLMs). However, prior research has mainly examined test-time compute scaling with an additional larger model as a verifier, leaving verification by sLMs underexplored. In this work, we investigate whether sLMs can reliably verify the output candidates under test-time scaling. We find that even with knowledge distillation from larger verifiers, sLMs struggle with verification tasks requiring memorization, such as numerical calculations and fact-checking. To address this limitation, we propose Tool-integrated verification (T1), a two-stage framework that first filters candidates with external tools and then uses an sLM for final verification, offloading memorization-heavy steps to tools such as a code interpreter. Within T1, we prove that offloading to external tools reduces the memorization burden on sLMs and improves test-time scaling performance. Experiments on the MATH benchmark demonstrate that, with T1, a Llama-3.2 1B model under test-time scaling outperforms the significantly larger Llama-3.1 8B model. Moreover, T1 improves the verification accuracy of both process reward models (PRMs) and critic models. Our findings highlight the potential of tool integration to substantially improve the verification abilities of sLMs.

URL PDF HTML ☆

赞 0 踩 0

2503.05500 2026-06-02 cs.CL cs.AI 版本更新

EuroBERT: Scaling Multilingual Encoders for European Languages

EuroBERT：面向欧洲语言的多语言编码器扩展

Nicolas Boizard, Hippolyte Gisserot-Boukhlef, Duarte M. Alves, André Martins, Ayoub Hammal, Caio Corro, Céline Hudelot, Emmanuel Malherbe, Etienne Malaboeuf, Fanny Jourdan, Gabriel Hautreux, João Alves, Kevin El Haddad, Manuel Faysse, Maxime Peyrard, Nuno M. Guerreiro, Patrick Fernandes, Ricardo Rei, Pierre Colombo

发表机构 * Artefact Research Center（Artfact研究中心）； CNRS（法国国家科学研究中心）； ISIA Lab（ISIA实验室）； Carnegie Mellon University（卡内基梅隆大学）

AI总结本文提出EuroBERT系列多语言编码器，通过整合生成式模型的最新进展，在检索、回归和分类等任务上超越现有模型，并原生支持长达8192个token的序列。

Comments 28 pages, 8 figures, 13 tables

详情

AI中文摘要

用于检索、回归和分类的通用多语言向量表示传统上来自双向编码器模型。尽管应用广泛，但编码器最近被生成式仅解码器模型的进步所掩盖。然而，推动这一进展的许多创新并非解码器所独有。在本文中，我们通过这些进展的视角重新审视多语言编码器的发展，并介绍EuroBERT，一个覆盖欧洲及全球广泛使用语言的多语言编码器家族。我们的模型在包括多语言能力、数学和编码在内的多种任务上优于现有替代方案，并原生支持长达8192个token的序列。我们还研究了EuroBERT背后的设计决策，提供了关于数据集组成和训练流程的见解。我们公开发布EuroBERT模型，包括中间训练检查点以及我们的训练框架。

英文摘要

General-purpose multilingual vector representations, used in retrieval, regression and classification, are traditionally obtained from bidirectional encoder models. Despite their wide applicability, encoders have been recently overshadowed by advances in generative decoder-only models. However, many innovations driving this progress are not inherently tied to decoders. In this paper, we revisit the development of multilingual encoders through the lens of these advances, and introduce EuroBERT, a family of multilingual encoders covering European and widely spoken global languages. Our models outperform existing alternatives across a diverse range of tasks, spanning multilingual capabilities, mathematics, and coding, and natively supporting sequences of up to 8,192 tokens. We also examine the design decisions behind EuroBERT, offering insights into our dataset composition and training pipeline. We publicly release the EuroBERT models, including intermediate training checkpoints, together with our training framework.

URL PDF HTML ☆

赞 0 踩 0

2501.04424 2026-06-02 cs.AI cs.CL 版本更新

NSA: Neuro-symbolic ARC Challenge

NSA: 神经符号 ARC 挑战

Paweł Batorski, Jannik Brinkmann, Paul Swoboda

发表机构 * Heinrich Heine Universität Düsseldorf（杜伊斯堡-艾森大学）； University of Mannheim（曼海姆大学）

AI总结提出一种结合 transformer 提案生成与领域特定语言组合搜索的神经符号方法，在 ARC 评估集上超越现有最优方法 27%。

详情

DOI: 10.14428/esann/2026.ES2026-119
Journal ref: ESANN 2026 proceedings, European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning, pp. 99-104, 2026

AI中文摘要

抽象与推理语料库 (ARC) 评估了机器学习模型和组合搜索方法都难以处理的通用推理能力。我们提出了一种神经符号方法，该方法结合了用于提案生成的 transformer 和使用领域特定语言的组合搜索。Transformer 通过提出有希望的搜索方向来缩小搜索空间，从而使组合搜索能够在短时间内找到实际解决方案。我们使用合成生成的数据预训练 transformer。在测试时，我们生成额外的任务特定训练任务并微调我们的模型。我们的结果在 ARC 评估集上比现有最优方法高出 27%，并且在 ARC 训练集上表现良好。我们在 https://github.com/Batorskq/NSA 公开了我们的代码和数据集。

英文摘要

The Abstraction and Reasoning Corpus (ARC) evaluates general reasoning capabilities that are difficult for both machine learning models and combinatorial search methods. We propose a neuro-symbolic approach that combines a transformer for proposal generation with combinatorial search using a domain-specific language. The transformer narrows the search space by proposing promising search directions, which allows the combinatorial search to find the actual solution in short time. We pre-train the trainsformer with synthetically generated data. During test-time we generate additional task-specific training tasks and fine-tune our model. Our results surpass comparable state of the art on the ARC evaluation set by 27% and compare favourably on the ARC train set. We make our code and dataset publicly available at https://github.com/Batorskq/NSA.

URL PDF HTML ☆

赞 0 踩 0

2410.12325 2026-06-02 cs.CL 版本更新

$M^3$ Scaling Law: Optimizing Multi-Epoch, Multi-Lingual, and Multi-Stage Training for Low-Resource Language Models

$M^3$ 缩放定律：优化低资源语言模型的多周期、多语言和多阶段训练

Kosuke Akimoto, Taiki Miyagawa, Masafumi Oyamada

发表机构 * NEC Corporation（日本电报电话株式会社）

AI总结本文提出 $M^3$ 缩放定律，统一预测模型规模、目标语料周期数、平均目标语言比例和最终阶段目标语言比例对低资源语言模型预训练损失的影响，并推导出最优训练策略的实用指南。

Comments 35 pages, 14 figures, 17 tables

详情

AI中文摘要

在本文中，我们研究了低资源语言环境下预训练大型语言模型（LLMs）的一个基本设计问题。现有工作采用多周期、多语言和多阶段训练来有效利用有限的目标语言语料库，但没有先前的缩放定律可以在相同的计算预算 $C$ 和目标语言语料库大小 $D_T$ 下比较这些方法，导致最优训练设置不明确。为填补这一空白，我们提出了 $M^3$ 缩放定律，这是一个统一的预测模型，参数化为模型规模、目标语料库周期数 $k$、平均目标语言比例 $r$ 和最终阶段目标语言比例 $r_f$，将单语言单阶段、多语言单阶段和多语言多阶段方案置于单一的目标语言损失曲面上。在三个语言对中，它比现有缩放定律更准确地外推到未见过的超参数区域。使用 $M^3$ 作为替代目标，我们为低资源 LLM 预训练推导出两个实用指南：(i) 随着 $D_T$ 减小，最优方案在计算预算相关的阈值处直接从单语言单阶段转变为多语言两阶段训练，而在我们的实验网格中多语言单阶段从未最优；(ii) 最优周期数在稀缺变量 $D_T/D^*(C)$ 上坍缩为一条曲线，其中 $D^*(C) \propto C^{α/(α+β)}$ 是单语言计算最优语料库大小。

英文摘要

In this paper, we study a fundamental design problem in pretraining Large Language Models (LLMs) for low-resource language regimes. Existing works adopt multi-epoch, multi-lingual, and multi-stage training to utilize the limited target-language corpus efficiently, but no prior scaling law can compare recipes spanning these approaches under the same compute budget $C$ and target-language corpus size $D_T$, leaving the optimal training setup unclear. To address this gap, we propose the $M^3$ Scaling Law, a unified predictive model parameterized by the model scale, the number of target-corpus epochs $k$, the average target-language ratio $r$, and the final-stage target-language ratio $r_f$, which places monolingual single-stage, multi-lingual single-stage, and multi-lingual multi-stage recipes on a single target-language loss surface. Across three language pairs, it extrapolates to unseen hyperparameter regions more accurately than existing scaling laws. Using $M^3$ as a surrogate objective, we derive two practical guidelines for low-resource LLM pretraining: (i) as $D_T$ decreases, the optimal recipe shifts directly from monolingual single-stage to multi-lingual two-stage training at a compute-budget-dependent threshold, with multi-lingual single-stage never optimal in our experimental grid; and (ii) the optimal number of epochs collapses onto a single curve in the scarcity variable $D_T/D^*(C)$, where $D^*(C) \propto C^{α/(α+β)}$ is the monolingual compute-optimal corpus size.

URL PDF HTML ☆

赞 0 踩 0

2407.01374 2026-06-02 cs.CL 版本更新

Bridging the Gap: Transfer Learning from English PLMs to Malaysian English

弥合差距：从英语PLM到马来西亚英语的迁移学习

Mohan Raj Chanthran, Lay-Ki Soon, Huey Fang Ong, Bhawani Selvaretnam

发表机构 * School of Information Technology, Monash University Malaysia（墨尔本大学马来西亚分校信息科技学院）

AI总结针对马来西亚英语的低资源克里奥尔语特性，通过微调预训练语言模型MENmBERT和MENBERT，在命名实体识别和关系抽取任务上分别提升1.52%和26.27%的性能。

Comments Accepted in 9th Workshop on Representation Learning for NLP (Rep4NLP) at ACL 2024

详情

AI中文摘要

马来西亚英语是一种低资源克里奥尔语，除了标准英语外，还包含马来语、汉语和泰米尔语的元素。由于其独特的形态句法适应、语义特征和代码切换（混合英语和马来语），命名实体识别（NER）模型在从马来西亚英语文本中捕获实体时表现不佳。考虑到这些差距，我们引入了MENmBERT和MENBERT，这是一种具有上下文理解的预训练语言模型，专门针对马来西亚英语定制。我们使用来自马来西亚英语新闻文章（MEN）数据集的手动注释实体和关系对MENmBERT和MENBERT进行了微调。这一微调过程使PLM能够学习捕捉与NER和RE任务相关的马来西亚英语细微差别的表示。与bert-base-multilingual-cased模型相比，MENmBERT在NER和RE任务上分别提高了1.52%和26.27%。尽管NER的整体性能没有显著提升，但我们的进一步分析表明，在12个实体标签评估时，性能有显著提升。这些发现表明，在特定语言和地理区域的语料库上预训练语言模型是提高低资源环境下NER性能的一种有前景的方法。本文发布的数据集和代码为专注于马来西亚英语的NLP研究工作提供了宝贵的资源。

英文摘要

Malaysian English is a low resource creole language, where it carries the elements of Malay, Chinese, and Tamil languages, in addition to Standard English. Named Entity Recognition (NER) models underperform when capturing entities from Malaysian English text due to its distinctive morphosyntactic adaptations, semantic features and code-switching (mixing English and Malay). Considering these gaps, we introduce MENmBERT and MENBERT, a pre-trained language model with contextual understanding, specifically tailored for Malaysian English. We have fine-tuned MENmBERT and MENBERT using manually annotated entities and relations from the Malaysian English News Article (MEN) Dataset. This fine-tuning process allows the PLM to learn representations that capture the nuances of Malaysian English relevant for NER and RE tasks. MENmBERT achieved a 1.52\% and 26.27\% improvement on NER and RE tasks respectively compared to the bert-base-multilingual-cased model. Although the overall performance of NER does not have a significant improvement, our further analysis shows that there is a significant improvement when evaluated by the 12 entity labels. These findings suggest that pre-training language models on language-specific and geographically-focused corpora can be a promising approach for improving NER performance in low-resource settings. The dataset and code published in this paper provide valuable resources for NLP research work focusing on Malaysian English.

URL PDF HTML ☆

赞 0 踩 0

2405.14782 2026-06-02 cs.CL 版本更新

Lessons from the Trenches on Reproducible Evaluation of Language Models

关于语言模型可重复评估的前线经验教训

Stella Biderman, Hailey Schoelkopf, Lintang Sutawika, Leo Gao, Jonathan Tow, Baber Abbasi, Alham Fikri Aji, Pawan Sasanka Ammanamanchi, Sidney Black, Jordan Clive, Anthony DiPofi, Julen Etxaniz, Benjamin Fattori, Jessica Zosa Forde, Charles Foster, Jeffrey Hsu, Mimansa Jaiswal, Wilson Y. Lee, Haonan Li, Charles Lovering, Niklas Muennighoff, Ellie Pavlick, Jason Phang, Aviya Skowron, Samson Tan, Xiangru Tang, Kevin A. Wang, Genta Indra Winata, François Yvon, Andy Zou

发表机构 * MBZUAI ； IIIT Hyderabad ； EleutherAI ； HiTZ Center - Ixa, UPV/EHU ； Ivy Natal ； University of Michigan ； HubSpot ； LibrAI ； Kensho ； Contextual AI ； Brown University ； New York University ； Amazon ； Yale University ； HKUST ； Sorbonne University ； CMU

AI总结本文基于开发Language Model Evaluation Harness框架的三年经验，总结了语言模型评估中面临的方法论挑战、缺乏可重复性和透明度等问题，并提供了改进评估严谨性和信心的建议。

详情

AI中文摘要

语言模型（LMs）的可靠评估仍然是一个未解决的挑战。研究人员和工程师面临方法学问题，例如模型对评估设置的敏感性、不同方法之间难以进行适当比较，以及缺乏可重复性和透明度。关于惯例和常见实践的信息碎片化和隔离加剧了评估困难。在本文中，我们借鉴了作为流行的Language Model Evaluation Harness (lm-eval) (Gao et al., 2023) 框架开发者三年评估大型语言模型（LMs）的经验，为该领域未来的发展提供指导和教训。我们记录了实践者面临的各种挑战，并提供了这些挑战或缺乏最佳实践已产生影响的实例。我们向该领域提出建议，以提高评估的严谨性和信心，并尝试将围绕LM评估的许多隐性或民间知识编纂成文，为未来的发展奠定坚实基础。

英文摘要

Reliable evaluation of language models (LMs) remains an open challenge. Re- searchers and engineers face methodological issues such as the sensitivity of models to evaluation setup, difficulty of proper comparisons across methods, and the lack of reproducibility and transparency. Evaluation difficulties are exacer- bated by the fracturing and siloing of information about conventions and common practices. In this paper we draw on three years of experience in evaluating large lan- guage models (LMs) as developers of the popular Language Model Evaluation Harness (lm-eval) (Gao et al., 2023) framework to provide guidance and lessons for the field moving forward. We document a variety of challenges faced by prac- titioners and provide concrete instances where these challenges or the absence of best practices have come into effect. We make recommendations to the field for improving evaluation rigor and confidence, and attempt to codify much of the tacit or folk knowledge surrounding LM evaluation, for a solid ground to move forward.

URL PDF HTML ☆

赞 0 踩 0

2403.07008 2026-06-02 cs.LG cs.AI cs.CL stat.ME 版本更新

AutoEval Done Right: Using Synthetic Data for Model Evaluation

AutoEval 的正确做法：使用合成数据进行模型评估

Pierre Boyeau, Anastasios N. Angelopoulos, Nir Yosef, Jitendra Malik, Michael I. Jordan

发表机构 * Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, USA（电子工程与计算机科学系，加州大学伯克利分校）； Department of Systems Immunology, Weizmann Institute of Science, Rehovot, Israel（系统免疫学系，魏茨曼科学研究所）； Inria, Ecole Normale Supérieure, Paris, France（法国国家信息与自动化技术研究所，巴黎高等师范学院）

AI总结本文提出高效且统计上无偏的算法，利用AI标记的合成数据减少模型评估所需的人工标注量，在GPT-4实验中有效样本量提升高达50%。

Comments camera-ready paper version

2405.01930 2026-06-02 cs.CL 版本更新

OARelatedWork: A Large-Scale Dataset of Related Work Sections with Full-texts from Open Access Sources

OARelatedWork：来自开放获取源的大规模相关工作章节数据集及其全文

Martin Docekal, Martin Fajcik, Pavel Smrz

发表机构 * Brno University of Technology（布拉格技术大学）

AI总结本文提出OARelatedWork数据集，包含全文和完整相关工作章节，用于从开放获取源生成相关工作，并评估了多种模型，发现大型语言模型在处理全文时事实准确性下降，而人类作者常引入无根据的抽象声明。

详情

AI中文摘要

本文介绍了OARelatedWork：一个来自开放获取源的相关工作生成数据集。它是首个用于相关工作生成的大规模多文档摘要数据集，包含完整的工作相关章节和引用论文的全文。其验证集和测试集的构建使得每篇被引论文的全文都可用，从而能够对全文相关工作生成进行受控评估。该数据集包含来自多个领域的94,450篇论文和5,824,689篇唯一引用论文。借助OARelatedWork，我们旨在将该领域从仅基于摘要生成相关工作章节的部分内容，转变为基于所有可用内容生成完整相关工作章节。我们（i）对一系列模型进行了基准测试，强调合成大规模全文上下文即使对现代大型语言模型（LLM）来说仍然是一个挑战：在我们的语句级评判下，GPT-4o-mini的基于证据的真实率从使用摘要时的92.9%下降到使用全文时的83.8%。我们（ii）通过对40篇论文和408个事实陈述进行人工评估，实证分析了人类写作行为，揭示作者经常引入未在局部源文本中扎根的抽象声明；因此，先进的LLM在严格的、基于证据的事实性方面实际上超越了人类基线。最后，我们（iii）进行了细粒度的元评估，揭示标准的基于参考的指标不足以评估这种长格式的结构化输出，并引入了一个稳健的语句级评估框架来弥补这一差距。

英文摘要

This paper introduces OARelatedWork: a dataset for related work generation from open-access sources. It is the first large-scale multi-document summarization dataset for related work generation, containing whole related work sections and full texts of cited papers. Its validation and test splits are constructed so that every cited paper is available in full text, enabling controlled evaluation of full-text related work generation. The dataset includes 94 450 papers and 5 824 689 unique referenced papers from multiple domains. With OARelatedWork, we aim to shift the field from generating parts of related work sections from abstracts only to generating entire related work sections from all available content. We (i) benchmark a wide spectrum of models, highlighting that synthesizing massive full-text contexts remains challenge even for modern Large Language Models (LLMs): under our statement-level judge, GPT-4o-mini's evidence-grounded True rate drops from 92.9% with abstracts to 83.8% with full texts. We (ii) empirically analyze human writing behavior through a human evaluation over 40 papers and 408 factual statements, revealing that authors frequently introduce abstractive claims ungrounded in localized source texts; consequently, advanced LLMs actually surpass human baselines in strict, evidence-grounded factuality. Finally, we (iii) conduct a fine-grained meta-evaluation, revealing that standard reference-based metrics are inadequate for evaluating such long-form structured outputs, and introduce a robust statement-level evaluation framework to address this gap.

URL PDF HTML ☆

赞 0 踩 0

2402.14521 2026-06-02 cs.CL 版本更新

Malaysian English News Decoded: A Linguistic Resource for Named Entity and Relation Extraction

马来西亚英语新闻解码：用于命名实体和关系抽取的语言资源

Mohan Raj Chanthran, Lay-Ki Soon, Huey Fang Ong, Bhawani Selvaretnam

AI总结针对马来西亚英语与标准英语的差异导致NLP任务困难的问题，构建了包含200篇新闻的MEN数据集，手动标注实体和关系，并通过微调spaCy NER工具验证了定制数据集能显著提升NER性能。

Comments Accepted at LREC-COLING 2024

详情

AI中文摘要

标准英语和马来西亚英语存在显著差异，这给马来西亚英语的自然语言处理（NLP）任务带来了挑战。不幸的是，现有数据集主要基于标准英语，因此不足以改进马来西亚英语的NLP任务。使用最先进的命名实体识别（NER）解决方案对马来西亚英语新闻文章进行的实验表明，它们无法处理马来西亚英语的形态句法变异。据我们所知，目前没有可用于改进模型的标注数据集。为了解决这些问题，我们构建了马来西亚英语新闻（MEN）数据集，包含200篇新闻文章，并手动标注了实体和关系。然后，我们微调了spaCy NER工具，并验证了拥有为马来西亚英语量身定制的数据集可以显著提高NER在马来西亚英语上的性能。本文介绍了我们在数据获取、标注方法以及标注数据集深入分析方面的工作。为了验证标注质量，我们使用了标注者间一致性检验，随后由领域专家对分歧进行裁决。完成这些任务后，我们成功开发了一个包含6,061个实体和3,268个关系实例的数据集。最后，我们讨论了spaCy微调设置及NER性能分析。这一独特的数据集将极大地推动马来西亚英语NLP研究的发展，使研究人员能够加速进展，特别是在NER和关系抽取方面。该数据集和标注指南已在Github上发布。

英文摘要

Standard English and Malaysian English exhibit notable differences, posing challenges for natural language processing (NLP) tasks on Malaysian English. Unfortunately, most of the existing datasets are mainly based on standard English and therefore inadequate for improving NLP tasks in Malaysian English. An experiment using state-of-the-art Named Entity Recognition (NER) solutions on Malaysian English news articles highlights that they cannot handle morphosyntactic variations in Malaysian English. To the best of our knowledge, there is no annotated dataset available to improvise the model. To address these issues, we constructed a Malaysian English News (MEN) dataset, which contains 200 news articles that are manually annotated with entities and relations. We then fine-tuned the spaCy NER tool and validated that having a dataset tailor-made for Malaysian English could improve the performance of NER in Malaysian English significantly. This paper presents our effort in the data acquisition, annotation methodology, and thorough analysis of the annotated dataset. To validate the quality of the annotation, inter-annotator agreement was used, followed by adjudication of disagreements by a subject matter expert. Upon completion of these tasks, we managed to develop a dataset with 6,061 entities and 3,268 relation instances. Finally, we discuss on spaCy fine-tuning setup and analysis on the NER performance. This unique dataset will contribute significantly to the advancement of NLP research in Malaysian English, allowing researchers to accelerate their progress, particularly in NER and relation extraction. The dataset and annotation guideline has been published on Github.

URL PDF HTML ☆

赞 0 踩 0