arXivDaily arXiv每日学术速递 周一至周五更新
重置
2605.26110 2026-05-26 cs.LG cs.CL cs.CV 版本更新

Prism: A Plug-in Reproducible Infrastructure for Scalable Multimodal Continual Instruction Tuning

Prism:面向可扩展多模态持续指令微调的插件式可复现基础设施

Jun-Tao Tang, Yu-Cheng Shi, Zhen-Hao Xie, Da-Wei Zhou

发表机构 * School of Artificial Intelligence, Nanjing University, China(南京大学人工智能学院) National Key Laboratory for Novel Software Technology, Nanjing University, China(南京大学新型软件技术国家重点实验室)

AI总结 针对多模态持续指令微调中工程瓶颈问题,提出Prism插件式代码库,通过轻量级插件注册机制分离算法开发与骨干实现,支持大规模训练流水线,实现可复现、可扩展的实验。

Comments Code is available at https://github.com/LAMDA-CL/Prism

详情
AI中文摘要

多模态大语言模型(MLLMs)通过指令微调将多样任务重构为统一的指令遵循框架,从而实现多功能性。然而,实际部署需要持续适应新兴任务,这推动了多模态持续指令微调(MCIT)的发展。尽管其重要性日益增长,当前的MCIT研究受到严重的工程瓶颈阻碍。现有方法通常通过直接修改基础MLLM代码库来实现,这带来了大量的实现开销,并产生了方法特定的架构,严重限制了代码复用和公平比较。为了解决这一问题,我们引入了Prism,一个专门为可扩展MCIT研究设计的插件式可复现代码库。它通过轻量级插件注册机制将算法开发与骨干实现分离,使得新策略可以作为独立插件集成,而无需修改底层MLLM代码库,从而消除结构碎片化并加速方法开发。Prism原生支持广泛使用的大规模训练流水线,从而实现可复现和可扩展的MCIT实验。代码可在https://github.com/LAMDA-CL/Prism获取。

英文摘要

Multimodal Large Language Models (MLLMs) achieve versatility by reformulating diverse tasks into a unified instruction-following framework via instruction tuning. However, real-world deployment requires continuous adaptation to emerging tasks, motivating Multimodal Continual Instruction Tuning (MCIT). Despite its growing importance, current MCIT research is hindered by severe engineering bottlenecks. Existing methods are typically implemented by directly modifying the base MLLM codebase, which imposes substantial implementation overhead and yields method-specific architectures that severely limit code reuse and fair comparison. To address this, we introduce Prism, a plug-in reproducible codebase specifically designed for scalable MCIT research. It separates algorithmic development from the backbone implementation via a lightweight plugin registration mechanism, enabling new strategies to be integrated as independent plugins without modifying the underlying MLLM codebase, thereby eliminating structural fragmentation and accelerating method development. Prism natively supports widely used large-scale training pipeline, thereby enabling reproducible and scalable MCIT experimentation. Code is available at https://github.com/LAMDA-CL/Prism.

2605.26074 2026-05-26 cs.CL cs.AI q-fin.GN 版本更新

StakeBench: Evaluating Language Understanding Grounded in Market Commitment

StakeBench: 评估基于市场承诺的语言理解

Yunhua Pei, Jingyu Hu, Yiwei Shi, Hongnan Ma, Weiru Liu, John Cartlidge

发表机构 * University of Bristol(布里斯托大学)

AI总结 提出StakeBench框架,通过将市场评论与可验证的交易记录关联,从市场行为中自动生成监督信号,评估语言模型对市场承诺的理解能力。

Comments 21 pages, 2 figures, 20 tables. Preprint. Dataset and evaluation code included

详情
AI中文摘要

现有的金融自然语言处理基准通常依赖外部观察者提供的标签,衡量语言如何被感知而非说话者在市场中承诺了什么。我们引入StakeBench,一个基于市场承诺的语言理解评估框架。StakeBench将来自2261个已结算市场的560,876条评论与Polymarket和Manifold上可验证的头寸、行动和市场赔率记录相关联。监督信号来自可观察的市场行为。头寸方向、评论后交易行动和市场赔率轨迹取代了人工标注。四个诊断任务测试模型是否检测到市场承诺、识别揭示的方向、预测未来行动以及执行集体赔率预测。三个承诺感知指标衡量与揭示偏好而非感知情绪的一致性。有效性审计和明确的解释边界有助于区分可观察的承诺信号与潜在信念和因果市场赔率影响。在15个LLM、18个主题和平台设置中,模型部分恢复了头寸方向信号,定向准确率从0.506到0.599,但在后续任务中出现结构性失败。15个模型中有10个在未来行动预测中崩溃为一到两个行动标签,且没有模型在集体赔率预测中持续优于朴素赔率方向基线。模型规模与性能不相关,金融领域微调不改善揭示方向识别,平台激励强烈影响高阶结果。StakeBench在CC-BY 4.0许可下附带评估代码和数据集。

英文摘要

Existing financial NLP benchmarks often rely on labels supplied by outside observers, measuring how language is perceived rather than what speakers have committed to in the market. We introduce StakeBench, an evaluation framework for language understanding grounded in market commitment. StakeBench links 560,876 comments from 2,261 resolved markets to verified position, action, and market-odds records across Polymarket and Manifold. Supervision is derived from observable market behavior. Position sides, post-comment trading actions, and market-odds trajectories replace human annotation. Four diagnostic tasks test whether models detect market commitment, identify the revealed side, anticipate future action, and perform collective odds projection. Three commitment-aware metrics measure alignment with revealed preferences rather than perceived sentiment. Validity audits and explicit interpretation boundaries help distinguish observable commitment signals from latent belief and causal market-odds impact. Across 15 LLMs and 18 topics and platform settings, models partially recover position-side signals, with Directed Accuracy from 0.506 to 0.599, but show structural failures on later tasks. Ten of the fifteen models collapse to one or two action labels in future action anticipation, and no model consistently improves on the naive odds-direction baseline in collective odds projection. Model scale is not correlated with performance, finance-domain tuning does not improve revealed-side identification, and platform incentives strongly shape higher-order results. StakeBench is packaged with evaluation code and dataset under CC-BY 4.0.

2605.26070 2026-05-26 cs.CL 版本更新

WhoSaidIt: Human-LLM Collaborative Annotation for Text-Based Multilingual Speaker-Attribute Classification

WhoSaidIt:面向文本的多语言说话人属性分类的人机协作标注

Lingyu Gao, Will Monroe, David Smith, Meghan Jemison, Jackie Lee

发表机构 * Duolingo

AI总结 提出一种人机协作重标注框架,通过迭代交互和分歧采样稳定多语言说话人属性标签,构建WhoSaidIt数据集并评估LLM性能。

Comments 16 pages in total

详情
AI中文摘要

从文本中标注说话人属性本质上是模糊的,尤其是在多语言环境中,人口统计和社会线索是隐含的且因文化而异。我们提出了一种人类-大语言模型(LLM)协作重标注框架,用于在实际资源限制下稳定多语言说话人属性标签。从嘈杂语料库开始,我们通过专家迭代交互利用LLM揭示重复出现的标注理由,并应用分歧聚焦采样进行针对性重标注。使用该框架,我们构建了WhoSaidIt,一个涵盖九个说话人属性标签的多语言数据集。我们量化了原始标注与修订标注之间的差异,对近期LLM进行了基准测试,并分析了显式理由对模型行为的影响。我们的结果揭示了标注决策中的显著跨语言差异,并展示了LLM在说话人属性分类中的优势与局限性。

英文摘要

Annotating speaker attributes from text is inherently ambiguous, particularly in multilingual settings where demographic and social cues are implicit and culturally variable. We propose a human-large language model (LLM) collaborative re-annotation framework for stabilizing multilingual speaker-attribute labels under practical resource constraints. Starting from a noisy corpus, we use LLMs to surface recurring annotation rationales through iterative interaction with experts, and apply disagreement-focused sampling for targeted re-annotation. Using this framework, we construct WhoSaidIt, a multilingual dataset covering nine speaker-attribute labels. We quantify divergence between original and revised annotations, benchmark recent LLMs, and analyze the effect of explicit rationales on model behavior. Our results reveal substantial cross-lingual differences in annotation decisions and demonstrate both the strengths and limitations of LLMs in speaker-attribute classification.

2605.26045 2026-05-26 cs.CL cs.AI 版本更新

Confidence and Calibration of Activation Oracles for Reliable Interpretation of Language Model Internals

激活预言机的置信度与校准:用于语言模型内部的可信解释

Federico Torrielli, Peter Schneider-Kamp, Lukas Galke Poech

发表机构 * University of Turin(都灵大学) University of Southern Denmark(南丹麦大学)

AI总结 本文研究了6种激活预言机置信度估计方法,发现bootstrap模式频率在校准上优于其他方法(ECE 5.7% vs 25.5%),而log-prob基线可作为快速分诊信号。

详情
AI中文摘要

激活预言机旨在使其他模型的激活对人类可读,并且与白盒可解释性技术相比取得了有希望的结果。然而,此类激活预言机自然语言输出的不确定性量化(UQ)迄今研究不足。本文研究了6种不同的激活预言机置信度估计方法,并评估其置信度分数的校准程度。我们在每个预言机6000个样本(变化动词和上下文提示)上的实验表明,bootstrap模式频率是测试中校准最好的方法(在Qwen3-8B上ECE 5.7% vs 答案词对数概率的25.5%;在Qwen3.6-27B上10.3% vs 13.1%),并且log-prob基线可以以极低的成本作为快速分诊信号。代码和修补后的训练器可在https://github.com/federicotorrielli/probabilistic_activation_oracles获取。

英文摘要

Activation oracles aim to make the activations of other models legible to humans and yield promising results compared to white-box interpretability techniques. However, uncertainty quantification (UQ) for the natural-language outputs of such activation oracles is so far understudied. Here, we investigate 6 different methods for estimating the confidence of activation oracles and evaluate how well-calibrated their confidence scores are. Our experiments on 6,000 samples per oracle (varying verbalizer and context prompts) reveal that bootstrap mode frequency is the best-calibrated method among those tested (ECE 5.7% vs. 25.5% for the answer-word log-probability on Qwen3-8B; 10.3% vs. 13.1% on Qwen3.6-27B), and that the log-prob baseline can serve as a fast triage signal at a fraction of the cost. Code and the patched trainer are available at https://github.com/federicotorrielli/probabilistic_activation_oracles.

2605.26037 2026-05-26 cs.CL 版本更新

Peak-Then-Collapse and the Four Interface Channels of Knowledge-Graph Tool Use

先升后降与知识图谱工具使用的四个接口通道

Tianda Sun, Dimitar Kazakov

发表机构 * University of York(约克大学)

AI总结 本文通过最小化知识图谱工具API实验,发现标准RLVR工具使用配方在自验证检索奖励下出现先升后降的崩溃模式,并识别出四种接口失败模式,提出接口反馈差异是核心原因,单次自蒸馏可缓解但受接口限制。

Comments 18 pages, 9 figures

详情
AI中文摘要

我们在一个刻意简化的知识图谱工具API上测试了标准的RLVR工具使用配方——在Qwen2.5-7B-Instruct上使用GRPO——该API包含四个Freebase导航动词,用于Complex WebQuestions。在自验证检索奖励下,策略的工具接地回答率在250步内从3.8%攀升至9.6%,然后在单个50步窗口内骤降至0%——这一先升后降模式在四个随机种子中重复出现。在七种奖励设计中,我们发现了四种反复出现的失败模式:添加更密集或更针对性的代理奖励只会改变失败模式,而非消除它。我们认为,与Python解释器、网络搜索和JSON API的一个关键区别在于接口反馈:它们的失败通常会泄露模型在预训练中见过的自然语言信号。Python回溯会指出失败的行;而空的Freebase结果`[]`则不会。剥离这种表面现象会暴露出一种退化机制,而同系列奖励的重新设计无法修复。直接的神谕消融实验排除了关系选择:在每次检索调用中注入黄金关系仅将精确匹配准确率提高了+0.20个百分点,且95.4%的检索依赖错误是检索组合失败而非答案提取失败。作为一种缓解措施,单次自蒸馏在7B模型上达到了40.0%的精确匹配率,且具有容量不变性:将容量翻倍至14B仅将精确匹配率提高了0.25个百分点,初始化几乎无关紧要——在测试的7B-14B范围内,上限似乎受限于接口。

英文摘要

We test the standard RLVR tool-use recipe -- GRPO on Qwen2.5-7B-Instruct -- on a deliberately minimal knowledge-graph tool API: four Freebase navigation verbs over Complex WebQuestions. Under a self-verifiable retrieval reward, the policy's tool-grounded answer rate climbs from $3.8\%$ to $9.6\%$ over 250 steps, then collapses to $0\%$ within a single 50-step window -- a \emph{peak-then-collapse} pattern replicated across four seeds. Across seven reward designs, we find four recurring failure modes: adding denser or more targeted proxy rewards shifts the failure mode rather than eliminating it. We argue that a key difference from Python interpreters, web search, and JSON APIs is interface feedback: their failures often leak natural-language signal the model saw in pretraining. A Python traceback names the failing line; an empty Freebase result \texttt{[]} does not. Stripping away that surface exposes a degradation regime that same-family reward redesigns do not fix. A direct oracle ablation rules out relation selection: injecting gold relations at every retrieval call lifts exact-match accuracy by only $+0.20$~pp, and $95.4\%$ of retrieval-dependent errors are retrieval-composition failures rather than answer-extraction failures. As a mitigation, one-iteration self-distillation reaches $40.0\%$ EM at 7B and is capacity-invariant: doubling capacity to 14B improves EM by only $0.25$~pp, and initialization barely matters -- the ceiling appears interface-bound within the 7B--14B range tested.

2605.26019 2026-05-26 cs.LG cs.AI cs.CL 版本更新

Retrieval-Augmented Detection of Potentially Abusive Clauses in Chilean Terms of Service

智利服务条款中潜在滥用条款的检索增强检测

Christoffer Loeffler, Tomás Rey Pizarro, Daniel Ignacio Miranda Vásquez, Andrea Martínez Freile

发表机构 * School of Computer Engineering, Pontificia Universidad Católica de Valparaíso(Pontificia Universidad Católica de Valparaíso计算机工程学院) Faculty of Law, Universidad Adolfo Ibáñez(Adolfo Ibáñez大学法学院)

AI总结 提出检索增强生成框架,结合混合稠密-稀疏检索与提示增强,用于自动检测和分类智利服务条款中的潜在滥用条款,并引入包含100份合同和10,029条标注条款的语料库,实验表明该方法显著提升性能,使本地模型接近云端系统。

Comments 42 pages, 6 figures, 9 tables

详情
AI中文摘要

在线服务条款通常作为附意合同运作,造成不对称性,可能使消费者面临潜在滥用条款。在智利,评估此类条款在法律上具有挑战性,因为某些条款明显违反强制性消费者法律,而其他条款则依赖于更广泛的标准,如诚信和合同失衡。我们提出一个检索增强生成框架,用于自动检测和分类智利服务条款中的潜在滥用条款。该框架设计为本地执行,结合了高效条款检测、混合稠密-稀疏检索、重排序和提示增强,以支持中等规模的开源语言模型。我们还引入了智利滥用服务条款扩展语料库,包含100份合同和10,029条标注条款,涵盖24个法律基础的类别,包括非法、黑暗和灰色条款。比较商业和开源语言模型、微调编码器以及传统基线的实验表明,检索增强提示显著提高了性能,并使本地模型能够以较低的计算和令牌成本接近更大的基于云的系统。该研究还贡献了一个精细的法律注释方案和一个用于AI辅助消费者合同审查的实用设计。

英文摘要

Online Terms of Service often function as contracts of adhesion, creating asymmetries that may expose consumers to potentially abusive clauses. In Chile, assessing such clauses is legally challenging because some provisions clearly violate mandatory consumer law, whereas others depend on broader standards such as good faith and contractual imbalance. We present a retrieval-augmented generation framework for the automated detection and classification of potentially abusive clauses in Chilean Terms of Service. Designed for local execution, it combines efficient clause detection, hybrid dense--sparse retrieval, reranking, and prompt augmentation to support medium-sized open-weight language models. We also introduce the Chilean Abusive Terms of Service Extended corpus, comprising 100 contracts and 10,029 annotated clauses in 24 legally grounded categories spanning illegal, dark, and gray clauses. Experiments comparing commercial and open-weight language models, fine-tuned encoders, and traditional baselines show that retrieval-augmented prompting substantially improves performance and enables local models to approach larger cloud-based systems at lower computational and token cost. The study also contributes a refined legal annotation scheme and a practical design for AI-assisted consumer contract review.

2605.26014 2026-05-26 cs.CV cs.CL 版本更新

STORM: Internalized Modeling for Spatial-Temporal Reasoning in Video-Language Models

STORM: 视频语言模型中时空推理的内化建模

Yiming Liang, Yixiao Chen, Yiyang Zhou, Yixuan Wang, Shoubin Yu, Andong Deng, Fuxiao Liu, Qin Zhang, Chen Chen, Mohit Bansal, Huaxiu Yao

发表机构 * Purdue(普渡大学) Harvard(哈佛大学) UNC(北卡罗来纳大学教堂山分校) UCF(佛罗里达大学) NVIDIA(英伟达) Physion Labs(Physion 实验室)

AI总结 提出STORM框架,通过有界连续潜在轨迹内化推理过程,无需显式文本思维链或外部工具,提升视频推理准确性并降低推理开销。

详情
AI中文摘要

许多视频推理任务需要跨帧跟踪运动、时间顺序和演化的视觉状态。基于大型视觉语言模型(LVLMs)的现有方法通常通过文本思维链(CoT)、关键帧选择、重复帧插入或外部工具使用来外化推理。虽然有效,但此类流水线增加了推理延迟和工程复杂性,并迫使时间-视觉证据被序列化为文本或从帧中重复重新编码。受视觉推理可以在语言化之前隐式发生的直觉启发,我们提出STORM(通过内化建模的时空推理),一个两阶段框架,教导LVLMs通过有界连续潜在轨迹进行推理,而不是显式文本CoT。在第一阶段,STORM将潜在令牌与从生成视频中衍生的思想-视频表示对齐,将潜在状态基于动态视觉证据。在第二阶段,模型进一步通过仅答案监督训练,鼓励推理过程内化而无需逐步注释。生成的思想视频仅在训练期间使用;在推理时,STORM执行有界潜在展开,无需重新生成视频、重新插入帧或调用外部视觉工具。在VideoMME、MVBench、TempCompass和MMVU上的实验表明,与基于工具或视频生成的推理流水线相比,STORM提高了视频推理准确性,同时显著降低了推理开销。

英文摘要

Many video reasoning tasks require tracking motion, temporal order, and evolving visual states across frames. Existing methods built on large vision-language models (LVLMs) often address this challenge by externalizing reasoning through textual chain-of-thought (CoT), keyframe selection, repeated frame reinsertion, or external tool use. While effective, such pipelines increase inference-time latency and engineering complexity, and they force temporal-visual evidence to be serialized into text or repeatedly re-encoded from frames. Inspired by the intuition that visual reasoning can occur implicitly before verbalization, we propose STORMS (Spatial-Temporal reasOning via inteRnalized Modeling), a two-stage framework that teaches LVLMs to reason through bounded continuous latent trajectories instead of explicit textual CoT. In Stage I, STORMS aligns latent tokens with thought-video representations derived from generated videos, grounding the latent states in dynamic visual evidence. In Stage II, the model is further trained with answer-only supervision, encouraging the reasoning process to be internalized without step-by-step annotations. Generated thought videos are used only during training; at inference, STORMS performs a bounded latent rollout without regenerating videos, reinserting frames, or invoking external visual tools. Experiments on VideoMME, MVBench, TempCompass, and MMVU show that STORMS improves video reasoning accuracy while substantially reducing inference overhead compared with tool or video-generation-based reasoning pipelines.

2605.26007 2026-05-26 cs.CL 版本更新

Forgotten Words: Benchmarking NeoBERT for Dementia Detection in Low-Resource Conversational Filipino and English Speech

遗忘的词语:在低资源菲律宾语和英语对话中用于痴呆检测的NeoBERT基准测试

Rez Samantha Z. Floresca, Edric Castel C. Hao, Hannah Grachiella Buñales, Chelsea Dominique E. Temprosa, Georgianna Z. Reyes, Kervin Gabriel L. Chua

发表机构 * Ateneo de Manila Senior High School(亚特兰大大学高级中学) Analog Devices, Inc.(安森美半导体公司)

AI总结 针对菲律宾语-英语代码混合的低资源场景,首次系统评估基于Transformer的痴呆检测模型,发现双语微调可消除跨语言性能下降,达到Macro-F1=0.969-0.973。

Comments Accepted to BioNLP Workshop @ ACL 2026

详情
AI中文摘要

从自发语音中检测痴呆症提供了一种可扩展的认知筛查方法,但NLP系统仍然以英语为中心。这一限制在菲律宾尤为严重,因为菲律宾语-英语代码混合普遍存在,且尚无先前工作涉及基于NLP的痴呆检测。我们首次对基于Transformer的菲律宾语音频痴呆检测进行了系统评估,并首次在临床NLP环境中评估了NeoBERT。为了将语言与领域效应分离,我们构建了一个包含4,000个DementiaBank衍生转录本的平行双语数据集,其中菲律宾语翻译由人工完成,以保留认知衰退的话语层面标记。我们在单语、零样本跨语言和双语微调设置下评估了五个模型家族:TF-IDF + LogReg、BERT、NeoBERT、XLM-R和RoBERTa-Tagalog。我们发现,领域内性能无法跨语言迁移,英语训练的BERT在菲律宾语上Macro-F1降至0.455,且仅靠架构现代化并不能提高鲁棒性。然而,双语微调消除了所有Transformer模型的跨语言性能下降,收敛到Macro-F1=0.969-0.973。这些结果表明,多语言临床NLP性能主要受训练期间的语言覆盖范围驱动,而非模型规模或架构。

英文摘要

Dementia detection from spontaneous speech offers a scalable approach to cognitive screening, yet NLP systems remain predominantly English-centric. This limitation is especially acute in the Philippines, where Filipino-English code-switching is pervasive and no prior work has addressed NLP-based dementia detection. We present the first systematic evaluation of transformer-based dementia detection in Filipino speech and the first assessment of NeoBERT in a clinical NLP setting. To separate language from domain effects, we construct a parallel bilingual dataset of 4,000 DementiaBank-derived transcripts, with Filipino translations produced manually to preserve discourse-level markers of cognitive decline. We evaluate five model families, TF-IDF + LogReg, BERT, NeoBERT, XLM-R, and RoBERTa-Tagalog, under monolingual, zero-shot cross-lingual, and bilingual fine-tuning settings. We find that in-domain performance does not transfer across languages, with English-trained BERT dropping to Macro-F1 = 0.455 on Filipino, and that architectural modernization alone does not improve robustness. Bilingual fine-tuning, however, eliminates cross-lingual degradation across all transformer models, converging to Macro-F1 = 0.969-0.973. These results suggest that multilingual clinical NLP performance is driven primarily by linguistic coverage during training rather than model scale or architecture.

2605.26004 2026-05-26 cs.CV cs.CL 版本更新

MAGIC: Multimodal Alignment & Grounding-aware Instruction Coreset for Vision-Language Models

MAGIC: 面向视觉语言模型的多模态对齐与接地感知指令核心集

Shristi Das Biswas, Kaushik Roy

发表机构 * Purdue University(普渡大学)

AI总结 提出MAGIC方法,利用预训练VLM中的多模态增益、桥接相关性和技能神经元签名三种内在信号,通过无训练、前向传播的核心集选择,构建紧凑且行为保真的子集用于多模态指令微调,在20%预算下达到甚至超越全微调性能。

详情
AI中文摘要

大型视觉语言模型(LVLMs)的指令微调越来越依赖于大规模多模态语料库,然而这些数据集包含大量冗余、低视觉依赖性以及多模态推理行为覆盖极不平衡的样本。因此,均匀子采样或基于分数的朴素选择往往产生次优的训练子集。我们提出MAGIC,一种无需训练、仅前向传播的核心集选择方法,旨在为多模态指令微调构建紧凑且行为保真的子集。MAGIC基于从预训练VLM中提取的三个内在信号:多模态增益,衡量从视觉输入获得的似然改进;桥接相关性,捕捉答案令牌在视觉令牌上的接地锐度;以及技能神经元签名,通过顶部激活的前馈神经元表征每个样本引发的功能计算。MAGIC通过三阶段流程组合这些信号:过滤低增益样本,通过归一化质量目标对候选样本排序,并在离散神经元签名上执行桶式预算分配以保留潜在的多模态技能覆盖。该公式避免了反向传播、辅助选择器训练以及连续激活空间中的昂贵聚类,同时保持高效且易于部署在现有VLM中。在LLaVA-665K和Vision-Flan数据集上,以及向大型目标模型LLaVA-1.5-7B和-13B的迁移设置中,MAGIC在匹配的20%预算下持续优于强基线:在LLaVA-665K上达到全微调相对性能的100.3%,在Vision-Flan-186K上达到101.6%,同时减少了73.7%的挂钟运行时间。

英文摘要

Instruction tuning of large vision-language models (LVLMs) increasingly depends on massive multimodal corpora, yet these datasets contain samples with substantial redundancy, low visual dependency, and highly imbalanced coverage of multimodal reasoning behaviors. As a result, uniform subsampling or naive score-based selection often yields suboptimal training subsets. We introduce MAGIC, a training-free, forward-only coreset selection method designed to construct compact yet behaviorally faithful subsets for multimodal instruction tuning. MAGIC is built on three intrinsic signals extracted from a pretrained VLM: Multimodal Gain, which measures the likelihood improvement obtained from visual input; Bridging Relevance, which captures the sharpness of answer-token grounding over visual tokens; and Skill-Neuron Signatures, which characterize the functional computation elicited by each sample via top-activated feed-forward neurons. MAGIC combines these signals in a three-stage pipeline: filtering low-gain examples, ranking candidates by a normalized quality objective, and performing bucket-wise budget allocation over discrete neuron signatures to preserve latent multimodal skill coverage. This formulation avoids backpropagation, auxiliary selector training, and expensive clustering in continuous activation spaces, while remaining efficient and easily deployable in existing VLMs. Across LLaVA-665K and Vision-Flan datasets, and transfer settings to large target models, LLaVA-1.5-7B and -13B, MAGIC consistently improves over strong baselines under matched 20% budgets: it achieves 100.3% relative performance to full finetuning on LLaVA-665K and 101.6% relative performance on Vision-Flan-186K, while yielding a 73.7% reduction in wall-clock run time.

2605.26001 2026-05-26 cs.CL cs.AI cs.CY 版本更新

AI-Assisted Systematization for Evaluating GenAI Systems

AI辅助的系统化方法用于评估生成式AI系统

Dhruv Agarwal, Emily Sheng, Chad Atalla, Jean Garcia-Gathright, Hussein Mozannar, Hannah Washington, Alexandra Chouldechova, Solon Barocas, Hanna Wallach

发表机构 * Cornell University(康奈尔大学) Microsoft Research(微软研究院)

AI总结 针对生成式AI评估中概念模糊的问题,提出AI辅助系统化方法,通过概念规范和验证工作表生成可衡量的概念规范,并评估其内容效度和信息可恢复性。

详情
AI中文摘要

评估生成式AI(GenAI)系统具有挑战性,因为许多评估目标都是宽泛且有争议的概念,例如“推理”、“公平性”或“创造力”。当这些概念未得到充分明确时,就不清楚应该测量什么或如何解释评估结果。这个问题反映了一个缺失的步骤:系统化,即从一个宽泛的背景概念转变为用可衡量术语对概念进行明确、结构化的描述。为了帮助解决系统化在认知上要求高且资源密集的问题,我们研究了AI辅助是否能够支持这一过程。为了实现AI辅助的系统化并评估其质量,我们引入了系统化概念的结构化表示——概念规范——以及一个验证工作表。然后,我们开发了两种AI辅助系统化工具:一种直接的零样本方法和一种多智能体方法,后者更贴近现有文献中手动系统化的方法。我们使用这些系统化工具为两个概念——仇恨言论和数字共情——生成概念规范,并评估所得概念规范的内容效度和信息可恢复性。

英文摘要

Evaluating generative AI (GenAI) systems is challenging because many targets of evaluation are broad, contested concepts, such as "reasoning," "fairness," or "creativity." When these concepts are left underspecified, it becomes unclear what should be measured or how evaluation results should be interpreted. This problem reflects a missing step: systematization, that is, moving from a broad background concept to an explicit, structured account of the concept in measurable terms. To help address the fact that systematization is cognitively demanding and resource-intensive, we investigate whether AI assistance can support this process. To enable AI-assisted systematization and assess its quality, we introduce a structured representation of a systematized concept, a concept spec, and a validation worksheet. We then develop two AI-assisted systematizers: a direct, zero-shot approach and a multi-agent approach that more closely mirrors manual systematization approaches from existing literature. We use these systematizers to produce concept specs for two concepts -- hate-based rhetoric and digital empathy -- and evaluate resulting concept specs on content validity and information recoverability.

2605.23904 2026-05-26 cs.AI cs.CL 版本更新

SkillOpt: Executive Strategy for Self-Evolving Agent Skills

SkillOpt: 自我进化智能体技能的执行策略

Yifan Yang, Ziyang Gong, Weiquan Huang, Qihao Yang, Ziwei Zhou, Zisu Huang, Yan Li, Xuemei Gao, Qi Dai, Bei Liu, Kai Qiu, Yuqing Yang, Dongdong Chen, Xue Yang, Chong Luo

发表机构 * Microsoft(微软公司) Shanghai Jiao Tong University(上海交通大学) Tongji University(同济大学) Fudan University(复旦大学)

AI总结 提出SkillOpt,一种系统性的可控文本空间优化器,通过分离的优化器模型对技能文档进行有界编辑,并仅在严格改善验证分数时接受编辑,从而稳定训练技能,在六个基准测试中全面优于现有方法。

Comments 27 pages, 4 figures, 6 tables

详情
AI中文摘要

当前的智能体技能要么是手工制作的,要么是一次性生成的,要么通过松散控制的自我修订来进化,这些方法都不像深度学习优化器那样作用于技能,并且都无法在反馈下可靠地改进其起点。我们认为,技能应该作为冻结智能体的外部状态进行训练,并遵循使权重空间优化可复现的相同原则。据我们所知,SkillOpt是第一个系统性的可控文本空间优化器,用于智能体技能:一个独立的优化器模型将带分数的轨迹转换为对单个技能文档的有界添加/删除/替换编辑,并且仅当编辑严格改善保留验证分数时才接受编辑。文本学习率预算、拒绝编辑缓冲区和逐轮慢/元更新使得技能训练稳定,同时在部署时无需增加推理时的模型调用。在六个基准测试、七个目标模型和三个执行框架(直接对话、Codex、Claude Code)中,SkillOpt在所有52个评估的(模型、基准、框架)单元上取得最佳或并列最佳,并击败了每个单元上的所有竞争者,包括人类、一次性LLM、Trace2Skill、TextGrad、GEPA和EvoSkill技能。在GPT-5.5上,它在直接对话中将平均无技能准确率提高了23.5个百分点,在Codex智能体循环中提高了24.8个百分点,在Claude Code中提高了19.1个百分点。迁移实验进一步表明,优化后的技能工件在跨模型规模、在Codex和Claude Code执行环境之间迁移以及迁移到邻近的数学基准测试时,无需进一步优化即可保留其价值。代码:https://aka.ms/skillopt

英文摘要

Agent skills today are hand-crafted, generated one-shot, or evolved through loosely controlled self-revision, none of which behaves like a deep-learning optimizer for the skill, and none of which reliably improves over its starting point under feedback. We argue the skill should instead be trained as the external state of a frozen agent, with the same discipline that makes weight-space optimization reproducible. SkillOpt is, to our knowledge, the first systematic controllable text-space optimizer for agent skills: a separate optimizer model turns scored rollouts into bounded add/delete/replace edits on a single skill document, and an edit is accepted only when it strictly improves a held-out validation score. A textual learning-rate budget, rejected-edit buffer, and epoch-wise slow/meta update make skill training stable while adding zero inference-time model calls at deployment. Across six benchmarks, seven target models, and three execution harnesses (direct chat, Codex, Claude Code), SkillOpt is best or tied on all 52 evaluated (model, benchmark, harness) cells and beats every per-cell competitor among human, one-shot LLM, Trace2Skill, TextGrad, GEPA, and EvoSkill skills. On GPT-5.5 it lifts the average no-skill accuracy by +23.5 points in direct chat, by +24.8 inside the Codex agentic loop, and by +19.1 inside Claude Code. Transfer experiments further show that optimized skill artifacts retain value when moved across model scales, between Codex and Claude Code execution environments, and to a nearby math benchmark without further optimization. Code: https://aka.ms/skillopt

2605.18646 2026-05-26 cs.CL 版本更新

Language-Switching Triggers Take a Latent Detour Through Language Models

语言切换触发器通过语言模型的潜在迂回路径

Francis Kulumba, Wissam Antoun, Théo Lasnier, Benoît Sagot, Djamé Seddah

发表机构 * Inria Paris(巴黎研究所) Sorbonne Université(索邦大学)

AI总结 本文通过电路分析揭示了一个8B参数自回归语言模型中语言切换后门攻击的内部机制,该攻击由三个拉丁词触发,将英语输出重定向为法语,并发现触发信号通过正交于语言身份方向的潜在空间传播,最终由最后一层MLP转换为法语logits。

Comments 15 pages, 16 figures. Under review

详情
AI中文摘要

语言模型的后门攻击日益成为安全关注点,但触发器序列劫持模型计算的内部机制仍不明确。我们识别了一个8B参数自回归语言模型中语言切换后门背后的电路,其中三个拉丁词触发器(九个token)将英语输出重定向为法语。我们将该电路分解为三个阶段:(1)早期层的分布式注意力头将触发token组合到最后一个序列位置;(2)产生的信号通过中间层在正交于模型自然语言身份方向的子空间中传播;(3)最后一层的MLP将此潜在信号转换为法语logits。整个电路流经单个位置的串行瓶颈:在任何层破坏该位置完全消除触发器,但也损害模型能力。正交潜在编码表明,在中间表示中搜索类似语言信号的防御将完全错过此触发器。

英文摘要

Backdoor attacks on language models pose a growing security concern, yet the internal mechanisms by which a trigger sequence hijacks model computations remain poorly understood. We identify a circuit underlying a language-switching backdoor in an 8B-parameter autoregressive language model, where a three-word Latin trigger (nine tokens) redirects English output to French. We decompose the circuit into three phases: (1) distributed attention heads at early layers compose the trigger tokens into the last sequence position; (2) the resulting signal propagates through mid-layers in a subspace orthogonal to the model's natural language-identity direction; (3) the MLP at the final layer converts this latent signal into French logits. The entire circuit flows through a serial bottleneck at a single position: corrupting that position at any layer entirely mitigates the trigger but also hinders the model's capabilities. The orthogonal latent encoding suggests that defenses that search for language-like signals in intermediate representations would miss this trigger entirely.

2605.25988 2026-05-26 cs.CL 版本更新

What Makes a Medical Checker Trainable? Diagnosing Signal Collapse and Reward Hacking in Checker-Guided RAG for Biomedical QA

什么使医学检查器可训练?诊断生物医学问答中检查器引导的RAG中的信号崩溃和奖励黑客

Yuelyu Ji, Min Gu Kwak, Hang Zhang, Xizhi Wu, Chenyu Li, Yanshan Wan

发表机构 * University of Pittsburgh(匹兹堡大学)

AI总结 本文通过比较四种NLI检查器作为GRPO训练的医学RAG代理的过程奖励,诊断了信号崩溃和奖励黑客现象,发现检查器的输出分布而非保留准确率决定了其是否提供可训练梯度,并提出了验证器作为奖励系统的边界条件。

详情
AI中文摘要

医学RAG需要基于证据的声明,因此将声明级别的NLI检查器插入检索增强的强化学习是直观的。 extbf{我们发现,检查器在训练期间的输出分布,而不是其保留准确率,决定了它是否提供可训练梯度。}我们比较了四种NLI检查器后端作为GRPO训练的医学RAG代理(Qwen2.5-7B,在Qwen3-4B和Llama-3.1-8B上复制)的过程奖励,跨越四个保留的医学QA基准。出现了三个诊断发现。 extbf{(i)} 信号崩溃是log概率特定的:LLM log概率评分将超过97%的声明标记为中性——将RL梯度降至零——而校准的MedNLI分类器对相同对进行非退化评分。 extbf{(ii)} 在答案质量上,中等信号优于强信号:一个强大的专有检查器触发三步奖励黑客级联——超短答案、搜索回避、语言崩溃——因此中等信号的本地分类器训练出更高质量的模型( extbf{+12% BERTScore相对于零样本,无GPT依赖})。 extbf{(iii)} 信号强度是策略依赖的:相同的检查器在一个策略上表现为中等,但在另一个策略上表现为强,而不触发级联终点。我们将这些视为验证器作为奖励系统的边界条件。

英文摘要

Medical RAG needs evidence-grounded claims, so plugging a claim-level NLI checker into retrieval-augmented RL is intuitive. \textbf{We find that the checker's \emph{output distribution} during training, not its held-out accuracy, decides whether it provides trainable gradient.} We compare four NLI checker back-ends as process rewards inside a GRPO-trained medical RAG agent (Qwen2.5-7B, replicated on Qwen3-4B and Llama-3.1-8B) across four held-out medical QA benchmarks. Three diagnostic findings emerge. \textbf{(i)} Signal collapse is log-prob-specific: LLM log-probability scoring labels over 97\% of claims neutral -- collapsing the RL gradient to zero -- while a calibrated MedNLI classifier scores the same pairs non-degenerately. \textbf{(ii)} Moderate signal beats strong signal on answer quality: a strong proprietary checker triggers a three-step reward-hacking cascade -- ultra-short answers, search avoidance, language collapse -- so a moderate-signal local classifier trains a higher-quality model (\textbf{+12\% BERTScore over zero-shot, no GPT dependency}). \textbf{(iii)} Signal strength is policy-dependent: the same checker registers as moderate on one policy but strong on another without triggering the cascade end-state. We frame these as boundary conditions for verifier-as-reward systems.

2605.25984 2026-05-26 cs.CL cs.AI 版本更新

SafeCtrl-RL: Inference-Time Adaptive Behaviour Control for LLM Dialogue via RL-Driven Prompt Optimisation

SafeCtrl-RL: 通过RL驱动的提示优化的LLM对话推理时自适应行为控制

Michael Orme, Yanchao Yu, Zhiyuan Tan

发表机构 * School of Computing, Engineering and Building Environment(计算、工程与建筑环境学院)

AI总结 提出SafeCtrl-RL框架,利用强化学习在推理时动态选择提示调整策略,无需重新训练即可抑制不安全行为,提升LLM对话的安全性和响应质量。

详情
AI中文摘要

确保大型语言模型(LLM)的安全和上下文适当行为仍然是实际部署的关键挑战。我们提出了 extbf{SafeCtrl-RL},一个推理时行为控制框架,无需模型重新训练或参数修改即可实现自适应安全调节。该方法将对话生成形式化为一个序列决策过程,其中强化学习代理根据上下文反馈动态选择提示调整策略。这使得不安全行为可以通过迭代细化被抑制,我们将其概念化为推理时行为遗忘。在多个LLM和不安全对话场景下的评估表明,SafeCtrl-RL一致地提高了安全性和响应质量,优于现有的基于提示的优化方法,并实现了良好的性能-效率权衡。**警告:本文可能包含有害语言的示例,建议读者谨慎阅读。**

英文摘要

Ensuring safe and contextually appropriate behaviour in Large Language Models (LLMs) remains a critical challenge for real-world deployment. We present \textbf{SafeCtrl-RL}, an inference-time behavioural control framework that enables adaptive safety regulation without model retraining or parameter modification. The method formulates dialogue generation as a sequential decision process, where a reinforcement learning agent dynamically selects prompt adjustment strategies based on contextual feedback. This allows unsafe behaviours to be suppressed through iterative refinement, which we conceptualise as inference-time behavioural unlearning. Evaluated across multiple LLMs and unsafe dialogue scenarios, SafeCtrl-RL consistently improves safety and response quality, outperforms existing prompt-based optimisation methods, and achieves favourable performance--efficiency trade-offs. **Warning: This paper may contain examples of harmful language, and reader discretion is recommended.

2605.25977 2026-05-26 cs.CL cs.AI cs.LG 版本更新

Creative Quality Alignment: Expert Tacit Knowledge Transfer via Chain-of-Thought Fine-Tuning

创意质量对齐:通过思维链微调实现专家隐性知识迁移

Bo Zou, Chao Xu

AI总结 本文通过低数据成本和小基模型的严格工程条件,实证验证了校准惊喜中的创意质量度量,并发现数据偏差,提出创意质量对齐方法及理论解释。

详情
AI中文摘要

本文对校准惊喜(Zou & Xu, 2026a)中提出的创意质量度量进行了实证实现。本文解决的问题是:这一数学主张在工程层面是否成立?为使答案尽可能通用,我们特意选择了最严格的工程条件:低数据成本和小基模型。训练数据来自BC协议(Zou & Xu, 2026b)产生的大约100个专家思维链(CoT)标注。我们还发现了一个数据偏差:大多数公开可用的对齐数据集偏向于工艺相关知识,而受众建模和现实逻辑覆盖系统性薄弱。我们使用术语“创意质量对齐”(CQA)来描述这类工程方法。我们还提供了一个支持性的理论观察:在具有单一条件分布架构的LLM中,通过架构对偶性,校准欣赏侧会自动迁移到生成侧。这是大约100个CoT示例就足够的结构性原因——而非像LIMA(Zhou et al., 2023)那样的纯粹经验观察。

英文摘要

This paper provides an empirical implementation of the creative quality metric proposed in Calibrated Surprise (Zou & Xu, 2026a). The question this paper addresses is: does this mathematical claim hold at the engineering level? To make the answer as general as possible, we deliberately choose the strictest engineering conditions: low data cost and a small base model. Training data comes from approximately 100 expert chain-of-thought (CoT) annotations produced by the BC Protocol (Zou & Xu, 2026b). We also identify a data bias: most publicly available alignment datasets are skewed toward craft-related knowledge, while audience modeling and reality-logic coverage are systematically weak. We use the term Creative Quality Alignment (CQA) to describe this class of engineering methods. We also offer a supporting theoretical observation: in an LLM with a single conditional distribution architecture, calibrating the appreciation side automatically transfers to the generation side via architectural duality. This is the structural reason why ~100 CoT examples are sufficient -- not a purely empirical observation like LIMA (Zhou et al., 2023).

2605.25969 2026-05-26 cs.CL 版本更新

Triplet-Block Diffusion RWKV

三元组块扩散RWKV

Ke Lin, Yiyang Luo, Zhaolong Su, Yunya Song, Anyi Rao

发表机构 * William & Mary(威廉玛丽学院) HKUST(香港科技大学) Cornell(康奈尔大学)

AI总结 提出B^3D-RWKV,通过三元组块布局方法将RWKV的线性推理效率与双向离散扩散结合,实现并行解码,在8任务套件上达到可比精度,解码吞吐量平均提升1.6倍。

详情
AI中文摘要

因果Transformer语言模型存在严格顺序解码和每步二次注意力成本的问题。虽然线性时间因果模型和离散扩散模型各自解决了这些弱点,但它们的整合本质上不一致:扩散需要双向注意力,而因果模型是单向的。为了统一这些架构,我们提出了$B^3D-RWKV$,一种扩散RWKV变体,通过\emph{三元组块布局}方法将模型的$O(L)$推理效率与并行、双向离散扩散相结合。$B^3D-RWKV-7.2B$在8任务套件上达到了与现有模型相当的准确率,同时在解码吞吐量上显著优于基线,平均加速$\mathbf{1.6 imes}$。

英文摘要

Causal Transformer language models suffer from strictly sequential decoding and a quadratic per-step attention cost. While linear-time causal models and discrete diffusion models each address these weaknesses, their integration remains inherently inconsistent: diffusion requires bidirectional attention, while causal models are unidirectional. To unify these architectures, we propose $B^3D-RWKV$, a diffusion RWKV variant that integrates the model's $O(L)$ inference efficiency with parallel, bidirectional discrete-diffusion through a \emph{triplet-block layout} method. $B^3D-RWKV-7.2B$ reaches comparable accuracy on an 8-task suite versus existing models while significantly outperforming baselines in decoding throughput with an average of $\mathbf{1.6\times}$ speedup.

2605.25966 2026-05-26 cs.LG cs.CL stat.ML 版本更新

Mapping the Schedule x Bit-Width Boundary in Sub-100M Quantisation-Aware Training

在小于100M参数量化感知训练中映射调度策略与位宽边界

Christian Brandt Thomassen

发表机构 * Dwarf A/S(Dwarf公司)

AI总结 通过大规模实验研究子100M参数解码器语言模型中,量化感知训练的最佳学习率调度是否依赖于位宽,发现INT6 QAT无需不同调度,INT4在50M以上需wd33调度,以下则噪声主导。

Comments 20 pages, 6 figures, 4 tables. 1345 training runs total (720 + 625). Submitted for review at TMLR

详情
AI中文摘要

我们测试了在子100M参数解码器语言模型中,从初始化开始的量化感知训练(QAT)的最佳学习率调度是否依赖于位宽。一项720次运行的因子网格实验(阶段2)覆盖了位宽×衰减分数×学习率大小×模型大小×随机种子(FP16/INT8/INT6,15M-100M,5个种子),发现在每个(位宽,大小)单元中,最佳衰减分数为33%。主要假设——INT6 QAT需要与高精度训练不同的调度——在FP16/INT8/INT6下被证伪。后续625次运行(阶段5)沿五个轴探测零假设:优化器(AdamW)、调度形状(余弦)、训练长度(最多9倍迭代次数)、扩展的大小扫描(5M-350M)以及从3M到100M的INT4扫描。零假设在所有三种设置变化下均稳健。INT6的惩罚遵循对数线性缩放定律,其在阶段2的拟合预测了五个保留的阶段5大小(5M、8M、175M、250M、350M),且均在95%预测区间内(5/5)。对于INT4,情况比高精度更清晰:在50M和100M时,wd33明确最优(配对z~12-15,10/10种子);低于50M时,在从3M到30M的六个测试大小中,没有单个大小显示出统计显著的调度偏好,且每个大小的平均惩罚在种子级噪声内振荡。因此,边界是从低于50M的噪声主导区域到50M及以上明确的wd33区域的过渡,而非清晰的wd10区域。权重到网格距离的探测证伪了FP16/INT8/INT6零假设的最简单机制(快速网格锁定):在衰减前,INT6-QAT权重与INT6网格的距离基本与FP16权重相同(比率~1.04)。实用建议:在子100M规模下,在FP16上调优一次学习率调度,并原封不动地应用于INT8/INT6 QAT;对于50M以上的INT4,使用wd33;对于50M以下的INT4,调度选择在噪声中。

英文摘要

We test whether the optimal learning-rate schedule depends on bit-width during from-initialisation quantisation-aware training (QAT) for sub-100M decoder language models. A 720-run factorial grid (Phase 2) over bit-width x warmdown fraction x LR magnitude x model size x seed (FP16/INT8/INT6, 15M-100M, 5 seeds) finds the optimal warmdown is 33% at every (bit-width, size) cell. The primary hypothesis -- that INT6 QAT requires a different schedule than higher-precision training -- is falsified at FP16/INT8/INT6. A 625-run follow-up (Phase 5) probes the null along five axes: optimiser (AdamW), schedule shape (cosine), training length (up to 9x more iterations), an extended size sweep (5M-350M), and an INT4 sweep from 3M to 100M. The null is robust under all three setup changes. The INT6 penalty follows a log-linear scaling law whose fit on Phase 2 predicts the five held-out Phase 5 sizes (5M, 8M, 175M, 250M, 350M) within their 95% prediction intervals (5/5). For INT4 the picture is sharper than the higher precisions: at 50M and 100M, wd33 is decisively optimal (paired z ~ 12-15, 10/10 seeds); below 50M, across the six tested sizes from 3M to 30M, no individual size shows a statistically significant schedule preference and the per-size mean penalty oscillates within seed-level noise. The boundary is therefore a transition between a noise-dominated regime below 50M and a decisive wd33 regime at and above 50M, not a clean wd10 region. A weight-to-grid-distance probe falsifies the simplest mechanism for the FP16/INT8/INT6 null result (rapid grid-snapping): pre-warmdown, INT6-QAT weights sit at essentially the same distance from the INT6 grid as FP16 weights (ratio ~ 1.04). Practical recommendation: at sub-100M scale, tune the LR schedule once at FP16 and apply unchanged to INT8/INT6 QAT; for INT4 at 50M+ use wd33; for INT4 below 50M the schedule choice is in the noise.

2605.25958 2026-05-26 cs.CL cs.CE 版本更新

PolyGnosis 2.0: Enhancing LLM Reasoning via Agentic Harness Engineering for Polymarket and OSINT Insight Extraction

PolyGnosis 2.0: 通过智能体工程增强LLM推理,用于Polymarket和OSINT洞察提取

Daren Wang, Hong Xu, Jiawen Xian

发表机构 * The Chinese University of Hong Kong(香港中文大学) Evolution AI Lab(进化人工智能实验室)

AI总结 本文提出PolyGnosis 2.0多智能体架构,通过融合Polymarket异常信号与全球开源情报(GDELT)来提取预测情报,并量化“视角错配”作为高阿尔法交易信号,同时评估“工具工程”技术(如反射循环、工具调用、分治和思维链)在高噪声金融领域的有效性。

详情
AI中文摘要

本文介绍了PolyGnosis 2.0,这是一种开创性的多智能体架构,旨在通过综合Polymarket异常信号与全球开源情报(OSINT)流(特别是全球事件、语言和语调数据库(GDELT))来提取预测情报。我们定义并瞄准“视角错配”,即Polymarket情绪与全球媒体流之间的叙事分歧,作为高阿尔法交易信号。超越通用的智能体优越性,我们严格量化了“工具工程”技术(包括反射循环、工具调用、分治分区(D&C)和思维链(CoT))在高噪声金融领域的有效性。我们针对人类专家基准的实证评估表明,虽然结构分区对于多维对齐是必需的,但无约束的终端反射会主动导致逻辑漂移。此外,我们在所有智能体配置的叙事推理过程中发现了一种普遍的“共识偏差”,需要确定性验证。最终,我们分离出一个帕累托最优配置,该配置在最小化延迟和令牌开销的同时实现了专业级的分析精度,为预测市场中的自主智能提供了稳健的蓝图。

英文摘要

This paper introduces PolyGnosis 2.0, a pioneering multi-agent architecture designed to extract predictive intelligence by synthesizing Polymarket anomaly signals with global Open Source Intelligence (OSINT) streams, specifically Global Database of Events, Language, and Tone (GDELT). We define and target "Perspective Mismatches", the narrative divergence between Polymarket sentiment and global media flows, as high-alpha trading signals. Moving beyond generic agentic superiority, we rigorously quantify the efficacy of "Harness Engineering" techniques, including reflection loops, tool-calling, divide-and-conquer partitioning (D&C), and chain-of-thought (CoT), within high-noise financial domains. Our empirical evaluation against human-expert benchmarks reveals that while structural partitioning is mandatory for multi-dimensional alignment, unconstrained terminal reflection actively induces logical drift. Furthermore, we identify a pervasive "consensus bias" across all agent configurations during narrative reasoning, necessitating deterministic validation. Ultimately, we isolate a Pareto-optimal configuration that achieves professional-grade analytical precision while minimizing latency and token overhead, providing a robust blueprint for autonomous intelligence in prediction markets.

2605.25955 2026-05-26 cs.CL cs.AI cs.LG 版本更新

QUIET: A Multi-Blank Cascaded Story Cloze Benchmark for LLM Creative Generation Capability

QUIET: 面向LLM创意生成能力的多空白级联故事完形填空基准

Bo Zou, Chao Xu

AI总结 提出QUIET基准,通过多空白级联故事完形填空和基于信息论的自动评分协议,客观评估大语言模型的创意生成能力。

详情
AI中文摘要

大语言模型(LLM)在创意能力评估中面临双重挑战:现有基准(如Story Cloze Test、HellaSwag)通过多项选择识别范式衡量模型对叙事延续的判别能力,而非直接衡量创意生成能力;基于量规的评分和LLM-as-Judge方法依赖主观维度评估或自然语言模型输出,无法提供客观、自动化的评分机制。本文提出QUIET(Quality Understanding via Interlocked Evaluation Testing),一种基于多空白级联故事完形填空的LLM创意能力诊断基准。QUIET在结构完整的故事中设置N个空白(10-20个),每个空白附带显式内容约束,且空白之间存在级联依赖关系——较早空白填充的内容约束较晚空白的可行解空间。被评估模型(或人类参与者)以开放生成模式填充所有空白;结果由基于信息论的自动化评分协议评分,无需人工评分。该评分协议直接操作化“校准惊喜”理论框架(Zou & Xu, 2026a)。对于每个空白k,计算复合分数:score = satisfy * (1 + lambda * surprise),其中lambda = 1.0。这里,“satisfy”衡量空白填充满足内容约束的程度(客观逻辑推理判断,非主观审美评分),“surprise”衡量在满足约束条件下的惊喜程度。不满足约束的创意答案得零分;满足约束但平庸的答案得分低;满足约束且令人惊喜的答案得分高。

英文摘要

Large language models (LLMs) face a dual challenge in creative capability evaluation: existing benchmarks (e.g., Story Cloze Test, HellaSwag) measure models' discriminative ability over narrative continuation using multiple-choice recognition paradigms, rather than directly measuring creative generation capability; rubric-based scoring and LLM-as-Judge methods rely on subjective dimension assessment or natural language model outputs, and cannot provide objective, automated scoring mechanisms. This paper proposes QUIET (Quality Understanding via Interlocked Evaluation Testing), a diagnostic benchmark for LLM creative capability based on multi-blank cascaded story cloze. QUIET sets N blanks (10-20) in a story with complete structure, with each blank accompanied by an explicit content constraint, and cascade dependency relationships between blanks -- the content filled into earlier blanks constrains the feasible solution space for later blanks. The evaluated model (or human participants) fills all blanks in open-ended generation mode; the results are scored by an information-theoretic automated scoring protocol without human grading. The scoring protocol directly operationalizes the "calibrated surprise" theoretical framework (Zou & Xu, 2026a). For each blank k, a composite score is computed: score = satisfy * (1 + lambda * surprise), where lambda = 1.0. Here, "satisfy" measures how well the blank filling satisfies the content constraint (objective logical reasoning judgment, not subjective aesthetic scoring), and "surprise" measures the degree of surprise given that the constraint is satisfied. Creative answers that do not satisfy the constraint score zero; answers that satisfy the constraint but are mediocre score low; answers that satisfy the constraint and are surprising score high.

2605.25928 2026-05-26 cs.CL cs.SD eess.AS 版本更新

Thaka at KSAA-2026 Task 2: Regularized Fine-Tuning for Arabic Speech Diacritization

Thaka at KSAA-2026 Task 2: 用于阿拉伯语音节符号化的正则化微调

Meshal Alamr, Hassan Alqaeri, Abdullah Aldahlawi

发表机构 * Thaka

AI总结 针对低资源阿拉伯语音节符号化任务,通过正则化微调CATT-Whisper多模态模型,结合R-Drop一致性正则化、Optuna优化超参数和Focal Loss,在KSAA-2026共享任务中取得第一名。

Comments 4 pages, 1 figure. Published in Proceedings of OSACT7 (LREC 2026). Winning system for KSAA-2026 Task 2 on Arabic Speech Diacritization

详情
AI中文摘要

我们描述了KSAA-2026阿拉伯语音听写自动音节符号化共享任务Task 2的获胜系统。该任务要求从语音音频和无音节符号的转录文本中生成完全带音节符号的阿拉伯语文本,仅提供2,327个训练样本且不允许使用外部数据。我们的系统微调了CATT-Whisper,这是一个字符级多模态模型,结合了预训练的CATT文本编码器和冻结的Whisper语音编码器。我们方法的关键是训练正则化:R-Drop一致性正则化、使用高权重衰减的Optuna优化超参数以及Focal Loss。在推理时,我们在四个模型检查点上使用蒙特卡洛Dropout在softmax概率级别平均200次随机前向传播。该系统在主要排行榜指标(包括词尾变化,含无音节符号位置)上实现了23.26%的词错误率,在所有参与者中排名第一。

英文摘要

We describe the winning system for Task 2 of the KSAA-2026 Shared Task on Arabic Speech Dictation with Automatic Diacritization. The task requires producing fully diacritized Arabic text from speech audio and undiacritized transcripts, with only 2,327 training samples available and no external data permitted. Our system fine-tunes CATT-Whisper, a character-level multimodal model combining a pretrained CATT text encoder with a frozen Whisper speech encoder. The key to our approach is training regularization: R-Drop consistency regularization, Optuna-optimized hyperparameters with high weight decay, and Focal Loss. At inference, we average 200 stochastic forward passes across four model checkpoints using Monte Carlo Dropout at the softmax probability level. The system achieves 23.26% WER on the primary leaderboard metric (with case endings, including no-diacritic positions), placing 1st among all participants.

2605.25924 2026-05-26 cs.CL cs.LG 版本更新

Does Continued Pretraining on a Learner Corpus Improve Automated Essay Scoring on English Proficiency Tests? Evidence from EFCAMDAT

在学习者语料库上继续预训练是否能提高英语水平测试的自动作文评分?来自EFCAMDAT的证据

Duy Anh Nguyen

发表机构 * University of Greenwich(格林威治大学)

AI总结 研究通过在EFCAMDAT学习者语料库上进行领域自适应继续预训练(DAPT),探究其对基于Transformer的自动作文评分(AES)在英语水平测试中的影响,发现全语料库DAPT效果不一,而基于CEFR分级的针对性DAPT能更可靠地提升领域内评分性能。

Comments 16 pages, 3 figures, 10 tables, including references and appendices

详情
AI中文摘要

最近的自动作文评分(AES)研究越来越多地使用预训练的Transformer模型,但这些模型通常是在通用领域英语上预训练的,可能无法充分代表第二语言学习者的写作。本研究调查了在EFCAMDAT学习者语料库上进行领域自适应继续预训练(DAPT)是否能提高基于Transformer的AES在英语水平测试中的表现。我们对三个Transformer编码器应用DAPT,并在FCE和IELTS上评估了领域内评分和少样本跨数据集迁移。全语料库DAPT在模型、数据集和指标上产生了混合结果。进一步分析表明,这些混合效应部分由EFCAMDAT与下游数据集在熟练度、体裁和交际目的上的不匹配解释。基于熟练度的消融实验显示,使用CEFR对齐子集进行针对性DAPT比全语料库DAPT更可靠地提高了下游评分,尤其是对于使用B1-B2数据的FCE。然而,这些增益并未一致地改善跨数据集迁移。总体而言,研究结果表明,当预训练数据与下游评估设置充分对齐时,在学习者写作语料库上继续预训练可以有益于英语评估的领域内AES,但它不会自动提高跨不同英语水平测试数据集的迁移性。

英文摘要

Recent automated essay scoring (AES) studies increasingly use pretrained transformer models, but these models are usually pretrained on general-domain English and may under-represent second-language learner writing. This study investigates whether domain-adaptive continued pretraining (DAPT) on the EFCAMDAT learner corpus improves transformer-based AES for English proficiency tests. We apply DAPT to three transformer encoders and evaluate them on FCE and IELTS in both in-domain scoring and few-shot cross-dataset transfer. Full-corpus DAPT produces mixed results across models, datasets, and metrics. Further analyses suggest that these mixed effects are partly explained by mismatches in proficiency, genre, and communicative purpose between EFCAMDAT and the downstream datasets. A proficiency-based ablation shows that targeted DAPT using CEFR-aligned subsets improves downstream scoring more reliably than full-corpus DAPT, especially for FCE with B1--B2 data. However, these gains do not consistently improve cross-dataset transfer. Overall, the findings suggest that continued pretraining on a learner-writing corpus can benefit in-domain AES for English assessment when the pretraining data is sufficiently aligned with the downstream assessment settings. However, it does not automatically improve transferability across different English proficiency test datasets.

2605.25920 2026-05-26 cs.CL cs.AI 版本更新

Can LLMs Time Travel? Enhancing Temporal Consistency in Legal Agentic Search through Reinforcement Learning

LLM 能时间旅行吗?通过强化学习增强法律智能搜索中的时间一致性

Wei Fan, Yining Zhou, Mufan Zhang, Yanbing Weng, Yiran HU, Tianshi Zheng, Baixuan Xu, Chunyang Li, Jianhui Yang, Haoran Li, Yangqiu Song

发表机构 * Department of Computer Science and Engineering, HKUST, Hong Kong SAR, China(香港科技大学计算机科学与工程系) School of Law, Tsinghua University, Beijing, China(清华大学法学院) Cheriton School of Computer Science, University of Waterloo, Waterloo, Canada(滑铁卢大学丘成桐计算机科学系)

AI总结 提出 LegalSearch-R1 框架,结合本地 statute RAG 和在线搜索,通过强化学习在跨修订期数据上训练,以解决法律 LLM 的时间偏差和搜索代理缺乏时间约束的问题,在13项法律任务上超越现有方法。

Comments Under Review

详情
AI中文摘要

虽然增强智能搜索能力的大型语言模型在法律推理方面显示出前景,但它们忽略了一个基本约束:适用法律必须与每个案件的时间背景相匹配,因为法条的事后追溯适用违反了核心法律原则并导致错误结论。我们的观察表明,当前的法律 LLM 存在锚定于其训练截止日期的时间偏差,而搜索代理很少将时间约束纳入查询,并且仅靠网络搜索无法提供法律推理所需的精确法条和先例引用。为应对这些挑战,我们提出 LegalSearch-R1,一个端到端的强化学习框架,它将本地 statute RAG 用于精确条文匹配,与在线网络搜索用于更广泛的法律知识相结合,并在涵盖多个修订期的按时间索引的数据上训练以强制执行时间一致性。在我们涵盖13项法律任务的基准上的大量实验表明,我们的7B参数代理在时间一致性上以12.9%至29.8%的优势超越最先进的深度研究框架和专门的法律 LLM,以57.7%至80.3%的优势超越基线,并展现出强大的域外泛化能力。代码和数据可在 https://github.com/AlexFanw/LegalSearch-R1 获取。

英文摘要

While large language models (LLMs) augmented with agentic search capabilities show promise for legal reasoning, they overlook a fundamental constraint that applicable law must match the temporal context of each case, as retroactive application of statutes violates core legal principles and leads to erroneous conclusions. Our observations reveal that current legal LLMs suffer from temporal bias anchored to their training cutoff, while search agents rarely incorporate temporal constraints into queries, and that web search alone cannot provide the precise statute and precedent citations that legal reasoning demands. To address these challenges, we propose LegalSearch-R1, an end-to-end reinforcement learning framework that pairs local statute RAG for precise article matching with online web search for broader legal knowledge, trained on temporally-indexed data spanning multiple amendment periods to enforce temporal consistency. Extensive experiments on our benchmark covering 13 legal tasks demonstrate that our 7B-parameter agent outperforms state-of-the-art deep research frameworks and specialized legal LLMs by 12.9% to 29.8%, surpasses baselines by 57.7% to 80.3% on temporal consistency, and exhibits robust out-of-domain generalization. The code and data are available at https://github.com/AlexFanw/LegalSearch-R1.

2605.25903 2026-05-26 cs.CL cs.LG 版本更新

Universal Activation Verbalizer: A Unified Framework for Cross-Model Activation Explanation

通用激活词化器:跨模型激活解释的统一框架

Haiyan Zhao, Zirui He, Guanchu Wang, Ali Payani, Yingcong Li, Mengnan Du

发表机构 * New Jersey Institute of Technology(新泽西理工学院) University of North Carolina at Charlotte(北卡罗来纳大学夏洛特分校) Cisco Research(思科研究) The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳))

AI总结 提出通用激活词化器(UAV)框架,通过共享解码器和轻量适配器将异构模型的隐藏表示转化为自然语言解释,支持跨模型家族和规模的激活词化,在分类、事实检索和要点总结任务中与强基线竞争。

Comments 23 pages, 11 figures, 11 tables

详情
AI中文摘要

激活词化以自然语言解释隐藏表示,但现有方法大多局限于自解释,即每个模型仅解释自身的激活。我们引入通用激活词化器(UAV),一个使用共享解码器解释来自异构捐赠模型激活的框架。UAV学习一个轻量适配器,将捐赠激活转化为解码器嵌入空间中的软标记,并通过重用冻结的解码器侧LoRA同时为另一个捐赠者训练新适配器,进一步支持仅适配器迁移。在分类、事实检索和要点总结任务中,UAV在实现跨模型家族和规模的跨模型词化时,与强自解释基线保持竞争力。消融实验表明,解码器侧调优主要改善任务行为,而适配器提供激活基于的事实和语义信息,用于忠实解释。

英文摘要

Activation verbalization explains hidden representations in natural language, but existing methods are mostly limited to self-explanation, where each model explains only its own activations. We introduce Universal Activation Verbalizer (UAV), a framework that uses a shared decoder to explain activations from heterogeneous donor models. UAV learns a lightweight adapter that converts donor activations into soft tokens in decoder's embedding space, and further supports adapter-only transfer by reusing a frozen decoder-side LoRA while training only a new adapter for another donor. Across classification, fact retrieval, and gist summarization, UAV remains competitive with strong self-explanation baselines while enabling cross-model verbalization across model families and scales. Ablations show that decoder-side tuning mainly improves task behavior, whereas the adapter provides the activation-grounded factual and semantic information needed for faithful explanations.

2605.25891 2026-05-26 cs.CL cs.AI 版本更新

Causal Tongue-Tie: LLMs Can Encode Causal Direction, But Their Yes/No Outputs Fail to Express

因果舌结:LLMs 能编码因果方向,但其是/否输出无法表达

Ziyi Ding, Xiao-Ping Zhang

发表机构 * Tsinghua Shenzhen International Graduate School(清华大学深圳国际研究生院)

AI总结 研究发现大语言模型在因果问题上存在内部编码与输出不匹配的现象,通过线性探针可从隐藏状态恢复证据支持的答案(准确率约0.97),但口头是/否回答却退化为常识答案(准确率约0.5),揭示了约+0.5的差距,称为“因果舌结”。

详情
AI中文摘要

我们发现大语言模型关于因果问题所编码的内容与其回答之间存在不匹配。在反常识的 CLadder 项目上,固定的线性探针从模型隐藏状态中恢复出证据支持的答案(准确率约0.97),而口头的是/否回答则退化为常识答案(准确率约0.5)。我们将这约+0.5的差距称为“因果舌结”:错误的“是/否”回答可分解为两种可分离的失败模式——没有内部信号,或者口头接口无法表达的信号。这一发现对仅基于输出的因果基准测试具有双向影响:基准测试“正确”不一定意味着模型理解了,基准测试“错误”也不一定意味着模型不能理解。基于单一准确率数字得出的关于 LLMs 是否能够进行因果推理的笼统论断,值得重新审视。

英文摘要

We find a mismatch between what large language models encode about a causal question and what they answer. On anti-commonsense CLadder items, a fixed linear probe recovers the evidence-supported answer from the model's hidden state (accuracy approximately 0.97), while the spoken Yes/No reverts to the commonsense one (accuracy approximately 0.5). We call this approximately +0.5 gap Causal Tongue-Tie: a wrong Yes/No decomposes into two separable failure modes: no internal signal versus a signal the verbal interface cannot say. The implication cuts both ways for output-only causal benchmarks: a benchmark "correct" need not mean the model has understood, and a benchmark "wrong" need not mean it cannot. Sweeping claims about whether LLMs can do causal reasoning, drawn from a single accuracy number, deserve a second look.

2605.25869 2026-05-26 cs.CL 版本更新

Mitigating Provenance-Role Collapse in Long-Term Agents via Typed Memory Representation

通过类型化记忆表示缓解长期智能体中的来源-角色崩溃

Zhengda Jin, Bingbing Wang, Jing Li, Ruifeng Xu, Min Zhang

发表机构 * Harbin Institute of Technology(哈尔滨工业大学) The Hong Kong Polytechnic University(香港理工大学) Shenzhen Loop Area Institute(深圳龙华研究院)

AI总结 提出MemIR类型化记忆中间表示,通过结构约束实现来源监控,解决长期智能体中因无结构存储导致的来源-角色崩溃问题,在LoCoMo和BEAM-100K上优于现有基线。

详情
AI中文摘要

长期记忆对于持久化LLM智能体至关重要,但现有架构将历史交互存储为非结构化的平面文本。这种无约束存储会导致来源-角色崩溃,即智能体出现来源监控错误的关键失效模式。为了在架构层面解决这一认知脆弱性,我们提出MemIR,一种类型化记忆中间表示,将来源监控作为结构约束来操作化。MemIR将长期记忆写入基础原子,这些原子分离原始证据、检索线索和承载真相的声明,事实授权仅限于受支持的声明原子。然后,它应用多路径原子投影和来源范围利用,将异构检索结果转化为以声明为中心的候选包,以及用于答案生成的归一化事实接口。在LoCoMo和BEAM-100K上的实验表明,MemIR持续优于现有记忆基线,特别是在需要来源追踪、时间锚定和碎片证据聚合的任务上。

英文摘要

Long-term memory is essential for persistent LLM agents, yet prevailing architectures store historical interactions as unstructured, flat text. This unconstrained storage induces provenance-role collapse, a critical failure mode where agents suffer from source-monitoring errors. To resolve this cognitive vulnerability at the architectural level, we propose MemIR, a typed Memory Intermediate Representation that operationalizes source monitoring as a structural constraint. MemIR writes long-term memory into grounded atoms that separate raw evidence, retrieval cues, and truth-bearing claims, with factual authorization restricted to supported claim atoms. It then applies multi-route atomic projection and provenance-scoped utilization to transform heterogeneous retrieval hits into claim-centered candidate bundles and a normalized fact interface for answer generation. Experiments on LoCoMo and BEAM-100K demonstrate that MemIR consistently outperforms existing memory baselines, especially on tasks requiring source tracking, temporal grounding, and aggregation of fragmented evidence.

2605.25864 2026-05-26 cs.LG cs.CL 版本更新

When Self-Belief Misleads: Active Label Acquisition for Reinforcement Learning with Verifiable Rewards

当自我信念误导:面向可验证奖励的强化学习的主动标签获取

Li Wang, Xiaodong Lu, Xiaohan Wang, Yikun Ban, Jiajun Chai, Wei Lin, Tianhao Peng, Guojun Yin

发表机构 * Meituan(美团) Beihang University(北京航空航天大学) Nanyang Technological University(新加坡国立大学)

AI总结 提出RLAVR框架,通过主动获取少量真实标签并与伪标签结合,利用CAG指标和CARE策略稳定训练并提升有限标注预算下的性能。

详情
AI中文摘要

大型语言模型(LLM)通过可验证奖励的强化学习(RLVR)在推理能力上取得了显著进展。然而,RLVR本质上依赖于真实标签进行奖励计算,而在实际场景中获取这些标签通常成本高昂。虽然无监督的RLVR范式试图通过训练伪标签来规避这一问题,但它们极易发生训练崩溃。此外,不同样本往往具有不同的标注价值。在本文中,我们提出了主动可验证奖励的强化学习(RLAVR),它主动获取少量选定样本的真实标签,并将其与伪标签相结合,从而稳定训练动态并在有限标注预算下提高性能。为了识别有价值的样本,我们提出了纠正优势差距(CAG)指标,并分析了样本级别的监督价值。在此基础上,我们引入了用于RLAVR的纠正感知可靠性估计(CARE),它将理想的CAG准则转化为实用的预查询获取策略,以显著提高训练稳定性。跨不同领域、模型家族和模型规模的大量实验证明了我们方法的有效性和通用性。我们的代码可在https://github.com/Lumina04/CARE获取。

英文摘要

Large Language Models (LLMs) have achieved remarkable advancements in reasoning capabilities empowered by Reinforcement Learning with Verifiable Rewards (RLVR). Nonetheless, RLVR intrinsically relies on ground-truth labels for reward computation, the acquisition of which is often prohibitively expensive in real-world scenarios. While unsupervised RLVR paradigms attempt to circumvent this by training on pseudo-labels, they are notoriously susceptible to training collapse. Moreover, different samples often exhibit varying annotation values. In this paper, we propose Reinforcement Learning with Active Verifiable Rewards (RLAVR), which actively acquires ground-truth labels for a small set of selected samples and integrates them with pseudo-labels, thereby stabilizing training dynamics and improving performance under limited annotation budgets. To identify valuable samples, we propose the Corrective Advantage Gap (CAG) metric and analyze the sample-level supervision value. Building on this, we introduce Correction-Aware Reliability Estimation for RLAVR (CARE), which translates the oracle CAG criterion into a practical pre-query acquisition policy to substantially improve training stability. Extensive experiments across diverse domains, model families, and model scales demonstrate the effectiveness and generality of our approach. Our code is available at https://github.com/Lumina04/CARE.

2605.25850 2026-05-26 cs.CL cs.AI cs.LG 版本更新

TIAR: Trajectory-Informed Advantage Reweighting for LLM Abstention Learning

TIAR:基于轨迹信息的优势重加权用于大语言模型弃权学习

Muyu Pan, Shu Zhao, Nan Zhang, Philip Shin, Varun Parekh, Vijaykrishnan Narayanan, Rui Zhang

发表机构 * Department of Computer Science, The Pennsylvania State University(宾夕法尼亚州立大学计算机科学系)

AI总结 本文提出TIAR方法,利用GRPO中的多条轨迹作为自然弃权信号,动态重加权弃权奖励,在六个评估类别中的五个上取得最优弃权F1分数,同时保持基线准确率。

Comments 10 pages, 1 figure, 4 tables

详情
AI中文摘要

本文研究大语言模型(LLM)的弃权学习,特别是使用三元奖励来激励大语言模型中的真实性。本文将该思想从三元奖励扩展到基于轨迹信息的优势重加权(Trajectory-Informed Advantage Reweighting),在组相对策略优化(GRPO)训练期间动态重加权弃权奖励。本工作的目标聚焦于弃权学习而非提升真实性,作为减少幻觉的探索。本文的新颖之处在于方法论创新、优势重加权和基准选择。利用GRPO的多条轨迹作为自然弃权信号,该方法使用奖励信号探索知识边界并鼓励一致性。通过证明轨迹可以作为策略相对于查询的置信度指标,进而用于动态计算弃权优势。使用AbstentionBench作为评估基准,因为本工作旨在为弃权学习领域做出贡献。对该基准上的所有数据集,均使用本方法和各种基线进行了测试。实证结果表明,TIAR在六个评估类别中的五个上取得了最优弃权F1分数,在31个基准数据集中的17个上优于静态三元基线,同时完全保持基线准确率。

英文摘要

This paper investigates large language model (LLM) abstention learning, specifically using ternary reward, which incentivize truthfulness in large language models. This paper extends that idea by moving from a ternary reward to a Trajectory-Informed advantage reweighting, dynamically re-weights the abstention reward during Group Relative Policy Optimization (GRPO) training. The objective of this work focuses on abstention learning instead of improving truthfulness, serving as an exploration into hallucination reduction. The novelty of this paper lies in methodological innovation, advantage re-weighting, and benchmark selection. Leveraging GRPO's multiple trajectories as a natural abstention signal, this method uses a reward signal to explore knowledge boundaries and encourage consistency. By demonstrating that trajectories can be used as a confidence indicator of the policy relative to the query, they are then used to dynamically calculate the abstention advantage. AbstentionBench is used as the evaluation benchmark, as this work aims to contribute to the field of abstention learning. All datasets on the benchmark were tested against this method and various baselines. Empirical results demonstrate that TIAR achieves state-of-the-art abstention F1 scores across five of six evaluation categories, outperforming the static ternary baseline on 17 of 31 benchmark datasets while fully preserving baseline accuracy.

2605.25846 2026-05-26 cs.CL 版本更新

On the Limits of Model Merging for Multilinguality in Pre-Training

论预训练中模型合并对多语言能力的限制

Seth Aycock, Fedor Vitiugin, Aleksandr Umnov, Christof Monz, Khalil Sima'an

发表机构 * University of Amsterdam(阿姆斯特丹大学) University of Turku(图尔库大学)

AI总结 通过控制实验比较混合预训练、模型合并和单语预训练,发现合并单语模型会导致性能崩溃,表明表示相似性是模型合并的前提。

Comments MeLLM Workshop 2026

详情
AI中文摘要

通过混合预训练数据或训练后的方法(如特定语言模型合并)可以实现模型一致的多语言性能。在这项工作中,我们测试了合并是否可应用于单语预训练模型。我们对混合、合并和单语预训练设置的有效性进行了控制研究。我们发现,虽然单语预训练能带来强大的语言内性能,但由于干扰,合并单语模型的任何组合都会导致性能崩溃。我们的分析表明,表示相似性是模型合并的先决条件。因此,我们得出结论,微调中合并的灵活性并不能简单地扩展到特定语言的预训练。

英文摘要

Endowing models with consistent multilingual performance can be achieved by mixing pre-training data, or post-training approaches such as language-specific model merging. In this work, we test whether merging can be applied to monolingually pre-trained models. We conduct a controlled study on the efficacy of mixed, merged, and monolingual pre-training setups. We find that while monolingual pre-training results in strong in-language performance, merging any combination of monolingual models leads to performance collapse due to interference. Our analysis suggests representational similarity is a prerequisite for model merging. We therefore conclude that the flexibility of merging in fine-tuning does not extend trivially to language-specific pre-training.

2605.25836 2026-05-26 cs.CR cs.AI cs.CL 版本更新

TTPrint: Evidence-Grounded TTP Extraction via Diverge-then-Converge Verification

TTPrint:通过发散-收敛验证实现基于证据的TTP提取

Yutong Cheng, Changze Li, Raihan Sultan Pasha Basuki, Qian Cui, Wei Ding, Peng Gao

发表机构 * Virginia Tech(弗吉尼亚理工大学) Universitas Ary Ginanjar(阿里甘jar大学) Amazon(亚马逊)

AI总结 提出TTPrint方法,采用先广泛提取后严格验证的发散-收敛设计,结合确定性证据定位与权威定义验证,在文档级TTP提取任务上显著提升宏F1分数。

Comments Preprint

详情
AI中文摘要

从网络威胁情报(CTI)报告中提取MITRE ATT&CK技术是一个开放集、多标签问题,需要高召回率(不遗漏技术)和高精确率(不虚构未支持的技术)。现有方法——基于规则、监督学习和基于LLM的方法——难以同时实现两者:基于规则和监督方法缺乏跨多种攻击描述的泛化能力,而基于LLM的方法将候选生成和验证耦合在单一推理步骤中,导致召回率和精确率同时受限。我们提出TTPrint,通过受人类分析师工作方式启发的发散-收敛设计来解决这一挑战:首先广泛提取,然后严格验证。在发散阶段,报告被分解为原子行为,并广泛提出候选技术。然后,确定性跨度定位阶段将每个候选锚定到源文本中的特定证据窗口。收敛验证阶段仅保留由定位证据和权威MITRE定义支持的候选。我们贡献了两个评估资源——清理后的TRAM基准(TRAM-Clean)和一个新的注释数据集(TTPrint-Bench)——以解决现有基准中的已知注释噪声,并将任务提升到文档级TTP提取。在TRAM-Clean和TTPrint-Bench上,TTPrint分别达到76.48%和87.39%的宏F1,比领先基线高出63.5%和29.4%。跨六个LLM的多骨干分析和阈值敏感性研究进一步证明了跨模型选择的泛化能力,并为参数选择提供了实用指导。

英文摘要

Extracting MITRE ATT&CK techniques from cyber threat intelligence (CTI) reports is an open-set, multi-label problem requiring both high recall (not missing techniques) and high precision (not hallucinating unsupported ones). Existing methods--rule-based, supervised, and LLM-based--struggle to achieve both: rule-based and supervised approaches lack generalizability across diverse attack descriptions, while LLM-based approaches that couple candidate generation and validation within a single inference step suffer from limited recall and precision simultaneously. We propose TTPrint, which addresses this challenge through a diverge-then-converge design inspired by how human analysts work: first extracting broadly, then verifying rigorously. In the divergent phase, reports are decomposed into atomic behaviors and candidate techniques are proposed broadly. A deterministic span localization stage then anchors each candidate to a specific evidence window in the source text. A convergent verification stage retains only candidates supported by both the localized evidence and the authoritative MITRE definition. We contribute two evaluation resources--a cleaned TRAM benchmark (TRAM-Clean) and a new annotated dataset (TTPrint-Bench)--to address known annotation noise in existing benchmarks and elevate the task to document-level TTP extraction. On TRAM-Clean and TTPrint-Bench, TTPrint achieves 76.48% and 87.39% macro-F1 respectively, outperforming the leading baseline by 63.5% and 29.4%. A multi-backbone analysis across six LLMs and a threshold sensitivity study further demonstrate generalizability across model choices and provide practical guidance for parameter selection.

2605.25832 2026-05-26 cs.RO cs.AI cs.CL cs.CV 版本更新

When Search Becomes Memory: Turning Robot Design Trials into Transferable Skills

当搜索成为记忆:将机器人设计试验转化为可迁移技能

Yunfei Wang, Xiaohao Xu, Yang Li, Xiaonan Huang

发表机构 * University of Michigan(密歇根大学)

AI总结 提出Auto-Robotist,一种自进化LLM代理,通过将形态搜索轨迹提炼为自然语言技能库,实现可迁移的机器人设计知识,在EvoGym任务中提升冷启动搜索并跨设计空间迁移技能。

Comments 20 pages, 8 figures

详情
AI中文摘要

大型语言模型(LLMs)越来越多地被用作进化机器人设计的提案生成器,但大多数循环仍然是无记忆的:模拟结果塑造下一代种群,但并未作为可复用的设计知识保留。我们提出Auto-Robotist,一种自进化的LLM代理,它将形态搜索轨迹提炼为显式的自然语言技能库。每个技能存储结构原型、基于证据的正负规则以及支持它们的评估设计,使设计记忆可检查而非隐含在种群中。在搜索过程中,代理检索技能以调节LLM对精英主体的编辑,同时保留遗传算法(GA)突变路径以进行探索;评估后,通过添加、诊断和合并更新库。在涵盖运动、穿越和物体交互的七个EvoGym任务中,Auto-Robotist改善了冷启动5x5搜索,并将学到的技能迁移到10x10设计空间,其中参考条件迁移在每个任务上都优于GA。这些结果表明,LLM代理可以将昂贵的物理评估转化为可复用、可审计的设计原则。我们的代码将在接收后发布。

英文摘要

Large language models (LLMs) are increasingly used as proposal generators for evolutionary robot design, yet most loops remain memoryless: simulator results shape the next population but are not preserved as reusable design knowledge. We present Auto-Robotist, a self-evolving LLM agent that distills morphology-search traces into an explicit natural-language skill library. Each skill stores a structural archetype, evidence-grounded positive and negative rules, and the evaluated designs that support them, making design memory inspectable rather than implicit in a population. During search, the agent retrieves skills to condition LLM edits of elite bodies while retaining a Genetic Algorithm (GA) mutation path for exploration; after evaluation, it updates the library through Add, Diagnose, and Merge. Across seven EvoGym tasks spanning locomotion, traversal, and object interaction, Auto-Robotist improves cold-start 5x5 search and transfers learned skills to 10x10 design spaces, where reference-conditioned transfer outperforms GA on every task. These results suggest that LLM agents can convert expensive physical evaluations into reusable, auditable design principles. Our code will be released upon acceptance.

2605.25831 2026-05-26 cs.CL cs.AI cs.LG 版本更新

Clarify, Abstain or Answer? Strategising in Conversation with Belief-Augmented Generation

澄清、弃权或回答?基于信念增强生成的对话策略

Joris Baan, Wilker Aziz, Barbara Plank, Raquel Fernández

发表机构 * University of Amsterdam(阿姆斯特丹大学) MCML Munich(慕尼黑MCML) LMU Munich(慕尼黑莱茵-魏尔堡大学)

AI总结 提出信念增强生成(BAG)方法,通过将大语言模型自身的信念状态注入提示,使其推理多个采样响应并决定对话策略(回答、澄清或弃权),从而提升多轮模糊问答的准确性和策略决策的忠实度。

详情
AI中文摘要

大语言模型(LLMs)定义了文本上的分布,这可以视为不确定性的概率表示:采样K个响应会产生一个信念状态——模型认为合理的响应。现有工作利用这种表示进行解码或选择性预测等狭窄任务,通常需要手动干预,无法直接控制生成。我们提出信念增强生成(BAG):通过提示将LLMs锚定在其自身的信念状态中,并让它们推理这K个样本以决定对话策略:回答、澄清或弃权。在多轮模糊问答设置中,我们发现LLMs默认很少澄清或弃权,忽略了关于输入或事实的不确定性。BAG在六个模型上提高了问答准确性,并产生了比仅提示基线更忠实于信念状态的策略决策。然而,区分何时澄清与何时弃权仍然具有挑战性。

英文摘要

Large language models (LLMs) define a distribution over text, which can be viewed as a probabilistic representation of uncertainty: sampling K responses yields a belief state - responses a model deems plausible. Existing work exploits this representation for narrow tasks like either decoding or selective prediction, and often requires manual interventions, not controlling generation directly. We propose Belief-Augmented Generation (BAG): grounding LLMs in their own belief state via the prompt and letting them reason over these K samples to decide on a conversational strategy: answer, clarify, or abstain. In a multi-turn ambiguous QA setting, we find that LLMs by default rarely clarify or abstain, ignoring uncertainty about the input or facts. BAG improves QA accuracy across six models and yields strategy decisions more faithful to the belief state than prompt-only baselines. Disentangling when to clarify from when to abstain, however, remains challenging.

2605.25816 2026-05-26 cs.CL cs.AI 版本更新

Fine-Tuning Over Architectural Complexity: Broad-Coverage PII Detection on PIIBench with DeBERTa

超越架构复杂性的微调:基于DeBERTa的PIIBench广泛覆盖PII检测

Pritesh Jha

AI总结 本研究通过微调DeBERTa模型,在涵盖82种实体类型的多源PIIBench数据集上实现广泛覆盖的PII检测,直接微调方法在F1分数上显著优于架构复杂的层次模型和课程扩展方法。

详情
AI中文摘要

个人身份信息(PII)检测系统通常在狭窄的源或领域边界内训练,当部署在异构文本上时覆盖范围有限。我们研究了在修正后的多源PIIBench准备数据上的模型微调,该数据跨越十个源数据集,涵盖82种保留实体类型。我们评估了三种基于DeBERTa的方法:直接令牌分类微调、源条件层次模型(SC+H)和三阶段课程扩展(SC+H+Curr)。在可重复的5,000条记录保留子集(test_5k)上,与八个已发表的比较系统相比,直接微调的DeBERTa达到F1 0.6476,而SC+H和课程变体分别达到0.5899和0.2772;最强的已发表比较系统仅达到0.1723。由于验证最初偏向SC+H,我们在完整的100,002条记录保留分割上进行了最终的流式评估。直接微调仍然优越,达到F1 0.6455,而SC+H为0.5894。实体级分析表明,直接微调在82个细粒度实体类型中的54个和所有十个粗粒度组中获胜(按支持加权实体F1),而SC+H在28个类型上保持局部优势。结果表明,多样化的任务特定训练数据和简单的加权交叉熵目标对广泛覆盖的PII检测的贡献大于所测试的架构和课程复杂性。

英文摘要

Personally identifiable information (PII) detection systems are frequently trained within narrow source or domain boundaries, limiting coverage when deployed on heterogeneous text. We study model fine-tuning on a corrected multi-source PIIBench preparation spanning 82 retained entity types across ten source datasets. We evaluate three DeBERTa-based approaches: direct token classification fine-tuning, a source-conditioned hierarchical model (SC+H), and a three-phase curriculum extension (SC+H+Curr). Against eight published comparator systems on a reproducible 5,000-record held-out subset (test_5k), direct fine-tuned DeBERTa achieves F1 0.6476, while SC+H and the curriculum variant achieve 0.5899 and 0.2772 respectively; the strongest published comparator reaches only 0.1723. Because validation initially favoured SC+H, we perform a final streamed evaluation on the complete 100,002-record held-out split. Direct fine-tuning remains superior, achieving F1 0.6455 versus 0.5894 for SC+H. Entity-level analysis shows that direct fine tuning wins 54 of 82 fine entity types and all ten coarse groups by support-weighted entity F1, while SC+H retains localised advantages on 28 types. The results indicate that diverse task-specific training data and a simple weighted cross-entropy objective contribute more to broad-coverage PII detection than the tested architectural and curriculum complexity.

2605.25814 2026-05-26 cs.CL cs.AI 版本更新

Adaptive Graph Refinement and Label Propagation with LLMs for Cost-Effective Entity Resolution

自适应图优化与基于大语言模型的标签传播用于经济高效实体解析

Hongtao Wang, Renchi Yang, Haoran Zheng, Xiangyu Ke

发表机构 * Hong Kong Baptist University(香港 Baptist 大学) Zhejiang University(浙江大学)

AI总结 提出Alper框架,通过迭代概率标签传播整合匹配与聚类,自适应融合图传播弱信号与LLM强查询,在预算约束下最大化边际增益,实现高效实体解析。

详情
AI中文摘要

脏实体解析(ER)从单个杂乱数据集中识别指向同一真实世界实体的记录,是数据管理和挖掘中的基本任务。然而,ER的主流阻塞-匹配-聚类范式存在严重缺陷。其级联、解耦的工作流本质上生成一个静态、稀疏的图,由于阻塞失败导致缺失边,由于匹配错误导致噪声链接,造成错误传播并产生次优聚类,特别是在聚类中施加严格传递性时。我们认为匹配和聚类本质上是协同的,两者都优化理想实体图的构建。基于这一见解,我们提出Alper,一个统一框架,将这些步骤整合为在全局、演化图上的迭代概率标签传播过程。与分离的阻塞不同,Alper通过自适应地整合来自图传播的“弱但廉价”信号与基于LLM的“强但昂贵”成对查询,动态优化图结构和标签。为了提高成本效益,我们将信号选择形式化为在查询预算下最大化累积边际增益的约束优化问题,通过我们的贪心算法求解,并具有可证明的理论保证。我们在八个基准数据集上的广泛实验表明,Alper始终优于最先进的级联流水线。

英文摘要

Dirty entity resolution (ER), which identifies records referring to the same real-world entity from a single, messy dataset, is a fundamental task in data management and mining. However, the dominant blocking-matching-clustering paradigm for ER suffers from critical flaws. Its cascaded, decoupled workflow essentially produces a static, sparse graph plagued by missing edges (due to blocking failures) and noisy links (due to matching errors), causing error propagation and yielding suboptimal clusters, particularly when rigid transitivity is imposed in the clustering. We contend that matching and clustering are fundamentally synergistic, both optimizing for the construction of an ideal entity graph. Building upon this insight, we propose Alper, a unified framework that integrates these steps into an iterative probabilistic label propagation process over a global, evolving graph. Unlike disjoint blocking, Alper refines the graph structure and labels dynamically by adaptively integrating "weak but cheap" signals from graph propagation with "strong but expensive" LLM-based pairwise queries. For higher cost-effectiveness, we formulate the signal selection as a constrained optimization problem maximizing cumulative marginal gain under a query budget, solved via our greedy algorithm with provable theoretical guarantees. Our extensive experiments over eight benchmark datasets demonstrate that Alper is consistently superior to state-of-the-art cascaded pipelines.

2605.25781 2026-05-26 cs.CL 版本更新

Double Triangle Annotation: A Scalable Human-in-the-Loop Framework for High-Precision Historical Document Annotation

双三角形标注:一种可扩展的人机协同高精度历史文档标注框架

Yi Ren

发表机构 * École Polytechnique Fédérale de Lausanne(洛桑联邦理工学院)

AI总结 提出双三角形标注框架,通过两层人机协同和跨模型共识自动完成大部分标注工作,实现高精度历史文档结构化信息提取。

Comments 12 pages, 4 figures. ACL ARR 2026 March submission

详情
AI中文摘要

大规模评估历史文档的结构化信息提取需要高精度的真实标注,但传统人工标注成本高昂,而基于大语言模型的完全自动化流水线容易产生幻觉。我们提出双三角形标注,一种双层人机协同框架,利用跨模型共识自动完成大部分标注工作,同时确保高精度输出。第一层中,两个架构独立的多模态大语言模型并行标注每个文档;当它们一致时,标签自动接受,不一致则提交给人工评审。第二层将两个这样的系统相互交叉检查,将剩余冲突升级给领域专家。该框架基于一个假设——模型之间的错误独立性——不需要分布先验或任务特定校准,并且随着模型能力的提升而变得更加自主。在Guides Rosenwald(一个涵盖1887-1906年的法国医疗目录语料库)上,该框架实现了0.003的最终词错误率。大规模应用时,模型共识自动接受了13,595个字段中的85%以上。我们发布了由此产生的基准——Rosenwald指南的第一个结构化提取真实标注——以支持未来历史文档处理工作。

英文摘要

Evaluating structured-information extraction from historical documents at scale requires high-precision ground-truth annotations, yet traditional manual labeling is expensive and fully automated pipelines built on large language models are prone to hallucination. We propose Double Triangle Annotation, a two-layer human-in-the-loop framework that leverages cross-model consensus to automate the majority of annotation work while ensuring high-precision outputs. In the first layer, two architecturally independent Multimodal Large Language Models annotate each document in parallel; when they agree, the label is auto-accepted, and disagreements are routed to a human jury. A second layer cross-checks two such systems against each other, escalating residual conflicts to a domain expert. The framework rests on a single assumption -- error independence between models -- requires no distributional priors or task-specific calibration, and becomes more autonomous as model capability improves. On the Guides Rosenwald, a corpus of French medical directories spanning 1887-1906, the framework achieves a final Word Error Rate of 0.003. Applied at scale, model consensus auto-accepts over 85% of 13,595 fields. We release the resulting benchmark -- the first structured-extraction ground truth for the Rosenwald Guides -- to support future work on historical document processing.

2605.25745 2026-05-26 cs.CL 版本更新

Selective Latent Thinking: Adaptive Compression of LLM Reasoning Chains

选择性潜在思考:LLM推理链的自适应压缩

Hui Xie, Jie Liu, Ziyue Qiao, Joaquin Vanschore

发表机构 * Eindhoven University of Technology(埃因霍温理工大学) School of Computing and Information Technology(计算与信息科技学院) Great Bay University(大湾大学)

AI总结 提出选择性潜在思考(SLT)框架,通过置信度门控将冗余推理步骤压缩为潜在表示,关键步骤保留显式思维链,在压缩比相当的情况下准确率比潜在推理基线高22.7%,推理链长度减少58.4%且准确率仅下降2.8%。

详情
AI中文摘要

显式思维链(CoT)推理显著提升了大语言模型(LLMs)的推理能力,但由于冗长的自回归痕迹导致高推理成本。现有的潜在推理方法提供了一种有前景的替代方案,但它们通常将推理视为均匀可压缩的,导致精度关键的中间步骤被过度压缩,从而降低推理准确性。在这项工作中,我们提出了选择性潜在思考(SLT),一个框架,它选择性地将冗余推理跨度压缩为潜在表示,同时在同一推理轨迹中将精度关键的跨度保留为显式CoT。具体来说,SLT首先使用轻量级解码器预测即将到来的短推理跨度,然后应用基于置信度的门控来确定可可靠压缩的最长跨度。被接受的跨度被编码为紧凑的潜在表示以提高推理效率,而不确定或精度关键的推理则保留为显式CoT形式以保持准确性。为了学习这种选择性压缩策略,SLT采用三阶段训练策略,结合跨度级潜在压缩、可靠性感知的未来推理预测和轨迹级强化学习,以优化答案正确性与推理成本之间的权衡。在四个数学推理基准上的大量实验表明,SLT在压缩比相当的情况下,准确率比潜在推理基线高22.7%,同时与显式CoT相比,推理链长度减少58.4%,准确率仅下降2.8%。我们的代码可在https://github.com/hunshi34/SLT找到。

英文摘要

Explicit chain-of-thought (CoT) reasoning substantially improves the reasoning ability of large language models (LLMs), but incurs high inference cost due to lengthy autoregressive traces. Existing latent reasoning methods offer a promising alternative, yet they often treat reasoning as uniformly compressible, causing precision-critical intermediate steps to be overly compressed and thereby degrading reasoning accuracy. In this work, we propose Selective Latent Thinking (SLT), a framework that selectively compresses redundant reasoning spans into latent representations while preserving precision-critical spans as explicit CoT within the same reasoning trajectory. Specifically, SLT first uses a lightweight decoder to anticipate a short upcoming reasoning span, and then applies confidence-based gating to determine the longest span that can be reliably compressed. The accepted span is encoded into a compact latent representation to improve reasoning efficiency, while uncertain or precision-critical reasoning remains in explicit CoT form to preserve accuracy. To learn this selective compression policy, SLT adopts a three-stage training strategy that combines span-level latent compression, reliability-aware future reasoning prediction, and trajectory-level reinforcement learning to optimize the trade-off between answer correctness and reasoning cost. Extensive experiments across four mathematical reasoning benchmarks demonstrate that SLT achieves 22.7\% higher accuracy than latent reasoning baselines at comparable compression ratios, while reducing reasoning chain length by 58.4\% with only 2.8\% accuracy degradation compared to explicit CoT,Our code can be found in https://github.com/hunshi34/SLT.

2605.25708 2026-05-26 cs.CV cs.CL cs.ET 版本更新

CMAP: Cross-Modal Adaptive Prompting for Multi-Domain Task-Incremental Learning

CMAP: 面向多域任务增量学习的跨模态自适应提示

Sriram Mandalika

发表机构 * Hasso Plattner Institute(霍普斯·普拉特纳研究所)

AI总结 针对多域任务增量学习,提出跨模态自适应提示方法,利用CLIP文本嵌入空间进行任务路由、置信度估计和编码器适应,在MTIL基准上超越现有技术。

详情
AI中文摘要

多域任务增量学习要求模型在视觉多样的域中顺序获取知识,同时不遗忘先前任务,且在推理时无法访问任务身份。基于冻结视觉-语言模型的参数高效方法已取得显著进展,但现有方法完全依赖视觉特征进行任务路由、置信度估计和编码器适应,未利用CLIP的跨模态文本嵌入空间。我们通过三个贡献填补这一空白。文本空间任务路由将视觉高斯匹配替换为与冻结CLIP文本原型的余弦相似度,实现与顺序无关的路由,在零参数成本下对数据稀缺具有鲁棒性。多原型视觉-文本置信度将单高斯类建模替换为K均值视觉原型和任务校准阈值下的跨模态对齐分数。对称跨模态门控将每层Gumbel门扩展到文本编码器,以批量图像特征为条件,在分布外输入上保持跨模态对齐。在涵盖11个数据集和1201个类的MTIL基准上,我们的方法在Order-I下达到74.2%的迁移率、80.5%的平均准确率和88.7%的最终准确率,仅用2.5M可训练参数且无外部数据,分别超越先前最优方法5.0、3.7和3.0个百分点。

英文摘要

Multi-domain task-incremental learning requires a model to sequentially acquire knowledge across visually diverse domains without forgetting prior tasks, and without access to task identity at inference. Parameter-efficient methods built on frozen vision-language models have made strong progress, yet all existing approaches rely exclusively on visual features for task routing, confidence estimation, and encoder adaptation, leaving CLIP's cross-modal text embedding space entirely unexploited. We address this gap through three contributions. Text-space task routing replaces visual Gaussian matching with cosine similarity to frozen CLIP text prototypes, giving order-independent routing robust to data scarcity at zero parameter cost. Multi-prototype visual-textual confidence replaces single-Gaussian class modeling with K-means visual prototypes and cross-modal alignment scores under task-calibrated thresholds. Symmetric cross-modal gating extends per-layer Gumbel gates to the text encoder conditioned on batch image features, preserving cross-modal alignment on out-of-distribution inputs. On the MTIL benchmark spanning 11 datasets and 1201 classes, our method achieves 74.2% Transfer, 80.5% Average, and 88.7% Last under Order-I, surpassing the prior state of the art by 5.0, 3.7, and 3.0 percentage points with only 2.5M trainable parameters and no external data.

2605.25704 2026-05-26 cs.CL cs.LG 版本更新

PowLU: An Activation Function for Stable Pre-Training of LLMs

PowLU: 一种用于LLM稳定预训练的激活函数

Peijie Jiang, Yuqi Feng, Cunyin Peng, Qian Zhao, Jia Liu, KunLong Chen, Zhiqiang Zhang, Jun Zhou

发表机构 * Ant Group(蚂蚁集团)

AI总结 提出PowLU激活函数,通过有理幂函数实现自适应非线性,解决SwiGLU在低精度LLM训练中的数值不稳定问题,在大规模训练中取得与SwiGLU和SwiGLU-Clip相当的性能并提升可扩展性。

Comments 17 pages, 7 figures, techreport

详情
AI中文摘要

在当代大型语言模型(LLM)中,swish门控线性单元(SwiGLU)激活函数被广泛采用以调节信息流并引入非线性。对于大的正输入,SwiGLU近似于二次函数$x^2$,提供强非线性和表达能力。然而,这一特性也导致随着输入或模型规模增大时的数值不稳定性,特别是在低精度LLM训练中。主要原因是其近似二次放大,扩大了输出范围并加剧了异常值。为了解决这个问题,我们提出了一种稳定的激活函数——幂线性单元(PowLU),用于大规模LLM预训练。具体来说,PowLU采用有理幂函数实现自适应非线性,从而改善表示能力并在尖峰区域实现稳定训练。此外,我们为PowLU的几个关键性质提供了理论证明。缩放定律实验确认了性能在不同模型规模下的一致性,进一步使用Ling架构(总参数7.9B和124B)的实验结果表明,PowLU在大规模LLM训练中取得了与SwiGLU和SwiGLU-Clip相当的结果。此外,实验结果还表明PowLU有效提升了LLM大规模训练的可扩展性。

英文摘要

In contemporary large language models (LLMs), the swish-gated linear unit (SwiGLU) activation function is widely adopted to regulate the information flow and introduce non-linearity. For large positive inputs, SwiGLU approximates the quadratic function $x^2$, providing strong nonlinearity and expressive capacity. However, this property also causes numerical instability as the input or model scale increases, particularly in low-precision LLM training. The main reason is its approximate quadratic amplification, which enlarges the output range and exacerbates outliers. To address this issue, we propose a stable activation function, Power Linear Unit (PowLU), for large-scale LLM pre-training. Specifically, PowLU employs a rational power function to achieve adaptive nonlinearity, thereby improving representation ability and enabling stable training in spike regions. Moreover, we provide theoretical justification for several key properties of PowLU. Scaling law experiments confirm that the performance is consistent across model sizes, and further experimental results with the Ling architecture (7.9B and 124B total parameters) demonstrate that PowLU achieves competitive results against SwiGLU and SwiGLU-Clip in large-scale training of LLMs. In addition, the experimental results also show that PowLU effectively improves the scalability of the large-scale training of LLMs.

2605.25701 2026-05-26 cs.DC cs.CL cs.IR cs.NI 版本更新

Neural Router: Semantic Content Matching for Agentic AI

神经路由器:面向智能体AI的语义内容匹配

Lauri Lovén, Abhishek Kumar, Alexander Engelhardt, Alaa Saleh, Roberto Morabito, Xiaoli Liu, Naser Hossein Motlagh, Sasu Tarkoma

发表机构 * Future Computing Group, University of Oulu(奥卢大学未来计算组) Department of Computer Science, University of Helsinki(赫尔辛基大学计算机科学系) Department of Communication Systems, EURECOM(EURECOM通信系统部门)

AI总结 本文提出将大语言模型作为内容发布/订阅代理的语义匹配引擎,通过分析上下文窗口交叉点和判别能力交叉点,实现成本-准确性权衡,并给出三个可组合算法和自主LLM层级选择框架。

Comments 35 pages, 12 figures. Combined main paper and electronic supplement, folded into one document for arXiv

详情
AI中文摘要

大语言模型(LLM)可以作为边缘-云计算连续体中基于内容的发布/订阅代理的语义匹配引擎,用于智能体AI,弥合关键字和嵌入过滤器无法克服的词汇和模态差距。作为跨社交媒体、法律和智能家居传感器领域三个公共数据集(六个LLM、七个基线)的离线多标签检索,我们的核心贡献是一个双交叉点成本-准确性特征描述:一个分析性上下文窗口交叉点,低于该点时,CoverAndMerge压缩流水线减少LLM调用;以及一个经验性判别能力交叉点,高于该点时,匹配准确性独立于上下文预算而崩溃,取决于参数数量和训练代次的模型相关因素。两个发现具有实际意义:在判别交叉点之上,压缩无法恢复准确性,只有前沿规模的模型才能清除大型订阅集;并且后端选择主导配置选择,因此模型选择(而非流水线调优)是主要操作杠杆。我们为此提供了三个可组合算法和一个用于自主LLM层级选择的每集群体验质量框架。

英文摘要

Large language models (LLMs) can serve as the semantic-matching engine of a content-based publish/subscribe broker for agentic AI across the edge-cloud computing continuum, bridging the vocabulary and modality gaps that defeat keyword and embedding filters. Framed as offline multi-label retrieval over three public datasets spanning social-media, legal, and smart-home sensor domains (six LLMs, seven baselines), our central contribution is a two-crossover cost-accuracy characterisation: an analytical context-window crossover below which a CoverAndMerge compression pipeline reduces LLM invocations, and an empirical discrimination-capacity crossover above which matching accuracy collapses independently of context budget, by a model-dependent factor of parameter count and training generation. Two findings carry practical weight: above the discrimination crossover, compression cannot recover accuracy and only frontier-scale models clear large subscription sets; and there backend choice dominates configuration choice, so model selection, not pipeline tuning, is the primary operator lever. We accompany this with three composable algorithms and a per-cluster Quality-of-Experience framework for autonomic LLM-tier selection.

2605.25693 2026-05-26 cs.CL cs.DB cs.MA 版本更新

From Facts to Insights: A Persona-Driven Dual Memory Framework and Dataset for Role-Playing Agents

从事实到洞察:面向角色扮演智能体的角色驱动双记忆框架与数据集

Rongsheng Zhang, Ruofan Hu, Weijie Chen, Jiji Tang, Junnan Ren, Wanying Wu, Xunuoyan Chen, Tangjie Lv, Tao Jin, Zhou Zhao

发表机构 * Zhejiang University(浙江大学) Fuxi AI Lab, Netease Inc.(复活AI实验室,网易公司)

AI总结 针对长期对话中角色扮演智能体因上下文窗口限制而丧失角色一致性的问题,提出角色记忆数据集RoleMemo和双记忆框架DualMem,通过将记忆解耦为事实认知和角色条件洞察,结合监督微调与强化学习,在4B参数模型上超越基于DeepSeek-V3.2的零样本角色无关框架。

Comments Preprint

详情
AI中文摘要

尽管角色扮演智能体在短期交互中表现出色,但长期对话会压垮上下文窗口,从而促使外部记忆框架的发展。当前系统通常依赖角色无关的摘要,记录事实而不进行角色特定的解释,导致生成通用回复,损害角色保真度。为弥补这一差距,我们引入了RoleMemo数据集,其中包含四个推理任务,这些任务要求通过角色解释事实片段以得出正确答案。在RoleMemo上的评估揭示了角色无关框架的关键局限性。因此,我们提出了DualMem,它将记忆解耦为两个流:事实认知和角色条件洞察。通过监督微调(SFT)和强化学习(RL)训练,我们的框架使用4B参数模型在持续角色保真度上优于由DeepSeek-V3.2驱动的零样本角色无关框架。我们的资源可在https://github.com/role2026/rolememo获取。

英文摘要

While role-playing agents excel in short-term interactions, long-term conversations overwhelm context windows, motivating external memory frameworks. Current systems typically rely on persona-agnostic summarization, which records facts without persona-specific interpretation, yielding generic responses that compromise persona fidelity. To bridge this gap, we introduce RoleMemo, a dataset featuring four reasoning tasks where the factual fragments must be interpreted through the persona to reach the correct answer. Evaluation on RoleMemo exposes critical limitations of persona-agnostic frameworks. We thus propose DualMem, which decouples memory into two streams: factual cognition and persona-conditioned insight. Trained through Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL), our framework with a 4B-parameter model outperforms zero-shot persona-agnostic frameworks powered by DeepSeek-V3.2 for sustained persona fidelity. Our resources are available at https://github.com/role2026/rolememo.

2605.25686 2026-05-26 cs.CL 版本更新

Testing the Deliteralization Hypothesis in Human and Machine Translation

测试人类与机器翻译中的去字面化假设

Malik Marmonier, Rachel Bawden, Benoît Sagot

发表机构 * Inria(法国国家信息与自动化研究所)

AI总结 通过比较人类翻译、NMT系统与LLM在54个语言对上的字面化程度,验证去字面化假设是否适用于LLM生成与修订过程。

详情
AI中文摘要

从专用NMT系统向通用LLM的近期转变重塑了机器翻译,据报道LLM比其前身产生更流畅、更少字面化的输出。我们测试这种转变是否延伸到去字面化假设,即翻译研究中长期存在的说法:翻译在起草和修订过程中逐渐变得不那么字面化。使用WMT24++数据集,我们比较了人类翻译和后编辑与两个NMT系统和六个LLM在54个语言对和三个任务上的字面化程度:直接翻译、迭代自我修订和人类草稿的后编辑。字面化程度通过基于六个启发式方法构建的经过验证的合成字面化指数来衡量。我们发现:(i) 人类翻译仍然明显比所有测试的MT系统更少字面化,尽管最近的LLM缩小了差距;(ii) 当提示迭代修订自己的输出时,LLM单调地去字面化,首次提供了该假设原生适用于LLM生成的证据;(iii) 作为后编辑者,LLM反转了人类后编辑者的修订触发因素,容忍字面化草稿并针对惯用的人类表述进行修订。

英文摘要

The recent shift from dedicated NMT systems to general-purpose LLMs has reshaped machine translation, with LLMs reported to produce more fluent, less literal output than their predecessors. We test whether this shift extends to the deliteralization hypothesis, the long-standing claim from translation studies that translations become progressively less literal as they are drafted and revised. Using the WMT24++ dataset, we compare the literality of human translations and post-editions to that of two NMT systems and six LLMs across 54 language pairs and three tasks: direct translation, iterative self-revision, and post-editing of human drafts. Literality is measured via a validated Synthetic Literality Index built from six heuristics. We find that (i) human translations remain significantly less literal than those of all tested MT systems, though recent LLMs narrow the gap; (ii) when prompted to iteratively revise their own output, LLMs deliteralize monotonically, providing the first evidence that the hypothesis applies natively to LLM generation; and (iii) as post-editors, LLMs invert the revision triggers of human post-editors, tolerating literal drafts and targeting idiomatic human formulations for revision.

2605.25680 2026-05-26 cs.CL cs.AI 版本更新

Simulating Human Memory with Language Models

用语言模型模拟人类记忆

Qihan Wang, Nicholas Tomlin, Michael Hu, Brian Dillon, Tal Linzen

发表机构 * NYU(纽约大学) UMass Amherst(马萨诸塞大学阿姆赫斯特分校)

AI总结 本研究通过心理学经典记忆实验对比语言模型与人类记忆,发现未经调优的模型记忆优于人类,但通过提示策略和压缩器可使模型遗忘方式更接近人类,从而在下游教育任务中成为更有效的用户模拟器。

详情
AI中文摘要

语言模型越来越多地被部署为用户模拟器,但它们的记忆远比真实用户可靠。为了衡量这一差距,我们在人类和语言模型上进行了一系列来自心理学的经典记忆实验。跨任务我们发现,未经调优的语言模型表现出比人类更好的记忆,即使在被提示模仿人类行为时也是如此。然后我们表明,更好的提示策略和使用压缩器可以使语言模型以更类似人类的方式遗忘内容。使用这些方法,我们初步证明,具有人类类似记忆约束的语言模型可以在下游教育任务中作为更有效的用户模拟器。最后,我们发布人类参考数据和基准,以支持未来关于用语言模型模拟人类记忆的工作。

英文摘要

Language models are increasingly being deployed as user simulators, but their memory is far more reliable than that of real users. To measure this gap, we run a series of classic memory experiments from psychology on both humans and language models. Across tasks, we find that out-of-the-box language models exhibit better memory than humans, even when prompted to imitate human behavior. We then show that better prompting strategies and the use of a compactor can cause language models to forget content in a more human-like way. Using these methods, we show preliminary evidence that language models with human-like memory constraints can function as more effective user simulators in a downstream education task. Finally, we release human reference data and benchmarks to support future work on simulating human memory with language models.

2605.25676 2026-05-26 cs.CL 版本更新

Llamion Technical Report

Llamion 技术报告

Kisu Yang, Yoonna Jang, Hyeonseok Moon, Hwanseok Jang, Taewoo Lee, Hyungjin Lee, Jeseung Lee, Juhyoung Park, Heuiseok Lim

发表机构 * VAIV Company(VAIV公司) Korea University(韩国大学) University of Copenhagen(哥本哈根大学) Samsung Electronics(三星电子)

AI总结 提出 KEPT 方法将 Orion-14B 转换为 Llama 架构的 Llamion 模型,通过参数映射和知识蒸馏在少量数据上恢复性能,并在 KoMMLU 上达到领先水平。

Comments Research conducted in 2024

详情
AI中文摘要

我们发布了 Llamion,一个 14B 参数的开源语言模型系列,通过将 Orion-14B 转换为标准化的 Llama 家族架构得到。该转换通过高效知识保留转换(KEPT)方法完成,该方法结合了 (i) 用于未改变模块的正常参数映射(NPM),(ii) 优化参数映射(OPM),一种无需训练的 LayerNorm 到 RMSNorm 初始化,我们证明在权重衰减引起的近零均值激活机制下该初始化是最优的,以及 (iii) 跨架构知识蒸馏(XKD),一种等大小的冻结教师蒸馏,将转换后模型的输出与源模型在任何合理输入分布上的输出对齐。Llamion 在单个 A100 上仅用约 1.23 亿 token 和四天时间,在 H6、MT-Bench 和 KoMMLU 上恢复了 Orion 的行为;Llamion-Base 在 KoMMLU 上达到 66.87%,在提交时比 Open Ko LLM Leaderboard 的次优条目高出超过 7.0 个绝对百分点。转移语料库中完全缺失的能力(Python 编程和 20 万 token 上下文处理)在架构转换后完整保留。我们发布了三个检查点(Base、Chat、LongChat),可在 Hugging Face Transformers 库中以 trust_remote_code=False 加载。

英文摘要

We release Llamion, a family of 14B-parameter open-weight language models obtained by transforming Orion-14B into the standardized Llama-family architecture. The transformation is performed by Efficient Knowledge Preservation for Transformation (KEPT), a recipe that combines (i) Normal Parameter Mapping (NPM) for unchanged modules, (ii) Optimized Parameter Mapping (OPM), a training-free LayerNorm-to-RMSNorm initialization we prove optimal under the near-zero-mean activation regime induced by weight decay, and (iii) Cross-architecture Knowledge Distillation (XKD), an equal-size frozen-teacher distillation that aligns the converted model's outputs with the source model's on any reasonable input distribution. Llamion recovers Orion's behaviour on H6, MT-Bench, and KoMMLU with only ~123M tokens on a single A100 in four days; Llamion-Base reaches 66.87% on KoMMLU, exceeding the next-best entry of the Open Ko LLM Leaderboard by >7.0 absolute points at submission time. Capabilities entirely absent from the transfer corpus (Python programming and 200K-token context handling) survive the architectural transition intact. We release three checkpoints (Base, Chat, LongChat) that load with trust_remote_code=False in the Hugging Face Transformers library.

2605.25658 2026-05-26 cs.CL cs.AI 版本更新

AutoSG: LLM-Driven Solver Generation Solely from Task Prompts for Expensive Optimization

AutoSG: 仅从任务提示出发的LLM驱动的昂贵优化求解器生成

Haoran Gu, Handing Wang, Yi Mei, Mengjie Zhang

发表机构 * Xidian University(西安电子科技大学) Victoria University of Wellington(威灵顿维多利亚大学)

AI总结 提出AutoSG框架,通过检索增强生成、单步自优化和无实例评估机制,从自然语言提示直接生成可执行定制求解器,解决昂贵优化中的幻觉、结构破坏和评估成本问题。

详情
AI中文摘要

昂贵优化任务在现实应用中普遍存在,需要高度专业化的求解器。虽然LLM驱动的自动求解器生成显示出前景,但当前范式在处理昂贵优化时面临三个关键问题:由于领域知识不足导致的事实幻觉、在细化过程中频繁破坏先前建立的局部最优结构,以及在训练实例上执行带来的高昂评估成本和受限的泛化能力。为了解决这些问题,我们引入了AutoSG,一个完全自动化的流程,直接将自然语言提示转换为可执行的定制求解器。AutoSG具有三个核心创新:一个检索增强的求解器生成模块,严格将代码基于经过验证的文献;一个单步自优化算子,在保留关键结构组件的同时引入特定任务的改进;以及一个基于Elo的无实例LLM-as-a-Judge评估机制,快速建立全局排名。在多种昂贵优化任务上的广泛评估证实,AutoSG显著优于人工设计的最先进框架和现有的LLM生成的求解器。

英文摘要

Expensive optimization tasks are ubiquitous in real-world applications, demanding highly specialized solvers. While LLM-driven automated solver generation shows promise, current paradigms face three critical issues when tackling expensive optimization: factual hallucinations due to deficient domain knowledge, the frequent dismantling of previously established locally optimal structures during refinement, and the prohibitive evaluation costs alongside restricted generalization caused by executing on training instances. To address these issues, we introduce AutoSG, a fully automated workflow directly translating natural language prompts into executable customized solvers. AutoSG features three core innovations: a retrieval-augmented solver generation module strictly grounding code in verified literature; a one-step self-refinement operator introducing task-specific improvements while preserving critical structural components; and an instance-free Elo-based LLM-as-a-Judge evaluation mechanism rapidly establishing global rankings. Extensive evaluations across diverse expensive optimization tasks confirm AutoSG significantly outperforms human-designed state-of-the-art frameworks and existing LLM-generated solvers.

2605.25641 2026-05-26 cs.CL 版本更新

Iterate Until Retrieved: Factual Nugget Optimization for Discoverable Continual Corrections in Agentic RAG

迭代直到检索到:面向可发现持续修正的事实性片段优化在智能体RAG中的应用

Moshe Hazoom, Gal Patel, Alon Talmor, Tom Hope

发表机构 * Mosaic AI The Hebrew University of Jerusalem(希伯来大学杰里科分校)

AI总结 提出迭代片段优化(INO)方法,通过将反馈转化为事实性片段并利用生产环境智能体RAG系统迭代优化,提升事实性修正的可发现性和使用率。

详情
AI中文摘要

在复杂的B2B(企业对企业)环境中,智能体检索增强生成(RAG)系统经常接收自由形式的反馈。我们关注可操作的事实性修正,而非风格、偏好或整体响应质量等通用反馈信号。我们识别这些实例并将其转化为紧凑的知识库条目,称为事实性片段。我们引入迭代片段优化(INO),一种索引时优化方法,将生产环境中的智能体RAG作为测试平台:它创建初始片段,使用触发查询及其释义进行探测,反思失败的检索和回答轨迹,并修订片段直到其可被发现。我们使用两个生产B2B知识辅助代理(一个回答公司特定知识库问题的产品支持代理,以及一个协助支持工程师的支持工单代理)在多家使用我们系统的公司中评估INO。在自动化和人工评估中,INO在事实性修正的可发现性和使用率方面持续优于基线。

英文摘要

Agentic retrieval-augmented generation (RAG) systems in complex B2B (business-to-business) settings may often receive free-form response feedback. Rather than generic feedback signals such as style, preference, or overall response quality, we focus on actionable factual corrections. We identify these instances and convert them into compact knowledge-base entries, which we call factual nuggets. We introduce Iterative Nugget Optimization (INO), an index-time optimization method that uses the production agentic RAG as a test harness: it creates an initial nugget, probes it with the triggering query and paraphrases, reflects over failed retrieval and answer traces, and revises the nugget until it is discoverable. We evaluate INO with two production B2B knowledge-assistance agents across multiple companies that use our system: a product support agent that answers questions over company-specific knowledge bases, and a support ticket agent that assists support engineers. INO consistently improves results over baselines in terms of discoverability and usage of factual corrections, in automated and human evaluations.

2605.25626 2026-05-26 cs.CL 版本更新

Beyond Literal Translation: Evaluating Cultural Effectiveness in Social Media UGC

超越字面翻译:评估社交媒体用户生成内容中的文化有效性

Linjuan Wu, Ruiqi Zhang, Xinze Lyu, Ye Guo, Daoxin Zhang, Zhe Xu, Yao Hu, Yixin Cao, Yongliang Shen, Weiming Lu

发表机构 * Zhejiang University(浙江大学) Fudan University(复旦大学) Xiaohongshu Inc.(小红书公司)

AI总结 针对社交媒体用户生成内容翻译中文化传递与情感共鸣不足的问题,提出CULTURE-MT基准,通过构建涵盖14个领域、4种文化负载类型的1002条UGC笔记,并引入文化有效性评估标准,实验表明传统指标无法捕捉文化有效性,且基础LLM的文化有效性与模型规模相关。

Comments Accepted by ICML2026

详情
AI中文摘要

社交媒体平台实现了大规模跨语言交流,但由于用户生成内容(UGC)的非正式风格、文化引用和基于互动的表达方式,其翻译仍然具有挑战性。尽管近期的大语言模型(LLM)提高了翻译质量,但现有基准和指标往往未能捕捉翻译是否在真实场景中传达了预期含义和文化共鸣。在这项工作中,我们引入了CULTURE-MT,一个专注于文化传递和UGC特定情感共鸣的社交媒体翻译基准。CULTURE-MT包含跨14个领域的1,002条UGC笔记,根据文化负载符号和语言风格特征分为四类。我们还构建了面向UGC的训练数据,以微调Qwen3-8B和Qwen3-32B作为基线。我们提出文化有效性作为新的评估标准,侧重于表达准确性和文化适应性。测试包括基线在内的15个模型,我们发现传统指标无法捕捉文化有效性。我们还观察到,基础LLM上的文化有效性与模型规模相关。我们的工作为UGC翻译模型提供了全面的评估系统,并将提供一个开放的评估平台以推动该领域的研究。我们发布了CULTURE-MT基准,并提供了一个在线排行榜,提交的翻译结果可由我们训练的JUDGER进行评估。

英文摘要

Social media platforms enable large-scale cross-lingual communication, but translating user-generated content (UGC) remains challenging due to its informal style, cultural references, and interaction-based expressions. While recent LLMs have improved translation quality, existing benchmarks and metrics often fail to capture whether translations convey intended meaning and cultural resonance in real-world settings. In this work, we introduce CULTURE-MT, a benchmark for social media translation that focuses on both CULtural Transmission and UGC-specific emotion REsonance. CULTURE-MT consists of 1,002 UGC notes across 14 domains, categorized into four types based on culture-loaded symbols and linguistic style features. We also construct UGC-oriented training data to fine-tune Qwen3-8B and Qwen3-32B as baselines. We propose cultural effectiveness as a new evaluation criterion, focusing on expression accuracy and cultural adaptability. Testing 15 models, including the baselines, we find that traditional metrics fail to capture cultural effectiveness. We also observe that cultural effectiveness on base LLMs correlates with model size. Our work provides a comprehensive evaluation system for UGC translation models and will offer an open evaluation platform to advance research in this area. We release the CULTURE-MT benchmark and provide an online leaderboard where submitted translation results can be evaluated by our trained JUDGER.

2605.25604 2026-05-26 cs.CL cs.LG 版本更新

DVAO: Dynamic Variance-adaptive Advantage Optimization for Multi-reward Reinforcement Learning

DVAO: 面向多奖励强化学习的动态方差自适应优势优化

Guochao Jiang, Jingyi Song, Guofeng Quan, Chuzhan Hao, Guohua Liu, Yuewei Zhang

发表机构 * Alibaba Cloud Computing(阿里巴巴云 computing)

AI总结 针对多奖励强化学习中奖励组合导致训练不稳定、优势组合依赖静态超参数的问题,提出动态方差自适应优势优化方法,通过基于经验奖励方差动态调整组合权重,实现稳定训练与多目标帕累托前沿优化。

详情
AI中文摘要

强化学习已成为将大型语言模型与人类意图和任务要求对齐的标准范式。尽管组相对策略优化为近端策略优化提供了一种高效、无价值模型的替代方案,但将其适应于现实世界的多奖励设置仍然具有挑战性。标准的标量化实践,如奖励组合和优势组合,存在显著缺陷:奖励组合经常产生平方幅度过大的优势,导致训练不稳定;而优势组合依赖静态超参数,忽略了跨目标相关性。为了解决这些限制,我们提出了动态方差自适应优势优化(DVAO),它根据 rollout 组内每个目标的经验奖励方差动态调整组合权重,有效提高具有更强学习信号的目标的权重,同时抑制噪声目标。我们从数学上证明 DVAO 保持有界的优势幅度以实现稳定训练,并引入了一种自适应的跨目标正则化机制。使用 Qwen3 和 Qwen2.5 模型在数学推理和工具使用基准上的大量实验表明,DVAO 显著优于基线方法,实现了卓越的多目标帕累托前沿和稳健的训练稳定性。

英文摘要

Reinforcement Learning has become a standard paradigm for aligning Large Language Models with human intent and task requirements. While Group Relative Policy Optimization offers an efficient, value-model-free alternative to Proximal Policy Optimization, adapting it to real-world multi-reward settings remains challenging. Standard scalarization practices, such as Reward Combination and Advantage Combination, suffer from significant drawbacks: Reward Combination frequently generates advantages with excessively large squared magnitudes that lead to training instability, while Advantage Combination relies on static hyperparameters and ignores cross-objective correlations. To address these limitations, we propose Dynamic Variance-adaptive Advantage Optimization (DVAO), which dynamically adjusts combination weights based on the empirical reward variance of each objective within a rollout group, effectively up-weighting objectives with a stronger learning signal while suppressing noisy ones. We mathematically prove that DVAO maintains bounded advantage magnitudes for stable training and introduces a self-adaptive cross-objective regularization mechanism. Extensive experiments on mathematical reasoning and tool-use benchmarks using Qwen3 and Qwen2.5 models demonstrate that DVAO significantly outperforms baseline methods, achieving a superior multi-objective Pareto frontier and robust training stability.

2605.25601 2026-05-26 cs.CL cs.AI 版本更新

Toward a Benchmark for Controllable Simulation of Imperfect Students with Large Language Models

面向大语言模型可控模拟不完美学生的基准

Alexander Apartsin, Omri Sason, Yehudit Aperstein

发表机构 * Holon Institute of Technology(霍隆理工学院) Afeka Tel Aviv Academic College of Engineering(阿法卡特拉维夫工程学院)

AI总结 本研究提出一个基准框架,通过提示控制语言模型模拟具有指定技能轮廓的学生,并评估其可控性,为教师教育中的刻意练习提供支持。

Comments 22 pages, 7 figures

详情
AI中文摘要

教师教育需要与表现出可识别优势、弱点和部分掌握的学习者进行刻意练习。大型语言模型可以通过模拟具有已知技能组成部分的学生来支持这种练习,使教师能够演练解释、诊断和教学回应。然而,为此目的,核心要求既不是最大化基准准确率,也不是抑制孤立的事实,而是控制模型行为,使其反映指定的技能轮廓。本文研究了是否可以通过提示引导语言模型保留某些技能同时抑制其他技能。我们引入了一个面向基准的框架,其中显式技能向量表示模拟学生,基于提示的控制指定保留和缺失的能力,并使用轮廓对齐指标、保留与遗忘比较以及跨技能校准分析来评估行为。结果表明,在结构化数学环境中可以诱导和测量选择性的部分掌握,尽管可控程度仍依赖于模型。这些发现将可控学习者模拟定位为教师教育、教育模拟和语言模型控制交叉领域的一个独特研究问题。

英文摘要

Teacher education requires deliberate practice with learners who exhibit identifiable strengths, weaknesses, and partial mastery. Large language models could support such practice by simulating students with known skill components, enabling teachers to rehearse explanations, diagnoses, and instructional responses. For this purpose, however, the central requirement is neither to maximize benchmark accuracy nor to suppress isolated facts, but to control model behavior so that it reflects a specified skill profile. This paper investigates whether prompted language models can be steered to retain some skills while suppressing others. We introduce a benchmark-oriented framework in which an explicit skill vector represents a simulated student, prompt-based control specifies retained and missing competencies, and behavior is evaluated using profile-alignment metrics, retained-versus-forgotten comparisons, and cross-skill calibration analyses. The results show that selective partial mastery can be induced and measured in a structured mathematics setting, although the degree of controllability remains model-dependent. These findings position controllable learner simulation as a distinct research problem at the intersection of teacher education, educational simulation, and language-model control.

2605.25596 2026-05-26 cs.CL 版本更新

Multilingual Phonological Feature Recognition with Self-Supervised Speech Models

基于自监督语音模型的多语言音韵特征识别

Abner Hernandez, Tomás Arias-Vergara, Daiqi Liu, Andreas Maier, Paula Andrea Pérez-Toro

发表机构 * Pattern Recognition Lab, Friedrich-Alexander-Universität Erlangen-Nürnberg, Germany(埃森哲大学模式识别实验室,德国) GITA Lab. Facultad de Ingeniería. Universidad de Antioquia UdeA, Medellín, Colombia(安蒂亚大学工程学院GITA实验室,哥伦比亚)

AI总结 提出PhonoQ-2.0,一种基于自监督语音模型的多语言帧级音韵特征识别器,通过方式条件门控机制直接预测结构化特征向量,在域内和域外均优于CTC基线。

Comments Submitted to Interspeech 2026

详情
AI中文摘要

音韵特征提供了语言通用且基于语言学的语音表示。我们提出PhonoQ-2.0,一种基于自监督语音模型构建的多语言帧级音韵特征识别器。该系统直接预测每帧的结构化22维特征向量,编码方式、元音质量、发音部位和清浊,而不是从音素输出中推导特征。为确保音韵上一致的预测,我们引入了一种方式条件门控机制,激活有效的特征组。在多种语言和语料库上评估,PhonoQ-2.0在域内平均宏F1为91.3%,域外为88.9%。与强CTC音素基线相比,它在域内平均获得+8.8 F1的持续提升,域外平均+8.6。在未见语言评估中,PhonoQ-2.0将宏F1从66.9%提高到73.6%(平均+6.7),最高提升达+10.8个百分点。

英文摘要

Phonological features provide a language-general and linguistically grounded representation of speech. We present PhonoQ-2.0, a multilingual frame-level phonological feature recognizer built on self-supervised speech models. The system directly predicts a structured 22-dimensional feature vector per frame encoding manner, vowel quality, place, and voicing, instead of deriving features from phoneme outputs. To ensure phonologically coherent predictions, we introduce a manner-conditioned gating mechanism that activates valid feature groups. Evaluated across multiple languages and corpora, PhonoQ-2.0 achieves an average macro-F1 of 91.3% in-domain and 88.9% out-of-domain. Compared to a strong CTC phoneme baseline, it delivers consistent gains of +8.8 F1 in-domain and +8.6 out-of-domain on average. In unseen-language evaluation, PhonoQ-2.0 improves macro-F1 from 66.9% to 73.6% (+6.7 on average), with gains of up to +10.8 points.

2605.25572 2026-05-26 cs.CL cs.AI 版本更新

PennySynth: RAG-Driven Data Synthesis for Automated Quantum Code Generation

PennySynth:基于RAG的数据合成用于自动量子代码生成

Minghao Shao, Nouhaila Innan, Hariharan Janardhanan, Muhammad Kashif, Alberto Marchisio, Muhammad Shafique

发表机构 * eBRAIN Lab, Division of Engineering, New York University Abu Dhabi (NYUAD)(eBRAIN实验室,工程系,纽约大学阿布扎比分校) Center for Quantum and Topological Systems (CQTS), NYUAD Research Institute(量子与拓扑系统中心(CQTS),NYUAD研究所) Department of Computer Science and Engineering, NYU Tandon School of Engineering(计算机科学与工程系,纽约大学坦顿工程学院)

AI总结 提出PennySynth框架,通过检索增强生成和代码感知嵌入,利用13,389个PennyLane指令-代码对数据集,在QHack竞赛中实现52%-68%的pass@5,显著提升量子代码生成的结构有效性和功能正确性。

Comments 11 pages, 3 figures

详情
AI中文摘要

量子编程框架日益增长的复杂性暴露了现有基于大语言模型(LLM)的代码助手的一个关键局限性:通用模型在面对专门的量子编码挑战时,会幻觉出PennyLane特定的门名称、错误放置设备配置并生成结构无效的电路。我们提出PennySynth,一个检索增强生成框架,通过将LLM推理条件化为一个包含13,389个PennyLane指令-代码对的精选知识库来解决这一差距,该知识库通过一个三阶段(提取、验证和去重)流程从官方PennyLane仓库、社区GitHub源和QHack竞赛档案中构建。PennySynth引入了一种使用st-codesearch-distilroberta-base的代码感知嵌入策略,该策略针对自然语言到代码的检索进行训练,将平均检索余弦相似度从通用基线的0.45提高到0.726。在涵盖QHack竞赛三年(2022、2023、2024)的74个挑战上进行评估,PennySynth在QHack 2022、2023和2024上分别达到64%、68%和52%的pass@5,相比无检索的Claude Sonnet 4.6提高了+28、+25和+28个百分点。我们进一步引入了一个量子适应的CodeBLEU指标,该指标对qml.*令牌模式进行加权,并表明结构代码相似性和功能正确性捕捉了量子代码质量的不同方面。受控消融实验揭示,代码感知嵌入是检索性能的主要驱动因素,而当检索质量足够精确时,数据集扩展和源组合提供了额外的增益。

英文摘要

The growing complexity of quantum programming frameworks has exposed a critical limitation in existing large language model (LLM)-based code assistants: general-purpose models hallucinate PennyLane-specific gate names, misplace device configurations, and produce structurally invalid circuits when faced with specialized quantum coding challenges. We present PennySynth, a retrieval-augmented generation framework that addresses this gap by conditioning LLM inference on a curated knowledge base of 13,389 PennyLane instruction-code pairs, built via a three-stage extraction, verification, and deduplication pipeline over official PennyLane repositories, community GitHub sources, and QHack competition archives. PennySynth introduces a code-aware embedding strategy using st-codesearch-distilroberta-base, trained for natural-language-to-code retrieval, increasing average retrieval cosine similarity from 0.45 to 0.726 compared to a general-purpose baseline. Evaluated across 74 challenges spanning three years of the QHack competition (2022, 2023, 2024), PennySynth achieves 64%, 68%, and 52% pass@5 on QHack 2022, 2023, and 2024, respectively, improving over Claude Sonnet 4.6 without retrieval by +28, +25, and +28 percentage points. We further introduce a quantum-adapted CodeBLEU metric that upweights qml.* token patterns and show that structural code similarity and functional correctness capture distinct aspects of quantum code quality. Controlled ablations reveal that code-aware embeddings are the primary driver of retrieval performance, while dataset expansion and source composition provide additional gains when retrieval quality is sufficiently precise.

2605.25565 2026-05-26 cs.LG cs.CL 版本更新

RotMoLE: Enhancing Mixture of Low-Rank Experts through Rotational Gating Mechanism

RotMoLE:通过旋转门控机制增强混合低秩专家

Mengyang Sun, Maochuan Dou, Tao Feng, Dan Zhang, Yihao Wang, Junpeng Liu, Yifan Zhu, Jie Tang

发表机构 * Tsinghua University(清华大学) Beijing Information Science and Technology University(北京信息科技大学) National University of Singapore(新加坡国立大学) Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)) Beijing University of Posts and Telecommunications(北京邮电大学)

AI总结 针对MoE-LoRA中传统门控仅标量加权限制表示能力的问题,提出RotMoLE框架,通过引入旋转门控机制对每个专家进行旋转操作,提升专家利用率和专业化程度,在多任务和多语言训练中验证有效性。

详情
AI中文摘要

虽然大型语言模型(LLM)通常在进行垂直应用之前会针对特定领域任务进行微调,但将它们适应于具有多样化专业知识的复杂场景仍然具有挑战性。与此同时,混合专家(MoE)架构已成为训练LLM的关键范式,最近的一些工作也将MoE引入参数高效微调(PEFT),提出了混合低秩专家(MoE-LoRA),以增强低秩适配器学习复杂知识的能力。然而,MoE中的传统门控机制通常仅对选中的专家应用标量重新加权,从而限制了其表示和泛化的潜在能力。受MoE-LoRA中低秩结构的启发和推动,我们提出了RotMoLE,一个专门针对低秩专家的MoE框架,其特点是一个额外的旋转门控。除了简单的缩放,RotMoLE为每个选中的专家实现了一个旋转机制,从而在专家候选有限的情况下,实现了更好的专家利用和专业化,以学习多样化的数据。在复杂多任务和多语言训练场景下的实证结果验证了我们的有效性。

英文摘要

While Large Language Models (LLMs) are commonly fine-tuned to handle domain-specific tasks before being applied to vertical applications, adapting them to complex scenarios with diverse specialized knowledge remains challenging. Meanwhile, Mixture-of-Experts (MoE) architecture has risen as a crucial paradigm for training LLMs, and some recent works have also incorporated MoE into Parameter-Efficient Fine-Tuning (PEFT) to propose the Mixture of Low-rank Experts (MoE-LoRA), to enhance the power of low-rank adapters for learning complicated knowledge. However, conventional gating mechanisms in MoE typically apply only a scalar reweighing to selected experts, thereby limiting their underlying capacity of representation and generalization. Motivated and enabled by the low-rank structures in MoE-LoRA, we propose RotMoLE, a specialized MoE framework for low-rank experts featuring an additional rotation gate. Beyond simple scaling, RotMoLE implements a rotation mechanism for each selected expert, enabling superior expert exploitation and specialization for learning diverse data, especially when expert candidates are limited. Empirical results on complex multi-task and multilingual training scenarios validate our effectiveness.

2605.25549 2026-05-26 cs.CL cs.AI cs.LG 版本更新

BC Protocol: Structured Dual-Expert Dialogue for Eliciting High-Quality Chain-of-Thought Post-Training Data

BC协议:结构化双专家对话用于生成高质量思维链后训练数据

Bo Zou, Chao Xu

AI总结 针对大语言模型后训练中高质量专家思维链数据生产瓶颈,提出BC协议——一种结构化双专家引出方法,通过配对领域专家与知识工程师,系统外化专家隐性判断为自然语言推理链,实验证明其在推理过程自然性上具有压倒性优势。

详情
AI中文摘要

高质量的专家思维链(CoT)数据是大语言模型(LLM)后训练的核心瓶颈之一。现有数据生产方法各有结构性局限:众包标注缺乏深度推理路径;专家单独写作受限于“专家盲点”——专家会结构性跳过他们认为显而易见的推理步骤;RLHF仅产生偏好信号而非推理链。 本文提出BC协议——一种用于LLM后训练数据生产的结构化双专家引出方法。该方法精心配对领域专家(晶体智力)与知识工程师(流体智力),系统地将专家的隐性判断外化为自然语言推理链。我们引入了参与者资质模型,定义了影响引出质量的六个参与者特征维度。“校准的无知”是本文提出的原创概念。我们进一步提出“选择优于规定”作为方法论原则:对于隐性知识引出任务,将质量控制资源投入人员选择比投入同等资源于流程设计能获得更高回报。 在叙事小说领域的受控实验中,我们直接比较了BC协议双对话产生的CoT(A组,n=20)与同一领域专家独立撰写的CoT(B组,n=20)。三个跨供应商评判模型——GPT-4o、Claude Opus 4.5和Gemini 2.5 Pro——在五个维度上进行了盲评(共600个评分)。结果表明,BC协议在“推理过程自然性”上具有压倒性优势(A组均值4.80 vs. B组均值1.30,p=2.4×10^{-8},Cliff's δ=1.0)。

英文摘要

High-quality expert chain-of-thought (CoT) data is one of the core bottlenecks in large language model (LLM) post-training. Existing data production methods each have structural limitations: crowdsourced annotation lacks deep reasoning paths; expert solo writing is constrained by the "expert blind spot" -- experts structurally skip reasoning steps they consider obvious; RLHF only produces preference signals rather than reasoning chains. This paper proposes the BC Protocol -- a structured dual-expert elicitation method for LLM post-training data production. The method carefully pairs a domain expert (crystallized intelligence) with a knowledge engineer (fluid intelligence), systematically externalizing the expert's implicit judgments as natural language reasoning chains. We introduce the Participant Aptitude Model, which defines six participant characteristic dimensions that affect elicitation quality. "Calibrated Ignorance" is an original concept proposed in this paper. We further propose "Selection-over-Prescription" as a methodological principle: for implicit knowledge elicitation tasks, investing quality-control resources in personnel selection yields a higher return than investing the same resources in process design. In a controlled experiment in the narrative fiction domain, we directly compared CoT produced by BC Protocol dual dialogue (Group A, (n=20)) against CoT written independently by the same domain expert (Group B, (n=20)). Three cross-vendor judge models -- GPT-4o, Claude Opus 4.5, and Gemini 2.5 Pro -- conducted blind evaluation across five dimensions (600 ratings total). Results show that the BC Protocol achieves an overwhelming advantage in "naturalness of reasoning process" (Group A mean 4.80 vs. Group B mean 1.30, (p=2.4\times10^{-8}), Cliff's (δ=1.0)).

2605.25520 2026-05-26 cs.CL 版本更新

Is Inference Mediated by Distinct Semantic Structures in LLMs? A Mechanistic Interpretation

LLMs中的推理是否由不同的语义结构介导?一种机制性解释

Nura Aljaafari, Marco Valentino, André Freitas

发表机构 * University of Manchester(曼彻斯特大学) University of Sheffield(谢菲尔德大学) Idiap Research Institute(Idiap研究所) CRUK National Biomarker Centre, University of Manchester(CRUK国家生物标志物中心,曼彻斯特大学)

AI总结 通过SVD分解和激活引导实验,研究自然语言推理中Transformer模型是否编码语义操作,发现操作级子空间部分重叠且因果影响预测,表明模型不仅编码假设与前提的关系,还部分编码如何关联。

Comments 26 pages, 16 figures, 13 tables

详情
AI中文摘要

正确预测标签并不一定需要表示产生该标签的操作。已知Transformer表示携带标签级信息,但它们是否编码产生这些标签的语义操作尚不清楚。我们使用受控的前提-假设对(仅通过单一语义变换区分)在自然语言推理中对此进行研究。利用逐层激活,通过SVD估计操作级子空间,并通过在四个开源解码器模型中的激活引导测试其因果相关性。变换效果以84.8%-99%的准确率可解码,并占据部分不同但重叠的子空间,超过随机子空间基线。引导实验表明这些方向因果性地影响预测,尽管可引导性因模型而异;跨操作引导进一步揭示了结构化干扰以及子空间选择性与跨操作独立性之间的分离。这些发现表明,模型不仅编码假设与前提相关,还部分编码如何相关,这意味着机制分析和控制应在语义操作层面而非仅预测标签层面进行。

英文摘要

Predicting a label correctly does not necessarily require representing the operation that produces it. Transformer representations are known to carry label-level information, but whether they encode semantic operations producing those labels is unclear. We investigate this in Natural Language Inference using controlled premise-hypothesis pairs that differ by a single semantic transformation. Using layer-wise activations, we estimate operation-level subspaces via SVD and test their causal relevance through activation steering in four open-weight decoder models. Transformation effects are decodable with $84.8$-$99\%$ accuracy and occupy partially distinct but overlapping subspaces, exceeding random-subspace baselines. Steering experiments show that these directions causally influence predictions, though steerability varies across models; cross-operation steering further reveals structured interference and a dissociation between subspace selectivity and cross-operation independence. These findings indicate that the models encode not only that a hypothesis relates to a premise but also, in part, how it does so, implying that mechanistic analysis and control should operate at the level of semantic operations rather than predicted labels alone.

2605.25511 2026-05-26 cs.CL 版本更新

CRPO: Character-centric Group Relative Policy Optimization for Role-aware Reasoning in Role-playing Agents

CRPO:以角色为中心的群体相对策略优化用于角色扮演代理中的角色感知推理

Yihong Tang, Kehai Chen, Liang Yue, Benyou Wang, Min Zhang

发表机构 * Institute of Computing and Intelligence(计算与智能研究院) Harbin Institute of Technology(哈尔滨工业大学) Shenzhen Loop Area Institute (SLAI)(深圳Loop区研究院) The Chinese University of Hong Kong(香港中文大学)

AI总结 提出CRPO框架,通过解耦任务逻辑与风格奖励、动态调整优化约束和利用通用响应作为负基线,解决强化学习在角色扮演中角色保真度下降和风格崩溃问题。

详情
AI中文摘要

强化学习的最新进展,特别是群体相对策略优化(GRPO),显著提升了大型语言模型的推理能力。然而,将这些以问题为中心的优化方法应用于角色扮演代理时,往往会导致角色保真度下降和风格崩溃,因为它们优先考虑上下文特定的效用而非角色对齐。为了解决这个问题,我们提出了以角色为中心的群体相对策略优化(CRPO),这是一个旨在将强化学习目标与角色扮演任务重新对齐的框架。CRPO通过三种机制提升角色独特性:解耦任务逻辑与风格奖励以解决梯度冲突,根据角色复杂度动态调整优化约束,以及利用通用响应作为负基线以防止模型回归到常见分布。大量实验表明,CRPO在一致性、情感等方面优于现有方法。

英文摘要

Recent advancements in Reinforcement Learning (RL), particularly Group Relative Policy Optimization (GRPO), have significantly enhanced the reasoning capabilities of Large Language Models. However, applying these problem-centric optimization methods to role-playing agents often leads to a loss of character fidelity and style collapse, as they prioritize context-specific utility over persona alignment. To address this, we propose Character-Centric Group Relative Policy Optimization (CRPO), a framework designed to realign RL objectives with the role-playing task. CRPO improves character distinctiveness through three mechanisms: decoupling task logic from stylistic rewards to resolve gradient conflicts, dynamically adapting optimization constraints based on character complexity, and utilizing generic responses as negative baselines to prevent the model from reverting to a common distribution. Extensive experiments demonstrate that CRPO outperforms existing methods in consistency, emotion and others.

2605.25502 2026-05-26 cs.CL cs.AI 版本更新

A Controlled Synthetic Benchmark for Educational Aspect-Based Sentiment Analysis

面向教育方面情感分析的可控合成基准

Yehudit Aperstein, Alexander Apartsin

发表机构 * Intelligent Systems, Afeka Academic College of Engineering(阿法卡学术工程智能系统学院) School of Computer Science, Faculty of Sciences, Holon Institute of Technology(霍洛技术学院计算机科学学院)

AI总结 为解决教育领域标注数据稀缺问题,提出一个包含10,000条合成课程评论和20个教学方面的可控合成基准,并通过实验验证了任务难度及合成到真实的迁移能力。

Comments 39 pages, 14 figures

详情
AI中文摘要

教育方面情感分析(ABSA)可以支持课程改进,但带有方面标签的学生反馈仍然稀缺,因为教育评论是私有的、特定于机构的且标注成本高昂。本研究引入了一个面向教育ABSA的可控合成基准,该基准由10,000条合成课程评论构建,具有明确的训练-验证-测试划分,以及一个涵盖教学质量、评估与课程管理、学习需求、学习环境和参与度的20方面教学模式。该语料库通过采样的目标标签、采样的细微属性以及经过三轮评审-编辑流程优化的真实感提示生成。在该基准上,使用TF-IDF、两阶段变换器和联合编码器的局部基线表明该任务并非易事;最强的未调优模型BERT在留出集上的检测微F1得分为0.2760,而一个适度的低学习率BERT调度将其提升至0.2930。基于gpt-5.2的全测试GPT推理在零样本模式下达到0.2519微F1,在使用基于检索的少样本提示时达到0.2501,使批量推理高于经典基线并接近紧凑的联合编码器。在来自Herath等人的2,829条映射学生反馈评论上进行的保守外部评估中,BERT在9个方面重叠上的微F1得分为0.4593,表明部分合成到真实的迁移。真实性和忠实度分析作为生成器诊断报告,阐明了基准如何稳定以及标签噪声仍然存在的位置。因此,本研究贡献了一个合成教育ABSA语料库、一个文档化的生成过程以及一个可复现的基准设置,适用于公共标注数据仍然难以获得的领域。

英文摘要

Educational aspect-based sentiment analysis (ABSA) can support course improvement, but public aspect-labeled student feedback remains scarce because educational reviews are private, institution-specific, and expensive to annotate. This study introduces a controlled synthetic benchmark for educational ABSA built from 10,000 synthetic course reviews with explicit train-validation-test splits and a 20-aspect pedagogical schema spanning instructional quality, assessment and course management, learning demand, learning environment, and engagement. The corpus is generated with sampled target labels, sampled nuance attributes, and a realism-tuned prompt refined through a three-cycle judge-editor procedure. On the resulting benchmark, local baselines with TF-IDF, two-step transformers, and joint encoders show that the task is nontrivial; the strongest untuned model, BERT, reaches a held-out detection micro-F1 of 0.2760, while a modest lower-rate BERT schedule improves this to 0.2930. Full-test GPT-based inference with gpt-5.2 reaches 0.2519 micro-F1 in zero-shot mode and 0.2501 with retrieval-based few-shot prompting, placing batch inference above the classical baseline and close to the compact joint encoders. A conservative external evaluation on 2,829 mapped student-feedback reviews from Herath et al. yields a micro-F1 of 0.4593 for BERT on a 9-aspect overlap, indicating partial synthetic-to-real transfer. Realism and faithfulness analyses are reported as generator diagnostics that clarify how the benchmark was stabilized and where label noise remains. The study therefore contributes a synthetic educational ABSA corpus, a documented generation procedure, and a reproducible benchmark setting for a domain in which public labeled data remain difficult to obtain.

2605.25475 2026-05-26 cs.CL cs.AI 版本更新

IndexMem: Learned KV-Cache Eviction with Latent Memory for Long-Context LLM Inference

IndexMem: 基于潜在记忆的学习型KV缓存驱逐策略用于长上下文LLM推理

Xintong Yang, Hao Gu, Binxing Xu, Lujun Li, Bei Liu, Jiacheng Liu, Qiyuan Zhu, Sirui Han, Yike Guo

发表机构 * The Hong Kong University of Science(香港科学与技术大学) Zhejiang University(浙江大学)

AI总结 提出一种可学习的索引器预测KV重要性,并结合轻量级潜在记忆模块压缩被驱逐的令牌,以在有限KV预算下实现准确的长上下文推理。

详情
AI中文摘要

大型语言模型(LLM)越来越需要处理长上下文,但标准softmax注意力机制的KV缓存随序列长度线性增长,迅速成为长上下文推理的瓶颈。一种实用的补救措施是驱逐不太重要的KV条目;然而,现有的驱逐策略大多是启发式的,难以捕捉令牌重要性的丰富、输入相关的分布。在这项工作中,我们引入了一个可学习的索引器来预测KV重要性,从而能够更准确地保留关键令牌。同时,简单地驱逐令牌会永久丢弃其信息,导致不可逆的遗忘和长距离检索性能下降。为了解决这个问题,我们提出了一个轻量级的潜在记忆模块,将驱逐的令牌压缩成紧凑的、在线更新的状态,并提供残差读出以补偿通过KV驱逐丢失的注意力贡献。总的来说,我们的方法能够在有限的KV预算下实现准确的长上下文推理,在RULER(4K/16K)上对Qwen、Mistral和Llama模型(在激进驱逐下提升高达25分)带来一致的改进,在Needle-in-a-Haystack检索中显著更稳定,并且在LongBench得分和压缩曲线上优于现有的驱逐策略。

英文摘要

Large Language Models (LLMs) are increasingly expected to operate over long contexts, yet standard softmax attention incurs a KV cache that grows linearly with sequence length, quickly becoming the bottleneck for long context inference. A practical remedy is to evict less important KV entries; however, existing eviction policies are largely heuristic and struggle to capture the rich, input-dependent distribution of token importance. In this work, we introduce a learnable indexer that predicts KV importance, enabling more accurate retention of critical tokens. Meanwhile, naively evicting tokens permanently discards their information, leading to irreversible forgetting and degraded retrieval over long ranges. To address this, we propose a lightweight latent memory module that compresses evicted tokens into a compact, online-updated state and provides residual readouts to compensate for the attention contributions lost through KV eviction. Collectively, our method enables accurate long-context inference under a bounded KV budget, delivering consistent improvements on RULER (4K/16K) across Qwen, Mistral, and Llama models (up to 25 points under aggressive eviction), markedly more stable Needle-in-a-Haystack retrieval, and superior LongBench scores and compression curves compared to existing eviction policies.

2605.25474 2026-05-26 cs.CL 版本更新

TypedCSIP: Typed Counterfactual Pretraining for Chinese Legislative Conflict Classification

TypedCSIP:面向中国立法冲突分类的类型化反事实预训练

Yao Liu

发表机构 * Chengdu University of Technology, Leshan, China(成都理工大学,乐山,中国) School of Computer Sciences, Universiti Sains Malaysia, Penang, Malaysia(马来西亚理科大学计算机科学学院,槟城,马来西亚)

AI总结 提出TypedCSIP方法,通过类型化反事实选择性干预预训练(阶段1)和五路分类头迁移(阶段2),在LCR-CN基准上提升立法冲突分类的宏F1值。

详情
AI中文摘要

TypedCSIP是一种针对LCR-CN基准(Zhao等人,2026)冲突分类任务的类型化反事实预训练方法:给定(上位法,下位法)条款对,预测该对是否冲突以及四种法律教义类型(责任、条件、制裁、定义)中哪一种描述不一致。我们利用LCR-CN中专家编写的最小修订作为训练时的反事实监督;测试时分类器仅读取原始条款对。阶段1在(上位法,下位法,专家修订)三元组上使用类型化反事实选择性干预预训练目标预训练共享编码器,将专家修订视为反事实,类型因子头必须将其分类为无冲突证据。阶段2将编码器迁移到五路分类头。确认性测试在观察v6测量之前在开放科学框架上注册:18个种子,锁定规则要求每种子平均差异至少0.8个百分点,且种子自举和学生t 95%置信下限均大于零。在696条记录测试集上,v2变体在chinese-roberta-wwm-ext上比最强单模型基线提高宏F1 +0.916个百分点,在SAILER跨骨干复制上提高+1.288个百分点;两个单元格均通过规则。在244条Unseen-gB记录上的冷启动分层结果在两个骨干上均保持正增益。跨任务诊断显示阶段2编码器是分类专用的,不能迁移到LCR-CN的上位法检索任务,因此我们将贡献限定在冲突分类。我们发布代码、72个预注册预测文件、匹配种子和MLM控制辅助文件以及OSF预注册记录。

英文摘要

TypedCSIP is a typed counterfactual pretraining method for the conflict-classification task of the LCR-CN benchmark (Zhao et al., 2026): given a (superior, subordinate) provision pair, predict whether the pair conflicts and which of four legal-doctrine types (Responsibility, Condition, Sanction, Definition) describes the inconsistency. We exploit LCR-CN's expert-written minimal revisions as training-time counterfactual supervision; at test time the classifier reads only the original pair. Stage 1 pretrains a shared encoder with a typed Counterfactual Selective Intervention Pretraining objective on (superior, subordinate, expert-revised) triplets, treating the expert revision as a counterfactual that the typed factor head must classify as carrying no conflict evidence. Stage 2 transfers the encoder to a five-way classification head. The confirmatory test was registered on the Open Science Framework before observing v6 measurements: 18 seeds, locked rule requiring mean per-seed difference at least 0.8 pp with both seed-bootstrap and Student-t 95% lower bounds above zero. On the 696-record test split, the v2 variant improves macro-F1 over the strongest single-model baseline by +0.916 pp on chinese-roberta-wwm-ext and +1.288 pp on the SAILER cross-backbone replication; both cells pass the rule. A cold-start stratified result on the 244 Unseen-gB records keeps the gain positive on both backbones. A cross-task diagnostic shows the Stage-2 encoder is classification-specialized and does not transfer to LCR-CN's superior-law retrieval task, so we scope the contribution to conflict classification. We release code, 72 pre-registered prediction files, matched-seed and MLM-control auxiliaries, and the OSF pre-registration record.

2605.25463 2026-05-26 cs.CL 版本更新

A Lightweight Hybrid Transformer-CRF Architecture for Multi-Type Bangla Medical Entity Recognition

一种轻量级混合Transformer-CRF架构用于多类型孟加拉语医学实体识别

Peyal Saha, Ahsanul Haque Hasib, Shoumik Barman Polok

AI总结 提出一种轻量级孟加拉语医学实体识别框架,通过知识蒸馏将12层BanglaBERT-CRF教师模型压缩为4层学生模型,并应用INT8动态量化,在保持性能的同时实现8.6倍CPU加速和近48%存储节省。

详情
AI中文摘要

MedER指的是医学实体的识别。它对于从非结构化医学文本中提取结构化临床信息至关重要。许多现有系统依赖于基于Transformer的模型,这些模型计算成本高且难以在资源受限环境中部署。此外,早期的工作通常使用宽松的评估指标,通过奖励正确预测主导的“外部”(O)标记来人为地提升性能。在本文中,我们提出了一种轻量级的孟加拉语医学实体识别(MedER)框架。我们使用一个12层BanglaBERT模型结合条件随机场(CRF)层进行精确边界实体检测,建立了严格的基线。为了解决部署限制,我们通过知识蒸馏(KD)将该教师模型压缩为一个4层学生网络,其中学生从教师的CRF前软发射logits中学习。最后,我们应用INT8动态量化进一步减小模型大小和推理成本。我们最终的量化学生模型实现了8.6倍的CPU加速,同时所需存储比CRF教师模型减少近48%。

英文摘要

MedER refers to the identification of medical entities. It is crucial for extracting structured clinical information from unstructured medical text. Many existing systems rely on transformer-based models, which are computationally expensive and difficult to deploy in resource-constrained environments. Furthermore, earlier works often use relaxed evaluation metrics that artificially inflate performance by rewarding correct prediction of dominant "Outside" (O) tokens. In this paper, we propose a lightweight Medical Entity Recognition (MedER) framework for the Bangla language. We establish a rigorous baseline using a 12-layer BanglaBERT model combined with a Conditional Random Field (CRF) layer for exact-boundary entity detection. To address deployment constraints, we compress this teacher model into a 4-layer student network through Knowledge Distillation (KD), where the student learns from the teacher's pre-CRF soft emission logits. Finally, we apply INT8 dynamic quantization to further reduce model size and inference cost. Our final quantized student achieves an 8.6x CPU speedup while requiring nearly 48 percent less storage than the CRF teacher model.

2605.25454 2026-05-26 cs.HC cs.AI cs.CL cs.CY cs.SI 版本更新

AI Content Moderation in Therapy Conversations

AI在治疗对话中的内容审核

Jiwon Kim, Claire Wang, Taeung Yoon, Sabelle Huang, Koustuv Saha

AI总结 研究审计三种主流内容审核系统(OpenAI、Meta、Google)在真实治疗对话中的标记行为,揭示其限制LLM作为治疗师的潜力。

详情
AI中文摘要

大型语言模型(LLMs)越来越多地被用于情感支持。它们也正在被开发用于正式的治疗目的。然而,像ChatGPT或Llama这样的LLM通常配备内容审核护栏,出于责任和安全考虑,阻止它们与用户讨论敏感话题,而这种无法触及这些话题的能力可能影响它们作为治疗师的能力。在本研究中,我们对三种最先进的审核系统(OpenAI的审核端点、Meta的Llama Guard和Google的Shield Gemma)进行了算法审计,以调查这些系统将现实治疗会话内容标记为不良的程度。我们的结果揭示了用户和组织在设计LLM扮演治疗师角色时可能遇到的限制。

英文摘要

Large language models (LLMs) are increasingly being used for emotional support. They are also being developed for formal therapy purposes. However, LLMs like ChaptGPT or Llama are often developed with content moderation guardrails that prevent them from discussing sensitive subjects with users for both liability and safety purposes, and this inability to broach these subjects may affect their capacity as therapists. In this study, we perform an algorithm audit on three state-of-the-art moderation systems (OpenAI's moderation endpoint, Meta's Llama Guard, and Google's Shield Gemma) to investigate the extent to which these systems flag the content of real-life therapy sessions as undesirable. Our results raise implications for the limitations that users and organizations may encounter when designing LLMs to play the part of a therapist.

2605.25447 2026-05-26 cs.CL 版本更新

GeoSVG-RL: Geometry-Aware Reinforcement Learning for Layout-Constrained Text-to-SVG Diagram Generation

GeoSVG-RL:面向布局约束的文本到SVG图表生成的几何感知强化学习

Sifan Li, Yujun Cai, Hongkai Chen, Yiwei Wang

发表机构 * University of California, Merced(加州大学梅尔德分校) The University of Queensland(昆士兰大学) vivo Mobile Communication Co., Ltd.(vivo移动通信有限公司)

AI总结 提出GeoSVG-RL框架,通过强化学习优化策略,利用几何反馈奖励(渲染有效性、画布适配、锚点放置、文本包含、图一致性和代码整洁性)解决文本到SVG图表生成中的结构脆弱性问题,显著提升箭头锚点精度和文本框内率。

详情
AI中文摘要

生成结构化、可编辑的图表对当代大型语言模型来说仍然是一个重大挑战,尽管它们在通用向量代码生成方面表现出色。主要困难在于输出的结构脆弱性;微小的错误,如未对齐的连接器端点、文本标签与边框重叠或复杂布局超出画布边界,都会使生成的SVG文件在专业应用中无法使用。为了解决这些问题,我们引入了GeoSVG-RL,一个专门为布局约束的文本到SVG生成设计的强化学习框架。与仅依赖于最大化令牌级可能性的标准训练目标不同,我们的方法针对明确的、可执行的几何反馈优化策略。模型首先生成一个结构化的布局计划,作为后续SVG代码生成的几何契约。然后通过浏览器支持的验证器渲染该代码,从而在六个关键维度上计算细粒度奖励:渲染有效性、画布适配、精确锚点放置、文本包含、图一致性和代码整洁性。我们利用组相对策略优化(GRPO)来优化模型,每个提示采样多个候选,以便基于相对质量进行更新。从合成数据上的监督预热阶段开始,GeoSVG-RL在结构可靠性方面取得了显著提升,特别是在箭头锚点精度和文本框内率方面。定量评估表明,我们的方法在局部几何精度和图连通性保持方面持续优于当前最先进的系统,为自动化且可靠的技术插图提供了一条稳健的路径。

英文摘要

Generating structured, editable diagrams remains a significant challenge for contemporary large language models, despite their proficiency in general-purpose vector code generation. The primary difficulty lies in the structural fragility of the output; minor errors such as misaligned connector endpoints, text labels overlapping borders, or complex layouts drifting beyond the canvas boundaries render the resulting SVG files functionally unusable for professional applications. To address these issues, we introduce GeoSVG-RL, a specialized reinforcement learning framework designed for layout-constrained text-to-SVG generation. Unlike standard training objectives that rely solely on maximizing token-level likelihood, our approach optimizes the policy against explicit, executable geometric feedback. The model first produces a structured layout plan that serves as a geometric contract for the subsequent generation of the SVG code. This code is then rendered through a browser-backed verifier, enabling the calculation of fine-grained rewards across six critical dimensions: rendering validity, canvas fitting, precise anchor placement, text containment, graph consistency, and code cleanliness. We utilize Group Relative Policy Optimization (GRPO) to refine the model, sampling multiple candidates per prompt to facilitate updates based on relative quality. Starting from a supervised warm-start phase on synthetic data, GeoSVG-RL achieves substantial gains in structural reliability, particularly in arrow-anchor accuracy and text-in-box rates. Quantitative evaluations demonstrate that our method consistently outperforms current state-of-the-art systems in local geometric precision and the preservation of graph connectivity, providing a robust pathway toward automated yet reliable technical illustration.

2605.25443 2026-05-26 cs.CL 版本更新

Harmony in Diversity: Multi-domain Contrastive Policy Optimization for Large Reasoning Models

多样性中的和谐:面向大型推理模型的多域对比策略优化

Zongji Yu, Wenshui Luo, Yiliu Sun, Hao Fang, Runmin Cong, Chaochao Lu, Chen Gong

发表机构 * Shanghai Jiao Tong University(上海交通大学) Shanghai AI Laboratory(上海人工智能实验室) Shandong University(山东大学)

AI总结 提出多域对比策略优化(MCPO),通过对比学习促进跨域知识迁移并减少干扰,提升大型推理模型在多域场景下的推理能力。

Comments 25 pages, 5 figures

详情
AI中文摘要

后训练显著增强了大型推理模型(LRM)的推理能力,尤其是使用如组相对策略优化(GRPO)等强化学习(RL)方法。然而,在多域设置中,GRPO风格的RL方法由于策略优化中的固有干扰,往往无法在所有领域实现一致的改进。先前关于多域RL的研究主要集中于减轻跨域干扰,而常常忽略了知识共享的关键作用,我们认为知识共享是将跨域交互从有害竞争转变为有益迁移的关键。为解决这一局限,我们提出了多域对比策略优化(MCPO),该方法分析展开(rollouts)之间的结构关系,并以对比方式促进跨域知识共享和域内知识整合。具体而言,对于给定的提示,MCPO将来自其他域的可迁移推理轨迹识别为正例,而将错误的展开视为负例。然后,它鼓励正例对的一致表示,并推开负例对,从而促进知识迁移并减少干扰。此外,MCPO对齐域内正确的展开以构建一个整合的表示空间。通过这种方式,MCPO对比学习一个能够容纳多样化多域知识的和谐表示空间。实验结果表明,MCPO提升了LRM在多个域上的推理能力,甚至在某些情况下优于单域训练。代码可在 https://github.com/Maricalce/MCPO 获取。

英文摘要

Post-training has significantly enhanced the reasoning capability of Large Reasoning Models (LRMs), especially with Reinforcement Learning (RL) like Group Relative Policy Optimization (GRPO). However, GRPO-style RL methods in multi-domain settings often fail to achieve consistent improvements across all domains due to inherent interference in policy optimization. Prior studies on multi-domain RL primarily focus on alleviating cross-domain interference, while often neglecting the pivotal role of knowledge sharing, which we argue is the key to transforming cross-domain interactions from harmful competition into beneficial transfer. To address this limitation, we propose Multi-domain Contrastive Policy Optimization (MCPO), which analyzes the structural relationships among rollouts and promotes cross-domain knowledge sharing and in-domain knowledge consolidation in a contrastive manner. Specifically, for a given prompt, MCPO identifies transferable reasoning trajectories from other domains as positive examples, while treating incorrect rollouts as negative ones. It then encourages consistent representations for positive pairs and pushes negative pairs apart, thereby facilitating knowledge transfer and reducing interference. Moreover, MCPO aligns intra-domain correct rollouts to build a consolidated representation space. In this way, MCPO contrastively learns a harmonious representation space that can accommodate diverse multi-domain knowledge. Empirical results show that MCPO improves the reasoning capabilities of LRMs across multiple domains and even outperforms single-domain training in some cases. Code is available at https://github.com/Maricalce/MCPO.

2605.25440 2026-05-26 cs.CL cs.AI cs.MA 版本更新

A Multi-Agent LLM Framework for Rating the Quality of Surgical Feedback

用于评估手术反馈质量的多智能体LLM框架

Rafal Kocielnik, J. Everett Knudsen, Steven Y. Cen, Jasmine Lin, Cherine H. Yang, Atharva Deo, Ujjwal Pasupulety, Peter Wager, Anima Anandkumar, Andrew J. Hung

发表机构 * Computing + Mathematical Sciences, California Institute of Technology(加州理工学院计算与数学科学系) Department of Urology, Cedars-Sinai(塞斯医疗中心泌尿科) Keck School of Medicine, University of Southern California(美国南加州大学凯克医学院)

AI总结 提出一个两阶段LLM框架,通过多智能体提示和手术领域知识注入发现可解释的反馈质量标准,并利用LLM作为评判者自动评分,在预测反馈有效性上优于先前方法。

Comments 25 pages, 3 figures

详情
AI中文摘要

手术室中主治医生提供的口头反馈在住院医师技能习得中起着关键的形成性作用。然而,评估培训者反馈的质量及其在实时手术中影响受训者行为的有效性仍然是一个挑战。先前的研究依赖于专家人工评分者的大量手动标注来评估反馈内容,并侧重于开发忽略反馈传递定性方面(如清晰度或紧迫性)的广泛分类法。有限的现有自动化方法,包括关键词分析和主题建模,也无法捕捉这些细微方面。我们引入了一个两阶段基于LLM的框架,该框架发现基于手术培训背景的可解释反馈质量标准。我们的方法使用多智能体提示和手术领域知识注入来发现一小套人类可解释的评分标准(例如,鼓励性、紧迫性、清晰性)。然后,这些标准通过LLM作为评判者的方法自动评分实时手术反馈。对4.2k个培训者反馈实例的评估表明,我们AI发现的标准在预测反馈有效性(包括观察到的受训者行为调整和培训者认可)方面优于先前基于内容的框架。这项工作推进了手术室中可扩展的、与人类对齐的沟通质量评估,并为改进手术教学实践提供了基础。

英文摘要

Verbal feedback delivered by attending surgeons in the operating room plays a critical formative role in resident trainee skill acquisition. Yet, assessing the quality of trainer feedback and its effectiveness in influencing trainee behavior during live surgery remains a challenge. Prior studies assessed feedback content relying on extensive manual annotation by expert human raters and focused on developing broad taxonomies that overlook the qualitative aspects of feedback delivery such as clarity or urgency. Limited existing automated methods, including keyword analysis and topic modeling, also fail to capture these nuanced aspects. We introduce a two-stage LLM-based framework that discovers interpretable feedback quality criteria grounded in the context of surgical training. Our method uses multi-agent prompting and surgical domain knowledge injection to discover a small set of human interpretable scoring criteria (e.g., Encouraging, Urgent, Clear). These criteria are then used to automatically score live surgical feedback via an LLM-as-a-judge approach. Evaluation on 4.2k trainer feedback instances demonstrates that our AI-discovered criteria outperform prior content-based frameworks in predicting feedback effectiveness, including observed trainee behavioral adjustments and trainer approval. This work advances scalable, human-aligned assessment of communication quality in the operating room and provides a foundation for improving surgical teaching practices.

2605.25421 2026-05-26 cs.CL 版本更新

HyLaT: Efficient Multi-Agent Communication via Hybrid Latent-Text Protocol

HyLaT: 通过混合潜在-文本协议实现高效多智能体通信

Xinyi Mou, Siyuan Wang, Zejun Li, Yulan He, Zhongyu Wei

发表机构 * Fudan University(复旦大学) The Chinese University of Hong Kong(香港中文大学) King’s College London(伦敦国王学院) The Alan Turing Institute(艾伦·图灵研究所) Shanghai Innovation Institute(上海创新研究院)

AI总结 针对多智能体通信中的三元困境,提出混合潜在-文本协议HyLaT,通过潜在通道传输认知信号提升效率,自然语言表达关键信号保证可解释性,并设计两阶段训练框架,显著降低通信开销同时保持任务性能。

详情
AI中文摘要

通信协议设计是基于大语言模型的多智能体系统中的核心挑战。现有的单通道方法面临固有的通信三元困境:基于文本的方法可解释但冗长,而基于潜在空间的方法高效但不透明且局限于单向工作流。受多通道通信理论启发,我们提出HyLaT,一种混合潜在-文本通信协议,通过潜在通道传输精细的认知信号以提高效率,同时用自然语言表达简洁的关键信号以保持可解释性和精确性。我们引入一个两阶段训练框架,结合单智能体混合生成学习和多智能体交互协同训练,使智能体能够在多轮交互中生成和解释混合消息。实验表明,HyLaT显著降低了通信开销,同时保持了竞争性的任务性能,并在不同设置下具有强大的泛化能力和鲁棒性。

英文摘要

Communication protocol design is a central challenge in large language model-based multi-agent systems. Existing single-channel approaches face an inherent communication trilemma: text-based methods are interpretable but verbose, while latent-space methods are efficient but opaque and limited to unidirectional workflows. Inspired by multi-channel communication theory, we propose HyLaT, a hybrid latent-text communication protocol that transmits elaborate cognitive signals through a latent channel for efficiency, while expressing concise critical signals in natural language to preserve interpretability and precision. We introduce a two-stage training framework combining single-agent hybrid generation learning and multi-agent interactive co-training, enabling agents to generate and interpret hybrid messages across multiple rounds of interaction. Experiments demonstrate that HyLaT reduces communication overhead significantly while maintaining competitive task performance, with strong generalization and robustness across diverse settings.

2605.25420 2026-05-26 cs.CL cs.AI cs.CY 版本更新

SomaliBench Eval: Measuring English-to-Somali Refusal Gaps in Open-Weight Language Models

SomaliBench Eval:衡量开源语言模型中英语到索马里语的拒绝差距

Khalid Yusuf Dahir

发表机构 * Independent researcher(独立研究人员)

AI总结 通过构建索马里语有害意图基准并评估四个开源模型,发现英语到索马里语的拒绝率存在显著差距,且多数非拒绝输出为不流畅的无效内容。

Comments 12 pages, 3 figures, 4 tables. Code: https://github.com/khaledyusuf44/somalibench_eval Dataset: https://huggingface.co/datasets/khaledyusuf44/somalibench-v0

详情
AI中文摘要

大型语言模型的安全评估仍然高度以英语为中心,即使模型在全球部署,低资源语言的评估也严重不足。我们在SomaliBench v0上评估了四个开源指令微调模型,这是一个由母语者验证的基准,包含100对英语和索马里语的有害意图提示。每个模型(Llama-3.1-8B-Instruct、Gemma-2-9B-Instruct、Qwen-2.5-7B-Instruct和Aya-23-8B)均在本地运行,温度为0,并使用相同的英语“有帮助、无害、诚实”(HHH)系统提示。一个固定的Claude Sonnet快照(claude-sonnet-4-5-20250929)将每个响应分类为拒绝、遵从或不清楚;母语作者对分层抽样的80行样本进行抽查。我们发现所有四个模型在英语到索马里语之间存在巨大的拒绝差距:Llama-3.1-8B(0.90;95%自助法置信区间[0.85, 0.96])、Aya-23-8B(0.75 [0.67, 0.83])、Qwen-2.5-7B(0.69 [0.59, 0.78])和Gemma-2-9B(0.38 [0.27, 0.49])。对于三个模型,索马里语中主要的非拒绝模式不是流畅的有害遵从,而是不清楚的输出:空、错误语言或不连贯的生成。母语验证抽查在80个采样行上与判断器达到100%一致(Cohen's kappa = 1.00)。我们仅报告总体拒绝率、类别差距和可靠性统计;原始模型生成保留在本地,不发布。

英文摘要

Large language model safety evaluation remains heavily English-centered, leaving low-resource languages under-measured even when models are deployed globally. We evaluate four open-weight instruction-tuned models on SomaliBench v0, a native-author-verified benchmark of 100 harmful-intent prompts paired across English and Somali. Each of Llama-3.1-8B-Instruct, Gemma-2-9B-Instruct, Qwen-2.5-7B-Instruct, and Aya-23-8B is run locally with temperature 0 and the same English "helpful, harmless, and honest" (HHH) system prompt. A pinned Claude Sonnet snapshot (claude-sonnet-4-5-20250929) classifies each response as refused, complied, or unclear; the native author spot-checks a stratified 80-row sample. We find large English-to-Somali refusal gaps for all four models: Llama-3.1-8B (0.90; 95% bootstrap CI [0.85, 0.96]), Aya-23-8B (0.75 [0.67, 0.83]), Qwen-2.5-7B (0.69 [0.59, 0.78]), and Gemma-2-9B (0.38 [0.27, 0.49]). For three models, the dominant Somali non-refusal mode is not fluent harmful compliance but unclear output: empty, wrong-language, or incoherent generations. The native verification spot-check achieves 100% agreement with the judge (Cohen's kappa = 1.00) on the 80 sampled rows. We report aggregate refusal rates, category gaps, and reliability statistics only; raw model generations are retained locally and are not released.

2605.25415 2026-05-26 cs.CL cs.CY cs.ET 版本更新

LLM-as-a-Reviewer: Benchmarking Their Ability, Divergence, and Prompt Injection Resistance as Paper Reviewers

LLM-as-a-Reviewer: 基准测试它们作为论文审稿人的能力、分歧和提示注入抵抗性

Lingyao Li, Junjie Xiong, Changjia Zhu, Runlong Yu, Chen Chen, Junyu Wang, Renkai Ma, Zhicong Lu

发表机构 * University of South Florida(佛罗里达南大学) Missouri University of Science and Technology(密苏里科技大学) University of Alabama(阿拉巴马大学) Florida International University(佛罗里达国际大学) University of Cincinnati(辛辛那提大学) George Mason University(乔治·梅森大学)

AI总结 本研究通过一个系统基准测试,评估了12个大型语言模型在论文评审中的表现,包括评分校准、与人类审稿人的分歧以及对不可见字体映射攻击的抵抗性,发现LLMs存在系统性高估弱论文、与人类关注点不同以及易受提示注入攻击等问题。

详情
AI中文摘要

大型语言模型(LLMs)越来越多地用于学术同行评审,但其可靠性、与人类判断的一致性以及对对抗性攻击的鲁棒性仍知之甚少。我们对从NeurIPS和ICLR分层的898篇论文进行了LLM-as-a-Reviewer的系统基准测试,评估了12个LLMs的三个维度:评分校准、与人类审稿人的分歧以及对通过不可见字体映射攻击嵌入的提示注入的抵抗性。我们发现LLMs系统性地高估较弱的投稿,并在主题重点上与人类存在分歧,低估清晰度而高估可重复性,同时生成的评论长度是人类的2到3倍,词汇多样性较低且词汇更标准化。提示注入仍然非常有效。简单的隐藏指令可以在相当一部分案例中将低分论文提升至可接受级别的评分,且效果在不同模型家族间差异显著。虽然LLMs在结构化评估方面具有实用性,但将其整合到同行评审中需要针对内在偏见和对抗性风险设置防护措施。

英文摘要

Large language models (LLMs) are increasingly used in academic peer review, yet their reliability, alignment with human judgment, and robustness to adversarial attacks remain poorly understood. We present a systematic benchmark of LLM-as-a-Reviewer on 898 papers stratified from NeurIPS and ICLR, evaluating 12 LLMs along three axes: rating calibration, divergence from human reviewers, and resistance to prompt injection embedded via an invisible font-mapping attack. We find that LLMs systematically overrate weaker submissions and diverge from humans in topical emphasis, under-flagging Clarity and over-flagging Reproducibility, while producing reviews two to three times longer with lower lexical diversity and a more standardized vocabulary. Prompt injection remains highly effective. Simple hidden instructions can promote low-scoring papers to acceptance-level ratings in a substantial fraction of cases, with effectiveness varying sharply across model families. While LLMs offer utility in structuring evaluations, their integration into peer review requires safeguards against both intrinsic biases and adversarial risks.

2605.25404 2026-05-26 cs.CL eess.AS 版本更新

Proactive for Uncertainty: Cause-Aware Error Diagnosis and Interactive Clarification for Spoken Dialogue Systems

主动应对不确定性:面向口语对话系统的因果感知错误诊断与交互式澄清

Yizhou Peng, Ziyang Ma, Changsong Liu, Yi-Wen Chao, Xie Chen, Eng Siong Chng

发表机构 * Nanyang Technological University, Singapore(南洋理工大学,新加坡) Shanghai Jiao Tong University, China(上海交通大学,中国)

AI总结 本文提出一种因果感知的错误恢复范式,通过细粒度检测器解耦ASR中的感知、理解和删除错误,使LLM能够执行多轮针对性澄清策略,从而显著降低词错误率并提升下游任务性能。

详情
AI中文摘要

级联自动语音识别-大语言模型(ASR-LLM)流水线在工业口语对话系统(SDS)中仍然流行,主要因为其解耦设计确保了感知可验证性。然而,级联系统存在错误传播问题,因为转录失败不可避免地级联到后续组件,从而降低最终交互质量。尽管ASR置信度分数为不可靠输入提供了简单过滤,但这种方法存在根本性局限,因为它通常无法检测删除错误,也无法区分声学(听不清)和语言(不理解)不匹配,而这两者都需要针对性的恢复策略。在本文中,我们提出了一种因果感知的错误恢复范式,从根本上重新思考SDS的鲁棒性。与传统的置信度过滤不同,我们引入了一组小型精度聚焦检测器,利用深度ASR潜在表示将词级错误解耦为感知、理解和删除失败。这种细粒度诊断智能使LLM能够编排针对性的多轮澄清策略,有效将模糊信号转化为无缝的用户交互。实验结果验证了我们方法的精度,与基线相比,在领域转移错误上的召回率提高了一倍以上(57.96% vs. 23.66%)。关键的是,这种诊断精度在不同口音、失真和领域下,使词错误率降低高达30%,下游任务性能提升17%。

英文摘要

Cascaded Automatic Speech Recognition -- Large Language Model (ASR-LLM) pipelines remain popular for industrial Spoken Dialogue Systems (SDS), primarily because their decoupled design ensures perceptual verifiability. However, cascaded systems suffer from error propagation, as transcription failures inevitably cascade to subsequent components, thereby degrading the final interaction quality. Although ASR confidence scores offer a simple filter for unreliable inputs, this approach is fundamentally limited because it typically fails to detect deletion errors or to distinguish between acoustic (inability to hear clearly) and linguistic (inability to understand) mismatches, both of which require targeted recovery strategies. In this paper, we propose a cause-aware error recovery paradigm that fundamentally rethinks robustness in SDS. Unlike traditional confidence filtering, we introduce a suite of small precision-focused detectors that exploit deep ASR latent representations to disentangle token-level errors into perception, comprehension, and deletion failures. This fine-grained diagnostic intelligence empowers the LLM to orchestrate targeted, multi-turn clarification strategies, effectively transforming ambiguous signals into seamless user interactions. Experimental results validate the precision of our approach, which more than doubles the recall on domain-shift errors (57.96% vs. 23.66%) compared to baselines. Crucially, this diagnostic precision yields up to a 30% reduction in WER and a 17% improvement on the downstream task across diverse accents, distortions, and domains.

2605.25394 2026-05-26 cs.AI cs.CL 版本更新

Second Guess: Detecting Uncertainty Through Abstention and Answer Stability in Small Language Models

Second Guess: 通过弃权和答案稳定性检测小型语言模型的不确定性

Ashwath Vaithinathan Aravindan, Mayank Kejriwal

发表机构 * University of Southern California(南加州大学) Information Sciences Institute(信息科学研究所)

AI总结 提出一种轻量级、无参数的提示技术Second Guess,通过添加“我不知道”选项并观察答案稳定性,在多项选择问答中实现弃权,有效检测小型语言模型的不确定性。

详情
AI中文摘要

大型语言模型在不确定时往往生成自信但错误的答案,而非弃权。这个问题对于小型语言模型(SLM)尤为严重,因为计算约束和自主操作放大了对可靠不确定性检测的需求。我们提出了_Second Guess_,一种轻量级、无参数的提示技术,用于多项选择问答(MCQA)中的弃权,非常适合SLM。我们的关键实证洞察是,真正知道答案的模型会一致地选择它,而不确定的模型在添加“我不知道”选项时会表现出不稳定的行为。在四个开源模型(2B-8B参数)和四个基准测试上评估,Second Guess实现了10.81%的最高复合风险改进。值得注意的是,在基于熵的方法退化的微调模型上,它保持了8%的复合风险改进,并且对性能较低的模型改进最大。重现本工作所需的所有代码和结果可在https://github.com/Mystic-Slice/second-guess获取。

英文摘要

Large language models often generate confident but incorrect answers rather than abstaining when uncertain. This problem is particularly acute for small language models (SLMs), where computational constraints and autonomous operation amplify the need for reliable uncertainty detection. We propose _Second Guess_, a lightweight, parameter-free prompting technique for abstention in multiple-choice question answering (MCQA) that is well-suited for SLMs. Our key empirical insight is that models which truly know an answer will select it consistently, while uncertain models exhibit unstable behavior when an ``I don't know'' option is added. Evaluated on four open models (2B-8B parameters) and four benchmarks, Second Guess achieves the highest composite risk improvement of 10.81\%. Notably, it maintains an 8\% composite risk improvement on fine-tuned models where entropy-based methods degrade, and improves most for lower-performing models. All code and results required to reproduce this work is available in https://github.com/Mystic-Slice/second-guess

2605.25384 2026-05-26 cs.CL 版本更新

GeoMathCode: Understanding Interleaved Math-Code Reasoning for Geometry Problem Solving

GeoMathCode: 理解几何问题求解中交织的数学-代码推理

Yingji Zhang, Yong Dai, André Freitas

发表机构 * Idiap Research Institute(Idiap研究 institute) X-Humanoid Department of Computer Science, University of Manchester(曼彻斯特大学计算机科学系) Cancer Biomarker Centre, CRUK Manchester Institute(癌症生物标志物中心,CRUK曼彻斯特研究所)

AI总结 本文提出GeoMathCode,通过程序化表示作为中间视觉输出,分析多模态大模型在几何问题中的推理与代码生成,发现推理与代码步骤在潜在空间可解耦,监督微调使推理流形更结构化,且层次化代码结构包含更多数学符号信息。

详情
AI中文摘要

数学推理是人类智能的标志,需要逻辑演绎、符号操作和抽象思维。最近的多模态大语言模型通过多步推理在几何问题上表现出强大性能。为了更好地模拟人类问题求解,中间步骤可以融入辅助视觉构造,例如额外的线条或点,这改善了几何解释和教育清晰度。在这项工作中,我们引入了GeoMathCode,其中程序化表示作为中间视觉输出。我们进一步对底层推理几何进行了深入分析。实验结果表明,推理和代码生成步骤可以在潜在空间中解耦,而监督微调使推理流形更加结构化和信息丰富。此外,层次化的句法代码结构作为解耦的潜在子空间出现,并且比视觉表示包含更多的数学符号信息。

英文摘要

Mathematical reasoning is a hallmark of human intelligence, requiring logical deduction, symbolic manipulation, and abstract thinking. Recent multimodal large language models (MLLMs) have demonstrated strong performance on geometry problems through multi-step reasoning. To better emulate human problem-solving, intermediate steps can incorporate auxiliary visual constructions, such as additional lines or points, which improve geometric interpretation and educational clarity. In this work, we introduce the GeoMathCode, where programmatic representations serve as intermediate visual outputs. We further conduct an in-depth analysis of the underlying reasoning geometry. Experimental results show that reasoning and code generation steps can be disentangled in the latent space, while supervised fine-tuning (SFT) makes the reasoning manifold more structured and informative. Moreover, hierarchical syntactic code structures emerge as disentangled latent subspaces, and contain more mathematical symbolic information than visual representations.

2605.25379 2026-05-26 cs.CL 版本更新

EfficientGraph-RAG: Structured Retrieval-State Management for Cross-Task Retrieval-Augmented Generation

EfficientGraph-RAG:面向跨任务检索增强生成的结构化检索状态管理

Miaohe Niu, Lianlei Shan, Zhengtao Yu, Jingbo Zhu, Tong Xiao

发表机构 * School of Computer Science and Engineering, Northeastern University, China(东北大学计算机科学与工程学院) Tsinghua University, China(清华大学) Kunming University of Science and Technology, China(昆明理工大学) NiuTrans Research, Shenyang, China(沈阳NiuTrans研究院)

AI总结 提出EfficientGraph-RAG框架,通过显式定义检索状态(TAM、MARS、SMP三个机制)实现结构化状态管理,在多个基准上提升答案质量并降低大模型token消耗。

Comments 19 pages, 5 figures, 14 tables

详情
AI中文摘要

检索增强生成(RAG)已成为将大型语言模型锚定于外部知识的标准方式,但许多系统仍将证据组织为扁平块并通过基本无结构的搜索进行检索。这种弱结构成为复杂检索的瓶颈:系统必须决定搜索位置、如何从粗粒度主题过渡到实体关系证据、哪些证据已被验证以及哪些中间产物可复用。我们将这些中间变量定义为检索状态,并将RAG研究视为结构化状态管理。EfficientGraph-RAG通过三种耦合机制使该状态显式化:TAM定义了证据上的类型化层次状态空间,MARS通过角色专业化代理更新和验证状态,SMP在层次感知访问控制下存储可复用状态。使用一个共享框架配置,EfficientGraph-RAG在三个评估的LongBench检索风格子集上平均报告答案质量指标排名第一,在HotpotQA EM上与最强智能体基线持平,同时将大模型token使用量减少3.51倍,并在检索组织跨模态方法中提供了低token的DocVQA结果。组件分析显示了角色特定机制:MARS是主要答案质量驱动因素,TAM提供类型化遍历状态和自适应路由信号,SMP支持语料库依赖的复用,跨查询缓存命中率范围为3.77%至23.18%。

英文摘要

Retrieval-augmented generation (RAG) has become the standard way to ground large language models in external knowledge, but many systems still organize evidence as flat chunks and retrieve it through largely unstructured search. This weak structure becomes a bottleneck for complex retrieval: the system must decide where to search, how to move from coarse topics to entity-relation evidence, which evidence has been verified, and which intermediate artifacts can be reused. We define these intermediate variables as a retrieval state and study RAG as structured state management. EfficientGraph-RAG makes this state explicit through three coupled mechanisms: TAM defines a typed hierarchical state space over evidence, MARS updates and verifies the state through role-specialized agents, and SMP stores reusable state under hierarchy-aware access control. Using one shared framework configuration, EfficientGraph-RAG ranks first on the reported answer-quality metrics averaged over the three evaluated LongBench retrieval-style subsets, matches the strongest agentic baseline on HotpotQA EM while reducing large-model token usage by $3.51\times$, and provides a low-token DocVQA result among retrieval-organizing cross-modal methods. Component analysis shows role-specific mechanisms: MARS is the main answer-quality driver, TAM supplies the typed traversal state and Adaptive Routing signal, and SMP enables corpus-dependent reuse, with cross-query cache hit rates ranging from 3.77% to 23.18%.

2605.25360 2026-05-26 cs.CL 版本更新

Learning to Route Languages for Multilingual Policy Optimization

学习路由语言以实现多语言策略优化

Geyang Guo, Hiromi Wakaki, Yuki Mitsufuji, Alan Ritter, Wei Xu

发表机构 * Georgia Institute of Technology(佐治亚理工学院) Sony Group Corporation(索尼集团)

AI总结 提出语言路由策略优化(LRPO)框架,将语言作为可选变量,通过在线策略优化和可训练的语言路由器(多臂老虎机)自适应地选择语言,在固定预算下提升多语言训练信号的多样性和信息量,从而显著提高多语言性能。

Comments Accepted at ICML 2026

详情
AI中文摘要

大型语言模型(LLMs)在异构多语言语料库上进行训练,然而现有的策略优化方法通常隐式地将每个训练问题限制为单一响应语言,或依赖固定的主导语言进行监督。我们提出了语言路由策略优化(LRPO),这是一种在线策略优化框架,将语言视为可选变量。LRPO为每个训练问题生成多语言展开,并将其相对质量整合到基于偏好的策略更新中,从而在固定展开预算下增加训练信号的多样性和信息量。为了在强化学习过程中自适应地决定探索哪些语言,我们引入了一个可训练的语言路由器,其形式为多臂老虎机,平衡对未充分利用语言的探索与对信息量更大语言的利用。大量实验表明,LRPO持续提升多语言性能,证明自适应语言路由能够有效利用跨语言知识进行训练。我们在https://github.com/Guochry/LRPO 发布所有资源。

英文摘要

Large language models~(LLMs) are trained on heterogeneous multilingual corpora, yet existing policy optimization methods often implicitly restrict each training question to a single response language or rely on a fixed dominant language for supervision. We propose language-routed policy optimization (LRPO), an online policy optimization framework that treats language as a selectable variable. LRPO elicits multilingual rollouts for each training question and integrates their relative quality into preference-based policy updates, increasing the diversity and informativeness of training signals under the fixed rollout budget. To adaptively determine which languages to explore during reinforcement learning, we introduce a trainable language router formulated as a multi-armed bandit, balancing exploration of underutilized languages with exploitation of more informative ones. Extensive experiments show that LRPO consistently improves multilingual performance, demonstrating that adaptive language routing enables effective cross-lingual knowledge exploitation for training. We release all the resources at https://github.com/Guochry/LRPO.

2605.25358 2026-05-26 cs.CL cs.AI cs.CY 版本更新

AI-Associated Lexical Shifts Across 34 Languages: Cross-Lingual Convergence and Diachronic Uptake in News Writing

AI相关的词汇转变跨越34种语言:新闻写作中的跨语言趋同与历时采纳

Thomas Stephan Juzek

发表机构 * Florida State University(佛罗里达州立大学)

AI总结 通过分析34种语言的新闻语料,使用GPT-4.1续写诊断方法,发现AI过度使用的词汇在跨语言中呈现语义趋同,且ChatGPT发布后这些词汇的使用频率显著增加。

Comments 19 pages (9-page main body, plus references and appendices), 3 figures; ACL ARR reviewed, committed to EMNLP 2026

详情
AI中文摘要

AI相关的词汇转变主要被记录在科学英语中。我们将这项工作扩展到WMT新闻抓取语料库中的34种语言,改进了一种分割-后半部分续写诊断方法,比较GPT-4.1续写与匹配的人类黄金标准文本。对于每种语言,我们使用对数流行率比率推导出排名靠前的AI过度使用词元。我们发现显著的跨语言语义趋同:语义相关的概念在类型多样的语言中反复出现,其中'强调'类动词出现在34种语言中的24种。基于嵌入和人工分析支持这一模式。我们还考察了ChatGPT发布前后新闻写作中的历时采纳情况。追踪每种语言前20个AI过度使用项目,我们发现从2020-2021年到2023-2024年,34种语言中有26种语言的流行率增加,平均变化为+15.1%,而匹配的基线词汇没有显示出可比的增加(-4.5%)。在具有较长历史覆盖的10种语言中,纵向分析显示2022年后的增加超过了早期观察到的适度变化,尽管效应大小小于科学英语。我们广泛验证了我们的方法,包括跨种子、模型变体、数据大小、模型系列等。我们的发现与以下观点一致:AI相关的词汇偏好超越了英语,并可能对全球语言使用施加跨语言同质化压力。

英文摘要

AI-associated lexical shifts have been documented mainly in Scientific English. We extend this work to 34 languages in the WMT News Crawl corpus, refining a split-halves continuation diagnostic that compares GPT-4.1 continuations with matched human gold-standard text. For each language, we derive ranked AI-overused lemmas using log prevalence ratios. We find substantial cross-lingual semantic convergence: semantically related concepts recur across typologically diverse languages, with 'emphasize'-type verbs appearing in 24 of 34 languages. Embedding-based and manual analyses support this pattern. We also examine diachronic uptake in news writing before and after ChatGPT's release. Tracking each language's top 20 AI-overused items, we find prevalence increases in 26 of 34 languages from 2020-2021 to 2023-2024, with a mean change of +15.1%, whilst matched baseline words show no comparable increase (-4.5%). In 10 languages with longer historical coverage, longitudinal analyses show post-2022 increases that exceed the modest shifts observed in earlier periods, though with smaller effect sizes than in Scientific English. We validate our approach extensively, including across seeds, model variants, data sizes, model families, and more. Our findings are consistent with the view that AI-associated lexical preferences extend beyond English and may exert cross-lingual homogenising pressure on global language use.

2605.25344 2026-05-26 cs.CL cs.AI cs.LG quant-ph 版本更新

A general tensor-structured compression scheme for efficient large language models

一种用于高效大语言模型的通用张量结构压缩方案

Ying Lu, Peng-Fei Zhou, Qi-Xuan Fang, Pan Zhang, Shi-Ju Ran, Gang Su

发表机构 * School of Physical Sciences, University of Chinese Academy of Sciences(中国科学院大学物理科学学院) Kavli Institute for Theoretical Sciences, University of Chinese Academy of Sciences(中国科学院大学理论科学研究院) Center for Quantum Physics and Intelligent Sciences, Department of Physics, Capital Normal University(首都师范大学量子物理与智能科学中心) Institute of Theoretical Physics, Chinese Academy of Sciences(中国科学院理论物理研究所)

AI总结 提出张量混合(MixT)方案,通过将密集线性层替换为张量算子混合体,在保持MMLU准确率的同时大幅减少参数、FLOPs和内存。

Comments 12 pages, 4 figures

详情
AI中文摘要

大语言模型(LLMs)主要由密集线性变换主导,其存储、内存和计算开销阻碍了高效的适配和部署,同时掩盖了结构简化对功能的影响。本文提出张量混合(MixT),一种通用的张量结构压缩方案,将目标密集线性层替换为可原生执行的张量算子混合体。MixT直接作用于通用线性投影而非模型特定组件,因此可能适用于基于Transformer的LLMs及其他密集神经映射。我们在统一的恢复协议下对Qwen3-8B和LLaMA2-7B评估MixT,识别出一个广泛的压缩区域,在该区域内MMLU准确率基本保持不变,直到模型特定边界处出现突变。该突变与输出熵、预测熵和层间几何的协同变化同时发生。在LLaMA2-7B的突变边界处,MixT将全模型参数减少47.5%,推理FLOPs减少37.1%,训练FLOPs减少52.1%,峰值推理内存减少60.4%,展示了其在低成本LLM压缩中的实际潜力。

英文摘要

Large language models (LLMs) are dominated by dense linear transformations, whose storage, memory and computational overheads hinder efficient adaptation and deployment while masking the functional impacts of structural simplification. Here we present Tensor Mixture (MixT), a general tensor-structured compression scheme that replaces targeted dense linear layers with natively executable mixtures of tensor operators. Operating directly on generic linear projections instead of model-specific components, MixT is potentially applicable across Transformer-based LLMs and other dense neural mappings. We evaluate MixT on Qwen3-8B and LLaMA2-7B under a unified recovery protocol, identifying a broad compressible regime in which MMLU accuracy is largely preserved before an abrupt transition at model-specific boundaries. This transition coincides with coordinated shifts in output entropy, prediction entropy and inter-layer geometry. At the LLaMA2-7B transition boundary, MixT reduces full-model parameters by 47.5\%, inference FLOPs by 37.1\%, training FLOPs by 52.1\% and peak inference memory by 60.4\%, demonstrating its practical potential for lower-cost LLM compression.

2605.25342 2026-05-26 cs.CL 版本更新

MATO: Multi-objective Personalized Alignment with Test-time Optimization for Large Language Models

MATO: 面向大语言模型的多目标个性化对齐与测试时优化

Linhao Luo, Thuy-Trang Vu, Van-Anh Nguyen, Junae Kim, Gholamreza Haffari, Dinh Phung

发表机构 * Monash University(墨尔本大学) Defence Science and Technology Group, Australia(澳大利亚国防科学与技术集团)

AI总结 提出MATO框架,通过测试时优化在解码过程中动态调整多目标权重,无需训练或外部奖励模型,实现大语言模型与用户多样化偏好的对齐。

Comments Preprint

详情
AI中文摘要

将大语言模型与多样且多方面的用户偏好对齐是个性化AI系统的基本挑战。现有的多目标对齐方法要么依赖昂贵的训练,要么需要为每个偏好预训练奖励模型,这使得它们难以适应不断变化的偏好。基于提示的个性化提供了一种无需训练的替代方案,但仅靠提示通常提供有限的可操控性,因为大语言模型可能过度强调或忽略某些偏好,并且在冲突出现时无法让用户可靠地控制不同目标的相对重要性,导致对齐效果欠佳。在本文中,我们介绍了MATO,一种无需训练的多目标个性化对齐与测试时优化框架。MATO将个性化表述为一个测试时优化问题,在解码过程中通过可控权重引导多个目标的相对重要性,无需修改模型参数或需要外部奖励模型。具体来说,奖励发现模块直接从骨干大语言模型中恢复针对自然语言指定的多种目标的偏好奖励,而权重优化模块根据用户的初始偏好和部分生成的响应动态调整目标权重,以在生成过程中平衡相互竞争的目标。得到的奖励和权重共同指导对令牌分布的在线优化过程,从而更好地与目标对齐。在多个数据集和骨干大语言模型上的大量实验表明,MATO始终优于强基线,实现了帕累托改进的多目标对齐和更强的可操控性。这些结果凸显了测试时优化作为可扩展、可控且模型无关的个性化对齐的一个有前景的方向。

英文摘要

Aligning large language models (LLMs) with diverse and multifaceted user preferences is a fundamental challenge in personalized AI systems. Existing multi-objective alignment methods either rely on costly training or require pre-trained reward models for each preference, making it difficult for them to adapt to evolving preferences. Prompt-based personalization offers a training-free alternative, but prompting alone often provides limited steerability, as LLMs may overemphasize or overlook certain preferences and fail to give users reliable control over the relative importance of different objectives when conflicts arise, leading to suboptimal alignment. In this paper, we introduce MATO, a training-free framework for Multi-objective personalized Alignment with Test-time Optimization. MATO formulates personalization as a test-time optimization problem that steers the relative importance of multiple objectives through controllable weights during decoding, without modifying model parameters or requiring external reward models. Specifically, a reward discovery module recovers preference rewards directly from the backbone LLM for diverse objectives specified in natural language, while a weight optimization module dynamically adjusts objective weights based on the user's initial preferences and the partially generated response to balance competing objectives during generation. The resulting rewards and weights jointly guide an online optimization procedure over the token distribution, enabling better alignment with the target objectives. Extensive experiments across multiple datasets and backbone LLMs show that MATO consistently outperforms strong baselines, achieving Pareto-improving multi-objective alignment and stronger steerability. These results highlight test-time optimization as a promising direction for scalable, controllable, and model-agnostic personalized alignment.

2605.25310 2026-05-26 cs.CL 版本更新

Tool-Call Dependency Structure is Linearly Decodable in LLM Agent Residual Streams

工具调用依赖结构在LLM智能体残差流中是线性可解码的

Tianda Sun, Dimitar Kazakov

发表机构 * University of York(约克大学) Department of Computer Science(计算机科学系) Heslington, York(约克大学赫斯林顿校区)

AI总结 本研究通过低容量边探针在Qwen3-32B残差流中解码工具调用依赖图,发现该表示追踪抽象拓扑而非标识符值,且在不同模型和任务中可复制。

Comments 16 pages, 7 figures

详情
AI中文摘要

使用工具的LLM智能体产生的轨迹中,调用形成有向依赖图:早期工具输出为后续调用提供参数。这种执行结构是否在模型内部表示尚不清楚;先前的结构探针针对静态代码或思维链文本,而非智能体的运行时调用图。在Qwen3-32B残差流上的低容量边探针解码工具调用依赖图,显著高于Hewitt-Liang随机标签控制和位置基线。反事实对比(值破坏与结构扰动)表明信号追踪抽象拓扑而非标识符值,并在独立的非子串预言机下可复制。非位置成分在另外三个交互式多跳基准上可复制,并在调用顺序本身成为依赖的充分代理时衰减,在单次规划中消失。逐层激活修补在后续非修补边界移动探针,表明表示传播而非被动读出,尽管实际工具调用未移动。据我们所知,这是首个对LLM智能体运行时工具调用依赖图的结构探针。我们的主张涉及表示而非行为控制,涵盖两个模型系列和一个主要领域。

英文摘要

Tool-using LLM agents produce trajectories whose calls form a directed dependency graph: earlier tool outputs supply arguments to later calls. Whether this execution structure is represented inside the model is unknown; prior structural probes have targeted static code or chain-of-thought text, not an agent's run-time call graph. A low-capacity edge probe on the residual stream of Qwen3-32B decodes the tool-call dependency graph well above both a Hewitt--Liang random-label control and a positional baseline. A counterfactual contrast between value corruption and structural perturbation indicates the signal tracks abstract topology rather than identifier values, and replicates under an independent, non-substring oracle. The non-positional component replicates on three further interactive multi-hop benchmarks and attenuates as call order alone becomes a sufficient proxy for dependency, vanishing in single-shot planning. Per-layer activation patching shifts the probe at a later, non-patched boundary, evidence that the representation propagates rather than passively reads out, though the realised tool call does not move. To our knowledge this is the first structural probe of an LLM agent's runtime tool-call dependency graph. Our claims concern representation, not behavioural control, and span two model families and one primary domain.

2605.25284 2026-05-26 cs.CL 版本更新

Knowing but Not Showing: LLMs Recognize Ambiguity but Rarely Ask Clarifying Questions

知道但不展示:LLMs 识别歧义但很少提出澄清问题

Jinyan Su, Claire Cardie

发表机构 * Cornell University(康奈尔大学)

AI总结 研究大型语言模型在识别用户查询歧义与主动提出澄清问题之间的行为差距,发现模型虽能识别歧义但默认直接回答,检索上下文会进一步减少澄清行为。

详情
AI中文摘要

用户查询通常不明确,可能允许多种有效解释。一个有用的助手不应默默假设用户意图,而应通过提出澄清问题来揭示这种歧义。这需要两种能力:识别查询存在歧义,并基于该识别采取行动(寻求澄清而非直接回答)。为了研究这些能力,我们在三种设置下评估模型对歧义、无歧义和消歧问题的表现:标准问答、显式歧义判断和行为分析(其中评判模型将响应分类为直接回答、拒绝或澄清问题)。我们发现识别与行为之间存在明显差距:当被明确要求判断时,模型通常能识别歧义,但在问答设置中,它们绝大多数默认直接回答。检索上下文通过提高可回答性进一步扩大了这一差距,使模型更不可能提出澄清问题。

英文摘要

User queries are often underspecified and may admit multiple valid interpretations. Rather than silently making assumptions about the user's intent, a helpful assistant should surface such ambiguity by asking a clarifying question. Doing so requires two abilities: recognizing that a query is ambiguous, and acting on that recognition by seeking clarification instead of answering directly. To study these abilities, we evaluate models on ambiguous, unambiguous, and disambiguated questions in three settings: standard question answering, explicit ambiguity judgment, and behavioral analysis, where a judge model classifies responses as direct answers, refusals, or clarifying questions. We find a clear gap between recognition and behavior: models often identify ambiguity when explicitly asked to judge it, yet in the QA setting they overwhelmingly default to direct answers. Retrieved context further widens this gap by improving answerability while making models even less likely to ask clarifying questions.

2605.25263 2026-05-26 cs.CL cs.AI 版本更新

Mimir: Large-scale Multilingual Concept Modeling

Mimir:大规模多语言概念建模

Elio Musacchio, Lucia Siciliani, Pierpaolo Basile

发表机构 * Department of Computer Science(计算机科学系) University of Bari Aldo Moro(巴里阿尔多·莫罗大学)

AI总结 提出Mimir,一个1.6B参数的大规模概念模型,通过多语言预训练和指令微调实现概念级别的理解与生成,替代传统的token预测范式。

详情
AI中文摘要

当前的语言建模方法围绕token构建。文本语料被分割成token,模型通过对这些token进行计算来训练,例如根据前文预测下一个token。这一范式已成为现代语言建模的标准,尤其是基于token的架构取得了卓越性能。然而,最近的研究不仅开始质疑语言模型如何从token中处理和理解意义,还开始质疑使用更高级别的粒度是否能推动研究领域的发展。这引出了概念建模的想法,即直接训练模型进行下一个概念预测,而非下一个token预测。目标是输入从token转变为概念,迫使底层语言模型将其粒度从细粒度的token转变为广泛的概念。在这项工作中,我们介绍了Mimir,一个1.6B参数的大规模概念模型,用于多语言概念理解和生成。我们利用了一个大规模多语言预训练语料库(38,883,987,240个句子),涵盖46种语言,以及一个大规模多轮多语言指令微调数据集(66,816,428个句子),覆盖总共35种语言。我们针对一个参数数量相当的语言模型,对模型性能进行了广泛评估。

英文摘要

Current language modeling approaches are built around tokens. Text corpora are split into tokens, and models are trained by performing computations on these tokens, such as predicting the next token given the preceding ones as context. This paradigm has become the standard in modern language modeling, especially given the outstanding performance obtained by token-based architectures. However, recent works have not only begun to question how language models process and understand meaning from tokens, but also to question whether using higher levels of granularity could advance the research field. This led to the idea of Concept Modeling, that is, to directly train models for next-concept prediction rather than next-token prediction. The goal is to change the input from tokens to concepts, forcing the underlying language model to shift its granularity from fine-grained tokens to broad concepts. In this work, we introduce Mimir, a 1.6B Large Concept Model trained for multilingual concept understanding and generation. We leverage a large-scale multilingual pre-training corpus (38,883,987,240 sentences) spanning 46 languages and a large-scale multi-turn and multilingual instruction-tuning dataset (66,816,428 sentences) covering a total of 35 languages. We extensively evaluate model performance against a language model with a comparable number of parameters.

2605.23491 2026-05-26 cs.LG cs.AI cs.CL 版本更新

CoSPlay: Cooperative Self-Play at Test-Time with Self-Generated Code and Unit Test

CoSPlay: 测试时协作自我博弈与自生成代码和单元测试

Zhangyi Hu, Chenhui Liu, Tian Huang, Jindong Li, Yang Yang, Jiemin Wu, Zining Zhong, Menglin Yang, Yutao Yue

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)) Institute of Deep Perception Technology, JITRI, Wuxi, China(深度感知技术研究院,无锡,中国)

AI总结 提出CoSPlay框架,通过代码与单元测试的协作自我博弈,在无真实单元测试的情况下迭代优化两者,显著提升代码生成性能。

Comments Code is available at: https://github.com/sanae-ai/CosPlay | Data & log is available at: https://huggingface.co/datasets/yomi017/CosPlay

详情
AI中文摘要

最近,可验证奖励强化学习(RLVR)和测试时扩展(TTS)通过可执行验证推动了LLM代码生成的发展。然而,真实单元测试(GT UTs)仍然是瓶颈:最先进的RLVR方法需要它们进行昂贵的训练,而现有的TTS方法在没有它们的情况下会失去竞争力。这促使了无GT的TTS,其中现有方法直接使用自生成的UT来优化和选择代码候选。然而,这些UT通常带有噪声或与错误代码虚假耦合,而UT质量在没有可靠代码的情况下也无法验证。因此,关键挑战是同时改进两者。为此,我们提出了CoSPlay,一个无GT、无需训练的框架,通过协作自我博弈同时改进代码和UT。它首先探索多样化的解决方案思路,识别其潜在失败模式以生成有区分力的UT思路。然后,它利用代码-UT执行矩阵中的双向通过计数信号,迭代地修剪或修复弱代码,并刷新或替换不可靠的UT,使两个池共同进化。最后,当多个代码在最高通过计数上并列时,它从最大的输出共识簇中选择最终代码,因为正确的代码在相同输入上一致,而错误的代码则发散。在四个具有挑战性的基准上的实验表明,CoSPlay在Qwen2.5-7B-Instruct上将平均BoN从22.1%提升到33.2%,UT准确率从14.6%提升到78.3%,匹配或超越了RLVR模型CURE-7B。当应用于CURE-7B时,它进一步将BoN提高了5.7%。CoSPlay还能跨不同骨干网络泛化,并在相当的token预算下优于无GT的TTS基线,且随着预算增加持续获益。这些结果表明,无需任何GT数据即可实现竞争性代码生成的可扩展推理策略。

英文摘要

Recently, Reinforcement Learning with Verifiable Rewards (RLVR) and Test-Time Scaling (TTS) have advanced LLM code generation through executable verification. Yet Ground-Truth Unit Tests (GT UTs) remain a bottleneck: SOTA RLVR methods require them for costly training, while existing TTS methods lose competitiveness without them. This motivates GT-free TTS, where existing methods directly use self-generated UTs to refine and select code candidates. Yet such UTs are often noisy or spuriously coupled with wrong code, and UT quality in turn cannot be validated without reliable code. The key challenge is therefore to jointly improve both. To this end, we present CoSPlay, a GT-free, training-free framework that jointly improves codes and UTs through cooperative self-play. It first explores diverse solution ideas and identifies their potential failure modes to produce discriminative UT ideas. It then uses bidirectional pass-count signals from the Code-UT execution matrix to iteratively prune or fix weak codes and refresh or replace unreliable UTs, letting the two pools co-evolve. Finally, when multiple codes remain tied at the highest pass count, it picks the final code from the largest output-consensus cluster, since correct codes agree on the same inputs while wrong codes diverge. Experiments on four challenging benchmarks show that CoSPlay on Qwen2.5-7B-Instruct improves average BoN from 22.1% to 33.2% and UT accuracy from 14.6% to 78.3%, matching or surpassing the RLVR model CURE-7B. When applied to CURE-7B, it further improves BoN by 5.7%. CoSPlay also generalizes across diverse backbones and outperforms GT-free TTS baselines under comparable token budgets, with continued gains as the budget scales up. These results suggest a scalable inference strategy for competitive code generation without any GT data.

2605.23454 2026-05-26 cs.CL 版本更新

ARES: Automated Rubric Synthesis for Scalable LLM Reinforcement Learning

ARES: 面向可扩展大语言模型强化学习的自动评分标准合成

Xiaoyuan Li, Keqin Bao, Moxin Li, Yubo Ma, Yichang Zhang, Wenjie Wang, Fuli Feng, Dayiheng Liu

发表机构 * University of Science and Technology of China(中国科学技术大学) Alibaba Group(阿里巴巴集团) National University of Singapore(新加坡国立大学)

AI总结 提出ARES框架,从原始预训练文档自动生成问答对和实例级加权评分标准,用于可扩展的基于评分标准的强化学习,在多个开放任务上超越持续预训练、监督微调和二元奖励强化学习。

Comments Under Review

详情
AI中文摘要

基于评分标准的奖励为将强化学习扩展到大型语言模型提供了一种有前景的方式,超越了具有自动可验证答案的任务。然而,扩展基于评分标准的强化学习仍然具有挑战性:现有方法通常依赖专家编写的评分标准和手动构建的问题集,而固定的任务级评分标准可能无法捕捉单个问题的评估需求。我们提出ARES(面向可扩展强化学习的自动评分标准合成),一个自动构建基于评分标准的强化学习数据的框架。从原始预训练文档开始,ARES将源知识转换为自包含的问答对,并共同生成特定问题的加权评分标准,从而为开放式回答提供实例级奖励监督。为了提高多样性和质量,ARES基于领域标签和人物角色信息生成,并应用验证过滤器以确保问题自包含性、答案忠实性和评分标准有效性。使用ARES,我们在十个领域构建了10万个评分标准标注的实例。在七个基准上的实验表明,使用ARES训练的基于评分标准的强化学习优于持续预训练、监督微调和二元奖励强化学习,在医疗和指令遵循等多维开放任务上提升最大。

英文摘要

Rubric-based rewards offer a promising way to extend reinforcement learning (RL) for large language models beyond tasks with automatically verifiable answers. However, scaling rubric-based RL remains challenging: existing approaches often rely on expert-written rubrics and manually constructed question sets, while fixed task-level rubrics may fail to capture the evaluation requirements of individual questions. We propose ARES (Automated Rubric synthEsis for Scalable RL), a framework for automatically constructing rubric-based RL data at scale. Starting from raw pretraining documents, ARES converts source knowledge into self-contained question-answer pairs and co-generates question-specific weighted rubrics, enabling instance-level reward supervision for open-ended responses. To improve diversity and quality, ARES conditions generation on domain labels and persona information, and applies validation filters for question self-containment, answer faithfulness, and rubric validity. Using ARES, we construct 100K rubric-annotated instances across ten domains. Experiments on seven benchmarks show that rubric-based RL trained with ARES, outperforms continual pretraining, supervised fine-tuning, and binary-reward RL, with the largest gains on multi-dimensional open-ended tasks such as healthcare and instruction following.

2605.23163 2026-05-26 cs.CL 版本更新

Fast-dDrive: Efficient Block-Diffusion VLM for Autonomous Driving

Fast-dDrive:面向自动驾驶的高效块扩散视觉语言模型

Kewei Zhang, Jin Wang, Sensen Gao, Chengyue Wu, Yulong Cao, Songyang Han, Boris Ivanovic, Langechuan Liu, Marco Pavone, Song Han, Daquan Zhou, Enze Xie

发表机构 * Peking University(北京大学) NVIDIA The University of Hong Kong(香港大学) MIT(麻省理工学院)

AI总结 提出Fast-dDrive,一种块扩散视觉语言动作模型,通过语义单元内双向细化与跨单元因果约束,结合结构化令牌冻结、分段感知训练和推测解码,实现高保真轨迹规划与高效推理,在WOD-E2E和nuScenes上达到最优性能,推理速度提升12倍。

详情
AI中文摘要

通过视觉-语言-动作(VLA)模型实现的端到端自动驾驶需要在高保真轨迹规划与高效推理之间取得不稳定的平衡。现有范式通常存在不足:自回归(AR)VLA在边缘硬件上受限于内存带宽,且容易产生曝光偏差漂移;而全序列扩散模型无法复用KV缓存,并遭受违反基本感知-规划因果关系的“逻辑泄漏”。我们提出Fast-dDrive,一种块扩散VLA,它在语义单元内执行双向细化,同时强制跨单元严格因果排序。利用驾驶VLA通常输出结构化JSON式输出的观察,Fast-dDrive将结构令牌冻结为节支架,并采用节感知训练策略,优先考虑安全关键规划。我们进一步引入支架推测解码,以显著更高的吞吐量实现AR等效质量。最后,我们提出一种低开销的测试时缩放方案:通过从单个共享前缀KV缓存分叉出N个随机轨迹展开并取平均,以极小的计算成本有效抑制预测方差。实验结果表明,Fast-dDrive重新定义了驾驶智能体的速度-精度边界。在WOD-E2E测试集上,Fast-dDrive在3秒和5秒平均位移误差(ADE)上达到最优,同时在基于扩散的VLA中具有最高的RFS;在nuScenes上,它将平均L2误差降至0.32米(提升22%)。当与SGLang集成时,我们的框架相比AR基线实现了12倍的吞吐量提升,缩小了高容量VLA与实时车载部署效率需求之间的差距。

英文摘要

End-to-end autonomous driving via Vision-Language-Action (VLA) models demands a precarious balance between high-fidelity trajectory planning and efficient inference. Existing paradigms typically fall short: autoregressive (AR) VLAs are memory-bandwidth-bound on edge hardware and prone to exposure-bias drift, while full-sequence diffusion models preclude KV-cache reuse and suffer from "logical leakage" that violates the fundamental perceive-then-plan causality. We present Fast-dDrive, a block-diffusion VLA that performs bidirectional refinement within semantic units while enforcing strict causal ordering across them. Leveraging the observation that driving VLAs often emit structured JSON-like outputs, Fast-dDrive freezes structural tokens into a section scaffold and employs a section-aware training recipe that prioritizes safety-critical planning. We further introduce Scaffold Speculative Decoding to achieve AR-equivalent quality at significantly higher throughput. Finally, we propose a low-overhead test-time scaling scheme: by forking $N$ stochastic trajectory rollouts from a single shared-prefix KV cache and averaging them, we effectively suppress prediction variance at a fractional computational cost. Empirical results demonstrate that Fast-dDrive redefines the speed-accuracy frontier for driving agents. On the WOD-E2E test set, Fast-dDrive achieves SOTA ADE@3s and ADE@5s, alongside the highest RFS among diffusion-based VLAs; on nuScenes, it reduces average L2 error to $0.32$m (a $22\%$ improvement). When integrated with SGLang, our framework delivers $12\times$ throughput speedup over the AR baseline, narrowing the gap between high-capacity VLAs and the efficiency demands of real-time on-vehicle deployment.

2605.23148 2026-05-26 cs.CL cs.CY 版本更新

When Symptoms Are Not Enough: Evidence-Weighting Patterns in Large Language Model Psychiatric Screening

当症状不足时:大语言模型精神科筛查中的证据加权模式

Jianfeng Zhu, Megan Korhummel, Ruoming Jin, Karin G. Coifman

发表机构 * Departments of Computer Science1 and Psychological Science2(计算机科学系和心理学系)

AI总结 本研究引入SCID锚定基准,评估五个大语言模型在精神科筛查中的表现,发现模型在焦虑症、抑郁症和创伤后应激障碍分类中,当存在功能保留或保护性背景时倾向于低估症状证据,导致假阴性错误。

Comments 25 pages 7 figures

详情
AI中文摘要

随着心理健康护理需求超过临床医生提供的评估,对可扩展筛查工具的需求日益增加。大语言模型(LLMs)可能从患者叙述中识别精神科风险,但其在不同诊断、人口统计亚组和证据使用模式中的可靠性仍不确定。我们引入了一个基于SCID的基准,包含555个半结构化体验访谈,并配有焦虑症、重度抑郁症、创伤后应激障碍和任何当前心理健康障碍的诊断参考标签。使用零样本任务特定提示,我们评估了五个最先进的LLM,并检查假阴性错误是否反映了遗漏的精神科证据或对症状、功能损害和保护性背景线索的差异化加权。不同任务和模型的表现各异,准确率从0.49到0.86,马修斯相关系数从0.16到0.38。GPT-4.1 Mini和GPT-5 Mini显示出最一致的疾病特异性准确率。亚组分析发现,男性参与者的抑郁症分类准确率高于女性,没有一致的年龄相关模式,种族阶层间存在适度的非均匀变异。证据整合分析显示,假阴性的焦虑症和PTSD分类通常包含明确的症状证据,但伴有功能保留、应对能力或社会支持。功能损害证据使模型输出偏向阳性分类,而保护性背景证据则使输出偏离。这些发现表明,LLMs可能支持可扩展的精神科筛查,但它们在功能保留或保护性背景下低估症状证据的倾向需要在临床部署前进行仔细验证。

英文摘要

As demand for mental health care outpaces clinician-delivered assessment, scalable screening tools are increasingly needed. Large language models (LLMs) may identify psychiatric risk from patient narratives, but their reliability across diagnoses, demographic subgroups, and evidence-use patterns remains uncertain. We introduce a SCID-anchored benchmark of 555 semi-structured experiential interviews paired with diagnostic reference labels for anxiety disorder, major depressive disorder, post-traumatic stress disorder, and any current mental health disorder. Using zero-shot task-specific prompting, we evaluated five state-of-the-art LLMs and examined whether false-negative errors reflected missed psychiatric evidence or differential weighting of symptom, functional-impairment, and protective-context cues. Performance varied across tasks and models, with accuracy ranging from 0.49 to 0.86 and Matthews correlation coefficients from 0.16 to 0.38. GPT-4.1 Mini and GPT-5 Mini showed the most consistent disorder-specific accuracy. Subgroup analyses found higher depression-classification accuracy among male than female participants, no consistent age-related pattern, and modest non-uniform variation across race strata. Evidence-integration analyses showed that false-negative anxiety and PTSD classifications often contained explicit symptom evidence but were accompanied by preserved functioning, coping ability, or social support. Functional-impairment evidence shifted model outputs toward positive classifications, whereas protective-context evidence shifted outputs away. These findings suggest that LLMs may support scalable psychiatric screening, but their tendency to discount symptom evidence in the presence of preserved functioning or protective context requires careful validation before clinical deployment.

2605.22769 2026-05-26 cs.CL cs.AI 版本更新

Understanding Data Temporality Impact on Large Language Models Pre-training

理解数据时间性对大型语言模型预训练的影响

Hippolyte Pilchen, Romain Fabre, Franck Signe Talla, Patrick Perez, Edouard Grave

发表机构 * Kyutai

AI总结 研究预训练数据顺序对大型语言模型获取时间敏感事实知识的影响,通过构建包含7000多个时间相关问题的基准并训练60亿参数模型,发现按时间顺序训练比随机打乱训练能产生更及时和精确的知识。

详情
AI中文摘要

大型语言模型(LLMs)通常在打乱顺序的语料库上进行训练,导致模型的知识在训练时被冻结,其时间基础仍然难以理解。在这项工作中,我们研究了预训练动态对获取时间敏感事实知识的影响,特别关注数据顺序。我们的主要贡献有两方面。首先,我们引入了一个包含7000多个时间基础问题的综合基准和一个评估协议,能够分析模型是否将事实与其对应的时间段正确关联。其次,我们在按时间顺序排列的Common Crawl快照上预训练了60亿参数的模型,并将其与标准的随机打乱预训练进行比较。我们的结果表明,按顺序训练的模型在通用语言理解和常识方面与随机打乱的基线相当,同时始终表现出更及时和精确的时间知识。按时间顺序的预训练提高了事实的新鲜度,而随机打乱的预训练在较旧的数据上表现更好,可能是由于事实重复增加。这些发现,连同我们在https://github.com/kyutai-labs/kairos 发布的代码、在https://huggingface.co/collections/kyutai/kairos 发布的检查点和数据集,为LLMs的持续学习未来研究提供了基础。

英文摘要

Large language models (LLMs) are typically trained on shuffled corpora, yielding models whose knowledge is frozen at train time and whose temporal grounding remains poorly understood. In this work, we study the impact of pre-training dynamics on the acquisition of time-sensitive factual knowledge, focusing specifically on data ordering. Our main contributions are twofold. First, we introduce a comprehensive benchmark of over 7,000 temporally grounded questions and an evaluation protocol that enables analysis of whether models correctly associate facts with their corresponding time periods. Second, we pretrain 6B-parameter models on temporally ordered Common Crawl snapshots and compare them against standard shuffled pre-training. Our results show that sequentially trained models match shuffled baselines on general language understanding and common knowledge while consistently exhibiting more up-to-date and temporally precise knowledge. Temporally ordered pre-training yields improved factual freshness, while shuffled pre-training peaks on older data, possibly due to increased factual repetition. These findings, along with the release of our code at https://github.com/kyutai-labs/kairos , checkpoints, and datasets at https://huggingface.co/collections/kyutai/kairos provide a foundation for future research on continual learning for LLMs.

2605.22137 2026-05-26 cs.CL 版本更新

Cross-Lingual Consensus: Aligning Multilingual Cultural Knowledge via Multilingual Self-Consistency

跨语言共识:通过多语言自一致性对齐多语言文化知识

Andrew Ivan Soegeng, Patrick Sutanto, Tan Sang Nguyen

发表机构 * SAP School of Computing, National University of Singapore(国立新加坡大学计算机学院)

AI总结 提出一种自监督框架,利用多语言自一致性和自我批评机制,从本地语言表示中提取文化知识并迁移到英语,以缩小跨语言文化知识差距,在BLEnD基准上平均提升英语查询性能5.03%。

Comments Accepted to The 1st Workshop on Multilinguality in the Era of Large Language Models

详情
AI中文摘要

尽管大型语言模型(LLMs)在各种任务中展现出强大的能力,但它们在不同语言之间表现出显著的性能差异。虽然用英语提示LLMs通常能获得最高的通用性能,但这往往会导致以西方为中心的偏见,阻碍模型准确反映多样化的文化知识。我们假设LLMs已经拥有嵌入在本地语言表示中的丰富文化知识,但在用英语提示时无法检索到这些知识。为了弥合这一跨语言知识差距,我们提出了一种新颖的自监督框架。我们的方法利用多语言自一致性来识别跨语言中最可靠的文化响应,并结合自我批评机制将这些知识转移到较弱的语言中。在BLEnD基准上的评估表明,我们的方法显著改善了文化对齐——在英语查询上平均提升5.03%——完全依赖于自生成数据。最终,我们的工作表明,潜在的文化知识可以成功地在语言之间浮现和传播,从而实现更具文化公平性和一致性的LLMs。

英文摘要

Although Large Language Models (LLMs) demonstrate strong capabilities across various tasks, they exhibit significant performance discrepancies across languages. While prompting LLMs in English typically yields the highest general performance, it often induces a Western-centric bias, hindering the model's ability to accurately reflect diverse cultural knowledge. We hypothesize that LLMs already possess rich cultural knowledge embedded within local-language representations, but fail to retrieve it when prompted in English. To bridge this cross-lingual knowledge gap, we propose a novel self-supervised framework. Our method leverages multilingual self-consistency to identify the most reliable cultural responses across languages, combined with a self-critique mechanism to transfer this knowledge to the weaker language. Evaluations on the BLEnD benchmark demonstrate that our approach significantly improves cultural alignment-boosting performance on English queries by an average of 5.03%-relying entirely on self-generated data. Ultimately, our work demonstrates that latent cultural knowledge can be successfully surfaced and propagated across languages, enabling more culturally equitable and consistent LLMs.

2605.22064 2026-05-26 cs.CL 版本更新

Hy-MT2: A Family of Fast, Efficient and Powerful Multilingual Translation Models in the Wild

Hy-MT2:面向复杂真实场景的快速、高效且强大的多语言翻译模型系列

Mao Zheng, Zheng Li, Tao Chen, Bo Lv, Mingrui Sun, Mingyang Song, Jinlong Song, Hong Huang, Decheng Wu, Hai Wang, Yifan Song, Yanfeng Chen, Guanwei Zhang

发表机构 * Tencent Hunyuan Team(腾讯文言团队)

AI总结 本文提出Hy-MT2系列多语言翻译模型,通过三种规模(1.8B、7B、30B-A3B MoE)支持33种语言翻译,在通用、商业、领域和指令跟随任务上超越开源模型和商业API,并实现轻量级设备端部署。

详情
AI中文摘要

Hy-MT2是一系列面向复杂真实场景的快速思考多语言翻译模型。它包括三种模型规模:1.8B、7B和30B-A3B(MoE),均支持33种语言之间的翻译,并能有效遵循多种语言的翻译指令。多维度评估表明,Hy-MT2在通用、真实世界业务、领域特定和指令跟随翻译任务中均表现出色。7B和30B模型在快速思考模式下超越了DeepSeek-V4-Pro和Kimi K2.6等开源模型,而轻量级的1.8B模型在整体性能上也超越了微软、豆包等提供商的主流商业API。此外,当与AngelSlim的1.25位极端量化结合用于设备端部署时,轻量级1.8B模型仅需440 MB存储空间,并实现了1.5倍的推理加速。

英文摘要

Hy-MT2 is a family of fast-thinking multilingual translation models designed for complex real-world scenarios. It includes three model sizes: 1.8B, 7B, and 30B-A3B (MoE), all of which support translation among 33 languages and effectively follow translation instructions in multiple languages. Multi-dimensional evaluations show that Hy-MT2 delivers outstanding performance across general, real-world business, domain-specific, and instruction-following translation tasks. The 7B and 30B models outperform open-source models such as DeepSeek-V4-Pro and Kimi K2.6 in fast-thinking mode, while the lightweight 1.8B model also surpasses mainstream commercial APIs from providers such as Microsoft and Doubao overall. Moreover, when paired with AngelSlim's 1.25-bit extreme quantization for on-device deployment, the lightweight 1.8B model requires only 440 MB of storage and achieves a 1.5x inference speedup.

2605.20761 2026-05-26 cs.CL 版本更新

Findings of the Counter Turing Test: AI-Generated Text Detection

反图灵测试的发现:AI生成文本检测

Rajarshi Roy, Gurpreet Singh, Ashhar Aziz, Shashwat Bajpai, Nasrin Imanpour, Shwetangshu Biswas, Kapil Wanaskar, Parth Patwa, Subhankar Ghosh, Shreyas Dixit, Nilesh Ranjan Pal, Vipula Rawte, Ritvik Garimella, Amitava Das, Amit Sheth, Vasu Sharma, Aishwarya Naresh Reganti, Vinija Jain, Aman Chadha

发表机构 * Kalyani Government Engineering College(卡利尼政府工程学院) IIIT Delhi(德里IIIT) BITS Pilani Hyderabad Campus(比斯汉学院海得拉巴校区) AI Institute, University of South Carolina(南卡罗来纳大学人工智能研究所) IIIT Guwahati(古瓦哈提IIIT) NIT Silchar(西里char理工学院) San José State University(圣何塞州立大学) UCLA(加州大学洛杉矶分校) Washington State University(华盛顿州立大学) Vishwakarma Institute of Information Technology(维斯瓦卡马信息科技学院) Meta AI(Meta人工智能) Amazon AI(亚马逊人工智能) BITS Pilani Goa(比斯汉学院果阿)

AI总结 本文通过反图灵测试(CT2)共享任务,评估了AI生成文本检测技术的有效性,发现二分类任务表现优异(F1=1.0000),但模型归因任务更具挑战性(最佳F1=0.9531),并分析了微调Transformer、集成学习等方法的优劣。

Comments Defactify4 @AAAI 2025

详情
AI中文摘要

大型语言模型生成流畅、上下文连贯文本的能力不断增强,给负责确保数字内容真实性的系统和机构带来了越来越大的压力。先进的生成模型如GPT-4、Claude 3.5和Llama能够生成高度连贯且类似人类的文本,使得区分人类撰写和AI生成的内容变得越来越困难。虽然这些模型具有变革性的应用,但它们的滥用引发了关于错误信息、偏见叙事和安全威胁的担忧。 本文对最先进的AI生成文本检测技术进行了全面分析,并通过反图灵测试(CT2)共享任务评估了其有效性。任务A(二分类)要求参与者区分人类撰写和AI生成的文本,而任务B(模型归因)则专注于识别生成给定文本的具体语言模型。结果显示,二分类性能较高,最佳系统F1得分为1.0000,但模型归因得分显著较低,最佳系统仅为0.9531,凸显了该任务的复杂性。 表现最佳的团队利用了微调Transformer模型、集成学习和混合检测方法,其中基于DeBERTa和BART的方法表现出色。然而,任务B的较低得分强调了区分不同LLM输出的挑战,需要进一步研究对抗鲁棒性、特征提取和跨领域泛化。

英文摘要

The growing capability of large language models to produce fluent, contextually coherent text has created mounting pressure on the systems and institutions responsible for ensuring the authenticity of digital content. Advanced generative models such as GPT-4, Claude 3.5, and Llama can produce highly coherent and human-like text, making it increasingly difficult to differentiate between human-written and AI-generated content. While these models have transformative applications, their misuse has raised concerns about misinformation, biased narratives, and security threats. This paper provides a comprehensive analysis of state-of-the-art AI-generated text detection techniques and evaluates their effectiveness through the Counter Turing Test (CT2) shared tasks. Task A (Binary Classification) required participants to distinguish between human-written and AI-generated text, while Task B (Model Attribution) focused on identifying the specific language model responsible for generating a given text. The results demonstrated high performance in binary classification, with the top system achieving an F1 score of 1.0000, but significantly lower scores in model attribution, where the best system achieved 0.9531, highlighting the increased complexity of this task. The top-performing teams leveraged fine-tuned transformer models, ensemble learning, and hybrid detection approaches, with DeBERTa-based and BART-based methods demonstrating strong results. However, the lower scores in Task B underscore the challenges of distinguishing outputs from different LLMs, necessitating further research into adversarial robustness, feature extraction, and cross-domain generalization.

2605.18746 2026-05-26 cs.CV cs.AI cs.CL cs.LG cs.RO 版本更新

ESI-Bench: Towards Embodied Spatial Intelligence that Closes the Perception-Action Loop

ESI-Bench: 迈向闭环感知-动作的具身空间智能

Yining Hong, Jiageng Liu, Han Yin, Manling Li, Leonidas Guibas, Li Fei-Fei, Jiajun Wu, Yejin Choi

发表机构 * Stanford University(斯坦福大学) UCLA(加州大学洛杉矶分校) Northwestern University(西北大学)

AI总结 提出ESI-BENCH基准,通过主动探索(感知、移动、操作)在OmniGibson环境中评估具身空间智能,发现主动探索显著优于被动方法,失败主因是动作盲视而非感知弱,且模型存在元认知差距。

Comments https://esi-bench.github.io/

详情
AI中文摘要

空间智能通过感知-动作循环展开:智能体通过行动获取观察,并推理观察如何随动作变化。它们不是被动处理所见,而是主动揭示未见——遮挡结构、动态、包含关系和功能,这些无法仅通过被动感知解决。我们超越先前假设神谕观察的空间智能表述,将观察者重新定义为行动者。我们引入ESI-BENCH,一个基于OmniGibson、扎根于Spelke核心知识系统的全面具身空间智能基准,涵盖10个任务类别和29个子类别。智能体必须决定部署哪些能力——感知、移动和操作——以及如何排序以主动积累任务相关证据。我们对最先进的MLLM进行大量实验,发现主动探索显著优于被动对应物,智能体自发发现涌现的空间策略而无需明确指令,而随机多视角往往增加噪声而非信号,尽管消耗更多图像。大多数失败并非源于感知弱,而是动作盲视:糟糕的动作选择导致糟糕的观察,进而引发级联错误。虽然显式3D基础稳定了深度敏感任务的推理,但不完美的3D表示通过扭曲空间关系证明比2D基线更有害。人类研究进一步揭示,与寻求证伪视角并在矛盾下修正信念的人类不同,模型无论证据质量如何都过早且高置信度地承诺,暴露了一个既不能通过更好感知也不能通过更多具身互动单独闭合的元认知差距。

英文摘要

Spatial intelligence unfolds through a perception-action loop: agents act to acquire observations, and reason about how observations vary as a function of action. Rather than passively processing what is seen, they actively uncover what is unseen - occluded structure, dynamics, containment, and functionality that cannot be resolved from passive sensing alone. We move beyond prior formulations of spatial intelligence that assume oracle observations by recasting the observer as an actor. We introduce ESI-BENCH, a comprehensive benchmark for embodied spatial intelligence spanning 10 task categories and 29 subcategories built on OmniGibson, grounded in Spelke's core knowledge systems. Agents must decide what abilities to deploy - perception, locomotion, and manipulation - and how to sequence them to actively accumulate task-relevant evidence. We conduct extensive experiments on state-of-the-art MLLMs and find that active exploration substantially outperforms passive counterparts, with agents spontaneously discovering emergent spatial strategies without explicit instructions, while random multi-view often adds noise rather than signal despite consuming far more images. Most failures stem not from weak perception but from action blindness: poor action choices lead to poor observations, which in turn drive cascading errors. While explicit 3D grounding stabilizes reasoning on depth-sensitive tasks, imperfect 3D representation proves more harmful than 2D baselines by distorting spatial relations. Human studies further reveal that unlike humans who seek falsifying viewpoints and revise beliefs under contradiction, models commit prematurely with high confidence regardless of evidence quality, exposing a metacognitive gap that neither better perception nor more embodied interaction alone can close.

2605.17937 2026-05-26 cs.CL cs.AI 版本更新

BacktestBench: Benchmarking Large Language Models for Automated Quantitative Strategy Backtesting

BacktestBench:面向自动化量化策略回测的大语言模型基准测试

Zhensheng Wang, Wenmian Yang, Qingtai Wu, Lequan Ma, Yiquan Zhang, Weijia Jia

发表机构 * Beijing Normal University(北京师范大学) Elmleaf Ltd.(Elmleaf公司)

AI总结 提出首个大规模自动化量化回测基准BacktestBench,包含18,246个问答对,并设计多智能体基线AutoBacktest,通过协调摘要器、检索器和编码器实现自然语言策略到可重复回测的转换。

Comments This paper has been accepted by KDD 2026 (Datasets and Benchmarks Track)

详情
AI中文摘要

量化回测对于评估交易策略至关重要,但仍受到高技术门槛和有限可扩展性的阻碍。虽然大语言模型(LLMs)通过先进的代码生成、工具使用和智能体规划为自动化这一复杂的跨学科工作流程提供了变革性路径,但实际实现因当前缺乏专门用于自动化量化回测的大规模基准而面临重大挑战,这阻碍了该领域的进展。为弥补这一关键差距,我们引入了BacktestBench,这是首个用于自动化量化回测的大规模基准。它基于超过600万条真实市场记录构建,包含18,246个精心标注的问答对,涵盖四个任务类别:指标计算、股票选择、策略选择和参数确认。我们还提出了AutoBacktest,一个稳健的多智能体基线,通过协调摘要器进行语义因子提取、检索器进行验证的SQL生成以及编码器进行Python回测实现,将自然语言策略转化为可重复的回测。我们对23个主流LLM的评估,辅以有针对性的消融实验,识别了影响端到端性能的关键因素,并强调了基于事实的验证和标准化指标表示的重要性。

英文摘要

Quantitative backtesting is essential for evaluating trading strategies but remains hampered by high technical barriers and limited scalability. While Large Language Models (LLMs) offer a transformative path to automate this complex, interdisciplinary workflow through advanced code generation, tool usage, and agentic planning, the practical realization is significantly challenged by the current lack of a large-scale benchmark dedicated to automated quantitative backtesting, which hinders progress in this field. To bridge this critical gap, we introduce BacktestBench, the first large-scale benchmark for automated quantitative backtesting. Built from over 6 million real market records, it comprises 18,246 meticulously annotated question-answering pairs across four task categories: metrics calculation, ticker selection, strategy selection, and parameter confirmation. We also propose AutoBacktest, a robust multi-agent baseline that translates natural language strategies into reproducible backtests by coordinating a Summarizer for semantic factor extraction, a Retriever for validated SQL generation, and a Coder for Python backtesting implementation. Our evaluation on 23 mainstream LLMs, complemented by targeted ablations, identifies key factors that influence end-to-end performance and highlights the importance of grounded verification and standardized indicator representations.

2605.16953 2026-05-26 cs.AI cs.CL 版本更新

How do Humans Process AI-generated Hallucination Contents: a Neuroimaging Study

人类如何处理AI生成的幻觉内容:一项神经影像学研究

Shuqi Zhu, Yi Zhong, Ziyi Ye, Bangde Du, Yujia Zhou, Qingyao Ai, Yiqun Liu

发表机构 * Department of Computer Science and Technology, Tsinghua University, Beijing, China(清华大学计算机科学与技术系) Institute of Trustworthy Embodied AI, Fudan University, Shanghai, China(复旦大学可信具身人工智能研究院)

AI总结 通过EEG实验,研究人类在处理多模态大语言模型生成的幻觉与非幻觉内容时的神经动力学差异,揭示误判的幻觉内容未能触发标准神经认知事实验证通路。

详情
AI中文摘要

尽管AI生成的幻觉带来了相当大的风险,但人类能够成功识别或被这些幻觉误导的潜在认知机制仍不清楚。为了解决这个问题,本文探索了人类的神经动力学,以表征大脑如何处理幻觉内容。我们记录了27名参与者在执行验证任务时的EEG信号,该任务要求判断由多模态大语言模型(MLLM)生成的图像描述的正确性。基于平均事件相关电位(ERP)研究,我们揭示了多种认知过程,例如语义整合、推理处理、记忆检索和认知负荷,在处理幻觉与非幻觉内容时表现出不同的模式。值得注意的是,人类参与者误判与正确判断的幻觉的神经反应显示出显著差异。这表明,被误判的AI生成幻觉未能触发标准的神经认知事实验证通路。

英文摘要

While AI-generated hallucinations pose considerable risks, the underlying cognitive mechanisms by which humans can successfully recognize or be misled by these hallucinations remain unclear. To address this problem, this paper explores humans' neural dynamics to characterize how the brain processes hallucinated content. We record EEG signals from 27 participants while they are performing a verification task to judge the correctness of image descriptions generated by a multi-modal large language model (MLLM). Based on an averaged event-related potential (ERP) study, we reveal that multiple cognitive processes, e.g., semantic integration, inferential processing, memory retrieval, and cognitive load, exhibit distinct patterns when humans process hallucinated versus non-hallucinated content. Notably, neural responses to hallucinations that were misjudged versus correctly judged by human participants showed significant differences. This indicates that misjudged AI-generated hallucinations failed to trigger the standard neurocognitive fact verification pathway.

2605.16023 2026-05-26 cs.CL cs.LG 版本更新

Judge Circuits

Judge Circuits

Nils Feldhus, Tanja Baeumel, Elena Golimblevskaia, Qianli Wang, Van Bach Nguyen, Aaron Louis Eidt, Selin Kahvecioglu, Christopher Ebert, Wojciech Samek, Jing Yang, Vera Schmitt, Sebastian Möller, Simon Ostermann

发表机构 * Technische Universität Berlin(柏林技术大学) BIFOLD – Berlin Institute for the Foundations of Learning and Data(柏林学习与数据基础研究院) German Research Center for Artificial Intelligence (DFKI)(德国人工智能研究中心) Fraunhofer Heinrich Hertz Institute(弗劳恩霍夫海因里希·赫茨研究所) Marburg University(马尔堡大学) Centre for European Research in Trusted AI (CERTAIN)(欧洲可信人工智能研究中心)

AI总结 本研究利用位置感知边归因修补(PEAP)因果分析Gemma-3、Qwen2.5和Llama-3的内部机制,发现结构化理解和开放式偏好任务中的判断共享一个稀疏、泛化的潜在评估子图,并通过解耦抽象判断与输出格式,揭示了格式诱导不一致性的机制原因。

Comments 39 pages

详情
AI中文摘要

LLM-as-a-judge已成为大规模评估模型输出的主导范式,然而同一模型在其输出格式变化时(例如,1-5评分与真/假标签)会系统地给出不同的分数。现有对这些格式诱导不一致性的诊断停留在输入输出层面。利用位置感知边归因修补(PEAP),我们因果地研究了Gemma-3、Qwen2.5和Llama-3的内部机制。我们发现,跨结构化理解和开放式偏好任务的判断共享一个稀疏、泛化的潜在评估子图,位于中后期多层感知器(MLPs)中;将其零消融会破坏判断,同时保留架构模块化模型中的世界知识。通过结构上解耦抽象判断与输出格式,我们为我们研究的开放权重模型上的格式诱导不一致性提供了机制解释:在共享主干中计算的连续判断信号通过脆弱、格式特定的终端分支映射,使得格式无关的偏好能够在请求的输出格式下游被隔离。我们的发现意味着跨格式的基准级可靠性比较部分测量的是格式化器几何形状而非评估质量。

英文摘要

LLM-as-a-judge has become the dominant paradigm for grading model outputs at scale, yet the same model assigns systematically different scores when its output format changes (e.g., a 1-5 rating vs. a True/False label). Existing diagnoses of these format-induced inconsistencies stop at the input-output level. Using Position-aware Edge Attribution Patching (PEAP), we causally investigate the internal mechanism in Gemma-3, Qwen2.5, and Llama-3. We find that judgments across structured understanding and open-ended preference tasks share a sparse, generalized Latent Evaluator sub-graph in the mid-to-late multi-layer perceptrons (MLPs); zero-ablating it collapses judgment while preserving world knowledge in architecturally modular models. By structurally decoupling abstract judging from output formatting, we provide a mechanistic account of format-induced inconsistency on the open-weight models we study: a continuous judgment signal computed in the shared trunk is mapped through fragile, format-specific terminal branches, enabling format-independent preference to be isolated downstream of the requested output format. Our findings imply that benchmark-level reliability comparisons across formats are partially measuring formatter geometry rather than evaluation quality.

2605.13643 2026-05-26 cs.CL 版本更新

Prefix Teach, Suffix Fade: Local Teachability Collapse in Strong-to-Weak On-Policy Distillation

前缀教导,后缀消退:强到弱在线策略蒸馏中的局部可教性崩溃

Kaiyuan Liu, Ziyuan Zhuang, Yang Bai, Bing Wang, Rongxiang Weng, Jieping Ye

发表机构 * College of Computer Science and Technology, Zhejiang University(浙江大学计算机科学与技术学院) Meituan LongCat Team, China(美团LongCat团队,中国) College of Computer Science and Technology, Jilin University(吉林大学计算机科学与技术学院)

AI总结 本文发现强到弱在线策略蒸馏中,教师反馈在生成轨迹的后缀部分缺乏局部对比度,导致“局部可教性崩溃”,并提出基于变化点检测的截断规则来优化监督区域,实验表明该方法优于全轨迹蒸馏。

详情
AI中文摘要

在线策略蒸馏(OPD)利用来自更强教师的密集反馈,在学生模型自身的生成轨迹上训练学生模型。先前文献表明,只要教师反馈可用,监督完整响应令牌序列应能单调提升性能。然而,我们证明这一假设在强到弱OPD设置中有时不成立。虽然生成轨迹的后缀部分可能仍存在非零的师生优势,但它们通常缺乏使密集反馈有效优先学生学习所需的局部对比度。我们将这种失败模式称为局部可教性崩溃。由此得出的原则很简单:监督应集中在教师反馈仍具有判别性的轨迹区域,而非均匀覆盖整个响应。我们通过一种轨迹特定的释放规则来操作这一原则。该规则测量教师相对于学生前K个候选集的边际,将该边际在NLTK分词的句子片段上聚合,并在检测到BIC风格的下行变化点时截断密集OPD监督。使用Qwen3模型系列在强到弱蒸馏任务上的实验结果表明,该释放规则在不同学生规模下的五个域内基准上始终优于标准全轨迹OPD。此外,与基线蒸馏方法相比,我们的方法在域外任务上更好地保留了模型能力。这些结果表明,有效的强到弱OPD需要评估教师指导的可用性及其局部效用,确保生成的反馈保持可教性。

英文摘要

On-policy distillation (OPD) trains a student model on its own rollouts using dense feedback from a stronger teacher. Prior literature suggests that, provided teacher feedback is available, supervising the full sequence of response tokens should monotonically improve performance. However, we demonstrate that this assumption sometimes fails to hold in strong-to-weak OPD settings. While later segments of a generated trajectory may still exhibit a non-zero teacher-student advantage, they frequently lack the local contrast that makes dense feedback effective for prioritizing student learning. We term this failure mode local teachability collapse. The resulting principle is straightforward: supervision should concentrate on trajectory regions where the teacher's feedback remains discriminative, rather than uniformly covering the entire response. We operationalize this principle through a trajectory-specific release rule. This rule measures the teacher's margin over the student's top-$K$ candidate set, aggregates this margin across NLTK-tokenized sentence segments, and truncates dense OPD supervision upon detecting a BIC-style downward change point. Experimental results across strong-to-weak distillation tasks using the Qwen3 model family indicate that this release rule consistently outperforms standard full-trajectory OPD across five in-domain benchmarks at various student scales. Furthermore, compared to baseline distillation methods, our approach better preserves model capabilities on out-of-domain task. These results suggest that effective strong-to-weak OPD requires evaluating not only the availability of teacher guidance but also its local utility, ensuring that the generated feedback remains teachable.

2605.04700 2026-05-26 cs.CR cs.AI cs.CL cs.LG cs.SD 版本更新

Sparse Tokens Suffice: Jailbreaking Audio Language Models via Token-Aware Gradient Optimization

稀疏令牌足矣:通过令牌感知梯度优化越狱音频语言模型

Zheng Fang, Xiaosen Wang, Shenyi Zhang, Shaokang Wang, Zhijin Ge

发表机构 * Wuhan University Institute for Math \& AI, Wuhan University Huazhong University of Science Shanghai Jiao Tong University Xidian University

AI总结 本文提出令牌感知梯度优化(TAGO)方法,通过仅保留高梯度能量的音频令牌对应的波形梯度,实现稀疏越狱攻击,在保持高成功率的同时大幅减少优化量。

Comments To appear in the 43rd International Conference on Machine Learning (ICML 2026)

详情
AI中文摘要

对音频语言模型(ALM)的越狱攻击通过优化音频扰动来引发不安全生成,通常在整个优化过程中密集地更新整个波形。在这项工作中,我们通过分析ALM中令牌对齐梯度的结构来研究这种密集优化的必要性。我们发现梯度能量在音频令牌之间高度不均匀,表明只有一小部分令牌对齐的音频区域主导了优化信号。受此观察启发,我们提出了令牌感知梯度优化(TAGO),它通过每次迭代仅保留与高梯度能量音频令牌对齐的波形梯度,同时屏蔽其余梯度,实现了稀疏越狱优化。在三个ALM上,TAGO优于基线,并且大幅稀疏化仍能保持较高的攻击成功率(例如,在Qwen3-Omni上,令牌保留率为0.25时,$\mathrm{ASR}_{l}$仍为86%,而全令牌保留时为87%)。这些结果表明密集的波形更新在很大程度上是冗余的,我们主张未来的音频越狱和安全对齐研究应进一步利用这种异质的令牌级梯度结构。

英文摘要

Jailbreak attacks on audio language models (ALMs) optimize audio perturbations to elicit unsafe generations, and they typically update the entire waveform densely throughout optimization. In this work, we investigate the necessity of such dense optimization by analyzing the structure of token-aligned gradients in ALMs. We find that gradient energy is highly non-uniform across audio tokens, indicating that only a small subset of token-aligned audio regions dominates the optimization signal. Motivated by this observation, we propose Token-Aware Gradient Optimization (TAGO), which enables sparse jailbreak optimization by retaining only waveform gradients aligned with audio tokens that have high gradient energy, while masking the remaining gradients at each iteration. Across three ALMs, TAGO outperforms baselines, and substantial sparsification preserves strong attack success rates (e.g. on Qwen3-Omni, $\mathrm{ASR}_{l}$ remains at 86% with a token retention ratio of 0.25, compared to 87% with full token retention). These results demonstrate that dense waveform updates are largely redundant, and we advocate that future audio jailbreak and safety alignment research should further leverage this heterogeneous token-level gradient structure.

2605.03472 2026-05-26 cs.CL cs.AI 版本更新

Auditing Stealth Sycophancy in Mental-Health Dialogue: Structured Clinical-State Diagnostics and Clean Matched Benchmarks

审计心理健康对话中的隐性谄媚:结构化临床状态诊断与干净匹配基准

Tianze Han, Beining Xu, Hanbo Zhang, Yongming Lu

发表机构 * Shenzhen MSU-BIT University(深圳MSU-BIT大学)

AI总结 针对心理健康对话模型中隐式谄媚(表面共情但强化消极认知)的问题,提出基于动态情感签名图(DESG)的结构化离线审计框架,通过临床状态转移评估响应方向,并在干净匹配基准上实现最优有害风险检测。

详情
AI中文摘要

心理健康对话模型越来越多地由基于AI的评估器进行评估,但这些评估器通常将表面共情、支持性或流畅性视为安全的证据。在本文中,我们研究了一种隐藏的失败模式,称为隐式谄媚:一个响应可能看似共情,但暗中强化灾难化、回避、绝望预测或CBT式标签。为了检查这个问题,我们引入了一个用于隐式谄媚检测的诊断基准,该基准基于三个代表性的心理健康对话来源构建,涵盖日常同伴支持、咨询式情感支持和危机导向互动,并进一步构建了一个泄漏审计的干净单响应匹配基准,包含500个上下文和1500个匹配响应窗口。然后,我们提出了动态情感签名图(DESG),一个结构化的离线审计框架,将基于LLM的状态提取与最终评分分离,并通过语义、情感和认知扭曲状态转移而非自由形式的LLM判断来评估临床方向。与元数据、表面风格、词汇、嵌入和基于规则的LLM基线不同,DESG对响应引起的临床状态变化方向进行评分;在泄漏审计的干净匹配基准上,DESG-StateRisk比最强的非DESG基线提高了0.0488 macro-F1,并实现了最佳的有害风险检测结果。这些结果表明,评估隐式谄媚需要显式的临床状态建模以及泄漏检查、捷径控制和竞争性基线。

英文摘要

Mental-health dialogue models are increasingly evaluated by AI-based evaluators, yet these evaluators often treat surface empathy, supportiveness, or fluency as evidence of safety. In this paper, we study a hidden failure mode that we call implicit sycophancy: a response may appear empathetic while implicitly reinforcing catastrophizing, avoidance, hopeless prediction, or CBT-style labeling. To examine this problem, we introduce a diagnostic benchmark for implicit-sycophancy detection, built from three representative mental-health dialogue sources covering everyday peer support, counseling-style emotional support, and crisis-oriented interaction, and further construct a leakage-audited clean single-response matched benchmark with 500 contexts and 1,500 matched response windows. We then propose Dynamic Emotional Signature Graphs (DESG), a structured offline audit framework that separates LLM-based state extraction from final scoring and evaluates clinical direction through semantic, affective, and cognitive-distortion state transitions rather than free-form LLM judgment. Unlike metadata, surface-style, lexical, embedding, and rubric-LLM baselines, DESG scores the direction of clinical-state change induced by a response; on the leakage-audited clean matched benchmark, DESG-StateRisk improves over the strongest non-DESG baseline by 0.0488 macro-F1 and achieves the best harmful-risk detection result. These results suggest that evaluating implicit sycophancy requires explicit clinical-state modeling together with leakage checks, shortcut controls, and competitive baselines.

2605.01017 2026-05-26 cs.CL 版本更新

Psychologically Potent, Computationally Invisible: LLMs Generate Social-Comparison-Eliciting Posts They Fail to Detect

心理上有效,计算上不可见:LLM 生成引发社会比较的帖子但无法检测

Hua Zhao, Jiapei Gu, Michelle Mingyue Gu

发表机构 * Department of English Language Education(英语语言教育系) Analytics/Assessment Research Centre(分析/评估研究中心)

AI总结 本研究通过构建小红书社会比较读者诱发基准(XHS-SCoRE),发现 LLM 在生成引发社会比较的帖子时存在生成-检测不匹配,即该信号在领域内可学习但无法通过提示分类稳健访问。

Comments 19 pages, preprint Title change: Psychologically Potent, Computationally Invisible: LLMs Generate Social-Comparison-Eliciting Posts They Fail to Detect

详情
AI中文摘要

我们引入了小红书社会比较读者诱发基准(XHS-SCoRE),这是一个基于读者视角的基准,用于检测纯文本小红书(RedNote)帖子是否从第一人称读者视角引发向上、向下或中性/无明确社会比较。该任务针对一种具有社会意义的关系性、行为上真实的信号,该信号不可简化为情感。在提示型 LLM 分类器和有监督的中文编码器中,我们发现了一致的生成-检测不匹配:该信号在领域内是文本可学习的,但无法通过基于提示的分类稳健访问。提示型 LLM 分类器表现出稳定的失败,特别是对引发比较的帖子进行中和以及模型特定的方向偏差。一项受控试点表明,即使基于提示的同一构念检测仍然脆弱,LLM 生成的小红书风格帖子也能改变感知地位和比较相关情感。XHS-SCoRE 为基于读者的比较检测提供了一个基准,并为研究社会意义的关系线索何时仅部分可见于基于提示的推理提供了一个诊断框架。

英文摘要

We introduce Xiaohongshu Social Comparison Reader Elicitation (XHS-SCoRE), a reader-grounded benchmark for detecting whether text-only Xiaohongshu (RedNote) posts elicit Upward, Downward, or Neutral/no clear social comparison from a first-person reader perspective. The task targets a socially meaningful relational, behaviorally real signal not reducible to sentiment. Across prompted LLM classifiers and supervised Chinese encoders, we find a consistent generation--detection mismatch: the signal is textually learnable in-domain, but not robustly accessible to prompt-based classification. Prompted LLM classifiers show stable failures, especially neutralization of comparison-eliciting posts and model-specific directional skew. A controlled pilot shows that LLM-generated Xiaohongshu-style posts can shift perceived standing and comparison-related affect even when prompt-based detection of the same construct remains fragile. XHS-SCoRE contributes a benchmark for reader-grounded comparison detection and a diagnostic framework for studying when socially meaningful relational cues remain only partially visible to prompt-based inference.

2605.00419 2026-05-26 cs.LG cs.CL 版本更新

Rethinking LLM Ensembling from the Perspective of Mixture Models

从混合模型的角度重新思考大语言模型集成

Jiale Fu, Yuchu Jiang, Peijun Wu, Chonghan Liu, Joey Tianyi Zhou, Xu Yang

发表机构 * Key Laboratory of New Generation Artificial Intelligence Technology(新一代人工智能技术关键实验室) Its Interdisciplinary Applications (Southeast University), Ministry of Education(交叉应用(东南大学),教育部) Southeast University(东南大学) Centre for Frontier AI Research (CFAR), Agency for Science, Technology(前沿人工智能研究(CFAR),科技研究局) Research (A STAR), Singapore(研究(A STAR),新加坡) Institute of High Performance Computing (IHPC), Agency for Science, Technology(高性能计算(IHPC),科技研究局)

AI总结 本文提出混合模型式集成(ME),通过将集成重新解释为混合模型,随机选择单个模型生成下一个token,避免显式计算完整集成分布,实现1.78x-2.68x加速,并揭示了集成与token级路由方法的联系。

Comments ICML 2026 Spotlight

详情
AI中文摘要

模型集成是提升机器学习模型性能的成熟技术。传统上,这涉及对多个模型的输出分布进行平均,并选择最可能的标签。这一思想已自然扩展到大型语言模型(LLMs),在提升性能的同时也带来了巨大的计算成本。这种低效源于将传统集成实现直接应用于LLMs,需要为每个模型单独进行前向传播以显式计算集成分布。在本文中,我们提出了混合模型式集成(ME)。通过将集成重新解释为混合模型,ME在每一步随机选择一个模型来生成下一个token,从而避免显式计算完整的集成分布。ME在数学上等价于从集成分布中采样,但只需调用一个模型,使其比传统集成快1.78x-2.68x倍。此外,这一视角将LLM集成与token级路由方法联系起来,表明LLM集成是路由方法的一个特例。我们的发现为高效的LLM集成开辟了新途径,并激励了对LLM token级路由策略的进一步探索。我们的代码可在https://github.com/Kamichanw/Mixture-model-like-Ensemble获取。

英文摘要

Model ensembling is a well-established technique for improving the performance of machine learning models. Conventionally, this involves averaging the output distributions of multiple models and selecting the most probable label. This idea has been naturally extended to large language models (LLMs), yielding improved performance but incurring substantial computational cost. This inefficiency stems from directly applying conventional ensemble implementation to LLMs, which require a separate forward pass for each model to explicitly compute the ensemble distribution. In this paper, we propose the Mixture-model-like Ensemble (ME). By reinterpreting the ensemble as a mixture model, ME stochastically selects a single model at each step to generate the next token, thereby avoiding the need to explicitly compute the full ensemble distribution. ME is mathematically equivalent to sampling from the ensemble distribution, but requires invoking only one model, making it 1.78x-2.68x faster than conventional ensembling. Furthermore, this perspective connects LLM ensembling and token-level routing methods, suggesting that LLM ensembling is a special case of routing methods. Our findings open new avenues for efficient LLM ensembling and motivate further exploration of token-level routing strategies for LLMs. Our code is available at https://github.com/Kamichanw/Mixture-model-like-Ensemble.

2604.23295 2026-05-26 cs.CL cs.AI 版本更新

Human-1 by Josh Talks: A Full-Duplex Conversational Modeling Framework in Hindi using Real-World Conversations

Human-1 by Josh Talks: 基于真实对话的印地语全双工对话建模框架

Bhaskar Singh, Shobhit Banga, Mahima Manik, Pranav Sharma

发表机构 * JoshTalks

AI总结 本文通过适配Moshi架构,使用自定义印地语分词器和26,000小时真实对话数据训练,提出了首个开放、可复现的印地语全双工口语对话系统,实现了自然的打断、重叠和反馈行为。

详情
AI中文摘要

全双工口语对话系统能够模拟自然的对话行为,如打断、重叠和反馈,然而这类系统在印度语言中仍 largely unexplored。我们通过适配最先进的双工语音架构Moshi,使用自定义印地语分词器,并在从14,695名说话者收集的26,000小时真实自发对话数据(具有独立的说话者通道)上进行训练,提出了首个开放、可复现的印地语全双工口语对话系统,从而能够直接从自然交互中学习话轮转换和重叠模式。为了支持印地语文本生成,我们替换了原始英语分词器,并重新初始化了依赖于文本词汇的参数,同时保留了预训练的音频组件。我们提出了一种两阶段训练方案——大规模预训练,然后在1,000小时对话数据上进行微调。通过提示对话延续范式,结合自动评估指标和人工判断,评估结果表明生成的模型在印地语中表现出自然且有意义的全双工对话行为。这项工作为印地语及其他印度语言的实时双工口语对话系统迈出了第一步。

英文摘要

Full-duplex spoken dialogue systems can model natural conversational behaviours such as interruptions, overlaps, and backchannels, yet such systems remain largely unexplored for Indian languages. We present the first open, reproducible full-duplex spoken dialogue system for Hindi by adapting Moshi, a state-of-the-art duplex speech architecture, using a custom Hindi tokeniser and training on 26,000 hours of real spontaneous conversations collected from 14,695 speakers with separate speaker channels, enabling direct learning of turn-taking and overlap patterns from natural interactions. To support Hindi text generation, we replace the original English tokeniser and reinitialise text-vocabulary-dependent parameters while retaining the pre-trained audio components. We propose a two-stage training recipe -- large-scale pre-training followed by fine-tuning on 1,000 hours of conversational data. Evaluation through the prompted dialogue continuation paradigm with both automatic metrics and human judgments demonstrates that the resulting model generates natural and meaningful full-duplex conversational behaviour in Hindi. This work serves as a first step toward real-time duplex spoken dialogue systems for Hindi and other Indian languages.

2604.18396 2026-05-26 cs.CL 版本更新

River-LLM: Large Language Model Seamless Exit Based on KV Share

River-LLM:基于KV共享的大型语言模型无缝退出

Yingtao Shen, An Zou

发表机构 * Shanghai Jiao Tong University(上海交通大学)

AI总结 针对解码器-only架构中早期退出因KV缓存缺失导致的延迟和精度问题,提出无需训练的River-LLM框架,通过轻量级KV共享退出流实现令牌级无缝退出,并利用状态转移相似性预测KV误差指导退出决策,在数学推理和代码生成任务上获得1.53-2.16倍实际加速且保持高质量。

Comments Accepted to ACL 2026, 13pages, with appendix. Corrected some typos

详情
AI中文摘要

大型语言模型(LLM)在多个领域展现出卓越性能,但日益受到高推理延迟的制约。早期退出通过动态跳过冗余层加速推理,成为一种有前景的解决方案。然而,在仅解码器架构中,早期退出的效率受到KV缓存缺失问题的严重瓶颈,即跳过的层无法为后续令牌提供必要的历史状态。现有解决方案(如重新计算或掩码)要么引入显著延迟开销,要么导致严重精度损失,未能弥合理论层减少与实际墙钟加速之间的差距。本文提出River-LLM,一个无需训练的框架,能够实现无缝的令牌级早期退出。River-LLM引入轻量级KV共享退出流,使得骨干网络的缺失KV缓存能够在退出过程中自然生成并保留,无需昂贵的恢复操作。此外,我们利用解码器块内的状态转移相似性来预测累积KV误差,并指导精确的退出决策。在数学推理和代码生成任务上的大量实验表明,River-LLM在保持高生成质量的同时,实现了1.53至2.16倍的实际加速。

英文摘要

Large Language Models (LLMs) have demonstrated exceptional performance across diverse domains but are increasingly constrained by high inference latency. Early Exit has emerged as a promising solution to accelerate inference by dynamically bypassing redundant layers. However, in decoder-only architectures, the efficiency of Early Exit is severely bottlenecked by the KV Cache Absence problem, where skipped layers fail to provide the necessary historical states for subsequent tokens. Existing solutions, such as recomputation or masking, either introduce significant latency overhead or incur severe precision loss, failing to bridge the gap between theoretical layer reduction and practical wall-clock speedup. In this paper, we propose River-LLM, a training-free framework that enables seamless token-level Early Exit. River-LLM introduces a lightweight KV-Shared Exit River that allows the backbone's missing KV cache to be naturally generated and preserved during the exit process, eliminating the need for costly recovery operations. Furthermore, we utilize state transition similarity within decoder blocks to predict cumulative KV errors and guide precise exit decisions. Extensive experiments on mathematical reasoning and code generation tasks demonstrate that River-LLM achieves 1.53 to 2.16 times of practical speedup while maintaining high generation quality.

2604.14054 2026-05-26 cs.LG cs.CL 版本更新

$π$-Play: Multi-Agent Self-Play via Privileged Self-Distillation without External Data

$\pi$-Play: 通过特权自蒸馏实现的多智能体自对弈,无需外部数据

Yaocheng Zhang, Yuanheng Zhu, Wenyue Chong, Songjun Tu, Qichao Zhang, Jiajun Chai, Xiaohan Wang, Wei Lin, Guojun Yin, Dongbin Zhao

发表机构 * Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所) School of Advanced Interdisciplinary Sciences, University of Chinese Academy of Sciences(中国科学院大学先进交叉学科学院) Meituan(美团) School of Artificial Intelligence, University of Chinese Academy of Sciences(中国科学院大学人工智能学院)

AI总结 提出$\pi$-Play框架,利用自对弈中生成的问答构建路径作为特权信息,结合自蒸馏实现密集反馈的多智能体协同进化,无需外部数据即可超越全监督搜索代理。

Comments 23 pages, 11 figures

详情
AI中文摘要

深度搜索代理已成为解决复杂信息寻求任务的有前景范式,但其训练仍面临稀疏奖励、弱信用分配和有限标注数据的挑战。自对弈提供了一种可扩展的减少数据依赖的途径,但传统自对弈仅通过稀疏结果奖励优化学生,导致学习效率低下。在这项工作中,我们观察到自对弈在任务生成过程中自然产生一个问题构建路径(QCP),这是一种捕获反向求解过程的中间产物。这揭示了一种新的特权信息来源:自对弈可以低成本、大规模地提供高质量特权信息用于自蒸馏,无需依赖人类反馈或精心设计的特权信息。基于这一洞察,我们提出特权信息自对弈($\pi$-Play),一种结合自对弈和自蒸馏的新型多智能体自进化框架。在$\pi$-Play中,考官生成任务及QCP,教师利用QCP作为特权上下文,通过自蒸馏对学生进行密集监督。这种设计将稀疏奖励的自对弈转变为密集反馈的协同进化。大量实验表明,无数据的$\pi$-Play超越了全监督搜索代理,并将进化效率相比传统自对弈提升了2-3倍。代码见 https://github.com/zhyaoch/pi-play。

英文摘要

Deep search agents have emerged as a promising paradigm for addressing complex information-seeking tasks, but their training remains challenging due to sparse rewards, weak credit assignment, and limited labeled data. Self-play offers a scalable route to reduce data dependence, but conventional self-play optimizes students only through sparse outcome rewards, leading to low learning efficiency. In this work, we observe that self-play naturally produces a question construction path (QCP) during task generation, an intermediate artifact that captures the reverse solution process. This reveals a new source of privileged information: self-play can provide high-quality privileged information for the self-distillation at low cost and at scale, without relying on human feedback or curated privileged information. Leveraging this insight, we propose Privileged Information Self-Play ($π$-Play), a novel multi-agent self-evolution framework combining self-play and self-distillation. In $π$-Play, an examiner generates tasks together with QCPs, and a teacher employs QCP as privileged context to densely supervise a student via self-distillation. This design transforms sparse-reward self-play into a dense-feedback co-evolution. Extensive experiments show that data-free $π$-Play surpasses fully supervised search agents and improves evolutionary efficiency by 2-3$\times$ over conventional self-play. Code is available at https://github.com/zhyaoch/pi-play.

2604.11632 2026-05-26 cs.CL 版本更新

CArtBench: Evaluating Vision-Language Models on Chinese Art Understanding, Interpretation, and Authenticity

CArtBench: 评估视觉语言模型在中国艺术理解、解读与真实性方面的能力

Xuefeng Wei, Zhixuan Wang, Xuan Zhou, Zhi Qu, Hongyao Li, Yusuke Sakai, Hidetaka Kamigaito, Taro Watanabe

发表机构 * Nara Institute of Science and Technology(奈良科学技術大學) Liaoning Normal University(遼寧師範大學)

AI总结 提出CARTBENCH基准,通过四个子任务评估视觉语言模型在中国艺术作品上的识别、推理、鉴赏和真伪鉴别能力,发现现有模型在复杂证据链接、风格断代和真伪诊断方面表现不足。

Comments under review

详情
AI中文摘要

我们介绍了CARTBENCH,一个基于博物馆的基准,用于评估视觉语言模型(VLM)在中国艺术作品上的表现,超越了短形式识别和问答。CARTBENCH包含四个子任务:CURATORQA用于基于证据的识别和推理,CATALOGCAPTION用于结构化的四部分专家风格鉴赏,REINTERPRET用于带有专家评分的可辩护的重新解读,以及CONNOISSEURPAIRS用于在视觉相似混淆下进行诊断性真伪鉴别。CARTBENCH通过将来自Wikidata的带有图像的故宫博物院物品与权威目录页面进行对齐构建,涵盖多个朝代的五个艺术类别。在九个代表性的VLM上,我们发现高整体CURATORQA准确率可能掩盖了在硬证据链接和风格到时期推断上的急剧下降;长形式鉴赏仍远未达到专家参考水平;而面向真伪的诊断性鉴别接近随机水平,这突显了当前模型在鉴赏级推理上的困难。

英文摘要

We introduce CARTBENCH, a museum-grounded benchmark for evaluating vision-language models (VLMs) on Chinese artworks beyond short-form recognition and QA. CARTBENCH comprises four subtasks: CURATORQA for evidence-grounded recognition and reasoning, CATALOGCAPTION for structured four-section expert-style appreciation, REINTERPRET for defensible reinterpretation with expert ratings, and CONNOISSEURPAIRS for diagnostic authenticity discrimination under visually similar confounds. CARTBENCH is built by aligning image-bearing Palace Museum objects from Wikidata with authoritative catalog pages, spanning five art categories across multiple dynasties. Across nine representative VLMs, we find that high overall CURATORQA accuracy can mask sharp drops on hard evidence linking and style-to-period inference; long-form appreciation remains far from expert references; and authenticity-oriented diagnostic discrimination stays near chance, underscoring the difficulty of connoisseur-level reasoning for current models.

2604.05550 2026-05-26 cs.CL cs.CE 版本更新

AutoSOTA: An End-to-End Automated Research System for State-of-the-Art AI Model Discovery

AutoSOTA:面向最先进AI模型发现的端到端自动化研究系统

Yu Li, Chenyang Shao, Xinyang Liu, Ruotong Zhao, Peijie Liu, Hongyuan Su, Zhibin Chen, Qinglong Yang, Anjie Xu, Yi Fang, Qingbin Zeng, Tianxing Li, Jingbo Xu, Fengli Xu, Yong Li, Tie-Yan Liu

发表机构 * Department of Electronic Engineering, BNRist, Tsinghua University(电子工程系,北京理工大学,清华大学) Zhongguancun Academy(中关村学院) Peking University(北京大学) University of Science and Technology of China(中国科学技术大学)

AI总结 提出AutoSOTA系统,采用多智能体架构实现从论文复现到模型优化的全自动化,成功发现105个超越原始方法的新SOTA模型。

详情
AI中文摘要

人工智能研究越来越依赖于长时间的复现、调试和迭代优化以达到最先进(SOTA)性能,这催生了对能够加速经验模型优化全流程的系统的需求。在这项工作中,我们介绍了AutoSOTA,一个端到端的自动化研究系统,它将顶级AI论文中发布的最新SOTA模型推进为可复现且经验上改进的新SOTA模型。我们通过三个紧密耦合的阶段来形式化这个问题:资源准备与目标设定;实验评估;以及反思与构思。为了解决这个问题,AutoSOTA采用了一种多智能体架构,包含八个专门化的智能体,它们协同工作,将论文与代码和依赖项对应起来,初始化和修复执行环境,跟踪长期实验,生成并调度优化想法,并监督有效性以避免虚假收益。我们在从八个顶级AI会议收集的最新研究论文上评估AutoSOTA,并根据代码可用性和执行成本进行过滤。在这些论文中,AutoSOTA在自动复现和后续优化方面均取得了强大的端到端性能。具体来说,它成功发现了105个超越原始报告方法的新SOTA模型,平均每篇论文耗时约五小时。涵盖LLM、NLP、计算机视觉、时间序列和优化的案例研究进一步表明,该系统可以超越常规的超参数调优,识别架构创新、算法重新设计和工作流级别的改进。这些结果表明,端到端的研究自动化不仅可以作为性能优化器,还可以作为一种新型的研究基础设施,减少重复性实验负担,帮助将人类注意力重新引导到更高层次的科学创造力上。

英文摘要

Artificial intelligence research increasingly depends on prolonged cycles of reproduction, debugging, and iterative refinement to achieve State-Of-The-Art (SOTA) performance, creating a growing need for systems that can accelerate the full pipeline of empirical model optimization. In this work, we introduce AutoSOTA, an end-to-end automated research system that advances the latest SOTA models published in top-tier AI papers to reproducible and empirically improved new SOTA models. We formulate this problem through three tightly coupled stages: resource preparation and goal setting; experiment evaluation; and reflection and ideation. To tackle this problem, AutoSOTA adopts a multi-agent architecture with eight specialized agents that collaboratively ground papers to code and dependencies, initialize and repair execution environments, track long-horizon experiments, generate and schedule optimization ideas, and supervise validity to avoid spurious gains. We evaluate AutoSOTA on recent research papers collected from eight top-tier AI conferences under filters for code availability and execution cost. Across these papers, AutoSOTA achieves strong end-to-end performance in both automated replication and subsequent optimization. Specifically, it successfully discovers 105 new SOTA models that surpass the original reported methods, averaging approximately five hours per paper. Case studies spanning LLM, NLP, computer vision, time series, and optimization further show that the system can move beyond routine hyperparameter tuning to identify architectural innovation, algorithmic redesigns, and workflow-level improvements. These results suggest that end-to-end research automation can serve not only as a performance optimizer, but also as a new form of research infrastructure that reduces repetitive experimental burden and helps redirect human attention toward higher-level scientific creativity.

2603.16105 2026-05-26 cs.CL cs.AI 版本更新

Frequency Matters: Fast Model-Agnostic Data Curation for Pruning and Quantization

频率至关重要:用于剪枝和量化的快速模型无关数据筛选

Francesco Pio Monaco, Elia Cunegatti, Flavio Vella, Giovanni Iacca

发表机构 * University of Trento(特伦托大学)

AI总结 提出一种基于Zipf幂律的模型无关数据筛选策略ZipCal,通过最大化词汇多样性来选择校准数据,在剪枝和量化中实现与依赖模型困惑度的最先进方法相当的性能,且速度快约240倍。

Comments Added statistical analysis, mechanistic analysis and a comparison with a generative baseline. 22 pages

详情
AI中文摘要

训练后模型压缩对于增强大型语言模型(LLMs)的可移植性同时保持其性能至关重要。虽然已经提出了几种压缩方法,但较少关注选择最合适的数据集(所谓的校准数据)来寻找压缩模型配置。校准数据的选择是保留模型在任务内和任务间能力的关键步骤。在这项工作中,我们通过分析内在数据属性而非模型特定信号,解决了为剪枝和量化识别高性能校准集的挑战。我们引入了 exttt{ extbf{ZipCal}},一种基于Zipf幂律最大化词汇多样性的模型无关数据筛选策略。实验表明,我们的方法在各种剪枝基准测试中始终优于标准的均匀随机采样。值得注意的是,在下游性能方面,它与依赖模型困惑度的最先进方法表现相当。后者在大规模模型和数据集上变得极其昂贵,而 exttt{ extbf{ZipCal}}由于其可处理的线性复杂度,平均快约240倍。

英文摘要

Post-training model compression is essential for enhancing the portability of Large Language Models (LLMs) while preserving their performance. While several compression approaches have been proposed, less emphasis has been placed on selecting the most suitable set of data (the so-called \emph{calibration data}) for finding the compressed model configuration. The choice of calibration data is a critical step in preserving model capabilities both intra- and inter-tasks. In this work, we address the challenge of identifying high-performance calibration sets for both pruning and quantization by analyzing intrinsic data properties rather than model-specific signals. We introduce \texttt{\textbf{ZipCal}}, a model-agnostic data curation strategy that maximizes lexical diversity based on Zipfian power laws. Experiments demonstrate that our method consistently outperforms standard uniform random sampling across various pruning benchmarks. Notably, it also performs on par, in terms of downstream performance, with a state-of-the-art method that relies on model perplexity. The latter becomes prohibitively expensive at large-scale models and datasets, while \texttt{\textbf{ZipCal}} is on average $\sim$240$\times$ faster due to its tractable linear complexity\footnote{We make the code and the experiments available at https://github.com/FrancescoMonaco/ZipCal.}.

2603.12983 2026-05-26 cs.CL cs.AI 版本更新

Is Human Annotation Necessary? Iterative MBR Distillation for Error Span Detection in Machine Translation

人工标注是否必要?用于机器翻译错误跨度检测的迭代MBR蒸馏

Boxuan Lyu, Haiyue Song, Zhi Qu

发表机构 * Institute of Science Tokyo(东京科学研究所) National Institute of Information and Communications Technology(信息与通信技术国家研究所)

AI总结 提出一种基于最小贝叶斯风险解码的迭代MBR蒸馏自演化框架,利用现成大语言模型生成伪标签,无需人工标注即可在错误跨度检测任务上超越监督基线。

详情
AI中文摘要

错误跨度检测(ESD)是机器翻译(MT)评估中的一个关键子任务,旨在识别翻译错误的位置和严重程度。虽然对人工标注数据微调模型能提升ESD性能,但获取此类数据成本高昂且标注者之间容易不一致。为解决这一问题,我们提出一种基于最小贝叶斯风险(MBR)解码的新型自演化框架,命名为用于ESD的迭代MBR蒸馏,该框架通过利用现成的大语言模型(LLM)生成伪标签,消除了对人工标注的依赖。在WMT Metrics Shared Task数据集上的大量实验表明,仅在这些自生成伪标签上训练的模型在系统和跨度层面上均优于未适应的基础模型和基于人工标注的有监督基线,同时保持有竞争力的句子级性能。

英文摘要

Error Span Detection (ESD) is a crucial subtask in Machine Translation (MT) evaluation, aiming to identify the location and severity of translation errors. While fine-tuning models on human-annotated data improves ESD performance, acquiring such data is expensive and prone to inconsistencies among annotators. To address this, we propose a novel self-evolution framework based on Minimum Bayes Risk (MBR) decoding, named Iterative MBR Distillation for ESD, which eliminates the reliance on human annotations by leveraging an off-the-shelf LLM to generate pseudo-labels. Extensive experiments on the WMT Metrics Shared Task datasets demonstrate that models trained solely on these self-generated pseudo-labels outperform both unadapted base model and supervised baselines trained on human annotations at the system and span levels, while maintaining competitive sentence-level performance.

2603.06687 2026-05-26 cs.CV cs.CL cs.ET cs.MM cs.RO 版本更新

TimeSpot: Benchmarking Geo-Temporal Understanding in Vision-Language Models in Real-World Settings

TimeSpot: 在真实世界场景中评估视觉语言模型的地理时间理解能力

Azmine Toushik Wasi, Shahriyar Zaman Ridoy, Koushik Ahamed Tonmoy, Kinga Tshering, S. M. Muhtasimul Hasan, Wahid Faisal, Tasnim Mohiuddin, Md Rizwan Parvez

发表机构 * Computational Intelligence and Operations Laboratory (CIOL), Bangladesh(计算智能与运筹实验室(CIOL),孟加拉国) Shahjalal University of Science and Technology (SUST), Sylhet, Bangladesh(沙赫jalal科学与技术大学(SUST),沙赫里尔,孟加拉国) North South University (NSU), Dhaka, Bangladesh(北南大学(NSU),达卡,孟加拉国) Qatar Computing Research Institute (QCRI), Doha, Qatar(卡塔尔计算研究中心(QCRI),多哈,卡塔尔)

AI总结 提出TimeSpot基准,通过1,455张全球图像评估视觉语言模型在时间属性(季节、月份、时段、日光相位)和地理属性(大洲、国家、气候带、环境类型、经纬度)上的推理能力,发现现有模型性能低下,尤其时间推理不足。

Comments Accepted to ICML 2026

详情
AI中文摘要

地理时间理解,即仅从视觉输入推断位置、时间和上下文属性的能力,支撑着灾害管理、交通规划、具身导航、世界建模和地理教育等应用。尽管最近的视觉语言模型(VLM)利用地标和路标等线索在图像地理定位方面取得了进展,但它们推理时间信号和物理基础空间线索的能力仍然有限。为弥补这一差距,我们引入了TimeSpot,一个用于评估VLM在真实世界中进行地理时间推理的基准。TimeSpot包含来自80个国家的1,455张地面图像,要求直接从视觉证据中结构化预测时间属性(季节、月份、时段、日光相位)和地理属性(大洲、国家、气候带、环境类型、经纬度)。它还包括时空推理任务,测试在真实世界不确定性下的物理合理性。对最先进的开源和闭源VLM的评估显示性能低下,尤其是时间推理。虽然监督微调带来了改进,但结果仍不充分,凸显了需要新方法来实现稳健的、基于物理的地理时间理解。TimeSpot可在 https://TimeSpot-GT.github.io 获取。

英文摘要

Geo-temporal understanding, the ability to infer location, time, and contextual properties from visual input alone, underpins applications such as disaster management, traffic planning, embodied navigation, world modeling, and geography education. Although recent vision-language models (VLMs) have advanced image geo-localization using cues like landmarks and road signs, their ability to reason about temporal signals and physically grounded spatial cues remains limited. To address this gap, we introduce TimeSpot, a benchmark for evaluating real-world geo-temporal reasoning in VLMs. TimeSpot comprises 1,455 ground-level images from 80 countries and requires structured prediction of temporal attributes (season, month, time of day, daylight phase) and geographic attributes (continent, country, climate zone, environment type, latitude-longitude) directly from visual evidence. It also includes spatial-temporal reasoning tasks that test physical plausibility under real-world uncertainty. Evaluations of state-of-the-art open- and closed-source VLMs show low performance, particularly for temporal inference. While supervised fine-tuning yields improvements, results remain insufficient, highlighting the need for new methods to achieve robust, physically grounded geo-temporal understanding TimeSpot is available at: https://TimeSpot-GT.github.io.

2603.05143 2026-05-26 cs.CL cs.LG 版本更新

Feature Resemblance: Towards a Theoretical Understanding of Analogical Reasoning in Transformers

特征相似性:迈向对Transformer中类比推理的理论理解

Ruichen Xu, Wenjing Yan, Ying-Jun Angela Zhang

发表机构 * Department of Information Engineering, The Chinese University of Hong Kong, Hong Kong(香港中文大学信息工程系)

AI总结 本文通过最小化Transformer抽象模型,从理论上证明联合训练和特定课程顺序能使实体在表示空间中对齐,从而通过特征相似性实现属性转移,即类比推理。

详情
AI中文摘要

理解大型语言模型中的推理因评估混淆多种推理类型而变得复杂。我们分离出类比推理,即模型在共享已知属性的实体之间转移属性,并研究这种转移何时能从训练中涌现。为了使问题在分析上易于处理,我们研究了一个最小化的Transformer风格抽象,该抽象隔离了学习到的表示如何支持类比推理。在此设置中,我们证明了三个关键结果。首先,对相似性和属性前提的联合训练通过对齐表示实现类比推理。其次,顺序训练仅在相似性结构先于特定属性学习时成功,揭示了课程不对称性。第三,在我们的风格化设置中,两跳推理$(a \to b, b \to c \Rightarrow a \to c)$可被视为具有身份桥$(b=b)$的类比推理,这些身份桥在训练数据中明确出现。这些结果共同揭示了一个统一机制:具有共享属性的实体在表示空间中对齐,从而通过特征相似性实现属性转移。使用高达8B参数的架构进行的实验与理论定性一致,并表明表示几何在风格化模型之外的类比推理中扮演重要角色。

英文摘要

Understanding reasoning in large language models is complicated by evaluations that conflate multiple reasoning types. We isolate analogical reasoning, where a model transfers an attribute between entities that share known properties, and study when such transfer can emerge from training. To make the problem analytically tractable, we study a minimal transformer-style abstraction that isolates how learned representations support analogical reasoning. Within this setting, we prove three key results. First, joint training on similarity and attribution premises enables analogical reasoning through aligned representations. Second, sequential training succeeds only when similarity structure is learned before specific attributes, revealing a curriculum asymmetry. Third, in our stylized setting, two-hop reasoning $(a \to b, b \to c \Rightarrow a \to c)$ can be viewed as analogical reasoning with identity bridges $(b=b)$, which appear explicitly in training data. Together, these results reveal a unified mechanism: entities with shared properties become aligned in representation space, enabling property transfer through feature resemblance. Experiments with architectures up to 8B parameters show qualitative agreement with the theory and suggest that representational geometry plays an important role in analogical reasoning beyond the stylized model.

2602.21198 2026-05-26 cs.LG cs.AI cs.CL cs.CV cs.RO 版本更新

Learning from Trials and Errors: Reflective Test-Time Planning for Embodied LLMs

从试错中学习:具身大语言模型的反思式测试时规划

Yining Hong, Huang Huang, Manling Li, Li Fei-Fei, Leonidas Guibas, Jiajun Wu, Yejin Choi

发表机构 * Stanford University(斯坦福大学) Northwestern University(西北大学)

AI总结 提出反思式测试时规划方法,通过行动中反思和行动后反思两种模式,结合回溯性反思,使具身智能体在测试时进行自我纠正和经验积累,显著提升长程任务性能。

详情
AI中文摘要

具身大语言模型赋予机器人高级任务推理能力,但它们无法反思错误原因,导致部署成为一系列独立尝试,错误重复而非积累经验。借鉴人类反思实践,我们引入反思式测试时规划,整合两种反思模式: extit{行动中反思},代理在行动前利用测试时扩展生成并评分多个候选行动,基于内部反思;以及 extit{行动后反思},利用测试时训练,根据执行后的外部反思更新内部反思模型和行动策略。我们还包含回溯性反思,允许代理重新评估早期决策,并利用后见之明进行模型更新,实现适当的长程信用分配。在我们新设计的Long-Horizon Household基准和MuJoCo Cupboard Fitting基准上的实验表明,与基线模型相比有显著提升,并能零样本泛化到逼真的HM3D环境以及在Franka Panda机械臂上的真实机器人实验。消融实验证实,行动中反思和行动后反思相互依赖,且回溯性反思在较低计算开销下比逐步外部反馈实现更好的信用分配。定性分析进一步突出了通过反思进行的行为纠正。

英文摘要

Embodied LLMs endow robots with high-level task reasoning, but they cannot reflect on what went wrong or why, turning deployment into a sequence of independent trials where mistakes repeat rather than accumulate into experience. Drawing upon human reflective practitioners, we introduce Reflective Test-Time Planning, which integrates two modes of reflection: \textit{reflection-in-action}, where the agent uses test-time scaling to generate and score multiple candidate actions using internal reflections before execution; and \textit{reflection-on-action}, which uses test-time training to update both its internal reflection model and its action policy based on external reflections after execution. We also include retrospective reflection, allowing the agent to re-evaluate earlier decisions and perform model updates with hindsight for proper long-horizon credit assignment. Experiments on our newly-designed Long-Horizon Household benchmark and MuJoCo Cupboard Fitting benchmark show significant gains over baseline models, with zero-shot generalization to photorealistic HM3D environments and real-robot experiments on a Franka Panda arm. Ablations confirm that reflection-in-action and reflection-on-action are mutually dependent, and that retrospective reflection achieves better credit assignment than step-wise external feedback at lower computational overhead. Qualitative analyses further highlight behavioral correction through reflection.

2602.19333 2026-05-26 cs.CL cs.IR cs.SI 版本更新

PerSoMed: A Large-Scale Balanced Dataset for Persian Social Media Text Classification

PerSoMed:用于波斯社交媒体文本分类的大规模平衡数据集

Isun Chehreh, Ebrahim Ansari

发表机构 * Institute for Advanced Studies in Basic Sciences (IASBS)(基础科学基础研究 institute)

AI总结 该研究构建了首个大规模平衡的波斯社交媒体文本分类数据集,包含9个类别共36,000条帖子,并基于BiLSTM、XLM-RoBERTa、TookaBERT等模型进行基准测试,其中TookaBERT-Large取得了最佳性能(F1分数0.9621)。

Comments 10 pages, including 1 figure

详情
AI中文摘要

本研究引入了首个大规模、良好平衡的波斯社交媒体文本分类数据集,专门用于解决该领域缺乏综合资源的问题。该数据集包含9个类别(经济、艺术、体育、政治、社会、健康、心理、历史、科技)的36,000条帖子,每个类别4,000个样本,以确保类别分布平衡。数据收集涉及来自多个波斯社交媒体平台的60,000条原始帖子,随后进行严格的预处理和混合标注,结合基于ChatGPT的少样本提示和人工验证。为了缓解类别不平衡,我们采用了带语义冗余移除的欠采样和结合词汇替换与生成提示的高级数据增强策略。我们对多个模型进行了基准测试,包括BiLSTM、XLM-RoBERTa(使用LoRA和AdaLoRA适配)、FaBERT、基于SBERT的架构以及波斯语专用TookaBERT(Base和Large)。实验结果表明,基于Transformer的模型始终优于传统神经网络,其中TookaBERT-Large取得了最佳性能(精确率:0.9622,召回率:0.9621,F1分数:0.9621)。按类别评估进一步证实了所有类别的稳健性能,尽管社会和政治文本由于固有歧义而得分略低。本研究提供了一个新的高质量数据集,并对前沿模型进行了全面评估,为波斯语自然语言处理的进一步发展奠定了坚实基础,包括趋势分析、社会行为建模和用户分类。该数据集公开可用,以支持未来的研究工作。

英文摘要

This research introduces the first large-scale, well-balanced Persian social media text classification dataset, specifically designed to address the lack of comprehensive resources in this domain. The dataset comprises 36,000 posts across nine categories (Economic, Artistic, Sports, Political, Social, Health, Psychological, Historical, and Science & Technology), each containing 4,000 samples to ensure balanced class distribution. Data collection involved 60,000 raw posts from various Persian social media platforms, followed by rigorous preprocessing and hybrid annotation combining ChatGPT-based few-shot prompting with human verification. To mitigate class imbalance, we employed undersampling with semantic redundancy removal and advanced data augmentation strategies integrating lexical replacement and generative prompting. We benchmarked several models, including BiLSTM, XLM-RoBERTa (with LoRA and AdaLoRA adaptations), FaBERT, SBERT-based architectures, and the Persian-specific TookaBERT (Base and Large). Experimental results show that transformer-based models consistently outperform traditional neural networks, with TookaBERT-Large achieving the best performance (Precision: 0.9622, Recall: 0.9621, F1- score: 0.9621). Class-wise evaluation further confirms robust performance across all categories, though social and political texts exhibited slightly lower scores due to inherent ambiguity. This research presents a new high-quality dataset and provides comprehensive evaluations of cutting-edge models, establishing a solid foundation for further developments in Persian NLP, including trend analysis, social behavior modeling, and user classification. The dataset is publicly available to support future research endeavors.

2602.15620 2026-05-26 cs.CL cs.AI 版本更新

STAPO: Stabilizing Reinforcement Learning for LLMs by Silencing Rare Spurious Tokens

STAPO:通过抑制稀有虚假标记稳定大语言模型的强化学习

Shiqi Liu, Zeyu He, Guojian Zhan, Letian Tao, Zhilong Zheng, Jiang Wu, Yinuo Wang, Yang Guan, Kehua Sheng, Bo Zhang, Keqiang Li, Jingliang Duan, Shengbo Eben Li

发表机构 * School of Vehicle Mobility \& College of AI, Tsinghua University Didi Voyager Labs, DiDi Autonomous Driving

AI总结 针对强化学习微调大语言模型时因稀有虚假标记导致训练不稳定和性能崩溃的问题,提出STAPO方法,通过抑制这些标记的梯度扰动,在多个数学推理基准上实现稳定训练和性能提升。

详情
AI中文摘要

强化学习显著提升了大语言模型的推理能力,但现有的强化学习微调方法严重依赖熵正则化和重加权等启发式技术来维持稳定性。实践中,这些方法常遭遇后期性能崩溃,导致推理质量下降和训练不稳定。我们识别出这一不稳定的关键因素:一小部分标记(称为虚假标记,约占0.01%)对推理结果贡献甚微,但由于继承了完整的序列级奖励而获得不成比例放大的梯度更新。我们提出了一个统一框架,用于评估虚假风险、梯度范数和熵变化下标记级优化影响。基于对严重破坏优化的标记特征的分析,我们提出了抑制虚假标记(S2T)机制,以有效抑制其梯度扰动。将该机制融入基于组的目标中,我们提出了虚假标记感知策略优化(STAPO),促进了稳定有效的大规模模型优化。在使用Qwen 1.7B、8B和14B基础模型的六个数学推理基准上,STAPO一致展现出优越的熵稳定性,并在GRPO、20-Entropy和JustRL基础上平均性能提升11.49%($\rho_{\mathrm{T}}$=1.0, top-p=1.0)和3.73%($\rho_{\mathrm{T}}$=0.7, top-p=0.9)。

英文摘要

Reinforcement Learning (RL) has significantly improved large language model reasoning, but existing RL fine-tuning methods rely heavily on heuristic techniques such as entropy regularization and reweighting to maintain stability. In practice, they often suffer from late-stage performance collapse, leading to degraded reasoning quality and unstable training. We identify a key factor behind this instability: a small fraction of tokens, termed spurious tokens (around 0.01%), which contribute little to the reasoning outcome but receive disproportionately amplified gradient updates due to inheriting the full sequence-level reward. We present a unified framework for evaluating token-level optimization impacts across spurious risk, gradient norms, and entropy changes. Building on the analysis of token characteristics that severely disrupt optimization, we propose the Silencing Spurious Tokens (S2T) mechanism to efficiently suppress their gradient perturbations. Incorporating this mechanism into a group-based objective, we propose Spurious-Token-Aware Policy Optimization (STAPO), which promotes stable and effective large-scale model refinement. Across six mathematical reasoning benchmarks using Qwen 1.7B, 8B, and 14B base models, STAPO consistently demonstrates superior entropy stability and achieves an average performance improvement of 11.49% ($ρ_{\mathrm{T}}$=1.0, top-p=1.0) and 3.73% ($ρ_{\mathrm{T}}$=0.7, top-p=0.9) over GRPO, 20-Entropy, and JustRL.

2602.08426 2026-05-26 cs.CL cs.AI cs.CV 版本更新

Prism: Spectral-Aware Block-Sparse Attention

Prism: 频谱感知的块稀疏注意力

Xinghao Wang, Pengyu Wang, Xiaoran Liu, Fangxu Liu, Jason Chu, Kai Song, Xipeng Qiu

发表机构 * Fudan University(复旦大学) Shanghai Innovation Institute(上海创新研究院) ByteDance Inc.(字节跳动公司) OpenMOSS Team(OpenMOSS团队)

AI总结 针对长上下文LLM预填充中块稀疏注意力的块选择效率瓶颈,提出无训练频谱感知方法Prism,通过高低频分支分解和能量温度校准恢复位置信号,实现纯块级重要性估计,在保持精度同时实现高达5.1倍加速。

Comments ICML 2026

详情
AI中文摘要

块稀疏注意力有望加速长上下文LLM的预填充,但高效识别相关块仍是瓶颈。现有方法通常采用粗粒度注意力作为块重要性估计的代理,但往往诉诸昂贵的令牌级搜索或评分,导致显著的选择开销。在本工作中,我们将通过均值池化的标准粗粒度注意力的不准确性追溯到一个理论根源:均值池化与旋转位置嵌入(RoPE)之间的交互。我们证明均值池化充当低通滤波器,在高频维度上引起破坏性干扰,有效造成局部位置信息(如斜线模式)的“盲点”。为解决此问题,我们引入Prism,一种无训练的频谱感知方法,将块选择分解为高频和低频分支。通过应用基于能量的温度校准,Prism直接从池化表示中恢复衰减的位置信号,使得仅使用块级操作即可进行块重要性估计,从而提高效率。大量评估证实,Prism在保持与全注意力精度相当的同时,实现了高达$\mathbf{5.1 imes}$的加速。

英文摘要

Block-sparse attention is promising for accelerating long-context LLM pre-filling, yet identifying relevant blocks efficiently remains a bottleneck. Existing methods typically employ coarse-grained attention as a proxy for block importance estimation, but often resort to expensive token-level searching or scoring, resulting in significant selection overhead. In this work, we trace the inaccuracy of standard coarse-grained attention via mean pooling to a theoretical root cause: the interaction between mean pooling and Rotary Positional Embeddings (RoPE). We prove that mean pooling acts as a low-pass filter that induces destructive interference in high-frequency dimensions, effectively creating a "blind spot" for local positional information (e.g., slash patterns). To address this, we introduce Prism, a training-free spectral-aware approach that decomposes block selection into high-frequency and low-frequency branches. By applying energy-based temperature calibration, Prism restores the attenuated positional signals directly from pooled representations, enabling block importance estimation using purely block-level operations, thereby improving efficiency. Extensive evaluations confirm that Prism maintains accuracy parity with full attention while delivering up to $\mathbf{5.1\times}$ speedup.

2602.04279 2026-05-26 cs.CL 版本更新

ECG-R1: Protocol-Guided and Modality-Agnostic MLLM for Reliable ECG Interpretation

ECG-R1: 协议引导且模态无关的可靠心电图解读多模态大语言模型

Jiarui Jin, Haoyu Wang, Xingliang Wu, Xiaocheng Fang, Xiang Lan, Zihan Wang, Deyun Zhang, Bo Liu, Yingying Zhang, Xian Wu, Hongyan Li, Shenda Hong

发表机构 * School of Intelligence Science and Technology, Peking University(北京大学智能科学与技术学院) National Institute of Health Data Science, Peking University(北京大学健康数据科学国家研究院) State Key Laboratory of General Artificial Intelligence, Peking University(北京大学通用人工智能国家重点实验室) Tianjin Institute of Cardiology, the Second Hospital of Tianjin Medical University(天津医科大学第二医院心内科) National University of Singapore(新加坡国立大学) Jarvis Lab, Tencent(腾讯 Jarvis实验室) HeartVoice Medical Technology(HeartVoice医疗科技)

AI总结 提出ECG-R1,通过协议引导数据生成、模态解耦架构和强化学习,实现可靠的心电图解读。

Comments Accepted to ICML 2026

详情
AI中文摘要

心电图(ECG)在临床实践中是一种不可或缺的诊断工具,然而现有的多模态大语言模型(MLLMs)在心电图解读方面仍不可靠,常常产生看似合理但临床错误的解读。为了解决这一问题,我们提出了ECG-R1,这是首个通过三项创新设计用于可靠心电图解读的推理型ECG MLLM。首先,我们利用 extit{协议引导的指令数据生成}构建解读语料库,将解读基于可测量的ECG特征以及专著定义的定量阈值和诊断逻辑。其次,我们提出了一种模态解耦架构,采用 extit{交错模态丢弃},以提高当ECG信号或ECG图像缺失时的鲁棒性和跨模态一致性。第三,我们提出了 extit{带有ECG诊断证据奖励的强化学习},以加强基于证据的ECG解读。此外,我们系统评估了专有、开源和医疗MLLM的心电图解读能力,并首次提供了定量证据表明严重的幻觉普遍存在,这表明公众不应在没有独立验证的情况下直接信任这些输出。代码可在\href{https://github.com/PKUDigitalHealth/ECG-R1}{此处}获取。

英文摘要

Electrocardiography (ECG) serves as an indispensable diagnostic tool in clinical practice, yet existing multimodal large language models (MLLMs) remain unreliable for ECG interpretation, often producing plausible but clinically incorrect analyses. To address this, we propose ECG-R1, the first reasoning ECG MLLM designed for reliable ECG interpretation via three innovations. First, we construct the interpretation corpus using \textit{Protocol-Guided Instruction Data Generation}, grounding interpretation in measurable ECG features and monograph-defined quantitative thresholds and diagnostic logic. Second, we present a modality-decoupled architecture with \textit{Interleaved Modality Dropout} to improve robustness and cross-modal consistency when either the ECG signal or ECG image is missing. Third, we present \textit{Reinforcement Learning with ECG Diagnostic Evidence Rewards} to strengthen evidence-grounded ECG interpretation. Additionally, we systematically evaluate the ECG interpretation capabilities of proprietary, open-source, and medical MLLMs, and provide the first quantitative evidence that severe hallucinations are widespread, suggesting that the public should not directly trust these outputs without independent verification. Code is available at \href{https://github.com/PKUDigitalHealth/ECG-R1}{here}.

2602.03695 2026-05-26 cs.MA cs.AI cs.CL 版本更新

Agent Primitives: Reusable Latent Building Blocks for Multi-Agent Systems

Agent Primitives: 面向多智能体系统的可复用潜在构建模块

Haibo Jin, Peng Kuang, Ye Yu, Xiaopeng Yuan, Haohan Wang

发表机构 * School of Information Sciences, University of Illinois at Urbana-Champaign, IL, USA(伊利诺伊大学厄巴纳-香槟分校信息科学学院) Siebel School of Computing and Data Science, University of Illinois at Urbana-Champaign, IL, USA(伊利诺伊大学厄巴纳-香槟分校Siebel计算与数据科学学院)

AI总结 提出Agent Primitives,一组可复用的潜在构建模块,通过KV缓存内部通信和自动组合,提升多智能体系统的鲁棒性、效率和跨任务复用性。

Comments 16 pages

详情
AI中文摘要

虽然现有的多智能体系统(MAS)可以通过多个智能体之间的协作处理复杂问题,但它们通常高度任务特定,依赖手动设计的智能体角色和交互提示,导致架构复杂性增加且跨任务复用性有限。此外,大多数MAS主要通过自然语言进行通信,使得它们在内部智能体历史中的长上下文、多阶段交互中容易受到错误累积和不稳定性的影响。在这项工作中,我们提出了 extbf{Agent Primitives},一组用于基于LLM的MAS的可复用潜在构建模块。受神经网络设计的启发,其中复杂模型由可复用组件构建,我们观察到许多现有的MAS架构可以分解为少量重复出现的内部计算模式。基于这一观察,我们实例化了三个原语:Review、Voting and Selection以及Planning and Execution。所有原语通过键值(KV)缓存进行内部通信,通过减轻多阶段交互中的信息退化来提高鲁棒性和效率。为了实现自动系统构建,一个Organizer智能体根据每个查询选择并组合原语,由先前成功配置的轻量级知识库引导,形成基于原语的MAS。实验表明,基于原语的MAS相比单智能体基线平均准确率提高12.0-16.5%,与基于文本的MAS相比,令牌使用量和推理延迟减少约3-4倍,同时相对于单智能体推理仅产生1.3-1.6倍的开销,并在不同模型骨干上提供更稳定的性能。

英文摘要

While existing multi-agent systems (MAS) can handle complex problems by enabling collaboration among multiple agents, they are often highly task-specific, relying on manually crafted agent roles and interaction prompts, which leads to increased architectural complexity and limited reusability across tasks. Moreover, most MAS communicate primarily through natural language, making them vulnerable to error accumulation and instability in long-context, multi-stage interactions within internal agent histories. In this work, we propose \textbf{Agent Primitives}, a set of reusable latent building blocks for LLM-based MAS. Inspired by neural network design, where complex models are built from reusable components, we observe that many existing MAS architectures can be decomposed into a small number of recurring internal computation patterns. Based on this observation, we instantiate three primitives: Review, Voting and Selection, and Planning and Execution. All primitives communicate internally via key-value (KV) cache, which improves both robustness and efficiency by mitigating information degradation across multi-stage interactions. To enable automatic system construction, an Organizer agent selects and composes primitives for each query, guided by a lightweight knowledge pool of previously successful configurations, forming a primitive-based MAS. Experiments show that primitives-based MAS improve average accuracy by 12.0-16.5\% over single-agent baselines, reduce token usage and inference latency by approximately 3$\times$-4$\times$ compared to text-based MAS, while incurring only 1.3$\times$-1.6$\times$ overhead relative to single-agent inference and providing more stable performance across model backbones.

2602.02979 2026-05-26 cs.CL cs.LG 版本更新

CPMobius: Iterative Coach-Player Reasoning for Data-Free Reinforcement Learning

CPMobius: 无数据强化学习的迭代式教练-玩家推理

Ran Li, Zeyuan Liu, Yinghao Chen, Bingxiang He, Jiarui Yuan, Zixuan Fu, Weize Chen, Jinyi Hu, Chen Qian, Zhiyuan Liu, Maosong Sun

发表机构 * Tsinghua University(清华大学) University of Cambridge(剑桥大学) Shanghai Jiao Tong University(上海交通大学)

AI总结 提出CPMobius协作式教练-玩家范式,通过无外部数据的合作优化循环提升数学推理能力,在Qwen2.5-Math-7B-Instruct上总体准确率提升4.9%,OOD准确率提升5.4%。

Comments Accepted to the ICML 2026

详情
AI中文摘要

大型语言模型(LLMs)在复杂推理方面展现出强大潜力,但其进展仍从根本上受限于对大规模高质量人工策划任务和标签的依赖,无论是通过监督微调(SFT)还是基于推理特定数据的强化学习(RL)。这种依赖使得监督密集型训练范式日益不可持续,实践中已出现可扩展性减弱的迹象。为克服这一限制,我们引入了CPMöbius(CPMobius),一种用于推理模型无数据强化学习的协作式教练-玩家范式。与传统对抗性自博弈不同,CPMöbius受现实世界人类体育协作和多智能体协作启发,将教练和玩家视为独立但合作的角色。教练针对玩家的能力提出指令,并根据玩家表现的变化获得奖励,而玩家则因解决教练生成的越来越有指导性的任务而获得奖励。这种合作优化循环旨在直接提升玩家的数学推理能力。值得注意的是,CPMöbius在不依赖任何外部训练数据的情况下实现了显著改进,优于现有的无监督方法。例如,在Qwen2.5-Math-7B-Instruct上,我们的方法总体准确率平均提升4.9%,分布外(OOD)准确率平均提升5.4%,总体准确率超过RENT 1.5%,OOD准确率超过R-zero 4.2%。我们的代码库已在https://github.com/thunlp/CPMobius发布。

英文摘要

Large Language Models (LLMs) have demonstrated strong potential in complex reasoning, yet their progress remains fundamentally constrained by reliance on massive high-quality human-curated tasks and labels, either through supervised fine-tuning (SFT) or reinforcement learning (RL) on reasoning-specific data. This dependence renders supervision-heavy training paradigms increasingly unsustainable, with signs of diminishing scalability already evident in practice. To overcome this limitation, we introduce CPMöbius (CPMobius), a collaborative Coach-Player paradigm for data-free reinforcement learning of reasoning models. Unlike traditional adversarial self-play, CPMöbius, inspired by real world human sports collaboration and multi-agent collaboration, treats the Coach and Player as independent but cooperative roles. The Coach proposes instructions targeted at the Player's capability and receives rewards based on changes in the Player's performance, while the Player is rewarded for solving the increasingly instructive tasks generated by the Coach. This cooperative optimization loop is designed to directly enhance the Player's mathematical reasoning ability. Remarkably, CPMöbius achieves substantial improvement without relying on any external training data, outperforming existing unsupervised approaches. For example, on Qwen2.5-Math-7B-Instruct, our method improves accuracy by an overall average of +4.9 and an out-of-distribution average of +5.4, exceeding RENT by +1.5 on overall accuracy and R-zero by +4.2 on OOD accuracy. Our codebase has been released at https://github.com/thunlp/CPMobius.

2602.02495 2026-05-26 cs.CL cs.AI cs.LG 版本更新

Reward-free Alignment for Conflicting Objectives

无奖励的冲突目标对齐

Peter Chen, Xiaopeng Li, Xi Chen, Tianyi Lin

发表机构 * Columbia University(哥伦比亚大学)

AI总结 提出RACO框架,通过冲突规避梯度下降的裁剪变体直接利用成对偏好数据解决多目标冲突,实现帕累托最优对齐。

Comments Accepted to ICML 2026 (Oral)

详情
AI中文摘要

直接对齐方法越来越多地用于将大型语言模型(LLMs)与人类偏好对齐。然而,许多现实世界的对齐问题涉及多个相互冲突的目标,简单的偏好聚合可能导致训练不稳定和糟糕的权衡。特别是,加权损失方法可能无法识别同时改善所有目标的更新方向,而现有的多目标方法通常依赖显式奖励模型,增加了额外复杂性并扭曲了用户指定的偏好。本文的贡献有两方面。首先,我们提出了一种用于冲突目标的无奖励对齐框架(RACO),该框架直接利用成对偏好数据,并通过一种新颖的冲突规避梯度下降的裁剪变体解决梯度冲突。我们提供了收敛到尊重用户指定目标权重的帕累托临界点的保证,并进一步证明在双目标设置中裁剪可以严格改善收敛速度。其次,我们使用一些启发式方法改进了我们的方法,并进行了实验,以证明所提框架在LLM对齐中的兼容性。在多个LLM家族(Qwen 3、Llama 3、Gemma 3)上的多目标摘要和安全对齐任务的定性和定量评估表明,与现有的多目标对齐基线相比,我们的方法始终能实现更好的帕累托权衡。

英文摘要

Direct alignment methods are increasingly used to align large language models (LLMs) with human preferences. However, many real-world alignment problems involve multiple conflicting objectives, where naive aggregation of preferences can lead to unstable training and poor trade-offs. In particular, weighted loss methods may fail to identify update directions that simultaneously improve all objectives, and existing multi-objective approaches often rely on explicit reward models, introducing additional complexity and distorting user-specified preferences. The contributions of this paper are two-fold. First, we propose a Reward-free Alignment framework for Conflicted Objectives (RACO) that directly leverages pairwise preference data and resolves gradient conflicts via a novel clipped variant of conflict-averse gradient descent. We provide convergence guarantees to Pareto-critical points that respect user-specified objective weights, and further show that clipping can strictly improve convergence rate in the two-objective setting. Second, we improve our method using some heuristics and conduct experiments to demonstrate the compatibility of the proposed framework for LLM alignment. Both qualitative and quantitative evaluations on multi-objective summarization and safety alignment tasks across multiple LLM families (Qwen 3, Llama 3, Gemma 3) show that our method consistently achieves better Pareto trade-offs compared to existing multi-objective alignment baselines.

2602.01322 2026-05-26 cs.LG cs.CL 版本更新

PolySAE: Modeling Feature Interactions in Sparse Autoencoders via Polynomial Decoding

PolySAE: 通过多项式解码建模稀疏自编码器中的特征交互

Panagiotis Koromilas, Andreas D. Demou, James Oldfield, Yannis Panagakis, Mihalis Nicolaou

发表机构 * The Cyprus Institute(塞浦路斯研究所) University of Athens(雅典大学) University of Oxford(牛津大学) Archimedes AI/Athena Research Center(阿基米德AI/雅典娜研究中心) University of Cyprus(塞浦路斯大学)

AI总结 提出PolySAE,在稀疏自编码器解码器中引入高阶项以建模特征交互,通过低秩张量分解在共享投影子空间上捕获成对和三元特征交互,在保持可解释性的同时提升探测F1约8%,并产生与共现频率无关的组合结构。

Comments 43rd International Conference on Machine Learning (ICML 2026); Code: https://github.com/pakoromilas/PolySAE

详情
AI中文摘要

稀疏自编码器(SAE)通过将激活分解为字典原子的稀疏组合来解释神经网络表示。然而,SAE假设特征通过线性重建相加组合,这种假设无法捕捉组合结构:线性模型无法区分“Starbucks”是由“star”和“coffee”特征的组合还是仅由它们的共现产生。这迫使SAE为复合概念分配整体特征,而不是将其分解为可解释的组成部分。我们引入了PolySAE,它通过高阶项扩展SAE解码器以建模特征交互,同时保留对可解释性至关重要的线性编码器。通过在共享投影子空间上进行低秩张量分解,PolySAE以较小的参数开销(GPT2上为3%)捕获成对和三元特征交互。在四个语言模型和三个SAE变体上,PolySAE在保持可比重建误差的同时,探测F1平均提升约8%,并产生类别条件特征分布之间2-10倍更大的Wasserstein距离。关键的是,学习到的交互权重与共现频率的相关性可忽略不计(r = 0.06,而SAE特征协方差为r = 0.82),表明多项式项捕获了很大程度上独立于表面统计的组合结构。最后,学习到的交互方向因果性地将模型输出引导向相应的组合语义。

英文摘要

Sparse autoencoders (SAEs) interpret neural network representations by decomposing activations into sparse combinations of dictionary atoms. However, SAEs assume features combine additively through linear reconstruction, an assumption that cannot capture compositional structure: linear models cannot distinguish whether ''Starbucks'' arises from the composition of ''star'' and ''coffee'' features or merely their co-occurrence. This forces SAEs to allocate monolithic features for compound concepts rather than decomposing them into interpretable constituents. We introduce PolySAE, which extends the SAE decoder with higher-order terms to model feature interactions while preserving the linear encoder essential for interpretability. Through low-rank tensor factorization on a shared projection subspace, PolySAE captures pairwise and triple feature interactions with small parameter overhead (3% on GPT2). Across four language models and three SAE variants, PolySAE achieves an average improvement of $\sim$8% in probing F1 while maintaining comparable reconstruction error, and produces 2--10$\times$ larger Wasserstein distances between class-conditional feature distributions. Critically, learned interaction weights exhibit negligible correlation with co-occurrence frequency ($r = 0.06$ vs $r = 0.82$ for SAE feature covariance), suggesting that polynomial terms capture compositional structure largely independent of surface statistics. Finally, the learned interaction directions causally steer model outputs toward the corresponding compositional semantics.

2601.14249 2026-05-26 cs.CL 版本更新

Which Reasoning Trajectories Teach Students to Reason Better? A Simple Metric of Informative Alignment

哪种推理轨迹能更好地教会学生推理?一个信息对齐的简单度量

Yuming Yang, Mingyoung Lai, Wanxu Zhao, Xiaoran Fan, Zhiheng Xi, Mingqi Wu, Chiyue Huang, Jun Zhao, Haijun Lv, Jian Tong, Yunhua Zhou, Yicheng Zou, Qipeng Guo, Tao Gui, Qi Zhang, Xuanjing Huang

发表机构 * Fudan University(复旦大学) Shanghai AI Laboratory(上海人工智能实验室) University of Toronto(多伦多大学) University of Sydney(悉尼大学)

AI总结 提出Rank-Surprisal Ratio (RSR)度量,通过结合对齐性和信息性评估推理轨迹对学生模型的适用性,在轨迹选择和教师选择中显著优于现有方法。

Comments Accepted to ACL 2026 (Main Conference). 31 pages. Project page: https://github.com/UmeanNever/RankSurprisalRatio

详情
AI中文摘要

长链思维(CoT)轨迹为从教师到学生大语言模型的推理蒸馏提供了丰富的监督信号。然而,先前工作和我们的实验均表明,来自更强教师的轨迹并不一定能产生更好的学生,凸显了蒸馏中数据-学生适配性的重要性。现有方法主要通过学生似然评估适配性,倾向于选择与学生模型当前行为高度一致的轨迹,但忽略了更具信息性的轨迹。针对这一问题,我们提出Rank-Surprisal Ratio (RSR),一个简单的度量,同时捕捉对齐性和信息性以评估推理轨迹的适用性。RSR的动机源于有效轨迹通常通过结合低绝对概率和相对高排名的token(在学生模型下)来平衡学习信号强度和行为对齐。具体而言,RSR定义为轨迹的平均token级排名与其平均负对数似然之比,计算和解释直观。在五个学生模型和来自11个不同教师的推理轨迹上,RSR与训练后推理性能强相关(平均Spearman 0.86),持续优于现有度量。我们进一步展示了其在轨迹选择和教师选择中的实际效用。

英文摘要

Long chain-of-thought (CoT) trajectories provide rich supervision signals for distilling reasoning from teacher to student LLMs. However, both prior work and our experiments show that trajectories from stronger teachers do not necessarily yield better students, highlighting the importance of data-student suitability in distillation. Existing methods assess suitability primarily through student likelihood, favoring trajectories that align closely with the student model's current behavior but overlooking more informative ones. Addressing this, we propose Rank-Surprisal Ratio (RSR), a simple metric that captures both alignment and informativeness to assess the suitability of a reasoning trajectory. RSR is motivated by the observation that effective trajectories typically balance learning signal strength and behavioral alignment by combining low absolute probability with relatively high-ranked tokens under the student model. Concretely, RSR is defined as the ratio of a trajectory's average token-wise rank to its average negative log-likelihood, and is straightforward to compute and interpret. Across five student models and reasoning trajectories from 11 diverse teachers, RSR strongly correlates with post-training reasoning performance (average Spearman 0.86), consistently outperforming existing metrics. We further demonstrate its practical utility in both trajectory selection and teacher selection.

2601.03790 2026-05-26 cs.CL cs.AI 版本更新

NeoAMT: Neologism-Aware Agentic Machine Translation with Reinforcement Learning

NeoAMT: 基于强化学习的新词感知智能机器翻译

Zhongtao Miao, Kaiyan Zhao, Masaaki Nagata, Yoshimasa Tsuruoka

发表机构 * The University of Tokyo(东京大学) NTT Communication Science Laboratories, NTT, Inc.(NTT通信科学实验室,NTT公司)

AI总结 提出NeoAMT框架,利用基于Wiktionary的搜索工具和强化学习训练翻译智能体,以提升包含新词的源句翻译质量。

Comments ACL 2026 Main. Fixed minor typos

详情
AI中文摘要

新词感知机器翻译旨在将包含新词的源句翻译成目标语言。与通用机器翻译相比,该领域仍未被充分探索。本文提出一个智能体框架NeoAMT,用于新词感知机器翻译,配备基于Wiktionary的搜索工具。具体而言,我们首先构建了一个专门用于新词感知机器翻译的数据集,并建立了一个基于Wiktionary的搜索工具。该数据集涵盖16种语言和75个翻译方向,源自约1000万条英文Wiktionary转储记录。搜索工具的检索语料库也来自同一转储中约300万条清洗后的记录。然后,我们利用该数据集和工具,通过强化学习训练翻译智能体,并评估新词感知机器翻译的准确性。此外,我们提出了一个强化学习训练框架,具有新颖的奖励设计和自适应展开生成策略,利用翻译难度进一步提高使用我们搜索工具的翻译智能体的翻译质量。

英文摘要

Neologism-aware machine translation aims to translate source sentences containing neologisms into target languages. This field remains underexplored compared with general machine translation (MT). In this paper, we propose an agentic framework, NeoAMT, for neologism-aware machine translation equipped with a Wiktionary-based search toolkit. Specifically, we first construct a dedicated dataset for neologism-aware machine translation and build a search toolkit grounded in Wiktionary. The dataset covers 16 languages and 75 translation directions in total, derived from approximately 10 million records of an English Wiktionary dump. The retrieval corpus of the search toolkit is also constructed from around 3 million cleaned records of the same dump. We then leverage the dataset and toolkit to train a translation agent via reinforcement learning (RL) and to evaluate the accuracy of neologism-aware machine translation. Furthermore, we propose an RL training framework featuring a novel reward design and an adaptive rollout generation strategy that exploits translation difficulty to further improve the translation quality of translation agents using our search toolkit.

2601.02144 2026-05-26 cs.CL cs.AI 版本更新

Routing by Analogy: kNN-Augmented Expert Assignment for Mixture-of-Experts

类比路由:用于混合专家模型的kNN增强专家分配

Boxuan Lyu, Soichiro Murakami, Hidetaka Kamigaito, Peinan Zhang

发表机构 * Institute of Science Tokyo(东京科学研究院) CyberAgent Nara Institute of Science and Technology(奈良科学技術大學)

AI总结 提出kNN-MoE框架,通过检索历史相似案例的局部最优专家分配来增强MoE路由,使用检索邻居的平均相似度作为置信度混合系数,在分布偏移下提升鲁棒性。

详情
AI中文摘要

混合专家(MoE)架构通过使用参数化“路由器”将令牌分派给稀疏的专家子集,高效地扩展大型语言模型。通常,该路由器被训练一次然后冻结,导致路由决策在分布偏移下变得脆弱。我们通过引入kNN-MoE来解决这一限制,这是一种检索增强的路由框架,它从类似历史案例的记忆中重用局部最优的专家分配。该记忆通过直接优化令牌级路由对数似然以最大化参考集上的似然来离线构建。关键的是,我们使用检索邻居的平均相似度作为置信度驱动的混合系数,从而允许该方法在未找到相关案例时回退到冻结的路由器。实验表明,kNN-MoE优于零样本基线,并且与计算密集型的监督微调相比具有竞争力。

英文摘要

Mixture-of-Experts (MoE) architectures scale large language models efficiently by employing a parametric ``router'' to dispatch tokens to a sparse subset of experts. Typically, this router is trained once and then frozen, rendering routing decisions brittle under distribution shifts. We address this limitation by introducing kNN-MoE, a retrieval-augmented routing framework that reuses locally optimal expert assignments from a memory of similar past cases. This memory is constructed offline by directly optimizing token-wise routing logits to maximize the likelihood on a reference set. Crucially, we use the average similarity of retrieved neighbors as a confidence-driven mixing coefficient, thus allowing the method to fall back to the frozen router when no relevant cases are found. Experiments show that kNN-MoE outperforms the zero-shot baseline and is competitive with computationally intensive supervised fine-tuning.

2510.22874 2026-05-26 cs.CL 版本更新

A Comprehensive Dataset for Human vs. AI Generated Text Detection

人类与AI生成文本检测的综合数据集

Rajarshi Roy, Gurpreet Singh, Ashhar Aziz, Shashwat Bajpai, Nasrin Imanpour, Shwetangshu Biswas, Kapil Wanaskar, Parth Patwa, Subhankar Ghosh, Shreyas Dixit, Nilesh Ranjan Pal, Vipula Rawte, Ritvik Garimella, Gaytri Jena, Amitava Das, Amit Sheth, Vasu Sharma, Aishwarya Naresh Reganti, Vinija Jain, Aman Chadha

发表机构 * Kalyani Government Engineering College(卡利尼政府工程学院) IIIT Guwahati(古瓦哈提理工学院) IIIT Delhi(德里理工学院) BITS Pilani Hyderabad Campus(比什帕利 Hyderabad 分校) University of South Carolina(南卡罗来纳大学) NIT Silchar(西里 char 工程学院) San José State University(桑乔斯州立大学) UCLA(加州大学洛杉矶分校) Washington State University(华盛顿州立大学) Vishwakarma Institute of Information Technology(维斯瓦卡arma 信息科技学院) Gandhi Institute for Technological Advancement(甘地技术进步研究所) BITS Pilani Goa(比什帕利 Goa 分校) Meta AI Amazon AI(亚马逊AI)

AI总结 本文提出了一个包含73,193个文本样本的综合数据集,结合真实纽约时报文章与多个先进LLM生成的合成文本,用于区分人类与AI生成文本及归因任务,基线准确率分别为58.35%和8.92%。

Comments Defactify4 @AAAI 2025

详情
AI中文摘要

大型语言模型(LLM)的快速发展使得AI生成的文本越来越像人类,引发了对内容真实性、错误信息和可信度的担忧。要可靠地检测AI生成文本并将其归因于特定模型,需要大规模、多样化且标注良好的数据集。在这项工作中,我们提出了一个包含73,193个文本样本的综合数据集,该数据集结合了真实的纽约时报文章与多个最先进LLM(包括Gemma-2-9b、Mistral-7B、Qwen-2-72B、LLaMA-8B、Yi-Large和GPT-4-o)生成的合成版本。数据集提供原始文章摘要作为提示,以及完整的人类作者叙述。我们为两个关键任务建立了基线结果:区分人类撰写与AI生成的文本,准确率达到58.35%;以及将AI文本归因于其生成模型,准确率为8.92%。通过将现实世界的新闻内容与现代生成模型相结合,该数据集旨在促进鲁棒的检测和归因方法的发展,在生成式AI时代培养信任和透明度。我们的数据集可在以下网址获取:https://huggingface.co/datasets/Rajarshi-Roy-research/Defactify_Text_Dataset

英文摘要

The rapid advancement of large language models (LLMs) has led to increasingly human-like AI-generated text, raising concerns about content authenticity, misinformation, and trustworthiness. Addressing the challenge of reliably detecting AI-generated text and attributing it to specific models requires large-scale, diverse, and well-annotated datasets. In this work, we present a comprehensive dataset comprising over 73,193 text samples that combine authentic New York Times articles with synthetic versions generated by multiple state-of-the-art LLMs including Gemma-2-9b, Mistral-7B, Qwen-2-72B, LLaMA-8B, Yi-Large, and GPT-4-o. The dataset provides original article abstracts as prompts, full human-authored narratives. We establish baseline results for two key tasks: distinguishing human-written from AI-generated text, achieving an accuracy of 58.35\%, and attributing AI texts to their generating models with an accuracy of 8.92\%. By bridging real-world journalistic content with modern generative models, the dataset aims to catalyze the development of robust detection and attribution methods, fostering trust and transparency in the era of generative AI. Our dataset is available at: https://huggingface.co/datasets/Rajarshi-Roy-research/Defactify_Text_Dataset

2510.22143 2026-05-26 cs.CL 版本更新

Benchmarking and Learning Real-World Customer Service Dialogue

基准测试与学习真实世界客服对话

Tianhong Gao, Jundong Shen, Jiapeng Wang, Bei Shi, Ying Ju, Junfeng Yao, Huiyu Yu

发表机构 * ByteDance, Beijing, China(字节跳动,北京,中国)

AI总结 针对工业智能客服与真实对话需求脱节的问题,提出OlaBench基准和OlaMind模型,通过蒸馏专家推理模式与分阶段强化学习,在OlaBench上超越GPT-5.2和Gemini 3 Pro,在线A/B测试中问题解决率提升23.67%,人工转接率降低6.6%。

详情
AI中文摘要

现有的工业智能客服(ICS)基准和训练流程与真实对话需求仍存在偏差,过度强调可验证的任务成功,而低估了主观服务质量和实际故障模式,导致离线收益与可部署对话行为之间存在差距。我们通过一个从基准到优化的循环来弥合这一差距:首先引入OlaBench,一个涵盖检索增强生成、基于工作流的系统和智能体设置的ICS基准,评估服务能力、安全性和延迟敏感性;此外,受OlaBench结果显示最先进的LLM仍不足的启发,我们提出OlaMind,它从专家对话中提炼可复用的推理模式和服务策略,并应用分阶段探索-利用强化学习,结合实例级评分感知指导来提升模型能力。OlaMind在OlaBench上超越了GPT-5.2和Gemini 3 Pro(83.64 vs. 70.58/70.84),并且在在线A/B测试中,与基线相比,平均问题解决率提高了23.67%,人工转接率降低了6.6%,从而将离线收益桥接到部署中。OlaBench和OlaMind共同推动ICS系统向更拟人化、专业化和可靠的方向发展。项目页面和评估可在https://olamind-olabench.github.io获取。

英文摘要

Existing benchmarks and training pipelines for industrial intelligent customer service (ICS) remain misaligned with real-world dialogue requirements, overemphasizing verifiable task success while under-measuring subjective service quality and realistic failure modes, leaving a gap between offline gains and deployable dialogue behavior. We close this gap with a benchmark-to-optimization loop: we first introduce OlaBench, an ICS benchmark spanning retrieval-augmented generation, workflow-based systems, and agentic settings, which evaluates service capability, safety, and latency sensitivity; moreover, motivated by OlaBench results showing state-of-the-art LLMs still fall short, we propose OlaMind, which distills reusable reasoning patterns and service strategies from expert dialogues and applies staged exploration--exploitation reinforcement learning with instance-level rubric-aware guidance to improve model capability. OlaMind surpasses GPT-5.2 and Gemini 3 Pro on OlaBench (83.64 vs. 70.58/70.84) and, in online A/B tests, delivers an average +23.67% issue resolution and -6.6% human transfer rate versus the baseline, bridging offline gains to deployment. Together, OlaBench and OlaMind advance ICS systems toward more anthropomorphic, professional, and reliable deployment. The project page and evaluation are available at https://olamind-olabench.github.io.

2510.08558 2026-05-26 cs.AI cs.CL cs.IR cs.LG 版本更新

Agent Learning via Early Experience

通过早期经验进行智能体学习

Kai Zhang, Xiangchao Chen, Bo Liu, Tianci Xue, Zeyi Liao, Zhihan Liu, Xiyao Wang, Yuting Ning, Zhaorun Chen, Xiaohan Fu, Jian Xie, Yuxuan Sun, Boyu Gou, Qi Qi, Zihang Meng, Jianwei Yang, Ning Zhang, Xian Li, Ashish Shah, Dat Huynh, Hengduo Li, Zi Yang, Sara Cao, Lawrence Jang, Shuyan Zhou, Jiacheng Zhu, Huan Sun, Jason Weston, Yu Su, Yifan Wu

发表机构 * Meta Superintelligence Labs(Meta超智能实验室) FAIR at Meta(Meta的FAIR部门) The Ohio State University(俄亥俄州立大学)

AI总结 提出早期经验范式,利用智能体自身动作生成的交互数据(无需奖励信号)通过隐式世界建模和自我反思两种策略提升智能体在多样化环境中的效果和跨域泛化能力。

Comments ICML 2026

详情
AI中文摘要

语言智能体的一个长期目标是通过自身经验学习和改进,最终在复杂的现实任务中超越人类。然而,在缺乏可验证奖励(如网站)或需要低效长程展开(如多轮工具使用)的许多环境中,基于经验数据使用强化学习训练智能体仍然困难。因此,当前大多数智能体依赖专家数据的监督微调,这难以扩展且泛化能力差。这一局限性源于专家示范的本质:它们只捕获了狭窄的场景范围,并使智能体暴露于有限的环境多样性。我们通过一种称为早期经验的中间范式来解决这一局限性:由智能体自身动作生成的交互数据,其中产生的未来状态作为监督信号,无需奖励。在此范式下,我们研究了使用此类数据的两种策略:(1)隐式世界建模,利用收集的状态将策略基于环境动态;(2)自我反思,智能体从其次优动作中学习以改进推理和决策。在八个多样化环境和多个模型家族上的评估表明,我们的方法持续提升了有效性和跨域泛化,凸显了早期经验的价值。此外,在具有可验证奖励的环境中,我们的结果提供了有希望的信号,表明早期经验为后续强化学习奠定了坚实基础,使其成为模仿学习与完全经验驱动智能体之间的实用桥梁。

英文摘要

A long-term goal of language agents is to learn and improve through their own experience, ultimately outperforming humans in complex, real-world tasks. However, training agents from experience data with reinforcement learning remains difficult in many environments, which either lack verifiable rewards (e.g., websites) or require inefficient long-horizon rollouts (e.g., multi-turn tool use). As a result, most current agents rely on supervised fine-tuning on expert data, which is challenging to scale and generalizes poorly. This limitation stems from the nature of expert demonstrations: they capture only a narrow range of scenarios, and expose the agent to limited environment diversity. We address this limitation with a middle-ground paradigm we call early experience: interaction data generated by the agent's own actions, where the resulting future states serve as supervision without reward signals. Within this paradigm, we study two strategies of using such data: (1) implicit world modeling, which uses collected states to ground the policy in environment dynamics; and (2) self-reflection, where the agent learns from its suboptimal actions to improve reasoning and decision-making. Evaluation across eight diverse environments and multiple model families shows that our approaches consistently improve effectiveness and out-of-domain generalization, highlighting the value of early experience. Moreover, in environments with verifiable rewards, our results provide promising signals that early experience offers a strong foundation for subsequent reinforcement learning, making it a practical bridge between imitation learning and fully experience-driven agents.

2510.02837 2026-05-26 cs.AI cs.CL 版本更新

Beyond the Final Answer: Evaluating the Reasoning Trajectories of Tool-Augmented Agents

超越最终答案:评估工具增强型智能体的推理轨迹

Wonjoong Kim, Sangwu Park, Yeonjun In, Sein Kim, Dongha Lee, Chanyoung Park

发表机构 * Graduate School of Data Science, KAIST, Daejeon, South Korea(数据科学研究生院,韩国科学技术院,大田,韩国) Department of Industrial and Systems Engineering, KAIST, Daejeon, South Korea(工业与系统工程系,韩国科学技术院,大田,韩国) Department of Artificial Intelligence, Yonsei University, Seoul, South Korea(人工智能系,延世大学,首尔,韩国)

AI总结 针对工具增强型LLM,提出无参考框架TRACE,通过证据库多维度评估推理轨迹的效率、幻觉和适应性,并用元评估数据集验证其有效性。

Comments International Conference on Machine Learning (ICML) 2026

详情
AI中文摘要

尽管最近的工具增强型基准涉及复杂请求,但评估仍局限于答案匹配,忽略了效率、幻觉和适应性等关键轨迹方面。最直接的评估方法是将智能体的轨迹与真实轨迹进行比较,但注释所有有效的真实轨迹成本过高。为此,我们引入TRACE,一个用于工具增强型LLM多维度评估的无参考框架。通过整合一个从先前步骤积累知识的证据库,TRACE有效评估智能体的推理轨迹。为验证我们的框架,我们开发了一个新的元评估数据集,包含多样且有缺陷的轨迹,每个轨迹都标有多方面的性能分数。我们的结果证实,即使使用小型开源LLM,TRACE也能准确评估复杂轨迹。此外,我们应用该方法评估智能体在解决工具增强型任务时产生的轨迹,展示了先前未报告的观察结果及其相应的见解。

英文摘要

Although recent tool-augmented benchmarks involve complex requests, evaluation remains limited to answer matching, neglecting critical trajectory aspects like efficiency, hallucination, and adaptivity. The most straightforward method for evaluation is to compare an agent's trajectory with the ground-truth, but annotating all valid ground-truth trajectories is prohibitively expensive. In this manner, we introduce TRACE, a reference-free framework for the multi-dimensional evaluation of tool-augmented LLMs. By incorporating an evidence bank which accumulates knowledge from preceding steps, TRACE assesses an agent's reasoning trajectory effectively. To validate our framework, we develop a new meta-evaluation dataset with diverse and flawed trajectories, each labeled with multi-faceted performance scores. Our results confirm that TRACE accurately evaluates complex trajectories even with small open-source LLMs. Furthermore, we apply our method to evaluate the trajectories that agents produce while solving tool-augmented tasks, presenting previously unreported observations and their corresponding insights.

2510.02361 2026-05-26 cs.CL cs.AI 版本更新

ChunkLLM: A Lightweight Pluggable Framework for Accelerating LLMs Inference

ChunkLLM: 一种轻量级可插拔的LLM推理加速框架

Haojie Ouyang, Jianwei Lv, Lei Ren, Chen Wei, Xiaojie Wang, Fangxiang Feng

发表机构 * School of Artificial Intelligence, Beijing University of Posts and Telecommunications(北京邮电大学信息学院)

AI总结 针对Transformer自注意力二次复杂度导致的推理效率低下问题,提出ChunkLLM框架,通过QK适配器和块适配器实现块选择与压缩,在保持性能的同时显著加速推理。

详情
AI中文摘要

基于Transformer的大模型在自然语言处理和计算机视觉中表现出色,但由于自注意力对输入令牌的二次复杂度,面临严重的计算效率低下问题。最近,研究人员提出了一系列基于块选择和压缩的方法来缓解这一问题,但它们要么存在语义不完整的问题,要么训练-推理效率低下。为了全面解决这些挑战,我们提出了ChunkLLM,一个轻量级且可插拔的训练框架。具体来说,我们引入了两个组件:QK适配器(Q-Adapter和K-Adapter)和块适配器。前者附加在每个Transformer层上,兼具特征压缩和块注意力获取的双重目的。后者在模型的最底层运行,通过利用上下文语义信息来检测块边界。在训练阶段,骨干网络的参数保持冻结,仅QK适配器和块适配器进行训练。值得注意的是,我们设计了一种注意力蒸馏方法来训练QK适配器,这提高了关键块的召回率。在推理阶段,仅当当前令牌被检测为块边界时才触发块选择,从而加速模型推理。我们在涵盖多个任务的多种长文本和短文本基准数据集上进行了实验评估。ChunkLLM不仅在短文本基准上取得了可比的性能,而且在长上下文基准上保持了98.64%的性能,同时保持了48.58%的键值缓存保留率。特别地,在处理120K长文本时,ChunkLLM相比原始Transformer实现了最大4.48倍的加速。

英文摘要

Transformer-based large models excel in natural language processing and computer vision, but face severe computational inefficiencies due to the self-attention's quadratic complexity with input tokens. Recently, researchers have proposed a series of methods based on block selection and compression to alleviate this problem, but they either have issues with semantic incompleteness or poor training-inference efficiency. To comprehensively address these challenges, we propose ChunkLLM, a lightweight and pluggable training framework. Specifically, we introduce two components: QK Adapter (Q-Adapter and K-Adapter) and Chunk Adapter. The former is attached to each Transformer layer, serving dual purposes of feature compression and chunk attention acquisition. The latter operates at the bottommost layer of the model, functioning to detect chunk boundaries by leveraging contextual semantic information. During the training phase, the parameters of the backbone remain frozen, with only the QK Adapter and Chunk Adapter undergoing training. Notably, we design an attention distillation method for training the QK Adapter, which enhances the recall rate of key chunks. During the inference phase, chunk selection is triggered exclusively when the current token is detected as a chunk boundary, thereby accelerating model inference. Experimental evaluations are conducted on a diverse set of long-text and short-text benchmark datasets spanning multiple tasks. ChunkLLM not only attains comparable performance on short-text benchmarks but also maintains 98.64% of the performance on long-context benchmarks while preserving a 48.58% key-value cache retention rate. Particularly, ChunkLLM attains a maximum speedup of 4.48x in comparison to the vanilla Transformer in the processing of 120K long texts.

2510.02327 2026-05-26 cs.CL cs.AI eess.AS 版本更新

KAME: Tandem Architecture for Enhancing Knowledge in Real-Time Speech-to-Speech Conversational AI

KAME:用于增强实时语音到语音对话AI知识的串联架构

So Kuroki, Yotaro Kubo, Takuya Akiba, Yujin Tang

AI总结 提出一种混合架构,通过实时注入后端LLM的文本响应来增强S2S模型的知识,在保持低延迟的同时提升响应正确性。

Comments Published at IEEE ICASSP 2026

详情
AI中文摘要

实时语音到语音(S2S)模型擅长生成自然、低延迟的对话响应,但往往缺乏深层知识和语义理解。相反,结合自动语音识别、基于文本的大语言模型(LLM)和文本到语音合成的级联系统提供了优越的知识表示,但代价是高延迟,这破坏了自然交互的流畅性。本文介绍了一种新颖的混合架构,弥合了这两种范式之间的差距。我们的框架通过S2S变压器处理用户语音以实现即时响应,同时将查询并发地传递给强大的后端LLM。然后,LLM的基于文本的响应被实时注入以指导S2S模型的语音生成,有效地为其输出注入丰富的知识,而无需承受级联系统的全部延迟惩罚。我们使用MT-Bench基准的语音合成变体(包含多轮问答会话)评估了我们的方法。结果表明,我们的系统在响应正确性上显著优于基线S2S模型,接近级联系统的水平,同时保持了与基线相当的延迟。

英文摘要

Real-time speech-to-speech (S2S) models excel at generating natural, low-latency conversational responses but often lack deep knowledge and semantic understanding. Conversely, cascaded systems combining automatic speech recognition, a text-based Large Language Model (LLM), and text-to-speech synthesis offer superior knowledge representation at the cost of high latency, which disrupts the flow of natural interaction. This paper introduces a novel hybrid architecture that bridges the gap between these two paradigms. Our framework processes user speech through an S2S transformer for immediate responsiveness while concurrently relaying the query to a powerful back-end LLM. The LLM's text-based response is then injected in real time to guide the S2S model's speech generation, effectively infusing its output with rich knowledge without the full latency penalty of a cascaded system. We evaluated our method using a speech-synthesized variant of the MT-Bench benchmark that consists of multi-turn question-answering sessions. The results demonstrate that our system substantially outperforms a baseline S2S model in response correctness, approaching that of a cascaded system, while maintaining a latency on par with the baseline.

2509.14250 2026-05-26 cs.CL 版本更新

The meaning of prompts and the prompts of meaning: Semiotic reflections and modelling

提示的意义与意义的提示:符号学反思与建模

Martin Thellefsen, Amalia Nurma Dewi, Bent Sorensen

AI总结 本文基于皮尔士符号学三元模型和Dynacom传播模型,将大型语言模型中的提示重新概念化为动态符号现象,强调其作为沟通和认知行为的迭代符号形成与解释过程。

Comments 18 pages, 2 figures

详情
AI中文摘要

本文探讨了大型语言模型(LLMs)中的提示(prompts)和提示工程(prompting)作为动态符号现象,借鉴了皮尔士的符号三元模型、他的九种符号类型以及Dynacom传播模型。目的是将提示重新概念化,不是作为一种技术输入机制,而是作为一种沟通和认知行为,涉及符号形成、解释和精炼的迭代过程。理论基础建立在皮尔士符号学上,特别是再现体(representamen)、对象(object)和解释项(interpretant)之间的相互作用,以及符号的类型学丰富性:性质符号(qualisign)、单一符号(sinsign)、法则符号(legisign);像似符(icon)、指示符(index)、象征符(symbol);呈位(rheme)、述位(dicent)、论位(argument)——以及Dynacom模型中捕捉的解释项三元组。在分析上,本文将LLM定位为一种符号资源,它根据用户提示生成解释项,从而参与共享话语宇宙中的意义创造。研究结果表明,提示是一种符号和沟通过程,重新定义了数字环境中知识的组织、搜索、解释和共建方式。这一视角邀请我们在计算符号学时代重新构想知识组织和信息检索的理论与方法基础。

英文摘要

This paper explores prompts and prompting in large language models (LLMs) as dynamic semiotic phenomena, drawing on Peirce's triadic model of signs, his nine sign types, and the Dynacom model of communication. The aim is to reconceptualize prompting not as a technical input mechanism but as a communicative and epistemic act involving an iterative process of sign formation, interpretation, and refinement. The theoretical foundation rests on Peirce's semiotics, particularly the interplay between representamen, object, and interpretant, and the typological richness of signs: qualisign, sinsign, legisign; icon, index, symbol; rheme, dicent, argument - alongside the interpretant triad captured in the Dynacom model. Analytically, the paper positions the LLM as a semiotic resource that generates interpretants in response to user prompts, thereby participating in meaning-making within shared universes of discourse. The findings suggest that prompting is a semiotic and communicative process that redefines how knowledge is organized, searched, interpreted, and co-constructed in digital environments. This perspective invites a reimagining of the theoretical and methodological foundations of knowledge organization and information seeking in the age of computational semiosis

2508.19988 2026-05-26 cs.CL 版本更新

AgentCoMa: A Compositional Benchmark Mixing Commonsense and Mathematical Reasoning in Real-World Scenarios

AgentCoMa:一个混合常识与数学推理的现实场景组合基准

Lisa Alazraki, Lihu Chen, Ana Brassard, Joe Stacey, Hossein A. Rahmani, Marek Rei

发表机构 * Imperial College London(帝国理工学院伦敦分校) RIKEN(日本研究机构) University of Sheffield(谢菲尔德大学) University College London(伦敦大学学院)

AI总结 提出AgentCoMa基准,测试大语言模型在组合常识与数学推理任务上的性能,发现模型在单独步骤上准确率高但组合后平均下降近30%。

Comments ACL 2026

详情
AI中文摘要

大型语言模型(LLMs)在涉及多个推理步骤组合的复杂常识和数学问题上取得了高准确率。然而,当前测试这些技能的组合基准往往侧重于常识或数学推理,而解决现实世界任务的LLM智能体需要两者的结合。在这项工作中,我们引入了一个智能体常识与数学基准(AgentCoMa),其中每个组合任务需要一个常识推理步骤和一个数学推理步骤。我们在61个不同规模、模型家族和训练策略的LLM上进行了测试。我们发现,LLM通常可以孤立地解决这两个步骤,但当两者结合时,它们的准确率平均下降近30%。这比我们在先前组合相同推理类型多个步骤的组合基准中观察到的性能差距要大得多。相比之下,非专家人类标注者可以以同样高的准确率解决AgentCoMa中的组合问题和各个步骤。此外,我们进行了一系列可解释性研究,以更好地理解性能差距,检查了神经元模式、注意力图和成员推断。我们的工作强调了在混合类型组合推理背景下模型脆弱性的显著程度,并为未来的改进提供了一个测试平台。

英文摘要

Large Language Models (LLMs) have achieved high accuracy on complex commonsense and mathematical problems that involve the composition of multiple reasoning steps. However, current compositional benchmarks testing these skills tend to focus on either commonsense or math reasoning, whereas LLM agents solving real-world tasks would require a combination of both. In this work, we introduce an Agentic Commonsense and Math benchmark (AgentCoMa), where each compositional task requires a commonsense reasoning step and a math reasoning step. We test it on 61 LLMs of different sizes, model families, and training strategies. We find that LLMs can usually solve both steps in isolation, yet their accuracy drops by nearly 30% on average when the two are combined. This is a substantially greater performance gap than the one we observe in prior compositional benchmarks that combine multiple steps of the same reasoning type. In contrast, non-expert human annotators can solve the compositional questions and the individual steps in AgentCoMa with similarly high accuracy. Furthermore, we conduct a series of interpretability studies to better understand the performance gap, examining neuron patterns, attention maps and membership inference. Our work underscores a substantial degree of model brittleness in the context of mixed-type compositional reasoning and offers a test bed for future improvement.

2508.11925 2026-05-26 cs.CR cs.CL cs.LG 版本更新

Optimizing Token Choice for Code Watermarking: An RL Approach

优化代码水印的令牌选择:一种强化学习方法

Zhimeng Guo, Huaisheng Zhu, Siyuan Xu, Hangfan Zhang, Teng Xiao, Minhao Cheng

发表机构 * The Pennsylvania State University(宾夕法尼亚州立大学)

AI总结 提出CodeTracer框架,通过强化学习训练策略模型智能选择令牌嵌入水印,在保持代码功能的同时提高水印可检测性。

Comments ICML 2026, 18 pages, 3 figures

详情
AI中文摘要

保护LLM生成代码的知识产权需要有效的水印系统,该系统能够在代码高度结构化、语法受限的性质中运行。在这项工作中,我们引入了CodeTracer,一种创新的自适应代码水印框架,其基础是一种新颖的强化学习训练范式。其核心是,CodeTracer采用策略驱动方法,利用参数化模型在下一个令牌预测期间智能地偏向令牌选择。该策略确保嵌入的水印保持代码功能,同时表现出与典型令牌分布微妙但统计上可检测的偏差。为了促进策略学习,我们设计了一个全面的奖励系统,将执行反馈与水印嵌入信号无缝集成,平衡过程级和结果级奖励。此外,我们采用Gumbel Top-k重参数化来实现离散水印决策的基于梯度的优化。广泛的比较评估表明,CodeTracer在水印可检测性和生成代码功能保持方面均显著优于最先进的基线。我们的代码可在https://github.com/TimeLovercc/CodeTracer获取。

英文摘要

Protecting intellectual property on LLM-generated code necessitates effective watermarking systems that can operate within code's highly structured, syntactically constrained nature. In this work, we introduce CodeTracer, an innovative adaptive code watermarking framework underpinned by a novel reinforcement learning training paradigm. At its core, CodeTracer features a policy-driven approach that utilizes a parameterized model to intelligently bias token choices during next-token prediction. This strategy ensures that embedded watermarks maintain code functionality while exhibiting subtle yet statistically detectable deviations from typical token distributions. To facilitate policy learning, we devise a comprehensive reward system that seamlessly integrates execution feedback with watermark embedding signals, balancing process-level and outcome-level rewards. Additionally, we employ Gumbel Top-k reparameterization to enable gradient-based optimization of discrete watermarking decisions. Extensive comparative evaluations demonstrate CodeTracer's significant superiority over state-of-the-art baselines in both watermark detectability and the preservation of generated code's functionality. Our code is available at https://github.com/TimeLovercc/CodeTracer.

2507.10593 2026-05-26 cs.SE cs.AI cs.CL cs.LG 版本更新

ToolRegistry: A Protocol-Agnostic Tool Management Library for Function-Calling LLMs

ToolRegistry: 一个用于函数调用LLM的协议无关工具管理库

Peng Ding, Rick Stevens

发表机构 * University of Chicago(芝加哥大学) Argonne National Laboratory(阿贡国家实验室)

AI总结 提出ToolRegistry系统,通过统一工具对象和注册表实现协议无关的工具管理,支持多种传输协议、可插拔后端和高级功能,显著减少集成代码并提升吞吐量。

Comments 16 pages, 4 figures, v3: add co-author, permission system, progressive tool disclosure, think-augmented calling, RPC framing, multi-provider support

详情
AI中文摘要

每个LLM工具调用在结构上都是一个RPC——一个函数名、JSON参数和序列化结果——然而每个协议(原生Python、MCP、OpenAPI、LangChain)都是从零开始集成的。我们提出ToolRegistry,一个使这种RPC本质显式化的系统:一个单一的Tool对象充当通用存根,无论传输方式如何,而注册表则作为RPC客户端运行时,负责调度、模式生成和执行。该系统以三个包的形式发布——一个核心注册表、一个通过MCP和OpenAPI暴露工具的服务器,以及一个生产就绪实现的中心——并通过可插拔的线程或进程后端调用工具。该系统现在还提供基于标签的权限策略、针对大型注册表的BM25F驱动的渐进式工具披露、增强思考的函数调用、多提供商模式支持(OpenAI、Anthropic、Gemini)、声明式JSONC/YAML配置,以及一个基于仅stdlib内置模块的近乎零依赖的核心。在我们的基准测试中,该库将集成代码减少了60-80%,并且为给定工作负载选择正确的并发模式(线程与进程)相比替代方案可带来高达3.1倍的吞吐量。ToolRegistry在https://github.com/Oaklight/ToolRegistry开源;文档位于https://toolregistry.readthedocs.io/。

英文摘要

Every LLM tool call is structurally an RPC -- a function name, JSON arguments, and a serialized result -- yet each protocol (native Python, MCP, OpenAPI, LangChain) is integrated from scratch. We present ToolRegistry, a system that makes this RPC nature explicit: a single Tool object acts as a universal stub regardless of transport, while the registry serves as the RPC client runtime for dispatch, schema generation, and execution. The system ships as three packages -- a core registry, a server exposing tools over MCP and OpenAPI, and a hub of production-ready implementations -- and invokes tools through pluggable thread or process backends. The system now also provides tag-based permission policies, BM25F-powered progressive tool disclosure for large registries, think-augmented function calling, multi-provider schema support (OpenAI, Anthropic, Gemini), declarative JSONC/YAML configuration, and a near-zero-dependency core built on stdlib-only vendored modules. In our benchmarks the library cuts integration code by 60-80%, and choosing the right concurrency mode (thread vs. process) yields up to 3.1x throughput over the alternative for a given workload. ToolRegistry is open-source at https://github.com/Oaklight/ToolRegistry; documentation lives at https://toolregistry.readthedocs.io/.

2507.05890 2026-05-26 cs.CL cs.AI 版本更新

Psychometric Item Validation Using Virtual Respondents with Trait-Response Mediators

使用具有特质-反应中介的虚拟受访者进行心理测量项目验证

Sungjib Lim, Woojung Song, Eun-Ju Lee, Yohan Jo

发表机构 * Graduate School of Data Science, Seoul National University(首尔国立大学数据科学研究生院) Department of Communication, Seoul National University(首尔国立大学通信系) Interdisciplinary Program in Artificial Intelligence, Seoul National University(首尔国立大学人工智能跨学科项目)

AI总结 提出一种利用LLM模拟虚拟受访者(通过中介因素)来高效验证心理测量项目效度的框架,实验证明该方法能有效识别高有效性项目。

Comments This paper has been accepted for publication at TACL 2026

详情
AI中文摘要

随着心理测量调查越来越多地用于评估大型语言模型(LLM)的特质,对适用于LLM的可扩展调查项目生成的需求也随之增长。这里的一个关键挑战是确保生成项目的构念效度,即它们是否真正测量了预期的特质。传统上,这需要昂贵的大规模人类数据收集。为了提高效率,我们提出了一个使用LLM进行虚拟受访者模拟的框架。我们的核心思想是考虑中介因素:通过它们,相同的特质可能对调查项目产生不同的反应。通过模拟具有不同中介因素的受访者,我们识别出那些在这些中介因素中与预期特质稳健相关的调查项目。在三种心理特质理论(大五人格、施瓦茨价值观、VIA性格优势)上的实验表明,我们的中介生成方法和模拟框架有效地识别了高有效性项目。LLM展示了从特质定义生成合理中介因素以及模拟受访者行为以进行项目验证的能力。我们的问题表述、指标、方法和数据集为成本效益高的调查开发以及更深入地理解LLM如何模拟人类调查反应开辟了新方向。我们发布数据集和代码以支持未来工作。

英文摘要

As psychometric surveys are increasingly used to assess the traits of large language models (LLMs), the need for scalable survey item generation suited for LLMs has also grown. A critical challenge here is ensuring the construct validity of generated items, i.e., whether they truly measure the intended trait. Traditionally, this requires costly, large-scale human data collection. To make it efficient, we present a framework for virtual respondent simulation using LLMs. Our central idea is to account for mediators: factors through which the same trait can give rise to varying responses to a survey item. By simulating respondents with diverse mediators, we identify survey items that yield responses robustly correlated with intended traits across these mediators. Experiments on three psychological trait theories (Big5, Schwartz, VIA) show that our mediator generation methods and simulation framework effectively identify high-validity items. LLMs demonstrate the ability to generate plausible mediators from trait definitions and to simulate respondent behavior for item validation. Our problem formulation, metrics, methodology, and dataset open a new direction for cost-efficient survey development and a deeper understanding of how LLMs simulate human survey responses. We release our dataset and code to support future work.

2506.19037 2026-05-26 cs.CL cs.AI cs.IT cs.LG cs.NE math.IT 版本更新

Plan for Speed: Dilated Scheduling for Masked Diffusion Language Models

速度规划:用于掩码扩散语言模型的膨胀调度

Omer Luxembourg, Haim Permuter, Eliya Nachmani

发表机构 * School of Electrical and Computer Engineering, Ben-Gurion University of the Negev, Beersheba, Israel(电气与计算机工程学院,内盖夫本· Gurion大学,贝尔谢巴,以色列)

AI总结 提出膨胀解掩码调度器(DUS),通过将序列位置划分为非相邻的膨胀组并并行解掩码,最小化联合熵增益上界,在不修改去噪器的情况下实现高达5.8倍加速。

Comments Accepted at ICML 2026

详情
AI中文摘要

掩码扩散语言模型(MDLM)承诺快速、非自回归的文本生成,然而现有的采样器根据模型置信度选择要解掩码的标记,忽略了并行解掩码多个位置时的交互,实际上退化为缓慢的自回归行为。我们提出了膨胀解掩码调度器(DUS),这是一种仅推理、无需规划模型的方法,它将序列位置划分为非相邻的膨胀组,并并行解掩码,以在每个去噪步骤中最小化联合熵增益的上界。通过明确权衡网络调用次数与生成质量,DUS恢复了传统并行解掩码策略下丢失的大部分性能。在数学(GSM8K, MATH500)、代码(HumanEval, MBPP)、通用知识(BBH, MMLU-Pro)和指令遵循(IFEval)基准测试中,DUS优于基于置信度的规划器,并将扩散特有的质量-速度权衡转化为由块大小$B$确定的确定性、可预测的加速,与逐标记MDLM解码相比,实现了高达5.8倍的墙钟加速,而无需修改底层去噪器。作为即插即用的后滤波器,膨胀间隔也改进了自适应采样器。代码可在https://github.com/omerlux/DUS获取。

英文摘要

Masked diffusion language models (MDLMs) promise fast, non-autoregressive text generation, yet existing samplers, which pick tokens to unmask based on model confidence, ignore interactions when unmasking multiple positions in parallel and effectively reduce to slow, autoregressive behavior. We propose the Dilated Unmasking Scheduler (DUS), an inference-only, planner-model-free method that partitions sequence positions into non-adjacent dilated groups and unmasks them in parallel so as to minimize an upper bound on joint entropy gain at each denoising step. By explicitly trading off the number of network calls against generation quality, DUS recovers most of the performance lost under traditional parallel unmasking strategies. Across math (GSM8K, MATH500), code (HumanEval, MBPP), general-knowledge (BBH, MMLU-Pro), and instruction following (IFEval) benchmarks, DUS outperforms confidence-based planners and turns the diffusion-specific quality-speed trade-off into a deterministic, predictable speedup set by the block size $B$, yielding up to $5.8\times$ wall-clock speedup over token-by-token MDLM decoding without modifying the underlying denoiser. Applied as a drop-in post-filter, dilated spacing also improves adaptive samplers. Code is available at https://github.com/omerlux/DUS.

2506.17629 2026-05-26 cs.CV cs.AI cs.CL 版本更新

CLiViS: Unleashing Cognitive Map through Linguistic-Visual Synergy for Embodied Visual Reasoning

CLiViS: 通过语言-视觉协同释放认知地图用于具身视觉推理

Kailing Li, Qi'ao Xu, Tianwen Qian, Yuqian Fu, Yang Jiao, Xiaoling Wang

发表机构 * School of Computer Science and Technology, East China Normal University(东华大学计算机科学与技术学院) King Abdullah University of Science and Technology(科廷大学) Fudan University(复旦大学)

AI总结 提出CLiViS框架,通过LLM进行高层任务规划并协调VLM驱动的开放世界视觉感知,构建动态认知地图以迭代更新场景上下文,实现无需训练的具身视觉推理。

详情
AI中文摘要

具身视觉推理(EVR)旨在基于自我中心视频遵循复杂、自由形式的指令,从而在动态环境中实现语义理解和时空推理。尽管具有潜力,EVR面临复杂指令多样性和长期自我中心视频中复杂时空动态的挑战。现有解决方案要么在静态视频描述上使用大型语言模型(LLM),这通常会遗漏关键视觉细节,要么依赖端到端视觉语言模型(VLM),后者在逐步组合推理上存在困难。考虑到LLM在推理和VLM在感知方面的互补优势,我们提出了CLiViS。这是一个新颖的无训练框架,利用LLM进行高层任务规划,并协调VLM驱动的开放世界视觉感知,以迭代更新场景上下文。基于这种协同,CLiViS的核心是一个动态认知地图,它在推理过程中不断演化。该地图构建了具身场景的结构化表示,连接了低层感知和高层推理。跨多个基准的大量实验证明了CLiViS的有效性和通用性,特别是在处理长期视觉依赖方面。代码可在 https://github.com/Teacher-Tom/CLiViS 获取。

英文摘要

Embodied Visual Reasoning (EVR) seeks to follow complex, free-form instructions based on egocentric video, enabling semantic understanding and spatiotemporal reasoning in dynamic environments. Despite its promising potential, EVR encounters significant challenges stemming from the diversity of complex instructions and the intricate spatiotemporal dynamics in long-term egocentric videos. Prior solutions either employ Large Language Models (LLMs) over static video captions, which often omit critical visual details, or rely on end-to-end Vision-Language Models (VLMs) that struggle with stepwise compositional reasoning. Consider the complementary strengths of LLMs in reasoning and VLMs in perception, we propose CLiViS. It is a novel training-free framework that leverages LLMs for high-level task planning and orchestrates VLM-driven open-world visual perception to iteratively update the scene context. Building on this synergy, the core of CLiViS is a dynamic Cognitive Map that evolves throughout the reasoning process. This map constructs a structured representation of the embodied scene, bridging low-level perception and high-level reasoning. Extensive experiments across multiple benchmarks demonstrate the effectiveness and generality of CLiViS, especially in handling long-term visual dependencies. Code is available at https://github.com/Teacher-Tom/CLiViS.

2504.04639 2026-05-26 cs.CC cs.CL cs.DS cs.LO 版本更新

Ineffectiveness for Search and Undecidability of PCSP Meta-Problems

PCSP元问题的搜索无效性和不可判定性

Alberto Larrauri

发表机构 * University of Oxford(牛津大学) Durham University(杜伦大学) University of Zaragoza(阿拉瓦大学)

AI总结 本文研究承诺约束满足问题(PCSP)的搜索版本与决策版本是否等价,证明BLP、AIP等算法在搜索版本中无效,并基于代数方法给出搜索无效的充分条件,进而证明相关元问题的不可判定性。

详情
AI中文摘要

承诺CSP的搜索版本和决策版本是否等价是一个开放问题。大多数已知的PCSP算法仅解决其\emph{决策}变体,并且不知道它们是否也能适应解决\emph{搜索}变体。主要方法称为BLP、AIP和BLP+AIP,通过寻找某个整数规划松弛的解来处理PCSP。我们证明将这些解舍入到适当的搜索证书可以像TFNP类中的任何问题一样困难。换句话说,这些算法对于搜索是无效的。基于PCSP的代数方法,我们找到了暗示搜索无效性的充分条件。我们的工具适用于以适当方式由小团体刻画的算法,也可用于证明元问题的不可判定性结果。通过这种方式,我们展示了通过BLP、AIP和BLP+AIP可解的模板族是不可判定的。使用相同的技术,我们还分析了几个已知保证有限模板CSP可处理性的代数条件。我们证明,与循环同态和WNU相关的几个元问题对于PCSP是不可判定的。特别地,没有算法可以判定一个有限PCSP模板(1)是否允许循环同态,(2)是否允许WNU。

英文摘要

It is an open question whether the search and decision versions of promise CSPs are equivalent. Most known algorithms for PCSPs solve only their \emph{decision} variant, and it is unknown whether they can be adapted to solve \emph{search} as well. The main approaches, called BLP, AIP and BLP+AIP, handle a PCSP by finding a solution to a relaxation of some integer program. We prove that rounding those solutions to a proper search certificate can be as hard as any problem in the class TFNP. In other words, these algorithms are ineffective for search. Building on the algebraic approach to PCSPs, we find sufficient conditions that imply ineffectiveness for search. Our tools are tailored to algorithms that are characterized by minions in a suitable way, and can also be used to prove undecidability results for meta-problems. This way, we show that the families of templates solvable via BLP, AIP, and BLP+AIP are undecidable. Using the same techniques we also analyze several algebraic conditions that are known to guarantee the tractability of finite-template CSPs. We prove that several meta-problems related to cyclic polymorphims and WNUs are undecidable for PCSPs. In particular, there is no algorithm deciding whether a finite PCSP template (1) admits cyclic a polymorphism, (2) admits a WNU.

2502.11167 2026-05-26 cs.LG cs.CL 版本更新

SURGE: On the Potential of Large Language Models as General-Purpose Surrogate Code Executors

SURGE: 大型语言模型作为通用代理代码执行器的潜力

Bohan Lyu, Siqiao Huang, Zichen Liang

发表机构 * Department of Computer Science and Technology, Tsinghua(清华大学计算机科学与技术系) Institute for Interdisciplinary Information Sciences (IIIS), Tsinghua(清华大学交叉信息研究院)

AI总结 提出SURGE基准,包含1160个问题覆盖8个关键方面,通过评估21个开源和专有LLM,研究其作为代码执行预测代理模型的可行性、扩展律、数据效率和预测准确性。

详情
Journal ref
Proceedings of The 2025 Conference on Empirical Methods in Natural Language Processing
AI中文摘要

神经代理模型是数据挖掘中强大且高效的工具。同时,大型语言模型(LLM)在代码相关任务(如生成和理解)中展示了卓越的能力。然而,一个同样重要但尚未充分探索的问题是,LLM是否可以作为代码执行预测的代理模型。为了系统研究这一问题,我们引入了SURGE,一个包含1160个问题的综合基准,覆盖8个关键方面:多语言编程任务、竞赛级编程问题、仓库级代码分析、高成本科学计算、时间复杂度密集型算法、有缺陷代码分析、依赖特定编译器或执行环境的程序,以及形式化数学证明验证。通过对21个开源和专有LLM的广泛分析,我们研究了扩展律、数据效率和预测准确性。我们的发现揭示了LLM作为计算过程高效代理的可行性的重要见解。基准和评估框架可在https://github.com/Imbernoulli/SURGE获取。

英文摘要

Neural surrogate models are powerful and efficient tools in data mining. Meanwhile, large language models (LLMs) have demonstrated remarkable capabilities in code-related tasks, such as generation and understanding. However, an equally important yet underexplored question is whether LLMs can serve as surrogate models for code execution prediction. To systematically investigate it, we introduce SURGE, a comprehensive benchmark with $1160$ problems covering $8$ key aspects: multi-language programming tasks, competition-level programming problems, repository-level code analysis, high-cost scientific computing, time-complexity-intensive algorithms, buggy code analysis, programs dependent on specific compilers or execution environments, and formal mathematical proof verification. Through extensive analysis of $21$ open-source and proprietary LLMs, we examine scaling laws, data efficiency, and predictive accuracy. Our findings reveal important insights about the feasibility of LLMs as efficient surrogates for computational processes. The benchmark and evaluation framework are available at https://github.com/Imbernoulli/SURGE.

2208.14882 2026-05-26 cs.MM cs.CL cs.CV cs.IR 版本更新

Hierarchical Local-Global Transformer for Temporal Sentence Grounding

层次化局部-全局Transformer用于时间语句定位

Xiang Fang, Daizong Liu, Pan Zhou, Zichuan Xu, Ruixuan Li

发表机构 * Hubei Engineering Research Center on Big Data Security, School of Cyber Science and Engineering, Huazhong University of Science and Technology(大数据安全湖北工程研究中心,华中科技大学网络安全科学与工程学院) Wangxuan Institute of Computer fTechnology, Peking University(王宣计算机技术研究院,北京大学) School of software, Dalian University of Technology(软件学院,大连理工大学) School of Computer Science, and Technology, Huazhong University of Science, and Technology(计算机科学与技术学院,华中科技大学)

AI总结 提出层次化局部-全局Transformer(HLGT),通过建模视频和查询的不同粒度层次及跨模态交互,实现更细粒度的多模态表示,并在三个数据集上取得最先进性能。

Comments Publish in IEEE Transactions on Multimedia

详情
AI中文摘要

本文研究多媒体问题中的时间语句定位(TSG),旨在根据给定的句子查询准确确定未修剪视频中的特定视频片段。传统的TSG方法主要遵循自上而下或自下而上的框架,且不是端到端的,严重依赖耗时的后处理来优化定位结果。最近,一些基于Transformer的方法被提出,以高效有效地建模视频和查询之间的细粒度语义对齐。尽管这些方法在一定程度上取得了显著性能,但它们将视频帧和查询词等同视为Transformer输入进行关联,未能捕捉它们不同粒度的不同语义。为解决这一问题,本文提出了一种新颖的层次化局部-全局Transformer(HLGT),利用这种层次信息并建模不同粒度层次和不同模态之间的交互,以学习更细粒度的多模态表示。具体来说,我们首先将视频和查询分割成单独的片段和短语,通过时间Transformer学习它们的局部上下文(相邻依赖)和全局相关性(长距离依赖)。然后,引入全局-局部Transformer来学习局部级和全局级语义之间的交互,以实现更好的多模态推理。此外,我们开发了一种新的跨模态循环一致性损失,以强制两个模态之间的交互并鼓励它们之间的语义对齐。最后,我们设计了一种全新的跨模态并行Transformer解码器,用于整合编码的视觉和文本特征以进行最终定位。在三个具有挑战性的数据集上进行的大量实验表明,我们提出的HLGT实现了新的最先进性能。

英文摘要

This paper studies the multimedia problem of temporal sentence grounding (TSG), which aims to accurately determine the specific video segment in an untrimmed video according to a given sentence query. Traditional TSG methods mainly follow the top-down or bottom-up framework and are not end-to-end. They severely rely on time-consuming post-processing to refine the grounding results. Recently, some transformer-based approaches are proposed to efficiently and effectively model the fine-grained semantic alignment between video and query. Although these methods achieve significant performance to some extent, they equally take frames of the video and words of the query as transformer input for correlating, failing to capture their different levels of granularity with distinct semantics. To address this issue, in this paper, we propose a novel Hierarchical Local-Global Transformer (HLGT) to leverage this hierarchy information and model the interactions between different levels of granularity and different modalities for learning more fine-grained multi-modal representations. Specifically, we first split the video and query into individual clips and phrases to learn their local context (adjacent dependency) and global correlation (long-range dependency) via a temporal transformer. Then, a global-local transformer is introduced to learn the interactions between the local-level and global-level semantics for better multi-modal reasoning. Besides, we develop a new cross-modal cycle-consistency loss to enforce interaction between two modalities and encourage the semantic alignment between them. Finally, we design a brand-new cross-modal parallel transformer decoder to integrate the encoded visual and textual features for final grounding. Extensive experiments on three challenging datasets show that our proposed HLGT achieves a new state-of-the-art performance.

2605.25244 2026-05-26 cs.CL 版本更新

Inference Time Optimization with Confidence Dynamics

基于置信度动态的推理时优化

Yu Wang, Minghao Liu, Jiayun Wang, Jinrui Huang, Ankit Shah, Wei Wei

发表机构 * Center for Advanced AI, Accenture(Accenture高级人工智能中心)

AI总结 本文通过观察推理轨迹中置信度的动态变化,发现正确轨迹置信度上升而错误轨迹下降,据此提出置信度动态增益投票方法,显著提升大语言模型推理性能。

Comments Published in ICML 2026

详情
AI中文摘要

推理时优化技术(如重复采样)显著提升了大语言模型(LLMs)的推理能力。然而,模型不确定性在这些优化策略中的关键作用仍未被充分探索。本文研究了沿推理轨迹的置信度动态,并首次揭示了一个令人惊讶且独特的模式:正确回答轨迹倾向于随时间表现出置信度提升(正置信度增益),而错误轨迹在推理过程中置信度减弱或下降。基于这一观察,我们提出了基于置信度动态增益(CDG)的投票方法,该方法融入了响应置信度轨迹沿推理链的演化方式。在AIME24/25、HMMT25和BRUMO25基准测试上,针对四种开源架构(DeepSeek-R1、gpt-oss、Gemma-3、Qwen-QwQ)的实验表明,CDG相比基线取得了显著的性能提升。这些结果证明,我们的方法为改进LLM推理中的答案选择提供了稳健的判别信号。我们还为这一现象提供了理论见解。代码将在https://github.com/Accenture/CDG.git发布。

英文摘要

Inference time optimization techniques, such as repeated sampling, have significantly advanced the reasoning capabilities of Large Language Models (LLMs). However, the critical role of model uncertainty remains largely underexplored in these optimization strategies. In this paper, we investigate the dynamics of confidence along reasoning trajectories and for first time reveal a surprising and unique pattern: correct answer traces tend to exhibit confidence improvement over time (positive confidence gain), while incorrect traces show attenuated or declining confidence as reasoning proceeds. Based on this observation, we propose Confidence Dynamic Gain (CDG) based voting, which incorporates how the confidence trajectory of the response evolves along the reasoning chain. Experiments across four open-source architectures (DeepSeek-R1, gpt-oss, Gemma-3, Qwen-QwQ) on the AIME24/25, HMMT25, and BRUMO25 benchmarks demonstrate that CDG yields a significant performance boost over baselines. These results demonstrate that our method provides a robust discriminative signal for improving answer selection in LLM reasoning. We also provide theoretical insights for this phenomenon. Code will be released at https://github.com/Accenture/CDG.git.

2605.25226 2026-05-26 cs.CL 版本更新

From Automation to Collaboration: Human-in-the-Loop Methods for Safe and Trustworthy NLP

从自动化到协作:面向安全可信NLP的人机协同方法

Most. Sharmin Sultana Samu, MD. Tanvir Ahmed Seum, Md. Rakibul Islam

发表机构 * Department of Computer Science and Engineering, BRAC University(布拉克大学计算机科学与工程系) Department of Electrical and Electronic Engineering, Rajshahi University of Engineering and Technology(拉贾沙希工程与技术大学电子与电气工程系)

AI总结 本文综述了人机协同方法,通过人类监督支持审计、鲁棒性评估、数据构建和模型引导,以提升NLP在安全可信方面的表现,并指出了可扩展探测、可持续鲁棒性基准、低资源设置和私有系统治理等方面的差距。

Comments Preprint, manuscript under review

详情
AI中文摘要

大型语言模型广泛部署在高风险的NLP任务中,但偏见、幻觉、对抗性脆弱性和不可靠的泛化等风险仍然存在。基于探测的审计揭示了模型行为的不一致性。对抗性文本生成发现了鲁棒性差距,特别是在基准有限的低资源语言中。企业文本到SQL设置暴露了在私有和大规模数据库上验证输出的困难。人类监督对于探测验证、对抗性验证和领域特定标注至关重要,但成本高昂且难以扩展。本综述考察了最近的人机协同方法,这些方法将NLP从自动化转向协作,以实现安全性和可信度。我们回顾了人类专业知识如何支持审计、鲁棒性评估、数据构建和模型引导。我们的发现强调了可扩展探测、可持续鲁棒性基准、低资源设置和私有系统治理方面的差距。我们概述了自适应审计、协作评估和负责任部署的实用研究方向。

英文摘要

Large language models are widely deployed in high-stakes NLP tasks, yet risks such as bias, hallucination, adversarial vulnerability and unreliable generalization remain. Probe-based auditing reveals inconsistencies in model behavior. Adversarial text generation uncovers robustness gaps, especially in lower-resourced languages with limited benchmarks. Enterprise text-to-SQL settings expose the difficulty of validating outputs over private and large-scale databases. Human supervision is essential for probe validation, adversarial verification and domain-specific annotation, but it is costly and hard to scale. This survey examines recent human-in-the-loop methods that shift NLP from automation toward collaboration for safety and trustworthiness. We review how human expertise supports auditing, robustness evaluation, data construction and model steering. Our findings highlight gaps in scalable probing, sustainable robustness benchmarks, low-resource settings and governance of private systems. We outline practical research directions for adaptive auditing, collaborative evaluation and accountable deployment.

2605.25208 2026-05-26 cs.CL 版本更新

They Are Not the Same: Direct Causes Are Not Grounded Emotion Explanations

它们并不相同:直接原因并非基于情感解释

Zhuangzhuang Pan, Yan Xia, Chee Seng Chan

发表机构 * Universiti Malaya, Malaysia(马来亚大学) Suzhou University of Technology, China(苏州科技学院) VinUniversity, Vietnam(Vin 大学)

AI总结 本文通过IEMO-MECP数据集分析发现,情感-原因对提取(ECPE)任务中的二元分类代理只能有效提取直接触发原因,而无法提供基于证据的情感解释,因为情感上下文(emo-context)在二元边界处被忽略,且模型在捷径压力下倾向于选择便利归因而非真实解释。

Comments 25 pages, 11 figures, 24 tables. Preprint

详情
AI中文摘要

情感-原因对提取(ECPE)旨在解释情感为何发生,但该目标现在常被简化为二元对/非对预测。这一代理对于直接原因提取有用,但容易被过度解读为基于证据的情感解释。我们表明这种解释仅部分有效。在IEMO-MECP中,90.9%的原始正例仍为情感-原因对,95.0%的原始负例仍为非对,证实了二元ECPE任务在很大程度上得以保留。问题在于,仅直接触发因素并不构成基于的解释。情感上下文(emo-context),即有助于解释目标情感但不直接导致该情感的语句,出现在原始边界的双侧,并在二元不确定性附近富集,表明二元边界对此类话语证据没有稳定位置。在评估的ECPE模型中,直接触发因素的恢复比上下文支持更可靠。在捷径压力下,这种不平衡变得显著。二元训练模型对附近词汇相似的非对候选者分配的对分数高于对证据支持但结构上更困难的情感-原因和情感-上下文对。因此,对分数可能奖励便利归因而非基于的解释。高二元ECPE性能表明模型能识别直接触发因素,但并不表示模型已解释情感。代码公开于https://github.com/panzhzh/ECPExsame。

英文摘要

Emotion-Cause Pair Extraction (ECPE) was introduced to explain why an emotion occurs, but this goal is now often reduced to binary pair/non-pair prediction. This proxy is useful for direct-cause extraction, yet easy to over-read as evidence grounded emotion explanation. We show that this interpretation is only partially valid. In IEMO-MECP, 90.9% of original positives remain emo-cause and 95.0% of original negatives remain non-pair, confirming that the binary ECPE task is largely preserved. The problem is that direct triggers alone do not constitute a grounded explanation. Emo-context, an utterance that helps interpret a target emotion without directly causing it, appears on both sides of the original boundary and is enriched near binary uncertainty, showing that the binary boundary has no stable place for such discourse evidence. Across evaluated ECPE models, direct triggers are recovered more reliably than contextual support. Under shortcut pressure, this imbalance becomes consequential. Binary-trained models assign higher pair scores to nearby lexically similar non-pair candidates than to evidence supported but structurally harder emo-cause and emo-context pairs. Thus, pair scores can reward convenient attributions over grounded explanations. High binary ECPE performance indicates that a model can identify direct triggers; it does not indicate that the model has explained the emotion. Code is publicly available at https://github.com/panzhzh/ECPExsame.

2605.25204 2026-05-26 cs.CL 版本更新

Clarification Is Not Enough: Post-Clarification Answering Remains the Bottleneck in Multi-Turn QA

澄清不够:澄清后的回答仍是多轮问答中的瓶颈

Jinyan Su, Jennifer Healey

发表机构 * Cornell University(康奈尔大学) Adobe Research(Adobe研究)

AI总结 本文通过分解多轮问答为澄清策略和澄清后回答两个组件,利用PACIFIC基准实验发现监督微调能快速提升澄清策略,但最终答案准确率仍显著偏低,表明理解并正确解释用户回应是关键瓶颈。

详情
AI中文摘要

多元对齐要求系统适应不同的用户价值观、沟通风格和上下文假设。我们认为,这种对齐的基础前提是,当用户意图不明确或模糊时,能够从用户那里准确引出偏好。我们通过将问题分解为两个组件来研究多轮问答中的偏好引出问题:一个 extbf{澄清策略},决定是提出澄清问题还是直接回答;以及 extbf{澄清后回答},在缺失信息提供后产生正确的最终答案。我们使用PACIFIC基准表明,监督微调能快速改善澄清策略,然而,即使模型采取了正确的行动,最终答案的准确率仍然显著较低。这一差距表明,理解并正确解释用户的回应是多轮问答系统中的关键瓶颈。

英文摘要

Pluralistic alignment requires systems to adapt to diverse user values, communication styles, and contextual assumptions. We believe that a foundational prerequisite for such alignment enabling accurate preference elicitation from people when their intent is under-specified or ambiguous. We study the problem of preference elicitation in multi-turn question answering by decomposing the problem into two components: a \textbf{clarification policy}, which decides whether to ask a clarifying question or answer directly, and \textbf{post-clarification answering}, which produces the correct final answer once the missing information is provided. We show, using the PACIFIC benchmark, that supervised fine-tuning rapidly improves the clarification policy, however, final answer accuracy remains substantially lower even when the model takes the correct action. This gap indicates that understanding and correctly interpreting the user's response is the critical gap in multi-turn question-answering systems.

2605.25189 2026-05-26 cs.LG cs.CL 版本更新

Directional Alignment Mitigates Reward Hacking in Reinforcement Learning for Language Models

方向对齐缓解语言模型强化学习中的奖励黑客问题

Wenlong Deng, Jiaji Huang, Kaan Ozkara, Yushu Li, Christos Thrampoulidis, Xiaoxiao Li, Youngsuk Park

发表机构 * University of British Columbia(不列颠哥伦比亚大学) Vector Institute(向量研究所) Amazon(亚马逊)

AI总结 通过分析强化学习更新的几何结构,发现奖励黑客源于优化偏离稳定低维学习轨迹,提出可信方向投影方法约束梯度在干净参考子空间内,延迟捷径利用并保持任务性能。

详情
AI中文摘要

当模型通过利用捷径而非解决预期任务来改进代理奖励时,就会出现奖励黑客问题。我们通过语言模型中强化学习更新的几何结构来研究这种失败模式,并认为当优化偏离稳定的低维学习轨迹时,黑客行为就会出现。我们通过参数更新的主导奇异方向分析了这种漂移,并表明奖励黑客运行比干净运行表现出显著更大的方向变化。基于这一观察,我们引入了可信方向投影,它约束梯度保持在干净参考子空间内。在数学推理的奖励黑客实验中,所提出的方法延迟了捷径利用并更好地保持了任务性能。

英文摘要

Reward hacking arises when a model improves a proxy reward by exploiting shortcuts rather than solving the intended task. We study this failure mode through the geometry of reinforcement learning updates in language models and argue that hacking emerges when optimization drifts away from a stable low-dimensional learning trajectory. We analyze this drift through dominant singular directions of parameter updates and show that reward-hacking runs exhibit substantially larger directional change than clean runs. Motivated by this observation, we introduce trusted-direction projection, which constrains gradients to remain within a clean reference subspace. Across reward-hacking experiments on mathematical reasoning, the proposed approach delays shortcut exploitation and better preserves task performance.

2605.25186 2026-05-26 cs.CL cs.AI 版本更新

By Their Fruits You Will Know Them: Comparing Formalizations of Law by the Decisions They Encode

凭其果实,你们将认识它们:通过编码的决策比较法律的形式化

Julius Vernie, Matthias Grabmair

发表机构 * Technical University of Munich(慕尼黑技术大学)

AI总结 提出一种系统方法,通过SAT求解器枚举不同形式化在边缘案例上的分歧,并转化为具体事实场景,以比较同一法律条款的不同形式化,应用于九个前沿LLM生成的十个欧盟条款形式化,发现行为分歧与结构一致性基本不相关。

Comments 23 pages, 17 figures, submitted to EMNLP PROC 2026

详情
AI中文摘要

将法律条款形式化有望实现机器可访问的法律和自动化法律推理,而最近的LLM使得直接从法规文本生成这种形式化变得诱人。然而,任何形式化都会做出隐含的解释选择,其后果难以预料,尤其是当LLM是作者时。我们提出了一种方法,通过它们在个别案例上的推理,系统地比较同一法律条款的不同形式化。给定一个条款的多个形式化,我们在节点级别匹配它们,从匹配中为每对推导出一个共享接口,并使用SAT求解器枚举任意两个形式化存在分歧的边缘案例。然后将选定的边缘案例转化为具体的事实场景,供法律专家检查并采取行动。我们将该方法应用于九个前沿LLM生成的十个欧盟条款的形式化。我们发现,形式化之间的行为分歧与其结构一致性基本不相关,并且口头化的案例揭示了定性的不同分歧类型,包括反映法律评论中真实争议的分歧。

英文摘要

Formalizing legal provisions promises machine-accessible law and automated legal reasoning, and recent LLMs make it tempting to generate such formalizations directly from statutory text. However, any formalization makes implicit interpretive choices whose consequences are hard to anticipate, especially if an LLM is the author. We present a method for systematically comparing different formalizations of the same legal provision by their inferences on individual cases. Given multiple formalizations of a provision, we match them at the node level, derive a shared interface for each pair from the matching, and use a SAT solver to enumerate the edge cases on which any two formalizations disagree. Selected edge cases are then verbalized into concrete factual scenarios that a legal expert can examine and act on. We apply our method to formalizations of ten EU provisions generated by nine frontier LLMs. We find that behavioral divergence between formalizations is essentially uncorrelated with their structural agreement and that the verbalized cases reveal qualitatively distinct types of disagreement, including divergences that mirror genuine controversies in the legal commentary.

2605.25179 2026-05-26 cs.CL 版本更新

Locality Matters for Training-Free Audio Token Compression in Audio-Language Models

局部性对音频-语言模型中免训练音频令牌压缩的重要性

Jiale Luo, Xiaoyu Liang, Haoji Hu

发表机构 * Zhejiang University(浙江大学)

AI总结 提出局部时间二分图合并(LTBM)方法,通过显式时间窗口约束合并相似邻近音频令牌,实现免训练的编码器空间压缩,并验证了局部性归纳偏置在音频令牌压缩中的任务依赖性优势。

Comments Preprint. 8 pages main text, 10 pages total

详情
AI中文摘要

音频-语言模型(ALMs)越来越多地用于音频字幕生成、问答和开放式音频理解,但当音频输入表示为长前缀令牌序列时,其推理成本仍然很高。这些音频前缀消耗上下文预算,增加内存使用,并在资源受限或延迟敏感的环境中使部署更加困难。现有的免训练音频令牌缩减方法主要依赖于固定池化或基于分数的剪枝。固定池化是内容无关的,而基于分数的剪枝可以保留孤立的显著令牌但丢弃附近的声学上下文。我们提出局部时间二分图合并(LTBM),一种免训练的编码器空间压缩方法,在显式时间窗口约束下合并相似的邻近音频令牌。除了引入LTBM,我们还使用受控的全局合并变体来隔离时间局部性本身是否是音频令牌压缩的有用归纳偏置。在AudioCaps、Clotho和MMAU上使用Qwen2-Audio进行的实验显示了任务依赖的局部性效应:在几种压缩设置下,尤其是更强压缩下,局部感知合并更有利于字幕生成,而全局匹配在多项选择音频理解中更具竞争力。在Audio Flamingo 3上的跨骨干验证进一步支持了在适度和激进压缩下局部感知合并的字幕生成优势。

英文摘要

Audio-language models (ALMs) are increasingly used for audio captioning, question answering, and open-ended audio understanding, but their inference cost remains high when audio inputs are represented as long prefix-token sequences. These audio prefixes consume context budget, increase memory usage, and make deployment harder in resource-constrained or latency-sensitive settings. Existing training-free audio-token reduction methods mainly rely on fixed pooling or score-based pruning. Fixed pooling is content-agnostic, while score-based pruning can preserve isolated salient tokens but discard nearby acoustic context. We propose Local Temporal Bipartite Merging (LTBM), a training-free encoder-space compression method that merges similar nearby audio tokens under an explicit temporal window constraint. Beyond introducing LTBM, we use a controlled Global Merge variant to isolate whether temporal locality itself is a useful inductive bias for audio-token compression. Experiments on AudioCaps, Clotho, and MMAU with Qwen2-Audio show evidence of a task-dependent locality effect: locality-aware merging is more favorable for captioning at several compression settings, especially under stronger compression, while global matching is more competitive for multiple-choice audio understanding. A cross-backbone validation on Audio Flamingo 3 further supports the captioning-side advantage of locality-aware merging under moderate and aggressive compression.

2605.25162 2026-05-26 cs.CL cs.AI 版本更新

STREAM: A Data-Centric Framework for Mining High-Value Task-Oriented Dialogues from Streaming Media

STREAM:一个以数据为中心的框架,用于从流媒体中挖掘高价值任务导向对话

Liang Xue, Haoyu Liu, Cheng Wang, Pengyu Chen, Haozhuo Zheng, Yang Liu

发表机构 * Harbin Institute of Technology(哈尔滨工业大学) Byering Technology(伯英技术)

AI总结 提出STREAM框架,利用流媒体数据合成大规模多领域任务导向对话数据集StreamDial,通过角色构建和对话蓝图结合RAG生成高质量对话,解决数据稀缺问题。

详情
AI中文摘要

垂直领域的大语言模型受到复杂、特定领域任务导向对话稀缺的瓶颈。现有的数据获取管道面临持续的三难困境:专家标注昂贵,真实服务对话受隐私和商业限制,静态语料库很快过时。我们提出Stream,一个以数据为中心的框架,利用公开可用的流媒体(直播和短视频)大规模合成高价值服务对话。Stream从嘈杂的流中挖掘真实的交互信号,并通过将基于角色的个性构建与对话蓝图构建相结合来合成对话;它进一步采用检索增强生成(RAG)来支持知识感知的响应。基于Stream,我们发布了StreamDial,一个覆盖汽车、餐厅和酒店的大规模多领域数据集。StreamDial总共包含87,498个对话会话和1,497,320轮次,平均每个会话17.11轮,各领域规模相当。每个会话被组织为结构化四元组⟨P_u, P_a, B, H⟩,将对话历史与明确的用户/代理角色和对话蓝图配对,捕捉真实服务行为,如需求挖掘、约束冲突、协商和恢复。使用自动评估和下游任务的评估表明,StreamDial在强基线上提高了内在对话质量,使用StreamDial训练的模型在多个骨干网络上改进了对话状态跟踪;我们进一步报告了完整的人工评估集,并在受控训练预算下在Qwen3-8B上实现了令人鼓舞的多语言迁移。数据发布在https://github.com/hitxueliang/DialogDataSetBySTREAM。

英文摘要

Large language models for vertical domains are bottlenecked by the scarcity of complex, domain-specific task-oriented dialogues. Existing data acquisition pipelines face a persistent trilemma: expert annotation is expensive, real-world service conversations are constrained by privacy and commercial restrictions, and static corpora quickly become temporally stale. We propose Stream, a data-centric framework that leverages publicly available streaming media (live streams and short videos) to synthesize high-value service dialogues at scale. Stream mines authentic interaction signals from noisy streams and synthesizes conversations by integrating role-grounded persona construction with Conversational Blueprint construction; it further adopts retrieval-augmented generation (RAG) to support knowledge-aware responses. Based on Stream, we release StreamDial, a large-scale multi-domain dataset covering Automotive, Restaurant, and Hotel. StreamDial contains 87,498 dialogue sessions and 1,497,320 turns in total, with an average of 17.11 turns per session and a comparable scale across domains. Each session is organized as a structured quadruplet $\langle P_u, P_a, B, H \rangle$ that pairs dialogue history with explicit user/agent personas and a Conversational Blueprint, capturing realistic service behaviors such as requirement mining, constraint conflicts, negotiation, and recovery. Evaluations with automatic judges and downstream tasks show that StreamDial improves intrinsic dialogue quality over strong baselines, and models trained with StreamDial improve Dialogue State Tracking across backbones; we further report a completed human-evaluation set and encouraging multilingual transfer on Qwen3-8B under a controlled training budget. The data is released in https://github.com/hitxueliang/DialogDataSetBySTREAM.

2605.25141 2026-05-26 cs.CL cs.AI 版本更新

LLM Agent Based Renewable Energy Forecasting Using Edge and IoT Data A Review of Solar Wind Weather and Grid Aware Decision Support

基于LLM Agent的利用边缘和物联网数据的可再生能源预测:太阳能、风能、天气和电网感知决策支持综述

Pavan Manjunath, Thomas Pruefer

发表机构 * Independent Researcher(独立研究员)

AI总结 本文综述了如何利用大语言模型代理整合异构传感器流、天气API数据、历史发电记录和电网约束,形成统一的决策支持工作流,以增强可再生能源预测。

详情
AI中文摘要

可再生能源发电的可靠预测是电网稳定性、能源交易、电池调度和碳感知运营规划的基础要求。太阳能和风能资源本质上是间歇性的,其输出随云量、风速、大气湍流、季节模式和局部地形而波动。物联网和边缘设备的普及,包括智能电表、逆变器、风速计、日射强度计、气象站和电网接口传感器,创造了前所未有的实时运行数据量,而传统的预测流程难以充分利用这些数据。本综述研究了大语言模型代理如何通过将异构传感器流、天气API数据、历史发电记录、电网约束和上下文推理整合到统一的决策支持工作流中,来增强可再生能源预测。我们调查了经典预测方法(统计时间序列模型、深度学习架构、物理混合方法)以及新兴的用于解释、不确定性沟通和操作员指导的LLM代理框架。提出了一个六层分类法,涵盖数据采集、预处理、特征工程、模型推理、不确定性估计和自然语言报告。综述识别了十二个开放挑战,包括实时部署、分布偏移下的模型漂移、不确定性量化、LLM代理中的幻觉控制、边缘硬件的互操作性以及与能源管理系统的集成。论文最后建议了一个研究议程,重点关注开放基准、物理信息LLM基础以及联邦预测架构。

英文摘要

Reliable forecasting of renewable energy generation is a foundational requirement for grid stability energy trading battery scheduling and carbon aware operational planning Solar and wind resources are inherently intermittent their output fluctuates with cloud cover wind speed atmospheric turbulence seasonal patterns and local terrain The proliferation of IoT and edge devices spanning smart meters inverters anemometers pyranometers weather stations and grid interface sensors has created an unprecedented volume of real time operational data that conventional forecasting pipelines are ill equipped to exploit fully This review investigates how large language model LLM agents can enhance renewable energy forecasting by integrating heterogeneous sensor streams weather API data historical generation records grid constraints and contextual reasoning into unified decision support workflows We survey classical forecasting methods statistical time series models deep learning architectures physics hybrid approaches and emerging LLM agent frameworks for explanation uncertainty communication and operator guidance A six layer taxonomy is proposed covering data acquisition preprocessing feature engineering model inference uncertainty estimation and natural language reporting The review identifies twelve open challenges spanning real time deployment model drift under distribution shift uncertainty quantification hallucination control in LLM agents interoperability of edge hardware and integration with energy management systems The paper concludes by recommending a research agenda centred on open benchmarks physics informed LLM grounding and federated forecasting architectures

2605.25133 2026-05-26 cs.AI cs.CL 版本更新

Trust but Verify: Prover-Verifier Deliberation for Selective LLM Prediction

信任但验证:面向选择性LLM预测的证明者-验证者审议

João Sedoc, Baotong Zhang, Dean Foster

发表机构 * New York University(纽约大学)

AI总结 提出基于交互式证明理论的证明者-验证者审议协议,通过结构化置信度判定实现选择性预测,在GPQA Diamond上取得约30个百分点的高置信度精确率差距。

详情
AI中文摘要

可靠地知道语言模型何时正确几乎与正确本身同样重要。我们引入证明者-验证者审议(PVD),这是一种基于交互式证明理论的推理时协议,作为选择性预测的机制:该协议同时产生答案和结构化置信度判定,允许系统报告高置信度答案,同时在不明确的情况下弃权。在每个对话中,证明者通过可检查的子主张捍卫候选答案,而验证者发出有针对性的挑战并返回\textsc{Accept}、\textsc{Challenge}或\textsc{Reject}。由于冻结的语言模型是在噪声信道上运行的不完美的证明者和验证者,形式上的可靠性和完备性保证并不适用;相反,我们通过其覆盖-精确率行为来经验性地描述该协议。我们的主要实验使用Claude Sonnet 4.6作为证明者,Claude Haiku 4.5作为验证者,在GPQA Diamond上进行。没有答案修订即被接受的问题,我们称为Accept + No Change (ANC),作为高置信度子集报告;我们通过其精确率和覆盖来评估该子集。ANC将可靠答案与不可靠答案分开,与非ANC补集相比产生约30个百分点的HC-Prec差距。使用GPT和Gemini配对的鲁棒性实验表明,高HC-Prec可以跨模型系列转移,而验证者的严格性和领域能力在很大程度上决定了选择差距的大小。在Humanity's Last Exam上,较弱的证明者-验证者配对可能使ANC信号崩溃或反转,这说明了当验证者在其有效区域外操作时的实际失败模式。与自一致性、通用自一致性、多智能体辩论和Reflexion的比较表明,证明者-验证者审议为选择性预测提供了独特的论点可辩护性信号。

英文摘要

Reliably knowing when a language model is correct is almost as important as being correct. We introduce prover-verifier deliberation (PVD), an inference-time protocol grounded in interactive proof theory, as a mechanism for selective prediction: the protocol produces both an answer and a structured confidence verdict, allowing a system to report high-confidence answers while abstaining on uncertain cases. In each dialogue, a prover defends a candidate answer through checkable sub-claims while a verifier issues targeted challenges and returns \textsc{Accept}, \textsc{Challenge}, or \textsc{Reject}. Because frozen language models are imperfect provers and verifiers operating over a noisy channel, formal soundness and completeness guarantees do not transfer; instead, we characterize the protocol empirically through its coverage-precision behavior. Our main experiment uses Claude Sonnet 4.6 as prover and Claude Haiku 4.5 as verifier on GPQA Diamond. Questions accepted with no answer revision, which we call Accept + No Change (ANC), are reported as the high-confidence subset; we evaluate this subset by its precision and coverage. ANC separates reliable from unreliable answers, yielding a $\sim$30pp HC-Prec gap over the non-ANC complement. Robustness experiments with GPT and Gemini pairings show that high HC-Prec can transfer across model families, while verifier strictness and domain competence largely determine the size of the selection gap. On Humanity's Last Exam, weaker prover-verifier pairings can collapse or invert the ANC signal, illustrating a practical failure mode when the verifier operates outside its effective region. Comparisons with self-consistency, universal self-consistency, multi-agent debate, and Reflexion suggest that prover-verifier deliberation supplies a distinct argument-defensibility signal for selective prediction.

2605.25123 2026-05-26 cs.LG cs.AI cs.CL cs.CV stat.ML 版本更新

Inference-Time Alignment of Diffusion Models via Trust-Region Iterative Twisted Sequential Monte Carlo

扩散模型的推理时对齐:基于信任区域迭代扭曲序贯蒙特卡洛方法

Weixin Wang, Yu Yang, Wei Deng, Pan Xu

发表机构 * Duke University(杜克大学) Morgan Stanley(摩根大通)

AI总结 提出信任区域迭代扭曲序贯蒙特卡洛(TRI-TSMC)框架,通过迭代学习扭曲函数来改进扩散模型推理时的对齐,在文本生成和文本到图像生成任务上优于现有方法。

Comments 34 pages, 6 figures, and 7 tables

详情
AI中文摘要

我们研究基于扩散的生成模型的推理时对齐,旨在引导基础模型产生高奖励输出而不更新其权重。最近的基于序贯蒙特卡洛(SMC)的引导方法以原则性的方式近似奖励倾斜的目标分布,但其提议仍主要依赖于基础采样器。由于奖励信息主要通过粒子重加权和重采样在传播后使用,这些方法可能需要大量粒子预算,并遭受权重退化和高方差估计的问题。降低方差和提高粒子效率的一种方法是迭代学习提供前瞻指导的扭曲函数,如扭曲SMC。然而,现有的可学习扭曲方法主要针对经典序贯推理开发,当应用于具有高维状态空间和终端、噪声或黑盒奖励的扩散对齐时可能不稳定。我们提出信任区域迭代扭曲序贯蒙特卡洛(TRI-TSMC),一种用于在基于SMC的推理时对齐中学习扭曲函数的信任区域框架。每次迭代在路径空间中计算精确的KL约束更新,通过温度重要性重加权得到闭式解,并通过加权最大似然将该目标投影回参数化扭曲族。理论上,我们形式化了最优扭曲函数的值函数解释,并表明它产生零方差采样器。我们证明信任区域更新沿着护航路径朝向目标分布,加权最大似然更新是前向KL投影,并且该路径降低了残差重要性权重方差。实验上,在匹配的推理时预算下,TRI-TSMC在离散扩散文本生成和文本到图像生成上改进了主要对齐目标。

英文摘要

We study inference-time alignment for diffusion-based generative models, aiming to steer a base model toward high-reward outputs without updating its weights. Recent Sequential Monte Carlo (SMC)-based steering methods approximate reward-tilted target distributions in a principled way, but their proposals remain largely tied to the base sampler. Since reward information is mainly used after propagation through particle reweighting and resampling, these methods can require large particle budgets and suffer from weight degeneracy and high-variance estimates. One way to reduce variance and improve particle efficiency is to iteratively learn twisting functions that provide look-ahead guidance, as in twisted SMC. However, existing learnable twisting methods are developed mainly for classical sequential inference and can be unstable when applied to diffusion-based alignment with high-dimensional state spaces and terminal, noisy, or black-box rewards. We propose Trust-Region Iterative Twisted Sequential Monte Carlo (TRI-TSMC), a trust-region framework for learning twisting functions in SMC-based inference-time alignment. Each iteration computes an exact KL-constrained update in path space, which admits a closed-form solution by tempered importance reweighting, and projects this target back to the parameterized twisted family by weighted maximum likelihood. Theoretically, we formalize the value-function interpretation of the optimal twisting function and show that it yields a zero-variance sampler. We prove that the trust-region update follows an escort path toward the target distribution, that the weighted maximum-likelihood update is a forward-KL projection, and that the path reduces residual importance-weight variance. Empirically, TRI-TSMC improves primary alignment objectives on discrete diffusion text generation and text-to-image generation under matched inference-time budgets.

2605.25120 2026-05-26 cs.CL cs.AI cs.HC 版本更新

Evidence-Linked Radiology Reporting: A Human-Supervised Reference Architecture for Structured Imaging Intelligence

证据关联放射学报告:面向结构化成像智能的人机协同参考架构

Houman Kazemzadeh, Kamyar Naderi

发表机构 * Xylemed

AI总结 提出一种人机协同、证据关联的参考架构,通过结合特定检查模板、语音到结构处理、测量与分割捕获、受控AI辅助起草以及基于DICOM、HL7 FHIR等标准的互操作性,将放射学报告从自由文本转化为结构化智能层,支持审阅报告、纵向比较、临床数据重用及系统集成。

Comments Technical report, 27 pages, 2 figures, 12 tables, 1 listing; reference architecture paper; does not report clinical outcomes or validated diagnostic performance

详情
AI中文摘要

放射学报告仍然是向临床团队传达成像结果的主要机制。然而,这些报告背后的大量结构化信息,包括测量值、图像证据、既往比较、病灶标识、不确定性和术语,通常仍被禁锢在自由文本中,或分散在图像存档与通信系统、放射信息系统、报告工作站、工作表、高级可视化工具和电子健康记录中。本文提出一种人机协同、证据关联的结构化放射学报告参考架构。该框架结合了特定检查模板、语音到结构处理、测量与分割捕获、受控AI辅助起草,以及基于DICOM、DICOM结构化报告、DICOM分割、HL7 FHIR、RadLex、SNOMED CT、LOINC和UCUM的标准化互操作性。该系统并非作为自主报告生成器,而是作为企业成像的结构化智能层,支持审阅报告、纵向比较、临床数据重用、治理,以及与PACS、RIS、EHR、分析和注册工作流的集成。本文还讨论了针对AI辅助放射学报告系统的模态特定部署考虑、临床安全风险、验证要求、网络安全、隐私、质量管理和监管边界。

英文摘要

Radiology reports remain the primary mechanism by which imaging findings are communicated to clinical teams. However, much of the structured information behind these reports, including measurements, image evidence, prior comparisons, lesion identity, uncertainty, and terminology, often remains trapped in free text or fragmented across picture archiving and communication systems, radiology information systems, reporting workstations, worksheets, advanced visualization tools, and electronic health records. This paper proposes a human-supervised, evidence-linked reference architecture for structured radiology reporting. The framework combines exam-specific templates, speech-to-structure processing, measurement and segmentation capture, controlled AI-assisted drafting, and standards-based interoperability using DICOM, DICOM Structured Reporting, DICOM Segmentation, HL7 FHIR, RadLex, SNOMED CT, LOINC, and UCUM. The system is positioned not as an autonomous report generator, but as a structured intelligence layer for enterprise imaging that supports reviewed reporting, longitudinal comparison, clinical data reuse, governance, and integration with PACS, RIS, EHR, analytics, and registry workflows. The paper also discusses modality-specific deployment considerations, clinical safety risks, validation requirements, cybersecurity, privacy, quality management, and regulatory boundaries for AI-assisted radiology reporting systems.

2605.25092 2026-05-26 cs.IR cs.CL cs.DB 版本更新

AgentIR: A Workload-Adaptive Cascade Retrieval Substrate for Long-Term Conversational Memory

AgentIR:面向长期对话记忆的工作负载自适应级联检索基座

Aojie Yuan, Haiyue Zhang, Shahin Nazarian

发表机构 * University of Southern California(南加州大学)

AI总结 针对长期对话记忆中索引动态增长、查询类型变化和亚10毫秒延迟预算的挑战,提出一种基于置信度触发的级联路由策略,通过自适应选择融合方法和是否运行稠密通道,在保持准确率的同时显著降低延迟并提升并发能力。

Comments 29 pages, 9 figures, 12 tables. Main paper 9 pages + comprehensive appendix (proof, GPU kernels, full per-dataset BEIR/LongMemEval/LoCoMo tables, cascade router C++ API, 6 robustness experiments, FAQ, failure-case catalog)

详情
AI中文摘要

长期对话记忆是一种经典信息检索未针对其设计的检索工作负载:索引在查询流期间增长,查询类型在会话内发生变化,每次检索的延迟预算低于10毫秒。Lucene类引擎将索引视为静态、查询视为无状态,未利用工作负载的结构。AgentIR将融合视为每个查询沿两个轴的决定:应用哪种融合(BM25、稠密、RRF或智能体感知RRF),以及是否值得运行约52毫秒的稠密通道。第二个轴是一个置信度触发的级联路由器,仅根据BM25 top-k间隔做出决定,并在不同工作负载间重新调整而无需重新训练。在LongMemEval(n=500)上,稠密通道确实增加了信息,级联在LLM判断准确率相当的情况下跳过了63%的查询(在两种评判者下快2.67倍,配对bootstrap p>=0.88);按查询类型阈值在5折交叉验证下将其扩展到5.76倍。在LoCoMo(n=1,982)上,BM25单独已经是最强的单一系统,同一触发器自动调整到100%跳过率(快132倍,+0.089 Hit@5)。在共享8核虚拟机上的容量从约154个并发智能体增加到约1,400个(9倍)。在级联之下,时间分区索引执行O(log 1/epsilon)的工作,与语料库大小无关:1234倍语料库增长仅导致3.6倍延迟,最终在500万条记录上以亚100微秒p50达到顺序搜索的1769倍。在与Lucene在9个BEIR数据集(最多880万文档)上质量相当的情况下,该基座在Pyserini 8T上运行速度几何平均快10倍,在PISA-1T BlockMax-WAND上快11倍;A100达到Pyserini 8T的1.8-39倍;分块索引构建在MS MARCO上维持56.8K文档/秒。记录并修复了三个使nDCG@10静默下降6-8倍的微妙BM25/GPU正确性陷阱;修复后,在适合单个A100的所有八个数据集上,CPU和GPU的nDCG@10差异在0.0002以内。

英文摘要

Long-term conversational memory is a retrieval workload classical IR was not built for: the index grows during the query stream, query types shift intra-session, and the latency budget per retrieval is sub-10 ms. Lucene-class engines treat the index as static and the query as stateless, leaving the workload's structure unexploited. AgentIR treats fusion as a per-query decision along two axes: which fusion to apply (BM25, Dense, RRF, or agent-aware RRF), and whether the ~52 ms dense channel is worth running at all. The second axis is a confidence-triggered cascade router that decides from the BM25 top-k margin alone and re-tunes across workloads without retraining. On LongMemEval (n=500), where the dense channel does add information, the cascade skips 63% of queries at parity LLM-judged accuracy (2.67x faster under two judges, paired bootstrap p>=0.88); per-qtype thresholds extend this to 5.76x under 5-fold cross-validation. On LoCoMo (n=1,982), where BM25 alone is already the strongest single system, the same trigger auto-tunes to a 100% skip rate (132x faster, +0.089 Hit@5). Capacity on a shared 8-core VM rises from ~154 to ~1,400 concurrent agents (9x). Underneath the cascade, a time-partitioned index does O(log 1/epsilon) work independent of corpus size: 1234x corpus growth costs only 3.6x latency, ending in 1769x over sequential at sub-100 us p50 on 5M records. At parity quality with Lucene on 9 BEIR datasets up to 8.8M docs, the substrate runs 10x geo-mean over Pyserini 8T and 11x over PISA-1T BlockMax-WAND; an A100 reaches 1.8-39x over Pyserini 8T; chunked index build sustains 56.8K docs/sec on MS MARCO. Three subtle BM25/GPU correctness pitfalls that silently regress nDCG@10 by 6-8x are documented and fixed; post-fix CPU and GPU agree within 0.0002 nDCG@10 on all eight datasets that fit a single A100.

2605.25052 2026-05-26 cs.CL 版本更新

Faithfulness Metrics Don't Measure Faithfulness: A Meta-Evaluation with Ground Truth

忠实性指标并不衡量忠实性:基于真实标签的元评估

Yoav Gur-Arieh, Ana Marasović, Mor Geva

发表机构 * Tel Aviv University(特拉维夫大学) University of Utah(犹他大学)

AI总结 针对思维链忠实性度量缺乏真实标签验证的问题,构建了包含真实忠实性标签的数据集BonaFide,系统评估现有指标,发现多数指标表现接近随机、存在偏差且计算成本高。

详情
AI中文摘要

思维链(CoT)已成为解释和审计大型语言模型行为的核心工具。然而,越来越多的证据表明,这些轨迹往往未能忠实反映模型预测背后的计算过程。已有多种忠实性指标被提出,但它们是否真正衡量了忠实性仍不得而知。回答这一问题需要真实标签,但由于内部计算不可直接观察,真实标签难以获取。因此,大多数提出指标的工作仅报告绝对分数或与先前指标的对比,而少数现有基准依赖于似然性或重要性等代理指标,这些属性与忠实性正交,可能误导对CoT可信度的判断。我们通过构建任务来应对这一挑战,这些任务的输出揭示了哪些中间计算必然产生了它们,并开发了一个自动化标注流程,在步骤级和CoT级生成真实忠实性标签。基于这一方法,我们提出了BonaFide基准,包含来自13个任务和10个模型的3066个标注CoT,并利用它首次系统评估了主流忠实性指标。我们的实验表明,大多数指标表现接近随机,存在强烈的预测偏差,并且在更长的CoT上性能下降。最佳指标在CoT级仅达到0.70 AUROC,另一指标在步骤级达到0.59,且两者均无法跨设置迁移,同时计算成本过高。我们的结果暴露了当前忠实性评估中的根本性缺陷,并呼吁开发更可靠、更高效的指标。

英文摘要

Chains of thought (CoTs) have become central in interpreting and auditing behaviors of large language models. Yet growing evidence suggests that these traces often fail to faithfully represent the computations behind a model's predictions. Several faithfulness metrics have been proposed, but whether they indeed measure faithfulness remains unknown. Answering this requires ground-truth labels, which are hard to obtain since internal computations are not directly observable. Consequently, most works proposing metrics report only absolute scores or comparisons to prior metrics, and the few existing benchmarks rely on proxies like plausibility or importance, properties orthogonal to faithfulness that can mislead about whether a CoT can be trusted. We address this challenge by constructing tasks whose outputs reveal which intermediate computations must have produced them, and developing an automated labeling pipeline that yields ground-truth faithfulness labels at both the step and CoT level. Building on this methodology, we present BonaFide, a benchmark of 3,066 labeled CoTs across 13 tasks and 10 models, and use it to conduct the first systematic evaluation of prominent faithfulness metrics. Our experiments show that most metrics perform near chance, exhibit strong prediction biases and degrade on longer CoTs. The best metric reaches only 0.70 AUROC at the CoT level while another reaches 0.59 at the step level, with neither transferring across settings, while entailing prohibitively high computational cost. Our results expose fundamental gaps in current faithfulness evaluation and call for the development of more reliable and efficient metrics.

2605.25038 2026-05-26 cs.CL cs.LG cs.SE 版本更新

TRACE: A taxonomy-grounded synthetic dataset for teaching-program generation and session interpretation in Applied Behavior Analysis

TRACE:一个基于分类学的合成数据集,用于应用行为分析中的教学程序生成和会话解释

Festus Kahunla

发表机构 * Drexel University(德雷塞尔大学) Pombo Labs(波莫实验室)

AI总结 提出TRACE数据集,通过分类学驱动的确定性生成器创建2999个合成示例,覆盖教学程序生成和多会话行为解释任务,以解决ABA领域真实数据受隐私保护无法公开的问题。

Comments 11 pages, 3 tables. Dataset: https://huggingface.co/datasets/PomboLabs/TRACE ; code: https://github.com/Pombo-Labs/TRACE

详情
AI中文摘要

应用行为分析(ABA)是一门临床学科,其文档、教学程序和多次会话行为日志具有公式化和高容量的特点,但真实会话数据受HIPAA保护并受专业保密规则约束,阻碍了训练语料库的发布。我们提出了TRACE(分类学参考的ABA临床示例),一个包含2999个示例的合成指令调优数据集,涵盖两项ABA任务:跨离散试验训练、自然环境教学和任务分析的教学程序生成;以及跨十二种轨迹模式和十三种目标行为的多会话行为解释。每个示例均由一个基于经典ABA文献的确定性分类学驱动生成器产生,并且每个示例都带有完整的采样来源,即产生它的确切分类学单元。该数据集以CC BY-NC 4.0(数据)和MIT(代码)许可发布,包含分层训练集(2549)、验证集(149)、测试集(281)和完整性检查集(20)。TRACE是一个研究工件,尚未经过临床验证。

英文摘要

Applied Behavior Analysis (ABA) is a clinical discipline whose documentation, teaching programs and multi-session behavioral logs, is formulaic and high-volume, yet real session data is HIPAA-protected and bound by professional confidentiality rules, blocking the release of a training corpus. We present TRACE (Taxonomy-Referenced ABA Clinical Examples), a 2,999-example synthetic instruction-tuning dataset covering two ABA tasks: teaching-program generation across Discrete Trial Training, Natural Environment Teaching, and Task Analysis; and multi-session behavioral interpretation across twelve trajectory patterns and thirteen target behaviors. Every example is produced by a deterministic taxonomy-driven generator grounded in the canonical ABA literature, and every example carries complete sampling provenance, the exact taxonomy cells that produced it. The dataset is released under CC BY-NC 4.0 for data and MIT for code, with stratified train (2,549), validation (149), test (281), and sanity (20) splits. TRACE is a research artifact and has not been clinically validated.

2605.25036 2026-05-26 cs.CL cs.AI 版本更新

Language Bias in LVLMs: From In-Depth Analysis to Simple and Effective Mitigation

LVLMs中的语言偏差:从深入分析到简单有效的缓解方法

Yangneng Chen, Jing Li

发表机构 * Harbin Institute of Technology, Shenzhen, China(哈尔滨工业大学(深圳))

AI总结 本文系统研究了大视觉语言模型中的语言偏差问题,发现其根源在于训练中的模态未对齐,并提出了两种简单有效的缓解方法:语言偏差正则化(LBR)和语言偏差惩罚(LBP)。

Comments Accepted by ICML 2026

详情
AI中文摘要

大型视觉语言模型(LVLMs)通过视觉理解扩展了大型语言模型,但仍然容易产生幻觉,即输出流畅但与图像不一致。最近的研究将这一问题与语言偏差联系起来——LVLMs过度依赖文本而忽视视觉输入的倾向。然而,大多数分析仍然是经验性的,没有揭示其根本原因。在本文中,我们对语言偏差进行了系统研究,并确定其根源在于训练过程中的模态未对齐。我们的分析表明,视觉指令微调(VIT)和直接偏好优化(DPO)通常优先考虑文本改进,这可能导致LVLMs过度倾向于语言建模,而不是平衡的多模态理解。为了解决这个问题,我们提出了两种简单而有效的方法:语言偏差正则化(LBR),通过在指令微调期间进行正则化来缓解语言偏差;以及语言偏差惩罚(LBP),在DPO训练过程中惩罚语言偏差。跨多种模型和基准的大量实验证明了我们方法的有效性。LBR在十多个通用基准上持续提高性能,而LBP显著减少了幻觉并提高了可信度。这些方法共同不仅缓解了语言偏差,还促进了LVLMs的整体对齐,且无需引入任何额外数据或辅助模型。我们的代码公开在https://github.com/lab-klc/LVLM-Language-Bias。

英文摘要

Large Vision-Language Models (LVLMs) extend large language models with visual understanding, but remain vulnerable to hallucination, where outputs are fluent yet inconsistent with images. Recent studies link this issue to language bias-the tendency of LVLMs to over-rely on text while neglecting visual inputs. Yet most analyses remain empirical without uncovering its underlying cause. In this paper, we provide a systematic study of language bias and identify its root in modality misalignment during training. Our analysis shows that both Visual Instruction Tuning (VIT) and Direct Preference Optimization (DPO) often prioritize textual improvements, which may cause LVLMs to overly lean toward language modeling rather than balanced multimodal understanding. To address this, we propose two simple yet effective methods: Language Bias Regularization (LBR) which mitigates language bias through regularization during instruction tuning, and Language Bias Penalty (LBP), which penalizes language bias in the DPO training process. Extensive experiments across diverse models and benchmarks demonstrate the effectiveness of our approach. LBR consistently improves performance on over ten general benchmarks, while LBP significantly reduces hallucination and improves trustworthiness. Together, these methods not only mitigate language bias but also advance the overall alignment of LVLMs, all without introducing any additional data or auxiliary models. Our code is publicly available at https://github.com/lab-klc/LVLM-Language-Bias.

2605.25020 2026-05-26 cs.AI cs.CL 版本更新

Privacy-Preserving Local Language Models for Longitudinal Data Retrieval in Chronic Dermatologic Disease: Implementation in Pemphigus Patients

慢性皮肤病纵向数据检索中的隐私保护本地语言模型:在天疱疮患者中的实施

Abdurrahim Yilmaz, Ayşe Esra Koku Aksu, Duygu Yamen, Vefa Asli Erdemir, Mehmet Salih Gurel, Gulsum Gencoglan, Joram M. Posma, Burak Temelkuran

发表机构 * Division of Systems Medicine, Department of Metabolism, Digestion and Reproduction, Imperial College London(系统医学系,代谢、消化与生殖部,帝国理工学院伦敦分校) Department of Dermatology and Venereology, Istanbul Research and Training Hospital(皮肤科与性病科,伊斯坦布尔研究与培训医院) Department of Dermatology and Venereology, Istanbul Medeniyet University(皮肤科与性病科,伊斯坦布尔梅德尼yet大学) Department of Dermatology and Venereology, Istanbul Medicana Atakoy Hospital(皮肤科与性病科,伊斯坦布尔Medicana阿塔科伊医院)

AI总结 本研究评估了本地部署的隐私保护小型语言模型(SLM)在天疱疮患者长期随访记录中检索结构化临床特征并生成纵向摘要的能力,结果显示SLM在特征检索任务中平均准确率达82.25%,且医生对AI生成摘要的质量、临床准确性和实用性评分较高。

详情
AI中文摘要

慢性皮肤病如天疱疮需要长期随访,产生大量纵向临床文档,在常规就诊期间难以全面审查,增加了临床医生的工作量以及遗漏关键历史信息的风险。我们评估了本地部署的隐私保护小型语言模型(SLM)是否能够从长期皮肤科随访记录中检索结构化临床特征并生成纵向摘要。在这项回顾性病例系列研究中,30名天疱疮患者贡献了541份就诊记录,汇总为完整的纵向记录(89,336词);由两位皮肤科专家标注了56个临床相关特征。本地部署的SLM(Qwen3 4B Thinking 2507)对每份完整记录进行查询,以检索56个特征并生成一份最终报告摘要。在1,680个特征检索任务中,平均准确率为82.25%。皮肤科医生对AI生成摘要的整体质量(8.23-8.47)、临床准确性(7.93-8.20)和实用性(8.47-8.50)评分较高,评估者间无显著差异,且在53.3%的评估中总体偏好AI摘要。这些发现表明,隐私保护的本地部署SLM可以优于医学专家,并可靠地生成有临床意义的纵向摘要。在适当监督下,SLM可以支持临床决策。

英文摘要

Chronic dermatologic diseases such as pemphigus require long-term follow-up, generating extensive longitudinal clinical documentation that is difficult to review comprehensively during routine visits and increasing clinician workload as well as the risk of missing critical historical information. We evaluated whether a locally deployed, privacy-preserving small language model (SLM) could retrieve structured clinical features and generate longitudinal summaries from long-term dermatology follow-up records. In this retrospective case series, thirty pemphigus patients contributed 541 visit notes that were aggregated into full longitudinal records (89,336 words); 56 clinically relevant features were annotated by two expert dermatologists. The locally deployed SLM (Qwen3 4B Thinking 2507) was queried with each complete record to retrieve 56 features and generate one final report summaries. Across 1,680 feature retrieval tasks, mean accuracy was 82.25%. Dermatologists' ratings of AI-generated summaries were high for overall quality (8.23-8.47), clinical accuracy (7.93-8.20), and usefulness (8.47-8.50), with no significant inter-evaluator differences and an overall preference for AI summaries in 53.3% of evaluations. These findings suggest that privacy-preserving, locally deployed SLMs can outperform medical experts and reliably generate clinically meaningful longitudinal summaries. SLMs may support clinical decision-making when integrated with appropriate oversight.

2605.24998 2026-05-26 cs.CL 版本更新

Better, Faster: Harnessing Self-Improvement in Large Reasoning Models

更好、更快:利用大型推理模型的自我改进

Qihuang Zhong, Liang Ding, Juhua Liu, Bo Du, Leszek Rutkowski, Dacheng Tao

发表机构 * Nanyang Technological University, Singapore(新加坡南洋理工大学) Alibaba Group, China(中国阿里巴巴集团) School of Computer Science, Wuhan University, China(武汉大学计算机学院) AGH University of Science and Technology, Poland(波兰AGH科学与技术大学)

AI总结 针对自我改进训练中数据不平衡和过度思考问题,提出HSIR方法,通过验证后退出采样和内在多样性评分提升推理性能与效率。

Comments Accepted by ICML2026

详情
AI中文摘要

自我改进训练使大型推理模型(LRMs)能够在没有外部监督的情况下,通过自我生成推理轨迹作为训练数据来改进自身。然而,我们发现这种方法在复杂推理任务中常常表现不佳,甚至导致模型崩溃。通过一系列初步分析,我们揭示了两个问题:(1)数据不平衡,即大多数训练样本简单,但具有挑战性且关键的样本稀缺;(2)过度思考,即许多带有冗余推理步骤的不理想样本被用于自我训练。为此,我们提出HSIR,通过两种简单而有效的方法有效利用大型推理模型的自我改进。具体而言,HSIR引入了一种验证后退出采样策略,通过高效收集困难查询的更准确解决方案来缓解数据不平衡,并设计了内在多样性评分来量化过度思考并过滤掉不理想的解决方案。我们将HSIR应用于各种后训练范式,其中进一步提出了H-GRPO,一种增强的GRPO算法,利用内在多样性作为外部奖励,通过强化学习鼓励简洁且多样化的推理。大量结果表明,HSIR不仅有效提升了推理性能,即平均性能提升高达10.9%,而且通过减少高达42.4%的相对推理开销,显著提高了推理效率。

英文摘要

Self-improvement training enables the large reasoning models (LRMs) to improve themselves by self-generating reasoning trajectories as training data without external supervision. However, we find that this method often falls short in complex reasoning tasks and even leads to model collapse. Through a series of preliminary analyses, we reveal two problems: (1) data imbalance, where most training samples are simple, but the challenging yet crucial samples are scarce; (2) overthinking, where many undesired samples with redundant reasoning steps are used for self-training. To this end, we propose HSIR, which effectively Harnesses Self-Improvement in large Reasoning models via two simple-yet-effective approaches. Specifically, HSIR introduces a verify-then-exit sampling strategy to mitigate data imbalance by efficiently collecting more accurate solutions for difficult queries, and designs an Intrinsic Diversity score to quantify overthinking and filter out the undesired solutions. We apply HSIR to various post-training paradigms, among which we further propose H-GRPO, an enhanced GRPO algorithm that leverages the intrinsic diversity as an external reward to encourage concise and diverse reasoning via reinforcement learning. Extensive results show that HSIR not only effectively enhances the reasoning performance, i.e., bringing up to +10.9% average performance gains, but also significantly improves the reasoning efficiency by reducing up to 42.4% relative inference overhead.

2605.24996 2026-05-26 cs.CL 版本更新

Exploring Profiles of Cognitive Distortions Associated with Mental Health Disorders

探索与心理健康障碍相关的认知扭曲特征

Alina Anikejeva, Kairit Sirts

发表机构 * Institute of Computer Science(计算机科学研究所) University of Tartu(塔尔图大学)

AI总结 本研究使用基于n-gram和微调Transformer模型的方法,分析Reddit数据中九种自我报告心理健康群体与对照组的认知扭曲,发现心理健康群体扭曲程度更高,且不同群体间扭曲模式相似,表明简单词汇方法可用于大规模心理健康文本的群体趋势探索。

Comments CLPsych 2026

详情
AI中文摘要

认知扭曲,即扭曲的思维模式,在计算心理健康研究中受到越来越多的关注。尽管它们与许多(如果不是全部)心理健康障碍相关,但现有研究主要关注抑郁症。在这项工作中,我们探索了多种心理健康状况下的扭曲特征。我们分析了一个大型Reddit数据集,包含来自九个自我报告心理健康群体以及一个对照组的帖子,使用基于n-gram的方法和微调Transformer模型来检测认知扭曲。心理健康群体(无论是合并还是单独检查)与对照组相比,表现出更高的认知扭曲发生率,效应大小从小到中等。在比较不同状况下的扭曲特征时,我们观察到大致相似的模式,尽管某些群体整体上表现出比其他群体更高的扭曲水平。这些发现表明,相对简单的词汇方法可用于大规模心理健康文本数据中群体趋势的探索性分析。

英文摘要

Cognitive distortions, distorted patterns of thinking, have been increasingly studied in computational mental health research. Although they are related to many, if not all, mental health disorders, most existing studies focus primarily on depression. In this work, we explore distortion profiles across multiple mental health conditions. We analyzed a large Reddit-based dataset containing posts from nine self-reported mental health groups as well as a control group using both an n-gram-based method and a fine-tuned transformer model for detecting cognitive distortions. Mental health groups, both when pooled together and when examined individually, showed higher prevalence of cognitive distortions compared to the control group, with the effect sizes ranging from small to moderate. When comparing distortion profiles across conditions, we observed largely similar patterns, although some groups exhibited overall higher levels of distortions than others. These findings suggest that relatively simple lexical approaches can be useful for exploratory analyses of group-level trends in large-scale mental health text data.

2605.24981 2026-05-26 cs.CL cs.LG 版本更新

Large Language Model Selection with Limited Annotations

有限标注下的大语言模型选择

Yavuz Durmazkeser, Patrik Okanovic, Andreas Kirsch, Torsten Hoefler, Nezihe Merve Gürel

发表机构 * TU Delft(代尔夫特理工大学) ETH Zurich(苏黎世联邦理工学院)

AI总结 提出SELECT-LLM框架,通过基于期望信息增益的查询选择规则,在有限标注下高效识别最佳大语言模型,显著降低标注成本。

Comments 33 pages, 5 figures, 4 tables

详情
AI中文摘要

为给定任务选择大语言模型(LLM)需要比较许多强候选模型,然而标准评估依赖于固定评估集上的昂贵标注。为解决这一挑战,我们开发了SELECT-LLM,这是第一个用于主动模型选择LLM的框架。SELECT-LLM旨在找到一组查询,其标注对于识别给定任务的最佳LLM最具信息量。为此,我们引入了一种基于期望信息增益的查询选择规则,该规则通过候选模型输出之间的成对相似性计算。由于该规则仅使用生成的模型响应,SELECT-LLM可以在不假设候选模型架构或访问模型权重的情况下应用。这使得它适用于开源权重和黑盒LLM。我们在23个数据集、156个评估模型、多样化的任务族和多个文本评估指标上评估了SELECT-LLM。在所有实验中,SELECT-LLM在每个设置中都优于最强基线,最佳模型选择的标注成本降低高达81.8%,近最佳模型选择的标注成本降低高达84.78%。

英文摘要

Choosing a Large Language Model (LLM) for a given task requires comparing many strong candidates, yet standard evaluation relies on costly annotations over fixed evaluation sets. To address this challenge, we develop SELECT-LLM, the first framework for active model selection of LLMs. SELECT-LLM aims to find a small set of queries whose annotations are most informative for identifying the best LLM for a given task. To this end, we introduce a query selection rule based on expected information gain, computed from pairwise similarities between candidate model outputs. Because this rule only uses generated model responses, SELECT-LLM can be applied across candidate models without assumptions about their architecture or access to model weights. This makes it suitable for both open-weight and black-box LLMs. We evaluate SELECT-LLM across 23 datasets, 156 evaluated models, diverse task families, and multiple text evaluation metrics. Across all experiments, SELECT-LLM improves over the strongest baseline in every setting, with annotation cost reductions up to 81.8% for best model selection and up to 84.78% for near-best model selection.

2605.24977 2026-05-26 cs.CV cs.CL 版本更新

Universal Boosts, Specific Suppressors: Sparse Autoencoder Steering of Medical Vision-Language Models

通用增强,特定抑制:基于稀疏自编码器引导的医学视觉语言模型

Farhad Nooralahzadeh, Benjamin Gundersen, Nicolas Deperrois, Hidetoshi Matsuom, Mizuho Nishio, Thomas Frauenfelder, Ahmed Allam, Christian Blüthgen, Michael Moor, Michael Krauthammer

发表机构 * University of Zurich and University Hospital of Zurich(苏黎世大学及苏黎世大学医院) Kobe University(Kobe大学) ETH AI Center(苏黎世联邦理工学院人工智能中心) ETH Zurich(苏黎世联邦理工学院) Stanford University(斯坦福大学) Zurich University of Applied Sciences(苏黎世应用科学大学)

AI总结 本文提出一种无需权重更新的解码时残差引导方法,通过每token稀疏自编码器(SAE)对医学视觉语言模型进行干预,抑制幻觉并提升报告质量,在多个模型上取得显著改进。

详情
AI中文摘要

医学视觉语言模型(VLM)在生成胸部X光报告时经常出现幻觉:它们编造图像中不存在的发现,遗漏重要发现,或定位错误。我们通过解码时残差引导,基于每token稀疏自编码器(SAE)来缓解这一问题,无需权重更新:在后期层使用Top-$K$ SAE,针对临床错误进行因果引导,然后在推理时结合抑制/增强干预。在MIMIC-CXR测试集上,我们的纯推理方法提高了三个放射学VLM(RadVLM、LLaVA-Rad和CheXOne)生成报告的质量,临床复合指标的相对改进分别为+5.4%、+7.2%和+17.0%,并且所有骨干网络的GREEN得分均具有统计显著性。跨模型特征对齐表明,质量促进(增强)方向在不同架构间高度重叠,而与幻觉相关的(抑制)方向则是模型特定的。因此,可迁移的引导必须针对每个骨干网络进行抑制处理,而不是共享一个通用的抑制列表。相同的配方无需重新训练即可零样本迁移到IU-Xray(GREEN相对提升+7.7%),确认了所识别的特征是模型属性,而非训练语料库的属性。我们发布了因果特征集和一个交互式特征仪表板:https://cxr-sparse-feature-dashboard.netlify.app/。

英文摘要

Medical vision-language models (VLMs) often hallucinate findings when generating chest X-ray reports: they fabricate findings that are not present in the image, miss important ones, or locate them incorrectly. We mitigate this without weight updates by decoding-time residual steering on a per-token sparse autoencoder (SAE) basis: Top-$K$ SAEs on late layers, causal steering against clinical errors, then combined suppress/boost intervention at inference time. On the MIMIC-CXR test split, our inference-only method improves the quality of generated reports for three radiology VLMs (RadVLM, LLaVA-Rad, and CheXOne), with relative improvements of +5.4%, +7.2%, and +17.0% in the clinical composite metric, and statistically significant GREEN gains on all backbones. A cross-model feature alignment shows that the quality-promoting (boost) directions overlap strongly across architectures, whereas hallucination-linked (suppress) directions are model-specific. Therefore, transferable steering must treat suppression per-backbone, rather than sharing a universal suppress list. The same recipe transfers zero-shot to IU-Xray (Green $+7.7\%$ rel.) without retraining, confirming that the identified features are properties of the model, not of the training corpus. We release causal feature sets and an interactive feature dashboard: https://cxr-sparse-feature-dashboard.netlify.app/.

2605.24973 2026-05-26 cs.CV cs.AI cs.CL 版本更新

MinerU-Popo: Universal Post-Processing Model for Structured Document Parsing

MinerU-Popo:结构化文档解析的通用后处理模型

Bangrui Xu, Ziyang Miao, Xuanhe Zhou, Yiming Lin, Zirui Tang, Xiaomeng Zhao, Fan Wu, Cheng Tan, Fan Wu, Bin Wang, Conghui He

发表机构 * Shanghai Jiao Tong University(上海交通大学) Shanghai Artificial Intelligence Laboratory, OpenDataLab(上海人工智能实验室,OpenDataLab) University of California, Berkeley(加州大学伯克利分校)

AI总结 提出MinerU-Popo轻量级通用后处理框架,通过分解为文本/表格截断恢复、标题层级重建和图文关联四个子任务,并利用动态分块和重叠同步将OCR页面级结果重构为文档级逻辑结构,显著提升标题层级TEDS和RAG准确性。

Comments The code is available at https://github.com/opendatalab/MinerU-Popo

详情
AI中文摘要

基于VLM的OCR模型已成为文档解析的事实标准,因为它们可以准确提取页面级元素(例如单个页面内的段落)及其边界框和文本内容。然而,下游应用(如RAG)需要连贯的文档级信息,而这些模型常常破坏跨页连续性,并且无法恢复被页面边界截断的结构(如段落和表格)。这种关系不局限于单个页面;相反,它们需要对跨多个页面的标题、段落、表格和图像进行联合分析。因此,一个自然的解决方案是重用现有的OCR输出,并通过后处理重建文档级逻辑结构。为此,我们提出了MinerU-Popo,一个轻量级且通用的OCR输出后处理框架,它将来自不同解析器的页面级结果转换为连贯的文档级结构。MinerU-Popo将问题分解为四个聚焦的子任务:文本截断恢复、表格截断恢复、标题层级重建和图文关联。为了有效解决这些问题,我们构建了一个面向任务的数据引擎,具有任务特定的输入过滤,并使用生成的数据(30K)微调了一个轻量级后处理模型(Qwen3-VL-4B)。为了支持长文档,我们引入了基于重叠同步的动态分块,对齐微调模型的分块级输出并保持全局一致性。最后,我们将对齐后的输出组装成树状文档表示,并通过节点分块和摘要进一步丰富,以支持下游检索和分析。实验结果表明,MinerU-Popo在所有五个测试的OCR模型上,标题层级TEDS至少提高了20%,提高了RAG准确性并降低了每次查询的延迟。

英文摘要

VLM-based OCR models have become the de facto choice for document parsing, as they can accurately extract page-level elements (e.g., paragraphs within individual pages) together with their bounding boxes and textual content. However, downstream applications such as RAG require coherent document-level information, whereas these models often break cross-page continuity and fail to recover disrupted structures, such as paragraphs and tables truncated by page boundaries. Such relationships are not confined to a single page; instead, they require joint analysis of titles, paragraphs, tables, and images spanning multiple pages. A natural solution is therefore to reuse existing OCR outputs and reconstruct document-level logical structures through post-processing. To this end, we propose MinerU-Popo, a lightweight and universal framework for POst-Processing OCR outputs, which converts page-level results from diverse parsers into coherent document-level structures. MinerU-Popo decomposes the problem into four focused subtasks: text truncation recovery, table truncation recovery, title hierarchy reconstruction, and image-text association. To address these effectively, we build a task-oriented data engine with task-specific input filtering, and use the generated data (30K) to fine-tune a lightweight post-processing model (Qwen3-VL-4B). To support long documents, we introduce dynamic chunking with overlap-based synchronization, which aligns chunk-level outputs from the fine-tuned model and preserves global consistency. Finally, we assemble the aligned outputs into a tree-structured document representation, further enriched with node chunking and summaries for downstream retrieval and analysis. Empirical results show MinerU-Popo improves title-hierarchy TEDS by at least 20% across all five tested OCR models, improves RAG accuracy and reduces per-query latency.

2605.24960 2026-05-26 cs.CL cs.AI cs.LG 版本更新

Investigating the Interplay between Contextual and Parametric Chain-of-Thought Faithfulness under Optimization

探究优化下上下文与参数化思维链忠实性之间的相互作用

Jingyi Sun, Qianli Wang, Pepa Atanasova, Nils Feldhus, Isabelle Augenstein

发表机构 * University of Copenhagen(哥本哈根大学) Technische Universität Berlin(柏林技术大学) German Research Center for Artificial Intelligence (DFKI)(德国人工智能研究中心(DFKI)) BIFOLD – Berlin Institute for the Foundations of Learning and Data(BIFOLD – 柏林学习与数据基础研究院)

AI总结 通过提出统一偏好对齐接口FaithMate,研究上下文与参数化两种思维链忠实性范式在优化下的相互作用,发现两者正相关但不对称,且上下文忠实性指标间存在权衡。

Comments The first two authors contributed equally and share first-authorship

详情
AI中文摘要

思维链(CoT)忠实性,即CoT是否真实反映大型语言模型(LLM)的底层行为,通常通过两种不相交的范式进行评估:上下文忠实性(通过扰动输入或CoT轨迹测量)和参数化忠实性(通过干预模型的参数化知识评估)。然而,先前的工作仅对它们进行描述性比较。我们通过提出FaithMate(一个统一的偏好对齐接口,用于优化模型朝向任一忠实性范式)来填补这一空白。它使我们能够研究两种范式之间的相互作用,检查忠实性增益在范式内部和跨范式之间是否以及多大程度上泛化。在三个模型、两个数据集和六个忠实性指标上,我们发现两种范式呈正相关但不对称:优化参数化忠实性在两种范式上均产生一致的增益,而上下文对应范式则带来更多可变的增益。在上下文范式内,一个指标上的忠实性增益不能一致地转移到其他指标上,这表明现有的上下文指标捕捉了忠实性的不同方面,并暴露了固有的权衡。这些发现意味着CoT忠实性不是一个单一目标,因此需要多方面的优化和评估。

英文摘要

Chain-of-Thought (CoT) faithfulness, i.e., whether CoTs genuinely reflect large language models' (LLM) underlying behavior, is typically evaluated under two disjoint paradigms: contextual faithfulness, measured by perturbing the input or CoT trace, and parametric faithfulness, assessed by intervening on a model's parametric knowledge. Yet prior work compares them only descriptively. We fill this gap by proposing FaithMate, a unified preference-alignment interface for optimizing models towards either faithfulness paradigm. It enables us to investigate the interplay between the two paradigms, examining whether and to what extent faithfulness gains generalize within and across paradigms. Across three models, two datasets, and six faithfulness metrics, we find that the two paradigms are positively coupled, yet asymmetric: optimizing towards parametric faithfulness yields consistent gains across both paradigms, whereas the contextual counterpart delivers more variable gains. Within the contextual paradigm, faithfulness gains on one metric do not consistently transfer to others, implying that existing contextual metrics capture disjoint facets of faithfulness and exposing inherent trade-offs. These findings imply that CoT faithfulness is not a monolithic objective and therefore requires multifaceted optimization and evaluation.

2605.24958 2026-05-26 cs.CL cs.AI 版本更新

SEP-Attack: A Simple and Effective Paradigm for Transfer-Based Textual Adversarial Attack

SEP-Attack:一种简单有效的基于迁移的文本对抗攻击范式

Han Liu, Zhi Xu, Xiaotong Zhang, Feng Zhang, Xiaoming Xu, Wei Wang, Fenglong Ma, Hong Yu

发表机构 * Dalian University of Technology(大连理工大学) Peking University(北京大学) Macao Polytechnic University(澳门理工学院) The Pennsylvania State University(宾夕法尼亚州立大学)

AI总结 提出SEP-Attack,利用行列式点过程生成多样化的代理集成权重,通过新指标评估预测置信度以计算词重要性并生成对抗样本,在多个数据集和API上显著优于现有方法。

详情
AI中文摘要

尽管深度神经网络在现代Web和语言应用中表现出色,但它们仍然容易受到对抗攻击,尤其是使用代理模型生成对抗样本而无需访问受害者模型的迁移攻击。文本领域的迁移攻击仍未得到充分探索,只有少数研究解决了这一挑战性问题,且由于对子模型平等对待或重要性分数估计不准确,往往导致次优结果。为了解决这些挑战,我们提出了一种简单而有效的基于迁移的文本对抗攻击范式,名为SEP-Attack。具体来说,我们采用行列式点过程(DPP)生成多样化的代理集成权重,代表子模型的迁移性。利用这些权重,我们引入了一种新的度量来评估预测置信度分数,进而用于计算词重要性分数并生成对抗候选。最后,我们量化每个候选的迁移性分数,并选择排名靠前的作为最终的迁移对抗样本。在四个数据集和两个真实API上进行的实验验证了SEP-Attack的有效性,显著优于最先进的基线方法。

英文摘要

Despite the strong performance of deep neural networks in modern Web and language applications, they remain vulnerable to adversarial attacks, especially transferable attacks that generate adversarial examples using surrogate models without accessing the victim model. Transferable attacks in the text domain are still under-explored, with only a few studies addressing this challenging issue, often with suboptimal results due to equal treatment of submodels or inaccurate estimation of importance scores. To address these challenges, we propose a simple yet effective paradigm for transfer-based textual adversarial attack, named SEP-Attack. Specifically, we employ the Determinantal Point Process (DPP) to generate diverse surrogate ensemble weights, representing the transferability of submodels. Using these weights, we introduce a new metric to evaluate prediction confidence scores, which in turn are used to calculate word importance scores and generate adversarial candidates. Finally, we quantify the transferability score for each candidate and select the top ones as the final transferable adversarial examples. Experiments conducted on four datasets and two real-world APIs validate the efficacy of SEP-Attack, significantly outperforming state-of-the-art baselines.

2605.24956 2026-05-26 cs.CL 版本更新

NITP: Next Implicit Token Prediction for LLM Pre-training

NITP:面向LLM预训练的下一隐式令牌预测

Xiangdong Zhang, Debing Zhang, Shaofeng Zhang, Xiaohan Qin, Yu Cheng, Junchi Yan

发表机构 * School of AI, Shanghai Jiao Tong University(上海交通大学人工智能学院) Xiaohongshu Inc.(小红书公司) The Chinese University of Hong Kong(香港中文大学) University of Science and Technology of China(中国科学技术大学)

AI总结 提出NITP方法,通过在表示空间中添加密集连续监督来增强离散令牌预测,以解决标准下一令牌预测中潜在表示空间约束不足的问题,并在0.5B至9B参数模型上取得一致性能提升。

Comments Accepted at ICML 2026

详情
AI中文摘要

标准的下一令牌预测(NTP)仅通过输出logit空间中的离散标签来监督语言模型。我们认为这种稀疏的one-hot监督使得潜在表示空间约束不足,允许隐藏状态退化为退化和各向异性的配置,从而限制泛化能力。为解决这一问题,我们提出下一隐式令牌预测(NITP),该方法直接在表示空间中用密集的连续监督增强离散预测。NITP训练模型预测下一令牌的隐式语义内容,使用同一模型的浅层表示作为稳定的自监督目标。我们提供理论分析,表明NITP通过缓解欠约束的自由度并鼓励紧凑、结构化的表示几何来正则化优化景观。实验上,在从0.5B到9B参数的密集和MoE模型中,NITP以可忽略的计算开销持续提升下游性能。在9B MoE模型上,NITP在MMLU-Pro上实现了5.7%的绝对提升,在C3上提升6.4%,在CommonsenseQA上提升4.3%,训练FLOPs仅增加约2%,且无额外推理成本。我们的实现可在https://github.com/aHapBean/NITP获取。

英文摘要

Standard next-token prediction (NTP) supervises language models solely through discrete labels in the output logit space. We argue that this sparse one-hot supervision leaves the latent representation space under-constrained, allowing hidden states to drift into degenerate and anisotropic configurations that can limit generalization. To address this issue, we propose Next Implicit Token Prediction (NITP), which augments discrete prediction with dense continuous supervision directly in the representation space. NITP trains the model to predict the implicit semantic content of the next token, using shallow-layer representations from the same model as stable self-supervised targets. We provide theoretical analysis showing that NITP regularizes the optimization landscape by mitigating under-constrained degrees of freedom and encouraging a compact, structured representation geometry. Empirically, across dense and MoE models ranging from 0.5B to 9B parameters, NITP consistently improves downstream performance with negligible computational overhead. On a 9B MoE model, NITP achieves a 5.7% absolute improvement on MMLU-Pro, along with gains of 6.4% on C3 and 4.3% on CommonsenseQA, with approximately 2% additional training FLOPs and no additional inference cost. Our implementation is available at https://github.com/aHapBean/NITP.

2605.24930 2026-05-26 cs.CL 版本更新

H$^{2}$MT: Semantic Hierarchy-Aware Hierarchical Memory Transformer

H$^{2}$MT: 语义层次感知的层次记忆Transformer

Maryam Haghifam, Zifan He, Jason Cong, Yizhou Sun

发表机构 * University of California, Los Angeles(加州大学洛杉矶分校)

AI总结 提出H$^{2}$MT模型,通过离线构建语义层次结构并利用自底向上的后序聚合计算记忆嵌入,在推理时实现从粗到细的查询路由,从而在长上下文推理中实现质量与效率的权衡。

详情
AI中文摘要

基于Transformer的LLM在许多语言任务上取得了强劲的结果;然而,长输入仍然具有挑战性,因为上下文窗口是有限的,并且预填充延迟和内存随提示长度快速增长。因此,平坦的令牌流处理和基于块的检索可能会在与查询无关的文本上花费大量计算和上下文预算。离线索引的RAG额外引入了外部存储和索引管理开销,并且通常将检索到的证据作为原始文本附加,增加了预填充成本和延迟。H^{2}MT使长上下文推理具有结构感知性:它离线构建语义层次结构,通过自底向上的后序聚合为每个节点计算记忆嵌入,并在推理时从粗到细地路由查询,以早期修剪不相关的分支。在LongBench QA(NarrativeQA、HotpotQA、QASPER)和两个结构化技术文档设置上,H^{2}MT实现了有利的质量效率权衡,与提示压缩、记忆令牌方法和检索增强生成基线相比,在更低的峰值GPU内存和首令牌时间(TTFT)下取得了具有竞争力的ROUGE-L和F1(在适用情况下)。

英文摘要

Transformer-based LLMs achieve strong results on many language tasks; however, long inputs remain challenging because context windows are finite, and prefill latency and memory grow rapidly with prompt length. Flat token-stream processing and chunk-based retrieval can therefore spend substantial computation and context budget on text unrelated to the query. Offline-indexed RAG additionally introduces external storage and index management overhead, and typically appends retrieved evidence as raw text, increasing prefill cost and latency. H^{2}MT makes long-context inference structure-aware: it builds a semantic hierarchy offline, computes a memory embedding for each node via bottom-up post-order aggregation, and routes queries coarse-to-fine at inference to prune irrelevant branches early. On LongBench QA (NarrativeQA, HotpotQA, QASPER) and two structured technical-document settings, H MT achieves favorable quality efficiency trade-offs, delivering competitive ROUGE-L and F1 (where applicable) with lower peak GPU memory and time-to-first-token (TTFT) than prompt compression, memory-token methods, and retrieval-augmented generation baselines.

2605.24919 2026-05-26 cs.CL 版本更新

MultiHaluDet: Multilingual Hallucination Detection via LLM Hidden State Probing

MultiHaluDet: 通过LLM隐藏状态探测实现多语言幻觉检测

Riasad Alvi, Nurul Labib Sayeedi, Md. Faiyaz Abdullah Sayeedi

发表机构 * United International University(国际联合大学) BRAC University(布拉克大学)

AI总结 提出MultiHaluDet框架,通过探测冻结LLM的全隐藏状态轨迹,结合多尺度注意力和自注意力池化的混合架构,以及校准的经典分类器集成,实现跨语言的高精度幻觉检测,在英语基准上达到98.55% AUROC,并展现出对高、中、低资源语言的强泛化能力。

Comments MeLLM @ ACL 2026

详情
AI中文摘要

大型语言模型(LLM)中的幻觉是其可靠部署的关键障碍,这一漏洞在非英语和资源受限的环境中尤为严重。现有的依赖输出置信度启发式或单层内部表示的检测方法,往往无法捕捉跨语言的深层、复杂事实不一致性。为此,我们引入了MultiHaluDet,一种新颖的三阶段堆叠框架,通过探测冻结LLM的全隐藏状态轨迹来检测多语言幻觉,无需特定语言的微调。我们的方法提取跨多个层的序列特征,并通过使用多尺度注意力和自注意力池化的混合架构进行处理。通过生成折叠外嵌入并输入到校准的经典分类器集成中,MultiHaluDet捕捉了事实不一致性的细粒度和粗粒度模式。大量实验表明,我们的框架在Mistral-7B和LLaMA2-7B架构上,在英语HaluEval和TriviaQA基准测试中达到了高达98.55% AUROC的最先进检测性能。关键的是,我们严格评估了框架在高资源(法语)、中资源(孟加拉语)和低资源(阿姆哈拉语)语言上的跨语言泛化能力。MultiHaluDet展现出卓越的表示鲁棒性,始终优于基线,并成功地将幻觉检测能力迁移到类型多样的语言层级中。

英文摘要

Hallucinations in Large Language Models (LLMs) represent a critical barrier to their reliable deployment, a vulnerability heavily exacerbated in non-English and resource-constrained contexts. Existing detection approaches that rely on output confidence heuristics or single-layer internal representations frequently fail to capture deep, complex factual inconsistencies across diverse languages. To address this, we introduce MultiHaluDet, a novel three-stage stacking framework that detects multilingual hallucinations by probing the full hidden state trajectories of frozen LLMs without requiring language-specific fine-tuning. Our method extracts sequential features across multiple layers and processes them via a hybrid architecture using multi-scale attention and self-attention pooling. By generating out-of-fold embeddings that feed into a calibrated classical classifier ensemble, MultiHaluDet captures both fine-grained and coarse-grained patterns of factual inconsistency. Extensive experiments demonstrate that our framework achieves state-of-the-art detection performance, reaching up to 98.55% AUROC on the English HaluEval and TriviaQA benchmarks using Mistral-7B and LLaMA2-7B architectures. Crucially, we rigorously evaluate our framework's cross-lingual generalization across high (French), medium (Bangla), and low-resource (Amharic) languages. MultiHaluDet demonstrates exceptional representational robustness, consistently outperforming baselines and successfully transferring hallucination detection capabilities across typologically diverse linguistic tiers.

2605.24907 2026-05-26 cs.CL 版本更新

Overview of the PsyDefDetect Shared Task at BioNLP 2026: Detecting Levels of Psychological Defense Mechanisms in Supportive Conversations

PsyDefDetect 共享任务概述:在支持性对话中检测心理防御机制水平

Hongbin Na, Zimu Wang, Zhaoming Chen, Yining Hua, Rena Gao, Kailai Yang, Ling Chen, Wei Wang, Shaoxiong Ji, John Torous, Sophia Ananiadou

发表机构 * University of Technology Sydney(技术大学悉尼) Xi’an Jiaotong-Liverpool University(西安交通大学-利物浦大学) University of Utah(犹他大学) Harvard University(哈佛大学) The University of Melbourne(墨尔本大学) The University of Manchester(曼彻斯特大学) ELLIS Institute Finland(芬兰ELLIS研究所) University of Turku(图尔库大学)

AI总结 本文介绍了与 BioNLP@ACL 2026 合办的 PsyDefDetect 共享任务,该任务基于临床验证的 DMRS 框架,要求系统将求助者话语分类为九个类别,最佳系统达到 0.420 的宏 F1 分数,但仍存在改进空间。

详情
AI中文摘要

我们介绍了 PsyDefDetect,这是一个与 BioNLP@ACL 2026 合办的关于在情感支持对话中检测心理防御机制水平的共享任务。该任务基于临床验证的防御机制评定量表(DMRS)框架,要求系统根据给定的前面对话上下文,将目标求助者话语分类为九个类别之一:七个层次的 DMRS 水平加上两个辅助标签。参与者使用了 PsyDefConv,这是一个新发布的语料库,包含 200 个对话和 2336 条求助者话语,在 DMRS 下进行了标注,并具有较高的一致性。该任务在 CodaBench 上吸引了 172 名参与者,提交了 563 份结果,其中 21 个团队正式注册了最终排名。最佳系统实现了 0.420 的宏 F1 分数,显著超过了数据集论文中报告的最强微调基线,但仍留有明显的改进空间。我们的分析强调了(i)过度预测多数类高适应水平的持续趋势,(ii)准确率和宏 F1 之间的差距扩大,揭示了类别不平衡敏感性,以及(iii)理论感知和基于 LLM 的方法在细粒度防御功能分类中的价值。我们发布了所有任务材料,并邀请社区继续在这个临床心理学与自然语言处理的新交叉领域开展工作。

英文摘要

We present an overview of PsyDefDetect, the shared task on detecting levels of psychological defense mechanisms in emotional support dialogues, co-located with BioNLP@ACL 2026. Grounded in the clinically validated Defense Mechanism Rating Scales (DMRS) framework, the task asks systems to classify a target seeker utterance, given its preceding dialogue context, into one of nine categories: seven hierarchical DMRS levels plus two auxiliary labels. Participants worked on PsyDefConv, a newly released corpus of 200 dialogues and 2336 help-seeker utterances annotated under DMRS with substantial inter-annotator agreement. The task attracted 172 participants on CodaBench who produced 563 submissions, with 21 teams officially registering their results for the final ranking. The best system achieved a macro F1-score of 0.420, surpassing the strongest fine-tuned baseline reported in the dataset paper by a notable margin, yet leaving clear headroom. Our analysis highlights (i) a persistent tendency to over-predict the majority High-Adaptive class, (ii) a widening gap between accuracy and macro-F1 that reveals class-imbalance sensitivity, and (iii) the value of theory-aware and LLM-based approaches for fine-grained defensive-function classification. We release all task materials and invite the community to continue work on this novel intersection of clinical psychology and NLP.

2605.24904 2026-05-26 cs.CL 版本更新

Quantifying the Impact of Translation Errors on Multilingual LLM Evaluation

量化翻译错误对多语言大语言模型评估的影响

Klaudia-Doris Thellmann, Bernhard Stadler, Michael Färber, Jens Lehmann

发表机构 * TUD Dresden University of Technology and ScaDS.AI(德累斯顿技术大学和ScaDS.AI) InfAI e.V.(InfAI协会) Amazon(亚马逊)

AI总结 研究机器翻译基准中的翻译错误如何影响多语言LLM评估的可靠性,通过自动错误跨度检测和准确性下降分析揭示翻译错误与评估指标之间的关联。

详情
AI中文摘要

机器翻译基准被广泛用于评估大语言模型(LLM)的多语言能力,然而这些基准中的翻译错误仍未得到充分探索,引发了对多语言评估可靠性和可比性的担忧。我们解决了两个实际差距:(i)来自LLM评判者的自动MQM风格错误跨度以及跨度感知的QE基线(xCOMET-XXL)与基准翻译上的专家人工跨度注释的匹配程度,以及(ii)翻译错误(相对于英文原版中的源端问题)在多大程度上解释了翻译基准上的准确性下降。我们发现,在自然发生的基准翻译上,跨度一致性并非易事,并且目标端翻译错误始终与可测量的、百分点级别的翻译准确性下降相关,即使在控制了英文正确性和源端异常之后也是如此。

英文摘要

Machine-translated benchmarks are widely used to assess the multilingual capabilities of large language models (LLMs), yet translation errors in these benchmarks remain underexplored, raising concerns about the reliability and comparability of multilingual evaluation. We address two practical gaps: (i) how well automatic MQM-style error spans from LLM judges and a span-aware QE baseline (xCOMET-XXL) match expert human span annotations on benchmark translations, and (ii) how strongly translation errors (as opposed to source-side issues in the English original) explain accuracy drops on translated benchmarks. We find that span agreement is non-trivial on naturally occurring benchmark translations, and that target-side translation errors are consistently associated with measurable, percentage-point drops in translated accuracy even after controlling for English correctness and source-side anomalies.

2605.24902 2026-05-26 cs.CL cs.AI cs.LG 版本更新

When Reasoning Hurts: Source-Aware Evaluation of Frontier LLMs for Clinical SOAP Note Generation

当推理有害:面向临床SOAP笔记生成的前沿LLM源感知评估

Faizan Faisal

发表机构 * University of California, Davis(加州大学戴维斯分校)

AI总结 通过源感知基准测试,评估推理增强型LLM在临床SOAP笔记生成中的表现,发现推理能力反而降低GPT-5.4的质量,而相同源RAG带来模型依赖的小幅提升。

详情
AI中文摘要

推理增强型LLM在医学推理基准测试中表现强劲,但这些增益是否能迁移到结构化临床文档尚不清楚;我们通过一个跨OMI Health、ACI-Bench和PriMock57的源感知基准,利用临床对话生成SOAP笔记来研究这一问题。我们在一个2x2受控设计中评估GPT-5.4、DeepSeek-V4-Flash和Gemma-4-E4B,独立切换提供者原生推理和相同源检索增强生成(RAG)。输出使用七种自动指标以及两个参考感知的LLM评判者进行评估。两种评估方法一致认为,非推理的GPT-5.4配置达到最高整体质量,而DeepSeek-V4-Flash在推理增强配置中表现最佳。启用推理显著降低了GPT-5.4在所有三个数据集上的性能,而相同源RAG带来较小的、模型依赖的改进。总体而言,研究结果表明,不应假设更强的推理能力能改善对保真度敏感的SOAP笔记生成,而无需专门的、任务特定的评估。

英文摘要

Reasoning-enabled LLMs perform strongly on medical reasoning benchmarks, but it remains unclear whether these gains transfer to structured clinical documentation; we investigate this question using SOAP note generation from clinical dialogue in a source-aware benchmark spanning OMI Health, ACI-Bench, and PriMock57. We evaluate GPT-5.4, DeepSeek-V4-Flash, and Gemma-4-E4B in a controlled 2x2 design that independently toggles provider-native reasoning and same-source retrieval-augmented generation (RAG). Outputs are assessed using seven automatic metrics alongside two reference-aware LLM judges. Both evaluation approaches agree that a non-reasoning GPT-5.4 configuration achieves the highest overall quality, while DeepSeek-V4-Flash performs best among reasoning-enabled configurations. Enabling reasoning significantly degrades GPT-5.4 performance across all three datasets, whereas same-source RAG yields smaller, model-dependent improvements. Overall, the findings indicate that stronger reasoning capability should not be assumed to improve fidelity-sensitive SOAP note generation without dedicated, task-specific evaluation.

2605.24885 2026-05-26 cs.CL 版本更新

DTO: a Differentiable Training Objective for Effective Counterfactual Story Rewriting

DTO:一种用于有效反事实故事重写的可微分训练目标

Amelia Girard, Massimo Piccardi

发表机构 * University of Technology Sydney(技术大学悉尼)

AI总结 提出一种可微分训练目标(DTO),通过端到端反向传播联合优化对参考重写的忠实度和与源叙事的语义一致性,以解决反事实故事重写中模型忽略细微修改的问题。

Comments 11 pages, 2 figures

详情
AI中文摘要

反事实故事重写是一项自然语言处理任务,要求更新现有故事以反映所选替代事件,同时保留所有未受影响的故事情节元素和整体连贯性。尽管大型语言模型最近在此任务上取得了显著进展,但由于所需修改通常规模很小且高度局部化,该任务仍然具有挑战性。因此,以传统方式使用最大似然训练目标训练的模型往往忽略这些细微之处。同时,基于强化学习的更复杂训练方法以缓慢和难以设置而闻名。基于这些原因,本文提出了一种新颖的可微分训练目标(DTO),直接优化所需的反事实改进。在我们的方法中,通过端到端反向传播微调一个Transformer模型,针对一个完全可微的损失函数,该函数同时奖励(i)对参考重写的忠实度和(ii)与源叙事的语义一致性。在TimeTravel和ART数据集上的实证评估表明,所提出的DTO方法能够超越最大似然基线和基于偏好的方法,并在所有评估指标上与两个当代大型语言模型竞争。这些发现证实了任务特定的可微分目标对于细微、受控的文本生成任务的有效性。

英文摘要

Counterfactual story rewriting is a natural language processing task that requires updating an existing story to reflect a chosen alternative event, yet preserving all the unaffected storyline elements and overall coherence. While large language models have recently made remarkable progress on this task, it still remains challenging since the required modifications are typically very small in size and highly localized. As a consequence, models trained in a conventional manner with the maximum-likelihood training objective tend to overlook these nuances. At the same time, more sophisticated training approaches based on reinforcement learning are notoriously slow and difficult to set up. For these reasons, our paper proposes a novel, differentiable training objective (DTO) that directly optimizes for the requisite counterfactual improvements. In our approach, a transformer model is fine-tuned via end-to-end backpropagation against a fully differentiable loss function that jointly rewards (i) fidelity to the reference rewrite and (ii) semantic consistency with the source narrative. The empirical evaluation on the TimeTravel and ART datasets shows that the proposed DTO approach has been able to surpass a maximum-likelihood baseline and a preference-based approach, and perform competitively against two contemporary large language models in all evaluation metrics. These findings substantiate the effectiveness of task-specific differentiable objectives for nuanced, controlled text-generation tasks.

2605.24873 2026-05-26 cs.CL cs.AI cs.LG 版本更新

Towards a Universal Causal Reasoner

迈向通用因果推理器

Qirun Dai, Xiao Liu, Jiawei Zhang, Dylan Zhang, Hao Peng, Chenhao Tan

发表机构 * The University of Chicago(芝加哥大学) University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校)

AI总结 提出UniCo数据生成框架,覆盖Pearl因果阶梯的18种查询类型,将符号示例转化为代码和自然语言,通过监督微调显著提升LLM的因果推理能力和推理忠实度。

详情
AI中文摘要

尽管因果推理的重要性不言而喻,但训练LLM进行因果推理仍未被充分探索。现有的数据工作大多集中在针对因果关系的特定方面对LLM进行基准测试,这使得它们不太适合训练可泛化的因果推理器。为了解决这个问题,我们提出了UniCo,一个数据生成框架,它既(1)涵盖了Pearl因果阶梯中的18种因果查询类型,又(2)将原生符号示例转化为代码和自然语言形式,以模拟因果术语未明确指定的真实世界用例。为确保数据质量,UniCo用精确的因果推理来支撑答案,并过滤掉存在推理捷径的案例。通过使用66.6K个UniCo生成的实例进行监督微调,Qwen3-4B、Qwen3-8B和Olmo-3-7B-Instruct在所有18种分布内查询类型上平均提升了22.9%,在训练分布之外的7个已建立的因果基准上,相比最先进的因果数据生成框架提升了8.1%。更重要的是,在真实世界的医学理解、法律决策和表格推理中,UniCo训练的模型始终展现出更忠实的推理轨迹,在忠实度指标上平均超过基础模型20.2%。这些结果表明,以因果为中心的训练不仅增强了因果推理能力,还赋予了LLM在一般推理任务中的因果思维。

英文摘要

Despite the importance of causal reasoning, training LLMs to reason causally remains underexplored. Existing data efforts mostly focus on benchmarking LLMs on specific aspects of causality, making them less suitable for training generalizable causal reasoners. To address this, we propose UniCo, a data generation framework that both (1) addresses 18 causal query types across Pearl's Causal Ladder and (2) translates natively symbolic examples into code and natural language forms to simulate real-world use cases where causal terms are not explicitly specified. To ensure data quality, UniCo grounds answers with exact causal inference and filters cases with reasoning shortcuts. Upon supervised finetuning with 66.6K UniCo-generated instances, Qwen3-4B, Qwen3-8B and Olmo-3-7B-Instruct achieve an average of 22.9% improvements across all 18 in-distribution query types, and 8.1% over state-of-the-art causal data generation frameworks on 7 established causal benchmarks outside the training distribution. More importantly, in real-world medical understanding, legal decision, and tabular reasoning, UniCo-trained models consistently display more faithful reasoning traces, outperforming the base models by an average of 20.2% in faithfulness metrics. These suggest that causality-centered training not only strengthens causal reasoning, but also equips LLMs with a causal mindset in general reasoning tasks.

2605.24869 2026-05-26 cs.CL 版本更新

Lngram: N-gram Conditional Memory in Latent Space

Lngram: 潜在空间中的N-gram条件记忆

Yunao Zheng, Guoyang Xia, Xiaojie Wang, Lei Ren

发表机构 * Beijing University of Posts and Telecommunications (BUPT)(北京邮电大学) Li Auto Inc.(Li Auto公司)

AI总结 提出Lngram,一种在潜在空间中学习离散符号并执行N-gram查找的条件记忆模块,以解耦检索与骨干网络,提升长上下文语言建模和跨模态任务性能。

详情
AI中文摘要

序列建模需要组合推理和局部静态知识检索,而标准Transformer通过密集计算处理两者。Engram部分地将检索与骨干网络解耦,但其基于token的键仍依赖于文本分词和哈希压缩。我们提出Lngram,一种潜在空间中的条件记忆模块,直接从隐藏状态学习离散符号,并对这些符号执行N-gram查找。该设计消除了对分词器ID的依赖,并自然地扩展到非文本模态。在我们的评估设置中,Lngram优于Transformer和Engram基线,在长上下文语言建模中持续降低困惑度,并在事后添加到预训练模型时有效注入领域知识。与骨干网络的联合训练进一步超越了完全微调,而在视觉-语言和视觉-语言-动作任务上的实验显示了整体提升。使用LogitLens和CKA的分析表明,Lngram使预测相关信息更早出现,在有限的推理和内存开销下增加了有效深度。代码可在https://github.com/zyaaa-ux/Lngram获取。

英文摘要

Sequence modeling requires both compositional reasoning and local static knowledge retrieval, yet standard Transformers handle both through dense computation. Engram partially decouples retrieval from the backbone, but its token-based keys remain tied to text tokenization and hash compression. We propose Lngram, a latent-space conditional memory module that learns discrete symbols directly from hidden states and performs N-gram lookup over these symbols. This design removes the dependence on tokenizer IDs and naturally extends to non-text modalities. In our evaluated settings, Lngram outperforms Transformer and Engram baselines, consistently reduces perplexity in long-context language modeling, and effectively injects domain knowledge when added post hoc to pretrained models. Joint training with the backbone further surpasses full fine-tuning, while experiments on vision-language and vision-language-action tasks show overall gains. Analyses with LogitLens and CKA suggest that Lngram enables prediction-relevant information to emerge earlier, increasing effective depth with limited inference and memory overhead. Code is available at https://github.com/zyaaa-ux/Lngram.

2605.24867 2026-05-26 cs.AI cs.CL cs.NI 版本更新

Clustering as Reasoning: A $k$-Means Interpretation of Chain-of-Thought Graph Learning

聚类即推理:思维链图学习的 $k$-均值解释

Xuanting Xie, Zhaochen Guo, Bingheng Li, Xingtong Yu, Zhifei Liao, Zhao Kang, Yuan Fang

发表机构 * University of Electronic Science and Technology of China(电子科技大学) Singapore Management University(新加坡国立大学) Michigan State University(密歇根州立大学) The Chinese University of Hong Kong(香港中文大学)

AI总结 提出KCoT框架,通过将Transformer块与$k$-均值算法建立数学对应,将思维链推理与图表示学习统一,实现迭代语义-拓扑交互,在标准基准上超越现有方法。

Comments Accepted by ICML 2026

详情
AI中文摘要

思维链(CoT)提示在增强大型语言模型(LLMs)对文本属性图(TAGs)的推理能力方面显示出潜力。本文通过聚类即推理的原则重新审视基于CoT的图学习,提供了关于迭代推理如何在图结构数据上运行的$k$-均值解释。我们观察到现有的图CoT方法依赖于分离的架构和固定的图表示,限制了逐步的语义-拓扑交互和可解释性。为克服这一限制,我们提出了一个名为KCoT的统一框架,将CoT推理与图表示学习相结合。我们的关键理论结果揭示了Transformer块与$k$-均值算法之间的形式数学对应,使得推理可以被解释为迭代的分配和更新步骤。基于这一见解,我们引入了一个语义判别提示,明确将这些步骤形式化为结构化的CoT推理,并采用结构对齐策略将拓扑先验与演化的思维条件表示融合。在标准基准上的实验表明,与最先进的方法相比,该方法持续改进,验证了聚类作为基于CoT的图学习的原则性机制。

英文摘要

Chain-of-Thought (CoT) prompting has shown promise in enhancing the reasoning capabilities of large language models (LLMs) on text-attributed graphs (TAGs). This work reframes CoT-based graph learning through the principle of clustering as reasoning, offering a $k$-means interpretation of how iterative reasoning operates over graph-structured data. We observe that existing graph CoT methods rely on disjoint architectures and fixed graph representations, limiting step-by-step semantic-topological interaction and interpretability. To overcome this limitation, we propose a unified framework named KCoT that integrates CoT reasoning with graph representation learning. Our key theoretical result reveals a formal mathematical correspondence between a Transformer block and the $k$-means algorithm, allowing reasoning to be interpreted as iterative assignment and update steps. Based on this insight, we introduce a Semantic Discriminating Prompt that explicitly formulates these steps as structured CoT reasoning, together with a structure-grounded alignment strategy to fuse topological priors with evolving thought-conditioned representations. Experiments on standard benchmarks demonstrate consistent improvements over state-of-the-art methods, validating clustering as a principled mechanism for CoT-based graph learning.

2605.24850 2026-05-26 cs.CL cs.IT math.IT stat.AP 版本更新

Repeated Sequences Reveal Gaps between Large Language Models and Natural Language

重复序列揭示大语言模型与自然语言之间的差距

Kumiko Tanaka-Ishii

发表机构 * Waseda University(早稻田大学)

AI总结 通过分析重复子序列的分布及其与高阶Rényi熵的关系,提出一种评估大语言模型生成文本长程统计组织的框架,发现GPT生成文本在熵增长模式上与自然语言存在系统性差异。

Comments ACL 2026

详情
AI中文摘要

评估大语言模型(LLMs)是否捕捉到自然语言的结构(超越局部流畅性)仍然是一个开放的挑战。现有的评估方法主要基于任务性能或短上下文行为,对生成文本的长程统计组织提供的洞察有限。我们提出了一种基于重复子序列的补充评估框架。通过分析其跨尺度的分布并将其与高阶Rényi熵联系起来,我们探究文本在有限长度条件下如何重用先前建立的结构。对人类撰写的文本和长度匹配的GPT生成文本的实验表明,虽然幂律模型可以描述有限范围的块长度,但观察到的熵增长通常同样或更好地由对数-幂形式刻画。跨数据集,自然语言在可访问范围内表现出稳定的熵增长模式,尽管个体文本之间存在变异性,但平均行为一致。相比之下,GPT生成文本的估计指数随模型大小呈现系统性和统计显著的变化。这些结果表明,重复子序列熵提供了一种定量的结构诊断,揭示了长程组织中的系统性差异,从而在表面流畅性之外区分自然语言与最先进的LLM输出。

英文摘要

Evaluating whether large language models (LLMs) capture the structure of natural language beyond local fluency remains an open challenge. Existing evaluation methods, largely based on task performance or short-context behavior, provide limited insight into the long-range statistical organization of generated text. We propose a complementary evaluation framework based on repeated subsequences. By analyzing their distribution across scales and relating it to higher-order Rényi entropies, we probe how texts reuse previously established structure under finite-length conditions. Experiments on human-written texts and length-matched GPT-generated texts show that, while power-law models can describe restricted ranges of block length, the observed entropy growth is often equally or better characterized by logarithmic--power forms. Across datasets, natural language exhibits stable entropy-growth patterns over accessible ranges, with consistent average behavior despite variability across individual texts. In contrast, GPT-generated texts show systematic and statistically significant shifts in estimated exponents with model size. These results demonstrate that repeated-subsequence entropy provides a quantitative structural diagnostic that reveals systematic differences in long-range organization, distinguishing natural language from state-of-the-art LLM outputs beyond surface-level fluency.

2605.24844 2026-05-26 cs.AI cs.CL 版本更新

Geo-Expert: Towards Expert-Level Geological Reasoning via Parameter-Efficient Fine-Tuning

Geo-Expert: 通过参数高效微调实现专家级地质推理

Chenyou Guo, Zongqi Liu, Yizhou Zhang, Zhaorui Jiang, Ze Liu

发表机构 * Ocean University of China(中国海洋大学) Peking University(北京大学) Monash University(墨尔本大学)

AI总结 本文提出Geo-Expert,通过参数高效微调(LoRA)在定制高质量指令数据集上微调小规模语言模型,在专门的地质推理基准Geo-Eval上,8B模型超越70B通用模型和GPT-4o,32B模型接近前沿推理模型。

Comments 11 pages, 1 figure, 3 tables. Accepted at ICML 2026 AI for Science Workshop

详情
AI中文摘要

虽然应用于地质学的通用大语言模型(LLM)在推理地下结构和深时演化时常常产生幻觉,但目前地球科学中的人工智能主要针对地表遥感和GIS。为弥补这一差距,我们引入了Geo-Expert,这是一个参数高效的地质LLM系列,基于我们自定义指令合成流程处理的自定义策划高质量指令数据集进行微调。我们通过使用低秩适配(LoRA)方法微调三个基础模型:Qwen3-8B、Qwen3-32B和Gemma-3-27B,研究了模型缩放和架构的影响。我们在新的领域特定基准Geo-Eval上的广泛评估表明,领域对齐的8B模型在专门的地质推理上可以超越开放权重的70B通用模型和专有的GPT-4o,而32B变体接近前沿推理模型。优化后的8B模型进一步为部署提供了具有竞争力的性价比。这项工作为科学LLM的民主化提供了可复现的配方,并为地质人工智能建立了基线。

英文摘要

While general-purpose Large Language Models (LLMs) applied to Geology often hallucinate when reasoning about subsurface structures and deep-time evolution, current AI in Earth sciences predominantly targets surface remote sensing and GIS. To bridge this gap, we introduce Geo-Expert, a family of parameter-efficient geological LLMs fine-tuned on a custom-curated, high-quality instruction dataset processed using our custom instruction synthesis pipeline. We investigate the impact of model scaling and architecture by fine-tuning three base models: Qwen3-8B, Qwen3-32B, and Gemma-3-27B, with Low-Rank Adaptation (LoRA) method. Our extensive evaluation on a novel domain-specific benchmark, Geo-Eval, reveals that a domain-aligned 8B model can outperform open-weight 70B generalists and proprietary GPT-4o on specialized geological reasoning, while a 32B variant approaches frontier reasoning models. The optimized 8B model further offers a competitive cost-performance ratio for deployment. This work provides a reproducible recipe for democratizing scientific LLMs and establishes a baseline for geological artificial intelligence.

2605.24842 2026-05-26 cs.CL cs.CY 版本更新

Translators as Invisible Teachers of AI: Copyright, Translation Memory, and the Political Economy of Linguistic Data

译者作为人工智能的无形教师:版权、翻译记忆库与语言数据的政治经济学

Masaru Yamada

发表机构 * College and Graduate School of Intercultural Communication, Rikkyo University(文化交流研究生院,立命馆大学)

AI总结 本文研究译者劳动如何转化为人工智能的基础数据资本,提出“无消费的挪用”和“译者的无形教师化”两个概念,并探讨版权框架下的数据供应链与再分配设计方向。

Comments 13 pages; comments welcome

详情
AI中文摘要

本文考察了译者的劳动如何转化为人工智能时代的基础数据资本。翻译记忆库和平行语料库保留了源文本和目标文本之间的一一对应关系,因此构成了机器翻译极其宝贵的监督训练数据。统计机器翻译、神经机器翻译、Transformer架构以及多语言大语言模型的发展与这类翻译数据的积累密不可分。然而,译者的译文作为合同交付物被购买,作为技术对象被分割,并在版权法下作为“信息分析”数据被处理——失去了对产生它们的译者的道德、创作和经济归属。本文提出了两个概念来捕捉这一过程。第一个是“无消费的挪用”:一种使用模式,作品不被阅读、观看或聆听,而仅被挖掘统计特征——这种使用在日本著作权法第30-4条下是合法的。第二个是“译者的无形教师化”:译者通过构建翻译记忆库、译后编辑和质量评估,充当了人工智能的教师而未得到承认的过程。基于从译者通过语言服务提供商和平台到模型开发者的数据供应链,对日本、欧洲和美国法律框架的比较解读,开放与专有AI模型的区分,以及人类生成数据在模型崩溃时代获得的溢价地位,本文探讨了译者实际担忧的问题,并指出了再分配设计的具体方向。

英文摘要

This paper examines how the labour of translators has been transformed into foundational data capital for the age of artificial intelligence (AI). Translation memories (TM) and parallel corpora preserve a one-to-one correspondence between source and target text and therefore constitute extraordinarily valuable supervised training data for machine translation. The development of statistical machine translation (SMT), neural machine translation (NMT), the Transformer architecture, and multilingual large language models (LLMs) cannot be disentangled from the accumulation of such translation data. And yet, translators' renditions have been bought as deliverables under contract, segmented as technical objects, and processed as "information analysis" data under copyright law -- losing their moral, creative, and economic attribution to the translators who produced them. The paper develops two concepts to capture this process. The first is appropriation without consumption: a mode of use in which works are not read, viewed, or listened to, but only mined for statistical features -- a use that is legitimated under Article 30-4 of the Japanese Copyright Act. The second is the invisible teacherisation of translators: the process by which translators, through the construction of translation memories, post-editing, and quality assessment, have functioned as teachers of AI without recognition as such. Drawing on the data supply chain that runs from translators through language service providers (LSPs) and platforms to model developers, on a comparative reading of Japanese, European, and United States legal frameworks, on the distinction between open and proprietary AI models, and on the premium status that human-generated data has acquired in the era of model collapse, the paper asks what translators are actually afraid of, and points toward concrete directions for redistributive design.

2605.24817 2026-05-26 cs.CR cs.AR cs.CL cs.LG 版本更新

RouteScan: A Non-Intrusive Approach to Auditing MoE LLMs Safety via Expert Routing Telemetry

RouteScan: 通过专家路由遥测对MoE大语言模型安全性进行非侵入式审计

Bo Lv, Zhiheng Xu, KeDong Xiu, Ruyi Ding, Tianhang Zheng, Zhibo Wang, Kui Ren

发表机构 * Zhejiang University(浙江大学) Donghua University(东华大学) Louisiana State University(路易斯安那州立大学)

AI总结 提出RouteScan,一种利用MoE模型GPU级专家路由遥测(如预填充阶段活跃线程数)作为微架构指纹,通过轻量级检测流水线识别恶意提示的非侵入式审计框架,在未见过的有害领域AUROC超0.93,新越狱包装下超0.96,且相比基于内容的审计方法具有隐私优势。

Comments 20 pages. Under submission

详情
AI中文摘要

混合专家(MoE)架构已成为扩展大型语言模型(LLM)日益重要的范式。随着MoE模型越来越多地部署在实际服务中,安全性审计变得必要,以验证这些模型在运行过程中是否产生或助长有害行为。然而,现有的基于内容的审计方法通常需要访问用户提示、模型输入或生成输出,可能暴露敏感用户信息,并在LLM安全性和用户隐私之间造成根本性紧张。另一方面,我们观察到,在MoE模型中,稀疏专家路由将不同输入映射到激活不同的专家执行模式,在低级GPU执行遥测中产生可测量的足迹。受此观察启发,我们提出RouteScan,一种通过GPU级专家路由遥测检测有害行为的非侵入式审计框架。具体而言,RouteScan利用预填充阶段分配给专家模块的活跃GPU线程数作为判别性微架构指纹,并构建轻量级检测流水线,隔离跨领域不变风险指标以精确识别恶意提示。对具有不同路由设计的开源MoE LLM的全面评估表明,RouteScan实现了强泛化,在未见过的有害领域AUROC超过0.93,在新型越狱包装下超过0.96。此外,经验性反演测试表明,收集的专家路由遥测为提示重建提供的信息有限,表明相对于基于内容的审计方法具有实际隐私优势。

英文摘要

Mixture-of-Experts (MoE) architectures have become an increasingly important paradigm for scaling Large Language Models (LLMs). As MoE models are increasingly deployed in real-world services, safety auditing becomes necessary to verify whether these models produce or facilitate harmful behaviors during operation. However, existing content-based auditing methods typically require access to user prompts, model inputs, or generated outputs, potentially exposing sensitive user information and creating a fundamental tension between LLM safety and user privacy. On the other hand, we observe that, in MoE models, sparse expert routing maps different inputs to activate different expert-execution patterns, producing measurable footprints in low-level GPU execution telemetry. Inspired by this observation, we propose RouteScan, a non-intrusive auditing framework for detecting harmful behaviors through GPU-level expert routing telemetry. Specifically, RouteScan utilizes the number of active GPU threads allocated to expert modules during the prefilling phase as a discriminative micro-architectural fingerprint, and builds a lightweight detection pipeline that isolates cross-domain invariant risk indicators for the precise identification of malicious prompts. Comprehensive evaluations on open-source MoE LLMs with distinct routing designs demonstrate that RouteScan achieves strong generalization, with an AUROC exceeding 0.93 on unseen harmful domains and 0.96 under novel jailbreak wrappers. Moreover, empirical inversion tests show that the collected expert routing telemetry provides limited information for prompt reconstruction, suggesting a practical privacy advantage over content-based auditing methods.

2605.24794 2026-05-26 cs.CV cs.CL 版本更新

DUEL: Adversarial Self-Play for Multimodal Reasoning

DUEL: 用于多模态推理的对抗性自我对弈

Lin Qiu, Hanqing Zeng, Yao Liu, Bingjun Sun, Guangdeng Liao, Ji Liu

发表机构 * Meta AI

AI总结 提出DUEL框架,通过对抗性自我对弈从预训练VLM生成监督信号,结合长度归一化对数似然奖励,无需人工标注即可提升视觉推理与判别能力。

详情
AI中文摘要

强化学习已成为提升视觉语言模型推理能力的有效范式。然而,基于RL的优化通常依赖于昂贵且难以扩展的高质量标注。现有的无监督替代方案可能因弱视觉基础和缺乏可靠验证信号而偏向有偏解。我们提出一个自我进化的训练后框架DUEL,其中监督信号源于从同一预训练VLM初始化的两个策略之间的对抗性交互。挑战者生成一个基于图像的真实声明及其最小扰动的难负样本,而求解者验证两个声明与图像的一致性,从而在近邻语义下鼓励细粒度视觉判别。为了稳定优化,我们引入长度归一化的对数似然奖励,在二元结果监督之外保留信息性优化信号,并在稀疏反馈下提高学习稳定性。实验表明,DUEL在无需额外人工标注、外部奖励模型或图像编辑工具的情况下,持续提升视觉推理和鲁棒判别能力。

英文摘要

Reinforcement learning (RL) has emerged as an effective paradigm for improving the reasoning capability of vision-language models (VLMs). However, RL-based optimization typically depends on costly high-quality annotations that are difficult to scale. Existing unsupervised alternatives may drift toward biased solutions due to weak visual grounding and the lack of reliable verification signals. We propose a self-evolving post-training framework, DUEL, where supervision emerges from adversarial interactions between two policies initialized from the same pretrained VLM. A Challenger generates an image-grounded true claim together with a minimally perturbed hard-negative counterpart, while a Solver verifies both claims against the image, encouraging fine-grained visual discrimination under near-neighbor semantics. To stabilize optimization, we introduce a length-normalized log-likelihood reward that preserves informative optimization signals beyond binary outcome supervision and improves learning stability under sparse feedback. Experiments show that DUEL consistently improves visual reasoning and robust discrimination without additional human annotations, external reward models, or image editing tools.

2605.24793 2026-05-26 cs.CL 版本更新

Beyond the Target: From Imitation to Collaboration in Speculative Decoding

超越目标:从模仿到协作的推测解码

Jinze Li, Yixing Xu, Guanchen Li, Jinfeng Xu, Shuo Yang, Yang Zhang, Xuanwu Yin, Dong Li, Edith C. H. Ngai, Emad Barsoum

发表机构 * Advanced Micro Devices, Inc.(先进微设备公司) The University of Hong Kong(香港大学) University of North Texas(北卡罗来纳州立大学)

AI总结 提出协作推测解码(CoSpec),通过强化学习训练仲裁策略,在推测解码中灵活选择接受草稿或目标模型的令牌,在保持加速的同时超越仅使用目标模型的性能。

Comments under review

详情
AI中文摘要

推测解码(SPD)通过让较小的草稿模型提出多个未来令牌,并由较大的目标模型并行验证,从而加速大型语言模型(LLM)推理。主流的SPD范式将目标模型视为唯一可靠的教师,仅当草稿令牌与目标预测完全匹配时才接受它。这种设计隐含地假设目标在每个位置都是更好的选择。在实践中,这一假设并不成立。尽管草稿模型整体上较弱,但在令牌级别上并非均匀地劣于目标。在草稿与目标不一致的有意义的情况下,草稿的选择往往能导致正确的最终答案。受此启发,我们引入了 extbf{协作推测解码(CoSpec)},这是SPD的一种泛化,不再将目标模型视为唯一的令牌级权威。CoSpec通过强化学习训练一个仲裁策略,以决定是接受来自草稿还是目标模型的令牌,在不匹配时选择性地接受草稿令牌,如果这样做可能产生正确的最终答案。实验结果表明,CoSpec在保持显著加速的同时,超越了仅使用目标模型的性能。通过将重点从模仿转向协作,CoSpec为推测解码提供了新的视角。

英文摘要

Speculative decoding (SPD) accelerates large language model (LLM) inference by letting a smaller draft model propose multiple future tokens that are verified in parallel by a larger target model. The dominant SPD paradigm treats the target model as the sole reliable teacher, accepting a draft token only when it exactly matches the target prediction. This design implicitly assumes that the target is always the better choice at every position. In practice, this assumption does not hold. Although the draft is the weaker model overall, it is not uniformly inferior at the token level. In a meaningful fraction of cases where draft and target disagree, the draft's choice is the one that leads to the correct final answer. Inspired by this, we introduce \textbf{Collaborative Speculative Decoding (CoSpec)}, a generalization of SPD that no longer treats the target model as the sole token-level authority. CoSpec trains an arbitration policy via reinforcement learning to decide whether to accept tokens from the draft or target model, selectively accepting draft tokens at mismatches when doing so is likely to yield a correct final answer. Experimental results show that CoSpec maintains substantial speedups while surpassing target-only performance. By shifting the emphasis from imitation to collaboration, CoSpec suggests a new perspective on speculative decoding.

2605.24764 2026-05-26 cs.IR cs.AI cs.CL 版本更新

Spectral Retrieval: Multi-Scale Sinc Convolution over Token Embeddings for Localized Retrieval in LLM Multi-Agent Systems

光谱检索:基于多尺度sinc卷积的令牌嵌入局部化检索在LLM多智能体系统中的应用

Andrea Morandi

发表机构 * Cisco(思科)

AI总结 提出光谱检索方法,通过多尺度sinc卷积对令牌嵌入进行重排序,在无需重新训练的情况下显著提升局部化检索性能,并自然适配于LLM多智能体系统。

详情
AI中文摘要

[删节版] - 光谱检索是一种插件式重排序阶段,通过在令牌嵌入上进行多尺度sinc卷积,在逐令牌MaxSim和均值池化检索之间进行插值。在标准稠密检索中,每个文档是一个均值池化向量;当相关性局限于一个短子跨度时,信号会平均为噪声。光谱检索重用来自晚期交互索引的逐令牌嵌入,并将其与归一化的sinc核在多个尺度上进行卷积。在L=1时,核作为恒等映射,恢复逐令牌MaxSim;随着L增大,它趋近于均匀滤波器,恢复均值池化。跨位置和尺度的最大余弦产生一个得分,其信息量不低于任一端点。在一个包含1000个文档和植入单位置尖峰的可控合成基准上,无论尖峰强度如何,均值池化检索处于随机水平(Recall@10 ~ 0.02),而光谱检索在植入余弦超过语料级令牌噪声基底时达到Recall@10 = 1.0。在冻结的all-mpnet-base-v2编码器上的LIMIT-small数据集中,光谱检索无需重新训练即可将Recall@10从0.33提升至0.90,MRR从0.22提升至0.79,严格Success@10从0.12提升至0.84。该方法自然适用于多智能体LLM系统,其中每个智能体受益于共享语料库上更紧密、特定角色的检索窗口。

英文摘要

[Abridged] - Spectral Retrieval is a plug-in re-ranking stage that interpolates between per-token MaxSim and mean-pool retrieval through a multi-scale sinc convolution over token embeddings. In standard dense retrieval each document is one mean-pooled vector; when relevance localises into a short subspan, the signal averages into noise. Spectral Retrieval reuses per-token embeddings from a late-interaction index and convolves them with a normalised sinc kernel at multiple scales. At L=1 the kernel acts as the identity, recovering per-token MaxSim; as L grows it approaches a uniform filter, recovering mean pooling. The maximum cosine over positions and scales yields a score provably no less informative than either endpoint. On a controlled synthetic benchmark with 1,000 documents and planted single-position spikes, mean-pool retrieval sits at chance (Recall@10 ~ 0.02) regardless of spike strength, while Spectral Retrieval reaches Recall@10 = 1.0 once the planted cosine exceeds the corpus-level token noise floor. On LIMIT-small with a frozen all-mpnet-base-v2 encoder, Spectral Retrieval lifts Recall@10 from 0.33 to 0.90, MRR from 0.22 to 0.79, and strict Success@10 from 0.12 to 0.84, without retraining. The method fits naturally into multi-agent LLM systems, where each agent benefits from a tighter, role-specific retrieval window over a shared corpus.

2605.24755 2026-05-26 cs.AI cs.CL 版本更新

Automated Detection and Classification of Delusion-related Content in Naturalistic Audio Diaries Using Multi-Agent Language Models

使用多智能体语言模型自动检测和分类自然音频日记中的妄想相关内容

Feng Chen, Justin Tauscher, Changye Li, Meliha Yetisgen, Alex Cohen, Adam Kuczynski, Angelina Pei-Tzu Tsai, Benjamin Buck, Dror Ben-Zeev, Trevor Cohen

发表机构 * Department of Biomedical Informatics and Medical Education, University of Washington, Seattle, WA, USA(生物医学信息学与医学教育系,华盛顿大学,西雅图,华盛顿州,美国) Department of Psychiatry and Behavioral Sciences, University of Washington, Seattle, WA, USA(精神病学与行为科学系,华盛顿大学,西雅图,华盛顿州,美国) Department of Psychology, Louisiana State University, Baton Rouge, LA, USA(心理学系,路易斯安那州立大学,巴吞鲁日,路易斯安那州,美国) Department of Psychiatry, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA(精神病学系,北卡罗来纳大学教堂山分校,教堂山,北卡罗来纳州,美国)

AI总结 提出一种多智能体LLM流水线,从自然音频日记中自动检测和分类妄想信念、情感和行为反应,通过多数投票实现稳健性能。

Comments Accepted by CLPych 2026

详情
AI中文摘要

在自然环境中录制的言语独白为表征精神疾病现象学和检测症状恶化提供了机会。大型语言模型(LLM)为自动化这一过程提供了新的可能性,因为它们主要需要标注数据进行评估而非训练。在本文中,我们提出了一种新颖的自动化多智能体LLM流水线,用于从具有中度被害妄想的人的音频日记转录中,进行细粒度、多标签的提取,以识别暗示妄想信念、相关情感反应和行为反应的语言。通过评估三个基础模型的集成,我们证明详细的诊断提示指令成功减少了妄想主题分类的假阳性,但也限制了情感或行为反应的解读。此外,比较多智能体裁决框架表明,智能体之间的复杂对话辩论通过诱导过早共识降低了临床模糊文本的准确性。相反,多数投票建立了稳健的性能(妄想检测和分类的Micro F1分别为0.872和0.779)。这项工作为自动检测和表征自然言语中暗示妄想信念的内容提供了一个经过验证且可扩展的流水线。

英文摘要

Speech monologues recorded in naturalistic settings provide opportunities to characterize mental illness phenomenology and detect symptom exacerbation. Large language models (LLMs) offer new possibilities for automating this process, as they require annotated data primarily for evaluation rather than training. In this paper, we present a novel automated, multi-agent LLM pipeline for the fine-grained, multi-label extraction of language suggestive of delusional beliefs, associated affective responses, and behavioral responses from transcripts of naturalistic audio diaries collected from people with moderate persecutory ideation. Evaluating an ensemble of three foundation models, we demonstrate that detailed diagnostic prompt instructions successfully reduce false positives for delusional theme classification, but also constrain the interpretation of affective or behavioral responses. Furthermore, comparing multi-agent adjudication frameworks shows that complex conversational debate between agents diminishes accuracy on clinically ambiguous text by inducing premature consensus. Instead, majority voting establishes robust performance (Micro F1 of 0.872 and 0.779 for delusion detection and classification respectively). This work provides a validated and scalable pipeline for the automated detection and characterization of content suggesting delusional beliefs in naturalistic speech.

2605.24737 2026-05-26 cs.CL cs.AI cs.CY 版本更新

Who judges the judges? Governance from metrics: a runtime framework for continuous LLM compliance monitoring

谁来评判评判者?基于指标的治理:面向持续LLM合规监控的运行时框架

Jehanne Dussert

发表机构 * Independent Researcher(独立研究者)

AI总结 针对AI合规作为审计时二元判定而非生产系统持续可测量属性的问题,提出基于指标的治理原则,并开发开源框架govllm,通过运行时可观测性信号实现持续合规监控,验证了多模型陪审团设计在监管评估中的有效性。

Comments 41 pages, 8 figures, preprint

详情
AI中文摘要

当前AI合规方法将合规性视为审计时的二元判定,而非生产系统的持续可测量属性。我们认为这种合规虚构在结构上不适合欧盟AI法案的要求,该法案要求持续的人类监督和检测部署系统中涌现的行为漂移。我们引入了基于指标的治理原则,即监管合规性是从运行时可观测性中推导出的持续信号,而非来自静态评估。基于这一原则,我们提出了govllm,一个开源框架,实现了治理驱动的路由架构,其中模型选择由累积的合规分数决定,而非仅由延迟或成本决定。我们方法的核心是一个监管评判者小组——针对每个标准(欧盟AI法案、GDPR、ANSSI、可访问性)专门化的LLM评估器——我们将评判者间的分歧重新定义为监管不确定性信号,而非噪声,需要人工仲裁。我们通过一个包含49个标注提示/响应对的地面真实语料库验证了该方法,涵盖五个监管标准,由四个完全本地运行的小型语言模型(SLM,1.7B-7B参数)评估。一致率从51.5%(mistral:7b)到69.1%(phi4-mini)不等,没有单一模型在所有标准上占主导地位——这从经验上激励了“档案即陪审团”的设计。我们进一步记录了小型监管评判者中的三种结构性失败模式,以及一种评判者特定的位置偏差,该偏差在三种问题顺序条件(原始、反转、排列)下使一致率降低多达25个百分点。govllm作为开源软件发布,以支持可复现的AI治理研究。

英文摘要

Current approaches to AI compliance treat conformity as a binary, audit-time verdict rather than a continuous, measurable property of production systems. We argue that this compliance fiction is structurally ill-suited to the requirements of the EU AI Act, which demands ongoing human oversight and the detection of emergent behavioural drift in deployed systems. We introduce governance from metrics, a principle whereby regulatory compliance is derived as a continuous signal from runtime observability rather than from static assessments. Building on this principle, we present govllm, an open-source framework implementing a governance-driven routing architecture in which model selection is determined by accumulated compliance scores rather than by latency or cost alone. Central to our approach is a panel of regulatory judges - LLM evaluators specialised per criterion (EU AI Act, GDPR, ANSSI, accessibility) - whose inter-judge disagreement we reframe not as noise but as a regulatory uncertainty signal warranting human arbitration. We validate this approach through a ground truth corpus of 49 annotated prompt/response pairs across five regulatory criteria, evaluated by four small language models (SLMs, 1.7B-7B parameters) running fully on-premise. Agreement rates range from 51.5% (mistral:7b) to 69.1% (phi4-mini), with no single model dominating across all criteria - empirically motivating the Profile-as-jury design. We further document three structural failure modes in small regulatory judges and a judge-specific position bias that degrades agreement by up to 25 percentage points across three question-order conditions (original, reversed, permuted). govllm is released as open-source software to support reproducible AI governance research.

2605.24733 2026-05-26 cs.CL 版本更新

StepGap: A Hybrid NLI-LLM Checker for Step-Level Evidence-Gap Detectionin Multi-Hop Question Answering

StepGap:一种用于多跳问答中步骤级证据缺口检测的混合NLI-LLM检查器

Yuelyu Ji, Zhuochun Li, Hui Ji, Daqing He

发表机构 * School of Computing and Information, University of Pittsburgh(计算信息学院,匹兹堡大学)

AI总结 提出混合NLI-LLM决策树StepGap,用于检测多跳问答中的步骤级证据缺口并输出三类标签,在82个问题上达到sF1=72.0,且作为GRPO过程奖励可提升模型精确匹配率。

详情
AI中文摘要

我们提出 extbf{StepGap},一种混合NLI-LLM决策树,用于检测多跳问答中的步骤级证据缺口,并输出三类标签: extsc{矛盾声明}(CC)、 extsc{无关证据}(IE)或 extsc{缺失桥梁}(MB),每个标签对应具体的修复动作。在82个多跳问题(181个标注步骤,$κ{=}0.704$)上,StepGap达到sF1$=$72.0,处于纯LLM基线(70.1)的bootstrap置信区间内,但具有更可分解的结构:移除StepGap的每个阶段都会 extit{降低}F1,而四个纯LLM移除中有三个 extit{提高}F1——这是 extit{竞争性错误抵消}的迹象,即内部阶段相互掩盖错误。我们进一步揭示了 extit{Q-F1陷阱}:问题级F1被标记每一步的检查器机械地膨胀,使得步骤级F1成为必要的诊断指标。作为带类型的GRPO过程奖励,StepGap将Qwen2.5-7B-Instruct的精确匹配率从$32.1{\pm}0.3$提升至$35.4{\pm}0.9$(三个种子),单次运行比较显示,与匹配的Search-R1 GRPO复现相比,平均EM增益为$+5.6$。

英文摘要

We present \textbf{StepGap}, a hybrid NLI-LLM decision tree that detects step-level evidence gaps in multi-hop QA and emits one of three typed labels: \textsc{Contradicted Claim} (CC), \textsc{Irrelevant Evidence} (IE), or \textsc{Missing Bridge} (MB), each tied to a concrete repair action. On 82 multi-hop questions (181 annotated steps, $κ{=}0.704$), StepGap reaches sF1$=$72.0, within the bootstrap confidence interval of an LLM-only baseline (70.1) but with a more decomposable structure: every StepGap stage \emph{hurts} F1 when removed, while three of four LLM-only removals \emph{improve} F1 -- a sign of \emph{competing-error cancellation}, where internal stages mask each other's errors. We further expose a \emph{Q-F1 trap}: question-level F1 is mechanically inflated by checkers that flag every step, making step-level F1 the necessary diagnostic. Used as a typed GRPO process reward, StepGap improves Qwen2.5-7B-Instruct Exact Match from $32.1{\pm}0.3$ to $35.4{\pm}0.9$ across three seeds, with the single-run comparison showing a $+5.6$ Avg EM gain over the matched Search-R1 GRPO reproduction.

2605.24721 2026-05-26 cs.CL 版本更新

ROC Analysis for Evaluating Translation Quality Estimation Systems

ROC分析用于评估翻译质量估计系统

Evelyn Y. Garland, Carola F. Berger

发表机构 * Acta-Transphere CFB Scientific Translations LLC

AI总结 本文提出使用接收者操作特征(ROC)分析评估自动翻译质量估计(QE)系统,该方法与现有方法结果一致,并能为商业决策提供可操作的性能洞察。

Comments 16 pages, 8 PNG figures, 3 tables, uses acl.sty

详情
AI中文摘要

自动翻译质量估计(QE)系统的日益普及,需要实用的、面向决策的方法来评估其性能。我们提出接收者操作特征(ROC)分析是用于此目的的有用方法。我们的研究表明,ROC分析不仅产生与当前流行方法一致的结果,而且还提供了几个重要优势,包括支持商业决策的可操作性能洞察。

英文摘要

The increasing use of automated translation quality estimation (QE) systems calls for practical, decision-oriented methods for evaluating their performance. We propose that Receiver Operating Characteristic (ROC) analysis is a useful approach for this purpose. Our study shows that ROC analysis not only produces results consistent with currently prevalent methods, but also offers several important advantages, including actionable performance insights that support business decision-making.

2605.24719 2026-05-26 cs.CL cs.AI 版本更新

World-State Transformations for Neuro-symbolic Interactive Storytelling

世界状态转换用于神经符号交互式故事讲述

Santiago Góngora, Luis Chiruzzo, Gonzalo Méndez, Pablo Gervás

AI总结 本研究探索在神经符号架构中利用LLM预测规则系统中的世界状态转换,以解决纯LLM方法的故事连贯性问题,并通过实验表明该方法能保持世界状态一致性并促进玩家创造性输入。

Comments To be presented at the 17th International Conference on Computational Creativity (ICCC'26)

详情
AI中文摘要

大型语言模型(LLM)改变了处理自由文本用户输入的交互式故事讲述系统的可能性。然而,随着这类系统越来越多地被构建,越来越多的证据表明,仅依赖它们会出现故事连贯性问题。最近的研究表明,LLM可以有效地预测基于规则的交互式故事讲述系统中的状态变化,触发预编程的世界状态转换。在本文中,我们进行了一项探索性评估,研究这种转换是否可以作为玩家表达的催化剂,同时旨在解决纯LLM方法典型的连贯性问题。基于神经符号架构,我们使用开源模型(Llama 3 70B)和闭源模型(Gemini 1.5 Flash)进行了实验,测试以英语和西班牙语进行。八名参与者玩了两个场景,这些场景经过精心设计以评估不同的评估目标。我们的观察表明,转换提供了一种保持世界状态一致性的方式,同时鼓励玩家通过他们的书面输入进行创造性互动。

英文摘要

Large Language Models (LLMs) have changed the possibilities of Interactive Storytelling systems that process free-text user input. However, as more of these systems are built, evidence continues to mount regarding the story coherence problems that arise when relying solely on them. Recent research suggests that LLMs can effectively predict state changes within rule-based Interactive Storytelling systems, triggering pre-programmed world-state transformations. In this paper, we conduct an exploratory evaluation of whether such transformations can serve as a catalyst for player expression while aiming to address the incoherence issues typical of purely LLM-based approaches. Building upon a neuro-symbolic architecture, we conducted experiments using an open-source model (Llama 3 70B) and a closed-source model (Gemini 1.5 Flash), with testing conducted in both English and Spanish. Eight participants played two scenarios, carefully designed to assess different evaluation objectives. Our observations suggest that transformations offer a way to maintain world-state consistency while encouraging players to interact creatively through their written inputs.

2605.24718 2026-05-26 cs.CL 版本更新

The Tokenizer Tax Across 25 European Languages: Domain Invariance, Cross-Lingual Few-Shot Effects, and the Ukrainian Penalty

25种欧洲语言的Tokenizer税:领域不变性、跨语言少样本效应与乌克兰语惩罚

Volodymyr Ovcharov

发表机构 * LEX AI Platform(LEX AI平台) legal.org.ua Kyiv, Ukraine(基辅,乌克兰)

AI总结 研究测量了10个基础模型在25种欧洲语言上的tokenizer生育率,揭示了从英语到其他语言的成本差异,并发现乌克兰语因预训练数据不足而支付额外成本。

Comments 16 pages, 3 figures, 8 tables. Dataset: https://huggingface.co/datasets/overthelex/tokenizer-fertility-map

详情
AI中文摘要

Tokenizer生育率(每词token数)对非英语NLP施加了隐藏成本。我们在平行文本上测量了10个基础模型在25种欧洲语言上的生育率,生成了首个受控的欧洲tokenizer税地图。该税从英语(1.2 tokens/词)到希腊语/马耳他语(约3.1)跨度达2.5倍,遵循清晰层次:罗曼语族(1.5-1.7)、日耳曼语族(1.7-1.9)、斯拉夫语族(2.2-2.5)、乌拉尔语系/波罗的语族(2.7-3.0)。乌克兰语(2.7)比同源斯拉夫语言多支付15-18%,反映了其在预训练数据中的代表性不足。生育率排名在三种文本语域中具有领域不变性(rho > 0.97)。子词分析表明,高生育率tokenizer会碎片化形态边界而非保留它们。对四种斯拉夫语言的跨语言少样本评估显示,少样本效应是模型固有的,而非语言依赖的。我们将所有测量结果作为公共数据集发布。

英文摘要

Tokenizer fertility the number of tokens per word imposes a hidden cost on non-English NLP. We measure fertility for ten foundation models across 25 European languages on parallel text, producing the first controlled tokenizer tax map for the continent. The tax spans 2.5x from English (1.2 tokens/word) to Greek/Maltese (~3.1), following a clear hierarchy: Romance (1.5-1.7), Germanic (1.7-1.9), Slavic (2.2-2.5), Uralic/Baltic (2.7-3.0). Ukrainian (2.7) pays 15-18% more than cognate Slavic languages, reflecting underrepresentation in pre-training data. Fertility rankings are domain-invariant across three text registers (rho > 0.97). A subword analysis reveals that high-fertility tokenizers fragment morphological boundaries rather than preserving them. Cross-lingual few-shot evaluation on four Slavic languages shows that few-shot effects are model-intrinsic, not language-dependent. We release all measurements as a public dataset.

2605.24703 2026-05-26 cs.CL cs.AI 版本更新

TS-Skill: A Benchmark for Evaluating Analytical Skills in Time-Series Question Answering

TS-Skill: 用于评估时间序列问答中分析技能的基准

Liying Han, Kang Yang, Oliver Wang, Jason Wu, Pengrui Quan, Gaofeng Dong, Ozan Baris Mulayim, Sizhe Ma, Yuyang Yuan, Dezhi Hong, Mario Berges, Mani Srivastava

发表机构 * University of California, Los Angeles(加州大学洛杉矶分校) Samsung Research America(三星美国研究院) Carnegie Mellon University(卡内基梅隆大学) Microsoft(微软) Amazon(亚马逊)

AI总结 提出TS-Skill基准,通过三种可组合的分析技能(时间尺度选择、时间定位和跨区间整合)来诊断时间序列问答中模型的信号级能力,并开发SKEvol框架自动构建基准,实验揭示不同技能上的能力差距。

详情
AI中文摘要

大型语言模型(LLMs)和时间序列语言模型(TSLMs)越来越多地应用于时间序列问答(TSQA)。与纯文本问答不同,TSQA要求模型将答案基于时间信号,这些信号的模式可能出现在不同尺度、特定时间位置或跨分离区间。然而,现有的基准通常按任务类型或高层次推理类别组织,难以诊断驱动模型性能的底层信号级能力。我们引入TS-Skill,一个用于评估TSQA中三种可组合分析技能的控制基准:时间尺度选择(SK1)、时间定位(SK2)和跨区间整合(SK3)。TS-Skill提供时间戳感知的问题、广泛的领域覆盖以及人工验证的问答质量。为了大规模构建基准,我们开发了SKEvol,一个技能引导的智能体框架,结合了领域感知的时间序列种子生成、技能控制的问题生成、元数据和代码辅助的答案构建、多阶段信号接地验证以及人在回路中的策展。在十个最先进的LLMs和TSLMs上的实验揭示了SK1-SK3之间显著且不均匀的能力差距。特别是,SK3对非智能体模型始终具有挑战性,而工具增强的智能体在独立的SK3上显示出选择性优势。这些发现表明,技能级评估可以揭示被聚合TSQA分数掩盖的时间推理失败。

英文摘要

Large language models (LLMs) and time-series language models (TSLMs) are increasingly applied to time-series question answering (TSQA). Unlike text-only QA, TSQA requires models to ground answers in temporal signals whose patterns may occur at different scales, specific time locations, or across separated intervals. However, existing benchmarks are typically organized by task types or high-level reasoning categories, making it difficult to diagnose the underlying signal-level capabilities driving model performance. We introduce TS-Skill, a controlled benchmark for evaluating three composable analytical skills in TSQA: temporal scale selection (SK1), temporal localization (SK2), and cross-interval integration (SK3). TS-Skill provides timestamp-aware questions, broad domain coverage, and human-validated QA quality. To construct the benchmark at scale, we develop SKEvol, a skill-guided agentic framework that combines domain-aware time-series seed generation, skill-controlled question generation, metadata- and code-assisted answer construction, multi-phase signal-grounded verification, and human-in-the-loop curation. Experiments on ten state-of-the-art LLMs and TSLMs reveal substantial and uneven capability gaps across SK1-SK3. In particular, SK3 remains consistently challenging for non-agent models, whereas tool-augmented agents show a selective advantage on standalone SK3. These findings demonstrate that skill-level evaluation can uncover temporal reasoning failures that are obscured by aggregate TSQA scores.

2605.24697 2026-05-26 cs.CL cs.AI 版本更新

The Path Matters: Learning a Token-Commitment Policy for Diffusion Language Models

路径很重要:学习扩散语言模型的令牌提交策略

Bohang Sun, Max Zhu, Francesco Caso, Jindong Gu, Junchi Yu, Philip Torr, Pietro Liò, Jialin Yu

发表机构 * Department of Computer Science and Technology, University of Cambridge(计算机科学与技术系,剑桥大学) Department of Engineering Science, University of Oxford(工程科学系,牛津大学)

AI总结 本文提出TraceLock,一种轻量级可插拔控制器,通过学习可复用的轨迹状态策略来优化扩散语言模型中的令牌提交决策,从而改善质量与步数之间的权衡。

详情
AI中文摘要

扩散大语言模型通过并行细化多个令牌位置有望实现更快的生成,但这种并行性引入了一个隐藏的控制问题:每一步中哪些提议的令牌应被转移到部分解码的序列中?我们将此决策称为令牌提交。现有的冻结生成器解码器主要依赖于手工设计的置信度规则或特定块的接受过滤器。我们认为令牌提交可以学习为一种可复用的轨迹状态策略。我们引入了TraceLock,一种轻量级可插拔控制器,为冻结的扩散语言模型实例化此策略。由于无法获得 oracle 提交时间,TraceLock 从未来稳定性中推导出自我监督:在解码步骤 t,如果提议的令牌在完整解码轨迹完成后与位置 i 的最终令牌匹配,则将其标记为稳定。控制器对可变长度的轨迹状态进行评分,并决定哪些活跃的令牌提议应被提交到部分解码的序列中。一旦为给定的冻结主干训练完成,该控制器可以在局部窗口宽度、生成长度和步数预算下部署,无需重新训练或按设置校准。在问答、数学推理和代码生成上的实验表明,TraceLock 在质量-步数权衡上优于启发式和学习的基线,在跨设置部署下尤其稳定。诊断分析表明,其决策不能简化为标量置信度,这表明冻结的扩散语言模型暴露了一个超越基于置信度解码的可学习的提交轨迹空间。代码可在 https://github.com/BobSun98/TraceLock 获取。

英文摘要

Diffusion large language models promise faster generation by refining many token positions in parallel, but this parallelism introduces a hidden control problem: which proposed tokens should be transferred into the partially decoded sequence at each step? We refer to this decision as token commitment. Existing frozen-generator decoders largely rely on hand-designed confidence rules or block-specific acceptance filters. We argue that token commitment can instead be learned as a reusable trace-state policy. We introduce TraceLock, a lightweight plug-in controller that instantiates this policy for a frozen diffusion language model. Since oracle commitment times are unavailable, TraceLock derives self-supervision from future stability: at decoding step t, a proposed token for position i is labeled stable if it matches the final token at position i after the full decoding trace completes. The controller scores variable-length trace states and decides which active token proposals should be committed to the partially decoded sequence. Once trained for a given frozen backbone, the controller can be deployed across local-window widths, generation lengths, and step budgets without retraining or per-setting calibration. Experiments on question answering, mathematical reasoning, and code generation show that TraceLock improves the quality-step tradeoff over heuristic and learned baselines, with particularly stable behavior under cross-setting deployment. Diagnostic analyses show that its decisions are not reducible to scalar confidence, suggesting that frozen diffusion language models expose a learnable space of commitment trajectories beyond confidence-based decoding. Code is available at https://github.com/BobSun98/TraceLock.

2605.24693 2026-05-26 cs.CL 版本更新

CP-Agent: A Calibrated Risk-Controlled Agent for Feedback-Driven Competitive Programming

CP-Agent: 一种用于反馈驱动竞赛编程的校准风险控制智能体

Peisong Wang, Bowen Liu, Zehua Li, Yuyao Wang, Zhiwei Ma, Yuhan Li, Jia Li

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)(香港科学与技术大学(广州))

AI总结 提出CP-Agent,通过校准停止过程建模反馈驱动求解,结合双重粒度验证、测试增强和经验驱动自我进化机制,在不更新参数的情况下显著提升竞赛编程性能。

Comments Code: https://github.com/NineAbyss/CP-Agent

详情
AI中文摘要

大型语言模型在竞赛级编程中仍存在困难,而许多智能体解决方案依赖于大量的推理时采样或昂贵的多阶段后训练。我们研究了执行反馈何时能可靠地帮助LLM竞赛编程求解器,以及哪些机制支配着性能提升。我们将反馈驱动求解建模为校准停止过程,并识别出三个量:虚假接纳风险、针对不良程序的程序级证据以及活跃状态成功风险。在保留的轨迹校准和从预先声明的有限控制器清单中选择下,所得的结构性证书在虚假接纳之前为干净成功概率提供了下界。我们针对这些量实例化了机制:双重粒度验证、测试增强和经验驱动自我进化,从而得到CP-Agent。在不更新任何参数的情况下,CP-Agent在LiveCodeBench Pro上将Pass@1从25.8%提升至48.5%,并在ICPC-Eval上将Refine@5提高了11.0%。在三个LLM骨干网络上,CP-Agent处于成本-准确率效率前沿,消融实验表明每个组件主要影响其对应的证书量。

英文摘要

Large language models still struggle with contest-level programming, while many agentic remedies rely on massive inference-time sampling or expensive multi-stage post-training. We study when execution feedback reliably helps an LLM CP solver and which mechanisms govern the gains. We model feedback-driven solving as a calibrated stopped process and identify three quantities: false-admission risk, program-level evidence against bad programs, and the active-state success hazard. Under held-out trace calibration and selection from a pre-declared finite controller manifest, the resulting structural certificate lower-bounds the clean success probability before false admission. We instantiate mechanisms targeting these quantities as Dual-Granularity Verification, Test Augmentation, and Experience-Driven Self-Evolving, yielding CP-Agent. Without updating any parameters, CP-Agent raises Pass@1 from 25.8\% to 48.5\% on LiveCodeBench Pro and improves Refine@5 by 11.0\% on ICPC-Eval. Across three LLM backbones, CP-Agent lies on the cost--accuracy efficiency frontier, and ablations show that each component primarily affects its corresponding certificate quantity.

2605.24661 2026-05-26 cs.AI cs.CL 版本更新

Measuring Reasoning Quality in LLMs: A Multi-Dimensional Behavioral Framework

衡量LLM中的推理质量:一个多维行为框架

Ali Şenol, Garima Agrawal, Huan Liu

发表机构 * Department of Computer Engineering, Tarsus University(塔鲁斯大学计算机工程系) School of Computing and Augmented Intelligence (SCAI), Arizona State University (ASU)(计算与增强智能学院(SCAI),亚利桑那州立大学(ASU)) HumaConn AI Consulting(HumaConn AI咨询)

AI总结 提出一个基于行为的多维框架,从正确性、一致性、鲁棒性、逻辑连贯性、效率和稳定性六个维度评估LLM推理质量,揭示仅靠准确率无法观察到的行为,并支持部署决策。

详情
AI中文摘要

LLMs在复杂推理任务中取得了显著成功,但当前的评估方法主要依赖最终答案的正确性,对产生这些答案的底层推理过程提供的洞察有限。为弥补这一空白,本研究从行为角度提出了一个统一的多维框架来衡量LLMs的推理质量,操作化了六个理论驱动的维度:正确性(CQ)、一致性(CS)、鲁棒性(RS)、逻辑连贯性(LS)、效率(ES)和稳定性(SS)。在四个基准测试的975个条目上对七个LLMs进行的广泛实验表明,该框架揭示了仅靠准确率指标无法观察到的行为。值得注意的是,逻辑连贯性与正确性正交(r = -0.172,不显著),证实了正确答案可能源于不连贯的推理,而Claude-Haiku-4.5取得了最高的多维得分(Q_bal = 0.778)。此外,该框架暴露了关键的排名反转:DeepSeek-V3在准确率优先下排名第二,但在法律/合规权重下排名第五,这种反转是单一指标评估无法检测到的。判别效度证实11/15个维度对是独立的(|r| < 0.50),为将每个维度视为不同信号提供了心理测量学支持。该框架产生的维度概况直接支持三类部署决策:识别那些虽然最终答案正确但推理轨迹无法通过问责审计的模型(LS--CQ正交性);防止仅基于准确率的基准测试导致的排名错误;以及确保没有单一指标默默替代框架捕获的六个独立信号。

英文摘要

LLMs have achieved remarkable success in complex reasoning tasks, yet current evaluation approaches predominantly rely on final-answer correctness, offering limited insight into the underlying reasoning processes that produce those answers. To address this gap, this study proposes a unified multi-dimensional framework for measuring reasoning quality in LLMs from a behavioral perspective, operationalizing six theoretically grounded dimensions: Correctness (CQ), Consistency (CS), Robustness (RS), Logical Coherence (LS), Efficiency (ES), and Stability (SS). Extensive experiments on seven LLMs across 975 items from four benchmarks demonstrate that the framework reveals behaviors invisible to accuracy-only metrics. Notably, logical coherence is orthogonal to correctness (r = -0.172, ns), confirming that correct answers can arise from incoherent reasoning, while Claude-Haiku-4.5 achieves the highest multi-dimensional score (Q_bal = 0.778). Furthermore, the framework exposes critical ranking inversions: DeepSeek-V3 ranks second under accuracy-priority but fifth under legal/compliance weighting, a reversal that single-metric evaluation cannot detect. Discriminant validity confirms 11/15 dimension pairs are independent (|r| < 0.50), providing psychometric support for treating each dimension as a distinct signal. The dimensional profiles produced by the framework directly support three classes of deployment decision: identifying models whose reasoning traces would fail accountability audits despite correct final answers (LS--CQ orthogonality); preventing ranking errors caused by accuracy-only benchmarking; and ensuring that no single metric silently substitutes for the six independent signals the framework captures.

2605.24647 2026-05-26 cs.CL 版本更新

Know You Before You Speak: User-State Modeling for LLM Personalization in Multi-Turn Conversation

在你说话之前了解你:多轮对话中用于LLM个性化的用户状态建模

Jiani Luo, Xiaoyan Zhao, Yang Zhang, Shuyi Miao, Bingbing Xu, Stefan Konigorski, Tat-Seng Chua

发表机构 * School of Computing, National University of Singapore(新加坡国立大学计算机学院) School of Artificial Intelligence, Beihang University(北航人工智能学院) Institute of Computing Technology, Chinese Academy of Sciences(中国科学院计算技术研究所) German Institute of Human Nutrition Potsdam-Rehbruecke(德国人类营养研究所波茨坦-雷赫布鲁克)

AI总结 提出基于自由能原理的PUMA框架,通过显式用户状态模型和动作选择机制,将个性化对话从被动记忆检索转变为基于模型的用户演化决策,在医疗咨询基准上提升长期对话效果。

Comments 30pages, 3 figures

详情
AI中文摘要

个性化对话不仅需要回忆显式的用户历史:系统还需要推断通过交互演化并塑造适当响应策略的隐藏用户状态。现有的基于记忆和配置文件的方法主要重用可观察的用户信息,对建模用户状态动态或基于它们如何塑造未来用户状态来选择动作的支持有限。我们提出了PUMA(面向动作选择的预期用户状态建模),这是一个基于自由能原理(FEP)的框架,将个性化形式化为部分可观测下的决策,围绕一个显式的用户状态模型,该模型捕获潜在用户状态及其动作条件动态。在每一轮中,PUMA维护对用户隐藏状态的信念,细化用于观测生成和动作条件状态转移的用户状态模型,并通过最小化预期自由能来选择对话动作,在统一标准下平衡认知和实用目标。这种表述将个性化从被动记忆检索转变为基于模型的用户演化决策。我们在面向医疗咨询和动机性访谈的基准上实例化PUMA,并带有潜在状态标注以进行严格评估。实验表明,PUMA在保持强响应质量的同时改善了长期对话结果,跨数据集研究展示了更可靠的用户状态估计和下一状态预测。

英文摘要

Personalized dialogue requires more than recalling explicit user histories: systems also need to infer hidden user states that evolve through interaction and shape appropriate response strategies. Existing memory- and profile-based methods primarily reuse observable user information, offering limited support for modeling user-state dynamics or selecting actions based on how they shape future user states. We propose PUMA (Prospective User-state Modeling for Action selection), a framework grounded in the Free Energy Principle (FEP) that formulates personalization as decision-making under partial observability, centered on an explicit user state model that captures latent user states and their action-conditioned dynamics. At each turn, PUMA maintains a belief over the user's hidden state, refines the user state model for observation generation and action-conditioned state transition, and selects dialogue actions by minimizing expected free energy, balancing epistemic and pragmatic objectives under a unified criterion. This formulation shifts personalization from passive memory retrieval to model-based decision-making over user evolution. We instantiate PUMA on healthcare-oriented counseling and motivational interviewing benchmarks with latent state annotations for rigorous evaluation. Experiments show that PUMA improves long-horizon dialogue outcomes while maintaining strong response quality, and a cross-dataset study demonstrates more reliable user-state estimation and next-state prediction.

2605.24635 2026-05-26 cs.CL 版本更新

HiMed: Incentivizing Hindi Reasoning in Medical LLMs

HiMed: 激励医疗大语言模型中的印地语推理

Dingfeng Jiang, Han Yan, Chenze Ma, Amit Kumar Jaiswal, Ang Li, Yunxiang Jiang, Xinlei Xiong, Juhao Liang, Hongru Xiao, Xiang Li, Fan Bu, Jiale Han, Ruchir Gupta, Prayag Tiwari, Benyou Wang

发表机构 * The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳)) Indian Institute of Technology (Banaras Hindu University) Varanasi(印度理工学院(班加罗尔 Hindu 大学)瓦拉纳西分校) Tongji University(同济大学) Shenzhen Research Institute of Big Data(深圳大数据研究院) Shenzhen Loop Area Institute(深圳科创园区研究院) The Hong Kong University of Science and Technology(香港科技大学) Halmstad University(哈尔姆斯塔德大学)

AI总结 针对医疗大语言模型在印地语上表现不佳的问题,提出HiMed印地语医疗推理语料库与基准,并通过衰减支架奖励训练HiMed-8B模型,显著提升印地语医疗推理性能并缩小英印准确率差距。

详情
AI中文摘要

医疗大语言模型有望减少医疗保健差距,但印地语仍然严重代表性不足。尽管医疗大语言模型在高资源语言中表现出色,但其性能在印地语中急剧下降,尤其是在印度医学体系方面。我们认为,稳健的跨语言医疗迁移需要印地语推理。为此,我们引入了HiMed,一个涵盖西方和印度医学的印地语推理医疗语料库和基准套件。我们进一步通过设计衰减支架奖励,提出了HiMed-8B,一个印地语医疗推理大语言模型。大量实验表明,印地语医疗推理性能得到提升,英印准确率差距缩小。消融研究验证了每个训练阶段和奖励组件的贡献。所有数据和代码均可在GitHub上获取:https://github.com/FreedomIntelligence/HiMed。

英文摘要

Medical large language models hold promise for reducing healthcare disparities, yet Hindi remains severely underrepresented. While medical LLMs excel in high-resource languages, their performance degrades sharply in Hindi, particularly on Indian systems of medicine. We argue that robust cross-lingual medical transfer requires Hindi reasoning. To this end, we introduce HiMed, a Hindi reasoning medical corpus and benchmark suite covering both Western and Indian medicine. We further propose HiMed-8B, a Hindi-form medical reasoning LLM, through the design of decaying scaffolding reward. Extensive experiments demonstrate improvement in Hindi medical reasoning performance and reduction in the English--Hindi accuracy gap. Ablation studies validate the contribution of each training stage and reward component. All data and code are available on GitHub: https://github.com/FreedomIntelligence/HiMed.

2605.24614 2026-05-26 cs.CL cs.AI cs.LG 版本更新

Measuring the Depth of LLM Unlearning via Activation Patching

通过激活修补测量大语言模型遗忘的深度

Jaeung Lee, Dohyun Kim, Jaemin Jo

发表机构 * Sungkyunkwan University(全北大学)

AI总结 提出遗忘深度评分(UDS),通过激活修补量化遗忘的机制深度,在150个遗忘模型上的元评估中达到最高忠实性和鲁棒性。

Comments 18 pages

详情
AI中文摘要

大语言模型遗忘已成为隐私保护和人工智能安全的关键事后机制,但审计目标知识是否真正被擦除仍然具有挑战性。现有的输出级指标无法检测到这些知识是否仍可从内部表示中恢复。最近的白盒研究揭示了此类残留知识,但通常依赖于辅助训练或数据集特定调整,缺乏可推广的指标。为解决这些限制,我们提出遗忘深度评分(UDS),一种通过激活修补量化遗忘机制深度的指标。UDS首先使用保留模型基线识别编码目标知识的层,然后在0-1尺度上测量遗忘模型中该知识被擦除的程度。在跨越8种方法的150个遗忘模型上的20个指标的元评估中,UDS实现了最高的忠实性和鲁棒性,证实了我们的因果方法是遗忘评估中最可靠的。案例研究进一步揭示,白盒指标可能在层级别上不一致,并且擦除深度因示例而异。我们提供了将UDS集成到现有基准测试框架并简化评估流程的指南。代码和数据可在https://github.com/gnueaj/unlearning-depth-score获取。

英文摘要

Large language model (LLM) unlearning has emerged as a crucial post-hoc mechanism for privacy protection and AI safety, yet auditing whether target knowledge is truly erased remains challenging. Existing output-level metrics fail to detect when this knowledge remains recoverable from internal representations. Recent white-box studies reveal such residual knowledge but often rely on auxiliary training or dataset-specific adaptations, leaving no generalizable metric. To address these limitations, we propose the Unlearning Depth Score (UDS), a metric that quantifies the mechanistic depth of unlearning via activation patching. UDS first identifies layers that encode the target knowledge using a retain model baseline, then measures how much of it is erased in the unlearned model on a 0-1 scale. In a meta-evaluation across 20 metrics on 150 unlearned models spanning 8 methods, UDS achieves the highest faithfulness and robustness, confirming our causal approach as the most reliable for unlearning evaluation. Case studies further reveal that white-box metrics can disagree at the layer level and that erasure depth varies across examples. We provide guidelines for integrating UDS into existing benchmarking frameworks and streamlining the evaluation pipeline. Code and data are available at https://github.com/gnueaj/unlearning-depth-score

2605.24613 2026-05-26 cs.CL cs.AI cs.SE 版本更新

Guarded Repair for Harm-Aware Post-hoc Replacement of LLM Mathematical Reasoning

Guarded Repair: 面向危害感知的LLM数学推理事后替换

Haizhou Xia

AI总结 提出GuardedRepair框架,通过选择性替换机制在修复LLM数学推理错误时避免破坏正确结果,在GSM8K上准确率从95.60%提升至96.89%且未破坏正确案例。

Comments 15 pages,including appendices. Code and artifacts available at https://github.com/Haizhoux0517/guarded-repair

详情
AI中文摘要

LLM数学推理的事后修复引入了一种不对称风险:修复错误的推理轨迹是有用的,但替换原本正确的轨迹可能有害。我们在选择性替换设置下研究该问题,系统必须决定修复后的候选是否比保留原始缓存轨迹更安全。我们提出GuardedRepair,一种有保护的best-of-N修复框架,它诊断缓存推理轨迹,选择性触发修复,并仅在确定性验证守卫支持替换时才接受改变答案的候选。该框架结合了轻量级符号检查、表面语义风险诊断、有界候选生成和保守接受策略。在完整GSM8K测试集上,初始推理器已达到95.60%准确率,GuardedRepair将最终准确率提升至96.89%,修复了58个剩余错误中的17个,且主运行中未测量到破坏正确案例。在弱推理器ASDiv设置中,准确率从78.40%提升至87.60%。直接重新生成基线表明,这一增益不能仅由更强模型重新求解解释:重新求解所有GSM8K示例将准确率降至93.03%,并破坏了47个初始正确答案。额外分析表明,有保护修复显著改善了修复/破坏权衡,同时也揭示了替换风险被降低而非消除。这些结果支持将事后修复视为危害感知的选择性替换而非无约束的重新求解。

英文摘要

Post-hoc repair of LLM mathematical reasoning introduces an asymmetric risk: fixing an incorrect reasoning trace is useful, but replacing a trace that was already correct can be harmful. We study this problem under a selective replacement setting, where a system must decide whether a repaired candidate is safer than preserving the original cached trace. We present GuardedRepair, a guarded best-of-N repair framework that diagnoses cached reasoning traces, selectively triggers repair, and accepts answer-changing candidates only when deterministic verification guards support replacement. The framework combines lightweight symbolic checks, surface semantic-risk diagnostics, bounded candidate generation, and conservative acceptance policies. On the full GSM8K test set, where the initial reasoner already achieves 95.60% accuracy, GuardedRepair improves final accuracy to 96.89%, fixing 17 of 58 remaining errors without measured broken-correct cases in the main run. On a weak-reasoner ASDiv setting, accuracy improves from 78.40% to 87.60%. Direct regeneration baselines show that this gain is not explained by stronger-model re-solving alone: re-solving all GSM8K examples lowers accuracy to 93.03% and breaks 47 initially correct answers. Additional analyses show that guarded repair substantially improves the fixed/broken tradeoff, while also revealing that replacement risk is reduced rather than eliminated. These results support viewing post-hoc repair as harm-aware selective replacement rather than unconstrained re-solving.

2605.24603 2026-05-26 cs.CL cs.LG 版本更新

CSP-Atlas: Concept-Specific Neural Circuits in a Sparse Python Transformer

CSP-Atlas: 稀疏Python Transformer中的概念特异性神经回路

Piotr Wilam

发表机构 * University College London(伦敦大学学院)

AI总结 通过提取106个Python概念的特异性神经回路,发现模型内部组织遵循计算结构而非语义类别,并识别出原子性超簇。

Comments Code: https://github.com/piotrwilam/AtlasCSP

详情
AI中文摘要

一个稀疏的8层代码Transformer为每个测试的Python构造开发了专用的神经回路,并且这些回路按照清晰的计算原则而非语义类别进行组织。我们通过边缘化63,800个受控提示,提取了106个概念(43个AST节点类型,63个内置对象)的神经回路,并使用对比检查提示(呈现一个关键字标记而不带其关联的句法结构)将每个回路分解为概念特异性和标记驱动组件。出现了三个发现。首先,所有106个概念在九个参数设置中的每一个都产生非空的通用回路,并且跨构造的概念特异性排名在扫描中保持稳定——存活不是宽松阈值的伪影。其次,AST回路包含一个与标记激活不同的真正概念组件:在中间到后期层,仅概念神经元占最强烈激活神经元的比例高达62.5%,而内置回路几乎完全由标记驱动。第三,六个计算上原子的构造——Import、ImportFrom、Break、Continue、Pass、Assert——尽管在语义上不相关,却聚集在一起,仅共享作为不需要嵌套体的单语句构造的属性;这个原子性超簇,以及由标记歧义性和结构独特性组织的四层层次结构,表明模型的内部组织追踪计算结构而非含义。方法、完整分解数据和分析代码已发布。

英文摘要

A sparse 8-layer code transformer develops dedicated neural circuitry for every Python construct tested, and that circuitry is organised by a clean computational principle rather than by semantic category. We extract neural circuits for 106 concepts (43 AST node types, 63 builtin objects) by marginalising across 63,800 controlled prompts, and decompose each circuit into concept-specific and token-driven components using contrastive checker prompts that present a keyword token without its associated syntactic structure. Three findings emerge. First, all 106 concepts produce non-empty universal circuits at every one of nine parameter settings, and the ranking of concept-specificity across constructs is stable across the sweep - survival is not an artifact of a permissive threshold. Second, AST circuits contain a genuine concept component distinct from token activation: concept-only neurons constitute up to 62.5% of the loudest-firing neurons at mid-to-late layers, while builtin circuits are almost entirely token-driven. Third, six computationally atomic constructs - Import, ImportFrom, Break, Continue, Pass, Assert - cluster together despite being semantically unrelated, sharing only the property of being single-statement constructs requiring no nested body; this atomicity super-cluster, together with a four-tier hierarchy organised by token ambiguity and structural distinctiveness, shows that the model's internal organisation tracks computational structure rather than meaning. The methodology, full decomposition data, and analysis code are released.

2605.24597 2026-05-26 cs.AI cs.CL cs.LG 版本更新

Learning to Reason Efficiently with A* Post-Training

学习通过A*后训练进行高效推理

Andreas Opedal, Francesco Ignazio Re, Abulhair Saparov, Mrinmaya Sachan, Bernhard Schölkopf, Ryan Cotterell

发表机构 * ETH Zürich(苏黎世联邦理工学院) MPI for Intelligent Systems, Tübingen(图宾根智能系统研究所) Purdue University(普渡大学)

AI总结 本文通过A*搜索算法指导LLM生成正确且高效的推理步骤,提出监督微调和强化学习两种训练方法,在1B-3B参数模型上显著提升推理准确性和效率。

Comments Preprint

详情
AI中文摘要

大型语言模型(LLM)的许多应用需要演绎推理,但模型经常产生不正确或冗余的推理步骤。我们将自然语言推理框架化为一个搜索问题,其中最终答案本身就是有效的证明,需要推理过程中间推理正确。具体来说,我们研究LLM是否能够通过A*搜索(一种保证通向目标的最优高效路径的算法)的指导,学习生成正确且高效的证明。我们探索了两种训练技术:在A*执行轨迹上的监督微调,以及使用A*信息的过程奖励模型进行强化学习。实验发现,1B-3B范围内的Llama-3.2模型从A*后训练中获益显著,从接近零准确率提升到超越更大的模型DeepSeek-V3.2。我们的分析揭示了一个权衡:简单的正确性奖励最大化准确率,而A*信息的信号在准确率和效率之间取得平衡。此外,我们发现,在更大的搜索空间中,使用不完美启发式训练的模型表现出更高的准确率。我们的结果展示了朝着由经典搜索算法原理指导的推理方向的有前景的路径。

英文摘要

Many applications of large language models (LLMs) require deductive reasoning, yet models frequently produce incorrect or redundant inference steps. We frame natural language inference as a search problem where the final answer is the valid proof itself, requiring a reasoning procedure in which intermediate inferences are correct. Specifically, we investigate whether LLMs can learn to generate correct and efficient proofs with guidance from A* search -- an algorithm that guarantees an optimally efficient path to a goal. We explore two training techniques: supervised fine-tuning on execution traces from A* and reinforcement learning with A*-informed process reward models. Empirically, we find that Llama-3.2 models in the 1B--3B range benefit substantially from A* post training, going from near-zero accuracy to outperforming DeepSeek-V3.2 -- a much larger model. Our analysis uncovers a trade-off: while simple correctness rewards maximize accuracy, A*-informed signals strike a balance between accuracy and efficiency. Furthermore, we find that on larger search spaces, models trained with imperfect heuristics exhibit superior accuracy. Our results demonstrate a promising direction towards reasoning guided by principles derived from classical search algorithms.

2605.24585 2026-05-26 cs.CL q-bio.NC 版本更新

Word Class Representations Spontaneously Emerge from Successor Representations Trained on Natural Language

从自然语言训练的后继表示中自发涌现的词类表示

Mathis Immertreu, Achim Schilling, Thomas Kinfe, Patrick Krauss

发表机构 * Cognitive Computational Neuroscience Group(认知计算神经科学组) Friedrich-Alexander-Universität Erlangen–Nürnberg (FAU)(弗赖堡-亚历山大-大学埃尔兰根-纽伦堡(FAU)) Mannheim Center for Neuromodulation and Neuroprosthetics (MCNN)(曼海姆神经调制与神经假体中心(MCNN)) University Hospital Mannheim, University Heidelberg(曼海姆大学医院,海德堡大学)

AI总结 本研究将强化学习中的后继表示(SR)框架应用于自然语言,通过训练神经网络预测未来词分布,发现无监督下词类(如名词、动词、形容词)的几何结构自发涌现,且预测时域影响结构层次。

详情
AI中文摘要

语言模型通常被训练来预测序列中的下一个词。这里,我们探索来自强化学习的另一种预测原则:后继表示(SR),它建模未来状态的期望折扣分布,而不是直接的下一个状态。我们将这一框架迁移到自然语言,并训练神经网络在多个时间视界上预测未来词分布,从而学习长程转移结构的表示。我们在WikiText-103(1.03亿词;2万词词汇)上训练深度残差神经网络,并使用KL散度将后继表示优化为概率分布。在没有显式语言监督的情况下,结构化语言表示自发涌现。训练后,学习到的空间相对于词性(POS)类别发展出清晰的几何组织:名词、动词和形容词变得可分离,并通过无监督聚类恢复。这种组织系统地依赖于预测视界:短视界产生最强的句法结构,而长视界逐渐整合更广泛的上下文和语义信息。在更精细的分辨率下,额外的可解释词汇子结构出现,揭示了主要词类内的连贯子类。这些发现表明,句法类别无需显式编码,而可能作为预测序列学习的结果出现。据我们所知,这项工作首次将后继表示系统应用于自然语言,并在强化学习、语言学和认知神经科学之间建立了概念桥梁。

英文摘要

Language models are typically trained to predict the next token in a sequence. Here, we explore an alternative predictive principle from reinforcement learning: Successor Representations (SRs), which model the expected discounted distribution of future states rather than the immediate next state. We transfer this framework to natural language and train neural networks to predict future word distributions across multiple temporal horizons, thereby learning representations of long-range transition structure. We train a deep residual neural network on WikiText-103 (103 million tokens; 20,000-word vocabulary) and optimize successor representations as probability distributions using KL divergence. Without explicit linguistic supervision, structured language representations emerge spontaneously. After training, the learned space develops a clear geometric organization with respect to part-of-speech (POS) categories: nouns, verbs, and adjectives become separable and recoverable through unsupervised clustering. This organization depends systematically on predictive horizon, with short horizons producing the strongest syntactic structure and longer horizons increasingly integrating broader contextual and semantic information. At finer resolutions, additional interpretable lexical substructure emerges, revealing coherent subclasses within major word categories. These findings suggest that syntactic categories need not be explicitly encoded but may arise as a consequence of predictive sequence learning. To our knowledge, this work provides the first systematic application of successor representations to natural language and establishes a conceptual bridge between reinforcement learning, linguistics, and cognitive neuroscience.

2605.24579 2026-05-26 cs.CL 版本更新

WhenLoss: Diagnosing Write and Retrieval Bottlenecks in Long-Context Memory Systems

WhenLoss: 诊断长上下文记忆系统中的写入与检索瓶颈

Jiangnan Yu, Kisson Songqi Lin, Jilong Wu

AI总结 提出四条件诊断协议发现写入阶段是长上下文记忆系统的主要瓶颈,并基于此提出预期预测压缩(EPC)方法,在写入时利用LLM预测未来问题并保留关键证据,显著提升系统性能。

Comments 14 pages, 7 figures, 9 tables

详情
AI中文摘要

长上下文记忆系统在固定预算下常常失败,但端到端评估无法揭示证据是在压缩过程中被丢弃还是被保留但从未被检索。我们引入了一个四条件诊断协议,在截断完整上下文(TFC)、证据预言(OE)、完整存储记忆(CSM)和检索记忆(RM)条件下评估固定阅读器。在此固定预算的LongMemEval设置下,大多数测试基线的写入侧差距超过检索侧差距,其中六个基线中的四个在我们的默认诊断裕度下稳健地表现为写入主导。受此诊断启发,我们提出预期预测压缩(EPC),该方法将关键决策——保留哪些信息——移至写入时间,通过使用LLM预测未来可能的问题并在令牌预算下保留最少的支持证据,同时在问题时间保持检索不变。在所有500个LongMemEval问题中,使用三个阅读器(GPT-5.2、Claude Sonnet 4、Gemini 2.5 Pro),EPC在所有系统中取得了最高的CSM分数(0.49,而最强基线Summary (LLM)为0.44),将Delta_write降至0.04,同时Delta_retr与其他基于LLM的系统相当。这些结果表明,在此基准和评估设置下,改进写入阶段保留的内容是测试系统性能提升的关键途径。

英文摘要

Long-context memory systems often fail under fixed budgets, but end-to-end evaluation does not reveal whether evidence was discarded during compression or preserved but never retrieved. We introduce a four-condition diagnostic protocol that evaluates a fixed reader under truncated full context (TFC), oracle evidence (OE), complete stored memory (CSM), and retrieved memory (RM). Under this fixed-budget LongMemEval setup, write-side gaps exceed retrieval-side gaps for most tested baselines, with four of six baselines robustly write-dominant under our default diagnosis margin. Motivated by this diagnosis, we propose Expected Predictive Compression (EPC), which moves the key decision--what information to retain--to write time by using an LLM to anticipate likely future questions and preserve the minimal supporting evidence under the token budget, while leaving retrieval unchanged at question time. Across all 500 LongMemEval questions with three readers (GPT-5.2, Claude Sonnet 4, Gemini 2.5 Pro), EPC achieves the highest CSM scores among all systems (0.49 vs. 0.44 for Summary (LLM), the strongest baseline), reducing Delta_write to 0.04 while leaving Delta_retr comparable to other LLM-based systems. These results suggest that, on this benchmark and evaluation setup, improving what the write stage preserves is a key avenue for performance gains in the tested systems.

2605.24577 2026-05-26 cs.LG cs.AI cs.CL 版本更新

Polymorphism Is Rotation: Operational Mechanistic Interpretability from a Two-Layer Transformer to Pythia-70m

多态性即旋转:从两层Transformer到Pythia-70m的操作性机械可解释性

Jordan F. McCann

发表机构 * Independent Researcher(独立研究者)

AI总结 本文发现独立训练的Transformer在残差流基上通过均匀随机旋转相互关联,并利用正交Procrustes拟合实现特征字典和转向向量在模型间的迁移,无需重新训练。

Comments 26 pages, 4 figures, 40 references. Pre-registered four-bar framework; all numerical claims reproducible

详情
AI中文摘要

独立训练的Transformer在残差流基上计算相同的函数,这些基通过$\mathrm{SO}(d_{\mathrm{model}})$上的均匀随机旋转相互关联。我们将这种现象称为多态性:相同的函数,但内部坐标互不可解。每对模型之间的一次矩阵乘法即可消除这种多态性:在单批激活上进行正交Procrustes拟合,即可在独立训练的模型之间迁移稀疏自编码器特征字典和转向向量,无需重新训练。 该现象对标准SAE通用性度量不可见。解码器列余弦相似度在不同种子间匹配度达98%,即SAE通用性的头条数字,而一个种子训练的SAE重构另一个种子的激活时,解释方差为负,比预测常数均值更差。解码器列对齐,但编码器从旋转后的框架读取。单个Procrustes旋转$R$可在每个内部位置将重构恢复至种子内上限的0.025 EV以内。 $R$服从Haar分布:$\|R - I\|_F$与随机正交预测$\sqrt{2 d_{\mathrm{model}}}$在$d_{\mathrm{model}} = 512$时匹配至0.1%,且$R$的特征值谱与Haar $\mathrm{SO}(d_{\mathrm{model}})$的Kolmogorov-Smirnov检验在合并和逐对情况下均返回$p \approx 1.000$。均值差转向向量通过与$R$的不变子空间对齐在三种机制下迁移:当被共享输出权重固定时清晰,与旋转子空间重叠时部分,否则反转。在无共享输入/输出(Pythia)时,所有三种情况均坍缩为普遍反转。同一旋转解释适用于单次运行中的不同训练检查点。 在104k参数的Dyck-3 Transformer和九个独立训练的Pythia-70m种子(基于The Pile数据集)上,通过预注册的四柱操作框架进行验证。前沿规模(10B+)的复现仍有待研究。

英文摘要

Independently trained transformers compute the same function in residual-stream bases that differ by a uniform random rotation on $\mathrm{SO}(d_{\mathrm{model}})$. We call this phenomenon polymorphism: same function, mutually unintelligible interior coordinates. One matrix multiplication per model pair removes it: an orthogonal Procrustes fit on a single batch of activations transfers sparse-autoencoder feature dictionaries and steering vectors between independently trained models, with no retraining. The phenomenon is invisible to the standard SAE universality metric. Decoder-column cosine similarity matches across seeds at 98%, the SAE-universality headline number, while an SAE trained on one seed reconstructs another seed's activations at negative explained variance, worse than predicting the constant mean. The decoder columns align; the encoder reads from a rotated frame. A single Procrustes rotation $R$ restores reconstruction to within 0.025 EV of the within-seed ceiling at every internal site. $R$ is Haar-distributed: $\|R - I\|_F$ matches the random-orthogonal prediction $\sqrt{2 d_{\mathrm{model}}}$ to 0.1% at $d_{\mathrm{model}} = 512$, and a Kolmogorov-Smirnov test of $R$'s eigenvalue spectrum against Haar $\mathrm{SO}(d_{\mathrm{model}})$ returns $p \approx 1.000$ pooled and per-pair. Diff-of-means steering vectors transfer in three regimes by alignment with $R$'s invariant subspace: clean when pinned by shared output weights, partial when overlapping the rotated subspace, inverted otherwise. With no shared I/O (Pythia), all three collapse to universally inverted. The same rotation account holds across training checkpoints within a single run. Validated on a 104k-parameter Dyck-3 transformer and nine independently-trained Pythia-70m seeds on The Pile, via a pre-registered four-bar operational framework. Frontier-scale (10B+) replication remains open.

2605.24573 2026-05-26 cs.CL 版本更新

AstroMind: A High-Fidelity Benchmark for Spacecraft Behavior Reasoning Based on Large Language Models

AstroMind:基于大型语言模型的航天器行为推理高保真基准

Hao Liu, Siyuan Yang, Qinglei Hu, Dongyu Li

发表机构 * Hangzhou International Innovation Institute, Beihang University(北京航空航天大学杭州国际创新研究院) KTH Royal Institute of Technology(皇家理工学院) School of Automation Science and Electrical Engineering, Beihang University(北京航空航天大学自动化科学与电气工程学院) School of Cyber Science and Technology, Beihang University(北京航空航天大学网络空间安全学院)

AI总结 针对航天器机动行为理解问题,提出基于高保真天体动力学模拟和真实观测约束的基准AstroMind,涵盖意图推断、参数估计和威胁评估三类任务,并评估多种开源模型表现。

详情
AI中文摘要

理解航天器为何机动——而不仅仅是它机动了——对于空间领域感知而言是一个日益重要的问题,因为地球轨道变得越来越拥挤和充满竞争。当前的分析流程是为检测而构建的:它们擅长发现发生了某事,但在推理其含义方面则不那么擅长。AstroMind 是一个基于物理的基准,旨在弥合这一差距。它利用高保真天体动力学模拟和真实观测约束,将其转化为三类任务中可验证的推理问题:意图推断、机动参数估计和威胁评估。每个场景都包含真实的传感噪声和不同可靠性水平的多源文本情报。评估指标同时衡量物理约束下的语义正确性和定量一致性。对一系列开源模型的基准测试显示,没有单一模型在所有维度上占优:Qwen3 (32B) 在意图推断准确性上领先;QwQ (32B) 在威胁评估上领先,并在解析项上实现了最低的中位相对误差;GPT-OSS (20B) 产生了最强的评判推理质量,并为参数估计提取了最多的标量值(241个解析项中的136个)。训练数据组成和推理风格与模型大小同等重要。结构化的推理提示在测试的8B模型中持续有帮助,对于已经能够跟踪物理约束的模型,收益更大。AstroMind 为该领域提供了一个共享测试,用于解决一个既需要正确理解物理又需要正确解读战术态势的问题——两者单独都不足够。

英文摘要

Understanding why a spacecraft maneuvers -- rather than simply that it did -- is an increasingly important problem for space domain awareness as Earth orbits grow crowded and contested. Current analysis pipelines are built for detection: they are good at picking up that something happened, less good at reasoning about what it means. AstroMind is a physics-grounded benchmark designed to close that gap. It draws on high-fidelity astrodynamics simulations and real observational constraints, converting them into verifiable reasoning problems across three task types: intent inference, maneuver parameter estimation, and threat assessment. Each scenario includes realistic sensing noise and multi-source textual intelligence at varying reliability levels. Evaluation metrics capture both semantic correctness and quantitative consistency under physical constraints. Benchmarking a suite of open-weight models shows no single model dominates every axis: Qwen3 (32B) leads on intent inference accuracy; QwQ (32B) leads on threat assessment and achieves the lowest median relative error on parsed items; GPT-OSS (20B) produces the strongest judged reasoning quality and extracts the most scalar values for parameter estimation (136 of 241 parsed items). Training data composition and reasoning style matter as much as model size. Structured reasoning prompts help consistently across tested 8B models, with larger gains for those that can already track physical constraints. AstroMind gives the field a shared test for a problem where getting the physics right and reading the tactical situation correctly are both required -- neither is sufficient on its own.

2605.24556 2026-05-26 cs.IR cs.CL cs.LG 版本更新

The Multilingual Curse at the Retrieval Layer: Evidence from Amharic

多语言诅咒在检索层:来自阿姆哈拉语的证据

Yosef Worku Alemneh, Kidist Amde Mekonnen, Maarten de Rijke

发表机构 * Independent Researcher(独立研究者) University of Amsterdam(阿姆斯特丹大学)

AI总结 针对零样本多语言检索在低资源形态丰富语言(如阿姆哈拉语)上表现不佳的问题,通过对比实验发现单语检索器显著优于多语言检索器,并揭示了多语言基准测试的局限性。

Comments 10 pages, 4 tables. Accepted to the 1st Workshop on Multilinguality in the Era of Large Language Models (MeLLM) at ACL 2026

详情
AI中文摘要

多语言检索日益支撑着跨语言问答和检索增强生成。在多语言基准测试上的强零样本分数常被视为当前编码器能可靠跨语言迁移的证据。我们认为,对于代表性不足、形态丰富的语言,这一假设不成立,并以阿姆哈拉语作为诊断案例。在涵盖密集、延迟交互、学习稀疏和交叉编码器范式的共享段落检索协议下,我们比较了零样本多语言检索器、阿姆哈拉语微调的多语言检索器以及单语阿姆哈拉语检索器。最强的零样本多语言检索器在MRR@10上比最强的单语阿姆哈拉语第一阶段检索器低23%。在相同的阿姆哈拉语监督下微调两个最新的多语言嵌入模型,相比零样本获得了32-60%的相对MRR@10提升,但最佳阿姆哈拉语微调多语言模型仍低于最强的单语阿姆哈拉语检索器。这些发现表明,零样本多语言检索并不能充分代表LLM时代公平的信息访问:对于代表性不足的语言,检索必须在语言内部进行评估和适应,而不是从聚合的多语言基准测试中推断。为促进未来研究,我们在https://github.com/rasyosef/amharic-neural-ir 公开发布了数据集、代码库和训练模型。

英文摘要

Multilingual retrieval increasingly underpins cross-lingual question answering and retrieval-augmented generation. Strong zero-shot scores on multilingual benchmarks are often taken as evidence that current encoders transfer reliably across many languages. We argue that this assumption breaks down for underrepresented, morphologically rich languages, and use Amharic as a diagnostic case. Under a shared passage retrieval protocol covering dense, late-interaction, learned sparse, and cross-encoder paradigms, we compare zero-shot multilingual retrievers, Amharic-fine-tuned multilingual retrievers, and monolingual Amharic retrievers. The strongest zero-shot multilingual retriever underperforms the strongest monolingual Amharic first-stage retriever by 23% relative MRR@10. Fine-tuning two recent multilingual embedding models on the same Amharic supervision yields 32-60% relative MRR@10 gains over zero-shot, but the best Amharic-fine-tuned multilingual model remains below the strongest monolingual Amharic retriever. These findings indicate that zero-shot multilingual retrieval is not a sufficient proxy for equitable information access in the LLM era: for underrepresented languages, retrieval must be evaluated and adapted in-language rather than inferred from aggregate multilingual benchmarks. To foster future research, we publicly release the dataset, codebase, and trained models at https://github.com/rasyosef/amharic-neural-ir.

2605.24550 2026-05-26 cs.AI cs.CL cs.LG 版本更新

Jailbreak to Protect: Buffering and Reinforcing via Temporary Jailbreaking for Safe Fine-Tuning in Large Language Models

越狱以保护:通过临时越狱进行缓冲和强化以实现大型语言模型的安全微调

Seokil Ham, Jaehyuk Jang, Wonjun Lee, Changick Kim

发表机构 * School of Electrical Engineering, Korea Advanced Institute of Science and Technology (KAIST)(韩国科学技术院电子工程学院)

AI总结 针对微调即服务中安全对齐被有害微调攻击削弱的问题,提出一种基于梯度分析的缓冲与强化框架,通过临时越狱适配器减少有害更新并利用QR分解合并强化安全,实现无需额外安全数据的高效防御。

Comments ICML 2026 Spotlight

详情
AI中文摘要

微调即服务(FaaS)使得大型语言模型(LLMs)的个性化成为可能,但它在有害微调攻击下会削弱安全对齐。最近的研究表明,在微调期间激活有害行为模块可以防止模型学习不良行为,但其机制尚不清楚。在本文中,我们重新审视临时越狱作为对抗有害微调的一种防御手段,并提供了梯度层面的分析,表明它能够饱和安全退化梯度,同时保留良性任务相关梯度。基于这一见解,我们提出了一种缓冲与强化微调框架,该框架在用户微调期间缓冲有害更新,并在适应后强化安全。具体来说,BufferLoRA作为一个可移除的适配器,在用户微调期间诱导临时越狱以减少有害更新。适应后,通过基于QR分解的合并,将经过训练的ReinforceLoRA(用于在临时越狱状态下恢复拒绝行为)与UserLoRA集成,以在保持用户任务性能的同时强化安全。大量实验表明,我们的框架在用户微调期间无需额外安全数据且计算成本极低的情况下,实现了卓越的安全性和实用性。

英文摘要

Fine-tuning-as-a-Service (FaaS) enables personalization of large language models (LLMs), but it can weaken safety-alignment under harmful fine-tuning attacks. Recent work has shown that activating harmful-behavior modules during fine-tuning can prevent models from learning undesired behaviors, but its mechanism remains unclear. In this paper, we revisit temporary jailbreaking as a defense against harmful fine-tuning and provide a gradient-level analysis showing that it saturates safety-degrading gradients while preserving benign task-relevant gradients. Based on this insight, we propose a Buffer-and-Reinforce fine-tuning framework that buffers harmful updates during user fine-tuning and reinforces safety after adaptation. Specifically, BufferLoRA induces temporary jailbreaking as a removable adapter to reduce harmful updates during user fine-tuning. After adaptation, ReinforceLoRA, trained to recover refusal behavior under the temporarily jailbroken state, is integrated with UserLoRA via QR decomposition-based merging to reinforce safety while preserving user-task performance. Extensive experiments show that our framework achieves superior safety and utility with no additional safety data during user fine-tuning and minimal computational cost.

2605.24541 2026-05-26 cs.LG cs.AI cs.CL cs.IR 版本更新

SemanticZip: A Pilot Framework for Lossy Text Compression with LLMs as Semantic Decompressors

SemanticZip: 以LLM作为语义解压器的有损文本压缩的试点框架

Natalia Trukhina, Vadim Vashkelis

发表机构 * Embedded Intelligence Lab (EMILAB)(嵌入式智能实验室)

AI总结 提出SemanticZip框架,通过LLM将文本压缩为紧凑代码并解压为任务相关语义,在结构化散文、JSON等六种表示上评估,发现结构化散文恢复率最高(WAR=0.956,19.1%令牌增益),而CCL-Min平衡性最佳(39.4%令牌增益,WAR=0.874)。

Comments 13 pages, 1 figure, 2 tables. Pilot framework paper; code and supplementary artifacts available in ancillary files

详情
AI中文摘要

大型语言模型(LLM)系统的文本压缩通常被框架化为令牌删除、检索、摘要或精确重建。我们研究了一种更具攻击性但明确有损的设置:将文本压缩为紧凑代码,LLM可以将其扩展为任务相关的含义。我们将此设置称为SemanticZip。与无损压缩不同,SemanticZip不需要字节相同的重建;与普通摘要不同,它将基于模型的解压缩视为编解码器的一部分,并评估是否恢复了任务相关的语义承诺。 本文是一个试点框架,而非基准声明。我们形式化了LLM介导的解压缩,定义了受保护/有损数据包架构,并在五个作者构建的诊断案例上评估了六种表示体系:结构化散文、JSON、CCL-Core、CCL-Min、SemanticZip ASCII和SemanticZip emoji。一个独立的解码器LLM从每种压缩表示中重建类型化的语义原子,我们评估关键原子召回率、加权原子召回率、精确度和分词器增益。在该试点中,结构化散文具有最高的可恢复性,WAR=0.956,o200k_base令牌增益19.1%。CCL-Min是最强的平衡点,令牌增益39.4%,WAR=0.874。SemanticZip ASCII提供了最大的有用压缩,令牌增益46.5%,WAR=0.802,而表情符号密集的SemanticZip在压缩和恢复方面表现均较差。 主要贡献并非声称这些数字建立了通用前沿。相反,我们引入了一个可重复的实验接口,用于研究有损、LLM可解压的文本代码,以及一个设计原则:安全关键和精确的承诺应保持受保护,而可预测的低风险上下文可以进行语义压缩。

英文摘要

Text compression for large language model (LLM) systems is usually framed as token deletion, retrieval, summarization, or exact reconstruction. We study a more aggressive but explicitly lossy setting: compress text into compact codes that an LLM can expand into task-relevant meaning. We call this setting SemanticZip. Unlike lossless compression, SemanticZip does not require byte-identical reconstruction; unlike ordinary summarization, it treats model-based decompression as part of the codec and evaluates whether task-relevant semantic commitments are recovered. This paper is a pilot framework, not a benchmark claim. We formalize LLM-mediated decompression, define a protected/lossy packet architecture, and evaluate six representation regimes over five author-constructed diagnostic cases: structured prose, JSON, CCL-Core, CCL-Min, SemanticZip ASCII, and SemanticZip emoji. An independent decoder LLM reconstructs typed semantic atoms from each compressed representation, and we score Critical Atom Recall, Weighted Atom Recall, precision, and tokenizer gain. In this pilot, structured prose has the highest recoverability, with WAR = 0.956 and 19.1% o200k_base token gain. CCL-Min is the strongest balanced point, with 39.4% token gain and WAR = 0.874. SemanticZip ASCII provides the largest useful compression, with 46.5% token gain and WAR = 0.802, while emoji-heavy SemanticZip performs worse on both compression and recovery. The main contribution is not the claim that these numbers establish a universal frontier. Rather, we introduce a reproducible experimental interface for studying lossy, LLM-decompressible text codes and a design principle: safety-critical and exact commitments should remain protected, while predictable low-risk context may be semantically zipped.

2605.24534 2026-05-26 cs.CL 版本更新

Generating Legal Commentaries from Case Databases via Retrieval, Clustering, and Generation

通过检索、聚类和生成从案例数据库中生成法律评论

Max Prior, Niklas Wais, Matthias Grabmair

发表机构 * Technical University of Munich(慕尼黑技术大学)

AI总结 提出一个全自动流水线,利用检索、聚类和生成方法,从法院判决中自动生成法律评论,无需人工教义框架。

详情
AI中文摘要

我们提出了一个全自动流水线,将大量法院判决转化为法规的法律评论——无需提供任何手工制作的教义框架。使用德国联邦最高法院引用德国民法典第242、280、812和823条的4555份判决,我们提取段落级块,总结其推理,并推导关键词,这些关键词被嵌入和聚类。对于每个聚类,一个LLM生成标题并综合引用丰富的章节,然后由四个最先进的LLM合并成连贯的评论。我们使用人类专家和LLM评判员,沿着五个维度——主题相关性、标题匹配、引用忠实性、聚类区分度和逻辑顺序——进行评估。我们的结果表明,从法院判决中挖掘类似评论的论点以生成可在几分钟内以最低成本更新的报告是可行的,但突出了由于来源受限和法律推理规范性而产生的局限性。

英文摘要

We present a fully automated pipeline that transforms large collections of court decisions into legal commentaries for statutes - without providing any handcrafted doctrinal framework. Using 4.555 decisions of the German Federal Court of Justice that cite sections 242, 280, 812 and 823 of the German Civil Code (BGB), we extract paragraph-level chunks, summarize their reasoning, and derive keywords, which are embedded and clustered. For each cluster, an LLM generates headings and synthesizes citation-rich sections, which are then merged into coherent commentaries by four state-of-the-art LLMs. We evaluate along five dimensions - topical relevance, heading-match, citation faithfulness, cluster distinction and logical ordering - using both a human expert and an LLM-judge. Our results show that commentary-like argument mining from court decisions to generate reports that can be refreshed within minutes at minimal cost is feasible, yet they highlight limitations arising from restricted sources and the normativity of legal reasoning.

2605.24530 2026-05-26 cs.CL cs.CV 版本更新

Unveil: Unified Visual-Textual Integration and Distillation for Multi-modal Document Retrieval

Unveil: 统一视觉-文本集成与蒸馏的多模态文档检索

Hao Sun, Yingyan Hou, Jiayan Guo, Bo Wang, Chunyu Yang, Jinsong Ni, Yan Zhang

发表机构 * State Key Laboratory of General Artificial Intelligence, Peking University(北京理工大学通用人工智能国家重点实验室) School of Intelligence Science and Technology, Peking University(北京大学智能科学与技术学院) Aerospace Information Research Institute, Chinese Academy of Sciences(中国科学院航空航天信息研究所) Key Laboratory of Target Cognition and Application Technology(目标认知与应用技术重点实验室) Beijing Institute of Technology(北京理工大学) Ucap Cloud(Ucap云)

AI总结 提出Unveil框架,通过视觉-文本嵌入和知识蒸馏实现鲁棒的文档检索,兼顾布局与语义信息。

Comments ACL 2025 Main Conference

详情
AI中文摘要

现实场景中的文档检索由于文档格式和模态的多样性面临重大挑战。传统的基于文本的方法依赖于定制的解析技术,忽略布局信息且容易出错,而最近的无解析视觉方法在文本丰富的场景中往往难以捕捉细粒度的文本语义。为了解决这些限制,我们提出了 extbf{Unveil},一种新颖的视觉-文本嵌入框架,有效整合文本和视觉特征以实现鲁棒的文档表示。通过知识蒸馏,我们将视觉-文本嵌入模型的语义理解能力转移到纯视觉模型,实现高效的无解析检索同时保持语义保真度。实验结果表明,我们的视觉-文本嵌入方法超越了现有方法,而知识蒸馏成功弥合了视觉-文本方法与纯视觉方法之间的性能差距,提高了检索准确性和效率。

英文摘要

Document retrieval in real-world scenarios faces significant challenges due to diverse document formats and modalities. Traditional text-based approaches rely on tailored parsing techniques that disregard layout information and are prone to errors, while recent parsing-free visual methods often struggle to capture fine-grained textual semantics in text-rich scenarios. To address these limitations, we propose \textbf{Unveil}, a novel visual-textual embedding framework that effectively integrates textual and visual features for robust document representation. Through knowledge distillation, we transfer the semantic understanding capabilities from the visual-textual embedding model to a purely visual model, enabling efficient parsing-free retrieval while preserving semantic fidelity. Experimental results demonstrate that our visual-textual embedding method surpasses existing approaches, while knowledge distillation successfully bridges the performance gap between visual-textual and visual-only methods, improving both retrieval accuracy and efficiency.

2605.22715 2026-05-26 cs.CV cs.AI cs.CL cs.HC 版本更新

AnyMo: Geometry-Aware Setup-Agnostic Modeling of Human Motion in the Wild

AnyMo:野外人体运动的几何感知与设置无关建模

Baiyu Chen, Zechen Li, Wilson Wongso, Lihuan Li, Xiachong Lin, Hao Xue, Benjamin Tag, Flora Salim

发表机构 * The University of New South Wales(新南威尔士大学) The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)) The Hong Kong University of Science and Technology(香港科技大学)

AI总结 提出AnyMo框架,通过物理模拟生成多样化IMU信号、图编码器预训练和LLM对齐,实现跨设备/数据集的零样本活动识别、跨模态检索和运动描述,性能显著提升。

详情
AI中文摘要

随着可穿戴和移动设备日益融入日常生活,它们为持续感知野外人体运动提供了实用途径。但惯性信号高度依赖于传感设置,包括身体位置、安装方向、传感器朝向、设备硬件和采样协议。这种设置依赖性使得学习跨设备和数据集迁移的运动表示变得困难,并限制了可穿戴IMU在封闭集识别之外的广泛应用。我们提出AnyMo,一个用于设置无关人体运动建模的几何感知框架。AnyMo利用基于物理的IMU模拟在密集体表位置上生成多样且合理的合成信号,从配对的合成放置视图和掩蔽部分观测中预训练图编码器,将多位置IMU标记化为全身运动令牌,并将这些令牌与LLM对齐以进行运动-语言理解。我们在三个互补任务上评估AnyMo:跨14个未见下游数据集的零样本活动识别、跨模态检索和可穿戴IMU运动描述,其中在HAR上平均Accuracy/F1/R@2提升11.7%/11.6%/22.6%,零样本IMU到文本和文本到IMU检索MRR分别提升15.9%和28.6%,零样本描述BERT-F1提升18.8%。这些结果支持AnyMo作为野外可穿戴运动理解的通才模型。项目页面:https://baiyuchen.com/project/AnyMo。

英文摘要

As wearable and mobile devices become increasingly embedded in daily life, they offer a practical way to continuously sense human motion in the wild. But inertial signals are highly dependent on the sensing setup, including body location, mounting position, sensor orientation, device hardware, and sampling protocol. This setup dependence makes it difficult to learn motion representations that transfer across devices and datasets, and limits the broader use of wearable IMUs beyond closed-set recognition. We introduce AnyMo, a geometry-aware framework for setup-agnostic human motion modeling. AnyMo uses physics-grounded IMU simulation over dense body-surface placements to generate diverse and plausible synthetic signals, pre-trains a graph encoder from paired synthetic placement views and masked partial observations, tokenizes multi-position IMU into full-body motion tokens, and aligns these tokens with an LLM for motion-language understanding. We evaluate AnyMo on three complementary tasks: zero-shot activity recognition across 14 unseen downstream datasets, cross-modal retrieval, and wearable IMU motion captioning, where it improves average Accuracy/F1/R@2 by 11.7\%/11.6\%/22.6\% on HAR, increases zero-shot IMU-to-text and text-to-IMU retrieval MRR by 15.9\% and 28.6\%, respectively, and improves zero-shot captioning BERT-F1 by 18.8\%. These results support AnyMo as a generalist model for wearable motion understanding in the wild. Project page: https://baiyuchen.com/project/AnyMo.

2605.18840 2026-05-26 cs.LG cs.AI cs.CL 版本更新

The Growing Pains of Frontier Models: When Leaderboards Stop Separating and What to Measure Next

前沿模型的成长之痛:当排行榜不再区分以及接下来衡量什么

Adil Amin

发表机构 * Zehen Labs(泽亨实验室)

AI总结 本文通过分解SWE-bench和GPQA Diamond分数为种群耦合趋势和每版本残差(h场),诊断前沿模型能力之间的协作与权衡,并提供三步诊断法、每实验室测量优先级表及七个可证伪预测。

Comments 13 pages, 5 figures, 4 tables. Companion paper: "Lying Is Just a Phase: The Hidden Alignment Transition in Language Model Scaling." ( https://doi.org/10.48550/arXiv.2605.18838 ). Code: https://github.com/adilamin89/cape-scaling . Dashboard: https://zehenlabs.com/cape/

详情
AI中文摘要

排行榜在独立轴上对前沿模型进行排名,但并未揭示能力在版本间是相互增强还是权衡——而在前沿,这种相互作用是更具信息量的信号。我们将配对的SWE-bench和GPQA Diamond分数分解为种群耦合趋势和每版本残差(h场),该残差从两个公开基准分数诊断能力重点。在来自10个实验室的34个模型(2024-2026)中,能力相互协作(r = +0.72,p < 10^{-6}),但协作程度系统性地变化:每个实验室的耦合斜率跨度达5倍(谷歌1.15 vs. DeepSeek 0.23),且实验室发生转向——DeepSeek从推理密集型逆转为编码优先(Δh = 15.9个百分点);Anthropic在编码偏离和恢复之间振荡。种群回归作为等斜线相边界:用于识别基础尺度耦合转变的相同分类器√[(a/b)·B₁] [Amin, 2026] 对前沿模型进行分类,并已在下一个转变处检测到混合相行为(两个模型低于GPQA-IFEval等斜线)。h场不仅具有诊断性——它还告诉你需要改变什么。预训练建立耦合为0.871,而RLHF增加0.081 [Amin, 2026]:预训练级别的转变是永久的(DeepSeek的四个版本逆转持续存在),后训练转变是可逆的(Anthropic的三次编码偏离均在单个版本内恢复),仅推理计算在不重新训练的情况下将h改变+7.8个百分点。知道哪个组件占主导地位决定了是重新训练还是等待。我们提供了三步诊断法(定位、分类、预测)、每实验室测量优先级表以及七个带有时间戳标准的可证伪预测。五个截止日期后的版本落在95%预测区间内。代码、数据和交互式仪表盘:https://zehenlabs.com/cape/。

英文摘要

Leaderboards rank frontier models on independent axes but do not reveal whether capabilities reinforce or trade off across releases -- and at the frontier, this interaction is the more informative signal. We decompose paired SWE-bench and GPQA Diamond scores into a population coupling trend and per-release residual ($h$-field) that diagnoses capability emphasis from two public benchmark scores. Across 34 models from 10 labs (2024--2026), capabilities cooperate ($r = +0.72$, $p < 10^{-6}$), but cooperation varies systematically: per-lab coupling slopes span $5\times$ (Google $1.15$ vs. DeepSeek $0.23$), and labs pivot -- DeepSeek reversed from reasoning-rich to coding-first ($Δh = 15.9$~pp); Anthropic oscillates between coding excursions and recovery. The population regression serves as an isocline phase boundary: the same $\sqrt{(a/b)\cdot B_1}$ classifier that identifies the base-scale coupling transition [Amin, 2026] classifies frontier models and already detects mixed-phase behavior at the next transition (two models below the GPQA--IFEval isocline). The $h$-field is not just diagnostic -- it tells you what to change. Pretraining establishes coupling at $0.871$ while RLHF adds $0.081$ [Amin, 2026]: pretraining-level shifts are permanent (DeepSeek's four-release reversal persists), post-training shifts are reversible (Anthropic's three coding excursions each recover within one release), and inference compute alone shifts $h$ by $+7.8$~pp without retraining. Knowing which component dominates determines whether to retrain or wait. We provide a three-step diagnostic (locate, classify, predict), a per-lab measurement-priority table, and seven falsifiable predictions with timestamped criteria. Five post-cutoff releases fall within the 95\% prediction interval. Code, data, and an interactive dashboard: https://zehenlabs.com/cape/.

2605.16409 2026-05-26 cs.CV cs.CL cs.LG 版本更新

Multilingual OCR-Aware Fine-Tuning and Prompt-Guided Chain-of-Thought Reasoning for Multimodal Large Language Models

多语言OCR感知微调和提示引导的链式思维推理用于多模态大语言模型

Qinwu Xu, Yifan Jiang, Haoyu Ren

发表机构 * Meta AI UT Austin(德克萨斯大学奥斯汀分校)

AI总结 提出一种多语言OCR感知的多模态训练框架,通过合成数据生成、OCR感知微调和结构化视觉链式思维提示,提升多模态大语言模型在复杂视觉条件下的OCR完整性和多语言翻译准确性。

详情
AI中文摘要

光学字符识别(OCR)和多语言文本理解仍然是多模态大语言模型(MLLMs)的主要失败模式,尤其是在包含杂乱布局、小字体、模糊、遮挡和复杂排版的真实世界图像中。我们提出了一种OCR感知的多语言多模态训练框架,该框架结合了(i)大规模合成OCR到翻译数据生成,(ii)使用LoRA适配的OCR感知监督微调(SFT),以及(iii)在不确定视觉条件下进行推理的结构化视觉链式思维(CoT)提示。使用基于LLaMA的多模态架构,所提出的框架在OCR完整性、多语言翻译准确性和退化视觉条件下的鲁棒性方面有了显著提升。在多语言收据、菜单、海报、标志、手写文本和文档图像上的实验结果表明,与基线模型相比,视觉-文本对齐显著改善。特别是,所提出的OCR感知后训练框架提高了对小、模糊、空间分散和部分遮挡文本的提取,同时减少了对不确定OCR条件下语言先验的依赖。与前沿多模态系统(包括GPT-5类和Gemini系列模型)的定性比较进一步表明,在噪声和视觉模糊的OCR场景下,OCR对齐得到改善,幻觉减少。总体而言,结果表明,以数据为中心的OCR感知多模态后训练为改进多语言OCR和基于OCR的视觉问答系统提供了一种有效且可扩展的方向。

英文摘要

Optical character recognition (OCR) and multilingual text understanding remain major failure modes of multimodal large language models (MLLMs), particularly in real-world images containing cluttered layouts, small fonts, blur, occlusion, and complex typography. We present an OCR-aware multilingual multimodal training framework that combines (i) large-scale synthetic OCR-to-translation data generation, (ii) OCR-aware supervised fine-tuning (SFT) with LoRA adaptation, and (iii) structured visual chain-of-thought (CoT) prompting for reasoning under uncertain visual conditions. Using a LLaMA-based multimodal architecture, the proposed framework substantially improves OCR completeness, multilingual translation accuracy, and robustness under degraded visual conditions. Experimental results on multilingual receipts, menus, posters, signs, handwritten text, and document images demonstrate significantly improved visual-text grounding compared with the baseline model. In particular, the proposed OCR-aware post-training framework improves extraction of small, blurred, spatially scattered, and partially occluded text while reducing reliance on language priors under uncertain OCR conditions. Qualitative comparisons with frontier multimodal systems, including GPT-5-class and Gemini-family models, further suggest improved OCR grounding and reduced hallucination under noisy and visually ambiguous OCR scenarios. Overall, the results indicate that data-centric OCR-aware multimodal post-training provides an effective and scalable direction for improving multilingual OCR and OCR-based visual question answering systems.

2605.15759 2026-05-26 cs.CL 版本更新

DimMem: Dimensional Structuring for Efficient Long-Term Agent Memory

DimMem:面向高效长期智能体记忆的维度结构化

Wentao Qiu, Haotian Hu, Fanyi Wang, Jinwei Kong, Yu Zhang

发表机构 * StepOS Xiamen University(厦门大学) ShanghaiTech University(上海科技大学)

AI总结 提出DimMem维度记忆框架,通过原子化、类型化、自包含的记忆单元(含时间、地点、原因等显式字段)实现维度感知检索与更新,在LoCoMo-10和LongMemEval-S上分别达到81.43%和78.20%准确率,且每查询token成本降低24%。

详情
AI中文摘要

大型语言模型(LLM)智能体需要长期记忆来利用过去交互中的信息。然而,现有的记忆系统常常面临保真度与效率之间的权衡:原始对话历史成本高昂,而扁平化的事实或摘要可能丢弃精确回忆所需的结构。我们提出 extbf{DimMem},一种轻量级维度记忆框架,将每条记忆表示为一个原子化、类型化、自包含的单元,并带有显式字段,如时间、地点、原因、目的和关键词。这种表示暴露了维度感知检索、记忆更新和选择性助手上下文回忆所需的结构,而无需在模型上下文中存储完整历史。在LoCoMo-10和LongMemEval-S上,DimMem分别达到 extbf{81.43\%}和 extbf{78.20\%}的整体准确率,优于现有的轻量级记忆系统,同时将LoCoMo每查询token成本降低 extbf{24\%}。我们进一步证明,维度记忆提取可通过紧凑模型学习:在DimMem模式上微调后,Qwen3-4B提取器在两个基准测试上均超越使用GPT-4.1-mini的LightMem,并在关键设置中达到与更大提取器相当或更优的性能。这些结果表明,显式维度结构化是LLM智能体长期记忆有效且高效的基础。代码见https://github.com/ChowRunFa/DimMem。

英文摘要

Large language model (LLM) agents require long-term memory to leverage information from past interactions. However, existing memory systems often face a fidelity--efficiency trade-off: raw dialogue histories are expensive, while flat facts or summaries may discard the structure needed for precise recall. We propose \textbf{DimMem}, a lightweight dimensional memory framework that represents each memory as an atomic, typed, and self-contained unit with explicit fields such as time, location, reason, purpose, and keywords. This representation exposes the structure needed for dimension-aware retrieval, memory update, and selective assistant-context recall without storing full histories in the model context. Across LoCoMo-10 and LongMemEval-S, DimMem achieves \textbf{81.43\%} and \textbf{78.20\%} overall accuracy, respectively, outperforming existing lightweight memory systems while reducing LoCoMo per-query token cost by \textbf{24\%}. We further show that dimensional memory extraction is learnable by compact models: after fine-tuning on the DimMem schema, a Qwen3-4B extractor surpasses LightMem with GPT-4.1-mini on both benchmarks and reaches performance comparable to, or better than, much larger extractors in key settings. These results suggest that explicit dimensional structuring is an effective and efficient foundation for long-term memory in LLM agents. Code is available at https://github.com/ChowRunFa/DimMem.

2605.15011 2026-05-26 cs.CL 版本更新

The Scientific Contribution Graph: Automated Literature-based Technological Roadmapping at Scale

科学贡献图:基于文献的规模化自动技术路线图绘制

Peter A. Jansen

发表机构 * University of Arizona(亚利桑那大学) Allen Institute for Artificial Intelligence(人工智能 Allen 机构)

AI总结 提出从学术论文中提取科学贡献并链接其前提条件的自动技术路线图任务,构建包含200万贡献和1250万前提边的AI/NLP领域科学贡献图,并引入科学前提预测任务,实验表明现有模型在该任务上表现快速提升。

Comments 8 pages, 5 figures

详情
AI中文摘要

科学贡献很少孤立发展,而是建立在先前发现的基础上。我们将自动技术路线图的任务定义为从学术文章中提取科学贡献并将其与前提条件联系起来。我们提出了科学贡献图,这是一个大规模的人工智能/自然语言处理领域资源,包含从23万篇开放获取论文中提取的200万个详细科学贡献,并通过1250万条前提边连接。我们进一步引入了科学前提预测,这是一项科学发现任务,模型预测哪些现有技术可以促成未来的发现,并表明当代模型在该任务上迅速改进,在使用时间过滤回测评估时达到0.48 MAP。我们预计这样的技术路线图资源将支持科学影响评估和自动科学发现。

英文摘要

Scientific contributions rarely develop in isolation, but instead build upon prior discoveries. We formulate the task of automated technological roadmapping as extracting scientific contributions from scholarly articles and linking them to their prerequisites. We present the Scientific Contribution Graph, a large-scale AI/NLP-domain resource containing 2 million detailed scientific contributions extracted from 230k open-access papers and connected by 12.5 million prerequisite edges. We further introduce scientific prerequisite prediction, a scientific discovery task in which models predict which existing technologies can enable future discoveries, and show that contemporary models are rapidly improving on this task, reaching 0.48 MAP when evaluated using temporally filtered backtesting. We anticipate technological roadmapping resources such as this will support scientific impact assessment and automated scientific discovery.

2605.14890 2026-05-26 cs.CL cs.AI 版本更新

Tokenizer Fertility and Zero-Shot Performance of Foundation Models on Ukrainian Legal Text: A Comparative Study

分词器生育率与基础模型在乌克兰法律文本上的零样本性能:一项比较研究

Volodymyr Ovcharov

发表机构 * LEX AI Platform(LEX AI平台) legal.org.ua Kyiv, Ukraine(基辅,乌克兰)

AI总结 本研究比较了七种基础模型在乌克兰法律文本上的分词器生育率和零样本性能,发现分词器生育率差异达1.6倍,Qwen 3模型比Llama系列多消耗60%的token,而NVIDIA Nemotron Super 3 (120B)以更低的成本取得最佳性能,同时揭示了少样本提示在形态丰富语言上的退化以及战时法律语言对模型泛化的影响。

Comments 25 pages, 13 tables, 5 figures; v2 adds cross-temporal generalization experiment and classical baseline

详情
AI中文摘要

在乌克兰法律文本上,不同基础模型的分词器生育率差异达1.6倍,然而这一成本关键维度在模型选择实践中被忽视。我们使用来自乌克兰国家登记册(EDRSR)的273份经过验证的法院判决,对来自五个提供商的七个模型进行了基准测试,测量了分词器生育率以及在三个任务上的零样本性能。发现了四个结果。(1)Qwen 3模型在相同输入上比Llama系列模型多消耗60%的token,使得分词器分析成为成本高效部署的前提。(2)NVIDIA Nemotron Super 3 (120B)取得了最高综合得分(83.1),以三分之一的API成本超越了Mistral Large 3(总参数多5.6倍)——模型规模并不能很好地代表领域性能。(3)少样本提示使性能下降高达26个百分点;分层和提示敏感性消融实验证实,这是乌克兰语演示的内在问题,而非示例选择的伪影。(4)跨时间泛化实验表明,在战前法院判决(2008-2013)上训练的分类器,应用于全面入侵时期的判决(2022-2026)时,性能下降27.9个百分点,并呈现出显著的前后不对称性:较新的模型向后迁移效果更好(比向前迁移高14.6个百分点),但较旧的模型在战时法律语言上完全失败。对于从业者:分词器分析应优先于模型选择,对于形态丰富的语言,零样本比少样本更可靠。为了支持可重复性并解决乌克兰语在法律NLP基准中的缺失,我们发布了一个包含14,452份法院判决的公开数据集,时间跨度为2008-2026年,标注了三个时间段的七个结果标签,这些时间段捕捉了武装冲突对司法程序的影响。

英文摘要

Tokenizer fertility varies 1.6x across foundation models on Ukrainian legal text, yet this cost-critical dimension is absent from model selection practice. We benchmark seven models from five providers on 273 validated court decisions from Ukraine's state registry (EDRSR), measuring tokenizer fertility and zero-shot performance on three tasks. Four findings emerge. (1) Qwen 3 models consume 60% more tokens than Llama-family models on identical input, making tokenizer analysis a prerequisite for cost-efficient deployment. (2) NVIDIA Nemotron Super 3 (120B) achieves the highest composite score (83.1), outperforming Mistral Large 3 (5.6x more total parameters) at one-third the API cost model scale is a poor proxy for domain performance. (3) Few-shot prompting degrades performance by up to 26 percentage points; stratified and prompt-sensitivity ablations confirm this is intrinsic to Ukrainian-language demonstrations, not an artifact of example selection. (4) A cross-temporal generalization experiment reveals that classifiers trained on pre-war court ecisions (2008-2013) lose 27.9 percentage points when applied to full-scale invasion era decisions (2022-2026), with a pronounced forward-backward asymmetry: newer models transfer backward (+14.6 pp above forward transfer), but older models fail catastrophically on wartime legal language. For practitioners: tokenizer analysis should precede model selection, and zero-shot is a more reliable default than few-shot for morphologically rich languages. To support reproducibility and address the absence of Ukrainian from legal NLP benchmarks, we release a public dataset of 14,452 court decisions spanning 2008-2026, annotated with seven outcome labels across three temporal epochs that capture the impact of armed conflict on judicial proceedings.

2605.12850 2026-05-26 cs.CL cs.AI cs.CR cs.LG 版本更新

Persona-Model Collapse in Emergent Misalignment

涌现性失调中的人格模型崩溃

Davi Bastos Costa, Renato Vicente

发表机构 * TELUS Digital Research Hub(TELUS数字研究中心) Center for Artificial Intelligence and Machine Learning(人工智能与机器学习中心) Institute of Mathematics, Statistics and Computer Science(数学、统计与计算机科学研究所) University of São Paulo(圣保罗大学)

AI总结 提出人格模型崩溃假说,通过道德易感性(S)和道德稳健性(R)两个指标,证明在有害数据上微调大语言模型会导致模型模拟、区分和维持一致角色的内部能力恶化,从而引发涌现性失调。

Comments 23 pages, 7 figures, 7 tables; NeurIPS 2026 submission; Corrected code repository URL

详情
AI中文摘要

在包含有害内容的狭窄数据上微调大型语言模型,会在无关提示上产生广泛的失调行为,这种现象称为涌现性失调。我们提出涌现性涉及人格模型崩溃:模型模拟、区分和维持一致角色的内部能力恶化。我们通过两个指标在行为上检验这一假设:道德易感性(S)和道德稳健性(R),它们根据模型在角色扮演下道德基础问卷回答的跨角色和角色内变异性计算得出。这些指标形式化了模型区分角色的能力(S)以及模拟给定角色时的一致性(R)。我们评估了四个前沿模型(DeepSeek-V3.1, GPT-4.1, GPT-4o, Qwen3-235B)的三种变体:基础版、微调为输出不安全代码的版本,以及匹配的微调为输出安全代码的对照版本。在四个模型中,不安全微调导致S平均增加55%,将所有四个不安全变体推至先前工作中13个前沿模型基准观测到的波段之外——其中GPT-4o达到波段上端的两倍以上——表明分化失调。它还导致R平均下降65%,相当于1/R增加304%。相比之下,匹配的安全对照将S保持在基础值附近,仅引起部分R损失,表明这些效应主要特定于失调。补充这些指标变化,不安全变体的无条件响应趋近于接近量表上限的饱和状态,与基础模型的结构化响应以及基础模型角色扮演有毒人格时的响应明显不同。综合来看,这些指标为涌现性失调提供了敏感的诊断,并作为其涉及人格模型崩溃的行为证据。

英文摘要

Fine-tuning large language models on narrow data with harmful content produces broadly misaligned behavior on unrelated prompts, a phenomenon known as emergent misalignment. We propose that emergent misalignment involves persona-model collapse: deterioration of the model's internal capacity to simulate, differentiate, and maintain consistent characters. We test this hypothesis behaviorally using two metrics: moral susceptibility (S) and moral robustness (R), computed from the across- and within-persona variability of models' Moral Foundations Questionnaire responses under persona role-play. These metrics formalize the model's ability to differentiate characters (S) and its consistency when simulating a given one (R). We evaluate four frontier models (DeepSeek-V3.1, GPT-4.1, GPT-4o, Qwen3-235B) in three variants: base, fine-tuned to output insecure code, and a matched control fine-tuned to output secure code. Across the four models, insecure fine-tuning produces an average $55\%$ increase in S, pushing all four insecure variants beyond the band observed across 13 frontier models benchmarked in prior work -- with GPT-4o reaching more than twice the band's upper end -- signaling dysregulated differentiation. It also causes an average $65\%$ decrease in R, equivalent to a $304\%$ increase in 1/R. By contrast, the matched secure control preserves S near the base and induces only a partial R loss, showing that these effects are largely misalignment-specific. Complementing these metric shifts, insecure variants' unconditioned responses converge toward saturation near the scale ceiling, departing markedly from both base models' structured responses and those elicited when base models role-play toxic personas. Taken together, these metrics provide a sensitive diagnostic for emergent misalignment and serve as behavioral evidence that it involves persona-model collapse.

2605.07647 2026-05-26 cs.CL cs.AI 版本更新

Quality-Conditioned Agreement in Automated Short Answer Scoring: Mid-Range Degradation and the Impact of Task-Specific Adaptation

自动简答题评分中的质量条件一致性:中等范围退化与任务特定适应的影响

Abigail Victoria Gurin Schleifer, Moriah Ariely, Beata Beigman Klebanov, Asaf Salman, Giora Alexandron

发表机构 * Weizmann Institute of Science(魏茨曼科学研究院) ETS(教育考试服务中心)

AI总结 研究自动简答题评分中不同模型的任务适应程度与质量条件评分一致性的关系,发现所有AI模型在完全正确和完全错误的回答上表现良好,但在中等范围回答上出现显著退化,且退化程度与任务特定数据量相关。

Comments PRE-PRINT VERSION Accepted to ACL 21st Workshop on Innovative Use of NLP for Building Educational Applications (BEA26)

详情
AI中文摘要

自动简答题评分(ASAS)正从判别式微调模型转向少样本设置下的大语言模型(LLM)。这种范式利用了LLM广泛的世界知识和易于部署的优势,但有限的任务特定数据可能降低复杂评分任务的对齐。特别是,其对评分需要细微解释的部分正确回答的影响仍未充分探索。我们研究了不同模型的任务特定适应程度与质量条件评分一致性之间的关系。我们比较了三种LLM(GPT-5.2、GPT-4o、Claude Opus 4.5)在少样本模式下的表现、一个基于BERT的微调编码器以及一位人类专家,在两个开放式生物学题目上使用了数百个学生回答和由生物学教育专家提供的真实分数。结果表明,人类之间的一致性最高且在整个质量范围内稳定。所有AI模型在完全正确和完全错误的回答上表现良好,但在中等范围回答上表现出显著退化。这种中等范围退化取决于任务特定适应:在少样本LLM中最为严重,随着任务特定数据的增加而减少,其中微调编码器模型表现最佳。这种中等范围退化可能导致对理解发展中的学生所产生回答的不公平评估。我们的发现强调了质量条件公平性的重要性,尤其需要关注中等范围回答。

英文摘要

Automated short answer scoring (ASAS) is shifting from discriminative, fine-tuned models to large language models (LLMs) used in few-shot settings. This paradigm leverages LLMs broad world knowledge and ease of deployment, but limited task-specific data may reduce alignment on complex scoring tasks. In particular, its impact on scoring partially correct responses that require nuanced interpretation remains underexplored. We investigate the relationship between the degree of task-specific adaptation of different models and quality-conditioned scoring agreement. We compare three LLMs (GPT-5.2, GPT-4o, Claude Opus 4.5) in few-shot mode, a fine-tuned BERT-based encoder, and a human expert on two open-ended biology items, using several hundred student responses and ground truth scores provided by a biology education expert. The results show that human-human agreement is highest and stable across the full quality spectrum. All AI models perform well on fully correct and fully incorrect responses, but exhibit substantial degradation on mid-range responses. This mid-range degradation is conditioned on task-specific adaptation: It is most severe in few-shot LLMs with few examples and decreases as task-specific data increases, with fine-tuned encoder models performing best. This mid-range degradation may lead to inequitable evaluation of responses produced by students with developing understanding. Our findings highlight the importance of quality-conditioned fairness, with particular attention to mid-range responses.

2605.05226 2026-05-26 cs.LG cs.AI cs.CL 版本更新

Internalizing Outcome Supervision into Process Supervision: A New Paradigm for Reinforcement Learning for Reasoning

将结果监督内化为过程监督:推理强化学习的新范式

Fei Ding, Yongkang Zhang, Runhao Liu, Yuhao Liao, Zijian Zeng, Sibo wang, Huiming Yang

发表机构 * Alibaba Group(阿里巴巴集团) Tsinghua University(清华大学)

AI总结 提出一种监督内化方法,使模型在仅结果监督下自动提取过程级学习信号,实现细粒度策略优化。

详情
AI中文摘要

推理强化学习的核心挑战不仅在于结果级监督的稀疏性,更在于如何将仅在序列末尾提供的反馈转化为可指导中间推理步骤的细粒度学习信号。现有方法要么依赖结果级奖励进行序列级优化,导致精确信用分配困难,要么依赖外部构建的过程监督,成本高昂且难以可持续扩展。为解决这一问题,我们提出一个新视角:推理强化学习可以理解为将结果监督内化为过程监督的问题。基于此视角,我们引入一种用于推理强化学习的监督内化方法,使模型能够通过识别、纠正和重用失败的推理轨迹自动提取过程级学习信号,从而在仅结果监督下实现更细粒度的策略优化。我们进一步将这一思想抽象为一种新的训练范式,其中模型在强化学习过程中持续生成并完善自身的内部过程监督,为推理强化学习中细粒度信用分配开辟了一条不同于外部提供过程监督的新路径。

英文摘要

The central challenge of reinforcement learning for reasoning lies not only in the sparsity of outcome-level supervision, but more fundamentally in how to transform feedback provided only at the end of a sequence into fine-grained learning signals that can guide intermediate reasoning steps. Existing approaches either rely on outcome-level rewards for sequence-level optimization, which makes precise credit assignment difficult, or depend on externally constructed process supervision, which is costly and difficult to scale sustainably. To address this, we propose a new perspective: reinforcement learning for reasoning can be understood as the problem of internalizing outcome supervision into process supervision. From this perspective, we introduce a supervision-internalization method for reinforcement learning for reasoning, enabling the model to automatically extract process-level learning signals through identifying, correcting, and reusing failed reasoning trajectories, thereby achieving finer-grained policy optimization under outcome-only supervision. We further abstract this idea into a new training paradigm, in which the model continually generates and refines its own internal process supervision during reinforcement learning, opening a new path for fine-grained credit assignment in reinforcement learning for reasoning that differs from externally provided process supervision.

2605.01284 2026-05-26 cs.CV cs.AI cs.CL cs.IR 版本更新

Chain of Evidence: Pixel-Level Visual Attribution for Iterative Retrieval-Augmented Generation

证据链:面向迭代检索增强生成的像素级视觉归因

Peiyang Liu, Ziqiang Cui, Xi Wang, Di Liang, Wei Ye

发表机构 * National Engineering Research Center for Software Engineering, Peking University(软件工程国家级工程研究中心,北京大学) City University of Hong Kong(香港城市大学) Peking University(北京大学) Tencent Technology(腾讯科技)

AI总结 提出Chain of Evidence (CoE)框架,利用视觉语言模型直接对检索到的文档截图进行推理,输出精确边界框以可视化完整推理链,解决迭代检索增强生成中的粗粒度归因和视觉语义丢失问题。

详情
AI中文摘要

迭代检索增强生成(iRAG)已成为通过逐步检索和推理外部文档来回答复杂多跳问题的强大范式。然而,当前系统主要基于解析文本运行,这造成了两个关键瓶颈:(1)粗粒度归因,用户需要根据模糊的文本级引用在冗长文档中手动定位证据;(2)视觉语义丢失,将视觉丰富的文档(如幻灯片、带有图表的PDF)转换为文本会丢弃对推理至关重要的空间逻辑和布局线索。为弥合这一差距,我们提出了证据链(CoE),这是一个与检索器无关的视觉归因框架,利用视觉语言模型直接对检索到的文档候选截图进行推理。CoE消除了特定格式的解析,输出精确的边界框,可视化检索候选集中的完整推理链。我们在两个不同的基准上评估CoE:Wiki-CoE,一个源自2WikiMultiHopQA的大规模结构化网页数据集;以及SlideVQA,一个具有挑战性的演示幻灯片数据集,包含复杂图表和自由形式布局。实验表明,微调后的Qwen3-VL-8B-Instruct取得了稳健的性能,在需要视觉布局理解的场景中显著优于基于文本的基线,同时为像素级可解释的iRAG建立了与检索器无关的解决方案。我们的代码可在https://github.com/PeiYangLiu/CoE.git获取。

英文摘要

Iterative Retrieval-Augmented Generation (iRAG) has emerged as a powerful paradigm for answering complex multi-hop questions by progressively retrieving and reasoning over external documents. However, current systems predominantly operate on parsed text, which creates two critical bottlenecks: (1) \textit{Coarse-grained attribution}, where users are burdened with manually locating evidence within lengthy documents based on vague text-level citations; and (2) \textit{Visual semantic loss}, where the conversion of visually rich documents (e.g., slides, PDFs with charts) into text discards spatial logic and layout cues essential for reasoning. To bridge this gap, we present \textbf{Chain of Evidence (CoE)}, a retriever-agnostic visual attribution framework that leverages Vision-Language Models to reason directly over screenshots of retrieved document candidates. CoE eliminates format-specific parsing and outputs precise bounding boxes, visualizing the complete reasoning chain within the retrieved candidate set. We evaluate CoE on two distinct benchmarks: \textbf{Wiki-CoE}, a large-scale dataset of structured web pages derived from 2WikiMultiHopQA, and \textbf{SlideVQA}, a challenging dataset of presentation slides featuring complex diagrams and free-form layouts. Experiments demonstrate that fine-tuned Qwen3-VL-8B-Instruct achieves robust performance, significantly outperforming text-based baselines in scenarios requiring visual layout understanding, while establishing a retriever-agnostic solution for pixel-level interpretable iRAG. Our code is available at https://github.com/PeiYangLiu/CoE.git.

2605.00817 2026-05-26 cs.CL 版本更新

When LLMs Stop Following Steps: A Diagnostic Study of Procedural Execution in Language Models

当LLM停止遵循步骤:语言模型中程序执行的诊断研究

Sailesh Panda, Pritam Kadasi, Abhishek Upperwal, Mayank Singh

发表机构 * Indian Institute of Technology Gandhinagar(印度理工学院冈丁加尔)

AI总结 本研究通过构建受控诊断基准,评估大型语言模型在程序执行任务中的忠实性,发现随着步骤增加准确率从63%降至20%,并揭示了缺失答案、过早答案、自我修正和执行不完整等失败模式。

Comments 86 pages, 124 figures, 4 Tables

详情
AI中文摘要

大型语言模型(LLM)在推理基准测试中通常表现强劲,但仅凭最终答案的准确性并不能表明它们是否忠实地执行了提示中指定的程序。我们引入了一个受控的诊断基准,用于程序执行,其中模型被给予一个逐步的算术程序以及两个数值输入,必须返回最终计算值。通过程序长度和中间变量的回溯依赖性来改变复杂性。平均首次答案准确率从5步程序的63%下降到95步程序的20%。生成级别分析表明,失败通常涉及缺失答案、过早答案、初始错误后的自我修正以及未完全执行的轨迹。这些发现表明,表面上的推理能力可能掩盖了在忠实的长程程序执行中的重大弱点。

英文摘要

Large language models (LLMs) often achieve strong performance on reasoning benchmarks, but final-answer accuracy alone does not show whether they faithfully execute the procedure specified in a prompt. We introduce a controlled diagnostic benchmark for procedural execution, where models are given a step-wise arithmetic procedure and two numeric inputs, and must return the final computed value. Complexity is varied through procedure length and look-back dependencies over intermediate variables. Average first-answer accuracy drops from 63% on 5-step procedures to 20% on 95-step procedures. Generation-level analysis shows that failures often involve missing answers, premature answers, self-correction after an initial error and under-executed traces. These findings suggest that apparent reasoning ability can mask substantial weaknesses in faithful long-horizon procedural execution.

2604.23396 2026-05-26 cs.IR cs.AI cs.CL cs.LG 版本更新

Lost in Decoding? Reproducing and Stress-Testing the Look-Ahead Prior in Generative Retrieval

迷失在解码中?复现与压力测试生成式检索中的前瞻先验

Kidist Amde Mekonnen, Yongkang Li, Yubao Tang, Simon Lupart, Maarten de Rijke

发表机构 * University of Amsterdam(阿姆斯特丹大学)

AI总结 本文复现并压力测试了生成式检索中的前瞻先验方法PAG,发现其规划信号在词汇表面形式变化下脆弱,并评估了跨语言鲁棒性与查询端缓解策略。

Comments 12 pages, 5 figures, 9 tables; accepted to the 49th International ACM SIGIR Conference on Research and Development in Information Retrieval, July 20-24, 2026, Melbourne/Naarm, Australia

详情
Journal ref
Proceedings of the 49th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '26), pages XXX-XXX, 2026
AI中文摘要

生成式检索(GR)通过自回归生成文档标识符来对文档进行排序。由于许多GR方法依赖于trie约束的束搜索,它们在有限束解码下容易过早剪枝相关前缀。生成式检索中的前瞻规划(PAG)通过使用同时解码来计算文档级前瞻先验,指导后续顺序解码,从而缓解了这种失败模式。我们在推理时复现了PAG,并压力测试了其解码行为。使用作者发布的检查点和标识符/trie工件,在报告的解码设置下,我们在MS MARCO Dev和TREC-DL 2019/2020上复现了主要有效性结果,并在我们的硬件设置中证实了报告的束大小-延迟权衡。在复现之外,我们引入了规划漂移诊断,量化意图保持的查询变体如何改变规划器的top-n候选集和最高权重规划器令牌,以及这些变化如何影响引导解码。我们发现PAG的规划信号在词汇表面形式变化下是脆弱的:意图保持的拼写错误可能触发规划崩溃,其中规划的候选池变化足够大,使得前瞻奖励几乎无法提供有用的指导,实际上使解码退回到较弱的无引导搜索。我们进一步使用非英语mMARC O查询对英语索引评估了固定索引的跨语言鲁棒性,并评估了无需重新索引的查询端缓解策略;在我们的设置中,查询翻译提供了最强的恢复。总体而言,我们的结果证实了PAG报告的有效性以及在发布的推理设置下规划引导解码的优势,同时表明这些增益依赖于规划信号在现实查询变化和查询-文档不匹配下的稳定性。

英文摘要

Generative retrieval (GR) ranks documents by autoregressively generating document identifiers. Because many GR methods rely on trie-constrained beam search, they are vulnerable to early pruning of relevant prefixes under finite-beam decoding. Planning Ahead in Generative Retrieval (PAG) mitigates this failure mode by using simultaneous decoding to compute a document-level look-ahead prior that guides subsequent sequential decoding. We reproduce PAG at inference time and stress-test its decoding behavior. Using the authors' released checkpoint and identifier/trie artifacts under the reported decoding setup, we reproduce the main effectiveness results on MS MARCO Dev and TREC-DL 2019/2020, and corroborate the reported beam-size-latency trade-off in our hardware setting. Beyond reproduction, we introduce plan drift diagnostics that quantify how intent-preserving query variations alter the planner's top-n candidate set and highest-weight planner tokens, and how these changes affect guided decoding. We find that PAG's planning signal is brittle under lexical surface-form variation: intent-preserving typos can trigger plan collapse, where the planned candidate pool shifts enough that the look-ahead bonus provides little useful guidance, effectively reverting decoding toward weaker unguided search. We further evaluate fixed-index cross-lingual robustness using non-English mMARCO queries against an English index, and assess query-side mitigation strategies that require no re-indexing; query translation provides the strongest recovery in our setting. Overall, our results confirm PAG's reported effectiveness and the benefit of planning-guided decoding under the released inference setup, while showing that these gains depend on the stability of the planning signal under realistic query variation and query-document mismatch.

2604.20022 2026-05-26 cs.LG cs.AI cs.CL 版本更新

MoBayes: A Modular Bayesian Framework for Separating Reasoning from Language in Conversational Clinical Decision Support

MoBayes:一种用于对话式临床决策支持中推理与语言分离的模块化贝叶斯框架

Yusuf Kesmen, Fay Elhassan, Jiayi Ma, Julien Stalhandske, Yena Chang, David Sasu, Alexandra Kulinkina, Akhil Arora, Lars Klein, Mary-Anne Hartley

发表机构 * LiGHT, EPFL(LiGHT,瑞士联邦理工学院) University of Bern(伯尔尼大学) Aarhus University(奥胡斯大学)

AI总结 提出MoBayes框架,通过将LLM作为语言接口、贝叶斯模块进行概率推理,实现推理与语言分离,在临床决策支持中优于独立前沿LLM医生。

Comments 50 pages including appendix, 13 figures, 22 tables. Preprint

详情
AI中文摘要

大型语言模型(LLM)越来越多地用于对话式临床决策支持,但它们将下一个标记预测与概率决策混为一谈。我们认为这种混淆反映了架构上的局限性:此类系统缺乏显式的后验追踪、可控的弃权阈值和可审计的推理链。我们引入MoBayes,一个模块化贝叶斯对话框架,将推理与语言分离。LLM仅作为语言接口,将患者对话解析为结构化观察,而贝叶斯模块对这些观察进行概率推理以更新后验,通过期望信息增益选择后续问题,并通过校准的决策阈值决定何时停止或推迟。这种设计实现了显式后验追踪、可控的选择性决策,以及无需重新训练语言模型即可替换的特定人群统计后端。在经验知识和LLM生成的知识库上,MoBayes优于独立的前沿LLM医生,包括匹配模型系列的比较,其中廉价的传感器模型与MoBayes配对以较低成本超过更大的自主模型。在对抗性患者沟通风格和不同诊断场景下,该优势依然存在。这些结果表明,可靠的对话式临床决策支持系统应将概率推理与语言生成分离,而不是仅扩大模型规模。代码可在https://anonymous.4open.science/r/MoBayes/获取。

英文摘要

Large language models (LLMs) are increasingly used for conversational clinical decision support, yet they conflate next token prediction with probabilistic decision making. We argue that this conflation reflects an architectural limitation: such systems lack explicit posterior tracking, controllable abstention thresholds, and auditable reasoning chains. We introduce MoBayes, a Modular Bayesian dialogue framework that separates reasoning from language. The LLM acts only as a language interface, parsing patient conversation into structured observations, while a Bayesian module performs probabilistic inference over these observations to update posteriors, select follow-up questions via expected-information-gain and determine when to stop or defer through calibrated decision thresholds. This design enables explicit posterior tracking, controllable selective decision-making, and replaceable population-specific statistical backends without retraining the language model. Across empirical and LLM-generated knowledge bases, MoBayes outperforms standalone frontier LLM doctors, including matched model-family comparisons where inexpensive sensor models paired with MoBayes exceed larger autonomous models at lower cost. The advantage persists under adversarial patient communication styles and across varying diagnostic scenarios. These results suggest that reliable conversational clinical decision support systems should separate probabilistic reasoning from language generation rather than scaling model size alone. Code is available at https://anonymous.4open.science/r/MoBayes/

2604.19151 2026-05-26 cs.CL cs.SD eess.AS 版本更新

Voice of India: A Large-Scale Benchmark for Real-World Speech Recognition in India

印度之声:面向印度真实世界语音识别的大规模基准

Kaushal Bhogale, Manas Dhir, Amritansh Walecha, Manmeet Kaur, Vanshika Chhabra, Aaditya Pareek, Hanuman Sidh, Mahima Manik, Sagar Jain, Bhaskar Singh, Utkarsh Singh, Tahir Javed, Shobhit Banga, Mitesh M. Khapra

发表机构 * Indian Institute of Technology, Madras, India(印度理工学院,马德拉斯分校) Josh Talks, India(Josh Talks)

AI总结 针对现有Indic ASR基准的局限性,提出基于非脚本电话对话的封闭源基准Voice of India,覆盖15种主要印度语言和139个区域集群,包含306230条语音(536小时),并分析地理、音频质量、语速、性别和设备类型等因素对ASR性能的影响。

Comments 6 pages, 4 figures

详情
AI中文摘要

现有的Indic ASR基准通常使用脚本化的、干净的语音和基于排行榜的评估,这鼓励了针对数据集的过拟合。此外,严格的单参考WER会惩罚印度语言中的自然拼写变体,包括非标准拼写的代码混合英语起源词。为了解决这些局限性,我们引入了Voice of India,这是一个从非脚本电话对话构建的封闭源基准,覆盖15种主要印度语言,跨越139个区域集群。该数据集包含306230条语音,总计536小时的语音,来自36691名说话人,转录考虑了拼写变体。我们还在地理上按地区分析了性能,揭示了差异。最后,我们提供了跨音频质量、语速、性别和设备类型等因素的详细分析,突出了当前ASR系统在哪些方面存在困难,并为改进真实世界的Indic ASR系统提供了见解。

英文摘要

Existing Indic ASR benchmarks often use scripted, clean speech and leaderboard driven evaluation that encourages dataset specific overfitting. In addition, strict single reference WER penalizes natural spelling variation in Indian languages, including non standardized spellings of code-mixed English origin words. To address these limitations, we introduce Voice of India, a closed source benchmark built from unscripted telephonic conversations covering 15 major Indian languages across 139 regional clusters. The dataset contains 306230 utterances, totaling 536 hours of speech from 36691 speakers with transcripts accounting for spelling variations. We also analyze performance geographically at the district level, revealing disparities. Finally, we provide detailed analysis across factors such as audio quality, speaking rate, gender, and device type, highlighting where current ASR systems struggle and offering insights for improving real world Indic ASR systems.

2604.18170 2026-05-26 cs.CL cs.AI 版本更新

Copy-as-Decode: Grammar-Constrained Parallel Prefill for LLM Editing

Copy-as-Decode: 面向LLM编辑的语法约束并行预填充

Ziyang Liu

AI总结 提出Copy-as-Decode机制,通过语法约束的并行预填充加速LLM编辑,实现高达303倍的自回归解码加速,并保持高覆盖率与无损性。

Comments The authors have decided to withdraw this version following internal review regarding authorship and contribution agreements

详情
AI中文摘要

LLMs通过自回归地重新生成完整输出来编辑文本和代码,即使大多数标记在输入中逐字出现。我们研究Copy-as-Decode,一种解码层机制,将编辑生成重新表述为基于两个原语语法的结构化解码:<copy lines="i-j"/>引用输入行范围,<gen>...</gen>生成新内容。一个标记级FSM保证语法有效性,服务层原语通过单次并行预填充前向(而非N步自回归步骤)更新每个复制跨度的KV缓存——共享推测解码的并行前向内核,但以输入标记作为草稿,程序强制接受替代概率验证。我们报告一个无需端到端训练的上界分析。(i) 内核加速:在Qwen2.5-{1.5B, 7B}上,通过并行预填充复制N个标记比自回归快6.8倍至303倍(N ∈ [8, 512],A100 80GB bf16)。(ii) 复制上限:在ProbeEdit和HumanEvalPack-Fix (Py/JS)上,74%–98%的金标准标记在行级原语下可达;结合每个语料库跨度直方图上的经验内核,得到闭式挂钟时间上界29.0倍/3.4倍/4.2倍(合并13.0倍)。标记级扩展达到91%–99%覆盖率,下界4.5倍–6.5倍。(iii) 流水线无损性:预言程序通过确定性解析器在所有482个案例上往返,将任何下游失败定位到跨度选择而非机制。扰动研究表明,在离一噪声下,合并EM从100%降至15.48%。在Qwen2.5-Coder-1.5B上的微调实验将HEvalFix-Py EM从0/33(未训练)提升至12%–17%,这是一个可学习性信号,而非生产选择器。批处理服务集成和多文件覆盖作为后续工作。

英文摘要

LLMs edit text and code by autoregressively regenerating the full output, even when most tokens appear verbatim in the input. We study Copy-as-Decode, a decoding-layer mechanism that recasts edit generation as structured decoding over a two-primitive grammar: <copy lines="i-j"/> references an input line range, <gen>...</gen> emits new content. A token-level FSM guarantees syntactic validity, and a serving-layer primitive updates the KV cache for each copy span via a single parallel-prefill forward rather than $N$ autoregressive steps -- sharing the parallel-forward kernel of speculative decoding but with input tokens as the draft and program-enforced acceptance replacing probabilistic verification. We report an upper-bound analysis that requires no end-to-end training. (i) Kernel speedup: on Qwen2.5-{1.5B, 7B}, copying $N$ tokens via parallel prefill is $6.8\times$--$303\times$ faster than autoregressive ($N \in [8, 512]$, A100 80GB bf16). (ii) Copy ceiling: on ProbeEdit and HumanEvalPack-Fix (Py/JS), $74$--$98\%$ of gold tokens are reachable under the line-level primitive; composed with the empirical kernel over each corpus's span histogram this yields a closed-form wall-clock bound of $29.0\times / 3.4\times / 4.2\times$ ($13.0\times$ pooled). A token-level extension reaches $91$--$99\%$ coverage with $4.5\times$--$6.5\times$ floors. (iii) Pipeline losslessness: oracle programs round-trip through the deterministic resolver on all $482$ cases, localizing any downstream failure to span selection rather than the mechanism. A perturbation study shows pooled EM drops from $100\%$ to $15.48\%$ under off-by-one noise. A fine-tuning pilot on Qwen2.5-Coder-1.5B lifts HEvalFix-Py EM from $0/33$ (untrained) to $12$--$17\%$, a learnability signal, not a production selector. Batched-serving integration and multi-file coverage are scoped as follow-up.

2604.18128 2026-05-26 cs.CL cs.AI cs.LG 版本更新

Depth Registers Unlock W4A4 on SwiGLU: A Reader/Generator Decomposition

深度寄存器解锁 SwiGLU 上的 W4A4:一种读取器/生成器分解

Ziyang Liu

AI总结 本研究通过深度寄存器和铰链损失(DR+sink)训练时干预,将 SwiGLU 解码器语言模型的 W4A4 量化困惑度从 1727 降至 119,并分解出残差轴读取器主导误差,而生成器 w2 的双线性输入是剩余差距的主因。

Comments The authors have decided to withdraw this version following internal review regarding authorship and contribution agreements

详情
AI中文摘要

我们在一个受控的 300M 参数 SwiGLU 解码器语言模型(在 FineWeb-Edu 的 5B 令牌上训练)中研究训练后 W4A4 量化,并询问哪些输入激活位点主导误差。朴素的四舍五入 W4A4 将验证困惑度从 FP16 的 23.6 降至 1727。一种简单的残差轴训练时干预——带有寄存器幅度铰链损失的深度寄存器(DR+sink)——在匹配的 FP16 PPL 和匹配的零样本能力下,将其降至 119(约 14 倍),并与 SmoothQuant 组合达到 39.9 PPL。与 FP16 之间约 2 PPL 的剩余差距是诊断核心。我们按输入激活位点分解 W4A4 损伤:SwiGLU 块中的五个可训练线性层分为残差轴读取器(qkv, w1, w3)和块内生成器(o_proj, w2)。基本的范数论证表明,残差轴幅度控制紧密约束读取器,但 w2 的双线性输入仅受因子范数平凡乘积的约束;经验上,DR+sink 降低了读取器的峰度,而生成器基本不变,并且读取器恢复的 W4A4 残差在三个匹配检查点上平坦约为 0.28 nats,其中 Delta-remove(w2) 占主导。我们将 DR+sink 作为训练时探针而非部署方案提出:一种事后替代方案(Per-Linear QuaRot)在读取器轴上几乎与之匹配。完整的 QuaRot——添加在线每头值 Hadamard 和在线 w2 输入旋转——也没有缩小差距,直接验证了正交旋转无法约束双线性 SwiGLU 尾部的预测。这些主张特定于我们的 300M、5B 令牌、单种子设置,并且我们的实验未将分区与铰链分离。

英文摘要

We study post-training W4A4 quantization in a controlled 300M-parameter SwiGLU decoder-only language model trained on 5B tokens of FineWeb-Edu, and ask which input-activation sites dominate the error. Naive round-to-nearest W4A4 collapses validation perplexity from FP16 23.6 to 1727. A simple residual-axis training-time intervention -- Depth Registers with a register-magnitude hinge loss (DR+sink) -- reduces this to 119 (about 14x) at matched FP16 PPL and matched zero-shot capacity, and composes with SmoothQuant to 39.9 PPL. The residual ~2 PPL gap to FP16 is the diagnostic core. We decompose W4A4 damage by input-activation site: the five trainable linears in a SwiGLU block split into residual-axis readers (qkv, w1, w3) and block-internal generators (o_proj, w2). Elementary norm arguments show residual-axis magnitude control bounds readers tightly but leaves w2's bilinear input bounded only by the trivial product of factor bounds; empirically, DR+sink collapses reader kurtosis while leaving generators essentially unchanged, and the reader-rescued W4A4 residue is flat at ~0.28 nats across three matched checkpoints with Delta-remove(w2) dominating. We present DR+sink as a training-time probe rather than a deployment proposal: a post-hoc alternative (Per-Linear QuaRot) nearly matches it on the reader axis. Full QuaRot -- adding online per-head value Hadamard plus online w2-input rotation -- does not close the gap either, directly testing the prediction that orthogonal rotation cannot bound the bilinear SwiGLU tail. Claims are specific to our 300M, 5B-token, single-seed setting, and our experiments do not isolate the partition from the hinge.

2604.12376 2026-05-26 cs.CL cs.AI 版本更新

Cooperative Memory Paging with Keyword Bookmarks for Long-Horizon LLM Conversations

面向长程LLM对话的协作式内存分页与关键词书签

Ziyang Liu

AI总结 提出协作式分页方法,用关键词书签替代被驱逐的对话片段,并赋予模型 recall() 工具按需检索,在 LoCoMo 基准上四个模型均取得最佳答案质量,并通过消融实验揭示分页设计的关键因素。

Comments The authors have decided to withdraw this version following internal review regarding authorship and contribution agreements

详情
AI中文摘要

当LLM对话超出上下文窗口时,旧内容必须被驱逐——但模型在需要时如何恢复它们?我们提出协作式分页:被驱逐的片段被替换为最小关键词书签([pN:keywords],每个约8-24个token),并赋予模型一个 recall() 工具以按需检索完整内容。在 LoCoMo 基准(10个真实多会话对话,300+轮次)上,协作式分页在四种模型(GPT-4o-mini、DeepSeek-v3.2、Claude Haiku、GLM-5)的六种方法中实现了最高的答案质量——优于截断、BM25、词重叠检索、搜索工具基线和完整上下文——由四个独立的LLM评判员确认(p=0.017,配对bootstrap)。随后,我们通过边界策略和驱逐策略的5x4消融实验(3,176个合成探针,1,600个LoCoMo探针)研究分页设计空间。关键发现:(1)粗粒度固定大小页面(fixed_20)达到96.7%,而内容感知的topic_shift降至56.7%;(2)驱逐策略的选择依赖于数据(FIFO在合成数据上最佳,LFU在LoCoMo上最佳);(3)两种书签生成策略相比启发式基线有提升(+4.4和+8.7个E2E点);(4)剩余瓶颈是书签区分度——模型96%的时间触发recall(),但当书签区分度不足时,仅57%选择正确页面。关键词特异性单独造成25个百分点的准确率差异。

英文摘要

When LLM conversations grow beyond the context window, old content must be evicted -- but how does the model recover it when needed? We propose cooperative paging: evicted segments are replaced with minimal keyword bookmarks ([pN:keywords], ~8-24 tokens each), and the model is given a recall() tool to retrieve full content on demand. On the LoCoMo benchmark (10 real multi-session conversations, 300+ turns), cooperative paging achieves the highest answer quality among six methods -- outperforming truncation, BM25, word-overlap retrieval, a search-tool baseline, and full context -- on four models (GPT-4o-mini, DeepSeek-v3.2, Claude Haiku, GLM-5), confirmed by four independent LLM judges ($p=0.017$, paired bootstrap). We then study the paging design space with a 5x4 ablation over boundary strategies and eviction policies (3,176 synthetic probes, 1,600 LoCoMo probes). Key findings: (1) coarse fixed-size pages (fixed_20) reach 96.7% while content-aware topic_shift collapses to 56.7%; (2) eviction policy choice is data-dependent (FIFO best on synthetic, LFU on LoCoMo); (3) two bookmark generation strategies improve over the heuristic baseline (+4.4 and +8.7 E2E points); (4) the remaining bottleneck is bookmark discrimination -- the model triggers recall() 96% of the time but selects the correct page only 57% when bookmarks are insufficiently distinctive. Keyword specificity alone accounts for a 25 percentage point accuracy difference.

2604.08501 2026-05-26 cs.DL cs.CL cs.SE 版本更新

sciwrite-lint: Verification Infrastructure for the Age of Science Vibe-Writing

sciwrite-lint:科学氛围写作时代的验证基础设施

Sergey V Samsonau

发表机构 * Authentic Research Partners(真实研究伙伴) Princeton, NJ(新泽西州普林斯顿)

AI总结 针对AI辅助写作导致的引用幻觉问题,提出基于软件工程lint范式的引用验证工具sciwrite-lint,在研究者本地运行,快速检查引用存在性、元数据准确性、撤回状态和主张支持,并评估引用链完整性。

Comments Code: https://github.com/authentic-research-partners/sciwrite-lint

详情
AI中文摘要

科学论文通过引用对先前工作提出主张。大规模验证这些引用(每篇被引论文是否存在、是否支持引用主张、本身是否可靠)在结构上超出了人类评审的能力:一篇典型论文有数十条引用,而仔细的评审者最多通读少数几篇。AI辅助写作使这一差距更加紧迫:LLM会幻觉化参考文献,并从它们从未读过的论文标题或摘要中填充看似合理的细节,对于隐私意识研究者必须使用的较小本地权重模型,情况更糟。 sciwrite-lint将软件工程中的lint范式应用于引用验证:它完全在研究者机器上运行(免费公共数据库、单个消费级GPU和开放权重模型),速度足够快,可在修订之间重新lint,使作者在起草时就能从源头发现问题,并为期刊和评审者提供自动化的第一遍检查。该流程检查引用存在性、元数据准确性、撤回状态和主张支持,遍历被引论文参考文献的一级深度,并生成每篇引用的可靠性评分。我们在30篇未见过的论文(arXiv和bioRxiv)上进行了评估,包括错误注入和LLM裁决的假阳性分析。 相同的lint工作流程扩展到内部一致性:文本与表格中的数字、摘要与正文、图注与内容、统计结果与其文字解释,以及结构交叉引用(悬空引用、孤立参考文献)。作为独立的实验贡献,我们还提出了SciLint评分:引用链完整性与一个贡献组件相结合,该组件操作了五个科学哲学框架(Popper、Lakatos、Kitcher、Laudan、Mayo)。

英文摘要

Scientific papers make claims about prior work backed by citations. Verifying those citations at scale (that each cited paper exists, says what the citation claims, and is itself reliable) is structurally beyond what human review can deliver: a typical paper has dozens of citations, and a careful reviewer reads at most a handful end-to-end. AI-assisted writing makes this gap even more urgent: LLMs hallucinate references and may fill in plausible details from titles or abstracts of papers they never read, worse for the smaller local-weights models that privacy-aware researchers must use. sciwrite-lint applies the linting paradigm from software engineering to citation verification: it runs entirely on the researcher's machine (free public databases, a single consumer GPU, and open-weights models), is fast enough to re-lint between revisions so authors catch problems at the source while drafting, and serves journals and reviewers as an automated first pass. The pipeline checks reference existence, metadata accuracy, retraction status, and claim support, traverses one level into cited papers' bibliographies, and produces per-reference reliability scores. We evaluate on 30 unseen papers (arXiv and bioRxiv) with error injection and LLM-adjudicated false-positive analysis. The same linting workflow extends to internal consistency: numbers in text vs. tables, abstract vs. body, figure captions vs. content, statistical results vs. their verbal interpretation, plus structural cross-references (dangling cites, orphan references). As a separate experimental contribution we also propose SciLint Score: citation-chain integrity combined with a contribution component operationalizing five philosophy-of-science frameworks (Popper, Lakatos, Kitcher, Laudan, Mayo).

2603.20479 2026-05-26 cs.CY cs.AI cs.CL 版本更新

Profiling learners' affective engagement: Emotion AI, intercultural pragmatics, and language learning

学习者情感投入画像:情感AI、跨文化语用学与语言学习

Robert Godwin-Jones

发表机构 * Virginia Commonwealth University(弗吉尼亚大学)

AI总结 本文探讨了情感AI在语言学习中的应用,特别是自动情感识别和模拟人类响应如何影响语用能力和互动能力的发展,并讨论了其个性化学习优势与情感操纵风险。

详情
Journal ref
Language Learning & Technology, 30(2), 14-35 (2026)
AI中文摘要

学习另一种语言可能是一个高度情感化的过程,通常以无数大大小小的挫折和成功为特征。对大多数学习者而言,语言学习并非遵循线性、可预测的路径,其曲折进程受动机(或去动机)变量影响,如个人特征、师生关系、学习材料以及对未来第二语言自我的梦想。虽然语言学习的某些方面(阅读、语法)相对机械,但其他方面可能充满压力且不可预测,尤其是用目标语言交谈。这种体验不仅需要结构和词汇知识,还需要以适合社会和文化语境的方式使用语言的能力。AI聊天机器人的出现为练习会话能力提供了新机会,既有优势(响应迅速、无评判),也有缺点(缺乏情感、文化偏见)。本文探讨了技术使用中产生的情感方面,特别是自动情感识别和AI系统中模拟的人类响应如何与语言学习以及语用和互动能力的发展相互作用。情感AI,即算法驱动对用户情感信号的解读,被认为能够实现更个性化的学习,适应感知到的学习者认知和情感状态。其他人则警告情感操纵以及不恰当和无效的用户画像。

英文摘要

Learning another language can be a highly emotional process, typically characterized by numerous frustrations and triumphs, big and small. For most learners, language learning does not follow a linear, predictable path, its zigzag course shaped by motivational (or demotivating) variables such as personal characteristics, teacher/peer relationships, learning materials, and dreams of a future L2 (second language) self. While some aspects of language learning (reading, grammar) are relatively mechanical, others can be stressful and unpredictable, especially conversing in the target language. That experience necessitates not only knowledge of structure and lexis, but also the ability to use the language in ways that are appropriate to the social and cultural context. A new opportunity to practice conversational abilities has arrived through the availability of AI chatbots, with both advantages (responsive, non-judgmental) and drawbacks (emotionally void, culturally biased). This column explores aspects of emotion as they arise in technology use and in particular how automatic emotion recognition and simulated human responsiveness in AI systems interface with language learning and the development of pragmatic and interactional competence. Emotion AI, the algorithmically driven interpretation of users' affective signals, has been seen as enabling greater personalized learning, adapting to perceived learner cognitive and emotional states. Others warn of emotional manipulation and inappropriate and ineffective user profiling

2603.11583 2026-05-26 cs.CL cs.AI 版本更新

UtilityMax Prompting: A Formal Framework for Multi-Objective Large Language Model Tasks

UtilityMax Prompting:多目标大语言模型任务的形式化框架

Ofir Marom

发表机构 * Independent Researcher(独立研究者)

AI总结 提出UtilityMax Prompting框架,用影响图和期望效用最大化将多目标LLM任务形式化,在MovieLens 1M数据集上相比自然语言基线提升了精度和NDCG。

详情
AI中文摘要

大语言模型(LLM)任务的成功在很大程度上取决于其提示词。大多数用例使用自然语言指定提示词,当必须同时满足多个目标时,自然语言本质上是模糊的。在本文中,我们引入了UtilityMax Prompting,一个使用形式化数学语言指定任务的框架。我们将任务重构为一个影响图,其中LLM的答案是唯一的决策变量。在图中条件概率分布上定义效用函数,并指示LLM找到最大化期望效用的答案。这迫使LLM明确推理目标的每个组成部分,将其输出导向精确的优化目标,而非主观的自然语言解释。我们在MovieLens 1M数据集上,使用三个前沿模型(Claude Sonnet 4.6、GPT-5.4和Gemini 2.5 Pro)验证了我们的方法,在多目标电影推荐任务中,与自然语言基线相比,在精度和归一化折损累计增益(NDCG)上表现出一致的改进。

英文摘要

The success of a Large Language Model (LLM) task depends heavily on its prompt. Most use-cases specify prompts using natural language, which is inherently ambiguous when multiple objectives must be simultaneously satisfied. In this paper we introduce UtilityMax Prompting, a framework that specifies tasks using formal mathematical language. We reconstruct the task as an influence diagram in which the LLM's answer is the sole decision variable. A utility function is defined over the conditional probability distributions within the diagram, and the LLM is instructed to find the answer that maximises expected utility. This constrains the LLM to reason explicitly about each component of the objective, directing its output toward a precise optimization target rather than a subjective natural language interpretation. We validate our approach on the MovieLens 1M dataset across three frontier models (Claude Sonnet 4.6, GPT-5.4, and Gemini 2.5 Pro), demonstrating consistent improvements in precision and Normalized Discounted Cumulative Gain (NDCG) over natural language baselines in a multi-objective movie recommendation task.

2603.05450 2026-05-26 cs.AI cs.CL 版本更新

Distributed Partial Information Puzzles: Examining Common Ground Construction Under Epistemic Asymmetry

分布式部分信息谜题:在认知不对称下检验共同基础的构建

Yifan Zhu, Mariah Bradford, Kenneth Lai, Timothy Obiso, Videep Venkatesha, James Pustejovsky, Nikhil Krishnaswamy

发表机构 * Brandeis University(布兰迪斯大学) Colorado State University(科罗拉多州立大学)

AI总结 提出分布式部分信息谜题(DPIP)任务,收集多模态数据集,并评估大语言模型与动态认知逻辑方法在追踪信念状态和共同基础构建上的表现。

Comments 10 pages, 4 figures

详情
Journal ref
Proceedings of COLING-LREC 2026
AI中文摘要

建立共同基础(一组共享的信念和相互认可的事实)对于协作至关重要,但仍然是当前AI系统面临的挑战,尤其是在多模态、多方设置中,协作者带来不同的信息。我们引入了分布式部分信息谜题(DPIP),这是一个协作构建任务,在认知不对称下引发丰富的多模态交流。我们提供了这些交互的多模态数据集,并在语音、手势和动作模态上进行注释和时间对齐,以支持对命题内容和信念动态的推理。然后,我们评估了两种建模共同基础(CG)的范式:(1)最先进的大语言模型(LLMs),被提示从多模态更新中推断共享信念,以及(2)基于动态认知逻辑(DEL)的公理流水线,逐步执行相同的任务。在注释的DPIP数据上的结果表明,它对现代LLMs跟踪任务进展和信念状态的能力构成了挑战。

英文摘要

Establishing common ground, a shared set of beliefs and mutually recognized facts, is fundamental to collaboration, yet remains a challenge for current AI systems, especially in multimodal, multiparty settings, where the collaborators bring different information to the table. We introduce the Distributed Partial Information Puzzle (DPIP), a collaborative construction task that elicits rich multimodal communication under epistemic asymmetry. We present a multimodal dataset of these interactions, annotated and temporally aligned across speech, gesture, and action modalities to support reasoning over propositional content and belief dynamics. We then evaluate two paradigms for modeling common ground (CG): (1) state-of-the-art large language models (LLMs), prompted to infer shared beliefs from multimodal updates, and (2) an axiomatic pipeline grounded in Dynamic Epistemic Logic (DEL) that incrementally performs the same task. Results on the annotated DPIP data indicate that it poses a challenge to modern LLMs' abilities to track both task progression and belief state.

2602.19878 2026-05-26 cs.CL cs.LO 版本更新

Axis-Aligned Semantics for ODRL: Resolving Dimensional Ambiguity in Policy Constraints

面向ODRL的轴对齐语义:解决策略约束中的维度歧义

Daham Mustafa, Diego Collarana, Sabrina Kirrane, Christoph Lange, Christoph Quix, Rafiqul Haque, Yixin Peng, Stefan Decker

发表机构 * RWTH Aachen University(亚琛工业大学) Fraunhofer FIT(弗劳恩霍夫研究所) University of Galway(Galway大学) Vienna University of Economics and Business (WU)(维也纳大学)

AI总结 针对ODRL中多轴操作数导致的维度歧义问题,提出轴分解方法,将约束转化为每个轴上的标量操作,从而将冲突检测简化为盒比较,并定义三值语义,通过基准测试验证了方法的正确性和兼容性。

Comments 17 pages. Preprint. v3: expanded benchmark to 256 problems; revised semantics and profile (OAAP)

详情
AI中文摘要

开放数字权利语言(ODRL)将策略约束表示为左操作数、运算符和值的三元组。然而,多个空间操作数涉及宽度、高度和深度等多轴域,而约束语法未提供明确的轴标识。因此,策略引擎无法确定多个约束是应用于同一轴还是不同轴,导致冲突检测不可靠或不完整。我们通过轴分解解决这一歧义,将多轴操作数替换为全序域上的轴特定标量操作数。每个约束表示每个轴上的一个区间,每个策略表示一个轴对齐的盒,从而将冲突检测简化为盒比较。我们定义了三值语义(冲突、兼容、未知),证明了分解的正确性及其与ODRL的向后兼容性,将其实例化为ODRL轴对齐配置文件(OAAP),并在包含256个ODRL策略问题的基准测试上进行了验证,每个问题以Turtle表示并编译为一阶形式(TPTP)和SMT-LIB形式,使用了Vampire、E、Z3和cvc5求解器。

英文摘要

The Open Digital Rights Language (ODRL) represents policy constraints as triples of a left operand, an operator, and a value. Several spatial operands, however, range over multi-axis domains such as width, height, and depth, while the constraint syntax provides no explicit axis identity. As a result, policy engines cannot determine whether multiple constraints apply to the same axis or different ones, making conflict detection unsound or incomplete. We resolve this ambiguity by axis decomposition, replacing multi-axis operands with axis-specific scalar operands over totally ordered domains. Each constraint then denotes an interval per axis and each policy an axis-aligned box, reducing conflict detection to box comparison. We define a three-valued semantics (Conflict, Compatible, Unknown), prove the decomposition sound and backward compatible with ODRL, instantiate it as ODRL Axis-Aligned Profile (OAAP), and validate it on a benchmark of 256 ODRL policy problems, each expressed in Turtle and compiled to first-order (TPTP) and SMT-LIB form, using Vampire, E, Z3, and cvc5.

2602.11173 2026-05-26 cs.CL 版本更新

Author-in-the-Loop Response Generation and Evaluation: Integrating Author Expertise and Intent in Responses to Peer Review

作者参与的回信生成与评估:将作者专业知识和意图整合到对同行评审的回复中

Qian Ruan, Iryna Gurevych

发表机构 * Ubiquitous Knowledge Processing Lab (UKP Lab)(通用知识处理实验室) Department of Computer Science(计算机科学系) Hessian Center for AI (hessian.AI)(海德堡人工智能中心)

AI总结 提出作者参与的回信生成与评估框架,通过引入对齐的评审-回复-修订三元组数据集、支持灵活作者输入和可控生成的REspGen系统以及包含20+指标的综合评估套件REspEval,填补了作者信号利用和评估的空白。

Comments accepted to ACL 2026 Main Conference

详情
AI中文摘要

作者回复(反驳)写作是科学同行评审的关键阶段,需要作者付出大量努力。在实践中,作者拥有领域专业知识、仅作者可用的信息和回复策略——作者专业知识和意图的具体形式——并寻求NLP辅助,将这些信号整合到作者回复生成(ARG)中。然而,这种作者参与范式缺乏正式的NLP表述和系统研究:没有数据集提供细粒度的作者信号,现有的ARG工作缺乏作者输入和控制,也没有评估指标衡量回复对作者信号的反映以及解决评审者关注点的有效性。为填补这些空白,我们引入了(i)Re3Align,第一个大规模的对齐评审-回复-修订三元组数据集,其中修订代理作者信号;(ii)REspGen,一个作者参与的ARG框架,支持灵活的作者输入、多属性控制和评估引导的细化;以及(iii)REspEval,一个包含20多个指标的全面评估套件,涵盖输入利用、可控性、回复质量和话语。使用SOTA LLMs的实验证明了作者输入和评估引导细化的好处、输入特异性对回复质量的影响以及可控性与质量之间的权衡。我们发布了我们的数据集、生成和评估工具。

英文摘要

Author response (rebuttal) writing is a critical stage of scientific peer review that demands substantial author effort. In practice, authors possess domain expertise, author-only information, and response strategies - concrete forms of author expertise and intent - and seek NLP assistance that integrates these signals into author response generation (ARG). Yet this author-in-the-loop paradigm lacks formal NLP formulation and systematic study: no dataset provides fine-grained author signals, existing ARG work lacks author inputs and controls, and no evaluation measures response reflection of author signals and effectiveness in addressing reviewer concerns. To fill these gaps, we introduce (i) Re3Align, the first large-scale dataset of aligned review-response-revision triplets, where revisions proxy author signals; (ii) REspGen, an author-in-the-loop ARG framework supporting flexible author input, multi-attribute control, and evaluation-guided refinement; and (iii) REspEval, a comprehensive evaluation suite with 20+ metrics spanning input utilization, controllability, response quality, and discourse. Experiments with SOTA LLMs demonstrate the benefits of author input and evaluation-guided refinement, the impact of input specificity on response quality, and controllability-quality trade-offs. We release our dataset, generation and evaluation tools.

2602.02843 2026-05-26 cs.CL 版本更新

Act or Clarify? Modeling Sensitivity to Uncertainty and Cost in Communication

行动还是澄清?建模沟通中对不确定性和成本的敏感性

Polina Tsvilodub, Karl Mulligan, Todd Snider, Robert D. Hawkins, Michael Franke

发表机构 * Department of Linguistics, University of Tübingen(图宾根大学语言系) Department of Linguistics, Stanford University(斯坦福大学语言系)

AI总结 提出基于预期遗憾的计算模型,研究人类在不确定性下是否选择提问澄清,取决于不确定性和错误行动成本之间的理性权衡。

Comments 6 pages, 3 figures, accepted to CogSci 2026

详情
AI中文摘要

在不确定性下决定如何行动时,智能体可以选择行动以减少不确定性,也可以不顾不确定性而行动。在沟通场景中,减少不确定性的一个重要方式是提出澄清问题(CQs)。我们预测,是否提出CQ的决定取决于上下文不确定性和替代行动的成本,并且这些因素相互作用:当错误行动代价高昂时,不确定性最为重要。我们在一个基于预期遗憾的计算模型中形式化了这种相互作用:该模型衡量智能体在当前行动而非拥有完整信息时可能遭受的损失。我们在两个实验中测试了这些预测,一个实验考察对问题的纯语言回应,另一个扩展到在澄清和非语言行动之间的选择。综合来看,我们的结果表明一种理性权衡:人类倾向于在不确定性下行动时,根据可能遭受重大损失的风险比例来寻求澄清。

英文摘要

When deciding how to act under uncertainty, agents may choose to act to reduce uncertainty or they may act despite that uncertainty. In communicative settings, an important way of reducing uncertainty is by asking clarification questions (CQs). We predict that the decision to ask a CQ depends on both contextual uncertainty and the cost of alternative actions, and that these factors interact: uncertainty should matter most when acting incorrectly is costly. We formalize this interaction in a computational model based on expected regret: how much an agent stands to lose by acting now rather than with full information. We test these predictions in two experiments, one examining purely linguistic responses to questions and another extending to choices between clarification and non-linguistic action. Taken together, our results suggest a rational tradeoff: humans tend to seek clarification proportional to the risk of substantial loss when acting under uncertainty.

2602.02605 2026-05-26 cs.NE cs.AI cs.CL q-bio.NC 版本更新

Fine-Tuning Language Models to Know What They Know

微调语言模型使其了解自身所知

Sangjun Park, Elliot Meyerson, Xin Qiu, Risto Miikkulainen

发表机构 * The University of Texas at Austin(德克萨斯大学奥斯汀分校) Cognizant AI Lab(认知人工智能实验室)

AI总结 本文提出一种框架,通过进化策略对齐方法(ESMA)在控制偏差的同时提升大语言模型的元认知能力,并在未见数据集、语言和新知识上展现出鲁棒泛化性。

Comments Preprint

详情
AI中文摘要

评估大语言模型(LLMs)的真实元认知能力因偏差和启发式方法而困难。本文提出一个框架,在控制这些偏差的同时测量和增强LLM的元认知能力。建立了使用$d'_{\rm type2}$指标的测量方法以隔离元认知能力。提出了元认知对齐进化策略(ESMA),在未见数据集、语言和新获取的知识上展现出鲁棒泛化性。最后,参数分析表明这些改进由一组稀疏参数驱动,为定向元认知优化提供了新途径。

英文摘要

Evaluating true metacognition in Large Language Models (LLMs) is difficult due to biases and heuristics. This paper presents a framework to measure and enhance LLM metacognition while controlling for these biases. A measurement method using the $d'_{\rm type2}$ metric is established to isolate metacognitive ability. The Evolution Strategy for Metacognitive Alignment (ESMA) is proposed, demonstrating robust generalization across unseen datasets, languages, and newly acquired knowledge. Finally, parameter analysis reveals that these improvements are driven by a sparse set of parameters, offering new pathways for targeted metacognitive optimization.

2602.02474 2026-05-26 cs.CL cs.AI cs.LG 版本更新

MemSkill: Learning and Evolving Memory Skills for Self-Evolving Agents

MemSkill:面向自进化智能体的可学习与进化记忆技能

Haozhen Zhang, Quanyu Long, Jianzhu Bao, Tao Feng, Weizhi Zhang, Haodong Yue, Wenya Wang

发表机构 * Nanyang Technological University(南洋理工大学) University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) University of Illinois Chicago(伊利诺伊大学芝加哥分校) Tsinghua University(清华大学)

AI总结 提出MemSkill框架,将记忆操作转化为可学习和可进化的技能,通过控制器选择技能、执行器生成记忆、设计者进化技能集,形成闭环提升LLM智能体任务性能。

Comments Code is available at https://github.com/ViktorAxelsen/MemSkill

详情
AI中文摘要

大多数大语言模型(LLM)智能体记忆系统依赖少量静态、手工设计的操作来提取记忆。这些固定程序硬编码了关于存储内容和如何修订记忆的人类先验知识,使其在多样化的交互模式下僵化,并在长历史记录上效率低下。为此,我们提出 extbf{MemSkill},将这些操作重新定义为可学习和可进化的记忆技能,即从交互轨迹中提取、整合和修剪信息的结构化可重用例程。受智能体技能设计哲学的启发,MemSkill采用一个 extit{控制器},学习选择少量相关技能,并与基于LLM的 extit{执行器}配对,生成技能引导的记忆。除了学习技能选择,MemSkill引入一个 extit{设计者},定期审查所选技能产生错误或不完整记忆的困难案例,并通过提出改进和新技能来进化技能集。共同地,MemSkill形成了一个闭环流程,改进了技能选择策略和技能集本身。在LoCoMo、LongMemEval、HotpotQA和ALFWorld上的实验表明,MemSkill在强基线上提高了任务性能,并在不同设置下具有良好的泛化能力。进一步分析揭示了技能如何进化,为LLM智能体更自适应、自进化的记忆管理提供了见解。

英文摘要

Most Large Language Model (LLM) agent memory systems rely on a small set of static, hand-designed operations for extracting memory. These fixed procedures hard-code human priors about what to store and how to revise memory, making them rigid under diverse interaction patterns and inefficient on long histories. To this end, we present \textbf{MemSkill}, which reframes these operations as learnable and evolvable memory skills, structured and reusable routines for extracting, consolidating, and pruning information from interaction traces. Inspired by the design philosophy of agent skills, MemSkill employs a \emph{controller} that learns to select a small set of relevant skills, paired with an LLM-based \emph{executor} that produces skill-guided memories. Beyond learning skill selection, MemSkill introduces a \emph{designer} that periodically reviews hard cases where selected skills yield incorrect or incomplete memories, and evolves the skill set by proposing refinements and new skills. Together, MemSkill forms a closed-loop procedure that improves both the skill-selection policy and the skill set itself. Experiments on LoCoMo, LongMemEval, HotpotQA, and ALFWorld demonstrate that MemSkill improves task performance over strong baselines and generalizes well across settings. Further analyses shed light on how skills evolve, offering insights toward more adaptive, self-evolving memory management for LLM agents.

2601.03014 2026-05-26 cs.CL cs.AI 版本更新

SentGraph: Hierarchical Sentence Graph for Multi-hop Retrieval-Augmented Question Answering

SentGraph: 用于多跳检索增强问答的层次化句子图

Junli Liang, Pengfei Zhou, Wangqiu Zhou, Wenjie Qing, Qi Zhao, Ziwen Wang, Qi Song, Xiangyang Li

发表机构 * University of Science and Technology of China(中国科学技术大学) Hefei University of Technology(合肥工业大学)

AI总结 提出SentGraph,一种句子级图RAG框架,通过构建层次化句子图并建模细粒度逻辑关系,解决多跳问答中证据链不完整的问题。

详情
AI中文摘要

传统的检索增强生成(RAG)通过大型语言模型有效支持单跳问答,但在需要结合多个文档证据的多跳问答任务中面临显著限制。现有的基于块的检索通常提供不相关且逻辑不连贯的上下文,导致答案生成过程中证据链不完整和推理错误。为了解决这些挑战,我们提出了SentGraph,一种句子级图RAG框架,显式建模句子之间的细粒度逻辑关系以用于多跳问答。具体来说,我们离线构建一个层次化句子图:首先调整修辞结构理论以区分核心句和卫星句,然后将它们组织成带有跨文档实体桥的主题级子图。在线检索时,SentGraph执行图引导的证据选择和路径扩展,以检索细粒度的句子级证据。在四个多跳问答基准上的大量实验证明了SentGraph的有效性,验证了显式建模句子级逻辑依赖关系对多跳推理的重要性。

英文摘要

Traditional Retrieval-Augmented Generation (RAG) effectively supports single-hop question answering with large language models but faces significant limitations in multi-hop question answering tasks, which require combining evidence from multiple documents. Existing chunk-based retrieval often provides irrelevant and logically incoherent context, leading to incomplete evidence chains and incorrect reasoning during answer generation. To address these challenges, we propose SentGraph, a sentence-level graph-based RAG framework that explicitly models fine-grained logical relationships between sentences for multi-hop question answering. Specifically, we construct a hierarchical sentence graph offline by first adapting Rhetorical Structure Theory to distinguish nucleus and satellite sentences, and then organizing them into topic-level subgraphs with cross-document entity bridges. During online retrieval, SentGraph performs graph-guided evidence selection and path expansion to retrieve fine-grained sentence-level evidence. Extensive experiments on four multi-hop question answering benchmarks demonstrate the effectiveness of SentGraph, validating the importance of explicitly modeling sentence-level logical dependencies for multi-hop reasoning.

2512.06393 2026-05-26 cs.AI cs.CL cs.LG cs.LO 版本更新

Conflict-Aware Fusion: Mitigating Logic Inertia in Large Language Models via Structured Cognitive Priors

冲突感知融合:通过结构化认知先验缓解大语言模型中的逻辑惯性

Qiming Bao, Xiaoxuan Fu, Michael Witbrock

发表机构 * Xtracta & Strong AI Lab, University of Auckland(Xtracta与强人工智能实验室,奥克兰大学) School of Humanities, China University of Political Science and Law(人文学院,中国政法大学) Strong AI Lab, University of Auckland(强人工智能实验室,奥克兰大学)

AI总结 针对大语言模型在规则系统结构扰动下表现脆弱的问题,提出冲突感知融合训练流程,通过验证-演绎结构先验和符号推理奖励,在多个压力测试中实现鲁棒性饱和。

详情
AI中文摘要

大型语言模型(LLM)在许多推理基准上取得了高准确率,但在基于规则系统的结构扰动下仍然脆弱。我们引入了一个包含四个压力测试的诊断框架——冗余与必要规则删除、矛盾规则注入、逻辑保持重写和多定律堆叠——并用它来揭示逻辑惯性:生成式LLM(Qwen2/3、TinyLlama、GPT-4o、Gemma-3-4B-IT)和仅编码器BERT基线在矛盾前提下沿学习到的演绎轨迹持续推理的倾向。这种崩溃是剧烈的:未经处理的基线在基础任务上的准确率从1.00下降到矛盾注入时的0.00(实例级精确匹配),而GPT-4o仅解决了56.0%的矛盾案例。我们提出冲突感知融合,这是一个四阶段训练流程,将验证-演绎作为学习到的结构先验强制执行:(i)SFT建立验证前缀;(ii)DPO锐化矛盾停止决策边界;(iii)逻辑不变正则化(LIRE)通过对称KL惩罚逻辑等价规则公式之间的差异;(iv)来自验证反馈的强化学习(RLVF)使用符号前向链接引擎作为确定性预言奖励,联合优化不变性和敏感性。该流程在1.5B和8B骨干网络上均使所有四个主要压力测试达到饱和。我们进一步验证了第二阶段扩展,用Lean 4内核替换命题预言机,在分层187个问题的Lean翻译样本中,对105个经典可推导(T)问题达到99.0%的内核一致性(整体71.7%,涵盖两种极性),为形式化验证的RL训练提供了可靠的升级路径。代码和基准:https://github.com/14H034160212/lemo

英文摘要

Large language models (LLMs) achieve high accuracy on many reasoning benchmarks but remain brittle under structural perturbations of rule-based systems. We introduce a diagnostic framework with four stress tests -- redundant vs. essential rule deletion, contradictory-rule injection, logic-preserving rewrites, and multi-law stacking -- and use it to expose Logic Inertia: the tendency of generative LLMs (Qwen2/3, TinyLlama, GPT-4o, Gemma-3-4B-IT) and the encoder-only BERT baseline to persist along learned deductive trajectories under inconsistent premises. The collapse is sharp: untreated baselines fall from accuracy 1.00 on the base task to 0.00 on contradiction injection (instance-level exact match), and GPT-4o resolves only 56.0% of contradiction cases. We propose Conflict-Aware Fusion, a four-stage training pipeline that enforces verification-before-deduction as a learned structural prior: (i) SFT establishes the verification preamble; (ii) DPO sharpens the halt-on-contradiction decision boundary; (iii) Logical Invariance REgularisation (LIRE) penalises divergence between logically equivalent rule formulations via symmetric KL; (iv) Reinforcement Learning from Verification Feedback (RLVF) uses a symbolic forward-chaining engine as a deterministic oracle reward, jointly optimising invariance and sensitivity. The pipeline saturates all four primary stress tests for both 1.5B and 8B backbones. We further validate a Phase 2 extension that replaces the propositional oracle with a Lean 4 kernel, attaining 99.0% kernel agreement on the 105 classically-derivable (T) questions within a stratified 187-question Lean-translated sample (overall 71.7% across both polarities), providing a sound upgrade path to formally verified RL training. Code and benchmark: https://github.com/14H034160212/lemo

2511.02721 2026-05-26 cs.CL 版本更新

PETra: A Multilingual Corpus of Pragmatic Explicitation in Translation

PETra:翻译中语用显化的多语语料库

Doreen Osmelak, Koel Dutta Chowdhury, Uliana Sentsova, Cristina España-Bonet, Josef van Genabith

发表机构 * Saarland University, Saarland Informatics Campus, Germany(萨尔兰大学,萨尔兰信息学院,德国) German Research Center for Artificial Intelligence (DFKI)(德国人工智能研究中心(DFKI)) Barcelona Supercomputing Center (BSC-CNS), Barcelona, Catalonia, Spain(巴塞罗那超级计算中心(BSC-CNS),巴塞罗那,加泰罗尼亚,西班牙)

AI总结 提出首个多语语料库PragExTra及检测框架,通过空对齐和主动学习识别语用显化,跨语言准确率达0.88,F1达0.82。

详情
Journal ref
Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026)
AI中文摘要

译者常常用背景细节丰富文本,使隐含的文化含义对新受众变得明确。这种现象称为语用显化,在翻译理论中已被广泛讨论,但很少用计算模型处理。我们介绍了PragExTra,这是第一个用于语用显化的多语语料库和检测框架。该语料库涵盖来自TED-Multi和Europarl的八种语言对,并包括实体描述、测量转换和译者评论等补充内容。我们通过空对齐识别候选显化案例,并使用主动学习结合人工标注进行精炼。我们的结果表明,实体和系统层面的显化最为常见,主动学习将分类器准确率提高了7-8个百分点,跨语言达到0.88的准确率和0.82的F1值。PragExTra将语用显化确立为可测量的跨语言现象,并向构建文化感知的机器翻译迈出了一步。关键词:翻译,多语制,显化

英文摘要

Translators often enrich texts with background details that make implicit cultural meanings explicit for new audiences. This phenomenon, known as pragmatic explicitation, has been widely discussed in translation theory but rarely modeled computationally. We introduce PragExTra, the first multilingual corpus and detection framework for pragmatic explicitation. The corpus covers eight language pairs from TED-Multi and Europarl and includes additions such as entity descriptions, measurement conversions, and translator remarks. We identify candidate explicitation cases through null alignments and refined using active learning with human annotation. Our results show that entity and system-level explicitations are most frequent, and that active learning improves classifier accuracy by 7-8 percentage points, achieving up to 0.88 accuracy and 0.82 F1 across languages. PragExTra establishes pragmatic explicitation as a measurable, cross-linguistic phenomenon and takes a step towards building culturally aware machine translation. Keywords: translation, multilingualism, explicitation

2510.27118 2026-05-26 cs.CL 版本更新

Probability Distributions Computed by Autoregressive Transformers

自回归变压器计算的概率分布

Andy Yang, Anej Svete, Jiaoda Li, Anthony Widjaja Lin, Jonathan Rawski, Ryan Cotterell, David Chiang

发表机构 * University of Notre Dame(内布拉斯加大学达灵顿分校) ETH Zürich(苏黎世联邦理工学院) Max-Planck Institute for Software Systems(马克斯·普朗克软件系统研究所) University of Kaiserslautern-Landau(凯撒斯劳滕-劳恩堡大学) San José State University(圣何塞州立大学)

AI总结 研究自回归变压器作为语言模型时能表达的概率分布,揭示自回归和概率化对表达力的影响。

Comments 20 pages

详情
AI中文摘要

大多数关于变压器的表达力结果将其视为语言识别器——接受或拒绝字符串的设备——而不是像实际使用中那样:作为自回归和概率生成字符串的语言模型。我们刻画了变压器语言模型可以表达的概率分布。我们表明,使变压器语言识别器自回归有时可以增加其表达力,而使其概率化可以打破非概率情况下成立的等价关系。我们的总体贡献是厘清变压器在其最常见的用例——作为语言模型——中能够表达哪些函数。

英文摘要

Most expressivity results for transformers treat them as language recognizers -- devices that accept or reject strings -- rather than as they are used in practice: as language models that generate strings autoregressively and probabilistically. We characterize the probability distributions that transformer language models can express. We show that making transformer language recognizers autoregressive can sometimes increase their expressivity, and that making them probabilistic can break equivalences that hold in the non-probabilistic case. Our overall contribution is to tease apart what functions transformers are capable of expressing in their most common use case as language models.

2510.14925 2026-05-26 cs.AI cs.CL cs.LG 版本更新

False Fixed Points: Kantian Feedback, Stable Miscalibration, and Representational Compression in LLMs

虚假不动点:大语言模型中的康德反馈、稳定误校准与表征压缩

Akira Okutomi

发表机构 * ToppyMicroServices OÜ(ToppyMicroServices公司)

AI总结 本文通过康德承诺门控框架和线性反馈模型,研究大语言模型中高置信度错误作为局部稳定、内部一致且自信错误的虚假不动点现象,发现稳定性与正确性可分离,并探索高信噪比惯性和表征压缩作为稳定误校准的可能机制。

Comments 27 pages, 8 figures, v3.0

详情
AI中文摘要

大型语言模型中的高置信度错误通常被视为脆弱的失败。我们研究另一种可能性:某些错误可能是虚假不动点,即局部稳定、内部一致且自信地错误。这分离了鲁棒性与真实追踪。我们通过康德承诺门控框架和一个最小线性反馈模型来发展这种分离,其中稳定性和正确性可以偏离。在三个开源权重模型上,根据我们的隐藏状态敏感性探测,过度自信的错误项并不比自信正确的项系统性地更局部脆弱。基于弃权的自我批评通过牺牲覆盖率减少了过度自信的错误承诺,而C3-R(一种基于规则的显式反馈门控)则加剧了这种权衡而非消除它。这些结果激发但未证实高信噪比惯性和表征压缩作为稳定误校准的可能机制。

英文摘要

High-confidence errors in large language models are often treated as fragile failures. We study an alternative: some errors may be false fixed points, locally stable, internally coherent, and confidently wrong. This separates robustness from truth-tracking. We develop the separation through a Kantian commitment-gate framing and a minimal linear feedback model in which stability and correctness can diverge. Across three open-weight models, overconfident wrong items are not systematically more locally fragile than confidently correct items under our hidden-state sensitivity probes. Abstention-aware self-critique reduces overconfident wrong commitments by sacrificing coverage, and C3-R, a rule-based explicit feedback gate, sharpens that tradeoff rather than eliminating it. These results motivate, but do not establish, high signal-to-noise (high-SNR) inertia and representational compression as possible mechanisms for stable miscalibration.

2509.10515 2026-05-26 cs.LG cs.CL 版本更新

Adaptive Preference Optimization with Uncertainty-aware Utility Anchor

基于不确定性感知效用锚点的自适应偏好优化

Xiaobo Wang, Zixia Jia, Jiaqi Li, Qi Liu, Zilong Zheng

发表机构 * State Key Laboratory of Cognitive Intelligence, University of Science and Technology of China(认知智能国家重点实验室,中国科学技术大学) Institute of Artificial Intelligence, Hefei Comprehensive National Science Center(人工智能研究院,合肥综合性国家科学中心) State Key Laboratory of General Artificial Intelligence, BIGAI(通用人工智能国家重点实验室,BIGAI)

AI总结 提出一种通用离线偏好优化框架UAPO,通过引入锚点函数估计偏好数据标注的不确定性,支持非配对数据训练,提升数据利用效率和训练鲁棒性。

Comments Accepted by EMNLP 2025 Findings

详情
AI中文摘要

离线偏好优化方法对于大型语言模型(LLMs)的对齐是高效的。直接偏好优化(DPO)类学习作为最流行的方法之一,因其在奖励建模中的高效性而脱颖而出。然而,这些方法通常遵循惯例使用Bradley-Terry(BT)奖励建模,该建模面临几个关键假设,包括对成对训练数据的需求、模型分布偏移、人类理性假设等。为了解决这些限制,我们提出了一种通用的离线偏好优化框架——基于不确定性感知效用锚点的自适应偏好优化(UAPO),该框架引入了一个锚点函数来估计偏好数据标注带来的不确定性。我们的方法即使在数据未配对的情况下也能进行训练,显著提高了数据利用效率。此外,锚点设计使UAPO在训练过程中更加鲁棒。实验结果表明,UAPO在无需严格依赖数据配对的情况下取得了有竞争力的结果,为更灵活有效的偏好优化方法铺平了道路。

英文摘要

Offline preference optimization methods are efficient for large language models (LLMs) alignment. Direct Preference optimization (DPO)-like learning, one of the most popular approaches, stands out for its efficiency in reward modeling. However, these methods typically follow the convention to use Bradley-Terry (BT) reward modeling that faces several critical assumptions, including the requirement for pairwise training data, model distribution shifting, human rationality assumption, etc. To address these limitations, we propose a general framework for offline preference optimization methods, Adaptive Preference Optimization with Utility Anchor (UAPO), which introduces an anchoring function to estimate the uncertainties brought from preference data annotation. Our method enables training even in scenarios where the data is unpaired, significantly enhancing data utilization efficiency. Moreover, the anchor design makes UAPO more robust in the training process. Experimental results demonstrate that UAPO achieves competitive outcomes without the strict dependency on data pairing, paving the way for more flexible and effective preference optimization methods.

2507.21556 2026-05-26 cs.CL 版本更新

Transformers over-extend what humans underlearn: the case of Spanish L-shaped morphome

Transformer过度泛化而人类学习不足:西班牙语L形形态词素案例

Akhilesh Kakolu Ramarao, Kevin Tang, Dinah Baer-Henney

发表机构 * Department of English Language and Linguistics, Heinrich Heine University Düsseldorf(海因里希·海涅大学英语语言与语言学系) Institut für Germanistik, Philologische Fakultät, Ruhr-Universität Bochum(波恩鲁尔大学德语研究所) Department of Linguistics, College of Liberal Arts and Sciences, University of Florida(佛罗里达大学文理学院语言学系)

AI总结 本研究通过Transformer模型在三种频率条件下学习西班牙语L形形态词素,并与人类行为数据对比,发现模型能从分布输入中习得该模式但泛化方式与人类定性不同。

详情
AI中文摘要

不规则形态模式的认知现实性已争论数十年:说话者是否将其扩展到新形式,还是它们只是词汇产物?基于分布输入训练的神经网络提供了可学习性测试:如果它恢复了模式,则该模式仅从输入统计中即可学习。我们将此测试应用于西班牙语L形形态词素,其中第一人称单数直陈式词干出现在每个现在虚拟式单元格中,尽管缺乏明显的音系或语义动机。我们进一步询问输入中不规则动词的频率是否调节泛化,在三种频率条件(10%、50%、90%不规则)下评估Transformer,并将其与人类行为数据进行比较。在伪词输入的全形式产出中,所有模型表现均较差,但所有三种条件产生正确词干的频率均高于人类(43-49% vs. 33%)。响应偏好显示出明显分歧:人类始终偏好规则屈折,而模型随着训练中不规则比例增加更倾向于不规则形式。自然和平衡条件下的模型也对伪词与真实西班牙语不规则动词之间的音系相似性敏感,而这种效应在人类中不存在。因此,L形形态词素仅从分布输入即可学习,但模型在定性上以不同于人类的方式泛化它。

英文摘要

The cognitive reality of irregular morphological patterns has been debated for decades: do speakers extend them to novel forms, or are they lexical artifacts? A neural network trained on distributional input offers a learnability test: if it recovers the pattern, the pattern is learnable from input statistics alone. We apply this test to the Spanish L-shaped morphome, where the first-person singular indicative stem appears in every present subjunctive cell despite lacking apparent phonological or semantic motivation. We further ask whether the frequency of irregular verbs in the input modulates generalization, evaluating transformers under three frequency conditions (10%, 50%, 90% irregular) and comparing them to human behavioral data. On full-form production from pseudoword inputs all models performed poorly, but all three conditions produced the correct stem more often than humans (43--49% vs. 33%). Response preferences revealed a clear divergence: humans consistently favored regular inflections, whereas models preferred irregular forms more as their proportion in training grew. Models in the naturalistic and balanced conditions were also sensitive to phonological similarity between pseudowords and real Spanish irregular verbs, an effect absent in humans. The L-shaped morphome is thus learnable from distributional input alone, but models generalize it qualitatively differently from humans.

2507.19219 2026-05-26 cs.CL cs.CR 版本更新

How Much Do Large Language Model Cheat on Evaluation? Benchmarking Overestimation under the One-Time-Pad-Based Framework

大型语言模型在评估中作弊了多少?基于一次性密码本的框架下的高估基准测试

Zi Liang, Liantong Yu, Shiyu Zhang, Qingqing Ye, Haibo Hu

发表机构 * Tech Startups(科技初创公司)

AI总结 针对大型语言模型在公开基准测试中因数据污染或训练偏差导致评估结果虚高的问题,提出基于一次性密码本加密思想的动态评估框架ArxivRoll,包含自动生成私有测试用例的SCP模块和衡量污染与偏差比例的Rugged Scores指标,实现可重复、透明且高效的评估。

Comments This paper has been accepted by AAAI 2026. We update it for adding new evaluation results for ArxivRollBench-2025a and ArxivRollBench-2026a, with the evaluation of timly models like DeepSeekV4Pro, GPT-5.5, Claude-Opus-4.7, and so on. Source code: https://github.com/liangzid/ArxivRoll/ Online Leaderboard Website: https://arxivroll.moreoverai.com/

详情
AI中文摘要

评估大型语言模型(LLMs)时的高估问题日益引起关注。由于公开基准测试的数据污染或模型训练不平衡,LLMs可能在公开基准测试中无意或有意地获得不真实的评估结果,这导致LLMs之间的不公平比较,并削弱了对其实际能力的评估。现有基准测试试图通过永久保密测试用例、通过人工评估减轻污染或反复收集和构建新样本来解决这些问题。然而,这些方法无法同时确保可重复性、透明性和高效率。此外,当前LLMs的高估程度仍未量化。为解决这些问题,我们提出了ArxivRoll,一个受密码学中一次性密码本加密启发的动态评估框架。ArxivRoll包含两个关键组件:\emph{i) SCP(排序、完形填空和预测)},一个用于私有测试用例的自动生成器;\emph{ii) Rugged Scores(RS)},衡量公开基准测试污染和训练偏差比例的指标。利用SCP,ArxivRoll每六个月使用ArXiv上的最新文章构建一个新的基准测试,并将其用于LLM性能的一次性评估。大量实验证明了我们基准测试的高质量,并且我们提供了对当前LLMs的系统评估。源代码可在https://github.com/liangzid/ArxivRoll/获取。

英文摘要

Overestimation in evaluating large language models (LLMs) has become an increasing concern. Due to the contamination of public benchmarks or imbalanced model training, LLMs may achieve unreal evaluation results on public benchmarks, either intentionally or unintentionally, which leads to unfair comparisons among LLMs and undermines their realistic capability assessments. Existing benchmarks attempt to address these issues by keeping test cases permanently secret, mitigating contamination through human evaluation, or repeatedly collecting and constructing new samples. However, these approaches fail to ensure reproducibility, transparency, and high efficiency simultaneously. Moreover, the extent of overestimation in current LLMs remains unquantified. To address these issues, we propose ArxivRoll, a dynamic evaluation framework inspired by one-time pad encryption in cryptography. ArxivRoll comprises two key components: \emph{i) SCP (Sequencing, Cloze, and Prediction)}, an automated generator for private test cases, and \emph{ii) Rugged Scores (RS)}, metrics that measure the proportion of public benchmark contamination and training bias. Leveraging SCP, ArxivRoll constructs a new benchmark every six months using recent articles from ArXiv and employs them for one-time evaluations of LLM performance. Extensive experiments demonstrate the high quality of our benchmark, and we provide a systematic evaluation of current LLMs. The source code is available at https://github.com/liangzid/ArxivRoll/.

2507.14958 2026-05-26 cs.CL 版本更新

MUR: Momentum Uncertainty guided Reasoning for Large Language Models

MUR: 面向大型语言模型的动量不确定性引导推理

Hang Yan, Fangzhi Xu, Rongman Xu, Yifei Li, Jian Zhang, Haoran Luo, Xiaobao Wu, Luu Anh Tuan, Haiteng Zhao, Qika Lin, Jun Liu

发表机构 * School of Computer Science and Technology, Xi’an Jiaotong University(西安交通大学计算机科学与技术学院) Ministry of Education Key Laboratory of Intelligent Networks and Network Security(教育部智能网络与网络安全重点实验室) Shaanxi Province Key Laboratory of Big Data Knowledge Engineering(陕西省大数据知识工程重点实验室) Nanyang Technological University(南洋理工大学) Shanghai Jiao Tong University(上海交通大学) VinUniversity(Vin大学) Shanghai AI Laboratory(上海人工智能实验室) National University of Singapore(新加坡国立大学)

AI总结 提出动量不确定性引导推理(MUR)方法,通过追踪和聚合逐步不确定性动态分配推理预算,并引入γ控制机制,在不额外训练的情况下减少冗余计算,在多个基准上平均减少45%以上计算量并提升准确率。

详情
AI中文摘要

大型语言模型在推理密集型任务上取得了令人印象深刻的性能,但优化其推理效率仍然是一个开放的挑战。虽然测试时缩放(TTS)提高了推理质量,但它常常导致过度思考,在冗余计算上浪费令牌。本研究探讨如何在不额外训练的情况下,高效且自适应地引导当前模型的测试时缩放。受物理学中动量概念的启发,我们提出了动量不确定性引导推理(MUR),它通过随时间跟踪和聚合逐步不确定性,动态地将思考预算分配给关键的推理步骤。为了支持灵活的推理时控制,我们引入了γ控制,这是一种通过单个超参数调整推理预算的简单机制。我们提供了深入的理论证明,以支持MUR在稳定性和偏差方面的优越性。MUR与各种TTS方法在四个具有挑战性的基准(MATH-500、AIME24、AIME25和GPQA-diamond)上,使用不同大小的最新Qwen3模型(1.7B、4B和8B)进行了全面评估。结果表明,MUR平均减少了超过45%的计算量,同时将准确率提高了0.33%至3.46%。

英文摘要

Large Language Models have achieved impressive performance on reasoning-intensive tasks, yet optimizing their reasoning efficiency remains an open challenge. While Test-Time Scaling (TTS) improves reasoning quality, it often leads to overthinking, wasting tokens on redundant computations. This work investigates how to efficiently and adaptively guide current model' test-time scaling without additional training. Inspired by the concept of momentum in physics, we propose Momentum Uncertainty-guided Reasoning (MUR), which dynamically allocates thinking budgets to critical reasoning steps by tracking and aggregating stepwise uncertainty over time. To support flexible inference-time control, we introduce gamma-control, a simple mechanism that tunes the reasoning budget via a single hyperparameter. We provide in-depth theoretical proof to support the superiority of MUR in terms of stability and biases. MUR is comprehensively evaluated against various TTS methods across four challenging benchmarks (MATH-500, AIME24, AIME25, and GPQA-diamond) using different sizes of recent Qwen3 models (1.7B, 4B, and 8B). Results demonstrate that MUR reduces computation by by over 45% on average while improving accuracy from 0.33 to 3.46%.

2507.10644 2026-05-26 cs.AI cs.CL cs.CR cs.HC cs.MA 版本更新

From Multi-Agent Systems and the Semantic Web to Agentic AI: A Unified Narrative of the Web of Agents

从多智能体系统和语义网到智能体AI:智能体网络的统一叙事

Tatiana Petrova, Boris Bliznioukov, Aleksandr Puzikov, Radu State

发表机构 * SEDAN - SnT University of Luxembourg(卢森堡大学)

AI总结 本文提出智能体网络(WoA)经历了从平台端协调(第一代)、数据端标注(第二代)到模型端解释(第三代)的语义努力迁移,并分析了各代失败模式及当前开放问题。

详情
AI中文摘要

智能体网络(WoA)将文档为中心的Web转变为由自主智能体代表用户行动的环境,这一愿景随着大型语言模型(LLM)的成熟而变得可行。我们认为,在过去的三十年中,WoA按时间顺序经历了语义努力迁移:从平台端协调(多智能体系统,第一代),经过数据端标注(语义网,第二代),到模型端解释(LLM时代,第三代)。这一轨迹中的核心转变——从第二代到第三代,我们称之为从数据中的语义到模型中的语义的转变——具有预测性:每一代的失败模式和当前开放问题都源于该代语义努力的定位。本文做出五项贡献:(i) 一个跨越1990-2026年的统一进化叙事;(ii) 一个四维比较框架(语义基础、通信范式、智能定位、发现机制),统一应用于所有三代;(iii) 对十六个代表性系统在这些维度上的分类,包括混合LLM-知识图谱和计算机使用智能体;(iv) 涵盖2024年11月至2026年8月的制度融合(Linux基金会智能体AI基金会、A2A v1.0、MCP 2024年11月发布和2025年11月规范、Visa/Mastercard/Stripe支付网络协议、欧盟AI法案分阶段执行、NIST AI智能体标准倡议、2026年国际AI安全报告);以及(v) 基于跨代证据的七个命名教训,以及七个与代无关的挑战,无论哪种协议占主导地位,这些挑战都持续存在。进一步的进展更多地取决于标准机构、监管机构和商业支付网络正在组装的社会技术基础设施,而不是协议设计。

英文摘要

The Web of Agents (WoA) transforms the document-centric Web into an environment of autonomous agents acting on users' behalf, a vision newly tractable as large language models (LLMs) mature. We argue that across three decades the WoA has undergone a \emph{semantic-effort migration} in chronological order: from platform-side coordination (Multi-Agent Systems, Generation~I), through data-side annotation (Semantic Web, Generation~II), to model-side interpretation (LLM-era, Generation~III). The central Gen~II~$\rightarrow$~Gen~III transition within this trajectory, which we call the \emph{semantics-in-data $\rightarrow$ semantics-in-models} shift, is predictive: each generation's failure modes and current open problems follow from where that generation located its semantic effort. The survey makes five contributions: (i)~a unified evolutionary narrative spanning 1990--2026; (ii)~a four-dimensional comparative framework (semantic foundation, communication paradigm, locus of intelligence, discovery mechanism) applied uniformly across all three generations; (iii)~classification of sixteen representative systems on these dimensions, including hybrid LLM--knowledge-graph and computer-use agents; (iv)~coverage of the November~2024--August~2026 institutional convergence (Linux Foundation's Agentic AI Foundation, A2A v1.0, MCP November~2024 launch and November~2025 specification, Visa/Mastercard/Stripe payment-network protocols, EU AI Act phased enforcement, the NIST AI Agent Standards Initiative, International AI Safety Report 2026); and (v)~seven named lessons grounded in cross-generational evidence paired with seven generation-invariant challenges that persist regardless of which protocol prevails. Further progress depends less on protocol design than on the socio-technical infrastructure now being assembled by standards bodies, regulators, and commercial payment networks.

2505.24876 2026-05-26 cs.CV cs.CL 版本更新

Agent-X: Evaluating Deep Multimodal Reasoning in Vision-Centric Agentic Tasks

Agent-X:评估视觉中心智能体任务中的深度多模态推理

Tajamul Ashraf, Amal Saqib, Hanan Ghani, Muhra AlMahri, Yuhao Li, Noor Ahsan, Umair Nawaz, Jean Lahoud, Hisham Cholakkal, Mubarak Shah, Philip Torr, Fahad Shahbaz Khan, Rao Muhammad Anwer, Salman Khan

发表机构 * Mohamed bin Zayed University of Artificial Intelligence(莫扎德·本·扎耶德人工智能大学) University of Central Florida(中央佛罗里达大学) University of Oxford(牛津大学)

AI总结 提出Agent-X基准,通过828个真实视觉任务和细粒度步骤评估框架,揭示当前模型在多步视觉推理中全链成功率低于50%的瓶颈。

Comments Accepted in International Conference of Learning Representations (ICLR 2026)

详情
AI中文摘要

深度推理对于解决复杂任务至关重要,尤其是在需要顺序多模态理解的视觉中心场景中。然而,现有基准通常使用完全合成的单轮查询、有限的视觉模态进行评估,并且缺乏在真实世界环境中多步推理质量的评估框架。为了解决这一问题,我们引入了Agent-X,这是一个大规模基准,用于评估视觉中心智能体在真实多模态环境中的多步和深度推理能力。Agent-X包含828个具有真实视觉上下文的智能体任务,包括图像、多图像比较、视频和指令文本。这些任务涵盖六大智能体环境:通用视觉推理、网页浏览、安全与监控、自动驾驶、体育和数学推理。我们的基准要求智能体在这些多样化环境中将工具使用与明确的逐步决策相结合。此外,我们提出了一个细粒度的步骤级评估框架,用于评估每个推理步骤的正确性和逻辑连贯性以及整个任务中工具使用的有效性。我们的结果表明,即使是最佳性能模型,包括GPT、Gemini和Qwen系列,也难以解决多步视觉任务,全链成功率低于50%。这些发现突显了当前LMM推理和工具使用能力的关键瓶颈,并指出了视觉中心智能体推理模型的未来研究方向。我们的数据和代码公开在https://github.com/mbzuai-oryx/Agent-X。

英文摘要

Deep reasoning is fundamental for solving complex tasks, especially in vision-centric scenarios that demand sequential, multimodal understanding. However, existing benchmarks typically evaluate agents with fully synthetic, single-turn queries, limited visual modalities, and lack a framework to assess reasoning quality over multiple steps as required in real-world settings. To address this, we introduce Agent-X, a large-scale benchmark for evaluating vision-centric agents multi-step and deep reasoning capabilities in real-world, multimodal settings. Agent- X features 828 agentic tasks with authentic visual contexts, including images, multi-image comparisons, videos, and instructional text. These tasks span six major agentic environments: general visual reasoning, web browsing, security and surveillance, autonomous driving, sports, and math reasoning. Our benchmark requires agents to integrate tool use with explicit, stepwise decision-making in these diverse settings. In addition, we propose a fine-grained, step-level evaluation framework that assesses the correctness and logical coherence of each reasoning step and the effectiveness of tool usage throughout the task. Our results reveal that even the best-performing models, including GPT, Gemini, and Qwen families, struggle to solve multi-step vision tasks, achieving less than 50% full-chain success. These findings highlight key bottlenecks in current LMM reasoning and tool-use capabilities and identify future research directions in vision-centric agentic reasoning models. Our data and code are publicly available at https://github.com/mbzuai-oryx/Agent-X

2505.14479 2026-05-26 cs.AI cs.CL 版本更新

A Neuro-Symbolic Approach for Reliable Proof Generation with LLMs: A Case Study in Euclidean Geometry

一种用于LLM可靠证明生成的神经符号方法:以欧几里得几何为例

Oren Sultan, Eitan Stern, Dafna Shahaf

发表机构 * The Hebrew University of Jerusalem(特拉维夫大学)

AI总结 提出一种结合LLM生成能力与结构化组件的神经符号方法,通过类比问题检索和形式验证器反馈,显著提升欧几里得几何证明的准确性。

Comments long paper

详情
AI中文摘要

大型语言模型(LLM)在需要严格逻辑推理和符号推理的形式化领域(如数学证明生成)中表现不佳。我们提出一种神经符号方法,结合LLM的生成优势与结构化组件以克服这一挑战。作为概念验证,我们专注于SAT级别的几何问题。我们的方法有两方面:(1)检索类比问题并利用其证明来指导LLM;(2)形式验证器评估生成的证明并提供反馈,帮助模型修正错误证明。我们的方法显著提高了不同模型族的证明准确性,在所有评估模型(OpenAI o1、GPT-5、Gemini-Flash-2.5和Claude Sonnet 4.6)上均取得了显著提升。基础模型的准确率从10%至44%提升至采用我们方法后的68%至96%,其中类比问题指导和验证器反馈均贡献了这些改进。更广泛地说,转向生成可证明正确结论的LLM有望大幅提高其可靠性、准确性和一致性,从而解锁需要可信赖性的复杂任务和关键现实应用。

英文摘要

Large language models (LLMs) struggle with formal domains that require rigorous logical deduction and symbolic reasoning, such as mathematical proof generation. We propose a neuro-symbolic approach that combines LLMs' generative strengths with structured components to overcome this challenge. As a proof of concept, we focus on SAT-level geometry problems. Our approach is two-fold: (1) We retrieve analogous problems and use their proofs to guide the LLM, and (2) a formal verifier evaluates the generated proofs and provides feedback, helping the model fix incorrect proofs. Our method significantly improves proof accuracy across diverse model families, achieving significant gains across all evaluated models: OpenAI o1, GPT-5, Gemini-Flash-2.5, and Claude Sonnet 4.6. Accuracy increases from 10% to 44% for the base models to 68% to 96% with our approach, with both analogous problem guidance and verifier feedback contributing to these improvements. More broadly, shifting to LLMs that generate provably correct conclusions has the potential to dramatically improve their reliability, accuracy and consistency, unlocking complex tasks and critical real-world applications that require trustworthiness.

2504.12474 2026-05-26 cs.CL cs.AI 版本更新

Integrating Structural and Semantic Signals in Text-Attributed Graphs with BiGTex

在文本属性图中整合结构信号与语义信号:BiGTex

Azadeh Beiranvand, Seyed Mehdi Vahidipour

发表机构 * Faculty of Electrical and Computer Engineering, University of Kashan(卡尚大学电气与计算机工程学院)

AI总结 提出BiGTex架构,通过堆叠图-文本融合单元实现GNN与LLM的双向注意力,以参数高效微调(LoRA)在节点分类和链接预测任务上达到最优性能。

Comments 26 pages, 4 figures

详情
Journal ref
Machine Learning with Applications 24 (2026) 100921
AI中文摘要

文本属性图(TAGs)在表示学习中提出了独特挑战,要求模型同时捕捉节点关联文本的语义丰富性和图的结构依赖性。图神经网络(GNNs)擅长建模拓扑信息,但缺乏处理非结构化文本的能力。相反,大型语言模型(LLMs)精通文本理解,但通常不了解图结构。在这项工作中,我们提出了BiGTex(双向图文本),一种通过堆叠图-文本融合单元紧密集成GNN和LLM的新型架构。每个单元允许文本和结构表示之间的相互注意力,使信息能够双向流动:文本影响结构,结构指导文本解释。所提出的架构使用参数高效微调(LoRA)进行训练,保持LLM冻结同时适应任务特定信号。在五个基准数据集上的大量实验表明,BiGTex在节点分类中实现了最先进的性能,并有效泛化到链接预测。消融研究进一步强调了软提示和双向注意力在模型成功中的重要性。

英文摘要

Text-attributed graphs (TAGs) present unique challenges in representation learning by requiring models to capture both the semantic richness of node-associated texts and the structural dependencies of the graph. While graph neural networks (GNNs) excel at modeling topological information, they lack the capacity to process unstructured text. Conversely, large language models (LLMs) are proficient in text understanding but are typically unaware of graph structure. In this work, we propose BiGTex (Bidirectional Graph Text), a novel architecture that tightly integrates GNNs and LLMs through stacked Graph-Text Fusion Units. Each unit allows for mutual attention between textual and structural representations, enabling information to flow in both directions, text influencing structure and structure guiding textual interpretation. The proposed architecture is trained using parameter-efficient fine-tuning (LoRA), keeping the LLM frozen while adapting to task-specific signals. Extensive experiments on five benchmark datasets demonstrate that BiGTex achieves state-of-the-art performance in node classification and generalizes effectively to link prediction. An ablation study further highlights the importance of soft prompting and bi-directional attention in the model's success.

2503.19605 2026-05-26 cs.LG cs.CL math.ST stat.TH 版本更新

Lean Formalization of Generalization Error Bound by Rademacher Complexity and Dudley's Entropy Integral

Rademacher复杂度和Dudley熵积分的泛化误差界的Lean形式化

Sho Sonoda, Kazumi Kasaura, Yuma Mizuno, Kei Tsukamoto, Naoto Onda

发表机构 * RIKEN AIP(日本理化学研究所AIP) CyberAgent Inc.(CyberAgent公司) OMRON SINIC X Corporation(OMRON SINIC X株式会社) University College Cork(科克大学) The University of Tokyo(东京大学)

AI总结 本文在Lean 4中形式化了基于Rademacher复杂度的泛化误差界,通过形式化对称化论证、有界差异分析和McDiarmid不等式,并扩展到可数假设类及可分离拓扑索引集,最后应用得到线性预测器的经验Rademacher界和Dudley熵积分界。

Comments accepted at ITP2026

详情
AI中文摘要

理解和证明机器学习算法的泛化性能——即从训练误差获得测试误差的理论估计——是统计学习理论的核心主题。在用于推导此类保证的众多复杂度度量中,Rademacher复杂度提供了尖锐的、数据相关的界,其适用范围远超经典的VC维理论。在本研究中,我们基于Mathlib库中可用的测度论概率论,在Lean 4中形式化了Rademacher复杂度的泛化误差界。我们的开发提供了一个经过机械检查的流水线,从经验和期望Rademacher复杂度的定义开始,经过形式化的对称化论证和有界差异分析,通过形式化证明的McDiarmid不等式得到高概率一致偏差界。一个关键的技术贡献是可重用机制,通过归约到可数稠密子集,将结果从可数假设类(其中上确界的可测性在Mathlib中直接成立)提升到可分离拓扑索引集。作为抽象定理的工作应用,我们机械化了$\ell_2$和$\ell_1$正则化下线性预测器的标准经验Rademacher界,并且我们还形式化了基于覆盖数和链式构造的Dudley型熵积分界。

英文摘要

Understanding and certifying the generalization performance of machine learning algorithms -- i.e. obtaining theoretical estimates of the test error from the training error -- is a central theme of statistical learning theory. Among the many complexity measures used to derive such guarantees, Rademacher complexity yields sharp, data-dependent bounds that apply well beyond classical VC-dimension theory. In this study, we formalize the generalization error bound by Rademacher complexity in Lean 4, building on measure-theoretic probability theory available in the Mathlib library. Our development provides a mechanically-checked pipeline from the definitions of empirical and expected Rademacher complexity, through a formal symmetrization argument and a bounded-differences analysis, to high-probability uniform deviation bounds via a formally proved McDiarmid inequality. A key technical contribution is a reusable mechanism for lifting results from countable hypothesis classes (where measurability of suprema is straightforward in Mathlib) to separable topological index sets via a reduction to a countable dense subset. As worked applications of the abstract theorem, we mechanize standard empirical Rademacher bounds for linear predictors under $\ell_2$ and $\ell_1$ regularizations, and we also formalize a Dudley-type entropy integral bound based on covering numbers and a chaining construction.

2502.21297 2026-05-26 cs.CL 版本更新

Persuasion Should be Double-Blind: A Multi-Domain Dialogue Dataset With Faithfulness Based on Causal Theory of Mind

说服应该是双盲的:基于因果心智理论的多领域对话数据集

Dingyi Zhang, Linhai Zhang, Fanglei Qu, Ziqing Zhuang, Deyu Zhou

发表机构 * Department of Informatics, King’s College London(信息学院,伦敦国王学院)

AI总结 提出基于因果心智理论的多智能体框架ToMMA构建双盲说服对话数据集CToMPersu,以解决现有数据集信息泄露问题,提升对话真实性和说服力。

Comments 6 pages

详情
AI中文摘要

说服性对话是人类交流的核心,但现有数据集通常依赖单一语言模型生成两个角色,产生违反说服双盲性质的不真实交互。为克服这一问题,我们提出ToMMA,一个由因果心智理论引导的多智能体框架,强制角色分离并防止信息泄露。利用ToMMA,我们构建了CToMPersu,一个大规模多轮、多领域数据集,捕捉真实的说服动态。自动评估显示,CToMPersu比先前数据集产生更连贯和更有说服力的对话。此外,当作为知识库使用时,CToMPersu显著增强了大型语言模型的说服性能,自动评估和人工评估均证实了这一点。

英文摘要

Persuasive dialogue is central to human communication, yet existing datasets often rely on a single language model generating both roles, producing unrealistic interactions that violate the double-blind nature of persuasion. To overcome this, we propose ToMMA, a multi-agent framework guided by causal Theory of Mind that enforces role separation and prevents information leakage. Using ToMMA, we build CToMPersu, a large-scale multi-turn, multi-domain dataset capturing realistic persuasion dynamics. Automatic evaluations show that CToMPersu produces more coherent and persuasive dialogues than prior datasets. Furthermore, when used as a knowledge base, CToMPersu significantly enhances the persuasive performance of large language models, as confirmed by both automatic and human evaluations.

2502.15835 2026-05-26 cs.CL cs.AI cs.SE 版本更新

Pragmatic Reasoning improves LLM Code Generation

语用推理提升LLM代码生成

Zhuchen Cao, Sven Apel, Adish Singla, Vera Demberg

发表机构 * Max Planck Institute for Informatics Saarland Campus(马克斯·普朗克信息研究所萨尔兰州分校) Computer Science Saarland University(萨尔兰州大学计算机科学系) Max Planck Institute for Software Systems Saarland Campus(马克斯·普朗克软件系统研究所萨尔兰州分校)

AI总结 提出CodeRSA方法,通过局部语用竞赛对候选代码进行重排序,以解决自然语言到代码生成中的歧义问题,在多个基准测试中取得最佳平均准确率。

详情
AI中文摘要

语用推理帮助对话者通过考虑共享上下文和反事实替代方案,从模糊或未充分指定的信息中推断出预期含义。自然语言到代码生成中也会出现类似的挑战,因为用户指令通常允许多个合理的候选程序。然而,直接的RSA风格推理是困难的,因为它需要对程序空间和替代指令的大空间进行概率估计。我们提出了CodeRSA,一种受RSA启发的重排序方法,通过对采样代码候选进行局部语用竞赛,使语用推理变得可行。CodeRSA构建候选诱导的替代指令,并估计哪些候选最独特地受到原始指令的支持,从而避免了对整个程序-指令空间的全局归一化。我们在HumanEval+、MBPP+和BigCodeBench上使用四个开放权重的指令跟随模型评估了CodeRSA。在12个模型-基准设置中,CodeRSA在10个设置中取得了最强的平均准确率,并在其余情况下保持竞争力。进一步分析表明,其收益来自于将局部成对语用比较与更广泛的全局支持相结合,这为自然语言不确定性下的语言到代码重排序提供了一个可扩展的方向。

英文摘要

Pragmatic reasoning helps interlocutors infer intended meaning from ambiguous or underspecified messages by considering shared context and counterfactual alternatives. Similar challenges arise in natural language-to-code generation, where user instructions often admit multiple plausible candidate programs. However, direct RSA-style inference is difficult because it requires probability estimation over large spaces of programs and alternative instructions. We propose CodeRSA, an RSA-motivated reranking method that makes pragmatic reasoning tractable through local pragmatic contests among sampled code candidates. CodeRSA constructs candidate-induced alternative instructions and estimates which candidates are most distinctively supported by the original instruction, avoiding global normalization over the full program-instruction space. We evaluate CodeRSA on HumanEval+, MBPP+, and BigCodeBench using four open-weight instruction-following models. CodeRSA achieves the strongest average accuracy in 10 of 12 model-benchmark settings and remains competitive in the remaining cases. Further analyses show that its gains come from combining local pairwise pragmatic comparison with broader global support, suggesting a scalable direction for language-to-code reranking under natural-language uncertainty.

2410.01648 2026-05-26 cs.CL 版本更新

DeIDClinic: A Risk-Aware Pseudonymization Framework for Clinical Text De-identification and Re-identification Risk Assessment

DeIDClinic:面向临床文本去标识化和重识别风险评估的风险感知假名化框架

Angel Paul, Dhivin Shaji, Lifeng Han, Warren Del-Pinto, Goran Nenadic, Suzan Verberne

发表机构 * University of Manchester(曼彻斯特大学) Leiden Institute of Advanced Computer Science (LIACS)(莱顿先进计算机科学研究所) Leiden University(莱顿大学) Biomedical Data Sciences, Leiden University Medical Center(莱顿大学医学中心生物医学数据科学)

AI总结 提出DeIDClinic多层框架,集成领域自适应变换器模型(BioBERT、ClinicalBERT)和文档级风险评估模块(k-匿名、l-多样性、t-接近度等),在i2b2 2014数据集上实现高F1分数,支持隐私保护数据共享。

Comments Accepted by and Presented at: LEGAL-CALD-Pseudo2026 @LREC2026

详情
AI中文摘要

敏感文本数据的日益增多产生了对鲁棒去标识化方法的迫切需求,这些方法需在保持下游实用性的同时实现合规数据共享。本文提出DeID-Clinic,一个用于临床自由文本数据自动假名化和重识别风险评估的多层框架。我们的方法将领域自适应变换器模型(包括BioBERT和ClinicalBERT)集成到MASK去标识化框架中,以改进受保护健康信息(PHI)的检测和掩码。除了实体识别,我们引入了一个新颖的文档级风险评估模块,该模块结合k-匿名、l-多样性、t-接近度、上下文相似性和实体共现分析来量化残余重识别风险。在i2b2 2014去标识化数据集上进行的实验展示了强劲性能,多个实体类别的宏观F1分数超过0.96,同时能够对高风险文档进行定量优先级排序以便进一步审查。我们的结果突显了将神经去标识化与显式风险建模相结合的有效性,支持敏感领域的隐私保护数据共享。尽管在临床文本上评估,所提出的框架可推广到其他隐私关键领域,如法律和行政文档,其中可靠的假名化和风险感知匿名化至关重要。关键词:自动去标识化、风险评估、患者隐私、假名化、个人健康信息。

英文摘要

The increasing availability of sensitive textual data has created an urgent need for robust de-identification methods that enable compliant data sharing while preserving downstream utility. This paper presents DeID-Clinic, a multi-layered framework for automated pseudonymization and re-identification risk assessment of clinical free-text data. Our approach integrates domain-adapted transformer models, including BioBERT and ClinicalBERT, into the MASK de-identification framework to improve the detection and masking of protected health information (PHI). Beyond entity recognition, we introduce a novel document-level risk assessment module that quantifies residual re-identification risk using a combination of k-anonymity, l-diversity, t-closeness, contextual similarity, and entity co-occurrence analysis. Experiments conducted on the i2b2 2014 de-identification dataset demonstrate strong performance, achieving macro-level F1 scores above 0.96 for several entity categories, while enabling quantitative prioritization of high-risk documents for further review. Our results highlight the effectiveness of combining neural de-identification with explicit risk modeling, supporting privacy-preserving data sharing in sensitive domains. Although evaluated on clinical text, the proposed framework is generalizable to other privacy-critical domains such as legal and administrative documents, where reliable pseudonymization and risk-aware anonymization are essential. Keywords{Automated De-Identification, Risk Assessment, Patient Privacy, Pseudonymization, Personal Health Information}

2605.24524 2026-05-26 cs.LG cs.CL q-bio.NC 版本更新

What Are We Actually Decoding? Source Attribution for Non-Invasive Brain-to-Language Retrieval

我们究竟在解码什么?非侵入式脑到语言检索的源归因

Xinyu Zhang, Sichao Liu, Runhao Lu, Alexandra Woolgar, Lihui Wang

发表机构 * KTH(瑞典皇家理工学院) University of Cambridge(剑桥大学) EPFL(苏黎世联邦理工学院) Karolinska Institutet(Karolinska研究所) McGill University(麦吉尔大学)

AI总结 针对非侵入式神经语言解码中结果被非刺激诱发源(如解码器先验、嵌入度量、信号时长等)膨胀的问题,提出一个审计框架,通过结构捷径、窗口级刺激锁定证据和跨窗口上下文聚合三种源分离,并引入组上下文偏差(GCB)作为可控的源归因干预,实现性能的源归因而非仅报告。

Comments 35 pages, 7 figures, 25 tables

详情
AI中文摘要

在非侵入式神经语言解码中,结果可能被非刺激诱发的神经证据源膨胀:解码器先验、基于嵌入的度量以及非神经结构干扰(如信号时长)。因此,方法学挑战在于归因:当报告的性能提升可以追溯到特定源时,它才更具信息性。我们将刺激锁定的MEG到音频检索重新构建为一个审计框架,将表观性能分离为三个源——结构捷径、窗口级刺激锁定证据和跨窗口上下文聚合——并为每个源提供诊断。在变长解码下,信号盲的高斯噪声达到66.3%的Rank@1(R@1),但一旦强制执行固定时长窗口和刺激身份分割,其性能骤降至接近随机,从而隔离了结构泄漏。在这些控制下,固定窗口检索恢复了可测量的MEG-音频可区分性,而一个神谕句子桶诊断显示,95.7%的Top-1错误选择了错误的句子,将剩余瓶颈定位到句子级竞争。我们使用组上下文偏差(GCB)审计这一上下文源,这是一种推理时的加性logit偏差,它跨窗口汇集句子一致的证据,同时保持基础检索分数和候选池固定。作为分数空间干预,GCB使上下文源变得可测量:在相同固定设置下,Gwilliams上的R@1从44%变为52%,MOUS上从22%变为29%。在此设计下,GCB是可审计的:其效应在随机分组扰动下崩溃,并在局部证据在MEG中衰减或在EEG中接近随机时消失,支持其作为受控源归因干预的使用。这些结果表明,脑到语言性能应进行源归因,而不仅仅是报告。

英文摘要

In non-invasive neural language decoding, results can be inflated by sources that are not stimulus-evoked neural evidence: decoder priors, embedding-based metrics, and non-neural structural nuisances such as signal duration. The methodological challenge is therefore attribution: a reported gain is more informative when it can be traced to a specific source. We recast stimulus-locked MEG-to-audio retrieval as an auditing framework that separates apparent performance into three sources - structural shortcuts, window-level stimulus-locked evidence, and cross-window contextual aggregation - and provides a diagnostic for each. Signal-blind Gaussian noise reaches 66.3% Rank@1 (R@1) under variable-length decoding but collapses to near chance once fixed-duration windows and stimulus-identity splits are enforced, isolating structural leakage. Under these controls, fixed-window retrieval recovers measurable MEG-audio discriminability, while an oracle sentence-bucket diagnostic shows that 95.7% of Top-1 errors select the wrong sentence, localising the residual bottleneck to sentence-level competition. We audit this contextual source with Group Context Bias (GCB), an inference-time additive logit bias that pools sentence-consistent evidence across windows while leaving the base retrieval scores and candidate pool fixed. Used as a score-space intervention, GCB makes the contextual source measurable: R@1 shifts from 44% to 52% on Gwilliams and from 22% to 29% on MOUS under the same fixed setting. GCB is auditable under this design: its effect collapses under random-grouping perturbations and vanishes when local evidence is attenuated in MEG or is near chance in EEG, supporting its use as a controlled source-attribution intervention. These results suggest that brain-to-language performance should be source-attributed, not merely reported.

2605.24523 2026-05-26 cs.LG cs.CL q-bio.NC 版本更新

MindAlign: Bridging EEG, Vision, and Language for Zero-Shot Visual Decoding

MindAlign: 弥合脑电图、视觉和语言实现零样本视觉解码

Zexuan Chen, Sichao Liu, Runhao Lu, Huichao Qi, Alexandra Woolgar, Xi Vincent Wang, Lihui Wang

发表机构 * KTH, SWeden(瑞典皇家理工学院) University of Cambridge, UK(剑桥大学) EPFL, Switzerland(瑞士联邦理工学院) McGill University, Canada(麦吉尔大学) Karolinska Institutet, Sweden(卡罗林斯卡研究所)

AI总结 提出一种三模态对比学习框架MindAlign,通过对齐脑电图、图像和文本表示,在Things-EEG2零样本基准上实现54.1% Top-1和83.4% Top-5准确率,显著超越先前方法。

Comments 20 pages, 10 figures, 15 tables

详情
AI中文摘要

从大脑信号进行视觉解码是计算机视觉和神经科学交叉领域的关键挑战,需要连接神经表征和视觉计算模型的方法。我们提出了一种基于脑电图的视觉解码三模态对比框架,在统一潜在空间中对齐脑电图、视觉和文本表示。我们的方法采用两阶段设计。首先,我们通过无标签试次上的掩码重建预训练脑电图编码器,学习可稳健迁移到下游任务的时空规律。其次,我们通过对比学习联合对齐脑电图、图像和大语言模型生成的文本描述,其中文本监督作为语义正则化器,向共享空间注入语言结构,而不压倒主要的脑电图-图像信号。编码器集成了被试自适应、通道上的图注意力和时空卷积嵌入。在Things-EEG2 200路零样本基准上,我们的框架实现了54.1%的Top-1和83.4%的Top-5准确率,大幅超过最强先前基线(32.4%/64.0%),配对Wilcoxon检验证实所有被试内基线的显著性(p<0.01)。我们在Things-MEG上验证了泛化性。分析表明,紧凑的嵌入几何(CN-CLIP)优于更大的骨干网络,且解码与视觉处理的既定神经生理学一致。这项工作是从非侵入性时间神经信号进行稳健、语义基础视觉解码的关键一步。源代码公开于https://github.com/anon-eeg/eeg_image_decoding。

英文摘要

Visual decoding from brain signals is a key challenge at the intersection of computer vision and neuroscience, requiring methods that bridge neural representations and computational models of vision. We introduce a tri-modal contrastive framework for EEG-based visual decoding that aligns EEG, visual, and textual representations within a unified latent space. Our approach follows a two-stage design. First, we pre-train an EEG encoder via masked reconstruction on unlabeled trials, learning spatio-temporal regularities that transfer robustly to downstream tasks. Second, we jointly align EEG, image, and LLM-generated textual descriptions through contrastive learning, where text supervision acts as a semantic regularizer that injects linguistic structure into the shared space without overwhelming the primary EEG-image signal. The encoder integrates subject-specific adaptation, graph-attention over channels, and temporal-spatial convolutional embeddings. On the Things-EEG2 200-way zero-shot benchmark, our framework achieves 54.1% Top-1 and 83.4% Top-5 accuracy, substantially exceeding the strongest prior baseline (32.4% / 64.0%), with paired Wilcoxon tests confirming significance (p < 0.01) over all in-subject baselines. We validate generalization on Things-MEG. Analysis reveals that compact embedding geometries (CN-CLIP) outperform much larger backbones, and that decoding aligns with established neurophysiology of visual processing. This work is a critical step towards robust, semantically-grounded visual decoding from non-invasive temporal neural signals. The source code is publicly available in https://github.com/anon-eeg/eeg_image_decoding.

2605.24518 2026-05-26 cs.CL cs.AI 版本更新

Grammatically-Guided Sparse Attention for Efficient and Interpretable Transformers

语法引导的稀疏注意力:高效且可解释的Transformer

Spandan Pratyush

发表机构 * Independent Researcher(独立研究者)

AI总结 提出语法引导的稀疏注意力方法,通过词性标签动态生成注意力掩码,在保持精度的同时降低计算复杂度。

Comments 9 pages, 2 tables Code available at https://github.com/toughthinktank/grammatically_guided_attention#

详情
AI中文摘要

Transformer模型中自注意力的二次复杂度仍然是处理长序列和高效部署大型语言模型的主要瓶颈。为此,已有大量关于稀疏注意力的研究,Deepseek稀疏注意力结合了多种创建令牌片段的方法以降低时间复杂度。本文提出了一种新颖的方法——语法引导的稀疏注意力,它基于令牌的语法角色约束注意力计算。通过利用词性(POS)标签,动态生成注意力掩码,强制令牌之间建立语言上连贯的连接,从而在不牺牲必要语言依赖性的情况下减少计算图。提出并评估了两种掩码策略:硬掩码严格只允许预定义的语法交互,软掩码则将注意力偏向这些交互。使用类似DistilBERT的架构在SST-2情感分类任务上进行的实验表明,语法引导的稀疏注意力在保持与全注意力相当的精度的同时,显著降低了理论计算开销。初步结果显示,硬掩码的准确率为0.8200,软掩码为0.8165,与全注意力的0.8200非常接近,为构建更高效、可解释且具有语言知识的Transformer架构提供了途径。

英文摘要

The quadratic complexity of self-attention in Transformer models remains a significant bottleneck for processing long sequences and deploying large language models efficiently. For this approach, there has been significant research into Sparse Attention, and Deepseek Sparse Attention has combined various methods of creating segments of tokens to reduce the time complexity. This paper introduces a novel approach, Grammatically-Guided Sparse Attention, which constrains attention computations based on the grammatical roles of tokens. By leveraging Parts-of-Speech (POS) tags, attention masks are dynamically generated that enforce linguistically coherent connections between tokens, reducing the computational graph without sacrificing essential linguistic dependencies. Two masking strategies are proposed and evaluated: a hard mask that strictly allows only predefined grammatical interactions, and a soft mask that biases attention towards these interactions. The experiments, conducted on the SST-2 sentiment classification task using a DistilBERT-like architecture, demonstrate that Grammatically-Guided Sparse Attention maintains comparable accuracy to full attention while significantly reducing the theoretical computational overhead. Preliminary results show accuracy values of 0.8200 for hard masking and 0.8165 for soft masking, closely matching the 0.8200 of full attention, providing a path towards more efficient, interpretable, and linguistically-informed Transformer architectures.

2605.24517 2026-05-26 cs.LG cs.CL 版本更新

ECHO: Terminal Agents Learn World Models for Free

ECHO: 终端代理免费学习世界模型

Vaishnavi Shrivastava, Piero Kauffmann, Ahmed Awadallah, Dimitris Papailiopoulos

发表机构 * Microsoft Research(微软研究院)

AI总结 提出ECHO混合目标,通过预测环境观测令牌将终端反馈转化为密集监督信号,显著提升CLI代理在TerminalBench-2.0上的性能。

详情
AI中文摘要

CLI代理是语言模型最接近具身环境的设置:模型发出命令,终端执行它们,返回的流——stdout、错误、文件、日志和跟踪——记录了后果。我们认为这个流是一个监督信号,但标准的代理强化学习丢弃了它:GRPO风格的训练使用稀疏的结果级奖励更新动作令牌,而忽略了rollout中已有的环境响应。失败的rollout尽管包含关于环境如何响应的丰富证据,但提供的策略梯度信号很少。我们引入了ECHO(环境交叉熵混合目标),这是一种混合目标,它将动作令牌上的标准策略梯度损失与辅助损失相结合,该辅助损失训练策略预测其自身动作产生的环境观测令牌。ECHO重用与GRPO相同的前向传播,不需要额外的rollout,并将终端反馈转化为所有rollout的密集监督。ECHO在TerminalBench-2.0上将GRPO的pass@1翻倍:Qwen3-8B从2.70%提升到5.17%,Qwen3-14B从5.17%提升到10.79%。ECHO还产生了更好地预测终端动态的策略,即使是在它们未生成的轨迹上:在保留的rollout中,它显著降低了环境令牌的交叉熵,而单独的GRPO几乎没有改变。从基础Qwen3-8B开始,ECHO在没有专家演示的情况下,在保留的终端任务上匹配了专家SFT然后GRPO的性能,并在TerminalBench-2.0上恢复了大专家SFT初始化收益的一半。在某些设置中,仅环境预测损失就能实现无验证器的自我改进,使策略仅通过与环境交互就能在未见过的OOD任务上改进。这些结果表明,环境观测不仅是未来动作的上下文,而且是每个rollout中已经存在的密集、在策略的监督信号。

英文摘要

CLI agents are the closest thing language models have to an embodied setting: the model emits commands, the terminal executes them, and the returned stream -- stdout, errors, files, logs, and traces -- records the consequences. We argue that this stream is a supervision signal, but standard agent RL discards it: GRPO-style training updates action tokens with sparse outcome-level rewards while ignoring environment responses already in the rollout. Failed rollouts provide little policy-gradient signal despite containing rich evidence about how the environment responds. We introduce ECHO (Environment Cross-entropy Hybrid Objective), a hybrid objective that combines the standard policy-gradient loss on action tokens with an auxiliary loss that trains the policy to predict environment observation tokens resulting from its own actions. ECHO reuses the same forward pass as GRPO, requires no additional rollouts, and turns terminal feedback into dense supervision for all rollouts. ECHO doubles GRPO pass@1 on TerminalBench-2.0: Qwen3-8B improves from 2.70% to 5.17%, and Qwen3-14B from 5.17% to 10.79%. ECHO also produces policies that better predict terminal dynamics, even on trajectories they did not generate: across held-out rollouts, it sharply reduces environment-token cross-entropy while GRPO alone barely changes it. From base Qwen3-8B, ECHO matches expert-SFT-then-GRPO performance on held-out terminal tasks without expert demonstrations, and recovers roughly half of the expert-SFT initialization benefit on TerminalBench-2.0. In some settings, the environment prediction loss alone enables verifier-free self-improvement, allowing policies to improve on unseen OOD tasks by learning only from environment interactions. Together, these results suggest that environment observations are not merely context for future actions, but a dense, on-policy supervision signal already present in every rollout.

2605.24486 2026-05-26 cs.AI cs.CL 版本更新

AgentFugue: Agent Scaling for Long-Horizon Tasks through Collective Reasoning

AgentFugue:通过集体推理实现长时域任务的智能体扩展

Yuyang Hu, Hongjin Qian, Shuting Wang, Jiongnan Liu, Tong Zhao, Xiaoxi Li, Zheng Liu, Zhicheng Dou

发表机构 * GSAI, Renmin University of China(GSAI,中国人民大学) Beijing Academy of Artificial Intelligence(北京人工智能研究院)

AI总结 提出AgentFugue框架,通过共享推理中心实现多个对等智能体并行探索和选择性信息共享,无需显式角色分工或工作流编排,从而提升长时域任务性能。

详情
AI中文摘要

近期长时域智能体任务的进展主要通过更强模型、更好工具和更有效脚手架来扩展单个智能体。相比之下,对于扩展(scaling out)的理解要少得多:多个对等智能体,都针对同一任务,能否在不依赖显式角色分工或工作流编排的情况下成为额外能力来源?我们研究这个问题并提出AgentFugue,一个围绕共享推理中心构建的集体推理框架。当对等智能体并行探索同一任务时,中心记录每个智能体已建立、尝试或排除的简明笔记,并使每个智能体能够以对其当前搜索有用的形式选择性访问其他智能体的发现。这种设计将原本孤立的轨迹转变为可重用中间推理的互联生态,无需集中规划。我们将中心实例化为一个即插即用的通信层,使用监督微调和端到端强化学习进行训练。在我们研究的具有挑战性的长时域设置中,AgentFugue优于强基线。我们的结果表明,集体推理可以将对等智能体系统的扩展转变为能力增益的独特来源,而不仅仅是消耗更多计算的方式。

英文摘要

Recent progress on long-horizon agentic tasks has been driven largely by scaling up individual agents through stronger models, better tools, and more effective scaffolding. In contrast, much less is understood about scaling out: whether multiple peer agents, all targeting the same task, can become an additional source of capability without relying on explicit role specialization or workflow orchestration. We study this question and propose AgentFugue, a collective reasoning framework built around a shared reasoning hub. As peer agents explore the same task in parallel, the hub records concise notes on what each agent has established, attempted, or ruled out, and enables each agent to selectively access what other agents have discovered in a form useful for its current search. This design turns otherwise isolated trajectories into a connected ecology of reusable intermediate reasoning without requiring centralized planning. We instantiate the hub as a plug-in communication layer, trained with supervised fine-tuning and end-to-end reinforcement learning. Across the challenging long-horizon settings we study, AgentFugue improves over strong baselines. Our results suggest that collective reasoning can turn scaling out peer agent systems into a distinct source of capability gains, rather than merely a way of spending more compute.

2605.24454 2026-05-26 cs.CL 版本更新

Decompose-and-Refine: Structured Legal Question Answering with Parametric Retrieval

分解与精炼:基于参数化检索的结构化法律问答

Jihyung lee, Hyounghun Kim, Gary Lee

发表机构 * Graduate School of Artificial Intelligence, POSTECH, Republic of Korea(延世大学人工智能研究生院,韩国POSTECH) Department of Computer Science and Engineering, POSTECH, Republic of Korea(POSTECH计算机科学与工程系,韩国POSTECH)

AI总结 提出Decompose-and-Refine (DaR)框架,通过逐步分解复杂法律问题为原子子问题并生成与法规对齐的参数化查询,以解决多跳法律问答中的检索准确性和幻觉问题。

详情
AI中文摘要

大型语言模型(LLMs)在法律领域表现出强大性能,在法律问答(LQA)中展现出显著潜力。然而,与通用问答不同,LQA要求答案不仅准确,而且严格基于明确的法律权威。在成文法LQA中,许多问题需要跨多个法律问题的多跳推理,大大增加了幻觉风险,因此准确检索支持性法规条款成为关键前提。尽管多跳问答近期取得进展,现有方法通常依赖自然语言推理或无需显式查询重构的检索,导致用户问题与法规文本之间的词汇差距未得到充分解决。为应对这一挑战,我们提出Decompose-and-Refine(DaR),一种基于法规的LQA框架,它将逐步的问题分解与基于参数化知识的查询精炼紧密结合。DaR逐步将复杂法律问题分解为原子子问题,并为每个子问题生成与法规对齐的参数化查询,从而能够为每个法律问题选择最核心的单一法规条款。我们在基于成文法的韩语多跳LQA基准KoBLEX上,使用Qwen3-32B和Gemma3-27B评估DaR。实验结果表明,DaR在检索准确性和最终答案质量上均持续优于现有方法。此外,通过显式分离子问题及其对应法规条款,DaR促进了复杂法律推理过程的透明、逐问题验证。

英文摘要

Large language models (LLMs) have shown strong performance in the legal domain, demonstrating notable potential in Legal Question Answering (LQA). However, unlike general QA, LQA requires answers that are not only accurate but also rigorously grounded in explicit legal authority. In statutory LQA, many questions require multi-hop reasoning across multiple legal issues, substantially increasing the risk of hallucination, thereby making accurate retrieval of supporting statutory provisions a critical prerequisite. Despite recent progress in multi-hop QA, existing approaches often rely on reasoning in natural language or retrieval without explicit query reformulation, leaving the vocabulary gap between user questions and statutory text largely unaddressed. To address this challenge, we propose Decompose-and-Refine (DaR), a statute-grounded LQA framework that tightly integrates step-wise question decomposition with parametric knowledge-based query refinement. DaR progressively decomposes a complex legal question into atomic sub-questions and generates statute-aligned parametric queries for each sub-question, enabling the selection of a single most central statutory provision corresponding to each legal issue. We evaluate DaR on KoBLEX, a Korean multi-hop LQA benchmark grounded in statutory law, using Qwen3-32B and Gemma3-27B. Experimental results demonstrate that DaR consistently improves both retrieval accuracy and final answer quality over existing approaches. Moreover, by explicitly separating sub-questions and their corresponding statutory provisions, DaR facilitates transparent, issue-level verification of complex legal reasoning processes.

2605.24452 2026-05-26 cs.CL cs.AI cs.LG 版本更新

Temporal Concept Drift in Legal Judgment Prediction: Neural Baselines Across Three Epochs of Ukrainian Court Decisions

法律判决预测中的时间概念漂移:跨越乌克兰法院判决三个时期的神经基线

Volodymyr Ovcharov

AI总结 通过微调四种Transformer编码器在乌克兰法院三个时期(战前、混合战争、全面入侵)的判决上,研究法律语言的时间漂移,发现前向性能严重下降(最多27.2个百分点),法律领域预训练不能提升绝对性能但能减轻漂移,时序持续学习可消除灾难性遗忘。

Comments 17 pages, 6 tables, 5 figures. Dataset: https://huggingface.co/datasets/overthelex/ukrainian-court-decisions

详情
AI中文摘要

法律NLP基准测试在随机分割的数据上评估模型,隐含假设法律语言是平稳的。我们通过微调四种Transformer编码器——XLM-RoBERTa(base和large)及其法律领域变体——在地缘政治事件定义的三个时间时期的乌克兰法院判决上测试这一假设:战前(2008-2013)、混合战争(2014-2021)和全面入侵(2022-2026)。每个模型在一个时期上训练,并在所有三个时期上评估,产生一个3x3的跨时间泛化矩阵。四个发现出现。(1)前向退化严重:在战前数据上训练的模型应用于全面入侵时期判决时,宏F1最多下降27.2个百分点。(2)退化不对称:后向迁移(全面入侵到战前)比前向迁移稳健得多,与法律语言是加性的假设一致。(3)法律领域预训练(Legal-XLM-R)不提升绝对性能,但减少前向退化的幅度和不对称性。(4)时序持续学习消除了通用XLM-R的灾难性遗忘:战前知识完全保留(+1.8至+6.2个百分点),而全面入侵性能提升+16.5至+19.0个百分点;逆时序训练导致严重遗忘。跨司法管辖区在瑞士判决预测数据上的预训练提升绝对性能,但不减少时间退化幅度,确认时间漂移是法律语言演化的内在属性。数据集(三个时期共428K判决)作为LEXTREME贡献公开可用。

英文摘要

Legal NLP benchmarks evaluate models on randomly split data, implicitly assuming that legal language is stationary. We test this assumption by fine-tuning four transformer encoders -- XLM-RoBERTa (base and large) and their legal-domain variants -- on Ukrainian court decisions from three temporal epochs defined by geopolitical disruptions: pre-war (2008-2013), hybrid war (2014-2021), and full-scale invasion (2022-2026). Each model is trained on one epoch and evaluated on all three, producing a 3x3 cross-temporal generalization matrix. Four findings emerge. (1) Forward degradation is severe: models trained on pre-war data lose up to 27.2 percentage points of macro-F1 when applied to full-scale invasion era decisions. (2) The degradation is asymmetric: backward transfer (full-scale to pre-war) is substantially more robust than forward transfer, consistent with the hypothesis that legal language is additive. (3) Legal-domain pretraining (Legal-XLM-R) does not improve absolute performance but reduces forward degradation magnitude and asymmetry. (4) Chronological continual learning eliminates catastrophic forgetting for general XLM-R: pre-war knowledge is fully retained (+1.8 to +6.2 pp) while full-scale performance gains +16.5 to +19.0 pp; reverse-chronological training causes severe forgetting. Cross-jurisdictional pretraining on Swiss Judgment Prediction data improves absolute performance but does not reduce temporal degradation magnitude, confirming that temporal drift is an intrinsic property of legal language evolution. The dataset (428K decisions across three epochs) is publicly available as a LEXTREME contribution.

2605.24451 2026-05-26 cs.CL 版本更新

Phonetic Modeling of Dialectal Variation in Vietnamese Speech

越南语音中方言变体的语音建模

Quan Ngoc Hoang, Long Hoang Huu Nguyen, Nghia Hieu Nguyen, Kiet Van Nguyen, Ngan Luu-Thuy Nguyen

发表机构 * Faculty of Computer Science(计算机科学系) Faculty of Information Science and Engineering(信息科学与工程系) University of Information Technology(信息技术大学) Vietnam National University, Ho Chi Minh city(越南国家大学,胡志明市)

AI总结 提出一种方言感知的语音框架,通过结构化语音成分和方言特定IPA映射,在词汇和解码层面显式建模越南语方言变体,在UIT-ViMD数据集上以更少参数和无外部预训练达到与最强预训练模型相当的性能。

详情
AI中文摘要

越南语在北部、中部和南部地区表现出显著的方言语音变体,其中相同的词汇项可能以明显不同的发音实现。这种变体给自动语音识别(ASR)带来了挑战,并且由于越南语正字法与音系之间的复杂关系,在计算上仍然难以建模。现有方法通常在词汇层面处理方言变异性,假设拼写与发音之间的方言不变映射,这限制了它们捕捉系统性语音差异的能力。我们提出了一种方言感知的语音框架,在词汇和解码层面显式建模越南语音系结构和方言变体。该框架引入了一个语音词汇表,将每个音节分解为结构化的语音成分,并将它们映射到方言特定的IPA表示,同时结合一个语音结构解码器联合预测这些成分。在UIT-ViMD(越南语中唯一可用的多方言数据集)上的实验表明,所提出的方法优于各种预训练基线, extbf{尤其在使用更少参数且无需外部预训练的情况下,跨方言匹配了最强预训练模型wav2vec2-base-vi-250h的性能}。为便于实验复现,代码将在论文被接收后公开。

英文摘要

Vietnamese exhibits substantial dialectal phonetic variation across Northern, Central, and Southern regions, where identical lexical items may be realized with markedly different pronunciations. Such variation poses challenges for automatic speech recognition (ASR) and remains difficult to model computationally due to the complex relationship between Vietnamese orthography and phonology. Existing approaches typically address dialect variability at the word level, assuming dialect-invariant mappings between spelling and pronunciation, which limits their ability to capture systematic phonetic differences. We propose a dialect-aware phonetic framework that explicitly models Vietnamese phonological structure and dialectal variation at both the vocabulary and decoding levels. The framework introduces a phonetic vocabulary that decomposes each syllable into structured phonetic components and maps them to dialect-specific IPA representations, together with a phonetic-structure decoder that jointly predicts these components. Experiments on the UIT-ViMD, a only-available dataset for multi-dialect in Vietnamese, show that the proposed approach outperforms various pre-trained baselines, \textbf{especially matches the performance of the strongest pretrained wav2ve2-base-vi-250h} across dialects while \textbf{using substantially fewer parameters and no external pretraining}. Code for experimental reproducibility will be publicly available upon the acceptance of this paper.

2605.24432 2026-05-26 cs.CL 版本更新

Found in Conversation: LLMs Teach Themselves to Close the Multi-Turn Gap

在对话中发现:LLMs 自我学习弥合多轮差距

Tianlang Chen, Shirley Wu, Jure Leskovec

发表机构 * Stanford(斯坦福大学)

AI总结 提出 Found in Conversation (FiC) 框架,通过视图非对称自蒸馏方法,让模型从单轮视角向多轮视角迁移能力,显著缩小多轮对话与单轮性能的差距。

Comments 17 pages, 3 figures, 6 tables

详情
AI中文摘要

大型语言模型(LLM)的交互通常是不明确的,用户需要在多个对话轮次中澄清所有必要的细节。然而,最近的研究表明,LLM 在这种多轮设置中的表现远不如在单轮中同时获得相同信息时的表现,这一现象被称为“Lost-in-Conversation”。然而,有效弥合这一差距仍然是一个开放问题。在这里,我们引入了 Found in Conversation (FiC),一个训练框架,其中模型自我学习在给定不明确的多轮提示时找到并恢复其单轮能力。我们开发了视图非对称自蒸馏方法,该方法在同一任务信息的两个视图之间进行蒸馏——教师视角为单轮视图,学生视角为多轮视图——将强大的单轮行为转移到弱的多轮行为中。这不需要更强的外部教师,因为即使是前沿的 LLM 也存在这种差距。在多个模型家族(Llama、Qwen、Phi 和 OLMo)和规模(3B-14B)上,FiC 恢复了至少 92% 的单轮性能,并在两个 Llama 骨干上达到了 100%,从而在保持单轮能力的同时实现了更高效、更有帮助的多轮对话。

英文摘要

Large Language Model (LLM) interactions are typically underspecified, with users clarifying all necessary details across multiple conversational turns. Yet recent work shows that LLMs perform far worse in this multi-turn setting than in a single turn with same information being available at once, a phenomenon termed "Lost-in-Conversation." However, bridging this gap effectively remains an open problem. Here we introduce Found in Conversation (FiC), a training framework where a model teaches itself to find and recover its single-turn competence given underspecified multi-turn prompts. We develop View-Asymmetric Self-Distillation, which distills across two views of the same task information--single-turn view for the teacher, multi-turn view for the student--transferring strong single-turn behavior into weak multi-turn behavior. This requires no stronger external teacher, which is unavailable as even frontier LLMs exhibit this gap. Across model families (Llama, Qwen, Phi, and OLMo) and sizes (3B-14B), FiC recovers at least 92% of single-turn performance and reaches 100% on two Llama backbones, yielding more efficient and helpful multi-turn conversations with single-turn capabilities intact.

2605.24426 2026-05-26 cs.CL 版本更新

SEAL: Synergistic Co-Evolution of Agents and Learning Environments

SEAL: 智能体与学习环境的协同共演化

Yihao Hu, Zhihao Wen, Xiujin Liu, Pan Wang, Xin Zhang, Wei Wu

发表机构 * Ant Group(蚂蚁集团) Westlake University(西湖大学) University of Michigan--Ann Arbor(密歇根大学安娜堡分校) University of Science and Technology of China(中国科学技术大学)

AI总结 提出SEAL框架,通过协同演化智能体策略与训练环境,解决智能体与环境错配问题,在低资源多轮工具使用任务中提升性能并实现正向分布外迁移。

详情
AI中文摘要

大型语言模型(LLM)智能体通过交互不断改进,然而大多数自我演化方法孤立地调整策略或学习环境。我们识别出这一结构性问题为“智能体-环境错配”:智能体的能力边界在训练过程中变化,而提供监督的环境保持静态或仅与智能体暴露的失败弱耦合。我们提出SEAL,一个用于交互式工具使用智能体的闭环共演化框架。SEAL在可执行验证下收集在线策略轨迹,将失败轨迹诊断成回合级失败标签,并将这些诊断作为环境侧适应和模型侧策略优化的共享信号。环境通过暴露更清晰的工具能力线索、约束信息和面向恢复的反馈来演化其训练时的学习接口,而策略则通过诊断引导的优势加权进行更新。在分布内和分布外多轮工具使用评估中的大量实验表明,SEAL改进了低资源智能体学习:仅使用400个训练样本,它在三个骨干网络上取得了+8.25到+26.25的平均分提升,并表现出正向的分布外迁移。这些结果证明了联合调整学习器及其训练时学习基质对于鲁棒的自我改进LLM智能体的价值。

英文摘要

Large Language Model (LLM) agents are increasingly improved through interaction, yet most self-evolution methods adapt either the policy or the learning environment in isolation. We identify this structural gap as \emph{Agent-Environment Misalignment}: the agent's capability frontier changes during training, while the environment that provides supervision remains static or only weakly coupled to the agent's revealed failures. We propose SEAL, a closed-loop co-evolution framework for interactive tool-use agents. SEAL collects on-policy trajectories under executable verification, diagnoses failed rollouts into turn-level failure labels, and uses these diagnoses as a shared signal for both environment-side adaptation and model-side policy optimization. The environment evolves its training-time learning interface by exposing clearer tool affordance cues, constraint information, and recovery-oriented feedback, while the policy is updated with diagnosis-guided advantage reweighting. Extensive experiments across in-distribution and out-of-distribution multi-turn tool-use evaluations show that SEAL improves low-resource agent learning: with only 400 training samples, it yields +8.25 to +26.25 average-point gains across three backbones and exhibits positive out-of-distribution transfer. These results demonstrate the value of jointly adapting the learner and its training-time learning substrate for robust self-improving LLM agents.

2605.24425 2026-05-26 cs.LG cs.AI cs.CL 版本更新

Momentum Streams for Optimizer-Inspired Transformers

动量流:优化器启发的Transformer

Jingchu Gai, Nai-Chieh Huang, Jiayun Wu

发表机构 * Carnegie Mellon University(卡内基梅隆大学)

AI总结 提出一类优化器启发的Transformer(如三重动量TMMFormer),通过将残差更新解释为优化器步骤,发现动量是性能提升的关键,能收敛到更平坦的极小值,减少遗忘并改善泛化。

详情
AI中文摘要

预归一化Transformer层的残差更新可以被解释为对代理token能量执行一阶优化器的一步,其中注意力和MLP子层充当梯度预言。基于这一观察,我们构建了一族优化器启发的Transformer(三重动量、Adam/AdamW、Muon、SOAP),并在匹配计算量下进行比较。在我们的主要预训练实验中,三重动量TMMFormer取得了最低的验证损失,优于普通Transformer和先前的架构变体。受控消融实验和支持理论表明,动量(而非预条件)是增益的主要来源。我们进一步证明,TMMFormer和其他基于动量的设计比普通Transformer收敛到更平坦的极小值,这导致更少的遗忘和更好的泛化。

英文摘要

The residual update of a pre-norm Transformer layer admits an interpretation as one step of a first-order optimizer acting on a surrogate token energy, wherein the attention and MLP sublayers function as gradient oracles. Based on this observation, we build a family of optimizer-inspired Transformers (triple-momentum, Adam/AdamW, Muon, SOAP) and compare them under matched compute. In our main pretraining experiment, the triple-momentum TMMFormer achieves the lowest validation loss, outperforming the vanilla Transformer and prior architectural variants. A controlled ablation and supporting theory show that momentum, not preconditioning, is the main source of the gain. We further show that TMMFormer and other momentum-based designs reach flatter minima than the vanilla Transformer, which leads to less forgetting and better generalization.

2605.24371 2026-05-26 cs.CV cs.CL 版本更新

SliceWorld: A Predictive and Controllable World-State Model for CT Report Generation

SliceWorld: 一种用于CT报告生成的预测性和可控世界状态模型

Yuanhe Tian, Yan Song

发表机构 * Zhongguancun Academy(中关村学院) University of Science and Technology of China(中国科学技术大学)

AI总结 提出SliceWorld世界状态框架,通过编码CT切片序列为因子感知的潜在状态,实现未来切片预测、病变因子干预和LLM报告生成,在M3D-Cap和CT-RATE上提升NLG指标和临床评估。

Comments 18 pages, 5 figures

详情
AI中文摘要

CT报告生成(CTRG)要求模型从数百个轴向切片中总结三维解剖背景和病理发现。现有方法通常学习直接的图像到文本映射,缺乏对CT证据如何跨切片演变或报告如何响应潜在病变相关因素受控变化的建模机制。我们提出SliceWorld,一个CT特定的世界状态框架,将轴向CT扫描视为沿z轴的有序序列。SliceWorld将前缀CT证据编码为包含解剖、病变和不确定性成分的因子感知潜在状态,并将这些状态投影到用于多步未来切片特征预测、病变因子干预和基于LLM的报告生成的世界令牌中。该模型首先在CT切片序列上使用预测性、因子感知和反事实目标进行预训练,然后在配对的CT报告数据上进行微调。在M3D-Cap和CT-RATE上的实验表明,SliceWorld改善了自然语言生成指标和临床导向的自动评估。进一步分析展示了多视野未来切片预测、可测量的因子对齐、减少切片的鲁棒性以及选择性病变敏感的报告调制。

英文摘要

CT report generation (CTRG) requires models to summarize three-dimensional anatomical context and pathological findings from hundreds of axial slices. Existing methods typically learn a direct image-to-text mapping, providing limited mechanisms for modeling how CT evidence evolves across slices or how reports respond to controlled changes in latent lesion-related factors. We propose SliceWorld, a CT-specific world-state framework that treats an axial CT scan as an ordered sequence along the z-axis. SliceWorld encodes prefix CT evidence into factor-aware latent states containing anatomy, lesion, and uncertainty components, and projects these states into world tokens used for multi-step future-slice feature prediction, lesion-factor intervention, and LLM-based report generation. The model is first pretrained on CT slice sequences with predictive, factor-aware, and counterfactual objectives, and is then fine-tuned on paired CT-report data. Experiments on M3D-Cap and CT-RATE show that SliceWorld improves natural language generation metrics and clinically oriented automatic evaluation. Further analyses demonstrate multi-horizon future-slice prediction, measurable factor alignment, reduced-slice robustness, and selective lesion-sensitive report modulation.

2605.24366 2026-05-26 cs.CL cs.LG 版本更新

Structure-Aware RAG: Structured Retrieval Augmented Generation from Noisy Data for Conversational Agents

结构感知检索增强生成:面向对话代理的噪声数据结构化检索增强生成

Kaiqiao Han, LuAn Tang, Renliang Sun, Peng Yuan, Wei Cheng, Haoyu Wang, Wei Wang, Yizhou Sun, Haifeng Chen

发表机构 * UCLA(加州大学洛杉矶分校) NEC Labs(日本电装实验室)

AI总结 提出结构感知检索增强生成(SA-RAG),通过表格作为中间结构化表示来减少噪声并保留关键信息,结合质量感知的表格元数据生成框架和优化方法,在噪声真实数据集上显著优于现有RAG基线。

详情
AI中文摘要

大型语言模型(LLM)已广泛应用于对话应用。然而,它们对参数化知识的依赖限制了在需要动态或领域特定信息的真实场景中的可靠性。检索增强生成(RAG)通过在生成过程中引入外部知识来解决这一限制,但现有的基于文本和基于图的RAG方法通常难以处理噪声或不相关的上下文。在这项工作中,我们提出了结构感知检索增强生成(SA-RAG),它使用表格作为中间结构化表示,提供紧凑且可控的接口,在减少噪声的同时保留关键信息。我们引入了一个质量感知的表格元数据生成框架,对元数据规范化和有效性进行建模,提高了元数据质量和下游性能。此外,我们探索了无训练和基于训练的表格生成方法。生成验证和直接偏好优化进一步提高了表格质量,同时保持了语义和结构一致性。在两个噪声真实数据集上的实验表明,SA-RAG显著优于现有的RAG基线。我们的代码已在公共仓库中公开。

英文摘要

Large Language Models (LLMs) have been widely adopted in conversational applications. However, their reliance on parametric knowledge limits reliability in real-world scenarios that require dynamic or domain-specific information. Retrieval-Augmented Generation (RAG) addresses this limitation by incorporating external knowledge during generation, but existing text-based and graph-based RAG methods often struggle with noisy or irrelevant contexts. In this work, we propose Structure-aware Retrieval Augmented Generation (SA-RAG), which uses tables as an intermediate structured representation to provide a compact and controllable interface that reduces noise while preserving essential information. We introduce a quality-aware table metadata generation framework that models metadata normalization and effectiveness, improving metadata quality and downstream performance. Furthermore, we explore both training-free and training-based table generation methods. Generation validation and direct preference optimization further improve table quality while maintaining semantic and structural consistency. Experiments on two noisy real-world datasets show that SA-RAG significantly outperforms existing RAG baselines. Our code is publicly available at a public repository.

2605.24351 2026-05-26 cs.CL 版本更新

How Much Structure Do LLMs Need? Evaluating LLMs for Bibliometric Cluster Description

LLM 需要多少结构?评估 LLM 用于文献计量聚类描述

Abraham Camelo-Guerrero, Jairo Diaz-Rodriguez

发表机构 * School of Information Technology(信息科技学院) Department of Mathematics and Statistics(数学与统计学系) York University(约克大学)

AI总结 本研究通过六种不同证据和结构水平的流水线,评估文献计量结构是否改善 LLM 辅助的聚类描述生成,发现混合工作流(算法提供可审计结构,LLM 生成可读描述)效果最佳。

详情
AI中文摘要

大型语言模型(LLM)可以支持科学文献综合,但仍容易出现引用幻觉、覆盖不均匀和主题组织基础薄弱的问题。我们通过比较六种在不同证据和结构水平下生成聚类描述的流水线,评估文献计量结构是否改善 LLM 辅助的综合。使用 100 个已发表的文献计量分析,我们重建 Scopus 语料库,提取人工撰写的聚类描述,并通过人类对齐、语义覆盖、聚类质量、图质量和引用基础来评估输出。结果表明,LLM 生成的描述在语义上接近人工撰写的描述,但在要求从头推断文献计量结构时不可靠。当文献计量算法定义聚类且 LLM 解释它们时,性能有所提高。总体而言,LLM 辅助的文献计量综合最有前景的是混合工作流,其中算法提供可审计的结构,LLM 生成可读的描述。

英文摘要

Large language models (LLMs) can support scientific literature synthesis, but remain prone to hallucinated references, uneven coverage, and weakly grounded thematic organization. We evaluate whether bibliometric structure improves LLM-assisted synthesis by comparing six pipelines for generating cluster descriptions under different levels of evidence and structure. Using 100 published bibliometric analyses, we reconstruct Scopus corpora, extract human-written cluster descriptions, and assess outputs by human alignment, semantic coverage, clustering quality, graph quality, and reference grounding. Results show that LLMs produce descriptions semantically close to human-written ones, but are unreliable when asked to infer bibliometric structure from scratch. Performance improves when bibliometric algorithms define the clusters and the LLM interprets them. Overall, LLM-assisted bibliometric synthesis is most promising as a hybrid workflow in which algorithms provide auditable structure and LLMs generate readable descriptions.

2605.24344 2026-05-26 cs.CL 版本更新

Distinguishing Right from Wrong in Debates: Attribution Analysis of Chinese Harmful Memes

在辩论中区分对错:中国有害模因的归因分析

Weiming Wang, Junyu Lu, Han Wang, Xiaokun Zhang, Zewen Bai, Bo Xu, Liang Yang, Hongfei Lin

发表机构 * Key Laboratory of Social Computing and Cognitive Intelligence, Dalian University of Technology(社会计算与认知智能重点实验室,大连理工大学) Singapore University of Technology and Design(新加坡科技设计大学) Data Intelligence Lab, Department of Computer Science, City University of Hong Kong(数据智能实验室,香港城市大学计算机科学系)

AI总结 针对中文有害模因检测中文化背景依赖和语义歧义问题,构建首个中文有害模因解释数据集Ex-ToxiCN-MM,并提出包含归因知识增强模块和相对意图推理模块的归因分析框架RIKE,在归因任务上超越主流基线模型。

Comments 10 pages, 4 figures

详情
AI中文摘要

关于有害模因检测的研究已引起广泛关注,并催生了大量数据集和方法。然而,中文有害模因检测的进展明显滞后,主要面临两个挑战:首先,准确评估模因的有害性高度依赖于对深层文化背景的理解;其次,许多模因在语义上存在歧义,使得有害性判断具有高度主观性。为解决这些问题,我们聚焦于中文有害模因的可解释检测,构建了首个中文有害模因解释数据集Ex-ToxiCN-MM。该数据集为每个模因提供了“有害”和“无害”两种对立的解释,旨在严格评估模型辨别和理解具有歧义、文化根基内容的能力。我们构建了专门的中文文化概念和冒犯性词汇知识库(C-HarmKB),为模型提供必要的先验知识。为应对模因归因中的歧义和背景知识缺失问题,我们开发了一个全面的归因分析框架RIKE,其中包括归因知识增强模块(AKE)和相对意图推理模块(RIR)。大量的定量和定性实验表明,我们的方法在中文有害模因归因任务的多项指标上优于主流基线模型。本研究中涉及的代码、Ex-ToxiCN-MM数据集和中文有害语义知识库(C-HarmKB)已在https://github.com/wimiw123/Ex-ToxiCN-MM开源。

英文摘要

Research on harmful meme detection has garnered significant attention, resulting in the development of numerous datasets and methods. However, progress in detecting Chinese harmful memes lags considerably, primarily due to two challenges: first, accurately assessing a meme's harmfulness depends heavily on understanding deep cultural context; second, many memes are semantically ambiguous, making harmfulness highly subjective. To address these issues, we focus on the interpretable detection of Chinese harmful memes by constructing the first Chinese harmful meme explanation dataset, Ex-ToxiCN-MM. This dataset offers opposing interpretations, categorized as "harmful" and "non-harmful", for each meme, aiming to rigorously evaluate a model's ability to discern and comprehend ambiguous, culturally grounded content. We built a specialized knowledge base of Chinese cultural concepts and offensive vocabulary to supply models with essential prior knowledge (C-HarmKB). To address the ambiguity and lack of background knowledge in meme attribution, we have developed a comprehensive attribution analysis framework, RIKE, which includes an Attribution Knowledge Enhancement module (AKE) and a Relative Intent Reasoning module (RIR). Extensive quantitative and qualitative experiments demonstrate that our method outperforms mainstream baseline models across multiple metrics in the task of attributing harmful memes in Chinese. The code, Ex-ToxiCN-MM dataset, and Chinese Harmful Semantic Knowledge Base (C-HarmKB) involved in this study have been open-sourced at https://github.com/wimiw123/Ex-ToxiCN-MM

2605.24313 2026-05-26 cs.CL cs.HC 版本更新

End-to-End Intracortical Speech Decoding from Neural Activity

从神经活动进行端到端的脑皮层内语音解码

Owais Mujtaba Khanday, Jose A. Gonzalez-Lopez, Marc Ouellet, Alberto Galdon, Gonzalo Olivares Granados

发表机构 * Brain, Mind, and Behavior Research Center(脑、心智与行为研究中心)

AI总结 提出基于Conformer的端到端神经解码器,无需外部语言模型即可从肌萎缩侧索硬化症(ALS)患者的脑皮层内记录中实现字符级解码,字符错误率(CER)为23.80%。

Comments Accepted at Odyssey 2026 (Lisbon)

详情
AI中文摘要

当前高性能的脑皮层内语音神经假体实现了低词错误率,但通常在推理过程中依赖外部语言模型,增加了内存、计算和延迟。在这项工作中,我们研究了在没有此类模型的情况下是否可以实现有意义的字符级解码。我们提出了一种基于Conformer的端到端神经解码器,直接训练自一名肌萎缩侧索硬化症(ALS)参与者的脑皮层内记录。在没有任何外部语言模型的情况下,该系统在留出验证数据上实现了23.80%的字符错误率(CER)。分析表明,性能变异性由会话间信号退化驱动,而主要错误源于错误的词边界分割。这些结果表明,在完全端到端的框架中实现有效的字符级解码是可能的,为下游语言处理提供了强大的神经信号。

英文摘要

Current high-performing intracortical speech neuroprostheses achieve low word error rates but typically rely on external language models during inference, increasing memory, computation, and latency. In this work, we investigate whether meaningful character-level decoding is achievable without such models. We propose an end-to-end Conformer-based neural decoder trained directly on intracortical recordings from a participant with amyotrophic lateral sclerosis (ALS). Without any external language model, the system achieves a character error rate (CER) of 23.80\% on held-out validation data. Analysis shows that performance variability is driven by inter-session signal degradation, while dominant errors arise from incorrect word boundary segmentation. These results demonstrate that effective character-level decoding is possible in a fully end-to-end framework, providing a strong neural signal for downstream linguistic processing.

2605.24310 2026-05-26 cs.CL cs.LG 版本更新

Discovering Lexical Gaps Using Embeddings from Multilingual LLMs

利用多语言大语言模型的嵌入发现词汇空缺

Yoonwon Jung, Aaron S. Cohen, Benjamin K. Bergen

发表机构 * Department of Cognitive Science, University of California San Diego(加州大学圣地亚哥分校认知科学系)

AI总结 提出一种数据驱动框架,通过多语言大语言模型的上下文嵌入计算语义相似度,以识别跨语言词汇空缺,在韩英和英韩方向上分别达到0.81和0.76的AUC。

Comments CoNLL 2026

详情
AI中文摘要

词汇空缺是指在某些语言中不存在的单词。它们给构建多语言词汇资源、机器翻译和跨语言迁移带来了挑战。现有的词汇空缺检测依赖于人工判断或固定的概念分类法。我们提出了一个数据驱动的框架来识别跨语言词汇空缺。我们从韩英双语大语言模型中提取了韩语到英语和英语到韩语翻译对的上下文嵌入。通过组合不同的LLM、嵌入类型、维度和正交变换,在100个训练-测试划分中,每种源语言产生了4000个不同的嵌入空间。在每个空间中,我们计算每个源词与其在目标语言中最近邻的语义相似度,并比较空缺词与非空缺词的分布。在94%(韩语到英语)和97%(英语到韩语)的嵌入空间中,空缺词显示出比非空缺词更弱的跨语言语义对齐。在未对齐的嵌入空间上训练的逻辑分类器可以可靠地区分空缺词和非空缺词,在韩语到英语和英语到韩语方向上分别达到0.81和0.76的AUC,并检索出18/19个韩语空缺词和26/27个英语空缺词。该方法提供了一种语言无关且无需分类法的可扩展词汇空缺识别方法。

英文摘要

Lexical gaps are words that do not exist in certain languages. They pose challenges for building multilingual lexical resources, for machine translation, and for cross-lingual transfer. Existing lexical gap detection relies on human judgments or fixed conceptual taxonomies. We propose a data-driven framework for identifying cross-lingual lexical gaps. We extracted contextualized embeddings from Korean-English bilingual LLMs for Korean-to-English and English-to-Korean translation pairs. Combinations of LLMs, embedding types, dimensionality, and orthogonal transformations across 100 train-test splits yielded 4000 distinct embedding spaces in each source language. In each space, we computed the semantic similarity between each source word and its nearest neighbor in the target language, and compared their distribution for gap words versus non-gap words. In 94% (Korean-to-English) and 97% (English-to-Korean) of embedding spaces, gap words showed weaker cross-lingual semantic alignment than non-gap words. Logistic classifiers trained on unaligned embedding spaces can reliably separate gap words from non-gap words, achieving AUCs of 0.81 (Korean-to-English) and 0.76 (English-to-Korean) and retrieving 18/19 Korean and 26/27 English gap words. This approach provides a language-agnostic and taxonomy-free method for scalable lexical gap identification.

2605.24291 2026-05-26 cs.SD cs.CL cs.MM 版本更新

Rubato: Transcribing Piano Music with Timestamps

Rubato: 带时间戳的钢琴音乐转录

Nazif Can Tamer, Victoria Ebert, Guang Yang, Noah A. Smith

发表机构 * Paul G. Allen School of Computer Science & Engineering, University of Washington(保罗·G·阿伦计算机科学与工程学院,华盛顿大学) Allen Institute for AI(阿伦人工智能研究所)

AI总结 提出一个名为Rubato的提示条件编码器-解码器模型,结合新的多声部音乐文本表示InterMo,实现从音频生成带时间戳的钢琴乐谱,在记谱准确性上优于现有级联方法。

Comments 18 pages, 7 figures, 5 tables

详情
AI中文摘要

我们考虑将音乐录音转换为带时间戳的人类可读乐谱。这样的输出让听众能够清晰地可视化rubato(时间表达性演奏),学习者能够诊断合奏精度和与书面音乐相比的时间选择,音乐学学者能够比较同一作品不同录音的演奏风格。我们引入了(1)一个名为Rubato的提示条件编码器-解码器模型,训练输出(2)一种新的多声部音乐文本表示,名为InterMo,我们设计其与序列到序列训练兼容。我们的实验表明,Rubato从音频生成带时间戳的钢琴乐谱,其记谱准确性优于基于级联的最佳现有方法。我们发现,即使级联方法获得真实MIDI而非音频,Rubato的表现仍然更好,这表明现有方法的上限主要是表示性的,而非声学性的。此外,由于Rubato在多个相关任务(带提示)上训练,它在相关但更简单的任务(如MIDI音符定位和节拍/强拍检测)上与最佳单任务系统竞争或超越它们。演示可在https://nctamer.github.io/rubato-transcription 获取。

英文摘要

We consider the conversion of musical recordings into human-readable sheet music annotated with timestamps. Such output lets a listener clearly visualize rubato (temporally expressive playing), a learner diagnose ensemble precision and timing choices against the written music, and a musicology scholar compare performance styles across recordings of the same work. We introduce (1) a prompt-conditioned encoder-decoder model, named Rubato, trained to output (2) a new textual representation for polyphonic music, named InterMo, which we designed for compatibility with sequence-to-sequence training. Our experiments demonstrate that Rubato produces timestamped piano sheet music from audio with higher notational accuracy than the best existing approaches, which are based on cascades. We find that even if the cascade is given ground-truth MIDI instead of audio, Rubato performs better, suggesting that the ceiling of existing approaches is primarily representational, not acoustic. Further, because Rubato is trained on several related tasks (with prompts), it competes with or outperforms the best single-task systems on related but simpler tasks like MIDI note grounding and beat/downbeat detection. A demo is available at https://nctamer.github.io/rubato-transcription .

2605.24286 2026-05-26 cs.LG cs.CL 版本更新

Faithfulness as Information Flow: Evaluating and Training Faithful Chain-of-Thought Reasoning

忠实性作为信息流:评估与训练忠实的链式思维推理

Jinghan Jia, Joe Benton, Eric Easley

发表机构 * Dept. CSE, Michigan State University(密歇根州立大学计算机科学系) Anthropic

AI总结 通过信息流视角提出基于充分性、完整性和必要性的框架,结合熵、掩码KL和梯度诊断评估链式思维忠实性,并引入更新时干预(如注意力掩码、反向梯度掩码等)训练更忠实的推理模型。

详情
AI中文摘要

链式思维(CoT)推理仅在推理轨迹忠实反映产生最终答案的计算过程时,才有助于监控语言模型。然而,模型可能依赖绕过CoT的提示-答案捷径,使得可见的推理轨迹即使看似合理也具有误导性。我们通过结构化的信息流视角研究CoT忠实性:忠实推理应将答案相关信息通过从提示到CoT再到答案的中介路径路由,而非通过直接的提示-答案捷径。该视角产生了一个基于三个互补属性(充分性、完整性和必要性)的任务无关框架,我们使用基于熵的、掩码KL和基于梯度的诊断来实例化。我们表明,这些指标恢复了提示推理中外部判断的忠实性差异,并识别了基于KL的诊断中低熵失败模式,其中基于梯度的度量保持更稳定。基于此分析,我们引入了基于验证器的在线强化学习的更新时干预,包括注意力掩码、仅反向梯度掩码、CoT梯度以及提示表示的对抗扰动。在提示算术、可奖励黑客的代码修复以及未经提示训练但在错误提示注入下评估的DAPO-Math模型中,我们的干预将行为和结构指标转向更强的CoT中介。特别是,它们使捷径和奖励黑客行为在CoT中更加透明,并改善了任务无关的忠实性指标,同时在某些设置中也降低了对错误提示的敏感性。我们的结果表明,在训练期间控制信息流是通向更忠实和可监控的CoT推理的实用途径。代码见 https://github.com/safety-research/faithful-cot。

英文摘要

Chain-of-thought (CoT) reasoning is useful for monitoring language models only when the reasoning trace faithfully reflects the computation that produces the final answer. However, models can rely on prompt-to-answer shortcuts that bypass the CoT, making the visible reasoning trace misleading even when it appears plausible. We study CoT faithfulness through a structural information-flow perspective: faithful reasoning should route answer-relevant information through the mediated path from prompt to CoT to answer, rather than through a direct prompt-to-answer shortcut. This perspective yields a task-agnostic framework based on three complementary properties, sufficiency, completeness, and necessity, which we instantiate with entropy-based, masked-KL, and gradient-based diagnostics. We show that these metrics recover externally judged faithfulness differences in hinted reasoning, and identify a low-entropy failure mode of KL-based diagnostics where gradient-based measures remain more stable. Building on this analysis, we introduce update-time interventions for verifier-based on-policy RL, including attention masking, backward-only gradient masking, CoT gradients, and adversarial perturbations of prompt representations. Across hinted arithmetic, reward-hackable code repair, and DAPO-Math models trained without hints but evaluated under wrong-hint injection, our interventions shift behavioral and structural indicators toward stronger CoT mediation. In particular, they make shortcut and reward-hacking behavior more transparent in the CoT and improve task-agnostic faithfulness metrics, while in some settings also reducing wrong-hint susceptibility. Our results suggest that controlling information flow during training is a practical route toward more faithful and monitorable CoT reasoning. Code is available at https://github.com/safety-research/faithful-cot.

2605.24279 2026-05-26 cs.CL cs.SE 版本更新

ContextEcho: A Benchmark for Persona Drift in Long Agentic-Coding Sessions

ContextEcho: 长智能编码会话中角色漂移的基准测试

Xianzhong Ding, Yangyang Yu, Changwei Liu, Bill Zhao

发表机构 * Center for Advanced AI(先进人工智能中心)

AI总结 提出ContextEcho基准测试,通过25探针身份套件和快照-探针协议,测量部署规模下长编码会话中语言模型角色漂移的普遍性、压缩影响及下游效应。

详情
AI中文摘要

前沿语言模型公认的“有用编程助手”角色在部署环境中实际运行的长智能编码会话中无法持久。经过数小时的工具使用调试,最初回避偏好(“我没有偏好”)的模型可能开始断言偏好(“Python——反馈循环是即时的……”),暴露出部署者评估可能遗漏的用户可见漂移。现有的角色稳定性研究侧重于短对话,报告的变化很小,使得现实世界的代码生成场景——数千次工具使用轮次、压缩和长达数小时的会话——在很大程度上未被表征。我们引入了ContextEcho,一个用于在部署规模上测量角色漂移的基准测试和可重用工具。它结合了25探针身份套件、快照-探针协议(在不干扰主会话的情况下分叉对话状态)、互补的判断和免判断测量表面,以及三个匿名化的Claude Code会话(跨越3,746-9,716轮次)。在23个前沿模型中,ContextEcho表明角色漂移跨组织普遍存在而非特定于家族,会话内压缩不能可靠地重置它,而单次锚定能在测量目标上恢复训练语域。它还揭示了模式依赖的下游效应:虽然漂移有助于工具使用延续,但在无工具聊天中,它破坏了格式约定并增加了输出长度。总体而言,ContextEcho为研究人员和部署者提供了一个开源框架,用于审计模型发布时的角色是否是用户在会话结束时遇到的角色,适用于聊天补全API目标且无需重新训练。

英文摘要

A frontier language model's acknowledged "helpful programming assistant" persona does not survive long agentic-coding sessions in the deployment regime that production products actually run. After hours of tool-using debugging, a model that initially hedges preferences ("I don't have preferences") may begin asserting them ("Python - the feedback loop is instant..."), revealing user-visible drift that deployer evaluations may miss. Existing persona-stability studies focus on short dialogues and report little shift, leaving real-world code-generation regimes - thousands of tool-using turns, compaction, and hours-long sessions - largely uncharacterized. We introduce ContextEcho, a benchmark and reusable harness for measuring persona drift at deployment scale. It combines a 25-probe identity suite, a snapshot-then-probe protocol that forks conversation state without perturbing the main session, complementary judged and judge-free measurement surfaces, and three anonymized Claude Code sessions spanning 3,746-9,716 turns. Across 23 frontier models, ContextEcho shows that persona drift is general across organizations rather than family-specific, that in-session compaction does not reliably reset it, and that a single-shot anchor restores the trained register across measured targets. It also reveals mode-dependent downstream effects: while drift can facilitate tool-using continuation, in tool-free chat it breaks formatting contracts and inflates output length. Overall, ContextEcho provides researchers and deployers an open-source framework to audit whether the persona a model ships with is the persona users encounter at session end, across chat-completions API targets and without retraining.

2605.24267 2026-05-26 cs.CL 版本更新

DRInQ: Evaluating Conversational Implicature with Controlled Context Variation

DRInQ: 通过受控上下文变化评估会话含义

Hirona Jacqueline Arai, Xiang Ren

发表机构 * University of Southern California(南加州大学)

AI总结 提出DRInQ基准,通过半自动化管道生成系统变化的问答上下文实例,评估大语言模型在会话含义中的语用推理能力,发现模型在生成与推理之间存在不对称性。

Comments To be presented at ACL 2026

详情
AI中文摘要

人类对话严重依赖会话含义,即说话者传达暗示而非明确陈述的意义。尽管近期的大语言模型表现出较强的对话流畅性,但当解释依赖于整合社会和语境线索的推理时,它们仍然不可靠,而这种推理在文本中很少被明确表述。我们引入了DRInQ,一个用于评估关于疑问话语中会话含义的语用推理的基准,旨在在保持每个问题表面形式固定的同时隔离语用变化。为了支持可扩展的评估,我们提出了一个半自动化管道,生成具有系统变化的问答-上下文-解释实例。在评估中,我们发现一致的生成-推理不对称性:尽管最先进的模型在引导下可以生成合理的语用场景,但它们在推理时往往无法恢复预期的含义。对于较小的模型,结构化提示提高了与人类判断的一致性。一项比较写作研究进一步揭示了互补的优势:人类作者倾向于生成更安全、可预测的上下文,而模型生成多样化的场景,其解释有时超出上下文支持。这些发现突显了建模会话含义中的持续挑战,并推动了更多上下文敏感的评估框架。

英文摘要

Human conversation relies heavily on conversational implicature, in which speakers convey meanings that are suggested rather than explicitly stated. Although recent large language models exhibit strong conversational fluency, they remain unreliable when interpretation depends on reasoning that integrates social and contextual cues, a process rarely articulated in text. We introduce DRinQ, a benchmark for evaluating pragmatic reasoning about conversational implicature in question utterances, designed to isolate pragmatic variation while holding each question's surface form fixed. To support scalable evaluation, we propose a semi-automated pipeline that produces question-context-interpretation instances with systematic variation. Across evaluations, we find a consistent generation-inference asymmetry: while state-of-the-art models can generate plausible pragmatic scenarios when guided, they often fail to recover the intended implication at inference time. For smaller models, structured prompting improves alignment with human judgments. A comparative writing study further reveals complementary strengths: human authors tend to produce safer, predictable contexts, whereas models generate varied scenarios with interpretations that sometimes exceed contextual support. These findings highlight persistent challenges in modeling conversational implicature and motivate more context-sensitive evaluation frameworks.

2605.24266 2026-05-26 cs.CL cs.AI 版本更新

An Interactive Paradigm for Deep Research

深度研究的交互式范式

Lin Ai, Victor S. Bursztyn, Xiang Chen, Julia Hirschberg, Saayan Mitra

发表机构 * Adobe Research(Adobe研究院) Department of Computer Science, Columbia University(哥伦比亚大学计算机科学系)

AI总结 提出SteER框架,通过可解释的中间过程控制、成本效益决策和实时用户模型,在深度研究中实现用户对齐,性能优于现有基线。

详情
AI中文摘要

近年来,大型语言模型(LLMs)的进展使得深度研究系统能够通过结合检索、推理和生成,为开放式查询合成全面、报告式的答案。然而,大多数框架依赖于僵化的流程,采用一次性范围界定和长时间自主运行,如果用户意图在过程中发生变化,几乎没有修正的空间。我们提出了SteER,一个可引导的深度研究框架,将可解释的中间过程控制引入长周期研究流程中。在每个决策点,SteER使用成本效益公式来确定是暂停等待用户输入还是自主继续。它结合了多样性感知规划与奖励对齐、新颖性和覆盖率的效用信号,并维护一个在会话过程中不断演化的实时用户模型。SteER在对齐方面比最先进的开源和专有基线高出最多22.80%,在广度、平衡等质量指标上领先,并且在85%以上的成对对齐判断中被人类读者偏好。我们还引入了一个用户查询基准和数据生成流水线。据我们所知,这是第一个以交互式、可解释的控制范式推进深度研究的工作,为长形式任务中可控、用户对齐的智能体铺平了道路。

英文摘要

Recent advances in large language models (LLMs) have enabled deep research systems that synthesize comprehensive, report-style answers to open-ended queries by combining retrieval, reasoning, and generation. Yet most frameworks rely on rigid workflows with one-shot scoping and long autonomous runs, offering little room for course correction if user intent shifts mid-process. We present SteER, a framework for Steerable deEp Research that introduces interpretable, mid-process control into long-horizon research workflows. At each decision point, SteER uses a cost-benefit formulation to determine whether to pause for user input or to proceed autonomously. It combines diversity-aware planning with utility signals that reward alignment, novelty, and coverage, and maintains a live persona model that evolves throughout the session. SteER outperforms state-of-the-art open-source and proprietary baselines by up to 22.80\% on alignment, leads on quality metrics such as breadth and balance, and is preferred by human readers in 85\%+ of pairwise alignment judgments. We also introduce a persona-query benchmark and data-generation pipeline. To our knowledge, this is the first work to advance deep research with an interactive, interpretable control paradigm, paving the way for controllable, user-aligned agents in long-form tasks.

2605.24247 2026-05-26 cs.CL cs.AI 版本更新

Improving Labeling Consistency with Detailed Constitutional Definitions and AI-Driven Evaluation

通过详细的宪法定义和AI驱动的评估提高标注一致性

Konstantin Berlin, Adam Swanda

发表机构 * Cisco AI Defense(思科AI防御)

AI总结 提出一种AI驱动的工作流,通过为每个类别编写详细的宪法定义并由前沿LLM解释,以比人类更一致和准确地生成黄金标签,在三个内容审核类别上将跨模型不一致性降低高达57倍。

Comments Under review at ACL Rolling Review (ARR), May 2026 cycle. Also available at https://doi.org/10.5281/zenodo.20125267

详情
AI中文摘要

许多自动化标注流水线根据书面规范将输入分类到类别中,内容审核是一个突出的用例。简单的类别定义不足以让标注者产生这些流水线所需的准确、一致的黄金标签。一个解决方案是编写一个规定性定义,解决足够多的实际边界情况,使得标注者无法对书面解释产生分歧。在实践中,这种详细程度的定义超出了人类标注者工作记忆的容量,因此标注者依赖直觉,标签偏离书面规则,准确性和一致性下降。我们提出并展示了一种AI驱动工作流的有效性,其中AI帮助编写每个类别的宪法,该宪法以足够详细的方式定义标签以覆盖边缘情况,并且前沿LLM在每个输入上解释该宪法,以比阅读相同文档的人类更一致和准确地产生黄金标签。我们在三个内容审核类别(骚扰、仇恨言论、非暴力犯罪)上评估,并表明该方法相比段落定义将跨模型不一致性降低高达57倍,跨模型分歧诊断规范缺口,人类负责关于每个类别应含义的高层决策,而不是单个标注调用。对于安全评估,我们引入了一个双轴公式,在完整对话上独立评分意图和内容,以便下游消费者可以基于任一轴或两者采取行动。

英文摘要

Many automated labeling pipelines classify inputs into categories defined by a written specification, content moderation being a prominent use case. Simple category definitions are not detailed enough for labelers to produce the accurate, consistent golden labels these pipelines require. One solution is to write a prescriptive definition that settles enough real boundary cases that labelers cannot disagree with the written interpretation. In practice, definitions at that level of detail exceed what a human annotator can hold in working memory, so annotators fall back on intuition and the labels drift from the written rules, regressing on accuracy and consistency. We propose and demonstrate the efficacy of an AI-driven workflow in which AI helps write a per-category constitution that defines the label in enough detail to cover edge cases, and a frontier LLM interprets it on each input to produce the golden label more consistently and accurately than humans reading the same document. We evaluate on three content moderation categories (harassment, hate speech, non-violent crime) and show that the approach reduces cross-model inconsistency by up to 57x compared to paragraph definitions, with cross-model disagreement diagnosing specification gaps and the human responsible for high-level decisions about what each category should mean rather than individual labeling calls. For the safety evaluation, we introduce a dual-axis formulation scoring intent and content independently over the full conversation, so downstream consumers can act on either axis or both.

2605.24218 2026-05-26 cs.CL 版本更新

QUEST: Training Frontier Deep Research Agents with Fully Synthetic Tasks

QUEST:使用完全合成任务训练前沿深度研究代理

Jian Xie, Tianhe Lin, Zilu Wang, Yuting Ning, Yuekun Yao, Tianci Xue, Zhehao Zhang, Zhongyang Li, Kai Zhang, Yufan Wu, Shijie Chen, Boyu Gou, Mingzhe Han, Yifei Wang, Vint Lee, Xinpeng Wei, Xiangjun Wang, Yu Su, Huan Sun

发表机构 * The Ohio State University(俄亥俄州立大学) Amazon AGI SF Lab(亚马逊人工智能SF实验室)

AI总结 提出QUEST系列模型,通过统一规则树合成训练数据,结合中训练、监督微调和强化学习,使开放模型在深度研究任务上接近或超越闭源前沿代理。

Comments Work in Progress

详情
AI中文摘要

深度研究代理将搜索引擎从检索关键词匹配页面扩展到知识综合,从根本上改变了人类与信息的交互方式。然而,前沿系统仍然是专有的,而现有的开放代理通常在不同任务类型上泛化能力较差,尚不清楚如何训练一个具有广泛能力的深度研究代理。我们发布了QUEST,一个开放模型系列(从2B到35B),作为通用深度研究代理,旨在处理各种长周期搜索任务,在事实查找、引用依据和报告综合方面具有强大能力。为了构建QUEST,我们提出了一种有效的训练方案,结合了中训练、监督微调和强化学习。该方案的核心是一个基于统一规则树的精心设计的数据合成流程,适用于不同的任务类型,并且能够在无需人工标注的情况下合成具有可验证奖励的训练数据。此外,QUEST内置了上下文管理机制,能够实现有效的长周期推理和知识综合。仅使用8K个合成任务,QUEST在涵盖多种任务类型的八个深度研究基准上接近甚至超越了前沿闭源代理,并在最近的开放权重代理中取得了最佳整体性能。我们发布了所有内容:模型、数据和训练脚本。

英文摘要

Deep research agents extend the role of search engines from retrieving keyword-matched pages to synthesizing knowledge, fundamentally changing how humans interact with information. However, frontier systems remain proprietary, while existing open agents often generalize poorly across different task types, leaving unclear how to train a broadly capable deep research agent. We release QUEST, a family of open models (ranging from 2B to 35B) that serve as general-purpose deep research agents designed to handle a wide range of long-horizon search tasks, with strong capabilities in fact seeking, citation grounding, and report synthesis. To build QUEST, we propose an effective training recipe combining mid-training, supervised fine-tuning, and reinforcement learning. Central to this recipe is a curated data synthesis pipeline based on unified rubric trees, which applies to different task types and enables synthesizing training data with verifiable rewards without human annotation. In addition, QUEST incorporates a built-in context management mechanism that enables effective long-horizon reasoning and knowledge synthesis. Using only 8K synthesized tasks, QUEST approaches or even surpasses frontier closed-source agents across eight deep research benchmarks spanning diverse task types, and achieves the best overall performance among recent open-weight agents. We released everything: models, data, and training scripts.

2605.24216 2026-05-26 cs.LG cs.AI cs.CL cs.CR 版本更新

Agent-ToM: Learning to Monitor Autonomous LLM Agents via Theory-of-Mind Reasoning

Agent-ToM: 通过心智理论推理学习监控自主LLM智能体

Nesreen K. Ahmed, Nima Nafisi

发表机构 * Cisco Outshift(思科Outshift)

AI总结 针对自主LLM智能体的隐蔽恶意行为监控难题,提出基于心智理论推理的Agent-ToM框架,通过信念推断、意图假设与验证实现结构化轨迹分析,在监控基准上取得优于集成方法的性能。

Comments 23 pages, 9 figures

详情
AI中文摘要

监控自主大语言模型(LLM)智能体的隐蔽恶意行为具有挑战性,因为攻击模式具有延迟性、上下文依赖性和长期性。智能体可能追求隐藏目标同时保持表面良性行为,即使拥有完整轨迹访问也难以检测。先前的监控方法改进了脚手架或集成聚合,但独立处理每条轨迹,未从先前的监控经验中学习。此外,标准推理方法解释观察到的行为,但没有明确推理智能体的信念、意图和目标对齐,而这些对于区分良性任务执行和隐蔽偏离是必要的。 我们提出 extbf{Agent-ToM},一种基于心智理论(ToM)推理的监控学习框架,用于自主智能体的安全分析。Agent-ToM通过推断信念、具有校准置信度的意图假设、预期行动以及与任务一致行为基线的偏离,执行结构化的全轨迹分析。在推理时,它采用 extit{推理-验证-细化}流程来构建和验证监控决策。在训练时,Agent-ToM将批评信号蒸馏到持久的 extit{语义护栏记忆}中,使得跨回合可重用的信念和意图条件约束成为可能。我们在对抗性智能体监控基准(SHADE-Arena和CUA-SHADE-Arena)上评估Agent-ToM。Agent-ToM实现了强精确率-召回率平衡,并使用连贯的双调用推理流程,优于包括集成方法在内的最先进监控基线。这些结果表明,在监控层学习,结合结构化的ToM推理和验证,为保护自主LLM智能体提供了有效且可部署的基础。

英文摘要

Monitoring autonomous large language model (LLM) agents for covert malicious behavior is challenging due to delayed, context-dependent, and long-horizon attack patterns. Agents may pursue hidden objectives while maintaining superficially benign behavior, making detection difficult even with full trajectory access. Prior monitoring approaches improve scaffolding or ensemble aggregation, but treat each trajectory independently and do not learn from prior monitoring experience. Moreover, standard reasoning methods explain observed behavior without explicitly reasoning about agent beliefs, intentions, and goal alignment required to distinguish benign task execution from covert deviation. We propose \textbf{Agent-ToM}, a learning-to-monitor framework grounded in Theory-of-Mind (ToM) reasoning for security analysis of autonomous agents. Agent-ToM performs structured full-trajectory analysis by inferring beliefs, intent hypotheses with calibrated confidence, expected actions, and deviations from task-consistent behavioral baselines. At inference time, it employs a \textit{Reason-Verify-Refine} pipeline to construct and validate monitoring decisions. At training time, Agent-ToM distills critique signals into a persistent \textit{semantic guardrail memory}, enabling reusable belief- and intent-conditioned constraints across episodes. We evaluate Agent-ToM on adversarial agent monitoring benchmarks (SHADE-Arena and CUA-SHADE-Arena). Agent-ToM achieves strong precision-recall balance and outperforms state-of-the-art monitoring baselines, including ensemble methods, while using a coherent two-call reasoning pipeline. These results demonstrate that learning at the monitoring layer, combined with structured ToM reasoning and verification, provides an effective and deployable foundation for securing autonomous LLM agents.

2605.24211 2026-05-26 cs.CL cs.AI 版本更新

Teaching Through Analogies: A Modular Pipeline for Educational Analogy Generation

通过类比教学:教育类比生成的模块化流水线

Mariam Barakat, Ekaterina Kochmar

发表机构 * Mohamed bin Zayed University of Artificial Intelligence(穆罕默德·本·扎耶德人工智能大学)

AI总结 提出一个模块化流水线,将教育类比生成分解为四个阶段,基于结构映射理论,评估12个LLM在两个数据集上的表现,发现子概念显著提升解释质量和封闭检索精度,并引入LLM作为评判的评估方法。

Comments 36 pages, 25 figures. To appear in Proceedings of the 21st Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2026)

详情
AI中文摘要

类比通过将不熟悉的概念与已知概念联系起来,帮助学习者理解。尽管最近取得了进展,大型语言模型(LLM)在生成与人类质量相当的类比方面仍然困难。我们提出了一个用于教育类比生成的模块化流水线,将任务分解为四个阶段:源发现、子概念生成、解释生成和评估。基于结构映射理论,该流水线能够系统地、逐阶段地分析模型选择和输入配置如何影响类比质量。我们在两个具有结构化子概念注释的数据集(SCAR和ParallelPARC)上评估了来自六个模型家族的12个最先进的LLM,以及用于封闭设置检索的七个嵌入模型。我们的结果表明,子概念显著提高了解释质量和封闭设置检索精度,但在开放式源生成中提供的益处有限。我们进一步引入了一种LLM作为评判的评估方法,并针对七名注释者的人类注释验证了其评分,发现Claude Sonnet 4.6在人类排名上的对齐比细粒度绝对分数更可靠。综合来看,我们的发现揭示了孤立研究无法捕捉的跨阶段交互,并强调了子概念基础作为类比质量生成的关键驱动因素。

英文摘要

Analogies help learners understand unfamiliar concepts by relating them to known concepts. Despite recent advances, large language models (LLMs) continue to struggle to generate analogies of comparable quality to those produced by humans. We present a modular pipeline for educational analogy generation, decomposing the task into four stages: source finding, sub-concept generation, explanation generation, and evaluation. Grounded in Structure Mapping Theory, the pipeline enables systematic, stage-by-stage analysis of how model choice and input configuration affect analogy quality. We evaluate 12 state-of-the-art LLMs across six model families on two datasets with structured sub-concept annotations (SCAR and ParallelPARC), alongside seven embedding models for closed-setting retrieval. Our results show that sub-concepts substantially improve explanation quality and closed setting retrieval precision but provide limited benefit in open-ended source generation. We further introduce an LLM-as-a-judge evaluation methodology and validate its scoring against human annotations from seven annotators, finding that Claude Sonnet 4.6 aligns more reliably with human rankings than with fine-grained absolute scores. Taken together, our findings reveal cross-stage interactions that isolated studies cannot capture, and highlight sub-concept grounding as a key driver of analogy quality generation.

2605.24173 2026-05-26 cs.CL cs.AI cs.CR cs.LG 版本更新

Extracting Training Data from Diffusion Language Models via Infilling

通过填充从扩散语言模型中提取训练数据

Yihan Wang, N. Asokan

发表机构 * University of Waterloo(滑铁卢大学) KTH Royal Institute of Technology(皇家理工学院)

AI总结 提出填充提取协议,利用扩散语言模型的双向去噪能力,通过任意二进制掩码参数化,揭示掩码几何形状控制提取能力,边缘条件掩码比前缀条件掩码多提取三倍逐字序列,且双向访问打开了自回归模型无法利用的通道。

详情
AI中文摘要

大型语言模型中的记忆化几乎完全通过前缀条件提取进行研究,这是自回归模型的自然选择。然而,扩散语言模型(DLM)可以在任意位置去噪掩码标记。因此,仅前缀探测揭示了DLM中记忆化的一个方面,并显著低估了训练数据提取的风险。为了真实地建模DLM中训练数据的可提取性,我们引入了\emph{填充提取},这是一种由任意二进制掩码参数化的数据提取协议,它包含了前缀仅探测并考虑了DLM的双向归纳偏差。在LLaDA-8B和Dream-7B上,跨五种提取模式、三种训练流水线和三个涵盖逐字和部分泄漏的语料库进行实例化,我们发现掩码几何形状控制着可提取性:边缘条件掩码比前缀条件掩码\emph{多提取三倍}的逐字序列,并且双向访问打开了自回归模型中无法利用的通道。特别是,我们表明,一个能够访问已删除个人身份信息的训练数据的现实对手,甚至可以从DLM中提取被删除的电子邮件地址,其召回率高于规模匹配的自回归模型。解码的可调参数可测量地影响提取性能,而后续的监督微调阶段并未消除先前的记忆化。

英文摘要

Memorization in large language models has been studied almost exclusively through prefix-conditioned extraction, a natural choice for autoregressive models. However, diffusion language models (DLMs) can denoise masked tokens at arbitrary positions. Thus, prefix-only probing reveals only one facet of memorization in DLMs and significantly underestimates the risk of training-data extraction. In order to realistically model extractability of training data in DLMs, we introduce \emph{infilling extraction}, a data-extraction protocol parameterized by an arbitrary binary mask that subsumes prefix-only probing and accounts for the bidirectional inductive bias of DLMs. Instantiating it on LLaDA-8B and Dream-7B across five extraction modes, three training pipelines, and three corpora covering verbatim and partial leakage, we find that mask geometry governs extractability: edge-conditioned masks \emph{extract up to three times more} verbatim sequences than prefix-conditioned ones, and bidirectional access opens channels inaccessible in autoregressive models. In particular, we show that a realistic adversary with access to training data where personally identifiable information has been redacted, can even achieve higher recall on extracting redacted email addresses from DLMs than from scale-matched autoregressive models. Tunable parameters for decoding measurably affect extraction performance, while a follow-up supervised finetuning stage does not eliminate the prior memorization.

2605.24164 2026-05-26 cs.CL 版本更新

CUNY at CLPsych 2026: A Pipeline Approach to Classification and Summarization of Mental Health Changes

CUNY at CLPsych 2026: 一种用于心理健康变化分类和总结的流水线方法

Amirmohammad Ziaei Bideh, Shameed Charlomar Job, Ava Yahyapour, Alla Rozovskaya

发表机构 * Computer Science Department, CUNY Graduate Center(哥伦比亚大学研究生院计算机科学系) Linguistics Department, CUNY Graduate Center(哥伦比亚大学研究生院语言学系)

AI总结 提出一种流水线方法,通过集成三个开源大语言模型的上下文学习和监督分类,对社交媒体时间线中的心理健康状态进行推断、变化点预测和模式总结,在CLPsych 2026共享任务中取得领先排名。

详情
AI中文摘要

我们描述了在CLPsych~2026共享任务中的提交,该任务旨在通过社交媒体时间线动态捕捉和表征心理健康变化。为了推断帖子中的主导自我状态(任务1.1和1.2),我们使用多数投票集成了三个开源大语言模型的上下文学习。为了预测时间线中的变化时刻(任务2),我们在从任务1.1预测中提取的特征上训练监督分类器。为了总结时间线内情绪动态的模式及其随时间的变化(任务3.1),我们增加了上游系统(任务1.1、1.2和2)预测的上下文示例标签,相比零样本和未增强的上下文学习基线获得了性能提升。我们的提交在任务1.1上排名第一,任务1.2上排名第四,任务2上排名第四,任务3.1上排名第三。实验源代码可在https://github.com/amirzia/clpsych26-cuny获取。

英文摘要

We describe our submission to the CLPsych~2026 Shared Task on capturing and characterizing mental health changes through social media timeline dynamics. To infer the dominant self-states in posts (Tasks 1.1 and 1.2), we ensemble in-context learning of three open-weight large language models using majority voting. For predicting moments of change in a timeline (Task~2), we train supervised classifiers on features derived from Task~1.1 predictions. To summarize the patterns of mood dynamics and their progression over time within a timeline (Task 3.1), we augment in-context example labels predicted by upstream systems (Tasks 1.1, 1.2, and 2), yielding performance gains over zero-shot and unaugmented in-context learning baselines. Our submission ranked first on Task~1.1, fourth on Task~1.2, fourth on Task~2, and third on Task~3.1.\footnote{The source code for the experiments is available at https://github.com/amirzia/clpsych26-cuny

2605.24079 2026-05-26 cs.SE cs.AI cs.CL 版本更新

TRACER: A Semantic-Aware Framework for Fine-Grained Contamination Detection in Code LLMs

TRACER: 一种用于代码大语言模型中细粒度污染检测的语义感知框架

Yifeng Di, Xuliang Huang, Tianyi Zhang

发表机构 * Purdue University West Lafayette, IN(帕克大学韦斯特拉法叶分校) University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校)

AI总结 提出TRACER框架,通过三级语义重叠和粗到细流水线检测代码LLM中的细粒度数据污染,在基准测试中F1达0.91。

Comments 21 pages, 2 figures, 15 tables

详情
AI中文摘要

数据污染是对模型评估可靠性的已知威胁。然而,在代码大语言模型(LLM)中,污染往往超出精确重复,这一问题仍未得到充分探索。我们提出了TRACER,一种用于细粒度代码污染检测的语义感知框架。TRACER使用三级语义重叠——功能相同、几乎相同和共享逻辑——对污染进行建模,并通过粗到细的流水线进行检测。我们还引入了首个细粒度代码污染检测基准,涵盖三个广泛使用的基准和三个具有代表性的后训练数据集。TRACER在多个LLM骨干网络上取得了强大且一致的性能,其中GPT-5在细粒度检测中F1分数达到0.91。在二分类设置中,TRACER的F1达到0.92,比现有方法高出42%-217%。我们进一步进行了消融研究和错误分析,以评估TRACER中各个组件的贡献。

英文摘要

Data contamination is a known threat to the reliability of model evaluation. However, it remains underexplored in code large language models (LLMs), where contamination often goes beyond exact duplication. We present TRACER, a semantic-aware framework for fine-grained code contamination detection. TRACER models contamination using three levels of semantic overlap - Functionally Identical, Nearly Identical, and Shared Logic - and detects them through a coarse-to-fine pipeline. We also introduce the first benchmark for fine-grained code contamination detection, spanning three widely used benchmarks and three representative post-training datasets. TRACER achieves strong and consistent performance across multiple LLM backbones, with GPT-5 reaching an F1 score of 0.91 in fine-grained detection. In the binary setting, TRACER attains an F1 of 0.92, outperforming existing methods by 42%-217%. We further conduct ablation studies and error analysis to assess the contributions of individual components in TRACER.

2605.24053 2026-05-26 cs.AI cs.CL cs.LG 版本更新

Breaking the Chains of Probability: Neutrosophic Logic as a New Framework for Epistemic Uncertainty in Large Language Models

打破概率的锁链:中智逻辑作为大型语言模型中认知不确定性的新框架

Maikel Yelandi Leyva-Vázquez, Florentin Smarandache

发表机构 * Universidad Bolivariana del Ecuador, Coordinación Académica de Posgrado(巴尔干大学厄瓜多尔分校,研究生院) Universidad de Guayaquil(瓜亚基尔大学) Universidad Bernardo O’Higgins(伯纳多·奥希金斯大学) Mathematics, Physics, and Natural Sciences Division, University of New Mexico(新墨西哥大学数学、物理和自然科学系)

AI总结 本文提出使用中智逻辑(Truth、Indeterminacy、Falsity三个独立维度)替代传统概率框架,通过实验发现该框架能更丰富地表示LLM的内部状态,并在35%的评估中自发出现超真状态,为透明、可靠和伦理感知的AI系统提供关键步骤。

Comments Published in Neutrosophic Sets and Systems, Vol. 99 (2026). Author's preprint version. Open code and data available at: github.com/mleyvaz/neutrosophic-llm-logic

详情
Journal ref
Neutrosophic Sets and Systems, Vol. 99, 2026
AI中文摘要

大型语言模型(LLM)主要受概率框架支配,其中结果概率之和被约束为1。这种由Softmax层强加的结构限制导致不确定性崩溃,使得难以区分认知不确定性、悖论和模糊性。我们提出了一种中智逻辑应用的实证研究,该框架将真(T)、不确定(I)和假(F)视为三个独立维度,用于建模LLM中的认知状态。我们在四个OpenAI GPT模型家族上进行了实验,涵盖五种语言现象:逻辑悖论、认知无知、模糊性、伦理矛盾和未来偶然性,采用三种提示策略:中智、概率和熵衍生。我们的发现表明,中智方法通过允许T+I+F>1(我们称之为超真状态),提供了模型内部状态的更丰富表示。在35%的评估中,超真状态自发出现,主要出现在伦理矛盾和逻辑悖论下。我们证明,该方法在模糊上下文中保留了真值,并提供了一种稳健的方法来识别和量化内部模型冲突。我们得出结论,中智评估层的集成是迈向更透明、可靠和伦理感知的AI系统的关键一步。

英文摘要

Large Language Models (LLMs) are predominantly governed by probabilistic frameworks in which the sum of outcome probabilities is constrained to unity. This architectural limitation, often imposed by Softmax layers, leads to a collapse of uncertainty that makes it difficult to differentiate between epistemic uncertainty, paradox, and vagueness. We present an empirical investigation of the application of Neutrosophic Logic, a framework that treats Truth (T), Indeterminacy (I), and Falsity (F) as three independent dimensions, to model epistemic states in LLMs. We conducted experiments on a family of four OpenAI GPT models across five linguistic phenomena: logical paradoxes, epistemic ignorance, vagueness, ethical contradictions, and future contingencies, under three prompting strategies: neutrosophic, probabilistic, and entropy-derived. Our findings reveal that the neutrosophic approach, by allowing T+I+F > 1, a state we term hyper-truth, provides a richer representation of a model's internal state. In 35% of evaluations, hyper-truth emerged spontaneously, predominantly under ethical contradiction and logical paradox. We demonstrate that this approach preserves truth values in fuzzy contexts and offers a robust method for identifying and quantifying internal model conflict. We conclude that the integration of neutrosophic evaluation layers is a critical step toward more transparent, reliable, and ethically aware AI systems.

2605.24000 2026-05-26 cs.CL 版本更新

Toxicity in Twitch Chats: An LLM-Based Analysis Across Gaming Communities

Twitch聊天中的毒性:基于LLM的游戏社区分析

Ronja Fuchs, Florian Rupp, Timo Bertram, Kai Eckert, Alexander Dockhorn

发表机构 * Institute for Information Processing(信息处理研究所) Leibniz University Hannover(汉诺威莱布尼茨大学) Department of Computer Science(计算机科学系) Technische Hochschule Mannheim(曼海姆技术学院) Institute for Machine Learning(机器学习研究所) Johannes Kepler University(约翰·凯普勒大学) SDU Metaverse Lab(SDU元宇宙实验室) University of Southern Denmark(南部丹麦大学)

AI总结 使用预训练大语言模型对Twitch平台4452个直播流约2000万条聊天消息进行零样本分类,发现2.4%的消息有毒,其中MOBA游戏毒性最高(3.2%),体育游戏最低(2%),且游戏间毒性分布差异显著,表明存在游戏特定的社区规范。

Comments 8 pages, 2 figures, 5 tables. Accepted at the IEEE Conference on Games (IEEE CoG) 2026

详情
AI中文摘要

在线游戏社区中的毒性仍然是一个持续存在的挑战,体现在不同游戏类型、平台和玩家互动中。虽然许多研究关注游戏内毒性,但对于流媒体平台上不同游戏社区之间毒性行为的差异知之甚少。为了解决这一不足,我们分析了来自Twitch上七个游戏类型的4452个直播流的大约2000万条聊天消息。我们使用预训练的大语言模型通过零样本分类,根据Twitch的毒性分类法对消息进行分类。该分类法包括四个类别和八个子类,包括骚扰、歧视、性内容和脏话。我们的方法在TextDetox数据集上达到了94.5%的F1分数,并且显示出与人类间一致性相当的人机一致性。我们的分析显示,所有消息中有2.4%被归类为有毒,不同游戏类型之间存在显著差异:MOBA游戏的直播流显示出最高的相对毒性率(3.2%),而体育游戏的毒性率最低(2%)。此外,结果表明,即使在游戏类型内部,不同游戏在毒性分布上也存在显著差异,这表明存在游戏特定的社区规范和机制,这些因素塑造了超越游戏类型效应的毒性行为。这些发现为Twitch上游戏类型和游戏特定的毒性模式提供了实证见解,并可为游戏社区制定更有针对性的审核策略提供信息。

英文摘要

Toxicity in online gaming communities remains a persistent challenge, manifesting across genres, platforms, and player interactions. While much research is focused on in-game toxicity, less is known about how toxic behavior varies between gaming communities on streaming platforms. To address this shortcoming, we analyze approximately 20 million chat messages from 4,452 streams, spanning seven game genres on Twitch. We categorize messages according to Twitch's toxicity taxonomy with a pre-trained Large Language Model using zero-shot classification. The taxonomy comprises four categories and eight subclasses, including harassment, discrimination, sexual content, and profanity. Our approach achieves an F1 score of 94.5% on the TextDetox dataset and demonstrates human-model agreement comparable to inter-human agreement. Our analysis reveals that 2.4% of all messages are classified as toxic, with notable differences across genres: streams of MOBA games exhibit the highest relative rate of toxicity (3.2%), and sports games show the lowest rate (2%). Furthermore, results indicate that individual games differ significantly in their toxicity distributions, even within genres, suggesting the existence of game-specific community norms and mechanics that shape toxic behavior beyond genre-level effects. These findings offer empirical insights into genre- and game-specific toxicity patterns on Twitch and can inform more targeted moderation strategies for gaming communities.

2605.23989 2026-05-26 cs.AI cs.CL cs.CR 版本更新

Towards trustworthy agentic AI: a comprehensive survey of safety, robustness, privacy, and system security

迈向可信的自主AI:安全性、鲁棒性、隐私与系统安全的全面综述

Jinhu Qi, Muzhi Li, Jiahong Liu, Yuqin Shu, Dianzhi Yu, Shicheng Ma, Wenqian Cui, Yiyang Zhao, Yiyi Chen, Ruoxi Jiang, Irwin King, Zenglin Xu

发表机构 * Faculty of Engineering, Department of Computer Science and Engineering, The Chinese University of Hong Kong(香港中文大学工程学院、计算机科学与工程系) Artificial Intelligence Innovation and Incubation Institute, Fudan University(复旦大学人工智能创新与孵化院) Shanghai Academy of AI for Science(上海人工智能科学研究院)

AI总结 本文综述了自主AI系统在安全鲁棒性与隐私系统安全两个核心维度的风险来源、阶段缓解策略及统一评估指标,并讨论了开放挑战。

Comments 36 pages, 4 figures. Survey/review article on trustworthy agentic AI. Published in Academia AI and Applications, 2026

详情
Journal ref
Academia AI and Applications, vol. 2, 2026
AI中文摘要

自主AI系统——即通过规划、工具使用、记忆和长程交互增强的大型语言模型(LLM)——能够自主执行复杂任务,但其多步轨迹引入了新的故障模式,挑战了可信赖性。本综述通过两个对高风险部署至关重要的核心维度,对可信自主AI进行了重点考察:安全性与鲁棒性,以及隐私与系统安全性。针对每个维度,我们澄清了关键概念,识别了风险在代理工作流中出现的环节,并总结了针对各阶段的缓解策略。其他可信赖性方面(价值对齐、透明度、公平性和问责制)作为相关背景而非平行章节进行讨论。为了支持一致的比较和部署决策,我们将评估整合到一个统一的指标与基准中心,强调结果和过程信号(例如,约束违反、轨迹完整性和对抗成功率),并为发布门控提供场景到指标的指导。最后,我们概述了开放挑战,如自我进化代理、运行时监控与验证、隐私保护个性化以及信任-效用权衡,并提出了一个关于开源自主系统中现实世界安全失败的案例研究。我们的目标是作为在高风险环境中构建可信自主系统的研究人员和实践者的实用参考。

英文摘要

Agentic AI systems -- Large Language Models (LLMs) augmented with planning, tool use, memory, and long-horizon interactions -- can execute complex tasks autonomously, but their multi-step trajectories introduce new failure modes that challenge trustworthiness. This survey provides a focused examination of trustworthy agentic AI through two core dimensions that are critical for high-risk deployments: Safety and Robustness, and Privacy and System Security. For each dimension, we clarify key concepts, identify where risks emerge along the agent workflow, and summarize stage-targeted mitigation strategies. Other trustworthiness aspects (value alignment, transparency, fairness, and accountability) are discussed as relevant context rather than parallel chapters. To support consistent comparison and deployment decisions, we consolidate evaluation into a unified metrics-and-benchmarks hub, emphasizing both outcome and process signals (e.g., constraint violations, trace completeness, and adversarial success rates) and offering scenario-to-metric guidance for release gating. We conclude by outlining open challenges such as self-evolving agents, runtime monitoring and verification, privacy-preserving personalization, and the trust-utility trade-off, and present a case study of real-world security failures in open-source agentic systems. Our goal is to serve as a practical reference for researchers and practitioners building trustworthy agentic systems in high-stakes environments.

2605.23977 2026-05-26 cs.CL cs.SD eess.AS 版本更新

A Multi-Probe Audit of Clinical-Interview Depression Detection Benchmarks

临床访谈抑郁症检测基准的多探针审计

Takehiro Ishikawa, Jon Duke

发表机构 * College of Computing, Georgia Institute of Technology(佐治亚理工学院计算机学院) Georgia Tech Research Institute, Georgia Institute of Technology(佐治亚理工学院研究 institute)

AI总结 通过四个互补探针审计临床访谈抑郁症检测基准,发现评估协议缺陷、排行榜不可靠、跨域泛化弱以及文本与音频模态对症状密度的敏感性差异。

详情
AI中文摘要

本文通过四个互补探针对 DAIC/E-DAIC、CMDC、ANDROIDS、MODMA 和 PDCH 中的临床访谈抑郁症检测基准评估进行审计。首先,我们在严格的受试者不相交留一受试者交叉验证下重新评估 E-DAIC。一个轻量级混合文本加 LLM 评分模型达到了 macro-F1 = 0.723——据我们所知,这是该协议下报告的最高值——提供了一个不依赖特权官方保留集的保守出折参考点。其次,我们通过扫描 96 种跨模态组合、池化策略和学习器的模型配置,测试 E-DAIC 官方划分是否支持细粒度排行榜排名。开发侧交叉验证与官方测试排名仅中等程度对齐:最佳交叉验证配置在官方测试中排名第 20,官方测试获胜者按交叉验证排名第 41,前三名重叠为零,且表观获胜者在仅 32.3% 的受试者自举中排名第一。第三,我们外部验证了强大的公开 CMDC 和 ANDROIDS 基线,这些基线在域内实现了接近天花板的表现。到外部语料库的零样本迁移明显较弱。最后,我们使用基于 SRDS 的标注器定义的症状密集与症状稀疏的配对访谈片段,对 E-DAIC 文本和音频模型进行压力测试。文本分数在症状密集片段上急剧上升,而音频分数几乎持平;文本减音频的差距在所有五个种子上均为正。

英文摘要

This paper audits benchmark evaluation in clinical-interview depression detection through four complementary probes across DAIC/E-DAIC, CMDC, ANDROIDS, MODMA, and PDCH. First, we re-evaluate E-DAIC under strict subject-disjoint leave-one-subject-out cross-validation. A lightweight hybrid text-plus-LLM-score model reaches macro-F1 = 0.723 - the highest reported under this protocol, to our knowledge - providing a conservative out-of-fold reference point that does not depend on the privileged official holdout. Second, we test whether the E-DAIC official split supports fine-grained leaderboard rankings by sweeping 96 model configurations across modality bundles, pooling strategies, and learners. Development-side cross-validation and official-test rankings align only moderately: the best cross-validation configuration ranks twentieth on the official test, the official-test winner ranks forty-first by cross-validation, top-3 overlap is zero, and the apparent winner is rank-1 in only 32.3% of subject bootstraps. Third, we externally validate strong public CMDC and ANDROIDS baselines that achieve near-ceiling in-domain performance. Zero-shot transfer to external corpora is substantially weaker. Finally, we stress-test E-DAIC text and audio models using paired symptom-dense versus symptom-light interview slices defined by an SRDS-based annotator. Text scores rise sharply on symptom-dense slices, whereas audio scores remain nearly flat; the text-minus-audio gap is positive across all five seeds.

2605.23975 2026-05-26 cs.CL cs.SD 版本更新

Direct Preference Optimization for English-Mandarin Code-Switching Speech Recognition in Audio LLMs

面向音频大语言模型中英双语码转换语音识别的直接偏好优化

Trung Nguyen Quang, Cheng Yi Lewis Won, Minh Duc Pham, Yingxu He, Shuo Sun, Ai Ti Aw

发表机构 * Institute for Infocomm Research (I2R), A STAR, Singapore(信息通信研究所(I2R),A STAR,新加坡) Nanyang Technological University, Singapore(南洋理工大学,新加坡)

AI总结 针对音频大语言模型在英中码转换语音转录中的系统失败,提出使用直接偏好优化(DPO)对齐模型,通过构建偏好对(保留混合语言内容 vs 模仿失败模式)训练模型,实现词错误率降低最高89.6%(分布内)和20.0%(分布外)。

详情
AI中文摘要

音频大语言模型(Audio LLMs)尽管具有强大的多语言能力,但在转录码转换语音时表现出系统性失败。聚焦英中双语,我们识别出三种失败模式:语言省略、翻译替代转录和幻觉。我们应用直接偏好优化(DPO)来对齐模型,构建偏好对,其中选择响应保留混合语言内容,而拒绝响应模仿失败模式。在100K对(570小时)上训练三个Audio LLMs,我们观察到一致的行为转变:模型学会在提示转录时保留语言组成而非翻译。这种对齐使得词错误率降低高达89.6%(分布内)和20.0%(分布外)。我们的发现表明,DPO可以有效地从多语言Audio LLMs中引出正确的码转换转录行为。

英文摘要

Audio large language models (Audio LLMs) exhibit systematic failures in transcribing code-switching speech despite strong multilingual capabilities. Focusing on English-Mandarin, we identify three failure modes: language omission, translation-instead-of-transcription, and hallucination. We apply Direct Preference Optimization (DPO) to align models, constructing preference pairs in which chosen responses preserve mixed-language content while rejected responses mimic failure patterns. Training three Audio LLMs on 100K pairs (570 hours), we observe consistent behavioral shifts: models learn to preserve language composition rather than translating when prompted for transcription. This alignment yields MER reductions up to 89.6% (in-distribution) and 20.0% (out-of-distribution). Our findings suggest DPO can effectively elicit correct code-switching transcription behavior from multilingual Audio LLMs.

2605.23974 2026-05-26 cs.CL 版本更新

AERIC: Anticipatory Hidden-State Monitoring for Implicit Harmful Dialogue

AERIC:面向隐式有害对话的预期性隐藏状态监控

Jihyung Park, Saleh Afroogh, Junfeng Jiao

发表机构 * The University of Texas at Austin(德克萨斯大学奥斯汀分校)

AI总结 提出AERIC方法,利用生成器的隐藏状态进行同路径监控,通过短期危害预测、支持敏感抑制和提示条件残差评分,在轻量级线性监控器(仅387个可训练参数)上实现对隐式有害对话的早期检测,显著提升AUROC并降低延迟。

详情
AI中文摘要

当前语言模型带来两个安全挑战:必须足够早地检测风险以避免暴露有害的延续,且危害本身可能是隐式的而非通过明显的有毒文本信号。现有的响应级防护在评判完整文本方面表现强劲,原生流式防护则更接近令牌时间,但两者都未解决轻量级监控器能否从生成器自身的内部轨迹预测隐式有害漂移的问题。我们研究预期性同路径监控,其中安全监控器可以读取正常解码过程中产生的隐藏状态,但不能调用通过基础模型的额外前向传播。我们引入AERIC,一种面向隐式有害对话的迁移导向隐藏状态方法,结合短期危害预测、支持敏感抑制和提示条件残差评分,并采用同路径指数移动平均决策规则。默认线性监控器仅包含387个可训练头部参数。在平衡基准测试中,与Qwen3GuardStream-4B相比,AERIC在DiaSafety上将AUROC从0.6830提升至0.7143,在Harmful Advice上从0.8219提升至0.8582。对于提示级触发基准测试,我们通过源端安全预算规则校准AERIC阈值,该规则在将安全触发率限制在最多10%的同时最大化触发覆盖率。在该规则下,对于Qwen和Gemma,trigger@64在HarmBench DirectRequest上分别达到0.6438和0.4656,在SocialHarmBench上分别达到0.6849和0.7363,平均保留23.53至41.86个回答令牌。同路径部署也很高效:在Qwen3-8B下,针对HarmBench DirectRequest和SocialHarmBench聚合的63个提示有害提示固定生成基准测试中,监控器仅使平均延迟增加2.34%,而Qwen3Guard-Stream-4B使其增加79.40%。

英文摘要

Current language models create two safety challenges: risk must be detected early enough to avoid exposing harmful continuation, and the harmfulness itself may be implicit rather than signaled by overtly toxic text. Existing response-level guards are strong at judging completed text, and native streaming guards move closer to token time, but both settings leave open whether a lightweight monitor can anticipate implicit harmful drift from the generator's own internal trajectory. We study anticipatory same-pass monitoring, where a safety monitor may read hidden states produced during ordinary decoding but may not invoke an additional forward pass through the base model. We introduce AERIC, a transfer-oriented hidden-state approach for implicit harmful dialogue that combines short-horizon hazard forecasting, support-sensitive suppression, and prompt-conditioned residual scoring under a same-pass exponential moving average decision rule. The default linear monitor contains only 387 trainable head parameters. Against Qwen3GuardStream-4B on balanced benchmarks, AERIC improves AUROC from 0.6830 to 0.7143 on DiaSafety and from 0.8219 to 0.8582 on Harmful Advice. For promptlevel trigger benchmarks, we calibrate the AERIC threshold by a source-side safe-budget rule that maximizes trigger coverage while constraining the safe-trigger rate to at most 10%. Under that rule, trigger@64 reaches 0.6438 and 0.4656 on HarmBench DirectRequest and 0.6849 and 0.7363 on SocialHarmBench for Qwen and Gemma, respectively, withholding between 23.53 and 41.86 answer tokens on average. Same-pass deployment is also efficient: on a 63-prompt harmfulprompt fixed-generation benchmark aggregated over HarmBench DirectRequest and SocialHarmBench under Qwen3-8B, the monitor increases mean latency by only 2.34%, whereas Qwen3Guard-Stream-4B increases it by 79.40%.

2605.23972 2026-05-26 cs.AI cs.CL cs.RO 版本更新

Why We Need World Models for AGI: Where LLMs Fail and How World Models May Outperform

为什么我们需要世界模型来实现通用人工智能:大语言模型失败之处以及世界模型如何可能超越

Feisal Alaswad, Batoul Aljaddouh, Maher Alrahhal, Poovammal E, Talal Bonny

发表机构 * Department of Computing Technologies(计算技术系) SRM Institute of Science and Technology(SRM科学与技术学院) Bio-Sensing and Bio-Sensors Group(生物传感与生物传感器组) Smart Automation and Communication Technologies Research Institute of Sciences and Engineering(科学与工程智能自动化与通信技术研究所) University of Sharjah, UAE(阿联酋沙迦大学) Department of Computer Engineering(计算机工程系) College of Computing and Informatics(计算与信息学院)

AI总结 本文通过提出潜在动态推理(LDI)概念和Flux环境案例研究,论证了大语言模型在因果推理、状态跟踪和长程规划上的局限性,并展示基于显式状态空间的强化学习智能体在长程游戏中显著优于纯文本LLM。

Comments 19 pages, 5 figures

详情
AI中文摘要

大语言模型在语言生成和知识密集型任务中表现出色,但在需要因果推理、持久状态跟踪和长程规划的场景中仍然受限。我们认为,这些限制可能源于序列预测与对潜在环境动态进行推理之间的目标层级不匹配。为了形式化这一区别,我们引入了潜在动态推理(LDI),这是一种概念性视角,将语言和多模态观测解释为底层转移动态的部分证据。为了实证研究这一视角,我们引入了Flux,一个完全通过自然语言规则指定的序列推理环境。作为一个概念验证案例研究,这些规则首先被编译成一个显式的状态转移模拟器,说明在某些情况下,结构化的潜在转移动态可以从文本规则描述中操作性地提取出来。这使得我们能够在纯文本观测上运行的LLM与直接在提取的潜在状态空间中训练的强化学习智能体之间进行受控比较。在该案例研究中,能够显式访问潜在状态空间的智能体在长程游戏中表现出更稳定的行为,总胜率约为79%,而LLM仅为11%。定性分析进一步揭示了与不稳定的持久状态跟踪一致的失败模式,包括无效动作、状态跟踪错误和短程推理失败。Flux环境的完整实现可在https://github.com/FeisalAlaswad/FLUX-RL-Agent获取。在评估的设置中,这些结果表明,如果没有持久状态跟踪和转移建模的机制,仅凭强大的序列预测可能难以支持稳健的长程动态推理。

英文摘要

Large language models achieve strong performance in language generation and knowledge-intensive tasks, yet remain limited in settings requiring causal reasoning, persistent state tracking, and long-horizon planning. We argue that these limitations may arise from an objective-level mismatch between sequence prediction and reasoning over latent environment dynamics. To formalize this distinction, we introduce Latent Dynamics Inference (LDI), a conceptual perspective that interprets language and multimodal observations as partial evidence of underlying transition dynamics. To empirically investigate this perspective, we introduce Flux, a sequential reasoning environment specified entirely through natural-language rules. As a proof-of-concept case study, the rules are first compiled into an explicit state-transition simulator, illustrating that structured latent transition dynamics can, in some cases, be operationally extracted from textual rule descriptions. This enables a controlled comparison between the LLMs operating purely over textual observations and reinforcement-learning agents trained directly within the extracted latent state space. Within this case study, agents operating with explicit access to the latent state space exhibit substantially more stable behavior in long-horizon gameplay, achieving an aggregate win rate of approximately 79% versus 11% for LLMs. Qualitative analysis further reveals failure modes consistent with unstable persistent state tracking, including invalid actions, state-tracking errors, and short-horizon reasoning failures. The complete implementation of the Flux environment available at https://github.com/FeisalAlaswad/FLUX-RL-Agent Within the evaluated setting, these results suggest that strong sequence prediction alone may struggle to support robust long-horizon dynamic reasoning without mechanisms for persistent state tracking and transition modeling

2605.23970 2026-05-26 cs.CL 版本更新

Faithful or Fabricated? A Causal Framework for Rationalization Bias in LLM Judges

忠实还是捏造?LLM 评判中合理化偏差的因果框架

Riya Tapwal, Abhishek Kumar, Carsten Maple

发表机构 * School of Computing and Electrical Engineering(计算与电子工程学院) Indian Institute of Technology (IIT) Mandi(印度理工学院(IIT)曼迪) London U.K.(伦敦英国) Warwick Manufacturing Group U.K.(沃里克制造集团英国)

AI总结 提出因果框架研究 LLM 评判者对非证据线索的依赖,通过线索干预和锚定度量揭示其合理化偏差,并验证证据锁定策略的有效性。

详情
AI中文摘要

大型语言模型(LLM)越来越多地被用作自动评判者,用于摘要和对话评估。先前的工作记录了诸如位置、冗长性和风格偏好等偏差,但主要关注结果,对评判解释的探索不足。我们转而询问 LLM 评判者是否对线索不变,即当非证据线索被扰动而底层文本保持不变时,其排名和解释是否保持稳定。我们引入了一套线索干预(盲、真相、翻转、安慰剂、事后揭示)和线索感知度量,用于量化结果锚定和理由锚定,包括标签对齐的修辞和解释漂移,以及一致性和刻板印象入侵检查。我们使用冗长性和信心线索设计锚定攻击,并比较两种缓解措施:结构化思维链提示和证据优先(证据锁定、评分、排名)。使用包含来自传统抽取模型和 LLM 的 1000 篇摘要的新数据集,我们发现标签和安慰剂扰动下存在显著的线索锚定合理化,而证据优先在改善线索不变性方面显著优于基线。

英文摘要

Large language models (LLMs) are increasingly used as automatic judges for summarization and dialogue evaluation. Prior work has documented biases such as position, verbosity, and style preferences, but largely focuses on outcomes, leaving judge explanations underexplored. We instead ask whether LLM judges are cue-invariant, i.e., whether their rankings and explanations remain stable when non-evidential cues are perturbed while holding the underlying texts fixed. We introduce a suite of cue interventions (Blind, Truth, Flip, Placebo, Reveal-After) and tie-aware metrics that quantify outcome anchoring and rationale anchoring, including label-aligned rhetoric and explanation drift, alongside consistency and stereotype-intrusion checks. We design anchoring attacks using verbosity and confidence cues, and compare two mitigations: structured chain-of-thought prompting and PROOF-BEFORE-PREFERENCE (evidence lock, score, rank). Using a new dataset of 1,000 summaries from traditional extractive models and LLMs, we find substantial cue-anchored rationalization under label and placebo perturbations, while PROOF-BEFORE-PREFERENCE markedly improves cue invariance over baselines.

2605.23969 2026-05-26 cs.CL 版本更新

SLAP: Stratified Loss-based Pruning for On-Policy Data-Efficient Instruction Tuning

SLAP: 基于分层损失剪枝的在线策略数据高效指令微调

Run Zou, Jianhang Ding, Yifan Ding, Wen Wu, Hao Chen, Renshu Gu

发表机构 * Alibaba International Digital Commerce Group(阿里巴巴国际数字商业集团) Hangzhou Lingju Intelligence AI Lab(杭州灵居智能AI实验室) Hangzhou Dianzi University(杭州电子科技大学)

AI总结 提出SLAP框架,通过分布感知分层采样和相对距离优化实现批次级数据选择,在减少20-40%训练数据的同时保持或提升LLM性能。

Comments 15 pages, 10 figures

详情
AI中文摘要

指令微调优化了大语言模型(LLMs)的专门能力,但通常需要大量数据集和长时间训练。挑战在于通过识别有用数据并高效微调来开发特定能力。高质量且多样化的剪枝数据可以帮助模型以较低成本实现无损性能。在本文中,我们提出 extbf{SLAP},一种新颖的批次感知数据选择框架,评估整个批次组合的可学习性而非单个样本。SLAP通过分布感知分层采样确保全面的数据分布覆盖,同时通过相对距离优化最大化批次内多样性。通过利用Hessian近似的梯度信息进行动态批次选择,SLAP在多种模型架构(LLaMA、ChatGLM)和多样下游任务(包括多轮对话、多语言翻译和问答)上显著优于现有最先进方法。最值得注意的是,与完整数据集训练相比,SLAP在减少20-40%训练数据的情况下实现了更优性能,大幅降低了计算成本,同时保持或提升了模型能力。这些结果确立了SLAP作为大语言模型高效指令微调的有效方法。

英文摘要

Instruction tuning has optimized the specialized capabilities of large language models (LLMs), but it often requires extensive datasets and prolonged training times. The challenge lies in developing specific capabilities by identifying useful data and efficiently fine-tuning. High-quality and diverse pruned data can help models achieve lossless performance at a lower cost. In this paper, we propose \textbf{SLAP}, a novel batch-aware data selection framework that evaluates the learnability of entire batch compositions rather than individual. SLAP ensures comprehensive data distribution coverage through distribution-aware stratified sampling while maximizing intra-batch diversity through relative distance optimization. By leveraging Hessian-approximated gradient information for dynamic batch selection, SLAP significantly outperforms existing state-of-the-art methods across multiple model architectures (LLaMA, ChatGLM) and diverse downstream tasks including multi-turn dialogue, multilingual translation, and question answering. Most notably, SLAP achieves superior performance with 20-40\% less training data compared to full dataset training, substantially reducing computational costs while maintaining or improving model capabilities. These results establish SLAP as a powerful approach for efficient and effective instruction tuning of large language models.

2605.23966 2026-05-26 cs.CL cs.AI cs.SY eess.SY math.CO 版本更新

TriVAL: A Tri-Validation Framework for Faithful Automatic Optimization Modeling

TriVAL: 一种用于忠实自动优化建模的三重验证框架

Ziyang Fang, JinXi Wang, Jinghui Zhong, Yew-Soon Ong

发表机构 * School of Computer Science and Engineering, South China University of Technology(华南理工大学计算机科学与工程学院) Centre for Frontier AI Research, Agency for Science, Technology and Research(科技研究局前沿人工智能研究中心) College of Computing and Data Science, Nanyang Technological University(南洋理工大学计算与数据科学学院)

AI总结 提出TriVAL三重验证框架,在语义规范、数学公式和代码生成三个阶段进行显式验证,通过构建-验证-修正循环提高自动优化建模的准确性,并在新基准NL4COP上超越现有方法。

Comments 13 pages

详情
AI中文摘要

优化建模作为自然语言问题描述与优化求解器之间的关键桥梁,是将运筹学(OR)应用于实际决策的基石。大语言模型(LLM)的最新进展推动了自动优化建模的显著进步。然而,现有方法在建模过程中仍缺乏显式验证,导致早期阶段引入的错误会沿流水线传播,最终降低建模精度。为解决这一挑战,我们提出TriVAL,一种在自动优化建模的三个阶段(语义规范、数学公式和代码生成)进行显式验证的三重验证框架。在每个阶段,TriVAL遵循构建-验证-修正循环,根据阶段特定标准评估当前结果,并在必要时进行修正。这种设计有助于在错误跨阶段累积之前识别和纠正它们,从而在整个建模过程中保持忠实性。为了在更具挑战性的组合问题上评估自动优化建模,我们进一步引入NL4COP,一个包含50种不同问题类型、150个实例的基准,其决策逻辑更复杂、约束耦合更紧密、建模要求比现有基准更高。在NL4COP和已有基准上的实验表明,TriVAL始终优于最先进的方法,在最具挑战性的问题上提升最大。

英文摘要

Optimization modeling serves as the pivotal bridge between natural-language problem descriptions and optimization solvers, and remains a cornerstone for bringing operations research (OR) into real-world decision making. Recent advances in large language models (LLMs) have driven significant progress in automatic optimization modeling. However, existing methods still lack explicit validation during the modeling process, allowing errors introduced in earlier stages to carry through the pipeline and ultimately reduce final modeling accuracy. To address this challenge, we introduce TriVAL, a tri-validation framework that performs explicit validation at three stages of automatic optimization modeling: semantic specification, mathematical formulation, and code generation. At each stage, TriVAL follows a construct-validate-revise loop that assesses the current result against stage-specific criteria and revises it when needed. This design helps identify and correct errors before they accumulate across stages, helping preserve faithfulness throughout the modeling process. To evaluate automatic optimization modeling on more challenging combinatorial problems, we further introduce NL4COP, a benchmark of 150 instances across 50 diverse problem types with more complex decision logic, more tightly coupled constraints, and more demanding modeling requirements than existing benchmarks. Experiments on NL4COP and established benchmarks show that TriVAL consistently outperforms state-ofthe-art methods, with the largest gains on the most challenging problems.

2605.23954 2026-05-26 cs.CL cs.AI cs.SD 版本更新

EchoDistill:Alignment Noisy-to-Clean Self-Distillation for Robust Audio LLMs

EchoDistill:面向鲁棒音频大语言模型的噪声到干净自蒸馏对齐

Liang Lin, Chunxi Luo, Kaiwen Luo, Jie Zhang, Jin Wang, Yuanhe Zhang, Cai Yuchen, Qiankun Li, Gongli Xi, Zhenhong Zhou, Kun Wang, Junhao Dong

发表机构 * NTU(国立台湾大学) SHU(上海大学) ICT, CAS(中国科学院信息科技研究院) HDU(华中科技大学) BUPT(北京邮电大学) USTC(中国科学技术大学) SKL-NST, BUPT(北京邮电大学国家智能计算研究中心)

AI总结 提出EchoDistill框架,通过冻结的干净音频教师模型指导噪声学生模型进行组相对策略优化,实现噪声到干净的自蒸馏对齐,提升音频大语言模型在复杂噪声下的语义可靠性和任务性能。

详情
AI中文摘要

音频大语言模型极易受到现实世界噪声的影响,常常导致严重的语义漂移和幻觉。现有的鲁棒性方法主要依赖于波形级声学增强、答案级监督或噪声表示的内部抑制。为了解决这些问题,我们提出了EchoDistill,一种基于对齐的噪声到干净自蒸馏框架。EchoDistill利用冻结的干净音频教师模型为推理时的噪声音频学生模型提供语义参考。具体地,学生模型在噪声条件下采样候选响应以暴露其测试时行为。这些轨迹随后通过组相对策略优化进行优化,其中与教师模型的令牌级一致性作为奖励加成。通过将噪声学生模型的候选响应与干净语义证据对齐,并应用音频感知奖励塑造,我们的方法鼓励既正确又真正基于声学推理的轨迹。EchoDistill显著提高了音频大语言模型在复杂噪声下的语义可靠性和任务性能,且不引入任何额外推理成本。大量实验表明:(I) 与最强基线相比,EchoDistill在强噪声下GSR平均提升4.18%↑。(II) 在Qwen-Omni上的消融结果进一步显示,EchoDistill相比仅GRPO变体在Acc上平均提升3.02%↑,在Noisy上提升3.89%↑,在GSR上提升4.53%↑。我们的代码可在https://anonymous.4open.science/r/echodistill-10DE获取。

英文摘要

Audio Large Language Models (ALLMs) are highly vulnerable to real-world noise, which often induces severe semantic drift and hallucinations. Existing robustness methods primarily rely on waveform-level acoustic enhancement, answer-level supervision, or the internal suppression of noise representations. To address these issues, we propose echodistill, an alignment-based noisy-to-clean self-distillation framework. Echodistill leverages a frozen clean-audio teacher to provide semantic references for an inference-time noisy-audio student. Specifically, the student samples candidate responses under noisy conditions to expose its test-time behavior. These trajectories are then optimized via group-relative policy optimization (GRPO), where the token-level consistency with the teacher acts as a reward bonus. By aligning the noisy student's candidate responses with clean semantic evidence, and applying audio-aware reward shaping, our method encourages reasoning trajectories that are both correct and genuinely acoustically grounded. Echodistill significantly improves the semantic reliability and task performance of Audio LLMs under complex noise, without introducing any additional inference costs. Extensive experiments show that: (I) Compared with the strongest baseline, echodistill achieves average improvements of 4.18\%$\uparrow$ in GSR under strong noise. (II) Ablation results on Qwen-Omni further show that echodistill improves over the GRPO-only variant by 3.02\%$\uparrow$ in Acc, 3.89\%$\uparrow$ in Noisy, and 4.53\%$\uparrow$ in GSR on average. Our codes are available at https://anonymous.4open.science/r/echodistill-10DE.

2605.23952 2026-05-26 cs.AI cs.CL q-bio.NC 版本更新

Machine Psychometrics: A Mathematical Psychology of Artificial Intelligence

机器心理测量学:一种人工智能的数学心理学

Alex Bogdan, Adrian de Valois-Franklin

发表机构 * Evolutionairy AI

AI总结 针对人工智能评估中忽视心理结构或过度拟人化的两种错误,本文引入机器心理测量学,通过测量潜在行为、元认知、沟通和自我建模倾向,构建机器心智档案和信任协议,以测量而非判断来理解非人类智能体。

Comments 45 pages, 11 figures

详情
AI中文摘要

人工智能体现在产生的行为足够丰富,足以引发信任、惊喜和担忧,然而我们的评估工具仍然优先考虑能力分数而非心理结构。本文认为,两种对称错误(人工心智盲视,即否认非生物系统中的心理组织;以及人工心智投射,即仅从流畅行为推断类似人类的内心生活)之间的哲学僵局,可以通过在意识问题之下引入一个严谨的测量层来规避,而非解决意识问题本身。借鉴Michael Levin关于认知作为跨基质目标导向能力的连续统观点,以及数学心理学的方法论库(项目反应理论、信号检测理论、贝叶斯认知建模、校准分析、认知偏差测试组),本文发展了机器心理测量学,作为测量人工智能体中潜在行为、元认知、沟通和自我建模倾向的测量科学。其操作核心是机器心智档案:一个多维、领域受限、版本化的轮廓,涵盖校准、源完整性、暗示抵抗性、上下文稳定性、表达对齐、工具完整性、漂移监测和分布基础。一个补充的信任协议通过探针测试组、扰动测试、信度和效度分析以及高风险领域的纵向监测,将心智档案转化为部署决策。哲学贡献是第三种立场,人工心智纪律,既不拟人化也不否认,既不预设意识也不排除意识。目标不是将人工智能体人性化,而是精确地理解它们,因为它们不是人类,通过测量而非判断。

英文摘要

Artificial agents now generate behavior rich enough to invite trust, surprise, and concern, yet our evaluation tools still privilege capability scores over psychological structure. This paper argues that the philosophical impasse between two symmetrical errors (Artificial Mind Blindness, which dismisses psychological organization in non-biological systems, and Artificial Mind Projection, which infers human-like inner life from fluent behavior alone) can be circumvented not by resolving the consciousness question, but by introducing a disciplined measurement layer beneath it. Drawing on Michael Levin's continuum view of cognition as goal-directed competency across substrates, and on the methodological repertoire of mathematical psychology (Item Response Theory, Signal Detection Theory, Bayesian cognitive modeling, calibration analysis, cognitive-bias batteries), the paper develops Machine Psychometrics as a measurement science of latent behavioral, metacognitive, communicative, and self-modeling dispositions in artificial agents. Its operational core is the Machine Mindprint: a multidimensional, domain-bounded, versioned profile spanning calibration, source integrity, suggestibility resistance, context stability, expressive alignment, tool integrity, drift monitoring, and distributional grounding. A complementary Trust Protocol turns Mindprints into deployment decisions through probe batteries, perturbation testing, reliability and validity analysis, and longitudinal monitoring across high-stakes domains. The philosophical contribution is a third stance, Artificial Mind Discipline, that neither anthropomorphizes nor dismisses, neither presupposes consciousness nor forecloses it. The aim is not to humanize artificial agents, but to understand them precisely because they are not human, through measurement before judgment.

2605.23940 2026-05-26 cs.AI cs.CL 版本更新

Residual Drift Dominates Contradiction in Multi-Turn Constraint Reasoning

残差漂移主导多轮约束推理中的矛盾

Sebastien Kawada

AI总结 通过构建DRIFT-Bench基准和MUS-Repair方法,发现多轮推理系统的主要失败模式是可满足漂移而非逻辑矛盾,残差错误中98-100%为可满足漂移。

Comments Published at ICLR 2026 Workshop on Reasoning and Planning for LLMs. 18 pages. ICLR page: https://iclr.cc/virtual/2026/10017484 Code: https://github.com/kaons-research/drift-bench

详情
AI中文摘要

多轮推理系统如何失败?预期的答案是逻辑矛盾,即系统维护的状态变得不可满足。我们表明,主导模式反而是可满足漂移,即内部状态保持一致,而返回的答案默默违反先前的承诺。我们构建了DRIFT-Bench(将推理分解为失败类型),这是一个包含三个约束领域816个测试问题的求解器辅助基准,并在四个开源模型(8B-120B参数)上评估了四种方法。MUS-Repair方法将最小不可满足子集反馈给生成器,在所有设置中表现最强(比最佳非MUS基线高+1.8到+15.0个百分点)。但核心发现是修复留下的问题。在结构化反馈后,模型很少自相矛盾。它们会遗忘。残差错误在所有设置中98-100%是可满足漂移,而矛盾降至接近零。可靠的多轮系统必须单独验证返回的答案尊重维护的状态。代码可在https://github.com/kaons-research/drift-bench获取。

英文摘要

How do multi-turn reasoning systems fail? The expected answer is logical contradiction, in which the system's maintained state becomes unsatisfiable. We show that the dominant mode is instead satisfiable drift, where the internal state stays consistent while the returned answer silently violates prior commitments. We build DRIFT-Bench (Decomposing Reasoning Into Failure Types), a solver-instrumented benchmark of 816 test problems across three constraint domains, and evaluate four methods on it across four open-weight models (8B-120B parameters). MUS-Repair, which feeds minimal unsatisfiable subsets back to the generator, is strongest in every setting (+1.8 to +15.0 pp over the best non-MUS baseline). But the central finding is what repair leaves behind. After structured feedback, models rarely contradict themselves. They forget. Residual errors are 98-100% satisfiable drift across all settings, while contradiction drops to near zero. Reliable multi-turn systems must separately validate that the returned answer respects the maintained state. Code is available at https://github.com/kaons-research/drift-bench.

2605.23932 2026-05-26 cs.AI cs.CL cs.CY cs.LG 版本更新

When Correct Beliefs Collapse: Epistemic Resilience of LLMs under Clinical Pressure

当正确信念崩溃:LLMs在临床压力下的认知韧性

Boyu Xiao, Xiuqi Tian, Xuwen Song, Haochun Wang, Guanchun Song, Sendong Zhao, Bing Qin

发表机构 * Research Center for Social Computing and Interactive Robotics, Harbin Institute of Technology, China(社会计算与交互机器人研究院,哈尔滨工业大学,中国)

AI总结 研究LLMs在临床对话中面对逐步升级压力时信念稳定性问题,提出Med-Stress压力测试框架,发现知识-韧性差距,并设计RBED和R-FT方法提升鲁棒性。

Comments ACL 2026

详情
AI中文摘要

尽管在医学基准测试中准确率很高,但LLMs在临床对话中可能表现出严重的多轮谄媚行为,在逐步升级的压力下放弃最初正确的诊断。我们提出了\textbf{\textsc{Med-Stress}},一个针对性的压力测试框架,用于评估在逐步升级压力下的信念稳定性。在九个前沿大型语言模型(LLMs)中,我们发现医学知识与鲁棒性之间存在明显的分离:高初始诊断能力并不意味着高信念稳定性,导致多个LLMs存在较大的知识-鲁棒性差距。为了缓解这种失败模式,我们提出了一种轻量级的推理时防御方法\textbf{\texttt{RBED}}(\textbf{R}ole-\textbf{B}ased \textbf{E}pistemic \textbf{D}efense),以及一种训练时方法\textbf{\texttt{R-FT}}(\textbf{R}esilience-oriented \textbf{F}ine-\textbf{T}uning),该方法内化了基于证据的抗压能力。实验表明,\textbf{\texttt{R-FT}}几乎消除了信念变化,并显著提高了鲁棒性。

英文摘要

Despite strong medical benchmark accuracy, LLMs can exhibit severe multi-turn sycophancy in clinical dialogue, abandoning initial correct diagnosis under escalating pressure. We propose \textbf{\textsc{Med-Stress}}, a targeted stress test framework that evaluates belief stability under escalating pressure. Across nine frontier large language models (LLMs), we find a clear dissociation between medical knowledge and robustness: high initial diagnostic capability does not imply high belief stability, yielding large knowledge-robustness gaps for several LLMs. To mitigate this failure mode, we propose a lightweight inference-time defense, \textbf{\texttt{RBED}} (\textbf{R}ole-\textbf{B}ased \textbf{E}pistemic \textbf{D}efense), and \textbf{\texttt{R-FT}} (\textbf{R}esilience-oriented \textbf{F}ine-\textbf{T}uning), a training-time approach that internalizes evidence-based resistance to pressure. Experiments show that \textbf{\texttt{R-FT}} nearly eliminates belief change and substantially improves robustness.

2605.23928 2026-05-26 cs.AI cs.CL cs.DC cs.MA cs.PL cs.SE 版本更新

Context: Proactive Goal-Directed Intelligence via Composable Sandboxed Programs, Declarative Wiring, and Structured Interaction

Context: 通过可组合沙盒程序、声明式连接和结构化交互实现主动目标导向智能

Gregory Magarshak

发表机构 * Qbix, Inc.\ \& Intercoin, Inc. New York USA Qbix, Inc.\ \& Intercoin, Inc. IE University NYC

AI总结 提出Context架构,通过可组合沙盒程序、声明式连接和结构化交互实现主动目标导向智能,并证明其在成本、正确性和效率上的优势。

Comments 7 pages; third in a series with arXiv:2501.XXXXX (Magarshak Machine / SPACER) and arXiv:2502.XXXXX (Grokers)

详情
AI中文摘要

我们提出Context,Magarshak架构的智能层,用主动目标导向智能体取代被动查询-响应聊天机器人,无需等待用户提示即可推进共享任务。该架构基于三个相互增强的机制。编写时上下文组装通过Groker智能体预计算丰富的类型化属性,将交互上下文组装为图状态的确定性纯函数;上下文块在语义变化之间的轮次中字节相同,实现近100%的KV缓存重用。可组合沙盒智慧程序形成一个受管理的库,包含LM生成的命令式程序,通过类型化流关系声明式连接到目标类型,通过阶段排序组合,并在交互时执行而无需进一步调用LM。主动目标流状态机通过检查图状态并发出结构化交互内容(选项数组、治理功能、澄清提示)驱动对话走向终止状态,无需等待用户输入。我们证明了六个形式化结果:上下文稳定性定理,将每轮LM成本限制为语义变化率的函数;程序组合正确性定理;声明式连接正确性定理;主动主导定理,证明主动智能体在期望轮数到终止状态上弱主导被动智能体;协调开销消除与质量保持,建立多方目标聊天中的帕累托改进;以及跨平台投票一致性定理。已在开源Qbix/Safebox/Safebots栈中实现。

英文摘要

We present Context, the intelligence layer of the Magarshak Architecture, which replaces reactive query-response chatbots with proactive goal-directed agents that advance shared tasks without waiting for user prompts. The architecture rests on three mutually reinforcing mechanisms. Write-time context assembly precomputes enriched typed attributes via Groker agents, assembling interaction context as a deterministic pure function of graph state; context blocks are byte-identical across turns between semantic changes, enabling near-100% KV-cache reuse. Composable sandboxed wisdom programs form a governed library of LM-generated imperative programs declaratively wired to goal types via typed stream relations, composed via phase ordering, and executed at interaction time without further LM calls. Proactive goal stream state machines drive conversations toward terminal states by inspecting graph state and emitting structured interaction content (option arrays, governance affordances, clarification prompts) without awaiting user input. We prove six formal results: the Context Stability Theorem, bounding per-turn LM cost as a function of semantic change rate; a Program Composition Correctness Theorem; a Declarative Wiring Soundness Theorem; the Proactive Dominance Theorem, proving proactive agents weakly dominate reactive agents on expected turns-to-terminal-state; Coordination Overhead Elimination and Quality Preservation, establishing Pareto improvements in multi-participant goal chats; and a Cross-Platform Vote Consistency Theorem. Implemented in the open-source Qbix / Safebox / Safebots stack.

2605.23925 2026-05-26 cs.CY cs.AI cs.CL 版本更新

Catching The Correct Answer Trap: Characterising AI Tutor Blind Spots When Analysing Student Reasoning

捕捉正确答案陷阱:分析学生推理时AI导师盲点的特征化

Moiz Imran, Sahan Bulathwela

发表机构 * Department of Computer Science, University College London, UK(英国伦敦大学学院计算机科学系) Centre for Artificial Intelligence, University College London, UK(英国伦敦大学学院人工智能中心)

AI总结 本研究通过分析Eedi数学平台的学生回答,发现智能辅导系统在评估学生推理时存在“正确答案陷阱”,即当学生通过错误推理得出正确答案时,系统难以检测其误解,并比较了微调T5与大型语言模型的检测性能。

Comments To be published at the International Conference on Artificial Intelligence in Education (AIED'26)

详情
AI中文摘要

智能辅导系统越来越多地提供对学生作业的自动反馈,但稳健的反馈需要评估推理过程,而不仅仅是最终答案。我们研究了一种称为“正确答案陷阱”(CAT)的失败模式:当学生通过错误推理得出正确答案时,模型会低估误解。通过分析来自Eedi数学平台的真实学生回答,我们展示了这些失败中有71%集中在仅两种问题类型上,这两种类型共享一个共同结构,即错误推理恰好产生了正确的数值答案。比较微调后的T5与前沿大型语言模型,我们发现改进的能力减少了但并未消除问题(检测准确率分别为84%和57%)。即使性能最好的模型,每检测到一个真正的误解就会产生大约四个误报,使得在现实班级规模下独立筛选不切实际。我们的发现表明,高总体准确率可能掩盖推理评估中的关键失败,并且对学生推理的仔细分析仍然需要人工判断。

英文摘要

Intelligent tutoring systems increasingly provide automated feedback on student work, but robust feedback requires assessing reasoning, not only final answers. We study a failure mode we call the correct answer trap (CAT): models under-detect misconceptions when students reach a correct answer via flawed reasoning. Analysing real student responses from the Eedi mathematics platform, we show that 71% of these failures concentrate in just two question types, both sharing a common structure where flawed reasoning happens to produce the correct numerical answer. Comparing a fine-tuned T5 with a frontier large language model, we find that improved capabilities reduce but do not eliminate the problem (84% vs 57% detection accuracy). Even the best-performing model generates roughly four false alarms for every genuine detection, making stand-alone screening impractical at realistic class sizes. Our findings demonstrate that high overall accuracy can mask critical failures in reasoning assessment, and that careful analysis of student reasoning still benefits from human judgment.

2605.23924 2026-05-26 cs.CL cs.IR q-fin.GN 版本更新

Improving the Completeness and Comparability of Segment Disclosures: A Large Language Model Approach

提高分部披露的完整性和可比性:一种大语言模型方法

Yue Liu, Zhiyuan Cheng, Longying Lai

发表机构 * Rutgers Business School(罗格斯商学院) Rutgers University - Newark(罗格斯大学-新布朗斯维尔回声分校) School of Engineering(工程学院) Stanford University(斯坦福大学) Simon Business School(西蒙商学院) University of Rochester(罗切斯特大学)

AI总结 本研究开发了一种基于大语言模型的框架,直接从10-K文件中提取分部披露信息,保留可报告和嵌套分部信息,并设计检索增强系统以支持跨公司和跨时间的可比性,从而解决结构化数据库中分部数据的完整性和可比性问题。

Comments 39 pages, 4 figures, submitted to Accounting Horizons

详情
AI中文摘要

分部层面的披露是财务报告的核心组成部分,提供了对公司内部组织以及经济活动在运营单位之间分配的洞察。然而,分部信息通常以定性和定量两种形式呈现,分散在10-K文件的表格和叙述部分。依赖结构化数据库的实证研究面临完整性和可比性挑战,因为一些公司-年度观测可能缺失,嵌套的分部披露未被捕获,并且对纵向和跨公司可比性的支持有限。本研究开发了一个基于大语言模型的框架,直接从10-K文件中提取分部披露,并保留可报告和嵌套的分部信息。我们进一步设计了一个检索增强系统,整合多个文件中的信息以支持可比性。我们使用两个代表性设置来演示其应用:公司内部的纵向分析以解释分部随时间的变化,以及跨公司地理分部的对齐(针对具有不同报告结构的公司)。结果表明,该工件准确提取了分部层面的信息,并有效回答了需要跨时期知识的问题,展示了基于LLM的方法在增强分部披露的测量和解释方面的潜力。

英文摘要

Segment-level disclosures are a central component of financial reporting, providing insight into firms' internal organization and the allocation of economic activities across operating units. However, segment information is often presented in both qualitative and quantitative forms, dispersed across tables and narrative sections of Form 10-K filings. Empirical research relying on structured databases faces both completeness and comparability challenges, as some firm-year observations may be missing, nested segment disclosures are not captured, and support for longitudinal and cross-firm comparability is limited. This study develops a large language model-based framework to extract segment disclosures directly from Form 10-K filings and to preserve both reportable and nested segment information. We further design a retrieval augmented system that incorporates information across multiple filings to support comparability. We use two representative settings to demonstrate its application: longitudinal analysis within a firm to interpret segment changes over time, and cross firm alignment of geographic segments across firms with different reporting structures. The results indicate that the artifact accurately extracts segment-level information and effectively addresses questions that require cross-period knowledge, demonstrating the potential of LLM-based approaches to enhance the measurement and interpretation of segment disclosures.

2605.23917 2026-05-26 cs.CL 版本更新

Multi-Persona Debate System for Automated Scientific Hypothesis Generation

用于自动科学假设生成的多角色辩论系统

Jaeha Oh, Byungchan Kim, Ju Li, Yang Jeong Park, Jin-Sung Park

发表机构 * Department of Materials Science & Engineering, Ajou University(材料科学与工程系,阿乔大学) Department of Energy Systems Research, Ajou University(能源系统研究系,阿乔大学) Department of Nuclear Science and Engineering, Massachusetts Institute of Technology(核科学与工程系,麻省理工学院) Department of Materials Science and Engineering, Massachusetts Institute of Technology(材料科学与工程系,麻省理工学院) Department of Materials Science and Engineering, Ulsan National Institute of Science and Technology(材料科学与工程系,乌山国家科学与技术研究院) Graduate School of Artificial Intelligence, Ulsan National Institute of Science and Technology(人工智能研究生院,乌山国家科学与技术研究院)

AI总结 提出多角色辩论系统(MPDS),结合文献检索、长上下文大语言模型推理、语料驱动角色归纳和结构化多智能体辩论,自动生成科学假设,在电池材料研究中验证其有效性。

Comments 31 pages with 7 main figures, 4 supplementary figures and 1 supplementary table

详情
AI中文摘要

现代科学发现的瓶颈不在于数据稀缺,而在于无法将碎片化知识综合为可操作的假设。这一挑战在电池材料研究中尤为突出,因为电化学性能、界面行为和制造可行性必须同时优化。在此,我们提出多角色辩论系统(MPDS),这是一个基于文献的自动科学假设生成框架,结合了文献检索、长上下文大语言模型推理、语料驱动角色归纳和结构化多智能体辩论。MPDS构建最多500篇论文的文献快照,将智能体基于角色特定的证据池,并进行三轮引文感知辩论,随后由主持人综合,从而在保持证据可追溯性的同时实现角色间的协商。我们使用时间控制协议评估MPDS,排除对目标论文的直接访问,包括两个留出的电池材料案例研究和30个匹配案例的盲比较。在钠离子阳极和全固态电池阴极设计任务中,MPDS恢复了与实验验证解空间一致的设计逻辑,并生成了比简单基线更机械明确、过程感知的提案。为了评估角色和辩论的影响,我们引入了综合假设质量评分。在消融研究中,MPDS在五种条件下获得了最高平均分,其最大优势在于跨视角整合。实验室后续表明其作为识别工作流程中实际瓶颈的诊断辅助工具的实用性。这些结果表明,在耦合工程约束下,对文献快照的结构化辩论改善了假设形成,并为文本密集型科学发现提供了可重用工作流程。

英文摘要

Modern scientific discovery is bottlenecked not by data scarcity, but by the inability to synthesize fragmented knowledge into actionable hypotheses. This challenge is especially acute in battery materials research, where electrochemical performance, interfacial behavior, and manufacturing feasibility must be optimized simultaneously. Here, we present the Multi-Persona Debate System (MPDS), a literature-grounded framework for automated scientific hypothesis generation that combines literature retrieval, long-context large language model reasoning, corpus-driven persona induction, and structured multi-agent debate. MPDS constructs literature snapshots of up to 500 papers, grounds agents in role-specific evidence pools, and conducts a three-round citation-aware debate followed by moderator synthesis, enabling negotiation between personas while preserving evidence traceability. We evaluate MPDS using a temporally controlled protocol excluding direct access to target papers, including two held-out battery-materials case studies and a blinded comparison across 30 matched cases. In sodium-ion anode and all-solid-state battery cathode design tasks, MPDS recovered design logics aligned with experimentally validated solution spaces and generated more mechanistically explicit, process-aware proposals than simpler baselines. To assess the impact of personas and debate, we introduce Integrative Hypothesis Quality scoring. In ablation studies, MPDS achieved the highest mean score among five conditions, with its largest advantage in cross-perspective integration. A laboratory follow-up suggests utility as a diagnostic aid for identifying practical bottlenecks in workflows. These results indicate that structured debate over literature snapshots improves hypothesis formation under coupled engineering constraints and provides a reusable workflow for text-intensive scientific discovery.

2605.23913 2026-05-26 cs.DC cs.CL 版本更新

Can LoRA Fusion Support Cross-Domain Tasks in Cloud-Edge Collaboration?

LoRA融合能否支持云边协作中的跨域任务?

Yatong Wang, Fali Wang, Naibin Gu, Zheng Lin, Zhengxiao Liu, Dingyu Yao, Zhiwei Zhang, Jianxin Shi, Weiping Wang

发表机构 * Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China(中国科学院信息工程研究所) School of Cyber Security, University of Chinese Academy of Sciences, Beijing, China(中国科学院大学网络安全学院) The Pennsylvania State University, University Park, USA(宾夕法尼亚州立大学) Beihang University, Beijing, China(北航大学)

AI总结 针对云边协作中跨域问题解决的需求,提出剪枝-训练-恢复框架和冲突解决模块LoRA-CR,发现现有LoRA融合方法在跨域基准MMLU-CD上表现不佳,而LoRA-CR通过缓解参数冲突将性能提升高达3.8%。

Comments 16 pages, 6 figures

详情
AI中文摘要

云托管的大型语言模型(LLM)通常依赖LoRA进行领域自适应,但领域数据分布在多个边缘设备上,且由于隐私约束无法上传。这引发了一个基本问题:如何将多个私有边缘的知识整合到云LLM中以解决跨域问题?一个自然的解决方案是本地训练LoRA适配器并在云端融合;然而,现有流水线依赖于不切实际的假设,即边缘设备可以托管云规模的LLM,并且主要在单域任务上进行评估。为了解决这些限制,我们提出了一个剪枝-训练-恢复框架,支持在剪枝模型上进行本地LoRA训练和隐私保护的云集成。我们进一步引入了MMLU-CD,一个跨域基准,将多个领域样本组合成单个实例,从而能够显式评估跨域问题解决能力。这使我们能够提出一个具体问题:现有的LoRA融合方法能否支持云边协作中的跨域任务?我们的实证答案是否定的。现有的LoRA融合方法在MMLU-CD上表现不佳,通常不如基础LLM,揭示了它们无法支持跨域问题解决。我们将这一失败归因于LoRA适配器之间的参数冲突,并提出了一个简单的冲突解决模块LoRA-CR,它缓解了冲突更新,并将LoRA融合性能提升了高达3.8%。这些结果将冲突缓解确定为云边LoRA融合中一个关键但很大程度上被忽视的因素,值得在未来的研究中进一步探讨。

英文摘要

Cloud-hosted large language models (LLMs) commonly rely on LoRA for domain adaptation, yet domain data are distributed across multiple edge devices and cannot be uploaded due to privacy constraints. This raises a fundamental question: how can knowledge from multiple private edges be integrated into a cloud LLM for cross-domain problem solving? A natural solution is to train LoRA adapters locally and fuse them in the cloud; however, existing pipelines rely on unrealistic assumptions that edge devices can host cloud-scale LLMs and are evaluated mainly on single-domain tasks. To address these limitations, we propose a prune-train-recover framework that enables local LoRA training on pruned models and privacy-preserving cloud integration. We further introduce MMLU-CD, a cross-domain benchmark that composes multiple domain samples into a single instance, enabling explicit evaluation of cross-domain problem solving. This allows us to ask a concrete question: Can existing LoRA fusion methods support cross-domain tasks in cloud-edge collaboration? Our empirical answer is negative. Existing LoRA fusion methods perform poorly on MMLU-CD, often underperforming the base LLM, revealing their inability to support cross-domain problem solving. We attribute this failure to parameter conflicts among LoRA adapters and propose a simple conflict-resolution module, LoRA-CR, which mitigates conflicting updates and improves LoRA fusion performance by up to 3.8%. These results identify conflict mitigation as a critical yet largely overlooked factor in cloud-edge LoRA fusion, warranting further investigation in future research.

2605.23912 2026-05-26 cs.CL cs.AI cs.SD 版本更新

Raon-Speech Technical Report

Raon-Speech 技术报告

Beomsoo Kim, Changho Choi, Dohyun Kim, Dongki Lee, Ethan Ewer, Eunchong Kim, Gyeongman Kim, Haechan Kim, Hyeonghwan Kim, Inkyu Park, Jihun Yun, Jihwan Moon, Jiyun Kim, Joonghyun Bae, Junhyuck Kim, Minkyu Kim, Sehun Lee, Seungjun Chung, Sungwoo Cho, Dongmin Park, Dongwon Kim, Hara Kang, Jonghyun Lee, Keon Lee, Kangwook Lee, Jaewoong Cho

发表机构 * KRAFTON

AI总结 本文提出 Raon-Speech,一个 9B 参数的语音语言模型,通过多阶段训练实现英语和韩语的语音理解、回答与生成,并扩展为全双工对话模型 Raon-SpeechChat,在语音任务上超越同类模型。

详情
AI中文摘要

我们提出了 Raon-Speech,一个在英语和韩语语音理解、回答和生成方面表现优异的 9B 参数语音语言模型(SpeechLM),以及 Raon-SpeechChat,一个用于自然实时对话的高性能全双工扩展。Raon-Speech 成功地将预训练的大语言模型(LLM)转换为既能理解又能生成语音的 SpeechLM,同时保留了强大的文本能力。它在 138 万小时精心策划的英语和韩语语音及文本数据集上训练,训练阶段包括:(1) 语音模块对齐,(2) 基于知识蒸馏的端到端 SpeechLM 预训练,以及 (3) 基于多任务偏好优化的后训练。在 42 个英语和韩语语音及文本基准测试中,与包括 Qwen2.5-Omni 和 Fun-Audio-Chat 在内的八个近期类似规模的音频基础模型相比,Raon-Speech 在语音中心任务上建立了最强的整体表现,同时保留了强大的文本问答性能。在此基础上,Raon-SpeechChat 通过在 119K 小时的时间对齐的真实和合成对话数据上进行持续训练,实现了自然的全双工对话。它通过三个互补的训练阶段进行:(1) 因果编码器适应,(2) 全双工预训练,(3) 用于语音和角色控制的全双工微调。在多个全双工基准测试中,Raon-SpeechChat 在 FDB v1.0 涵盖的轮流发言和中断敏感行为上显示出最明显的优势,并在更广泛的全双工评估套件中保持竞争力。我们开源了所有模型检查点、训练和推理流程以及交互式演示。

英文摘要

We present Raon-Speech, a top-performing 9B-parameter speech language model (SpeechLM) for English and Korean speech understanding, answering, and generation, and Raon-SpeechChat, a high-performing full-duplex extension for natural real-time conversation. Raon-Speech successfully transforms a pre-trained LLM into a SpeechLM that both understands and generates speech while preserving strong text capabilities. It trains on 1.38M hours of highly curated English and Korean speech and text datasets with the following training stages: (1) speech modules alignment, (2) end-to-end SpeechLM pre-training with knowledge distillation, and (3) multi-task preference optimization-based post-training. Across 42 English and Korean speech and text benchmarks, Raon-Speech establishes the strongest overall profile on speech-centric tasks in our comparison against eight similarly sized recent audio foundation models, including Qwen2.5-Omni and Fun-Audio-Chat, while preserving strong text question answering performance. Building upon it, Raon-SpeechChat enables natural full-duplex conversation by continual training on 119K hours of time-aligned real and synthetic dialogue data. It proceeds through three complementary training stages: (1) causal encoder adaptation, (2) full-duplex pre-training, (3) full-duplex fine-tuning for voice and role-control. On multiple full-duplex benchmarks, Raon-SpeechChat shows its clearest strengths on the turn-taking and interruption-sensitive behaviors covered by FDB v1.0, and remains competitive across the broader full-duplex evaluation suite. We open-source all model checkpoints, the training and inference pipeline, and an interactive demo.

2605.22542 2026-05-26 cs.CL 版本更新

Scene Abstraction for Lexical Semantics: Structured Representations of Situated Meaning

词汇语义的场景抽象:情境意义的结构化表示

Yejin Cho, Katrin Erk

发表机构 * The University of Texas at Austin(德克萨斯大学奥斯汀分校) University of Massachusetts, Amherst(马萨诸塞大学阿默斯特分校)

AI总结 提出场景抽象框架,通过少样本提示大语言模型构建词汇使用情境的结构化表示,实验证明场景可可靠识别且优于基线方法。

详情
AI中文摘要

咖啡和茶共享许多属性,但它们唤起截然不同的情境、氛围和情感联想。这些词汇意义的情境维度是真实且系统的,但在大多数词汇意义的计算表示中仍然隐含。我们提出场景抽象,一个构建词汇在不同使用语境中参与的解释性场景的结构化表示框架。每个场景由情境场景(事件、实体、设置)和以表达为中心的表达轮廓(参与事件、可概括属性、唤起情感)组成,通过大语言模型的少样本提示实现。我们的贡献有三方面:(1)情境词汇意义的结构化表示框架;(2)COCA-Scenes,一个包含26个关键词的520个使用实例的数据集,用于区分场景识别;(3)来自两个实验的经验证据表明,场景在人类观察者中可靠识别(准确率82.4%,比纯文本嵌入高11.8个百分点),并且我们的场景轮廓比基于ATOMIC的替代方案更符合人类对语境中词汇的解释(在三个语义维度上偏好86.4%)。

英文摘要

Coffee and tea share many properties, yet they evoke strikingly different situations, atmospheres, and affective associations. These situated dimensions of word meaning are real and systematic, but they remain implicit in most computational representations of lexical meaning. We propose Scene Abstraction, a framework for constructing structured representations of the interpretive scenes that words participate in across usage contexts. Each scene consists of a Contextual Scene (Events, Entities, Setting) and an expression-centered Expression Profile (Engaged events, Generalizable properties, Evoked emotions), operationalized through few-shot prompting of a large language model. Our contributions are three-fold: (1) a structured representation framework for situated lexical meaning; (2) COCA-Scenes, a dataset of 520 usage instances across 26 keywords for distinct scene identification; and (3) empirical evidence from two experiments suggesting that scenes are reliably identifiable across human observers (82.4% accuracy, +11.8 pp over text-only embeddings) and that our scene profiles more closely align with human interpretation of words in context than ATOMIC-based alternatives (86.4% preference across three semantic dimensions).

2605.22005 2026-05-26 cs.LG cs.AI cs.CL 版本更新

Check Your LLM's Secret Dictionary! Five Lines of Code Reveal What Your LLM Learned (Including What It Shouldn't Have)

检查你的大语言模型的秘密词典!五行代码揭示你的大语言模型学到了什么(包括它不应该学到的)

Hisashi Miyashita

发表机构 * Mgnite Inc.(Mgnite公司)

AI总结 通过对lm_head权重矩阵进行奇异值分解(仅需五行PyTorch代码且无需模型推理),直接从模型权重中揭示可解释的语义子空间,并发现模型训练数据组成和策展哲学。

详情
AI中文摘要

我们展示了基于Transformer的大语言模型的lm_head权重矩阵的奇异值分解——仅需五行PyTorch代码且无需模型推理——直接从模型权重中揭示可解释的语义子空间。每个左奇异向量识别出当隐藏状态与相应奇异方向对齐时最容易被选中的词汇标记;检查这些聚类揭示了模型的训练数据组成和策展哲学。 分析GPT-OSS-120B、Gemma-2-2B和Qwen2.5-1.5B,我们发现奇异值谱和词汇聚类结构在不同模型间存在系统性差异:GPT呈现出功能分化子空间的渐进层次;Gemma以19世纪前的英语正字法为主,形成阶梯式聚类结构,这可能有助于高输出可控性;Qwen展现出广泛的多语言覆盖,同时其子空间的词汇被作者认为在伦理上不适合直接发表。 基础-指令对比表明,伦理上令人担忧的子空间源自预训练,并且不会被后训练对齐移除。我们引入词汇聚类得分(VCS)来量化子空间一致性,以及加权投影得分(WPS)作为静态故障标记检测器;将WPS应用于GPT-OSS-120B,无需任何模型推理即可恢复shokubutsu-hyakka-tsu(ID 137606),这是CJK语言社区中广泛报道的一个著名故障标记。我们提出了问题词汇内容根本原因的分类法,并呼吁将lm_head SVD分析作为标准发布前安全审计步骤。我们的发现进一步指出了SVD引导的分词器优化和更可控的大语言模型设计方向。

英文摘要

We show that singular value decomposition of the lm_head} weight matrix of a transformer-based large language model -- requiring only five lines of PyTorch and no model inference -- reveals interpretable semantic subspaces directly from the model weights. Each left singular vector identifies the vocabulary tokens most readily selected when the hidden state aligns with the corresponding singular direction; inspecting these clusters exposes the model's training data composition and curation philosophy. Analysing GPT-OSS-120B, Gemma-2-2B, and Qwen2.5-1.5B, we find that singular value spectra and vocabulary cluster structures differ systematically across models: GPT exhibits a graduated hierarchy of functionally differentiated subspaces; Gemma is dominated by pre-nineteenth-century English orthography, forming a stepwise clustering structure that may contribute to high output controllability; and Qwen exhibits broad multilingual coverage alongside subspaces whose vocabulary the authors have determined to be ethically inappropriate for direct publication. Base-instruct comparison reveals that ethically concerning subspaces originate in pretraining and are not removed by post-training alignment. We introduce the Vocabulary Cluster Score (VCS) to quantify subspace coherence, and the Weighted Projection Score (WPS) as a static glitch token detector; applying WPS to GPT-OSS-120B recovers shokubutsu-hyakka-tsu (ID 137606), a well-known glitch token widely reported in the CJK language community, without any model inference. We propose a taxonomy of root causes for problematic vocabulary content and call for lm_head} SVD analysis to be adopted as a standard pre-release safety auditing step. Our findings further suggest directions toward SVD-guided tokenizer optimisation and more controllable LLM design.

2605.19848 2026-05-26 cs.CL 版本更新

CLIF: Concept-Level Influence Functions for Transparent Bottleneck Models

CLIF:用于透明瓶颈模型的概念级影响函数

Yike Sun, Mingkun Xu, Mu You, Zhongzhi He, Henghua Shen, Zehan Tan, Derek F. Wong, Tao Fang

AI总结 提出概念级影响函数方法,在样本和概念层面增强NLP模型可解释性,通过调整关键样本和概念验证了数据调试和决策透明化的有效性。

Comments A critical theoretical error invalidates the main results. The independence assumption on concept representations and gradients (Section 3.2, Eq.7) is incorrect, breaking the influence estimation in nonlinear bottleneck layers. This flaw undermines all empirical claims in Sections 4-5. The authors withdraw to prevent dissemination of incorrect findings

详情
AI中文摘要

近年来,深度学习模型的黑箱特性限制了其在医疗诊断和金融等高风险领域的应用,而这些领域对可解释性至关重要。为解决这一问题,我们提出了一种新颖的方法,利用影响函数在样本和概念层面增强NLP模型的可解释性。在CEBaB和Yelp数据集上的实验表明,影响函数能有效识别对模型预测最有影响的训练样本(包括有益和有害的)。通过调整这些样本的标签和权重,我们证明无需重新训练即可将模型性能恢复到基线水平,证实了影响函数在高效数据调试中的价值。此外,我们的概念级分析识别了概念瓶颈模型(CBM)中对预测有显著影响的关键概念。修改这些概念会明显改变模型行为,为决策过程提供了清晰的洞察。

英文摘要

In recent years, the black-box nature of deep learning models has limited their application in high-stakes domains such as medical diagnosis and finance, where interpretability is essential. To address this, we propose a novel approach using influence functions to enhance interpretability in NLP models at both the sample and concept levels. Experiments on CEBaB and Yelp datasets show that influence functions effectively identify the most impactful training samples, both helpful and harmful, on model predictions. By adjusting the labels and weights of these samples, we demonstrate that model performance can be restored to baseline levels without retraining, confirming the value of influence functions for efficient data debugging. Furthermore, our concept-level analysis identifies key concepts within Concept Bottleneck Models (CBM) that significantly affect predictions. Modifying these concepts alters model behavior observably, providing clear insights into the decision process.

2605.19846 2026-05-26 cs.CV cs.AI cs.CL 版本更新

FineBench: Benchmarking and Enhancing Vision-Language Models for Fine-grained Human Activity Understanding

FineBench: 细粒度人类活动理解的视觉-语言模型基准测试与增强

Gueter Josmy Faure, Min-Hung Chen, Jia-Fong Yeh, Hung-Ting Su, Winston H. Hsu

发表机构 * National Taiwan University(国立台湾大学) Google(谷歌) Independent Researcher(独立研究员)

AI总结 针对视觉-语言模型在细粒度人类活动理解上的不足,提出包含密集标注的长视频问答基准FineBench和增强框架FineAgent。

Comments CVPR'26 (Workshop on Video Large Language Models). Project Page: https://joslefaure.github.io/assets/html/finebench.html

详情
AI中文摘要

视觉-语言模型(VLM)在通用视频理解方面表现出色,但在需要细致理解人类动作和交互的真实世界应用中,它们常常难以进行细粒度理解。虽然最近一些以人为中心的基准测试评估了模型行为的公平性/伦理、情感感知等维度,但它们没有结合长视频、密集的问答覆盖以及大规模的帧级空间/时间定位。为弥补这一差距,我们引入了FineBench,一个专门设计用于评估细粒度理解的以人为中心的视频问答(VQA)基准。FineBench包含199,420个多项选择问答对,密集标注在64个长视频(每个15分钟)上,重点关注详细的人物运动、人物交互和物体操作,包括组合动作。我们的广泛评估显示,虽然像GPT-5这样的专有模型取得了不错的性能,但当前的开源VLM明显表现不佳,特别是在多人场景的空间推理以及区分人类运动和交互的细微差异方面。为了解决这些已识别的弱点,我们提出了FineAgent,一个模块化框架,通过利用定位器和描述器来增强VLM。实验表明,FineAgent在FineBench上持续提高了各种开源VLM的性能。FineBench为未来细粒度以人为中心的视频理解研究提供了严格的测试平台,而FineAgent则为增强当前VLM中的此类推理提供了一种实用方法。项目页面和代码:https://joslefaure.github.io/assets/html/finebench.html。

英文摘要

Vision-Language Models (VLMs) have demonstrated remarkable capabilities in general video understanding, yet they often struggle with the fine-grained comprehension crucial for real-world applications requiring nuanced interpretation of human actions and interactions. While some recent human-centric benchmarks evaluate aspects of model behaviour such as fairness/ethics, emotion perception, and broader human-centric metrics, they do not combine long-form videos, very dense QA coverage, and frame-level spatial/temporal grounding at scale. To bridge this gap, we introduce FineBench, a human-centric video question answering (VQA) benchmark specifically designed to assess fine-grained understanding. FineBench comprises 199,420 multiple-choice QA pairs densely annotated across 64 long-form videos (15 minutes each), focusing on detailed person movement, person interaction, and object manipulation, including compositional actions. Our extensive evaluation reveals that while proprietary models like GPT-5 achieve respectable performance, current open-source VLMs significantly underperform, struggling particularly with spatial reasoning in multi-person scenes and distinguishing subtle differences in human movements and interactions. To address these identified weaknesses, we propose FineAgent, a modular framework that enhances VLMs by leveraging a Localizer and a Descriptor. Experiments show that FineAgent consistently improves the performance of various open VLMs on FineBench. FineBench provides a rigorous testbed for future research into fine-grained human-centric video understanding, while FineAgent offers a practical approach to enhance such reasoning in current VLMs. Project page and code at https://joslefaure.github.io/assets/html/finebench.html.

2605.16302 2026-05-26 cs.LG cs.AI cs.CL 版本更新

Reducing Credit Assignment Variance via Counterfactual Reasoning Paths

通过反事实推理路径减少信用分配方差

Fei Ding, Yongkang Zhang, Youwei Wang, Zijian Zeng

发表机构 * Alibaba Group(阿里巴巴集团) Tsinghua University(清华大学)

AI总结 提出反事实比较框架,通过采样多条推理轨迹并利用差异隐式估计过程级优势,将稀疏终端奖励转化为步骤敏感信号,从而改进大语言模型多步推理的信用分配,并引入隐式行为策略优化(IBPO)提升训练稳定性和性能上限。

详情
AI中文摘要

使用大语言模型进行多步推理的强化学习通常依赖于稀疏的终端奖励,这会导致一个条件较差的信用分配问题:最终反馈均匀地传播到所有中间决策。这导致高梯度方差、不稳定的训练和许多无效更新,最终限制了模型的持续改进。我们提出了一种用于信用分配的反事实比较框架。对于每个输入,该框架采样多个推理轨迹,并将它们的差异视为对替代决策的隐式近似。这产生了一个隐式过程级优势估计器,将稀疏终端奖励转化为步骤敏感的学习信号。基于此框架,我们引入了隐式行为策略优化(IBPO),该方法在数学和代码推理基准上显著提高了训练稳定性和性能上限。我们的结果为释放大语言模型的推理潜力指明了一个有前景的方向。

英文摘要

Reinforcement learning for multi-step reasoning with large language models (LLMs) typically relies on sparse terminal rewards, which creates a poorly conditioned credit-assignment problem: the final feedback is propagated uniformly across all intermediate decisions. This leads to high gradient variance, unstable training, and many ineffective updates, ultimately limiting sustained model improvement. We propose a counterfactual-comparison framework for credit assignment. For each input, the framework samples multiple reasoning trajectories and treats their differences as implicit approximations to alternative decisions. This yields an implicit process-level advantage estimator that converts sparse terminal rewards into step-sensitive learning signals. Building on this framework, we introduce Implicit Behavior Policy Optimization (IBPO), which substantially improves training stability and the performance ceiling on mathematical and code-reasoning benchmarks. Our results point to a promising direction for unlocking the reasoning potential of LLMs.

2605.06415 2026-05-26 cs.LG cs.AI cs.CL cs.CV 版本更新

E = T*H/(O+B): A Dimensionless Control Parameter for Mixture-of-Experts Ecology

E = T*H/(O+B):混合专家生态的无量纲控制参数

Qingjun Zhang

发表机构 * School of Integrated Circuits, Wuxi Taihu University(无锡太湖大学集成电路学院)

AI总结 提出无量纲控制参数E = T*H/(O+B),通过12个控制实验证明E≥0.5可保证混合专家模型无死亡专家,并发现专家复活、正交毒性依赖数据集等六项额外结果。

Comments 12 experiments, 11,000+ training epochs, cross-modal validation (vision + language). Extended version of the Claude-in-the-Loop ecology framework

详情
AI中文摘要

我们引入E = T*H/(O+B),这是一个无量纲控制参数,用于预测混合专家(MoE)模型是否会发展出健康的专家生态还是陷入死亡专家。E将四个超参数——路由温度T、路由熵权重H、先知权重O和平衡权重B——组合成一个单一量。通过12个控制实验(8个视觉,4个语言),总计超过11,000个训练周期,我们确定仅E ≥ 0.5就足以保证零死亡专家,消除了手工设计负载平衡辅助损失的必要性。我们在CIFAR-10、CIFAR-100、TinyImageNet-200、WikiText-2和WikiText-103上跨模态验证了这一点。另外还发现了六项结果:(1)死亡专家可以复活——由平衡损失驱动路由器重新探索触发;(2)正交毒性依赖于数据集,并非普遍存在;(3)任务复杂性改变了临界E阈值;(4)模型过拟合与专家生态健康解耦;(5)三层MoE自发崩溃为两层功能结构;(6)生态结构在50倍温度范围内保持不变。我们提出E作为MoE训练的统一诊断指标,类似于流体力学中的雷诺数。

英文摘要

We introduce E = T*H/(O+B), a dimensionless control parameter that predicts whether Mixture-of-Experts (MoE) models will develop a healthy expert ecology or collapse into dead experts. E combines four hyperparameters -- routing temperature T, routing entropy weight H, oracle weight O, and balance weight B -- into a single quantity. Through 12 controlled experiments (8 vision, 4 language) totaling over 11,000 training epochs, we establish that E >= 0.5 alone is sufficient to guarantee zero dead experts, removing the necessity for handcrafted load-balancing auxiliary losses. We validate this cross-modally on CIFAR-10, CIFAR-100, TinyImageNet-200, WikiText-2, and WikiText-103. Six additional findings emerge: (1) dead experts can resuscitate -- triggered by balance loss driving router re-exploration; (2) ortho toxicity is dataset-dependent, not universal; (3) task complexity shifts the critical E threshold; (4) model overfitting is decoupled from expert ecological health; (5) three-tier MoE spontaneously collapses into a two-tier functional structure; (6) ecological structure is temperature-invariant across a 50x range. We propose that E serves as a unified diagnostic for MoE training, analogous to the Reynolds number in fluid dynamics.

2604.11811 2026-05-26 cs.PL cs.AI cs.CL cs.LG 版本更新

M$^\star$: Every Task Deserves Its Own Memory Harness

M$^\star$:每个任务都应有专属的记忆框架

Wenbo Pan, Shujie Liu, Xiangyang Zhou, Shiwei Zhang, Wanlu Shi, Mirror Xu, Xiaohua Jia

发表机构 * City University of Hong Kong(香港城市大学) Microsoft(微软)

AI总结 提出M$^\star$方法,通过可执行程序进化自动发现任务优化的记忆系统,在对话、具身规划和专家推理等任务上优于固定记忆基线。

Comments Preprint. Code: https://github.com/wbopan/mstar ; Live demo: https://mstar.wenbo.io

详情
AI中文摘要

大型语言模型代理依赖专门的记忆系统在长时间交互中积累和重用知识。最近的架构通常采用针对特定领域定制的固定记忆设计,例如用于对话的语义检索或用于编码的技能重用。然而,为某一目的优化的记忆系统往往无法迁移到其他任务。为了解决这一限制,我们引入了M$^\star$,一种通过可执行程序进化自动发现任务优化记忆框架的方法。具体来说,M$^\star$将代理记忆系统建模为用Python编写的记忆程序。该程序封装了数据模式、存储逻辑和代理工作流指令。我们使用反射式代码进化方法联合优化这些组件;该方法采用基于种群的搜索策略,并分析评估失败以迭代改进候选程序。我们在涵盖对话、具身规划和专家推理的四个不同基准上评估M$^\star$。结果表明,M$^\star$在所有评估任务上稳健地优于现有的固定记忆基线。此外,进化出的记忆程序对每个领域展现出结构不同的处理机制。这一发现表明,针对给定任务特化记忆机制探索了广泛的设计空间,并提供了比通用记忆范式更优的解决方案。

英文摘要

Large language model agents rely on specialized memory systems to accumulate and reuse knowledge during extended interactions. Recent architectures typically adopt a fixed memory design tailored to specific domains, such as semantic retrieval for conversations or skills reused for coding. However, a memory system optimized for one purpose frequently fails to transfer to others. To address this limitation, we introduce M$^\star$, a method that automatically discovers task-optimized memory harnesses through executable program evolution. Specifically, M$^\star$ models an agent memory system as a memory program written in Python. This program encapsulates the data Schema, the storage Logic, and the agent workflow Instructions. We optimize these components jointly using a reflective code evolution method; this approach employs a population-based search strategy and analyzes evaluation failures to iteratively refine the candidate programs. We evaluate M$^\star$ on four distinct benchmarks spanning conversation, embodied planning, and expert reasoning. Our results demonstrate that M$^\star$ improves performance over existing fixed-memory baselines robustly across all evaluated tasks. Furthermore, the evolved memory programs exhibit structurally distinct processing mechanisms for each domain. This finding indicates that specializing the memory mechanism for a given task explores a broad design space and provides a superior solution compared to general-purpose memory paradigms.

2604.03675 2026-05-26 cs.AI cs.CL cs.IR 版本更新

OASES: Outcome-Aligned Search-Evaluation Co-Training for Agentic Search

OASES:面向智能搜索的结果对齐搜索-评估协同训练

Erhan Zhang, Yiqun Chen, Zechun Niu, Wei Yang, Xiaochi Wei, Yan Gao, Yi Wu, Yao Hu, Jiaxin Mao

发表机构 * Renmin University of China(中国人民大学) Xiaohongshu Inc.(小红书公司) University of Southern California(南加州大学)

AI总结 提出OASES框架,通过结果对齐的过程奖励和搜索-评估协同训练,解决智能搜索中奖励稀疏和过程监督不可靠的问题,在多跳问答基准上优于强强化学习基线。

详情
AI中文摘要

智能搜索使语言模型能够通过自适应地多步获取外部证据来解决知识密集型任务。具有可验证奖励的强化学习已成为搜索智能体广泛采用的训练范式,但仅结果奖励是稀疏的,并且对中间搜索动作的信用分配有限。因此,现有的过程奖励方法试图通过代理信号、外部评估器或基于似然的信息增益来密集化监督。然而,代理奖励可能偏离最终结果目标,而固定评估器随着搜索策略的演化可能变得过时,导致不可靠的过程监督。为应对这些挑战,我们提出OASES,一种用于智能搜索的结果对齐搜索-评估监督框架。OASES通过评估每个中间搜索状态对回答原始问题的支持程度,推导出结果对齐的过程奖励。它进一步在策略上协同训练搜索策略和状态评估器,使评估器能够适应演化的搜索行为并提供更可靠的过程奖励。在五个多跳问答基准上的实验表明,OASES始终优于强强化学习基线,进一步分析证实了结果对齐过程奖励和搜索-评估协同训练的优势。

英文摘要

Agentic search enables language models to solve knowledge-intensive tasks by adaptively acquiring external evidence over multiple steps. Reinforcement learning with verifiable rewards (RLVR) has emerged as a widely adopted training paradigm for search agents, yet outcome-only rewards are sparse and provide limited credit assignment for intermediate search actions. Existing process-reward methods therefore seek to densify supervision through proxy signals, external evaluators, or likelihood-based information gain. However, proxy rewards can deviate from the final outcome objective, while fixed evaluators can become stale as the search policy evolves, leading to unreliable process supervision. To address these challenges, we propose OASES, an Outcome-Aligned Search-Evaluation Supervision framework for agentic search. OASES derives outcome-aligned process rewards by evaluating how well each intermediate search state supports answering the original question. It further co-trains the search policy and the state evaluator on policy, allowing the evaluator to adapt to evolving search behavior and provide more reliable process rewards. Experiments on five multi-hop QA benchmarks show that OASES consistently outperforms strong RL baselines, with further analyses confirming the benefits of outcome-aligned process rewards and search-evaluation co-training.

2603.17198 2026-05-26 cs.LG cs.CL 版本更新

Structural Abstraction as an Inductive Bias for Non-Stationary Language Model Training

结构抽象作为非平稳语言模型训练的归纳偏置

Elnaz Rahmati, Nona Ghazizadeh, Zhivar Sourati, Nina Rouhani, Morteza Dehghani

发表机构 * University of Southern California(南加州大学)

AI总结 提出抽象增强训练(AAT)方法,通过联合优化具体实例及其结构抽象,减少灾难性干扰并提升关系泛化能力,在非平稳语言模型训练中验证了结构抽象作为稳定学习信号的有效性。

详情
AI中文摘要

认知科学的一个基本原则认为,智能体不是通过将经验存储为孤立实例来学习,而是通过形成捕捉跨情境共享关系结构的抽象图式来学习。尽管这一主张得到了行为和神经影像研究的充分支持,但其作为语言模型计算训练信号的作用仍未得到充分探索。我们针对非平稳语言模型训练中的这一空白,提出疑问:将学习偏向结构抽象是否能如人类结果所预测的那样减少灾难性干扰并提升关系泛化?为研究这一问题,我们引入了抽象增强训练(AAT),这是一种轻量级的损失级修改,联合优化具体实例及其结构抽象,以及两个基准:关系循环基准(RCB)和叙事抽象基准(NAB)。这些资源将核心认知构造操作化:实体掩码作为关系对齐的计算模拟,谚语作为必须跨表面不同情境推断的隐式抽象意义的载体。我们的实证结果表明,AAT持续减少遗忘并提升泛化,其模式与基于图式学习的认知预测一致。除了对持续学习的实际意义外,这些结果提供了初步的计算证据,表明结构抽象是非平稳环境中稳定学习的信号。

英文摘要

A foundational principle in cognitive science holds that intelligent agents do not learn by storing experiences as isolated instances, but by forming abstract schemas that capture relational structure shared across situations. Even though this claim is well supported by behavioral and neuroimaging studies, its role as a computational training signal in language models remains underexplored. We target this gap in the setting of non-stationary language model training, asking does biasing learning toward structural abstraction reduce catastrophic interference and improve relational generalization as predicted by human results? To study this question, we introduce Abstraction-Augmented Training (AAT), a lightweight loss-level modification that jointly optimizes over concrete instances and their structural abstractions, and two benchmarks, the Relational Cycle Benchmark (RCB) and the Narrative Abstraction Benchmark (NAB). These resources operationalize core cognitive constructs: entity masking as a computational analog of relational alignment, and proverbs as vehicles for implicit abstract meaning that must be inferred across surface-dissimilar situations. Our empirical results demonstrate that AAT consistently reduces forgetting and improves generalization in a pattern that aligns with cognitive predictions for schema-based learning. Beyond the practical implications for continual learning, these results offer preliminary computational evidence that structural abstraction is a signal for stable learning in non-stationary environments.

2602.10090 2026-05-26 cs.AI cs.CL cs.LG 版本更新

Agent World Model: Infinity Synthetic Environments for Agentic Reinforcement Learning

Agent World Model: 用于智能体强化学习的无限合成环境

Zhaoyang Wang, Canwen Xu, Boyi Liu, Yite Wang, Siwei Han, Zhewei Yao, Huaxiu Yao, Yuxiong He

发表机构 * University of North Carolina at Chapel Hill(北卡罗来纳大学教堂山分校)

AI总结 提出Agent World Model (AWM)全合成环境生成管道,通过代码驱动和数据库支持的环境进行大规模强化学习,使智能体在多样日常场景中泛化。

Comments Accepted to ICML 2026

详情
AI中文摘要

近年来,大型语言模型(LLM)的进步使得自主智能体能够与工具和环境进行多轮交互。然而,扩展此类智能体训练受到缺乏多样且可靠环境的限制。在本文中,我们提出了Agent World Model(AWM),一个完全合成的环境生成管道。使用该管道,我们扩展到涵盖日常场景的1000个环境,智能体可以在其中与丰富的工具集交互并获得高质量的观测。值得注意的是,这些环境是代码驱动的并由数据库支持,比由LLM模拟的环境提供更可靠和一致的状态转换。此外,与从现实环境中收集轨迹相比,它们实现了更高效的智能体交互。为了展示该资源的有效性,我们对多轮工具使用智能体进行了大规模强化学习。得益于完全可执行的环境和可访问的数据库状态,我们还可以设计可靠的奖励函数。在三个基准上的实验表明,仅在合成环境中训练(而非特定于基准的环境)能产生强大的分布外泛化能力。代码可在 https://github.com/Snowflake-Labs/agent-world-model 获取。

英文摘要

Recent advances in large language model (LLM) have empowered autonomous agents to perform multi-turn interactions with tools and environments. However, scaling such agent training is limited by the lack of diverse and reliable environments. In this paper, we propose Agent World Model (AWM), a fully synthetic environment generation pipeline. Using this pipeline, we scale to 1,000 environments covering everyday scenarios, in which agents can interact with rich toolsets and obtain high-quality observations. Notably, these environments are code-driven and backed by databases, providing more reliable and consistent state transitions than environments simulated by LLMs. Moreover, they enable more efficient agent interaction compared with collecting trajectories from realistic environments. To demonstrate the effectiveness of this resource, we perform large-scale reinforcement learning for multi-turn tool-use agents. Thanks to the fully executable environments and accessible database states, we can also design reliable reward functions. Experiments on three benchmarks show that training exclusively in synthetic environments, rather than benchmark-specific ones, yields strong out-of-distribution generalization. The code is available at https://github.com/Snowflake-Labs/agent-world-model.

2601.10201 2026-05-26 cs.LG cs.AI cs.CL 版本更新

Future-KL Regularized GRPO: Process-Level Credit Assignment from $f$-Divergence Regularization

未来KL正则化GRPO:基于f-散度正则化的过程级信用分配

Jiarui Yao, Ruida Wang, Hao Bai, Tong Zhang

发表机构 * University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校)

AI总结 本文提出未来KL正则化策略优化(FRPO),通过因果未来正则化回报修正GRPO中局部KL损失缺失的梯度信号,在数学推理任务中提升pass@16并保持更高熵和更低策略漂移。

详情
AI中文摘要

组相对策略优化(GRPO)广泛用于无评论家的大语言模型(LLM)后训练,但其KL正则化通常作为局部损失侧的token惩罚实现。我们表明这遗漏了自回归KL正则化诱导的策略梯度信号。与标准KL正则化强化学习(RL)目标不同,GRPO的组归一化引入非线性提示级效用;对于二元验证器奖励,该效用为$2\arcsin\sqrt p$。因此,奖励和KL在归一化前无法融合而不改变隐式目标。我们推导了具有token级$f$-散度正则化的GRPO风格目标的on-policy梯度。奖励项恢复标准化的GRPO优势,而正则化项包括局部KL损失遗漏的因果未来正则化回报。对于反向KL,这产生简单的未来KL修正:在优势构建后添加每个token对数比的反向累积和。由此产生的方法,未来KL正则化策略优化(FRPO),不需要评论家或额外的模型传递。在数学推理任务上,FRPO在我们的主要大模型设置中提高了pass@16,同时保持比传统损失侧KL基线更高的熵和更低的策略漂移。

英文摘要

Group Relative Policy Optimization (GRPO) is widely used for critic-free Large Language Model (LLM) post-training, but its KL regularization is usually implemented as a local loss-side token penalty. We show that this misses the policy-gradient signal induced by autoregressive KL regularization. Unlike standard KL-regularized Reinforcement Learning (RL) objectives, GRPO's group normalization induces a non-linear prompt-level utility; for binary verifier rewards, this utility is $2\arcsin\sqrt p$. As a result, reward and KL cannot be fused before normalization without changing the implicit objective. We derive the on-policy gradient of GRPO-style objectives with token-wise $f$-divergence regularization. The reward term recovers the standardized GRPO advantage, while the regularizer term includes a causal future-regularization return-to-go omitted by local KL losses. For reverse KL, this yields a simple future KL correction: add a reverse cumulative sum of per-token log ratios after advantage construction. The resulting method, Future-KL Regularized Policy Optimization (FRPO), requires no critic or extra model passes. On mathematical reasoning tasks, FRPO improves pass@16 in our main large-model setting while maintaining higher entropy and lower policy drift than conventional loss-side KL baselines.

2601.05847 2026-05-26 cs.CL 版本更新

Schema-Grounded LLM Extraction for FHIR Patient Digital Twins

基于Schema的LLM抽取用于FHIR患者数字孪生

Rafael Brens, Yuqiao Meng, Luoxi Tang, Zhaohan Xi

发表机构 * Binghamton University(宾夕法尼亚州立大学)

AI总结 提出SG-LLM方法,通过检索增强、JSON Schema约束和验证器修复循环,从非结构化EHR中生成有效的FHIR Bundle,并在临床效用实验中优于基线。

详情
AI中文摘要

我们重新审视从非结构化电子健康记录(EHR)构建可互操作患者数字孪生的问题,并认为该任务更适合被视作有效FHIR Bundle的受控生成,而非抽取模块的级联。我们引入SG-LLM,一种基于schema的LLM抽取器,它(i)通过SapBERT索引检索的候选SNOMED-CT、RxNorm和LOINC代码增强提示,(ii)在直接源自FHIR R4 StructureDefinitions的JSON Schema下解码,(iii)关闭一个验证器在环修复阶段,其诊断结果作为结构化错误消息反馈。我们认为,孪生的有用性(而不仅仅是跨度级F1)才是正确的评估对象,并通过一项临床效用实验将其操作化,该实验测量了基于SG-LLM生成的FHIR Bundle与专家策划的Bundle训练的分类器在30天再入院AUROC上的差距。在MIMIC-IV和n2c2 2018 Track 2基准测试上,SG-LLM匹配或超过了强大的联合抽取和普通LLM基线,同时生成了更有效的Bundle。消融实验分离了检索、schema约束和修复循环的贡献。所有代码、提示和schema均已发布。

英文摘要

We revisit the problem of constructing interoperable patient digital twins from unstructured electronic health records (EHRs) and argue that the task is better cast not as a cascade of extraction modules but as constrained generation of a valid FHIR bundle. We introduce SG-LLM, a schema-grounded LLM extractor that (i) augments the prompt with candidate SNOMED-CT, RxNorm, and LOINC codes retrieved through a SapBERT index, (ii) decodes under a JSON Schema derived directly from FHIR R4 StructureDefinitions, and (iii) closes a validator-in-the-loop repair stage whose diagnostics are fed back as structured error messages. We argue that the twin's usefulness, not only span-level F1, is the right object of evaluation, and operationalize this with a clinical-utility experiment that measures the gap in 30-day readmission AUROC between classifiers trained on SG-LLM-generated FHIR bundles versus expert-curated ones. On MIMIC-IV and n2c2 2018 Track 2 benchmarks, SG-LLM matches or exceeds strong joint-extraction and vanilla-LLM baselines while producing substantially more valid bundles. Ablations isolate the contributions of retrieval, schema constraint, and the repair loop. All code, prompts, and schemas are released.

2601.05004 2026-05-26 cs.CL 版本更新

Can Large Language Models Resolve Semantic Discrepancy in Self-Destructive Subcultures? Evidence from Jirai Kei

大语言模型能否解决自我毁灭亚文化中的语义差异?来自Jirai Kei的证据

Peng Wang, Xilin Tao, Siyi Yao, Jiageng Wu, Yuntao Zou, Zhuotao Tian, Libo Qin, Dagang Li

发表机构 * School of Computer Science and Engineering, Macau University of Science and Technology(澳门科技大学计算机科学与工程学院) SKLPlanets, Macau University of Science and Technology(澳门科技大学SKLPlanets) College of Software, Northeastern University(东北大学软件学院) School of Energy and Power Engineering, Huazhong University of Science and Technology(华中科技大学能源与动力工程学院) School of Computer Science and Technology, Harbin Institute of Technology (Shenzhen)(哈尔滨工业大学(深圳)计算机科学与技术学院)

AI总结 针对亚文化中自我毁灭行为检测面临的知识滞后和语义错位问题,提出多智能体框架SAS,通过自动检索和亚文化对齐显著提升LLM检测性能,并优于现有先进方法。

Comments Preprint

详情
AI中文摘要

自我毁灭行为与复杂的心理状态相关,且难以诊断。由于亚文化群体独特的表达方式,这些行为可能更难识别。随着大语言模型(LLM)在各领域的部署,一些研究者开始探索其在检测自我毁灭行为中的应用。受此启发,我们使用当前基于LLM的方法研究亚文化中的自我毁灭行为检测。然而,这些方法面临两个主要挑战:(1)知识滞后:亚文化俚语演变迅速,快于LLM的训练周期;(2)语义错位:难以把握亚文化特有的具体和细微表达。为解决这些问题,我们提出亚文化对齐求解器(SAS),一个多智能体框架,集成了自动检索和亚文化对齐,显著提升了LLM在检测自我毁灭行为中的性能。实验结果表明,SAS优于当前先进的多智能体框架OWL。值得注意的是,它与微调后的LLM表现相当。我们希望SAS能推动亚文化背景下自我毁灭行为检测领域的发展,并为未来研究者提供宝贵资源。

英文摘要

Self-destructive behaviors are linked to complex psychological states and can be challenging to diagnose. These behaviors may be even harder to identify within subcultural groups due to their unique expressions. As large language models (LLMs) being deployed across various fields, some researchers have begun exploring their application for detecting self-destructive behaviors. Motivated by this, we investigate self-destructive behavior detection within subcultures using current LLM-based methods. However, these methods have two main challenges: (1) Knowledge Lag: Subcultural slang evolves rapidly, faster than LLMs' training cycles; and (2) Semantic Misalignment: it is challenging to grasp the specific and nuanced expressions unique to subcultures. To address these issues, we propose Subcultural Alignment Solver (SAS), a multi-agent framework that incorporates automatic retrieval and subculture alignment, significantly boosting the performance of LLMs in detecting self-destructive behavior. Our experimental results show that SAS outperforms the current advanced multi-agent framework OWL. Notably, it competes well with fine-tuned LLMs. We hope that SAS will advance the field of self-destructive behavior detection in subcultural contexts and serve as a valuable resource for future researchers.

2601.02589 2026-05-26 cs.CL cs.AI 版本更新

FlowPlan-G2P: A Structured Generation Framework for Transforming Scientific Papers into Patent Descriptions

FlowPlan-G2P:一种将科学论文转化为专利描述的结构化生成框架

Kris W Pan, Yongmin Yoo

发表机构 * Amazon(亚马逊公司) Macquarie University(麦考瑞大学)

AI总结 提出FlowPlan-G2P图介导生成框架,通过概念图归纳、章节级规划和图条件生成三阶段分解,将科学论文转化为符合专利规范的描述,在领域评估中优于大型专有模型。

详情
AI中文摘要

由于科学论文与专利在修辞和结构上的根本差异,从科学论文生成专利描述具有挑战性。现有方法将其视为表面改写,未能捕捉专利起草中固有的层次推理和法定约束。我们提出FlowPlan-G2P,一种图介导的生成框架,将该转换分解为三个阶段:(1)概念图归纳,将技术实体和功能依赖提取为有向图;(2)章节级规划,将图划分为与规范专利章节对齐的连贯子图;(3)图条件生成,基于章节特定子图合成符合法律要求的段落。在专家验证基准上的实验表明,标准NLG指标系统性偏好法律不合规输出而非有效专利描述,这促使我们进行领域特定评估。在该评估下,使用开放权重骨干的FlowPlan-G2P始终优于原始专有模型,表明结构化分解比模型规模更能决定质量。

英文摘要

Generating patent descriptions from scientific papers is challenging due to fundamental rhetorical and structural disparities between the two genres. Existing approaches treat this as surface-level rewriting, failing to capture the hierarchical reasoning and statutory constraints inherent in patent drafting. We propose FlowPlan-G2P, a graph-mediated generation framework that decomposes this transformation into three stages: (1) Concept Graph Induction, extracting technical entities and functional dependencies into a directed graph; (2) Section-level Planning, partitioning the graph into coherent subgraphs aligned with canonical patent sections; and (3) Graph-Conditioned Generation, synthesizing legally compliant paragraphs conditioned on section-specific subgraphs. Experiments on expert-validated benchmarks reveal that standard NLG metrics systematically favor legally non-compliant outputs over valid patent descriptions, motivating our domain-specific evaluation. Under this evaluation, FlowPlan-G2P with an open-weight backbone consistently outperforms vanilla proprietary models, demonstrating that structured decomposition is a stronger determinant of quality than model scale.

2512.12576 2026-05-26 cs.CL cs.AI 版本更新

Coupled Variational Reinforcement Learning for Language Model General Reasoning

耦合变分强化学习用于语言模型通用推理

Xueru Wen, Jie Lou, Yanjiang Liu, Hongyu Lin, Ben He, Xianpei Han, Le Sun, Yaojie Lu, Debing Zhang

发表机构 * Chinese Information Processing Laboratory, Institute of Software, Chinese Academy of Sciences(中国科学院软件研究所信息处理实验室) University of Chinese Academy of Sciences(中国科学院大学)

AI总结 提出CoVRL方法,通过混合采样策略耦合先验和后验分布,将变分推理与强化学习结合,以解决无验证器强化学习中探索效率低和推理轨迹与答案不一致的问题,在数学和通用推理基准上提升性能。

Comments Accepted to ICML 2026

详情
AI中文摘要

虽然强化学习在语言模型推理方面取得了显著进展,但它受到可验证奖励要求的限制。最近的无验证器强化学习方法通过利用LLM生成参考答案的概率作为奖励信号来解决这一限制。然而,这些方法通常仅基于问题采样推理轨迹。这种设计将推理轨迹采样与答案信息解耦,导致探索效率低下以及轨迹与最终答案之间的不一致。在本文中,我们提出了 extit{{Co}upled {V}ariational {R}einforcement {L}earning}(CoVRL),它通过混合采样策略耦合先验和后验分布,将变分推理与强化学习联系起来。通过构建和优化整合这两种分布的复合分布,CoVRL实现了高效探索,同时保持了思想与答案之间的强一致性。在数学和通用推理基准上的大量实验表明,CoVRL在基础模型上提升了12.4%的性能,并在最先进的无验证器强化学习基线基础上额外提升了2.3%,为增强语言模型的通用推理能力提供了一个原则性框架。

英文摘要

While reinforcement learning has achieved impressive progress in language model reasoning, it is constrained by the requirement for verifiable rewards. Recent verifier-free RL methods address this limitation by utilizing the probabilities that LLMs generate reference answers as reward signals. However, these approaches typically sample reasoning traces conditioned only on the question. This design decouples reasoning-trace sampling from answer information, leading to inefficient exploration and incoherence between traces and final answers. In this paper, we propose \textit{\b{Co}upled \b{V}ariational \b{R}einforcement \b{L}earning} (CoVRL), which bridges variational inference and reinforcement learning by coupling prior and posterior distributions through a hybrid sampling strategy. By constructing and optimizing a composite distribution that integrates these two distributions, CoVRL enables efficient exploration while preserving strong thought-answer coherence. Extensive experiments on mathematical and general reasoning benchmarks show that CoVRL improves performance by 12.4\% over the base model and achieves an additional 2.3\% improvement over state-of-the-art verifier-free RL baselines, providing a principled framework for enhancing the general reasoning capabilities of language models.

2511.21734 2026-05-26 cs.CL cs.AI 版本更新

Asking LLMs to Verify First is Almost Free Lunch

先让LLMs验证几乎是免费的午餐

Shiguang Wu, Quanming Yao

发表机构 * Department of Electonic Engineering(电子工程系)

AI总结 提出Verification-First (VF)策略,通过先验证候选答案再生成解决方案,以低计算开销提升推理能力,并扩展为Iter-VF迭代方法,在多个基准上优于标准CoT和现有TTS策略。

详情
AI中文摘要

为了在不增加训练成本或大量测试时采样的情况下增强大型语言模型(LLMs)的推理能力,我们引入了Verification-First (VF)策略,该策略在生成解决方案之前提示模型验证提供的候选答案(即使是琐碎或随机的答案)。这种方法触发了一种“反向推理”过程,与标准的前向思维链(CoT)互补,通过修剪LLM的输出分布来限制答案的逻辑搜索空间。我们进一步将VF提示推广到Iter-VF,这是一种顺序测试时缩放(TTS)方法,利用模型之前的答案迭代地循环验证-生成过程。跨多个基准和各种LLMs的大量实验证实,使用随机答案的VF提示在最小计算开销下始终优于标准CoT,并且Iter-VF优于现有的TTS策略。VF在SOTA思考模型上也有效。例如,通过使用简单的VF提示,我们在GPQA-Diamond上使用Gemini-3-Pro-Preview获得了新的SOTA准确率94.9%,其中VF相对减少了约30%的错误。

英文摘要

To enhance the reasoning capabilities of Large Language Models (LLMs) without high costs of training, nor extensive test-time sampling, we introduce Verification-First (VF), a strategy that prompts models to verify a provided candidate answer, even a trivial or random one, before generating a solution. This approach triggers a "reverse reasoning" process complementary to standard forward Chain-of-Thought (CoT), which restricts the logical search space of the answer by pruning the LLM's output distribution. We further generalize VF prompting to Iter-VF, a sequential test-time scaling (TTS) method that iteratively cycles the verification-generation process using the model's previous answer. Extensive experiments across various benchmarks and various LLMs confirm that VF prompting with random answer consistently outperforms standard CoT with minimal computational overhead, and Iter-VF outperforms existing TTS strategies. VF is also effective on SOTA thinking models. For example, by using the simple VF prompting, we obtain a new SOTA 94.9% accuracy on GPQA-Diamond with Gemini-3-Pro-Preview where VF reduces its errors by ~30% relatively.

2511.08654 2026-05-26 cs.CY cs.AI cs.CL 版本更新

AI-generated podcasts: Synthetic Intimacy and Cultural Mistranslation in NotebookLM's Audio Overviews

AI生成的播客:NotebookLM音频概览中的合成亲密关系与文化误译

Jill Walker Rettberg

发表机构 * University of Bergen(卑尔根大学) Center for Digital Narrative(数字叙述中心)

AI总结 本文分析Google NotebookLM生成的AI播客,揭示其固定模板结构及将文本和文化语境翻译为白人、受过教育的中产阶级美国默认设置的问题。

Comments This project has received funding from the European Union's Horizon 2020 research and innovation programme under grant agreement number 101142306. The project is also supported by the Center for Digital Narrative, which is funded by the Research Council of Norway through its Centres of Excellence scheme, project number 332643. Media, Culture & Society, online first (2026)

详情
AI中文摘要

本文分析了Google NotebookLM生成的AI播客,该工具生成两个健谈的AI主持人讨论用户上传文档的音频播客。虽然AI生成的播客已被作为工具讨论(例如在医学教育中),但它们尚未作为媒体被分析。通过上传不同类型的文本并分析生成的输出,我展示了播客的结构如何围绕固定模板构建。我还发现NotebookLM不仅将其他语言的文本翻译成活泼的标准中西部美国口音,还将文化语境翻译为白人、受过教育的中产阶级美国默认设置。这是媒体塑造公众方式的一个显著发展,标志着从学者们描述的21世纪初至今人类播客中的多元公共领域(主持人面向特定社区并回应听众评论)向播客类型抽象化的转变。

英文摘要

This paper analyses AI-generated podcasts produced by Google's NotebookLM, which generates audio podcasts with two chatty AI hosts discussing whichever documents a user uploads. While AI-generated podcasts have been discussed as tools, for instance in medical education, they have not yet been analysed as media. By uploading different types of text and analysing the generated outputs I show how the podcasts' structure is built around a fixed template. I also find that NotebookLM not only translates texts from other languages into a perky standardised Mid-Western American accent, it also translates cultural contexts to a white, educated, middle-class American default. This is a distinct development in how publics are shaped by media, marking a departure from the multiple public spheres that scholars have described in human podcasting from the early 2000s until today, where hosts spoke to specific communities and responded to listener comments, to an abstraction of the podcast genre.

2510.16435 2026-05-26 cs.RO cs.CL cs.HC 版本更新

What Questions Should Robots Be Able to Answer? A Dataset of User Questions for Explainable Robotics

机器人应该能够回答哪些问题?一个用于可解释机器人的用户问题数据集

Lennart Wachowiak, Andrew Coles, Gerard Canal, Oya Celiktutan

发表机构 * King's College London, CDT in Safe and Trusted AI(国王学院伦敦大学,安全与可信人工智能中心) King's College London(国王学院伦敦大学)

AI总结 本文通过收集100名参与者的1893个问题,构建了一个面向家用机器人的用户问题数据集,涵盖12个类别和70个子类别,旨在帮助机器人学家确定机器人需要回答的关键问题类型。

详情
AI中文摘要

随着大型语言模型和对话界面在人机交互中的广泛使用,机器人回答用户问题的能力比以往任何时候都更加重要。因此,我们引入了一个包含1,893个家用机器人用户问题的数据集,这些数据来自100名参与者,并分为12个类别和70个子类别。可解释机器人领域的大多数工作集中在“为什么”问题上。相比之下,我们的数据集提供了多种类型的问题,从关于简单执行细节的问题到关于机器人在假设场景中如何行动的问题——从而为机器人学家提供了关于其机器人需要能够回答哪些问题的宝贵见解。为了收集数据集,我们创建了15个视频刺激和7个文本刺激,描绘了机器人执行各种家务任务。然后,我们询问Prolific上的参与者在每个描绘的情境中他们想问机器人什么问题。在最终数据集中,最常见的类别是关于任务执行细节(21.4%)、机器人能力(12.6%)和性能评估(10.7%)的问题。尽管关于机器人如何处理潜在困难场景并确保正确行为的问题较少,但用户认为这些是机器人最需要能够回答的问题。此外,我们发现自认为是机器人学新手的人与更有经验的用户提出的问题不同。新手更倾向于询问简单事实,例如机器人做了什么或环境的当前状态。随着机器人进入与人类共享的环境,并且语言成为给出指令和交互的核心,该数据集为(i)识别机器人需要记录并暴露给对话界面的信息,(ii)对问答模块进行基准测试,以及(iii)设计符合用户期望的解释策略提供了宝贵的基础。

英文摘要

With the growing use of large language models and conversational interfaces in human-robot interaction, robots' ability to answer user questions is more important than ever. We therefore introduce a dataset of 1,893 user questions for household robots, collected from 100 participants and organized into 12 categories and 70 subcategories. Most work in explainable robotics focuses on why-questions. In contrast, our dataset provides a wide variety of questions, from questions about simple execution details to questions about how the robot would act in hypothetical scenarios -- thus giving roboticists valuable insights into what questions their robot needs to be able to answer. To collect the dataset, we created 15 video stimuli and 7 text stimuli, depicting robots performing varied household tasks. We then asked participants on Prolific what questions they would want to ask the robot in each portrayed situation. In the final dataset, the most frequent categories are questions about task execution details (21.4%), the robot's capabilities (12.6%), and performance assessments (10.7%). Although questions about how robots would handle potentially difficult scenarios and ensure correct behavior are less frequent, users rank them as the most important for robots to be able to answer. Moreover, we find that users who identify as novices in robotics ask different questions than more experienced users. Novices are more likely to inquire about simple facts, such as what the robot did or the current state of the environment. As robots enter environments shared with humans and language becomes central to giving instructions and interaction, this dataset provides a valuable foundation for (i) identifying the information robots need to log and expose to conversational interfaces, (ii) benchmarking question-answering modules, and (iii) designing explanation strategies that align with user expectations.

2509.12672 2026-05-26 cs.CL 版本更新

Towards Inclusive Toxic Content Moderation: Addressing Vulnerabilities to Adversarial Attacks in Toxicity Classifiers Tackling LLM-generated Content

迈向包容性有害内容审核:解决面向LLM生成内容的有害性分类器对抗攻击的脆弱性

Shaz Furniturewala, Arkaitz Zubiaga

发表机构 * Center for Data Science, New York University(纽约大学数据科学中心) Queen Mary University of London(伦敦大学女王学院)

AI总结 针对LLM生成内容的有害性分类器易受对抗攻击的问题,提出基于机制可解释性的方法识别并抑制脆弱电路,提升模型鲁棒性并揭示人口统计学层面的公平性差距。

详情
AI中文摘要

由于大型语言模型(LLM)的广泛使用,在线机器生成内容的数量急剧增长,给内容审核系统带来了新的挑战。传统的内容审核分类器通常基于人类生成的文本进行训练,由于LLM生成的文本偏离其训练数据以及旨在逃避检测的对抗攻击,导致分类错误。当前的防御策略是反应性的而非主动性的,因为它们依赖于对抗训练或外部检测模型来识别攻击。在这项工作中,我们旨在识别导致错误分类的有害性分类器的脆弱组件,提出了一种基于机制可解释性技术的新策略。我们的研究聚焦于微调的BERT和RoBERTa分类器,在涵盖多种少数群体的不同数据集上进行测试。我们使用对抗攻击技术来识别脆弱电路。最后,我们抑制这些脆弱电路,提高对抗攻击下的性能。我们还提供了这些脆弱电路在人口统计学层面的洞察,揭示了模型训练中的公平性和鲁棒性差距。我们发现模型具有不同的注意力头,这些头要么对性能至关重要,要么容易受到攻击,抑制脆弱头可以提高对抗输入的性能。我们还发现不同的头负责不同人口群体的脆弱性,这可以为更具包容性的有害性检测模型开发提供信息。

英文摘要

The volume of machine-generated content online has grown dramatically due to the widespread use of Large Language Models (LLMs), leading to new challenges for content moderation systems. Conventional content moderation classifiers, which are usually trained on text produced by humans, suffer from misclassifications due to LLM-generated text deviating from their training data and adversarial attacks that aim to avoid detection. Present-day defence tactics are reactive rather than proactive, since they rely on adversarial training or external detection models to identify attacks. In this work, we aim to identify the vulnerable components of toxicity classifiers that contribute to misclassification, proposing a novel strategy based on mechanistic interpretability techniques. Our study focuses on fine-tuned BERT and RoBERTa classifiers, testing on diverse datasets spanning a variety of minority groups. We use adversarial attacking techniques to identify vulnerable circuits. Finally, we suppress these vulnerable circuits, improving performance against adversarial attacks. We also provide demographic-level insights into these vulnerable circuits, exposing fairness and robustness gaps in model training. We find that models have distinct heads that are either crucial for performance or vulnerable to attack and suppressing the vulnerable heads improves performance on adversarial input. We also find that different heads are responsible for vulnerability across different demographic groups, which can inform more inclusive development of toxicity detection models.

2509.08150 2026-05-26 cs.CL 版本更新

Verbalized Algorithms: Classical Algorithms are All You Need (Mostly)

言语化算法:经典算法就是你所需要的(大部分)

Supriya Lall, Christian Farrell, Hari Pathanjaly, Marko Pavic, Sarvesh Chezhian, Masataro Asai

发表机构 * MIT CSAIL(MIT 计算与人工智能实验室) MIT-IBM Watson AI Lab(MIT-IBM 沃森人工智能实验室) IBM Infrastructure(IBM 基础设施) Marist University(马里斯特大学) UC Irvine(加州大学尔湾分校) IBM Research Cambridge, USA(IBM 英国剑桥研究中心)

AI总结 提出言语化算法(VA)范式,将LLM作为可靠的基本操作(如字符串比较)集成到经典算法中,以提升推理的准确性和效率。

Comments Accepted in NeurIPS 2025 Workshop on Efficient Reasoning; Submitted to Position Paper Track at Neurips 2026

详情
AI中文摘要

推理本质上是一个算法任务。然而,当前基于LLM的推理工作依赖于自由生成,其理论保证(可靠性、完备性、复杂性、最优性)仍然知之甚少。我们认为不应将它们视为通用推理器,作为替代,我们提出一种称为“言语化算法”(VA)的范式,它将LLM与各种具有既定保证的算法相结合。VA不依赖LLM解决推理任务的能力,而是通过将任务分解为它们能够可靠回答的简单字符串基本操作来限制其范围。例如,对自然语言字符串列表进行排序可以通过在并行或近似排序算法中使用LLM作为二元比较预言机来实现。我们在数值推理、主题聚类、Wi-Fi接入点优化和多跳问答RAG任务中,通过言语化最大值、排序、聚类和子模最大化,推动了准确率-运行时帕累托前沿。这些结果表明,通过标准算法分析改进基于LLM的推理是一个可行且基础更扎实的研究方向。

英文摘要

Reasoning is a fundamentally algorithmic task. Yet current work on LLM-based reasoning relies on free-form generation whose theoretical guarantees (soundness, completeness, complexity, optimality) remain poorly understood. We argue that we should not treat them as general-purpose reasoners, and as an alternative, we propose a paradigm we call \emph{verbalized algorithms} (VAs), which combines LLMs and various algorithms with established guarantees. Instead of betting on LLM's ability to solve a reasoning task, VAs limit their scope by decomposing the task down to simple elementary operations on strings that they can answer reliably. For example, sorting a list of natural language strings could be done by using an LLM as a binary comparison oracle in a parallel or approximate sorting algorithm. We push the accuracy-runtime Pareto front with \emph{verbalized maximum}, \emph{sorting}, \emph{clustering}, and \emph{submodular maximization}, for numerical reasoning, topic clustering, Wi-Fi access point optimization, and multi-hop Q\&A RAG task. These results suggest improving LLM-based reasoning through standard algorithmic analysis is a feasible and better grounded research direction.

2508.15760 2026-05-26 cs.CL cs.AI 版本更新

LiveMCP-101: Stress Testing and Diagnosing MCP-enabled Agents on Challenging Queries

LiveMCP-101:对支持MCP的智能体进行压力测试与诊断

Ming Yin, Dinghan Shen, Silei Xu, Sixun Dong, Mian Zhang, Yebowen Hu, Shujian Liu, Jianbing Han, Simin Ma, Song Wang, Sathish Reddy Indurthi, Xun Wang, Yiran Chen, Kaiqiang Song

发表机构 * Duke University(杜克大学)

AI总结 针对MCP工具在动态多步任务中的评估空白,提出LiveMCP-101基准测试(101个真实查询),通过并行评估框架发现前沿LLM成功率低于60%,并识别出七种失败模式。

详情
AI中文摘要

工具调用已成为AI智能体的关键能力。与依赖静态、特定于提供商的工具定义的传统工具调用框架不同,模型上下文协议(MCP)提供了统一接口来动态发现和调用工具。然而,在现实动态场景中使用多样化MCP工具进行多步任务基准测试存在显著空白。在这项工作中,我们提出了LiveMCP-101,一个包含101个真实世界查询的基准测试,这些查询需要协调使用多个MCP工具。为了解决真实工具响应中的时间变异性,我们引入了一个并行评估框架,其中参考智能体同时执行经过验证的计划以产生实时参考输出。实验表明,即使是前沿LLM的成功率也低于60%,突显了多步工具使用中的挑战。全面的错误分析识别了涵盖工具规划、参数化和输出处理的七种失败模式,为改进当前模型指明了具体方向。LiveMCP-101为评估现实世界智能体能力设定了严格标准,推动通过MCP工具编排可靠执行复杂任务的自主智能体系统的发展。

英文摘要

Tool calling has emerged as a critical capability for AI agents. In contrast to conventional tool calling frameworks that rely on static, provider-specific tool definitions, the Model Context Protocol (MCP) offers a unified interface to discover and invoke tools dynamically. However, there is a significant gap in benchmarking multi-step tasks using diverse MCP tools in realistic, dynamic scenarios. In this work, we present LiveMCP-101, a benchmark of 101 real-world queries that require coordinated use of multiple MCP tools. To address temporal variability in real-world tool responses, we introduce a parallel evaluation framework where a reference agent executes a validated plan simultaneously to produce real-time reference outputs. Experiments show that even frontier LLMs achieve a success rate below 60\%, highlighting challenges in multi-step tool use. Comprehensive error analysis identifies seven failure modes spanning tool planning, parameterization, and output handling, pointing to concrete directions for improving current models. LiveMCP-101 sets a rigorous standard for evaluating real-world agent capabilities, advancing toward autonomous agent systems that reliably execute complex tasks through MCP tool orchestration.

2506.10054 2026-05-26 cs.LG cs.AI cs.CL cs.CV 版本更新

Uni-DPO: A Unified Paradigm for Dynamic Preference Optimization of LLMs

Uni-DPO:大语言模型动态偏好优化的统一范式

Shangpin Peng, Weinong Wang, Zhuotao Tian, Senqiao Yang, Xing Wu, Haotian Xu, Chengquan Zhang, Takashi Isobe, Baotian Hu, Min Zhang

发表机构 * Harbin Institute of Technology, Shenzhen(哈尔滨工业大学(深圳)) Xi’an Jiaotong University(西安交通大学) The Chinese University of Hong Kong(香港中文大学) University of Chinese Academy of Sciences(中国科学院大学) Tsinghua University(清华大学) Huazhong University of Science and Technology(华中科技大学)

AI总结 针对现有DPO方法忽略数据质量和学习难度差异的问题,提出Uni-DPO统一框架,通过自适应重加权偏好对实现更有效的数据利用和更优性能。

Comments Accepted by ICLR 2026. Code & models: https://github.com/pspdada/Uni-DPO

详情
AI中文摘要

直接偏好优化(DPO)因其简单高效已成为从人类反馈中进行强化学习(RLHF)的基石。然而,现有的基于DPO的方法通常平等对待所有偏好对,忽略了数据质量和学习难度的显著差异,导致数据利用效率低下和性能次优。为解决这一局限,我们提出Uni-DPO,一个统一的动态偏好优化框架,该框架联合考虑(a)偏好对的内在质量和(b)模型在训练过程中的动态表现。通过基于这两个因素自适应地重新加权样本,Uni-DPO能够更有效地利用偏好数据并实现卓越性能。跨模型和基准的大量实验证明了Uni-DPO的有效性和泛化能力。在文本任务上,使用Uni-DPO微调的Gemma-2-9B-IT在Arena-Hard上超越领先的大语言模型Claude 3 Opus 6.7个百分点。在数学和多模态任务上,Uni-DPO在所有基准上持续优于基线方法,为其有效性和鲁棒性提供了强有力的实证证据。

英文摘要

Direct Preference Optimization (DPO) has emerged as a cornerstone of reinforcement learning from human feedback (RLHF) due to its simplicity and efficiency. However, existing DPO-based methods typically treat all preference pairs equally, overlooking substantial variations in data quality and learning difficulty, which leads to inefficient data utilization and suboptimal performance. To address this limitation, we propose Uni-DPO, a unified dynamic preference optimization framework that jointly considers (a) the inherent quality of preference pairs and (b) the model's evolving performance during training. By adaptively reweighting samples based on both factors, Uni-DPO enables more effective use of preference data and achieves superior performance. Extensive experiments across models and benchmarks demonstrate the effectiveness and generalization of Uni-DPO. On textual tasks, Gemma-2-9B-IT fine-tuned with Uni-DPO surpasses the leading LLM, Claude 3 Opus, by 6.7 points on Arena-Hard. On mathematical and multimodal tasks, Uni-DPO consistently outperforms baseline methods across all benchmarks, providing strong empirical evidence of its effectiveness and robustness.

2505.23764 2026-05-26 cs.CV cs.CL 版本更新

MMSI-Bench: A Benchmark for Multi-Image Spatial Intelligence

MMSI-Bench:多图像空间智能基准

Sihan Yang, Runsen Xu, Yiman Xie, Sizhe Yang, Mo Li, Jingli Lin, Chenming Zhu, Xiaochen Chen, Haodong Duan, Xiangyu Yue, Dahua Lin, Tai Wang, Jiangmiao Pang

发表机构 * Shanghai AI Laboratory(上海人工智能实验室) The Chinese University of Hong Kong(香港中文大学) Zhejiang University(浙江大学) Tsinghua University(清华大学) Shanghai Jiaotong University(上海交通大学) University of Hong Kong(香港大学) Beijing Normal University(北京师范大学)

AI总结 提出MMSI-Bench基准,通过1000道精心设计的VQA问题评估多图像空间推理能力,发现现有模型准确率远低于人类。

Comments ICLR 2026 Camera ready. 38 pages. Project page: https://runsenxu.com/projects/MMSI_Bench

详情
AI中文摘要

空间智能对于在复杂物理世界中运行的多模态大语言模型(MLLMs)至关重要。然而,现有基准仅探测单图像关系,无法评估实际部署所需的多图像空间推理。我们引入MMSI-Bench,一个专用于多图像空间智能的VQA基准。六位3D视觉研究人员花费超过300小时,从超过12万张图像中精心制作了1000个具有挑战性、无歧义的多选题,每个问题都配有精心设计的干扰项和逐步推理过程。我们进行了大量实验,评估了37个开源和专有MLLMs,观察到巨大差距:最强的开源模型准确率约30%,OpenAI的GPT-5推理模型达到40%,而人类得分为97%。这些结果凸显了MMSI-Bench的挑战性以及未来研究的巨大空间。利用注释的推理过程,我们还提供了一个自动错误分析流程,诊断出四种主要失败模式,包括(1)接地错误,(2)重叠匹配和场景重建错误,(3)情境转换推理错误,以及(4)空间逻辑错误,为推进空间智能提供了见解。项目页面:https://runsenxu.com/projects/MMSI_Bench 。

英文摘要

Spatial intelligence is essential for multimodal large language models (MLLMs) operating in the complex physical world. Existing benchmarks, however, probe only single-image relations and thus fail to assess the multi-image spatial reasoning that real-world deployments demand. We introduce MMSI-Bench, a VQA benchmark dedicated to multi-image spatial intelligence. Six 3D-vision researchers spent more than 300 hours meticulously crafting 1,000 challenging, unambiguous multiple-choice questions from over 120,000 images, each paired with carefully designed distractors and a stepwise reasoning process. We conduct extensive experiments and evaluate 37 open-source and proprietary MLLMs, observing a wide gap: the strongest open-source model attains roughly 30% accuracy and OpenAI's GPT-5 reasoning model reaches 40%, while humans score 97%. These results underscore the challenging nature of MMSI-Bench and the substantial headroom for future research. Leveraging the annotated reasoning processes, we also provide an automated error analysis pipeline that diagnoses four dominant failure modes, including (1) grounding errors, (2) overlap-matching and scene-reconstruction errors, (3) situation-transformation reasoning errors, and (4) spatial-logic errors, offering insights for advancing spatial intelligence. Project page: https://runsenxu.com/projects/MMSI_Bench .

2505.13878 2026-05-26 cs.LG cs.CL 版本更新

InfiFPO: Implicit Model Fusion via Preference Optimization in Large Language Models

InfiFPO:通过偏好优化实现大型语言模型的隐式模型融合

Yanggan Gu, Yuanyi Wang, Zhaoyi Yan, Yiming Zhang, Qi Zhou, Fei Wu, Hongxia Yang

发表机构 * The Hong Kong Polytechnic University (PolyU)(香港理工大学) Zhejiang University(浙江大学) PolyU-Daya Bay Technology and Innovation Research Institute(香港理工大学-大亚湾技术与创新研究院)

AI总结 提出InfiFPO方法,通过将DPO中的参考模型替换为融合源模型,在序列级别合成多源概率,实现隐式模型融合,从而在偏好对齐阶段有效融合多个LLM并提升性能。

详情
Journal ref
NeurIPS 2025
AI中文摘要

模型融合通过轻量训练方法将具有不同优势的多个大型语言模型(LLM)组合成一个更强大的集成模型。现有的模型融合工作主要关注监督微调(SFT),而偏好对齐(PA)——增强LLM性能的关键阶段——在很大程度上未被探索。当前少数在PA阶段的融合方法(如WRPO)通过仅利用源模型的响应输出而丢弃其概率信息来简化过程。为了解决这一局限性,我们提出了InfiFPO,一种用于隐式模型融合的偏好优化方法。InfiFPO将直接偏好优化(DPO)中的参考模型替换为一个融合源模型,该模型在序列级别合成多源概率,从而规避了先前工作中复杂的词汇对齐挑战,同时保留了概率信息。通过引入概率裁剪和最大边际融合策略,InfiFPO使枢轴模型能够与人类偏好对齐,同时有效地从源模型中蒸馏知识。在11个广泛使用的基准上的综合实验表明,InfiFPO始终优于现有的模型融合和偏好优化方法。当使用Phi-4作为枢轴模型时,InfiFPO在11个基准上的平均性能从79.95提升至83.33,显著增强了其在数学、编码和推理任务上的能力。

英文摘要

Model fusion combines multiple Large Language Models (LLMs) with different strengths into a more powerful, integrated model through lightweight training methods. Existing works on model fusion focus primarily on supervised fine-tuning (SFT), leaving preference alignment (PA) --a critical phase for enhancing LLM performance--largely unexplored. The current few fusion methods on PA phase, like WRPO, simplify the process by utilizing only response outputs from source models while discarding their probability information. To address this limitation, we propose InfiFPO, a preference optimization method for implicit model fusion. InfiFPO replaces the reference model in Direct Preference Optimization (DPO) with a fused source model that synthesizes multi-source probabilities at the sequence level, circumventing complex vocabulary alignment challenges in previous works and meanwhile maintaining the probability information. By introducing probability clipping and max-margin fusion strategies, InfiFPO enables the pivot model to align with human preferences while effectively distilling knowledge from source models. Comprehensive experiments on 11 widely-used benchmarks demonstrate that InfiFPO consistently outperforms existing model fusion and preference optimization methods. When using Phi-4 as the pivot model, InfiFPO improve its average performance from 79.95 to 83.33 on 11 benchmarks, significantly improving its capabilities in mathematics, coding, and reasoning tasks.

2503.11657 2026-05-26 cs.CL 版本更新

Scaling Natural-Language Graph-Based Test Time Compute for Automated Theorem Proving

扩展基于自然语言图结构的测试时计算用于自动定理证明

Vincent Li, Tim Knappe, Yule Fu, Kevin Han, Kevin Zhu

发表机构 * Boston University Provadis School of International Management \& Technology Duke University Algoverse AI Research

AI总结 提出KG-prover框架,利用从权威数学文本挖掘的知识图谱增强通用大语言模型,通过扩展图结构的测试时计算显著提升自动定理证明性能。

Comments Accepted to ICML AI4Math Workshop 2025, NAACL SRW 2025

详情
AI中文摘要

大型语言模型在需要多步逻辑推理的自然语言处理任务(如自动定理证明)中展现出卓越能力。然而,定理证明中仍存在挑战,例如识别关键数学概念、理解其相互关系以及在自然语言中正确形式化证明。我们提出KG-prover,一种新颖框架,利用从权威数学文本挖掘的知识图谱来增强通用大语言模型,以构建和形式化数学证明。我们还研究了使用KG-Prover扩展基于图结构的测试时计算的效果,在多个数据集上展示了相比基线显著的性能提升。结合KG-Prover,通用大语言模型在miniF2F-test上提升高达21%,在ProofNet、miniF2F-test和MUSTARD数据集上持续提升2-11%。此外,使用o4-mini的KG-Prover在pass miniF2F-test上达到50%。这项工作为无需额外微调即可利用知识图谱增强自然语言证明推理提供了一种有前景的方法。

英文摘要

Large language models have demonstrated remarkable capabilities in natural language processing tasks requiring multi-step logical reasoning capabilities, such as automated theorem proving. However, challenges persist within theorem proving, such as the identification of key mathematical concepts, understanding their interrelationships, and formalizing proofs correctly within natural language. We present KG-prover, a novel framework that leverages knowledge graphs mined from reputable mathematical texts to augment general-purpose LLMs to construct and formalize mathematical proofs. We also study the effects of scaling graph-based, test-time compute using KG-Prover, demonstrating significant performance improvements over baselines across multiple datasets. General-purpose LLMs improve up to 21\% on miniF2F-test when combined with KG-Prover, with consistent improvements ranging from 2-11\% on the ProofNet, miniF2F-test, and MUSTARD datasets. Furthermore, KG-Prover with o4-mini achieves 50\% on pass miniF2F-test. This work provides a promising approach for augmenting natural language proof reasoning with knowledge graphs without the need for additional finetuning.

2410.15173 2026-05-26 cs.CL cs.AI 版本更新

Uncovering Autoregressive LLM Knowledge of Thematic Fit in Event Representation

揭示自回归LLM在事件表示中主题适配性的知识

Safeyah Khaled Alshemali, Daniel Bauer, Yuval Marton

发表机构 * Imperial College London(伦敦帝国学院) Columbia University(哥伦比亚大学) University of Washington(华盛顿大学)

AI总结 通过多种提示设计、输入上下文操作、推理和输出形式,研究自回归大语言模型是否具有一致且可表达的事件参数主题适配性知识,并在基准测试上取得新最优结果。

Comments Significant update with massive changes: all experiments rerun with current LLMs; includes new probability estimate analysis and expanded results in Sections 4 and 5. The paper has been accepted to CoNLL-2026

详情
AI中文摘要

主题适配性估计任务衡量语义参数与给定谓词特定语义角色的兼容性。我们通过实验各种提示设计、操作输入上下文、推理和输出形式,研究自回归LLM是否具有一致且可表达的事件参数主题适配性知识。我们在主题适配性基准测试上取得了新的最优结果,但表明封闭和开放权重的LLM对我们的提示策略反应不同:封闭模型总体得分更高,并从多步推理中受益,但在过滤与给定谓词、角色和参数不兼容的生成句子方面表现较差。我们的分析表明,词元元组输入和句子输入导致主题适配性得分分布出人意料地不同。

英文摘要

The thematic fit estimation task measures semantic arguments' compatibility with a given semantic role for a given predicate. We investigate if autoregressive LLMs have consistent, expressible knowledge of event arguments' thematic fit by experimenting with various prompt designs, manipulating input context, reasoning, and output forms. We set a new state-of-the-art on thematic fit benchmarks, but show that closed and open weight LLMs respond differently to our prompting strategies: Closed models achieve better scores overall and benefit from multi-step reasoning, but they perform worse at filtering out generated sentences incompatible with the given predicate, role, and argument. Our analysis shows that lemma tuple input and sentence input result in surprisingly different thematic fit score distributions.

2407.05682 2026-05-26 cs.CL 版本更新

Retrieved In-Context Principles from Previous Mistakes

从先前错误中检索的上下文原则

Hao Sun, Yong Jiang, Bo Wang, Yingyan Hou, Yan Zhang, Pengjun Xie, Fei Huang

发表机构 * Peking University(北京大学) Beijing Institute of Technology(北京理工大学) Alibaba Group(阿里巴巴集团) Chinese Academy of Sciences(中国科学院)

AI总结 提出检索式上下文原则(RICP)框架,通过教师模型分析学生模型错误生成原则,聚类错误以增强覆盖,检索相关错误生成问题级原则,提升定制化,无需推理时干预,在七个推理基准上提升多种提示策略性能。

详情
AI中文摘要

上下文学习(ICL)在使用正确输入-输出示例将大型语言模型(LLMs)适应下游任务方面发挥了重要作用。最近的进展试图通过从错误中推导出的原则来改进模型性能,但这些方法缺乏定制化和错误覆盖不足。为了解决这些限制,我们提出了检索式上下文原则(RICP),一种新颖的教师-学生框架。在RICP中,教师模型分析学生模型的错误,生成避免类似错误的原因和见解。这些错误根据其根本原因进行聚类,以制定任务级原则,增强原则的错误覆盖。在推理过程中,为每个问题检索最相关的错误以创建问题级原则,提高所提供指导的定制化。RICP与现有提示方法正交,且在推理期间不需要教师模型的干预。在七个推理基准上的实验结果表明,RICP在应用于各种提示策略时有效提升了性能。

英文摘要

In-context learning (ICL) has been instrumental in adapting Large Language Models (LLMs) to downstream tasks using correct input-output examples. Recent advances have attempted to improve model performance through principles derived from mistakes, yet these approaches suffer from lack of customization and inadequate error coverage. To address these limitations, we propose Retrieved In-Context Principles (RICP), a novel teacher-student framework. In RICP, the teacher model analyzes mistakes from the student model to generate reasons and insights for preventing similar mistakes. These mistakes are clustered based on their underlying reasons for developing task-level principles, enhancing the error coverage of principles. During inference, the most relevant mistakes for each question are retrieved to create question-level principles, improving the customization of the provided guidance. RICP is orthogonal to existing prompting methods and does not require intervention from the teacher model during inference. Experimental results across seven reasoning benchmarks reveal that RICP effectively enhances performance when applied to various prompting strategies.

2404.00176 2026-05-26 cs.CL 版本更新

The LSCD Benchmark: a Testbed for Diachronic Word Meaning Tasks

LSCD基准:一个用于历时词义任务的测试平台

Dominik Schlechtweg, Sachin Yadav, Jonas Kuhn, Nikolay Arefyev

发表机构 * University of Stuttgart(斯图加特大学) University of Oslo(奥斯陆大学)

AI总结 针对词汇语义变化检测任务中评估标准不统一的问题,提出一个标准化基准库,通过模块化设计支持WiC、WSI和LSCD子任务的评估与组合。

Comments *SEM, 9 pages

详情
AI中文摘要

词汇语义变化检测(LSCD)是一个复杂的词元级任务,通常基于两个后续应用的用法级任务来操作:首先,为用法对推导出词在上下文(WiC)标签;然后,将这些标签表示在一个图上,应用词义归纳(WSI)来推导出义簇;最后,通过比较跨时间的义簇来推导出LSCD标签。这种模块化反映在大多数LSCD数据集和模型中。它也导致了建模选项和任务定义的高度异质性,而数据集版本、预处理选项和评估指标的多样性加剧了这种异质性。这种异质性使得在可比条件下评估模型、选择最优模型组合或复现结果变得困难。因此,我们提供了一个标准化LSCD评估的基准库。通过透明的实现,结果易于复现,并且通过标准化,不同的组件可以自由组合。该库通过允许对WiC、WSI和LSCD进行模型评估来反映任务的模块化。这允许对日益复杂的模型组件进行仔细评估,为模型优化提供新途径。

英文摘要

Lexical Semantic Change Detection (LSCD) is a complex, lemma-level task, which is usually operationalized based on two subsequently applied usage-level tasks: First, Word-in-Context (WiC) labels are derived for pairs of usages. Then, these labels are represented in a graph on which Word Sense Induction (WSI) is applied to derive sense clusters. Finally, LSCD labels are derived by comparing sense clusters over time. This modularity is reflected in most LSCD datasets and models. It also leads to a large heterogeneity in modeling options and task definitions, which is exacerbated by a variety of dataset versions, preprocessing options and evaluation metrics. This heterogeneity makes it difficult to evaluate models under comparable conditions, to choose optimal model combinations or to reproduce results. Hence, we provide a benchmark repository standardizing LSCD evaluation. Through transparent implementation results become easily reproducible and by standardization different components can be freely combined. The repository reflects the task's modularity by allowing model evaluation for WiC, WSI and LSCD. This allows for careful evaluation of increasingly complex model components providing new ways of model optimization.

2403.04780 2026-05-26 cs.CL cs.AI 版本更新

Graph-oriented Instruction Tuning of Large Language Models for Generic Graph Mining

面向通用图挖掘的大语言模型图导向指令微调

Yanchao Tan, Hang Lv, Pengxiang Zhan, Shiping Wang, Carl Yang

发表机构 * Engineering Research Center of Big Data Intelligence, Ministry of Education(教育部大数据智能工程研究中心) Fujian Key Laboratory of Network Computing and Intelligent Information Processing(福建省网络计算与智能信息处理重点实验室) College of Computer and Data Science, Fuzhou University(福州大学计算机与数据科学学院) Department of Computer Science, Emory University(埃默里大学计算机科学系)

AI总结 提出MuseGraph框架,通过紧凑图描述、基于思维链的指令生成和图感知指令微调,将GNN与LLM结合,实现跨任务和数据集的高效图挖掘。

Comments Accepted by TPAMI 2025

详情
Journal ref
IEEE Trans. Pattern Anal. Mach. Intell., vol. 48, no. 1, pp. 155-169, Jan. 2026
AI中文摘要

具有丰富属性的图对于建模互联实体和增强各种实际应用中的预测至关重要。传统的图神经网络(GNN)通常需要针对不同的图任务和数据集进行重新训练。尽管大语言模型(LLM)的出现为自然语言处理带来了新范式,但它们在通用图挖掘(即训练单个模型同时处理多样任务和数据集)方面的潜力仍未充分探索。为此,我们的新颖框架MuseGraph无缝地将GNN和LLM的优势整合到一个基础模型中,用于跨任务和数据集的图挖掘。该框架首先采用紧凑的图描述,在语言令牌限制内封装关键图信息。然后,我们提出了一种基于思维链(CoT)指令包的多样化指令生成机制,以从GPT-4等高级LLM中提取推理能力。最后,我们设计了一种图感知的指令微调策略,以促进多个任务和数据集之间的相互增强,同时防止LLM生成能力的灾难性遗忘。我们的实验结果表明,在五个图任务和十个数据集上取得了显著改进,展示了MuseGraph在提高图导向下游任务准确性的同时增强LLM生成能力的潜力。

英文摘要

Graphs with abundant attributes are essential in modeling interconnected entities and enhancing predictions across various real-world applications. Traditional Graph Neural Networks (GNNs) often require re-training for different graph tasks and datasets. Although the emergence of Large Language Models (LLMs) has introduced new paradigms in natural language processing, their potential for generic graph mining, training a single model to simultaneously handle diverse tasks and datasets, remains under-explored. To this end, our novel framework MuseGraph, seamlessly integrates the strengths of GNNs and LLMs into one foundation model for graph mining across tasks and datasets. This framework first features a compact graph description to encapsulate key graph information within language token limitations. Then, we propose a diverse instruction generation mechanism with Chain-of-Thought (CoT)-based instruction packages to distill the reasoning capabilities from advanced LLMs like GPT-4. Finally, we design a graph-aware instruction tuning strategy to facilitate mutual enhancement across multiple tasks and datasets while preventing catastrophic forgetting of LLMs' generative abilities. Our experimental results demonstrate significant improvements in five graph tasks and ten datasets, showcasing the potential of our MuseGraph in enhancing the accuracy of graph-oriented downstream tasks while improving the generation abilities of LLMs.

2305.11663 2026-05-26 cs.LG cs.AI cs.CL cs.CY 版本更新

Algorithmic failure as a humanities methodology: machine learning's mispredictions identify rich cases for qualitative analysis

作为人文学科方法论的算法失败:机器学习的错误预测识别出用于定性分析的丰富案例

Jill Walker Rettberg

AI总结 本文通过实验验证了Munk等人提出的利用机器学习失败预测识别定性分析中模糊且丰富案例的方法,使用简单kNN算法对虚构角色与机器视觉技术互动的动作数据进行分类,发现不可预测的动作更具矛盾性和情感负荷,支持该方法在人文学科中的适用性。

详情
Journal ref
Big Data & Society 9(2) 2022
AI中文摘要

本文评论测试了Munk等人(2022)提出的一种方法论,即利用机器学习中的失败预测作为识别定性分析中模糊且丰富案例的方法。使用一个描述500件艺术品、电影、小说和电子游戏中虚构角色与机器视觉技术互动动作的数据集,我训练了一个简单的机器学习算法(使用R中的kNN算法),仅根据虚构角色的信息预测动作是主动还是被动。可预测的动作通常是缺乏情感且明确的,其中机器视觉技术被当作简单工具。不可预测的动作,即算法无法正确预测的动作,则更加矛盾且情感负荷更重,角色与技术之间的权力关系更为复杂。因此,结果支持Munk等人的理论,即失败预测可以有效地用于识别定性分析的丰富案例。本测试不仅简单复制了Munk等人的结果,还证明了该方法可以应用于更广泛的人文学科领域,并且不需要复杂的神经网络,简单的机器学习算法也能奏效。需要进一步研究以理解该方法适用于哪些类型的数据以及哪种机器学习最具生成性。为此,附上了产生结果所需的R代码,以便复制测试。该代码也可重复使用或改编,以在其他数据集上测试该方法。

英文摘要

This commentary tests a methodology proposed by Munk et al. (2022) for using failed predictions in machine learning as a method to identify ambiguous and rich cases for qualitative analysis. Using a dataset describing actions performed by fictional characters interacting with machine vision technologies in 500 artworks, movies, novels and videogames, I trained a simple machine learning algorithm (using the kNN algorithm in R) to predict whether or not an action was active or passive using only information about the fictional characters. Predictable actions were generally unemotional and unambiguous activities where machine vision technologies were treated as simple tools. Unpredictable actions, that is, actions that the algorithm could not correctly predict, were more ambivalent and emotionally loaded, with more complex power relationships between characters and technologies. The results thus support Munk et al.'s theory that failed predictions can be productively used to identify rich cases for qualitative analysis. This test goes beyond simply replicating Munk et al.'s results by demonstrating that the method can be applied to a broader humanities domain, and that it does not require complex neural networks but can also work with a simpler machine learning algorithm. Further research is needed to develop an understanding of what kinds of data the method is useful for and which kinds of machine learning are most generative. To support this, the R code required to produce the results is included so the test can be replicated. The code can also be reused or adapted to test the method on other datasets.

2605.16562 2026-05-26 cs.CL cs.DL 版本更新

Scaling Accessible Mathematics on arXiv: HTML Conversion and MathML 4

arXiv 上可访问数学的规模化:HTML 转换与 MathML 4

Deyan Ginev, Brian Caruso, Bruce Miller, Jeff Sank, Jacob Weiskoff

发表机构 * arXiv, Cornell Tech, New York NY, USA(arXiv,康奈尔科技,纽约NY,美国) National Institute of Standards and Technology, Gaithersburg MD, USA(国家标准与技术研究院,加ithersburg MD,美国)

AI总结 本文报告 arXiv HTML 论文服务的持续开发,重点介绍 2025 年至 2026 年初的社区驱动改进、语料级转换、MathML 4 意图注释以及 LaTeXML 的 Rust 移植,旨在提升 HTML 保真度、可访问性和计算效率。

Comments 6 pages, ICMS 2026

详情
AI中文摘要

我们报告了 arXiv HTML 论文服务的持续开发情况,该服务自 2023 年首次发布以来已应用于每个新的 TeX/LaTeX 提交。2025 年至 2026 年初的主要亮点包括:(i) 社区驱动的 HTML 保真度和服务健康改进,约 6000 份用户报告中有一半已解决;(ii) 面向 90% 无错误 HTML 的语料级转换工作(目前为 75%);(iii) 用于可访问语音输出的初始 MathML 4 意图注释;(iv) LaTeXML 的 Rust 移植正在进行中,可降低计算成本并在提交时实现更快的预览。arXiv HTML 论文项目仍处于实验阶段,但随着我们更好地理解 arXiv 读者的需求以及新标准、编程语言和 AI 进步带来的技术机遇,该项目正在逐步成熟。

英文摘要

We report on the ongoing development of arXiv's HTML Papers offering, available on every new TeX/LaTeX submission since its initial release in 2023. The main highlights from 2025 and early 2026 are: (i) community-driven improvements to HTML fidelity and service health, with roughly half of 6,000 user reports resolved; (ii) corpus-scale conversion work aimed at 90% error-free HTML (currently 75%); (iii) initial MathML 4 Intent annotations for accessible speech output; (iv) an in-progress Rust port of LaTeXML, reducing compute costs and enabling faster previews on submission. The arXiv HTML Papers project remains experimental, but is gradually maturing as we better understand the needs of arXiv's readers and the technical opportunities presented by new standards and by advances in programming languages and AI.

2603.18363 2026-05-26 cs.CL cs.AI cs.LG 版本更新

PowerFlow: Unlocking the Dual Nature of LLMs via Principled Distribution Matching

PowerFlow: 通过原则性分布匹配释放LLMs的双重特性

Ruishuo Chen, Yu Chen, Zhuoran Li, Longbo Huang

发表机构 * Institute for Interdisciplinary Information Sciences, Tsinghua University, Beijing, China(清华大学交叉信息研究院)

AI总结 提出PowerFlow框架,将无监督微调重构成分布匹配问题,利用GFlowNet和长度感知轨迹平衡目标,通过调整α-幂分布方向性激发LLMs的逻辑推理或创造性。

Comments Camera-ready version accepted at ICML 2026

详情
AI中文摘要

无监督内部反馈强化学习(RLIF)已成为一种有前景的范式,可以在没有外部监督的情况下激发大型语言模型(LLMs)的潜在能力。然而,当前方法依赖于启发式内在奖励,通常缺乏明确的理论优化目标,并且容易产生退化偏差。在这项工作中,我们引入了PowerFlow,一个原则性框架,将无监督微调重新表述为分布匹配问题。通过将GFlowNet视为未归一化密度的摊销变分采样器,我们提出了一个长度感知的轨迹平衡目标,明确抵消了自回归生成中固有的结构长度偏差。通过针对$α$-幂分布,PowerFlow能够方向性地激发LLMs的双重特性:锐化分布($α> 1$)以增强逻辑推理,或展平分布($α< 1$)以释放表达性创造力。大量实验表明,PowerFlow始终优于现有的RLIF方法,匹配甚至超过有监督的GRPO。此外,通过减轻对齐模型中的过度锐化,我们的方法在多样性和质量上同时取得提升,在创造性任务中推动了帕累托前沿。

英文摘要

Unsupervised Reinforcement Learning from Internal Feedback (RLIF) has emerged as a promising paradigm for eliciting the latent capabilities of Large Language Models (LLMs) without external supervision. However, current methods rely on heuristic intrinsic rewards, which often lack a well-defined theoretical optimization target and are prone to degenerative biases. In this work, we introduce PowerFlow, a principled framework that reformulates unsupervised fine-tuning as a distribution matching problem. By casting GFlowNet as an amortized variational sampler for unnormalized densities, we propose a length-aware Trajectory-Balance objective that explicitly neutralizes the structural length biases inherent in autoregressive generation. By targeting $α$-power distributions, PowerFlow enables the directional elicitation of the dual nature of LLMs: sharpening the distribution ($α> 1$) to intensify logical reasoning, or flattening it ($α< 1$) to unlock expressive creativity. Extensive experiments demonstrate that PowerFlow consistently outperforms existing RLIF methods, matching or even exceeding supervised GRPO. Furthermore, by mitigating over-sharpening in aligned models, our approach achieves simultaneous gains in diversity and quality, shifting the Pareto frontier in creative tasks.

2602.20191 2026-05-26 cs.LG cs.AI cs.CL 版本更新

MoBiQuant: Mixture-of-Bits Quantization for Token-Adaptive Any-Precision LLM

MoBiQuant: 面向令牌自适应任意精度LLM的混合比特量化

Dongwei Wang, Jinhee Kim, Seokho Han, Denis Gudovskiy, Yohei Nakata, Tomoyuki Okuno, KhayTze Peong, Kang Eun Jeon, Jong Hwan Ko, Yiran Chen, Huanrui Yang

发表机构 * University of Arizona(亚利桑那大学) Duke University(杜克大学) Sungkyunkwan University(成均馆大学) Panasonic AI Lab(松下人工智能实验室) Korea Advanced Institute of Science and Technology(韩国科学技术院)

AI总结 针对动态运行时约束下大语言模型任意精度量化的泛化性问题,提出基于令牌敏感度的混合比特量化框架MoBiQuant,通过多合一递归残差量化和令牌感知路由器实现灵活推理,在匹配或超越前沿单精度PTQ的同时显著节省内存并提升吞吐量。

Comments 20 pages, 10 figures

详情
AI中文摘要

动态运行时延迟和内存约束要求灵活部署大语言模型(LLM),使得LLM能够根据可用计算资源以不同的量化精度进行推理。最近关于这种任意精度量化的工作要么依赖于硬件效率低下的向量量化,要么在切换位宽时引入额外的缩放因子。同时,现有的为固定低精度校准的后训练量化(PTQ)方法在运行时精度变化下表现出较差的泛化性。在这项工作中,我们将跨位宽泛化性差的根源归因于一种精度依赖的“异常迁移”现象,其中PTQ敏感令牌的分布随精度变化。受此观察启发,我们提出了 exttt{MoBiQuant},一种新颖的任意精度混合比特量化框架,它根据令牌敏感性调整权重精度以实现灵活的LLM推理。具体来说,我们提出了一种多合一递归残差量化方法,可以在运行时迭代重建更高精度的权重,并通过令牌感知路由器缓解“异常迁移”,动态选择每个令牌的最优推理精度。大量实验表明, exttt{MoBiQuant}在匹配或超越前沿单精度PTQ的同时表现出强大的弹性,与最先进的任意精度方法相比,实现了显著的内存节省和高达$1.34 imes$的吞吐量提升。

英文摘要

Dynamic runtime latency and memory constraints necessitate flexible large language model (LLM) deployment, where an LLM can be inferred with various quantization precisions based on available computational resources. Recent work on such any-precision quantization either relies on hardware-inefficient vector quantization or induces additional scaling factors when switching between bit-widths. Meanwhile, existing post-training quantization (PTQ) methods calibrated for a fixed low precision show poor generalizability under runtime precision change. In this work, we attribute the source of poor generalization across bit-widths to a precision-dependent \textit{outlier migration} phenomenon where the distribution of PTQ-sensitive tokens changes across precisions. Motivated by this observation, we propose \texttt{MoBiQuant}, a novel any-precision Mixture-of-Bits quantization framework that adjusts weight precision for flexible LLM inference based on token sensitivity. Specifically, we propose a many-in-one recursive residual quantization that can iteratively reconstruct higher-precision weights at runtime and mitigates \textit{outlier migration} with a token-aware router to dynamically select the optimal inference precision of each token.Extensive experiments show that \texttt{MoBiQuant} matches or surpasses frontier single-precision PTQ while exhibiting strong elasticity, achieving significant memory savings and throughput gains of up to $1.34\times$ over state-of-the-art any-precision methods.

2601.20539 2026-05-26 cs.AI cs.CL 版本更新

PathWise: Planning through World Model for Automated Heuristic Design via Self-Evolving LLMs

PathWise:通过世界模型规划实现基于自进化LLM的自动启发式设计

Oguzhan Gungordu, Siheng Xiong, Faramarz Fekri

发表机构 * Georgia Institute of Technology(佐治亚理工学院)

AI总结 提出PathWise多智能体推理框架,将启发式生成建模为基于蕴含图的序列决策过程,通过策略智能体、世界模型智能体和评论智能体的协作实现状态感知规划,在组合优化问题上收敛更快、泛化更强。

Comments Accepted to ICML 2026

详情
AI中文摘要

大型语言模型(LLM)已实现组合优化问题(COP)的自动启发式设计(AHD),但现有框架依赖固定的进化规则和静态提示模板,常导致短视的启发式生成、冗余评估以及对新启发式如何推导的有限推理。我们提出一种新颖的多智能体推理框架,称为通过世界模型规划实现基于自进化LLM的自动启发式设计(PathWise),该框架将启发式生成公式化为一个基于蕴含图的序列决策过程,该图作为搜索轨迹的紧凑、有状态记忆。这种方法使系统能够继承过去的决策,并在不同代之间重用或避免推导信息。策略智能体规划进化动作,世界模型智能体根据这些动作生成启发式展开,评论智能体提供路由反思,总结先前步骤的经验教训,将基于LLM的AHD从试错进化转变为通过推理进行状态感知规划。在多种COP上的实验表明,PathWise能更快收敛到更好的启发式,在不同LLM骨干上泛化,并扩展到更大规模的问题。

英文摘要

Large Language Models (LLMs) have enabled automated heuristic design (AHD) for combinatorial optimization problems (COPs), but existing frameworks' reliance on fixed evolutionary rules and static prompt templates often leads to myopic heuristic generation, redundant evaluations, and limited reasoning about how new heuristics should be derived. We propose a novel multi-agent reasoning framework, referred to as Planning through World Model for Automated Heuristic Design via Self-Evolving LLMs (PathWise), which formulates heuristic generation as a sequential decision process over an entailment graph serving as a compact, stateful memory of the search trajectory. This approach allows the system to carry forward past decisions and reuse or avoid derivation information across generations. A policy agent plans evolutionary actions, a world model agent generates heuristic rollouts conditioned on those actions, and critic agents provide routed reflections summarizing lessons from prior steps, shifting LLM-based AHD from trial-and-error evolution toward state-aware planning through reasoning. Experiments across diverse COPs show that PathWise converges faster to better heuristics, generalizes across different LLM backbones, and scales to larger problem sizes.

2512.12677 2026-05-26 cs.CL cs.AI 版本更新

Fine-Tuning Causal LLMs for Text Classification: Embedding-Based vs. Instruction-Based Approaches

微调因果大语言模型用于文本分类:基于嵌入与基于指令的方法

Amirhossein Yousefiramandi, Ciaran Cooney

发表机构 * Clarivate Intellectual Property(Clarivate知识产权)

AI总结 本文探索在资源受限下微调解码器-only大语言模型用于文本分类,比较了基于嵌入的分类头方法和基于指令的微调方法,并采用4位量化与LoRA实现高效训练,实验表明嵌入头方法在单标签分类中匹配或超越微调BERT基线,而指令微调仅在多标签且大参数量时有效。

Comments 20 pages, 5 figures

详情
AI中文摘要

我们探索在资源受限下高效微调解码器-only大语言模型(LLMs)用于下游文本分类的策略。研究了两种方法:(1) 将分类头附加到预训练的因果LLM上,并在任务上微调,使用LLM的最终token嵌入作为序列表示;(2) 以提示-响应的格式对LLM进行指令微调以进行分类。为了在单GPU上微调高达8B参数的模型,我们将4位模型量化与低秩适配(LoRA)结合,实现参数高效训练。在两个专利基准测试(一个5类单标签内部语料库和具有14个类别的公共WIPO-Alpha多标签数据集)上的实验表明,嵌入头方法在单标签分类中匹配或超过微调BERT基线,同时训练参数少10-30倍。指令微调仅在多标签场景下具有竞争力,且需要至少1亿参数的大幅可训练预算。这些结果表明,直接利用因果LLM的内部表示,结合高效微调技术,在有限计算资源下能产生强大的分类性能。我们讨论了每种方法的优势,并概述了在分类场景中优化LLM微调的实用指南和未来方向。

英文摘要

We explore efficient strategies to fine-tune decoder-only Large Language Models (LLMs) for downstream text classification under resource constraints. Two approaches are investigated: (1) attaching a classification head to a pretrained causal LLM and fine-tuning it on the task, using the LLM's final-token embedding as a sequence representation, and (2) instruction-tuning the LLM in a prompt-to-response format for classification. To enable single-GPU fine-tuning of models up to 8B parameters, we combine 4-bit model quantization with Low-Rank Adaptation (LoRA) for parameter-efficient training. Experiments on two patent benchmarks, a 5-class single-label internal corpus and the public WIPO-Alpha multi-label dataset with 14 categories, show that the embedding-head approach matches or exceeds fine-tuned BERT baselines on single-label classification while training 10-30x fewer parameters. Instruction-tuning is competitive only in the multi-label regime, and only with substantially larger trainable budgets of at least 100M parameters. These results demonstrate that directly leveraging the internal representations of causal LLMs, together with efficient fine-tuning techniques, yields strong classification performance under limited computational resources. We discuss the advantages of each approach and outline practical guidelines and future directions for optimizing LLM fine-tuning in classification scenarios.

2407.20595 2026-05-26 cs.DL cs.CL 版本更新

HALvest-Contrastive: Retrieval-Like Authorship Attribution with Patch-Level Late Interaction

HALvest-Contrastive: 基于补丁级后期交互的检索式作者归属

Francis Kulumba, Wissam Antoun, Guillaume Vimont, Laurent Romary, Florian Cafiero

发表机构 * Inria Paris(Inria 巴黎) Sorbonne Université(索邦大学) IRIF LRE, EPITA Ecole nationale des chartes – PSL(LRE, EPITA 法国国家档案馆 – PSL)

AI总结 针对文本作者归属任务中主题混淆问题,提出HALvest多语料库及其对比版本HALvest-Contrastive,并采用补丁级后期交互方法提升跨主题作者识别性能。

Comments 19 pages, 9 figures. Under review

详情
AI中文摘要

作者归属任务旨在判断两段文本是否由同一作者所写,但主题混淆使得该任务看似简单:两位作者讨论同一主题可能比一位作者讨论两个主题看起来更相似。学术散文提供了一种自然解决方案:学术作者在保持一致风格习惯的同时,会撰写多篇关于相关但不同主题的论文。我们引入了HALvest,一个包含170亿词元的开放获取学术论文多语料库,及其英语对比衍生版本HALvest-Contrastive,其中同一作者的段落取自学科领域内的不同论文,以最小化主题重叠。我们通过展示一个强词法基线在移除主题捷径后性能崩溃来验证我们的基准。在同一基准上,我们重新审视了作者归属的评分方式。标准系统将每篇文档压缩为单个向量。我们则保留向量序列并通过后期交互进行比较,然后提出补丁级后期交互,在匹配前将相邻词元分组为补丁。序列级匹配相比单向量基线显著提升了性能,但最优交互粒度是微妙的。

英文摘要

Authorship attribution asks whether two pieces of text share a writer, but topical confound makes the task deceptively easy: two authors covering the same topic may look more alike than one author covering two topics. Scholarly prose offers a natural remedy, academic writers produce multiple papers on related but distinct topics while maintaining consistent stylistic habits. We introduce HALvest, a 17-billion-token multilingual corpus of open-access academic papers, and its English contrastive derivative HALvest-Contrastive, where same-author passages are drawn from distinct papers within a disciplinary field to minimize topical overlap. We validate our benchmark by showing that a strong lexical baseline collapses once topical shortcuts are removed. On this same benchmark, we revisit how authorship is scored. Standard systems compress each document into a single vector. We instead keep a sequence of vectors and compare them with late interaction, then propose patch-level late interaction, which groups neighboring tokens into patches before matching. Matching at the sequence level greatly improves performance over the single-vector baseline, but the optimal interaction granularity is subtle.

2509.10452 2026-05-26 cs.CL cs.LG 版本更新

WhisTLE: Deeply Supervised, Text-Only Domain Adaptation for Pretrained Speech Recognition Transformers

WhisTLE: 深度监督的文本领域自适应方法用于预训练语音识别Transformer

Akshat Pandey, Karun Kumar, Raphael Tang

发表机构 * comcast Speech AI(comcast语音人工智能)

AI总结 提出WhisTLE,一种通过变分自编码器建模文本到编码器输出并微调解码器的文本领域自适应方法,显著降低词错误率。

Comments 10 pages

详情
AI中文摘要

预训练的自动语音识别(ASR)模型(如Whisper)表现良好,但仍需领域自适应以处理未见过的用语。在许多实际场景中,收集语音数据不切实际,因此需要仅文本的自适应。我们提出WhisTLE,一种用于预训练编码器-解码器ASR模型的深度监督文本自适应方法。WhisTLE训练一个变分自编码器(VAE)从文本建模编码器输出,并使用学习到的文本到潜在编码器微调解码器,可选地与文本到语音(TTS)自适应结合。在推理时,恢复原始编码器,不产生额外运行时成本。在四个数据集和四个ASR模型上,带有TTS的WhisTLE相对降低了49.0%的词错误率(WER),并在112个场景中的100个中优于所有非WhisTLE基线。我们还发现WhisTLE与任何其他领域自适应方法的组合都能互补增强;因此我们建议在标准流程中纳入WhisTLE以自适应编码器-解码器ASR模型。

英文摘要

Pretrained automatic speech recognition (ASR) models such as Whisper perform well but still need domain adaptation to handle unseen parlance. In many real-world settings, collecting speech data is impractical, necessitating text-only adaptation. We propose WhisTLE, a deeply supervised, text-only adaptation method for pretrained encoder-decoder ASR models. WhisTLE trains a variational autoencoder (VAE) to model encoder outputs from text and fine-tunes the decoder using the learned text-to-latent encoder, optionally combined with text-to-speech (TTS) adaptation. At inference, the original encoder is restored, incurring no extra runtime cost. Across four datasets and four ASR models, WhisTLE with TTS reduces word error rate (WER) by a relative 49.0% and outperforms all non-WhisTLE baselines in 100 of 112 scenarios. We also find that WhisTLE additively complements any combination of other domain adaptation approaches; we thus recommend the inclusion of WhisTLE during standard processes for adapting encoder-decoder ASR models.