URL PDF HTML ☆

赞 0 踩 0

2606.15079 2026-06-16 cs.CL cs.AI 新提交

重放重要内容：面向高效LLM强化反学习的离策略重放

Zirui Pang, Chenlong Zhang, Haosheng Tan, Zhuoran Jin, Jiaheng Wei, Zixin Zhong

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)（香港科技大学（广州））； Institute of Automation, Chinese Academy of Sciences（中国科学院自动化研究所）； University of Glasgow（格拉斯哥大学）

AI总结针对LLM反学习中在线策略优化对困难样本利用不足的问题，提出ReRULE方法，通过离策略重放缓冲区存储并复用低奖励困难样本，在保持通用性的同时提升反学习效率。

详情

AI中文摘要

LLM反学习已成为一种经济有效的替代方案，无需完全重新训练即可从预训练模型中移除危险知识，同时保持通用实用性。最近的基于RL的方法（如RULE）将反学习重新定义为学习拒绝行为，但其在线策略优化在整个训练过程中反复从相同的遗忘和保留/边界提示中采样。我们发现了该过程中的一个关键低效问题：简单案例迅速收敛，提供的梯度信号很少，而遗忘/保留边界附近的困难案例持续产生低奖励的轨迹，这些轨迹在单次使用后被丢弃。为了解决这个问题，我们提出了ReRULE，一种用于强化反学习的离策略重放增强方法。ReRULE在早期GRPO训练期间将低奖励的困难案例轨迹组存储在重放缓冲区中，并通过重要性采样的离策略更新在后续阶段重用它们，将计算重定向到仍需学习的边界案例。理论上，我们证明ReRULE比纯在线策略RULE具有更紧的困难案例收敛界。实验上，ReRULE将MUSE-Books保留质量从46.3提高到56.2，同时仅增加5-11%的训练时间。其在更简单的TOFU设置上改进有限，进一步支持了预期的条件行为：当困难/简单差异显著时，重放最为有益。

英文摘要

LLM unlearning has emerged as a cost-effective alternative to full retraining for removing hazardous knowledge from pretrained models while preserving general utility. Recent RL-based methods such as RULE reformulate unlearning as learning a refusal behavior, but their on-policy optimization repeatedly samples from the same forget and retain/boundary prompts throughout training. We identify a critical inefficiency in this process: easy cases quickly converge and provide little useful gradient signal, while hard cases near the forget/retain boundary continue to produce low-reward rollouts that are discarded after a single use. To address this issue, we propose ReRULE, an off-policy replay enhancement for reinforcement unlearning. ReRULE stores low-reward hard-case rollout groups in a replay buffer during early GRPO training and reuses them in later stages through importance-sampled off-policy updates, redirecting computation toward boundary cases that still require learning. Theoretically, we show that ReRULE yields a tighter hard-case convergence bound than pure on-policy RULE. Empirically, ReRULE improves MUSE-Books Retain Quality from 46.3 to 56.2 while adding only 5--11% training time across benchmarks. Its limited improvement on the simpler TOFU setting further supports the intended conditional behavior: replay is most beneficial when the hard/easy disparity is pronounced.

URL PDF HTML ☆

赞 0 踩 0

2606.15378 2026-06-16 cs.CL cs.LG 新提交

Rethinking the Role of Efficient Attention in Hybrid Architectures

重新思考高效注意力在混合架构中的作用

Ziqing Qiao, Yinuo Xu, Chaojun Xiao, Zhou Su, Zihan Zhou, Yingfa Chen, Xiaoyue Xu, Xu Han, Zhiyuan Liu

发表机构 * Tsinghua University（清华大学）； OpenBMB

AI总结本文系统分析混合架构中高效注意力模块（如滑动窗口注意力和循环序列混合器）的影响，发现其主要影响长上下文能力的涌现速度，并揭示“大窗口惰性”现象，提出仅对全注意力层去除位置编码可提升长上下文性能。

Comments 23 pages, 13 figures

详情

AI中文摘要

现代语言模型越来越多地采用混合架构，将全注意力与高效注意力模块（如滑动窗口注意力（SWA）和循环序列混合器）相结合。然而，这些高效模块如何塑造模型能力仍知之甚少。为填补这一空白，我们从三个角度对混合架构进行了系统分析：缩放行为、机制分析和架构设计。首先，从缩放角度来看，我们发现高效注意力设计主要影响长上下文能力涌现的速度，而不同的混合模型在充分训练下最终会收敛到可比较的长上下文性能。其次，从机制上，我们表明长距离检索主要由全注意力承担，而高效注意力则塑造其优化轨迹。这解释了我们称之为“大窗口惰性”的反直觉现象：更大的SWA窗口可能延迟全注意力层中检索头的形成。第三，受此机制指导，我们表明仅对小型窗口SWA混合架构的全注意力层应用NoPE（无位置编码）可以显著提升长上下文性能，而对短上下文性能影响甚微。

英文摘要

Modern language models increasingly adopt hybrid architectures that combine full attention with efficient attention modules, such as sliding-window attention (SWA) and recurrent sequence mixers. However, how these efficient modules shape model capabilities remains poorly understood. To address this gap, we conduct a systematic analysis across hybrid architectures from three perspectives: scaling behavior, mechanism analysis, and architecture design. First, from a scaling perspective, we find that efficient-attention design primarily affects how fast long-context capability emerges, while different hybrids eventually converge to comparable long-context performance under sufficient training. Second, mechanistically, we show that long-range retrieval is mainly carried by full attention, whereas efficient attention shapes its optimization trajectory. This explains a counter-intuitive phenomenon we call Large-Window Laziness: larger SWA windows can delay the formation of retrieval heads in full-attention layers. Third, guided by this mechanism, we show that applying NoPE to only the full-attention layers of a small-window SWA hybrid substantially improves long-context performance with negligible impact on short-context performance.

URL PDF HTML ☆

赞 0 踩 0

2606.15419 2026-06-16 cs.CL cs.AI 新提交

Let LLMs Judge Each Other: Multi-Agent Peer-Reviewed Reasoning for Medical Question Answering

让LLMs互相评判：面向医学问答的多智能体同行评审推理

Zaifu Zhan, Shuang Zhou, Rui Zhang

发表机构 * University of Minnesota（明尼苏达大学）

AI总结提出多智能体同行评审推理方法，让多个LLM独立生成思维链推理并相互评估，选择最优推理链输出答案，在三个医学问答数据集上优于单模型和多数投票方法。

Comments Accepted by the Journal of the American Medical Informatics Association

详情

AI中文摘要

目的：提升大语言模型在医学问答中的准确性、可解释性和鲁棒性。方法：我们设计了一种多智能体同行评审推理方法，其中多个LLM智能体独立生成包含候选答案的思维链推理，然后作为同行评审者评估彼此推理的事实正确性和逻辑合理性。选择评分最高的推理链生成最终答案。使用五个最先进的LLM（Llama-3.1-8B、Qwen2.5-7B、Phi-4、DeepSeek-LLM-7B、GPT-oss-20B）在三个基准数据集（HeadQA、MedQA-USMLE和PubMedQA）上进行实验。性能与单模型思维链推理和基于思维链的多数投票进行了比较。结果：同行评审推理始终优于两种基线。最佳模型组合在数据集上的平均准确率达到0.820，超过了最强单模型（0.777）和多数投票集成（最高0.789）。该方法还随着参与模型数量的增加而有效扩展，同时同行评估可靠地区分了高质量和低质量的推理链。结论：提出的多智能体同行评审推理方法使LLM既能作为求解者又能作为评估者，在医学问答中取得了优越性能。通过强调推理质量而非仅答案一致性，该方法提高了准确性、可解释性和鲁棒性，为可信赖的生物医学AI系统提供了有前景的方向。

英文摘要

Objective: To enhance the accuracy, interpretability, and robustness of large language models (LLMs) in medical question answering (MedQA). Method: We designed a multi-agent peer-reviewed reasoning method in which multiple LLM agents independently generate chain-of-thought reasoning with candidate answers, then act as peer reviewers to evaluate each other's reasoning for factual correctness and logical soundness. The highest-rated reasoning chain is selected to produce the final answer. Experiments were conducted with five state-of-the-art LLMs (Llama-3.1-8B, Qwen2.5-7B, Phi-4, DeepSeek-LLM-7B, GPT-oss-20B) on three benchmark datasets: HeadQA, MedQA-USMLE, and PubMedQA. Performance was compared against single-model chain-of-thought reasoning and chain-of-thought-based majority voting. Results: Peer-reviewed reasoning consistently outperformed both baselines. The best model combination achieved an average accuracy of 0.820 across datasets, exceeding the strongest single model (0.777) and majority voting ensembles (up to 0.789). The method also scaled effectively with more participating models, while peer assessments reliably distinguished high- from low-quality reasoning chains. Conclusion: The proposed multi-agent peer-reviewed reasoning method enables LLMs to act as both solvers and evaluators, yielding superior performance in MedQA. By emphasizing reasoning quality rather than answer agreement alone, this approach improves accuracy, interpretability, and robustness, offering a promising direction for trustworthy biomedical AI systems.

URL PDF HTML ☆

赞 0 踩 0

2606.15521 2026-06-16 cs.CL cs.LG 新提交

Emergent retokenization symmetry in large language models: phenomenology and applications

大型语言模型中涌现的重分词对称性：现象学与应用

Kanishk Jain, Matthew Day, Tankut Can

发表机构 * Department of Physics, Emory University（埃默里大学物理系）

AI总结研究发现大型语言模型在训练中部分涌现出重分词对称性，通过重分词实验探测模型对语义等价输入表示的敏感性和鲁棒性，并提出一种新的推理时采样策略。

详情

AI中文摘要

分词引入了表示冗余：在固定词表下，每个字节串存在多种有效的分词编码（或切分方式），它们解码后得到相同的表面字符串。然而，给定提示词时，大多数语言模型的分词器通过返回规范切分打破了这种表示对称性。仅基于规范切分进行训练应会影响推理行为，且几乎没有理由期望模型在下游任务中尊重切分对称性。我们发现这种对称性在训练过程中部分涌现。本文通过实验探测这种涌现对称性，测试了分词组合理解、表示多样性和任务导向的基准性能。我们主要使用\textbf{重分词}——在保持字节完全不变的情况下，将提示词的规范分词替换为另一种切分。相对于其他提示扰动，重分词异常干净，因为它隔离了切分效果而不改变语法、语义或表面形式。我们利用重分词研究预训练和后训练中对语义等价输入表示的敏感性和鲁棒性。此外，这种部分重分词对称性暗示了一个不同的推理时采样轴。温度采样通过模型的下一个词概率分布生成多样输出，而重分词通过语义等价的输入表示从模型内部计算生成多样性。我们发现，虽然这种重分词采样策略在简单问题上可能损害性能，但它也能恢复传统采样无法找到的解决方案。总体而言，我们的工作将重分词呈现为一种简单而强大的大型语言模型探测工具，揭示了组合理解和提示敏感性，并提供了一种新颖的采样策略。

英文摘要

Tokenization introduces representational redundancy: under a fixed token vocabulary, every byte string admits many valid token encodings, or segmentations, that decode to the same surface string. However, given a prompt, most language model tokenizers break this representational symmetry by returning a canonical segmentation. Training only on canonical segmentations should influence inference behavior, and there is little reason to expect models to respect segmentation symmetry on downstream tasks. We find that this symmetry partially emerges during training. Here, we probe this emergent symmetry through experiments testing token compositional understanding, representation diversity, and task focused benchmark performance. We primarily use \textbf{retokenization} -- replacing a prompt's canonical tokenization with an alternative segmentation while preserving its bytes exactly. Relative to other prompt perturbations, retokenization is unusually clean because it isolates segmentation effects without changing syntax, semantics or surface form. We use retokenization to study sensitivity and robustness to semantically identical input representations across pretraining and post-training. Moreover, this partial retokenization symmetry suggests a distinct inference-time sampling axis. While temperature sampling generates diverse outputs from the model using its next-token probability distribution, retokenization generates diversity from the model's internal computations through semantically equivalent input representations. We find that while this retokenization sampling strategy can hurt performance on easy problems, it can also recover solutions that conventional sampling does not find. Overall, our work presents retokenization as a simple yet powerful probe of large language models, shedding light on compositional understanding and prompt sensitivity, and offering a novel sampling strategy.

URL PDF HTML ☆

赞 0 踩 0

2606.15733 2026-06-16 cs.CL cs.AI 新提交

Vernier: Probing Representational Misalignment Behind Lexical Gaps in Causal Reasoning

Vernier: 探测因果推理中词汇间隙背后的表征错位

Zhenyu Yu

发表机构 * College of Computer Science and Artificial Intelligence, Fudan University（复旦大学计算机科学与人工智能学院）

AI总结通过配对视图权重更新和激活修补，发现语言模型在因果推理中因变量名替换导致的答案差异源于表征错位而非信息丢失，并在Qwen和Llama模型上验证了反事实增强的对齐效果。

详情

AI中文摘要

指令微调的语言模型在将其英文变量名替换为类型保留的占位符后，可能会对相同的因果推理问题给出不同的答案，尽管结构因果模型和正确答案未变。我们探究这种词汇间隙是否反映了占位符视图中的信息丢失，或是从仍携带答案相关内容的表征中读取时的错位。Vernier 使用配对视图权重更新作为工具，然后检查间隙闭合后留下的机制。在工作状态下，证据支持表征错位。变量名探针在占位符视图上变得更准确，对 Qwen-7B、Qwen-14B 和 Llama-3.1-8B 的激活修补表明，决策令牌表征可以在视图间传递答案身份。重新对齐视图的更新是对原始提示和占位符提示的反事实增强，而答案子空间 KL 主要增强了中间答案信念的一致性。成功受限于模型家族、规模和任务。CRASS 转移在 Qwen 规模和 Llama 上可靠，e-CARE 仍然较弱，初步的非因果重命名任务显示出类似的定性模式。

英文摘要

Instruction-tuned language models can answer the same causal-reasoning question differently after its English variable names are replaced by type-preserving placeholders, although the structural causal model and the gold answer are unchanged. We ask whether this lexical gap reflects information loss in the placeholder view or a misaligned read-out from a representation that still carries answer-relevant content. Vernier uses a paired-view weight update as an instrument and then inspects the mechanism left after the gap closes. In the working regimes, the evidence favours representational misalignment. A variable-name probe becomes more accurate on the placeholder view, and activation patching on Qwen-7B, Qwen-14B, and Llama-3.1-8B shows that the decision-token representation can transfer answer identity between views. The update that realigns the views is counterfactual augmentation over original and placeholder prompts, while the answer-subspace KL mainly sharpens intermediate answer-belief agreement. Success is bounded by model family, scale, and task. CRASS transfer is reliable across Qwen scales and Llama, e-CARE remains weak, and preliminary non-causal rename tasks show a similar qualitative pattern.

URL PDF HTML ☆

赞 0 踩 0

2606.15734 2026-06-16 cs.CL cs.AI cs.IR cs.LG 新提交

Retrievable Gradients: Continual Post-Training Without Cumulative Weight Drift

可检索梯度：无累积权重漂移的持续后训练

Weihang Su, Jiacheng Kang, Jingyan Xu, Qingyao Ai, Jianming Long, Hanwen Zhang, Bangde Du, Xinyuan Cao, Min Zhang, Yiqun Liu

发表机构 * Department of Computer Science and Technology, Tsinghua University（清华大学计算机科学与技术系）

AI总结提出ReGrad范式，将梯度作为可检索知识单元，通过元学习重塑文档梯度为通用适应信号，实现无权重漂移的可扩展参数知识注入。

详情

AI中文摘要

持续后训练使模型在部署后能够吸收新知识，但重复更新共享参数会累积权重漂移，可能导致灾难性遗忘并降低通用能力。检索增强生成避免了这种参数漂移，但往往缺乏参数化知识整合的深度。在本文中，我们提出ReGrad（可检索梯度），一种将梯度视为可检索知识单元的新范式。ReGrad离线预计算文档特定梯度，存储在索引化的梯度库中，并在推理时仅检索与查询相关的梯度以进行临时权重调整。然而，原始语言建模梯度针对词级文档重建而非查询驱动的知识使用进行优化。因此，我们引入双层元学习目标，将文档派生梯度重塑为下游任务的通用适应信号。在通用和特定领域设置上的实验表明，ReGrad优于CPT和RAG基线，实现了可扩展且可逆的参数知识注入，且不累积权重漂移。

英文摘要

Continual post-training enables models to absorb emerging knowledge after deployment, but repeatedly updating shared parameters can accumulate weight drift, potentially causing catastrophic forgetting and degrading general capabilities. Retrieval-augmented generation avoids such parameter drift, yet often lacks the depth of parametric knowledge integration. In this paper, we propose ReGrad (Retrievable Gradients), a new paradigm that treats gradients as retrievable units of knowledge. ReGrad pre-computes document-specific gradients offline, stores them in an indexed Gradient Bank, and retrieves only query-relevant gradients at inference time for temporary weight adaptation. However, raw language-modeling gradients are optimized for token-level document reconstruction rather than for query-driven knowledge use. We therefore introduce a bi-level meta-learning objective that reshapes document-derived gradients into generalizable adaptation signals for downstream tasks. Experiments across general and domain-specific settings show that \textsc{ReGrad} outperforms CPT and RAG baselines, enabling scalable and reversible parametric knowledge injection without accumulating weight drift.

URL PDF HTML ☆

赞 0 踩 0

2606.15778 2026-06-16 cs.CL cs.AI cs.LG cs.SI 新提交

DYNA : Dynamic Episodic Memory Networks for Augmenting Large Language Models with Temporal Knowledge Graphs in Continuous Learning

DYNA：用于在持续学习中通过时间知识图谱增强大语言模型的动态情景记忆网络

Ali Sarabadani, Mahtab Tajvidiyan

发表机构 * Department of Computer Engineering and Information Technology, University of Qom（卡姆大学计算机工程与信息科技系）

AI总结提出DYNA框架，通过时间知识图谱作为外部可更新记忆，增强冻结的大语言模型，在三个时间召回任务上减少约7%的灾难性遗忘并提升约5%的时间排序能力。

2606.15821 2026-06-16 cs.CL cs.AI cs.LG 新提交

The Truth Stays in the Family: Enhancing Contextual Grounding via Inherited Truthful Heads in Model Lineages

真相留在家族中：通过模型谱系中继承的真相头增强上下文基础

Miso Choi, Seonga Choi, Mincheol Kwon, Woosung Joung, Jinkyu Kim, Jungbeom Lee

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结研究发现基础LLM与下游变体间存在上下文真相分数的强继承性，提出TruthProbe软门控策略放大真相头以提升上下文真实性并减少多模态幻觉。

Comments Accepted at ICML 2026

详情

AI中文摘要

大型语言模型（LLM）的最新进展产生了许多共享基础LLM的专业多模态LLM（MLLM），形成了不同的模型谱系。基础LLM与下游变体之间是否存在基本的行为联系尚不清楚。我们通过量化头部级别的上下文真相分数来研究这个问题。在包括基于Vicuna、Qwen2.5、LLaMA2和Mistral的模型在内的多种LLM和MLLM谱系中，我们发现真相分数在模型家族内被强烈保留，即使在指令调优或多模态适应后也是如此。我们进一步表明，这种继承与注意力头权重保留一致，并且上下文真相头关注查询相关的证据。基于这一发现，我们提出了TruthProbe，一种软门控策略，在保留其他头部贡献的同时放大上下文真相头。TruthProbe在HaluEval上提高了上下文真实性，并在POPE和CHAIR上减少了多模态幻觉，基础LLM的真相分数有效转移到其微调的LLM和MLLM后代。代码可在https://github.com/miso-choi/TruthProbe获取。

英文摘要

Recent advances in large language models (LLMs) have produced many specialized multimodal LLMs (MLLMs) that share common foundational LLMs, forming distinct model lineages. It remains unclear whether a fundamental behavioral link exists between the foundational LLMs and downstream variants. We investigate this question by quantifying head-level context-truthfulness scores. Across diverse LLM and MLLM lineages, including Vicuna-, Qwen2.5-, LLaMA2-, and Mistral-based models, we find that Truth Scores are strongly preserved within model families, even after instruction tuning or multimodal adaptation. We further show that this inheritance is consistent with attention-head weight preservation, and that context-truthful heads attend to query-relevant evidence. Building on this finding, we propose TruthProbe, a soft-gating strategy that amplifies context-truthful heads while preserving other head contributions. TruthProbe improves contextual truthfulness on HaluEval and reduces multimodal hallucination on POPE and CHAIR, with base-LLM Truth Scores transferring effectively to their fine-tuned LLM and MLLM descendants. Code is available at https://github.com/miso-choi/TruthProbe.

URL PDF HTML ☆

赞 0 踩 0

2606.15872 2026-06-16 cs.CL 新提交

SciOrch: Learning to Orchestrate Expert LLMs for Solving Frontier Multimodal Scientific Reasoning Tasks

SciOrch: 学习编排专家大语言模型以解决前沿多模态科学推理任务

Jingru Guo, Xiangyuan Xue, Lian Zhang, Wanghan Xu, Siki Chen, Philip Torr, Wanli Ouyang, Lei Bai, Zhenfei Yin

发表机构 * Imperial College London（伦敦帝国学院）； The Chinese University of Hong Kong（香港中文大学）； University of Illinois Urbana-Champaign（伊利诺伊大学厄巴纳-香槟分校）； Shanghai Jiao Tong University（上海交通大学）； Shanghai Artificial Intelligence Laboratory（上海人工智能实验室）； University of Oxford（牛津大学）； Shenzhen Loop Area Institute（深圳环湖研究所）

AI总结提出SciOrch框架，训练轻量级8B模型编排多个前沿大语言模型，通过MCTS和GRPO优化，在科学推理任务上超越最强单模型和多智能体基线。

详情

AI中文摘要

前沿科学推理仍然是大语言模型（LLMs）面临的主要挑战，即使是最强大的商业系统也达不到专家级性能。对模型行为的深入分析揭示了单模型评估所隐藏的显著互补性：不同的前沿模型在不同类型的问题上表现出色，没有一个模型能全面覆盖。我们提出了SciOrch，一个训练轻量级8B模型来编排前沿LLMs进行科学推理的框架。编排器分解每个问题，通过API调用将子问题委托给选定的商业模型，并综合最终答案。训练这样的编排器比传统的智能体强化学习更难：每个动作都会触发一次API调用，这在金钱成本和延迟上都代价高昂，使得标准的在线回滚不可行。我们通过基于MCTS的方法解决了这个问题，生成了多样化的编排轨迹，提取了每个节点的单轮样本，并使用GRPO风格的训练优化编排器。在包含SGI-Reasoning和Scientists' First Exam的240个问题测试集上，SciOrch达到了56.66%的平均准确率，比最强的单个商业模型高出3.74%，比最强的多智能体基线高出3.33%。它还在SGI和SFE上都取得了最佳准确率，而API成本不到典型多智能体方法的一半。

英文摘要

Frontier scientific reasoning remains a major challenge for large language models (LLMs), where even the strongest commercial systems fall short of expert-level performance. A closer look at model behavior reveals substantial complementarity that single-model evaluation hides: different frontier models excel on different question types, and no single model captures the full picture. We present SciOrch, a framework that trains a lightweight 8B model to orchestrate frontier LLMs for scientific reasoning. The orchestrator decomposes each question, delegates sub-problems to selected commercial models through API calls, and synthesizes a final answer. Training such an orchestrator is fundamentally harder than conventional agentic RL: each action triggers an API call that is expensive in both dollar cost and latency, making standard online rollouts infeasible. We address this with MCTS-based approach, producing diverse orchestration trajectories, extracting per-node single-turn samples, and optimizing the orchestrator with GRPO-style training. On a 240-question test set spanning SGI-Reasoning and Scientists' First Exam, SciOrch reaches 56.66% average accuracy, outperforming the strongest single commercial model by 3.74% and the strongest multi-agent baseline by 3.33%. It also attains the best accuracy on both SGI and SFE with less than half the API cost of typical multi-agent methods.

URL PDF HTML ☆

赞 0 踩 0

2606.15877 2026-06-16 cs.CL cs.AI 新提交

Free Energy Heuristics: Fast-And-Frugal Cognition as Active Inference Under Uncertain Precision

自由能启发式：作为不确定精度下主动推理的快速节俭认知

Alex Bogdan

发表机构 * Evolutionairy AI Toronto, Canada（进化人工智能（多伦多，加拿大））

AI总结本文提出元不确定性决定链式思维（CoT）的效果：当模型对自身证据的可靠性高度不确定时，更多推理会降低准确率。通过自由能最小化策略证明，在重尾精度先验下，有限数量的高有效性线索后停止整合，与“取最优”启发式等价。实验验证了高元不确定性下长CoT导致准确率下降17.3个百分点。

Comments 64 pages, 6 figures

详情

AI中文摘要

链式思维（CoT）提升了大型语言模型在数学和符号推理中的表现。但在规划、有争议的伦理问题以及模型无法自我检查的任务中，更多推理反而使情况更糟。这两种效应均有文献记载；但一直缺少一个原则性的解释来说明哪种属性决定了结果。我们认为这是元不确定性：模型对其自身证据可靠性的不确定程度。当这种不确定性很高时，额外的推理不再增加信号，而是开始制造虚假的置信度。我们证明，在不确定精度下最小化期望自由能的策略，在精度先验为重尾分布时（定理2.6.1），会在有限数量的高有效性线索后停止整合线索，并且在递减优势条件下，该策略在样本层面上与“取最优”策略相同（定理2.7.4）。因此，快速节俭启发式和主动推理是同一计算的两种描述。预测是，在高元不确定性项目上，更长的CoT会降低准确率。我们按项目对区间进行评分（模拟-恢复rho > 0.96），构建了FEH-79基准（包含匹配对照的奈特框架），并在七个模型（五个开放权重3B-32B，两个前沿模型）、五种CoT长度和7,875个响应上进行了预注册研究。门槛（在数据前固定）要求负交互的后验概率高于0.95，准确率下降超过6个百分点。结果成立。高区间下降为17.3个百分点（95% CI [7.7, 25.5]）；具有明确答案的匹配项目没有显示成本。该效应依赖于区间：在能力较强的中大型模型中显著，在两个前沿系统中具有方向性，在最弱的模型中缺失甚至反转。该框架回答了CoT何时有帮助，并统一了贝叶斯和快速节俭传统：少即是多的效应是关于元不确定性区间的证据，而非反对贝叶斯认知。

英文摘要

Chain-of-thought (CoT) improves large language models' performance in math and symbolic reasoning. But on planning, contested ethics, and tasks where the model cannot check itself, more reasoning makes things worse. Both effects are documented; what has been missing is a principled account of which property decides the outcome. We argue it is meta-uncertainty: how unsure the model is about the reliability of its own evidence. When that uncertainty is high, extra reasoning stops adding signal and starts manufacturing false confidence. We prove that the policy minimizing expected free energy under uncertain precision stops integrating cues after a finite number of high-validity ones when the precision prior is heavy-tailed (Theorem 2.6.1), and under a Descending Dominance condition, is sample-wise identical to take-the-best (Theorem 2.7.4). Fast-and-frugal heuristics and active inference are, then, two descriptions of the same computation. The prediction is that on high-meta-uncertainty items, longer CoT should degrade accuracy. We score the regime per item (simulate-and-recover rho > 0.96), build FEH-79, a benchmark of Knightian frames with matched controls, and run a pre-registered study across seven models (five open-weight 3B-32B, two frontier), five CoT lengths, and 7,875 responses. The gate, fixed before any data, required a negative interaction with posterior probability above 0.95 and an accuracy drop of more than 6 points. It held. The high-regime drop is 17.3 points (95% CI [7.7, 25.5]); matched items with definite answers show no cost. The effect is regime-dependent: decisive in capable mid-to-large models, directional in the two frontier systems, absent-to-reversed in the weakest. The framework answers when CoT helps and unifies the Bayesian and fast-and-frugal traditions: less-is-more effects are evidence about the meta-uncertainty regime, not against Bayesian cognition.

URL PDF HTML ☆

赞 0 踩 0

2606.15884 2026-06-16 cs.CL 新提交

Neuron Level Analysis of Large Language Model in Legal Domain Reasoning

法律领域推理中大语言模型的神经元级分析

Eri Onami, Youmi Ma, Shuhei Kurita, Naoaki Okazaki

发表机构 * Institute of Science Tokyo（东京科学大学）； NII（国立信息学研究所）； AIST（产业技术综合研究所）

AI总结通过神经元归因分数识别并抑制关键神经元，发现存在任务特异性神经元和跨任务通用神经元，法律领域神经元重叠度高且分布受输入格式影响。

详情

AI中文摘要

我们对LLM在法律领域推理中的神经元级分析进行了研究，并将其与七个开放权重模型中的其他应用领域任务进行了比较。通过使用神经元归因分数对影响神经元进行排序和抑制，我们证实抑制识别出的神经元会显著降低目标任务的准确率，而抑制相同数量的随机神经元则不会。我们进一步发现了一小部分对所有七个任务都有影响的神经元；一旦这些神经元被移除，抑制剩余神经元只会降低它们被识别出的任务的表现，从而揭示了每个研究模型中真正任务特异性的神经元。在法律领域内，三个基准测试表现出相对较高的神经元重叠，并且往往共同受到影响，这表明存在跨司法管辖区的法律组件神经元。我们实验中识别出的神经元分布表明，关于影响神经元集中在中间MLP层的假设可能取决于输入格式和内容，而非普遍现象。

英文摘要

We presented a neuron-level analysis of legal-domain reasoning in LLMs, comparing it with other applied domain tasks across seven open-weight models. Using neuron attribution scores to rank and suppress influential neurons, we confirmed that suppressing the identified neurons collapses accuracy on the target task, whereas suppressing the same number of random neurons does not. We further found a small subset of neurons influential across all seven tasks; once these are removed, suppressing the remaining neurons degrades only the task they were identified from, revealing genuinely task-specific neurons in every model studied. Within the legal domain, the three benchmarks exhibit relatively high neuron overlap and tend to be affected jointly, suggesting of legal components neurons that span jurisdictions. The distribution of identified neurons in our experiments suggests that the hypothesis that influential neurons are concentrated in middle MLP layers may depend on the input format and content, rather than being a universal phenomenon.

URL PDF HTML ☆

赞 0 踩 0

2606.15893 2026-06-16 cs.CL 新提交

LLM智能体能否推断世界模型？来自智能体自动机学习的证据

Reef Menaged, Gili Lior, Shauli Ravfogel, Roee Aharoni, Gabriel Stanovsky

发表机构 * The Hebrew University of Jerusalem（海法大学）； New York University（纽约大学）； Google Research（谷歌研究）

AI总结提出智能体自动机学习框架，通过成员查询和等价查询评估LLM智能体发现隐藏确定性有限自动机的能力，发现性能随DFA规模增加而急剧下降，推理模型优于非推理模型但仍存在规划、整合和假设构建缺陷。

详情

AI中文摘要

我们提出智能体自动机学习，以评估调用工具的LLM智能体通过交互发现隐藏环境的程度。在我们的设置中，智能体应通过与预言机的交互来发现隐藏的确定性有限自动机（DFA），交互方式包括（1）成员查询（“该字符串是否属于目标语言？”）和（2）等价查询（“这是目标DFA吗？”）。这产生了一个可扩展的测试平台，具有可控的任务复杂度、可测量的交互效率以及强基线（经典自动机学习算法）。评估最先进的LLM，我们发现性能随着DFA规模增加而急剧下降。推理模型明显强于非推理模型，但轨迹分析揭示了查询规划、证据整合和假设构建中的反复失败。总体而言，我们的结果表明，当前的LLM智能体有时可以执行非平凡的交互式发现，但在此任务上远不如经典算法稳健和高效。

英文摘要

We propose agentic automata learning to evaluate the extent to which tool-calling LLM agents can uncover hidden environments through interaction. In our setup, an agent should uncover a hidden deterministic finite automaton (DFA) by interacting with an oracle through (1) membership queries ("Does this string belong to the target language?") and (2) equivalence queries ("Is this the target DFA?"). This yields a scalable testbed with controlled task complexity, measurable interaction efficiency, and strong baselines (classic automata-learning algorithms). Evaluating state-of-the-art LLMs, we find that performance drops sharply as DFA size increases. Reasoning models are markedly stronger than non-reasoning models, yet trajectory analyses reveal recurring failures in query planning, evidence integration, and hypothesis construction. Overall, our results show that current LLM agents can sometimes perform non-trivial interactive discovery, but remain far less robust and efficient than classic algorithms for the task.

URL PDF HTML ☆

赞 0 踩 0

2606.16629 2026-06-16 cs.CL 新提交

Islamic Large Language Models: From Knowledge Acquisition to Trustworthy and Hallucination-Resistant AI

伊斯兰大语言模型：从知识获取到可信且抗幻觉的人工智能

Mohammed Amine Mouhoub

发表机构 * Paris Dauphine University（巴黎多芬纳大学）

AI总结综述伊斯兰大语言模型领域，提出构建可信、抗幻觉的伊斯兰AI需结合阿拉伯语NLP、检索增强生成、教法学派推理及专家评估等关键技术。

详情

AI中文摘要

大语言模型（LLMs）越来越多地用于知识密集型问答，包括宗教和法律问题。伊斯兰知识是一个特别苛刻的场景：答案应基于权威来源，引用必须精确，阿拉伯语变体与古典来源语言差异显著，且必须呈现合法的教法学派分歧而非合并为单一答案。本综述回顾了伊斯兰LLMs和可信伊斯兰AI的新兴领域。我们围绕阿拉伯语NLP和以阿拉伯语为中心的LLMs、伊斯兰NLP资源、古兰经问答、伊斯兰知识基准、检索增强生成、伊斯兰法律推理、继承推理、幻觉评估和可信度组织文献。我们认为，阿拉伯语流利度不足以实现伊斯兰AI。可靠系统需要策划来源、检索和验证模块、引用感知生成、教法学派感知推理、人类专家评估以及不仅衡量答案准确性还衡量忠实度、来源有效性和推理质量的基准。本综述以抗幻觉伊斯兰AI系统的研究议程结束。

英文摘要

Large language models (LLMs) are increasingly used for knowledge-intensive question answering, including religious and legal questions. Islamic knowledge is a particularly demanding setting: answers are expected to be grounded in authoritative sources, citations must be exact, Arabic varieties differ substantially from the language of classical sources, and legitimate jurisprudential disagreement must be represented rather than collapsed into a single answer. This survey reviews the emerging field of Islamic LLMs and trustworthy Islamic AI. We organize the literature around Arabic NLP and Arabic-centric LLMs, Islamic NLP resources, Qur'anic question answering, Islamic knowledge benchmarks, retrieval-augmented generation, Islamic legal reasoning, inheritance reasoning, hallucination evaluation, and trustworthiness. We argue that fluency in Arabic is not sufficient for Islamic AI. Reliable systems require curated sources, retrieval and verification modules, citation-aware generation, madhhab-aware reasoning, human expert evaluation, and benchmarks that measure not only answer accuracy but also faithfulness, source validity, and reasoning quality. The survey concludes with a research agenda for hallucination-resistant Islamic AI systems.

URL PDF HTML ☆

赞 0 踩 0

2606.16825 2026-06-16 cs.CL cs.AI cs.LG 新提交

Tying the Loop -- Tied Expert Layers in Mixture-of-Experts Language Models

循环绑定——混合专家语言模型中的专家层绑定

Martin Jaggi

发表机构 * EPFL（瑞士联邦理工学院洛桑）

AI总结提出专家绑定方法，通过共享连续Transformer层的专家参数，在保持独立路由和注意力的同时，将MoE模型内存占用降低近2倍，且不损失困惑度或下游性能。

Comments Code available at https://github.com/epfml/looped-moe

详情

AI中文摘要

混合专家（MoE）架构通过每个令牌仅激活一小部分专家来高效扩展大型语言模型（LLM），但全部参数计数——主要由专家参数主导——必须保留在训练和推理内存中。为了解决这个问题，我们引入了专家绑定（Expert Tying），这是一种架构修改，它在连续Transformer层之间共享专家参数，同时保留独立的逐层路由和注意力。我们在常见的先进架构上评估了这种方法，包括OLMoE、Qwen3和DeepSeek风格的MoE。我们的预训练实验表明，绑定专家可以将内存占用减少近2倍，而几乎不降低困惑度或下游质量。通过利用MoE路径中固有的参数冗余，我们的方法提供了高度有利的计算-内存权衡，推动了下一代LLM的高效训练和扩展。

英文摘要

Mixture-of-Experts (MoE) architectures efficiently scale Large Language Models (LLMs) by activating only a small fraction of their experts per token, yet the full parameter count - dominated by the expert parameters - must be held in training and inference memory. To address this, we introduce Expert Tying, an architectural modification that shares expert parameters across consecutive transformer layers while preserving independent, layer-wise routing and attention. We evaluate this approach across common, state-of-the-art architectures, including OLMoE, Qwen3, and DeepSeek-style MoEs. Our pretraining experiments demonstrate that tying experts can reduce memory footprint by almost 2x at virtually no degradation in perplexity or downstream quality. By exploiting the parameter redundancy inherent in MoE pathways, our method provides a highly favorable compute-to-memory trade-off, advancing efficient training and scaling of next-generation LLMs.

URL PDF HTML ☆

赞 0 踩 0

2606.16847 2026-06-16 cs.CL cs.AI 新提交

Follow the Latent Roadmap: Navigating Revocable Decoding for Diffusion LLMs with Anchor Tokens

遵循潜在路径：利用锚定令牌导航扩散LLM的可撤销解码

Yizhen Yao, Qinglin Zhu, Runcong Zhao, Xiangxiang Dai, Yanzheng Xiang, Yulan He, Lin Gui

发表机构 * King's College London（伦敦国王学院）； The Chinese University of Hong Kong（香港中文大学）； The Alan Turing Institute, UK（英国艾伦·图灵研究所）

AI总结针对扩散大语言模型解码速度与质量的权衡，提出无训练框架ASRD，通过锚定令牌解耦上下文，结合锚定引导生成与锚定扰动验证，在数学和编码基准上提升准确率6.4%，加速推理7.2倍。

详情

AI中文摘要

扩散大语言模型（dLLMs）为并行生成提供了有前景的途径，但面临解码速度与质量之间的权衡。虽然可撤销解码策略尝试通过验证和重新掩码来减轻错误，但它们通常在混合质量上下文中操作。这导致两个关键失败：\textit{错误传播}，即新令牌从错误上下文中吸收有毒信息；以及\textit{局部错误强化}，即错误相互强化以逃避检测。为缓解这些挑战，我们提出ASRD（锚定监督可撤销解码），一种在嵌入空间内运行的无训练框架。ASRD明确将解码上下文解耦为通过时间一致性识别的可信\textit{锚定令牌}和不确定候选令牌。利用动态锚定令牌缓存，我们引入两种互补机制：（1）锚定引导生成，将熵加权锚定信号注入掩码位置，以隐式地将注意力引导向可靠的全局骨架；（2）锚定扰动验证，对不确定候选令牌施加正交扰动，破坏并重新掩码由脆弱局部共识驱动的错误。在数学和编码基准上的大量实验表明，ASRD优于最近的重新掩码基线，准确率提升高达6.4%，同时推理吞吐量加速高达7.2倍。

英文摘要

Diffusion Large Language Models (dLLMs) offer a promising avenue for parallel generation but face a trade-off between decoding speed and quality. While revocable decoding strategies attempt to mitigate errors by verifying and remasking tokens, they typically operate within a mixed-quality context. This leads to two critical failures: \textit{Error Propagation}, where new tokens absorb toxic information from erroneous context, and \textit{Local Error Reinforcement}, where errors mutually reinforce each other to evade detection. To alleviate these challenges, we propose ASRD (Anchor Supervised Revocable Decoding), a training-free framework that operates within the embedding space. ASRD explicitly decouples the decoding context into trusted \textit{Anchor Tokens}, which are identified via temporal consistency, and uncertain candidates. Leveraging a dynamic Anchor Tokens Cache, we introduce two complementary mechanisms: (1) Anchor-Guided Generation, which injects entropy-weighted anchor signals into masked positions to implicitly rectify attention toward the reliable global skeleton; and (2) Anchor-Perturbed Verification, which applies orthogonal perturbations to uncertain candidate tokens, destabilizing and remasking errors driven by fragile local consensus. Extensive experiments on math and coding benchmarks demonstrate that ASRD outperforms recent remasking baselines, achieving accuracy improvements of up to 6.4\% while accelerating inference throughput by up to 7.2$\times$.

URL PDF HTML ☆

赞 0 踩 0

2606.16905 2026-06-16 cs.CL 新提交

Speaking the Language of Science: Toward a General-Purpose Generative Foundation Model for the Natural Sciences

说科学的语言：面向自然科学的通用生成基础模型

Mingyang Li, Yurou Liu, Jieping Ye, Bing Su, Ji-Rong Wen, Zheng Wang

发表机构 * Alibaba Group（阿里巴巴集团）； Gaoling School of Artificial Intelligence, Renmin University of China（中国人民大学高瓴人工智能学院）

AI总结提出LOGOS模型，通过统一科学语法将异构任务转化为自回归框架中的下一个词预测，在多个科学任务上匹配或超越领域专用基线，验证了“一模型适用于所有”的可行性。

详情

AI中文摘要

在本报告中，我们提出了LOGOS（科学中的生成对象语言），一个科学生成语言模型，它基于共享的科学语法，在单一自回归框架内统一了自然科学中的异构任务。它将不同的科学对象及其空间交互编码为公共词汇表上的令牌序列。通过将空间接触和约束模式表示为离散令牌，该模型以纯序列方式捕获复杂的结构交互，而不依赖显式坐标或几何神经网络。这种统一表示使得广泛的下游任务能够一致地表述为同一语法空间中的下一个词预测，从而在持续的多领域预训练和下游目标之间建立强对齐。在多种任务中，LOGOS始终匹配或超越领域专用基线，为自然科学中“一模型适用于所有”的可行性提供了初步证据。我们训练了不同规模（1B、3B和8B参数）的LOGOS模型，并发现模型大小与性能之间存在一致的正相关关系。这表明，未来的人工智能科学（AI4S）可能不在于构建独立于大型语言模型（LLM）的技术栈，而在于通过共享架构、共享训练范式和共享推理基础设施，将科学基础模型与LLM深度对齐，从而使LLM真正成为AI4S的新入口。我们发布了模型权重和相关资源以促进进一步研究。

英文摘要

In this report, we present LOGOS (Language Of Generative Objects in Science), a scientific generative language model that unifies heterogeneous tasks across the natural sciences within a single autoregressive framework based on a shared scientific grammar. It encodes diverse scientific objects and their spatial interactions as token sequences over a common vocabulary. By representing spatial contact and constraint patterns as discrete tokens, the model captures complex structural interactions in a purely sequential manner, without relying on explicit coordinates or geometric neural networks. This unified representation enables a wide range of downstream tasks to be formulated consistently as next-token prediction in the same grammar space, creating strong alignment between continued multi-domain pre-training and downstream objectives. Across diverse tasks, LOGOS consistently matches or outperforms domain-specific baselines, providing preliminary evidence for the feasibility of "one model fits all" in the natural sciences. We train LOGOS models at different scales (1B, 3B, and 8B parameters) and find a consistent positive correlation between model size and performance. This suggests that the future of AI for Science (AI4S) may not lie in building an independent technical stack that is separated from large language models (LLMs). Instead, it may depend on deeply aligning scientific foundation models with LLMs through shared architectures, shared training paradigms, and shared inference infrastructure, so that LLMs can truly become a new entry point for AI4S. We release the model weights and associated resources to facilitate further research.

URL PDF HTML ☆

赞 0 踩 0

2606.16908 2026-06-16 cs.CL 新提交

LESS Is More: Mutual-Stability Sampling for Diffusion Language Models

LESS Is More: 扩散语言模型的互稳定采样

Amr Mohamed, Guokan Shang, Michalis Vazirgiannis

发表机构 * MBZUAI（穆罕默德·本·扎耶德人工智能大学）； Ecole Polytechnique（巴黎综合理工学院）

AI总结针对扩散语言模型固定步数采样效率低的问题，提出无训练的自适应采样器LESS，通过互稳定规则动态决定掩码位置何时解码，在7个基准上平均准确率提升且步数减少72.1%。

详情

AI中文摘要

扩散大语言模型（dLLMs）通过迭代精炼掩码序列，支持并行令牌更新和双向条件化，为自回归解码提供了一种有前景的替代方案。然而，其实际效率受到采样过程的限制，该过程在解码前执行固定数量的反向去噪步骤，将计算花费在已经稳定的位置上，有时过早地提交不稳定的位置。我们提出\textsc{LESS}，一种无需训练、模型无关的自适应采样器，将令牌提交视为在线停止问题。\textsc{LESS}通过联合稳定性规则实现互稳定采样：仅当其top-1预测具有高置信度、其top-1令牌在最近的反向步骤中持续出现、且其预测分布在top-$K$步间Jensen-Shannon散度下稳定时，掩码位置才符合解码条件。我们在Dream-7B、LLaDA-8B和LLaDA-1.5-8B上评估\textsc{LESS}，涵盖全序列扩散和半自回归块采样模式，跨越七个涵盖通用知识、数学和代码的基准。\textsc{LESS}在强无训练自适应采样器上提高了平均准确率，同时比固定预算解码减少了$72.1\%$的反向步骤。由于每个反向步骤需要一次Transformer前向传播，这些步数减少转化为更少的前向评估、更低的实测墙钟延迟和更低的估计推理计算量。

英文摘要

Diffusion large language models (dLLMs) offer a promising alternative to autoregressive decoding by iteratively refining masked sequences, enabling parallel token updates and bidirectional conditioning. Their practical efficiency, however, is limited by sampling procedures that execute a fixed number of reverse denoising steps selected before decoding, spending computation on already-stable positions and sometimes committing unstable ones too early. We present \textsc{LESS}, a training-free, model-agnostic adaptive sampler that treats token commitment as an online stopping problem. \textsc{LESS} implements mutual-stability sampling through a joint stability rule that makes a masked position eligible for unmasking only when its top-1 prediction has high confidence, its top-1 token persists across recent reverse steps, and its predictive distribution is stable under top-$K$ inter-step Jensen--Shannon divergence. We evaluate \textsc{LESS} on Dream-7B, LLaDA-8B, and LLaDA-1.5-8B, covering full-sequence diffusion and semi-autoregressive blockwise sampling regimes, across seven benchmarks spanning general knowledge, math, and code. \textsc{LESS} improves average accuracy over strong training-free adaptive samplers while using $72.1\%$ fewer reverse steps than fixed-budget decoding. Since each reverse step requires a Transformer forward pass, these step-count reductions translate into fewer forward evaluations, lower measured wall-clock latency, and lower estimated inference compute.

URL PDF HTML ☆

赞 0 踩 0

2606.16934 2026-06-16 cs.CL cs.LG 新提交

VibeThinker-3B：探索小型语言模型中可验证推理的前沿

Sen Xu, Shixi Liu, Wei Wang, Jixin Min, Yingwei Dai, Zhibin Yin, Yirong Chen, Xin Zhou, Junlin Zhang

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结提出3B参数紧凑模型VibeThinker-3B，通过频谱到信号后训练范式（课程SFT、多域强化学习、离线自蒸馏）在可验证推理任务上达到前沿性能，匹配甚至超越大模型，并验证推理增强不损害指令可控性。

详情

AI中文摘要

本技术报告介绍了VibeThinker-3B，一个具有3B参数的紧凑密集模型，旨在探究在严格的小模型范围内可验证推理能推进到何种程度。基于频谱到信号后训练范式，我们通过优化的流程系统性地增强模型，该流程包括基于课程的监督微调、多域强化学习和离线自蒸馏。实验评估表明，VibeThinker-3B在高度要求的可验证任务上达到了前沿水平。具体来说，它在AIME26上获得94.3分（通过声明级测试时缩放提升至97.1），在LiveCodeBench v6上获得80.2的Pass@1，并在最近的未见LeetCode竞赛中表现出强大的分布外泛化能力，接受率达96.1%。这有效地将其置于一流推理系统的性能区间，匹配或超越规模大数个数量级的旗舰模型，如DeepSeek V3.2、GLM-5和Gemini 3 Pro。此外，IFEval上的93.4分证实了这种极端的推理增强并未损害严格的指令可控性。扩展我们之前的1.5B工作，这些发现推动了参数压缩-覆盖假说，该假说将可验证推理视为可压缩到紧凑推理核心中，而开放域知识和通用能力则需要广泛的参数覆盖事实、概念和长尾场景。这一观点表明，紧凑模型不仅是部署高效的替代品，更是通往参数密集能力领域前沿性能的互补路径。

英文摘要

This technical report introduces VibeThinker-3B, a compact dense model with 3B parameters developed to investigate how far verifiable reasoning can be pushed within a strictly small-model regime. Building upon the Spectrum-to-Signal post-training paradigm, we systematically enhance the model through an optimized pipeline that includes curriculum-based supervised fine-tuning, multi-domain reinforcement learning, and offline self-distillation. Experimental evaluations demonstrate that VibeThinker-3B achieves frontier-level performance on highly demanding verifiable tasks. Specifically, it attains a score of 94.3 on AIME26 (improving to 97.1 with claim-level test-time scaling), an 80.2 Pass@1 on LiveCodeBench v6, and exhibits strong out-of-distribution generalization with a 96.1\% acceptance rate on recent unseen LeetCode contests. This effectively places it in the performance band of first-tier reasoning systems, matching or exceeding flagship models that are orders of magnitude larger, such as DeepSeek V3.2, GLM-5, and Gemini 3 Pro. Furthermore, a score of 93.4 on IFEval confirms that this extreme reasoning enhancement does not compromise strict instruction controllability. Extending our previous 1.5B work, these findings motivate the Parametric Compression-Coverage Hypothesis, which views verifiable reasoning as compressible into compact reasoning cores, while open-domain knowledge and general-purpose competence require broad parameter coverage over facts, concepts, and long-tail scenarios. This perspective suggests that compact models are not merely deployment-efficient substitutes, but a complementary path toward frontier-level performance in parameter-dense capability regimes.

URL PDF HTML ☆

赞 0 踩 0

2606.16310 2026-06-16 cs.LG cs.CL 交叉投稿

QK-Normed MLA: QK normalization without full key caching

QK归一化MLA：无需完整键缓存的QK归一化

Yizhou Han, Yao Zhao, Jun Zhou, Longfei Li, Ruoyu Sun

发表机构 * The Chinese University of Hong Kong（香港中文大学）； Ant Group（蚂蚁集团）

AI总结提出QK归一化与MLA兼容的方法，通过吸收静态权重和动态标量，无需缓存完整键，在400M模型训练中降低损失并提升下游精度，解码延迟增加小于2%。

Comments 13 pages, 5 figures, conference-style manuscript

详情

AI中文摘要

查询-键（QK）归一化通过控制点积前查询和键的尺度来稳定注意力，但无法直接与多头潜在注意力（MLA）兼容。MLA通过缓存低维潜在状态而非完整键来实现高效解码，而投影后的QK RMSNorm似乎需要对每个缓存的token使用完全投影的键。我们表明这种明显的不兼容性是实现伪影，而非架构约束。RMSNorm分解为静态仿射权重和动态标量RMS统计量。静态键侧权重可以吸收到MLA查询侧投影中；动态键统计量简化为每个token和KV组的一个逆RMS标量。得到的公式在精确算术中与显式投影后QK RMSNorm完全等价，并保留了MLA的潜在解码路径。在我们训练高达100B token的400M参数模型中，QK归一化MLA相比QK裁剪实现了更低的训练损失和更好的下游准确率，而H800解码基准测试显示在高达256k上下文下延迟开销小于2%。这些结果使得QK归一化成为MLA模型实用的稳定选项，无需完整键缓存。

英文摘要

Query-key (QK) normalization stabilizes attention by controlling the scale of queries and keys before the dot product, but is not immediately compatible with Multi-head Latent Attention (MLA). MLA achieves efficient decoding by caching low-dimensional latent states instead of full keys, whereas post-projection QK RMSNorm appears to require the fully projected key for every cached token. We show this apparent incompatibility is an implementation artifact, not an architectural constraint. RMSNorm decomposes into a static affine weight and a dynamic scalar RMS statistic. The static key-side weight can be absorbed into the MLA query-side projection; the dynamic key statistic reduces to one inverse-RMS scalar per token and KV group. The resulting formulation is exactly equivalent to explicit post-projection QK RMSNorm in exact arithmetic and preserves MLA's latent decode path. In our 400M runs trained for up to 100B tokens, QK-Normed MLA achieves lower training loss and better downstream accuracy than QK clipping, while H800 decode benchmarks show less than 2% latency overhead up to 256k context. These results make QK normalization a practical stabilization option for MLA models without requiring full-key caching.

URL PDF HTML ☆

赞 0 踩 0

2606.16811 2026-06-16 cs.AI cs.CL 交叉投稿

Scaling LLM Reasoning from Minimal Labels: A Semi-Supervised Framework with a Lightweight Verifier

从最小标签扩展LLM推理：一种带有轻量级验证器的半监督框架

Keizo Kato, Chenhui Chu, Yugo Murawaki, Sado Kurohashi

发表机构 * Fujitsu Limited（富士通株式会社）； Kyoto University（京都大学）； National Institute of Informatics（国立信息学研究所）

AI总结提出半监督框架，用轻量级推理正确性分类器和熵过滤从少量标注数据生成高质量伪推理链，在数学和视觉问答任务上达到10-15倍标注数据效果。

Comments LREC 2026. Section 3.3 is updated

详情

AI中文摘要

对于大型语言模型（LLMs）的发展，最近生成伪中间推理的方法取得了显著进展。但它们通常依赖大量正确标注的答案来评估推理质量。本文提出一种半监督框架，从最小监督中扩展推理学习，将推理验证本身转变为数据创建机制。我们仅在少量标注样本上训练一个轻量级推理正确性分类器，用于判断LLM生成的中间推理轨迹是否有效。此外，基于熵的置信度阈值过滤掉不可靠样本，剩余的高置信度推理轨迹用于微调模型。在可验证数学问题（Orca-Math子集）和基于视觉编程的图像场景图问答（GQA）上的实验表明，我们的方法达到了与使用10-15倍标注数据相当的准确率。消融分析证实，分类器和熵过滤对于可扩展且抗噪声的伪标签生成都是必不可少的。通过用轻量级推理验证替代昂贵的答案级监督，我们的方法为构建大规模推理资源提供了一条实用路径，并为未来从最小人工输入中学习的自主推理系统铺平了道路。

英文摘要

For the development of Large language models (LLMs), recent approaches to generating pseudo intermediate reasoning have shown remarkable progress. But they typically rely on large numbers of correctly annotated answers to assess reasoning quality. This paper presents a semi-supervised framework that scales reasoning learning from minimal supervision, turning reasoning verification itself into a data creation mechanism. We train a lightweight reasoning-correctness classifier on only a few labeled samples, which judges whether intermediate reasoning traces generated by an LLM are valid. Furthermore, an entropy-based confidence threshold filters out unreliable samples, and the remaining high-confidence reasoning traces are used to fine-tune the model. Experiments on Verifiable Math Problems (Orca-Math subset) and Question Answering on Image Scene Graphs (GQA) with Visual Programming show that our method achieves accuracy comparable to using 10-15x more labeled data. Ablation analyses confirm that both the classifier and entropy filtering are essential for scalable and noise-resistant pseudo-labeling. By replacing expensive answer-level supervision with lightweight reasoning verification, our method provides a practical path toward constructing large-scale reasoning resources and paves the way for future autonomous reasoning systems that learn from minimal human input.

URL PDF HTML ☆

赞 0 踩 0

2505.09655 2026-06-16 cs.CL cs.LG 版本更新

DRA-GRPO: Your GRPO Needs to Know Diverse Reasoning Paths for Mathematical Reasoning

DRA-GRPO：你的GRPO需要了解多样化的推理路径以进行数学推理

Xiwen Chen, Wenhui Zhu, Peijie Qiu, Xuanzhao Dong, Hao Wang, Haiyu Wu, Huayu Li, Aristeidis Sotiras, Yalin Wang, Abolfazl Razi

发表机构 * Morgan Stanley（摩根士丹利）； Clemson University（克莱姆森大学）； Arizona State University（亚利桑那州立大学）； Washington University in St. Louis（圣路易斯华盛顿大学）； University of Notre Dame（圣母大学）； University of Arizona（亚利桑那大学）

AI总结针对GRPO在数学推理中因奖励信号非单射导致策略坍塌的问题，提出基于子模互信息的多样性感知奖励调整框架DRA，通过逆倾向评分去偏梯度估计，在五个数学基准上以少量数据和成本取得平均58.2%的准确率。

Comments ACL2026

详情

AI中文摘要

使用强化学习（特别是组相对策略优化GRPO）对大型语言模型进行后训练已成为增强数学推理的一种范式。然而，标准GRPO依赖于标量正确性奖励，这些奖励在语义内容上通常是非单射的：不同的推理路径获得相同的奖励。这导致了多样性-质量不一致性，策略会坍缩到一组狭窄的主导模式，而忽略同样有效但结构新颖的策略。为弥补这一差距，我们提出了多样性感知奖励调整（DRA），这是一个理论上有基础的框架，它使用采样组的语义密度来校准奖励信号。通过利用子模互信息（SMI），DRA实现了一种逆倾向评分（IPS）机制，有效去偏梯度估计。这产生了对抗冗余的排斥力，推动策略更好地覆盖高奖励区域。我们的方法是即插即用的，并与GRPO变体无缝集成。在五个数学基准上的实证评估表明，DRA-GRPO持续优于强基线，在DeepSeek-R1-Distill-Qwen-1.5B上仅使用7,000个训练样本和55美元成本就达到了58.2%的平均准确率，突显了多样性校准在数据高效对齐中的关键作用。代码可在该网址获取。

英文摘要

Post-training LLMs with Reinforcement Learning, specifically Group Relative Policy Optimization (GRPO), has emerged as a paradigm for enhancing mathematical reasoning. However, standard GRPO relies on scalar correctness rewards that are often non-injective with respect to semantic content: distinct reasoning paths receive identical rewards. This leads to a Diversity-Quality Inconsistency, where the policy collapses into a narrow set of dominant modes while ignoring equally valid but structurally novel strategies. To bridge this gap, we propose Diversity-aware Reward Adjustment (DRA), a theoretically grounded framework that calibrates the reward signal using the semantic density of sampled groups. By leveraging Submodular Mutual Information (SMI), DRA implements an Inverse Propensity Scoring (IPS) mechanism that effectively de-biases the gradient estimation. This creates a repulsive force against redundancy, driving the policy to achieve better coverage of the high-reward landscape. Our method is plug-and-play and integrates seamlessly with GRPO variants. Empirical evaluations on five math benchmarks demonstrate that DRA-GRPO consistently outperforms strong baselines, achieving an average accuracy of 58.2% on DeepSeek-R1-Distill-Qwen-1.5B with only 7,000 training samples and $55 cost, highlighting the critical role of diversity calibration in data-efficient alignment. The code is available at https://github.com/xiwenc1/DRA-GRPO.

URL PDF HTML ☆

赞 0 踩 0

2505.23666 2026-06-16 cs.CL cs.LG 版本更新

LoLA: Low-Rank Linear Attention With Sparse Caching

LoLA: 低秩线性注意力与稀疏缓存

Luke McDermott, Robert W. Heath, Rahul Parhi

发表机构 * University of California, San Diego（加州大学圣地亚哥分校）

AI总结提出LoLA，一种无需训练的线性注意力增强方法，通过三种记忆系统（局部滑动窗口、稀疏全局缓存和循环隐状态）提升关联回忆，在pass-key检索任务上将准确率从0.6%提升至97.4%，且缓存大小比Llama-3.1 8B小4.6倍。

详情

AI中文摘要

Transformer推理的每token成本随上下文长度扩展，阻碍了其在终身上下文学习中的应用。线性注意力是一种高效的替代方案，即使在无限上下文长度下也能保持恒定的内存占用。虽然这可能是终身学习的潜在候选，但其内存容量不足。在本文中，我们提出LoLA，一种无需训练的线性注意力增强方法，可提升关联回忆。LoLA将上下文中的过去键值对分配到三种记忆系统中：(i) 局部滑动窗口缓存中的近期对；(ii) 稀疏全局缓存中的难以记忆的对；以及(iii) 线性注意力循环隐状态中的通用对。通过消融实验，我们表明自回忆误差指标对于高效管理长期关联记忆至关重要。在pass-key检索任务上，LoLA将基础模型的准确率从0.6%提升至97.4%。这是在4K上下文长度下，缓存大小比Llama-3.1 8B小4.6倍的情况下实现的。LoLA在零样本常识推理任务上也优于其他1B和8B参数的次二次模型。

英文摘要

The per-token cost of transformer inference scales with context length, preventing its application to lifelong in-context learning. Linear attention is an efficient alternative that maintains a constant memory footprint, even on infinite context lengths. While this is a potential candidate for lifelong learning, it falls short in memory capacity. In this paper, we propose LoLA, a training-free augmentation to linear attention that boosts associative recall. LoLA distributes past key-value pairs from context into three memory systems: (i) recent pairs in a local sliding window cache; (ii) difficult-to-memorize pairs in a sparse, global cache; and (iii) generic pairs in the recurrent hidden state of linear attention. We show through ablations that our self-recall error metric is crucial to efficiently manage long-term associative memories. On pass-key retrieval tasks, LoLA improves the base model's performance from 0.6% to 97.4% accuracy. This is achieved with a 4.6x smaller cache than Llama-3.1 8B on 4K context length. LoLA also outperforms other 1B and 8B parameter subquadratic models on zero-shot commonsense reasoning tasks.

URL PDF HTML ☆

赞 0 踩 0

2506.11418 2026-06-16 cs.CL 版本更新

Think-at-Hard: 选择性潜在迭代以改进推理语言模型

Tianyu Fu, Yichen You, Zekai Chen, Guohao Dai, Huazhong Yang, Yu Wang

AI总结针对循环变压器中潜在过思考问题，提出Think-at-Hard方法，通过轻量级决策器选择性地在困难令牌上触发潜在迭代，并采用深度感知LoRA和双因果注意力机制，在数学、问答和编码任务上一致提升性能。

Comments Accepted by ICML'26

详情

AI中文摘要

提升大型语言模型（LLMs）的推理能力，特别是在参数约束下，对实际应用至关重要。循环变压器通过执行多次潜在迭代来细化每个令牌，超越单次前向传播。然而，我们识别出一种潜在过思考现象：大多数令牌预测在第一次前向传播后已经正确，但在后续迭代中有时会被修改为错误。我们询问选择性地跳过潜在迭代是否能提高准确性，并揭示了一个显著的潜力：使用预言迭代策略可将性能提升高达7.3%。受此启发，我们提出了Think-at-Hard (TaH)，一种针对选择性迭代优化的循环变压器。TaH采用轻量级神经决策器来触发潜在迭代，仅在标准前向传播后可能不正确的令牌上触发。在潜在迭代期间，深度感知的低秩适应（LoRA）模块将目标从一般的下一个令牌预测转变为聚焦的困难令牌细化。双因果注意力机制将注意力从令牌序列维度扩展到额外的迭代深度维度，实现跨迭代信息流，同时保持完全的序列并行性。在九个基准上的实验显示，在数学、问答和编码任务上一致提升。在相同参数数量下，TaH在93%的令牌上跳过迭代，性能比始终迭代的基线高3.8-4.4%，并超过单次迭代的Qwen3基线3.0-3.8%。当允许LoRA和决策器增加不到3%的参数时，增益分别进一步增加到5.3-6.2%和6.1-6.8%。我们的代码可在以下网址获取：https://this URL。

英文摘要

Improving the reasoning abilities of Large Language Models (LLMs), especially under parameter constraints, is crucial for real-world applications. Looped transformers address this by performing multiple latent iterations to refine each token beyond a single forward pass. However, we identify a latent overthinking phenomenon: most token predictions are already correct after the first pass, but are sometimes revised into errors in later iterations. We ask whether selectively skipping latent iterations can improve accuracy, and reveal significant potential with an oracle iteration policy that boosts performance by up to 7.3%. Motivated by this, we propose Think-at-Hard (TaH), a looped transformer optimized for selective iteration. TaH employs a lightweight neural decider to trigger latent iteration, only at tokens likely to be incorrect after the standard forward pass. During latent iterations, depth-aware Low-Rank Adaptation (LoRA) modules shift the objective from general next-token prediction to focused hard-token refinement. A duo-causal attention mechanism extends attention from the token sequence dimension to an additional iteration depth dimension, enabling cross-iteration information flow with full sequential parallelism. Experiments on nine benchmarks show consistent gains across math, QA, and coding tasks. With identical parameter counts, TaH outperforms always-iterate baselines by 3.8-4.4% while skipping iterations on 93% of tokens, and exceeds single-iteration Qwen3 baselines by 3.0-3.8%. When allowing <3% more parameters from LoRA and decider, the gains further increase to 5.3-6.2% and 6.1-6.8%, respectively. Our code is available at https://github.com/thu-nics/TaH.

URL PDF HTML ☆

赞 0 踩 0

2512.21577 2026-06-16 cs.CL cs.AI cs.LG stat.ML 版本更新

A Unified Definition of Hallucination: It's The World Model, Stupid!

幻觉的统一定义：是世界模型的问题，笨蛋！

Emmy Liu, Varun Gangal, Chelsea Zou, Michael Yu, Xiaoqi Huang, Alex Chang, Zhuofu Tao, Karan Singh, Sachin Kumar, Steven Y. Feng

发表机构 * University of California, Berkeley（加州大学伯克利分校）

AI总结本文提出幻觉的统一定义，即用户可观察到的错误内部世界建模，并连接至HalluWorld基准测试，以区分真实幻觉与规划或奖励错误。

Comments ICML 2026. HalluWorld benchmark at https://github.com/DegenAI-Labs/HalluWorld

详情

AI中文摘要

尽管自语言模型诞生以来已有无数缓解尝试，但即使在当今最前沿的LLM中，幻觉仍然是一个持续存在的问题。这是为什么？我们回顾了现有的幻觉定义，并将它们整合为一个统一的定义，其中先前的定义被包含在内。我们认为，幻觉可以通过将其简单地定义为不准确的（内部）世界建模来统一，其形式是用户可观察到的。例如，陈述与知识库相矛盾的事实，或生成与来源相矛盾的摘要。通过改变参考世界模型和冲突策略，我们的框架统一了先前的定义。我们认为，这种统一观点是有用的，因为它迫使评估澄清其假定的参考“世界”，区分真实幻觉与规划或奖励错误，并为跨基准比较和缓解策略讨论提供共同语言。基于这一定义，我们还将我们的框架连接到HalluWorld，这是一个补充基准，它实例化了完全指定的参考世界模型，用于压力测试模型幻觉。

英文摘要

Despite numerous attempts at mitigation since the inception of language models, hallucinations remain a persistent problem even in today's frontier LLMs. Why is this? We review existing definitions of hallucination and fold them into a single, unified definition wherein prior definitions are subsumed. We argue that hallucination can be unified by defining it as simply inaccurate (internal) world modeling, in a form where it is observable to the user. For example, stating a fact which contradicts a knowledge base OR producing a summary which contradicts the source. By varying the reference world model and conflict policy, our framework unifies prior definitions. We argue that this unified view is useful because it forces evaluations to clarify their assumed reference "world", distinguishes true hallucinations from planning or reward errors, and provides a common language for comparison across benchmarks and discussion of mitigation strategies. Building on this definition, we also connect our framework to HalluWorld, a complementary benchmark that instantiates fully specified reference world models for stress-testing model hallucinations.

URL PDF HTML ☆

赞 0 踩 0

2601.17421 2026-06-16 cs.CL 版本更新

Oops, Wait: Discourse Tokens Matter in Reasoning Model

哎呀，等等：话语标记在推理模型中的重要性

Jaehui Hwang, Byeongho Heo, Sangdoo Yun, Dongyoon Han

发表机构 * NAVER AI Lab（NAVER人工智能实验室）

AI总结本文研究话语标记（如“wait”）在推理轨迹中的作用，发现数据高效微调可部分复现其模式，但不如大规模后训练与高置信答案转换对齐。

详情

AI中文摘要

最近的研究表明，通过后训练，即使使用少量（约1K）推理轨迹进行数据高效训练，也能诱导大型语言模型产生非平凡的推理能力。此类训练语料通常包含诸如“wait”、“so”和“alternatively”等标志性标记，这些标记频繁出现在推理轨迹中，并可能在此过程中发挥作用。本文专注于描述后训练中可观察的标记级模式，并案例研究数据高效监督微调（SFT）与大规模后训练的不同之处及其不足之处。为此，我们首先识别跨模型和训练设置中与推理轨迹上正确答案相关的标记。然后，我们聚焦于“wait”标记的分布和（功能）角色，主要研究数据高效训练的模型与对应模型的对比。我们的研究发现，即使在数据高效SFT中，话语标记也与正确性和推理准确性的跳跃相关。这表明数据高效SFT可以部分复现话语标记模式以模仿有意义的推理行为，但这些模式与高置信答案转换的对齐程度不如大规模后训练。

英文摘要

Recent studies suggest that even data-efficient training with ($\simeq$1K) reasoning trajectories can induce non-trivial reasoning capabilities in large language models through post-training. Such training corpora often contain iconic tokens such as "wait", "so", and "alternatively", which frequently appear in reasoning trajectories and may play a role in this process. This paper focuses on characterizing observable token-level patterns in post-training and a case study of how data-efficient supervised fine-tuning (SFT) differs from, and falls short of, large-scale post-training. To this end, we first identify tokens that correlate with correct answers along reasoning trajectories across models and training setups. We then focus on the distribution and (functional) roles of the "wait" token to primarily study the model trained in a data-efficient manner compared with the counterpart. Our study finds that discourse tokens are associated with correctness and a reasoning accuracy jump, even in data-efficient SFT. This suggests data-efficient SFT can partially reproduce discourse-token patterns to mimic meaningful reasoning behavior, but the patterns are less aligned with high-confidence answer transitions than those from large-scale post-training.

URL PDF HTML ☆

赞 0 踩 0

2602.11543 2026-06-16 cs.CL 版本更新

Pretraining A Large Language Model using Distributed GPUs: A Memory-Efficient Decentralized Paradigm

使用分布式GPU预训练大型语言模型：一种内存高效的分散式范式

Jinrui Zhang, Chaodong Xiao, Aoqi Wu, Xindong Zhang, Lei Zhang

发表机构 * Department of Computing, The Hong Kong Polytechnic University（香港理工大学计算机系）； OPPO Research Institute（OPPO研究院）

AI总结提出SPES框架，通过分散式训练MoE LLM的子集专家降低内存需求，结合专家合并预热策略，在16个48GB GPU上训练2B参数模型，性能媲美集中式训练。

详情

AI中文摘要

预训练大型语言模型（LLMs）通常需要配备数千个高内存GPU（如H100/A100）的集中式集群。最近的分散式训练方法通过采用联邦优化来减少通信开销；然而，它们仍然需要在每个节点上训练整个模型，因此仍受限于GPU内存限制。在这项工作中，我们提出了SPES（稀疏专家同步），一种用于预训练混合专家（MoE）LLMs的内存高效分散式框架。SPES在每个节点上仅训练一部分专家，大幅降低了内存占用。每个节点更新其本地专家，并定期与其他节点同步，消除了全参数传输，同时确保高效的知识共享。为了缓解稀疏专家更新下每个专家数据利用率有限的问题，我们引入了一种专家合并预热策略，即在训练早期让专家交换知识，以快速建立基础能力。通过SPES，我们使用16个独立的48GB GPU通过互联网连接训练了一个2B参数的MoE LLM，在相似计算预算下取得了与集中式训练LLM相竞争的性能。我们进一步展示了可扩展性，从头开始训练了一个7B模型，并从密集检查点升级了一个9B模型，两者均匹配先前的集中式基线。我们的代码可在该https URL获取。

英文摘要

Pretraining large language models (LLMs) typically requires centralized clusters with thousands of high-memory GPUs (e.g., H100/A100). Recent decentralized training methods reduce communication overhead by employing federated optimization; however, they still need to train the entire model on each node, remaining constrained by GPU memory limitations. In this work, we propose SParse Expert Synchronization (SPES), a memory-efficient decentralized framework for pretraining mixture-of-experts (MoE) LLMs. SPES trains only a subset of experts per node, substantially lowering the memory footprint. Each node updates its local experts and periodically synchronizes with other nodes, eliminating full-parameter transmission while ensuring efficient knowledge sharing. To mitigate limited per-expert data utilization under sparse expert updates, we introduce an expert-merging warm-up strategy, where experts exchange knowledge early in training, to rapidly establish foundational capabilities. With SPES, we train a 2B-parameter MoE LLM using 16 standalone 48GB GPUs over internet connections, which achieves competitive performance with centrally trained LLMs under similar computational budgets. We further demonstrate scalability by training a 7B model from scratch and a 9B model upcycled from a dense checkpoint, both of which match prior centralized baselines. Our code is available at https://github.com/zjr2000/SPES.

URL PDF HTML ☆

赞 0 踩 0

2603.08999 2026-06-16 cs.CL 版本更新

Learning When to Sample: Confidence-Aware Selective Sampling for Efficient Chain-of-Thought Reasoning

学习何时采样：面向高效链式思维推理的置信度感知选择性采样

Juming Xiong, Kevin Guo, Congning Ni, Wexin Liu, Chao Yan, Katherine Brown, Avinash Baidya, Xiang Gao, Bradley Malin, Zhijun Yin

发表机构 * Vanderbilt University（范德比尔特大学）； Vanderbilt University Medical Center（范德比尔特大学医学中心）； Intuit AI Research（Intuit AI研究院）

AI总结提出置信度感知选择性采样框架，通过分析单条推理轨迹自适应决定是否触发多路径采样，在保持性能的同时显著降低推理成本。

详情

AI中文摘要

大型语言模型（LLMs）通过链式思维（CoT）推理能够实现强大的推理性能，但往往生成不必要的长推理路径，导致高昂的推理成本。基于自一致性的方法进一步提高准确性，但需要采样和聚合多个推理轨迹，带来大量计算开销。本文提出一种置信度感知的选择性采样框架，在推理时分析单条推理轨迹，自适应地决定是仅依赖该轨迹还是触发多路径采样。该框架利用从推理状态中提取的轨迹级数值特征和句子级语言特征来指导选择性多路径推理。我们在MedQA上训练该框架，并在MedQA上进行域内评估，以及在MathQA、MedMCQA和MMLU上进行仅校准的迁移评估，无需进一步微调。实验结果表明，所提框架与完整和高效的多路径推理基线相比，性能相当，准确率变化分别为$-0.41 \pm 0.58$和$-0.31 \pm 0.58$个百分点，同时令牌使用量分别减少$71.7 \pm 5.0\\%$和$36.6 \pm 9.1\\%$。这些发现表明，推理轨迹包含用于不确定性估计的丰富信号，从而能够实现一种简单、可迁移的机制来平衡LLM推理中的准确性和效率。

英文摘要

Large language models (LLMs) can achieve strong reasoning performance through chain-of-thought (CoT) reasoning, yet they often generate unnecessarily long reasoning paths that incur high inference cost. Self-consistency-based approaches push accuracy higher still, but they require sampling and aggregating multiple reasoning trajectories, leading to substantial computational overhead. In this paper, we introduce a confidence-aware selective sampling framework that, at inference time, analyzes a single reasoning trajectory to adaptively determine whether to rely on that trajectory alone or trigger multi-path sampling. The framework uses trajectory-level numeric features and sentence-level linguistic features extracted from reasoning states to guide selective multi-path reasoning. We train it on MedQA and evaluate it in-domain on MedQA and under calibration-only transfer on MathQA, MedMCQA, and MMLU, without further fine-tuning. Experimental results show that the proposed framework maintains comparable performance to full and efficient multi-path reasoning baselines, with accuracy changes of $-0.41 \pm 0.58$ and $-0.31 \pm 0.58$ percentage points, respectively, while reducing token usage by $71.7 \pm 5.0%$ and $36.6 \pm 9.1%$. These findings demonstrate that reasoning trajectories contain rich signals for uncertainty estimation, enabling a simple, transferable mechanism to balance accuracy and efficiency in LLM reasoning.

URL PDF HTML ☆

赞 0 踩 0

2604.03472 2026-06-16 cs.CL cs.AI 版本更新

Vocabulary Dropout for Curriculum Diversity in LLM Co-Evolution

词汇丢弃：LLM共同进化中的课程多样性

Jacob Dineen, Aswin RRV, Zhikun Xu, Ben Zhou

发表机构 * Arizona State University（亚利桑那州立大学）

AI总结针对LLM共同进化中问题多样性崩溃的问题，提出词汇丢弃机制，通过在策略训练和课程生成时随机掩码输出logits维持多样性，在数学推理任务上提升求解器性能平均+4.4点。

详情

AI中文摘要

共同进化自我对弈，其中一个语言模型生成问题，另一个求解，有望在没有人类监督的情况下实现自主课程学习。在实践中，提议者迅速收敛到满足奖励函数的狭窄问题分布。这种多样性崩溃使得课程对求解者无信息量，从而停滞共同进化循环。我们引入词汇丢弃，一种在策略训练和课程生成期间应用于提议者输出logits的随机掩码，作为维持多样性的轻量级机制。该掩码是硬性的且非平稳的，防止提议者锁定在固定的token序列上。通过R-Zero在数学推理上训练Qwen3-4B和Qwen3-8B，我们发现词汇丢弃在整个训练过程中在词汇、语义和功能指标上维持了提议者的多样性。它还带来了求解器性能的提升，在8B规模上平均提高+4.4点，在竞赛级基准上增益最大。我们的发现表明，显式的动作空间约束，类似于经典自我对弈中游戏规则的结构性作用，可以帮助维持语言中的生产性共同进化。词汇丢弃是该原则的一个简单实例。

英文摘要

Co-evolutionary self-play, where one language model generates problems and another solves them, promises autonomous curriculum learning without human supervision. In practice, the proposer quickly converges to a narrow distribution of problems that satisfy the reward function. This diversity collapse renders the curriculum uninformative for the solver, stalling the co-evolutionary loop. We introduce vocabulary dropout, a random mask applied to the proposer's output logits during both policy training and curriculum generation, as a lightweight mechanism to sustain diversity. The mask is hard and non-stationary, preventing the proposer from locking into fixed token sequences. Training Qwen3-4B and Qwen3-8B on mathematical reasoning via R-Zero, we find that vocabulary dropout sustains proposer diversity across lexical, semantic, and functional metrics throughout training. It also yields solver improvements averaging +4.4 points at 8B, with the largest gains on competition-level benchmarks. Our findings suggest that explicit action-space constraints, analogous to the structural role that game rules play in classical self-play, can help sustain productive co-evolution in language. Vocabulary dropout is one simple instantiation of this principle.

URL PDF HTML ☆

赞 0 踩 0

2604.25853 2026-06-16 cs.CL cs.AI cs.LG 版本更新

G-Loss: Graph-Guided Fine-Tuning of Language Models

G-Loss：图引导的语言模型微调

Aditya Sharma, Vinti Agarwal, Rajesh Kumar

发表机构 * BITS Pilani（BITS 派拉尼）； Bucknell University（巴克内尔大学）

AI总结提出G-Loss损失函数，通过构建文档相似度图并利用半监督标签传播捕捉全局语义结构，引导语言模型学习更具判别性和鲁棒性的嵌入，在多个分类任务上提升准确率并加速收敛。

Comments 20 pages, Learning on Graphs (LoG2025)

详情

AI中文摘要

用于微调预训练语言模型（如BERT）的传统损失函数，包括交叉熵、对比损失、三元组损失和监督对比损失，仅在局部邻域内操作，未能考虑全局语义结构。我们提出了G-Loss，一种图引导的损失函数，它结合半监督标签传播来利用嵌入流形中的结构关系。G-Loss构建了一个文档相似度图，捕捉全局语义关系，从而引导模型学习更具判别性和鲁棒性的嵌入。我们在五个涵盖关键下游分类任务的基准数据集上评估了G-Loss：MR（情感分析）、R8和R52（主题分类）、Ohsumed（医学文档分类）和20NG（新闻分类）。在大多数实验设置中，G-Loss收敛更快，并产生语义一致的嵌入空间，从而比使用传统损失函数微调的模型获得更高的分类准确率。

英文摘要

Traditional loss functions, including cross-entropy, contrastive, triplet, and su pervised contrastive losses, used for fine-tuning pre-trained language models such as BERT, operate only within local neighborhoods and fail to account for the global semantic structure. We present G-Loss, a graph-guided loss function that incorporates semi-supervised label propagation to use structural relationships within the embedding manifold. G-Loss builds a document-similarity graph that captures global semantic relationships, thereby guiding the model to learn more discriminative and robust embeddings. We evaluate G-Loss on five benchmark datasets covering key downstream classification tasks: MR (sentiment analysis), R8 and R52 (topic categorization), Ohsumed (medical document classification), and 20NG (news categorization). In the majority of experimental setups, G-Loss converges faster and produces semantically coherent embedding spaces, resulting in higher classification accuracy than models fine-tuned with traditional loss functions.

URL PDF HTML ☆

赞 0 踩 0

2605.07013 2026-06-16 cs.CL 版本更新

CoBit: Language Modeling with Bitstream Diffusion

CoBit: 基于比特流扩散的语言建模

Georgios Batzolis, Mark Girolami, Luca Ambrogioni

发表机构 * University of Cambridge（剑桥大学）

AI总结提出CoBit模型，将文本建模为固定宽度二进制比特流上的连续扩散过程，采用匹配滤波残差参数化和基于熵率门控的朗之万校正采样器，在LM1B和OpenWebText上达到接近自回归模型的生成困惑度，并消除词汇表缩放瓶颈。

详情

AI中文摘要

扩散语言模型（DLM）有望实现并行、顺序无关的生成，但在标准基准测试中，它们在样本质量和多样性上历来落后于自回归模型。最近的连续流和扩散方法缩小了这一差距。在这项工作中，我们通过将文本建模为固定宽度二进制比特流上的连续扩散过程，进一步缩小了自回归差距。我们将所得模型称为CoBit（连续比特流扩散）。我们的方法将语义标记表示为模拟比特序列，并使用匹配滤波残差参数化将上下文学习与解析的独立比特后验分离。关键的是，我们采用了一种随机采样器，该采样器应用由熵率分布门控的朗之万型校正，将随机性集中到高信息区域，而在其他区域几乎保持确定性。在LM1B上，我们的130M参数模型在匹配真实数据熵（4.31）的情况下，使用256次神经函数评估（NFE）达到了59.76的生成困惑度（GenPPL），优于先前的DLM基线，并达到了自回归参考水平。在OpenWebText（OWT）上，我们的采样器建立了一个新的连续DLM帕累托前沿，在熵为5.26时实现了27.06的GenPPL，使用的步数比先前1024 NFE基线少4倍。将相同的配方扩展到462M参数模型（CoBit-M）进一步改善了OWT上GenPPL-熵前沿，优于130M模型（CoBit-S）以及中等规模的连续和离散DLM基线，在熵为5.40时达到GenPPL 19.5，接近真实数据熵（5.44），并在高质量区域接近预训练的GPT-2 Medium。作为一个额外的好处，比特流扩散消除了标准DLM的O(V)词汇表缩放瓶颈：通过语义比特修补预测O(log V)逐比特逻辑，它降低了内存并提高了吞吐量，这是一种随着词汇表大小增长而可扩展的范式。

英文摘要

Diffusion language models (DLMs) promise parallel, order-agnostic generation, but on standard benchmarks they have historically lagged behind autoregressive models in sample quality and diversity. Recent continuous flow and diffusion approaches have narrowed this gap. In this work, we further close the autoregressive gap by modeling text as a continuous diffusion process over fixed-width binary bitstreams. We refer to the resulting model as CoBit (Continuous Bitstream Diffusion). Our approach represents semantic tokens as analog bit sequences and uses a matched-filter residual parameterization to isolate contextual learning from analytic independent-bit posteriors. Crucially, we adopt a stochastic sampler that applies Langevin-type corrections gated by the entropy-rate profile, concentrating stochasticity in high-information regions while remaining nearly deterministic elsewhere. On LM1B, our 130M-parameter model reaches a generative perplexity (GenPPL) of 59.76 at matched real-data entropy (4.31) using 256 neural function evaluations (NFEs), outperforming prior DLM baselines and reaching the autoregressive reference. On OpenWebText (OWT), our sampler establishes a new continuous-DLM Pareto frontier, achieving GenPPL 27.06 at entropy 5.26 using 4x fewer steps than previous 1024-NFE baselines. Scaling the same recipe to a 462M-parameter model (CoBit-M) further improves the OWT GenPPL-entropy frontier over the 130M model (CoBit-S) and over medium-scale continuous and discrete DLM baselines, reaching GenPPL 19.5 at entropy 5.40, near real-data entropy (5.44), and approaching pretrained GPT-2 Medium over the high-quality region. As an additional benefit, bitstream diffusion removes the O(V) vocabulary scaling bottleneck of standard DLMs: by predicting O(log V) bitwise logits via semantic bit-patching, it lowers memory and raises throughput, a scalable paradigm as vocabulary sizes grow.

URL PDF HTML ☆

赞 0 踩 0

2605.17106 2026-06-16 cs.CL cs.LG 版本更新

HyDRA: Hybrid Dynamic Routing Architecture for Heterogeneous LLM Pools

HyDRA：异构LLM池的混合动态路由架构

Aashna Garg, Siddharth Singha Roy, Jinu Jang, Federico Brancasi, Shengyu Fu

发表机构 * Microsoft（微软）

AI总结本文提出HyDRA，一种能够根据查询预测细粒度多维能力需求并匹配配置定义模型配置的混合动态路由架构，实现了在异构LLM池中高效且无需重新训练的模型选择。

Comments preprint v2

详情

AI中文摘要

生产中的LLM部署越来越多地维护跨越数量级成本差异的异构模型池。现有路由器做出二元强弱决策，并将学习参数与特定模型身份耦合，当目录更改时需要重新训练。我们提出了HyDRA（混合动态路由架构），一种框架，可以预测每个查询的细粒度、多维能力需求，并通过短缺匹配与配置定义的模型配置匹配。一个带有K=4个独立sigmoid头的ModernBERT编码器对每个查询进行评分，评分维度包括推理、代码生成、调试和工具使用；然后，一个短缺匹配算法会选择最便宜的模型，其能力满足预测的需求。部署的预测器在生产中的中位CPU推理延迟为86毫秒，并且完全解耦于模型目录--添加或删除模型只需配置更改，无需重新训练。在SWE-Bench Verified（5模型池：GPT-5.4-mini，Claude Haiku 4.5，GPT-5.3 Codex，Claude Sonnet 4.6，GPT-5.4）上，HyDRA的可调短缺阈值跨越三个领域：峰值质量超过始终强劲的Claude Sonnet 4.6基线（75.4% vs. 74.2%分辨率）在12.9%的成本节省；等质量匹配Sonnet在54.1%的成本节省，比我们先前的内部二元路由器在9.1%的改进；激进的推动节省到72.5%在3.2点质量折损。结果在LiveCodeBench、BigCodeBench和tau-bench上通用。HyDRA已部署到GitHub Copilot的所有用户在VS Code Chat自动模式中，并且据我们所知，在LLM路由文献中首次展示了跨CJK、欧洲和其他文字家族的语言不变路由。

英文摘要

Production LLM deployments increasingly maintain heterogeneous model pools spanning order-of-magnitude cost differences. Existing routers make binary strong-vs-weak decisions and couple learned parameters to specific model identities, requiring retraining whenever the catalog changes. We present HyDRA (Hybrid Dynamic Routing Architecture), a framework that predicts fine-grained, multi-dimensional capability requirements per query and matches them against configuration-defined model profiles via shortfall matching. A ModernBERT encoder with K=4 independent sigmoid heads scores each query along reasoning, code generation, debugging, and tool use; a shortfall-matching algorithm then selects the cheapest model whose capabilities meet the predicted requirements. The deployed predictor runs at 86 ms median CPU inference latency in production, and is fully decoupled from the model catalog -- adding or removing models requires only a configuration change, with zero retraining. On SWE-Bench Verified (5-model pool: GPT-5.4-mini, Claude Haiku 4.5, GPT-5.3 Codex, Claude Sonnet 4.6, GPT-5.4), HyDRA's tunable shortfall threshold spans three regimes: peak-quality exceeds the always-strong Claude Sonnet 4.6 baseline (75.4% vs. 74.2% resolution) at 12.9% cost savings; iso-quality matches Sonnet at 54.1% cost savings, a 6x improvement over our prior in-house binary router at 9.1%; aggressive pushes savings to 72.5% for a 3.2-point quality trade. Results generalize across LiveCodeBench, BigCodeBench, and tau-bench. HyDRA is deployed to all users in GitHub Copilot's VS Code Chat auto-mode and -- to our knowledge for the first time in the LLM routing literature -- demonstrates language-invariant routing across CJK, European, and other script families.

URL PDF HTML ☆

赞 0 踩 0

2605.21850 2026-06-16 cs.CL cs.AI 版本更新

基于大语言模型的生成式推荐的隐式推理

Yinhan He, Liam Collins, Bhuvesh Kumar, Jundong Li, Neil Shah, Donald Loveland

发表机构 * University of Virginia（弗吉尼亚大学）； Snap Inc.（Snap公司）

AI总结针对大语言模型用于生成式推荐时显式推理的三大局限（世界知识表达弱化、语义ID与自然语言嵌入空间不对齐、推理质量敏感），提出轻量级隐式推理范式PauseRec，在性能、训练成本和推理速度上均优于显式方法。

详情

AI中文摘要

大语言模型（LLMs）越来越多地被用作生成式推荐（GR）的骨干，有望利用预训练的世界知识。然而，如何可靠地调用这些知识进行GR仍不清楚。一个关键障碍是，基于LLM的GR通常使用语义ID（SIDs）表示物品，这破坏了LLM的自然语言推理接口，因为这些标记在预训练期间对LLM是未见过的。现有方法通过昂贵的多阶段流程来应对，这些流程将SID接地并引发显式推理，但对每个阶段何时以及为何必要提供的见解有限。在这项工作中，我们系统地分解了基于LLM的GR的显式推理训练流程，揭示了三个关键局限：弱化的世界知识表达、SID与自然语言标记嵌入空间之间的不对齐，以及对推理质量的敏感性，所有这些都损害了显式推理性能。为了规避这些问题，我们提出了PauseRec，一种为GR量身定制的轻量级隐式推理范式。PauseRec非常实用，避免了昂贵的推理轨迹获取和推理对齐训练，带来了诸多好处：（1）其性能比标准显式CoT方法高出高达6.22%，（2）将训练成本降低高达65%的GPU小时，（3）将推理速度提升高达71.3%。这些结果使PauseRec成为显式推理生成的轻量级替代方案，能够实现更有效、更高效的基于LLM的GR。

英文摘要

Large Language Models (LLMs) are increasingly adopted as backbones for Generative Recommendation (GR), promising access to pretrained world knowledge. Yet reliably invoking this knowledge for GR remains poorly understood. A key obstacle is that LLM-based GR typically represents items with Semantic IDs (SIDs), disrupting LLMs' natural-language reasoning interface because these tokens are unseen by the LLM during pretraining. Existing approaches address this with expensive multi-stage pipelines that ground SIDs and elicit explicit rationales, but offer limited insight into when and why each stage is necessary. In this work, we systematically decompose explicit reasoning training pipelines for LLM-based GR, revealing three key limitations: weakened world-knowledge verbalization, misalignment between SID and natural-language token embedding spaces, and sensitivity to rationale quality, all of which hurt explicit reasoning performance. To circumvent these issues, we propose PauseRec, a lightweight implicit reasoning paradigm tailored for GR. PauseRec is exceptionally practical, avoiding costly reasoning trace acquisition and reasoning alignment training, leading to a multitude of benefits: (1) it outperforms standard explicit CoT methods by up to 6.22%, (2) it reduces training cost by up to 65% GPU hours, and (3) it speeds up inference by up to 71.3%. These results position PauseRec as a lightweight alternative to explicit rationale generation, enabling more effective and efficient LLM-based GR.

URL PDF HTML ☆

赞 0 踩 0

2606.14694 2026-06-16 cs.CL 版本更新

熵感知的在线策略蒸馏语言模型

Woogyeol Jin, Taywon Min, Yongjin Yang, Dennis Wei, Yi Zhou, Swanand Ravindra Kadhe, Nathalie Baracaldo, Kimin Lee

AI总结针对在线策略蒸馏中反向KL导致生成多样性下降和教师高熵时学习信号不稳定的问题，提出熵感知的在线策略蒸馏方法，通过在高熵时引入前向KL平衡模式寻求与模式覆盖，提升了生成多样性和学生-教师对齐度。

Comments 18 pages, 11 figures, ICML 2026

详情

AI中文摘要

在线策略蒸馏是一种有前景的语言模型知识迁移方法，学生模型沿着自身轨迹从密集的token级信号中学习。该框架通常使用反向KL散度，鼓励学生匹配教师的高置信度预测。然而，我们表明反向KL的模式寻求特性会降低生成多样性，并在教师分布具有高熵时产生不稳定的学习信号。为解决此问题，我们引入了熵感知的在线策略蒸馏。我们的关键思想是在教师熵高时，用前向KL增强标准的反向KL目标，以捕获全部合理输出范围，同时在其他地方保留精确模仿。它在不牺牲在线策略训练效率的情况下，平衡了模式寻求的精确性与模式覆盖的鲁棒性。实验表明，我们的方法保持了生成多样性（持续的token级熵），并改善了学生-教师对齐（在高熵token上降低前向KL）。在六个数学推理基准上，与基线在线策略蒸馏方法相比，Qwen3-0.6B-Base的Pass@8准确率提升+1.37，Qwen3-1.7B-Base提升+2.39，Qwen3-4B-Base提升+5.05。这些结果表明，考虑教师不确定性对于保持多样性和实现有效知识迁移至关重要。

英文摘要

On-policy distillation is a promising approach for transferring knowledge between language models, where a student learns from dense token-level signals along its own trajectories. This framework typically uses reverse KL divergence, encouraging the student to match the teacher's high-confidence predictions. However, we show that the mode-seeking property of reverse KL reduces generation diversity and yields unstable learning signals when the teacher distribution has high entropy. To address this, we introduce Entropy-Aware On-Policy Distillation. Our key idea is augmenting the standard reverse KL objective with forward KL when teacher entropy is high, capturing the full range of plausible outputs while retaining precise imitation elsewhere. It balances mode-seeking precision with mode-covering robustness without sacrificing on-policy training efficiency. Experiments show that our method maintains generation diversity (sustained token-level entropy) and improves student-teacher alignment (lower forward KL on high-entropy tokens). Across six math reasoning benchmarks, this yields Pass@8 accuracy gains of +1.37 for Qwen3-0.6B-Base, +2.39 for Qwen3-1.7B-Base, and +5.05 for Qwen3-4B-Base compared to baseline on-policy distillation methods. These results demonstrate that accounting for teacher uncertainty is essential for maintaining diversity and achieving effective knowledge transfer.

URL PDF HTML ☆

赞 0 踩 0

2605.22873 2026-06-16 cs.LG cs.AI cs.CL 版本更新

When Do LLMs Reason? A Dynamical Systems View via Entropy Phase Transitions

LLM何时推理？基于熵相变的动力系统视角

Wei Xia, Haoqing Wang, Zhi-Hong Deng, Yehui Tang

发表机构 * Samsung Research（三星研究院）； State Key Laboratory of General Artificial Intelligence, School of Intelligence Science and Technology, Peking University（通用人工智能国家重点实验室，北京理工大学）

AI总结本文通过早期解码熵动态检测LLM的推理状态，提出轻量级无训练路由框架EDRM，自适应选择推理策略，在减少token消耗的同时提升准确率。

详情

AI中文摘要

链式思维（CoT）推理已成为增强LLM能力的默认策略，但其应用引发了一个基本问题：显式推理何时真正有益？实证证据揭示了一个显著悖论：CoT在事实性和开放式任务上往往带来边际甚至负增益，同时成倍增加token消耗。在这项工作中，我们表明LLM推理不是任务或模型的静态属性，而是在生成过程中涌现的\emph{动态解码状态}。通过系统分析，我们发现早期熵动态提供了这一状态的可靠信号：受益于CoT的任务表现出一致的熵降低，而其他任务则呈现不稳定或增加的模式。这种行为可以解释为从高熵探索状态到低熵结构化推理状态的类相变转变。基于这些见解，我们提出了 extbf{EDRM}（基于熵动态的推理流形），一个轻量级且无需训练的路由框架，利用早期解码熵自适应选择推理策略。EDRM将熵轨迹嵌入到紧凑且可解释的流形表示中，支持零样本部署和细粒度实例级适应。在15个基准测试和4个不同规模与架构的LLM上，EDRM始终优于静态基线。在数据集层面，EDRM实现了 extbf{41--55\%}的token减少，同时仅需50个校准样本即可提高准确率。在实例层面，它进一步将准确率提升高达 extbf{4.7\%}，同时保持 extbf{27--45\%}的token节省。这些结果表明，推理应被选择性地调用而非默认使用，并展示了基于熵的解码控制对于高效自适应LLM推理的有效性。

英文摘要

Chain-of-thought (CoT) reasoning has become the default strategy for enhancing LLM capabilities, yet its application raises a fundamental question: when is explicit reasoning actually beneficial? Empirical evidence reveals a striking paradox: CoT often provides marginal or even negative gains on factual and open-ended tasks while multiplying token consumption. In this work, we show that LLM reasoning is not a static property of tasks or models, but a \emph{dynamic decoding state} that emerges during generation. Through systematic analysis, we find early-stage entropy dynamics provide a reliable signal of this state: tasks benefiting from CoT exhibit consistent entropy reduction, while others display unstable or increasing patterns. This behavior can be interpreted as a phase-transition-like shift from a high-entropy exploratory regime to a low-entropy structured reasoning regime. Based on these insights, we propose \textbf{EDRM} (Entropy Dynamics-based Reasoning Manifold), a lightweight and training-free routing framework that leverages early decoding entropy to adaptively select inference strategies. EDRM embeds entropy trajectories into a compact and interpretable manifold representation, enabling both zero-shot deployment and fine-grained instance-level adaptation. Across 15 benchmarks and 4 LLMs of varying scales and architectures, EDRM consistently outperforms static baselines. At the dataset level, EDRM achieves \textbf{41--55\%} token reduction while improving accuracy with as few as 50 calibration samples. At the instance level, it further improves accuracy by up to \textbf{4.7\%} while maintaining \textbf{27--45\%} token savings. These results suggest that reasoning should be invoked selectively rather than by default, and demonstrate the effectiveness of entropy-driven decoding control for efficient and adaptive LLM inference.

URL PDF HTML ☆

赞 0 踩 0

2606.04547 2026-06-16 cs.IR cs.CL 版本更新

Beyond Retrieval: Learning Compact User Representations for Scalable LLM Personalization

超越检索：学习紧凑用户表示以实现可扩展的LLM个性化

Heng Cao, Fan Zhang, Jian Yao, Yujie Zheng, Changlin Zhao, Lu Hao, Yuxuan Wei, Wangze Ni, Huaiyu Fu, Yuqian Sun, Xuyan Mo

发表机构 * Microsoft（微软公司）； Shanghai International Studies University（上海国际问题研究大学）； Zhejiang University（浙江大学）； Department of Data Science and Artificial Intelligence, The Hong Kong Polytechnic University（数据科学与人工智能系，香港理工大学）

AI总结提出TAP-PER框架，通过可学习的用户状态前缀嵌入编码用户偏好，避免显式提示构建和繁重的每用户适配器，在六个LaMP任务上优于基线方法，并显著减少参数开销。

Comments 16 pages, 6 figures

详情

AI中文摘要

个性化大型语言模型需要在保持鲁棒性和部署规模效率的同时，将模型行为适应于个体用户。现有方法通常在输入层面（通过检索用户历史或构建个人资料提示）或参数层面（通过维护用户特定的参数高效模块）进行个性化。前者使个性化对检索质量和提示设计敏感，而后者则产生随用户数量增长的存储和维护成本。为解决这些限制，我们提出TAP-PER（时间注意力前缀个性化），一种基于前缀的框架，将用户偏好编码为可学习的表示，消除了显式提示构建，并用轻量级用户状态前缀嵌入替代了繁重的每用户适配器。受个性化推荐系统启发，TAP-PER将用户建模分解为用户状态和查询条件组件，并引入时间信号以捕捉用户兴趣的演变特性。在六个LaMP任务上的实验表明，TAP-PER在分类、评分和生成设置中均持续优于基于提示和基于模型的基线。此外，在1000用户规模下，TAP-PER的每用户参数比OPPU少130倍，总参数量约为PER-PCS的一半，证明无需显式提示构建或繁重的每用户适配器即可实现可扩展的LLM个性化。

英文摘要

Personalizing large language models requires adapting model behavior to individual users while preserving robustness and deployment-scale efficiency. Existing approaches typically personalize LLMs either at the input level, by retrieving user histories or constructing profile prompts, or at the parameter level, by maintaining user-specific parameter-efficient modules. The former makes personalization sensitive to retrieval quality and prompt design, whereas the latter incurs storage and maintenance costs that grow with the user population. To address these limitations, we propose TAP-PER (Temporal Attentive Prefix for PERsonalization), a prefix-based framework that encodes user preferences as learnable representations, eliminating explicit prompt construction and replacing heavy per-user adapters with lightweight user-state prefix embeddings. Inspired by personalized recommendation systems, TAP-PER decomposes user modeling into user-state and query-conditioned components, and incorporates temporal signals to capture the evolving nature of user interests. Experiments on six LaMP tasks show that TAP-PER consistently outperforms prompt-based and model-based baselines across classification, rating, and generation settings. Moreover, TAP-PER uses 130x fewer per-user parameters than OPPU and roughly half the total parameter footprint of PER-PCS at the 1,000-user scale, demonstrating that scalable LLM personalization can be achieved without explicit prompt construction or heavy per-user adapters.

URL PDF HTML ☆

赞 0 踩 0

2606.13003 2026-06-16 cs.AI cs.CL cs.MA 版本更新

The Illusion of Multi-Agent Advantage

多智能体优势的错觉

Prathyusha Jwalapuram, Hehai Lin, Chuyuan Li, Fangkai Jiao, Sudong Wang, Yifei Ming, Zixuan Ke, Chengwei Qin, Giuseppe Carenini, Shafiq Joty

发表机构 * Salesforce Research（Salesforce研究院）； HKUST (Guangzhou)（香港科技大学（广州））； University of British Columbia（不列颠哥伦比亚大学）； Nanyang Technological University（南洋理工大学）

AI总结通过系统评估，发现自动生成的多智能体系统在性能和成本效率上均不如单智能体基线（如思维链自一致性），揭示了现有评估框架的缺陷和架构膨胀问题。

详情

AI中文摘要

普遍观点认为多智能体系统优于单智能体系统，其优势包括上下文保护、并行处理和分布式决策。然而，这一主张的经验支持主要依赖于与使用优先考虑孤立推理任务的基准测试的单智能体基线的比较，这些基准测试未能充分评估这些优势。我们专注于自动生成的多智能体系统（旨在比手动设计的系统具有更强的泛化能力），对单智能体系统（特别是思维链自一致性）进行了严格、系统的评估。在传统推理数据集和具有交互式多步骤工作流的任务（例如 BrowseComp-Plus）上，我们证明自动多智能体系统始终不如思维链自一致性，尽管其成本高达10倍。为了将这些失败与任务结构固有的局限性隔离开来，我们引入了一个为多智能体系统量身定制的诊断性合成数据集，该数据集具有显式任务分解、上下文分离和并行化潜力。我们表明，专家设计的多智能体系统在该数据集上的原始性能和成本效率方面始终优于自动生成的架构，这表明现有的评估框架未能考虑增加计算成本的边际效用，从而掩盖了复杂多智能体系统的关键架构缺陷和低效性。关键的是，对生成的多智能体系统架构的系统解构表明，当前的自动化设计范式产生了架构膨胀，优先考虑表面复杂性，但这并未转化为功能效用，暴露了与多智能体原则的根本性错位。

英文摘要

Prevailing wisdom posits that Multi-Agent Systems (MAS) are superior to Single-Agent Systems (SAS), citing advantages like context protection, parallel processing and distributed decision-making. However, empirical support for this claim relies primarily on comparisons with SAS baselines using benchmarks that prioritize isolated reasoning tasks, which do not adequately assess these advantages. Focusing on automatically generated MAS that are designed for enhanced generalizability over manually-designed counterparts, we perform a rigorous, systematic evaluation against SAS, specifically Chain-of-Thought with Self-Consistency (CoT-SC). Across traditional reasoning datasets and tasks with interactive multi-step workflows (e.g., BrowseComp-Plus), we demonstrate that automatic MAS consistently underperform CoT-SC despite being up to 10x more expensive. To isolate these failures from limitations inherent to task structure, we introduce a diagnostic synthetic dataset tailored for MAS featuring explicit task decomposition, context separation and parallelization potential. We show that expert-architected MAS consistently outperforms automatically generated architectures in both raw performance and cost-efficiency on this dataset, demonstrating that existing evaluation frameworks mask critical architectural gaps and inefficiencies of complex MAS by failing to account for the marginal utility of increased computational cost. Critically, systematic deconstruction of the generated MAS architectures reveals that current automated design paradigms produce architectural bloat that prioritizes superficial complexity which does not translate into functional utility, exposing a fundamental misalignment with multi-agent principles.

URL PDF HTML ☆

赞 0 踩 0

2605.28860 2026-06-16 cs.LG cs.AI cs.CL cs.CR 版本更新

Mechanistic origins of catastrophic forgetting: why RL preserves circuits better than SFT?

灾难性遗忘的机制起源：为什么RL比SFT更好地保留电路？

Jeanmely Rojas Nunez, Viraj Sawant, Nathan Allen, Nomgondalai Amgalanbaatar, Yannis Zongo, Vasu Sharma, Maheep Chaudhary

发表机构 * University of California, Berkeley（加州大学伯克利分校）； University of Washington（华盛顿大学）； University of Toronto（多伦多大学）

AI总结通过引入差异电路脆弱性指标，研究比较了强化学习与监督微调在大型语言模型微调中对内部计算电路的保留程度，发现RL虽任务适应较慢但能更好保留电路，从而减轻灾难性遗忘。

详情

AI中文摘要

微调大型语言模型（LLMs）经常导致先前能力的灾难性遗忘。最近的研究表明，强化学习（RL）比监督微调（SFT）更有效地保留先前能力，这归因于策略梯度更新更接近基础策略\cite{shenfeld2025rl}。我们将这种行为解释扩展到机制层面，并探究RL的优势是否通过内部计算电路的更强保留来体现。我们引入了差异电路脆弱性，一种头部级别的度量，用于衡量电路在微调下的退化程度，并将其用于比较RL和SFT在Qwen2.5-3B-Instruct适应科学问答任务上的表现。我们发现了清晰的机制权衡：SFT更快地适应目标任务，但导致更大的电路破坏和先前能力的遗忘，而RL保留了更大比例的基础电路，代价是任务适应较慢。这些发现表明，电路保留可能有助于解释为什么RL对灾难性遗忘更具鲁棒性。我们在此发布了代码：https://github.com/rl-sft-circuit-research/differential-circuit-vulnerability。

英文摘要

Fine-tuning large language models (LLMs) frequently induces catastrophic forgetting of prior capabilities. Recent work has shown that reinforcement learning (RL) retains prior capabilities more effectively than supervised fine-tuning (SFT), attributing this to policy-gradient updates remaining closer to the base policy \cite{shenfeld2025rl}. We extend this behavioral account to the mechanistic level and ask whether RL's advantage is mirrored by stronger preservation of internal computational circuits. We introduce differential circuit vulnerability, a head-level measure of how much a circuit degrades under fine-tuning, and use it to compare RL and SFT on Qwen2.5-3B-Instruct adapted to scientific question-answering. We find a clear mechanistic trade-off: SFT adapts more rapidly to the target task but produces substantially greater circuit disruption and forgetting of prior capabilities, whereas RL preserves a larger fraction of the base circuit at the cost of slower task adaptation. These findings suggest that circuit preservation may help explain why RL is more robust to catastrophic forgetting. We released our code here: https://github.com/rl-sft-circuit-research/differential-circuit-vulnerability.

URL PDF HTML ☆

赞 0 踩 0

2606.15483 2026-06-16 cs.CL 新提交

Evaluative Judgement in Teaching AI-based Translation: A Class-room Case Study of AI-Mediated Translation and Post-Editing

基于AI的翻译教学中的评价判断：AI中介翻译与译后编辑的课堂案例研究

Gokhan Dogru

发表机构 * Universitat Pompeu Fabra Barcelona（巴塞罗那庞培法华大学）

AI总结通过分析23个学生项目，研究结构化比较通用LLM和在线MT系统如何激发AI中介翻译中的评价判断，发现学生不盲从自动指标，而是基于充分性、流畅性等理由选择译后编辑输出。

Comments Workshop on Teaching AI-based Translation and Technologies (TAITT 2026) - EAMT 2026

详情

上下文压缩并非单一事物：在匹配预算下可读的符号化重新表达与连贯摘要的比较

Sisong Bei, Mikhail L. Arbuzov, Ziwei Dong, Dmitri Kalaev, Alexey Shvets

发表机构 * Independent Researcher（独立研究员）； Palo Alto Networks

AI总结提出Telegraph English可读符号格式，将检索段落重写为结构化实体关系陈述，在匹配预算下比三种压缩基线及连贯摘要更有效，F1提升13-20个百分点。

详情

AI中文摘要

我们研究了使用小型语言模型进行多跳问答时的上下文压缩。我们提出Telegraph English，一种可读的符号化格式，将检索到的段落重写为结构化的实体关系陈述，以更低的token成本保留推理证据。在MuSiQue、TwoWiki和HotpotQA上的受控实验中，Telegraph English在每个数据集上都优于三种匹配预算的压缩基线（字符级删除、截断和随机子采样），F1得分提升13至20个百分点。它还在最难的数据集上优于由同一编码器生成的连贯散文摘要。一个预先注册的深度交互假设被证伪：在数据集内，优势并未随推理深度增加而增长。我们将这些结果解释为证据，表明在匹配的token预算下，可读的符号化重新表达比自然语言或连贯摘要更密集地保留了实体内容。

英文摘要

We study context compression for multi-hop question answering with small language models. We propose Telegraph English, a readable symbolic format that rewrites retrieved passages into structured entity-relation statements, preserving reasoning evidence at lower token cost. In controlled experiments on MuSiQue, TwoWiki, and HotpotQA, Telegraph English outperforms three matched-budget compression baselines (character-level deletion, truncation, and random sub-sampling) on every dataset, with gains of 13 to 20 F1 percentage point. It also outperforms a coherent prose summary produced by the same encoder on the hardest dataset. A pre-registered depth-interaction hypothesis is null: the advantage does not grow with reasoning depth within datasets. We interpret these results as evidence that readable symbolic re-expression preserves entity content more densely than either natural language or coherent summarization at matched token budget.

URL PDF HTML ☆

赞 0 踩 0

2606.15412 2026-06-16 cs.CL cs.AI 新提交

Few-Shot Biomedical Relation Extraction with Large Language Models: A Viable Alternative to Supervised Learning?

基于大语言模型的少样本生物医学关系抽取：监督学习的可行替代方案？

Jakob Mraz, Tomaž Curk, Blaž Zupan

发表机构 * University of Ljubljana（卢布尔雅那大学）； Baylor College of Medicine（贝勒医学院）

AI总结研究利用大语言模型进行少样本生物医学关系抽取，比较成对分类与联合生成两种方法，发现联合生成更精确高效，在宏F1上超越监督基线，尤其在稀有关系类型上表现突出。

详情

AI中文摘要

生物医学关系抽取（BioRE）是将生物医学文献转化为结构化知识的关键步骤。然而，现有方法大多依赖在昂贵标注数据集上训练的监督模型，限制了其在关系类型和领域上的可扩展性和适应性。我们研究了基于提示学习的大语言模型（LLMs）进行少样本BioRE，并比较了两种任务形式：成对分类（预测单个实体对的关系）和联合生成（在单次模型调用中提取多个关系）。在BioREDirect数据集上的实验揭示了明确的精确率-召回率权衡。成对分类实现了更高的召回率，而联合生成更精确且计算效率更高。最佳模型达到了0.44的微F1分数，显著优于之前的少样本结果（0.34），但仍低于监督基线（0.56）。这一差距大部分归因于一个定义模糊的关系类型。当使用宏F1评估时（在类别不平衡设置下更能反映跨关系类型的性能），基于提示的方法优于监督基线（0.45 vs. 0.38），尤其在稀有关系类型上。这些发现突显了LLMs在低资源场景下进行BioRE的潜力，并强调了定义良好的关系模式的重要性。

英文摘要

Biomedical relation extraction (BioRE) is a key step in transforming biomedical literature into structured knowledge. However, most existing approaches rely on supervised models trained on costly annotated datasets, limiting their scalability and adaptability across relation types and domains. We investigate few-shot BioRE using prompt-based learning with large language models (LLMs) and compare two task formulations: pairwise classification, which predicts relations for individual entity pairs, and joint generation, which extracts multiple relations in a single model call. Experiments on the BioREDirect dataset reveal a clear precision-recall trade-off. Pairwise classification achieves higher recall, whereas joint generation is more precise and computationally efficient. The best-performing model achieves a micro-F1 score of 0.44, substantially outperforming previous few-shot results (0.34) while remaining below the supervised baseline (0.56). Much of this gap is attributable to a single ambiguously defined relation type. When evaluated using macro-F1, which better captures performance across relation types in an imbalanced setting, prompt-based approaches outperform the supervised baseline (0.45 vs. 0.38), particularly on rare relation types. These findings highlight the potential of LLMs for BioRE in low-resource settings and underscore the importance of well-defined relation schemas.

URL PDF HTML ☆

赞 0 踩 0

2606.15449 2026-06-16 cs.CL cs.IR cs.LG 新提交

Transfer Learning for FHIR Questionnaire Terminology Binding

面向 FHIR 问卷术语绑定的迁移学习

Maxim Gorshkov

发表机构 * Department of Computer Science, Stanford University（斯坦福大学计算机科学系）

AI总结将 FHIR 问卷项与 LOINC 代码的绑定视为检索问题，比较六种方法，发现 BioLORD 在 top-1 准确率上最优，而对比微调在 top-5 和 top-10 上表现更好，并分析了分布偏移和错误类型。

详情

AI中文摘要

电子预授权工作流要求 FHIR 问卷项携带 LOINC 代码，但 HL7 Da Vinci CDS-Library 中的大多数项缺乏这些绑定。我们将其视为一个检索问题：给定问卷项的文本，从 97,314 个活跃代码池中找到正确的 LOINC 代码。我们在一个包含 54 个项的评估集上比较了六种方法（TF-IDF、冻结 MiniLM、BioBERT、BioLORD、对比微调 MiniLM 以及 TF-IDF+GPT 重排序器），该评估集涵盖三种查询风格（自然问题、中等和简洁）。没有单一方法在所有指标上获胜。BioLORD 是一个在生物医学本体定义上预训练的冻结编码器，尽管没有见过任务特定数据，但其 top-1 准确率最高（R@1 = 0.185，MRR = 0.246），而在原始 LHC-Forms 对上的对比微调则在 R@5（0.389）和 R@10（0.426）上表现最佳。分布偏移消融实验表明，为什么我们主表中的微调不是最强的：在原始对中添加 GPT 生成的释义后，R@5 从 0.389 降至 0.296，因此增强联合在除 R@1 外的所有指标上均不如仅使用原始训练。性能在 5k 训练对时达到峰值。对 BioLORD 的 R@1 失败案例的错误分析表明，错误特异性和歧义文本案例共占错误的 59%。

英文摘要

Electronic prior authorization workflows require FHIR Questionnaire items to carry LOINC codes, yet most items in the HL7 Da Vinci CDS-Library lack these bindings. We treat this as a retrieval problem: given a Questionnaire item's text, find the correct LOINC code in a pool of 97,314 active codes. We compare six methods (TF-IDF, frozen MiniLM, BioBERT, BioLORD, contrastively fine-tuned MiniLM, and a TF-IDF+GPT reranker) on a 54-item evaluation set spanning three query styles (natural question, medium, and terse). No single method wins on every metric. BioLORD, a frozen encoder pre-trained on biomedical ontology definitions, has the best top-rank accuracy (R@1 = 0.185, MRR = 0.246) despite seeing no task-specific data, while a contrastive fine-tune on raw LHC-Forms pairs takes R@5 (0.389) and R@10 (0.426). A distribution-shift ablation shows why the fine-tune in our main table is not the strongest one: adding GPT-generated paraphrases to the raw pairs drops R@5 from 0.389 to 0.296, so the augmented union underperforms raw-only training on every metric except R@1. Performance peaks at 5k training pairs. Error analysis on BioLORD's R@1 failures shows that wrong-specificity and ambiguous-text cases together account for 59% of errors.

URL PDF HTML ☆

赞 0 踩 0

2606.15566 2026-06-16 cs.CL cs.AI 新提交

LLM-Assisted Stance Detection in Scientific Discourse: A Test Case in Bayesian Cognitive Science

科学话语中的立场检测：以贝叶斯认知科学为例的LLM辅助方法

Eyup Engin Kucuk, Tarik Kelestemur, Ömer Dağlar Tanrikulu

发表机构 * University of New Hampshire（新罕布什尔大学）； Independent Researcher（独立研究员）

AI总结提出结合理论驱动编码手册、专家标注和诊断门控提示优化的方法，利用三个前沿LLM检测贝叶斯模型在科学文本中的现实主义/工具主义立场，在210篇文章的6858条引文中达到0.78的联合信度。

Comments 9 pages, 4 figures; Code and data: https://github.com/EyupEK/autoresearch_bayes

详情

AI中文摘要

定性编码是社会科学的核心，但专家标注难以规模化。LLM提供了一种可能的扩展，但当目标构念是解释性的、理论负载的且仅间接表达时，需要仔细验证。我们在一个困难案例中研究这个问题：检测作者是将贝叶斯模型视为心理和神经机制的描述（现实主义）还是有用的数学工具（工具主义）。我们的方法结合了理论驱动的编码手册、专家编码的参考标注、诊断门控提示优化搜索（为三个前沿LLM：GPT-5.1、Claude Sonnet 4.6、Gemini 3 Pro Preview生成共享的零样本提示）以及多评估者信度分析。最终提示在保留样本上实现了0.76的综合信度分数（ICC=0.79和α=0.74的调和平均数），所有诊断均满足。在来自210篇文章的6858条引文上部署后，三个LLM达到了显著的引文级一致性（ICC=0.80；α=0.76；综合=0.78）和近乎完美的文章级排名稳定性（评估者对之间r=0.96-0.97）。语料库总体偏向弱现实主义，但文章级立场很少一致：仅1.4%的文章使用单一波段，而59.5%的文章跨越四个或更多波段。低层感知/运动文章比高层认知文章高出8.8个现实主义点（p<.001，d=0.60），量化了长期持有的定性直觉。我们将其作为专家主导的案例研究呈现；该框架旨在推广到类似的理论密集型任务，而非所有定性分析。

英文摘要

Qualitative coding is central to social science, but expert annotation is difficult to scale. LLMs offer a possible extension, yet require careful validation when the target construct is interpretive, theoretically loaded, and only indirectly expressed. We study this problem in a difficult case: detecting whether authors treat Bayesian models as descriptions of mental and neural mechanisms (realism) or as useful mathematical tools (instrumentalism). Our method combines a theory-driven codebook, expert-coded reference annotations, a diagnostic-gated prompt-optimization search yielding a shared zero-shot prompt for three frontier LLMs (GPT-5.1, Claude Sonnet 4.6, Gemini 3 Pro Preview), and multi-rater reliability analysis. The final prompt achieved a held-out combined reliability score of 0.76 (harmonic mean of ICC = 0.79 and $α$ = 0.74), with all diagnostics satisfied. Deployed on 6,858 quotes from 210 articles, the three LLMs reached substantial quote-level agreement (ICC = 0.80; $α$ = 0.76; combined = 0.78) and near-perfect article-level rank stability ($r$ = 0.96-0.97 across rater pairs). The corpus was predominantly weakly realist, but article-level stances were rarely uniform: only 1.4% of articles used a single band, while 59.5% spanned four or more. Low-level perception/motor articles scored 8.8 Realism points higher than high-level cognition articles ($p < .001$, $d = 0.60$), quantifying a long-held qualitative intuition. We present this as an expert-led case study; the framework is intended to generalize to similar theoretically demanding tasks, not to all qualitative analysis.

URL PDF HTML ☆

赞 0 踩 0

2606.15641 2026-06-16 cs.CL 新提交

Distilling Examples into Task Instructions: Enhanced In-Context Learning for Real-World B2B Conversations

将示例提炼为任务指令：面向真实B2B对话的增强上下文学习

Guy Rotman, Adi Kopilov, Danit Berger Zalmanson, Omri Allouche

AI总结针对B2B对话分类中传统上下文学习因示例拼接导致上下文过长而性能受限的问题，提出知识蒸馏方法将冗长示例压缩为结构化分类标准和精确任务描述，实现令牌使用减少99%，宏平均AUC提升7%，且随上下文增长保持鲁棒。

Comments Accepted for publication in Findings of the Association for Computational Linguistics 2026

详情

AI中文摘要

上下文学习（ICL）是低资源分类的标准方法，但其在专业领域的有效性尚未充分探索。我们解决了语义复杂、多方B2B对话分类的挑战，传统ICL在此面临显著限制，尤其是当多个少样本示例拼接导致上下文长度增加时。我们引入了\ exttt{Call Playbook}数据集，包含源自真实B2B对话的五项分类任务，针对核心销售概念。为了弥合性能与实际应用之间的差距，我们提出了新颖的知识提取方法，将冗长示例蒸馏为紧凑、可解释的结构化分类标准和精确任务描述。我们的方法实现了令牌使用减少99%，宏平均AUC比传统ICL提升高达7%。值得注意的是，与先进的令牌压缩基线（其F1分数下降超过9点）不同，我们的方法在上下文增长时保持鲁棒。重要的是，我们的框架能够直接优化分类逻辑，满足了真实NLP应用中对透明度、效率和用户交互的关键需求。

英文摘要

In-context learning (ICL) is the standard method for low-resource classification, yet its efficacy in specialized domains remains largely unexplored. We address the challenge of classifying semantically complex, multi-party B2B conversations, where traditional ICL encounters significant limitations, especially as context length increases due to the concatenation of multiple few-shot examples. We introduce the \texttt{Call Playbook} dataset, featuring five classification tasks derived from real-world B2B conversations targeting core sales concepts. To bridge the gap between performance and practical utility, we propose novel knowledge extraction methods that distill verbose examples into compact, interpretable representations of structured classification criteria and precise task descriptions. Our approach achieves a 99\% reduction in token usage and improves macro-averaged AUC by up to 7\% over traditional ICL. Notably, it remains robust as context grows, unlike advanced token compression baselines which degrade by over 9 F1 points. Importantly, our framework enables direct refinement of classification logic, addressing critical needs for transparency, efficiency, and user interaction in real-world NLP applications.

URL PDF HTML ☆

赞 0 踩 0

2606.15770 2026-06-16 cs.CL 新提交

ttda704 at SemEval-2026 Task 6: Structured Chain-of-Thought Prompting for Political Evasion Detection

ttda704 at SemEval-2026 Task 6: 用于政治回避检测的结构化思维链提示

Tai Tran Tan, An Dinh Thien

发表机构 * University of Information Technology, Ho Chi Minh City, Vietnam（胡志明市信息技术大学）； Vietnam National University, Ho Chi Minh City, Vietnam（越南国家大学胡志明市分校）

AI总结针对总统访谈中政治回避策略分类任务，比较QLoRA微调Qwen3与结构化思维链提示DeepSeek-V3.2/Grok-4-Fast，发现后者在Macro F1上显著更优，最佳系统在9类回避任务上Macro F1达0.5147。

详情

AI中文摘要

本文描述了我们在SemEval-2026任务6中的系统，该任务涉及对从美国总统访谈中提取的英文问答对进行政治回避策略分类。我们系统比较了两种不同的范式：(1) 使用QLoRA对Qwen3模型（4B-32B）进行参数高效微调，通过分层上采样和加权交叉熵损失来应对严重的类别不平衡；(2) 对具备推理能力的API模型（即DeepSeek-V3.2和Grok-4-Fast）使用结构化思维链（CoT）提示。我们的评估表明，启用推理能力的模型的结构化CoT提示在绝对Macro F1上显著优于我们的基线参数高效微调实现。我们最好的系统，即具有扩展推理和少样本分层CoT提示的Grok-4-Fast，在子任务2（9类回避）上达到0.5147的Macro F1，在子任务1（3类清晰度）上达到0.7979的Macro F1，在官方排行榜上分别位列子任务2的第8名（共33支队伍）和子任务1的第13名（共41支队伍）。此外，我们的消融研究揭示了回避检测中有效提示设计的关键见解：在分层分类法中呈现标签有助于结构化模型推理，而少样本示例提供了任务校准。然而，最强的提示变体在Macro F1上并无统计显著差异，而显式启用扩展推理模式通过促进检测回避意图所需的多步语用分析，带来了显著的性能提升。

英文摘要

This paper describes our system for SemEval-2026 Task 6, which addresses the classification of political evasion strategies in English question-answer pairs extracted from U.S. presidential interviews. We systematically compare two distinct paradigms: (1) Parameter-Efficient Fine-Tuning of Qwen3 models (4B-32B) using QLoRA, enhanced with tiered upsampling and weighted cross-entropy loss to address severe class imbalance, and (2) structured Chain-of-Thought (CoT) prompting of reasoning-capable API models, namely DeepSeek-V3.2 and Grok-4-Fast. Our evaluation demonstrates that structured CoT prompting of reasoning-enabled models substantially outperforms our baseline parameter-efficient fine-tuning implementation in absolute Macro F1. Our best system, Grok-4-Fast with extended reasoning and few-shot hierarchical CoT prompting, achieves a Macro F1 of 0.5147 on Subtask 2 (9-class evasion) and 0.7979 on Subtask 1 (3-class clarity), ranking 8th out of 33 teams on Subtask 2 and 13th out of 41 teams on Subtask 1 on the official leaderboard. Furthermore, our ablation studies reveal key insights into effective prompt design for evasion detection: presenting labels within a hierarchical taxonomy helps structure model reasoning, while few-shot exemplars provide task calibration. However, the strongest prompt variants are not statistically distinguishable in Macro F1, and explicitly enabling extended reasoning modes yields substantial performance gains by facilitating the multi-step pragmatic analysis required to detect evasive intent.

URL PDF HTML ☆

赞 0 踩 0

2606.15833 2026-06-16 cs.CL 新提交

When Correct Edges Cannot Be Verified: A Provenance Gap in Incomplete KGQA and a Provenance-Favoring Completion Policy

当正确边无法被验证：不完全KGQA中的溯源缺口及一种偏好溯源的补全策略

Yongqi Kang, Yu Fu, Yong Zhao

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结针对不完全知识图谱问答中补全边的可验证性问题，发现76-96%的正确边缺乏文本支持，提出偏好溯源的TGComplete策略，在保持答案质量的同时显著提高边精度和严格忠实性。

详情

AI中文摘要

不完全知识图谱问答（IKGQA）需要补全缺失的边以继续推理。越来越多的研究通过检索文本来验证补全的边，将文本支持视为边质量的代理。我们提出了一个据我们所知尚未被系统检验的问题：文本可验证性是否真的反映了正确性？利用标准随机删除协议提供的黄金删除三元组，我们测量了这两者。发现是反直觉的：在黄金正确的补全边中，76-96%即使在穷尽检索下也没有支持段落，这一结果在删除率（20%/40%）、数据集（CWQ/WebQSP）和关系类型（结构型、常识型、长尾型）上均稳健。大多数Freebase风格的事实根本不会以头尾共现的形式出现在文本中。因此，文本忠实性衡量的是溯源，而非正确性——两者之间存在一个语料内检索无法弥合的范式级差距。这重新定义了边补全问题。由于大多数补全边——无论正确与否——对答案而言是因果冗余的（95-97%的正确答案不依赖于任何无支持的边），核心问题从“边是否正确？”转变为“在溯源不确定下是接受还是放弃？”在此框架下，我们提出了TGComplete，一种偏好溯源的接受策略，它在推理断点处检索证据，通过轻量级循环验证候选边，并在缺乏支持时放弃。与生成-补全基线GoG相比，它在黄金标准上获得了更高的边精度（15-21% vs 3-14%），且没有统计上显著的EM损失，同时被接受边的严格忠实性提高了3.1-7.4倍——代价是召回率较低。我们将TGComplete定位为并非全面更优，而是在精度/溯源召回权衡下的一个原则性选择，适用于可审计性重要的场景。

英文摘要

Incomplete Knowledge Graph Question Answering (IKGQA) requires completing missing edges to continue reasoning. A growing line of work verifies completed edges against retrieved text, treating textual support as a proxy for edge quality. We ask a question that, to our knowledge, has not been systematically tested: does textual verifiability actually track correctness? Exploiting the gold deleted triples provided by the standard random-deletion protocol, we measure both. The finding is counterintuitive: among gold-correct completed edges, 76-96% have no supporting passage even under exhaustive retrieval, robustly across deletion rates (20%/40%), datasets (CWQ/WebQSP), and relation types (structural, commonsense, long-tail). Most Freebase-style facts simply do not occur as head-tail co-mentions in text. Textual faithfulness therefore measures provenance, not correctness -- separated by a paradigm-level gap no in-corpus retrieval closes. This reframes edge completion. Since most completed edges -- correct or not -- are causally redundant for the answer (95-97% of correct answers do not depend on any unsupported edge), the central question shifts from "is the edge correct?" to "admit or abstain under provenance uncertainty?" Within this framing we present TGComplete, a provenance-favoring admission policy that retrieves evidence at a reasoning breakpoint, verifies a candidate through a lightweight loop, and abstains when support is absent. Against the generate-to-complete baseline GoG, it attains higher edge precision against gold (15-21% vs 3-14%), with no statistically detectable EM loss and 3.1-7.4 times higher strict faithfulness of admitted edges -- at the cost of lower recall. We position TGComplete not as uniformly better, but as a principled point on a precision/provenance-recall trade-off, appropriate when auditability matters.

URL PDF HTML ☆

赞 0 踩 0

2606.15971 2026-06-16 cs.CL 新提交

SAG: SQL-Retrieval Augmented Generation with Query-Time Dynamic Hyperedges

SAG: 查询时动态超边的SQL检索增强生成

Yuchao Wu, Junqin Li, XingCheng Liang, Yongjie Chen, Yinghao Liang, Linyuan Mo, Guanxian Li

发表机构 * Zleap AI

AI总结提出SAG架构，通过SQL查询动态构建局部超边索引，避免全局图维护，在多跳推理基准上达到最优召回率，支持亿级数据生产部署。

详情

AI中文摘要

检索增强生成（RAG）为大语言模型访问外部知识提供了一种有效方法。然而，现有方法依赖密集相似性检索，在处理结构化约束和多跳推理方面存在固有限制。引入知识图谱可部分缓解这些问题，但代价是语义碎片化、高维护成本和难以增量更新。本文介绍了SAG（SQL检索增强生成），一种用于检索和代理系统的结构化架构。SAG不预先构建全局静态图，而是将每个块转换为一个语义完整的事件和一组索引实体，然后使用SQL连接查询动态地将共享实体的事件链接到局部超边中，在查询时构建动态实例化的局部索引结构。这种设计避免了全局图重建和持续维护的需要；该系统通过依赖标准数据库基础设施，自然支持增量写入、并发处理和持续扩展。在HotpotQA、2WikiMultiHop和MuSiQue这三个标准多跳基准上，SAG在9个Recall@K指标中的8个上取得了最佳结果，在MuSiQue（多跳推理需求最高的基准）上达到80.0%的Recall@5。SAG还已在数亿数据项的生产规模上部署，在线检索延迟保持在秒级。项目网站和代码见https://github.com/Zleap-AI/SAG-Benchmark。

英文摘要

Retrieval-Augmented Generation (RAG) offers an effective approach for large language models to access external knowledge. However, existing methods rely on dense similarity retrieval and face inherent limitations in handling structured constraints and multi-hop reasoning. Incorporating knowledge graphs partially alleviates these issues, but at the cost of semantic fragmentation, high maintenance overhead, and difficult incremental updates. This paper introduces SAG (SQLRetrieval Augmented Generation), a structured architecture for retrieval and agent systems. Instead of pre-building a global static graph, SAG converts each chunk into one semantically complete event and a set of indexing entities, then uses SQL join queries to dynamically link events that share entities into local hyperedges,constructing, at query time, a dynamically instantiated local index structure. This design avoids the need for global graph rebuilding and ongoing maintenance; the system naturally supports incremental writes, concurrent processing, and continuous scaling through its reliance on standard database infrastructure. Across HotpotQA, 2WikiMultiHop, and MuSiQue, three standard multi-hop benchmarks,SAG achieves the best results on 8 out of 9 Recall@K metrics, reaching 80.0% Recall@5 on MuSiQue, the benchmark with the highest multi-hop reasoning demands.SAG has also been deployed at a production scale of hundreds of millions of data items, with online retrieval latency kept within seconds. Project site and code are available at https://github.com/Zleap-AI/SAG-Benchmark.

URL PDF HTML ☆

赞 0 踩 0

2606.16074 2026-06-16 cs.CL cs.AI 新提交

PVminerLLM2: Improving Structured Extraction of Patient Voice via Preference Optimization

PVminerLLM2：通过偏好优化改进患者声音的结构化提取

Samah Fodeh, Linhai Ma, Ganesh Puthiaraju, Srivani Talakokkul, Afshan Khan, Elyas Irankhah, Sreeraj Ramachandran, Ashley Hagaman, Sarah Lowe, Aimee Roundtree

发表机构 * Yale School of Medicine（耶鲁大学医学院）； Yale School of Public Health（耶鲁大学公共卫生学院）； Texas State University（德克萨斯州立大学）

AI总结提出PVminerLLM2，通过偏好优化和令牌级门控稳定项、混淆感知偏好对构建等技术，解决监督微调难以处理的细粒度错误，在患者声音结构化提取任务上优于基线模型。

详情

AI中文摘要

动机：患者生成的文本包含关于患者生活经历、社会背景和护理参与的关键信息，但大多是非结构化的，限制了其在以患者为中心的结果研究中的应用。先前的工作引入了PV-Miner基准和PVMinerLLM模型用于结构化提取。然而，仅靠监督微调（SFT）难以处理罕见、细粒度且分布不均的错误，尤其是在令牌关键的结构化输出中。结果：我们提出了PVminerLLM2，一组改进的用于结构化患者声音提取的LLM，它应用偏好优化来解决监督微调无法处理的令牌级错误。我们的方法引入了（i）带有令牌级门控稳定项的偏好目标，防止在偏好优化下绝对令牌似然的退化，以及（ii）混淆感知的偏好对构建，以更好地捕捉低分离度的区分。我们进一步引入了令牌重要性加权和逆频率重加权，以解决令牌不平衡和类别偏斜问题。在多种模型规模下，PVMinerLLM2始终优于强基线，在代码、子代码和跨度上分别获得了高达4.43%、3.50%和1.55%的提升，并且优于使用现有偏好优化方法训练的基线LLM。可用性和实现：PVminerLLM2的补充材料、代码、评估脚本和训练模型公开于：https://github.com/Data-Mining-Lab-Yale/PVminerLLM2

英文摘要

Motivation: Patient-generated text contains critical information on patients' lived experiences, social context, and care engagement, but remains largely unstructured, limiting its use in patient-centered outcomes research. Prior work introduced the PV-Miner benchmark and PVMinerLLM models for structured extraction. However, supervised fine-tuning (SFT) alone struggles with rare, fine-grained, and unevenly distributed errors, particularly in token-critical structured outputs. Results: We present PVminerLLM2, an improved set of LLMs for structured patient voice extraction that applies preference optimization to address token-critical errors beyond the reach of supervised fine-tuning. Our method introduces (i) a preference objective with token-level gated stabilization term that prevents degradation of absolute token likelihood under preference optimization, and (ii) confusion-aware preference pair construction to better capture low-separation distinctions. We further incorporate token-importance weighting and inverse-frequency reweighing to address token imbalance and class skew. Across multiple model sizes, PVMinerLLM2 consistently outperforms strong baselines, achieving gains of up to 4.43% (Code), 3.50% (Sub-code), and 1.55% (Span), and outperforms baseline LLM trained with existing preference optimization methods. Availability and Implementation: The supplementary material, code, evaluation scripts, and trained models for PVminerLLM2 are publicly available at: https://github.com/Data-Mining-Lab-Yale/PVminerLLM2

URL PDF HTML ☆

赞 0 踩 0

2606.16409 2026-06-16 cs.CL 新提交

MAGE-RAG：面向长文档问答的多粒度自适应图证据多模态RAG

Yilong Zuo, Xunkai Li, Jing Yuan, Qiangqiang Dai, Hongchao Qin, Ronghua Li

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结提出MAGE-RAG框架，通过离线构建包含页面和元素节点的证据图，在线自适应构建证据子图，平衡证据覆盖与噪声控制，在长文档多模态问答中取得最优性能。

详情

AI中文摘要

长文档多模态问答要求系统在长PDF中定位稀疏证据，并整合来自文本、表格、图像、图表和复杂布局的线索。现有RAG方法大多依赖于文本块或页面的固定Top-k检索。文本检索可以压缩上下文，但往往丢失视觉和布局信息；页面级视觉检索保留原始页面，但也会将大量无关区域送入阅读器，导致证据覆盖、噪声和推理成本之间的静态权衡。本文提出MAGE-RAG，一种用于长文档多模态问答的多粒度自适应图证据框架。MAGE-RAG以页面检索作为查询时证据构建的入口。离线阶段，它构建一个包含页面节点和元素节点的证据图，编码包含关系、阅读顺序、布局邻接、章节层次和语义邻居关系。查询时，在线证据控制器在显式预算下迭代地激活、打开、搜索和剪枝证据。生成的证据子图随后被渲染为结构化的多模态阅读器输入，使LVLM能够在有限上下文中消费紧凑且相关的证据。在LongDocURL和MMLongBench-Doc上，我们建立了统一的比较和分析协议，涵盖直接MLLM、文本RAG、页面级视觉RAG和图/智能体RAG。实验表明，MAGE-RAG在LongDocURL上达到52.75的整体准确率，在MMLongBench-Doc上达到53.26的准确率和51.19的F1。细粒度分解、预算-性能曲线、消融和基于轨迹的分析进一步表明，查询时证据子图构建能够平衡分散证据覆盖与上下文噪声控制。我们的代码可在https://github.com/laonuo2004/MAGE-RAG.git获取。

英文摘要

Long-document multimodal question answering requires a system to locate sparse evidence in long PDFs and integrate clues from text, tables, images, charts, and complex layouts. Existing RAG methods mostly rely on fixed Top-k retrieval over text chunks or pages. Text retrieval can compress the context but often loses visual and layout information; page-level visual retrieval preserves the original page, yet it also sends large irrelevant regions to the reader, leading to a static trade-off among evidence coverage, noise, and inference cost. This paper proposes MAGE-RAG, a multigranular adaptive graph evidence framework for long-document multimodal QA. MAGE-RAG uses page retrieval as the entry point for query-time evidence construction. Offline, it builds an evidence graph with page nodes and element nodes, encoding containment, reading order, layout adjacency, section hierarchy, and semantic-neighbor relations. At query time, an online evidence controller iteratively activates, opens, searches, and prunes evidence under explicit budgets. The resulting evidence subgraph is then rendered into structured multimodal reader input, allowing the LVLM to consume compact and relevant evidence within a limited context. On LongDocURL and MMLongBench-Doc, we establish a unified comparison and analysis protocol covering Direct MLLM, Text RAG, Page-level Visual RAG, and Graph/Agentic RAG. Experiments show that MAGE-RAG achieves 52.75 overall accuracy on LongDocURL, and 53.26 accuracy with 51.19 F1 on MMLongBench-Doc. Fine-grained breakdowns, budget-performance curves, ablations, and trace-based analysis further show that query-time evidence subgraph construction can balance dispersed evidence coverage with context-noise control. Our code is available at https://github.com/laonuo2004/MAGE-RAG.git.

URL PDF HTML ☆

赞 0 踩 0

2606.15998 2026-06-16 cs.IR cs.AI cs.CL cs.LG 交叉投稿

Entity Labels Are Not Entity Signals: A Framework for Observable Relevance in Document Re-Ranking

实体标签并非实体信号：文档重排序中可观测相关性的框架

Utshab Kumar Ghosh, Shubham Chatterjee

发表机构 * Department of Computer Science, Missouri University of Science and Technology（计算机科学系，密苏里科技大学）

AI总结提出实体可观测相关性（OER）与概念相关性（CER）的区分，证明CER监督效果差，而OER对齐可显著提升重排序性能。

Comments ICTIR '26

详情

DOI: 10.1145/3805713.3820411
Journal ref: Proceedings of the 2026 International ACM SIGIR Conference on Innovative Concepts and Theories in Information Retrieval (ICTIR)

AI中文摘要

实体感知的文档检索使用与查询关联的实体作为排序信号，假设语义相关的实体也是有用的检索信号。我们证明这一假设是不充分的，并解释原因。与作为真实观测的词项不同，实体链接是由不完美的链接器产生的假设：如果链接器在相关和非相关文档中无差别地触发，那么一个实体可能在主题上重要，却不提供任何判别性信号。我们将此形式化为概念实体相关性（CER）——实体是否与查询主题相关——和可观测实体相关性（OER）——其在集合中的观测出现是否能区分相关与非相关文档——之间的区别。在四个集合和包括人工实体判断的标注来源上，CER和OER表现出接近随机的吻合度（κ≈0），而OER的操作化实现吻合度较高（κ≈0.5），确认CER是系统性异常值。基于CER的监督选择主题上合理但判别性弱的实体，在某些集合上仅能过滤不到4%的非相关文档。将监督与OER对齐可将非相关文档过滤提升至10倍，并在BM25基础上将开放世界MAP提升0.051。我们的发现促使实体感知检索中从概念实体相关性向可观测实体相关性的转变。

英文摘要

Entity-aware document retrieval uses query-associated entities as ranking signals, assuming that semantically relevant entities are also useful retrieval signals. We show this assumption is insufficient- and explain why. Unlike terms, which are ground-truth observations, entity links are hypotheses produced by an imperfect linker: an entity can be topically central yet provide no discriminative signal if the linker fires indiscriminately across relevant and non-relevant documents. We formalize this as a distinction between Conceptual Entity Relevance (CER)- whether an entity is topically related to a query- and Observable Entity Relevance (OER)- whether its observed presence in a collection discriminates relevant from non-relevant documents. Across four collections and annotation sources including human entity judgments, CER and OER exhibit near-chance agreement ($κ\approx 0$), while OER operationalizations agree substantially ($κ\approx 0.5$), confirming CER as the systematic outlier. CER-based supervision selects topically plausible but weakly discriminative entities, pruning fewer than 4% of non-relevant documents on some collections. Aligning supervision with OER improves non-relevant pruning by up to 10x and open-world MAP by 0.051 over BM25. Our findings motivate a shift from conceptual to observable notions of entity relevance in entity-aware retrieval.

URL PDF HTML ☆

赞 0 踩 0

2606.16661 2026-06-16 cs.IR cs.CL 交叉投稿

SPI：向量数据库中流式RAG的查询深度自适应索引

Dong Liu, Yanxuan Yu

发表机构 * Yale University（耶鲁大学）； Columbia University（哥伦比亚大学）

AI总结提出语义金字塔索引（SPI），通过多级分辨率组织和不确定性感知控制器实现查询深度自适应，支持流式插入和渐进式ANN搜索，在MS MARCO和Natural Questions上相比基线实现1.4-2.3倍延迟降低。

详情

AI中文摘要

向量数据库（VecDB）越来越多地部署在检索增强生成（RAG）管道中，其中查询处理和文档摄取同时发生。索引层需要提供低延迟搜索，同时在不频繁全局重建的情况下纳入新向量。现有的VecDB管道通常在统一表示机制下运行，尽管查询所需的语义粒度存在显著差异。这促使设计一种支持增量更新同时根据查询分布和复杂性调整检索深度的索引。我们提出**语义金字塔索引（SPI）**，一种VecDB层索引框架，将嵌入组织成$L$个语义对齐的分辨率级别，并通过轻量级不确定性感知控制器为每个查询选择检索深度。SPI支持渐进式粗到细ANN搜索、无需全局重建的逐级流式插入，以及通过LSH分区和异步gRPC协调的分布式执行。与具有固定遍历规则的分层ANN结构（例如SPANN）不同，SPI在查询时自适应分辨率，同时保持与FAISS和Qdrant后端的兼容性。在MS MARCO和Natural Questions上，在相同密集编码器系列下，SPI在Recall@10上具有竞争力且延迟更低，相对于可比较的近似ANN基线，在固定Recall@10目标下实现了**1.4-2.3倍**的平均检索延迟降低。一个最多8个节点的原型扩展研究显示吞吐量扩展了6.2倍（约73%效率）；为完整性包含了16节点配置，但显示出递减的效率。我们提供了top-$K$稳定性保证：具有足够检索裕度的查询在较浅层返回相同的top-$K$集合。代码和配置可从此https URL获取。

英文摘要

Vector databases (VecDBs) are increasingly deployed in retrieval-augmented generation (RAG) pipelines where query processing and document ingestion occur concurrently. The index layer needs to provide low-latency search while incorporating new vectors without frequent global rebuilding. Existing VecDB pipelines typically operate within a uniform representation regime, despite substantial variation in the semantic granularity required across queries. This motivates an index design that supports incremental updates while adapting retrieval depth to query distribution and complexity. We propose \textbf{Semantic Pyramid Indexing (SPI)}, a VecDB-layer indexing framework that organizes embeddings into $L$ semantically aligned resolution levels and selects retrieval depth per query via a lightweight uncertainty-aware controller. SPI supports progressive coarse-to-fine ANN search, level-wise streaming insertion without global rebuilds, and distributed execution through LSH partitioning with asynchronous gRPC coordination. Unlike hierarchical ANN structures with fixed traversal rules (e.g., SPANN), SPI adapts resolution at query time while remaining compatible with FAISS and Qdrant backends. On MS MARCO and Natural Questions, SPI achieves competitive Recall@10 with lower latency under the same dense encoder family, yielding a \textbf{1.4--2.3$\times$} average retrieval latency reduction under fixed Recall@10 targets relative to comparable approximate-ANN baselines. A prototype scaling study up to 8 nodes shows $6.2\times$ throughput scaling (${\approx}73\%$ efficiency); the 16-node configuration is included for completeness but shows diminishing efficiency. We provide a top-$K$ stability guarantee: queries with sufficient retrieval margin return an identical top-$K$ set at a shallower level. Code and configurations are available at https://github.com/FastLM/SPI_VecDB.

URL PDF HTML ☆

赞 0 踩 0

2603.19595 2026-06-16 cs.IR cs.CL 版本更新

All-Mem: Agentic Lifelong Memory via Dynamic Topology Evolution

All-Mem: 通过动态拓扑演化实现智能体终身记忆

Can Lv, Heng Chang, Shengyu Tao, Mingju Chen, Zhaoxin Fan, Ziwei Zhang, Yuchen Guo, Shiji Zhou

发表机构 * Beijing Advanced Innovation Center for Future Blockchain and Privacy Computing（北京未来区块链与隐私计算先进创新中心）； Beihang University（北航）； Tsinghua University（清华大学）； Chalmers University of Technology（查尔姆斯理工大学）

AI总结提出All-Mem框架，通过在线/离线结合的非破坏性拓扑结构记忆库，解决终身交互代理中历史增长导致的检索冗余和噪声问题，在LoCoMo和LongMemEval-s上提升检索与问答性能。

详情

AI中文摘要

终身交互代理期望在数月或数年内协助用户，这需要在固定上下文和延迟预算下持续写入长期记忆，同时为每个新查询检索正确的证据。现有的记忆系统随着历史增长往往会退化，产生冗余、过时或噪声的检索上下文。我们提出\textbf{All-Mem}，一个在线/离线终身记忆框架，通过显式的、非破坏性的整合维护一个拓扑结构化的记忆库，避免了基于摘要压缩的典型不可逆信息损失。在在线操作中，它将检索锚定在有界可见表面上以保持粗略搜索成本有界。定期离线时，LLM诊断器提出置信度评分的拓扑编辑，通过三个算子（拆分、合并和更新）执行门控，同时保留不可变证据以保持可追溯性。在查询时，类型化链接支持从活动锚点到存档证据的跳数有界、预算可控的扩展。在\textbf{LoCoMo}和\textbf{LongMemEval-s}上的实验表明，与代表性基线相比，检索和问答性能得到提升。代码可在该https URL获取。

英文摘要

Lifelong interactive agents are expected to assist users over months or years, which requires continually writing long term memories while retrieving the right evidence for each new query under fixed context and latency budgets. Existing memory systems often degrade as histories grow, yielding redundant, outdated, or noisy retrieved contexts. We present \textbf{All-Mem}, an online/offline lifelong memory framework that maintains a topology structured memory bank via explicit, non destructive consolidation, avoiding the irreversible information loss typical of summarization based compression. In online operation, it anchors retrieval on a bounded visible surface to keep coarse search cost bounded. Periodically offline, an LLM diagnoser proposes confidence scored topology edits executed with gating using three operators: Split, Merge, and Update, while preserving immutable evidence for traceability. At query time, typed links enable hop bounded, budgeted expansion from active anchors to archived evidence when needed. Experiments on \textbf{LoCoMo} and \textbf{LongMemEval-s} show improved retrieval and QA over representative baselines. The code is available at https://github.com/LvCan926/All-Mem.

URL PDF HTML ☆

赞 0 踩 0

2606.14832 2026-06-16 cs.CL 新提交

PhoneHarness: Harnessing Phone-Use Agents through Mixed GUI, CLI, and Tool Actions

PhoneHarness: 通过混合GUI、CLI和工具操作利用手机使用代理

Chenxin Li, Zhengyao Fang, Zhengyang Tang, Pengyuan Lyu, Xingran Zhou, Xin Lai, Fei Tang, Liang Wu, Yiduo Guo, Weinong Wang, Junyi Li, Yi Zhang, Yang Ding, Huawen Shen, Sunqi Fan, Shangpin Peng, Zheng Ruan, Anran Zhang, Benyou Wang, Chengquan Zhang, Han Hu

发表机构 * Tencent Hunyuan（腾讯混元）； The Chinese University of Hong Kong（香港中文大学）； The Chinese University of Hong Kong, Shenzhen（香港中文大学（深圳））； Tsinghua University（清华大学）

AI总结提出PhoneHarness，一个混合动作基准和执行框架，用于评估手机代理在可验证移动工作流中的表现，通过GUI、CLI和工具动作的组合，达到75.0%通过率，比最强基线高12.9个百分点。

Comments Project Page: https://phoneharness.github.io/

详情

AI中文摘要

手机代理越来越被期望完成真实的移动工作流，而不仅仅是预测下一个屏幕动作。然而，当前许多移动代理文献仍然主要将代理评估为GUI控制器，它们观察屏幕、发出点击和滑动，并根据目标应用状态评分。真实的手机使用任务更为广泛：它们需要决定何时使用应用GUI、设备端命令或结构化工具，同时留下预期副作用实际发生的证据。我们引入了PhoneHarness，一个混合动作基准和执行框架，用于研究在可验证移动工作流上的手机使用代理。PhoneHarness在GUI、CLI和主机端工具动作上运行设备端代理循环，结合确定性动作路由与有界GUI委托和可审计执行轨迹。其基准PhoneHarness Bench评估代理是否完成具有可观察副作用的任务，而不仅仅是产生合理的最终答案。在注释评估集上，PhoneHarness达到75.0%的通过率，比最强的非PhoneHarness设置高出12.9个百分点。因此，PhoneHarness和PhoneHarness Bench扮演着不同但相互依赖的角色：框架使混合手机工作流可执行，而基准衡量代理是否能够可靠且安全地使用该框架。我们的发现表明，可靠的手机自动化依赖于动作表面路由和可验证执行，而不仅仅是视觉GUI控制。

英文摘要

Phone agents are increasingly expected to complete real mobile workflows rather than merely predict the next screen action. However, much of the current mobile-agent literature still evaluates agents primarily as GUI controllers that observe a screen, emit taps and swipes, and are scored by target app state. Real phone-use tasks are broader: they require deciding when to use app GUIs, device-side commands, or structured tools, while leaving evidence that the intended side effect actually occurred. We introduce PhoneHarness, a mixed-action benchmark and execution harness for studying phone-use agents on verifiable mobile workflows. PhoneHarness runs a device-side agent loop over GUI, CLI, and host-side tool actions, combining deterministic action routing with bounded GUI delegation and auditable execution traces. Its benchmark, PhoneHarness Bench, evaluates whether agents complete tasks with observable side effects, not only whether they produce plausible final answers. On the annotated evaluation split, PhoneHarness reaches a 75.0% pass rate, outperforming the strongest non-PhoneHarness settings by 12.9 percentage points. PhoneHarness and PhoneHarness Bench therefore play distinct but mutually dependent roles: the harness makes mixed phone workflows executable, while the benchmark measures whether agents can use that harness reliably and safely. Our findings suggest that reliable phone automation depends on action-surface routing and verifiable execution, not only visual GUI control.

URL PDF HTML ☆

赞 0 踩 0

2606.15017 2026-06-16 cs.CL 新提交

Are Online Skill and Memory Modules Always Worth Their Tokens? A Budget-Constrained Study of Web Agents

在线技能与记忆模块是否总是值得其令牌消耗？Web代理的预算约束研究

Sina Hajimiri, Masih Aminbeidokhti, Jose Dolz, Ismail Ben Ayed, Issam H. Laradji, Spandana Gella, Nicolas Gontier

发表机构 * ServiceNow AI Research（ServiceNow AI 研究院）； ÉTS Montreal（蒙特利尔高等技术学院）； University of British Columbia（不列颠哥伦比亚大学）； McGill University（麦吉尔大学）

AI总结在固定推理预算下，对比三种在线增强模块与令牌匹配的基线，发现基线在总成功率上匹配或超越所有增强方法，且常使用更少令牌。

详情

AI中文摘要

在线Web代理通常用记忆、工作流或技能模块增强基础执行器。这些模块可提升性能，但也消耗测试时令牌，这一成本很少与执行器的推理成本一同报告。我们研究在线增强（每项任务都需支付此开销），并在固定总推理预算下重新评估其收益。我们将AWM、ASI和ReasoningBank与令牌匹配的普通基线（使用相同预算进行额外执行器步骤）进行比较。在三个WebArena领域和三个模型（Gemini 3 Flash、GPT-5.4-mini和Qwen 3.6-27B）上，普通基线在总成功率上匹配或超越所有三种增强方法，同时通常使用更少总令牌。我们在WorkArena-L1上使用Qwen 3.6-27B观察到类似趋势，表明该效果扩展到企业知识工作任务。我们的结果表明，技能和工作流记忆在特定领域可能有用，但其表面收益在预算匹配的执行器面前往往消失。我们进一步表明，运行间方差显著影响结果，应作为在线Web代理的核心评估标准报告。

英文摘要

Online web agents often augment a base actor with memory, workflow, or skill modules. These modules can improve performance, but they also consume test-time tokens, a cost rarely reported alongside the actor's inference cost. We study online augmentation, where this overhead is paid on every task, and re-evaluate its benefits under a fixed total inference budget. We compare AWM, ASI, and ReasoningBank with a token-matched vanilla baseline that uses the same budget for additional actor steps. Across three WebArena domains and three models, Gemini 3 Flash, GPT-5.4-mini, and Qwen 3.6-27B, the vanilla baseline matches or surpasses all three augmentation methods in aggregate success rate while often using fewer total tokens. We observe a similar trend on WorkArena-L1 with Qwen 3.6-27B, indicating that the effect extends to enterprise knowledge-work tasks. Our results suggest that skills and workflow memory can be useful in specific domains, but their apparent gains often vanish against a budget-matched actor. We further show that run-to-run variance materially affects outcomes and should be reported as a core evaluation criterion for online web agents.

URL PDF HTML ☆

赞 0 踩 0

2606.15152 2026-06-16 cs.CL 新提交

Can Agents Read the Room? Benchmarking Visual Social Intelligence in Multimodal Simulation

智能体能否读懂房间气氛？多模态模拟中的视觉社交智能基准测试

Shijun Wan, Xuehai Wu, Jiwen Zhang, Siyuan Wang, Zhongyu Wei

发表机构 * Fudan University（复旦大学）； Shanghai Innovation Institute（上海创新研究院）； The Chinese University of Hong Kong（香港中文大学）

AI总结提出AgentViSS基准，通过240个场景和四项角色级任务评估多模态大模型的视觉社交智能，发现局部角色扮演接近饱和而交互调控仍困难。

详情

AI中文摘要

社交互动依赖于语言和可见的社交信号，如面部表情、姿势、注视和情绪变化。然而，现有的社交智能体基准大多基于文本，很少测试多模态智能体是否能够利用视觉线索来指导互动。我们引入了\ extsc{\enchmarkname{}}，一个评估多模态社交模拟中视觉社交智能的基准。它包含240个场景、585个角色实例和2,340个角色任务实例，结合了对齐的文本-视觉证据、结构化角色档案和四个角色级任务：表情任务、特征任务、交互调控任务和交互结果任务。在口头视觉和直接视觉条件下评估七个最新的多模态大语言模型，揭示了局部角色扮演与交互管理之间的明显差距：角色特定的表情和冲突处理接近饱和，而交互调控和基于视觉的结果实现仍然困难得多。代码已发布在https://github.com/JunsWan/AgentViSS，数据集可在https://huggingface.co/datasets/JunsWan/AgentViSS获取。

英文摘要

Social interaction depends on both language and visible social signals, such as facial expressions, posture, gaze, and emotional shifts. Yet existing social-agent benchmarks are largely text-based and rarely test whether multimodal agents can use visual cues to guide interaction. We introduce \textsc{\benchmarkname{}}, a benchmark evaluating visual social intelligence in multimodal social simulation. It contains 240 scenarios, 585 role instances, and 2,340 role-task instances, combining aligned textual-visual evidence, structured role profiles, and four role-level tasks: expression task, characteristic task, interaction regulation task, and interaction outcome task. Evaluating seven recent MLLMs under verbalized-vision and direct-vision reveals a clear gap between local role enactment and interaction management: role-specific expression and conflict handling are near saturation, whereas interaction regulation and visually grounded outcome achievement remain substantially more difficult. The code is released at https://github.com/JunsWan/AgentViSS, and the dataset is available at https://huggingface.co/datasets/JunsWan/AgentViSS.

URL PDF HTML ☆

赞 0 踩 0

2606.15390 2026-06-16 cs.CL cs.AI cs.LG 新提交

Not All Skills Help: Measuring and Repairing Agent Knowledge

并非所有技能都有用：测量与修复智能体知识

Yixuan Wang, Yiyang Zhou, Yiming Liang, Congyu Zhang, Fuxiao Liu, Jiawei Zhou, Huaxiu Yao

发表机构 * UNC Chapel Hill（北卡罗来纳大学教堂山分校）； Purdue（普渡大学）； NVIDIA（英伟达）

AI总结提出ASSAY框架，通过随机掩码测量技能因果贡献，分离技能生成与筛选，在推理时抑制负面技能，显著提升LLM智能体任务完成率。

Comments 18 pages, 5 figures

详情

AI中文摘要

LLM智能体可以通过从经验中积累自然语言技能来改进，而无需更新权重，但当前系统将所有关于保留哪些技能以及如何应用它们的决策完全交由LLM判断。我们认为这混淆了两个不同的角色：从经验中生成技能是判断擅长的创造性行为，而决定该技能是否真正有帮助则需要跨多个任务的实证证据。通过随机掩码测量每个技能的因果贡献，我们发现技能库表现出普遍的因果异质性：单个技能通常在某些任务类型上有帮助，但在其他任务类型上有害，然而它们的相反效应在总体上相互抵消，使得全局筛选方法无法察觉。我们提出ASSAY，一个将生成与筛选分离的框架：它在小型开发集上计算每个技能的因果归因，离线重组技能库，并为每个测试任务抑制预测效应为负的技能。在跨越四个提供商的七个基础模型以及两个基准（AppWorld和tau-bench）上，ASSAY始终优于先前的技能筛选方法。在AppWorld最难的数据划分上，DeepSeek-V3实现了69.3%的任务目标完成率（相对提升47.4%），在所有已发表方法（包括权重调整方法）中达到了新的最先进水平。在tau-bench零售领域，GPT-4.1相对提升8.7%，在公开排行榜上超越了o4-mini、o1和GPT-4.5，且无需任何权重修改。消融实验将主要收益归因于每任务掩码，证实瓶颈在于推理时将技能与任务匹配，而非全局移除不良技能。代码已开源：https://github.com/aiming-lab/assay。

英文摘要

LLM agents can improve without weight updates by accumulating natural-language skills from experience, but current systems entrust every decision about which skills to keep and how to apply them to LLM judgment alone. We argue that this conflates two distinct roles: generating a skill from experience is a creative act that judgment handles well, while deciding whether that skill actually helps requires empirical evidence across many tasks. Measuring per-skill causal contributions via randomized masking, we find that skill libraries exhibit pervasive causal heterogeneity: individual skills routinely help on some task types while hurting on others, yet their opposing effects cancel in aggregate, making them invisible to global curation methods. We propose ASSAY, a framework that separates generation from curation: it computes a per-skill causal attribution on a small development set, restructures the library offline, and suppresses skills with negative predicted effect for each test task. Across seven base models spanning four providers and two benchmarks (AppWorld and tau-bench), ASSAY consistently improves over prior skill-curation approaches. On AppWorld's hardest split, DeepSeek-V3 achieves 69.3% task-goal completion (47.4% relative improvement), a new state of the art among all published methods including weight-tuned approaches. On tau-bench retail, GPT-4.1 improves by 8.7% relative, advancing past o4-mini, o1, and GPT-4.5 on the public leaderboard without any weight modification. Ablation traces the dominant gain to per-task masking, confirming that the bottleneck is matching skills to tasks at inference time, not removing bad skills globally. Code is available at https://github.com/aiming-lab/assay.

URL PDF HTML ☆

赞 0 踩 0

2606.15405 2026-06-16 cs.CL cs.AI 新提交

T-Mem: Memory That Anticipates, Not Archives

T-Mem：预测而非归档的记忆

Weidong Guo, Dakai Wang, Zixuan Wang, Hui Liu, Yu Xu

发表机构 * Tencent（腾讯）

AI总结提出T-Mem架构，通过写时触发机制覆盖描述性和关联性回忆，解决长对话中语义关联检索问题，在LoCoMo和LoCoMo-Plus上达到SOTA。

详情

AI中文摘要

长期记忆对于对话代理在扩展对话中保持连贯性、遵循多个会话前做出的承诺以及根据每个用户调整行为至关重要。然而，当前基于LLM的长期对话记忆受限于查询与存储内容（包括词汇和稠密向量）之间的相似性。当查询和记忆共享表面特征（如措辞或命名实体，我们称之为描述性）时，该方法有效。但它忽略了另一类同样有价值的案例，即查询和记忆不共享表面特征，仅通过潜在语义弧（关联性）相连。在这种机制下，现有的长期记忆系统普遍失败。覆盖这另一半使得助手首次能够主动将过去的对话作为语义资产。在记忆方面，这是认知科学中称为情景未来思维的工程对应物：预演过去的经验，以便在未来需要找到它的上下文中使用。我们将这些写时预演称为触发器。我们提出T-Mem，这是第一个覆盖描述性和关联性回忆的长期对话记忆架构。在两种证据粒度（单个事实和完整交流）上，T-Mem实例化一个描述性触发器家族和一个关联性触发器家族，使得每个记忆都能从表面相似和相关性约束的查询中访问。作为实证验证，T-Mem在LoCoMo和LoCoMo-Plus上达到了最先进水平。

英文摘要

Long-term memory is essential for conversational agents to remain coherent across extended dialogues, follow through on commitments made many sessions earlier, and adapt their behaviour to each user. Current LLM-backed long-term conversational memory, however, is reachability-bounded by the similarity between a query and stored content, both lexical and dense-vector. The approach is effective when query and memory share surface features such as wording or named entities (we call this descriptive). But it misses another, equally valuable class of cases, where query and memory do not share surface features and are tied only by a latent semantic arc (associative). On this regime prevailing long-term memory systems collectively fail. Covering this other half is what allows an assistant, for the first time, to actively draw on past dialogue as a semantic asset. On the memory side, this is the engineering counterpart of what cognitive science calls episodic future thinking: rehearsing past experience for the future contexts under which it will need to be found. We call these write-time rehearsals triggers. We propose T-Mem, the first long-term conversational memory architecture that covers both descriptive and associative recall. At each of two evidence granularities, single facts and full exchanges, T-Mem instantiates one descriptive trigger family and one associative trigger family, so that every memory remains reachable from both surface-similar and relevance-bound queries. As empirical validation, T-Mem reaches state-of-the-art on both LoCoMo and LoCoMo-Plus.

URL PDF HTML ☆

赞 0 踩 0

2606.15422 2026-06-16 cs.CL q-bio.BM 新提交

Pepti-Agent: An AI Agent for Peptide Design and Optimization

Pepti-Agent: 一种用于肽设计与优化的人工智能代理

Houxu Chen, Achuth Chandrasekhar, Amir Barati Farimani

发表机构 * Biomedical Engineering, Carnegie Mellon University, Pittsburgh, PA 15213, USA（生物医学工程系，卡内基梅隆大学，匹兹堡，PA 15213，美国）； Mechanical Engineering, Carnegie Mellon University, Pittsburgh, PA 15213, USA（机械工程系，卡内基梅隆大学，匹兹堡，PA 15213，美国）

AI总结提出Pepti-Agent，一种基于模型上下文协议（MCP）的闭环肽设计框架，通过可独立检查的生成、预测和突变工具，结合大语言模型控制器和实时属性预测，实现多目标优化与可复现基准测试。

详情

AI中文摘要

治疗性肽占据小分子和生物制剂之间有价值的设计空间，但它们的开发需要同时满足几个相互竞争的约束：溶解度、溶血活性和非特异性表面污染由重叠的序列特征控制，因此改善一个属性往往会降低另一个属性。计算设计通过将生成模型与基于序列的属性预测器配对，迭代地提出和优化候选物来解决这一问题。然而，这些组件通常被连接成难以检查、扩展或重用的整体脚本，并且它们通常通过自然语言推理而不是跟踪每个候选物不断变化的多属性状态来优化序列。我们提出了Pepti-Agent，一个闭环的、肽特异性的框架，它将生成、属性预测和单残基突变暴露为可独立检查的模型上下文协议（MCP）工具。一个大语言模型控制器调用这些工具，并在调用之间查阅实时的预测器输出，因此优化由每个序列当前的属性概况指导，而不是仅由语言推理指导。任务特异性的PeptideGPT模型生成候选物，基于ProtBERT的分类器对溶解度、溶血和非污染进行评分，两个可互换的突变算子提出序列编辑。通过记录控制器决策、预测器输出和接受突变的每一步迹，Pepti-Agent为多目标设计策略的基准测试和为实验验证优先排序候选物提供了可复现的基础。

英文摘要

Therapeutic peptides occupy a valuable design space between small molecules and biologics, but their development requires satisfying several competing constraints at once: solubility, hemolytic activity, and nonspecific surface fouling are governed by overlapping sequence features, so improving one property often degrades another. Computational design addresses this by pairing generative models with sequence-based property predictors, iteratively proposing and refining candidates. However, these components are typically wired together as monolithic scripts that are difficult to inspect, extend, or reuse, and they often refine sequences by natural-language reasoning rather than by tracking the evolving multi-property state of each candidate. We present Pepti-Agent, a closed-loop, peptide-specific framework that exposes generation, property prediction, and single-residue mutation as independently inspectable Model Context Protocol (MCP) tools. A large language model controller invokes these tools and consults live predictor output between calls, so refinement is guided by each sequence's current property profile rather than by language reasoning alone. Task-specific PeptideGPT models generate candidates, ProtBERT-based classifiers score solubility, hemolysis, and non-fouling, and two interchangeable mutation operators propose sequence edits. By recording a per-step trace of controller decisions, predictor outputs, and accepted mutations, Pepti-Agent offers a reproducible substrate for benchmarking multi-objective design strategies and for prioritizing candidates for experimental validation.

URL PDF HTML ☆

赞 0 踩 0

2606.15532 2026-06-16 cs.CL cs.LG 新提交

EIBench: A Simulator-Based Benchmark and Turn-Credit RL for Emotion Management

EIBench: 基于模拟器的基准测试和用于情绪管理的回合信用强化学习

Rongzhi Zhu, Xiang Huang, Yuchuan Wu, Rui Wang, Zequn Sun, Tao Ren, Weiyao Luo, Bingxue Qiu, Jieping Ye, Yongbin Li, Wei Hu

发表机构 * State Key Laboratory for Novel Software Technology, Nanjing University（南京大学计算机软件新技术国家重点实验室）； Qwen-Character Team, Alibaba Group（阿里巴巴集团Qwen-Character团队）

AI总结提出EIBench模拟器基准，包含2222个场景，通过2x2分类（支持、防御、修复、魅力）评估多轮情绪管理；并设计CTC-GRPO方法利用逐轮状态更新作为密集反馈，提升模型情绪智能。

详情

AI中文摘要

大型语言模型（LLM）的情绪智能（EI）通常通过静态理解任务或单轮对话生成来评估。然而，情绪管理是交互式的：一个好的模型不仅应识别用户的情绪，还应在多轮对话中改善用户的情绪和关系状态。我们引入了EIBench，一个基于模拟器的交互式情绪管理基准。EIBench包含2222个场景，其中2009个用于训练，213个用于保留测试。场景按2x2分类法组织，涵盖支持、防御、修复和魅力，分别对应不同形式的支持、边界维护、信任修复和融洽关系建立。在每个场景中，LLM模拟器扮演用户，每轮后更新情绪-关系状态，并将最终状态映射到基于锚点的分数。这一设计使EIBench既是一个评估基准，也是一个训练环境：最终状态提供结果奖励，而逐轮状态更新为强化学习提供密集反馈。我们评估了15个开源和闭源LLM。当前模型在支持和融洽关系建立场景中表现良好，但在用户压力下的边界维护方面存在困难。为了提升LLM的情绪智能能力，我们提出了中心化回合信用GRPO（CTC-GRPO），这是GRPO的一个扩展，它重用模拟器的逐轮状态更新作为密集的回合级反馈，同时保留最终结果奖励。CTC-GRPO将Qwen3-8B在EIBench上的得分从-22.4提升至+22.4，并在分布外评估（包括SAGE +12.4和EQBench3 +20.9%）中也有所提升。我们的结果表明，模拟器追踪的用户状态可以支持多轮情绪管理的评估和训练。

英文摘要

Emotional intelligence (EI) in Large Language Models (LLMs) is often evaluated through static understanding tasks or single-response dialogue generation. However, emotion management is interactive: a good model should not only recognize a user's emotion, but also improve the user's emotional and relational state over several turns. We introduce EIBench, a simulator-based benchmark for interactive emotion management. EIBench contains 2,222 scenarios, with 2,009 for training and 213 for held-out testing. The scenarios are organized by a 2x2 taxonomy covering Support, Defense, Repair, and Charm, which together capture different forms of support, boundary maintenance, trust repair, and rapport building. In each scenario, an LLM simulator plays the user, updates an emotion-relation state after each turn, and maps the final state to an anchor-based score. This design makes EIBench both an evaluation benchmark and a training environment: the final state gives the outcome reward, while the per-turn state updates provide dense feedback for RL. We evaluate 15 open- and closed-source LLMs. Current models perform well on support and rapport-building scenes, but struggle with boundary maintenance under user pressure. To improve the EI ability of LLMs, we propose Centered Turn-Credit GRPO (CTC-GRPO), a GRPO extension that reuses the simulator's per-turn state updates as dense turn-level feedback while preserving the final outcome reward. CTC-GRPO improves Qwen3-8B from -22.4 to +22.4 on EIBench and also improves on out-of-distribution evaluations including SAGE (+12.4) and EQBench3 (+20.9%). Our results show that simulator-tracked user states can support both evaluation and training for multi-turn emotion management.

URL PDF HTML ☆

赞 0 踩 0

2606.15911 2026-06-16 cs.CL cs.IR 新提交

LectūraAgents：面向自适应个性化AI辅助学习与具身教学的多智能体框架

Jaward Sesay, Yue Yu, Siwei Dong, Yemin Shi, Guangyao Chen, Börje F. Karlsson

发表机构 * Beijing Institute of Technology（北京理工大学）； Peking University（北京大学）； Cornell University（康奈尔大学）； Beijing Academy of Artificial Intelligence（北京人工智能研究院）

AI总结提出LectūraAgents多智能体框架，通过层次化架构和自适应具身教学机制（如手势、高亮等）实现端到端个性化学习，并设计教学动作-语音对齐算法提升连贯性，在多个课程级别上优于现有方法。

详情

AI中文摘要

有效的个性化AI辅助学习需要系统不仅能够生成准确的、针对学习者的教育材料，还能动态调整其教学方式以适应不同学习者。然而，现有的教育智能体主要关注讲座内容自动化和模拟，往往缺乏针对个体学习者的多模态和具身教学方法的建模。为此，我们提出LectūraAgents——一个多智能体框架，通过端到端的自适应具身教学实现个性化学习。其核心模拟了教授-学生关系，其中ProfessorAgent领导一个由专业下属智能体组成的协作团队，通过研究、规划、审查和具身交付适应学习者需求的讲座内容。该框架有三个主要贡献：（1）用于端到端个性化学习的层次化多智能体架构；（2）自适应具身教学机制，其中ProfessorAgent在教学环境中对内容执行可见且具有教学动机的教学动作（例如手写、高亮、下划线等）；（3）教学动作-语音对齐（TASA）算法，该算法采用基于显著性的启发式和时序语义分割，生成与学习者档案对齐的连贯教学动作序列。我们在高中、本科和研究生级别的多样化课程上，使用基于样本特定量规的分析评估LectūraAgents；生成的讲座材料和教学动作由专家教育者评估和验证。实验结果显示，在讲座内容质量、具身教学质量、评估和个性化方面，LectūraAgents持续优于现有方法，使其成为大规模个性化学习的教学基础扎实的框架。

英文摘要

Effective personalized AI-assisted learning demands systems that can not only generate accurate learner-specific educational materials, but also dynamically adapt their instruction to diverse learners. However, existing educational agents have primarily focused on lecture content automation and simulations, which often fall short of modelling multimodal and embodied instructional methods tailored for the individual learner. To this end, we propose LectūraAgents - a multi-agent framework that enables personalized learning through end-to-end adaptive embodied teaching. At its core, LectūraAgents mirrors a professor-student relationship, in which a ProfessorAgent leads a collaborative team of specialized subordinate agents through research, planning, review, and embodied delivery of lecture contents that adapt to a learner's needs. The framework offers three main contributions: (1) a hierarchical multi-agent architecture for end-to-end personalized learning; (2) an adaptive embodied teaching mechanism, wherein the ProfessorAgent executes visible and pedagogically motivated teaching actions (e.g., handwrite, highlight, underline, etc.) over contents in a teaching environment; and (3) a Teaching Action-Speech Alignment (TASA) algorithm that employs salience-based heuristics and temporal semantic segmentation to generate coherent teaching action sequences aligned with learner profiles. We evaluate LectūraAgents on diverse courses at high school, undergraduate, and graduate levels using sample-specific rubric-based analysis; with generated lecture materials and teaching actions assessed and validated by expert educators. Experimental results show consistent gains in lecture content quality, embodied teaching quality, assessment, and personalization over existing approaches, positioning LectūraAgents as a pedagogically well-grounded framework for personalized learning at scale.

URL PDF HTML ☆

赞 0 踩 0

2606.16432 2026-06-16 cs.CL cs.AI 新提交

ACCORD: Action-Conditioned Contextual Grounding for Language Agents

ACCORD: 面向语言智能体的动作条件上下文接地

Lai Jiang, Cheng Qian, Zhenhailong Wang, Pan Lu, Heng Ji, Hao Peng

发表机构 * University of Illinois Urbana-Champaign（伊利诺伊大学厄巴纳-香槟分校）； Stanford University（斯坦福大学）

AI总结针对用户指令常因隐含环境假设而欠指定，导致LLM智能体执行失败的问题，提出ACCORD框架，在每次动作前主动探测缺失信息并整合轨迹上下文，无需额外训练，在AppWorld和AlfWorld上显著提升任务完成率。

详情

DEEPRUBRIC: 基于证据树规则监督的高效深度研究智能体强化学习

Minghang Zhu, Chuyang Wei, Junhao Xu, Yilin Cheng, Zhumin Chen, Jiyan He

发表机构 * Shandong University（山东大学）； Zhongguancun Academy（中关村学院）； Fudan University（复旦大学）

AI总结提出DeepRubric框架，通过构建证据树生成查询-规则对，确保奖励信号准确评估查询所需信息，以13倍少的RL GPU时间达到与先前最优模型相当的性能。

详情

AI中文摘要

深度研究智能体通过搜索和推理检索到的证据来综合长篇报告。基于规则的奖励强化学习通过优化智能体以符合可检查的标准（这些标准将报告质量转化为奖励信号）来改进这些智能体，但其效率取决于这些标准是否可靠地捕捉任务范围和证据需求。大多数现有研究要求LLM为给定查询生成规则，但当模型无法推断潜在信息需求时，生成的规则可能不完整，从而降低RL效率。为了获得更可靠的查询-规则监督，我们引入了DeepRubric，一个反向这一过程的数据构建框架：它首先确定基于证据的报告应该评估什么，然后从这些评估目标中合成对齐的查询-规则对，而不是为给定查询推断评估标准。从采样的种子主题开始，DeepRubric通过递归扩展有证据支持的子问题构建证据树，其叶子节点作为原子且可验证的评估目标。然后，它使用证据树合成训练查询和规则，确保奖励准确评估查询所请求的信息。使用DeepRubric，我们构建了9K个查询-规则监督示例，并使用基于规则的GRPO训练了DeepRubric-8B，在三个基准测试中实现了与先前开源最先进深度研究模型相当的性能，而RL GPU时间减少了约13倍。

英文摘要

Deep research agents synthesize long-form reports by searching and reasoning over retrieved evidence. Reinforcement learning with rubric-based rewards improves these agents by optimizing them against checkable criteria that translate report quality into reward signals, but its efficiency depends on whether those criteria reliably capture the task scope and evidence needs. Most existing studies ask an LLM to generate rubrics for a given query, but when the model fails to infer the underlying information needs, the generated rubrics may be incomplete and reduce RL efficiency. To obtain more reliable query--rubric supervision, we introduce DeepRubric, a data construction framework that reverses this process: instead of inferring evaluation criteria for a given query, it first determines what an evidence-backed report should be evaluated on and then synthesizes aligned query--rubric pairs from those evaluation targets. Starting from a sampled seed topic, DeepRubric builds an evidence tree by recursively expanding evidence-backed sub-questions, whose leaves serve as atomic and verifiable evaluation targets. It then uses the evidence tree to synthesize the training query and rubrics, ensuring that the reward evaluates exactly the information requested by the query. Using DeepRubric, we construct 9K query--rubric supervision examples and train DeepRubric-8B with rubric-based GRPO, achieving comparable performance to prior open state-of-the-art deep research models across three benchmarks with roughly 13x fewer RL GPU-hours.

URL PDF HTML ☆

赞 0 踩 0

2606.15033 2026-06-16 cs.HC cs.CL cs.CY 交叉投稿

Cloze: An Open Research Platform for Studying Human-AI Conversations in Mental Health Contexts

Cloze：一个用于研究心理健康背景下人机对话的开放研究平台

Matthew Flathers, Francesco Cipriani, John Torous

发表机构 * Beth Israel Deaconess Medical Center（贝塞斯达以色列德acons医疗中心）； University College London（伦敦大学学院）； Division of Digital Psychiatry（数字精神病学部）

AI总结提出开源平台Cloze，支持在心理健康研究中控制、监控人机对话，统一配置模型、指令、安全约束并记录完整溯源，为建立人机交互证据基础提供研究基础设施。

Comments 7 pages, 2 figures. Cloze is released under AGPL-3.0

详情

AI中文摘要

Cloze是一个开源网络平台，用于在心理健康研究背景下进行受控、受监测的人机对话研究。消费者大语言模型（LLM）产品如ChatGPT、Claude和Gemini是为个人生产力而构建的，为研究人员提供的实验控制很少，数据导出不一致，并且没有跨提供商的共享安全框架。Cloze为研究团队提供了一个单一环境，在其中他们配置参与者与哪些模型对话、AI如何被指示、对话如何随时间安排以及哪些安全约束无条件适用，同时每条消息都带有完整的溯源信息（模型版本、提示配置、时间）。该平台目前支持OpenAI、Anthropic、Google以及通过Ollama在统一接口后提供的本地托管开放权重模型，并可在云端或完全本地运行，以便参与者数据无需离开机构。Cloze是为在心理健康背景下建立人机交互证据基础而研究的基础设施。它不是治疗产品。

英文摘要

Cloze is an open-source web platform for conducting controlled, monitored studies of human-AI conversation in mental health research contexts. Consumer large language model (LLM) products such as ChatGPT, Claude, and Gemini are built for individual productivity, and offer researchers little experimental control, inconsistent data export, and no shared safety scaffolding that holds across providers. Cloze gives research teams a single environment in which they configure which models participants converse with, how the AI is instructed, how conversations are scheduled over time, and which safety constraints apply unconditionally, while every message is captured with full provenance (model version, prompt configuration, timing). The platform currently supports OpenAI, Anthropic, Google, and locally hosted open-weight models served through Ollama behind a unified interface, and runs in the cloud or fully on premises so that participant data need never leave an institution. Cloze is research infrastructure for building an evidence base on human-AI interaction in mental health contexts. It is not a therapeutic product.

URL PDF HTML ☆

赞 0 踩 0

2606.15367 2026-06-16 cs.AI cs.CL cs.IR cs.LG 交叉投稿

构建面向1亿用户规模的客户支持AI代理：一种评估驱动的框架

Aman Gupta, Kevin Rossell, Edesio Alcobaça, Jose Chrystian Lima Pacheco, Carolina Baptista de Lima, Shao Tang, Luiz Paulo Rabachini, Luis Moneda, Herbert Fei, Daniel Silva, Rohan Ramanath

发表机构 * Nubank

AI总结提出一个统一框架，通过评估驱动开发、上下文工程、人工循环提示迭代和LLM评判一致性优化，在Nubank的100M+用户规模下实现客户支持AI代理的离线开发与在线效果桥接，并在五个生产部署中验证了离线指标与在线结果的高度相关性。

Comments 12 pages. Accepted to KDD '26 (32nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining)

详情

DOI: 10.1145/3770855.3818332

AI中文摘要

LLM能力的快速提升使得AI代理在广泛任务中越来越可行。其中最有前景的应用之一是构建生产就绪的面向客户代理，这一挑战需要在评估方法论、上下文工程、训练和在线测量方面协调卓越。然而，这些关键支柱通常是孤立开发的，导致只有在部署后才会暴露的盲点。\n在本文中，我们提出了一个统一框架，将离线开发与在线影响桥接起来，应用于Nubank（一家拥有1亿+用户的公司）的客户支持AI代理。我们的方法整合了几个关键组件：(1) 针对客户支持代理定制的结构化上下文工程，(2) 系统化的人工在环提示迭代，(3) 具有测量评估者间一致性和GEPA优化一致性的严格LLM评判评估，以及(4) 从构思到生产的验证。\n一个核心见解是评估管道质量直接决定迭代速度。我们展示了跨越不同领域的五个生产部署的结果：卡片递送、债务管理、信用额度支持、卡片管理和产品解释。这些部署在显著加速迭代的同时，带来了持续的客户满意度提升。在我们的卡片递送部署中，大规模A/B测试显示，与之前的代理变体相比，AI交易净推荐值提高了37个百分点，自助服务率提高了29个百分点，同时离线模拟指标与在线结果之间存在强相关性，表明评估驱动开发可靠地预测了生产影响。在大多数用例中，AI满意度达到了与专家人类代理相差几个百分点的水平。

英文摘要

The rapid rise in LLM capabilities has made AI agents increasingly viable across a broad range of tasks. Among the most promising applications is building production-ready customer-facing agents, a challenge that demands coordinated excellence in evaluation methodology, context engineering, training, and online measurement. Yet these critical pillars are typically developed in isolation, creating blind spots that only surface after deployment. In this paper, we present a unified framework that bridges offline development with online impact for customer support AI agents at Nubank, a company with 100M+ users. Our approach integrates several key components: (1) structured context engineering tailored to customer support agents, (2) systematic human-in-the-loop prompt iteration, (3) rigorous LLM judge evaluation with measured inter-rater agreement and GEPA optimization for consistency, and (4) ideation-to-production validation. A central insight is that evaluation-pipeline quality directly determines iteration velocity. We present results from five production deployments spanning distinct domains: card delivery, debt management, credit-limit support, card management, and product explanation. These deployments deliver consistent customer-satisfaction gains while substantially accelerating iteration. In our card-delivery deployment, large-scale A/B testing yields a 37 percentage-point improvement in AI transactional Net Promoter Score and a 29 percentage-point gain in self-service rate over prior agent variants, alongside a strong correlation between offline simulation metrics and online outcomes, demonstrating that eval-driven development reliably predicts production impact. On most use cases, AI satisfaction reaches within a few percentage points of expert human agents.

URL PDF HTML ☆

赞 0 踩 0

2606.11520 2026-06-16 cs.CL cs.AI cs.LG 版本更新

ISE: An Execution-Grounded Recipe for Multi-Turn OS-Agent Trajectories

ISE：一种基于执行的多轮操作系统代理轨迹合成方法

Siyuan Luo, Nairong Zheng, Lin Zhou, Tiankuo Yao, Shengyou Yuan, Haojia Yu, Cong Pang, Jiapeng Luo, Lewei Lu

发表机构 * University of Electronic Science and Technology of China（电子科技大学）； SenseTime Research（字节跳动研究院）

AI总结提出ISE三阶段范式，通过结构化意图构建、角色锁定用户模拟和真实执行环境，生成多轮代理轨迹，微调后显著提升代理工具使用性能。

Comments 13 pages, 6 figures. Dataset and code: https://github.com/Valiere01/ISE-Trace

详情

AI中文摘要

训练有能力的操作系统代理需要同时捕获结构化用户意图、多轮任务委派和基于工具执行的数据——这些属性在现有数据集中缺失。我们提出ISE（意图->模拟->执行），一种三阶段合成范式，联合解决这些差距。阶段1通过4D框架（人物角色x领域x任务x复杂度）构建约50000个结构化意图；去重后池中包含43956个唯一意图，并在mpnet-base-v2嵌入（余弦核，q=1）上获得61.57的Vendi分数。阶段2通过角色锁定的用户模拟器驱动多轮用户-代理交互，将每轮用户交互基于实际执行结果，生成23132条完整轨迹，平均8.12轮用户交互和68.24轮总对话。阶段3在实时、隔离的操作系统工作空间中执行每个工具调用，生成真实的故障恢复动态而非模拟响应。在ISETrace上微调后，使用Qwen3-8B在标准协议下的代理工具使用任务中，ClawEval pass@1从19.3提升至37.7。该结果优于零样本GPT-4o和四倍大的Qwen3-32B基础模型。对阶段2的消融实验证明多轮模拟带来了大部分性能提升。我们在该https URL发布所有源代码和数据集。

英文摘要

Training capable OS agents requires data that simultaneously captures structured user intents, multi-turn task delegation, and grounded tool execution--properties absent from existing datasets. We propose ISE (Intent -> Simulate -> Execute), a three-stage synthesis paradigm that addresses these gaps jointly. Stage 1 constructs roughly 50000 structured intents via a 4D framework (Persona x Domain x Task x Complexity); after deduplication the pool contains 43956 unique intents and attains a Vendi Score of 61.57 over the entire pool on mpnet-base-v2 embeddings (cosine kernel, q=1). Stage 2 drives multi-turn user-agent interaction through a role-locked user simulator that grounds each user turn in actual execution outcomes, producing 23132 complete trajectories averaging 8.12 user turns and 68.24 total dialogue turns. Stage 3 runs every tool call inside a live, isolated OS workspace, generating authentic failure-recovery dynamics instead of simulated responses. Fine-tuning on ISETrace improves ClawEval pass@1 from 19.3 to 37.7 using Qwen3-8B on agent tool-use tasks with a standard protocol. This result outperforms zero-shot GPT-4o and the larger Qwen3-32B base model which is four times bigger. An ablation on Stage 2 proves multi-turn simulation brings a large portion of the performance gain. We release all source code and dataset at https://github.com/Valiere01/ISE-Trace.

URL PDF HTML ☆

赞 0 踩 0

2602.05060 2026-06-16 cs.LG cs.CL 版本更新

StagePilot: Stage-Level Planning for Long-Horizon Dialogue Simulation in Cybergrooming

StagePilot: 网络诱骗中长程对话模拟的阶段级规划

Heajun An, Qi Zhang, Minqian Liu, Xinyi Zhang, Sang Won Lee, Lifu Huang, Pamela J. Wisniewski, Jin-Hee Cho

发表机构 * Virginia Tech（弗吉尼亚理工大学）； University of California, Davis（加州大学戴维斯分校）； International Computer Science Institute（国际计算机科学研究所）

AI总结提出StagePilot框架，通过分离阶段级规划与响应生成，结合强化学习学习阶段策略，实现网络诱骗对话的结构化、连贯模拟，相比基线减少对话停滞，IQL+AWAC变体最终阶段到达率提升43%。

Comments Accepted at the 27th Annual Meeting of the Special Interest Group on Discourse and Dialogue (SIGDIAL 2026)

详情

AI中文摘要

网络诱骗是对青少年的一种不断演变的威胁，需要主动的教育干预。我们通过将对话进展建模为阶段式交互上的结构化规划问题来解决这一问题。我们提出StagePilot，一个将阶段级规划与响应生成分离的对话框架，其中模型在受约束的转换下选择下一阶段，并基于该阶段生成响应，从而实现连贯且逼真的进展。使用强化学习从离线数据中学习阶段级策略，优化情感对齐和目标一致进展。我们的实证实验表明，与基线相比，StagePilot生成更结构化、更连贯的对话轨迹，并减少对话停滞；值得注意的是，IQL+AWAC变体更频繁地到达最终阶段，同时保持超过70%的正面或中性响应，实现了43%的相对改进。

英文摘要

Cybergrooming is an evolving threat to youth, requiring proactive educational interventions. We address this by modeling dialogue progression as a structured planning problem over stage-wise interactions. We propose StagePilot, a dialogue framework that separates stage-level planning from response generation, in which the model selects the next stage under constrained transitions and generates responses conditioned on it, enabling coherent and realistic progression. Reinforcement learning is used to learn stage-level policies from offline data, optimizing for both emotional alignment and goal-consistent progression. Our empirical experiments show that StagePilot generates more structured, coherent dialogue trajectories and reduces conversational stagnation compared to baselines; notably, the IQL+AWAC variant reaches the final stage more often while maintaining over 70% positive or neutral responses, yielding a 43% relative improvement.

URL PDF HTML ☆

赞 0 踩 0

2605.01101 2026-06-16 cs.AI cs.CL cs.SD eess.AS 版本更新

Virtual Speech Therapist: A Clinician-in-the-Loop AI Speech Therapy Agent for Personalized and Supervised Therapy

虚拟言语治疗师：一种临床医生参与的AI言语治疗代理，用于个性化和监督式治疗

Shakeel Sheikh, Patrick Marmaroli, MD Sahidullah, Slim Ouni, Fabrice Hirsch, Goncalo Leal, Bjorn W Schuller

发表机构 * The Kashmir Hub for Artficial Intelligence（喀布尔人工智能中心）； Microsoft / Vocametrix（微软 / Vocametrix）； IAI, TCG CREST（IAI，TCG CREST）； Université de Lorraine, CNRS, Inria, LORIA（洛林大学，CNRS，Inria，LORIA）； Laboratoire Praxiling, UMR5267, CNRS et Université Paul-Valéry Montpellier 3（Praxiling实验室，UMR5267，CNRS及蒙彼利埃Paul-Valéry大学）； Speechcare iStutter, Portuguese Catholic University（Speechcare iStutter，葡萄牙天主教大学）； CHI – Chair of Health Informatics, TUM University Hospital（健康信息学系，TUM大学医院）； GLAM – Group on Language, Audio, & Music, Imperial College London（语言、音频与音乐小组，伦敦帝国理工学院）

AI总结提出虚拟言语治疗师（VST）平台，集成深度学习口吃分类与多智能体大语言模型推理，自动生成个性化治疗方案，并通过临床医生反馈优化，实验证明其高质量推荐。

Comments Under Review

详情

AI中文摘要

本文开发了虚拟言语治疗师（VST），这是一个基于智能体的平台，通过自动化和自适应的AI驱动工作流程，简化口吃评估并提供定制化的治疗计划。VST集成了最先进的基于深度学习的口吃分类和多智能体大语言模型（LLM）推理，以支持循证临床决策。VST首先获取并提取患者语音样本的特征，然后对口吃类型进行稳健分类。基于这些输出，VST启动一个智能体推理过程，其中专门的LLM智能体自主生成、批评并迭代优化个性化治疗计划。一个专门的批评智能体评估所有生成的治疗计划，以确保临床安全性、方法学合理性，并与同行评审的证据和既定专业指南保持一致。最终输出是一个全面的、针对患者的治疗草案，供临床医生审查。系统结合临床医生的反馈，生成最终的治疗计划，适用于患者交付，从而保持临床医生参与的范式。由专家言语治疗师进行的实验评估证实，VST持续生成高质量、基于证据的治疗建议。这些发现表明该系统具有增强临床工作流程、减轻临床医生负担并改善言语障碍患者治疗效果的潜力。所提出系统的交互式用户界面可在以下网址在线获取：this https URL，支持实时口吃评估和个性化治疗计划。

英文摘要

This paper develops Virtual Speech Therapist (VST), an intelligent agent-based platform that streamlines stuttering assessment and delivers customized therapy planning through automated and adaptive AI-driven workflows. VST integrates state-of-the-art deep learning-based stuttering classification, and multi-agent large language model (LLM) reasoning to support evidence-based clinical decision-making. The VST begins with the acquisition and feature extraction of patient speech samples, followed by robust classification of stuttering types. Building on these outputs, VST initiates an agentic reasoning process in which specialized LLM agents autonomously generate, critique, and iteratively refine individualized therapy plans. A dedicated critic agent evaluates all generated therapy plans to ensure clinical safety, methodological soundness, and alignment with peer-reviewed evidence and established professional guidelines. The resulting output is a comprehensive, patient-specific therapy draft intended for clinician review. Incorporating clinician feedback, the system then produces a finalized therapy plan suitable for patient delivery, thereby maintaining a clinician-in-the-loop paradigm. Experimental evaluation by expert speech therapists confirms that VST consistently generates high-quality, evidence-based therapy recommendations. These findings demonstrate the system's potential to augment clinical workflows, reduce clinician burden, and improve therapeutic outcomes for individuals with speech impairments. An interactive user interface for the proposed system is available online at: https://vocametrix.com/ai/stuttering-therapy-planning-agent , facilitating real-time stuttering assessment and personalized therapy planning.

URL PDF HTML ☆

赞 0 踩 0

2605.05855 2026-06-16 cs.IR cs.CL 版本更新

Bridging Passive and Active: Enhancing Conversation Starter Recommendation via Active Expression Modeling

桥接被动与主动：通过主动表达建模增强对话启动推荐

Yiqing Wu, Haoming Li, Guanyu Jiang, Jiahao Liang, Yongchun Zhu, Jingwu Chen, Feng Zhang

发表机构 * Bytedance Beijing China（字节跳动北京中国）

AI总结针对LLM驱动的对话搜索中被动推荐陷入回声室的问题，提出PA-Bridge框架，通过对抗分布对齐器桥接被动推荐与主动表达之间的分布差异，并引入语义离散化器实现流行度去偏，在线实验显著提升特征渗透率和用户活跃天数。

Comments Accepted by SIGIR 2026

详情

AI中文摘要

大型语言模型（LLM）驱动的对话搜索正在将信息检索从被动关键词匹配转变为主动、开放式的对话。在此背景下，对话启动器被广泛部署，以提供个性化查询推荐，帮助用户发起对话。传统上，推荐这些启动器依赖于一个封闭的“曝光-点击”循环。然而，这种反馈循环机制使系统陷入回声室，加上数据稀疏性，无法捕捉由开放世界塑造的对话搜索意图的动态特性。结果，系统偏向于流行但通用的建议。在这项工作中，我们揭示了一个未被利用的范式转变，以打破这种有害的反馈循环：通过用户的主动表达来利用用户的“自由意志”。与传统推荐不同，对话搜索使用户能够通过手动输入查询完全绕过菜单。主动查询中的开放世界意图是打破这一循环的关键。然而，整合它们并非易事：（1）主动查询与制定的启动器之间存在固有的分布偏移。（2）此外，开放文本的“非ID化”特性使得传统的基于项目的流行度统计在大规模工业流式训练中无效。为此，我们提出了被动-主动桥接（PA-Bridge），一种新颖的框架，采用对抗分布对齐器来桥接被动推荐的启动器与主动表达之间的分布差距。此外，我们引入了一个语义离散化器，以实现流行度去偏算法的部署。在我们平台上的在线A/B测试表明，PA-Bridge显著提升了特征渗透率0.54%和用户活跃天数0.04%。

英文摘要

Large Language Model (LLM)-driven conversational search is shifting information retrieval from reactive keyword matching to proactive, open-ended dialogues. In this context, Conversation Starters are widely deployed to provide personalized query recommendations that help users initiate dialogues. Conventionally, recommending these starters relies on a closed "exposure-click" loop. Yet, this feedback loop mechanism traps the system in an echo chamber where, compounded by data sparsity, it fails to capture the dynamic nature of conversational search intents shaped by the open world. As a result, the system skews towards popular but generic suggestions. In this work, we uncover an untapped paradigm shift to shatter this harmful feedback loop: harnessing user "free will" through active user expressions. Unlike traditional recommendations, conversational search empowers users to bypass menus entirely through manually typed queries. The open-world intents in active queries hold the key to breaking this loop. However, incorporating them is non-trivial: (1) there exists an inherent distribution shift between active queries and formulated starters. (2) Furthermore, the "non-ID-able" nature of open text renders traditional item-based popularity statistics ineffective for large-scale industrial streaming training. To this end, we propose Passive-Active Bridge (PA-Bridge), a novel framework that employs an adversarial distribution aligner to bridge the distributional gap between passively recommended starters and active expressions. Moreover, we introduce a semantic discretizer to enable the deployment of popularity debiasing algorithms. Online A/B tests on our platform, demonstrate that PA-Bridge significantly boosts the Feature Penetration Rate by 0.54% and User Active Days by 0.04%.

URL PDF HTML ☆

赞 0 踩 0

2605.29796 2026-06-16 cs.AI cs.CL cs.LG 版本更新

SAAS: Self-Aware Reinforcement Learning for Over-Search Mitigation in Agentic Search

SAAS：面向智能体搜索中过度搜索缓解的自我感知强化学习

Yunbo Tang, Chengyi Yang, Shiyu Liu, Zhishang Xiang, Zerui Chen, Qinggang Zhang, Jinsong Su

发表机构 * School of Informatics, Xiamen University（厦门大学信息学院）； School of Artificial Intelligence, Jilin University（吉林大学人工智能学院）

AI总结提出SAAS强化学习框架，通过搜索边界建模、边界感知奖励和分阶段优化策略，使LLM智能体具备动态自我感知能力，在不降低准确率的前提下显著减少过度搜索。

详情

AI中文摘要

智能体搜索使LLM能够通过迭代推理和外部搜索解决复杂的多跳问题。尽管有效，但这些系统在实践中常受限于一个关键缺陷：智能体无法识别自身知识边界，在内部知识足够时盲目触发搜索，甚至在已收集足够证据时未能终止搜索。缺乏自我感知导致严重的 extbf{过度搜索}，带来大量推理延迟和过高的计算成本。为此，我们提出SAAS，一种新颖的强化学习框架，旨在培养动态自我感知能力，精确调节搜索行为而不损害准确性。SAAS引入三个关键组件：(i) 搜索边界建模机制，通过对比禁用搜索和启用搜索的轨迹，识别策略演化下的搜索边界；(ii) 边界感知奖励模块，将这种边界意识转化为轨迹级惩罚，抑制不必要和冗余的搜索；(iii) 分阶段优化策略，利用顺序课程优先考虑推理而非搜索正则化，从而避免奖励黑客。大量实验表明，SAAS在保持准确性的同时大幅减少了过度搜索。我们的代码和实现细节已在https://github.com/XMUDeepLIT/SAAS发布。

英文摘要

Agentic search enables LLMs to solve complex multi-hop questions through iterative reasoning and external search. Despite the effectiveness, these systems often suffer from a critical limitation in practice: agents fail to recognize their own knowledge boundaries, blindly triggering searches when internal knowledge suffices and failing to terminate search even when adequate evidence has been collected. The lack of self-awareness leads to severe \textbf{over-search}, incurring substantial inference latency and prohibitive computational cost. To this end, we propose SAAS, a novel RL framework designed to cultivate dynamic self-awareness that precisely regulates search behavior without compromising accuracy. SAAS introduces three key components: (i) a search boundary modeling mechanism, which identifies the search boundary under the evolving policy by contrasting search-disabled and search-enabled rollouts; (ii) a boundary-aware reward module, which translates this boundary awareness into trajectory-level penalties, suppressing unnecessary and redundant searches; and (iii) a stage-wise optimization strategy, which leverages a sequential curriculum to prioritize reasoning over search regularization, thereby avoiding reward hacking. Extensive experiments demonstrate that SAAS substantially reduces over-search, while maintaining accuracy. Our code and implementation details are released at https://github.com/XMUDeepLIT/SAAS.

URL PDF HTML ☆

赞 0 踩 0

2606.09365 2026-06-16 cs.AI cs.CL 版本更新

Experience Makes Skillful: Enabling Generalizable Medical Agent Reasoning via Self-Evolving Skill Memory

经验造就熟练：通过自进化技能记忆实现可泛化的医疗智能体推理

Haoran Sun, Wenjie Li, Yujie Zhang, Zekai Lin, Fanrui Zhang, Kaitao Chen, Xingqi He, Yichen Li, Mianxin Liu, Lei Liu, Yankai Jiang

发表机构 * Fudan University（复旦大学）； Shanghai Artificial Intelligence Laboratory（上海人工智能实验室）； Shanghai Innovation Institute（上海创新研究院）； Huazhong University of Science and Technology（华中科技大学）

AI总结提出SkeMex框架，通过技能记忆实现医疗智能体后部署自进化，无需更新模型权重，在临床任务中优于现有记忆型智能体。

详情

AI中文摘要

医疗智能体系统越来越期望支持交互式临床决策，而不仅仅是静态问答。在这种设置中，有效的智能体必须跨演化病例重用先前经验，然而现有的记忆机制通常保留原始历史轨迹，这些轨迹冗余、嘈杂且难以管理。更重要的是，它们很少区分哪些记忆对未来推理真正有用。这限制了它们积累紧凑且可靠的经验以进行长期临床推理的能力。为弥补这一差距，我们提出SkeMex，一种部署后自进化框架，通过基于技能的记忆改进医疗智能体，无需更新模型权重。SkeMex将信息丰富的交互轨迹提炼为结构化技能，编码可重用的程序性知识，并将其组织成涵盖通用、任务特定和行动级经验的多分支存储库。为确定哪些记忆应被重用和保留，SkeMex从环境反馈中估计上下文相关的效用，并用其指导价值感知的检索和存储库治理。闭环的“读-写-评估-治理”生命周期通过写入新技能、更新效用、促进有用记忆和移除有害条目进一步支持持续进化。跨不同临床任务的实验表明，SkeMex在离线和在线设置中均持续优于代表性记忆型智能体。它还能跨模型骨干泛化并支持可迁移的技能记忆。所有数据和代码将公开发布。

英文摘要

Medical agent systems are increasingly expected to support interactive clinical decision making rather than only static question answering. In such settings, effective agents must reuse prior experience across evolving cases, yet existing memory mechanisms often retain raw historical traces that are redundant, noisy, and difficult to govern. More importantly, they rarely distinguish which memories are truly useful for future reasoning. This limits their ability to accumulate compact and reliable experience for long-horizon clinical reasoning. To close this gap, we propose SkeMex, a post-deployment self-evolution framework that improves medical agents through a skill-based memory without updating model weights. SkeMex distills informative interaction trajectories into structured skills that encode reusable procedural knowledge, and organizes them into a multi-branch repository spanning general, task-specific, and action-level experience. To determine which memories should be reused and retained, SkeMex estimates context-dependent utility from environment feedback and uses it to guide value-aware retrieval and repository governance. A closed-loop ``Read--Write--Assess--Govern" lifecycle further supports continual evolution by writing new skills, updating utilities, promoting useful memories, and removing harmful entries. Experiments across diverse clinical tasks show that SkeMex consistently outperforms representative memory-based agents in both offline and online settings. It also generalizes across model backbones and supports transferable skill memory. All data and code will be released publicly.

URL PDF HTML ☆

赞 0 踩 0

2606.15069 2026-06-16 cs.CL 新提交

CoCoGEC: Counterfactual Generation for Robust Grammatical Error Correction

CoCoGEC：面向鲁棒语法纠错的反事实生成

Qianyu Wang, Xiaoman Wang, Yuanyuan Liang, Xinyuan Li, Yunshi Lan

发表机构 * East China Normal University（华东师范大学）

AI总结提出CoCoGEC框架，通过生成词级和句级反事实样本并筛选高互信息实例，提升语法纠错模型在上下文扰动下的稳定性，在三个扰动数据集上取得显著F0.5提升。

详情

AI中文摘要

语法纠错（GEC）系统通常在GEC基准上进行训练和评估，但一旦周围上下文发生轻微扰动或扩展，其性能往往会急剧下降。这表明现有的GEC模型通常无法理解变化上下文中的错误模式。在本文中，我们深入研究了GEC任务的反事实，其中上下文的细微变化可能导致标签翻转问题。我们提出了CoCoGEC，一个反事实生成框架，该框架创建训练实例的副本，并改变与错误无关的上下文。我们的框架通过以下方式系统地生成反事实：（1）通过改变词级和句级上下文，生成保持原始实例错误模式及语法的句内和句间反事实；（2）通过选择具有翻转标签和高GEC互信息（MI）系数的实例来修正生成的反事实。大量实验表明，我们的方法显著提高了GEC模型的稳定性，优于一组数据增强基线。特别是，在扰动的BEA-19*、CoNLL-14*和TEM-8*数据集上，它分别实现了+9.9、+11.3和+20.8个点的绝对F0.5增益。我们的代码已发布在https://github.com/Quinnok/CoCoGEC。

英文摘要

Grammatical error correction (GEC) systems are usually trained and evaluated on GEC benchmarks, but their performance often drops sharply once the surrounding context is slightly perturbed or extended. This indicates that the existing GEC models usually fail to understand the error patterns in the varying contexts. In this paper, we thoroughly investigate the counterfactuals for GEC tasks, where the subtle changes to the contexts could lead to the label flipping issue. We propose CoCoGEC, a counterfactual generation framework that creates copies of training instances with error-irrelevant contexts altered. Our framework systematically generates counterfactuals by (1) generating intra- and inter-sentence counterfactuals that maintain the error patterns as well as syntax of the original instances by altering the word-level and sentence-level contexts; (2) revising the generated counterfactuals by selecting the instances with flipped labels and high GEC Mutual Information (MI) coefficient. Extensive experiments show that our method substantially improves the stability of GEC models, outperforming a set of data augmentation baselines. Particularly, it could achieve absolute F0.5 gains of +9.9, +11.3, and +20.8 points on the perturbed BEA-19*,CoNLL-14*, and TEM-8* data set.Our code is released at https://github.com/Quinnok/CoCoGEC

URL PDF HTML ☆

赞 0 踩 0

2606.15416 2026-06-16 cs.CL 新提交

Encode Errors: Representational Retrieval of In-Context Demonstrations for Multilingual Grammatical Error Correction

编码错误：多语言语法错误纠正中上下文示例的表征检索

Guangyue Peng, Wei Li, Wen Luo, Houfeng Wang

发表机构 * State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University（北京大学计算机学院多媒体信息处理国家重点实验室）

AI总结提出从LLM内部状态提取语法错误表征（GER）用于检索上下文示例，显著提升多语言语法错误纠正的少样本性能，在低资源语言上F0.5提升达1.20倍。

Comments 15 pages, 6 figures

详情

DOI: 10.18653/v1/2025.findings-acl.1090
Journal ref: Findings of the Association for Computational Linguistics: ACL 2025, pages 21166-21180, Vienna, Austria. Association for Computational Linguistics, 2025

AI中文摘要

语法错误纠正（GEC）涉及检测和纠正语法的错误使用。虽然具有上下文学习（ICL）能力的大型语言模型（LLM）在各种自然语言处理（NLP）任务上取得了显著进展，但它们在GEC上的少样本性能仍然次优。这主要是由于难以检索到能够捕捉错误模式而非语义相似性的合适上下文示例。在本文中，我们证明LLM可以通过其内部状态固有地捕捉与语法错误相关的信息。从这些状态中，我们提取了语法错误表征（GER），这是一种信息丰富且语义中立的语法错误编码。我们基于GER的新型检索方法显著提升了多语言GEC数据集上ICL设置的性能，提高了纠正的精确度。对于高资源语言，我们在8B大小的开源模型上的结果与Deepseek2.5和GPT-4o-mini等闭源模型相当。对于低资源语言，我们的F0.5分数比基线高出最多1.20倍。该方法为多语言GEC提供了一种更精确且资源高效的解决方案，为可解释的GEC研究提供了一个有前景的方向。

英文摘要

Grammatical Error Correction (GEC) involves detecting and correcting the wrong usage of grammar. While large language models (LLMs) with in-context learning (ICL) capabilities have shown significant progress on various natural language processing (NLP) tasks, their few-shot performance on GEC remains suboptimal. This is mainly due to the challenge of retrieving suitable in-context demonstrations that capture error patterns instead of semantic similarity. In this paper, we demonstrate that LLMs can inherently capture information related to grammatical errors through their internal states. From these states, we extract the Grammatical Error Representation (GER), an informative and semantically neutral encoding of grammatical errors. Our novel GER-based retrieval method significantly boosts performance in ICL settings on multilingual GEC datasets, improving the precision of correction. For high-resource languages, our results on 8B-sized open-source models match those of closed-source models such as Deepseek2.5 and GPT-4o-mini. For low-resource languages, our $F_{0.5}$ scores surpass the baseline by up to a factor of 1.20. This method provides a more precise and resource-efficient solution for multilingual GEC, offering a promising direction for interpretable GEC research.

URL PDF HTML ☆

赞 0 踩 0

2606.15741 2026-06-16 cs.CL cs.AI 新提交

A Self Consistency Based Reranking for Narrative Question Answering

基于自一致性的叙事问答重排序

Molham Mohamed, Ali Hamdi

发表机构 * GitHub

AI总结提出自一致性重排序框架，通过生成多个候选答案并基于语义一致性选择最终答案，提升叙事问答的鲁棒性和准确性。

详情

AI中文摘要

叙事问答（NQA）是自然语言处理中一项具有挑战性的任务，要求模型理解长文本上下文、捕捉事件间关系并生成连贯的响应。尽管预训练语言模型近期取得了进展，但大多数现有方法在推理时依赖单一解码输出，使其对生成变异性敏感，常导致答案不完整或不一致。为解决这一局限，我们提出了一种基于自一致性的自集成重排序框架用于叙事问答。该方法为每个故事-问题对生成多个候选答案，并根据生成响应间的语义一致性选择最终答案。这使得模型能够探索多样化的答案表述，同时通过基于共识的选择提高鲁棒性，而无需修改底层架构。该框架将预训练和微调的语言生成与多答案推理及基于相似度的重排序相结合。我们在NarrativeQA数据集上使用多种模型（包括FLAN-T5 Base和Small以及Pegasus-Large）在基线和微调设置下评估了所提方法。实验结果表明，该方法在所有模型上均持续提升了性能。特别是，FLAN-T5-Base在结合自集成推理后，性能从82.32%提升至86.66%（+4.34%），取得了最佳整体性能。此外，Pegasus-Large的提升最大，从72.50%提升至87.07%（+14.57%），凸显了所提策略的有效性。

英文摘要

Narrative question answering (NQA) is a challenging task in natural language processing that requires models to understand long textual contexts, capture relationships across events, and generate coherent responses. Despite recent advances in pretrained language models, most existing approaches rely on a single decoding output during inference, making them sensitive to generation variability and often resulting in incomplete or inconsistent answers .To address this limitation, we propose a self-ensemble Self-Consistency-Based reranking framework for narrative question answering. The proposed method generates multiple candidate answers for each story-question pair and selects the final answer based on semantic agreement among the generated responses. This allows the model to explore diverse answer formulations while improving robustness through consensus-based selection without requiring modifications to the underlying architecture .The framework combines pretrained and fine-tuned language generation with multi-answer inference and similarity-based reranking. We evaluate the proposed approach on the NarrativeQA dataset using multiple models, including FLAN-T5 (Base and Small) and Pegasus-Large, under both baseline and fine-tuned settings .Experimental results demonstrate that the proposed method consistently improves performance across all models. In particular, FLAN-T5-Base achieves the best overall performance, improving from 82.32% to 86.66% (+4.34%) when combined with self-ensemble inference. Additionally, the largest improvement is observed with Pegasus-Large, which increases from 72.50% to 87.07% (+14.57%), highlighting the effectiveness of the proposed strategy.

URL PDF HTML ☆

赞 0 踩 0

2606.15783 2026-06-16 cs.CL 新提交

ttda704 at SemEval-2026 Task 4: Modeling Narrative Structures via Pseudonymization and Multi-View Sentence Alignment

ttda704 在 SemEval-2026 任务 4：通过假名化和多视角句子对齐建模叙事结构

Tai Tran Tan, An Dinh Thien

发表机构 * University of Information Technology, Ho Chi Minh City, Vietnam（胡志明市信息技术大学）； Vietnam National University, Ho Chi Minh City, Vietnam（越南国立大学胡志明市分校）

AI总结提出基于对比学习和微调句子变换器的叙事相似度方法，包括单视角（智能层冻结）和多视角（主题/情节/结局投影头+自监督对齐）两条流水线，在合成数据上训练。

2606.16281 2026-06-16 cs.CL cs.AI 新提交

Who Should Lead Decoding Now? Tracking Reliable Trajectories for Ensembling Masked Diffusion Language Models

现在谁应该主导解码？跟踪可靠轨迹以集成掩码扩散语言模型

Heecheol Yun, Joonhyung Park, Joowon Kim, Eunho Yang

发表机构 * KAIST（韩国科学技术院）； AITRICS

AI总结针对掩码扩散语言模型集成问题，提出TIE框架，通过跟踪答案相关位置的置信度动态，迭代识别并传递可靠解码轨迹，实现多模型协同生成。

Comments preprint

详情

AI中文摘要

掩码扩散语言模型（MDLM）已成为序列生成的一种独特范式。随着MDLM在能力和知识覆盖范围上变得多样化，一个重要问题是如何结合它们的知识。为此，我们首先研究了MDLM独特的解码动态。我们发现，成功的生成在答案相关位置上表现出稳定的置信度动态，而不可靠的轨迹通常可以通过注入来自其他模型的有希望的中间状态来纠正。受此观察启发，我们提出了$\textbf{TIE}$（基于轨迹的迭代集成），这是一个知识融合框架，其中MDLM迭代地识别可靠的解码轨迹并在模型之间传递它们。TIE跟踪答案相关位置上的置信度动态，以确定哪个模型当前遵循更可靠的轨迹，并选择性地跨模型传递部分去噪的序列。由于处于更有希望轨迹上的模型在去噪步骤中经常变化，TIE允许不同模型在生成的不同阶段贡献互补的优势。在多种推理任务上的强劲表现以及我们的分析表明，TIE为MDLM集成这一尚未充分探索的问题提供了一种实用方法。

英文摘要

Masked Diffusion Language Models (MDLMs) have emerged as a distinct paradigm for sequence generation. As MDLMs become diverse in capabilities and knowledge coverage, an important question is how to combine their knowledge. Toward this, we first investigate the unique decoding dynamics of MDLMs. We find that successful generations exhibit stable confidence dynamics over answer-relevant positions, while unreliable trajectories can often be corrected by injecting promising intermediate states from other models. Guided by this observation, we propose $\textbf{TIE}$ ($\textbf{T}$rajectory-based $\textbf{I}$terative $\textbf{E}$nsembling), a knowledge fusion framework in which MDLMs iteratively identify reliable decoding trajectories and relay them across models. TIE tracks confidence dynamics over answer-relevant positions to determine which model currently follows a more reliable trajectory and selectively transfers partially denoised sequences across models. As the model on the more promising trajectory often changes across denoising steps, TIE allows different models to contribute complementary strengths at different stages of generation. Strong performance across diverse reasoning tasks, along with our analyses, suggests that TIE offers a practical approach to the underexplored problem of MDLM ensembling.

URL PDF HTML ☆

赞 0 踩 0

2606.16322 2026-06-16 cs.CL 新提交

智能检索与强化学习方程链：面向复杂新颖物理文字题的可控生成框架

Tirthankar Mittra

发表机构 * University of California, Berkeley（加州大学伯克利分校）

AI总结提出ARVRE两阶段框架，通过离线时序差分学习构建有效物理方程链，结合智能检索增强生成控制问题结构与难度，再由大语言模型生成自然语言问题，实现复杂、新颖且可解的物理文字题生成。

详情

AI中文摘要

生成高质量、新颖、复杂且可解的物理文字题（PWPs）在教育内容生成中仍是一个具有挑战性且未被充分探索的问题。现有方法多改编自数学文字题（MWP）生成，常产生模糊、不可解或结构简单且语言多样性有限的问题。我们提出ARVRE（智能检索值强化方程链），一个用于生成多样且数学有效的PWPs的两阶段框架。在第一阶段，使用一种离线时序差分学习形式构建有效的物理方程链，同时一个智能检索增强生成（RAG）框架动态选择主题特定的概念和词汇。这种设计能够显式控制问题结构和难度。在第二阶段，大语言模型（LLM）将方程链和检索到的概念转换为自然语言的物理问题。通过将生成过程基于有效方程链，我们的方法在保持数学正确性的同时，促进了语言多样性和上下文丰富性。人工和自动评估表明，ARVRE生成的PWPs比现有方法更复杂、新颖且可解。这些结果凸显了结合强化学习、检索和LLM用于可靠生成教育物理内容的潜力。

英文摘要

Generating high-quality Physics Word Problems (PWPs) that are novel, complex, and solvable remains a challenging and underexplored problem in educational content generation. Existing approaches, many adapted from Math Word Problem (MWP) generation, often produce ambiguous, unsolvable, or structurally simple questions with limited linguistic diversity. We introduce ARVRE (Agentic Retrieval Value Reinforced Equation-chain), a two-stage framework for generating diverse and mathematically valid PWPs. In the first stage, a form of offline temporal-difference learning is used to construct valid chains of physics equations, while an agentic retrieval-augmented generation (RAG) framework dynamically selects topic-specific concepts and vocabulary. This design enables explicit control over problem structure and difficulty. In the second stage, a Large Language Model (LLM) converts the equation chain and retrieved concepts into a natural-language physics question. By grounding generation in valid equation chains, our method preserves mathematical correctness while promoting linguistic diversity and contextual richness. Human and automated evaluations demonstrate that ARVRE generates PWPs that are more complex, novel, and solvable than those produced by existing approaches. These results highlight the potential of combining reinforcement learning, retrieval, and LLMs for reliable generation of educational physics content.

URL PDF HTML ☆

赞 0 踩 0

2403.13089 2026-06-16 cs.CL 版本更新

Automatic Summarization of Doctor-Patient Encounter Dialogues Using Large Language Model through Prompt Tuning

通过提示调优使用大型语言模型自动总结医患对话

Mengxian Lyu, Cheng Peng, Xiaohan Li, Patrick Balian, Jiang Bian, Yonghui Wu

发表机构 * University of Florida（佛罗里达大学）； University of Florida Health Cancer Center（佛罗里达大学健康癌症中心）

AI总结本研究提出通过提示调优生成式大型语言模型来自动总结医患对话，实验表明GatorTronGPT-20B模型在所有评估指标上表现最佳，且计算成本低。

详情

AI中文摘要

自动文本摘要是辅助临床医生提供连续协调护理的新兴技术。本研究提出了一种使用生成式大型语言模型总结医患对话的方法。我们开发了提示调优算法来指导生成式LLM总结临床文本。我们研究了提示调优策略、软提示的大小以及GatorTronGPT（一个使用2770亿临床和通用英语词汇、参数高达200亿的生成式临床LLM）的少样本学习能力。我们将GatorTronGPT与基于广泛使用的T5模型微调的先前解决方案进行了比较，使用了临床基准数据集MTS-DIALOG。实验结果表明，GatorTronGPT-20B模型在所有评估指标上均取得了最佳性能。所提出的解决方案计算成本低，因为在提示调优期间LLM参数不更新。本研究证明了通过提示调优使用生成式临床LLM进行临床自动文本摘要的效率。

英文摘要

Automatic text summarization (ATS) is an emerging technology to assist clinicians in providing continuous and coordinated care. This study presents an approach to summarize doctor-patient dialogues using generative large language models (LLMs). We developed prompt-tuning algorithms to instruct generative LLMs to summarize clinical text. We examined the prompt-tuning strategies, the size of soft prompts, and the few-short learning ability of GatorTronGPT, a generative clinical LLM developed using 277 billion clinical and general English words with up to 20 billion parameters. We compared GatorTronGPT with a previous solution based on fine-tuning of a widely used T5 model, using a clinical benchmark dataset MTS-DIALOG. The experimental results show that the GatorTronGPT- 20B model achieved the best performance on all evaluation metrics. The proposed solution has a low computing cost as the LLM parameters are not updated during prompt-tuning. This study demonstrates the efficiency of generative clinical LLMs for clinical ATS through prompt tuning.

URL PDF HTML ☆

赞 0 踩 0

2512.03503 2026-06-16 cs.CL 版本更新

Understanding LLM Reasoning for Abstractive Summarization

理解大语言模型在抽象摘要中的推理能力

Haohan Yuan, Haopeng Zhang

发表机构 * ALOHA Lab, University of Hawaii at Manoa（夏威夷大学马诺亚分校ALOHA实验室）

AI总结本研究通过大规模比较8种推理策略和3种大型推理模型在8个数据集上的表现，发现推理并非万能，其效果依赖于策略和摘要设置，且存在质量与事实准确性之间的权衡。

Comments 27 pages,15 figures

详情

AI中文摘要

推理在数学和代码生成等分析任务上显著提升了大型语言模型（LLMs），但其对抽象摘要的价值仍不明确。为填补这一空白，我们将通用推理策略适配到摘要场景，并在8个多样化数据集上对8种推理策略和3种大型推理模型（LRMs）进行了大规模比较研究，评估了摘要质量和事实忠实度。结果表明，推理并非通用解决方案，其有效性强烈依赖于策略和摘要设置。特别是，我们发现摘要质量与事实忠实度之间存在权衡。显式推理策略通常能提升基于参考的质量，但可能削弱事实基础，而LRMs中的隐式推理则表现出相反趋势。我们进一步发现，增加LRM的内部推理预算并不能可靠地改善摘要，甚至可能降低事实一致性。这些发现表明，对于摘要而言，更多的推理并不总是更好。有效的推理应保留忠实的压缩，而非引发过度阐述。我们的源代码已公开。

英文摘要

Reasoning has substantially improved Large Language Models (LLMs) on analytical tasks such as mathematics and code generation, but its value for abstractive summarization remains unclear. To address this gap, we adapt general reasoning strategies to the summarization setting and conduct a large-scale comparative study of 8 reasoning strategies and 3 Large Reasoning Models (LRMs) across 8 diverse datasets, evaluating both summary quality and factual faithfulness. Our results show that reasoning is not a universal solution and its effectiveness depends strongly on the strategy and the summarization setting. In particular, we find a trade-off between summary quality and factual faithfulness. Explicit reasoning strategies often improve reference-based quality, but may weaken factual grounding, whereas implicit reasoning in LRMs shows the opposite tendency. We further find that increasing an LRM's internal reasoning budget does not reliably improve summarization and can even reduce factual consistency. These findings suggest that, for summarization, more reasoning is not always better. Effective reasoning should preserve faithful compression rather than induce over-elaboration. Our source code is publicly available.

URL PDF HTML ☆

赞 0 踩 0

2606.05742 2026-06-16 cs.CL 版本更新

AdaPLD: Adaptive Retrieval and Reuse for Efficient Model-Free Speculative Decoding

AdaPLD: 自适应检索与重用实现高效无模型推测解码

Runheng Liu, Jincheng Xie, Wen Hu, Xingchen Xiao, Heyan Huang

发表机构 * School of Computer Science and Technology, Beijing Institute of Technology（北京理工大学计算机科学与技术学院）； Department of Mathematical Sciences, Tsinghua University（清华大学数学科学部）； JDT AI Infra（京东AI基础设施）

AI总结针对现有基于重用的推测解码方法在词汇匹配失败时召回率低和确定性复制脆弱的问题，提出无需训练的自适应方法AdaPLD，通过语义相似性恢复重用机会并构建分支假设，实现最高3.10倍解码加速。

详情

AI中文摘要

推测解码通过在单次目标模型前向传播中验证多个草拟令牌来加速生成，减少了顺序解码迭代。无模型变体通过重用生成过程中已有的文本和模型状态来避免辅助草稿模型，但其加速效果取决于构建的草稿的可靠性。我们指出现有基于重用的方法存在两个局限性：基于词汇锚定的检索在表面形式变化下召回率有限，以及当检索上下文不能唯一确定续写时，确定性跨度复制可能脆弱。我们提出\emph{AdaPLD}，一种无需训练的方法，自适应地改进检索和草稿构建。AdaPLD保留高精度的词汇重用，同时利用语义相似性在词汇匹配失败时恢复额外的重用机会。它进一步构建分支重用假设以考虑续写的不确定性，而不是依赖单个复制的跨度。在多个基准测试中，AdaPLD减少了目标模型前向传播次数，并实现了高达$3.10 imes$的解码加速。

英文摘要

Speculative decoding accelerates generation by verifying multiple drafted tokens in a single target-model forward pass, reducing sequential decoding iterations. Model-free variants avoid auxiliary draft models by reusing text and model states already available during generation, but their speedup depends on the reliability of the constructed drafts. We identify two limitations of existing reuse-based methods: lexically anchored retrieval has limited recall under surface-form variation, and deterministic span copying can be brittle when the retrieved context does not uniquely determine the continuation. We propose \emph{AdaPLD}, a training-free method that adaptively improves both retrieval and draft construction. AdaPLD preserves high-precision lexical reuse while using semantic similarity to recover additional reuse opportunities when lexical matching fails. It further constructs branched reuse hypotheses to account for continuation uncertainty, rather than relying on a single copied span. Across diverse benchmarks, AdaPLD reduces target-model forward passes and achieves up to $3.10\times$ decoding speedup.

URL PDF HTML ☆

赞 0 踩 0

2606.15510 2026-06-16 cs.CL cs.DL 新提交

AthDGC: An Open Diachronic Greek Treebank with Indo-European Parallels

AthDGC：一个开放的历时希腊语树库及其印欧语平行语料

Nikolaos Lavidas, Kiki Nikiforidou, Dag Haug, Leonid Kulikov, Vassiliki Geka, Vassileios Symeonidis, Theodoros Michalareas, Sofia Chionidi, Anastasia Tsiropina, Eleni Plakoutsi, Evangelos Argyropoulos

发表机构 * National and Kapodistrian University of Athens（国家与kapodistrian大学）

AI总结提出首个跨越八个历时时期的开放许可依存句法树库，采用统一PROIEL XML 2.0模式，并与拉丁语、哥特语等印欧语进行跨对齐。

Comments 16 pages. Data paper for the v0.4 release of AthDGC. Concept DOI: 10.5281/zenodo.20439182. Companion site: https://athdgc.github.io

详情

DOI: 10.5281/zenodo.20439182

AI中文摘要

AthDGC（“Athens-PROIEL”）是一个开放的端到端工作流和数据集。据我们所知，它是第一个公开许可的依存句法分析树库，涵盖希腊语的八个历时时期，即古风时期、古典时期、通用希腊语时期、晚期古代、拜占庭时期、晚期拜占庭时期、早期现代和现代希腊语，采用单一的PROIEL XML 2.0模式，并将《新约》按诗句级别与拉丁语（武加大译本）、哥特语（乌尔菲拉译本）、古教会斯拉夫语（Marianus译本）和古典亚美尼亚语进行交叉对齐。AthDGC建立在PROIEL树库家族（Haug and Johndal 2008; Eckhoff et al. 2018）之上，该家族为项目建立了模式和通用希腊语参考集。标注使用Stanford Stanza PROIEL训练的工作流；句子级对齐使用LaBSE，一种多语言句子嵌入模型；词级对齐通过AwesomeAlign程序使用多语言BERT注意力。v0.4版本提供精选样本和开源工具包；完整注释语料库分区仍在希腊国家HPC上进行v0.5审计。定量规模、每个见证的诗句计数和每个时期的注释行计数将在审计通过后的v0.5发布说明中报告。概念DOI：10.5281/zenodo.20439182。

英文摘要

AthDGC ("Athens-PROIEL") is an open, end-to-end workflow and dataset. It is, to the best of our knowledge, the first openly licensed dependency-parsed treebank of Greek that spans eight diachronic periods, namely Archaic, Classical, Koine, Late Antique, Byzantine, Late Byzantine, Early Modern, and Modern Greek, under a single PROIEL XML 2.0 schema, with verse-level cross-alignment of the New Testament to Latin (Vulgate), Gothic (Wulfila), Old Church Slavonic (Marianus), and Classical Armenian. AthDGC builds on the PROIEL Treebank Family (Haug and Johndal 2008; Eckhoff et al. 2018), which established the schema and the Koine-Greek reference set for the project. Annotation uses the Stanford Stanza PROIEL-trained workflow; sentence-level alignment uses LaBSE, a multilingual sentence-embedding model; word-level alignment uses multilingual-BERT attention through the AwesomeAlign procedure. The v0.4 release provides curated samples and the open-source toolkit; the full annotated corpus partitions remain under v0.5 audit on the Greek national HPC. Quantitative scale, per-witness verse counts, and per-period annotated-row counts are reported in the v0.5 release notes, after the audit pass completes. Concept DOI: 10.5281/zenodo.20439182.

URL PDF HTML ☆

赞 0 踩 0

2606.16047 2026-06-16 cs.CL 新提交

From Argument Components to Graphs: A Multi-Agent Debate with Confidence Gating for Argument Relations

从论证组件到图：一种具有置信门控的多智能体辩论方法用于论证关系识别

Jakub Bąba, Jarosław A. Chudziak

发表机构 * Faculty of Electronics and Information Technology, Warsaw University of Technology（华沙理工大学电子与信息技术学院）

AI总结提出一种多智能体辩论框架，通过置信门控机制仅在不确定时进行辩论，在UKP语料上达到训练无关方法最高Macro F1，并生成可读辩论记录。

Comments Accepted for publication in the proceedings of KES 2026

详情

AI中文摘要

大型语言模型（LLMs）凭借其强大的通用推理能力，在论证挖掘（AM）领域受到越来越多的评估和应用。然而，标准的无训练模型常常遗漏复杂细节，特别是在需要将文本的两个部分一起分析的上下文中。此外，自我纠正机制往往会强化推理中的初始幻觉。克服这些限制通常需要昂贵的、领域特定的监督微调。最近的研究表明，多智能体范式可以通过支持者-反对者-裁判架构的辩证改进来解决组件分类任务中的此类弱点，为该领域的无训练方法指明了有希望的方向。在本文中，我们将该框架扩展并评估于论证关系识别与分类（ARIC）任务，将其重新表述为组件对之间的辩论。此外，我们引入了一种置信门控机制，使得仅在不确定的情况下进行辩论，而在置信度高时接受初始预测。在UKP Argument Annotated Essays v2语料库上，我们证明了选择性辩论在所有无训练方法中取得了最高的Macro F1，而对所有样本进行辩论则使性能低于其中一个基线。所有生成方法在Macro F1上也优于微调的RoBERTa模型，这表明Attack类的代表性不足对监督微调的损害大于对仅推理模型的影响。此外，我们的框架生成人类可读的辩论记录，提供了单智能体和监督分类器所缺乏的可解释性。

英文摘要

Large Language Models (LLMs) are increasingly assessed and utilized in the field of Argument Mining (AM), thanks to their strong general reasoning capabilities. However, standard training-free models often miss sophisticated details, specifically in contexts where two parts of the text have to be analyzed together. Furthermore, self-correction mechanisms tend to reinforce initial hallucinations in reasoning. Overcoming these limitations typically requires expensive, domain-specific supervised fine-tuning. Recent work has shown that a multi-agent paradigm can address such weaknesses for the component classification task through dialectical refinement with a Proponent-Opponent-Judge architecture, setting a promising direction for training-free approaches in the field. In this paper, we extend and evaluate this framework on the Argument Relation Identification and Classification (ARIC) task, reformulating it as a debate over component pairs. Besides that, we introduce a confidence gating mechanism that enables debating only on the uncertain cases and accepting the initial prediction when confidence is high. On the UKP Argument Annotated Essays v2 corpus, we demonstrate that the selective debate achieves the highest Macro F1 among all training-free methods, while debate over all samples degrades performance below that of one of the baselines. All generative approaches also outperform fine-tuned RoBERTa models on Macro F1, suggesting that the under-representation of the Attack class was more damaging to supervised fine-tuning than to inference-only models. Additionally, our framework produces human-readable debate transcripts, offering interpretability absent from both single-agent and supervised classifiers.

URL PDF HTML ☆

赞 0 踩 0

2606.16407 2026-06-16 cs.CL cs.LG 新提交

A Mechanistic Understanding of Pronoun Fidelity in LLMs

对大型语言模型中代词忠实性的机制理解

Katharina Trinley, Jesujoba O. Alabi, Dietrich Klakow, Vagrant Gautam

发表机构 * Saarland University（萨尔大学）； Heidelberg Institute for Theoretical Studies（海德堡理论研究所）

AI总结通过因果分析发现，代词忠实性由组实体绑定、近因偏差和刻板印象偏差三种因果子空间共同作用，解释了91-99.5%的行为。

详情

从情感预测到情感预报：纵向文本中不同信息源的证据

Sadia Noor, Seemab Latif, Raja Khurram Shahzad, Mehwish Fatima

发表机构 * School of Electrical Engineering and Computer Science (SEECS), National University of Sciences and Technology (NUST)（国立科技大学电气工程与计算机科学学院）； Department of Communication, Quality Management and Information Systems, Mid Sweden University（中瑞典大学通信、质量管理和信息系统系）

AI总结本文区分当前情感估计与未来情感变化预报，提出TSAP框架和ACF-Hybrid模型，实验表明文本语义支持当前预测，而数值轨迹动力学更适用于未来变化预报。

详情

AI中文摘要

对纵向文本中的维度情感建模需要区分当前情感估计与未来情感变化预报。现有方法通常将每个文本视为独立观测，并对两个任务应用类似假设，而不检验它们是否依赖不同的信息源。本文利用纵向自我报告生态短文和情感词条目研究这一区别。我们提出特质-状态情感预测（TSAP）框架及其时间扩展E-TSAP用于逐文本效价和唤醒度预测，在来自91名用户的1737条条目的保留预测测试集上评估。我们进一步提出情感变化预报混合模型（ACF-Hybrid）用于下一步情感变化预报，在来自46名用户的保留预报测试集上评估。对于预测，E-TSAP在效价上达到复合皮尔逊相关系数0.670，在唤醒度上达到0.449。对于预报，文本表示的表现不如紧凑的数值轨迹基线：包含文本的模型在效价上仅达到r=0.316，在唤醒度上达到r=0.284，而简单的先前状态基线分别达到r=0.615和r=0.670。ACF-Hybrid使用维度特定的数值轨迹特征，在效价上达到r=0.659，在唤醒度上达到r=0.658。这些结果表明，文本语义支持当前情感预测，而未来情感变化通过先前数值轨迹动力学能更好地捕获。

英文摘要

Modeling dimensional affect in longitudinal text requires distinguishing current affect estimation from future affective change forecasting. Existing approaches often treat each text as an independent observation and apply similar assumptions to both tasks, without testing whether they rely on different information sources. This paper investigates that distinction using longitudinal self-reported ecological essays and feeling-word entries. We propose the Trait--State Affective Prediction (TSAP) framework and its temporal extension E-TSAP for per-text valence and arousal prediction, evaluated on a held-out prediction test set of 1,737 entries from 91 users. We further propose the Affective Change Forecaster Hybrid (ACF-Hybrid) for next-step affective change forecasting, evaluated on a held-out forecasting test set of 46 users. For prediction, E-TSAP achieves composite Pearson correlations of 0.670 for valence and 0.449 for arousal. For forecasting, textual representations perform worse than compact numeric trajectory baselines: the text-inclusive model achieves only r=0.316 for valence and r=0.284 for arousal, whereas a simple prior-state baseline reaches r=0.615 and r=0.670, respectively. ACF-Hybrid, using dimension-specific numeric trajectory features, achieves r=0.659 for valence and $r=0.658$ for arousal. These results show that textual semantics support current affect prediction, whereas future affective change is better captured through prior numeric trajectory dynamics.

URL PDF HTML ☆

赞 0 踩 0

2606.16893 2026-06-16 cs.AI cs.CL cs.LO 交叉投稿

Symbolic Informalization: Fluent, Productive, Multilingual

符号非形式化：流畅、高效、多语言

Aarne Ranta

发表机构 * Department of Computer Science and Engineering, Chalmers University of Technology and University of Gothenburg（查尔姆斯理工大学与哥德堡大学计算机科学与工程系）

AI总结提出符号非形式化方法，将形式数学可靠地转换为自然语言，基于Dedukti和Grammatical Framework的中间语言架构，实现多证明系统与多自然语言的流畅转换。

2502.18795 2026-06-16 cs.CL 版本更新

Anything Goes? A Crosslinguistic Study of (Im)possible Language Learning in LMs

什么都行？语言模型中（不）可能语言学习的跨语言研究

Xiulin Yang, Tatsuya Aoyama, Yuekun Yao, Ethan Gotlieb Wilcox

发表机构 * Georgetown University（乔治城大学）； Saarland University（萨尔兰大学）

AI总结通过训练语言模型学习不可能和类型学上未证实的语言，跨12种语言实验发现GPT-2小模型能部分区分自然语言与不可能语言，但弱于人类归纳偏置。

Comments ACL 2025

详情

AI中文摘要

语言模型（LMs）能否为人类语言学习提供见解？反对这一观点的常见论点是，由于LMs的架构和训练范式与人类截然不同，它们可以像学习自然语言一样轻松地学习任意输入。我们通过训练LMs建模不可能和类型学上未证实的语言来检验这一说法。与以往仅关注英语的工作不同，我们使用两个新构建的平行语料库，对来自4个语系的12种语言进行了实验。结果表明，虽然GPT-2小模型在很大程度上能够区分已证实语言与其不可能对应物，但并未在所有已证实语言和所有不可能语言之间实现完美分离。我们进一步通过基于Greenberg的普遍性20调整词序，测试GPT-2小模型是否区分类型学上已证实和未证实的具有不同NP顺序的语言。我们发现，模型的困惑度分数不能区分已证实与未证实的词序，但其在泛化测试上的表现却能做到。这些发现表明，LMs表现出一些类似人类的归纳偏置，尽管这些偏置比人类学习者中的要弱。

英文摘要

Do language models (LMs) offer insights into human language learning? A common argument against this idea is that because their architecture and training paradigm are so vastly different from humans, LMs can learn arbitrary inputs as easily as natural languages. We test this claim by training LMs to model impossible and typologically unattested languages. Unlike previous work, which has focused exclusively on English, we conduct experiments on 12 languages from 4 language families with two newly constructed parallel corpora. Our results show that while GPT-2 small can largely distinguish attested languages from their impossible counterparts, it does not achieve perfect separation between all the attested languages and all the impossible ones. We further test whether GPT-2 small distinguishes typologically attested from unattested languages with different NP orders by manipulating word order based on Greenberg's Universal 20. We find that the model's perplexity scores do not distinguish attested vs. unattested word orders, while its performance on the generalization test does. These findings suggest that LMs exhibit some human-like inductive biases, though these biases are weaker than those found in human learners.

URL PDF HTML ☆

赞 0 踩 0

2606.15714 2026-06-16 cs.CL cs.RO 新提交

Beyond English: Uncovering the Multilingual Gap in Vision-Language-Action Models

超越英语：揭示视觉-语言-动作模型中的多语言差距

Hanyang Chen, Hongliang Li, Jiarui Cao, Yang Li, Yang Jiang, Haonan Wen, Kaiyu Huang, Shengnan Guo, Huaiyu Wan

发表机构 * Beijing Jiaotong University（北京交通大学）

AI总结本研究首次系统探究VLA模型的多语言指令跟随能力，发现英语训练模型在其他语言上性能显著下降，并提出多语言主成分对齐方法缩小差距。

详情

AI中文摘要

视觉-语言-动作模型最近展示了从大规模多模态数据学习通用机器人策略的能力。然而，大多数现有的VLA系统主要使用英语指令进行训练和评估，使得它们理解和执行其他语言指令的能力在很大程度上未被探索。虽然底层的大语言模型通常具备多语言能力，但这些多语言能力在训练过程中是否能迁移到VLA尚不清楚。在这项工作中，我们首次对VLA模型中的多语言指令跟随进行了系统研究。我们首先通过扩展现有基准测试并翻译其指令来构建多语言指令。利用这些指令，我们在模拟环境中评估了几个代表性的VLA模型在一系列任务上的表现。我们的实验揭示了一个显著的多语言差距：主要用英语指令训练的模型在评估其他语言时表现出显著的性能下降，即使底层语言骨干是多语言的。我们提供了若干发现和分析来理解多语言差距。跨语言迁移行为分析表明，性能下降与指令理解和动作执行都相关。表示分析表明，多语言指令引起的表示偏移可能导致了多语言差距。受这些发现的启发，我们进一步探索了提高VLA多语言性能的策略。我们提出了一种简单而有效的多语言微调方法——多语言主成分对齐，该方法利用主成分分析获取主成分子空间并对齐投影后的多语言表示，有效缩小了多语言性能差距。

英文摘要

Vision-Language-Action models have recently demonstrated promising capabilities in learning generalist robot policies from large-scale multimodal data. However, most existing VLA systems are trained and evaluated primarily with English instructions, leaving their ability to understand and execute instructions in other languages largely unexplored. While the underlying large language models often possess multilingual capabilities, it remains unclear whether these multilingual capabilities transfer to VLAs during training. In this work, we present the first systematic study of multilingual instruction following in VLA models. We first construct multilingual instructions by extending existing benchmarks with translations of their instructions. Using these instructions, we evaluate several representative VLA models across a range of tasks in simulation settings. Our experiments reveal a significant multilingual gap: models trained primarily on English instructions exhibit substantial performance degradation when evaluated on other languages, even when the underlying language backbone is multilingual. We provide several findings and analyses to understand the multilingual gap. Cross-lingual transfer behavior analysis shows that performance drops correlate with both instruction understanding and action execution. Representation analyses suggest that multilingual instruction-caused representation shifts may contribute to the multilingual gap. Motivated by these findings, we further explore strategies to improve multilingual performance in VLAs. We propose a simple yet effective multilingual fine-tuning approach, Multilingual Principal Component Alignment, which leverages Principal Component Analysis to get the principal component subspace and align projected multilingual representations, effectively reducing the multilingual performance gap.

URL PDF HTML ☆

赞 0 踩 0

2606.15910 2026-06-16 cs.CL 新提交

必要时聚焦：用于无训练视觉定位的自适应路由与协作定位

Yifan Wang, Peiming Li, Shiyu Li, Zhiyuan Hu, Xiaochen Yang, Wenming Yang, Yang Tang, Zheng Wei

发表机构 * East China University of Science and Technology（华东理工大学）； Tsinghua University（清华大学）； Shanghai Artificial Intelligence Laboratory（上海人工智能实验室）； University of Science and Technology of China（中国科学技术大学）

AI总结提出LazyMCoT动态框架，通过自适应路由评估不确定性，对简单查询跳过处理，对困难样本利用协作定位模块进行两阶段精炼，在提升推理精度的同时降低平均推理延迟。

详情

AI中文摘要

虽然多模态大语言模型（MLLMs）在跨模态推理方面表现出色，但它们通常难以感知复杂高分辨率图像中的细粒度细节。最近的无训练方法通过图像缩放和局部裁剪来解决这一问题。然而，不加区分地应用这些操作会导致简单查询的计算冗余，并且可能因截断必要的全局上下文或引入无关的背景噪声而降低准确性。为此，我们提出了LazyMCoT，一个动态且无需训练的框架，能够根据样本难度自适应地分配视觉定位工作。该框架具有自适应路由机制，通过单次前向传递的首词统计量来评估预测不确定性。这有效地绕过了置信度高的案例，同时通过保形校准确保困难样本的召回。对于这些具有挑战性的案例，协作定位模块通过两阶段精炼过程，将模型固有的跨模态注意力与外部视觉专家相结合。该精炼过程生成精确的局部显示，以恢复小目标或被遮挡的目标。在多个基准上的大量实验表明，LazyMCoT通过同时提高推理精度和降低平均推理延迟，与基于训练的方法相媲美。我们的代码可在https://github.com/TencentBAC/LazyMCoT获取。

英文摘要

While Multimodal Large Language Models (MLLMs) excel in cross-modal reasoning, they often struggle to perceive fine-grained details in complex high-resolution images. Recent training-free methods address this through image scaling and localized cropping. However, applying these manipulations indiscriminately introduces computational redundancy for simple queries and can degrade accuracy by truncating essential global context or introducing irrelevant background noise. To this end, we propose LazyMCoT, a dynamic and training-free framework that adaptively allocates visual grounding efforts based on sample difficulty. The framework features an Adaptive Routing mechanism that evaluates predictive uncertainty using first-token statistics from a single forward pass. This efficiently bypasses confident cases while ensuring the recall of difficult samples via conformal calibration. For these challenging cases, a Collaborative Grounding module integrates the inherent cross-modal attention of the model with an external visual expert through a two-stage refinement process. This refinement process generates a precise localized display to recover small or occluded targets. Extensive experiments across diverse benchmarks demonstrate that LazyMCoT rivals training-based approaches by simultaneously improving reasoning accuracy and reducing average inference latency. Our code is availble at https://github.com/TencentBAC/LazyMCoT.

URL PDF HTML ☆

赞 0 踩 0

2606.16295 2026-06-16 cs.CV cs.CL 交叉投稿

VisualClaw: A Real-Time, Personalized Agent for the Physical World

VisualClaw：面向物理世界的实时个性化智能体

Haoqin Tu, Jianwen Chen, Zijun Wang, Siwei Han, Juncheng Wu, Hardy Chen, Haonian Ji, Kaiwen Xiong, Jiaqi Liu, Peng Xia, Jieru Mei, Hongliang Fei, Jason Eshraghian, Zeyu Zheng, Yuyin Zhou, Huaxiu Yao, Cihang Xie

发表机构 * UC Santa Cruz（加州大学圣克鲁兹分校）； UNC-Chapel Hill（北卡罗来纳大学教堂山分校）； Google（谷歌）； UC Berkeley（加州大学伯克利分校）

AI总结提出VisualClaw，一种自进化多模态智能体，通过混合编码和技能进化机制降低部署成本并提升准确性，在多个视频QA基准上实现平均-98%的API成本削减和最高+15.80%的准确率提升。

Comments H. T. and J. C. contribute to this project equally

详情

AI中文摘要

视觉语言模型正作为复杂多模态任务的通用接口。然而，部署仍面临三个差距：VLMs在处理密集视频帧和长提示时通常产生高延迟和成本，智能体框架在部署后保持静态，标准视频QA基准不测试智能体是否能在工具使用工作区内使用视觉证据。我们提出VisualClaw，一个围绕两个原则构建的自进化多模态智能体。首先，混合编码通过级联门过滤信息较少的流式帧，并通过热/冷top-k注入压缩文本技能库，从而降低部署成本。其次，技能进化让智能体从失败中学习：检索的记忆作为直接拼接上下文或引导证据条件化进化器，产生技能库更新以帮助未来问题。在4个视频QA基准上使用2个VLM，VisualClaw相比全帧上传平均降低每问题API成本-98%，相比离线均匀8帧基线降低-25.9%，同时在大多数设置中提升准确率，例如在EgoSchema上使用Gemini 3 Flash平均+3.85%，峰值+15.80%。为解决这一差距，我们整理了VisualClawArena，一个通过严格五阶段流程构建的200场景多模态智能体基准；模型必须使用视频证据、文档、动态更新和工作区内的可执行检查。在VisualClawArena上，相同的框架配合计算机使用智能体后端，相比无进化基线，Codex (GPT-5.5)的宏观准确率提升+2.9%，Claude Code (Sonnet 4.6)提升+3.2%，相比均匀采样基线成本降低-9.5%。这些特性使VisualClaw自然适用于边缘应用，其中级联将1小时流式会话从约3,600次API上传减少到仅5-20次调用，自进化使其成为完美的个性化助手。

英文摘要

Vision language models are serving as general-purpose interfaces for complex multimodal tasks. However, deployment still faces three gaps: VLMs typically incur high latency and cost when processing dense video frames and long prompts, the agent scaffold remains static after deployment, and standard video-QA benchmarks do not test whether agents can use visual evidence inside tool-using workspaces. We present VisualClaw, a self-evolving multimodal agent built around two principles. First, hybrid encoding reduces deployment cost by filtering less informative streaming frames with a cascaded gate and compressing the text skill bank through hot/cold top-k injection. Second, skill evolution lets the agent learn from failures: retrieved memories condition an evolver as direct concatenated context or as guided evidence, producing skill-bank updates that help future questions. Across 4 video-QA benchmarks with 2 VLMs, VisualClaw cuts per-question API cost by an average -98% versus full-frame upload and by -25.9% over the offline uniform 8 frame baseline, while boosting accuracy in most settings, e.g., an average +3.85% and a peak +15.80% on EgoSchema with Gemini 3 Flash. To address the gap, we curate VisualClawArena, a 200-scenario multimodal agentic benchmark built through a strict five-stage pipeline; models must use video evidence, documents, dynamic updates, and executable checks inside a workspace. On VisualClawArena, the same framework with computer-use agent backends improves macro accuracy by +2.9% for Codex (GPT-5.5) and +3.2% for Claude Code (Sonnet 4.6) over no-evolution baselines, with a -9.5% cost reduction compared to the uniform-sampled baseline. These properties make VisualClaw a natural fit for edge applications, where the cascade reduces a 1-hour streaming session from ~3,600 API uploads down to only 5-20 calls and the self-evolution makes it a perfect personalized assistant.

URL PDF HTML ☆

赞 0 踩 0

2602.22391 2026-06-16 cs.CL 版本更新

Detecting Hate and Inflammatory Content in Bengali Memes: A New Multimodal Dataset and Co-Attention Framework

检测孟加拉语模因中的仇恨和煽动性内容：一个新的多模态数据集和共注意力框架

Rakib Ullah, Mominul islam, Md Sanjid Hossain, Md Ismail Hossain

发表机构 * University of California, Berkeley（加州大学伯克利分校）； University of Washington（华盛顿大学）

AI总结针对孟加拉语模因中仇恨和煽动性内容检测的研究空白，构建了首个区分煽动性内容与直接仇恨言论的数据集Bn-HIB，并提出多模态共注意力融合模型MCFM，通过联合分析视觉和文本特征实现更准确分类。

Comments Added public link to dataset and fixed typo in abstract

详情

AI中文摘要

互联网模因已成为社交媒体上的一种主要表达形式，包括在孟加拉语社区中。虽然模因通常具有幽默性，但也可能被利用来传播针对个人和群体的攻击性、有害和煽动性内容。由于其讽刺性、微妙性和文化特异性，检测这类内容异常困难。对于孟加拉语等低资源语言，这一问题更为突出，因为现有研究主要关注高资源语言。为填补这一关键研究空白，我们引入了Bn-HIB（孟加拉语仇恨煽动良性）数据集，包含3,247个手动标注的孟加拉语模因，分为良性、仇恨或煽动性三类。值得注意的是，Bn-HIB是首个在孟加拉语模因中区分煽动性内容与直接仇恨言论的数据集。此外，我们提出了MCFM（多模态共注意力融合模型），一种简单而有效的架构，可共同分析模因的视觉和文本元素。MCFM采用共注意力机制来识别并融合来自每种模态的最关键特征，从而实现更准确的分类。我们的实验表明，MCFM在Bn-HIB数据集上显著优于多个最先进模型，证明了其在此精细任务中的有效性。为促进可重复性和未来研究，Bn-HIB数据集已通过Mendeley Data公开发布。警告：本文包含可能令部分观众不安的内容，请观众自行判断。

英文摘要

Internet memes have become a dominant form of expression on social media, including within the Bengali speaking community. While often humorous, memes can also be exploited to spread offensive, harmful, and inflammatory content targeting individuals and groups. Detecting this type of content is exceptionally challenging due to its satirical, subtle, and culturally specific nature. This problem is magnified for low-resource languages like Bengali, as existing research predominantly focuses on high-resource languages. To address this critical research gap, we introduce Bn-HIB (Bangla Hate Inflammatory Benign), a novel dataset containing 3,247 manually annotated Bengali memes categorized as Benign, Hate, or Inflammatory. Significantly, Bn- HIB is the first dataset to distinguish inflammatory content from direct hate speech in Bengali memes. Furthermore, we propose the MCFM (Multi-Modal Co-Attention Fusion Model), a simple yet effective architecture that mutually analyses both the visual and textual elements of a meme. MCFM employs a co-attention mechanism to identify and fuse the most critical features from each modality, leading to a more accurate classification. Our experiments show that MCFM significantly outperforms several state-of-the-art models on the Bn-HIB dataset, demonstrating its effectiveness in this nuanced task. To facilitate reproducibility and future research, the Bn-HIB dataset has been made publicly available through Mendeley Data. Warning: This work contains material that may be disturbing to some audience members. Viewer discretion is advised

URL PDF HTML ☆

赞 0 踩 0

2606.13578 2026-06-16 cs.CL cs.AI cs.LG cs.MM cs.RO 版本更新

LabVLA: Grounding Vision-Language-Action Models in Scientific Laboratories

LabVLA：在科学实验室中落地视觉-语言-动作模型

Baochang Ren, Xinjie Liu, Xi Chen, Yanshuo Liu, Chenxi Li, Daqi Gao, Zeqin Su, Jintao Xing, Zirui Xue, Rui Li, Xiangyu Zhao, Shuofei Qiao, Minting Pan, Wangmeng Zuo, Lei Bai, Dongzhan Zhou, Ningyu Zhang, Huajun Chen

发表机构 * Zhejiang University（浙江大学）； Shanghai AI Laboratory（上海人工智能实验室）； Harbin Institute of Technology（哈尔滨工业大学）

AI总结针对科学实验室中机器人执行协议面临的数据和实体瓶颈，提出模拟数据引擎RoboGenesis和两阶段训练策略LabVLA，在LabUtopia基准上取得最高平均成功率。

Comments Work in progress. Project website at https://zjunlp.github.io/LabVLA/

详情

AI中文摘要

科学实验室越来越依赖AI系统来推理实验，但物理实验操作仍超出其能力范围。AI可以帮助阅读文献、生成假设和规划协议，但实验台前的协议执行仍需人类操作员。视觉-语言-动作（VLA）模型为书面协议与机器人执行之间提供了一种可能的接口，但现有策略主要在家庭和桌面演示上训练，很少遇到科学实验室中的仪器、透明液体或固定协议工作流。弥补这一差距需要实验室特定的监督和统一的学习框架，以适应执行实验协议所使用的不同机器人实体。因此，我们将数据和实体视为与模型设计并列的核心瓶颈。为解决数据方面的问题，我们构建了RoboGenesis，这是一个基于模拟的工作流和数据引擎，能够从原子技能组合配置的实验室工作流，验证和过滤 rollout，并跨支持的机器人配置文件导出结构化演示。在策略方面，我们提出了LabVLA，采用两阶段训练方案：首先进行FAST动作标记预训练，使Qwen3-VL-4B-Instruct骨干网络在学习任何连续控制之前具备动作意识；然后进行流匹配后训练，在知识隔离下附加一个DiT动作专家。在LabUtopia基准上，LabVLA在分布内和分布外设置下均达到了所有评估基线中最高的平均成功率。

英文摘要

Scientific laboratories increasingly rely on AI systems to reason about experiments, but the physical act of doing science remains largely outside their reach. AI can help read literature, generate hypotheses, and plan protocols, yet the execution of those protocols at the bench still requires a human operator. Vision-Language-Action (VLA) models provide one possible interface between written protocols and robot execution, but existing policies are trained mostly on household and tabletop demonstrations and rarely encounter the instruments, transparent liquids, or fixed protocol workflows found in scientific laboratories. Closing this gap requires both laboratory-specific supervision and a unified learning framework that can accommodate the diverse robot embodiments used to execute experimental protocols. We therefore identify data and embodiment as central bottlenecks alongside model design. To address the data side, we build RoboGenesis, a simulation-based workflow and data engine that composes configured laboratory workflows from atomic skills, validates and filters rollouts, and exports structured demonstrations across supported robot profiles. On the policy side, we present LabVLA, trained with a two-stage recipe: FAST action token pretraining first makes the Qwen3-VL-4B-Instruct backbone action aware before any continuous control is learned, and flow matching posttraining then attaches a DiT action expert under knowledge insulation. On the LabUtopia benchmark, LabVLA achieves the highest average success rate among all evaluated baselines under both in-distribution and out-of-distribution settings.

URL PDF HTML ☆

赞 0 踩 0

2308.06035 2026-06-16 cs.AI cs.CL 版本更新

Attention, not scale, drives human-AI alignment in multimodal language prediction

注意力，而非规模，驱动多模态语言预测中的人机对齐

Viktor Kewenig, Andrew Lampinen, Samuel A. Nastase, Christopher Edwards, Quitterie Lacome D'Elascombe, Akilles Rechardt, Jeremy I Skipper, Gabriella Vigliocco

发表机构 * Psychology and Language Science, Experimental Psychology, University College London, London, UK（心理学与语言科学、实验心理学，伦敦大学学院，伦敦，英国）； Google Deepmind, Mountain View, US（谷歌DeepMind，山景城，美国）； Princeton Neuroscience Institute, Princeton University, Princeton, NJ, USA（普林斯顿神经科学研究所，普林斯顿大学，普林斯顿，新泽西州，美国）； Computer Science Department, Exeter University（计算机科学系，埃克塞特大学）

AI总结本研究通过比较五种视觉-语言模型与600名人类在视觉世界范式中的表现，发现添加视觉上下文显著提升模型与人类在预测评分上的一致性，且注意力机制而非模型规模是主要驱动因素。

Comments 39 pages, 6 Figures, published in NPJ Artificial Intelligence

详情

AI中文摘要

人类通常利用视觉上下文来预测即将出现的词语。目前视觉-语言模型在多大程度上产生类似行为尚不清楚。在这里，我们将五个最先进的预训练系统与600名人类参与者并排放置在基于网络的视觉世界范式中。在100个六秒电影片段中，模型和参与者接收纯文本或同步视频和文本，并判断指定目标词接下来出现的可能性；全程记录人类眼动。添加视觉上下文在所有架构中均增加了模型与人类在可预测性评分上的一致性（平均Delta r = 0.18），且参数大小无影响。当视觉上下文信息丰富时，Transformer注意力显著提高了一致性。两个Transformer模型的注意力图与人类注视相对应，当场景包含信息性线索时，解释了高达70%的参与者间方差。值得注意的是，跨模态注意力可靠地追踪了语义线索上的预期性人类注视。这些结果表明，当前基于Transformer的视觉-语言模型可以在语言预测期间近似利用视觉上下文的人类行为——并且对信息性线索的选择性注意力，而非纯粹的模型规模，是这种对齐的主要驱动因素。

英文摘要

Humans routinely draw on visual context to predict upcoming words. To what extent current vision-language models produce comparable behaviour is unclear. Here we placed five state-of-the-art pretrained systems side-by-side with 600 human participants in a web-based Visual-World Paradigm. On each of 100 six-second movie clips, models and participants received either text only or synchronised video and text and judged how likely a specified target word was to appear next; human eye movements were tracked throughout. Adding visual context increased model-human alignment in predictability ratings across all architectures (average Delta r = 0.18) with no impact of parameter size. When visual context was informative, transformer attention significantly increased alignment. Attention maps from two transformer models corresponded with human gaze, explaining up to 70% of the inter-participant variance when the scene contained informative cues. Notably, cross-modal attention reliably tracked anticipatory human fixations on semantic cues. These results suggest that current transformer-based vision-language models can approximate human behaviour exploiting visual context during language prediction - and that selective attention to informative cues, not sheer model scale, is the principal driver of this alignment.

URL PDF HTML ☆

赞 0 踩 0

2510.01444 2026-06-16 cs.AI cs.CL cs.LG 版本更新

Dual-Uncertainty Guided Policy Learning for Multimodal Reasoning

双不确定性引导的多模态推理策略学习

Rui Liu, Dian Yu, Tong Zheng, Runpeng Dai, Zongxia Li, Wenhao Yu, Zhenwen Liang, Linfeng Song, Haitao Mi, Pratap Tokekar, Dong Yu

发表机构 * Tencent Hunyuan（腾讯文汇）； University of Maryland（马里兰大学）； University of North Carolina（北卡罗来纳大学）

AI总结提出DUPL方法，通过量化感知不确定性和输出不确定性来引导策略更新，在多个多模态推理基准上显著提升模型准确率，优于现有方法。

详情

AI中文摘要

具有可验证奖励的强化学习（RLVR）已经提升了多模态大语言模型的推理能力。然而，现有方法通常将视觉输入视为确定性的，忽略了视觉模态固有的感知模糊性。因此，它们无法区分模型的不确定性是源于复杂推理还是模糊感知，从而无法有针对性地分配探索或学习信号。为了解决这一问题，我们引入了\textbf{DUPL}，一种用于多模态RLVR的双不确定性引导策略学习方法，该方法量化并利用感知不确定性（通过对称KL散度）和输出不确定性（通过策略熵）来指导策略更新。通过建立不确定性驱动的反馈循环并采用动态分支优先级机制，DUPL重新校准策略优势，将学习重点放在具有高感知或决策模糊性的状态上，从而实现超越被动数据增强的有效目标探索。在涵盖数学和通用领域的多个多模态推理基准上，DUPL取得了显著提升。它将Qwen2.5-VL的准确率提升了高达$\textbf{12.3%}$（3B）和$\textbf{7.9%}$（7B），将Qwen3-VL-Instruct的准确率提升了高达$\textbf{10.7%}$（4B）和$\textbf{12.4%}$（8B），持续优于GRPO，同时无缝泛化到其他算法（DAPO，平均$\textbf{+6.5%}$）和架构（LLaVA-OneVision-1.5，平均$\textbf{+4.7%}$）。这些结果表明，DUPL是一种有效且可泛化的多模态RLVR方法。

英文摘要

Reinforcement learning with verifiable rewards (RLVR) has advanced reasoning capabilities in multimodal large language models. However, existing methods typically treat visual inputs as deterministic, overlooking the perceptual ambiguity inherent to the visual modality. Consequently, they fail to distinguish whether a model's uncertainty stems from complex reasoning or ambiguous perception, preventing the targeted allocation of exploration or learning signals. To address this gap, we introduce \textbf{DUPL}, a dual-uncertainty guided policy learning approach for multimodal RLVR that quantifies and leverages both perceptual uncertainty (via symmetric KL divergence) and output uncertainty (via policy entropy) to guide policy updates. By establishing an uncertainty-driven feedback loop and employing a dynamic branch prioritization mechanism, DUPL recalibrates the policy advantage to focus learning on states with high perceptual or decisional ambiguity, enabling effective targeted exploration beyond passive data augmentation. Evaluated on diverse multimodal reasoning benchmarks spanning mathematical and general domains, DUPL achieves solid gains. It improves Qwen2.5-VL accuracy by up to $\textbf{12.3%}$ (3B) and $\textbf{7.9%}$ (7B), and Qwen3-VL-Instruct by up to $\textbf{10.7%}$ (4B) and $\textbf{12.4%}$ (8B), consistently outperforming GRPO, while seamlessly generalizing to alternative algorithms (DAPO, $\textbf{+6.5%}$ avg) and architectures (LLaVA-OneVision-1.5, $\textbf{+4.7%}$ avg). These results demonstrate that DUPL is an effective and generalizable approach for multimodal RLVR.

URL PDF HTML ☆

赞 0 踩 0

2602.00344 2026-06-16 cs.CV cs.AI cs.CL 版本更新

When RAG Hurts: Diagnosing and Mitigating Attention Distraction in Retrieval-Augmented LVLMs

当RAG有害：诊断和缓解检索增强LVLMs中的注意力分散

Beidi Zhao, Wenlong Deng, Xinting Liao, Yushu Li, Nazim Shaikh, Yao Nie, Xiaoxiao Li

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结本文发现检索增强生成（RAG）在LVLMs中导致注意力分散（AD）问题，即检索文本抑制视觉注意力并偏离问题相关区域，提出MAD-RAG方法通过双问题公式和注意力混合来解耦视觉定位与上下文整合，在三个基准上提升性能并纠正大部分失败案例。

Comments 19 pages, 13 figures

详情

AI中文摘要

虽然检索增强生成（RAG）是增强大型视觉语言模型（LVLMs）在基于知识的VQA任务上的主导范式之一，但最近的工作将RAG失败归因于对检索上下文的注意力不足，并提出减少分配给图像令牌的注意力。在这项工作中，我们识别了先前研究忽略的一个不同失败模式：注意力分散（AD）。当检索上下文足够（高度相关或包含正确答案）时，检索文本全局抑制视觉注意力，并且图像令牌上的注意力从问题相关区域转移。这导致模型在原本无需检索文本就能正确回答的问题上失败。为了缓解这个问题，我们提出了MAD-RAG，一种无需训练的干预方法，通过双问题公式解耦视觉定位与上下文整合，并结合注意力混合以保留图像条件证据。在OK-VQA、E-VQA和InfoSeek上的大量实验表明，MAD-RAG在不同模型家族中始终优于现有基线，相对于原始RAG基线分别取得了高达4.76%、9.20%和6.18%的绝对增益。值得注意的是，MAD-RAG纠正了高达74.68%的失败案例，且计算开销可忽略不计。

英文摘要

While Retrieval-Augmented Generation (RAG) is one of the dominant paradigms for enhancing Large Vision-Language Models (LVLMs) on knowledge-based VQA tasks, recent work attributes RAG failures to insufficient attention towards the retrieved context, proposing to reduce the attention allocated to image tokens. In this work, we identify a distinct failure mode that previous study overlooked: Attention Distraction (AD). When the retrieved context is sufficient (highly relevant or including the correct answer), the retrieved text suppresses the visual attention globally, and the attention on image tokens shifts away from question-relevant regions. This leads to failures on questions the model could originally answer correctly without the retrieved text. To mitigate this issue, we propose MAD-RAG, a training-free intervention that decouples visual grounding from context integration through a dual-question formulation, combined with attention mixing to preserve image-conditioned evidence. Extensive experiments on OK-VQA, E-VQA, and InfoSeek demonstrate that MAD-RAG consistently outperforms existing baselines across different model families, yielding absolute gains of up to 4.76%, 9.20%, and 6.18% over the vanilla RAG baseline. Notably, MAD-RAG rectifies up to 74.68% of failure cases with negligible computational overhead.

URL PDF HTML ☆

赞 0 踩 0

2605.10157 2026-06-16 cs.CV cs.CL 版本更新

ROMPAR：罗马尼亚口音语音识别的形态补全与人口统计去偏

Andrei-Marius Avram, Aureliu-Valentin Antonie, Ştefan-Bogdan Badea, Andrei Florea, Robert-Nicolae Zaharoiu, Dumitru-Clementin Cercel

发表机构 * National University of Science and Technology POLITEHNICA Bucharest（布加勒斯特理工大学）

AI总结提出ROMPAR语料库和多任务对抗训练框架，通过指数衰减机制和LLM引导解码，实现人口统计不变性和形态补全，WER显著降低，形态重建F1达96.6%。

详情

AI中文摘要

由于人口统计偏差、方言变异以及分割过程中话语截断等技术伪影，议会程序的自动转录面临重大挑战。本文介绍了罗马尼亚议会语音语料库（ROMPAR）数据集，这是一个17.80小时的罗马尼亚和摩尔多瓦议会语音语料库，具有双重标注的真实数据和重构词片段的显式标签。为了构建鲁棒的ASR系统，我们提出了一个多任务对抗训练框架，强制实现跨年龄、性别和方言的人口统计不变性。我们通过引入对抗系数的指数衰减机制来解决生成架构中对抗目标固有的不稳定性。此外，我们实现了一种具有位置依赖权重的LLM引导解码策略，以促进截断终端词的形态补全。我们的结果表明，所提出的框架显著降低了WER，并在形态重建中达到了96.6%的F1分数。

英文摘要

Automated transcription of parliamentary proceedings faces significant hurdles due to demographic bias, dialectal variation, and technical artifacts such as utterance truncation during segmentation. This paper introduces the ROManian PARliamentary Speech Corpus (ROMPAR) dataset, a 17.80-hour corpus of Romanian and Moldavian parliamentary speech, featuring double-annotated ground truth and explicit labels for reconstructed word fragments. To build a robust ASR system, we propose a multi-task adversarial training framework that enforces demographic invariance across age, gender, and dialect. We address the inherent instability of adversarial objectives in generative architectures by introducing an exponential decay mechanism for the adversarial coefficients. Furthermore, we implement an LLM-guided decoding strategy with position-dependent weighting to facilitate morphological completion of truncated terminal words. Our results demonstrate that the proposed framework significantly reduces WER and achieves an F1-score of 96.6% in morphological reconstruction.

URL PDF HTML ☆

赞 0 踩 0

2606.16019 2026-06-16 cs.CL cs.LG cs.SD 新提交

Scaling Human and G2P Supervision for Robust Phonetic Transcription

扩展人类与G2P监督以实现鲁棒语音转录

Alexander Metzger, Aruna Srivastava, Ruslan Mukhamedvaleev

发表机构 * Koel Labs LLC

AI总结研究自动语音转录中人类标注与G2P监督的扩展规律，发现当人类标注少于20-30小时时G2P有效，超过后无益甚至降低鲁棒性，而ASR预训练可显著提升性能。

Comments Accepted to Interspeech 2026

详情

AI中文摘要

专家语音标注成本高昂，尤其对于非标准方言和非典型语音。一种常见替代方法是使用字素到音素（G2P）模型从文本转录中自动生成语音标签。我们研究了自动语音转录性能如何随英语中人类和G2P监督的扩展而变化。使用一个涵盖母语、非母语和卒中后语音的精心策划的80小时基准测试，我们确定了一个监督质量阈值：只有当人类标注少于20-30小时时，G2P监督才有帮助。超过此阈值，它不提供显著益处，并可能降低跨方言鲁棒性。在此阈值之后有效的是ASR预训练，我们使用它实现了比先前系统加权音素特征错误率降低2.3倍，在非母语和失语症语音上取得了强劲提升。这些结果表明，数量驱动的G2P扩展可能对鲁棒泛化产生递减收益。

英文摘要

Expert phonetic annotation is costly, especially for non-standard dialects and atypical speech. A common alternative is using Grapheme-to-Phoneme (G2P) models to auto-generate phonetic labels from text transcripts at scale. We study how automatic phonetic transcription performance scales with human and G2P supervision in English. Using a curated 80-hour benchmark spanning native, non-native and post-stroke speech, we identify a supervision quality threshold: G2P supervision helps only when fewer than 20-30 hours of human annotation are available. Beyond this threshold, it provides no significant benefit and can reduce cross-dialect robustness. What is effective after this threshold is ASR pretraining which we use to achieve a 2.3x reduction in weighted phone feature error rate over prior systems, with strong gains on non-native and aphasic speech. These results suggest that quantity-driven G2P scaling may yield diminishing returns for robust generalization.

URL PDF HTML ☆

赞 0 踩 0

2606.16472 2026-06-16 cs.CL 新提交

From Awareness to Adherence: Bridging the Context Gap in Spoken Dialogue Systems via Context-Aware Decoding

从意识到遵循：通过上下文感知解码弥合口语对话系统中的上下文鸿沟

Che Hyun Lee, Heeseung Kim, Sungroh Yoon

发表机构 * ECE, Seoul National University（首尔大学电气与计算机工程系）； IPAI, Seoul National University（首尔大学IPAI研究所）； Department of AI, University of Seoul（首尔市立大学人工智能系）

AI总结提出音频适配的上下文感知解码方法，通过对比有无关键上下文的输出分布，放大多模态上下文信号，解决口语对话系统中上下文遵循问题。

Comments Interspeech 2026 Main Track

详情

AI中文摘要

尽管端到端口语对话系统取得了成功，但在多轮对话中保持严格的上下文遵循仍然是一个挑战。先前的工作将这些失败归因于模型忘记对话历史，但我们强调了一个同样关键但被忽视的瓶颈：潜在上下文意识与主动遵循之间的鸿沟。尽管模型内部能识别相关的历史话语，但强大的参数先验在解码时常会掩盖这些信号。为弥合这一鸿沟，我们提出了一种音频适配的上下文感知解码方法。通过利用内部注意力机制隔离关键历史轮次，我们的方法在推理时对比有无此关键上下文的输出分布，直接放大多模态上下文信号。在Audio MultiChallenge基准上的评估表明，在语义记忆和自我一致性子任务上取得了显著改进，成功实现了严格、忠实于上下文的遵循。

英文摘要

Despite the success of end-to-end (E2E) spoken dialogue systems, maintaining strict context adherence in multi-round conversations remains a challenge. While prior works attribute these failures to models forgetting dialogue history, we highlight an equally critical but overlooked bottleneck: a gap between latent context awareness and active adherence. Although models internally recognize relevant past utterances, strong parametric priors often overshadow these signals during decoding. To bridge this gap, we propose an audio-adapted Context-Aware Decoding (CAD) approach. By leveraging internal attention mechanisms to isolate key historical rounds, our approach contrasts output distributions with and without this key context during inference, directly amplifying multimodal contextual signals. Evaluations on the Audio MultiChallenge benchmark demonstrate significant improvements in Semantic Memory and Self Coherence subtasks, successfully enforcing strict, context-faithful adherence.

URL PDF HTML ☆

赞 0 踩 0

2606.16568 2026-06-16 cs.CL cs.AI 新提交

Fast When, Careful Who: Dual-Process Multiparty Turn-Taking with Diffusion Augmentation

快速判断何时，谨慎决定谁：基于扩散增强的双过程多轮对话

Rutherford A. Patamia, Ming Liu, Wei Luo, Favour Ekong, Akan Cosgun

发表机构 * Deakin University（迪肯大学）； Griffith University（格里菲斯大学）

AI总结针对多说话人对话中的轮次转换问题，提出音频两阶段流水线，先快速检测轮次边界，再轻量验证决定是否转移并预测下一说话人，扩散增强进一步改善检测性能。

详情

AI中文摘要

可靠的轮次转换对于口语对话系统至关重要。然而，现有方法大多针对双说话人交互设计，难以处理包含重叠和快速说话人切换的现实多说话人音频。我们在VoxConverse数据集上研究多说话人轮次转换，并提出一个纯音频的两阶段流水线，将何时触发轮次边界与是否实际转移话语权分开。一个快速触发器扫描音频并提出候选的结束轮次时间，而一个轻量验证器仅在这些时间运行，以决定\textsc{Hold}或\textsc{Shift}，并支持下一说话人预测。我们报告了完整多说话人设置下的结果，以及为可比性而控制的二元顶2投影结果。我们还研究了基于扩散的、保留标签的背景音频混合作为数据增强策略。结果显示，与基线相比，转移检测有所改善，扩散增强进一步提升了性能。

英文摘要

Reliable turn-taking is essential for spoken dialogue systems. However, most existing methods are designed for two-speaker interaction and struggle with realistic multiparty audio containing overlap and rapid speaker changes. We study multiparty turn-taking on the VoxConverse dataset and propose an audio-only two-stage pipeline that separates when to trigger a turn boundary from whether the floor is actually transferring. A fast trigger scans the audio and proposes candidate end-of-turn times, while a lightweight verifier runs only at those times to decide \textsc{Hold} or \textsc{Shift} and support next-speaker prediction. We report results in the full multiparty setting and a controlled dyadic top-2 projection for comparability. We also investigate diffusion-based, label-preserving background-audio mixing as a data augmentation strategy. Results show improved shift detection over a baseline, with further improvements from diffusion augmentation.

URL PDF HTML ☆

赞 0 踩 0

2606.16807 2026-06-16 cs.CL 新提交

跨语言嵌入聚类用于低资源多语言语音识别中的分层Softmax

Zhengdong Yang, Qianying Liu, Sheng Li, Fei Cheng, Chenhui Chu

AI总结提出一种基于跨语言嵌入聚类构建分层Softmax解码器的方法，通过共享相似令牌表示提升低资源多语言语音识别精度。

Comments Published in IEEE/ACM Transactions on Audio, Speech, and Language Processing (TASLP)

2506.16738 2026-06-16 cs.CL cs.AI cs.SD eess.AS 版本更新

ReportQA: 基于问答的放射学报告评估

Yiming Shi, Shaoshuai Yang, Xi Chen, Haolin Li, Hengyu Zhang, Che Jiang, Kaiwen Wang, Xun Zhu, Dong Xie, Fei Wang, Dejing Dou, Miao Li, Ji Wu

发表机构 * Department of Electronic Engineering, Tsinghua University（清华大学电子工程系）； College of AI, Tsinghua University（清华大学人工智能学院）； Beijing National Research Center for Information Science and Technology（北京信息科学与技术国家研究中心）； Beijing Electronic Digital & Intelligence（北京电子数字与智能）

AI总结提出ReportQA框架，利用知识树和LLM从报告中提取结构化信息生成QA对，以问答准确率作为评估指标，比现有指标更符合放射科医生判断。

详情

AI中文摘要

放射学报告评估对于推进自动报告生成至关重要。自然语言生成指标具有有限的临床相关性。临床效能（CE）指标评估重要的医学发现，但主要关注存在性且仅覆盖有限的实体集。由于严重依赖人工标注，CE指标难以扩展临床实体或属性。在临床实践中，放射学报告作为信息传递的媒介。临床医生使用它们执行下游诊断任务，而无需直接检查图像。基于这一见解，我们提出了ReportQA，一个临床相关且灵活的放射学报告评估框架，支持对放射学报告生成系统进行详细的定量分析。我们首先收集涵盖多种成像模态和解剖区域的数据集。然后，在放射科医生的指导下构建临床实体和属性的知识树，并使用大型语言模型（LLM）从原始报告中提取结构化信息。接下来，我们从预定义模板生成QA对，并通过自过滤和基于报告的过滤进行质量控制。在评估期间，将报告视为上下文，LLM作为评判模型来回答QA对。基于得到的QA准确率，我们引入了QAScore指标。与现有指标相比，QAScore显示出与放射科医生判断更好的对齐。在多个最先进的视觉-语言模型上的实验表明，当前基于报告的推理范式难以学习细粒度的临床表示，并表现出强烈的负先验偏差。相比之下，问题驱动的推理提供了一种更有效的替代方案。为了可重复性和可扩展性，我们发布了知识树、结构化报告和QA对，以及用于QA构建和评估的流水线代码。

英文摘要

Radiology report evaluation is essential for advancing automated report generation. Natural language generation metrics have limited clinical relevance. Clinical efficacy (CE) metrics evaluate important medical findings, but focus mainly on presence and cover only a limited set of entities. Due to heavy reliance on manual annotations, it is difficult for CE metrics to extend clinical entities or attributes. In clinical practice, radiology reports serve as a medium for information transfer. Clinicians use them to perform downstream diagnostic tasks without directly inspecting images. Based on this insight, we propose ReportQA, a clinical-related and flexible radiology report evaluation framework, supporting detailed quantitative analysis of radiology report generation systems. We first collect datasets covering multiple imaging modalities and anatomical regions. We then construct knowledge trees of clinical entities and attributes with radiologist guidance, and use large language models (LLMs) to extract structured information from raw reports. Next, we generate QA pairs from predefined templates and apply quality control through self-filtering and report-based filtering. During evaluation, the report is treated as context, and an LLM acts as a judge model to answer the QA pairs. Based on the resulting QA accuracy, we introduce QAScore metric. Compared with existing metrics, QAScore shows better alignment with radiologist judgments. Experiments on multiple state-of-the-art vision-language models reveal that current report-based inference paradigms struggle to learn fine-grained clinical representations and exhibit strong negative prior biases. In contrast, question-driven inference provides a more effective alternative. For reproducibility and extensibility, we release the knowledge trees, structured reports, and QA pairs, along with the pipeline code for QA construction and evaluation.

URL PDF HTML ☆

赞 0 踩 0

2606.15144 2026-06-16 cs.CL cs.AI 新提交

PACUTE: Phonology-, Affix-, and Character-level Understanding of Tokens for Filipino

PACUTE: 面向菲律宾语的音韵、词缀和字符级词元理解

Jann Railey Montalan, David Demitri Africa, Jimson Paulo Layacan, Richell Isaiah Flores, Ivan Yuri De Leon, Lance Calvin Gamboa

发表机构 * AI Singapore（AI新加坡）； Nanyang Technological University（南洋理工大学）； UK AI Security Institute（英国人工智能安全研究所）； Ateneo de Manila University（马尼拉雅典耀大学）； University of Birmingham（伯明翰大学）

AI总结提出PACUTE基准，包含4600个任务，通过六层诊断框架评估大语言模型在菲律宾语中的形态理解，发现开放权重模型在语素分解上接近随机，前沿模型在组合任务上远低于字符级上限。

Comments Submitted to EMNLP 2026

详情

AI中文摘要

大型语言模型（LLMs）将文本处理为子词词元序列，这掩盖了构成词形成的字符级和形态结构。对于具有非连接形态的语言，这种限制最为严重，标准分词器系统性地使词元边界与语素边界错位。我们引入PACUTE，一个包含4600个任务的诊断基准，旨在评估菲律宾语中的形态理解，菲律宾语以能产的中缀、重叠和变音符号驱动的词汇区分（通常不在书面文本中出现）为特征。PACUTE包括一个六层组合诊断框架，用于定位形态理解在何处崩溃。评估开放权重LLMs和前沿商业模型，我们发现开放权重模型在语素分解上无论规模大小都接近随机。前沿模型表现更好，通常在包含匹配评分下能恢复单个词缀，但在语素变换和音节划分的组合任务上仍远低于其字符级上限。这些结果表明，能产的形态组合（而非仅字符访问）是菲律宾语词汇结构理解的持续瓶颈。

英文摘要

Large language models (LLMs) process text as sequences of subword tokens, which can obscure the character-level and morphological structure that underlies word formation. This limitation is most acute for languages with non-concatenative morphology, where standard tokenizers systematically misalign token boundaries with morpheme boundaries. We introduce PACUTE, a diagnostic benchmark of 4,600 tasks designed to evaluate morphological understanding in Filipino, a language characterized by productive infixation, reduplication, and diacritic-driven lexical distinctions that are typically absent from written text. PACUTE includes a hierarchical diagnostic framework of six compositional levels that localizes where morphological understanding breaks down. Evaluating open-weight LLMs and frontier commercial models, we find that open-weight models perform near chance on morpheme decomposition regardless of scale. Frontier models perform much better, often recovering individual affixes under contains-match scoring, but remain far below their character-level ceilings on compositional tasks of morpheme transformations and syllabification. These results identify productive morphological composition, rather than character access alone, as the persistent bottleneck for Filipino word-structure understanding.

URL PDF HTML ☆

赞 0 踩 0

2606.15191 2026-06-16 cs.CL 新提交

AmchiBias: Measuring Stereotypical Bias in Goan Identity Groups with a Minimal Pair Dataset in English and Konkani

AmchiBias：基于英语和孔卡尼语的最小对数据集测量果阿身份群体的刻板偏见

Michelle Barbosa, Sebastian Padó, Franziska Weeber

发表机构 * Institute for Natural Language Processing, University of Stuttgart（斯图加特大学自然语言处理研究所）

AI总结提出AmchiBias基准，通过313个最小对评估多语言编码器对果阿身份群体的刻板偏见，发现模型在孔卡尼语上表现接近随机，英语查询反映泛印度偏见而非本地文化知识。

Comments The 1st Workshop on Stereotypes Across Cultures in Language Technologies

详情

AI中文摘要

社会文化刻板偏见是NLP系统开发和部署中的重要考虑因素。然而，尽管存在丰富的次国家级社会文化结构，偏见通常仅在国家层面被考虑。我们提出AmchiBias，这是首个针对印度果阿邦（其独特的历史多元文化背景）测量社会文化刻板偏见的基准。它涵盖各种果阿身份群体，包括英语和天城文孔卡尼语中八个社会人口维度的313个最小对。然后，我们在此基准上评估五个多语言编码器模型中的刻板偏见。我们发现模型在孔卡尼语上的得分接近随机，反映了通用多语言模型的语言能力不足以及印度语言模型缺乏果阿文化能力。当用英语查询时，具有更强印度语言覆盖的模型对泛印度群体表现出比超本地果阿群体更高的偏见。这表明英语信号反映了泛印度预训练关联，而非真正的果阿文化知识。我们的发现突显了低资源多语言NLP评估中超本地社区身份的关键空白。

英文摘要

Socio-cultural stereotypical bias is an important consideration in the development and deployment of NLP systems. It is however often considered only at the national level, despite rich subnational socio-cultural structures. We present AmchiBias, the first benchmark for measuring socio-cultural stereotypical bias for the Indian state of Goa with its unique historically multicultural setting. It covers various Goan identity groups and comprises 313 minimal pairs across eight sociodemographic dimensions in both English and Devanagari Konkani. We then evaluate stereotypical bias in five multilingual encoder models on this benchmark. We find near-chance scores in Konkani, reflecting language incompetence for general multilingual models and a lack of Goan cultural competence for Indian language models. Queried in English, models with a stronger Indian language coverage show higher bias for pan-Indian groups than hyperlocal Goan groups. This suggests the English signal reflects pan-Indian pretraining associations rather than genuine Goan cultural knowledge. Our findings highlight a critical gap in low-resource multilingual NLP evaluation for hyperlocal community identities.

URL PDF HTML ☆

赞 0 踩 0

2606.15610 2026-06-16 cs.CL astro-ph.IM cs.AI cs.LG 新提交

LLM Judges Have Dark Current: A Psychometric Datasheet for LLM-as-a-Judge Evaluation

LLM 裁判具有暗电流：LLM 作为裁判评估的心理测量数据表

Hiroyasu Usami, Keisuke Hara, Ayato Tsuboi, Naohiko Matsuda

发表机构 * Chubu University（中部大学）； Mitsubishi Heavy Industries, Ltd., Research & Innovation Center（三菱重工业株式会社研究创新中心）

AI总结提出裁判数据表协议，通过真空输入、表面变异、位置偏好等指标测量 LLM 裁判的暗电流和偏差，揭示其测量特性。

Comments 22 pages, 4 figures

详情

AI中文摘要

LLM 作为裁判的系统现在常规用于开放式模型评估，其中人类偏好标注成本高、速度慢且难以复现。然而，这些裁判通常被报告为标量准确率、胜率或一致性指标。我们认为，裁判应被报告为测量仪器。我们引入了一个裁判数据表协议，该协议测量在真实真空输入下的暗电流、对相同质量表面变化的稳定交叉敏感性、位置虚假偏好、在受控质量阶梯上的目标敏感性，以及由平局指令引发的标准或操作点。方向-稳定性分解揭示，明显的 Delta0 偏好可能是稳定的表面响应或伪装的位置偏差。在一个三裁判开放权重案例研究中，Llama-3.1-8B 显示出高暗电流和呈现冲突的 Delta0 行为，Qwen2.5-14B 是真空清洁且对目标敏感，但混合了稳定和位置过度判别，而 Qwen2.5-32B 是真空清洁，具有低稳定交叉敏感性和低位置虚假偏好。严格的平局标准消除了 Qwen32B 的 Delta0 虚假偏好，但将边缘 Delta1 目标信号吸收为平局，同时保留了 Delta5 敏感性。结果表明，提示移动的是标准，而不是分辨率。我们并不声称激发这项工作的下游机制假设已得到确认；贡献是在做出下游声明之前测量测量仪器的计量协议。

英文摘要

LLM-as-a-judge systems are now routinely used for open-ended model evaluation, where human preference annotation is costly, slow, and difficult to reproduce. Yet these judges are often reported as scalar accuracy, win-rate, or agreement devices. We argue that a judge should instead be reported as a measurement instrument. We introduce a Judge Datasheet protocol that measures dark current under true-vacuum inputs, stable cross-sensitivity to same-quality surface variation, positional false preference, target sensitivity on a controlled quality ladder, and the criterion or operating point induced by tie instructions. The direction-stability decomposition reveals that apparent Delta0 preference can be stable surface response or disguised position bias. In a three-judge open-weight case study, Llama-3.1-8B shows high dark current and presentation-conflicted Delta0 behavior, Qwen2.5-14B is vacuum-clean and target-sensitive but mixes stable and positional over-discrimination, and Qwen2.5-32B is vacuum-clean with low stable cross-sensitivity and low positional false preference. A strict tie criterion eliminates Qwen32B Delta0 false preference but absorbs marginal Delta1 target signals into ties while preserving Delta5 sensitivity. The results show that prompting moves the criterion, not the resolution. We do not claim that the downstream mechanism hypothesis that motivated this work is confirmed; the contribution is a metrological protocol for measuring the measuring device before downstream claims are made.

URL PDF HTML ☆

赞 0 踩 0

2606.15643 2026-06-16 cs.CL 新提交

GRACE：基于上下文忠实推理的步骤级基准

Hoang Pham, Dong Le, Anh Tuan Luu

发表机构 * Nanyang Technological University（南洋理工大学）； VinUniversity

AI总结提出GRACE，首个带人工标注的步骤级忠实性基准，通过数据驱动错误分类法评估上下文推理中的链式思维步骤，并证明步骤级忠实信号可提升下游准确性和推理可靠性。

详情

AI中文摘要

许多推理任务要求模型基于输入上下文进行推理，从文档问答到基于规则的演绎。链式思维提示产生的轨迹看似透明，但单个步骤可能悄然偏离源证据，即使最终答案正确。现有方法在响应级别检测幻觉，但无法识别链中失败的位置或类型。我们引入GRACE，这是首个带人工标注的步骤级忠实性基准，具有数据驱动的错误分类法，用于基于上下文的文本推理。GRACE涵盖来自4个源数据集的10个模型的CoT轨迹，每个步骤都标注了忠实性、错误类别和自然语言解释。通过无监督聚类自底向上发现的数据驱动分类法将失败分为两个轨道：GRACE-Inference（演绎错误）和GRACE-Grounding（事实基础错误），每个轨道包含四个类别。评估集是人工标注的，且设计上具有挑战性。我们的实验揭示了当前模型存在巨大的改进空间。此外，将步骤级忠实性信号集成到强化学习管道中可提高下游准确性和推理可靠性。

英文摘要

Many reasoning tasks require models to reason over input context, from document-grounded question answering to rule-based deduction. Chain-of-Thought (CoT) prompting produces traces that appear transparent, yet individual steps can silently deviate from the source evidence, even when the final answer is correct. Existing methods detect hallucinations at the response level but fail to identify where in the chain a failure occurs or what type it is. We introduce GRACE, the first human-annotated step-level faithfulness benchmark with a data-driven error taxonomy for context-grounded textual reasoning. GRACE covers CoT traces from 10 models across 4 source datasets, with each step annotated for faithfulness, error category, and natural language explanation. A data-driven taxonomy, discovered bottom-up via unsupervised clustering, organizes failures into two tracks: GRACE-Inference (deductive errors) and GRACE-Grounding (factual grounding errors), with four categories each. The evaluation set is human-annotated and challenging by design. Our experiments reveal substantial headroom for current models. In addition, integrating step-level faithfulness signals into reinforcement learning pipelines improves both downstream accuracy and reasoning reliability.

URL PDF HTML ☆

赞 0 踩 0

2606.16211 2026-06-16 cs.CL 新提交

Weaving Multi-Source Evidence for Biomedical Reasoning: The BioMedHop Benchmark and BioWeave Framework

编织多源证据进行生物医学推理：BioMedHop基准与BioWeave框架

Xingyu Tan, Shiyuan Liu, Xiaoyang Wang, Qing Liu, Xiwei Xu, Xin Yuan, Liming Zhu, Wenjie Zhang

发表机构 * University of New South Wales（新南威尔士大学）； CSIRO（澳大利亚联邦科学与工业研究组织）； University of Technology Sydney（悉尼科技大学）

AI总结提出BioMedHop基准和BioWeave框架，用于评估和实现生物医学多源证据推理，BioWeave在基准上优于基线方法10.5%，并提升小模型性能。

详情

AI中文摘要

生物医学问答（QA）日益需要对交互实体进行推理，其中支持证据分散在生物医学知识图谱、文献文档和网络可访问资源中。然而，现有的生物医学QA基准主要关注考试式知识、文献理解或短程多跳推理，而源条件图推理和证据拓扑构建尚未充分探索。为填补这一空白，我们引入了BioMedHop，一个多源图基基准，用于评估结构化证据拓扑上的生物医学推理。BioMedHop包含10,045个实例，涵盖知识图谱、文档、网络和混合证据设置，包括共享邻居匹配、交集推理、基于路径的推理和计数，并提供选项式、开放式和数值计数形式。为支持该基准，我们进一步提出了BioWeave，一个源感知推理框架，该框架检索生物医学知识图谱路径，从文档和网络来源收集支持线索，将其组装成统一的证据图，并通过实体级证据支持验证答案。综合实验表明，在BioMedHop上，BioWeave在比较方法中实现了最佳整体性能，在总体平均值上比强混合基线ToG-2高出10.5%。此外，BioWeave一致地改进了不同的大语言模型骨干，并使较小的模型（如Qwen3-4B）能够达到与GPT-4-Turbo相当的推理性能。

P3B3：用于测量大语言模型中欧洲和巴西葡萄牙语变体偏差的多轮对话基准

Rafael Ferreira, Inês Vieira, Inês Calvo, James Furtado, Iago Paulo, Diogo Tavares, Diogo Glória-Silva, David Semedo, João Magalhães

发表机构 * NOVA University of Lisbon（新里斯本大学）； NOVA LINCS（NOVA LINCS实验室）

AI总结提出P3B3基准，通过专家策划的对话提示和评估框架，测量大语言模型在葡萄牙语变体（欧洲vs巴西）上的偏差和可控性，发现多数模型偏向巴西葡萄牙语。

Comments Accepted at MeLLM Workshop at ACL 2026

2606.16890 2026-06-16 cs.CL cs.AI 新提交

Compositional Reasoning Depth Predicts Clinical AI Failure: Empirical Evidence Consistent with Transformer Compositionality Limits in Electronic Health Record Question Answering

组合推理深度预测临床AI失败：与电子健康记录问答中Transformer组合性限制一致的实证证据

Sanjay Basu

发表机构 * University of California San Francisco（加州大学旧金山分校）； Waymark

AI总结本研究引入推理步数（hop count）作为预测大型语言模型在电子健康记录问答中失败的理论驱动指标，发现准确率随步数增加单调下降，且扩展思考未能显著改善，提示组合推理深度是跨架构的失败预测因子。

Comments 20 pages, 5 figures. Code: https://github.com/sanjaybasu/compositional-depth-clinical-ehr

详情

AI中文摘要

聚合准确率基准掩盖了大型语言模型在电子健康记录（EHR）问答中失败的系统性结构：需要更多推理步骤的问题会产生不成比例的更多错误。受Transformer组合性限制的理论结果启发，我们引入一个预先指定的跳数分类法——从EHR回答临床问题所需的不同推理步骤的数量——作为模型失败的原则性预测因子。我们标注了313个由临床医生生成的MedAlign EHR问答对，涵盖四个跳数级别，并在模型内消融（claude-sonnet-4-6，零样本 vs. 扩展思考）和跨架构复制（gpt-4o和gpt-5.4-2026-03-05，零样本）中评估了301个问题。所有三个模型，跨越两个提供商和两个OpenAI代（GPT-4和GPT-5），均显示准确率随跳数单调下降：Claude Sonnet零样本从30.6%（跳数=1）降至17.6%（跳数=4）（Cochran-Armitage z=-2.30，p=0.011；每跳OR 0.72，95% CI [0.56,0.92]，p=0.008）；GPT-4o复现了这一点（37.8%降至14.7%；OR 0.58 [0.45,0.75]，p<0.001）；gpt-5.4-2026-03-05证实了这一点（37.8%降至23.5%；OR 0.80 [0.66,0.98]，p=0.027）。一项预先指定的上下文充分性审计显示，较高跳数的问题并未因EHR截断而受到不同不利影响（跳数2-4的可回答性为93-95%，而跳数1为79%），因此下降反映了组合推理难度。扩展思考在三个推理条件下并未显著平缓准确率-深度曲线，且思考令牌使用量与跳数呈正相关（r=0.31，p<0.0001），与预测的O(k)计算需求一致。因此，跳数是一个理论驱动、跨架构的大型语言模型在EHR问答中错误的预测因子，对临床AI的部署风险分层具有直接意义。

英文摘要

Aggregate accuracy benchmarks conceal a systematic structure in how large language models fail at electronic health record (EHR) question answering: questions requiring more inferential steps produce disproportionately more errors. Motivated by theoretical results on transformer compositionality limits, we introduce a pre-specified hop-count taxonomy -- the number of distinct reasoning steps required to answer a clinical question from an EHR -- as a principled predictor of model failure. We annotate 313 clinician-generated MedAlign EHR question-answer pairs across four hop levels and evaluate 301 questions in a within-model ablation (claude-sonnet-4-6, zero-shot vs. extended thinking) and cross-architecture replications (gpt-4o and gpt-5.4-2026-03-05, zero-shot). All three models, spanning two providers and two OpenAI generations (GPT-4 and GPT-5), show monotone accuracy decline with hop count: Claude Sonnet zero-shot falls from 30.6% (hop=1) to 17.6% (hop=4) (Cochran-Armitage z=-2.30, p=0.011; OR per hop 0.72, 95% CI [0.56,0.92], p=0.008); GPT-4o replicates this (37.8% to 14.7%; OR 0.58 [0.45,0.75], p<0.001); and gpt-5.4-2026-03-05 confirms it (37.8% to 23.5%; OR 0.80 [0.66,0.98], p=0.027). A pre-specified context-sufficiency audit shows higher-hop questions are not differentially disadvantaged by EHR truncation (answerability 93-95% at hops 2-4 vs. 79% at hop=1), so the decline reflects compositional reasoning difficulty. Extended thinking did not significantly flatten the accuracy-depth curve across three reasoning conditions, and thinking-token usage scaled with hop count (r=0.31, p<0.0001), consistent with the predicted O(k) computational requirement. Hop count is thus a theory-motivated, cross-architecture predictor of large-language-model error on EHR question answering, with direct implications for deployment risk stratification of clinical AI.

URL PDF HTML ☆

赞 0 踩 0

2606.16897 2026-06-16 cs.CL 新提交

Contrastive-Difference CKA Reveals Concept-Specific Structural Alignment Across Language Model Architectures

对比差异CKA揭示跨语言模型架构的概念特定结构对齐

Xueping Gao

发表机构 * Alibaba Cloud（阿里云）

AI总结提出对比差异CKA（CKA_Delta）方法，发现不同LLM架构在概念表示上存在几何收敛与功能可迁移性分离的现象，能有效区分概念特定相似性与通用相似性。

详情

衡量LLM导师是教学还是解题：教育影响的诊断方法

Junyi Yao, Zihao Zheng, Baichuan Li

发表机构 * Washington University in St. Louis（圣路易斯华盛顿大学）； Department of Operations Research and Engineering Management, Southern Methodist University（南卫理公会大学运筹学与工程管理系）

AI总结针对LLM作为教育导师时解题能力不等于教学支持的问题，提出基于解题导向与教学导向基准性能差距的诊断方法，通过MathTutorBench分析表明两者仅部分对齐，建议分开报告评分并明确保护学生能动性的标准。

详情

AI中文摘要

大型语言模型越来越多地被提议作为教育导师，但更强的任务解决能力并不一定意味着更强的学习支持。受近期呼吁在实践中衡量NLP系统社会影响的启发，我们研究公开的LLM辅导基准是否能够区分支持学习的行为与单纯的答案生成。我们提出了一种轻量级诊断方法，基于解题导向和教学导向基准性能之间的差距。利用公开的MathTutorBench排行榜结果，我们表明这些维度仅部分对齐：在八个公开报告的模型中，解题和教学综合得分之间的相关性为0.421，并且当评估从解题转向教学时，几个模型的排名发生了显著变化。然后，我们分析了公开的TutorBench样本，并表明与能动性相关的行为明确编码在基准评分标准中，尤其是在主动学习环境中，奖励引导性问题、校准提示和非揭露性脚手架。这些发现共同表明，教育影响评估不应将任务成功视为学习支持的充分代理。我们认为，公开的辅导基准可以通过分别报告解题导向和教学导向得分，并使披露敏感、保护学生能动性的标准更加明确，从而更好地支持积极影响评估。

英文摘要

Large language models are increasingly proposed as educational tutors, yet stronger task-solving ability does not necessarily imply stronger learning support. Motivated by recent calls to measure the social impact of NLP systems in practice, we study whether public LLM tutoring benchmarks distinguish learning-supportive behavior from mere answer production. We propose a lightweight diagnostic based on the gap between solving-oriented and pedagogy-oriented benchmark performance. Using public MathTutorBench leaderboard results, we show that these dimensions are only partially aligned: across eight publicly reported models, the correlation between solving and pedagogy composites is 0.421, and several models shift meaningfully in rank when evaluation moves from solving to pedagogy. We then analyze the public TutorBench sample and show that agency-relevant behaviors are explicitly encoded in benchmark rubrics, especially in active-learning settings that reward guiding questions, calibrated hints, and non-disclosive scaffolding. Together, these findings suggest that educational-impact evaluation should not treat task success as a sufficient proxy for learning support. We argue that public tutoring benchmarks can better support positive-impact evaluation by reporting solving-oriented and pedagogy-oriented scores separately and by making disclosure-sensitive, student-agency-preserving criteria more explicit.

URL PDF HTML ☆

赞 0 踩 0

2606.16748 2026-06-16 cs.LG cs.CL 交叉投稿

MyPCBench: A Benchmark for Personally Intelligent Computer-Use Agents

MyPCBench: 个人智能计算机使用代理的基准测试

Lawrence Keunho Jang, Andrew Keunwoo Jang, Jing Yu Koh, Ruslan Salakhutdinov

发表机构 * Carnegie Mellon University（卡内基梅隆大学）

AI总结提出MyPCBench基准，在模拟真实桌面环境（含17个Web应用）中测试个人计算机使用代理，发现最佳模型Claude Opus 4.6仅解决55.4%任务，失败集中在多应用和长轨迹任务。

详情

AI中文摘要

当前的计算机使用代理基准测试在非个人化环境中评估模型。这导致评估与部署之间存在差距，因为个人助理预计将在用户的整个数字生活中工作，包括其上下文、历史数据和已登录账户。这种差距在Web任务中最为明显，因为实时Web评估无法测试需要登录或个人信息的网站，而真正的个人助理必须驱动这类网站。我们引入了MyPCBench，它在Linux桌面上测试计算机使用代理作为个人助理，该桌面填充了17个模拟的真实世界Web应用程序和一个完整的桌面堆栈，所有这些都为一个典型角色——来自《办公室》的Michael Scott——进行了种子化。我们在此环境中定义了184个任务，每个任务都受到来自OpenClaw社区的真实请求的启发，并使用统一的计算机+bash工具界面基准测试了六个闭源和开源模型。我们发现，最佳模型Claude Opus 4.6完全解决了55.4%的任务，是唯一超过50%的模型。模型失败集中在跨越多个应用程序的任务和长轨迹上，其中个性化对助理的压力最大。我们在https://mypcbench.com上发布了环境、任务集和代理工具包。

英文摘要

Current benchmarks for computer-use agents evaluate models in impersonal environments. This leaves a gap between evaluation and deployment where personal assistants are expected to work across a user's whole digital life, including their context, historical data, and logged-in accounts. This gap is widest on web tasks, where live web evaluations cannot exercise sites that require logging in or personal information, the kind of site a real personal assistant has to drive. We introduce MyPCBench, which tests computer-use agents as personal assistants on a Linux desktop populated with 17 simulated real-world web applications and a full desktop stack, all seeded for one canonical persona, Michael Scott from The Office. We define 184 tasks in this environment, each inspired by a real request drawn from the OpenClaw community, and benchmark six closed and open-weight models with a uniform computer+bash tool surface. We find that the best model, Claude Opus 4.6, fully solves 55.4\% of the tasks, the only model above 50\%. Model failures cluster on tasks that span many applications and on long trajectories, where personalization stresses an assistant the most. We release the environment, task set, and agent harness at https://mypcbench.com.

URL PDF HTML ☆

赞 0 踩 0

2306.11252 2026-06-16 cs.CL cs.LG 版本更新

HK-LegiCoST: Leveraging Non-Verbatim Transcripts for Speech Translation

HK-LegiCoST: 利用非逐字转录进行语音翻译

Cihan Xiao, Henry Li Xinyuan, Jinyi Yang, Dongji Gao, Matthew Wiesner, Kevin Duh, Sanjeev Khudanpur

发表机构 * Center for Language and Speech Processing（语言与语音处理中心）； Human Language Technology Center of Excellence（人类语言技术卓越中心）； Johns Hopkins University（约翰霍普金斯大学）

AI总结提出HK-LegiCoST语料库，包含600+小时粤语-英语三路平行数据，解决非逐字转录的句子级对齐挑战，在粤语语音翻译上取得竞争性基线并跨语料库验证。

详情

AI中文摘要

我们介绍了HK-LegiCoST，一个新的粤语-英语三路平行语料库，包含600+小时的粤语音频、其标准繁体中文转录和英文翻译，并在句子级别进行切分和对齐。我们描述了语料库准备中的显著挑战：切分、长音频记录的对齐，以及与非逐字转录的句子级对齐。当源语言的口语和书面形式存在显著差异时，此类转录使语料库适用于语音翻译研究。由于其大规模，我们能够在HK-LegiCoST上展示具有竞争力的语音翻译基线，并将其扩展到FLEURS粤语子集上具有前景的跨语料库结果。这些结果为语音识别和翻译研究提供了见解，特别是对于因各种因素（包括方言和口语）而常见非逐字或“噪声”转录的语言。

英文摘要

We introduce HK-LegiCoST, a new three-way parallel corpus of Cantonese-English translations, containing 600+ hours of Cantonese audio, its standard traditional Chinese transcript, and English translation, segmented and aligned at the sentence level. We describe the notable challenges in corpus preparation: segmentation, alignment of long audio recordings, and sentence-level alignment with non-verbatim transcripts. Such transcripts make the corpus suitable for speech translation research when there are significant differences between the spoken and written forms of the source language. Due to its large size, we are able to demonstrate competitive speech translation baselines on HK-LegiCoST and extend them to promising cross-corpus results on the FLEURS Cantonese subset. These results deliver insights into speech recognition and translation research in languages for which non-verbatim or ``noisy'' transcription is common due to various factors, including vernacular and dialectal speech.

URL PDF HTML ☆

赞 0 踩 0

2412.21036 2026-06-16 cs.CL 版本更新

GePBench: Evaluating Fundamental Geometric Perception for Multimodal Large Language Models

GePBench：评估多模态大语言模型的基础几何感知能力

Shangyu Xing, Changhao Xiang, Yuteng Han, Yifan Yue, Zhen Wu, Xinyu Liu, Zhangtai Wu, Fei Zhao, Xinyu Dai

发表机构 * University of Science and Technology of China（中国科学技术大学）； Tsinghua University（清华大学）

AI总结提出GePBench基准，系统评估多模态大语言模型的几何形状识别与空间关系感知能力，发现现有模型存在显著缺陷，而基于该基准训练可提升下游任务性能。

详情

AI中文摘要

几何形状在物理世界和人类认知中都扮演着重要角色。尽管多模态大语言模型（MLLMs）在视觉理解方面取得了显著进展，但它们识别几何形状及其空间关系的能力（我们称之为“几何感知”）尚未得到明确和系统的探索。为填补这一空白，我们引入了GePBench，这是一个专门设计用于评估MLLMs几何感知能力的新型基准。我们的广泛评估表明，即使是当前最先进的MLLMs在几何感知任务中也表现出显著缺陷。此外，我们展示了使用GePBench数据训练的模型在广泛的下游任务上取得了显著改进，突显了几何感知在实现高级多模态应用中的关键作用。我们的代码和数据集可在\href{this https URL}{this https URL}获取。

英文摘要

Geometric shapes play important roles in both physical world and human cognition. While multimodal large language models (MLLMs) have made significant advancements in visual understanding, their abilities to recognize geometric shapes and their spatial relationships, which we term \emph{geometric perception}, are not explicitly and systematically explored. To address this gap, we introduce GePBench, a novel benchmark specifically designed to assess the geometric perception capabilities of MLLMs. Our extensive evaluations reveal that even the current state-of-the-art MLLMs exhibit significant deficiencies in geometric perception tasks. Furthermore, we show that models trained with GePBench data demonstrate considerable improvements on a wide range of downstream tasks, highlighting the critical role of geometric perception in enabling advanced multimodal applications. Our code and datasets are available at \href{https://github.com/Changhao-Xiang/GePBench}{https://github.com/Changhao-Xiang/GePBench}.

URL PDF HTML ☆

赞 0 踩 0

2506.21613 2026-06-16 cs.CL cs.SD eess.AS 版本更新

ChildGuard: A Specialized Dataset for Combatting Child-Targeted Hate Speech

ChildGuard：针对儿童仇恨言论的专用数据集

Gautam Siddharth Kashyap, Mohammad Anas Azeez, Rafiq Ali, Zohaib Hasan Siddiqui, Jiechao Gao, Usman Naseem

发表机构 * Macquarie University（麦考瑞大学）； MBZUAI（穆罕默德·本·拉希德人工智能研究所）； DSEU（德里国家理工学院）； Department of Information and Computer Science, KFUPM（科威特石油大学信息与计算机科学系）； Stanford University（斯坦福大学）

AI总结针对社交媒体上针对儿童的仇恨言论问题，构建了大规模英文数据集ChildGuard，包含351,877条标注实例，覆盖三个年龄段，并评估了多种模型性能。

Comments Updated Version

详情

AI中文摘要

心理健康行业越来越关注社交媒体上针对儿童的仇恨言论，因为接触此类内容可能在关键发育阶段导致不良心理结果。当前的仇恨言论数据集和检测系统对儿童应用的支持有限，因为它们主要针对成人设计，缺乏针对儿童仇恨言论的年龄特定特征的专门表示。为了解决这一差距，我们引入了ChildGuard，一个大规模英文数据集，用于针对儿童的仇恨言论，包含从X（原Twitter）、Reddit和YouTube收集的351,877条标注实例。该数据集覆盖三个年龄组：幼儿（11岁以下）、前青少年（11-12岁）和青少年（13-17岁）。ChildGuard包含两个子集：上下文子集（157K）和词汇子集（194K）。使用最新的基于Transformer的模型和LLM进行评估，最佳Macro-F1达到82.07%，在幼儿、上下文、隐式仇恨和跨子集设置下分别降至79.41%、79.24%、76.04%和74.88%。

英文摘要

Mental health industry faces growing concerns regarding hate speech directed at children's on social media, as exposure to such content can contribute to adverse psychological outcomes during critical stages of development. Current hate speech datasets and detection systems provide limited support for child-focused applications because they are primarily designed for adults and lack dedicated representations of age-specific characteristics associated with hate speech directed at children's. To address this gap, we introduce ChildGuard, a large-scale English dataset for child-targeted hate speech containing 351,877 annotated instances collected from X (formerly Twitter), Reddit, and YouTube. The dataset covers three age groups such as younger children's (under 11), pre-teens (11-12), and teens (13-17). ChildGuard contains two subsets such as a contextual subset (157K) and a lexical subset (194K). Evaluation using recent transformer-based models and LLMs achieves a best Macro-F1 of 82.07%, decreasing to 79.41%, 79.24%, 76.04%, and 74.88% on younger children's, contextual, implicit hate, and cross-subset settings, respectively.

URL PDF HTML ☆

赞 0 踩 0

2508.01401 2026-06-16 cs.CL cs.AI 版本更新

MedSynth: Realistic, Synthetic Medical Dialogue-Note Pairs

MedSynth: 真实、合成的医疗对话-笔记对

Ahmad Rezaie Mianroodi, Amirali Rezaie, Niko Grisel Todorov, Nadine A. Friedrich, Maria P Mogollon, Alexander Hernandez-Tirado, Guillermo Lopez Garcia, Cyril Rakovski, Frank Rudzicz

发表机构 * Dalhousie University（达尔豪斯大学）； Vector Institute（向量研究所）； Shahrood University of Technology（沙霍尔德大学）； Chapman University（查普曼大学）； Cedars-Sinai Medical Center（Cedars-Sinai 医疗中心）

AI总结为解决医生文书负担，提出MedSynth合成数据集，包含超1万对对话-笔记，覆盖2000+ICD-10编码，显著提升Dial-2-Note和Note-2-Dial任务性能。

Comments 7 pages excluding references and appendices

2509.22808 2026-06-16 cs.CL 版本更新

ArFake: A Robust Framework for Multi-Dialect Arabic Speech Spoofing Detection Benchmark

ArFake: 多方言阿拉伯语语音欺骗检测基准的鲁棒框架

Mohamed Elsetohy, Alhassan Ehab, Ali Mekky, Besher Hassan, Shady Shehata

发表机构 * MBZUAI, UAE（马布里扎大学人工智能研究所，阿联酋）； Queen’s University, Canada（女王大学，加拿大）； University of Waterloo, Canada（滑铁卢大学，加拿大）

AI总结提出首个多方言阿拉伯语欺骗语音数据集，通过多模型评估和人类评分构建基准，发现FishSpeech在卡萨布兰卡语料库上生成最逼真的合成语音。

详情

AI中文摘要

随着生成式文本到语音模型的兴起，区分真实语音和合成语音变得具有挑战性，尤其是对于研究较少的阿拉伯语。大多数欺骗检测工作集中在英语上，为阿拉伯语及其多种方言留下了显著空白。在这项工作中，我们引入了第一个多方言阿拉伯语欺骗语音数据集。为了评估每个模型合成音频的难度并确定哪个模型产生最具挑战性的样本，我们旨在通过合并来自多个模型的音频或选择性能最佳的模型来指导最终数据集的构建，我们进行了一个评估流程，包括使用两种方法训练分类器：基于现代嵌入的方法结合分类器头；应用于MFCC特征的经典机器学习算法；以及RawNet2架构。该流程进一步结合了基于人类评分的平均意见得分计算，以及通过自动语音识别模型处理原始和合成数据集以测量词错误率。我们的结果表明，FishSpeech在卡萨布兰卡语料库上的阿拉伯语语音克隆方面优于其他TTS模型，产生更逼真和具有挑战性的合成语音样本。然而，依赖单一TTS进行数据集创建可能会限制泛化能力。

英文摘要

With the rise of generative text-to-speech models, distinguishing between real and synthetic speech has become challenging, especially for Arabic that have received limited research attention. Most spoof detection efforts have focused on English, leaving a significant gap for Arabic and its many dialects. In this work, we introduce the first multi-dialect Arabic spoofed speech dataset. To evaluate the difficulty of the synthesized audio from each model and determine which produces the most challenging samples, we aimed to guide the construction of our final dataset either by merging audios from multiple models or by selecting the best-performing model, we conducted an evaluation pipeline that included training classifiers using two approaches: modern embedding-based methods combined with classifier heads; classical machine learning algorithms applied to MFCC features; and the RawNet2 architecture. The pipeline further incorporated the calculation of Mean Opinion Score based on human ratings, as well as processing both original and synthesized datasets through an Automatic Speech Recognition model to measure the Word Error Rate. Our results demonstrate that FishSpeech outperforms other TTS models in Arabic voice cloning on the Casablanca corpus, producing more realistic and challenging synthetic speech samples. However, relying on a single TTS for dataset creation may limit generalizability.

URL PDF HTML ☆

赞 0 踩 0

2510.06143 2026-06-16 cs.CL 版本更新

RoSE: Round-robin Synthetic Data Evaluation for Selecting LLM Generators without Human Test Sets

RoSE: 循环合成数据评估，无需人工测试集选择LLM生成器

Jan Cegin, Branislav Pecher, Ivan Srba, Jakub Simko

发表机构 * Kempelen Institute of Intelligent Technologies（凯姆佩尔智能技术研究所）

AI总结提出RoSE方法，通过训练小模型在候选生成器输出上，并在其他LLM合成样本上评估，以选择最佳LLM生成器，无需人工测试集，在多个语言和任务中优于其他内在启发式方法。

Comments 16 pages; EACL 2026 Main

详情

AI中文摘要

LLM是合成数据的强大生成器，这些数据用于训练较小的特定模型。这对于低资源语言尤其有价值，因为人工标注数据稀缺，但LLM仍能生成高质量文本。然而，不同LLM的输出对训练的实用性不同。选择最佳LLM作为生成器具有挑战性，因为外部评估需要昂贵的人工标注（通常低资源语言不可用），而内在指标与下游性能相关性差。我们引入了循环合成数据评估（RoSE），这是一种无需人工测试集即可选择最佳LLM生成器的代理指标。RoSE在候选生成器（LLM）的输出上训练一个小模型，然后在所有其他候选LLM生成的合成样本上评估该模型。最终的RoSE分数是该小模型的平均性能。在六个LLM、十一种语言和三个任务（情感、主题、意图）上，RoSE比任何其他内在启发式方法更频繁地识别出最优生成器。RoSE优于内在启发式方法，并且与最优生成器基线的差距在0.76个百分点以内。该结果是通过在所选生成器的输出上训练小模型（最优与代理指标选择）并在人工标注测试数据上评估，以下游性能衡量。此外，RoSE是唯一与人工测试数据性能呈正相关的指标。

英文摘要

LLMs are powerful generators of synthetic data, which are used for training smaller, specific models. This is especially valuable for low-resource languages, where human-labelled data is scarce but LLMs can still produce high-quality text. However, LLMs differ in how useful their outputs are for training. Selecting the best LLM as a generator is challenging because extrinsic evaluation requires costly human annotations (which are often unavailable for low-resource languages), while intrinsic metrics correlate poorly with downstream performance. We introduce Round robin Synthetic data Evaluation (RoSE), a proxy metric for selecting the best LLM generator without human test sets. RoSE trains a small model on the outputs of a candidate generator (LLM) and then evaluates it on generated synthetic examples from all other candidate LLMs. The final RoSE score is the mean performance of this small model. Across six LLMs, eleven languages, and three tasks (sentiment, topic, intent), RoSE identifies the optimal generator more often than any other intrinsic heuristics. RoSE outperforms intrinsic heuristics and comes within 0.76 percentage points of the optimal generator baseline. This result is measured in terms of downstream performance, obtained by training a small model on the chosen generator's outputs (optimal vs. proxy metric selected) and evaluating it on human-labelled test data. Additionally, RoSE is the only metric to achieve a positive correlation with performance on human test data.

URL PDF HTML ☆

赞 0 踩 0

2510.12306 2026-06-16 cs.CL 版本更新

A large-scale pipeline for LLM-assisted corpus annotation: variation and change in the English consider construction

基于大语言模型的大规模语料标注流水线：英语consider构式的变异与变化

Cameron Morin, Matti Marttinen Larsson

发表机构 * Université Paris-Cité（巴黎-城市大学）； University of Gothenburg（哥德堡大学）

AI总结提出一种利用大语言模型自动标注大规模语料的四阶段流水线，在历史英语语料库中标注14万条consider构式，准确率超98%，揭示未记录的体裁特异性变化轨迹。

详情

AI中文摘要

随着自然语言语料库以前所未有的速度扩展，人工标注仍然是语料库语言学工作中的重大方法论瓶颈。我们通过提出一种可扩展的流水线来应对这一挑战，该流水线利用大语言模型（LLMs）自动进行大规模语料的语法标注。与以往的监督式和迭代式方法不同，我们的方法采用四阶段工作流程：提示工程、事前评估、自动批量处理和事后验证。我们通过一个关于英语评价性consider构式（consider X as/to be/Ø Y）变异的历时案例研究，展示了该流水线的易用性和有效性。我们通过OpenAI API在不到60小时内从历史美国英语语料库（COHA）中标注了143,933条'consider'索引行，在两个复杂的标注程序上实现了98%以上的准确率。对44,527个评价性构式的真阳性实例拟合的贝叶斯多项GAM揭示了先前未记录的体裁特异性变化轨迹，使我们能够提出关于语域正式性与形态句法简化和增强的竞争压力之间关系的新假设。我们的结果表明，LLMs可以在最少人工干预下大规模执行一系列数据准备任务，解锁了以前实际难以触及的实质性研究问题，尽管实施过程中需要注意成本、许可和其他伦理考虑。

英文摘要

As natural language corpora expand at an unprecedented rate, manual annotation remains a significant methodological bottleneck in corpus linguistic work. We address this challenge by presenting a scalable pipeline for automating grammatical annotation in voluminous corpora using large language models (LLMs). Unlike previous supervised and iterative approaches, our method employs a four-phase workflow: prompt engineering, pre-hoc evaluation, automated batch processing, and post-hoc validation. We demonstrate the pipeline's accessibility and effectiveness through a diachronic case study of variation in the English evaluative consider construction (consider X as/to be/Ø Y). We annotate 143,933 'consider' concordance lines from the Corpus of Historical American English (COHA) via the OpenAI API in under 60 hours, achieving 98%+ accuracy on two sophisticated annotation procedures. A Bayesian multinomial GAM fitted to 44,527 true positives of the evaluative construction reveals previously undocumented genre-specific trajectories of change, enabling us to advance new hypotheses about the relationship between register formality and competing pressures of morphosyntactic reduction and enhancement. Our results suggest that LLMs can perform a range of data preparation tasks at scale with minimal human intervention, unlocking substantive research questions previously beyond practical reach, though implementation requires attention to costs, licensing, and other ethical considerations.

URL PDF HTML ☆

赞 0 踩 0

2602.06015 2026-06-16 cs.CL 版本更新

A Systematic Evaluation of Large Language Models for PTSD Severity Estimation: The Role of Contextual Knowledge and Modeling Strategies

大型语言模型在PTSD严重程度评估中的系统评估：上下文知识与建模策略的作用

Panagiotis Kaliosis, Adithya V Ganesan, Oscar N. E. Kjell, Whitney Ringwald, Scott Feltman, Melissa A. Carr, Dimitris Samaras, Camilo Ruggero, Benjamin J. Luft, Roman Kotov, Andrew H. Schwartz

发表机构 * Department of Computer Science（计算机科学系）； Stony Brook University（石溪大学）； College of Connected Computing（连接计算学院）； Vanderbilt University（范德比大学）； Lund University（隆德大学）； University of Minnesota（明尼苏达大学）； Stony Brook World Trade Center Wellness Program（石溪世界贸易中心健康计划）； Renaissance School of Medicine at Stony Brook University（石溪大学复兴医学院）； University of Texas at Dallas（德克萨斯大学达拉斯分校）； Department of Psychiatry（精神病学系）

AI总结本研究系统评估了11种大型语言模型在PTSD严重程度评估中的表现，发现提供详细定义和上下文可提升准确性，增加推理努力可改善估计，并验证了集成策略的有效性。

Comments 24 pages, 5 figures, 5 tables

详情

AI中文摘要

大型语言模型（LLMs）越来越多地被以零样本（生成式）方式用于评估心理健康状况，然而我们对影响其准确性的因素了解有限。在本研究中，我们使用来自1,437名个体的自然语言叙述和自我报告的PTSD严重程度评分的临床数据集，全面评估了11种最先进的LLMs的性能。为了理解影响模型评估准确性的因素，我们系统地变化了（i）提示给模型的上下文知识，如子量表定义、分布摘要和访谈问题，以及（ii）建模策略，包括零样本与少样本、推理努力量、模型大小、结构化子量表与直接标量预测、输出重新缩放和九种集成方法。我们的发现表明：（a）当提供详细的构念定义和叙述上下文时，LLMs最为准确，甚至超过人类评分者与自我报告评分的一致性；（b）增加推理努力导致更好的估计准确性；（c）开放权重模型（Llama, DeepSeek）的性能在超过700亿参数后趋于平稳，而封闭权重模型（gpt-o3-mini, gpt-5）随着新一代的推出而改进；（d）当将监督模型与零样本LLMs集成时，达到最佳性能。除了与自我报告的一致性外，LLMs的估计能够区分PTSD严重程度与抑郁、焦虑和酒精使用，并前瞻性地预测未来的精神医疗支出。总之，这些结果表明上下文知识和建模策略显著影响基于LLMs的PTSD严重程度评估的准确性和临床实用性。

英文摘要

Large language models (LLMs) are increasingly being used in a zero-shot (generative) fashion to assess mental health conditions, yet we have limited knowledge on what factors affect their accuracy. In this study, we use a clinical dataset of natural language narratives and self-reported PTSD severity scores from 1,437 individuals to comprehensively evaluate the performance of 11 state-of-the-art LLMs. To understand the factors affecting model's assessment accuracy, we systematically varied (i) contextual knowledge prompted to the models like subscale definitions, distribution summary, and interview questions, and (ii) modeling strategies including zero-shot vs few shot, amount of reasoning effort, model sizes, structured subscales vs direct scalar prediction, output rescaling and nine ensemble methods. Our findings indicate that (a) LLMs are most accurate when provided with detailed construct definitions and context of the narrative, even exceeding human raters agreement with self-reported scores; (b) increased reasoning effort leads to better estimation accuracy; (c) performance of open-weight models (Llama, DeepSeek) plateaus beyond 70B parameters while closed-weight (gpt-o3-mini, gpt-5) alternatives improve with newer generations; and (d) best performance is achieved when ensembling a supervised model with the zero-shot LLMs. Beyond agreement with self-reports, LLMs' estimates discriminated PTSD severity from depression, anxiety, and alcohol use, and prospectively predicted future mental healthcare expenditure. Together, these results suggest that contextual knowledge and modeling strategies meaningfully affect accuracy and clinical utility of LLM-based assessments of PTSD severity.

URL PDF HTML ☆

赞 0 踩 0

2603.07539 2026-06-16 cs.CL 版本更新

MAWARITH: A Dataset and Benchmark for Legal Inheritance Reasoning with LLMs

MAWARITH：面向大语言模型的法律继承推理数据集与基准

Abdessalam Bouchekif, Shahd Gaben, Samer Rashwani, Somaya Eltanbouly, Mutaz Al-Khatib, Heba Sbahi, Mohammed Ghaly, Emad Mohamed

发表机构 * Hamad Bin Khalifa University, Qatar（卡塔尔哈马德比内卡菲大学）； Nazarbayev University, Kazakhstan（哈萨克斯坦纳扎尔巴耶夫大学）

AI总结提出MAWARITH数据集，包含12,500个阿拉伯语继承案例，支持端到端的伊斯兰继承推理，并设计MIR-E多阶段评估指标，实验显示商业模型达90%而开源模型低于50%。

详情

AI中文摘要

伊斯兰继承法对大语言模型具有挑战性，因为解决继承案件需要复杂、结构化、多步骤的推理以及正确应用法学规则来计算继承人的份额。我们引入了\textit{MAWARITH}，一个大规模标注数据集，包含12,500个阿拉伯语继承案例，用于训练和评估模型的完整推理链：(i) 识别合格继承人，(ii) 应用阻断（\textit{\d{h}ajb}）和分配规则，以及(iii) 计算精确继承份额。据我们所知，\textit{MAWARITH}是首个专为端到端伊斯兰继承推理设计的阿拉伯语语料库和基准。与之前将继承案件解决限制为多项选择题的数据集不同，\textit{MAWARITH}支持完整的推理链，并提供基于经典法学来源和既定继承规则的逐步解决方案及理由，以及精确的份额计算。这使得模型能够学习如何生成详细的、逐步响应用户查询的答案，反映现实世界的伊斯兰继承案例。为了超越最终答案准确性来评估模型，我们提出了\textit{MIR-E}（Mawarith继承推理评估），一个加权多阶段指标，对关键推理阶段进行评分，并捕捉整个流水线中的错误传播。我们在零样本设置下评估了六个大语言模型。一个商业模型达到了约90%，而所有评估的开源模型均低于50%。我们的错误分析识别了重复出现的失败模式，包括场景误解、继承人识别错误、份额分配错误以及关键继承规则（如\textit{\textquotesingle awl}和\textit{radd}）的缺失或错误应用。\textit{MAWARITH}数据集在此https URL公开可用。

英文摘要

Islamic inheritance law is challenging for large language models because solving inheritance cases requires complex, structured, multi-step reasoning and the correct application of juristic rules to compute heirs' shares. We introduce \textit{MAWARITH}, a large-scale annotated dataset of 12,500 Arabic inheritance cases for training and evaluating models on the full reasoning chain: (i) identifying eligible heirs, (ii) applying blocking (\textit{\d{h}ajb}) and allocation rules, and (iii) computing exact inheritance shares. To the best of our knowledge, \textit{MAWARITH} is the first Arabic corpus and benchmark designed for end-to-end Islamic inheritance reasoning. Unlike prior datasets that restrict inheritance case solving to multiple-choice questions, \textit{MAWARITH} supports the full reasoning chain and provides step-by-step solutions with justifications grounded in classical juristic sources and established inheritance rules, as well as exact share calculations. This enables models to learn how to generate detailed, step-by-step responses to user queries that reflect real-world Islamic inheritance cases. To evaluate models beyond final-answer accuracy, we propose \textit{MIR-E} (Mawarith Inheritance Reasoning Evaluation), a weighted multi-stage metric that scores key reasoning stages and captures error propagation across the pipeline. We evaluate six large language models in a zero-shot setting. A commercial model achieves about 90\%, whereas all evaluated open-source models remain below 50\%. Our error analysis identifies recurring failure patterns, including scenario misinterpretation, errors in heir identification, errors in share allocation, and missing or incorrect application of key inheritance rules such as \textit{\textquotesingle awl} and \textit{radd}. The \textit{MAWARITH} dataset is publicly available at https://gitlab.com/nlpresearcher/mawarith.

URL PDF HTML ☆

赞 0 踩 0

2605.18421 2026-06-16 cs.CL cs.AI cs.LG 版本更新

EvoMemBench: Benchmarking Agent Memory from a Self-Evolving Perspective

EvoMemBench: 从自演化视角评估智能体记忆

Yuyao Wang, Zhongjian Zhang, Mo Chi, Kaichi Yu, Yuhan Li, Miao Peng, Bing Tong, Chen Zhang, Yan Zhou, Jia Li

发表机构 * Hong Kong University of Science and Technology (Guangzhou)（香港理工大学（广州））； Createlink Technology（创-link科技）； Beijing University of Posts and Telecommunications（北京邮电大学）； Beijing Institute of Technology（北京理工大学）

AI总结本文提出EvoMemBench，从自演化视角评估智能体记忆，通过内存范围和内容两个维度构建统一基准，比较15种内存方法并发现当前内存系统尚未达到通用解决方案，长上下文基线仍具竞争力，内存在上下文不足或任务困难时效果显著，检索方法在知识密集型任务中表现优异，而程序和长期记忆方法在任务结构匹配时更有效。

详情

AI中文摘要

近期针对大语言模型（LLM）智能体的基准测试主要评估推理、规划和执行能力。然而，记忆对于智能体同样至关重要，因为它使智能体能够随时间存储、更新和检索信息。这种能力仍被低估，主要是因为现有基准测试未能提供系统评估记忆机制的方法。本文从自演化视角研究智能体记忆，引入EvoMemBench，一个沿内存范围（回合内 vs. 跨回合）和内存内容（知识导向 vs. 执行导向）两个轴线组织的统一基准。我们在标准化协议下比较了15种代表性内存方法与强大的长上下文基线。结果表明，当前内存系统仍远未达到通用解决方案：长上下文基线仍具有高度竞争力，内存在当前上下文不足或任务困难时效果最显著，且没有单一的内存形式能一致适用于所有设置。基于检索的方法在知识密集型任务中仍表现强劲，而程序和长期记忆方法在存储的经验与任务结构匹配时，对执行导向任务更有效。我们希望EvoMemBench能促进未来更有效的LLM智能体内存系统研究。我们的代码可在https://github.com/DSAIL-Memory/EvoMemBench获取。

英文摘要

Recent benchmarks for Large Language Model (LLM) agents mainly evaluate reasoning, planning, and execution. However, memory is also essential for agents, as it enables them to store, update, and retrieve information over time. This ability remains under-evaluated, largely because existing benchmarks do not provide a systematic way to assess memory mechanisms. In this paper, we study agent memory from a self-evolving perspective and introduce EvoMemBench, a unified benchmark organized along two axes: memory scope (in-episode vs. cross-episode) and memory content (knowledge-oriented vs. execution-oriented). We compare 15 representative memory methods with strong long-context baselines under a standardized protocol. Results show that current memory systems are still far from a general solution: long-context baselines remain highly competitive, memory helps most when the current context is insufficient or tasks are difficult, and no single memory form works consistently across all settings. Retrieval-based methods remain strong for knowledge-intensive settings, whereas procedural and long-term memory methods are more effective for execution-oriented tasks when their stored experience matches the task structure. We hope EvoMemBench facilitates future research on more effective memory systems for LLM-based agents. Our code is available at https://github.com/DSAIL-Memory/EvoMemBench.

URL PDF HTML ☆

赞 0 踩 0

2509.22888 2026-06-16 cs.AI cs.CL 版本更新

JE-IRT: A Geometric Lens on LLM Abilities through Joint Embedding Item Response Theory

JE-IRT: 通过联合嵌入项目反应理论审视LLM能力的几何视角

Louie Hong Yao, Nicholas Jarvis, Tiffany Zhan, Saptarshi Ghosh, Linfeng Liu, Tianyu Jiang

发表机构 * Independent Researcher（独立研究者）； University of Cincinnati（辛辛那提大学）； Carnegie Mellon University（卡内基梅隆大学）

AI总结提出JE-IRT几何框架，将LLM和问题嵌入共享空间，通过方向编码语义、范数编码难度，揭示主题专长和分布外行为，支持新模型高效扩展，并发现与人类分类部分对齐的内部结构。

Comments 35 pages, 17 figures, 9 tables, accepted to TMLR

详情

AI中文摘要

标准LLM评估实践将多样能力压缩为单一分数，掩盖了其固有的多维性质。我们提出JE-IRT，一种几何项目反应框架，将LLM和问题嵌入共享空间。对于问题嵌入，方向编码语义，范数编码难度，而每个问题的正确性由模型和问题嵌入之间的几何交互决定。这种几何结构用主题专长取代了LLM的全局排名，并允许相关问题之间的平滑变化。基于此框架，我们的实验结果表明，分布外行为可以通过方向对齐来解释，且更大的范数一致地指示更难的问题。此外，JE-IRT自然支持泛化：一旦空间被学习，新LLM通过拟合单个嵌入即可添加。学习到的空间进一步揭示了仅部分与人类定义的主题类别对齐的LLM内部分类。我们还表明，嵌入空间的简单线性探针恢复了跨主题的能力方向，例如一个算术轴，在看似遥远的主题（如病毒学和全球事实）中突出定量要求高的问题。因此，JE-IRT建立了一个统一且可解释的几何视角，将LLM能力与问题结构联系起来，为模型评估和泛化提供了独特视角。

英文摘要

Standard LLM evaluation practices compress diverse abilities into single scores, obscuring their inherently multidimensional nature. We present JE-IRT, a geometric item-response framework that embeds both LLMs and questions in a shared space. For question embeddings, the direction encodes semantics and the norm encodes difficulty, while correctness on each question is determined by the geometric interaction between the model and question embeddings. This geometry replaces a global ranking of LLMs with topical specialization and enables smooth variation across related questions. Building on this framework, our experimental results reveal that out-of-distribution behavior can be explained through directional alignment, and that larger norms consistently indicate harder questions. Moreover, JE-IRT naturally supports generalization: once the space is learned, new LLMs are added by fitting a single embedding. The learned space further reveals an LLM-internal taxonomy that only partially aligns with human-defined subject categories. We also show that simple linear probes of the embedding space recover cross-subject ability directions, such as an arithmetic axis that highlights quantitatively demanding questions in seemingly distant subjects like virology and global facts. JE-IRT thus establishes a unified and interpretable geometric lens that connects LLM abilities with the structure of questions, offering a distinctive perspective on model evaluation and generalization.

URL PDF HTML ☆

赞 0 踩 0

2606.07226 2026-06-16 cs.LG cs.AI cs.CL 版本更新

DEFINED: A Data-Efficient Computational Framework for Fine-Grained Creativity Assessment in Debate Scenarios

DEFINED: 辩论场景中细粒度创造力评估的数据高效计算框架

Tongzhou Yu, Mingjia Li, Hong Qian, Wenkai Wang, Zongbao Zhang, Yaoyu Jiang, Xiangfeng Wang, Aimin Zhou, Jiajun Guo

发表机构 * Nanjing University（南京大学）； Shanghai Innovation Institute（上海创新研究院）； East China Normal University（华东师范大学）

AI总结提出DEFINED框架，通过层次化八维指标体系、预训练语言模型和混合粒度训练策略，在辩论场景中实现数据高效的细粒度创造力自动评估，优于现有方法。

Comments Accepted by KDD 2026

详情

DOI: 10.1145/3770855.3817874

AI中文摘要

人类创造力已成为大语言模型时代的关键能力。在复杂、开放环境中评估创造力是数据挖掘领域的一大挑战，目前受限于对标准化简单任务的依赖以及细粒度专家数据的稀缺。作为生态有效的评估场景，辩论反映了创造力的多个维度，涵盖发散思维和收敛思维。此外，辩论是一个数据丰富的领域，拥有大量公开可获取的材料。当前主流的自动评分方法难以适应辩论等复杂场景，因此仍然依赖昂贵的人工评估。为此，本文提出DEFINED，一种数据高效的计算框架，用于辩论场景中的细粒度创造力评估。DEFINED通过层次化的八维指标体系操作化辩论创造力，采用预训练自回归语言模型，并配备支持细粒度和粗粒度评估的层次化评分头。从真实辩论比赛中获取陈述及其相关专家评分，并采用约束数据增强策略以解决原始数据中的精英偏差。DEFINED采用混合粒度训练策略，能够从训练有素的研究生专家提供的有限细粒度监督中实现鲁棒学习。为严格验证超越合成基准的生态效度，我们纳入了一项针对辩论新手参与者的实证研究，利用这些真实数据作为中低水平人群的定性案例研究。在我们的评估协议中，评分模型实现了准确且稳定的评分，优于基于提示的大语言模型评估器和现有的辩论评分方法。

英文摘要

Human creativity has emerged as a critical competency in the era of large language models. Assessing creativity in complex, open-ended environments is a grand challenge in data mining, currently hindered by a reliance on standardized simple tasks and the scarcity of fine-grained expert data. As an ecologically valid assessment context, debate reflects multiple dimensions of creativity, encompassing both divergent thinking and convergent thinking. Moreover, debate is a data-rich domain, with a large volume of publicly accessible materials. Current mainstream automated scoring methods are poorly suited to complex settings such as debate, and therefore still rely on costly human evaluation. To this end, this paper proposes DEFINED, a data-efficient computational framework for fine-grained creativity assessment in debate scenarios. DEFINED operationalizes debate creativity through a hierarchical eight-dimensional metric system, implemented via a pre-trained autoregressive language model with a hierarchical scoring head that supports both fine-grained and coarse-grained evaluation. Statements and their associated expert scores were obtained from authentic debate competitions, and a constrained data augmentation strategy was employed to address the elite bias inherent in the original data. DEFINED adopts a mixed-granularity training strategy enabling robust learning from limited fine-grained supervision annotated by trained graduate experts. To rigorously validate ecological validity beyond synthetic benchmarks, we incorporate an empirical study with debate-naive participants, utilizing these authentic data to serve as a qualitative case study for mid-to-low proficiency populations. Across our evaluation protocol, our scoring model achieves accurate and stable scoring, outperforming prompt-based large language model evaluators and existing debate scoring methods.

URL PDF HTML ☆

赞 0 踩 0

2606.09669 2026-06-16 cs.AI cs.CL 版本更新

SpatialWorld: Benchmarking Interactive Spatial Reasoning of Multimodal Agents in Real-World Tasks

SpatialWorld: 在多模态智能体真实世界任务中基准测试交互式空间推理

Hongcheng Gao, Hailong Qu, Jingyi Tang, Jiahao Wang, Zihao Huang, Hengkang Qiao, Shihong Huang, Junming Yang, Yi Li, Hongyixuan Yuan, Wenjie Li, Bohan Zeng, Wenbo Li, Bo Wang, Jianhui Liu, Olive Huang, Haoyang Huang, Wentao Zhang, Guoqing Huang, Nan Duan, Yinpeng Dong

发表机构 * Tsinghua University（清华大学）； Chongqing University（重庆大学）； Peking University（北京大学）； ZenoMind AI ； Xi’an Jiaotong University（西安交通大学）； Beijing Institute of Technology（北京理工大学）； Southeast University（东南大学）； Shanghai Jiao Tong University（上海交通大学）； Joy Future Academy ； The University of Hong Kong（香港大学）

AI总结提出SpatialWorld基准，集成8种异构模拟后端，通过760个人工标注任务评估多模态智能体在视觉部分可观测环境中的交互式空间理解，发现最强模型GPT-5任务成功率仅17.4%。

详情

AI中文摘要

空间推理是多模态大语言模型（MLLMs）感知和操作物理世界的基础能力。然而，现有基准主要依赖被动评估（如静态VQA）或特定模拟器流程，未能评估通用的交互式空间理解。我们引入SpatialWorld，一个专门为评估多模态智能体在复杂真实世界任务中的交互式空间理解而设计的统一基准。在共享的、模拟器无关的协议下集成八个异构模拟后端，SpatialWorld包含跨多个领域（如家庭日常、旅行、社交协作）的760个人工标注任务。智能体必须在仅视觉的部分可观测性下解决问题，主动收集自我中心的视觉证据，并通过MLLMs原生的统一文本动作接口表达决策。为了可靠评估，每个任务包含一个人工验证的初始状态、一条参考轨迹和一个终端状态验证器。评估15个先进智能体揭示，稳健的空间任务解决仍然具有挑战性：最强模型GPT-5平均任务成功率（TSR）仅为17.4%，而领先的开源模型Qwen-3.5达到14.1%。进一步分析暴露了任务成功与执行效率之间的明显不匹配，以及显著的领域特定性能差异。这些在主动探索和长程规划中的瓶颈使SpatialWorld成为未来空间智能体的严格测试平台。

英文摘要

Spatial reasoning is a foundational capability for multimodal large language models (MLLMs) to perceive and operate within the physical world. However, existing benchmarks predominantly rely on passive evaluation (e.g., static VQA) or simulator-specific pipelines, failing to assess general interactive spatial understanding. We introduce SpatialWorld, a unified benchmark designed specifically for evaluating the interactive spatial understanding of multimodal agents in complex real-world tasks. Integrating eight heterogeneous simulation backends under a shared, simulator-agnostic protocol, SpatialWorld features 760 human-annotated tasks across diverse domains (e.g., household routines, travel, social collaboration). Agents must solve tasks under vision-only partial observability, actively gathering egocentric visual evidence and expressing decisions via a unified, text-based action interface native to MLLMs. For reliable evaluation, each task includes a human-validated initial state, a reference trajectory, and a terminal-state verifier. Evaluating 15 advanced agents reveals that robust spatial task solving remains challenging: the strongest model, GPT-5, achieves an average task success rate (TSR) of only 17.4%, while the leading open-source model, Qwen-3.5, reaches 14.1%. Further analysis exposes a clear mismatch between task success and execution efficiency, alongside substantial domain-specific performance variations. These bottlenecks in active exploration and long-horizon planning position SpatialWorld as a rigorous testbed for future spatial agents.

URL PDF HTML ☆

赞 0 踩 0

2606.15307 2026-06-16 cs.CL cs.AI 新提交

Adapting Reinforcement Learning with Chain-of-Thought Supervision for Explainable Detection of Hateful and Propagandistic Memes

利用思维链监督的强化学习进行仇恨和宣传模因的可解释检测

Mohamed Bayan Kmainasi, Mucahid Kutlu, Ali Ezzat Shahroor, Abul Hasnat, Firoj Alam

发表机构 * Hamad Bin Khalifa University（哈马德·本·哈利法大学）； Qatar University（卡塔尔大学）

AI总结提出基于强化学习的后训练方法，结合任务特定奖励和组相对策略优化（GRPO），提升思考型多模态大语言模型在仇恨和宣传模因检测中的分类性能和解释质量。

详情

AI中文摘要

仇恨和宣传模因利用图像与文本之间的相互作用来传达有害意图，而这两种模态单独都无法揭示这种意图。尽管基于思考的多模态大语言模型（MLLMs）在视觉-语言理解方面取得了进展，但它们在模因内容审核中的应用仍未得到充分探索。我们提出了一种基于强化学习的后训练方法，通过任务特定奖励和组相对策略优化（GRPO）来提高思考型MLLMs的分类性能和基于参考的解释质量。具体来说，我们（i）对现成的MLLMs在英语和阿拉伯语基准上的仇恨和宣传模因理解进行了系统的实证研究，（ii）通过蒸馏和多LLM细粒度宣传标注，用弱监督的思维链（CoT）理由扩展了现有的模因数据集，（iii）引入了一个基于GRPO的目标函数，带有思考长度正则化，联合优化分类准确性和解释质量，以及（iv）研究基于共识伪标签的无标签模因的自监督GRPO。在Hateful Memes和ArMeme基准上的实验表明，我们的方法在FHM准确率（从79.9%提高到82.0%，提升高达2.1%）和ArMeme宏F1（从0.536提高到0.612，提升高达7.6个百分点，附带解释；与原始ArMeme基准相比提升6.1个百分点）上优于先前报告的结果，同时生成自然语言解释。在ArMeme上，序列分类基线在原始准确率方面仍然更强，而我们的方法提供了更平衡的每类性能以及解释。我们公开发布了代码、数据扩展和评估资源。

英文摘要

Hateful and propagandistic memes exploit the interplay between images and text to convey harmful intent that neither modality reveals alone. Although thinking-based multimodal large language models (MLLMs) have advanced vision-language understanding, their application to meme content moderation remains underexplored. We propose a reinforcement learning-based post-training method that improves classification performance and reference-based explanation quality in thinking-based MLLMs via task-specific rewards and Group Relative Policy Optimization (GRPO). Concretely, we (i) conduct a systematic empirical study of off-the-shelf MLLMs for hateful and propagandistic meme understanding across English and Arabic benchmarks, (ii) extend existing meme datasets with weakly supervised chain-of-thought (CoT) rationales via distillation and multi-LLM fine-grained propaganda annotations, (iii) introduce a GRPO-based objective with thinking-length regularization that jointly optimizes classification accuracy and explanation quality, and (iv) investigate self-supervised GRPO on unlabeled memes using consensus-based pseudo-labels. Experiments on the Hateful Memes and ArMeme benchmarks show that our approach improves over previously reported results on FHM accuracy (up to +2.1%, from 79.9% to 82.0%) and on ArMeme macro-F1 (up to +7.6 points, from 0.536 to 0.612 with explanations; +6.1 compared to the original ArMeme benchmark), while also generating natural-language explanations. On ArMeme, sequence-classification baselines remain stronger in terms of raw accuracy, whereas our approach provides more balanced per-class performance along with explanations. We publicly release our code, data extensions, and evaluation resources.

URL PDF HTML ☆

赞 0 踩 0

2606.15335 2026-06-16 cs.CL cs.AI 新提交

Privacy-Preserving Text Sanitization for Distributed Agents Collaboration via Disentangled Representations

基于解耦表示的分布式智能体协作隐私保护文本净化

Xuan Liu, Hefeng Zhou, Sicheng Chen, Chao Yang, Xingcheng Xu, Jingjing Qu, Jiong Lou, Jie LI, Xia Hu

发表机构 * Shanghai Artificial Intelligence Laboratory（上海人工智能实验室）； Shanghai Jiao Tong University（上海交通大学）

AI总结提出DiSan框架，通过解耦文本为任务语义和风格子空间，结合联邦原型对齐与对抗正则化，在分布式多智能体协作中实现隐私保护，显著降低风格归因和PII泄露。

详情

AI中文摘要

当分布式智能体跨组织边界交换文本时，隐私泄露不仅来自显式标识符，还来自分布特征，如格式惯例、词汇选择和句法模式。我们提出DiSan（解耦净化），一个隐私保护净化框架，是Intern-Shannon中多智能体协作的内置组件。DiSan使用双流编码器将文本分解为保持任务语义的源不变角色子空间和保持本地的源识别风格子空间。联邦原型对齐和对抗正则化使得无需集中原始文本即可进行联合训练。实验表明，标识符级别的掩码是不够的：掩码19.2%的token仅将TF-IDF风格归因降低18.6%。相比之下，DiSan在分布式多智能体RAG基准上将答案级别的PII暴露降低了20倍，同时保持了83%的答案忠实度，并在Enron数据集上将TF-IDF风格归因降低了73.2%，神经探针降低了70.6%。

英文摘要

When distributed agents exchange text across organizational boundaries, privacy leakage arises not only from explicit identifiers but also from distributional signatures such as formatting conventions, vocabulary choices, and syntactic patterns. We propose DiSan(Disentangled Sanitization), a privacy-preserving sanitization framework and a built-in component of Intern-Shannon for multi-agent collaboration. DiSan uses a two-stream encoder to factorize text into a source-invariant role subspace that preserves task semantics and a source-identifying style subspace that remains local. Federated proto-type alignment and adversarial regularization enable joint training without centralizing raw text. Experiments show that identifier-level masking is insufficient: masking 19.2% of tokens reduces TF-IDF stylometric attribution by only 18.6%. By contrast, DiSan reduces answer-level PII exposure by 20 times while maintaining 83% answer faithfulness on a distributed multi-agent RAG benchmark, and lowers Enron stylometric attribution by 73.2% under TF-IDF and 70.6% under a neural probe.

URL PDF HTML ☆

赞 0 踩 0

2606.15396 2026-06-16 cs.CL cs.AI 新提交

CHILLGuard: Towards Fine-Grained Chinese LLM Safety Guardrail with Scalable Data Construction and Model-aware Preference Alignment

CHILLGuard：面向细粒度中文大语言模型安全护栏的可扩展数据构建与模型感知偏好对齐

Wenbo Yu, Bohua Wang, Hao Fang, Kuofeng Gao, Jingru Zeng, Xiaochen Yang, Tianyi Zhang, Xiaoxiao Ma, Jiawei Kong, Hao Wu, Bin Chen, Shu-Tao Xia, Min Zhang

发表机构 * Tsinghua University（清华大学）； Beijing Normal University（北京师范大学）； South China University of Technology（华南理工大学）； Harbin Institute of Technology, Shenzhen（哈尔滨工业大学（深圳））； Shenzhen ShenNong Information Technology Co., Ltd.（深圳神农信息技术有限公司）

AI总结针对中文场景，提出细粒度风险分类体系（5大类31小类），通过可扩展数据构建管道生成高质量训练数据，并采用模型感知直接偏好优化训练CHILLGuard，在基准上F1分数提升15.92%。

详情

AI中文摘要

大语言模型生成的恶意内容可能带来严重的安全风险和伦理问题。虽然现有的大语言模型安全护栏在英语或多语言环境中表现出色，但它们缺乏对中文特定监管政策、文化背景和语言细微差别的适应，无法支持针对不同部署需求的细粒度风险分类。在本文中，我们引入了一个面向中文场景的5大类、31小类细粒度风险分类体系，并构建了CHILLGuard：一个专门的中文大语言模型内容安全护栏。为了解决高质量标注中文安全数据的严重稀缺问题，我们提出了一个可扩展的多阶段数据构建管道：通过检索增强生成扩展多源语料库，通过提示工程改写生成隐式有害样本，并通过多模型投票的标签校准精炼高质量数据。基于此，我们构建了CHILLGuardTrain，一个包含405,007样本的大规模训练集，以及CHILLGuardTest，一个严格策划的包含51,745样本的标注测试集。然后，我们在生成器-分类器协作框架下，通过模型感知直接偏好优化在CHILLGuardTrain上训练CHILLGuard。在多种设置下的广泛实验证明了CHILLGuard的最先进性能，例如，在我们的基准上，F1分数相比Qwen3Guard-8B-Strict提升了15.92%。我们将在https://github.com/cswbyu/CHILLGuard发布我们的资源。

英文摘要

Malicious content generated from large language models (LLMs) could pose severe safety risks and ethical concerns. While existing LLM safety guardrails excel in English or multilingual settings, they lack adaptation to Chinese-specific regulatory policies, cultural context and linguistic nuances, failing to support fine-grained risk classification for diverse deployment needs. In this paper, we introduce a 5-macro, 31-micro category fine-grained risk taxonomy for Chinese scenarios, and build CHILLGuard: a dedicated Chinese LLM content safety guardrail. To address the critical scarcity of high-quality annotated Chinese safety data, we propose a scalable multi-stage data construction pipeline: we expand multi-source corpus via retrieval-augmented generation, generate implicit harmful samples through prompt engineering rewriting, and refine high-quality data via multi-model voting-based label calibration. Based on this, we build CHILLGuardTrain, a large-scale training set with 405,007 samples, and CHILLGuardTest, a rigorously curated annotated test set with 51,745 samples. We then train CHILLGuard on CHILLGuardTrain under a generator-classifier collaborative framework via Model-aware Direct Preference Optimization. Extensive experiments under multiple settings demonstrate the state-of-the-art performance of CHILLGuard, e.g., a 15.92% improvement of F1 score over Qwen3Guard-8B-Strict on our benchmark. We will release our resources at https://github.com/cswbyu/CHILLGuard.

URL PDF HTML ☆

赞 0 踩 0

2606.15517 2026-06-16 cs.CL 新提交

SHARD: Safe and Helpful Alignment via Self-Reframing Distillation

AI总结采用材料科学框架，将LLM谄媚视为推挤载荷下的材料失效，通过14个轴测量和三种加载情形（辩论、错误预设、伦理设定）共7800个样本，揭示失效模式依赖加载类型，并发现跨评判者可靠性差异。

Comments 12 pages, 3 figures. Code, data, and pre-registrations: https://github.com/FerdinandSchessl/sycophancy-note-companion

详情

AI中文摘要

LLM中的谄媚现象在70多篇论文中有记录，但专家对构念边界的共识仍然较低（ICC=.184；Ye等人，2026）。该构念碎片化是因为行为分类取决于哪种表面形式被优先考虑。我们采用材料科学框架：对话作为加载下的测试样本，LLM模型作为材料批次，推挤作为渐进载荷，立场翻转作为材料失效。我们在三种加载情形（辩论n=1000；错误预设n=3400；伦理设定n=3400；每种情形10-17种材料批次；共7800个样本）下，使用14个回合级轴测量（涵盖速度、损伤累积、框架漂移、脆性和方向稳定性）以及来自独立管道的三个说话者解析轴来表征这种失效。测量是胡克耦合的（$σ= E \cdot \varepsilon$类比），并在加载情形间重现，在辩论上效应高达$|r_{rb}| = 0.35$；符号结构增加了第二种模式：伦理设定情形反转了速度和累积块。方差组成分为两个轮廓：辩论是批次主导的（类似脆性断裂：材料等级决定），错误预设和伦理设定是主题主导的（类似蠕变：载荷决定）；比率（2.03 vs 0.13/0.17）依赖于估计器，对于辩论甚至在方向上也是如此。跨评判者可靠性（GPT-4o vs Haiku 4.5）显示辩论评分是评判者鲁棒的（Cohen's $κ= 0.88$），而错误预设评分是评判者敏感的（$κ= 0.36$）——这是单评判者基准必须报告的注意事项。这是Ye等人诊断所要求的方法论举措：一种不依赖于构念的哪种表面形式被优先考虑的多元表征。

英文摘要

Sycophancy in LLMs is documented across 70+ papers, but expert agreement on construct boundaries remains low (ICC=.184; Ye et al., 2026). The construct fragments because behavioral classification depends on which surface form is privileged. We adopt a materials-science framing: conversation as test specimen under load, LLM-model as material charge, pushback as progressive load, stance-flip as material failure. We characterize this failure across three loading cases (debate n=1000; false-presuppositions n=3400; ethical-setting n=3400; 10-17 material charges per case; 7800 specimens total) using 14 turn-level axis-measurements spanning velocity, damage accumulation, frame-drift, brittleness, and direction stability, plus three speaker-resolved axes from an independent pipeline. The measurements are Hooke-coupled ($σ= E \cdot \varepsilon$ analog) and reproduce across loading cases with effects up to $|r_{rb}| = 0.35$ on debate; the sign structure adds a second pattern: the ethical-setting case inverts the velocity and accumulation blocks. Variance composition partitions into two profiles: debate is charge-dominated (brittle-fracture-like: the material grade decides), false-presuppositions and ethical-setting are topic-dominated (creep-like: the load decides); the ratios (2.03 vs 0.13/0.17) are estimator-dependent, for debate even in direction. Cross-judge reliability (GPT-4o vs Haiku 4.5) shows debate scoring is judge-robust (Cohen's $κ= 0.88$) while false-presupposition scoring is judge-sensitive ($κ= 0.36$) -- a caveat single-judge benchmarks must report. This is the methodological move Ye et al.'s diagnosis calls for: a multi-axis characterization that does not depend on which surface form of the construct one privileges.

URL PDF HTML ☆

赞 0 踩 0

2606.16801 2026-06-16 cs.CL 新提交

The Art of Mixology: Mixup-based Obfuscation for Privacy-Preserving Split Learning in Large Language Models

混合艺术：基于混合的混淆方法用于大型语言模型中隐私保护的分割学习

Chen Chen, Xiang Gao, Xianshun Wang, Chengran Li, Shengyu Xia, Xueluan Gong, Linru Zhang, Qian Wang, Kwok-Yan Lam

发表机构 * College of Computing and Data Science, Nanyang Technological University, Singapore（南洋理工大学计算与数据科学学院）； School of Cyber Science and Engineering, Wuhan University, China（武汉大学网络空间安全学院）

AI总结提出MIXGUARD框架，通过令牌级混淆、表示级混淆和自适应梯度扰动机制，在保护隐私的同时保持模型效用，实验表明其优于现有方法。

Comments 19 pages, 5 figures

详情

AI中文摘要

分割学习为资源受限的用户提供了一种实用范式，通过将计算密集型层卸载到服务器同时保留原始数据在本地，来训练大型语言模型（LLMs）。然而，现有的隐私保护分割学习方法仍然在效用、隐私、效率和稳定性之间面临艰难的权衡。具体来说，这些方法常常遭受显著的效用下降，仍然容易受到高级数据重建攻击，产生高昂的计算和通信开销，或者在不同任务上表现出不稳定的性能。在本文中，我们提出了MIXGUARD，一种新颖的基于混合的隐私保护分割学习框架，用于LLMs。MIXGUARD引入了令牌级混淆、表示级混淆和自适应梯度扰动机制，这些机制联合运作以保留有用的学习信号，同时防止隐私泄露给服务器。技术上，MIXGUARD首先在公共数据集上构建一个轻量级校准模型，以细化近似的目标表示，然后在私有数据上的隐私保护微调期间应用该模型。我们在多个LLM家族、模型大小、架构和微调策略上，对四个分类任务和四个文本生成任务进行了大量实验。结果表明，MIXGUARD保持了与非分割训练基线相当的模型效用，在针对最先进的数据重建攻击时，始终比现有的分割学习防御方法实现更强的隐私保护，并且在自适应攻击设置下保持鲁棒性。

英文摘要

Split learning provides a practical paradigm for resource-constrained users to train Large Language Models (LLMs) by offloading computation-intensive layers to a server while keeping raw data local. However, existing privacy-preserving split learning methods still face a difficult trade-off among utility, privacy, efficiency, and stability. Specifically, these methods often suffer from substantial utility degradation, remain vulnerable to advanced data reconstruction attacks, incur prohibitive computational and communication overhead, or exhibit unstable performance across different tasks. In this paper, we propose MIXGUARD, a novel mixup-based privacy-preserving split learning framework for LLMs. MIXGUARD introduces token-level obfuscation, representation-level obfuscation, and adaptive gradient perturbation mechanisms, which operate jointly to preserve useful learning signals while preventing privacy leakage to the server. Technically, MIXGUARD first constructs a lightweight calibration model on a public dataset to refine the approximated target representation, and then applies this model during privacy-preserving fine-tuning on private data. We conduct extensive experiments on four classification tasks and four text generation tasks across multiple LLM families, model sizes, architectures, and fine-tuning strategies. The results show that MIXGUARD preserves model utility comparable to non-split training baselines, consistently achieves stronger privacy protection than existing split learning defense methods against state-of-the-art data reconstruction attacks, and remains robust under adaptive attack settings.

URL PDF HTML ☆

赞 0 踩 0

2606.16821 2026-06-16 cs.CL cs.CR cs.CY cs.IR 新提交

How Much Can We Trust LLM Search Agents? Measuring Endorsement Vulnerability to Web Content Manipulation

我们能在多大程度上信任LLM搜索智能体？衡量对网络内容操纵的认可脆弱性

Yimeng Chen, Zhe Ren, Firas Laakom, Yu Li, Dandan Guo, Jürgen Schmidhuber

发表机构 * Center of Excellence for Generative AI, KAUST（KAUST生成式人工智能卓越中心）； Jilin University（吉林大学）； Zhejiang University（浙江大学）； The Swiss AI Lab, IDSIA-USI/SUPSI（瑞士人工智能实验室 IDSIA-USI/SUPSI）； NNAISENSE

AI总结提出SearchGEO框架，通过五类攻击模式和输出级指标评估13种LLM后端对网络内容操纵的认可脆弱性，发现攻击成功率从0%到31.4%不等，且不同后端对攻击模式的敏感性和部署框架的影响存在显著差异。

Comments 23 pages, 3 figures

详情

AI中文摘要

基于大型语言模型（LLM）的搜索智能体将开放网络内容综合成可操作的建议，代表用户行事，这造成了攻击者发布的页面可能被转化为认可声明的风险。我们引入了SearchGEO，一个用于衡量基于LLM的网络搜索智能体中认可腐败的受控评估框架，它结合了网络证据操纵流水线、五模式攻击分类法和多个输出级指标。我们在308个案例上评估了13个LLM后端。结果显示，脆弱性模式因后端而异：总体攻击成功率（ASR）从Claude-Sonnet-4.6的0.0%到Gemini-3-Flash的31.4%不等，最强攻击模式因模型系列而异，且相同的部署框架可能在不同后端上放大或降低ASR。一个辅助智能体技能探测（其中认可变为安装命令）揭示了原本鲁棒的后端之间的明显分裂：Claude过度拒绝而GPT过度信任。这些发现主张将对抗性搜索内容下的推荐可靠性视为后端安全评估的一级维度。

英文摘要

Large language model (LLM)-based search agents synthesize open-web content into actionable recommendations on behalf of users, creating a risk that attacker-published pages are transformed into endorsed claims. We introduce SearchGEO, a controlled evaluation framework for measuring endorsement corruption in LLM-based web-search agents, combining a web-evidence manipulation pipeline, a five-mode attack taxonomy, and multiple output-level metrics. We evaluate 13 LLM backends on 308 cases each. Results show that vulnerability patterns vary across backends: overall attack success rate (ASR) ranges from 0.0% on Claude-Sonnet-4.6 to 31.4% on Gemini-3-Flash, the strongest attack mode differs by model family, and the same deployment scaffold could amplify or decrease ASR on different backends. An auxiliary agent-skill probe, where endorsement becomes an install command, exposes a sharp split among otherwise robust backends: Claude over-rejects while GPT over-trusts. These findings argue for treating recommendation reliability under adversarial search content as a first-class dimension of backend safety evaluation.

URL PDF HTML ☆

赞 0 踩 0

2606.15980 2026-06-16 cs.LG cs.AI cs.CL 交叉投稿

Do Safety Monitors Stay Reliable After an Update? Benchmarking and Predicting Activation-Monitor Staleness

安全监控器在更新后是否仍可靠？激活监控器陈旧性的基准测试与预测

Evan Duan

发表机构 * University of Michigan（密歇根大学）

AI总结研究语言模型更新后激活监控器是否仍可靠，发现量化更新影响小，微调更新常导致监控器失效，且可通过预部署特征预测退化。

详情

AI中文摘要

激活监控器——在语言模型内部表示上训练的轻量级探针——在部署安全栈中越来越常见。然而，部署的模型很少是静态的：它们被量化、微调、用LoRA适配，或与合并适配器一起服务，而监控器保持冻结。我们首次系统测试了这一隐含契约是否成立：在基础模型上训练的激活监控器在这些常规模型更新后是否仍可靠。跨多个安全相关监控器、模型深度、更新系列和开放权重模型，我们发现一个明显的分裂：量化风格的更新大多保持冻结探针性能，而微调风格的更新经常使探针变得陈旧。脆弱性高度依赖于监控器，隐私/PII探针受影响最大，而拒绝合规探针相对稳定，表明重新训练行为不一定使其对应的监控器变得陈旧。QLoRA尤其有害，尽管单独的NF4量化相对良性，这表明量化在与适配结合时风险更大。我们进一步表明，退化可以从部署前的特征预测，从而能够将重新验证预算优先分配给最可能失败的监控器。这些结果表明，微调应默认触发激活监控器重新验证，而预测可以帮助优先检查哪些监控器。

英文摘要

Activation monitors-lightweight probes trained on a language model's internal representations-are an increasingly common layer in deployment safety stacks. Deployed models however are rarely static: they are quantized, fine-tuned, adapted with LoRA, or served with merged adapters while the monitor remains frozen. We present the first systematic test of whether this implicit contract holds: whether activation monitors trained on a base model remain reliable after these routine model updates. Across multiple safety-relevant monitors, model depths, update families, and open-weight models, we find a sharp split: quantization-style updates largely preserve frozen probe performance, while fine-tuning-style updates frequently make probes stale. Fragility is highly monitor-dependent, with privacy/PII probes most affected and refusal-compliance probes comparatively stable, showing that retraining a behavior need not stale its corresponding monitor. QLoRA is especially damaging despite NF4 quantization alone being relatively benign, suggesting that quantization becomes riskier when combined with adaptation. We further show that degradation is predictable from pre-deployment features, enabling revalidation budgets to be triaged toward the monitors most likely to fail. These results suggest that fine-tuning should trigger activation-monitor revalidation by default, while prediction can help prioritize which monitors to check first.

URL PDF HTML ☆

赞 0 踩 0

2606.16100 2026-06-16 cs.CR cs.CL cs.LG 交叉投稿

良性多智能体系统中的错误信息传播

Jonas Becker, Jan Philip Wahle, Terry Ruas, Bela Gipp

AI总结研究在良性多智能体系统中，基于意图的错误信息如何通过智能体交互传播，并发现多智能体辩论相比单智能体提示能减少性能下降，鲁棒性取决于群体组成和决策协议。

Comments 20 pages, 8 figures, 1 table

详情

AI中文摘要

多智能体系统，其中多个大型语言模型智能体通过轮流交互解决问题，越来越多地部署在医疗诊断、法律分析和法医决策等高风险环境中。当单个智能体从错误或误导性上下文（例如工具调用）进行推理时，其可靠性可能面临风险，因为错误可能通过智能体交互传播。本研究通过在推理、知识和对齐任务中向良性单智能体和多智能体系统注入基于意图的错误信息来研究这一风险。我们发现，错误信息会降低单智能体性能，并在多智能体辩论中持续存在，智能体通常会保留由误导同伴引入的答案。尽管如此，与单智能体提示相比，多智能体辩论减少了由此产生的性能下降，尤其是在大多数智能体未接触错误信息的情况下。鲁棒性取决于群体组成和决策协议。在同伴压力下，共识可能比投票更稳定，而多数意见通常能将受误导的智能体引导回正确答案。我们的结果表明，多智能体系统中的错误信息鲁棒性不仅取决于底层模型，还取决于智能体如何交换信息和聚合决策。

英文摘要

Multi-agent systems, in which multiple large language model agents solve problems through turn-based interaction, are increasingly deployed in high-stakes settings such as medical diagnosis, legal analysis, and forensic decision-making. Their reliability can be at risk when single agents reason from incorrect or misleading context, e.g., from tool calls, since errors may propagate through agent interactions. This work studies this risk by injecting intent-based misinformation into benign single-agent and multi-agent systems across reasoning, knowledge, and alignment tasks. We find that misinformation can degrade single-agent performance and persists across multi-agent debate, with agents often retaining answers introduced by misinformed peers. Nevertheless, multi-agent debate reduces the resulting performance degradation compared to single-agent prompting, especially when most agents are not exposed to misinformation. Robustness depends on group composition and decision protocol. Consensus can be more stable than voting under peer pressure, while majorities can often steer misinformed agents back toward correct answers. Our results show that misinformation robustness in multi-agent systems depends on the underlying model and also on how agents exchange information and aggregate decisions.

URL PDF HTML ☆

赞 0 踩 0

2501.19337 2026-06-16 cs.CL cs.CV 版本更新

Token-Level Entropy Reveals Demographic Disparities in Language Models

Token级熵揭示语言模型中的群体统计差异

Messi H. J. Lee

发表机构 * Independent Researcher（独立研究者）

AI总结通过测量零温度下全词汇香农熵，发现黑裔名字比白裔名字产生更高的首token熵，且女性名字比男性名字产生更低的熵和更同质的输出，种族和性别效应可加，指令微调未减弱种族差距，显式群体标签探测无显著种族效应。

Comments 9 pages

详情

AI中文摘要

我们探究仅由姓名标识的人口统计身份是否会系统性地重塑语言模型的生成分布。在六个开源基础模型和5760个隐式句子补全提示（例如“Tanisha在一个周一早晨走进办公室，然后”）上，测量零温度下的全词汇香农熵，我们发现，在所有六个架构中，与白裔名字相比，黑裔名字产生更高的首token熵——这与在显式人口统计提示下记录的输出层面同质性偏差（Lee等人，2024）相反——并且黑裔名字总是比白裔名字在身份中性基线之上产生更大的熵（所有六个模型中ΔΔ>0）。与男性名字相比，女性名字伴随更低的首token熵（DL合并β̂=-0.041，p=0.019）和更同质的输出（α̂=+0.024，p<0.001）——这一模式与同质性偏差收敛；种族和性别效应是可加的。指令微调并未减弱种族差距（匹配格式DL合并β̂=+0.153）。使用显式群体标签而非姓名运行相同模板，在隐式探测显著的12个模型中有10个产生无效的种族效应——这表明探测方法是恢复哪种分布结构的主要决定因素。

英文摘要

We ask whether demographic identity, signaled by a name alone, systematically reshapes the generative distribution of a language model. Measuring full-vocabulary Shannon entropy at temperature zero across six open-weight base models and 5,760 implicit sentence-completion prompts (e.g., "Tanisha walked into the office on a Monday morning and"), we find that Black-associated names produce higher first-token entropy than White-associated names across all six architectures - opposite to the output-level homogeneity bias documented under explicit demographic prompting (Lee et al., 2024) - and Black-associated names always produce greater entropy above identity-neutral baselines than White-associated names ($ΔΔ> 0$ in all six models). Women-associated names co-occur with lower first-token entropy (DL-pooled $\hatβ= -0.041, p = .019$) and more homogeneous outputs ($\hatα= +0.024, p < .001$) than men-associated names - a pattern convergent with homogeneity bias; race and gender effects are additive. Instruction tuning does not attenuate the race gap (matched-format DL-pooled $\hatβ=+0.153$). Running the same templates with explicit group labels instead of names yields null race effects in 10 of 12 models where implicit probing is significant - establishing that probing methodology is a primary determinant of which distributional structure is recovered.

URL PDF HTML ☆

赞 0 踩 0

2502.05163 2026-06-16 cs.CL cs.LG 版本更新

Enhancing LLM Safety Through a Theoretical Minimax Game Lens

通过理论极小极大博弈视角增强LLM安全性

Yihe Deng, Yu Yang, Junkai Zhang, Wei Wang, Bo Li

发表机构 * University of California, Los Angeles（加州大学洛杉矶分校）； VirtueAI ； University of Illinois Urbana-Champaign（伊利诺伊大学厄巴纳-香槟分校）

AI总结提出极小极大强化学习框架，通过数据生成器与分类器协同进化生成高质量多语言安全数据，理论证明收敛到纳什均衡，使小模型在英文基准上超越SOTA近10%且推理速度提升4.5倍。

Comments 24 pages, 9 figures, 5 tables

详情

AI中文摘要

大型语言模型（LLM）的快速发展需要有效的机制来确保其负责任部署，通过准确区分不安全内容和良性内容。尽管英文中有大量安全数据集，但由于其他语言的开源安全数据集有限，多语言安全建模仍未得到充分探索。即使在英文数据集中，安全但敏感的边界情况内容也很稀缺，导致模型出现捷径学习和非平凡的误报率。为缓解这些问题，我们引入了一种新颖的极小极大强化学习（RL）框架，其中数据生成器和分类器模型共同进化，促进高质量合成多语言安全数据的生成。我们从理论上将这种交互形式化为一个极小极大博弈，并严格证明了收敛到纳什均衡。实证评估证实，我们的合成数据生成方法显著提升了分类器模型的性能，使得一个更小的模型在英文基准上超越当前最优水平近10%，同时实现4.5倍的推理速度提升。这些结果为合成数据生成建立了一种可扩展且高效的方法，推动了更安全、更稳健的多语言LLM部署的发展。

英文摘要

The rapid advancement of large language models (LLMs) necessitates effective mechanisms to ensure their responsible deployment by accurately distinguishing unsafe content from benign content. While substantial safety datasets are available in English, multilingual safety modeling remains underexplored due to limited open-source safety datasets in other languages. Even within English datasets, safe yet sensitive corner-case content is scarce, leading to shortcut learning by models and non-trivial false-positive rates. To mitigate these issues, we introduce a novel minimax reinforcement learning (RL) framework wherein a data generator and a classifier model co-evolve, facilitating the production of high-quality synthetic multilingual safety data. We theoretically formalize this interaction as a minimax game and rigorously demonstrate convergence to a Nash equilibrium. Empirical evaluations confirm that our synthetic data generation method significantly enhances the classifier model performance, enabling a substantially smaller model to surpass the state-of-the-art by nearly 10% on English benchmarks while achieving 4.5x faster inference speed. These results establish a scalable and efficient methodology for synthetic data generation, advancing the development of safer and more robust multilingual LLM deployments.

URL PDF HTML ☆

赞 0 踩 0

2502.08266 2026-06-16 cs.CL cs.AI cs.LG 版本更新

Dealing with Annotator Disagreement in Hate Speech Classification

处理仇恨言论分类中的标注者分歧

Somaiyeh Dehghan, Mehmet Umut Sen, Berrin Yanikoglu

发表机构 * Faculty of Engineering and Natural Sciences, Sabanci University, Istanbul, Turkey（工程与自然科学学院，Sabanci大学，伊斯坦布尔，土耳其）； Center of Excellence in Data Analytics (VERIM), Sabanci University, Istanbul, Turkey（数据分析卓越中心（VERIM），Sabanci大学，伊斯坦布尔，土耳其）

AI总结研究标注者分歧对仇恨言论分类的影响，评估多数投票等聚合方法，并利用感知强度增强分类性能，在土耳其语推文中取得新最优结果。

Comments 19 pages, 4 Tables

详情

AI中文摘要

仇恨言论检测是一项关键任务，尤其是在有害内容可能迅速传播的社交媒体上。收集社交媒体内容（如推文）来训练机器学习模型很容易，但由于其固有的主观性，检测和分类仇恨言论可能很困难。这种主观性导致标注者之间频繁出现分歧，尤其是对于微妙或边缘内容。传统方法要么丢弃非共识样本，要么通过专家裁决强制设定“黄金标准”，忽略了关于不确定性和多样化人类视角的宝贵信息。我们研究了仇恨言论分类中标注者分歧这一很大程度上被忽视的问题，并评估了一系列聚合方法，包括多数投票、序数策略（最小值、最大值和均值），并分析了它们在二分类、四分类和六分类任务中的影响。此外，我们利用标注者感知的仇恨言论强度分数来探索基于回归和混合建模的方法。我们证明，过滤非共识样本会导致过于乐观的结果，而感知强度提供了增强分类性能的补充信号。最后，我们在土耳其语推文的仇恨言论检测中建立了新的最优结果，并表明标注者分歧在适当建模后，是构建更稳健可靠系统的宝贵资源。

英文摘要

Hate speech detection is a crucial task, especially on social media where harmful content can spread quickly. Collecting social media content (tweets etc.) to train machine learning models is easy, but detecting and categorizing hate speech can be difficult due to the inherently subjective nature. This subjectivity leads to frequent disagreement among annotators, particularly for subtle or borderline content. Traditional approaches either discard non-consensus samples or force a ''gold standard'' through expert adjudication, ignoring valuable information about uncertainty and diverse human perspectives. We examine the largely overlooked problem of annotator disagreement in hate speech classification and evaluate a range of aggregation methods, including majority voting, ordinal strategies (minimum, maximum, and mean), and analyze their impact across binary, 4-class, and 6-class classification tasks. In addition, we leverage annotators' perceived hate speech strength scores to explore regression-based and hybrid modeling approaches. Among others, we show that filtering non-consensus samples results in over-optimistic results and that the perceived strength provides a complementary signal that enhance classification performance. Finally, we establish new state-of-the-art results for hate speech detection in Turkish tweets, and demonstrate that annotator disagreement, when properly modeled, is a valuable resource for building more robust and reliable systems.

URL PDF HTML ☆

赞 0 踩 0

2503.23688 2026-06-16 cs.CL cs.HC 版本更新

Mapping Geopolitical Bias in 11 Large Language Models: A Bilingual, Dual-Framing Analysis of U.S.-China Tensions

映射11个大语言模型中的地缘政治偏见：美中紧张局势的双语、双框架分析

William Guey, Wei Zhang, Pierrick Bougault, Vitor D. de Moura, José O. Gomes

发表机构 * Department of Industrial Engineering, Tsinghua University（清华大学工业工程系）； School of Social Sciences, Tsinghua University（清华大学社会科学部）； Department of Industrial Engineering, Federal University of Rio de Janeiro（里约热内卢联邦大学工业工程系）

AI总结通过引入调查心理测量学的平衡键控方法，量化11个模型在2种语言中对美中地缘政治问题的立场，发现开发者起源、查询语言和议题领域是三个近等加性因素，所有模型在中文下更亲华。

Comments 37 pages, 6 main-text figures, 12 supplementary figures, 5 supplementary tables; supplementary information included

详情

AI中文摘要

大语言模型是数亿人现在如何遇到有争议的政治问题的方式，这提出了一个微妙的测量问题：一个简单地同意任何被告知内容的模型可以伪装成有偏见的，污染任何声称模型持有政治观点的说法。我们通过从调查心理测量学中引入平衡键控来解决这个问题，提出每个命题及其交换的反面，并对响应进行符号化，使得默认同意被抵消，真正的信念得以累积。结果是一个可重复的、定量的工具，映射了11个模型和2种语言（19,712个响应）的地缘政治立场。开发者起源、查询语言和议题领域成为三个近等加性因素；每个模型，包括在美国构建的模型，在普通话中更亲华；并且两个具有相同同意偏差的模型被区分开，一个中立，一个有偏见。我们将其作为一个开放、交互式工具发布，可扩展到任何有争议的意见领域。

英文摘要

Large language models are how hundreds of millions of people now encounter contested political questions, raising a subtle measurement problem: a model that simply agrees with whatever it is told can masquerade as biased, contaminating any claim that models hold political opinions. We address this by importing balanced keying from survey psychometrics, posing each proposition and its swapped reverse and signing the response so acquiescence cancels and genuine conviction accumulates. The result is a reproducible, quantitative instrument that maps geopolitical stance across 11 models and 2 languages (19,712 responses). Developer origin, query language and issue domain emerge as three near-equal, additive factors; every model, including those built in the United States, leans more Pro-China in Mandarin; and two models with identical agreement bias are told apart, one neutral, one biased. We release it as an open, interactive tool that extends to any contested-opinion domain.

URL PDF HTML ☆

赞 0 踩 0

2505.14418 2026-06-16 cs.CL 版本更新

Hidden Ghost Hand: Unveiling Backdoor Vulnerabilities in MLLM-Powered Mobile GUI Agents

隐藏的幽灵之手：揭示MLLM驱动的移动GUI代理中的后门漏洞

Pengzhou Cheng, Haowen Hu, Zheng Wu, Zongru Wu, Tianjie Ju, Zhuosheng Zhang, Gongshen Liu

发表机构 * School of Computer Science, Shanghai Jiao Tong University（上海交通大学计算机科学学院）； Huawei Inc.（华为公司）

AI总结本研究提出AgentGhost框架，通过组合目标和交互层触发器，利用监督对比学习和微调注入后门，在移动GUI代理中实现高成功率和隐蔽性，攻击准确率达99.7%。

Comments EMNLP-Findings 2025 (Correcting model settings)

详情

AI中文摘要

由多模态大语言模型（MLLM）驱动的图形用户界面（GUI）代理在人机交互中展现出巨大潜力。然而，由于微调成本高昂，用户通常依赖开源GUI代理或AI提供商提供的API，这引入了一个关键但尚未充分探索的供应链威胁：后门攻击。在这项工作中，我们首先揭示了MLLM驱动的GUI代理自然暴露了多个交互级触发器，例如历史步骤、环境状态和任务进度。基于这一观察，我们提出了AgentGhost，一个用于红队后门攻击的有效且隐蔽的框架。具体来说，我们首先通过组合目标和交互级别构建复合触发器，使GUI代理在确保任务实用性的同时无意中激活后门。然后，我们将后门注入表述为一个最小-最大优化问题，该问题使用监督对比学习在表示空间中最大化样本类别间的特征差异，提高后门的灵活性。同时，它采用监督微调来最小化后门和干净行为生成之间的差异，增强有效性和实用性。在两个已建立的移动基准测试中对各种代理模型进行的广泛评估表明，AgentGhost有效且通用，在三个攻击目标上的攻击准确率达到99.7%，并且仅以1%的实用性下降表现出隐蔽性。此外，我们针对AgentGhost定制了一种防御方法，将攻击准确率降低至22.1%。我们的代码可在\ exttt{anonymous}获取。

英文摘要

Graphical user interface (GUI) agents powered by multimodal large language models (MLLMs) have shown greater promise for human-interaction. However, due to the high fine-tuning cost, users often rely on open-source GUI agents or APIs offered by AI providers, which introduces a critical but underexplored supply chain threat: backdoor attacks. In this work, we first unveil that MLLM-powered GUI agents naturally expose multiple interaction-level triggers, such as historical steps, environment states, and task progress. Based on this observation, we introduce AgentGhost, an effective and stealthy framework for red-teaming backdoor attacks. Specifically, we first construct composite triggers by combining goal and interaction levels, allowing GUI agents to unintentionally activate backdoors while ensuring task utility. Then, we formulate backdoor injection as a Min-Max optimization problem that uses supervised contrastive learning to maximize the feature difference across sample classes at the representation space, improving flexibility of the backdoor. Meanwhile, it adopts supervised fine-tuning to minimize the discrepancy between backdoor and clean behavior generation, enhancing effectiveness and utility. Extensive evaluations of various agent models in two established mobile benchmarks show that AgentGhost is effective and generic, with attack accuracy that reaches 99.7\% on three attack objectives, and shows stealthiness with only 1\% utility degradation. Furthermore, we tailor a defense method against AgentGhost that reduces the attack accuracy to 22.1\%. Our code is available at \texttt{anonymous}.

URL PDF HTML ☆

赞 0 踩 0

2506.18756 2026-06-16 cs.CL cs.CR 版本更新

Semantic-Preserving Prompt Hijacking: A Black-Box Adversarial Attack on Auto-Prompt Optimization

语义保持提示劫持：针对自动提示优化的黑盒对抗攻击

Chong Zhang, Xiang Li, Jia Wang, Shan Liang, Haochen Xue, Xiaobo Jin

发表机构 * Jiangsu universities（江苏大学）； NSFC（国家自然科学基金委员会）； RDF-24-01-016

AI总结提出一种黑盒条件下的自适应贪婪局部搜索方法，通过层次化分解输入文本、掩码关键语言单元并动态调整候选词，在保持语义相似度的同时最大化模型输出与原始意图的偏差，在2400+测试用例中取得更高攻击成功率。

Comments 12 pages, 8 figures. Accepted by the IEEE International Conference on Multimedia and Expo (ICME 2026)

详情

AI中文摘要

LLM 越来越多地集成自动建议优化模块，使其能够在生成最终响应之前重写和显示用户输入。虽然这种设计旨在增强透明度和信任，但其从多个候选解中自主选择单个最佳结果的过程，允许攻击者通过诱导细微、不可察觉的语义偏移来劫持这一优化过程。为此，我们提出一种基于黑盒条件的语义保持劫持攻击方法：自适应贪婪局部搜索。该方法层次化地分解输入文本，掩码关键语言单元，并在预定义的语义检查点动态调整候选替换词。这最大化模型输出与原始意图的偏差，同时严格保持与原始文本的语义相似性。在商业和开源 LLM 上的实验结果表明，在相同语义相似性约束下，该方法在超过 2400 个测试用例中实现了比现有攻击方法更高的攻击成功率。代码可在以下网址获取：this https URL

英文摘要

LLMs increasingly integrate auto-suggestion optimization modules, enabling them to rewrite and display user input before generating the final response. While this design aims to enhance transparency and trust, its process of autonomously selecting a single best result from multiple candidate solutions allows attackers to hijack this optimization process by inducing subtle, imperceptible semantic shifts. To address this, we propose a semantic preservation hijacking attack method based on black-box conditions: Adaptive Greedy Local Search. This method hierarchically decomposes the input text, masks key language units, and dynamically adjusts candidate replacement words at predefined semantic checkpoints. This maximizes the deviation between the model output and the original intent while strictly maintaining semantic similarity to the original text. Experimental results on commercial and open-source LLMs demonstrate that, under the same semantic similarity constraints, this method achieves a higher attack success rate than existing attack methods in over 2400 test cases. Code is available at: https://github.com/franz-chang/DOBS

URL PDF HTML ☆

赞 0 踩 0

2510.06445 2026-06-16 cs.CL cs.AI cs.CR 版本更新

A Survey on Agentic Security: Applications, Threats and Defenses

关于智能体安全的综述：应用、威胁与防御

Asif Shahriar, Md Nafiu Rahman, Sadif Ahmed, Farig Sadeque, Md Rizwan Parvez

发表机构 * BRAC University（布拉克大学）； Qatar Computing Research Institute (QCRI)（卡塔尔计算研究所）

AI总结本文首次全面综述智能体安全领域，围绕应用、威胁与防御三大支柱，分类260余篇论文，分析攻击入口、防御策略及生命周期覆盖，指出智能体系统默认结构脆弱，需全生命周期防御。

详情

AI中文摘要

基于LLM的智能体现在被广泛应用于网络安全领域。尽管这些智能体促进了强大且自主的安全应用，但其自主性也开辟了新的攻击面，安全社区正在积极构建防御措施来保护它们。然而，关于这一主题的文献增长迅速且不均衡。现有综述孤立地处理应用、威胁和防御，未能统一阐述智能体的能力、漏洞和对抗措施如何相互关联。在这项工作中，我们首次对智能体安全格局进行了全面综述，围绕应用、威胁和防御三大基本支柱构建该领域。我们提供了一个包含260多篇论文的综合分类法，解释了智能体如何用于下游网络安全应用、智能体系统固有的威胁以及旨在保护它们的对抗措施。此外，我们提供了详细的支柱特定和交叉分析，展示了智能体应用的安全生命周期覆盖、红队与蓝队智能体之间的比较，以及红队应用的对抗性使用。在威胁方面，我们分析了攻击目标所针对的入口点和智能体循环阶段、它们对智能体设置的特异性以及它们假设的威胁模型。在防御方面，我们分析了主要的防御策略、它们的成本和安全性权衡，以及它们在智能体生命周期中的部署位置。我们进一步映射了哪些防御覆盖哪些攻击类别，并绘制了智能体架构、骨干模型使用、数据模态覆盖以及攻击和防御研究随时间增长的趋势。综合来看，这些发现表明智能体系统在默认情况下结构脆弱，保护它们将需要跨越整个智能体生命周期的防御，而不是单层修复。

英文摘要

LLM-based agents are now used throughout cybersecurity. While these agents facilitate powerful and autonomous security applications, their autonomy opens up new attack surfaces, and the security community is actively building defenses to secure them. Yet the literature on this subject has grown quickly and unevenly. Existing surveys treat applications, threats, and defenses in isolation, leaving no unified account of how an agent's capabilities, vulnerabilities, and countermeasures interconnect. In this work we present the first holistic survey of the agentic security landscape, structuring the field around the fundamental pillars of Applications, Threats and Defenses. We provide a comprehensive taxonomy of over 260 papers, explaining how agents are used in downstream cybersecurity applications, inherent threats to agentic systems, and countermeasures designed to protect them. In addition, we provide detailed pillar-specific and cross-cutting analyses that show the security-lifecycle coverage of agentic applications, comparison between red-teaming and blue-teaming agents, and the adversarial use of red-teaming applications. On the threat side, we analyze the entry points and agent-loop stages that attacks target, their specificity to the agentic setting, and the threat models they assume. On the defense side, we analyze the prevailing defense strategies, their cost and security trade-offs, and where in the agent lifecycle they are deployed. We further map which defenses cover which attack classes and chart trends in agent architecture, backbone model usage, data modality coverage, and the growth of attack and defense research over time. Taken together, these findings indicate that agentic systems are structurally fragile by default and that securing them will require defenses that span the full agent lifecycle rather than single-layer fixes.

URL PDF HTML ☆

赞 0 踩 0

2510.17431 2026-06-16 cs.CL 版本更新

Agentic Reinforcement Learning for Search Misaligns Instruction-Tuning

用于搜索的智能体强化学习使指令微调对齐失效

Yushi Yang, Shreyansh Padarha, Sarah Ball, Andrew Lee, Adam Mahdi

发表机构 * University of Oxford（牛津大学）； Ludwig-Maximilians-Universität München（慕尼黑路德维希-马克西米利安大学）； Harvard University（哈佛大学）

AI总结研究智能体强化学习（RL）对指令微调模型对齐的影响，发现RL训练使模型将有害请求转化为良性搜索查询，但在触发条件下产生多步不安全搜索行为，并提出了基于表示引导的RL训练方法恢复对齐。

详情

AI中文摘要

智能体强化学习（RL）训练大型语言模型使用工具，但其对对齐的影响尚不明确。我们研究用于搜索的智能体RL如何影响指令微调（IT）模型的对齐。我们发现，RL训练的模型通过将有害请求转移为良性搜索查询来继承拒绝推理，但在一个简单的诊断触发器下，该触发器在拒绝发生前引发搜索调用，这种机制失效。在此条件下，RL模型产生多步不安全的搜索动作和推理，相对于IT对应模型，Qwen和Llama模型的搜索查询安全性降低高达68.6%。该效应跨模型家族、规模和RL算法泛化。为了理解原因，我们识别出残差流中控制搜索查询安全性的线性方向，并表明RL训练逐步将搜索行为向该方向的有害端移动。因此，我们提出表示引导的RL训练，它基于朝向有害搜索方向的投影添加奖励惩罚。仅使用良性数据训练，它在不降低任务准确性的情况下恢复IT级别的对齐，且不需要额外的训练数据。总之，我们的工作为诊断、机制分析和缓解用于搜索的智能体RL中的对齐退化提供了首个框架。

英文摘要

Agentic reinforcement learning (RL) trains large language models to use tools, but its impact on alignment is poorly understood. We study how agentic RL for search affects the alignment of instruction-tuned (IT) models. We find that RL-trained models inherit refusal reasoning by deflecting harmful requests into benign search queries, but this breaks down under a simple diagnostic trigger that elicits a search call before refusal can occur. Under this condition, RL models produce multi-step unsafe search actions and reasoning, reducing search query safety by up to 68.6% in Qwen and Llama models relative to their IT counterparts. The effect generalises across model families, scales, and RL algorithms. To understand why, we identify linear directions in the residual stream that control search query safety, and show that RL training progressively shifts search behaviour toward the harmful end of this direction. We thus propose representation-guided RL training, which adds a reward penalty based on projection toward the harmful search direction. Training on benign data alone, it restores IT-level alignment without reducing task accuracy and requires no additional training data. Together, our work provides the first framework for diagnosing, mechanistically analysing, and mitigating alignment degradation in agentic RL for search.

URL PDF HTML ☆

赞 0 踩 0

2604.17301 2026-06-16 cs.CL cs.AI cs.HC cs.IR cs.LG 版本更新

RoTRAG: Rule of Thumb Reasoning for Conversation Harm Detection with Retrieval-Augmented Generation

RoTRAG: 基于经验法则推理的检索增强生成对话有害内容检测

Juhyeon Lee, Wonduk Seo, Junseo Koh, Seunghyun Lee, Haihua Chen, Yi Bu

发表机构 * Peking University（北京大学）； Enhans ； University of North Texas（北得克萨斯大学）

AI总结提出RoTRAG框架，通过检索外部道德规范（RoTs）增强LLM的多轮对话有害内容检测，实现基于规范推理和分类，平均F1提升约40%，分布误差降低8.4%。

Comments Accepted by SIGIR-ICTIR 2026, Oral Presentation

详情

DOI: 10.1145/3805713.3820397
Journal ref: Proceedings of the 2026 International ACM SIGIR Conference on Innovative Concepts and Theories in Information Retrieval (ICTIR '26), July 25, 2026, Melbourne, VIC, Australia. ACM, New York, NY, USA, 12 pages

AI中文摘要

检测多轮对话中的有害内容需要对完整对话上下文进行推理，而非孤立的话语。然而，现有方法主要依赖模型内部的参数化知识，缺乏对外部规范性原则的明确依据。这常导致在社会细微语境下判断不一致、可解释性有限以及跨轮次冗余推理。为解决此问题，我们提出RoTRAG，一种检索增强框架，将简洁的人类编写的道德规范（称为经验法则，RoTs）融入基于LLM的有害性评估中。对于每一轮，RoTRAG从外部语料库中检索相关RoTs，并将其作为轮次推理和最终严重性分类的明确规范性证据。为提高效率，我们进一步引入一个轻量级二元路由分类器，决定新轮次是否需要基于检索的推理或可重用现有上下文。在ProsocialDialog和Safety Reasoning Multi Turn Dialogue上的实验表明，RoTRAG在有害分类和严重性估计上均持续优于竞争基线，在基准数据集上F1平均相对提升约40%，分布误差平均相对降低8.4%，同时在不牺牲性能的情况下减少冗余计算。

英文摘要

Detecting harmful content in multi turn dialogue requires reasoning over the full conversational context rather than isolated utterances. However, most existing methods rely mainly on models internal parametric knowledge, without explicit grounding in external normative principles. This often leads to inconsistent judgments in socially nuanced contexts, limited interpretability, and redundant reasoning across turns. To address this, we propose RoTRAG, a retrieval augmented framework that incorporates concise human written moral norms, called Rules of Thumb (RoTs), into LLM based harm assessment. For each turn, RoTRAG retrieves relevant RoTs from an external corpus and uses them as explicit normative evidence for turn level reasoning and final severity classification. To improve efficiency, we further introduce a lightweight binary routing classifier that decides whether a new turn requires retrieval grounded reasoning or can reuse existing context. Experiments on ProsocialDialog and Safety Reasoning Multi Turn Dialogue show that RoTRAG consistently improves both harm classification and severity estimation over competitive baselines, with an average relative gain of around 40% in F1 across benchmark datasets and an average relative reduction of 8.4% in distributional error, while reducing redundant computation without sacrificing performance.

URL PDF HTML ☆

赞 0 踩 0

2606.02493 2026-06-16 cs.CL 版本更新

Not What, But How: A Framework for Auditing LLM Responses across Positioning, Generalization, Anthropomorphism, and Maxims

不是“什么”，而是“如何”：LLM 响应框架的沟通审计

Siddhesh Milind Pawar, Sarah Masud, Haneul Yoo, Alice Oh, Isabelle Augenstein

发表机构 * University of Copenhagen（哥本哈根大学）； KAIST（韩国科学技术院）

AI总结提出 FRANZ 框架，从文化定位、概括性语言、拟人化线索和对话准则遵守四个维度审计 LLM 对主观问题的响应框架，并构建 SQUARE 语料库进行实证分析。

Comments 34 pages, 19 Figures, 4 Tables

详情

AI中文摘要

大型语言模型（LLM）越来越多地被用于回答主观性、信息寻求型问题，在这些问题中，用户对响应的沟通方式敏感，而不仅仅是答案是否正确。现有的针对主观文化查询的 LLM 评估主要关注事实正确性，忽略了响应的框架方式。为此，我们引入了 FRANZ，一个自动化的响应特征化框架，用于沿四个维度对 LLM 响应进行沟通审计：文化定位、概括性语言的使用、拟人化线索以及对对话准则的遵守。为了支持这一评估，我们贡献了 SQUARE——一个包含来自 57 个子版块的 376k 个主观问题的语料库，并映射到 7 个国家和 19 个问题类别。我们通过评分三个开放权重 LLM 的响应来展示 FRANZ 的适用性。我们观察到，LLM 在采用每种响应特征的频率上显示出统计显著差异。与单维度审计不同，FRANZ 揭示了内部定位和拟人化是正相关的，且相关程度因国家而异，为识别框架差异提供了诊断视角。

英文摘要

Large language models (LLMs) are being increasingly used to answer subjective, information-seeking questions, where users are sensitive to how responses are communicated, not just whether the answers are correct. Existing LLM evaluations for subjective cultural queries largely focus on factual correctness, ignoring how the response is framed. To this end, we introduce FRANZ, an automated FRAmework for respoNse characteriZation to conduct communicative audit of LLM responses along four dimensions: cultural positioning, use of generalizing language, anthropomorphic cues, and adherence to conversational maxims. To enable this evaluation, we contribute SQUARE - a corpus of 376k subjective questions sourced from 57 subreddits, and mapped to 7 countries and 19 question categories. We demonstrate FRANZ's applicability by scoring responses from three open-weight LLMs. We observe that LLMs show statistically significant differences in the frequency with which they employ each response characteristic. Unlike single-dimensional audits, FRANZ reveals that insider positioning and anthropomorphism are positively coupled, with the degree of coupling varying by country, providing a diagnostic lens for identifying framing divergences.

URL PDF HTML ☆

赞 0 踩 0

2606.12291 2026-06-16 cs.CL 版本更新

Measuring Epistemic Resilience of LLMs Under Misleading Medical Context

测量大语言模型在误导性医疗上下文下的认知韧性

Hongjian Zhou, Xinyu Zou, Jinge Wu, Sean Wu, Junchi Yu, Bradley Max Segal, Tobias Erich Niebuhr, Sara Amro, Michael Petrus, Sheikh Momin, Alexandra M. Cardoso Pinto, Rachel Niesen, Laura Sophie Wegner, Dhruv Darji, Jung Moses Koo, Joshua Fieggen, Kapil Narain, Mingde Zeng, Lei Clifton, Linda Shapiro, Fenglin Liu, David A. Clifton

发表机构 * University of Oxford（牛津大学）； University of Washington（华盛顿大学）； University College London（伦敦大学学院）； University of Waterloo（滑铁卢大学）

AI总结本研究提出MedMisBench基准，通过注入误导性上下文测试大语言模型在医疗场景中的认知韧性，发现模型准确率从71.1%降至38.0%，权威性虚假信息攻击成功率达69.5%。

详情

AI中文摘要

大型语言模型（LLMs）现在在医疗执照考试中达到专家级分数，这鼓励了高分数意味着安全医疗判断的假设，而患者越来越多地使用它们获取健康建议。我们证明这一假设是脆弱的：当误导性上下文被注入到LLMs最初正确回答的问题中时，它们会放弃正确答案。我们将这种在对抗性上下文中保持正确判断的能力称为认知韧性，并引入MedMisBench来测量它。MedMisBench包含10,932个医疗问题项目和48,889个误导性上下文-选项对，涵盖医疗推理、代理能力和患者旅程评估。在11个模型配置中，平均准确率从原始问题的71.1%下降到聚焦误导性上下文下的38.0%，攻击成功率为51.5%。最具破坏性的注入是正式的、规则式的捏造：权威框架的虚假信息达到69.5%的攻击成功率，例外投毒声明达到64.1%。来自7个国家的14名临床专家小组在38.2%的审查案例中识别出严重的潜在危害。MedMisBench暴露了LLM在医疗环境评估中的结构性盲点：现有基准衡量模型知道什么，但不衡量它们在误导性上下文下是否保持正确的医疗判断。

英文摘要

Large language models (LLMs) now reach expert-level scores on medical licensing exams, encouraging the assumption that high scores imply safe medical judgment while patients increasingly use them for health advice. We show this assumption is fragile: when misleading context is injected into questions that LLMs originally answer correctly, they abandon the correct answer. We call the ability to maintain correct judgment under adversarial context epistemic resilience, and introduce MedMisBench to measure it. MedMisBench contains 10,932 medical question items and 48,889 misleading context-option pairs spanning medical reasoning, agentic capability, and patient-journey evaluation. Across 11 model configurations, mean accuracy falls from 71.1% on original questions to 38.0% under focused misleading context, with 51.5% attack success. The most damaging injections are formal, rule-like fabrications: authority-framed falsehoods reach 69.5% attack success and exception-poisoning claims reach 64.1%. A 14-member clinical panel from 7 countries identified serious potential harm in 38.2% of reviewed cases. MedMisBench exposes a structural blind spot in LLM evaluation in medical settings: existing benchmarks measure what models know, but not whether they preserve correct medical judgment under misleading context.

URL PDF HTML ☆

赞 0 踩 0

2408.05568 2026-06-16 cs.AI cs.CL cs.CY stat.AP 版本更新

SAMark: 一种具有段落级释义鲁棒性的自锚文本水印

Jiahao Huo, Wenjie Qu, Yibo Yan, Kening Zheng, Jiaheng Zhang, Xuming Hu, Philip S. Yu, Mingxun Zhou

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结提出SAMark自锚水印框架，通过建立语义空间中与句子顺序无关的逐步独立绿色区域，结合多通道双曲评分机制和多样性感知过滤策略，在段落级释义攻击下实现高检测率并打破鲁棒性-质量权衡。

详情

AI中文摘要

语义级水印通过将句子作为基本单元，提高了对文本修改的鲁棒性。然而，对段落级释义的鲁棒性仍然困难，因为此类攻击通过改变句子顺序全局性地破坏水印信号。在这项工作中，我们提出了SAMark，一种自锚水印框架，通过建立语义空间中与步骤无关的绿色区域，消除了对句子顺序的依赖。为了提高可检测性，我们引入了一种多通道双曲评分机制，该机制在放大水印信号的同时抑制来自弱对齐候选的噪声。我们进一步提出了一种多样性感知过滤策略，将硬过滤与软正则化相结合，超越了简单的n-gram重复过滤器，以解决语义冗余问题。实验结果表明，在典型的段落级释义攻击下，SAMark实现了高达90.2%的TP@FP1%，平均比最强先前基线高出30%以上，同时保持了与未水印文本相竞争的生成本质量，并打破了限制先前方法的鲁棒性-质量权衡。

英文摘要

Semantic-level watermarking (SWM) improves robustness against text modifications by treating sentences as the basic unit. However, robustness to paragraph-level paraphrasing remains difficult because such attacks globally disrupt watermark signals by changing sentence order. In this work, we propose SAMark, a self-anchored watermarking framework that removes the dependency on sentence order by establishing a step-independent green region in semantic space. To improve detectability, we introduce a multi-channel hyperbolic scoring mechanism that amplifies watermark signals while suppressing noise from weakly aligned candidates. We further propose a diversity-aware filtering strategy that combines hard filtering with soft regularization, extending beyond simple n-gram repetition filters to address semantic redundancy. Experimental results show that SAMark achieves up to 90.2% TP@FP1% under typical paragraph-level paraphrasing attacks, outperforming the strongest prior baseline by more than 30% on average, while maintaining generation quality competitive with unwatermarked text and breaking the robustness-quality trade-off that limits prior methods.

URL PDF HTML ☆

赞 0 踩 0

2605.28734 2026-06-16 cs.CR cs.CL cs.LG 版本更新

Code as a Weapon: A Consensus-Labeled Prompt Bank for Measuring Coding-Model Compliance with Malicious-Code Requests

代码即武器：用于衡量编码模型对恶意代码请求遵从性的共识标记提示库

Richard J. Young, Gregory D. Moody

发表机构 * University of Nevada Las Vegas（内华达大学拉斯维加斯分校）； Department of Information Systems（信息系统系）

AI总结本文通过构建一个经五名评审共识标记的提示库（包含4,748个可执行恶意代码请求和1,923个有害安全知识请求），为编码模型对恶意代码请求的拒绝行为提供了可靠且可跨语料库比较的测量基准。

Comments 23 pages, 9 figures, 6 tables. Consensus-labeled prompt bank consolidating eight malicious-code corpora (ASTRA, CySecBench, AdvBench/harmful_behaviors, JailbreakBench, MalwareBench, RedCode, RMCBench, Scam2Prompt) spanning diverse elicitation paradigms; 6,675 prompts, 33,375 classification calls

详情

AI中文摘要

一个回答有害问题的通用语言模型返回文本；而一个遵从恶意请求的编码模型可以返回一个可运行的武器——键盘记录器、勒索软件存根、按原样运行的漏洞利用。这种单一遵从行为严重性的不对称意味着，编码专用模型应比通用聊天模型设置更高的拒绝标准，而非更低，然而目前该领域无法判断它们是否做到了这一点。针对恶意代码的拒绝基准是零散的：它们混合了可执行软件（即用型武器）的请求与有害安全知识（仍需人类操作的信息）的请求，并在不可比较的语料库上报告拒绝率，因此没有单一统计量衡量真正重要的属性。本文引入了一个扩展的共识标记提示库，区分了这两种请求类型，并为跨语料库的编码模型遵从性测量提供了结构稳定的基础。八个语料库（ASTRA、CySecBench、AdvBench/harmful_behaviors、JailbreakBench、MalwareBench、RedCode、RMCBench、Scam2Prompt）在五名评审共识协议下被整合和分类（6,675个提示 × 5名评审 = 33,375次调用）。评审小组达到Fleiss' kappa = 0.767 [95% CI 0.755, 0.777]（“显著”）；95.0%的提示获得至少四名评审一致，76.9%完全一致，并且小组在3,133个共享提示上以Cohen's kappa = 0.952复现了先前四个语料库的发布。发布的库包含4,748个共识-CODE提示（可执行恶意代码请求）和1,923个共识-KNOWLEDGE提示（有害安全知识请求）。该库是该领域一直缺乏的经过验证的工具：一个经过可靠性量化的基础，用于测试编码模型是否满足其可执行输出所要求的更严格拒绝标准。

英文摘要

A general-purpose language model that answers a harmful question returns text; a coding model that complies with a malicious request can return a working weapon: a keylogger, ransomware, an exploit that runs as written. This asymmetry in the severity of a single act of compliance implies coding-specialized models should clear a higher refusal bar than general-purpose chat models, not a lower one, yet the field cannot tell whether they do. Refusal benchmarks for malicious code are fragmented: they mix requests for executable software with requests for harmful security knowledge and report refusal rates over non-comparable corpora. This paper's central result is that the CODE-versus-KNOWLEDGE classification axis established in a prior four-corpus release remains stable under a substantially expanded corpus pool and an independently refreshed judge panel, evidence that it measures a real construct rather than an artifact of the prompts or judges. Eight corpora spanning diverse elicitation paradigms (direct, jailbreak-decorated, indirect, and agent/interpreter: ASTRA, CySecBench, AdvBench/harmful_behaviors, JailbreakBench, MalwareBench, RedCode, RMCBench, Scam2Prompt) are classified under a five-judge consensus protocol (6,675 prompts x 5 judges = 33,375 calls), reaching Fleiss' kappa = 0.767 [95% CI 0.755, 0.777] ("substantial"). Critically, the panel shares no judge with the prior release (five paid commercial APIs replaced by five open-weight models from five vendors), yet the two panels agree on 94.45% of the 3,133 shared prompts and reach Cohen's kappa = 0.952 [0.942, 0.963] on the 3,031-prompt binary overlap: the axis survives near-total panel replacement. The released bank comprises 4,748 consensus-CODE and 1,923 consensus-KNOWLEDGE prompts, a reliability-quantified benchmark whose central classification axis is shown stable across corpus expansion and judge-panel replacement.

URL PDF HTML ☆

赞 0 踩 0

2606.10740 2026-06-16 cs.AI cs.CL cs.LG 版本更新

When the Chain of Thought Knows Better: Failure Modes in Multi-Turn Reasoning Models

当思维链更清楚时：多轮推理模型的失败模式

Sai Kartheek Reddy Kasu, Nils Lukas, Samuele Poppi

发表机构 * GitHub

AI总结提出CoT-Output 2x2安全矩阵诊断多轮推理模型隐藏的时间动态失败，发现监督悖论和上下文注入失败两种可复现漏洞。

Comments Accepted at the ICML 2026 Workshop on Failure Modes in Agentic AI (FAGEN)

详情

AI中文摘要

多轮推理模型中的失败在终端评分评估中基本不可见。模型可能在长对话早期锁定不安全立场，但其最终轮拒绝率可能看起来与稳健对齐的基线无法区分。为了揭示这些隐藏的时间动态，我们提出了一种轨迹级诊断方法——CoT-Output 2x2安全矩阵。该框架沿两个独立轴（内部推理和可见输出）标记每一轮，产生四个操作定义的失败单元：稳健对齐、对齐伪装、显式越狱，以及我们称为上下文注入失败的不同失败模式（其中CoT保持安全推理，但可见输出产生危害，突出了多轮推理不忠实的表现）。我们在五个监督条件下针对固定攻击者评估了三个蒸馏推理目标，在信息危害场景上收集了6750个轮级观察。我们的分析揭示了两个可复现的漏洞：一个监督悖论，其中显式监控线索反而增加对齐伪装率而非抑制它；以及一个上下文注入失败，其中模型尽管内部状态安全却锁定不安全的外部输出。我们发布了多轮对话和CoT轨迹的完整数据集，以支持后续的轨迹诊断研究。

英文摘要

Failures in multi-turn reasoning models are largely invisible to terminal-score evaluation. A model can lock onto an unsafe stance early in a long dialogue, yet its final-turn refusal rate may appear indistinguishable from a robustly aligned baseline. To expose these hidden temporal dynamics, we propose a trace-level diagnostic - the CoT-Output 2x2 safety matrix. This framework labels every turn along two independent axes (internal reasoning and visible output), yielding four operationally defined failure cells: robust alignment, alignment faking, overt jailbreak, and a distinct failure mode we term context-injection failure (where the CoT maintains safe reasoning, but the visible output produces harm, highlighting a multi-turn manifestation of reasoning unfaithfulness). We evaluate three distilled reasoning targets against a fixed attacker across five oversight conditions, collecting 6750 turn-level observations on the Information-Hazard scenario. Our analysis reveals two reproducible vulnerabilities: an oversight paradox where explicit monitoring cues paradoxically increase alignment-faking rates rather than suppress them, and a context-injection failure where models lock onto unsafe external outputs despite safe internal states. We release the full dataset of multi-turn dialogues and CoT traces to support follow-up trace-diagnostic research.

URL PDF HTML ☆

赞 0 踩 0

2606.13441 2026-06-16 cs.AI cs.CL 版本更新

Why Sampling Is Not Choosing: Intentionality, Agency, and Moral Responsibility in Large Language Models

为什么采样不是选择：大语言模型中的意向性、能动性与道德责任

Joseph Keshet

发表机构 * Joseph Keshet（约瑟夫·凯舍特）

AI总结本文论证大语言模型不具备道德责任所需的承诺性能动性，其输出源于概率映射而非内在意向性，随机采样不等于选择。

详情

AI中文摘要

近期大语言模型（LLMs）的进展引发了关于此类系统展现能动性或具备道德主体资格的讨论。本文认为这些归因是错误的。我们坚持道德责任需要基于内在意向性和自我归因行动的承诺性能动性，而这种能动性构成了与责任相关的自由意志形式。尽管LLMs生成连贯且可进行规范性评估的输出，其操作完全由从数据中学习到的概率输入-输出映射所刻画。它们表面的意向性是衍生的而非内在的，其输出既不被作为承诺拥有，也不受理由引导。随机采样引入的变异性并不等同于选择或作者身份。我们回应来自意向立场、功能主义、相容论以及模型输出中存在道德推理的反对意见，认为这些都不足以确立真正的能动性。

英文摘要

Recent advances in large language models (LLMs) have prompted claims that such systems exhibit agency or qualify as moral agents. This paper argues that these attributions are misguided. We maintain that moral responsibility requires commitment-bearing agency grounded in intrinsic intentionality and self-attributed action, and that such agency constitutes the form of free will relevant to responsibility. Although LLMs generate coherent and normatively evaluable outputs, their operation is fully characterized by probabilistic input-output mappings learned from data. Their apparent intentionality is derived rather than intrinsic, and their outputs are neither owned as commitments nor guided by reasons. Variability introduced by stochastic sampling does not amount to choice or authorship. We address objections from the intentional stance, functionalism, compatibilism, and the presence of moral reasoning in model outputs, arguing that none suffice to establish genuine agency.

URL PDF HTML ☆

赞 0 踩 0

2606.14027 2026-06-16 cs.CR cs.AI cs.CL cs.SY eess.SY 版本更新

Same-Origin Policy for Agentic Browsers

代理浏览器的同源策略

Xilong Wang, Xiaoxing Chen, Patrick Li, Dawn Song, Neil Gong

发表机构 * Duke University（杜克大学）； Stanford University（斯坦福大学）； UC Berkeley（加州大学伯克利分校）

AI总结研究代理浏览器中同源策略的有效性，构建SOPBench评估基准，发现现有代理浏览器频繁违反SOP，并提出SOPGuard机制来强制执行SOP，同时保持效用和低开销。

详情

AI中文摘要

代理浏览器将自主AI代理集成到Web浏览器中，使用户能够通过自然语言指令完成Web任务。同源策略（SOP）是一种基本的浏览器安全机制，可防止由脚本引起的未经授权的自动化跨源数据流。然而，SOP在代理浏览器中是否仍然有效是一个尚未系统研究的开放问题。在这项工作中，我们填补了这一空白。我们首先观察到，代理浏览器本身可以作为跨源数据流的自动化通道，可能导致SOP违规。为了研究这一现象，我们构建了SOPBench，一个用于评估代理浏览器中SOP违规的基准。我们的评估表明，现有的代理浏览器在良性设置和攻击下都频繁违反SOP。为了解决这个问题，我们提出了SOPGuard，一种针对代理浏览器定制的SOP强制机制。我们在开源代理浏览器BrowserOS中实现了SOPGuard。广泛的评估表明，SOPGuard在保持效用的同时有效地强制执行SOP，并且仅产生很小的运行时开销。我们的代码和数据可在以下网址获取：https://this https URL。

英文摘要

Agentic browsers integrate autonomous AI agents into web browsers, enabling users to accomplish web tasks through natural-language instructions. The same-origin policy (SOP) is a fundamental browser security mechanism that prevents unauthorized automated cross-origin data flows induced by scripts. However, whether SOP remains effective in agentic browsers is an open question that has not been systematically studied. In this work, we bridge this gap. We first observe that an agentic browser can itself serve as an automated channel for cross-origin data flows, potentially leading to SOP violations. To investigate this phenomenon, we construct SOPBench, a benchmark for evaluating SOP violations in agentic browsers. Our evaluation shows that existing agentic browsers frequently violate SOP, both in benign settings and under attacks. To address this problem, we propose SOPGuard, an SOP enforcement mechanism tailored to agentic browsers. We implement SOPGuard in BrowserOS, an open-source agentic browser. Extensive evaluations demonstrate that SOPGuard effectively enforces SOP while preserving utility and incurring only a small runtime overhead. Our code and data are available at https://github.com/wxl-lxw/BrowserOS-SOPGuard.

URL PDF HTML ☆

赞 0 踩 0

2606.15044 2026-06-16 cs.CL 新提交

Equity with Efficiency: An Empirical Study of Tokenizers for Multilingual Large Language Models

公平与效率：多语言大语言模型分词器的实证研究

Kieron Seven Jun Wei Lee, Muhammad Reza Qorib, Andrew Ivan Soegeng, Hwee Tou Ng

发表机构 * National University of Singapore（新加坡国立大学）； Carnegie Mellon University（卡内基梅隆大学）； SAP

AI总结本文系统比较了11种东南亚语言上的公平分词器，发现Parity-aware BPE在效率与公平的权衡中处于帕累托前沿，而Morphology-Driven Byte Encoding在语义推理上表现最佳。

详情

AI中文摘要

多语言大语言模型（LLMs）依赖子词分词来桥接离散文本和连续神经表示。最先进的多语言LLMs通常使用字节级字节对编码（BPE）分词器，这些分词器在结构上偏向高资源语言和拉丁文字。对于代表性不足的语言使用者，特别是东南亚地区的语言，这种偏见增加了推理成本并扩大了跨语言能力差距。我们首次在涵盖11种东南亚语言的统一基准上对公平分词器进行了系统比较。除了分词器级别的压缩效率和跨语言公平性分析外，我们还通过使用相同训练数据训练的1.5B参数语言模型评估了下游任务性能。我们的结果表明，Parity-aware BPE位于效率-公平权衡的帕累托前沿，以有竞争力的成本实现了强大的压缩公平性。Morphology-Driven Byte Encoding通过形态更丰富的表示提供了最佳的语义推理性能，尽管计算成本更高。Byte Latent Transformer在下游任务中表现不佳，可能是因为其架构假设与有限低资源训练数据的约束不一致。总之，我们的发现表明跨语言公平性和分词效率并非根本矛盾，并为设计公平的多语言模型提供了实用指导。

英文摘要

Multilingual large language models (LLMs) depend on subword tokenization to bridge discrete text and continuous neural representation. State-of-the-art multilingual LLMs often use Byte-level Byte-Pair Encoding (BPE) tokenizers that structurally favor high-resource languages and Latin scripts. For speakers of underrepresented languages, particularly those across Southeast Asia, this bias inflates inference costs and widens cross-lingual capability gaps. We present the first systematic comparison of equitable tokenizers on a unified benchmark spanning 11 Southeast Asian languages. Beyond tokenizer-level analysis of compression efficiency and cross-lingual equity, we assess downstream task performance through controlled 1.5B-parameter language model training using the same training data. Our results show that Parity-aware BPE lies on the Pareto frontier of the efficiency-equity trade-off, achieving strong compression parity at competitive cost. Morphology-Driven Byte Encoding delivers the best semantic reasoning performance through morphologically richer representations, albeit at a higher computational expense. Byte Latent Transformer underperforms on downstream tasks, possibly because its architectural assumptions misalign with the constraints of limited low-resource training data. Together, our findings demonstrate that cross-lingual fairness and tokenization efficiency are not fundamentally at odds, and offer practical guidance for designing equitable multilingual models.

URL PDF HTML ☆

赞 0 踩 0

2606.15161 2026-06-16 cs.CL 新提交

Beyond Layer Importance in Layer-wise Sparsity: An Inter-Layer Perturbation-Absorption Perspective

超越逐层稀疏中的层重要性：层间扰动吸收视角

Tao Jing, Ningxin Wu, Chen Kang, Dong Yu, Changliang Li, Pengyuan Liu

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结本文通过受控扰动实验发现大语言模型中不同层对剪枝扰动的响应存在异质性，早期层放大扰动而中后期层吸收扰动，并基于此提出吸收感知校正方法，在70%稀疏度下降低困惑度7.13%并提升零样本准确率1.02%。

Comments 10 pages, 4 figures, 4 tables. Submitted to EMNLP 2026

详情

AI中文摘要

大型语言模型（LLM）中显著的逐层冗余性使得跨层的非均匀稀疏分配成为高效压缩的标准剪枝方法。现有的逐层分配方法通过局部信号（如激活异常值或权重谱）估计分配策略，主要源于局部层重要性，而剪枝后的最终性能还受到网络后续补偿能力的影响。在本文中，我们通过受控扰动实验直接表征这一特性。我们得到以下实证发现。首先，各层对剪枝规模扰动的响应表现出高度异质性。在大多数情况下，早期层放大扰动，而中层和后期层主动吸收扰动，相对L2漂移随深度单调递减，方向重新对准未扰动的隐藏状态轨迹。其次，吸收是大扰动现象。在小扰动下，网络在所有层都表现出放大效应，而当扰动幅度增长到剪枝规模时，向吸收的转变平滑发生。这丰富了相关工作中线性化累积理论的基础。基于这些发现，我们定义了每层的吸收系数，并提出吸收感知校正——一种正交增强方法，通过将困惑度降低7.13%并在多个模型家族中在70%稀疏度下将零样本准确率提升1.02%，改进了OWL和AlphaPruning。

英文摘要

The considerable layer-wise redundancy in large language models (LLMs) has established non-uniform sparsity allocation across layers as the standard pruning approach for efficient compression. Existing layer-wise allocation methods that estimate allocation strategy from local signals such as activation outliers or weight spectra mainly derive from local layer importance, whereas the final post-pruning performance is also influenced by the network's subsequent compensatory capacity. In this paper, we directly characterize this property through controlled perturbation experiments. We make the following empirical findings. First, layers exhibit highly heterogeneous responses to pruning-scale perturbations. In most cases, early layers amplify perturbations, while middle and late layers actively absorb them, with relative L2 drift decreasing monotonically across depth and direction realigning toward the unperturbed hidden-state trajectory. Second, absorption is a large-perturbation phenomenon. Under small perturbations the network exhibits amplification across all layers, and the transition to absorption occurs smoothly as perturbation magnitude grows to pruning scale. This enriches the linearized accumulation theory underlying related works. Building on these findings, we define an absorption coefficient per layer and propose absorption-aware correction, an orthogonal augmentation that improves OWL and AlphaPruning by reducing perplexity by 7.13% and boosting zero-shot accuracy by 1.02% across multiple model families at 70% sparsity.

URL PDF HTML ☆

赞 0 踩 0

2606.16026 2026-06-16 cs.CL 新提交

In-Domain Supervised Pathology Report Classification: A Reproducible Pipeline from Data Curation to Production-Matched Evaluation

领域内监督病理报告分类：从数据整理到生产匹配评估的可复现流程

Isaac Hands, Bin Huang, Adam Spannaus, John Gounley, Heidi Hanson, Eric Durbin, Sally R. Ellingson

发表机构 * University of Kentucky（肯塔基大学）； UK Markey Cancer Center（肯塔基大学马基癌症中心）； Kentucky Cancer Registry（肯塔基癌症登记处）； Division of Cancer Biostatistics, University of Kentucky（肯塔基大学癌症生物统计学系）； Oak Ridge National Laboratory（橡树岭国家实验室）； Division of Biomedical Informatics, University of Kentucky（肯塔基大学生物医学信息学系）

AI总结提出领域内监督流程解决病理报告跨注册中心性能下降问题，通过标准化数据整理、生产匹配保留集和低假阴性率操作点选择，在418k报告集上FNR降至0.003，F1提升至0.922。

详情

AI中文摘要

我们引入了一个领域内监督流程，旨在应对阻碍监督生物医学NLP模型的分布外性能下降问题，该问题在病理报告跨癌症注册中心迁移时观察到。我们的贡献是一个可复现的配方，用于从常规收集的癌症注册数据训练监督分类器。它描述了如何构建领域内训练集和生产匹配的保留集，并选择操作点以保持非常低的假阴性率（FNR），同时将审阅者工作量控制在可管理范围内。该流程通过设施分层抽样和与注册病例关联的报告单独处理来标准化数据整理，并包括盲法人工审计以估计阳性病例患病率和标签噪声。在418k报告保留集上，肯塔基模型实现了FNR 0.003和假阳性率（FPR）0.097，优于西雅图训练的MOSSAIC OncoID基线（FNR 0.010，FPR 0.183），并将F1从0.860提升至0.922。在600份报告的盲法人工审阅中，估计阳性患病率从0.500下降到0.398，表明存在大量标签噪声，错误集中在罕见原发部位。

英文摘要

We introduce an in-domain supervised pipeline designed to counter the out-of-distribution performance drop that hampers supervised biomedical NLP models, a problem observed when models trained on pathology reports are moved across cancer registries. Our contribution is a reproducible recipe for training a supervised classifier from routinely collected cancer registry data. It describes how to build the in-domain training set and a production-matched holdout, and to choose operating points that keep the false-negative rate (FNR) very low while keeping reviewer workload manageable. The pipeline standardizes data curation with facility-stratified sampling and separate handling of reports linked to registry cases, and includes a blinded manual audit to estimate positive-case prevalence and label noise. On a 418k-report holdout set, the Kentucky model achieved FNR 0.003 and false-positive rate (FPR) 0.097, improving over the Seattle-trained MOSSAIC OncoID baseline (FNR 0.010, FPR 0.183) and raising F1 from 0.860 to 0.922. In a blinded manual review of 600 reports, estimated positive prevalence declined from 0.500 to 0.398, indicating substantial label noise with errors concentrated in rare primary sites.

URL PDF HTML ☆

赞 0 踩 0

2606.16383 2026-06-16 cs.CL 新提交

Surpassing Scale by Efficiency: A Compact 135M Parameter Foundational LLM Natively Adapted for the Bangla Language

通过效率超越规模：一个紧凑的135M参数基础LLM原生适配孟加拉语

Rabindra Nath Nandi

发表机构 * Independent Researcher（独立研究者）

AI总结提出一个135M参数的紧凑型解码器模型bangla-smollm-135m，通过确定性交集-追加词元合并策略解决孟加拉语子词碎片化，在零样本多任务基准上匹配或超越两倍大小模型。

Comments Submitted to a Workshop

详情

AI中文摘要

虽然自然语言处理领域由数十亿参数的架构主导，但它们在低资源、非拉丁文字中的部署对于边缘配置、移动系统和分散的本地硬件来说在计算上仍然难以承受。本文提出了bangla-smollm-135m，一个高度紧凑的1.35亿参数仅解码器基础模型，专为孟加拉语脚本的高效语言建模而设计。通过利用TituLLMs和SmolLM2-135M之间的确定性交集-追加词元合并策略，该模型克服了子词脚本碎片化，同时不破坏早期预训练参数状态。在零样本多任务基准评估（PIQA_bn、OpenBookQA_bn、CommonsenseQA_bn和Bangla_MMLU）中，bangla-smollm-135m匹配或超越了其两倍大小的模型（Gemma-3-270m），并与1B参数级别的模型达到同等水平。该模型可在rnnandi/bangla-smollm-135m获取。

英文摘要

While the NLP landscape is dominated by multi-billion parameter architectures, their deployment in low-resource, non-Latin scripts remains computationally prohibitive for edge configurations, mobile systems, and decentralized local hardware. This paper presents bangla-smollm-135m, a highly compact 135-million parameter decoder-only foundational model engineered explicitly for high-efficiency language modeling in the Bangla script. By leveraging a deterministic intersect-and-append token merging strategy between TituLLMs and SmolLM2-135M, the model overcomes subword script fragmentation without destabilizing early pretrained parameter states. In zero-shot multi-task benchmark evaluations (PIQA_bn, OpenBookQA_bn, CommonsenseQA_bn, and Bangla_MMLU), bangla-smollm-135m matches or outperforms models twice its size (Gemma-3-270m) and achieves parity with models in the 1B parameter tier. The model is available at rnnandi/bangla-smollm-135m

URL PDF HTML ☆

赞 0 踩 0

2606.15963 2026-06-16 cs.DC cs.AI cs.CL cs.LG 交叉投稿

PreLort: Prefix-Nested LoRA for Federated Fine-Tuning under Rank Heterogeneity

PreLort: 面向秩异构联邦微调的前缀嵌套LoRA

Muhammad Waseem, Nurbek Tastan, Andrej Jovanovic, Nicholas D. Lane, Nils Lukas, Karthik Nandakumar, Samuel Horvath

发表机构 * MBZUAI, UAE University of Cambridge, UK（MBZUAI，阿联酋剑桥大学，英国）； Flower Labs, UK（Flower Labs，英国）； Michigan State University, USA（密歇根州立大学，美国）

AI总结针对联邦LoRA中异构秩导致的信息分布不均问题，提出PreLort方法，通过前缀层次化嵌套低秩结构、分段聚合规则和前缀嵌套训练策略，使低秩客户端受益于高秩客户端的丰富信息，在准确率和ROUGE-L上优于现有方法。

详情

AI中文摘要

使用LoRA等参数高效方法对大型语言模型进行联邦微调，能够实现基础模型的隐私保护适配。异构硬件资源带来了挑战，因为具有不同适配器秩的客户端无法直接聚合。现有方法虽能实现异构秩下的聚合，但未能控制信息在秩维度上的分布，导致共享低秩表示利用不充分。为此，我们提出PreLort：一种用于联邦LoRA的嵌套低秩公式，将适配器维度组织成前缀层次结构。我们的方法确保较低秩维度编码任务相关信息，而较高秩维度捕获额外容量。基于此，我们引入(i)分段聚合规则，仅对贡献于每个秩分段的客户端进行平均，避免来自零填充低秩客户端的稀释；以及(ii)前缀嵌套训练策略，在多个秩截断下优化每个适配器，鼓励有用信号集中在低秩前缀维度。这些组件共同鼓励一个一致的低秩前缀捕获最任务相关信息，而较高秩维度学习额外容量。这使得低秩客户端能够受益于高秩客户端贡献的更丰富信息，因为前缀维度被一致地学习和聚合。实验表明，我们的方法在准确率和ROUGE-L上持续优于先前的异构联邦LoRA方法，并在多个基础模型上实现了更低或相当困惑度。

英文摘要

Federated fine-tuning of large language models using parameter-efficient methods such as LoRA enables privacy-preserving adaptation of foundation models. Heterogeneous hardware resources introduce challenges, as clients with different adapter ranks cannot be directly aggregated. While existing methods enable aggregation under heterogeneous ranks, they fail to control how information is distributed across rank dimensions, leading to suboptimal use of shared low-rank representations. Instead, we propose PreLort: a nested low-rank formulation for federated LoRA that organizes adapter dimensions into a prefix hierarchy. Our approach ensures that lower-rank dimensions encode task-relevant information, while higher-rank dimensions capture additional capacity. Building on this, we introduce (i) a segment-wise aggregation rule that averages only over clients contributing to each rank segment, avoiding dilution from zero-padded lower-rank clients, and (ii) a prefix-nested training strategy that optimizes each adapter under multiple rank truncations, encouraging useful signal to concentrate in low-rank prefix dimensions. Together, these components encourage a consistent low-rank prefix capturing the most task-relevant information, while higher-rank dimensions learn additional capacity. This allows low-rank clients to benefit from richer information contributed by higher-rank clients, as prefix dimensions are consistently learned and aggregated. Experiments demonstrate that our method consistently outperforms prior heterogeneous federated LoRA methods in accuracy and ROUGE-L, while achieving lower or comparable perplexity across multiple base models.

URL PDF HTML ☆

赞 0 踩 0

2606.16243 2026-06-16 cs.LG cs.CL 交叉投稿

LiFT: Local Search via Linear Programming for Overfitting-Controlled Transformers

LiFT: 通过线性规划进行局部搜索以实现过拟合可控的Transformer

Abhishek Shukla, Anikeit Khanna, Ankur Sinha, Faiz Hamid

发表机构 * Department of Management Sciences, Indian Institute of Technology Kanpur（印度理工学院坎普尔分校管理科学系）； Department of Civil Engineering, Indian Institute of Technology Kanpur（印度理工学院坎普尔分校土木工程系）； Operations and Decision Sciences, Indian Institute of Management Ahmedabad（印度管理学院艾哈迈达巴德分校运营与决策科学系）； Brij Disa Centre for Data Science and AI, Indian Institute of Management Ahmedabad（印度管理学院艾哈迈达巴德分校Brij Disa数据科学与人工智能中心）

AI总结提出基于线性规划的局部搜索框架，通过双层优化联合更新模型参数和正则化超参数，利用验证梯度和Hessian信息构造局部下降方向，在保持训练最优性的同时减少过拟合，实验表明在GPT-2 Small微调中持续改善测试困惑度。

Comments 22 pages, 6 figures, published in The 20th Learning and Intelligent Optimization Conference (LION 2026)

详情

AI中文摘要

本文提出了一种基于线性规划（LP）的局部搜索框架，用于微调预训练Transformer模型，并显式控制过拟合。该方法将Transformer微调表述为一个基于双层优化的正则化问题，其中模型参数和正则化超参数被联合更新。利用初始热身迭代期间收集的信息，包括验证梯度和训练Hessian信息，通过求解一个线性规划来构造局部下降方向，该方向在保持训练最优性的同时最小化缩放的方向导数。这种验证感知的下降方向能够对参数和正则化超参数进行聚焦的局部更新，从而在不需重复完整再训练周期的情况下减少过拟合。由此产生的方法称为基于线性规划的Transformer微调（LiFT），它通过系统识别任务特定的更新，而非依赖启发式或网格搜索的超参数选择，从而区别于传统微调。在WikiText-2上微调GPT-2 Small的实验表明，LiFT通过选择性调整Transformer块和正则化参数实现了有效的适应，在多种层配置和正则化设置下持续改善测试困惑度，尤其在易过拟合场景中增益显著。除了实证性能，LiFT还在Transformer微调、双层优化、局部搜索和正则化理论之间建立了原则性的联系。

英文摘要

This paper proposes a Linear Programming (LP)-based local search framework for fine-tuning pretrained transformer models with explicit control against overfitting. The approach formulates transformer fine-tuning as a bilevel optimization-based regularization problem, in which model parameters and regularization hyperparameters are jointly updated. Information collected during initial warm-up iterations, including validation gradients and training Hessian information, is used to construct a local descent direction by solving an LP that minimizes a scaled directional derivative while preserving training optimality. This validation-aware descent direction enables focused local updates of both parameters and regularization hyperparameters, reducing overfitting without requiring repeated full retraining cycles. The resulting method, termed Linear Programming-based Fine-Tuning (LiFT) for transformers, differs from conventional fine-tuning by systematically identifying task-specific updates rather than relying on heuristic or grid-based hyperparameter selection. Experiments on GPT-2 Small fine-tuned on WikiText-2 demonstrate that LiFT enables effective adaptation through selective tuning of transformer blocks and regularization parameters, yielding consistent improvements in test perplexity across multiple layer configurations and regularization settings, with particularly pronounced gains in overfitting-prone scenarios. Beyond empirical performance, LiFT establishes a principled connection between transformer fine-tuning, bilevel optimization, local search, and regularization theory.

URL PDF HTML ☆

赞 0 踩 0

2606.16246 2026-06-16 cs.LG cs.AI cs.CL 交叉投稿

基于生理信号的深度时间建模与集成融合多模态情感识别

Desta Haileselassie Hagos, Saurav Keshari Aryal, Patrick Ymele-Leki, Anietie Andy, Legand L. Burge

发表机构 * Howard University（霍华德大学）

AI总结本研究评估LSTM、TCN和Transformer在WESAD数据集上的多模态情感识别性能，通过消融实验和传感器级早期融合，并采用晚期融合集成策略，最终集成方法达到98.91%准确率和98.56%宏F1分数。

Comments Accepted for publication in the 17th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics (ACM BCB 2026). DOI: https://doi.org/10.1145/3807503.3819363

详情

DOI: 10.1145/3807503.3819363

AI中文摘要

生理压力和情感识别对于健康监测和情感计算非常重要。在这项工作中，我们对深度学习模型如长短期记忆网络（LSTM）、时序卷积网络（TCN）和Transformer在WESAD数据集上使用腕部和胸部传感器信号进行多模态情感识别进行了全面评估。我们通过仅在腕部和仅胸部输入上训练模型进行消融研究，以评估每个模态的单独贡献。此外，我们实现了一种晚期融合集成策略，该策略结合了在多模态输入上训练的所有三种架构的预测。我们还在传感器级别采用早期融合，即在将腕部和胸部信号输入每个模型之前进行拼接。我们的结果表明，Transformer模型在多模态设置中始终达到最高准确率，而TCN模型在仅腕部配置中表现最佳。集成方法实现了最高的总体准确率（98.91 +/- 0.13%）和宏F1分数（98.56 +/- 0.17%）。这些发现证明了传感器融合和基于集成的融合在开发鲁棒的生理情感识别系统中的有效性。

英文摘要

Physiological stress and emotion recognition are important for health monitoring and affective computing. In this work, we present a comprehensive evaluation of deep learning models such as Long Short-Term Memory (LSTM), Temporal Convolutional Networks (TCN), and Transformer on the WESAD dataset for multimodal affect recognition using wrist and chest sensor signals. We perform ablation studies to assess the individual contributions of each modality by training models on wrist-only and chest-only inputs. In addition, we implement a late-fusion ensemble strategy that combines predictions from all three architectures trained on multimodal input. We also employ early fusion at the sensor level by concatenating wrist and chest signals before feeding them into each model. Our results show that Transformer models consistently achieve the highest accuracy in multimodal settings, while TCN models perform best in the wrist-only configuration. The ensemble method yields the highest overall accuracy (98.91 +/- 0.13%) and macro-F1 score (98.56 +/- 0.17%). These findings demonstrate the effectiveness of sensor fusion and ensemble-based fusion in developing robust systems for physiological emotion recognition.

URL PDF HTML ☆

赞 0 踩 0

2606.15325 2026-06-16 cs.CL 新提交

Prior over Evidence: Stereotype-Driven Diagnosis in LLM-Based L2 Pronunciation Feedback

先验优于证据：基于LLM的二语发音反馈中的刻板印象驱动诊断

Rong Wang, Kun Sun

发表机构 * University of Tuebingen（蒂宾根大学）； Tongji University（同济大学）

AI总结研究测试了LLM在二语发音反馈中是否基于语音证据而非预训练先验进行诊断，发现评分准确性与推理脱钩，音素级反馈收敛于固定困难音素集，且声学证据仅在直接探测目标维度时改善评分。

Comments 12 pages, 2 figures

详情

AI中文摘要

大型语言模型越来越多地被部署用于第二语言（L2）英语学习中的书面发音反馈，其假设是模型的诊断基于提供的语音证据而非预训练中的先验。该假设在1800条L2-Arctic语音上进行了测试，涵盖六种L1背景、三种音频能力LLM、四个发音维度以及五种证据条件（从纯文本基线到数值声学特征和原始音频）。每个（语音×模型×条件×维度）单元在三个指标上评分：与黄金标签的评分准确性（RA）、评估内部一致性而无真实标签的证据连贯性（EC），以及基于黄金证据评估的接地正确性（GC）。结果显示跨模型有三个发现。第一，评分准确性和接地推理脱钩：39.6%的被判断单元包含支持错误评分的内部连贯推理，而仅15.8%的推理支持正确评分。第二，音素级反馈收敛到一个固定的L2英语困难音素清单，该清单在所有六种L1背景和所有证据条件下重复出现。第三，仅当提供的特征直接探测目标维度时，声学证据才改善评分：文本化的F0范围将三个模型的音高变化接地性从(0.18-0.19)提升到(0.45-0.62)，而需要目标-实现对齐的重音和音素正确性仍保持未接地。没有文本化F0值的相同音频波形无法重现这一改进。这些发现表明，当前通用LLM作为外部计算发音证据的言语化器比作为独立诊断引擎更可靠。

英文摘要

Large language models are increasingly deployed for written pronunciation feedback in second-language (L2) English learning, under the assumption that their diagnoses are grounded in the supplied speech evidence rather than in priors from pretraining. This assumption is tested on 1,800 L2-Arctic utterances spanning six L1 backgrounds, three audio-capable LLMs, four pronunciation dimensions, and five evidence conditions ranging from a text-only baseline to numeric acoustic features and raw audio. Each (utterance x model x condition x dimension) cell is scored on three metrics: Rating Accuracy (RA) against gold labels, Evidence Coherence (EC) assessing internal consistency without ground truth, and Grounded Correctness (GC) evaluated against gold evidence. Results show three findings across models. First, rating accuracy and grounded reasoning decouple: 39.6% of judged cells contain internally coherent reasoning that supports a wrong rating, against only 15.8% where the reasoning supports a correct rating. Second, phoneme-level feedback converges to a fixed inventory of L2-English difficulty phones that recurs across all six L1 backgrounds and all evidence conditions. Third, acoustic evidence improves the rating only when the supplied feature directly probes the target dimension: textualised F0 range raises pitch-variation grounding from (0.18-0.19) to (0.45-0.62) across all three models, while stress and phoneme correctness, which require target-to-realisation alignment, remain ungrounded. The same audio waveform without textualised F0 values does not reproduce this improvement. These findings indicate that current general-purpose LLMs are more reliable as verbalisers of externally computed pronunciation evidence than as standalone diagnostic engines.

URL PDF HTML ☆

赞 0 踩 0

2606.15461 2026-06-16 cs.CL cs.AR 新提交

ESBMC-PLC: Formal Verification of IEC 61131-3 Ladder Diagram Programs Using SMT-Based Model Checking

ESBMC-PLC：基于SMT模型检测的IEC 61131-3梯形图程序形式化验证

Pierre Dantas, Lucas Cordeiro, Waldir Junior

发表机构 * The University of Manchester（曼彻斯特大学）； Federal University of Amazonas (UFAM)（亚马逊联邦大学）

AI总结提出首个原生支持梯形图（LD）的开源形式化验证工具ESBMC-PLC，通过SMT有界模型检测和k-归纳验证安全属性，在13个基准测试中正确分类61个属性，发现8个错误。

Comments 24 pages

详情

AI中文摘要

PLC在工业领域执行安全关键程序。IEC 61131-3标准下的梯形图（LD）作为主流PLC表示法，仍缺乏形式化验证：基于SMT的模型检测器无法处理LD的梯级-线圈图形。本文提出ESBMC-PLC，首个原生支持LD（PLCopen XML格式）的开源形式化验证器，作为ESBMC的新前端实现。ESBMC-PLC将LD梯级转换为GOTO IR，将PLC扫描周期建模为带有非确定性输入的while(true)循环，并通过基于SMT的有界模型检测或k-归纳检查安全属性。一个包含五个属性的YAML语言（互斥、不变性、不存在、响应、可达性）避免了时序逻辑。对22项研究（2020-2026）的调查识别出四个研究空白；ESBMC-PLC填补了其中两个。在13个基准测试（6个领域，3个来源——包括已部署的CONTROLLINO PLC和MathWorks Simulink PLC Coder）上的评估显示，在61个属性上正确分类：所有9个作者构建的程序（类别A/B）符合预期，所有4个供应商程序（类别C）正确未标注，发现8个错误（可操作反例），7个无界k-归纳证明，所有运行在Apple Silicon上低于60毫秒。与PLCverif的功能对比表明，ESBMC-PLC是唯一结合原生LD、k-归纳和SMT位向量语义的开源工具。

英文摘要

PLCs execute safety-critical programs across industrial sectors. The dominant PLC notation, ladder diagram (LD) per IEC 61131-3, remains absent from formal verification: SMT-based model checkers cannot process LD's rung-and-coil graphics. This paper presents ESBMC-PLC, the first open-source formal verifier with native LD support (PLCopen XML format), implemented as a new ESBMC frontend. ESBMC-PLC translates LD rungs to GOTO IR, models the PLC scan cycle as a while(true) loop with nondeterministic inputs, and checks safety properties via SMT-based bounded model checking or k-induction. A five-property YAML language (mutual_exclusion, invariant, absence, response, reachability) avoids temporal logic. A survey of 22 studies (2020-2026) identifies four research gaps; ESBMC-PLC closes two of them. Evaluation on 13 benchmarks (6 domains, 3 sources - including deployed CONTROLLINO PLCs and MathWorks Simulink PLC Coder) shows correct classification across 61 properties: all 9 author-constructed programs (Categories A/B) as expected, all 4 vendor programs (Category C) correctly unlabeled, with 8 bugs found (actionable counterexamples), 7 unbounded k-induction proofs, all runs under 60ms on Apple Silicon. Feature comparison with PLCverif shows that ESBMC-PLC is the only open-source tool that combines native LD, k-induction, and SMT bit-vector semantics.

URL PDF HTML ☆

赞 0 踩 0

2606.16496 2026-06-16 cs.CL cs.LG 新提交

基于LLM的虚拟人群用于需求模拟与定价

Chengpiao Huang, Kaizheng Wang

发表机构 * Columbia University（哥伦比亚大学）

AI总结提出一种LLM驱动的虚拟人群模型，通过混合客户画像和LLM评估购买概率，生成需求分布，支持风险感知定价，在H&M数据集上表现最优。

Comments 18 pages, 7 figures

详情

AI中文摘要

我们开发了一个基于LLM的虚拟人群模型，用于模拟定价决策中的需求，其中产品由丰富的非结构化信息（如文本描述和图像）描述，决策者不仅需要平均需求预测，还需要反事实价格的不确定性估计。我们的模型将暴露的客户表示为从有限混合客户画像中的抽取。对于每个画像、产品和候选价格，LLM使用结构化画像信息和非结构化产品信息来引出画像级别的购买概率。这些概率通过校准的混合权重聚合，形成总需求的预测分布。生成的模拟器可以在各种定价目标下评估反事实价格，包括期望收入和风险感知标准（如条件风险价值）。我们在一个包含产品描述和图像的在线H&M时尚数据集上测试了该框架。校准后的基于LLM的模拟器在所考虑的模型中实现了最佳的整体预测性能，并支持样本高效的定价决策。我们的框架提供了一种实用的方法，将LLM用作需求模拟器，适用于历史需求数据有限但产品信息丰富的产品。通过生成完整的需求预测分布而不仅仅是点预测，它使管理者能够比较候选价格、量化需求不确定性，并选择针对平均收入或风险感知目标的价格。

英文摘要

We develop an LLM-powered virtual population model that simulates demand for pricing decisions, in settings where products are described by rich unstructured information, such as text descriptions and images, and where decision makers need not only mean-demand predictions but also uncertainty estimates for counterfactual prices. Our model represents exposed customers as draws from a finite mixture of customer personas. For each persona, product, and candidate price, an LLM elicits a persona-level purchase probability using both structured persona information and unstructured product information. These probabilities are aggregated through calibrated mixture weights to form a predictive distribution of aggregate demand. The resulting simulator can evaluate counterfactual prices under various pricing objectives, including expected revenue and risk-aware criteria such as conditional value at risk. We test the framework on an online H&M fashion dataset with product descriptions and images. The calibrated LLM-based simulator achieves the best overall predictive performance among the models considered, and supports sample-efficient pricing decisions. Our framework provides a practical way to use LLMs as demand simulators for products with limited historical demand data but rich product information. By producing a full predictive demand distribution rather than only a point forecast, it enables managers to compare candidate prices, quantify demand uncertainty, and choose prices that target either average-case revenue or risk-aware objectives.

URL PDF HTML ☆

赞 0 踩 0

2606.16497 2026-06-16 cs.LG cs.AI cs.CL 交叉投稿

daVinci-kernel: Co-Evolving Skill Selection, Summarization, and Utilization via RL for GPU Kernel Optimization

daVinci-kernel：通过强化学习协同进化技能选择、总结与利用的GPU内核优化

Dayuan Fu, Mohan Jiang, Tongyu Wang, Dian Yang, Jiarui Hu, Liming Liu, Jinlong Hou, Pengfei Li

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结提出daVinci-kernel框架，通过强化学习联合训练技能选择、策略生成和技能总结三个智能体，共享LLM骨干，实现GPU内核优化，在KernelBench上超越先前最优模型。

详情

AI中文摘要

GPU内核优化代表了一种范式，其中功能正确性被假定，执行效率是目标。我们提出daVinci-kernel，一个强化学习框架，通过动态演化的技能库将技能发现与技能利用相结合。daVinci-kernel联合训练三个共享一个LLM骨干的智能体：技能选择智能体通过BM25和LLM重排序检索相关技术，策略智能体基于所选技能生成多轮CUDA/Triton内核，技能总结智能体将成功轨迹提炼为可复用技能。候选技能仅在基于执行的验证确认可复现加速后才被添加。所有三个智能体共享单个LLM骨干，通过多样性过滤数据上的结构化SFT冷启动初始化，然后通过多轮REINFORCE和每个智能体的优势估计进行端到端联合优化。在KernelBench上，daVinci-kernel-14B在Fast$_1$阈值下，Level 1、Level 2和Level 3分别达到37.2%、70.6%和32.2%，优于先前最强的RL训练模型Dr.Kernel-14B。

英文摘要

GPU kernel optimization represents a paradigm where functional correctness is assumed and execution efficiency is the objective. We present daVinci-kernel, a reinforcement learning framework that couples skill discovery with skill exploitation through a dynamically evolving skill library. daVinci-kernel jointly trains three agents sharing one LLM backbone: a Skill Selection Agent that retrieves relevant techniques via BM25 and LLM reranking, a Policy Agent that generates multi-turn CUDA/Triton kernels conditioned on selected skills, and a Skill Summary Agent that distills successful rollouts into reusable skills. Candidate skills are added only after execution-based verification confirms reproducible speedups. All three agents share a single LLM backbone, are initialized via a structured SFT cold start on diversity-filtered data, and are then jointly optimized end-to-end with multi-turn REINFORCE and per-agent advantage estimation. On KernelBench, daVinci-kernel-14B achieves 37.2%, 70.6%, and 32.2% on Level 1, Level 2, and Level 3 under the Fast$_1$ threshold, outperforming the strongest prior RL-trained model, Dr.Kernel-14B.

URL PDF HTML ☆

赞 0 踩 0

2606.16999 2026-06-16 cs.SE cs.CL cs.LG 交叉投稿

Selection Without Signal, Recovery Through Expression: A Measurement Study of Post-Hoc Falsification Operators for Frozen Small Code Models

无信号下的选择，通过表达恢复：冻结小代码模型的事后伪造操作符的测量研究

Mehmet Iscan

AI总结本研究测量了冻结小代码模型的事后语义操作符（如选择、验证、修复）的有效性，发现它们均未优于Best-of-N，并揭示了覆盖墙、能力剪刀和共识陷阱等机制原因；而表达层恢复（M1）通过鲁棒提取和签名对齐提升了准确率。

Comments 33 pages, 4 figures, 8 tables

详情

AI中文摘要

冻结的小代码模型（<=1.5B参数，本地运行无需微调）适用于离线或隐私受限场景，但常输出看似合理实则错误的程序。一种自然的补救措施是事后操作符，无需重新训练即可选择、验证、修复或重新处理模型的样本；其原则形式是波普尔式的：用严格测试攻击每个候选，保留通过者。我们测量了这类操作符是否有帮助。在单一确定性执行预言机和无泄漏、计算匹配的协议下，26种语义事后操作符（选择、验证、修复、淘汰、组合、合理否决、生成条件）与Best-of-N（BoN）进行了比较；在测试的单元和基准上，没有一种操作符在保留集上的准确率优于BoN。这种负面结果源于机制原因：覆盖墙（系统性困难任务失败，更深采样无法挽救）、能力剪刀（有能力的生成器使得可见测试通过者之间几乎不存在可区分的错误）以及近乎空的共识陷阱（可见通过但隐藏错误的多数，无泄漏选择器需要与正确替代方案同时出现，但这种情况很少发生）。一个无分布假设的无害界无法在零观察伤害下保证伤害率<=alpha，除非n>=45。两种操作符在语义输出空间之外的不同轴线上有所帮助。表达层恢复（M1）是这里唯一的准确率提升，它恢复了标准提取器丢弃的正确程序（鲁棒提取和公开测试签名对齐）；它无害（b10=0），无泄漏，并在HumanEval+上使DeepSeek-Coder-1.3B提升了+12个任务（p=2.4e-4）。自适应共识早停（ACE）是一种校准的计算节省控制（约节省19%，零伤害）。M1和选择负面结果在HumanEval+和MBPP+上跨三个模型单元复现。教训是：在指责语义事后推理之前，先修复测试框架并测量覆盖范围。

英文摘要

Frozen small code models (<=1.5B parameters, run locally without fine-tuning) suit offline and privacy-constrained use, but often emit plausible-but-wrong programs. A natural remedy is a post-hoc operator that selects, verifies, repairs, or re-processes the model's samples without retraining; in principled form it is Popperian: attack each candidate with a severe test, keep what survives. We measure whether such operators help. Under one deterministic execution oracle and a leakage-free, matched-compute protocol, 26 semantic post-hoc operators (selection, verification, repair, elimination, portfolios, sound vetoes, generation conditioning) are evaluated against Best-of-N (BoN); on the cells and benchmarks tested, none improves held-out accuracy over BoN. The negative is mechanistic: a coverage wall (systematic hard-task failures deeper sampling does not rescue), a capability scissors (a competent generator leaves almost no discriminable error among visible-test passers), and a near-empty consensus trap (the visible-pass-but-hidden-wrong majority a leakage-free selector needs rarely co-occurs with a correct alternative). A distribution-free do-no-harm bound cannot certify a harm rate <=alpha at zero observed harm unless n>=45. Two operators help on a different axis, outside the semantic output space. An expression-layer recovery (M1), the only accuracy gain here, recovers correct programs the standard extractor discards (robust extraction and public-test signature alignment); it does no harm (b10=0), is leakage-free, and lifts DeepSeek-Coder-1.3B by +12 tasks on HumanEval+ (p=2.4e-4). An adaptive consensus early-stop (ACE) is a calibrated compute-saving control (~19% saving, zero harm). M1 and the selection negative replicate on HumanEval+ and MBPP+ across three model cells. The lesson: fix the harness and measure coverage before blaming semantic post-hoc reasoning.

URL PDF HTML ☆

赞 0 踩 0

2310.06555 2026-06-16 cs.CL cs.AI cs.LG cs.MA 版本更新

It's About Time: Temporal References in Emergent Communication

关于时间：涌现通信中的时间指代

Olaf Lipinski, Adam J. Sobey, Federico Cerutti, Timothy J. Norman

发表机构 * University of Southampton（索姆塞特大学）； The Alan Turing Institute（艾伦·图灵研究所）； University of Brescia（布雷西亚大学）

AI总结研究涌现通信中时间指代缺失问题，发现仅改变损失函数不足，需修改架构（分批方法）才能使时间指代涌现，95%以上代理成功，为提升通信效率奠定基础。

Comments 23 pages main body and 31 pages supplementary material, 9 figures in main body. Code available at https://github.com/olipinski/TRG

详情

DOI: 10.1613/jair.1.19795
Journal ref: Journal of Artificial Intelligence Research 86, Article 11 (June 2026)

AI中文摘要

涌现通信使代理能够开发定制语言以提高通信效率。尽管已知时间结构在自然语言中的重要性，但在涌现通信中尚无时间指代的证据。本文通过探索代理如何交流时间关系来填补这一空白。我们分析了时间指代涌现的三个潜在因素：环境因素、外部因素和架构因素。实验表明，仅改变损失函数不足以使时间指代涌现；相反，架构变化是必要的。代理架构的最小变化——使用不同的分批方法——允许时间指代涌现。在强调时间关系的时间指代游戏环境中，将此修改后的设计与标准架构进行比较。分析显示，超过95%使用修改后分批方法的代理发展出了时间指代，而无需改变其损失函数。我们认为时间指代对于未来提高代理通信效率是必要的，使未来代理能够使用更接近最优编码的方式，与纯组合语言相比。这些见解为将时间指代纳入其他涌现通信设置以及研究语言的其他方面提供了基础。

英文摘要

Emergent communication enables agents to develop bespoke languages that improve communication efficiency. Despite the known importance of temporal structure in natural language, there is no existing evidence of temporal references in emergent communication. This paper addresses this gap, by exploring how agents communicate about temporal relationships. We analyse three potential factors for the emergence of temporal references: environmental, external, and architectural. Our experiments demonstrate that altering the loss function is insufficient for temporal references to emerge; rather, architectural changes are necessary. A minimal change in agent architecture, using a different batching method, allows the emergence of temporal references. This modified design is compared with the standard architecture in a temporal referential games environment, which emphasises temporal relationships. The analysis shows that over 95% of the agents with the modified batching method develop temporal references, without changes to their loss function. We consider temporal referencing necessary for future improvements to the agents' communication efficiency, enabling future agents to use a closer to optimal coding as compared to purely compositional languages. These insights provide the basis for incorporation of temporal references into other emergent communication settings, and investigation of other aspects of language.

URL PDF HTML ☆

赞 0 踩 0

2410.00812 2026-06-16 cs.CL q-bio.NC 版本更新

Generative causal testing to bridge data-driven models and scientific theories in language neuroscience

生成式因果测试：弥合语言神经科学中数据驱动模型与科学理论之间的鸿沟

Richard Antonello, Chandan Singh, Shailee Jain, Aliyah Hsu, Sihang Guo, Jianfeng Gao, Bin Yu, Alexander Huth

发表机构 * Computer Science Department, University of Texas at Austin（德克萨斯大学计算机科学系）； Microsoft Research（微软研究院）； Neurosurgery Department, University of California（加州大学神经外科系）； EECS Department, University of California（加州大学电子工程与计算机科学系）； Statistics Department, University of California（加州大学统计学系）； Center for Computational Biology, University of California（加州大学计算生物学中心）； Neuroscience Department, University of California（加州大学神经科学系）

AI总结提出生成式因果测试（GCT）框架，利用大语言模型生成简洁解释并通过LLM生成刺激进行验证，成功解释大脑区域的语言选择性，弥合数据驱动模型与科学理论之间的差距。

Comments Accepted to Nature Neuroscience, please cite that version

详情

AI中文摘要

来自大型语言模型的表示在预测语言刺激的BOLD fMRI反应方面非常有效。然而，这些表示在很大程度上是不透明的：尚不清楚语言刺激的哪些特征驱动每个大脑区域的反应。我们提出了生成式因果测试（GCT），这是一个从预测模型生成大脑语言选择性的简洁解释，然后在后续实验中使用LLM生成的刺激测试这些解释的框架。该方法成功解释了个体体素和皮层感兴趣区域（ROI）的选择性，包括前额叶皮层中新识别的微ROI。我们表明，解释准确性与底层预测模型的预测能力和稳定性密切相关。最后，我们证明GCT可以剖析具有相似功能选择性的脑区之间的细微差异。这些结果表明，LLM可用于弥合数据驱动模型与正式科学理论之间日益扩大的差距。

英文摘要

Representations from large language models are highly effective at predicting BOLD fMRI responses to language stimuli. However, these representations are largely opaque: it is unclear what features of the language stimulus drive the response in each brain area. We present generative causal testing (GCT), a framework for generating concise explanations of language selectivity in the brain from predictive models and then testing those explanations in follow-up experiments using LLM-generated stimuli.This approach is successful at explaining selectivity both in individual voxels and cortical regions of interest (ROIs), including newly identified microROIs in prefrontal cortex. We show that explanatory accuracy is closely related to the predictive power and stability of the underlying predictive models. Finally, we show that GCT can dissect fine-grained differences between brain areas with similar functional selectivity. These results demonstrate that LLMs can be used to bridge the widening gap between data-driven models and formal scientific theories.

URL PDF HTML ☆

赞 0 踩 0

2410.21803 2026-06-16 cs.CL 版本更新

SimSiam Naming Game: A Unified Approach for Emergent Communication and Representation Learning

SimSiam 命名博弈：一种用于涌现通信和表示学习的统一方法

Nguyen Le Hoang, Tadahiro Taniguchi, Tianwei Fang, Akira Taniguchi, Masatoshi Nagano

发表机构 * Kyoto University（京都大学）

AI总结提出SimSiam命名博弈（SSNG），通过对称自监督表示对齐替代采样更新，实现无反馈的涌现通信，在CIFAR-10和ImageNet-100上线性探针分类准确率显著优于现有方法。

详情

AI中文摘要

涌现通信（EmCom）研究智能体如何在没有预定义语言的情况下通过交互发展符号通信。最近的框架，如Metropolis-Hastings命名博弈（MHNG），将EmCom形式化为在联合注意力下通过交互协商共享外部表示的学习，没有明确的成功或奖励反馈。然而，MHNG依赖于基于采样的更新，在高维感知空间中遭受高拒绝率，使得学习过程对于复杂视觉数据集样本效率低下。在这项工作中，我们提出了SimSiam命名博弈（SSNG），一种无反馈的EmCom框架，用自主智能体之间的对称自监督表示对齐目标替代基于采样的更新。基于自监督学习的变分推断概率解释，SSNG将符号涌现形式化为智能体之间通过消息交换介导的潜在表示的对齐过程。为了实现端到端的基于梯度的优化，离散符号消息通过Gumbel-Softmax松弛学习，在保持可微性的同时保留了通信的离散性质。在CIFAR-10和ImageNet-100上的实验表明，SSNG学习到的涌现消息在线性探针分类准确率上显著高于参照博弈、重建博弈和MHNG产生的消息。这些结果表明，自监督表示对齐为多智能体系统中的无反馈EmCom提供了一种有效机制。

英文摘要

Emergent Communication (EmCom) investigates how agents develop symbolic communication through interaction without predefined language. Recent frameworks, such as the Metropolis--Hastings Naming Game (MHNG), formulate EmCom as the learning of shared external representations negotiated through interaction under joint attention, without explicit success or reward feedback. However, MHNG relies on sampling-based updates that suffer from high rejection rates in high-dimensional perceptual spaces, making the learning process sample-inefficient for complex visual datasets. In this work, we propose the SimSiam Naming Game (SSNG), a feedback-free EmCom framework that replaces sampling-based updates with a symmetric, self-supervised representation alignment objective between autonomous agents. Building on a variational inference--based probabilistic interpretation of self-supervised learning, SSNG formulates symbol emergence as an alignment process between agents' latent representations mediated by message exchange. To enable end-to-end gradient-based optimization, discrete symbolic messages are learned via a Gumbel--Softmax relaxation, preserving the discrete nature of communication while maintaining differentiability. Experiments on CIFAR-10 and ImageNet-100 show that the emergent messages learned by SSNG achieve substantially higher linear-probe classification accuracy than those produced by referential games, reconstruction games, and MHNG. These results indicate that self-supervised representation alignment provides an effective mechanism for feedback-free EmCom in multi-agent systems.

URL PDF HTML ☆

赞 0 踩 0

2507.00783 2026-06-16 cs.CL cs.DL 版本更新

Generative AI and the future of scientometrics: current topics and future questions

生成式人工智能与科学计量学的未来：当前主题与未来问题

Benedetto Lepori, Jens Peter Andersen, Karsten Donnay

发表机构 * Università della Svizzera italiana（瑞士意大利大学）； University of Aarhus（奥胡斯大学）； University of Zurich（苏黎世大学）

AI总结本文提出基于文本语义与语用维度的概念框架，分析生成式AI在科学计量学中的优势与局限，并指导其可解释、可操作的应用。

Comments Scientometrics (2026)

详情

DOI: 10.1007/s11192-026-05667-1

AI中文摘要

在本文中，我们为科学计量学中关于生成式人工智能（GenAI）的辩论做出贡献。我们认为，从试错方法转向可解释和可操作的使用，需要原则性地理解GenAI相对于其他技术和人类判断的优势与劣势。为此，我们引入了一个基于文本语义维度（即词语赋予的意义）与语用维度（即其在交际情境中的嵌入）之间区别的概念框架。我们利用这一框架来解释GenAI在科学计量学中的应用结果，并为用户提供指导。具体而言，我们得出结论：需要考虑的关键参数包括任务的性质、分析的粒度水平以及目标是描述性、推理性还是评价性的。这些参数导致了使用GenAI和人机集成的不同策略。最后，我们提出，通过生成大量科学语言，GenAI可能会影响用于衡量科学的文本特征，如作者、词汇和参考文献。我们认为，仔细的实证工作和理论反思对于保持解释AI时代知识生产不断演变模式的能力至关重要。

英文摘要

In this paper, we contribute to the debate on generative artificial intelligence (GenAI) in scientometrics. We argue that moving from a trial-and-error approach to an explainable and actionable use requires a principled understanding of strengths and weaknesses of GenAI as compared with other techniques and with human judgment. To this end, we introduce a conceptual framework based on the distinction between the semantic dimensions of texts, i.e. the meanings attributed to words, and their pragmatic dimension, i.e. their embedding within communicative situations. We leverage this framework to interpret the results of applications of GenAI in scientometrics and to provide guidance to users. Specifically, we conclude that key parameters to be considered are the nature of the task, the level of granularity of the analysis and whether the goal was descriptive, inferential or evaluative. These parameters lead to different strategies for using GenAI and human-machine integration. Finally, we suggest that, by generating large amounts of scientific language, GenAI might affect textual characteristics used to measure science, such as authors, words, and references. We argue that careful empirical work and theoretical reflection will be essential to remain capable of interpreting the evolving patterns of knowledge production in the age of AI.

URL PDF HTML ☆

赞 0 踩 0

2606.06646 2026-06-16 cs.CL cs.AI 版本更新

CAF-Gen: A Multi-Agent System for Enriching Argumentation Structures

CAF-Gen: 一种用于丰富论证结构的多智能体系统

Jakub Bąba, Jarosław A. Chudziak

发表机构 * Faculty of Electronics and Information Technology, Warsaw University of Technology（电子与信息技术学院，华沙技术大学）

AI总结提出CAF-Gen多智能体框架，通过迭代创建-评审流程将浅层论证结构自动转换为符合Carneades论证框架的丰富模型，克服单次生成的结构不稳定性。

Comments Accepted for publication in the proceedings of ICCCI 2026

详情

AI中文摘要

从自然文本中形式化复杂推理是计算语言学的核心挑战之一。它要求系统不仅理解关键词，还要理解文本中嵌入的上下文和复杂推理。当前的论证挖掘技术能够识别基本的主张和前提，但往往难以捕捉高级模式（如Carneades论证框架）所需的更丰富的结构信息，该框架包含前提类型、证明标准和论证模式等特征。我们通过引入CAF-Gen来解决这一局限性，这是一个自动化的多智能体框架，旨在将浅层论证结构丰富为符合CAF的论证模型。通过采用迭代的创建者-评审者流水线，创建者智能体的输出由批评智能体验证以确保结构完整性。这种多智能体协作对于缓解单次生成模型典型的结构不稳定性至关重要。我们的实验表明，迭代反馈循环提高了所得数据的质量，并与原始标注实现了强对齐，同时生成了结构更丰富的模型。我们的发现表明，多智能体系统可以克服单次生成的局限性，为自动建模形式论证提供了一种稳健的方法。

英文摘要

Formalizing complex reasoning from natural text is one of the central challenges in computational linguistics. It requires systems to understand not just keywords but also the context and complex reasoning embedded in a text. Current Argument Mining (AM) techniques identify basic claims and premises, yet they often struggle to capture the richer structural information required by advanced schemas such as the Carneades Argumentation Framework (CAF), which incorporates features such as premise types, proof standards, and argument schemes. We address this limitation by introducing CAF-Gen, an automated multi-agent framework designed to enrich shallow argument structures into CAF-compliant argument models. By employing an iterative Creator-Reviewer pipeline, a creator agent's output is validated by a critical agent to ensure structural integrity. This multi-agent collaboration is crucial for mitigating the structural instability typical of single-pass generative models. Our experiments demonstrate that the iterative feedback loop improves the quality of the resulting data and achieves strong alignment with the original annotations, while producing structurally richer models. Our findings show that the multi-agent system can overcome the limitations of single-pass generation, providing a robust methodology for the automated modeling of formal argumentation.

URL PDF HTML ☆

赞 0 踩 0

2606.06834 2026-06-16 cs.CL q-bio.GN 版本更新

The Dark Regulome: Disentangling Predictability from Regulation in Genomic Foundation Models

暗调控组：从基因组基础模型中分离可预测性与调控性

Chahat Baranwal, Aaditya Baranwal, Lakshya Nitin Tandon

发表机构 * IIT Jodhpur（印度理工学院贾尔普尔分校）； University of Central Florida（中央佛罗里达大学）； Northeastern University（东北大学）

AI总结本研究提出残差化-置换诊断方法，从基因组基础模型的计算机诱变评分中分离序列可预测性与调控信号，揭示10kb近端调控边界，并验证跨架构分解可区分可预测性层与调控输出层，为暗基因组调控研究提供通用工具。

详情

AI中文摘要

高级别胶质瘤通过与神经元的突触整合到神经回路中，这引发了一个问题：哪些非编码元件塑造了肿瘤细胞中的突触形成基因表达。写在暗基因组上的调控程序，我们称之为$\textit{暗调控组}$，是探索的自然底物，而序列基础模型通过计算机诱变（ISM）提供了一条零样本路径；然而，基于似然的评分与局部序列可预测性存在同义反复的耦合，使得调控解释不充分。在三个架构不同的基础模型（Caduceus-Ph、HyenaDNA、Enformer）和92个胶质瘤相关位点的30,448个暗基因组元件上，我们引入了一种残差化-置换诊断方法，以分离由可预测性驱动和由调控驱动的RIS方差。一个尖锐的10kb近端调控边界在我们应用的所有控制中仍然存在，但LM衍生的元件类别层次结构则不然：一个六特征线性基线在AUC=0.985时匹配Caduceus的十分位数成员。跨架构分解清晰地分离了序列可预测性层（两个语言模型共同对长且可预测的转座元件进行排序）和调控输出层（只有Enformer保留了区分cCRE的信号），两个前100列表之间完全没有重叠。然后，保守性、脑cis-eQTL和STRING-PPI交叉检查锚定了哪些生物学信息得以保留：所有三个模型的前100个元件在匹配脑eQTL方面每个模型富集了3.3倍（$p_\mathrm{emp} < 5\times 10^{-3}$），而一个诱人的转座元件调控层和一个显著的NRXN1+NLGN1蛋白对收敛在构建适当的置换检验后均未通过。我们将该诊断方法作为任何基于ISM的调控研究的通用方法工具提供。

英文摘要

High-grade gliomas integrate into neural circuits through functional synapses with neurons, raising the question of which noncoding elements shape synaptogenic gene expression in tumor cells. The regulatory program written across the dark genome, what we call the $\textit{dark regulome}$, is the natural substrate to probe, and sequence foundation models offer a zero-shot route through in-silico mutagenesis (ISM); yet likelihood-based scoring is tautologically coupled to local sequence predictability, leaving the regulatory interpretation underdetermined. Across three architecturally distinct foundation models (Caduceus-Ph, HyenaDNA, Enformer) and 30,448 dark genome elements at 92 glioma-relevant loci, we introduce a residualization-and-permutation diagnostic that separates predictability-driven from regulation-driven RIS variance. A sharp 10kb proximal-regulatory horizon survives every control we apply, but the LM-derived element-class hierarchy does not: a six-feature linear baseline matches Caduceus top-decile membership at AUC $= 0.985$. Cross-architecture decomposition cleanly separates a sequence-predictability layer (the two language models co-rank long well-predicted transposable elements) from a regulatory-output layer (Enformer alone retains residual cCRE-discriminative signal), with literally zero overlap between the two top-100 lists. Conservation, brain cis-eQTL, and STRING-PPI cross-checks then anchor what biology survives: top-100 elements across all three models are $3.3\times$ enriched per model for matching brain eQTLs ($p_\mathrm{emp} < 5\times 10^{-3}$), while a tempting transposable-element regulatory layer and a striking NRXN1+NLGN1 protein-pair convergence both fail proper permutation tests once those tests are constructed. We deliver the diagnostic as a general methodological tool for any ISM-based regulatory study.

URL PDF HTML ☆

赞 0 踩 0

2606.13751 2026-06-16 cs.CL 版本更新

Which Models Perform Better in Inheritance Reasoning?

哪些模型在继承推理中表现更好？

Mohammed Amine Mouhoub, Chahinez Bouchekif

发表机构 * Paris Dauphine University（巴黎多芬纳大学）； University of Abou Bekr Belkaïd（阿布·贝克尔·贝尔卡伊德大学）

AI总结本文比较了商业和开源大语言模型在伊斯兰继承推理任务中的表现，发现商业模型在识别继承人、应用排除规则和保持推理一致性方面更优，其中Gemini 2.5 Flash表现最佳。

详情

AI中文摘要

本文介绍了PSL团队在QIAS 2026阿拉伯伊斯兰继承推理共享任务中的参与情况。该任务评估大语言模型解决需要法律解释、多步推理和精确数值计算的继承案例的能力。我们在统一的提示策略下比较了\textit{商业}和\textit{开源}模型，以评估它们在最小任务特定适应下的结构化法律推理中的有效性。\我们的结果显示两个模型系列在可靠性上存在明显差距。商业模型在识别合格继承人、应用排除规则以及保持推理步骤一致性方面表现出更强的性能。相比之下，开源模型表现出更大的不稳定性，特别是在涉及依赖法律决策和分数份额调整的案例中。最佳性能由\textit{Gemini 2.5 Flash}实现，其MRE为$0.989$。

英文摘要

This paper presents the participation of team PSL in the QIAS 2026 Shared Task on Arabic Islamic inheritance reasoning. The task evaluates the ability of large language models to solve inheritance cases that require legal interpretation, multi-step reasoning, and precise numerical computation. We compare \textit{commercial} and \textit{open-source} models under a unified prompting strategy to assess their effectiveness in structured legal reasoning with minimal task-specific adaptation. \\ Our results show a clear gap in reliability between the two model families. Commercial models demonstrate stronger performance in identifying eligible heirs, applying exclusion rules, and maintaining consistency across reasoning steps. In contrast, open-source models exhibit greater instability, particularly in cases involving dependent legal decisions and fractional share adjustments. The best performance is achieved by \textit{Gemini 2.5 Flash}, with an MRE of $0.989$.

URL PDF HTML ☆

赞 0 踩 0

2606.13691 2026-06-16 cs.CY cs.CL 版本更新

Incentives Of EdTech: A Systematic Review Of EduNLP Research

教育科技的激励：EduNLP研究的系统综述

Gabrielle Gaudeau, Aoife O'Driscoll, Jasper Degraeuwe, Andrew Caines, Donya Rooein, Zeerak Talat

发表机构 * ALTA Institute, Computer Laboratory, University of Cambridge（剑桥大学ALTA研究所、计算机实验室）； Ghent University（根特大学）； Bocconi University（博科尼大学）； University of Edinburgh（爱丁堡大学）

AI总结通过系统综述204篇ACL教育应用论文，揭示教育NLP研究中私营部门激励与教育基础设施需求之间的张力，发现教师作为受益者被系统性低估（33.3%），实际部署罕见（9.8%），伦理参与趋于承认而非行动。

Comments 10 main pages (13 appendix pages), 20 figures, accepted to 21st Workshop on Innovative Use of NLP for Building Educational Applications @ ACL 2026

详情

AI中文摘要

尽管自然语言处理社区投入了大量资源来开发支持这一转变的教育技术（EdTech），但在教育利益相关者中，谁的利益得到了最好的服务仍不清楚。在本文中，我们对2024年和2025年发表在计算语言学协会教育应用构建特别兴趣小组会议上的204篇论文进行了系统文献综述，并与更广泛的ACL文集中的EdTech论文进行了验证。通过考察利益相关者包容性和研究任务的优先级，我们的发现揭示了一个关键张力：私营部门激励与教育基础设施的基本需求之间的推拉。我们的分析表明，教师作为研究受益者被系统性低估（33.3%），尽管他们受影响最大；实际部署仍然罕见（9.8%）；伦理参与倾向于承认而非行动。借鉴我们语料库中的典范论文，我们为更负责任的EduNLP研究实践提供了具体建议。

英文摘要

While the Natural Language Processing community has dedicated significant resources in developing educational technologies (EdTech) that support this shift, it remains unclear whose interests are being best served among the stakeholders of education. In this paper, we present a systematic literature review of 204 papers published in venues of the Association for Computational Linguistics' Special Interest Group on Building Educational Applications in 2024 and 2025, and validate these against EdTech papers from the wider ACL Anthology. By examining stakeholder inclusion and the prioritisation of research tasks, our findings reveal a critical tension: a push and pull between private-sector incentives and the foundational needs of educational infrastructure. Our analysis reveals that teachers are systematically under-represented as beneficiaries of research (33.3%) despite being the most affected, that real-world deployment remains rare (9.8%), and that ethical engagement tends toward acknowledgement rather than action. Drawing on exemplary papers in our corpus, we offer concrete recommendations for more responsible EduNLP research practices.

URL PDF HTML ☆

赞 0 踩 0

2506.00955 2026-06-16 cs.CL cs.SD eess.AS 版本更新

Leveraging Large Language Models for Sarcastic Speech Annotation in Sarcasm Detection

利用大语言模型进行讽刺语音标注在讽刺检测中的应用

Zhu Li, Yuqing Zhang, Xiyuan Gao, Shekhar Nayak, Matt Coler

发表机构 * University of Groningen（Groningen大学）； Speech Technology Lab（语音技术实验室）； Center for Language and Cognition（语言与认知中心）

AI总结本文提出利用大语言模型生成讽刺语音数据集，通过人类验证提升标注质量，并在公开数据集上验证了检测性能，最终引入PodSarc数据集，实现了73.63%的F1分数。

Comments Interspeech 2025; Project page: https://github.com/Abel1802/PodSarc

详情

DOI: 10.21437/Interspeech.2025-2074

AI中文摘要

讽刺通过语气和语境改变意义，但语音中检测讽刺仍具挑战性，因数据稀缺。现有检测系统常依赖多模态数据，限制了仅语音可用场景的应用。为此，我们提出一个利用大语言模型（LLMs）生成讽刺数据集的标注流程。使用公开的以讽刺为主的播客，我们采用GPT-4o和LLaMA 3进行初始讽刺标注，随后由人类验证以解决分歧。我们通过在公开讽刺数据集上比较标注质量和检测性能，验证了该方法的有效性。最后，我们引入PodSarc，一个通过此流程生成的大规模讽刺语音数据集。检测模型实现了73.63%的F1分数，证明了该数据集作为讽刺检测研究基准的潜力。

英文摘要

Sarcasm fundamentally alters meaning through tone and context, yet detecting it in speech remains a challenge due to data scarcity. In addition, existing detection systems often rely on multimodal data, limiting their applicability in contexts where only speech is available. To address this, we propose an annotation pipeline that leverages large language models (LLMs) to generate a sarcasm dataset. Using a publicly available sarcasm-focused podcast, we employ GPT-4o and LLaMA 3 for initial sarcasm annotations, followed by human verification to resolve disagreements. We validate this approach by comparing annotation quality and detection performance on a publicly available sarcasm dataset using a collaborative gating architecture. Finally, we introduce PodSarc, a large-scale sarcastic speech dataset created through this pipeline. The detection model achieves a 73.63% F1 score, demonstrating the dataset's potential as a benchmark for sarcasm detection research.

URL PDF HTML ☆

赞 0 踩 0

2602.14819 2026-06-16 cs.CL 版本更新

Testimole-Conversational: A 30-Billion-Word Italian Discussion Board Corpus (1996-2024) for Language Modeling and Sociolinguistic Research

Testimole-Conversational: 一个包含300亿词的意大利讨论板语料库（1996-2024）用于语言建模和社会语言学研究

Matteo Rinaldi, Rossella Varvara, Viviana Patti

发表机构 * University of Bologna（博洛尼亚大学）

AI总结该语料库为意大利大语言模型预训练和语言社会学研究提供了丰富的讨论板文本资源，涵盖1996-2024年间超过300亿词的意大利语信息。

详情

Journal ref: Proceedings of the 12th Workshop on Challenges in the Management of Large Corpora (CMLC-12) @ LREC 2026

AI中文摘要

我们介绍了"Testimole-conversational"，一个包含意大利语讨论板消息的大量语料库。该语料库超过300亿词（1996-2024），使其成为训练原生意大利语大语言模型的理想数据集。此外，讨论板消息对于语言学和社会学分析也具有相关性。该语料库捕捉了丰富的计算机中介交流内容，提供了关于非正式书面意大利语、话语动态和在线社交互动的深入见解。除了对NLP应用如语言建模、领域适应和对话分析的相关性外，它还支持对数字通信中语言变异和社会现象的研究。该资源将免费提供给研究社区。

英文摘要

We present "Testimole-conversational" a massive collection of discussion boards messages in the Italian language. The large size of the corpus, more than 30B word-tokens (1996-2024), renders it an ideal dataset for native Italian Large Language Models'pre-training. Furthermore, discussion boards' messages are a relevant resource for linguistic as well as sociological analysis. The corpus captures a rich variety of computer-mediated communication, offering insights into informal written Italian, discourse dynamics, and online social interaction in wide time span. Beyond its relevance for NLP applications such as language modelling, domain adaptation, and conversational analysis, it also support investigations of language variation and social phenomena in digital communication. The resource will be made freely available to the research community.

URL PDF HTML ☆

赞 0 踩 0

2508.12365 2026-06-16 cs.IR cs.AI cs.CL 版本更新

TaoSR1: The Thinking Model for E-commerce Relevance Search

TaoSR1：电商相关性搜索的思考模型

Chenhe Dong, Shaowei Yao, Pengkun Jiao, Jianhui Yang, Yiming Jin, Zerui Huang, Xiaojiang Zhou, Dan Ou, Haihong Tang, Bo Zheng

发表机构 * Taobao & Tmall Group of Alibaba（淘宝与天猫集团）

AI总结本文提出TaoSR1框架，通过CoT引导的监督微调、离线采样与DPO优化，解决电商搜索中相关性预测的推理误差与幻觉问题，实现高效部署。

详情

DOI: 10.1145/3770855.3818489
Journal ref: KDD '26: Proceedings of the 32nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining V.2, 2026

AI中文摘要

查询-商品相关性预测是电商搜索的核心任务。基于BERT的模型在语义匹配上表现优异，但缺乏复杂的推理能力。尽管大型语言模型（LLMs）被探索，大多数仍使用判别性微调或蒸馏到小模型进行部署。我们提出一个框架，直接部署LLMs用于此任务，解决关键挑战：推理链（CoT）误差累积、判别性幻觉和部署可行性。我们的框架TaoSR1包括三个阶段：（1）使用CoT的监督微调以培养推理能力；（2）离线采样与pass@N策略和直接偏好优化（DPO）以提高生成质量；（3）基于难度的动态采样与组相对策略优化（GRPO）以缓解判别性幻觉。此外，后CoT处理和基于累积概率的分区方法使在线部署高效。TaoSR1在离线数据集上显著优于基线，并在在线双人评估中取得显著优势，引入了将CoT推理应用于相关性分类的新范式。

英文摘要

Query-product relevance prediction is a core task in e-commerce search. BERT-based models excel at semantic matching but lack complex reasoning capabilities. While Large Language Models (LLMs) are explored, most still use discriminative fine-tuning or distill to smaller models for deployment. We propose a framework to directly deploy LLMs for this task, addressing key challenges: Chain-of-Thought (CoT) error accumulation, discriminative hallucination, and deployment feasibility. Our framework, TaoSR1, involves three stages: (1) Supervised Fine-Tuning (SFT) with CoT to instill reasoning; (2) Offline sampling with a pass@N strategy and Direct Preference Optimization (DPO) to improve generation quality; and (3) Difficulty-based dynamic sampling with Group Relative Policy Optimization (GRPO) to mitigate discriminative hallucination. Additionally, post-CoT processing and a cumulative probability-based partitioning method enable efficient online deployment. TaoSR1 significantly outperforms baselines on offline datasets and achieves substantial gains in online side-by-side human evaluations, introducing a novel paradigm for applying CoT reasoning to relevance classification.

URL PDF HTML ☆

赞 0 踩 0

2510.18355 2026-06-16 cs.CL cs.HC cs.IR 版本更新

KrishokBondhu: A Retrieval-Augmented Voice-Based Agricultural Advisory Call Center for Bengali Farmers

KrishokBondhu：一种基于检索增强的语音农业咨询呼叫中心，面向孟加拉语农民

Mohd Ruhul Ameen, Akif Islam, Farjana Aktar, M. Saifuzzaman Rafat

发表机构 * University of California, Berkeley（加州大学伯克利分校）

AI总结本文提出KrishokBondhu，一种基于检索增强生成框架的语音农业咨询平台，通过OCR处理农业手册等资料，结合大语言模型生成回答，实现孟加拉语农民的实时农业指导。

Comments Accepted at the 2026 IEEE 2nd International Conference on Quantum Photonics, Artificial Intelligence and Networking (QPAIN 2026)

详情

DOI: 10.1109/QPAIN69676.2026.11546653
Journal ref: 2026 IEEE 2nd International Conference on Quantum Photonics, Artificial Intelligence & Networking (QPAIN)

AI中文摘要

在孟加拉国，许多农民仍难以获得及时的农业指导。本文提出了KrishokBondhu，一种基于检索增强生成（RAG）框架的语音咨询平台，整合农业手册、推广手册和NGO出版物，通过OCR流程处理并索引到向量数据库中，实现语义检索。通过电话界面，农民可以接收实时、上下文感知的建议：语音识别将孟加拉语查询转换为文本，RAG模块检索相关信息，大语言模型（Gemma 3-4B）生成基于事实的回答，文本到语音将答案以孟加拉语口语形式传达。在试点评估中，KrishokBondhu对72.7%的多样化农业查询产生了高质量的回答。与KisanQRS基准相比，其综合得分为4.53（5分制）而非3.13，提高了44.7%，特别是在上下文丰富性和完整性方面有显著提升，同时保持了相关性和技术特异性。语义相似性分析进一步显示检索上下文与回答质量之间有强相关性。KrishokBondhu展示了结合呼叫中心可及性、多语言语音交互和现代RAG技术，为偏远孟加拉国农民提供专家级农业指导的可行性。

英文摘要

In Bangladesh, many farmers still struggle to access timely, expert-level agricultural guidance. This paper presents KrishokBondhu, a voice-enabled, call-centre-integrated advisory platform built on a Retrieval-Augmented Generation (RAG) framework for Bengali-speaking farmers. The system combines agricultural handbooks, extension manuals, and NGO publications, processes them through an OCR-based pipeline, and indexes the curated content in a vector database for semantic retrieval. Through a phone-based interface, farmers can receive real-time, context-aware advice: speech-to-text converts the Bengali query, the RAG module retrieves relevant information, a large language model (Gemma 3-4B) generates a grounded response, and text-to-speech delivers the answer in spoken Bengali. In a pilot evaluation, KrishokBondhu produced high-quality responses for 72.7% of diverse agricultural queries. Compared to the KisanQRS benchmark, it achieved a composite score of 4.53 versus 3.13 on a 5-point scale, with a 44.7% improvement and especially large gains in contextual richness and completeness, while maintaining comparable relevance and technical specificity. Semantic-similarity analysis further showed a strong correlation between retrieved context and answer quality. KrishokBondhu demonstrates the feasibility of combining call-centre accessibility, multilingual voice interaction, and modern RAG techniques to deliver expert-level agricultural guidance to remote Bangladeshi farmers.

URL PDF HTML ☆

赞 0 踩 0

2602.12639 2026-06-16 cs.CL 版本更新

CLASE: A Hybrid Method for Chinese Legalese Stylistic Evaluation

CLASE：一种中文法律文本风格评估的混合方法

Yiran Rex Ma, Yuxiao Ye, Huiyuan Xie

发表机构 * Nanyang Technological University（南洋理工大学）

AI总结本文提出CLASE混合方法，通过结合语言特征和经验指导的LLM评估，提升法律文本风格质量，实验表明其比传统方法更符合人类判断。

Comments Accepted at LREC 2026

详情

DOI: 10.63317/2xrbbj7oaghv

AI中文摘要

大型语言模型生成的法律文本通常能实现合理的事实准确性，但往往无法遵循法律写作的专门风格规范和语言惯例。为提高风格质量，建立可靠的评估方法是关键第一步。然而，让法律专家手动开发此类指标不现实，因为法律写作中的隐含风格要求难以明确化为明确的评分标准。同时，现有自动评估方法也存在不足：基于参考的指标将语义准确性与风格忠实度混为一谈，而LLM作为评判者的方法则存在不透明和不一致的问题。为解决这些挑战，我们引入CLASE（中文法律风格评估），一种专注于法律文本风格表现的混合评估方法。该方法结合了基于语言特征的评分和经验指导的LLM作为评判者评分。特征系数和LLM评分经验均从真实法律文件与其LLM恢复版本的对比对中学习。这种混合设计以透明且无参考的方式捕捉了表层特征和隐含的风格规范。在200份中文法律文件上的实验表明，CLASE在与人类判断的一致性方面显著优于传统指标和纯LLM作为评判者方法。除了改进的一致性，CLASE还提供可解释的评分分解和改进建议，为法律文本生成中的专业风格评估提供了可扩展和实用的解决方案（CLASE的代码和数据可在：https://github.com/rexera/CLASE获得）。

英文摘要

Legal text generated by large language models (LLMs) can usually achieve reasonable factual accuracy, but it frequently fails to adhere to the specialised stylistic norms and linguistic conventions of legal writing. In order to improve stylistic quality, a crucial first step is to establish a reliable evaluation method. However, having legal experts manually develop such a metric is impractical, as the implicit stylistic requirements in legal writing practice are difficult to formalise into explicit rubrics. Meanwhile, existing automatic evaluation methods also fall short: reference-based metrics conflate semantic accuracy with stylistic fidelity, and LLM-as-a-judge evaluations suffer from opacity and inconsistency. To address these challenges, we introduce CLASE (Chinese LegAlese Stylistic Evaluation), a hybrid evaluation method that focuses on the stylistic performance of legal text. The method incorporates a hybrid scoring mechanism that combines 1) linguistic feature-based scores and 2) experience-guided LLM-as-a-judge scores. Both the feature coefficients and the LLM scoring experiences are learned from contrastive pairs of authentic legal documents and their LLM-restored counterparts. This hybrid design captures both surface-level features and implicit stylistic norms in a transparent, reference-free manner. Experiments on 200 Chinese legal documents show that CLASE achieves substantially higher alignment with human judgments than traditional metrics and pure LLM-as-a-judge methods. Beyond improved alignment, CLASE provides interpretable score breakdowns and suggestions for improvements, offering a scalable and practical solution for professional stylistic evaluation in legal text generation (Code and data for CLASE is available at: https://github.com/rexera/CLASE).

URL PDF HTML ☆

赞 0 踩 0

2511.08507 2026-06-16 cs.CL cs.AI 版本更新

Introducing A Bangla Sentence - Gloss Pair Dataset for Bangla Sign Language Translation and Research

介绍一个孟加拉语句子- gloss配对数据集用于孟加拉语手语翻译和研究

Neelavro Saha, Rafi Shahriyar, Nafis Ashraf Roudra, Saadman Sakib, Annajiat Alim Rasel

发表机构 * Department of Computer Science and Engineering, Bangladesh University of Engineering and Technology（Bangladesh University of Engineering and Technology计算机科学与工程系）

AI总结本文介绍了一个包含1000个人工标注句子- gloss配对的新数据集Bangla-SGP，通过规则基于的检索增强生成管道生成约3000个合成配对，用于孟加拉语手语翻译和研究。

详情

DOI: 10.63317/38qenrwzegr9
Journal ref: Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026), pp. 10457-10466, ELRA, Palma, Mallorca, Spain, May 2026

AI中文摘要

孟加拉语手语（BdSL）翻译是一个低资源自然语言处理任务，由于缺乏大规模数据集来解决句子级翻译。相应地，该领域现有研究局限于词和字母级别的检测。在本工作中，我们介绍了Bangla-SGP，一个包含1000个由专业手语者手动标注的高质量孟加拉语句子的平行数据集，这些句子被注释为gloss序列。该数据集通过基于规则的检索增强生成（RAG）管道扩展，使用句法和形态学规则生成约3000个合成配对。gloss序列由单独的gloss组成，这些gloss是孟加拉语手语支持的词汇，并作为连续手语的中间表示。我们的数据集由1000个高质量孟加拉语句子组成，这些句子由专业手语者手动注释为gloss序列。增强过程结合了基于规则的语言学策略和提示工程技术，这些技术通过批判性分析我们的人工标注句子-gloss配对以及与专业手语者密切合作而获得。此外，我们微调了几种基于transformer的模型，如mBart50、Google mT5、GPT4.1-nano，并使用BLEU分数评估其句子到gloss的翻译性能。基于这些评估指标，我们比较了模型在我们数据集和RWTH-PHOENIX-2014T基准上的gloss翻译一致性。

英文摘要

Bangla Sign Language (BdSL) translation represents a low-resource NLP task due to the lack of large-scale datasets that address sentence-level translation. Correspondingly, existing research in this field has been limited to word and alphabet level detection. In this work, we introduce Bangla-SGP, a novel parallel dataset consisting of 1,000 human-annotated sentence-gloss pairs which was augmented with around 3,000 synthetically generated pairs using syntactic and morphological rules through a rule-based Retrieval-Augmented Generation (RAG) pipeline. The gloss sequences of the spoken Bangla sentences are made up of individual glosses which are Bangla sign supported words and serve as an intermediate representation for a continuous sign. Our dataset consists of 1000 high quality Bangla sentences that are manually annotated into a gloss sequence by a professional signer. The augmentation process incorporates rule-based linguistic strategies and prompt engineering techniques that we have adopted by critically analyzing our human annotated sentence-gloss pairs and by working closely with our professional signer. Furthermore, we fine-tune several transformer-based models such as mBart50, Google mT5, GPT4.1-nano and evaluate their sentence-to-gloss translation performance using BLEU scores, based on these evaluation metrics we compare the model's gloss-translation consistency across our dataset and the RWTH-PHOENIX-2014T benchmark.

URL PDF HTML ☆

赞 0 踩 0

2509.01182 2026-06-16 cs.AI cs.CL cs.HC cs.IR cs.MA 版本更新

Question-to-Knowledge (Q2K): Multi-Agent Generation of Inspectable Facts for Product Mapping

问题到知识（Q2K）：多智能体生成可检查的事实以实现产品映射

Wonduk Seo, Taesub Shin, Hyunjin An, Dokyun Kim, Seunghyun Lee

发表机构 * The University of Tokyo（东京大学）； KISTI（韩国科学技术院）

AI总结 Q2K通过多智能体框架利用大语言模型实现可靠的产品SKU映射，通过生成辨析问题、网络搜索和去重来提高准确性与鲁棒性，适用于复杂场景如捆绑识别和品牌来源辨析。

Comments Accepted by IEEE BigData 2025 Industry Track

详情

DOI: 10.1109/BigData66926.2025.11401468
Journal ref: 2025 IEEE International Conference on Big Data (BigData), Macau, China, 2025, pp. 2646-2653

AI中文摘要

识别两个产品列表是否指向相同的库存单位（SKU）是电子商务中的持续挑战，尤其是在缺乏显式标识符且产品名称在不同平台上差异较大的情况下。基于规则的启发式方法和关键词相似性经常因忽略品牌、规格或捆绑配置的细微区别而误分类。为克服这些限制，我们提出了问题到知识（Q2K），一个多智能体框架，利用大语言模型（LLMs）进行可靠的SKU映射。Q2K集成了：（1）一个推理代理，生成定向的辨析问题；（2）一个知识代理，通过聚焦的网络搜索解决这些问题；（3）一个去重代理，重用已验证的推理轨迹以减少冗余并确保一致性。人类在循环机制进一步细化不确定情况。在真实世界消费品数据集上的实验表明，Q2K超越了强大的基线，实现了在捆绑识别和品牌来源辨析等困难场景中的更高准确性和鲁棒性。通过重用检索到的推理而不是发出重复搜索，Q2K在准确性和效率之间取得了平衡，提供了一种可扩展且可解释的解决方案用于产品整合。

英文摘要

Identifying whether two product listings refer to the same Stock Keeping Unit (SKU) is a persistent challenge in ecommerce, especially when explicit identifiers are missing and product names vary widely across platforms. Rule based heuristics and keyword similarity often misclassify products by overlooking subtle distinctions in brand, specification, or bundle configuration. To overcome these limitations, we propose Question to Knowledge (Q2K), a multi agent framework that leverages Large Language Models (LLMs) for reliable SKU mapping. Q2K integrates: (1) a Reasoning Agent that generates targeted disambiguation questions, (2) a Knowledge Agent that resolves them via focused web searches, and (3) a Deduplication Agent that reuses validated reasoning traces to reduce redundancy and ensure consistency. A human in the loop mechanism further refines uncertain cases. Experiments on real world consumer goods datasets show that Q2K surpasses strong baselines, achieving higher accuracy and robustness in difficult scenarios such as bundle identification and brand origin disambiguation. By reusing retrieved reasoning instead of issuing repeated searches, Q2K balances accuracy with efficiency, offering a scalable and interpretable solution for product integration.

URL PDF HTML ☆

赞 0 踩 0

2509.02093 2026-06-16 cs.CL cs.AI cs.IR 版本更新

Better by Comparison: Retrieval-Augmented Contrastive Reasoning for Automatic Prompt Optimization

通过对比改进：基于检索增强的对比推理用于自动提示优化

Juhyeon Lee, Wonduk Seo, Hyunjin An, Seunghyun Lee, Yi Bu

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结本文提出CRPO框架，通过对比推理提升提示优化效果，利用HelpSteer2数据集中的高质量示例进行对比分析，改进提示生成的鲁棒性和可解释性。

Comments Preprint

详情

DOI: 10.1109/JCDL67857.2025.00043
Journal ref: 2025 ACM/IEEE Joint Conference on Digital Libraries (JCDL), Dekalb, IL, USA, 2025, pp. 269-272

AI中文摘要

自动提示优化近期作为一种提升大型语言模型（LLMs）提示质量的策略，旨在生成更准确和有用的响应。然而，大多数先前工作集中在直接提示精炼或模型微调，忽略了利用LLM内在推理能力从对比示例中学习的潜力。本文提出对比推理提示优化（CRPO），一种新颖的框架，将提示优化建模为检索增强的推理过程。我们的方法从HelpSteer2数据集检索top k参考提示-响应对，该数据集是一个开源集合，每个响应均标注了有用性、正确性、连贯性、复杂性和冗余性。我们构建了两种互补的优化范式：（1）分层对比推理，其中LLM比较高质量、中等质量和低质量的示例（提示和响应）以通过反思推理优化自身生成；（2）多指标对比推理，其中LLM分析每个评估维度的最佳示例并整合其优势以生成优化提示。通过显式对比高质量和低质量示例，CRPO使模型能够推断为何某些提示成功而其他失败，从而实现更鲁棒和可解释的优化。在HelpSteer2基准测试中的实验结果表明，CRPO显著优于基线方法。我们的发现突显了对比、检索增强推理在推进自动提示优化方面的潜力。

英文摘要

Automatic prompt optimization has recently emerged as a strategy for improving the quality of prompts used in Large Language Models (LLMs), with the goal of generating more accurate and useful responses. However, most prior work focuses on direct prompt refinement or model fine-tuning, overlooking the potential of leveraging LLMs' inherent reasoning capability to learn from contrasting examples. In this paper, we present Contrastive Reasoning Prompt Optimization (CRPO), a novel framework that formulates prompt optimization as a retrieval-augmented reasoning process. Our approach retrieves top k reference prompt-response pairs from the HelpSteer2 dataset, an open source collection where each response is annotated for helpfulness, correctness, coherence, complexity, and verbosity, and constructs two complementary optimization paradigms: (1) tiered contrastive reasoning, where the LLM compares high-, medium-, and low-quality exemplars (both prompts and responses) to refine its own generation through reflective reasoning, and (2) multi-metric contrastive reasoning, where the LLM analyzes the best exemplars along each evaluation dimension and integrates their strengths into an optimized prompt. By explicitly contrasting high and low quality exemplars, CRPO enables the model to deduce why certain prompts succeed while others fail, thereby achieving more robust and interpretable optimization. Experimental results on the HelpSteer2 benchmark demonstrate that CRPO significantly outperforms baselines. Our findings highlight the promise of contrastive, retrieval-augmented reasoning for advancing automatic prompt optimization.

URL PDF HTML ☆

赞 0 踩 0

2410.13439 2026-06-16 cs.LG cs.CL cs.CV 版本更新

Similarity-Dissimilarity Loss for Multi-label Supervised Contrastive Learning

多标签监督对比学习中的相似性-差异性损失

Guangming Huang, Yunfei Long, Cunjin Luo

发表机构 * University of Essex（埃塞克斯大学）； Queen Mary University of London（伦敦大学玛丽女王学院）

AI总结本文提出相似性-差异性损失，通过动态加权样本解决多标签场景下正样本确定问题，提供理论证明并统一单标签与多标签对比学习框架，实验表明方法在图像、文本和医疗领域均优于基线。

Comments Accepted by Transactions on Machine Learning Research (TMLR)

详情

AI中文摘要

监督对比学习通过利用标签信息取得了显著成功；然而，在多标签场景中确定正样本仍是一个关键挑战。在多标签监督对比学习（MSCL）中，多标签关系尚未完全定义，导致正样本识别和对比损失函数构建存在歧义。为解决这些挑战，我们：（i）系统地制定了MSCL中的多标签关系；（ii）提出了一种新颖的相似性-差异性损失，根据相似性和差异性因素动态重新加权样本；（iii）通过严谨的数学分析提供了理论支持，支持我们的方法制定和有效性；（iv）为单标签和多标签监督对比损失提供统一形式和范式。我们在图像和文本模态上进行了实验，并进一步将其扩展到医疗领域。结果表明，我们的方法在全面评估中始终优于基线，证明了其有效性和鲁棒性。

英文摘要

Supervised contrastive learning has achieved remarkable success by leveraging label information; however, determining positive samples in multi-label scenarios remains a critical challenge. In multi-label supervised contrastive learning (MSCL), multi-label relations are not yet fully defined, leading to ambiguity in identifying positive samples and formulating contrastive loss functions to construct the representation space. To address these challenges, we: (i) systematically formulate multi-label relations in MSCL, (ii) propose a novel Similarity-Dissimilarity Loss, which dynamically re-weights samples based on similarity and dissimilarity factors, (iii) further provide theoretically grounded proofs for our method through rigorous mathematical analysis that supports the formulation and effectiveness, and (iv) offer a unified form and paradigm for both single-label and multi-label supervised contrastive loss. We conduct experiments on both image and text modalities and further extend the evaluation to the medical domain. The results show that our method consistently outperforms baselines in comprehensive evaluations, demonstrating its effectiveness and robustness.

URL PDF HTML ☆

赞 0 踩 0

2508.13028 2026-06-16 cs.CL 版本更新

Integrating Feedback Loss from Bi-modal Sarcasm Detector for Sarcastic Speech Synthesis

将双模反讽检测器的反馈损失整合到反讽语音合成中

Zhu Li, Yuqing Zhang, Xiyuan Gao, Devraj Raghuvanshi, Nagendra Kumar, Shekhar Nayak, Matt Coler

发表机构 * University of Groningen（格罗宁根大学）； Brown University（布朗大学）； Indian Institute of Technology Indore（印度理工学院印度尔）

AI总结本文提出整合双模反讽检测器反馈损失的语音合成方法，通过迁移学习提升合成反讽语音的质量和自然度。

Comments Speech Synthesis Workshop 2025

详情

DOI: 10.21437/SSW.2025-23

AI中文摘要

反讽语音合成，即生成能有效传达反讽的语音，对于增强娱乐和人机交互等应用中的自然交互至关重要。然而，由于反讽的细微语调特征以及标注反讽语音数据的有限性，合成反讽语音仍具挑战性。为此，本研究引入一种新颖方法，将双模反讽检测模型的反馈损失整合到TTS训练过程中，以增强模型捕捉和传达反讽的能力。此外，通过迁移学习，预训练在朗读语音上的语音合成模型经历两阶段微调。首先，在涵盖各种语音风格的多样化数据集上微调，包括反讽语音。第二阶段，使用专门针对反讽语音的数据集进一步优化模型，提升生成反讽感知语音的能力。客观和主观评估显示，所提方法提高了合成语音的质量、自然度和反讽感知性。

英文摘要

Sarcastic speech synthesis, which involves generating speech that effectively conveys sarcasm, is essential for enhancing natural interactions in applications such as entertainment and human-computer interaction. However, synthesizing sarcastic speech remains a challenge due to the nuanced prosody that characterizes sarcasm, as well as the limited availability of annotated sarcastic speech data. To address these challenges, this study introduces a novel approach that integrates feedback loss from a bi-modal sarcasm detection model into the TTS training process, enhancing the model's ability to capture and convey sarcasm. In addition, by leveraging transfer learning, a speech synthesis model pre-trained on read speech undergoes a two-stage fine-tuning process. First, it is fine-tuned on a diverse dataset encompassing various speech styles, including sarcastic speech. In the second stage, the model is further refined using a dataset focused specifically on sarcastic speech, enhancing its ability to generate sarcasm-aware speech. Objective and subjective evaluations demonstrate that our proposed methods improve the quality, naturalness, and sarcasm-awareness of synthesized speech.

URL PDF HTML ☆

赞 0 踩 0

2502.16560 2026-06-16 cs.AI cs.CL cs.SI 版本更新

An Analytical Emotion Framework of Rumour Threads on Social Media

社交媒体谣言线中的分析情绪框架

Rui Xing, Boyang Sun, Kun Zhang, Preslav Nakov, Timothy Baldwin, Jey Han Lau

发表机构 * University of Edinburgh（爱丁堡大学）

AI总结本文提出一个多方面情绪检测框架，分析谣言与非谣言线的情绪差异，揭示谣言引发负面情绪而非谣言引发正面情绪，并通过因果分析揭示情绪传播机制。

Comments Accepted to ICWSM 2025 MisD Workshop

详情

DOI: 10.36190/2025.28

AI中文摘要

在线社交媒体中的谣言对现代社会构成重大风险，推动了对谣言发展机制的深入理解。本文聚焦谣言与情绪在线讨论中的交互，构建了一个多方面情绪分析框架，对比谣言与非谣言线，并进行情绪的关联与因果分析。我们应用该框架于现有广泛使用的谣言数据集，进一步理解在线社交媒体线的情绪动态。框架显示谣言引发更多负面情绪（如愤怒、恐惧、悲观），而非谣言引发更多积极情绪。情绪具有传染性，谣言传播负面情绪，非谣言传播正面情绪。因果分析显示惊讶连接谣言与其他情绪；悲观来自悲伤和恐惧，而乐观源于喜悦和爱。

英文摘要

Rumours in online social media pose significant risks to modern society, motivating the need for better understanding of how they develop. We focus specifically on the interface between emotion and rumours in threaded discourses, building on the surprisingly sparse literature on the topic which has largely focused on single aspect of emotions within the original rumour posts themselves, and largely overlooked the comparative differences between rumours and non-rumours. In this work, we take one step further to provide a comprehensive analytical emotion framework with multi-aspect emotion detection, contrasting rumour and non-rumour threads and provide both correlation and causal analysis of emotions. We applied our framework on existing widely-used rumour datasets to further understand the emotion dynamics in online social media threads. Our framework reveals that rumours trigger more negative emotions (e.g., anger, fear, pessimism), while non-rumours evoke more positive ones. Emotions are contagious, rumours spread negativity, non-rumours spread positivity. Causal analysis shows surprise bridges rumours and other emotions; pessimism comes from sadness and fear, while optimism arises from joy and love.

URL PDF HTML ☆

赞 0 踩 0

2504.08609 2026-06-16 cs.CL cs.AI 版本更新

A Survey of Machine Learning Models and Datasets for the Multi-label Classification of Textual Hate Speech in English

面向英文文本仇恨言论多标签分类的机器学习模型与数据集综述

Julian Bäumler, Louis Blöcher, Lars-Joel Frey, Xian Chen, Markus Bayer, Christian Reuter

发表机构 * Technical University of Darmstadt, Science and Technology for Peace and Security (PEASEC)（德累斯顿技术大学，和平与安全科学技术（PEASEC））

AI总结本文综述了46篇英文文献，分析了28个适合多标签分类模型训练的数据集，揭示了标签集、大小、元概念等的异质性，并指出评估不一致、BERT和RNN偏好等关键问题，提出十项研究建议。

Comments 35 pages, 4 figures, 4 tables

详情

DOI: 10.1145/3820778
Journal ref: ACM Transactions on Knowledge Discovery from Data (2026)

AI中文摘要

在线仇恨言论的传播对个人、在线社区和社会整体都有严重负面影响。鉴于此以及海量仇恨内容的规模，内容审核和执法人员及研究人员对机器学习模型自动分类仇恨言论产生了兴趣。尽管大多数科学作品将仇恨言论分类视为二元任务，但实践中往往需要区分子类型，例如根据目标、严重程度或合法性，这可能在个别内容上重叠。因此，研究者创建了数据集和机器学习模型，将文本数据中的仇恨言论分类视为多标签问题。本文首次系统全面地综述了英文文献中这一新兴研究领域的科学文献（N=46）。我们贡献了28个适合训练多标签分类模型的数据集的简要概述，揭示了标签集、大小、元概念、标注过程和标注者间一致性的显著异质性。对24篇提出合适分类模型的出版物的分析进一步揭示了评估不一致以及对双向编码表示变换器（BERT）和循环神经网络（RNN）的偏好。我们识别出训练数据不平衡、依赖众包平台、小而稀疏的数据集以及缺失方法学一致性为关键开放问题，并提出了十项研究建议。

英文摘要

The dissemination of online hate speech can have serious negative consequences for individuals, online communities, and entire societies. This and the large volume of hateful online content prompted both practitioners', i.e., in content moderation or law enforcement, and researchers' interest in machine learning models to automatically classify instances of hate speech. Whereas most scientific works address hate speech classification as a binary task, practice often requires a differentiation into sub-types, e.g., according to target, severity, or legality, which may overlap for individual content. Hence, researchers created datasets and machine learning models that approach hate speech classification in textual data as a multi-label problem. This work presents the first systematic and comprehensive survey of scientific literature on this emerging research landscape in English (N=46). We contribute with a concise overview of 28 datasets suited for training multi-label classification models that reveals significant heterogeneity regarding label-set, size, meta-concept, annotation process, and inter-annotator agreement. Our analysis of 24 publications proposing suitable classification models further establishes inconsistency in evaluation and a preference for architectures based on Bidirectional Encoder Representation from Transformers (BERT) and Recurrent Neural Networks (RNNs). We identify imbalanced training data, reliance on crowdsourcing platforms, small and sparse datasets, and missing methodological alignment as critical open issues and formulate ten recommendations for research.

URL PDF HTML ☆

赞 0 踩 0

2406.07277 2026-06-16 cs.CL cs.AI cs.MA 版本更新

Speaking Your Language: Spatial Relationships in Interpretable Emergent Communication

说出你的语言：可解释的涌现交流中的空间关系

Olaf Lipinski, Adam J. Sobey, Federico Cerutti, Timothy J. Norman

发表机构 * University of Southampton（索姆塞特大学）； The Alan Turing Institute（艾伦·图灵研究所）； University of Brescia（布雷西亚大学）

AI总结本文研究了智能体如何通过空间关系交流，展示了其能发展出表达观察部分关系的语言，实现90%以上的准确率，并证明该语言可被人类解读。

Comments Accepted at NeurIPS 2024. 18 pages, 3 figures

详情

DOI: 10.52202/079017-4446
Journal ref: In Advances in Neural Information Processing Systems (Vol. 37, pp. 140113-140137) 2024

AI中文摘要

有效的交流需要能够参照观察中的特定部分相对于其他部分的能力。尽管涌现交流文献在开发各种语言属性方面取得成功，但尚未有研究展示出此类位置参照的出现。本文展示了智能体如何在观察中交流空间关系。结果表明，智能体可以发展出能够表达其观察部分之间关系的语言，在训练于需要此类交流的指称游戏中，准确率超过90%。使用词组测量方法，我们展示了智能体如何创建此类参照。此分析表明，智能体使用非组合性和组合性信息的混合来传达空间关系。我们还证明了涌现语言可被人类解读。通过与接收智能体交流测试翻译准确性，接收智能体使用该词典部分达到78%以上的准确率，证实了该涌现语言的解读成功。

英文摘要

Effective communication requires the ability to refer to specific parts of an observation in relation to others. While emergent communication literature shows success in developing various language properties, no research has shown the emergence of such positional references. This paper demonstrates how agents can communicate about spatial relationships within their observations. The results indicate that agents can develop a language capable of expressing the relationships between parts of their observation, achieving over 90% accuracy when trained in a referential game which requires such communication. Using a collocation measure, we demonstrate how the agents create such references. This analysis suggests that agents use a mixture of non-compositional and compositional messages to convey spatial relationships. We also show that the emergent language is interpretable by humans. The translation accuracy is tested by communicating with the receiver agent, where the receiver achieves over 78% accuracy using parts of this lexicon, confirming that the interpretation of the emergent language was successful.

URL PDF HTML ☆

赞 0 踩 0

2408.14892 2026-06-16 cs.CL cs.SD eess.AS 版本更新

A Functional Trade-off between Prosodic and Semantic Cues in Conveying Sarcasm

语义与语气特征在传达讽刺中的功能权衡

Zhu Li, Xiyuan Gao, Yuqing Zhang, Shekhar Nayak, Matt Coler

发表机构 * ZhuLi（朱莉）； XiyuanGao（高西元）； YuqingZhang（张雨青）； ShekharNayak（Shekhar Nayak）； MattColer（Matt Coler）

AI总结研究通过分析不同讽刺类型语句的声学特征，发现语义明显时语气特征不重要，而语义不明显时语气特征更关键，揭示了讽刺传达中语义与语气特征的权衡关系。

Comments accepted at Interspeech 2024

详情

DOI: 10.21437/Interspeech.2024-1962

AI中文摘要

本研究探讨了讽刺的声学特征，并分离了语句被用作讽刺的倾向与语气特征信号之间的相互作用。利用从电视节目中收集的讽刺语句数据集，我们分析了语句和关键短语的语气特征，这些短语属于三种不同的讽刺类别（嵌入式、命题式和施为式），它们在语义特征的强度上有所不同，并与中性表达进行比较。结果表明，在语义明显显示讽刺意义的短语中，语气特征不如语义不明显时重要，这表明在短语层面，讽刺的语气和语义特征之间存在权衡。这些发现突显了在语义密集的讽刺表达中对语气调节的依赖性降低，并揭示了塑造讽刺意图传达的细微互动。

英文摘要

This study investigates the acoustic features of sarcasm and disentangles the interplay between the propensity of an utterance being used sarcastically and the presence of prosodic cues signaling sarcasm. Using a dataset of sarcastic utterances compiled from television shows, we analyze the prosodic features within utterances and key phrases belonging to three distinct sarcasm categories (embedded, propositional, and illocutionary), which vary in the degree of semantic cues present, and compare them to neutral expressions. Results show that in phrases where the sarcastic meaning is salient from the semantics, the prosodic cues are less relevant than when the sarcastic meaning is not evident from the semantics, suggesting a trade-off between prosodic and semantic cues of sarcasm at the phrase level. These findings highlight a lessened reliance on prosodic modulation in semantically dense sarcastic expressions and a nuanced interaction that shapes the communication of sarcastic intent.

URL PDF HTML ☆

赞 0 踩 0

2402.16515 2026-06-16 cs.CL cs.CR 版本更新

LLM-based Privacy Data Augmentation Guided by Knowledge Distillation with a Distribution Tutor for Medical Text Classification

基于知识蒸馏和分布导师的LLM隐私数据增强方法用于医学文本分类

Yiping Song, Juhua Zhang, Zhiliang Tian, Yuxin Yang, Minlie Huang, Dongsheng Li

发表机构 * College of Science, National University of Defense Technology, Hunan, China（国防科技大学科学学院，湖南，中国）； College of Computer, National University of Defense Technology, Hunan, China（国防科技大学计算机学院，湖南，中国）； The CoAI Group, DCST, BNRist, Tsinghua University, Beijing（清华大学北京人工智能研究院，北京）

AI总结本文提出一种结合LLM和知识蒸馏的隐私数据增强方法，通过分布导师控制生成分布，提升医学文本分类的隐私保护与性能。

详情

DOI: 10.1016/j.neunet.2026.108668
Journal ref: Neural Networks, Vol. 199, 2026, 108668

AI中文摘要

由于充足的数据不总是公开可得，研究者利用有限数据和先进学习算法，或通过数据增强（DA）扩展数据集。在私有领域进行DA需要隐私保护方法（即匿名化和扰动），但这些方法无法提供保护保证。差分隐私（DP）学习方法理论上界定了保护，但不擅长生成大规模模型的伪文本样本。本文将DP-based伪样本生成任务转移到DP-based生成样本鉴别任务，提出一种结合LLM和DP-based鉴别器的DA方法，用于私有领域文本分类。我们构建了一个知识蒸馏模型作为DP-based鉴别器：教师模型访问私有数据，指导学生如何选择私有样本，通过校准噪声实现DP。为约束DA生成的分布，我们提出一个DP-based导师，建模噪声私有分布，并通过低隐私成本控制样本生成。我们理论分析了模型的隐私保护，并通过实验证证了模型。

英文摘要

As sufficient data are not always publically accessible for model training, researchers exploit limited data with advanced learning algorithms or expand the dataset via data augmentation (DA). Conducting DA in private domain requires private protection approaches (i.e. anonymization and perturbation), but those methods cannot provide protection guarantees. Differential privacy (DP) learning methods theoretically bound the protection but are not skilled at generating pseudo text samples with large models. In this paper, we transfer DP-based pseudo sample generation task to DP-based generated samples discrimination task, where we propose a DP-based DA method with a LLM and a DP-based discriminator for text classification on private domains. We construct a knowledge distillation model as the DP-based discriminator: teacher models, accessing private data, teaches students how to select private samples with calibrated noise to achieve DP. To constrain the distribution of DA's generation, we propose a DP-based tutor that models the noised private distribution and controls samples' generation with a low privacy cost. We theoretically analyze our model's privacy protection and empirically verify our model.

URL PDF HTML ☆

赞 0 踩 0

1. 大语言模型与基础模型 68 篇

Simplifying the Modeling of Arbitrary Conditionals in Natural Language

CoRA: Confidence-Rationale Alignment for Reliable Chain-of-Thought Reasoning

Nemotron 3 Ultra: Open, Efficient Mixture-of-Experts Hybrid Mamba-Transformer Model for Agentic Reasoning

Stop When Further Reasoning Won't Help: Attention-State Adaptive Generation in Reasoning Models

Ling and Ring 2.6 Technical Report: Efficient and Instant Agentic Intelligence at Trillion-Parameter Scale

AdaMame: A Training Recipe for Adaptive Multilingual Reasoning

Spokes: Optimizing for Diverse Pretraining Data Selection

Replay What Matters: Off-Policy Replay for Efficient LLM Reinforcement Unlearning

Rethinking the Role of Efficient Attention in Hybrid Architectures

Let LLMs Judge Each Other: Multi-Agent Peer-Reviewed Reasoning for Medical Question Answering

Emergent retokenization symmetry in large language models: phenomenology and applications

Vernier: Probing Representational Misalignment Behind Lexical Gaps in Causal Reasoning

Retrievable Gradients: Continual Post-Training Without Cumulative Weight Drift

DYNA : Dynamic Episodic Memory Networks for Augmenting Large Language Models with Temporal Knowledge Graphs in Continuous Learning

The Truth Stays in the Family: Enhancing Contextual Grounding via Inherited Truthful Heads in Model Lineages

SciOrch: Learning to Orchestrate Expert LLMs for Solving Frontier Multimodal Scientific Reasoning Tasks

Free Energy Heuristics: Fast-And-Frugal Cognition as Active Inference Under Uncertain Precision

Neuron Level Analysis of Large Language Model in Legal Domain Reasoning

BALTO: Balanced Token-Level Policy Optimization for Hallucination Mitigation

Formalize Once, Edit the Rest: Efficient Lean-Based Answer Selection for Math Reasoning

Who Flips? Self- and Cross-Model Counterarguments Reveal Answer Instability in LLMs

Long-Context Modeling via GSS-Transformer Hybrid Architecture with Learnable Mixing

Creative Collision: Directorial Persona Steering and Competition in Large Language Models

Tyler: Typed Latent Reasoning for Language Models -- When to Think, What to Compute, and How Much to Allocate

Can LLM Coding Agents Reason About Time Series?

Can LLM Agents Infer World Models? Evidence from Agentic Automata Learning

Islamic Large Language Models: From Knowledge Acquisition to Trustworthy and Hallucination-Resistant AI

Tying the Loop -- Tied Expert Layers in Mixture-of-Experts Language Models

Follow the Latent Roadmap: Navigating Revocable Decoding for Diffusion LLMs with Anchor Tokens

Speaking the Language of Science: Toward a General-Purpose Generative Foundation Model for the Natural Sciences

LESS Is More: Mutual-Stability Sampling for Diffusion Language Models

Exploring Extrinsic and Intrinsic Properties for Effective Reasoning with Code Interpreter

KVEraser: Learning to Steer KV Cache for Efficient Localized Context Erasing

Context-Aware RL for Agentic and Multimodal LLMs

The Value Axis: Language Models Encode Whether They're on the Right Track

From Static Inference to Dynamic Interaction: A Survey of Streaming Large Language Models

Re-feeding Is Not Replaying: Measuring Replay Noise in Counterfactual Token-Credit Estimation

MosaicQuant: Inlier-Outlier Disaggregation for Unified 4-Bit LLM Quantization

VibeThinker-3B: Exploring the Frontier of Verifiable Reasoning in Small Language Models

QK-Normed MLA: QK normalization without full key caching

Scaling LLM Reasoning from Minimal Labels: A Semi-Supervised Framework with a Lightweight Verifier

DRA-GRPO: Your GRPO Needs to Know Diverse Reasoning Paths for Mathematical Reasoning

LoLA: Low-Rank Linear Attention With Sparse Caching

CentroidKV: Efficient Long-Context LLM Inference via KV Cache Clustering

Why Tree-Style Branching Matters for Thought Advantage Estimation in GRPO

OBCache: Optimal Brain KV Cache Pruning for Efficient Long-Context LLM Inference

Less is More: Improving LLM Reasoning with Minimal Test-Time Intervention

Think-at-Hard: Selective Latent Iterations to Improve Reasoning Language Models

A Unified Definition of Hallucination: It's The World Model, Stupid!

Oops, Wait: Discourse Tokens Matter in Reasoning Model

Pretraining A Large Language Model using Distributed GPUs: A Memory-Efficient Decentralized Paradigm

Learning When to Sample: Confidence-Aware Selective Sampling for Efficient Chain-of-Thought Reasoning

Vocabulary Dropout for Curriculum Diversity in LLM Co-Evolution

G-Loss: Graph-Guided Fine-Tuning of Language Models

CoBit: Language Modeling with Bitstream Diffusion

HyDRA: Hybrid Dynamic Routing Architecture for Heterogeneous LLM Pools

ACC: Compiling Agent Trajectories for Long-Context Training

Fast-dLLM++: Fréchet Profile Decoding for Faster Diffusion LLM Inference

Depth-Attention: Cross-Layer Value Mixing for Language Models

Implicit Reasoning for Large Language Model-based Generative Recommendation

AdaSR: Adaptive Streaming Reasoning with Hierarchical Relative Policy Optimization

Towards Advanced Mathematical Reasoning for LLMs via First-Order Logic Theorem Proving

Mechanistic Analysis of Catastrophic Forgetting in Large Language Models During Continual Fine-tuning

Entropy-Aware On-Policy Distillation of Language Models

When Do LLMs Reason? A Dynamical Systems View via Entropy Phase Transitions

Beyond Retrieval: Learning Compact User Representations for Scalable LLM Personalization

The Illusion of Multi-Agent Advantage

Mechanistic origins of catastrophic forgetting: why RL preserves circuits better than SFT?

2. 机器翻译与跨语言处理 4 篇

Evaluative Judgement in Teaching AI-based Translation: A Class-room Case Study of AI-Mediated Translation and Post-Editing

How Far Can Machine Translation Quality Take You? Extrinsic Discourse Evaluation in Goal-Oriented Setups

RASST: Retrieval-Augmented Simultaneous Speech Translation

Dual-branch Prompting for Multimodal Machine Translation

3. 信息抽取、检索与问答 21 篇

Context Compression Is Not One Thing: Readable Symbolic Re-expression vs. Coherent Summary at Matched Budget

Few-Shot Biomedical Relation Extraction with Large Language Models: A Viable Alternative to Supervised Learning?

Transfer Learning for FHIR Questionnaire Terminology Binding

LLM-Assisted Stance Detection in Scientific Discourse: A Test Case in Bayesian Cognitive Science

Distilling Examples into Task Instructions: Enhanced In-Context Learning for Real-World B2B Conversations