arXivDaily arXiv每日学术速递 周一至周五更新

1. 大语言模型与基础模型 15 篇

2606.19348 2026-06-19 cs.CL cs.AI 新提交

DeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence

DeepSeek-V4: 迈向高效百万令牌上下文智能

DeepSeek-AI, Anyi Xu, Bangcai Lin, Bing Xue, Bingxuan Wang, Bingzheng Xu, Bochao Wu, Bowei Zhang, Chaofan Lin, Chen Dong, Chenchen Ling, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chengyu Hou, Chenhao Xu, Chenze Shao, Chong Ruan, Conner Sun, Damai Dai, Daya Guo, Dejian Yang, Deli Chen, Donghao Li, Dongjie Ji, Erhang Li, Fang Wei, Fangyun Lin, Fangzhou Yuan, Feiyu Xia, Fucong Dai, Guangbo Hao, Guanting Chen, Guoai Cao, Guolai Meng, Guowei Li, Han Yu, Han Zhang, Hanwei Xu, Hao Li, Haofen Liang, Haoling Zhang, Haoming Luo, Haoran Wei, Haotian Yuan, Haowei Zhang, Haowen Luo, Haoyu Chen, Haozhe Ji, Hengqing Zhang, Honghui Ding, Hongxuan Tang, Huanqi Cao, Huazuo Gao, Hui Qu, Hui Zeng, J Yang, JQ Zhu, Jia Luo, Jia Song, Jia Yu, Jialiang Huang, Jialu Cai, Jian Liang, Jiangting Zhou, Jiasheng Ye, Jiashi Li, Jiaxin Xu, Jiewen Hu, Jieyu Yang, Jin Chen, Jin Yan, Jingchang Chen, Jingli Zhou, Jingting Xiang, Jingyang Yuan, Jingyuan Cheng, Jingzi Zhou, Jinhua Zhu, Jiping Yu, Joseph Sun, Jun Ran, Junguang Jiang, Junjie Qiu, Junlong Li, Junmin Zheng, Junxiao Song, Kai Dong, Kaige Gao, Kang Guan, Kexing Zhou, Kezhao Huang, Kuai Yu, Lean Wang, Lecong Zhang, Lei Wang, Leyi Xia, Li Zhang, Liang Zhao, Lihua Guo, Lingxiao Luo, Linwang Ma, Linyan Zhu, Litong Wang, Liyu Cai, Liyue Zhang, Longhao Chen, MS Di, MY Xu, Max Mei, Miaojun Wang, Mingchuan Zhang, Minghua Zhang, Minghui Tang, Mingming Li, Mingxu Zhou, Minmin Han, Ning Wang, Panpan Huang, Panpan Wang, Peixin Cong, Peiyi Wang, Peng Zhang, Qiancheng Wang, Qihao Zhu, Qingyang Li, Qinyu Chen, Qiushi Du, Qiwei Jiang, Rui Tian, Ruifan Xu, Ruijie Lu, Ruiling Xu, Ruiqi Ge, Ruisong Zhang, Ruizhe Pan, Runji Wang, Runqian Chen, Runqiu Yin, Runxin Xu, Ruomeng Shen, Ruoyu Zhang, Ruyi Chen, SH Liu, Shanghao Lu, Shangmian Sun, Shangyan Zhou, Shanhuang Chen, Shaofei Cai, Shaoheng Nie, Shaoqing Wu, Shaoyuan Chen, Shengding Hu, Shengyu Liu, Shiqiang Hu, Shirong Ma, Shiyu Wang, Shuiping Yu, Shunfeng Zhou, Shuting Pan, Shuying Yu, Songyang Zhou, Tao Ni, Tao Yun, Tian Jin, Tian Pei, Tian Ye, Tianle Lin, Tianran Ji, Tianyi Cui, Tianyuan Yue, Tingting Yu, Tun Wang, W Zhang, WL Xiao, Wangding Zeng, Wei An, Weilin Zhao, Wen Liu, Wenfeng Liang, Wenjie Pang, Wenjing Luo, Wenjing Yao, Wenjun Gao, Wenkai Yang, Wenlve Huang, Wenqing Hou, Wentao Zhang, Wenting Ma, Xi Gao, Xiang He, Xiangwen Wang, Xianzu Wang, Xiao Bi, Xiaodong Liu, Xiaohan Wang, Xiaokang Chen, Xiaokang Zhang, Xiaotao Nie, Xiaowen Sun, Xiaoxiang Wang, Xin Cheng, Xin Liu, Xin Xie, Xingchao Liu, Xingchen Liu, Xingkai Yu, Xingyou Li, Xinyu Yang, Xinyu Zhang, Xu Chen, Xuanyu Wang, Xuecheng Su, Xueyin Chen, Xuheng Lin, Xuwei Fu, YC Yan, YQ Wang, YW Ma, Yanfeng Luo, Yang Zhang, Yanhong Xu, Yanru Ma, Yanwen Huang, Yao Li, Yao Li, Yao Xu, Yao Zhao, Yaofeng Sun, Yaohui Wang, Yi Qian, Yi Shao, Yi Yu, Yichao Zhang, Yifan Ding, Yifan Shi, Yijia Wu, Yiliang Xiong, Yiling Ma, Ying He, Ying Tang, Ying Zhou, Yingjia Luo, Yinmin Zhong, Yishi Piao, Yisong Wang, Yixiang Zhang, Yixiao Chen, Yixuan Tan, Yixuan Wei, Yiyang Ma, Yiyuan Liu, Yonglun Yang, Yongqiang Guo, Yongtong Wu, Yu Wu, YuKun Li, Yuan Cheng, Yuan Ou, Yuanfan Xu, Yuanhao Li, Yuduan Wang, Yuehan Yang, Yuer Xu, Yuhan Wu, Yuhao Meng, Yuheng Zou, Yukun Zha, Yunfan Xiong, Yupeng Chen, Yuping Lin, Yuqian Cao, Yuqian Wang, Yushun Zhang, Yuting Yan, Yutong Lin, Yuxian Gu, Yuxiang Luo, Yuxiang You, Yuxuan Liu, Yuxuan Zhou, Yuyang Zhou, Yuzhen Huang, ZF Wu, Zehao Wang, Zehua Zhao, Zehui Ren, Zekai Zhang, Zhangli Sha, Zhe Fu, Zhe Ju, Zhean Xu, Zhenda Xie, Zhengyan Zhang, Zheren Gao, Zhewen Hao, Zhibin Gou, Zhicheng Ma, Zhigang Yan, Zhihong Shao, Zhixian Huang, Zhixuan Chen, Zhiyu Wu, Zhizhou Ren, Zhongyu Wu, Zhuoshu Li, Zhuping Zhang, Zian Xu, Zihao Wang, Zihua Qu, Zihui Gu, Zijia Zhu, Zilin Li, Zipeng Zhang, Ziwei Xie, Ziyi Gao, Ziyi Wan, Zizheng Pan, Zongqing Yao

发表机构 * DeepSeek-AI(深度求索人工智能)

AI总结 提出DeepSeek-V4系列MoE模型,通过混合注意力架构、流形约束超连接和Muon优化器,实现百万令牌上下文的高效推理,在核心任务上超越前代。

详情
AI中文摘要

我们展示了DeepSeek-V4系列的预览版本,包括两个强大的混合专家(MoE)语言模型——DeepSeek-V4-Pro(1.6T参数,49B激活)和DeepSeek-V4-Flash(284B参数,13B激活),两者均支持一百万个令牌的上下文长度。DeepSeek-V4系列在架构和优化方面引入了多项关键升级:(1)混合注意力架构,结合压缩稀疏注意力(CSA)和重度压缩注意力(HCA),以提高长上下文效率;(2)流形约束超连接(mHC),增强传统残差连接;(3)Muon优化器,实现更快的收敛和更高的训练稳定性。我们在超过32T多样且高质量的令牌上预训练了两个模型,随后通过全面的后训练流程解锁并进一步增强其能力。DeepSeek-V4-Pro-Max是DeepSeek-V4-Pro的最大推理努力模式,重新定义了开放模型的最先进水平,在核心任务上超越了其前代。同时,DeepSeek-V4系列在长上下文场景中非常高效。在百万令牌上下文设置下,与DeepSeek-V3.2相比,DeepSeek-V4-Pro仅需27%的单令牌推理FLOPs和10%的KV缓存。这使得我们能够常规支持百万令牌上下文,从而使长时任务和进一步的测试时扩展更加可行。模型检查点可从此https URL获取。

英文摘要

We present a preview version of DeepSeek-V4 series, including two strong Mixture-of-Experts (MoE) language models -- DeepSeek-V4-Pro with 1.6T parameters (49B activated) and DeepSeek-V4-Flash with 284B parameters (13B activated) -- both supporting a context length of one million tokens. DeepSeek-V4 series incorporate several key upgrades in architecture and optimization: (1) a hybrid attention architecture that combines Compressed Sparse Attention (CSA) and Heavily Compressed Attention (HCA) to improve long-context efficiency; (2) Manifold-Constrained Hyper-Connections (mHC) that enhance conventional residual connections; (3) and the Muon optimizer for faster convergence and greater training stability. We pre-train both models on more than 32T diverse and high-quality tokens, followed by a comprehensive post-training pipeline that unlocks and further enhances their capabilities. DeepSeek-V4-Pro-Max, the maximum reasoning effort mode of DeepSeek-V4-Pro, redefines the state-of-the-art for open models, outperforming its predecessors in core tasks. Meanwhile, DeepSeek-V4 series are highly efficient in long-context scenarios. In the one-million-token context setting, DeepSeek-V4-Pro requires only 27% of single-token inference FLOPs and 10% of KV cache compared with DeepSeek-V3.2. This enables us to routinely support one-million-token contexts, thereby making long-horizon tasks and further test-time scaling more feasible. The model checkpoints are available at https://huggingface.co/collections/deepseek-ai/deepseek-v4.

2606.19349 2026-06-19 cs.CL cs.AI 新提交

Where to Place the Query? Unveiling and Mitigating Positional Bias in In-Context Learning for Diffusion LLMs via Decoding Dynamics

查询应置于何处?通过解码动力学揭示并缓解扩散大语言模型中上下文学习的位置偏差

Zhengheng Li, Panrui Li, Xuyang Liu, Puzhi Xia

发表机构 * Southeast University(东南大学)

AI总结 本文系统分析了扩散大语言模型中查询位置对生成质量的影响,发现其与示例语义质量同等重要,并提出基于平均置信度的无训练自适应路由策略Auto-ICL以优化查询放置。

Comments 9 figures, 4 tables

详情
AI中文摘要

尽管上下文学习(ICL)在自回归(AR)大语言模型(LLMs)中已被广泛研究,但其在扩散大语言模型(dLLMs)中的机制仍基本未被探索。与受单向因果掩码限制的AR模型不同,dLLMs本质上利用双向注意力,为查询放置提供了广泛的空间灵活性。不幸的是,当前实践通常继承AR风格的尾随查询模板,往往忽略了结构范式转变。本文通过全面分析揭示了查询位置实际上是dLLMs中的一阶变量。通过经验解耦,我们证明了位置方差对生成质量的影响与示例语义质量相当。在内部,这种位置敏感性源于注意力流中的空间“近因效应”以及解码轨迹中依赖于任务的偏移。为了在没有真实标签的情况下缓解这种不稳定性,我们揭示了传统的单步置信度($C_{decoded}$)在dLLMs中失效。相反,我们提出了平均置信度($\overline{C}$),一种跟踪迭代解码过程的新指标。通过建立基础的空间ICL基线,我们引入了Auto-ICL,一种无需训练的自适应路由策略,动态优化查询放置,在异构推理和感知任务中稳健地接近最优性能。

英文摘要

While In-Context Learning (ICL) is extensively studied in Autoregressive (AR) LLMs, its mechanism within Diffusion Large Language Models (dLLMs) remains largely unexplored. Unlike AR models restricted by unidirectional causal masking, dLLMs intrinsically utilize bidirectional attention, offering extensive spatial flexibility for query placement. Unfortunately, current practices conventionally inherit AR-style trailing-query templates, often overlooking the structural paradigm shift. This paper presents a comprehensive analysis unveiling that query position is actually a first-order variable in dLLMs. Through empirical decoupling, we demonstrate that positional variance impacts generation quality on par with example semantic quality. Internally, this positional sensitivity stems from a spatial ``Recency Effect'' in attention flow and task-dependent shifts in decoding trajectories. To mitigate this instability without ground-truth labels, we reveal that traditional single-step confidence ($C_{decoded}$) fails in dLLMs. Instead, we propose Average Confidence ($\overline{C}$), a novel metric tracking the iterative decoding process. By establishing the foundational spatial ICL baselines, we introduce Auto-ICL, a training-free adaptive routing strategy that dynamically optimizes query placement, robustly approaching oracle performance across heterogeneous reasoning and perception tasks.

2606.19350 2026-06-19 cs.CL 新提交

Pruning via Causal Attribution Preserves Reasoning Performance in Large Language Models

基于因果归因的剪枝保留大型语言模型的推理性能

Amogh Sheth, Biruk Assefa, Yi Wen Huang, Andrew Lin, Yuhao Ge

发表机构 * Edison Academy Magnet School(爱迪生学院磁石学校) Massachusetts Institute of Technology(麻省理工学院) State University of New York College at Plattsburgh(纽约州立大学普拉茨堡学院) The University of Texas at Austin(德克萨斯大学奥斯汀分校) Independent Researcher(独立研究员)

AI总结 提出无需训练的因果归因剪枝(CAP)方法,通过测量注意力头对推理任务的因果影响进行细粒度剪枝,在20%稀疏度下相比Wanda在ARC-Challenge上准确率提升高达61%。

Comments Accepted at the ICLR 2026 Workshop on LLM Reasoning. 13 pages, 2 figures

详情
AI中文摘要

大型语言模型(LLMs)在多步推理方面表现出色,但推理成本高昂。我们引入了因果归因剪枝(CAP),一种无需训练的方法,通过测量注意力头对推理任务的因果影响来识别关键注意力头,并利用这些头级分数指导细粒度的权重剪枝。对于每个注意力头,CAP估计在推理问题的小型校准集上前向传播时掩码该头所导致的预期性能下降。这些因果分数随后被转换为对应投影矩阵的权重级重要性值。与仅基于幅度或激活的标准不同,CAP的干预测量直接捕捉每个头的功能贡献,在20%稀疏度下,相比Wanda在ARC-Challenge上获得高达61%的相对准确率提升。我们在GSM8K、StrategyQA和ARC-Challenge上使用Llama-3-8B-Instruct和Mistral-7B-Instruct在10%、20%和50%稀疏度下评估CAP。在中等稀疏度(10-20%)下,CAP在大多数模型-基准配置中优于Wanda,尤其在Llama-3的ARC-Challenge上提升显著。我们的结果表明,在相同稀疏度下,注意力头级因果归因比相关性剪枝标准能更好地保留下游基准的推理性能,但在50%稀疏度下仍受限于粗粒度的MLP归因。

英文摘要

Large language models (LLMs) excel at multi-step reasoning but incur substantial inference cost. We introduce Causal Attribution Pruning (CAP), a training-free method that identifies critical attention heads by measuring their causal impact on reasoning tasks and uses these head-level scores to guide fine-grained weight pruning. For each attention head, CAP estimates the expected performance degradation when the head is masked during forward passes on a small calibration set of reasoning problems. These causal scores are then converted into weight-level importance values for the corresponding projection matrices. Unlike magnitude-only or activation-based criteria, CAP's interventional measurement directly captures each head's functional contribution, yielding relative accuracy gains of up to 61% over Wanda on ARC-Challenge at 20% sparsity. We evaluate CAP on GSM8K, StrategyQA, and ARC-Challenge using Llama-3-8B-Instruct and Mistral-7B-Instruct at 10%, 20%, and 50% sparsity. At moderate sparsity (10-20%), CAP improves over Wanda in most model-benchmark configurations. with especially large gains on ARC-Challenge for Llama-3. Our results suggest that attention-head-level causal attribution can better preserve reasoning performance on downstream benchmarks than correlational pruning criteria at equivalent sparsity, while remaining limited by coarse MLP attribution at 50% sparsity.

2606.19353 2026-06-19 cs.CL cs.LG 新提交

Quantifying Aleatoric Uncertainty of In-Context Learning for Robust Measure of LLM Prediction Confidence

量化上下文学习中的偶然不确定性以稳健衡量LLM预测置信度

Jinseok Chung, Minkyoung Song, Hyunji Jung, Namhoon Lee

发表机构 * POSTECH(浦项科技大学)

AI总结 针对上下文学习(ICL)中预测对提示设计敏感的问题,提出基于贝叶斯观点和机制可解释性的自函数向量,直接估计偶然不确定性,并设计严格评估协议,在合成和真实数据集上验证了方法的可靠性及在幻觉检测等应用中的实用性。

Comments Accepted to ACL 2026

详情
AI中文摘要

上下文学习(ICL)使LLM能够从少量示例中适应新任务,但其可靠性仍存疑虑:预测对提示设计和模型理解上下文的能力高度敏感,使得失败源于数据特性还是模型限制难以区分。不确定性分解——将偶然不确定性从认知不确定性中分离——在此场景中尤为关键,然而现有方法针对标准生成任务设计,未能捕捉ICL的独特动态。为解决此问题,我们引入基于贝叶斯观点和ICL机制可解释性的自函数向量概念。这些向量利用模型内部表示来建模上下文提示中学习的潜在概念,从而在贝叶斯框架内直接估计偶然不确定性,并规避了对脆弱的输入或解码操作的依赖。鉴于缺乏既定基准和合适的评估协议,我们还提出了首个严格的评估协议,其中数据以受控方式被操纵,以便精确量化偶然不确定性并将其与认知不确定性分离。借助这一新的评估框架(最初基于合成任务进行概念开发,随后扩展到真实世界数据集),我们展示了所提出的方法比现有替代方法更可靠地衡量LLM在ICL下做出的预测的不确定性。此外,我们展示了它可作为可信相关应用(如幻觉检测)的实用工具。我们的发现为将不确定性的量化观点与模型行为的机制理解联系起来开辟了新方向。

英文摘要

In-Context Learning (ICL) allows LLMs to adapt to new tasks from a few demonstrations, but its reliability remains a concern: predictions are highly sensitive to both prompt design and the model's ability to understand the context, obscuring whether failures arise from data properties or model limitations. Uncertainty decomposition-separating aleatoric from epistemic sources-is particularly crucial in this setting, yet existing methods, designed for standard generation tasks, fail to capture the unique dynamics of ICL. To address this, we introduce a concept of self-function vectors, built upon Bayesian views and the mechanistic interpretability of ICL. These vectors leverage internal model representations to model the latent concept learned during in-context prompting, thereby enabling a direct estimation of aleatoric uncertainty within a Bayesian framework and circumventing the reliance on brittle input or decoding manipulations. Given the lack of established benchmarks and suitable evaluation protocols, we also propose the first and rigorous evaluation protocol, in which data is manipulated in controlled ways so as to quantify aleatoric uncertainty precisely and separately from epistemic uncertainty. With this new evaluation framework, initially grounded in synthetic tasks for conceptual development and subsequently extended to real-world datasets, we show that our proposed methodology can measure uncertainty of LLM predictions made under ICL more reliably than existing alternative methods. Moreover, we show it can be used as a practical tool for trustworthy-related applications, such as hallucination detection. Our findings pave a new direction for connecting the quantitative view of uncertainty with the mechanistic understanding of model behavior.

2606.19354 2026-06-19 cs.CL cs.LG 新提交

Granularity-Regulated Adaptive Computational Efficiency for Optimal Verification in Test-Time Scaling

粒度调控的自适应计算效率:测试时扩展中的最优验证

Ardit Krasniqi, Luan Vejsiu, Elira Dervishi

发表机构 * European University of Tirana(欧洲地拉那大学)

AI总结 提出GRACE理论框架,将验证粒度建模为问题难度、验证器准确率和计算预算的函数,证明存在相变:细粒度验证在计算预算大或问题难时占优,粗粒度验证在低预算简单问题时更优,自适应策略可达到计算-性能帕累托前沿。

详情
AI中文摘要

测试时扩展(TTS)已成为一种强大的范式,通过在推理时投入额外计算来提升大语言模型(LLMs)的推理性能。TTS的核心组件是验证器,它选择或评分候选解以引导搜索过程。虽然先前工作已探索验证的益处,但一个基本问题仍未充分探索:在给定计算预算下,最优验证粒度是什么?粗粒度的结果奖励模型(ORMs)和细粒度的过程奖励模型(PRMs)代表两个极端,但两者单独均无法在所有场景下实现计算最优性。本文建立了一个统一的理论框架,称为GRACE(粒度调控的自适应计算效率),该框架将最优验证粒度刻画为问题难度、验证器准确率和计算预算的显式函数。我们证明存在一个相变:当计算预算大或问题难时,细粒度验证占优;而在低预算、简单问题场景下,粗粒度验证更受青睐。我们的理论将Best-of-N、束搜索和步骤级MCTS统一在一个帕累托最优框架内,并激发了一种自适应粒度策略,该策略可证明达到计算-性能帕累托前沿。在MATH-500、GSM8K和AIME基准上的实验结果证实了所有四个理论主张,在匹配计算量下,我们的自适应策略相比固定粒度基线准确率提升高达3.1%。

英文摘要

Test-time scaling (TTS) has emerged as a powerful paradigm for improving the reasoning performance of large language models (LLMs) by investing additional compute at inference time. A central component of TTS is the \emph{verifier}, which selects or scores candidate solutions to guide the search process. While prior work has explored the benefit of verification, a fundamental question remains underexplored: \emph{what is the optimal granularity of verification under a given compute budget?} Coarse-grained outcome reward models (ORMs) and fine-grained process reward models (PRMs) represent two extremes, yet neither alone achieves compute-optimality across all regimes. In this paper, we establish a unified theoretical framework, called \textbf{GRACE} (\underline{G}ranularity-\underline{R}egulated \underline{A}daptive \underline{C}omputational \underline{E}fficiency), that characterizes the optimal verification granularity as an explicit function of problem difficulty, verifier accuracy, and compute budget. We prove that there exists a phase transition: fine-grained verification dominates when either the compute budget is large or the problem is hard, whereas coarse-grained verification is preferred in the low-budget, easy-problem regime. Our theory unifies Best-of-$N$, beam search, and step-level MCTS within a single Pareto-optimality framework, and motivates an adaptive granularity strategy that provably achieves the compute-performance Pareto frontier. Empirical results on MATH-500, GSM8K, and AIME benchmarks corroborate all four theoretical claims, with our adaptive strategy outperforming fixed-granularity baselines by up to 3.1\% accuracy at matched compute.

2606.19625 2026-06-19 cs.CL cs.LG 新提交

Where Does Social Reasoning Come From? Capability Provenance in Language Models

社会推理从何而来?语言模型中的能力来源

Glenn Matlin, Chandreyi Chakraborty, Saehee Eom, Mika Okamoto, Rayan Castilla, Louis Jaburi, Alvin Deng, Taywon Min, Lucia Quirke, Stella Biderman, Mark Riedl

发表机构 * Georgia Institute of Technology, College of Computing(佐治亚理工学院计算学院) MATS Program(MATS项目) EleutherAI KAIST AI(韩国科学技术院人工智能学院) Georgia Tech AI Safety Initiative(佐治亚理工学院人工智能安全倡议)

AI总结 通过训练数据归因方法,发现OLMo3-7B中社会推理和STEM推理依赖于不同的预训练语料区域,且推理层面的差异比知识层面更显著。

Comments Under review at COLM 2026 (Conference)

详情
AI中文摘要

我们使用训练数据归因作为可解释的工具进行能力发现,映射预训练语料库中哪些区域支持OLMo3-7B的社会推理与STEM推理。训练数据归因衡量每个训练文档对模型在基准测试上的预测的影响强度,但文档级别的分数过于嘈杂,无法识别哪些语料区域支持哪些能力,且先前的工作侧重于事实知识而非推理。我们在从去重后的Dolma3混合数据中抽取的工作集上计算基于梯度的归因(通过Bergmann的TrackStar),聚合跨WebOrganizer的24格式×24主题分类(576个箱子)的影响,并在2×2设计中对比基准对,该设计变化领域(社会 vs. STEM)和能力类型(推理 vs. 知识):SocialIQA和MMLU社会科学对比ARC-Challenge和MMLU STEM。社会和STEM推理依赖于定性不同的语料区域,且推理层面的对比比知识层面更尖锐。有针对性的机器遗忘提供了部分因果验证:遗忘高归因主题箱(例如,SocialIQA的文学)比箱内随机基线更严重地降低对齐的基准,我们开源所有代码、采样清单、箱级影响矩阵和遗忘检查点。

英文摘要

We use training-data attribution as an interpretable tool for capability discovery, mapping which regions of the pretraining corpus support social-reasoning versus STEM-reasoning in OLMo3-7B. Training-data attribution measures how strongly each training document influences a model's predictions on a benchmark, but document-level scores are too noisy to identify which corpus regions support which capabilities, and prior work has emphasized factual knowledge rather than reasoning. We compute gradient-based attribution (TrackStar via Bergson) over a working set drawn from the de-duplicated Dolma3 mix, aggregate influence across WebOrganizer's 24-format x 24-topic taxonomy (576 bins), and contrast benchmark pairs in a 2x2 design that varies domain (social vs. STEM) and capability type (reasoning vs. knowledge): SocialIQA and MMLU Social Sciences against ARC-Challenge and MMLU STEM. Social and STEM reasoning draw on qualitatively distinct corpus regions, and the contrast is sharper at the reasoning level than at the knowledge level. Targeted machine unlearning provides partial causal validation: forgetting high-attribution topic bins (e.g., Literature for SocialIQA) degrades the aligned benchmark more than within-bin random baselines, and we open-source all code, sampling manifests, the bin-level influence matrix, and unlearning checkpoints.

2606.19668 2026-06-19 cs.CL 新提交

Code-Switching Reveals Language Anchoring in Multilingual LLMs

代码切换揭示多语言大模型中的语言锚定

Jeonghyun Park, Seunghyun Yoon, Yonghyun Jun, Hwanhee Lee

发表机构 * Chung-Ang University(中央大学) Adobe Research(Adobe研究院)

AI总结 通过语法强制代码切换诊断多语言大模型中的语言锚定现象,提出锚定偏差度量并设计CANVAS干预方法,有效缓解代码切换导致的问答性能下降。

Comments 36 pages, 13 figures, 27 tables

详情
AI中文摘要

多语言大模型(MLLMs)越来越需要处理代码切换(CS)输入,然而混合语言通常会导致性能相对于源语言或目标语言单语版本下降。为了理解这种退化,我们使用语法强制CS作为受控诊断设置,将CS表示相对于其源和目标对应物进行定位。我们引入锚定偏差(Anchor Bias),一种几何度量,用于量化语言锚定,即CS隐藏状态是否更接近其源语言或目标语言对应物。在不同的MLLMs中,锚定偏差揭示了一致的语法框架效应:源框架CS保持源锚定,而目标框架CS向目标方向移动,并显示出更大的问答(QA)退化。受这种表示模式的启发,我们提出了CANVAS(基于上下文锚定的神经向量对齐引导),一种推理时干预方法,从输入中提取源侧画布,并在预填充期间将目标语言隐藏状态软引导向源锚定。CANVAS在MLLMs和CS条件下一致地恢复了QA F1分数,表明内部锚定信号为缓解CS推理失败提供了可行的目标。

英文摘要

Multilingual Large Language Models (MLLMs) are increasingly expected to handle Code-Switched (CS) inputs, yet mixing languages frequently degrades performance relative to source- or target-language monolingual counterparts. To understand this degradation, we use grammar-forced CS as a controlled diagnostic setting for locating CS representations relative to their source and target counterparts. We introduce Anchor Bias, a geometric measure that quantifies language anchoring, whether a CS hidden state aligns closer to its source or target language counterpart. Across diverse MLLMs, Anchor Bias reveals a consistent grammar-frame effect: source-framed CS stays source-anchored, whereas target-framed CS shifts target-ward and shows larger Question Answering (QA) degradation. Motivated by this representational pattern, we propose CANVAS (Contextual Anchor-based Neural Vector Alignment Steering), an inference-time intervention that extracts a source-side canvas from the input and softly steers target-language hidden states toward the source anchor during prefill. CANVAS consistently recovers QA F1 across MLLMs and CS conditions, showing that internal anchoring signals provide an actionable target for mitigating CS inference failures.

2606.19744 2026-06-19 cs.CL cs.AI cs.HC 新提交

Beyond Uniform Forgetting: A Study of Sequential Direct Preference Optimization Across Preference Settings

超越统一遗忘:不同偏好设置下顺序直接偏好优化的研究

Pranav Bhandari, Nicolas Fay, Amitava Datta, Usman Naseem, Mehwish Nasim

发表机构 * Network Analysis and Social Influence Modelling (NASIM) Lab(网络分析与社会影响建模实验室) School of Physics Maths and Computing, The University of Western Australia(西澳大学物理数学与计算学院) School of Psychological Science, The University of Western Australia(西澳大学心理科学学院) School of Computing, Macquarie University(麦考瑞大学计算机学院)

AI总结 研究顺序DPO在不同偏好设置下的影响,发现遗忘模式并非统一,而是取决于目标关系、信号强度和训练顺序,并提出未来对齐流程应考虑目标兼容性。

Comments Submitted to EMNLP 2026

详情
AI中文摘要

将语言模型与人类偏好对齐通常需要优化多个行为目标。一种实用方法是使用直接偏好优化(DPO)等偏好优化方法顺序应用这些目标,但目前尚不清楚后续训练是否会统一降低先前学习的偏好,或者这种影响是否取决于目标之间的关系。我们研究了跨越四种偏好设置(包括分布冲突、多属性交互、强安全信号和兼容的响应质量目标)的顺序DPO。使用带有LoRA适配器的Llama-3.1-8B-Instruct,我们在每个阶段后使用固定的基础模型参考评估所有目标。我们发现顺序DPO不会产生单一的遗忘模式;偏好变化从部分退化到稳定、成对重新分配或正迁移,具体取决于目标关系、信号强度和训练顺序。使用长度归一化策略边界的成对分析表明,聚合指标可能掩盖偏好对之间的异质性变化,而四分位数分解显示,高置信度对可能根据设置而退化或改进。机制诊断表明,在所有设置中,阶段2的梯度和适配器更新与先前目标接近正交,几乎没有证据表明直接梯度对立是主要驱动因素。这些发现表明,未来的顺序对齐流程应考虑目标兼容性和信号强度,而不是假设后续目标会统一影响先前的偏好。

英文摘要

Aligning language models with human preferences often requires optimising multiple behavioural objectives. A practical approach is to apply these objectives sequentially using preference optimisation methods such as Direct Preference Optimisation (DPO), but it remains unclear whether later training uniformly degrades preferences learned earlier or whether the effect depends on the relationship between objectives. We study sequential DPO across four preference settings covering distributional conflict, multi-attribute interaction, strong safety signal, and compatible response-quality objectives. Using Llama-3.1-8B-Instruct with LoRA adapters, we evaluate all objectives after every stage with a fixed base-model reference. We find that sequential DPO does not produce a single forgetting pattern; preference change ranges from partial degradation to stability, pair-level redistribution, or positive transfer depending on objective relationship, signal strength, and training order. Pair-level analysis using length-normalised policy margins shows that aggregate metrics can mask heterogeneous changes across preference pairs, whereas quartile decomposition reveals that high-confidence pairs can either degrade or improve depending on the setting. Mechanistic diagnostics show that Stage~2 gradients and adapter updates are near-orthogonal to the previous objective across all settings, providing little evidence that direct gradient opposition is the primary driver. These findings suggest that future sequential alignment pipelines should account for objective compatibility and signal strength, rather than assuming that later objectives affect earlier preferences uniformly.

2606.19815 2026-06-19 cs.CL 新提交

Clusters are All You Need: Pre-Training the Tsetlin Machine with Semantic Clusters from Language Models for Interpretability

聚类即一切:利用语言模型中的语义聚类预训练Tsetlin Machine以实现可解释性

Jiechao Gao, Rohan Kumar Yadav, Yuangang Li, Yuandong Pan, Jie Wang, Ying Liu, Michael Lepech

发表机构 * Independent Researcher(独立研究员) University of California, Irvine(加州大学尔湾分校) University of the Chinese Academy of Sciences(中国科学院大学)

AI总结 提出一种语义预训练框架,通过K-means或Top2Vec将文本聚类,用聚类-样本对预训练Tsetlin Machine,使其学习可解释的语义关键词,在五个数据集上性能优于传统方法且与BERT竞争。

详情
AI中文摘要

预训练语言模型如BERT在文本分类任务中表现强劲,但缺乏透明度,限制了在高风险场景中的应用。Tsetlin Machine (TM) 提供完全可解释的基于子句的推理,但捕获的语义信息有限,先前桥接两者的尝试依赖于静态词嵌入,忽略了上下文含义。我们提出一种语义预训练框架,无需使用嵌入即可将知识从预训练语言模型转移到TM中。文本样本通过K-means或Top2Vec被分组为语义一致的聚类,得到的聚类-样本对通过增强的Type I反馈预训练一个非否定TM。因此,TM学习到可解释的语义关键词,并在下游任务上进行微调。在五个数据集上,我们的方法显著优于传统和基于嵌入的TM,性能与BERT竞争,同时保持可解释性。

英文摘要

Pre-trained language models such as BERT achieve strong text classification performance but lack transparency, limiting their use in high-stakes settings. The Tsetlin Machine (TM) offers fully interpretable, clause-based reasoning but captures little semantic information, and prior attempts to bridge the two rely on static word embeddings that miss contextual meaning. We propose a semantic pre-training framework that transfers knowledge from a pre-trained language model into a TM without using embeddings. Text samples are grouped into semantically coherent clusters with K-means or Top2Vec, and the resulting cluster-sample pairs pre-train a non-negated TM with enhanced Type I feedback. The TM thereby learns interpretable semantic keywords that are fine-tuned on downstream tasks. Across five datasets, our method substantially outperforms vanilla and embedding-based TMs and reaches performance competitive with BERT while remaining interpretable.

2606.19831 2026-06-19 cs.CL cs.LG 新提交

Leverage Is Not Reach: A Control-Window Law for Single-Neuron Steering in Language Models

杠杆不等于可达性:语言模型中单神经元操控的控制窗口定律

Hongliang Liu

发表机构 * Palo Alto Networks

AI总结 提出预算归一化控制窗口框架,通过残差范数与写入范数之比定义的相干预算,预测单神经元干预何时产生连贯行为控制,并在15个神经元上验证了预测精度。

详情
AI中文摘要

对齐语言模型通过稀疏前馈神经元门控拒绝和语言路由等行为,但尚无理论预测单神经元干预何时连贯地控制行为而非导致输出崩溃。我们开发了一个预算归一化的控制窗口框架用于单神经元操控。沿一个写入方向的剂量简化为一个控制坐标:残差流与写入之间的对齐,该对齐沿着一条通用饱和曲线驱动,以残差范数除以写入范数设定的相干预算为单位。当行为触发点低于崩溃上限时,存在连贯控制。同一坐标控制良性模式切换和拒绝;上限由权重和一次通用前向传播得出,而触发点在 rollout 时测量。在15个保留神经元上,预测上限的平均绝对误差为0.14,在批量层中约为0.07,并且承诺的开启或关闭判定在11个神经元上成立,而多数基线为10/15。关闭情况揭示了三种失败模式而非违反:触发前崩溃、深度不足以传播、或归一化限制了单个神经元能推动的距离。该定律解释了为什么局部梯度归因反直觉地预测控制:真正的控制器偏离读出轴写入,并携带接近零的一阶梯度。由窗口精确化的仅前向对比筛选恢复了归因遗漏的控制器。在拒绝这一最难案例中,干预成功是类型化的而非标量:连贯旁路和严格可操作可达性分离,因此一个神经元可以在流畅、任务相关且无操作内容的文本中翻转拒绝,而真正的可操作可达性仅出现在六个审计的 Llama 枢轴中的三个,且仅在较晚的 rollout 时间范围内。因此,单神经元操控是对可控性的预算化、类型化审计,而非固定剂量的轶事。

英文摘要

Aligned language models gate behaviors such as refusal and language routing through sparse feed forward neurons, yet no theory predicts when a single neuron intervention controls a behavior coherently rather than collapsing the output. We develop a budget normalized control window framework for single neuron steering. A dose along one write direction reduces to one control coordinate: the alignment between the residual stream and the write, driven along a universal saturation curve in units of a coherence budget set by the residual norm divided by the write norm. Coherent control exists when a behavior trigger lies below the collapse ceiling. The same coordinate governs benign mode switches and refusal; the ceiling follows from weights and one generic forward pass, while triggers are measured at rollout. On fifteen held out neurons, the predicted ceiling has mean absolute error 0.14, about 0.07 in bulk layers, and the committed open or closed verdict holds on eleven against a ten of fifteen majority baseline. Closed cases expose three failure modes rather than violations: collapse before trigger, too little depth to propagate, or a normalization that caps how far one neuron can push. The law explains why local gradient attribution anti predicts control: true controllers write off the readout axis and carry a near zero first order gradient. A forward only contrastive screen made precise by the window recovers controllers that attribution misses. On refusal, the hardest case, intervention success is typed, not scalar: coherent bypass and strict actionable reach separate, so a neuron can flip refusal in fluent, on task text with no actionable content, and genuine actionable reach appears only for three of six audited Llama pivots and only at later rollout horizons. Single neuron steering is therefore a budgeted, typed audit of controllability rather than a fixed dose anecdote.

2606.19857 2026-06-19 cs.CL cs.AI 新提交

Large Language Models Do Not Always Need Readable Language

大型语言模型并不总是需要可读语言

Jiayi Zhu, Haoxuan Peng, Junxi Wang, Liang Ke, Chen Zhang, Linfeng Zhang

AI总结 研究提出BabelTele表示法,将语义编码为紧凑、非标准文本,牺牲人类可读性但保持LLM可恢复性,实验表明可压缩至27.9%长度并保持99.5%语义保真度,降低上下文开销。

Comments 23 pages, 10 figures. Preprint

详情
AI中文摘要

大型语言模型(LLM)通常使用人类可读的自然语言进行提示和交互,即使目标读者是另一个模型。本文研究语义信息是否可以编码为紧凑、非标准的文本形式,这种形式牺牲了人类可读性,但能被LLM恢复。我们将这类以模型为中心的文本表示称为BabelTele,这里不是作为固定协议,而是作为探索LLM生成和解释此类表示能力的经验探针。通过可读性诊断、模型似然度量、人类问卷和下游任务评估,我们发现BabelTele可以显著偏离普通自然语言,同时为指令调优的LLM保留核心语义。作为一种任务无关的表示范式,BabelTele展示了高信息密度,即使文本体积压缩到原始长度的27.9%,也能保持99.5%的语义保真度。我们进一步评估了其在跨模型迁移、智能体记忆和多智能体通信中的语义鲁棒性。结果表明,BabelTele可以降低上下文开销,同时通常保持可靠的下游性能,但其有效性取决于压缩器-读取器对和任务设置。这些发现表明,人类可读性、自然语言典型性和模型端语义可恢复性可以部分解耦,为未来探索LLM系统中的模型原生表示开辟了道路。

英文摘要

Large language models (LLMs) are commonly prompted and interfaced with human-readable natural language, even when the intended reader is another model. This paper investigates whether semantic information can be encoded in compact, non-standard textual forms that sacrifice human readability while remaining recoverable by LLMs. We refer to this class of model-centric textual representations as BabelTele, approached here not as a fixed protocol but as an empirical probe into LLMs' capacity to generate and interpret such representations. Through readability diagnostics, model likelihood measures, human questionnaires, and downstream task evaluations, we find that BabelTele can substantially depart from ordinary natural language while preserving core semantics for instruction-tuned LLMs. As a task-agnostic representational paradigm, BabelTele demonstrates high information density, maintaining 99.5% semantic fidelity even when the text volume is condensed to 27.9% of its original length. We further evaluate its semantic robustness in cross-model transfer, agent memory, and multi-agent communication. Results suggest that BabelTele can reduce context overhead while generally maintaining reliable downstream performance, although its effectiveness depends on the compressor-reader pair and task setting. These findings indicate that human readability, natural-language typicality, and model-side semantic recoverability can be partially decoupled, opening a path toward model-native representations in future exploration of LLM systems.

2606.19946 2026-06-19 cs.CL cs.LG 新提交

GEMS: Geometric Constraints Enable Multi-Semantic Superposition in LLMs

GEMS: 几何约束使LLM中多语义叠加成为可能

Yu Deng

AI总结 提出GEMS方法,通过范数保持加权叠加、目标注意力路径注入和实时正交化两个几何约束,解决无训练多方向激活干预中的分布偏差和方向干扰问题,在GSM8K上保持98%准确率。

Comments 30 pages, 5 figures, 20 tables. Code and logs are available at: https://github.com/LuLu663939/gems-multi-semantic-steering

详情
AI中文摘要

激活引导通过在推理时修改中间隐藏状态来控制模型行为,无需重新训练。现有方法仅处理单方向注入;当多个语义方向无约束叠加时,模型崩溃。我们证明这种崩溃分解为两个独立作用的来源:分布偏差(加法扰动在层间累积范数并将激活推出训练分布)和方向干扰(非正交语义向量叠加时相互抑制)。这两个来源定义了任何无训练多方向干预必须满足的设计约束。作为这些原则的一个实例,我们提出GEMS,一种无训练方法,将每个来源映射到相应的几何约束:针对分布偏差的范数保持加权叠加和目标注意力路径注入,以及针对方向干扰的实时正交化。在GSM8K上,注入三个并发非数学方向保持98%的准确率(基线92%),而无约束加法崩溃至4%;在Wikitext-2上,相同注入仅导致2.2%的PPL增加。组件消融隔离了每个约束的因果作用,层级探针确认正交化信号通过FFN路径存活并以语义特异性到达输出分布。定性引导效果跨架构从3B到31B迁移。

英文摘要

Activation steering controls model behavior by modifying intermediate hidden states at inference time without retraining. Existing methods handle only single-direction injection; when multiple semantic directions are superposed without constraints, the model collapses. We show that this collapse decomposes into two independently acting sources: distributional deviation, where additive perturbations accumulate in norm across layers and drive activations outside the training distribution, and directional interference, where non-orthogonal semantic vectors mutually dampen when superposed. These two sources define the design constraints that any training-free multi-directional intervention must address. As one instantiation of these principles, we propose GEMS, a training-free method that maps each source to a corresponding geometric constraint: norm-preserving weighted superposition and targeted attention-pathway injection for distributional deviation, and real-time orthogonalization for directional interference. On GSM8K, injecting three concurrent non-mathematical directions preserves accuracy at 98% (baseline 92%), while unconstrained addition collapses to 4%; on Wikitext-2, the same injection incurs only 2.2% PPL increase. Component ablation isolates the causal role of each constraint, and layer-level probes confirm that orthogonalized signals survive the FFN pathway and reach the output distribution with semantic specificity. Qualitative steering effects transfer across architectures from 3B to 31B.

2606.20089 2026-06-19 cs.CL cs.AI 新提交

IHUBERT: Vector-Based Semantic Deduplication and Domain-Balanced Pretraining for Persian Resources

IHUBERT: 面向波斯语资源的基于向量的语义去重与领域平衡预训练

Arash Ghafouri, Mahdi Firouzmandi, Hossein Saberi, Mohammad Reza Hasani Ahangar

AI总结 提出IHUBERT,一个基于RoBERTa-base的波斯语预训练模型,通过多阶段预处理(包括基于向量数据库的语义去重和领域平衡)在45GB语料上训练,在多项NLU任务上取得领先结果,尤其抽取式问答表现突出。

详情
AI中文摘要

波斯语预训练语言模型仍然受到大规模高质量预训练语料库稀缺以及标准分类和NER任务之外评估不足的限制。我们提出了IHUBERT,一个从头训练的波斯语单语PLM,采用RoBERTa-base编码器(1.25亿参数),在Sepahr-Danesh集合的45GB精选子集(约70-80亿token)上进行训练。为了提高语料质量并减少冗余,我们采用多阶段预处理流程,包括规范化、精确和近似重复去除、匿名化,以及基于向量数据库的语义去重,以实现跨领域和语体的分布平衡控制。我们还在完整的预训练语料库上训练了一个13.9万词汇量的BPE分词器,以更好地捕捉波斯语的形态和拼写变化。IHUBERT在七个波斯语NLU基准测试上进行评估,涵盖NER、情感分析、主题分类、NLI、抽取式问答和关系抽取,使用任务标准指标(实体级F1、宏F1、EM/F1)。IHUBERT在抽取式QA上取得了最强增益,在PQuAD(F1 88.3542)和ParsiNLU-RC(F1 49.0987)上均排名第一,并在FarsTail上取得了最佳结果(宏F1 0.8350)。在NER和主题分类上,它保持竞争力(例如,ParsTwiNER上F1 0.8308;DigiMag上宏F1 0.7953),而关系抽取仍然是主要差距(PERLEX上宏F1 0.6684)。在IHUBERT预训练语料库上的受控分词器消融实验表明,在匹配词汇量下,BPE产生的子词碎片化程度略低于WordPiece,支持了我们的分词设计。总体而言,IHUBERT通过语义精选的大规模预训练以及跨分类和理解型任务的广泛评估,推进了波斯语语言建模。

英文摘要

Persian pretrained language models (PLMs) are still limited by the scarcity of large-scale, high-quality pretraining corpora and by insufficient evaluation beyond standard classification and NER tasks. We present IHUBERT, a monolingual Persian PLM trained from scratch with the RoBERTa-base encoder (125M parameters) on a 45 GB curated subset of the Sepahr-Danesh collection (about 7-8B tokens). To improve corpus quality and reduce redundancy, we employ a multi-stage preprocessing pipeline that includes normalization, exact and near-duplicate removal, anonymization, and vector-database-based semantic deduplication for distribution balancing control across domains and registers. We additionally train a 139k-vocabulary BPE tokenizer on the full pretraining corpus to better capture Persian morphology and orthographic variation. IHUBERT is evaluated on seven Persian NLU benchmarks covering NER, sentiment analysis, topic classification, NLI, extractive question answering, and relation extraction, using task-standard metrics (entity-level F1, Macro-F1, EM/F1). IHUBERT achieves its strongest gains on extractive QA, ranking first on both PQuAD (F1 88.3542) and ParsiNLU-RC (F1 49.0987), and attains the best result on FarsTail (Macro-F1 0.8350). On NER and topic classification, it remains competitive (e.g., 0.8308 F1 on ParsTwiNER; 0.7953 Macro-F1 on DigiMag), while relation extraction remains the main remaining gap (0.6684 Macro-F1 on PERLEX). A controlled tokenizer ablation on the IHUBERT pretraining corpus shows that BPE yields slightly lower subword fragmentation than WordPiece at matched vocabulary size, supporting our tokenization design. Overall, IHUBERT advances Persian language modeling through semantically curated large-scale pretraining and broad evaluation across both classification and comprehension-oriented tasks.

2606.20097 2026-06-19 cs.CL 新提交

HydraHead: From Head-Level Functional Heterogeneity to Specialized Attention Hybridization

HydraHead:从头部级功能异质性到专业化注意力混合

Zhentao Tan, Wei Chen, Jingyi Shen, Yao Liu, Xu Shen, Yue Wu, Jieping Ye

发表机构 * Alibaba Group(阿里巴巴集团)

AI总结 提出HydraHead架构,沿头部维度混合全注意力和线性注意力,通过可解释性驱动的头部选择和尺度归一化融合模块,在长上下文任务中优于层级混合设计,仅用15B token训练即在512K上下文长度上提升69%。

详情
AI中文摘要

注意力的二次复杂度对长上下文处理构成了关键瓶颈,激发了混合注意力设计的兴趣。大多数开源混合模型采用层级策略。然而,先前工作注意到线性注意力与全注意力整合的内在困难,表明注意力混合的设计空间仍未充分探索。为了探索这一空间,我们进行可解释性分析,观察到层表现出块级功能相似性,而同一层内的单个头部尽管共享输入特征,却显示出不同的功能专门化。这种头部级异质性表明,头部维度为融合异质注意力信号提供了自然且原则性的粒度。基于这一洞察,我们引入了HydraHead,一种沿头部轴混合全注意力和线性注意力的新型架构。HydraHead具有两个关键创新:(1)一种可解释性驱动的选择策略,识别检索关键的头部并仅为其保留全注意力;(2)一种尺度归一化融合模块,调和全注意力和线性注意力头部输出之间的分布差距。通过利用参数重用和蒸馏的三阶段迁移流程,我们以最小的训练开销实现了高性能混合模型。在统一的训练设置下,HydraHead在长上下文任务中优于其他混合设计,同时保持强大的通用推理能力。通过可解释性驱动的头部选择,它以7:1的线性注意力与全注意力比例匹配了3:1层级混合的长上下文性能。关键的是,仅用15B token训练,HydraHead在512K上下文长度上比基线提升超过69%,接近Qwen3.5(一个具有256K原生上下文长度的类似规模领先模型)。这突显了头部级混合的显著扩展潜力。

英文摘要

The quadratic complexity of attention poses a critical bottleneck for long-context processing, spurring interest in hybrid attention designs. Most open-source hybrid models adopt a layer-wise strategy. Yet, prior work has noted the inherent difficulty of integrating Linear Attention (LA) with Full Attention (FA), suggesting that the design space of attention hybridization remains underexplored. To probe this space, we conduct interpretability analysis and observe that layers exhibit block-wise functional similarity, while individual heads within the same layer display distinct functional specialization despite sharing input features. This head-level heterogeneity suggests that the head dimension provides a natural and principled granularity for fusing heterogeneous attention signals. Building on this insight, we introduce HydraHead, a novel architecture that hybridizes FA and LA along the head axis. HydraHead features two key innovations: (1) an interpretability-driven selection strategy that identifies retrieval-critical heads and preserves FA only for them, and (2) a scale-normalized fusion module that reconciles the distributional gap between FA and LA head outputs. By leveraging a three-stage transfer pipeline with parameter reuse and distillation, we achieve high-performance hybrid models with minimal training overhead. Under a unified training setup, HydraHead outperforms other hybrid designs in long-context tasks while maintaining strong general reasoning. With interpretability-driven head selection, it matches a 3:1 layer-wise hybrid's long-context performance at a 7:1 LA-to-FA ratio. Crucially, trained on only 15B tokens, HydraHead achieves over 69% improvement over the baseline at 512K context length, approaching Qwen3.5, a leading model of comparable size with a native context length of 256K. This highlights the significant scaling potential of head-level hybridization.

2606.20482 2026-06-19 cs.CL cs.HC cs.LG 新提交

Your Mouse and Eyes Secretly Leak Your Preference: LLM Alignment using Implicit Feedback from Users

你的鼠标和眼睛悄悄泄露你的偏好:利用用户隐式反馈进行LLM对齐

Haw-Shiuan Chang, Jeffrey Gomez, Mehul Patwari, Aryan Sajith, Hamed Zamani

发表机构 * University of Massachusetts, Amherst(马萨诸塞大学阿默斯特分校) York University(约克大学)

AI总结 针对显式反馈稀缺的问题,提出利用鼠标轨迹和眼动数据等隐式反馈训练奖励模型,将文本奖励模型准确率从55%提升至64%,并显著提高DPO对齐后响应质量。

详情
AI中文摘要

为了对齐大型语言模型(LLM),大多数现有方法收集显式的人类反馈,并基于响应文本训练奖励模型来预测人类偏好。这些现有方法有两个关键局限性。首先,用户很少为LLM响应提供显式反馈,这使得高质量偏好标注的收集成本高昂。其次,这些方法没有利用隐式人类反馈,而隐式反馈已被证明对互联网巨头的经济护城河至关重要。为了量化隐式反馈的价值,我们构建了一个名为IFLLM的新数据集,收集了来自59名Mechanical Turk工作者的1336个多轮问题、他们的鼠标轨迹以及通过网络摄像头对LLM响应的眼动注视点。IFLLM显示用户具有非常多样化的注视行为和鼠标轨迹。基于隐式用户反馈的奖励模型将基于文本的奖励模型准确率从55%提升至64%,并在将DPO应用于八个LLM后,相对响应质量改进几乎翻了三倍,证明了隐式反馈在现实场景中的价值。我们的数据收集网站、数据集和代码可在以下网址找到:此https URL。

英文摘要

To align a Large Language Model (LLM), most existing methods collect explicit human feedback and train a reward model to predict the human preference based on the response text. These existing methods have two key limitations. First, the users rarely provide explicit feedback for LLM responses, which makes the high-quality preference annotation expensive to collect. Second, the methods do not leverage implicit human feedback, which has proven vital to the economic moats of Internet giants. To quantify the value of implicit feedback, we build a new dataset called IFLLM, which collects 1336 multi-turn questions from the 59 Mechanical Turk workers, their mouse trajectories, and eye gazing points to the LLMs' responses from their webcams. IFLLM shows that the users have very diverse types of gazing behavior and mouse trajectories. Our reward model based on the implicit user feedback boosts the accuracy of the text-based reward model from 55% to 64% and nearly triples the relative response quality improvements after applying the DPO to eight LLMs, demonstrating the value of implicit feedback in the wild. Our data collection website, dataset, and codes can be found at https://github.com/themehulpatwari/llm-implicit-feedback/.

2. 机器翻译与跨语言处理 2 篇

2606.19346 2026-06-19 cs.CL cs.AI 新提交

Disentangling Linguistic Relatedness from Task Alignment in Cross-Lingual Transfer

跨语言迁移中语言相关性与任务对齐的解耦

Ahmed Haj Ahmed, Ruochen Zhang, Alvin Grissom

发表机构 * Haverford College(哈弗福德学院) Brown University(布朗大学)

AI总结 通过微调大语言模型并在闪语族与非闪语族语言上评估零样本阅读理解,发现跨语言迁移主要提升任务格式对齐而非语言特定知识。

详情
AI中文摘要

我们通过微调七个大语言模型(4B--671B参数)在阿拉伯语上,并在闪语族语言和非闪语族对照语言上评估零样本阅读理解,研究跨语言迁移。在密集架构和混合专家架构中,我们没有发现闪语族特定迁移的证据:基线较弱的模型在所有语言上都有显著提升,而基线较强的模型无论语言族系如何,只有边际提升。思维链消融实验强化了这一发现——从微调中获益最多的模型同样从推理时推理中获益,这表明两种机制都解决了任务格式对齐问题,而非跨语言知识迁移。

英文摘要

We study cross-lingual transfer by fine-tuning seven large language models (4B--671B parameters) on Arabic and evaluating zero-shot reading comprehension on Semitic languages and non-Semitic controls. Across dense and Mixture-of-Experts architectures, we find no evidence of Semitic-specific transfer: models with weak baselines improve dramatically across all languages, while strong-baseline models show only marginal gains regardless of language family. A chain-of-thought ablation reinforces this finding -- the same models that benefit most from fine-tuning benefit equally from inference-time reasoning, suggesting both mechanisms address task-format alignment rather than cross-lingual knowledge transfer.

2606.19640 2026-06-19 cs.CL cs.AI cs.HC 新提交

Creating Multilingual Mental Health Dialogue Datasets: Limits of Persona-Based Localization via Nationality and Language

创建多语言心理健康对话数据集:基于国籍和语言的人物角色本地化方法的局限性

Yunkai Xu, Saeed Abdullah

发表机构 * Pennsylvania State University(宾夕法尼亚州立大学)

AI总结 研究通过修改人物角色中的国籍和语言参数生成中文、孟加拉语和印地语临床对话,发现仅添加这些参数会导致跨语言临床不一致,且LLM评估非英语文本的抑郁严重度时存在不准确性。

Comments 15 pages, 4 figures. Accepted to the 2026 Workshop on Computational Linguistics and Clinical Psychology (CLPsych 2026), co-located with ACL 2026

详情
AI中文摘要

人工智能和大语言模型(LLMs)已成为应对全球心理健康挑战的有前景的工具。尽管这些挑战具有全球性,但用于训练和评估此类系统的高质量数据集仍然严重短缺。为弥补这一差距,研究人员越来越多地生成合成临床人物角色来模拟用户数据并测试数字心理健康支持系统。然而,大多数经过验证的人物角色依赖于以英语为中心的语境。本文研究了是否可以使用类似的人物角色方法生成多语言心理健康数据集。我们修改了人物角色中的国籍和语言参数,以生成普通话、孟加拉语和印地语的临床对话。然后,我们考察了不同LLM在评估这些生成的多语言数据集的抑郁严重程度(与英语基线相比)时的表现。我们的研究结果表明,仅在人物角色中添加国籍和语言参数可能不够,因为它可能引入跨语言的临床不一致性。LLM评判模型在评估非英语文本中的抑郁严重程度时常常表现出不准确性,且不同模型的性能存在差异。这暴露了将以英语为中心的人物角色应用于多语言语境的系统性局限性。最终,我们的工作强调了迫切需要文化响应式数据生成,以确保全球心理健康系统的公平性。

英文摘要

AI and large language models (LLMs) have emerged as promising tools to address global mental health challenges. Despite the global nature of these challenges, there remains a critical shortage of high-quality datasets for training and evaluating such systems. To mitigate this gap, researchers increasingly generate synthetic clinical personas to simulate user data and test digital mental health support systems. However, most validated personas rely on English-centric contexts. This paper investigates whether similar persona-based methods can be used to generate multilingual mental health datasets. We modified nationality and language parameters in personas to generate clinical dialogues in Mandarin, Bengali, and Hindi. We then examined how different LLMs perform when evaluating the depression severity of these generated multilingual datasets against the baseline in English. Our findings indicate that just adding nationality and language parameters in personas might not be adequate, as it can introduce clinical inconsistency across languages. LLM judge models often exhibit inaccuracies in assessing depression severity in non-English texts, with performance varying across different models. This exposes the systemic limitations of applying English-centric personas to multilingual contexts. Ultimately, our work highlights the urgent need for culturally responsive data generation to ensure equitable mental health systems globally.

3. 信息抽取、检索与问答 8 篇

2606.19345 2026-06-19 cs.CL cs.AI 新提交

Ensembles of Large Language Models for Identifying EQ-5D Studies in PubMed Based on Their Abstracts

基于摘要识别PubMed中EQ-5D研究的大型语言模型集成

Zhyar Rzgar K. Rostam, Márta Péntek, János Tibor Czere, Zsombor Zrubka, László Gulácsi, Gábor Kertész

发表机构 * Doctoral School of Applied Informatics and Applied Mathematics, Obuda University(欧布达大学应用信息学与应用数学博士学院) John von Neumann Faculty of Informatics, Obuda University(欧布达大学约翰·冯·诺伊曼信息学学院) Doctoral School of Innovation Management, Obuda University(欧布达大学创新管理博士学院)

AI总结 提出多阶段框架集成Gemini和Gemma等LLM,通过少样本提示、权重集成和软堆叠元分类器,自动检测PubMed中EQ-5D研究,加权集成F1达0.74。

Comments 6 pages, 7 tables, 8 equations

详情
AI中文摘要

科学出版物的快速增长导致系统文献综述(SLR)中的人工研究筛选越来越耗费资源、效率低下且不一致。分类明确报告健康相关生活质量结果(如EQ-5D数据)的研究需要高水平的临床解释,并给人类评审者带来挑战。本研究探讨了使用Google的Gemini和Gemma大型语言模型(LLM)仅基于已发表摘要自动检测PubMed生物医学数据库中的EQ-5D。提出了一个多阶段框架,集成了少样本提示、权重集成聚合和软堆叠元分类器。在由两位专家手动标记的PubMed研究数据集上评估了九个LLM的EQ-5D报告情况。gemini-2.5-pro、gemma-3-12b和gemma-3-27b的加权集成获得了0.74的加权F1分数和0.74的准确率,超过了单独获得的结果。与单个模型相比,表现最佳模型的集成改善了精确率和召回率之间的平衡,而软堆叠方法提供了更高的可靠性和可解释性。特征分析表明,模型的概率结果在指导最终预测中很重要。研究结果表明,基于集成的LLM设置是自动化生物医学研究筛选的可靠且可扩展的方法。

英文摘要

The rapid increase in scientific publications leads to the fact that manual study screening in systematic literature reviews (SLRs) is increasingly resource consuming, inefficient, and inconsistent. Classifying studies that clearly report health-related quality-of-life results, such as EQ-5D data, requires a high level of clinical interpretation and poses challenges for human reviewers. This study investigates the use of Google's Gemini and Gemma large language models (LLMs) in automating EQ-5D detection in the PubMed biomedical database based only on published abstracts. A multi-phase framework is proposed that integrates few-shot prompting, weight ensembling aggregation, and a soft stacking meta-classifier. Nine LLMs are evaluated on a dataset of PubMed studies manually labeled by two experts regarding EQ-5D reporting. The weighted ensemble of gemini-2.5-pro, gemma-3-12b, and gemma-3-27b obtained a 0.74 weighted F1-score and 0.74 accuracy, exceeding individually attained results. The ensembling of top-performing models improved the balance between precision and recall compared to individual models, while the soft stacking approach provided greater reliability and interpretability. Feature analysis shows that the probability results from the models are important in guiding the final predictions. The findings suggest that an ensemble-based LLM setup is a reliable and scalable approach for automating screening in biomedical research.

2606.19351 2026-06-19 cs.CL cs.AI 新提交

Detecting Hallucinations for Large Language Model-based Knowledge Graph Reasoning

基于大语言模型的知识图谱推理中的幻觉检测

Xinyan Zhu, Yaoqi Liu, Yue Gao, Huadong Ma, Cheng Yang, Chuan Shi

发表机构 * Beijing University of Posts and Telecommunications(北京邮电大学) Tsinghua University(清华大学)

AI总结 提出LUCID方法,结合LLM注意力分数、知识图谱语义和结构信息,利用图神经网络检测LLM在知识图谱推理中的幻觉,在九个数据集上达到最优性能。

详情
AI中文摘要

知识图谱推理从现有事实中推断新知识,广泛应用于问答、推荐和决策支持。随着大语言模型(LLM)的快速发展,基于LLM的知识图谱推理框架通过利用检索到的知识图谱信息变得越来越流行。然而,LLM中的幻觉仍然是一个关键问题。即使融入了相关的知识图谱知识,模型仍可能生成错误输出,导致错误信息和不可靠的决策。现有的幻觉检测方法要么关注LLM内部状态,要么验证与检索上下文的一致性,但两者都忽略了知识图谱中的结构信息,导致性能次优。为了解决这一差距,我们提出了LUCID,这是首个针对基于LLM的知识图谱推理框架的幻觉检测方法。LUCID联合利用LLM注意力分数、知识图谱语义和结构信息。具体来说,它从注意力分数和语义相似度中提取节点和边特征,并使用图神经网络将其与知识图谱结构集成。我们还构建了人工标注的基准数据集用于评估。在九个数据集上的实验表明,与15个基线相比,LUCID达到了最先进的性能。

英文摘要

Knowledge graph (KG) reasoning infers new knowledge from existing facts and is widely applied in question answering, recommendation, and decision support. With the rapid development of large language models (LLMs), LLM-based KG reasoning frameworks have become increasingly popular by leveraging retrieved KG information. However, hallucinations in LLMs remain a critical issue. Even when relevant KG knowledge is incorporated, models may still generate incorrect outputs, leading to misinformation and unreliable decisions. Existing hallucination detection methods either focus on LLM internal states or verify consistency with retrieved contexts, but both overlook the structural information in KGs, resulting in suboptimal performance. To address this gap, we propose LUCID, the first halLUcination deteCtIon method for LLM-based knowleDge graph reasoning frameworks. LUCID jointly leverages LLM attention scores, KG semantics, and structural information. Specifically, it extracts node and edge features from attention scores and semantic similarities, and integrates them with KG structure using a graph neural network. We also construct manually annotated benchmark datasets for evaluation. Experiments on nine datasets show that LUCID achieves state of the art performance compared to 15 baselines.

2606.19667 2026-06-19 cs.CL 新提交

CacheWeaver: Cache-Aware Evidence Ordering for Efficient Grounded RAG Inference

CacheWeaver:面向高效接地RAG推理的缓存感知证据排序

Kaizhen Tan, Rong Gu, Mingyuan Li

发表机构 * Heinz College of Information Systems and Public Policy, Carnegie Mellon University(卡内基梅隆大学海因茨信息系统与公共政策学院)

AI总结 提出CacheWeaver,一种轻量级提示层方法,通过缓存感知的证据排序降低RAG推理的首令牌延迟,无需修改服务引擎或证据集。

详情
AI中文摘要

检索增强生成(RAG)改善了事实基础,但也延长了提示并增加了预填充成本。vLLM等服务引擎中的前缀缓存仅在请求共享相同令牌前缀时降低此成本。然而,在接地生成中,相邻查询可能以不同顺序检索重叠证据,因此集合重叠不会变成可重用的前缀重叠。我们提出CacheWeaver,一种用于缓存感知证据排序的轻量级提示层方法。该方法维护最近服务的证据序列的前缀树,并使用贪婪遍历将最可重用的前缀放在首位,同时保持服务引擎和检索到的证据集不变。在三种vLLM配置中,相对于检索顺序前缀缓存,该方法将中位首令牌时间(TTFT)降低了约20-33%,且在我们的QA测试中不损害答案质量。贪婪策略达到了Oracle排序中位TTFT增益的97.5%,表明大多数可重用前缀局部性可以通过检索和推理之间的简单调度层恢复。

英文摘要

Retrieval-Augmented Generation (RAG) improves factual grounding, but it also lengthens prompts and raises prefill cost. Prefix caching in serving engines such as vLLM reduces this cost only when requests share the same token prefix. In grounded generation, however, adjacent queries may retrieve overlapping evidence in different orders, so set overlap does not become reusable prefix overlap. We present CacheWeaver, a lightweight prompt-layer method for cache-aware evidence ordering. The method keeps a prefix tree over recently served evidence sequences and uses a greedy walk to place the most reusable prefix first, while leaving the serving engine and retrieved evidence set unchanged. Across three vLLM configurations, the method lowers median time-to-first-token (TTFT) by about 20-33 percent relative to retrieval-order prefix caching, without hurting answer quality in our QA tests. The greedy policy reaches 97.5 percent of the median TTFT gain from oracle ordering, indicating that most reusable prefix locality can be recovered by a simple scheduling layer between retrieval and inference.

2606.19700 2026-06-19 cs.CL 新提交

TerraMARS: A Domain-Adapted Small-Language-Model Pipeline for Mars Terraforming Literature

TerraMARS: 用于火星地球化改造文献的领域自适应小语言模型管道

Jyotsna Singh, Ash Black, Jeff Larsen, Scott R. Saleska

发表机构 * University of Arizona(亚利桑那大学) College of Information Science, University of Arizona(亚利桑那大学信息科学学院) Biosphere 2, University of Arizona(亚利桑那大学生物圈2) Department of Ecology and Evolutionary Biology, University of Arizona(亚利桑那大学生态与进化生物学系) Department of Environmental Sciences, University of Arizona(亚利桑那大学环境科学系)

AI总结 提出TerraMARS管道,结合领域自适应小语言模型,从火星科学文献中提取结构化信息,支持地球化改造研究。

Comments 16 pages, 1 figure, 4 tables

详情
AI中文摘要

研究人员有兴趣了解火星,以便最终使其适合人类居住。为此,需要通过科学文献全面了解行星的大气、水文、表面化学、辐射环境和空间特征。这些文献包含有价值的信息和有意义的定量约束,可用于其他模型和研究,如宜居性评估和未来的地球化改造研究。我们提出了TerraMARS,一个端到端的信息提取管道,它结合了领域自适应的小语言模型来回答火星地球化改造相关问题,并将非结构化的火星科学文本转换为机器可读的结构化输出(JSON格式)。收集了一个开放获取论文语料库,并使用多阶段检索和分块框架进行处理。使用量化低秩自适应(QLoRA)对火星特定问答和信息提取数据集进行微调,使Google Gemma 3 1B适应领域。生成的管道产生两种类型的输出,并为将科学文献中的知识整合到下游应用(如数字孪生和火星宜居性建模)提供了基础。该管道的输出看起来很有前景,但需要进一步改进以提高提取准确性和事实一致性。

英文摘要

Researchers are interested in learning about Mars so that it may eventually become habitable for humans. To achieve this, there is a need for comprehensive knowledge of the planet's atmosphere, hydrology, surface chemistry, radiation environment, and spatial features through the scientific literature. These contain valuable information and meaningful quantitative constraints that can be used in other models and studies, such as habitability assessment and future terraforming studies. We present TerraMARS, an end-to-end information extraction pipeline that combines a domain-adapted Small Language Model to answer Mars terraforming-related questions and convert unstructured Mars science text into machine-readable structured outputs in JavaScript Object Notation (JSON) format. A corpus of open-access papers is collected and processed using a multistage retrieval and chunking framework. Google Gemma 3 1B was adapted to the domain using Quantized Low-Rank Adaptation (QLoRA) fine-tuning on Mars-specific question-answering and information extraction datasets. The resulting pipeline generates both types of output and provides a foundation for integrating knowledge from scientific literature into downstream applications like digital twins and habitability modeling for Mars. The output from this pipeline looks promising, but further improvements are needed to increase extraction accuracy and factual consistency.

2606.19710 2026-06-19 cs.CL cs.AI 新提交

FineREX: Fine-Tuned NER-RE for Human Smuggling Knowledge Graphs

FineREX: 面向人口走私知识图谱的微调NER-RE

Elijah Feldman, Dipak Meher, Carlotta Domeniconi

发表机构 * Thomas Jefferson High School for Science and Technology(托马斯·杰斐逊科技高中)

AI总结 提出FineREX,一个基于微调LLM的流水线,用于从法律文档中提取实体和关系构建知识图谱,在F1分数上分别提升15.50%和31.46%,并减少50%处理时间。

Comments Code available at https://github.com/ElijahFeldman7/FineREX

详情
AI中文摘要

法庭记录包含关于人口走私网络的有价值证据,但这些信息通常埋藏在非结构化的、充满术语的法律文件中。虽然大型语言模型(LLM)可以通过自动信息提取支持知识图谱构建,但现有方法依赖通用模型,未针对该领域所需的实体和关系定义进行定制。我们提出FineREX,一个精简的知识图谱构建流水线,基于微调的LLM进行命名实体识别和关系提取(NER-RE)。使用包含512个文本块的手动标注数据集,FineREX在实体和关系F1分数上分别比更大的通用基线模型绝对提高了15.50%和31.46%。这些提升转化为更高质量的知识图谱,将法律噪声减少近一半,并将长文档上的节点重复率从17.78%降至11.17%。通过消除文档重写和冗余提取阶段,FineREX还将端到端处理时间减少了50.0%。我们的结果表明,领域特定的微调可以显著优于更大的通用模型,同时提高非法网络分析知识图谱构建的质量和效率。

英文摘要

Court proceedings contain valuable evidence about human smuggling networks, but this information is often buried within unstructured, jargon-heavy legal documents. While large language models (LLMs) can support knowledge graph construction through automated information extraction, existing approaches rely on general-purpose models that are not tailored to the entity and relationship definitions required in this domain. We introduce FineREX, a streamlined knowledge graph construction pipeline built around a fine-tuned LLM for named entity recognition and relationship extraction (NER-RE). Using a manually annotated dataset of $512$ text chunks, FineREX achieves absolute improvements of 15.50% and 31.46% in entity and relationship F1-score, respectively, compared to a larger general-purpose baseline. These gains translate into higher-quality knowledge graphs, reducing legal noise by nearly half and lowering node duplication on long documents from 17.78% to 11.17%. By eliminating document rewriting and redundant extraction stages, FineREX also reduces end-to-end processing time by 50.0%. Our results demonstrate that domain-specific fine-tuning can substantially outperform larger general-purpose models while improving both the quality and efficiency of knowledge graph construction for illicit network analysis.

2606.19852 2026-06-19 cs.CL cs.LG 新提交

Prompt, Plan, Extract: Zero-Shot Agentic LLMs Workflows for Lung Pathology Extraction from Clinical Narratives

提示、规划、提取:用于从临床叙述中提取肺部病理学的零样本智能体LLM工作流

Aman Pathak, Cheng Peng, Mengxian Lyu, Ziyi Chen, Reema Solan, Sankalp Talankar, Yasir Khan, Hiren Mehta, Aokun Chen, Yi Guo, Yonghui Wu

AI总结 提出零样本智能体工作流,利用开源大语言模型从肺切除病理报告中提取13个CAP字段,在无训练下达到0.893 Micro-F1,接近监督方法。

Comments 7 pages, 2 figures, 3 tables. Affiliations: (1) Department of Health Outcomes and Biomedical Informatics, College of Medicine, University of Florida, Gainesville, FL, USA; (2) Division of Pulmonary, Critical Care and Sleep Medicine, Department of Medicine, College of Medicine, University of Florida, Gainesville, FL, USA; (3) College of Nursing, Florida State University, Tallahassee, FL, USA

详情
AI中文摘要

从病理报告中提取信息对于癌症分期和肿瘤登记人群至关重要。然而关键数据仍嵌入在叙述性报告中,使得手动提取劳动密集且易出错。传统的监督自然语言处理流程通过完全监督的命名实体识别和关系提取来解决这一问题,但需要昂贵的人工标注,并且当上游实体缺失时会出现级联故障。在本研究中,我们开发了一个零样本智能体工作流,并评估了五个开源生成式大语言模型(LLMs),以从肺切除病理报告中填充13个美国病理学家学会的概要字段。我们使用一种新颖的、与注册对齐的评估框架,将它们与最先进的监督GatorTron NER-RE基线进行比较。基线达到了0.960的Micro-F1,而最佳零样本模型(GPT-OSS-20B)达到了0.893的Micro-F1(召回率:0.949),在没有任务特定训练的情况下准确提取了复杂关系(如病理分期)。这些结果表明,开源零样本智能体LLMs是提取肺部病理信息的低成本解决方案。

英文摘要

Information extraction from pathology reports is essential for cancer staging, tumor registry population. Yet key data remains embedded in narrative reports, making manual extraction labor-intensive and error-prone. Traditional supervised Natural Language Processing pipelines address this through fully supervised Named Entity Recognition and Relation Extraction, but require expensive manual annotation and suffer cascading failures when upstream entities are missed. In this study, we developed a zero-shot, agentic workflow, and evaluated five open-source generative Large Language Models (LLMs) to populate 13 College of American Pathologists synoptic fields from lung resection pathology reports. We compared them against a state-of-the-art supervised GatorTron NER-RE baseline using a novel, registry-aligned evaluation framework. The baseline achieved Micro-F1of 0.960, while the best zero-shot model (GPT-OSS-20B) achieved Micro-F1 of 0.893 (recall: 0.949), accurately extracting complex relations like Pathologic Stage without task-specific training. These results suggest that open-source, zero-shot agentic LLMs are a low-cost solution for extracting lung pathology information.

2606.20072 2026-06-19 cs.CL 新提交

Source-Grounded Data Generation for Text-to-JSON Learning

基于源数据的文本到JSON学习数据生成

Sunghee Ahn, Guijin Son, Youngjae Yu

发表机构 * Seoul National University(首尔大学)

AI总结 提出STAGE方法,利用电子表格作为源数据,通过LLM生成报告和JSON模式,并验证真实值,显著提升文本到JSON任务的训练数据质量。

Comments Preprint

详情
AI中文摘要

从财务文件到临床记录,传统行业严重依赖冗长、非结构化的文档来存储高价值信息。将这些信息可靠地提取为结构化的、机器可读的表示形式,是使自动化系统能够访问这些内容的关键前提。JSON是这种结构化提取的自然目标,然而构建可靠且可扩展的文本到JSON训练数据仍然具有挑战性。为了解决这一差距,我们提出了STAGE(电子表格基础的文本到JSON工件生成),一种基于源数据的数据生成管道,通过使用LLM进行可扩展合成,同时根据底层电子表格验证真实值,来构建报告和JSON模式。在STAGE-Eval(我们的基于源数据的基准测试,包含851个示例的测试集)上的评估表明,STAGE生成的训练数据优于现有方法。这使Qwen3-4B的精确匹配从31.37%提高到74.27%,值准确率从45.46%提高到90.69%。

英文摘要

From financial filings to clinical records, legacy industries rely heavily on long, unstructured documents to store high-value information. Reliably extracting this information into structured, machine-readable representations is a key prerequisite to making the contents accessible to automated systems. JSON is a natural target for such structured extraction, yet constructing reliable and scalable text-to-JSON training data remains challenging. To address this gap, we propose STAGE (Spreadsheet-grounded Text-to-JSON Artifact GEneration), a source-grounded data generation pipeline that constructs reports and JSON schema by using LLMs for scalable synthesis while validating ground-truth values against the underlying spreadsheet. Evaluations on STAGE-Eval, our source-grounded benchmark with an 851-example test set, show that STAGE produces stronger training data than existing approaches. This improves Qwen3-4B exact match from 31.37% to 74.27% and value accuracy from 45.46% to 90.69%.

2606.20113 2026-06-19 cs.CL cs.IR 新提交

When Does Streaming Tool Use Help? Characterizing Tool-Intent Stabilization in Streaming Retrieval-Augmented Generation

流式工具使用何时有帮助?表征流式检索增强生成中的工具意图稳定化

Elroy Galbraith

发表机构 * SMG Labs(SMG实验室)

AI总结 通过测量工具意图稳定化(即推测查询收敛到答案的时间点),在CRAG基准上分析流式RAG的延迟隐藏效果,发现73.9%的查询可实现显著延迟隐藏,并识别早期与晚期稳定化的预测因素。

详情
AI中文摘要

流式检索增强生成(Streaming RAG)通过在用户输入完成前并行发出工具查询来减少用户感知的延迟。报告的性能提升是聚合性的,但该机制的好处本质上是查询内在的:只有当正确的工具查询在用户停止说话或打字之前变得可确定时,推测才有帮助。我们隔离并测量了这一属性——工具意图稳定化,即输入流中推测查询的检索收敛到包含答案的结果的时间点。在CRAG基准(1371个验证问题)上,我们(i)测量了稳定化的分布,(ii)推导出一个与模型无关的界限H,表示可以隐藏在用户剩余输入背后的工具延迟比例,该比例是工具延迟L和输入节奏δ的函数,(iii)通过一个工作流式管道验证了实际节省达到或超过此界限,(iv)识别了哪些查询属性预测早期与晚期稳定化。该研究无需模型训练,在普通CPU硬件上运行。我们发现,在现实操作点(L=600ms,δ=3w/s,θ=0.8)下,整个基准中73.9%的查询实现了显著的延迟隐藏——这一混合数字结合了21.3%的问题(其中黄金证据以原文形式存在且可被BM25检索)上的充分稳定化(在此有利切片上95.2%可流式处理)以及其余问题上的无基础top-1稳定化回退。在有利切片上,φ_suf被精确和宽松基础限定在[0.26, 0.281]之间——两者均为早期。问题类型产生了显著但粗略的早期/晚期划分(Kruskal-Wallis p=0.017, epsilon^2=0.04),直接指导了何时学习到的推测触发器值得其成本。

英文摘要

Streaming Retrieval-Augmented Generation (Streaming RAG) reduces user-perceived latency by issuing tool queries in parallel with ongoing user input, before the utterance is complete. Reported gains are aggregate, yet the mechanism's benefit is fundamentally query-intrinsic: speculation can only help when the correct tool query becomes determinable before the user stops speaking or typing. We isolate and measure this property -- tool-intent stabilization, the point in the input stream at which a speculative query's retrieval converges to the answer-bearing result. On the CRAG benchmark (1371 validation questions) we (i) measure the distribution of stabilization, (ii) derive a model-agnostic bound H on the portion of tool latency that can be hidden behind the user's remaining input, as a function of tool latency L and input cadence δ, (iii) validate against a working streaming pipeline that realized savings meet or exceed this bound, and (iv) identify which query properties predict early versus late stabilization. The study requires no model training and runs on commodity CPU hardware. We find that at a realistic operating point (L=600ms, δ=3w/s, θ=0.8), 73.9% of queries across the full benchmark admit substantial latency hiding -- a blended figure that mixes sufficiency stabilization on the 21.3% of questions where gold evidence is verbatim-present and BM25-retrievable (95.2% streamable on this favorable slice) with a grounding-free top-1-settling fallback on the remainder. On the favorable slice, ϕ_suf is bracketed to [0.26, 0.281] by exact and relaxed grounding -- both early. Question type produces a significant but coarse early/late split (Kruskal-Wallis p=0.017, epsilon^2=0.04), directly informing when a learned speculative trigger is worth its cost.

4. 对话系统与智能体 4 篇

2606.19356 2026-06-19 cs.CL cs.AI 新提交

Trustworthy Multi-Agent Systems: Mitigating Semantic Drift with the Argent Signaling Protocol

可信多智能体系统:使用Argent信令协议缓解语义漂移

Anantha Sharma

发表机构 * Synechron Inc(Synechron公司)

AI总结 提出Argent信令协议(ASP),通过结构化质量信号区分可修复与不可修复的失败,在文档问答和多智能体系统中分别提升通过率和阻断无依据传播。

Comments 17 pages

详情
AI中文摘要

当多智能体LLM系统产生错误答案时,并非所有失败都相同:有些答案基于正确材料但不完整,而另一些则完全无依据且应被阻止。当前的重新尝试策略对两种情况一视同仁(重试并希望最好),使得人类监督者无法判断重试是否合理或系统是否应停止。我们引入Argent信令协议(ASP),这是一种紧凑的机器可读头部,为每个AI生成的响应附带结构化质量信号:确定性(@C)、依据性(@G)、随机性(@S)以及一个假设索引,用于分类每个声明的证据基础。这些信号使控制器能够区分可修复失败与遏制失败,并对每种情况进行不同路由。我们在两种模式下评估ASP。在独立模式下,基于Array BioPharma/Ono许可协议的27个问题的文档问答基准,比较基线提示与ASP仪器化控制器动作在三个本地GGUF模型上的表现。在Qwen~(0.8B)上,ASP将通过率从11.1%提升至33.3%,平均术语覆盖率从36.7%提升至65.4%;在Dobby~(8B)上,ASP产生4次失败到通过的恢复,通过率从33.3%提升至44.4%;在SmolLM3~(3B)上,ASP在每次问题中交替进行修复和遏制。总体改进显著(从12/81通过到21/81通过)。在多智能体模式下,ASP侧车位于检索智能体和下游决策智能体之间;侧车100%阻止无依据的上游输出到达下游智能体(24/27被阻止,0次无依据传播)。

英文摘要

When multi-agent LLM systems produce bad answers, not all failures are equal: some answers are grounded in the right material but incomplete, while others are simply ungrounded and should be stopped. Current retry strategies treat both cases identically (try again and hope for the best), leaving human supervisors unable to tell whether a retry was warranted or whether the system should have halted instead. We introduce the Argent Signaling Protocol (ASP), a compact machine-readable header that accompanies every AI-generated response with structured quality signals: certainty (@C), grounding (@G), stochasticity (@S), and an assumption index that classifies the evidentiary basis of each claim. These signals enable a controller to distinguish repairable failures from containment failures and route each case differently. We evaluate ASP in two modes. In standalone mode, a 27-question document-grounded QA benchmark over the Array BioPharma/Ono license agreement compares baseline prompts against ASP-instrumented controller actions across three local GGUF models. On Qwen~(0.8B), ASP improves pass rate from 11.1% to 33.3% and mean term coverage from 36.7% to 65.4%; on Dobby~(8B), ASP produces 4 fail-to-pass recoveries, raising pass rate from 33.3% to 44.4%; on SmolLM3~(3B), ASP alternates between repair and containment per question. Aggregate improvement is meaningful (12/81 to 21/81 passes). In multi-agent mode, an ASP sidecar sits between a retrieval agent and a downstream decision agent; the sidecar blocks 100% of ungrounded upstream outputs from reaching the downstream agent (24/27 blocked, 0 ungrounded propagations).

2606.19659 2026-06-19 cs.CL 新提交

SAGE-OPD: Selective Agent-Guided Intervention for Multi-Turn On-Policy Distillation

SAGE-OPD:面向多轮在策略蒸馏的选择性智能体引导干预

Yuhang Zhou, Lizhu Zhang, Yifan Wu, Mingyi Wang, Bo Peng, Jiayi Liu, Xiangjun Fan, Zhuokai Zhao

发表机构 * Meta AI

AI总结 提出SAGE-OPD框架,通过环境反馈和教师判断选择性干预学生响应,结合置信度加权和损失归一化,解决多轮在策略蒸馏中的错误累积问题,在ALFWorld任务中取得13.3%的相对提升。

Comments 21 pages, 3 figures

详情
AI中文摘要

在策略蒸馏(OPD)通过训练学生模型在其自身策略生成的轨迹上来改进学生模型,使其成为缓解智能体训练中曝光偏差的一种有前景的方法。然而,大多数OPD研究集中在单轮设置,而现实中的LLM智能体需要与环境进行多轮交互。在这种机制下,早期错误会改变未来观察并沿轨迹累积,标准的密集令牌级OPD变得脆弱,因为它可能过度惩罚语义上有效的替代方案,强化局部退化(如重复动作),并在分布外历史中传播不可靠的教师监督。我们提出SAGE-OPD,一种专门为多轮OPD设计的无验证器选择性干预框架。SAGE-OPD不是在所有轮次上统一应用教师监督,而是首先观察环境反馈,并使用教师判断来决定每个学生响应是否应被跳过或干预。为了进一步解决累积错误,SAGE-OPD通过教师置信度对令牌级蒸馏进行加权,减少不确定的教师分布在受损或模糊历史上的影响。最后,SAGE-OPD应用损失归一化以保留标准OPD的整体损失规模,同时保持选择性轮次级加权。在智能体任务上的实验表明,SAGE-OPD持续优于基线,在ALFWorld未见成功率上比标准OPD实现了高达13.3%的相对提升。消融研究进一步表明,轮次级干预、教师置信度加权和损失归一化提供了互补的益处。我们的结果表明,有效的多轮OPD应保持策略内,但教师监督应选择性地分配到需要干预且可靠的轮次。

英文摘要

On-policy distillation (OPD) improves student models by training them on trajectories induced by their own policy, making it a promising approach for mitigating exposure bias in agent training. However, most OPD studies focus on single-turn settings, while realistic LLM agents interact with environments over multiple turns. In this regime, early errors can alter future observations and compound across the trajectory, and standard dense token-level OPD becomes brittle, as it may over-penalize semantically valid alternatives, reinforce local degeneracies such as repeated actions, and propagate unreliable teacher supervision on off-distribution histories. We propose SAGE-OPD, a verifier-free selective intervention framework specifically designed for multi-turn OPD. Instead of applying teacher supervision uniformly across all turns, SAGE-OPD first observes environment feedback and uses teacher judgment to decide whether each student response should be skipped or intervened on. To further address compounding errors, SAGE-OPD weights token-level distillation by teacher confidence, reducing the influence of uncertain teacher distributions on corrupted or ambiguous histories. Finally, SAGE-OPD applies loss normalization to preserve the overall loss scale of standard OPD while retaining selective turn-level weighting. Experiments on agent tasks show that SAGE-OPD consistently improves over baselines, achieving up to a 13.3% relative improvement in ALFWorld unseen success rate over standard OPD. Ablation studies further demonstrate that turn-level intervention, teacher confidence weighting, and loss normalization provide complementary benefits. Our results suggest that effective multi-turn OPD should remain on-policy, but teacher supervision should be selectively allocated to turns where intervention is necessary and reliable.

2606.19847 2026-06-19 cs.CL 新提交

AtomMem: Building Simple and Effective Memory System for LLM Agents via Atomic Facts

AtomMem: 通过原子事实构建简单有效的LLM智能体记忆系统

Yanyu Yao, Shangze Li, Zhi Zheng, Hui Zheng, Qi Liu, Tong Xu, Enhong Chen

发表机构 * State Key Laboratory of Cognitive Intelligence, University of Science and Technology of China(中国科学技术大学认知智能国家重点实验室) Anhui University(安徽大学)

AI总结 针对现有记忆系统存储粗粒度、更新不稳定的问题,提出AtomMem,通过事实执行器提取高价值原子事实作为高效记忆表示,并组织为层次化事件结构和时间档案,实现价值密集存储和稳定演化,在LoCoMo基准上取得最优性能。

Comments 19 pages, 10 figures, 5 tables

详情
AI中文摘要

大型语言模型(LLM)展示了强大的推理和生成能力,但其固定的上下文窗口限制了跨多会话交互的长期信息积累和重用。现有的记忆增强系统通常以粗粒度且不稳定的方式构建记忆,依赖于低效的记忆表示或不稳定的无约束更新。为了解决这些挑战,我们提出了AtomMem,一种专为价值密集存储和稳定记忆演化设计的长期记忆系统。AtomMem引入了一个事实执行器,从长形式交互中选择性地提取高价值原子事实,作为高效的记忆表示。随后,AtomMem将这些事实组织成层次化的事件结构和时间档案,捕获连贯的情景上下文并随时间跟踪动态演变的用户属性。在检索过程中,系统激活一个关联记忆图来连接碎片化的记忆。在LoCoMo基准上的实验证实,AtomMem在各种推理任务中实现了最先进的性能,为部署智能个性化智能体提供了一种可扩展且经济可行的解决方案。

英文摘要

Large language models (LLMs) demonstrate strong reasoning and generation abilities, but their fixed context windows limit long-term information accumulation and reuse across multi-session interactions. Existing memory-augmented systems often construct memory in a coarse and unstable manner, relying on inefficient memory representations or unstable unconstrained updates. To address these challenges, we propose AtomMem, a long-term memory system designed for value-dense storage and stable memory evolution. AtomMem introduces a Fact Executor, which selectively extracts high value atomic facts from long form interactions to serve as highly efficient memory representations. Subsequently, AtomMem organizes these facts into hierarchical event structures and temporal profiles, capturing coherent episodic contexts and tracking dynamically evolving user attributes over time. During retrieval, the system activates an associative memory graph to connect fragmented memories. Experiments on the LoCoMo benchmark confirm that AtomMem achieves state-of-the-art performance across various reasoning tasks, offering a scalable and economically viable solution for deploying intelligent personalized agents.

2606.20487 2026-06-19 cs.CL 新提交

Beyond Global Replanning: Hierarchical Recovery for Cross-Device Agent Systems

超越全局重规划:跨设备智能体系统的分层恢复

Shu Yao, Yuhua Luo, Qian Long, Jingru Fan, Zhuoyuan Yu, Yuheng Wang, Lin Wu, Yufan Dang, Huatao Li, Chen Qian

发表机构 * School of Artificial Intelligence, Shanghai Jiao Tong University(上海交通大学人工智能学院) Shanghai Innovation Institute(上海创新研究院) Southeast University(东南大学) Tsinghua University(清华大学)

AI总结 提出分层重规划框架H-RePlan,通过统一API-CLI-GUI执行和跨层失败抽象,区分设备本地策略恢复与全局重规划,在HeraBench基准上显著提升跨设备任务完成率和指令遵循度。

详情
AI中文摘要

现实世界中的计算机使用任务通常跨越多个应用程序和设备,要求智能体在动态运行时故障下协调异构环境。现有的多设备智能体系统支持任务分解和跨设备分配,但恢复仍然粗粒度:当执行失败时,它们通常重试相同策略、重新分配子任务或修改全局计划,而没有系统地建模设备本地策略空间。这限制了它们区分可在当前设备内修复的故障与需要跨设备重规划的故障的能力。我们提出\textbf{H-RePlan},一个用于具有统一API-CLI-GUI执行的多设备智能体的分层重规划框架。H-RePlan为每个设备配备可互换的执行策略,并通过紧凑的跨层失败抽象将设备本地策略恢复与编排器级全局重规划分离。为了评估这一能力,我们引入\textbf{HeraBench},一个故障注入基准,它在Linux和Android设备上构建跨设备工作流,并注入策略级和设备级故障。实验表明,H-RePlan显著优于单策略和粗粒度多设备基线,实现了更高的完成率、指令遵循率和完美通过率,同时降低了可靠端到端成功所需的令牌成本。这些结果表明,范围感知的分层恢复对于鲁棒的多设备智能体执行至关重要。

英文摘要

Real-world computer-use tasks often span multiple applications and devices, requiring agents to coordinate heterogeneous environments under dynamic runtime failures. Existing multi-device agent systems support task decomposition and cross-device assignment, but recovery remains largely coarse-grained: when execution fails, they typically retry the same strategy, reassign the subtask, or revise the global plan, without systematically modeling the device-local strategy space. This limits their ability to distinguish failures that can be repaired within the current device from those that require cross-device replanning. We propose \textbf{H-RePlan}, a hierarchical replanning framework for multi-device agents with unified API--CLI--GUI execution. H-RePlan equips each device with interchangeable execution strategies and separates device-local strategy recovery from orchestrator-level global replanning through a compact cross-layer failure abstraction. To evaluate this capability, we introduce \textbf{HeraBench}, a fault-injected benchmark that constructs cross-device workflows over Linux and Android devices and injects strategy- and device-level failures. Experiments show that H-RePlan substantially outperforms single-strategy and coarse-grained multi-device baselines, achieving higher completion, instruction adherence, and perfect-pass rates while reducing the token cost required for reliable end-to-end success. These results demonstrate that scope-aware hierarchical recovery is essential for robust multi-device agent execution.

5. 文本生成、摘要与编辑 2 篇

2606.19591 2026-06-19 cs.CL cs.AI 新提交

A BART-based approach with hierarchical strategy for Vietnamese abstractive multi-document summarization

基于BART的分层策略用于越南语抽象式多文档摘要

Vu Nguyen Nguyen Xuan, Huy Ngo Quang

发表机构 * Aimesoft JSC(Aimesoft股份公司)

AI总结 提出一种新颖简单的基于黄金摘要缩短文档的分层策略,结合BART模型实现越南语多文档抽象式摘要,在VLSP 2022测试集上达到ROUGE2-F1 0.2468,并利用外部数据增强训练。

Comments originally written in 2022

详情
AI中文摘要

在本技术报告中,我们专注于解决越南语多文档抽象式摘要的挑战,该任务在2022年越南语言与语音处理国际研讨会(VLSP)上提出。我们选择遵循流行的分层方法,即先浓缩每个文档,然后进行聚合和摘要。我们提出了一种新颖而简单的策略来缩短文档,该策略由黄金摘要驱动,从而确保分层方法各阶段之间的高度相关性。我们的方法在VLSP的公开测试集上达到了0.2468的ROUGE2-F1分数,并且能够生成流畅简洁的摘要。此外,我们利用外部来源获取额外数据,这极大地增加了越南语多文档摘要的数据量。额外数据已向社区公开。

英文摘要

In this technical report, we focus on solving the challenge of Vietnamese multi-document abstractive summarization, introduced in the International Workshop on Vietnamese Language and Speech Processing (VLSP) 2022. We choose to follow the popular hierarchical approach, i.e. condensing each document followed by aggregation and summarization. We propose a novel yet simple strategy to shorten documents that is driven by the golden summary, thus ensuring high correlation between stages of the hierarchical approach. Our method achieves a ROUGE2-F1 score of 0.2468 on the VLSP's public test set, and can produce fluent and concise summaries. Additionally, we utilize external sources for extra data, which greatly enhances the quantity of data for Vietnamese multi-document summarization. The additional data is made available for the community.

2606.20287 2026-06-19 cs.CL 新提交

PsyScore: A Psychometrically-Aware Framework for Trait-Adaptive Essay Scoring and ZPD-Scaffolded Feedback

PsyScore: 一种心理测量感知的特质自适应作文评分与最近发展区支架反馈框架

Wei Xia, Jin Wu, Haoran Shi, Xiangyu Wang, Chanjin Zheng

发表机构 * Department of Educational Psychology, East China Normal University(华东师范大学教育心理学系) Shanghai Institute of Artificial Intelligence for Education, East China Normal University(华东师范大学上海智能教育研究院) School of Computer Science and Technology, East China Normal University(华东师范大学计算机科学与技术学院)

AI总结 提出PsyScore框架,通过共享潜在能力表示整合诊断评估与教学支架,包括特质自适应神经IRT评分器、ZPD支架反馈生成器和多视角反馈评估策略,在ASAP++数据集上实现竞争性评分性能并提供更符合教学法的反馈。

详情
AI中文摘要

有效的自动作文评分(AES)应支持可靠评估和可操作的教学反馈。然而,现有方法通常将评分和反馈视为独立组件:神经评分模型可解释性有限,而基于大语言模型(LLM)的反馈通常对学习者熟练度不敏感。为解决这一碎片化问题,本工作提出PsyScore,一个心理测量感知的框架,通过共享潜在能力表示整合诊断评估与教学支架。PsyScore包含三个关键模块:特质自适应神经IRT评分器,将分级部分信用模型(GPCM)融入神经架构,能够在保持心理测量可解释性的同时精确估计学生能力;ZPD支架反馈生成器,根据诊断出的能力参数调节多智能体反馈策略,以适应不同熟练水平的教学重点;以及多视角反馈评估策略,通过成对偏好判断和学生修订模拟评估反馈质量。在ASAP++数据集上的实验表明,PsyScore在提供更具教学一致性的反馈的同时,实现了有竞争力的评分性能。

英文摘要

Effective Automated Essay Scoring (AES) are expected to support both reliable assessment and actionable instructional feedback. However, existing approaches often treat scoring and feedback as separate components: neural scoring models provide limited interpretability, while Large Language Model (LLM)-based feedback is typically insensitive to learners proficiency levels. To address this fragmentation, this work proposes PsyScore, a psychometrically-aware framework that integrates diagnostic assessment with instructional scaffolding through a shared latent ability representation. PsyScore comprises three key modules: a Trait-Adaptive Neural IRT Scorer that incorporates the Graded Partial Credit Model (GPCM) into a neural architecture, enabling the precise estimation of student ability while maintaining psychometric interpretability, a ZPD-Scaffolded Feedback Generator, which conditions multi-agent feedback strategies on the diagnosed ability parameter to adapt instructional focus across different proficiency levels, and a Multi-Perspective Feedback Evaluation Strategy that assesses feedback quality via pairwise preference judgements and student revision simulations. Experiments on the ASAP++ dataset demonstrate that PsyScore achieves competitive scoring performance while providing more pedagogically aligned feedback.

6. 语义、语法与语言学分析 1 篇

2606.19638 2026-06-19 cs.CL 新提交

MiqraBERT: Regression-Based Sentence-BERT Finetuning for Biblical Hebrew Parallel Detection

MiqraBERT:基于回归的Sentence-BERT微调用于圣经希伯来语平行检测

David M. Smiley

AI总结 提出MiqraBERT模型,通过余弦相似度回归微调Sentence-BERT,在圣经希伯来语中检测文本平行,将分布分离度提升2.7倍,重叠区域从24%降至6%。

详情
AI中文摘要

文本复用遍及希伯来圣经,但用于检测的计算方法仍主要依赖词汇重叠,一旦平行涉及释义、词汇替换或句法重组,这些方法就会失效。本文介绍MiqraBERT,一个从AlephBERT(现代希伯来语编码器)微调而来的Sentence-BERT模型,用于圣经希伯来语的诗句级语义相似度。训练集包含1,650个标注的诗句和半诗句对:825个来自编年史同源材料和诗歌平行基础研究的真实平行,与825个随机采样的负例平衡。通过余弦相似度回归,模型学习到一个嵌入空间,其中平行诗句聚集在一起,无关诗句彼此远离。我们使用基于分布的指标、Wasserstein距离和重叠系数,在十个随机种子上评估分离度。MiqraBERT将分布分离度比预训练基线提高了2.7倍,并将模糊重叠区域从约24%减少到约6%。叙事同源平行的召回@10达到87.1%;诗歌平行仍然困难,低于9%。这种依赖于体裁的不对称性将模型的可靠范围限制在叙事文本复用。MiqraBERT在此https URL公开可用。

英文摘要

Textual reuse pervades the Hebrew Bible, yet the computational methods used to detect it still rest largely on lexical overlap, and they falter once a parallel involves paraphrase, lexical substitution, or syntactic reworking. This paper introduces MiqraBERT, a Sentence-BERT model finetuned from AlephBERT (a Modern Hebrew encoder) for verse-level semantic similarity in Biblical Hebrew. The training set comprises 1,650 labeled verse and half-verse pairs: 825 true parallels drawn from the Chronicles synoptic material and from foundational studies of poetic parallelism, balanced against 825 randomly sampled negatives. Through cosine-similarity regression, the model learns an embedding space in which parallel verses cluster together and unrelated verses move apart. We evaluate separation with distribution-based metrics, Wasserstein distance and the overlap coefficient, across ten random seeds. MiqraBERT improves distributional separation 2.7-fold over the pre-trained baseline and reduces the ambiguous overlap region from roughly 24% to about 6%. Narrative synoptic parallels reach a recall@10 of 87.1%; poetic parallels remain difficult, below 9%. This genre-dependent asymmetry confines the model's reliable scope to narrative textual reuse. MiqraBERT is publicly available at https://huggingface.co/davidmsmiley/MiqraBERT

7. 多模态语言处理 2 篇

2606.19552 2026-06-19 cs.CL 新提交

LaViSA: A Language and Vision Structural Ambiguity Benchmark

LaViSA:语言与视觉结构歧义基准

Lee Sangmyeong, Shun Inadumi, Koichiro Yoshino

发表机构 * Nara Institute of Science and Technology(奈良先端科学技术大学院大学) Guardian Robot Project RIKEN(RIKEN守护机器人项目) The University of Osaka(大阪大学)

AI总结 提出LaViSA基准,通过七类歧义句及对应图像评估视觉语言模型利用视觉场景解决结构歧义的能力,实验显示现有模型虽能部分利用视觉信息,但在特定歧义类型和细微语义区分上仍有局限。

详情
AI中文摘要

结构歧义是指单个句子由于其句法结构而产生多种有效解释,这给语言理解带来了基本挑战。视觉场景可作为解决此类歧义的有用线索,视觉语言模型(VLM)需要能够从视觉场景中推导出可能的语义解释。我们引入了语言与视觉结构歧义(LaViSA)基准,旨在评估VLM利用视觉场景解决结构歧义的能力。LaViSA包含歧义句子、其消歧句子以及这些消歧句子对应的图像,涵盖七类歧义。利用LaViSA,我们对多种VLM进行了全面评估,包括专有模型和开源模型,参数规模和推理能力各异。实验结果表明,尽管最近的VLM能在一定程度上利用视觉场景解决结构歧义,但它们仍然在特定歧义类型和视觉上微妙的语义区分上存在困难,表明在利用视觉场景解决结构歧义方面仍存在局限性。

英文摘要

Structural ambiguity arises when a single sentence admits multiple valid interpretations due to its syntactic structure, posing a fundamental challenge for language understanding. Visual scenes serve as useful cues for resolving such ambiguity, and Vision and Language Models (VLMs) need to be capable of deriving possible semantic interpretations from visual scenes. We introduce Language and Vision Structural Ambiguity (LaViSA), a benchmark designed to evaluate the ability of VLMs to resolve structural ambiguity leveraging visual scenes. LaViSA consists of ambiguous sentences, their disambiguated sentences, and corresponding images of these disambiguated sentences across seven ambiguity categories. Using LaViSA, we conduct a comprehensive evaluation of diverse VLMs, including both proprietary and open-source models with varying parameter scales and reasoning capabilities. Experimental results show that although recent VLMs can leverage visual scenes to resolve structural ambiguity to a some extent, they still struggle with certain ambiguity types and visually subtle semantic distinctions, indicating remaining limitations in resolving structural ambiguity using visual scenes.

2606.20164 2026-06-19 cs.CL cs.AI cs.LG q-bio.QM 新提交

MedRLM: Recursive Multimodal Health Intelligence for Long-Context Clinical Reasoning, Sensor-Guided Screening, Evidence-Grounded Decision Support, and Community-to-Tertiary Referral Optimization

MedRLM:用于长上下文临床推理、传感器引导筛查、证据支持决策及社区到三级转诊优化的递归多模态健康智能

Aueaphum Aueawatthanaphisut

发表机构 * School of Information, Computer Communication Technology Sirindhorn International Institute of Technology, Thammasat University Pathum Thani, Thailand 1

AI总结 提出MedRLM递归多模态健康智能框架,通过递归检查、分解、检索、验证和合成患者信息,协调多个专业代理并引入临床证据图记忆,实现长上下文临床推理和传感器引导筛查。

Comments 9 pages, 3 figures, 3 tables, 1 Algorithm, 29 equations

详情
AI中文摘要

现实世界的临床决策支持需要对异质性和纵向的患者信息进行推理,而不是回答孤立的医学问题。然而,当前的医学大语言模型和检索增强生成系统通常依赖单步提示或检索,当临床证据分布在长电子健康记录、医学图像、传感器流、指南和转诊约束中时,这可能变得脆弱。本文提出MedRLM,一个用于长上下文临床推理、传感器引导筛查和社区到三级转诊支持的递归多模态健康智能框架。MedRLM不是将所有患者信息压缩到一个提示中,而是将患者病例视为一个外部临床环境,可以递归地检查、分解、检索、验证和综合。该框架协调了专门用于临床文本、纵向EHR、医学影像、生理传感器信号、指南检索、不确定性审计和转诊规划的代理。它进一步引入了临床证据图记忆,将患者特定的观察结果与检索到的证据、标准化定义、传感器衍生的生物标志物和转诊标准连接起来。传感器引导的递归触发机制在检测到异常生理或行为模式时激活更深层次的推理,而不确定性门控细化支持临床医生对高风险或低置信度病例的审查。我们还概述了一个使用公共和经认证的临床数据集(涵盖EHR、放射学、ECG、ICU时间序列和转诊代理结果)的真实数据评估设计。MedRLM旨在将医学AI从静态问答转向可审计、多模态和流程感知的临床决策支持。

英文摘要

Real-world clinical decision support requires reasoning over heterogeneous and longitudinal patient information rather than answering isolated medical questions. However, current medical large language models and retrieval-augmented generation systems often rely on single-step prompting or retrieval, which can be fragile when clinical evidence is distributed across long electronic health records, medical images, sensor streams, guidelines, and referral constraints. This paper proposes MedRLM, a Recursive Multimodal Health Intelligence framework for long-context clinical reasoning, sensor-guided screening, and community-to-tertiary referral support. Instead of compressing all patient information into one prompt, MedRLM treats the patient case as an external clinical environment that can be recursively inspected, decomposed, retrieved, verified, and synthesized. The framework coordinates specialized agents for clinical text, longitudinal EHR, medical imaging, physiological sensor signals, guideline retrieval, uncertainty auditing, and referral planning. It further introduces a Clinical Evidence Graph Memory to connect patient-specific observations with retrieved evidence, standardized definitions, sensor-derived biomarkers, and referral criteria. A sensor-guided recursive triggering mechanism activates deeper reasoning when abnormal physiological or behavioral patterns are detected, while uncertainty-gated refinement supports clinician review for high-risk or low-confidence cases. We also outline a real-data evaluation design using public and credentialed clinical datasets spanning EHR, radiology, ECG, ICU time series, and referral-proxy outcomes. MedRLM aims to move medical AI from static question answering toward auditable, multimodal, and workflow-aware clinical decision support.

8. 语音语言联合与音频文本 2 篇

2606.19910 2026-06-19 cs.CL cs.SD eess.AS 新提交

Light-weight Pronunciation Assessment via Discrete Speech Token Surprisal

轻量级发音评估:基于离散语音标记的意外度

Syeda Faiza Ahmed Sara, Shammur Absar Chowdhury

发表机构 * Qatar Computing Research Institute, Doha, Qatar(卡塔尔计算研究所,多哈,卡塔尔)

AI总结 提出仅使用母语语音资源训练的轻量级发音评估框架,通过离散化语音标记和语言模型计算意外度,结合文本引导对齐特征,在无监督或少量校准下达到接近监督方法的性能。

Comments Accepted to Interspeech 2026

详情
AI中文摘要

训练自动发音评估通常依赖于标记的学习者错误或非母语语料库,这些语料库收集成本高昂。我们提出一个轻量级框架,仅使用母语语音资源训练,以无监督或通过少量评分话语进行轻量校准的方式运行。在推理时,学习者语音通过SSL编码器和K-means码本进行离散化。一个在母语序列上训练的标记语言模型计算意外度,其中较高的意外度表示音位偏差。我们添加了一个转录引导的Text2DUnit--DTW模块,该模块从参考文本预测母语标记序列,并将其与声学标记对齐以推导出错误敏感特征。意外度和对齐特征通过简单回归融合。在SpeechOcean762上,PCC从0.60提升到0.66(带转录引导),接近监督基线。在L2-ARCTIC上的跨数据集评估显示了一致的提升。

英文摘要

Training automated pronunciation assessment often relies on labeled learner errors or non-native corpora that are costly to collect. We propose a lightweight framework trained only on native speech resources, operating unsupervised or lightly calibrated with a small set of scored utterances. At inference, learner speech is discretized with an SSL encoder and a K-means codebook. A token language model trained on native sequences computes surprisal where higher surprisal indicates phonotactic deviation. We add a transcript-guided Text2DUnit--DTW module that predicts native token sequences from reference text and aligns them to acoustic tokens to derive error-sensitive features. Surprisal and alignment features are fused via simple regression. On SpeechOcean762, PCC improves from 0.60 to 0.66 with transcript guidance, near supervised baselines. Cross-dataset evaluation on L2-ARCTIC shows consistent gains.

2606.20179 2026-06-19 cs.CL 新提交

ReNikud: Audio-Supervised Hebrew Grapheme-to-Phoneme Conversion

ReNikud:音频监督的希伯来语字素到音素转换

Maxim Melichov, Yakov Kolani, Morris Alper

AI总结 提出ReNikud方法,利用音频监督和伪元音化架构,通过无标注音频的ASR伪标签和字符级对齐,解决希伯来语G2P转换中的元音缺失和发音歧义问题,在多个基准上达到最优。

详情
AI中文摘要

现代希伯来语的字素到音素(G2P)转换对于文本到语音(TTS)等应用是必需的,但由于该语言的辅音音素文字系统(abjad)使元音大多不写出来,造成大量歧义,因此具有挑战性。标准方法首先预测元音变音符号(nikud)以生成国际音标(IPA)转录,但这存在局限性:元音化数据稀缺且制作费力,它不指定词汇重音等特征,并且反映的是正式语法规则而非日常口语发音。同时,直接的序列到序列IPA预测在有限数据上表现不佳,且未能利用辅音音素文字特有的字符级对齐。我们的方法ReNikud通过两个关键洞察克服了这些限制:(1)通过基于音素的自动语音识别(ASR)伪标签流水线,在数千小时无标注希伯来语音频上进行弱音频监督,生成反映自然口语规范的音位转录,无需人工标注。(2)一种伪元音化架构,在每个字符位置预测IPA音素,强制字符级对齐作为归纳偏置。在现有希伯来语G2P基准和针对口语希伯来语的新MILIM基准上的结果表明,ReNikud超越了先前的最先进方法。我们将发布代码和训练模型,以支持希伯来语TTS和语音技术的进一步研究。

英文摘要

Grapheme-to-phoneme (G2P) conversion for Modern Hebrew is needed for applications like text-to-speech (TTS), but is challenging due to the language's abjad writing system, which leaves vowels largely unwritten, creating substantial ambiguity. Standard approaches first predict vowel diacritics (nikud) to produce International Phonetic Alphabet (IPA) transcriptions, but this is limited: vocalization data is scarce and laborious to produce, it does not specify features such as lexical stress, and it reflects formal grammatical rules rather than everyday spoken pronunciation. Direct sequence-to-sequence IPA prediction, meanwhile, struggles on limited data and fails to exploit the character-level alignment characteristic of abjads. Our method, ReNikud, overcomes these limitations with two key insights: (1) Weak audio supervision via a phoneme-based automatic speech recognition (ASR) pseudo-labeling pipeline on thousands of hours of unlabeled Hebrew audio, yielding phonemic transcriptions that reflect natural spoken norms without manual annotation. (2) A pseudo-vocalization architecture that predicts IPA phonemes at each character position, enforcing character-level alignment as an inductive bias. Results on existing Hebrew G2P benchmarks and the new targeted MILIM benchmark for spoken Hebrew show that ReNikud surpasses previous state-of-the-art methods. We will release our code and trained models to support further work on Hebrew TTS and speech technologies.

9. 评测、数据集与基准 12 篇

2606.19352 2026-06-19 cs.CL cs.AI 新提交

Sign-Language Datasets at Scale: A Comprehensive Survey on Resources, Benchmarks, and Annotation Standards

大规模手语数据集:资源、基准和标注标准的综合调查

Yiming Ni, Zhi-Qi Cheng, Jiayu Li, Wei Cheng

发表机构 * Tacoma School of Engineering & Technology, University of Washington(华盛顿大学塔科马工程与技术学院)

AI总结 本文调查了35种手语的120个数据集,分析了模态不平衡、标注粒度和手语者偏差等挑战,并提出了24字段手语数据表以支持标准化文档和可复现评估。

Comments Accepted to ACL 2026 Main. 27 pages, 5 figures

详情
AI中文摘要

手语是聋人和听障社区使用的表达性视觉语言。尽管在手语识别、翻译和生成方面取得了显著进展,但由于数据集碎片化、标注不一致以及语言覆盖有限,进展仍然受到制约。现有的基准往往无法反映现实世界的通信需求,对这些局限性的系统分析仍然有限。在本调查中,我们提出了一个全面的手语数据集索引,涵盖了35种手语的120个资源。我们分析了关键挑战,如模态不平衡、标注粒度和手语者偏差,并概述了未来数据集设计的考虑因素。我们还引入了一个24字段的手语数据表,并发布了一个公共GitHub仓库(此 https URL ),以支持标准化文档和可复现评估。总体而言,我们的工作为在现实应用中开发包容、稳健和可扩展的手语技术提供了统一且实用的基础。

英文摘要

Sign languages are expressive visual languages used by Deaf and Hard-of-Hearing (DHH) communities. Despite substantial progress in sign-language recognition, translation, and production, advances remain constrained by fragmented datasets, inconsistent annotations, and limited linguistic coverage. Existing benchmarks often fail to reflect real-world communication needs, and systematic analyses of these limitations remain limited. In this survey, we present a comprehensive index of sign-language datasets, covering 120 resources across 35 sign languages. We analyze key challenges such as modality imbalance, annotation granularity, and signer bias, and outline considerations for future dataset design. We also introduce a 24-field Sign-Language Datasheet and release a public GitHub repository (https://github.com/Ginqwerty/Open-Sign-Language) to support standardized documentation and reproducible evaluation. Overall, our work provides a unified and practical foundation for developing inclusive, robust, and scalable sign-language technologies in real-world applications.

2606.19468 2026-06-19 cs.CL 新提交

Characterizing Narrative Content in Web-scale LLM Pretraining Data

网络规模LLM预训练数据中的叙事内容特征化

Teagan Johnson, Elliott Ash, Andrew Piper, Maria Antoniak

AI总结 首次细粒度研究LLM预训练语料库Dolma的叙事特征,提出涵盖三个核心叙事元素(能动性、场景、事件)的框架,构建NarraBERT模型并发布NarraDolma数据集,揭示叙事结构在异构数据中可测量且分布不均。

Comments 8 pages of main content, 28 total pages. 30 figures

详情
AI中文摘要

尽管叙事是人类交流的基本模式,但网络规模LLM预训练语料库的叙事组成仍然很大程度上未被探索。我们首次对Dolma(一个3万亿词元的开放预训练语料库)中的叙事特征进行了细粒度研究。借鉴叙事理论,我们设计了一个框架,涵盖三个核心叙事元素(能动性、场景和事件),并将其操作化为11个可解释维度。在采样并标注了400个多样化的段落之后,我们微调并验证了NarraBERT,一个基于RoBERTa的细粒度叙事预测模型。我们将NarraBERT应用于300万个段落,生成了新数据集NarraDolma。我们发现:(i) 叙事结构在极度异构的数据中是可大规模测量的;(ii) 我们揭示了网络文本背后连续的多维叙事结构;(iii) 叙事质量在预训练来源和主题之间分布不均,而当前的策展实践既未测量也未考虑这一点。我们的框架、数据集和分析为理解LLM预训练数据中叙事质量的分布以及研究数据组成如何影响叙事推理任务提供了基础。我们公开发布了NarraDolma和NarraBERT。

英文摘要

The narrative composition of web-scale LLM pretraining corpora remains largely unexplored even though narrative is a fundamental mode of human communication. We present the first fine-grained study of narrative features in Dolma, a 3-trillion-token open pretraining corpus. Drawing on narrative theory, we design a framework spanning three core narrative elements (agency, setting, and events) operationalized as 11 interpretable dimensions. After sampling and annotating a diverse set of 400 passages, we finetune and validate NarraBERT, a RoBERTa-based model for fine-grained narrative prediction. We apply NarraBERT to 3M passages, resulting in a new dataset, NarraDolma. We find (i) narrative structure is measurable at scale across extremely heterogeneous data, (ii) we uncover a continuous, multidimensional narrative structure underlying web text, and (iii) narrative qualities are unequally distributed across pretraining sources and topics in ways that current curation practices neither measure nor account for. Our framework, dataset, and analyses provide a foundation for understanding how narrative qualities are distributed in LLM pretraining data and for studying how data composition affects narrative reasoning tasks. We publicly release NarraDolma and NarraBERT.

2606.19544 2026-06-19 cs.CL 新提交

Reliability without Validity: A Systematic, Large-Scale Evaluation of LLM-as-a-Judge Models Across Agreement, Consistency, and Bias

无效度的可靠性:LLM-as-a-Judge 模型在一致性、稳定性和偏差上的系统性大规模评估

Justin D. Norman, Michael U. Rivera, D. Alex Hughes

发表机构 * UC Berkeley School of Information(加州大学伯克利分校信息学院)

AI总结 本研究通过大规模系统性评估(21个裁判模型、118次运行、约54.1万次判断),发现LLM-as-a-Judge在一致性、稳定性和偏差方面存在普遍问题,包括kappa通缩、排名偏移、高重测信度与严重位置偏差并存,并提出了最小可行验证协议。

详情
AI中文摘要

LLM-as-a-Judge已成为语言模型的主导评估范式,但实际中的裁判验证依赖于精确匹配一致性,这一指标未对随机性进行校正,且系统性地高估了判别能力。我们展示了迄今为止最大规模的LLM-as-a-Judge系统性评估:来自九个提供商的21个裁判模型,在MT-Bench、JudgeBench和RewardBench上,按照三种协议(一致性、稳定性、偏差审计)进行了118次运行,约54.1万次独立判断。发现了四个结果,在整个队列中一致,包括2026年4月的前沿模型:精确匹配与Cohen's kappa之间的kappa通缩是普遍存在的(MT-Bench上33-41个百分点),裁判排名在不同基准上最多移动14个位置,高重测信度(>0.95)与两个生产部署裁判中的严重位置偏差(>0.10)并存(体现了一致性-偏差悖论),以及在单一成对评分标准下,整个队列中的冗长偏差较小(<0.011)。我们将这些结果提炼为一个最小可行验证协议。

英文摘要

LLM-as-a-Judge has become the dominant evaluation paradigm for language models, but judge validation in practice relies on exact-match agreement, a metric that does not correct for chance and systematically overstates discriminative ability. We present the largest systematic evaluation of LLM-as-a-Judge to date: 21 judges from nine providers across MT-Bench, JudgeBench, and RewardBench, evaluated under three protocols (agreement, consistency, bias audit) over 118 runs and approximately 541,000 individual judgments. Four findings emerge, consistent across the full cohort, including the April 2026 frontier: kappa deflation between exact match and Cohen's kappa is universal (33--41 pp on MT-Bench), judge rankings shift by up to 14 positions across benchmarks, high test--retest reliability (>0.95) coexists with severe position bias (>0.10) in two production-deployed judges (instantiating a consistency--bias paradox), and verbosity bias is small (<0.011) across our cohort under a single pairwise rubric. We distill these into a Minimum Viable Validation Protocol.

2606.19637 2026-06-19 cs.CL cs.AI 新提交

Before the Labels: How Dataset Construction Shapes Suicidality Detection in Clinical Text

标签之前:数据集构建如何塑造临床文本中的自杀检测

Priyanshi Garg, Ishita Rao, Jieqiong Ding, Amandalynne Paullada

发表机构 * University of Washington(华盛顿大学)

AI总结 通过ScAN数据集案例研究,揭示EHR自杀数据集编码特定操作化定义,受数据作者、事件边界和歧义处理影响,并展示相同标签涵盖异质性临床框架。

Comments To appear in the Proceedings of the 11th Workshop on Computational Linguistics and Clinical Psychology (CLPsych 2026)

详情
AI中文摘要

临床自然语言处理越来越依赖电子健康记录(EHR)数据来检测自杀行为,将临床文档视为比社交媒体更可靠的真相。我们认为,这种框架掩盖了基于EHR的自杀数据集如何编码自杀的特定操作化定义,这种定义受到数据作者、事件边界划定方式以及歧义处理方式的影响。我们以ScAN数据集(基于MIMIC-III临床笔记构建)的案例研究为基础,论证了这一观点。我们展示了治理约束、基于ICD的队列选择、单一标注者标签以及住院级别聚合如何产生反映临床医生记录判断的标签,将自杀视为一个有边界的事件,并假设意图可以从文档中可靠推断。语言学分析表明,相同的标签涵盖了在时间性、否定性和不确定性方面不同的异质性临床框架。我们认为,临床自然语言处理在将自杀数据集的标签解释为真相之前,应审视其中嵌入的假设。

英文摘要

Clinical NLP increasingly relies on electronic health record (EHR) data to detect suicidal behaviors, treating clinical documentation as more reliable ground truth than social media. We argue that this framing obscures how EHR-based suicidality datasets encode a particular operationalization of suicidality, shaped by who authors the data, how episodes are bounded, and how ambiguity is resolved. We ground this argument in a case study of the ScAN dataset, built over MIMIC-III clinical notes. We show how governance constraints, ICD-based cohort selection, single-annotator labeling, and hospital-stay-level aggregation produce labels that reflect clinician-documented judgments, treat suicidality as a bounded episode, and assume that intent can be reliably inferred from documentation. A linguistic analysis demonstrates that identical labels subsume heterogeneous clinical framings differing in temporality, negation, and uncertainty. We argue that clinical NLP should examine the assumptions embedded in suicidality datasets before interpreting their labels as ground truth.

2606.19698 2026-06-19 cs.CL 新提交

What sentiment analysis can't see: Measuring whether customers were helped, and what went wrong, across 70,000 support conversations

情感分析看不到的:衡量客户是否得到帮助以及出了什么问题——基于70,000次客服对话

Jason Potteiger

AI总结 本研究使用GPT-5.4从70,450次客服对话中估计客户满意度并标记具体问题,发现满意度估计比情感分析更准确,且能揭示情感分析无法捕捉的客户状态和问题原因。

Comments 25 pages, 6 figures

详情
AI中文摘要

大多数公司通过情感分析大规模阅读客户支持数据,这种方法衡量的是客户听起来如何,而不是他们对结果是否满意。我们在一个领先的在线筹款平台的70,450次客服对话中测试了一种更丰富的替代方案:除了语气,我们使用GPT-5.4估计每位客户的满意度,并标记他们是否报告了具体问题,然后将这三个读数与客户对对话的1到5星评分进行验证。满意度估计跟踪这些评分的效果远好于情感分析,相关性分别为0.47和0.36,并且标记不满客户时的误报率低得多。结构化阅读还能看到情感分析看不到的东西:语气和满意度在44%的对话中不一致,一个单一的“中性”标签掩盖了从安静满意的客户到安静放弃的客户的一切,而最大的群体是“容忍摩擦”——那些满意但仍然报告可修复问题的客户,这是一个情感分析仪表板无法揭示的长期问题。更广泛的发现是,基于LLM的标注可以捕捉到比客户语言语气多得多的信息,为基于客户状态(他们是否满意)和直接从交互与反馈的原始文本数据中提取的问题原因的新业务指标提供了强大潜力。

英文摘要

Most companies read their customer support data at scale using sentiment analysis, which measures how customers sound rather than whether they were satisfied with the result. We tested a richer alternative on 70,450 support conversations from a leading online fundraising platform: alongside tone, we used GPT-5.4 to estimate each customer's satisfaction and to flag whether they reported a concrete problem, then validated all three readings against the 1-to-5 ratings customers left on the conversations they rated. The satisfaction estimate tracked those ratings far better than sentiment did, correlating at 0.47 against 0.36 and flagging unhappy customers with far fewer false alarms. The structured read also sees what sentiment cannot: tone and satisfaction disagree in 44% of conversations, a single "Neutral" label hides everything from quietly satisfied customers to ones who quietly gave up, and the largest group of all is "tolerated friction," customers who are satisfied but still reporting a fixable problem, a standing issue that no sentiment-based dashboard can surface. The broader finding is that LLM-based annotation can capture far more than the tonality of a customer's language, offering strong potential for new business metrics grounded instead in the customer's state (whether they were satisfied) and the cause of their problem extracted directly from the raw textual data of interactions and feedback.

2606.19727 2026-06-19 cs.CL cs.AI 新提交

NRITYAM: Language Models Meet Art and Heritage of Dance

NRITYAM:语言模型遇见舞蹈的艺术与遗产

Punit Kumar Singh, Niladri Ghosh, Advait Joshiınst, Shailee Choudhary, Michael Färber, Haiqin Yang

发表机构 * Shenzhen Technology University(深圳技术大学) New Delhi Institute of Management(新德里管理学院) Technische Universität Dresden(德累斯顿工业大学) Ramakrishna Mission Vivekananda Educational and Research Institute(罗摩克里希纳传道会维韦卡南达教育与研究学院) Indian Institute of Technology(印度理工学院) Swami Vivekananda Institute of Technology(斯瓦米·维韦卡南达技术学院) GuangDong Engineering Technology Research Center of Edge Intelligence(广东省边缘智能工程技术研究中心)

AI总结 提出NRITYAM基准,包含9,260个跨12语言的文化问答对,评估语言模型对全球舞蹈传统的文化理解能力,涵盖多种模型类型。

Comments 18 pages, 12 figures, in ECML_PKDD'26

详情
AI中文摘要

语言模型已成为塑造现代工作流程的重要工具。然而,其全球有效性取决于对当地社会文化背景的细致理解。为弥补这一差距,我们提出NRITYAM,一个用于评估语言模型在全球舞蹈传统背景下文化理解能力的综合基准。NRITYAM包含9,260个精心策划的问答对,涵盖12种语言,是专门用于评估舞蹈文化知识的最大数据集。该数据集通过与本地舞蹈艺术家和母语者的密切合作从头开发,他们创作并验证了特定地区的文化相关问题。我们评估了一系列模型,包括大型语言模型、小型语言模型、多模态大型语言模型和小型多模态语言模型。作为一个多语言和多文化基准,NRITYAM为评估AI系统理解和推理传统表演艺术的能力设定了新标准。详细数据集样本可在\url{this https URL}获取。

英文摘要

Language models have become essential tools in shaping modern workflows. However, their global effectiveness hinges on a nuanced understanding of local socio-cultural contexts. To address this gap, we present NRITYAM, a comprehensive benchmark for evaluating the cultural comprehension capabilities of language models in the context of global dance traditions. NRITYAM comprises 9,260 carefully curated question-answer pairs spanning 12 languages, making it the largest dataset dedicated to evaluating cultural knowledge in dance. The dataset has been developed from the ground up through close collaboration with native dance artists and native speakers of the languages, who authored and validated culturally relevant questions specific to their regions. We evaluate a broad set of models, including large language models, small language models, multimodal large language models, and small multimodal language models. As a multilingual and multicultural benchmark, NRITYAM sets a new standard for evaluating the ability of AI systems to understand and reason about traditional performing arts. Detailed dataset samples are available at~\url{https://github.com/niladrighosh03/NRITYAM}.

2606.19819 2026-06-19 cs.CL cs.AI 新提交

CREDENCE: Claim Reduction for Decomposition & Enhanced Credibility -- Semantic Metrics and Convergence Analysis

CREDENCE: 面向分解与增强可信度的声明缩减——语义度量与收敛性分析

Phuong Huu Vu Tran, Thuan Duc Mai, Bach Xuan Le

发表机构 * Vietnamese-German University(越南德国大学) Ho Chi Minh University of Technology(胡志明市理工大学)

AI总结 提出CREDENCE框架,通过语义F1度量解决Jaccard度量对释义声明的低估问题,并形式化分析修复管道的收敛性,实验表明语义F1比Jaccard F1提升15-32个百分点,规则修复将原子性违反率降低47-100%。

Comments 40 pages, 6 figures, 19 tables. Submitted to Language Resources and Evaluation

详情
AI中文摘要

将复合句分解为原子化的、可验证的声明是可靠自动化事实核查的前提。先前工作依赖基于词重叠(Jaccard)的度量,系统性地低估了释义声明的分解质量,并且缺乏对修复循环的形式化终止分析。我们提出CREDENCE,一个改进的声明分解与评估框架,解决了这两个缺陷。我们的贡献包括:(1) 语义F1:我们使用BGE-large余弦相似度保真度度量,解决了Jaccard的惩罚问题,并提高了下游事实核查的准确性;(2) 收敛定理:我们形式化地表征了修复管道的四个性质,确立了在预言解析器假设下基于规则的修复是单调且有限终止的;基于LLM的自修复被证明是非单调的,需要早期退出保护;(3) 三个评估基准,涵盖社交媒体、百科全书和新闻领域,用于跨领域泛化度量;(4) 跨四个分解器模型(3.8B-12B)和一个封闭API模型的多模型基准测试。在SocialClaimSplit、WikiSplitBench和ClaimDecompBench上的实验表明,语义F1比Jaccard F1提升15-32个百分点。在SocialClaimSplit和WikiSplitBench上,EPR范围为0.94至1.00,而ClaimDecompBench由于更难的新闻领域构造,包含较低的基线EPR情况(低至0.824),规则修复相对于基线模型将原子性违反率(AVR)降低了47-100%,且不降低保真度。

英文摘要

Decomposing compound sentences into atomic, verifiable claims is a prerequisite for reliable automated fact-checking. Prior work has relied on token-overlap (Jaccard) metrics that systematically underestimate decomposition quality for paraphrastic claims, and has lacked formal termination analysis for the repair loop. We present Credence, a revised claim decomposition and evaluation framework addressing both shortcomings. Our contributions are: (1) Semantic-F1: we use BGE-large cosine similarity fidelity metric that resolves Jaccard's penalisation and improves downstream fact-checking accuracy; (2) Convergence theorems: we formally characterise four properties of the repair pipeline, establishing that rule-based repair is monotone and finitely terminating under an oracle parser assumption; LLM-based self-repair is provably non-monotone and requires an early-exit guard; (3) Three evaluation benchmarks spanning social-media, encyclopaedic, and news domains for cross-domain generalisation measurement; (4) Multi-model benchmarking across four decomposer models (3.8B-12B) and a closed API model. Experiments on SocialClaimSplit, WikiSplitBench, and ClaimDecompBench show that Semantic-F1 outperforms Jaccard-F1 by +15-32pp. EPR ranges from 0.94 to 1.00 on SocialClaimSplit and WikiSplitBench, while ClaimDecompBench includes lower base EPR cases (down to 0.824) due to harder news-domain constructions, and rule-repair reduces the Atomicity Violation Rate (AVR) by 47-100% relative to the base model without degrading fidelity.

2606.19881 2026-06-19 cs.CL 新提交

REDACT: A Systematically Controlled Multilingual Benchmark for Personal Information Detection

REDACT:一个系统控制的个人信息检测多语言基准

Guneesh Vats, Anubha Agrawal, Shikha Singhal, Ajita Dash, Praison Selvaraj, Vidhan Jhawar, Ranga Prasad Chenna, Bharadwaj Y M G

发表机构 * ServiceNow

AI总结 提出REDACT基准,包含13,427条记录、51种实体类型、25种语言,通过强度-2覆盖阵列采样控制9个生成轴,并引入实体级元数据(披露状态、形式、GDPR敏感层级)以支持分层评估,揭示检测器在敏感数据上的架构依赖性失败模式。

Comments 14 pages, 5 figures

详情
AI中文摘要

个人可识别信息(PII)检测的基准基础设施仍然有限:现有语料库涵盖的实体类型少,使用临时生成条件,并且未显示哪些表面条件导致检测器失败。我们提出REDACT,一个系统控制的多语言PII基准,包含13,427条记录、324,078个实体注释、51种实体类型、4,127个表面形式模式以及跨越9种文字的25种语言。一个强度-2覆盖阵列采样器控制九个生成轴:领域、格式、难度、长度、密度、代码切换、语言、邻接和共现。三个实体级元数据字段(披露状态、披露形式和符合GDPR的敏感层级)使得能够进行超越聚合或按类型F1的分层评估。从完整基准中,我们在一个锁定的、按语言分层的1000条记录样本上评估了五个检测器(Presidio、GLiNER、OpenAI隐私过滤器、GPT-4.1和Claude Sonnet 4.6)。聚合F1掩盖了架构依赖的失败结构:基于规则的检测器在最高风险数据上表现不佳,包括高敏感类别(召回率0.07)和非逐字披露形式,而LLM检测器保持更鲁棒,高敏感层级是其最强的敏感切片。一个三模型无参考LLM作为评判者的评估证实,敏感层级分配是任务最困难的轴。我们发布了基准、模式、提示和分层评估工具。

英文摘要

Benchmark infrastructure for personally identifiable information (PII) detection remains limited: existing corpora cover few entity types, use ad hoc generation conditions, and do not show which surface conditions cause detector failures. We present REDACT, a systematically controlled multilingual PII benchmark with 13,427 records, 324,078 entity annotations, 51 entity types, 4,127 surface-form patterns, and 25 languages across 9 scripts. A strength-2 covering-array sampler controls nine generation axes: domain, format, difficulty, length, density, code-switching, language, adjacency, and co-occurrence. Three entity-level metadata fields (disclosure status, disclosure form, and a GDPR-aligned sensitivity tier) enable stratified evaluation beyond aggregate or per-type F1. From the full benchmark, we evaluate five detectors (Presidio, GLiNER, the OpenAI Privacy Filter, GPT-4.1, and Claude Sonnet 4.6) on a locked, language-stratified sample of 1,000 records. Aggregate F1 masks an architecture-dependent failure structure: the rule-based detector performs poorly on the highest-stakes data, including HIGH-sensitivity categories (recall 0.07) and non-verbatim disclosure forms, while the LLM detectors remain more robust, with the HIGH tier as their strongest sensitivity slice. A three-model reference-free LLM-as-judge assessment corroborates that sensitivity-tier assignment is the task's hardest axis. We release the benchmark, schema, prompts, and stratified evaluation harness.

2606.20152 2026-06-19 cs.CL cs.AI 新提交

From Texts to Scores: Tracing the Emergence of Essay Quality Representations in Large Language Models

从文本到分数:追踪大型语言模型中作文质量表征的出现

Jiaxu Zuo, Mu You, Kaixin Lan, Tao Fang, Yujia Huo, Henghua Shen, Lidia S. Chao, Derek F. Wong

AI总结 通过线性探测等方法分析8个LLM在三个数据集上的隐藏表征,发现作文质量信息以线性可解码形式存在,并识别出与分数相关的神经元,揭示了LLM评分的内在机制。

Comments This is a preprint of a manuscript currently under peer review

详情
AI中文摘要

近年来,大型语言模型(LLMs)的进展极大地改变了自动作文评分(AES),但基于LLM的评分内部机制仍知之甚少。在本工作中,我们系统分析了八个LLMs在两个英文作文数据集(ASAP++、CSEE)和一个葡萄牙语数据集(ENEM)上的隐藏表征。通过线性探测、跨提示泛化、降维和神经元级分析,我们发现一致证据表明作文质量信息以线性可访问的形式编码在LLM表征中。这些表征在层间逐步出现,在不同提示策略下保持稳健,并且尽管评分标准不同,仍能在作文提示间部分迁移。此外,非线性探测相对于线性探测仅提供边际且不一致的改进,表明大多数作文质量信息已经是线性可解码的。我们进一步识别出单个“作文评分神经元”,其激活与作文分数强相关,且其行为对目标干预敏感。此外,这些神经元的逐层分布随作文长度系统性地变化,较长的作文更依赖深层。总体而言,我们的发现提供了LLM编码与作文质量相关的结构化表征的证据,并为基于LLM的AES系统的可解释性提供了新见解。

英文摘要

Recent advances in Large Language Models (LLMs) have substantially transformed Automated Essay Scoring (AES), yet the internal mechanisms underlying LLM-based scoring remain poorly understood. In this work, we systematically analyze the hidden representations of eight LLMs across two English essay datasets (ASAP++, CSEE) and one Portuguese dataset (ENEM). Using linear probing, cross-prompt generalization, dimensionality reduction, and neuron-level analyses, we find consistent evidence that essay quality information is encoded in a linearly accessible form within LLM representations. These representations emerge progressively across layers, remain robust across prompting strategies, and partially transfer across essay prompts despite differences in scoring rubrics. In addition, nonlinear probes provide only marginal and inconsistent improvements over linear probes, suggesting that most essay quality information is already linearly decodable. We further identify individual ``essay scoring neurons'' whose activations strongly correlate with essay scores and whose behavior is sensitive to targeted intervention. Moreover, the layer-wise distribution of these neurons systematically shifts with essay length, with longer essays relying more heavily on deeper layers. Overall, our findings provide evidence that LLMs encode structured representations related to essay quality and offer new insights into the interpretability of LLM-based AES systems.

2606.20212 2026-06-19 cs.CL 新提交

CzechDocs: A Multiway Parallel Dataset of Formatted Documents for Minority Languages in Czechia

CzechDocs:捷克少数民族语言格式化文档的多路平行数据集

Josef Jon, Ondřej Bojar

发表机构 * Charles University, Faculty of Mathematics Physics Institute of Formal

AI总结 提出CzechDocs多路平行格式化文档数据集,覆盖捷克及少数民族语言,支持评估保留格式的机器翻译系统,并公开验证子集与评估工具。

详情
AI中文摘要

我们提出了CzechDocs,一个多路平行的格式化文档(HTML、DOCX和PDF)数据集,涵盖捷克语及捷克境内使用的少数民族语言——主要是乌克兰语和英语,以及少量越南语、俄语和其他语言。该数据集旨在支持评估旨在翻译过程中保留文档格式的机器翻译系统。我们在数据集的验证子集上比较了最常见的格式保留机器翻译方法。该验证子集连同评估工具包已公开发布,以供进一步研究。一个保留的测试子集将用于未来专注于文档级翻译并保留格式的共享任务。

英文摘要

We present CzechDocs, a multiway parallel dataset of formatted documents (HTML, DOCX, and PDF) covering Czech and minority languages used in Czechia-primarily Ukrainian and English, with smaller portions of Vietnamese, Russian and other languages. The dataset is designed to support the evaluation of machine translation systems that aim to preserve document formatting during translation. We provide a comparison of the most common approaches to format-preserving machine translation on a validation subset of the dataset. This validation split, together with the evaluation toolkit, is publicly released for further research. A held-out test split will be reserved for a future shared task focused on document-level translation with formatting preservation.

2606.20255 2026-06-19 cs.CL cs.AI 新提交

The Register Gap: A Meaning Intelligence Framework for Nigerian Public Discourse

语域差距:尼日利亚公共话语的意义智能框架

Celestine Achi

AI总结 提出九维意义智能框架(MIF),通过语域、真实意图等维度区分表面情感与真实交际意图,在尼日利亚公共话语数据集上使语域分类准确率提升40个百分点,复合意义智能评分提升5.4分。

Comments Preprint. 12 pages, 2 tables. Supplementary materials: MIF Master Specification v2.0, Annotation Guidelines v1.0, and 30-item public calibration set with gold labels available from the author

详情
AI中文摘要

我们提出了意义智能框架(MIF),这是一个用于尼日利亚公共话语的九维标注和评估方案,将表面情感与真实交际意图区分开来。现有的尼日利亚语言基准(包括NaijaSenti和AfriSenti)将情感分类视为三向极性任务(正面、负面、中性)。我们认为,AI系统在尼日利亚话语上的主要失败模式不是翻译失败,而是语境失败:同一话语根据说话者、听众和情境可能具有相反的语用效力。MIF通过九个评分维度将这一见解操作化:语域、表面情感、真实意图、反讽、编码潜台词、风险等级、标注者置信度、说话者情绪和推荐沟通行动。我们构建了一个包含30个项目的校准数据集,涵盖标准英语、尼日利亚英语、尼日利亚皮钦语和混合语域,并在零样本和模式引导提示条件下评估了一个前沿语言模型(Gemini 2.5 Flash)。主要发现是语域差距:零样本语域分类准确率为33.3%,当模型在上下文中接收到MIF模式时,准确率上升至73.3%(+40个百分点)。在模式引导提示下,复合意义智能评分增加了5.4分(从73.2到78.6),最大的实际收益体现在语域识别、编码潜台词检测(+10分)和战略行动推荐(+10.3分)上。我们发布了框架规范、标注指南和包含30个项目的公开校准集以支持可重复性,同时保留了一个私有留存语料库用于防污染评估。

英文摘要

We introduce the Meaning Intelligence Framework (MIF), a nine-dimension annotation and evaluation schema for Nigerian public discourse that separates surface sentiment from true communicative intent. Existing benchmarks for Nigerian languages, including NaijaSenti and AfriSenti, treat sentiment classification as a three-way polarity task (positive, negative, neutral). We argue that the dominant failure mode of AI systems on Nigerian discourse is not translation failure but context failure: the same utterance carries opposite pragmatic force depending on speaker, audience, and situation. The MIF operationalises this insight across nine scored dimensions: register, surface sentiment, true intent, irony, coded subtext, risk tier, annotator confidence, speaker emotion, and recommended communications action. We construct a 30-item calibration dataset spanning Standard English, Nigerian English, Nigerian Pidgin, and code-mixed registers, and evaluate a frontier language model (Gemini 2.5 Flash) under zero-shot and schema-informed prompting conditions. The headline finding is the Register Gap: zero-shot register classification accuracy is 33.3%, rising to 73.3% (+40 points) when the model receives the MIF schema in-context. The composite Meaning Intelligence Score increases by 5.4 points (73.2 to 78.6) under schema-informed prompting, with the largest practical gains in register identification, coded-subtext detection (+10 points), and strategic action recommendation (+10.3 points). We release the framework specification, annotation guidelines, and the 30-item public calibration set to support reproducibility, while retaining a private holdout corpus for contamination-protected evaluation.

2606.20369 2026-06-19 cs.CL 新提交

CATCH-ME if you RAG: a dataset of Contextually Annotated multi-Turn Counterspeech against Hate and Misinformation Exchanges

CATCH-ME if you RAG:针对仇恨与虚假信息交流的上下文注释多轮对抗言论数据集

Helena Bonaldi, Genoveffa Martone, Marco Guerini

发表机构 * Fondazione Bruno Kessler(布鲁诺·凯斯勒基金会) Università Cattolica del Sacro Cuore(圣心天主教大学)

AI总结 提出首个大规模、专家策划的多语言对话数据集,覆盖仇恨与虚假信息重叠问题,包含事实核查锚定和跨度标注,支持RAG系统训练更可信的对抗言论模型。

详情
AI中文摘要

在线仇恨言论和虚假信息经常重叠,但NLP研究主要将它们孤立处理。虽然LLMs代表了协助人类针对这两种威胁生成对抗言论的可扩展解决方案,但零样本模型经常生成重复和模糊的回应,凸显了需要高质量示例来指导模型生成。然而,现有的针对仇恨和虚假信息重叠的对抗言论数据集很少,且仅限于单轮英语对话,而现实中的交互跨越多个轮次和语言。为弥补这一差距,我们引入了第一个大规模、专家策划的多语言对话数据集,处理仇恨与虚假信息的交叉点。为确保事实基础,对话还锚定在已验证的外部知识(即事实核查文章和非政府组织报告)中,并包含文档级和块级跨度标注,使其可直接应用于RAG系统。该新资源涵盖五种语言,针对七个边缘化群体的仇恨,能够训练和评估更具说服力、基于事实的对抗言论模型。

英文摘要

Online hate speech and misinformation frequently overlap, yet NLP research has mainly treated them in isolation. While LLMs represent a scalable solution for assisting humans in the generation of counterspeech for both threats, zero-shot models frequently generate repetitive and vague responses, underscoring the need for high-quality examples to steer model generation. However, existing counterspeech datasets against the overlap of hate and misinformation are scarce and limited to single-turn English dialogues, while real-life interactions span across multiple turns and languages. To bridge this gap, we introduce the first large-scale, expert-curated, multilingual dataset of dialogues tackling the intersection of hate and misinformation. To ensure factual grounding, the dialogues are also anchored in verified external knowledge (i.e., fact-checking articles and NGO reports) and include document- and chunk-level span annotations, making it directly applicable for RAG systems. Covering five languages and targeting hate directed at seven marginalized groups, this novel resource enables the training and evaluation of more persuasive, factually grounded counterspeech models.

10. 安全、隐私、公平与可解释NLP 4 篇

2606.19344 2026-06-19 cs.CL cs.AI 新提交

Exposing the Unsaid: Visualizing Hidden LLM Bias through Stochastic Path Aggregation

揭示未言明之事:通过随机路径聚合可视化隐藏的LLM偏见

Matteo Pelossi, Rita Sevastjanova, Thilo Spinner, Mennatallah El-Assady

发表机构 * ETH Zurich(苏黎世联邦理工学院)

AI总结 提出TreeTracer工具,通过系统扰动分析、语法对齐聚合和分类感知节点合并,利用桑基图对比不同语义上下文,揭示LLM中隐藏的代表性和句法偏见。

Comments 14 pages

详情
AI中文摘要

大型语言模型(LLM)表现出表征性和句法性偏见,由于文本生成的随机性,这些偏见难以评估。标准审计方法依赖于单一输出检查或静态自动化指标,这些方法掩盖了底层概率分布,未能捕捉隐藏在低概率生成分支中的偏见。本文介绍了TreeTracer,一种通过聚合比较评估LLM偏见的可视化分析工具。该工具使用系统扰动分析流程,替换每个输入提示中由本体定义的术语,将数百次随机生成聚合成语法对齐的层次结构,然后使用辅助语言模型进行分类感知节点合并。生成的结构通过自定义桑基图可视化。通过并置两个本体驱动的树,工作空间能够直接比较语义上下文,并支持系统性偏见检测。由于任何可视化仅反映模型学习行为的一个子集,系统进一步应用对比推理来计算并直接显示跨上下文的反事实标记概率,从而降低误解偏见存在的风险。我们通过案例研究验证了该工作空间,比较了未对齐的基线模型GPT-2 XL与宪法对齐的Apertus模型。视觉聚合成功揭示了隐藏的代表性伤害,例如反事实代词抑制和对话中对个体的边缘化。初步用户研究证实,聚合比较界面降低了认知负荷,并有效支持分析人员检测系统性偏见。

英文摘要

Large Language Models (LLMs) exhibit representational and syntactic biases that are difficult to evaluate due to the stochastic nature of text generation. Standard auditing methods rely on a single output inspection or static automated metrics. These approaches obscure the underlying probability distributions and fail to capture biases hidden in lower-probability generation branches. This paper introduces TreeTracer, a visual analytics tool designed to evaluate LLM bias through aggregated comparison. Using a systematic perturbation analysis pipeline, the tool replaces ontology-defined terms in each input prompt, aggregates hundreds of stochastic generations into a syntax-aligned hierarchical structure, and then performs classification-aware node merging with an auxiliary language model. The resulting structure is visualized through a custom Sankey diagram. By juxtaposing two ontology-driven trees, the workspace enables direct comparison between semantic contexts and supports systematic bias detection. Because any visualization reflects only a subset of the model's learned behavior, the system further applies contrastive inference to compute and directly display counterfactual token probabilities across contexts, reducing the risk of misinterpreting the presence of bias. We validate the workspace through case studies comparing an unaligned baseline model GPT-2 XL against the constitutionally aligned Apertus models. The visual aggregation successfully exposes hidden representational harms, such as counterfactual pronoun suppression and conversational marginalization of individuals. A preliminary user study confirms that the aggregated comparative interface reduces cognitive load and effectively supports analysts in detecting systemic biases.

2606.20093 2026-06-19 cs.CL 新提交

Self-Preference Is Weak or Absent in Verifiable Instruction-Following Revision: A Four-Model Test Under Genuine Authorship

自我偏好在可验证的指令遵循修订中弱或不存在:基于真正作者身份的四模型测试

William Guey, Pierrick Bougault

发表机构 * Department of Industrial Engineering, Tsinghua University(清华大学工业工程系)

AI总结 通过IFEval验证器测试四类中端模型在指令遵循修订中的自我偏好,发现作者拒绝已验证正确编辑的比例与新鲜模型无显著差异,表明自我偏好弱或不存在。

Comments 7 pages, 3 tables. Code and data: https://github.com/williamguey/self-preference-revision

详情
AI中文摘要

大型语言模型(LLMs)越来越多地审查和修订文本,包括它们自己的文本。有记录的自我偏好偏差(模型在充当评判者时偏爱自己的生成)引发了一个问题:模型是否也会抵制对自己写作的有效修正。我们在一个“有效”不是由另一个模型决定,而是由确定性验证器决定的设置中测试这一点:基于IFEval的指令遵循修订。模型撰写草稿;官方IFEval检查器确认草稿违反约束,并且候选编辑修复了它;然后模型接受或拒绝该编辑,要么作为真正的上下文内作者,要么作为一个以中立方式看待草稿的新鲜模型。在四个中端模型系列和85次作者与新鲜模型比较中,我们未检测到可察觉的自我偏好:作者拒绝对自己草稿的已验证正确修复的比例与判断相同草稿的新鲜模型基本相同(差距-5.1个百分点,95%置信区间[-12.9, +2.7])。来自较小试点的自我怀疑提示未在大规模上复制。唯一稳健的观察是定性的:当作者确实拒绝已验证正确的修复时,他们陈述的理由中有97%是挑错而非偏好,即关于拒绝的性质,而非升高的比率。在此样本量下,不能排除小于约13个百分点的效应。

英文摘要

Large language models (LLMs) increasingly review and revise text, including their own. A documented self-preference bias (models favoring their own generations when acting as judges) raises the question of whether models also resist valid corrections to their own writing. We test this in a setting where "valid" is decided not by another model but by a deterministic verifier: instruction-following revision on IFEval. A model writes a draft; the official IFEval checker confirms the draft violates a constraint and that a candidate edit fixes it; the model then accepts or rejects that edit either as the genuine in-context author or as a fresh model that sees the draft neutrally. Across four mid-tier model families and 85 author-versus-fresh comparisons, we find no detectable self-preference: authors reject verified-good fixes to their own drafts at essentially the same rate as fresh models judging the same drafts (gap -5.1 pp, 95% CI [-12.9, +2.7]). A self-skepticism hint from a smaller pilot did not replicate at scale. The one robust observation is qualitative: when authors do reject a verified-good fix, 97% of their stated reasons are flaw-catching rather than preference, that is, about the character of rejections, not an elevated rate. Effects smaller than ~13 pp cannot be excluded at this sample size.

2606.20225 2026-06-19 cs.CL 新提交

Actionable Activation Directions for Detecting and Mitigating Emergent Misalignment Across Language Model Families

可操作的激活方向:检测和缓解跨语言模型家族的突发性对齐失调

Abdul Rafay Syed

发表机构 * Universität des Saarlandes(萨尔大学)

AI总结 通过差分均值方向在最终层实现99.6%的对齐/失调分离,因果干预将代码泄露降低21-51点;跨架构迁移虽有效但缺乏特异性,揭示了两层特异性结构。

Comments 12 pages, 2 figures

详情
AI中文摘要

在不安全代码上微调语言模型会引发突发性对齐失调,其内部结构尚不明确。我们研究了这种失调是否对应于跨架构共享的因果可操作的激活空间方向。在四个指令微调模型家族(Qwen2.5-1.5B、Gemma-2-2B、Llama-3.2-1B、Ministral-3-3B)上进行相同微调后,差分均值方向在每个模型的最终层实现了99.6%的对齐与失调激活分离。通过减去该方向进行因果干预,代码泄露减少了21-51个百分点,而安全代码控制验证了内容特异性。通过岭回归映射进行跨架构迁移产生了较大的行为抑制(高达46个百分点),但未能通过特异性控制,因为随机和正交方向表现相当。我们识别出一个两层特异性结构:模型内方向具有因果特异性和可操作性;跨模型方向具有因果真实性但缺乏特异性。出现了不对称的迁移拓扑,Gemma和Qwen作为几何捐赠者,Llama作为接收者。这些发现定义了线性跨架构校正的局限性,并推荐使用模型内探测进行审计。

英文摘要

Fine-tuning language models on insecure code induces emergent misalignment with poorly understood internal structure. We investigate whether this misalignment corresponds to a causally actionable activation-space direction shared across architectures. Across four instruction-tuned model families (Qwen2.5-1.5B, Gemma-2-2B, Llama-3.2-1B, Ministral-3-3B) finetuned identically, a difference-in-means direction achieves 99.6% separation of aligned and misaligned activations at each model's final layer. Causal steering by subtracting this direction reduces code spillover by 21-51 points, while a secure-code control confirms content specificity. Cross-architecture transfer via ridge regression maps yields large behavioral suppression (up to 46 points) but fails specificity controls as random and orthogonal directions perform comparably. We identify a two-tier specificity structure: within-model directions are causally specific and actionable; cross-model directions are causally real but non-specific. An asymmetric transfer topology emerges, with Gemma and Qwen acting as geometric donors and Llama as a receiver. These findings define the limits of linear cross-architecture correction and recommend within-model probing for auditing.

2606.20527 2026-06-19 cs.CL cs.CV 新提交

StylisticBias: A Few Human Visual Cues Drive Most Social Biases in MLLMs

StylisticBias: 少数人类视觉线索驱动多模态大语言模型中的大部分社会偏见

Shaghayegh Kolli, Timo Cavelius, Nafiseh Nikeghbal, Samantha Dalal, Jana Diesner

发表机构 * Technical University of Munich(慕尼黑工业大学) Munich Center for Machine Learning(慕尼黑机器学习中心) Princeton Center for Information and Technology Policy(普林斯顿信息与技术政策中心)

AI总结 提出StylisticBias基准,通过控制单一视觉属性变化,发现年龄和体型主导身份层面偏见,而时尚风格等约15个属性解释近80%的偏见变化,偏见集中于少数视觉线索。

Comments Accepted to the non-archival workshops AI4Good and Culture x AI at ICML 2026

详情
AI中文摘要

多模态大语言模型(MLLMs)越来越多地部署在个人和社会影响重大的场景中,但影响这些模型判断人物的视觉线索仍知之甚少。先前的工作通常比较不同的(群体)个体,难以将外貌效应与身份差异分离。我们引入StylisticBias,一个用于评估MLLMs中属性级社会偏见的受控基准。我们生成500张逼真的基础人脸,每张脸创建约50个单一属性变体,产生约25K张图像。这种设计保持身份不变,每次改变一个视觉属性,使我们能够测量特定线索如何改变模型判断。我们在25个二元社会判断场景中评估了六个MLLMs。我们发现年龄和体型主导身份层面的效应,而时尚风格和其他视觉线索驱动最大的属性级变化。我们进一步发现,约15个属性解释了近80%的总变异,表明偏见集中在少数视觉线索上。在与外貌语义对齐的判断中,尤其是社会经济和风格相关判断,敏感性最强。我们发布StylisticBias作为多模态模型细粒度偏见评估的基准。代码和数据集:此https URL和此https URL。

英文摘要

Multimodal large language models (MLLMs) are increasingly deployed in personally and societally consequential settings, yet the visual cues that shape how these models judge people remain poorly understood. Prior work often compares different (groups of) individuals, making it difficult to separate appearance effects from identity differences. We introduce StylisticBias, a controlled benchmark for evaluating attribute-level social bias in MLLMs. We generate 500 photorealistic base faces and create about 50 single-attribute variations per face, producing about 25K images. This design keeps identity fixed and changes one visual attribute at a time. It lets us measure how specific cues shift model judgments. We evaluate six MLLMs across 25 binary social judgment scenarios. We find that age and body type dominate identity-level effects, while fashion style and other visual cues drive the largest attribute-level shifts. We further find that about 15 attributes account for nearly 80\% of the total variation, showing that bias is concentrated in a small set of visual cues. Sensitivity is strongest in judgments that are semantically aligned with appearance, especially socioeconomic and style-related judgments. We release StylisticBias as a benchmark for fine-grained bias evaluation in multimodal models. Code and dataset: https://github.com/timo-cavelius/StylisticBias and https://hf.co/datasets/shaghayegh/stylistic-bias-dataset.

11. 其他/综合NLP 4 篇

2606.19347 2026-06-19 cs.CL cs.AI cs.PL 新提交

How LLMs Fail and Generalize in RTL Coding for Hardware Design?

LLM在硬件设计的RTL编码中如何失败与泛化?

Guan-Ting Liu, Chao-Han Huck Yang, Chenhui Deng, Zhongzhi Yu, Brucek Khailany, Yu-Chiang Frank Wang

发表机构 * NVIDIA Research(英伟达研究院)

AI总结 提出基于问题可解性的错误分类法,揭示LLM在RTL编码中受限于预训练知识,对齐技术仅教会编译,而推理能力才是关键瓶颈。

Comments Preview, under submission for EMNLP 2026

详情
AI中文摘要

将顺序编程先验转换为硬件设计的并行时序逻辑仍然是大型语言模型(LLM)的关键瓶颈。为了研究这一点,我们引入了一种新的错误分类法,该分类法基于问题可解性,受认知理论启发。我们的分类法将失败分为语法、语义、可解功能和不可解功能类型。评估揭示了VerilogEval基准上的严格经验上限,前沿模型初始通过率稳定在90.8%。这些平台期由不可解的功能错误定义,暴露出对测试时计算扩展免疫的持续知识差距。此外,我们揭示了一个显著的表面收敛差距:优化容易消除语法错误,但同时加剧了更深层次的功能失败。我们的发现表明,对齐技术仅仅教会模型编译。虽然重复采样策略可以修补可解错误,但寄存器传输级(RTL)编码能力仍然严格受限于预训练知识。解决当前基于LLM的硬件生成流水线中的挑战需要更多关于模型推理的研究,而不是对齐干预。

英文摘要

Translating sequential programming priors into the parallel temporal logic of hardware design remains a crucial bottleneck for large language models(LLM). To investigate this, we introduce a new error taxonomy grounded in problem solvability, inspired by cognitive theory. Our taxonomy categorizes failures into syntactic, semantic, solvable functional, and unsolvable functional types. Evaluations reveal a strict empirical ceiling on the VerilogEval benchmark, as frontier models plateau at a 90.8% initial pass rate. These plateaus are defined by unsolvable functional errors, exposing persistent knowledge gaps immune to test time compute scaling. Furthermore, we expose a striking surface convergence gap: optimization readily eliminates syntax errors but concurrently exacerbates deeper functional failures. Our findings demonstrate that alignment techniques merely teach models to compile. While repeated sampling strategies can patch solvable errors, register-transfer level(RTL) coding capacity remains strictly bounded by pretraining knowledge. Addressing challenges in the current LLM based hardware generation pipeline requires more studies in model reasoning rather than alignment interventions.

2606.19647 2026-06-19 cs.CL cs.CY cs.SI 新提交

From 50K to 8.2 Million in 24 Hours: Vozinha's Algorithmic Consecration and the Multilingual Making of World Cup Visibility

从5万到820万在24小时内:Vozinha的算法封圣与世界杯可见性的多语言构建

Vinicius Covas

发表机构 * Universidad Anáhuac México(墨西哥阿纳瓦克大学)

AI总结 通过多语言语料库和九框架叙事分类法,分析2026年世界杯后Vozinha的算法封圣过程,揭示不同语言承载不同叙事框架,将平台粉丝数作为语言对象研究可见性构建。

Comments 11 pages, 4 figures, 3 tables; v0.1 pilot preprint. Dataset and evidence package available at https://doi.org/10.5281/zenodo.20722235

详情
AI中文摘要

我们提出了一项多语言计算话语分析,研究语言如何构建了Vozinha——这位40岁的佛得角门将在2026年世界杯西班牙0-0佛得角比赛后的算法封圣。该研究贡献了一个包含葡萄牙语、西班牙语、英语和法语的多语言语料库;一个基于线索的九框架叙事分类法;一个结合LLM辅助建议与人工验证的可复现标注流程;以及跨话语阶段的多语言叙事扩散分析。我们将平台粉丝数本身——被叙述为“从5万到800万”——视为一个语言对象:一种流通且可叙述的可见性证明,而非单纯的测量。粉丝增长时间线仅作为上下文元数据使用:我们重构了一个保守的阶段结构,而非连续的API原生序列,并对每个数据点按值类别、置信度和证据类型进行标注。唯一精确的主要爬取锚点是2026年6月16日15:47 UTC的8,235,652粉丝;所有其他数字均报告为估计范围或阈值,包括估计的赛前基线45k-56k。研究结果表明,不同语言承载了不同的框架:葡萄牙语的动员、西班牙语的危机、英语的民族构建,以及共享的平台指标奇观,通过这种奇观,边缘的体育表现变得全球可见。作为v0.1试点,本文发布了语料库模式、框架分类法、标注指南、哈希视觉证据日志和类型化时间线,同时将完整的双重标注和标注者间一致性标记为计划工作。

英文摘要

We present a multilingual computational discourse analysis of how language constructed the algorithmic consecration of Vozinha, the 40-year-old Cape Verde goalkeeper, after Spain 0-0 Cape Verde at the 2026 FIFA World Cup. The study contributes a multilingual corpus in Portuguese, Spanish, English, and French; a nine-frame narrative taxonomy with cue-based frame annotation; a reproducible annotation pipeline combining LLM-assisted suggestion with human validation; and an analysis of cross-lingual narrative diffusion across discourse phases. We treat the platform follower count itself, narrated as "50k to 8M", as a linguistic object: a circulating and narratable proof of visibility rather than a mere measurement. The follower-growth timeline is used only as contextual metadata: we reconstruct a conservative phase structure, not a continuous API-native series, and type every datapoint by value class, confidence, and evidence type. The only exact primary scraper anchor is 8,235,652 followers at 2026-06-16 15:47 UTC; all other figures are reported as estimated ranges or thresholds, including an estimated pre-match baseline of 45k-56k. Findings suggest that distinct languages carried distinct frames: Portuguese mobilization, Spanish crisis, English nation-making, and a shared platform-metric spectacle through which peripheral athletic performance became globally visible. As a v0.1 pilot, the paper releases the corpus schema, frame taxonomy, annotation guidelines, hashed visual-evidence log, and typed timeline, while flagging full double annotation and inter-annotator agreement as planned work.

2606.19864 2026-06-19 cs.CL 新提交

The Almost Intelligent Revolution: Options for Scaling Up Deliberation and Empowering People with AI

近乎智能的革命:扩大审议规模并利用AI赋能人类的选项

Serge Sharoff

AI总结 探讨大型语言模型如何通过系统功能语言学视角扩大民主审议规模,增强包容性并赋权边缘群体,同时警惕过度承诺与低估风险。

Comments Published in /Handbook of Democracy in the Era of Artificial Intelligence/ edited by Evangelos Pournaras, Srijoni Majumdar, Carina Ines Hausladen, and Dirk Helbing. 2026

详情
AI中文摘要

大型语言模型在公共话语中的日益突出为民主审议带来了机遇和挑战。虽然红队策略有助于缓解特定风险,但关于语言限制、偏见和LLM的谄媚倾向等更广泛的担忧仍然存在。本章探讨如何利用LLM显著扩大和民主化审议,特别是在促进包容性和赋权传统边缘群体方面。借鉴系统功能语言学的概念,本章考察了语言使用者之间的差异(例如,关于社会人口群体)和语言使用中的差异(例如,关于交际功能)如何影响AI支持的审议参与。本章介绍了AI驱动的审议研究,并评估了它们在支撑论证、增强可及性以及减少嵌入在声望语域中的排斥性语言规范和偏见的影响方面的潜力。同时,本章警告不要过度承诺(导致不切实际的期望)和低估承诺(冒着错失AI辅助参与机会的风险)。最后,本章确定了未来的研究方向,以最大化AI辅助参与的民主潜力,同时嵌入伦理保障以抵消语言不平等的再生产。

英文摘要

The increasing prominence of Large Language Models (LLMs) in public discourse presents both opportunities and challenges for democratic deliberation. While red teaming strategies help mitigate specific risks, broader concerns persist regarding linguistic constraints, biases, and the sycophantic tendencies of LLMs. This chapter explores how LLMs can be used to significantly scale up and democratise deliberation, particularly in fostering inclusivity and empowering traditionally marginalised groups. Drawing on concepts from Systemic-Functional Linguistics, the chapter examines how variations across language users (for example, with respect to socio-demographic groups) and across language use (for example, with respect to communicative functions) shape participation in AI-supported deliberation. The chapter presents AI-driven deliberation studies and assesses their potential to scaffold argumentation, enhance access, and reduce the influence of exclusionary linguistic norms and biases which are embedded in prestigious registers. At the same time, the chapter cautions against both overclaiming, which leads to unrealistic expectations, and underclaiming, which risks missed opportunities for AI-assisted engagement. The chapter concludes by identifying future research directions to maximise the democratic potential of AI-assisted participation while embedding ethical safeguards to counteract the reproduction of linguistic inequalities.

2606.20198 2026-06-19 cs.CL 新提交

Pitch Spelling Jazz Lead Sheets, Solo Transcriptions, Classical Piano and Monophonic Scores

爵士乐领谱、独奏转录、古典钢琴与单声部乐谱的音高拼写

Augustin Bouquillard, Florent Jacquemard

发表机构 * École polytechnique(巴黎综合理工学院) INRIA(法国国家信息与自动化研究所)

AI总结 提出一种音高拼写与调性估计算法,通过两阶段优化(模态与调性)联合估计音符名称、全局调号和每小节局部音阶,在多种数字乐谱数据集上验证有效性。

详情
AI中文摘要

我们提出了一种用于音高拼写和调性估计的算法。给定MIDI格式的输入,包含音符音高(以半音表示,相对于最低参考音)和小节边界信息,该算法估计适当的音符名称、全局调号以及每小节的局部音阶。这些相关信息元素在两个优化阶段中联合评估。在初始的“模态”阶段,通过最短路径搜索为每个小节提出一个可能的音阶,以最小化印刷乐谱中的临时记号数量。然后,在称为“调性”的第二阶段,这些局部音阶被用于估计调号和音符名称,从而为整首作品生成最佳音乐记谱。我们在包含多种数字乐谱的数据集上进行了评估:来自《Real Book》的爵士领谱、爵士独奏和贝斯线的录音转录、传统曲调,以及钢琴和单声部乐器的古典乐谱。我们的程序最初设计用于音乐转录,特别是构建从音频录音转录的爵士独奏数字集合,用于音乐分析、教学和文化遗产保护。该方法也应有助于其他与音乐记谱处理相关的任务。此外,为此我们定义了各种常见爵士音阶之间的新距离,这可能对音乐学研究有一定意义。

英文摘要

We present an algorithm for pitch spelling and key estimation. Given an input in MIDI-like format, containing information on note pitches (expressed in semitones relative to the lowest reference note) and bar boundaries, it estimates the appropriate note names, a global Key Signature, and a local scale for each bar. This related information elements are evaluated jointly during two stages of optimisation. During an initial 'modal' stage, a probable scale is proposed for each bar, minimising the number of accidentals to be printed in the printed score with a shortest-path search. Then, during a second stage called 'tonal', these local scales are used to estimate the Key Signature and note names that would result in the best musical notation for the entire piece. We present evaluations conducted on datasets comprising a variety of digital musical scores: jazz lead sheets taken from the Real Book, transcriptions of recordings of jazz soli and bass lines, traditional tunes, as well as classical scores for piano and monophonic instruments. Our procedure was originally designed for use in music transcription, specifically for building digital collections of jazz solos transcribed from audio recordings, for the purposes of music analysis, teaching and the preservation of cultural heritage. This method should also prove useful for other tasks related to the processing of musical notation. Furthermore, to this end, we have defined new distances between various common jazz scales, which may be of some interest to musicological studies.