arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

自我蒸馏通过训练模型自身的生成来提升大语言模型（LLMs）。然而，现有方法要么依赖外部信号来筛选自生成输出（例如正确性过滤、执行反馈和奖励搜索），这些方法成本高且无法用于表现最佳的前沿模型；要么完全跳过筛选直接训练所有原始输出，这种方法通常领域特定且难以泛化。两者都共享一个更深层次的弱点，即自生成输出会将任务相关的能力建与其它因素（如风格模式、格式瑕疵和模型特定错误）纠缠在一起，稀释了要改进的特定能力的信号。在本文中，我们提出Self-Policy Distillation（SPD），实现了无需外部信号的通用且能力选择性的自我蒸馏。具体而言，SPD从模型对正确性定义标记的自身梯度中提取低维能力子空间，在自我生成过程中将关键值（KV）激活投影到该子空间，并在标准下一项预测损失下对结果进行微调。通过在代码生成、数学推理和多个选择性问答任务上的广泛实验，我们展示了SPD在无外部信号的情况下比最先进的自我蒸馏方法提高了高达13%，并且在预训练基线上的表现提高了高达16%。值得注意的是，SPD展示了优越的泛化能力，在跨领域泛化设置下表现更优15%。

英文摘要

Self-distillation bootstraps large language models (LLMs) by training on their own generations. However, existing methods either rely on external signals to curate self-generated outputs (e.g., correctness filtering, execution feedback, and reward search), which are costly and unavailable for the best-performing frontier models, or skip curation entirely and train on all raw outputs, an approach that is often domain-specific and hard to generalize. Both also share a deeper weakness that self-generated outputs entangle task-relevant capability with others, such as stylistic patterns, formatting artifacts, and model-specific errors, diluting the signal for the specific capability one aims to improve. In this paper, we propose Self-Policy Distillation (SPD), which achieves generalizable, capability selective without any external signal. Specifically, SPD extracts a low-rank capability subspace from the model's own gradients on correctness-defining tokens, projects key-value (KV) activations into this subspace during self-generation, and fine-tunes on the resulting raw outputs with standard next-token prediction loss. Through extensive experiments across code generation, mathematical reasoning, and multiple-choice QA, we show that SPD achieves up to 13% improvement over state-of-the-art self-distillation methods without external signals and up to 16% improvement over pre-trained baselines. Notably, SPD demonstrates superior generalizability, achieving 15% better performance under out-of-domain generalization settings.

URL PDF HTML ☆

赞 0 踩 0

2605.22660 2026-05-22 cs.CL cs.AI 版本更新

Moral Semantics Survive Machine Translation: Cross-Lingual Evidence from Moral Foundations Corpora

道德语义在机器翻译中得以保留：来自道德基础语料库的跨语言证据

Maciej Skorski

发表机构 * University of Luxembourg（卢森堡大学）

AI总结本研究探讨了基于LLM的翻译是否能弥合道德价值观分类中语言特定标注语料库的差距，通过波兰语案例展示直接翻译能有效保留微妙的道德线索，为资源匮乏语言的道德研究提供了可行路径。

详情

AI中文摘要

道德语言具有微妙性和文化差异性，使得跨语言忠实翻译极具挑战性。习语、俚语和文化参考会引入难以避免的翻译痕迹。然而，自动道德价值观分类依赖于几乎只存在于英语中的语言特定标注语料库。我们研究了基于LLM的翻译是否能弥合这一差距，以波兰语为测试案例。使用约5万条涵盖广泛主题的道德标注社交媒体帖子，我们应用了一个系统化的四方法验证流程：LaBSE跨语言嵌入相似性、中心核对齐（CKA）、LLM作为评判者评估以及深度学习分类器公平性测试。我们证明，尽管在处理俚语、粗俗语言和文化负载表达方面存在不足，直接翻译能够很好地保留微妙的道德线索，这些线索足以被跨语言机器学习系统捕获——在所有基础方面，平均余弦相似度为0.86，AUC差距在0.01-0.02之间，经过语言模型微调后进一步缩小。这些结果表明，机器翻译是实现当前资源匮乏语言中道德研究的实用且成本效益高的途径。我们以波兰语作为代表性的斯拉夫语言展示了这一点，并预期可推广到相关语言。

英文摘要

Moral language is subtle and culturally variable, making it difficult to translate faithfully across languages. Idiomatic expressions, slang, and cultural references introduce hard-to-avoid translation artifacts. Yet automated moral values classification depends on language-specific annotated corpora that exist almost exclusively in English. We investigate whether LLM-based translation can bridge this gap, taking Polish as a test case. Using $\sim$50k morally-annotated social media posts from a diverse range of topics, we apply a principled four-method validation pipeline: LaBSE cross-lingual embedding similarity, Centered Kernel Alignment (CKA), LLM-as-judge evaluation, and deep learning classifier parity tests. We show that despite shortcomings in handling slang, vulgarity, and culturally-loaded expressions, direct translation preserves subtle moral cues well enough to be harvested by cross-lingual machine learning -- with mean cosine similarity of 0.86 and AUC gaps of 0.01--0.02 across all foundations closing further under fine-tuning of language models. These results demonstrate that machine translation is a practical and cost-effective path to moral values research in languages currently under-resourced in this domain. We demonstrate this for Polish as a representative Slavic language, with expected generalisation to related languages.

URL PDF HTML ☆

赞 0 踩 0

2605.22654 2026-05-22 cs.CL cs.CV 版本更新

Seeing the Poem: Image-Semantic Detection of AI-Generated Modern Chinese Poetry with MLLMs

看见诗歌：基于大语言模型的AI生成现代汉语诗歌的图像-语义检测

Shanshan Wang, Fengying Ye, Hanjia Lyu, Caiwen Gou, Junchao Wu, Jingming Yao, Chengzhong Xu, Jiebo Luo, Derek F. Wong

发表机构 * Department of Computer and Information Science, University of Macau（澳门大学计算机与信息科学系）； University of Rochester（罗切斯特大学）； Sichuan University（四川大学）； Department of Portuguese, Faculty of Arts and Humanities, University of Macau（澳门大学人文学院葡萄牙语系）

AI总结本文提出了一种图像-语义引导的诗歌检测方法，通过整合图像内容与诗歌文本信息，提升大语言模型在检测现代汉语诗歌中的性能，实验结果表明该方法在多个数据集上均优于传统方法。

详情

AI中文摘要

先前的检测研究显示，LLMs无法有效用作检测器，但这些研究未涉及现代汉语诗歌。此外，没有相关研究探讨LLMs在检测现代汉语诗歌中的性能。本文评估并提升了LLMs作为现代汉语诗歌检测器的性能，并提出了一种图像-语义引导的诗歌检测方法。与传统检测方法相比，我们的方法创新性地整合了反映诗歌内容的图像。通过示例驱动的方法，我们的方法有效整合了图像中的意义、意象和情感信息，然后与诗歌文本形成互补判断。实验结果表明，基于我们方法的LLM检测器在多个数据集上均优于基于纯文本的基线检测器，甚至超越了表现最佳的传统检测器RoBERTa。使用我们方法的Gemini检测器在Macro-F1得分上达到85.65%，达到最先进的水平。不同LLM检测器在多个LLM生成数据上的性能提升证明了我们方法的有效性。

英文摘要

Previous detection studies have shown that LLMs cannot be effectively used as detectors, but these studies have not addressed modern Chinese poetry. Moreover, no relevant research has explored the performance of LLMs in detecting modern Chinese poetry. This paper evaluates and enhances the performance of LLMs as detectors for modern Chinese poetry, and proposes an image-semantic guided poetry detection method. Compared with traditional detection approaches, our method innovatively incorporates images that reflect the content of the poetry. Through example-driven approaches, our method effectively integrates information such as meaning, imagery, and feeling from the image, then forms a complementary judgment with the poem text. Experimental results demonstrate that the LLM detectors based on our method outperform baseline detectors based on plain text, and even surpass the best-performing traditional detector, RoBERTa. The Gemini detector using our method achieves a Macro-F1 score of 85.65%, reaching the state-of-the-art level. The performance improvements of different LLM detectors on multiple LLMs-generated data prove the effectiveness of our method.

URL PDF HTML ☆

赞 0 踩 0

2605.22650 2026-05-22 cs.CL 版本更新

Whose Voice Counts? Mapping Stakeholder Perspectives on AI Through Public Submissions to the U.S. Government

谁的声音被听见？通过美国政府公开提交的材料映射利益相关者的AI观点

Alina Karakanta, Alex Christiansen, Tomás Dodds, Bissie Anderson, Matteo Fuoli, Marcus Perlman, Aletta G. Dorst

发表机构 * Leiden University Centre for Linguistics, Leiden University（莱顿大学语言学中心，莱顿大学）； Department of Linguistics and Communication, University of Birmingham（伯明翰大学语言学与交流系）； School of Journalism and Mass Communication, University of Wisconsin-Madison（威斯康星大学麦迪逊分校新闻与大众传播学院）

AI总结本文通过分析美国政府AI行动计划公众咨询期间提交的信件，探讨不同利益相关者对AI的看法，发现个人更关注AI对生活的影响，而其他群体更关注AI发展，揭示了AI行动计划主要反映私营部门的关切。

详情

AI中文摘要

随着人工智能（AI）系统在日常生活中的普及，了解不同利益相关者如何理解和设想这些技术在塑造社会、政治和经济现实中的作用变得至关重要。本文基于特朗普政府AI行动计划公众咨询期间提交的信件语料库，调查公众对AI的看法。为此，我们发布了一个语料库清理流程，并通过主题建模和频率分析来探索不同子群体（如学术界、个人、私营部门）讨论的主要主题以及AI行动计划中出现的主题。我们的结果表明，个人对AI对生活的影响表达了强烈担忧，而其他利益相关者则更关注AI的发展。我们的主题比较显示，AI行动计划主要反映了私营部门对安全、政策和发展方面的关切，而个人的关切则代表性较低。

英文摘要

As artificial intelligence (AI) systems become more common in our daily lives, it is important to understand how different stakeholders comprehend and envisage the role that these technologies play in shaping social, political, and economic realities. In this paper, we investigate public perceptions of AI based on a corpus of letters submitted during the public consultation for the Trump Administration's US AI Action Plan. To this aim, we release a corpus cleaning pipeline and perform topic modelling and frequency analysis to explore predominant topics discussed by different subgroups (e.g., academia, individuals, private sector) and those appearing in the AI Action Plan. Our results show that individuals voice strong concerns related to the impact of AI on life, while other stakeholders are more concerned with AI development. Our comparison of topics suggests that the AI Action Plan reflects predominantly the concerns of the private sector on security, policies, and development, with individuals' concerns less represented.

URL PDF HTML ☆

赞 0 踩 0

2605.22620 2026-05-22 cs.LG cs.CL 版本更新

Two is better than one: A Collapse-free Multi-Reward RLIF Training Framework

两个优于一个：一种无崩溃的多奖励RLIF训练框架

Shourov Joarder, Diganta Sikdar, Ahsan Habib Akash, Binod Bhattarai, Prashnna Gyawali

发表机构 * Bangladesh University of Engineering and Technology（孟加拉工程科技大学）； West Virginia University（西弗吉尼亚大学）； University of Aberdeen（阿伯丁大学）； Fogsphere (Redev.AI Ltd, UK)（Fogsphere（Redev.AI Ltd，英国））； University College London（伦敦大学学院）

AI总结本文提出一种多奖励RLIF框架，通过分解训练信号为答案级奖励和完成级奖励，并结合GDPO归一化和KL-Cov正则化，提升稳定性和鲁棒性，同时在数学推理和代码生成任务中接近监督RLVR方法的性能。

详情

AI中文摘要

可验证奖励的强化学习（RLVR）显著提升了大语言模型的推理能力，但通常依赖于外部监督的人类注释或黄金标准解决方案。最近，从内部反馈强化学习（RLIF）作为一种可扩展的无监督替代方法出现，利用模型自身提取的信号。然而，现有RLIF方法通常依赖单一内部奖励，可能导致奖励黑客、熵崩溃和推理结构退化。我们提出一种多奖励RLIF框架，将训练信号分解为两个互补成分：基于聚类投票的答案级奖励和基于逐token自信心的完成级奖励。为了稳健地结合这些信号，我们应用GDPO基于的归一化以减少奖励尺度不平衡。我们进一步引入KL-Cov正则化，针对导致不成比例熵减少的低熵token分布，保持探索并防止后期崩溃。在数学推理和代码生成基准上，我们的方法在无监督RL方法中提高了稳定性和鲁棒性，同时在性能上接近监督RLVR方法。这些结果表明，互补的内部奖励结合针对性正则化可以支持稳定的长周期推理，而无需依赖外部真实监督。代码将很快发布。

英文摘要

Reinforcement learning with verifiable rewards (RLVR) has substantially improved the reasoning ability of LLMs, but often depends on external supervision from human annotations or gold-standard solutions. Reinforcement learning from internal feedback (RLIF) has recently emerged as a scalable unsupervised alternative, using signals extracted from the model itself. However, existing RLIF methods typically rely on a single internal reward, which can lead to reward hacking, entropy collapse, and degraded reasoning structure. We propose a multi-reward RLIF framework that decomposes the training signal into two complementary components: an answer-level reward based on cluster voting and a completion-level reward based on token-wise self-certainty. To combine these signals robustly, we apply GDPO-based normalization to reduce reward-scale imbalance. We further introduce KL-Cov regularization, which targets low-entropy token distributions responsible for disproportionate entropy reduction, preserving exploration and preventing late-stage collapse. Across mathematical reasoning and code-generation benchmarks, our method improves stability and robustness over prior unsupervised RL approaches, while achieving performance close to supervised RLVR methods. These results show that complementary internal rewards, combined with targeted regularization, can support stable long-horizon reasoning without relying on external ground-truth supervision. Code will be released soon.

URL PDF HTML ☆

赞 0 踩 0

2605.22608 2026-05-22 cs.CL cs.AI 版本更新

Agentic CLEAR: Automating Multi-Level Evaluation of LLM Agents

Agentic CLEAR: 自动化多层级评估LLM代理

Asaf Yehudai, Lilach Eden, Michal Shmueli-Scheuer

发表机构 * IBM Research（IBM研究院）

AI总结本研究提出Agentic CLEAR框架，通过多层级细粒度分析实现LLM代理的自动化评估，提供高质量的数据驱动反馈并预测任务成功率。

Comments ACL

详情

AI中文摘要

代理系统正变得越来越有能力：代理定义策略、采取行动并与不同环境交互。这种自主性对监督和评估代理行为提出了严峻挑战。当前大多数工具功能有限，要么侧重于可观测性并具备基本评估能力，要么强制使用静态、手工制定的错误分类法，无法适应新领域。为解决这一差距，我们提出了Agentic CLEAR，一个自动、动态且易于使用的评估框架。它在三个粒度层级上生成关于代理行为的文本洞察：系统、轨迹和节点。Agentic CLEAR运行在可观测性层之上，能够实现无缝集成，并具有直观的用户界面，使代理评估变得高度可访问。在四个基准测试、七个代理设置和数万次LLM调用的实验中，我们展示了Agentic CLEAR能够产生高质量、数据驱动的反馈。我们的分析显示与人工标注的错误高度一致，并且能够预测任务的成功率。

英文摘要

Agentic systems are becoming more capable: agents define strategies, take actions, and interact with different environments. This autonomy poses serious challenges for overseeing and assessing agent behavior. Most current tools are limited, focusing on observability with basic evaluation capabilities or imposing static, hand-crafted error taxonomies that cannot adapt to new domains. To address this gap, we present Agentic CLEAR, an automatic, dynamic, and easy-to-use evaluation framework. It produces textual insights into the agent behavior on three levels of granularity: system, trace, and node. Agentic CLEAR operates above the observability layer, enabling seamless integration and featuring an intuitive UI that makes agent evaluation highly accessible. In our experiments on four benchmarks, seven agentic settings, and tens of thousands of LLM calls, we show that Agentic CLEAR produces high-quality, data-driven, insightful feedback. Our analysis shows strong alignment with human-annotated errors and the ability to predict task success rate.

URL PDF HTML ☆

赞 0 踩 0

2605.22579 2026-05-22 cs.CL cs.AI stat.ML 版本更新

Beyond Temperature: Hyperfitting as a Late-Stage Geometric Expansion

超越温度：超拟合作为晚期几何扩展

Meimingwei Li, Yuanhao Ding, Esteban Garces Arias, Christian Heumann

发表机构 * Department of Statistics, LMU Munich（慕尼黑大学统计系）； Munich Center for Machine Learning (MCML)（慕尼黑机器学习中心（MCML））； School of Computer and Information Engineering, Henan University（河南大学计算机与信息工程学院）

AI总结本文研究了超拟合现象，发现其与分布锐化不同，通过实验表明超拟合依赖于动态的上下文相关排名重排机制，并在Transformer最后一层的终端扩展中实现了特征空间的几何扩展，提出了Late-Stage LoRA方法以提升生成质量。

Comments Accepted at ICML 2026

2605.22567 2026-05-22 cs.CL 版本更新

LANG: Reinforcement Learning for Multilingual Reasoning with Language-Adaptive Hint Guidance

LANG: 用于多语言推理的强化学习与语言自适应提示引导

Yuchun Fan, Bei Li, Peiguang Li, Yilin Wang, Yongyu Mu, Jian Yang, Xin Chen, Rongxiang Weng, Jingang Wang, Xunliang Cai, Jingbo Zhu, Tong Xiao

发表机构 * NLP Lab, School of Computer Science and Engineering, Northeastern University, Shenyang, China（东北大学计算机科学与工程学院自然语言处理实验室）； Meituan Inc.（美团公司）； NiuTrans Research, Shenyang, China（牛译研所）

AI总结本文提出LANG框架，通过语言条件提示引导非英语推理任务的探索，解决了多语言环境下强化学习在输入语言一致性与推理质量之间的权衡问题，提升了推理性能而不影响语言一致性。

Comments Accepted to ACL 2026 (main conference)

详情

AI中文摘要

强化学习已被证明在增强大型语言模型（LLMs）的多步推理方面非常有效，但其好处尚未完全转化为多语言环境。现有方法在根本上面临一个矛盾：优先考虑输入语言的一致性严重损害推理质量，而优先考虑推理则会导致无意中向英语漂移。我们通过LANG，一种新的框架，利用语言条件提示来指导非英语推理任务的探索。我们的方法结合了两个关键机制来防止依赖这些提示：一个逐步衰减计划，逐渐撤回支架，以及一个语言自适应切换，将学习时间跨度调整到特定语言的困难程度。在具有挑战性的多语言数学基准上的实验证明，LANG显著提高了推理性能，而不会损害语言一致性。此外，我们表明我们的框架超越了数学，促进了模型各层之间更一致的语言对齐。

英文摘要

Reinforcement learning has proven effective for enhancing multi-step reasoning in large language models (LLMs), yet its benefits have not fully translated to multilingual contexts. Existing methods struggle with a fundamental trade-off: prioritizing input-language consistency severely hampers reasoning quality, while prioritizing reasoning often leads to unintended language drift toward English. We address this challenge with LANG, a novel framework that leverages language-conditioned hints to guide exploration in non-English reasoning tasks. Our method incorporates two key mechanisms to prevent dependency on these hints: a progressive decay schedule that gradually withdraws scaffolding, and a language-adaptive switch that tailors learning horizons to specific language difficulties. Empirical results on challenging multilingual mathematical benchmarks reveal that LANG substantially enhances reasoning performance without compromising language consistency. Moreover, we show that our framework generalizes beyond mathematics, fostering more consistent language alignment across model layers

URL PDF HTML ☆

赞 0 踩 0

2605.22564 2026-05-22 cs.CL cs.LG cs.SE 版本更新

SynAE: A Framework for Measuring the Quality of Synthetic Data for Tool-Calling Agent Evaluations

SynAE: 一个用于评估工具调用代理合成数据质量的框架

Shuaiqi Wang, Aadyaa Maddi, Zinan Lin, Giulia Fanti

发表机构 * Carnegie Mellon University（卡内基梅隆大学）； Microsoft Research（微软研究院）

AI总结本文提出SynAE框架，用于评估多轮工具调用代理合成数据的质量，通过四个指标类别评估合成数据的有效性、保真度和多样性，揭示单一指标不足以全面表征合成数据质量。

详情

AI中文摘要

如今，工具调用代理通常在静态执行轨迹数据集上进行评估或测试，包括输入命令、代理响应和相关工具调用。然而，内部生产数据集往往不足或无法使用；例如，它们可能包含敏感或专有数据，或过于稀疏，无法支持全面测试（尤其是预部署前）。在这些情况下，实践者越来越多地用合成数据替代或补充真实数据进行评估。关键挑战是量化这些合成数据集与真实数据之间的关系。我们介绍了SynAE，一个用于评估多轮工具调用代理合成基准如何复制和增强真实数据轨迹特征的评估框架。SynAE在四个指标类别中评估合成数据的效度、保真度和多样性：（i）任务指令和中间响应，（ii）工具调用，（iii）最终输出，（iv）下游评估。我们通过近期代理基准评估SynAE，并通过现实且受控的生成方案测试常见的合成数据失败模式。SynAE能够检测数据效度、保真度和多样性的细粒度变化，并表明没有单一指标足以全面表征合成数据质量，从而推动对合成数据的多轴评估。SynAE的演示可在https://synae-2026-synae-demo.static.hf.space/index.html获取，代码在https://github.com/wsqwsq/SynAE。

英文摘要

Today, tool-calling agents are commonly evaluated or tested on static datasets of execution traces, including input commands, agent responses, and associated tool calls. However, internal production datasets are often insufficient or unusable for testing; for example, they may contain sensitive or proprietary data, or they may be too sparse to support comprehensive testing (especially pre-deployment). In these settings, practitioners are increasingly replacing or augmenting real datasets with synthetic ones for evaluation purposes. A key challenge is quantifying the relation between these synthetic datasets and the real data. We introduce SynAE, an evaluation framework for assessing how well synthetic benchmarks for multi-turn, tool-calling agents replicate and augment the characteristics of real data trajectories. SynAE assesses the validity, fidelity, and diversity of synthetic data across four metric categories: (i) task instructions and intermediate responses, (ii) tool calls, (iii) final outputs, and (iv) downstream evaluation. We evaluate SynAE using recent agent benchmarks and test common synthetic data failure modes via realistic and controlled generation schemes. SynAE detects fine-grained variations in data validity, fidelity and diversity, and shows that no single metric is sufficient to fully characterize synthetic data quality, motivating a multi-axis evaluation of synthetic data for agent testing. A demo of SynAE is available at https://synae-2026-synae-demo.static.hf.space/index.html, with code at https://github.com/wsqwsq/SynAE.

URL PDF HTML ☆

赞 0 踩 0

2605.22544 2026-05-22 cs.CL cs.IR 版本更新

One prompt is not enough: Instruction Sensitivity Undermines Embedding Model Evaluation

一个提示不够：指令敏感性削弱了嵌入模型评估

Yevhen Kostiuk, Kenneth Enevoldsen

发表机构 * Aarhus University（奥胡斯大学）

AI总结本文研究了单提示评估在指令调优嵌入模型中的不足，发现默认提示可能系统性低估或高估性能，并指出排行榜对提示选择不鲁棒，建议通过多提示评估或报告敏感性来改进基准测试。

详情

AI中文摘要

指令嵌入模型已成为最先进模型中的常见选择，但通常仅使用单个提示进行评估。单点评估忽略了指令方法的主要问题，即对指令措辞的敏感性。我们对6个嵌入模型、11个数据集和每个数据集15个任务特定提示进行了实证研究，共990个案例。我们发现报告的分数无法代表在合理提示下的分数分布。默认提示既可能系统性低估也可能高估性能。此外，我们发现排行榜对提示选择不鲁棒：通过选择有利的提示，研究中的任何模型都可以被提升到首位。我们的发现表明，单提示评估不足以评估指令调优的嵌入模型，基准测试应纳入提示鲁棒性，通过多提示评估或报告敏感性来改进。

英文摘要

Instruction embedding models have become common among state-of-the-art models, however are evaluated using a single prompt per task. The single-point evaluation ignores a main problem of the instruction-based approach namely: sensitivity to the phrasing of the instruction. We present an empirical study of prompt sensitivity across 6 embedding models, 11 datasets, and 15 task-specific prompts per dataset, a total of 990. We show that reported scores misrepresent the distribution of scores over plausible prompts. The default prompt can both systematically understate or overstate performance. Furthermore, we show that the leaderboard ranking is not robust to prompt selection: by choosing prompts favorably, any model in our study can be promoted to first place. Our findings suggest that single-prompt evaluation is insufficient for instruction-tuned embedding models and that benchmarks should incorporate prompt robustness, either by evaluating over multiple prompts or by reporting sensitivity alongside point estimates.

URL PDF HTML ☆

赞 0 踩 0

2605.22536 2026-05-22 cs.CV cs.CL 版本更新

SpaceDG: Benchmarking Spatial Intelligence under Visual Degradation

SpaceDG: 在视觉退化下评估空间智能的基准测试

Xiaolong Zhou, Yifei Liu, Ziyang Gong, Jiarui Li, Qiyue Zhao, Muyao Niu, Yuanyuan Gao, Le Ma, Xue Yang, Hongjie Zhang, Zhihang Zhong

发表机构 * Shanghai Jiao Tong University（上海交通大学）； Shanghai Artificial Intelligence Laboratory（上海人工智能实验室）； University of Electronic Science and Technology of China（电子科技大学）； Chongqing University（重庆大学）； The University of Tokyo（东京大学）； Beihang University（北航）； Northwestern Polytechnical University（西北工业大学）

AI总结本文提出SpaceDG，首个针对退化感知空间理解的大型数据集，通过物理基础的退化合成引擎生成9种退化类型，评估多模态大语言模型在视觉退化下的空间推理能力，并展示在退化条件下微调可提升模型鲁棒性。

详情

AI中文摘要

多模态大语言模型（MLLMs）在空间智能方面取得了快速进展，但现有空间推理基准大多假设纯净的视觉输入，忽略了现实部署中常见的退化现象，如运动模糊、低光照、恶劣天气、镜头畸变和压缩伪影。这提出了一个根本性问题：当前MLLMs在视觉观察不完美时的空间智能鲁棒性如何？为回答这个问题，我们引入SpaceDG，首个大规模退化感知空间理解数据集。它通过物理基础的退化合成引擎将退化形成过程嵌入3D高斯点散布（3DGS）渲染，能够真实模拟九种退化类型。所生成的数据集包含约100万对QA问题，来自近1000个室内场景。我们进一步引入SpaceDG-Bench，一个经人类验证的基准，包含11种推理类别和9种视觉退化类型的1102个问题，产生超过10000个VQA实例。评估25个开源和闭源MLLMs发现，视觉退化一致且显著损害空间推理能力，暴露出关键的鲁棒性差距。最后，我们展示在SpaceDG上微调可显著提高退化鲁棒性，并且在退化条件下甚至可以超越人类性能，而不会在清晰图像上造成性能下降，突显了退化感知训练在鲁棒空间智能方面的潜力。

英文摘要

Multimodal Large Language Models (MLLMs) have made rapid progress in spatial intelligence, yet existing spatial reasoning benchmarks largely assume pristine visual inputs and overlook the degradations that commonly occur in real-world deployment, such as motion blur, low light, adverse weather, lens distortion, and compression artifacts. This raises a fundamental question: how robust is the spatial intelligence of current MLLMs when visual observations are imperfect? To answer this question, we introduce SpaceDG, the first large-scale dataset for degradation-aware spatial understanding. It is constructed with a physically grounded degradation synthesis engine that embeds degradation formation process into 3D Gaussian Splatting (3DGS) rendering, enabling realistic simulation of nine degradation types. The resulting dataset contains approximately 1M QA pairs from nearly 1,000 indoor scenes. We further introduce SpaceDG-Bench, an human-verified benchmark with 1,102 questions spanning 11 reasoning categories and 9 visual degradation types, yielding over 10K VQA instances. Evaluating 25 open- and closed-source MLLMs reveals that visual degradations consistently and substantially impair spatial reasoning, exposing a critical robustness gap. Finally, we show that finetuning on SpaceDG markedly improves degradation robustness and can even surpass human performance under degraded conditions without any performance drop on clean images, highlighting the promise of degradation-aware training for robust spatial intelligence.

URL PDF HTML ☆

赞 0 踩 0

2605.22501 2026-05-22 cs.CL cs.AI cs.IR 版本更新

BeLink: Biomedical Entity Linking Meets Generative Re-Ranking

BeLink: 生物医学实体链接结合生成性重新排序

Darya Shlyk, Stefano Montanelli, Lawrence Hunter

发表机构 * University of Milan（米兰大学）； University of Chicago（芝加哥大学）

AI总结本文提出了一种基于生成模型的重新排序方法，通过指令微调提高生物医学实体链接的效率和准确性，在多个基准测试中实现了3%-24%的链接准确率提升，同时减少了推理时间。

Comments Accepted to ACM SIGIR 2026

2605.22487 2026-05-22 cs.CL 版本更新

Polite on the Surface, Wrong in Practice: A Curated Dataset for Fixing Honorific Failures in Multilingual Bangla Generation

表面礼貌，实践错误：一个用于修复多语言孟加拉语生成中敬语失误的定制数据集

Md. Asaduzzaman Shuvo, Mahedi Hasan, Md. Tashin Parvez, Azizul Haque Noman, Md. Shafayet Hossain Ovi

发表机构 * United International University（国际大学）

AI总结本文提出了一个定制数据集BLADE，用于改进多语言孟加拉语生成中敬语处理的准确性，通过系统微调和评估领先的开源架构，如DeepSeek-8B和LLaMA-3.2-3B，以提高结构忠实度和敬语对齐度。

详情

AI中文摘要

近年来，多语言大语言模型（MLLMs）的进展显著增强了跨语言对话能力，但建模文化细腻和上下文依赖的交流仍是一个关键瓶颈。具体而言，现有最先进的模型在处理低资源环境如孟加拉语中的结构变化、地区习语和敬语一致性时存在严重的语用差距。为了解决这一限制，我们引入了一个新的、文化对齐的指令微调数据集和基准框架，即BangLa Application and DialoguE生成-BLADE，包含4,196对精心编纂的交互对。我们利用这一资源系统地微调和评估领先的开源架构，包括DeepSeek-8B和LLaMA-3.2-3B，通过LoRA适配器在4位NormalFloat（NF4）量化框架下进行参数高效微调。我们的实证评估表明，使用我们数据集微调的模型在结构忠实度和敬语对齐方面有显著改进，为弥合低资源多语言文本生成中的语用差距提供了严格基准。代码和数据集：https://github.com/ashuvo25/Bangla_Application_LLM/tree/main

英文摘要

Recent advances in Multilingual Large Language Models (MLLMs) have significantly enhanced cross-lingual conversational capabilities, yet modeling culturally nuanced and context-dependent communication remains a critical bottleneck. Specifically, existing state-of-the-art models exhibit a severe pragmatic gap when handling structural variations, regional idioms, and honorific consistencies in low-resource contexts like Bangla. To address this limitation, we introduce a novel, culturally aligned instruction-tuning dataset for \textbf{BangLa Application and DialoguE generation - BLADE} and benchmarking framework comprising $4,196$ meticulously curated interaction pairs. We leverage this resource to systematically fine-tune and evaluate leading open-weight architectures, including DeepSeek-8B and LLaMA-3.2-3B, utilizing parameter-efficient fine-tuning via LoRA adapters in a 4-bit NormalFloat (NF4) quantization framework. Our empirical evaluations demonstrate that models fine-tuned on our dataset yield substantial improvements in structural fidelity and honorific alignment, providing a rigorous benchmark for bridging pragmatic disparities in low-resource multilingual text generation. Code and dataset: https://github.com/ashuvo25/Bangla_Application_LLM/tree/main

URL PDF HTML ☆

赞 0 踩 0

2605.22476 2026-05-22 cs.LG cs.CL 版本更新

Structured-Sparse Attention for Entity Tracking with Subquadratic Sequence Complexity

结构稀疏注意力用于具有次二次序列复杂度的实体跟踪

Hangyue Zhao, Paul Caillon, Erwan Fagnou, Alexandre Allauzen

发表机构 * ESPCI PSL（ESPCI 法国巴黎大学）； LAMSADE, Université Paris Dauphine - PSL（LAMSADE 巴黎dauphine大学-巴黎科学实验室）

AI总结本文提出了一种结构稀疏注意力机制，用于在长序列中高效维护和更新实体和属性的潜在状态，通过减少计算复杂度提升实体跟踪的效率和准确性。

Comments 12 pages, 1 figure, 9 tables

详情

AI中文摘要

实体跟踪需要在长序列中维护和更新实体和属性的潜在状态。最近的特定任务注意力运算可以通过在单个层内进行多跳状态传播，将深度Transformer堆栈压缩成几层，但其密集评估仍很昂贵。我们显示在这种情况下，学习的注意力具有很强的结构特性：大部分质量集中在局部块对角邻域，具有轻量的跨块残差。利用这一点，我们推导出一种分块评估的解析式算子，保持块内交互的精确性，并通过缩减系统路由跨块交互。所得到的评估是序列长度的次二次复杂度$O(n^{4/3}d)$（当$d\approx n$时为$O(n^{7/3})$）。在受控跟踪基准上，我们的方法在保持密集运算准确性的同时，通过标准化测量协议减少了12-29%的实时时钟时间，并在可比的精确匹配准确性下，比紧凑的密集Transformer快高达2.4倍。我们进一步提供了关于块大小和模型容量的消融实验，并识别了一个限制：当同时演化的属性数量超过注意力头的数量时，性能会崩溃。

英文摘要

Entity tracking requires maintaining and updating latent states for entities and attributes over long sequences. Recent task-specific attention operators can compress deep Transformer stacks into a few layers by performing multi-hop state propagation within a single layer, but their dense evaluation remains expensive. We show that in this setting, learned attention is strongly structured: most mass concentrates in local block-diagonal neighborhoods with a light cross-block residue. Exploiting this, we derive a blockwise evaluation of a resolvent-style operator that keeps within-block interactions exact and routes cross-block interactions through a reduced system. The resulting evaluation is subquadratic in sequence length $O(n^{4/3}d)$ (and $O(n^{7/3})$ when $d\approx n$). On controlled tracking benchmarks, our method matches the dense operator's accuracy while reducing wall-clock time by $12-29\%$ under a standardized measurement protocol, and is up to $2.4 \times$ faster than a compact dense Transformer at comparable exact-match accuracy. We further provide ablations over block size and model capacity, and identify a limitation: performance collapses when the number of simultaneously evolving properties exceeds the number of attention heads.

URL PDF HTML ☆

赞 0 踩 0

2605.22465 2026-05-22 cs.CL 版本更新

In Silico Modeling of the RAMPHO Buffer: Dissociating Informational and Energetic Masking via Phonetic Entropy in Deep Neural Networks

RAMPHO缓冲区的计算机模拟：通过语音熵在深度神经网络中分离信息性和能量性遮蔽

Stefan Bleeck

发表机构 * Institute of Sound and Vibration Research (ISVR), University of Southampton（声学与振动研究所（ISVR），南安普顿大学）

AI总结本文提出了一种基于wav2vec 2.0的自监督声学模型的计算机模拟，通过语音熵分离信息性和能量性遮蔽，揭示了认知-听觉帕累托优化问题。

详情

AI中文摘要

在多说话者环境中听觉识别的核心挑战是一个认知瓶颈，定义为RAMPHO事件缓冲区内的失败。当前用于语音增强的深度神经网络仅优化物理声学，未能考虑信息性遮蔽的认知惩罚。本文通过使用自监督声学模型（wav2vec 2.0）的帧级语音熵，对RAMPHO缓冲区进行了计算机模拟。通过在信号噪声比（SNR）扫描中对比语义完整和相位去相关干扰源（集中护盾），成功将信息性干扰的认知惩罚与能量衰减的物理惩罚分离。该模拟揭示了一个认知-听觉帕累托优化问题：破坏干扰源的语义负载在高SNR下释放了信息性遮蔽，但本质上在低SNR下会退化时间瞥见线索。

英文摘要

The fundamental challenge of listening in multi-talker environments is a cognitive bottleneck, defined by the Ease of Language Understanding (ELU) model as a failure within the RAMPHO episodic buffer. Current deep neural networks for speech enhancement optimize purely for physical acoustics, failing to account for the cognitive penalty of informational masking. Here, we present an in silico simulation of the RAMPHO buffer using the frame-by-frame phonetic entropy of a self-supervised acoustic model (wav2vec 2.0). By contrasting a semantically intact distractor with a phase-decorrelated distractor (the Concentration Shield) across a signal-to-noise ratio (SNR) sweep, we successfully dissociate the cognitive penalty of informational distraction from the physical penalty of energetic decay. The simulation reveals a cognitive-acoustic Pareto optimization problem: destroying a distractor's semantic payload provides a release from informational masking at high SNRs, but fundamentally degrades temporal glimpsing cues at low SNRs.

URL PDF HTML ☆

赞 0 踩 0

2605.22462 2026-05-22 cs.CL cs.AI 版本更新

From Correlation to Cause: A Five-Stage Methodology for Feature Analysis in Transformer Language Models

从相关性到因果：一种五阶段方法用于Transformer语言模型中的特征分析

Caleb Munigety

发表机构 * Independent Researcher（独立研究者）

AI总结本文提出了一种五阶段方法用于Transformer语言模型中的因果特征分析，并在GPT-2小型模型上端到端地展示了其在间接宾语识别任务中的应用，通过激活补丁恢复经典IOI电路，稀疏自编码器恢复特定名称的特征，因果验证发现这些特征具有特定但部分因果性，鲁棒性测试揭示了检测鲁棒性与因果鲁棒性之间的差距，部署评估显示了最优监控配置带来的成本节省。

详情

AI中文摘要

我们提出了一种五阶段方法用于Transformer语言模型中的因果特征分析（探针设计、特征提取、因果验证、鲁棒性测试和部署集成），并在GPT-2小型模型上端到端地执行了间接宾语识别（IOI）任务。激活补丁恢复了经典的IOI电路（第9层头9单独恢复+1.02）。稀疏自编码器恢复了每名称选择性特征，其效果大小为30到50个激活单元。因果验证发现这些特征具有特定但部分因果性：删除十五个特征后，模型在98%的提示上仍保持准确。两种受NLA启发的评估强化了这一观点：十五个选择性特征仅解释了激活方差的31%，而SAE的解释为99.7%，选择性比率与因果力呈负相关（r = -0.56）。三种分布偏移下的鲁棒性测试发现，电路能够顺利转移，但特征消融效果显著下降，揭示了检测鲁棒性与因果鲁棒性之间的差距。基于成本的部署评估（假设$50/FN，$0.42/FP，2%错误率）发现最优监控配置可使每1000次查询的成本降至$8.96，相比$1000的基准，节省了99.1%。最优组合策略随成本比和基础率变化。各阶段的结合产生了单一阶段无法产生的发现。

英文摘要

We propose a five-stage methodology for causal feature analysis in transformer language models (probe design, feature extraction, causal validation, robustness testing, and deployment integration) and demonstrate it end-to-end on GPT-2 small performing the Indirect Object Identification (IOI) task. Activation patching recovers the canonical IOI circuit (layer-9 head 9 alone gives recovery +1.02). A sparse autoencoder recovers per-name selective features with effect sizes of 30 to 50 activation units. Causal validation finds these features specifically but only partially causal: ablating fifteen of them leaves the model accurate on 98% of prompts. Two NLA-inspired evaluations strengthen this picture: the fifteen selective features explain only 31% of activation variance versus the SAE's 99.7%, and selectivity ratio anticorrelates with causal force (r = -0.56). Robustness testing under three distribution shifts finds that the circuit transfers cleanly but feature ablation effects degrade substantially, exposing a gap between detection robustness and causal robustness. A cost-based deployment evaluation (assumed $50/FN, $0.42/FP, 2% error rate) finds an optimal monitor configuration yielding $8.96 per 1000 queries against a $1000 baseline, a 99.1% saving. Optimal composition strategy varies with cost ratio and base rate. The conjunction of stages produces findings no single stage would.

URL PDF HTML ☆

赞 0 踩 0

2605.22447 2026-05-22 cs.CL 版本更新

Cohesion-6K: An Arabic Dataset for Analyzing Social Cohesion and Conflict in Online Discourse

Cohesion-6K: 一个用于分析在线讨论中社会凝聚力与冲突的阿拉伯语数据集

Aisha Ali Al-Athba, Wajdi Zaghouani

发表机构 * Hamad Bin Khalifa University（哈马德·本·拉希德大学）； Northwestern University in Qatar（卡塔尔西北大学）

AI总结本研究通过Cohesion-6K数据集探讨在线讨论中的社会凝聚力与冲突，采用五类话语分类揭示冲突与凝聚力的动态平衡，并通过公开资源支持未来计算社会科学、数字通信和阿拉伯语自然语言处理的研究。

详情

AI中文摘要

在线讨论研究已成为理解社会极化的核心。尽管许多研究聚焦于检测显性毒性，但社会凝聚力的微妙动态，即分裂与团结叙述之间的互动，仍 computationally underexplored (Bail, 2021; Gonzalez-Bailon and Lelkes, 2023). 本文提出了Cohesion-6K，一个包含六千条阿拉伯语公共Facebook帖子的数据集，这些帖子涉及巴勒斯坦被占领土问题。每条帖子被分配到五个话语类别之一，代表从冲突到凝聚力的连续体：冲突、解决、社区参与、支持性互动和共同价值观。注释过程结合了专家人类判断与模型辅助预标注，经培训注释者验证，实现了显著的注释者间一致性 (Cohens kappa = 0.85). 定量分析揭示了持续的参与差距，冲突导向的帖子获得的用户互动比解决导向的帖子多两到四倍 (p < 0.01). 这种模式展示了分裂性讨论如何在阿拉伯语社交媒体空间中获得不成比例的可见性。Cohesion-6K提供了一个透明且可重复的资源，用于研究在线凝聚力和极化。该数据集、注释指南和预处理代码将通过开放许可发布，以支持未来计算社会科学、数字通信和阿拉伯语自然语言处理的研究。

英文摘要

The study of online discourse has become central to understanding societal polarization. While much research has focused on detecting overt toxicity, the subtle dynamics of social cohesion, meaning the interaction between divisive and unifying narratives, remain computationally underexplored (Bail, 2021; Gonzalez-Bailon and Lelkes, 2023). This paper presents Cohesion-6K, a manually and ChatGPT-assisted annotated dataset of six thousand Arabic public Facebook posts related to the Israeli Occupation of Palestine. Each post is assigned to one of five discourse categories that represent a continuum from conflict to cohesion: Conflict, Resolution, Community Engagement, Supportive Interactions, and Shared Values. The annotation process combines expert human judgment with model-assisted pre-labeling verified by trained annotators, achieving substantial inter-annotator agreement (Cohens kappa = 0.85). Quantitative analysis reveals a consistent engagement gap, where conflict-oriented posts receive between two and four times more user interaction than resolution-oriented ones (p < 0.01). This pattern illustrates how divisive discourse tends to attract disproportionate visibility in Arabic social media spaces. Cohesion-6K provides a transparent and reproducible resource for the study of online cohesion and polarization. The dataset, annotation guidelines, and preprocessing code will be released for research use under an open license, supporting future work in computational social science, digital communication, and Arabic natural language processing.

URL PDF HTML ☆

赞 0 踩 0

2605.22435 2026-05-22 cs.CL 版本更新

Assisted Counterspeech Writing at the Crossroads of Hate Speech and Misinformation

在仇恨言论与虚假信息交汇处的辅助反诽谤写作

Genoveffa Martone, Helena Bonaldi, Marco Guerini

发表机构 * Fondazione Bruno Kessler（布鲁诺·克塞勒基金会）； Università Cattolica del Sacro Cuore（天主教圣心大学）

AI总结本文研究了在仇恨言论和虚假信息共存的背景下，利用大型语言模型辅助专家反诽谤写作的方法，通过三种知识驱动的生成策略，结合事实核查和非政府组织的指南，提高了反诽谤文本的质量和有效性。

详情

AI中文摘要

仇恨言论和虚假信息经常在线上同时出现，加剧偏见和极化。鉴于其规模，使用大型语言模型（LLMs）来帮助专家反诽谤（CS）写作引起了关注，但以往的研究将这些现象分别处理。我们通过研究在仇恨言论和虚假信息共存的背景下生成反诽谤文本，填补了这一空白。我们测试了三种知识驱动的生成策略：首先，我们提示LLM使用事实核查员的指南和事实核查文章；其次，使用非政府组织的指南和报告；第三，我们创建了一种混合策略，结合来自两方面的指南和文档。23名专家修改生成的反诽谤文本，这些文本通过人工和自动指标进行评估。尽管LLM在40%的情况下生成了足够的反诽谤文本，但专家修改显著提高了自然性、全面性和对指南的遵守程度。基于修改后的反诽谤文本，混合策略在众包评估中证明是最有效的，结合了强大的事实纠正、刻板印象缓解和同理心参与。我们发布了包含仇恨言论和误导性声明的数据库，并附有专家验证的反诽谤文本和相关知识。

英文摘要

Hate speech and misinformation frequently co-occur online, amplifying prejudice and polarization. Given their scale, using Large Language Models (LLMs) to assist expert counterspeech (CS) writing has gained interest, yet prior work has addressed these phenomena separately. We bridge this gap by studying CS generation in contexts where both hate and misinformation co-occur. We test three knowledge-driven generation strategies: first we prompt an LLM with fact-checkers' guidelines and fact-checking articles; secondly, with NGOs' guidelines and reports; thirdly, we create a mixed strategy that combines guidelines and documents from both. 23 experts revise the generated CS, which are assessed via human and automatic metrics. While LLMs produce adequate CS in 40% of cases, expert edits substantially improve naturalness, exhaustiveness, and adherence to guidelines. Based on the post-edited CS, the mixed strategy proves to be the most effective in crowdsourcing evaluation, pairing strong factual correction with stereotype mitigation and empathetic engagement. We release a dataset of hateful and misinformed claims with expert-verified CS and supporting knowledge.

URL PDF HTML ☆

赞 0 踩 0

2605.22411 2026-05-22 cs.CL cs.AI cs.LG 版本更新

DeferMem: Query-Time Evidence Distillation via Reinforcement Learning for Long-Term Memory QA

DeferMem: 通过强化学习进行长时记忆问答的查询时证据蒸馏

Jianing Yin, Tan Tang

发表机构 * State Key Lab of CAD&CG（计算机辅助设计与图形学国家重点实验室）

AI总结本文提出DeferMem，一种长时记忆框架，通过分离问题为高召回候选检索和查询条件证据蒸馏，以提升长时记忆问答的准确性和效率。

Comments 31 pages, 3 figures

详情

AI中文摘要

大型语言模型（LLM）代理在长时记忆问答任务中仍面临挑战，因为答案支持的证据通常分散在长对话历史中并被大量无关内容掩盖。现有记忆系统通常在未来的查询确定之前处理记忆，然后根据相似性而非其对回答查询的效用来检索结果单元。这种工作流程使下游回答者不得不对检索的候选进行去噪并重建查询特定的证据。我们提出了DeferMem，一种长时记忆框架，将该问题分解为高召回候选检索和查询条件证据蒸馏。DeferMem使用轻量级的段链接结构来组织原始历史并在查询时检索广泛的候选。然后，它应用一个通过DistillPO训练的内存蒸馏器，DistillPO是我们用于将高召回但高度嘈杂的候选蒸馏成一组忠实、自包含且查询条件的证据的强化学习算法。DistillPO将检索后的证据蒸馏制定为一个结构化的动作，包括信息选择和证据重写。它通过分解和门控奖励管道和结构对齐优势分配来优化此动作，门控奖励组件从有效性到质量检查，同时在早期暴露任务级别的正确性反馈，并将每个奖励分配给其负责的输出片段。在LoCoMo和LongMemEval-S上，DeferMem在问答准确性和记忆系统效率上超过了强大的基线，在达到最高问答准确度的同时实现了最快的运行时间和零商业API令牌成本的记忆操作。

英文摘要

Large language model (LLM) agents still struggle with long-term memory question answering, where answer-supporting evidence is often scattered across long conversational histories and buried in substantial irrelevant content. Existing memory systems typically process memory before future queries are known, then retrieve the resulting units based on similarity rather than their utility for answering the query. This workflow leaves downstream answerers to denoise retrieved candidates and reconstruct query-specific evidence. We present DeferMem, a long-term memory framework that decouples this problem into high-recall candidate retrieval and query-conditioned evidence distillation. DeferMem uses a lightweight segment-link structure to organize raw history and retrieve broad candidates at query time. It then applies a memory distiller trained with DistillPO, our reinforcement learning algorithm for distilling the high-recall but highly noisy candidates into a set of faithful, self-contained, and query-conditioned evidence. DistillPO formulates post-retrieval evidence distillation as a structured action comprising message selection and evidence rewriting. It optimizes this action with a decomposed-and-gated reward pipeline and structure-aligned advantage assignment, gating reward components from validity to quality checks while exposing task-level correctness feedback early and assigning each reward to its responsible output span. On LoCoMo and LongMemEval-S, DeferMem surpasses strong baselines in QA accuracy and memory-system efficiency, achieving the highest QA accuracy with the fastest runtime and zero commercial-API token cost for memory operations.

URL PDF HTML ☆

赞 0 踩 0

2605.22391 2026-05-22 cs.AI cs.CL cs.CY 版本更新

Epicure: Navigating the Emergent Geometry of Food Ingredient Embeddings

Epicure：探索食品成分嵌入的涌现几何

Jakub Radzikowski, Josef Chen

AI总结本文提出Epicure，一种基于三兄弟skip-gram模型重新训练的食品成分嵌入方法，通过多语言食谱语料库构建了包含1790个标准成分的嵌入模型，并通过三种不同的随机游走方案生成了不同侧重的模型。

详情

AI中文摘要

我们提出了Epicure，一种由三个兄弟skip-gram成分嵌入模型组成的家族，这些模型是从多语言食谱语料库中从头开始重新训练的。我们汇总了来自11个来源的414万条食谱，涵盖七种语言：英语、中文、俄语、越南语、西班牙语、土耳其语、印度尼西亚语、德语和印度英语，并通过一个增强语言模型的流程将原始成分字符串标准化为1790个标准条目。一个包含203,508条边的成分-成分NPMI图和一个包含80,019条边的带类型FlavorDB成分-化合物图，以及2,247个带类型化合物节点跨越15个类别，为三种共享架构和超参数但仅在随机游走方案上不同的Metapath2Vec变体提供了基础：Cooc仅在共现图上行走，Chem仅在带类型化合物元路径上行走，Core则通过注入的成分-成分行走进行混合，在可控混合下，将每个模型置于化学与食谱上下文的谱线上不同的位置。

英文摘要

We present Epicure, a family of three sibling skip-gram ingredient embeddings retrained from scratch on a multilingual recipe corpus. We aggregate 4.14M recipes from 11 sources spanning seven languages, English, Chinese, Russian, Vietnamese, Spanish, Turkish, Indonesian, German, and Indian-English, and normalise the raw ingredient strings to 1,790 canonical entries via an LLM-augmented pipeline. A 203,508-edge ingredient-ingredient NPMI graph and an 80,019-edge typed FlavorDB ingredient-compound graph, 2,247 typed compound nodes across 15 categories, seed three Metapath2Vec variants that share architecture and hyperparameters and differ only in the random-walk schema: Cooc walks the co-occurrence graph only, Chem walks the typed compound metapaths only, and Core blends both via injected ingredient-ingredient walks at controlled mixing, placing each model at a distinct point on the chemistry-vs-recipe-context spectrum.

URL PDF HTML ☆

赞 0 踩 0

2605.22389 2026-05-22 cs.CL 版本更新

Unified Data Selection for LLM Reasoning

统一的数据选择用于LLM推理

Xiaoyuan Li, Yubo Ma, Chengpeng Li, Fengbin Zhu, Yiyao Yu, Keqin Bao, Wenjie Wang, Fuli Feng, Dayiheng Liu

发表机构 * University of Science and Technology of China（中国科学技术大学）； Alibaba Group（阿里巴巴集团）； National University of Singapore（新加坡国立大学）

AI总结本文提出了一种无需训练的高熵和（HES）指标，用于评估和选择高质量的推理样本，通过在三种主流训练范式（监督微调、拒绝微调和强化学习）中验证，证明了其在提高LLM推理性能和减少计算开销方面的有效性。

Comments Under Review

详情

AI中文摘要

有效训练大型语言模型（LLMs）进行复杂、长链推理通常受到需要大量高质量推理数据的限制。现有方法要么计算成本高，要么无法可靠地区分高质量和低质量的推理样本。为此，我们提出高熵和（HES），一种无需训练的度量标准，通过仅对每个推理样本中熵最高的（例如0.5%）高熵标记进行求和来量化推理质量。我们验证了HES在三种主流训练范式：监督微调（SFT）、拒绝微调（RFT）和强化学习（RL）中的效果，结果表明其效果一致且显著减少了计算开销。在SFT中，使用顶部20%的HES排名数据与完整数据集性能相当，而使用最低HES数据会使其下降。在RFT中，我们的HES基于训练方法显著优于基线方法。在RL中，HES选择的成功轨迹使模型能够学习到强大的推理模式，显著超越其他比较方法。我们的发现确立了HES作为一种稳健、无需训练的度量标准，使开发高级LLM推理能力成为统一、有效和高效的方法。

词形和根的屈折形态：阿拉伯语的破碎复数

Alexis Amid Neme, Eric Laporte

发表机构 * LIGM, Université Paris-Est（LIGM，巴黎-est大学）； DLL, Universidade Federal do Espírito Santo（DLL，弗拉门杜斯皮里霍萨联邦大学）

AI总结本文提出了一种对阿拉伯语名词屈折形态的描述模型，重点在于阿拉伯语学者在管理词典和其他语言资源时的处理方式。其突破在于将传统的根-词形塞语模型反转为词形-根模型，优先考虑词形。该模型包括破碎复数（BPs），即通过修改词干形成的复数。它基于传统的塞语形态学中的根和词形概念。与传统阿拉伯语形态学相比，它将屈折的正式描述与派生和语义分开。如同传统阿拉伯语词典，可更新的词典以词形的词典条目结构进行组织，并且参考拼写完全带变音符号。在我们的模型中，阿拉伯语文本的形态分析直接使用词典进行，而无需形态音律规则。我们对名词屈折的分类是简单、有序且详细的。我们通过指定元音数量为v或vv，并忽略元音质量来简化单数词形的分类。根交替和正字法变化是独立于词形并以事实方式编码的，而不涉及深根或形态音律或正字法规则。具有三重词干BPs的名词根据22个词形细分到90个类别中进行分类，而具有四重词干BPs的名词根据3个词形细分到70个类别中进行分类。这些160个类别在考虑只影响单数的屈折变化时，变为300个屈折类别。我们提供了一种直接的编码方案，该方案应用于3200个BPs名词条目。

详情

DOI: 10.1016/j.langsci.2013.06.002
Journal ref: Language Sciences, 2013, 40, pp.221-250

AI中文摘要

我们提出了一种实施程度较高的模型，用于描述阿拉伯语名词的屈折形态，特别关注阿拉伯语学者在管理词典和其他语言资源时的处理方式。突破点在于将传统的根-词形塞语模型反转为词形-根模型，优先考虑词形。我们的模型包括破碎复数（BPs），即通过修改词干形成的复数。它基于传统的塞语形态学中的根和词形概念。然而，与传统阿拉伯语形态学相比，它将屈折的正式描述与派生和语义分开。如同传统阿拉伯语词典，可更新的词典以词形的词典条目结构进行组织，并且参考拼写完全带变音符号。在我们的模型中，阿拉伯语文本的形态分析直接使用词典进行，而无需形态音律规则。我们的名词屈折分类法简单、有序且详细。我们通过指定元音数量为v或vv，并忽略元音质量来简化单数词形的分类。根交替和正字法变化是独立于词形并以事实方式编码的，而不涉及深根或形态音律或正字法规则。具有三重词干BPs的名词根据22个词形细分到90个类别中进行分类，而具有四重词干BPs的名词根据3个词形细分到70个类别中进行分类。这些160个类别在考虑只影响单数的屈折变化时，变为300个屈折类别。我们提供了一种直接的编码方案，该方案应用于3200个BPs名词条目。

英文摘要

We present a substantially implemented model of description of the inflectional morphology of Arabic nouns, with special attention to the management of dictionaries and other language resources by Arabic-speaking linguists. The breakthrough lies in the reversal of the traditional root-and-pattern Semitic model into pattern-and-root, giving precedence to patterns over roots. Our model includes broken plurals (BPs), i.e. plurals formed by modifying the stem. It is based on the traditional notions of root and pattern of Semitic morphology. However, as compared to traditional Arabic morphology, it keeps the formal description of inflection separate from that of derivation and semantics. As traditional Arabic dictionaries, the updatable dictionary is structured in lexical entries for lemmas, and the reference spelling is fully diacritized. In our model, morphological analysis of Arabic text is performed directly with a dictionary of words and without morphophonological rules. Our taxonomy for noun inflection is simple, orderly and detailed. We simplify the taxonomy of singular patterns by specifying vowel quantity as v or vv, and ignoring vowel quality. Root alternations and orthographical variations are encoded independently from patterns and in a factual way, without deep roots or morphophonological or orthographical rules. Nouns with a triliteral BP are classified according to 22 patterns subdivided into 90 classes, and nouns with a quadriliteral BP according to 3 patterns subdivided into 70 classes. These 160 classes become 300 inflectional classes when we take into account inflectional variations that affect only the singular. We provide a straightforward encoding scheme that we applied to 3 200 entries of BP nouns.

URL PDF HTML ☆

赞 0 踩 0

2605.22258 2026-05-22 cs.CL 版本更新

Harder to Defend: Towards Chinese Toxicity Attacks via Implicit Enhancement and Obfuscation Rewriting

更难防御：通过隐含增强和模糊重写实现中文毒性攻击

Jingyi Kang, Junyu Lu, Bo Xu, Hongbo Wang, Linlin zong, Roy Ka-Wei Lee, Hongfei Lin

发表机构 * Dalian University of Technology（大连理工大学）； The University of Tokyo（东京大学）； Singapore University of Technology and Design（新加坡科技设计大学）

AI总结本研究提出了一种针对中文毒性攻击的框架CITA，通过隐含增强和模糊重写技术生成攻击样本，揭示了现有检测器在识别隐含毒性内容时的不足，并展示了通过训练防御模型提升鲁棒性的效果。

Comments 16 pages, 5 figures

详情

AI中文摘要

大型语言模型（LLMs）需要超越显式用语的鲁棒毒性评估。在这种设定中，中文的毒性可能结合语义间接性和表层模糊性仍处于探索阶段。我们引入了中文隐含毒性攻击（CITA），一种受控的红队评估和防御数据生成框架，而不是可部署的逃避工具。CITA使用三个阶段：（i）有害意图学习，（ii）隐含毒性增强，以及（iii）模糊变体重写，以保持有害意图，增加隐含性，并添加受控的表层变体。在CITA生成的评估样本上，七个测试检测器表现出显著的漏检风险，达到平均ASR为69.48%；人类评估进一步确认了保留的有害性和增加的隐含性/逃避性。作为下游防御应用，我们使用CITA生成的红队数据微调了中文隐含毒性防御模型（CITD），显示此类数据可通过额外训练提高鲁棒性。

英文摘要

Large language models (LLMs) require robust toxicity evaluation beyond explicit wording. This setting remains underexplored in Chinese, where toxicity may combine semantic indirectness with surface obfuscation. We introduce Chinese Implicit Toxicity Attack (CITA), a controlled red-team evaluation and defense-data generation framework, not a deployable evasion tool. CITA uses three stages: (i) Harmful Intent Learning, (ii) Implicit Toxicity Enhancement, and (iii) Obfuscation Variant Rewriting, to preserve harmful intent, increase implicitness, and add controlled surface variants. On CITA-generated evaluation samples, the seven tested detectors exhibit substantial missed-detection risks, reaching an average ASR of 69.48%; human evaluation further confirms preserved harmfulness and increased implicitness/evasiveness. As a downstream defense application, we fine-tune a Chinese Implicit Toxicity Defense model (CITD) with CITA-generated red-team data, showing that such data can improve robustness through additional training.

URL PDF HTML ☆

赞 0 踩 0

2605.22247 2026-05-22 cs.CL 版本更新

IdioLink: Retrieving Meaning Beyond Words Across Idiomatic and Literal Expressions

IdioLink: 超越词语的语义检索：在隐喻和直述表达之间

Kai Golan Hashiloni, Daniel Fadlon, Lior Livyatan, Ofri Hefetz, Jiahuan Pei, Kfir Bar

发表机构 * Data Science Institute, Reichman University（雷赫曼大学数据科学学院）； Efi Arazi School of Computer Science, Reichman University（雷赫曼大学埃菲·阿拉兹计算机科学学院）； Vrije Universiteit Amsterdam（阿姆斯特丹自由大学）

AI总结本文提出IdioLink检索基准，旨在测试模型能否将隐喻表达与直述或改写形式的概念等价意义联系起来，揭示当前模型在隐喻语义检索中的不足。

详情

AI中文摘要

隐喻表达对语言模型构成了基本挑战，因为其含义无法仅通过表层形式推断。因此，理解此类表达需要超越词汇重叠的语义抽象。我们介绍了IdioLink，一个检索基准，用于测试模型是否能将隐喻表达与用直述或改写形式表达的概念等价意义联系起来。IdioLink包含10,700个文档和2,140个查询，涵盖107个具有直述和隐喻用法的习语。每个文档和查询都标注了传达核心意义的片段。评估强大的嵌入基线（如BGE、E5、Contriever和Qwen），我们发现当前模型在跨不同表层实现检索等价意义时表现不佳，依赖于主题和浅层语义线索。IdioLink揭示了隐喻意识语义检索中的关键缺口，并为未来模型提供了具有挑战性的测试平台。

英文摘要

Idioms pose a fundamental challenge for language models, as their meaning cannot be inferred from surface form alone. Understanding such expressions, therefore, requires semantic abstraction beyond lexical overlap. We introduce IdioLink, a retrieval benchmark designed to test whether models can link idiomatic expressions to conceptually equivalent meanings expressed in literal or paraphrased forms. IdioLink comprises 10,700 documents and 2,140 queries, spanning 107 idioms with both literal and figurative uses. Each document and query is annotated with spans that convey the core meaning. Evaluating strong embedding baselines (e.g., BGE, E5, Contriever, and Qwen), we show that current models struggle to retrieve equivalent meanings across divergent surface realizations, relying instead on topical and shallow semantic cues. IdioLink exposes key gaps in idiom-aware semantic retrieval and provides a challenging testbed for future models.

URL PDF HTML ☆

赞 0 踩 0

2605.22228 2026-05-22 cs.CL 版本更新

GHI: Graphormer over Conditioned Hypergraph Incidence for Aspect-Based Sentiment Analysis

GHI: 图ormer over Conditioned Hypergraph Incidence 用于基于方面的情感分析

Yu Du, Wenlong Zhu, Xingze Li, Chenglong Cao, Jing Wang, Yukun Ma

发表机构 * Qiqihar University（齐齐哈尔大学）

AI总结本文提出GHI框架，通过构建基于双分拓扑的 incidence 结构推理层，实现对基于方面的情感分析任务中不同结构信号的统一处理，实验表明GHI在多个标准基准上优于现有方法，且在参数较少的情况下表现优异。

Comments 15 pages, 8 figures, 7 tables

详情

AI中文摘要

基于方面的情感分析（ABSA）要求模型将情感证据绑定到正确的方面，使其成为细粒度结构推理的自然测试场。我们介绍了GHI，一种基于条件超图 incidence 的图ormer框架，其设计为一个基于双分拓扑的 incidence 基础结构推理层。GHI将多样化的语言和语义证据表示为token-超边 incidence 关系，允许通过统一接口整合不同的结构信号。在六个标准ABSA基准上的广泛实验表明，GHI在SemEval领域优于所有基线，多种子评估显示其在强DeBERTa模型上表现稳定。进一步实验显示，仅使用247M参数，GHI在ISE基准上接近11B Flan-T5基方法的性能。此外，GHI在具有挑战性的ARTS数据集上表现出强鲁棒性，保持了高度竞争性性能，而传统模型在此处退化。这些结果表明，紧凑的结构推理仍然是细粒度任务中比规模驱动方法有价值的替代方案。

英文摘要

Aspect-based sentiment analysis (ABSA) requires models to bind sentiment evidence to the correct aspect, making it a natural testbed for fine-grained structural reasoning. We introduce GHI, a Graphormer-over-Conditioned-Hypergraph-Incidence framework that is designed as an incidence-based structural reasoning layer built on a bipartite topology. GHI represents diverse linguistic and semantic evidence as token--hyperedge incidence relations, allowing different structural signals to be incorporated through a unified interface. Extensive experiments on six standard ABSA benchmarks show that GHI outperforms all baselines on the SemEval domains, and multi-seed evaluations show stable improvements over strong DeBERTa. Further experiments show that with only 247M parameters, GHI approaches the performance of 11B Flan-T5 based methods on the ISE benchmark. Moreover, it demonstrates strong robustness on the challenging ARTS datasets, maintaining highly competitive performance where traditional models degrade. These results demonstrate that compact structural reasoning remains a valuable alternative to scale-driven approaches for fine-grained tasks.

URL PDF HTML ☆

赞 0 踩 0

2605.22217 2026-05-22 cs.LG cs.CL 版本更新

Survive or Collapse: The Asymmetric Roles of Data Gating and Reward Grounding in Self-Play RL

生存或崩溃：自我博弈强化学习中数据门控与奖励基础的不对称作用

Sophia Xiao Pu, Zhaotian Weng, Chengzhi Liu, Jayanth Srinivasa, Gaowen Liu, William Yang Wang, Xin Eric Wang

发表机构 * University of California, Santa Barbara（加州大学圣巴巴拉分校）； Cisco Research（思科研究）

AI总结本文研究了自我博弈强化学习中数据门控和奖励基础的不对称作用，发现数据门控是维持稳定的关键因素，而奖励信号在门控移除后无法单独保证稳定性，揭示了'基础提出者悖论'。

详情

AI中文摘要

自我博弈强化学习通过语言模型自行生成任务进行训练，实现提出者与求解者的共同进化，无需人工标注。最近的系统报告了显著的推理提升，但崩溃和不稳定性普遍存在且理解不足。主流观点将其视为奖励设计问题，但我们认为自我博弈的稳定性由两个不同的调节机制决定：数据层面的门控，决定哪些由提出者生成的任务进入训练池，以及奖励信号，更新已准入任务的策略。通过在Python输出预测任务和确定性DSL双胞胎任务上的受控实验，我们发现这两个机制是不对称的。严格的数据门控在我们测试的每种奖励变体下都能保证稳定性，包括没有地面真实信息访问的自一致性奖励；而一旦移除门控，没有任何奖励变体足以保证稳定性。这种不对称性揭示了我们称之为'基础提出者悖论'的反直觉耦合：具有地面真实信息访问的提出者在与自一致性求解器配对时，会比无地面真实信息的提出者更快崩溃，因为训练集中在形成最快路径到虚假自一致性吸引子的干净任务上。将二进制门控替换为连续严格性参数ε进一步揭示了两阶段相变：训练侧指标在低ε时解耦，而验证准确率在ε远高于时才保持。数据层面的门控，而非奖励校准，是自我博弈稳定性的绑定约束。

英文摘要

Self-play reinforcement learning trains language models on their own generated tasks, co-evolving a proposer and solver without human labels. Recent systems report strong reasoning gains, but collapse and instability are widely observed and poorly understood. The dominant response treats this as a reward-design problem. We argue instead that self-play stability is governed by two distinct levers: a data-level gate that decides which proposer-generated tasks enter the training pool, and the reward signal that updates the policy on tasks already admitted. Through controlled experiments on a Python output-prediction task and a deterministic-DSL twin task that strips pretraining priors, output ambiguity, and executor noise, we find the two levers are asymmetric. A strict gate is sufficient for stability under every reward variant we test, including a self-consistency reward with no access to ground truth; while no reward variant is sufficient once the gate is removed. This asymmetry exposes a counter-intuitive coupling we call the Grounded Proposer Paradox: a proposer with ground-truth access accelerates collapse faster than an ungrounded one when paired with a self-consistency solver, by concentrating training on clean tasks that form the fastest path to a spurious self-consistent attractor. Replacing the binary gate with a continuous strictness parameter $\varepsilon$ further reveals a two-stage phase transition: training-side metrics decouple at low $\varepsilon$, while validation accuracy holds until $\varepsilon$ is much higher. Data-level gating, not reward calibration, is the binding constraint on self-play stability.

URL PDF HTML ☆

赞 0 踩 0

2605.22204 2026-05-22 cs.CL 版本更新

Audience Engagement with Arabic Women's Social Empowerment and Wellbeing: A Decadal Corpus

阿拉伯女性社会赋权与福祉的受众参与：一个十年语料库

Wajdi Zaghouani, Mabrouka Bessghaier, MD. Rafiul Biswas, Shimaa Amer Ibrahim

发表机构 * Northwestern University in Qatar（卡塔尔西北大学）； Hamad bin Khalifa University（哈利法大学）

AI总结本文提出阿拉伯女性与社会语料库，包含2013至2024年间252,487条阿拉伯语Facebook公开帖子，涵盖女性赋权和社会福祉主题，通过自动化流程处理后，为阿拉伯方言的性别话语、社会改革和情感参与的大规模分析提供了数据支持。

2605.22203 2026-05-22 cs.CL 版本更新

Evaluation of Chunking Strategies for Effective Text Embedding in Low-Resource Language on Agricultural Documents

对低资源语言农业文档中有效文本嵌入的分块策略评估

Sovandara Chhoun, Pichdara Po, Sereiwathna Ros, Wan-Sup Cho, Saksonita Khoeurn

发表机构 * Department of Big Data, Chungbuk National University, Cheongju-si, South Korea（大数据系， Chungbuk国立大学，韩国Cheongju市）； Department of Computer Science, Chungbuk National University, Cheongju-si, South Korea（计算机科学系， Chungbuk国立大学，韩国Cheongju市）； BigDataLabs Co., Ltd. Department of Management Information Systems, Chungbuk National University, South Korea（BigDataLabs公司管理信息系， Chungbuk国立大学，韩国）

AI总结本研究比较了四种文本分块方法在Khmer农业文档中的性能，通过检索增强生成（RAG）框架评估分块策略对密集检索优化的影响，发现基于字符的递归分块方法在低资源语言中表现最佳。

Comments 11 pages, 1 figure

详情

AI中文摘要

在本研究中，我们比较了四种文本分块方法：递归、Khmer-aware、基于句子和基于大语言模型（LLM）在检索增强生成（RAG）框架中应用于Khmer农业文档的性能。文档分块使用BGE-M3多语言嵌入模型进行编码，并使用FAISS库进行检索。性能通过四个指标评估：平均检索分数（L2距离）、答案相关性、Khmer覆盖率和Khmer交并比（IoU），均基于真实问题-答案对进行测量。在评估中，我们对18个问题-答案对进行了5折交叉验证。我们观察到基于字符的递归分块方法在分块大小为300字符时表现最佳，实现了最低的L2距离（0.4295 ± 0.0461）、最高的答案相关性（0.8663 ± 0.0199）和最高的Khmer IoU（0.6441 ± 0.0347）。配对t检验显示在L2距离上，与基于句子的分块方法相比有统计学显著改进（p = 0.0121）。这些结果突显了分块粒度和结构保持在优化形态复杂、低资源语言如Khmer的密集检索中的重要性。

英文摘要

In this study, we compare the performance of four text chunking approaches: Recursive, Khmer-Aware, Sentence-Based, and LLM-Based within a Retrieval-Augmented Generation (RAG) framework applied to Khmer agricultural documents. The document chunks are encoded using the BGE-M3 multilingual embedding model and retrieved using the FAISS library. Performance is evaluated using four metrics: Average Retrieval Score (L2 distance), Answer Relevance, Khmer Coverage, and Khmer Intersection over Union, all measured against ground-truth question-answer pairs. For evaluation, we perform 5-fold cross-validation over 18 question-answer pairs. We observe the best performance for the character-based Recursive chunking method with a chunk size of 300 characters, achieving the lowest L2 distance (0.4295 +- 0.0461), highest Answer Relevance (0.8663 +- 0.0199), and highest Khmer IoU (0.6441 +- 0.0347). A paired t-test shows a statistically significant improvement over the Sentence-Based chunking method in L2 distance (p = 0.0121). These results highlight the importance of segmentation granularity and structural preservation for optimizing dense retrieval in morphologically complex, low-resource languages such as Khmer.

URL PDF HTML ☆

赞 0 踩 0

2605.22202 2026-05-22 cs.CL 版本更新

Structure Retention in Embedding Spaces as a Predictor of Benchmark Performance

嵌入空间中的结构保留作为基准性能预测因子

Amanda Myntti, Jenna Kanerva, Veronika Laippala, Filip Ginter

发表机构 * TurkuNLP, University of Turku, Finland（图尔库大学TurkuNLP实验室，图尔库大学，芬兰）； ELLIS Institute Finland（芬兰ELLIS研究所）

AI总结本文研究了高表现嵌入模型在嵌入空间中的一致性组织方式，通过评估25种现代嵌入模型在五个MTEB任务上的表现，发现最近邻重叠和独立成分分析（ICA）中成对文本实例的幅度差异与任务性能高度相关，揭示了嵌入任务在线性度和局部信息保留依赖性方面的差异。

2605.22177 2026-05-22 cs.LG cs.CL 版本更新

Psy-Chronicle: 一个用于合成长周期校园心理辅导对话的结构化流水线

Chaogui Gou, Jiarui Liang

发表机构 * University of Science and Technology Beijing（北京科技大学）

AI总结本文提出Psy-Chronicle，一种结构化数据生成框架，用于合成长周期校园心理辅导对话，通过生成学期跨度的时间压力事件图和学生与辅导员代理的交互模拟，构建了包含100个学生档案和9万条对话的CPCD数据集，并通过CPCD-Bench评估模型的长周期校园辅导能力，实验结果表明CPCD有效提升了模型的会话级响应生成和长周期记忆召回能力。

详情

AI中文摘要

近年来，大语言模型在心理支持任务中展现出显著潜力。然而，现有心理辅导数据大多依赖于单轮问答或短多轮对话，难以刻画大学生心理困扰在校园生活事件中如何积累、交互并逐渐演变的长期过程。为解决这一问题，本文提出Psy-Chronicle，一种结构化数据生成框架，用于合成长周期校园心理辅导对话。我们生成一个跨越学期的时间压力事件图，以建模校园压力事件之间的时序顺序和演变依赖关系。通过学生代理与辅导员代理之间的交互模拟，以及结构化记忆整合机制，Psy-Chronicle生成具有连续性且跨越多个辅导会话的长周期对话。基于Psy-Chronicle，我们构建并开源了CPCD，一个中文长周期对话数据集，包含100个学生档案和90,000条辅导对话。我们进一步构建CPCD-Bench，从三个维度评估模型的长周期校园辅导能力：会话级响应、长周期记忆召回和时间因果推理。实验结果表明，CPCD有效提升了具有相同基础架构的模型的会话级响应生成和长周期记忆召回能力。同时，时间因果推理的改进仍有限，表明事件链组织和因果解释是长周期心理辅导建模中的关键挑战。相关代码和数据可在：https://github.com/EdwinUSTB/Psy-Chronicle 获取。

英文摘要

In recent years, large language models have shown substantial potential in psychological support tasks. However, existing psychological counseling data mostly rely on single-turn question answering or short multi-turn dialogues, making it difficult to characterize how college students' psychological distress accumulates, interacts, and gradually evolves over long periods within campus life events. To address this issue, this paper proposes Psy-Chronicle, a structured data-generation framework for synthesizing long-horizon campus psychological counseling dialogues. We generate a semester-spanning temporal stress event graph to model the chronological order and evolutionary dependencies among campus stress events. Through interactive simulation between a student agent and a counselor agent, together with a structured memory integration mechanism, Psy-Chronicle generates long-horizon dialogues with continuity across counseling sessions. Based on Psy-Chronicle, we construct and open-source CPCD, a Chinese long-horizon dialogue dataset for college psychological counseling, containing 100 student profiles, 90,000 counseling dialogues. We further build CPCD-Bench to evaluate models' long-horizon campus counseling capabilities from three dimensions: session-level response, long-horizon memory recall, and temporal-causal reasoning. Experimental results show that CPCD effectively improves session-level response generation and long-horizon memory recall for models with the same base architecture. Meanwhile, improvements in temporal-causal reasoning remain limited, indicating that event-chain organization and causal explanation are key challenges in long-horizon psychological counseling modeling. The related code and data are available at: https://github.com/EdwinUSTB/Psy-Chronicle

URL PDF HTML ☆

赞 0 踩 0

2605.22138 2026-05-22 cs.AI cs.CL cs.LG cs.RO 版本更新

Efficient Agentic Reasoning Through Self-Regulated Simulative Planning

通过自我调节模拟规划实现高效的代理推理

Mingkai Deng, Jinyu Hou, Lara Sá Neves, Varad Pimpalkhute, Taylor W. Killian, Zhengzhong Liu, Eric P. Xing

发表机构 * Institute of Foundation Models (IFM)（基础模型研究所）； Carnegie Mellon University（卡内基梅隆大学）

AI总结本文提出通过分解决策过程为三个系统：模拟推理、自我调节和反应执行，来提升代理推理的效率，并展示了SR$^2$AM模型在不同任务中的表现。

Comments Code and model artifacts are available at https://github.com/sailing-lab/sr2am

详情

AI中文摘要

代理应该如何决定何时以及如何规划？主流方法将代理建模为具有自适应计算的反应策略（例如链式思考），通过端到端训练期望规划隐式地出现。由于无法控制规划的存在、结构或时间范围，这些系统显著增加了推理长度，导致无效的令牌使用，而没有可靠的准确性提升。我们主张高效的代理推理受益于将决策过程分解为三个系统：模拟推理（系统II）通过世界模型将推理根植于未来状态预测；自我调节（系统III）通过学习的配置器决定何时以及如何深入规划；以及反应执行（系统I）处理细粒度的动作。模拟推理在不同任务中提供统一的规划，而无需每个领域的工程，同时自我调节确保规划只在需要时被调用。为了测试这一点，我们开发了SR$^2$AM（Self-Regulated Simulative Reasoning Agentic LLM），在LLM的链式思考中实现这两个系统作为独立阶段，其中LLM作为世界模型。我们探索了两种实现：从提示的多模块系统中记录决策（v0.1）和从预训练推理LLM的痕迹中重建结构化计划（v1.0），通过监督学习和强化学习（RL）训练。在数学、科学、表格分析和网络信息检索中，v0.1-8B和v1.0-30B在性能上与120-355B和685B-1T参数系统相当，而v1.0-30B使用的推理令牌比同类代理LLM少25.8-95.3%。强化学习使平均规划时间增加22.8%，而规划频率仅增加2.0%，表明它学会了更远地规划而不是更频繁地规划。更广泛地说，学习的自我调节实例化了一个原则，我们预计可以扩展到代理如何管理自己的学习和适应。

英文摘要

How should an agent decide when and how to plan? A dominant approach builds agents as reactive policies with adaptive computation (e.g., chain-of-thought), trained end-to-end expecting planning to emerge implicitly. Without control over the presence, structure, or horizon of planning, these systems dramatically increase reasoning length, yielding inefficient token use without reliable accuracy gains. We argue efficient agentic reasoning benefits from decomposing decision-making into three systems: simulative reasoning (System II) grounding deliberation in future-state prediction via a world model; self-regulation (System III) deciding when and how deeply to plan via a learned configurator; and reactive execution (System I) handling fine-grained action. Simulative reasoning provides unified planning across diverse tasks without per-domain engineering, while self-regulation ensures the planner is invoked only when needed. To test this, we develop SR$^2$AM (Self-Regulated Simulative Reasoning Agentic LLM), realizing both as distinct stages within an LLM's chain-of-thought, with the LLM as world model. We explore two instantiations: recording decisions from a prompted multi-module system (v0.1) and reconstructing structured plans from traces of pretrained reasoning LLMs (v1.0), trained via supervised then reinforcement learning (RL). Across math, science, tabular analysis, and web information seeking, v0.1-8B and v1.0-30B achieve Pass@1 competitive with 120-355B and 685B-1T parameter systems respectively, while v1.0-30B uses 25.8-95.3% fewer reasoning tokens than comparable agentic LLMs. RL increases average planning horizon by 22.8% while planning frequency grows only 2.0%, showing it learns to plan further ahead rather than more often. More broadly, learned self-regulation instantiates a principle we expect to extend beyond planning to how agents govern their own learning and adaptation.

URL PDF HTML ☆

赞 0 踩 0

2605.22099 2026-05-22 cs.CL 版本更新

A Comparative Study of Language Models for Khmer Retrieval-Augmented Question Answering

一种用于柬埔寨检索增强问答的语言模型比较研究

Sereiwathna Ros, Phannet Pov, Ratanaktepi Chhor, Kimleang Ly, Wan-Sup Cho, Saksonita Khoeurn

发表机构 * Department of Computer Science, Chungbuk National University（Chungbuk National University 计算机科学系）； Department of Big Data, Chungbuk National University（Chungbuk National University 大数据系）； General Department of Information and Communication Technology, Ministry of Post and Telecommunications（邮电部信息和通信技术总局）； Department of Management Information Systems, Chungbuk National University（Chungbuk National University 管理信息系统系）； BigDatalabs Co., Ltd（BigDatalabs 公司）

AI总结本文针对低资源非拉丁语种柬埔寨语言，比较了多种语言模型在检索增强问答任务中的性能，发现检索器选择是影响效果的关键因素，生成器在不同指标上表现各异。

Comments 14 pages, 1 figure,

详情

AI中文摘要

检索增强生成（RAG）作为一种将大型语言模型（LLM）输出与检索到的证据相结合的范式，已被证明可以减少幻觉并提高事实准确性。然而，其在低资源、非拉丁语种如柬埔寨语言中的有效性仍鲜有研究。本文提出了一种基于RAG的柬埔寨语言电信领域文档问答系统。我们进行了两阶段的比较评估。首先，我们基准测试了三种嵌入模型：BGE-M3（567M）、Jina-Embeddings-v3（570M）和Qwen3-Embedding（597M），用于柬埔寨文档的密集检索。BGE-M3在Hit Rate@3、File Hit Rate@3、MRR@3和Precision@3等指标上均表现最佳，显著优于其他检索器。其次，使用BGE-M3作为选定的检索器，我们在200个柬埔寨问答对的精心编纂的黄金数据集上评估了五个生成器后端：Qwen3（8B）、Qwen3.5（9B）、Sailor2-8B-Chat、SeaLLMs-v3-7B-Chat和Llama-SEA-LION-v2-8B-IT。为了量化系统性能，我们应用了六个受RAGAS启发的指标：忠实度、答案相关性、上下文相关性、事实正确性、答案相似性和答案正确性。结果表明，没有单一模型在所有指标上占据主导：Qwen3.5-9B在忠实度（0.859）和上下文相关性（0.726）上最高，Qwen3-8B在事实正确性（0.380）上最高，而SeaLLMs-v3-7B-Chat在答案相关性（0.867）、答案相似性（0.836）和答案正确性（0.599）上表现最佳。这些发现突显了检索器选择仍是柬埔寨RAG的主要瓶颈，而生成器的强项则取决于优先考虑的是 grounding、事实精度还是语义相似性。

英文摘要

Retrieval-Augmented Generation (RAG) has emerged as a promising paradigm for grounding large language model (LLM) outputs in retrieved evidence, thereby reducing hallucination and improving factual accuracy. Its efficacy, however, remains largely unexamined for low-resource, non-Latin-script languages such as Khmer. In this paper, we present a RAG-based question answering system for Khmer-language telecom-domain documents. We conduct a two-phase comparative evaluation. First, we benchmark three embedding models: BGE-M3 (567M), Jina-Embeddings-v3 (570M), and Qwen3-Embedding (597M), for dense retrieval over Khmer documents. BGE-M3 consistently performs best, achieving a Hit Rate@3 of 0.285, File Hit Rate@3 of 0.700, MRR@3 of 0.221, and Precision@3 of 0.112, substantially outperforming the other retrievers. Second, using BGE-M3 as the selected retriever, we evaluate five generator backends: Qwen3 (8B), Qwen3.5 (9B), Sailor2-8B-Chat, SeaLLMs-v3-7B-Chat, and Llama-SEA-LION-v2-8B-IT, on a curated golden dataset of 200 Khmer question-answer pairs. To quantify system performance, we apply six RAGAS-inspired metrics: faithfulness, answer relevance, context relevance, factual correctness, answer similarity, and answer correctness. The results show no single model dominates across all metrics: Qwen3.5-9B achieves the highest faithfulness (0.859) and context relevance (0.726), Qwen3-8B attains the highest factual correctness (0.380), and SeaLLMs-v3-7B-Chat performs best on answer relevance (0.867), answer similarity (0.836), and answer correctness (0.599). These findings highlight that retriever choice remains a major bottleneck for Khmer RAG, while generator strengths vary depending on whether the priority is grounding, factual precision, or semantic similarity.

URL PDF HTML ☆

赞 0 踩 0

2605.16984 2026-05-22 cs.CL 版本更新

Closing the Gap at CRAC 2026: Two-Stage Adaptation for LLM-Based Multilingual Coreference Resolution

在CRAC 2026上缩小差距：基于LLM的多语言核心指代解析的两阶段适应

Antoine Bourgois, Olga Seminck, Thierry Poibeau

发表机构 * Lattice (CNRS UMR 8094 & ENS-PSL & Université Sorbonne Nouvelle)（Lattice（CNRS UMR 8094 和 ENS-PSL 和索邦学院新欧））

AI总结本文提出了一种基于LLM的多语言核心指代解析方法，通过两阶段适应策略，在CRAC 2026共享任务中取得了74.32的平均CoNLL F1分数，排名第一。

2605.16545 2026-05-22 cs.LG cs.AI cs.CL 版本更新

Symphony for Speech-to-Text: Supporting Real-Time Medical Voice Interfaces

Symphony for Speech-to-Text: 支持实时医疗语音接口

Arne Nix, Robert James, Lasse Borgholt, Anna B. Ekner, Lana Krumm, Julius Severin, Dan Engel, Lars Maaløe, Jakob Havtorn

发表机构 * Corti

AI总结本文提出Symphony for Speech-to-Text，一种医疗级实时语音识别系统，通过分解转录过程为识别、格式化和上下文校正等专业化组件，优化医学术语召回，实现实时临床结构文本生成，并在医疗场景中显著优于现有系统，同时在通用领域表现不逊。

Comments Updated with a correction and improvement to Symphony's performance in spoken punctuation evaluation (R_punct, P_punct)

详情

AI中文摘要

在数十年用于打字和更近期的环境记录后，语音正逐渐成为与技术及AI交互的主要方式，在医疗领域也不例外。然而，医疗语音识别仍然具有挑战性：系统必须捕捉专业术语，解决上下文歧义，并精确渲染测量、缩写和临床缩写。现有解决方案通常针对通用目的转录或狭窄的打字工作流进行优化，限制了其在安全关键设置中的可靠性以及在更广泛临床工作流中的实用性。我们引入Symphony for Speech-to-Text，一种用于实时流式和基于批量文件的临床使用的医疗级语音识别系统。Symphony将转录过程分解为识别、格式化和上下文校正等专业化组件，以优化医学术语召回，同时在实时生成临床结构文本并适应不同使用场景。在公共基准和医疗语音数据集上的评估表明，Symphony在临床场景中显著优于现有系统，同时在通用领域表现不逊，表明具有鲁棒的泛化能力而非过拟合。我们发布了一个临床基准数据集以支持可靠的验证和进一步推进医疗语音识别。Symphony通过生产级API提供，用于实时打字、对话转录和批量音频文件处理。

英文摘要

After decades of use in dictation and, more recently, ambient documentation, speech is emerging as a primary modality for interacting with technology and AI in healthcare. Yet medical speech recognition remains difficult: systems must capture specialized terminology, resolve contextual ambiguity, and render measurements, abbreviations, and clinical shorthand precisely. Existing solutions are typically optimized either for general-purpose transcription or narrow dictation workflows, limiting their reliability in safety-critical settings and their usefulness for broader clinical workflows. We introduce Symphony for Speech-to-Text, a medical-grade speech recognition system for real-time streaming and batch file-based clinical use. Symphony decomposes the transcription process into specialized components for recognition, formatting, and contextual correction to optimize medical term recall while producing clinically structured text in real time and adapting across use cases. Evaluations on public benchmark and medical speech datasets show that Symphony substantially outperforms state-of-the-art systems in clinical settings while matching or exceeding them in general-domain settings, suggesting robust generalization rather than overfitting. We release a clinical benchmark dataset to support reliable validation and further progress in medical speech recognition. Symphony is available through a production-grade API for live dictation, conversational transcription, and batch audio file processing.

URL PDF HTML ☆

赞 0 踩 0

2605.15040 2026-05-22 cs.AI cs.CL 版本更新

Orchard: An Open-Source Agentic Modeling Framework

Orchard：一个开源的智能体建模框架

Baolin Peng, Wenlin Yao, Qianhui Wu, Hao Cheng, Xiao Yu, Rui Yang, Tao Ge, Alessandro Sordoni, Xingdi Yuan, Yelong Shen, Pengcheng He, Tong Zhang, Zhou Yu, Jianfeng Gao

发表机构 * Microsoft Research（微软研究院）； Columbia University（哥伦比亚大学）； UIUC（伊利诺伊大学香槟分校）

AI总结本文提出Orchard，一个开源的智能体建模框架，通过轻量级环境服务和三种智能体建模食谱，实现了跨领域可重用的智能体数据、训练和评估。

详情

AI中文摘要

智能体建模旨在通过规划、推理、工具使用和与环境的多轮交互，将大语言模型转化为能够解决复杂任务的自主智能体。尽管有大量投入，开放研究仍受制于基础设施和训练差距。许多高性能系统依赖于专有代码库、模型或服务，而大多数开源框架专注于编排和评估，而非可扩展的智能体训练。我们提出了Orchard，一个用于可扩展智能体建模的开源框架。其核心是Orchard Env，一个轻量级环境服务，提供可重用的原语用于跨任务领域、智能体利用和流水线阶段的沙盒生命周期管理。在Orchard Env之上，我们构建了三种智能体建模食谱。Orchard-SWE针对编码智能体，从MiniMax-M2.5和Qwen3.5-397B中蒸馏出107K条轨迹，引入信用分配SFT来学习未解决轨迹的 productive 段落，并应用平衡自适应回滚进行强化学习。从Qwen3-30B-A3B-Thinking开始，Orchard-SWE在SWE-bench Verified上经过SFT后达到64.3%，经过SFT+RL后达到67.5%，在同等规模的开源模型中设立了新的状态。Orchard-GUI使用仅0.4K蒸馏轨迹和2.2K开放性任务训练了一个4B视觉-语言计算机使用智能体，在WebVoyager、Online-Mind2Web和DeepShop上分别达到74.1%、67.0%和64.0%的成功率，成为最强的开源模型，同时在与专有系统竞争中保持竞争力。Orchard-Claw针对个人助理智能体，仅用0.2K合成任务训练，达到Claw-Eval上的59.6% pass@3和与更强的ZeroClaw利用配合时的73.9%。总体而言，这些结果表明，一个轻量级、开源、不依赖利用的环境层能够实现跨领域的可重用智能体数据、训练和评估。

英文摘要

Agentic modeling aims to transform LLMs into autonomous agents capable of solving complex tasks through planning, reasoning, tool use, and multi-turn interaction with environments. Despite major investment, open research remains constrained by infrastructure and training gaps. Many high-performing systems rely on proprietary codebases, models, or services, while most open-source frameworks focus on orchestration and evaluation rather than scalable agent training. We present Orchard, an open-source framework for scalable agentic modeling. At its core is Orchard Env, a lightweight environment service providing reusable primitives for sandbox lifecycle management across task domains, agent harnesses, and pipeline stages. On top of Orchard Env, we build three agentic modeling recipes. Orchard-SWE targets coding agents. We distill 107K trajectories from MiniMax-M2.5 and Qwen3.5-397B, introduce credit-assignment SFT to learn from productive segments of unresolved trajectories, and apply Balanced Adaptive Rollout for RL. Starting from Qwen3-30B-A3B-Thinking, Orchard-SWE achieves 64.3% on SWE-bench Verified after SFT and 67.5% after SFT+RL, setting a new state of the art among open-source models of comparable size. Orchard-GUI trains a 4B vision-language computer-use agent using only 0.4K distilled trajectories and 2.2K open-ended tasks. It achieves 74.1%, 67.0%, and 64.0% success rates on WebVoyager, Online-Mind2Web, and DeepShop, respectively, making it the strongest open-source model while remaining competitive with proprietary systems. Orchard-Claw targets personal assistant agents. Trained with only 0.2K synthetic tasks, it achieves 59.6% pass@3 on Claw-Eval and 73.9% when paired with a stronger ZeroClaw harness. Collectively, these results show that a lightweight, open, harness-agnostic environment layer enables reusable agentic data, training recipes, and evaluations across domains.

URL PDF HTML ☆

赞 0 踩 0

2605.14257 2026-05-22 cs.CL 版本更新

Sakura at BEA 2026 Shared Task 1: What Makes Vocabulary Difficult?

BEA 2026 共同任务 1：什么使词汇困难？

Adam Nohejl, Xuanxin Wu, Yusuke Ide, Maria Angelica Riera Machin, Yi-Ning Chang, Hitomi Yanaka

发表机构 * RIKEN ； The University of Osaka（大阪大学）； Nara Institute of Science and Technology（奈良科学技術大學）； National Tsing Hua University（國立清華大學）； The University of Tokyo（東京大學）； Tohoku University（東北大學）

AI总结本文提出两种模型用于预测词汇难度：一种高精度的黑盒模型，在公开赛道取得最佳成绩，另一种可解释模型，优于微调编码器基线。黑盒模型通过软目标损失函数微调LLM，在评分任务中达到r>0.91的精度，而可解释模型在保持强相关性（r>0.77）的同时，揭示了影响每个项目难度的因素。进一步分析显示，英国理事会知识型词汇列表（KVL）中词汇难度常受拼写难度或测试项目构造影响，而不仅仅是词汇本身的生产难度。

Comments To be published in Proceedings of the 21st Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2026)

2605.13989 2026-05-22 cs.CL 版本更新

VectraYX-Nano: A 42M-Parameter Spanish Cybersecurity Language Model with Curriculum Learning and Native Tool Use

VectraYX-Nano: 一个具有课程学习和原生工具使用的4200万参数西班牙语网络安全语言模型

Juan S. Santillana

发表机构 * Globant

AI总结本文提出了一种基于西班牙语的网络安全语言模型VectraYX-Nano，通过课程学习和原生工具调用，展示了在网络安全领域的应用与改进。

Comments 24 pages, 5 figures, 12 tables. v3: post-Chinchilla compute ablation (v8-v15), Globant affiliation finalized, EMNLP Findings 2026 submission. Released model: VectraYX-Nano v7 (42M params, GGUF Q4 ~20 MB, native MCP)

详情

AI中文摘要

我们介绍了VectraYX-Nano，一个从头开始训练的41.95M参数解码器-only语言模型，专为西班牙语网络安全领域设计，具有拉丁美洲地区侧重和通过模型上下文协议（MCP）进行的原生工具调用。该模型有四个贡献。（i）语料库：VectraYX-Sec-ES，一个由八台虚拟机分布式管道构建的170M-token西班牙语语料库，成本约为25美元的云计算，分为三个课程阶段（对话42M，网络安全118M，攻击性工具10M）。（ii）架构：一个42M Transformer解码器，包含GQA、QK-Norm、RMSNorm、SwiGLU、RoPE和z-loss，配以领域平衡的16,384-token字节回退BPE。（iii）课程学习通过跨三个阶段的回放，导致单调损失下降（9.80 -> 3.17 -> 3.00 -> 2.16）；在SFT（损失1.74）后，v2 bootstrap-ablation参考在B5上达到0.775 +/- 0.043的对话门，经过受控的Phase-2回放扫描，以{0,5,10,25,50}%的饱和度在B5上达到>=25%的回放。（iv）两个经验发现，N=4。一个受控的bootstrap-语料库消融跨v2（OpenSubs）、v4（mC4-ES）和v6（60/25/15 OpenSubs/mC4/Wiki）暴露了损失与注册的倒置：低困惑度的bootstrap产生可测量更差的对话行为（v2 > v4 > v6在B5上每个配对的种子）。B4（工具选择）地板为0.000是语料库密度的产物，而不是容量门：重新平衡SFT混合物的工具使用比例为1:21，得到VectraYX-Nano v7，即发布的头版配置，达到B4 = 0.230 +/- 0.052在42M的同时保持B1 = 0.332 +/- 0.005和B5 = 0.725 +/- 0.130；在260M从头开始的中档模型上进行LoRA复制达到0.445 +/- 0.201。发布的GGUF在F16中为96 MB，能够在llama.cpp下在商用硬件上实现亚秒级TTFT，并且，据我们所知，是第一个发布的西班牙语原生网络安全LLM，具有端到端的MCP集成。

英文摘要

We present VectraYX-Nano, a 41.95M-parameter decoder-only language model trained from scratch in Spanish for cybersecurity, with a Latin-American regional focus and native tool invocation via the Model Context Protocol (MCP). The model has four contributions. (i) Corpus: VectraYX-Sec-ES, a 170M-token Spanish corpus assembled by an eight-VM distributed pipeline at ~$25 USD of cloud compute and split into three curriculum phases (conversational 42M, cybersecurity 118M, offensive tooling 10M). (ii) Architecture: a 42M Transformer decoder with GQA, QK-Norm, RMSNorm, SwiGLU, RoPE and z-loss, paired with a domain-balanced 16,384-token byte-fallback BPE. (iii) Curriculum with replay across the three phases yields a monotonic loss descent (9.80 -> 3.17 -> 3.00 -> 2.16); after SFT (loss 1.74) the v2 bootstrap-ablation reference attains a conversational gate of 0.775 +/- 0.043 on B5 over N=4 seeds, and a controlled Phase-2 replay sweep over {0,5,10,25,50}% saturates B5 at >=25% replay. (iv) Two empirical findings, both N=4. A controlled bootstrap-corpus ablation across v2 (OpenSubs), v4 (mC4-ES), and v6 (60/25/15 OpenSubs/mC4/Wiki) exposes a loss-versus-register inversion: lower-perplexity bootstraps yield measurably worse conversational behavior (v2 > v4 > v6 on B5 at every paired seed). The B4 (tool-selection) floor of 0.000 is a corpus-density artifact, not a capacity gate: rebalancing the SFT mixture to tool-use ratio 1:21 yields VectraYX-Nano v7, the released headline configuration, reaching B4 = 0.230 +/- 0.052 at 42M while retaining B1 = 0.332 +/- 0.005 and B5 = 0.725 +/- 0.130; a LoRA replication on a 260M from-scratch mid-tier reaches 0.445 +/- 0.201. The released GGUF is 96 MB in F16, runs sub-second TTFT on commodity hardware under llama.cpp, and is, to our knowledge, the first published Spanish-native cybersecurity LLM with end-to-end MCP integration.

URL PDF HTML ☆

赞 0 踩 0

2605.12456 2026-05-22 cs.CR cs.CL cs.LG 版本更新

TextSeal: A Localized LLM Watermark for Provenance & Distillation Protection

TextSeal: 一种用于溯源与蒸馏保护的本地化大语言模型水印

Tom Sander, Hongyan Chang, Tomáš Souček, Tuan Tran, Valeriu Lacatusu, Sylvestre-Alvise Rebuffi, Alexandre Mourachko, Surya Parimi, Christophe Ropers, Rashel Moritz, Vanessa Stark, Hady Elsahar, Pierre Fernandez

发表机构 * FAIR, Meta Superintelligence Labs（FAIR，Meta超智能实验室）

AI总结本文提出TextSeal，一种先进的大语言模型水印技术，通过Gumbel-max采样引入双密钥生成以恢复输出多样性，并结合熵加权评分和多区域定位提升检测性能。该方法支持推测解码和多令牌预测等服务优化，不增加推理开销。在检测强度上严格优于基线方法SynthID-text，并对稀释具有鲁棒性，即使在混合的人类/AI文档中也能保持自信的本地化检测。理论上该方案无失真，经推理基准评估证实其保持下游性能；同时通过多语言人工评估（6000次A/B对比，5种语言）显示无明显质量差异。除了用于溯源检测外，TextSeal还具有'放射性'特性：其水印信号通过模型蒸馏传递，可检测未经授权的使用。

详情

AI中文摘要

我们介绍TextSeal，一种最先进的大语言模型水印。基于Gumbel-max采样，TextSeal引入双密钥生成以恢复输出多样性，同时结合熵加权评分和多区域定位以提升检测性能。它支持推测解码和多令牌预测等服务优化，并不增加任何推理开销。TextSeal在检测强度上严格优于基线方法如SynthID-text，并对稀释具有鲁棒性，即使在混合的人类/AI文档中也能保持自信的本地化检测。该方案在理论上是无失真的，经推理基准评估确认其保持下游性能；同时通过多语言人工评估（6000次A/B对比，5种语言）显示无明显质量差异。除了用于溯源检测外，TextSeal还具有'放射性'特性：其水印信号通过模型蒸馏传递，可检测未经授权的使用。

英文摘要

We introduce TextSeal, a state-of-the-art watermark for large language models. Building on Gumbel-max sampling, TextSeal introduces dual-key generation to restore output diversity, along with entropy-weighted scoring and multi-region localization for improved detection. It supports serving optimizations such as speculative decoding and multi-token prediction, and does not add any inference overhead. TextSeal strictly dominates baselines like SynthID-text in detection strength and is robust to dilution, maintaining confident localized detection even in heavily mixed human/AI documents. The scheme is theoretically distortion-free, and evaluation across reasoning benchmarks confirms that it preserves downstream performance; while a multilingual human evaluation (6000 A/B comparisons, 5 languages) shows no perceptible quality difference. Beyond its use for provenance detection, TextSeal is also ``radioactive'': its watermark signal transfers through model distillation, enabling detection of unauthorized use.

URL PDF HTML ☆

赞 0 踩 0

2605.11739 2026-05-22 cs.CL 版本更新

Learning to Foresee: Unveiling the Unlocking Efficiency of On-Policy Distillation

学习预见：揭示在线蒸馏的解锁效率

Yuchen Cai, Ding Cao, Liang Lin, Chunxi Luo, Xin Xu, Kai Yang, Weijie Liu, Saiyong Yang, Tianxiang Zhao, Guangzhong Sun, Guiquan Liu, Junfeng Fang

发表机构 * USTC（中国科学技术大学）； Tencent（腾讯）； NUS（新加坡国立大学）； HKUST(GZ)（香港科技大学（广州））； UCAS-IIE（中国科学技术大学国际交流学院）； SHU（上海大学）

AI总结本文研究了在线蒸馏（OPD）的效率来源，提出EffOPD方法通过适应性选择 extrapolation 步长和沿当前更新方向移动来加速OPD，实现了3倍的训练加速同时保持最终性能。

详情

AI中文摘要

在线蒸馏（OPD）已成为大型语言模型的一种高效的后训练范式。然而，现有研究大多将其优势归因于更密集和稳定的监督，而OPD效率背后的参数级机制仍不清晰。本文认为OPD的效率源于一种“预见”机制：它在训练早期就建立了指向最终模型的稳定更新轨迹。这种预见体现在两个方面。首先，在模块分配层面，OPD识别出边际效用低的区域，并将更新集中在对推理更关键的模块上。其次，在更新方向层面，OPD表现出更强的低秩集中，其主导子空间在训练早期就与最终更新子空间紧密对齐。基于这些发现，我们提出了EffOPD，一种即插即用的加速方法，通过自适应选择extrapolation步长并沿当前更新方向移动来加速OPD。EffOPD不需要额外可训练模块或复杂的超参数调优，实现了平均3倍的训练加速，同时保持可比的最终性能。整体而言，我们的发现为理解OPD的效率提供了参数动态视角，并为设计更高效的大型语言模型后训练方法提供了实用见解。

英文摘要

On-policy distillation (OPD) has emerged as an efficient post-training paradigm for large language models. However, existing studies largely attribute this advantage to denser and more stable supervision, while the parameter-level mechanisms underlying OPD's efficiency remain poorly understood. In this work, we argue that OPD's efficiency stems from a form of ``foresight'': it establishes a stable update trajectory toward the final model early in training. This foresight manifests in two aspects. First, at the \textbf{Module-Allocation Level}, OPD identifies regions with low marginal utility and concentrates updates on modules that are more critical to reasoning. Second, at the \textbf{Update-Direction Level}, OPD exhibits stronger low-rank concentration, with its dominant subspaces aligning closely with the final update subspace early in training. Building on these findings, we propose \textbf{EffOPD}, a plug-and-play acceleration method that speeds up OPD by adaptively selecting an extrapolation step size and moving along the current update direction. EffOPD requires no additional trainable modules or complex hyperparameter tuning, and achieves an average training acceleration of $3\times$ while maintaining comparable final performance. Overall, our findings provide a parameter-dynamics perspective for understanding the efficiency of OPD and offer practical insights for designing more efficient post-training methods for large language models.

URL PDF HTML ☆

赞 0 踩 0

2605.07711 2026-05-22 cs.CL 版本更新

SimCT: Recovering Lost Supervision for Cross-Tokenizer On-Policy Distillation

SimCT: 通过跨分词器策略进行监督恢复

Jie Sun, Mao Zheng, Mingyang Song, Qiyong Zhong, Yilin Cheng, Bichuan Feng, Pengfei Liu, Junfeng Fang, Xiang Wang

发表机构 * University of Science and Technology of China（中国科学技术大学）； Large Language Model Department, Tencent（腾讯大语言模型部门）； Shanghai Innovation Institute（上海创新研究院）； Zhongguancun Academy（中关村学院）

AI总结本文提出SimCT，一种改进的在线策略蒸馏方法，通过扩展监督空间来恢复因分词差异而丢失的监督信号，从而在数学推理和代码生成任务中提升了性能。

Comments 4 figures, 6 tables, 28 pages

详情

AI中文摘要

在线策略蒸馏（OPD）是一种标准工具，用于将教师行为转移到较小的学生模型中，但其隐含假设教师和学生预测在逐个token上是可比的，这一假设在两个模型对同一文本进行不同分词时会失效。在异构分词器情况下，精确共享token匹配会静默丢弃大量教师信号，特别是在词汇不一致的位置。我们提出了简单的跨分词器OPD（SimCT），通过扩展监督空间：在共享token之外，SimCT比较教师和学生在短多token延续上的表现，这些延续两者都能实现，从而保持OPD损失形式不变。我们证明这些单位是 finest 共同可分词的监督接口，并且更粗糙的替代方法会移除对在线学习有用的教师-学生区分。在三个异构教师-学生对上，SimCT在数学推理和代码生成基准上表现优于共享词汇OPD和代表性跨分词器基线，消融实验确认改进来自恢复精确共享token匹配所丢弃的监督。代码可在https://github.com/sunjie279/SimCT-获取。

英文摘要

On-policy distillation (OPD) is a standard tool for transferring teacher behavior to a smaller student, but it implicitly assumes that teacher and student predictions are comparable token by token, an assumption that fails whenever the two models tokenize the same text differently. Under heterogeneous tokenizers, exact shared-token matching silently discards a large fraction of the teacher signal at precisely the positions where vocabularies disagree. We propose \textbf{\underline{Sim}ple \underline{C}ross-\underline{T}okenizer OPD (SimCT)}, which restores this signal by enlarging the supervision space: alongside shared tokens, SimCT compares teacher and student over short multi-token continuations that both tokenizers can realize, leaving the OPD loss form itself unchanged. We show that these units are the finest jointly tokenizable supervision interface, and that coarser alternatives remove teacher-student distinctions that are useful for on-policy learning. Across three heterogeneous teacher-student pairs on mathematical reasoning and code-generation benchmarks, SimCT shows consistent gains over shared-vocabulary OPD and representative cross-tokenizer baselines, with ablations confirming that the improvements come from recovering supervision discarded by exact shared-token matching. Code is available at \href{https://github.com/sunjie279/SimCT-}{https://github.com/sunjie279/SimCT-}.

URL PDF HTML ☆

赞 0 踩 0

2605.07243 2026-05-22 cs.CL 版本更新

MemEvoBench: 评估LLM代理中内存误进化带来的安全风险

Weiwei Xie, Shaoxiong Guo, Fan Zhang, Tian Xia, Xue Yang, Lizhuang Ma, Junchi Yan, Qibing Ren

发表机构 * Shanghai Jiao Tong University（上海交通大学）； East China Normal University（华东师范大学）； Shandong University（山东大学）； Duke University（杜克大学）

AI总结本文提出MemEvoBench，首个评估LLM代理长期内存安全性的基准，针对对抗性内存注入、噪声工具输出和偏见反馈，通过7个领域36种风险类型的问题任务和20个Agent-SafetyBench环境改编的工作流任务，验证了内存进化对安全性的重大影响，指出静态提示防御不足，亟需加强LLM代理内存进化的安全性。

详情

AI中文摘要

为大型语言模型（LLMs）配备持久化内存可以增强交互连续性和个性化，但引入了新的安全风险。具体而言，受污染或偏见的内存积累可能触发异常代理行为。现有的评估方法尚未建立衡量内存误进化的标准化框架。这种现象是指由于反复接触误导信息而导致的行为漂移。为解决这一缺口，我们引入MemEvoBench，首个评估LLM代理长期内存安全性的基准，针对对抗性内存注入、噪声工具输出和偏见反馈。该框架包含7个领域36种风险类型的问答式任务，以及改编自20个Agent-SafetyBench环境的工作流任务，采用混合良性与误导性内存池在多轮交互中模拟内存进化。在代表性模型上的实验揭示了在偏见内存更新下显著的安全退化。我们的分析表明，内存进化是这些失败的重要原因。此外，静态提示基于防御证明不足，强调了在LLM代理中保障内存进化的安全性的紧迫性。

英文摘要

Equipping Large Language Models (LLMs) with persistent memory enhances interaction continuity and personalization but introduces new safety risks. Specifically, contaminated or biased memory accumulation can trigger abnormal agent behaviors. Existing evaluation methods have not yet established a standardized framework for measuring memory misevolution. This phenomenon refers to the gradual behavioral drift resulting from repeated exposure to misleading information. To address this gap, we introduce MemEvoBench, the first benchmark evaluating long-horizon memory safety in LLM agents against adversarial memory injection, noisy tool outputs, and biased feedback. The framework consists of QA-style tasks across 7 domains and 36 risk types, complemented by workflow-style tasks adapted from 20 Agent-SafetyBench environments with noisy tool returns. Both settings employ mixed benign and misleading memory pools within multi-round interactions to simulate memory evolution. Experiments on representative models reveal substantial safety degradation under biased memory updates. Our analysis suggests that memory evolution is a significant contributor to these failures. Furthermore, static prompt-based defenses prove insufficient, underscoring the urgency of securing memory evolution in LLM agents.

URL PDF HTML ☆

赞 0 踩 0

2604.08362 2026-05-22 cs.CL cs.AI cs.LG 版本更新

Towards Real-world Human Behavior Simulation: Benchmarking Large Language Models on Long-horizon, Cross-scenario, Heterogeneous Behavior Traces

迈向真实世界的人类行为模拟：在长时间跨度、跨场景、异质行为轨迹上对大语言模型进行基准测试

Jiawei Chen, Ruoxi Xu, Boxi Cao, Ruotong Pan, Yunfei Zhang, Yifei Hu, Yong Du, Tingting Gao, Yaojie Lu, Yingfei Sun, Xianpei Han, Le Sun, Xiangyu Wu, Hongyu Lin

发表机构 * Chinese Information Processing Laboratory, Institute of Software, Chinese Academy of Sciences（中国科学院软件研究所信息处理实验室）； University of Chinese Academy of Sciences（中国科学院大学）； Kuaishou Technology（快手科技）

AI总结本文提出OmniBehavior基准测试，通过真实世界数据整合长周期、跨场景和异质行为模式，揭示现有模型在模拟复杂人类行为时的局限性，包括对正向平均人的趋同、人格同质化和乌托邦偏见，为未来高保真模拟研究指明方向。

Comments Project page: https://OmniBehavior.github.io

详情

AI中文摘要

大语言模型（LLMs）的出现揭示了通用用户模拟的潜力。然而，现有基准测试仍局限于孤立场景、狭窄动作空间或合成数据，无法捕捉真实人类行为的整体性。为弥合这一差距，我们引入OmniBehavior，首个完全基于真实世界数据构建的用户模拟基准测试，将长周期、跨场景和异质行为模式整合到统一框架中。基于此基准测试，我们首先提供了实证证据，表明以往孤立场景的数据集存在隧道视野问题，而真实世界决策依赖于长期的跨场景因果链。对最新LLMs的广泛评估显示，当前模型在模拟这些复杂行为时表现不佳，即使扩展上下文窗口，性能也趋于平稳。关键的是，模拟行为与真实行为的系统性比较揭示了根本性的结构偏差：LLMs倾向于趋同于正向平均人，表现出超活跃、人格同质化和乌托邦偏见。这导致了个体差异和长尾行为的丧失，突显了未来高保真模拟研究的关键方向。

英文摘要

The emergence of Large Language Models (LLMs) has illuminated the potential for a general-purpose user simulator. However, existing benchmarks remain constrained to isolated scenarios, narrow action spaces, or synthetic data, failing to capture the holistic nature of authentic human behavior. To bridge this gap, we introduce OmniBehavior, the first user simulation benchmark constructed entirely from real-world data, integrating long-horizon, cross-scenario, and heterogeneous behavioral patterns into a unified framework. Based on this benchmark, we first provide empirical evidence that previous datasets with isolated scenarios suffer from tunnel vision, whereas real-world decision-making relies on long-term, cross-scenario causal chains. Extensive evaluations of state-of-the-art LLMs reveal that current models struggle to accurately simulate these complex behaviors, with performance plateauing even as context windows expand. Crucially, a systematic comparison between simulated and authentic behaviors uncovers a fundamental structural bias: LLMs tend to converge toward a positive average person, exhibiting hyper-activity, persona homogenization, and a utopian bias. This results in the loss of individual differences and long-tail behaviors, highlighting critical directions for future high-fidelity simulation research.

URL PDF HTML ☆

赞 0 踩 0

2603.27355 2026-05-22 cs.AI cs.CL cs.SE 版本更新

LLM Readiness Harness: Evaluation, Observability, and CI Gates for LLM/RAG Applications

LLM Readiness Harness: 评估、可观测性和持续集成门禁用于LLM/RAG应用

Alexandre Cristovão Maiorano

发表机构 * Lumytics

AI总结本文提出了一种LLM和RAG应用的准备性框架，通过自动化基准测试、OpenTelemetry可观测性和持续集成质量门禁，将评估转化为部署决策流程，并通过帕累托前沿计算场景加权的准备度分数，展示了在票务路由工作流和BEIR接地任务上的评估结果。

Comments 19 pages, 4 figures, 15 tables

详情

修复结构瓶颈：通过显式信息传输进行上下文压缩

Jiangnan Ye, Hanqi Yan, Zhenyi Shen, Heng Chang, Ye Mao, Yulan He

发表机构 * King’s College London（伦敦国王学院）； Tsinghua University（清华大学）； Imperial College London（伦敦帝国学院）； The Alan Turing Institute（艾伦·图灵研究所）

AI总结本文通过从结构角度重新审视上下文压缩，识别出标准LLM压缩方法中的两个关键瓶颈，并提出ComprExIT框架，通过显式信息传输提升压缩效率，实验表明其在多个数据集上表现优异，提升了F1分数并降低了计算成本。

详情

AI中文摘要

长上下文LLM代理往往面临增长的token、内存和延迟成本，使高效的上下文压缩对实际部署至关重要。现有LLM作为压缩器的方法在使用完整上下文时仍明显劣于其性能。我们发现这一差距部分源于其无法有效保留上下文信息。在本文中，我们从结构角度重新审视上下文压缩，并识别出标准LLM压缩方法中的两个关键瓶颈：信息聚合过程中压缩token之间的协调有限，以及层间稀释削弱了中间隐藏状态中的有用信号。为了解决这些限制，我们提出了ComprExIT，一种基于显式信息传输的新上下文压缩框架。ComprExIT会自适应地选择冻结LLM层中的特征，然后通过全局协调的运输计划将信息从锚点分配到压缩槽中。在12个数据集上的实验表明，ComprExIT在多个数据集上优于强大的软压缩基线，平均F1分数提升高达18.5%，同时仅增加约1%的可训练参数，并且比最快的基线快超过2倍的压缩速度。代码将在接受后发布。

英文摘要

Long-context LLM agents often struggle with growing token, memory, and latency costs, making efficient context compression essential for practical deployment. Existing LLM-as-a-compressor methods remain noticeably inferior to using the full context. We find that this gap partly stems from their inability to preserve contextual information effectively. In this work, we revisit context compression from a structural perspective and identify two key bottlenecks in standard LLM-based compressors: limited coordination among compression tokens during information aggregation, and layerwise dilution that weakens useful signals from intermediate hidden states. To address these limitations, we propose ComprExIT, a new context compression framework based on explicit information transmission. ComprExIT adaptively selects features across frozen LLM layers, then allocates information from anchors to compression slots through a globally coordinated transport plan. Experiments on 12 datasets show that ComprExIT consistently outperforms strong soft-compression baselines, improving average F1 by up to 18.5%, while adding only ~1% trainable parameters and achieving more than 2x faster compression than the fastest baselines. The code will be released upon acceptance.

URL PDF HTML ☆

赞 0 踩 0

2602.02112 2026-05-22 cs.LG cs.AI cs.CL 版本更新

Unifying Masked Diffusion Models with Various Generation Orders and Beyond

统一多种生成顺序及超越的掩码扩散模型

Chunsan Hong, Sanghyun Lee, Jong Chul Ye

发表机构 * Graduate School of AI, KAIST, South Korea（韩国延世大学人工智能研究生院）

AI总结本文提出Order-Expressive Masked Diffusion Model (OeMDM)和Learnable-Order Masked Diffusion Model (LoMDM)，统一了不同生成顺序的扩散生成过程，并通过单目标学习生成顺序和扩散骨干，提升了文本生成性能。

Comments Accepted at ICML 2026

详情

AI中文摘要

Masked diffusion models (MDMs) 是语言生成中替代自回归模型 (ARMs) 的潜在选择，但生成质量严重依赖于生成顺序。先前工作要么硬编码顺序（例如块状左到右），要么为预训练的MDM学习顺序策略，这会带来额外成本并可能导致次优解，因为存在两阶段优化。受此启发，我们提出了order-expressive masked diffusion model (OeMDM)，以适用于各种生成顺序的广泛扩散生成过程，使MDM、ARM和块扩散能在单一框架中进行解释。此外，基于OeMDM，我们引入了learnable-order masked diffusion model (LoMDM)，通过单目标学习生成顺序和扩散骨干，使扩散模型能够根据上下文生成顺序进行文本生成。实证上，我们证实LoMDM在多个语言模型基准测试中优于各种离散扩散模型。

英文摘要

Masked diffusion models (MDMs) are a potential alternative to autoregressive models (ARMs) for language generation, but generation quality depends critically on the generation order. Prior work either hard-codes an ordering (e.g., blockwise left-to-right) or learns an ordering policy for a pretrained MDM, which incurs extra cost and can yield suboptimal solutions due to the two-stage optimization. Motivated by this, we propose order-expressive masked diffusion model (OeMDM) for a broad class of diffusion generative processes with various generation orders, enabling the interpretation of MDM, ARM, and block diffusion in a single framework. Furthermore, building on OeMDM, we introduce learnable-order masked diffusion model (LoMDM), which jointly learns the generation ordering and diffusion backbone through a single objective from scratch, enabling the diffusion model to generate text in context-dependent ordering. Empirically, we confirm that LoMDM outperforms various discrete diffusion models across multiple language modeling benchmarks.

URL PDF HTML ☆

赞 0 踩 0

2601.20107 2026-05-22 cs.CV cs.CL cs.IR 版本更新

Structural Anchor Pruning: Training-Free Multi-Vector Compression for Visual Document Retrieval

结构锚点剪枝：用于视觉文档检索的无训练多向量压缩

Zhuchenyang Liu, Ziyu Hu, Yao Zhang, Yu Xiao

发表机构 * Aalto University（阿alto大学）

AI总结本文提出结构锚点剪枝（SAP），一种无需训练的多向量压缩方法，通过保留评分、指导窗口选择和视觉入度中心性评分三个组件，在不进行模型参数调整的情况下，实现了超过90%的视觉token剪枝同时保持NDCG@5超过90%的性能。

Comments methodology revision and new title

详情

AI中文摘要

最近的视觉-语言模型（例如ColPali）能够实现细粒度的视觉文档检索（VDR），但带来了可接受的多向量索引存储开销。现有的无训练剪枝方法要么依赖于启发式的层选择，要么在激进压缩下急剧退化，导致先前的工作认为有效的高压缩剪枝需要查询依赖的训练。我们通过结构锚点剪枝（SAP）挑战这一观点，这是一种自校准、无训练、且查询无关的索引时间剪枝框架，包含三个组件：（i）评分保留（SR），一种每层压缩诊断的白盒方法；（ii）SR引导的窗口选择，一种自动定位任何主干网络的结构剪枝区域的程序，无需每个模型的超参数；（iii）一个视觉入度中心性评分器，用于识别所选窗口内的锚点块。在ViDoRe v1/v2基准测试中，跨越三种架构（18、28和36层主干网络）的三个架构上，SAP在不进行任何模型参数调整的情况下，保留了超过90%的NDCG@5，同时剪枝了超过90%的视觉token。我们的分层解析SR分析揭示了对齐-聚合分歧：文档的视觉结构在主干网络中被保留为稳定的“结构高原”，但最终层将这种表示重塑为稀疏、查询对齐的形式，不再适合剪枝。这是SAP在最终层方法失败的地方的机械原因。

英文摘要

Recent Vision-Language Models (e.g., ColPali) enable fine-grained Visual Document Retrieval (VDR) but incur prohibitive multi-vector index storage overhead. Existing training-free pruning methods either rely on heuristic layer choices or degrade sharply under aggressive compression, leading prior work to argue that effective high-compression pruning requires query-dependent training. We challenge this view with Structural Anchor Pruning (SAP), a self-calibrating, training-free, and query-agnostic index-time pruning framework with three components: (i) Score Retention (SR), a white-box per-layer compression diagnostic; (ii) SR-guided window selection, a procedure that automatically locates the structural pruning region for any backbone with no per-model hyperparameters; and (iii) a visual in-degree centrality scorer that identifies anchor patches within the selected window. On the ViDoRe v1/v2 benchmarks across three architectures spanning 18, 28, and 36 backbone layers, SAP retains over 90\% of NDCG@5 while pruning more than 90\% of visual tokens, without any per-model parameter tuning. Our layer-resolved SR analysis reveals an Alignment-Aggregation Divergence: the document's visual structure is preserved as a stable ``Structural Plateau'' within the backbone, but the final layers reshape this representation into a sparse, query-aligned form that is no longer suitable for pruning. This is the mechanistic reason SAP succeeds where final-layer methods fail.

URL PDF HTML ☆

赞 0 踩 0

2601.04537 2026-05-22 cs.LG cs.CL 版本更新

Linear Dynamics in the RLVR Training of Large Language Models

在大语言模型RLVR训练中的线性动力学

Tianle Wang, Jiayu Liu, Zhongyuan Wu, Shenghao Jin, Wei Chen, Hao Xu, Ning Miao

发表机构 * Department of Data Science, City University of Hong Kong（香港城市大学数据科学系）； Hong Kong Institute of AI for Science, City University of Hong Kong（香港城市大学人工智能科学研究院）； Li Auto Inc. ； Beihang University（北航大学）

AI总结本文研究了强化学习可验证奖励（RLVR）在大语言模型训练中的内部动态，发现RLVR在多种模型和训练配置下均进入线性区域，通过实验和理论分析证明这种线性特性源于训练信号的高方差和噪声，且具有预测性和实用性。

Comments Major revision: substantially reorganized the manuscript and added a theoretical explanation section. The replacement is intended for the same arXiv paper; the core topic and contribution remain the same

详情

AI中文摘要

强化学习可验证奖励（RLVR）在以推理为导向的大语言模型（LLMs）中推动了显著的性能提升，但其内部训练动态仍 largely 是一个黑箱。在本文中，我们对RLVR进行了全面的轨迹级分析，并揭示出一个显著的规律：在各种模型家族、RL算法和训练配置下，RLVR始终进入一个稳健的线性区域，其中参数权重和输出对数概率，通过严格教师强制评估测量，以高度线性的方式（R²>0.7）演变。通过受控实验和理论分析，我们证明这种线性并非偶然，而是源于RLVR训练信号的高方差和噪声性质，这些性质起到了低通滤波器的作用，将优化集中在稳定的、低维的漂移上。此外，我们显示这种线性结构不仅具有描述性，而且具有强大的预测性和实用性。具体而言，权重空间外推在性能上与标准RL优化相当，同时通过定期重新定位实现了6.1倍的训练加速。同时，输出空间外推作为一种轻量级干预，有效 bypassed 后期模型崩溃，持续在数学和编码基准上优于标准RL，平均性能提升了4.2%。我们的代码可在https://github.com/Miaow-Lab/RLVR-Linearity获得。

英文摘要

Reinforcement learning with verifiable rewards (RLVR) has driven significant performance gains in reasoning-oriented large language models (LLMs), yet its internal training dynamics remain largely a black box. In this work, we perform a comprehensive trajectory-level analysis of RLVR and uncover a striking regularity: across various model families, RL algorithms, and training configurations, RLVR consistently enters a robust linear regime, where both parameter weights and output log-probabilities, measured rigorously via teacher-forced evaluation, evolve in a highly linear manner ($R^2 > 0.7$). Through controlled experiments and theoretical analysis, we demonstrate that this linearity is not a coincidence, but stems from the high-variance, noisy nature of RLVR training signals, which act as a low-pass filter to concentrate optimization along a stable, low-dimensional drift. Moreover, we show that this linear structure is not merely descriptive but powerfully predictive and actionable. Specifically, weight-space extrapolation matches the performance of standard RL optimization while achieving a 6.1x training speedup through periodic re-grounding. Meanwhile, output-space extrapolation serves as a lightweight intervention that effectively bypasses late-stage model collapse, consistently outperforming standard RL across mathematical and coding benchmarks, with an average performance improvement of 4.2%. Our code is available at https://github.com/Miaow-Lab/RLVR-Linearity.

URL PDF HTML ☆

赞 0 踩 0

2511.04106 2026-05-22 physics.soc-ph cs.CL cs.CY stat.AP 版本更新

Sub-exponential Growth Dynamics in Complex Systems: A Piecewise Power-Law Model for the Diffusion of New Words and Names

复杂系统中的亚指数增长动力学：一种用于新词汇和名称扩散的分段幂律模型

Hayafumi Watanabe

发表机构 * Department of Economics, Seijo University, Setagaya-ku, Tokyo 157-8511, Japan（早稻田大学经济学部）； The Institute of Statistical Mathematics, Tachikawa-shi, Tokyo 190-8562,Japan（统计数学研究所）

AI总结本文提出了一种分段幂律模型，用于描述复杂增长曲线，通过分析大规模数据集发现亚指数增长是社会扩散的常见模式。

详情

DOI: 10.1103/f3d5-2tb8
Journal ref: Physical Review E (2026)

AI中文摘要

社会中思想和语言的扩散通常被S型模型描述，如逻辑斯蒂曲线。然而，亚指数增长——一种比指数增长更慢的模式，在更广泛的社会现象中作用被忽视。本文提出了一种分段幂律模型，通过分析约十亿篇日本博客文章与维基百科词汇的数据集，发现网络搜索趋势数据（英语、西班牙语和日语）中存在一致的模式。分析2963个选定项目（如足够持续时间/峰值、单调增长）发现，1625（55%）种扩散模式没有突变水平，可以由一个或两个段描述。对于单段曲线，发现（i）形状参数α的模式接近0.5，表明亚指数增长普遍；（ii）峰值扩散规模主要由增长速率R决定，次要贡献来自α或持续时间T；（iii）α倾向于随主题性质变化，小主题/本地主题的α较小，广泛共享主题的α较大。此外，一个微观行为模型表明，α可以解释为对外向（陌生人）与内向（社区）接触的偏好指数。这些发现表明亚指数增长是社会扩散的常见模式，我们的模型提供了一个实用框架，用于一致描述、比较和解释复杂多样的增长曲线。

英文摘要

The diffusion of ideas and language in society has conventionally been described by S-shaped models, such as the logistic curve. However, the role of sub-exponential growth -- a slower-than-exponential pattern known in epidemiology -- has been largely overlooked in broader social phenomena. Here, we present a piecewise power-law model to characterize complex growth curves with a few parameters. We systematically analyzed a large-scale dataset of approximately one billion Japanese blog articles linked to Wikipedia vocabulary, and observed consistent patterns in web search trend data (English, Spanish, and Japanese). Our analysis of 2,963 items, selected for reliable estimation (e.g., sufficient duration/peak, monotonic growth), reveals that 1,625 (55%) diffusion patterns without abrupt level shifts were adequately described by one or two segments. For single-segment curves, we found that (i) the mode of the shape parameter $α$ was near 0.5, indicating prevalent sub-exponential growth; (ii) the peak diffusion scale is primarily determined by the growth rate $R$, with minor contributions from $α$ or the duration $T$; and (iii) $α$ showed a tendency to vary with the nature of the topic, being smaller for niche/local topics and larger for widely shared ones. Furthermore, a micro-behavioral model of outward (stranger) vs. inward (community) contact suggests that $α$ can be interpreted as an index of the preference for outward-oriented communication. These findings suggest that sub-exponential growth is a common pattern of social diffusion, and our model provides a practical framework for consistently describing, comparing, and interpreting complex and diverse growth curves.

URL PDF HTML ☆

赞 0 踩 0

2510.23090 2026-05-22 cs.CL 版本更新

MAP4TS: A Multi-Aspect Prompting Framework for Time-Series Forecasting with Large Language Models

MAP4TS: 一个用于基于大语言模型的时间序列预测的多方面提示框架

Suchan Lee, Jihoon Choi, Sohyeon Lee, Minseok Song, Bong-Gyu Jang, Hwanjo Yu, Soyeon Caren Han

AI总结本文提出MAP4TS框架，通过将经典时间序列分析融入提示设计，提升大语言模型在时间序列预测中的性能，实验表明其在多个数据集上均优于现有方法。

Comments There is a error in modeling. Thereafter, paper will be revised and re-uploaded

详情

AI中文摘要

最近的研究探讨了使用预训练的大语言模型（LLMs）进行时间序列预测，通过将数值输入对齐到LLM嵌入空间。然而，现有的多模态方法往往忽视了时间序列数据中独特的统计特性和时间依赖性。为弥合这一差距，我们提出了MAP4TS，一种新颖的多方面提示框架，该框架明确将经典时间序列分析纳入提示设计。我们的框架引入了四个专门的提示组件：一个全局领域提示传达数据集级别的上下文，一个局部领域提示编码近期趋势和系列特定行为，以及一对统计和时间提示，嵌入了从自相关（ACF）、偏自相关（PACF）和傅里叶分析中提取的手工洞察。多方面提示与原始时间序列嵌入结合，并通过跨模态对齐模块生成统一的表示，然后通过LLM处理并投影以进行最终预测。在八个多样化的数据集上进行的广泛实验表明，MAP4TS在多个数据集上均优于现有方法。我们的消融研究进一步揭示，提示意识设计显著提升了性能稳定性，并且当与结构化提示结合时，GPT-2模型在长期预测任务中优于较大的模型如LLaMA。

英文摘要

Recent advances have investigated the use of pretrained large language models (LLMs) for time-series forecasting by aligning numerical inputs with LLM embedding spaces. However, existing multimodal approaches often overlook the distinct statistical properties and temporal dependencies that are fundamental to time-series data. To bridge this gap, we propose MAP4TS, a novel Multi-Aspect Prompting Framework that explicitly incorporates classical time-series analysis into the prompt design. Our framework introduces four specialized prompt components: a Global Domain Prompt that conveys dataset-level context, a Local Domain Prompt that encodes recent trends and series-specific behaviors, and a pair of Statistical and Temporal Prompts that embed handcrafted insights derived from autocorrelation (ACF), partial autocorrelation (PACF), and Fourier analysis. Multi-Aspect Prompts are combined with raw time-series embeddings and passed through a cross-modality alignment module to produce unified representations, which are then processed by an LLM and projected for final forecasting. Extensive experiments across eight diverse datasets show that MAP4TS consistently outperforms state-of-the-art LLM-based methods. Our ablation studies further reveal that prompt-aware designs significantly enhance performance stability and that GPT-2 backbones, when paired with structured prompts, outperform larger models like LLaMA in long-term forecasting tasks.

URL PDF HTML ☆

赞 0 踩 0

2510.13910 2026-05-22 cs.CL 版本更新

评估LLM作为裁判的评分偏见

Qingquan Li, Shaoyu Dou, Kailai Shao, Chao Chen, Haixiang Hu

发表机构 * Ant Group（蚂蚁集团）

AI总结本文研究了LLM作为裁判在评分任务中的偏见问题，提出了三种新的评分偏见类型，并开发了一个框架来量化这些偏见，以改进评分提示设计。

Comments Accepted by DASFAA 2026

详情

DOI: 10.1007/978-981-92-0372-7_2

AI中文摘要

"LLM-as-a-Judge"范式通过大型语言模型（LLMs）作为自动评估者，是LLM发展中的关键部分，为复杂任务提供可扩展的反馈。然而，这些裁判的可靠性受到多种偏见的影响。现有研究主要集中在比较性评估中的偏见。相比之下，基于评分的评估（分配绝对分数，常用于工业应用）研究较少。为填补这一空白，我们进行了首次专门的评分偏见评估。我们从评分提示本身而非评估目标的偏见出发。我们正式定义了评分偏见，并识别了三种新的偏见类型：评分标准顺序偏见、评分ID偏见和参考答案评分偏见。我们提出了一种全面的框架来量化这些偏见，包含多方面的度量指标和自动数据合成管道来创建定制的评估语料库。我们的实验实证地证明了即使最先进的LLMs也受这些显著评分偏见的影响。我们的分析为设计更稳健的评分提示和缓解这些新发现的偏见提供了可行的见解。

英文摘要

The "LLM-as-a-Judge" paradigm, using Large Language Models (LLMs) as automated evaluators, is pivotal to LLM development, offering scalable feedback for complex tasks. However, the reliability of these judges is compromised by various biases. Existing research has heavily concentrated on biases in comparative evaluations. In contrast, scoring-based evaluations-which assign an absolute score and are often more practical in industrial applications-remain under-investigated. To address this gap, we undertake the first dedicated examination of scoring bias in LLM judges. We shift the focus from biases tied to the evaluation targets to those originating from the scoring prompt itself. We formally define scoring bias and identify three novel, previously unstudied types: rubric order bias, score ID bias, and reference answer score bias. We propose a comprehensive framework to quantify these biases, featuring a suite of multi-faceted metrics and an automatic data synthesis pipeline to create a tailored evaluation corpus. Our experiments empirically demonstrate that even the most advanced LLMs suffer from these substantial scoring biases. Our analysis yields actionable insights for designing more robust scoring prompts and mitigating these newly identified biases.

URL PDF HTML ☆

赞 0 踩 0

2506.04708 2026-05-22 cs.CL 版本更新

Accelerated Test-Time Scaling with Model-Free Speculative Sampling

加速测试时间缩放与模型无关的推测采样

Woomin Song, Saket Dingliwal, Sai Muralidhar Jayanthi, Bhavana Ganesh, Jinwoo Shin, Aram Galstyan, Sravan Babu Bodapati

发表机构 * KAIST（韩国科学技术院）； Amazon AGI（亚马逊人工智能实验室）； AirSignal

AI总结本文提出STAND，一种无需模型的推测解码方法，通过利用推理轨迹中的冗余性，显著提升推理效率而不牺牲准确性，经多个模型和任务评估，STAND在保持准确性的同时将推理延迟降低了60-65%。

Comments EMNLP 2025 Oral

详情

DOI: 10.18653/v1/2025.emnlp-main.1558

AI中文摘要

语言模型通过测试时间缩放技术如best-of-N采样和树搜索在推理任务中展现了显著的能力。然而，这些方法通常需要大量的计算资源，导致性能与效率之间的关键权衡。我们引入STAND（STochastic Adaptive N-gram Drafting），一种新颖的无模型推测解码方法，利用推理轨迹中的内在冗余性，实现显著的加速而不牺牲准确性。我们的分析显示，推理路径经常重复相似的推理模式，使高效的无模型令牌预测成为可能，而无需单独的草案模型。通过引入随机草案和通过高效日志几率基的n-gram模块保留概率信息，结合优化的Gumbel-Top-K采样和数据驱动的树构建，STAND显著提高了令牌接受率。在多个模型和推理任务（AIME-2024、GPQA-Diamond和LiveCodeBench）上的广泛评估表明，与标准自回归解码相比，STAND将推理延迟降低了60-65%，同时保持准确性。此外，STAND在各种推理模式下，包括单轨迹解码、批量解码和测试时间树搜索中，均优于最先进的推测解码方法。作为一种无模型方法，STAND可以应用于任何现有语言模型，无需额外训练，使其成为加速语言模型推理的强大即插即用解决方案。

英文摘要

Language models have demonstrated remarkable capabilities in reasoning tasks through test-time scaling techniques like best-of-N sampling and tree search. However, these approaches often demand substantial computational resources, creating a critical trade-off between performance and efficiency. We introduce STAND (STochastic Adaptive N-gram Drafting), a novel model-free speculative decoding approach that exploits the inherent redundancy in reasoning trajectories to achieve significant acceleration without compromising accuracy. Our analysis shows that reasoning paths frequently reuse similar reasoning patterns, enabling efficient model-free token prediction without requiring separate draft models. By introducing stochastic drafting and preserving probabilistic information through a memory-efficient logit-based N-gram module, combined with optimized Gumbel-Top-K sampling and data-driven tree construction, STAND significantly improves token acceptance rates. Extensive evaluations across multiple models and reasoning tasks (AIME-2024, GPQA-Diamond, and LiveCodeBench) demonstrate that STAND reduces inference latency by 60-65% compared to standard autoregressive decoding while maintaining accuracy. Furthermore, STAND consistently outperforms state-of-the-art speculative decoding methods across diverse inference patterns, including single-trajectory decoding, batch decoding, and test-time tree search. As a model-free approach, STAND can be applied to any existing language model without additional training, making it a powerful plug-and-play solution for accelerating language model reasoning.

URL PDF HTML ☆

赞 0 踩 0

2505.17123 2026-05-22 cs.CL 版本更新

MTR-Bench: A Comprehensive Benchmark for Multi-Turn Reasoning Evaluation

MTR-Bench：多轮推理评估的综合性基准

Xiaoyuan Li, Keqin Bao, Yubo Ma, Moxin Li, Wenjie Wang, Rui Men, Yichang Zhang, Fuli Feng, Dayiheng Liu

发表机构 * University of Science and Technology of China（中国科学技术大学）； Alibaba Group（阿里巴巴集团）； National University of Singapore（新加坡国立大学）

AI总结本文提出MTR-Bench，一个包含4类、40个任务和3600个实例的综合性基准，用于评估大型语言模型的多轮推理能力，通过自动化框架实现大规模评估，并揭示了当前先进推理模型在多轮交互任务中的不足。

Comments ACL 2026 Main Conference

详情

AI中文摘要

近年来，大型语言模型（LLMs）在复杂推理任务中展现出有前景的结果。然而，当前的评估主要集中在单轮推理场景，忽略了交互性任务。我们归因于缺乏全面的数据集和可扩展的自动评估协议。为了填补这些空白，我们提出了MTR-Bench用于LLM的多轮推理评估。MTR-Bench包含4类、40个任务和3600个实例，覆盖了多样的推理能力、细粒度难度层次以及需要与环境进行多轮交互的任务。此外，MTR-Bench具备完全自动化的框架，涵盖了数据集构建和模型评估，使大规模评估成为可能而无需人工干预。广泛实验表明，即使是最先进的推理模型在多轮交互推理任务中也显得不足。对这些结果的进一步分析为未来交互式人工智能系统的研究提供了有价值的见解。

英文摘要

Recent advances in Large Language Models (LLMs) have shown promising results in complex reasoning tasks. However, current evaluations predominantly focus on single-turn reasoning scenarios, leaving interactive tasks largely unexplored. We attribute it to the absence of comprehensive datasets and scalable automatic evaluation protocols. To fill these gaps, we present MTR-Bench for LLMs' Multi-Turn Reasoning evaluation. Comprising 4 classes, 40 tasks, and 3600 instances, MTR-Bench covers diverse reasoning capabilities, fine-grained difficulty granularity, and necessitates multi-turn interactions with the environments. Moreover, MTR-Bench features fully-automated framework spanning both dataset constructions and model evaluations, which enables scalable assessment without human interventions. Extensive experiments reveal that even the cutting-edge reasoning models fall short of multi-turn, interactive reasoning tasks. And the further analysis upon these results brings valuable insights for future research in interactive AI systems.

URL PDF HTML ☆

赞 0 踩 0

2505.05406 2026-05-22 cs.CL 版本更新

Frame In, Frame Out: Measuring Framing Bias in LLM-Generated News Summaries

框入框出：衡量LLM生成新闻摘要中的框架偏差

Valeria Pastorino, Nafise Sadat Moosavi

发表机构 * Department of Computer Science University of Sheffield (UK)（计算机科学系谢菲尔德大学（英国））

AI总结本文提出FIFO基准测试，用于衡量LLM生成的新闻摘要中的框架存在性，发现LLM生成的摘要在科学和公共卫生领域显示出较高的框架率，表明框架是摘要质量的一个被忽视但重要的维度。

Comments Accepted to The 15th Joint Conference on Lexical and Computational Semantics (*SEM 2026) co-located with ACL 2026

详情

AI中文摘要

新闻标题和摘要通过选择性强调和省略来影响事件的解读，这种现象通常称为框架。大型语言模型现在经常用于生成此类内容，但现有的评估框架大多忽略了这一维度。我们介绍了Frame In, Frame Out (FIFO)，这是首个大规模基准测试，用于衡量LLM生成的新闻摘要中的框架存在性，基于广泛使用的XSum数据集。FIFO结合了15,499名陪审团标注的例子和320个专家标注的实例（κ=0.61）来验证和校准基于模型的标注。使用FIFO，我们分析了27个摘要模型的测量框架率。我们发现，LLM生成的摘要往往表现出比人类撰写的参考更高的校准框架率，不同主题和训练制度下存在显著差异，包括在科学和公共卫生摘要中出现较高的框架率。我们的结果确立了框架作为摘要质量的一个被忽视但重要的维度。

英文摘要

News headlines and summaries shape how events are interpreted through selective emphasis and omission, a phenomenon commonly referred to as framing. Large language models are now routinely used to generate such content, yet existing evaluation frameworks largely overlook this dimension. We introduce Frame In, Frame Out (FIFO), the first large-scale benchmark for measuring framing presence in LLM-generated news summaries, grounded in the widely used XSum dataset. FIFO combines 15,499 jury-annotated examples with 320 expert-labeled instances ($κ= 0.61$) to validate and calibrate model-based annotations. Using FIFO, we analyze measured framing rates across 27 summarization models. We find that LLM-generated summaries often exhibit higher calibrated framing rates than human-written references, with substantial variation across topics and training regimes, including elevated rates in scientific and public health summaries. Our results establish framing as an underexplored and consequential dimension of summarization quality.

URL PDF HTML ☆

赞 0 踩 0

2503.17599 2026-05-22 cs.CL cs.AI 版本更新

Evaluating Clinical Competencies of Large Language Models with a General Practice Benchmark

利用通用医疗基准评估大型语言模型的临床能力

Zheqing Li, Yiying Yang, Jiping Lang, Wenhao Jiang, Junrong Chen, Yuhang Zhao, Shuang Li, Dingqian Wang, Zhu Lin, Xuanna Li, Yuze Tang, Jiexian Qiu, Xiaolin Lu, Hongji Yu, Shuang Chen, Yuhua Bi, Xiaofei Zeng, Yixian Chen, Lin Yao

发表机构 * The Sixth Affiliated Hospital of Sun Yat-sen University（中山大学第六附属医院）； Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ)（广东省人工智能与数字经济发展实验室（深圳））； Xinyi People’s Hospital（新一人民医院）； The Fifth Affiliated Hospital of Sun Yat-sen University（中山大学第五附属医院）； School of Public Health of Sun Yat-sen University（中山大学公共卫生学院）

AI总结本文提出了一种新的评估框架，通过通用医疗基准（GPBench）评估大型语言模型在医疗实践中的能力，发现当前LLM无法独立应用于临床医疗，需持续的人类监督。

详情

AI中文摘要

大型语言模型（LLMs）在一般医疗实践中展现出了相当大的潜力。然而，现有的基准测试和评估框架主要依赖于考试式或简化的问题-答案格式，缺乏与一般医疗实践中实际临床责任相匹配的基于能力的结构。因此，LLMs能否可靠地履行一般医生（GPs）职责的范围仍然不确定。在本工作中，我们提出了一种新的评估框架，用于评估LLMs作为GPs的能力。基于此框架，我们引入了一个通用医疗基准（GPBench），其数据由领域专家根据常规临床实践标准进行细致标注。我们评估了十种最先进的LLMs，并分析了它们的能力。我们的发现表明，当前的LLMs不适合在临床一般实践中自主部署，所有实际应用都需要持续的人类监督；进一步针对GPs日常职责进行的特定优化仍至关重要。

英文摘要

Large Language Models (LLMs) have demonstrated considerable potential in general practice. However, existing benchmarks and evaluation frameworks primarily depend on exam-style or simplified question-answer formats, lacking a competency-based structure aligned with the real-world clinical responsibilities encountered in general practice. Consequently, the extent to which LLMs can reliably fulfill the duties of general practitioners (GPs) remains uncertain. In this work, we propose a novel evaluation framework to assess the capability of LLMs to function as GPs. Based on this framework, we introduce a general practice benchmark (GPBench), whose data are meticulously annotated by domain experts in accordance with routine clinical practice standards. We evaluate ten state-of-the-art LLMs and analyze their competencies. Our findings indicate that current LLMs are not suitable for autonomous deployment in clinical general practice and that all realistic applications require continuous human oversight; further optimization specifically tailored to the daily responsibilities of GPs remains essential.

URL PDF HTML ☆

赞 0 踩 0

2503.02885 2026-05-22 cs.CY cs.CL cs.HC 版本更新

"Would You Want an AI Tutor?" Understanding Stakeholder Perceptions of LLM-based Systems in the Classroom

你希望有一个AI导师吗？理解基于大语言模型的系统在课堂中的利益相关者观点

Caterina Fuligni, Daniel Dominguez Figaredo, Armanda Lewis, Julia Stoyanovich

发表机构 * Khan Academy ； New York City Department of Education（纽约市教育部）； Learning Innovation Catalyst（学习创新催化剂）

AI总结本文研究了在课堂中部署基于大语言模型（LLM）系统的利益相关者观点，提出了一种以利益相关者为中心的框架，以支持更谨慎的决策。

详情

AI中文摘要

大语言模型（LLM）在教育环境中获得了广泛的应用，通常被描述为虚拟导师或教学助手。在早期的怀疑和禁令之后，许多学校和大学已经开始将这些系统整合到课程中。然而，关于是否以及如何部署LLM-based工具的决策通常是缺乏系统性地与所有受影响的利益相关者互动。在本文中，我们主张理解课堂中基于LLM的系统利益相关者观点不仅仅是衡量批准或接受，而是识别哪些关注被提出，何时提出，以及这对负责任的设计和治理有何影响。我们介绍了面向教育中LLM采用的感知框架（Co-PALE），该框架以利益相关者为中心，连接教育环境、负责任的人工智能原则和感知类别，以支持关于LLM-based工具采用的更谨慎的决策。我们通过针对先前工作的分析来奠定Co-PALE，以诊断在研究利益相关者观点时反复出现的差距，并通过具有不同教育场景的上下文不同的教育场景来说明相同的技术如何对不同的利益相关者产生不同的关注。我们进一步探讨了大学教师和K-12家长如何理解该框架，通过焦点小组讨论，使用他们的反思来揭示紧张和不确定性。Co-PALE支持更系统地思考在教育中部署LLM-based工具是否、在哪里以及为谁而部署。

英文摘要

Large Language Models (LLMs) have gained traction in educational settings, often framed as virtual tutors or teaching assistants. Following early skepticism and bans, many schools and universities have begun integrating these systems into curricula. Yet decisions about whether and how to deploy LLM-based tools are frequently made without systematic engagement with the full range of stakeholders they affect. In this paper, we argue that understanding stakeholder perceptions of LLM-based systems in the classroom is not a matter of measuring approval or acceptance, but of identifying whose concerns are surfaced, in which contexts, and with what implications for responsible design and governance. We introduce Contextualized Perceptions for the Adoption of LLMs in Education (Co-PALE), a stakeholder-first framework that connects educational context, responsible AI principles, and categories of perception to support more deliberate decision-making about the adoption of LLM-based tools. We ground Co-PALE through a targeted analysis of prior work to diagnose recurring gaps in how stakeholder perceptions are studied, and through contextually distinct educational scenarios that illustrate how the same technology raises different concerns for different stakeholders. We further examine how university faculty and K--12 parents make sense of the framework through focus groups, using their reflections to surface tensions and uncertainties. Co-PALE supports more systematic reasoning about whether, where, and for whom LLM-based tools should be deployed in education.

URL PDF HTML ☆

赞 0 踩 0

2605.22081 2026-05-22 cs.CL 版本更新

ArabDiscrim: A Decade-Long Arabic Facebook Corpus on Racism and Discrimination

ArabDiscrim: 一个十年的阿拉伯语Facebook语料库，涉及种族主义和歧视

Wajdi Zaghouani, Shimaa Amer Ibrahim, Mabrouka Bessghaier, Houda Bouamor

发表机构 * Northwestern University in Qatar（卡塔尔西北大学）； Carnegie Mellon University in Qatar（卡塔尔卡内基梅隆大学）

AI总结本文提出了ArabDiscrim，一个包含293,000条阿拉伯语Facebook公开帖子的十年长的词料库（2014-2024年），用于研究种族主义和歧视。该语料库整合了平台原生的互动信号，如反应、分享、评论和页面元数据，支持语言和受众反应的联合分析。该资源包括200个精心挑选的术语（100个与种族主义相关，100个与歧视相关）以及20个歧视轴，捕捉基于身份的不平等对待。它还提供了显式的归属模式。ArabDiscrim在伦理合规的限制研究使用许可下发布，支持弱监督、轴感知采样和平台生态研究。通过连接词法深度和生态效度，它为公平导向、平台意识的阿拉伯语NLP建立了基础。

Comments Accepted at LREC 2026 Main Conference

详情

AI中文摘要

我们介绍了ArabDiscrim，一个十年长的词料库和包含293,000条公开阿拉伯语Facebook帖子（2014-2024）的词料库，讨论种族主义和歧视。不同于现有以推特为中心的数据集，ArabDiscrim整合了平台原生的互动信号，包括反应、分享、评论和页面元数据，使语言和受众反应的联合分析成为可能。该资源包括200个精心挑选的术语（100个与种族主义相关，100个与歧视相关）以及20个歧视轴，捕捉基于身份的不平等对待。它还提供了显式的归属模式。在遵守平台条款的限制研究使用许可下发布，ArabDiscrim支持弱监督、轴感知采样和平台生态研究。通过连接词法深度和生态效度，它为公平导向、平台意识的阿拉伯语NLP建立了基础。

英文摘要

We present ArabDiscrim, a decade-long lexical resource and corpus of 293K public Arabic Facebook posts (2014--2024) discussing racism and discrimination. Unlike existing Twitter-centric datasets, ArabDiscrim integrates platform-native engagement signals, including reactions, shares, comments, and page metadata, enabling joint analysis of language and audience response. The resource includes 200 curated terms (100 racism-related and 100 discrimination-related) with morphological regex families (13+ inflections per lemma), and 20 discrimination axes capturing identity-based grounds for unequal treatment. It also provides explicit attribution patterns. Released under a restricted research-use license for ethical compliance with platform terms, ArabDiscrim supports weak supervision, axis-aware sampling, and platform ecology research. By bridging lexical depth and ecological validity, it establishes a foundation for fairness-oriented, platform-aware Arabic NLP.

URL PDF HTML ☆

赞 0 踩 0

2605.22074 2026-05-22 cs.LG cs.AI cs.CL 版本更新

幻觉作为承诺失败：更大的LLM在知道答案的情况下仍会出错

Jewon Yeom, Jaewon Sok, Heejun Kim, Seonghyeon Park, Jeongjae Park, Taesup Kim

发表机构 * Graduate School of Data Science（数据科学研究生院）； Department of Rural Systems Engineering（农村系统工程系）； Electrical Engineering and Computer Science（电子工程与计算机科学）； Department of Aerospace Engineering（航空航天工程系）

AI总结本文研究了大型LLM在知道正确答案的情况下仍出现幻觉的现象，发现模型在生成答案时，正确概念的概率分布方式决定了幻觉的发生，而非是否包含正确概念。

详情

AI中文摘要

幻觉通常被视为知识缺失的直接后果：当正确答案不在生成时的分布中，模型会错误回答；当正确答案存在时，模型会正确回答。我们通过引入一个语义上的答案可用性概念，聚合表达相同答案概念的token级变体，检验这一假设。在Qwen和Llama模型（0.8B至72B，包括Instruct和Base版本）中，16-47%的Instruct幻觉在模型承诺回答时已有显著概率质量在正确概念上，且随着规模增加，此比例单调上升。将此类失败与具有匹配语义支持的正确生成进行比较，发现区别不在于是否表示正确概念，而在于概率分布方式：正确生成将质量集中于单一表层形式，幻觉则将其分散到多个替代选项中。这种锐化不对称性在多token生成中也延伸，并在预生成隐藏状态中可检测到。这些结果识别出单一机制：指令微调通过规模锐化答案承诺，使有用性和自信幻觉成为同一底层倾向的两种结果。

英文摘要

Hallucination is often viewed as a direct consequence of missing knowledge: a model answers incorrectly when the correct answer is absent from its generation-time distribution, and correctly when it is present. We test this assumption by introducing a semantic notion of answer availability that aggregates token-level variants expressing the same answer concept, and asks whether the correct concept is already available at the moment the model commits to an answer. Across Qwen and Llama models from 0.8B to 72B in both Instruct and Base variants, 16-47% of Instruct hallucinations occur with substantial probability mass already on the correct concept, and the rate rises monotonically with scale. Comparing such failures against correct generations with matched semantic support, the distinguishing factor is not whether the correct concept is represented, but how its probability is distributed: correct generations concentrate mass on a single surface form, hallucinations disperse it across alternatives. The same sharpening asymmetry extends across multi-token generation and is detectable in pre-generation hidden states. Together, these results identify a single mechanism: instruction tuning sharpens answer commitment with scale, making helpfulness and confident hallucination two consequences of the same underlying disposition.

URL PDF HTML ☆

赞 0 踩 0

2605.22003 2026-05-22 cs.CL cs.AI cs.IR cs.LG 版本更新

From TF-IDF to Transformers: A Comparative and Ensemble Approach to Sentiment Classification

从TF-IDF到Transformer：一种比较和集成的方法用于情感分类

Dip Biswas Shanto, Mitali Yadav, Prajwal Panth, Suresh Chandra Satapathy

发表机构 * School of Computer Engineering KIIT Deemed to be University（计算机工程学院 KIIT 被认定大学）

AI总结本文比较了多种机器学习模型，包括Naive Bayes、逻辑回归、SVM、LightGBM、LSTM以及基于Transformer的RoBERTa和DistilBERT，旨在对电影评论进行情感分类，并发现RoBERTa在准确率上表现最佳，同时集成所有模型的软投票方法进一步提升了分类性能。

Comments 6 pages, 9 figures. This is the author's accepted manuscript, presented at the International Conference on Intelligent Computing, Networks and Security (IC-ICNS 2026), March 26-28, Bhubaneswar, India. Proceedings publication pending

详情

AI中文摘要

情感分析，也称为观点挖掘，主要试图从任何基于文本的数据中提取观点。在电影评论和评论员的背景下，情感分析可以成为预测电影评论总体是积极还是消极的有用工具。对于ML模型来说，理解上下文或隐喻性情感可能具有挑战性，因为ML模型主要依赖统计词表示。本文的目标是检验并分类电影评论为积极或消极情感。为此考虑了多种机器学习模型，并运用自然语言处理（NLP）方法进行数据预处理和模型评估。使用IMDb数据集。具体来说，评估了Naive Bayes、逻辑回归、支持向量机（SVM）、LightGBM、LSTM以及基于Transformer的RoBERTa和DistilBERT等模型。经过大量测试，使用准确率、精确率、召回率、F1分数和ROC-AUC后，RoBERTa在所有其他模型之上表现更好，准确率为93.02%。一个结合所有模型的软投票集成方法也提高了分类性能，表明模型集成在情感分析中效果良好。

英文摘要

Sentiment analysis, also referred to as opinion mining, primarily tries to extract opinion from any text-based data. In the context of movie reviews and critics, sentimental analysis can be a helpful tool to predict whether a movie review is generally positive or negative. It can be difficult for the ML models to understand the context or metaphysical sentiment accurately, as ML models rely largely on statistical word representations. The objective of this paper is to examine and categorise movie reviews into positive and negative sentiments. Diverse machine learning models are considered in doing so, and Natural Language Processing (NLP) methodologies are employed for data preprocessing and model assessment. The IMDb dataset is used. Specifically, Naive Bayes, Logistic Regression, Support Vector Machines (SVM), LightGBM, LSTM, and transformer-based models such as RoBERTa and DistilBERT were evaluated. After a lot of testing with accuracy, precision, recall, F1-score, and ROC-AUC, RoBERTa performed better than all the other models, with an accuracy of 93.02%. A soft voting ensemble that combined all the models also improved classification performance, showing that model ensembling works well for sentiment analysis.

URL PDF HTML ☆

赞 0 踩 0

2605.22001 2026-05-22 cs.CR cs.AI cs.CL 版本更新

Blind Spots in the Guard: How Domain-Camouflaged Injection Attacks Evade Detection in Multi-Agent LLM Systems

守卫中的盲区：如何域伪装注入攻击在多智能体大语言模型系统中逃避检测

Aaditya Pai

发表机构 * Data Science Institute（数据科学研究所）； Columbia University（哥伦比亚大学）

AI总结本文研究了在多智能体大语言模型系统中，域伪装注入攻击如何通过模仿目标文档的领域词汇和权威结构来逃避检测，揭示了检测器在静态和伪装负载之间的检测率差异（Camouflage Detection Gap, CDG），并展示了多智能体辩论架构对静态注入攻击的放大效应以及检测器增强的有限有效性。

Comments 8 pages, 3 figures, 2 tables. Submitted to EMNLP 2026 ARR cycle

详情

AI中文摘要

部署在保护大语言模型代理中的注入检测器是基于静态的模板化负载进行校准的，这些负载会公开声明自身为覆盖指令。我们识别出一个系统性的盲区：当负载被生成以模仿目标文档的领域词汇和权威结构时，我们称之为域伪装注入，标准检测器无法识别它们，检测率从Llama 3.1 8B上的93.8%降至9.7%，从Gemini 2.0 Flash上的100%降至55.6%。我们将此正式定义为Camouflage Detection Gap（CDG），即静态负载与伪装负载之间注入检测率的差异。在覆盖三个领域和两种模型家族的45项任务中，CDG是显著且统计显著的（Llama的chi^2=38.03，p<0.001；Gemini的chi^2=17.05，p<0.001），在两种情况下均无零反向不一致对。我们还评估了Llama Guard 3，一个生产安全分类器，其检测零伪装负载（IDRcamouflage=0.000），证实盲区不仅限于少量样本检测器，还扩展到专门的安全分类器。我们进一步表明，多智能体辩论架构通过小型模型放大静态注入攻击高达9.9倍，而更强的模型则表现出集体抵抗力。针对检测器的增强仅提供部分缓解（Llama上提高10.2%，Gemini上提高78.7%），这表明该漏洞是架构性的，而非偶然的，对于较弱的模型而言。我们的框架、任务库和负载生成器已公开发布。

英文摘要

Injection detectors deployed to protect LLM agents are calibrated on static, template-based payloads that announce themselves as override directives. We identify a systematic blind spot: when payloads are generated to mimic the domain vocabulary and authority structures of the target document, what we call domain camouflaged injection, standard detectors fail to flag them, with detection rates dropping from 93.8% to 9.7% on Llama 3.1 8B and from 100% to 55.6% on Gemini 2.0 Flash. We formalize this as the Camouflage Detection Gap (CDG), the difference in injection detection rate between static and camouflaged payloads. Across 45 tasks spanning three domains and two model families, CDG is large and statistically significant (chi^2 = 38.03, p < 0.001 for Llama; chi^2 = 17.05, p < 0.001 for Gemini), with zero reverse discordant pairs in either case. We additionally evaluate Llama Guard 3, a production safety classifier, which detects zero camouflage payloads (IDRcamouflage = 0.000), confirming that the blind spot extends beyond few-shot detectors to dedicated safety classifiers. We further show that multi-agent debate architectures amplify static injection attacks by up to 9.9x on smaller models, while stronger models show collective resistance. Targeted detector augmentation provides only partial remediation (10.2% improvement on Llama, 78.7% on Gemini), suggesting the vulnerability is architectural rather than incidental for weaker models. Our framework, task bank, and payload generator are released publicly.

URL PDF HTML ☆

赞 0 踩 0

2605.21984 2026-05-22 cs.AI cs.CL 版本更新

Echo: Learning from Experience Data via User-Driven Refinement

Echo：通过用户驱动的细化学习经验数据

Hande Dong, Xiaoyun Liang, Jiarui Yu, Jiayi Lin, Changqing Ai, Feng Liu, Wenjun Zhang, Rongbi Wei, Chaofan Zhu, Linjie Che, Feng Wu, Xin Shen, Dexu Kong, Xiaotian Wang, Qiuyuan Chen, Bingxu An, Yueting Lei, Qiang Lin

发表机构 * Core Contributors（核心贡献者）； Qiang Lin is the team leader（Qiang Lin 是团队负责人）

AI总结本文提出Echo框架，通过用户驱动的细化过程将原始经验数据转化为可学习的知识，提升模型性能，实验表明其能将接受率从25.7%提升至35.7%。

详情

AI中文摘要

静态的'人类数据'面临固有局限：扩展成本高且受制于创造者知识。持续学习'经验数据'——智能体与其环境的交互——有望超越这些障碍。如今，AI智能体的广泛应用使我们能够以低成本获取大量真实世界经验数据。然而，原始交互日志本质上嘈杂，充满试错和低信息密度，使其不适合直接用于模型训练。我们引入Echo，一个通用框架，旨在将原始经验转化为可学习的知识，有效将环境反馈回训练循环以优化模型。在当今智能体生态系统中，用户细化是主要的反馈来源：出于对结果的责任感，用户严格地将缺陷智能体提案转化为已验证的解决方案。这些用户驱动的细化序列本质上将智能体的粗略尝试提炼为高质量的训练信号。Echo系统性地收集这些信号，持续使智能体与真实世界需求对齐。在大规模生产代码补全环境中的验证表明，Echo有效利用这一流程，打破静态性能上限，将接受率从25.7%提升至35.7%。

英文摘要

Static "human data" faces inherent limitations: it is expensive to scale and bounded by the knowledge of its creators. Continuous learning from "experience data" - interactions between agents and their environments - promises to transcend these barriers. Today, the widespread deployment of AI agents grants us low-cost access to massive streams of such real-world experience. However, raw interaction logs are inherently noisy, filled with trial-and-error and low information density, rendering them inefficient for direct model training. We introduce Echo, a generalized framework designed to operationalize the transition from raw experience to learnable knowledge, effectively "echoing" environmental feedback back into the training loop for model optimization. In today's agent ecosystem, user refinement serves as a primary source of such feedback: driven by responsibility for the outcome, users rigorously transform flawed agent proposals into verified solutions. These user-driven refinement sequences inherently distill agents' crude attempts into high-quality training signals. Echo systematically harvests these signals to continuously align the agent with real-world needs. Large-scale validation in a production code completion environment confirms that Echo effectively harnesses this pipeline, breaking the static performance ceiling by increasing the acceptance rate from 25.7% to 35.7%.

URL PDF HTML ☆

赞 0 踩 0

2605.21965 2026-05-22 cs.CL 版本更新

在大语言模型时代进行规划：构建可靠性与效率

Michael Katz, Harsha Kokel, Kavitha Srinivas, Shirin Sohrabi

AI总结本文探讨了在大语言模型时代规划领域的发展，重点介绍了通过生成可验证的符号求解器来提高规划的可靠性和效率的方法。

Comments Published at ICAPS 2026

详情

AI中文摘要

随着智能代理受到越来越多的关注，规划能力成为其核心能力之一。早期尝试利用大语言模型（LLMs）进行规划的方法主要依赖于单次计划生成，随后发展出结合LLMs与有限外部搜索的混合方法。这些方法本质上不严谨且不完整，往往需要大量资源，但并未在未见问题上产生更好的解决方案。随着对LLMs局限性的认识加深，近期的研究转向在解决方案构建时使用LLMs，生成可用于验证并高效用于推理时间的一类问题的符号求解器。这一趋势反映了对既可靠又资源高效的代理日益增长的需求。它还提供了一条生成可维护规划器的路径，从而在推理时对语言模型的依赖最小化。在本文中，我们论证这种转变反映了在大语言模型时代规划领域更广泛的真实调整。我们检查了三种主要的规划器生成方法类别，讨论了它们当前的局限性，并概述了朝着更可靠和高效的大语言模型驱动的规划器生成的研究步骤。

英文摘要

Growing attention to intelligent agents has put a spotlight on one of their central capabilities: planning. Early attempts to leverage large language models (LLMs) for planning relied on single-shot plan generation, followed by hybrid approaches that coupled LLMs with limited external search. These methods, unsound and incomplete by their very nature, often require substantial resources without yielding better solutions on unseen problems. As the limitations of LLMs become clearer, recent work has shifted toward using them at solution construction time -- generating symbolic solvers for a family of problems that can be verified and then used efficiently at inference time. This trend reflects the growing need for agents that are both reliable and resource-efficient. It also offers a path towards generating maintainable planners with minimal dependence on language models at inference time. In this paper, we argue that this shift reflects a broader realignment of the planning field in the LLM era. We examine three major categories of planner-generation methods, discuss their current limitations, and outline research steps towards a more reliable and efficient LLM-based generation of planners.

URL PDF HTML ☆

赞 0 踩 0

2605.21858 2026-05-22 cs.CL 版本更新

Hypergraph as Language

超图作为语言

Mengqi Lei, Guohuan Xie, Shihui Ying, Shaoyi Du, Jun-Hai Yong, Siqi Li, Yue Gao

发表机构 * Tsinghua University（清华大学）； Yangtze Delta Region Institute（长江三角洲研究院）； Shanghai Institute of Applied Mathematics and Mechanics（上海应用数学和力学研究所）； State Key Laboratory of Human-Machine Hybrid Augmented Intelligence（人机混合增强智能国家重点实验室）； National Engineering Research Center for Visual Information and Applications（视觉信息与应用国家工程研究中心）； Institute of Artificial Intelligence and Robotics（人工智能与机器人研究所）

AI总结本文提出了一种基于超图的语言模型对齐框架Hyper-Align，通过将超图结构转换为可被大语言模型理解的超图令牌，以更有效地处理高阶关联关系，从而在结构建模任务中取得显著优势。

详情

AI中文摘要

大型语言模型（LLMs）最近在建模关系结构方面展现出了强大的潜力。然而，现有方法仍然从根本上以图为中心：它们专注于将成对图结构转换为LLMs可以理解的令牌。相比之下，许多现实世界的关系模式并不自然地符合成对边的假设，而更适合用超图中的高阶关联来建模。对于超图结构，现有方法往往无法保留多个对象由同一高阶关系共同连接的本源语义，限制了其对复杂结构的利用能力。为了解决这一限制，我们提出了

英文摘要

Large language models (LLMs) have recently shown strong potential in modeling relational structures. However, existing approaches remain fundamentally graph-centric: they focus on processing pairwise graph structures into tokens that LLMs can understand. In contrast, many real-world relational patterns do not naturally conform to the pairwise-edge assumption, and are better modeled as high-order associations in hypergraphs. For hypergraph structures, existing methods often fail to preserve the native semantics that multiple objects are jointly connected by the same high-order relation, limiting their ability to exploit complex structures. To address this limitation, we put forth the "Hypergraph as Language" perspective and propose Hyper-Align, a hypergraph-native alignment framework for large language models. Hyper-Align compiles the query-object-centered hypergraph context into hypergraph tokens directly consumable by a base LLM. Specifically, we introduce Hypergraph Incidence Detail Template with Overview (HIDT-O), which serializes high-order association structures into a fixed-shape hybrid template combining local incidence details and overview-level summaries. We then design a Hypergraph Incidence Projector (HIP), which maps native high-order incidence structures into the LLM token space through explicit semantic-structural decoupling and bidirectional message passing between vertices and hyperedges. We further define a concrete Hypergraph-as-Language input protocol, which jointly feeds hypergraph tokens and textual prompts into a frozen base LLM, supporting both vertex-level and hyperedge-level tasks under a unified question-answering paradigm. To systematically evaluate different methods in hypergraph structural modeling, we introduce HyperAlign-Bench. Extensive experiments show that Hyper-Align significantly outperforms existing methods across in-domain and zero-shot evaluations.

URL PDF HTML ☆

赞 0 踩 0

2605.21849 2026-05-22 cs.LG cs.CL 版本更新

为何语义熵失效：面向策略优化的几何感知与校准不确定性

Zheyuan Zhang, Kaiwen Shi, Han Bao, Zehong Wang, Tianyi Ma, Yanfang Ye

发表机构 * University of Notre Dame（诺丁汉大学）

AI总结本文提出了一种新的策略优化框架GCPO，通过几何感知措施捕捉语义分歧，并利用基于奖励的校准对齐不确定性与学习信号强度，从而更准确地跟踪梯度变化并提升训练后性能。

详情

AI中文摘要

训练后已成为改进大语言模型推理和对齐的关键，其中无批评模型能够实现从模型生成输出的可扩展学习，但缺乏区分信息性与噪声信号的原理性机制。最近的方法利用响应级度量作为不确定性信号来调节基于群体的优化方法，如GRPO。然而，其经验成功仍不稳定，且不清楚它们如何影响优化动态。在本文中，我们提供迄今为止第一个原理性公式，将不确定性信号解释为表征和调节梯度方差和学习信号质量的机制。基于经验和理论分析，我们识别出当前基于熵的估计器的两个关键缺陷：各向异性缺口和校准缺口。受此分析启发，我们提出几何感知校准策略优化（GCPO），一种新的框架，整合几何感知度量以捕捉语义分歧，利用基于奖励的校准对齐不确定性与学习信号强度。在多个基准测试中的实验表明，我们的方法更忠实跟踪梯度变化，并且一致提升训练后性能。我们的结果强调了设计与优化动态对齐的不确定性信号的重要性，为稳健训练后方法提供了原理性视角。

英文摘要

Post-training has become central to improving reasoning and alignment in large language models, where critic-free models enable scalable learning from model-generated outputs but lack principled mechanisms to distinguish informative from noisy signals. Recent approaches leverage response-level measures as uncertainty signals to regulate group-based optimization methods such as GRPO. Yet their empirical success remains unstable and unclear in how they influence optimization dynamics. In this paper, we provide, to our knowledge, the first principled formulation that interprets uncertainty signals as mechanisms for characterizing and regulating gradient variance and learning signal quality. Based on both empirical and theoretical analysis, we identify two critical gaps of current entropy-based estimators: The anisotropic gap and The calibration gap. Motivated by this analysis, we propose Geometric-aware Calibrated Policy Optimization (GCPO), a novel framework integrating geometry-aware measures to capture semantic disagreement with reward-based calibration to align uncertainty with learning signal strength. Experiments on multiple benchmarks show that our approach more faithfully tracks gradient variability and consistently improves post-training performance. Our results highlight the importance of designing uncertainty signals that are aligned with optimization dynamics, offering a principled perspective for robust post-training.

URL PDF HTML ☆

赞 0 踩 0

2605.21796 2026-05-22 cs.CV cs.CL 版本更新

MM-Conv: A Multimodal Dataset and Benchmark for Context-Aware Grounding in 3D Dialogue

MM-Conv: 一种多模态数据集和基准，用于上下文感知的3D对话中指代解析

Anna Deichler, Jim O'Regan, Fethiye Irmak Dogan, Lubos Marcinek, Anna Klezovich, Iolanda Leite, Jonas Beskow

发表机构 * KTH Royal Institute of Technology（皇家理工学院）

AI总结本文提出了一种多模态数据集和基准，用于在动态3D环境中实现上下文感知的指代解析，通过引入包含6.7小时第一人称VR交互的同步语音、动作、注视和3D场景几何数据的基准，以及一个两阶段的指代解析流水线，改进了对话中的指代解析性能。

Comments Extended version of the paper published at LREC 2026 (Palma de Mallorca, Spain), with expanded VLM baselines and inter-annotator agreement analysis

详情

DOI: 10.63317/37fzwjphsb9y
Journal ref: Proceedings of the 15th Language Resources and Evaluation Conference (LREC 2026), Palma de Mallorca, Spain

AI中文摘要

在物理世界中将语言进行定位需要AI系统解释在对话中动态出现的参考。尽管当前的视觉语言模型（VLMs）在静态图像任务上表现出色，但在自发的多轮对话中解决歧义表达方面存在困难。我们通过引入（1）一个用于动态3D环境中的指代交流的基准，该基准基于6.7小时的第一人称VR交互，同步语音、动作、注视和3D场景几何数据，以及（2）一个两阶段的定位流水线，该流水线在视觉定位之前显式解决对话中的歧义，来填补这一空白。该基准包含超过4,200个经过人工验证的指代表达，涵盖完整、部分和代词类型。我们的上下文重写方法在平均上将定位性能提高了11-22个百分点，纯检测器（GroundingDINO）在重写后在代词上达到了56.7%的准确率，几乎是最佳端到端基线的两倍。结果表明，将语言推理与视觉感知解耦比端到端方法在对话定位中更有效。

英文摘要

Grounding language in the physical world requires AI systems to interpret references that emerge dynamically during conversation. While current vision-language models (VLMs) excel at static image tasks, they struggle to resolve ambiguous expressions in spontaneous, multi-turn dialogue. We address this gap by introducing (1) a benchmark for referential communication in dynamic 3D environments, built from 6.7 hours of egocentric VR interaction with synchronized speech, motion, gaze, and 3D scene geometry, and (2) a two-stage grounding pipeline that explicitly resolves conversational ambiguity before visual localization. The benchmark includes over 4,200 manually verified referring expressions spanning full, partitive, and pronominal types. Our contextual rewriting approach improves grounding performance by 11-22 percentage points on average, with a pure detector (GroundingDINO) reaching 56.7% on pronominals after rewriting, nearly double the best end-to-end baseline. Results demonstrate that decoupling linguistic reasoning from visual perception is more effective than end-to-end approaches for conversational grounding.

URL PDF HTML ☆

赞 0 踩 0

2605.21792 2026-05-22 cs.CL cs.AI cs.DB cs.LG 版本更新

Residual Skill Optimization for Text-to-SQL Ensembles

残差技能优化用于文本到SQL集成

Jiongli Zhu, Haoquan Guan, Parjanya Prajakta Prashant, Nikki Lijing Kuang, Seyedeh Baharan Khatami, Canwen Xu, Xiaodong Yu, Yingyu Lin, Zhewei Yao, Yuxiong He, Babak Salimi

发表机构 * University of California, San Diego（加州大学圣地亚哥分校）； Snowflake AI Research（Snowflake人工智能研究）

AI总结本文提出DivSkill-SQL，一种残差技能优化框架，通过在当前技能集成失败的示例上优化新技能，从而构建互补的文本到SQL集成，提升Pass@K性能，在Spider2-Lite上实现了显著的准确性提升，同时在不同方言和任务上表现出一致的改进。

详情

AI中文摘要

文本到SQL集成通过生成多个SQL候选并选择一个来优于单一候选生成，但其效果受限于Pass@K，即至少有一个K候选正确的概率。现有方法通过随机解码或提示变体启发式地引入多样性，导致候选集受相关失败主导。我们提出DivSkill-SQL，一种残差技能优化框架，构建互补的文本到SQL集成而无需模型微调：每个新技能在当前技能集成失败的示例上进行优化，证明其对Pass@K的边际贡献。在Spider2-Lite上，DivSkill-SQL在Snowflake和BigQuery上分别比最强集成基线提升11.1和8.3个点，且在两个基础模型（Opus-4.6和GPT-5.4）上表现一致。在单个方言上无重新训练即可转移至其他方言（Snowflake、BigQuery、SQLite）和不同任务形式（如BIRD-Critic，+2.6个点）。错误诊断显示幻觉的模式参考和函数调用减少3倍，表明收益来自真正可靠的互补技能，而非表面形式变化。

英文摘要

Text-to-SQL ensembles improve over single-candidate generation by drawing multiple SQL candidates and selecting one, but their effectiveness is bounded by Pass@K, the probability that at least one of K candidates is correct. Existing methods source diversity heuristically through stochastic decoding or prompt variants, leaving candidate sets dominated by correlated failures. We present DivSkill-SQL, a residual skill optimization framework that builds complementary agentic Text-to-SQL ensembles without model fine-tuning: each new skill is optimized on examples the current skill ensemble fails on, provably targeting its marginal contribution to Pass@K. On Spider2-Lite, DivSkill-SQL improves selected accuracy by up to +11.1 points on Snowflake and +8.3 on BigQuery over the strongest ensemble baseline, with consistent gains across two base models (Opus-4.6 and GPT-5.4). Skills optimized on a single dialect transfer without retraining across dialects (Snowflake, BigQuery, SQLite) and to a different task formulation, such as BIRD-Critic (+2.6 pts). Error diagnostics show up to 3x fewer hallucinated schema references and function calls, indicating that gains come from genuinely reliable complementary skills rather than surface-form variation.

URL PDF HTML ☆

赞 0 踩 0

2605.21781 2026-05-22 cs.CL 版本更新

Reflective Prompt Tuning through Language Model Function-Calling

通过语言模型功能调用实现反思式提示调优

Farima Fatahi Bayat, Moin Aminnaseri, Pouya Pezeshkpour, Estevam Hruschka

发表机构 * Megagon Labs（梅加穹实验室）

AI总结本文提出了一种名为Reflective Prompt Tuning (RPT)的框架，利用语言模型功能调用模拟人类提示工程师的迭代工作流程，通过诊断函数评估目标模型，生成结构化诊断报告，并利用历史报告优化提示，从而提升提示效果和置信度校准。

Comments 17 pages, 6 figures

详情

AI中文摘要

大型语言模型（LLMs）在遵循指令和复杂推理方面的能力不断增强，使提示成为一种灵活的接口，用于在不更新参数的情况下调整模型。然而，提示设计仍然劳动密集且对格式、措辞和指令顺序高度敏感，这促使了自动提示优化方法的发展，以减少手动努力并保持推理时的灵活性。然而，现有方法通常在提示候选项上进行搜索或使用由单个示例或小批量驱动的固定批评-优化流水线，限制了它们捕捉系统性错误模式和基于失败历史进行针对性编辑的能力。我们提出Reflective Prompt Tuning (RPT)，一个框架，利用语言模型功能调用模拟人类提示工程师的迭代工作流程。一个语言模型优化器调用一个诊断函数，该函数在完整的优化集上评估目标模型，总结反复出现的失败模式，并返回结构化的诊断报告。优化器利用此报告，以及累积的记忆中的先前报告，来修改下一个迭代的提示。RPT进一步通过在诊断反馈和最终提示选择中使用校准信号支持置信度-aware的优化。在三个推理任务上，RPT在初始提示上提高了高达12.9个点，与最先进方法保持竞争力，并提高了置信度校准。我们的分析显示，RPT在多跳和数学推理中特别有效，产生针对性的提示修改，与诊断的失败模式对齐，并导致任务性能和校准的提升。

英文摘要

Large language models (LLMs) have become increasingly capable of following instructions and complex reasoning, making prompting a flexible interface for adapting models without parameter updates. Yet prompt design remains labor-intensive and highly sensitive to formatting, phrasing, and instruction order, motivating automated prompt optimization methods that reduce manual effort while preserving inference-time flexibility. However, existing methods often search over prompt candidates or use fixed critique-refine pipelines driven by individual examples or small batches, limiting their ability to capture systematic error patterns and make targeted edits grounded in failure history. We propose Reflective Prompt Tuning (RPT), a framework that uses LLM function calling to simulate the iterative workflow of human prompt engineers. An LLM optimizer calls a diagnostic function that evaluates the target model over an entire optimization set, summarizes recurring failure modes, and returns a structured diagnostic report. The optimizer uses this report, together with an accumulated memory of prior reports, to revise the prompt for the next iteration. RPT further supports confidence-aware optimization by using calibration signals in diagnostic feedback and final prompt selection. Across three reasoning tasks, RPT improves over initial prompts by up to 12.9 points, remains competitive with state of the art, and improves confidence calibration. Our analyses show that RPT is especially effective on multi-hop and mathematical reasoning, producing targeted prompt revisions that align with diagnosed failure patterns and lead to gains in task performance and calibration.

URL PDF HTML ☆

赞 0 踩 0

2605.21776 2026-05-22 cs.CL 版本更新

PromptNCE: Pointwise Mutual Information Predictions Using Only LLMs and Contrastive Estimation Prompts

PromptNCE：仅使用LLM和对比估计提示进行点互信息预测

Juliette Woodrow, Chris Piech

发表机构 * Department of Computer Science Stanford University（计算机科学系斯坦福大学）

AI总结本文提出PromptNCE方法，通过将条件概率估计转化为对比任务，并引入显式的OTHER类别来恢复真实的条件概率，从而在低数据情况下实现零样本点互信息估计。

详情

AI中文摘要

从文本中估计互信息通常需要训练特定任务的批评者，这限制了其在低数据设置中的应用。我们问大语言模型能否仅通过提示和提取的概率来零样本估计点互信息。我们引入了一个包含三个公开可用数据集的人类衍生真实PMI基准，并评估了五个信息论提示基于的估计器。我们的主要方法PromptNCE将条件概率估计框架为对比任务，并在候选集上引入显式的OTHER类别。我们理论证明，添加OTHER类别可以恢复真实的条件P(y | x)，而不是仅仅在列出的候选者之间进行排名，将对比提示转化为通用的零样本概率估计器。PromptNCE在所有三个数据集上都是最佳的零样本方法，与人类衍生的PMI达到Spearman相关性高达0.82。我们还展示了在计算机科学教育中的案例研究，说明这些估计器如何在低数据情况下用于评分学生知识摘要。

英文摘要

Estimating mutual information from text usually requires training a task-specific critic, which limits its use in low-data settings. We ask whether large language models can instead estimate pointwise mutual information zero-shot, using only prompts and elicited probabilities. We introduce a benchmark with human-derived ground-truth PMI across three publicly available datasets, and evaluate five information-theoretic prompting-based estimators. Our main method, PromptNCE, frames conditional probability estimation as a contrastive task and augments the candidate set with an explicit OTHER category. We show theoretically that adding OTHER recovers the true conditional P(y | x) rather than just a ranking over listed candidates, turning a contrastive prompt into a general-purpose zero-shot probability estimator. PromptNCE is the best zero-shot method on all three datasets, reaching Spearman correlation up to 0.82 with human-derived PMI. We also present a case study in computer science education showing how these estimators can be used to score student knowledge summaries in a low-data setting.

URL PDF HTML ☆

赞 0 踩 0

2605.21748 2026-05-22 cs.CL 版本更新

RankJudge: A Multi-Turn LLM-as-a-Judge Synthetic Benchmark Generator

RankJudge: 一个多轮LLM-as-a-Judge合成基准生成器

Zhenwei Tang, Zhaoyan Liu, Rasa Hosseinzadeh, Tongzi Wu, Keyvan Golestan, Jesse C. Cresswell

发表机构 * Layer 6 AI

AI总结本文提出RankJudge，一种用于评估LLM作为评判者在多轮对话中表现的合成基准生成器，通过生成带有单个缺陷的对话对，实现对评判准确性的严格评估，并通过领域覆盖和21个前沿LLM评判者评估，验证了评判排名的稳定性。

详情

AI中文摘要

随着交互式基于LLM的应用被创建和优化，模型开发者需要在多个可能的轴上评估生成文本的质量。对于更简单的系统，人工评估可能是可行的，但在复杂的系统如对话聊天机器人中，生成文本的数量可能会超出人类注释资源的承受能力。模型开发者已经开始依赖自动评估，其中LLM也被用来判断生成质量。然而，现有的LLM-as-a-judge基准主要集中在简单的问答任务上，而无法匹配多轮对话的复杂性。我们引入了RankJudge，一种用于评估LLM-as-a-judge在基于参考文档的多轮对话中的基准生成器。RankJudge生成对话对，其中一组对话在某一回合中注入了一个单一的缺陷。这种构造使得对话对可以被无歧义地标记为更好或更差，并且能够精确地将失败类别隔离到单个回合中，从而实现一个严格的联合正确性标准来评判。我们实现了RankJudge在机器学习、生物医学和金融领域，评估了21个前沿LLM评判者，并通过Bradley-Terry模型对这些评判者进行排名。我们的方法还允许对每个对话对进行难度评分，我们利用这些评分动态地整理评估切片以减少标签噪声，这已通过人工注释得到验证。我们发现，在部分可观测性、更粗略的正确性标准以及替代的随机游走评分算法下，评判排名是稳定的。

英文摘要

As interactive LLM-based applications are created and refined, model developers need to evaluate the quality of generated text along many possible axes. For simpler systems, human evaluation may be practical, but in complicated systems like conversational chatbots, the amount of generated text can overwhelm human annotation resources. Model developers have begun to rely heavily on auto-evaluation, where LLMs are also used to judge generation quality. However, existing LLM-as-a-judge benchmarks largely focus on simple Q\&A tasks that do not match the complexity of multi-turn conversations. We introduce RankJudge, a benchmark generator for evaluating LLM-as-a-judge on multi-turn conversations grounded in reference documents. RankJudge creates pairs of conversations where one conversation has a single flaw injected into one turn. This construction allows paired conversations to be labeled unambiguously as better or worse, and precisely isolates failure categories to individual turns, enabling a strict joint correctness criterion for judging. We implement RankJudge across the domains of machine learning, biomedicine, and finance, evaluate 21 frontier LLM judges, and rank those judges via the Bradley-Terry model. Our formulation also allows ranking each conversation pair with difficulty ratings, which we use to dynamically curate the evaluation slice to reduce label noise, as confirmed via human annotation. We find that judge rankings are stable under partial observability, coarser correctness criteria, and an alternative random-walk rating algorithm.

URL PDF HTML ☆

赞 0 踩 0

2605.21728 2026-05-22 cs.CV cs.CL cs.LG 版本更新

BEiTScore: Reference-free Image Captioning Evaluation with an Efficient Cross-Encoder Model

BEiTScore: 一种基于高效交叉编码器的无参考图像描述评估方法

Gonçalo Gomes, Bruno Martins, Chrysoula Zerva

发表机构 * Instituto Superior Técnico（里斯本大学理工学院）； INESC-ID ； Instituto de Telecomunicações（电信机构）

AI总结本文提出了一种无参考图像描述评估方法BEiTScore，通过高效的交叉编码器模型解决传统评估方法在计算成本和敏感性方面的不足，提出了一种新的评估指标，并在多种场景下验证了其优越的性能。

详情

AI中文摘要

图像描述评估仍是一个重大挑战，因为视觉-语言模型朝着生成长形式和上下文丰富的描述等更具挑战性的能力发展。最先进的评估度量标准涉及使用大型语言模型（LLMs）作为评判者的大量计算成本，或者受到标准CLIP基于编码器的限制，例如严格的令牌限制、缺乏细粒度敏感性或缺乏组合泛化能力，因为将描述视为“词袋”。我们提出了一种新的学习度量标准，以解决上述挑战，基于一个轻量级交叉编码器，其初始化来自视觉问答模型检查点，平衡了强大的权重初始化与计算效率。我们的训练方案使用精心编排的数据混合进行监督学习，特征是对抗性的LLM基于数据增强，以增强模型对细粒度视觉-语言错误的敏感性。我们还引入了一个新的基准，用于在多种场景中评估详细的描述评估。实验结果表明，所提出的度量标准在保持大规模基准测试、质量感知解码或奖励指导所需的效率的同时，实现了最先进的性能。

英文摘要

Image captioning evaluation remains a significant challenge, as vision-language models evolve toward more challenging capabilities such as generating long-form and context-rich descriptions. State-of-the-art evaluation metrics involve extensive computational costs associated with the use of Large Language Models (LLMs) as judges, or instead suffer from the limitations of standard CLIP-based encoders, such as strict token limits, lack of fine-grained sensitivity, or lack of compositional generalization by treating captions as ``bags-of-words.'' We propose a new learned metric that tackles the aforementioned challenges, based on a lightweight cross-encoder that is initialized from a visual question-answering model checkpoint, balancing a strong weight initialization with computational efficiency. Our training scheme uses a carefully assembled data mixture for supervised learning, featuring adversarial LLM-based data augmentations to enhance model sensitivity to fine-grained visual-linguistic errors. We also introduce a new benchmark designed to assess detailed captioning evaluation across diverse scenarios. Experimental results demonstrate that the proposed metric achieves state-of-the-art performance while maintaining the efficiency required for large-scale benchmarking, quality-aware decoding, or reward guidance.

URL PDF HTML ☆

赞 0 踩 0

2605.21726 2026-05-22 cs.CL cs.AI 版本更新

Probabilistic Attribution For Large Language Models

基于概率的大型语言模型归因

Shilpika Shilpika, Carlo Graziani, Bethany Lusch, Venkatram Vishwanath, Michael E. Papka

发表机构 * Argonne Leadership Computing Facility（阿贡领导计算设施）； Argonne National Laboratory（阿贡国家实验室）； Mathematics and Computer Science Division（数学与计算机科学 division）； Department of Computer Science（计算机科学系）

AI总结本文提出了一种模型无关的概率性token归因度量，通过贝叶斯法则反向计算下一个token的对数概率，以捕捉模型对token序列分布的内部表示，从而提高大型语言模型的可解释性。

Comments 29 pages, 13 figures

详情

AI中文摘要

大型语言模型（LLMs）生成性的特性体现在它们计算每个响应token的条件概率，以根据先前的token进行采样。这些概率编码了模型在训练中学习的分布结构，并在推理中加以利用。在本文中，我们利用这些概率将LLMs置于随机过程的数学理论框架中。我们使用此框架设计了一种模型无关的概率性token归因度量，通过贝叶斯法则反向计算下一个token的对数概率，以捕捉模型对token序列分布的内部表示。该表示独立于模型的计算结构。此表示给出了响应给提示的条件概率，以及在移除一个token后的响应给提示的条件概率。我们的归因分数是这两个概率比值的对数。我们进一步计算了单个提示token分布的熵，条件于剩余的上下文。熵与归因分数之间的相互作用揭示了LLM的行为。我们评估了8个模型在7个提示上的表现，并调查了异常、token敏感性、响应稳定性、模型稳定性以及训练收敛性，从而提高了可解释性，并引导用户关注生成中不确定或不稳定的部分。

英文摘要

The generative nature of Large Language Models (LLMs) is reflected in the conditional probabilities they compute to sample each response token given the previous tokens. These probabilities encode the distributional structure that the model learns in training and exploits in inference. In this work, we use these probabilities to situate LLMs within the mathematical theory of stochastic processes. We use this framework to design a model-agnostic probabilistic token attribution measure, using Bayes rule to invert the next-token log-probabilities so as to capture the models internal representation of the distribution over token sequences. The representation is independent of the models computational structure. This representation yields the conditional probability of the response given the prompt, and of the response given the prompt with a token marginalized away. Our attribution score is the log of the ratio of these probabilities. We further compute the entropies of a single prompts token distributions, conditioned on the remaining context. The interplay between entropy and attribution score sheds light on LLM behavior. We evaluate 8 models across 7 prompts and investigate anomalies, token sensitivity, response stability, model stability, and training convergence, thereby improving interpretability and guiding users to focus on uncertain or unstable parts of the generation.

URL PDF HTML ☆

赞 0 踩 0

2605.21713 2026-05-22 cs.CL 版本更新

Sem-Detect: Semantic Level Detection of AI Generated Peer-Reviews

Sem-Detect：基于语义层面的AI生成同行评审检测

André V. Duarte, Brian Tufts, Aditya Oke, Fei Fang, Arlindo L. Oliveira, Lei Li

发表机构 * Language Technologies Institute, Carnegie Mellon University（卡内基梅隆大学语言技术研究所）

AI总结本文提出Sem-Detect方法，通过结合文本特征和语义分析，区分AI生成与人类撰写的同行评审，实验表明其在二分类和三分类场景下均表现出色，准确率显著提升。

详情

AI中文摘要

如何区分同行评审是人类还是AI生成？我们主张在该设置中，不应仅根据文本特征来确定作者身份，还应考虑评审所表达的观点、判断和主张。为此，我们提出了Sem-Detect，一种用于同行评审的作者身份检测方法，通过结合文本特征和基于声明的语义分析来实现这一原则。Sem-Detect通过将目标评审与同一篇论文的多个AI生成评审进行比较，利用观察到的不同AI模型倾向于收敛到相似点，而人类评审者引入更多独特和多样化的观点这一现象。因此，Sem-Detect能够区分完全由AI生成的评审与真实的由人类撰写的评审，包括那些经过LLM优化但仍反映人类判断的评审。在包含超过20,000篇同行评审的ICLR和NeurIPS会议数据集中，Sem-Detect在二分类设置中将TPR@0.1% FPR比最强基线提高了25.5%。此外，在三分类场景中，我们实证表明LLM优化保留了人类评审的语义信号，这些信号仍与完全由AI生成的文本模式区分开来；因此，少于3.5%的LLM优化后的评审被错误分类为AI生成。

英文摘要

How can we distinguish whether a peer review was written by a human or generated by an AI model? We argue that, in this setting, authorship should not be attributed solely from the textual features of a review, but also from the ideas, judgments, and claims it expresses. To this end, we propose Sem-Detect, an authorship detection method for peer reviews that operationalizes this principle by combining textual features with claim-level semantic analysis. Sem-Detect compares a target review against multiple AI-generated reviews of the same paper, leveraging the observation that different AI models tend to converge on similar points, while human reviewers introduce more unique and diverse ones. As a result, Sem-Detect is able to distinguish fully AI reviews from authentic human-written ones, including those that have been refined using an LLM but still reflect human judgment. Across a dataset of over 20,000 peer reviews from ICLR and NeurIPS conferences, Sem-Detect improves over the strongest baseline by 25.5% in TPR@0.1% FPR in the binary setting. Moreover, in the three-class scenario, we empirically show that LLM refinement preserves the semantic signals of human reviews, which remain distinct from the patterns exhibited by fully AI-generated text; as a result, fewer than 3.5% of LLM-refined human reviews are misclassified as AI-generated.

URL PDF HTML ☆

赞 0 踩 0

2605.21712 2026-05-22 cs.CL 版本更新

Broadening Access to Transportation Safety Data with Generative AI: A Schema-Grounded Framework for Spatial Natural Language Queries

通过生成式AI拓宽交通安全管理数据的可及性：一种基于模式的时空自然语言查询框架

Mahdi Azhdari, Eric J. Gonzales

发表机构 * Department of Civil and Environmental Engineering, University of Massachusetts Amherst（麻省大学阿姆赫斯特分校土木与环境工程系）

AI总结本文提出了一种基于模式的自然语言接口，利用大型语言模型解释用户意图，同时保持确定性和可审查的执行，以解决交通安全管理数据访问不均的问题，通过整合事故记录、道路属性和地理空间数据，提升公共部门的安全规划能力。

Comments 30 pages, 5 figures

详情

AI中文摘要

交通安全管理分析需要通过基于GIS的工作流整合事故记录、道路属性和地理空间数据，但各机构和社区利益相关者之间的访问仍然不均。技术前提导致分析工具与能够使用它们的从业者之间存在差距。地方机构、学校委员会和居民可能有安全担忧，但缺乏检索、过滤、映射和分析相关数据的能力。生成式AI提供了一种缩小这一差距的方法，但其在公共部门的使用引发了关于可靠性和可复现性的问题。本文提出了一种基于模式的自然语言接口，利用大型语言模型（LLM）解释用户意图，同时保持确定性和可审查的执行，以权威数据库为基础。用户查询被翻译成结构化的语义框架，通过基于规则的层验证，编译成一个有类型的有向无环图，用于执行PostGIS数据库。这种设计将语言解释与确定性执行分离，保持结果可复现和基于模式，同时去除访问障碍。该框架使用整合了事故记录、道路属性和地理空间层（包括学校、公交站、过街天桥和市政边界）的州级马萨诸塞州交通安全管理数据库进行评估。所有查询均成功执行；验证层纠正了29%的评估查询中的错误，反映了灵活的自然语言与严格模式要求之间的差距。结果表明，结合自然语言的可访问性与确定性执行是扩大交通安全管理数据可及性的可行方向，对公共部门规划中的可信AI具有启示。

英文摘要

Transportation safety analysis requires integrating crash records, roadway attributes, and geospatial data through GIS-based workflows, but access remains uneven across agencies and community stakeholders. Technical prerequisites create a gap between analytical tools central to safety planning and the practitioners able to use them. Local agencies, school committees, and residents may have safety concerns but limited capacity to retrieve, filter, map, and analyze relevant data. Generative AI offers a way to narrow this divide, but its public-sector use raises questions about reliability, reproducibility, and governance. This paper presents a schema-grounded natural language interface for transportation safety analysis, using a large language model (LLM) to interpret user intent while preserving deterministic, reviewable execution against an authoritative database. User queries are translated into structured semantic frames, validated by a rule-based layer, compiled into a typed directed acyclic graph of spatial operations, and executed against a PostGIS database. This bounded design separates language interpretation from deterministic execution, keeping results reproducible and schema-grounded while removing access barriers. The framework is evaluated using a statewide Massachusetts transportation safety database integrating crash records, roadway attributes, and geospatial layers including schools, bus stops, crosswalks, and municipal boundaries. All queries executed successfully; the validation layer corrects errors in 29% of evaluation queries, reflecting the gap between flexible natural language and strict schema-grounded requirements. The results suggest that combining natural language accessibility with deterministic execution is a practical direction for broadening access to transportation safety data, with implications for trustworthy AI in public-sector planning.

URL PDF HTML ☆

赞 0 踩 0

2605.21699 2026-05-22 cs.LG cs.CL 版本更新

X-Token: Projection-Guided Cross-Tokenizer Knowledge Distillation

X-Token: 通过投影引导的跨分词器知识蒸馏

Sharath Turuvekere Sreenivas, Adithyakrishna Venkatesh Hanasoge, Mingyu Yang, Ali Taghibakhshi, Saurav Muralidharan, Ashwath Aithal, Pavlo Molchanov

发表机构 * NVIDIA

AI总结本文提出X-Token，一种通过投影引导的跨分词器知识蒸馏方法，解决传统方法在处理不同分词器间知识迁移时的不足，通过两个互补的损失函数改进知识蒸馏效果。

详情

AI中文摘要

跨分词器知识蒸馏允许学生模型从具有不兼容词汇表的教师模型中学习。先前工作基于隐藏状态或对数几率，后者更优，因为它不需要辅助组件。基于对数几率的方法要么只使用正确分词的概率，从而遗漏了教师分布中的全部'暗知识'，要么基于完整的输出分布，依赖严格的分词划分和/或不严谨的启发式排序。我们发现完整分布、基于对数几率方法的两个关键缺点：(i) 不常见分词失败，其中关键分词落入未匹配子集（例如，在数字拆分Qwen监督下Llama的1100多数字），在训练中被抑制，导致GSM8k从12.89降至2.56，相较于使用相同分词器的KD；(ii) 过于保守的匹配，严格的一对一匹配排除了表面形式间的近等价分词。这些失败需要不同的解决办法：当关键分词对齐错误时消除划分，当对齐可靠时进行细化。我们提出X-Token，一种具有两个互补损失函数的方法，针对这些问题。P-KL通过稀疏投影矩阵W（从分词级别字符串规则初始化）消除划分，并通过将学生分布与教师分布对齐来解决不常见分词失败。H-KL保留混合形式，同时放松匹配，使每个学生分词与W下的最高排名教师映射对齐。两个目标共享W并自然扩展到多个教师。实验证明，在Llama-3.2-1B上，X-Token在Qwen3-4B教师下比当前最佳GOLD高出+3.82平均点，在Phi-4-Mini教师下高出+0.5。此外，双教师设置（Phi-4-mini + Llama-3B）在单教师蒸馏上提高了+1.3点。

英文摘要

Cross-tokenizer knowledge distillation allows a student model to learn from teachers with incompatible vocabularies. Prior work operates on hidden states or logits; the latter is preferred as a drop-in replacement requiring no auxiliary components. Logit-based methods either use only the correct-token probability, missing the full 'dark knowledge' in the teacher's distribution, or operate on the full output distribution, relying on strict token partitioning and/or unprincipled heuristic ranking. We identify two key shortcomings of full-distribution, logit-based methods: (i) an uncommon-token failure, where critical tokens fall into the unmatched subset (e.g., Llama's 1100 multi-digit numerals under digit-splitting Qwen supervision) and are suppressed during training, reducing GSM8k from 12.89 to 2.56 compared to same-tokenizer KD from a weaker teacher; and (ii) over-conservative matching, where strict 1-to-1 matching excludes near-equivalent tokens across surface forms. These failures require distinct remedies: eliminating the partition when critical tokens are misaligned, and refining it when alignment is reliable. We propose X-Token, an approach with two complementary loss formulations targeting these issues. P-KL removes partitioning and aligns the student's distribution with the teacher's via a sparse projection matrix W (initialized from tokenizer-level string rules) to address the uncommon-token failure. H-KL retains the hybrid form while relaxing matching to align each student token with its top-ranked teacher mapping under W. Both objectives share W and extend naturally to multiple teachers. Empirically, on Llama-3.2-1B, X-Token outperforms the current state of the art GOLD by +3.82 average points with a Qwen3-4B teacher and by +0.5 with a Phi-4-Mini teacher. Further, a two-teacher setup (Phi-4-mini + Llama-3B) improves over single-teacher distillation by +1.3 points.

URL PDF HTML ☆

赞 0 踩 0

2605.21654 2026-05-22 cs.LG cs.AI cs.CL 版本更新

Value-Gradient Hypothesis of RL for LLMs

强化学习中大语言模型的价值-梯度假说

Arip Asadulaev, Daniil Ognev, Karim Salta, Martin Takac

发表机构 * MBZUAI（穆斯林人工智能研究所）

AI总结本文提出了一种价值-梯度视角来解释无评论强化学习方法在大语言模型后训练中的有效性，并通过分析actor更新和注意力机制中的自适应微分，提出了价值梯度信号和可达奖励空间的分解方法。

详情

AI中文摘要

强化学习显著提升了预训练语言模型，但尚不清楚为何无评论方法如PPO和GRPO能发挥如此大的作用，以及何时能提供最大的收益。我们开发了一种无评论强化学习在大语言模型后训练中的价值-梯度视角。首先，在可微展开和加性噪声参数化下，我们证明在期望下actor更新是价值-梯度类似的：反向传播传播的costates的条件期望等于价值梯度。其次，对于离散transformer策略，我们证明通过注意力机制的自适应微分会产生经验性的costates，这些近似于该价值信号，其误差受采样间隙和策略熵的控制。这些结果促使将RL影响分解为价值梯度信号和可达奖励空间，从而得出RL在预训练轨迹上最有效的标准。

英文摘要

Reinforcement learning substantially improves pretrained language models, but it remains understudied why critic-free methods such as PPO and GRPO work as well as they do, and when they should provide the largest gains. We develop a value-gradient perspective of critic-free RL for LLM post-training. First, under a differentiable rollout and additive-noise parameterization, we show that the actor update is value-gradient-like in expectation: the backward pass propagates costates whose conditional expectation equals the value gradient. Second, for discrete transformer policies, we show that autodifferentiation through attention produces empirical costates that approximate this value signal, with an error controlled by the sampling gap and policy entropy. These results motivate a decomposition of RL impact into value gradient signal and reachable reward headroom, yielding a criterion for when RL should be most effective along a pretraining trajectory.

URL PDF HTML ☆

赞 0 踩 0

2605.21653 2026-05-22 cs.LG cs.AI cs.CL 版本更新

Amplifying, Not Learning: Fine-Tuned AI Text Detectors Amplify a Pretrained Direction

放大而非学习：微调的AI文本检测器放大了预训练的方向

Alexander Smirnov

发表机构 * University College London（伦敦大学学院）

AI总结该研究探讨了通过微调AI文本检测器来放大预训练方向而非学习AI与人类边界的问题，发现微调在某些情况下会降低辨别能力，但在非母语写作中表现不同，并展示了闭合形式雅可比预测器在不同架构中的有效性。

详情

AI中文摘要

AI文本检测器放大了预训练的典型性轴；它们并不构建AI与人类的边界。在没有任何任务监督的原始编码器上，将投影到AI-中心（HC3）的中心可以实现NYT与HC3的AUROC分别为0.806/0.944/0.834，跨三种架构（86-106%的微调辨别上限：在RoBERTa-base上，原始投影超过微调）；在RoBERTa-base上，完全微调在两种流畅正式人口测试中降低了辨别能力。相同的轴在非母语ESL写作中反转（AUROC 0.06-0.20）--这是典型性阅读独有的可验证预测。一个24例冻结探测器与完全微调（0.900 vs 0.895）一致。一个闭合形式雅可比预测器参数化轴操纵干预，R²=1.000通用，提升了ELECTRA-CE部署的TPR从0.000到0.904（FPR=1%），并在三个独立训练的第三方RoBERTa检测器上转移，达到16/16 oracle等价（在OpenAI检测器上57%的NYT-FPR减少）。范围：编码器家族；机制幅度HC3锚定；人口层面共享轴，不同架构中每文本机制有所变化。三种操作上不同的探测器--文本表面caps_rate残差化、几何符号epsilon消融、闭合形式文本对预测器--在三种架构中一致，cos 0.74/0.81/1.00，确认了观察者不变性。在匹配TPR-0.90评估下，已发表的干预动物园（CC、dealign-f2c）在27个单元格中校准等价（|Delta AUROC| <= 0.0081），并且ELECTRA上的LoRA->full-FT偏移差距的97%是校准偏移而非学习表示--这是核心主张的预测确认。

英文摘要

AI text detectors amplify a pretrained typicality axis; they do not construct an AI-vs-human boundary. On raw encoders before any task supervision, projecting onto centroid(AI)-centroid(HC3) achieves NYT-vs-HC3 AUROC 0.806/0.944/0.834 across three architectures (86-106% of the fine-tuned discrimination ceiling: on RoBERTa-base, raw projection exceeds fine-tuning); on RoBERTa-base, full fine-tuning reduces discrimination below raw on both fluent-formal populations tested. The same axis inverts on non-native ESL writing (AUROC 0.06-0.20) -- a falsifiable prediction unique to the typicality reading. A 24-example frozen probe matches full fine-tuning (0.900 vs 0.895). A closed-form Jacobian predictor parameterises axis-manipulating interventions with R^2 = 1.000 universal, lifts ELECTRA-CE deployment TPR from 0.000 to 0.904 at FPR = 1%, and transfers to three independently-trained third-party RoBERTa detectors at 16/16 oracle-equivalence (57% NYT-FPR reduction on the OpenAI detector). Scope: encoder family; mechanism magnitude HC3-anchored; population-level shared axis with per-text mechanisms varying across architectures. Three operationally distinct probes -- text-surface caps_rate residualisation, geometric signed-epsilon ablation, closed-form text-pair predictor -- agree at cos 0.74/0.81/1.00 across three architectures, confirming observer-invariance. Under matched-TPR-0.90 evaluation, the published intervention zoo (CC, dealign-f2c) is calibration-equivalent across 27 cells (|Delta AUROC| <= 0.0081), and >= 97% of the LoRA->full-FT bias gap on ELECTRA is calibration shift, not learned representation -- the central claim's prediction confirmed.

URL PDF HTML ☆

赞 0 踩 0

2605.21649 2026-05-22 cs.LG cs.CL 版本更新

EntmaxKV: Support-Aware Decoding for Entmax Attention

EntmaxKV: 基于支持的解码方法用于Entmax注意力

Gonçalo Duarte, Miguel Couceiro, Marcos V. Treviso

发表机构 * Instituto Superior Técnico, Universidade de Lisboa（里斯本大学理工学院）； ELLIS Unit Lisbon（里斯本ELLIS单位）； INESC-ID ； Instituto de Telecomunicações（电信研究所）

AI总结本文提出EntmaxKV，一种基于支持的解码框架，利用熵最大注意力的稀疏性在KV页面加载前进行稀疏解码，通过查询感知的页面评分、支持感知的候选选择和稀疏熵最大注意力，减少概率质量丢失，提高长上下文语言模型的效率。

详情

AI中文摘要

长上下文解码越来越受到KV缓存内存流量的限制，因为每个生成的标记都需在缓存上进行注意力运算，而缓存大小与上下文长度成线性增长。现有稀疏解码方法通过选择部分标记或页面来减少成本，但这些方法是为softmax注意力设计的，其密集尾部使得任何截断都会丢弃非零的概率质量。相比之下，α-entmax产生精确的零，将稀疏解码从密集尾部近似转变为支持恢复：如果所选候选包含entmax支持，稀疏解码仍保持精确。虽然最近的entmax内核实现了高效的训练，但它们并未解决自回归解码瓶颈，即密集推理仍需在稀疏性确定之前流式传输完整的KV缓存。在本文中，我们引入了EntmaxKV，一种基于entmax的稀疏解码框架，它在KV页面加载前利用稀疏性。EntmaxKV结合了查询感知的页面评分、支持感知的候选选择和稀疏entmax注意力。我们通过分析截断误差中的丢弃概率质量δ，证明输出误差由δ控制，并在恢复entmax支持时消失。我们进一步引入了一种高斯感知的entmax选择器，从轻量级页面统计中估计entmax阈值，使所选预算适应于分数分布。实验证明，EntmaxKV比基于softmax的稀疏解码在相同KV预算下丢弃更少的概率质量，保留更多支持标记，并实现更低的输出误差。在长上下文和语言建模基准上，它接近完整的缓存entmax，但使用KV缓存的少量比例，达到100万上下文长度时，比完整的注意力基线快3.36倍（softmax）和5.43倍（entmax）。代码可在：https://github.com/deep-spin/entmaxkv获取。

英文摘要

Long-context decoding is increasingly limited by KV-cache memory traffic since each generated token attends over a cache whose size grows linearly with context length. Existing sparse decoding methods reduce this cost by selecting subsets of tokens or pages, but are designed for softmax attention, whose dense tails make any truncation discard nonzero probability mass. In contrast, $α$-entmax produces exact zeros, turning sparse decoding from dense-tail approximation into support recovery: if the selected candidates contain the entmax support, sparse decoding remains exact. While recent entmax kernels enable efficient training, they do not address the autoregressive decoding bottleneck, where dense inference still streams the full KV cache before sparsity is known. In this work, we introduce EntmaxKV, an entmax-native sparse decoding framework that exploits sparsity before KV pages are loaded. EntmaxKV combines query-aware page scoring, support-aware candidate selection, and sparse entmax attention. We analyze truncation error through the dropped probability mass $δ$, showing that output error is controlled by $δ$ and vanishes when the entmax support is recovered. We further introduce a Gaussian-aware entmax selector that estimates the entmax threshold from lightweight page statistics, adapting the selected budget to the score distribution. Empirically, EntmaxKV drops less probability mass, retains more support tokens, and achieves lower output error than softmax-based sparse decoding at matched KV budgets. On long-context and language modeling benchmarks, it closely matches full-cache entmax while using a small fraction of the KV cache, achieving up to $3.36\times$ (softmax) and $5.43\times$ (entmax) speedup over full attention baselines at 1M context length. Code available at: https://github.com/deep-spin/entmaxkv.

URL PDF HTML ☆

赞 0 踩 0

2605.21625 2026-05-22 cs.CV cs.AI cs.CL 版本更新

Flat-Pack Bench: Evaluating Spatio-Temporal Understanding in Large Vision-Language Models through Furniture Assembly

Flat-Pack Bench: 通过家具组装评估大视觉-语言模型的时空理解

Aditya Chetan, Eric Cai, Peeyush Kushwaha, Bharath Raj Nagoor Kani, Utkarsh Mall, Qianqian Wang, Noah Snavely, Bharath Hariharan

发表机构 * Cornell University（康奈尔大学）； Cornell Tech（康奈尔科技）； MBZUAI（麦吉尔-伯克利-浙江大学人工智能研究院）； UC Berkeley（伯克利大学）

AI总结本文提出Flat-Pack Bench基准，用于评估大视觉-语言模型在复杂视频场景中的时空理解能力，发现当前模型在细粒度时空推理上存在显著不足。

Comments CVPR 2026

详情

AI中文摘要

大视觉-语言模型（LVLMs）的出现显著提升了视频理解能力。然而，现有基准主要集中在粗粒度任务，如动作分割、分类、描述和检索，且这些基准通常依赖于易于口头识别的实体，如家庭物品、动物、人类主体等，限制了其在复杂真实视频场景中的适用性。但许多应用，如家具组装、烹饪等，需要对视频进行逐步细粒度的时空理解，而当前基准并未充分评估。为解决这一差距，我们引入了Flat-Pack Bench，一个专注于家具组装任务的新基准。我们的基准评估LVLMs在细微任务上的表现，包括组装动作的时间顺序、组装状态的时间定位、理解部件配合和追踪，使用多选问题配以视觉提示突出相关部分作为参考，以回答细粒度问题。我们的实验表明，最先进的LVLMs在细粒度时空推理上表现显著不足，凸显了其在有效利用视频时间信息、跟踪能力和理解空间交互（如物理接触）方面的局限性。

英文摘要

The emergence of Large Vision-Language Models (LVLMs) has significantly advanced video understanding capabilities. However, existing benchmarks focus predominantly on coarse-grained tasks such as action segmentation, classification, captioning, and retrieval. Furthermore, these benchmarks often rely on entities that can be easily identified verbally, like household objects, animals, human subjects, etc., limiting their applicability to complex, in-the-wild video scenarios. But, many applications such as furniture assembly, cooking, etc., require step-by-step fine-grained spatio-temporal understanding of the video, which is not sufficiently evaluated in current benchmarks. To address this gap, we introduce Flat-Pack Bench, a novel benchmark centered on furniture assembly tasks. Our benchmark evaluates LVLMs on nuanced tasks, including temporal ordering of assembly actions, temporal localization of assembly state, understanding part mating, and tracking, using multiple-choice questions paired with visual prompts highlighting relevant parts as references for fine-grained questions. Our experiments reveal that state-of-the-art LVLMs struggle significantly with fine-grained spatio-temporal reasoning, highlighting their limitations in effectively leveraging temporal information from videos, limited tracking ability, and understanding of spatial interactions like physical contact.

URL PDF HTML ☆

赞 0 踩 0

2605.21609 2026-05-22 cs.CL cs.AI cs.CY 版本更新

CR4T: Rewrite-Based Guardrails for Adolescent LLM Safety

CR4T：基于重写的青少年LLM安全机制

Heajun An, Qi Zhang, Vedanth Achanta, Jin-Hee Cho

发表机构 * Virginia Tech（弗吉尼亚理工大学）

AI总结本文提出CR4T框架，通过选择性响应重构替代拒绝导向的安全机制，以更符合青少年发展需求的方式提升LLM的安全性。

详情

AI中文摘要

大型语言模型（LLMs）越来越多地嵌入青少年的数字环境，介导信息搜索、建议和情感敏感的互动。然而，现有安全机制仍主要基于成人中心的规范，并通过拒绝导向的压制来实现安全。尽管这些方法可能减少即时的政策违规，但它们也可能导致对话死胡同、限制建设性指导，并未能解决青少年与AI互动中固有的发展脆弱性。我们主张，青少年LLM安全不应仅被视为过滤问题，而应被视为一种社会技术、发展一致的转变问题。为实现这一视角，我们提出了Critique-and-Revise-for-Teenagers（CR4T），一种模型无关的安全保障框架，该框架可选择性地将不安全或拒绝式输出重构为适合年龄的指导性响应，同时保持善意意图。CR4T结合轻量级风险检测与领域条件重写，以去除风险放大内容，减少不必要的对话关闭，并引入适合发展的指导。实验结果表明，针对重写显著减少了不安全和拒绝导向的结果，同时避免了对可接受互动的不必要的干预。这些发现表明，选择性响应重构为青少年面向的LLM系统提供了一种更以人为本的替代方案，以替代以拒绝为中心的安全机制。

英文摘要

Large language models (LLMs) are increasingly embedded in adolescent digital environments, mediating information seeking, advice, and emotionally sensitive interactions. Yet existing safety mechanisms remain largely grounded in adult-centric norms and operationalize safety through refusal-oriented suppression. While such approaches may reduce immediate policy violations, they can also create conversational dead-ends, limit constructive guidance, and fail to address the developmental vulnerabilities inherent in adolescent-AI interactions. We argue that adolescent LLM safety should be framed not solely as a filtering problem, but as a socio-technical, developmentally aligned transformation problem. To operationalize this perspective, we propose Critique-and-Revise-for-Teenagers (CR4T), a model-agnostic safeguarding framework that selectively reconstructs unsafe or refusal-style outputs into ageappropriate, guidance-oriented responses while preserving benign intent. CR4T combines lightweight risk detection with domain-conditioned rewriting to remove risk-amplifying content, reduce unnecessary conversational shutdown, and introduce developmentally appropriate guidance. Experimental results show that targeted rewriting substantially reduces unsafe and refusal-oriented outcomes while avoiding unnecessary intervention on acceptable interactions. These findings suggest that selective response reconstruction offers a more human-centered alternative to refusal-centric guardrails for adolescent-facing LLM systems.

URL PDF HTML ☆

赞 0 踩 0

2605.21558 2026-05-22 cs.LG cs.CL 版本更新

From Parameters to Data: A Task-Parameter-Guided Fine-Tuning Pipeline for Efficient LLM Alignment

从参数到数据：一种任务参数引导的微调流水线用于高效的LLM对齐

Hao Chen, Qi Zhang, Liyao Li, Zhanming Shen, Wentao Ye, Lirong Gao, Ningtao Wang, Xing Fu, Xiaoyu Shen, Junbo Zhao

发表机构 * Zhejiang University（浙江大学）； Eastern Institute of Technology（东部技术研究所）

AI总结本研究提出了一种任务参数引导的微调流水线，通过任务敏感的注意力头作为双指南，实现样本挖掘和结构剪枝，从而提高LLM对齐的效率。

Comments Accepted@ICML26, 28 pages, 11 figures, 26 tables

详情

AI中文摘要

适应大型语言模型（LLM）到专业领域通常会带来高数据和计算开销。尽管先前的效率努力大多将数据选择和参数高效微调视为孤立过程，我们的实证分析表明它们可能本质上是耦合的。我们提出了强映射假说：稀疏的注意力头子集在任务特定适应中起主导作用，作为解锁特定数据模式的钥匙。基于这一观察，我们提出了从参数到数据（P2D）统一框架，利用这些任务敏感的注意力头作为双指南，用于样本挖掘和结构剪枝。为了严格量化整个流程的成本，我们引入了对齐效率比率（AER）指标，用于衡量选择延迟和训练时间。机理上，P2D通过轻量级代理识别关键头，并利用它们作为功能性过滤器来精选高亲和力数据，建立协同流程。经验上，通过更新仅10%的注意力头在10%的数据上，P2D在强基线基础上实现了8.3个百分点的性能提升，并提供了7.0倍的端到端时间加速。这些结果验证了精确的参数-数据同步消除了冗余，提供了一种新的高效对齐范式。

英文摘要

Adapting Large Language Models (LLMs) to specialized domains typically incurs high data and computational overhead. While prior efficiency efforts have largely treated data selection and parameter-efficient fine-tuning as isolated processes, our empirical analysis suggests they may be intrinsically coupled. We posit the Strong Map Hypothesis: a sparse subset of attention heads plays a dominant role in task-specific adaptation, acting as keys that unlock specific data patterns. Building on this observation, we propose From Parameters to Data (P2D), a unified framework that leverages these task-sensitive attention heads as a dual compass for both sample mining and structural pruning. To rigorously quantify the total pipeline cost, we introduce the Alignment Efficiency Ratio (AER) metric for both selection latency and training time. Mechanistically, P2D identifies critical heads via a lightweight proxy and uses them as a functional filter to curate high-affinity data, establishing a synergistic pipeline. Empirically, by updating merely 10% of attention heads on 10% of the data, P2D achieves an 8.3 pp performance gain over strong baselines and delivers a 7.0x end-to-end time speedup. These results validate that precise parameter-data synchronization eliminates redundancy, offering a new paradigm for efficient alignment.

URL PDF HTML ☆

赞 0 踩 0

2605.21540 2026-05-22 cs.SI cs.AI cs.CL cs.CY 版本更新

通用偏好强化学习

Muhammad Umer, Muhammad Ahmed Mohsin, Ahsan Bilal, Arslan Chaudhry, Andreas Haupt, Sanmi Koyejo, Emily Fox, John M. Cioffi

发表机构 * Stanford University（斯坦福大学）； The University of Oklahoma（俄克拉荷马大学）

AI总结本文提出通用偏好强化学习（GPRL），通过引入通用偏好模型（GPM）解决传统强化学习在开放任务中连续探索不足的问题，通过多维偏好比较提升模型性能。

详情

AI中文摘要

训练后将大型语言模型（LLM）对齐分解为两个大致分离的轨道。在线强化学习（RL）通过可验证奖励推动数学和代码的涌现推理，但依赖于无法达到开放任务的程序验证器；而偏好优化处理开放生成任务却牺牲了驱动在线RL的连续探索。弥合这一差距需要一个开放性质量验证器，但标量奖励模型不适合此任务。质量是多维的，任何标量分数都是不完整的代理，使在线RL崩溃于分数最敏感的轴。我们转而采用通用偏好模型（GPM），将响应嵌入到k个斜对称子空间中，并将偏好表示为结构化的、具有不传递性的比较。在此基础上，我们提出通用偏好强化学习（GPRL），将k维结构延伸到策略更新中。GPRL计算每维的组相对优势，对每个优势进行归一化以避免任何轴主导，并通过上下文相关的特征值进行聚合。相同的结构推动了一个闭环漂移监视器，能够检测单轴利用并通过重新加权维度和收紧信任区域进行即时纠正。从Llama-3-8B-Instruct开始，GPRL在AlpacaEval~2.0上达到长度控制的胜利率为56.51%，并在Arena-Hard、MT-Bench和WildBench上优于SimPO和SPPO，通过在长时间训练中抵抗奖励黑客。

英文摘要

Post-training has split large language model (LLM) alignment into two largely disconnected tracks. Online reinforcement learning (RL) with verifiable rewards drives emergent reasoning on math and code but depends on a programmatic verifier that cannot reach open-ended tasks, while preference optimization handles open-ended generation yet forgoes the continuous exploration that powers online RL. Closing this gap requires a verifier for open-ended quality, but a scalar reward model is the wrong shape for the job. Quality is multi-dimensional, and any scalar score is an incomplete proxy that lets online RL collapse onto whichever axis the score is most sensitive to. We turn instead to the General Preference Model (GPM), which embeds responses into $k$ skew-symmetric subspaces and represents preference as a structured, intransitivity-aware comparison. Building on this, we propose General Preference Reinforcement Learning (GPRL), which carries the $k$-way structure through to the policy update. GPRL computes per-dimension group-relative advantages, normalizes each on its own scale so no axis can dominate, and aggregates them with context-dependent eigenvalues. The same structure powers a closed-loop drift monitor that detects single-axis exploitation and corrects it on the fly by reweighting dimensions and tightening the trust region. Starting from $\texttt{Llama-3-8B-Instruct}$, GPRL reaches a length-controlled win rate of $56.51\%$ on AlpacaEval~2.0 while also outperforming SimPO and SPPO on Arena-Hard, MT-Bench, and WildBench by resisting reward hacking across extended training runs.

URL PDF HTML ☆

赞 0 踩 0

2605.15588 2026-05-22 cs.CL cs.LG 版本更新

Calibrating LLMs with Semantic-level Reward

通过语义层面奖励校准大型语言模型

Fengfei Yu, Ruijia Niu, Dongxia Wu, Yian Ma, Rose Yu

发表机构 * Department of Computer Science and Engineering, University of California San Diego, La Jolla, California, USA（加州大学圣地亚哥分校计算机科学与工程系，拉贾尔，加利福尼亚州，美国）； Halıcıoğlu Data Science Institute, University of California San Diego, La Jolla, California, USA（加州大学圣地亚哥分校Halıcıoğlu数据科学研究所，拉贾尔，加利福尼亚州，美国）； Department of Statistics, Stanford University, Stanford, California, USA（斯坦福大学统计学系，斯坦福，加利福尼亚州，美国）

AI总结本文提出了一种新的校准框架CSR，通过在语义空间中直接校准语言模型，避免了传统方法中因词汇化置信度导致的不一致问题，实验显示CSR在多个数据集上均能有效降低ECE并提高AUROC。

详情

AI中文摘要

随着大型语言模型（LLMs）被应用于医疗问答和法律推理等关键领域，估计其输出正确性的能力对于安全可靠使用至关重要，要求模型具有良好的校准能力。标准的可验证奖励强化学习（RLVR）通过二元正确性奖励训练模型，但该奖励对置信度不敏感，无法对自信但错误的预测施加惩罚，从而降低校准效果。最近的研究通过训练模型生成带有词汇化置信度的置信分数并奖励与正确性的同意来解决这一问题。然而，词汇化置信度在语义相同但文本变化时表现出不一致性。我们提出Calibration with Semantic Reward（CSR），一种在语义空间中直接校准语言模型的框架，无需词汇化置信度接口。CSR结合了正确性奖励和一种新的语义校准奖励，通过促进正确路径中的语义一致性和不正确路径中的探索来鼓励利用和探索。在HotpotQA（在分布）和TriviaQA、MSMARCO、NQ-Open（不在分布）三个模型家族上的实验表明，CSR在几乎所有设置中都比词汇化置信度基线实现了更低的ECE和更高的AUROC，ECE减少高达40%，AUROC提高高达31%，校准行为在所有四个评估设置中均表现出良好的鲁棒性。

英文摘要

As large language models (LLMs) are deployed in consequential settings such as medical question answering and legal reasoning, the ability to estimate when their outputs are likely to be correct is essential for safe and reliable use, requiring well-calibrated uncertainty. Standard reinforcement learning with verifiable rewards (RLVR) trains models with a binary correctness reward that is indifferent to confidence, providing no penalty for confident but wrong predictions and thereby degrading calibration. Recent work addresses this by training models to produce verbalized confidence scores alongside answers and rewarding agreement with correctness. However, verbalized confidence is calibrated at the token level and thus exhibits inconsistency across textual variations with same semantic meaning. We propose \textbf{Calibration with Semantic Reward (CSR)}, a framework that calibrates language models directly in semantic space without a verbalized confidence interface. CSR combines the correctness reward with a novel semantic calibration reward that encourages exploitation among correct rollouts by promoting semantic agreement, and exploration among incorrect ones by discouraging spurious consistency. Experiments across three model families on HotpotQA (in-distribution) and TriviaQA, MSMARCO, and NQ-Open (out-of-distribution) show that CSR consistently achieves lower ECE and higher AUROC than verbalized-confidence baselines across nearly all settings, reducing ECE by up to $40\%$ and improving AUROC by up to $31\%$ over verbalized-confidence baselines, with calibration behavior generalizing robustly across all four evaluation settings.

URL PDF HTML ☆

赞 0 踩 0

2605.12623 2026-05-22 cs.CL cs.CV cs.LG 版本更新

DocAtlas: Multilingual Document Understanding Across 80+ Languages

DocAtlas: 跨80多种语言的多语言文档理解

Ahmed Heakl, Youssef Mohamed, Abdullah Sohail, Rania Elbadry, Ahmed Nassar, Peter W. J. Staar, Fahad Shahbaz Khan, Imran Razzak, Salman Khan

发表机构 * MBZUAI（穆罕默德·本·拉谢德人工智能研究所）； IBM Research（IBM研究院）

AI总结本文提出DocAtlas框架，通过构建高保真的OCR数据集和基准测试，覆盖82种语言和9个评估任务，利用双重管道生成精确的结构注解，展示了直接偏好优化在多语言适应中的有效性，提升了领域内和领域外的准确率。

Comments Under submission

详情

AI中文摘要

多语言文档理解在低资源语言中受限于稀缺的训练数据和基于模型的标注流程，这些流程会加剧现有偏见。我们引入DocAtlas，一个构建覆盖82种语言和9个评估任务的高保真OCR数据集和基准测试的框架。我们的双重管道，包括本地DOCX文档的差异渲染和针对从右到左脚本的合成LaTeX生成，生成统一的DocTag格式注解，编码布局、文本和组件类型，无需学习模型进行核心注解。评估16种最先进的模型揭示了低资源脚本中的持续差距。我们展示直接偏好优化（DPO）使用渲染派生的真实情况作为正信号，实现了稳定的多语言适应，提高了领域内（+1.9%）和领域外（+1.8%）的准确性，而监督微调会导致领域外性能下降高达21%。我们的最佳变体，DocAtlas-DeepSeek，在最强基线基础上提高了+1.7%。代码可在https://github.com/ahmedheakl/DocAtlas获取。

英文摘要

Multilingual document understanding remains limited for low-resource languages due to scarce training data and model-based annotation pipelines that perpetuate existing biases. We introduce DocAtlas, a framework that constructs high-fidelity OCR datasets and benchmarks covering 82 languages and 9 evaluation tasks. Our dual pipelines, differential rendering of native DOCX documents and synthetic LaTeX-based generation for right-to-left scripts produce precise structural annotations in a unified DocTag format encoding layout, text, and component types, without learned models for core annotation. Evaluating 16 state-of-the-art models reveals persistent gaps in low-resource scripts. We show that Direct Preference Optimization (DPO) using rendering-derived ground truth as positive signal achieves stable multilingual adaptation, improving both in-domain (+1.9%) and out-of-domain (+1.8%) accuracy without measurable base-language degradation, where supervised fine-tuning degrades out-of-domain performance by up to 21%. Our best variant, DocAtlas-DeepSeek, improves +1.7% over the strongest baseline. Code is available at https://github.com/ahmedheakl/DocAtlas .

URL PDF HTML ☆

赞 0 踩 0

2605.09252 2026-05-22 cs.CL 版本更新

LLM Agents Already Know When to Call Tools -- Even Without Reasoning

Chung-En Sun, Linbo Liu, Ge Yan, Zimo Wang, Tsui-Wei Weng

发表机构 * University of California, San Diego（加州大学圣地亚哥分校）

AI总结本文提出When2Tool基准，通过18个环境研究工具调用的必要性，发现模型已能识别何时需要调用工具，但生成时未能有效利用此知识，提出Probe&Prefill方法显著减少工具调用。

详情

AI中文摘要

Tool-augmented LLM agents tend to call tools indiscriminately, even when the model can answer directly. Each unnecessary call wastes API fees and latency, yet no existing benchmark systematically studies when a tool call is actually needed. We propose When2Tool, a benchmark of 18 environments (15 single-hop, 3 multi-hop) spanning three categories of tool necessity -- computational scale, knowledge boundaries, and execution reliability -- each with controlled difficulty levels that create a clear decision boundary between tool-necessary and tool-unnecessary tasks. We evaluate two families of 免训练 baselines: Prompt-only (varying the prompt to discourage unnecessary calls) and Reason-then-Act (requiring the model to reason about tool necessity before acting). Both provide limited control: Prompt-only suppresses necessary calls alongside unnecessary ones, and Reason-then-Act still incurs a disproportionate accuracy cost on hard tasks. To understand why these baselines fail, we probe the models' hidden states and find that tool necessity is linearly decodable from the pre-generation representation with AUROC 0.89--0.96 across six models, substantially exceeding the model's own verbalized reasoning. This reveals that models already know when tools are needed, but fail to act on this knowledge during generation. Building on this finding, we propose Probe&Prefill, which uses a lightweight linear probe to read the hidden-state signal and prefills the model's response with a steering sentence. Across all models tested, Probe&Prefill reduces tool calls by 48% with only 1.7% accuracy loss, while the best baseline at comparable accuracy only reduces 6% of tool calls, or achieves a similar tool call reduction but incurs a 5$ imes$ higher accuracy loss. Our code is available at https://github.com/Trustworthy-ML-Lab/when2tool

英文摘要

Tool-augmented LLM agents tend to call tools indiscriminately, even when the model can answer directly. Each unnecessary call wastes API fees and latency, yet no existing benchmark systematically studies when a tool call is actually needed. We propose When2Tool, a benchmark of 18 environments (15 single-hop, 3 multi-hop) spanning three categories of tool necessity -- computational scale, knowledge boundaries, and execution reliability -- each with controlled difficulty levels that create a clear decision boundary between tool-necessary and tool-unnecessary tasks. We evaluate two families of training-free baselines: Prompt-only (varying the prompt to discourage unnecessary calls) and Reason-then-Act (requiring the model to reason about tool necessity before acting). Both provide limited control: Prompt-only suppresses necessary calls alongside unnecessary ones, and Reason-then-Act still incurs a disproportionate accuracy cost on hard tasks. To understand why these baselines fail, we probe the models' hidden states and find that tool necessity is linearly decodable from the pre-generation representation with AUROC 0.89--0.96 across six models, substantially exceeding the model's own verbalized reasoning. This reveals that models already know when tools are needed, but fail to act on this knowledge during generation. Building on this finding, we propose Probe&Prefill, which uses a lightweight linear probe to read the hidden-state signal and prefills the model's response with a steering sentence. Across all models tested, Probe&Prefill reduces tool calls by 48% with only 1.7% accuracy loss, while the best baseline at comparable accuracy only reduces 6% of tool calls, or achieves a similar tool call reduction but incurs a 5$\times$ higher accuracy loss. Our code is available at https://github.com/Trustworthy-ML-Lab/when2tool

URL PDF HTML ☆

赞 0 踩 0

2604.08571 2026-05-22 cs.LG cs.AI cs.CL 版本更新

非自回归生成的离散随机定位

Yunshu Wu, Jiayi Cheng, Longxuan Yu, Partha Thakuria, Rob Brekelmans, Evangelos E. Papalexakis, Greg Ver Steeg

发表机构 * University of California Riverside（加州大学河滨分校）； New York University（纽约大学）

AI总结本文提出了一种名为离散随机定位（DSL）的连续状态框架，通过单位球体令牌嵌入实现最优去噪，从而在离散序列生成中提升分布忠实度，并展示了其在OpenWebText上的有效性。

详情

AI中文摘要

连续扩散是一种非自回归生成的自然框架，但在离散序列生成中通常落后于掩码离散扩散模型（MDMs）。我们认为瓶颈不在于连续性本身，而在于一种表示方式，其中去噪依赖于时间步索引的噪声模式。我们引入了离散随机定位（DSL），一种具有单位球体令牌嵌入的连续状态框架，其贝叶斯最优去噪器在定位信道下对名义信号噪声比（SNR）具有不变性。一个训练好的网络可以支持整个SNR路径家族，端点掩码扩散路径是特殊情况。对预训练MDLM检查点进行微调可显著提升OpenWebText在所有步预算（从T=128到T=1024）下的分布忠实度（MAUVE），并且同一检查点支持随机顺序自回归采样，以及使用最少T=48总步数的混合连续-然后-离散采样器，无需蒸馏或重新训练。

英文摘要

Continuous diffusion is a natural framework for non-autoregressive generation but has generally lagged behind masked discrete diffusion models (MDMs) on discrete sequence generation. We argue that the bottleneck is not continuity itself, but a representation in which denoising depends on timestep-indexed noise regimes. We introduce \emph{Discrete Stochastic Localization} (DSL), a continuous-state framework with unit-sphere token embeddings whose Bayes-optimal denoiser is invariant to the nominal signal-to-noise ratio (SNR) under the localization channel. One trained network then supports an entire family of per-token SNR paths, with endpoint masked-diffusion paths as a special case. Fine-tuning a pretrained MDLM checkpoint with DSL substantially improves distributional faithfulness (MAUVE) on OpenWebText across all step budgets from $T{=}128$ to $T{=}1024$, and the same checkpoint supports random-order autoregressive sampling, as well as a hybrid continuous-then-discrete sampler using as few as T=48 total steps -- without distillation or retraining.

URL PDF HTML ☆

赞 0 踩 0

2602.15338 2026-05-22 cs.LG cs.CL 版本更新

Discovering Implicit Large Language Model Alignment Objectives

发现隐式大语言模型对齐目标

Edward Chen, Sanmi Koyejo, Carlos Guestrin

发表机构 * Stanford University（斯坦福大学）

AI总结本文提出Obj-Disco框架，通过自动分解对齐奖励信号为可解释的目标，解决现有方法的不足，验证了框架在多种任务和模型上的鲁棒性，并发现潜在的对齐偏差。

Comments ICML 2026

详情

AI中文摘要

大语言模型（LLM）对齐依赖于复杂的奖励信号，这些信号往往模糊了被激励的具体行为，导致对齐风险和奖励黑客问题。现有解释方法通常依赖预定义的准则，可能遗漏“未知的未知”，或无法识别全面覆盖和因果影响模型行为的目标。为了解决这些限制，我们引入Obj-Disco框架，该框架能够自动将对齐奖励信号分解为稀疏、加权的可解释自然语言目标的组合。我们的方法利用迭代贪心算法分析训练检查点的行为变化，识别并验证最佳解释残差奖励信号的候选目标。在多种任务、模型大小和对齐算法上的广泛评估证明了框架的鲁棒性。对流行开源奖励模型的实验表明，框架一致捕获超过90%的奖励行为，这一发现进一步得到人类评估的证实。此外，对开源奖励模型对齐的案例研究显示，Obj-Disco能够成功识别伴随预期行为出现的潜在偏移激励。我们的工作提供了一种关键工具，用于揭示LLM对齐中的隐式目标，为更透明和安全的AI发展铺平道路。

英文摘要

Large language model (LLM) alignment relies on complex reward signals that often obscure the specific behaviors being incentivized, creating critical risks of misalignment and reward hacking. Existing interpretation methods typically rely on pre-defined rubrics, risking the omission of "unknown unknowns", or fail to identify objectives that comprehensively cover and are causal to the model behavior. To address these limitations, we introduce Obj-Disco, a framework that automatically decomposes an alignment reward signal into a sparse, weighted combination of human-interpretable natural language objectives. Our approach utilizes an iterative greedy algorithm to analyze behavioral changes across training checkpoints, identifying and validating candidate objectives that best explain the residual reward signal. Extensive evaluations across diverse tasks, model sizes, and alignment algorithms demonstrate the framework's robustness. Experiments with popular open-source reward models show that the framework consistently captures > 90% of reward behavior, a finding further corroborated by human evaluation. Additionally, a case study on alignment with an open-source reward model reveals that Obj-Disco can successfully identify latent misaligned incentives that emerge alongside intended behaviors. Our work provides a crucial tool for uncovering the implicit objectives in LLM alignment, paving the way for more transparent and safer AI development.

URL PDF HTML ☆

赞 0 踩 0

2601.10348 2026-05-22 cs.CL cs.AI cs.LG 版本更新

Training-Trajectory-Aware Token Selection

基于训练轨迹的token选择

Zhanming Shen, Jiaqi Hu, Zeyu Qin, Hao Chen, Wentao Ye, Zenan Huang, Yihong Zhuang, Guoshan Lu, Junlin Zhou, Junbo Zhao

发表机构 * Zhejiang University（浙江大学）； Hong Kong University of Science and Technology（香港科技大学）

AI总结本文提出T3S方法，通过在token层面重构训练目标，清除未学习token的优化路径，从而在连续蒸馏中提升性能，实验表明在AR和dLLM设置中均取得显著效果。

Comments Accepted by ICML 2026

详情

AI中文摘要

高效的蒸馏是将昂贵的推理能力转化为可部署效率的关键途径，然而在前沿领域中，当学生模型已具备较强的推理能力时，朴素的连续蒸馏往往产生有限的收益甚至退化。我们观察到一种训练特征现象：即使损失单调下降，所有性能指标在几乎相同的瓶颈处会突然大幅下降，然后逐渐恢复。我们进一步揭示了token层面的机制：置信度会分裂成稳步增加的模仿锚点token，快速锚定优化，以及尚未学习的token，其置信度被抑制直到瓶颈之后。这两种类型token无法共存的特性是连续蒸馏失败的根本原因。为此，我们提出了基于训练轨迹的token选择（T3S）方法，以在token层面重建训练目标，清除未学习token的优化路径。T3S在AR和dLLM设置中均取得一致的收益：仅用数百个示例，Qwen3-8B在竞争性推理基准上超越DeepSeek-R1，Qwen3-32B接近Qwen3-235B，且T3训练的LLaDA-2.0-Mini超越其AR基线，达到所有16B级模型中的最先进性能。

英文摘要

Efficient distillation is a key pathway for converting expensive reasoning capability into deployable efficiency, yet in the frontier regime where the student already has strong reasoning ability, naive continual distillation often yields limited gains or even degradation. We observe a characteristic training phenomenon: even as loss decreases monotonically, all performance metrics can drop sharply at almost the same bottleneck, before gradually recovering. We further uncover a token-level mechanism: confidence bifurcates into steadily increasing Imitation-Anchor Tokens that quickly anchor optimization and other yet-to-learn tokens whose confidence is suppressed until after the bottleneck. And the characteristic that these two types of tokens cannot coexist is the root cause of the failure in continual distillation. To this end, we propose Training-Trajectory-Aware Token Selection (T3S) to reconstruct the training objective at the token level, clearing the optimization path for yet-to-learn tokens. T3S yields consistent gains in both AR and dLLM settings: with only hundreds of examples, Qwen3-8B surpasses DeepSeek-R1 on competitive reasoning benchmarks, Qwen3-32B approaches Qwen3-235B, and T3-trained LLaDA-2.0-Mini exceeds its AR baseline, achieving state-of-the-art performance among all of 16B-scale no-think models.

URL PDF HTML ☆

赞 0 踩 0

2511.08093 2026-05-22 eess.AS cs.CL cs.SD 版本更新

Quantizing Whisper-small: How design choices affect ASR performance

对Whisper-small的量化：设计选择如何影响语音识别性能

Arthur Söhler, Julian Irigoyen, Andreas Søeborg Kirkedal

发表机构 * Copenhagen Business School（哥本哈根商学院）； Danske Bank（丹麦银行）； Jabra (GN Group)（Jabra（GN集团））

AI总结本文研究了不同量化方案对Whisper-small模型性能的影响，发现动态int8量化在模型压缩和识别准确率之间取得了最佳平衡，同时展示了通过精心选择量化方法可以显著减少模型大小和推理成本，从而在受限硬件上实现高效部署。

Comments Accepted to SPEAKABLE workshop at LREC 2026

详情

AI中文摘要

大型语音识别模型如Whisper-small虽然能实现高精度，但其高计算需求使其难以在边缘设备上部署。为此，我们提出了一种统一的跨库评估，评估了Whisper-small上的后训练量化（PTQ）方法，以分离量化方案、方法、粒度和位宽的影响。我们的研究基于四个库：PyTorch、Optimum-Quanto、HQQ和bitsandbytes。在LibriSpeech测试清洁和测试其他数据集上的实验表明，动态int8量化结合Quanto提供了最佳的权衡，将模型大小减少57%，同时在基线的词错误率上有所提升。静态量化表现较差，可能由于Whisper的Transformer架构，而更激进的格式（如nf4、int3）在嘈杂条件下以牺牲准确性为代价实现了高达71%的压缩。总体而言，我们的结果表明，精心选择的PTQ方法可以在不重新训练的情况下显著减少模型大小和推理成本，从而在受限硬件上实现Whisper-small的高效部署。

英文摘要

Large speech recognition models like Whisper-small achieve high accuracy but are difficult to deploy on edge devices due to their high computational demand. To this end, we present a unified, cross-library evaluation of post-training quantization (PTQ) on Whisper-small that disentangles the impact of quantization scheme, method, granularity, and bit-width. Our study is based on four libraries: PyTorch, Optimum-Quanto, HQQ, and bitsandbytes. Experiments on LibriSpeech test-clean and test-other show that dynamic int8 quantization with Quanto offers the best trade-off, reducing model size by 57% while improving on the baseline's word error rate. Static quantization performed worse, likely due to Whisper's Transformer architecture, while more aggressive formats (e.g., nf4, int3) achieved up to 71% compression at the cost of accuracy in noisy conditions. Overall, our results demonstrate that carefully chosen PTQ methods can substantially reduce model size and inference cost without retraining, enabling efficient deployment of Whisper-small on constrained hardware.

URL PDF HTML ☆

赞 0 踩 0

2511.07885 2026-05-22 cs.DC cs.AI cs.CL cs.LG 版本更新

Intelligence per Watt: Measuring Intelligence Efficiency of Local AI

每瓦智能：衡量本地AI的智能效率

Jon Saad-Falcon, Avanika Narayan, Hakki Orhun Akengin, J. Wes Griffin, Herumb Shandilya, Adrian Gamarra Lafuente, Medhya Goel, Rebecca Joseph, Shlok Natarajan, Etash Kumar Guha, Shang Zhu, Ben Athiwaratkun, John Hennessy, Azalia Mirhoseini, Christopher Ré

发表机构 * Stanford University（斯坦福大学）； Together AI

AI总结本文研究了本地AI在能源效率和性能上的表现，提出了一种统一的衡量指标IPW，展示了本地推理在重新分配需求方面的能力，并揭示了本地加速器的优化潜力。

详情

AI中文摘要

大型语言模型（LLM）查询主要由集中式云基础设施中的前沿模型处理。需求增长比提供商能够扩展的速度更快。两项进展创造了重新思考这一范式的机会：小型本地LM（<=20B活跃参数）在许多任务上能与前沿模型竞争性地表现，而本地加速器（如Apple M4 Max）可以以交互延迟支持这些模型。这引发了问题：本地推理能否在能源受限的设备上有效重新分配需求？这需要测量本地LM是否能准确回答现实查询以及是否在能源受限的设备上高效。我们提出了智能每瓦（IPW），即任务准确度每单位功率，作为衡量本地推理能力与效率的统一指标。我们评估了20多个最先进的本地LM、8种硬件加速器（本地和云）以及100万条现实单轮聊天和推理查询。对于每个查询，我们测量了准确性（本地LM对前沿模型的胜率）、能耗、延迟和功率。我们发现三个关键结果。首先，本地LM成功回答了88.7%的这些查询，准确性因领域而异。其次，2023-2025年的纵向分析显示IPW提高了5.3倍，由算法和加速器的改进驱动，本地可服务查询覆盖范围从23.2%增加到71.3%。第三，本地加速器在相同模型上实现的IPW至少比云加速器低1.4倍，揭示了本地加速器优化的巨大潜力。这些发现表明，本地推理可以对集中式基础设施的大量查询需求进行有意义的重新分配，IPW是跟踪这一转变的关键指标。

英文摘要

Large language model (LLM) queries are predominantly processed by frontier models in centralized cloud infrastructure. Demand growth strains this paradigm faster than providers can scale. Two advances create an opportunity to rethink it: small, local LMs (<=20B active parameters) now achieve competitive performance to frontier models on many tasks, and local accelerators (e.g., Apple M4 Max) can host these models at interactive latencies. This raises the question: can local inference viably redistribute demand from centralized infrastructure? This requires measuring both whether local LMs can accurately answer real-world queries and whether they can do so efficiently on power-constrained devices (e.g., laptops). We propose intelligence per watt (IPW), task accuracy per unit of power, as a unified metric for the capability and efficiency of local inference across model-accelerator configurations. We evaluate 20+ state-of-the-art local LMs, 8 hardware accelerators (local and cloud), and 1M real-world single-turn chat and reasoning queries. For each query, we measure accuracy (local LM win rate against frontier models), energy, latency, and power. We find three key results. First, local LMs successfully answer 88.7% of these queries, with accuracy varying by domain. Second, longitudinal analysis from 2023-2025 shows IPW improved 5.3x, driven by both algorithmic and accelerator advances, with locally-serviceable query coverage rising from 23.2% to 71.3%. Third, local accelerators achieve at least 1.4x lower IPW than cloud accelerators running identical models, revealing significant headroom for local accelerator optimization. These findings demonstrate that local inference can meaningfully redistribute demand from centralized infrastructure for a substantial subset of queries, with IPW serving as the critical metric for tracking this transition.

URL PDF HTML ☆

赞 0 踩 0

2510.07962 2026-05-22 cs.CL cs.AI 版本更新

LightReasoner: Can Small Language Models Teach Large Language Models Reasoning?

LightReasoner: 小型语言模型能否教会大型语言模型推理？

Jingyuan Wang, Yankai Chen, Zhonghang Li, Chao Huang

发表机构 * University of Hong Kong（香港大学）； University of Chicago（芝加哥大学）

AI总结本文提出LightReasoner框架，通过利用强专家模型与弱业余模型之间的行为差异，发现高价值推理时刻，从而提升大型语言模型的推理能力，同时减少资源消耗。

Comments Updated to ACL 2026 camera-ready version with improved method presentation, expanded related work discussion, additional analyses, and presentation refinements

详情

AI中文摘要

大型语言模型（LLMs）在推理任务上取得了显著进展，通常通过监督微调（SFT）实现。然而，SFT过程资源消耗大，依赖大规模定制数据集、拒绝采样演示和对所有token的统一优化，尽管只有少量token具有实际学习价值。在本工作中，我们探索了一个反直觉的想法：小型语言模型（SLMs）能否通过揭示高价值推理时刻来教会大型语言模型（LLMs）其独特优势？我们提出了LightReasoner，一种新的框架，利用强专家模型（LLM）与弱业余模型（SLM）之间的行为差异。LightReasoner分为两个阶段：（1）采样阶段通过专家-业余对比确定关键推理时刻，并构建捕捉专家优势的监督示例；（2）微调阶段将专家模型与这些提炼出的示例对齐，放大其推理优势。在七个数学基准测试中，LightReasoner将准确性提高了28.1%，同时将时间消耗减少了90%，采样问题减少了80%，调优token使用减少了99%，且不依赖真实标签。通过将弱SLMs转化为有效的教学信号，LightReasoner提供了一种可扩展且资源高效的提升LLM推理能力的方法。代码可在：https://github.com/HKUDS/LightReasoner获取。

英文摘要

Large language models (LLMs) have demonstrated remarkable progress in reasoning, often through supervised fine-tuning (SFT). However, SFT is resource-intensive, relying on large curated datasets, rejection-sampled demonstrations, and uniform optimization across all tokens, even though only a fraction carry meaningful learning value. In this work, we explore a counterintuitive idea: can smaller language models (SLMs) teach larger language models (LLMs) by revealing high-value reasoning moments that reflect the latter's unique strength? We propose LightReasoner, a novel framework that leverages the behavioral divergence between a stronger expert model (LLM) and a weaker amateur model (SLM). LightReasoner operates in two stages: (1) a sampling stage that pinpoints critical reasoning moments and constructs supervision examples capturing the expert's advantage through expert-amateur contrast, and (2) a fine-tuning stage that aligns the expert model with these distilled examples, amplifying its reasoning strengths. Across seven mathematical benchmarks, LightReasoner improves accuracy by up to 28.1%, while reducing time consumption by 90%, sampled problems by 80%, and tuned token usage by 99%, all without relying on ground-truth labels. By turning weaker SLMs into effective teaching signals, LightReasoner offers a scalable and resource-efficient approach for advancing LLM reasoning. Code is available at: https://github.com/HKUDS/LightReasoner

URL PDF HTML ☆

赞 0 踩 0

2508.03865 2026-05-22 cs.CL 版本更新

An Entity Linking Agent for Question Answering

用于问答任务的实体链接代理

Yajie Luo, Yihong Wu, Muzhi Li, Jia Ao Sun, Xinyu Wang, Liheng Ma, Yingxue Zhang, Jian-Yun Nie

发表机构 * The Chinese University of Hong Kong（香港中文大学）； McGill University（麦吉尔大学）； Mila - Quebec AI Institute（魁北克AI研究院）； Huawei Noah’s Ark Lab（华为诺亚实验室）

AI总结本文提出了一种基于大语言模型的实体链接代理，用于解决问答任务中短且模糊用户问题的实体链接问题，通过两个实验验证了其有效性。

Comments 12 pages, 2 figures

2507.03674 2026-05-22 cs.CL cs.AI 版本更新

STRUCTSENSE: A Task-Agnostic Agentic Framework for Structured Information Extraction with Human-In-The-Loop Evaluation and Benchmarking

STRUCTSENSE：一种任务无关的代理框架，用于结构化信息提取，具有人机协同评估和基准测试

Tek Raj Chhetri, Yibei Chen, Puja Trivedi, Dorota Jarecka, Saif Haobsh, Patrick Ray, Lydia Ng, Satrajit S. Ghosh

发表机构 * McGovern Institute for Brain Research, Massachusetts Institute of Technology, Cambridge, MA, USA（麦戈文脑科学研究所，麻省理工学院，马萨诸塞州剑桥市）； Fylo Labs Inc., New York, NY, USA（Fylo实验室公司，纽约州纽约市）； Allen Institute for Brain Science, Seattle, WA, USA（艾伦脑科学研究所，华盛顿州西雅图市）

AI总结本文提出STRUCTSENSE框架，通过整合本体引导的符号知识、代理自我评估细化和人机协同验证，实现了结构化信息提取的鲁棒性，并在三个领域展示了其跨任务泛化能力。

Comments -

详情

AI中文摘要

从科学文献中提取结构化信息对于加速发现至关重要，但大型语言模型（LLMs）在需要专家知识的专门领域表现不佳，且在跨任务泛化方面表现差。我们引入STRUCTSENSE，一种模块化、任务无关、开源的框架，整合了本体引导的符号知识、代理自我评估细化和人机协同验证，以实现领域感知的稳健提取。我们在三个递增语义复杂度的任务上评估STRUCTSENSE：基于模式的评估工具提取（91-100%准确率）、从科学论文中提取元数据和资源（86-93%总体准确率）以及从神经科学文献中进行命名实体识别（NER）（58-75%标签准确率，共8,882个实体）。在两个生物医学NER基准（NCBI疾病和S800物种）上，系统实现了≥90%的宽松召回率和62.5-85.8%的严格召回率，同时提取了1,000-3,600个额外实体。本地概念映射服务在严格匹配下达到62-82%的Hits@1，在语义匹配下达到68-86%。这些结果在三个领域展示了STRUCTSENSE跨任务泛化的能力，同时保持了源地和可追溯性透明度。

英文摘要

Extracting structured information from scientific literature is critical for accelerating discovery, yet Large Language Models (LLMs) often struggle in specialized domains that require expert knowledge and generalize poorly across tasks. We introduce \textsc{StructSense}, a modular, task-agnostic, open-source framework that integrates ontology-guided symbolic knowledge, agentic self-evaluative refinement, and human-in-the-loop validation for robust domain-aware extraction. We evaluate \textsc{StructSense} on three tasks of increasing semantic complexity: schema-based extraction of assessment instruments (91--100\% accuracy), metadata and resource extraction from scientific papers (86--93\% overall), and named entity recognition (NER) from neuroscience literature (58--75\% label accuracy across 8,882 entities). On two biomedical NER benchmarks (NCBI Disease and S800 Species), the system achieves $\geq$90\% relaxed recall and 62.5--85.8\% strict recall while extracting 1,000--3,600 additional entities beyond gold annotations. The local concept mapping service achieves Hits@1 of 62--82\% under strict matching and 68--86\% under semantic matching. These results across three domains demonstrate that \textsc{StructSense} generalizes across tasks while maintaining source grounding and provenance transparency.

URL PDF HTML ☆

赞 0 踩 0

2506.19500 2026-05-22 cs.AI cs.CL cs.LG 版本更新

NaviAgent: Graph-Driven Bilevel Planning for Scalable Tool Orchestration

NaviAgent: 一种基于图的双层规划用于可扩展的工具编排

Yan Jiang, Hao Zhou, Lizhong GU, Tianlong Li, Ruinan Jin, Wanqi Zhou, Ai Han

发表机构 * Department of Electrical and Computer Engineering, The Ohio State University, USA（电气与计算机工程系，俄亥俄州立大学，美国）

AI总结本文提出NaviAgent，一种基于图的双层规划框架，通过解耦任务规划与工具执行，提升大规模工具编排的可扩展性和鲁棒性，实验表明其在任务成功率和实际应用中表现优异。

Comments Accepted to ICML 2026

详情

Journal ref: Proceedings of the 43rd International Conference on Machine Learning (ICML), 2026

AI中文摘要

大型语言模型（LLMs）越来越多地作为功能调用代理，通过调用外部工具来处理超出其静态知识的任务。然而，它们通常逐个调用工具，缺乏对任务结构的整体视图。由于工具之间往往相互依赖，这导致了错误累积和可扩展性差，尤其是在扩展到数百或数千个工具时。为了解决这些限制，我们提出了NaviAgent，一种显式的双层架构，通过基于工具关系的图建模来解耦任务规划与工具执行。在规划层，基于LLM的代理决定是否直接回应、澄清意图或检索并执行独立于工具间复杂度的工具链。在执行层，工具世界导航模型（TWNM）编码工具之间的结构和行为关系，引导代理生成可扩展且鲁棒的调用序列。通过整合真实工具交互的反馈，NaviAgent实现了规划与执行之间的闭环对齐，使代理能够在大规模工具生态系统中实现自适应导航。在API-Bank和ToolBench上的评估显示，任务成功率（TSR）有持续改进，TWNM在复杂任务上平均提升13.1个百分点。进一步在50个真实API跨7个领域的测试中，展示了4.3-12.0个百分点的持续收益，步骤更少且延迟更低，证明了其在真实世界动态下的鲁棒泛化能力。

英文摘要

Large Language Models (LLMs) increasingly act as function-call agents that invoke external tools to tackle tasks beyond their static knowledge. However, they typically invoke tools one at a time without a global view of task structure. As tools often depend on one another, this leads to error accumulation and poor scalability, particularly when scaling to hundreds or thousands of tools. To address these limitations, we propose NaviAgent, an explicit bilevel architecture that decouples task planning from tool execution through graph-based modeling of tool relations. At the planning level, the LLM-based agent decides whether to respond directly, clarify intent, or retrieve and execute a toolchain independent of inter-tool complexity. At the execution level, a Tool World Navigation Model (TWNM) encodes structural and behavioral relations among tools, steering the agent to compose scalable and robust invocation sequences. Incorporating feedback from real tool interactions, NaviAgent achieves closed-loop alignment between planning and execution, enabling adaptive navigation in large-scale tool ecosystems. Evaluations on API-Bank and ToolBench show consistent improvements in task success rate (TSR), with TWNM yielding an average gain of 13.1 points on complex tasks. Further tests on 50 real APIs across 7 domains show consistent gains of 4.3--12.0 points, with fewer steps and latency, demonstrating robust generalization under real-world dynamics.

URL PDF HTML ☆

赞 0 踩 0

2502.09487 2026-05-22 cs.CL cs.AI cs.LG 版本更新

Internal narratives parameterise affective states

内部叙事参数化情感状态

Jakub Onysk, Quentin J. M. Huys

发表机构 * Applied Computational Psychiatry Lab（应用计算精神病学实验室）； Max Planck UCL Centre for Computational Psychiatry and Ageing Research（马克斯·普朗克UCL计算精神病学与衰老研究中心）； Queen Square Institute of Neurology and Mental Health（圣夸克广场神经病学与心理健康研究所）； Neuroscience Department（神经科学系）； Division of Psychiatry（精神病学系）

AI总结本文通过量化参与者内部叙事的大语言模型表示及其子空间，研究了叙事与情感状态之间的关系，发现特定症状的描述性思维能够预测标准化的抑郁评分，并强调保持症状间的协方差对构建效度至关重要。

详情

AI中文摘要

描述我们如何用语言表达感受对于心理评估和干预至关重要，但叙事与情感状态之间的映射仍然理解不足。在两个大规模研究（n=1257）中，我们通过大语言模型表示及其子空间量化了参与者内部叙事的结构和动态，以参数化抑郁状态。在第一项研究中，我们发现对特定症状的描述性思维捕捉了预测标准化、自我报告抑郁评分的细粒度信息。关键的是，我们显示保持症状之间的特定协方差对于构效效度至关重要，这表明高维文本表示镜像了疾病的潜在几何结构。第二项研究探讨了这种关系的时间动态，当参与者与情感叙事互动时。我们发现量化内部叙事的变化导致自我报告的变化，而基线叙事严重性预测了后续情感变化的幅度。通过将情感视为计算状态，我们的结果强调了其核心、治疗相关功能：约束内部叙事的结构并整合上下文以塑造自我报告。

英文摘要

Characterising how we verbalise our feelings is central to psychological assessment and intervention, yet the mapping between narrative and affective state remains poorly understood. Across two large studies (n=1257), we parameterised the structure and dynamics of depressive states by quantifying participants' internal narratives through large-language-model representations and their subspaces. In Study 1, we found verbal descriptions of symptom-specific thoughts captured granular information predictive of standardised, self-reported depression scores. Critically, we show preserving the specific covariance between symptoms is essential for construct validity, suggesting high-dimensional text representations mirror the latent geometry of the disorder. Study 2 probed the temporal dynamics of this relationship as participants engaged with emotional narratives. We found quantified changes in internal narratives led to changes in self-report, while the baseline narrative severity predicted the magnitude of subsequent affective change. By framing affect as a computational state, our results highlight its core, therapeutically pertinent functions: constraining the structure of internal narratives and integrating context to shape self-report.

URL PDF HTML ☆

赞 0 踩 0

2410.04753 2026-05-22 cs.AI cs.CL cs.LG cs.LO 版本更新

ImProver: Agent-Based Automated Proof Optimization

ImProver：基于代理的自动证明优化

Riyaz Ahuja, Jeremy Avigad, Prasad Tetali, Sean Welleck

发表机构 * Carnegie Mellon University（卡内基梅隆大学）

AI总结本文研究了自动证明优化问题，提出ImProver这一基于大语言模型的代理，用于重写证明以优化长度、可读性等任意标准，实验表明其能显著缩短证明并提高其模块化和可读性。

Comments Published as a conference paper at ICLR 2025

详情

AI中文摘要

大型语言模型（LLMs）已被用于在证明助手如Lean中生成数学定理的正式证明。然而，我们通常希望根据不同的下游用途优化正式证明，例如使其符合某种风格、易于阅读、简洁或模块化。适当优化的证明对于学习任务也非常重要，尤其是因为人工撰写的证明可能不适用于此目的。为此，我们研究了一个新的问题：自动证明优化，即重写证明以使其正确并优化任意标准，如长度或可读性。作为自动证明优化的一种初步方法，我们提出了ImProver，这是一个能够重写证明以优化任意用户定义指标的大型语言模型代理。我们发现直接应用LLMs进行证明优化效果有限，并在ImProver中引入了各种改进，例如新颖的链式状态技术中的符号Lean上下文使用，以及错误校正和检索。我们测试了ImProver在重写真实世界中的本科、竞赛和研究级数学定理方面的性能，发现ImProver能够重写证明使其显著更短、更模块化和更易读。

英文摘要

Large language models (LLMs) have been used to generate formal proofs of mathematical theorems in proofs assistants such as Lean. However, we often want to optimize a formal proof with respect to various criteria, depending on its downstream use. For example, we may want a proof to adhere to a certain style, or to be readable, concise, or modularly structured. Having suitably optimized proofs is also important for learning tasks, especially since human-written proofs may not optimal for that purpose. To this end, we study a new problem of automated proof optimization: rewriting a proof so that it is correct and optimizes for an arbitrary criterion, such as length or readability. As a first method for automated proof optimization, we present ImProver, a large-language-model agent that rewrites proofs to optimize arbitrary user-defined metrics in Lean. We find that naively applying LLMs to proof optimization falls short, and we incorporate various improvements into ImProver, such as the use of symbolic Lean context in a novel Chain-of-States technique, as well as error-correction and retrieval. We test ImProver on rewriting real-world undergraduate, competition, and research-level mathematics theorems, finding that ImProver is capable of rewriting proofs so that they are substantially shorter, more modular, and more readable.

URL PDF HTML ☆

赞 0 踩 0

2403.03920 2026-05-22 cs.AI cs.CL cs.HC 版本更新

Enhancing Instructional Quality: Leveraging Computer-Assisted Textual Analysis to Generate In-Depth Insights from Educational Artifacts

提升教学质量：利用计算机辅助文本分析从教育资料中生成深入见解

Zewei Tian, Min Sun, Alex Liu, Shawon Sarkar, Jing Liu

发表机构 * University of Washington（华盛顿大学）； University of Maryland（马里兰大学）

AI总结本文探讨了计算机辅助文本分析在通过教育资料的深入分析提升教学质量的变革潜力，结合Richard Elmore的Instructional Core Framework，分析AI和机器学习方法，特别是自然语言处理（NLP），如何分析教育内容、教师话语和学生回答以促进教学改进，并指出AI/ML在教师指导、学生支持和内容开发中的关键优势。

详情

DOI: 10.4337/9781035321544.00019

AI中文摘要

本文探讨了计算机辅助文本分析在通过教育资料的深入分析提升教学质量的变革潜力。我们整合Richard Elmore的Instructional Core Framework，以探讨人工智能（AI）和机器学习（ML）方法，特别是自然语言处理（NLP），如何分析教育内容、教师话语和学生回答，以促进教学改进。通过在Instructional Core Framework内的全面回顾和案例研究，我们识别出AI/ML整合在教师指导、学生支持和内容开发中的关键优势。我们揭示出模式，表明AI/ML不仅简化了行政任务，还引入了新的个性化学习路径，为教育工作者提供可操作的反馈，并有助于更深入地理解教学动态。本文强调了将AI/ML技术与教学目标对齐的重要性，以在教育环境中实现其全部潜力，倡导一种平衡的方法，考虑伦理问题、数据质量和人类专业知识的整合。

英文摘要

This paper explores the transformative potential of computer-assisted textual analysis in enhancing instructional quality through in-depth insights from educational artifacts. We integrate Richard Elmore's Instructional Core Framework to examine how artificial intelligence (AI) and machine learning (ML) methods, particularly natural language processing (NLP), can analyze educational content, teacher discourse, and student responses to foster instructional improvement. Through a comprehensive review and case studies within the Instructional Core Framework, we identify key areas where AI/ML integration offers significant advantages, including teacher coaching, student support, and content development. We unveil patterns that indicate AI/ML not only streamlines administrative tasks but also introduces novel pathways for personalized learning, providing actionable feedback for educators and contributing to a richer understanding of instructional dynamics. This paper emphasizes the importance of aligning AI/ML technologies with pedagogical goals to realize their full potential in educational settings, advocating for a balanced approach that considers ethical considerations, data quality, and the integration of human expertise.

URL PDF HTML ☆

赞 0 踩 0

2402.11621 2026-05-22 cs.CL 版本更新

Decoding News Narratives: A Critical Analysis of Large Language Models in Framing Detection

解码新闻叙述：对大型语言模型在框架检测中的关键分析

Valeria Pastorino, Jasivan A. Sivakumar, Nafise Sadat Moosavi

AI总结本文研究了大型语言模型在框架检测中的应用，分析了不同模型在零样本、少样本和解释性提示设置下的表现，指出模型性能对提示设计敏感且易在模糊案例中出现系统性错误，并提出了一种新的跨领域新闻标题数据集以提高评估的现实性。

详情

Journal ref: Proceedings of the 3rd Workshop on Natural Language Processing for Political Sciences (PoliticalNLP 2026) @ LREC 2026, pages 17-28

AI中文摘要

随着新闻报道的复杂性和多样性增加，框架分析已成为计算社会科学中的关键但具有挑战性的任务。传统方法，包括手动标注和微调模型，仍然受到高标注成本、领域特定性和不一致泛化能力的限制。基于指令的大型语言模型（LLMs）提供了一个有前景的替代方案，但它们在框架分析中的可靠性尚不充分。本文系统评估了几个LLMs，包括GPT-3.5/4、FLAN-T5和Llama 3，在零样本、少样本和基于解释的提示设置下的表现。聚焦于领域转移和固有的标注模糊性，我们显示模型性能高度敏感于提示设计，并且在模糊案例中容易出现系统性错误。尽管LLMs，特别是GPT-4，表现出更强的跨领域泛化能力，但它们也显示出系统性偏见，最值得注意的是倾向于将情感语言与框架混淆。为了在现实世界的话题多样性下实现原则性评估，我们引入了一种新的跨领域新闻标题数据集。最后，通过分析多个模型在现有框架数据集上的一致性模式，我们证明了跨模型共识为识别争议标注提供了一个有用的信号，为低资源环境下的数据集审计提供了一种实用方法。

英文摘要

The growing complexity and diversity of news coverage have made framing analysis a crucial yet challenging task in computational social science. Traditional approaches, including manual annotation and fine-tuned models, remain limited by high annotation costs, domain specificity, and inconsistent generalisation. Instruction-based large language models (LLMs) offer a promising alternative, yet their reliability for framing analysis remains insufficiently understood. In this paper, we conduct a systematic evaluation of several LLMs, including GPT-3.5/4, FLAN-T5, and Llama 3, across zero-shot, few-shot, and explanation-based prompting settings. Focusing on domain shift and inherent annotation ambiguity, we show that model performance is highly sensitive to prompt design and prone to systematic errors on ambiguous cases. Although LLMs, particularly GPT-4, exhibit stronger cross-domain generalisation, they also display systematic biases, most notably a tendency to conflate emotional language with framing. To enable principled evaluation under real-world topic diversity, we introduce a new dataset of out-of-domain news headlines covering diverse subjects. Finally, by analysing agreement patterns across multiple models on existing framing datasets, we demonstrate that cross-model consensus provides a useful signal for identifying contested annotations, offering a practical approach to dataset auditing in low-resource settings.

URL PDF HTML ☆

赞 0 踩 0

2401.00139 2026-05-22 cs.AI cs.CL cs.LG stat.ME 版本更新

Enhancing Causal Reasoning in Large Language Models: A Causal Attribution Model for Precision Fine-Tuning

增强大语言模型中的因果推理：一种用于精确微调的因果归因模型

Hengrui Cai, Shengjie Liu, Rui Song

发表机构 * University of California, Irvine（加州大学尔湾分校）； North Carolina State University（北卡罗来纳州立大学）； Amazon（亚马逊公司）

AI总结本文提出一种因果归因模型，通过精确微调提升大语言模型的可解释性和因果推理能力，展示了模型在不同领域中的因果发现任务中的有效性。

Comments A Python implementation of our proposed method is available at https://github.com/ncsulsj/Causal_LLM

详情

AI中文摘要

本文介绍了一种因果归因模型，旨在通过精确微调增强大语言模型（LLMs）的可解释性并提高其因果推理能力。尽管LLMs在多种任务中表现出色，但其推理过程往往仍是一个黑箱，限制了有针对性的增强。我们提出了一种新的因果归因模型，利用“do-运算符”构建干预场景，使我们能够系统地量化LLMs因果推理过程中不同组件的贡献。通过在各种领域中进行因果发现任务来评估所提出的归因分数，我们证明了LLMs在因果发现中的有效性严重依赖于提供的上下文和领域特定知识，但也可以利用数值数据进行有限的相关性推理，而非因果性。这促使了所提出的微调LLM用于成对因果发现，有效且正确地利用了知识和数值信息。

英文摘要

This paper introduces a causal attribution model to enhance the interpretability of large language models (LLMs) and improve their causal reasoning abilities via precise fine-tuning. Despite LLMs' proficiency in diverse tasks, their reasoning processes often remain black box, and thus restrict targeted enhancement. We propose a novel causal attribution model that utilizes "do-operators" for constructing interventional scenarios, allowing us to quantify the contribution of different components in LLMs's causal reasoning process systematically. By assessing the proposed attribution scores through causal discovery tasks across various domains, we demonstrate that LLMs' effectiveness in causal discovery heavily relies on provided context and domain-specific knowledge but can also utilize numerical data with limited calculations in correlation, not causation. This motivates the proposed fine-tuned LLM for pairwise causal discovery, effectively and correctly leveraging both knowledge and numerical information.

URL PDF HTML ☆

赞 0 踩 0

1607.06330 2026-05-22 cs.CL 版本更新

La representación de la variación contextual mediante definiciones terminológicas flexibles

通过灵活的术语定义实现语境变化的表示

Antonio San Martín

AI总结本文研究了如何通过灵活的术语定义来反映环境领域中专业概念在不同语境下的变化，提出了灵活的术语定义方法，并通过实证研究分析了语境变化对术语定义的影响。

Comments PhD Thesis. in Spanish. University of Granada. 2016

详情

AI中文摘要

在本博士论文中，我们应用认知语言学的原理到术语定义，并提出了一种称为灵活术语定义的提案。这包括由一组同一概念的定义组成，包括一个一般定义（在此情况下，涵盖整个环境领域）以及额外的定义，从相关子领域的角度描述该概念。由于语境是构建词汇单位（包括术语）意义的关键因素，我们假设术语定义可以并且应该反映语境的影响，尽管定义传统上被视为与任何语境因素无关的意义表达。本论文的主要目标是分析语境变化对专业环境概念的影响，以它们在术语定义中的表示为视角。具体而言，我们专注于基于主题限制的语境变化。为了完成本博士论文的目标，我们进行了实证研究，包括对一组具有语境变化的概念进行分析，并为其中两个概念创建灵活的定义。在我们实证研究的第一部分，我们将我们的领域依赖性语境变化概念划分为三种不同的现象：调制、视角化和子概念化。这些现象是叠加的，即所有概念都经历调制，一些概念还经历视角化，最后，少数概念还经历子概念化。在第二部分，我们应用这些概念到术语定义，并提出了如何构建灵活定义的指南，从知识提取到定义的实际写作。

英文摘要

In this doctoral thesis, we apply premises of cognitive linguistics to terminological definitions and present a proposal called the flexible terminological definition. This consists of a set of definitions of the same concept made up of a general definition (in this case, one encompassing the entire environmental domain) along with additional definitions describing the concept from the perspective of the subdomains in which it is relevant. Since context is a determining factor in the construction of the meaning of lexical units (including terms), we assume that terminological definitions can, and should, reflect the effects of context, even though definitions have traditionally been treated as the expression of meaning void of any contextual effect. The main objective of this thesis is to analyze the effects of contextual variation on specialized environmental concepts with a view to their representation in terminological definitions. Specifically, we focused on contextual variation based on thematic restrictions. To accomplish the objectives of this doctoral thesis, we conducted an empirical study consisting of the analysis of a set of contextually variable concepts and the creation of a flexible definition for two of them. As a result of the first part of our empirical study, we divided our notion of domain-dependent contextual variation into three different phenomena: modulation, perspectivization and subconceptualization. These phenomena are additive in that all concepts experience modulation, some concepts also undergo perspectivization, and finally, a small number of concepts are additionally subjected to subconceptualization. In the second part, we applied these notions to terminological definitions and we presented we presented guidelines on how to build flexible definitions, from the extraction of knowledge to the actual writing of the definition.

URL PDF HTML ☆

赞 0 踩 0