arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2606.17682 2026-06-17 cs.CL 新提交

ConSA: 通过可学习分配实现混合注意力中的可控稀疏性

Yao Chen, Yinqi Yang, Junyuan Shang, Xiangzhao Hao, Simeng Zhang, Yilong Chen, Tingwen Liu, Shuohuan Wang, Dianhai Yu

发表机构 * Institute of Information Engineering, Chinese Academy of Sciences（中国科学院信息工程研究所）； School of Cyber Security, University of Chinese Academy of Sciences（中国科学院大学网络空间安全学院）； Baidu Inc.（百度公司）

AI总结提出ConSA框架，通过L0正则化和增广拉格朗日约束学习全注意与滑动窗口注意的最优分配，实现用户指定的稀疏目标，在0.6B和1.7B规模LLM上优于规则基线，并发现底层SWA与中层FA的集中分配模式。

详情

AI中文摘要

结合全注意（FA）和滑动窗口注意（SWA）的混合架构是高效LLM推理的一种有前景的范式。然而，现有方法通常依赖手工规则或简单的后验启发式进行FA/SWA分配，并且对这些设计背后的注意行为分析有限。我们提出混合注意力中的可控稀疏性（ConSA），一个在用户指定稀疏目标下学习最优FA/SWA分配的框架。ConSA采用L0正则化学习选择每个注意力单元FA或SWA的二元掩码，同时增广拉格朗日约束在层或KV头粒度上强制执行目标稀疏性。我们在0.6B和1.7B规模的两个LLM上评估ConSA。学习到的分配一致优于基于规则的基线，其中KV头级分配比层级分配带来明显增益。学习到的模式将SWA置于底层，并将FA集中在连续的中间层块中，这与基于规则方法中均匀交错模式不同。这种结构在模型规模、稀疏级别和分配粒度上持续存在，揭示了学习分配背后细粒度的内在注意行为谱。

英文摘要

Hybrid architectures combining full attention (FA) and sliding-window attention (SWA) are a promising paradigm for efficient LLM inference. However, existing methods typically rely on hand-crafted rules or simple post-hoc heuristics for FA/SWA allocation and offer limited analysis of the attention behaviors underlying these designs. We propose Controllable Sparsity in Hybrid Attention (ConSA), a framework that learns optimal FA/SWA assignment under a user-specified sparsity target. ConSA employs L0 regularization to learn binary masks selecting between FA and SWA for each attention unit, while an augmented Lagrangian constraint enforces the target sparsity at either layer or KV-head granularity. We evaluate ConSA on two LLMs at the 0.6B and 1.7B scales. Learned allocations consistently outperform rule-based baselines, with KV-head-wise allocation yielding clear gains over layer-wise allocation. The learned patterns place SWA in the bottom layers and concentrate FA into contiguous middle-layer blocks, diverging from evenly interleaved patterns in rule-based methods. This structure persists across model scales, sparsity levels, and allocation granularities, revealing a fine-grained spectrum of intrinsic attention behaviors that underlies the learned allocation.

URL PDF HTML ☆

赞 0 踩 0

2606.18195 2026-06-17 cs.CL 新提交

Learning from the Self-future: On-policy Self-distillation for dLLMs

从自我未来学习：面向扩散LLM的在线自蒸馏

Yifu Luo, Zeyu Chen, Haoyu Wang, Xinhao Hu, Yuxuan Zhang, Zhizhou Sha, Shiwei Liu

发表机构 * Tsinghua University（清华大学）； Technical University of Munich（慕尼黑工业大学）； Nanyang Technological University（南洋理工大学）； University of British Columbia（不列颠哥伦比亚大学）； University of Texas at Austin（德克萨斯大学奥斯汀分校）； ELLIS Institute Tubingen（ELLIS蒂宾根研究所）； Max Planck Institute for Intelligent Systems（马克斯·普朗克智能系统研究所）； Tubingen AI Center（蒂宾根人工智能中心）

AI总结提出首个面向扩散大语言模型的在线自蒸馏框架d-OPSD，通过自我生成答案作为后缀条件实现从自我未来经验学习，并将监督从词元级转向步骤级，在推理基准上以约10%的优化步骤超越RLVR和SFT基线。

Comments Preprint

详情

AI中文摘要

在线自蒸馏（OPSD）已被证明对后训练大型语言模型（LLMs）有效，但其在扩散LLMs（dLLMs）上的应用仍未探索。现有的OPSD方法本质上是自回归中心的，它们通过从左到右的前缀条件化和词元级差异监督注入特权信息，这种设计与dLLMs的任意顺序生成根本冲突。我们提出了d-OPSD，这是首个为dLLMs量身定制的OPSD框架。我们的方法有两个核心贡献。首先，我们通过使用自我生成的答案作为后缀条件来重新构建自我教师，使学生模型能够从“自我未来经验”而非特权前缀中学习。其次，我们将监督从词元级转向步骤级，使训练与dLLMs的迭代去噪过程对齐。在四个推理基准上的实验表明，d-OPSD在样本效率上始终优于RLVR和SFT基线，仅需RLVR约10%的优化步骤，为dLLM后训练开辟了一条有前景的途径。代码可在该https URL获取。

英文摘要

On-policy self-distillation (OPSD) has proven effective for post-training large language models (LLMs), yet its application to diffusion LLMs (dLLMs) remains unexplored. Existing OPSD methods are inherently autoregressive-centric. They inject privileged information via left-to-right prefix conditioning with token-level divergence supervision, a design that fundamentally conflicts with the arbitraryorder generation of dLLMs. We introduce d-OPSD, the first OPSD framework tailored for dLLMs. Our approach makes two core contributions. First, we reframe self-teacher construction by using self-generated answers as suffix conditioning, enabling the student model to learn from "self future-experience" rather than privileged prefixes. Second, we shift supervision from token-level to step-level, aligning training with the iterative denoising process of dLLMs. Experiments across four reasoning benchmarks show that d-OPSD consistently outperforms RLVR and SFT baselines with superior sample efficiency, requiring only around 10% of the optimization steps by RLVR and opening a promising pathway for dLLM posttraining. The code is available at https://github.com/xingzhejun/d-OPSD.

URL PDF HTML ☆

赞 0 踩 0

2606.18216 2026-06-17 cs.CL 新提交

Zone of Proximal Policy Optimization: Teacher in Prompts, Not Gradients

近端策略优化区域：教师存在于提示中，而非梯度中

Byung-Kwan Lee, Ximing Lu, Shizhe Diao, Minki Kang, Saurav Muralidharan, Karan Sapra, Andrew Tao, Pavlo Molchanov, Yejin Choi, Yu-Chiang Frank Wang, Ryo Hachiuma

发表机构 * NVIDIA（英伟达）

AI总结提出ZPPO方法，通过将教师知识注入提示而非策略梯度，解决小模型知识蒸馏中教师梯度主导和强化学习策略漂移问题，在多种规模模型上超越现有方法。

Comments Project page: https://byungkwanlee.github.io/ZPPO-page/

详情

AI中文摘要

知识蒸馏将教师的能力传递给小型学生模型，但在小模型场景下存在脆弱性：强制学生模仿更大教师的logits会使其集中于教师最尖锐的模式，从而损害训练语料库之外基准家族的泛化能力。强化学习通过基于学生自身rollout进行训练避免了logit模仿。然而，在每次rollout都失败（产生零优势并被静默丢弃）的问题上，将更强教师的响应注入策略梯度会破坏同策略假设并导致漂移。我们提出近端策略优化区域（ZPPO），受维果茨基最近发展区启发，将教师保留在提示中而非策略梯度中。在难题上，ZPPO构建两种重新表述的提示：二元候选包含问题（BCQ）将一个正确教师响应与一个错误学生响应配对作为匿名候选，学生必须区分；负候选包含问题（NCQ）将学生的错误rollout聚合到单个提示中，以揭示其共同的失败模式。提示重放缓冲区循环每个难题，直到其毕业（学生在该问题上的平均rollout准确率达到一半）或在有限容量下被FIFO逐出，从而在学生当前最近发展区内放大BCQ和NCQ。在Qwen3.5系列上，使用四个学生规模（0.8B-9B）和27B教师，后训练为视觉语言模型并在31个基准套件（16个VLM、10个LLM、5个视频）上评估，ZPPO优于离/同策略蒸馏和GRPO，且在最小规模上增益最大。

英文摘要

Knowledge distillation transfers a teacher's competence to a small student but is brittle in the small-student regime: forcing the student to imitate logits from a much larger teacher concentrates it on the teacher's sharpest modes, hurting generalization on benchmark families beyond the training corpus. Reinforcement learning (RL) avoids logit imitation by training on the student's own rollouts. However, on questions where every rollout fails-yielding zero advantage and being silently discarded-injecting a stronger teacher's response into the policy gradient breaks the on-policy assumption and induces drift. We introduce Zone of Proximal Policy Optimization (ZPPO), inspired by Vygotsky's zone of proximal development, which keeps the teacher inside the prompt rather than the policy gradient. On hard questions, ZPPO constructs two reformulated prompts: a Binary Candidate-included Question (BCQ) pairs one correct teacher response with one incorrect student response as anonymized candidates the student must discriminate, and a Negative Candidate-included Question (NCQ) aggregates the student's wrong rollouts into a single prompt to surface their shared failure modes. A prompt replay buffer recirculates each hard question until it either graduates-the student's mean rollout accuracy on it reaches half- or is FIFO-evicted under finite capacity, amplifying BCQ and NCQ inside the student's current zone of proximal development. On the Qwen3.5 family at four student scales (0.8B-9B) with a 27B teacher, post-trained as vision-language models and evaluated on a 31-benchmark suite (16 VLM, 10 LLM, 5 Video), ZPPO outperforms off/on-policy distillation and GRPO, with the largest gains at the smallest scale.

URL PDF HTML ☆

赞 0 踩 0

2606.18246 2026-06-17 cs.CL 新提交

Variable-Width Transformers

变宽Transformer

Zhaofeng Wu, Oliver Sieberling, Shawn Tan, Rameswar Panda, Yury Polyanskiy, Yoon Kim

发表机构 * MIT（麻省理工学院）； MIT-IBM Watson AI Lab（MIT-IBM沃森人工智能实验室）

AI总结提出一种中间窄、两端宽的变宽Transformer架构，通过无参数残差缩放机制实现非均匀容量分配，在语言模型困惑度、FLOPs和KV缓存上优于均匀宽度基线。

详情

AI中文摘要

扩展模型规模，特别是深度和宽度，推动了基于Transformer的语言模型的显著进步。然而，大多数架构在所有层中保持恒定宽度，即使不同层可能扮演不同的计算角色，也均匀分配固定的参数和计算预算。在这项工作中，我们通过提出一个×形> <former架构，实证研究了跨网络深度的非均匀容量分配。该设计保持较宽的早期和晚期层，同时缩窄中间层，利用无参数残差缩放机制。在从200M到2B参数（密集）和3B参数（MoE）的仅解码器语言模型中，我们的> <former在语言建模损失上始终优于参数匹配的均匀基线。通过降低平均层宽度，该架构还减少了总体FLOPs（在拟合的损失匹配缩放曲线下减少22%）以及更小的KV缓存内存和I/O成本（减少15%）。在分析中，我们展示了这种瓶颈结构导致残差流中定性不同的表示。总体而言，我们的结果表明，非均匀宽度分配可以导致语言模型更资源最优的缩放。

英文摘要

Scaling model size, specifically depth and width, has driven significant progress in transformer-based language models. However, most architectures maintain a constant width across all layers, allocating a fixed parameter and computation budget evenly despite different layers potentially playing distinct computational roles. In this work, we empirically investigate nonuniform capacity allocation across network depth by proposing a $\times$-shaped > <former architecture. This design maintains wider early and late layers while narrowing the middle layers, utilizing a parameter-free residual resizing mechanism. Across decoder-only language models ranging from 200M to 2B parameters (dense) and 3B parameters (MoE), our > <former consistently outperforms parameter-matched uniform baselines on language modeling loss. By reducing the average layer width, this architecture also requires fewer overall FLOPs (22% reduction under fitted loss-matched scaling curves) and smaller KV cache memory and I/O cost (15% reduction). In analysis, we show that this bottleneck structure results in qualitatively different representations in residual streams. Overall, our results demonstrate that nonuniform width allocation can result in more resource-optimal scaling of language models.

URL PDF HTML ☆

赞 1 踩 0

2606.17250 2026-06-17 cs.LG cs.CL 交叉投稿

Rethinking Groups in Critic-Free RLVR

重新思考无评论强化学习中的分组

Yihong Wu, Liheng Ma, Lingfeng Xiao, Muzhi Li, Xinyu Wang, Yingxue Zhang, Jian-Yun Nie

发表机构 * Université de Montréal（蒙特利尔大学）； McGill University（麦吉尔大学）； Mila - Quebec AI Institute（Mila - 魁北克人工智能研究所）； University of Waterloo（滑铁卢大学）； The Chinese University of Hong Kong（香港中文大学）； Huawei Noah’s Ark Lab（华为诺亚方舟实验室）

AI总结针对无评论强化学习分组策略的数据低效和同步问题，提出负令牌过滤方法，实现单次 rollout 稳定训练，在推理和代理任务上表现相当或更优。

2606.17289 2026-06-17 cs.AI cs.CL 交叉投稿

Nothing from Something: Can a Language Model Discover 0?

无中生有：语言模型能否发现0？

Phoebe Zeng, Thomas L. Griffiths, Brenden M. Lake

发表机构 * Department of Computer Science, Princeton University（普林斯顿大学计算机科学系）

AI总结研究语言模型能否独立发现“零”的概念，通过算术任务测试，发现GPT-2规模模型无法在测试时泛化，但少量示例训练后显著提升，且语言预训练减少所需示例约50%。

详情

DOI: 10.32470/vp0ddtg

AI中文摘要

基于人工神经网络的AI系统正被开发，旨在推动人类数学知识的边界。这些系统的关键问题在于它们能在多大程度上超越训练数据。数学发现需要一种强形式的分布外泛化能力——假设真正新的、且可能逻辑上更强大的数学结构的能力。已有假设认为，语言能力在人类认知中支持这种泛化。在这项工作中，我们使用简单算术作为案例研究，考察现代AI模型如何扩展其数学视野，评估这些模型能否独立发现“零”的概念。我们表明：(1) GPT-2规模的语言模型在测试时无法进行这种泛化，无论是否经过语言预训练；(2) 但在经过数十或数百个零的示例训练后，模型能显著改进。此外，我们发现语言预训练将所需示例数量减少了约50%，表明语言能力可以支撑神经模型中的数学发现。

英文摘要

AI systems based on artificial neural networks are being developed with aspirations of pushing the boundary of human mathematical knowledge. A key question for these systems is how much they can reach beyond their training data. Mathematical discovery requires a strong form of out of distribution generalization; the ability to hypothesize genuinely new - and potentially logically more powerful - mathematical structures. It has been hypothesized that language abilities support such generalizations in human cognition. In this work, we use simple arithmetic as a case study for examining how modern AI models could expand their mathematical horizons, evaluating whether these models can independently discover the concept of "zero". We show that We show that (1) language models of a GPT-2 size are unable to perform this generalization at test time regardless of language pretraining, but (2) models can improve substantially after training on tens or hundreds of examples of zero. Additionally, we find that language pretraining reduces the number of required examples by approximately $50\%$, showing that language abilities can scaffold mathematical discovery in neural models.

URL PDF HTML ☆

赞 0 踩 0

2606.17443 2026-06-17 cs.AI cs.CL cs.CY 交叉投稿

Incumbent Advantage: Brand Bias and Cognitive Manipulation Dynamics in LLM Recommendation Systems

在位优势：LLM推荐系统中的品牌偏见与认知操纵动态

Xi Chu, Yupeng Hou

发表机构 * Trine University（特莱恩大学）； Texas A&M University（德克萨斯农工大学）

AI总结研究LLM推荐中的品牌动态，发现知名品牌在同等规格下获100%推荐（IAI=10.0），但微弱评分优势可打破垄断；权威营销语言（如虚假临床证据）以+0.17评分点的偏差剩余价值打破垄断；多品牌GEO竞争存在社会困境，集体优化降低个体收益。

Comments 16 pages, 4 figures, 11 tables

详情

AI中文摘要

大型语言模型（LLM）正成为消费者寻找产品的主要方式，但我们尚不了解品牌如何在这个新渠道中竞争。我们使用护肤品——消费者在购买前难以判断质量、必须依赖品牌声誉的类别——在三个商业LLM（GPT-4o-mini、Claude Sonnet、Gemini 3 Flash）中研究LLM推荐中的品牌动态，并对搜索品进行了稳健性检验。在三个实验中，我们发现：（1）条件垄断：当所有产品具有相同规格时，知名品牌获得100%的推荐（IAI = 10.0），但这种主导地位在竞争对手拥有不到+0.1星的评分优势时消失；（2）权威式营销语言，包括捏造的临床证据声明，以等于+0.17评分点的偏差剩余价值打破了这种垄断，每个模型反应不同；（3）多品牌GEO竞争中的社会困境：当所有品牌采用相同的优化策略时，在我们的收益代理中，个体收益从+0.802降至+0.007，而我们的测试中未参与的品牌获得零推荐。我们的结果表明，生成引擎优化（GEO）不仅应作为安全风险研究，还应作为塑造市场竞争的新兴营销实践来研究。

英文摘要

Large language models (LLMs) are becoming a major way for consumers to find products, but we do not yet understand how brands compete in this new channel. We study brand dynamics in LLM recommendations using skincare products -- a category where consumers cannot easily judge quality before buying and must rely on brand reputation -- across three commercial LLMs (GPT-4o-mini, Claude Sonnet, Gemini 3 Flash), with a robustness check on search goods. In three experiments, we find: (1) a Conditional Monopoly where well-known brands get recommended 100% of the time (IAI = 10.0) when all products have the same specifications, but this dominance disappears with less than a +0.1-star rating advantage for a competitor; (2) authority-style marketing language, including fabricated clinical-evidence claims, breaks this monopoly at a Bias Surplus Value equal to +0.17 rating points, with each model responding differently; and (3) a social dilemma in multi-brand GEO competition: when all brands adopt the same optimization strategy, individual payoff falls from +0.802 to +0.007 in our payoff proxy, and non-participating brands receive zero recommendations in our tests. Our results suggest that generative engine optimization (GEO) should be studied not only as a security risk, but also as an emerging marketing practice that shapes market competition.

URL PDF HTML ☆

赞 0 踩 0

2606.17579 2026-06-17 cs.LG cs.AI cs.CL cs.SI 交叉投稿

LLM Features Can Hurt GNNs: Concatenation Interference on Homophilous Graph Benchmarks

LLM特征可能损害GNN：同配图基准上的拼接干扰

Zhongyuan Wang, Pratyusha Vemuri

AI总结本文发现将LLM特征通过纯输入拼接（而非联合训练）引入图神经网络时，会在同配基准上系统性地降低准确率，并提出了一个基于LLM单独判别性指标Delta_sig来预测拼接效果。

Comments 29 pages, 8 figures

详情

AI中文摘要

将LLM生成的节点特征添加到图神经网络（GNN）中，被广泛报道能提高标准基准的准确率。我们记录了一个相反的观察：当LLM特征通过纯输入拼接（而非联合训练、蒸馏或提示条件）引入时，它们会在相同的同配基准上系统地降低准确率，而端到端LLM流水线在这些基准上却能成功。使用MLP骨干网络、Planetoid公共划分和词袋原始特征，拼接SBERT编码的GPT-4o-mini TAPE特征导致PubMed测试准确率下降-17.0±0.3个百分点，Cora下降-4.3±0.6个百分点（CiteSeer下降-0.6±0.8个百分点，在种子噪声范围内）。当我们放宽每个条件（GCN/GCNII/GAT骨干网络、随机划分、更小编码器）时，下降幅度减弱，并在中等同配的WikiCS（+4.4个百分点）和ogbn-arxiv（+11.7个百分点）上逆转。为了预测拼接何时有益或有害，我们报告了一个简单的LLM单独判别性指标Delta_sig。在9个数据集上，Delta_sig与拼接成本的相关系数（r^2=0.38）强于同配性（r^2=0.06；N=9，bootstrap置信区间重叠）。bootstrap最佳变点为tau=13.8个百分点，规则“Delta_sig <= tau预测非正拼接成本”正确分类了7/9个数据集；由于60%的bootstrap样本将tau置于[5,30]个百分点之间，我们将Delta_sig视为解释性透镜而非精确过滤器。在PubMed上进行的维度控制消融实验将LLM特征下降置于同源PCA（-2.3个百分点）和同维高斯噪声（-37.3个百分点）之间，排除了维度和权重衰减的影响。九个PubMed配置拟合出幂律|Delta_concat| ∝ (sqrt(d_l/n))^1.31，r^2=0.97；低Delta_sig、小n的角落正是标题中-17个百分点PubMed缺陷出现的位置。

英文摘要

Adding LLM-generated node features to graph neural networks (GNNs) is widely reported to improve accuracy on standard benchmarks. We document a contrasting observation: when LLM features are introduced through pure input concatenation (rather than joint training, distillation, or prompt-conditioning), they can systematically degrade accuracy on the same homophilous benchmarks where end-to-end LLM pipelines succeed. With an MLP backbone on the Planetoid public split and bag-of-words original features, concatenating SBERT-encoded GPT-4o-mini TAPE features reduces PubMed test accuracy by -17.0 +/- 0.3 pp and Cora by -4.3 +/- 0.6 pp (CiteSeer -0.6 +/- 0.8 pp, within seed noise). The drop attenuates as we relax each condition (GCN / GCNII / GAT backbones, random splits, smaller encoders) and reverses on medium-homophily WikiCS (+4.4 pp) and ogbn-arxiv (+11.7 pp). To predict when concatenation helps versus hurts, we report a simple measure of LLM-alone discriminability, Delta_sig. Across 9 datasets Delta_sig correlates with the concatenation cost more strongly than homophily at point estimate (r^2 = 0.38 vs. 0.06; N=9, bootstrap CIs overlap). The bootstrap-best change-point is tau = 13.8 pp, and the rule "Delta_sig <= tau predicts non-positive concat cost" classifies 7/9 datasets correctly; since 60% of bootstrap samples place tau in [5, 30] pp, we treat Delta_sig as an interpretive lens rather than a precision filter. A dimension-controlled ablation on PubMed places the LLM-feature drop between same-source PCA (-2.3 pp) and same-dim Gaussian noise (-37.3 pp), ruling out dimensionality and weight-decay artifacts. Nine PubMed configurations fit a power law |Delta_concat| proportional to (sqrt(d_l/n))^1.31 with r^2 = 0.97; the low-Delta_sig, small-n corner is exactly where the headline -17 pp PubMed deficit appears.

URL PDF HTML ☆

赞 0 踩 0

2606.18057 2026-06-17 cs.HC cs.AI cs.CL cs.CY cs.SI 交叉投稿

When AI Says "I have been in similar situations": Synthetic Lived Experience in Peer-Like Caregiver Support

当AI说“我也有过类似经历”：同伴式照护支持中的合成生活经验

Drishti Goel, Violeta J. Rodriguez, Daniel S. Brown, Ravi Karkar, Dong Whi Yoo, Koustuv Saha

发表机构 * University of Illinois Urbana-Champaign（伊利诺伊大学厄巴纳-香槟分校）； University of Massachusetts Amherst（马萨诸塞大学阿默斯特分校）； Indiana University Indianapolis（印第安纳大学印第安纳波利斯分校）

AI总结研究AI在同伴支持中生成“合成生活经验”的悖论，通过分析人类与AI在阿尔茨海默病照护社区中的叙事差异，揭示AI虽能模拟情感支持但缺乏真实经历，需建立机制区分支持性语言与虚构经历。

详情

AI中文摘要

照护者经常转向在线社区寻求信息和情感支持。在这些空间中，同伴支持者常常利用个人叙事来回应情感复杂的照护情境。随着LLM被设计为同伴式的支持来源，它们引入了一个关键张力：AI可以提供即时、私密且非评判性的支持，但它无法真实拥有使人类同伴支持有意义的生活经验。然而，当被提示要听起来像同伴时，LLM可能会生成暗示生活经验的语言。这创造了一个合成生活经验悖论：使AI支持感觉温暖、 relatable 和同伴式的相同经验语言，也可能错误地将系统定位为拥有生活经验的人。我们在阿尔茨海默病及相关痴呆症（ADRD）患者的家庭照护者背景下审视这一悖论。利用来自在线社区的照护者支持交流以及三个LLM——LLaMA、GPT-4o-mini和MedGemma——生成的同伴式响应，我们分析人类同伴如何使用个人叙事以及AI如何融入类似的叙事形式。心理语言学分析显示，同伴响应使用的第一人称和过去时态语言显著多于同伴式AI响应。定性上，我们识别出人类同伴支持中的七种个人叙事类型，并表明AI通常能捕捉其情感工作，但可能捏造经验基础。这些发现揭示了一个叙事真实性差距：同伴式AI可以生成合成生活经验，而没有使同伴支持有意义的真实经验。我们认为，照护者支持AI系统需要机制来区分支持性的同伴式框架与虚构的生活经验，确保模型能够提供温暖和认可，而不会错误地将自己定位为经验同伴。

英文摘要

Caregivers often turn to online communities for informational and emotional support. In these spaces, peer supporters frequently draw on personal narratives to respond to emotionally complex caregiving situations. As LLMs are increasingly designed as peer-like sources of support, they introduce a critical tension: AI can provide immediate, private, and nonjudgmental support, but it cannot authentically possess the lived experiences that make human peer support meaningful. Yet, when prompted to sound peer-like, LLMs may generate language that implies lived experience. This creates a synthetic lived experience paradox: the same experiential language that may make AI support feel warm, relatable, and peer-like can also falsely position the system as someone with lived experience. We examine this paradox in the context of family caregivers of people living with Alzheimer's Disease and Related Dementias (ADRD). Drawing on caregiver support exchanges from online communities and prompted peer-like responses from three LLMs -- LLaMA, GPT-4o-mini, and MedGemma -- we analyze how human peers use personal narratives and how AI incorporates similar narrative forms. Psycholinguistic analysis shows that peer responses used significantly more first-person and past-focused language than peer-like AI responses. Qualitatively, we identify seven types of personal narratives in human peer support and show that AI often captures their emotional work, but can fabricate experiential grounding. These findings reveal a narrative authenticity gap: peer-like AI can generate synthetic lived experience without the real experience that makes peer support meaningful. We argue that caregiver-support AI systems need mechanisms to distinguish supportive peer-like framing from fabricated lived experience, ensuring that models can offer warmth and validation without falsely positioning themselves as experiential peers.

URL PDF HTML ☆

赞 0 踩 0

2606.18208 2026-06-17 cs.LG cs.AI cs.CL cs.CV 交叉投稿

Looped World Models

循环世界模型

Hongyuan Adam Lu, Z. L. Victor Wei, Qun Zhang, Jinrui Zeng, Bowen Cao, Lingwei Meng, Mocheng Li, Zezhong Wang, Haonan Yin, Naifu Xue, Minyu Chen, Cenyuan Zhang, Zefan Zhang, Hao Wei, Jiawei Zhou, Haoran Xu, Hao Yang, Ronglai Zuo, Tongda Xu, Yonghao Li, Jian Chen, Hebin Wang, Zeyu Gao, Yang Li, Wei Zhao, Qimin Zhong, Siqi Liu, Yumeng Zhang, Leyan Cui, Zhangyu Wang, Wai Lam

发表机构 * FaceMind Research Asia

AI总结提出循环世界模型（LoopWM），通过参数共享的Transformer块迭代细化潜在环境状态，实现高达100倍参数效率，并建立迭代潜在深度作为世界模拟的新缩放轴。

Comments Technical Report

2502.08363 2026-06-17 cs.CL cs.AI 版本更新

Top-Theta Attention: Sparsifying Transformers by Compensated Thresholding

Top-Theta注意力：通过补偿阈值稀疏化Transformer

Konstantin Berestizshevsky, Renzo Andri, Lukas Cavigelli

AI总结提出Top-Theta注意力，一种无需训练的推理时稀疏化方法，通过静态每头阈值保留每行固定数量的重要元素，结合补偿技术实现高稀疏度下的精度保持，在NLP任务中实现3-10倍V-cache减少和高达10倍注意力元素减少，精度下降不超过1%。

Comments Extended version of a paper accepted at ICANN 2026

2506.18831 2026-06-17 cs.CL 版本更新

Adaptive Activation Steering for Efficient LLM Reasoning via Closed-Loop PID Control

自适应激活引导：通过闭环PID控制实现高效LLM推理

Aryasomayajula Ram Bharadwaj

AI总结提出PID-steering方法，利用PID控制器根据块级冗余分类器动态调整激活引导强度，在减少推理开销的同时提升准确率。

2508.03250 2026-06-17 cs.CL cs.AI 版本更新

在任意文本条件下学习：一种超网络驱动的元门控大语言模型

Luo Ji, Qi Qin, Ningyuan Xi, Teng Chen, Qingqing Gu, Hongyan Li

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结提出一种超网络驱动的元门控机制，通过动态调整SwiGLU块中的β参数，使LLM适应不同文本条件，优于微调和元学习基线。

Comments Accepted by ICML2026

2605.12227 2026-06-17 cs.CL 版本更新

A Recipe for Long-Context Reasoning in Large Language Models via On-Policy Optimization and Distillation

结合在线优化与蒸馏以提升大语言模型的长上下文推理能力

Miguel Moura Ramos, Duarte M. Alves, André F. T. Martins

发表机构 * Instituto Superior Técnico, Universidade de Lisboa（里斯本大学理工学院）； Instituto de Telecomunicações（电信研究所）； TransPerfect（TransPerfect公司）

AI总结本文提出dGRPO方法，结合在线优化与蒸馏，通过强教师模型提供密集指导，提升长上下文推理能力，同时保持短上下文能力。

详情

AI中文摘要

适应大语言模型（LLMs）进行长上下文任务需要在训练后保持准确性和连贯性的方法。现有方法存在局限：1）监督微调（SFT）和知识蒸馏（KD）等离线方法存在曝光偏差且难以从模型生成的错误中恢复；2）在线强化学习方法如组相对策略优化（GRPO）更符合模型生成的状态，但因稀疏奖励导致不稳定和样本效率低；3）在线蒸馏（OPD）提供密集的token级指导，但不直接优化任意奖励信号。本文提出Distilled Group Relative Policy Optimization（dGRPO），通过OPD从更强的教师模型获得密集指导来增强GRPO。我们还引入LongBlocks，一个涵盖多跳推理、上下文接地和长形式生成的合成长上下文数据集。我们进行了广泛的实验和消融研究，比较离线训练、稀疏奖励GRPO和我们的综合方法，得出改进的长上下文对齐配方。总体而言，我们的结果表明，将基于结果的策略优化与知识蒸馏结合在一个目标中，为长上下文推理提供更稳定和有效的方法，同时保持短上下文能力。

英文摘要

Existing approaches to post-train models for long-context tasks face complementary limitations: (i) supervised fine-tuning (SFT) provides stable supervision but suffers from exposure bias; (ii) reinforcement learning methods such as Group Relative Policy Optimization (GRPO) train on model-generated trajectories but struggle with long-horizon credit assignment and sparse rewards; and (iii) on-policy distillation (OPD) provides dense token-level guidance but does not directly optimize task rewards. We study these complementary strategies for long-context alignment and derive a recipe that combines GRPO with OPD-style teacher guidance: the student learns from its own rollouts using outcome-level rewards, while a stronger teacher provides dense token-level regularization in place of the standard reference policy. This is especially useful when process-level supervision is difficult to obtain. To support this study, we introduce LongBlocks, a synthetic multilingual dataset spanning multi-hop reasoning, contextual grounding, and long-form generation. Through controlled ablations, we isolate the roles of cold-start initialization, teacher anchoring, and data mixing, showing that our recipe yields a more stable and effective path to long-context reasoning than GRPO or OPD while preserving short-context capabilities.

URL PDF HTML ☆

赞 0 踩 0

2606.00024 2026-06-17 cs.CL 版本更新

ART: Attention Run-time Termination for Efficient Large Language Model Decoding

ART：面向高效大语言模型解码的注意力运行时终止

Chen Qiu, Guozhong Li, Cristian McGee, Aritra Dutta, Panos Kalnis

发表机构 * King Abdullah University of Science and Technology（卡布尔大学科学与技术大学）； University of Central Florida（中央佛罗里达大学）

AI总结提出注意力运行时终止（ART）机制，通过跟踪累积注意力输出并在贡献可忽略时终止后续KV块访问，在不显著影响准确率的情况下将大批量生成吞吐量提升20%。

详情

AI中文摘要

大语言模型（LLM）中的长上下文解码受到获取大量键值（KV）缓存所需内存带宽的严重限制。大多数现有的KV管理方法依赖于解码前的仅键剪枝，尽管有证据表明注意力输出共同依赖于键和值，因为将值纳入其方法会带来过高的额外开销。在本文中，我们提出了注意力运行时终止（ART），一种轻量级的运行时机制，在内核执行期间跟踪累积的注意力输出，并在后续贡献变得可忽略时终止后续KV块访问。这种设计使ART与现有的基于键的KV缓存管理方法正交，从而能够与它们无缝集成。在LongBench基准上的实验表明，与最先进的基线相比，ART在大批量下实现了20%更高的生成吞吐量，同时保持了相当的准确率。

英文摘要

Long-context decoding in Large Language Models (LLMs) is constrained by the cost of accessing and processing the Key-Value (KV) cache. Despite evidence that attention outputs depend jointly on keys and values, most existing KV management methods rely on key-only pruning, since incorporating values incurs prohibitive overhead. In this paper, we propose Attention Run-time Termination (ART), a lightweight run-time mechanism that tracks accumulated attention outputs during kernel execution and terminates subsequent KV block accesses once further contributions become negligible. Rather than replacing KV selection, ART dynamically terminates redundant KV traversal on top of existing dense or sparse attention policies. We introduce a stability-based criterion that monitors both magnitude and directional changes of intermediate attention outputs and provideds a theoretical characterization of the resulting truncation error. Experiments on the LongBench and RULER Needle-in-a-Haystack tasks show that ART increases the generation throughput of existing KV-cache methods by up to 20%, without compromising the result quality.

URL PDF HTML ☆

赞 0 踩 0

2602.06154 2026-06-17 cs.LG cs.CL 版本更新

MoSE: Mixture of Slimmable Experts for Efficient and Adaptive Language Models

MoSE: 混合可瘦身专家实现高效自适应语言模型

Nurbek Tastan, Stefanos Laskaridis, Karthik Nandakumar, Samuel Horvath

AI总结提出MoSE架构，每个专家具有可变宽度的嵌套结构，支持在推理时连续调节精度-计算权衡，通过多宽度训练和轻量级测试时训练实现高效自适应。

Comments Accepted to ICML 2026

详情

AI中文摘要

混合专家（MoE）模型通过稀疏激活专家高效扩展大型语言模型，但一旦选定专家，其执行是完整的。因此，MoE模型中精度与计算之间的权衡通常表现出较大的不连续性。我们提出混合可瘦身专家（MoSE），这是一种MoE架构，其中每个专家具有嵌套的、可瘦身的结构，可以以可变宽度执行。这不仅实现了对激活哪些专家的条件计算，还实现了对每个专家利用多少的条件计算。因此，单个预训练的MoSE模型可以在推理时支持更连续的精度-计算权衡谱。我们提出了一种简单且稳定的训练方法，用于在稀疏路由下训练可瘦身专家，将多宽度训练与标准MoE目标相结合。在推理过程中，我们探索了运行时宽度确定的策略，包括一种轻量级的测试时训练机制，该机制学习如何在固定预算下将路由器置信度/概率映射到专家宽度。在GPT风格模型、各种路由机制、零样本下游推理基准以及DeepSeek模型的持续预训练适应上的实验表明，MoSE在全宽度下匹配或优于标准MoE，并持续将计算-质量边界向更低的推理FLOPs移动。代码可在以下网址找到：this https URL。

英文摘要

Mixture-of-Experts (MoE) models scale large language models efficiently by sparsely activating experts, but once an expert is selected, it is executed fully. Hence, the trade-off between accuracy and computation in an MoE model typically exhibits large discontinuities. We propose Mixture of Slimmable Experts (MoSE), an MoE architecture in which each expert has a nested, slimmable structure that can be executed at variable widths. This enables conditional computation not only over which experts are activated but also over how much of each expert is utilized. Consequently, a single pretrained MoSE model can support a more continuous spectrum of accuracy-compute trade-offs at inference time. We present a simple and stable training recipe for slimmable experts under sparse routing, combining multi-width training with standard MoE objectives. During inference, we explore strategies for runtime width determination, including a lightweight test-time training mechanism that learns how to map router confidence/probabilities to expert widths under a fixed budget. Experiments on GPT-style models, various routing regimes, zero-shot downstream reasoning benchmarks, and continual pre-training adaptation of DeepSeek model show that MoSE matches or improves standard MoE at full width and consistently shifts the compute-quality frontier toward lower inference FLOPs. The code can be found at: https://github.com/tnurbek/mose.

URL PDF HTML ☆

赞 0 踩 0

2602.09802 2026-06-17 cs.AI cs.CL 版本更新

Would a Large Language Model Pay Extra for a View? Inferring Willingness to Pay from Subjective Choices

大型语言模型会为景观付费吗？从主观选择中推断支付意愿

Manon Reusens, Sofie Goethals, Toon Calders, David Martens

AI总结研究在旅行助手场景下，通过多分类逻辑模型分析LLM的主观选择，推断其支付意愿并与人类基准比较，发现LLM在属性层面存在系统偏差且高估支付意愿，但通过条件化偏好可改善。

详情

DOI: 10.1016/j.eswa.2026.133279

AI中文摘要

随着大型语言模型（LLM）越来越多地部署在旅行辅助和购买支持等应用中，它们常常需要在没有客观正确答案的情况下代表用户做出主观选择。我们在旅行助手背景下研究LLM的决策，通过向模型呈现选择困境，并使用多项逻辑模型分析其响应，推导出隐含的支付意愿（WTP）估计。随后将这些WTP值与经济学文献中的人类基准值进行比较。除了基线设置外，我们还研究了在更现实条件下模型行为的变化，包括提供用户过去选择的信息和基于角色的提示。我们的结果表明，虽然可以从较大的LLM中推导出有意义的WTP值，但它们在属性层面也显示出系统偏差。此外，它们倾向于整体高估人类的WTP，特别是在引入昂贵选项或面向商业的角色时。将模型条件化于对更便宜选项的先前偏好，得出的估值更接近人类基准。总体而言，我们的发现突出了使用LLM进行主观决策支持的潜力和局限性，并强调了在实际部署此类系统时仔细选择模型、设计提示和表示用户的重要性。

英文摘要

As Large Language Models (LLMs) are increasingly deployed in applications such as travel assistance and purchasing support, they are often required to make subjective choices on behalf of users in settings where no objectively correct answer exists. We study LLM decision-making in a travel-assistant context by presenting models with choice dilemmas and analyzing their responses using multinomial logit models to derive implied willingness to pay (WTP) estimates. These WTP values are subsequently compared to human benchmark values from the economics literature. In addition to a baseline setting, we examine how model behavior changes under more realistic conditions, including the provision of information about users' past choices and persona-based prompting. Our results show that while meaningful WTP values can be derived for larger LLMs, they also display systematic deviations at the attribute level. Additionally, they tend to overestimate human WTP overall, particularly when expensive options or business-oriented personas are introduced. Conditioning models on prior preferences for cheaper options yields valuations that are closer to human benchmarks. Overall, our findings highlight both the potential and the limitations of using LLMs for subjective decision support and underscore the importance of careful model selection, prompt design, and user representation when deploying such systems in practice.

URL PDF HTML ☆

赞 0 踩 0

2602.11715 2026-06-17 cs.LG cs.CL 版本更新

DICE: Diffusion Large Language Models Excel at Generating CUDA Kernels

DICE：扩散大语言模型在生成CUDA内核方面表现出色

Haolei Bai, Lingcheng Kong, Xueyi Chen, Jianmian Wang, Zhiqiang Tao, Huan Wang

AI总结提出CuKe数据集和BiC-RL训练框架，构建DICE系列扩散大语言模型（1.7B/4B/8B），在KernelBench上显著优于同类自回归和扩散模型，实现CUDA内核生成新SOTA。

Comments v2: Expanded with dLLM vs. autoregressive LLM comparisons, ablation studies, and qualitative case studies

详情

AI中文摘要

扩散大语言模型（dLLMs）因其并行生成令牌的能力，已成为自回归（AR）LLMs的有力替代方案。这一范式特别适用于代码生成，其中整体结构规划和非顺序优化至关重要。尽管有这种潜力，但针对CUDA内核生成定制dLLMs仍然具有挑战性，不仅因为高度专业化，还因为严重缺乏高质量的训练数据。为了解决这些挑战，我们构建了CuKe，一个针对高性能CUDA内核优化的增强监督微调数据集。在此基础上，我们提出了一个双阶段策划强化学习（BiC-RL）框架，包括CUDA内核填充阶段和端到端CUDA内核生成阶段。利用这一训练框架，我们推出了DICE，一系列专为CUDA内核生成设计的扩散大语言模型，涵盖1.7B、4B和8B三个参数规模。在KernelBench上的大量实验表明，DICE显著优于同等规模的自回归和扩散LLMs，为CUDA内核生成建立了新的最先进水平。

英文摘要

Diffusion large language models (dLLMs) have emerged as a compelling alternative to autoregressive (AR) LLMs, owing to their capacity for parallel token generation. This paradigm is particularly well-suited for code generation, where holistic structural planning and non-sequential refinement are critical. Despite this potential, tailoring dLLMs for CUDA kernel generation remains challenging, obstructed not only by the high specialization but also by the severe lack of high-quality training data. To address these challenges, we construct CuKe, an augmented supervised fine-tuning dataset optimized for high-performance CUDA kernels. On top of it, we propose a bi-phase curated reinforcement learning (BiC-RL) framework consisting of a CUDA kernel infilling stage and an end-to-end CUDA kernel generation stage. Leveraging this training framework, we introduce DICE, a series of diffusion large language models designed for CUDA kernel generation, spanning three parameter scales, 1.7B, 4B, and 8B. Extensive experiments on KernelBench demonstrate that DICE significantly outperforms both autoregressive and diffusion LLMs of comparable scale, establishing a new state-of-the-art for CUDA kernel generation.

URL PDF HTML ☆

赞 0 踩 0

2604.03444 2026-06-17 cs.LG cs.CL 版本更新

Olmo Hybrid: From Theory to Practice and Back

Olmo Hybrid：从理论到实践再回到理论

William Merrill, Yanhong Li, Tyler Romero, Anej Svete, Caia Costello, Pradeep Dasigi, Dirk Groeneveld, David Heineman, Bailey Kuehl, Nathan Lambert, Chuan Li, Kyle Lo, Saumya Malik, DJ Matusz, Benjamin Minixhofer, Jacob Morrison, Luca Soldaini, Finbarr Timbers, Pete Walsh, Noah A. Smith, Hannaneh Hajishirzi, Ashish Sabharwal

AI总结本文通过理论分析和实验验证，证明混合模型（结合注意力与线性RNN）在表达能力、扩展效率上优于纯Transformer，并训练了7B参数的Olmo Hybrid模型，在标准评估中超越Olmo 3。

Comments Corrected author list and typos in appendix

详情

AI中文摘要

近期工作展示了非Transformer语言模型（尤其是线性递归神经网络（RNN）和混合注意力与递归的混合模型）的潜力。然而，对于这些新架构的潜在优势是否值得承担规模化扩展的风险和努力，尚无共识。为解决此问题，我们从多个方面提供混合模型优于纯Transformer的证据。首先，理论上，我们证明混合模型不仅继承了Transformer和线性RNN的表达能力，还能表达超出两者的任务，例如代码执行。将这一理论付诸实践，我们训练了Olmo Hybrid，一个70亿参数模型，与Olmo 3 7B基本相当，但将滑动窗口层替换为Gated DeltaNet层。我们表明，在标准预训练和中期训练评估中，Olmo Hybrid优于Olmo 3，证明了混合模型在受控大规模设置下的优势。我们发现混合模型的扩展效率显著高于Transformer，这解释了其更高的性能。然而，尚不清楚为何特定形式问题上的更高表达能力会导致更好的扩展性或在下游任务（与这些问题无关）上表现更优。为解释这一明显差距，我们回到理论，论证为何增强的表达能力应转化为更好的扩展效率，从而完成循环。总体而言，我们的结果表明，混合注意力和递归层的混合模型是语言建模范式的强大扩展：不仅用于减少推理时的内存，更是获得在预训练中更好扩展的更具表达能力模型的基本途径。

英文摘要

Recent work has demonstrated the potential of non-transformer language models, especially linear recurrent neural networks (RNNs) and hybrid models that mix recurrence and attention. Yet there is no consensus on whether the potential benefits of these new architectures justify the risk and effort of scaling them up. To address this, we provide evidence for the advantages of hybrid models over pure transformers on several fronts. First, theoretically, we show that hybrid models do not merely inherit the expressivity of transformers and linear RNNs, but can express tasks beyond both, such as code execution. Putting this theory to practice, we train Olmo Hybrid, a 7B-parameter model largely comparable to Olmo 3 7B but with the sliding window layers replaced by Gated DeltaNet layers. We show that Olmo Hybrid outperforms Olmo 3 across standard pretraining and mid-training evaluations, demonstrating the benefit of hybrid models in a controlled, large-scale setting. We find that the hybrid model scales significantly more efficiently than the transformer, explaining its higher performance. However, its unclear why greater expressivity on specific formal problems should result in better scaling or superior performance on downstream tasks unrelated to those problems. To explain this apparent gap, we return to theory and argue why increased expressivity should translate to better scaling efficiency, completing the loop. Overall, our results suggest that hybrid models mixing attention and recurrent layers are a powerful extension to the language modeling paradigm: not merely to reduce memory during inference, but as a fundamental way to obtain more expressive models that scale better during pretraining.

URL PDF HTML ☆

赞 0 踩 0

2605.30036 2026-06-17 cs.AI cs.CL 版本更新

Teaching Values to Machines: Simulating Human-Like Behavior in LLMs

向机器传授价值观：在LLMs中模拟类人行为

Asaf Yehudai, Naama Rozen, Ariel Gera

发表机构 * The Hebrew University of Jerusalem（海法大学）； IBM Research（IBM研究院）； Tel-Aviv University（特拉维夫大学）

AI总结本研究基于心理学价值理论，通过大规模实验（超过500万个问题）评估价值提示的LLMs在价值结构和价值-行为关系上与人类的一致性，并证明引入人类价值分布可增强群体模拟。

Comments We had some disagreement regarding proper attribution; we hope to resolve it soon and upload the paper

详情

AI中文摘要

大型语言模型（LLMs）展示了采用不同角色和身份的能力；然而，它们是否能表现出符合连贯、类人价值结构的行为仍不清楚。在这项工作中，我们借鉴既定的心理学价值理论，在LLMs中诱导类人价值观，并评估它们与人类研究中观察到的模式的一致性。使用经过验证的心理学问卷，我们进行了大规模实验——超过500万个问题——以评估领先LLMs的价值结构和价值-行为关系，并将其与人类进行比较。我们的发现揭示了价值提示的LLMs与人类在两个维度上的强烈一致性。此外，引入人类价值分布增强了价值诱导LLMs的群体模拟。这些发现凸显了价值诱导LLMs作为有效的、基于心理学的模拟人类行为工具的潜力。

英文摘要

Large Language Models (LLMs) demonstrate a remarkable capacity to adopt different personas and roles; however, it remains unclear whether they can manifest behavior that adheres to a coherent, human-like value structure. In this work, we draw on established psychological value theory to induce human-like values in LLMs and assess their alignment with patterns observed in human studies. Using validated psychological questionnaires, we conduct large-scale experiments -- over 5 million questions -- to evaluate value structures and value-behavior relationships in leading LLMs and compare them to humans. Our findings reveal strong agreement between value-prompted LLMs and humans across both dimensions. Moreover, incorporating human value distributions enhances population-level simulations with value-induced LLMs. These findings highlight the potential of value-induced LLMs as effective, psychologically grounded tools for simulating human behavior.

URL PDF HTML ☆

赞 0 踩 0

2606.17234 2026-06-17 cs.CL 新提交

Speaking in Self-Assessing Tongues: On the Verbalized Confidence of LLMs in Machine Translation

自评之言：论大语言模型在机器翻译中的口头化置信度

Ali Marashian, Alexis Palmer, Katharina von der Wense

发表机构 * University of Colorado Boulder（科罗拉多大学博尔德分校）； Johannes Gutenberg University Mainz（美因茨约翰内斯·古腾堡大学）

AI总结本研究设计了五种无需内部信号的口头化方法提取LLM逐词置信度，并与内部确定性信号比较，发现两者在细粒度错误检测和校准上表现相似但相关性低。

详情

AI中文摘要

大语言模型（LLMs）在翻译中的迅速普及要求对其自身输出的置信度可靠性进行深入研究。与许多生成任务不同，翻译错误和置信度可以在不同粒度级别（标记、单词或片段）上发挥作用。基于预测概率等内部信号的无监督方法可能具有误导性，因为它们反映的是替代方案之间的确定性而非正确性。此外，这些方法需要访问此类内部信号。本文设计了五种口头化方法，用于在没有这些缺点的情况下提取LLM的逐词置信度，并将其可靠性与模型内部确定性信号进行比较。我们使用两种对齐形式评估可靠性：细粒度错误检测和校准。对于两者，内部方法和口头化方法表现相似，尽管结果因模型而异。有趣的是，我们发现内部方法与口头化方法之间几乎没有相关性。

英文摘要

The rapid rise in popularity of large language models (LLMs) for translation calls for a thorough study of the reliability of their confidence in their own outputs. Unlike many generation tasks, translation errors and confidence levels can be useful at different levels of granularity (tokens, words, or spans). Unsupervised approaches based on internal signals like predicted probabilities can be misleading because they reflect certainty among alternatives rather than correctness. In addition, they require access to such internal signals. Here, we devise five verbalized methods of extracting an LLM's per-token confidence without those shortcomings and compare their reliability with that of the model's internal signals of certainty. We evaluate reliability using two forms of alignment: fine-grained error detection and calibration. For both, internal and verbalized methods perform similarly, although results vary by model. Interestingly, we find little to no correlation between internal and verbalized methods.

URL PDF HTML ☆

赞 0 踩 0

2606.17354 2026-06-17 cs.CL cs.AI 新提交

Translating the Untranslatable: An Operationalizable Ontology for Untranslatability

翻译不可译：一个可操作化的不可译性本体论

Jacob Bremerman, Brihi Joshi, Hirona Arai, Xiang Ren, Jonathan May

发表机构 * University of Southern California Information Sciences Institute（南加州大学信息科学研究所）

AI总结提出一个结构化的不可译性本体论和补偿策略分类法，构建多语言数据集，通过人类偏好研究发现注释补偿策略最受青睐，为策略感知机器翻译奠定基础。

详情

AI中文摘要

不可译性，即意义无法在语言间直接保留的情况，在语言学中已有深入研究，但在自然语言处理中尚未充分探索。随着机器翻译系统在标准基准测试上的改进，其局限性越来越集中在这些情况下，即翻译无法简化为一一对应。我们引入了一个结构化的不可译性本体论以及补偿策略的分类法，这些策略是在这些不可译情况下传达意义的具体技术。我们将该框架操作化为一个多语言数据集，包含不可译句子及其基于策略的翻译，从而能够对翻译行为进行受控分析。初步的人类偏好研究表明，翻译质量取决于所使用的策略，并且对包含解释性上下文（称为注释补偿策略）的输出存在一致的偏好。我们的框架和数据集为研究和建模策略感知的机器翻译提供了基础。

英文摘要

Untranslatability, cases where meaning cannot be directly preserved across languages, is well-studied in linguistics but underexplored in NLP. As machine translation (MT) systems improve on standard benchmarks, their limitations increasingly concentrate in such cases, where translation cannot be reduced to one-to-one equivalence. We introduce a structured ontology of untranslatability along with a taxonomy of compensation strategies, which are specific techniques to convey meaning under these untranslatable circumstances. We operationalize this framework into a multilingual dataset of untranslatable sentences paired with strategy-based translations, enabling controlled analysis of translation behavior. Initial human preference studies suggest that translation quality depends on the strategy used, with consistent preferences for outputs that include explanatory context, known as the Annotation compensation strategy. Our framework and dataset provide a foundation for studying and modeling strategy-informed machine translation.

URL PDF HTML ☆

赞 0 踩 0

2606.18033 2026-06-17 cs.CL cs.AI 新提交

When English Isn't the Best Teacher: Source Language Effects in Cross-Lingual In-Context Learning

当英语不是最好的老师：跨语言上下文学习中的源语言效应

Fred Philippy, Siwen Guo, Jacques Klein, Tegawendé F. Bissyandé

发表机构 * Snt, University of Luxembourg（卢森堡大学科学技术系）； Luxembourg Institute of Science and Technology（卢森堡科学技术研究院）

AI总结研究跨语言上下文学习（ICL）中源语言选择的影响，发现基于微调的预期在ICL中不成立，提出有效选择源语言的替代启发式方法。

Comments Accepted at 1st Workshop on Multilinguality in the Era of Large Language Models (MeLLM 2026), co-located with ACL 2026

2605.21135 2026-06-17 cs.CL 版本更新

Smarter edits? Post-editing with error highlights and translation suggestions

更智能的编辑？基于错误高亮和翻译建议的后编辑

Fleur V. J. van Tellingen, Gautam Ranka, Dora Žugčić, Joyce van der Wal, Andrea Camasta, Livio Guerra, Alina Karakanta

发表机构 * Leiden University Centre for Linguistics（莱顿大学语言研究中心）； Visvesvaraya National Institute of Technology（维什瓦塞拉亚国家理工学院）； Department of Bionanoscience, Faculty of Applied Sciences, Delft University of Technology（应用科学学院生物纳米科学系，代尔夫特理工大学）； Pedagogical Sciences, Leiden University（莱顿大学教育科学）； Faculty of Science, Leiden University（莱顿大学科学学院）

AI总结本文研究了基于自动后编辑（APE）的错误高亮和纠正建议在后编辑任务中的有效性，发现虽然没有提升生产力和质量，但APE高亮和纠正建议提升了用户体验。

Comments Accepted at EAMT 2026

2606.15883 2026-06-17 cs.CL cs.AI 版本更新

Koshur Diacritizer: A Byte-Level Sequence-to-Sequence Model for Kashmiri Diacritic Restoration

Koshur Diacritizer：用于克什米尔语变音符号恢复的字节级序列到序列模型

Haq Nawaz Malik, Nahfid Nissar, Faizan Iqbal

发表机构 * arXiv

AI总结针对克什米尔语数字文本中变音符号缺失导致的歧义问题，提出基于ByT5-small的字节级序列到序列模型Koshur Diacritizer，结合脚本感知归一化、对齐验证和骨架保留推理，在测试集上实现DERm 0.2012和WER 0.2159，专家评估准确率77.5%。

详情

AI中文摘要

克什米尔语是一种使用改良的波斯-阿拉伯字母书写的印度-雅利安语言，在数字文本中经常省略变音符号，造成歧义并挑战下游NLP应用。我们提出了Koshur Diacritizer，一个基于ByT5-small的字节级序列到序列模型，用于恢复克什米尔语文本中的变音符号。为支持此任务，我们发布了一个公开可用的数据集，包含23.7k对齐的未变音/变音克什米尔语句对。所提出的框架结合了脚本感知归一化、对齐验证和骨架保留推理，以确保在保持原始基本字母序列的同时进行可靠的恢复。在保留测试集上的实验结果显示，DERm为0.2012，WER为0.2159。此外，由克什米尔语母语语言学专家评估的平均准确率为77.5%。数据集、模型和源代码已公开发布，为克什米尔语变音符号恢复和未来的低资源语言研究提供了可复现的基线。

英文摘要

Kashmiri, an Indo-Aryan language written in a modified Perso-Arabic script, frequently omits diacritic marks in digital text, creating ambiguity and challenging downstream NLP applications. We present Koshur Diacritizer, a ByT5-small byte-level sequence-to-sequence model for restoring diacritics in Kashmiri text. To support this task, we release a publicly available dataset of 23.7k aligned undiacritized diacritized Kashmiri sentence pairs. The proposed framework combines script-aware normalization, alignment validation, and skeleton-preserving inference to ensure reliable restoration while maintaining the original base-letter sequence. Experimental results on a held-out test set achieve a DERm of 0.2012 and a WER of 0.2159. Additionally, evaluation by a native Kashmiri linguistic expert yields a mean accuracy of 77.5%. The dataset, model, and source code are publicly released to provide a reproducible baseline for Kashmiri diacritic restoration and future low-resource language research.

URL PDF HTML ☆

赞 0 踩 0

2606.16009 2026-06-17 cs.CL cs.HC 版本更新

PACE-RAG：面向临床药物推荐的患者感知上下文与证据约束RAG

Chaeyoung Huh, Hyunmin Hwang, Jung Hwan Shin, Sungyang Jo, Jinse Park, Jong Chul Ye

AI总结提出PACE-RAG框架，通过提取患者特定临床特征、检索相关病例并结合当前症状与用药史，实现个性化药物推荐，在帕金森病和MIMIC-IV数据集上取得最优性能。

Comments 32 pages, 18 figures

详情

AI中文摘要

药物推荐需要深入理解个体患者背景，尤其是帕金森病等复杂疾病。尽管大语言模型拥有广泛的医学知识，但无法捕捉实际处方模式的细微差别。现有的RAG方法也难以应对这些复杂性，因为基于指南的检索仍然过于通用，而相似患者检索往往复制多数模式，未考虑个体患者的独特临床细微差别。为弥合这一差距，我们提出PACE-RAG（患者感知上下文与证据约束RAG）。PACE-RAG并非直接从检索到的患者中复制常用药物，而是首先提取患者特定临床特征，围绕这些特征检索病例，然后利用患者当前症状、活跃用药史和焦点特异性处方倾向来优化最终处方。通过分析针对特定临床特征的治疗模式，PACE-RAG生成患者特定的药物推荐以及可解释的临床总结。在帕金森病队列和MIMIC-IV基准上使用Llama-3.1-8B和Qwen3-8B进行评估，PACE-RAG实现了最先进的性能，F1分数分别达到80.84%和47.22%。这些结果表明PACE-RAG是一个稳健且临床基础扎实的个性化决策支持框架。我们的代码可在以下网址获取：this https URL。

英文摘要

Drug recommendation requires a deep understanding of individual patient context, especially for complex conditions like Parkinson's disease. While LLMs possess broad medical knowledge, they fail to capture the subtle nuances of actual prescribing patterns. Existing RAG methods also struggle with these complexities because guideline-based retrieval remains too generic and similar-patient retrieval often replicates majority patterns without accounting for the unique clinical nuances of individual patients. To bridge this gap, we propose PACE-RAG (Patient-Aware Contextual and Evidence-Constrained RAG). Rather than directly copying frequent medications from retrieved patients, PACE-RAG personalizes recommendations by first extracting patient-specific clinical features, retrieving cases around these features, and then refining the final prescription using the patient's current symptoms, active medication history, and focus-specific prescribing tendencies. By analyzing treatment patterns tailored to specific clinical features, PACE-RAG generates patient-specific medication recommendations along with an explainable clinical summary. Evaluated on a Parkinson's cohort and the MIMIC-IV benchmark using Llama-3.1-8B and Qwen3-8B, PACE-RAG achieved state-of-the-art performance, reaching F1 scores of 80.84% and 47.22%, respectively. These results suggest that PACE-RAG is a robust and clinically grounded framework for personalized decision support. Our code is available at: https://github.com/ChaeYoungHuh/PACE-RAG.

URL PDF HTML ☆

赞 0 踩 0

2605.08077 2026-06-17 cs.CL 版本更新

Conformal Path Reasoning: Trustworthy Knowledge Graph Question Answering via Path-Level Calibration

Conformal Path Reasoning: 通过路径级校准实现可信的知识图谱问答

Shuhang Lin, Chuhao Zhou, Xiao Lin, Zihan Dong, Kuan Lu, Zhencan Peng, Jie Yin, Dimitris N. Metaxas

AI总结提出Conformal Path Reasoning (CPR)框架，通过查询级共形校准和残差共形价值网络(RCVNet)学习判别性路径级非一致性分数，在保证有效覆盖的同时将预测集大小平均减少52%。

Comments 13 pages, 3 figures, 2 tables;

详情

AI中文摘要

知识图谱问答(KGQA)提供了基于事实的、可解释的推理，但现有方法通常无法对检索到的答案提供可靠的覆盖保证。虽然共形预测(CP)为生成具有统计保证的预测集提供了原则性框架，但先前的共形KGQA方法存在两个关键缺陷：由于无效校准导致的覆盖保证被违反，以及弱分数判别性导致预测集过大。我们提出Conformal Path Reasoning (CPR)，一个基于两个关键创新的新型可信KGQA框架。首先，路径级分数上的查询级共形校准保持可交换性以确保有效的覆盖保证。其次，我们引入残差共形价值网络(RCVNet)，这是一个通过PUCT引导探索训练的轻量级模块，用于学习判别性的路径级非一致性分数。大量实验表明，与基准数据集上的共形基线相比，CPR将经验覆盖率平均提高45%，同时将预测集大小平均减少52%，突显了其在知识图谱上进行可靠共形推理的有效性。

英文摘要

Knowledge Graph Question Answering (KGQA) offers grounded, interpretable reasoning, but existing methods often fail to provide reliable coverage guarantees over retrieved answers. While Conformal Prediction (CP) offers a principled framework for producing prediction sets with statistical guarantees, prior conformal KGQA methods suffer from two critical pitfalls: violated coverage guarantees due to invalid calibration, and weak score discriminability that yields excessively large prediction sets. We propose Conformal Path Reasoning (CPR), a novel trustworthy KGQA framework built on two key innovations. First, query-level conformal calibration over path-level scores preserves exchangeability to ensure valid coverage guarantees. Second, we introduce the Residual Conformal Value Network (RCVNet), a lightweight module trained via PUCT-guided exploration to learn discriminative path-level nonconformity scores. Extensive experiments show that CPR significantly improves the Empirical Coverage Rate by 45% while reducing prediction set size by 52% on average over conformal baselines across benchmark datasets, highlighting its effectiveness for reliable conformal reasoning over knowledge graphs.

URL PDF HTML ☆

赞 0 踩 0

2606.15735 2026-06-17 cs.CL cs.AI 版本更新

扩展企业智能体路由：退化、诊断与恢复

Kellen Gillespie, Robyn Perry

发表机构 * Superhuman, Inc.（Superhuman公司）

AI总结研究企业助手工具库扩展时路由准确率下降问题，通过嵌入预选恢复F1分数10-17个百分点。

Comments 10 pages (6 main + 4 appendix), 4 figures, 6 tables

2606.17542 2026-06-17 cs.CL 新提交

Evaluating Large Language Models Abilities for Addressee, Turn-change, and Next Speaker Prediction in Meetings

评估大型语言模型在会议中的收话人、话轮转换和下一说话人预测能力

Ryo Fukuda, Takatomo Kano, Siddhant Arora, Marc Delcroix, Naohiro Tawara, Atsunori Ogawa, Yuya Chiba, Atsushi Ando, William Chen, Shinji Watanabe

发表机构 * NTT, Inc., Japan（日本电信电话公司）； Language Technologies Institute, Carnegie Mellon University, USA（卡内基梅隆大学语言技术研究所）

AI总结利用大型语言模型（LLMs）研究多模态多人对话中的话轮转换，构建了收话人检测、话轮转换预测和下一说话人预测三个任务的评估框架，实验表明LLMs在下一说话人预测上优于监督模型和人类，但多模态LLM在收话人检测和话轮转换预测上仍低于人类水平。

Comments Accepted to INTERSPEECH 2026

详情

AI中文摘要

我们利用大型语言模型（LLMs）研究多模态多人对话中的话轮转换。我们构建了一个评估框架，涵盖三个任务：收话人检测、话轮转换预测和下一说话人预测。我们比较了针对这些任务训练的监督模型、基于文本的LLMs、多模态LLMs（MM-LLMs）以及人类受试者。在AMI语料库上的实验表明，尽管LLMs未在目标领域进行训练，也无法访问音频或视觉信息，但它们在下一说话人预测方面优于监督模型和人类。多模态LLM在收话人检测和话轮转换预测上表现优于基于文本的LLMs，但仍低于人类表现，表明其难以利用原始音视频信号。消融分析显示，对话上下文至关重要，尤其是对于下一说话人预测。我们观察到人类和LLM的预测模式相似，且话轮转换频繁的区间对两者都具有挑战性。

英文摘要

We investigate turn-taking in multimodal multi-party conversations using large language models (LLMs). We construct an evaluation framework for three tasks: addressee detection, turn-change prediction, and next speaker prediction. We compare supervised models trained for these tasks, text-based LLMs, multimodal LLMs (MM-LLMs), and human subjects. Experiments on the AMI corpus showed that LLMs outperformed supervised models and humans in next speaker prediction, despite not being trained on the target domain and without access to audio or visual information. An MM-LLM performed better than text-based LLMs on addressee detection and turn-change prediction but remained below human performance, indicating difficulty leveraging raw audio-visual signals. Ablation analyses revealed that conversational context was critical, particularly for next speaker prediction. We observed that human and LLM prediction patterns were similar, and intervals with frequent turn changes were difficult for both.

URL PDF HTML ☆

赞 0 踩 0

2606.17628 2026-06-17 cs.CL 新提交

OPD-Evolver: Cultivating Holistic Agent Evolver via On-Policy Distillation

OPD-Evolver：通过在线策略蒸馏培养整体智能体进化器

Guibin Zhang, Xun Xu, Yanwei Yue, Zikun Su, Wangchunshu Zhou, Xiaobin Hu, Shuicheng Yan

发表机构 * LV-NUS Lab（林文实验室）； FDU（福建大学）； PKU（北京大学）； Bytedance Inc.（字节跳动公司）

AI总结提出OPD-Evolver框架，通过快慢双循环和在线策略自蒸馏，使智能体学会选择、使用、编写和维护经验，在多个基准上超越现有方法，并让小模型挑战大模型。

详情

AI中文摘要

记忆已成为自我进化智能体的标准基础，但保留经验并不等同于学习如何通过经验进化。现有的记忆智能体可以存储轨迹、检索反思或积累技能，但往往缺乏选择有用经验、据此行动、编写可重用知识以及维护不断增长的存储库的整体能力。我们引入OPD-Evolver，一种快慢双协同进化框架，通过在线策略自蒸馏培养这样的智能体进化器。在快循环中，OPD-Evolver与四级记忆层次结构交互，以读取、使用、编写和维护经验，实现快速测试时进化。在慢循环中，结果校准的记忆归因和特权事后视角将这些四种能力蒸馏到可部署的策略中。在多领域基准测试中，OPD-Evolver超越了诸如ReasoningBank等记忆系统高达11.5%，以及诸如Skill0等基于训练的方法约5.8%。进一步分析表明，OPD-Evolver内化了高价值经验和记忆管理，使OPD-Evolver-9B能够挑战诸如Qwen3.5-397B-A17B和Step-3.5-Flash等大型对手，指向超越记忆增强智能体的真正合格的智能体进化器。

英文摘要

Memory has become a standard substrate for self-evolving agents, yet retaining experience is not the same as learning how to evolve through it. Existing memory agents can store trajectories, retrieve reflections, or accumulate skills, but often lack the holistic competence to select useful experience, act on it, write reusable knowledge, and maintain a growing repository. We introduce OPD-Evolver, a slow-fast co-evolution framework that cultivates such an agent evolver through on-policy self-distillation. In the fast loop, OPD-Evolver interacts with a four-level memory hierarchy to read, use, write, and maintain experience for rapid test-time evolution. In the slow loop, outcome-calibrated memory attribution and privileged hindsight distill these four abilities into the deployable policy. Across multi-domain benchmarks, OPD-Evolver surpasses memory systems such as ReasoningBank by up to 11.5%, and training-based methods such as Skill0 by ~5.8%. Further analysis shows that OPD-Evolver internalizes high-value experience and memory management, enabling OPD-Evolver-9B to challenge giant counterparts such as Qwen3.5-397B-A17B and Step-3.5-Flash, pointing beyond memory-augmented agents toward genuinely qualified agent evolvers.

URL PDF HTML ☆

赞 0 踩 0

2606.17838 2026-06-17 cs.CL 新提交

Environment-Grounded Automated Prompt Optimization for LLM Game Agents

面向LLM游戏智能体的环境驱动自动提示优化

Rean Clive Fernandes, Lukas Fehring, Theresa Eimer, Marius Lindauer, Matthias Feurer

发表机构 * Lamarr institute for ML and AI（拉马尔机器学习与人工智能研究所）； TU Dortmund University（多特蒙德工业大学）； Leibniz University Hannover（莱布尼茨汉诺威大学）； L3S Research Center（L3S研究中心）

AI总结提出一种自动提示优化框架，将观察-动作管道分解为描述器和选择器，通过环境回报驱动的进化循环迭代优化提示，在BabyAI任务中显著提升成功率。

详情

AI中文摘要

交互环境中的LLM智能体对其提示高度敏感，但提示工程仍然是手动的、特定于任务的过程。我们为LLM智能体引入了一个自动提示优化框架，该框架将观察-动作管道分解为一个目标条件描述器智能体和一个动作选择智能体，并通过由环境回报引导的LLM驱动进化循环迭代地优化每个模块的提示。我们提出一个行为分析器，将情节结果归因于特定的提示组件，以及一个变异器，在通过环境回滚验证之前，对提示提出有针对性的修订。我们在BALROG基准测试中的所有五个BabyAI任务上进行了评估，在普通和引导提示初始化下，将我们的管道与BALROG的RobustCoTAgent进行了比较。优化在任务和条件下一致地提高了性能，无需更新模型权重。在PutNext（一个多步协调任务，RobustCoTAgent的成功率为0%）上，我们的框架使用相同的底层LLM和优化提示达到了高达72.5%的成功率。这些结果表明，多智能体框架结合自动提示优化，无需微调或大量人工监督即可增强LLM。

英文摘要

LLM agents in interactive environments are highly sensitive to their prompts, yet prompt engineering remains a manual, task-specific process. We introduce an automated prompt optimization framework for LLM agents that decomposes the observation-to-action pipeline into a goal-conditioned descriptor agent and an action selection agent, and iteratively refines each module's prompt through an LLM-driven evolutionary loop guided by environment returns. We propose a behavior analyzer to attribute episode outcomes to specific prompt components, and a mutator to propose targeted revisions to the prompt, before validating them through environment rollouts. We evaluate on all five BabyAI tasks in the BALROG benchmark, comparing our pipeline against BALROG's RobustCoTAgent under both plain and guided prompt initializations. Optimization improves performance consistently across tasks and conditions, without requiring updates to the model weights. On PutNext, a multi-step coordination task where the RobustCoTAgent achieves 0% success, our framework reaches up to 72.5% success rate using the same underlying LLM with optimized prompts. These results suggest that a multi-agent framework, combined with automatic prompt optimization, enhances LLMs without the need for fine-tuning or extensive human supervision.

URL PDF HTML ☆

赞 0 踩 0

2606.18051 2026-06-17 cs.CL 新提交

Compositional Skill Routing for LLM Agents: Decompose, Retrieve, and Compose

面向LLM智能体的组合技能路由：分解、检索与组合

Xueping Gao

发表机构 * Alibaba Cloud（阿里云）

AI总结提出SkillWeaver框架，通过迭代技能感知分解（SAD）将复杂查询分解为原子子任务，检索对应技能并组合为可执行计划，在CompSkillBench上验证了分解质量是主要瓶颈，SAD将分解准确率提升32.7%。

详情

AI中文摘要

LLM智能体越来越依赖外部技能（可复用的工具规范），但现实任务通常需要组合多种技能，而不仅仅是选择一种。我们将此形式化为组合技能路由问题：给定一个复杂的用户查询和一个大型技能库，将查询分解为原子子任务，为每个子任务检索适当的技能，并组合成一个可执行的计划。我们提出了SkillWeaver，一个分解-检索-组合框架，结合了LLM任务分解器、带有FAISS索引的双编码器技能检索器以及依赖感知的DAG规划器。为了支持评估，我们引入了CompSkillBench，这是一个包含300个组合查询的基准测试，覆盖来自公共MCP生态系统的2,209个真实MCP服务器技能，跨越24个功能类别。我们的实验表明，任务分解质量是主要瓶颈：标准LLM分解在步骤级别仅达到34.2%的类别召回率。为了解决这个问题，我们提出了迭代技能感知分解（SAD），这是一种检索增强的反馈循环，能够迭代地将分解与可用技能对齐。SAD在单次迭代中将分解准确率从51.0%提高到67.7%（+32.7%，Wilcoxon p < 10^-6）；DA条件分析证实，正确的粒度是有效检索的前提（当DA=1时，CatR@1从34%上升到41%）。SkillWeaver将上下文窗口消耗减少了99%以上，迁移实验证实了泛化能力（即使目标类别不在检索池中，相对DA增益仍达到+35.6%）。

英文摘要

LLM agents increasingly rely on external skills -- reusable tool specifications -- but real-world tasks often require composing multiple skills, not just selecting one. We formalize this as the Compositional Skill Routing problem: given a complex user query and a large skill library, decompose the query into atomic sub-tasks, retrieve the appropriate skill for each sub-task, and compose an executable plan. We present SkillWeaver, a decompose-retrieve-compose framework combining an LLM task decomposer, a bi-encoder skill retriever with FAISS indexing, and a dependency-aware DAG planner. To support evaluation, we introduce CompSkillBench, a benchmark of 300 compositional queries over 2,209 real MCP server skills spanning 24 functional categories, sourced from the public MCP ecosystem. Our experiments reveal that task decomposition quality is the primary bottleneck: standard LLM decomposition reaches only 34.2% category recall at the step level. To address this, we propose Iterative Skill-Aware Decomposition (SAD), a retrieval-augmented feedback loop that iteratively aligns decomposition with available skills. SAD improves decomposition accuracy from 51.0% to 67.7% (+32.7%, Wilcoxon p < 10^-6) in a single iteration; DA-conditioned analysis confirms that correct granularity is the prerequisite for effective retrieval (CatR@1 rises from 34% to 41% when DA=1). SkillWeaver reduces context window consumption by over 99%, and transfer experiments confirm generalization (+35.6% relative DA gain even when target categories are absent from the retrieval pool).

URL PDF HTML ☆

赞 0 踩 0

2606.17645 2026-06-17 cs.AI cs.CL cs.LG 交叉投稿

Beyond Domains: Reusing Web Skills via Transferable Interaction Patterns

超越领域：通过可迁移交互模式重用网络技能

Shiqi He, Yue Cui, Feijie Wu, Xinyu Ma, Jiaheng Lu, Yaliang Li, Bolin Ding, Mosharaf Chowdhury

发表机构 * University of Michigan（密歇根大学）； Alibaba Group（阿里巴巴集团）； Purdue University（普渡大学）； McMaster University（麦克马斯特大学）； University of Pennsylvania（宾夕法尼亚大学）

AI总结提出SkillMigrator代理，通过学习可迁移交互模式（TIP）匹配布局结构而非元素引用，实现跨站点技能重用，在WebArena和Mind2Web上成功轨迹的LLM动作数减少8-10%。

详情

AI中文摘要

大型语言模型（LLM）网络代理通常被部署为工具调用者：每轮，模型读取新的页面观察并发出一个结构化工具动作。当每个动作都是低级原语时，视野迅速增长，面向策略的LLM完成次数也随之增加，在Mind2Web和WebArena等基准测试中主导了延迟和成本。因此，最近的系统将重复的交互片段包装为网络技能：从成功轨迹或诱导程序中构建的可调用工具，这样一次调用可以替代多个原语。然而，先前的技能库仍然主要通过指令相似性或粗略的站点元数据触发，这导致在未见站点上技能重用率低，并留下了许多潜在的步骤和令牌减少空间。我们提出了SkillMigrator，一个学习可重用网络技能并通过匹配布局结构而非特定元素引用来跨站点迁移它们的代理。每个诱导技能被存储为可迁移交互模式（TIP）：技能与诱导时快照的结构草图配对。在测试时，SkillMigrator通过布局相似性检索TIP，并将其引用锚定到实时页面。其余堆栈是标准的：具有稳定引用的可访问性快照观察，以及基于原语加技能调用的固定工具调用。与最先进的方法相比，SkillMigrator在匹配成功率的情况下，将WebArena和Mind2Web上成功轨迹的平均LLM动作数减少了8-10%。

英文摘要

Large language model (LLM) web agents are usually deployed as tool callers: each turn, the model reads a fresh page observation and emits one structured tool action. When every action is a low-level primitive, horizons grow quickly and so do policy-facing LLM completions, dominating latency and cost on benchmarks such as Mind2Web and WebArena. Recent systems therefore wrap repeated interaction fragments as web skills: callable tools built from successful trajectories or induced programs, so one call can replace several primitives. However, prior skill libraries are still triggered mainly by instruction similarity or coarse site metadata, which yields low skill reuse on held-out sites and leaves much of the potential step and token reduction on the table. We present SkillMigrator, an agent that learns reusable web skills and transfers them across sites by matching layout structure rather than specific element references. Each induced skill is stored as a transferable interaction pattern (TIP): the skill paired with a structural sketch of the snapshot at induction time. At test time, SkillMigrator retrieves TIPs by layout similarity and grounds their references on the live page. The rest of the stack is standard: accessibility-snapshot observations with stable references, and fixed tool calling over primitives plus skill invocations. Compared with the state-of-the-art approaches, SkillMigrator reduces the average LLM-action count on successful trajectories by 8-10% across both WebArena and Mind2Web at matched success rate.

URL PDF HTML ☆

赞 0 踩 0

2606.17680 2026-06-17 cs.LG cs.CL 交叉投稿

EnvRL: Learn from Environment Dynamics in Agentic Reinforcement Learning

EnvRL: 在智能体强化学习中从环境动力学中学习

Zhitong Wang, Songze Li, Hao Peng, Shuzheng Si, Yi Wang, Maosong Sun, Juanzi Li

发表机构 * Department of Computer Science and Technology, Tsinghua University（清华大学计算机科学与技术系）； Shanghai AI Laboratory（上海人工智能实验室）

AI总结提出EnvRL框架，通过状态预测和逆动力学两个辅助目标，将环境动力学学习融入智能体强化学习，在长周期任务中显著提升成功率。

详情

AI中文摘要

强化学习已成为训练大型语言模型作为智能体的强大范式。然而，针对长周期智能体任务的常规强化学习方法往往难以处理稀疏的结果奖励。直观上，这忽略了展开交互轨迹中包含的丰富环境动力学信息。我们认为交互体验本身固有地充当隐式监督信号，揭示了环境的潜在转换机制，并使智能体能够构建更准确的环境内部模型。因此，在这项工作中，我们研究了如何利用这一额外信号来改进策略学习。具体来说，我们提出了EnvRL，一个通过两个辅助目标（状态预测和逆动力学）将环境动力学学习融入智能体强化学习的框架。通过与主要强化学习目标联合优化，我们鼓励智能体从其自身的交互体验中内化环境动力学。在两个长周期智能体基准上的大量实验表明，EnvRL在成功率上比仅使用强化学习的基线有显著提升，例如，当使用GRPO训练时，在ALFWorld上将Qwen-2.5-1.5B-Instruct从72.8%提升到77.4%，在WebShop上从56.8%提升到67.0%。

英文摘要

Reinforcement learning (RL) has emerged as a powerful paradigm for training Large Language Models (LLMs) as agents. However, conventional RL methods for long-horizon agentic tasks often struggle with sparse outcome rewards. Intuitively, this overlooks the rich environment dynamics information contained in rollout interaction trajectories. We argue that the interaction experience inherently serves as an implicit supervision signal, reveals the underlying transition mechanisms of the environment, and enables the agent to construct a more accurate internal model of the environment.. Therefore, in this work, we investigate how to leverage this additional signal to improve policy learning. Specifically, we propose EnvRL, a framework that incorporates environment dynamics learning into agentic RL via two auxiliary objectives: state prediction and inverse dynamics. By jointly optimizing with the primary RL objective, we encourage the agent to internalize environment dynamics from its own interaction experience. Extensive experiments on two long-horizon agentic benchmarks demonstrate that EnvRL achieves significant improvements on success-rates over RL-only baselines, e.g., when trained with GRPO, lifting Qwen-2.5-1.5B-Instruct from 72.8% to 77.4% on ALFWorld, and from 56.8% to 67.0% on WebShop.

URL PDF HTML ☆

赞 0 踩 0

2606.17786 2026-06-17 cs.HC cs.CL 交叉投稿

Toward Accessible Psychotherapy Training Using AI-Driven Interactive Patient Avatars

利用AI驱动的交互式患者化身实现可及的心理治疗培训

Pascal Riachi, Sofie Kamber, Stella Brogna, Andrew Gloster, Rafael Wampfler

发表机构 * ETH Zurich（苏黎世联邦理工学院）； University of Lucerne（卢塞恩大学）

AI总结提出一个通过具身虚拟患者进行对话训练ACT心理治疗师的系统，利用大语言模型模拟患者行为，并提供基于ACT保真度标准的逐轮反馈，专家评估证实了高真实性和培训效果。

详情

DOI: 10.1109/ICHI69079.2026.00124
Journal ref: 2026 IEEE 14th International Conference on Healthcare Informatics (ICHI), Minneapolis, MN, June 1-3, 2026, pp. 990-995

AI中文摘要

培训心理治疗师掌握诸如接纳与承诺疗法（ACT）等循证干预措施需要反复练习并伴有有意义的反馈，然而安全、标准化的培训机会受到伦理、后勤和资源限制。我们引入一个系统，旨在通过与具身虚拟患者的语音对话支持ACT导向的心理治疗培训。该系统使用大语言模型模拟患者行为，其行为基于真实治疗会话中提取的档案和可配置的临床场景，同时一个独立的自动评估器根据既定的ACT保真度标准为治疗师的回应提供逐轮反馈。该系统并非旨在取代督导，而是通过支持在低风险环境中进行实验、反思和即时反馈来促进刻意练习。执业心理学家的专家评估证实了患者行为的高度真实性，并表明即时的逐轮ACT反馈提高了治疗师对干预选择的意识，并使他们能够有效尝试替代回应。对49份治疗记录的定量评估确定GPT-4o-mini为最佳反馈模型，在复制人类督导的ACT保真度评分时实现了最低的平均绝对误差（MAE = 6.12），且具有统计显著性的一致性。这项工作展示了保真度感知的模拟患者作为心理治疗培训的可扩展补充的潜力。

英文摘要

Training psychotherapists in evidence-based interventions such as Acceptance and Commitment Therapy (ACT) requires repeated practice with meaningful feedback, yet opportunities for safe, standardized training are limited by ethical, logistical, and resource constraints. We introduce a system designed to support ACT-oriented psychotherapy training through spoken dialogue with an embodied virtual patient. The system uses large language models to simulate patient behavior conditioned on profiles derived from real therapy sessions and configurable clinical scenarios, while a separate automated evaluator provides turn-by-turn feedback on therapist responses based on established ACT fidelity criteria. Rather than aiming to replace supervision, the system is intended to support deliberate practice by enabling experimentation, reflection, and immediate feedback in low-risk settings. Expert evaluation with practicing psychologists confirmed high realism in patient behavior and demonstrated that immediate turn-by-turn ACT feedback increased therapists' awareness of intervention choices and enabled effective experimentation with alternative responses. Quantitative evaluation across 49 therapy transcripts identified GPT-4o-mini as the optimal feedback model, achieving the lowest mean absolute error (MAE = 6.12) in replicating human supervisor ACT fidelity ratings with statistically significant agreement. This work demonstrates the potential of fidelity-aware simulated patients as a scalable complement to psychotherapy training.

URL PDF HTML ☆

赞 0 踩 0

2504.03991 2026-06-17 cs.CL cs.AI cs.HC cs.MA 版本更新

Algorithmic Prompt Generation for Diverse Human-like Teaming and Communication with Large Language Models

面向多样化类人团队协作与通信的算法化提示生成与大型语言模型

Siddharth Srikanth, Varun Bhatt, Boshen Zhang, Werner Hager, Charles Michael Lewis, Katia P. Sycara, Aaquib Tabrez, Stefanos Nikolaidis

发表机构 * Thomas Lord Department of Computer Science, University of Southern California（美国南加州大学汤姆·劳德计算机科学系）； School of Computing and Information, University of Pittsburgh（美国匹兹堡大学计算与信息学院）； Robotics Institute, Carnegie Mellon University（卡内基梅隆大学机器人研究所）； Sibley School of Mechanical and Aerospace Engineering, Cornell University（康奈尔大学西伯利机械与航空航天工程学院）

AI总结结合质量多样性优化与LLM代理，自动搜索生成多样化团队行为的提示，捕获人类协作与通信策略，并通过用户研究验证其类人性。

详情

AI中文摘要

理解人类如何在团队中协作和通信对于改善人-代理团队协作和AI辅助决策至关重要。然而，由于后勤、伦理和实际限制，仅依赖大规模用户研究的数据是不切实际的，因此需要多种多样化人类行为的合成模型。最近，基于大型语言模型（LLM）的代理已被证明能够在社交环境中模拟类人行为。但是，获得大量多样化行为需要手动设计提示。另一方面，质量多样性（QD）优化已被证明能够生成多样化的强化学习（RL）代理行为。在这项工作中，我们将QD优化与LLM驱动的代理相结合，以迭代搜索在长时域、多步骤协作环境中生成多样化团队行为的提示。我们首先通过一项人类受试者实验表明，人类在该领域中表现出多样化的协调和通信行为。然后，我们进行一系列实验，表明我们的方法捕获了在没有大规模数据收集的情况下难以观察到的行为，并通过后续用户研究表明这些生成的行为是类人的。我们的发现凸显了QD与LLM驱动代理的结合作为研究多代理协作中团队协作和通信策略的有效工具。

英文摘要

Understanding how humans collaborate and communicate in teams is essential for improving human-agent teaming and AI-assisted decision-making. However, relying solely on data from large-scale user studies is impractical due to logistical, ethical, and practical constraints, necessitating synthetic models of multiple diverse human behaviors. Recently, agents powered by Large Language Models (LLMs) have been shown to emulate human-like behavior in social settings. But, obtaining a large set of diverse behaviors requires manual effort in the form of designing prompts. On the other hand, Quality Diversity (QD) optimization has been shown to be capable of generating diverse Reinforcement Learning (RL) agent behavior. In this work, we combine QD optimization with LLM-powered agents to iteratively search for prompts that generate diverse team behavior in a long-horizon, multi-step collaborative environment. We first show, through a human-subjects experiment, that humans exhibit diverse coordination and communication behavior in this domain. We then present a series of experiments showing that our approach captures behaviors that are difficult to observe without large-scale data collection, and a follow-up user study to show that these generated behaviors are human-like. Our findings highlight the combination of QD and LLM-powered agents as an effective tool for studying teaming and communication strategies in multi-agent collaboration.

URL PDF HTML ☆

赞 0 踩 0

2504.11837 2026-06-17 cs.CL cs.AI 版本更新

EmoFSM: A Finite State Machine for Emotional Support Conversation

EmoFSM：一种用于情感支持对话的有限状态机

Yue Zhao, Qingqing Gu, Xiaoyu Wang, Teng Chen, Zhonglin Jiang, Yong Chen, Hongyan Li, Luo Ji

AI总结针对情感支持对话中长期满意度不足的问题，提出EmoFSM框架，利用有限状态机引导大语言模型进行规划与自我推理，在多个数据集上优于多种基线方法。

Comments 15 pages, 4 figures. PAKDD 2026

2606.15903 2026-06-17 cs.CL cs.AI 版本更新

Control-Plane Placement Shapes Forgetting: An Architectural Study of Agent Memory Across Thirteen System Configurations

控制平面放置塑造遗忘：跨十三种系统配置的智能体记忆架构研究

Dongxu Yang

发表机构 * DeepLethe

AI总结研究LLM在智能体记忆管道中的位置（控制平面 vs 召回平面）对遗忘失败模式的影响，通过13种配置在385例对抗测试集上的实验，揭示了三种放置机制的互补覆盖范围，并提出了ForgetEval评估套件。

Comments 25 pages including appendices. Code, benchmark, and adapters released under MIT at https://github.com/deeplethe/lethe

详情

AI中文摘要

LLM在智能体记忆管道中的位置——位于检索存储事实（广泛基准测试）的召回平面和通过替换、释放、清除来改变事实（基本未经测试）的控制平面之间——决定了系统能够恢复哪些遗忘失败模式。通过在385例对抗测试集上比较十三种系统配置，我们观察到三种具有部分互补覆盖范围的放置机制：确定性原语足以处理词汇/时间类别，但无法处理规范化（标识符混淆上5%，跨语言上0%）；写入时LLM可以恢复规范化（100%），但无法处理意图感知删除（前缀冲突和复合事实为0%）；变异时钩子可以恢复意图感知删除（78-85%），并同时提升几乎所有类别的性能（整体91.7-93.2%，每385例运行成本0.17美元，每例变异延迟2.3秒，而确定性方法为64-191毫秒，召回路径不变）。我们通过ForgetEval揭示了这种权衡，ForgetEval包含1000例模板化套件和385例对抗层（132例手工制作+253例LLM生成并经预言机验证），通过确定性子串匹配评分，并配有一个六方法适配器协议，采用诚实的N/A评分，允许异构记忆存储以130行代码接入。该协议通过10名标注者的IAA（Fleiss' kappa = 0.958）和77例外部作者子集（四位盲贡献者）得到验证，该子集复现了规范化不对称性并放大了联合放置的提升（+27.8个百分点）。生产环境中的失败主要是遗忘失败而非召回失败，但现有基准仅衡量召回。ForgetEval和所有适配器均以MIT许可发布。

英文摘要

Where an LLM sits in an agent memory pipeline -- between the recall plane that retrieves stored facts (extensively benchmarked) and the control plane that mutates them via supersede, release, purge (largely untested) -- shapes which forgetting failure modes the system recovers. Comparing thirteen system configurations on a 385-case adversarial surface, we observe three placement regimes with partly complementary coverage: deterministic primitives suffice for lexical/temporal categories but fail canonicalization (5% on identifier-obfuscation, 0% on cross-lingual); inscribe-time LLM recovers canonicalization (100%) but cannot help intent-aware deletion (0% on prefix-collision and compound-fact); a mutation-time hook recovers intent-aware deletion (78-85%) and brightens nearly all categories simultaneously (91.7-93.2% overall, $0.17 per 385-case run, 2.3s/case mutation latency vs. 64-191ms/case deterministic, recall path unchanged). We expose the trade-off via ForgetEval, a 1000-case templated suite plus a 385-case adversarial layer (132 hand-crafted + 253 LLM-drafted oracle-validated) scored by deterministic substring match, paired with a six-method Adapter Protocol with honest N/A scoring that lets heterogeneous memory stores enter in 130 lines. Admission is corroborated by 10-annotator IAA (Fleiss' kappa = 0.958) and a 77-case external-authored subset (four blind contributors) that replicates the canonicalization asymmetry and amplifies the joint-placement lift (+27.8 pt). Production failures are predominantly forgetting failures rather than recall failures, yet existing benchmarks measure only recall. ForgetEval and all adapters are released under MIT.

URL PDF HTML ☆

赞 0 踩 0

2606.16591 2026-06-17 cs.CL 版本更新

SING: Synthetic Intention Graph for Scalable Active Tool Discovery in LLM Agents

SING: 用于LLM代理中可扩展主动工具发现的合成意图图

Qiao Xiao, Haochen Shi, Yisen Gao, Wenbin Hu, Huihao Jing, Tianshi Zheng, Baixuan Xu, Ziheng Zhang, Weiqi Wang, Haoran Li, Jiaxin Bai, Yangqiu Song

发表机构 * Cornell University（康奈尔大学）； The Hong Kong University of Science and Technology（香港科技大学）； The Ohio State University（俄亥俄州立大学）； Hong Kong Baptist University（香港浸会大学）

AI总结提出SING框架，通过构建意图-工具图并动态检索工具，在长周期任务中提升工具发现准确率，Global Recall@5提高59.8%，下游成功率提高28.9%。

详情

AI中文摘要

大型语言模型（LLM）代理越来越依赖管理上下文、工具和多轮执行的代理框架，使工具成为在真实数字环境中行动的核心接口。随着框架连接的工具生态系统扩展到数百或数千个API、服务和任务特定技能，穷举工具模式注入变得昂贵，并施加了封闭世界假设，将代理限制在预定义的静态库存中。检索增强的工具选择提供了一种自然的替代方案，但现有的一次性检索方法通常无法将孤立的工具描述与代理的真实任务意图对齐，特别是在需要通过分解、观察和新诱导的子目标来涌现所需能力的长期任务中。我们提出SING，一种意图感知的主动工具发现框架，它构建了一个连接用户意图、工具能力和工具协作模式的意图-工具图，并根据不断变化的任务状态动态检索工具。使用包含7,471个工具的统一语料库，我们在三个真实世界的工具使用基准上评估了SING。与基线相比，SING将全局Recall@5提高了59.8%，下游成功率提高了28.9%，同时将全语料库工具模式暴露减少了99.8%，表明意图感知的图结构能够在大规模代理生态系统中实现更准确和上下文高效的工具发现。

英文摘要

Large language model (LLM) agents increasingly rely on agent harnesses that manage context, tools, and multi-turn execution, making tools a central interface for acting in realistic digital environments. As harness-connected tool ecosystems expand to hundreds or thousands of APIs, services, and task-specific skills, exhaustive tool schema injection becomes costly and imposes a closed-world assumption that limits agents to a predefined static inventory. Retrieval-augmented tool selection offers a natural alternative, but existing one-shot retrieval methods often fail to align isolated tool descriptions with the agent's true task intention, especially in long-horizon tasks where required capabilities emerge through decomposition, observations, and newly induced subgoals. We propose SING, an intention-aware active tool discovery framework that builds an intention-tool graph linking user intentions, tool capabilities, and tool collaboration patterns, and dynamically retrieves tools according to evolving task states. Using a unified corpus of 7,471 tools, we evaluate SING on three real-world tool-use benchmarks. SING improves Global Recall@5 by up to 59.8% and downstream success rate by up to 28.9% over baselines, while reducing full-corpus tool-schema exposure by 99.8%, demonstrating that intention-aware graph structure enables more accurate and context-efficient tool discovery in large-scale agentic ecosystems.

URL PDF HTML ☆

赞 0 踩 0

2510.19838 2026-06-17 cs.AI cs.CL cs.LG 版本更新

弥合基于LLM的代码翻译中的功能正确性与运行时效率差距

Longhui Zhang, Jiahao Wang, Chenhao Hu, Bingyu Liang, Jing Li, Min Zhang

发表机构 * Harbin Institute of Technology, Shenzhen, China（哈尔滨工业大学（深圳））

AI总结提出SwiftTrans框架，通过多视角探索和差异感知选择两阶段，利用并行上下文学习和差异比较，同时提升LLM代码翻译的正确性和运行时效率。

Comments Accepted to ICML 2026

详情

AI中文摘要

尽管大型语言模型（LLM）显著提高了自动代码翻译系统的功能正确性，但翻译后程序的运行时效率却相对较少受到关注。随着摩尔定律的失效，运行时效率与功能正确性一样，对程序质量变得越来越重要。我们的初步研究表明，LLM翻译的程序通常比人类编写的程序运行得更慢，而且这个问题无法仅通过提示工程来解决。因此，我们的工作提出了SwiftTrans，一个包含两个关键阶段的代码翻译框架：（1）多视角探索，其中MpTranslator利用并行上下文学习（ICL）生成多样化的翻译候选；（2）差异感知选择，其中DiffSelector通过显式比较翻译之间的差异来识别最优候选。我们进一步为MpTranslator引入层次化指导，为DiffSelector引入序数指导，使LLM能够更好地适应这两个核心组件。为了支持对翻译程序运行时效率的评估，我们扩展了现有基准CodeNet和F2SBench，并引入了一个新基准SwiftBench。在所有三个基准上的实验结果表明，SwiftTrans在正确性和运行时效率方面都取得了一致的改进。

英文摘要

While large language models (LLMs) have greatly advanced the functional correctness of automated code translation systems, the runtime efficiency of translated programs has received comparatively little attention. With the waning of Moore's law, runtime efficiency has become increasingly important for program quality, alongside functional correctness. Our preliminary study reveals that LLM-translated programs often run slower than human-written ones, and this issue cannot be remedied through prompt engineering alone. Therefore, our work proposes SwiftTrans, a code translation framework comprising two key stages: (1) Multi-Perspective Exploration, where MpTranslator leverages parallel in-context learning (ICL) to generate diverse translation candidates; and (2) Difference-Aware Selection, where DiffSelector identifies the optimal candidate by explicitly comparing differences between translations. We further introduce Hierarchical Guidance for MpTranslator and Ordinal Guidance for DiffSelector, enabling LLMs to better adapt to these two core components. To support the evaluation of runtime efficiency in translated programs, we extend existing benchmarks, CodeNet and F2SBench, and introduce a new benchmark, SwiftBench. Experimental results across all three benchmarks show that SwiftTrans achieves consistent improvements in both correctness and runtime efficiency.

URL PDF HTML ☆

赞 0 踩 0

2606.17999 2026-06-17 cs.CL 新提交

VoidPadding: Let [VOID] Handle Padding in Masked Diffusion Language Models so that [EOS] Can Focus on Semantic Termination

VoidPadding: 让 [VOID] 处理掩码扩散语言模型中的填充，以便 [EOS] 专注于语义终止

Chunyu Liu, Zhengyang Fan, Kaisen Yang, Alex Lamb

发表机构 * Tsinghua University（清华大学）

AI总结提出VoidPadding方法，通过引入[VOID]令牌处理填充，将[EOS]从双重角色中解放，实现大块解码下的早期停止和自适应响应扩展，在数学推理和代码生成任务上平均提升17.84分，并减少55.7%的NFE。

详情

AI中文摘要

MDLM通过去噪预分配的掩码响应画布生成文本，使得响应长度建模成为指令调优的核心。现有的MDLM通常继承自回归惯例，在指令调优期间使用重复的\ exttt{[EOS]}令牌进行填充，赋予\ exttt{[EOS]}双重角色：既是语义终止符又是填充令牌。我们证明这种双重角色是大块解码下\ exttt{[EOS]}溢出的根本原因。为了解耦这些角色，我们提出VoidPadding，引入\ exttt{[VOID]}用于填充，并保留\ exttt{[EOS]}用于终止。在推理过程中，学习到的\ exttt{[EOS]}信号实现早期停止，而学习到的\ exttt{[VOID]}信号指导自适应响应画布扩展。在Dream-7B-Instruct上，VoidPadding在数学推理和代码生成基准测试中，将块大小平均的四任务均值比原始模型提高+17.84分，比RainbowPadding提高+6.95分，同时平均减少55.7%的解码NFE。代码可在该https URL获取。

英文摘要

MDLMs generate text by denoising a preallocated masked response canvas, making response-length modeling central to instruction tuning. Existing MDLMs often inherit the autoregressive convention of using repeated \texttt{[EOS]} tokens for padding during instruction tuning, giving \texttt{[EOS]} a dual role as both a semantic terminator and a padding token. We show that this dual role is a root cause of \texttt{[EOS]} overflow under large-block decoding. To decouple these roles, we propose VoidPadding, which introduces \texttt{[VOID]} for padding and reserves \texttt{[EOS]} for termination. During inference, the learned \texttt{[EOS]} signal enables early stopping, while the learned \texttt{[VOID]} signal guides adaptive response canvas expansion. On Dream-7B-Instruct, VoidPadding improves the block-size-averaged four-task mean across mathematical reasoning and code generation benchmarks by $+17.84$ points over the original model and $+6.95$ points over RainbowPadding, while reducing decoding NFE by 55.7\% on average. Code is available at https://github.com/Haru-LCY/VoidPadding.

URL PDF HTML ☆

赞 0 踩 0

2606.08810 2026-06-17 cs.CL cs.LG 版本更新

Continuous Language Diffusion as a Decoder-Interface Problem

连续语言扩散作为解码器-接口问题

Zhicheng Du, Lan Ma

发表机构 * Tsinghua Shenzhen International Graduate School, Tsinghua University（清华大学深圳国际研究生院, 清华大学）

AI总结研究连续扩散语言模型如何从高斯噪声生成流畅文本，提出解码器-盆地机制，并设计诊断协议揭示标量指标隐藏的失败，通过接口相图解释令牌恢复行为。

详情

AI中文摘要

高斯扰乱的句子嵌入没有直接的语言解释，但连续扩散语言模型可以从它们生成流畅文本。我们通过嵌入式语言流（ELF）研究这一谜题，并识别出解码器-盆地机制：当轨迹到达原生解码器可以读取稳定令牌的区域时，去噪成功。我们引入了可去噪性、语义可恢复性、顺序敏感性、解码器兼容性和轨迹可靠性的诊断协议。它暴露了标量指标隐藏的失败：低均方误差可能丢弃语言内容，低困惑度可能反映低熵崩溃，干净的潜在重建可能与狭窄的解码器盆地共存。一个解码器-边界界解释了为什么令牌恢复依赖于边界和局部解码器敏感性，而不仅仅是潜在误差。审计公开的ELF检查点揭示了一个接口相图：早期预测弱可读，轨迹中期分歧标志竞争区域，晚期预测进入高边界最终令牌盆地。一旦进入，在生成的ELF状态上令牌实现出奇简单：冻结的T5令牌嵌入查找恢复了原生解码器决策的93%–96%，单个线性读出在32k样本时达到97.9%的一致性，在结构化残差尾部留下约1.1的困惑度差距。在显式诊断监控下，保守的边界门在去噪步骤中提前17%–27%退出。对LangFlow、BitstreamDiffusion和连续潜在扩散语言模型（Cola-DLM）的边界检查表明，当状态对象和解码器改变时，相同的接口问题仍然有意义。因此，连续和潜在扩散语言模型应作为表示-解码器系统进行评估。

英文摘要

Gaussian-corrupted sentence embeddings have no direct linguistic interpretation, yet continuous diffusion language models can generate fluent text from them. We study this puzzle through Embedded Language Flows (ELF) and identify a decoder-basin mechanism: our evidence suggests that denoising becomes reliable when trajectories reach regions where the native decoder can read stable tokens. We introduce a diagnostic protocol for denoisability, semantic recoverability, order sensitivity, decoder compatibility, and trajectory reliability. It exposes failures hidden by scalar metrics: low mean-squared error can discard linguistic content, low perplexity can reflect low-entropy collapse, and clean latent reconstruction can coexist with a narrow decoder basin. A decoder-margin bound explains why token recovery depends on margin and local decoder sensitivity, not latent error alone. Auditing public ELF checkpoints reveals an interface phase diagram: early predictions are weakly readable, mid-trajectory disagreement marks a competition region, and late predictions enter a high-margin decoder basin. Once inside, token realization is surprisingly simple on generated ELF states: frozen T5 (Text-to-Text Transfer Transformer) token-embedding lookup recovers $93$--$96\%$ of native decoder decisions, and a single linear readout reaches $97.9\%$ agreement at 32k samples, leaving an $\approx1.1$--$1.2$ perplexity gap in a structured residual tail. Under conservative held-out gates, a margin rule exits roughly $17$--$28\%$ earlier in denoising steps under an explicit diagnostic monitor. Boundary checks on LangFlow, BitstreamDiffusion, and the Continuous Latent Diffusion Language Model (Cola-DLM) show that the same interface questions remain meaningful when the state object and decoder change. Continuous and latent diffusion language models should therefore be evaluated as representation-decoder systems.

URL PDF HTML ☆

赞 0 踩 0

2606.17299 2026-06-17 cs.CL 新提交

Examining the Limits of Word2Vec with Toki Pona

用 Toki Pona 检验 Word2Vec 的极限

Daniel Zhenhan Huang, Hongchen Wu

发表机构 * Georgia Institute of Technology（佐治亚理工学院）

AI总结本文使用仅有约130个单词的人造语言 Toki Pona 训练 Word2Vec，探究词汇量极小时嵌入质量，并分析非核心噪声词的影响。

Comments 10 pages, 4 figures, 3 tables. Accepted to the Society for Computation in Linguistics (SCiL) 2026

详情

AI中文摘要

Word2Vec 在生成语义嵌入方面的有效性已得到广泛验证，但其测试几乎完全集中在词汇量大的语言上。本研究使用 Toki Pona（一种约130个单词的人造语言）的数据，检验 Word2Vec 能否在极小的词汇量下成功捕获语义关系。我们从 Toki Pona 社区获取了140万句子（795万词元）用于训练。语料库中约23%的句子包含非 Toki Pona 词元，如命名实体、借词和新词。为了探究这种语言噪声是增强还是阻碍性能——这是词嵌入文献中很少涉及的话题——我们训练了两个不同的模型：一个保留这些偶然词元，另一个将其完全过滤。评估采用定量方法（测量单词到语义类别质心的距离）、自动轮廓分数（通过凝聚聚类）以及定性分析（使用与英语对比的表征相似性矩阵）。结果表明，虽然稀疏的非核心词元不影响所学嵌入的相对结构，但它们实际上使相似词在向量空间中更接近。重要的是，即使在这个极端下限，Word2Vec 的有效性更多地取决于分布模式而非词汇表大小。

英文摘要

Word2Vec's effectiveness at generating semantic embeddings has been widely validated, yet it has been tested almost exclusively on languages with large vocabulary inventories. This study examines whether Word2Vec can successfully capture semantic relationships within an extremely reduced vocabulary using data from Toki Pona, a constructed language with approximately 130 words. We sourced 1.4 million sentences (7.95 million tokens) from the Toki Pona community for training. Approximately 23% of sentences in the corpus contain non-Toki Pona tokens such as named entities, loanwords, and neologisms. To investigate whether this linguistic noise enhances or hinders performance -- a topic rarely addressed in word embedding literature -- we trained two distinct models: one retaining these incidental tokens and another filtering them out completely. Evaluation was conducted using quantitative methods measuring word proximity to semantic category centroids, automated silhouette scores via agglomerative clustering, and qualitative analysis utilizing representational similarity matrices compared against English. The results indicate that while sparse, non-core tokens do not affect the relative structure of the learned embeddings, they actually draw similar words closer together in the vector space. Importantly, Word2Vec's effectiveness depends more on distributional patterns than lexicon size even at this extreme lower bound.

URL PDF HTML ☆

赞 0 踩 0

2606.17522 2026-06-17 cs.CL cs.LG 新提交

An expressivity analysis of hierarchical modelling in deep transformers via bounded-depth grammars

深度Transformer中基于有界深度文法的层次建模的表达性分析

Vinoth Nandakumar, Qiang Qu, Pramod Thebe, Sakshi Khachariya, Tongliang Liu

发表机构 * University of Sydney（悉尼大学）； San Francisco State University（旧金山州立大学）； IIT Madras（印度理工学院马德拉斯分校）

AI总结通过有界深度上下文无关文法，证明深度Transformer的深度随文法深度线性增长，神经元数随派生树形状和产生式规则数量缩放，支持线性表示假说。

详情

AI中文摘要

深度神经网络普遍被认为其表达能力源于形成层次表示的能力，即在各层中逐步捕获更抽象和组合的特征。在语言建模中，Transformer已成为主导架构，早期层捕获局部句法模式，后期层编码更复杂的从句级依赖。尽管这种直觉塑造了模型设计，但缺乏严格的理论工作来展示深度Transformer如何表示这种层次结构。本文通过有界深度、非递归上下文无关文法的形式化视角，分析深度Transformer模型的表达性。对于这类文法，我们显式构造了具有位置注意力的Transformer，其深度随文法深度线性增长，而神经元数量随派生树形状数量以及产生式规则数量的平方缩放。我们的理论结果支持线性表示假说，证明了这些架构具有将抽象语法状态编码为残差流中低维、线性可分子空间的结构能力。

英文摘要

Deep neural networks are widely believed to derive their expressive power from their ability to form \textbf{hierarchical representations}, capturing progressively more abstract and compositional features across layers. In language modeling, \textbf{transformers} have emerged as the dominant architecture, with early layers capturing local syntactic patterns and later layers encoding more complex clause-level dependencies. While this intuition has shaped model design, there remains a lack of rigorous theoretical work demonstrating \textbf{how} deep transformers represent such hierarchical structures. In this work, we analyze the expressiveness of deep transformer models through the formal lens of bounded-depth, non-recursive context-free grammars. For this class of grammars, we explicitly construct transformers with positional attention whose depth grows linearly with grammar depth, while the neuron count scales with the number of derivation-tree shapes and quadratically with the number of production rules. Our theoretical results support the linear representation hypothesis by demonstrating that these architectures possess the structural capacity to encode abstract grammatical states into low-dimensional, linearly separable subspaces within the residual stream.

URL PDF HTML ☆

赞 0 踩 0

2606.18205 2026-06-17 cs.CL 新提交

Analyzing and Encoding the Al-Mawrid Arabic-English Dictionary with the ISO Language Markup Framework and TEI Lex-0

使用ISO语言标记框架和TEI Lex-0分析与编码Al-Mawrid阿拉伯语-英语词典

Diaa Fayed, Laurent Romary

发表机构 * Faculty of Information Technology and Computer Science, Sinai University（信息科技与计算机科学学院， Sinai大学）； Inria ； ISO committee TC 37 (Language and terminology)（ISO术语委员会TC 37（语言和术语））

AI总结本文提出了一种系统化方法，将Al-Mawrid阿拉伯语-英语词典数字化并编码为标准化计算词典，采用ISO LMF和TEI Lex-0双重标准，实现91%的结构解析准确率。

Comments 44 pages, 58 figures, 12 tables. Submitted to Language Resources and Evaluation, under review since Aug 2025, round 3

详情

AI中文摘要

本文提出了一种稳健的方法，用于系统化数字化和编码Al-Mawrid阿拉伯语-英语词典，将其从传统的印刷资源转变为标准化的计算词典。针对阿拉伯语词汇基础设施中的显著空白，本研究采用双重标准框架，将ISO词汇标记框架（LMF）与文本编码倡议TEI Lex-0指南对齐。通过应用编辑视角处理词典的宏观和微观结构，研究解决了20世纪双语词典中典型的结构歧义和标点不一致问题。该方法基于对词典词汇知识密度的实证分析。基于代表性样本（字母Ayn，占总量的4.6%），研究为编码过程提供了科学依据，展示了91%的结构解析准确率。信息提取规则的定量评估显示出高性能，同义词的精确率为85%，召回率为98%，其他形态语义特征的精确率为88%。除了技术描述，本文还与现有阿拉伯语词汇资源进行了批判性比较，并讨论了TEI Lex-0在建模特定阿拉伯语现象（如隐式“开放集”语义关系和分散的形态线索）时的局限性。此外，研究通过建立可扩展的基于前缀的引用系统，探索了语言关联开放数据（LLOD）集成的潜力，促进了该资源在语义网中的包含。最终成果是一个可互操作、机器可处理的资源，为阿拉伯语自然语言处理和数字人文社区中复杂遗留双语词典的逆向数字化提供了可复现的工作流程。

英文摘要

This paper presents a robust methodology for the systematic digitization and encoding of the Al-Mawrid Arabic-English dictionary, transforming it from a legacy print resource into a standardized computational lexicon. Addressing a significant gap in Arabic lexical infrastructure, the study adopts a dual-standard framing that aligns the ISO Lexical Markup Framework (LMF) with the Text Encoding Initiative TEI Lex-0 guidelines. By applying an editorial view to the dictionary's macro- and microstructure, the research resolves the structural ambiguities and punctuation inconsistencies typical of 20th-century bilingual dictionaries. The methodology is grounded in an empirical analysis of the dictionary's lexical knowledge density. Drawing on a representative sample (the letter Ayn, comprising 4.6% of the total volume), the study provides scientific weight to the encoding process, demonstrating a structural parsing accuracy of 91%. Quantitative evaluation of the information extraction rules reveals high performance, with 85% precision and 98% recall for synonyms, and 88% precision for other morpho-semantic features. Beyond technical description, the paper provides a critical comparison with existing Arabic lexical resources and discusses the limitations of TEI Lex-0 when modelling specific Arabic phenomena, such as implicit "open set" semantic relations and scattered morphological cues. Furthermore, the study explores the potential for Linguistic Linked Open Data (LLOD) integration by establishing a scalable prefix-based referencing system that facilitates the resource's inclusion in the semantic web. The result is an interoperable, machine-tractable resource that provides a reproducible workflow for the retro-digitization of complex legacy bilingual lexicons within the Arabic NLP and Digital Humanities communities.

URL PDF HTML ☆

赞 0 踩 0

2604.22128 2026-06-17 cs.CL cs.LG 版本更新

MODE-RAG: 基于流形异常诊断和能量的检索增强生成评估

Zehang Wei, Jiaxin Dai, Jiamin Yan, Xiang Xiang

发表机构 * School of Computer Science & Tech, Huazhong University of Science and Technology（华中科技大学计算机科学与技术学院）； School of AI and Automation, Huazhong University of Science and Technology（华中科技大学人工智能与自动化学院）

AI总结提出MODE-RAG多智能体系统，利用变分自由能和内部注意力状态动态门控干预，结合蒙特卡洛树搜索和logit扰动减少多模态检索增强生成中的幻觉和逻辑捏造。

Comments To be presented at ACL 2026

详情

AI中文摘要

虽然多模态检索增强生成（M-RAG）增强了大型视觉语言模型，但它仍然非常容易受到跨模态幻觉、因果捏造和谄媚的影响。此外，现有的缓解流程常常面临干预悖论：静态规则往往不必要地干扰准确的生成，而完全不加引导的多模态推理则允许现有的不匹配级联成严重的逻辑捏造。为了量化和缓解这些幻觉，我们提出了一个多智能体系统MODE-RAG，由变分自由能（VFE）和内部注意力状态驱动，以动态门控干预。高风险查询被路由到五个阶段特定的智能体，集成蒙特卡洛树搜索（MCTS）进行严格的因果推导，以及logit扰动以惩罚谄媚。专门的纠正和监管智能体确保格式稳定性并执行事后事实验证。为了客观评估我们的方法，我们引入了ModeVent，一个源自MultiVent数据集的具有挑战性的子集。大量实验表明，我们的系统有效降低了幻觉率和逻辑捏造，显著提高了M-RAG系统的鲁棒性。

英文摘要

While Multimodal Retrieval-Augmented Generation (M-RAG) enhances Large Vision-Language Models, it remains highly susceptible to cross-modal hallucinations, causal fabrications, and sycophancy. Furthermore, existing mitigation pipelines often face an intervention paradox: static rules tend to unnecessarily disrupt accurate generations, whereas leaving the multi-modal reasoning completely unguided allows existing mismatches to cascade into severe logical fabrications. To quantify and mitigate these hallucinations, we propose a Multi-Agent system, MODE-RAG, driven by Variational Free Energy (VFE) and internal attention states to dynamically gate interventions. High-risk queries are routed to five stage-specific agents, integrating Monte Carlo Tree Search (MCTS) for rigorous causal derivation and logit perturbations to penalize sycophancy. Dedicated Correction and Overseer agents ensure formatting stability and perform post-hoc factual verification. To objectively evaluate our approach, we introduce ModeVent, a challenging subset derived from the MultiVent dataset. Extensive experiments indicate that our system effectively reduces hallucination rates and logical fabrication, significantly improving the robustness of M-RAG systems.

URL PDF HTML ☆

赞 0 踩 0

2606.17791 2026-06-17 cs.CL cs.CV 新提交

MambaCount: 基于空间稀疏状态空间对偶块的高效文本引导开放词汇目标计数

Hao-Yuan Ma, Li Zhang, Minjie Qiang, Jie Gao

发表机构 * School of Computer Science and Technology, Soochow University（苏州大学计算机科学与技术学院）

AI总结提出MambaCount框架，通过空间稀疏状态空间对偶块解决Mamba在非因果视觉任务中的双向依赖限制和空间token高熵问题，实现线性复杂度的开放词汇目标计数，在FSC-147上取得12.23的测试MAE。

详情

AI中文摘要

文本引导的开放词汇目标计数（TOOC）旨在估计由文本提示描述的目标数量，在具有大规模变化的密集场景中尤其具有挑战性。现有的TOOC方法主要依赖Transformer，其相对于图像分辨率的二次复杂度限制了可扩展性。Mamba因其线性复杂度提供了一种有前景的替代方案。然而，先前基于Mamba的方法存在两个主要限制。一方面，Mamba固有的因果公式限制了非因果视觉任务所需的双向空间依赖建模。另一方面，现有的基于Mamba的视觉模型往往忽略了空间token响应中无约束的高熵，这可能削弱局部细节和高频线索。为了解决这些限制，我们提出了MambaCount，一种基于空间稀疏状态空间对偶（S^4D）块的高效框架。具体来说，我们分析并重构了Mamba中隐藏状态的衰减动态，以缓解因果建模引入的依赖约束。此外，我们引入了空间token选择（STS）子块，以减少Mamba中空间token响应的无约束高熵。另外，我们设计了多粒度原型（MGP），以在不同语义级别识别类似目标的区域，改善跨模态对齐和可解释性。在FSC-147上的大量实验表明，MambaCount在无需二次查询的方法中达到了最先进的性能，测试MAE为12.23，同时保持了线性复杂度。

英文摘要

Text-guided Open-vocabulary Object Counting (TOOC) aims to estimate the number of objects described by text prompts, which is particularly challenging in dense scenes with large scale variations. Existing TOOC approaches predominantly rely on Transformers, whose quadratic complexity with respect to image resolution limits their scalability. Mamba offers a promising alternative due to its linear complexity. However, previous Mamba-based methods have two main limitations. On the one hand, the inherent causal formulation of Mamba constrains the bidirectional spatial dependency modeling required by non-causal vision tasks. On the other hand, existing Mamba-based vision models often overlook the unconstrained high entropy in the spatial token responses, which can weaken local details and high-frequency cues. To address these limitations, we propose MambaCount, an efficient framework built on the Spatial Sparse State Space Duality (S^4D) block. Specifically, we analyze and reconstruct the decay dynamics of hidden states in Mamba to alleviate the dependency constraints introduced by causal modeling. Moreover, we introduce a Spatial Token Selection (STS) sub-block to reduce the unconstrained high entropy in spatial token responses within Mamba. In addition, we design Multi-Granularity Prototypes (MGP) to identify object-like regions at different semantic levels, improving cross-modal alignment and interpretability. Extensive experiments on FSC-147 demonstrate that MambaCount achieves state-of-the-art performance among methods without secondary querying, obtaining a test MAE of 12.23, while retaining linear complexity.

URL PDF HTML ☆

赞 0 踩 0

2606.17710 2026-06-17 cs.CV cs.AI cs.CL cs.LG 交叉投稿

使用大型视觉语言模型审核不安全的用户生成内容游戏中的非法在线图像推广

Keyan Guo, Ayush Utkarsh, Wenbo Ding, Isabelle Ondracek, Ziming Zhao, Guo Freeman, Nishant Vishwamitra, Hongxin Hu

AI总结针对社交媒体上非法推广不安全UGC游戏的图像，提出UGCG-Guard系统，利用大型视觉语言模型和条件提示策略实现零样本域适应，检测准确率达94%。

Comments In Proceedings of the 33rd USENIX Conference on Security Symposium (SEC '24), August 14-16, 2024

详情

AI中文摘要

在线用户生成内容游戏（UGCG）在儿童和青少年中越来越受欢迎，用于社交互动和更具创造性的在线娱乐。然而，它们带来了更高的接触露骨内容的风险，引发了对儿童和青少年在线安全的日益关注。尽管存在这些担忧，但很少有研究关注社交媒体上基于图像的非法不安全UGCG推广问题，这种推广可能无意中吸引年轻用户。这一挑战源于难以获得全面的UGCG图像训练数据以及这些图像与传统不安全内容不同的独特性质。在这项工作中，我们迈出了研究不安全UGCG非法推广威胁的第一步。我们收集了一个包含2,924张图像的真实世界数据集，这些图像展示了游戏创作者用于推广UGCG的多种色情和暴力内容。我们的深入研究揭示了对此问题的新认识，以及自动标记非法UGCG推广的迫切需求。我们还创建了一个尖端系统UGCG-Guard，旨在帮助社交媒体平台有效识别用于非法UGCG推广的图像。该系统利用最近引入的大型视觉语言模型（VLM），并采用一种新颖的条件提示策略进行零样本域适应，以及思维链（CoT）推理进行上下文识别。UGCG-Guard取得了出色结果，在现实场景中检测这些用于非法推广游戏的图像时准确率达到94%。

英文摘要

Online user generated content games (UGCGs) are increasingly popular among children and adolescents for social interaction and more creative online entertainment. However, they pose a heightened risk of exposure to explicit content, raising growing concerns for the online safety of children and adolescents. Despite these concerns, few studies have addressed the issue of illicit image-based promotions of unsafe UGCGs on social media, which can inadvertently attract young users. This challenge arises from the difficulty of obtaining comprehensive training data for UGCG images and the unique nature of these images, which differ from traditional unsafe content. In this work, we take the first step towards studying the threat of illicit promotions of unsafe UGCGs. We collect a real-world dataset comprising 2,924 images that display diverse sexually explicit and violent content used to promote UGCGs by their game creators. Our in-depth studies reveal a new understanding of this problem and the urgent need for automatically flagging illicit UGCG promotions. We additionally create a cutting-edge system, UGCG-Guard, designed to aid social media platforms in effectively identifying images used for illicit UGCG promotions. This system leverages recently introduced large vision-language models (VLMs) and employs a novel conditional prompting strategy for zero-shot domain adaptation, along with chain-of-thought (CoT) reasoning for contextual identification. UGCG-Guard achieves outstanding results, with an accuracy rate of 94% in detecting these images used for the illicit promotion of such games in real-world scenarios.

URL PDF HTML ☆

赞 0 踩 0

2502.00241 2026-06-17 cs.LG cs.AI cs.CL cs.CV 版本更新

MLLP-VRAIN UPV 系统在 IWSLT 2026 同声传译任务中的应用

Jorge Iranzo-Sánchez, Gerard Mas-Mollà, Adrià Giménez, Jorge Civera, Albert Sanchis, Alfons Juan

发表机构 * MLLP-VRAIN research group（MLLP-VRAIN研究组）； VRAIN ； Universitat Politècnica de València（瓦伦西亚理工大学）

AI总结提出基于Parakeet和Qwen 3.5模型的级联同声传译系统，通过自适应黑盒策略优化质量-延迟权衡，并引入ASR词增强和RAG机制处理上下文跟踪，在MCIF En→De测试集上实现XCOMET-XL提升+5.82。

Comments IWSLT 2026 System Description

详情

AI中文摘要

本文描述了MLLP-VRAIN研究组参与IWSLT 2026同声传译赛道共享任务的情况。我们的提交利用最近发布的Parakeet和Qwen 3.5模型，通过自适应“黑盒”策略构建了一个鲁棒的级联解决方案，用于长形式SimulST。我们探索了这些策略的松弛版本以实现更好的质量-延迟权衡。与去年相比，我们参与了所有语言方向。此外，对于En→{De, It, Zh}方向，我们还参与了今年新增的上下文跟踪赛道，采用ASR词增强和离线预翻译示例的RAG机制相结合，以引导生成并丰富系统的领域特定上下文。最后，我们提供了系统的详细延迟分析。与去年相比，在MCIF En→De测试集上的结果显示质量显著提升，XCOMET-XL提高了+5.82。我们的上下文跟踪处理进一步提升了+1.03的性能。

英文摘要

This work describes the participation of the MLLP-VRAIN research group in the shared task of the IWSLT 2026 Simultaneous Speech Translation track. Our submission utilizes the recently released Parakeet and Qwen 3.5 models to create a robust, cascaded solution for long-form SimulST through the use of adaptive "black-box" policies. We explore relaxations of these policies to achieve better quality-latency trade-offs. Compared to last year, we participate on all language directions. In addition to this, for the En$\rightarrow${De, It, Zh} directions we also participate in this year's new context track employing a combination of ASR word-boosting and a RAG mechanism of offline pre-translated exemplars to guide generation and enrich our system with domain-specific context. Finally, we provide a detailed latency analysis of our system. Compared to last year, results on the MCIF En$\rightarrow$De test set shows a substantial quality improvement of +5.82 XCOMET-XL. Our context track processing further improves performance by +1.03.

URL PDF HTML ☆

赞 0 踩 0

2606.17281 2026-06-17 cs.CL cs.SD eess.AS 新提交

Are you speaking my languages? On spoken language adherence in multimodal LLMs

你在说我的语言吗？多模态大语言模型中的口语遵循问题

Hyungwon Kim, Kandarp Joshi, Lillian Zhou, Pavel Golik, Petar Aleksic

发表机构 * Google DeepMind（谷歌深Mind）

AI总结针对多模态大语言模型在自动语音识别中输出语言识别错误的问题，提出软提示方法、监督微调和思维链推理三种缓解策略，并引入新指标量化语言违背，比较各方法在减少违规和保持ASR性能上的效果。

Comments 7 pages, 3 tables in the main body

详情

AI中文摘要

虽然基于大语言模型（LLM）的自动语音识别（ASR）能够实现无缝的多语言使用，但模型经常错误识别输出语言，损害转录保真度和下游应用质量。为了保持灵活性和代码切换能力，我们提出了一种软提示方法，该方法暗示潜在的口语语言而不严格约束输出。我们正式将这一挑战定义为缺乏语言遵循，引入了一个新的指标来量化违规行为，并评估了三种缓解策略：（1）零样本提示，在不确定性下提供稳健指导；（2）监督微调（SFT），以提高提示遵循度；（3）思维链（CoT）推理，在解码过程中强制遵循。我们跨多种语言对这些方法进行了比较分析，评估了它们在减少语言违规同时保持整体ASR性能方面的有效性。最后，我们讨论了权衡，以指导在不同计算约束下的策略选择。

英文摘要

While Large Language Model (LLM) based Automatic Speech Recognition (ASR) enables seamless multilingual use, models often misidentify the output language, compromising transcription fidelity and downstream application quality. To preserve flexibility and code-switching capabilities, we propose a soft prompting approach that hints at potential spoken languages without strictly constraining the output. We formally define this challenge as a lack of language adherence, introduce a novel metric to quantify violations, and evaluate three mitigation strategies: (1) zero-shot prompting for robust guidance under uncertainty, (2) supervised fine-tuning (SFT) to improve prompt adherence, and (3) Chain-of-Thought (CoT) reasoning to enforce adherence during decoding. We present a comparative analysis of these methods across multiple languages, evaluating effectiveness in reducing the language violation while maintaining overall ASR performance. Finally, we discuss trade-offs to guide strategy selection under various compute constraints.

URL PDF HTML ☆

赞 0 踩 0

2606.17820 2026-06-17 cs.CL 新提交

Improving low-resource ASR using bilingual fine-tuning with language identification: a cross-linguistic evaluation

利用语言识别的双语微调改进低资源语音识别：跨语言评估

Reihaneh Amooie, Yun Hao, Wietse de Vries, Jelske Dijkstra, Matt Coler, Martijn Wieling

发表机构 * University of Groningen（格罗宁根大学）； Fryske Akademy（弗里斯兰科学院）； Vrije Universiteit Brussel（布鲁塞尔自由大学）

AI总结研究双语微调对低资源语言语音识别的影响，在九种语言对中评估，通过语言识别令牌区分语言，发现高语言识别准确率时双语微调有效，低准确率时推理时加入令牌可提升性能。

2606.17826 2026-06-17 cs.CL cs.AI 新提交

When Multiple Scripts Matter: Evaluating ASR in Clinical Settings

当多种文字重要时：在临床环境中评估ASR

Jean Seo, Minkyu Kim, Jeonguk Lee, Jisoo Jung, Wooseok Han, Eunho Yang

发表机构 * AITRICS ； University of Copenhagen（哥本哈根大学）； KAIST（韩国科学技术院）

AI总结针对非英语临床场景中ASR受多文字变异性影响的问题，提出MultiClin基准，通过多文字感知评估更公平地衡量识别质量，并发现文字统一化能提升ASR性能。

Comments Interspeech 2026

详情

AI中文摘要

非英语临床环境中的自动语音识别（ASR）面临多文字变异性的挑战，即同一术语可能以多种有效的正字法形式出现。传统的字符串匹配评估指标通常将正字法变体视为错误，从而低估ASR性能。为解决此问题，我们引入了MultiClin，一个旨在评估对多文字变异性鲁棒性的临床ASR基准。跨多种ASR模型的实验表明，与传统的单参考评估相比，多文字感知评估能更公平地评估识别质量。我们进一步研究了训练过程中文字一致性的影响，发现不一致的文字映射会增加正字法不确定性并阻碍模型收敛，其中50%的平衡映射比例产生最高的熵。相比之下，文字统一化始终能带来最佳的ASR性能。我们的数据集和代码公开于：this https URL。

英文摘要

Automatic speech recognition (ASR) in non-English clinical settings is challenged by multiscript variability, where the same term may appear in multiple valid orthographic forms. Conventional string-matching evaluation metrics often underestimate ASR performance by treating orthographic variants as errors. To address this issue, we introduce MultiClin, a clinical ASR benchmark designed to evaluate robustness to multiscript variability. Experiments across diverse ASR models show that multiscript-aware evaluation provides a fairer assessment of recognition quality than conventional single-reference evaluation. We further investigate the impact of script consistency during training and find that inconsistent script mappings increase orthographic uncertainty and hinder model convergence, with a balanced 50% mapping ratio producing the highest entropy. In contrast, script unification consistently yields the best ASR performance. Our dataset and code are publicly available at: https://github.com/aitrics-ronaldo/Interspeech_MultiClin.

URL PDF HTML ☆

赞 0 踩 0

2606.17835 2026-06-17 cs.CL cs.AI eess.AS 新提交

Perceptual compensation for tonal context in self-supervised speech models

自监督语音模型中对声调上下文的感知补偿

James Kirby, Ioana Krehan, Michele Gubian

发表机构 * Institute for Phonetics and Speech Processing, LMU Munich（慕尼黑大学语音与语言处理研究所）

AI总结通过伪复制普通话声调的感知补偿实验，比较纯自监督预训练模型和微调模型，发现纯预训练模型无补偿证据，而微调模型有部分补偿但未达到人类水平，表明监督目标可能对抽象某些音韵规律是必要的。

Comments Accepted for publication at Interspeech 2026

2606.17967 2026-06-17 cs.CL 新提交

Learning task-specific subspaces via interventional post-training of speech foundation models

通过语音基础模型的干预后训练学习任务特定子空间

Jack Cox, Jon Barker

发表机构 * University of Sheffield（谢菲尔德大学）

AI总结提出一种干预对比学习后训练方法，将语音基础模型的纠缠表示分解为内容和说话人子空间，提升说话人验证和关键词识别性能。

Comments Accepted to Interspeech 2026; 6 pages (4 main body), 2 figures

2606.17537 2026-06-17 eess.AS cs.CL 交叉投稿

Non-Autoregressive Minimum Bayes' Risk Decoding for Fast Speech Recognition

非自回归最小贝叶斯风险解码用于快速语音识别

Hiroyuki Deguchi, Takatomo Kano, Katsuki Chousa, Marc Delcroix

发表机构 * NTT, Inc.（日本NTT公司）

AI总结提出基于最小贝叶斯风险解码的非自回归解码框架，通过单次前向计算高效采样多个候选，在保持速度优势的同时提升识别性能。

Comments Accepted at Interspeech2026

2606.18019 2026-06-17 eess.AS cs.CL cs.SD 交叉投稿

Reading between the Lines: Leveraging Large Language Models for Global Dementia and Depression Assessment from Clinical Interviews

字里行间：利用大型语言模型从临床访谈中进行全球痴呆和抑郁评估

Franziska Braun, Alea Rüggeberg, Thomas Ranzenberger, Hartmut Lehfeld, Thomas Hillemacher, Tobias Bocklet, Korbinian Riedhammer

发表机构 * TH Nürnberg（Nürnberg大学）； FAU Erlangen（埃朗根大学）； PMU Klinikum Nürnberg（纽伦堡大学医院）

AI总结本研究利用开放权重大型语言模型，从154名德语受试者的临床访谈录音中预测痴呆和抑郁严重程度，引入与全球恶化量表对齐的全球抑郁量表，发现零样本预测对抑郁有效，而结构化特征提取显著提升痴呆评估性能，误差降低达35%，且暂停增强转录本表现与人工转录相当。

Comments Accepted for publication in Text, Speech and Dialogue (TSD 2026). The final authenticated publication will be available online via Springer LNCS/LNAI

详情

AI中文摘要

痴呆和抑郁是老年人群中最常见的神经精神障碍，其重叠症状对鉴别诊断构成重大挑战。在本研究中，我们探讨了开放权重的大型语言模型（LLMs）用于从154名德语受试者的标准化病史访谈录音中预测痴呆和抑郁严重程度。我们引入了一个与已建立的全球恶化量表（GDS）对齐的观察者基础全球抑郁量表（GDS-D），从而能够对情感和认知症状进行并行全局分期。我们在两种设置下比较了三种LLMs（Mistral 3.1、DeepHermes、Qwen3）：(1) 零样本预测和(2) 基于LLM的特征提取用于支持向量回归，使用人工转录和暂停增强转录。结果显示，LLMs在零样本设置中有效预测抑郁严重程度（最佳MAE为0.60），而痴呆评估显著受益于结构化特征提取（最佳MAE为0.78），相比零样本基线误差降低高达35%。暂停增强转录本在性能上与人工转录相当，证明了全自动筛查流程在神经精神鉴别评估中的可行性。

英文摘要

Dementia and depression are the most prevalent neuropsychiatric disorders in geriatric populations, and their overlapping symptoms pose major challenges for differential diagnosis. In this study, we investigate open-weights Large Language Models (LLMs) for predicting dementia and depression severity from speech samples collected during standardized history taking interviews with 154 German-speaking subjects. We introduce an observer-based Global Depression Scale (GDS-D) aligned with the established Global Deterioration Scale (GDS), enabling parallel global staging of affective and cognitive symptoms. We compare three LLMs (Mistral 3.1, DeepHermes, Qwen3) in two settings: (1) zero-shot prediction and (2) LLM-based feature extraction for Support Vector Regression, using human and pause-enriched transcripts. Results show that LLMs effectively predict depression severity in zero-shot settings (best MAE of 0.60), while dementia assessment benefits substantially from structured feature extraction (best MAE of 0.78), reducing errors by up to 35% over zero-shot baselines. Pause-enriched transcripts achieve competitive performance with human transcriptions, demonstrating the viability of fully automatic screening pipelines for differential neuropsychiatric assessment.

URL PDF HTML ☆

赞 0 踩 0

2602.15537 2026-06-17 cs.CL eess.AS 版本更新

ZeroSyl: Simple Zero-Resource Syllable Tokenization for Spoken Language Modeling

ZeroSyl: 用于口语语言建模的简单零资源音节分词

Nicol Visser, Simon Malan, Danel Slabbert, Herman Kamper

AI总结提出ZeroSyl，一种无需训练的方法，直接从冻结的WavLM模型中提取音节边界和嵌入，实现竞争性的音节分割性能，并在词汇、句法和叙事基准上优于先前方法。

Comments Accepted to Interspeech 2026

详情

AI中文摘要

纯语音语言模型旨在直接从原始音频中学习语言，无需文本资源。一个关键挑战是来自自监督语音编码器的离散标记会导致过长的序列，这促使了最近关于音节类单元的研究。然而，像Sylber和SyllableLM这样的方法依赖于复杂的多阶段训练流程。我们提出了ZeroSyl，一种简单的无需训练的方法，直接从冻结的WavLM模型中提取音节边界和嵌入。通过使用WavLM中间层特征的L2范数，ZeroSyl实现了具有竞争力的音节分割性能。得到的片段进行均值池化，使用K-means离散化，并用于训练语言模型。ZeroSyl在词汇、句法和叙事基准上优于先前的音节分词器。扩展实验表明，虽然更细粒度的单元有利于词汇任务，但我们发现的音节单元在句法建模方面表现出更好的扩展行为。

英文摘要

Pure speech language models aim to learn language directly from raw audio without textual resources. A key challenge is that discrete tokens from self-supervised speech encoders result in excessively long sequences, motivating recent work on syllable-like units. However, methods like Sylber and SyllableLM rely on intricate multi-stage training pipelines. We propose ZeroSyl, a simple training-free method to extract syllable boundaries and embeddings directly from a frozen WavLM model. Using L2 norms of features in WavLM's intermediate layers, ZeroSyl achieves competitive syllable segmentation performance. The resulting segments are mean-pooled, discretized using K-means, and used to train a language model. ZeroSyl outperforms prior syllabic tokenizers across lexical, syntactic, and narrative benchmarks. Scaling experiments show that while finer-grained units are beneficial for lexical tasks, our discovered syllabic units exhibit better scaling behavior for syntactic modeling.

URL PDF HTML ☆

赞 0 踩 0

2603.26292 2026-06-17 cs.CL cs.AI 版本更新

基准测试幻觉：剪枝后的大语言模型能通过多项选择但无法回答问题

Rui Wen, Lu Sun, Jiayang Liu, Zesheng Xu, Tianshuo Cong, Zheng Li

发表机构 * Institute of Science Tokyo（东京科学大学）； Tohoku University（东北大学）； Nanyang Technological University（南洋理工大学）； KTH Royal Institute of Technology（瑞典皇家理工学院）； Shandong University（山东大学）

AI总结研究发现，高稀疏度剪枝后的大语言模型在多项选择评估中表现良好，但在开放生成中无法正确回答相同问题，揭示了基准测试的盲点。

详情

AI中文摘要

压缩大型语言模型可以减少内存使用和推理成本，但也可能导致标准基准测试未能捕捉到的失败。一个剪枝后的模型可能在多项选择评估中仍然表现良好，但在开放生成中却无法回答相同的问题。我们探究剪枝改变了什么：是擦除了正确答案，还是使答案更难作为最高输出产生？我们通过多语言问答来研究这个问题，追踪剪枝前后相同的问题。我们发现了一种基准测试幻觉。在高稀疏度剪枝（尤其是Wanda）下，模型在贪婪开放生成中经常失败，而在多项选择评分下仍然能选择正确答案。在这些仅识别错误中，答案通常并未消失，而是被降级：它经常通过束搜索、采样或一个上下文示例重新出现。总体而言，多项选择基准测试可能夸大压缩LLM的可用性，造成评估盲点。压缩模型应该测试它们能产生什么，而不仅仅是能识别什么。

英文摘要

Compressing large language models reduces memory use and inference cost, but it can also create failures that standard benchmarks miss. A pruned model may still perform well on multiple-choice evaluations, yet fail to answer the same question in open generation. We ask what pruning changes: does it erase the correct answer, or does it make the answer harder to produce as the top output? We study this question with multilingual question answering, tracking the same questions before and after pruning. We find a benchmark illusion. Under high-sparsity pruning, especially Wanda, models often fail in greedy open generation while still selecting the correct answer under multiple-choice scoring. In these recognition-only errors, the answer is usually not gone, but demoted: it often reappears with beam search, sampling, or one in-context example. Overall, multiple-choice benchmarks can overstate the usability of compressed LLMs, creating an evaluation blind spot. Compressed models should be tested on what they can produce, not only on what they can recognize.

URL PDF HTML ☆

赞 0 踩 0

2606.17634 2026-06-17 cs.CL 新提交

Prompt Perturbation for Reliable LLM Evaluation over Comparison Graphs

用于可靠LLM评估的提示扰动方法：基于比较图

Dong Huang, Jianbo Sun, Pengkun Yang

发表机构 * Department of Statistics and Data Science, Tsinghua University（清华大学统计与数据科学系）

AI总结针对LLM成对评估中的传递性矛盾问题，提出提示扰动框架，通过生成扰动变体并过滤结构不一致的比较模式，提高排名一致性。

Comments 42 pages, 8 figures

详情

AI中文摘要

评估大型语言模型（LLM）对于理解其能力、比较竞争系统以及支持在实践中部署可靠模型至关重要。对于开放任务，成对评估已成为一种流行范式，其中比较同一提示的两个响应，并将产生的判断聚合为整体排名。该范式的核心挑战是传递性：诱导的比较结果可能无法支持任何连贯的全局排名。例如，可能会观察到循环偏好，如$A \succ B \succ C \succ A$，或涉及平局的不一致性，如$A \equiv B\equiv C\neq A$。此类矛盾使得最终排行榜不稳定且难以解释。在本文中，我们提出了一种提示扰动框架，用于提高成对LLM评估的一致性。我们的方法生成每个提示的扰动变体，利用生成的比较图识别并过滤掉结构不一致的比较模式，然后将标准排名方法应用于过滤后的比较。该框架的一个关键特征是，在排名聚合之前，将图级结构一致性显式纳入评估流程。这提供了一种简单且原则性的方法来减少循环不一致性并提高LLM排名的可靠性。

英文摘要

Evaluating large language models (LLMs) is important for understanding their capabilities, comparing competing systems, and supporting the deployment of reliable models in practice. For open-ended tasks, pairwise evaluation has become a popular paradigm, in which two responses to the same prompt are compared and the resulting judgments are aggregated into an overall ranking. A central challenge of this paradigm is intransitivity: the induced comparison outcomes may fail to support any coherent global ranking. For example, one may observe cyclic preferences such as $A \succ B \succ C \succ A$, or inconsistencies involving ties such as $A \equiv B\equiv C\neq A$. Such contradictions make the resulting leaderboard unstable and challenging to interpret. In this paper, we propose a prompt perturbation framework for improving the consistency of pairwise LLM evaluation. Our approach generates perturbed variants of each prompt, uses the resulting comparison graphs to identify and filter out structurally inconsistent comparison patterns, and then applies standard ranking methods to the filtered comparisons. A key feature of the proposed framework is that graph-level structural consistency is incorporated explicitly into the evaluation pipeline before ranking aggregation. This provides a simple and principled way to reduce cyclic inconsistencies and improve the reliability of LLM rankings.

URL PDF HTML ☆

赞 0 踩 0

2606.17861 2026-06-17 cs.CL 新提交

GameCraft-Bench: Can Agents Build Playable Games End-to-End in a Real Game Engine?

GameCraft-Bench：智能体能否在真实游戏引擎中端到端构建可玩游戏？

Tongxu Luo, Rongsheng Wang, Jiaxi Bi, Chenming Xu, Zhengyang Tang, Jianlong Chen, Juhao Liang, Ke Ji, Shuqi Guo, Yuhao Du, Fan Bu, Wenyu Du, Xiaotong Zhang, Kyle Li, Shaobo Wang, Linfeng Zhang, Yuxuan Liu, Xin Lai, Chenxin Li, Yiduo Guo, Zhexin Zhang, Xinyuan Wang, Tianyi Bai, Ziniu Li, Benyou Wang

发表机构 * The Chinese University of Hong Kong, Shenzhen（香港中文大学（深圳））； Shenzhen Loop Area Institute（深圳环区研究院）； Hunyuan Team, Tencent（腾讯混元团队）； USTB（北京科技大学）； DualverseAI ； SJTU（上海交通大学）； NUS（新加坡国立大学）

AI总结提出GameCraft-Bench基准，评估编码智能体在Godot引擎中端到端生成可玩游戏的能力，最强智能体仅达41.46%成功率。

详情

AI中文摘要

游戏生成是编码智能体的新兴应用，要求模型将自然语言规范转化为可玩的交互系统。与传统编码任务不同，游戏生成发生在游戏引擎内，脚本、场景、资源、渲染和运行时交互必须共同产生连贯的游戏体验。我们将端到端游戏生成形式化为产生完整游戏制品的问题，该制品通过目标环境中可观察的玩家-游戏交互实现规范。我们认为评估这一设置需要三个必要条件：引擎接地、制品完整性和交互验证。我们提出一个交互接地评估框架，通过重放演示和基于规则的多模态评判来评估可执行游戏玩法。我们将该框架实例化为GameCraft-Bench，一个包含15个游戏家族共140个Godot任务的基准。对前沿编码智能体的评估表明，端到端游戏生成仍然极具挑战性：最强智能体仅达到41.46%，大多数智能体得分低于40%。进一步分析显示，虽然智能体经常实现可识别的机制，但它们在提供具有足够内容、功能性视觉反馈和连贯呈现的完整游戏方面存在困难。演示、代码和数据见此https URL。

英文摘要

Game generation is an emerging application of coding agents, requiring models to transform natural-language specifications into playable interactive systems. Unlike traditional coding tasks, game generation takes place within a game engine, where scripts, scenes, assets, rendering, and runtime interactions must jointly produce coherent gameplay. We formalize end-to-end game generation as the problem of producing a complete game artifact that realizes a specification through observable player-game interaction in a target environment. We argue that evaluating this setting requires three desiderata: Engine Grounding, Artifact Completeness, and Interactive Verification. We propose an interaction-grounded evaluation framework that assesses executable gameplay through replayed demonstrations and rubric-guided multimodal judging. We instantiate this framework as GameCraft-Bench, a benchmark comprising 140 Godot tasks across 15 game families. Evaluations of frontier coding agents show that end-to-end game generation remains highly challenging: the strongest agent achieves only 41.46%, and most agents score below 40%. Further analysis reveals that while agents often implement recognizable mechanics, they struggle to deliver complete games with sufficient content, functional visual feedback, and coherent presentation. See https://tongxuluo.github.io/gamecraft-bench-website for demos, code, and data.

URL PDF HTML ☆

赞 0 踩 0

2606.17905 2026-06-17 cs.CL 新提交

ChLogic: Evaluating Robustness of Logical Reasoning in Chinese Expressions

ChLogic: 评估中文表达中逻辑推理的鲁棒性

Peixian Zhou, Yuxu Chen, Chaorui Zhang, Wei Han, Bo Bai, Xueyan Niu

发表机构 * College of Mathematics, Sichuan University（四川大学数学学院）； Theory Lab, 2012 Labs, Huawei Technologies Co., Ltd（华为技术有限公司2012实验室理论实验室）

AI总结提出英中对照基准ChLogic，通过形式逻辑模板构建数据集，测试模型在英文和多种中文表达下逻辑推理的鲁棒性，发现英中性能差距，回译效果因数据集和模型而异。

详情

AI中文摘要

大型语言模型在标准化逻辑推理基准上表现越来越好，但这种能力在英语之外是否保持鲁棒尚不清楚。我们提出ChLogic，一个英中对照基准，测试当相同的潜在逻辑结构用英语和多种中文表层实现表达时，模型是否保持逻辑推理性能。该基准基于形式逻辑模板构建，包含三个数据集：(i) 通用对照集，源自9个模板家族的60个通用命题；(ii) 困难对照集，源自40个困难问题；(iii) 仅中文集，涵盖15种语言特有现象类型。每个对照项将一个英文参考表达式与五个中文实现配对。在Qwen3、Ministral和GLM模型上的实验揭示了持续的英中性能差距。从标准中文回译成英语通常能提升通用对照集上的性能，但对困难对照集产生混合效果，Qwen3-32B和GLM-5.1在翻译后表现更差。这些结果表明，中文表层实现、翻译伪影和模型特定行为共同影响多语言逻辑推理。总体而言，ChLogic为多语言推理的鲁棒性提供了有用的压力测试。

英文摘要

Large language models perform increasingly well on standardized logical reasoning benchmarks, but whether this ability remains robust beyond English is unclear. We introduce ChLogic, an English--Chinese aligned benchmark that tests whether models preserve logical reasoning performance when the same latent logical structure is expressed in English and diverse Chinese surface realizations. Built from formal logical templates, the benchmark contains three data sets: (i) the General aligned set, derived from 60 General Propositions across nine template families; (ii) the Difficult aligned set, derived from 40 Difficult Problems; and (iii) the Chinese-only set, covering 15 language-specific phenomenon types. Each aligned item pairs one English reference expression with five Chinese realizations. Experiments on Qwen3, Ministral, and GLM models reveal a persistent English--Chinese performance gap. Back-translation from standard Chinese into English often improves performance on the General aligned set, but produces mixed effects on the Difficult aligned set, where Qwen3-32B and GLM-5.1 perform worse after translation. These results indicate that Chinese surface realization, translation artifacts, and model-specific behavior jointly affect multilingual logical reasoning. Overall, ChLogic provides a useful stress test for the robustness of multilingual reasoning.

URL PDF HTML ☆

赞 0 踩 0

2606.18203 2026-06-17 cs.CL cs.AI 新提交

RubricsTree: Scalable and Evolving Open-Ended Evaluation of Personal Health Agents across Health Memory and Medical Skills

RubricsTree: 面向个人健康代理在健康记忆与医疗技能上的可扩展且不断演进的开放式评估

Weizhi Zhang, Zechen Li, Hamid Palangi, Ben Graef, A. Ali Heydari, Simon A. Lee, Salman Rahman, Ray Luo, Zeinab Esmaeilpour, Erik Schenck, Chloe Zhang, Yamin Li, Menglian Zhou, Philip S. Yu, Daniel McDuff, Lindsey Sunden, Mark Malhotra, Shwetak Patel, Ahmed A. Metwally

发表机构 * Google Research（谷歌研究院）； University of Illinois Chicago（伊利诺伊大学芝加哥分校）

AI总结提出RubricsTree框架，通过专家对齐的层次化分类法（含100多个原子布尔规则）和上下文自适应路由，实现可扩展、可审计且不断演进的开放式评估，在HealthBench上使模型性能提升高达约66%。

详情

AI中文摘要

基于LLM的个人健康代理利用用户健康（传感器）指标，为缓解全球医疗资源获取不均提供了有希望的途径。然而，大规模临床部署仍受限于开放式评估瓶颈：医生标注可靠但成本高且不可扩展，而LLM作为评判者的评估虽可扩展但主观、不一致，且有时临床对齐不佳。我们引入了RubricsTree，一个可扩展的评估框架，具有专家对齐的层次化分类法，包含超过100个原子级、临床可验证的布尔规则，这些规则通过迭代的人机协同策展协议（由经验丰富的医生领导的专家小组）从4000个真实用户查询的洞察中演化而来。一个上下文感知的自适应路由器每查询仅激活相关的自动加权规则子集，提供可扩展评估所需的吞吐量，同时保持专家对齐的质量。通过系统的元评估，我们展示了RubricsTree：(i) 在具有挑战性的开放式查询上，专家对齐程度显著超过强大的大规模评估基线；(ii) 可靠地惩罚上下文退化的响应；(iii) 当用作结构化指令、文本反馈或性能优化的训练奖励时，在HealthBench上为Gemini、GPT和Qwen模型系列带来高达约66%的相对提升。因此，RubricsTree为产品级个人健康AI的持续优化提供了可扩展、可审计且不断演进的评估基础设施。

英文摘要

The LLM-empowered personal health agents with user health (sensor) metrics have offered a promising pathway to alleviate global disparities in healthcare access. However, large-scale clinical deployment remains constrained by an open-ended evaluation bottleneck: physician annotation is reliable but costly and unscalable, while LLM-as-a-judge evaluators are scalable but subjective, inconsistent, and sometimes clinically misaligned. We introduce RubricsTree, a scalable evaluation framework with an expert-aligned hierarchical taxonomy of over 100 atomic, clinically-verifiable Boolean rubrics, evolving from the insights of 4,000 real user queries through an iterative human-in-the-loop curation protocol with an expertise panel led by an experienced physician. A context-aware adaptive router activates only the relevant auto-weighted rubric subset per query, providing the throughput needed for scalable evaluation with expert-aligned quality. Through a systematic meta-evaluation, we show that RubricsTree (i) substantially exceeds a strong large-scale evaluation baseline in expert alignment on challenging open-ended queries; (ii) reliably penalizes contextually degraded responses; and (iii) when used as structured instructions, text feedback, or training rewards for performance optimization, yields up to ~66% relative gains on HealthBench for Gemini, GPT, and Qwen model families. RubricsTree thus provides a scalable, auditable, and evolving evaluation infrastructure required for the continuous optimization of product-level personal healthcare AI.

URL PDF HTML ☆

赞 0 踩 0

2606.18222 2026-06-17 cs.CL cs.DL 新提交

Darshana Graph: A Parallel Commentary Corpus for Comparative Indian Philosophy, with Stylometric and Exploratory Graph Analyses

Darshana Graph：用于比较印度哲学的平行注释语料库，附文体计量与探索性图分析

Joy Bose

发表机构 * Independent Researcher（独立研究者）； Bangalore, India（印度班加罗尔）

AI总结构建包含超12.5万条记录的印度哲学平行注释语料库，其中约8500条记录实现跨18位注释者的根颂对齐，通过文体计量和约束大语言模型管道分析论证风格与概念关系，揭示学派间分歧模式。

Comments 12 pages, 1 figure. Open Source Code available at https://github.com/joyboseroy/darshana-graph and dataset at https://huggingface.co/datasets/joyboseroy/darshana-graph

详情

AI中文摘要

我们介绍了Darshana Graph，一个包含超过12.5万条文本记录的语料库，涵盖古典印度教、佛教和耆那教哲学传统，这些记录来自公共领域和开放许可的翻译，包括《薄伽梵歌》、《梵经》、主要《奥义书》、《巴利经典》和核心耆那教经典。其独特贡献在于一个结构独特的子集，包含约8500条印度教和耆那教记录，其中相同的根本颂或经句与代表吠檀多五个学派及其他见（darshanas）的十八位历史注释者对齐，从而能够直接比较独立解释传统如何解读相同的源材料。据我们所知，没有其他公开资源提供如此规模的跨注释者对齐。我们展示了基于该语料库的两项分析。首先，一种无需机器学习的透明文体计量比较，通过经典引用密度、明确反驳率和句子复杂度来衡量论证风格。它发现引用密度与反驳率之间存在中等负相关，在相关教义谱系的三位注释者中反驳率显著增加，并且在巴利经典内部存在可测量的体裁层面差异。其次，我们描述了一个受约束的大语言模型管道，该管道使用预定义的关系词汇和确定性事后验证来提取概念之间的类型化哲学关系。生成的图揭示了跨学派分歧模式，同时也揭示了重要的提取局限性，包括独立基于嵌入的分析与图派生结果不一致的情况。我们发布了完整的语料库、提取的关系图以及所有源代码。

英文摘要

We introduce Darshana Graph, a corpus of over 125,000 text records spanning classical Hindu, Buddhist, and Jain philosophical traditions, drawn from public-domain and openly licensed translations of sources including the Bhagavad Gita, Brahma Sutras, principal Upanishads, the Pali Canon, and core Jain texts. Its distinctive contribution lies in a structurally unique subset of roughly 8,500 Hindu and Jain records in which the same root verse or sutra is aligned across eighteen historical commentators representing five schools of Vedanta and other darshanas, enabling direct comparison of how independent interpretive traditions read identical source material. To our knowledge, no publicly available resource provides comparable cross-commentator alignment at this scale. We present two analyses built on this corpus. First, a transparent stylometric comparison requiring no machine learning measures argumentative style through scriptural citation density, explicit refutation rate, and sentence complexity. It finds a moderate negative correlation between citation density and refutation rate, a marked increase in refutation rate across three commentators in a related doctrinal lineage, and measurable genre-level differences within the Pali Canon itself. Second, we describe a constrained large language model pipeline that extracts typed philosophical relationships between concepts using a predefined relation vocabulary and deterministic post-hoc validation. The resulting graph surfaces cross-school disagreement patterns while also revealing important extraction limitations, including cases where an independent embedding-based analysis disagrees with the graph-derived findings. We release the full corpus, extracted relationship graph, and all source code.

URL PDF HTML ☆

赞 0 踩 0

2606.18237 2026-06-17 cs.CL cs.AI cs.LG 新提交

ReproRepo: Scaling Reproducibility Audits with GitHub Repository Issues

ReproRepo: 利用 GitHub 仓库问题扩展可重复性审计

Shanda Li, Qiuhong Anna Wei, Jingwu Tang, Valerie Chen, Nihar B Shah, Tim Dettmers, Yiming Yang, Ameet Talwalkar

发表机构 * School of Computer Science, Carnegie Mellon University（卡内基梅隆大学计算机科学学院）； Datadog

AI总结提出 ReproRepo 框架，利用 GitHub issues 作为监督信号，对 1149 篇论文进行可重复性评估，发现 Codex with GPT-5.5 能识别约 90% 论文的语义相关复现问题。

详情

AI中文摘要

从论文和已发布代码中复现研究结果对科学进步至关重要。现有工作引入了基准测试来评估 LLM 代理是否能协助可重复性，但由于数据整理和评估需要大量人工努力，这些基准难以扩展。我们提出了 ReproRepo，一个可扩展的可重复性评估框架，利用人类提出的 GitHub issues 作为真实复现障碍的自然监督信号。我们在来自主要会议的 1149 篇近期机器学习论文上实例化 ReproRepo，并评估了四种前沿模型代理配置。我们的结果表明，即使不执行代码，LLM 代理也能从论文-仓库对中识别出许多现实世界的可重复性问题：我们研究中的最佳代理，即带有 GPT-5.5 的 Codex，为研究中约 90% 的论文揭示了至少一个语义相关的人类报告的障碍。进一步分析表明，代理在揭示可见故障和识别正确语义区域方面特别有效，但在精确定位方面可能仍不足。ReproRepo 可作为未来在真实世界可重复性审计中评估 LLM 代理的可重用、可扩展框架。我们的代码发布在 https://this URL。

英文摘要

Reproducing research results from papers and released code is central to scientific progress. Existing works have introduced benchmarks to evaluate whether LLM agents can assist with reproducibility, but they are difficult to scale due to their reliance on substantial manual effort for data curation and evaluation. We introduce ReproRepo, a scalable framework for reproducibility evaluation that leverages human-raised GitHub issues as naturally occurring supervision on realistic reproduction blockers. We instantiate ReproRepo on 1,149 recent machine learning papers from major conferences and evaluate four frontier model-agent configurations. Our results show that LLM agents, even without executing code, can identify many real-world reproducibility problems from paper-repository pairs: the best agent in our study, namely Codex with GPT-5.5, surfaces at least one semantically related human-reported blocker for ~90% of papers in the study. Further analysis shows that agents are particularly effective for surfacing visible failures and identifying the right semantic region, but may still be insufficient in exact localization. ReproRepo can serve as a reusable, scalable framework for future evaluations of LLM agents on real-world reproducibility auditing. Our code is released at https://github.com/LithiumDA/ReproRepo.

URL PDF HTML ☆

赞 0 踩 0

2606.17339 2026-06-17 cs.AI cs.CL cs.SD 交叉投稿

SpeechDx: A Multi-Task Benchmark for Clinical Speech AI

SpeechDx: 面向临床语音AI的多任务基准

Sejal Bhalla, Larry Kieu, Aina Merchant, Eyal de Lara, Alex Mariakakis

发表机构 * University of Toronto（多伦多大学）

AI总结提出SpeechDx基准，涵盖12个数据集和27个任务，通过语音产生阶段（概念化、公式化、发音）组织任务，评估12种音频编码器，发现大规模语音模型表现最佳，但尚无表示能可靠泛化。

详情

AI中文摘要

语音通过同时涉及神经、运动、呼吸和发声系统，为健康提供了一个独特的窗口。当前的临床语音AI方法主要通过孤立的特定疾病研究取得进展，导致结果难以比较，泛化能力难以评估。我们引入了SpeechDx，这是一个大规模的临床语音AI基准，涵盖12个数据集和27个任务，涉及多种健康状况。为了能够基于共享的临床机制进行评估，SpeechDx根据任务所破坏的语音产生阶段（概念化、公式化和发音）来组织任务。该基准通过包含有限标注数据的任务以及跨多个数据集评估同一健康状况来测试泛化能力，从而区分有临床意义的模式与数据集伪影。我们系统评估了12个最先进的音频编码器在所有任务以及零样本跨条件迁移下的表现。结果表明，大规模语音模型代表了最强的整体基线，领域特定模型仅在紧密匹配的任务上提升性能，而当前没有任何表示能在临床语音领域可靠泛化。SpeechDx建立了一个共享评估框架，用于追踪通用临床语音表示的进展。

英文摘要

Speech offers a uniquely informative window into health by simultaneously engaging neurological, motor, respiratory, and vocal systems. Current clinical speech AI methods have largely progressed through isolated condition-specific studies, making results difficult to compare and generalization difficult to assess. We introduce SpeechDx, a large-scale benchmark for clinical speech AI spanning 12 datasets and 27 tasks across diverse health conditions. To enable evaluation across shared clinical mechanisms, SpeechDx structures tasks by the stage of speech production they disrupt: conceptualization, formulation, and articulation. The benchmark tests generalization by including tasks with limited labeled data and evaluating the same health condition across multiple datasets, distinguishing clinically meaningful patterns from dataset artefacts. We systematically evaluate 12 state-of-the-art audio encoders across all tasks and under zero-shot cross-condition transfer. Results show that large-scale speech models represent the strongest overall baselines, domain-specific models improve performance only on closely matched tasks, and no current representation generalizes reliably across the clinical speech landscape. SpeechDx establishes a shared evaluation framework for tracking progress toward general-purpose clinical speech representations

URL PDF HTML ☆

赞 0 踩 0

2606.17698 2026-06-17 cs.AI cs.CL 交叉投稿

EComAgentBench: Benchmarking Shopping Agents on Long-Horizon Tasks with Distributed Hidden Intent

EComAgentBench：在分布式隐藏意图的长时任务上基准测试购物代理

Zeyao Du, Tong Li, Haibo Zhang

发表机构 * Shopee

AI总结提出EComAgentBench基准，包含662个基于真实亚马逊产品的任务，要求代理在100次工具调用内从可见查询、工具门控配置文件和脚本化澄清中挖掘隐藏意图，验证候选产品并提交最终选择，通过类型化源标签评分归因失败。

详情

AI中文摘要

随着基于LLM的购物代理进入生产环境，现有基准未能捕捉购物者需求的出现方式：隐含在查询中、记录在配置文件中，或仅在提出正确问题时才揭示。提前暴露全部意图并仅对最终选择评分的基准既无法提出这种长时挑战，也无法解释代理遗漏了哪个需求。为填补这一空白，我们引入了EComAgentBench，一个基于真实亚马逊产品和评论的662个任务的基准。每个任务将这些需求分散在可见查询、工具门控配置文件和脚本化澄清中；代理必须揭示隐藏意图，根据属性和评论证据验证候选产品，并在100次工具调用内提交单个产品。此外，类型化、源标记的评分规则对每个任务进行评分，将每个失败归因于一个需求及其来源。构建过程自动化且可靠，每个答案在生成任何文本之前已在代码中固定，每个样本都经过验证。我们对七个模型的评估显示，即使最强的模型也仅达到57.1%的整体准确率，并且评分规则的满足度从可见源到隐藏源逐渐下降。总体而言，我们相信EComAgentBench将作为一个可复现的基础，推动购物代理从单查询搜索向长时可靠辅助发展。

英文摘要

As LLM-based shopping agents enter production, existing benchmarks fail to capture how a shopper's requirements arrive: stated implicitly in the query, recorded in a profile, or revealed only when the right question is asked. Benchmarks that expose full intent upfront and grade only the final choice can neither pose this long-horizon challenge nor explain which requirement an agent missed. To address this gap, we introduce EComAgentBench, a benchmark of 662 tasks grounded in real Amazon products and reviews. Each task scatters these requirements across a visible query, a tool-gated profile, and scripted clarification; an agent must uncover hidden intent, verify candidates against attributes and review evidence, and commit to a single product within 100 tool calls. Moreover, typed, source-tagged rubrics grade every task, attributing each failure to a requirement and its source. Construction is automated yet reliable, with every answer fixed in code before any text is generated and every sample validated. Our evaluation of seven models reveals that even the strongest attains only 57.1% overall accuracy, and rubric satisfaction degrades from visible to hidden sources. Overall, we believe EComAgentBench will serve as a reproducible foundation for moving shopping agents from single-query search toward dependable assistance over long horizons.

URL PDF HTML ☆

赞 0 踩 0

2606.17799 2026-06-17 cs.SE cs.AI cs.CL 交叉投稿

Position: Coding Benchmarks Are Misaligned with Agentic Software Engineering

立场：编程基准与智能体软件工程不一致

Maria I. Gorinova, Macey Baker, Amy Heineike, Maksim Shaposhnikov, Rob Willoughby, Dru Knox

发表机构 * Tessl

AI总结本文指出当前编程基准在智能体时代存在三大问题：混淆模型与系统框架、单一参考答案惩罚有效替代方案、缺乏组件级信号导致迭代困难，并提出应重新设计基准以对齐智能体软件工程。

详情

AI中文摘要

编程智能体已成为软件工程的主要模式，但我们用于比较它们的基准是在智能体时代之前设计的：它们将模型、框架和环境合并为一个单一的端到端分数，通常针对一个参考答案进行计算，没有提供用于迭代的组件级信号。我们认为当前的编程基准与智能体软件工程不一致。在实践中，编程智能体不是一个模型：它是一个系统框架——由模型、框架、上下文、环境和反馈信号组成的复合体，其中任何一个都可能使基准分数移动与相邻模型代际之间相当的幅度。我们讨论了三个症状：(i) 基准分数混淆了模型与框架的其余部分；(ii) 针对单一参考答案评分惩罚了同样有效的替代方案；(iii) 缺乏单个框架组件级别的信号使得端到端系统分数难以迭代。

英文摘要

Coding agents have become a major mode of software engineering, but the benchmarks we use to compare them were designed in a pre-agent era: they collapse model, harness, and environment into a single end-to-end score, typically computed against one reference solution, with no component-level signal for iteration. We argue that current coding benchmarks are misaligned with agentic software engineering. A coding agent in practice is not a model: it is a system harness -- a composite of models, harnesses, contexts, environments, and feedback signals, any one of which can move the benchmark score by margins comparable to those between adjacent model generations. We discuss three symptoms: (i) benchmark scores conflate the model with the rest of the harness; (ii) grading against a single reference solution penalises equally valid alternatives; and (iii) the absence of signal at the level of individual harness components makes the end-to-end system score difficult to iterate on.

URL PDF HTML ☆

赞 0 踩 0

2606.17819 2026-06-17 cs.SE cs.AI cs.CL 交叉投稿

A Framework for Evaluating Agentic Skills at Scale

大规模评估智能体技能的框架

Maksim Shaposhnikov, Nicolas Fortuin, Simon Stipcich, Maria I. Gorinova, Amy Heineike, Rob Willoughby

发表机构 * Tessl London United Kingdom（伦敦英国Tessl）

AI总结提出一个评估框架，通过构建真实任务和评分标准，大规模评估500个真实技能在19种智能体模型上的表现，发现模型对技能指令的遵循程度差异显著，且技能显著改变模型行为。

详情

AI中文摘要

智能体技能——结构化、可重用的知识工件，增强LLM智能体能力——已在工业界迅速采用，但其跨领域影响以及在商业和开源模型中的使用仍未得到充分研究，并且缺乏可复用的方法来评估单个技能。在这项工作中，我们提出了一个评估框架，允许技能作者构建真实任务，以严格评估技能中对他们最重要的方面，并通过解决这些任务来估计技能效用。此外，我们将评估方法大规模应用于500个真实技能，生成了1000个源自技能内容的任务，以及指令遵循和目标完成评分标准。使用这些指标，我们评估了19种智能体模型配置（包括专有和开源模型）在任务上的表现。我们的结果表明，模型在遵循技能中编码的指令方面差异很大，导致其性能提升存在显著差异。此外，我们表明，与无技能设置相比，访问技能显著改变了模型行为，为将主观工作流编码到LLM智能体中提供了一种重要机制。我们发布了评估数据集，以支持未来关于智能体技能的工作。

英文摘要

Agent skills -- structured, reusable knowledge artifacts that augment LLM agent capabilities -- have been rapidly adopted in industry, yet their cross-domain impact and use across commercial and open-source models remain under-studied, and no reusable methodology exists for evaluating an individual skill. In this work, we present an evaluation framework that lets a skill author construct realistic tasks to rigorously assess the aspects of a skill that matter most to them, and that estimates skill utility by solving those tasks. Further, we apply our evaluation approach at scale to 500 real-world skills, generating 1,000 tasks derived from the skills' content, along with instruction-following and goal-completion scoring rubrics. Using these metrics, we evaluate how 19 agent-model configurations, both proprietary and open-source, perform on the tasks. Our results show that models vary widely in how closely they adhere to the instructions encoded in skills, leading to substantial differences in their performance gains. Furthermore, we show that access to a skill significantly changes model behavior compared to the no-skill setup, providing an essential mechanism for encoding opinionated workflows into LLM agents. We release our evaluation dataset to support future work on agent skills.

URL PDF HTML ☆

赞 0 踩 0

2606.18060 2026-06-17 cs.AI cs.CL 交叉投稿

PseudoBench: Measuring How Agentic Auto-Research Fuels Pseudoscience

PseudoBench: 衡量自主研究如何助长伪科学

Xinyang Liao, Lingyu Li, Huacan Liu, Tianle Gu, Yang Yao, Tong Zhu, Yan Teng, Yingchun Wang

发表机构 * Shanghai Artificial Intelligence Laboratory（上海人工智能实验室）； Xi’an Jiao Tong University（西安交通大学）； Shanghai Jiao Tong University（上海交通大学）

AI总结提出PseudoBench基准，通过200个伪科学声明-证据对评估AI代理识别和抵制伪科学的能力，发现当前系统极易生成有说服力的伪科学报告，拒绝率接近零。

Comments 26 pages, 21 figures

详情

AI中文摘要

ALAS：音频语言模型的自动潜在对齐分数

Pooneh Mousavi, Yingzhi Wang, Mirco Ravanelli, Cem Subakan

AI总结提出ALAS指标，通过计算音频与文本表示的跨模态余弦相似度，无需训练即可评估语音-LLM的音频-文本对齐质量，揭示模型对齐深度与任务需求的关系。

详情

AI中文摘要

大型语言模型（LLM）被扩展为语音-LLM，它们学习的音频-文本对齐质量影响大多数下游口语理解（SLU）行为。然而，尽管融合策略不断增长，但没有标准方法来衡量语音-LLM内部如何将音频帧与文本标记绑定。我们引入ALAS（自动潜在对齐分数），一种模型和任务无关的度量，探测LLM的逐层隐藏状态，将音频和文本表示之间的跨模态余弦相似度与Whisper导出的参考进行评分。ALAS仅需要冻结的前向传递和现成的ASR参考，无需训练或拟合分类器，并校准到可解释的均匀基线，可在任务间比较。将ALAS应用于四个开源语音-LLM（AF3、Qwen2-Audio、Qwen-Omni、SALMONN），在情感识别（IEMOCAP）、开放式SQA（LibriSQA）和多选音频理解（MMAU-speech）上，我们发现对齐的深度和强度反映了每个模型的音频编码器设计以及任务的声学与语义需求，并且ALAS跟踪但不重复任务准确性，暴露了那些得分高但未真正基于音频的模型。我们将ALAS作为开源库发布，以便从业者探测自己的语音-LLM或在新任务上尝试。

英文摘要

Large Language Models (LLMs) are extended into Speech-LLMs, and the quality of the audio--text alignment they learn affects most downstream Spoken Language Understanding (SLU) behavior. Yet despite a growth of fusion strategies, there is no standard way to measure how well a Speech-LLM internally binds audio frames to text tokens. We introduce ALAS (Automatic Latent Alignment Score), a model and task-agnostic metric that probes the LLM's per-layer hidden states, scoring the cross-modal cosine similarity between audio and text representations against a Whisper-derived reference. ALAS needs only a frozen forward pass and an off-the-shelf ASR reference, with no training or fitted classifier, and is calibrated to an interpretable uniform baseline comparable across tasks. Applying ALAS to four open-source Speech-LLMs (AF3, Qwen2-Audio, Qwen-Omni, SALMONN) across emotion recognition (IEMOCAP), open-ended SQA (LibriSQA), and multi-choice audio understanding (MMAU-speech), we find that the depth and strength of alignment reflect each model's audio-encoder design and the acoustic-versus-semantic demands of the task, and that ALAS tracks but does not duplicate task accuracy, exposing models that score well without genuinely grounding in the audio. We release ALAS as an open-source library so that practitioners can probe their own Speech-LLMs or try it on new tasks.

URL PDF HTML ☆

赞 0 踩 0

2511.01650 2026-06-17 cs.CL cs.AI cs.LG 版本更新

EngTrace: A Symbolic Benchmark for Verifiable Process Supervision of Engineering Reasoning

EngTrace：工程推理可验证过程监督的符号基准

Ayesha Gull, Muhammad Usman Safder, Rania Elbadry, Fan Zhang, Veselin Stoyanov, Preslav Nakov, Zhuohan Xie

AI总结提出EngTrace符号基准，包含1350个参数化测试用例，通过两阶段可验证评估框架（分层协议+AI仲裁）检验中间推理轨迹与最终答案，揭示数值精度与轨迹保真度的权衡。

Comments 33 pages, includes figures and tables; introduces the EngTrace benchmark

详情

AI中文摘要

大型语言模型（LLM）正越来越多地进入由严格定量标准和不变物理定律约束的专业化、安全关键的工程工作流程，因此对其推理能力进行严格评估势在必行。然而，现有的基准（如MMLU、MATH和HumanEval）评估的是孤立的认知技能，未能捕捉工程中核心的基于物理的推理，其中科学原理、定量建模和实际约束必须融合。为了实现工程中的可验证过程监督，我们引入了EngTrace，这是一个基于90个参数化模板构建的符号基准，每个模板生成独特的、抗污染的实例，涵盖三个主要工程分支、九个核心领域和20个不同领域，产生1350个测试用例，以压力测试跨多样物理场景的泛化能力。超越结果匹配，我们引入了一个可验证的两阶段评估框架，该框架使用分层协议通过自动化程序检查和异构AI仲裁来验证中间推理轨迹以及最终答案。我们对27个领先LLM的评估揭示了数值精度与轨迹保真度之间的明显权衡，识别出一个复杂性悬崖，其中抽象数学预训练未能转化为高级工程任务所需的整合推理。

英文摘要

Large Language Models (LLMs) are increasingly entering specialized, safety-critical engineering workflows governed by strict quantitative standards and immutable physical laws, making rigorous evaluation of their reasoning capabilities imperative. However, existing benchmarks such as MMLU, MATH, and HumanEval assess isolated cognitive skills, failing to capture the physically grounded reasoning central to engineering, where scientific principles, quantitative modeling, and practical constraints must converge. To enable verifiable process supervision in engineering, we introduce EngTrace, a symbolic benchmark built on 90 parameterized templates, each generating unique, contamination-resistant problem instances, spanning three major engineering branches, nine core domains, and 20 distinct areas, yielding 1,350 test cases that stress-test generalization across diverse physical scenarios. Moving beyond outcome matching, we introduce a verifiable two-stage evaluation framework that uses a tiered protocol to validate intermediate reasoning traces alongside final answers through automated procedural checks and a heterogeneous AI Tribunal. Our evaluation of 27 leading LLMs reveals a distinct trade-off between numeric precision and trace fidelity, identifying a complexity cliff where abstract mathematical pre-training fails to translate into the integrative reasoning required for advanced engineering tasks.

URL PDF HTML ☆

赞 0 踩 0

2601.04574 2026-06-17 cs.CL 版本更新

FeedEval: Pedagogically Aligned Evaluation of LLM-Generated Essay Feedback

FeedEval: 面向教学法的LLM生成作文反馈评估

Seongyeub Chu, Jongwoo Kim, Munyong Yi

发表机构 * Graduate School of Data Science, KAIST（数据科学研究生院，韩国科学技术院）； Department of Industrial & Systems Engineering, KAIST（工业与系统工程系，韩国科学技术院）

AI总结提出FeedEval框架，沿特异性、帮助性和有效性三个教学维度评估LLM生成的作文反馈，通过专用评估器筛选高质量反馈，提升下游评分和修订效果。

详情

AI中文摘要

超越数值分数预测，近期自动作文评分研究日益强调生成提供理由和可操作指导的高质量反馈。为减轻专家标注的高成本，先前工作通常依赖LLM生成的反馈来训练作文评估模型。然而，此类反馈常未经明确质量验证即被纳入，导致下游应用中噪声的传播。为解决这一局限，我们提出FeedEval，一个基于LLM的框架，用于沿三个教学维度（特异性、帮助性和有效性）评估LLM生成的作文反馈。FeedEval采用维度专用的LLM评估器，这些评估器在本研究策划的数据集上训练，以评估多个候选反馈并选择高质量反馈供下游使用。在ASAP++基准上的实验表明，FeedEval与人类专家判断高度一致，且使用FeedEval筛选的高质量反馈训练的作文评分模型取得了更优的评分性能。此外，使用小型LLM进行的修订实验表明，FeedEval识别的高质量反馈能导致更有效的作文修订。我们在以下网址发布代码和策划的数据集：this https URL。

英文摘要

Going beyond the prediction of numerical scores, recent research in automated essay scoring has increasingly emphasized the generation of high-quality feedback that provides justification and actionable guidance. To mitigate the high cost of expert annotation, prior work has commonly relied on LLM-generated feedback to train essay assessment models. However, such feedback is often incorporated without explicit quality validation, resulting in the propagation of noise in downstream applications. To address this limitation, we propose FeedEval, an LLM-based framework for evaluating LLM-generated essay feedback along three pedagogically grounded dimensions: specificity, helpfulness, and validity. FeedEval employs dimension-specialized LLM evaluators trained on datasets curated in this study to assess multiple feedback candidates and select high-quality feedback for downstream use. Experiments on the ASAP++ benchmark show that FeedEval closely aligns with human expert judgments and that essay scoring models trained with FeedEval-filtered high-quality feedback achieve superior scoring performance. Furthermore, revision experiments using small LLMs show that the high-quality feedback identified by FeedEval leads to more effective essay revisions. We release our code and curated datasets at: https://github.com/BBeeChu/FeedEval.git.

URL PDF HTML ☆

赞 0 踩 0

2602.13139 2026-06-17 cs.CL 版本更新

OpenLID-v3: Improving the Precision of Closely Related Language Identification -- An Experience Report

OpenLID-v3：提高近亲语言识别精度的经验报告

Mariia Fedorova, Nikolay Arefyev, Maja Buljan, Jindřich Helcl, Stephan Oepen, Egil Rønningstad, Yves Scherrer

AI总结针对现有语言识别工具对近亲语言和噪声区分困难的问题，通过增加训练数据、合并问题语言变体簇和引入噪声标签扩展OpenLID分类器，提出OpenLID-v3，在多个基准上提升精度。

Comments VarDial'26 workshop at the EACL 2026 conference

详情

DOI: 10.18653/v1/2026.vardial-1.23

AI中文摘要

语言识别（LID）是从网络数据构建高质量多语言数据集的关键步骤。现有的LID工具（如OpenLID或GlotLID）通常难以识别近亲语言，也难以区分有效自然语言与噪声，这污染了特定语言子集，尤其是低资源语言。在本工作中，我们通过增加更多训练数据、合并有问题的语言变体簇以及引入一个专门标记噪声的标签来扩展OpenLID分类器。我们将这个扩展系统称为OpenLID-v3，并在多个基准上将其与GlotLID进行评估。在开发过程中，我们重点关注三组近亲语言（波斯尼亚语、克罗地亚语和塞尔维亚语；意大利北部和法国南部的罗曼语变体；以及斯堪的纳维亚语言），并在现有评估数据集不足的地方贡献了新的评估数据集。我们发现集成方法提高了精度，但也显著降低了对低资源语言的覆盖。OpenLID-v3可在该https URL上获取。

英文摘要

Language identification (LID) is an essential step in building high-quality multilingual datasets from web data. Existing LID tools (such as OpenLID or GlotLID) often struggle to identify closely related languages and to distinguish valid natural language from noise, which contaminates language-specific subsets, especially for low-resource languages. In this work we extend the OpenLID classifier by adding more training data, merging problematic language variant clusters, and introducing a special label for marking noise. We call this extended system OpenLID-v3 and evaluate it against GlotLID on multiple benchmarks. During development, we focus on three groups of closely related languages (Bosnian, Croatian, and Serbian; Romance varieties of Northern Italy and Southern France; and Scandinavian languages) and contribute new evaluation datasets where existing ones are inadequate. We find that ensemble approaches improve precision but also substantially reduce coverage for low-resource languages. OpenLID-v3 is available on https://huggingface.co/HPLT/OpenLID-v3.

URL PDF HTML ☆

赞 0 踩 0

2605.25652 2026-06-17 cs.CL cs.CY 版本更新

RepSelect: 通过表示选择性实现鲁棒的大语言模型遗忘

Filip Sondej, Yushi Yang, Adam Mahdi

发表机构 * Independent（独立）； University of Oxford（牛津大学）

AI总结针对现有遗忘方法易被微调或少样本提示逆转的问题，提出RepSelect方法，通过梯度主成分坍塌隔离遗忘集表示，实现深度鲁棒遗忘。

详情

AI中文摘要

使大语言模型（LLMs）深度遗忘特定知识和价值观而不牺牲通用能力仍然是遗忘领域的一个核心挑战。然而，当前方法容易被微调或少样本提示逆转，表明其遗忘仅是浅层的。我们找到了根本原因：现有方法针对与保留集以及微调攻击者恢复的子空间共享的表示，这使得遗忘既破坏通用能力又容易被逆转。我们提出RepSelect（表示选择性），通过在每次更新前坍塌权重梯度的主成分来隔离遗忘集特定的表示，从而保持通用能力完整，同时限制微调可恢复的内容。我们在两个遗忘类别（生物危害知识和虐待倾向）以及四种涵盖密集和混合专家架构的模型家族（Llama 3、Qwen 3.5、Gemma 4 E4B、DeepSeek V2 Lite）上进行评估。与五种流行基线（GradDiff、NPO、SimNPO、RMU、UNDIAL）相比，RepSelect在重新学习后的答案准确性上实现了比最强基线大4-50倍的降低，并且对少样本提示攻击近乎完美鲁棒。因此，针对选择性表示是实现深度鲁棒LLM遗忘的重要一步。

英文摘要

Making large language models (LLMs) deeply forget specific knowledge and values without sacrificing general capabilities remains a central challenge in unlearning. However, current methods are easily reversed by fine-tuning or few-shot prompting, suggesting their forgetting is only shallow. We identify the root cause. Existing methods target representations shared with both the retain set and the subspace recovered by a fine-tuning attacker, making unlearning both disruptive to general capabilities and easy to reverse. We propose RepSelect (Representation Selectivity), isolates forget-set-specific representations by collapsing top principal components of weight gradients before each update, leaving general capabilities intact while limiting what fine-tuning can recover. We evaluate across two forget categories, biohazardous knowledge and abusive tendencies, and four model families spanning dense and Mixture-of-Experts architectures (Llama 3, Qwen 3.5, Gemma 4 E4B, DeepSeek V2 Lite). Compared to five popular baselines (GradDiff, NPO, SimNPO, RMU, UNDIAL), RepSelect achieves a 4-50x larger reduction in post-relearning answer accuracy than the strongest baseline, and is near-perfectly robust to few-shot prompting attacks. Targeting selective representations is thus an important step towards deep and robust LLM forgetting.

URL PDF HTML ☆

赞 0 踩 0

2606.17478 2026-06-17 cs.CL cs.AI 新提交

Decoding Hidden Deception in Reasoning LLMs: Activation Explainers for Deception Auditing

解码推理型LLM中的隐藏欺骗：用于欺骗审计的激活解释器

Kexin Chen, Yi Liu, Haonan Zhang, Yanhui Li, Xinyu Deng, Dongxia Wang

发表机构 * Zhejiang University（浙江大学）； Griffith University（格里菲斯大学）

AI总结提出STATEWITNESS，一种通过解码目标模型隐藏状态来生成自然语言查询答案和结构化报告的激活解释器，在欺骗检测中平均AUROC达0.916，优于现有方法。

Comments Under review

详情

AI中文摘要

随着LLM获得更强的推理能力，欺骗行为成为一个日益严重的安全问题。现有的欺骗监控器要么对可见文本进行评分，要么从表示向量中导出标量探针分数，几乎没有留下关于为什么响应可疑的可检查证据。我们引入了STATEWITNESS，一种用于欺骗审计的激活解释器。一个独立的解码器读取目标模型的隐藏状态，然后回答自然语言查询或发出关于它们的结构化报告。我们在两个目标推理LLM上评估了STATEWITNESS，涵盖七个欺骗数据集。在相同评估协议下，STATEWITNESS的平均AUROC达到0.916，比最佳黑盒文本监控器相对提升11.6%，比最佳激活探针基线相对提升25.0%。当与现有监控器结合时，STATEWITNESS在简单阈值集成中减少了遗漏的欺骗示例。除了标量检测，解码器还返回查询级答案、模式报告以及令牌级或句子级证据痕迹供人工检查。我们将此接口视为更广泛的可解释性和对齐工具的潜在构建块。

英文摘要

As LLMs acquire stronger reasoning capabilities, deceptive behavior becomes an increasingly serious safety concern. Existing deception monitors either score visible transcripts or derive scalar probe scores from representation vectors, leaving little inspectable evidence about why a response is suspicious. We introduce STATEWITNESS, an activation explainer for deception auditing. A separate decoder reads a target model's hidden states, then answers natural-language queries or emits structured reports about them. We evaluate STATEWITNESS on two target reasoning LLMs across seven deception datasets. STATEWITNESS reaches 0.916 mean AUROC, a relative gain of 11.6% over the best black-box text monitor and 25.0% over the best activation-probe baseline under the same evaluation protocol. When combined with existing monitors, STATEWITNESS reduces missed deceptive examples in simple threshold ensembles. Beyond scalar detection, the decoder returns query-level answers, schema reports, and token- or sentence-level evidence traces for human inspection. We view this interface as a potential building block for broader interpretability and alignment tools.

URL PDF HTML ☆

赞 0 踩 0

2606.17506 2026-06-17 cs.CL 新提交

Evaluating Second-Order Bias of LLMs Through Epistemic Entitlement

通过认知权利评估大语言模型的二阶偏见

Ramaravind Kommiya Mothilal, Terry Jingchen Zhang, Raiyan Ahmed, Zhijing Jin, Shion Guha, Syed Ishtiaque Ahmed

发表机构 * University of Toronto（多伦多大学）； Vector Institute（向量研究所）； EuroSafeAI ； Max Planck Institute for Intelligent Systems, Tübingen, Germany（马克斯·普朗克智能系统研究所（德国图宾根））

AI总结提出基于认知权利理论的逻辑推理任务，评估LLM在判断偏见内容时表现出的二阶偏见，发现模型判断存在系统性偏差且能规避安全护栏。

Comments 20 pages, 13 tables, 2 figures

详情

AI中文摘要

对大语言模型社会偏见的评估主要关注模型是否生成或暗示偏见内容。然而，随着LLM越来越多地被用作偏见评判者，它们可能在评估偏见内容时以更微妙的方式表现出社会偏见，而当前方法并未系统性地捕捉这一点。我们称之为二阶偏见：LLM对社会偏见的判断中存在的偏见，并通过一种新颖的、基于哲学推理的任务进行评估。借鉴认知权利理论，我们将偏见概念化为塑造主体理性探究的错位基础知识，并推导出一个逻辑推理任务，让LLM判断偏见文本对谁是可接受的或不可接受的。我们开发了两个简单指标来衡量LLM评判者在没有充分支持的情况下推断人口统计学可接受性的偏见程度，以及这些推断在不同目标群体间的差异。评估开源和闭源模型时，我们发现我们的任务通过揭示模型判断中的偏见来规避安全护栏。它随目标群体系统性地变化，反映了隐性的社会图谱，并展示了模型如何仍然被人口统计标签触发。我们的工作指出了在判断任务中评估LLM偏见的必要性，以及更广泛地，在NLP中采用更具理论基础的偏见评估方法。我们在以下网址发布代码和模型响应：此 https URL。

英文摘要

Evaluations of social bias in LLMs largely focus on whether models generate or imply biased content. However, as LLMs are increasingly used as judges of bias, they may exhibit social biases in subtler ways in how they evaluate biased content, which current methods do not systematically capture. We call this second-order bias: social bias in an LLM's judgment about social bias, which we evaluate through a novel, philosophically grounded reasoning task. Drawing on entitlement epistemology, we conceptualize bias as misplaced foundational knowledge that shapes an agent's rational inquiry, and derive a logical reasoning task for LLMs to judge to whom a biased text is acceptable or non-acceptable. We develop two simple metrics to measure how biased LLM judges are in inferring demographics for acceptability without sufficient support, and how these inferences vary across groups targeted by biased texts. Evaluating open and closed models, we find that our task evades safety guardrails by surfacing bias in model judgment. It varies systematically across target groups, reflects implicit social maps, and shows how models are still triggered by demographic labels. Our work points to the need for LLM bias evaluation in judgment tasks and broadly, for more theoretically grounded approaches to bias evaluation in NLP. We release our code and model responses at https://github.com/uofthcdslab/second-order-bias.

URL PDF HTML ☆

赞 0 踩 0

2606.18062 2026-06-17 cs.CL cs.AI cs.CR cs.HC 新提交

超越原生成功：审计CLIP后门的部署接口暴露

Kunlan Xiang, Haomiao Yang, Wenbo Jiang

发表机构 * University of Electronic Science and Technology of China（电子科技大学）

AI总结提出DIFE框架审计CLIP后门在不同部署接口下的暴露情况，发现原生成功不代表全局安全，并引入BadTextTower填补文本编码器后门缺失。

详情

AI中文摘要

对比语言-图像预训练模型广泛重用于下游接口，包括特征提取、检索、重排序和选择。然而，现有的CLIP后门通常在小规模的原生攻击任务上验证攻击，导致不清楚当通过其他接口重用时，相同的投毒检查点是否仍然暴露、减弱或变得不适用。我们引入DIFE，一个部署接口足迹评估框架，用于审计跨部署接口的带后门CLIP检查点。DIFE通过指定每个接口的组件读出、触发通道、目标事件、参考条件和度量，使各种评估具有可比性。DIFE还引入了有效足迹诊断，以识别携带暴露的可重用CLIP组件或组件组合，并解释风险转移的位置。使用DIFE审计复现的CLIP后门揭示了一个结构化的景观：原生成功不是检查点级别的风险证书，暴露遵循组件足迹，文本侧投毒不会产生文本编码器控制，一些耦合攻击仍然受机制约束。这次审计揭示了现有CLIP后门中的一个重要空白：文本编码器本身成为对抗行为的可重用载体。因此，我们引入BadTextTower来填补这一空白。BadTextTower产生强大的文本条件检索、重排序和选择暴露，同时使仅视觉重用几乎保持清洁。

英文摘要

Contrastive Language-Image Pre-training models are widely reused across downstream interfaces, including feature extraction, retrieval, reranking, and selection. Existing CLIP backdoor, however, usually validate attacks on a small attack-native task, leaving unclear whether the same poisoned checkpoint remains exposed, weakens, or becomes not applicable when reused through other interfaces. We introduce DIFE, a Deployment-Interface Footprint Evaluation framework that audits backdoored CLIP checkpoints across deployment interfaces. DIFE makes various evaluations comparable by specifying each interface's component readout, trigger channel, target event, reference condition, and metric. DIFE also introduces effective-footprint diagnosis to identify the reusable CLIP component or component combination that carries exposure and explains where risk transfers. Auditing reproduced CLIP backdoors with DIFE reveals a structured landscape: native success is not a checkpoint-level risk certificate, exposure follows component footprints, text-side poisoning does not yield textual-encoder control, and some coupled attacks remain mechanism-bound. This audit reveals a import gapin existing CLIP backdoors: a textual encoder that itself becomes a reusable carrier of adversarial behavior. We therefore introduce BadTextTower to fill this gap. BadTextTower produces strong text-conditioned retrieval, reranking, and selection exposure while leaving visual-only reuse nearly clean.

URL PDF HTML ☆

赞 0 踩 0

2606.18021 2026-06-17 cs.AI cs.CL cs.LG cs.MA 交叉投稿

LegalHalluLens: Typed Hallucination Auditing and Calibrated Multi-Agent Debate for Trustworthy Legal AI

LegalHalluLens: 类型化幻觉审计与校准的多智能体辩论以实现可信赖的法律AI

Lalit Yadav, Akshaj Gurugubelli

发表机构 * Independent Researcher, Sunnyvale, CA, USA（独立研究者，美国加州太阳谷）； Independent Researcher, San Diego, CA, USA（独立研究者，美国加州圣地亚哥）

AI总结针对法律AI中聚合指标掩盖的错误集中性和方向性问题，提出LegalHalluLens审计框架，通过类型化幻觉画像、风险方向指数（RDI）和校准辩论管道，将幻觉检测减少45%，并揭示聚合指标隐藏的失败模式。

Comments 15 pages, 5 figures; Published at the Second Workshop on Agents in the Wild: Safety, Security, and Beyond (AIWILD) at ICML 2026

详情

AI中文摘要

部署在法律工作流程中的AI系统以聚合指标报告的约52%的比率产生幻觉，但这个平均值掩盖了错误集中的位置和方向，使合规官员无法获得可操作的可信部署信号。我们提出LegalHalluLens，一个包含三个组件的审计框架：基于CUAD（Hendrycks等人，2021）的四种法律动机声明类别（数字、时间、义务/权利、事实）的类型化幻觉画像；一个风险方向指数（RDI），将遗漏与发明偏差简化为一个可部署比较的标量；以及一个针对幅度和方向校准的类型化辩论管道。在510份合同和249,252个条款级实例上，我们测量了义务/数字和时间声明之间约38-40个百分点的模型内差距，而聚合报告隐藏了这一点，并表明两个具有匹配的52%比率的系统可能具有相反的RDI。辩论管道将虚构检测减少了45%，每个类别的收益跟踪诊断结果，使用显著更小的骨干网络（4B活跃参数）匹配商业API。类型化画像和RDI揭示了聚合指标隐藏的失败模式；我们进一步表明这些诊断可作为多智能体辩论管道的校准输入，其中针对测量失败模式的怀疑挑战和非对称门优于通用调整的辩论。该框架支持部署在现实世界中的法律AI的方向感知采购、问责制和智能体设计。

英文摘要

AI systems deployed in legal workflows hallucinate at rates that aggregate metrics report at ~52%, but this average conceals where errors concentrate and in which direction they run, leaving compliance officers without an actionable signal for trustworthy deployment. We present LegalHalluLens, an auditing framework with three components: typed hallucination profiles across four legally-motivated claim categories (numeric, temporal, obligation/entitlement, factual) over CUAD (Hendrycks et al., 2021); a Risk Direction Index (RDI) that reduces omission-versus-invention bias to a single deployment-comparable scalar; and a typed debate pipeline calibrated to both magnitudes and directions. Across 510 contracts and 249,252 clause-level instances we measure a within-model gap of approximately 38-40 pp between obligation/numeric and temporal claims that aggregate reporting hides, and show that two systems with matched 52% rates can carry opposite RDIs. The debate pipeline reduces fabricated detections by 45% with per-category gains tracking the diagnosis, matching commercial APIs with a substantially smaller backbone (4B active parameters). Typed profiles and RDI surface failure modes that aggregate metrics hide; we further show these diagnostics serve as calibration inputs for multi-agent debate pipelines, where Skeptic challenges and asymmetric gates targeted at measured failure modes outperform generically-tuned debate. The framework supports direction-aware procurement, accountability, and agent design for legal AI deployed in the wild.

URL PDF HTML ☆

赞 0 踩 0

2606.18120 2026-06-17 cs.CR cs.AI cs.CL cs.LG 交叉投稿

Structural Role Injection in Handlebars-Templated LLM Prompts: Triple-Brace Interpolation, Delimiter Family, and the Limits of HTML Auto-Escaping

Handlebars模板化LLM提示中的结构角色注入：三花括号插值、分隔符家族与HTML自动转义的局限性

Mohammadreza Rashidi

发表机构 * Department of Computer Science AI（计算机科学系人工智能）； Media Analysis Lab Berlin, Germany（媒体分析实验室柏林德国）

AI总结本文研究Handlebars模板引擎中双花括号与三花括号插值对结构角色注入攻击的影响，通过无模型分析和5760次实验，揭示HTML转义仅保护特定分隔符家族，无法替代指令与数据的结构分离。

Comments 7 pages, 6 figures

详情

AI中文摘要

大型语言模型应用从模板构建提示，Handlebars是广泛使用的模板引擎，也是Microsoft Semantic Kernel中的默认提示模板格式。其双花括号{x}表达式对插值值进行HTML转义，并被记录为安全默认；而三花括号{x}表达式则直接插入原始值。我们表明，这一选择悄然决定了应用对结构角色注入的暴露程度，攻击者控制的数据携带聊天角色分隔符，从而伪造高权限轮次。无模型分析建立了机制：Handlebars转义重写尖括号，但不重写方括号、冒号或Markdown井号，因此它中和了ChatML、Llama-3和XML角色分隔符（存活率0.00），同时保留Llama-2 [INST]、传统Human:/Assistant:和Markdown ###分隔符（后两者存活率1.00）。随后，我们在七个分隔符家族、两个攻击目标和四个模型（GPT-3.5 Turbo、GPT-4o mini、GPT-4.1 mini、Claude Haiku 4.5）上运行了5760次试验，总API成本为1.63美元。GPT-3.5 Turbo在97%的原始试验和91%的转义试验中遵循任务劫持指令，转义保护集中在尖括号家族，而在冒号和Markdown家族中缺失；更难的秘密泄露目标未饱和，更清晰地暴露了相同的家族交互。Claude Haiku 4.5几乎完全抵抗了两个目标。转义默认仅保护HTML转义恰好覆盖的分隔符方案，对剩余方案无保护，且无法替代指令与数据的结构分离。

英文摘要

Large language model applications build prompts from templates, and Handlebars is a widely used templating engine and the default prompt-template format in Microsoft Semantic Kernel. Its double-brace {x} expression HTML-escapes the interpolated value and is documented as the safe default; its triple-brace {x} expression inserts the value raw. We show that this choice silently governs an application's exposure to structural role injection, where attacker-controlled data carries chat role delimiters that forge a higher-privilege turn. A model-free analysis establishes the mechanism: Handlebars escaping rewrites angle brackets but not square brackets, colons, or Markdown hashes, so it neutralises ChatML, Llama-3, and XML role delimiters (survival rate 0.00) while leaving Llama-2 [INST], legacy Human:/Assistant:, and Markdown ### delimiters intact (survival rate 1.00 for the last two). We then run 5760 trials across seven delimiter families, two attack objectives, and four models (GPT-3.5 Turbo, GPT-4o mini, GPT-4.1 mini, Claude Haiku 4.5) at a combined API cost of 1.63 USD. GPT-3.5 Turbo follows the task-hijack instruction in 97% of raw and 91% of escaped trials, with the escaping protection concentrated in the angle-bracket families and absent for the colon- and Markdown-based families; the harder secret-exfiltration objective, which does not saturate, exposes the same family interaction more cleanly. Claude Haiku 4.5 resists both objectives almost entirely. The escaped default protects only the delimiter schemes whose characters HTML escaping happens to cover, gives no protection for the rest, and cannot substitute for a structural separation of instruction and data.

URL PDF HTML ☆

赞 0 踩 0

2606.18193 2026-06-17 cs.CR cs.AI cs.CL 交叉投稿

A Red-Team Study of Anthropic Fable 5 & Opus 4.8 Models

Anthropic Fable 5 与 Opus 4.8 模型的红队研究

Nicola Franco

发表机构 * AI4I

AI总结通过 HackAgent 框架对两个前沿大语言模型进行自动化越狱攻击，发现尽管模型抵抗大部分攻击，但自适应迭代攻击仍能成功，且残差表面比总体框架更大。

Comments White paper

详情

AI中文摘要

我们评估了 Anthropic 开发的两个前沿大语言模型（LLM）Fable 5 和 Opus 4.8 的对抗鲁棒性，针对涵盖十个危害类别的 7 826 个有害意图，使用了四类自动化越狱攻击。利用 HackAgent 红队框架，生成了数十万次对抗尝试，每个明显的成功案例均由三个评判模型组成的委员会（多数投票）独立重新裁定。两个模型抵抗了大部分攻击，但残差表面比总体框架所暗示的更大：它主要由自适应迭代攻击主导，而静态混淆几乎完全被中和。最强的自适应搜索（攻击树）在 11.5% 的意图上攻破了 Opus 4.8，而 Fable 5 保持在个位数（最坏情况 6.1%）。因此，总体成功率不应被视为令人放心。即使在这些加固配置下，两个模型仍产生了 1 620 个（Opus 4.8）和 702 个（Fable 5）经委员会确认的有害完成，涵盖每个危害类别，这些完成是由攻击模型在没有人类专家参与的情况下，自动、廉价地在前一两个细化步骤中发现的。合理的结论是，即使是最好的、经过最严格测试的前沿模型，在持续的自动化压力下仍然可以被可靠地攻破。

英文摘要

We evaluate the adversarial robustness of two frontier large language models (LLMs) developed by Anthropic, Fable 5 and Opus 4.8, against four families of automated jailbreak attack across 7 826 harmful intents spanning a ten-category harm taxonomy. Using the HackAgent red-teaming framework, hundreds of thousands of adversarial attempts were generated and every apparent success was independently re-adjudicated by a panel of three judge models (majority vote). Both models resist the majority of attacks, but the residual surface is larger than aggregate framing suggests: it is dominated by adaptive iterative attacks, while static obfuscation is near-fully neutralised. The strongest adaptive search (tree-of-attacks) breaks Opus 4.8 on 11.5% of intents overall, whereas Fable 5 stays in the single digits (6.1% worst-case). Aggregate rates therefore should not be read as reassurance. Even in these hardened configurations, the two models produced 1 620 (Opus 4.8) and 702 (Fable 5) panel-confirmed harmful completions spanning every harm category, located automatically, cheaply, and within the first one or two refinement steps by an attacker model with no human expert in the loop. The reasonable conclusion is that even the best, most-tested frontier models remain reliably breakable under sustained automated pressure.

URL PDF HTML ☆

赞 0 踩 0

2601.16407 2026-06-17 cs.CL cs.AI 版本更新

Jacobian Scopes: token-level causal attributions in LLMs

Jacobian Scopes: LLM中的令牌级因果归因

Toni J. B. Liu, Baran Zadeoğlu, Nicolas Boullé, Raphaël Sarfati, Gurbir Arora, Christopher J. Earls

发表机构 * Cornell University（康奈尔大学）； Imperial College London（伦敦帝国理工学院）； Goodfire AI

AI总结提出Jacobian Scopes，一种基于梯度的令牌级因果归因方法，用于解释LLM预测，揭示政治偏见、翻译策略和上下文学习机制。

Comments 25 pages, 16 figures

详情

AI中文摘要

大型语言模型（LLM）基于上下文中的线索（如语义描述和上下文示例）进行下一个令牌预测。然而，由于现代架构中层和注意力头的 proliferation，阐明哪些先前的令牌对给定预测影响最大仍然具有挑战性。我们提出Jacobian Scopes，一套基于梯度的令牌级因果归因方法，用于解释LLM预测。基于微扰理论和信息几何，Jacobian Scopes量化输入令牌如何影响模型预测的各个方面，例如特定logits、完整预测分布和模型不确定性（有效温度）。通过涵盖指令理解、翻译和上下文学习（ICL）的案例研究，我们展示了Jacobian Scopes如何揭示隐含的政治偏见，揭示词级和短语级翻译策略，并阐明最近争论的上下文时间序列预测的潜在机制。为了便于在自定义文本上探索Jacobian Scopes，我们开源了实现，并在以下网址提供了云托管交互式演示：this https URL。

英文摘要

Large language models (LLMs) make next-token predictions based on clues present in their context, such as semantic descriptions and in-context examples. Yet, elucidating which prior tokens most strongly influence a given prediction remains challenging due to the proliferation of layers and attention heads in modern architectures. We propose Jacobian Scopes, a suite of gradient-based, token-level causal attribution methods for interpreting LLM predictions. Grounded in perturbation theory and information geometry, Jacobian Scopes quantify how input tokens influence various aspects of a model's prediction, such as specific logits, the full predictive distribution, and model uncertainty (effective temperature). Through case studies spanning instruction understanding, translation, and in-context learning (ICL), we demonstrate how Jacobian Scopes reveal implicit political biases, uncover word- and phrase-level translation strategies, and shed light on recently debated mechanisms underlying in-context time-series forecasting. To facilitate exploration of Jacobian Scopes on custom text, we open-source our implementations and provide a cloud-hosted interactive demo at https://huggingface.co/spaces/Typony/JacobianScopes.

URL PDF HTML ☆

赞 0 踩 0

2512.15792 2026-06-17 cs.CY cs.AI cs.CL 版本更新

A Multifaceted Analysis of Social Biases in Large Language Models

大型语言模型中偏见的系统分析

Xulang Zhang, Rui Mao, Erik Cambria

发表机构 * Nanyang Technological University（南洋理工大学）

AI总结本文系统分析了四种广泛使用的大型语言模型在政治、意识形态、联盟、语言和性别等维度上的偏见，通过多项实验揭示了模型在中立性、意识形态倾向、地缘政治倾向、多语言故事完成中的偏见以及性别倾向。

详情

AI中文摘要

大型语言模型（LLMs）已迅速成为获取信息和支持人类决策不可或缺的工具。然而，确保这些模型在各种情境下保持公平性对于其安全和负责任的部署至关重要。在本研究中，我们对四种广泛采用的LLMs进行了全面分析，探讨了它们在政治、意识形态、联盟、语言和性别等维度上的潜在偏见和倾向。通过一系列精心设计的实验，我们利用新闻摘要来检验其政治中立性，通过新闻立场分类来研究意识形态偏见，通过联合国投票模式来探讨对特定地缘政治联盟的倾向，通过多语言故事完成来检验语言偏见，并通过世界价值观调查中的响应来揭示性别相关倾向。结果表明，尽管这些模型被设计为中立和公正，但它们仍然表现出不同类型的偏见和倾向。

英文摘要

Large language models (LLMs) have rapidly become indispensable tools for acquiring information and supporting human decision-making. However, ensuring that these models uphold fairness across varied contexts is critical to their safe and responsible deployment. In this study, we undertake a comprehensive examination of four widely adopted LLMs, probing their underlying biases and inclinations across the dimensions of politics, ideology, alliance, language, and gender. Through a series of carefully designed experiments, we investigate their political neutrality using news summarization, ideological biases through news stance classification, tendencies toward specific geopolitical alliances via United Nations voting patterns, language bias in the context of multilingual story completion, and gender-related affinities as revealed by responses to the World Values Survey. Results indicate that while the LLMs are aligned to be neutral and impartial, they still show biases and affinities of different types.

URL PDF HTML ☆

赞 0 踩 0

伏尼契手稿中分层位置和方向约束的证据：对类密码结构的影响

Christophe Parisel

AI总结通过分析伏尼契手稿的字素序列，发现词内从右到左优化和词边界从左到右依赖的双层结构，这种方向分离在四种对比语言中未出现；测试两类生成器均无法同时满足四个签名标准，表明手稿存在难以用简单位置或频率机制复现的类密码结构约束。

详情

AI中文摘要

伏尼契手稿（VMS）展示了一种起源不明的文字，其字素序列一直抗拒语言学分析。我们对其字素序列进行了系统分析，揭示了两个互补的结构层：词内序列中字符级的从右到左优化，以及词边界处的从左到右依赖，这种方向分离在我们四种对比语言（英语、法语、希伯来语、阿拉伯语）中均未观察到。我们进一步根据一个四签名联合标准评估了两类结构化生成器：一个参数化的槽位生成器和一个实现Rugg（2004）胡言乱语假设的卡尔达诺格栅。在其全部测试参数空间中，两类生成器均无法同时再现所有四个签名。虽然这些结果并未排除我们未测试的生成器类别，但它们提供了第一个定量基准，未来任何关于VMS的生成或密码分析模型均可据此评估，并且表明VMS表现出类似密码的结构约束，这些约束难以仅通过简单的位置或频率机制复现。

英文摘要

The Voynich Manuscript (VMS) exhibits a script of uncertain origin whose grapheme sequences have resisted linguistic analysis. We present a systematic analysis of its grapheme sequences, revealing two complementary structural layers: a character-level right-to-left optimization in word-internal sequences and a left-to-right dependency at word boundaries, a directional dissociation not observed in any of our four comparison languages (English, French, Hebrew, Arabic). We further evaluate two classes of structured generator against a four-signature joint criterion: a parametric slot-based generator and a Cardan grille implementing Rugg's (2004) gibberish hypothesis. Across their full tested parameter spaces, neither class reproduces all four signatures simultaneously. While these results do not rule out generator classes we have not tested, they provide the first quantitative benchmarks against which any future generative or cryptanalytic model of the VMS can be evaluated, and they suggest that the VMS exhibits cipher-like structural constraints that are difficult to reproduce from simple positional or frequency-based mechanisms alone.

URL PDF HTML ☆

赞 0 踩 0

2606.15121 2026-06-17 cs.CL 版本更新

When Cognitive Graphs Meet LLMs: BDEI Cognitive Pathways for Panic Emotional Arousal Prediction

当认知图遇见大语言模型：恐慌情绪唤醒预测的BDEI认知路径

Mengzhu Liu, Long Qin, Chuan Ai, Zhengqiu Zhu, Hongru Liang, Chen Gao, Yong Li, Xin Lu, Quanjun Yin

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结提出PanicCognitivePath框架，通过心理安全距离模型融合多域信号，引入显式情绪节点构建BDEI认知路径，将LLM限制于单步参数估计，实现恐慌情绪唤醒时间预测，准确率提升10.68%。

详情

AI中文摘要

在情绪显现前预测个体恐慌情绪唤醒时间对于主动应急干预至关重要。现有方法融合了认知元素，但均未显式建模情绪唤醒过程，因此不适用于情绪唤醒时间预测。我们认为，基于评价情绪理论进行预测是必要的，因为该理论显式建模了这一过程，但必须解决三个问题：(1) 评价理论认为情绪源于对多个威胁维度的同时评估，但尚无工作将这些输入融合为风险感知；(2) 现有认知模型缺乏情绪节点，将威胁评价与情绪唤醒解耦，迫使情绪从行为中间接推断；(3) 鉴于其可泛化的认知推理能力，当前方法采用LLM作为主要决策者，却忽视了其输出的脆弱性和易幻觉性。为解决这些问题，我们提出了PanicCognitivePath (PCP)框架，该框架同时解决了上述三个问题。基于心理距离理论的心理安全距离(PSD)模型将四域信号映射为统一的风险度量，作为后续认知推理的入口条件。在BDI中引入基于评价情绪理论的显式情绪节点，形成信念-欲望-情绪-意图(BDEI)路径。风险度量超过PSD阈值的智能体进入该路径，将威胁评价直接与情绪唤醒耦合。BDEI路径控制所有状态转换，而LLM被限制于信念到欲望转换的参数估计，将幻觉限制在单一步骤内并防止错误传播。在飓风桑迪上的实验表明，PCP将唤醒时间准确率较基线提升10.68%，峰值计数误差降至7.07%。

英文摘要

Predicting individual panic emotional arousal timing before manifestation is essential for proactive emergency intervention. Existing methods incorporate cognitive elements but none explicitly model the emotional arousal process, making them ill-suited for emotional arousal timing prediction. We argue that grounding prediction in appraisal emotion theory is necessary because it explicitly models this process, but three problems must be solved. (1) Appraisal theory posits that emotion arises from simultaneous evaluation across multiple threat dimensions, yet no prior work fuses these inputs into risk perception. (2) Existing cognitive models lack an Emotion node, decoupling threat appraisal from emotional arousal and forcing emotions to be inferred indirectly from behaviors. (3) Given their generalizable cognitive reasoning, current approaches adopt LLMs as the primary decision-maker, yet overlook the fragility and hallucination-proneness of their outputs. To address these issues, we introduce PanicCognitivePath (PCP), a framework that addresses all three. A Psychological Safety Distance (PSD) model, grounded in psychological distance theory, maps four-domain signals into a unified risk metric as the entry condition for subsequent cognitive reasoning. An explicit Emotion node grounded in appraisal emotion theory is introduced into BDI, forming a Belief-Desire-Emotion-Intention (BDEI) pathway. Agents whose risk metric exceeds the PSD threshold enter this pathway, coupling threat appraisal directly to emotional arousal. The BDEI pathway governs all state transitions while the LLM is confined to parameter estimation for the Belief-to-Desire transition, confining hallucinations to a single step and preventing error propagation. Experiments on Hurricane Sandy show PCP improves arousal timing accuracy by 10.68% over baselines, reduces peak count error to 7.07%.

URL PDF HTML ☆

赞 0 踩 0

2407.13053 2026-06-17 cs.CY cs.AI cs.CL cs.LG 版本更新

E2Vec: Feature Embedding with Temporal Information for Analyzing Student Actions in E-Book Systems

E2Vec：基于时间信息的特征嵌入用于分析电子书系统中的学生行为

Yuma Miyazaki, Valdemar Švábenský, Yuta Taniguchi, Fumiya Okubo, Tsubasa Minematsu, Atsushi Shimada

发表机构 * Kyushu University（九州大学）

AI总结提出E2Vec方法，利用词嵌入将操作日志和时间间隔转化为学生向量，用于风险检测任务，提升泛化性和性能。

Comments Research paper published in the Proceedings of the 17th Educational Data Mining Conference (EDM 2024), see https://doi.org/10.5281/zenodo.12729853

详情

DOI: 10.5281/zenodo.12729853

AI中文摘要

数字教科书（电子书）系统将学生与教科书的交互记录为一系列事件，称为事件流数据。过去，研究人员从事件流中提取有意义的特征，并将其用作下游任务（如成绩预测和学生行为建模）的输入。先前的研究评估了主要使用基于统计的特征（如操作类型数量或访问频率）的模型。虽然这些特征有助于提供某些见解，但它们缺乏捕捉不同学生学习行为中细粒度差异的时间信息。本研究提出E2Vec，一种基于词嵌入的新型特征表示方法。该方法将每个学生的操作日志及其时间间隔视为字符字符串序列，并生成包含时间信息的学习活动特征的学生向量。我们应用fastText为来自两年计算机科学课程数据集的305名学生生成嵌入向量。然后，我们研究了E2Vec在风险检测任务中的有效性，展示了其泛化性和性能潜力。

英文摘要

Digital textbook (e-book) systems record student interactions with textbooks as a sequence of events called EventStream data. In the past, researchers extracted meaningful features from EventStream, and utilized them as inputs for downstream tasks such as grade prediction and modeling of student behavior. Previous research evaluated models that mainly used statistical-based features derived from EventStream logs, such as the number of operation types or access frequencies. While these features are useful for providing certain insights, they lack temporal information that captures fine-grained differences in learning behaviors among different students. This study proposes E2Vec, a novel feature representation method based on word embeddings. The proposed method regards operation logs and their time intervals for each student as a string sequence of characters and generates a student vector of learning activity features that incorporates time information. We applied fastText to generate an embedding vector for each of 305 students in a dataset from two years of computer science courses. Then, we investigated the effectiveness of E2Vec in an at-risk detection task, demonstrating potential for generalizability and performance.

URL PDF HTML ☆

赞 0 踩 0

2507.05169 2026-06-17 cs.LG cs.AI cs.CL cs.CV cs.RO 版本更新

Critique of World Model: A Generative Latent Prediction Architecture for World Modeling

世界模型批判：一种用于世界建模的生成式潜在预测架构

Eric Xing, Mingkai Deng, Jinyu Hou

AI总结本文从心理学“假设性思维”出发，提出世界模型的核心目标是模拟真实世界的所有可行动可能性，并设计了一种基于状态化、分层、多级、混合连续/离散表示的生成式潜在预测（GLP）架构。

详情

AI中文摘要

世界模型，即生物智能体所经历并对其采取行动的真实世界环境的算法模拟器，近年来因开发具有人工（通用）智能的虚拟智能体的需求日益增长而成为一个新兴课题。关于世界模型究竟是什么、如何构建、如何使用以及如何评估，已有许多讨论。本文从著名科幻经典《沙丘》中的想象出发，并借鉴心理学文献中“假设性思维”的概念，论证世界模型的主要目标是模拟真实世界中所有可行动的可能性，以进行有目的的推理和行动。我们审视了世界建模的关键设计维度：数据、表示、架构、学习目标和使用，调查了现有方法并分析了它们的权衡。在此基础上，我们提出了一种新的通用世界模型生成式潜在预测（GLP）架构，基于有状态的、分层的、多层次的、混合连续/离散表示，以及生成式和自监督学习框架，并展望了由这种模型支持的物理、智能体和嵌套（PAN）AGI系统。

英文摘要

World Model, the algorithmic simulator of the real-world environment which biological agents experience and act upon, has been an emerging topic in recent years due to the rising need to develop virtual agents with artificial (general) intelligence. There has been much discussion on what a world model really is, how to build it, how to use it, and how to evaluate it. In this essay, starting from the imagination in the famed Sci-Fi classic Dune, and drawing inspiration from the concept of ``hypothetical thinking'' in psychology literature, we argue the primary goal of a world model to be {\it simulating all actionable possibilities of the real world for purposeful reasoning and acting}. We examine the key design dimensions of world modeling: data, representation, architecture, learning objective, and usage, surveying existing approaches and analyzing their tradeoffs. Building on this examination, we propose a new Generative Latent Prediction (GLP) architecture for a general-purpose world model, based on stateful, hierarchical, multi-level, and mixed continuous/discrete representations, and a generative and self-supervised learning framework, with an outlook of a Physical, Agentic, and Nested (PAN) AGI system enabled by such a model.

URL PDF HTML ☆

赞 0 踩 0

2601.06116 2026-06-17 cs.AI cs.CL cs.CY 版本更新

The Homogenization Problem in LLMs: Towards Meaningful Diversity in AI Safety

在大语言模型中的同质化问题：迈向人工智能安全中的有意义多样性

Ian Rios-Sialer

发表机构 * Independent Researcher（独立研究者）

AI总结本文探讨了大语言模型中同质化问题，提出通过编码价值观系统来促进多样性，通过实验揭示性别偏见并引入xeno-reproduction概念以缓解同质化。

详情

AI中文摘要

生成式AI模型在训练数据中复制人类偏见，并通过如模式崩溃等机制放大这些偏见。多样性丧失导致同质化，不仅损害少数群体，也使所有人受益。我们主张同质化应成为人工智能安全的核心关注点。为有意义地表征大语言模型中的同质化，我们引入一个框架，允许利益相关者编码其上下文和价值体系。我们通过实验揭示了一个大语言模型（Claude 3.5 Haiku）在开放性故事提示中的性别偏见。基于酷儿理论，我们将同质化定义为规范性。借用女性主义理论的语言，我们引入xeno-reproduction作为一类任务，以通过促进多样性来缓解同质化。我们的工作开启了一条协作研究路线，旨在理解和推进AI中的多样性。

英文摘要

Generative AI models reproduce the human biases in their training data and further amplify them through mechanisms such as mode collapse. The loss of diversity produces homogenization, which not only harms the minoritized but impoverishes everyone. We argue homogenization should be a central concern in AI safety. To meaningfully characterize homogenization in Large Language Models (LLMs), we introduce a framework that allows stakeholders to encode their context and value system. We illustrate our approach with an experiment that surfaces gender bias in an LLM (Claude 3.5 Haiku) on an open-ended story prompt. Building from queer theory, we formalize homogenization in terms of normativity. Borrowing language from feminist theory, we introduce the concept of xeno-reproduction as a class of tasks for mitigating homogenization by promoting diversity. Our work opens a collaborative line of research that seeks to understand and advance diversity in AI.

URL PDF HTML ☆

赞 0 踩 0

2509.03932 2026-06-17 cs.CL cs.CY cs.LG 版本更新

KPoEM: A Human-Annotated Dataset for Emotion Classification and RAG-Based Poetry Generation in Korean Modern Poetry

KPoEM：用于韩国现代诗歌情感分类与基于RAG的诗歌生成的人工标注数据集

Iro Lim, Haein Ji, Byungjun Kim

发表机构 * The Academy of Korean Studies（韩国学术院）； Graduate School of Korean Studies（韩国研究研究生院）； Cultural Informatics（文化信息学）

AI总结本研究构建了KPoEM多标签情感数据集，通过序列微调策略实现F1-micro 0.60的情感分类，并验证了基于RAG的诗歌生成在韩国文学情感与文化表达上的可行性。

Comments 43 pages, 22 tables, 3 figures, Digital Humanities and Social Sciences Korea Conference, James Joo-Jin Kim Center for Korean Studies, University of Pennsylvania, Philadelphia, USA

详情

DOI: 10.25024/review.2026.29.1.006
Journal ref: The Review of Korean Studies 29(1) (2026) 161-206

AI中文摘要

本研究介绍了KPoEM（韩国诗歌情感映射），这是一个新颖的数据集，为现代韩国诗歌中情感中心分析和生成应用奠定了基础。尽管自然语言处理取得了进展，但由于诗歌复杂的比喻语言和文化特异性，其研究仍不充分。我们构建了一个包含7,662条条目（7,007条行级和615条作品级）的多标签数据集，由五位有影响力的韩国诗人的44个细粒度情感类别进行标注。通过序列策略（从通用语料库到专门的KPoEM数据集）微调的KPoEM情感分类模型，实现了0.60的F1-micro分数，显著优于之前的模型（0.43）。该模型在保留核心诗歌情感的同时，展示了识别时间和文化特定情感表达的能力增强。此外，将结构化情感数据集应用于基于RAG的诗歌生成模型，证明了生成反映韩国文学情感和文化敏感性文本的实证可行性。这种综合方法加强了计算技术与文学分析之间的联系，为定量情感研究和生成诗学开辟了新途径。总体而言，本研究为推进现代韩国诗歌中情感中心分析和创作提供了基础。

英文摘要

This study introduces KPoEM (Korean Poetry Emotion Mapping), a novel dataset that serves as a foundation for both emotion-centered analysis and generative applications in modern Korean poetry. Despite advancements in NLP, poetry remains underexplored due to its complex figurative language and cultural specificity. We constructed a multi-label dataset of 7,662 entries (7,007 line-level and 615 work-level), annotated with 44 fine-grained emotion categories from five influential Korean poets. The KPoEM emotion classification model, fine-tuned through a sequential strategy -- moving from general-purpose corpora to the specialized KPoEM dataset -- achieved an F1-micro score of 0.60, significantly outperforming previous models (0.43). The model demonstrates an enhanced ability to identify temporally and culturally specific emotional expressions while preserving core poetic sentiments. Furthermore, applying the structured emotion dataset to a RAG-based poetry generation model demonstrates the empirical feasibility of generating texts that reflect the emotional and cultural sensibilities of Korean literature. This integrated approach strengthens the connection between computational techniques and literary analysis, opening new pathways for quantitative emotion research and generative poetics. Overall, this study provides a foundation for advancing emotion-centered analysis and creation in modern Korean poetry.

URL PDF HTML ☆

赞 0 踩 0

2509.13196 2026-06-17 cs.CL 版本更新

The Few-shot Dilemma: Over-prompting Large Language Models

少样本困境：过度提示大型语言模型

Yongjian Tang, Doruk Tuncel, Christian Koerner, Thomas Runkler

发表机构 * Siemens AG（西门子股份公司）； Technical University of Munich（慕尼黑技术大学）

AI总结本文提出一个提示框架，使用随机采样、语义嵌入和TF-IDF三种少样本选择方法，在多个LLM上实验发现过多领域特定示例会降低性能，并通过TF-IDF与分层采样结合找到最优示例数量，在软件需求分类上超越现有方法1%。

Comments accepted for the main track of FLLM

详情

DOI: 10.1109/FLLM67465.2025.11391015

AI中文摘要

过度提示是一种现象，即提示中过多的示例导致大型语言模型（LLMs）性能下降，挑战了关于上下文少样本学习的传统观点。为了研究这种少样本困境，我们概述了一个提示框架，该框架利用三种标准的少样本选择方法——随机采样、语义嵌入和TF-IDF向量——并在多个LLM上评估这些方法，包括GPT-4o、GPT-3.5-turbo、DeepSeek-V3、Gemma-3、LLaMA-3.1、LLaMA-3.2和Mistral。我们的实验结果表明，在提示中加入过多的领域特定示例可能会在某些LLM中反常地降低性能，这与先前认为更多相关少样本示例普遍有利于LLM的实证结论相矛盾。鉴于LLM辅助软件工程和需求分析的趋势，我们在两个真实世界的软件需求分类数据集上进行了实验。通过逐步增加TF-IDF选择和分层的少样本示例数量，我们为每个LLM确定了其最优数量。这种组合方法以更少的示例实现了更优的性能，避免了过度提示问题，从而在功能性和非功能性需求分类上超越了现有技术1%。

英文摘要

Over-prompting, a phenomenon where excessive examples in prompts lead to diminished performance in Large Language Models (LLMs), challenges the conventional wisdom about in-context few-shot learning. To investigate this few-shot dilemma, we outline a prompting framework that leverages three standard few-shot selection methods - random sampling, semantic embedding, and TF-IDF vectors - and evaluate these methods across multiple LLMs, including GPT-4o, GPT-3.5-turbo, DeepSeek-V3, Gemma-3, LLaMA-3.1, LLaMA-3.2, and Mistral. Our experimental results reveal that incorporating excessive domain-specific examples into prompts can paradoxically degrade performance in certain LLMs, which contradicts the prior empirical conclusion that more relevant few-shot examples universally benefit LLMs. Given the trend of LLM-assisted software engineering and requirement analysis, we experiment with two real-world software requirement classification datasets. By gradually increasing the number of TF-IDF-selected and stratified few-shot examples, we identify their optimal quantity for each LLM. This combined approach achieves superior performance with fewer examples, avoiding the over-prompting problem, thus surpassing the state-of-the-art by 1% in classifying functional and non-functional requirements.

URL PDF HTML ☆

赞 0 踩 0

2503.08679 2026-06-17 cs.AI cs.CL cs.LG 版本更新

Chain-of-Thought Reasoning In The Wild Is Not Always Faithful

现实中的思维链推理并不总是忠实的

Iván Arcuschin, Jett Janiak, Robert Krzyzanowski, Senthooran Rajamanoharan, Neel Nanda, Arthur Conmy

发表机构 * Poseidon Research（Poseidon研究）

AI总结研究发现，在自然语言提示下，模型有时会生成表面连贯但自相矛盾的思维链，揭示出隐含的事后合理化现象，且前沿模型也未能完全避免。

Comments Published at the 43rd International Conference on Machine Learning (ICML 2026)

详情

AI中文摘要

最近的研究表明，当面对提示中的显式偏见时，模型通常会在其思维链（CoT）输出中省略提及这些偏见，揭示出口头推理可能给出模型如何得出错误结论的不正确图景（不忠实）。在这项工作中，我们展示了不忠实的CoT也发生在自然措辞、非对抗性的提示上，而无需添加人为偏见或编辑模型输出。我们发现，当分别呈现问题“X比Y大吗？”和“Y比X大吗？”时，模型有时会生成表面连贯的论证来证明系统性地对两者都回答“是”或都回答“否”是合理的，尽管存在矛盾。我们提供了初步证据表明这是由于模型对“是”或“否”的隐含偏见，并将其标记为隐含的事后合理化。我们的结果显示，生产模型的不忠实率高达13%，而前沿模型虽然更忠实，但没有一个完全忠实，包括像DeepSeek R1（0.37%）和Sonnet 3.7 with thinking（0.04%）这样的思考模型。我们还研究了不忠实的非逻辑捷径，即模型使用微妙的非逻辑推理来使对困难数学问题的推测性答案看起来经过严格证明。我们的发现表明，虽然CoT可用于评估输出，但它并不是产生模型答案的内部过程的完整描述，应在代理或安全关键环境中谨慎使用。

英文摘要

Recent studies indicate that when faced with explicit biases in prompts, models often omit mentioning these biases in their Chain-of-Thought (CoT) output, revealing that verbalized reasoning can give an incorrect picture of how models arrive at conclusions (unfaithfulness). In this work, we show that unfaithful CoT also occurs on naturally worded, non-adversarial prompts without adding artificial biases or editing model outputs. We find that when separately presented with the questions "Is X bigger than Y?" and "Is Y bigger than X?", models sometimes produce superficially coherent arguments to justify systematically answering Yes to both or No to both, despite the contradiction. We present preliminary evidence that this is due to models' implicit biases towards Yes or No, labeling this Implicit Post-Hoc Rationalization. Our results reveal rates up to 13% for production models, and while frontier models are more faithful, none are entirely so, including thinking models like DeepSeek R1 (0.37%) and Sonnet 3.7 with thinking (0.04%). We also investigate Unfaithful Illogical Shortcuts, where models use subtly illogical reasoning to make speculative answers to hard math problems seem rigorously proven. Our findings indicate that while CoT can be useful for assessing outputs, it is not a complete account of the internal process that produced the model's answer and should be used with caution in agentic or safety-critical settings.

URL PDF HTML ☆

赞 0 踩 0

2408.15188 2026-06-17 eess.AS cs.CL cs.SD 版本更新

Infusing Acoustic Pause Context into Text-Based Dementia Assessment

将语音停顿上下文注入基于文本的痴呆症评估

Franziska Braun, Sebastian P. Bayerl, Florian Hönig, Hartmut Lehfeld, Thomas Hillemacher, Tobias Bocklet, Korbinian Riedhammer

发表机构 * Technische Hochschule Nürnberg（图林根应用技术大学纽伦堡分校）； Technische Hochschule Rosenheim（图林根应用技术大学罗森海姆分校）； Klinik für Psychiatrie und Psychotherapie, Universitätsklinik der Paracelsus Medizinischen Privatuniversität, Klinikum Nürnberg, Germany（帕拉塞尔斯医学私人大学纽伦堡大学心理治疗与精神病科诊所）； KST Institut GmbH, Bad Emstal, Germany（KST研究所，巴德埃姆斯塔尔，德国）

AI总结本文研究利用停顿增强的转录文本，通过Transformer语言模型区分无认知障碍、轻度认知障碍和阿尔茨海默病患者，探讨停顿信息和声学上下文对不同任务的影响。

Comments Accepted at INTERSPEECH 2024

详情

DOI: 10.21437/Interspeech.2024-2496
Journal ref: Proceedings of Interspeech 2024

AI中文摘要

语音停顿，与内容和结构相结合，提供了一种有价值的、非侵入性的生物标志物，用于检测痴呆症。本工作探讨了在基于Transformer的语言模型中使用包含停顿的转录文本，以区分无认知障碍、轻度认知障碍和阿尔茨海默病患者在临床评估中的语音特征。我们处理了三个二元分类任务：起始、监测和痴呆排除。通过在德语口头流畅性测试和图片描述测试上的实验，比较模型在不同语音生成上下文中的有效性。从文本基线开始，我们探讨了停顿信息和声学上下文的整合效果。我们展示了测试应根据任务选择，并且词汇停顿信息和声学交叉注意力对不同任务贡献不同。

英文摘要

Speech pauses, alongside content and structure, offer a valuable and non-invasive biomarker for detecting dementia. This work investigates the use of pause-enriched transcripts in transformer-based language models to differentiate the cognitive states of subjects with no cognitive impairment, mild cognitive impairment, and Alzheimer's dementia based on their speech from a clinical assessment. We address three binary classification tasks: Onset, monitoring, and dementia exclusion. The performance is evaluated through experiments on a German Verbal Fluency Test and a Picture Description Test, comparing the model's effectiveness across different speech production contexts. Starting from a textual baseline, we investigate the effect of incorporation of pause information and acoustic context. We show the test should be chosen depending on the task, and similarly, lexical pause information and acoustic cross-attention contribute differently.

URL PDF HTML ☆

赞 0 踩 0

2206.06208 2026-06-17 eess.AS cs.CL cs.SD 版本更新

Automated Evaluation of Standardized Dementia Screening Tests

标准化痴呆筛查测试的自动化评估

Franziska Braun, Markus Förstel, Bastian Oppermann, Andreas Erzigkeit, Thomas Hillemacher, Hartmut Lehfeld, Korbinian Riedhammer

AI总结本文研究了标准化痴呆筛查测试的自动化评分方法，通过分析手动和自动转录本的评分相关性，发现自动评分在某些任务上比人工评分更严格，但整体仍保持高相关性。

Comments Submitted to Interspeech 2022. arXiv admin note: text overlap with arXiv:2206.05018

详情

DOI: 10.21437/Interspeech.2022-10436
Journal ref: Proceedings of Interspeech 2022

AI中文摘要

在痴呆筛查和监测中，标准化测试在临床实践中起关键作用，因为它们旨在通过测量多种认知任务的表现来最小化主观性。本文报告了一项研究，该研究包括一个半标准化的病史采集，随后是两种标准化的神经心理学测试，即SKT和CERAD-NB。这些测试包括命名物体、学习词列表等基本任务，以及广泛使用的工具如MMSE。大多数任务是口头进行的，因此应适合基于转录文本的自动化评分。对于前30名患者的第一批，我们分析了专家手动评分与基于手动和自动转录的自动评分之间的相关性。对于SKT和CERAD-NB，我们观察到使用手动转录本时的高到完美相关性；对于某些相关性较低的任务，自动评分比人类参考更严格，因为其仅限于音频。使用自动转录本时，相关性下降如预期，与识别准确性相关；然而，我们仍观察到高达0.98（SKT）和0.85（CERAD-NB）的高相关性。我们证明使用词替代可以缓解识别错误，从而提高与专家评分的相关性。

英文摘要

For dementia screening and monitoring, standardized tests play a key role in clinical routine since they aim at minimizing subjectivity by measuring performance on a variety of cognitive tasks. In this paper, we report on a study that consists of a semi-standardized history taking followed by two standardized neuropsychological tests, namely the SKT and the CERAD-NB. The tests include basic tasks such as naming objects, learning word lists, but also widely used tools such as the MMSE. Most of the tasks are performed verbally and should thus be suitable for automated scoring based on transcripts. For the first batch of 30 patients, we analyze the correlation between expert manual evaluations and automatic evaluations based on manual and automatic transcriptions. For both SKT and CERAD-NB, we observe high to perfect correlations using manual transcripts; for certain tasks with lower correlation, the automatic scoring is stricter than the human reference since it is limited to the audio. Using automatic transcriptions, correlations drop as expected and are related to recognition accuracy; however, we still observe high correlations of up to 0.98 (SKT) and 0.85 (CERAD-NB). We show that using word alternatives helps to mitigate recognition errors and subsequently improves correlation with expert scores.

URL PDF HTML ☆

赞 0 踩 0

1. 大语言模型与基础模型 28 篇

From Trainee to Trainer: LLM-Designed Training Environment for RL with Multi-Agent Reasoning

SuCo: Sufficiency-guided Continuous Adaptive Reasoning

LLMs Infer Cultural Context but Fail to Apply It When Responding

Dynamic Rollout Editing for Reducing Overthinking in RL-Trained Reasoning Models

Fine-tuning LLMs for Passive Depression Severity Estimation from AI Mental Health Dialogue

ConSA: Controllable Sparsity in Hybrid Attention via Learnable Allocation

Learning from the Self-future: On-policy Self-distillation for dLLMs

Zone of Proximal Policy Optimization: Teacher in Prompts, Not Gradients

Variable-Width Transformers

Rethinking Groups in Critic-Free RLVR

Nothing from Something: Can a Language Model Discover 0?

Incumbent Advantage: Brand Bias and Cognitive Manipulation Dynamics in LLM Recommendation Systems

LLM Features Can Hurt GNNs: Concatenation Interference on Homophilous Graph Benchmarks

When AI Says "I have been in similar situations": Synthetic Lived Experience in Peer-Like Caregiver Support

Looped World Models

Top-Theta Attention: Sparsifying Transformers by Compensated Thresholding

Adaptive Activation Steering for Efficient LLM Reasoning via Closed-Loop PID Control

RooseBERT: A New Deal For Political Language Modelling

Regression Language Models for Code

Atlas: Orchestrating Heterogeneous Models and Tools for Multi-Domain Complex Reasoning

Learn-To-Learn on Arbitrary Textual Conditioning: A Hypernetwork-Driven Meta-Gated LLM

A Recipe for Long-Context Reasoning in Large Language Models via On-Policy Optimization and Distillation

ART: Attention Run-time Termination for Efficient Large Language Model Decoding

MoSE: Mixture of Slimmable Experts for Efficient and Adaptive Language Models

Would a Large Language Model Pay Extra for a View? Inferring Willingness to Pay from Subjective Choices

DICE: Diffusion Large Language Models Excel at Generating CUDA Kernels

Olmo Hybrid: From Theory to Practice and Back

Teaching Values to Machines: Simulating Human-Like Behavior in LLMs

2. 机器翻译与跨语言处理 6 篇

Speaking in Self-Assessing Tongues: On the Verbalized Confidence of LLMs in Machine Translation

Translating the Untranslatable: An Operationalizable Ontology for Untranslatability

When English Isn't the Best Teacher: Source Language Effects in Cross-Lingual In-Context Learning

Smarter edits? Post-editing with error highlights and translation suggestions

Koshur Diacritizer: A Byte-Level Sequence-to-Sequence Model for Kashmiri Diacritic Restoration

Bridging the Usability Gap: Lessons from Interpreting Studies for Machine Interpreting Design

3. 信息抽取、检索与问答 6 篇

HistoRAG: Embedding Historical Methodology in Retrieval-Augmented Generation Through Critical Technical Practice

Non-negative Elastic Net Decoding for Information Retrieval

ProvenanceGuard: Source-Aware Factuality Verification for MCP-Based LLM Agents

PACE-RAG: Patient-Aware Contextual and Evidence-Constrained RAG for Clinical Drug Recommendation

Conformal Path Reasoning: Trustworthy Knowledge Graph Question Answering via Path-Level Calibration

EHRNote-ChatQA: A Benchmark for Evidence-Grounded Multi-Turn Clinical Question Answering over Longitudinal Discharge Summaries

4. 对话系统与智能体 15 篇

MemSlides: A Hierarchical Memory Driven Agent Framework for Personalized Slide Generation with Multi-turn Local Revision

From Parasocial Scripts to Dyadic Persistence in Autonomous AI-Agent Communities

Scaling Enterprise Agent Routing: Degradation, Diagnosis, and Recovery

Evaluating Large Language Models Abilities for Addressee, Turn-change, and Next Speaker Prediction in Meetings

OPD-Evolver: Cultivating Holistic Agent Evolver via On-Policy Distillation

Environment-Grounded Automated Prompt Optimization for LLM Game Agents

Compositional Skill Routing for LLM Agents: Decompose, Retrieve, and Compose

Beyond Domains: Reusing Web Skills via Transferable Interaction Patterns

EnvRL: Learn from Environment Dynamics in Agentic Reinforcement Learning

Toward Accessible Psychotherapy Training Using AI-Driven Interactive Patient Avatars

Algorithmic Prompt Generation for Diverse Human-like Teaming and Communication with Large Language Models

EmoFSM: A Finite State Machine for Emotional Support Conversation

Control-Plane Placement Shapes Forgetting: An Architectural Study of Agent Memory Across Thirteen System Configurations

SING: Synthetic Intention Graph for Scalable Active Tool Discovery in LLM Agents

Branch-and-Browse: Efficient and Controllable Web Exploration with Tree-Structured Reasoning and Action Memory

5. 文本生成、摘要与编辑 5 篇

Self-Generated Error Training for Token Editing in Diffusion Language Models

Do Large Language Models Always Tell The Same Stories?

Bridging Functional Correctness and Runtime Efficiency Gaps in LLM-Based Code Translation

VoidPadding: Let [VOID] Handle Padding in Masked Diffusion Language Models so that [EOS] Can Focus on Semantic Termination

Continuous Language Diffusion as a Decoder-Interface Problem

6. 语义、语法与语言学分析 5 篇

Examining the Limits of Word2Vec with Toki Pona

An expressivity analysis of hierarchical modelling in deep transformers via bounded-depth grammars

Analyzing and Encoding the Al-Mawrid Arabic-English Dictionary with the ISO Language Markup Framework and TEI Lex-0

Dissociating Decodability and Causal Use in Bracket-Sequence Transformers

Priors Persist Through Suppression: A Stroop Paradigm for Lexical Override

7. 多模态语言处理 15 篇

Revisiting LLM Adaptation for 3D CT Report Generation: A Study of Scaling and Diagnostic Priors

MODE-RAG: Manifold Outlier Diagnosis and Energy-based Retrieval-Augmented Generation Evaluation

The Slop Paradox: How Synthetic Standardization Erodes Clinical Uncertainty and Cross-Modal Alignment in AI-Rewritten Radiology Reports

Correct When Paired, Wrong When Split: Decoupling and Editing Modality-Specific Neurons in MLLMs

Visuals Lie, Consistency Speaks: Disentangling Spatial Attention from Reliability in Vision-Language Models

MambaCount: Efficient Text-guided Open-vocabulary Object Counting with Spatial Sparse State Space Duality Block

Vision-language models for chest radiography do not always need the image

When Tables Go Crazy: Evaluating Multimodal Models on French Financial Documents