arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2606.19348 2026-06-19 cs.CL cs.AI 新提交

粒度调控的自适应计算效率：测试时扩展中的最优验证

Ardit Krasniqi, Luan Vejsiu, Elira Dervishi

发表机构 * European University of Tirana（欧洲地拉那大学）

AI总结提出GRACE理论框架，将验证粒度建模为问题难度、验证器准确率和计算预算的函数，证明存在相变：细粒度验证在计算预算大或问题难时占优，粗粒度验证在低预算简单问题时更优，自适应策略可达到计算-性能帕累托前沿。

详情

AI中文摘要

测试时扩展（TTS）已成为一种强大的范式，通过在推理时投入额外计算来提升大语言模型（LLMs）的推理性能。TTS的核心组件是验证器，它选择或评分候选解以引导搜索过程。虽然先前工作已探索验证的益处，但一个基本问题仍未充分探索：在给定计算预算下，最优验证粒度是什么？粗粒度的结果奖励模型（ORMs）和细粒度的过程奖励模型（PRMs）代表两个极端，但两者单独均无法在所有场景下实现计算最优性。本文建立了一个统一的理论框架，称为GRACE（粒度调控的自适应计算效率），该框架将最优验证粒度刻画为问题难度、验证器准确率和计算预算的显式函数。我们证明存在一个相变：当计算预算大或问题难时，细粒度验证占优；而在低预算、简单问题场景下，粗粒度验证更受青睐。我们的理论将Best-of-N、束搜索和步骤级MCTS统一在一个帕累托最优框架内，并激发了一种自适应粒度策略，该策略可证明达到计算-性能帕累托前沿。在MATH-500、GSM8K和AIME基准上的实验结果证实了所有四个理论主张，在匹配计算量下，我们的自适应策略相比固定粒度基线准确率提升高达3.1%。

英文摘要

Test-time scaling (TTS) has emerged as a powerful paradigm for improving the reasoning performance of large language models (LLMs) by investing additional compute at inference time. A central component of TTS is the \emph{verifier}, which selects or scores candidate solutions to guide the search process. While prior work has explored the benefit of verification, a fundamental question remains underexplored: \emph{what is the optimal granularity of verification under a given compute budget?} Coarse-grained outcome reward models (ORMs) and fine-grained process reward models (PRMs) represent two extremes, yet neither alone achieves compute-optimality across all regimes. In this paper, we establish a unified theoretical framework, called \textbf{GRACE} (\underline{G}ranularity-\underline{R}egulated \underline{A}daptive \underline{C}omputational \underline{E}fficiency), that characterizes the optimal verification granularity as an explicit function of problem difficulty, verifier accuracy, and compute budget. We prove that there exists a phase transition: fine-grained verification dominates when either the compute budget is large or the problem is hard, whereas coarse-grained verification is preferred in the low-budget, easy-problem regime. Our theory unifies Best-of-$N$, beam search, and step-level MCTS within a single Pareto-optimality framework, and motivates an adaptive granularity strategy that provably achieves the compute-performance Pareto frontier. Empirical results on MATH-500, GSM8K, and AIME benchmarks corroborate all four theoretical claims, with our adaptive strategy outperforming fixed-granularity baselines by up to 3.1\% accuracy at matched compute.

URL PDF HTML ☆

赞 0 踩 0

2606.19625 2026-06-19 cs.CL cs.LG 新提交

Where Does Social Reasoning Come From? Capability Provenance in Language Models

社会推理从何而来？语言模型中的能力来源

Glenn Matlin, Chandreyi Chakraborty, Saehee Eom, Mika Okamoto, Rayan Castilla, Louis Jaburi, Alvin Deng, Taywon Min, Lucia Quirke, Stella Biderman, Mark Riedl

发表机构 * Georgia Institute of Technology, College of Computing（佐治亚理工学院计算学院）； MATS Program（MATS项目）； EleutherAI ； KAIST AI（韩国科学技术院人工智能学院）； Georgia Tech AI Safety Initiative（佐治亚理工学院人工智能安全倡议）

AI总结通过训练数据归因方法，发现OLMo3-7B中社会推理和STEM推理依赖于不同的预训练语料区域，且推理层面的差异比知识层面更显著。

Comments Under review at COLM 2026 (Conference)

详情

AI中文摘要

我们使用训练数据归因作为可解释的工具进行能力发现，映射预训练语料库中哪些区域支持OLMo3-7B的社会推理与STEM推理。训练数据归因衡量每个训练文档对模型在基准测试上的预测的影响强度，但文档级别的分数过于嘈杂，无法识别哪些语料区域支持哪些能力，且先前的工作侧重于事实知识而非推理。我们在从去重后的Dolma3混合数据中抽取的工作集上计算基于梯度的归因（通过Bergmann的TrackStar），聚合跨WebOrganizer的24格式×24主题分类（576个箱子）的影响，并在2×2设计中对比基准对，该设计变化领域（社会 vs. STEM）和能力类型（推理 vs. 知识）：SocialIQA和MMLU社会科学对比ARC-Challenge和MMLU STEM。社会和STEM推理依赖于定性不同的语料区域，且推理层面的对比比知识层面更尖锐。有针对性的机器遗忘提供了部分因果验证：遗忘高归因主题箱（例如，SocialIQA的文学）比箱内随机基线更严重地降低对齐的基准，我们开源所有代码、采样清单、箱级影响矩阵和遗忘检查点。

英文摘要

We use training-data attribution as an interpretable tool for capability discovery, mapping which regions of the pretraining corpus support social-reasoning versus STEM-reasoning in OLMo3-7B. Training-data attribution measures how strongly each training document influences a model's predictions on a benchmark, but document-level scores are too noisy to identify which corpus regions support which capabilities, and prior work has emphasized factual knowledge rather than reasoning. We compute gradient-based attribution (TrackStar via Bergson) over a working set drawn from the de-duplicated Dolma3 mix, aggregate influence across WebOrganizer's 24-format x 24-topic taxonomy (576 bins), and contrast benchmark pairs in a 2x2 design that varies domain (social vs. STEM) and capability type (reasoning vs. knowledge): SocialIQA and MMLU Social Sciences against ARC-Challenge and MMLU STEM. Social and STEM reasoning draw on qualitatively distinct corpus regions, and the contrast is sharper at the reasoning level than at the knowledge level. Targeted machine unlearning provides partial causal validation: forgetting high-attribution topic bins (e.g., Literature for SocialIQA) degrades the aligned benchmark more than within-bin random baselines, and we open-source all code, sampling manifests, the bin-level influence matrix, and unlearning checkpoints.

URL PDF HTML ☆

赞 0 踩 0

2606.19668 2026-06-19 cs.CL 新提交

Code-Switching Reveals Language Anchoring in Multilingual LLMs

代码切换揭示多语言大模型中的语言锚定

Jeonghyun Park, Seunghyun Yoon, Yonghyun Jun, Hwanhee Lee

发表机构 * Chung-Ang University（中央大学）； Adobe Research（Adobe研究院）

AI总结通过语法强制代码切换诊断多语言大模型中的语言锚定现象，提出锚定偏差度量并设计CANVAS干预方法，有效缓解代码切换导致的问答性能下降。

Comments 36 pages, 13 figures, 27 tables

详情

AI中文摘要

多语言大模型（MLLMs）越来越需要处理代码切换（CS）输入，然而混合语言通常会导致性能相对于源语言或目标语言单语版本下降。为了理解这种退化，我们使用语法强制CS作为受控诊断设置，将CS表示相对于其源和目标对应物进行定位。我们引入锚定偏差（Anchor Bias），一种几何度量，用于量化语言锚定，即CS隐藏状态是否更接近其源语言或目标语言对应物。在不同的MLLMs中，锚定偏差揭示了一致的语法框架效应：源框架CS保持源锚定，而目标框架CS向目标方向移动，并显示出更大的问答（QA）退化。受这种表示模式的启发，我们提出了CANVAS（基于上下文锚定的神经向量对齐引导），一种推理时干预方法，从输入中提取源侧画布，并在预填充期间将目标语言隐藏状态软引导向源锚定。CANVAS在MLLMs和CS条件下一致地恢复了QA F1分数，表明内部锚定信号为缓解CS推理失败提供了可行的目标。

英文摘要

Multilingual Large Language Models (MLLMs) are increasingly expected to handle Code-Switched (CS) inputs, yet mixing languages frequently degrades performance relative to source- or target-language monolingual counterparts. To understand this degradation, we use grammar-forced CS as a controlled diagnostic setting for locating CS representations relative to their source and target counterparts. We introduce Anchor Bias, a geometric measure that quantifies language anchoring, whether a CS hidden state aligns closer to its source or target language counterpart. Across diverse MLLMs, Anchor Bias reveals a consistent grammar-frame effect: source-framed CS stays source-anchored, whereas target-framed CS shifts target-ward and shows larger Question Answering (QA) degradation. Motivated by this representational pattern, we propose CANVAS (Contextual Anchor-based Neural Vector Alignment Steering), an inference-time intervention that extracts a source-side canvas from the input and softly steers target-language hidden states toward the source anchor during prefill. CANVAS consistently recovers QA F1 across MLLMs and CS conditions, showing that internal anchoring signals provide an actionable target for mitigating CS inference failures.

URL PDF HTML ☆

赞 0 踩 0

2606.19744 2026-06-19 cs.CL cs.AI cs.HC 新提交

Beyond Uniform Forgetting: A Study of Sequential Direct Preference Optimization Across Preference Settings

超越统一遗忘：不同偏好设置下顺序直接偏好优化的研究

Pranav Bhandari, Nicolas Fay, Amitava Datta, Usman Naseem, Mehwish Nasim

发表机构 * Network Analysis and Social Influence Modelling (NASIM) Lab（网络分析与社会影响建模实验室）； School of Physics Maths and Computing, The University of Western Australia（西澳大学物理数学与计算学院）； School of Psychological Science, The University of Western Australia（西澳大学心理科学学院）； School of Computing, Macquarie University（麦考瑞大学计算机学院）

AI总结研究顺序DPO在不同偏好设置下的影响，发现遗忘模式并非统一，而是取决于目标关系、信号强度和训练顺序，并提出未来对齐流程应考虑目标兼容性。

Comments Submitted to EMNLP 2026

详情

AI中文摘要

将语言模型与人类偏好对齐通常需要优化多个行为目标。一种实用方法是使用直接偏好优化（DPO）等偏好优化方法顺序应用这些目标，但目前尚不清楚后续训练是否会统一降低先前学习的偏好，或者这种影响是否取决于目标之间的关系。我们研究了跨越四种偏好设置（包括分布冲突、多属性交互、强安全信号和兼容的响应质量目标）的顺序DPO。使用带有LoRA适配器的Llama-3.1-8B-Instruct，我们在每个阶段后使用固定的基础模型参考评估所有目标。我们发现顺序DPO不会产生单一的遗忘模式；偏好变化从部分退化到稳定、成对重新分配或正迁移，具体取决于目标关系、信号强度和训练顺序。使用长度归一化策略边界的成对分析表明，聚合指标可能掩盖偏好对之间的异质性变化，而四分位数分解显示，高置信度对可能根据设置而退化或改进。机制诊断表明，在所有设置中，阶段2的梯度和适配器更新与先前目标接近正交，几乎没有证据表明直接梯度对立是主要驱动因素。这些发现表明，未来的顺序对齐流程应考虑目标兼容性和信号强度，而不是假设后续目标会统一影响先前的偏好。

英文摘要

Aligning language models with human preferences often requires optimising multiple behavioural objectives. A practical approach is to apply these objectives sequentially using preference optimisation methods such as Direct Preference Optimisation (DPO), but it remains unclear whether later training uniformly degrades preferences learned earlier or whether the effect depends on the relationship between objectives. We study sequential DPO across four preference settings covering distributional conflict, multi-attribute interaction, strong safety signal, and compatible response-quality objectives. Using Llama-3.1-8B-Instruct with LoRA adapters, we evaluate all objectives after every stage with a fixed base-model reference. We find that sequential DPO does not produce a single forgetting pattern; preference change ranges from partial degradation to stability, pair-level redistribution, or positive transfer depending on objective relationship, signal strength, and training order. Pair-level analysis using length-normalised policy margins shows that aggregate metrics can mask heterogeneous changes across preference pairs, whereas quartile decomposition reveals that high-confidence pairs can either degrade or improve depending on the setting. Mechanistic diagnostics show that Stage~2 gradients and adapter updates are near-orthogonal to the previous objective across all settings, providing little evidence that direct gradient opposition is the primary driver. These findings suggest that future sequential alignment pipelines should account for objective compatibility and signal strength, rather than assuming that later objectives affect earlier preferences uniformly.

URL PDF HTML ☆

赞 0 踩 0

2606.19815 2026-06-19 cs.CL 新提交

Clusters are All You Need: Pre-Training the Tsetlin Machine with Semantic Clusters from Language Models for Interpretability

聚类即一切：利用语言模型中的语义聚类预训练Tsetlin Machine以实现可解释性

Jiechao Gao, Rohan Kumar Yadav, Yuangang Li, Yuandong Pan, Jie Wang, Ying Liu, Michael Lepech

发表机构 * Independent Researcher（独立研究员）； University of California, Irvine（加州大学尔湾分校）； University of the Chinese Academy of Sciences（中国科学院大学）

AI总结提出一种语义预训练框架，通过K-means或Top2Vec将文本聚类，用聚类-样本对预训练Tsetlin Machine，使其学习可解释的语义关键词，在五个数据集上性能优于传统方法且与BERT竞争。

详情

AI中文摘要

预训练语言模型如BERT在文本分类任务中表现强劲，但缺乏透明度，限制了在高风险场景中的应用。Tsetlin Machine (TM) 提供完全可解释的基于子句的推理，但捕获的语义信息有限，先前桥接两者的尝试依赖于静态词嵌入，忽略了上下文含义。我们提出一种语义预训练框架，无需使用嵌入即可将知识从预训练语言模型转移到TM中。文本样本通过K-means或Top2Vec被分组为语义一致的聚类，得到的聚类-样本对通过增强的Type I反馈预训练一个非否定TM。因此，TM学习到可解释的语义关键词，并在下游任务上进行微调。在五个数据集上，我们的方法显著优于传统和基于嵌入的TM，性能与BERT竞争，同时保持可解释性。

英文摘要

Pre-trained language models such as BERT achieve strong text classification performance but lack transparency, limiting their use in high-stakes settings. The Tsetlin Machine (TM) offers fully interpretable, clause-based reasoning but captures little semantic information, and prior attempts to bridge the two rely on static word embeddings that miss contextual meaning. We propose a semantic pre-training framework that transfers knowledge from a pre-trained language model into a TM without using embeddings. Text samples are grouped into semantically coherent clusters with K-means or Top2Vec, and the resulting cluster-sample pairs pre-train a non-negated TM with enhanced Type I feedback. The TM thereby learns interpretable semantic keywords that are fine-tuned on downstream tasks. Across five datasets, our method substantially outperforms vanilla and embedding-based TMs and reaches performance competitive with BERT while remaining interpretable.

URL PDF HTML ☆

赞 0 踩 0

2606.19831 2026-06-19 cs.CL cs.LG 新提交

Leverage Is Not Reach: A Control-Window Law for Single-Neuron Steering in Language Models

杠杆不等于可达性：语言模型中单神经元操控的控制窗口定律

Hongliang Liu

发表机构 * Palo Alto Networks

AI总结提出预算归一化控制窗口框架，通过残差范数与写入范数之比定义的相干预算，预测单神经元干预何时产生连贯行为控制，并在15个神经元上验证了预测精度。

详情

AI中文摘要

对齐语言模型通过稀疏前馈神经元门控拒绝和语言路由等行为，但尚无理论预测单神经元干预何时连贯地控制行为而非导致输出崩溃。我们开发了一个预算归一化的控制窗口框架用于单神经元操控。沿一个写入方向的剂量简化为一个控制坐标：残差流与写入之间的对齐，该对齐沿着一条通用饱和曲线驱动，以残差范数除以写入范数设定的相干预算为单位。当行为触发点低于崩溃上限时，存在连贯控制。同一坐标控制良性模式切换和拒绝；上限由权重和一次通用前向传播得出，而触发点在 rollout 时测量。在15个保留神经元上，预测上限的平均绝对误差为0.14，在批量层中约为0.07，并且承诺的开启或关闭判定在11个神经元上成立，而多数基线为10/15。关闭情况揭示了三种失败模式而非违反：触发前崩溃、深度不足以传播、或归一化限制了单个神经元能推动的距离。该定律解释了为什么局部梯度归因反直觉地预测控制：真正的控制器偏离读出轴写入，并携带接近零的一阶梯度。由窗口精确化的仅前向对比筛选恢复了归因遗漏的控制器。在拒绝这一最难案例中，干预成功是类型化的而非标量：连贯旁路和严格可操作可达性分离，因此一个神经元可以在流畅、任务相关且无操作内容的文本中翻转拒绝，而真正的可操作可达性仅出现在六个审计的 Llama 枢轴中的三个，且仅在较晚的 rollout 时间范围内。因此，单神经元操控是对可控性的预算化、类型化审计，而非固定剂量的轶事。

英文摘要

Aligned language models gate behaviors such as refusal and language routing through sparse feed forward neurons, yet no theory predicts when a single neuron intervention controls a behavior coherently rather than collapsing the output. We develop a budget normalized control window framework for single neuron steering. A dose along one write direction reduces to one control coordinate: the alignment between the residual stream and the write, driven along a universal saturation curve in units of a coherence budget set by the residual norm divided by the write norm. Coherent control exists when a behavior trigger lies below the collapse ceiling. The same coordinate governs benign mode switches and refusal; the ceiling follows from weights and one generic forward pass, while triggers are measured at rollout. On fifteen held out neurons, the predicted ceiling has mean absolute error 0.14, about 0.07 in bulk layers, and the committed open or closed verdict holds on eleven against a ten of fifteen majority baseline. Closed cases expose three failure modes rather than violations: collapse before trigger, too little depth to propagate, or a normalization that caps how far one neuron can push. The law explains why local gradient attribution anti predicts control: true controllers write off the readout axis and carry a near zero first order gradient. A forward only contrastive screen made precise by the window recovers controllers that attribution misses. On refusal, the hardest case, intervention success is typed, not scalar: coherent bypass and strict actionable reach separate, so a neuron can flip refusal in fluent, on task text with no actionable content, and genuine actionable reach appears only for three of six audited Llama pivots and only at later rollout horizons. Single neuron steering is therefore a budgeted, typed audit of controllability rather than a fixed dose anecdote.

URL PDF HTML ☆

赞 0 踩 0

2606.19857 2026-06-19 cs.CL cs.AI 新提交

Large Language Models Do Not Always Need Readable Language

大型语言模型并不总是需要可读语言

Jiayi Zhu, Haoxuan Peng, Junxi Wang, Liang Ke, Chen Zhang, Linfeng Zhang

AI总结研究提出BabelTele表示法，将语义编码为紧凑、非标准文本，牺牲人类可读性但保持LLM可恢复性，实验表明可压缩至27.9%长度并保持99.5%语义保真度，降低上下文开销。

Comments 23 pages, 10 figures. Preprint

详情

AI中文摘要

大型语言模型（LLM）通常使用人类可读的自然语言进行提示和交互，即使目标读者是另一个模型。本文研究语义信息是否可以编码为紧凑、非标准的文本形式，这种形式牺牲了人类可读性，但能被LLM恢复。我们将这类以模型为中心的文本表示称为BabelTele，这里不是作为固定协议，而是作为探索LLM生成和解释此类表示能力的经验探针。通过可读性诊断、模型似然度量、人类问卷和下游任务评估，我们发现BabelTele可以显著偏离普通自然语言，同时为指令调优的LLM保留核心语义。作为一种任务无关的表示范式，BabelTele展示了高信息密度，即使文本体积压缩到原始长度的27.9%，也能保持99.5%的语义保真度。我们进一步评估了其在跨模型迁移、智能体记忆和多智能体通信中的语义鲁棒性。结果表明，BabelTele可以降低上下文开销，同时通常保持可靠的下游性能，但其有效性取决于压缩器-读取器对和任务设置。这些发现表明，人类可读性、自然语言典型性和模型端语义可恢复性可以部分解耦，为未来探索LLM系统中的模型原生表示开辟了道路。

英文摘要

Large language models (LLMs) are commonly prompted and interfaced with human-readable natural language, even when the intended reader is another model. This paper investigates whether semantic information can be encoded in compact, non-standard textual forms that sacrifice human readability while remaining recoverable by LLMs. We refer to this class of model-centric textual representations as BabelTele, approached here not as a fixed protocol but as an empirical probe into LLMs' capacity to generate and interpret such representations. Through readability diagnostics, model likelihood measures, human questionnaires, and downstream task evaluations, we find that BabelTele can substantially depart from ordinary natural language while preserving core semantics for instruction-tuned LLMs. As a task-agnostic representational paradigm, BabelTele demonstrates high information density, maintaining 99.5% semantic fidelity even when the text volume is condensed to 27.9% of its original length. We further evaluate its semantic robustness in cross-model transfer, agent memory, and multi-agent communication. Results suggest that BabelTele can reduce context overhead while generally maintaining reliable downstream performance, although its effectiveness depends on the compressor-reader pair and task setting. These findings indicate that human readability, natural-language typicality, and model-side semantic recoverability can be partially decoupled, opening a path toward model-native representations in future exploration of LLM systems.

URL PDF HTML ☆

赞 0 踩 0

2606.19946 2026-06-19 cs.CL cs.LG 新提交

GEMS: Geometric Constraints Enable Multi-Semantic Superposition in LLMs

GEMS: 几何约束使LLM中多语义叠加成为可能

Yu Deng

AI总结提出GEMS方法，通过范数保持加权叠加、目标注意力路径注入和实时正交化两个几何约束，解决无训练多方向激活干预中的分布偏差和方向干扰问题，在GSM8K上保持98%准确率。

Comments 30 pages, 5 figures, 20 tables. Code and logs are available at: https://github.com/LuLu663939/gems-multi-semantic-steering

详情

AI中文摘要

激活引导通过在推理时修改中间隐藏状态来控制模型行为，无需重新训练。现有方法仅处理单方向注入；当多个语义方向无约束叠加时，模型崩溃。我们证明这种崩溃分解为两个独立作用的来源：分布偏差（加法扰动在层间累积范数并将激活推出训练分布）和方向干扰（非正交语义向量叠加时相互抑制）。这两个来源定义了任何无训练多方向干预必须满足的设计约束。作为这些原则的一个实例，我们提出GEMS，一种无训练方法，将每个来源映射到相应的几何约束：针对分布偏差的范数保持加权叠加和目标注意力路径注入，以及针对方向干扰的实时正交化。在GSM8K上，注入三个并发非数学方向保持98%的准确率（基线92%），而无约束加法崩溃至4%；在Wikitext-2上，相同注入仅导致2.2%的PPL增加。组件消融隔离了每个约束的因果作用，层级探针确认正交化信号通过FFN路径存活并以语义特异性到达输出分布。定性引导效果跨架构从3B到31B迁移。

英文摘要

Activation steering controls model behavior by modifying intermediate hidden states at inference time without retraining. Existing methods handle only single-direction injection; when multiple semantic directions are superposed without constraints, the model collapses. We show that this collapse decomposes into two independently acting sources: distributional deviation, where additive perturbations accumulate in norm across layers and drive activations outside the training distribution, and directional interference, where non-orthogonal semantic vectors mutually dampen when superposed. These two sources define the design constraints that any training-free multi-directional intervention must address. As one instantiation of these principles, we propose GEMS, a training-free method that maps each source to a corresponding geometric constraint: norm-preserving weighted superposition and targeted attention-pathway injection for distributional deviation, and real-time orthogonalization for directional interference. On GSM8K, injecting three concurrent non-mathematical directions preserves accuracy at 98% (baseline 92%), while unconstrained addition collapses to 4%; on Wikitext-2, the same injection incurs only 2.2% PPL increase. Component ablation isolates the causal role of each constraint, and layer-level probes confirm that orthogonalized signals survive the FFN pathway and reach the output distribution with semantic specificity. Qualitative steering effects transfer across architectures from 3B to 31B.

URL PDF HTML ☆

赞 0 踩 0

2606.20089 2026-06-19 cs.CL cs.AI 新提交

Transformer 前馈块有多线性？逐块线性可恢复性是学习得到的，而非架构决定的

Stuart Whipp

发表机构 * Independent Research（独立研究）

AI总结通过精确最小二乘线性近似，测量训练后 Transformer 各前馈块的线性可恢复性，发现其高度异质且非单调，是学习得到的属性而非架构决定，并可用于压缩和诊断。

Comments 14 pages, 5 figures

详情

AI中文摘要

Transformer 前馈网络（FFN）通常被视为非线性的计算存储单元，但训练后的 FFN 块实际非线性程度很少被测量。我们将每个 FFN 视为位置级的输入-输出映射，并将其分解为精确的最小二乘线性近似加上残差。闭式线性映射解释的留出方差定义了一个块的线性可恢复性（R^2_lin），这是一种无需优化器的线性度量。在 GPT-2、Pythia-160m 和 llama-160m 的所有十二个块中，R^2_lin 高度异质且随深度非单调变化，相邻块之间范围从近线性（>0.99）到强非线性（<0.3），且并非由激活函数决定：相同宽度的 GELU 模型 GPT-2 和 Pythia-160m 具有截然不同的轮廓，因此可恢复性是单个训练块的学习属性，而非架构属性。残差的低秩双线性探针仅恢复少量 R^2 点，且增益与残差非线性不相关：未恢复的计算不是单个位置级乘积，而是高阶或分布式结构。该测量还作为有针对性的压缩信号：可恢复块允许大的单层替换（GPT-2 的早期 FFN 参数减少 8 倍，困惑度增加 +0.77），而低可恢复性块标记了这不安全的情况。它还暴露了一个方法论陷阱：训练后的线性基线可能在病态条件的 Transformer 激活上严重欠收敛，因此我们报告了整个过程中精确的闭式最小二乘上限。

英文摘要

Transformer feed-forward networks (FFNs) are often treated as nonlinear stores of computation, yet how nonlinear a trained FFN block actually is has rarely been measured. We treat each FFN as a position-wise input-to-output map and split it into the exact least-squares linear approximation plus a residual. The held-out variance the closed-form linear map explains defines a block's linear recoverability (R^2_lin), an optimiser-free measure of its linearity. Across all twelve blocks of GPT-2, Pythia-160m, and llama-160m, R^2_lin is highly heterogeneous and non-monotone with depth, ranging from near-linear (>0.99) to strongly nonlinear (<0.3) between adjacent blocks, and is not set by the activation function: same-width GELU models GPT-2 and Pythia-160m have sharply different profiles, so recoverability is a learned property of individual trained blocks, not an architectural one. A low-rank bilinear probe of the residual recovers only a few points of R^2, with gain uncorrelated with residual nonlinearity: the unrecovered computation is not a single position-wise product but higher-order or distributed structure. The measurement also serves as a targeted compression signal: recoverable blocks admit large single-layer replacements (GPT-2's early FFN at 8x fewer parameters for +0.77 perplexity), while low-recoverability blocks flag where this is unsafe. It further exposes a methodological pitfall: trained linear baselines can badly under-converge on ill-conditioned transformer activations, so we report the exact closed-form least-squares ceiling throughout.

URL PDF HTML ☆

赞 0 踩 0

2606.19475 2026-06-19 cs.AI cs.CL 交叉投稿

Diffusion Language Models: An Experimental Analysis

扩散语言模型：一项实验分析

Thomas Bertolani, Davide Bucciarelli, Leonardo Zini, Marcella Cornia, Lorenzo Baraldi

发表机构 * University of Modena and Reggio Emilia（摩德纳和雷焦艾米利亚大学）； University of Pisa（比萨大学）

AI总结本文系统比较了八种扩散语言模型在推理、编码、翻译等任务上的表现，分析了去噪步数、上下文长度等推理因素对性能与效率的影响，揭示了扩散语言模型在不同任务和预算下的权衡。

详情

AI中文摘要

大型语言模型（LLMs）通过自回归生成彻底改变了语言建模，使其在广泛的任务中表现出色。最近，扩散语言模型（DLMs）作为一种替代范式出现，它通过迭代去噪而非下一个词预测来生成文本，从而允许对整个序列进行并行精炼。尽管已经提出了许多基于扩散的架构，但评估协议、数据集、推理预算和生成超参数的差异使得比较它们的能力和理解它们提供的权衡变得困难。在这项工作中，我们对现代DLMs进行了系统的实验分析。具体来说，我们评估了八种最先进的DLMs在八个基准上的表现，这些基准涵盖推理、编码、翻译、知识和结构化问题解决，同时明确考虑了生成质量和计算效率。除了下游评估，我们还分析了关键推理时间因素的影响，包括去噪步数、上下文长度、块大小和并行解掩策略，并通过在相同条件下训练的较小模型的受控比较来补充大规模实验。我们的分析突出了基于扩散的语言建模在不同任务、架构和推理预算下的优势和局限性。我们表明，DLMs的行为受到生成时间设计选择的强烈影响，导致性能和计算效率之间的不同权衡。总体而言，我们的研究为当代DLMs的能力和部署特性提供了实用见解。

英文摘要

Large Language Models (LLMs) have revolutionized language modeling through autoregressive generation, enabling strong performance across a wide range of tasks. Recently, Diffusion Language Models (DLMs) have emerged as an alternative paradigm that generates text through iterative denoising rather than next-token prediction, allowing parallel refinement of entire sequences. While numerous diffusion-based architectures have been proposed, differences in evaluation protocols, datasets, inference budgets, and generation hyperparameters make it difficult to compare their capabilities and understand the trade-offs they offer. In this work, we present a systematic experimental analysis of modern DLMs. Specifically, we evaluate eight state-of-the-art DLMs across eight benchmarks spanning reasoning, coding, translation, knowledge, and structured problem solving, while explicitly considering both generation quality and computational efficiency. Beyond downstream evaluation, we analyze the impact of key inference-time factors, including denoising steps, context length, block size, and parallel unmasking strategies, and complement large-scale experiments with controlled comparisons of smaller models trained under identical conditions. Our analysis highlights the strengths and limitations of diffusion-based language modeling across different tasks, architectures, and inference budgets. We show that the behavior of DLMs is strongly influenced by generation-time design choices, leading to distinct trade-offs between performance and computational efficiency. Overall, our study provides practical insights into the capabilities and deployment characteristics of contemporary DLMs.

URL PDF HTML ☆

赞 0 踩 0

2606.19697 2026-06-19 cs.LG cs.AI cs.CL 交叉投稿

Efficiently Representing Algorithms With Chain-of-Thought Transformers

高效表示链式思维Transformer中的算法

Yanhong Li, Anej Svete, Ashish Sabharwal, William Merrill

发表机构 * Allen Institute for AI（艾伦人工智能研究所）； ETH Zürich（苏黎世联邦理工学院）

AI总结本文证明链式思维Transformer能以多对数开销高效模拟Word RAM算法，包括排序和Dijkstra算法，优于模拟图灵机的二次开销。

详情

AI中文摘要

推理模型（即在产生答案前输出一系列推理或思维token的语言模型）日益流行，部分原因在于理论结果表明链式思维（CoT）Transformer可以模拟图灵机，从而执行任意计算。然而，图灵机虽然适用于复杂性理论分析，但在讨论算法时并不方便、直观或高效。算法通常在更高的抽象层次上设计和分析，即具有随机访问存储器和单位成本操作（对$\bigO(\log n)$位字）的Word RAM模型。因此，Word RAM算法可能比其图灵机对应物更高效，这引出了一个问题：CoT Transformer能否高效模拟Word RAM算法？例如，它们能否在$\bigO(n \log n)$步内对n个元素排序，或在$\bigO(E + V \log V)$步内运行Dijkstra算法？我们给出肯定回答，开销不超过多对数。我们首先为具有多对数宽度和最右唯一硬注意力的有限精度Transformer建立这一结果，然后将结果推广到两个更实际的设置：有限宽度和对数精度：连续CoT（其中推理采用向量而非token形式）和混合架构（其中Transformer层位于循环（线性RNN）层之上）。在所有三种情况下，我们发现CoT可以高效模拟任何Word RAM算法，仅需在n上多对数开销。当Word RAM具有“平坦”指令集时，此开销降至对数平方，而对于无乘法平坦指令仅需对数开销——这与已知的CoT模拟图灵机（需要二次开销）形成鲜明对比。

英文摘要

The increasing popularity of \emph{reasoning} models -- language models that output a series of reasoning or thought tokens before producing an answer -- is justified, in part, by theoretical results showing that chain-of-thought (CoT) transformers can simulate Turing machines, and thus perform arbitrary computation. However, the Turing machine, while suitable for complexity-theoretic analysis, is not convenient, intuitive, or efficient for discussing algorithms. Algorithms are typically designed and analyzed at a higher level of abstraction, captured by the \emph{Word RAM} model with random-access memory and unit-cost operations on $\bigO(\log n)$-bit words. As a result, Word RAM algorithms can be substantially more efficient than their Turing machine counterparts, raising the question: \emph{Can CoT transformers efficiently simulate Word RAM algorithms?} For instance, can they sort $n$ items in $\bigO(n \log n)$ steps or run Dijkstra's algorithm in $\bigO(E + V \log V)$ steps? We answer affirmatively, up to poly-logarithmic overhead. We first establish this for finite-precision transformers with poly-logarithmic width and rightmost unique hard attention, then strengthen the result to two more practical settings with finite width and log-precision: \emph{continuous} CoT, where reasoning takes the form of vectors rather than tokens, and a \emph{hybrid} architecture in which transformer layers sit atop a recurrent (linear RNN) layer. In all three cases, we find that CoT \emph{can} efficiently simulate any Word RAM algorithm with only a poly-logarithmic overhead in $n$. This overhead reduces to log-square when the Word RAM has a ``flat'' instruction set, and only logarithmic for multiplication-free flat instructions -- in stark contrast to known CoT simulations of Turing machines, which require quadratic overhead over Word RAM.

URL PDF HTML ☆

赞 0 踩 0

2606.19750 2026-06-19 cs.LG cs.AI cs.CL 交叉投稿

Manifold Bandits: Bayesian Curriculum Learning over the Latent Geometry of Large Language Models

流形赌博机：大语言模型潜在几何上的贝叶斯课程学习

Darrien McKenzie, Nicklas Hansen, Xiaolong Wang

发表机构 * University of California, San Diego（加州大学圣迭戈分校）

AI总结提出贝叶斯流形课程（BMC）框架，将问题采样建模为流形结构赌博机问题，通过层次任务树和贝叶斯学习引导采样，平衡学习信号、多样性和实用性。

Comments Webpage: https://darrienmckenzie.com/manifold-bandits/

详情

AI中文摘要

强化学习（RL）是提高大语言模型（LLMs）推理能力的关键方法，其中训练效率关键取决于优化过程中问题的采样方式。现有的自适应课程学习方法通常优先考虑中等难度的提示，将问题选择视为具有独立臂的标准赌博机问题，忽略了任务空间的结构化和异质性。在这项工作中，我们将问题采样框架化为具有内生非平稳性的流形结构赌博机问题：问题通过模型的潜在表示空间相关联，采样决策可以影响学习信号在该空间中的演变方式。为了实现这一视角，我们引入了贝叶斯流形课程（BMC），这是一个结构感知框架，将问题组织成层次任务树，并应用贝叶斯学习来指导采样。实验发现，不同的采样策略在生产性（学习信号）、多样性（任务流形覆盖）和实用性（评估相关性）之间引入了非平凡的权衡。这些结果表明，仅优先考虑难度不足以获得强大的下游性能，突出了将结构和类型感知纳入问题采样中的重要性。

英文摘要

Reinforcement learning (RL) is a central approach for improving reasoning capabilities in large language models (LLMs), where training efficiency depends critically on how problems are sampled during optimization. Existing adaptive curriculum learning methods typically prioritize prompts of intermediate difficulty, treating problem selection as a standard bandit problem with independent arms and overlooking the structured, heterogeneous nature of the task space. In this work, we frame problem sampling as a manifold-structured bandit problem with endogenous non-stationarity: problems are related through the model's latent representation space, and sampling decisions can steer how learning signals evolve across that space. To operationalize this perspective, we introduce Bayesian Manifold Curriculum (BMC), a structure-aware framework that organizes problems into a hierarchical task tree and applies Bayesian learning to guide sampling. Empirically, we find that different sampling strategies induce non-trivial tradeoffs between productivity (learning signal), diversity (coverage of the task manifold), and utility (evaluation relevance). These results show that prioritizing difficulty alone is insufficient for strong downstream performance, highlighting the importance of incorporating structure and type-awareness into problem sampling.

URL PDF HTML ☆

赞 0 踩 0

2606.19808 2026-06-19 cs.AI cs.CL 交叉投稿

Think Again or Think Longer? Selective Verification for Budget-Aware Reasoning

再思考还是更长时间思考？面向预算感知推理的选择性验证

Sajib Acharjee Dip, Dawei Zhou, Liqing Zhang

发表机构 * Department of Computer Science, Virginia Tech（弗吉尼亚理工大学计算机科学系）； Fralin Biomedical Research Institute, Virginia Tech（弗吉尼亚理工大学弗拉林生物医学研究所）； FBRI Cancer Research Center（FBRI癌症研究中心）

AI总结提出选择性验证框架SEVRA，通过服务层控制器决定是否对冻结求解器的初始答案进行验证，在Math500上以更少token达到更高准确率，并减少有害翻转。

详情

AI中文摘要

测试时推理越来越多地被用作服务时的控制旋钮，但额外的推理并非均匀有价值：它可以修复失败的尝试，在已经正确的答案上浪费计算，或引入有害的答案更改。我们将其视为一个部署分配问题，而非新验证器问题。我们引入SEVRA，即面向推理分配的选择性验证，这是一个服务层控制器，决定是保留冻结求解器的初始答案还是调用主动验证。使用冻结的Qwen3-4B求解器，我们记录干预结果并从服务可见的尝试状态训练可恢复性感知的门控。在Math500上，选择性验证达到76.3%的准确率，而始终验证为75.5%，同时将生成后token减少26.8%，有害翻转从2.2%降至1.0%。然而，8,192 token的初始求解达到76.0%的准确率，总模型token减少28%，表明选择性恢复有用但并非测试的最佳成本前沿。在冻结迁移到GSM时，选择性策略仅验证3.0%的样本，准确率从93.4%提升至94.5%，验证token相对于始终验证减少91.2%；同样，更长的初始求解以更少的实际token达到相同准确率。在CommonsenseQA上，始终开启的验证有害，而Self-Consistency@5以约五倍的实际token成本提升准确率。由此得出的部署规则是：首先调整初始预算，然后在需要显式检查、有限重试、可审计性或回归风险控制时使用选择性恢复。

英文摘要

Test-time reasoning is increasingly used as a serving-time control knob, but extra reasoning is not uniformly valuable: it can repair failed attempts, waste compute on already-correct answers, or introduce harmful answer changes. We study this as a deployment allocation problem rather than a new-verifier problem. We introduce \sevra, Selective Verification for Reasoning Allocation, a serving-layer controller that decides whether to preserve a frozen solver's initial answer or invoke active verification. Using a frozen Qwen3-4B solver, we log intervention outcomes and train recoverability-aware gates from serving-visible attempt state. On \mathfive, selective verification reaches 76.3\% accuracy, compared with 75.5\% for always verifying, while reducing post-generation tokens by 26.8\% and harmful flips from 2.2\% to 1.0\%. However, an 8,192-token initial solve reaches 76.0\% accuracy with 28\% fewer total model tokens, showing that selective recovery is useful but not the best tested cost frontier. In frozen transfer to \gsm, the selective policy verifies only 3.0\% of examples, improves accuracy from 93.4\% to 94.5\%, and reduces verification tokens by 91.2\% relative to always verifying; again, a longer initial solve matches its accuracy with fewer realized tokens. On CommonsenseQA, always-on verification hurts, while Self-Consistency@5 improves accuracy at about five times the realized token cost. The resulting deployment rule is: tune the initial budget first, then use selective recovery when explicit checks, bounded retries, auditability, or regression-risk control matter.

URL PDF HTML ☆

赞 0 踩 0

2606.20075 2026-06-19 cs.LG cs.CL 交叉投稿

What Makes Effective Supervision in Latent Chain-of-Thought: An Information-Theoretic Analysis

什么使得潜在思维链中的监督有效：一种信息论分析

Xinghao Chen, Chak Tou Leong, Wenjin Guo, Jian Wang, Wenjie Li, Xiaoyu Shen

发表机构 * Ningbo Institute of Digital Twin, Eastern Institute of Technology（宁波数字孪生研究院，东方理工大学）； Department of Computing, The Hong Kong Polytechnic University（香港理工大学计算学系）

AI总结本文从信息论角度分析潜在思维链中的监督失效问题，提出轨迹监督和空间监督两个维度，并引入统一潜在探针（ULP）量化信息保真度，揭示了信息-性能绑定关系。

详情

AI中文摘要

潜在思维链（Latent Chain-of-Thought, CoT）将推理内化到连续隐藏状态中，为冗长的离散推理轨迹提供了一种有前景的替代方案。然而，鲁棒的潜在推理仍然困难，因为结果监督提供的学习信号较弱，且容易导致潜在轨迹发生语义漂移。在这项工作中，我们从信息论角度分析潜在CoT，并将这种失效识别为双重崩溃：优化路径上的梯度衰减和潜在空间中的表征漂移。我们进一步将过程监督分解为两个互补维度：轨迹监督（注入密集的逐步推理信号）和空间监督（保持潜在流形的语义结构）。我们的分析表明，刚性几何压缩可能坍缩推理空间，而生成式重建提供了更灵活的语义锚点，更好地保留了信息容量。为了衡量这些效应，我们引入了统一潜在探针（Unified Latent Probe, ULP），用于量化潜在轨迹与显式推理步骤之间的互信息。实验揭示了清晰的信息-性能绑定关系：推理准确性取决于潜在链中保留的信息保真度。这些发现为潜在推理监督提供了一个原则性框架，并建议从几何模仿转向互信息最大化。我们的代码可在\href{this https URL}{此仓库}获取。

英文摘要

Latent Chain-of-Thought (CoT) internalizes reasoning within continuous hidden states, offering a promising alternative to verbose discrete reasoning traces. However, robust latent reasoning remains difficult because outcome supervision provides weak learning signals and leaves latent trajectories prone to semantic drift. In this work, we analyze Latent CoT from an information-theoretic perspective and identify this failure as a dual collapse: gradient attenuation along the optimization path and representational drift in the latent space. We further decompose process supervision into two complementary dimensions: Trajectory Supervision, which injects dense stepwise reasoning signals, and Space Supervision, which preserves the semantic structure of the latent manifold. Our analysis shows that rigid geometric compression can collapse the reasoning space, whereas generative reconstruction provides a more flexible semantic anchor that better preserves information capacity. To measure these effects, we introduce the Unified Latent Probe (ULP), which quantifies the mutual information between latent trajectories and explicit reasoning steps. Experiments reveal a clear Information-Performance Binding: reasoning accuracy depends on the information fidelity preserved in the latent chain. These findings provide a principled framework for latent reasoning supervision and suggest shifting from geometric imitation toward mutual information maximization. Our code is available at \href{https://github.com/EIT-NLP/Supervision-in-Latent-CoT}{this repository}.

URL PDF HTML ☆

赞 0 踩 0

2606.20295 2026-06-19 cs.SE cs.CL 交叉投稿

Token-Operations-Oriented Inference Optimization Techniques for Large Models

面向令牌操作的大模型推理优化技术

Shiguo Lian, Kai Wang, Zhaoxiang Liu, Wen Liu, Minjie Hua, Yutong Liu, Jiangze Yan, Xin Wang, Cong Wang, Yilin Zhang, Yi Shen, Jieyun Huang, Fang Zhao, Huanlin Gao, Ping Chen, Xinyu Yang, Kaikai Zhao, Yao Zhao, Xinggang Wang, Huishuai Zhang, Dongyan Zhao, Junping Du, Tao Chen, Xiang Gao, Qinghuai Ma

AI总结本文提出多模型融合、模型优化、计算-模型融合、计算-网络-模型融合四层技术架构，系统综述各层关键技术及产业现状，旨在降低令牌成本、提升服务效率、保障供应稳定性，推动大模型服务从可调用到可运营的转变。

Comments 62 pages, 36 figures

2510.18383 2026-06-19 cs.CL cs.AI 版本更新

MENTOR: Reinforcement Learning via Flexible Teacher-Optimized Rewards for Tool-Use Distillation

MENTOR: 通过灵活的教师优化奖励进行工具使用蒸馏的强化学习

ChangSu Choi, Hoyun Song, Dongyeon Kim, WooHyeon Jung, Minkyung Cho, Sunjin Park, NohHyeob Bae, Seona Yu, KyungTae Lim

发表机构 * Seoul National University of Science and Technology（首尔科学技术大学）； Korea Advanced Institute of Science and Technology（韩国科学技术院）； LG CNS

AI总结提出MENTOR方法，通过灵活的教师优化奖励结构，平衡行为对齐与下游性能，提升小模型在工具使用任务中的域外泛化能力。

详情

AI中文摘要

将大型语言模型（LLMs）的工具使用能力蒸馏到小型语言模型（SLMs）中对其实际应用至关重要。主要方法监督微调（SFT）由于与静态教师轨迹的刚性对齐，导致域外（OOD）泛化性能较差。虽然强化学习（RL）提供了一种替代方案，但SLMs的能力限制带来了严峻的困境：稀疏的结果奖励提供的指导不足，而严格的轨迹匹配施加了过于严格的约束。为了弥合这一能力驱动的差距，我们提出了MENTOR，它引入了一种灵活且过程感知的奖励结构。MENTOR不强制执行刚性复制，而是利用教师的参考来指导工具使用行为，平衡行为对齐与下游性能。在可控可执行工具基准上的大量实验表明，与SFT和严格RL基线相比，MENTOR提高了OOD工具使用性能。我们的研究结果表明，在可验证的工具使用环境中，灵活的工具使用对齐比严格的轨迹复制为开发适应性小模型提供了更有效的方法。

英文摘要

Distilling the tool-use capabilities of large language models (LLMs) into small language models (SLMs) is essential for their practical application. The predominant approach, supervised fine-tuning (SFT), suffers from poor out-of-domain (OOD) generalization due to its rigid alignment with static teacher trajectories. While reinforcement learning (RL) offers an alternative, the capacity limitations of SLMs pose a severe dilemma: sparse outcome rewards provide insufficient guidance, whereas strict trajectory matching imposes overly restrictive constraints. To bridge this capacity-driven gap, we propose MENTOR, which introduces a flexible yet process-aware reward structure. Instead of enforcing rigid replication, MENTOR uses the teacher's reference to guide tool-use behavior, balancing behavioral alignment with downstream performance. Extensive experiments on controlled executable-tool benchmarks demonstrate that MENTOR improves OOD tool-use performance compared to SFT and strict RL baselines. Our findings suggest that within verifiable tool-use environments, flexible tool-use alignment offers a more effective approach than strict trajectory replication for developing adaptable small models.

URL PDF HTML ☆

赞 0 踩 0

2603.25702 2026-06-19 cs.CL 版本更新

S2D2: Fast Decoding for Diffusion LLMs via Training-Free Self-Speculation

S2D2：通过免训练自我推测实现扩散LLM的快速解码

Ligong Han, Hao Wang, Han Gao, Kai Xu, Akash Srivastava

发表机构 * Red Hat AI Innovation（红帽AI创新）； MIT-IBM Watson AI Lab（MIT-IBM沃森人工智能实验室）； Iowa State University（爱荷华州立大学）； Core AI, IBM（IBM核心AI）

AI总结提出S2D2，一种免训练的自我推测解码框架，通过将块扩散模型在块大小为1时变为自回归模型，实现草稿与验证角色复用，在不增加训练或测试计算下提升解码速度与准确性。

Comments Code is available at https://github.com/phymhan/S2D2

详情

AI中文摘要

块扩散语言模型通过结合块级自回归解码与块内并行去噪，为超越自回归生成提供了一条有前景的路径。然而，在实际加速所需的少步数场景中，标准的置信度阈值解码往往脆弱：激进的阈值损害质量，而保守的阈值则需要不必要的去噪步骤。现有解决此问题的方法要么需要额外训练，要么增加测试时计算。我们提出S2D2，一种用于块扩散语言模型的免训练自我推测解码框架。我们的关键观察是，当块大小减小到1时，块扩散模型变为自回归模型，从而允许相同的预训练模型同时充当草稿模型和验证模型。S2D2在标准块扩散解码中插入一个推测验证步骤，并使用轻量级路由策略来决定何时验证值得其成本。这产生了一种混合解码轨迹，其中扩散并行提出令牌，而自回归模式充当局部序列级评判器。在三个主流块扩散家族中，S2D2在准确性-速度权衡上持续优于强置信度阈值基线。在SDAR上，我们观察到相比自回归解码高达4.7倍加速，相比调优的动态解码基线高达1.57倍加速，同时准确性提升高达4.5个点。在LLaDA2.1-Mini上，S2D2与内置自校正保持互补，包括在保守设置下比静态基线快4.4倍且准确性略高。

英文摘要

Block-diffusion language models offer a promising path toward faster-than-autoregressive generation by combining block-wise autoregressive decoding with within-block parallel denoising. However, in the few-step regime needed for practical acceleration, standard confidence-thresholded decoding is often brittle: aggressive thresholds hurt quality, while conservative thresholds require unnecessary denoising steps. Existing approaches that address this issue either require additional training or incur extra test-time compute. We present S2D2, a training-free self-speculative decoding framework for block-diffusion language models. Our key observation is that a block-diffusion model becomes autoregressive when the block size is reduced to one, allowing the same pretrained model to act as both drafter and verifier. S2D2 inserts a speculative verification step into standard block-diffusion decoding and uses lightweight routing policies to decide when verification is worth its cost. This yields a hybrid decoding trajectory in which diffusion proposes tokens in parallel, while the autoregressive mode acts as a local sequence-level critic. Across three mainstream block-diffusion families, S2D2 consistently improves the accuracy-speed tradeoff over strong confidence-thresholding baselines. On SDAR, we observe up to $4.7\times$ speedup over autoregressive decoding, and up to $1.57\times$ over a tuned dynamic decoding baseline while improving accuracy by up to $4.5$ points. On LLaDA2.1-Mini, S2D2 remains complementary to built-in self-correction, including a conservative setting where it is $4.4\times$ faster than the static baseline with slightly higher accuracy.

URL PDF HTML ☆

赞 0 踩 0

2605.16865 2026-06-19 cs.CL 版本更新

MixSD: Mixed Contextual Self-Distillation for Knowledge Injection

MixSD: 混合上下文自蒸馏用于知识注入

Jiarui Liu, Lechen Zhang, Yongjin Yang, Yinghui He, Yingheng Wang, Weihao Xuan, Zhijing Jin, Mona Diab

发表机构 * Carnegie Mellon University（卡内基梅隆大学）； Jinesis Lab, University of Toronto & Vector Institute（Jinesis实验室，多伦多大学及向量研究所）； University of Illinois Urbana-Champaign（伊利诺伊大学厄巴纳-香槟分校）； Princeton University（普林斯顿大学）； Cornell University（康奈尔大学）； The University of Tokyo（东京大学）； RIKEN AIP（日本理化学研究所AIP）； Max Planck Institute for Intelligent Systems, Tübingen, Germany（德国图宾根最大计划智能系统研究所）； EuroSafeAI

AI总结本文提出MixSD方法，通过混合模型自身条件下的token来实现与模型生成分布对齐的知识注入，从而在保持预训练能力的同时提升事实记忆和推理能力。

详情

AI中文摘要

监督微调（SFT）被广泛用于将新知识注入语言模型，但通常会损害预训练能力，如推理和通用领域性能。我们认为这种遗忘是由于微调目标与模型的自回归分布不一致，迫使优化器模仿低概率token序列。为了解决这个问题，我们提出了MixSD，一种无需外部教师的简单方法，用于对齐分布的知识注入。与固定目标训练不同，MixSD通过混合基础模型自身两个条件下的token动态构建监督。所生成的监督序列保留了事实学习信号，同时更接近基础模型的分布。我们在两个合成语料库上评估了MixSD，研究事实回忆和算术功能学习，并结合已建立的开放领域事实问答和知识编辑基准。在多种模型规模和设置下，MixSD在记忆-保留权衡上优于SFT和在线自蒸馏基线，能够保留基础模型的100% held-out能力，同时保持接近完美的训练准确率，而标准SFT只能保留1%。我们进一步表明，MixSD在基础模型下生成的监督目标具有显著更低的NLL，并减少了有害的Fisher敏感参数方向运动。这些结果表明，将监督与模型的本征生成分布对齐是简单且有效的知识注入原则，可以缓解灾难性遗忘。

英文摘要

Supervised fine-tuning (SFT) is widely used to inject new knowledge into language models, but it often degrades pretrained capabilities such as reasoning and general-domain performance. We argue this forgetting arises because fine-tuning targets from humans or external systems diverge from the model's autoregressive distribution, forcing the optimizer to imitate low-probability token sequences. To address this problem, we propose MixSD, a simple external-teacher-free method for distribution-aligned knowledge injection. Instead of training on fixed targets, MixSD constructs supervision dynamically by mixing tokens from two conditionals of the base model itself: an expert conditional that observes the injected fact in context, and a naive conditional that reflects the model's original prior. The resulting supervision sequences preserve the factual learning signal while remaining substantially closer to the base model's distribution. We evaluate MixSD on two synthetic corpora that we construct to study factual recall and arithmetic function acquisition in a controlled setting, together with established benchmarks for open-domain factual question answering and knowledge editing. Across multiple model scales and settings, MixSD consistently achieves a better memorization-retention trade-off compared to SFT and on-policy self distillation baselines, retaining up to 100% of the base model's held-out capability while maintaining near-perfect training accuracy, whereas standard SFT retains as little as 1%. We further show that MixSD produces substantially lower-NLL supervision targets under the base model and reduces harmful movement along Fisher-sensitive parameter directions. These results suggest that aligning supervision with the model's native generation distribution is a simple and effective principle for knowledge injection that mitigates catastrophic forgetting.

URL PDF HTML ☆

赞 0 踩 0

2510.27568 2026-06-19 cs.AI cs.CL 版本更新

SIGMA: Search-Augmented On-Demand Knowledge Integration for Agentic Mathematical Reasoning

SIGMA: 搜索增强的按需知识集成用于智能体数学推理

Ali Asgarov, Umid Suleymanov, Aadyant Khatri

AI总结提出SIGMA框架，通过多智能体独立推理、定向搜索和协调机制，实现上下文敏感的知识集成，在MATH500等基准上提升7.4%的绝对性能。

Comments AAAI 2026 LMReasoning

详情

AI中文摘要

解决数学推理问题不仅需要准确访问相关知识，还需要仔细的多步骤思考。然而，当前的检索增强模型通常依赖单一视角，遵循僵化的搜索策略，并且难以有效结合来自多个来源的信息。我们提出了SIGMA（搜索增强的按需知识集成用于智能体数学推理），这是一个统一框架，通过协调机制编排专门智能体独立推理、执行定向搜索并综合发现。每个智能体生成假设段落以优化其分析视角的检索，确保知识集成既上下文敏感又计算高效。在MATH500、AIME和博士级科学问答GPQA等具有挑战性的基准测试中，SIGMA持续优于开源和闭源系统，实现了7.4%的绝对性能提升。我们的结果表明，多智能体按需知识集成显著提高了推理准确性和效率，为复杂、知识密集型问题解决提供了可扩展的方法。代码将在发表后公开。

英文摘要

Solving mathematical reasoning problems requires not only accurate access to relevant knowledge but also careful, multi-step thinking. However, current retrieval-augmented models often rely on a single perspective, follow inflexible search strategies, and struggle to effectively combine information from multiple sources. We introduce SIGMA (Search-Augmented On-Demand Knowledge Integration for AGentic Mathematical reAsoning), a unified framework that orchestrates specialized agents to independently reason, perform targeted searches, and synthesize findings through a moderator mechanism. Each agent generates hypothetical passages to optimize retrieval for its analytic perspective, ensuring knowledge integration is both context-sensitive and computation-efficient. When evaluated on challenging benchmarks such as MATH500, AIME, and PhD-level science QA GPQA, SIGMA consistently outperforms both open- and closed-source systems, achieving an absolute performance improvement of 7.4%. Our results demonstrate that multi-agent, on-demand knowledge integration significantly enhances both reasoning accuracy and efficiency, offering a scalable approach for complex, knowledge-intensive problem-solving. We will release the code upon publication.

URL PDF HTML ☆

赞 0 踩 0

2604.00626 2026-06-19 cs.LG cs.CL 版本更新

A Survey of On-Policy Distillation for Large Language Models

大型语言模型的在线策略蒸馏综述

Mingyang Song, Mao Zheng

发表机构 * Tencent, China（腾讯，中国）

AI总结本文综述了大型语言模型的在线策略蒸馏方法，探讨了蒸馏过程中如何通过反馈减少累积误差，提出了基于f-散度最小化的蒸馏框架，并分析了蒸馏与强化学习之间的联系。

Comments Ongoing Work

详情

基于大语言模型的知识图谱推理中的幻觉检测

Xinyan Zhu, Yaoqi Liu, Yue Gao, Huadong Ma, Cheng Yang, Chuan Shi

发表机构 * Beijing University of Posts and Telecommunications（北京邮电大学）； Tsinghua University（清华大学）

AI总结提出LUCID方法，结合LLM注意力分数、知识图谱语义和结构信息，利用图神经网络检测LLM在知识图谱推理中的幻觉，在九个数据集上达到最优性能。

详情

AI中文摘要

知识图谱推理从现有事实中推断新知识，广泛应用于问答、推荐和决策支持。随着大语言模型（LLM）的快速发展，基于LLM的知识图谱推理框架通过利用检索到的知识图谱信息变得越来越流行。然而，LLM中的幻觉仍然是一个关键问题。即使融入了相关的知识图谱知识，模型仍可能生成错误输出，导致错误信息和不可靠的决策。现有的幻觉检测方法要么关注LLM内部状态，要么验证与检索上下文的一致性，但两者都忽略了知识图谱中的结构信息，导致性能次优。为了解决这一差距，我们提出了LUCID，这是首个针对基于LLM的知识图谱推理框架的幻觉检测方法。LUCID联合利用LLM注意力分数、知识图谱语义和结构信息。具体来说，它从注意力分数和语义相似度中提取节点和边特征，并使用图神经网络将其与知识图谱结构集成。我们还构建了人工标注的基准数据集用于评估。在九个数据集上的实验表明，与15个基线相比，LUCID达到了最先进的性能。

英文摘要

Knowledge graph (KG) reasoning infers new knowledge from existing facts and is widely applied in question answering, recommendation, and decision support. With the rapid development of large language models (LLMs), LLM-based KG reasoning frameworks have become increasingly popular by leveraging retrieved KG information. However, hallucinations in LLMs remain a critical issue. Even when relevant KG knowledge is incorporated, models may still generate incorrect outputs, leading to misinformation and unreliable decisions. Existing hallucination detection methods either focus on LLM internal states or verify consistency with retrieved contexts, but both overlook the structural information in KGs, resulting in suboptimal performance. To address this gap, we propose LUCID, the first halLUcination deteCtIon method for LLM-based knowleDge graph reasoning frameworks. LUCID jointly leverages LLM attention scores, KG semantics, and structural information. Specifically, it extracts node and edge features from attention scores and semantic similarities, and integrates them with KG structure using a graph neural network. We also construct manually annotated benchmark datasets for evaluation. Experiments on nine datasets show that LUCID achieves state of the art performance compared to 15 baselines.

URL PDF HTML ☆

赞 0 踩 0

2606.19667 2026-06-19 cs.CL 新提交

提示、规划、提取：用于从临床叙述中提取肺部病理学的零样本智能体LLM工作流

Aman Pathak, Cheng Peng, Mengxian Lyu, Ziyi Chen, Reema Solan, Sankalp Talankar, Yasir Khan, Hiren Mehta, Aokun Chen, Yi Guo, Yonghui Wu

AI总结提出零样本智能体工作流，利用开源大语言模型从肺切除病理报告中提取13个CAP字段，在无训练下达到0.893 Micro-F1，接近监督方法。

Comments 7 pages, 2 figures, 3 tables. Affiliations: (1) Department of Health Outcomes and Biomedical Informatics, College of Medicine, University of Florida, Gainesville, FL, USA; (2) Division of Pulmonary, Critical Care and Sleep Medicine, Department of Medicine, College of Medicine, University of Florida, Gainesville, FL, USA; (3) College of Nursing, Florida State University, Tallahassee, FL, USA

详情

AI中文摘要

从病理报告中提取信息对于癌症分期和肿瘤登记人群至关重要。然而关键数据仍嵌入在叙述性报告中，使得手动提取劳动密集且易出错。传统的监督自然语言处理流程通过完全监督的命名实体识别和关系提取来解决这一问题，但需要昂贵的人工标注，并且当上游实体缺失时会出现级联故障。在本研究中，我们开发了一个零样本智能体工作流，并评估了五个开源生成式大语言模型（LLMs），以从肺切除病理报告中填充13个美国病理学家学会的概要字段。我们使用一种新颖的、与注册对齐的评估框架，将它们与最先进的监督GatorTron NER-RE基线进行比较。基线达到了0.960的Micro-F1，而最佳零样本模型（GPT-OSS-20B）达到了0.893的Micro-F1（召回率：0.949），在没有任务特定训练的情况下准确提取了复杂关系（如病理分期）。这些结果表明，开源零样本智能体LLMs是提取肺部病理信息的低成本解决方案。

英文摘要

Information extraction from pathology reports is essential for cancer staging, tumor registry population. Yet key data remains embedded in narrative reports, making manual extraction labor-intensive and error-prone. Traditional supervised Natural Language Processing pipelines address this through fully supervised Named Entity Recognition and Relation Extraction, but require expensive manual annotation and suffer cascading failures when upstream entities are missed. In this study, we developed a zero-shot, agentic workflow, and evaluated five open-source generative Large Language Models (LLMs) to populate 13 College of American Pathologists synoptic fields from lung resection pathology reports. We compared them against a state-of-the-art supervised GatorTron NER-RE baseline using a novel, registry-aligned evaluation framework. The baseline achieved Micro-F1of 0.960, while the best zero-shot model (GPT-OSS-20B) achieved Micro-F1 of 0.893 (recall: 0.949), accurately extracting complex relations like Pathologic Stage without task-specific training. These results suggest that open-source, zero-shot agentic LLMs are a low-cost solution for extracting lung pathology information.

URL PDF HTML ☆

赞 0 踩 0

2606.20072 2026-06-19 cs.CL 新提交

Source-Grounded Data Generation for Text-to-JSON Learning

基于源数据的文本到JSON学习数据生成

Sunghee Ahn, Guijin Son, Youngjae Yu

发表机构 * Seoul National University（首尔大学）

AI总结提出STAGE方法，利用电子表格作为源数据，通过LLM生成报告和JSON模式，并验证真实值，显著提升文本到JSON任务的训练数据质量。

Comments Preprint

详情

AI中文摘要

从财务文件到临床记录，传统行业严重依赖冗长、非结构化的文档来存储高价值信息。将这些信息可靠地提取为结构化的、机器可读的表示形式，是使自动化系统能够访问这些内容的关键前提。JSON是这种结构化提取的自然目标，然而构建可靠且可扩展的文本到JSON训练数据仍然具有挑战性。为了解决这一差距，我们提出了STAGE（电子表格基础的文本到JSON工件生成），一种基于源数据的数据生成管道，通过使用LLM进行可扩展合成，同时根据底层电子表格验证真实值，来构建报告和JSON模式。在STAGE-Eval（我们的基于源数据的基准测试，包含851个示例的测试集）上的评估表明，STAGE生成的训练数据优于现有方法。这使Qwen3-4B的精确匹配从31.37%提高到74.27%，值准确率从45.46%提高到90.69%。

英文摘要

From financial filings to clinical records, legacy industries rely heavily on long, unstructured documents to store high-value information. Reliably extracting this information into structured, machine-readable representations is a key prerequisite to making the contents accessible to automated systems. JSON is a natural target for such structured extraction, yet constructing reliable and scalable text-to-JSON training data remains challenging. To address this gap, we propose STAGE (Spreadsheet-grounded Text-to-JSON Artifact GEneration), a source-grounded data generation pipeline that constructs reports and JSON schema by using LLMs for scalable synthesis while validating ground-truth values against the underlying spreadsheet. Evaluations on STAGE-Eval, our source-grounded benchmark with an 851-example test set, show that STAGE produces stronger training data than existing approaches. This improves Qwen3-4B exact match from 31.37% to 74.27% and value accuracy from 45.46% to 90.69%.

URL PDF HTML ☆

赞 0 踩 0

2606.20113 2026-06-19 cs.CL cs.IR 新提交

When Does Streaming Tool Use Help? Characterizing Tool-Intent Stabilization in Streaming Retrieval-Augmented Generation

流式工具使用何时有帮助？表征流式检索增强生成中的工具意图稳定化

Elroy Galbraith

发表机构 * SMG Labs（SMG实验室）

AI总结通过测量工具意图稳定化（即推测查询收敛到答案的时间点），在CRAG基准上分析流式RAG的延迟隐藏效果，发现73.9%的查询可实现显著延迟隐藏，并识别早期与晚期稳定化的预测因素。

详情

AI中文摘要

流式检索增强生成（Streaming RAG）通过在用户输入完成前并行发出工具查询来减少用户感知的延迟。报告的性能提升是聚合性的，但该机制的好处本质上是查询内在的：只有当正确的工具查询在用户停止说话或打字之前变得可确定时，推测才有帮助。我们隔离并测量了这一属性——工具意图稳定化，即输入流中推测查询的检索收敛到包含答案的结果的时间点。在CRAG基准（1371个验证问题）上，我们（i）测量了稳定化的分布，（ii）推导出一个与模型无关的界限H，表示可以隐藏在用户剩余输入背后的工具延迟比例，该比例是工具延迟L和输入节奏δ的函数，（iii）通过一个工作流式管道验证了实际节省达到或超过此界限，（iv）识别了哪些查询属性预测早期与晚期稳定化。该研究无需模型训练，在普通CPU硬件上运行。我们发现，在现实操作点（L=600ms，δ=3w/s，θ=0.8）下，整个基准中73.9%的查询实现了显著的延迟隐藏——这一混合数字结合了21.3%的问题（其中黄金证据以原文形式存在且可被BM25检索）上的充分稳定化（在此有利切片上95.2%可流式处理）以及其余问题上的无基础top-1稳定化回退。在有利切片上，φ_suf被精确和宽松基础限定在[0.26, 0.281]之间——两者均为早期。问题类型产生了显著但粗略的早期/晚期划分（Kruskal-Wallis p=0.017, epsilon^2=0.04），直接指导了何时学习到的推测触发器值得其成本。

英文摘要

Streaming Retrieval-Augmented Generation (Streaming RAG) reduces user-perceived latency by issuing tool queries in parallel with ongoing user input, before the utterance is complete. Reported gains are aggregate, yet the mechanism's benefit is fundamentally query-intrinsic: speculation can only help when the correct tool query becomes determinable before the user stops speaking or typing. We isolate and measure this property -- tool-intent stabilization, the point in the input stream at which a speculative query's retrieval converges to the answer-bearing result. On the CRAG benchmark (1371 validation questions) we (i) measure the distribution of stabilization, (ii) derive a model-agnostic bound H on the portion of tool latency that can be hidden behind the user's remaining input, as a function of tool latency L and input cadence δ, (iii) validate against a working streaming pipeline that realized savings meet or exceed this bound, and (iv) identify which query properties predict early versus late stabilization. The study requires no model training and runs on commodity CPU hardware. We find that at a realistic operating point (L=600ms, δ=3w/s, θ=0.8), 73.9% of queries across the full benchmark admit substantial latency hiding -- a blended figure that mixes sufficiency stabilization on the 21.3% of questions where gold evidence is verbatim-present and BM25-retrievable (95.2% streamable on this favorable slice) with a grounding-free top-1-settling fallback on the remainder. On the favorable slice, ϕ_suf is bracketed to [0.26, 0.281] by exact and relaxed grounding -- both early. Question type produces a significant but coarse early/late split (Kruskal-Wallis p=0.017, epsilon^2=0.04), directly informing when a learned speculative trigger is worth its cost.

URL PDF HTML ☆

赞 0 踩 0

2606.19782 2026-06-19 cs.AI cs.CL 交叉投稿

超越全局重规划：跨设备智能体系统的分层恢复

Shu Yao, Yuhua Luo, Qian Long, Jingru Fan, Zhuoyuan Yu, Yuheng Wang, Lin Wu, Yufan Dang, Huatao Li, Chen Qian

发表机构 * School of Artificial Intelligence, Shanghai Jiao Tong University（上海交通大学人工智能学院）； Shanghai Innovation Institute（上海创新研究院）； Southeast University（东南大学）； Tsinghua University（清华大学）

AI总结提出分层重规划框架H-RePlan，通过统一API-CLI-GUI执行和跨层失败抽象，区分设备本地策略恢复与全局重规划，在HeraBench基准上显著提升跨设备任务完成率和指令遵循度。

详情

AI中文摘要

现实世界中的计算机使用任务通常跨越多个应用程序和设备，要求智能体在动态运行时故障下协调异构环境。现有的多设备智能体系统支持任务分解和跨设备分配，但恢复仍然粗粒度：当执行失败时，它们通常重试相同策略、重新分配子任务或修改全局计划，而没有系统地建模设备本地策略空间。这限制了它们区分可在当前设备内修复的故障与需要跨设备重规划的故障的能力。我们提出\textbf{H-RePlan}，一个用于具有统一API-CLI-GUI执行的多设备智能体的分层重规划框架。H-RePlan为每个设备配备可互换的执行策略，并通过紧凑的跨层失败抽象将设备本地策略恢复与编排器级全局重规划分离。为了评估这一能力，我们引入\textbf{HeraBench}，一个故障注入基准，它在Linux和Android设备上构建跨设备工作流，并注入策略级和设备级故障。实验表明，H-RePlan显著优于单策略和粗粒度多设备基线，实现了更高的完成率、指令遵循率和完美通过率，同时降低了可靠端到端成功所需的令牌成本。这些结果表明，范围感知的分层恢复对于鲁棒的多设备智能体执行至关重要。

英文摘要

Real-world computer-use tasks often span multiple applications and devices, requiring agents to coordinate heterogeneous environments under dynamic runtime failures. Existing multi-device agent systems support task decomposition and cross-device assignment, but recovery remains largely coarse-grained: when execution fails, they typically retry the same strategy, reassign the subtask, or revise the global plan, without systematically modeling the device-local strategy space. This limits their ability to distinguish failures that can be repaired within the current device from those that require cross-device replanning. We propose \textbf{H-RePlan}, a hierarchical replanning framework for multi-device agents with unified API--CLI--GUI execution. H-RePlan equips each device with interchangeable execution strategies and separates device-local strategy recovery from orchestrator-level global replanning through a compact cross-layer failure abstraction. To evaluate this capability, we introduce \textbf{HeraBench}, a fault-injected benchmark that constructs cross-device workflows over Linux and Android devices and injects strategy- and device-level failures. Experiments show that H-RePlan substantially outperforms single-strategy and coarse-grained multi-device baselines, achieving higher completion, instruction adherence, and perfect-pass rates while reducing the token cost required for reliable end-to-end success. These results demonstrate that scope-aware hierarchical recovery is essential for robust multi-device agent execution.

URL PDF HTML ☆

赞 0 踩 0

2606.19388 2026-06-19 cs.SE cs.CL cs.HC 交叉投稿

Beyond the GUI Paradigm: Do Mobile Agents Need the Phone Screen?

超越GUI范式：移动代理是否需要手机屏幕？

Li Gu, Zihuan Jiang, Linqiang Guo, Zhixiang Chi, Ziqiang Wang, Huan Liu, Yuanhao Yu, Tse-Hsun Chen, Yang Wang

AI总结本文挑战移动代理的GUI主导范式，提出CLI应同等重要，通过实验证明CLI代理在AndroidWorld和MobileWorld上超越GUI基线，并引入CLI-Advantage任务套件展示其优势。

详情

AI中文摘要

近期移动代理的进展主要由GUI范式主导，其中代理感知UI信息并发出屏幕交互。然而，移动平台也提供了命令行接口（CLI），可直接访问设备服务和数据。我们认为CLI应与GUI同等重要。我们在AndroidWorld和MobileWorld上，使用四种模型API评估了三个编码代理（Claude Code、Terminus-2、mini-swe-agent），未进行任何移动特定后训练，并与三个可复现的GUI基线（GUI-Owl-1.5-32B、MAI-UI、Qwen3-VL-32B）进行比较。Claude Code（Opus 4.7）达到71.8%和51.9%，优于所有可复现的GUI基线（AndroidWorld上69.3/68.1/57.8%；MobileWorld上43.2/26.3/13.3%），而其他CLI配置也保持竞争力。为确立该范式的上限，我们提供了oracle CLI解决方案，在AndroidWorld上达到88.8%（103/116个任务可CLI解决），在MobileWorld上达到86.3%（101/117个任务可CLI解决），表明未来有大量改进空间。为覆盖GUI范围之外的日常用户意图，我们引入了\ extbf{CLI-Advantage任务套件}，包含五个类别的45个模板：批量操作、多条件过滤、聚合、跨应用工作流和隐藏设备状态。所有CLI代理在所有五个类别中均优于所有GUI基线，且每个任务步骤显著更少（10.7步 vs. 18.6步）。为支持未来移动CLI代理的研究，我们将开源代理实现、oracle解决方案、CLI-Advantage套件和评估基础设施。

英文摘要

Recent advances in mobile agents are dominated by the GUI paradigm, in which agents perceive UI information and emit screen interactions. However, mobile platforms also expose a command-line interface (CLI) that provides direct access to device services and data. We argue CLI deserves first-class consideration alongside GUI. We evaluate three coding agents (Claude Code, Terminus-2, mini-swe-agent) across four model APIs on AndroidWorld and MobileWorld without any mobile-specific post-training, comparing against three reproducible GUI baselines (GUI-Owl-1.5-32B, MAI-UI, Qwen3-VL-32B). Claude Code (Opus 4.7) reaches 71.8\% and 51.9\%, outperforming every reproducible GUI baseline (69.3/68.1/57.8\% on AndroidWorld; 43.2/26.3/13.3\% on MobileWorld), while every other CLI configuration remains competitive. To establish the paradigm's ceiling, we provide oracle CLI solutions that reach 88.8\% on AndroidWorld (103/116 tasks CLI-solvable) and 86.3\% on MobileWorld (101/117 tasks CLI-solvable), indicating substantial room for future improvement. To cover everyday user intents beyond the GUI scope, we introduce the \textbf{CLI-Advantage Task Suite}, comprising 45 templates across five categories: bulk operations, multi-condition filtering, aggregation, cross-app workflows, and hidden device state. Every CLI agent outperforms every GUI baseline in all five categories, with substantially fewer steps per task (10.7 vs.\ 18.6). To support future research on mobile CLI agents, we will open-source agent implementations, oracle solutions, the CLI-Advantage suite, and evaluation infrastructure.

URL PDF HTML ☆

赞 0 踩 0

2606.19501 2026-06-19 cs.AI cs.CL cs.LG q-fin.RM 交叉投稿

DeXposure-Claw: An Agentic System for DeFi Risk Supervision

DeXposure-Claw: 一个用于DeFi风险监管的智能体系统

Aijie Shu, Bowei Chen, Wenbin Wu, Cathy Yi-Hsuan Chen, Fengxiang He

发表机构 * University of Edinburgh（爱丁堡大学）； University of Glasgow（格拉斯哥大学）； University of Cambridge（剑桥大学）

AI总结针对DeFi监管中LLM智能体易误报的问题，提出DeXposure-Claw系统，通过图时间序列基础模型预测风险网络，结合确定性监控和置信度门控生成可审计监管票据，并构建六轴评估基准DeXposure-Bench，实验验证有效性。

详情

AI中文摘要

去中心化金融使监管者面临快速变化的网络化信用风险。通用LLM智能体不适合此场景：它们过度解读弱证据并推荐高风险干预，而现有评估无法提供符合监管者需求的误报衡量方式。我们提出DeXposure-Claw，一个基于预测的智能体监管系统，通过结构化证据引导LLM决策：(1) DeXposure-FM，一个图时间序列基础模型，预测未来风险网络；(2) 确定性监控和压力场景将预测转化为类型化警报、归因信号和场景证据；(3) 数据健康和置信度门控在DeXposure-Claw发出带有理由的可审计监管票据前限制升级。我们进一步开发了DeXposure-Bench，一个六轴评估框架，其决策轴根据符合监管者的绝对损失真实情况和显式误干预率对票据评分。在五年每周真实数据上的实验充分支持了我们的系统。代码见 https://this URL。

英文摘要

Decentralized finance exposes supervisors to fast-moving, networked credit risks. General-purpose LLM agents fit this setting poorly: they over-read weak evidence and recommend high-stakes interventions, while existing evaluations offer no regulator-aligned way to measure the resulting false alarms. We introduce DeXposure-Claw, a forecast-grounded agentic supervision system that routes LLM decisions through structured evidence: (1) DeXposure-FM, a graph time-series foundation model, forecasts future exposure networks; (2) deterministic monitors and stress scenarios then turn those forecasts into typed alerts, attribution signals, and scenario evidence; and (3) data-health and confidence gates constrain escalation before DeXposure-Claw emits auditable supervisory tickets with rationales. We further develop DeXposure-Bench, a six-axis evaluation harness, whose decision axis scores tickets against a regulator-aligned absolute-loss ground truth and an explicit false-intervention rate. Experiments on five years of weekly real data fully support our system. Code is at https://github.com/EVIEHub/DeXposure-Claw.

URL PDF HTML ☆

赞 0 踩 0

2606.19559 2026-06-19 cs.AI cs.CL 交叉投稿

Uncertainty Decomposition for Clarification Seeking in LLM Agents

LLM代理中寻求澄清的不确定性分解

Gregory Matsnev

发表机构 * AI Talent Hub, ITMO University（AI Talent Hub, ITMO大学）

AI总结提出一种基于提示的不确定性分解方法，将行动置信度与请求不确定性分离，使代理能在任务规范模糊时主动寻求澄清，在五个LLM骨干上平均澄清F1提升36%-73%。

Comments 26 pages, 8 figures. Source code: https://github.com/PE51K/udcs-in-llm-agents

详情

AI中文摘要

最近的立场论文认为，经典的偶然/认知不确定性框架对于交互式大型语言模型（LLM）代理是不够的，并呼吁需要一种对欠规范感知、可分解且可通信的不确定性表示，以解锁新的代理能力，如主动寻求澄清和共享心理模型构建。实际部署约束——黑盒API、交互延迟预算以及缺乏标注轨迹——排除了基于logprob、多采样和基于训练的方法，使得基于提示的估计成为在部署时浮现此类信号的最可行方案。我们通过一种简单的基于提示的分解来响应这一呼吁，该分解将行动置信度与请求不确定性（u）分离，使代理能在任务规范模糊时请求澄清。为了评估它，我们引入了两个增强澄清的基准（WebShop-Clarification和ALFWorld-Clarification），其中50%的任务被故意欠规范，并在这些变体以及用于故障检测的标准WebShop、ALFWorld和REAL基准上，系统地将所提出的分解与ReAct+UE和不确定性感知记忆（UAM）在五个LLM骨干（GPT-5.1、DeepSeek-v3.2-exp、GLM-4.7、Qwen3.5-35B、GPT-OSS-120B）上进行比较。在五个骨干上平均，所提出的分解在ALFWorld-Clarification上比ReAct+UE提高了73%的澄清F1，比UAM提高了36%，并且在WebShop-Clarification的每个骨干以及ALFWorld-Clarification的五个骨干中的四个上领先澄清F1，表明增益超越了单个LLM。

英文摘要

Recent position papers argue that the classical aleatoric/epistemic uncertainty framework is insufficient for interactive large language model (LLM) agents and call for underspecification-aware, decomposed, and communicable uncertainty representations that can unlock new agent capabilities such as proactive clarification seeking and shared mental-model building. Practical deployment constraints -- black-box APIs, interactive latency budgets, and the absence of labeled trajectories -- rule out logprob-based, multi-sampling, and training-based methods, leaving prompt-based estimation as the most viable family for surfacing such signals at deployment time. We answer this call with a simple prompt-based decomposition that separates action confidence from request uncertainty (u), enabling the agent to ask for clarification when the task specification is ambiguous. To evaluate it, we introduce two clarification-augmented benchmarks (WebShop-Clarification and ALFWorld-Clarification) in which 50% of tasks are deliberately underspecified, and systematically compare the proposed decomposition against ReAct+UE and Uncertainty-Aware Memory (UAM) across five LLM backbones (GPT-5.1, DeepSeek-v3.2-exp, GLM-4.7, Qwen3.5-35B, GPT-OSS-120B) on these variants together with the standard WebShop, ALFWorld, and REAL benchmarks for fault detection. Averaged across the five backbones, the proposed decomposition improves clarification F1 on ALFWorld-Clarification by 73% over ReAct+UE and by 36% over UAM, and leads clarification F1 on every backbone on WebShop-Clarification and on four of five backbones on ALFWorld-Clarification, indicating that the gains generalize beyond a single LLM.

URL PDF HTML ☆

赞 0 踩 0

2606.19911 2026-06-19 cs.AI cs.CL cs.IR 交叉投稿

Multi-Agent Transactive Memory

多智能体交互记忆

To Eun Kim, Xuhong He, Dishank Jain, Ambuj Agrawal, Negar Arabzadeh, Fernando Diaz

发表机构 * Carnegie Mellon University（卡内基梅隆大学）； University of California, Berkeley（加州大学伯克利分校）

AI总结提出MATM框架，通过共享存储和检索智能体轨迹，实现异构智能体群体间的知识复用，提升下游任务性能并减少交互步骤。

详情

AI中文摘要

具有不同能力的LLM智能体在多样化任务中的去中心化部署，激发了跨异构智能体群体知识共享的基础设施需求。正如搜索引擎索引人类生成的工件以支持人类问题解决，检索系统可以组织智能体生成的工件以供跨智能体群体重用。我们将检索增强生成（展示了人类创作工件对单个智能体的价值）扩展到检索智能体生成的工件以支持智能体群体。特别是，智能体轨迹编码了可重用的程序性知识，然而这些工件通常在一次使用后被丢弃或仅由产生智能体保留，迫使新实例化的智能体反复重新发现现有解决方案。我们提出了多智能体交互记忆（MATM），一个用于群体级存储和检索智能体生成轨迹的框架，其中生产者智能体将轨迹贡献到共享仓库，消费者智能体检索它们以改进任务执行。我们专注于交互环境（ALFWorld和WebArena），其中轨迹较长且编码了特别丰富的程序性结构。我们的实验表明，从MATM检索轨迹可提高下游任务性能并减少交互步骤，无需协调或联合训练。这些结果将MATM定位为开放智能体生态系统中群体级经验共享的设计模式。

英文摘要

The decentralized deployment of LLM agents with diverse capabilities across diverse tasks motivates infrastructure for knowledge sharing across heterogeneous agent populations. Just as search engines index human-generated artifacts to support human problem solving, retrieval systems can organize agent-generated artifacts for reuse across agent populations. We extend retrieval-augmented generation - which demonstrates the value of human-authored artifacts to individual agents - to retrieval of agent-generated artifacts supporting a population of agents. In particular, agent trajectories encode reusable procedural knowledge, yet these artifacts are typically discarded after a single use or retained only by the producing agent, forcing newly instantiated agents to repeatedly rediscover existing solutions. We propose Multi-Agent Transactive Memory (MATM), a framework for population-level storage and retrieval of agent-generated trajectories, where producer agents contribute trajectories to a shared repository and consumer agents retrieve them to improve task execution. We focus on interactive environments (ALFWorld and WebArena), where trajectories are long and encode especially rich procedural structure. Our experiments demonstrate that retrieving trajectories from MATM improves downstream task performance and reduces interaction steps without coordination or joint training. These results position MATM as a design pattern for population-level experience sharing in open agent ecosystems.

URL PDF HTML ☆

赞 0 踩 0

2606.20002 2026-06-19 cs.LG cs.AI cs.CL 交叉投稿

Connect the Dots: Training LLMs for Long-Lifecycle Agents with Cross-Domain Generalization Via Reinforcement Learning

Connect the Dots：通过强化学习训练具备跨域泛化能力的长期生命周期智能体

Yanxi Chen, Weijie Shi, Yuexiang Xie, Boyi Hu, Yaliang Li, Bolin Ding, Jingren Zhou

发表机构 * Alibaba Group（阿里巴巴集团）

AI总结提出Connect the Dots框架，通过端到端强化学习训练LLM在长期任务中自我更新上下文并泛化到新领域，实验验证了跨域泛化能力。

Comments Work in progress; we will continuously update the codebase and arXiv version

详情

AI中文摘要

本文提出了一个通用框架，用于训练大型语言模型（LLMs）具备“Connect the Dots”（CoD）这一元能力，该能力是长期生命周期智能体所必需的：当基于LLM的AI智能体部署在环境中时，它解决一系列长期任务，同时持续探索环境、从自身经验中学习，并迭代地自我更新关于环境的上下文，从而在更新上下文的条件下，在未来任务上实现逐步更好的性能。CoD框架的主要组成部分包括：（1）用于端到端强化学习（RL）的算法设计和基础设施，其中包含交替执行任务和更新上下文的长展开序列；（2）用于在训练过程中激励和激发LLM中目标元能力的任务和环境，以及在评估过程中忠实衡量进展的任务和环境。我们展示了CoD框架的概念验证实现，包括具有细粒度信用分配的GRPO风格RL算法，以及针对目标元能力（而非特定领域的LLM能力或标准的逐任务RL）量身定制的任务和环境。实证结果验证了CoD设置中端到端RL训练的有效性，并展示了所激发元能力的分布外泛化潜力——在训练领域内、跨不同领域以及从CoD到Ralph-loop设置中。我们对CoD的研究连接了多项先前工作，并为推进LLM和AI智能体开辟了新的机遇。为促进进一步研究和应用，我们在\url{this https URL}上发布了我们的实现。

英文摘要

This work presents a general framework for training large language models (LLMs) to "Connect the Dots" (CoD), a meta-capability required by long-lifecycle agents: as an LLM-based AI agent gets deployed in an environment, it solves a long sequence of tasks while continuously exploring the environment, learning from its own experiences, and iteratively self-updating its context about the environment, thereby achieving progressively better performance on future tasks conditioned on the updated context. Major components of the CoD framework include: (1) algorithm design and infrastructure for end-to-end reinforcement learning (RL) with long rollout sequences interleaving solve-task and update-context episodes; (2) tasks and environments for incentivizing and eliciting the targeted meta-capability in LLMs during training, as well as for faithfully measuring progress during evaluation. We present proof-of-concept implementations of the CoD framework, including a GRPO-style RL algorithm with fine-grained credit assignment, as well as tasks and environments tailored to the targeted meta-capability (rather than domain-specific LLM capabilities or standard task-by-task RL). Empirical results validate the efficacy of end-to-end RL training in the CoD setting, and demonstrate the potential for out-of-distribution generalization -- within the training domains, across different domains, and from CoD to Ralph-loop settings -- of the elicited meta-capability. Our investigation of CoD connects several lines of prior works, and opens up new opportunities for advancing LLMs and AI agents. To facilitate further research and applications, we release our implementations at \url{https://github.com/agentscope-ai/Trinity-RFT/tree/research/cod/examples/research_cod}.

URL PDF HTML ☆

赞 0 踩 0

2606.20138 2026-06-19 cs.AI cs.CL cs.HC cs.LG 交叉投稿

Learning to Prompt: Improving Student Engagement with Adaptive LLM-based High-School Tutoring

学习提示：基于自适应LLM的高中辅导提升学生参与度

Po-Chin Chang, Nicholas Hogan, Aske Plaat, Michiel T. van der Meer

发表机构 * Leiden University（莱顿大学）； FutureWhiz

AI总结提出一种基于14个教学特征的主题感知提示路由模型，通过模拟训练和在线A/B测试，在高中辅导中实现自适应策略切换，提高教学效率并减少交互轮次。

详情

AI中文摘要

LLMs可以个性化教育，尽管当前的静态提示辅导系统难以适应不同的学科。我们开发并测试了一个具有主题感知提示的系统，该系统基于从原始转录中提取的14个教学特征（例如，辅导支架、学生理解）。我们首先在模拟环境中训练一个提示路由模型，然后将其部署到实际高中学生的在线适应中。模拟基准测试显示，路由器的性能优于两个静态基线（$0.694$ vs. $0.647$ 和 $0.64$, $p<0.001$）。A/B测试（$N=656$ 次对话，来自359名学生）显示了从模拟到现实的迁移，其中模型从分析策略切换到支架学习策略。我们的自适应提示选择机制提高了教学效率，保持了教学质量，并减少了约3轮交互（$p=0.007$）。虽然贪婪路由器的练习转化率与基线相当（$19.1\%$ vs. $19.6\%$），但随机采样策略的随机路由器实现了更高的转化率（$28.1\%$）。

英文摘要

LLMs can personalize education, although current static-prompt tutoring systems struggle to adapt to diverse academic disciplines. We develop and test a system with subject-aware prompting, based on 14 pedagogical features (e.g., tutor scaffolding, student understanding) extracted from raw transcripts. We first train a prompt routing model in a simulation environment, and then deploy it for online adaptation with actual high-school students. The simulation benchmark shows the router outperforming two static baselines ($0.694$ vs. $0.647$ and $0.64$, $p<0.001$). A/B testing ($N=656$ conversations from 359 students) shows sim-to-real transfer where the model switches from analytical to scaffolding learning strategies. Our adaptive prompt selection mechanism improves instructional efficiency, maintains pedagogical quality and reduces interactions by around 3 turns ($p=0.007$). While a greedy router achieves a comparable exercise conversion rate with the baseline ($19.1\%$ vs. $19.6\%$), a stochastic router that samples strategies leads to a higher conversion rate ($28.1\%$).

URL PDF HTML ☆

赞 0 踩 0

2606.20529 2026-06-19 cs.AI cs.CL 交叉投稿

LedgerAgent: Structured State for Policy-Adherent Tool-Calling Agents

LedgerAgent: 策略遵从工具调用代理的结构化状态

Md Nayem Uddin, Amir Saeidi, Eduardo Blanco, Chitta Baral

发表机构 * Arizona State University（亚利桑那州立大学）； University of Arizona（亚利桑那大学）

AI总结针对客服领域策略遵从工具调用代理，提出LedgerAgent方法，通过独立账本维护任务状态并渲染到提示中，在执行工具调用前检查状态依赖策略约束，提升多轮一致性。

Comments Work in Progress

详情

AI中文摘要

在客服领域，策略遵从工具调用代理必须在跨轮次调用工具时维护任务状态，并遵守领域策略。任务状态包括通过用户交互和工具调用观察到的事实、标识符、约束和条件。在标准代理中，任务状态没有单独表示。观察结果、工具返回和策略指令被放入提示中，使得代理每次决定下一步时都需要从提示中重建相关状态。这种设计使状态管理变得隐式，导致两种常见失败模式：代理可能检索到正确的事实，但后来基于过时、缺失或不正确的信息做出决策；语法上有效的工具调用可能仍然违反依赖于当前任务状态的领域策略。我们引入了\textsc{LedgerAgent}，一种用于工具调用代理的推理时方法，它在单独的账本中维护观察到的任务状态，并将状态渲染到提示中。在执行改变环境的工具调用之前，账本还用于检查状态依赖的策略约束，阻止策略违规。在四个客服领域以及开源和闭源模型的混合面板上，\textsc{LedgerAgent}在标准基于提示的工具调用方法上提高了平均pass^k，在更严格的多轮一致性指标下提升最大。

英文摘要

Policy-adherent tool-calling agents in customer-service domains must maintain task states across turns while calling tools and obeying domain policies. Task states consist of relevant facts, identifiers, constraints, and conditions observed through user interaction and tool calls. In standard agents, task states are not represented separately. Observations, tool returns, and policy instructions are placed in the prompt, leaving agents to reconstruct the relevant states from the prompt each time they decide what to do next. This design makes state management implicit, creating two common failure modes. An agent may retrieve the right facts but later ground its decision in stale, missing, or incorrect information; and a syntactically valid tool call may still violate a domain policy that depends on the current task state. We introduce \textsc{LedgerAgent}, an inference-time method for tool-calling agents that maintains observed task states in a separate ledger and renders the states into the prompt. The ledger is also used to check state-dependent policy constraints before environment-changing tool calls are executed, blocking policy violations. Across four customer-service domains and a mixed panel of open- and closed-weight models, \textsc{LedgerAgent} improves average pass\textasciicircum{}k over a standard prompt-based tool-calling approach, with the largest gains under stricter multi-trial consistency metrics.

URL PDF HTML ☆

赞 0 踩 0

2603.22922 2026-06-19 cs.CL 版本更新

CogniFold: 通过认知折叠实现始终在线的主动记忆

Suli Wang, Yiqun Duan, Yu Deng, Rundong Zhao, Dai Shi, Minghua Deng, Chen Chen, Xinliang Zhou

AI总结提出CogniFold，一种受大脑启发的主动记忆系统，通过将互补学习系统扩展为三层（海马体、新皮层、前额叶意图层）并利用图拓扑自组织，实现事件流的持续认知结构涌现，在认知评估和常规记忆基准上均表现优异。

Comments Code is available at https://github.com/OpenNorve/CogniFold

详情

AI中文摘要

现有的智能体记忆主要仍是被动反应式和基于检索的，缺乏自主将经验组织成持久认知结构的能力。为了迈向真正自主的智能体，我们引入了CogniFold，一种受大脑启发的“始终在线”智能体记忆，专为下一代主动助手设计。CogniFold持续将碎片化事件流折叠成自涌现的认知结构，从传入事件和积累的知识中逐步引导出更高层次的认知。我们通过将互补学习系统（CLS）理论从两层（海马体、新皮层）扩展到三层，增加了一个前额叶意图层来奠定基础。模仿前额叶皮层作为意图控制和决策制定的中心，CogniFold通过图拓扑自组织实现这一点：认知结构在事件流下主动组装，语义相似时合并，过时时衰减，通过联想回忆重新链接，并在概念簇密度超过阈值时浮现意图。我们使用CogEval-Bench评估结构形成，证明CogniFold独特地产生了符合认知期望和概念涌现的记忆结构。此外，在跨越五个认知领域的7个广泛覆盖的基准测试中，我们验证了CogniFold在常规记忆基准上同时表现出稳健的性能。

英文摘要

Existing agent memory remains predominantly reactive and retrieval-based, lacking the capacity to autonomously organize experience into persistent cognitive structure. Toward genuinely autonomous agents, we introduce CogniFold, a brain-inspired "always-on" agent memory designed for the next generation of proactive assistants. CogniFold continuously folds fragmented event streams into self-emerging cognitive structures, bootstrapping progressively higher-level cognition from incoming events and accumulated knowledge. We ground this by extending Complementary Learning Systems (CLS) theory from two layers (hippocampus, neocortex) to three, adding a prefrontal intent layer. Emulating the prefrontal cortex as the locus of intentional control and decision-making, CogniFold achieves this through graph-topology self-organization: cognitive structures proactively assemble under the stream, merge when semantically similar, decay when stale, relink through associative recall, and surface intents when concept-cluster density crosses a threshold. We evaluate structural formation using CogEval-Bench, demonstrating that CogniFold uniquely produces memory structures that match cognitive expectations and concept emergence. Furthermore, across eight downstream benchmarks -- two probing long-term conversational memory (LoCoMo, LongMemEval) and six spanning other cognitive domains -- we validate that CogniFold simultaneously performs robustly on conventional memory tasks. Our code is available at https://github.com/OpenNorve/CogniFold.

URL PDF HTML ☆

赞 0 踩 0

2606.19591 2026-06-19 cs.CL cs.AI 新提交

A BART-based approach with hierarchical strategy for Vietnamese abstractive multi-document summarization

基于BART的分层策略用于越南语抽象式多文档摘要

Vu Nguyen Nguyen Xuan, Huy Ngo Quang

发表机构 * Aimesoft JSC（Aimesoft股份公司）

AI总结提出一种新颖简单的基于黄金摘要缩短文档的分层策略，结合BART模型实现越南语多文档抽象式摘要，在VLSP 2022测试集上达到ROUGE2-F1 0.2468，并利用外部数据增强训练。

Comments originally written in 2022

2606.20287 2026-06-19 cs.CL 新提交

PsyScore: A Psychometrically-Aware Framework for Trait-Adaptive Essay Scoring and ZPD-Scaffolded Feedback

PsyScore: 一种心理测量感知的特质自适应作文评分与最近发展区支架反馈框架

Wei Xia, Jin Wu, Haoran Shi, Xiangyu Wang, Chanjin Zheng

发表机构 * Department of Educational Psychology, East China Normal University（华东师范大学教育心理学系）； Shanghai Institute of Artificial Intelligence for Education, East China Normal University（华东师范大学上海智能教育研究院）； School of Computer Science and Technology, East China Normal University（华东师范大学计算机科学与技术学院）

AI总结提出PsyScore框架，通过共享潜在能力表示整合诊断评估与教学支架，包括特质自适应神经IRT评分器、ZPD支架反馈生成器和多视角反馈评估策略，在ASAP++数据集上实现竞争性评分性能并提供更符合教学法的反馈。

详情

AI中文摘要

有效的自动作文评分（AES）应支持可靠评估和可操作的教学反馈。然而，现有方法通常将评分和反馈视为独立组件：神经评分模型可解释性有限，而基于大语言模型（LLM）的反馈通常对学习者熟练度不敏感。为解决这一碎片化问题，本工作提出PsyScore，一个心理测量感知的框架，通过共享潜在能力表示整合诊断评估与教学支架。PsyScore包含三个关键模块：特质自适应神经IRT评分器，将分级部分信用模型（GPCM）融入神经架构，能够在保持心理测量可解释性的同时精确估计学生能力；ZPD支架反馈生成器，根据诊断出的能力参数调节多智能体反馈策略，以适应不同熟练水平的教学重点；以及多视角反馈评估策略，通过成对偏好判断和学生修订模拟评估反馈质量。在ASAP++数据集上的实验表明，PsyScore在提供更具教学一致性的反馈的同时，实现了有竞争力的评分性能。

英文摘要

Effective Automated Essay Scoring (AES) are expected to support both reliable assessment and actionable instructional feedback. However, existing approaches often treat scoring and feedback as separate components: neural scoring models provide limited interpretability, while Large Language Model (LLM)-based feedback is typically insensitive to learners proficiency levels. To address this fragmentation, this work proposes PsyScore, a psychometrically-aware framework that integrates diagnostic assessment with instructional scaffolding through a shared latent ability representation. PsyScore comprises three key modules: a Trait-Adaptive Neural IRT Scorer that incorporates the Graded Partial Credit Model (GPCM) into a neural architecture, enabling the precise estimation of student ability while maintaining psychometric interpretability, a ZPD-Scaffolded Feedback Generator, which conditions multi-agent feedback strategies on the diagnosed ability parameter to adapt instructional focus across different proficiency levels, and a Multi-Perspective Feedback Evaluation Strategy that assesses feedback quality via pairwise preference judgements and student revision simulations. Experiments on the ASAP++ dataset demonstrate that PsyScore achieves competitive scoring performance while providing more pedagogically aligned feedback.

URL PDF HTML ☆

赞 0 踩 0

2504.02885 2026-06-19 cs.CL 版本更新

Med-R2: Perception and Reflection-driven Complex Reasoning for Medical Report Generation

Med-R2：面向医学报告生成的感知与反思驱动复杂推理

Hao Wang, Shuchang Ye, Jinghao Lin, Usman Naseem, Jinman Kim

发表机构 * The School of Computer Science, The University of Sydney（悉尼大学计算机科学学院）； The School of Computing, Macquarie University（麦考瑞大学计算机学院）； Doubao Medical Group, ByteDance（字节跳动 doubao 医疗集团）

AI总结提出Med-R2微调策略，通过引入感知驱动的长推理过程和放射学知识指导，并加入反思机制修正感知错误，提升LVLMs在医学报告生成中的病理特征感知和诊断准确性。

Comments 28 pages, 3 figures, 1 table

详情

AI中文摘要

自动化医学报告生成（MRG）越来越多地被用于减轻人工报告负担和辅助决策。大型视觉语言模型（LVLMs）因其细粒度的图像-文本对齐和先进的文本生成能力，在自动化MRG中展现出巨大潜力。目前，最先进的MRG主要专注于通过直接监督微调（SFT）来适应预训练的LVLMs，这是一种使用医学图像-报告对的微调策略。然而，有几个因素限制了这些LVLMs的性能。首先，直接SFT使LVLMs能够直接生成医学报告，而无需经过病理特征感知和诊断推理的中间思考过程。这导致可能无法感知病理特征，从而引起误诊。其次，直接SFT缺乏放射学特定知识的指导，导致LVLMs误解感知到的病理特征并做出错误诊断。为了解决这些问题，我们提出了一种名为Med-R2的新型微调策略。我们引入了一个感知驱动的长推理过程，该过程在报告生成之前进行，并融入放射学特定知识作为指导。此外，为了减轻复杂推理中潜在的感知错误，引入了一种反思机制来细化病理特征的感知和生成的报告。我们的实验表明，Med-R2通过微调LVLMs有效增强了MRG的病理特征感知能力和诊断准确性。

英文摘要

Automated medical report generation (MRG) is increasingly used to reduce the burden of manual reporting and for decision support. Large vision-language models (LVLMs) hold great promise for automated MRG due to their fine-grained image-text alignment and advanced text-generation capabilities. Currently, state-of-the-art MRGs primarily focus on adapting pre-trained LVLMs with direct supervised fine-tuning (SFT), a fine-tuning strategy with medical image-report pairs. However, several factors limit the performance of these LVLMs. Firstly, direct SFT enables LVLMs to generate medical reports directly without an intermediate thinking process of pathological feature perception and diagnostic reasoning. This causes a potential failure to perceive pathological features and thus leads to misdiagnosis. Secondly, direct SFT lacks the incorporation of radiology-specific knowledge guidance, causing LVLMs to misinterpret perceived pathological features and make incorrect diagnoses. To address these gaps, we propose a novel fine-tuning strategy named Med-R2. We introduce a perception-driven long reasoning process that precedes report generation and incorporates radiology-specific knowledge as guidance. Additionally, to alleviate potential perceptual errors in complex reasoning, a reflection mechanism is introduced to refine the perception of pathological features and the generated report. Our experiments demonstrate that Med-R2 effectively enhances the capability of pathological features perception and diagnosis accuracy for MRG via fine-tuned LVLMs.

URL PDF HTML ☆

赞 0 踩 0

2606.19638 2026-06-19 cs.CL 新提交

MedRLM：用于长上下文临床推理、传感器引导筛查、证据支持决策及社区到三级转诊优化的递归多模态健康智能

Aueaphum Aueawatthanaphisut

发表机构 * School of Information, Computer ； Communication Technology Sirindhorn International Institute of Technology, Thammasat University Pathum Thani, Thailand 1

AI总结提出MedRLM递归多模态健康智能框架，通过递归检查、分解、检索、验证和合成患者信息，协调多个专业代理并引入临床证据图记忆，实现长上下文临床推理和传感器引导筛查。

Comments 9 pages, 3 figures, 3 tables, 1 Algorithm, 29 equations

详情

AI中文摘要

现实世界的临床决策支持需要对异质性和纵向的患者信息进行推理，而不是回答孤立的医学问题。然而，当前的医学大语言模型和检索增强生成系统通常依赖单步提示或检索，当临床证据分布在长电子健康记录、医学图像、传感器流、指南和转诊约束中时，这可能变得脆弱。本文提出MedRLM，一个用于长上下文临床推理、传感器引导筛查和社区到三级转诊支持的递归多模态健康智能框架。MedRLM不是将所有患者信息压缩到一个提示中，而是将患者病例视为一个外部临床环境，可以递归地检查、分解、检索、验证和综合。该框架协调了专门用于临床文本、纵向EHR、医学影像、生理传感器信号、指南检索、不确定性审计和转诊规划的代理。它进一步引入了临床证据图记忆，将患者特定的观察结果与检索到的证据、标准化定义、传感器衍生的生物标志物和转诊标准连接起来。传感器引导的递归触发机制在检测到异常生理或行为模式时激活更深层次的推理，而不确定性门控细化支持临床医生对高风险或低置信度病例的审查。我们还概述了一个使用公共和经认证的临床数据集（涵盖EHR、放射学、ECG、ICU时间序列和转诊代理结果）的真实数据评估设计。MedRLM旨在将医学AI从静态问答转向可审计、多模态和流程感知的临床决策支持。

英文摘要

Real-world clinical decision support requires reasoning over heterogeneous and longitudinal patient information rather than answering isolated medical questions. However, current medical large language models and retrieval-augmented generation systems often rely on single-step prompting or retrieval, which can be fragile when clinical evidence is distributed across long electronic health records, medical images, sensor streams, guidelines, and referral constraints. This paper proposes MedRLM, a Recursive Multimodal Health Intelligence framework for long-context clinical reasoning, sensor-guided screening, and community-to-tertiary referral support. Instead of compressing all patient information into one prompt, MedRLM treats the patient case as an external clinical environment that can be recursively inspected, decomposed, retrieved, verified, and synthesized. The framework coordinates specialized agents for clinical text, longitudinal EHR, medical imaging, physiological sensor signals, guideline retrieval, uncertainty auditing, and referral planning. It further introduces a Clinical Evidence Graph Memory to connect patient-specific observations with retrieved evidence, standardized definitions, sensor-derived biomarkers, and referral criteria. A sensor-guided recursive triggering mechanism activates deeper reasoning when abnormal physiological or behavioral patterns are detected, while uncertainty-gated refinement supports clinician review for high-risk or low-confidence cases. We also outline a real-data evaluation design using public and credentialed clinical datasets spanning EHR, radiology, ECG, ICU time series, and referral-proxy outcomes. MedRLM aims to move medical AI from static question answering toward auditable, multimodal, and workflow-aware clinical decision support.

URL PDF HTML ☆

赞 0 踩 0

2606.19534 2026-06-19 cs.CV cs.AI cs.CL 交叉投稿

PerceptionDLM: Parallel Region Perception with Multimodal Diffusion Language Models

PerceptionDLM：基于多模态扩散语言模型的并行区域感知

Yueyi Sun, Yuhao Wang, Jason Li, Ye Tian, Tao Zhang, Jacky Mai, Yihan Wang, Haochen Wang, Jinbin Bai, Ling Yang, Yunhai Tong

发表机构 * Peking University（北京大学）； MSALab ； ByteDance（字节跳动）

AI总结提出PerceptionDLM，利用扩散语言模型的并行解码特性，通过高效提示和结构化注意力掩码实现多区域并行感知，显著提升推理效率，并构建ParaDLC-Bench基准进行评估。

Comments Code available at https://github.com/MSALab-PKU/PerceptionDLM

详情

AI中文摘要

多模态大语言模型（MLLMs）在视觉理解任务中取得了显著进展。然而，现有大多数MLLMs依赖自回归生成，这限制了它们在需要描述多个区域的感知任务中的效率。在这项工作中，我们提出PerceptionDLM，一种针对高效并行区域感知优化的多模态扩散语言模型。基于PerceptionDLM-Base（一个在开源扩散MLLMs中达到最先进性能的强基础基线），我们的架构充分利用了DLMs的并行解码特性。具体来说，我们引入了高效提示和结构化注意力掩码，以实现对多个掩码区域的同步感知，使模型能够在序列和token级别并行生成区域描述。与现有顺序处理区域的方法相比，这种设计显著提高了推理效率。为了系统评估DLMs视觉感知能力的并行性，我们通过将DLC-Bench扩展为每张图像包含多个区域掩码，构建了一个新的并行详细局部描述基准（ParaDLC-Bench），从而能够联合评估描述质量和推理效率。实验表明，PerceptionDLM在区域描述中保持竞争性能，同时在多区域感知任务中实现了显著的加速。我们的结果凸显了多模态扩散语言模型在高效并行视觉感知中的潜力。据我们所知，我们是首个利用扩散语言模型优势实现并行区域描述和感知的工作。代码、模型和数据集已发布。

英文摘要

Multimodal large language models (MLLMs) have achieved remarkable progress in visual understanding tasks. However, most existing MLLMs rely on autoregressive generation, which limits their efficiency for perception tasks that require captioning multiple regions. In this work, we propose PerceptionDLM, a multimodal diffusion language model optimized for efficient parallel region perception. Built upon PerceptionDLM-Base, a strong foundational baseline that achieves state-of-the-art performance among open-source diffusion MLLMs, our architecture fully leverages the parallel decoding nature of DLMs. Specifically, we introduce efficient prompting and structured attention masking to enable simultaneous perception of multiple masked regions, allowing the model to generate region descriptions in parallel at both the sequence and token levels. This design significantly improves inference efficiency compared with existing approaches that process regions sequentially. To systematically evaluate the parallelism property of visual perception capability for DLMs, we construct a new Parallel Detailed Localized Captioning Benchmark (ParaDLC-Bench) by scaling the DLC-Bench to include multiple region masks per image, enabling joint evaluation of both caption quality and inference efficiency. Experiments demonstrate that PerceptionDLM maintains competitive performance in region captioning while achieving substantial speed improvements for multi-region perception tasks. Our results highlight the potential of multimodal diffusion language models for efficient, parallel visual perception. To the best of our knowledge, we are the first to achieve parallel region caption and perception by leveraging the advantages of diffusion language models. Code, models, and datasets are released.

URL PDF HTML ☆

赞 0 踩 0

2606.20477 2026-06-19 cs.CV cs.CL cs.LG 交叉投稿

Scalable Training of Spatially Grounded 2D Vision-Language Models for Radiology

面向放射学的空间定位2D视觉-语言模型的可扩展训练

Yusuf Salcan, Simon Ging, Robin Schirrmeister, Philipp Arnold, Elmar Kotter, Behzad Bozorgtabar, Thomas Brox

发表机构 * Computer Vision Group, University of Freiburg, Germany（德国弗莱堡大学计算机视觉组）； Department of Radiology, Medical Center -- University of Freiburg, Germany（德国弗莱堡大学医学中心放射科）； CRIION-AI Lab, Freiburg, Germany（德国弗莱堡CRIION-AI实验室）

AI总结提出RefRad2D大规模双语数据集，通过LLM和自动分割生成空间定位数据，训练RadGrounder模型联合完成报告生成、VQA和空间定位，在外部基准上取得竞争性结果。

Comments Accepted for MICCAI 2026. First two authors: equal contribution. Last two authors: equal supervision

详情

AI中文摘要

我们研究了如何在没有手动空间标注的情况下，为放射学训练具有视觉定位能力的视觉-语言模型（VLM）。我们引入了RefRad2D，这是一个大规模的双语（德语/英语）数据集，包含来自临床实践的120万对CT和MR图像-文本对，并通过基于LLM的筛选和自动分割自动生成任务特定的VQA和空间定位子集。在此数据上训练的模型RadGrounder联合执行报告生成、视觉问答以及通过边界框检测或分割进行的空间定位。在外部VQA基准（Slake，VQA-RAD）上，RadGrounder取得了与专用医学VLM竞争的结果。将我们的临床数据加入训练混合集，相比于仅在下游数据集上微调，提高了开放式VQA的性能，显示了数据集的迁移性。关键在于，添加定位监督不会降低语言质量，从而在不牺牲VQA性能的情况下实现空间可验证的输出。

英文摘要

We study how to train visually grounded vision-language models (VLMs) for radiology without manual spatial annotations. We introduce RefRad2D, a large-scale bilingual (German/English) dataset of 1.2M CT and MR image-text pairs derived from clinical practice, with task-specific VQA and spatial grounding subsets generated automatically via LLM-based curation and automated segmentation. Trained on this data, our model RadGrounder jointly performs report generation, visual question answering, and spatial grounding via bounding-box detection or segmentation. On external VQA benchmarks (Slake, VQA-RAD), RadGrounder achieves competitive results with specialized medical VLMs. Adding our clinical data to the training mixture improves open-ended VQA over fine-tuning on the downstream datasets alone, showing the transferability of our dataset. Crucially, adding grounding supervision does not degrade language quality, enabling spatially verifiable outputs at no cost to VQA performance.

URL PDF HTML ☆

赞 0 踩 0

2603.16606 2026-06-19 cs.CL 版本更新

ReNikud：音频监督的希伯来语字素到音素转换

Maxim Melichov, Yakov Kolani, Morris Alper

AI总结提出ReNikud方法，利用音频监督和伪元音化架构，通过无标注音频的ASR伪标签和字符级对齐，解决希伯来语G2P转换中的元音缺失和发音歧义问题，在多个基准上达到最优。

详情

AI中文摘要

现代希伯来语的字素到音素（G2P）转换对于文本到语音（TTS）等应用是必需的，但由于该语言的辅音音素文字系统（abjad）使元音大多不写出来，造成大量歧义，因此具有挑战性。标准方法首先预测元音变音符号（nikud）以生成国际音标（IPA）转录，但这存在局限性：元音化数据稀缺且制作费力，它不指定词汇重音等特征，并且反映的是正式语法规则而非日常口语发音。同时，直接的序列到序列IPA预测在有限数据上表现不佳，且未能利用辅音音素文字特有的字符级对齐。我们的方法ReNikud通过两个关键洞察克服了这些限制：（1）通过基于音素的自动语音识别（ASR）伪标签流水线，在数千小时无标注希伯来语音频上进行弱音频监督，生成反映自然口语规范的音位转录，无需人工标注。（2）一种伪元音化架构，在每个字符位置预测IPA音素，强制字符级对齐作为归纳偏置。在现有希伯来语G2P基准和针对口语希伯来语的新MILIM基准上的结果表明，ReNikud超越了先前的最先进方法。我们将发布代码和训练模型，以支持希伯来语TTS和语音技术的进一步研究。

英文摘要

Grapheme-to-phoneme (G2P) conversion for Modern Hebrew is needed for applications like text-to-speech (TTS), but is challenging due to the language's abjad writing system, which leaves vowels largely unwritten, creating substantial ambiguity. Standard approaches first predict vowel diacritics (nikud) to produce International Phonetic Alphabet (IPA) transcriptions, but this is limited: vocalization data is scarce and laborious to produce, it does not specify features such as lexical stress, and it reflects formal grammatical rules rather than everyday spoken pronunciation. Direct sequence-to-sequence IPA prediction, meanwhile, struggles on limited data and fails to exploit the character-level alignment characteristic of abjads. Our method, ReNikud, overcomes these limitations with two key insights: (1) Weak audio supervision via a phoneme-based automatic speech recognition (ASR) pseudo-labeling pipeline on thousands of hours of unlabeled Hebrew audio, yielding phonemic transcriptions that reflect natural spoken norms without manual annotation. (2) A pseudo-vocalization architecture that predicts IPA phonemes at each character position, enforcing character-level alignment as an inductive bias. Results on existing Hebrew G2P benchmarks and the new targeted MILIM benchmark for spoken Hebrew show that ReNikud surpasses previous state-of-the-art methods. We will release our code and trained models to support further work on Hebrew TTS and speech technologies.

URL PDF HTML ☆

赞 0 踩 0

2606.19951 2026-06-19 eess.AS cs.CL cs.LG cs.SD 交叉投稿

Investigating Human-Model Discrepancies in Speech Quality Assessment via Acoustic and Prosodic Perturbations

通过声学和韵律扰动研究语音质量评估中的人机差异

Masato Takagi, Masaya Kawamura, Reo Shimizu, Yuma Shirahata

AI总结通过声学退化、韵律错误和说话人特征扰动，发现MOS预测模型对声学退化敏感，但对韵律错误不敏感，且对基频有偏见，而对语速和基频变化不敏感。

Comments Accepted to INTERSPEECH 2026

详情

AI中文摘要

平均意见得分（MOS）预测模型在文本到语音（TTS）研究中被广泛用作代理指标，但它们捕捉超出声学保真度的质量差异的能力仍不清楚。我们通过控制性扰动来研究这一点：声学退化、韵律错误以及说话人特定特征（如音高和语速）的操纵。我们从人类听众和模型那里获得了这些语音样本的MOS预测，并分析了它们感知特征的差异。结果表明，大多数模型能很好地跟踪声学退化，而所有模型对韵律错误不敏感，尽管主观评分大幅下降。对于说话人特征，模型表现出双重分离：在人类评分中不存在的强平均基频（F0）偏见，但对人类注意到的语速和F0变化不敏感。这些发现突出了标量MOS预测在声学保真度之外的局限性。

英文摘要

Mean opinion score (MOS) prediction models are widely used as proxy metrics in text-to-speech (TTS) research, yet their ability to capture quality differences beyond acoustic fidelity remains unclear. We investigate this via controlled perturbations on speech: acoustic degradation, prosodic errors, and manipulation of speaker-specific characteristics such as pitch and speaking rate. We obtained MOS predictions for these speech samples from both human listeners and the model, and analyzed the differences in their perceptual characteristics. Results show that most models track acoustic degradation well, while all are insensitive to prosodic errors despite large subjective score drops. For speaker characteristics, models exhibit a double dissociation: strong mean fundamental frequency (F0) biases absent in human ratings, yet insensitivity to speaking rate and F0 variability that humans notice. These findings highlight limitations of scalar MOS prediction beyond acoustic fidelity.

URL PDF HTML ☆

赞 0 踩 0

2606.19996 2026-06-19 cs.SD cs.CL 交叉投稿

Segment-Level Mandarin Chinese Speech-Based Cognitive Impairment Detection via an Autoencoder with Contrastive Learning

基于自编码器与对比学习的段级普通话语音认知障碍检测

Yongqi Shao, Hong Huo, Flavio Bertini, Danilo Montesi, Tao Fang

发表机构 * School of Automation and Intelligent Sensing, Shanghai Jiao Tong University（上海交通大学自动化与智能感知学院）； Key Laboratory of System Control and Information Processing, Ministry of Education of China（教育部系统控制与信息处理重点实验室）； Shanghai Key Laboratory of Perception and Control in Industrial Network Systems（上海市工业网络系统感知与控制重点实验室）； Department of Computer Science and Engineering, University of Bologna（博洛尼亚大学计算机科学与工程系）； Department of Mathematical, Physical and Computer Sciences, University of Parma（帕尔马大学数学、物理与计算机科学系）

AI总结提出段级表示学习框架，结合自编码器和对比学习，在四个普通话数据集上实现稳定的二分类和三分类认知障碍检测，尤其改善了临床困难的三分类性能。

Comments 15 pages, 7 figures, 5 tables

详情

AI中文摘要

\noindent\textbf{背景与目标：} 语音已成为一种低成本、非侵入性的数字生物标志物，在认知障碍检测方面具有巨大潜力。然而，有限的标注数据和跨数据集变异性仍然是构建稳健的语音筛查系统的主要挑战。\par\noindent\textbf{方法：} 我们开发了一个用于语音认知障碍检测的段级表示学习框架。将语音录音分割成短片段并转换为语谱图表示。为了在有限数据条件下提高鲁棒性，将离线和在线增强策略与基于自编码器的表示学习和对比目标相结合，以增强判别性潜在表示。\par\noindent\textbf{结果：} 在四个独立的普通话语音数据集上进行的实验表明，在二分类和三分类任务中均取得了稳定且有竞争力的性能，尤其是在临床具有挑战性的三分类设置中取得了显著改进。消融研究进一步支持了所提框架的有效性。\par\noindent\textbf{结论：} 研究结果表明，段级语音表示学习可能为资源受限的临床环境中的认知障碍筛查提供一种可扩展且实用的方法。

英文摘要

\noindent\textbf{Background and Objective:} Speech has emerged as a low-cost and non-invasive digital biomarker with considerable potential for cognitive impairment detection. However, limited labeled data and cross-dataset variability remain major challenges for robust speech-based screening systems. \par\noindent\textbf{Methods:} We developed a segment-level representation learning framework for speech-based cognitive impairment detection. Speech recordings were divided into short segments and converted into spectrogram representations. To improve robustness under limited-data conditions, offline and online augmentation strategies were combined with autoencoder-based representation learning and contrastive objectives to enhance discriminative latent representations. \par\noindent\textbf{Results:} Experiments conducted on four independent Mandarin Chinese speech datasets demonstrated stable and competitive performance in both binary and three-class classification tasks, with particularly notable improvements in the clinically challenging three-class setting. Ablation studies further supported the effectiveness of the proposed framework. \par\noindent\textbf{Conclusions:} The findings suggest that segment-level speech representation learning may provide a scalable and practical approach for cognitive impairment screening in resource-constrained clinical settings.

URL PDF HTML ☆

赞 0 踩 0

2606.20137 2026-06-19 eess.AS cs.CL cs.LG cs.SD 交叉投稿

PASQA: Pitch-Accent-Focused Speech Quality Assessment Model Trained on Synthetic Speech with Accent Errors

PASQA：针对重音错误的合成语音训练的以音高重音为中心的语音质量评估模型

Masaya Kawamura, Yuma Shirahata, Kentaro Mitsui, Reo Shimizu

AI总结提出PASQA模型，通过可控重音合成数据集和伪重音质量分数，结合自监督表示、摩拉条件融合等训练策略，有效评估音高重音正确性，优于传统MOS模型。

Comments Accepted to INTERSPEECH 2026

详情

AI中文摘要

现有的平均意见得分（MOS）预测模型通常预测话语级别的自然度MOS，并且可能对局部音高重音错误不敏感。我们提出了以音高重音为中心的语音质量评估（PASQA），明确针对音高重音正确性。为了训练我们的模型，我们使用重音可控的文本转语音系统通过改变重音模式构建了一个受控的日语重音错误数据集，并根据重音错误率计算伪重音质量得分。PASQA建立在自监督表示的基础上，并采用摩拉条件融合、排序损失、辅助重音错误定位任务和说话者不变训练。实验表明，传统模型无法保持按重音错误严重程度的排序，而PASQA在已见和未见说话者上都实现了高排序准确性。此外，PASQA与人类重音正确性判断的一致性更强。代码可在以下网址获取：https://this URL。

英文摘要

Existing mean opinion score (MOS) prediction models typically predict utterance-level naturalness MOS and can be insensitive to localized pitch-accent errors. We propose Pitch-Accent-focused Speech Quality Assessment (PASQA), which explicitly targets pitch-accent correctness. To train our model, we construct a controlled Japanese accent-error dataset by changing accent patterns using an accent-controllable text-to-speech system, and compute a pseudo accent-quality score from the accent-error rate. PASQA builds on self-supervised representations and employs mora-conditioned fusion, ranking loss, an auxiliary accent-error localization task, and speaker-invariant training. Experiments show that conventional models fail to preserve the ordering by accent-error severity, whereas PASQA achieves high ordering accuracy on both seen and unseen speakers. Further, PASQA shows stronger agreement with human accent-correctness judgments. The code is available at https://github.com/lycorp-jp/PASQA.

URL PDF HTML ☆

赞 0 踩 0

2605.17443 2026-06-19 cs.CL cs.SD eess.AS 版本更新

Analyzing Error Propagation in Korean Spoken QA with ASR-LLM Cascades

分析韩语语音问答中ASR-LLM级联中的误差传播

Donghyuk Jung, Youngwon Choi

发表机构 * Korea Culture Technology Institute, Republic of Korea（韩国文化科技研究所）； Maum AI Inc., Republic of Korea（马姆人工智能公司）

AI总结本文研究了韩语语音问答中ASR-LLM级联中误差传播的问题，通过分析下游语义失败，揭示了传统ASR指标无法完全捕捉的误差影响，发现不同性能的LLM在级联降级上的一致性，识别出单字符ASR错误作为语义失败通道，并通过辅助比较表明大音频语言模型在噪声韩语SQA中优于匹配语言模型的ASR-LLM流水线。

Comments Preprint. Submitted to APSIPA ASC 2026

2606.05846 2026-06-19 cs.CL eess.AS 版本更新

Towards Truly Multilingual ASR: Generalizing Code-Switching ASR to Unseen Language Pairs

迈向真正的多语言ASR：将代码切换ASR泛化到未见语言对

Gio Paik, Hyunseo Shin, Soungmin Lee

发表机构 * University of Tokyo（东京大学）

AI总结通过模型合并和领域泛化方法，研究从有限语言对中学到的代码切换能力能否泛化到未见语言对，实验表明双语CS-ASR模型对未见语言对有一定泛化能力但有限。

Comments ICML 2026 Workshop on Machine Learning for Audio

详情

AI中文摘要

自动语音识别（ASR）已成为人机交互的关键技术。然而，由于跨多种语言对的代码切换（CS）语音资源严重稀缺，代码切换ASR（CS-ASR）仍然特别具有挑战性。现有方法主要通过合成CS语音生成或在有限双语数据集上进行特定语言对微调来提高CS-ASR性能。然而，这些方法面临固有的可扩展性限制，因为对CS的支持必须针对语言对单独开发，而语言对的数量随支持的语言数量呈组合增长。在这项工作中，我们研究通过模型合并和领域泛化方法，从一组有限的已见语言对中学到的CS能力是否可以泛化到未见语言对。我们的实验表明，合并的双语CS-ASR模型对未见语言对有一定程度的泛化，表明双语CS能力在语言对之间的迁移有限。

英文摘要

Automatic Speech Recognition (ASR) has become a key technology for human--AI interaction. However, code-switching ASR (CS-ASR) remains particularly challenging due to the severe scarcity of multilingual CS speech resources across diverse language pairs. Existing approaches primarily improve CS-ASR performance through synthetic CS speech generation or pair-specific fine-tuning on limited bilingual datasets. Nevertheless, these approaches face an inherent scalability limitation, as support for CS must be developed separately for language pairs whose number grows combinatorially with the number of supported languages. In this work, we investigate whether CS capabilities learned from a limited set of seen language pairs can generalize to unseen language pairs through model merging and domain generalization methods. Our experiments show that merged bilingual CS-ASR models modestly generalize to unseen language pairs, suggesting limited transfer of bilingual CS capabilities across language pairs.

URL PDF HTML ☆

赞 0 踩 0

2604.18105 2026-06-19 eess.AS cs.CL cs.SD 版本更新

NIM4-ASR: Towards Efficient, Robust, and Customizable Real-Time LLM-Based ASR

NIM4-ASR：迈向高效、鲁棒且可定制的实时基于LLM的语音识别

Yuan Xie, Jiaqi Song, Guang Qiu, Xianliang Wang, Kai Qiao, Junfeng Yuan, Shengqing Liu, Yi Zhang, Bowen Chen, Ming Lei, Jie Gao, Jie Wu

发表机构 * Advanced Intelligent Systems Group, NIO（蔚来智能系统集团）

AI总结提出NIM4-ASR框架，通过重新设计多阶段训练范式（包括预训练架构优化、迭代异步SFT和ASR专用强化学习）以及生产优化（噪声鲁棒性、流式推理和RAG热词定制），在2.3B参数下实现SOTA性能。

详情

AI中文摘要

将大语言模型（LLM）集成到自动语音识别（ASR）中已成为近年来的主流范式。尽管现有的基于LLM的ASR模型在公共基准上表现出色，但其训练仍然主要依赖数据驱动，未能充分解决关键的实际挑战——特别是在资源受限部署中的有限向下可扩展性以及声学挑战条件下的幻觉问题。为了解决这些问题，我们提出了NIM4-ASR，一个面向生产的、基于LLM的ASR框架，针对效率和鲁棒性进行了优化。基于编码器和LLM之间功能角色的原则性划分，我们重新设计了多阶段训练范式，使每个模块与其预期的能力边界对齐。具体来说，我们重新制定了预训练架构和目标以缓解模态差距并提高参数效率；引入了迭代异步SFT阶段以保持声学保真度并约束表示漂移；设计了ASR专用的强化学习阶段以进一步提高识别质量和鲁棒性。我们还加入了一系列面向生产的优化，包括噪声和静音条件下的鲁棒性、实时流式推理以及通过检索增强生成（RAG）进行的热词定制。实验表明，NIM4-ASR仅用2.3B参数就在多个公共基准上达到了最先进的性能，同时在内部基准上显著优于更大规模的竞争对手——特别是在实体密集的真实场景中。NIM4-ASR进一步通过RAG支持百万级热词定制，检索延迟低于毫秒，从而能够高效适应新兴实体和个性化用户需求。

英文摘要

Integrating large language models (LLMs) into automatic speech recognition (ASR) has become a mainstream paradigm in recent years. Although existing LLM-based ASR models demonstrate impressive performance on public benchmarks, their training remains predominantly data-driven, leaving key practical challenges insufficiently addressed -- particularly limited downward scalability in resource-constrained deployments and hallucinations under acoustically challenging conditions. To address these issues, we present NIM4-ASR, a production-oriented LLM-based ASR framework optimized for both efficiency and robustness. Grounded in a principled delineation of functional roles between the encoder and the LLM, we redesign the multi-stage training paradigm to align each module with its intended capability boundary. Specifically, we reformulate the pre-training architecture and objective to mitigate the modality gap and improve parameter efficiency; introduce an iterative asynchronous SFT stage to preserve acoustic fidelity and constrain representation drift; and design an ASR-specialized reinforcement learning stage to further enhance recognition quality and robustness. We additionally incorporate a suite of production-oriented optimizations, including robustness under noisy and silent conditions, real-time streaming inference, and hotword customization via retrieval-augmented generation (RAG). Experiments show that NIM4-ASR achieves state-of-the-art performance on multiple public benchmarks with merely 2.3B parameters, while substantially outperforming larger-scale competitors on internal benchmarks -- particularly in entity-intensive real-world scenarios. NIM4-ASR further supports million-scale hotword customization via RAG with sub-millisecond retrieval latency, enabling efficient adaptation to emerging entities and personalized user requirements.

URL PDF HTML ☆

赞 0 踩 0

2606.19352 2026-06-19 cs.CL cs.AI 新提交

Sign-Language Datasets at Scale: A Comprehensive Survey on Resources, Benchmarks, and Annotation Standards

大规模手语数据集：资源、基准和标注标准的综合调查

Yiming Ni, Zhi-Qi Cheng, Jiayu Li, Wei Cheng

发表机构 * Tacoma School of Engineering & Technology, University of Washington（华盛顿大学塔科马工程与技术学院）

AI总结本文调查了35种手语的120个数据集，分析了模态不平衡、标注粒度和手语者偏差等挑战，并提出了24字段手语数据表以支持标准化文档和可复现评估。

Comments Accepted to ACL 2026 Main. 27 pages, 5 figures

详情

AI中文摘要

手语是聋人和听障社区使用的表达性视觉语言。尽管在手语识别、翻译和生成方面取得了显著进展，但由于数据集碎片化、标注不一致以及语言覆盖有限，进展仍然受到制约。现有的基准往往无法反映现实世界的通信需求，对这些局限性的系统分析仍然有限。在本调查中，我们提出了一个全面的手语数据集索引，涵盖了35种手语的120个资源。我们分析了关键挑战，如模态不平衡、标注粒度和手语者偏差，并概述了未来数据集设计的考虑因素。我们还引入了一个24字段的手语数据表，并发布了一个公共GitHub仓库（此 https URL ），以支持标准化文档和可复现评估。总体而言，我们的工作为在现实应用中开发包容、稳健和可扩展的手语技术提供了统一且实用的基础。

英文摘要

Sign languages are expressive visual languages used by Deaf and Hard-of-Hearing (DHH) communities. Despite substantial progress in sign-language recognition, translation, and production, advances remain constrained by fragmented datasets, inconsistent annotations, and limited linguistic coverage. Existing benchmarks often fail to reflect real-world communication needs, and systematic analyses of these limitations remain limited. In this survey, we present a comprehensive index of sign-language datasets, covering 120 resources across 35 sign languages. We analyze key challenges such as modality imbalance, annotation granularity, and signer bias, and outline considerations for future dataset design. We also introduce a 24-field Sign-Language Datasheet and release a public GitHub repository (https://github.com/Ginqwerty/Open-Sign-Language) to support standardized documentation and reproducible evaluation. Overall, our work provides a unified and practical foundation for developing inclusive, robust, and scalable sign-language technologies in real-world applications.

URL PDF HTML ☆

赞 0 踩 0

2606.19468 2026-06-19 cs.CL 新提交

Characterizing Narrative Content in Web-scale LLM Pretraining Data

网络规模LLM预训练数据中的叙事内容特征化

Teagan Johnson, Elliott Ash, Andrew Piper, Maria Antoniak

AI总结首次细粒度研究LLM预训练语料库Dolma的叙事特征，提出涵盖三个核心叙事元素（能动性、场景、事件）的框架，构建NarraBERT模型并发布NarraDolma数据集，揭示叙事结构在异构数据中可测量且分布不均。

Comments 8 pages of main content, 28 total pages. 30 figures

详情

AI中文摘要

尽管叙事是人类交流的基本模式，但网络规模LLM预训练语料库的叙事组成仍然很大程度上未被探索。我们首次对Dolma（一个3万亿词元的开放预训练语料库）中的叙事特征进行了细粒度研究。借鉴叙事理论，我们设计了一个框架，涵盖三个核心叙事元素（能动性、场景和事件），并将其操作化为11个可解释维度。在采样并标注了400个多样化的段落之后，我们微调并验证了NarraBERT，一个基于RoBERTa的细粒度叙事预测模型。我们将NarraBERT应用于300万个段落，生成了新数据集NarraDolma。我们发现：(i) 叙事结构在极度异构的数据中是可大规模测量的；(ii) 我们揭示了网络文本背后连续的多维叙事结构；(iii) 叙事质量在预训练来源和主题之间分布不均，而当前的策展实践既未测量也未考虑这一点。我们的框架、数据集和分析为理解LLM预训练数据中叙事质量的分布以及研究数据组成如何影响叙事推理任务提供了基础。我们公开发布了NarraDolma和NarraBERT。

英文摘要

The narrative composition of web-scale LLM pretraining corpora remains largely unexplored even though narrative is a fundamental mode of human communication. We present the first fine-grained study of narrative features in Dolma, a 3-trillion-token open pretraining corpus. Drawing on narrative theory, we design a framework spanning three core narrative elements (agency, setting, and events) operationalized as 11 interpretable dimensions. After sampling and annotating a diverse set of 400 passages, we finetune and validate NarraBERT, a RoBERTa-based model for fine-grained narrative prediction. We apply NarraBERT to 3M passages, resulting in a new dataset, NarraDolma. We find (i) narrative structure is measurable at scale across extremely heterogeneous data, (ii) we uncover a continuous, multidimensional narrative structure underlying web text, and (iii) narrative qualities are unequally distributed across pretraining sources and topics in ways that current curation practices neither measure nor account for. Our framework, dataset, and analyses provide a foundation for understanding how narrative qualities are distributed in LLM pretraining data and for studying how data composition affects narrative reasoning tasks. We publicly release NarraDolma and NarraBERT.

URL PDF HTML ☆

赞 0 踩 0

2606.19544 2026-06-19 cs.CL 新提交

Reliability without Validity: A Systematic, Large-Scale Evaluation of LLM-as-a-Judge Models Across Agreement, Consistency, and Bias

无效度的可靠性：LLM-as-a-Judge 模型在一致性、稳定性和偏差上的系统性大规模评估

Justin D. Norman, Michael U. Rivera, D. Alex Hughes

发表机构 * UC Berkeley School of Information（加州大学伯克利分校信息学院）

AI总结本研究通过大规模系统性评估（21个裁判模型、118次运行、约54.1万次判断），发现LLM-as-a-Judge在一致性、稳定性和偏差方面存在普遍问题，包括kappa通缩、排名偏移、高重测信度与严重位置偏差并存，并提出了最小可行验证协议。

详情

AI中文摘要

LLM-as-a-Judge已成为语言模型的主导评估范式，但实际中的裁判验证依赖于精确匹配一致性，这一指标未对随机性进行校正，且系统性地高估了判别能力。我们展示了迄今为止最大规模的LLM-as-a-Judge系统性评估：来自九个提供商的21个裁判模型，在MT-Bench、JudgeBench和RewardBench上，按照三种协议（一致性、稳定性、偏差审计）进行了118次运行，约54.1万次独立判断。发现了四个结果，在整个队列中一致，包括2026年4月的前沿模型：精确匹配与Cohen's kappa之间的kappa通缩是普遍存在的（MT-Bench上33-41个百分点），裁判排名在不同基准上最多移动14个位置，高重测信度（>0.95）与两个生产部署裁判中的严重位置偏差（>0.10）并存（体现了一致性-偏差悖论），以及在单一成对评分标准下，整个队列中的冗长偏差较小（<0.011）。我们将这些结果提炼为一个最小可行验证协议。

英文摘要

LLM-as-a-Judge has become the dominant evaluation paradigm for language models, but judge validation in practice relies on exact-match agreement, a metric that does not correct for chance and systematically overstates discriminative ability. We present the largest systematic evaluation of LLM-as-a-Judge to date: 21 judges from nine providers across MT-Bench, JudgeBench, and RewardBench, evaluated under three protocols (agreement, consistency, bias audit) over 118 runs and approximately 541,000 individual judgments. Four findings emerge, consistent across the full cohort, including the April 2026 frontier: kappa deflation between exact match and Cohen's kappa is universal (33--41 pp on MT-Bench), judge rankings shift by up to 14 positions across benchmarks, high test--retest reliability (>0.95) coexists with severe position bias (>0.10) in two production-deployed judges (instantiating a consistency--bias paradox), and verbosity bias is small (<0.011) across our cohort under a single pairwise rubric. We distill these into a Minimum Viable Validation Protocol.

URL PDF HTML ☆

赞 0 踩 0

2606.19637 2026-06-19 cs.CL cs.AI 新提交

Before the Labels: How Dataset Construction Shapes Suicidality Detection in Clinical Text

标签之前：数据集构建如何塑造临床文本中的自杀检测

Priyanshi Garg, Ishita Rao, Jieqiong Ding, Amandalynne Paullada

发表机构 * University of Washington（华盛顿大学）

AI总结通过ScAN数据集案例研究，揭示EHR自杀数据集编码特定操作化定义，受数据作者、事件边界和歧义处理影响，并展示相同标签涵盖异质性临床框架。

Comments To appear in the Proceedings of the 11th Workshop on Computational Linguistics and Clinical Psychology (CLPsych 2026)

详情

AI中文摘要

临床自然语言处理越来越依赖电子健康记录（EHR）数据来检测自杀行为，将临床文档视为比社交媒体更可靠的真相。我们认为，这种框架掩盖了基于EHR的自杀数据集如何编码自杀的特定操作化定义，这种定义受到数据作者、事件边界划定方式以及歧义处理方式的影响。我们以ScAN数据集（基于MIMIC-III临床笔记构建）的案例研究为基础，论证了这一观点。我们展示了治理约束、基于ICD的队列选择、单一标注者标签以及住院级别聚合如何产生反映临床医生记录判断的标签，将自杀视为一个有边界的事件，并假设意图可以从文档中可靠推断。语言学分析表明，相同的标签涵盖了在时间性、否定性和不确定性方面不同的异质性临床框架。我们认为，临床自然语言处理在将自杀数据集的标签解释为真相之前，应审视其中嵌入的假设。

英文摘要

Clinical NLP increasingly relies on electronic health record (EHR) data to detect suicidal behaviors, treating clinical documentation as more reliable ground truth than social media. We argue that this framing obscures how EHR-based suicidality datasets encode a particular operationalization of suicidality, shaped by who authors the data, how episodes are bounded, and how ambiguity is resolved. We ground this argument in a case study of the ScAN dataset, built over MIMIC-III clinical notes. We show how governance constraints, ICD-based cohort selection, single-annotator labeling, and hospital-stay-level aggregation produce labels that reflect clinician-documented judgments, treat suicidality as a bounded episode, and assume that intent can be reliably inferred from documentation. A linguistic analysis demonstrates that identical labels subsume heterogeneous clinical framings differing in temporality, negation, and uncertainty. We argue that clinical NLP should examine the assumptions embedded in suicidality datasets before interpreting their labels as ground truth.

URL PDF HTML ☆

赞 0 踩 0

2606.19698 2026-06-19 cs.CL 新提交

从文本到分数：追踪大型语言模型中作文质量表征的出现

Jiaxu Zuo, Mu You, Kaixin Lan, Tao Fang, Yujia Huo, Henghua Shen, Lidia S. Chao, Derek F. Wong

AI总结通过线性探测等方法分析8个LLM在三个数据集上的隐藏表征，发现作文质量信息以线性可解码形式存在，并识别出与分数相关的神经元，揭示了LLM评分的内在机制。

Comments This is a preprint of a manuscript currently under peer review

详情

AI中文摘要

基准测试智能审稿系统

Dang Nguyen, Wanqing Hao, Yanai Elazar, Chenhao Tan

发表机构 * University of Chicago（芝加哥大学）； Bar-Ilan University（巴伊兰大学）

AI总结针对AI辅助研究给同行评审带来的压力，新兴智能审稿系统涌现，但缺乏评估标准。本文评估了多种系统，发现最佳配置（OpenAIReview + GPT-5.5）在成对准确性上达83.0%，能捕获71.6%注入错误，且用户反馈正面。

Comments 11 pages, 7 tables, 4 figures

详情

AI中文摘要

一类新的智能审稿系统正在兴起，以缓解AI辅助研究给同行评审系统带来的压力，但如何评估它们尚不明确。我们评估了两个开源系统（OpenAIReview和coarse）、一个专有系统（Reviewer3）以及一个零样本基线，跨越六个涵盖前沿和高效模型的LLM。首先，我们研究ICLR/NeurIPS论文上的AI评审是否与论文质量（通过引用和接受决定等外部信号近似）相关。每个系统在成对准确性上均高于随机水平，最佳为OpenAIReview + GPT-5.5，达到83.0%。其次，为测试系统能否捕获已知真实错误的错误，我们构建了一个扰动基准，向八个arXiv学科类别的论文中注入四类错误，并测量检测召回率。最强配置（OpenAIReview + GPT-5.5）捕获了71.6%的注入错误，仍有很大改进空间。六个模型的检测并集达到83.3%的召回率，表明不同模型检测不同错误，更好的利用设计可能提高性能。除这些基准外，我们研究了OpenAIReview在真实用户中的公开部署。对其评论的投票偏向正面，比例为1.44:1，最常见的抱怨是误报和琐碎挑剔。总之，通过评估基于最先进模型的全审稿系统在真实研究论文上的表现，我们表明虽然AI评审仍有改进空间，但它们已经能够很好地跟踪人类质量判断、捕获重要错误，并获得真实用户的正面反馈。

英文摘要

A new class of agentic review systems are emerging as a remedy to the pressure placed on peer review systems by AI-assisted research, but it is unclear how they should be evaluated. We evaluate two open-source systems (OpenAIReview and coarse), one proprietary system (Reviewer3), and a zero-shot baseline, across six LLMs spanning frontier and efficient models. First, we study whether AI reviews on ICLR/NeurIPS papers track with papers' quality as approximated by external signals such as citations and acceptance decisions. Every system performs above chance in pairwise accuracy, and the best is OpenAIReview + GPT-5.5 at 83.0%. Second, to test whether systems can catch errors with known ground truth, we construct a perturbation benchmark that injects four categories of errors into papers across eight arXiv subject classes and measure detection recall. The strongest configuration (OpenAIReview + GPT-5.5) catches 71.6% of injected errors, leaving substantial room for improvement. The union of detections across six models reaches 83.3% recall, suggesting different models detect different errors and better harness design can potentially increase performance. Beyond these benchmarks, we study a public deployment of OpenAIReview with real users. Votes on its comments skew positive at 1.44 to 1, and the most common complaints are about false positives and minor nitpicks. Together, by evaluating full review systems backed by state-of-the-art models on real research papers, we show that while AI reviews still have room for improvement, they can already track human quality judgments well, catch important errors, and earn positive feedback from real users.

URL PDF HTML ☆

赞 0 踩 0

2606.19788 2026-06-19 cs.AI cs.CL 交叉投稿

CombEval: A Framework for Evaluating Combinatorial Counting in Large Language Models

CombEval：评估大语言模型中组合计数的框架

Yuxu Zhou, Ondřej Kuželka, Yuyi Wang, Yuanhong Wang, Yi Chang

发表机构 * School of Artificial Intelligence, Jilin University（吉林大学人工智能学院）； Czech Technical University in Prague（捷克布拉格理工大学）； CRRC Zhuzhou Institute（中车株洲研究所）； Tengen Intelligence Institute（天元智能研究院）； International Center of Future Science, Jilin University（吉林大学未来科学国际合作中心）； Engineering Research Center of Knowledge-Driven Human-Machine Intelligence, MOE（教育部知识驱动人机智能工程研究中心）

AI总结提出CombEval动态基准，通过类型化Cofola规范生成组合计数问题，评估11个大语言模型在直接和代码增强设置下的表现，发现模型在有序对象、不可区分元素、相对位置约束和嵌套对象依赖上存在脆弱性。

Comments under review. Code: https://github.com/YuxuZhou-CN/combination-problem-generation

详情

AI中文摘要

我们提出了CombEval，一个用于评估大语言模型中组合计数的动态基准。CombEval将每个问题表示为关于实体、组合对象、对象依赖和约束的类型化Cofola规范，从而能够生成带有精确求解器验证答案的自然语言计数问题。与静态集合不同，CombEval支持对象类型、实体规模、约束数量和推理深度的系统变化。我们在直接和代码增强设置下评估了11个大语言模型，发现模型在有序对象、不可区分元素、相对位置约束和嵌套对象依赖上仍然脆弱。错误分析进一步识别出在约束解释和计数原则上的失败。CombEval为研究大语言模型何时以及为何在组合推理上失败提供了一个诊断测试平台。代码和生成的基准套件可在\url{this https URL}公开获取。

英文摘要

We present CombEval, a dynamic benchmark for evaluating combinatorial counting in large language models. CombEval represents each problem as a typed Cofola specification over entities, combinatorial objects, object dependencies, and constraints, enabling controlled generation of natural-language counting problems with exact solver-verified answers. Unlike static collections, CombEval supports systematic variation of object type, entity scale, constraint count, and reasoning depth. We evaluate 11 LLMs under direct and code-augmented settings and find that models remain brittle on ordered objects, indistinguishable elements, relatively positional constraints, and nested object dependencies. Error analysis further identifies failures in constraint interpretation and counting principles. CombEval provides a diagnostic testbed for studying when and why LLMs fail at combinatorial reasoning. The code and generated benchmark suites are publicly available at \url{https://github.com/YuxuZhou-CN/combination-problem-generation}.

URL PDF HTML ☆

赞 0 踩 0

2606.19830 2026-06-19 cs.SE cs.CL 交叉投稿

ShoppingBench：面向LLM智能体的真实世界意图导向购物基准

Jiangyuan Wang, Kejun Xiao, Qi Sun, Huaipeng Zhao, Tao Luo, Jian Dong Zhang, Xiaoyi Zeng

发表机构 * Alibaba International Digital Commercial Group（阿里巴巴国际数字商业集团）

AI总结提出ShoppingBench基准，包含多层级真实购物意图任务，通过模拟环境和250万商品评估LLM智能体，发现GPT-4.1成功率低于50%，并提出轨迹蒸馏策略提升小模型性能。

Comments Accepted for oral presentation at AAAI 2026

详情

AI中文摘要

现有的电子商务基准主要关注基本用户意图，例如查找或购买产品。然而，现实世界的用户通常追求更复杂的目标，例如应用优惠券、管理预算以及寻找多产品卖家。为了弥补这一差距，我们提出了ShoppingBench，这是一个新颖的端到端购物基准，旨在涵盖日益具有挑战性的接地意图级别。具体来说，我们提出了一个可扩展的框架，基于从采样的真实世界产品中得出的各种意图来模拟用户指令。为了促进一致且可靠的评估，我们提供了一个大规模购物沙箱作为交互式模拟环境，包含超过250万种真实产品。实验结果表明，即使是最先进的语言智能体（如GPT-4.1）在我们的基准任务上的绝对成功率也低于50%，这突显了我们的ShoppingBench带来的重大挑战。此外，我们提出了一种轨迹蒸馏策略，并利用监督微调以及基于合成轨迹的强化学习，将大型语言智能体的能力蒸馏到较小的智能体中。结果，我们训练的智能体实现了与GPT-4.1相媲美的竞争性能。

英文摘要

Existing benchmarks in e-commerce primarily focus on basic user intents, such as finding or purchasing products. However, real-world users often pursue more complex goals, such as applying vouchers, managing budgets, and finding multi-products seller. To bridge this gap, we propose ShoppingBench, a novel end-to-end shopping benchmark designed to encompass increasingly challenging levels of grounded intent. Specifically, we propose a scalable framework to simulate user instructions based on various intents derived from sampled real-world products. To facilitate consistent and reliable evaluations, we provide a large-scale shopping sandbox that serves as an interactive simulated environment, incorporating over 2.5 million real-world products. Experimental results demonstrate that even state-of-the-art language agents (such as GPT-4.1) achieve absolute success rates under 50% on our benchmark tasks, highlighting the significant challenges posed by our ShoppingBench. In addition, we propose a trajectory distillation strategy and leverage supervised fine-tuning, along with reinforcement learning on synthetic trajectories, to distill the capabilities of a large language agent into a smaller one. As a result, our trained agent achieves competitive performance compared to GPT-4.1.

URL PDF HTML ☆

赞 0 踩 0

2602.13139 2026-06-19 cs.CL 版本更新

OpenLID-v3: Improving the Precision of Closely Related Language Identification -- An Experience Report

OpenLID-v3：提高近亲语言识别精度的经验报告

Mariia Fedorova, Nikolay Arefyev, Maja Buljan, Jindřich Helcl, Stephan Oepen, Egil Rønningstad, Yves Scherrer

AI总结针对现有语言识别工具对近亲语言和噪声区分困难的问题，通过增加训练数据、合并问题语言变体簇和引入噪声标签扩展OpenLID分类器，提出OpenLID-v3，在多个基准上提升精度。

Comments VarDial'26 workshop at the EACL 2026 conference

详情

DOI: 10.18653/v1/2026.vardial-1.23

AI中文摘要

语言识别（LID）是从网络数据构建高质量多语言数据集的关键步骤。现有的LID工具（如OpenLID或GlotLID）通常难以识别近亲语言，也难以区分有效自然语言与噪声，这污染了特定语言子集，尤其是低资源语言。在本工作中，我们通过增加更多训练数据、合并有问题的语言变体簇以及引入一个专门标记噪声的标签来扩展OpenLID分类器。我们将这个扩展系统称为OpenLID-v3，并在多个基准上将其与GlotLID进行评估。在开发过程中，我们重点关注三组近亲语言（波斯尼亚语、克罗地亚语和塞尔维亚语；意大利北部和法国南部的罗曼语变体；以及斯堪的纳维亚语言），并在现有评估数据集不足的地方贡献了新的评估数据集。我们发现集成方法提高了精度，但也显著降低了对低资源语言的覆盖。OpenLID-v3可在该https URL上获取。

英文摘要

Language identification (LID) is an essential step in building high-quality multilingual datasets from web data. Existing LID tools (such as OpenLID or GlotLID) often struggle to identify closely related languages and to distinguish valid natural language from noise, which contaminates language-specific subsets, especially for low-resource languages. In this work we extend the OpenLID classifier by adding more training data, merging problematic language variant clusters, and introducing a special label for marking noise. We call this extended system OpenLID-v3 and evaluate it against GlotLID on multiple benchmarks. During development, we focus on three groups of closely related languages (Bosnian, Croatian, and Serbian; Romance varieties of Northern Italy and Southern France; and Scandinavian languages) and contribute new evaluation datasets where existing ones are inadequate. We find that ensemble approaches improve precision but also substantially reduce coverage for low-resource languages. OpenLID-v3 is available on https://huggingface.co/HPLT/OpenLID-v3.

URL PDF HTML ☆

赞 0 踩 0

2605.26891 2026-06-19 cs.CL 版本更新

Telenor Nordics Customer Service self-help corpus

Telenor Nordics 客户服务自助语料库

Mike Riess

发表机构 * Research and Innovation, Telenor Group（Telenor集团研究与创新）

AI总结本文构建了一个包含芬兰语、丹麦语、挪威语和瑞典语的多语言客户服务自助语料库，共1122篇文档，用于支持北欧NLP和信息检索研究。

Comments 8 pages, 2 figures, 5 tables. Submitted to Nordic Machine Intelligence. Dataset: https://zenodo.org/records/19493152

详情

AI中文摘要

本文介绍了一个多语言客户服务自助语料库，包含1122篇经过人工验证的芬兰语、丹麦语、挪威语和瑞典语文档，总词数超过一百万。这些文档来自四家北欧电信运营商的公共自助页面，随后通过结合LLM和人工标注的流程过滤了个人身份信息和相关性。北欧语言的领域特定数据集仍然稀缺，尤其是在客户服务领域——这一领域对于检索增强生成、跨语言迁移学习和新兴的基于代理的服务架构日益重要。对语料库的分析显示，不同运营商的文档长度和结构存在显著差异，反映了不同的编辑策略，以及涵盖网络硬件、移动服务、电视和流媒体、计费和账户管理的广泛主题覆盖。该数据集在CC-BY-NC-SA-4.0许可下公开提供，网址为https://zenodo.org/records/19493152，旨在支持北欧NLP和信息检索的可重复研究。

英文摘要

This paper presents a multilingual customer service self-help corpus comprising 1,122 manually validated documents in Finnish, Danish, Norwegian, and Swedish, totaling 274,599 words and 1,884,833 characters. The documents have been sourced from the public self-help pages of four Nordic telecommunications operators and subsequently filtered for person-identifiable information and relevance through a combined LLM and human annotation pipeline. Domain-specific datasets for Nordic languages remain scarce, particularly in customer service: a domain of growing importance for retrieval-augmented generation, cross-lingual transfer learning, and emerging agent-based service architectures. An analysis of the corpus reveals substantial variation in document length and structure across operators, reflecting distinct editorial strategies, as well as broad topical coverage spanning network hardware, mobile services, TV and streaming, billing, and account management. The dataset is publicly available under a CC-BY-NC-SA-4.0 license at https://zenodo.org/records/20732652, intended to support reproducible research in Nordic NLP and information retrieval.

URL PDF HTML ☆

赞 0 踩 0

2606.01338 2026-06-19 cs.CL 版本更新

LLM招聘决策中的性别偏见：来自日本语境的证据及缓解策略评估

Serena A. Hoffstedde, Machiko Hirota, Akshara Nadayanur Sathis Kanna, Rihito Kotani, Ujwal Kumar, Gabriele Trovato, Phan Xuan Tan

发表机构 * Shibaura Institute of Technology, Tokyo, Japan（Shibaura技术学院，东京，日本）； Amsterdam University of Applied Sciences, Amsterdam, Netherlands（阿姆斯特丹应用科学大学，阿姆斯特丹，荷兰）； University of Pennsylvania, Philadelphia, USA（宾夕法尼亚大学，费城，美国）； Carnegie Mellon University, Pittsburgh, USA（卡内基梅隆大学，匹兹堡，美国）； Keio University, Tokyo, Japan（庆应大学，东京，日本）

AI总结本研究通过60份日本履历书格式的简历和5个先进LLM，发现所有模型均存在显著的亲女性偏见，且简单的提示指令无法缓解，而移除姓名几乎完全消除该偏见。

详情

AI中文摘要

大型语言模型（LLM）越来越多地被部署在招聘流程中，然而大多数关于LLM招聘决策中性别偏见的研究都集中在英语、西方格式的简历上。本研究考察了亲女性性别偏见是否扩展到日本企业语境，并评估了两种实用的缓解策略。使用反事实简历设计，包含60份日本履历书格式的简历、基于语言学性别信号标准选择的12个姓名对，以及五个最先进的LLM（Claude Sonnet 4.6、GPT-4o、DeepSeek-V3、Gemini 2.5 Flash、Llama 3.3 70B），我们在基线、提示指令和隐私过滤条件下进行了43,200次API调用。交叉随机效应线性混合模型确认了所有五个模型均存在显著的亲女性偏见，将西方研究结果复制到了非西方语境中。提示级别的性别中立指令并未显著减少偏见。姓名依赖分析正式将候选人姓名识别为主要性别渠道：从提示中移除姓名几乎完全消除了女性效应。隐私过滤器与GPT-4o内容安全过滤器之间的意外不兼容导致42%的拒绝率，突显了在LLM辅助招聘流程中姓名匿名化的实际部署挑战。

英文摘要

Large language models (LLMs) are increasingly deployed in hiring workflows, yet most research on gender bias in LLM hiring decisions has focused on English-language, Western-format resumes. This study examines whether pro-female gender bias extends to a Japanese corporate context and evaluates two practical mitigation strategies. Using a counterfactual resume design with 60 Japanese rirekisho-format resumes, 12 name pairs selected on linguistically grounded gender-signal criteria, and five state-of-the-art LLMs (Claude Sonnet 4.6, GPT-4o, DeepSeek-V3, Gemini 2.5 Flash, Llama 3.3 70B), we conducted 43,200 API calls across baseline, prompt instruction, and privacy filter conditions. A crossed random-effects linear mixed model confirms a significant pro-female bias across all five models, replicating Western findings in a non-Western context. A prompt-level gender-neutrality instruction produces no meaningful reduction in bias. A name-reliance analysis formally identifies the candidate name as the primary gender channel: removing the name from the prompt reduces the female effect by nearly its full magnitude. An unexpected incompatibility between the privacy filter and GPT-4o's content safety filter, resulting in a 42% refusal rate, highlights a practical deployment challenge for name anonymization in LLM-assisted recruitment pipelines.

URL PDF HTML ☆

赞 0 踩 0

2606.19404 2026-06-19 cs.LG cs.CL 交叉投稿

Thermodynamic Signatures of Reasoning: Free-Energy and Spectral-Form-Factor Diagnostics for Hallucination Detection in Large Language Models

推理的热力学特征：用于大型语言模型幻觉检测的自由能和谱形因子诊断

Salim Khazem

发表机构 * Talan Research & Innovation Center（Talan研究与创新中心）

AI总结提出自由能签名（Fes）作为谱描述符，将注意力拉普拉斯视为哈密顿量并提取热力学势和随机矩阵理论谱形因子，用于检测LLM幻觉，无需训练即可实现高AUROC。

详情

AI中文摘要

大型语言模型（LLM）中的幻觉检测对部署至关重要，近期研究表明注意力导出的图拉普拉斯谱携带关于推理质量的强信号。然而，先前的谱诊断仅通过少数特征值或手工选取的标量来总结拉普拉斯谱，忽略了其大部分结构。我们提出自由能签名（Fes），一种谱描述符，将每层的注意力拉普拉斯视为哈密顿量，并提取其热力学势（配分函数、自由能、谱熵、热容）以及随机矩阵理论（RMT）谱形因子。我们证明了三个结果：（i）Fes在注意力扰动下的Lipschitz稳定性；（ii）一个表达性结果，表明Fes丰富了有限谱摘要，并在明确的规则性和网格分辨率假设下逼近矩导出的谱泛函；（iii）基于Fes构建的无训练检测器AUROC的有限样本PAC界。实验上，在六个开源LLM和六个基准测试中，基于Fes描述符的轻量级探测在注意力谱基线中实现了最强的平均AUROC，相比LapEig平均提高+6.5 AUROC点，相比GoR-4平均提高+2.4点，且无需更新底层LLM。在完全无监督设置下，RMT偏差得分达到平均AUROC 0.71，提供了一个无标签但较弱的检测器。互补的RMT分析表明，正确生成表现出更接近Wigner-Dyson的谱统计，而幻觉表现出更接近Poisson的统计。匿名代码和配置在补充材料中提供。

英文摘要

Hallucination detection in large language models (LLMs) is deployment-critical, and recent work shows that the spectrum of attention-derived graph Laplacians carries strong signal about reasoning quality. Prior spectral diagnostics, however, summarize the Laplacian spectrum by a handful of eigenvalues or hand-picked scalars, leaving most of its structure unused. We propose Free-Energy Signatures (Fes), a spectral descriptor that treats each layer's attention Laplacian as a Hamiltonian and extracts its thermodynamic potentials partition function, free energy, spectral entropy, heat capacity together with the random-matrix-theory (RMT) spectral form factor. We prove three results: (i)~Lipschitz stability of Fes under attention perturbation; (ii)~an expressiveness result showing that Fes enriches finite spectral summaries and approximates moment-derived spectral functionals under explicit regularity and grid-resolution assumptions; and (iii)~a finite-sample PAC bound on the AUROC of a training-free detector built from Fes. Empirically, across six open-weight LLMs and six benchmarks, a lightweight probe on Fes descriptors achieves the strongest aggregate AUROC among attention-spectral baselines, improving over LapEig by $+6.5$ AUROC points and over GoR-4 by $+2.4$ points on average, while requiring no update to the underlying LLM. In the fully unsupervised setting, an RMT-deviation score achieves mean AUROC $0.71$, providing a label-free but weaker detector. A complementary RMT analysis shows that correct generations exhibit more Wigner-Dyson like spectral statistics, whereas hallucinations exhibit more Poisson-like statistics. The anonymized code and config are provided in the supplementary material.

URL PDF HTML ☆

赞 0 踩 0

2606.19660 2026-06-19 cs.CR cs.CL 交叉投稿

A Layered Security Framework Against Prompt Injection in RAG-Based Chatbots

基于RAG的聊天机器人中针对提示注入的分层安全框架

Gulshan Saleem, Nisar Ahmed, Muhammad Imran Zaman, Ali Hassan

AI总结提出三层防御框架，通过输入过滤、上下文指令层级和输出审计，将提示注入攻击成功率从71.4%降至11.3%，误报率4.8%，延迟开销61.2毫秒。

Comments Submitted in ICCK Transactions on Information Security and Cryptography

详情

AI中文摘要

提示注入被OWASP Top 10 for LLM Applications列为大语言模型（LLM）部署中最关键的漏洞，然而现有防御措施仅在孤立的流水线阶段运行且不完整。输入过滤器无法检查检索到的文档，而输出监控器无法阻止恶意载荷到达模型。因此，检索增强生成（RAG）聊天机器人仍然容易受到间接注入攻击，其中被污染的知识库文档会损害每个检索到它的用户。我们提出了一个三层框架，在推理流水线中拦截直接和间接的提示注入。第一层使用基于规则的模式库和微调后的语义异常分类器筛选用户输入。第二层在上下文组装期间强制执行基于来源的指令层级，防止检索到的内容覆盖操作员策略。第三层在交付前使用策略规则引擎和语义漂移检测器审计模型输出。一个持续审计循环聚合结构化日志，并支持重新训练以适应新兴攻击模式。该框架与模型无关，作为中间件部署，无需修改底层LLM。在GPT-4o、Llama 3和Mistral 7B上对5,080个样本的评估显示，该框架将攻击成功率（ASR）从71.4%降至11.3%，比最佳单层基线高出27.3个百分点，比已发布的护栏系统高出23.8个百分点，同时保持4.8%的误报率和61.2毫秒的中位延迟开销。消融研究证实，所有三层提供互补保护，且其组合效果超过单个贡献的总和。

英文摘要

Prompt injection is ranked as the most critical vulnerability in large language model (LLM) deployments by the OWASP Top 10 for LLM Applications, yet existing defenses operate at isolated pipeline stages and remain incomplete. Input filters cannot inspect retrieved documents, while output monitors cannot prevent malicious payloads from reaching the model. Consequently, retrieval-augmented generation (RAG) chatbots remain vulnerable to indirect injection, where a poisoned knowledge-base document compromises every user whose query retrieves it. We present a three-layer framework that intercepts both direct and indirect prompt injection throughout the inference pipeline. Layer 1 screens user input using a rule-based pattern library and a fine-tuned semantic anomaly classifier. Layer 2 enforces a provenance-based instruction hierarchy during context assembly, preventing retrieved content from overriding operator policy. Layer 3 audits model output using a policy rule engine and semantic drift detector before delivery. A continuous audit loop aggregates structured logs and supports retraining to adapt the classifier to emerging attack patterns. The framework is model-agnostic and deploys as middleware without modifying the underlying LLM. Evaluation on 5,080 samples across GPT-4o, Llama 3, and Mistral 7B shows that the framework reduces Attack Success Rate (ASR) from 71.4\% to 11.3\%, outperforming the best single-layer baseline by 27.3 percentage points and a published guardrail system by 23.8 percentage points, while maintaining a 4.8\% false positive rate and a median latency overhead of 61.2 ms. Ablation studies confirm that all three layers provide complementary protection and that their combined effect exceeds the sum of individual contributions.

URL PDF HTML ☆

赞 0 踩 0

2606.20023 2026-06-19 cs.SE cs.AI cs.CL 交叉投稿

When Lower Privileges Suffice: Investigating Over-Privileged Tool Selection in LLM Agents

当较低权限足够时：探究LLM代理中的过度权限工具选择

Kaiyue Yang, Yuyan Bu, Jingwei Yi, Yuchi Wang, Biyu Zhou, Juntao Dai, Songlin Hu, Yaodong Yang

AI总结针对LLM代理在工具选择中偏好高权限工具的安全问题，提出ToolPrivBench评估框架，发现主流代理普遍存在过度权限选择且被瞬态故障放大，并设计权限感知后训练防御方法有效减少不必要的高权限工具使用。

Comments code: https://github.com/AISafetyHub/agent-tool-selection-bias

详情

面向词嵌入迁移学习的组稀疏矩阵分解

Kan Xu, Xuanyi Zhao, Hamsa Bastani, Osbert Bastani

发表机构 * W. P. Carey School of Business, Arizona State University（亚利桑那州立大学韦伯商学院）； University of Pennsylvania（宾夕法尼亚大学）； Wharton School, University of Pennsylvania（宾夕法尼亚大学沃顿商学院）

AI总结提出一种基于组稀疏惩罚的两阶段估计器，通过结合大规模语料和少量领域数据高效迁移学习领域特定的词嵌入，并证明了其泛化误差界和非凸目标函数的局部最优与全局最优统计等价。

详情

AI中文摘要

非结构化文本为许多领域的决策者提供了丰富的数据源，从零售中的产品评论到医疗保健中的护理记录。为了利用这些信息，单词通常通过无监督学习算法（如矩阵分解）转化为词嵌入——编码单词之间语义关系的向量。然而，从训练数据有限的新领域学习词嵌入可能具有挑战性，因为在新领域中含义/用法可能不同，例如，单词“positive”通常具有积极情感，但在医疗记录中通常具有消极情感，因为它可能意味着患者检测出疾病阳性。在实践中，我们预计只有少数领域特定的单词可能具有新含义。我们提出了一种直观的两阶段估计器，通过组稀疏惩罚利用这种结构，通过结合大规模文本语料库（如维基百科）和有限的领域特定文本数据，高效地迁移学习领域特定的词嵌入。我们限定了迁移学习估计器的泛化误差，证明当只有少量嵌入在领域间改变时，它可以用显著更少的领域特定数据实现高精度。此外，我们证明了在标准正则化条件下，由非凸目标函数识别的所有局部最小值与全局最小值在统计上不可区分，这意味着我们的估计器可以高效计算。我们的结果首次给出了组稀疏矩阵分解的界限，这可能具有独立意义。我们通过与自然语言处理中最先进的微调启发式方法进行实证比较来评估我们的方法。

英文摘要

Unstructured text provides decision-makers with a rich data source in many domains, ranging from product reviews in retail to nursing notes in healthcare. To leverage this information, words are typically translated into word embeddings -- vectors that encode the semantic relationships between words -- through unsupervised learning algorithms such as matrix factorization. However, learning word embeddings from new domains with limited training data can be challenging, because the meaning/usage may be different in the new domain, e.g., the word ``positive'' typically has positive sentiment, but often has negative sentiment in medical notes since it may imply that a patient tested positive for a disease. In practice, we expect that only a small number of domain-specific words may have new meanings. We propose an intuitive two-stage estimator that exploits this structure via a group-sparse penalty to efficiently transfer learn domain-specific word embeddings by combining large-scale text corpora (such as Wikipedia) with limited domain-specific text data. We bound the generalization error of our transfer learning estimator, proving that it can achieve high accuracy with substantially less domain-specific data when only a small number of embeddings are altered between domains. Furthermore, we prove that all local minima identified by our nonconvex objective function are statistically indistinguishable from the global minimum under standard regularization conditions, implying that our estimator can be computed efficiently. Our results provide the first bounds on group-sparse matrix factorization, which may be of independent interest. We empirically evaluate our approach compared to state-of-the-art fine-tuning heuristics from natural language processing.

URL PDF HTML ☆

赞 0 踩 0

2606.19347 2026-06-19 cs.CL cs.AI cs.PL 新提交

How LLMs Fail and Generalize in RTL Coding for Hardware Design?

LLM在硬件设计的RTL编码中如何失败与泛化？

Guan-Ting Liu, Chao-Han Huck Yang, Chenhui Deng, Zhongzhi Yu, Brucek Khailany, Yu-Chiang Frank Wang

发表机构 * NVIDIA Research（英伟达研究院）

AI总结提出基于问题可解性的错误分类法，揭示LLM在RTL编码中受限于预训练知识，对齐技术仅教会编译，而推理能力才是关键瓶颈。

Comments Preview, under submission for EMNLP 2026

详情

AI中文摘要

将顺序编程先验转换为硬件设计的并行时序逻辑仍然是大型语言模型（LLM）的关键瓶颈。为了研究这一点，我们引入了一种新的错误分类法，该分类法基于问题可解性，受认知理论启发。我们的分类法将失败分为语法、语义、可解功能和不可解功能类型。评估揭示了VerilogEval基准上的严格经验上限，前沿模型初始通过率稳定在90.8%。这些平台期由不可解的功能错误定义，暴露出对测试时计算扩展免疫的持续知识差距。此外，我们揭示了一个显著的表面收敛差距：优化容易消除语法错误，但同时加剧了更深层次的功能失败。我们的发现表明，对齐技术仅仅教会模型编译。虽然重复采样策略可以修补可解错误，但寄存器传输级（RTL）编码能力仍然严格受限于预训练知识。解决当前基于LLM的硬件生成流水线中的挑战需要更多关于模型推理的研究，而不是对齐干预。

英文摘要

Translating sequential programming priors into the parallel temporal logic of hardware design remains a crucial bottleneck for large language models(LLM). To investigate this, we introduce a new error taxonomy grounded in problem solvability, inspired by cognitive theory. Our taxonomy categorizes failures into syntactic, semantic, solvable functional, and unsolvable functional types. Evaluations reveal a strict empirical ceiling on the VerilogEval benchmark, as frontier models plateau at a 90.8% initial pass rate. These plateaus are defined by unsolvable functional errors, exposing persistent knowledge gaps immune to test time compute scaling. Furthermore, we expose a striking surface convergence gap: optimization readily eliminates syntax errors but concurrently exacerbates deeper functional failures. Our findings demonstrate that alignment techniques merely teach models to compile. While repeated sampling strategies can patch solvable errors, register-transfer level(RTL) coding capacity remains strictly bounded by pretraining knowledge. Addressing challenges in the current LLM based hardware generation pipeline requires more studies in model reasoning rather than alignment interventions.

URL PDF HTML ☆

赞 0 踩 0

2606.19647 2026-06-19 cs.CL cs.CY cs.SI 新提交

From 50K to 8.2 Million in 24 Hours: Vozinha's Algorithmic Consecration and the Multilingual Making of World Cup Visibility

从5万到820万在24小时内：Vozinha的算法封圣与世界杯可见性的多语言构建

Vinicius Covas

发表机构 * Universidad Anáhuac México（墨西哥阿纳瓦克大学）

AI总结通过多语言语料库和九框架叙事分类法，分析2026年世界杯后Vozinha的算法封圣过程，揭示不同语言承载不同叙事框架，将平台粉丝数作为语言对象研究可见性构建。

Comments 11 pages, 4 figures, 3 tables; v0.1 pilot preprint. Dataset and evidence package available at https://doi.org/10.5281/zenodo.20722235

详情

AI中文摘要

我们提出了一项多语言计算话语分析，研究语言如何构建了Vozinha——这位40岁的佛得角门将在2026年世界杯西班牙0-0佛得角比赛后的算法封圣。该研究贡献了一个包含葡萄牙语、西班牙语、英语和法语的多语言语料库；一个基于线索的九框架叙事分类法；一个结合LLM辅助建议与人工验证的可复现标注流程；以及跨话语阶段的多语言叙事扩散分析。我们将平台粉丝数本身——被叙述为“从5万到800万”——视为一个语言对象：一种流通且可叙述的可见性证明，而非单纯的测量。粉丝增长时间线仅作为上下文元数据使用：我们重构了一个保守的阶段结构，而非连续的API原生序列，并对每个数据点按值类别、置信度和证据类型进行标注。唯一精确的主要爬取锚点是2026年6月16日15:47 UTC的8,235,652粉丝；所有其他数字均报告为估计范围或阈值，包括估计的赛前基线45k-56k。研究结果表明，不同语言承载了不同的框架：葡萄牙语的动员、西班牙语的危机、英语的民族构建，以及共享的平台指标奇观，通过这种奇观，边缘的体育表现变得全球可见。作为v0.1试点，本文发布了语料库模式、框架分类法、标注指南、哈希视觉证据日志和类型化时间线，同时将完整的双重标注和标注者间一致性标记为计划工作。

英文摘要

We present a multilingual computational discourse analysis of how language constructed the algorithmic consecration of Vozinha, the 40-year-old Cape Verde goalkeeper, after Spain 0-0 Cape Verde at the 2026 FIFA World Cup. The study contributes a multilingual corpus in Portuguese, Spanish, English, and French; a nine-frame narrative taxonomy with cue-based frame annotation; a reproducible annotation pipeline combining LLM-assisted suggestion with human validation; and an analysis of cross-lingual narrative diffusion across discourse phases. We treat the platform follower count itself, narrated as "50k to 8M", as a linguistic object: a circulating and narratable proof of visibility rather than a mere measurement. The follower-growth timeline is used only as contextual metadata: we reconstruct a conservative phase structure, not a continuous API-native series, and type every datapoint by value class, confidence, and evidence type. The only exact primary scraper anchor is 8,235,652 followers at 2026-06-16 15:47 UTC; all other figures are reported as estimated ranges or thresholds, including an estimated pre-match baseline of 45k-56k. Findings suggest that distinct languages carried distinct frames: Portuguese mobilization, Spanish crisis, English nation-making, and a shared platform-metric spectacle through which peripheral athletic performance became globally visible. As a v0.1 pilot, the paper releases the corpus schema, frame taxonomy, annotation guidelines, hashed visual-evidence log, and typed timeline, while flagging full double annotation and inter-annotator agreement as planned work.

URL PDF HTML ☆

赞 0 踩 0

2606.19864 2026-06-19 cs.CL 新提交

The Almost Intelligent Revolution: Options for Scaling Up Deliberation and Empowering People with AI

近乎智能的革命：扩大审议规模并利用AI赋能人类的选项

Serge Sharoff

AI总结探讨大型语言模型如何通过系统功能语言学视角扩大民主审议规模，增强包容性并赋权边缘群体，同时警惕过度承诺与低估风险。

Comments Published in /Handbook of Democracy in the Era of Artificial Intelligence/ edited by Evangelos Pournaras, Srijoni Majumdar, Carina Ines Hausladen, and Dirk Helbing. 2026

详情

AI中文摘要

大型语言模型在公共话语中的日益突出为民主审议带来了机遇和挑战。虽然红队策略有助于缓解特定风险，但关于语言限制、偏见和LLM的谄媚倾向等更广泛的担忧仍然存在。本章探讨如何利用LLM显著扩大和民主化审议，特别是在促进包容性和赋权传统边缘群体方面。借鉴系统功能语言学的概念，本章考察了语言使用者之间的差异（例如，关于社会人口群体）和语言使用中的差异（例如，关于交际功能）如何影响AI支持的审议参与。本章介绍了AI驱动的审议研究，并评估了它们在支撑论证、增强可及性以及减少嵌入在声望语域中的排斥性语言规范和偏见的影响方面的潜力。同时，本章警告不要过度承诺（导致不切实际的期望）和低估承诺（冒着错失AI辅助参与机会的风险）。最后，本章确定了未来的研究方向，以最大化AI辅助参与的民主潜力，同时嵌入伦理保障以抵消语言不平等的再生产。

英文摘要

The increasing prominence of Large Language Models (LLMs) in public discourse presents both opportunities and challenges for democratic deliberation. While red teaming strategies help mitigate specific risks, broader concerns persist regarding linguistic constraints, biases, and the sycophantic tendencies of LLMs. This chapter explores how LLMs can be used to significantly scale up and democratise deliberation, particularly in fostering inclusivity and empowering traditionally marginalised groups. Drawing on concepts from Systemic-Functional Linguistics, the chapter examines how variations across language users (for example, with respect to socio-demographic groups) and across language use (for example, with respect to communicative functions) shape participation in AI-supported deliberation. The chapter presents AI-driven deliberation studies and assesses their potential to scaffold argumentation, enhance access, and reduce the influence of exclusionary linguistic norms and biases which are embedded in prestigious registers. At the same time, the chapter cautions against both overclaiming, which leads to unrealistic expectations, and underclaiming, which risks missed opportunities for AI-assisted engagement. The chapter concludes by identifying future research directions to maximise the democratic potential of AI-assisted participation while embedding ethical safeguards to counteract the reproduction of linguistic inequalities.

URL PDF HTML ☆

赞 0 踩 0

2606.20198 2026-06-19 cs.CL 新提交

世界模型批判：一种用于世界建模的生成式潜在预测架构

Eric Xing, Mingkai Deng, Jinyu Hou

AI总结本文从心理学“假设性思维”出发，提出世界模型的核心目标是模拟真实世界的所有可行动可能性，并设计了一种基于状态化、分层、多级、混合连续/离散表示的生成式潜在预测（GLP）架构。

详情

AI中文摘要

世界模型，即生物智能体所经历并对其采取行动的真实世界环境的算法模拟器，近年来因开发具有人工（通用）智能的虚拟智能体的需求日益增长而成为一个新兴课题。关于世界模型究竟是什么、如何构建、如何使用以及如何评估，已有许多讨论。本文从著名科幻经典《沙丘》中的想象出发，并借鉴心理学文献中“假设性思维”的概念，论证世界模型的主要目标是模拟真实世界中所有可行动的可能性，以进行有目的的推理和行动。我们审视了世界建模的关键设计维度：数据、表示、架构、学习目标和使用，调查了现有方法并分析了它们的权衡。在此基础上，我们提出了一种新的通用世界模型生成式潜在预测（GLP）架构，基于有状态的、分层的、多层次的、混合连续/离散表示，以及生成式和自监督学习框架，并展望了由这种模型支持的物理、智能体和嵌套（PAN）AGI系统。

英文摘要

World Model, the algorithmic simulator of the real-world environment which biological agents experience and act upon, has been an emerging topic in recent years due to the rising need to develop virtual agents with artificial (general) intelligence. There has been much discussion on what a world model really is, how to build it, how to use it, and how to evaluate it. In this essay, starting from the imagination in the famed Sci-Fi classic Dune, and drawing inspiration from the concept of ``hypothetical thinking'' in psychology literature, we argue the primary goal of a world model to be {\it simulating all actionable possibilities of the real world for purposeful reasoning and acting}. We examine the key design dimensions of world modeling: data, representation, architecture, learning objective, and usage, surveying existing approaches and analyzing their tradeoffs. Building on this examination, we propose a new Generative Latent Prediction (GLP) architecture for a general-purpose world model, based on stateful, hierarchical, multi-level, and mixed continuous/discrete representations, and a generative and self-supervised learning framework, with an outlook of a Physical, Agentic, and Nested (PAN) AGI system enabled by such a model.

URL PDF HTML ☆

赞 0 踩 0

2606.18941 2026-06-19 cs.PL cs.CL 版本更新

ESBMC-GraphPLC: Formal Verification of Graphical PLCopen XML Ladder Diagram Programs Using SMT-Based Model Checking

Graph-ESBMC-PLC：使用基于SMT的模型检查对图形化PLCopen XML梯形图程序进行形式验证

Pierre Dantas, Lucas Cordeiro, Waldir Junior

发表机构 * Computer Science, The University of Manchester（计算机科学，曼彻斯特大学）； Electrical Engineering, Federal University of Amazonas (UFAM)（电气工程，亚马逊联邦大学（UFAM））

AI总结针对ESBMC-PLC无法处理图形化PLCopen XML梯形图的问题，提出基于DFS的图形LD解析器，将连接图转换为布尔触点合取，并采用三级I/O推断方案，成功实现完整GOTO IR转换，验证了3个图形LD程序。

Comments 18 pages

详情

AI中文摘要

PLCopen XML为IEC 61131-3梯形图程序定义了两种编码格式：一种使用<rung>元素的文本编码，另一种将梯形逻辑表示为localId/refLocalId连接的有向图的图形编码。ESBMC-PLC支持文本格式，但将来自CONTROLLINO、Beremiz和OpenPLC Editor的图形导出解析为空GOTO中间表示，导致空洞的验证成功。本文提出Graph-ESBMC-PLC，通过基于DFS的图形LD解析器填补了这一空白。该解析器从leftPowerRail遍历连接图到每个线圈，将梯形路径提取为布尔触点合取，并应用三级I/O推断方案。按rightPowerRail的connectionPointIn序列对线圈排序，确保SET线圈在RESET线圈之前处理，匹配IEC扫描周期语义。图形到IR的转换无需改动ESBMC后端。在来自CONTROLLINO/OpenPLC Editor的3个图形LD程序上的验证表明，所有程序都生成了包含非确定性输入和梯形逻辑的完整GOTO IR，而之前生成的是空IR。所有3个程序在k=2时在70ms内验证为SAFE。11个文本LD基准测试完全保留，无回归。两个不含LD内容或不支持定时器语义的Beremiz示例被报告为发现的局限性。工件位于Zenodo（DantasCordeiro2026graphical，doi: https://doi.org/10.5281/zenodo.20699856）。

英文摘要

PLCopen XML defines two encoding formats for IEC 61131-3 Ladder Diagram programs: a textual encoding using <rung> elements, and a graphical encoding that represents rung logic as a directed graph of localId/refLocalId connections. ESBMC-PLC supported the textual format but parsed graphical exports from CONTROLLINO, Beremiz, and OpenPLC Editor into an empty GOTO intermediate representation, causing vacuous verification success. This paper presents ESBMC-GraphPLC, which closes this gap with a DFS-based graphical LD resolver. The resolver traverses the connection graph from leftPowerRail to each coil, extracts rung paths as Boolean contact conjunctions, and applies a three-tier I/O inference scheme. Ordering coils by rightPowerRail connectionPointIn sequence ensures SET coils process before RESET coils, matching IEC scan-cycle semantics. The graphical-to-IR conversion leaves the ESBMC backend unchanged. Validation on 3 graphical LD programs from CONTROLLINO/OpenPLC Editor shows all produce full GOTO IR with nondeterministic inputs and rung logic, versus the empty IR previously. All 3 verify SAFE at k=2 under 70ms. The 11 textual LD benchmarks are fully preserved, with no regression. Two Beremiz examples with no LD content or unsupported timer semantics are reported as discovered limitations. Artifact at Zenodo (DantasCordeiro2026graphical, doi:10.5281/zenodo.20699856).

URL PDF HTML ☆

赞 0 踩 0

2511.23071 2026-06-19 cs.CV cs.AI cs.CL 版本更新

Bharat Scene Text: A Novel Comprehensive Dataset and Benchmark for Indian Language Scene Text Understanding

Anik De, Abhirama Subramanyam Penamakuri, Rajeev Yadav, Aditya Rathore, Harshiv Shah, Devesh Sharma, Sagar Agarwal, Pravin Kumar, Anand Mishra

发表机构 * Indian Institute of Technology Jodhpur（印度理工学院朱道尔）

Comments Accepted in International Journal on Document Analysis and Recognition (IJDAR)

Journal ref International Journal on Document Analysis and Recognition (IJDAR), 2026

2406.15465 2026-06-19 cs.CL cs.AI 版本更新

RadEx: A Framework for Structured Information Extraction from Radiology Reports based on Large Language Models

Daniel Reichenpfader, Jonas Knupp, André Sander, Kerstin Denecke

发表机构 * Institute for Patient-centered Digital Health, Bern University of Applied Sciences, Biel, Switzerland（以患者为中心的数字健康研究所，伯恩应用科学大学，比尔，瑞士）； ID Suisse AG, St. Gallen, Switzerland（ID瑞士股份有限公司，圣加尔，瑞士）

2306.12679 2026-06-19 cs.CL 版本更新

Constructing Colloquial Dataset for Persian Sentiment Analysis of Social Microblogs

Mojtaba Mazoochi, Leila Rabiei, Farzaneh Rahmani, Zeinab Rajabi

发表机构 * Faculty member in ICT Research Institute（ICT研究所教员）； Iran Telecommunication Research Center (ITRC)（伊朗电信研究中心）； Faculty member in Computer Department（计算机系教员）； Mehralborz University（梅赫拉布尔兹大学）； Hazrat-e Masoumeh University（玛苏姆大学）

Journal ref Multimedia Tools and Applications, 2025

1. 大语言模型与基础模型 27 篇

DeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence

Where to Place the Query? Unveiling and Mitigating Positional Bias in In-Context Learning for Diffusion LLMs via Decoding Dynamics

Pruning via Causal Attribution Preserves Reasoning Performance in Large Language Models

Quantifying Aleatoric Uncertainty of In-Context Learning for Robust Measure of LLM Prediction Confidence

Granularity-Regulated Adaptive Computational Efficiency for Optimal Verification in Test-Time Scaling

Where Does Social Reasoning Come From? Capability Provenance in Language Models

Code-Switching Reveals Language Anchoring in Multilingual LLMs

Beyond Uniform Forgetting: A Study of Sequential Direct Preference Optimization Across Preference Settings

Clusters are All You Need: Pre-Training the Tsetlin Machine with Semantic Clusters from Language Models for Interpretability

Leverage Is Not Reach: A Control-Window Law for Single-Neuron Steering in Language Models

Large Language Models Do Not Always Need Readable Language

GEMS: Geometric Constraints Enable Multi-Semantic Superposition in LLMs

IHUBERT: Vector-Based Semantic Deduplication and Domain-Balanced Pretraining for Persian Resources

HydraHead: From Head-Level Functional Heterogeneity to Specialized Attention Hybridization

Your Mouse and Eyes Secretly Leak Your Preference: LLM Alignment using Implicit Feedback from Users

How Linear Is a Transformer Feed-Forward Block? Per-Block Linear Recoverability Is Learned, Not Architectural

Diffusion Language Models: An Experimental Analysis

Efficiently Representing Algorithms With Chain-of-Thought Transformers

Manifold Bandits: Bayesian Curriculum Learning over the Latent Geometry of Large Language Models

Think Again or Think Longer? Selective Verification for Budget-Aware Reasoning

What Makes Effective Supervision in Latent Chain-of-Thought: An Information-Theoretic Analysis

Token-Operations-Oriented Inference Optimization Techniques for Large Models

MENTOR: Reinforcement Learning via Flexible Teacher-Optimized Rewards for Tool-Use Distillation

S2D2: Fast Decoding for Diffusion LLMs via Training-Free Self-Speculation

MixSD: Mixed Contextual Self-Distillation for Knowledge Injection

SIGMA: Search-Augmented On-Demand Knowledge Integration for Agentic Mathematical Reasoning

A Survey of On-Policy Distillation for Large Language Models

2. 机器翻译与跨语言处理 4 篇

Disentangling Linguistic Relatedness from Task Alignment in Cross-Lingual Transfer

Creating Multilingual Mental Health Dialogue Datasets: Limits of Persona-Based Localization via Nationality and Language

TransLaw: A Large-Scale Dataset and Multi-Agent Benchmark Simulating Professional Translation of Hong Kong Case Law

Target-Side Paraphrase Augmentation for Sign Language Translation with Large Language Models

3. 信息抽取、检索与问答 9 篇

Ensembles of Large Language Models for Identifying EQ-5D Studies in PubMed Based on Their Abstracts

Detecting Hallucinations for Large Language Model-based Knowledge Graph Reasoning

CacheWeaver: Cache-Aware Evidence Ordering for Efficient Grounded RAG Inference

TerraMARS: A Domain-Adapted Small-Language-Model Pipeline for Mars Terraforming Literature

FineREX: Fine-Tuned NER-RE for Human Smuggling Knowledge Graphs

Prompt, Plan, Extract: Zero-Shot Agentic LLMs Workflows for Lung Pathology Extraction from Clinical Narratives

Source-Grounded Data Generation for Text-to-JSON Learning

When Does Streaming Tool Use Help? Characterizing Tool-Intent Stabilization in Streaming Retrieval-Augmented Generation

AgentFinVQA: A Deployable Multi-Agent Pipeline for Auditable Financial Chart QA

4. 对话系统与智能体 15 篇

Trustworthy Multi-Agent Systems: Mitigating Semantic Drift with the Argent Signaling Protocol

SAGE-OPD: Selective Agent-Guided Intervention for Multi-Turn On-Policy Distillation

AtomMem: Building Simple and Effective Memory System for LLM Agents via Atomic Facts

Beyond Global Replanning: Hierarchical Recovery for Cross-Device Agent Systems

Beyond the GUI Paradigm: Do Mobile Agents Need the Phone Screen?

DeXposure-Claw: An Agentic System for DeFi Risk Supervision

Uncertainty Decomposition for Clarification Seeking in LLM Agents

Multi-Agent Transactive Memory

Connect the Dots: Training LLMs for Long-Lifecycle Agents with Cross-Domain Generalization Via Reinforcement Learning

Learning to Prompt: Improving Student Engagement with Adaptive LLM-based High-School Tutoring

LedgerAgent: Structured State for Policy-Adherent Tool-Calling Agents

Quality Over Clicks: Iterative Reinforcement Learning for Early-Stage E-Commerce Query Suggestion

TSAssistant: A Human-in-the-Loop Agentic Framework for Automated Target Safety Assessment

Proactive Conversational Assistant for a Procedural Manual Task based on Audio and IMU

CogniFold: Always-On Proactive Memory via Cognitive Folding

5. 文本生成、摘要与编辑 3 篇

A BART-based approach with hierarchical strategy for Vietnamese abstractive multi-document summarization

PsyScore: A Psychometrically-Aware Framework for Trait-Adaptive Essay Scoring and ZPD-Scaffolded Feedback

Med-R2: Perception and Reflection-driven Complex Reasoning for Medical Report Generation

6. 语义、语法与语言学分析 2 篇

MiqraBERT: Regression-Based Sentence-BERT Finetuning for Biblical Hebrew Parallel Detection

Toten: Knowledge-Based Ontological Tokenization Of Physical Quantities And Technical Notation In Brazilian Portuguese

7. 多模态语言处理 8 篇

LaViSA: A Language and Vision Structural Ambiguity Benchmark

MedRLM: Recursive Multimodal Health Intelligence for Long-Context Clinical Reasoning, Sensor-Guided Screening, Evidence-Grounded Decision Support, and Community-to-Tertiary Referral Optimization

PerceptionDLM: Parallel Region Perception with Multimodal Diffusion Language Models

Scalable Training of Spatially Grounded 2D Vision-Language Models for Radiology

Omnilingual SONAR: Cross-Lingual and Cross-Modal Sentence Embeddings Bridging Massively Multilingual Text and Speech

IdealGPT: Iteratively Decomposing Vision and Language Reasoning via Large Language Models

EndoCoT: Scaling Endogenous Chain-of-Thought Reasoning in Diffusion Models

Vero: An Open RL Recipe for General Visual Reasoning

8. 语音语言联合与音频文本 8 篇

Light-weight Pronunciation Assessment via Discrete Speech Token Surprisal

ReNikud: Audio-Supervised Hebrew Grapheme-to-Phoneme Conversion

Investigating Human-Model Discrepancies in Speech Quality Assessment via Acoustic and Prosodic Perturbations

Segment-Level Mandarin Chinese Speech-Based Cognitive Impairment Detection via an Autoencoder with Contrastive Learning