arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2606.19379 2026-06-19 cs.LG cs.AI cs.CL 交叉投稿

How Linear Is a Transformer Feed-Forward Block? Per-Block Linear Recoverability Is Learned, Not Architectural

Transformer 前馈块有多线性？逐块线性可恢复性是学习得到的，而非架构决定的

Stuart Whipp

发表机构 * Independent Research（独立研究）

AI总结通过精确最小二乘线性近似，测量训练后 Transformer 各前馈块的线性可恢复性，发现其高度异质且非单调，是学习得到的属性而非架构决定，并可用于压缩和诊断。

Comments 14 pages, 5 figures

详情

AI中文摘要

Transformer 前馈网络（FFN）通常被视为非线性的计算存储单元，但训练后的 FFN 块实际非线性程度很少被测量。我们将每个 FFN 视为位置级的输入-输出映射，并将其分解为精确的最小二乘线性近似加上残差。闭式线性映射解释的留出方差定义了一个块的线性可恢复性（R^2_lin），这是一种无需优化器的线性度量。在 GPT-2、Pythia-160m 和 llama-160m 的所有十二个块中，R^2_lin 高度异质且随深度非单调变化，相邻块之间范围从近线性（>0.99）到强非线性（<0.3），且并非由激活函数决定：相同宽度的 GELU 模型 GPT-2 和 Pythia-160m 具有截然不同的轮廓，因此可恢复性是单个训练块的学习属性，而非架构属性。残差的低秩双线性探针仅恢复少量 R^2 点，且增益与残差非线性不相关：未恢复的计算不是单个位置级乘积，而是高阶或分布式结构。该测量还作为有针对性的压缩信号：可恢复块允许大的单层替换（GPT-2 的早期 FFN 参数减少 8 倍，困惑度增加 +0.77），而低可恢复性块标记了这不安全的情况。它还暴露了一个方法论陷阱：训练后的线性基线可能在病态条件的 Transformer 激活上严重欠收敛，因此我们报告了整个过程中精确的闭式最小二乘上限。

英文摘要

Transformer feed-forward networks (FFNs) are often treated as nonlinear stores of computation, yet how nonlinear a trained FFN block actually is has rarely been measured. We treat each FFN as a position-wise input-to-output map and split it into the exact least-squares linear approximation plus a residual. The held-out variance the closed-form linear map explains defines a block's linear recoverability (R^2_lin), an optimiser-free measure of its linearity. Across all twelve blocks of GPT-2, Pythia-160m, and llama-160m, R^2_lin is highly heterogeneous and non-monotone with depth, ranging from near-linear (>0.99) to strongly nonlinear (<0.3) between adjacent blocks, and is not set by the activation function: same-width GELU models GPT-2 and Pythia-160m have sharply different profiles, so recoverability is a learned property of individual trained blocks, not an architectural one. A low-rank bilinear probe of the residual recovers only a few points of R^2, with gain uncorrelated with residual nonlinearity: the unrecovered computation is not a single position-wise product but higher-order or distributed structure. The measurement also serves as a targeted compression signal: recoverable blocks admit large single-layer replacements (GPT-2's early FFN at 8x fewer parameters for +0.77 perplexity), while low-recoverability blocks flag where this is unsafe. It further exposes a methodological pitfall: trained linear baselines can badly under-converge on ill-conditioned transformer activations, so we report the exact closed-form least-squares ceiling throughout.

URL PDF HTML ☆

赞 0 踩 0

2606.19475 2026-06-19 cs.AI cs.CL 交叉投稿

Diffusion Language Models: An Experimental Analysis

扩散语言模型：一项实验分析

Thomas Bertolani, Davide Bucciarelli, Leonardo Zini, Marcella Cornia, Lorenzo Baraldi

发表机构 * University of Modena and Reggio Emilia（摩德纳和雷焦艾米利亚大学）； University of Pisa（比萨大学）

AI总结本文系统比较了八种扩散语言模型在推理、编码、翻译等任务上的表现，分析了去噪步数、上下文长度等推理因素对性能与效率的影响，揭示了扩散语言模型在不同任务和预算下的权衡。

详情

AI中文摘要

大型语言模型（LLMs）通过自回归生成彻底改变了语言建模，使其在广泛的任务中表现出色。最近，扩散语言模型（DLMs）作为一种替代范式出现，它通过迭代去噪而非下一个词预测来生成文本，从而允许对整个序列进行并行精炼。尽管已经提出了许多基于扩散的架构，但评估协议、数据集、推理预算和生成超参数的差异使得比较它们的能力和理解它们提供的权衡变得困难。在这项工作中，我们对现代DLMs进行了系统的实验分析。具体来说，我们评估了八种最先进的DLMs在八个基准上的表现，这些基准涵盖推理、编码、翻译、知识和结构化问题解决，同时明确考虑了生成质量和计算效率。除了下游评估，我们还分析了关键推理时间因素的影响，包括去噪步数、上下文长度、块大小和并行解掩策略，并通过在相同条件下训练的较小模型的受控比较来补充大规模实验。我们的分析突出了基于扩散的语言建模在不同任务、架构和推理预算下的优势和局限性。我们表明，DLMs的行为受到生成时间设计选择的强烈影响，导致性能和计算效率之间的不同权衡。总体而言，我们的研究为当代DLMs的能力和部署特性提供了实用见解。

英文摘要

Large Language Models (LLMs) have revolutionized language modeling through autoregressive generation, enabling strong performance across a wide range of tasks. Recently, Diffusion Language Models (DLMs) have emerged as an alternative paradigm that generates text through iterative denoising rather than next-token prediction, allowing parallel refinement of entire sequences. While numerous diffusion-based architectures have been proposed, differences in evaluation protocols, datasets, inference budgets, and generation hyperparameters make it difficult to compare their capabilities and understand the trade-offs they offer. In this work, we present a systematic experimental analysis of modern DLMs. Specifically, we evaluate eight state-of-the-art DLMs across eight benchmarks spanning reasoning, coding, translation, knowledge, and structured problem solving, while explicitly considering both generation quality and computational efficiency. Beyond downstream evaluation, we analyze the impact of key inference-time factors, including denoising steps, context length, block size, and parallel unmasking strategies, and complement large-scale experiments with controlled comparisons of smaller models trained under identical conditions. Our analysis highlights the strengths and limitations of diffusion-based language modeling across different tasks, architectures, and inference budgets. We show that the behavior of DLMs is strongly influenced by generation-time design choices, leading to distinct trade-offs between performance and computational efficiency. Overall, our study provides practical insights into the capabilities and deployment characteristics of contemporary DLMs.

URL PDF HTML ☆

赞 0 踩 0

2606.19697 2026-06-19 cs.LG cs.AI cs.CL 交叉投稿

Efficiently Representing Algorithms With Chain-of-Thought Transformers

高效表示链式思维Transformer中的算法

Yanhong Li, Anej Svete, Ashish Sabharwal, William Merrill

发表机构 * Allen Institute for AI（艾伦人工智能研究所）； ETH Zürich（苏黎世联邦理工学院）

AI总结本文证明链式思维Transformer能以多对数开销高效模拟Word RAM算法，包括排序和Dijkstra算法，优于模拟图灵机的二次开销。

详情

AI中文摘要

推理模型（即在产生答案前输出一系列推理或思维token的语言模型）日益流行，部分原因在于理论结果表明链式思维（CoT）Transformer可以模拟图灵机，从而执行任意计算。然而，图灵机虽然适用于复杂性理论分析，但在讨论算法时并不方便、直观或高效。算法通常在更高的抽象层次上设计和分析，即具有随机访问存储器和单位成本操作（对$\bigO(\log n)$位字）的Word RAM模型。因此，Word RAM算法可能比其图灵机对应物更高效，这引出了一个问题：CoT Transformer能否高效模拟Word RAM算法？例如，它们能否在$\bigO(n \log n)$步内对n个元素排序，或在$\bigO(E + V \log V)$步内运行Dijkstra算法？我们给出肯定回答，开销不超过多对数。我们首先为具有多对数宽度和最右唯一硬注意力的有限精度Transformer建立这一结果，然后将结果推广到两个更实际的设置：有限宽度和对数精度：连续CoT（其中推理采用向量而非token形式）和混合架构（其中Transformer层位于循环（线性RNN）层之上）。在所有三种情况下，我们发现CoT可以高效模拟任何Word RAM算法，仅需在n上多对数开销。当Word RAM具有“平坦”指令集时，此开销降至对数平方，而对于无乘法平坦指令仅需对数开销——这与已知的CoT模拟图灵机（需要二次开销）形成鲜明对比。

英文摘要

The increasing popularity of \emph{reasoning} models -- language models that output a series of reasoning or thought tokens before producing an answer -- is justified, in part, by theoretical results showing that chain-of-thought (CoT) transformers can simulate Turing machines, and thus perform arbitrary computation. However, the Turing machine, while suitable for complexity-theoretic analysis, is not convenient, intuitive, or efficient for discussing algorithms. Algorithms are typically designed and analyzed at a higher level of abstraction, captured by the \emph{Word RAM} model with random-access memory and unit-cost operations on $\bigO(\log n)$-bit words. As a result, Word RAM algorithms can be substantially more efficient than their Turing machine counterparts, raising the question: \emph{Can CoT transformers efficiently simulate Word RAM algorithms?} For instance, can they sort $n$ items in $\bigO(n \log n)$ steps or run Dijkstra's algorithm in $\bigO(E + V \log V)$ steps? We answer affirmatively, up to poly-logarithmic overhead. We first establish this for finite-precision transformers with poly-logarithmic width and rightmost unique hard attention, then strengthen the result to two more practical settings with finite width and log-precision: \emph{continuous} CoT, where reasoning takes the form of vectors rather than tokens, and a \emph{hybrid} architecture in which transformer layers sit atop a recurrent (linear RNN) layer. In all three cases, we find that CoT \emph{can} efficiently simulate any Word RAM algorithm with only a poly-logarithmic overhead in $n$. This overhead reduces to log-square when the Word RAM has a ``flat'' instruction set, and only logarithmic for multiplication-free flat instructions -- in stark contrast to known CoT simulations of Turing machines, which require quadratic overhead over Word RAM.

URL PDF HTML ☆

赞 0 踩 0

2606.19750 2026-06-19 cs.LG cs.AI cs.CL 交叉投稿

Manifold Bandits: Bayesian Curriculum Learning over the Latent Geometry of Large Language Models

流形赌博机：大语言模型潜在几何上的贝叶斯课程学习

Darrien McKenzie, Nicklas Hansen, Xiaolong Wang

发表机构 * University of California, San Diego（加州大学圣迭戈分校）

AI总结提出贝叶斯流形课程（BMC）框架，将问题采样建模为流形结构赌博机问题，通过层次任务树和贝叶斯学习引导采样，平衡学习信号、多样性和实用性。

Comments Webpage: https://darrienmckenzie.com/manifold-bandits/

详情

AI中文摘要

强化学习（RL）是提高大语言模型（LLMs）推理能力的关键方法，其中训练效率关键取决于优化过程中问题的采样方式。现有的自适应课程学习方法通常优先考虑中等难度的提示，将问题选择视为具有独立臂的标准赌博机问题，忽略了任务空间的结构化和异质性。在这项工作中，我们将问题采样框架化为具有内生非平稳性的流形结构赌博机问题：问题通过模型的潜在表示空间相关联，采样决策可以影响学习信号在该空间中的演变方式。为了实现这一视角，我们引入了贝叶斯流形课程（BMC），这是一个结构感知框架，将问题组织成层次任务树，并应用贝叶斯学习来指导采样。实验发现，不同的采样策略在生产性（学习信号）、多样性（任务流形覆盖）和实用性（评估相关性）之间引入了非平凡的权衡。这些结果表明，仅优先考虑难度不足以获得强大的下游性能，突出了将结构和类型感知纳入问题采样中的重要性。

英文摘要

Reinforcement learning (RL) is a central approach for improving reasoning capabilities in large language models (LLMs), where training efficiency depends critically on how problems are sampled during optimization. Existing adaptive curriculum learning methods typically prioritize prompts of intermediate difficulty, treating problem selection as a standard bandit problem with independent arms and overlooking the structured, heterogeneous nature of the task space. In this work, we frame problem sampling as a manifold-structured bandit problem with endogenous non-stationarity: problems are related through the model's latent representation space, and sampling decisions can steer how learning signals evolve across that space. To operationalize this perspective, we introduce Bayesian Manifold Curriculum (BMC), a structure-aware framework that organizes problems into a hierarchical task tree and applies Bayesian learning to guide sampling. Empirically, we find that different sampling strategies induce non-trivial tradeoffs between productivity (learning signal), diversity (coverage of the task manifold), and utility (evaluation relevance). These results show that prioritizing difficulty alone is insufficient for strong downstream performance, highlighting the importance of incorporating structure and type-awareness into problem sampling.

URL PDF HTML ☆

赞 0 踩 0

2606.19808 2026-06-19 cs.AI cs.CL 交叉投稿

Think Again or Think Longer? Selective Verification for Budget-Aware Reasoning

再思考还是更长时间思考？面向预算感知推理的选择性验证

Sajib Acharjee Dip, Dawei Zhou, Liqing Zhang

发表机构 * Department of Computer Science, Virginia Tech（弗吉尼亚理工大学计算机科学系）； Fralin Biomedical Research Institute, Virginia Tech（弗吉尼亚理工大学弗拉林生物医学研究所）； FBRI Cancer Research Center（FBRI癌症研究中心）

AI总结提出选择性验证框架SEVRA，通过服务层控制器决定是否对冻结求解器的初始答案进行验证，在Math500上以更少token达到更高准确率，并减少有害翻转。

详情

AI中文摘要

测试时推理越来越多地被用作服务时的控制旋钮，但额外的推理并非均匀有价值：它可以修复失败的尝试，在已经正确的答案上浪费计算，或引入有害的答案更改。我们将其视为一个部署分配问题，而非新验证器问题。我们引入SEVRA，即面向推理分配的选择性验证，这是一个服务层控制器，决定是保留冻结求解器的初始答案还是调用主动验证。使用冻结的Qwen3-4B求解器，我们记录干预结果并从服务可见的尝试状态训练可恢复性感知的门控。在Math500上，选择性验证达到76.3%的准确率，而始终验证为75.5%，同时将生成后token减少26.8%，有害翻转从2.2%降至1.0%。然而，8,192 token的初始求解达到76.0%的准确率，总模型token减少28%，表明选择性恢复有用但并非测试的最佳成本前沿。在冻结迁移到GSM时，选择性策略仅验证3.0%的样本，准确率从93.4%提升至94.5%，验证token相对于始终验证减少91.2%；同样，更长的初始求解以更少的实际token达到相同准确率。在CommonsenseQA上，始终开启的验证有害，而Self-Consistency@5以约五倍的实际token成本提升准确率。由此得出的部署规则是：首先调整初始预算，然后在需要显式检查、有限重试、可审计性或回归风险控制时使用选择性恢复。

英文摘要

Test-time reasoning is increasingly used as a serving-time control knob, but extra reasoning is not uniformly valuable: it can repair failed attempts, waste compute on already-correct answers, or introduce harmful answer changes. We study this as a deployment allocation problem rather than a new-verifier problem. We introduce \sevra, Selective Verification for Reasoning Allocation, a serving-layer controller that decides whether to preserve a frozen solver's initial answer or invoke active verification. Using a frozen Qwen3-4B solver, we log intervention outcomes and train recoverability-aware gates from serving-visible attempt state. On \mathfive, selective verification reaches 76.3\% accuracy, compared with 75.5\% for always verifying, while reducing post-generation tokens by 26.8\% and harmful flips from 2.2\% to 1.0\%. However, an 8,192-token initial solve reaches 76.0\% accuracy with 28\% fewer total model tokens, showing that selective recovery is useful but not the best tested cost frontier. In frozen transfer to \gsm, the selective policy verifies only 3.0\% of examples, improves accuracy from 93.4\% to 94.5\%, and reduces verification tokens by 91.2\% relative to always verifying; again, a longer initial solve matches its accuracy with fewer realized tokens. On CommonsenseQA, always-on verification hurts, while Self-Consistency@5 improves accuracy at about five times the realized token cost. The resulting deployment rule is: tune the initial budget first, then use selective recovery when explicit checks, bounded retries, auditability, or regression-risk control matter.

URL PDF HTML ☆

赞 0 踩 0

2606.20075 2026-06-19 cs.LG cs.CL 交叉投稿

What Makes Effective Supervision in Latent Chain-of-Thought: An Information-Theoretic Analysis

什么使得潜在思维链中的监督有效：一种信息论分析

Xinghao Chen, Chak Tou Leong, Wenjin Guo, Jian Wang, Wenjie Li, Xiaoyu Shen

发表机构 * Ningbo Institute of Digital Twin, Eastern Institute of Technology（宁波数字孪生研究院，东方理工大学）； Department of Computing, The Hong Kong Polytechnic University（香港理工大学计算学系）

AI总结本文从信息论角度分析潜在思维链中的监督失效问题，提出轨迹监督和空间监督两个维度，并引入统一潜在探针（ULP）量化信息保真度，揭示了信息-性能绑定关系。

详情

AI中文摘要

潜在思维链（Latent Chain-of-Thought, CoT）将推理内化到连续隐藏状态中，为冗长的离散推理轨迹提供了一种有前景的替代方案。然而，鲁棒的潜在推理仍然困难，因为结果监督提供的学习信号较弱，且容易导致潜在轨迹发生语义漂移。在这项工作中，我们从信息论角度分析潜在CoT，并将这种失效识别为双重崩溃：优化路径上的梯度衰减和潜在空间中的表征漂移。我们进一步将过程监督分解为两个互补维度：轨迹监督（注入密集的逐步推理信号）和空间监督（保持潜在流形的语义结构）。我们的分析表明，刚性几何压缩可能坍缩推理空间，而生成式重建提供了更灵活的语义锚点，更好地保留了信息容量。为了衡量这些效应，我们引入了统一潜在探针（Unified Latent Probe, ULP），用于量化潜在轨迹与显式推理步骤之间的互信息。实验揭示了清晰的信息-性能绑定关系：推理准确性取决于潜在链中保留的信息保真度。这些发现为潜在推理监督提供了一个原则性框架，并建议从几何模仿转向互信息最大化。我们的代码可在\href{this https URL}{此仓库}获取。

英文摘要

Latent Chain-of-Thought (CoT) internalizes reasoning within continuous hidden states, offering a promising alternative to verbose discrete reasoning traces. However, robust latent reasoning remains difficult because outcome supervision provides weak learning signals and leaves latent trajectories prone to semantic drift. In this work, we analyze Latent CoT from an information-theoretic perspective and identify this failure as a dual collapse: gradient attenuation along the optimization path and representational drift in the latent space. We further decompose process supervision into two complementary dimensions: Trajectory Supervision, which injects dense stepwise reasoning signals, and Space Supervision, which preserves the semantic structure of the latent manifold. Our analysis shows that rigid geometric compression can collapse the reasoning space, whereas generative reconstruction provides a more flexible semantic anchor that better preserves information capacity. To measure these effects, we introduce the Unified Latent Probe (ULP), which quantifies the mutual information between latent trajectories and explicit reasoning steps. Experiments reveal a clear Information-Performance Binding: reasoning accuracy depends on the information fidelity preserved in the latent chain. These findings provide a principled framework for latent reasoning supervision and suggest shifting from geometric imitation toward mutual information maximization. Our code is available at \href{https://github.com/EIT-NLP/Supervision-in-Latent-CoT}{this repository}.

URL PDF HTML ☆

赞 0 踩 0

2606.20295 2026-06-19 cs.SE cs.CL 交叉投稿

Token-Operations-Oriented Inference Optimization Techniques for Large Models

面向令牌操作的大模型推理优化技术

Shiguo Lian, Kai Wang, Zhaoxiang Liu, Wen Liu, Minjie Hua, Yutong Liu, Jiangze Yan, Xin Wang, Cong Wang, Yilin Zhang, Yi Shen, Jieyun Huang, Fang Zhao, Huanlin Gao, Ping Chen, Xinyu Yang, Kaikai Zhao, Yao Zhao, Xinggang Wang, Huishuai Zhang, Dongyan Zhao, Junping Du, Tao Chen, Xiang Gao, Qinghuai Ma

AI总结本文提出多模型融合、模型优化、计算-模型融合、计算-网络-模型融合四层技术架构，系统综述各层关键技术及产业现状，旨在降低令牌成本、提升服务效率、保障供应稳定性，推动大模型服务从可调用到可运营的转变。

Comments 62 pages, 36 figures

2606.19782 2026-06-19 cs.AI cs.CL 交叉投稿

AgentFinVQA: A Deployable Multi-Agent Pipeline for Auditable Financial Chart QA

AgentFinVQA：一种可部署的多智能体管道用于可审计的金融图表问答

Aravind Narayanan, Shaina Raza

发表机构 * Vector Institute（向量研究所）

AI总结提出多智能体管道AgentFinVQA，通过分解查询步骤并记录可追溯的模型评估包，在金融图表问答中实现可审计性与本地部署，在FinMME上提升准确率7.68个百分点。

详情

AI中文摘要

在受监管环境中的金融图表问答不仅要求准确性：从业者必须在采取行动之前知道哪些答案值得信任，而且许多机构无法将客户数据发送给外部模型提供商。然而，现有的图表问答智能体注重准确性且不透明，并且大多数假设专有API访问；据我们所知，没有一种方法能在不显著牺牲准确性的情况下同时实现可审计性和本地部署。我们提出AgentFinVQA，一个多智能体管道，将每个查询分解为规划、OCR、图例定位、视觉检查和验证，每个样本记录在可追溯的模型评估包（MEP）中。在FinMME上，AgentFinVQA在使用专有主干（Gemini-3 Flash；71.24% vs. 63.56%，McNemar p ≈ 1.1×10^{-16}）时比主骨干匹配的零样本基线提高+7.68个百分点，在使用本地服务的开放权重Qwen3.6-27B-FP8时提高+4.84个百分点。验证器的判断也作为有用的置信度信号（确认答案与修正答案的精确准确率分别为68.2%和55.6%），支持人在回路审查路由。错误分析表明，问题误解、图例混淆和提取错误占失败原因的近三分之二，并且是验证器检测最少的类别，为未来工作指明了明确方向。这些结果共同表明，可审计、本地部署的金融图表问答是可行的，并且开放权重系统保留了大部分准确率提升，同时实现了完全的数据驻留。我们发布代码以支持可重复评估。

英文摘要

Financial chart question answering in regulated settings demands more than accuracy: practitioners must know which answers to trust before acting on them, and many institutions cannot send client data to external model providers. Yet existing chart-QA agents are accuracy-focused and opaque, and most assume proprietary API access; to our knowledge, none combines auditability with on-premise deployability without significant accuracy compromise. We present AgentFinVQA, a multi-agent pipeline that decomposes each query into planning, OCR, legend grounding, visual inspection, and verification, recording every step in a traceable Model Evaluation Packet (MEP) per sample. On FinMME, AgentFinVQA improves $+7.68$ pp over a primary-backbone matched zero-shot baseline with a proprietary backbone (Gemini-3 Flash; 71.24% vs. 63.56%, McNemar $p \approx 1.1 \times 10^{-16}$), and $+4.84$ pp with open-weights Qwen3.6-27B-FP8 served locally. The verifier's verdict also serves as a useful confidence signal (68.2% vs. 55.6% exact accuracy on confirmed vs. revised answers), enabling human-in-the-loop review routing. Error analysis shows that question misunderstanding, legend confusion and extraction error account for nearly two-thirds of failures and are the categories least detected by the verifier, identifying clear directions for future work. Together these results show that auditable, on-premise financial chart QA is practical and that the open-weights system keeps most of the accuracy gains while enabling full data residency. We release our code to support reproducible evaluation.

URL PDF HTML ☆

赞 0 踩 0

2606.19388 2026-06-19 cs.SE cs.CL cs.HC 交叉投稿

Beyond the GUI Paradigm: Do Mobile Agents Need the Phone Screen?

超越GUI范式：移动代理是否需要手机屏幕？

Li Gu, Zihuan Jiang, Linqiang Guo, Zhixiang Chi, Ziqiang Wang, Huan Liu, Yuanhao Yu, Tse-Hsun Chen, Yang Wang

AI总结本文挑战移动代理的GUI主导范式，提出CLI应同等重要，通过实验证明CLI代理在AndroidWorld和MobileWorld上超越GUI基线，并引入CLI-Advantage任务套件展示其优势。

详情

AI中文摘要

近期移动代理的进展主要由GUI范式主导，其中代理感知UI信息并发出屏幕交互。然而，移动平台也提供了命令行接口（CLI），可直接访问设备服务和数据。我们认为CLI应与GUI同等重要。我们在AndroidWorld和MobileWorld上，使用四种模型API评估了三个编码代理（Claude Code、Terminus-2、mini-swe-agent），未进行任何移动特定后训练，并与三个可复现的GUI基线（GUI-Owl-1.5-32B、MAI-UI、Qwen3-VL-32B）进行比较。Claude Code（Opus 4.7）达到71.8%和51.9%，优于所有可复现的GUI基线（AndroidWorld上69.3/68.1/57.8%；MobileWorld上43.2/26.3/13.3%），而其他CLI配置也保持竞争力。为确立该范式的上限，我们提供了oracle CLI解决方案，在AndroidWorld上达到88.8%（103/116个任务可CLI解决），在MobileWorld上达到86.3%（101/117个任务可CLI解决），表明未来有大量改进空间。为覆盖GUI范围之外的日常用户意图，我们引入了\ extbf{CLI-Advantage任务套件}，包含五个类别的45个模板：批量操作、多条件过滤、聚合、跨应用工作流和隐藏设备状态。所有CLI代理在所有五个类别中均优于所有GUI基线，且每个任务步骤显著更少（10.7步 vs. 18.6步）。为支持未来移动CLI代理的研究，我们将开源代理实现、oracle解决方案、CLI-Advantage套件和评估基础设施。

英文摘要

Recent advances in mobile agents are dominated by the GUI paradigm, in which agents perceive UI information and emit screen interactions. However, mobile platforms also expose a command-line interface (CLI) that provides direct access to device services and data. We argue CLI deserves first-class consideration alongside GUI. We evaluate three coding agents (Claude Code, Terminus-2, mini-swe-agent) across four model APIs on AndroidWorld and MobileWorld without any mobile-specific post-training, comparing against three reproducible GUI baselines (GUI-Owl-1.5-32B, MAI-UI, Qwen3-VL-32B). Claude Code (Opus 4.7) reaches 71.8\% and 51.9\%, outperforming every reproducible GUI baseline (69.3/68.1/57.8\% on AndroidWorld; 43.2/26.3/13.3\% on MobileWorld), while every other CLI configuration remains competitive. To establish the paradigm's ceiling, we provide oracle CLI solutions that reach 88.8\% on AndroidWorld (103/116 tasks CLI-solvable) and 86.3\% on MobileWorld (101/117 tasks CLI-solvable), indicating substantial room for future improvement. To cover everyday user intents beyond the GUI scope, we introduce the \textbf{CLI-Advantage Task Suite}, comprising 45 templates across five categories: bulk operations, multi-condition filtering, aggregation, cross-app workflows, and hidden device state. Every CLI agent outperforms every GUI baseline in all five categories, with substantially fewer steps per task (10.7 vs.\ 18.6). To support future research on mobile CLI agents, we will open-source agent implementations, oracle solutions, the CLI-Advantage suite, and evaluation infrastructure.

URL PDF HTML ☆

赞 0 踩 0

2606.19501 2026-06-19 cs.AI cs.CL cs.LG q-fin.RM 交叉投稿

DeXposure-Claw: An Agentic System for DeFi Risk Supervision

DeXposure-Claw: 一个用于DeFi风险监管的智能体系统

Aijie Shu, Bowei Chen, Wenbin Wu, Cathy Yi-Hsuan Chen, Fengxiang He

发表机构 * University of Edinburgh（爱丁堡大学）； University of Glasgow（格拉斯哥大学）； University of Cambridge（剑桥大学）

AI总结针对DeFi监管中LLM智能体易误报的问题，提出DeXposure-Claw系统，通过图时间序列基础模型预测风险网络，结合确定性监控和置信度门控生成可审计监管票据，并构建六轴评估基准DeXposure-Bench，实验验证有效性。

详情

AI中文摘要

去中心化金融使监管者面临快速变化的网络化信用风险。通用LLM智能体不适合此场景：它们过度解读弱证据并推荐高风险干预，而现有评估无法提供符合监管者需求的误报衡量方式。我们提出DeXposure-Claw，一个基于预测的智能体监管系统，通过结构化证据引导LLM决策：(1) DeXposure-FM，一个图时间序列基础模型，预测未来风险网络；(2) 确定性监控和压力场景将预测转化为类型化警报、归因信号和场景证据；(3) 数据健康和置信度门控在DeXposure-Claw发出带有理由的可审计监管票据前限制升级。我们进一步开发了DeXposure-Bench，一个六轴评估框架，其决策轴根据符合监管者的绝对损失真实情况和显式误干预率对票据评分。在五年每周真实数据上的实验充分支持了我们的系统。代码见 https://this URL。

英文摘要

Decentralized finance exposes supervisors to fast-moving, networked credit risks. General-purpose LLM agents fit this setting poorly: they over-read weak evidence and recommend high-stakes interventions, while existing evaluations offer no regulator-aligned way to measure the resulting false alarms. We introduce DeXposure-Claw, a forecast-grounded agentic supervision system that routes LLM decisions through structured evidence: (1) DeXposure-FM, a graph time-series foundation model, forecasts future exposure networks; (2) deterministic monitors and stress scenarios then turn those forecasts into typed alerts, attribution signals, and scenario evidence; and (3) data-health and confidence gates constrain escalation before DeXposure-Claw emits auditable supervisory tickets with rationales. We further develop DeXposure-Bench, a six-axis evaluation harness, whose decision axis scores tickets against a regulator-aligned absolute-loss ground truth and an explicit false-intervention rate. Experiments on five years of weekly real data fully support our system. Code is at https://github.com/EVIEHub/DeXposure-Claw.

URL PDF HTML ☆

赞 0 踩 0

2606.19559 2026-06-19 cs.AI cs.CL 交叉投稿

Uncertainty Decomposition for Clarification Seeking in LLM Agents

LLM代理中寻求澄清的不确定性分解

Gregory Matsnev

发表机构 * AI Talent Hub, ITMO University（AI Talent Hub, ITMO大学）

AI总结提出一种基于提示的不确定性分解方法，将行动置信度与请求不确定性分离，使代理能在任务规范模糊时主动寻求澄清，在五个LLM骨干上平均澄清F1提升36%-73%。

Comments 26 pages, 8 figures. Source code: https://github.com/PE51K/udcs-in-llm-agents

详情

AI中文摘要

最近的立场论文认为，经典的偶然/认知不确定性框架对于交互式大型语言模型（LLM）代理是不够的，并呼吁需要一种对欠规范感知、可分解且可通信的不确定性表示，以解锁新的代理能力，如主动寻求澄清和共享心理模型构建。实际部署约束——黑盒API、交互延迟预算以及缺乏标注轨迹——排除了基于logprob、多采样和基于训练的方法，使得基于提示的估计成为在部署时浮现此类信号的最可行方案。我们通过一种简单的基于提示的分解来响应这一呼吁，该分解将行动置信度与请求不确定性（u）分离，使代理能在任务规范模糊时请求澄清。为了评估它，我们引入了两个增强澄清的基准（WebShop-Clarification和ALFWorld-Clarification），其中50%的任务被故意欠规范，并在这些变体以及用于故障检测的标准WebShop、ALFWorld和REAL基准上，系统地将所提出的分解与ReAct+UE和不确定性感知记忆（UAM）在五个LLM骨干（GPT-5.1、DeepSeek-v3.2-exp、GLM-4.7、Qwen3.5-35B、GPT-OSS-120B）上进行比较。在五个骨干上平均，所提出的分解在ALFWorld-Clarification上比ReAct+UE提高了73%的澄清F1，比UAM提高了36%，并且在WebShop-Clarification的每个骨干以及ALFWorld-Clarification的五个骨干中的四个上领先澄清F1，表明增益超越了单个LLM。

英文摘要

Recent position papers argue that the classical aleatoric/epistemic uncertainty framework is insufficient for interactive large language model (LLM) agents and call for underspecification-aware, decomposed, and communicable uncertainty representations that can unlock new agent capabilities such as proactive clarification seeking and shared mental-model building. Practical deployment constraints -- black-box APIs, interactive latency budgets, and the absence of labeled trajectories -- rule out logprob-based, multi-sampling, and training-based methods, leaving prompt-based estimation as the most viable family for surfacing such signals at deployment time. We answer this call with a simple prompt-based decomposition that separates action confidence from request uncertainty (u), enabling the agent to ask for clarification when the task specification is ambiguous. To evaluate it, we introduce two clarification-augmented benchmarks (WebShop-Clarification and ALFWorld-Clarification) in which 50% of tasks are deliberately underspecified, and systematically compare the proposed decomposition against ReAct+UE and Uncertainty-Aware Memory (UAM) across five LLM backbones (GPT-5.1, DeepSeek-v3.2-exp, GLM-4.7, Qwen3.5-35B, GPT-OSS-120B) on these variants together with the standard WebShop, ALFWorld, and REAL benchmarks for fault detection. Averaged across the five backbones, the proposed decomposition improves clarification F1 on ALFWorld-Clarification by 73% over ReAct+UE and by 36% over UAM, and leads clarification F1 on every backbone on WebShop-Clarification and on four of five backbones on ALFWorld-Clarification, indicating that the gains generalize beyond a single LLM.

URL PDF HTML ☆

赞 0 踩 0

2606.19911 2026-06-19 cs.AI cs.CL cs.IR 交叉投稿

Multi-Agent Transactive Memory

多智能体交互记忆

To Eun Kim, Xuhong He, Dishank Jain, Ambuj Agrawal, Negar Arabzadeh, Fernando Diaz

发表机构 * Carnegie Mellon University（卡内基梅隆大学）； University of California, Berkeley（加州大学伯克利分校）

AI总结提出MATM框架，通过共享存储和检索智能体轨迹，实现异构智能体群体间的知识复用，提升下游任务性能并减少交互步骤。

详情

AI中文摘要

具有不同能力的LLM智能体在多样化任务中的去中心化部署，激发了跨异构智能体群体知识共享的基础设施需求。正如搜索引擎索引人类生成的工件以支持人类问题解决，检索系统可以组织智能体生成的工件以供跨智能体群体重用。我们将检索增强生成（展示了人类创作工件对单个智能体的价值）扩展到检索智能体生成的工件以支持智能体群体。特别是，智能体轨迹编码了可重用的程序性知识，然而这些工件通常在一次使用后被丢弃或仅由产生智能体保留，迫使新实例化的智能体反复重新发现现有解决方案。我们提出了多智能体交互记忆（MATM），一个用于群体级存储和检索智能体生成轨迹的框架，其中生产者智能体将轨迹贡献到共享仓库，消费者智能体检索它们以改进任务执行。我们专注于交互环境（ALFWorld和WebArena），其中轨迹较长且编码了特别丰富的程序性结构。我们的实验表明，从MATM检索轨迹可提高下游任务性能并减少交互步骤，无需协调或联合训练。这些结果将MATM定位为开放智能体生态系统中群体级经验共享的设计模式。

英文摘要

The decentralized deployment of LLM agents with diverse capabilities across diverse tasks motivates infrastructure for knowledge sharing across heterogeneous agent populations. Just as search engines index human-generated artifacts to support human problem solving, retrieval systems can organize agent-generated artifacts for reuse across agent populations. We extend retrieval-augmented generation - which demonstrates the value of human-authored artifacts to individual agents - to retrieval of agent-generated artifacts supporting a population of agents. In particular, agent trajectories encode reusable procedural knowledge, yet these artifacts are typically discarded after a single use or retained only by the producing agent, forcing newly instantiated agents to repeatedly rediscover existing solutions. We propose Multi-Agent Transactive Memory (MATM), a framework for population-level storage and retrieval of agent-generated trajectories, where producer agents contribute trajectories to a shared repository and consumer agents retrieve them to improve task execution. We focus on interactive environments (ALFWorld and WebArena), where trajectories are long and encode especially rich procedural structure. Our experiments demonstrate that retrieving trajectories from MATM improves downstream task performance and reduces interaction steps without coordination or joint training. These results position MATM as a design pattern for population-level experience sharing in open agent ecosystems.

URL PDF HTML ☆

赞 0 踩 0

2606.20002 2026-06-19 cs.LG cs.AI cs.CL 交叉投稿

Connect the Dots: Training LLMs for Long-Lifecycle Agents with Cross-Domain Generalization Via Reinforcement Learning

Connect the Dots：通过强化学习训练具备跨域泛化能力的长期生命周期智能体

Yanxi Chen, Weijie Shi, Yuexiang Xie, Boyi Hu, Yaliang Li, Bolin Ding, Jingren Zhou

发表机构 * Alibaba Group（阿里巴巴集团）

AI总结提出Connect the Dots框架，通过端到端强化学习训练LLM在长期任务中自我更新上下文并泛化到新领域，实验验证了跨域泛化能力。

Comments Work in progress; we will continuously update the codebase and arXiv version

详情

AI中文摘要

本文提出了一个通用框架，用于训练大型语言模型（LLMs）具备“Connect the Dots”（CoD）这一元能力，该能力是长期生命周期智能体所必需的：当基于LLM的AI智能体部署在环境中时，它解决一系列长期任务，同时持续探索环境、从自身经验中学习，并迭代地自我更新关于环境的上下文，从而在更新上下文的条件下，在未来任务上实现逐步更好的性能。CoD框架的主要组成部分包括：（1）用于端到端强化学习（RL）的算法设计和基础设施，其中包含交替执行任务和更新上下文的长展开序列；（2）用于在训练过程中激励和激发LLM中目标元能力的任务和环境，以及在评估过程中忠实衡量进展的任务和环境。我们展示了CoD框架的概念验证实现，包括具有细粒度信用分配的GRPO风格RL算法，以及针对目标元能力（而非特定领域的LLM能力或标准的逐任务RL）量身定制的任务和环境。实证结果验证了CoD设置中端到端RL训练的有效性，并展示了所激发元能力的分布外泛化潜力——在训练领域内、跨不同领域以及从CoD到Ralph-loop设置中。我们对CoD的研究连接了多项先前工作，并为推进LLM和AI智能体开辟了新的机遇。为促进进一步研究和应用，我们在\url{this https URL}上发布了我们的实现。

英文摘要

This work presents a general framework for training large language models (LLMs) to "Connect the Dots" (CoD), a meta-capability required by long-lifecycle agents: as an LLM-based AI agent gets deployed in an environment, it solves a long sequence of tasks while continuously exploring the environment, learning from its own experiences, and iteratively self-updating its context about the environment, thereby achieving progressively better performance on future tasks conditioned on the updated context. Major components of the CoD framework include: (1) algorithm design and infrastructure for end-to-end reinforcement learning (RL) with long rollout sequences interleaving solve-task and update-context episodes; (2) tasks and environments for incentivizing and eliciting the targeted meta-capability in LLMs during training, as well as for faithfully measuring progress during evaluation. We present proof-of-concept implementations of the CoD framework, including a GRPO-style RL algorithm with fine-grained credit assignment, as well as tasks and environments tailored to the targeted meta-capability (rather than domain-specific LLM capabilities or standard task-by-task RL). Empirical results validate the efficacy of end-to-end RL training in the CoD setting, and demonstrate the potential for out-of-distribution generalization -- within the training domains, across different domains, and from CoD to Ralph-loop settings -- of the elicited meta-capability. Our investigation of CoD connects several lines of prior works, and opens up new opportunities for advancing LLMs and AI agents. To facilitate further research and applications, we release our implementations at \url{https://github.com/agentscope-ai/Trinity-RFT/tree/research/cod/examples/research_cod}.

URL PDF HTML ☆

赞 0 踩 0

2606.20138 2026-06-19 cs.AI cs.CL cs.HC cs.LG 交叉投稿

Learning to Prompt: Improving Student Engagement with Adaptive LLM-based High-School Tutoring

学习提示：基于自适应LLM的高中辅导提升学生参与度

Po-Chin Chang, Nicholas Hogan, Aske Plaat, Michiel T. van der Meer

发表机构 * Leiden University（莱顿大学）； FutureWhiz

AI总结提出一种基于14个教学特征的主题感知提示路由模型，通过模拟训练和在线A/B测试，在高中辅导中实现自适应策略切换，提高教学效率并减少交互轮次。

详情

AI中文摘要

LLMs可以个性化教育，尽管当前的静态提示辅导系统难以适应不同的学科。我们开发并测试了一个具有主题感知提示的系统，该系统基于从原始转录中提取的14个教学特征（例如，辅导支架、学生理解）。我们首先在模拟环境中训练一个提示路由模型，然后将其部署到实际高中学生的在线适应中。模拟基准测试显示，路由器的性能优于两个静态基线（$0.694$ vs. $0.647$ 和 $0.64$, $p<0.001$）。A/B测试（$N=656$ 次对话，来自359名学生）显示了从模拟到现实的迁移，其中模型从分析策略切换到支架学习策略。我们的自适应提示选择机制提高了教学效率，保持了教学质量，并减少了约3轮交互（$p=0.007$）。虽然贪婪路由器的练习转化率与基线相当（$19.1\%$ vs. $19.6\%$），但随机采样策略的随机路由器实现了更高的转化率（$28.1\%$）。

英文摘要

LLMs can personalize education, although current static-prompt tutoring systems struggle to adapt to diverse academic disciplines. We develop and test a system with subject-aware prompting, based on 14 pedagogical features (e.g., tutor scaffolding, student understanding) extracted from raw transcripts. We first train a prompt routing model in a simulation environment, and then deploy it for online adaptation with actual high-school students. The simulation benchmark shows the router outperforming two static baselines ($0.694$ vs. $0.647$ and $0.64$, $p<0.001$). A/B testing ($N=656$ conversations from 359 students) shows sim-to-real transfer where the model switches from analytical to scaffolding learning strategies. Our adaptive prompt selection mechanism improves instructional efficiency, maintains pedagogical quality and reduces interactions by around 3 turns ($p=0.007$). While a greedy router achieves a comparable exercise conversion rate with the baseline ($19.1\%$ vs. $19.6\%$), a stochastic router that samples strategies leads to a higher conversion rate ($28.1\%$).

URL PDF HTML ☆

赞 0 踩 0

2606.20529 2026-06-19 cs.AI cs.CL 交叉投稿

LedgerAgent: Structured State for Policy-Adherent Tool-Calling Agents

LedgerAgent: 策略遵从工具调用代理的结构化状态

Md Nayem Uddin, Amir Saeidi, Eduardo Blanco, Chitta Baral

发表机构 * Arizona State University（亚利桑那州立大学）； University of Arizona（亚利桑那大学）

AI总结针对客服领域策略遵从工具调用代理，提出LedgerAgent方法，通过独立账本维护任务状态并渲染到提示中，在执行工具调用前检查状态依赖策略约束，提升多轮一致性。

Comments Work in Progress

详情

AI中文摘要

在客服领域，策略遵从工具调用代理必须在跨轮次调用工具时维护任务状态，并遵守领域策略。任务状态包括通过用户交互和工具调用观察到的事实、标识符、约束和条件。在标准代理中，任务状态没有单独表示。观察结果、工具返回和策略指令被放入提示中，使得代理每次决定下一步时都需要从提示中重建相关状态。这种设计使状态管理变得隐式，导致两种常见失败模式：代理可能检索到正确的事实，但后来基于过时、缺失或不正确的信息做出决策；语法上有效的工具调用可能仍然违反依赖于当前任务状态的领域策略。我们引入了\textsc{LedgerAgent}，一种用于工具调用代理的推理时方法，它在单独的账本中维护观察到的任务状态，并将状态渲染到提示中。在执行改变环境的工具调用之前，账本还用于检查状态依赖的策略约束，阻止策略违规。在四个客服领域以及开源和闭源模型的混合面板上，\textsc{LedgerAgent}在标准基于提示的工具调用方法上提高了平均pass^k，在更严格的多轮一致性指标下提升最大。

英文摘要

Policy-adherent tool-calling agents in customer-service domains must maintain task states across turns while calling tools and obeying domain policies. Task states consist of relevant facts, identifiers, constraints, and conditions observed through user interaction and tool calls. In standard agents, task states are not represented separately. Observations, tool returns, and policy instructions are placed in the prompt, leaving agents to reconstruct the relevant states from the prompt each time they decide what to do next. This design makes state management implicit, creating two common failure modes. An agent may retrieve the right facts but later ground its decision in stale, missing, or incorrect information; and a syntactically valid tool call may still violate a domain policy that depends on the current task state. We introduce \textsc{LedgerAgent}, an inference-time method for tool-calling agents that maintains observed task states in a separate ledger and renders the states into the prompt. The ledger is also used to check state-dependent policy constraints before environment-changing tool calls are executed, blocking policy violations. Across four customer-service domains and a mixed panel of open- and closed-weight models, \textsc{LedgerAgent} improves average pass\textasciicircum{}k over a standard prompt-based tool-calling approach, with the largest gains under stricter multi-trial consistency metrics.

URL PDF HTML ☆

赞 0 踩 0

2606.19626 2026-06-19 cs.AI cs.CL 交叉投稿

通过声学和韵律扰动研究语音质量评估中的人机差异

Masato Takagi, Masaya Kawamura, Reo Shimizu, Yuma Shirahata

AI总结通过声学退化、韵律错误和说话人特征扰动，发现MOS预测模型对声学退化敏感，但对韵律错误不敏感，且对基频有偏见，而对语速和基频变化不敏感。

Comments Accepted to INTERSPEECH 2026

详情

AI中文摘要

平均意见得分（MOS）预测模型在文本到语音（TTS）研究中被广泛用作代理指标，但它们捕捉超出声学保真度的质量差异的能力仍不清楚。我们通过控制性扰动来研究这一点：声学退化、韵律错误以及说话人特定特征（如音高和语速）的操纵。我们从人类听众和模型那里获得了这些语音样本的MOS预测，并分析了它们感知特征的差异。结果表明，大多数模型能很好地跟踪声学退化，而所有模型对韵律错误不敏感，尽管主观评分大幅下降。对于说话人特征，模型表现出双重分离：在人类评分中不存在的强平均基频（F0）偏见，但对人类注意到的语速和F0变化不敏感。这些发现突出了标量MOS预测在声学保真度之外的局限性。

英文摘要

Mean opinion score (MOS) prediction models are widely used as proxy metrics in text-to-speech (TTS) research, yet their ability to capture quality differences beyond acoustic fidelity remains unclear. We investigate this via controlled perturbations on speech: acoustic degradation, prosodic errors, and manipulation of speaker-specific characteristics such as pitch and speaking rate. We obtained MOS predictions for these speech samples from both human listeners and the model, and analyzed the differences in their perceptual characteristics. Results show that most models track acoustic degradation well, while all are insensitive to prosodic errors despite large subjective score drops. For speaker characteristics, models exhibit a double dissociation: strong mean fundamental frequency (F0) biases absent in human ratings, yet insensitivity to speaking rate and F0 variability that humans notice. These findings highlight limitations of scalar MOS prediction beyond acoustic fidelity.

URL PDF HTML ☆

赞 0 踩 0

2606.19996 2026-06-19 cs.SD cs.CL 交叉投稿

Segment-Level Mandarin Chinese Speech-Based Cognitive Impairment Detection via an Autoencoder with Contrastive Learning

基于自编码器与对比学习的段级普通话语音认知障碍检测

Yongqi Shao, Hong Huo, Flavio Bertini, Danilo Montesi, Tao Fang

发表机构 * School of Automation and Intelligent Sensing, Shanghai Jiao Tong University（上海交通大学自动化与智能感知学院）； Key Laboratory of System Control and Information Processing, Ministry of Education of China（教育部系统控制与信息处理重点实验室）； Shanghai Key Laboratory of Perception and Control in Industrial Network Systems（上海市工业网络系统感知与控制重点实验室）； Department of Computer Science and Engineering, University of Bologna（博洛尼亚大学计算机科学与工程系）； Department of Mathematical, Physical and Computer Sciences, University of Parma（帕尔马大学数学、物理与计算机科学系）

AI总结提出段级表示学习框架，结合自编码器和对比学习，在四个普通话数据集上实现稳定的二分类和三分类认知障碍检测，尤其改善了临床困难的三分类性能。

Comments 15 pages, 7 figures, 5 tables

详情

AI中文摘要

\noindent\textbf{背景与目标：} 语音已成为一种低成本、非侵入性的数字生物标志物，在认知障碍检测方面具有巨大潜力。然而，有限的标注数据和跨数据集变异性仍然是构建稳健的语音筛查系统的主要挑战。\par\noindent\textbf{方法：} 我们开发了一个用于语音认知障碍检测的段级表示学习框架。将语音录音分割成短片段并转换为语谱图表示。为了在有限数据条件下提高鲁棒性，将离线和在线增强策略与基于自编码器的表示学习和对比目标相结合，以增强判别性潜在表示。\par\noindent\textbf{结果：} 在四个独立的普通话语音数据集上进行的实验表明，在二分类和三分类任务中均取得了稳定且有竞争力的性能，尤其是在临床具有挑战性的三分类设置中取得了显著改进。消融研究进一步支持了所提框架的有效性。\par\noindent\textbf{结论：} 研究结果表明，段级语音表示学习可能为资源受限的临床环境中的认知障碍筛查提供一种可扩展且实用的方法。

英文摘要

\noindent\textbf{Background and Objective:} Speech has emerged as a low-cost and non-invasive digital biomarker with considerable potential for cognitive impairment detection. However, limited labeled data and cross-dataset variability remain major challenges for robust speech-based screening systems. \par\noindent\textbf{Methods:} We developed a segment-level representation learning framework for speech-based cognitive impairment detection. Speech recordings were divided into short segments and converted into spectrogram representations. To improve robustness under limited-data conditions, offline and online augmentation strategies were combined with autoencoder-based representation learning and contrastive objectives to enhance discriminative latent representations. \par\noindent\textbf{Results:} Experiments conducted on four independent Mandarin Chinese speech datasets demonstrated stable and competitive performance in both binary and three-class classification tasks, with particularly notable improvements in the clinically challenging three-class setting. Ablation studies further supported the effectiveness of the proposed framework. \par\noindent\textbf{Conclusions:} The findings suggest that segment-level speech representation learning may provide a scalable and practical approach for cognitive impairment screening in resource-constrained clinical settings.

URL PDF HTML ☆

赞 0 踩 0

2606.20137 2026-06-19 eess.AS cs.CL cs.LG cs.SD 交叉投稿

PASQA: Pitch-Accent-Focused Speech Quality Assessment Model Trained on Synthetic Speech with Accent Errors

PASQA：针对重音错误的合成语音训练的以音高重音为中心的语音质量评估模型

Masaya Kawamura, Yuma Shirahata, Kentaro Mitsui, Reo Shimizu

AI总结提出PASQA模型，通过可控重音合成数据集和伪重音质量分数，结合自监督表示、摩拉条件融合等训练策略，有效评估音高重音正确性，优于传统MOS模型。

Comments Accepted to INTERSPEECH 2026

详情

AI中文摘要

现有的平均意见得分（MOS）预测模型通常预测话语级别的自然度MOS，并且可能对局部音高重音错误不敏感。我们提出了以音高重音为中心的语音质量评估（PASQA），明确针对音高重音正确性。为了训练我们的模型，我们使用重音可控的文本转语音系统通过改变重音模式构建了一个受控的日语重音错误数据集，并根据重音错误率计算伪重音质量得分。PASQA建立在自监督表示的基础上，并采用摩拉条件融合、排序损失、辅助重音错误定位任务和说话者不变训练。实验表明，传统模型无法保持按重音错误严重程度的排序，而PASQA在已见和未见说话者上都实现了高排序准确性。此外，PASQA与人类重音正确性判断的一致性更强。代码可在以下网址获取：https://this URL。

英文摘要

Existing mean opinion score (MOS) prediction models typically predict utterance-level naturalness MOS and can be insensitive to localized pitch-accent errors. We propose Pitch-Accent-focused Speech Quality Assessment (PASQA), which explicitly targets pitch-accent correctness. To train our model, we construct a controlled Japanese accent-error dataset by changing accent patterns using an accent-controllable text-to-speech system, and compute a pseudo accent-quality score from the accent-error rate. PASQA builds on self-supervised representations and employs mora-conditioned fusion, ranking loss, an auxiliary accent-error localization task, and speaker-invariant training. Experiments show that conventional models fail to preserve the ordering by accent-error severity, whereas PASQA achieves high ordering accuracy on both seen and unseen speakers. Further, PASQA shows stronger agreement with human accent-correctness judgments. The code is available at https://github.com/lycorp-jp/PASQA.

URL PDF HTML ☆

赞 0 踩 0

2606.19558 2026-06-19 cs.LG cs.CL 交叉投稿

Displacement Is Not Direction: Evaluating Fidelity Metrics for Quantized LLM Deployment

位移不是方向：评估量化LLM部署的保真度指标

Miloš Nikolić, Ali Hadi Zadeh, Enrique Torres Sanchez, Andreas Moshovos

发表机构 * ByteShape ； University of Toronto（多伦多大学）； Vector Institute for Artificial Intelligence（向量人工智能研究所）

AI总结本文研究KL散度等保真度指标在量化语言模型部署中与下游基准分数的相关性，发现整体强相关但在近基线区域失效，归因于KL散度主要衡量分歧量而非方向。

详情

AI中文摘要

保真度指标，如每个token的KL散度（KLD）与高精度参考模型的比较，常被用作基准质量的低成本代理。我们在Qwen3.6-35B-A3B的28个量化模型和Devstral-Small-2-24B的41个量化模型上，通过一系列下游基准测试验证了这一做法。我们发现，在整个量化队列中，KLD与基准分数强相关（Qwen上ρ=-0.72，Devstral上ρ=-0.86，p<0.001）。然而，在接近基线的静默区，这种关系变得不显著（Qwen上ρ=+0.00，Devstral上ρ=-0.24，p=0.36）。这种失效在14种测量变体中持续存在，包括不同的KLD聚合方式、困惑度公式、top-1一致性、校准语料库和上下文长度。在逐提示层面，KLD在代码任务上仅有较弱的失败预测能力，在LiveCodeBench上五个模型的失败与通过几何平均比在[1.08,1.22]之间，并且作为跨模型路由器失败，在分歧提示上仅达到42.3%-49.4%的准确率。我们将这种失效归因于结构分解：KLD主要衡量与参考模型的分歧量，在静默区复合ρ在Qwen上为+0.94（p<0.001），在Devstral上为+0.55（p=0.03），而其与分歧方向的关系较弱且依赖于任务。

英文摘要

Fidelity metrics, such as per-token KL divergence (KLD) against a high-precision reference, are often used in practice as low-cost proxies for benchmark quality. We test this practice on a 28-quant cohort of Qwen3.6-35B-A3B and a 41-quant cohort of Devstral-Small-2-24B, evaluated across a suite of downstream benchmarks. We find that KLD is strongly correlated with benchmark score over the full cohort ($ρ=-0.72$ on Qwen and $ρ=-0.86$ on Devstral, both with $p<0.001$). However, this relationship collapses to non-significance in the near-baseline silent zone ($ρ=+0.00$ on Qwen and $ρ=-0.24$, $p=0.36$, on Devstral). This collapse persists across 14 measurement variants, including different KLD aggregations, perplexity formulations, top-1 agreement, calibration corpora, and context lengths. At the per-prompt level, KLD has only weak failure-prediction power on code, with failed-vs-passed geometric-mean ratios in $[1.08,1.22]$ across five models on LiveCodeBench, and fails as a cross-model router, achieving only $42.3\%-49.4\%$ accuracy on disagreement prompts. We trace the collapse to a structural decomposition: KLD primarily measures the volume of disagreement with the reference, with silent-zone composite $ρ=+0.94$ ($p<0.001$) on Qwen and $+0.55$ ($p=0.03$) on Devstral, while its relationship to the direction of those disagreements is weak and task-conditional.

URL PDF HTML ☆

赞 0 踩 0

2606.19706 2026-06-19 cs.CV cs.CL 交叉投稿

NEST: Narrative Event Structures in Time for Long Video Understanding

NEST：面向长视频理解的时间叙事事件结构

Ali Asgarov, Kaushik Narasimhan, Najibul Haque Sarker, Hani Alomari, Chia-Wei Tang, Anushka Sivakumar, Zaber Ibn Abdul Hakim, Shaurya Mallampati, Chris Thomas

发表机构 * Department of Computer Science, Virginia Tech（弗吉尼亚理工大学计算机科学系）

AI总结提出NEST数据集（1005部全长电影），通过多模态叙事事件标注和关系链接，评估模型在长视频中理解事件结构、时间顺序和长程依赖的能力，实验表明事件检测等任务极具挑战性。

详情

AI中文摘要

视觉-语言模型的最新进展使得处理越来越长的视频序列成为可能，但处理扩展令牌流的能力并不能转化为对长视频中叙事结构的理解。现有的长视频基准侧重于大海捞针式检索，而不是评估低级动作如何形成事件、事件如何跨时间交互以及叙事如何进展，例如，模型是否能够将早期的挫折（如失业）与后来的关系破裂联系起来，尽管存在长时间间隔、中间场景或重新诠释事件的闪回。我们引入了NEST（面向长视频理解的时间叙事事件结构），一个包含1005部全长电影（平均98分钟）的数据集，每部电影都标注了102个基于视觉内容、对话和音频的多模态叙事事件。NEST通过基于视觉内容、对话和音频的结构化标注捕捉多模态叙事事件，并通过反映叙事结构的关系（包括时间顺序、层次组合和长程依赖）将它们联系起来。我们引入了事件触发检测（ETD）、事件定位（EL）、事件论元抽取（EAE）和事件关系抽取（ERE）的基线。该基准对于基于事件发现极具挑战性，ETD低于8%，EL低于6%，EAE低于11%。相比之下，一旦事件给定，ERE更容易处理，零样本F1达到35.45%，微调后F1达到44.42%。

英文摘要

Recent progress in vision-language models has enabled the processing of increasingly long video sequences, but the ability to handle extended token streams does not translate to understanding of narrative structure in long videos. Existing long video benchmarks focus on needle-in-a-haystack retrieval rather than evaluating how low-level actions form events, how events interact across time, and how narratives progress, for example, whether a model can connect an early setback, such as a job loss to a later relationship breakup, despite long gaps, intervening scenes, or flashbacks that reframe what occurred. We introduce NEST (Narrative Event Structures in Time for Long Video Understanding), a dataset of 1005 full-length movies (avg. 98 minutes), each annotated with 102 multimodal narrative events grounded in visual content, dialogue, and audio. NEST captures multimodal narrative events with structured annotations grounded in visual content, dialogue, and audio, and links them through relations that reflect narrative structure, including temporal ordering, hierarchical composition, and long-range dependencies. We introduce baselines for event trigger detection (ETD), event localization (EL), event argument extraction (EAE), and event relation extraction (ERE). The benchmark is highly challenging for grounded event discovery, with ETD below 8%, EL under 6%, and EAE below 11%. In contrast, ERE is more tractable once events are given, reaching 35.45% F1 zero-shot and 44.42% F1 after fine-tuning.

URL PDF HTML ☆

赞 0 踩 0

2606.19719 2026-06-19 cs.IR cs.CL cs.LG 交叉投稿

NAMESAKES: 探究文本到图像模型中的身份记忆

Morris Alper, Vasudha Varadarajan, Moran Yanuka, Angelina Wang, Hadar Averbuch-Elor

发表机构 * Carnegie Mellon University（卡内基梅隆大学）； Tel Aviv University（特拉维夫大学）； Cornell University（康奈尔大学）

AI总结提出一种黑盒行为探针，无需参考照片或训练数据，即可区分文本到图像模型生成的图像是记忆还是虚构，并在NAMESAKES数据集上验证其有效性。

1. 大语言模型与基础模型 7 篇

How Linear Is a Transformer Feed-Forward Block? Per-Block Linear Recoverability Is Learned, Not Architectural

Diffusion Language Models: An Experimental Analysis

Efficiently Representing Algorithms With Chain-of-Thought Transformers

Manifold Bandits: Bayesian Curriculum Learning over the Latent Geometry of Large Language Models

Think Again or Think Longer? Selective Verification for Budget-Aware Reasoning

What Makes Effective Supervision in Latent Chain-of-Thought: An Information-Theoretic Analysis

Token-Operations-Oriented Inference Optimization Techniques for Large Models

2. 信息抽取、检索与问答 1 篇

AgentFinVQA: A Deployable Multi-Agent Pipeline for Auditable Financial Chart QA

3. 对话系统与智能体 7 篇

Beyond the GUI Paradigm: Do Mobile Agents Need the Phone Screen?

DeXposure-Claw: An Agentic System for DeFi Risk Supervision

Uncertainty Decomposition for Clarification Seeking in LLM Agents

Multi-Agent Transactive Memory

Connect the Dots: Training LLMs for Long-Lifecycle Agents with Cross-Domain Generalization Via Reinforcement Learning

Learning to Prompt: Improving Student Engagement with Adaptive LLM-based High-School Tutoring

LedgerAgent: Structured State for Policy-Adherent Tool-Calling Agents

4. 语义、语法与语言学分析 1 篇

Toten: Knowledge-Based Ontological Tokenization Of Physical Quantities And Technical Notation In Brazilian Portuguese

5. 多模态语言处理 2 篇

PerceptionDLM: Parallel Region Perception with Multimodal Diffusion Language Models

Scalable Training of Spatially Grounded 2D Vision-Language Models for Radiology

6. 语音语言联合与音频文本 3 篇

Investigating Human-Model Discrepancies in Speech Quality Assessment via Acoustic and Prosodic Perturbations

Segment-Level Mandarin Chinese Speech-Based Cognitive Impairment Detection via an Autoencoder with Contrastive Learning

PASQA: Pitch-Accent-Focused Speech Quality Assessment Model Trained on Synthetic Speech with Accent Errors

7. 评测、数据集与基准 8 篇

Displacement Is Not Direction: Evaluating Fidelity Metrics for Quantized LLM Deployment

NEST: Narrative Event Structures in Time for Long Video Understanding

Closing the Calibration Gap in Semantic Caching

Benchmarking Agentic Review Systems

CombEval: A Framework for Evaluating Combinatorial Counting in Large Language Models

JAMER: Project-Level Code Framework Dataset and Benchmark on Professional Game Engines

Generative Engine Optimization at Scale: Measuring Brand Visibility Across AI Search Engines

Apparent Psychological Profiles of Large Language Models are Largely a Measurement Artifact

8. 安全、隐私、公平与可解释NLP 5 篇

Gender Bias in LLM Hiring Decisions: Evidence from a Japanese Context and Evaluation of Mitigation Strategies

Thermodynamic Signatures of Reasoning: Free-Energy and Spectral-Form-Factor Diagnostics for Hallucination Detection in Large Language Models

A Layered Security Framework Against Prompt Injection in RAG-Based Chatbots

When Lower Privileges Suffice: Investigating Over-Privileged Tool Selection in LLM Agents

NAMESAKES: Probing Identity Memorization in Text-to-Image Models