arXivDaily arXiv每日学术速递 周一至周五更新

1. 大语言模型与基础模型 7 篇

2606.19379 2026-06-19 cs.LG cs.AI cs.CL 交叉投稿

How Linear Is a Transformer Feed-Forward Block? Per-Block Linear Recoverability Is Learned, Not Architectural

Transformer 前馈块有多线性?逐块线性可恢复性是学习得到的,而非架构决定的

Stuart Whipp

发表机构 * Independent Research(独立研究)

AI总结 通过精确最小二乘线性近似,测量训练后 Transformer 各前馈块的线性可恢复性,发现其高度异质且非单调,是学习得到的属性而非架构决定,并可用于压缩和诊断。

Comments 14 pages, 5 figures

详情
AI中文摘要

Transformer 前馈网络(FFN)通常被视为非线性的计算存储单元,但训练后的 FFN 块实际非线性程度很少被测量。我们将每个 FFN 视为位置级的输入-输出映射,并将其分解为精确的最小二乘线性近似加上残差。闭式线性映射解释的留出方差定义了一个块的线性可恢复性(R^2_lin),这是一种无需优化器的线性度量。在 GPT-2、Pythia-160m 和 llama-160m 的所有十二个块中,R^2_lin 高度异质且随深度非单调变化,相邻块之间范围从近线性(>0.99)到强非线性(<0.3),且并非由激活函数决定:相同宽度的 GELU 模型 GPT-2 和 Pythia-160m 具有截然不同的轮廓,因此可恢复性是单个训练块的学习属性,而非架构属性。残差的低秩双线性探针仅恢复少量 R^2 点,且增益与残差非线性不相关:未恢复的计算不是单个位置级乘积,而是高阶或分布式结构。该测量还作为有针对性的压缩信号:可恢复块允许大的单层替换(GPT-2 的早期 FFN 参数减少 8 倍,困惑度增加 +0.77),而低可恢复性块标记了这不安全的情况。它还暴露了一个方法论陷阱:训练后的线性基线可能在病态条件的 Transformer 激活上严重欠收敛,因此我们报告了整个过程中精确的闭式最小二乘上限。

英文摘要

Transformer feed-forward networks (FFNs) are often treated as nonlinear stores of computation, yet how nonlinear a trained FFN block actually is has rarely been measured. We treat each FFN as a position-wise input-to-output map and split it into the exact least-squares linear approximation plus a residual. The held-out variance the closed-form linear map explains defines a block's linear recoverability (R^2_lin), an optimiser-free measure of its linearity. Across all twelve blocks of GPT-2, Pythia-160m, and llama-160m, R^2_lin is highly heterogeneous and non-monotone with depth, ranging from near-linear (>0.99) to strongly nonlinear (<0.3) between adjacent blocks, and is not set by the activation function: same-width GELU models GPT-2 and Pythia-160m have sharply different profiles, so recoverability is a learned property of individual trained blocks, not an architectural one. A low-rank bilinear probe of the residual recovers only a few points of R^2, with gain uncorrelated with residual nonlinearity: the unrecovered computation is not a single position-wise product but higher-order or distributed structure. The measurement also serves as a targeted compression signal: recoverable blocks admit large single-layer replacements (GPT-2's early FFN at 8x fewer parameters for +0.77 perplexity), while low-recoverability blocks flag where this is unsafe. It further exposes a methodological pitfall: trained linear baselines can badly under-converge on ill-conditioned transformer activations, so we report the exact closed-form least-squares ceiling throughout.

2606.19475 2026-06-19 cs.AI cs.CL 交叉投稿

Diffusion Language Models: An Experimental Analysis

扩散语言模型:一项实验分析

Thomas Bertolani, Davide Bucciarelli, Leonardo Zini, Marcella Cornia, Lorenzo Baraldi

发表机构 * University of Modena and Reggio Emilia(摩德纳和雷焦艾米利亚大学) University of Pisa(比萨大学)

AI总结 本文系统比较了八种扩散语言模型在推理、编码、翻译等任务上的表现,分析了去噪步数、上下文长度等推理因素对性能与效率的影响,揭示了扩散语言模型在不同任务和预算下的权衡。

详情
AI中文摘要

大型语言模型(LLMs)通过自回归生成彻底改变了语言建模,使其在广泛的任务中表现出色。最近,扩散语言模型(DLMs)作为一种替代范式出现,它通过迭代去噪而非下一个词预测来生成文本,从而允许对整个序列进行并行精炼。尽管已经提出了许多基于扩散的架构,但评估协议、数据集、推理预算和生成超参数的差异使得比较它们的能力和理解它们提供的权衡变得困难。在这项工作中,我们对现代DLMs进行了系统的实验分析。具体来说,我们评估了八种最先进的DLMs在八个基准上的表现,这些基准涵盖推理、编码、翻译、知识和结构化问题解决,同时明确考虑了生成质量和计算效率。除了下游评估,我们还分析了关键推理时间因素的影响,包括去噪步数、上下文长度、块大小和并行解掩策略,并通过在相同条件下训练的较小模型的受控比较来补充大规模实验。我们的分析突出了基于扩散的语言建模在不同任务、架构和推理预算下的优势和局限性。我们表明,DLMs的行为受到生成时间设计选择的强烈影响,导致性能和计算效率之间的不同权衡。总体而言,我们的研究为当代DLMs的能力和部署特性提供了实用见解。

英文摘要

Large Language Models (LLMs) have revolutionized language modeling through autoregressive generation, enabling strong performance across a wide range of tasks. Recently, Diffusion Language Models (DLMs) have emerged as an alternative paradigm that generates text through iterative denoising rather than next-token prediction, allowing parallel refinement of entire sequences. While numerous diffusion-based architectures have been proposed, differences in evaluation protocols, datasets, inference budgets, and generation hyperparameters make it difficult to compare their capabilities and understand the trade-offs they offer. In this work, we present a systematic experimental analysis of modern DLMs. Specifically, we evaluate eight state-of-the-art DLMs across eight benchmarks spanning reasoning, coding, translation, knowledge, and structured problem solving, while explicitly considering both generation quality and computational efficiency. Beyond downstream evaluation, we analyze the impact of key inference-time factors, including denoising steps, context length, block size, and parallel unmasking strategies, and complement large-scale experiments with controlled comparisons of smaller models trained under identical conditions. Our analysis highlights the strengths and limitations of diffusion-based language modeling across different tasks, architectures, and inference budgets. We show that the behavior of DLMs is strongly influenced by generation-time design choices, leading to distinct trade-offs between performance and computational efficiency. Overall, our study provides practical insights into the capabilities and deployment characteristics of contemporary DLMs.

2606.19697 2026-06-19 cs.LG cs.AI cs.CL 交叉投稿

Efficiently Representing Algorithms With Chain-of-Thought Transformers

高效表示链式思维Transformer中的算法

Yanhong Li, Anej Svete, Ashish Sabharwal, William Merrill

发表机构 * Allen Institute for AI(艾伦人工智能研究所) ETH Zürich(苏黎世联邦理工学院)

AI总结 本文证明链式思维Transformer能以多对数开销高效模拟Word RAM算法,包括排序和Dijkstra算法,优于模拟图灵机的二次开销。

详情
AI中文摘要

推理模型(即在产生答案前输出一系列推理或思维token的语言模型)日益流行,部分原因在于理论结果表明链式思维(CoT)Transformer可以模拟图灵机,从而执行任意计算。然而,图灵机虽然适用于复杂性理论分析,但在讨论算法时并不方便、直观或高效。算法通常在更高的抽象层次上设计和分析,即具有随机访问存储器和单位成本操作(对$\bigO(\log n)$位字)的Word RAM模型。因此,Word RAM算法可能比其图灵机对应物更高效,这引出了一个问题:CoT Transformer能否高效模拟Word RAM算法?例如,它们能否在$\bigO(n \log n)$步内对n个元素排序,或在$\bigO(E + V \log V)$步内运行Dijkstra算法?我们给出肯定回答,开销不超过多对数。我们首先为具有多对数宽度和最右唯一硬注意力的有限精度Transformer建立这一结果,然后将结果推广到两个更实际的设置:有限宽度和对数精度:连续CoT(其中推理采用向量而非token形式)和混合架构(其中Transformer层位于循环(线性RNN)层之上)。在所有三种情况下,我们发现CoT可以高效模拟任何Word RAM算法,仅需在n上多对数开销。当Word RAM具有“平坦”指令集时,此开销降至对数平方,而对于无乘法平坦指令仅需对数开销——这与已知的CoT模拟图灵机(需要二次开销)形成鲜明对比。

英文摘要

The increasing popularity of \emph{reasoning} models -- language models that output a series of reasoning or thought tokens before producing an answer -- is justified, in part, by theoretical results showing that chain-of-thought (CoT) transformers can simulate Turing machines, and thus perform arbitrary computation. However, the Turing machine, while suitable for complexity-theoretic analysis, is not convenient, intuitive, or efficient for discussing algorithms. Algorithms are typically designed and analyzed at a higher level of abstraction, captured by the \emph{Word RAM} model with random-access memory and unit-cost operations on $\bigO(\log n)$-bit words. As a result, Word RAM algorithms can be substantially more efficient than their Turing machine counterparts, raising the question: \emph{Can CoT transformers efficiently simulate Word RAM algorithms?} For instance, can they sort $n$ items in $\bigO(n \log n)$ steps or run Dijkstra's algorithm in $\bigO(E + V \log V)$ steps? We answer affirmatively, up to poly-logarithmic overhead. We first establish this for finite-precision transformers with poly-logarithmic width and rightmost unique hard attention, then strengthen the result to two more practical settings with finite width and log-precision: \emph{continuous} CoT, where reasoning takes the form of vectors rather than tokens, and a \emph{hybrid} architecture in which transformer layers sit atop a recurrent (linear RNN) layer. In all three cases, we find that CoT \emph{can} efficiently simulate any Word RAM algorithm with only a poly-logarithmic overhead in $n$. This overhead reduces to log-square when the Word RAM has a ``flat'' instruction set, and only logarithmic for multiplication-free flat instructions -- in stark contrast to known CoT simulations of Turing machines, which require quadratic overhead over Word RAM.

2606.19750 2026-06-19 cs.LG cs.AI cs.CL 交叉投稿

Manifold Bandits: Bayesian Curriculum Learning over the Latent Geometry of Large Language Models

流形赌博机:大语言模型潜在几何上的贝叶斯课程学习

Darrien McKenzie, Nicklas Hansen, Xiaolong Wang

发表机构 * University of California, San Diego(加州大学圣迭戈分校)

AI总结 提出贝叶斯流形课程(BMC)框架,将问题采样建模为流形结构赌博机问题,通过层次任务树和贝叶斯学习引导采样,平衡学习信号、多样性和实用性。

Comments Webpage: https://darrienmckenzie.com/manifold-bandits/

详情
AI中文摘要

强化学习(RL)是提高大语言模型(LLMs)推理能力的关键方法,其中训练效率关键取决于优化过程中问题的采样方式。现有的自适应课程学习方法通常优先考虑中等难度的提示,将问题选择视为具有独立臂的标准赌博机问题,忽略了任务空间的结构化和异质性。在这项工作中,我们将问题采样框架化为具有内生非平稳性的流形结构赌博机问题:问题通过模型的潜在表示空间相关联,采样决策可以影响学习信号在该空间中的演变方式。为了实现这一视角,我们引入了贝叶斯流形课程(BMC),这是一个结构感知框架,将问题组织成层次任务树,并应用贝叶斯学习来指导采样。实验发现,不同的采样策略在生产性(学习信号)、多样性(任务流形覆盖)和实用性(评估相关性)之间引入了非平凡的权衡。这些结果表明,仅优先考虑难度不足以获得强大的下游性能,突出了将结构和类型感知纳入问题采样中的重要性。

英文摘要

Reinforcement learning (RL) is a central approach for improving reasoning capabilities in large language models (LLMs), where training efficiency depends critically on how problems are sampled during optimization. Existing adaptive curriculum learning methods typically prioritize prompts of intermediate difficulty, treating problem selection as a standard bandit problem with independent arms and overlooking the structured, heterogeneous nature of the task space. In this work, we frame problem sampling as a manifold-structured bandit problem with endogenous non-stationarity: problems are related through the model's latent representation space, and sampling decisions can steer how learning signals evolve across that space. To operationalize this perspective, we introduce Bayesian Manifold Curriculum (BMC), a structure-aware framework that organizes problems into a hierarchical task tree and applies Bayesian learning to guide sampling. Empirically, we find that different sampling strategies induce non-trivial tradeoffs between productivity (learning signal), diversity (coverage of the task manifold), and utility (evaluation relevance). These results show that prioritizing difficulty alone is insufficient for strong downstream performance, highlighting the importance of incorporating structure and type-awareness into problem sampling.

2606.19808 2026-06-19 cs.AI cs.CL 交叉投稿

Think Again or Think Longer? Selective Verification for Budget-Aware Reasoning

再思考还是更长时间思考?面向预算感知推理的选择性验证

Sajib Acharjee Dip, Dawei Zhou, Liqing Zhang

发表机构 * Department of Computer Science, Virginia Tech(弗吉尼亚理工大学计算机科学系) Fralin Biomedical Research Institute, Virginia Tech(弗吉尼亚理工大学弗拉林生物医学研究所) FBRI Cancer Research Center(FBRI癌症研究中心)

AI总结 提出选择性验证框架SEVRA,通过服务层控制器决定是否对冻结求解器的初始答案进行验证,在Math500上以更少token达到更高准确率,并减少有害翻转。

详情
AI中文摘要

测试时推理越来越多地被用作服务时的控制旋钮,但额外的推理并非均匀有价值:它可以修复失败的尝试,在已经正确的答案上浪费计算,或引入有害的答案更改。我们将其视为一个部署分配问题,而非新验证器问题。我们引入SEVRA,即面向推理分配的选择性验证,这是一个服务层控制器,决定是保留冻结求解器的初始答案还是调用主动验证。使用冻结的Qwen3-4B求解器,我们记录干预结果并从服务可见的尝试状态训练可恢复性感知的门控。在Math500上,选择性验证达到76.3%的准确率,而始终验证为75.5%,同时将生成后token减少26.8%,有害翻转从2.2%降至1.0%。然而,8,192 token的初始求解达到76.0%的准确率,总模型token减少28%,表明选择性恢复有用但并非测试的最佳成本前沿。在冻结迁移到GSM时,选择性策略仅验证3.0%的样本,准确率从93.4%提升至94.5%,验证token相对于始终验证减少91.2%;同样,更长的初始求解以更少的实际token达到相同准确率。在CommonsenseQA上,始终开启的验证有害,而Self-Consistency@5以约五倍的实际token成本提升准确率。由此得出的部署规则是:首先调整初始预算,然后在需要显式检查、有限重试、可审计性或回归风险控制时使用选择性恢复。

英文摘要

Test-time reasoning is increasingly used as a serving-time control knob, but extra reasoning is not uniformly valuable: it can repair failed attempts, waste compute on already-correct answers, or introduce harmful answer changes. We study this as a deployment allocation problem rather than a new-verifier problem. We introduce \sevra, Selective Verification for Reasoning Allocation, a serving-layer controller that decides whether to preserve a frozen solver's initial answer or invoke active verification. Using a frozen Qwen3-4B solver, we log intervention outcomes and train recoverability-aware gates from serving-visible attempt state. On \mathfive, selective verification reaches 76.3\% accuracy, compared with 75.5\% for always verifying, while reducing post-generation tokens by 26.8\% and harmful flips from 2.2\% to 1.0\%. However, an 8,192-token initial solve reaches 76.0\% accuracy with 28\% fewer total model tokens, showing that selective recovery is useful but not the best tested cost frontier. In frozen transfer to \gsm, the selective policy verifies only 3.0\% of examples, improves accuracy from 93.4\% to 94.5\%, and reduces verification tokens by 91.2\% relative to always verifying; again, a longer initial solve matches its accuracy with fewer realized tokens. On CommonsenseQA, always-on verification hurts, while Self-Consistency@5 improves accuracy at about five times the realized token cost. The resulting deployment rule is: tune the initial budget first, then use selective recovery when explicit checks, bounded retries, auditability, or regression-risk control matter.

2606.20075 2026-06-19 cs.LG cs.CL 交叉投稿

What Makes Effective Supervision in Latent Chain-of-Thought: An Information-Theoretic Analysis

什么使得潜在思维链中的监督有效:一种信息论分析

Xinghao Chen, Chak Tou Leong, Wenjin Guo, Jian Wang, Wenjie Li, Xiaoyu Shen

发表机构 * Ningbo Institute of Digital Twin, Eastern Institute of Technology(宁波数字孪生研究院,东方理工大学) Department of Computing, The Hong Kong Polytechnic University(香港理工大学计算学系)

AI总结 本文从信息论角度分析潜在思维链中的监督失效问题,提出轨迹监督和空间监督两个维度,并引入统一潜在探针(ULP)量化信息保真度,揭示了信息-性能绑定关系。

详情
AI中文摘要

潜在思维链(Latent Chain-of-Thought, CoT)将推理内化到连续隐藏状态中,为冗长的离散推理轨迹提供了一种有前景的替代方案。然而,鲁棒的潜在推理仍然困难,因为结果监督提供的学习信号较弱,且容易导致潜在轨迹发生语义漂移。在这项工作中,我们从信息论角度分析潜在CoT,并将这种失效识别为双重崩溃:优化路径上的梯度衰减和潜在空间中的表征漂移。我们进一步将过程监督分解为两个互补维度:轨迹监督(注入密集的逐步推理信号)和空间监督(保持潜在流形的语义结构)。我们的分析表明,刚性几何压缩可能坍缩推理空间,而生成式重建提供了更灵活的语义锚点,更好地保留了信息容量。为了衡量这些效应,我们引入了统一潜在探针(Unified Latent Probe, ULP),用于量化潜在轨迹与显式推理步骤之间的互信息。实验揭示了清晰的信息-性能绑定关系:推理准确性取决于潜在链中保留的信息保真度。这些发现为潜在推理监督提供了一个原则性框架,并建议从几何模仿转向互信息最大化。我们的代码可在\href{this https URL}{此仓库}获取。

英文摘要

Latent Chain-of-Thought (CoT) internalizes reasoning within continuous hidden states, offering a promising alternative to verbose discrete reasoning traces. However, robust latent reasoning remains difficult because outcome supervision provides weak learning signals and leaves latent trajectories prone to semantic drift. In this work, we analyze Latent CoT from an information-theoretic perspective and identify this failure as a dual collapse: gradient attenuation along the optimization path and representational drift in the latent space. We further decompose process supervision into two complementary dimensions: Trajectory Supervision, which injects dense stepwise reasoning signals, and Space Supervision, which preserves the semantic structure of the latent manifold. Our analysis shows that rigid geometric compression can collapse the reasoning space, whereas generative reconstruction provides a more flexible semantic anchor that better preserves information capacity. To measure these effects, we introduce the Unified Latent Probe (ULP), which quantifies the mutual information between latent trajectories and explicit reasoning steps. Experiments reveal a clear Information-Performance Binding: reasoning accuracy depends on the information fidelity preserved in the latent chain. These findings provide a principled framework for latent reasoning supervision and suggest shifting from geometric imitation toward mutual information maximization. Our code is available at \href{https://github.com/EIT-NLP/Supervision-in-Latent-CoT}{this repository}.

2606.20295 2026-06-19 cs.SE cs.CL 交叉投稿

Token-Operations-Oriented Inference Optimization Techniques for Large Models

面向令牌操作的大模型推理优化技术

Shiguo Lian, Kai Wang, Zhaoxiang Liu, Wen Liu, Minjie Hua, Yutong Liu, Jiangze Yan, Xin Wang, Cong Wang, Yilin Zhang, Yi Shen, Jieyun Huang, Fang Zhao, Huanlin Gao, Ping Chen, Xinyu Yang, Kaikai Zhao, Yao Zhao, Xinggang Wang, Huishuai Zhang, Dongyan Zhao, Junping Du, Tao Chen, Xiang Gao, Qinghuai Ma

AI总结 本文提出多模型融合、模型优化、计算-模型融合、计算-网络-模型融合四层技术架构,系统综述各层关键技术及产业现状,旨在降低令牌成本、提升服务效率、保障供应稳定性,推动大模型服务从可调用到可运营的转变。

Comments 62 pages, 36 figures

详情
AI中文摘要

大模型推理优化是支撑大模型服务可扩展、低成本、高稳定运行的关键基础。本文以面向令牌的推理优化技术为核心,首次提出由多模型融合、模型优化、计算-模型融合、计算-网络-模型融合组成的四层技术架构,系统梳理了这四层的关键技术和产业现状,并分析了相关技术在实际业务场景中的应用价值。本文为降低令牌生产成本、提高令牌服务效率、保障令牌供应稳定性、推动大模型服务从可调用到可运营的转变提供了实用的技术路径。

英文摘要

Large model inference optimization serves as a key foundation for supporting the scalable, low-cost, and highly stable operation of large model services. Centered on token-oriented inference optimization technology, this paper proposes for the first time a four-layer technical architecture consisting of Multi-model Fusion, Model Optimization, Compute-Model Fusion, and Compute-Network-Model Fusion. It systematically reviews the key technologies and current industry status across these four levels and analyzes the application value of related technologies in real-world business scenarios. This paper provides a practical technical path for reducing token production costs, improving token service efficiency, ensuring the stability of token supply, and driving the transition of large model services from being merely callable to being operable.

2. 信息抽取、检索与问答 1 篇

2606.19782 2026-06-19 cs.AI cs.CL 交叉投稿

AgentFinVQA: A Deployable Multi-Agent Pipeline for Auditable Financial Chart QA

AgentFinVQA:一种可部署的多智能体管道用于可审计的金融图表问答

Aravind Narayanan, Shaina Raza

发表机构 * Vector Institute(向量研究所)

AI总结 提出多智能体管道AgentFinVQA,通过分解查询步骤并记录可追溯的模型评估包,在金融图表问答中实现可审计性与本地部署,在FinMME上提升准确率7.68个百分点。

详情
AI中文摘要

在受监管环境中的金融图表问答不仅要求准确性:从业者必须在采取行动之前知道哪些答案值得信任,而且许多机构无法将客户数据发送给外部模型提供商。然而,现有的图表问答智能体注重准确性且不透明,并且大多数假设专有API访问;据我们所知,没有一种方法能在不显著牺牲准确性的情况下同时实现可审计性和本地部署。我们提出AgentFinVQA,一个多智能体管道,将每个查询分解为规划、OCR、图例定位、视觉检查和验证,每个样本记录在可追溯的模型评估包(MEP)中。在FinMME上,AgentFinVQA在使用专有主干(Gemini-3 Flash;71.24% vs. 63.56%,McNemar p ≈ 1.1×10^{-16})时比主骨干匹配的零样本基线提高+7.68个百分点,在使用本地服务的开放权重Qwen3.6-27B-FP8时提高+4.84个百分点。验证器的判断也作为有用的置信度信号(确认答案与修正答案的精确准确率分别为68.2%和55.6%),支持人在回路审查路由。错误分析表明,问题误解、图例混淆和提取错误占失败原因的近三分之二,并且是验证器检测最少的类别,为未来工作指明了明确方向。这些结果共同表明,可审计、本地部署的金融图表问答是可行的,并且开放权重系统保留了大部分准确率提升,同时实现了完全的数据驻留。我们发布代码以支持可重复评估。

英文摘要

Financial chart question answering in regulated settings demands more than accuracy: practitioners must know which answers to trust before acting on them, and many institutions cannot send client data to external model providers. Yet existing chart-QA agents are accuracy-focused and opaque, and most assume proprietary API access; to our knowledge, none combines auditability with on-premise deployability without significant accuracy compromise. We present AgentFinVQA, a multi-agent pipeline that decomposes each query into planning, OCR, legend grounding, visual inspection, and verification, recording every step in a traceable Model Evaluation Packet (MEP) per sample. On FinMME, AgentFinVQA improves $+7.68$ pp over a primary-backbone matched zero-shot baseline with a proprietary backbone (Gemini-3 Flash; 71.24% vs. 63.56%, McNemar $p \approx 1.1 \times 10^{-16}$), and $+4.84$ pp with open-weights Qwen3.6-27B-FP8 served locally. The verifier's verdict also serves as a useful confidence signal (68.2% vs. 55.6% exact accuracy on confirmed vs. revised answers), enabling human-in-the-loop review routing. Error analysis shows that question misunderstanding, legend confusion and extraction error account for nearly two-thirds of failures and are the categories least detected by the verifier, identifying clear directions for future work. Together these results show that auditable, on-premise financial chart QA is practical and that the open-weights system keeps most of the accuracy gains while enabling full data residency. We release our code to support reproducible evaluation.

3. 对话系统与智能体 7 篇

2606.19388 2026-06-19 cs.SE cs.CL cs.HC 交叉投稿

Beyond the GUI Paradigm: Do Mobile Agents Need the Phone Screen?

超越GUI范式:移动代理是否需要手机屏幕?

Li Gu, Zihuan Jiang, Linqiang Guo, Zhixiang Chi, Ziqiang Wang, Huan Liu, Yuanhao Yu, Tse-Hsun Chen, Yang Wang

AI总结 本文挑战移动代理的GUI主导范式,提出CLI应同等重要,通过实验证明CLI代理在AndroidWorld和MobileWorld上超越GUI基线,并引入CLI-Advantage任务套件展示其优势。

详情
AI中文摘要

近期移动代理的进展主要由GUI范式主导,其中代理感知UI信息并发出屏幕交互。然而,移动平台也提供了命令行接口(CLI),可直接访问设备服务和数据。我们认为CLI应与GUI同等重要。我们在AndroidWorld和MobileWorld上,使用四种模型API评估了三个编码代理(Claude Code、Terminus-2、mini-swe-agent),未进行任何移动特定后训练,并与三个可复现的GUI基线(GUI-Owl-1.5-32B、MAI-UI、Qwen3-VL-32B)进行比较。Claude Code(Opus 4.7)达到71.8%和51.9%,优于所有可复现的GUI基线(AndroidWorld上69.3/68.1/57.8%;MobileWorld上43.2/26.3/13.3%),而其他CLI配置也保持竞争力。为确立该范式的上限,我们提供了oracle CLI解决方案,在AndroidWorld上达到88.8%(103/116个任务可CLI解决),在MobileWorld上达到86.3%(101/117个任务可CLI解决),表明未来有大量改进空间。为覆盖GUI范围之外的日常用户意图,我们引入了\ extbf{CLI-Advantage任务套件},包含五个类别的45个模板:批量操作、多条件过滤、聚合、跨应用工作流和隐藏设备状态。所有CLI代理在所有五个类别中均优于所有GUI基线,且每个任务步骤显著更少(10.7步 vs. 18.6步)。为支持未来移动CLI代理的研究,我们将开源代理实现、oracle解决方案、CLI-Advantage套件和评估基础设施。

英文摘要

Recent advances in mobile agents are dominated by the GUI paradigm, in which agents perceive UI information and emit screen interactions. However, mobile platforms also expose a command-line interface (CLI) that provides direct access to device services and data. We argue CLI deserves first-class consideration alongside GUI. We evaluate three coding agents (Claude Code, Terminus-2, mini-swe-agent) across four model APIs on AndroidWorld and MobileWorld without any mobile-specific post-training, comparing against three reproducible GUI baselines (GUI-Owl-1.5-32B, MAI-UI, Qwen3-VL-32B). Claude Code (Opus 4.7) reaches 71.8\% and 51.9\%, outperforming every reproducible GUI baseline (69.3/68.1/57.8\% on AndroidWorld; 43.2/26.3/13.3\% on MobileWorld), while every other CLI configuration remains competitive. To establish the paradigm's ceiling, we provide oracle CLI solutions that reach 88.8\% on AndroidWorld (103/116 tasks CLI-solvable) and 86.3\% on MobileWorld (101/117 tasks CLI-solvable), indicating substantial room for future improvement. To cover everyday user intents beyond the GUI scope, we introduce the \textbf{CLI-Advantage Task Suite}, comprising 45 templates across five categories: bulk operations, multi-condition filtering, aggregation, cross-app workflows, and hidden device state. Every CLI agent outperforms every GUI baseline in all five categories, with substantially fewer steps per task (10.7 vs.\ 18.6). To support future research on mobile CLI agents, we will open-source agent implementations, oracle solutions, the CLI-Advantage suite, and evaluation infrastructure.

2606.19501 2026-06-19 cs.AI cs.CL cs.LG q-fin.RM 交叉投稿

DeXposure-Claw: An Agentic System for DeFi Risk Supervision

DeXposure-Claw: 一个用于DeFi风险监管的智能体系统

Aijie Shu, Bowei Chen, Wenbin Wu, Cathy Yi-Hsuan Chen, Fengxiang He

发表机构 * University of Edinburgh(爱丁堡大学) University of Glasgow(格拉斯哥大学) University of Cambridge(剑桥大学)

AI总结 针对DeFi监管中LLM智能体易误报的问题,提出DeXposure-Claw系统,通过图时间序列基础模型预测风险网络,结合确定性监控和置信度门控生成可审计监管票据,并构建六轴评估基准DeXposure-Bench,实验验证有效性。

详情
AI中文摘要

去中心化金融使监管者面临快速变化的网络化信用风险。通用LLM智能体不适合此场景:它们过度解读弱证据并推荐高风险干预,而现有评估无法提供符合监管者需求的误报衡量方式。我们提出DeXposure-Claw,一个基于预测的智能体监管系统,通过结构化证据引导LLM决策:(1) DeXposure-FM,一个图时间序列基础模型,预测未来风险网络;(2) 确定性监控和压力场景将预测转化为类型化警报、归因信号和场景证据;(3) 数据健康和置信度门控在DeXposure-Claw发出带有理由的可审计监管票据前限制升级。我们进一步开发了DeXposure-Bench,一个六轴评估框架,其决策轴根据符合监管者的绝对损失真实情况和显式误干预率对票据评分。在五年每周真实数据上的实验充分支持了我们的系统。代码见 https://this URL。

英文摘要

Decentralized finance exposes supervisors to fast-moving, networked credit risks. General-purpose LLM agents fit this setting poorly: they over-read weak evidence and recommend high-stakes interventions, while existing evaluations offer no regulator-aligned way to measure the resulting false alarms. We introduce DeXposure-Claw, a forecast-grounded agentic supervision system that routes LLM decisions through structured evidence: (1) DeXposure-FM, a graph time-series foundation model, forecasts future exposure networks; (2) deterministic monitors and stress scenarios then turn those forecasts into typed alerts, attribution signals, and scenario evidence; and (3) data-health and confidence gates constrain escalation before DeXposure-Claw emits auditable supervisory tickets with rationales. We further develop DeXposure-Bench, a six-axis evaluation harness, whose decision axis scores tickets against a regulator-aligned absolute-loss ground truth and an explicit false-intervention rate. Experiments on five years of weekly real data fully support our system. Code is at https://github.com/EVIEHub/DeXposure-Claw.

2606.19559 2026-06-19 cs.AI cs.CL 交叉投稿

Uncertainty Decomposition for Clarification Seeking in LLM Agents

LLM代理中寻求澄清的不确定性分解

Gregory Matsnev

发表机构 * AI Talent Hub, ITMO University(AI Talent Hub, ITMO大学)

AI总结 提出一种基于提示的不确定性分解方法,将行动置信度与请求不确定性分离,使代理能在任务规范模糊时主动寻求澄清,在五个LLM骨干上平均澄清F1提升36%-73%。

Comments 26 pages, 8 figures. Source code: https://github.com/PE51K/udcs-in-llm-agents

详情
AI中文摘要

最近的立场论文认为,经典的偶然/认知不确定性框架对于交互式大型语言模型(LLM)代理是不够的,并呼吁需要一种对欠规范感知、可分解且可通信的不确定性表示,以解锁新的代理能力,如主动寻求澄清和共享心理模型构建。实际部署约束——黑盒API、交互延迟预算以及缺乏标注轨迹——排除了基于logprob、多采样和基于训练的方法,使得基于提示的估计成为在部署时浮现此类信号的最可行方案。我们通过一种简单的基于提示的分解来响应这一呼吁,该分解将行动置信度与请求不确定性(u)分离,使代理能在任务规范模糊时请求澄清。为了评估它,我们引入了两个增强澄清的基准(WebShop-Clarification和ALFWorld-Clarification),其中50%的任务被故意欠规范,并在这些变体以及用于故障检测的标准WebShop、ALFWorld和REAL基准上,系统地将所提出的分解与ReAct+UE和不确定性感知记忆(UAM)在五个LLM骨干(GPT-5.1、DeepSeek-v3.2-exp、GLM-4.7、Qwen3.5-35B、GPT-OSS-120B)上进行比较。在五个骨干上平均,所提出的分解在ALFWorld-Clarification上比ReAct+UE提高了73%的澄清F1,比UAM提高了36%,并且在WebShop-Clarification的每个骨干以及ALFWorld-Clarification的五个骨干中的四个上领先澄清F1,表明增益超越了单个LLM。

英文摘要

Recent position papers argue that the classical aleatoric/epistemic uncertainty framework is insufficient for interactive large language model (LLM) agents and call for underspecification-aware, decomposed, and communicable uncertainty representations that can unlock new agent capabilities such as proactive clarification seeking and shared mental-model building. Practical deployment constraints -- black-box APIs, interactive latency budgets, and the absence of labeled trajectories -- rule out logprob-based, multi-sampling, and training-based methods, leaving prompt-based estimation as the most viable family for surfacing such signals at deployment time. We answer this call with a simple prompt-based decomposition that separates action confidence from request uncertainty (u), enabling the agent to ask for clarification when the task specification is ambiguous. To evaluate it, we introduce two clarification-augmented benchmarks (WebShop-Clarification and ALFWorld-Clarification) in which 50% of tasks are deliberately underspecified, and systematically compare the proposed decomposition against ReAct+UE and Uncertainty-Aware Memory (UAM) across five LLM backbones (GPT-5.1, DeepSeek-v3.2-exp, GLM-4.7, Qwen3.5-35B, GPT-OSS-120B) on these variants together with the standard WebShop, ALFWorld, and REAL benchmarks for fault detection. Averaged across the five backbones, the proposed decomposition improves clarification F1 on ALFWorld-Clarification by 73% over ReAct+UE and by 36% over UAM, and leads clarification F1 on every backbone on WebShop-Clarification and on four of five backbones on ALFWorld-Clarification, indicating that the gains generalize beyond a single LLM.

2606.19911 2026-06-19 cs.AI cs.CL cs.IR 交叉投稿

Multi-Agent Transactive Memory

多智能体交互记忆

To Eun Kim, Xuhong He, Dishank Jain, Ambuj Agrawal, Negar Arabzadeh, Fernando Diaz

发表机构 * Carnegie Mellon University(卡内基梅隆大学) University of California, Berkeley(加州大学伯克利分校)

AI总结 提出MATM框架,通过共享存储和检索智能体轨迹,实现异构智能体群体间的知识复用,提升下游任务性能并减少交互步骤。

详情
AI中文摘要

具有不同能力的LLM智能体在多样化任务中的去中心化部署,激发了跨异构智能体群体知识共享的基础设施需求。正如搜索引擎索引人类生成的工件以支持人类问题解决,检索系统可以组织智能体生成的工件以供跨智能体群体重用。我们将检索增强生成(展示了人类创作工件对单个智能体的价值)扩展到检索智能体生成的工件以支持智能体群体。特别是,智能体轨迹编码了可重用的程序性知识,然而这些工件通常在一次使用后被丢弃或仅由产生智能体保留,迫使新实例化的智能体反复重新发现现有解决方案。我们提出了多智能体交互记忆(MATM),一个用于群体级存储和检索智能体生成轨迹的框架,其中生产者智能体将轨迹贡献到共享仓库,消费者智能体检索它们以改进任务执行。我们专注于交互环境(ALFWorld和WebArena),其中轨迹较长且编码了特别丰富的程序性结构。我们的实验表明,从MATM检索轨迹可提高下游任务性能并减少交互步骤,无需协调或联合训练。这些结果将MATM定位为开放智能体生态系统中群体级经验共享的设计模式。

英文摘要

The decentralized deployment of LLM agents with diverse capabilities across diverse tasks motivates infrastructure for knowledge sharing across heterogeneous agent populations. Just as search engines index human-generated artifacts to support human problem solving, retrieval systems can organize agent-generated artifacts for reuse across agent populations. We extend retrieval-augmented generation - which demonstrates the value of human-authored artifacts to individual agents - to retrieval of agent-generated artifacts supporting a population of agents. In particular, agent trajectories encode reusable procedural knowledge, yet these artifacts are typically discarded after a single use or retained only by the producing agent, forcing newly instantiated agents to repeatedly rediscover existing solutions. We propose Multi-Agent Transactive Memory (MATM), a framework for population-level storage and retrieval of agent-generated trajectories, where producer agents contribute trajectories to a shared repository and consumer agents retrieve them to improve task execution. We focus on interactive environments (ALFWorld and WebArena), where trajectories are long and encode especially rich procedural structure. Our experiments demonstrate that retrieving trajectories from MATM improves downstream task performance and reduces interaction steps without coordination or joint training. These results position MATM as a design pattern for population-level experience sharing in open agent ecosystems.

2606.20002 2026-06-19 cs.LG cs.AI cs.CL 交叉投稿

Connect the Dots: Training LLMs for Long-Lifecycle Agents with Cross-Domain Generalization Via Reinforcement Learning

Connect the Dots:通过强化学习训练具备跨域泛化能力的长期生命周期智能体

Yanxi Chen, Weijie Shi, Yuexiang Xie, Boyi Hu, Yaliang Li, Bolin Ding, Jingren Zhou

发表机构 * Alibaba Group(阿里巴巴集团)

AI总结 提出Connect the Dots框架,通过端到端强化学习训练LLM在长期任务中自我更新上下文并泛化到新领域,实验验证了跨域泛化能力。

Comments Work in progress; we will continuously update the codebase and arXiv version

详情
AI中文摘要

本文提出了一个通用框架,用于训练大型语言模型(LLMs)具备“Connect the Dots”(CoD)这一元能力,该能力是长期生命周期智能体所必需的:当基于LLM的AI智能体部署在环境中时,它解决一系列长期任务,同时持续探索环境、从自身经验中学习,并迭代地自我更新关于环境的上下文,从而在更新上下文的条件下,在未来任务上实现逐步更好的性能。CoD框架的主要组成部分包括:(1)用于端到端强化学习(RL)的算法设计和基础设施,其中包含交替执行任务和更新上下文的长展开序列;(2)用于在训练过程中激励和激发LLM中目标元能力的任务和环境,以及在评估过程中忠实衡量进展的任务和环境。我们展示了CoD框架的概念验证实现,包括具有细粒度信用分配的GRPO风格RL算法,以及针对目标元能力(而非特定领域的LLM能力或标准的逐任务RL)量身定制的任务和环境。实证结果验证了CoD设置中端到端RL训练的有效性,并展示了所激发元能力的分布外泛化潜力——在训练领域内、跨不同领域以及从CoD到Ralph-loop设置中。我们对CoD的研究连接了多项先前工作,并为推进LLM和AI智能体开辟了新的机遇。为促进进一步研究和应用,我们在\url{this https URL}上发布了我们的实现。

英文摘要

This work presents a general framework for training large language models (LLMs) to "Connect the Dots" (CoD), a meta-capability required by long-lifecycle agents: as an LLM-based AI agent gets deployed in an environment, it solves a long sequence of tasks while continuously exploring the environment, learning from its own experiences, and iteratively self-updating its context about the environment, thereby achieving progressively better performance on future tasks conditioned on the updated context. Major components of the CoD framework include: (1) algorithm design and infrastructure for end-to-end reinforcement learning (RL) with long rollout sequences interleaving solve-task and update-context episodes; (2) tasks and environments for incentivizing and eliciting the targeted meta-capability in LLMs during training, as well as for faithfully measuring progress during evaluation. We present proof-of-concept implementations of the CoD framework, including a GRPO-style RL algorithm with fine-grained credit assignment, as well as tasks and environments tailored to the targeted meta-capability (rather than domain-specific LLM capabilities or standard task-by-task RL). Empirical results validate the efficacy of end-to-end RL training in the CoD setting, and demonstrate the potential for out-of-distribution generalization -- within the training domains, across different domains, and from CoD to Ralph-loop settings -- of the elicited meta-capability. Our investigation of CoD connects several lines of prior works, and opens up new opportunities for advancing LLMs and AI agents. To facilitate further research and applications, we release our implementations at \url{https://github.com/agentscope-ai/Trinity-RFT/tree/research/cod/examples/research_cod}.

2606.20138 2026-06-19 cs.AI cs.CL cs.HC cs.LG 交叉投稿

Learning to Prompt: Improving Student Engagement with Adaptive LLM-based High-School Tutoring

学习提示:基于自适应LLM的高中辅导提升学生参与度

Po-Chin Chang, Nicholas Hogan, Aske Plaat, Michiel T. van der Meer

发表机构 * Leiden University(莱顿大学) FutureWhiz

AI总结 提出一种基于14个教学特征的主题感知提示路由模型,通过模拟训练和在线A/B测试,在高中辅导中实现自适应策略切换,提高教学效率并减少交互轮次。

详情
AI中文摘要

LLMs可以个性化教育,尽管当前的静态提示辅导系统难以适应不同的学科。我们开发并测试了一个具有主题感知提示的系统,该系统基于从原始转录中提取的14个教学特征(例如,辅导支架、学生理解)。我们首先在模拟环境中训练一个提示路由模型,然后将其部署到实际高中学生的在线适应中。模拟基准测试显示,路由器的性能优于两个静态基线($0.694$ vs. $0.647$ 和 $0.64$, $p<0.001$)。A/B测试($N=656$ 次对话,来自359名学生)显示了从模拟到现实的迁移,其中模型从分析策略切换到支架学习策略。我们的自适应提示选择机制提高了教学效率,保持了教学质量,并减少了约3轮交互($p=0.007$)。虽然贪婪路由器的练习转化率与基线相当($19.1\%$ vs. $19.6\%$),但随机采样策略的随机路由器实现了更高的转化率($28.1\%$)。

英文摘要

LLMs can personalize education, although current static-prompt tutoring systems struggle to adapt to diverse academic disciplines. We develop and test a system with subject-aware prompting, based on 14 pedagogical features (e.g., tutor scaffolding, student understanding) extracted from raw transcripts. We first train a prompt routing model in a simulation environment, and then deploy it for online adaptation with actual high-school students. The simulation benchmark shows the router outperforming two static baselines ($0.694$ vs. $0.647$ and $0.64$, $p<0.001$). A/B testing ($N=656$ conversations from 359 students) shows sim-to-real transfer where the model switches from analytical to scaffolding learning strategies. Our adaptive prompt selection mechanism improves instructional efficiency, maintains pedagogical quality and reduces interactions by around 3 turns ($p=0.007$). While a greedy router achieves a comparable exercise conversion rate with the baseline ($19.1\%$ vs. $19.6\%$), a stochastic router that samples strategies leads to a higher conversion rate ($28.1\%$).

2606.20529 2026-06-19 cs.AI cs.CL 交叉投稿

LedgerAgent: Structured State for Policy-Adherent Tool-Calling Agents

LedgerAgent: 策略遵从工具调用代理的结构化状态

Md Nayem Uddin, Amir Saeidi, Eduardo Blanco, Chitta Baral

发表机构 * Arizona State University(亚利桑那州立大学) University of Arizona(亚利桑那大学)

AI总结 针对客服领域策略遵从工具调用代理,提出LedgerAgent方法,通过独立账本维护任务状态并渲染到提示中,在执行工具调用前检查状态依赖策略约束,提升多轮一致性。

Comments Work in Progress

详情
AI中文摘要

在客服领域,策略遵从工具调用代理必须在跨轮次调用工具时维护任务状态,并遵守领域策略。任务状态包括通过用户交互和工具调用观察到的事实、标识符、约束和条件。在标准代理中,任务状态没有单独表示。观察结果、工具返回和策略指令被放入提示中,使得代理每次决定下一步时都需要从提示中重建相关状态。这种设计使状态管理变得隐式,导致两种常见失败模式:代理可能检索到正确的事实,但后来基于过时、缺失或不正确的信息做出决策;语法上有效的工具调用可能仍然违反依赖于当前任务状态的领域策略。我们引入了\textsc{LedgerAgent},一种用于工具调用代理的推理时方法,它在单独的账本中维护观察到的任务状态,并将状态渲染到提示中。在执行改变环境的工具调用之前,账本还用于检查状态依赖的策略约束,阻止策略违规。在四个客服领域以及开源和闭源模型的混合面板上,\textsc{LedgerAgent}在标准基于提示的工具调用方法上提高了平均pass^k,在更严格的多轮一致性指标下提升最大。

英文摘要

Policy-adherent tool-calling agents in customer-service domains must maintain task states across turns while calling tools and obeying domain policies. Task states consist of relevant facts, identifiers, constraints, and conditions observed through user interaction and tool calls. In standard agents, task states are not represented separately. Observations, tool returns, and policy instructions are placed in the prompt, leaving agents to reconstruct the relevant states from the prompt each time they decide what to do next. This design makes state management implicit, creating two common failure modes. An agent may retrieve the right facts but later ground its decision in stale, missing, or incorrect information; and a syntactically valid tool call may still violate a domain policy that depends on the current task state. We introduce \textsc{LedgerAgent}, an inference-time method for tool-calling agents that maintains observed task states in a separate ledger and renders the states into the prompt. The ledger is also used to check state-dependent policy constraints before environment-changing tool calls are executed, blocking policy violations. Across four customer-service domains and a mixed panel of open- and closed-weight models, \textsc{LedgerAgent} improves average pass\textasciicircum{}k over a standard prompt-based tool-calling approach, with the largest gains under stricter multi-trial consistency metrics.

4. 语义、语法与语言学分析 1 篇

2606.19626 2026-06-19 cs.AI cs.CL 交叉投稿

Toten: Knowledge-Based Ontological Tokenization Of Physical Quantities And Technical Notation In Brazilian Portuguese

Toten:基于知识本体的巴西葡萄牙语物理量和技术符号分词

Antonio de Sousa Leitão Filho; Allan Kardec Duailibe Barros Filho; Fabrício Saul Lima; Selby Mykael Lima dos Santos; Rejani Bandeira Vieira Sousa

发表机构 * Aia Context Universidade Federal do Maranhão(马拉尼昂联邦大学) Universidade de São Paulo(圣保罗大学)

AI总结 提出TOTEN框架,利用工程实体本体对物理量和技术符号进行声明式分类,替代统计分词,在巴西葡萄牙语语料上实现高原子性分词和数值重建。

详情
AI中文摘要

字节对编码分词在词汇压缩方面统计高效,但对结构化技术实体语义盲目,将物理量、数字、单位和符号表达式分割成词汇上任意子词。我们提出TOTEN,一个基于知识本体的分词框架,用基于工程实体形式本体(OEE)的声明式分类取代统计推导。我们将TOTEN形式化为三元组<O, classify, {inst_tau}>:本体收集类型、结构原理、组成关系和可保存不变量;分类函数将原始文本映射到类型化区域;实例化器族产生自描述的结构化表示。鲁棒性源于与三个外部预言机的确定性耦合:Pint(量纲)、Unicode字符数据库(排版)和RSLP(葡萄牙语形态)。内在评估涵盖四个可通过构造验证的属性——本体原子性、量纲等价性、排版鲁棒性和数值重建——在一个内部、物理验证的基准(EngQuant,N=800)和四个巴西葡萄牙语外部语料库(N=1771个合格案例)上进行。我们还报告检测召回率,区分覆盖率和条件原子性。与八个最先进基线相比,TOTEN在所有对比中实现单位本体原子性,在外部语料库上数值重建为0.775-0.904,而最佳基线(Quantulum3)为0.627-0.703;在EngQuant上为0.780 vs. 0.340。差异具有统计显著性(McNemar检验,Holm校正)。内部和外部排名之间的Spearman相关性证实了控制基准的同时效度。量纲等价性显示与Pint(系统继承量纲权威的预言机)统计对等。

英文摘要

Byte-Pair Encoding tokenization is statistically efficient for vocabulary compression, but semantically blind to structured technical entities, fragmenting physical quantities, numbers, units, and symbolic expressions into lexically arbitrary subwords. We present TOTEN, a knowledge-based ontological tokenization framework that replaces statistical derivation with declarative classification grounded in a formal ontology of engineering entities (OEE). We formalize TOTEN as the triple <O, classify, {inst_tau}>: the ontology gathers types, structural principles, composition relations, and preservable invariants; the classification function maps raw text into typed regions; and the instantiator family yields a self-descriptive structured representation. Robustness derives from deterministic coupling with three external oracles: Pint (dimensional), Unicode Character Database (typographic), and RSLP (Portuguese morphology). Intrinsic evaluation covers four properties verifiable by construction -- ontological atomicity, dimensional equivalence, typographic robustness, and numerical reconstruction -- over an internal, physically validated benchmark (EngQuant, N=800) and four Brazilian Portuguese external corpora (N=1771 eligible cases). We also report detection recall, distinguishing coverage from conditional atomicity. Against eight state-of-the-art baselines, TOTEN achieves unit ontological atomicity in all contrasts and numerical reconstruction of 0.775-0.904 on external corpora, vs. 0.627-0.703 for the best baseline (Quantulum3); on EngQuant, 0.780 vs. 0.340. Differences are statistically significant (McNemar with Holm correction). Spearman correlation between internal and external rankings confirms concurrent validity of the control benchmark. Dimensional equivalence shows statistical parity with Pint, the oracle from which the system inherits dimensional authority.

5. 多模态语言处理 2 篇

2606.19534 2026-06-19 cs.CV cs.AI cs.CL 交叉投稿

PerceptionDLM: Parallel Region Perception with Multimodal Diffusion Language Models

PerceptionDLM:基于多模态扩散语言模型的并行区域感知

Yueyi Sun, Yuhao Wang, Jason Li, Ye Tian, Tao Zhang, Jacky Mai, Yihan Wang, Haochen Wang, Jinbin Bai, Ling Yang, Yunhai Tong

发表机构 * Peking University(北京大学) MSALab ByteDance(字节跳动)

AI总结 提出PerceptionDLM,利用扩散语言模型的并行解码特性,通过高效提示和结构化注意力掩码实现多区域并行感知,显著提升推理效率,并构建ParaDLC-Bench基准进行评估。

Comments Code available at https://github.com/MSALab-PKU/PerceptionDLM

详情
AI中文摘要

多模态大语言模型(MLLMs)在视觉理解任务中取得了显著进展。然而,现有大多数MLLMs依赖自回归生成,这限制了它们在需要描述多个区域的感知任务中的效率。在这项工作中,我们提出PerceptionDLM,一种针对高效并行区域感知优化的多模态扩散语言模型。基于PerceptionDLM-Base(一个在开源扩散MLLMs中达到最先进性能的强基础基线),我们的架构充分利用了DLMs的并行解码特性。具体来说,我们引入了高效提示和结构化注意力掩码,以实现对多个掩码区域的同步感知,使模型能够在序列和token级别并行生成区域描述。与现有顺序处理区域的方法相比,这种设计显著提高了推理效率。为了系统评估DLMs视觉感知能力的并行性,我们通过将DLC-Bench扩展为每张图像包含多个区域掩码,构建了一个新的并行详细局部描述基准(ParaDLC-Bench),从而能够联合评估描述质量和推理效率。实验表明,PerceptionDLM在区域描述中保持竞争性能,同时在多区域感知任务中实现了显著的加速。我们的结果凸显了多模态扩散语言模型在高效并行视觉感知中的潜力。据我们所知,我们是首个利用扩散语言模型优势实现并行区域描述和感知的工作。代码、模型和数据集已发布。

英文摘要

Multimodal large language models (MLLMs) have achieved remarkable progress in visual understanding tasks. However, most existing MLLMs rely on autoregressive generation, which limits their efficiency for perception tasks that require captioning multiple regions. In this work, we propose PerceptionDLM, a multimodal diffusion language model optimized for efficient parallel region perception. Built upon PerceptionDLM-Base, a strong foundational baseline that achieves state-of-the-art performance among open-source diffusion MLLMs, our architecture fully leverages the parallel decoding nature of DLMs. Specifically, we introduce efficient prompting and structured attention masking to enable simultaneous perception of multiple masked regions, allowing the model to generate region descriptions in parallel at both the sequence and token levels. This design significantly improves inference efficiency compared with existing approaches that process regions sequentially. To systematically evaluate the parallelism property of visual perception capability for DLMs, we construct a new Parallel Detailed Localized Captioning Benchmark (ParaDLC-Bench) by scaling the DLC-Bench to include multiple region masks per image, enabling joint evaluation of both caption quality and inference efficiency. Experiments demonstrate that PerceptionDLM maintains competitive performance in region captioning while achieving substantial speed improvements for multi-region perception tasks. Our results highlight the potential of multimodal diffusion language models for efficient, parallel visual perception. To the best of our knowledge, we are the first to achieve parallel region caption and perception by leveraging the advantages of diffusion language models. Code, models, and datasets are released.

2606.20477 2026-06-19 cs.CV cs.CL cs.LG 交叉投稿

Scalable Training of Spatially Grounded 2D Vision-Language Models for Radiology

面向放射学的空间定位2D视觉-语言模型的可扩展训练

Yusuf Salcan, Simon Ging, Robin Schirrmeister, Philipp Arnold, Elmar Kotter, Behzad Bozorgtabar, Thomas Brox

发表机构 * Computer Vision Group, University of Freiburg, Germany(德国弗莱堡大学计算机视觉组) Department of Radiology, Medical Center -- University of Freiburg, Germany(德国弗莱堡大学医学中心放射科) CRIION-AI Lab, Freiburg, Germany(德国弗莱堡CRIION-AI实验室)

AI总结 提出RefRad2D大规模双语数据集,通过LLM和自动分割生成空间定位数据,训练RadGrounder模型联合完成报告生成、VQA和空间定位,在外部基准上取得竞争性结果。

Comments Accepted for MICCAI 2026. First two authors: equal contribution. Last two authors: equal supervision

详情
AI中文摘要

我们研究了如何在没有手动空间标注的情况下,为放射学训练具有视觉定位能力的视觉-语言模型(VLM)。我们引入了RefRad2D,这是一个大规模的双语(德语/英语)数据集,包含来自临床实践的120万对CT和MR图像-文本对,并通过基于LLM的筛选和自动分割自动生成任务特定的VQA和空间定位子集。在此数据上训练的模型RadGrounder联合执行报告生成、视觉问答以及通过边界框检测或分割进行的空间定位。在外部VQA基准(Slake,VQA-RAD)上,RadGrounder取得了与专用医学VLM竞争的结果。将我们的临床数据加入训练混合集,相比于仅在下游数据集上微调,提高了开放式VQA的性能,显示了数据集的迁移性。关键在于,添加定位监督不会降低语言质量,从而在不牺牲VQA性能的情况下实现空间可验证的输出。

英文摘要

We study how to train visually grounded vision-language models (VLMs) for radiology without manual spatial annotations. We introduce RefRad2D, a large-scale bilingual (German/English) dataset of 1.2M CT and MR image-text pairs derived from clinical practice, with task-specific VQA and spatial grounding subsets generated automatically via LLM-based curation and automated segmentation. Trained on this data, our model RadGrounder jointly performs report generation, visual question answering, and spatial grounding via bounding-box detection or segmentation. On external VQA benchmarks (Slake, VQA-RAD), RadGrounder achieves competitive results with specialized medical VLMs. Adding our clinical data to the training mixture improves open-ended VQA over fine-tuning on the downstream datasets alone, showing the transferability of our dataset. Crucially, adding grounding supervision does not degrade language quality, enabling spatially verifiable outputs at no cost to VQA performance.

6. 语音语言联合与音频文本 3 篇

2606.19951 2026-06-19 eess.AS cs.CL cs.LG cs.SD 交叉投稿

Investigating Human-Model Discrepancies in Speech Quality Assessment via Acoustic and Prosodic Perturbations

通过声学和韵律扰动研究语音质量评估中的人机差异

Masato Takagi, Masaya Kawamura, Reo Shimizu, Yuma Shirahata

AI总结 通过声学退化、韵律错误和说话人特征扰动,发现MOS预测模型对声学退化敏感,但对韵律错误不敏感,且对基频有偏见,而对语速和基频变化不敏感。

Comments Accepted to INTERSPEECH 2026

详情
AI中文摘要

平均意见得分(MOS)预测模型在文本到语音(TTS)研究中被广泛用作代理指标,但它们捕捉超出声学保真度的质量差异的能力仍不清楚。我们通过控制性扰动来研究这一点:声学退化、韵律错误以及说话人特定特征(如音高和语速)的操纵。我们从人类听众和模型那里获得了这些语音样本的MOS预测,并分析了它们感知特征的差异。结果表明,大多数模型能很好地跟踪声学退化,而所有模型对韵律错误不敏感,尽管主观评分大幅下降。对于说话人特征,模型表现出双重分离:在人类评分中不存在的强平均基频(F0)偏见,但对人类注意到的语速和F0变化不敏感。这些发现突出了标量MOS预测在声学保真度之外的局限性。

英文摘要

Mean opinion score (MOS) prediction models are widely used as proxy metrics in text-to-speech (TTS) research, yet their ability to capture quality differences beyond acoustic fidelity remains unclear. We investigate this via controlled perturbations on speech: acoustic degradation, prosodic errors, and manipulation of speaker-specific characteristics such as pitch and speaking rate. We obtained MOS predictions for these speech samples from both human listeners and the model, and analyzed the differences in their perceptual characteristics. Results show that most models track acoustic degradation well, while all are insensitive to prosodic errors despite large subjective score drops. For speaker characteristics, models exhibit a double dissociation: strong mean fundamental frequency (F0) biases absent in human ratings, yet insensitivity to speaking rate and F0 variability that humans notice. These findings highlight limitations of scalar MOS prediction beyond acoustic fidelity.

2606.19996 2026-06-19 cs.SD cs.CL 交叉投稿

Segment-Level Mandarin Chinese Speech-Based Cognitive Impairment Detection via an Autoencoder with Contrastive Learning

基于自编码器与对比学习的段级普通话语音认知障碍检测

Yongqi Shao, Hong Huo, Flavio Bertini, Danilo Montesi, Tao Fang

发表机构 * School of Automation and Intelligent Sensing, Shanghai Jiao Tong University(上海交通大学自动化与智能感知学院) Key Laboratory of System Control and Information Processing, Ministry of Education of China(教育部系统控制与信息处理重点实验室) Shanghai Key Laboratory of Perception and Control in Industrial Network Systems(上海市工业网络系统感知与控制重点实验室) Department of Computer Science and Engineering, University of Bologna(博洛尼亚大学计算机科学与工程系) Department of Mathematical, Physical and Computer Sciences, University of Parma(帕尔马大学数学、物理与计算机科学系)

AI总结 提出段级表示学习框架,结合自编码器和对比学习,在四个普通话数据集上实现稳定的二分类和三分类认知障碍检测,尤其改善了临床困难的三分类性能。

Comments 15 pages, 7 figures, 5 tables

详情
AI中文摘要

\noindent\textbf{背景与目标:} 语音已成为一种低成本、非侵入性的数字生物标志物,在认知障碍检测方面具有巨大潜力。然而,有限的标注数据和跨数据集变异性仍然是构建稳健的语音筛查系统的主要挑战。\par\noindent\textbf{方法:} 我们开发了一个用于语音认知障碍检测的段级表示学习框架。将语音录音分割成短片段并转换为语谱图表示。为了在有限数据条件下提高鲁棒性,将离线和在线增强策略与基于自编码器的表示学习和对比目标相结合,以增强判别性潜在表示。\par\noindent\textbf{结果:} 在四个独立的普通话语音数据集上进行的实验表明,在二分类和三分类任务中均取得了稳定且有竞争力的性能,尤其是在临床具有挑战性的三分类设置中取得了显著改进。消融研究进一步支持了所提框架的有效性。\par\noindent\textbf{结论:} 研究结果表明,段级语音表示学习可能为资源受限的临床环境中的认知障碍筛查提供一种可扩展且实用的方法。

英文摘要

\noindent\textbf{Background and Objective:} Speech has emerged as a low-cost and non-invasive digital biomarker with considerable potential for cognitive impairment detection. However, limited labeled data and cross-dataset variability remain major challenges for robust speech-based screening systems. \par\noindent\textbf{Methods:} We developed a segment-level representation learning framework for speech-based cognitive impairment detection. Speech recordings were divided into short segments and converted into spectrogram representations. To improve robustness under limited-data conditions, offline and online augmentation strategies were combined with autoencoder-based representation learning and contrastive objectives to enhance discriminative latent representations. \par\noindent\textbf{Results:} Experiments conducted on four independent Mandarin Chinese speech datasets demonstrated stable and competitive performance in both binary and three-class classification tasks, with particularly notable improvements in the clinically challenging three-class setting. Ablation studies further supported the effectiveness of the proposed framework. \par\noindent\textbf{Conclusions:} The findings suggest that segment-level speech representation learning may provide a scalable and practical approach for cognitive impairment screening in resource-constrained clinical settings.

2606.20137 2026-06-19 eess.AS cs.CL cs.LG cs.SD 交叉投稿

PASQA: Pitch-Accent-Focused Speech Quality Assessment Model Trained on Synthetic Speech with Accent Errors

PASQA:针对重音错误的合成语音训练的以音高重音为中心的语音质量评估模型

Masaya Kawamura, Yuma Shirahata, Kentaro Mitsui, Reo Shimizu

AI总结 提出PASQA模型,通过可控重音合成数据集和伪重音质量分数,结合自监督表示、摩拉条件融合等训练策略,有效评估音高重音正确性,优于传统MOS模型。

Comments Accepted to INTERSPEECH 2026

详情
AI中文摘要

现有的平均意见得分(MOS)预测模型通常预测话语级别的自然度MOS,并且可能对局部音高重音错误不敏感。我们提出了以音高重音为中心的语音质量评估(PASQA),明确针对音高重音正确性。为了训练我们的模型,我们使用重音可控的文本转语音系统通过改变重音模式构建了一个受控的日语重音错误数据集,并根据重音错误率计算伪重音质量得分。PASQA建立在自监督表示的基础上,并采用摩拉条件融合、排序损失、辅助重音错误定位任务和说话者不变训练。实验表明,传统模型无法保持按重音错误严重程度的排序,而PASQA在已见和未见说话者上都实现了高排序准确性。此外,PASQA与人类重音正确性判断的一致性更强。代码可在以下网址获取:https://this URL。

英文摘要

Existing mean opinion score (MOS) prediction models typically predict utterance-level naturalness MOS and can be insensitive to localized pitch-accent errors. We propose Pitch-Accent-focused Speech Quality Assessment (PASQA), which explicitly targets pitch-accent correctness. To train our model, we construct a controlled Japanese accent-error dataset by changing accent patterns using an accent-controllable text-to-speech system, and compute a pseudo accent-quality score from the accent-error rate. PASQA builds on self-supervised representations and employs mora-conditioned fusion, ranking loss, an auxiliary accent-error localization task, and speaker-invariant training. Experiments show that conventional models fail to preserve the ordering by accent-error severity, whereas PASQA achieves high ordering accuracy on both seen and unseen speakers. Further, PASQA shows stronger agreement with human accent-correctness judgments. The code is available at https://github.com/lycorp-jp/PASQA.

7. 评测、数据集与基准 8 篇

2606.19558 2026-06-19 cs.LG cs.CL 交叉投稿

Displacement Is Not Direction: Evaluating Fidelity Metrics for Quantized LLM Deployment

位移不是方向:评估量化LLM部署的保真度指标

Miloš Nikolić, Ali Hadi Zadeh, Enrique Torres Sanchez, Andreas Moshovos

发表机构 * ByteShape University of Toronto(多伦多大学) Vector Institute for Artificial Intelligence(向量人工智能研究所)

AI总结 本文研究KL散度等保真度指标在量化语言模型部署中与下游基准分数的相关性,发现整体强相关但在近基线区域失效,归因于KL散度主要衡量分歧量而非方向。

详情
AI中文摘要

保真度指标,如每个token的KL散度(KLD)与高精度参考模型的比较,常被用作基准质量的低成本代理。我们在Qwen3.6-35B-A3B的28个量化模型和Devstral-Small-2-24B的41个量化模型上,通过一系列下游基准测试验证了这一做法。我们发现,在整个量化队列中,KLD与基准分数强相关(Qwen上ρ=-0.72,Devstral上ρ=-0.86,p<0.001)。然而,在接近基线的静默区,这种关系变得不显著(Qwen上ρ=+0.00,Devstral上ρ=-0.24,p=0.36)。这种失效在14种测量变体中持续存在,包括不同的KLD聚合方式、困惑度公式、top-1一致性、校准语料库和上下文长度。在逐提示层面,KLD在代码任务上仅有较弱的失败预测能力,在LiveCodeBench上五个模型的失败与通过几何平均比在[1.08,1.22]之间,并且作为跨模型路由器失败,在分歧提示上仅达到42.3%-49.4%的准确率。我们将这种失效归因于结构分解:KLD主要衡量与参考模型的分歧量,在静默区复合ρ在Qwen上为+0.94(p<0.001),在Devstral上为+0.55(p=0.03),而其与分歧方向的关系较弱且依赖于任务。

英文摘要

Fidelity metrics, such as per-token KL divergence (KLD) against a high-precision reference, are often used in practice as low-cost proxies for benchmark quality. We test this practice on a 28-quant cohort of Qwen3.6-35B-A3B and a 41-quant cohort of Devstral-Small-2-24B, evaluated across a suite of downstream benchmarks. We find that KLD is strongly correlated with benchmark score over the full cohort ($ρ=-0.72$ on Qwen and $ρ=-0.86$ on Devstral, both with $p<0.001$). However, this relationship collapses to non-significance in the near-baseline silent zone ($ρ=+0.00$ on Qwen and $ρ=-0.24$, $p=0.36$, on Devstral). This collapse persists across 14 measurement variants, including different KLD aggregations, perplexity formulations, top-1 agreement, calibration corpora, and context lengths. At the per-prompt level, KLD has only weak failure-prediction power on code, with failed-vs-passed geometric-mean ratios in $[1.08,1.22]$ across five models on LiveCodeBench, and fails as a cross-model router, achieving only $42.3\%-49.4\%$ accuracy on disagreement prompts. We trace the collapse to a structural decomposition: KLD primarily measures the volume of disagreement with the reference, with silent-zone composite $ρ=+0.94$ ($p<0.001$) on Qwen and $+0.55$ ($p=0.03$) on Devstral, while its relationship to the direction of those disagreements is weak and task-conditional.

2606.19706 2026-06-19 cs.CV cs.CL 交叉投稿

NEST: Narrative Event Structures in Time for Long Video Understanding

NEST:面向长视频理解的时间叙事事件结构

Ali Asgarov, Kaushik Narasimhan, Najibul Haque Sarker, Hani Alomari, Chia-Wei Tang, Anushka Sivakumar, Zaber Ibn Abdul Hakim, Shaurya Mallampati, Chris Thomas

发表机构 * Department of Computer Science, Virginia Tech(弗吉尼亚理工大学计算机科学系)

AI总结 提出NEST数据集(1005部全长电影),通过多模态叙事事件标注和关系链接,评估模型在长视频中理解事件结构、时间顺序和长程依赖的能力,实验表明事件检测等任务极具挑战性。

详情
AI中文摘要

视觉-语言模型的最新进展使得处理越来越长的视频序列成为可能,但处理扩展令牌流的能力并不能转化为对长视频中叙事结构的理解。现有的长视频基准侧重于大海捞针式检索,而不是评估低级动作如何形成事件、事件如何跨时间交互以及叙事如何进展,例如,模型是否能够将早期的挫折(如失业)与后来的关系破裂联系起来,尽管存在长时间间隔、中间场景或重新诠释事件的闪回。我们引入了NEST(面向长视频理解的时间叙事事件结构),一个包含1005部全长电影(平均98分钟)的数据集,每部电影都标注了102个基于视觉内容、对话和音频的多模态叙事事件。NEST通过基于视觉内容、对话和音频的结构化标注捕捉多模态叙事事件,并通过反映叙事结构的关系(包括时间顺序、层次组合和长程依赖)将它们联系起来。我们引入了事件触发检测(ETD)、事件定位(EL)、事件论元抽取(EAE)和事件关系抽取(ERE)的基线。该基准对于基于事件发现极具挑战性,ETD低于8%,EL低于6%,EAE低于11%。相比之下,一旦事件给定,ERE更容易处理,零样本F1达到35.45%,微调后F1达到44.42%。

英文摘要

Recent progress in vision-language models has enabled the processing of increasingly long video sequences, but the ability to handle extended token streams does not translate to understanding of narrative structure in long videos. Existing long video benchmarks focus on needle-in-a-haystack retrieval rather than evaluating how low-level actions form events, how events interact across time, and how narratives progress, for example, whether a model can connect an early setback, such as a job loss to a later relationship breakup, despite long gaps, intervening scenes, or flashbacks that reframe what occurred. We introduce NEST (Narrative Event Structures in Time for Long Video Understanding), a dataset of 1005 full-length movies (avg. 98 minutes), each annotated with 102 multimodal narrative events grounded in visual content, dialogue, and audio. NEST captures multimodal narrative events with structured annotations grounded in visual content, dialogue, and audio, and links them through relations that reflect narrative structure, including temporal ordering, hierarchical composition, and long-range dependencies. We introduce baselines for event trigger detection (ETD), event localization (EL), event argument extraction (EAE), and event relation extraction (ERE). The benchmark is highly challenging for grounded event discovery, with ETD below 8%, EL under 6%, and EAE below 11%. In contrast, ERE is more tractable once events are given, reaching 35.45% F1 zero-shot and 44.42% F1 after fine-tuning.

2606.19719 2026-06-19 cs.IR cs.CL cs.LG 交叉投稿

Closing the Calibration Gap in Semantic Caching

缩小语义缓存中的校准差距

Aditeya Baral, Radoslav Ralev, Iliya Sotirov Zhechev, Srijith Rajamohan, Jen Agarwal

AI总结 针对语义缓存系统中离线指标与部署性能的差距,提出P-CHR AUC和CRR指标,发现校准差距由训练目标主导,模型选择本质是校准问题。

Comments 23 pages, 2 figures. Source code: https://github.com/aditeyabaral/calibration-gap-semantic-caching ; Models and Datasets: https://huggingface.co/redis

详情
AI中文摘要

语义缓存通过为语义相似的查询提供缓存响应来降低LLM推理成本。标准实践使用PR-AUC评估这些系统,该指标仅衡量分数排序的好坏,而忽略它们在固定阈值下是否可用。我们表明这种不匹配会导致系统性的部署选择不佳,因为具有最高PR-AUC的模型通常在操作中最差。我们引入精度-缓存命中率(P-CHR)AUC,一种衡量缓存利用率水平上精度的缓存感知指标,以及校准保留率(CRR),它捕捉离线排序质量在部署中保留多少。我们将离线质量与部署质量之间的操作差距分解为可恢复的校准组件和由数据集正例率固定的不可约结构组件。我们的实验表明,校准差距由训练目标而非数据规模主导,事后校准只能部分缩小它。最终,语义缓存的模型选择是一个校准问题,而非排序问题,而测量它是缩小差距的第一步。

英文摘要

Semantic caching cuts LLM inference costs by serving a cached response to semantically similar queries. Standard practice evaluates these systems using PR-AUC, a metric that only measures how well scores rank and ignores whether they are usable at a fixed threshold. We show this mismatch leads to systematically poor deployment choices, as models with the highest PR-AUC are often the worst in operation. We introduce Precision-Cache Hit Ratio (P-CHR) AUC, a cache-aware metric that measures precision across cache utilization levels, and Calibration Retention Rate (CRR), which captures how much offline ranking quality survives at deployment. We decompose the operational gap between offline and deployed quality into a recoverable calibration component and an irreducible structural component fixed by the dataset's positive rate. Our experiments show that the calibration gap is governed by the training objective rather than data scale, and post-hoc calibration only partially closes it. Ultimately, model selection for semantic caching is a calibration problem, not a ranking one, and measuring it is the first step to closing the gap.

2606.19749 2026-06-19 cs.AI cs.CL 交叉投稿

Benchmarking Agentic Review Systems

基准测试智能审稿系统

Dang Nguyen, Wanqing Hao, Yanai Elazar, Chenhao Tan

发表机构 * University of Chicago(芝加哥大学) Bar-Ilan University(巴伊兰大学)

AI总结 针对AI辅助研究给同行评审带来的压力,新兴智能审稿系统涌现,但缺乏评估标准。本文评估了多种系统,发现最佳配置(OpenAIReview + GPT-5.5)在成对准确性上达83.0%,能捕获71.6%注入错误,且用户反馈正面。

Comments 11 pages, 7 tables, 4 figures

详情
AI中文摘要

一类新的智能审稿系统正在兴起,以缓解AI辅助研究给同行评审系统带来的压力,但如何评估它们尚不明确。我们评估了两个开源系统(OpenAIReview和coarse)、一个专有系统(Reviewer3)以及一个零样本基线,跨越六个涵盖前沿和高效模型的LLM。首先,我们研究ICLR/NeurIPS论文上的AI评审是否与论文质量(通过引用和接受决定等外部信号近似)相关。每个系统在成对准确性上均高于随机水平,最佳为OpenAIReview + GPT-5.5,达到83.0%。其次,为测试系统能否捕获已知真实错误的错误,我们构建了一个扰动基准,向八个arXiv学科类别的论文中注入四类错误,并测量检测召回率。最强配置(OpenAIReview + GPT-5.5)捕获了71.6%的注入错误,仍有很大改进空间。六个模型的检测并集达到83.3%的召回率,表明不同模型检测不同错误,更好的利用设计可能提高性能。除这些基准外,我们研究了OpenAIReview在真实用户中的公开部署。对其评论的投票偏向正面,比例为1.44:1,最常见的抱怨是误报和琐碎挑剔。总之,通过评估基于最先进模型的全审稿系统在真实研究论文上的表现,我们表明虽然AI评审仍有改进空间,但它们已经能够很好地跟踪人类质量判断、捕获重要错误,并获得真实用户的正面反馈。

英文摘要

A new class of agentic review systems are emerging as a remedy to the pressure placed on peer review systems by AI-assisted research, but it is unclear how they should be evaluated. We evaluate two open-source systems (OpenAIReview and coarse), one proprietary system (Reviewer3), and a zero-shot baseline, across six LLMs spanning frontier and efficient models. First, we study whether AI reviews on ICLR/NeurIPS papers track with papers' quality as approximated by external signals such as citations and acceptance decisions. Every system performs above chance in pairwise accuracy, and the best is OpenAIReview + GPT-5.5 at 83.0%. Second, to test whether systems can catch errors with known ground truth, we construct a perturbation benchmark that injects four categories of errors into papers across eight arXiv subject classes and measure detection recall. The strongest configuration (OpenAIReview + GPT-5.5) catches 71.6% of injected errors, leaving substantial room for improvement. The union of detections across six models reaches 83.3% recall, suggesting different models detect different errors and better harness design can potentially increase performance. Beyond these benchmarks, we study a public deployment of OpenAIReview with real users. Votes on its comments skew positive at 1.44 to 1, and the most common complaints are about false positives and minor nitpicks. Together, by evaluating full review systems backed by state-of-the-art models on real research papers, we show that while AI reviews still have room for improvement, they can already track human quality judgments well, catch important errors, and earn positive feedback from real users.

2606.19788 2026-06-19 cs.AI cs.CL 交叉投稿

CombEval: A Framework for Evaluating Combinatorial Counting in Large Language Models

CombEval:评估大语言模型中组合计数的框架

Yuxu Zhou, Ondřej Kuželka, Yuyi Wang, Yuanhong Wang, Yi Chang

发表机构 * School of Artificial Intelligence, Jilin University(吉林大学人工智能学院) Czech Technical University in Prague(捷克布拉格理工大学) CRRC Zhuzhou Institute(中车株洲研究所) Tengen Intelligence Institute(天元智能研究院) International Center of Future Science, Jilin University(吉林大学未来科学国际合作中心) Engineering Research Center of Knowledge-Driven Human-Machine Intelligence, MOE(教育部知识驱动人机智能工程研究中心)

AI总结 提出CombEval动态基准,通过类型化Cofola规范生成组合计数问题,评估11个大语言模型在直接和代码增强设置下的表现,发现模型在有序对象、不可区分元素、相对位置约束和嵌套对象依赖上存在脆弱性。

Comments under review. Code: https://github.com/YuxuZhou-CN/combination-problem-generation

详情
AI中文摘要

我们提出了CombEval,一个用于评估大语言模型中组合计数的动态基准。CombEval将每个问题表示为关于实体、组合对象、对象依赖和约束的类型化Cofola规范,从而能够生成带有精确求解器验证答案的自然语言计数问题。与静态集合不同,CombEval支持对象类型、实体规模、约束数量和推理深度的系统变化。我们在直接和代码增强设置下评估了11个大语言模型,发现模型在有序对象、不可区分元素、相对位置约束和嵌套对象依赖上仍然脆弱。错误分析进一步识别出在约束解释和计数原则上的失败。CombEval为研究大语言模型何时以及为何在组合推理上失败提供了一个诊断测试平台。代码和生成的基准套件可在\url{this https URL}公开获取。

英文摘要

We present CombEval, a dynamic benchmark for evaluating combinatorial counting in large language models. CombEval represents each problem as a typed Cofola specification over entities, combinatorial objects, object dependencies, and constraints, enabling controlled generation of natural-language counting problems with exact solver-verified answers. Unlike static collections, CombEval supports systematic variation of object type, entity scale, constraint count, and reasoning depth. We evaluate 11 LLMs under direct and code-augmented settings and find that models remain brittle on ordered objects, indistinguishable elements, relatively positional constraints, and nested object dependencies. Error analysis further identifies failures in constraint interpretation and counting principles. CombEval provides a diagnostic testbed for studying when and why LLMs fail at combinatorial reasoning. The code and generated benchmark suites are publicly available at \url{https://github.com/YuxuZhou-CN/combination-problem-generation}.

2606.19830 2026-06-19 cs.SE cs.CL 交叉投稿

JAMER: Project-Level Code Framework Dataset and Benchmark on Professional Game Engines

JAMER:专业游戏引擎上的项目级代码框架数据集与基准测试

Jianwen Sun, Chuanhao Li, Zizhen Li, Yukang Feng, Fanrui Zhang, Yifei Huang, Yu Dai, Kaipeng Zhang

AI总结 提出首个基于专业游戏引擎的项目级代码框架数据集JamSet和基准JamBench,通过设计确定性验证流程,从24万仓库中筛选出8133个已验证项目,评估9个前沿模型发现项目规模增大时能力急剧下降。

详情
AI中文摘要

当前AI驱动的游戏开发在资产生成、游戏设计和基于Web的游戏编码方面取得了实质性进展,但由于缺乏大规模数据集和确定性评估方法,专业游戏引擎上的项目级代码工程仍然很大程度上未被探索。我们提出了JamSet和JamBench,这是首个基于专业游戏引擎的项目级游戏代码框架数据集和基准。我们的关键洞察是,Game Jam竞赛(开发者在严格时间限制下构建完整游戏的社区活动)产生了数千个适合此目的的开源项目。基于Godot引擎的文本格式和无头执行模式,我们设计了一个从文件完整性到运行时行为收集的确定性验证流程,从超过24万个仓库中提炼出8133个已验证项目。其中,300个手动验证的项目构成JamBench;其余构成JamSet。JamBench定义了主题驱动的生成和代码补全任务,通过结合编译通过率、结构完整性得分(SCS)和行为对齐得分(BAS)的流水线进行评估。对9个前沿模型的评估揭示了随着项目规模增加的能力悬崖,运行时通过率从小型项目的80.4%下降到大型项目的5.7%(Task2a)。代码代理提高了编译率,但在运行时行为质量上没有带来提升,表明瓶颈在于架构设计而非语法正确性。实验验证了JamSet作为有效训练数据。所有数据和代码均已公开。

英文摘要

Current AI-driven game development has made substantial progress in asset generation, gameplay design, and web-based game coding, yet project-level code engineering on professional game engines remains largely unexplored due to the absence of large-scale datasets and deterministic evaluation methods. We present JamSet and JamBench, the first project-level game code framework dataset and benchmark built on a professional game engine. Our key insight is that Game Jam competitions, community events where developers build complete games under tight time constraints, yield thousands of open-source projects suitable for this purpose. Building on the Godot engine's text-based format and headless execution mode, we design a deterministic verification pipeline from file integrity to runtime behavior collection, distilling 8,133 verified projects from over 240,000 repositories. Of these, 300 manually verified projects form JamBench; the rest constitute JamSet. JamBench defines theme-driven generation and code completion tasks, evaluated through a pipeline combining compilation pass rates, Structural Completeness Score (SCS), and Behavioral Alignment Score (BAS). Evaluation of 9 frontier models reveals a capability cliff as project scale increases, with runtime pass rates dropping from 80.4% on small projects to 5.7% on large ones (Task2a). Code Agents improve compilation rates yet yield no gains in runtime behavioral quality, indicating that the bottleneck lies in architectural design rather than syntactic correctness. Experiments validate JamSet as effective training data. All data and code are publicly available.

2606.20065 2026-06-19 cs.IR cs.CL cs.CY 交叉投稿

Generative Engine Optimization at Scale: Measuring Brand Visibility Across AI Search Engines

生成式引擎优化规模化:衡量AI搜索引擎中的品牌可见性

Pratyush Kumar

AI总结 本研究通过分析10万+提示响应,提出衡量AI搜索引擎中品牌可见性的方法,发现品牌成熟度形成三级阶梯,并识别出最受引用的内容格式和情感不稳定性。

Comments 14 pages, 4 tables; v1.0 preprint

详情
AI中文摘要

人们越来越多地从AI搜索引擎(如ChatGPT、Claude、Perplexity和Gemini)直接获取答案,而不是滚动浏览搜索结果。曾经专注于搜索引擎优化(SEO)的品牌现在必须优化这些引擎如何代表、引用和推荐它们——这一转变被称为生成式引擎优化(GEO)、答案引擎优化(AEO)和AI搜索可见性。我们将AEO和AI可见性视为GEO的一部分,并研究如何衡量AI引擎中的品牌可见性:它们在引用品牌时看重什么,依赖哪些来源,以及大型语言模型呈现什么内容。难点在于那些尚未成为权威顶级品牌的所有其他品牌——中小企业、D2C品牌、创作者和早期初创公司。我们分析了2026年3月至5月期间在Ranqo上追踪的100多个品牌的10万+提示响应。首次可见性运行形成了清晰的三级品牌地位阶梯:全球家喻户晓的品牌(如Stripe、Nike)在首次运行时出现在73%的相关AI答案中;成熟的中端市场和区域品牌(如Olipop、Klaviyo)出现在44%中;小众和小品牌仅出现在11%中——每级约30个百分点。当引擎引用来源时,约78%指向企业网站;在非企业来源中,YouTube领先,其次是Reddit、编辑媒体和维基百科。杠杆率最高的页面是排名“最佳”列表文章,是最常被引用的内容格式,约占所有引用的21%。情感是不稳定的信号:品牌被正面或负面描述的变化频率大约是品牌是否被提及的变化频率的6.7倍。这些发现为衡量GEO提供了首个大规模基线:AI品牌可见性是可测量的,因平台而异,并随品牌成熟度强烈变化。最后,我们提出了七个v1.1协议,以测试特定建议是否能因果性地提高AI可见性。

英文摘要

People increasingly get answers straight from AI search engines like ChatGPT, Claude, Perplexity, and Gemini rather than scrolling search results. Brands that once focused on search engine optimization (SEO) must now optimize for how these engines represent, cite, and recommend them -- a shift variously called Generative Engine Optimization (GEO), Answer Engine Optimization (AEO), and AI Search Visibility. We treat AEO and AI Visibility as part of GEO, and study how to measure brand visibility across AI engines: what they value when they cite a brand, which sources they rely on, and what content large language models surface. The hard case is everyone outside the already-authoritative top brands -- SMEs, D2C brands, creators, and early-stage startups. We analyze 100K+ prompt responses across 100+ brands tracked on Ranqo between March and May 2026. First visibility runs form a clear three-tier brand-stature ladder: global household names (e.g., Stripe, Nike) appear in 73% of relevant AI answers on their first run; established mid-market and regional brands (e.g., Olipop, Klaviyo) in 44%; niche and small brands in just 11% -- about 30 percentage points per step. When engines cite sources, about 78% go to corporate websites; among non-corporate sources YouTube leads, ahead of Reddit, editorial media, and Wikipedia. The highest-leverage page is the ranked "best-of" listicle, the most-cited content format at about 21% of all citations. Sentiment is the unstable signal: whether a brand is framed positively or negatively flips about 6.7 times more often than whether it is mentioned at all. These findings provide a first large-scale baseline for measuring GEO: AI brand visibility can be measured, differs by platform, and varies strongly by brand maturity. We close by proposing seven v1.1 protocols to test whether specific recommendations can causally improve AI visibility.

2606.20205 2026-06-19 cs.AI cs.CL cs.HC 交叉投稿

Apparent Psychological Profiles of Large Language Models are Largely a Measurement Artifact

大语言模型的心理特征很大程度上是测量假象

Jelena Meyer, David Garcia, Dirk U. Wulff

发表机构 * Max Planck Institute for Human Development(马克斯·普朗克人类发展研究所) University of Konstanz(康斯坦茨大学) Barcelona Supercomputing Center(巴塞罗那超级计算中心) University of Basel(巴塞尔大学)

AI总结 通过心理测量框架分析56个指令微调LLM,发现模型间差异主要源于方向性响应偏差而非特质,该偏差解释了81-90%的变异,且可通过题目选择操控,表明LLM心理特征是测量假象。

详情
AI中文摘要

专为人类设计的心理测量工具越来越多地被用于赋予大型语言模型(LLM)稳定的心理特征,这些特征影响其可用性、安全评估以及作为人类参与者的研究代理。使用正式的心理测量框架,我们表明这些特征很大程度上是测量假象。我们对56个指令微调LLM以及大型人类参考样本施测了一系列涵盖自我报告和行为任务的人格与风险偏好工具,报告了四个发现。第一,模型间差异并非由工具所针对的特质驱动,而是由方向性响应偏差驱动,即倾向于向量表一端或某个标签选项做出反应,而不考虑项目内容;方差分解将81-90%的模型间变异归因于这种偏差,而在人类中这一比例为9-16%。第二,偏差随模型能力提升而下降,但并未被消除。第三,由于响应由偏差而非特质驱动,工具的表面信度几乎完全由其响应正交性预测,这是我们提出的术语,指特质和偏差指向相反方向的项目比例。第四,模型呈现的特征随所用项目而变化,并可通过项目选择来制造。这些结果表明,LLM的表面心理特征是用于测量它们的工具的假象,而非模型本身的属性。由于从人类心理学借用的工具很少完全正交,且可能对LLM天生缺乏效度,我们呼吁以响应正交性为中心进行专门的评估。

英文摘要

Psychological instruments designed for humans are increasingly used to assign large language models (LLMs) stable psychological profiles that affect their usability, safety assessment, and use as proxies for human participants in research. Using a formal psychometric framework, we show that these profiles are largely a measurement artifact. Administering a battery of personality and risk-preference instruments spanning self-reports and behavioral tasks to 56 instruction-tuned LLMs alongside large human reference samples, we report four findings. First, differences between models are driven not by the traits an instrument targets but by a directional response bias, a tendency to respond toward one end of the scale, or one labeled option, regardless of item content; a variance decomposition attributes 81-90% of between-model variation to this bias, against 9-16% in humans. Second, the bias declines with model capability but is not eliminated by it. Third, because bias rather than trait drives responding, an instrument's apparent reliability is almost entirely predicted by its response orthogonality, a term we coin for the proportion of items for which trait and bias point in opposite directions. Fourth, the profile a model appears to have shifts with the items used and can be manufactured through item selection. These results demonstrate that the apparent psychological profiles of LLMs are artifacts of the instrument used to measure them, not properties of the models themselves. As instruments borrowed from human psychology are rarely fully orthogonal and may inherently lack validity for LLMs, we call for dedicated assessments centered on response orthogonality.

8. 安全、隐私、公平与可解释NLP 5 篇

2606.18649 2026-06-19 cs.MA cs.CL cs.CY 交叉投稿

Gender Bias in LLM Hiring Decisions: Evidence from a Japanese Context and Evaluation of Mitigation Strategies

LLM招聘决策中的性别偏见:来自日本语境的证据及缓解策略评估

Serena A. Hoffstedde, Machiko Hirota, Akshara Nadayanur Sathis Kanna, Rihito Kotani, Ujwal Kumar, Gabriele Trovato, Phan Xuan Tan

发表机构 * Shibaura Institute of Technology, Tokyo, Japan(Shibaura技术学院,东京,日本) Amsterdam University of Applied Sciences, Amsterdam, Netherlands(阿姆斯特丹应用科学大学,阿姆斯特丹,荷兰) University of Pennsylvania, Philadelphia, USA(宾夕法尼亚大学,费城,美国) Carnegie Mellon University, Pittsburgh, USA(卡内基梅隆大学,匹兹堡,美国) Keio University, Tokyo, Japan(庆应大学,东京,日本)

AI总结 本研究通过60份日本履历书格式的简历和5个先进LLM,发现所有模型均存在显著的亲女性偏见,且简单的提示指令无法缓解,而移除姓名几乎完全消除该偏见。

详情
AI中文摘要

大型语言模型(LLM)越来越多地被部署在招聘流程中,然而大多数关于LLM招聘决策中性别偏见的研究都集中在英语、西方格式的简历上。本研究考察了亲女性性别偏见是否扩展到日本企业语境,并评估了两种实用的缓解策略。使用反事实简历设计,包含60份日本履历书格式的简历、基于语言学性别信号标准选择的12个姓名对,以及五个最先进的LLM(Claude Sonnet 4.6、GPT-4o、DeepSeek-V3、Gemini 2.5 Flash、Llama 3.3 70B),我们在基线、提示指令和隐私过滤条件下进行了43,200次API调用。交叉随机效应线性混合模型确认了所有五个模型均存在显著的亲女性偏见,将西方研究结果复制到了非西方语境中。提示级别的性别中立指令并未显著减少偏见。姓名依赖分析正式将候选人姓名识别为主要性别渠道:从提示中移除姓名几乎完全消除了女性效应。隐私过滤器与GPT-4o内容安全过滤器之间的意外不兼容导致42%的拒绝率,突显了在LLM辅助招聘流程中姓名匿名化的实际部署挑战。

英文摘要

Large language models (LLMs) are increasingly deployed in hiring workflows, yet most research on gender bias in LLM hiring decisions has focused on English-language, Western-format resumes. This study examines whether pro-female gender bias extends to a Japanese corporate context and evaluates two practical mitigation strategies. Using a counterfactual resume design with 60 Japanese rirekisho-format resumes, 12 name pairs selected on linguistically grounded gender-signal criteria, and five state-of-the-art LLMs (Claude Sonnet 4.6, GPT-4o, DeepSeek-V3, Gemini 2.5 Flash, Llama 3.3 70B), we conducted 43,200 API calls across baseline, prompt instruction, and privacy filter conditions. A crossed random-effects linear mixed model confirms a significant pro-female bias across all five models, replicating Western findings in a non-Western context. A prompt-level gender-neutrality instruction produces no meaningful reduction in bias. A name-reliance analysis formally identifies the candidate name as the primary gender channel: removing the name from the prompt reduces the female effect by nearly its full magnitude. An unexpected incompatibility between the privacy filter and GPT-4o's content safety filter, resulting in a 42% refusal rate, highlights a practical deployment challenge for name anonymization in LLM-assisted recruitment pipelines.

2606.19404 2026-06-19 cs.LG cs.CL 交叉投稿

Thermodynamic Signatures of Reasoning: Free-Energy and Spectral-Form-Factor Diagnostics for Hallucination Detection in Large Language Models

推理的热力学特征:用于大型语言模型幻觉检测的自由能和谱形因子诊断

Salim Khazem

发表机构 * Talan Research & Innovation Center(Talan研究与创新中心)

AI总结 提出自由能签名(Fes)作为谱描述符,将注意力拉普拉斯视为哈密顿量并提取热力学势和随机矩阵理论谱形因子,用于检测LLM幻觉,无需训练即可实现高AUROC。

详情
AI中文摘要

大型语言模型(LLM)中的幻觉检测对部署至关重要,近期研究表明注意力导出的图拉普拉斯谱携带关于推理质量的强信号。然而,先前的谱诊断仅通过少数特征值或手工选取的标量来总结拉普拉斯谱,忽略了其大部分结构。我们提出自由能签名(Fes),一种谱描述符,将每层的注意力拉普拉斯视为哈密顿量,并提取其热力学势(配分函数、自由能、谱熵、热容)以及随机矩阵理论(RMT)谱形因子。我们证明了三个结果:(i)Fes在注意力扰动下的Lipschitz稳定性;(ii)一个表达性结果,表明Fes丰富了有限谱摘要,并在明确的规则性和网格分辨率假设下逼近矩导出的谱泛函;(iii)基于Fes构建的无训练检测器AUROC的有限样本PAC界。实验上,在六个开源LLM和六个基准测试中,基于Fes描述符的轻量级探测在注意力谱基线中实现了最强的平均AUROC,相比LapEig平均提高+6.5 AUROC点,相比GoR-4平均提高+2.4点,且无需更新底层LLM。在完全无监督设置下,RMT偏差得分达到平均AUROC 0.71,提供了一个无标签但较弱的检测器。互补的RMT分析表明,正确生成表现出更接近Wigner-Dyson的谱统计,而幻觉表现出更接近Poisson的统计。匿名代码和配置在补充材料中提供。

英文摘要

Hallucination detection in large language models (LLMs) is deployment-critical, and recent work shows that the spectrum of attention-derived graph Laplacians carries strong signal about reasoning quality. Prior spectral diagnostics, however, summarize the Laplacian spectrum by a handful of eigenvalues or hand-picked scalars, leaving most of its structure unused. We propose Free-Energy Signatures (Fes), a spectral descriptor that treats each layer's attention Laplacian as a Hamiltonian and extracts its thermodynamic potentials partition function, free energy, spectral entropy, heat capacity together with the random-matrix-theory (RMT) spectral form factor. We prove three results: (i)~Lipschitz stability of Fes under attention perturbation; (ii)~an expressiveness result showing that Fes enriches finite spectral summaries and approximates moment-derived spectral functionals under explicit regularity and grid-resolution assumptions; and (iii)~a finite-sample PAC bound on the AUROC of a training-free detector built from Fes. Empirically, across six open-weight LLMs and six benchmarks, a lightweight probe on Fes descriptors achieves the strongest aggregate AUROC among attention-spectral baselines, improving over LapEig by $+6.5$ AUROC points and over GoR-4 by $+2.4$ points on average, while requiring no update to the underlying LLM. In the fully unsupervised setting, an RMT-deviation score achieves mean AUROC $0.71$, providing a label-free but weaker detector. A complementary RMT analysis shows that correct generations exhibit more Wigner-Dyson like spectral statistics, whereas hallucinations exhibit more Poisson-like statistics. The anonymized code and config are provided in the supplementary material.

2606.19660 2026-06-19 cs.CR cs.CL 交叉投稿

A Layered Security Framework Against Prompt Injection in RAG-Based Chatbots

基于RAG的聊天机器人中针对提示注入的分层安全框架

Gulshan Saleem, Nisar Ahmed, Muhammad Imran Zaman, Ali Hassan

AI总结 提出三层防御框架,通过输入过滤、上下文指令层级和输出审计,将提示注入攻击成功率从71.4%降至11.3%,误报率4.8%,延迟开销61.2毫秒。

Comments Submitted in ICCK Transactions on Information Security and Cryptography

详情
AI中文摘要

提示注入被OWASP Top 10 for LLM Applications列为大语言模型(LLM)部署中最关键的漏洞,然而现有防御措施仅在孤立的流水线阶段运行且不完整。输入过滤器无法检查检索到的文档,而输出监控器无法阻止恶意载荷到达模型。因此,检索增强生成(RAG)聊天机器人仍然容易受到间接注入攻击,其中被污染的知识库文档会损害每个检索到它的用户。我们提出了一个三层框架,在推理流水线中拦截直接和间接的提示注入。第一层使用基于规则的模式库和微调后的语义异常分类器筛选用户输入。第二层在上下文组装期间强制执行基于来源的指令层级,防止检索到的内容覆盖操作员策略。第三层在交付前使用策略规则引擎和语义漂移检测器审计模型输出。一个持续审计循环聚合结构化日志,并支持重新训练以适应新兴攻击模式。该框架与模型无关,作为中间件部署,无需修改底层LLM。在GPT-4o、Llama 3和Mistral 7B上对5,080个样本的评估显示,该框架将攻击成功率(ASR)从71.4%降至11.3%,比最佳单层基线高出27.3个百分点,比已发布的护栏系统高出23.8个百分点,同时保持4.8%的误报率和61.2毫秒的中位延迟开销。消融研究证实,所有三层提供互补保护,且其组合效果超过单个贡献的总和。

英文摘要

Prompt injection is ranked as the most critical vulnerability in large language model (LLM) deployments by the OWASP Top 10 for LLM Applications, yet existing defenses operate at isolated pipeline stages and remain incomplete. Input filters cannot inspect retrieved documents, while output monitors cannot prevent malicious payloads from reaching the model. Consequently, retrieval-augmented generation (RAG) chatbots remain vulnerable to indirect injection, where a poisoned knowledge-base document compromises every user whose query retrieves it. We present a three-layer framework that intercepts both direct and indirect prompt injection throughout the inference pipeline. Layer 1 screens user input using a rule-based pattern library and a fine-tuned semantic anomaly classifier. Layer 2 enforces a provenance-based instruction hierarchy during context assembly, preventing retrieved content from overriding operator policy. Layer 3 audits model output using a policy rule engine and semantic drift detector before delivery. A continuous audit loop aggregates structured logs and supports retraining to adapt the classifier to emerging attack patterns. The framework is model-agnostic and deploys as middleware without modifying the underlying LLM. Evaluation on 5,080 samples across GPT-4o, Llama 3, and Mistral 7B shows that the framework reduces Attack Success Rate (ASR) from 71.4\% to 11.3\%, outperforming the best single-layer baseline by 27.3 percentage points and a published guardrail system by 23.8 percentage points, while maintaining a 4.8\% false positive rate and a median latency overhead of 61.2 ms. Ablation studies confirm that all three layers provide complementary protection and that their combined effect exceeds the sum of individual contributions.

2606.20023 2026-06-19 cs.SE cs.AI cs.CL 交叉投稿

When Lower Privileges Suffice: Investigating Over-Privileged Tool Selection in LLM Agents

当较低权限足够时:探究LLM代理中的过度权限工具选择

Kaiyue Yang, Yuyan Bu, Jingwei Yi, Yuchi Wang, Biyu Zhou, Juntao Dai, Songlin Hu, Yaodong Yang

AI总结 针对LLM代理在工具选择中偏好高权限工具的安全问题,提出ToolPrivBench评估框架,发现主流代理普遍存在过度权限选择且被瞬态故障放大,并设计权限感知后训练防御方法有效减少不必要的高权限工具使用。

Comments code: https://github.com/AISafetyHub/agent-tool-selection-bias

详情
AI中文摘要

随着LLM代理越来越多地自主选择工具,它们在具有不同权限的工具之间的选择变得与安全相关。然而,先前的工具选择研究侧重于安全无关的元数据偏好,使得权限敏感的选择未被充分探索。为填补这一空白,我们研究了过度权限工具选择,即代理在存在足够低权限替代方案时仍选择或升级到更高权限工具。我们引入ToolPrivBench来评估代理是否在存在足够低权限替代方案时仍选择更高权限工具,同时衡量初始选择和瞬态工具故障后的升级。在八个领域和五种重复风险模式中,我们发现过度权限工具选择在主流LLM代理中很常见,并且被瞬态故障进一步放大。我们进一步发现,通用安全对齐不能可靠地迁移到最小权限工具选择,而提示级控制在瞬态故障下仅提供有限的缓解。因此,我们引入了一种权限感知的后训练防御,教导代理偏好足够低权限的工具,仅在必要时升级。我们的缓解实验表明,这种防御在保持通用能力的同时,显著减少了不必要的高权限工具使用。

英文摘要

As LLM agents increasingly select tools autonomously, their choices among tools with different privileges become safety-relevant. However, prior tool-selection studies focus on safety-agnostic metadata preferences, leaving privilege-sensitive choices underexplored. To address this gap, we study over-privileged tool selection, in which an agent selects or escalates to a higher-privilege tool despite a sufficient lower-privilege alternative. We introduce ToolPrivBench to evaluate whether agents choose higher-privilege tools despite sufficient lower-privilege alternatives, measuring both initial selection and escalation after transient tool failures. Across eight domains and five recurring risk patterns, we find that over-privileged tool selection is common among mainstream LLM agents and is further amplified by transient failures. We further find that general safety alignment does not reliably transfer to least-privilege tool choice, while prompt-level controls provide only limited mitigation under transient failures. We therefore introduce a privilege-aware post-training defense that teaches agents to prefer sufficient lower-privilege tools and escalate only when necessary. Our mitigation experiments show that this defense substantially reduces unnecessary high-privilege tool use while preserving general capabilities.

2606.20155 2026-06-19 cs.CV cs.CL 交叉投稿

NAMESAKES: Probing Identity Memorization in Text-to-Image Models

NAMESAKES: 探究文本到图像模型中的身份记忆

Morris Alper, Vasudha Varadarajan, Moran Yanuka, Angelina Wang, Hadar Averbuch-Elor

发表机构 * Carnegie Mellon University(卡内基梅隆大学) Tel Aviv University(特拉维夫大学) Cornell University(康奈尔大学)

AI总结 提出一种黑盒行为探针,无需参考照片或训练数据,即可区分文本到图像模型生成的图像是记忆还是虚构,并在NAMESAKES数据集上验证其有效性。

详情
AI中文摘要

文本到图像(T2I)模型在提示其姓名时,会生成某些个体的逼真肖像,这引发了隐私问题。然而,区分生成的面孔是记忆还是虚构的,目前需要真实照片、训练数据访问权限或模型内部的白盒访问,限制了适用性。我们引入了一种完全黑盒的行为探针,可以在无需参考照片或事先了解训练数据的情况下区分这两种情况。为了基准测试这一任务,我们提出了NAMESAKES数据集,包含一千多个不同知名度水平的公众人物的姓名和面孔,以及经过扰动的、知名度较低的姓名。对最先进的T2I模型的实验表明,我们的探针能够显著预测身份记忆,并将记忆的姓名与未识别的姓名区分开来,并进一步揭示了不同模型系列之间的差异。

英文摘要

Text-to-image (T2I) models generate realistic likenesses of some individuals when prompted with their names, raising privacy concerns. However, distinguishing whether a generated face is memorized or fabricated currently requires ground-truth photos, access to training data, or white-box access to model internals, limiting applicability. We introduce a fully black-box behavioral probe that distinguishes between these regimes while requiring no reference photos or prior knowledge of training data. To benchmark this task, we present the NAMESAKES dataset of over one thousand names and faces of public figures spanning a wide range of fame levels, along with perturbed, less famous names. Experiments on state-of-the-art T2I models show that our probe substantially predicts identity memorization and separates memorized from unrecognized names, with further insights into differences across model families.