arXivDaily arXiv每日学术速递 周一至周五更新
重置

1. 大语言模型与基础模型 8 篇

2606.18284 2026-06-18 cs.LG cs.AI cs.CL 交叉投稿

Breaking the Solver Bottleneck: Training Task Generators at the Learnable Frontier

打破求解器瓶颈:在可学习前沿训练任务生成器

Lorenz Wolf, Connor Watts, Roger Creus Castanyer, Geoffrey Bradway, Maxwill Lin, Augustine N. Mavor-Parker, Matthew Daborn-Sargent

发表机构 * Vmax Goodfire AI

AI总结 提出PROPEL框架,通过训练轻量级激活探针作为求解率代理,在无需重复求解器评估的情况下优化任务生成器,使生成任务集中在可学习前沿,提升数学、代码和软件工程任务的有效性。

Comments 30 pages, 9 figures, 12 tables

详情
AI中文摘要

通过强化学习训练智能体的限制资源日益成为前沿任务供给:有效、可求解且刚好足够困难以训练当前模型的任务。随着推理和智能体模型的改进,固定任务分布趋于饱和,而天真的合成生成产生琐碎、不可能或不适定的任务。用强化学习训练任务生成器以优化有效性和可学习性可以解决这一瓶颈,但直接优化需要对每个候选任务进行重复求解器评估。对于软件工程任务,单次评估可能耗时数十分钟;求解器在环的生成器训练是不可行的。我们提出PROPEL,一个求解器摊销框架,用于在目标求解率下训练任务生成器。PROPEL在一次性标注的生成任务和求解器结果语料库上训练一个轻量级激活探针。该探针从冻结的生成器参考模型预测目标求解器的通过率,并在生成器优化期间作为求解率的代理,将生成器评估简化为单次前向传播。在多种模型规模下的数学、代码和软件工程任务中,PROPEL将生成任务转向目标求解率:对于编程,在可学习前沿生成的任务从$10.1\% \ ightarrow 20.0\%$(针对Qwen2.5-3B-Instruct求解器)和从$5.3\% \ ightarrow 12.6\%$(针对Qwen2.5-7B-Instruct求解器)。对于软件工程,PROPEL将目标求解率下的生成份额从$9.8\% \ ightarrow 19.6\%$(针对Qwen3.5-27B在探针和生成器训练期间未见过的仓库)。

英文摘要

The limiting resource for training agents via reinforcement learning (RL) is increasingly frontier task supply: valid, solvable tasks just difficult enough to train the current model. As reasoning and agentic models improve, fixed task distributions saturate, while naive synthetic generation yields tasks that are trivial, impossible, or ill-posed. Training a task generator with RL to optimize validity and learnability can address this bottleneck, but direct optimization requires repeated solver rollouts per candidate. For software-engineering (SWE) tasks, a single rollout can take tens of minutes; solver-in-the-loop generator training is intractable. We introduce PROPEL, a solver-amortized framework for training task generators at the targeted solve rate. PROPEL trains a lightweight activation probe on a one-time labeled corpus of generated tasks and solver outcomes. The probe predicts target-solver pass rate from a frozen generator reference model and serves as a proxy for solve rate during generator optimization, reducing generator evaluation to a single forward pass. Across math, code, and software-engineering at multiple model scales, PROPEL shifts generation toward the targeted solve rate: for coding, tasks generated at the learnable frontier increase from $10.1\% \rightarrow 20.0\%$ for a Qwen2.5-3B-Instruct solver and from $5.3\% \rightarrow 12.6\%$ for a Qwen2.5-7B-Instruct solver. For SWE, PROPEL increases the share of generations at the targeted solve rate from $9.8\% \rightarrow 19.6\%$ for Qwen3.5-27B on repositories not seen during training of probe and generator.

2606.18388 2026-06-18 cs.LG cs.AI cs.CL cs.MA 交叉投稿

LLMZero: Discovering Adaptive Training Strategies for RL Post-Training via LLM Agents

LLMZero: 通过LLM智能体发现RL后训练的自适应训练策略

Haoyang Fang, Wei Zhu, Boran Han, Alex Zhang, Zhenyu Pan, Shuo Yang, Shuai Zhang, Jiading Gai, Peng Tang, Cuixiong Hu, Xuan Zhu, Huzefa Rangwala, George Karypis, Bernie Wang

发表机构 * Amazon(亚马逊)

AI总结 提出LLMZero系统,利用LLM智能体通过树搜索发现多阶段RL后训练的自适应策略,揭示容量参数单调累积、正则化参数振荡的规律,在4个GRPO任务上相对基线提升9%-140%。

详情
AI中文摘要

RL后训练策略依赖于数据集,并揭示了一个反复出现的经验模式:容量参数在阶段间单调累积,而正则化参数主要根据训练动态的变化而振荡。这种区别很重要,因为固定调度将所有参数提交到固定轨迹,因此无法表达正则化必须跟踪的非平稳探索-利用权衡;该原则为多阶段训练提供了可操作的设计规则。我们通过LLMZero发现了这一点,该系统通过树搜索让LLM智能体搜索训练轨迹,诊断每个检查点的病理并提出协调的多参数转换。在4个不同的GRPO任务中,LLMZero发现的策略相对基础模型提升9%到140%,相对网格搜索提升6%到15%,始终优于随机搜索和基于技能的智能体。该结构原则跨任务迁移,解释了为什么发现的策略形式不同但参数动态相似。

英文摘要

RL post-training strategies are dataset-dependent and reveal a recurring empirical pattern: capacity parameters accumulate monotonically across stages, while regularization parameters predominantly oscillate in response to shifting training dynamics. This distinction matters because fixed schedules commit all parameters to fixed trajectories and therefore cannot express the non-stationary exploration-exploitation tradeoffs that regularization must track; the principle provides actionable design rules for multi-stage training. We discover this through LLMZero, a system where LLM agents search over training trajectories via tree search, diagnosing pathologies at each checkpoint and proposing coordinated multi-parameter transitions. Across 4 diverse GRPO tasks, LLMZero discovers strategies that improve over the base model by 9% to 140% relative and over grid search by 6% to 15% relative, consistently outperforming random search and the skill-based agent. The structural principle transfers across tasks, providing an explanation for why discovered strategies take qualitatively different forms yet share similar parameter dynamics.

2606.18487 2026-06-18 cs.LG cs.AI cs.CL 交叉投稿

SFT Overtraining Predicts Rank Inversion via Entropy Collapse Under RLVR

SFT 过训练通过熵崩溃预测 RLVR 下的排名反转

Siddharth Aphale, Kelly Liu

发表机构 * Stanford University(斯坦福大学)

AI总结 研究发现 SFT 过度训练导致 rollout 分布熵降低,使 GRPO 中优势信号消失,从而引发排名反转;提出基于熵的两阶段诊断方法可预警高风险检查点。

Comments 14 pages, 6 figures. Accepted at the Deep Learning for Code (DL4C) Workshop at ICML 2026

详情
AI中文摘要

当 SFT 压缩 rollout 分布时,选择 pass@1 最高的 SFT 检查点进行 GRPO 的标准启发式方法可能失败。对于二元奖励,组内期望优势方差为 $p(1{-}p)(g{-}1)/g$;当早期 GRPO 将 $p$ 驱动到 $p^*(g)$ 以下时,大多数组具有相同奖励,不提供组间相对信号。我们研究了 Qwen2.5-Coder-3B 和 DeepSeek-Coder-6.7B 的 SFT 深度阶梯。我们在五个深度和三个种子上测试 Qwen2.5-Coder-3B,在四个匹配深度和三个种子上测试 DeepSeek-Coder-6.7B。在 Qwen 上,RL 前的 pass@1 随 SFT 深度增加而上升,但 GRPO 峰值 pass@10 从 $0.806$ 下降到 $0.481$(3 种子均值,$n{=}20$);RL 前的熵与 GRPO 结果正相关($\rho{=}{+}0.69$)。在 DeepSeek 上,pass@1 仍远高于 $p^*(8){=}0.083$,GRPO 结果压缩而非反转。结合 RL 前熵分诊与早期 GRPO 熵监测的两阶段诊断方法,可标记高风险检查点并提前停止失败运行。在我们的设置中,简单的 KL 参考正则化和标签平滑变体未能挽救崩溃的 Qwen 检查点,表明该失败并非琐碎的 GRPO 超参数伪影。

英文摘要

The standard heuristic of selecting the SFT checkpoint with the highest pass@1 for GRPO can fail when SFT compresses the rollout distribution. For binary rewards, the expected within group advantage variance is $p(1{-}p)(g{-}1)/g$; when early GRPO drives $p$ below $p^*(g)$, most groups have identical rewards and provide no group relative signal. We study SFT depth ladders for Qwen2.5-Coder-3B and DeepSeek-Coder-6.7B. We test Qwen2.5-Coder-3B across five depths and three seeds, and DeepSeek-Coder-6.7B across four matched depths and three seeds. On Qwen, pre RL pass@1 rises with SFT depth, but peak GRPO pass@10 falls from $0.806$ to $0.481$ (3 seed mean, $n{=}20$); pre RL entropy is positively associated with the GRPO outcome ($ρ{=}{+}0.69$). On DeepSeek, pass@1 remains far above $p^*(8){=}0.083$, and GRPO outcomes compress rather than invert. A two stage diagnostic, combining pre RL entropy triage with an early GRPO entropy monitor, flags high risk checkpoints and can stop failing runs early. Simple KL to reference regularisation and label smoothing variants do not rescue the collapsed Qwen checkpoint in our setting, suggesting the failure is not a trivial GRPO hyperparameter artefact.

2606.18694 2026-06-18 cs.LG cond-mat.dis-nn cs.CL cs.NE nlin.AO 交叉投稿

Attention as Frustrated Synchronization

注意力作为受挫同步

Joshua Nunley

发表机构 * Cognitive Science Program(认知科学项目) Luddy School of Informatics, Computing, and Engineering(信息学、计算与工程学院) Indiana University Bloomington(印第安纳大学布卢明顿分校)

AI总结 提出受挫同步网络(FSN),通过复值耦合核和延迟项实现基于同步的注意力机制,在百万参数级字符级文本和代码任务上优于调优的RoPE-SwiGLU Transformer。

Comments 25 pages, 4 figures. Preliminary report at the 1-10M parameter scale

详情
AI中文摘要

一个完美同步的振荡器网络无法进一步计算,因此基于同步构建的注意力架构必须将其计算定位在结构性的偏离一致中。我们引入了受挫同步网络(FSN),其令牌状态是环面上的相位,整个值通路是一个学习到的复值耦合核,包含谐波和一步延迟。核的每个分量在同步文献意义上都是一个受挫。复相位是静态的Kuramoto-Sakaguchi受挫角,带符号的谐波是排斥性的Daido分量,而延迟项(将每个令牌与其关注的令牌的后继耦合)在代数上与Kuramoto-Sakaguchi耦合相同,其受挫角是数据自身的转移,因此下一个令牌预测被实现为由数据受挫的同步。在匹配百万参数和训练预算的字符级文本和代码任务上,FSN的验证损失在每个测量周期都低于调优的RoPE-SwiGLU Transformer,并且该比较在基线训练至收敛后仍然成立:每30个周期的enwik8种子都低于Transformer收敛的50周期损失1.611,而FSN完成的50周期运行收敛至1.5953 ± 0.0014。一种变体将每个前馈块替换为对学习到的集体模式的平均场耦合,堆栈中不保留多层感知机,其性能与Transformer相当。在自然文本上,无受挫的基础层在每个复制深度上都落后于收敛的Transformer,在长距离复制事件上最差;而核在四个及以上深度处逆转了这种劣势。标题比较在百万参数规模下进行;规模阶梯在四百万参数下完成,优势持续存在,其余分支标记为进行中。

英文摘要

A network of oscillators that synchronizes perfectly computes nothing further, so an attention architecture built from synchronization must locate its computation in structured departures from agreement. We introduce the Frustrated Synchronization Network (FSN), whose token states are phases on a torus and whose entire value pathway is one learned complex coupling kernel over harmonics and a one-step delay. Each component of the kernel is a frustration in the sense of the synchronization literature. The complex phases are static Kuramoto-Sakaguchi frustration angles, the signed harmonics are repulsive Daido components, and the delay term, which couples each token to the successors of the tokens it attends to, is algebraically identical to Kuramoto-Sakaguchi coupling whose frustration angle is the data's own transition, so next-token prediction is implemented as synchronization frustrated by the data. At matched one-million-parameter and training budgets on character-level text and code, the FSN's validation loss is below a tuned RoPE-SwiGLU transformer's at every epoch measured, and the comparison survives training the baseline to convergence: every thirty-epoch enwik8 seed finishes below the transformer's converged fifty-epoch loss of 1.611, and the FSN's completed fifty-epoch runs converge to 1.5953 +/- 0.0014. A variant with every feed-forward block replaced by mean-field coupling to learned collective modes, leaving no multilayer perceptron in the stack, tracks the transformer. On natural text the unfrustrated base layer falls behind the converged transformer at every copy depth, worst on long-range copy events; the kernel reverses the deficit at every depth of four and beyond. Headline comparisons are at the one-million-parameter scale; a scale ladder is complete through four million parameters with the advantage persisting, and remaining arms are marked as in progress.

2606.18910 2026-06-18 cs.LG cs.CL 交叉投稿

REVES: REvision and VErification--Augmented Training for Test-Time Scaling

REVES:通过修订与验证增强的测试时扩展训练

Yuanxin Liu, Ruida Zhou, Xinyan Zhao, Amr Sharaf, Hongzhou Lin, Arijit Biswas, Mohammad Ghavamzadeh, Zhaoran Wang, Mingyi Hong

发表机构 * Northwestern University(西北大学) Amazon AGI(亚马逊人工智能实验室) Qualcomm AI Research(高通人工智能研究) University of Minnesota(明尼苏达大学)

AI总结 提出REVES框架,通过将中间步骤的“接近正确”答案转化为解耦的修订和验证提示,实现高效的离策略数据生成,提升大语言模型的多步推理能力,在LiveCodeBench上比强化学习基线高6.5分。

详情
AI中文摘要

通过顺序修订进行测试时扩展已成为增强大语言模型(LLM)推理能力的强大范式。然而,标准的后训练方法主要优化单次目标,与多步推理动态存在根本性不匹配。虽然最近的工作将其视为多轮强化学习(RL),但传统方法直接优化多步轨迹,未能进一步利用模型可以从纠正中学习的中间步骤中的高质量错误。我们提出了一个两阶段迭代框架,交替进行在线数据/提示增强和策略优化。通过将成功恢复轨迹中的中间步骤(“接近正确”答案)转化为解耦的修订和验证提示,我们的方法将训练集中在有效的答案转换和错误识别上。与标准的多轮RL相比,这种方法实现了高效的离策略数据生成,并减少了长程采样的计算开销。在LiveCodeBench上,使用公开可用的测试用例作为反馈,我们观察到比RL基线高6.5分,比标准多轮训练高4.0分。除了编码,我们的方法在圆填充问题上达到了先前报告的SOTA结果,同时使用了最小的基础模型(4B)和远少于更大进化搜索系统的采样次数。在真实验证下的数学结果进一步证实了改进的纠正能力。该方法还泛化到分布外的约束满足谜题,如n皇后和迷你数独,其中正确性完全由问题约束定义。代码可在该https URL获取。

英文摘要

Test-time scaling via sequential revision has emerged as a powerful paradigm for enhancing Large Language Model (LLM) reasoning. However, standard post-training methods primarily optimize single-shot objectives, creating a fundamental misalignment with multi-step inference dynamics. While recent work treats this as multi-turn reinforcement learning (RL), conventional approaches optimize over the multi-step trajectories directly, failing to further exploit the high-quality mistakes in intermediate steps that model can learn from correcting them. We propose a two-stage iterative framework that alternates between online data/prompt augmentation and policy optimization. By converting the intermediate steps (``near-miss'' answers) in the successful recovery trajectories into decoupled revision and verification prompts, our approach concentrates training on both effective answer transformation and error identification. This approach enables efficient off-policy data generation and reduces the computational overhead of long-horizon sampling compared to standard multi-turn RL. On LiveCodeBench, using publicly available test cases as feedback, we observe gains of +6.5 points over the RL baseline and +4.0 points over standard multi-turn training. Beyond coding, our approach matches the previously reported SOTA result on circle packing while using the smallest base model (4B) and far fewer rollouts than the much larger evolutionary search systems. Math results under ground-truth verification further confirm improved correction ability. It also generalizes to out-of-distribution constraint-satisfaction puzzles such as n\_queens and mini\_sudoku, where correctness is defined entirely by problem constraints. Code is available at https://github.com/yxliu02/REVES.git.

2606.19236 2026-06-18 cs.LG cs.AI cs.CL 交叉投稿

STARE: Surprisal-Guided Token-Level Advantage Reweighting for Policy Entropy Stability

STARE: 基于惊讶度的令牌级优势重加权以实现策略熵稳定性

Haipeng Luo, Qingfeng Sun, Songli Wu, Can Xu, Wenfeng Deng, Han Hu, Yansong Tang

发表机构 * Shenzhen International Graduate School, Tsinghua University(清华大学深圳国际研究生院) Tencent Hunyuan(腾讯混元)

AI总结 针对GRPO等RL算法中策略熵崩溃问题,提出STARE方法,通过惊讶度分位数识别熵关键令牌并重加权其优势,结合目标熵闭环门控稳定熵,在1.5B-32B模型和多种任务上实现稳定训练,AIME24/25准确率提升4%-8%。

Comments LLM, Reinforcement Learning

详情
AI中文摘要

基于可验证奖励的强化学习算法(如GRPO)已成为LLMs复杂推理的主流后训练范式,但通常在训练中遭受策略熵崩溃。我们对GRPO下的令牌级熵动态进行一阶梯度分析,识别出令牌级信用分配不匹配:每个令牌的熵变化分解为轨迹级优势与下一个令牌分布上的熵敏感函数的乘积,产生优势-惊讶度四象限结构和近临界性质。受此启发,我们提出STARE(基于惊讶度的令牌级优势重加权以实现策略熵稳定性),该方法通过批次内惊讶度分位数识别熵关键令牌子集,选择性重加权其有效优势,并引入目标熵闭环门控以实现稳定的熵调节。在1.5B至32B的模型规模以及三个任务族(短思维链、长思维链和多轮工具使用)上,STARE在数千步内维持稳定的RL训练,同时将策略熵保持在目标带内。在AIME24和AIME25上,STARE在平均准确率上比DAPO和其他竞争基线高出4%-8%,反思令牌和响应长度同步增长,表明持续探索-利用平衡进一步释放了RL训练潜力。代码可在https://github.com/xxxx获取。

英文摘要

Reinforcement Learning with Verifiable Rewards algorithms like GRPO have emerged as the dominant post-training paradigm for complex reasoning in LLMs, yet commonly suffer from policy entropy collapse during training. We conduct a first-order gradient analysis of token-level entropy dynamics under GRPO and identify a token-level credit assignment mismatch: the per-token entropy variation decomposes into the product of the trajectory-level advantage and an entropy sensitivity function over the next-token distribution, yielding an advantage-surprisal four-quadrant structure and a near-criticality property. Motivated by it, we propose STARE (Surprisal-guided Token-level Advantage Reweighting for policy Entropy stability), which identifies entropy-critical token subsets via batch-internal surprisal quantiles, selectively reweights their effective advantages, and incorporates a target-entropy closed-loop gate for stable entropy regulation. Across model scales from 1.5B to 32B and three task families (Short CoT, Long CoT, and Multi-Turn Tool Use), STARE sustains stable RL training over thousands of steps while maintaining policy entropy within the target band. On AIME24 and AIME25, STARE outperforms DAPO and other competitive baselines by 4%-8% in average accuracy, with reflection tokens and response length growing in tandem, indicating sustained exploration-exploitation balance that further unlocks RL training potential.Code is available at https://github.com/hp-luo/STARE.

2606.19264 2026-06-18 cs.LG cs.CL 交叉投稿

Structured Inference with Large Language Gibbs

大语言吉布斯结构化推理

Sanghyeok Choi, Henry Gouk, Esmeralda S. Whitammer

AI总结 提出大语言吉布斯方法,利用大语言模型的条件分布作为转移算子进行结构化概率推理,通过迭代重采样变量避免顺序偏差,在合成分布、一致性推理和贝叶斯结构学习中验证有效性。

Comments Code: https://github.com/hyeok9855/large-language-gibbs

详情
AI中文摘要

大型语言模型(LLMs)中编码的知识可以作为描述复杂世界变量的结构化推理的基础,但以概率一致的方式访问这些知识构成了一个困难的推理问题。我们提出了大语言吉布斯,一种结构化概率推理方案,它使用LLM的条件分布作为转移算子。不是通过单次自回归生成来采样结构化对象,而是利用LLM的下一个标记条件分布,在给定其他变量的条件下迭代地重采样单个变量。这种方法避免了顺序依赖偏差,并产生一个反映所有局部条件分布之间折衷的平稳分布。我们将这种方法应用于从合成分布中采样、一致性推理任务和贝叶斯结构学习。结果表明,在通过噪声LLM条件分布可访问的世界先验下,MCMC中使用LLM条件分布是用于结构化概率推理的一次性生成的实际替代方案。

英文摘要

The knowledge encoded in large language models (LLMs) can serve as a substrate for structured reasoning over variables describing a complex world, but accessing this knowledge in a probabilistically coherent manner poses a difficult inference problem. We propose Large Language Gibbs, a scheme for structured probabilistic inference that uses conditional distributions of an LLM as transition operators. Rather than sampling structured objects through single-pass autoregressive generation, we iteratively resample individual variables conditioned on others using an LLM's next-token conditionals. This approach avoids order-dependent biases and produces a stationary distribution that reflects a compromise between all local conditionals. We apply this approach to sampling from synthetic distributions, consistent reasoning tasks, and Bayesian structure learning. The results suggest that the use of LLM conditionals in MCMC is a practical alternative to one-pass generation for structured probabilistic inference under a world prior accessible through noisy LLM conditionals.

2606.19327 2026-06-18 cs.AI cs.CL 交叉投稿

Rethinking Reward Supervision: Rubric-Conditioned Self-Distillation

重新思考奖励监督:基于评分准则的自蒸馏

Siyi Gu, Jialin Chen, Sophia Zhou, Arman Cohan, Rex Ying

发表机构 * Yale University(耶鲁大学)

AI总结 提出评分准则条件自蒸馏框架,通过结构化细粒度反馈指导推理模型,在科学推理基准上平均超越GRPO 1.0分、OPSD 0.9分。

详情
AI中文摘要

推理语言模型的后训练通常由监督蒸馏和基于可验证奖励的强化学习驱动。蒸馏通常依赖于思维链注释,这些注释获取成本高昂,且可能本身带有噪声、不完整或部分错误;即使最终答案正确,不完美的推理过程也会干扰学习。另一方面,基于验证奖励的强化学习通常将评估反馈压缩为标量信号,掩盖了响应中哪些方面需要改进。我们提出\textbf{评分准则条件自蒸馏}框架,该框架将评分准则作为结构化、细粒度的反馈用于策略内自蒸馏。我们的方法使教师模型以准则级评分准则为条件,并利用它在学生自身采样的轨迹上提供令牌级指导。这种设计避免了将单一参考推理过程作为唯一的监督目标。相反,评分准则指定了一个强响应应满足的条件,从而在推理过程中实现比标量奖励优化更细粒度的信用分配。我们通过一个两阶段流程实例化该框架:首先学习生成任务特定的评分准则,然后训练一个评分准则引导的推理器。我们在多样化的科学推理基准上进行评估,结果表明,评分准则条件自蒸馏有效地将准则级标准转化为推理过程中的令牌级指导,平均超过GRPO 1.0分、OPSD 0.9分。

英文摘要

Post-training of reasoning language models is commonly driven by supervised distillation and reinforcement learning with verifiable rewards. Distillation often relies on chain-of-thought annotations that are expensive to obtain and may themselves be noisy, incomplete, or partially incorrect; even when the final solution is correct, an imperfect rationale can interfere with learning. Reinforcement learning with verified rewards, on the other hand, typically compresses evaluative feedback into a scalar signal, obscuring which aspects of a response should be improved. We propose \textbf{Rubric-Conditioned Self-Distillation}, a framework that incorporates rubrics as structured, fine-grained feedback for on-policy self-distillation. Our method conditions the teacher model on criterion-level rubrics and uses it to provide token-level guidance on the student's own sampled trajectories. This design avoids treating a single reference rationale as the sole supervision target. Instead, rubrics specify what a strong response should satisfy, enabling more fine-grained credit assignment over the reasoning process than scalar reward optimization. We instantiate this framework with a two-stage pipeline that first learns to generate task-specific rubrics and then trains a rubric-guided reasoner. We evaluate on a diverse suite of science reasoning benchmarks and results show that rubric-conditioned self-distillation effectively converts rubric-level criteria into token-level guidance over the reasoning process, surpassing GRPO by 1.0 points and OPSD by 0.9 points on average.

2. 信息抽取、检索与问答 2 篇

2606.18780 2026-06-18 cs.CV cs.CL cs.MM 交叉投稿

SAMA: Semantic Anchor-aligned Augmentation for Unified Low-Resource Multimodal Information Extraction

SAMA:面向统一低资源多模态信息抽取的语义锚定对齐增强

Quanjiang Guo, Chong Mu, Jiazhou Pan, Ming Jia, Ling Tian, Hui Gao, Zhao Kang

发表机构 * School of Computer Science and Engineering, University of Electronic Science and Technology of China(电子科技大学计算机科学与工程学院)

AI总结 提出语义锚定对齐增强框架SAMA,通过构建结构化语义锚引导多专家多模态大模型生成高保真文本,并利用锚保留扩散机制合成图像,结合双约束过滤模块,在低资源多模态信息抽取任务中显著提升性能。

Comments Accepted by IEEE Transactions on Multimedia

详情
AI中文摘要

多模态信息抽取(MIE)——涵盖多模态命名实体识别(MNER)、关系抽取(MRE)和事件抽取(MEE)等任务——对于理解多媒体内容至关重要,但受到严重数据稀缺的限制。尽管数据增强是一种有前景的补救措施,但现有方法受到粗粒度跨模态对齐和碎片化、任务特定设计的阻碍,未能利用共享语义知识。为克服这些限制,我们引入了语义锚定对齐多模态增强(SAMA),一个用于生成高保真、任务感知合成数据的统一框架。SAMA从真实标签构建结构化语义锚,以指导协作多专家多模态大语言模型(CME-MLLM),该模型集成了用于共享语义的通用适配器和任务特定适配器,以生成多样且符合约束的文本样本。对于图像合成,SAMA采用锚保留扩散机制,使用锚加权提示和潜在条件来维持关键语义锚,同时多样化视觉上下文。为消除人工验证需求,SAMA进一步引入双约束过滤模块,基于跨模态一致性和锚保真度选择合成样本。在MNER、MRE和MEE基准数据集上的大量实验表明,SAMA在全监督和低资源设置下均一致优于最先进的增强基线,突显了其通用性、鲁棒性和有效性。

英文摘要

Multimodal Information Extraction (MIE)-covering tasks such as Multimodal Named Entity Recognition (MNER), Relation Extraction (MRE), and Event Extraction (MEE)-is essential for understanding multimedia content but remains constrained by severe data scarcity. Although data augmentation is a promising remedy, existing approaches are impeded by coarse cross-modal alignment and fragmented, task-specific designs that fail to exploit shared semantic knowledge. To overcome these limitations, we introduce Semantic Anchor-aligned Multimodal Augmentation (SAMA), a unified framework for generating high-fidelity, task-aware synthetic data. SAMA constructs structured semantic anchors from ground-truth labels to guide a Collaborative Multi-Experts Multimodal Large Language Model (CME-MLLM), which integrates a Universal Adapter for shared semantics with Task-Specific Adapters to produce diverse yet constraint-compliant textual samples. For image synthesis, SAMA employs an Anchor-Preserving Diffusion mechanism that uses anchor-weighted prompts and latent conditioning to maintain critical semantic anchors while diversifying visual contexts. To eliminate the need for manual verification, SAMA further introduces a Dual-Constraint Filtering module that selects synthetic samples based on both cross-modal consistency and anchor fidelity. Extensive experiments across benchmark datasets for MNER, MRE, and MEE demonstrate that SAMA consistently outperforms state-of-the-art augmentation baselines under both fully supervised and low-resource settings, underscoring its versatility, robustness, and effectiveness.

2606.18947 2026-06-18 cs.AI cs.CL cs.IR cs.MA 交叉投稿

Decoupling Search from Reasoning: A Vendor-Agnostic Grounding Architecture for LLM Agents

将搜索与推理解耦:面向LLM Agent的供应商无关的接地架构

Emmanuel Aboah Boateng, Kyle MacDonald, Amardeep Kumar, Siddharth Kodwani, Sudeep Das

发表机构 * DoorDash, Inc.(DoorDash公司)

AI总结 提出解耦搜索接地(DSG)架构,将搜索接地从推理模型中分离,通过MCP兼容网关实现供应商路由、缓存等控制,在降低成本和延迟的同时保持或提升准确性。

Comments 15 pages, Figure 8

详情
AI中文摘要

生产级LLM Agent越来越依赖实时搜索,但原生搜索接地将检索策略、供应商选择、证据注入、成本、延迟和生成行为捆绑在单一模型-供应商边界内。这种耦合使得接地难以检查、调优、重用或移植,并可能触发搜索诱导的冗长,破坏严格的输出合约。我们提出解耦搜索接地(DSG),一种供应商无关的边界,通过MCP兼容网关将接地移出推理模型,将供应商路由、源感知上下文渲染、配置的回退、检索深度控制以及精确和语义缓存作为一级控制暴露。在SimpleQA、FreshQA和HotpotQA上的五个前沿模型上,原生搜索在时效性敏感的FreshQA上领先,但DSG在控制重要时展现出更强的前沿:在SimpleQA上,它以91%更低的搜索成本接近原生准确率(86.1%对87.7%),保持简洁答案合约,并以68%更低的延迟达到99.4%的热缓存命中率。作为大规模Agent工作负载的共享生产接地层部署,DSG在电商查询理解(QIU)工作负载上匹配或略超原生搜索准确率,同时将搜索成本降低超过98%。实时接地最好被视为可优化的接口边界,而非固定的模型特性。

英文摘要

Production LLM agents increasingly depend on real-time search, yet native search grounding bundles retrieval policy, provider choice, evidence injection, cost, latency, and generation behavior behind a single model-provider boundary. This coupling makes grounding hard to inspect, tune, reuse, or port, and can trigger Search-Induced Verbosity that breaks strict output contracts. We present Decoupled Search Grounding (DSG), a vendor-agnostic boundary that moves grounding outside the reasoning model through an MCP-compatible gateway, exposing provider routing, source-aware context rendering, configured fallback, retrieval-depth control, and exact plus semantic caching as first-class controls. Across five frontier models on SimpleQA, FreshQA, and HotpotQA, native search leads on recency-sensitive FreshQA, but DSG exposes a stronger frontier when control matters: on SimpleQA it nearly matches native accuracy (86.1% vs. 87.7%) at 91% lower search cost, preserves concise answer contracts, and reaches a 99.4% warm-cache hit rate with 68% lower latency. Deployed as a shared production grounding layer for large-scale agentic workloads with interchangeable models, DSG matches or slightly exceeds native-search accuracy on an e-commerce query-understanding (QIU) workload while cutting search cost by over 98%. Real-time grounding is best treated as an optimizable interface boundary, not a fixed model feature.

3. 对话系统与智能体 3 篇

2606.18543 2026-06-18 cs.AI cs.CL cs.SE 交叉投稿

CEO-Bench: Can Agents Play the Long Game?

CEO-Bench:智能体能否玩转长期博弈?

Haozhe Chen, Karthik Narasimhan, Zhuang Liu

发表机构 * Princeton University(普林斯顿大学)

AI总结 提出CEO-Bench,通过模拟500天运营初创公司的任务,评估语言模型智能体在长期、不确定、动态环境下的综合决策能力。

详情
AI中文摘要

语言模型智能体在软件工程、客户服务等孤立、短期的任务上正变得熟练。然而,现实世界的挑战需要结合多种复杂技能,这些技能在很大程度上尚未在智能体中得到测试:(1)在不确定性中导航长期视野;(2)在嘈杂环境中获取信息;(3)适应不断变化的世界;(4)协调多个移动部分以实现连贯目标。我们引入CEO-Bench,通过模拟一个代表性的现实世界任务——运营一家初创公司500天——来共同评估这些能力。智能体通过可编程的Python接口管理一家虚构公司的定价、营销、预算等众多方面,在相同的环境中运行,并面临与人类CEO相同的挑战。成功需要分析嘈杂、相互关联的业务数据库,将信号转化为合理的策略,并通过编程协调许多决策。最强的智能体编写复杂的代码,模拟客户群体以预测未来现金流,并挖掘谈判历史以揭示隐藏的客户偏好。即便如此,大多数最先进的模型在此环境中挣扎。只有Claude Opus 4.8和GPT-5.5的最终余额超过100万美元的起始资金,且两者均未能持续盈利。CEO-Bench迈出了衡量驱动持续、自适应进步所需智能的第一步。

英文摘要

Language model agents are becoming proficient executors at isolated, short-horizon tasks such as software engineering and customer service. Yet real-world challenges require a combination of sophisticated skills that remain largely untested in agents: (1) navigating long horizons amid uncertainty; (2) acquiring information in noisy environments; (3) adapting to a changing world; (4) orchestrating multiple moving parts toward a coherent goal. We introduce CEO-Bench, which evaluates these capabilities together by simulating a representative real-world task: operating a startup for 500 days. An agent manages pricing, marketing, budgeting, and many other aspects of a fictional company through a programmable Python interface, operating in the same environment and facing the same challenges as a human CEO. Success demands analyzing noisy, interconnected business databases, translating signals into sound strategy, and coordinating many decisions with programming. The strongest agents write sophisticated code that simulates customer cohorts to forecast future cash and mines negotiation history to uncover hidden customer preferences. Even so, most state-of-the-art models struggle in this environment. Only Claude Opus 4.8 and GPT-5.5 finish above the $1M starting balance, and neither consistently turns a profit. CEO-Bench takes a first step toward measuring the intelligence required to drive sustained, adaptive progress over time.

2606.18668 2026-06-18 cs.MA cs.CL 交叉投稿

EARS: Explanatory Abstention for Reliable Sub-Agent Modeling in Large-scale Multi-Agent Systems

EARS:大规模多智能体系统中可靠子智能体建模的解释性弃权

Shuang Xie, Yunan Lu, Han Li, Lingyun Wang

发表机构 * Shopify Columbia University(哥伦比亚大学)

AI总结 针对大规模多智能体系统中子智能体过度回答导致幻觉的问题,提出EARS框架,通过将弃权重构为智能体间通信协议,利用校准的LLM裁判模型生成结构化弃权标签和理由,微调子智能体以检测故障并返回理由,在电商助手系统中将响应通过率从68.5%提升至78.9%。

详情
AI中文摘要

在大规模企业环境中,集中式多智能体系统(MAS)日益被采用,其中协调器将用户请求委托给轻量级、领域专业化的子智能体。虽然这种架构提高了模块化、可扩展性和成本效率,但其可靠性不仅取决于准确的路由,还取决于子智能体根据能力约束校准其响应的能力。特别是,基于较小微调模型的子智能体通常难以进行这种校准,导致它们过度回答模糊、未明确说明、路由错误或不支持的请求,并产生幻觉输出,而不是可操作的反馈。为了应对这一挑战,我们提出了EARS(用于可靠子智能体建模的解释性弃权),这是一个面向生产的框架,将子智能体弃权重新定义为智能体间通信协议:子智能体不仅弃权,而且向协调器暴露可操作的故障状态。EARS使用一组校准的LLM裁判模型来策划人机交互数据,在子智能体故障模式的分类法下生成结构化的弃权标签和理由。这些数据用于微调子智能体,使其能够检测故障条件并返回理由,以便协调器进行澄清、重新路由或回退。我们在一个支持企业商业智能工作流程的大规模生产电商助手中评估了EARS。EARS将整体响应通过率从68.5%提高到78.9%,证明了子智能体侧的解释性弃权提高了MAS的可靠性。

英文摘要

In large-scale enterprise settings, centralized multi-agent systems (MAS) are increasingly adopted, in which a coordinator delegates user requests to lightweight, domain-specialized sub-agents. While this architecture improves modularity, scalability, and cost efficiency, its reliability depends not only on accurate routing but also on sub-agents' ability to calibrate their responses to capability constraints. In particular, sub-agents built on smaller fine-tuned models often struggle with such calibration, leading them to over-answer ambiguous, underspecified, misrouted, or unsupported requests and produce hallucinated outputs instead of actionable feedback. To address this challenge, we present EARS (Explanatory Abstention for Reliable Sub-Agent Modeling), a production-oriented framework that reframes sub-agent abstention as an inter-agent communication protocol: a sub-agent does not merely abstain, but exposes an actionable failure state to the coordinator. EARS curates human-agent interaction data using an ensemble of calibrated LLM-as-a-Judge models, producing structured abstention labels and rationales under a taxonomy of sub-agent failure modes. These data are used to fine-tune sub-agents to detect failure conditions and return rationales for coordinator-level clarification, rerouting, or fallback. We evaluate EARS in a large-scale production e-commerce assistant supporting enterprise business intelligence workflows. EARS improves the overall response pass rate from 68.5% to 78.9%, demonstrating that sub-agent-side explanatory abstention improves MAS reliability.

2606.19144 2026-06-18 cs.AI cs.CL 交叉投稿

Human-AI Coevolution Dynamics: A Formal Theory of Social Intelligence Emergence Through Long-Term Interaction

人机协同演化动力学:长期互动中社会智能涌现的形式理论

Jingyi Zhou, Senlin Luo, Haofan Chen

AI总结 提出人机协同演化动力学框架(HACD-H),将情感适应、关系组织、社会记忆和人格一致性整合为统一动力学模型,通过约14,700轮对话数据集验证,发现社会智能与社会认知能量显著负相关,揭示社会智能源于长期协同演化。

详情
AI中文摘要

当前的对话式AI系统在语言生成、个性化和长上下文交互方面取得了显著进展。然而,大多数现有方法通过孤立组件(如情感建模、记忆检索或人格条件化)来建模社会行为,缺乏一个统一的框架来解释长期人机交互中稳定社会关系和社会智能的涌现。为解决这一问题,我们提出了人机协同演化动力学框架(HACD-H),这是一个将人机交互建模为自组织社会认知系统的形式模型。HACD-H将情感适应、关系组织、社会记忆和人格一致性整合到一个统一的动力学框架中,并引入了多时间尺度社会认知、关系吸引子、信任盆地、发展相变和社会认知能量景观等原则。我们构建了一个约14,700轮交互的对话数据集,并开发了一个理论驱动的实证评估框架。结果揭示了社会认知中的时间持久性层次结构、稳定的关系吸引子、类似相变的发展模式以及结构化的社会认知能量景观。社会智能与社会认知能量呈显著负相关(r = -0.391, p < 0.001),且交互轨迹随时间呈现渐进性能量减少。这些发现表明,社会智能源于长期的社会认知协同演化,而非孤立的对话能力。HACD-H为建模适应性人机社会交互和开发社会智能AI系统提供了统一的理论基础。

英文摘要

Current conversational AI systems have made significant progress in language generation, personalization, and long-context interaction. However, most existing methods model social behavior through isolated components such as emotion modeling, memory retrieval, or persona conditioning, lacking a unified framework to explain the emergence of stable social relationships and social intelligence in long-term human-AI interaction.To address this, we propose the Human-AI Coevolution Dynamics Framework (HACD-H), a formal model of human-AI interaction as a self-organizing social cognitive system. HACD-H integrates emotional adaptation, relational organization, social memory, and personality consistency into a unified dynamical framework and introduces principles including multi-timescale social cognition, relational attractors, trust basins, developmental phase transitions, and social cognitive energy dynamics.We construct a conversational dataset with approximately 14,700 interaction turns and develop a theory-driven empirical evaluation framework. Results reveal a hierarchy of temporal persistence in social cognition, stable relational attractors, phase-transition-like developmental patterns, and a structured social cognitive energy landscape. Social intelligence shows a significant negative correlation with social cognitive energy (r = -0.391, p < 0.001), and interaction trajectories exhibit progressive energy reduction over time.These findings suggest that social intelligence emerges from long-term social cognitive coevolution rather than isolated conversational capabilities. HACD-H provides a unified theoretical foundation for modeling adaptive human-AI social interaction and developing socially intelligent AI systems.

4. 文本生成、摘要与编辑 1 篇

2606.18788 2026-06-18 cs.CV cs.CL 交叉投稿

HandwritingAgent: Language-Driven Handwriting Synthesis in Scalable Vector Space

HandwritingAgent: 语言驱动的可缩放矢量空间手写合成

Jaward Sesay, Yue Yu, Börje F. Karlsson

发表机构 * Beijing Institute of Technology(北京理工大学) Beijing Academy of Artificial Intelligence(北京人工智能研究院)

AI总结 提出HandwritingAgent,利用大推理模型在SVG格式中自动回归生成手写笔画序列,无需风格特定训练,通过自然语言和参考图像控制风格,在模仿、识别、多语言及复杂数学表达式合成等任务上达到或超越现有最优方法。

详情
AI中文摘要

教会机器模仿自然手写风格仍然是一个开放挑战,因为它需要合成在形状、纹理、压力和字体上动态变化的笔画序列——不仅在不同个体之间,而且在同一个人的手写中也是如此。针对这一挑战的尝试主要探索了在线和离线环境下的深度学习方法。然而,这些方法通常受到风格特定架构选择、对大型数据集的严重依赖、高计算成本以及缺乏通过自然语言灵活控制书写风格的限制。为此,我们引入了HandwritingAgent,一个语言驱动的智能体,它可以直接在可缩放矢量图形(SVG)格式中合成自然手写序列,无需风格特定训练。该智能体利用大型推理模型在离散网格画布环境中对目标手写字形进行几何分析并自回归生成笔画序列。生成过程以对话或非对话模式提供的文本以及参考手写风格图像为条件。在涵盖模仿、识别、多语言手写合成以及复杂手写数学和科学表达式生成等多样化手写任务上的实验表明,性能有显著提升,HandwritingAgent匹配或超越了最先进的生成式手写模型,同时提供了一种更高效、可控且泛化能力更强的合成方法。

英文摘要

Teaching machines to emulate natural handwriting styles remains an open challenge, as it requires synthesizing stroke sequences that dynamically vary in shape, texture, pressure and script - not only across individuals, but also within a single person's handwriting. Attempts at this challenge have largely explored deep learning methods in both online and offline settings. However, these approaches are often constrained by style-specific architectural choices, heavy reliance on large datasets, high compute costs, and a lack of flexible control over writing styles through natural language. To this end, we introduce HandwritingAgent, a language-driven agent that can synthesize natural handwriting sequences directly in Scalable Vector Graphics (SVG) format with no need for style-specific training. The agent leverages a large reasoning model to geometrically analyse and autoregressively generate target handwritten glyphs as stroke sequences in a discrete grid canvas environment. Generation is conditioned on texts provided in either conversational or non-conversational mode, along with a reference handwriting-style image. Experiments on diverse handwriting tasks spanning imitation, recognition, multi-lingual handwriting synthesis, and generation of complex handwritten maths and science expressions indicate substantial improvement in performance, with HandwritingAgent matching or surpassing state-of-the-art generative handwriting models, while providing a more efficient, controllable, and generalizable synthesis method.

5. 多模态语言处理 1 篇

2606.19341 2026-06-18 cs.CV cs.CL cs.SD 交叉投稿

Native Active Perception as Reasoning for Omni-Modal Understanding

原生主动感知作为全模态理解的推理

Zhenghao Xing, Ruiyang Xu, Yuxuan Wang, Jinzheng He, Ziyang Ma, Qize Yang, Yunfei Chu, Jin Xu, Junyang Lin, Chi-Wing Fu, Pheng-Ann Heng

发表机构 * The Chinese University of Hong Kong(香港中文大学) Shanghai Jiao Tong University(上海交通大学) Nanyang Technological University(南洋理工大学) Qwen Team, Alibaba Group(阿里巴巴集团Qwen团队)

AI总结 提出OmniAgent,一种基于POMDP迭代观察-思考-行动循环的原生全模态智能体,通过主动感知将推理复杂度与视频时长解耦,在多个基准上达到开源模型最优性能。

Comments Accepted at ICML 2026. Code and models: https://github.com/harryhsing/omniagent

详情
AI中文摘要

用于长视频理解的被动模型通常依赖于“全看一遍”范式,无论查询难度如何都统一处理帧,导致计算成本随视频时长增长。尽管出现了交互式框架,但它们通常依赖于全局预扫描,其上下文成本仍随视频长度扩展。我们提出OmniAgent,第一个原生全模态智能体,将视频理解建模为基于POMDP的迭代观察-思考-行动循环。OmniAgent执行按需动作,选择性地将视听线索提炼到持久文本记忆中,有效将推理复杂度与原始视频时长解耦。为实现这一点,我们引入了(1)智能体监督微调,通过最佳N轨迹合成和双阶段质量控制在启动原生主动感知;(2)带TAURA(轮次感知自适应不确定性重缩放优势)的智能体强化学习,利用轮次级熵将信用分配引导至关键发现轮次。关键的是,OmniAgent表现出正向测试时缩放,性能随推理轮次增加而提升,验证了主动感知的有效性。在十个基准(如VideoMME、LVBench)上的实验结果表明,OmniAgent在开源模型中达到了最先进性能。值得注意的是,在LVBench上,我们的7B智能体优于10倍大的Qwen2.5-VL-72B(50.5% vs. 47.3%)。

英文摘要

Passive models for long video understanding typically rely on a "watch-it-all" paradigm, processing frames uniformly regardless of query difficulty, causing computational cost to grow with video duration. Although interactive frameworks have emerged, they often rely on global pre-scanning, and their context cost still scales with video length. We propose OmniAgent, the first native omni-modal agent that formulates video understanding as a POMDP-based iterative Observation-Thought-Action cycle. OmniAgent executes on-demand actions to selectively distill audio-visual cues into a persistent textual memory, effectively decoupling reasoning complexity from raw video duration. To operationalize this, we introduce (1) Agentic Supervised Fine-Tuning to bootstrap native active perception via best-of-N trajectory synthesis with dual-stage quality control, and (2) Agentic Reinforcement Learning with TAURA (Turn-aware Adaptive Uncertainty Rescaled Advantage), which leverages turn-level entropy to steer credit assignment toward pivotal discovery turns. Crucially, OmniAgent exhibits positive test-time scaling, where performance improves as the number of reasoning turns increases, validating the efficacy of active perception. Empirical results across ten benchmarks (e.g., VideoMME, LVBench) demonstrate that OmniAgent achieves state-of-the-art performance among open-source models. Notably, on LVBench, our 7B agent outperforms the 10$\times$ larger Qwen2.5-VL-72B (50.5% vs. 47.3%).

6. 语音语言联合与音频文本 1 篇

2606.18979 2026-06-18 eess.AS cs.CL cs.SD 交叉投稿

Mitigating Scoring Errors and Compensating for Nonverbal Subtests in Speech-Based Dementia Assessment

缓解语音痴呆评估中的评分错误并补偿非语言子测试

Franziska Braun, Christopher Witzl, Andreas Erzigkeit, Hartmut Lehfeld, Thomas Hillemacher, Tobias Bocklet, Korbinian Riedhammer

AI总结 研究通过融合转录分数和Whisper嵌入减少语音评估中的评分错误,并利用融合表示近似专家整体评分以补偿缺失的运动子测试,有效区分认知状态组。

Comments Accepted at INTERSPEECH 2026

详情
AI中文摘要

认知障碍的早期检测依赖于神经心理学测试,通过评估多个认知领域来最小化主观性。基于语音的评估可以支持诊断并提高可及性,但转录错误和非语言子测试(如运动技能)的遗漏限制了准确性。除了传统的测试分数,语音衍生特征可以提供对认知状态的额外见解。本研究调查了德国“综合征短测试”的语音评估,这是一种标准化的痴呆筛查测试,包含语言和运动子测试。我们训练模型,整合每个语言子测试的转录衍生分数和Whisper嵌入,以减少评分错误。为了补偿缺失的运动子测试,我们利用这些融合表示来近似专家整体评分。尽管省略了子测试,我们的模型与专家评分高度相关,并能有效且准确地区分认知状态组。

英文摘要

Early detection of cognitive impairment relies on neuropsychological tests to minimize subjectivity by assessing multiple cognitive domains. Speech-based evaluation can support diagnostics and improve accessibility, but transcription errors and the omission of nonverbal subtests (e.g., motor skills) limit accuracy. Beyond conventional test scores, speech-derived features can provide additional insights into cognitive status. This study investigates the speech-based evaluation of the German "Syndrom-Kurz-Test," a standardized dementia screening test comprising verbal and motor subtests. We train models that integrate transcript-derived scores and Whisper embeddings per verbal subtest to reduce scoring errors. To compensate for missing motor subtests, we then leverage these fused representations to approximate expert overall ratings. Despite omitting subtests, our models strongly correlate with expert ratings and efficiently and accurately discriminate between cognitive status groups.

7. 评测、数据集与基准 4 篇

2606.18686 2026-06-18 cs.AI cs.CL cs.LG 交叉投稿

ForecastBench-Sim: A Simulated-World Forecasting Benchmark

ForecastBench-Sim:一个模拟世界预测基准

Jaeho Lee, Nick Merrill, Ezra Karger

发表机构 * Forecasting Research Institute(预测研究所)

AI总结 提出基于Freeciv游戏模拟的预测基准ForecastBench-Sim,通过游戏回滚生成可控、即时可解的预测问题,用于评估AI系统的概率推理能力。

Comments 15 pages, 5 main figures, 6 appendix figures. Spotlight presentation at Forecasting as a New Frontier of Intelligence / Workshop on AI Forecasting, ICML 2026

详情
AI中文摘要

通用AI系统的预测基准通常继承现实世界的约束:结果缓慢显现、尾部事件罕见、反事实问题难以评分。我们引入ForecastBench-Sim,一个基于Freeciv(一款以文明系列为模型的回合制策略游戏)游戏回滚的模拟世界预测基准。预测者接收固定的世界报告(当前游戏状态的结构化快照),并回答关于隐藏未来状态的问题;然后基准继续模拟并对预测进行评分。由于世界是模拟的,同一设置可以生成任意时间跨度的连续或二元预测问题、用于条件或因果问题的配对干预世界,以及罕见或破坏性结果的已解决示例。我们描述了基准流程、问题族、评分协议和发布工件,并报告了来自模型评估和匿名人工试点的验证切片。ForecastBench-Sim旨在通过提供受控、即时可解的任务来补充现实世界预测基准,用于研究动态世界状态下的概率推理。

英文摘要

Forecasting benchmarks for general-purpose AI systems usually inherit the constraints of the real world: outcomes resolve slowly, tail events are rare, and counterfactual questions are difficult to score. We introduce ForecastBench-Sim, a simulated-world forecasting benchmark built on game rollouts from Freeciv, a turn-based strategy game modelled on the Civilization series. Forecasters receive a fixed world report (a structured snapshot of the current game state) and answer questions about hidden future states; the benchmark then continues the simulation and scores forecasts. Because the world is simulated, the same setup can generate continuous or binary forecasting questions at arbitrary time horizons, paired intervention worlds for conditional or causal questions, and resolved examples of rare or disruptive outcomes. We describe the benchmark pipeline, question families, scoring protocol, and release artifacts, and report validation slices from model evaluations and an anonymized human pilot. ForecastBench-Sim is intended to complement real-world forecasting benchmarks by providing controlled, immediately resolvable tasks for studying probabilistic reasoning under dynamic world states.

2606.18829 2026-06-18 cs.LG cs.CL 交叉投稿

GateMem: Benchmarking Memory Governance in Multi-Principal Shared-Memory Agents

GateMem:多主体共享内存代理中的内存治理基准

Zhe Ren, Yibo Yang, Yimeng Chen, Zijun Zhao, Benshuo Fu, Zhihao Shu, Bingjie Zhang, Yangyang Xu, Dandan Guo, Shuicheng Yan

发表机构 * School of Artificial Intelligence, Jilin University(吉林大学人工智能学院) Shanghai Jiao Tong University(上海交通大学) King Abdullah University of Science and Technology (KAUST)(卡尔斯鲁厄大学) Tsinghua University(清华大学) National University of Singapore(新加坡国立大学)

AI总结 提出GateMem基准,评估多主体共享内存代理在效用、访问控制和遗忘三方面的治理能力,发现现有方法无法同时满足三者。

Comments 24 pages, 8 figures. Code and dataset are available at https://github.com/rzhub/GateMem and https://huggingface.co/datasets/Ray368/GateMem

详情
AI中文摘要

LLM代理的内存基准主要假设单用户设置,而医院、工作场所、校园和家庭中的共享助手研究不足。在这些部署中,多个主体写入公共内存池并根据不同角色、范围和关系进行查询,因此内存质量需要治理和召回。我们引入GateMem,一个多主体共享内存代理的基准。GateMem联合评估合法长期请求的效用(含状态更新)、跨上下文授权边界的访问控制,以及显式删除请求后的主动遗忘。它涵盖医疗、办公、教育和家庭领域,包含长形式多方情节、增量内存注入、隐藏检查点、结构化评判和泄漏目标注释。在多种基线和骨干模型上,没有方法能同时实现强效用、鲁棒访问控制和可靠遗忘。长上下文提示通常以高令牌成本获得最佳治理分数,而基于检索和外部内存的方法降低成本但仍泄漏未授权或已删除信息。这些结果表明,当前内存代理远未达到可靠的共享机构部署水平。

英文摘要

Memory benchmarks for LLM agents largely assume single-user settings, leaving shared assistants for hospitals, workplaces, campuses, and households understudied. In these deployments, multiple principals write to a common memory pool and query it under different roles, scopes, and relationships, so memory quality requires governance as well as recall. We introduce GateMem, a benchmark for multi-principal shared-memory agents. GateMem jointly evaluates utility for legitimate long-horizon requests with state updates, access control across contextual authorization boundaries, and agent-facing active forgetting after explicit deletion requests. It spans medical, office, education, and household domains, with long-form multi-party episodes, incremental memory injection, hidden checkpoints, structured judging, and leak-target annotations. Across diverse baselines and backbone models, no method simultaneously achieves strong utility, robust access control, and reliable forgetting. Long-context prompting often yields the best governance score at high token cost, while retrieval-based and external-memory methods reduce cost yet still leak unauthorized or deleted information. These results show current memory agents remain far from reliable shared institutional deployment.

2606.19139 2026-06-18 cs.CV cs.CL 交叉投稿

Urdu Katib Handwritten Dataset: A Historical Document Dataset for Offline Urdu Handwritten Text Recognition with CRNN-Based Baseline Evaluation

Urdu Katib 手写数据集:用于离线乌尔都语手写文本识别的历史文档数据集及基于CRNN的基线评估

Ramza Basharat, Muhammad Usman Ali

发表机构 * Department of Computer Science, University of Gujrat(古杰拉特大学计算机科学系)

AI总结 为解决乌尔都语手写文本识别中数据集稀缺的问题,本文提出了首个由历史时期Katib书写的离线乌尔都语手写文本行数据集UKHD,并评估了多种CRNN混合模型,其中CNN-BGRU-CTC在字符错误率和词错误率上表现最优。

详情
AI中文摘要

自动手写文本识别(HTR)本质上是一项具有挑战性的任务,当处理草书体时,其复杂性进一步增加。尽管在各种草书体上已经做出了显著努力,但关于乌尔都语手写文本识别(UHTR)的研究相对有限。这种研究滞后主要是由于其文字带来的独特挑战,以及基准数据集的稀缺和不可用。因此,为了推进UHTR研究,本研究提出了一个专门的真实数据集,称为Urdu Katib手写数据集(UKHD)。据我们所知,这是第一个专门从历史时期Katib书写的材料中整理的离线乌尔都语手写文本行数据集。它涵盖了Nastalique书法风格中各种扁平笔尖书写变体。此外,评估了不同基于CRNN的混合模型的有效性,以确定用于Urdu Katib手写识别(UKHR)的最佳架构。在分析的模型中,CNN-BGRU-CTC模型表现出更稳健的性能,具有较低的字符错误率(CER)和词错误率(WER)。本研究工作旨在支持和鼓励研究社区开发用于保存乌尔都语手写文学的稳健识别系统。

英文摘要

Automatic Handwritten Text Recognition (HTR) is inherently a challenging task, and its complexity is further increased when dealing with cursive scripts. Although significant efforts have been made on various cursive scripts, research regarding Urdu Handwritten Text Recognition (UHTR) has been relatively limited. This lag of research is primarily due to the unique challenges posed by its script, and the scarcity and unavailability of benchmark datasets. Therefore, to advance research in UHTR, this study presents a specialized real dataset called the Urdu Katib Handwritten Dataset (UKHD). To the best of our knowledge, this is the first offline Urdu handwritten text lines dataset specifically curated from the materials written by Katibs in historical times. It encompasses a diverse range of flat nib writing variations in the Nastalique calligraphic style. Additionally, the effectiveness of different CRNN-based hybrid models has been evaluated to identify the optimal architecture for Urdu Katib Handwriting Recognition (UKHR). Among the analyzed models, the CNN-BGRU-CTC model showed more robust performance, with low Character Error Rate (CER) and Word Error Rate (WER). This research work aims to support and encourage the research community in developing a robust recognition system for preserving Urdu handwritten literature.

2606.19157 2026-06-18 eess.AS cs.CL 交叉投稿

IndicContextEval: A Benchmark for Evaluating Context Utilisation in Audio Large Language Models Across 8 Indic Languages

IndicContextEval:评估8种印度语言音频大语言模型上下文利用能力的基准

Sakshi Joshi, Dhruv Subhash Rathi, Sanskar Singh, Eldho Ittan George, R J Hari, Kaushal Bhogale, Mitesh M. Khapra

AI总结 提出IndicContextEval基准,包含8种印度语言555位说话人的56小时自然语音,通过7级提示框架评估音频大语言模型是否真正利用上下文而非依赖参数化知识。

Comments Accepted at Interspeech 2026

详情
AI中文摘要

音频大语言模型(AudioLLMs)能够基于文本提示(如领域描述或实体列表)进行语音识别。然而,尚不清楚这些模型是真正利用此类上下文,还是依赖预训练期间学到的参数化知识。现有基准无法回答这个问题,因为它们仅在固定提示条件下评估转录,且很少包含明确的上下文输入。我们引入IndicContextEval,这是一个56小时的多语言基准,包含来自8种印度语言和23个专业领域的555位说话人的自然语音。我们设计了一个7级提示框架,逐步引入上下文信号,包括元数据、自然语言描述、英语和本地文字的实体列表,以及包含错误实体的对抗性提示。评估五个模型揭示了上下文利用行为的显著差异,凸显了对音频大语言模型中上下文基础进行显式评估的必要性。

英文摘要

AudioLLMs enable speech recognition conditioned on textual prompts such as domain descriptions or entity lists. However, it remains unclear whether these models genuinely utilise such context or rely on parametric knowledge learned during pretraining. Existing benchmarks cannot answer this question because they evaluate transcription under fixed prompting conditions and rarely include explicit contextual inputs. We introduce IndicContextEval, a 56-hour multilingual benchmark of natural speech from 555 speakers across 8 Indian languages and 23 professional domains. We design a 7-level prompting framework that progressively introduces contextual signals, including metadata, natural-language descriptions, entity lists in English and native script, and adversarial prompts with incorrect entities. Evaluating five models reveals substantial differences in context utilisation behaviour, highlighting the need for explicit evaluation of contextual grounding in AudioLLMs.

8. 安全、隐私、公平与可解释NLP 4 篇

2606.18264 2026-06-18 cs.SI cs.AI cs.CL 交叉投稿

Simulating Hate Speech Cascades with Multi-LLM Agents: Empirical Grounding, Modeling Fidelity, and Intervention Strategies

使用多LLM智能体模拟仇恨言论级联:实证基础、建模保真度与干预策略

Fan Huang

发表机构 * Indiana University Bloomington(印第安纳大学布卢明顿分校)

AI总结 本研究通过多LLM智能体系统模拟在线仇恨言论传播,发现其能再现实证数据中的立场单一性和毒性同质性,并通过消融实验识别出智能体异质性为关键保真因素,提出针对密集网络的放大器干预策略。

详情
AI中文摘要

在线平台上仇恨内容传播的忠实建模仍然是内容审核研究中的一个开放问题。经典的级联模型没有明确表示与仇恨内容传播相关的用户画像、社区和内容因素,因此在实际场景中部署时可能产生效果较差的审核策略。多智能体大语言模型系统原则上可以使每次转发决策依赖于用户画像、周围社区和帖子内容,但尚不清楚这种增加的灵活性是否比经典基线更忠实地再现真实的仇恨级联。我们研究了三个仇恨Bluesky级联和一个大小匹配的良性对照。在实证Bluesky数据中,我们发现:97.4--99.7%的转发者采取敌对立场;对于仇恨级联,扩散树上的毒性-参与同质性高于关注图;仇恨级联的拓扑结构是星形(大多数转发直接来自根节点),而良性级联是树形(转发通过多跳链传播)。在模拟中,多LLM智能体模拟器再现了立场单一性和毒性差异方向。结构化消融实验将智能体异质性识别为主要的保真因素,针对密集网络的放大器干预在5.7%良性附带损害下实现了7.5--12.9%的减少。

英文摘要

Faithful modeling of hateful content propagation on online platforms remains an open problem for moderation research. Classical cascade models that do not explicitly represent the profile, community, and content factors associated with hateful-content propagation may yield moderation strategies that behave less effectively when deployed in real-world scenarios. Multi-agent large language model (LLM) systems can, in principle, make each reshare decision depend on the user's profile, the surrounding community, and the post's content, but it remains unclear whether this added flexibility actually reproduces real hateful cascades more faithfully than classical baselines. We study three hateful Bluesky cascades and a size-matched benign control. In the empirical Bluesky data, we found that: 97.4--99.7\% of reposters take a hostile stance; toxicity-engagement homophily is higher on the diffusion tree than on the follower graph for hateful cascades; topology is star-like for the hateful cascades (most reposts come directly from the root) versus tree-like for the benign cascade (reposts propagate through multi-hop chains). In simulation, a multi-LLM-agent simulator reproduces the stance monoculture and the toxicity-delta direction. A structured ablation identifies agent heterogeneity as the leading fidelity factor, and amplifier targeting on dense networks yields 7.5--12.9\% reduction at 5.7\% benign collateral.

2606.18383 2026-06-18 cs.LG cs.CL 交叉投稿

From Sparse Features to Trustworthy Proxies: Certifying SAE-Based Interpretability

从稀疏特征到可信代理:认证基于SAE的可解释性

Dibyanayan Bandyopadhyay, Asif Ekbal

发表机构 * Department of Computer Science and Engineering, Indian Institute of Technology Patna(印度理工学院巴特那分校计算机科学与工程系)

AI总结 提出一种后验泛化框架,通过稀疏代理(SAE重建)认证语言模型,推导期望风险上界,并在GPT-2 Small等模型上验证非平凡界,揭示深层更易认证且特征分解区分语义对齐与统计稀疏性。

详情
AI中文摘要

稀疏自编码器(SAE)越来越多地被用于从语言模型(LM)中提取可解释特征,但一个核心问题仍然存在:基于SAE的解释何时可以被视为底层冻结LM的忠实视图?我们通过一个后验泛化框架来研究这个问题,该框架通过稀疏代理来认证LM,稀疏代理是通过将原生隐藏激活替换为其预训练的SAE重建而获得的。我们的框架使用四个可测量量推导出基础模型期望风险的上界:代理风险、SAE重建差距、概念池不匹配和稀疏复杂度。我们将此证书解释为解释忠实性的操作标准。特别地,非平凡界表明提取的稀疏特征保留了有意义的预测信息,而小的重建和匹配误差表明代理在行为上接近原始模型。实验上,我们展示了在GPT-2 Small、Gemma-2B和Llama-3-8B上,该界在实际样本量下变得非平凡。对Llama-3-8B的详细逐层分析揭示了强烈的深度依赖性,较深层变得更容易认证,这与更强的局部保真度和更弱的下游误差放大相关。最后,通过特征洗牌消融,我们展示了分解区分了真正的语义对齐与单纯的统计稀疏性,为基于SAE的解释何时变得不太可靠提供了有用的诊断。

英文摘要

Sparse autoencoders (SAEs) are increasingly used to extract interpretable features from language models (LMs), yet a central question remains: when can an SAE-based explanation be treated as a faithful view of an underlying frozen LM We study this through a post-hoc generalization framework that certifies the LM via a sparse proxy, obtained by replacing a native hidden activation with its pretrained SAE reconstruction. Our framework derives an upper bound on the base model's expected risk using four measurable quantities: proxy risk, SAE reconstruction gap, concept-pool mismatch, and sparse complexity. We interpret this certificate as an operational criterion for explanatory faithfulness. In particular, a non-vacuous bound indicates that the extracted sparse features retain meaningful predictive information, while small reconstruction and mismatch errors indicate that the proxy remains behaviorally close to the original model. Empirically, we show that the bound becomes non-vacuous on GPT-2 Small, Gemma-2B, and Llama-3-8B at practical sample sizes. A detailed layerwise analysis of Llama-3-8B reveals a strong depth dependence, with later layers becoming much easier to certify, associated with both stronger local fidelity and weaker downstream error amplification. Finally, through feature-shuffling ablations, we show that the decomposition distinguishes genuine semantic alignment from mere statistical sparsity, providing a useful diagnostic for when SAE-based explanations become less reliable.

2606.18530 2026-06-18 cs.CR cs.CL cs.LG 交叉投稿

Evaluating Prompting-Based Defenses Against Domain-Camouflaged Injection Attacks

评估基于提示的防御策略对抗领域伪装注入攻击

Aaditya Pai

发表机构 * Data Science Institute(数据科学研究所)

AI总结 针对领域伪装注入攻击,评估五种基于提示的防御方法(如释义、重点标记等)在三个模型家族和三个部署领域中的有效性,发现释义法最有效,可将伪装攻击成功率降低55-84%。

Comments 9 pages, 4 figures, 4 tables; under review at the AdvML-Frontiers x CoTMA workshop, COLM 2026

详情
AI中文摘要

领域伪装注入攻击使用领域特定词汇将恶意指令嵌入检索内容中,从而逃避依赖句法注入标记的标准检测器。当检测失败时,从业者需要知道哪些防御架构能降低攻击成功率。我们评估了五种基于提示的防御方法(重点标记、释义、提示夹层以及两种组合)对抗领域伪装注入攻击,涉及三个模型家族(Claude Haiku、Llama 3.1 8B、Gemini 2.0 Flash)和三个部署领域(金融、法律、通用),共进行3,510次试验。在代理处理之前对检索内容进行释义是最一致有效的防御方法,根据模型不同,可将伪装攻击成功率降低55-84%,并且在所有测试模型上均实现了比我们的Llama Guard 4配置更低的攻击成功率。防御效果强烈依赖于模型:重点标记在Claude Haiku上将攻击成功率减半,但在Llama 3.1 8B上没有任何益处。金融领域部署面临最高的残余风险,基线攻击成功率为26-33%,在较弱模型上没有任何基于提示的防御能完全消除威胁。这些结果首次系统评估了专门针对伪装类注入攻击的基于提示的防御方法,并为从业者建立了基于基准的建议。所有任务均使用合成构建的专业文档;这些基准排名是否能推广到真实企业文档仍是一个开放问题。

英文摘要

Domain-camouflaged injection attacks embed malicious instructions in retrieved content using domain-appropriate vocabulary, evading standard detectors that rely on syntactic injection markers. When detection fails, practitioners need to know which defense architectures reduce attack success. We evaluate five prompting-based defenses (spotlighting, paraphrasing, prompt sandwiching, and two combinations) against domain-camouflaged injection across three model families (Claude Haiku, Llama 3.1 8B, Gemini 2.0 Flash) and three deployment domains (financial, legal, general) using 3,510 trials. Paraphrasing retrieved content before agent processing is the most consistently effective defense in this benchmark, reducing camouflage attack success rate by 55-84\% depending on model, and achieves lower attack success rates than our Llama Guard 4 configuration on every model tested. Defense effectiveness is strongly model-dependent: spotlighting halves attack success on Claude Haiku but provides no benefit on Llama 3.1 8B. Financial domain deployments face the highest residual risk at 26-33\% baseline attack success rate, with no prompting-based defense fully eliminating the threat on weaker models. These results provide the first systematic evaluation of prompting-based defenses specifically against camouflage-class injection attacks and establish benchmark-based recommendations for practitioners. All tasks use synthetically constructed professional documents; whether these benchmark rankings generalize to real enterprise documents remains an open question.

2606.18571 2026-06-18 cs.LG cs.CL cs.SD eess.AS 交叉投稿

Fair Cognitive Impairment Detection Through Unlearning

通过去学习实现公平的认知障碍检测

William Nguyen, Jiali Cheng, Hadi Amiri

发表机构 * University of Massachusetts Lowell, USA(马萨诸塞大学洛厄尔分校)

AI总结 提出一种多模态框架,结合跨模态融合和梯度反转去学习,减少人口统计信息对轻度认知障碍检测的偏见,在跨语言数据集上缩小性能差距。

Comments Interspeech 2026

详情
AI中文摘要

轻度认知障碍(MCI)是一种以记忆、语言或思维能力显著下降为特征的医学状况。从自发语音中检测MCI对于可扩展的筛查具有前景。然而,学习模型常常利用与标签相关的人口统计线索,导致不同亚组之间存在较大的性能差距。我们提出了一种多模态框架,结合了(i)模态间(语音、文本和图像)的跨模型融合,以及(ii)使用梯度反转的去学习,该技术阻止共享嵌入编码与任务无关的人口统计属性。在多语言基准TAUKADIAL和PREPARE上的评估表明,我们的方法在MCI分类上优于最先进的多语言和多模态基线,同时显著缩小了患者亚组(性别和语言)之间的性能差距。我们进一步分析了跨数据集的迁移,表明人口统计去学习有助于学习更鲁棒的MCI检测表示。

英文摘要

Mild Cognitive Impairment (MCI) is a medical condition characterized by a noticeable decline in memory, language, or thinking abilities. MCI detection from spontaneous speech is promising for scalable screening. However, learned models often exploit demographic cues correlated with labels, resulting in a large performance gap across subgroups. We present a multimodal framework that combines (i) cross-model fusion between modalities (speech, text, and image), and (ii) unlearning using gradient reversal that discourages the shared embedding from encoding task-irrelevant demographic attributes. Evaluated on the multilingual benchmarks TAUKADIAL and PREPARE, our method outperforms the state-of-the-art multilingual and multimodal baseline in MCI classification while substantially reducing the performance gap across patient subgroups (sex and language). We further analyze transfer across datasets, showing that demographic unlearning helps learn more robust representations for MCI detection.

9. 其他/综合NLP 3 篇

2606.18520 2026-06-18 stat.ML cs.CG cs.CL cs.DS cs.IR cs.LG 交叉投稿

Compact Geometric Representations of Hierarchies

层次结构的紧凑几何表示

Prashant Gokhale, Piotr Indyk, Yuhao Liu, Sandeep Silwal, Tony Chang Wang, Haike Xu

发表机构 * UW-Madison(威斯康星大学麦迪逊分校) MIT(麻省理工学院)

AI总结 研究如何用低维几何嵌入表示有向无环图中的祖先-后代关系,提出基于树宽等结构参数的维度上界和下界,并在真实数据集上验证了紧凑性。

Comments Published at the 39th Annual Conference on Learning Theory (COLT) 2026. 22 Pages

详情
AI中文摘要

计算数据的几何表示是现代机器学习的基石,通常通过训练双编码器将查询和文档映射到共享嵌入空间来实现。You等人[NeurIPS '25]的最新工作将这种方法扩展到层次检索,其中相关性由有向无环图(DAG)中的祖先-后代关系决定。虽然先前的工作表明当后代数量较少时存在有效嵌入,但这些界限对于深层层次结构会严重退化,所需维度与节点总数相当。在本文中,我们研究了更一般图类的紧凑可达性嵌入,并提供了使用维度依赖于结构图参数的嵌入来表示层次结构的理论保证。我们证明,对于任何有向树,存在常数维度3的可达性嵌入,与树的大小或深度无关。我们将这一结果推广到以树宽$t$为特征的图,构造了维度为$O(t \log n)$的嵌入,其中$n$是节点数。作为这些上界的补充,我们提供了匹配或接近匹配的下界,表明对于一般DAG,维度$\Omega(n)$是必要的,而对于树宽为$t$的图,需要$\Omega(t/\log(n/t))$的维度。我们还获得了由DAG中交叉边数量参数化的上界和下界。此外,我们展示了我们的嵌入可以在真实世界数据集上构建,并且与先前具有理论保证的嵌入相比,在高召回率情况下维度小得多。

英文摘要

Computing geometric representations of data is a cornerstone of modern machine learning, typically achieved by training dual encoders which map queries and documents into a shared embedding space. Recent work of You et al. [NeurIPS '25] has extended this approach to hierarchical retrieval, where relevance is determined by the ancestor-descendant relationships in a Directed Acyclic Graph (DAG). While previous work has shown that valid embeddings exist when the number of descendants is small, these bounds degrade significantly for deep hierarchies, requiring dimensions as large as the total number of nodes. In this paper, we investigate compact reachability embeddings for more general graph classes and provide theoretical guarantees for representing hierarchies using embeddings whose dimension depends on structural graph parameters. We prove that for any directed tree, there exists a reachability embedding in constant dimension 3, independent of the tree's size or depth. We generalize this result to graphs characterized by treewidth $t$, constructing embeddings of dimension $O(t \log n)$, where $n$ is the number of nodes. Complementing these upper bounds, we provide matching or near-matching lower bounds, showing that dimension $Ω(n)$ is necessary for general DAGs and $Ω(t/\log(n/t))$ is required for graphs of treewidth $t$. We also obtain upper and lower bounds parameterized by the number of cross-edges in the DAG. We additionally show that our embeddings can be constructed on real world datasets, and that they give much smaller dimensions in high recall regimes compared to prior embeddings with theoretical guarantees.

2606.18941 2026-06-18 cs.PL cs.CL 交叉投稿

Graph-ESBMC-PLC: Formal Verification of Graphical PLCopen XML Ladder Diagram Programs Using SMT-Based Model Checking

Graph-ESBMC-PLC:使用基于SMT的模型检查对图形化PLCopen XML梯形图程序进行形式验证

Pierre Dantas, Lucas Cordeiro, Waldir Junior

发表机构 * Computer Science, The University of Manchester(计算机科学,曼彻斯特大学) Electrical Engineering, Federal University of Amazonas (UFAM)(电气工程,亚马逊联邦大学(UFAM))

AI总结 针对ESBMC-PLC无法处理图形化PLCopen XML梯形图的问题,提出基于DFS的图形LD解析器,将连接图转换为布尔触点合取,并采用三级I/O推断方案,成功实现完整GOTO IR转换,验证了3个图形LD程序。

Comments 18 pages

详情
AI中文摘要

PLCopen XML为IEC 61131-3梯形图程序定义了两种编码格式:一种使用<rung>元素的文本编码,另一种将梯形逻辑表示为localId/refLocalId连接的有向图的图形编码。ESBMC-PLC支持文本格式,但将来自CONTROLLINO、Beremiz和OpenPLC Editor的图形导出解析为空GOTO中间表示,导致空洞的验证成功。本文提出Graph-ESBMC-PLC,通过基于DFS的图形LD解析器填补了这一空白。该解析器从leftPowerRail遍历连接图到每个线圈,将梯形路径提取为布尔触点合取,并应用三级I/O推断方案。按rightPowerRail的connectionPointIn序列对线圈排序,确保SET线圈在RESET线圈之前处理,匹配IEC扫描周期语义。图形到IR的转换无需改动ESBMC后端。在来自CONTROLLINO/OpenPLC Editor的3个图形LD程序上的验证表明,所有程序都生成了包含非确定性输入和梯形逻辑的完整GOTO IR,而之前生成的是空IR。所有3个程序在k=2时在70ms内验证为SAFE。11个文本LD基准测试完全保留,无回归。两个不含LD内容或不支持定时器语义的Beremiz示例被报告为发现的局限性。工件位于Zenodo(DantasCordeiro2026graphical,doi: https://doi.org/10.5281/zenodo.20699856)。

英文摘要

PLCopen XML defines two encoding formats for IEC 61131-3 Ladder Diagram programs: a textual encoding using <rung> elements, and a graphical encoding that represents rung logic as a directed graph of localId/refLocalId connections. ESBMC-PLC supported the textual format but parsed graphical exports from CONTROLLINO, Beremiz, and OpenPLC Editor into an empty GOTO intermediate representation, causing vacuous verification success. This paper presents Graph-ESBMC-PLC, which closes this gap with a DFS-based graphical LD resolver. The resolver traverses the connection graph from leftPowerRail to each coil, extracts rung paths as Boolean contact conjunctions, and applies a three-tier I/O inference scheme. Ordering coils by rightPowerRail connectionPointIn sequence ensures SET coils process before RESET coils, matching IEC scan-cycle semantics. The graphical-to-IR conversion leaves the ESBMC backend unchanged. Validation on 3 graphical LD programs from CONTROLLINO/OpenPLC Editor shows all produce full GOTO IR with nondeterministic inputs and rung logic, versus the empty IR previously. All 3 verify SAFE at k=2 under 70ms. The 11 textual LD benchmarks are fully preserved, with no regression. Two Beremiz examples with no LD content or unsupported timer semantics are reported as discovered limitations. Artifact at Zenodo (DantasCordeiro2026graphical, doi:10.5281/zenodo.20699856).

2606.19121 2026-06-18 cs.SE cs.CL cs.HC 交叉投稿

Written by AI, Managed by AI: Semantic Space Control and Index Sickness Elimination Across 391 Consecutive Sessions

由AI编写,由AI管理:跨越391个连续会话的语义空间控制与索引病消除

Hui Zhang, Shuren Song

发表机构 * Shenzhen Yunxi Technology Co., Ltd.(深圳云曦科技有限公司) Information Technology Center, Tsinghua University(清华大学信息科学技术中心)

AI总结 本文通过真实软件项目中的行动研究,发现长期LLM协作中增加形式约束反而导致“索引病”,提出“基线-日志物理分离”机制,有效消除该问题。

Comments 22 pages, 2 tables, 1 figure. Action research. Bilingual submission (Chinese companion version included as supplementary). Submitted to ICSE 2027 IOR track

详情
AI中文摘要

解决长期LLM协作中概念漂移的主流工程直觉是,用更多的形式约束换取更可靠的输出——设计符号标识符系统,在系统提示中积累防御规则,扩展上下文窗口。我们的工程记录表明,在长期设置中,这种方向可能产生与设计意图相反的效果。通过在跨越约一个月和391个协作会话的真实软件项目(Bang-v3)中使用行动研究方法,我们记录并分析了这些策略的失败过程。当符号系统超过复杂度阈值时,LLM并不会变得更准确——相反,它们放弃了对业务语义的真正理解,退回到符号层内的自我指涉推理,并生成看似内部一致但实际上与现实脱节的输出。我们将这种失败模式命名为“索引病”,其典型表现为“幻影立法”。我们将底层原理命名为“庞原理(语义活力定律)”:带有明确目的的自然语言传达的信息质量远高于符号表达。由此,我们设计并验证了其物理工程机制:“基线-日志物理分离”。在同一项目中,该机制将AI指令量减少了约75%,并且在随后的约150个会话中,未观察到索引病复发。附有双语对照版本(中文)作为补充材料。

英文摘要

The prevailing engineering intuition for addressing conceptual drift in long-horizon LLM collaboration is to trade more formal constraints for more reliable outputs -- designing symbolic identifier systems, accumulating defensive rules in System Prompts, expanding context windows. Our engineering record shows that in long-horizon settings, this direction may produce effects contrary to design intent. Using action research methods in a real software project (Bang-v3) spanning approximately one month and 391 collaborative sessions, we document and analyze the failure process of these strategies. When the symbolic system exceeds a complexity threshold, LLMs do not become more accurate -- instead, they abandon genuine understanding of business semantics, retreat to self-referential reasoning within the symbolic layer, and generate outputs that appear internally consistent but are physically disconnected from reality. We name this failure pattern "Index Sickness," and its canonical manifestation "Phantom Legislation." We name the underlying principle the "Pang Principle (Semantic Vitality Law)": natural language carrying explicit purpose conveys far greater information quality than symbolic expression. From this, we design and validate its physical engineering mechanism: "Baseline-Log Physical Separation." In the same project, this mechanism reduced AI Instructions volume by ~75%, and across the subsequent ~150 sessions, no recurrence of Index Sickness was observed. A bilingual companion version (Chinese) is included as supplementary material.