arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 3410
2605.13850 2026-05-26 cs.AI cs.MA cs.SE

A Two-Dimensional Framework for AI Agent Design Patterns: Cognitive Function and Execution Topology

AI智能体设计模式的二维框架:认知功能与执行拓扑

Jia Huang, Joey Tianyi Zhou

AI总结 提出一个结合认知功能(7类)和执行拓扑(6种结构)的二维分类框架,识别28种命名模式,并通过跨领域分析得出模式选择的五条经验法则。

Comments 10 pages, 6 tables, 28 named patterns

详情
AI中文摘要

现有的基于LLM的智能体架构框架从单一视角描述系统:行业指南(Anthropic、Google、LangChain)关注执行拓扑——数据如何流动,而认知科学调查关注认知功能——智能体做什么。单独任何一个轴都无法区分架构上不同的系统:相同的Orchestrator-Workers拓扑可以实现Plan-and-Execute、Hierarchical Delegation或Adversarial Verification——这三种模式具有根本不同的故障模式和设计权衡。我们提出一个二维分类,结合(1)认知功能轴,包含七个类别(感知、记忆、推理、行动、反思、协作、治理)和(2)执行拓扑轴,包含六种结构原型(链、路由、并行、编排、循环、层次)。由此产生的7x6矩阵识别出28种命名模式,其中15种为原创名称。我们通过系统的跨轴分析证明正交性,详细定义八种代表性模式,并在四个真实领域(金融贷款、法律尽职调查、网络运维、医疗分诊)验证描述覆盖范围。跨领域分析得出模式选择的五条经验法则,这些法则支配环境约束(时间压力、行动权限、失败成本不对称、规模)与架构选择之间的关系。该框架为AI智能体架构设计提供了原则性、框架中立且模型无关的词汇表。

英文摘要

Existing frameworks for LLM-based agent architectures describe systems from a single perspective: industry guides (Anthropic, Google, LangChain) focus on execution topology -- how data flows -- while cognitive science surveys focus on cognitive function -- what the agent does. Neither axis alone disambiguates architecturally distinct systems: the same Orchestrator-Workers topology can implement Plan-and-Execute, Hierarchical Delegation, or Adversarial Verification -- three patterns with fundamentally different failure modes and design trade-offs. We propose a two-dimensional classification that combines (1) a Cognitive Function axis with seven categories (Perception, Memory, Reasoning, Action, Reflection, Collaboration, Governance) and (2) an Execution Topology axis with six structural archetypes (Chain, Route, Parallel, Orchestrate, Loop, Hierarchy). The resulting 7x6 matrix identifies 28 named patterns, 15 with original names. We demonstrate orthogonality through systematic cross-axis analysis, define eight representative patterns in detail, and validate descriptive coverage across four real-world domains (financial lending, legal due diligence, network operations, healthcare triage). Cross-domain analysis yields five empirical laws of pattern selection governing the relationship between environmental constraints (time pressure, action authority, failure cost asymmetry, volume) and architectural choices. The framework provides a principled, framework-neutral, and model-agnostic vocabulary for AI agent architecture design.

2605.13282 2026-05-26 cs.AI cs.LG

Differentiable Learning of Lifted Action Schemas for Classical Planning

经典规划中提升动作模式的可微学习

Jonas Reiter, Jakob Elias Gebler, Hector Geffner

AI总结 提出一种神经网络架构,从完全可观测状态但动作参数未观测的轨迹中学习提升动作模式,实现近乎完美的结构恢复。

详情
AI中文摘要

经典规划器可以有效解决用STRIPS或PDDL表示的非常大的确定性MDP,其中状态是对象和关系上的原子集合,提升动作模式添加或删除这些原子。这种紧凑表示产生了强大的搜索启发式,并为结构泛化提供了理想设置,因为提升关系和动作模式可以产生无限多个领域实例。一个核心挑战是从数据中学习这些关系和动作模式,最近的方法使用不同类型的观测来解决这个问题。在这项工作中,我们开发了一种新颖的神经网络架构,从状态完全可观测但动作参数未观测的轨迹中学习动作模式。该问题是一个简化,但却是从图像序列和动作标签学习规划领域的重要一步,我们旨在以近乎完美的方式解决这个简化问题。挑战在于同时从观测到的状态变化中识别动作参数并学习动作模式。我们的方法产生了一个鲁棒的可微组件,然后可以集成到更大的神经符号模型中。我们在各种规划领域上评估该架构,其中学习到的提升动作模式必须恢复真实结构。此外,我们报告了关于对观测噪声的鲁棒性以及与基于槽的动态模型相关变体的实验。

英文摘要

Classical planners can effectively solve very large deterministic MDPs represented in STRIPS or PDDL where states are sets of atoms over objects and relations, and lifted action schemas add or delete these atoms. This compact representation yields strong search heuristics and provides an ideal setting for structural generalization, since lifted relations and action schemas give rise to infinitely many domain instances. A central challenge is to learn these relations and action schemas from data, and recent approaches have addressed this problem using different types of observations. In this work, we develop a novel neural network architecture for learning action schemas from traces where states are fully observed but action arguments are unobserved. The problem is a simplification but an important step towards learning planning domains from sequences of images and action labels, and we aim to solve this simplification in a nearly perfect manner. The challenge lies in learning the action schemas while simultaneously identifying the action arguments from observed state changes. Our approach yields a robust differentiable component that can then be integrated into larger neuro-symbolic models. We evaluate the architecture on various planning domains, where the learned lifted action schemas must recover the ground-truth structure. Additionally, we report experiments on robustness to observation noise and on a variation related to slot-based dynamics models.

2605.12850 2026-05-26 cs.CL cs.AI cs.CR cs.LG

Persona-Model Collapse in Emergent Misalignment

涌现性失调中的人格模型崩溃

Davi Bastos Costa, Renato Vicente

AI总结 提出人格模型崩溃假说,通过道德易感性(S)和道德稳健性(R)两个指标,证明在有害数据上微调大语言模型会导致模型模拟、区分和维持一致角色的内部能力恶化,从而引发涌现性失调。

Comments 23 pages, 7 figures, 7 tables; NeurIPS 2026 submission; Corrected code repository URL

详情
AI中文摘要

在包含有害内容的狭窄数据上微调大型语言模型,会在无关提示上产生广泛的失调行为,这种现象称为涌现性失调。我们提出涌现性涉及人格模型崩溃:模型模拟、区分和维持一致角色的内部能力恶化。我们通过两个指标在行为上检验这一假设:道德易感性(S)和道德稳健性(R),它们根据模型在角色扮演下道德基础问卷回答的跨角色和角色内变异性计算得出。这些指标形式化了模型区分角色的能力(S)以及模拟给定角色时的一致性(R)。我们评估了四个前沿模型(DeepSeek-V3.1, GPT-4.1, GPT-4o, Qwen3-235B)的三种变体:基础版、微调为输出不安全代码的版本,以及匹配的微调为输出安全代码的对照版本。在四个模型中,不安全微调导致S平均增加55%,将所有四个不安全变体推至先前工作中13个前沿模型基准观测到的波段之外——其中GPT-4o达到波段上端的两倍以上——表明分化失调。它还导致R平均下降65%,相当于1/R增加304%。相比之下,匹配的安全对照将S保持在基础值附近,仅引起部分R损失,表明这些效应主要特定于失调。补充这些指标变化,不安全变体的无条件响应趋近于接近量表上限的饱和状态,与基础模型的结构化响应以及基础模型角色扮演有毒人格时的响应明显不同。综合来看,这些指标为涌现性失调提供了敏感的诊断,并作为其涉及人格模型崩溃的行为证据。

英文摘要

Fine-tuning large language models on narrow data with harmful content produces broadly misaligned behavior on unrelated prompts, a phenomenon known as emergent misalignment. We propose that emergent misalignment involves persona-model collapse: deterioration of the model's internal capacity to simulate, differentiate, and maintain consistent characters. We test this hypothesis behaviorally using two metrics: moral susceptibility (S) and moral robustness (R), computed from the across- and within-persona variability of models' Moral Foundations Questionnaire responses under persona role-play. These metrics formalize the model's ability to differentiate characters (S) and its consistency when simulating a given one (R). We evaluate four frontier models (DeepSeek-V3.1, GPT-4.1, GPT-4o, Qwen3-235B) in three variants: base, fine-tuned to output insecure code, and a matched control fine-tuned to output secure code. Across the four models, insecure fine-tuning produces an average $55\%$ increase in S, pushing all four insecure variants beyond the band observed across 13 frontier models benchmarked in prior work -- with GPT-4o reaching more than twice the band's upper end -- signaling dysregulated differentiation. It also causes an average $65\%$ decrease in R, equivalent to a $304\%$ increase in 1/R. By contrast, the matched secure control preserves S near the base and induces only a partial R loss, showing that these effects are largely misalignment-specific. Complementing these metric shifts, insecure variants' unconditioned responses converge toward saturation near the scale ceiling, departing markedly from both base models' structured responses and those elicited when base models role-play toxic personas. Taken together, these metrics provide a sensitive diagnostic for emergent misalignment and serve as behavioral evidence that it involves persona-model collapse.

2605.11182 2026-05-26 cs.AI

The Many Faces of On-Policy Distillation: Pitfalls, Mechanisms, and Fixes

在线策略蒸馏的多种面貌:陷阱、机制与修复

Siqi Zhu, Xuyan Ye, Hongyu Lu, Weiye Shi, Ge Liu

AI总结 本文通过实证研究分析了在线策略蒸馏(OPD)和在线策略自蒸馏(OPSD)在大语言模型后训练中的有效性、失败机制及修复方法。

详情
AI中文摘要

在线策略蒸馏(OPD)和在线策略自蒸馏(OPSD)已成为大语言模型有前景的后训练方法,它们在模型自身策略采样的轨迹上提供密集的token级监督。然而,现有关于其有效性的结果仍然好坏参半:虽然OP(S)D在系统提示和知识内化方面显示出潜力,但最近的研究也报告了不稳定性和退化。在这项工作中,我们对OPD和OPSD何时有效、何时失败以及原因进行了全面的实证研究。我们发现,数学推理上的OPD对教师选择和损失公式高度敏感,而OPSD在我们测试的设置中失败,因为测试时缺乏实例特定的特权信息(PI)。相反,当PI表示共享的潜在规则(如系统提示或对齐偏好)时,OPSD是有效的。我们识别出三种失败机制:(1)由于以学生生成的前缀为条件导致的教师与学生之间的分布不匹配,(2)来自有偏TopK反向KL梯度的优化不稳定性,以及(3)OPSD特定的限制,即学生学习了无PI策略,该策略聚合了以PI为条件的教师,当PI是实例特定时这是不够的。我们进一步表明,停止梯度TopK目标、RLVR适应的教师和SFT稳定的学生可以缓解这些失败。

英文摘要

On-policy distillation (OPD) and on-policy self-distillation (OPSD) have emerged as promising post-training methods for large language models, offering dense token-level supervision on trajectories sampled from the model's own policy. However, existing results on their effectiveness remain mixed: while OP(S)D has shown promise in system prompt and knowledge internalization, recent studies also report instability and degradation. In this work, we present a comprehensive empirical study of when OPD and OPSD work, when they fail, and why. We find that OPD on mathematical reasoning is highly sensitive to teacher choice and loss formulation, whereas OPSD fails in our tested settings due to test-time absence of instance-specific privileged information (PI). In contrast, OPSD is effective when PI represents a shared latent rule, such as a system prompt or alignment preference. We identify three failure mechanisms: (1) distribution mismatch between teacher and student caused by conditioning on student-generated prefixes, (2) optimization instability from biased TopK reverse-KL gradients, and (3) an OPSD-specific limitation where the student learns a PI-free policy that aggregates PI-conditioned teachers, which is insufficient when PI is instance-specific. We further show that stop-gradient TopK objectives, RLVR-adapted teachers, and SFT-stabilized students mitigate these failures.

2605.10989 2026-05-26 cs.LG cs.AI

SURGE: Surrogate Gradient Adaptation in Binary Neural Networks

SURGE: 二值神经网络中的替代梯度自适应

Haoyu Huang, Boyu Liu, Linlin Yang, Yanjing Li, Yuguang Yang, Xuhui Liu, Canyu Chen, Zhongqian Fu, Baochang Zhang

AI总结 针对二值神经网络中梯度失配和固定范围梯度裁剪导致的信息损失问题,提出一种基于理论的可学习梯度补偿框架SURGE,通过双路径梯度补偿器和自适应梯度缩放器实现偏差减少的梯度估计与动态平衡,在图像分类、目标检测和语言理解任务上达到最优性能。

Comments Accepted as a poster at the 43rd International Conference on Machine Learning (ICML 2026)

详情
AI中文摘要

二值神经网络(BNN)的训练从根本上依赖于对不可微二值化操作(如符号函数)的梯度近似。然而,包括直通估计器(STE)及其改进变体在内的主流方法依赖于手工设计,存在梯度失配问题和固定范围梯度裁剪导致的信息损失。为了解决这一问题,我们提出了SURrogate GradiEnt Adaptation(SURGE),一种新颖的、具有理论依据的可学习梯度补偿框架。SURGE通过辅助反向传播缓解梯度失配。具体地,我们设计了一个双路径梯度补偿器(DPGC),为每个二值化层构建一个并行的全精度辅助分支,通过在反向传播期间进行输出分解来解耦梯度流。DPGC利用全精度分支估计超出STE一阶近似的分量,从而实现偏差减少的梯度估计。为了进一步增强训练稳定性,我们引入了一个基于最优缩放因子的自适应梯度缩放器(AGS),通过基于范数的缩放动态平衡分支间的梯度贡献。在图像分类、目标检测和语言理解任务上的实验表明,SURGE在现有最先进方法中表现最佳。

英文摘要

The training of Binary Neural Networks (BNNs) is fundamentally based on gradient approximation for non-differentiable binarization operations (e.g., sign function). However, prevailing methods including the Straight-Through Estimator (STE) and its improved variants, rely on hand-crafted designs that suffer from gradient mismatch problem and information loss induced by fixed-range gradient clipping. To address this, we propose SURrogate GradiEnt Adaptation (SURGE), a novel learnable gradient compensation framework with theoretical grounding. SURGE mitigates gradient mismatch through auxiliary backpropagation. Specifically, we design a Dual-Path Gradient Compensator (DPGC) that constructs a parallel full-precision auxiliary branch for each binarized layer, decoupling gradient flow via output decomposition during backpropagation. DPGC enables bias-reduced gradient estimation by leveraging the full-precision branch to estimate components beyond STE's first-order approximation. To further enhance training stability, we introduce an Adaptive Gradient Scaler (AGS) based on an optimal scale factor to dynamically balance inter-branch gradient contributions via norm-based scaling. Experiments on image classification, object detection, and language understanding tasks demonstrate that SURGE performs best over state-of-the-art methods.

2605.10302 2026-05-26 cs.LG

Follow the Mean: Reference-Guided Flow Matching

跟随均值:参考引导的流匹配

Pedro M. P. Curvo, Maksim Zhdanov, Floor Eijkelboom, Jan-Willem van de Meent

AI总结 提出通过改变参考集均值来引导预训练流匹配模型实现可控生成,无需微调或额外网络。

详情
AI中文摘要

现有的可控生成方法通常依赖于微调、辅助网络或测试时搜索。我们证明流匹配提供了不同的控制接口:通过示例进行自适应。对于确定性插值,速度场仅由条件端点均值决定;移动该均值会移动流本身。这为可控生成提供了一个简单原则:通过改变模型遵循的参考集来引导预训练模型。我们以两种形式实例化这一思想。参考均值引导无需训练:它从参考库中计算封闭形式的端点均值修正,并将其应用于冻结的FLUX.2-klein(4B)模型,在保持提示、种子和权重不变的情况下,实现对颜色、身份、风格和结构的控制。半参数引导通过显式均值锚点和学习到的残差精炼器摊销相同的思想,在AFHQv2上匹配无条件的DiT-B/4质量,同时允许在推理时交换参考集。这些结果指向一个更广泛的方向:通过数据而非参数更新进行自适应的生成模型。

英文摘要

Existing approaches to controllable generation typically rely on fine-tuning, auxiliary networks, or test-time search. We show that flow matching admits a different control interface: adaptation through examples. For deterministic interpolants, the velocity field is solely governed by a conditional endpoint mean; shifting this mean shifts the flow itself. This yields a simple principle for controllable generation: steer a pretrained model by changing the reference set it follows. We instantiate this idea in two forms. Reference-Mean Guidance is training-free: it computes a closed-form endpoint-mean correction from a reference bank and applies it to a frozen FLUX.2-klein (4B) model, enabling control of color, identity, style, and structure while keeping the prompt, seed, and weights fixed. Semi-Parametric Guidance amortizes the same idea through an explicit mean anchor and learned residual refiner, matching unconditional DiT-B/4 quality on AFHQv2 while allowing the reference set to be swapped at inference time. These results point to a broader direction: generative models that adapt through data, not parameter updates.

2605.08063 2026-05-26 cs.CV cs.AI

Flow-OPD: On-Policy Distillation for Flow Matching Models

Flow-OPD:面向流匹配模型的在线策略蒸馏

Zhen Fang, Wenxuan Huang, Yu Zeng, Yiming Zhao, Shuang Chen, Kaituo Feng, Yunlong Lin, Lin Chen, Zehui Chen, Shaosheng Cao, Feng Zhao

AI总结 提出Flow-OPD框架,通过两阶段对齐策略(单奖励GRPO微调专家+流式冷启动与在线策略蒸馏)解决流匹配模型在多任务对齐中的奖励稀疏和梯度干扰问题,并引入流形锚点正则化抑制美学退化,在GenEval和OCR指标上显著提升。

Comments Project Page: https://costaliya.github.io/Flow-OPD/ , Code: https://github.com/CostaliyA/Flow-OPD

详情
AI中文摘要

现有的流匹配(FM)文本到图像模型在多任务对齐下存在两个关键瓶颈:标量奖励导致的奖励稀疏性,以及联合优化异构目标引起的梯度干扰,这共同导致了竞争指标的“跷跷板效应”和普遍的奖励破解。受大型语言模型社区中在线策略蒸馏(OPD)成功的启发,我们提出了Flow-OPD,这是第一个将在线策略蒸馏集成到流匹配模型中的统一后训练框架。Flow-OPD采用两阶段对齐策略:首先通过单奖励GRPO微调培养领域专精的教师模型,使每个专家在隔离环境中达到其性能上限;然后通过基于流的冷启动方案建立稳健的初始策略,并通过在线策略采样、任务路由标记和密集轨迹级监督的三步编排,将异构专业知识无缝整合到单个学生模型中。我们进一步引入了流形锚点正则化(MAR),它利用任务无关的教师提供全数据监督,将生成锚定到高质量流形,有效缓解了纯强化学习对齐中常见的美学退化。基于Stable Diffusion 3.5 Medium,Flow-OPD将GenEval分数从63提升至92,OCR准确率从59提升至94,相比原始GRPO总体提升约10个百分点,同时保持了图像保真度和人类偏好对齐,并展现出“超越教师”的涌现效应。这些结果确立了Flow-OPD作为构建通用文本到图像模型的可扩展对齐范式。代码和权重将在 https://github.com/CostaliyA/Flow-OPD 发布。

英文摘要

Existing Flow Matching (FM) text-to-image models suffer from two critical bottlenecks under multi-task alignment: the reward sparsity induced by scalar-valued rewards, and the gradient interference arising from jointly optimizing heterogeneous objectives, which together give rise to a 'seesaw effect' of competing metrics and pervasive reward hacking. Inspired by the success of On-Policy Distillation (OPD) in the large language model community, we propose Flow-OPD, the first unified post-training framework that integrates on-policy distillation into Flow Matching models. Flow-OPD adopts a two-stage alignment strategy: it first cultivates domain-specialized teacher models via single-reward GRPO fine-tuning, allowing each expert to reach its performance ceiling in isolation; it then establishes a robust initial policy through a Flow-based Cold-Start scheme and seamlessly consolidates heterogeneous expertise into a single student via a three-step orchestration of on-policy sampling, task-routing labeling, and dense trajectory-level supervision. We further introduce Manifold Anchor Regularization (MAR), which leverages a task-agnostic teacher to provide full-data supervision that anchors generation to a high-quality manifold, effectively mitigating the aesthetic degradation commonly observed in purely RL-driven alignment. Built upon Stable Diffusion 3.5 Medium, Flow-OPD raises the GenEval score from 63 to 92 and the OCR accuracy from 59 to 94, yielding an overall improvement of roughly 10 points over vanilla GRPO, while preserving image fidelity and human-preference alignment and exhibiting an emergent 'teacher-surpassing' effect. These results establish Flow-OPD as a scalable alignment paradigm for building generalist text-to-image models. The codes and weights will be released in: https://github.com/CostaliyA/Flow-OPD .

2605.08025 2026-05-26 cs.CV

TRAS: An Interactive Software for Tracing Tree Ring Cross Sections

TRAS:一种用于追踪树木年轮横截面的交互式软件

Henry Marichal, Diego Passarella, Gregory Randall

AI总结 提出TRAS开源图形软件,集成三种检测算法(CS-TRD、DeepCS-TRD、INBD),实现树木年轮自动勾画、手动校正和测量,在松木横截面图像上DeepCS-TRD达到81.0% F值,显著减少手动校正工作量。

Comments This manuscript has been accepted for publication in Forestry: An International Journal of Forest Research, published by Oxford University Press. This is an author-produced version and may differ from the final Version of Record. The final published version will be available through the journal website

详情
AI中文摘要

树木年轮标记仍然是树木测量学和树木年代学中的关键步骤,但通常手动进行,使得过程耗时、主观且难以扩展到大型图像数据集。我们提出了树木年轮分析套件(TRAS),一个用于木材横截面图像中树木年轮自动勾画、手动校正和测量的开源图形软件。TRAS集成了三种互补的检测算法:经典图像处理方法CS-TRD和两种深度学习方法DeepCS-TRD与INBD。界面允许用户细化自动检测、去除假阳性并手动添加缺失的年轮。它还计算树木年代学指标,如早材和晚材面积、年轮周长、等效年轮宽度以及基于自定义路径的年轮宽度测量。TRAS在18张专家标注的Pinus taeda L.横截面图像上进行了评估。DeepCS-TRD取得了最佳自动检测性能,F值为81.0%,精确率为86.4%。自动检测将所需的手动校正工作减少到大约20%的年轮边界。对于一维年轮宽度测量,TRAS与CooRecorder显示出极好的一致性(r > 0.99)。常见的检测错误,如跳跃传播或靠近节疤的假阳性,可以通过后处理界面轻松校正。TRAS在Windows、macOS和Linux上为树木年轮分析提供了灵活且可重复的解决方案。代码可在https://hmarichal93.github.io/tras获取。

英文摘要

Tree ring marking remains a key step in dendrometry and dendrochronology, but it is often performed manually, making the process time-consuming, subjective, and difficult to scale to large image datasets. We present the Tree Ring Analyzer Suite (TRAS), an open-source graphical software for automatic delineation, manual correction, and measurement of tree rings in wood cross-sectional images. TRAS integrates three complementary detection algorithms: the classical image-processing method CS-TRD and two deep-learning approaches, DeepCS-TRD and INBD. The interface allows users to refine automatic detections, remove false positives, and manually add missing rings. It also computes dendrochronological metrics such as earlywood and latewood areas, ring perimeter, equivalent ring width, and custom path-based ring-width measurements. TRAS was evaluated on 18 expertly annotated Pinus taeda L. cross-section images. DeepCS-TRD achieved the best automatic detection performance, with an F-score of 81.0% and precision of 86.4%. Automatic detection reduced the required manual correction effort to approximately 20% of ring boundaries. For one-dimensional ring-width measurements, TRAS showed excellent agreement with CooRecorder ($r > 0.99$). Common detection errors, such as jump propagation or false positives near knots, were easily corrected through the postprocessing interface. TRAS provides a flexible and reproducible solution for tree-ring analysis on Windows, macOS, and Linux. Code is available at the https://hmarichal93.github.io/tras.

2605.07647 2026-05-26 cs.CL cs.AI

Quality-Conditioned Agreement in Automated Short Answer Scoring: Mid-Range Degradation and the Impact of Task-Specific Adaptation

自动简答题评分中的质量条件一致性:中等范围退化与任务特定适应的影响

Abigail Victoria Gurin Schleifer, Moriah Ariely, Beata Beigman Klebanov, Asaf Salman, Giora Alexandron

AI总结 研究自动简答题评分中不同模型的任务适应程度与质量条件评分一致性的关系,发现所有AI模型在完全正确和完全错误的回答上表现良好,但在中等范围回答上出现显著退化,且退化程度与任务特定数据量相关。

Comments PRE-PRINT VERSION Accepted to ACL 21st Workshop on Innovative Use of NLP for Building Educational Applications (BEA26)

详情
AI中文摘要

自动简答题评分(ASAS)正从判别式微调模型转向少样本设置下的大语言模型(LLM)。这种范式利用了LLM广泛的世界知识和易于部署的优势,但有限的任务特定数据可能降低复杂评分任务的对齐。特别是,其对评分需要细微解释的部分正确回答的影响仍未充分探索。我们研究了不同模型的任务特定适应程度与质量条件评分一致性之间的关系。我们比较了三种LLM(GPT-5.2、GPT-4o、Claude Opus 4.5)在少样本模式下的表现、一个基于BERT的微调编码器以及一位人类专家,在两个开放式生物学题目上使用了数百个学生回答和由生物学教育专家提供的真实分数。结果表明,人类之间的一致性最高且在整个质量范围内稳定。所有AI模型在完全正确和完全错误的回答上表现良好,但在中等范围回答上表现出显著退化。这种中等范围退化取决于任务特定适应:在少样本LLM中最为严重,随着任务特定数据的增加而减少,其中微调编码器模型表现最佳。这种中等范围退化可能导致对理解发展中的学生所产生回答的不公平评估。我们的发现强调了质量条件公平性的重要性,尤其需要关注中等范围回答。

英文摘要

Automated short answer scoring (ASAS) is shifting from discriminative, fine-tuned models to large language models (LLMs) used in few-shot settings. This paradigm leverages LLMs broad world knowledge and ease of deployment, but limited task-specific data may reduce alignment on complex scoring tasks. In particular, its impact on scoring partially correct responses that require nuanced interpretation remains underexplored. We investigate the relationship between the degree of task-specific adaptation of different models and quality-conditioned scoring agreement. We compare three LLMs (GPT-5.2, GPT-4o, Claude Opus 4.5) in few-shot mode, a fine-tuned BERT-based encoder, and a human expert on two open-ended biology items, using several hundred student responses and ground truth scores provided by a biology education expert. The results show that human-human agreement is highest and stable across the full quality spectrum. All AI models perform well on fully correct and fully incorrect responses, but exhibit substantial degradation on mid-range responses. This mid-range degradation is conditioned on task-specific adaptation: It is most severe in few-shot LLMs with few examples and decreases as task-specific data increases, with fine-tuned encoder models performing best. This mid-range degradation may lead to inequitable evaluation of responses produced by students with developing understanding. Our findings highlight the importance of quality-conditioned fairness, with particular attention to mid-range responses.

2605.06505 2026-05-26 cs.LG cs.AI cs.CR

PACZero: PAC-Private Fine-Tuning of Language Models via Sign Quantization

PACZero: 通过符号量化的语言模型PAC隐私微调

Murat Bilgehan Ertan, Xiaochen Zhu, Phuong Ha Nguyen, Marten van Dijk, Srinivas Devadas

AI总结 提出PACZero系列零阶机制,通过符号量化实现零互信息下的PAC隐私微调,在SST-2和SQuAD上取得竞争性结果。

详情
AI中文摘要

我们引入了PACZero,一系列用于微调大型语言模型的PAC隐私零阶机制,在$I(S^*; Y_{1:T})=0$时提供可用的效用。该隐私机制将成员推断攻击(MIA)后验成功率限制在先验水平,这是DP框架仅在$\varepsilon=0$和无限噪声下才能达到的MIA抵抗水平。所有下面的DP-ZO比较都在MIA后验水平上匹配。关键见解是,PAC隐私仅在发布依赖于哪个候选子集是秘密时才对互信息收费。对子集聚合的零阶梯度进行符号量化会产生频繁的一致步骤,即每个候选子集在更新方向上达成一致;在这些步骤中,发布的符号花费零条件互信息。我们提出了两个变体,涵盖隐私-效用权衡:PACZero-MI(通过对二元发布进行精确校准的预算化MI)和PACZero-ZPL(在分歧步骤上通过均匀硬币翻转实现$I=0$)。我们在SST-2和SQuAD上使用OPT-1.3B和OPT-6.7B在LoRA和全参数轨道上进行了评估。在SST-2 OPT-1.3B全微调$I=0$时,PACZero-ZPL达到$88.99\pm0.91$,比非私有MeZO基线($91.1$ FT)低2.1个百分点。在$\varepsilon<1$的高隐私机制下,没有先前方法能产生可用的效用,而PACZero-ZPL在$I=0$时在OPT-1.3B和OPT-6.7B上获得了有竞争力的SST-2准确率和非平凡的SQuAD F1分数。

英文摘要

We introduce PACZero, a family of PAC-private zeroth-order mechanisms for fine-tuning large language models that delivers usable utility at $I(S^*; Y_{1:T})=0$. This privacy regime bounds the membership-inference attack (MIA) posterior success rate at the prior, an MIA-resistance level the DP framework matches only at $\varepsilon=0$ and infinite noise. All DP-ZO comparisons below are matched at the MIA posterior level. The key insight is that PAC Privacy charges mutual information only when the release depends on which candidate subset is the secret. Sign-quantizing subset-aggregated zeroth-order gradients creates frequent unanimity, steps at which every candidate subset agrees on the update direction; at these steps the released sign costs zero conditional mutual information. We propose two variants that span the privacy-utility trade-off: PACZero-MI (budgeted MI via exact calibration on the binary release) and PACZero-ZPL ($I=0$ via a uniform coin flip on disagreement steps). We evaluate on SST-2 and SQuAD with OPT-1.3B and OPT-6.7B in both LoRA and full-parameter tracks. On SST-2 OPT-1.3B full fine-tuning at $I=0$, PACZero-ZPL reaches ${88.99\pm0.91}$, within $2.1$pp of the non-private MeZO baseline ($91.1$ FT). No prior method produces usable utility in the high-privacy regime $\varepsilon<1$, and PACZero-ZPL obtains competitive SST-2 accuracy and nontrivial SQuAD F1 across OPT-1.3B and OPT-6.7B at $I=0$.

2605.06259 2026-05-26 cs.LG cs.CR

Trade-off Functions for DP-SGD with Subsampling based on Random Shuffling: Tight Upper and Lower Bounds

基于随机洗牌的DP-SGD的权衡函数:紧的上界和下界

Marten van Dijk, Murat Bilgehan Ertan

AI总结 本文在$f$-DP框架下,针对基于随机洗牌子采样的差分隐私随机梯度下降(DP-SGD),推导了权衡函数的紧致分析,得到了透明且可解释的闭式界,并展示了单轮训练中达到有意义的差分隐私所需的参数设置。

详情
AI中文摘要

我们在$f$-DP框架下,针对基于随机洗牌子采样的差分隐私随机梯度下降(DP-SGD),推导了权衡函数的紧致分析。我们的分析涵盖了噪声乘数$σ$满足$σ\geq \sqrt{3/\ln M}$的情形,其中$M$是单轮内的轮数。与泊松子采样的$f$-DP分析(产生非封闭的隐式公式,可机器计算但不透明)不同,随机洗牌允许紧致分析,得到透明且可解释的闭式界。我们通过Berry-Esseen定理推导的具体界,在证明框架内紧致到常数因子。我们展示了单轮($E=1$)的工作参数设置,对应的权衡函数$\geq 1-a-δ$,即仅比理想随机猜测对角线$1-a$低$δ$:对于$δ=1/100$和$σ=1$,大约$M \approx 1.14\times 10^6$轮和$N \approx 1.14\times 10^7$训练样本足以实现有意义的差分隐私。这与最近关于$σ\leq 1/\sqrt{2 \ln M}$情形的负面结果形成对比。我们的具体界可以在多个轮次上组合,导致$δ$具有与$E$的线性依赖关系,这限制了$E=O(\sqrt{M})$。为了超越Berry-Esseen,我们引入了一种新的证明技术,基于大数定律的推广,得到了渐近随机猜测对角线极限结果:如果$E=c_M^2M$且$c_M\to 0$,则$E$次组合的权衡函数满足$f^{\otimes E}(a)\to 1-a$在$a\in[0,1]$上一致,且$δ$仅具有$O(\sqrt{E})$的依赖关系。我们将这种渐近状态与相应的泊松子采样渐近进行比较,并将显式收敛速率的刻画作为一个开放问题。

英文摘要

We derive a tight analysis of the trade-off function for Differentially Private Stochastic Gradient Descent (DP-SGD) with subsampling based on random shuffling within the $f$-DP framework. Our analysis covers the regime $σ\geq \sqrt{3/\ln M}$, where $σ$ is the noise multiplier and $M$ is the number of rounds within a single epoch. Unlike $f$-DP analyses for Poisson subsampling, which yield non-closed implicit formulas that can be machine computed but are non-transparent, random shuffling admits a tight analysis yielding transparent and interpretable closed-form bounds. Our concrete bounds, derived via the Berry-Esseen theorem, are tight up to constant factors within the proof framework. We demonstrate worked parameter settings for a single epoch ($E=1$) with a corresponding trade-off function $\geq 1-a-δ$, that is, only $δ$ below the ideal random guessing diagonal $1-a$: For $δ= 1/100$ and $σ= 1$, roughly $M \approx 1.14\times 10^6$ rounds and $N \approx 1.14\times 10^7$ training samples suffice to achieve meaningful differential privacy. This is in contrast to recent negative results for the regime $σ\leq 1/\sqrt{2 \ln M}$. Our concrete bounds can be composed over multiple epochs leading to $δ$ having a linear in $E$ dependency, which restricts $E=O(\sqrt{M})$. To go beyond Berry--Esseen, we introduce a new proof technique based on a generalization of the law of large numbers that yields an asymptotic random guessing diagonal-limit result: if $E=c_M^2M$ with $c_M\to 0$, then the $E$-fold composed trade-off function satisfies $f^{\otimes E}(a)\to 1-a$ uniformly in $a\in[0,1]$ with $δ$ having only an $O(\sqrt{E})$ dependency. We compare this asymptotic regime with the corresponding Poisson subsampling asymptotic, and highlight the characterization of explicit convergence rates as an open question.

2605.05795 2026-05-26 cs.LG

Reward Shaping and Action Masking for Compositional Tasks using Behavior Trees and LLMs

使用行为树和LLM的组合任务奖励塑造与动作掩码

Nicholas Potteiger, Ankita Samaddar, Taylor T. Johnson, Xenofon Koutsoukos

AI总结 提出MRBT结构,结合LLM自动生成奖励和动作掩码,通过SMT验证和神经符号RL循环,提升组合任务训练效率和成功率。

详情
AI中文摘要

将复杂任务分解为一系列更简单的子任务可以提高自主代理的学习效率。强化学习(RL)可用于优化代理策略以完成子任务,但需要明确定义的子任务奖励,并受益于动作掩码。最近的工作使用大型语言模型(LLM)来自动化奖励塑造和动作掩码,然而它们都没有完全解决对子任务失败的响应性以及组合任务中不同对象的模块化问题。为了克服这些挑战,我们开发了掩码奖励行为树(MRBT),这是一种用作响应式和模块化奖励及动作掩码函数的符号结构。我们设计了一个MRBT模板,并推导出逻辑规范来构建和验证一系列对象交互子任务的MRBT。此外,我们开发了一个自动化流水线,使用LLM生成对变化任务对象鲁棒的MRBT,使用SMT求解器验证规范的正确性,以及一个神经符号RL循环来训练代理完成组合任务。实验证明成功生成和优化了五个MRBT,与基线以及没有动作掩码的MRBT相比,持续提高了训练效率和任务成功率。我们进一步强调了MRBT的三个优势:可迁移性、模块化和可验证性。

英文摘要

Decomposing complex tasks into a sequence of simpler subtasks can improve learning efficiency for an autonomous agent. Reinforcement learning (RL) can be used to optimize agent policies to complete subtasks, but requires well-defined subtask rewards and benefits from action masking. Recent work uses large language models (LLMs) to automate reward shaping and action masking, however none of them fully address reactivity to subtask failure and modularity to varying objects for compositional tasks. To overcome these challenges, we develop masking reward behavior tree (MRBT), a symbolic structure used as a reactive and modular reward and action mask function. We design an MRBT template and derive logical specifications to construct and verify MRBTs for a sequence of object-interaction subtasks. Further, we develop an automated pipeline that uses an LLM to generate MRBTs robust to varying task objects, an SMT-solver to verify correctness of specifications, and a neurosymbolic RL loop to train agents on compositional tasks. Experiments demonstrate successful generation and refinement of five MRBTs, consistently improving training efficiency and task success rates over baselines and MRBTs without action masking. We further highlight three advantages of MRBTs: transferability, modularity, and verifiability.

2605.05759 2026-05-26 cs.LG

Full-Spectrum Graph Neural Networks: Expressive and Scalable

全谱图神经网络:表达力与可扩展性

Xiaohan Wang, Deyu Bo, Longlong Li, Kelin Xia

AI总结 提出全谱图神经网络(FSpecGNN),通过将信号从节点域提升到节点对域并将单变量谱滤波器扩展为双变量滤波器,实现了对节点对信号的通用逼近,同时保持可扩展性。

Comments 41 pages, 4 figures. Accepted to ICML 2026

详情
AI中文摘要

众所周知,谱图神经网络(GNN)可以通用逼近节点信号;然而,它们的表达能力仍然受限于1维Weisfeiler-Lehman测试,这体现在它们对高阶信号缺乏通用性。为了突破这一界限,我们提出了全谱GNN(FSpecGNN),这是经典谱GNN的二阶推广。FSpecGNN从两个角度推进了谱滤波:(1)将信号从节点域提升到节点对域;(2)将特征值上的单变量谱滤波器扩展为特征值对上的双变量滤波器。我们证明经典谱GNN是FSpecGNN的对角特例,并证明FSpecGNN在通用逼近节点对信号的同时,其表达能力最多与Local 2-GNN相当,后者对异配图学习特别有益。此外,FSpecGNN支持可扩展实现,避免了显式的节点对级计算;结合低秩近似将全谱卷积简化为多项式谱滤波器的组合,使其能够在大图上学习。实验上,FSpecGNN验证了预测的表达能力,并在异配基准上展现了强劲性能。

英文摘要

It is well established that spectral graph neural networks (GNNs) can universally approximate node signals; however, their expressive power remains bounded by the 1-dimensional Weisfeiler-Lehman test, which is mirrored in their lack of universality for higher-order signals. To go beyond this bound, we propose the Full-Spectrum GNNs (FSpecGNNs), a second-order generalization of classical spectral GNNs. FSpecGNN advances spectral filtering from two perspectives: (1) it lifts signals from the node domain to the node-pair domain; and (2) it extends the univariate spectral filter over eigenvalues to a bivariate filter over eigenvalue pairs. We show that classical spectral GNNs arise as a diagonal special case of FSpecGNNs, and prove that FSpecGNNs can be at most as expressive as Local 2-GNN while universally approximating node-pair signals, the latter being particularly beneficial for heterophilic graph learning. Moreover, FSpecGNN admits scalable implementations that avoid explicit node-pair-level computations; combined with a low-rank approximation that reduces full-spectrum convolution to a combination of polynomial spectral filters, it enables learning on large graphs. Empirically, FSpecGNN validates the predicted expressivity and delivers strong performance on heterophilic benchmarks.

2605.05226 2026-05-26 cs.LG cs.AI cs.CL

Internalizing Outcome Supervision into Process Supervision: A New Paradigm for Reinforcement Learning for Reasoning

将结果监督内化为过程监督:推理强化学习的新范式

Fei Ding, Yongkang Zhang, Runhao Liu, Yuhao Liao, Zijian Zeng, Sibo wang, Huiming Yang

AI总结 提出一种监督内化方法,使模型在仅结果监督下自动提取过程级学习信号,实现细粒度策略优化。

详情
AI中文摘要

推理强化学习的核心挑战不仅在于结果级监督的稀疏性,更在于如何将仅在序列末尾提供的反馈转化为可指导中间推理步骤的细粒度学习信号。现有方法要么依赖结果级奖励进行序列级优化,导致精确信用分配困难,要么依赖外部构建的过程监督,成本高昂且难以可持续扩展。为解决这一问题,我们提出一个新视角:推理强化学习可以理解为将结果监督内化为过程监督的问题。基于此视角,我们引入一种用于推理强化学习的监督内化方法,使模型能够通过识别、纠正和重用失败的推理轨迹自动提取过程级学习信号,从而在仅结果监督下实现更细粒度的策略优化。我们进一步将这一思想抽象为一种新的训练范式,其中模型在强化学习过程中持续生成并完善自身的内部过程监督,为推理强化学习中细粒度信用分配开辟了一条不同于外部提供过程监督的新路径。

英文摘要

The central challenge of reinforcement learning for reasoning lies not only in the sparsity of outcome-level supervision, but more fundamentally in how to transform feedback provided only at the end of a sequence into fine-grained learning signals that can guide intermediate reasoning steps. Existing approaches either rely on outcome-level rewards for sequence-level optimization, which makes precise credit assignment difficult, or depend on externally constructed process supervision, which is costly and difficult to scale sustainably. To address this, we propose a new perspective: reinforcement learning for reasoning can be understood as the problem of internalizing outcome supervision into process supervision. From this perspective, we introduce a supervision-internalization method for reinforcement learning for reasoning, enabling the model to automatically extract process-level learning signals through identifying, correcting, and reusing failed reasoning trajectories, thereby achieving finer-grained policy optimization under outcome-only supervision. We further abstract this idea into a new training paradigm, in which the model continually generates and refines its own internal process supervision during reinforcement learning, opening a new path for fine-grained credit assignment in reinforcement learning for reasoning that differs from externally provided process supervision.

2605.05182 2026-05-26 cs.RO cs.SY eess.SY

A Closed-Form Dual-Barrier CBF Safety Filter for Holonomic Robots on Incrementally Built Occupancy Grid Maps

基于增量构建占据栅格地图的全向机器人闭式双障碍CBF安全滤波器

Himanshu Paudel, Basanta Joshi, Dhirendra Raj Madai, Alina Bartaula, Biman Rimal, Sanjay Neupane

AI总结 提出一种闭式双障碍控制障碍函数安全滤波器,通过解析推导占据栅格地图的符号距离场,同时避免已映射障碍物并限制进入未探索区域,实现全向机器人在资源受限平台上的实时安全控制。

详情
AI中文摘要

我们提出了一种双障碍控制障碍函数(CBF)安全滤波器,用于在增量构建的占据栅格地图中运行的全向机器人的实时、安全关键速度控制。当机器人探索未知环境时,未映射区域引入了不可约的不确定性,因为超出已探索前沿的障碍物几何形状未知,使得进入这些区域成为碰撞风险的来源,尤其是对于前向传感器。为了解决这个问题,我们强制执行两个约束:避免已映射障碍物和限制进入未探索区域。这两个约束都是从占据栅格地图的符号距离场解析推导出来的,产生了一个闭式安全滤波器,每个周期只需求解一个小型线性系统。在资源受限的平台(如Raspberry Pi)上,SLAM和规划已经消耗了大量计算资源,所提出的滤波器的低开销节省了资源。自适应增益调度在信息丰富的区域放松前沿约束,在良好映射的区域收紧约束,提高了探索效率,同时保持了安全性。该滤波器在速度空间中作为最小侵入性校正运行,并与任意标称控制器(包括基于学习的方法)组合。在PX4控制的四旋翼飞行器上的硬件飞行实验表明,在多次室内运行中实现了零碰撞。

英文摘要

We present a dual-barrier control barrier function (CBF) safety filter for real-time, safety-critical velocity control of holonomic robots operating in incrementally built occupancy grid maps. As a robot explores an unknown environment, unmapped regions introduce irreducible uncertainty, since obstacle geometry beyond the explored frontier is unknown, making entry into such regions a source of collision risk, especially with front-facing sensors. To address this, we enforce two constraints: avoidance of mapped obstacles and restriction from unexplored regions. Both constraints are derived analytically from the occupancy grid's signed distance field, yielding a closed-form safety filter that requires only a small linear system solve per cycle. On resource-constrained platforms such as the Raspberry Pi, where SLAM and planning already consume significant compute, the low overhead of the proposed filter preserves resources. An adaptive gain schedule relaxes the frontier constraint in information-rich regions and tightens it in well-mapped areas, improving exploration efficiency while maintaining safety. The filter operates in velocity space as a minimally invasive correction and composes with arbitrary nominal controllers, including learning-based methods. Hardware flight experiments on a PX4-controlled quadrotor demonstrate zero collisions across multiple indoor runs.

2605.04363 2026-05-26 cs.LG cs.AI

Mitigating Label Shift in Tabular In-Context Learning via Test-Time Posterior Adjustment

通过测试时后验调整缓解表格上下文学习中的标签偏移

Seunghan Lee

AI总结 针对TabPFN在表格数据上下文学习中对标签偏移敏感的问题,提出DistPFN方法,通过测试时后验调整重新缩放类别概率,无需修改架构或额外训练,在250多个OpenML数据集上显著提升分类性能。

Comments ICML 2026

详情
AI中文摘要

TabPFN最近作为表格数据集的基础模型受到关注,通过在合成数据上利用上下文学习实现了强性能。然而,我们发现TabPFN容易受到标签偏移的影响,常常过拟合训练数据集中的多数类。为了解决这一局限性,我们提出了DistPFN,这是第一个专为表格基础模型设计的测试时后验调整方法。DistPFN通过降低训练先验(即上下文的类别分布)的影响并强调模型预测后验的贡献来重新缩放预测的类别概率,无需架构修改或额外训练。我们进一步引入了DistPFN-T,它结合了温度缩放,以根据先验和后验之间的差异自适应地控制调整强度。我们在超过250个OpenML数据集上评估了我们的方法,证明在标签偏移下,各种基于TabPFN的模型在分类任务中取得了显著改进,同时在无标签偏移的标准设置中保持了强性能。代码可在以下仓库获取:https://github.com/seunghan96/DistPFN。

英文摘要

TabPFN has recently gained attention as a foundation model for tabular datasets, achieving strong performance by leveraging in-context learning on synthetic data. However, we find that TabPFN is vulnerable to label shift, often overfitting to the majority class in the training dataset. To address this limitation, we propose DistPFN, the first test-time posterior adjustment method designed for tabular foundation models. DistPFN rescales predicted class probabilities by downweighting the influence of the training prior (i.e., the class distribution of the context) and emphasizing the contribution of the model's predicted posterior, without architectural modification or additional training. We further introduce DistPFN-T, which incorporates temperature scaling to adaptively control the adjustment strength based on the discrepancy between prior and posterior. We evaluate our methods on over 250 OpenML datasets, demonstrating substantial improvements for various TabPFN-based models in classification tasks under label shift, while maintaining strong performance in standard settings without label shift. Code is available at this repository: https://github.com/seunghan96/DistPFN.

2605.03462 2026-05-26 cs.LG cs.AI

From Muscle Bursts to Motor Intent: Self-Supervised Token Modeling for Heterogeneous EMG

从肌肉爆发到运动意图:面向异质EMG的自监督令牌建模

Zhenghao Huang, Huilin Yao, Kaikai Wang

AI总结 提出AEMG自监督学习方法,通过事件级令牌建模和Transformer编码,从异质EMG数据中提取可复用的神经肌肉表征,提升跨用户、跨会话的鲁棒性并减少校准数据需求。

Comments After further verification, we identified issues in the current version that may affect the reliability and reproducibility of the reported experimental results. In particular, part of the evaluation relies on a dataset for which the public-release/redistribution status and supporting validation remain unresolved

详情
AI中文摘要

表面肌电图提供了一种从可穿戴肌肉记录推断人类运动意图的实用方法,但在单一采集设置下训练的模型在用户、会话、电极布局或手势协议改变时往往会失去可靠性。本文提出AEMG,一种自监督学习方法,旨在从多样化的EMG源中提取可复用的神经肌肉表征。首先将八个公开手势数据集转换为共享信号格式,以减少通道配置、传感器拓扑和记录协议的差异。AEMG不依赖固定长度滑动窗口,而是从能量变化中识别收缩事件并将其表示为紧凑的神经肌肉令牌,同时有序令牌组描述运动过程中多个肌肉的协调活动。然后使用空间和时间条件Transformer编码这些令牌序列,保留电极位置、激活时序和顺序结构信息。在预训练中,模型通过向量量化重建构建收缩原型的离散库,并通过从周围观测中恢复掩蔽的神经肌肉令牌进一步学习上下文依赖关系。在留一受试者和低标签适应设置下的实验表明,学习到的表征提高了对未见用户的鲁棒性,并减少了手势识别所需的校准数据量。这些发现表明,事件级令牌建模为适应性强且数据高效的基于EMG的运动意图理解提供了一条可扩展的途径。

英文摘要

Surface electromyography provides a practical way to infer human movement intention from wearable muscle recordings, but models trained under a single acquisition setting often lose reliability when the user, session, electrode layout, or gesture protocol changes. This paper proposes AEMG, a self-supervised learning approach designed to extract reusable neuromuscular representations from diverse EMG sources. Eight public gesture datasets are first transformed into a shared signal format to reduce discrepancies in channel configuration, sensor topology, and recording protocol. Instead of relying on fixed-length sliding windows, AEMG identifies contraction events from energy variations and represents them as compact neuromuscular tokens, while ordered token groups describe the coordinated activity of multiple muscles during motion. A spatially and temporally conditioned Transformer is then used to encode these token sequences, preserving information about electrode position, activation timing, and sequential structure. For pre-training, the model constructs a discrete library of contraction prototypes through vector-quantized reconstruction and further learns contextual dependencies by recovering masked neuromuscular tokens from surrounding observations. Experiments under leave-one-subject-out and low-label adaptation settings show that the learned representation improves robustness to unseen users and reduces the amount of calibration data required for gesture recognition. These findings suggest that event-level token modeling offers a scalable route toward adaptable and data-efficient EMG-based motor-intent understanding.

2605.02124 2026-05-26 cs.LG cs.AI math.PR

Soft-to-Hard Routing in Sparse Mixture-of-Experts Models

稀疏混合专家模型中的软到硬路由

Reza Rastegar

AI总结 本文通过边界层微积分方法,研究了稀疏混合专家模型中softmax路由随温度趋于零时趋近于硬top-1路由的极限过程,并给出了基于路由界面邻域概率的定量误差界。

详情
AI中文摘要

随着温度趋于零,softmax路由趋近于硬top-1路由,但极限过程在路由器平局时存在奇异性。本文针对总体平方损失混合专家回归中的软到硬极限,发展了一种边界层微积分方法。对于具有logits $a_k(x;ϕ)$的路由器,相关的局部量是前两名的间隔$Δ(x;ϕ)$,相关的全局量是边界质量$\\mathbb{P}(Δ(X;ϕ)\\\le w)$。在光滑性和横截性假设下,余面积和管状邻域估计展示了该质量如何随板宽缩放;在二元情形中,主导系数是路由界面上的显式曲面积分。这些几何估计给出了软目标$L_τ$和硬目标$L_0$之间的定量界,包括在间隔尾条件下的$O(τ^α)$一致比较,并得到了紧参数空间上软目标的$Γ$-收敛性。主要结论是,零温度近似由路由界面的$O(τ)$邻域所承载的概率控制,而不仅仅由温度本身决定。在分离出问题的这一边界层部分后,我们记录了一个从硬路由到小温度软路由的条件景观传递定理,以及一个简化的双专家高斯计算,展示了局部对称性破缺。仅包含合成诊断作为边界层预测的受控检验。

英文摘要

Softmax routing approaches hard top-1 routing as the temperature tends to zero, but the limiting passage is singular at router ties. This paper develops a boundary-layer calculus for this soft-to-hard limit in population squared-loss mixture-of-experts regression. For a router with logits $a_k(x;ϕ)$, the relevant local quantity is the top-two margin $Δ(x;ϕ)$, and the relevant global quantity is the boundary mass $\mathbb{P}(Δ(X;ϕ)\le w)$. Under smoothness and transversality assumptions, coarea and tubular-neighborhood estimates show how this mass scales with the slab width; in the binary case the leading coefficient is an explicit surface integral over the routing interface. These geometric estimates give quantitative bounds between the soft objective $L_τ$ and the hard objective $L_0$, including an $O(τ^α)$ uniform comparison under a margin-tail condition, and yield $Γ$-convergence of the soft objectives on compact parameter spaces. The main conclusion is that the zero-temperature approximation is controlled by the probability carried by an $O(τ)$ neighborhood of the routing interfaces, not by temperature alone. After isolating this boundary-layer part of the problem, we record a conditional landscape-transfer theorem from hard to small-temperature soft routing and a reduced two-expert Gaussian calculation illustrating local symmetry breaking. Synthetic diagnostics are included only as controlled checks of the boundary-layer predictions.

2605.02010 2026-05-26 cs.AI

Reliable AI Needs to Externalize Implicit Knowledge: A Human-AI Collaboration Perspective

可靠AI需要外化隐性知识:人机协作视角

Hengyu Liu, Tianyi Li, Zhihong Cui, Yushuai Li, Zhangkai Wu, Torben Bach Pedersen, Kristian Torp, Christian S. Jensen

AI总结 本文从人机协作视角提出,可靠AI需要基础设施将隐性知识外化为可验证的形式,通过知识对象(KOs)实现人类验证,从而提升可靠性。

Comments Accepted at ICML 2026 (Position Paper Track). 14 pages, 2 figures, 1 table

详情
AI中文摘要

本文立场认为,可靠AI需要基础设施来支持人类对隐性知识的验证。AI从显性知识(论文、文档、结构化数据库)和隐性知识(推理模式、调试过程、中间步骤)中学习。隐性知识由于文档成本超过感知价值而未被外化——然而AI不加区分地学习它,既获得有益模式也获得有害偏见。当前的可靠性方法只能根据来源验证显性知识,造成根本性差距:最有价值的AI能力(推理、判断、直觉)恰恰是我们无法验证的。我们提出知识对象(KOs)——将隐性知识外化为人类可以检查、验证和认可的形式的结构化工件。KOs改变了验证经济学:以前验证成本过高的事情变得可行,使得累积的人类验证能够随时间提高可靠性。

英文摘要

This position paper argues that reliable AI requires infrastructure for human validation of implicit knowledge. AI learns from both explicit knowledge (papers, documentation, structured databases) and implicit knowledge (reasoning patterns, debugging processes, intermediate steps). Implicit knowledge remains unexternalized because documentation cost exceeds perceived value -- yet AI learns from it indiscriminately, acquiring both beneficial patterns and harmful biases. Current reliability methods can only verify explicit knowledge against sources, creating a fundamental gap: the most valuable AI capabilities (reasoning, judgment, intuition) are precisely those we cannot verify. We propose Knowledge Objects (KOs) -- structured artifacts that externalize implicit knowledge into forms humans can inspect, verify, and endorse. KOs transform verification economics: what was previously too costly to verify becomes feasible, enabling accumulated human validation to improve reliability over time.

2605.01512 2026-05-26 cs.CV

Two-Pass Zero-Shot Temporal-Spatial Grounding of Rare Traffic Events in Surveillance Video

监控视频中罕见交通事件的两次通过零样本时空定位

Jiantang Huang

AI总结 提出一种无需微调的管道,通过粗到细的两遍分解和专家角色分配,利用冻结视觉语言模型实现罕见交通事件在时间、空间和碰撞类型上的联合定位。

Comments Accepted at CVPR 2026 AUTOPILOT Workshop (Non-Archival Track). 7 pages (4 main + references + appendix), 3 figures, 5 tables

详情
AI中文摘要

在真实闭路电视画面中定位交通事故是一个罕见事件问题,通常禁止使用标注事故视频进行训练,但需要精确的时空和碰撞类型联合定位。我们提出一种无需微调的管道,通过两个想法从冻结的视觉语言模型中引出这种联合输出。首先,粗到细的两遍分解:第一遍以1 fps处理全视频,产生粗粒度(t, x, y, c)元组;然后第二遍在±3秒窗口内以5 fps细化时间和位置,并设置两个确定性置信门,在边界犹豫或边缘夹紧坐标时回退到粗估计。其次,专家角色分配:Qwen3-VL-Plus负责定位,Gemini 3.1 Flash-Lite负责在居中视频片段上分类。在ACCIDENT@CVPR 2026基准测试(2,027个真实闭路电视视频)上,我们达到ACC^S = 0.539(95%置信区间[0.525, 0.553]):比基准论文的最佳基线预言机(0.412)高0.127,比最强单VLM基线(Molmo-7B, 0.396)高0.143,比朴素基线(0.289)高0.250。VLM路径每个视频最多调用三次API(17%在API失败时回退到物理方法);完整运行成本约20美元。

英文摘要

Grounding traffic accidents in real CCTV footage is a rare-event problem where training on labeled accident video is often prohibited, yet accurate joint localization in time, space, and collision type is required. We present a no-fine-tuning pipeline that elicits this joint output from frozen vision-language models through two ideas. First, a coarse-to-fine two-pass decomposition: a full-video pass at 1 fps produces a coarse (t, x, y, c) tuple, then a second pass at 5 fps within a +/- 3 s window refines time and location, with two deterministic confidence gates that revert to the coarse estimate on boundary hedges or edge-clamped coordinates. Second, a specialist role assignment: Qwen3-VL-Plus handles grounding, Gemini 3.1 Flash-Lite handles typing on a centered video clip. On the ACCIDENT@CVPR 2026 benchmark (2,027 real CCTV videos) we reach ACC^S = 0.539 (95% CI [0.525, 0.553]): +0.127 over the benchmark paper's best-of-baselines oracle (0.412), +0.143 over the strongest single-VLM baseline (Molmo-7B, 0.396), and +0.250 over the naive baseline (0.289). The VLM path uses up to three API calls per video (17% fall back to physics on API failures); the full run costs ~$20.

2605.01284 2026-05-26 cs.CV cs.AI cs.CL cs.IR

Chain of Evidence: Pixel-Level Visual Attribution for Iterative Retrieval-Augmented Generation

证据链:面向迭代检索增强生成的像素级视觉归因

Peiyang Liu, Ziqiang Cui, Xi Wang, Di Liang, Wei Ye

AI总结 提出Chain of Evidence (CoE)框架,利用视觉语言模型直接对检索到的文档截图进行推理,输出精确边界框以可视化完整推理链,解决迭代检索增强生成中的粗粒度归因和视觉语义丢失问题。

详情
AI中文摘要

迭代检索增强生成(iRAG)已成为通过逐步检索和推理外部文档来回答复杂多跳问题的强大范式。然而,当前系统主要基于解析文本运行,这造成了两个关键瓶颈:(1)粗粒度归因,用户需要根据模糊的文本级引用在冗长文档中手动定位证据;(2)视觉语义丢失,将视觉丰富的文档(如幻灯片、带有图表的PDF)转换为文本会丢弃对推理至关重要的空间逻辑和布局线索。为弥合这一差距,我们提出了证据链(CoE),这是一个与检索器无关的视觉归因框架,利用视觉语言模型直接对检索到的文档候选截图进行推理。CoE消除了特定格式的解析,输出精确的边界框,可视化检索候选集中的完整推理链。我们在两个不同的基准上评估CoE:Wiki-CoE,一个源自2WikiMultiHopQA的大规模结构化网页数据集;以及SlideVQA,一个具有挑战性的演示幻灯片数据集,包含复杂图表和自由形式布局。实验表明,微调后的Qwen3-VL-8B-Instruct取得了稳健的性能,在需要视觉布局理解的场景中显著优于基于文本的基线,同时为像素级可解释的iRAG建立了与检索器无关的解决方案。我们的代码可在https://github.com/PeiYangLiu/CoE.git获取。

英文摘要

Iterative Retrieval-Augmented Generation (iRAG) has emerged as a powerful paradigm for answering complex multi-hop questions by progressively retrieving and reasoning over external documents. However, current systems predominantly operate on parsed text, which creates two critical bottlenecks: (1) \textit{Coarse-grained attribution}, where users are burdened with manually locating evidence within lengthy documents based on vague text-level citations; and (2) \textit{Visual semantic loss}, where the conversion of visually rich documents (e.g., slides, PDFs with charts) into text discards spatial logic and layout cues essential for reasoning. To bridge this gap, we present \textbf{Chain of Evidence (CoE)}, a retriever-agnostic visual attribution framework that leverages Vision-Language Models to reason directly over screenshots of retrieved document candidates. CoE eliminates format-specific parsing and outputs precise bounding boxes, visualizing the complete reasoning chain within the retrieved candidate set. We evaluate CoE on two distinct benchmarks: \textbf{Wiki-CoE}, a large-scale dataset of structured web pages derived from 2WikiMultiHopQA, and \textbf{SlideVQA}, a challenging dataset of presentation slides featuring complex diagrams and free-form layouts. Experiments demonstrate that fine-tuned Qwen3-VL-8B-Instruct achieves robust performance, significantly outperforming text-based baselines in scenarios requiring visual layout understanding, while establishing a retriever-agnostic solution for pixel-level interpretable iRAG. Our code is available at https://github.com/PeiYangLiu/CoE.git.

2605.00817 2026-05-26 cs.CL

When LLMs Stop Following Steps: A Diagnostic Study of Procedural Execution in Language Models

当LLM停止遵循步骤:语言模型中程序执行的诊断研究

Sailesh Panda, Pritam Kadasi, Abhishek Upperwal, Mayank Singh

AI总结 本研究通过构建受控诊断基准,评估大型语言模型在程序执行任务中的忠实性,发现随着步骤增加准确率从63%降至20%,并揭示了缺失答案、过早答案、自我修正和执行不完整等失败模式。

Comments 86 pages, 124 figures, 4 Tables

详情
AI中文摘要

大型语言模型(LLM)在推理基准测试中通常表现强劲,但仅凭最终答案的准确性并不能表明它们是否忠实地执行了提示中指定的程序。我们引入了一个受控的诊断基准,用于程序执行,其中模型被给予一个逐步的算术程序以及两个数值输入,必须返回最终计算值。通过程序长度和中间变量的回溯依赖性来改变复杂性。平均首次答案准确率从5步程序的63%下降到95步程序的20%。生成级别分析表明,失败通常涉及缺失答案、过早答案、初始错误后的自我修正以及未完全执行的轨迹。这些发现表明,表面上的推理能力可能掩盖了在忠实的长程程序执行中的重大弱点。

英文摘要

Large language models (LLMs) often achieve strong performance on reasoning benchmarks, but final-answer accuracy alone does not show whether they faithfully execute the procedure specified in a prompt. We introduce a controlled diagnostic benchmark for procedural execution, where models are given a step-wise arithmetic procedure and two numeric inputs, and must return the final computed value. Complexity is varied through procedure length and look-back dependencies over intermediate variables. Average first-answer accuracy drops from 63% on 5-step procedures to 20% on 95-step procedures. Generation-level analysis shows that failures often involve missing answers, premature answers, self-correction after an initial error and under-executed traces. These findings suggest that apparent reasoning ability can mask substantial weaknesses in faithful long-horizon procedural execution.

2604.27636 2026-05-26 cs.AI

Generative structure search for efficient and diverse discovery of molecular and crystal structures

生成式结构搜索:高效且多样地发现分子和晶体结构

Yifang Qin, Yu Shi, Junfu Tan, Chang Liu, Ming Zhang, Ziheng Lu

AI总结 提出生成式结构搜索(GSS)框架,结合扩散模型和随机结构搜索,利用数据先验加速采样并保持能量引导的局部极小探索,以低于随机结构搜索十分之一的成本恢复多样亚稳态结构。

详情
AI中文摘要

预测稳定和亚稳态结构是分子和材料发现的核心,但受限于高维能量景观的搜索成本。深度生成模型提供了高效的结构采样,但其输出仍受训练数据影响,可能未充分探索罕见但物理相关的极小值。我们引入生成式结构搜索(GSS),一个统一框架,将基于扩散的生成和随机结构搜索(RSS)表述为由学习得分场和物理力驱动的共同采样过程的极限情况。耦合这些驱动因素使GSS能够利用数据先验加速采样,同时保留能量引导的局部极小探索。在分子和晶体系统中,GSS恢复了多样的亚稳态结构,其采样成本比RSS低十倍以上,且对训练分布之外的组成仍然有效。结果建立了一种物理基础的生成搜索策略,用于发现仅靠数据驱动采样无法达到的结构。

英文摘要

Predicting stable and metastable structures is central to molecular and materials discovery, but remains limited by the cost of searching high-dimensional energy landscapes. Deep generative models offer efficient structure sampling, yet their outputs remain shaped by training data and can underexplore minima that are rare but physically relevant. We introduce generative structure search (GSS), a unified framework that formulates diffusion-based generation and random structure search (RSS) as limiting regimes of a common sampling process driven by learned score fields and physical forces. Coupling these drivers lets GSS use data priors to accelerate sampling while retaining energy-guided exploration of local minima. Across molecular and crystalline systems, GSS recovers diverse metastable structures with more than tenfold lower sampling cost than RSS for broad coverage and remains effective for compositions outside the training distribution. The results establish a physically grounded generative search strategy for discovering structures beyond the reach of data-driven sampling alone.

2604.20022 2026-05-26 cs.LG cs.AI cs.CL

MoBayes: A Modular Bayesian Framework for Separating Reasoning from Language in Conversational Clinical Decision Support

MoBayes:一种用于对话式临床决策支持中推理与语言分离的模块化贝叶斯框架

Yusuf Kesmen, Fay Elhassan, Jiayi Ma, Julien Stalhandske, Yena Chang, David Sasu, Alexandra Kulinkina, Akhil Arora, Lars Klein, Mary-Anne Hartley

AI总结 提出MoBayes框架,通过将LLM作为语言接口、贝叶斯模块进行概率推理,实现推理与语言分离,在临床决策支持中优于独立前沿LLM医生。

Comments 50 pages including appendix, 13 figures, 22 tables. Preprint

详情
AI中文摘要

大型语言模型(LLM)越来越多地用于对话式临床决策支持,但它们将下一个标记预测与概率决策混为一谈。我们认为这种混淆反映了架构上的局限性:此类系统缺乏显式的后验追踪、可控的弃权阈值和可审计的推理链。我们引入MoBayes,一个模块化贝叶斯对话框架,将推理与语言分离。LLM仅作为语言接口,将患者对话解析为结构化观察,而贝叶斯模块对这些观察进行概率推理以更新后验,通过期望信息增益选择后续问题,并通过校准的决策阈值决定何时停止或推迟。这种设计实现了显式后验追踪、可控的选择性决策,以及无需重新训练语言模型即可替换的特定人群统计后端。在经验知识和LLM生成的知识库上,MoBayes优于独立的前沿LLM医生,包括匹配模型系列的比较,其中廉价的传感器模型与MoBayes配对以较低成本超过更大的自主模型。在对抗性患者沟通风格和不同诊断场景下,该优势依然存在。这些结果表明,可靠的对话式临床决策支持系统应将概率推理与语言生成分离,而不是仅扩大模型规模。代码可在https://anonymous.4open.science/r/MoBayes/获取。

英文摘要

Large language models (LLMs) are increasingly used for conversational clinical decision support, yet they conflate next token prediction with probabilistic decision making. We argue that this conflation reflects an architectural limitation: such systems lack explicit posterior tracking, controllable abstention thresholds, and auditable reasoning chains. We introduce MoBayes, a Modular Bayesian dialogue framework that separates reasoning from language. The LLM acts only as a language interface, parsing patient conversation into structured observations, while a Bayesian module performs probabilistic inference over these observations to update posteriors, select follow-up questions via expected-information-gain and determine when to stop or defer through calibrated decision thresholds. This design enables explicit posterior tracking, controllable selective decision-making, and replaceable population-specific statistical backends without retraining the language model. Across empirical and LLM-generated knowledge bases, MoBayes outperforms standalone frontier LLM doctors, including matched model-family comparisons where inexpensive sensor models paired with MoBayes exceed larger autonomous models at lower cost. The advantage persists under adversarial patient communication styles and across varying diagnostic scenarios. These results suggest that reliable conversational clinical decision support systems should separate probabilistic reasoning from language generation rather than scaling model size alone. Code is available at https://anonymous.4open.science/r/MoBayes/

2604.19151 2026-05-26 cs.CL cs.SD eess.AS

Voice of India: A Large-Scale Benchmark for Real-World Speech Recognition in India

印度之声:面向印度真实世界语音识别的大规模基准

Kaushal Bhogale, Manas Dhir, Amritansh Walecha, Manmeet Kaur, Vanshika Chhabra, Aaditya Pareek, Hanuman Sidh, Mahima Manik, Sagar Jain, Bhaskar Singh, Utkarsh Singh, Tahir Javed, Shobhit Banga, Mitesh M. Khapra

AI总结 针对现有Indic ASR基准的局限性,提出基于非脚本电话对话的封闭源基准Voice of India,覆盖15种主要印度语言和139个区域集群,包含306230条语音(536小时),并分析地理、音频质量、语速、性别和设备类型等因素对ASR性能的影响。

Comments 6 pages, 4 figures

详情
AI中文摘要

现有的Indic ASR基准通常使用脚本化的、干净的语音和基于排行榜的评估,这鼓励了针对数据集的过拟合。此外,严格的单参考WER会惩罚印度语言中的自然拼写变体,包括非标准拼写的代码混合英语起源词。为了解决这些局限性,我们引入了Voice of India,这是一个从非脚本电话对话构建的封闭源基准,覆盖15种主要印度语言,跨越139个区域集群。该数据集包含306230条语音,总计536小时的语音,来自36691名说话人,转录考虑了拼写变体。我们还在地理上按地区分析了性能,揭示了差异。最后,我们提供了跨音频质量、语速、性别和设备类型等因素的详细分析,突出了当前ASR系统在哪些方面存在困难,并为改进真实世界的Indic ASR系统提供了见解。

英文摘要

Existing Indic ASR benchmarks often use scripted, clean speech and leaderboard driven evaluation that encourages dataset specific overfitting. In addition, strict single reference WER penalizes natural spelling variation in Indian languages, including non standardized spellings of code-mixed English origin words. To address these limitations, we introduce Voice of India, a closed source benchmark built from unscripted telephonic conversations covering 15 major Indian languages across 139 regional clusters. The dataset contains 306230 utterances, totaling 536 hours of speech from 36691 speakers with transcripts accounting for spelling variations. We also analyze performance geographically at the district level, revealing disparities. Finally, we provide detailed analysis across factors such as audio quality, speaking rate, gender, and device type, highlighting where current ASR systems struggle and offering insights for improving real world Indic ASR systems.

2604.18170 2026-05-26 cs.CL cs.AI

Copy-as-Decode: Grammar-Constrained Parallel Prefill for LLM Editing

Copy-as-Decode: 面向LLM编辑的语法约束并行预填充

Ziyang Liu

AI总结 提出Copy-as-Decode机制,通过语法约束的并行预填充加速LLM编辑,实现高达303倍的自回归解码加速,并保持高覆盖率与无损性。

Comments The authors have decided to withdraw this version following internal review regarding authorship and contribution agreements

详情
AI中文摘要

LLMs通过自回归地重新生成完整输出来编辑文本和代码,即使大多数标记在输入中逐字出现。我们研究Copy-as-Decode,一种解码层机制,将编辑生成重新表述为基于两个原语语法的结构化解码:<copy lines="i-j"/>引用输入行范围,<gen>...</gen>生成新内容。一个标记级FSM保证语法有效性,服务层原语通过单次并行预填充前向(而非N步自回归步骤)更新每个复制跨度的KV缓存——共享推测解码的并行前向内核,但以输入标记作为草稿,程序强制接受替代概率验证。我们报告一个无需端到端训练的上界分析。(i) 内核加速:在Qwen2.5-{1.5B, 7B}上,通过并行预填充复制N个标记比自回归快6.8倍至303倍(N ∈ [8, 512],A100 80GB bf16)。(ii) 复制上限:在ProbeEdit和HumanEvalPack-Fix (Py/JS)上,74%–98%的金标准标记在行级原语下可达;结合每个语料库跨度直方图上的经验内核,得到闭式挂钟时间上界29.0倍/3.4倍/4.2倍(合并13.0倍)。标记级扩展达到91%–99%覆盖率,下界4.5倍–6.5倍。(iii) 流水线无损性:预言程序通过确定性解析器在所有482个案例上往返,将任何下游失败定位到跨度选择而非机制。扰动研究表明,在离一噪声下,合并EM从100%降至15.48%。在Qwen2.5-Coder-1.5B上的微调实验将HEvalFix-Py EM从0/33(未训练)提升至12%–17%,这是一个可学习性信号,而非生产选择器。批处理服务集成和多文件覆盖作为后续工作。

英文摘要

LLMs edit text and code by autoregressively regenerating the full output, even when most tokens appear verbatim in the input. We study Copy-as-Decode, a decoding-layer mechanism that recasts edit generation as structured decoding over a two-primitive grammar: <copy lines="i-j"/> references an input line range, <gen>...</gen> emits new content. A token-level FSM guarantees syntactic validity, and a serving-layer primitive updates the KV cache for each copy span via a single parallel-prefill forward rather than $N$ autoregressive steps -- sharing the parallel-forward kernel of speculative decoding but with input tokens as the draft and program-enforced acceptance replacing probabilistic verification. We report an upper-bound analysis that requires no end-to-end training. (i) Kernel speedup: on Qwen2.5-{1.5B, 7B}, copying $N$ tokens via parallel prefill is $6.8\times$--$303\times$ faster than autoregressive ($N \in [8, 512]$, A100 80GB bf16). (ii) Copy ceiling: on ProbeEdit and HumanEvalPack-Fix (Py/JS), $74$--$98\%$ of gold tokens are reachable under the line-level primitive; composed with the empirical kernel over each corpus's span histogram this yields a closed-form wall-clock bound of $29.0\times / 3.4\times / 4.2\times$ ($13.0\times$ pooled). A token-level extension reaches $91$--$99\%$ coverage with $4.5\times$--$6.5\times$ floors. (iii) Pipeline losslessness: oracle programs round-trip through the deterministic resolver on all $482$ cases, localizing any downstream failure to span selection rather than the mechanism. A perturbation study shows pooled EM drops from $100\%$ to $15.48\%$ under off-by-one noise. A fine-tuning pilot on Qwen2.5-Coder-1.5B lifts HEvalFix-Py EM from $0/33$ (untrained) to $12$--$17\%$, a learnability signal, not a production selector. Batched-serving integration and multi-file coverage are scoped as follow-up.

2604.18128 2026-05-26 cs.CL cs.AI cs.LG

Depth Registers Unlock W4A4 on SwiGLU: A Reader/Generator Decomposition

深度寄存器解锁 SwiGLU 上的 W4A4:一种读取器/生成器分解

Ziyang Liu

AI总结 本研究通过深度寄存器和铰链损失(DR+sink)训练时干预,将 SwiGLU 解码器语言模型的 W4A4 量化困惑度从 1727 降至 119,并分解出残差轴读取器主导误差,而生成器 w2 的双线性输入是剩余差距的主因。

Comments The authors have decided to withdraw this version following internal review regarding authorship and contribution agreements

详情
AI中文摘要

我们在一个受控的 300M 参数 SwiGLU 解码器语言模型(在 FineWeb-Edu 的 5B 令牌上训练)中研究训练后 W4A4 量化,并询问哪些输入激活位点主导误差。朴素的四舍五入 W4A4 将验证困惑度从 FP16 的 23.6 降至 1727。一种简单的残差轴训练时干预——带有寄存器幅度铰链损失的深度寄存器(DR+sink)——在匹配的 FP16 PPL 和匹配的零样本能力下,将其降至 119(约 14 倍),并与 SmoothQuant 组合达到 39.9 PPL。与 FP16 之间约 2 PPL 的剩余差距是诊断核心。我们按输入激活位点分解 W4A4 损伤:SwiGLU 块中的五个可训练线性层分为残差轴读取器(qkv, w1, w3)和块内生成器(o_proj, w2)。基本的范数论证表明,残差轴幅度控制紧密约束读取器,但 w2 的双线性输入仅受因子范数平凡乘积的约束;经验上,DR+sink 降低了读取器的峰度,而生成器基本不变,并且读取器恢复的 W4A4 残差在三个匹配检查点上平坦约为 0.28 nats,其中 Delta-remove(w2) 占主导。我们将 DR+sink 作为训练时探针而非部署方案提出:一种事后替代方案(Per-Linear QuaRot)在读取器轴上几乎与之匹配。完整的 QuaRot——添加在线每头值 Hadamard 和在线 w2 输入旋转——也没有缩小差距,直接验证了正交旋转无法约束双线性 SwiGLU 尾部的预测。这些主张特定于我们的 300M、5B 令牌、单种子设置,并且我们的实验未将分区与铰链分离。

英文摘要

We study post-training W4A4 quantization in a controlled 300M-parameter SwiGLU decoder-only language model trained on 5B tokens of FineWeb-Edu, and ask which input-activation sites dominate the error. Naive round-to-nearest W4A4 collapses validation perplexity from FP16 23.6 to 1727. A simple residual-axis training-time intervention -- Depth Registers with a register-magnitude hinge loss (DR+sink) -- reduces this to 119 (about 14x) at matched FP16 PPL and matched zero-shot capacity, and composes with SmoothQuant to 39.9 PPL. The residual ~2 PPL gap to FP16 is the diagnostic core. We decompose W4A4 damage by input-activation site: the five trainable linears in a SwiGLU block split into residual-axis readers (qkv, w1, w3) and block-internal generators (o_proj, w2). Elementary norm arguments show residual-axis magnitude control bounds readers tightly but leaves w2's bilinear input bounded only by the trivial product of factor bounds; empirically, DR+sink collapses reader kurtosis while leaving generators essentially unchanged, and the reader-rescued W4A4 residue is flat at ~0.28 nats across three matched checkpoints with Delta-remove(w2) dominating. We present DR+sink as a training-time probe rather than a deployment proposal: a post-hoc alternative (Per-Linear QuaRot) nearly matches it on the reader axis. Full QuaRot -- adding online per-head value Hadamard plus online w2-input rotation -- does not close the gap either, directly testing the prediction that orthogonal rotation cannot bound the bilinear SwiGLU tail. Claims are specific to our 300M, 5B-token, single-seed setting, and our experiments do not isolate the partition from the hinge.

2604.17538 2026-05-26 cs.RO

Novel Algorithms for Smoothly Differentiable and Efficiently Vectorizable Contact Manifold Construction

用于光滑可微且高效可向量化的接触流形构建的新算法

Onur Beker, Andreas René Geist, Anselm Paulus, Georg Martius

AI总结 针对接触丰富场景中机器人行为优化,提出一种以光滑二次可微性和GPU大规模可向量化为优先的新碰撞检测流水线,包括可微SDF表示、宽/窄阶段例程和凸分解接触融合。

Comments This version adds late-breaking results in preparation for the CR2 workshop in ICRA 2026

详情
AI中文摘要

在接触丰富的环境中生成智能机器人行为是一个目前零阶方法占主导的研究问题。开发利用接触存在下刚体动力学的一阶/二阶信息的方法,在提高求解速度和计算效率方面具有巨大潜力。该研究方向的主要瓶颈在于,由于常见模拟流水线中所有三个步骤(i)碰撞检测、(ii)接触动力学、(iii)时间积分)的病态性,难以获得对数值优化实际有用的梯度和Hessian矩阵。本文提出了一种旨在解决该难题中碰撞检测部分的方法,通过一个从头设计的新流水线,以光滑(即二次)可微性和GPU上的大规模可向量化作为主要优先级。这与标准碰撞检测例程形成对比,后者针对CPU上的运行时间和最小内存占用进行了优化,但采用了阻碍可微性和向量化的逻辑和控制流。所提出的流水线包括以下贡献:i)高度表达力强且计算高效的SDF表示,ii)使用这些表示生成顶点-SDF和边-SDF接触的可微宽阶段和窄阶段例程,iii)基于凸分解的接触融合的可微例程。

英文摘要

Generating intelligent robot behavior in contact-rich settings is a research problem where zeroth-order methods currently prevail. Developing methods that make use of first/second order information about rigid-body dynamics in the presence of contact holds great promise in terms of increasing the solution speed and computational efficiency. The main bottleneck in this research direction is the difficulty in obtaining gradients and Hessians that are actually useful for numerical optimization, due to pathologies in all three steps of a common simulation pipeline: i) collision detection, ii) contact dynamics, iii) time integration. This abstract proposes a method that aims to address the collision detection part of the puzzle, via a novel pipeline designed from scratch with smooth (i.e. twice) differentiability and massive vectorizability on GPUs as the main priorities. This is in contrast to standard collision detection routines that are instead optimized for runtime on CPUs and minimal memory footprint, but do employ logic and control flow that hinder differentiability and vectorization. The proposed pipeline consists of the following contributions: i) highly expressive and compute efficient SDF representations, ii) differentiable broad-phase and narrow-phase routines that use these representations to generate vertex-SDF and edge-SDF contacts, iii) a differentiable routine for convex decomposition based contact blending.

2604.17328 2026-05-26 cs.LG cs.AI

Rethinking the Comparison Unit in Sequence-Level Reinforcement Learning: An Equal-Length Paired Training Framework from Loss Correction to Sample Construction

重新思考序列级强化学习中的比较单元:从损失校正到样本构建的等长配对训练框架

Fei Ding, Yongkang Zhang, Runhao Liu, Yuhao Liao, Zijian Zeng, Huiming Yang, Sibo wang, Linglin Liao

AI总结 本文提出序列级相对强化学习中的长度问题本质是比较单元构建问题,并基于此提出等长配对训练框架EqLen,通过双轨同步生成、前缀继承和段掩码构建可比较的训练样本。

详情
AI中文摘要

本文研究了序列级相对强化学习中的长度问题。我们观察到,尽管现有方法部分缓解了与长度相关的现象,但一个更根本的问题仍未得到充分刻画:训练过程中使用的比较单元缺乏内在可比性。基于这一观察,我们提出一个新的视角:长度问题不应仅仅被视为损失缩放或归一化偏差,而应被视为一个比较单元构建问题。我们进一步建立了一个基于样本构建的训练框架,该框架不是对不等长响应进行事后校正,而是在生成过程中主动构建等长、可对齐且可比较的训练段。在该框架内,我们提出了EqLen,一种适用于组相对比较算法(如GRPO、GSPO和RLOO)的具体方法。通过双轨同步生成、前缀继承和段掩码,EqLen高效地收集有效的等长训练段,并实现稳定的训练。

英文摘要

This paper investigates the length problem in sequence-level relative reinforcement learning. We observe that, although existing methods partially alleviate length-related phenomena, a more fundamental issue remains insufficiently characterized: the comparison units used during training lack inherent comparability. Building on this observation, we propose a new perspective: the length problem should not be viewed merely as a loss-scaling or normalization bias, but rather as a \emph{comparison unit construction} problem. We further establish a sample-construction-based training framework that, instead of applying post-hoc corrections to unequal-length responses, proactively constructs equal-length, alignable, and comparable training segments during generation. Within this framework, we propose EqLen, a concrete method applicable to group-relative comparison algorithms such as GRPO, GSPO, and RLOO. Through dual-track synchronous generation, prefix inheritance, and segment masking, EqLen efficiently collects effective equal-length training segments and enables stable

2604.16778 2026-05-26 cs.LG cs.AI

Federation over Text: Insight Sharing for Multi-Agent Reasoning

文本上的联邦:多智能体推理的洞察共享

Dixi Yao, Tahseen Rabbani, Manzil Zaheer, Tian Li

AI总结 提出一种类似联邦学习的框架FoT,通过迭代聚合多个客户端的本地推理过程,构建跨任务元认知洞察库,无需共享问题实例或任务指令,显著提升推理效果和效率。

Comments 46 pages

详情
AI中文摘要

我们提出了一种类似联邦学习的框架——文本上的联邦(FoT),它使得处理不同任务的多个客户端能够通过迭代地联邦化其本地推理过程,共同生成一个共享的元认知洞察库,而无需共享实际的问题实例或任务指令。与梯度上的联邦(例如分布式训练)不同,FoT在语义层面运作,无需任何梯度优化或监督信号。迭代地,每个客户端运行一个LLM智能体,独立地对其特定任务进行本地思考和自我改进,并将推理轨迹与中央服务器共享,中央服务器将其聚合和提炼成一个跨任务(和跨领域)的洞察库,现有和未来的智能体可以利用该库来改进相关任务的性能。实验表明,FoT在广泛具有挑战性的应用中提高了推理效果和效率,包括数学问题求解、跨领域协作、现实世界日常任务以及机器学习研究洞察发现。具体而言,在前三个应用中,它平均提高了25%的性能得分,同时减少了4%的推理令牌。在研究洞察发现应用中,FoT能够生成覆盖后续论文中80%以上主要贡献的洞察。

英文摘要

We propose a federated learning-like framework, Federation over Text (FoT), that enables multiple clients solving different tasks to collectively generate a shared library of metacognitive insights by iteratively federating their local reasoning processes without sharing actual problem instances or task instructions. Instead of federation over gradients (e.g., as in distributed training), FoT operates at the semantic level without any gradient optimization or supervision signal. Iteratively, each client runs an LLM agent that does local thinking and self-improvement on their specific tasks independently, and shares reasoning traces with a central server, which aggregates and distills them into a cross-task (and cross-domain) insight library that existing and future agents can leverage to improve performance on related tasks. Experiments show that FoT improves reasoning effectiveness and efficiency across a wide range of challenging applications, including mathematical problem solving, cross-domain collaboration, real-world daily tasks, and machine learning research insight discovery. Specifically, it improves average performance scores by 25% while reducing the reasoning tokens by 4% across the first three applications. In the research insight discovery application, FoT is able to generate insights that cover over 80% of the major contributions in the subsequent papers.