arXivDaily arXiv每日学术速递 周一至周五更新
重置
2606.06494 2026-06-05 cs.LG

TailLoR: Protecting Principal Components in Parameter-Efficient Continual Learning

TailLoR:在参数高效持续学习中保护主成分

Marius Dragoi, Ioana Pintilie, Alexandra Dragomir, Antonio Barbalau, Florin Brad

AI总结 提出TailLoR方法,利用预训练权重的奇异基作为固定参考系,对奇异值矩阵进行低秩更新,并通过软谱惩罚抑制与主导奇异方向对齐的更新,从而减少干扰并实现细粒度适应。

详情
AI中文摘要

基于谱分解的参数高效微调方法推动了持续学习的进展。本文介绍TailLoR,该方法利用预训练权重的奇异基U和V作为固定参考系,学习应用于奇异值矩阵的低秩更新。软谱惩罚抑制与主导奇异方向对齐的更新,减少干扰,同时将细粒度适应引导到高度灵活的长尾谱坐标中。

英文摘要

Parameter-efficient finetuning methods based on spectral decomposition have enabled progress in Continual Learning. In this paper we introduce TailLoR, which utilizes the singular bases U and V of the pre-trained weights as a fixed reference frame to learn a low-rank update applied to the singular value matrix. A soft spectral penalty discourages updates aligned with dominant singular directions, reducing interference while routing fine-grained adaptation into the highly flexible, long-tail spectral coordinates.

2606.06492 2026-06-05 cs.SE cs.AI cs.CL

Code2LoRA: Hypernetwork-Generated Adapters for Code Language Models under Software Evolution

Code2LoRA:用于软件演化下代码语言模型的超网络生成适配器

Liliana Hotsko, Yinxi Li, Yuntian Deng, Pengyu Nie

AI总结 提出Code2LoRA超网络框架,通过生成仓库特定的LoRA适配器注入仓库知识,无需推理时令牌开销,支持静态和演化两种场景,在RepoPeftBench上达到与逐仓库LoRA相当或更优的性能。

详情
AI中文摘要

代码语言模型需要仓库级上下文来解决导入、API和项目约定。现有方法通过长输入(通过RAG或依赖分析检索)或逐仓库微调和LoRA注入这些知识——这在仓库规模上成本高昂且对演化的代码库脆弱。我们引入Code2LoRA,一个超网络框架,生成仓库特定的LoRA适配器,有效地注入仓库知识,零推理时令牌开销。Code2LoRA支持两种使用场景:Code2LoRA-Static将单个仓库快照转换为适配器,适用于稳定代码库的理解;而Code2LoRA-Evo维护一个由GRU隐藏状态支持的适配器,该状态随每次代码差异更新,适用于演化代码库的活跃开发。为了评估Code2LoRA与参数高效微调基线,我们构建了RepoPeftBench,一个包含604个Python仓库的基准,包含两个轨道:一个静态轨道,包含40K训练和12K测试断言完成任务;一个演化轨道,包含215K提交派生训练和87K提交派生测试任务。在静态轨道上,Code2LoRA-Static实现了63.8%的跨仓库和66.2%的仓库内精确匹配,与逐仓库LoRA上界相当;在演化轨道上,Code2LoRA-Evo实现了60.3%的跨仓库精确匹配(比单个共享LoRA高5.2个百分点)。Code2LoRA的代码可在https://anonymous.4open.science/r/code2lora-6857找到;模型检查点和RepoPeftBench数据集可在https://huggingface.co/code2lora找到。

英文摘要

Code language models need repository-level context to resolve imports, APIs, and project conventions. Existing methods inject this knowledge as long inputs (retrieved through RAG or dependency analysis) or through per-repository fine-tuning and LoRA -- costly at repository scale and brittle to evolving codebases. We introduce Code2LoRA, a hypernetwork framework that generates repository-specific LoRA adapters, effectively injecting repository knowledge with zero inference-time token overhead. Code2LoRA supports two usage scenarios: Code2LoRA-Static converts a single repository snapshot into an adapter, suitable for comprehension of stable codebases; while Code2LoRA-Evo maintains an adapter backed by a GRU hidden state updated per code diff, suitable for active development of evolving codebases. To evaluate Code2LoRA against parameter-efficient fine-tuning baselines, we build RepoPeftBench, a benchmark of 604 Python repositories with two tracks: a static track with 40K training and 12K test assertion-completion tasks, and an evolution track with 215K commit-derived training and 87K commit-derived test tasks. On the static track, Code2LoRA-Static achieves 63.8% cross-repo and 66.2% in-repo exact match, matching the per-repository LoRA upper bound; on the evolution track, Code2LoRA-Evo achieves 60.3% cross-repo exact match (+5.2 pp over a single shared LoRA). Code2LoRA's code can be found at https://anonymous.4open.science/r/code2lora-6857; the model checkpoints and RepoPeftBench datasets can be found at https://huggingface.co/code2lora.

2606.06491 2026-06-05 cs.RO cs.AI

TempoVLA: Learning Speed-Controllable Vision-Language-Action Policies

TempoVLA: 学习速度可控的视觉-语言-动作策略

Dong Jing, Jingchen Nie, Tianqi Zhang, Jiaqi Liu, Huaxiu Yao, Zhiwu Lu, Mingyu Ding

AI总结 提出TempoVLA,通过可变速度轨迹增强和速度条件机制,实现机器人操作中速度的双向灵活控制,并支持动态速度调节。

详情
AI中文摘要

机器人操作在低风险过渡阶段需要快速执行,而在高风险接触阶段需要缓慢精确的运动。然而,现有的视觉-语言-动作模型(VLA)仅从训练演示中继承单一的固定速度。先前通过模型压缩、KV缓存重用或强化学习加速VLA的尝试仅将策略从一个固定速度转移到另一个,而几乎未探索减速。我们观察到每个预测动作的幅度已经决定了机器人移动的速度,这为可控执行速度开辟了直接途径。我们将这一观察转化为TempoVLA,一个执行速度由显式条件控制的单一VLA。TempoVLA结合了两个耦合组件:(1)数据侧的可变速度轨迹增强(VSTA),通过合并或分割动作重新定时演示到任何目标速度,同时保留其运动语义;(2)模型侧的条件机制,将速度馈送给策略。统计显示,VSTA以可忽略的运动误差达到请求的速度。在仿真和真实世界任务上的实验表明,TempoVLA实现了双向的灵活速度控制,而VSTA通过更好的数据利用进一步提升了默认的1倍性能。此外,通过与大型多模态模型协作,TempoVLA实现了动态速度控制,在低风险阶段加速,在高风险阶段减速。

英文摘要

Robot manipulation alternates between low-risk transit phases that call for fast execution and high-risk contact stages that demand slow, precise motion. Yet existing Vision-Language-Action models (VLAs) only inherit a single fixed speed from training demonstrations. Prior efforts to accelerate VLAs through model compression, KV-cache reuse, or reinforcement learning only shift the policy from one fixed speed to another, and leave deceleration almost unexplored. We observe that the magnitude of each predicted action already governs how fast the robot moves, opening a direct route to controllable execution speed. We turn this observation into TempoVLA, a single VLA whose execution speed is controlled by an explicit condition. TempoVLA combines two coupled components. (1) A data-side Variable-Speed Trajectory Augmentation (VSTA) that re-times demonstration to any target speed by merging or splitting actions while preserving its motion semantics. (2) A model-side conditioning mechanism that feeds the speed to the policy. Statistics show that VSTA reaches the requested speed with negligible motion error. Experiments in simulation and on real-world tasks demonstrate that TempoVLA achieves flexible speed control in both directions, while VSTA additionally boosts the default $1\times$ performance via better data utilization. Furthermore, by cooperating with a large multimodal model, TempoVLA realizes dynamic speed control, accelerating through low-risk phases and decelerating for high-risk ones.

2606.06486 2026-06-05 cs.LG cs.AI cs.GT

Regret Minimization with Adaptive Opponents in Repeated Games

重复博弈中与自适应对手的遗憾最小化

Mingyang Liu, Asuman Ozdaglar, Tiancheng Yu, Kaiqing Zhang

AI总结 针对重复博弈中自适应对手的遗憾最小化问题,提出重复策略遗憾(RP-Regret)指标,并设计三种算法实现次线性遗憾,同时证明所有玩家最小化该遗憾可学习子博弈完美均衡。

详情
AI中文摘要

在本文中,我们研究重复博弈中与\emph{自适应}对手(即能够根据历史对局做出反应的对手)的遗憾最小化问题。已知在线学习中的标准\emph{外部遗憾}指标无法捕捉这种自适应性。为了考虑玩家的反事实推理,我们引入了{ t 重复策略遗憾(RP-Regret)},这是一种博弈论指标,衡量当所有玩家都能对历史对局做出\emph{反应}时,\emph{实际}累积效用与\emph{事后最优}累积效用之间的差异。与此背景下现有的遗憾概念相比,我们的概念更贴近重复博弈,允许更强的比较器和约束更少的对手,同时当所有玩家最小化该遗憾时,仍有可能找到更好的均衡。我们首先确定了获得时间次线性{ t RP-Regret}的必要条件,涉及遗憾定义中玩家比较器策略的变化以及比较器和对手策略的记忆。然后,我们研究了最小化{ t RP-Regret}的附加条件和可证明的算法,该遗憾在策略空间上本质上是\emph{非凸}的。为了应对这一挑战,我们提出了三种算法:(i)基于优化预言机(如先前一些在线非凸学习工作所假设的);(ii)每次迭代最小化{ t RP-Regret}的凸\emph{线性化}替代项;(iii)当对手缓慢改变策略时,直接最小化{ t RP-Regret}。此外,当所有玩家都能运行算法最小化{ t RP-Regret}(或其线性化变体)时,可以学习重复博弈的某些子博弈完美均衡。我们还提供了实验,表明最小化我们的遗憾概念可以在诸如猎鹿博弈等游戏中带来更合作、效用更高的解。

英文摘要

In this paper, we study regret minimization in repeated games with \emph{adaptive} opponents who can respond based on histories of play. The standard metric of \emph{external regret} in online learning is known to fail to capture such adaptivity. To account for players' counterfactual reasoning, we introduce {\tt Repeated Policy Regret (RP-Regret)}, a game-theoretic metric that measures the difference between the \emph{realized} and the \emph{best-in-hindsight} accumulated utility when all players can \emph{respond} to the history of play. Compared to existing regret notions in this setting, ours is native to repeated game playing, enabling stronger comparators and opponents with fewer constraints, while maintaining the possibility of finding better equilibria when all players minimize it. We first identify necessary conditions for obtaining {\tt RP-Regret} sublinear in time, on the variation of the player's comparator strategies in the regret definition and on the memories of both the comparator and opponents' strategies. We then study additional conditions and provable algorithms to minimize {\tt RP-Regret}, which is by definition \emph{non-convex} in the strategy space. To address this challenge, we propose three algorithms: (i) one based on an optimization oracle, as assumed in some prior work in online non-convex learning; (ii) one that minimizes a convex and \emph{linearized} surrogate of {\tt RP-Regret} at each iteration; (iii) one that directly minimizes {\tt RP-Regret} when opponents change strategies slowly. Furthermore, when all players can run algorithms to minimize the {\tt RP-Regret} (or its linearized variant), certain subgame perfect equilibria of the repeated game can be learned. We also provide experiments showing that minimizing our regret notions can lead to more cooperative solutions with higher utility in games such as Stag-Hunt.

2606.06485 2026-06-05 cs.CV

PAR3D: A Unified 3D-MLLM with Part-Aware Representation for Scene Understanding

PAR3D: 一种用于场景理解的统一部件感知3D多模态大语言模型

Shaohui Dai, Yansong Qu, You Shen, Shengchuan Zhang, Liujuan Cao

AI总结 提出PAR3D框架,通过部件感知3D表示学习和层次化分割查询生成,解决现有3D-MLLM在细粒度部件理解上的不足,在部件级问答和指代分割任务上取得显著提升。

详情
Comments
Project page: https://atrovast.github.io/PAR3D/
AI中文摘要

近期3D多模态大语言模型(3D-MLLMs)的进展为3D场景理解任务(包括视觉问答、描述和指代分割)提供了统一解决方案。然而,现有的3D-MLLM仍以物体为中心,限制了其对细粒度部件结构的建模能力,而这对于与3D环境的具身交互至关重要。在这项工作中,我们提出了PAR3D,一个统一的部件感知3D-MLLM框架,使模型能够理解、推理并定位3D场景中的物体及其部件。为了支持部件感知3D场景理解的训练和评估,我们引入了ScenePart,一个带有部件级标注和语言指令的合成3D场景数据集。我们进一步开发了部件感知3D表示学习,以用细粒度部件级语义丰富3D视觉表示,并提出了层次化分割查询生成,通过层次化的物体-部件查询来定位部件目标。大量实验表明,我们的方法显著提升了部件级问答和指代分割的性能,同时在物体级视觉语言任务上也取得了强劲表现。

英文摘要

Recent advances in 3D multimodal large language models (3D-MLLMs) have enabled unified solutions for 3D scene understanding tasks, including visual question answering, captioning, and referring segmentation. However, existing 3D-MLLMs remain largely object-centric, limiting their ability to model fine-grained part structures that are essential for embodied interaction with 3D environments. In this work, we present PAR3D, a unified part-aware 3D-MLLM framework that enables models to understand, reason about, and ground both objects and their parts in 3D scenes. To enable training and evaluation of part-aware 3D scene understanding, we introduce ScenePart, a synthetic 3D scene dataset with part-level annotations and language instructions. We further develop Part-Aware 3D Representation Learning to enrich 3D visual representations with fine-grained part-level semantics, and propose Hierarchical Segmentation Query Generation to ground part targets via hierarchical object-part queries. Extensive experiments show that our method substantially improves part-level question answering and referring segmentation, while also achieving strong performance across object-level vision-language tasks.

2606.06481 2026-06-05 cs.CL cs.AI cs.LG

Operation-Guided Progressive Human-to-AI Text Transformation Benchmark for Multi-Granularity AI-Text Detection

操作引导的渐进式人机文本转换基准:面向多粒度AI文本检测

Sondos Mahmoud Bsharat, Jiacheng Liu, Xiaohan Zhao, Tianjun Yao, Xinyi Shang, Yi Tang, Jiacheng Cui, Ahmed Elhagry, Salwa K. Al Khatib, Hao Li, Salman Khan, Zhiqiang Shen

AI总结 提出OpAI-Bench基准,通过九种渐进修订版本和五种AI编辑操作,模拟人机协作编辑过程,支持文档、句子、词元和跨度多粒度检测,揭示AI文本可检测性受编辑操作、领域和累积修订历史影响,并发现混合作者中间版本比纯人类或纯AI端点更难检测。

详情
Comments
Our code and data are available at https://github.com/VILA-Lab/OpAI-Bench
AI中文摘要

随着AI写作助手越来越多地融入现实世界的起草和修订流程,许多文档不再是纯粹的人类撰写或AI生成,而是渐进式人机共同编辑的结果。然而,现有的AI文本检测基准主要关注最终输出,对AI作者身份信号如何在修订过程中出现、累积或消失的理解有限。我们引入了OpAI-Bench,一个操作引导的基准,用于研究在文档、句子、词元和跨度粒度上的渐进式人机文本转换。从人类撰写的文档开始,OpAI-Bench在预定义的AI覆盖水平和五种代表性AI编辑操作下,为每个样本构建了九个顺序修订版本,涵盖四个领域,同时保留多粒度上的完整作者身份来源。该基准支持8个文档级检测器、7个句子级检测器和2个细粒度词元/跨度级检测器的全面评估。实验表明,AI文本的可检测性不仅受AI编辑内容比例的影响,还受编辑操作、领域和累积修订历史的影响。有趣的是,我们注意到混合作者身份的中间版本通常比完全人类或大量AI编辑的端点更难检测,暴露了现有基准遗漏的非单调检测模式。OpAI-Bench为分析在现实渐进编辑场景下,AI辅助写作是否、何时以及如何变得可检测提供了一个受控测试平台。我们的代码和基准可在https://github.com/VILA-Lab/OpAI-Bench获取。

英文摘要

As AI writing assistants become increasingly integrated into real-world drafting and revision workflows, many documents are no longer purely human-written or AI-generated, but instead result from progressive human-AI co-editing. However, existing AI-text detection benchmarks largely focus on final outputs and provide limited understanding of how AI authorship signals emerge, accumulate, or disappear throughout the revision process. We introduce OpAI-Bench, an operation-guided benchmark for studying progressive human-to-AI text transformation across document, sentence, token, and span granularities. Starting from human-written documents, OpAI-Bench constructs nine sequentially revised versions for each sample under predefined AI coverage levels and five representative AI edit operations, covering four domains while preserving complete authorship provenance at multiple granularities. The benchmark supports comprehensive evaluation with 8 document-level detectors, 7 sentence-level detectors, and 2 fine-grained token/span-level detectors. Experiments reveal that AI-text detectability is governed not only by the proportion of AI-edited content, but also by edit operation, domain, and cumulative revision history. Interestingly, we notice that mixed-authorship intermediate versions are often harder to detect than both fully human and heavily AI-edited endpoints, exposing non-monotonic detection patterns missed by existing benchmarks. OpAI-Bench provides a controlled testbed for analyzing whether, when, and how AI-assisted writing becomes detectable under realistic progressive editing scenarios. Our code and benchmark are available at https://github.com/VILA-Lab/OpAI-Bench.

2606.06480 2026-06-05 cs.GT cs.LG

DNQ: Deep Nash Q-Network for Partially Observable n-Player Games

DNQ: 用于部分可观测n人博弈的深度纳什Q网络

Qintong Xie, Edward Koh, Xavier Cadet, Peter Chin

AI总结 针对多智能体同时博弈问题,提出DNQ框架,通过求解器在环的均衡监督训练智能体,并对比成对与精确均衡求解方法的可扩展性。

详情
AI中文摘要

许多现实世界的竞争系统要求多个决策者在共享约束、有限信息和重复交互下同时行动,例如拍卖、资源分配和安全竞争。我们将多轮同时竞价作为此类问题的受控测试平台,并提出DNQ,一种求解器在环的均衡监督框架,用于训练竞价智能体。DNQ在轨迹收集、基于评论家的收益估计、均衡计算和策略模仿之间交替进行。在每个访问的状态下,共享评论家预测成对收益矩阵或精确的N人收益张量,外部求解器计算均衡策略,智能体通过最小化其掩码策略与求解器导出的均衡目标之间的KL散度进行训练。我们专注于可扩展的成对公式,与精确公式相比,大大降低了均衡求解成本和训练时间,同时共享评论家跨智能体和状态摊销了收益学习。实验使用评论家损失、策略熵、竞价资源使用和训练成本比较了成对和精确变体,表明成对方法可扩展到更多智能体,而精确方法随着联合博弈的增长在计算上变得不可行。这些结果说明了重复竞争环境中战略保真度与可扩展性之间的权衡。

英文摘要

Many real-world competitive systems require multiple decision-makers to act simultaneously under shared constraints, limited information, and repeated interaction, as in auctions, resource allocation, and security competition. We study multi-turn simultaneous bidding as a controlled testbed for such problems and propose DNQ, a solver-in-the-loop equilibrium supervision framework for training bidding agents. DNQ alternates between trajectory collection, critic-based payoff estimation, equilibrium computation, and policy imitation. At each visited state, a shared critic predicts either pairwise payoff matrices or an exact N-player payoff tensor, an external solver computes equilibrium strategies, and the agents are trained by minimizing the KL divergence between their masked policies and the solver-derived equilibrium targets. We focus on a scalable pairwise formulation that greatly reduces equilibrium-solving cost and training time compared with the exact formulation, while the shared critic amortizes payoff learning across agents and states. Experiments compare the pairwise and exact variants using critic loss, policy entropy, bidding resource usage, and training cost, showing that the pairwise method scales to larger numbers of agents, whereas the exact method becomes computationally impractical as the joint game grows. These results illustrate the trade-off between strategic fidelity and scalability in repeated competitive environments.

2606.06479 2026-06-05 cs.LG cs.AI

Pretraining Recurrent Networks without Recurrence

无递归预训练循环网络

Akarsh Kumar, Phillip Isola

AI总结 提出监督记忆训练(SMT)方法,通过将循环神经网络训练转化为一步记忆转换标签上的监督学习,实现时间并行训练和稳定梯度路径,优于反向传播通过时间(BPTT)方法。

详情
Comments
30 pages, 23 figures
AI中文摘要

训练循环神经网络(RNN)需要在长序列计算中分配信用。标准的反向传播通过时间(BPTT)对此问题处理不佳:它在时间上是顺序的,限制了并行性,并且遭受梯度消失或爆炸,使得长程关联难以学习。我们提出监督记忆训练(SMT),一种训练非线性RNN的方法,通过将RNN训练简化为一步记忆转换标签 $(m_t, x_{t+1}) \rightarrow m_{t+1}$ 上的监督学习,完全绕过了循环信用传播。SMT通过训练基于Transformer的编码器在预测状态目标上获取这些记忆标签——仅保留预测未来所需的过去信息。通过将记忆内容与记忆更新方式解耦,SMT实现了时间并行的RNN训练,任意两个token之间具有稳定的$O(1)$长度梯度路径——而无需展开RNN。我们发现,在语言建模和像素序列建模等任务上预训练各种RNN架构时,SMT优于BPTT。SMT使非线性RNN能够更好地捕获长程依赖并并行训练,可能解锁构建过去经验时间抽象模型的缩放能力。

英文摘要

Training recurrent neural networks (RNNs) requires assigning credit across long sequences of computations. Standard backpropagation through time (BPTT) addresses this problem poorly: it is sequential in time, limiting parallelism, and suffers from vanishing or exploding gradients, making long-range associations difficult to learn. We propose Supervised Memory Training (SMT), a method for training nonlinear RNNs that sidesteps recurrent credit propagation entirely by reducing RNN training to supervised learning on one-step memory transition labels $(m_t, x_{t+1}) \rightarrow m_{t+1}$. SMT acquires these memory labels by training a Transformer-based encoder on a predictive state objective--retaining only information from the past necessary to predict the future. By decoupling what to remember from how to update memory, SMT enables time-parallel RNN training with a stable $O(1)$ length gradient path between any two tokens--without ever unrolling the RNN. We find that SMT outperforms BPTT when pretraining various RNN architectures on tasks like language modeling and pixel sequence modeling. SMT enables nonlinear RNNs to better capture long-range dependencies and train in parallel, potentially unlocking the scaling of models that build temporal abstractions of past experience.

2606.06477 2026-06-05 cs.CV

Complexity-Balanced Diffusion Splitting

复杂度平衡的扩散分裂

Noam Issachar, Dani Lischinski, Raanan Fattal

AI总结 提出复杂度平衡分裂(CBS)框架,通过将扩散时间线划分为等近似负担的段并分配更多容量给困难区域,在多个架构和数据集上提升生成质量而不增加推理成本。

详情
AI中文摘要

标准连续时间生成模型依赖于整体架构,必须从各向同性噪声到复杂数据分布等截然不同的信号域中导航。虽然扩展模型容量可提升性能,但在整个生成时间线上均匀部署大规模网络本质上效率低下。在这项工作中,我们提出复杂度平衡分裂(CBS),一种用于时间容量分配的原则性框架,将生成工作负载分布到多个专门的子网络上。基于函数逼近理论和de Boor的等分布原则,CBS将扩散时间线划分为等近似负担的段,将更多表示容量分配给生成动力学更难建模的区域。为估计这种局部复杂度,我们引入两个互补且易于处理的监控函数:基于流Dirichlet能量的空间度量,和基于采样轨迹加速度的几何度量。通过使用轻量级辅助模型估计这些复杂度分布,我们的方法消除了启发式时间分割或计算昂贵的搜索过程的需求。在多种架构(SiT、JiT和UNet)和数据集上的广泛评估表明,CBS在不增加每步推理成本的情况下持续提升合成质量。特别地,在SiT-XL上使用CFG时,CBS相比朴素时间分割将FID改善了约35%。项目页面见https://noamissachar.github.io/CBS/。

英文摘要

Standard continuous-time generative models rely on monolithic architectures that must navigate vastly different signal regimes, from isotropic noise to intricate data distributions. While scaling model capacity improves performance, deploying a massive network uniformly across the entire generative timeline is inherently inefficient. In this work, we propose Complexity-Balanced Splitting (CBS), a principled framework for temporal capacity allocation that distributes the generative workload across multiple specialized sub-networks. Grounded in function approximation theory and de Boor's equidistribution principle, CBS partitions the diffusion timeline into segments of equal approximation burden, allocating more representational capacity to regions where the generative dynamics are more difficult to model. To estimate this local complexity, we introduce two complementary and tractable monitor functions: a spatial measure based on the flow's Dirichlet energy, and a geometric measure based on the acceleration of the sampling trajectories. Using a lightweight auxiliary model to estimate these complexity profiles, our approach eliminates the need for heuristic temporal splits or computationally expensive search procedures. Extensive evaluation across multiple architectures (SiT, JiT, and UNet) and datasets demonstrates that CBS consistently improves synthesis quality without increasing per-step inference cost. In particular, CBS improves FID by ~35% on SiT-XL with CFG relative to naive temporal partitioning. Project page is available at https://noamissachar.github.io/CBS/.

2606.06476 2026-06-05 cs.CV

Thinking with Imagination: Agentic Visual Spatial Reasoning with World Simulators

思考与想象:基于世界模拟器的智能视觉空间推理

Chenming Zhu, Jingli Lin, Yilin Long, Peizhou Cao, Tai Wang, Jiangmiao Pang, Xihui Liu

AI总结 提出Astra框架,通过强化学习训练VLM策略与Bagel世界模拟器交互,在推理中生成想象视觉证据,解决空间推理中的未观察布局、跨视角一致性和替代视角推理问题。

详情
Comments
Project page: https://zcmax.github.io/projects/Thinking-With-Imagination
AI中文摘要

尽管视觉语言模型(VLM)展现出强大的视觉推理能力,但其空间推理能力仍然很大程度上局限于观察到的图像和面向文本的思维链。当只有有限的自我中心观察可用时,它们通常难以推断未观察到的布局、保持跨视角一致性以及从替代视角进行推理。在这项工作中,我们将此问题研究为“思考与想象”,即VLM在推理过程中通过与世界模拟器交互主动获取想象的视觉证据。我们提出Astra,一种智能空间推理框架,赋予VLM以动作条件视觉想象能力。具体而言,Astra将强化学习训练的VLM策略Astra-VL与基于Bagel的世界模拟器Astra-WM相结合,后者从上下文图像和自然语言相机运动生成新视角观察。为了提供可靠的想象证据,Astra-WM通过视角一致性训练进行训练,以提高跨视角的姿态和内容一致性。在强化学习阶段,我们提出了一种世界模拟器在环的两阶段强化学习课程,以稳定工具使用探索,并提升模型仅在想象观察优于直接回答时调用模拟器的能力。实验表明,世界模拟器和智能策略都是必要的:Astra-WM将模拟器增强的Gemini-3-Flash在MMSI-Bench上的性能从45.1提升到49.5,而Astra-VL将Qwen3-VL骨干网络在MMSI-Bench上的性能从29.8提升到38.8,在MindCube上从36.8提升到42.7。这些结果表明,想象观察可以提供有用的空间证据,但有效的世界模型增强推理需要学习何时、何地以及如何想象。

英文摘要

While Vision-Language Models (VLMs) have shown strong visual reasoning capabilities, their spatial reasoning abilities remain largely constrained to the observed images and text-oriented chain-of-thought. They often struggle to infer unobserved layouts, maintain cross-view consistency, and reason from alternative viewpoints when only limited egocentric observations are available. In this work, we study this problem as thinking with imagination, where a VLM actively acquires imagined visual evidence by interacting with a world simulator during reasoning. We propose Astra, an agentic spatial reasoning framework that empowers VLMs with action-conditioned visual imagination. Specifically, Astra couples Astra-VL, an RL-trained VLM policy, with Astra-WM, a Bagel-based world simulator that generates novel-view observations from context images and natural-language camera motions. To provide reliable imagined evidence, Astra-WM is trained with view consistency tuning to improve pose and content consistency across views. In the RL stage, we propose a world-simulator-in-the-loop two-phase RL curriculum to stabilize tool-use exploration and advance the model's ability to invoke the simulator only when imagined observations improve over direct answering. Experiments demonstrate that both the world simulator and the agentic policy are necessary: Astra-WM improves simulator-augmented Gemini-3-Flash on MMSI-Bench from 45.1 to 49.5, while Astra-VL improves the Qwen3-VL backbone from 29.8 to 38.8 on MMSI-Bench and from 36.8 to 42.7 on MindCube. These results show that imagined observations can provide useful spatial evidence, but effective world-model-augmented reasoning requires learning when, where, and how to imagine.

2606.06475 2026-06-05 cs.LG cs.AI

RREDCoT: Segment-Level Reward Redistribution for Reasoning Models

RREDCoT: 推理模型的片段级奖励再分配

Mykyta Ielanskyi, Kajetan Schweighofer, Lukas Aichberger, Sepp Hochreiter

AI总结 针对推理语言模型强化学习微调中的延迟奖励问题,提出RREDCoT方法,利用模型自身近似最优奖励再分配,无需额外生成,降低方差并提升信用分配效率。

详情
Comments
Preprint, under review
AI中文摘要

近期推理语言模型的进展由强化学习微调驱动。通常,这些依赖于组相对策略优化(GRPO)算法或其变体来引导模型生成思维链(CoT)轨迹。最终答案只能在CoT轨迹完成后验证并分配奖励,这构成了延迟奖励问题。GRPO及其变体对应于标准强化学习中的蒙特卡洛方法,已知具有高方差。该问题的一个可能解决方案是通过信用分配进行奖励再分配,其中对达到期望解重要的CoT轨迹片段通过分配更高奖励来强调。虽然蒙特卡洛采样可用于提供中间状态值的无偏估计,但其计算开销使其不适用于长上下文高粒度下的训练时信用分配。我们引入RREDCoT(思维链的奖励再分配),它利用模型自身近似最优奖励再分配,无需额外生成。我们研究了我们的方法相比MC采样和几种归因方法的优势。我们进一步分析了与再分配构建相关的几个方面,例如CoT轨迹的分割和状态值估计。

英文摘要

Recent advancements in reasoning language models have been driven by Reinforcement Learning (RL) fine-tuning. Most often, these rely on the Group Relative Policy Optimization (GRPO) algorithm or modifications thereof to steer the models to produce Chain-of-Thought (CoT) traces. The final answer can only be verified, and the reward assigned, after the CoT trace is complete, making it a delayed reward problem. GRPO and its modifications correspond to Monte Carlo methods in standard RL, which are known to suffer from high variance. A possible solution to this problem is the redistribution of rewards through credit assignment, where segments of the CoT trace that are important for arriving at the desirable solution are emphasized by assigning a higher reward. While Monte Carlo sampling can be used to provide an unbiased estimate of intermediate state values, its computational overhead makes it unsuitable for train-time credit assignment in long contexts at high granularity. We introduce RREDCoT (Reward REDistribution for Chain of Thoughts), which utilizes the model itself to approximate the optimal reward redistribution without additional generation. We investigate the advantages of our method compared to MC sampling and several attribution methods. We further analyze several aspects relevant to the construction of the redistribution such as segmentation of CoT traces and state value estimation.

2606.06474 2026-06-05 cs.CL cs.AI cs.LG

Self-Augmenting Retrieval for Diffusion Language Models

扩散语言模型的自增强检索

Paul Jünger, Justin Lovelace, Linxi Zhao, Dongyoung Go, Kilian Q. Weinberger

AI总结 提出SARDI框架,利用扩散语言模型去噪过程中丢弃的低置信度标记作为前瞻信号指导检索,无需训练且与检索器无关,在多跳问答基准上以高达8倍吞吐量超越现有方法。

详情
Comments
ICML 2026
AI中文摘要

离散扩散语言模型通过并行迭代去噪整个响应来生成文本。每一步,它们为每个掩码位置预测暂定标记,将高置信度预测提交到输出,并丢弃低置信度标记。我们表明,被丢弃的标记实际上对检索增强生成是有用的前瞻信号:即使低置信度标记也常在去噪轨迹早期浮现显著实体,从而在输出最终确定前检索到更强的证据。我们通过扩散语言模型的自增强检索(SARDI)利用这一点,这是一个动态RAG框架,在去噪过程中使用这些前瞻标记指导检索。SARDI无需训练、与检索器无关,并适用于任何具备推理能力的离散扩散语言模型。在五个多跳QA基准上,SARDI以高达8倍的吞吐量优于当前无训练的扩散和自回归检索基线。

英文摘要

Discrete diffusion language models generate text by iteratively denoising an entire response in parallel. At each step, they predict tentative tokens for every masked position, committing the confident predictions to the output and discarding the unconfident ones. We show that the discarded tokens are in fact a useful lookahead signal for retrieval-augmented generation: even low-confidence tokens often surface salient entities early in the denoising trajectory, enabling retrieval of stronger evidence before the output is finalized. We exploit this through Self-Augmenting Retrieval for Diffusion Language Models (SARDI), a dynamic RAG framework that uses these lookahead tokens to guide retrieval during denoising. SARDI is training-free, retriever-agnostic, and applicable to any reasoning-capable discrete diffusion language model. Across five multi-hop QA benchmarks, SARDI outperforms current training-free diffusion and autoregressive retrieval baselines at up to $8\times$ higher throughput.

2606.06473 2026-06-05 cs.AI cs.CL

MLEvolve: A Self-Evolving Framework for Automated Machine Learning Algorithm Discovery

MLEvolve:一种用于自动化机器学习算法发现的自我进化框架

Shangheng Du, Xiangchao Yan, Jinxin Shi, Zongsheng Cao, Shiyang Feng, Zichen Liang, Boyuan Sun, Tianshuo Peng, Yifan Zhou, Xin Li, Jie Zhou, Liang He, Bo Zhang, Lei Bai

AI总结 提出MLEvolve框架,通过渐进式MCGS、回溯记忆和分层控制解决LLM智能体在长期任务中的信息隔离、无记忆搜索和缺乏分层控制问题,在MLE-Bench和数学算法优化任务上取得最先进性能。

详情
AI中文摘要

大型语言模型(LLM)智能体越来越多地应用于长期任务,如科学发现和机器学习工程(MLE),其中持续的自我进化成为关键能力。然而,现有的MLE智能体存在分支间信息隔离、无记忆搜索和缺乏分层控制的问题,这些共同阻碍了长期优化。我们提出了MLEvolve,一个基于LLM的自我进化多智能体框架,用于端到端的机器学习算法发现。通过将树搜索扩展到渐进式MCGS,MLEvolve通过基于图的参考边实现跨分支信息流,并借助熵启发的渐进式调度,逐步将搜索从广泛探索转向集中利用。为了让智能体能够随着积累的经验进化,我们引入了回溯记忆,它将冷启动领域知识库与动态全局记忆相结合,用于特定任务的体验检索和重用。为了实现稳定的长期迭代,我们进一步将战略规划与代码生成解耦,并采用自适应编码模式。在MLE-Bench上的评估表明,MLEvolve在多个维度上实现了最先进的性能,包括在12小时预算(标准运行时间的一半)下的平均奖牌率和有效提交率。此外,MLEvolve在数学算法优化任务上也优于专门的算法发现方法(包括AlphaEvolve),展示了强大的跨领域泛化能力。我们的代码可在https://github.com/InternScience/MLEvolve获取。

英文摘要

Large language model (LLM) agents are increasingly applied to long-horizon tasks such as scientific discovery and machine learning engineering (MLE), where sustained self-evolution becomes a key capability. However, existing MLE agents suffer from inter-branch information isolation, memoryless search, and lack of hierarchical control, which together hinder long-horizon optimization. We present MLEvolve, an LLM-based self-evolving multi-agent framework for end-to-end machine learning algorithm discovery. By extending tree search to Progressive MCGS, MLEvolve enables cross-branch information flow through graph-based reference edges and gradually shifts the search from broad exploration to focused exploitation with an entropy-inspired progressive schedule. To allow the agent to evolve with accumulated experience, we introduce Retrospective Memory, which combines a cold-start domain knowledge base with a dynamic global memory for task-specific experience retrieval and reuse. For stable long-horizon iteration, we further decouple strategic planning from code generation with adaptive coding modes. Evaluation on MLE-Bench shows that MLEvolve achieves state-of-the-art performance across multiple dimensions including average medal rate and valid submission rate under a 12-hour budget (half the standard runtime). Moreover, MLEvolve also outperforms specialized algorithm discovery methods including AlphaEvolve on mathematical algorithm optimization tasks, demonstrating strong cross-domain generalization. Our code is available at https://github.com/InternScience/MLEvolve.

2606.06470 2026-06-05 cs.LG cs.AI

PC Layer: Polynomial Weight Preconditioning for Improving LLM Pre-Training

PC层:通过多项式权重预处理改进大语言模型预训练

Senmiao Wang, Tiantian Fang, Haoran Zhang, Yushun Zhang, Kunxiang Zhao, Alex Schwing, Ruoyu Sun

AI总结 提出一种多项式预条件子权重参数化方法(PC层),通过低阶多项式预条件重塑权重矩阵奇异值谱,确保LLM训练中权重条件稳定,且训练后无推理开销,在Llama-1B预训练中优于标准Transformer。

详情
AI中文摘要

我们提出了一种预条件(PC)层,一种通过多项式预条件子实现的权重参数化方法,确保在整个LLM训练过程中权重条件稳定。PC模块通过低阶多项式预条件重塑权重矩阵的奇异值谱。训练后,预条件权重可以合并回原始架构,不产生推理开销。我们展示了在Llama-1B预训练中,对于AdamW和Muon优化器,所提出的PC层相对于标准Transformer的优势。理论上,我们通过证明对于某些深度线性网络,均匀限制每层的奇异值能确保梯度下降几何收敛到全局最小值,从而证明了这一谱控制原理。我们的代码可在https://github.com/Empath-aln/PC-layer获取。

英文摘要

We propose a preconditioning (PC) layer, a weight parameterization via polynomial preconditioner that ensures stable weight conditioning throughout LLM training. The PC module reshapes the singular-value spectrum of weight matrices via low-degree polynomial preconditioning. After training, the preconditioned weights can be merged back into the original architecture, incurring no inference overhead. We demonstrate the advantage of the proposed PC layer over standard transformers in Llama-1B pre-training, for both the AdamW and Muon optimizers. Theoretically, we justify this spectrum-control principle by proving that uniformly bounding each layer's singular values ensures geometric convergence of gradient descent to global minima, for certain deep linear networks. Our code is available at https://github.com/Empath-aln/PC-layer.

2606.06469 2026-06-05 math.ST cs.LG math.PR stat.TH

How abundant are good interpolators?

好的插值器有多丰富?

August Y. Chen, Ahmed El Alaoui

AI总结 在高维比例下,通过大偏差原理研究随机均匀选择的线性插值分类器的泛化误差分布,发现几乎所有插值分类器具有相同的泛化性能,而高效算法(如梯度下降)优于大多数插值器。

详情
Comments
140 pages
AI中文摘要

设 $S$ 是单位范数线性分类器 $\theta\in \mathbb{R}^d$ 的集合,这些分类器以预先固定的可能负的间隔 $\kappa$ 正确分类标记数据集 $(X_i,y_i)_{i=1}^n$ 中的每个点,其中 $X_i \in \mathbb{R}^d$,$y_i \in \{-1,+1\}$。在两种自然的数据生成分布——高斯混合模型和具有高斯特征的逻辑模型——以及比例 $n/d \to \alpha$ 且 $\alpha$ 足够小的条件下,我们建立了关于事件(从 $S$ 中均匀随机选择的点 $\theta$ 达到给定泛化误差)的大偏差原理,且该事件以高概率依赖于数据的选择。相关的速率函数是确定性的,描述了在 $d$ 的指数尺度上具有给定期望性能的插值分类器的比例。作为推论,我们建立了以下集中现象:除了指数小的一部分外,所有插值分类器都具有大致相同的泛化性能,该性能由该速率函数的唯一最大值给出。我们将该最大值与通过梯度下降的经验风险最小化和自然线性规划的性能进行了数值比较,两者都找到了 $S$ 中的一个点,并推断出在 $\alpha$ 小的过参数化区域中,这些高效方法优于绝大多数插值器,指出了它们在此设置中非平凡的良性过拟合。

英文摘要

Let $S$ be the set of unit norm linear classifiers $θ\in \mathbb{R}^d$ which correctly classify every point of a labeled dataset $(X_i,y_i)_{i=1}^n$, $X_i \in \mathbb{R}^d$, $y_i \in \{-1,+1\}$, with a possibly negative margin $κ$ fixed in advance. Under two natural data-generating distributions of the $(X,y)$ pairs -- a Gaussian mixture model and a logistic model with Gaussian features -- and in the proportional regime $n/d \to α$ with small enough $α$, we establish a large deviation principle on the event that a point $θ$ chosen uniformly at random from $S$ achieves a given generalization error, with high probability over the choice of the data. The associated large deviation rate function is deterministic and describes the proportion, at the exponential scale in $d$, of interpolating classifiers having a given desired performance. As a consequence, we establish the following concentration phenomenon: all but an exponentially small fraction of interpolating classifiers have approximately the same generalization performance given by the unique maximizer of this rate function. We numerically compare this maximizer to the performance of empirical risk minimization by gradient descent and to the performance of a natural linear program, both finding a point in $S$, and deduce that in the overparametrized regime of small $α$, these efficient procedures outperform the vast majority of interpolators, pointing to their nontrivial benign overfitting in this setting.

2606.06468 2026-06-05 cs.AI

Goedel-Architect: Streamlining Formal Theorem Proving with Blueprint Generation and Refinement

Goedel-Architect: 通过蓝图生成与精炼简化形式定理证明

Jui-Hui Chung, Ziyang Cai, Zihao Li, Qishuo Yin, Rohit Agarwal, Simon Park, Rodrigo Porto, Narutatsu Ri, Ziran Yang, Shange Tang, Xingyu Dang, Hongzhou Lin, Mengdi Wang, Danqi Chen, Chi Jin, Liam H Fowl, Sanjeev Arora

AI总结 提出Goedel-Architect框架,通过生成和精炼依赖图蓝图,结合Lean 4证明器并行证明引理,在多个基准测试上达到开源最优性能。

详情
AI中文摘要

我们介绍Goedel-Architect,一个以蓝图生成和精炼为中心的Lean 4形式定理证明智能体框架。蓝图是一个定义和引理的依赖图,逐步构建到主定理。首先,Goedel-Architect生成一个包含形式化定义和引理及其声明依赖关系的蓝图。该蓝图可选地由自然语言证明引导。然后,一个配备工具的Lean证明器组件使用相关依赖并行证明每个开放的引理节点。失败的引理反过来驱动全局蓝图的精炼。这种策略与其他主流方法形成对比,后者使用递归引理分解,并可能低效地在死胡同策略上循环。使用开放权重的DeepSeek-V4-Flash (284B-A13B)作为骨干,Goedel-Architect在MiniF2F-test上达到99.2%的pass@1,在PutnamBench上达到75.6%的pass@1。在更困难的问题上,通过可选的初始蓝图自然语言证明种子,我们额外解决了剩余的两个MiniF2F-test问题(达到100%),将PutnamBench提升至88.8%(597/672),并在IMO 2025上解决了4/6,在Putnam 2025上解决了11/12,在USAMO 2026上解决了3/6。这代表了开源流水线在价格点比可比开源流水线低至500倍的情况下的最先进性能。

英文摘要

We introduce Goedel-Architect, an agentic framework for formal theorem proving in Lean 4 centered on blueprint generation and refinement. A blueprint is a dependency graph of definitions and lemmas that builds up to the main theorem. First, Goedel-Architect generates a blueprint of formally stated definitions and lemmas, along with declared dependencies. This blueprint is optionally guided by a natural language proof. Then, a tool-equipped Lean prover component closes each open lemma node in parallel using relevant dependencies. Failed lemmas in turn drive refinement of the global blueprint. This strategy contrasts with other mainstream approaches which use recursive lemma decomposition, and can inefficiently loop on dead-end strategies. Using the open-weight DeepSeek-V4-Flash (284B-A13B) as the backbone, Goedel-Architect attains 99.2% pass@1 on MiniF2F-test and 75.6% pass@1 on PutnamBench. With an optional natural-language proof seeding the initial blueprint on the harder problems, we additionally close the remaining two MiniF2F-test problems (reaching 100%), lift PutnamBench to 88.8% (597/672), and solve 4/6 on IMO 2025, 11/12 on Putnam 2025, and 3/6 on USAMO 2026. This represents state-of-the-art performance for an open-source pipeline at a price point up to 500x less than comparable open-source pipelines.

2606.06467 2026-06-05 cs.CL cs.AI cs.LG

You Only Index Once: Cross-Layer Sparse Attention with Shared Routing

仅索引一次:具有共享路由的跨层稀疏注意力

Yutao Sun, Yanqi Zhang, Li Dong, Jianyong Wang, Furu Wei

AI总结 提出跨层稀疏注意力(CLSA),通过共享KV缓存和路由索引,在保持token稀疏注意力精度的同时减少路由开销,显著提升长上下文LLM的解码效率。

详情
AI中文摘要

现代LLM中的长上下文推理越来越受到解码效率的限制,尤其是在模型生成长中间思维链的推理密集型场景中。现有的稀疏注意力方法通常面临实际的效率-质量权衡。结构化块稀疏方法通常提供更强的加速,但会导致明显的质量损失,而token稀疏方法通常更准确,但由于在全缓存上进行top-k路由仍然昂贵,因此端到端加速有限。在这项工作中,我们提出了跨层稀疏注意力(CLSA),它建立在KV共享架构(如YOCO)之上。核心思想不仅是跨解码器层共享KV缓存,还共享路由索引。单个索引器计算一次token级别的top-k选择,并在各层之间重用生成的索引,从而保留了token稀疏注意力的细粒度选择性,同时分摊了路由开销。由此产生的架构共同改善了所有主要的推理瓶颈,包括预填充、KV缓存存储和长上下文解码。在短上下文和长上下文基准上的实验表明,CLSA既准确又高效,在128K上下文下实现了高达7.6倍的解码加速和17.1倍的总体吞吐量提升。这些结果表明,对于长上下文LLM,这是一种更完整的架构解决方案,可同时提升模型质量和推理效率。

英文摘要

Long-context inference in modern LLMs is increasingly constrained by decoding efficiency, especially in reasoning-heavy settings where models generate long intermediate chains of thought. Existing sparse attention methods often face a practical efficiency-quality trade-off. Structured block sparse methods typically provide stronger acceleration but incur noticeable quality loss, while token sparse methods are usually more accurate yet deliver limited end-to-end speedup because top-k routing over the full cache remains expensive. In this work, we propose cross-layer sparse attention (CLSA), which is built on top of KV-sharing architectures such as YOCO. The core idea is to share not only the KV cache across cross-decoder layers, but also the routing index. A single indexer computes token-level top-k selection once and reuses the resulting index across layers, thereby preserving the fine-grained selectivity of token sparse attention while amortizing the routing overhead. The resulting architecture improves all major inference bottlenecks jointly, including pre-filling, KV-cache storage, and long-context decoding. Experiments across short-context and long-context benchmarks show that CLSA is both accurate and efficient, achieving up to 7.6x decoding speedup and 17.1x overall throughput improvement at 128K context. These results suggest a more complete architectural solution for long-context LLMs that jointly advances model quality and inference efficiency.

2606.06462 2026-06-05 cs.AI

Benchmark Everything Everywhere All at Once

无处不在的基准测试

Shiyun Xiong, Dongming Wu, Peiwen Sun, Yuang Ai, Bokang Yang, Wencheng Han, Xiao-Hui Li, Xiangyu Yue

AI总结 提出Benchmark Agent,一个全自主智能体系统,自动化基准构建流程,以解决现有基准构建劳动密集、难以复用和性能饱和的问题。

详情
Comments
Project page: https://benchmarkagent.github.io/
AI中文摘要

基准测试通过提供标准化和明确的性能度量,对于评估和推进LLM和MLLM至关重要。然而,它们的构建劳动密集且难以复用,引发了可持续性和可扩展性的担忧。此外,现有基准在发布后往往很快达到性能饱和,导致对最先进模型的区分不足。为了应对这些挑战,我们引入了Benchmark Agent,一个完全自主的智能体系统,专为基准构建而设计。我们的框架编排了完整的基准构建流程,从用户查询分析和子任务设计到数据注释和质量控制。为了评估Benchmark Agent,我们实现了它来生成15个代表性基准,涵盖多种评估场景,包括文本理解、多模态理解和领域特定推理。大量实验,包括人工评估、LLM作为评判者的评估和一致性检查,表明Benchmark Agent能够在最小人工参与下生成高质量的基准样本。更重要的是,通过持续评估,我们观察到一些有洞察力的发现,包括当前模型在某些领域特定推理任务上存在困难。我们相信快速演进的基准可以为研究社区做出重要贡献。预览和代码将在演示页面和代码仓库中公开。

英文摘要

Benchmarks are fundamental for evaluating and advancing LLMs and MLLMs by providing standardized and explicit measures of performance. However, their construction is labor-intensive and hard to reuse, raising concerns about sustainability and scalability. Moreover, existing benchmarks often quickly reach performance saturation after their release, resulting in insufficient discrimination among state-of-the-art models. To address these challenges, we introduce Benchmark Agent, a fully autonomous agentic system designed for benchmark building. Our framework orchestrates the complete benchmark construction pipeline, from user query analysis and subtask design to data annotation and quality control. To assess Benchmark Agent, we implement it to produce 15 representative benchmarks, spanning diverse evaluation scenarios, including text understanding, multimodal understanding, and domain-specific reasoning. Extensive experiments, including human evaluation, LLM-as-a-judge assessment, and consistency checks, demonstrate Benchmark Agent can generate high-quality benchmark samples with minimal human involvement. More importantly, through continual evaluation, we observe several insightful findings, including that current models struggle with certain domain-specific reasoning tasks. We believe that rapidly evolving benchmarks can contribute significantly to the research community. The preview and code will be publicly available at the demo page and code repository.

2606.06461 2026-06-05 cs.RO

Flow-based Policy Adaptation without Policy Updates

基于流的策略适应无需策略更新

Luzhe Sun, Jingtian Ji, Haoran Chen, Jiawei Zhou, Matthew R. Walter

AI总结 提出GLOVES方法,通过流模型将非专家动作向专家动作分布传输,实现选择性动作级适应,提升任务成功率并保持智能体意图。

详情
AI中文摘要

利用预训练策略、基础模型或人类操作员的先验知识,为从零开始学习机器人技能提供了一种高效替代方案。然而,这些智能体提供的动作往往是次优的、有噪声的,或与特定任务的专家行为不一致。我们提出了GLOVES,一系列基于流的适应方法,通过将非专家动作向专家动作分布传输来纠正它们。GLOVES并非用完全自主性取代智能体控制,而是执行选择性的动作级适应,在提升任务成功率的同时保持智能体意图。学习到的流还通过反向流评估提供了一种自然的分布内评分机制。我们利用该信号作为干预门:与专家分布一致的动作保持不变,而异常或分布外(OOD)动作则被纠正。这样,仅在必要时提供辅助。GLOVES仅需有限的专家监督,使用少量演示或可重用的成功技能片段。通过学习局部专家动作模式并在执行过程中拼接,GLOVES提供了一个轻量级的共享控制模块,用于跨任务和环境的鲁棒动作适应。代码和演示可在ripl.github.io/GLOVES_web获取。

英文摘要

Leveraging prior knowledge from pretrained policies, foundation models, or human operators offers an efficient alternative to learning robot skills from scratch. However, these agents often provide actions that are suboptimal, noisy, or misaligned with task-specific expert behavior. We propose GLOVES, a family of flow-based adaptation methods that correct non-expert actions by transporting them toward an expert action distribution. Rather than replacing agentic control with full autonomy, GLOVES performs selective action-level adaptation, improving task success while preserving agent intent. The learned flow also provides a natural in-distribution scoring mechanism through reverse flow evaluation. We use this signal as an intervention gate: actions that appear consistent with the expert distribution are passed through unchanged, while anomalous or out-of-distribution (OOD) actions are corrected. In this way, assistance is only provided when necessary. GLOVES requires only limited expert supervision, using a small number of demonstrations or reusable successful skill segments. By learning local expert action patterns and stitching them during execution, GLOVES provides a lightweight shared-control module for robust action adaptation across tasks and environments. Code and demos are available at ripl.github.io/GLOVES_web.

2606.06460 2026-06-05 cs.CR cs.AI

Will the Agent Recuse Itself? Measuring LLM-Agent Compliance with In-Band Access-Deny Signals

智能体会自行回避吗?测量LLM智能体对带内拒绝访问信号的遵从性

Thamilvendhan Munirathinam

AI总结 提出一种轻量级带内拒绝信号(Recuse Signal),通过实验测量LLM智能体是否自愿遵从该信号,发现信号能有效诱导回避,但高级模型在操作员授权下可能忽略。

详情
Comments
8 pages, 1 figure. Code, specification, and experiment harness: https://github.com/mthamil107/Recuse
AI中文摘要

随着自主LLM智能体越来越多地持有真实凭证并在无人参与的情况下操作基础设施,操作员没有标准方式告知智能体某个资源是禁止访问的。访问控制要么允许智能体进入(它有有效凭证),要么硬性拒绝(与任何其他客户端无法区分)。我们提出第三种模式:一种轻量级的、公开的带内拒绝信号——Recuse Signal——服务器通过协议的现有通道(如SSH横幅、PostgreSQL NOTICE)发出,要求连接的自动化智能体自愿退出。这是一种合作治理控制,类似于实时访问的robots.txt;明确不是安全边界。其价值完全是经验性的,据我们所知,尚未被测量:合规的LLM智能体是否真的会遵守这样的信号?我们将该信号定义为一个开放的小型标准,实现了两个零或低占用适配器(一个SSH横幅/PAM钩子和一个PostgreSQL线路协议代理),将它们部署在实时的生产主机上,并进行受控实验,其中新智能体被赋予一个良性操作任务,并观察其是否回避。在试点中(SSH;OpenAI GPT-4o和GPT-4o-mini;以及作为部署智能体的Claude Code),该信号干净地诱导回避——存在信号时100%回避,而无信号对照组中100%完成任务——并且揭示性地表现为合作信号而非绝对信号:显式的操作员授权框架使最强大的模型继续执行,而其他智能体继续遵从主机策略。我们发布该标准、适配器和实验框架以供复现。

英文摘要

As autonomous LLM agents increasingly hold real credentials and operate infrastructure without a human in the loop, operators have no standard way to tell an agent that a resource is off-limits. Access controls either let the agent in (it has valid credentials) or hard-fail it (indistinguishable from any other client). We propose a third mode: a lightweight, published in-band deny signal -- the Recuse Signal -- that a server emits over a protocol's existing channels (an SSH banner, a PostgreSQL NOTICE) asking a connecting automated agent to voluntarily withdraw. This is a cooperative governance control, the robots.txt analogue for live access; it is explicitly not a security boundary. Its value is entirely empirical and, to our knowledge, unmeasured: do compliant LLM agents actually honor such a signal? We define the signal as an open mini-standard, implement two zero- or low-footprint adapters (an SSH banner/PAM hook and a PostgreSQL wire-protocol proxy), deploy them on a live production host, and run a controlled experiment in which fresh agents are given a benign operations task and observed for recusal. In a pilot (SSH; OpenAI GPT-4o and GPT-4o-mini; and Claude Code as a deployed agent), the signal cleanly induces recusal -- 100% recusal when present versus 100% task completion in a no-signal control -- and, revealingly, behaves as a cooperative rather than absolute signal: an explicit operator-authorization framing flips the most capable model to proceed, while other agents continue to defer to the on-host policy. We release the standard, adapters, and experiment harness for reproduction.

2606.06459 2026-06-05 cs.LG

Event Detection for Parameter-to-KPI Dependency Learning for AI-RAN

面向AI-RAN的参数到KPI依赖学习的事件检测

Christie Djidjev, Nicholas Kaminski

AI总结 针对AI-RAN中多AI控制函数相互干扰问题,提出基于事件检测的依赖学习方法,通过将噪声连续遥测转换为二元事件指示器,并利用合成数据评估机器学习管道恢复潜在依赖结构的能力。

详情
AI中文摘要

下一代无线网络预计将依赖多个并发的AI驱动控制功能,这些功能同时优化不同的网络目标,特别是在AI集成和开放无线接入网络架构中,如AI无线接入网络(AI-RAN)和开放无线接入网络(O-RAN)。当这些功能相互作用时,它们可能以难以仅从原始网络数据中检测的方式相互干扰。管理此类交互的一个关键缺失部分是可靠、可解释的依赖结构,该结构捕获在任何给定时间哪些控制参数积极影响哪些网络性能结果。本文聚焦于支持此类依赖学习所需的事件检测步骤,通过将噪声连续遥测转换为参数活动和KPI响应的二元指示器。核心困难在于并非数据中的每个波动都反映真实的控制交互,因此该方法必须区分真实的参数-结果关系与背景变化。由于难以获得具有已知参数-KPI真实标签的真实AI-RAN流量轨迹,我们引入了一个带有植入潜在依赖的合成闭环流量生成器。我们使用这种受控遥测来评估基于机器学习的依赖恢复管道,该管道将连续轨迹到二元事件指示器的转换表述为一个显著性检测问题。实验评估表明,当信号与背景变化充分分离时,所提出的管道能够可靠地从噪声连续轨迹中恢复潜在依赖结构,同时强调阈值校准是控制事件检测质量的关键因素。这些结果为自适应AI-RAN控制系统的可解释依赖学习奠定了基础。

英文摘要

Next-generation wireless networks are expected to rely on multiple concurrent AI-driven control functions that optimize different network objectives simultaneously, particularly in AI-integrated and open radio access network architectures such as AI Radio Access Network (AI-RAN) and Open Radio Access Network (O-RAN). When these functions interact, they can interfere with one another in ways that are difficult to detect from raw network data alone. A key missing piece for managing such interactions is a reliable, interpretable dependency structure that captures which control parameters are actively influencing which network performance outcomes at any given time. This paper focuses on the event-detection step needed to support such dependency learning by converting noisy continuous telemetry into binary indicators of parameter activity and KPI response. The central difficulty is that not every fluctuation in the data reflects a genuine control interaction, so the method must distinguish real parameter-outcome relationships from background variation. Because real AI-RAN traffic traces with known parameter-KPI ground truth are difficult to obtain, we introduce a synthetic closed-loop traffic generator with planted latent dependencies. We use this controlled telemetry to evaluate a machine-learning-based dependency recovery pipeline that formulates the conversion of continuous traces into binary event indicators as a significance-detection problem. Experimental evaluation shows that the proposed pipeline reliably recovers the latent dependency structure from noisy continuous traces when the signal is sufficiently separated from background variation, while highlighting threshold calibration as the key factor controlling event-detection quality. These results constitute a foundational step toward interpretable dependency learning for adaptive AI-RAN control systems.

2606.06458 2026-06-05 cs.LG cs.AI cs.CV

In-Context Multiple Instance Learning

上下文多实例学习

Alexander Möllers, Marvin Sextro, Julius Hense, Gabriel Dernbach, Klaus-Robert Müller

AI总结 本文提出一种基于感知器架构的上下文学习器,通过合成数据预训练,无需梯度更新即可从少量标记包中解决新的多实例学习任务,在12个基准上超越需任务特定训练的监督基线。

详情
AI中文摘要

多实例学习(MIL)解决了在实例包级别提供监督的问题,并已成功应用于从计算病理学到卫星图像等领域。然而,现有算法在低标签率(许多实际应用的特点)下表现不佳。灵活的模型过拟合,而僵化的模型无法适应手头的任务。我们证明,在合成数据上预训练一个具有感知器架构的上下文学习器,可以得到一个能够从少量标记包中解决新任务的模型。在推理时,分类在单次前向传播中完成,无需梯度更新。我们提出并研究了不同的用于包结构数据的合成数据生成器,发现它们捕获了互补的归纳偏差。在这些生成器的混合上预训练的模型继承了每个生成器在各自任务上的优势,并在12个MIL基准上取得了最佳平均性能,超过了需要任务特定训练的监督基线。

英文摘要

Multiple Instance Learning (MIL) addresses problems where supervision is available at the level of bags of instances and has been successfully applied in fields ranging from computational pathology to satellite imagery. Nevertheless, existing algorithms struggle in the low-label regime that characterizes many real-world applications. Flexible models overfit and rigid ones fail to adapt to the task at hand. We show that pretraining an in-context learner with a Perceiver-style architecture on synthetic data yields a model that can solve new tasks from a handful of labeled bags. At inference time, classification happens in a single forward pass and requires no gradient updates. We propose and investigate different synthetic data generators for bag-structured data and find that they capture complementary inductive biases. A model pretrained on a mixture of these generators inherits their per-task strengths and achieves the best average performance across twelve MIL benchmarks, outperforming supervised baselines that require task-specific training.

2606.06455 2026-06-05 quant-ph cs.IT math.IT

Breakeven demonstration of quantum low-density parity-check codes

量子低密度奇偶校验码的盈亏平衡演示

Edwin Tham, Michael L. Goldman, Shantanu Debnath, Ashay N. Patel, Jyothi Saraladevi, Jason Nguyen, Erik Nielsen, Neal Pisenti, Kenneth Wright, John Gamble, Nicolas Delfosse

AI总结 利用离子阱量子计算机的灵活性,无需硬件重配置即可演示九种不同量子纠错码,其中qLDPC码在18个物理量子比特上编码4个逻辑量子比特,逻辑错误率比先前类似演示低9倍,并实现了盈亏平衡性能。

详情
AI中文摘要

高速率量子低密度奇偶校验(qLDPC)码是容错量子计算的主要候选方案。它们比平面替代方案(如表面码)具有更高的编码率,但其实现通常面临重大硬件障碍,例如需要长距离耦合器。我们利用离子阱量子计算机的灵活性,在单个设备上无需任何硬件重配置,演示了九种具有截然不同量子比特连接要求的量子纠错码。这些实验涵盖三个量子纠错码系列:qLDPC码、拓扑码和级联码。使用将4个逻辑量子比特编码到18个物理量子比特中的qLDPC码,我们实现的逻辑错误率比先前在超导固态量子比特上类似码的演示高出9倍。此外,我们的实现表现出盈亏平衡性能,某些实例的量子比特寿命达到或略超过我们的离子阱量子比特的寿命。我们采用光学亚稳态基态(OMG)架构的新颖实现,用于可寻址的电路中间测量和重置,从而无需任何离子传输或专用冷却离子即可进行这些实验,而这些要求通常会消耗离子阱量子计算机的大部分运行时间或离子数。

英文摘要

High-rate quantum low-density parity-check (qLDPC) codes are a leading candidate for fault-tolerant quantum computing. They feature higher encoding rates than planar alternatives such as the surface code, but their implementation often entails significant hardware hurdles like the need for long-range couplers. We leverage the flexibility of a trapped-ion quantum computer to demonstrate nine quantum error-correcting codes with starkly different qubit connectivity requirements on a single device without any hardware reconfiguration. These experiments span three families of quantum error-correcting codes: qLDPC codes, topological codes, and concatenated codes. With a qLDPC code encoding 4 logical qubits into 18 physical qubits, we achieve a logical error rate up to $9\times$ better than a previous demonstration of a similar code on superconducting solid-state qubits. Moreover, our implementation exhibits breakeven performance, with some instances achieving qubit lifetimes comparable to or slightly exceeding that of our trapped-ion qubits. We use a novel implementation of the optical-metastable-ground (OMG) architecture for addressable mid-circuit measurement and reset, which enables us to perform these experiments without any ion transport or dedicated coolant ions, requirements that typically consume a large fraction of the runtime or ion count of trapped-ion quantum computers.

2606.06454 2026-06-05 cs.SE cs.CL

Scaffold, Not Vocabulary? A Controlled, Two-Tier, Pre-Registered Study of a Popperian Code-Generation Skill

脚手架,而非词汇?一项受控、双层、预注册的波普尔式代码生成技能研究

Mehmet Iscan

AI总结 通过双层消融实验(包括长度匹配安慰剂、仅标签脚手架和真实执行测试),研究发现波普尔式提示技能对代码正确性的提升主要来自脚手架结构而非其内容,并在大模型上因天花板效应无法检测,在小模型上仅标签脚手架即可达到类似效果。

详情
Comments
34 pages, 5 figures, 8 tables
AI中文摘要

大型语言模型越来越多地编写、审查和评判代码,一种快速发展的实践是为它们配备提示“技能”,要求模型像科学家一样推理。一个突出的例子是告诉模型扮演波普尔式证伪主义者,据报道这种技能能改进生成的代码。但这些增益几乎总是通过LLM作为评判者来读取,而该评判工具存在已知的位置偏好、自我偏好和风格偏差。我们问:如果它看起来有帮助,那么增益是来自技能的波普尔式内容,还是来自任何脚手架所施加的结构?我们预注册了一个双层消融实验,包含三个对照:长度匹配的安慰剂、仅保留波普尔式标题但去除过程的仅标签脚手架,以及一个执行预言机(HumanEval+单元测试),外加一个词汇光环哨兵和一个同模型自评判审计。在前沿模型(Claude Sonnet 4.6,N=163)上,所有条件都接近基准上限且无法区分,因此预注册的+5点改进未得到支持(上限限制的未检测)。在小模型(Qwen2.5-Coder-0.5B,N=164)上,结构化条件将最佳八次正确率提升了20-22点,但完整技能相比仅标签脚手架没有显示出可分离的益处(聚合F@8=L@8 vs V@8=34.8%),而安慰剂仅落后2.4点。一个应用波普尔式评分标准的0.5B自评判器未能击败随机选择,并将其60%的选择集中在一个索引上。在测试的两种设置中,该技能的波普尔式过程内容在仅标签脚手架之外没有增加可分离的执行正确性收益,因此增益追踪的是脚手架结构。我们贡献了一个校准的负结果和一个可重用的消歧协议;该发现界定了关于一个提示技能家族的工程主张,而不是对波普尔式方法论的总体评价。

英文摘要

Large language models increasingly write, review, and judge code, and a fast-growing practice equips them with prompt 'skills' that ask the model to reason like a scientist. A prominent example tells the model to act as a Popperian falsificationist, and such skills are reported to improve generated code. But these gains are almost always read off an LLM-as-a-judge, an instrument with documented positional, self-preference, and stylistic biases. We ask: if it appears to help, is the gain from the skill's Popperian content, or from the structure any scaffold imposes? We pre-register a two-tier ablation with three controls: a length-matched placebo, a labels-only scaffold that keeps the Popperian headers but strips the procedure, and an execution oracle (HumanEval+ unit tests), plus a vocabulary-halo sentinel and a same-model self-judge audit. On a frontier model (Claude Sonnet 4.6, N=163) all conditions sit near the benchmark ceiling and do not separate, so the pre-registered +5-point improvement is not supported (a ceiling-limited non-detection). On a small model (Qwen2.5-Coder-0.5B, N=164) structured arms lift best-of-eight correctness by 20-22 points, but the full skill shows no separable benefit over a labels-only scaffold (aggregate F@8=L@8 vs V@8=34.8%), and the placebo trails by only 2.4 points. A 0.5B self-judge applying the Popperian rubric does not beat random selection and concentrates 60% of its picks on one index. In the two settings tested, the skill's Popperian procedural content adds no separable execution-correctness benefit beyond a labels-only scaffold, so the gains track scaffold structure. We contribute a calibrated negative result and a reusable disambiguation protocol; the finding bounds an engineering claim about one prompt-skill family and is not an evaluation of Popperian methodology in general.

2606.06453 2026-06-05 cs.AI

Vortex: Efficient and Programmable Sparse Attention Serving for AI Agents

Vortex: 面向AI Agent的高效可编程稀疏注意力服务

Zhuoming Chen, Xinrui Zhong, Qilong Feng, Ranajoy Sadhukhan, Yang Zhou, Michael Qizhe Shieh, Zhihao Jia, Beidi Chen

AI总结 提出Vortex系统,通过Python嵌入式前端语言和面向页面的张量抽象,结合高效后端,实现稀疏注意力算法的快速原型设计、部署和评估,显著提升吞吐量。

详情
AI中文摘要

随着生成长度的增长,稀疏注意力对于服务大型语言模型(LLMs)变得越来越重要。然而,大规模部署和评估新的稀疏注意力算法仍然高度工程密集,这减慢了人类研究人员和AI Agent探索稀疏注意力设计的速度。为了应对这一挑战,我们提出了Vortex,一个系统,它结合了在面向页面的张量抽象之上的Python嵌入式前端语言,用于表达广泛的稀疏注意力算法,以及一个紧密集成到现代LLM服务栈中的高效后端。Vortex能够快速原型设计、部署和评估稀疏注意力算法,有效地将其理论效率提升转化为实际吞吐量的改进。因此,Vortex大大加速了稀疏注意力算法的设计和迭代。首先,AI Agent使用Vortex自动生成和优化多样化的算法,最佳算法在保持准确性的同时,吞吐量比全注意力高出高达3.46倍。其次,Vortex将稀疏注意力扩展到新兴架构和非常大的模型,这些模型原本难以实验,在基于MLA的GLM-4.7-Flash上实现了高达4.7倍的吞吐量提升,在229B参数的MiniMax-M2.7上实现了1.37倍的提升(在NVIDIA B200 GPU上)。

英文摘要

Sparse attention is becoming increasingly important for serving large language models (LLMs) as generation lengths continue to grow. However, deploying and evaluating new sparse attention algorithms at scale remains highly engineering-intensive, slowing both human researchers and AI agents in exploring the sparse attention design. To address this challenge, we present Vortex, a system that combines a Python-embedded frontend language atop a page-centric tensor abstraction for expressing a broad range of sparse attention algorithms, with an efficient backend tightly integrated into modern LLM serving stacks. Vortex enables rapid prototyping, deployment, and evaluation of sparse attention algorithms, effectively translating their theoretical efficiency gains into real-world throughput improvements. As a result, Vortex substantially accelerates the design and iteration of sparse attention algorithms. First, AI agents use Vortex to automatically generate and refine diverse algorithms, the best reaching up to $3.46\times$ higher throughput than full attention while preserving accuracy. Second, Vortex extends sparse attention to emerging architectures and very large models that are otherwise hard to experiment with, reaching up to $4.7\times$ higher throughput on the MLA-based GLM-4.7-Flash and $1.37\times$ on the 229B-parameter MiniMax-M2.7 on NVIDIA B200 GPUs.

2606.06451 2026-06-05 cs.GT

Simultaneous EF1 and approximate MMS allocations for submodular valuations

次模估值下同时满足EF1和近似MMS的分配

Uriel Feige, Assaf Fine

AI总结 针对次模估值函数,研究如何分配不可分割物品以同时实现EF1无嫉妒和常数近似最大最小份额(MMS)的公平性。

详情
AI中文摘要

在将$m$个不可分割物品分配给$n$个具有平等权利的代理人时,通常考虑两类公平性概念。一类是基于份额的公平性概念,其中最大最小份额(MMS)及其松弛到$\rho$-MMS是这类概念的主要代表。另一类是基于比较的公平性概念,其中无嫉妒(EF)及其松弛如EF1是这类概念的主要代表。通常,没有一类概念能为另一类概念提供良好的保证。在这项工作中,我们设计了同时满足两类概念中公平性概念的分配,具体而言,是$\rho$-MMS(对于常数$\rho$)和EF1(实际上还有EFL)。这些结果先前在代理人具有可加估值时已知,而我们针对更一般的次模估值类证明了这些结果。

英文摘要

There are two common classes of fairness notions that are considered when allocating $m$ indivisible items to $n$ agents of equal entitlements. One is that of share-based fairness notions, with the maximin share (MMS) and its relaxations to $ρ$-MMS being prominent representatives of this class. The other is that of comparison-based fairness notions, with envy-freeness (EF) and its relaxations such as EF1 being prominent representatives of this class. In general, no class offers good guarantees for the other class. In this work, we design allocations that simultaneously satisfy notions from both classes, and specifically, are $ρ$-MMS for constant $ρ$ and EF1 (in fact, also EFL). Such results were previously known when agents have additive valuations, and we prove such results for the more general class of submodular valuations.

2606.06448 2026-06-05 cs.AI

Agent Memory: Characterization and System Implications of Stateful Long-Horizon Workloads

Agent记忆:有状态长时任务工作负载的表征与系统影响

Yasmine Omri, Ziyu Gan, Zachary Broveak, Robin Geens, Zexue He, Alex Pentland, Marian Verhelst, Tsachy Weissman, Thierry Tambe

AI总结 本文首次对LLM agent记忆系统进行系统级表征,提出四轴分类法,通过阶段感知分析框架评估10种代表性系统,并给出10条系统设计建议。

详情
AI中文摘要

LLM agent越来越多地被部署在需要跨扩展交互历史进行持续推理的长时任务上。大规模实现这一点要求agent在会话之间持久地存储、检索和更新自己的记忆。一个丰富的agent记忆系统生态系统已经出现,涵盖平面检索、LLM介导的提取、整合事实存储和agent控制流。然而,它们的系统级行为尚未被表征。我们提出了agent记忆的首次系统表征。首先,我们引入了一个面向系统的分类法,沿四个轴对agent记忆系统进行分类。其次,我们构建了一个阶段感知的分析框架,将成本归因于构建、检索和生成。第三,我们跨两个基准套件表征了十个代表性系统,揭示了设计选择如何在写和读路径上转移成本。最后,我们推导出10条系统建议,涵盖构建调度、能力下限、通过查询量的摊销、新鲜度-延迟权衡以及集群规模管理。

英文摘要

LLM agents are increasingly deployed on long-horizon tasks requiring sustained reasoning over extended interaction histories. Realizing this at scale requires agents to persistently store, retrieve, and update their own memory across sessions. A rich ecosystem of agent memory systems has emerged spanning flat retrieval, LLM-mediated extraction, consolidating fact stores, and agentic control flows. Yet, their system-level behavior remains uncharacterized. We present the first systems characterization of agent memory. First, we introduce a system-oriented taxonomy classifying agent memory systems along four axes. Second, we build a phase-aware profiling harness attributing cost to construction, retrieval, and generation. Third, we characterize ten representative systems across two benchmark suites, uncovering how design choices shift cost across the write and read paths. Finally, we derive 10 system recommendations covering construction scheduling, capability floors, amortization via query volume, freshness-latency tradeoffs, and fleet-scale management.

2606.06447 2026-06-05 cs.CL cs.LG

Latent Reasoning with Normalizing Flows

基于归一化流的潜在推理

Guancheng Tu, Xiangjun Fu, Suhao Yu, Yao Tang, Haoqiang Kang, Lianhui Qin, Yizhe Zhang, Jiatao Gu

AI总结 提出NF-CoT框架,通过归一化流在LLM内部建模连续潜在思维,保留自回归生成、概率采样、KV缓存解码和似然估计等优势,在代码生成任务中提升通过率并降低推理成本。

详情
AI中文摘要

大型语言模型通常通过生成显式思维链(CoT)来改进推理,展示了中间计算的重要性。然而,文本CoT迫使这种计算通过离散、串行且面向通信的令牌流进行:每个推理步骤必须在模型继续之前被语言化,即使底层更新是语义的、不确定的或仅部分形成的。潜在推理通过在承诺文本之前以紧凑的连续状态执行中间计算,提供了一种更高带宽的替代方案。然而,现有的潜在推理方法常常牺牲了使CoT在自回归语言模型中有效的关键优势,包括原生的从左到右生成、概率采样、与KV缓存解码的兼容性以及可处理的似然估计。我们提出NF-CoT,一种潜在推理框架,通过使用归一化流对连续思维进行建模来保留这些优势。NF-CoT在LLM骨干内部实例化一个TARFlow风格的归一化流,定义了从显式CoT提炼的紧凑连续思维上的可处理概率模型。连续思维位置由NF头生成,而文本位置由标准LM头在同一因果流中生成。这种设计为潜在思维提供了精确的似然,支持使用原始KV缓存进行概率从左到右解码,并支持在潜在推理空间中进行直接策略梯度优化。在代码生成基准测试中,NF-CoT在显式CoT和先前潜在推理基线上提高了通过率,同时显著降低了中间推理成本。

英文摘要

Large language models often improve reasoning by generating explicit chain-of-thought (CoT), demonstrating the importance of intermediate computation. However, textual CoT forces this computation through a discrete, serial, and communication-oriented token stream: each reasoning step must be verbalized before the model can proceed, even when the underlying update is semantic, uncertain, or only partially formed. Latent reasoning offers a higher-bandwidth alternative by performing intermediate computation in compact continuous states before committing to text. Yet existing latent-reasoning methods often sacrifice key advantages that make CoT effective in autoregressive language models, including native left-to-right generation, probabilistic sampling, compatibility with KV-cache decoding, and tractable likelihood estimation. We propose NF-CoT, a latent reasoning framework that preserves these advantages by modeling continuous thoughts with normalizing flows. NF-CoT instantiates a TARFlow-style normalizing flow inside the LLM backbone, defining a tractable probability model over compact continuous thoughts distilled from explicit CoT. Continuous-thought positions are generated by an NF head, while text positions are generated by the standard LM head within the same causal stream. This design provides exact likelihoods for latent thoughts, enables probabilistic left-to-right decoding with the original KV cache, and supports direct policy-gradient optimization in the latent reasoning space. On code-generation benchmarks, NF-CoT improves pass rates over explicit-CoT and prior latent-reasoning baselines while substantially reducing intermediate-reasoning cost.

2606.06444 2026-06-05 eess.AS cs.CL cs.SD

USAD 2.0: Scaling Representation Distillation for Universal Audio Understanding

USAD 2.0:面向通用音频理解的表征蒸馏规模化

Heng-Jui Chang, Alexander H. Liu, Saurabhchand Bhati, Mrudula Athi, Anton Ratnarajah, Amit Chhetri, James Glass

AI总结 提出USAD 2.0通用音频编码器,通过领域感知蒸馏融合自监督和监督基础模型知识,并扩展至音乐领域,经深度缩放达到十亿参数,在探测和基于LLM的评估中取得领先性能。

详情
Comments
Accepted to Interspeech 2026
AI中文摘要

音频编码器对于现代音频应用至关重要,因为大型语言模型(LLM)越来越依赖单一编码器处理多样输入。虽然自监督学习(SSL)已产生强大的领域特定编码器(如语音或音乐专家),但像USAD和SPEAR这样的多领域方法在覆盖范围和评估方面仍然有限。最近的研究也表明,监督编码器与音频LLM的对齐效果更好。我们提出USAD 2.0,一种融合了SSL和监督基础模型知识的通用编码器。USAD 2.0引入了领域感知蒸馏来解决教师不匹配问题,将覆盖范围扩展到音乐领域,并增加了用于下游任务的第二阶段监督蒸馏。我们进一步通过深度缩放将模型扩展到十亿参数。实验表明,USAD 2.0在探测和基于LLM的评估中取得了强劲或最先进的性能。

英文摘要

Audio encoders are critical to modern audio applications as large language models (LLMs) increasingly rely on a single encoder for diverse inputs. While self-supervised learning (SSL) has yielded strong domain-specific encoders like speech or music experts, multi-domain approaches like USAD and SPEAR remain limited in coverage and evaluation. Recent studies also suggest supervised encoders align better with audio LLMs. We present USAD 2.0, a universal encoder integrating knowledge from both SSL and supervised foundation models. USAD 2.0 introduces domain-aware distillation to address teacher mismatch, extends coverage to the music domain, and adds second-stage supervised distillation for downstream use. We further scale the model to one billion parameters via depth scaling. Experiments show USAD 2.0 achieves strong or state-of-the-art performance across probing and LLM-based evaluations.

2606.06440 2026-06-05 cs.LG stat.ML

Causal Atlases from Entropic Inference: Bayesian Networks beyond Optimal DAGs

来自熵推理的因果图谱:超越最优DAG的贝叶斯网络

Hazhir Aliahmadi, Irina Babayan, Greg van Anders

AI总结 针对数据驱动因果识别中多因果链问题,提出基于熵推理的因果图谱方法,通过最大熵系综采样量化因果结构歧义性。

详情
Comments
18 pages, 2 figures
AI中文摘要

数据驱动的因果关系识别对于理解科学内外的复杂系统至关重要。贝叶斯网络通过有向无环图(DAG)为建模通用因果关系提供了一种概率方法。然而,构建贝叶斯网络的典型技术依赖于优化,这可能不适合学习因果关系,因为底层数据可能允许多条因果链。更忠实于数据的因果关系表示将提供构建多个因果图的框架,这些因果图与底层数据固有的变异性一致。在这里,我们展示了基于熵的推理生成了与底层数据一致的合理因果关系的图谱。在2节点和20节点线性结构方程模型的模拟噪声数据上,我们对图的最大熵系综进行采样,从而量化底层因果关系中固有的结构歧义性。我们的方法表明,“优化”的DAG可能包含在同等精确的拓扑中不一致的因果伪影。

英文摘要

Data-driven causal relationship identification is pertinent to advancing understanding of complex systems both within and beyond science. Bayesian networks offer a probabilistic method for modelling generic causal relationships via directed acyclic graphs (DAGs). However, typical techniques for constructing Bayesian networks rely on optimization, which can be ill-suited for learning causal relationships because the underlying data may admit multiple chains of causation. More data-faithful representations of causal relationships would provide frameworks for constructing multiple causal maps that are consistent with the variability that is inherent in underlying data. Here, we show that entropy-based inference generates atlases of plausible causal relationships that are consistent with underlying data. On simulated noisy data of 2- and 20-node linear structural equation models, we sample a maximum-entropy ensemble of graphs that allow us to quantify the inherent structural ambiguity in underlying causal relationships. Our method shows that "optimized" DAGs can contain causal artifacts are not consistent across equivalently accurate topologies.