arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 1926
专题追踪
2605.20246 2026-05-22 cs.LG cs.AI

GROW: Aligning GRPO with State-Action Modeling for Open-World VLM Agents

GROW: 将GRPO与状态-动作建模对齐以适用于开放世界VLM智能体

Xiongbin Wu, Zhihao Luo, Shanzhe Lei, Lechao Zhang, Xuhong Wang, Jie Yang, Zhonglong Zheng, Yuanjie Zheng, Xin Tan, Wei Liu

发表机构 * Shanghai Jiao Tong University(上海交通大学) Shanghai Artificial Intelligence Laboratory(上海人工智能实验室) East China Normal University(华东师范大学) Zhejiang Normal University(浙江师范大学) Shandong Normal University(山东省师范大学)

AI总结 本文提出GROW框架,通过将收集的轨迹分解为状态-动作样本,并在样本间计算优势,解决了标准GRPO在多轮RL中因需要完整轨迹导致上下文过长和噪声的问题,实验表明其在超过800个Minecraft任务中取得SOTA性能。

详情
AI中文摘要

最近,视觉-语言模型(VLM)智能体在开放世界任务中展现出有前景的进步,其中成功的任务完成通常需要多次视觉感知和动作执行的回合。然而,现有方法仍主要依赖于监督微调(SFT)专家演示,而先进的强化学习(RL)算法,特别是分组相对策略优化(GRPO),尚未在这些任务中有效应用于多轮RL,因为标准GRPO需要完整的轨迹作为训练样本,导致上下文过长和噪声。为了解决这个问题,我们提出GROW,一种适用于开放世界VLM智能体的RL框架,将收集的轨迹分解为状态-动作样本,并在这些样本之间计算优势,而不是将完整轨迹视为单一实体。我们进一步提供了一个替代分析,表明尽管分组样本是基于不同的局部状态而不是相同的提示上下文,简化假设下目标可以保留GRPO的核心相对策略优化信号。在超过800个Minecraft任务上的实验表明,我们的方法实现了最先进的性能,证明了我们提出的RL框架在开放世界VLM智能体中的有效性。

英文摘要

Recently, vision-language model (VLM) agents have shown promising progress in open-world tasks, where successful task completion often requires multiple turns of visual perception and action execution. However, existing methods still rely primarily on Supervised Fine-Tuning (SFT) with expert demonstrations, while the advanced reinforcement learning (RL) algorithm, specifically Group Relative Policy Optimization (GRPO), has not been effectively employed for multi-turn RL in these tasks because standard GRPO requires full trajectories as training samples which leads to excessively long context and noise. To address this issue, we propose GROW, a RL framework for open-world VLM agents that decomposes collected trajectories into state-action samples, and computes advantages between these samples rather than treating a full trajectory as a single entity. We further provide a surrogate analysis indicating that, even though the grouped samples are conditioned on different local states rather than an identical prompt context, the objective can preserve the core relative policy optimization signal of GRPO under simplifying assumptions. Experiments on more than 800 Minecraft tasks show that our method achieves state-of-the-art (SOTA) performance, demonstrating the effectiveness of our proposed RL framework for open-world VLM agents.

2605.19192 2026-05-22 cs.AI cs.CR

Hallucination as Exploit: Evidence-Carrying Multimodal Agents

幻觉作为利用:证据承载多模态智能体

Guijia Zhang, Hao Zheng, Harry Yang

发表机构 * Shenzhen University(深圳大学) HKUST(香港科技大学)

AI总结 本文研究了多模态智能体中幻觉导致授权失败的问题,提出证据承载多模态智能体(ECA)方法,通过分解工具调用、获取类型证书并使用确定性门控来授权,从而将模型的模糊信念转换为可审计的残余,提高了系统的安全性。

Comments 23 pages, 6 figures, 15 tables

详情
AI中文摘要

多模态智能体越来越多地从截图、文档和网页中选择工具调用,其中虚假感知声明可能导致幻觉从答案质量错误转变为授权失败。我们正式将这种失败模式定义为幻觉到动作转换:一个不支持的声明为特权动作提供了前提条件。我们提出了证据承载多模态智能体(ECA),将自由形式模型文本视为不可接受的证据,将每个工具调用分解为动作关键谓词,从受限的DOM/OCR/AX验证器中获取类型证书,并使用确定性门来只授权证书支持的特权。与其隐藏感知错误不同,ECA将模糊的模型信念转换为可审计的残余,在验证器、模式和实现层面。在17个经典攻击类别上进行的验证器红队测试显示,四个目标加固步骤各自是必要的;在加固后,经典门绕过是0/1700(Wilson 95%上界0.22%)。使用内容衍生证书,ECA在200个端到端任务上观察到零不安全执行(Wilson 95%上界2.67%)和120个浏览器任务(上界4.3%)。对500个分层任务键的HACR审计显示,不支持的动作关键声明导致不安全执行,对原始智能体(100.0%)和仅提示防御(49.6%)无效,但对ECA无效。在7,488个GPT-5.4跟踪上进行的Oracle证书回放隔离了门的正确性,而神经判断基线在相同威胁模型下仍允许大多数不安全动作。最终的原则很简单:模型语言可能提出工具使用,但认证的谓词必须授权它。

英文摘要

Multimodal agents increasingly choose tool calls from screenshots, documents, and webpages, where a false perceptual claim can turn hallucination from an answer-quality error into an authorization failure. We formalize this failure mode as hallucination-to-action conversion: an unsupported claim supplies the precondition for a privileged action. We propose evidence-carrying multimodal agents (ECA), which treat free-form model text as inadmissible evidence, decompose each tool call into action-critical predicates, obtain typed certificates from constrained DOM/OCR/AX verifiers, and use a deterministic gate to authorize only the privileges those certificates support. Rather than hiding perception error, ECA converts opaque model belief into auditable residuals at the verifier, schema, and implementation levels. Verifier red-teaming across 17 canonical attack categories shows that four targeted hardening steps are each necessary; after hardening, canonical gate bypass is 0/1,700 (Wilson 95% upper bound 0.22%). With content-derived certificates, ECA observes zero unsafe executions on 200 end-to-end tasks (Wilson 95% upper bound 2.67%) and 120 browser tasks (upper bound 4.3%). A HACR audit on 500 stratified task keys shows that unsupported action-critical claims reach unsafe execution for naive agents (100.0%) and prompt-only defenses (49.6%), but not for ECA. Oracle-certificate replay over 7,488 GPT-5.4 traces isolates gate correctness, while neural judge baselines still admit most unsafe actions under the same threat model. The resulting principle is simple: model language may propose tool use, but certified predicates must authorize it.

2605.18893 2026-05-22 cs.LG

Position: Graph Condensation Needs a Reset -- Move Beyond Full-dataset Training and Model-Dependence

位置:图压缩需要重新开始——超越全数据集训练和模型依赖

Mridul Gupta, Samyak Jain, Vansh Ramani, Hariprasad Kodamana, Sayan Ranu

发表机构 * Yardi School of Artificial Intelligence, IIT Delhi, India(印度德里理工学院Yardi人工智能学院) Department of Computer Science and Engineering, IIT Delhi, India(印度德里理工学院计算机科学与工程系) Department of Chemical Engineering, IIT Delhi, India(印度德里理工学院化学工程系) Indian Institute of Technology Delhi, Abu Dhabi, Zayed City, Abu Dhabi, UAE(印度德里理工学院阿布扎赫德分校,扎耶德城,阿布扎赫德,阿联酋)

AI总结 本文指出当前图压缩方法存在系统性缺陷,呼吁转向轻量、架构无关且可部署的方法,以实现高效、通用和可扩展的图神经网络训练。

详情
AI中文摘要

图神经网络(GNNs)是学习图结构数据的强大工具,但其可扩展性在推荐系统、欺诈检测和分子生物学等领域的现实图规模下日益受到限制。图压缩——生成保留原始模型性能的更小合成图的任务——已成为有前途的解决方案。然而,主流的梯度匹配方法引入了根本性矛盾:它需要在完整数据集上训练以生成压缩版本,从而削弱了效率目标。更糟糕的是,这些方法存在高计算开销、在不同GNN架构间泛化差以及对特定模型配置的脆弱依赖。同样令人担忧的是社区对误导性评估协议如节点压缩比的依赖,这些协议未能反映真正的资源节约、压缩开销以及对神经架构搜索的虚假应用。这些不足并非偶然——它们是系统性的,并阻碍了有意义的进展。在本文的立场论文中,我们主张图压缩目前需要重新开始。我们呼吁超越全数据集训练和模型依赖,转而倡导轻量、架构无关且可部署的方法。通过识别关键方法论缺陷并概述具体研究方向,我们旨在将领域重新导向能够实现压缩真正承诺的方法:高效、通用和可扩展的图神经网络训练。

英文摘要

Graph Neural Networks (GNNs) are powerful tools for learning from graph-structured data, but their scalability is increasingly strained by the size of real-world graphs in domains like recommender systems, fraud detection, and molecular biology. Graph condensation -- the task of generating a smaller synthetic graph that retains the performance of models trained on the original -- has emerged as a promising solution. However, the dominant approach of gradient matching introduces a fundamental contradiction: it requires training on the full dataset to create the compressed version, thereby undermining the goal of efficiency. Worse still, these methods suffer from high computational overhead, poor generalization across GNN architectures, and brittle reliance on specific model configurations. Equally concerning is the community's reliance on misleading evaluation protocols such as node compression ratios, which fail to reflect true resource savings, condensation overhead, and illusory application to neural architecture search. These shortcomings are not incidental -- they are systemic, and they obstruct meaningful progress. In this position paper, we argue that graph condensation, in its current form, needs a reset. We call for moving beyond full-dataset training and model-dependent design, and instead advocate for methods that are lightweight, architecture-agnostic, and practically deployable. By identifying key methodological flaws and outlining concrete research directions, we aim to reorient the field toward approaches that deliver on the true promise of condensation: efficient, generalizable, and usable GNN training at scale.

2605.18721 2026-05-22 cs.LG cs.CL

General Preference Reinforcement Learning

通用偏好强化学习

Muhammad Umer, Muhammad Ahmed Mohsin, Ahsan Bilal, Arslan Chaudhry, Andreas Haupt, Sanmi Koyejo, Emily Fox, John M. Cioffi

发表机构 * Stanford University(斯坦福大学) The University of Oklahoma(俄克拉荷马大学)

AI总结 本文提出通用偏好强化学习(GPRL),通过引入通用偏好模型(GPM)解决传统强化学习在开放任务中连续探索不足的问题,通过多维偏好比较提升模型性能。

详情
AI中文摘要

训练后将大型语言模型(LLM)对齐分解为两个大致分离的轨道。在线强化学习(RL)通过可验证奖励推动数学和代码的涌现推理,但依赖于无法达到开放任务的程序验证器;而偏好优化处理开放生成任务却牺牲了驱动在线RL的连续探索。弥合这一差距需要一个开放性质量验证器,但标量奖励模型不适合此任务。质量是多维的,任何标量分数都是不完整的代理,使在线RL崩溃于分数最敏感的轴。我们转而采用通用偏好模型(GPM),将响应嵌入到k个斜对称子空间中,并将偏好表示为结构化的、具有不传递性的比较。在此基础上,我们提出通用偏好强化学习(GPRL),将k维结构延伸到策略更新中。GPRL计算每维的组相对优势,对每个优势进行归一化以避免任何轴主导,并通过上下文相关的特征值进行聚合。相同的结构推动了一个闭环漂移监视器,能够检测单轴利用并通过重新加权维度和收紧信任区域进行即时纠正。从Llama-3-8B-Instruct开始,GPRL在AlpacaEval~2.0上达到长度控制的胜利率为56.51%,并在Arena-Hard、MT-Bench和WildBench上优于SimPO和SPPO,通过在长时间训练中抵抗奖励黑客。

英文摘要

Post-training has split large language model (LLM) alignment into two largely disconnected tracks. Online reinforcement learning (RL) with verifiable rewards drives emergent reasoning on math and code but depends on a programmatic verifier that cannot reach open-ended tasks, while preference optimization handles open-ended generation yet forgoes the continuous exploration that powers online RL. Closing this gap requires a verifier for open-ended quality, but a scalar reward model is the wrong shape for the job. Quality is multi-dimensional, and any scalar score is an incomplete proxy that lets online RL collapse onto whichever axis the score is most sensitive to. We turn instead to the General Preference Model (GPM), which embeds responses into $k$ skew-symmetric subspaces and represents preference as a structured, intransitivity-aware comparison. Building on this, we propose General Preference Reinforcement Learning (GPRL), which carries the $k$-way structure through to the policy update. GPRL computes per-dimension group-relative advantages, normalizes each on its own scale so no axis can dominate, and aggregates them with context-dependent eigenvalues. The same structure powers a closed-loop drift monitor that detects single-axis exploitation and corrects it on the fly by reweighting dimensions and tightening the trust region. Starting from $\texttt{Llama-3-8B-Instruct}$, GPRL reaches a length-controlled win rate of $56.51\%$ on AlpacaEval~2.0 while also outperforming SimPO and SPPO on Arena-Hard, MT-Bench, and WildBench by resisting reward hacking across extended training runs.

2605.17837 2026-05-22 cs.CV cs.AI

Temporal Aware Pruning for Efficient Diffusion-based Video Generation

具有时间意识的剪枝用于高效扩散式视频生成

Sheng Li, Yang Sui, Junhao Ran, Bo Yuan, Yue Dai, Xulong Tang

发表机构 * University of Pittsburgh(匹兹堡大学) Illinois Institute of Technology(伊利诺伊理工学院) Rutgers University(罗格斯大学) Rice University(Rice大学)

AI总结 本文提出TAPE,一种无需训练的时间感知剪枝方法,用于高效扩散式视频生成,通过时间平滑、层内token重选和时间步预算调度,提升生成效率并保持高质量视觉效果。

详情
AI中文摘要

视频扩散模型最近通过基于ViT的架构实现了高质量视频生成,但生成过程由于需要在长时空序列上进行注意力计算而计算成本高。token剪枝已被证明在ViTs和VLMs中有效。然而,大多数先前的剪枝方法基于注意力,按帧操作,无法确保视频生成任务中帧间的重要时间一致性。在实践中,简单采用仅注意力的剪枝会导致明显退化,由于背景一致性变差、闪烁和图像质量下降。为此,我们提出TAPE,一种无需训练的时间感知剪枝方法,用于高效扩散式视频生成。TAPE(i)应用时间平滑以对齐相邻帧之间的token重要性并抑制选择抖动;(ii)在选定的层中进行token重选,以使token剪枝与层的多样化语义关注相一致,并避免特定区域的误差累积;它还(iii)采用时间步级预算调度,在早期噪声步骤中进行激进剪枝,并在保真度关键的细化阶段放松剪枝。实验结果表明,TAPE在保持高质量视觉保真度的同时提供了显著的加速,优于先前的token减少方法。

英文摘要

Video diffusion models have recently enabled high-quality video generation with ViT-based architectures, but remain computationally intensive because generation requires attention computation over long spatiotemporal sequences. Token pruning has proven effective for ViTs and VLMs. However, most prior pruning methods are attention-based and operate per frame, failing to ensure the vital temporal coherence across frames in video generation tasks. In practice, naively adopting attention-only pruning causes noticeable degradation due to worsened background consistency, flickering, and reduced image quality. To address this, we propose TAPE, a training-free Temporal Aware Pruning for Efficient diffusion-based video generation. TAPE (i) applies temporal smoothing to align token-importance across adjacent frames and suppress selection jitter; and (ii) performs token reselection in selected layers to align token pruning with layers' diverse semantic focus and avoid error accumulation in specific areas; it also (iii) adopt a timestep-level budget scheduling that prunes aggressively at early noisy steps and relaxes pruning during fidelity-critical refinement. The experimental results show that TAPE delivers significant speedups while preserving high visual fidelity, outperforming prior token reduction approaches.

2605.17659 2026-05-22 cs.LG

Bug or Feature$^2$: Weight Drift, Activation Sparsity and Spikes

Bug or Feature²:权重漂移、激活稀疏性与尖峰

Egor Shvetsov, Aleksandr Serkov, Shokorov Viacheslav, Redko Dmitry, Vladislav Goloshchapov, Evgeny Burnaev

发表机构 * GitHub

AI总结 本文研究了现代神经网络架构中由于标准损失与正偏激活函数相互作用导致的负权重漂移现象,分析了其对激活稀疏性和模型性能的影响,并提出通过剪枝解决尖峰问题的方法。

详情
AI中文摘要

现代神经架构的设计通过逐步经验选择逐渐收敛,但其训练动态的机制仍只部分被理解。我们识别并分析了由标准损失与正偏激活函数相互作用引起的负权重漂移。证明在MSE或交叉熵损失下,正预激活的梯度在初始化时期望非负,驱动下游权重向负值发展。这种漂移是优化固有的,而非数据相关,并在多种架构(MLP、ResNet、ViT、GPT-nano、MP-SENe)和非对称激活函数(ReLU、GELU、SiLU)中持续存在。与ReLU结合,权重漂移产生高达90%的激活稀疏性。我们跨79种配置表征稀疏性-准确率权衡,并识别出稀疏性超过约70%时的准确率断崖。虽然ReLU²在GPT-nano中实现了良好的稀疏性-准确率比,但会病理性放大中间Transformer层的激活尖峰。剪枝可以解决这一问题,同时保留平方的表示优势:剪枝ReLU²优于其未剪枝版本,GELU²在GPT-nano上达到最低验证损失。代码可在https://github.com/On-Point-RND/BugOrFeature获取。

英文摘要

The design of modern neural architectures has converged through incremental empirical choices, yet the mechanisms governing their training dynamics remain only partially understood. We identify and analyze a negative weight drift induced by the interaction between standard losses and positively biased activation functions. We prove that under MSE or cross-entropy loss, the gradient with respect to positive pre-activations is non-negative in expectation at initialization, driving downstream weights toward negative values during early training. The drift is intrinsic to optimization rather than data, and persists across architectures (MLP, ResNet, ViT, GPT-nano, MP-SENe) and asymmetric activation functions (ReLU, GELU, SiLU). Coupled with ReLU, weight drift produces activation sparsity reaching up to 90\% in GPT-nano. We characterize the sparsity-accuracy tradeoff across 79 configurations and identify a sharp accuracy cliff above $\sim$70\% activation sparsity. While ReLU$^2$ achieves a good sparsity--accuracy ratio in GPT-nano, it pathologically amplifies identified activation spikes in intermediate transformer layers. Clipping resolves this while preserving the representational benefits of squaring: clipped ReLU$^2$ outperforms its unclipped version, and GELU$^2$ achieves the lowest validation loss on GPT-nano. Code is available at https://github.com/On-Point-RND/BugOrFeature.

2605.17602 2026-05-22 cs.AI cs.CV cs.LG

AutoRubric-T2I: Robust Rule-Based Reward Model for Text-to-Image Alignment

AutoRubric-T2I: 一种用于文本到图像对齐的鲁棒基于规则的奖励模型

Kuei-Chun Kao, Daixuan Huo, Yuanhao Ban, Cho-Jui Hsieh

发表机构 * University of California, Los Angeles(加州大学洛杉矶分校)

AI总结 本文提出AutoRubric-T2I,一种首个用于文本到图像生成的规则学习框架,通过自动合成和选择显式规则来指导视觉语言模型(VLM)法官。该方法通过合成偏好对的推理轨迹生成候选规则,并利用VLM法官在每种规则下对配对图像进行评分,产生配对规则评分差异用于偏好学习。通过ℓ1正则化逻辑回归精简器去除噪声和冗余规则,从而在少量标注偏好数据下生成高质量、可解释的奖励信号,并在多个图像奖励基准测试中优于现有奖励模型基线。

Comments 27 pages

详情
AI中文摘要

将文本到图像(T2I)生成模型与人类偏好对齐越来越依赖于图像奖励模型,这些模型根据提示对齐和感知质量对生成图像进行评分或排序。现有的奖励模型通常在大规模人类偏好语料上训练为Bradley-Terry(BT)偏好模型,这使得训练成本高、适应困难且评估标准不透明。同时,视觉语言模型(VLM)法官可以通过文本评分规则提供更细致的评估,但其手动设计或启发式生成的评分规则可能无法可靠地反映人类偏好。在本文中,我们提出AutoRubric-T2I,这是首个用于T2I的规则学习框架,能够自动合成和选择显式规则以指导VLM法官。AutoRubric-T2I首先通过合成偏好对的推理轨迹生成候选规则,然后利用VLM法官在每种规则下对配对图像进行评分,产生配对规则评分差异用于偏好学习。为了去除噪声和冗余规则,我们进一步采用ℓ1正则化逻辑回归精简器,选择Top-N最判别性的规则。广泛评估表明,AutoRubric-T2I在使用不到0.01%的标注偏好数据的情况下,能够生成高质量、可解释的奖励信号,大幅减少了大规模奖励模型训练的需求。在图像奖励基准如MMRB2上,AutoRubric-T2I优于强奖励模型基线。我们进一步验证AutoRubric-T2I作为强化学习奖励在下游T2I任务中的效果,包括TIIF和UniGenBench++,其中它通过流-GRPO管道在扩散模型上提升了生成质量,优于标量奖励模型。

英文摘要

Aligning Text-to-Image (T2I) generation models with human preferences increasingly relies on image reward models that score or rank generated images according to prompt alignment and perceptual quality. Existing reward models are commonly trained as Bradley-Terry (BT) preference models on large-scale human preference corpora, making them costly to train, difficult to adapt, and opaque in their evaluation criteria. Meanwhile, Vision-Language Model (VLM) judges can provide more fine-grained assessments through textual rubrics, but their manually designed or heuristically generated scoring rules may fail to reliably reflect human preferences. In this paper, we propose AutoRubric-T2I, the first rubric learning framework in T2I that automatically synthesizes and selects explicit rubrics for guiding VLM judges. AutoRubric-T2I first synthesizes reasoning traces from preference pairs into candidate rubrics, then uses a VLM judge to score paired images under each rubric, producing pairwise rubric-score differences for preference learning. To remove noisy and redundant rules, we further employ a $\ell_1$-Regularized Logistic Regression Refiner, which selects the Top-$N$ most discriminative rubrics. Extensive evaluations show that AutoRubric-T2I produces high-quality, interpretable reward signals using less than 0.01% of the annotated preference data, substantially reducing the need for large-scale reward-model training. On image reward benchmarks such as MMRB2, AutoRubric-T2I outperforms strong reward model baselines. We further validate AutoRubric-T2I as an RL reward on downstream T2I tasks, including TIIF and UniGenBench++, where it improves generation quality over scalar reward models using the Flow-GRPO pipeline on diffusion models.

2605.17596 2026-05-22 cs.AI

NeuSymMS: A Hybrid Neuro-Symbolic Memory System for Persistent, Self-Curating LLM Agents

NeuSymMS:一种混合神经符号记忆系统,用于持久、自管理的LLM代理

Mujahid Sultan, Sri Thuraisamy, Daya Rajaratnam

发表机构 * iVedha Corporation(iVedha公司) MLSoft Inc.(MLSoft公司)

AI总结 NeuSymMS通过混合神经符号架构,使LLM代理能够在多个会话中学习、记忆和推理用户信息,其核心方法是结合神经网络的事实提取和基于CLIPS的专家系统,主要贡献是提出了一个支持自管理记忆的双视野记忆模型。

Comments 7 pages

详情
AI中文摘要

我们介绍了NeuSymMS,一种自适应的记忆系统,使大型语言模型(LLM)代理能够通过混合神经符号架构在多个会话中学习、记忆和推理用户信息。NeuSymMS结合了使用LLM从非结构化对话中提取事实的神经网络,以及基于CLIPS的专家系统,该系统在显式生命周期规则下对事实进行分类、去重和协调。系统将知识表示为主体-关系-值三元组,存储在关系数据库管理系统中。它支持用户/代理/代理到代理的范围,并实现双视野(短期和长期)记忆模型。它利用基于访问的提升和基于时间的剪枝来管理两个视野中的记忆。NeuSymMS在保持记忆连续性的同时避免了上下文窗口膨胀和跨实体污染。我们认为这种架构为生产代理系统提供了可靠、可审计的记忆的实用路径,并讨论其与日志检索、摘要和键值方法的创新性对比。

英文摘要

We present NeuSymMS, an adaptive memory system that enables large language model (LLM) agents to learn, remember, and reason about users across sessions via a hybrid neuro-symbolic architecture. NeuSymMS couples neural fact extraction from unstructured dialogue using LLMs and a CLIPS-based expert system that classifies, deduplicates, and reconciles facts under explicit lifecycle rules. The system represents knowledge as subject-relation-value triples stored in relational database management system. It supports user/agents/agent-to-agent scoping, and implements a dual-horizon (short-term and long-term) memory model. IT leverages access-based promotion and time-based pruning of the memory on both horizpons. NeuSymMS maintains continuity of memory while avoiding context-window bloat and cross-entity contamination. We argue that this architecture offers a practical path to trustworthy, auditable memory for production agentic systems and discuss its novelty relative to log retrieval, summarization, and key-value approaches.

2605.16923 2026-05-22 cs.CV

Neuroscience-inspired Staged Representation Learning with Disentangled Coarse- and Fine-Grained Semantics for EEG Visual Decoding

受神经科学启发的分阶段表征学习:解纠缠的粗粒度和细粒度语义用于EEG视觉解码

Xiang Gao, Hui Tian, Yanming Zhu, Xuefei Yin, Alan Wee-Chung Liew

发表机构 * School of Information and Communication Technology, Griffith University(信息与通信技术学院,格里菲斯大学)

AI总结 本文提出了一种受神经科学启发的分阶段表征学习框架,通过解纠缠的粗粒度和细粒度语义来改进EEG视觉解码,解决了现有方法在人类视觉处理分阶段和层次特性方面的不足。

Comments 17 pages, 5 figures

详情
AI中文摘要

从电生理图(EEG)信号解码视觉信息仍然是脑机接口和医疗康复中的基本挑战。现有的EEG视觉解码方法主要集中在学习一个单一的全局EEG嵌入以实现跨模态对齐,但它们大多忽略了人类视觉处理的分阶段和层次特性。为了解决这一限制,我们提出了一种受神经科学启发的分阶段表征学习框架,将EEG视觉解码重新表述为一个阶段特定的表征分解问题。所提出的框架将EEG表征学习分为三个互补的阶段:低级视觉表征学习、高级语义表征学习和整合信息融合。为了加强语义建模,我们进一步引入了一种多模态双级语义学习机制,将粗标签级别的语义与细图像级别的视觉-语义信息分开。此外,引入了语义潜在通道作为从观察到的视觉EEG信号生成的计算表征通道,扩展了通道级别的语义表征空间以实现结构化的语义抽象和跨模态对齐。在THINGS-EEG基准上的大量实验表明,所提出的方法在受试者依赖的零样本评估中表现优异,并在受试者独立的零样本评估中实现了改进的精确检索。此外,包括逐层检索、时间累积、扩展多图像检索和消融研究的额外分析进一步支持了分阶段分解和结构化语义建模的有效性。这些结果表明,显式建模分阶段的感知、语义和整合表征提供了一种有效的受神经科学启发的EEG视觉解码框架。

英文摘要

Decoding visual information from electroencephalography (EEG) signals remains a fundamental challenge in brain-computer interfaces and medical rehabilitation. Existing EEG visual decoding methods mainly focus on learning a single global EEG embedding for cross-modal alignment, but they largely overlook the staged and hierarchical characteristics of human visual processing. To address this limitation, we propose a neuroscience-inspired staged representation learning framework that reformulates EEG visual decoding as a stage-specific representation decomposition problem. The proposed framework organizes EEG representation learning into three complementary phases: low-level visual representation learning, high-level semantic representation learning, and integrative information fusion. To strengthen semantic modeling, we further introduce a multimodal dual-level semantic learning mechanism that separates coarse label-level semantics from fine image-level visual-semantic information. In addition, semantic latent channels are introduced as computational representation channels generated from observed visual EEG signals, expanding the channel-level semantic representation space for structured semantic abstraction and cross-modal alignment. Extensive experiments on the THINGS-EEG benchmark demonstrate that the proposed method achieves superior performance under subject-dependent zero-shot evaluation and improved exact retrieval under subject-independent zero-shot evaluation. Additional analyses, including layer-wise retrieval, temporal accumulation, expanded multi-image retrieval, and ablation studies, further support the effectiveness of staged decomposition and structured semantic modeling. These results suggest that explicitly modeling staged perceptual, semantic, and integrative representations provides an effective neuroscience-inspired framework for EEG-based visual decoding.

2605.16579 2026-05-22 cs.CV cs.LG

Attend Locally, Remember Linearly: Linear Attention as Cross-Frame Memory for Autoregressive Video Diffusion

局部关注,线性记忆:线性注意力作为跨帧记忆用于自回归视频扩散

Kunyang Li, Mubarak Shah, Yuzhang Shang

发表机构 * Institute of Artificial Intelligence, University of Central Florida(中央佛罗里达大学人工智能研究所)

AI总结 本文提出了一种名为ARL2的混合注意力模块,通过将二次跨帧注意力替换为固定大小的递归状态,解决了自回归视频扩散模型在长视频生成中的可扩展性瓶颈问题,实现了线性时间复杂度和常数内存消耗,同时提升了时间一致性。

详情
AI中文摘要

自回归(AR)视频扩散是一种强大的视频生成范式,用于流式和交互式视频生成。然而,其依赖于softmax自注意力机制导致序列长度的二次计算复杂度和内存使用,由于键值缓存,限制了其扩展到长视频时间范围的能力。现有的解决方案(例如稀疏注意力和KV缓存压缩)降低了每步成本,但仍依赖于线性增长的缓存或不可逆地丢弃过去上下文,因此无法解决线性内存增长和流式上下文管理问题。为了解决这一可扩展性瓶颈,我们提出了ARL2(局部关注,线性记忆),一种混合注意力模块,通过将二次跨帧注意力替换为固定大小的递归状态。我们将自注意力分解为两个分支:一个用于空间细节和局部依赖的帧内softmax分支,以及一个用于维护固定大小状态以流式管理上下文的帧间门控线性分支。我们的关键见解是softmax注意力捕捉细粒度的局部交互,而递归状态提供可控的长程记忆。这种设计实现了线性时间复杂度和常数内存消耗,同时在全softmax模型上提高了时间一致性。为防止噪声中间状态破坏记忆,我们只在去噪步骤后更新递归状态。为了避免帧内信息不对称,所有token共享相同的预更新状态,而不是按顺序更新。据我们所知,这是首次将预训练的AR视频扩散模型转换为混合线性注意力架构的工作,通过一种高效的两阶段训练方案实现AR视频的训练。在75%的层被替换为混合线性注意力的情况下,模型实现了高达2.26倍的时钟时间加速和54%的内存减少,同时保持与改进的时间一致性相当的质量。

英文摘要

Autoregressive (AR) video diffusion is a powerful paradigm for streaming and interactive video generation. However, its reliance on softmax self-attention leads to quadratic compute complexity in sequence length and memory usage due to key-value caching, which limits its scalability to long video horizons. Existing remedies (e.g., sparse attention and KV-cache compression) reduce per-step cost but still rely on a linearly growing cache or irreversibly discard past context, and thus fail to address linear memory growth and streaming context management. To address this scalability bottleneck, we propose ARL2 (Attend Locally, Remember Linearly), a hybrid attention module that replaces quadratic cross-frame attention with a fixed-size recurrent state. We decompose self-attention into two branches: an intra-frame softmax branch for spatial detail and local dependencies, and an inter-frame gated recurrent linear branch that maintains a fixed-size state for streaming context. Our key insight is that softmax attention captures fine-grained local interactions, while a recurrent state provides controllable long-range memory. This design achieves linear-time scaling with constant memory while improving temporal consistency over the full-softmax model. To prevent noisy intermediate states from corrupting memory, we update the recurrent state only after the denoised pass. To avoid within-frame information asymmetry, all tokens share the same pre-update state rather than sequential updates. To the best of our knowledge, this is the first work to convert a pretrained AR video diffusion model into a hybrid linear attention architecture, through an efficient two-stage training scheme for AR video. With 75% of layers replaced by hybrid linear attention, the model achieves up to 2.26 wall-clock speedup and 54% memory reduction, while maintaining comparable quality with improving temporal consistency.

2605.16362 2026-05-22 cs.LG cs.AI

When Is Rank-1 Steering Cheap? Geometry, Granularity, and Budgeted Search

当秩-1引导廉价时是什么情况?几何学、粒度和预算化搜索

John T. Robertson, Jianing Zhu, Haris Vikalo, Zhangyang Wang

发表机构 * The University of Texas at Austin(德克萨斯大学奥斯汀分校)

AI总结 本文研究了秩-1引导在不同概念上的有效性差异,提出粒度和几何学是影响引导成本的关键因素,并介绍了GRACE框架来高效优化引导过程。

Comments Updated Abstract metadata

详情
AI中文摘要

激活引导提供了一种无需重新训练即可控制大语言模型的轻量方法,但其效果在不同概念上变化显著。先前研究通常将这种变化视为许多概念无法由单一引导方向捕捉的证据。我们主张这种变化更多反映了搜索难度:有用的秩-1干预通常存在,但找到它可能成本高昂。我们正式将秩-1引导定义为在干预层和系数上的预算约束优化。在不同概念和模型家族中,提示边界方向对齐预测有效干预的位置,使几何引导搜索能够以更少的评估达到高效用,平均减少39.8%的试验次数以恢复95%的最佳效用。为解释为何某些概念即使在更好的搜索下仍昂贵,我们引入了粒度,即对比上下文中方向异质性的度量。粒度区分了差异向量共享稳定全局方向的概念,与提示在每个输入中局部一致但最优方向系统性旋转的概念。更高的粒度与更慢的收敛速度和更低的最佳效用相关(相关系数分别为0.44和-0.46,p<0.001)。我们提出了GRACE框架,一个粒度和表征意识的概念工程框架,利用激活几何学来诊断引导难度的主要来源,选择适当的解决方案,并高效分配优化努力。我们的结果将框架从“秩-1何时失败?”转变为“秩-1何时廉价且稳定?”,使激活几何学从描述性工具转变为LLM控制的可操作先验。

英文摘要

Activation steering offers a lightweight way to control LLMs without retraining, but its effectiveness varies sharply across concepts. Prior work often reads this variability as evidence that many concepts are not captured by a single steering direction. We argue instead that much of it reflects search difficulty: a useful rank-1 intervention often exists, but finding it can be expensive. We formalize rank-1 steering as a budget-constrained optimization over intervention layer and coefficient. Across concepts and model families, prompt-boundary directional alignment predicts where effective interventions occur, enabling geometry-guided search that reaches high utility with substantially fewer evaluations, reducing the trials needed to recover 95% of best-found utility by 39.8% on average across three model families. To explain why some concepts remain expensive even under better search, we introduce concept granularity, a measure of directional heterogeneity across contrastive contexts. Granularity distinguishes concepts whose difference vectors share a stable global direction from those where prompts agree locally within each input but the utility-maximizing direction rotates systematically across inputs. Higher granularity is associated with slower convergence and lower best-found performance (Pearson $r{=}0.44$ with trials-to-95%, $r{=}{-}0.46$ with best-found utility, both $p<0.001$). We present GRACE, a Granularity- and Representation-Aware Concept Engineering framework that uses activation geometry to diagnose the dominant source of steering difficulty, select the appropriate remedy, and allocate optimization effort efficiently. Our results shift the frame from "when does rank-1 fail?" to "when is rank-1 cheap and stable?", turning activation geometry from a descriptive tool into an actionable prior for LLM control.

2605.16258 2026-05-22 cs.CV cs.AI cs.RO

IVGT: Implicit Visual Geometry Transformer for Neural Scene Representation

IVGT:隐式视觉几何变换器用于神经场景表示

Yuqi Wu, Tianyu Hu, Wenzhao Zheng, Yuanhui Huang, Haowen Sun, Jie Zhou, Jiwen Lu

发表机构 * Intelligent Vision Group, Tsinghua University(清华大学智能视觉组)

AI总结 本文提出IVGT,一种隐式视觉几何变换器,通过无姿态多视角图像隐式建模连续且一致的几何结构,从而实现神经场景表示,支持在任意3D位置进行连续空间查询,以预测签名距离和颜色,并在多个任务中表现出色。

Comments Code: https://github.com/wzzheng/IVGT/

详情
AI中文摘要

从未经姿态的多视角图像中重建一致的3D几何和外观是计算机视觉中的基础但具有挑战性的问题。现有的视觉几何基础模型通常通过回归像素对齐的点图来预测显式几何,常常面临冗余和几何连续性有限的问题。我们提出了IVGT,一种隐式视觉几何变换器,能够从无姿态的多视角图像中隐式建模连续且一致的几何。这种形式在规范坐标系中学习了连续的神经场景表示,并支持在任意3D位置进行连续空间查询,通过轻量级解码器检索局部特征,以预测签名距离(SDF)值和颜色。它允许直接提取连续且一致的表面几何,从而能够从任意视角渲染RGB图像、深度图和表面法线图。我们通过多数据集联合优化进行训练,结合2D监督和3D几何正则化。IVGT在不同场景中表现出良好的泛化能力,并在多种任务中实现了优异的性能,包括网格和点云重建、新视角合成、深度和表面法线估计以及相机姿态估计。

英文摘要

Reconstructing coherent 3D geometry and appearance from unposed multi-view images is a fundamental yet challenging problem in computer vision. Most existing visual geometry foundation models predict explicit geometry by regressing pixel-aligned pointmaps, often suffering from redundancy and limited geometric continuity. We propose IVGT, an Implicit Visual Geometry Transformer that implicitly models continuous and coherent geometry from pose-free multi-view images. This formulation learns a continuous neural scene representation in a canonical coordinate system and supports continuous spatial queries at any 3D positions, retrieving local features to predict signed distance (SDF) values and colors using lightweight decoders. It allows direct extraction of continuous and coherent surface geometry, enabling rendering of RGB images, depth maps, and surface normal maps from arbitrary viewpoints. We train IVGT via multi-dataset joint optimization with 2D supervision and 3D geometric regularization. IVGT demonstrates generalization across scenes and achieves strong performance on various tasks, including mesh and point cloud reconstruction, novel view synthesis, depth and surface normal estimation, and camera pose estimation.

2605.15588 2026-05-22 cs.CL cs.LG

Calibrating LLMs with Semantic-level Reward

通过语义层面奖励校准大型语言模型

Fengfei Yu, Ruijia Niu, Dongxia Wu, Yian Ma, Rose Yu

发表机构 * Department of Computer Science and Engineering, University of California San Diego, La Jolla, California, USA(加州大学圣地亚哥分校计算机科学与工程系,拉贾尔,加利福尼亚州,美国) Halıcıoğlu Data Science Institute, University of California San Diego, La Jolla, California, USA(加州大学圣地亚哥分校Halıcıoğlu数据科学研究所,拉贾尔,加利福尼亚州,美国) Department of Statistics, Stanford University, Stanford, California, USA(斯坦福大学统计学系,斯坦福,加利福尼亚州,美国)

AI总结 本文提出了一种新的校准框架CSR,通过在语义空间中直接校准语言模型,避免了传统方法中因词汇化置信度导致的不一致问题,实验显示CSR在多个数据集上均能有效降低ECE并提高AUROC。

详情
AI中文摘要

随着大型语言模型(LLMs)被应用于医疗问答和法律推理等关键领域,估计其输出正确性的能力对于安全可靠使用至关重要,要求模型具有良好的校准能力。标准的可验证奖励强化学习(RLVR)通过二元正确性奖励训练模型,但该奖励对置信度不敏感,无法对自信但错误的预测施加惩罚,从而降低校准效果。最近的研究通过训练模型生成带有词汇化置信度的置信分数并奖励与正确性的同意来解决这一问题。然而,词汇化置信度在语义相同但文本变化时表现出不一致性。我们提出Calibration with Semantic Reward(CSR),一种在语义空间中直接校准语言模型的框架,无需词汇化置信度接口。CSR结合了正确性奖励和一种新的语义校准奖励,通过促进正确路径中的语义一致性和不正确路径中的探索来鼓励利用和探索。在HotpotQA(在分布)和TriviaQA、MSMARCO、NQ-Open(不在分布)三个模型家族上的实验表明,CSR在几乎所有设置中都比词汇化置信度基线实现了更低的ECE和更高的AUROC,ECE减少高达40%,AUROC提高高达31%,校准行为在所有四个评估设置中均表现出良好的鲁棒性。

英文摘要

As large language models (LLMs) are deployed in consequential settings such as medical question answering and legal reasoning, the ability to estimate when their outputs are likely to be correct is essential for safe and reliable use, requiring well-calibrated uncertainty. Standard reinforcement learning with verifiable rewards (RLVR) trains models with a binary correctness reward that is indifferent to confidence, providing no penalty for confident but wrong predictions and thereby degrading calibration. Recent work addresses this by training models to produce verbalized confidence scores alongside answers and rewarding agreement with correctness. However, verbalized confidence is calibrated at the token level and thus exhibits inconsistency across textual variations with same semantic meaning. We propose \textbf{Calibration with Semantic Reward (CSR)}, a framework that calibrates language models directly in semantic space without a verbalized confidence interface. CSR combines the correctness reward with a novel semantic calibration reward that encourages exploitation among correct rollouts by promoting semantic agreement, and exploration among incorrect ones by discouraging spurious consistency. Experiments across three model families on HotpotQA (in-distribution) and TriviaQA, MSMARCO, and NQ-Open (out-of-distribution) show that CSR consistently achieves lower ECE and higher AUROC than verbalized-confidence baselines across nearly all settings, reducing ECE by up to $40\%$ and improving AUROC by up to $31\%$ over verbalized-confidence baselines, with calibration behavior generalizing robustly across all four evaluation settings.

2605.15505 2026-05-22 cs.AI cs.IR cs.LG

X-SYNTH: Beyond Retrieval -- Enterprise Context Synthesis from Observed Digital Human Attention

X-SYNTH:超越检索——从观察到的数字人类注意力中提取企业上下文

Guruprasad Raghavan, George Nychis, Rohan Narayana Murthy

发表机构 * Workfabric AI

AI总结 本文提出X-SYNTH框架,通过分析数字人类注意力行为模式,解决企业上下文合成问题,其核心方法是基于行为模式的上下文合成,而非传统检索,从而显著提升有效线索率并降低误报率。

Comments 11 pages, 7 figures, 5 tables

详情
AI中文摘要

在企业运营中,AI代理任务所需上下文分散在记录系统、静态信息存储和通信渠道中。所存储的是系统状态,这是工作实际发生情况的损失性表示。现有的方法通过匹配请求内容来检索存储的信息;对于狭窄请求,这种方法效果良好。但合成质量依赖于了解应展示什么以及如何解释它:这涉及每个组织、团队和个人特有的知识,存在于行为模式中,而不在任何检索索引中。对于提出对企业有价值的线索给销售员的代理任务,这种方法失效:真正的线索率低,假线索率高,且模型没有改进机制。我们提出了X-SYNTH,一个基于数字人类注意力的框架,这种注意力是每个工人的可数字化交互特征,编码了他们做了什么、按什么顺序做,以及隐含的奖励信号。在没有外部标签的情况下,可以区分出导致积极结果的先前行为轨迹与未导致积极结果的轨迹。X-SYNTH将每个个体的行为基线建模为数字双胞胎签名(DTS),并根据个体和查询选择七种注意力过滤器:比例、反比、微分、递归、比较、顺序和集体,以识别因果相关的活动签名。一个四阶段的管道将基于行为模式的排名上下文组装起来,而不是查询嵌入。一个前沿模型在无辅助的情况下实现了9.5%的真实线索率(TLR)和90.5%的假线索率(FLR)。在加入X-SYNTH后,TLR上升到61.9%(6.5倍),而FLR下降到18.8%。企业上下文合成不是检索问题,而是相关性问题,而数字人类注意力是其最可靠的地面真实值。

英文摘要

In enterprise operations, the context required for an AI agent task is scattered across systems of record, static information stores, and communication channels. What is stored is system state, a lossy representation of the work that actually happened. The prevailing approach retrieves by matching request content to what is stored; for narrow requests this works well. But synthesis quality depends on knowing what to surface and how to interpret it: knowledge specific to each organization, team, and individual, present in behavioral patterns, absent from any retrieval index. For the agentic task of proposing enterprise-valuable leads to sellers, this approach breaks down: True Lead Rate is low, False Lead Rate is high, and the model has no mechanism to improve. We present X-SYNTH, a framework for enterprise context synthesis grounded in digital human attention, the digitally observable interaction signatures of each worker, encoding what they did, the sequence in which they did it, and implicit reward signals. Behavioral traces preceding positive outcomes are distinguishable from those that did not, without external labeling. X-SYNTH models each individual's behavioral baseline as a Digital Twin Signature (DTS) and selects among seven attention filters, Proportional, Inverse, Differential, Recurrent, Comparative, Sequential, and Collective, per individual and per query, to identify causally relevant activity signatures. A four-stage pipeline assembles ranked context grounded in behavioral patterns rather than query embeddings. A frontier model unaided achieves 9.5% True Lead Rate (TLR) with 90.5% False Lead Rate (FLR). Augmented with X-SYNTH, TLR rises to 61.9% (6.5x) while FLR falls to 18.8%. Enterprise context synthesis is not a retrieval problem. It is a relevance problem, and digital human attention is its most reliable ground truth.

2605.14598 2026-05-22 cs.RO

DSSP: Diffusion State Space Policy with Full-History Encoding

DSSP:具有完整历史编码的扩散状态空间策略

Zhiyuan Guan, Jianshu Hu, Han Fang, Yunpeng Jiang, Yize Huang, Shujia Li, Xiao Li, Yutong Ban

发表机构 * Shanghai Jiao Tong University(上海交通大学)

AI总结 本文提出DSSP,一种基于扩散模型的状态空间策略,通过完整历史编码提升机器人操作任务中长周期任务的历史依赖性处理能力,实现了更高效的模型压缩和更小的模型规模。

详情
AI中文摘要

基于扩散的模仿学习在机器人操作中显示出强大的前景。然而,大多数现有策略仅依赖于当前观察或最近的短窗口观察,限制了它们在长周期任务中解决历史依赖性模糊性的能力。为此,我们引入DSSP,一种具有完整历史编码的扩散状态空间策略,能够为机器人操作提供高效的完整历史条件。利用状态空间模型(SSMs)的连续序列建模特性,我们的历史编码器有效地将整个观察流压缩成一个紧凑的上下文表示。为了确保此上下文保留有关未来状态演化的关键信息,编码器通过动态感知的辅助训练目标进行优化。此高层上下文表示随后与近期状态观察无缝融合,形成一个分层的条件机制用于动作生成。此外,为了保持架构一致性并减少GPU内存开销,我们还用SSM实例化扩散骨干网络。在模拟基准和真实世界操作任务中的广泛实验表明,DSSP在显著更小的模型规模下实现了最先进的性能,展示了分层条件在历史长度增加时捕获关键信息的优越效率。

英文摘要

Diffusion-based imitation learning has shown strong promise for robot manipulation. However, most existing policies condition only on the current observation or a short window of recent observations, limiting their ability to resolve history-dependent ambiguities in long-horizon tasks. To address this, we introduce DSSP, a history-conditioned Diffusion State Space Policy that enables efficient, full-history conditioning for robot manipulation. Leveraging the continuous sequence modeling properties of State Space Models (SSMs), our history encoder effectively compresses the entire observation stream into a compact context representation. To ensure this context preserves critical information regarding future state evolution, the encoder is optimized with a dynamics-aware auxiliary training objective. This high-level context representation is then seamlessly fused with recent state observations to form a hierarchical conditioning mechanism for action generation. Furthermore, to maintain architectural consistency and minimize GPU memory overhead, we also instantiate the diffusion backbone itself using an SSM. Extensive experiments across simulation benchmarks and real-world manipulation tasks show that DSSP achieves state-of-the-art performance with a significantly smaller model size, demonstrating superior efficiency of the hierarchical conditioning in capturing crucial information as the history length increases.

2605.14322 2026-05-22 cs.AI

Are Agents Ready to Teach? A Multi-Stage Benchmark for Real-World Teaching Workflows

代理是否准备好教学?一个多阶段基准用于现实世界教学工作流程

Zixin Chen, Peng Liu, Rui Sheng, Haobo Li, Jianhong Tu, Xiaodong Deng, Kashun Shum, Dayiheng Liu, Huamin Qu

发表机构 * Hong Kong University of Science and Technology(香港科技大学) Qwen Team, Alibaba Group(阿里集团通义实验室)

AI总结 本文提出EduAgentBench基准,用于评估教学代理的全面能力,发现当前模型在教学任务中的表现有限,但仍为开发未来教学代理提供了测量基础。

Comments Under review

详情
AI中文摘要

语言代理越来越多地部署在复杂的专业工作流程中,辅导能力作为高风险功能,目前在现有基准中仍未得到充分衡量。有效的辅导代理需要超越产生正确答案或执行准确工具调用:一个稳健的辅导代理必须诊断学习者状态、随时间适应支持、做出基于教育证据的决策,并在现实的学习管理系统中执行干预。我们引入EduAgentBench,一个源驱动的基准,用于全面评估辅导代理在教学工作范围内的能力。它包含150个经过质量控制的任务,涵盖三个能力表面:专业教学判断、情境多轮辅导和Canvas式教学工作流程完成。任务通过教学洞察驱动的流程构建,并通过互补的验证信号和人工审查进行评估。在对前沿模型的全面评估中,我们的发现表明,当前模型在有限的教学判断方面表现良好,但在情境辅导和自主教学工作流程执行方面仍无法达到专业教学标准。据我们所知,EduAgentBench是第一个理论基础和现实的基准,用于评估辅导代理的全面教学能力,为开发未来能够支持现实教学工作的辅导代理提供了测量基础。

英文摘要

Language agents are increasingly deployed in complex professional workflows, with tutoring emerging as a particularly high-stakes capability that remains largely unmeasured in existing benchmarks. Effective tutor agents require more than producing correct answers or executing accurate tool calls: a robust tutor must diagnose learner state, adapt support over time, make pedagogically justified decisions grounded in educational evidence, and execute interventions within realistic learning-management systems. We introduce EduAgentBench, a source-grounded benchmark for holistically evaluating tutor agents across the full scope of teaching work. It contains 150 quality-controlled tasks across three capability surfaces: professional pedagogical judgment, situated multi-turn tutoring, and Canvas-style teaching workflow completion. Tasks are constructed through a pedagogical-insight-driven pipeline and evaluated with complementary verification signals and human review. Across a comprehensive evaluation of frontier models, our findings reveal that current models are generally capable of bounded pedagogical judgment, but still fall short of professional teaching standards in situated tutoring and autonomous teaching-workflow execution. To our knowledge, EduAgentBench is the first theory-grounded and realistic benchmark for evaluating the holistic teaching capability of tutor agents, providing a measurement foundation for developing future tutor agents that can support realistic teaching work.

2605.12623 2026-05-22 cs.CL cs.CV cs.LG

DocAtlas: Multilingual Document Understanding Across 80+ Languages

DocAtlas: 跨80多种语言的多语言文档理解

Ahmed Heakl, Youssef Mohamed, Abdullah Sohail, Rania Elbadry, Ahmed Nassar, Peter W. J. Staar, Fahad Shahbaz Khan, Imran Razzak, Salman Khan

发表机构 * MBZUAI(穆罕默德·本·拉谢德人工智能研究所) IBM Research(IBM研究院)

AI总结 本文提出DocAtlas框架,通过构建高保真的OCR数据集和基准测试,覆盖82种语言和9个评估任务,利用双重管道生成精确的结构注解,展示了直接偏好优化在多语言适应中的有效性,提升了领域内和领域外的准确率。

Comments Under submission

详情
AI中文摘要

多语言文档理解在低资源语言中受限于稀缺的训练数据和基于模型的标注流程,这些流程会加剧现有偏见。我们引入DocAtlas,一个构建覆盖82种语言和9个评估任务的高保真OCR数据集和基准测试的框架。我们的双重管道,包括本地DOCX文档的差异渲染和针对从右到左脚本的合成LaTeX生成,生成统一的DocTag格式注解,编码布局、文本和组件类型,无需学习模型进行核心注解。评估16种最先进的模型揭示了低资源脚本中的持续差距。我们展示直接偏好优化(DPO)使用渲染派生的真实情况作为正信号,实现了稳定的多语言适应,提高了领域内(+1.9%)和领域外(+1.8%)的准确性,而监督微调会导致领域外性能下降高达21%。我们的最佳变体,DocAtlas-DeepSeek,在最强基线基础上提高了+1.7%。代码可在https://github.com/ahmedheakl/DocAtlas获取。

英文摘要

Multilingual document understanding remains limited for low-resource languages due to scarce training data and model-based annotation pipelines that perpetuate existing biases. We introduce DocAtlas, a framework that constructs high-fidelity OCR datasets and benchmarks covering 82 languages and 9 evaluation tasks. Our dual pipelines, differential rendering of native DOCX documents and synthetic LaTeX-based generation for right-to-left scripts produce precise structural annotations in a unified DocTag format encoding layout, text, and component types, without learned models for core annotation. Evaluating 16 state-of-the-art models reveals persistent gaps in low-resource scripts. We show that Direct Preference Optimization (DPO) using rendering-derived ground truth as positive signal achieves stable multilingual adaptation, improving both in-domain (+1.9%) and out-of-domain (+1.8%) accuracy without measurable base-language degradation, where supervised fine-tuning degrades out-of-domain performance by up to 21%. Our best variant, DocAtlas-DeepSeek, improves +1.7% over the strongest baseline. Code is available at https://github.com/ahmedheakl/DocAtlas .

2605.10067 2026-05-22 cs.LG cs.AI

Metis: Learning to Jailbreak LLMs via Self-Evolving Metacognitive Policy Optimization

Metis: 通过自进化元认知策略优化学习 jailbreak LLMs

Huilin Zhou, Jian Zhao, Yilu Zhong, Zhen Liang, Xiuyuan Chen, Yuchen Yuan, Tianle Zhang, Chi Zhang, Lan Zhang, Xuelong Li

发表机构 * University of Science and Technology of China(中国科学技术大学) Institute of Artificial Intelligence (TeleAI), China Telecom(人工智能研究院(TeleAI),中国电信)

AI总结 本文提出Metis框架,通过将jailbreaking重新表述为对抗性部分可观测马尔可夫决策过程中的推理时间策略优化,以提高对抗性测试的效率和效果,同时通过结构化反馈和透明推理轨迹提升可解释性,实验表明Metis在多种模型上均表现出更高的攻击成功率和更低的token成本。

Comments Accepted to the 43rd International Conference on Machine Learning (ICML 2026)

详情
AI中文摘要

红队测试对于揭示大型语言模型(LLMs)中的漏洞至关重要。尽管自动化方法已提高可扩展性,但现有方法往往依赖静态启发式或随机搜索,使其在面对高级安全对齐时显得脆弱。为了解决这一问题,我们引入了Metis框架,该框架将jailbreaking重新表述为对抗性部分可观测马尔可夫决策过程(POMDP)中的推理时间策略优化。Metis采用自进化元认知循环来执行目标防御逻辑的因果诊断,并利用结构化反馈作为语义梯度来优化其策略,通过透明推理轨迹提高可解释性。在10种不同模型上的广泛评估表明,Metis在比较方法中实现了最强的平均攻击成功率(ASR)为89.2%,在坚韧的前沿模型(如O1和GPT-5-chat)上保持高效果,而传统基线方法表现出显著的性能下降。通过用定向优化替代冗余探索,Metis将token成本平均降低了8.2倍,最高可达11.4倍。我们的分析表明,当前防御在测试设置下仍易受内部引导的闭环推理轨迹影响,突显了下一代防御机制在推理过程中动态处理安全性的关键需求。

英文摘要

Red teaming is critical for uncovering vulnerabilities in Large Language Models (LLMs). While automated methods have improved scalability, existing approaches often rely on static heuristics or stochastic search, rendering them brittle against advanced safety alignment. To address this, we introduce Metis, a framework that reformulates jailbreaking as inference-time policy optimization within an adversarial Partially Observable Markov Decision Process (POMDP). Metis employs a self-evolving metacognitive loop to perform causal diagnosis of a target's defense logic and leverages structured feedback as a semantic gradient to refine its policy, offering enhanced interpretability through transparent reasoning traces. Extensive evaluations across 10 diverse models demonstrate that Metis achieves the strongest average Attack Success Rate (ASR) among compared methods at 89.2%, maintaining high efficacy on resilient frontier models (e.g., 76.0% on O1 and 78.0% on GPT-5-chat) where traditional baselines exhibit substantial performance degradation. By replacing redundant exploration with directed optimization, Metis reduces token costs by an average of 8.2x and up to 11.4x. Our analysis reveals that current defenses remain vulnerable to internally-steered, closed-loop reasoning trajectories under the tested settings, highlighting a critical need for next-generation defenses capable of reasoning about safety dynamically during inference.

2605.09273 2026-05-22 cs.LG

Instance-Adaptive Online Multicalibration

实例自适应在线多校准

Zhiming Huang, Jamie Morgenstern, Aaron Roth, Claire Jie Zhang

发表机构 * Paul G. Allen School of Computer Science and Engineering, University of Washington(华盛顿大学保罗·G·阿伦计算机科学与工程学院) Department of Computer and Information Sciences, University of Pennsylvania(宾夕法尼亚大学计算机与信息科学系)

AI总结 本文提出了一种高效的实例自适应在线多校准算法,通过动态调整预测值的二进制网格来平衡最坏情况和易处理情况,实现了在不同实例下的最优误差控制。

Comments We tightened the analysis and added a comparison to the concurrent work of Liu et al. (arXiv:2605.11490)

详情
AI中文摘要

我们研究了超越最坏情况的在线多校准。我们给出一个单一、高效的算法,通过自适应细化预测值的二进制网格,动态插值于良性和最坏情况序列之间。其误差由细化树中的叶子数量控制。我们的分析恢复了已知的在线多校准最坏情况最优率$\widetilde O(T^{2/3})$,同时自动适应于更简单的实例:在边际随机情况下,获得$\widetilde O(\sqrt T)$的速率,对于具有$J$段的分段平稳均值,其速率是$\widetilde O(\sqrt{JT})$。更一般地说,速率取决于可预测均值过程相对于组族的阈值复杂度度量。我们证明这种依赖性在对数因子范围内是紧致的。

英文摘要

We study online multicalibration beyond the worst-case. We give a single, efficient algorithm which dynamically interpolates between benign and worst-case sequences by adaptively refining a dyadic grid of prediction values. Its error is controlled by the number of leaves in the refinement tree. Our analysis recovers the known $\widetilde O(T^{2/3})$ worst-case-optimal rate for online multicalibration, while simultaneously automatically adapting to easier instances: in the marginal stochastic setting it obtains a rate of $\widetilde O(\sqrt T)$, and for piecewise-stationary means with $J$ segments its rate is $\widetilde O(\sqrt{JT})$. More generally, the rate depends on a threshold-complexity measure of the predictable mean process relative to the group family. We show that this dependence is tight up to logarithmic factors.

2605.09252 2026-05-22 cs.CL

LLM Agents Already Know When to Call Tools -- Even Without Reasoning

LLM Agents Already Know When to Call Tools -- Even Without Reasoning

Chung-En Sun, Linbo Liu, Ge Yan, Zimo Wang, Tsui-Wei Weng

发表机构 * University of California, San Diego(加州大学圣地亚哥分校)

AI总结 本文提出When2Tool基准,通过18个环境研究工具调用的必要性,发现模型已能识别何时需要调用工具,但生成时未能有效利用此知识,提出Probe&Prefill方法显著减少工具调用。

详情
AI中文摘要

Tool-augmented LLM agents tend to call tools indiscriminately, even when the model can answer directly. Each unnecessary call wastes API fees and latency, yet no existing benchmark systematically studies when a tool call is actually needed. We propose When2Tool, a benchmark of 18 environments (15 single-hop, 3 multi-hop) spanning three categories of tool necessity -- computational scale, knowledge boundaries, and execution reliability -- each with controlled difficulty levels that create a clear decision boundary between tool-necessary and tool-unnecessary tasks. We evaluate two families of 免训练 baselines: Prompt-only (varying the prompt to discourage unnecessary calls) and Reason-then-Act (requiring the model to reason about tool necessity before acting). Both provide limited control: Prompt-only suppresses necessary calls alongside unnecessary ones, and Reason-then-Act still incurs a disproportionate accuracy cost on hard tasks. To understand why these baselines fail, we probe the models' hidden states and find that tool necessity is linearly decodable from the pre-generation representation with AUROC 0.89--0.96 across six models, substantially exceeding the model's own verbalized reasoning. This reveals that models already know when tools are needed, but fail to act on this knowledge during generation. Building on this finding, we propose Probe&Prefill, which uses a lightweight linear probe to read the hidden-state signal and prefills the model's response with a steering sentence. Across all models tested, Probe&Prefill reduces tool calls by 48% with only 1.7% accuracy loss, while the best baseline at comparable accuracy only reduces 6% of tool calls, or achieves a similar tool call reduction but incurs a 5$ imes$ higher accuracy loss. Our code is available at https://github.com/Trustworthy-ML-Lab/when2tool

英文摘要

Tool-augmented LLM agents tend to call tools indiscriminately, even when the model can answer directly. Each unnecessary call wastes API fees and latency, yet no existing benchmark systematically studies when a tool call is actually needed. We propose When2Tool, a benchmark of 18 environments (15 single-hop, 3 multi-hop) spanning three categories of tool necessity -- computational scale, knowledge boundaries, and execution reliability -- each with controlled difficulty levels that create a clear decision boundary between tool-necessary and tool-unnecessary tasks. We evaluate two families of training-free baselines: Prompt-only (varying the prompt to discourage unnecessary calls) and Reason-then-Act (requiring the model to reason about tool necessity before acting). Both provide limited control: Prompt-only suppresses necessary calls alongside unnecessary ones, and Reason-then-Act still incurs a disproportionate accuracy cost on hard tasks. To understand why these baselines fail, we probe the models' hidden states and find that tool necessity is linearly decodable from the pre-generation representation with AUROC 0.89--0.96 across six models, substantially exceeding the model's own verbalized reasoning. This reveals that models already know when tools are needed, but fail to act on this knowledge during generation. Building on this finding, we propose Probe&Prefill, which uses a lightweight linear probe to read the hidden-state signal and prefills the model's response with a steering sentence. Across all models tested, Probe&Prefill reduces tool calls by 48% with only 1.7% accuracy loss, while the best baseline at comparable accuracy only reduces 6% of tool calls, or achieves a similar tool call reduction but incurs a 5$\times$ higher accuracy loss. Our code is available at https://github.com/Trustworthy-ML-Lab/when2tool

2605.07287 2026-05-22 cs.CV

SplatWeaver: Learning to Allocate Gaussian Primitives for Generalizable Novel View Synthesis

SplatWeaver: 学习分配高斯原语以实现可泛化的新型视角合成

Yecong Wan, Fan Li, Mingwen Shao, Wangmeng Zuo

发表机构 * Faculty of Computing, Harbin Institute of Technology(哈尔滨工业大学计算机学院) Zhengzhou Advanced Research Institute of Harbin Institute of Technology(哈尔滨工业大学郑州先进研究院) Huawei Noah’s Ark Lab(华为诺亚实验室) Artificial Intelligence Research Institute, Shenzhen University of Advanced Technology(深圳先进技术大学人工智能研究院)

AI总结 本文提出SplatWeaver框架,通过动态分配高斯原语实现可泛化的新型视角合成,解决传统方法中固定分配导致的资源浪费和表达不足问题。

Comments Project Page: https://yecongwan.github.io/SplatWeaver/

详情
AI中文摘要

可泛化的新型视角合成旨在从未经校准的输入图像中渲染未见过的视角,而无需每个场景的优化。最近基于3D高斯点划的前馈方法在效率和渲染质量上取得了显著进展。然而,大多数方法将固定数量的高斯分布分配给每个像素或体素,忽略了现实场景中空间变化的复杂性。这种均匀分配通常在平滑区域浪费高斯原语,而在细结构、复杂几何和高频细节方面提供不足的容量。这促使我们预测区域依赖的原语数量,而不是在所有地方施加固定原语预算,从而实现更具表达力的3D场景表示。因此,我们提出SplatWeaver,一个能够动态分配高斯原语的可泛化新型视角合成框架。具体而言,SplatWeaver引入了基数高斯专家和像素级路由方案,其中每个专家专门生成从0到M的特定数量的原语,路由方案协调这些专家以适应性地确定每个空间位置应分配多少高斯原语。此外,SplatWeaver结合了高频先验和相关的指导模块和路由正则化,以稳定专家选择并促进复杂度感知的分配。通过利用高频线索,路由过程被鼓励将更多的高斯原语分配给细结构和纹理区域,同时抑制平滑区域的冗余。在多样化的场景中进行的广泛实验表明,SplatWeaver在大多数情况下都优于最先进的方法,能够以更少的高斯原语生成更逼真的新型视角渲染。项目页面:https://yecongwan.github.io/SplatWeaver/

英文摘要

Generalizable novel view synthesis aims to render unseen views from uncalibrated input images without requiring per-scene optimization. Recent feed-forward approaches based on 3D Gaussian Splatting have achieved promising efficiency and rendering quality. However, most of them assign a fixed number of Gaussians to each pixel or voxel, ignoring the spatially varying complexity of real-world scenes. Such uniform allocation often wastes Gaussian primitives in smooth regions while providing insufficient capacity for fine structures, complex geometry, and high-frequency details. This motivates us to predict region-dependent primitive cardinalities rather than impose a fixed primitive budget everywhere, enabling a more expressive 3D scene representation. Therefore, we propose SplatWeaver, a generalizable novel view synthesis framework that is able to dynamically allocate Gaussian primitives over different regions in a feed-forward manner. Specifically, SplatWeaver introduces cardinality Gaussian experts and a pixel-level routing scheme, wherein each expert specializes in producing a specific number of primitives from 0 to M, and the routing scheme coordinates these experts to adaptively determine how many Gaussian primitives should be allocated to each spatial location. Moreover, SplatWeaver incorporates a high-frequency prior with attendant guidance module and routing regularization to stabilize expert selection and promote complexity-aware allocation. By leveraging high-frequency cues, the routing process is encouraged to assign more Gaussian primitives to fine structures and textured regions, while suppressing redundancy in smooth areas. Extensive experiments across diverse scenarios show that SplatWeaver consistently outperforms state-of-the-art methods, delivering more faithful novel-view renderings with fewer Gaussian primitives. Project Page: https://yecongwan.github.io/SplatWeaver/

2605.05765 2026-05-22 cs.CV

X-OmniClaw Technical Report: A Unified Mobile Agent for Multimodal Understanding and Interaction

X-OmniClaw 技术报告:一种统一的移动代理用于多模态理解和交互

Xiaoming Ren, Ru Zhen, Chao Li, Yang Song, Qiuxia Hou, Yanhao Zhang, Peng Liu, Qi Qi, Quanlong Zheng, Qi Wu, Zhenyi Liao, Binqiang Pan, Haobo Ji, Haonan Lu

发表机构 * OPPO AI Center(OPPO人工智能中心)

AI总结 本文提出X-OmniClaw,一种统一的移动代理,用于Android生态系统中的多模态理解和交互,通过统一的感知、记忆和行动架构,提升复杂移动任务的上下文感知能力,展示了其在多模态交互中的高效性和可靠性。

Comments 12 pages, 7 figures

详情
AI中文摘要

受OpenClaw发展启发,随着对能够处理复杂和直观交互的移动个人代理需求增加,本文介绍了X-OmniClaw,一种专为Android生态系统设计的统一移动代理,用于多模态理解和交互。该统一架构的感知、记忆和行动模块使代理能够通过高上下文感知处理复杂移动任务。具体而言,Omni Perception提供了一个统一的多模态输入管道,整合UI状态、现实世界视觉上下文和语音输入,利用时间对齐模块将原始数据分解为结构化的多模态意图表示。Omni Memory利用多模态记忆优化来增强个性化智能,通过整合运行时工作记忆与从本地数据中提取的长期个人记忆,实现高度上下文感知和个性化的交互。最后,Omni Action采用混合接地策略,结合结构性XML元数据与视觉感知以实现稳健的交互。通过行为克隆和轨迹回放,系统捕获用户导航作为可重用的技能,实现精确的直接访问执行。在多样化的场景中展示表明,X-OmniClaw有效提高了交互效率和任务可靠性,为下一代移动原生个人助手提供了实用的架构蓝图。

英文摘要

Inspired by the development of OpenClaw, there is a growing demand for mobile-based personal agents capable of handling complex and intuitive interactions. In this technical report, we introduce X-OmniClaw, a unified mobile agent designed for multimodal understanding and interaction in the Android ecosystem. This unified architecture of perception, memory, and action enables the agent to handle complex mobile tasks with high contextual awareness. Specifically, Omni Perception provides a unified multimodal ingress pipeline that integrates UI states, real-world visual contexts, and speech inputs, leveraging a temporal alignment module to decompose raw data into structured multimodal intent representations. Omni Memory leverages multimodal memory optimization to enhance personalized intelligence by integrating runtime working memory for task continuity with long-term personal memory distilled from local data, enabling highly context-aware and personalized interactions. Finally, Omni Action employs a hybrid grounding strategy that combines structural XML metadata with visual perception for robust interaction. Through Behavior Cloning and Trajectory Replay, the system captures user navigation as reusable skills, enabling precise direct-access execution. Demonstrations across diverse scenarios show that X-OmniClaw effectively enhances interaction efficiency and task reliability, providing a practical architectural blueprint for the next generation of mobile-native personal assistants.

2605.01466 2026-05-22 cs.CV cs.LG

SplAttN: Bridging 2D and 3D with Gaussian Soft Splatting and Attention for Point Cloud Completion

SplAttN: 通过高斯软溅射和注意力在2D和3D之间架桥以实现点云补全

Zhaoyang Li, Zhichao You, Tianrui Li

发表机构 * School of Computing and Artificial Intelligence(计算与人工智能学院) Southwest Jiaotong University(西南交通大学) Chengdu, China(中国成都)

AI总结 本文提出SplAttN方法,通过高斯软溅射和注意力机制解决点云补全中2D和3D模态连接问题,改进了传统硬投影导致的跨模态熵塌陷问题,实现了更有效的跨模态连接学习。

Comments Accepted as a Spotlight paper at ICML 2026; camera-ready version

详情
AI中文摘要

尽管多模态学习在点云补全方面取得了进展,但理论机制仍不明确。最近的研究将成功归因于模态间的联系,但我们发现标准硬投影破坏了这种联系:将稀疏点云投影到图像平面会产生极稀疏的支持,阻碍视觉先验传播,这种失败模式我们称为跨模态熵塌陷。为解决这一实际限制,我们提出了SplAttN,用可微高斯溅射替代硬投影,生成密集的连续图像平面表示。通过将投影重新公式化为连续密度估计,SplAttN避免了塌陷的稀疏支持,促进了梯度流动,并提高了跨模态连接的学习能力。广泛的实验表明,SplAttN在PCN和ShapeNet-55/34上实现了最先进的性能。关键的是,我们利用现实世界的KITTI基准作为多模态依赖的应力测试。反事实评估显示,尽管基线退化为对视觉移除不敏感的单模态模板检索器,SplAttN仍能保持对视觉线索的稳健依赖,验证了我们的方法建立了有效的跨模态连接。代码可在https://github.com/zay002/SplAttN获取。

英文摘要

Although multi-modal learning has advanced point cloud completion, the theoretical mechanisms remain unclear. Recent works attribute success to the connection between modalities, yet we identify that standard hard projection severs this connection: projecting a sparse point cloud onto the image plane yields an extremely sparse support, which hinders visual prior propagation, a failure mode we term Cross-Modal Entropy Collapse. To address this practical limitation, we propose SplAttN, which replaces hard projection with Differentiable Gaussian Splatting to produce a dense, continuous image-plane representation. By reformulating projection as continuous density estimation, SplAttN avoids collapsed sparse support, facilitates gradient flow, and improves cross-modal connection learnability. Extensive experiments show that SplAttN achieves state-of-the-art performance on PCN and ShapeNet-55/34. Crucially, we utilize the real-world KITTI benchmark as a stress test for multi-modal reliance. Counter-factual evaluation reveals that while baselines degenerate into unimodal template retrievers insensitive to visual removal, SplAttN maintains a robust dependency on visual cues, validating that our method establishes an effective cross-modal connection. Code is available at https://github.com/zay002/SplAttN.

2605.00414 2026-05-22 cs.LG cond-mat.stat-mech cs.AI

Trees to Flows and Back: Unifying Decision Trees and Diffusion Models

树到流及回归:统一决策树和扩散模型

Sai Niranjan Ramachandran, Suvrit Sra

发表机构 * School of Computation, Information and Technology, Technical University of Munich, Germany(慕尼黑技术大学计算、信息与技术学院,德国) Munich center for machine learning (MCML)(慕尼黑机器学习中心(MCML))

AI总结 本文通过建立层次决策树与扩散过程之间的数学对应关系,统一了决策树和扩散模型,揭示了共同的优化原则'全局轨迹得分匹配',并提出了两种实用应用:treeflow在表格数据生成中表现优异,且计算速度更快;dsmtree将层次决策逻辑转移到神经网络中,在多个基准上与教师模型表现相近。

Comments 12 pages (main), 68 pages (inclusive of appendix), Accepted in the Forty-Third International Conference on Machine Learning (ICML) 2026

详情
AI中文摘要

决策树和扩散模型本质上是不同的模型类别,前者是离散和层次的,后者是连续和动态的。本文通过在适当的极限情况下建立层次决策树与扩散过程之间的清晰数学对应关系,将两者统一起来。我们的统一揭示了一个共同的优化原则:全局轨迹得分匹配(GTSM),其中梯度提升(在理想化版本中)在渐近意义上是最优的。通过两个关键的实用实例,我们强调了本工作的概念价值:treeflow在表格数据上实现了具有更高保真度和2倍计算速度的竞争性生成质量,而dsmtree是一种新的蒸馏方法,将层次决策逻辑转移到神经网络中,在许多基准上与教师模型表现相近。

英文摘要

Decision trees and diffusion models are ostensibly disparate model classes, one discrete and hierarchical, the other continuous and dynamic. This work unifies the two by establishing a crisp mathematical correspondence between hierarchical decision trees and diffusion processes in appropriate limiting regimes. Our unification reveals a shared optimization principle: \emph{Global Trajectory Score Matching (GTSM)}, for which gradient boosting (in an idealized version) is asymptotically optimal. We underscore the conceptual value of our work through two key practical instantiations: \treeflow, which achieves competitive generation quality on tabular data with higher fidelity and a 2\times computational speedup, and \dsmtree, a novel distillation method that transfers hierarchical decision logic into neural networks, matching teacher performance within 2\% on many benchmarks.

2605.00185 2026-05-22 cs.LG cs.AI

Fair Dataset Distillation via Cross-Group Barycenter Alignment

通过跨组重心对齐实现公平的数据集蒸馏

Mohammad Hossein Moslemi, Nima Hosseini Dashtbayaz, Zhimin Mei, Bissan Ghaddar, Boyu Wang

发表机构 * Western University(温莎大学) Vector Institute(向量研究所) IE University(IE大学) Ivey Business School(Ivey商学院)

AI总结 本文研究了数据集蒸馏中因不同群体预测模式差异导致的公平性问题,提出通过跨组重心对齐方法来减少群体间的预测偏差,从而提升模型的公平性。

Comments Accepted by ICML 2026

详情
AI中文摘要

数据集蒸馏旨在将大规模数据集压缩成小规模合成数据集,同时保持预测性能。我们发现,由于不同人口群体表现出不同的预测模式,蒸馏过程在保持所有子群体信息信号方面面临困难,无论群体大小是轻微还是严重不平衡。因此,训练在蒸馏数据上的模型可能会在某些子群体上出现显著性能下降,导致公平性差距。关键的是,这些差距不会仅仅通过纠正群体不平衡来消失,因为它们源于子群体预测模式的根本不匹配,而不是样本数量差异本身。因此,我们正式分析了这两种偏差源之间的相互作用,并将解决方案定义为识别一个不考虑群体不平衡的预测信息重心,该重心在所有子群体中诱导出相似的表示。通过向这个共享的聚合表示进行蒸馏,我们证明可以减少群体公平性方面的担忧。我们的方法与现有蒸馏方法兼容,并且实验证明,它显著减少了数据集蒸馏引入的偏差。代码可在https://github.com/mhmoslemi/COBRA上获得。

英文摘要

Dataset Distillation aims to compress a large dataset into a small synthetic one while maintaining predictive performance. We show that as different demographic groups exhibit distinct predictive patterns, the distillation process struggles to simultaneously preserve informative signals for all subgroups, regardless of whether group sizes are mildly or severely imbalanced. Consequently, models trained on distilled data can experience substantial performance drops for certain subgroups, leading to fairness gaps. Crucially, these gaps do not disappear by merely correcting group imbalance, since they stem from fundamental mismatches in subgroup predictive patterns rather than from sample-size disparities alone. We therefore formally analyze the interaction between these two sources of bias and cast the solution as identifying a group-imbalance-agnostic barycenter of the predictive information that induces similar representations across all subgroups. By distilling toward this shared aggregate representation, we show that group fairness concerns can be reduced. Our approach is compatible with existing distillation methods, and empirical results show that it substantially reduces bias introduced by dataset distillation. Code is available at https://github.com/mhmoslemi/COBRA.

2604.24762 2026-05-22 cs.CV

OmniShotCut: Holistic Relational Shot Boundary Detection with Shot-Query Transformer

OmniShotCut: 以-shot查询Transformer实现整体关系性shot边界检测

Boyang Wang, Guangyi Xu, Jiahui Zhang, Zhipeng Tang, Zezhou Cheng

发表机构 * University of Virginia(弗吉尼亚大学) University of Massachusetts Amherst(马萨诸塞大学阿姆赫斯特分校)

AI总结 本文提出OmniShotCut,通过shot查询基于的密集视频Transformer,将shot边界检测建模为结构化关系预测,同时估计shot内关系和shot间关系,以解决现有方法在边界不可解释、错过细微有害断点以及依赖噪声低多样性标注和过时基准的问题。

详情
AI中文摘要

Shot Boundary Detection (SBD)旨在自动识别shot变化并将视频划分为连贯的shot。尽管SBD在文献中被广泛研究,现有方法往往在转换处产生不可解释的边界,错过细微但有害的断点,并依赖于噪声大、低多样性的标注和过时的基准。为缓解这些限制,我们提出OmniShotCut,将SBD建模为结构化关系预测,通过shot查询基于的密集视频Transformer,联合估计shot范围、shot内关系和shot间关系。为避免不精确的手动标注,我们采用完全合成的过渡合成管道,自动重现主要过渡家族并精确生成参数化变体。我们还引入OmniShotCutBench,一个现代宽领域基准,能够实现整体和诊断评估。在基准上的实验展示了我们方法的有效性和通用性。

英文摘要

Shot Boundary Detection (SBD) aims to automatically identify shot changes and divide a video into coherent shots. While SBD was widely studied in the literature, existing methods often produce non-interpretable boundaries on transitions, miss subtle yet harmful discontinuities, and rely on noisy, low-diversity annotations and outdated benchmarks. To alleviate these limitations, we propose OmniShotCut to formulate SBD as structured relational prediction, jointly estimating shot ranges with intra-shot relations and inter-shot relations, by a shot query-based dense video Transformer. To avoid imprecise manual labeling, we adopt a fully synthetic transition synthesis pipeline that automatically reproduces major transition families with precise boundaries and parameterized variants. We also introduce OmniShotCutBench, a modern wide-domain benchmark enabling holistic and diagnostic evaluation. Experiments on the benchmarks demonstrate the effectiveness and generality of our method.

2604.24681 2026-05-22 cs.RO

Learning Human-Intention Priors from Large-Scale Human Demonstrations for Robotic Manipulation

从大规模人类示范中学习人类意图先验以用于机器人操作

Yifan Xie, YuAn Wang, Guangyu Chen, Jinkun Liu, Yu Sun, Wenbo Ding

发表机构 * Tsinghua University(清华大学) ByteDance(字节跳动)

AI总结 本文提出MoT-HRA框架,通过大规模人类示范学习人类意图先验,用于机器人操作,通过构建HA-2.2M数据集和三个耦合专家提升动作合理性和鲁棒性。

Comments 13 pages, 5 figures

详情
AI中文摘要

人类视频包含丰富的操作先验,但用于机器人学习仍然困难,因为原始观测将场景理解、人类运动和特定于身体的动作纠缠在一起。我们引入MoT-HRA,一种层次化视觉-语言-动作框架,从大规模人类示范中学习人类意图先验。我们首先整理HA-2.2M,一个通过手中心过滤、空间重建、时间分割和语言对齐从异构人类视频中重建出的220万集动作-语言数据集。在此数据集之上,MoT-HRA将操作分解为三个耦合专家:一个视觉-语言专家预测无关身体的3D轨迹,一个意图专家将MANO风格的手部运动建模为潜在的人类运动先验,一个精细专家将意图感知的表示映射到机器人动作块。共享注意力主干和只读键值传输允许下游控制使用人类先验同时限制对上游表示的干扰。在手部运动生成、模拟操作和真实世界机器人任务上的实验表明,MoT-HRA在分布偏移下提高了动作合理性和鲁棒控制。

英文摘要

Human videos contain rich manipulation priors, but using them for robot learning remains difficult because raw observations entangle scene understanding, human motion, and embodiment-specific action. We introduce MoT-HRA, a hierarchical vision-language-action framework that learns human-intention priors from large-scale human demonstrations. We first curate HA-2.2M, a 2.2M-episode action-language dataset reconstructed from heterogeneous human videos through hand-centric filtering, spatial reconstruction, temporal segmentation, and language alignment. On top of this dataset, MoT-HRA factorizes manipulation into three coupled experts: a vision-language expert predicts an embodiment-agnostic 3D trajectory, an intention expert models MANO-style hand motion as a latent human-motion prior, and a fine expert maps the intention-aware representation to robot action chunks. A shared-attention trunk and read-only key-value transfer allow downstream control to use human priors while limiting interference with upstream representations. Experiments on hand motion generation, simulated manipulation, and real-world robot tasks show that MoT-HRA improves motion plausibility and robust control under distribution shift.

2604.24514 2026-05-22 cs.LG

SceneSelect: Selective Learning for Trajectory Scene Classification and Expert Scheduling

SceneSelect: 用于轨迹场景分类和专家调度的选择性学习

Xinrun Wang, Deshun Xia, Yuxi Sun, Weijie Zhu

发表机构 * School of Computer Science, China University of Geosciences (Wuhan)(中国地质大学(武汉)计算机科学学院) School of Information Engineering, Wuhan University of Technology(武汉理工大学信息工程学院) School of Mathematics and Statistics, Wuhan University of Technology(武汉理工大学数学与统计学学院)

AI总结 本文提出SceneSelect,一种基于场景的选择性学习方法,通过动态路由输入到最合适的专家模型,提升轨迹预测的准确性和效率。

Comments This paper has been accepted by ICIC 2026

详情
AI中文摘要

准确的轨迹预测因高场景异质性而具有根本挑战性 - 不同现实环境中的运动速度、空间密度和交互模式存在剧烈变化。然而,大多数现有方法通常训练一个单一统一模型,期望固定容量架构能普遍泛化所有可能场景。这种以模型为中心的范式在面对此类极端异质性时本质上是错误的,不可避免导致严重的泛化差距、降级的准确性以及大量的计算浪费。为克服这一瓶颈,我们提出选择性学习,一种新的以场景为中心的范式。它明确分析底层场景的特性,动态路由输入到最合适的专家模型。作为这一范式的具体实现,我们引入SceneSelect。具体而言,SceneSelect利用无监督聚类在可解释的几何和运动学特征上发现潜在的场景分类。然后训练一个高度解耦的分类模块,将实时输入分配到这些场景类别,并一个高度可扩展、插件式的调度策略自动将轨迹序列调度到最优的专家预测器。关键的是,这种解耦设计确保了出色的泛化能力,允许无缝集成不同的现成模型,并在新数据集上稳健适应,而无需计算昂贵的联合再训练。在三个公开基准(ETH-UCY、SDD和NBA)上的大量实验表明,我们的方法在强单模型和集成基线中一致表现更好,平均提高10.5%,展示了场景感知选择性学习的有效性。

英文摘要

Accurate trajectory prediction is fundamentally challenging due to high scene heterogeneity - the severe variance in motion velocity, spatial density, and interaction patterns across different real-world environments. However, most existing approaches typically train a single unified model, expecting a fixed-capacity architecture to generalize universally across all possible scenarios. This conventional model-centric paradigm is fundamentally flawed when confronting such extreme heterogeneity, inevitably leading to a severe generalization gap, degraded accuracy, and massive computational waste. To overcome this bottleneck, rather than refining restricted model-centric architectures, we propose selective learning, a novel scene-centric paradigm. It explicitly analyzes the characteristics of the underlying scene to dynamically route inputs to the most appropriate expert models. As a concrete implementation of this paradigm, we introduce SceneSelect. Specifically, SceneSelect utilizes unsupervised clustering on interpretable geometric and kinematic features to discover a latent scene taxonomy. A highly decoupled classification module is then trained to assign real-time inputs to these scene categories, and a highly extensible, plug-and-play scheduling policy automatically dispatches the trajectory sequence to the optimal expert predictor. Crucially, this decoupled design ensures excellent generalization capabilities, allowing seamless integration with different off-the-shelf models and robust adaptation across new datasets without requiring computationally expensive joint retraining. Extensive experiments on three public benchmarks (ETH-UCY, SDD, and NBA) demonstrate that our method consistently outperforms strong single-model and ensemble baselines, achieving an average improvement of 10.5%, showcasing the effectiveness of scene-aware selective learning.

2604.17623 2026-05-22 cs.CV cs.GR

ViPS: Video-informed Pose Spaces for Auto-Rigged Meshes

ViPS: 为自动绑定网格的视频感知姿态空间

Honglin Chen, Karran Pandey, Rundi Wu, Matheus Gadelha, Yannick Hold-Geoffroy, Ayush Tewari, Niloy J. Mitra, Changxi Zheng, Paul Guerrero

发表机构 * Columbia University(哥伦比亚大学) University of Toronto(多伦多大学) Adobe Research(Adobe研究) University of Cambridge(剑桥大学) University College London(伦敦大学学院)

AI总结 本文提出ViPS,一种通过视频扩散模型提取运动先验来发现自动绑定网格有效姿态分布的前馈框架,实现了对多样形状变化、逆向运动学和动画的关键帧生成的支持。

Comments Project page: https://honglin-c.github.io/vips/

详情
AI中文摘要

运动绑定提供了一个结构化的接口来表达3D网格,但缺乏任何关联的姿态空间,即给定网格的可能关节配置的显式表示。没有这样的姿态空间,随机采样或手动操作原始绑定参数很容易导致语义和/或几何违规,例如解剖学超伸展和非物理自相交。我们提出了Video-informed Pose Spaces (ViPS),一种前馈框架,通过从预训练的视频扩散模型中提取运动先验,发现自动绑定网格有效姿态的潜在分布。与现有方法依赖稀缺的艺术家编写的4D数据集或专注于重建单个运动实例不同,ViPS将生成视频模型的先验转移到给定绑定参数化的通用分布中。应用于皮肤网格的可微几何验证器在不需手动调节器的情况下强制执行形状特定的完整性。我们的前馈模型揭示了平滑、紧凑且可控的姿态空间。这反过来支持了对多样形状变化的采样、逆向运动学的流形投影以及动画和关键帧的时序一致轨迹。此外,提取的3D姿态样本作为语义代理指导视频扩散,有效地闭合了生成2D先验和结构化3D运动控制之间的循环。我们的评估显示,仅使用视频先验训练的ViPS在合理性和多样性方面与基于合成艺术家创建的4D数据训练的最新模型表现相当。此外,作为通用模型,ViPS在分布外物种和未见骨骼拓扑上表现出鲁棒的零样本泛化能力。

英文摘要

Kinematic rigs provide a structured interface for articulating 3D meshes but lack any associated pose space, i.e., an explicit representation of the plausible manifold of joint configurations for a given mesh. Without such a pose space, stochastic sampling or manual manipulation of raw rig parameters easily results in semantic and/or geometric violations, such as anatomical hyperextension and non-physical self-intersections. We propose Video-informed Pose Spaces (ViPS), a feedforward framework that discovers the latent distribution of valid articulations for auto-rigged meshes by distilling motion priors from a pretrained video diffusion model. Unlike existing methods that rely on scarce, artist-authored 4D datasets, or focus on reconstructing instances of individual motions, ViPS transfers generative video model priors into a universal distribution over the given rig parameterization. Differentiable geometric validators applied to the skinned mesh enforce shape-specific integrity without requiring manual regularizers. Our feedforward model reveals a smooth, compact, and controllable pose space. This, in turn, supports sampling for diverse shape variations, manifold projection for inverse kinematics, and temporally coherent trajectories for animation and keyframing. Further, the distilled 3D pose samples serve as semantic proxies to guide video diffusion, effectively closing the loop between generative 2D priors and structured 3D kinematic control. Our evaluations show that ViPS, trained solely using video priors, matches the performance of state-of-the-art models trained on synthetic artist-created 4D data in both plausibility and diversity. Additionally, as a universal model, ViPS exhibits robust zero-shot generalization to out-of-distribution species and unseen skeletal topologies.

2604.15003 2026-05-22 cs.CV

Flow of Truth: Proactive Temporal Forensics for Image-to-Video Generation

流之真相:面向图像到视频生成的主动时间鉴伪

Yuzhuo Chen, Zehua Ma, Han Fang, Hengyi Wang, Guanjie Wang, Weiming Zhang

发表机构 * Anhui Province Key Laboratory of Digital Security (School of Cyber Science and Technology, University of Science and Technology of China)(安徽省数字安全重点实验室(网络安全学院,中国科学技术大学))

AI总结 本文提出了一种面向图像到视频生成的主动时间鉴伪方法,通过追踪像素在视频中的流动和变换,解决了传统空间鉴伪在时间维度上的不足。

详情
AI中文摘要

图像到视频(I2V)生成的迅速发展使单张图像可以生成逼真的视频,但也带来了新的鉴伪需求。与静态图像不同,I2V内容随时间演变,要求鉴伪方法超越二维像素级篡改定位,追踪像素在视频中的流动和变换。随着帧数增加,嵌入的痕迹会漂移和变形,使传统空间鉴伪失效。为应对这一未探索的维度,我们提出了**Flow of Truth**,首个专注于I2V生成中时间鉴伪的主动框架。关键挑战在于发现一个能够与生成过程一致演化的鉴伪特征,这本质上是一种创造性的转换而非确定性重建。尽管存在这种内在困难,我们创新性地将视频生成重新定义为*像素随时间的运动而非帧的合成*。基于这一观点,我们提出了一种可学习的鉴伪模板,追踪像素运动,并提出一个模板引导的流模块,将运动与图像内容解耦,实现稳健的时间追踪。实验表明,Flow of Truth在商业和开源I2V模型上均表现出色,显著提升了时间鉴伪性能。

英文摘要

The rapid rise of image-to-video (I2V) generation enables realistic videos to be created from a single image but also brings new forensic demands. Unlike static images, I2V content evolves over time, requiring forensics to move beyond 2D pixel-level tampering localization toward tracing how pixels flow and transform throughout the video. As frames progress, embedded traces drift and deform, making traditional spatial forensics ineffective. To address this unexplored dimension, we present **Flow of Truth**, the first proactive framework focusing on temporal forensics in I2V generation. A key challenge lies in discovering a forensic signature that can evolve consistently with the generation process, which is inherently a creative transformation rather than a deterministic reconstruction. Despite this intrinsic difficulty, we innovatively redefine video generation as *the motion of pixels through time rather than the synthesis of frames*. Building on this view, we propose a learnable forensic template that follows pixel motion and a template-guided flow module that decouples motion from image content, enabling robust temporal tracing. Experiments show that Flow of Truth generalizes across commercial and open-source I2V models, substantially improving temporal forensics performance.