arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 4033
2606.17056 2026-06-16 cs.CL 新提交

The Value Axis: Language Models Encode Whether They're on the Right Track

价值轴:语言模型编码它们是否在正确的轨道上

Nick Jiang, Isaac Kauvar, Jack Lindsey

发表机构 * Stanford University(斯坦福大学) Anthropic

AI总结 通过构建Qwen3-8B的“价值轴”,发现语言模型内部追踪当前轨迹的成功概率,并影响自信、自我纠正和探索行为。

Comments Code repository: https://github.com/nickjiang2378/value-axis

详情
AI中文摘要

我们研究语言模型是否内部追踪其当前轨迹的价值,定义为当前策略实现目标的似然。使用合成的上下文强化学习数据,我们为Qwen3-8B构建了一个“价值轴”。我们发现沿此轴的激活区分了高与低口头自信、无回溯与有回溯的展开、正确与错误的代码。向高价值引导因果地抑制自我纠正并减少解释冗长,而向低价值引导则诱导回溯和探索。我们证明直接偏好优化(DPO)可以增加奖励行为(例如使用某个词)的内部价值,使模型在展示这些行为后表现得更自信。最后,我们将价值轴应用于研究野外设置。例如,我们发现Qwen在训练后对政治敏感的聊天查询分配低价值,并且监督微调增加了训练领域内的内部自信。我们的结果表明语言模型线性编码对预期目标成功的一个估计,该估计调节它们追求方向的自信。

英文摘要

We investigate whether language models internally track the value of their current trajectory, defined as the likelihood that their ongoing strategy will achieve their goals. Using synthetic, in-context reinforcement learning data, we construct a "value" axis for Qwen3-8B. We find that activations along this axis distinguish between high vs. low verbalized confidence, rollouts without and with backtracking, and correct vs. corrupted code. Steering towards high value causally suppresses self-correction and reduces explanatory verbosity, while steering towards low value induces backtracking and exploration. We demonstrate that direct preference optimization (DPO) can increase the internal value of rewarded behaviors (e.g. use a certain word), causing the model to act more confidently after exhibiting them. Finally, we apply the value axis to study in-the-wild settings. For example, we find that Qwen assigns low value to politically sensitive chat queries after post-training and that supervised fine-tuning increases internal confidence within the training domain. Our results suggest that language models linearly encode an estimate of expected goal success that modulates their confidence in pursuing a direction.

2606.17055 2026-06-16 cs.RO 新提交

T-Rex: Tactile-Reactive Dexterous Manipulation

T-Rex: 触觉反应灵巧操作

Dantong Niu, Zhuoyang Liu, Zekai Wang, Boning Shao, Zhao-Heng Yin, Anirudh Pai, Yuvan Sharma, Stefano Saravalle, Ruijie Zheng, Jing Wang, Ryan Punamiya, Mengda Xu, Yuqi Xie, Yunfan Jiang, Letian Fu, Konstantinos Kallidromitis, Matteo Gioia, Junyi Zhang, Jiaxin Ge, Haiwen Feng, Fabio Galasso, Wei Zhan, David M. Chan, Yutong Bai, Roei Herzig, Jiahui Lei, Fei-Fei Li, Ken Goldberg, Jitendra Malik, Pieter Abbeel, Yuke Zhu, Danfei Xu, Jim, Fan, Trevor Darrell

发表机构 * UC Berkeley(加州大学伯克利分校) NVIDIA(英伟达) Stanford(斯坦福大学) Panasonic(松下) La Sapienza University(罗马大学) ItalAI

AI总结 提出大规模触觉数据集和可变速率混合Transformer架构,在12项精细操作任务上平均成功率提升超30%。

Comments Project page: https://tactile-rex.github.io/

详情
AI中文摘要

长期以来,对触觉信号做出动态反应的能力被认为是实现敏捷人类级灵巧操作的关键。然而,当前基于学习的视觉-语言-动作(VLA)模型在机器人操作中通常要么忽略触觉模态,要么局限于使用静态线索的编码器,部分原因是缺乏多样化的训练数据和标准化评估、当前VLA模型中的架构限制以及静态触觉编码器的局限性。在本文中,我们通过解决所有这些局限性来推动触觉反应操作的前沿。我们提出了一个大规模、100小时的触觉丰富数据集,该数据集通过一种新颖的、数据高效的配方收集,优先考虑基本运动基元。为了有效利用自然高频的触觉信号而不牺牲现有VLA的现有能力,我们引入了一种可变速率混合Transformer(MoT)架构,配备了一种新颖的时间触觉VQ-VAE编码器。我们在12项需要精细力控制和可变形物体操作的操作任务上展示了触觉反应策略的有效性,平均成功率比最强基线高出30%以上。

英文摘要

The ability to react dynamically to tactile signals has long been considered crucial to agile human-level dexterity. Yet contemporary learning-based Vision-Language-Action (VLA) models for robotic manipulation generally either overlook the tactile modality or are limited to encoders with static cues, due in part to the scarcity of diverse training data and standardized evaluation, architectural constraints in current VLA models, and limitations of static tactile encoders. In this paper, we push the frontier of tactile-reactive manipulation by addressing all of these limitations. We propose a large-scale, 100-hour tactile-rich dataset collected via a novel, data-efficient recipe that prioritizes elementary motor primitives. To effectively exploit naturally high-frequency touch signals without sacrificing the existing capabilities of existing VLAs, we introduce a variable-rate Mixture-of-Transformers (MoT) architecture equipped with a novel temporal tactile VQ-VAE encoder. We demonstrate the effectiveness of tactile-reactive policies on 12 manipulation tasks requiring delicate force control and deformable object manipulation, achieving over 30% higher average success rate than the strongest baseline.

2606.17054 2026-06-16 cs.RO 新提交

Human Universal Grasping

人类通用抓取

Kevin Yuanbo Wu, Tianxing Zhou, Isaac Tu, Billy Yan, Irmak Guzey, David Fouhey, Dandan Shan, Lerrel Pinto

发表机构 * New York University(纽约大学) Tsinghua University(清华大学) University of Michigan(密歇根大学)

AI总结 提出HUG模型,利用人类抓取数据(1M-HUG数据集)和流匹配方法,从单张RGB-D图像生成多样化抓取姿态,并重定向到机器人手,实现零样本抓取,在HUG-Bench上超越基线23%-34%。

Comments 28 pages, 20 figures, 7 tables

详情
AI中文摘要

人类可以轻松抓取物体,而多指机器人远未达到这种通用性。我们认为机器人抓取数据最自然的来源是人类,他们每天拿起数千个物体。我们提出HUG,一个流匹配模型,能够为任何用户指定的物体(从立体相机捕获的单张RGB-D图像中)生成多样化的人类抓取。使用智能眼镜,我们首先收集了1M-HUGs,一个自我中心的人类抓取数据集,涵盖100万帧(27.8小时)和41栋建筑中的6,707个物体实例。接下来,为了建模自然人类抓取的分布,我们的新型流匹配模型融合RGB和深度观测,输出由手腕平移、手腕旋转和MANO手姿态参数化的抓取。预测的抓取可以重定向到各种机器人手,实现在日常场景中的零样本抓取。为了标准化评估,我们构建了一个新的模拟基准HUG-Bench,包含来自五个几何类别和不同尺寸的90个未见物体,并带有公制尺度的3D网格。我们在真实世界中评估HUG,使用HUG-Bench的30个物体测试集,跨越多个立体相机、机器人实体和家庭环境。HUG在我们具有挑战性的物体集上比最先进的抓取基线高出23%和34%。代码、数据、基准、检查点和交互式演示已在我们的网站上发布:https://grasping.io/

英文摘要

Humans can grasp objects effortlessly, whereas multi-fingered robots are far from this level of generality. We argue that the most natural source of robot grasping data is from humans, who pick up thousands of objects every day. We present HUG, a flow-matching model that generates diverse human grasps for any user-specified object in a single RGB-D image captured from a stereo camera. Using smart glasses, we first collect 1M-HUGs, an egocentric dataset of human grasps spanning 1M frames (27.8 hrs) and 6,707 object instances across 41 buildings. Next, to model the distribution of natural human grasps, our novel flow-matching model fuses RGB and depth observations to output a grasp parameterized by wrist translation, wrist rotation, and MANO hand pose. Predicted grasps can be retargeted to various robot hands, enabling zero-shot grasping in everyday scenes. To standardize evaluation, we build a new simulated benchmark, HUG-Bench, of 90 unseen objects from five geometric categories and various sizes, with metric-scale 3D meshes. We evaluate HUG in the real world on the 30-object test set of HUG-Bench across multiple stereo cameras, robot embodiments, and household environments. HUG outperforms the state-of-the-art grasping baselines by +23% and +34% on our challenging object set. Code, data, benchmark, checkpoints, and an interactive demo are released on our website: https://grasping.io/

2606.17053 2026-06-16 cs.CL cs.CV 新提交

Context-Aware RL for Agentic and Multimodal LLMs

上下文感知强化学习用于智能体与多模态大语言模型

Peiyang Xu, Bangzheng Li, Sijia Liu, Karthik R. Narasimhan, Pramod Viswanath, Prateek Mittal, Xingyu Fu

发表机构 * Princeton University(普林斯顿大学) UC Davis(加州大学戴维斯分校)

AI总结 提出ContextRL方法,通过间接辅助目标(上下文选择奖励)增强大模型在长上下文和多模态任务中的细粒度推理能力,在5个长程基准和12个视觉问答基准上分别提升+2.2%和+1.8%。

Comments 29 pages, 9 figures

详情
AI中文摘要

大语言模型在需要从长或复杂上下文中识别细小但决定性证据(如工具跟踪中的一行或图像中的细微细节)时常常失败。我们提出ContextRL,一种上下文感知的强化学习方法,通过一个间接辅助目标来提升长程推理和多模态性能。ContextRL不是仅监督最终答案,而是向模型提供查询、答案和两个高度相似的上下文,并奖励它选择支持查询-答案对的上下文,从而鼓励细粒度定位。我们在两个领域构建对比上下文数据:对于编码智能体,轨迹作为上下文,通过条件过滤生成1k对;对于多模态推理,图像作为上下文,通过生成式编辑和相似性搜索生成7K对。ContextRL在5个长程基准上比标准GRPO平均提升+2.2%,在12个多样化视觉问答基准上平均提升+1.8%。为了分离所提目标与额外数据的影响,我们与数据增强基线进行比较,这些基线将相同的对比上下文重新用作标准查询-上下文-答案示例。这些基线几乎没有改进,表明收益来自所提出的上下文选择目标,而非仅对比数据。

英文摘要

Large language models (LLMs) often fail when answering requires identifying a small but decisive piece of evidence within a long or complex context, such as a single line in a tool trace or a subtle detail in an image. We propose ContextRL, a context-aware reinforcement learning (RL) method that improves long-horizon reasoning and multimodal performance through an \emph{indirect} auxiliary objective. Instead of supervising only the final answer, ContextRL presents the model with a query, an answer, and two highly similar contexts, and rewards it for selecting the context that supports the query--answer pair, thereby encouraging fine-grained grounding. We construct contrastive context data in two domains: for coding agents, trajectories serve as contexts, yielding 1k pairs built via condition filtering; for multimodal reasoning, images serve as contexts, yielding 7K pairs built via generative editing and similarity search. ContextRL achieves average gains of +2.2% over standard GRPO on 5 long-horizon benchmarks, and +1.8% across 12 diverse visual question answering benchmarks. To disentangle the effect of the proposed objective from that of additional data, we compare against data-augmentation baselines that repurpose the same contrastive contexts as standard query--context--answer examples. These baselines provide little to no improvement, showing that the gains arise from the proposed context-selection objective rather than from the contrastive data alone.

2606.17048 2026-06-16 cs.LG cs.CV stat.ML 新提交

Exact Posterior Score Estimation for Solving Linear Inverse Problems

精确后验分数估计用于求解线性逆问题

Abbas Mammadov, Ozgur Kara, Kaan Oktay, Iskander Azangulov, Adil Kaan Akan, Hyungjin Chung, James Matthew Rehg, Yee Whye Teh

发表机构 * University of Oxford(牛津大学) UIUC(伊利诺伊大学厄巴纳-香槟分校) EverEx

AI总结 提出精确后验分数(EPS)方法,通过闭式后验分数将线性逆问题转化为去噪问题,无需梯度或投影,在FFHQ和ImageNet上优于现有方法。

详情
AI中文摘要

扩散和基于流的模型通过训练去噪器来逆转高斯损坏,从而学习强大的数据先验。为了利用这一先验解决线性逆问题,需要从后验中采样,但先验提供的分数是无条件分数,而非后验分数。现有方法要么使用近似测量匹配校正来引导固定的预训练去噪器,要么训练一个放弃先验去噪结构的条件恢复模型。我们在一般高斯插值下推导了线性高斯逆问题的精确后验分数闭式,并表明后验采样可归结为在算子依赖的偏移枢轴和各向异性噪声协方差下的去噪问题。我们将这一恒等式转化为精确后验分数(EPS),这是一种去噪训练目标,保留了标准预训练的输入/输出结构,因此可以从头训练或从预训练去噪器微调。在推理时,EPS使用与底层骨干相同的采样器,无需似然梯度或投影。我们在FFHQ和ImageNet上的五个线性逆问题上评估了EPS,在保真度、感知和分布指标上优于无训练和基于训练的基线,同时使用的去噪器评估次数比基于梯度的后验采样器少大约一个数量级。

英文摘要

Diffusion and flow-based models learn powerful data priors by training a denoiser to reverse Gaussian corruption. To use this prior to solve a linear inverse problem, one needs to sample from the posterior, but the score that the prior provides is the unconditional score, not the posterior score. Existing methods either steer a fixed pretrained denoiser with approximate measurement-matching corrections, or train a conditional restoration model that abandons the denoising structure of the prior. We derive the exact posterior score in closed form for linear Gaussian inverse problems under general Gaussian interpolants, and show that posterior sampling reduces to a denoising problem at an operator-dependent shifted pivot under an anisotropic noise covariance. We turn this identity into Exact Posterior Score (EPS), a denoising training objective that preserves the input/output structure of standard pretraining and can therefore be trained from scratch or fine-tuned from a pretrained denoiser. At inference, EPS uses the same sampler as the underlying backbone, with no likelihood gradients or projections. We evaluate EPS on five linear inverse problems across FFHQ and ImageNet, where it outperforms training-free and training-based baselines on fidelity, perceptual, and distributional metrics, while using roughly an order of magnitude fewer denoiser evaluations than gradient-based posterior samplers.

2606.17046 2026-06-16 cs.RO cs.CV cs.LG 新提交

Geometric Action Model for Robot Policy Learning

几何动作模型用于机器人策略学习

Jisang Han, Seonghu Jeon, Jaewoo Jung, René Zurbrügg, Honggyu An, Tifanny Portela, Marco Hutter, Marc Pollefeys, Seungryong Kim, Sunghwan Hong

发表机构 * KAIST AI(韩国科学技术院人工智能学院) ETH Zurich(苏黎世联邦理工学院) ETH AI Center(苏黎世联邦理工学院人工智能中心)

AI总结 提出几何动作模型(GAM),通过重用预训练几何基础模型(GFM)作为共享骨干,实现语言条件下的操作策略,在仿真和真实机器人任务中优于现有方法。

Comments Project page: https://cvlab-kaist.github.io/Geometric-Action-Model/

详情
AI中文摘要

通用机器人策略必须遵循用户指令,同时推理物体、相机和机器人动作如何在3D物理世界中交互。最近的视觉-语言-动作模型(VLAs)和视频世界-动作模型(WAMs)从大规模基础模型中继承了强大的语义或时间先验,但它们仍然主要在2D图像帧或2D派生的潜在空间上操作,隐含了接触丰富操作所需的3D几何信息。我们提出了几何动作模型(GAM),一种语言条件操作策略,直接重用预训练的几何基础模型(GFM)作为感知、时间预测和动作解码的共享基础。GAM在中间层分割GFM:浅层作为观察编码器,在分割层插入一个因果未来预测器,根据语言、本体感受和动作历史预测未来的潜在令牌。然后,预测的未来令牌通过剩余的GFM块进行特征传播和解码,使得单个骨干能够同时产生未来几何和动作。这种设计通过最小的架构修改赋予GFM语言条件的时间世界建模能力,同时保留其丰富的几何先验。在广泛的仿真和真实机器人操作基准测试中,GAM比当前基础模型规模的基线更准确、更鲁棒、更快、更轻量。

英文摘要

Generalist robot policies must follow user instructions while reasoning about how objects, cameras, and robot actions interact in the 3D physical world. Recent vision-language-action models (VLAs) and video world-action models (WAMs) inherit strong semantic or temporal priors from large-scale foundation models, but they still operate primarily on 2D image frames or 2D-derived latent spaces, leaving implicit the 3D geometry required for contact-rich manipulation. We propose the Geometric Action Model (GAM), a language-conditioned manipulation policy that directly repurposes a pretrained geometric foundation model (GFM) as a shared substrate for perception, temporal prediction, and action decoding. GAM splits the GFM at an intermediate layer: the shallow layers serve as an observation encoder, and a causal future predictor inserted at the split layer forecasts future latent tokens conditioned on language, proprioception, and action history. The predicted future tokens are then routed through the remaining GFM blocks for feature propagation and decoding, allowing a single backbone to produce both future geometry and actions. This design equips the GFM with language-conditioned temporal world modeling through minimal architectural modification while preserving its rich geometric priors. Across a broad suite of simulation and real-robot manipulation benchmarks, GAM is more accurate, more robust, faster, and lighter than current foundation-model-scale baselines.

2606.17043 2026-06-16 cs.RO cs.LG 新提交

Hierarchical Advantage Weighting for Online RL Fine-Tuning of VLAs from Sparse Episode Outcomes

基于层级优势加权的在线RL微调VLA策略从稀疏回合结果

Tongyan Fang, Siyuan Huang, Naiyu Fang, Ganlong Zhao, Zhongjin Luo, Jianbo Liu, Xiaogang Wang, Ying Dong, Hongsheng Li

发表机构 * ACE Robotics Shenzhen International Graduate School, Tsinghua University(清华大学深圳国际研究生院) The Chinese University of Hong Kong(香港中文大学)

AI总结 提出层级优势加权行为克隆(HABC),通过分离生存性和效率目标并自适应平衡,解决稀疏二元结果下VLA策略在线微调中的信用分配问题,在三个双臂接触任务上将成功率从12-44%提升至38-92%。

Comments Website: https://acerobotics-vla.github.io/HABC-Website

详情
AI中文摘要

当预训练的VLA策略通过在线RL进行微调时,每次 rollout 回合仅产生单个二元结果(成功或失败),但 actor 更新需要每个时间步的监督。现有方法通常将此稀疏结果简化为单个标量奖励或优势信号,这混淆了不同形式的过渡级反馈,并且在基本任务成功可实现后提供的指导有限。首先,单个标量信号混淆了生存性和效率这两个目标;一旦基本成功实现,二元标签无法提供梯度来区分高效完成与缓慢完成。其次,真实世界的 rollout 混合了自主段和干预段;天真地将回合结果跨这些边界分配会导致不正确的信用分配。为解决这些问题,我们提出层级优势加权行为克隆(HABC),该方法在不同数据子集上为这两个目标训练独立的评论家头,并通过状态自适应平衡组合其输出。状态自适应门 $g_t$ 合并它们的一步优势,在成功不确定时优先考虑生存性,仅在生存性高时转向效率,并将结果转换为 actor 损失上的每时间步权重。干预感知的信用分配进一步将结果标签限制在当前策略执行的段,防止监督跨干预边界泄漏。在三个接触丰富的双臂任务上的真实机器人实验中,HABC 将监督微调(SFT)基线的成功率从 36%、44% 和 12% 提升至 92%、88% 和 38%。

英文摘要

When pretrained VLA policies are fine-tuned through online RL, each rollout episode produces only a single binary outcome (success or failure), yet the actor update requires per-transition supervision. Existing approaches commonly reduce this sparse outcome to a single scalar reward or advantage signal, which conflates distinct forms of transition-level feedback and provides limited guidance once basic task success becomes achievable. First, a single scalar signal conflates the two objectives of viability and efficiency; once basic success is achieved, the binary label provides no gradient to distinguish efficient completions from slow ones. Second, real-world rollouts mix autonomous and intervention segments; naively assigning episode outcomes across these boundaries introduces incorrect credit assignment. To address these issues, we propose Hierarchical Advantage-Weighted Behavior Cloning (HABC), which trains separate critic heads for these two objectives on different data subsets and combines their outputs with a state-adaptive balance. A state-adaptive gate $g_t$ merges their one-step advantages, prioritizing viability when success is uncertain and shifting to efficiency only when viability is high, and converts the result into per-transition weights on the actor loss. Intervention-aware credit assignment further restricts outcome labels to segments executed by the current policy, preventing supervision from leaking across intervention boundaries. In real-robot experiments on three contact-rich bimanual tasks, HABC raises success from supervised fine-tuning (SFT) baselines of 36%, 44%, and 12% to 92%, 88%, and 38%.

2606.17041 2026-06-16 cs.CL cs.IR 新提交

Benchmarking LLM Agents on Meta-Analysis Articles from Nature Portfolio

对Nature Portfolio元分析文章进行LLM代理基准测试

Anzhe Xie, Weihang Su, Yujia Zhou, Yiqun Liu, Qingyao Ai

发表机构 * Tsinghua University(清华大学)

AI总结 提出MetaSyn数据集,包含442篇专家策划的元分析,用于评估LLM代理在检索-筛选-综合全流程中的表现,发现当前系统在筛选阶段存在严重瓶颈。

Comments 13 pages, 7 figures, preprint for arXiv, dataset and code available at https://github.com/BFTree/MetaSyn

详情
AI中文摘要

元分析是一种要求高的证据综合形式,结合了文献检索、PI/ECO指导的研究选择和统计聚合。其结构化、可验证的工作流程使其成为评估系统科学推理的理想基础,然而现有基准缺乏完整的检索-筛选-综合流程的真相。我们引入了MetaSyn,一个包含来自Nature Portfolio期刊的442篇专家策划的元分析的数据集。每个条目将研究问题与PI/ECO标准、包含140k篇PubMed文章的检索语料库、经过验证的阳性研究、主题相似但不符合PI/ECO的硬负样本以及完整的搜索策略和日期范围配对。对十二种流水线配置(九种RAG变体和一种协议驱动的代理)进行基准测试揭示了关键的筛选瓶颈:尽管在K=200时检索上限达到90.9%的召回率,但没有任何系统能恢复超过52.7%的真相包含文献。当前的LLM无法可靠地将合格研究与主题相关性相当的PI/ECO不合格干扰项区分开来。阶段归因指标捕捉了系统成功和失败的地方;单一的端到端分数则不能。

英文摘要

Meta-analysis is a demanding form of evidence synthesis that combines literature retrieval, PI/ECO-guided study selection, and statistical aggregation. Its structured, verifiable workflow makes it an ideal substrate for evaluating systematic scientific reasoning, yet existing benchmarks lack ground truth across the full retrieval-screening-synthesis pipeline. We introduce MetaSyn, a dataset of 442 expert-curated meta-analyses from Nature Portfolio journals. Each entry pairs a research question with PI/ECO criteria, a retrieval corpus of 140k PubMed articles, verified positive studies, hard negatives that are topically similar but PI/ECO-ineligible, and complete search strategies and date bounds. Benchmarking twelve pipeline configurations (nine RAG variants and a protocol-driven agent) reveals a critical screening bottleneck: despite a retrieval ceiling of 90.9% recall at K=200, no system recovers more than 52.7% of ground-truth included literature. Current LLMs fail to reliably separate eligible studies from PI/ECO-failing distractors in pools of comparable topical relevance. Stage-attributed metrics capture where systems succeed and fail; a single end-to-end score does not.

2606.17040 2026-06-16 cs.RO cs.CV 新提交

R2RDreamer: 3D-aware Data Augmentation for Spatially-generalized 2D Manipulation Policies

R2RDreamer: 面向空间泛化的2D操作策略的3D感知数据增强

Xiuwei Xu, Haowen Sun, Angyuan Ma, Yiwei Zhang, Zhenyu Wu, Xiaofeng Wang, Bingyao Yu, Zheng Zhu, Jie Zhou, Jiwen Lu

发表机构 * Tsinghua University(清华大学) BUPT(北京邮电大学) GigaAI

AI总结 提出R2RDreamer框架,通过轻量级3D编辑和2D视频补全,从少量真实演示生成几何一致的增强数据,提升2D操作策略的空间泛化能力。

Comments Project page: https://r2rdreamer.github.io/

详情
AI中文摘要

空间泛化对于模仿学习的操作策略至关重要,但通常需要跨不同物体姿态、机器人配置和相机视角的大规模演示。从少量源演示中进行数据增强为昂贵的真实世界数据收集提供了一种实用替代方案。基于仿真的增强可以创建可控变化,但需要复杂的环境和物体设置,并可能引入仿真到现实的差距。最近的实到实方法通过联合编辑真实演示的3D观测和动作轨迹来避免这些问题,但它们仍然依赖于强大的3D场景解析和几何补全,并且通常生成针对3D点云策略而非基于RGB的2D策略的观测。我们提出R2RDreamer,一个实到实演示增强框架,它在保持3D动作-观测编辑的几何一致性的同时,将视觉补全迁移到2D视频空间。具体来说,R2RDreamer首先通过在一个共享的3D框架中编辑不完整的物体点云和末端执行器轨迹来执行轻量级3D增强;然后,它将编辑后的场景投影到具有遮挡感知推理的掩码图像空间控制视频中,并使用密集控制图像到视频模型来补全时间上连贯的RGB观测。在空间偏移操作任务上的实验,包括2D扩散风格策略和视觉-语言-动作策略,表明R2RDreamer从有限的源演示中提高了空间泛化能力,分析验证了3D编辑、遮挡感知投影和视频补全的贡献。

英文摘要

Spatial generalization is critical for imitation-learned manipulation policies, but achieving it typically requires scaling demonstrations across diverse object poses, robot configurations, and camera viewpoints. Data augmentation from a few source demonstrations offers a practical alternative to costly real-world collection. Simulation-based augmentation can create controllable variation, but requires complex environment and object setup and may introduce a sim-to-real gap. Recent real-to-real methods avoid these issues by jointly editing 3D observations and action trajectories from real demonstrations, yet they still rely on strong 3D scene parsing and geometry completion, and often produce observations tailored to 3D pointcloud policies rather than RGB-based 2D policies. We propose R2RDreamer, a real-to-real demonstration augmentation framework that preserves the geometric consistency of 3D action-observation editing while moving visual completion to 2D video space. Specifically, R2RDreamer first performs lightweight 3D augmentation by editing incomplete object pointclouds and end-effector trajectories in a shared 3D frame; it then projects the edited scene into masked image-space control videos with occlusion-aware reasoning and uses a dense-control image-to-video model to complete temporally coherent RGB observations. Experiments on spatially shifted manipulation tasks with both 2D diffusion-style policies and vision-language-action policies show that R2RDreamer improves spatial generalization from limited source demonstrations, with analyses validating the contributions of 3D editing, occlusion-aware projection, and video completion.

2606.17034 2026-06-16 cs.CL cs.LG 新提交

KVEraser: Learning to Steer KV Cache for Efficient Localized Context Erasing

KVEraser: 学习操控KV缓存以实现高效的局部上下文擦除

Mufei Li, Shikun Liu, Dongqi Fu, Haoyu Wang, Yinglong Xia, Hong Li, Hong Yan, Pan Li

发表机构 * Georgia Institute of Technology(佐治亚理工学院) Meta

AI总结 提出KVEraser方法,通过学习操控KV缓存实现局部上下文擦除,避免全局重计算,在长上下文任务中接近全重算性能且延迟仅增加24%。

Comments Oral at the ICML 2026 Workshop on the Impact of Memorization on Trustworthy Foundation Models

详情
AI中文摘要

在KV缓存上进行事后上下文擦除具有挑战性,因为局部编辑会产生全局影响:一旦某个跨度被处理,其影响会传播到所有后续token的缓存状态。这个问题在长上下文LLM应用中自然出现,其中过时的检索事实、错误的工具观察、撤回的用户偏好或有害的提示注入可能仅在预填充后才发现。精确擦除必须重新计算删除跨度后的所有token,使其计算成本取决于后缀长度而非擦除跨度长度。我们引入KVEraser,一种学习型KV缓存编辑方法,用于高效的局部上下文擦除。给定已处理的上下文和要移除的跨度,KVEraser仅用学习到的操控状态替换擦除区间的KV状态,同时保持其余缓存不变。为了学习可迁移的擦除机制,我们构建了一个两阶段训练流程:通用跨度-邻居预训练教会擦除器抑制擦除跨度的影响,而任务特定微调将此能力适应下游场景。实验表明,在1K--32K上下文长度的域内任务中,KVEraser在擦除后性能上几乎匹配全重算,而其延迟仅增加24%,而全重算延迟增加17.6倍。KVEraser还能泛化到具有有害事实干扰项的未见长文档QA任务,在全重算的3--4倍加速下,在近似基线中取得最佳性能。

英文摘要

Post-hoc context erasing over the KV cache is challenging because a local edit has a global consequence: once a span has been processed, its influence propagates into the cached states of all subsequent tokens. This issue arises naturally in long-context LLM applications, where stale retrieved facts, incorrect tool observations, retracted user preferences, or harmful prompt injections may be identified only after prefill. Exact erasing must then recompute all tokens after the deleted span, making its computational cost depend on suffix length rather than erased-span length. We introduce KVEraser, a learned KV-cache editing method for efficient localized context erasing. Given a processed context and a span to remove, KVEraser replaces only the KV states of the erased interval with learned steering states while reusing the remaining cache unchanged. To learn a transferable erasing mechanism, we build a two-stage training pipeline: generic span-neighbor pre-training teaches the eraser to suppress the influence of the erased span, while task-specific fine-tuning adapts this capability to downstream scenarios. Experiments show that KVEraser nearly matches full recomputation in post-erasure performance on in-domain tasks across 1K--32K context lengths, while its latency increases by only 24% compared with a 17.6x increase for full recomputation. KVEraser also generalizes to unseen long-document QA tasks with harmful factual distractors, achieving the best performance among approximate baselines with a 3--4x speedup over full recomputation.

2606.17029 2026-06-16 cs.CL 新提交

DEEPRUBRIC: Evidence-Tree Rubric Supervision for Efficient Reinforcement Learning of Deep Research Agents

DEEPRUBRIC: 基于证据树规则监督的高效深度研究智能体强化学习

Minghang Zhu, Chuyang Wei, Junhao Xu, Yilin Cheng, Zhumin Chen, Jiyan He

发表机构 * Shandong University(山东大学) Zhongguancun Academy(中关村学院) Fudan University(复旦大学)

AI总结 提出DeepRubric框架,通过构建证据树生成查询-规则对,确保奖励信号准确评估查询所需信息,以13倍少的RL GPU时间达到与先前最优模型相当的性能。

详情
AI中文摘要

深度研究智能体通过搜索和推理检索到的证据来综合长篇报告。基于规则的奖励强化学习通过优化智能体以符合可检查的标准(这些标准将报告质量转化为奖励信号)来改进这些智能体,但其效率取决于这些标准是否可靠地捕捉任务范围和证据需求。大多数现有研究要求LLM为给定查询生成规则,但当模型无法推断潜在信息需求时,生成的规则可能不完整,从而降低RL效率。为了获得更可靠的查询-规则监督,我们引入了DeepRubric,一个反向这一过程的数据构建框架:它首先确定基于证据的报告应该评估什么,然后从这些评估目标中合成对齐的查询-规则对,而不是为给定查询推断评估标准。从采样的种子主题开始,DeepRubric通过递归扩展有证据支持的子问题构建证据树,其叶子节点作为原子且可验证的评估目标。然后,它使用证据树合成训练查询和规则,确保奖励准确评估查询所请求的信息。使用DeepRubric,我们构建了9K个查询-规则监督示例,并使用基于规则的GRPO训练了DeepRubric-8B,在三个基准测试中实现了与先前开源最先进深度研究模型相当的性能,而RL GPU时间减少了约13倍。

英文摘要

Deep research agents synthesize long-form reports by searching and reasoning over retrieved evidence. Reinforcement learning with rubric-based rewards improves these agents by optimizing them against checkable criteria that translate report quality into reward signals, but its efficiency depends on whether those criteria reliably capture the task scope and evidence needs. Most existing studies ask an LLM to generate rubrics for a given query, but when the model fails to infer the underlying information needs, the generated rubrics may be incomplete and reduce RL efficiency. To obtain more reliable query--rubric supervision, we introduce DeepRubric, a data construction framework that reverses this process: instead of inferring evaluation criteria for a given query, it first determines what an evidence-backed report should be evaluated on and then synthesizes aligned query--rubric pairs from those evaluation targets. Starting from a sampled seed topic, DeepRubric builds an evidence tree by recursively expanding evidence-backed sub-questions, whose leaves serve as atomic and verifiable evaluation targets. It then uses the evidence tree to synthesize the training query and rubrics, ensuring that the reward evaluates exactly the information requested by the query. Using DeepRubric, we construct 9K query--rubric supervision examples and train DeepRubric-8B with rubric-based GRPO, achieving comparable performance to prior open state-of-the-art deep research models across three benchmarks with roughly 13x fewer RL GPU-hours.

2606.17027 2026-06-16 cs.CV 新提交

MeshLoom: Feed-Forward Non-Rigid Registration of Mesh Sequences

MeshLoom: 网格序列的前馈式非刚性配准

Jianqi Chen, Jiraphon Yenphraphai, Xiangjun Tang, Sergey Tulyakov, Chaoyang Wang, Peter Wonka, Rameen Abdal

发表机构 * KAUST Saudi Arabia(沙特阿拉伯国王科技大学) Snap Inc. United States of America(Snap Inc. 美国) Purdue University United States of America(普渡大学 美国)

AI总结 提出MeshLoom,一种前馈式配准网络,通过拓扑感知编码器-解码器直接重建网格序列的顶点变形,实现秒级多网格配准,并在非刚性配准任务上达到最先进水平,同时支持运动插值和网格变形。

Comments Project page: https://meshloom.github.io/

详情
AI中文摘要

我们提出MeshLoom,一种前馈式配准网络,可直接重建网格序列中的顶点变形。我们的方法将非刚性配准推进到超越现有模型,这些模型通常受限于昂贵的逐实例优化、狭窄的物体类别、仅成对输入或仅仅是中间输出。该网络简单高效,可在数秒内配准多个网格。其核心在于拓扑感知的编码器-解码器设计。具体来说,我们首先引入一种拓扑感知的点表示,将锚点(参考)网格的拓扑编码到其逐顶点特征中。这种表示增强了网络对锚点网格几何结构的理解,并区分了欧几里得接近但测地距离远的点。然后,我们提出一种多模态编码器,将这种锚点网格表示与每帧的互补线索(如形状潜变量和图像特征)融合。这些多源信号被压缩成一个紧凑的全局运动嵌入,捕捉密集的帧间对应关系。一个轻量级解码器随后用锚点网格点表示查询该全局嵌入,检索目标时间戳处的逐顶点变形。通过在多种运动和物体类别上的大量实验,我们表明MeshLoom在非刚性配准上达到了最先进的结果。此外,我们发现我们的全局嵌入-然后-查询范式自然地使网络能够生成中间时间戳的变形,这扩展了MeshLoom到运动插值和网格变形。项目页面:https://meshloom.github.io/。

英文摘要

We present MeshLoom, a feed-forward registration network that directly reconstructs vertex deformations across mesh sequences. Our approach advances non-rigid registration beyond existing models, which are typically constrained by costly per-instance optimization, narrow object categories, pairwise-only inputs, or merely intermediate outputs. The network is simple and efficient, registering multiple meshes within seconds. At its core lies a topology-aware encoder--decoder design. Specifically, we first introduce a topology-aware point representation that encodes the anchor (reference) mesh's topology into its per-vertex features. This representation strengthens the network's understanding of the anchor-mesh geometry and disambiguates points that are Euclidean-close yet geodesically distant. We then propose a multi-modal encoder that fuses this anchor-mesh representation with complementary cues from each frame, such as shape latents and image features. These multi-source signals are compressed into a compact global motion embedding that captures dense inter-frame correspondence. A lightweight decoder then queries this global embedding with the anchor-mesh point representation, retrieving per-vertex deformations at target timestamps. Through extensive experiments across diverse motions and object categories, we show that MeshLoom achieves state-of-the-art results on non-rigid registration. In addition, we find that our global embedding-then-query paradigm naturally enables the network to generate deformations at intermediate timestamps, which extends MeshLoom to motion interpolation and mesh morphing. Project page: https://meshloom.github.io/ .

2606.17024 2026-06-16 cs.LG 新提交

ExpRL: Exploratory RL for LLM Mid-Training

ExpRL: 用于LLM中期训练的探索性强化学习

Violet Xiang, Amrith Setlur, Chase Blagden, Nick Haber, Aviral Kumar

发表机构 * Stanford University(斯坦福大学) Carnegie Mellon University(卡内基梅隆大学) OpenAI Rogo

AI总结 提出ExpRL方法,利用人类编写的问答数据作为奖励支架,通过密集奖励强化推理过程中的部分进展和有用行为,在数学推理任务上优于SFT、稀疏奖励GRPO和自蒸馏,并为后续稀疏奖励RL提供更好的初始化。

详情
AI中文摘要

稀疏奖励强化学习(RL)已成为提升LLM推理能力的标准工具,但其成功关键取决于基础模型中的覆盖范围。实践中,模型通常通过在精心策划的推理轨迹上进行中期训练来为RL做准备,这些轨迹教授有用的基本技能,如分解、验证或自我纠正。尽管有效,但这种策略需要手动指定模型应学习的内容,并且尚不清楚这种基本覆盖是否足以解决更难的问题,这些问题需要将这些技能组合成更广泛的解决方案策略。我们研究了一种更自动化的方法:使用大规模人工编写的问答数据进行基于RL的中期训练。我们的方法ExpRL不是将参考解决方案作为模仿目标,而是将其用作奖励支架:参考对策略隐藏,仅用于构建问题特定的评分标准,以评判在策略推理轨迹。策略从原始问题提示中采样,而LLM评判器将采样的推理轨迹与参考解决方案进行比较,并分配结果级或过程级的密集奖励。这使得ExpRL能够强化部分进展、有用的中间归约以及稀疏最终答案奖励通常无法提升的生产性推理行为。在具有挑战性的数学推理任务上,ExpRL比SFT、稀疏奖励GRPO和自蒸馏产生更强的RL启动,并为后续稀疏奖励RL提供更好的初始化。额外的混合领域实验进一步表明,ExpRL可以扩展到最初的纯数学设置之外。

英文摘要

Sparse reward reinforcement learning (RL) has become a standard tool for improving LLM reasoning, but its success depends critically on the coverage present in the base model. In practice, models are often primed for RL through \emph{mid-training} on curated reasoning traces that teach useful primitive skills such as decomposition, verification, or self-correction. Although effective, this strategy requires manually specifying what the model should learn, and it remains unclear whether such primitive coverage is enough for much harder problems, which require combining these skills into broader solution strategies. We study a more automated approach: \emph{RL-based mid-training} using large corpora of human-written question-answer data. Rather than treating reference solutions as targets to imitate, our method, ExpRL, uses them as \emph{reward scaffolds}: references are hidden from the policy and used only to construct problem-specific grading rubrics for judging on-policy reasoning traces. The policy samples from the original problem prompt, while an LLM judge compares the sampled reasoning trace against the reference solution and assigns outcome-level or process-level dense rewards. This lets ExpRL reinforce partial progress, useful intermediate reductions, and productive reasoning behaviors that sparse final-answer rewards often fail to upweight. On challenging math reasoning tasks, ExpRL yields stronger RL priming than SFT, sparse-reward GRPO, and self-distillation, and provides a better initialization for subsequent sparse-reward RL. Additional mixed-domain experiments further suggest that ExpRL can extend beyond the original math-only setting.

2606.17020 2026-06-16 cs.CV cs.AI 新提交

FusionRS: A Large-Scale RGB-Infrared Remote Sensing Dataset for Dual-Modal Vision-Language Foundation Models

FusionRS: 用于双模态视觉-语言基础模型的大规模RGB-红外遥感数据集

Jiaju Han, Ben Zhang, Xuemeng Sun, Qike Zhang, Yuxian Dong, Chengyin Hu, Fengyu Zhang, Yiwei Wei, Jiujiang Guo

发表机构 * China University of Petroleum-Beijing at Karamay(中国石油大学(北京)克拉玛依校区) University of Electronic Science and Technology of China(电子科技大学) Tianjin University(天津大学)

AI总结 针对遥感视觉-语言模型缺乏红外数据的问题,提出首个大规模RGB-红外-文本数据集FusionRS,通过翻译RGB图像为红外风格并配以红外感知描述,训练双模态基础模型,提升RGB-红外对齐和双模态字幕生成性能。

详情
AI中文摘要

遥感视觉-语言模型推动了地球观测理解的发展,但现有工作大多集中于RGB图像,红外数据中的互补信息尚未得到充分探索。红外图像提供了独特的线索,包括热强度结构、物体边界和光照不变场景特征,这些可以丰富超越传统RGB观测的视觉-语言学习。然而,用于遥感视觉-语言建模的大规模RGB-红外-文本数据集仍然缺失。为填补这一空白,我们引入了FusionRS,这是首个专为遥感双模态视觉-语言学习设计的大规模RGB-红外-文本数据集。FusionRS通过将多样的公开RGB遥感图像翻译为红外风格对应物,形成对齐的RGB-IR图像对。每对图像都配有常规场景描述和红外感知描述,后者在保留语义内容的同时明确描述红外特有的视觉属性。基于FusionRS,我们训练了用于RGB-IR联合理解的双模态视觉-语言基础模型。我们首先训练CLIP风格的模型进行RGB-IR-文本对齐,然后微调生成式VLM用于双模态RGB-IR字幕生成。实验表明,与仅RGB和非红外感知训练设置相比,FusionRS改进了RGB-IR对齐、红外到文本检索和双模态字幕生成。消融研究进一步验证了红外感知描述对于加强红外-语言对齐至关重要,突显了模态特定文本监督对于更可扩展的RGB-红外遥感视觉-语言表示学习的重要性。

英文摘要

Remote sensing vision-language models have advanced Earth observation understanding, but most existing work remains centered on RGB imagery, leaving the complementary information in infrared data underexplored. Infrared images provide distinctive cues, including thermal intensity structures, object boundaries, and illumination-invariant scene features, which can enrich visual-language learning beyond conventional RGB observations. However, a large-scale RGB-infrared-text dataset for remote sensing vision-language modeling is still absent. To address this gap, we introduce FusionRS, the first large-scale RGB-infrared-text dataset designed for dual-modal vision-language learning in remote sensing. FusionRS is constructed by translating diverse public RGB remote sensing images into infrared-style counterparts, forming aligned RGB-IR image pairs. Each pair is associated with conventional scene captions and IR-aware captions that explicitly describe infrared-specific visual properties while preserving semantic content. Based on FusionRS, we train dual-modal vision-language foundation models for RGB-IR joint understanding. We first train CLIP-style models for RGB-IR-text alignment, and then fine-tune generative VLMs for dual-modal RGB-IR captioning. Experiments show that FusionRS improves RGB-IR alignment, infrared-to-text retrieval, and dual-modal captioning over RGB-only and non-IR-aware training settings. Ablation studies further verify that IR-aware captions are crucial for strengthening infrared-language alignment, highlighting the importance of modality-specific textual supervision for more scalable RGB-infrared remote sensing vision-language representation learning.

2606.17016 2026-06-16 cs.CL cs.AI cs.LG cs.MA 新提交

TokenPilot: Cache-Efficient Context Management for LLM Agents

TokenPilot: 面向LLM智能体的缓存高效上下文管理

Buqiang Xu, Zirui Xue, Dianmou Chen, Chenyang Fu, Chiyu Wu, Caiying Huang, Chen Jiang, Jizhan Fang, Xinle Deng, Yijun Chen, Yunzhi Yao, Xuehai Wang, Jin Shang, Gong Yu, Ningyu Zhang

发表机构 * Zhejiang University(浙江大学) University of Electronic Science and Technology of China(电子科技大学) Xi’an University of Electronic Science and Technology(西安电子科技大学) HomologyAI(同源人工智能)

AI总结 针对LLM智能体长会话中上下文累积导致推理成本高的问题,提出TokenPilot双粒度上下文管理框架,通过摄入感知压缩和生命周期感知驱逐策略,在保持性能的同时降低61%-87%的成本。

Comments LightMem Series: Work in Progress

详情
AI中文摘要

随着LLM智能体被部署在长周期会话中,上下文累积推高了推理成本。现有方法利用文本修剪或动态内存驱逐来最小化token占用,但其无约束的序列突变改变了布局,引入前缀不匹配和缓存失效。这揭示了文本稀疏性与提示缓存连续性之间的关键权衡。为解决此问题,我们提出TokenPilot,一个双粒度上下文管理框架。全局上,摄入感知压缩作为框架工具,稳定提示前缀并在摄入门处消除开放世界环境噪声。局部上,生命周期感知驱逐监控上下文段的持续剩余效用,强制执行保守的批处理轮次调度,仅在任务相关性过期时卸载内容段。在PinchBench和Claw-Eval上的隔离和连续模式实验表明,TokenPilot在隔离模式下成本降低61%和56%,在连续模式下降低61%和87%,同时与先前系统相比保持竞争性能。TokenPilot已集成到LightMem2中,地址为https://github.com/zjunlp/LightMem2。

英文摘要

As LLM agents are deployed in long-horizon sessions, context accumulation drives up inference costs. Existing approaches utilize text pruning or dynamic memory eviction to minimize token footprints; however, their unconstrained sequence mutations alter layouts, introducing prefix mismatches and cache invalidation. This reveals a critical trade-off between text sparsity and prompt cache continuity. To address this, we present TokenPilot, a dual-granularity context management framework. Globally, Ingestion-Aware Compaction acts as a framework harness to stabilize prompt prefixes and eliminate open-world environmental noise at the ingestion gate. Locally, Lifecycle-Aware Eviction monitors the ongoing residual utility of context segments, enforcing a conservative batch-turn schedule to offload content segments only when task relevance expires. Experiments on PinchBench and Claw-Eval under both isolated and continuous modes demonstrate that TokenPilot reduces costs by 61% and 56% in isolated mode, and 61% and 87% in continuous mode, while maintaining competitive performance compared to prior systems. TokenPilot has been integrated into LightMem2 at https://github.com/zjunlp/LightMem2.

2606.17014 2026-06-16 cs.LG math.ST stat.ML stat.TH 新提交

Filtered Conformal Ellipsoids for Graph-Native Time Series

图原生时间序列的过滤共形椭球

Yannick Limmer

发表机构 * DRW London(DRW伦敦)

AI总结 提出过滤共形椭球方法,结合状态空间滤波与共形校准,为多元时间序列生成联合预测集,控制单事件并适应跨坐标依赖,通过可观测预测律商分析保证覆盖界。

详情
AI中文摘要

多元时间序列的联合预测集应控制单个事件,同时适应跨坐标依赖性。我们研究过滤共形椭球:一个冻结的状态空间滤波器输出一步预测均值和协方差,并对得到的马氏距离分数应用分割共形校准。滤波器用于选择椭球形状;共形校准选择标量半径,因此该构造受益于学习到的预测协方差,而不依赖高斯尾部概率来保证覆盖。主要困难在于过滤分数是依赖的,且学习到的循环滤波器不需要在其原始隐藏状态上收缩;因此,我们分析可观测预测律商中的收缩,该商识别产生相同未来发射高斯律序列的隐藏状态。在稳定的贝叶斯高斯投影滤波器、协方差界和有限时域可观测性费舍尔条件下,小超额高斯负对数似然意味着学习到的发射律的收缩。结合阈值自协方差包络,这给出了依赖下过滤分割共形预测的切比雪夫型近似覆盖界;更尖锐的伯恩斯坦型界需要额外的几何混合集中假设。在高斯预言可实现性下,我们还在条件有效的高斯椭球规则类中获得了接近预言的log体积比较。我们使用具有对角加低秩协方差的GCN-GRU滤波器实例化该框架。在中等规模的图原生交通基准(METRLA-$20$和PEMSBAY-$50$)上,学习到的滤波器比静态协方差和非滤波基线给出更尖锐的目标椭球;在全图规模和非图原生数据集上,因子和copula基线可能更强。

英文摘要

Joint prediction sets for multivariate time series should control a single event while adapting to cross-coordinate dependence. We study filtered conformal ellipsoids: a frozen state-space filter emits a one-step predictive mean and covariance, and split-conformal calibration is applied to the resulting Mahalanobis scores. The filter is used to choose the ellipsoid shape; conformal calibration chooses the scalar radius, so the construction benefits from a learned predictive covariance without relying on Gaussian tail probabilities for coverage. The main difficulty is that filtered scores are dependent and learned recurrent filters need not contract in their raw hidden state; we therefore analyse contraction in an observable predictive-law quotient that identifies hidden states producing the same future sequence of emitted Gaussian laws. Under a stable Bayes Gaussian-projection filter, covariance bounds, and a finite-horizon observability Fisher condition, small excess Gaussian negative log-likelihood implies contraction of the learned emitted laws. Combined with a threshold-autocovariance envelope this yields a Chebyshev-type approximate coverage bound for filtered split-conformal prediction under dependence; a sharper Bernstein-type bound requires an additional geometric-mixing concentration assumption. Under Gaussian oracle realisability we also obtain a near-oracle log-volume comparison within the class of conditionally valid Gaussian ellipsoid rules. We instantiate the framework with a GCN-GRU filter with diagonal-plus-low-rank covariance. On moderate-size graph-native traffic benchmarks (METRLA-$20$ and PEMSBAY-$50$), the learned filter gives sharper at-target ellipsoids than static-covariance and non-filter baselines; at full-graph scale and on non-graph-native datasets, factor and copula baselines can be stronger.

2606.17011 2026-06-16 cs.RO cs.LG 新提交

ROVE: Unlocking Human Interventions for Humanoid Manipulation via Reinforcement Learning

ROVE: 通过强化学习解锁人类干预用于人形机器人操作

Wei Xiao, Weiliang Tang, Yuying Ge, Hui Zhou, Yao Mu, Li Zhang, Yixiao Ge

发表机构 * XPENG Robotics(小鹏机器人) Fudan University(复旦大学) The Chinese University of Hong Kong(香港中文大学) Shanghai Jiao Tong University(上海交通大学)

AI总结 提出ROVE框架,利用强化学习和乐观价值估计,从次优人类干预轨迹中学习高价值行为,提升人形机器人操作性能。

详情
AI中文摘要

人类干预为视觉-语言-动作(VLA)模型的后训练提供了关键的纠正信号。然而,由于复杂的全身运动学和灵巧手控制,实现无缝的人形干预是一个严峻的系统挑战。因此,收集到的干预轨迹往往是次优的,依赖人类干预作为专家监督的方法可能会吸收犹豫、低效甚至错误的行为。为了解决系统和算法两方面的挑战,我们提出了ROVE,一个用于人形VLA后训练的强化学习框架,能够处理不完美的人类干预。首先,ROVE引入了一个人在环的流水线,能够收集人形操作中的部署和干预数据。其次,它利用乐观价值估计(OVE)从混合质量的轨迹中优先考虑高价值行为。为了进一步增强价值估计的鲁棒性,我们融入了跨具身的人类经验视频,为长尾失败和恢复模式提供丰富的监督。由此产生的评论家产生信息丰富的优势信号,引导VLA演员专注于高价值行为,而不是不加区分地模仿所有动作。在具有挑战性的真实世界接触密集和精细的人形操作任务中,ROVE优于基于经验学习的基线,并在多次部署-干预迭代中持续改进。

英文摘要

Human interventions provide crucial corrective signals for post-training Vision-Language-Action (VLA) models. However, enabling seamless humanoid interventions is a formidable systems challenge due to complex whole-body kinematics and dexterous-hand control. Consequently, the collected intervention trajectories are often suboptimal, and methods that rely on human interventions as expert supervision can absorb hesitant, inefficient, or even erroneous behaviors. To address both the system and algorithmic challenges, we propose ROVE, a reinforcement learning framework for humanoid VLA post-training with imperfect human interventions. First, ROVE introduces a human-in-the-loop pipeline capable of collecting deployment and intervention data for humanoid manipulation. Second, it utilizes Optimistic Value Estimation (OVE) to prioritize high-value behaviors from mixed-quality trajectories. To further robustify value estimation, we incorporate cross-embodiment human experience videos to provide rich supervision for long-tailed failure and recovery modes. The resulting critic yields informative advantage signals, steering the VLA actor to focus on high-value behaviors rather than indiscriminately imitating all actions. On challenging real-world contact-rich and fine-grained humanoid manipulation tasks, ROVE outperforms experience-learning baselines and consistently improves across multiple rollout-intervention iterations.

2606.17010 2026-06-16 cs.LG 新提交

From Tokens to Policy: Causal and Interpretable Heterogeneous Treatment Effects Identification

从令牌到策略:因果且可解释的异质性处理效应识别

Riccardo Cadei, Frank Otchere, Nyasha Tirivayi, Gustavo Angeles Tagliaferro, Falco J. Bargagli-Stoffi, Francesco Locatello

发表机构 * ISTA UNICEF(联合国儿童基金会) UCLA(加州大学洛杉矶分校)

AI总结 提出NEXIS方法,利用多模态预处理表示将HTE识别转化为马尔可夫毯发现问题,实现因果可解释的异质性处理效应识别,并在非洲反贫困项目中验证。

详情
AI中文摘要

异质性处理效应(HTE)识别对于解释干预的影响并据此优化策略至关重要。现有方法在表达性和可解释性之间权衡,但如果某些活跃的异质性驱动因素未被测量,这两种极端方法都会允许虚假的HTE表征,缺乏因果解读。在这项工作中,我们聚焦于受控实验,并认为通过潜在交互变量实现因果HTE表征现在已触手可及,这得益于(i)更广泛的预处理测量,即多模态和多视角,以及(ii)具有最小人工监督的可扩展表示。然后,我们将HTE识别重新定义为在充分且对齐的预处理表示上的马尔可夫毯发现问题,并引入神经暴露交互搜索(NEXIS),这是一种具有可证明且经验验证的一致选择性的迭代过程。我们在非洲的两个反贫困项目中部署NEXIS,为每个项目增加卫星图像以捕捉先前未测量的环境效应修饰因子,从而为优化项目的后续迭代提供新颖、可解释且规范性的指导。

英文摘要

Heterogeneous Treatment Effect (HTE) identification is crucial to explain the impact of an intervention and optimize our policies accordingly. Existing approaches trade expressivity for interpretability, but, if some active heterogeneity drivers are unmeasured, methods at both ends of this spectrum allow for spurious HTE characterization with no causal reading. In this work, we focus on controlled experiments and argue that an oracle HTE causal characterization via the latent interactors is now within reach, thanks to (i) more extensive pre-treatment measurements, i.e., multi-modal and multi-view, and (ii) scalable representations with minimal human supervision. We then re-frame HTE identification as a Markov-blanket discovery problem on a sufficient and aligned pre-treatment representation, and introduce Neural EXposure Interaction Search (NEXIS), an iterative procedure with provable and empirically validated consistent selection. We deploy NEXIS on two anti-poverty programs in Africa, augmenting each with satellite imagery capturing previously unmeasured environmental effect modifiers, leading to novel, interpretable and prescriptive guidelines to optimize the programs' next iterations.

2606.17006 2026-06-16 cs.SD cs.AI cs.LG cs.MM eess.AS 新提交

TuneJury: An Open Metric for Improving Music Generation Preference Alignment

TuneJury: 一种改进音乐生成偏好对齐的开放指标

Yonghyun Kim, Junwon Lee, Haiwen Xia, Yinghao Ma, Junghyun Koo, Koichi Saito, Yuki Mitsufuji, Chris Donahue

发表机构 * Carnegie Mellon University(卡内基梅隆大学) Sony AI(索尼AI) Georgia Tech(佐治亚理工学院) KAIST(韩国科学技术院) Peking University(北京大学) QMUL(伦敦玛丽女王大学)

AI总结 提出TuneJury,一个开放、实例级别的成对奖励模型,用于文本到音乐生成,通过预测偏好分数支持数据筛选、后处理校准,并在推理、优化和训练中提升对齐效果。

Comments 32 pages, 9 figures

详情
AI中文摘要

我们引入了TuneJury,一个开放、实例级别的成对奖励模型,用于文本到音乐生成,它从文本提示和音频片段中预测音乐偏好分数。发布的检查点在公开的人类偏好标签上训练,涵盖竞技场风格(A vs. B)投票、度量对齐偏好对、众包成对比较和专家审美评分。两个片段之间的预测分数差在我们的保留测试集上校准良好,支持通过简单的分数阈值进行数据筛选。TuneJury泛化到保留测试对和分布外基准,在后一任务上与先前基线保持竞争力。对于训练后发布的生成器,我们引入了锚定校准,一种事后、每系统的Bradley-Terry校准,以显著优于从头再训练的数据效率恢复一致性。相同的冻结奖励在三个下游应用中驱动一致的奖励轴增益:推理时的最佳N选择、DITTO风格的潜在优化和专家迭代后训练。TuneJury可在https://github.com/yonghyunk1m/TuneJury获取。

英文摘要

We introduce TuneJury, an open, instance-level pairwise reward model for text-to-music that predicts a music preference score from a text prompt and an audio clip. The released checkpoint is trained on publicly available human-preference labels covering arena-style (A vs. B) votes, metric-alignment preference pairs, crowdsourced pairwise comparisons, and expert aesthetic ratings. The predicted score margin between two clips is well calibrated on our held-out test split, supporting data filtering via a simple score threshold. TuneJury generalizes to both held-out test pairs and out-of-distribution benchmarks, remaining competitive with prior baselines on the latter. For generators released after training, we introduce anchor calibration, a post-hoc, per-system Bradley-Terry calibration that recovers agreement at substantially better data efficiency than from-scratch retraining. The same frozen reward drives consistent reward-axis gains across three downstream applications: inference-time best-of-N selection, DITTO-style latent optimization, and expert-iteration post-training. TuneJury is available at https://github.com/yonghyunk1m/TuneJury.

2606.16996 2026-06-16 cs.CV cs.AI cs.LG 新提交

ActiveSAM: Image-Conditional Class Pruning for Fast and Accurate Open-Vocabulary Segmentation

ActiveSAM: 图像条件类别剪枝实现快速准确的开放词汇分割

Tran Dinh Tien, Zhiqiang Shen

发表机构 * VILA Lab, Mohamed bin Zayed University of Artificial Intelligence(VILA实验室,穆罕默德·本·扎耶德人工智能大学)

AI总结 提出ActiveSAM,一种无需训练、零样本的推理框架,通过图像条件类别剪枝和低分辨率预览,将SAM 3转化为主动词汇分割器,在8个基准上平均提升1.4 mIoU,速度提升最高5.5倍。

Comments Preprint. Code is available at https://github.com/VILA-Lab/ActiveSAM

详情
AI中文摘要

Segment Anything Model 3 (SAM 3) 为概念提示分割提供了强大的冻结骨干网络,但直接应用于开放词汇语义分割 (OVSS) 效率低下:全分辨率解码通常在整个数据集词汇表上运行,而每个图像只包含一小部分活跃类别。我们引入ActiveSAM,一种无需训练、零样本的推理框架,将SAM 3转化为主动词汇分割器。ActiveSAM首先规范化并扩展类别提示,然后从低分辨率存在预览中估计图像条件的活跃集。只有保留的类别使用冻结的SAM 3解码器进行桶式提示复用全分辨率解码。预览阶段仅使用类别存在证据,跳过不必要的分割头计算,而最终阶段应用边缘感知背景校准以抑制低置信度像素。ActiveSAM不需要目标数据集训练、权重更新或oracle类别存在标签。在八个OVSS基准上,ActiveSAM改善了无需训练的开放词汇语义分割的速度-准确率权衡,平均比当前最先进的SegEarth-OV3高出约+1.4 mIoU,同时在大型词汇数据集上运行速度最高提升5.5倍。ActiveSAM在模拟真实世界分布偏移的图像损坏下也表现出最强的鲁棒性,使其非常适合部署在噪声输入领域,如自动驾驶和具身AI。代码可在https://github.com/VILA-Lab/ActiveSAM获取。

英文摘要

Segment Anything Model 3 (SAM 3) provides a strong frozen backbone for concept-prompted segmentation, but applying it directly to open-vocabulary semantic segmentation (OVSS) is inefficient: full-resolution decoding is typically run over the entire dataset vocabulary, whereas each image contains only a small active subset of classes. We introduce ActiveSAM, a training-free, zero-shot inference framework that turns SAM 3 into an active-vocabulary segmenter. ActiveSAM first canonicalizes and expands class prompts, then estimates an image-conditioned active set from a low-resolution presence preview. Only the retained classes are decoded at full resolution, using bucketed prompt multiplexing with the frozen SAM 3 decoder. The preview stage uses only class-presence evidence and skips unnecessary segmentation-head computation, while the final stage applies margin-aware background calibration to suppress low-confidence pixels. ActiveSAM requires no target-dataset training, no weight updates, and no oracle class-presence labels. Across eight OVSS benchmarks, ActiveSAM improves the speed-accuracy tradeoff of training-free open-vocabulary semantic segmentation, outperforming the current state-of-the-art SegEarth-OV3 by approximately +1.4 mIoU on average while running up to 5.5x faster on large-vocabulary datasets. ActiveSAM also demonstrates the strongest robustness under image corruption that simulates real-world distribution shift, making it well-suited for deployment in noisy-input domains such as autonomous driving and embodied AI. Code is available at https://github.com/VILA-Lab/ActiveSAM.

2606.16991 2026-06-16 cs.CV cs.LG 新提交

A Multi-Center Benchmark for Abdominal Disease Diagnosis and Report Generation from Non-Contrast CT

基于非增强CT的腹部疾病诊断与报告生成的多中心基准

Mariam Elbakry, Aliaa Sayed Sheha, Salma Hassan Tantawy, Aya Yassin, Concetto Spampinato, Karim Lekadir, Xiaomeng Li, Marawan Elbatel

发表机构 * Ain Shams University(艾因夏姆斯大学) The Hong Kong University of Science and Technology(香港科技大学) University of Catania(卡塔尼亚大学) Universitat de Barcelona(巴塞罗那大学)

AI总结 提出一个多中心基准,利用非增强CT合成增强CT发现,用于多器官腹部疾病诊断和自动报告生成,实验表明非增强CT保留诊断信号,平均AUC达69.1%(内部)和63.1%(外部)。

Comments Early Accept (top ~9%), MICCAI 2026

详情
AI中文摘要

多期增强CT(CECT)广泛用于腹部病变表征,但存在造影剂肾病风险、增加采集负担并加重放射科医生工作量。为解决这些问题,我们引入了一个新的多中心基准,用于多器官腹部疾病诊断和自动放射报告生成,该基准学习从单期非增强CT(NCCT)合成增强CT发现。为此,我们从两个中心收集了配对NCCT-CECT研究及其对应的增强放射报告的大规模数据集,分为内部集和外部验证队列。在统一评估协议下,我们对五种当代深度学习架构进行了基准测试,涵盖胸部专用、腹部专用和通用多模态领域。大量实验表明,NCCT保留了诊断信号,在内部队列和外部队列上分别实现了平均多器官AUC 69.1%和63.1%。通过公开发布该数据集和标准化基准,本研究旨在促进未来对更安全、资源高效且全球可及的免造影腹部成像工作流程的研究。代码地址:https://github.com/xmed-lab/TriALS-Report。

英文摘要

Multiphasic contrast-enhanced CT (CECT) is widely used for abdominal lesion characterization, yet it carries inherent risks of contrast-induced nephropathy, escalates acquisition burden, and heavily contributes to radiologist workload. To address these challenges, we introduce a novel multi-center benchmark for multi-organ abdominal disease diagnosis and automated radiology report generation, which learns to synthesize contrast-enhanced findings from single-phase non-contrast CT (NCCT). To support this, we curated a large-scale dataset of paired NCCT-CECT studies and their corresponding contrast-enhanced radiology reports from two centers, partitioned into internal sets and an external validation cohort. Under a unified evaluation protocol, we benchmarked five contemporary deep learning architectures encompassing chest-specific, abdomen-specific, and general-purpose multimodal domains. Extensive experiments demonstrate that NCCT retains diagnostic signals, achieving an average multi-organ AUC of 69.1% on the internal cohort and 63.1% on the external cohort, respectively. By releasing this dataset and standardized benchmark publicly, this study aims to catalyze future research into safer, resource-efficient, and globally accessible contrast-free abdominal imaging workflows. Code is available at: https://github.com/xmed-lab/TriALS-Report.

2606.16990 2026-06-16 cs.LG math.AT 新提交

Analytic Torsion and Spectral Gap Capture Persistent-Laplacian Performance

解析挠率和谱间隙捕捉持久拉普拉斯算子的性能

Jernej Grlj, Aaron D. Lauda

发表机构 * University of Southern California(南加州大学)

AI总结 提出用贝蒂数、谱间隙和解析挠率三个不变量的紧凑谱表示替代全谱,在多个数据集上实现同等或更优性能,显著降低计算开销并避免高频噪声。

Comments 13 pages

详情
AI中文摘要

虽然持久拉普拉斯算子(PL)比持久同调提供更丰富的数据几何表示,但利用其全特征谱进行学习任务常因高维性和不同过滤尺度下的“变长”问题而受阻。我们提出一种紧凑谱表示,将持久拉普拉斯算子提炼为三个数学基础不变量:贝蒂数、谱间隙和解析挠率。在包括MNIST、QM-3D和SKEMPI WT的基准数据集上,我们证明该降维特征空间捕捉了全谱的基本预测信号,在某些情况下甚至优于全谱,同时显著降低计算开销并防止高频特征值引入的噪声。我们的结果表明,这些不变量提供了谱几何与拓扑学习之间原则性的固定长度接口。

英文摘要

While persistent Laplacians (PL) offer a richer geometric representation of data than persistent homology, utilizing their full eigenspectrum for learning tasks is often hampered by high dimensionality and the ``varying length'' problem across different filtration scales. We propose a compact spectral representation that distills the persistent Laplacian into three mathematically grounded invariants: Betti numbers, the spectral gap, and analytic torsion. Across benchmark datasets including MNIST, QM-3D, and SKEMPI WT, we demonstrate that this reduced feature space captures the essential predictive signal of the full spectrum, and in some cases outperforms it, while significantly reducing computational overhead and preventing the noise introduced by higher-frequency eigenvalues. Our results suggest that these invariants provide a principled, fixed-length interface between spectral geometry and topological learning.

2606.16987 2026-06-16 cs.AI 新提交

Consensus-based Agentic Large Language Model Framework for Harmonized Tariff Schedule Code Classification

基于共识的智能体大语言模型框架用于协调制度海关编码分类

Truong Thanh Hung Nguyen, Khanh Van Quynh Nguyen, Hoang-Loc Cao, Tri Duong, Phuc Ho, Van Pham, Loc Nguyen, Hung Cao

发表机构 * Analytics Everywhere Lab, University of New Brunswick(新不伦瑞克大学无处不在分析实验室) University of Economics Ho Chi Minh City(胡志明市经济大学)

AI总结 提出一种多智能体LLM框架,通过信息检索、语义检索、证据推理、共识验证和分层投票等方法,解决加拿大10位HTS编码分类难题,在3300条数据上验证了证据驱动和人工参与的必要性。

Comments Accepted at the 3rd International Conference of Resilience by Technology and Design (RTD 2026)

详情
AI中文摘要

准确的协调制度(HTS)编码分类对于海运物流中的清关、关税评估、贸易统计和法规合规至关重要。然而,精确的HTS分类仍然具有挑战性,因为产品描述通常简短、不完整或模糊,而正确的分类依赖于层级关税结构、法律注释和特定司法管辖区的规则。本文提出了一种智能体大语言模型(LLM)框架,用于智慧港口和海运物流环境中的加拿大10位HTS编码分类。该框架集成了多智能体信息检索、官方关税文件的语义检索、基于证据的推理、基于共识的验证、跨层级编码组件的逐元素投票、置信度估计以及人工介入升级。我们在一个包含3300条领域专家标注的产品记录(来自物流和配送场景)的私有数据集上评估了该框架。实验结果表明,即使对于先进的LLM,精确的10位分类仍然困难,性能从粗略的章节级预测下降到细粒度的关税和统计后缀分配。这些发现表明,需要基于证据、不确定性感知和以人为中心的分类工作流程,而不是完全自主的单步预测。所提出的框架支持更可解释、可问责和合规导向的HTS分类,适用于海运物流和智慧港口操作。我们的代码可在https://github.com/Analytics-Everywhere-Lab/hts获取。

英文摘要

Accurate Harmonized Tariff Schedule (HTS) code classification is essential for customs clearance, duty assessment, trade statistics, and regulatory compliance in maritime logistics. However, exact HTS classification remains challenging because product descriptions are often short, incomplete, or ambiguous, while correct classification depends on hierarchical tariff structures, legal notes, and jurisdiction-specific rules. This paper proposes an agentic large language model (LLM) framework for Canadian 10-digit HTS code classification in smart-port and maritime logistics environments. The framework integrates multi-agent information retrieval, semantic retrieval over official tariff documents, evidence-grounded reasoning, consensus-based validation, element-wise voting across hierarchical code components, confidence estimation, and human-in-the-loop escalation. We evaluate the framework on a private dataset of 3,300 domain-expert-labeled product records collected from logistics and delivery contexts. Experimental results show that exact 10-digit classification remains difficult even for advanced LLMs, with performance decreasing from coarse chapter-level prediction to fine-grained tariff and statistical suffix assignment. These findings demonstrate the need for evidence-grounded, uncertainty-aware, and human-centered classification workflows rather than fully autonomous single-step prediction. The proposed framework supports more interpretable, accountable, and compliance-oriented HTS classification for maritime logistics and smart-port operations. Our code is available at https://github.com/Analytics-Everywhere-Lab/hts.

2606.16979 2026-06-16 cs.LG 新提交

Scalable Pairwise Kernel Learning with Stochastic Vec Trick

可扩展的成对核学习与随机Vec技巧

Napsu Karmitsa, Tapio Pahikkala, Antti Airola

发表机构 * Department of Computing, University of Turku(图尔库大学计算系)

AI总结 提出SPaiK方法,利用随机广义vec技巧(sGVT)实现成对核学习的大规模扩展,在七个药物-靶标亲和力数据集上优于现有方法。

详情
AI中文摘要

成对学习是一种特殊形式的监督学习,专注于预测对象对的结果。在这项工作中,我们引入了SPaiK,一种针对成对设置的新可扩展核学习方法。我们的方法保留了核方法的表达能力,同时大幅降低了计算和内存需求。关键创新是随机广义vec技巧(sGVT),它是稀疏Kronecker积乘法算法的随机扩展,能够使用成对核进行高效的大规模训练。通过结合sGVT,SPaiK使得将基于核的成对学习应用于以前无法达到的大规模数据集成为可能。我们在七个真实的药物-靶标亲和力数据集上评估了SPaiK的性能,并将结果与成对学习中的最新方法进行了比较。

英文摘要

Pairwise learning is a specialized form of supervised learning that focuses on predicting outcomes for pairs of objects. In this work, we introduce SPaiK, a new scalable kernel learning method tailored for pairwise settings. Our approach preserves the expressive power of kernel methods while substantially reducing computational and memory requirements. The key innovation is the stochastic generalized vec trick (sGVT), a stochastic extension of the sparse Kronecker product multiplication algorithm, which enables efficient large-scale training with pairwise kernels. By incorporating sGVT, SPaiK makes it possible to apply kernel-based pairwise learning to datasets of a size previously out of reach. We evaluate the performance of SPaiK on seven real-world drug-target affinity datasets and compare the results with state-of-the-art methods in pairwise learning.

2606.16978 2026-06-16 cs.RO cs.LG cs.SY eess.SY 新提交

Task-Error Residual Learning for Real-Robot Five-Ball Juggling

任务误差残差学习用于真实机器人五球杂耍

Kai Ploeger, Jan Peters

发表机构 * Technical University of Darmstadt(达姆施塔特工业大学) German Research Center for AI (DFKI)(德国人工智能研究中心) Hessian Center for Artificial Intelligence (hessian.AI)(黑森州人工智能中心)

AI总结 提出基于任务误差方向监督和误差模型驱动样本选择的残差学习方法,在Barrett WAM机械臂上实现稳定三、四、五球杂耍,首次尝试失败后任务误差单调递减,无需进一步失败。

Comments Submitted to the 2026 International Symposium on Robotics Research (ISRR)

详情
AI中文摘要

对于改进现有行为的残差学习,样本效率取决于两个因素:每次试错返回的信息量,以及学习器使用这些信息的效率。强化学习的标准标量奖励携带的信息远少于定义任务的方向性任务误差。随机探索进一步丢弃了每次试错返回的信息。通过使用方向性任务误差监督和驱动样本选择的任务误差模型进行残差学习,我们在拟人化Barrett WAM机械臂上实现了稳定的三、四、五球杂耍。尽管通过简单、理想化的堆栈进行规划和控制,系统从第二次尝试开始收敛。第一次尝试失败后,任务误差单调递减,没有进一步的失败。相比之下,五球杂耍通常需要人类多年的练习。我们在三个三元轴上比较残差学习器:学习反馈中的方向性信息和分析先验的承诺,涵盖牛顿式雅可比更新、复合贝叶斯优化和随机搜索方法。两个轴都被证明是必要的:方向性反馈或信息性先验单独都不足够,而结合它们的最简单方法——固定雅可比牛顿更新——是最可靠的。学习到的残差能够容忍大量的先验失准和退化的关节跟踪,主要影响收敛速度。因此,真实机器人上残差学习的瓶颈是监督信号的信息内容以及学习器如何使用它,而不是周围堆栈的精度。所有实验的视频文档可在 https://kai-ploeger.com/residual-juggling 获取。

英文摘要

For residual learning that refines existing behavior, sample efficiency depends on two things: how much information each rollout returns, and how efficiently the learner uses that information. Reinforcement learning's standard scalar reward carries far less information than the directional task error that defines the task. Random exploration further discards whatever information each rollout returns. Through residual learning with directional task-error supervision and a task error model that drives sample selection, we achieve stable three-, four-, and five-ball juggling on anthropomorphic Barrett WAM arms. Despite planning and controlling through a simple, idealized stack, the system converges from the second attempt. The first attempt drops, after which task error decreases monotonically without further failures. In comparison, five-ball juggling typically takes humans years of practice. We compare residual learners across two ternary axes, the directional information in the learning feedback and the commitment of the analytic prior, spanning Newton-style Jacobian updates, Composite Bayesian Optimization, and stochastic search methods. Both axes prove necessary: neither directional feedback nor an informative prior suffices alone, and the simplest method that combines them, a fixed-Jacobian Newton update, is the most reliable. The learned residual tolerates substantial prior misalignment and degraded joint tracking, affecting mainly convergence speed. The bottleneck for residual learning on real robots is therefore the information content of the supervision signal and how the learner uses it, not the accuracy of the surrounding stack. Video documentation of all experiments is available at https://kai-ploeger.com/residual-juggling.

2606.16974 2026-06-16 cs.AI 新提交

The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers

拥抱开放科学:十年AI研究与56 800篇会议论文的分析

Kevin L Coakley, Thijs Snelleman, Holger Hoos, Odd Erik Gundersen

发表机构 * Norwegian University of Science and Technology(挪威科技大学) University of California San Diego(加州大学圣迭戈分校) RWTH Aachen University(亚琛工业大学) Leiden University(莱顿大学)

AI总结 分析2014-2024年五大AI会议56,800篇论文,发现文档实践改善,代码和数据共享率从11%升至64%,可重复性估计从28%升至64%,且改善早于可重复性检查清单的引入,反映开放科学运动。

详情
AI中文摘要

可重复性危机促使AI研究社区改进文档实践。多项研究已指出方法论问题,作为回应,该领域最具影响力的会议引入了可重复性检查清单。我们试图通过评估过去十年五大领先AI会议的所有已发表论文,了解文档实践是否随时间改变。确定了七个可重复性变量,经过质量保证并用于分析56,800篇出版物。我们的分析显示,在2014年至2024年期间,文档实践有所改善;同时共享代码和数据的论文增加了近六倍,从11%增至64%。基于先前研究的实证可重复性率,我们估计——根据文档实践推断,而非直接测试——可重复性从2014年的28%增加到2024年的64%。文档实践的改善早于可重复性检查清单的引入,表明这些变化反映了更广泛的开放科学运动,而非对正式要求的直接响应。

英文摘要

The reproducibility crisis has directed the AI research community toward improving documentation practices. Several studies have identified methodological issues, and in response, the most impactful venues in the field have introduced reproducibility checklists. We seek to understand whether documentation practices have changed over time by assessing all published papers at five leading AI conferences over the past decade. Seven reproducibility variables were identified, quality-assured and used to analyse 56 800 publications. Our analysis reveals that in the period 2014 to 2024, documentation practices have improved; papers sharing both code and data increased nearly sixfold, from 11% to 64% Building on empirical reproducibility rates from a prior study, we estimate - inferred from documentation practices, not direct testing - that reproducibility increased from 28% in 2014 to 64% in 2024. Improvements in documentation practices predate the introduction of reproducibility checklists, suggesting these changes reflect a broader movement toward open science rather than a direct response to formal requirements.

2606.16972 2026-06-16 cs.RO cs.SY eess.SY 新提交

When Should a Robot Replan? Regret-Guided Update Scheduling in Time-Varying MDPs

机器人何时应重新规划?时变MDP中的遗憾引导更新调度

Negin Musavi, Gokul Puthumanaillam, Ruben Hernandez, William Schafer, Melkior Ornik

发表机构 * University of Illinois Urbana–Champaign(伊利诺伊大学厄巴纳-香槟分校)

AI总结 针对时变环境下机器人因预算限制无法持续重规划的问题,提出基于动态遗憾的在线更新调度规则,在仿真和实物实验中优于固定预算基线。

详情
AI中文摘要

在非平稳环境中运行的机器人必须随着动态漂移不断调整其策略,但机载能量和计算预算限制了全状态估计和重规划步骤的执行频率。这引出一个问题:在时间轴上,机器人何时应花费其有限的预算?我们在具有已知转移漂移率边界的时变马尔可夫决策过程(TVMDP)中形式化该问题。我们将执行建模为一种“跳过更新”方案,即在选定的更新时间点,智能体通过最大似然估计转移核并计算有限时域策略,而在更新间隔之间,则在传播的状态估计下重用该策略。我们分析了该方案的动态遗憾,并展示了它如何根据TVMDP的性质和跳过长度在跳过区间内增长;由此产生的界限通过一种在线、遗憾引导的更新规则回答了开头的问题,该规则自适应地分配预算。我们在具有时变滑移动力学的模拟火星车导航任务和室内障碍物场中的Crazyflie四旋翼飞行器上评估了该规则。自适应分配优于其他预算基线。

英文摘要

Robots operating in non-stationary environments must continually adapt their policies as the dynamics drift, but onboard energy and compute budgets cap how often a full state estimation and re-planning step can be performed. This raises a question: \emph{when}, along a horizon, should a robot spend its limited budget? We formulate this problem in time-varying Markov decision processes (TVMDPs) with a known bound on the rate of transition drift. We model execution as a \emph{skip-update} scheme in which, at chosen update times, the agent estimates the transition kernel by maximum likelihood and computes a finite-horizon policy, and between updates reuses this policy under a propagated state estimate. We analyze the dynamic regret of this scheme and show how it grows during skip intervals in terms of the properties of the TVMDP and the skip lengths; the resulting bound answers the opening question via an online, regret-guided update rule that allocates the budget adaptively. We evaluate the rule in a simulated Mars-rover navigation task with time-varying slip dynamics and on a Crazyflie quadrotor in indoor obstacle fields. Adaptive allocation outperforms other budgeted baselines.

2606.16969 2026-06-16 cs.SD cs.AI eess.AS 新提交

Probing Low Frame Rate Degradation in Neural Audio Codecs

探测神经音频编解码器中的低帧率退化

Alex Gichamba, Moise Busogi

发表机构 * Carnegie Mellon University Africa(卡内基梅隆大学非洲校区)

AI总结 通过控制帧率消融实验,发现低帧率质量悬崖源于训练配置缺陷而非根本性障碍,修正后帧率可降至3.1Hz和1.6Hz。

Comments Accepted at Interspeech 2026

详情
AI中文摘要

神经音频编解码器中的低帧率对于自回归语音合成具有吸引力,因为生成成本与序列长度线性相关。最近的研究表明,编解码器可以在12.5 Hz及以下运行,但低帧率退化的机制仍未被充分理解。我们通过受控的帧率消融实验来研究这些机制。我们重现了先前工作中报告的6.25 Hz处的质量悬崖,并评估了候选解释:音素冲突和码本饱和,两者均未显示出根本性障碍的证据。该悬崖实际上是由次优的训练配置引起的:训练期间固定的剪辑时长在低帧率下产生过少的令牌,使解码器缺乏令牌间上下文。一旦修正,WER随音素负载平滑退化,直至3.1 Hz和1.6 Hz,这表明低帧率编解码器的推理时效率增益比先前假设的更容易实现。

英文摘要

Low frame rates in neural audio codecs are attractive for autoregressive speech synthesis, where the generation cost scales linearly with the sequence length. Recent work has demonstrated that codecs can operate at 12.5 Hz and below, but the mechanisms underlying low frame rate degradation remain insufficiently understood. We investigate these mechanisms through a controlled frame rate ablation. We reproduce a quality cliff at 6.25 Hz reported in previous works and evaluate candidate explanations: phonemic collisions and codebook saturation, neither of which shows evidence of a fundamental barrier. The cliff is instead caused by suboptimal training configuration: fixed clip duration during training yields too few tokens at low frame rates, starving the decoder of inter-token context. Once corrected, WER degrades smoothly with phonemic load down to 3.1 Hz and 1.6 Hz, suggesting the inference-time efficiency gains of low frame rate codecs are more accessible than previously assumed.

2606.16961 2026-06-16 cs.LG q-fin.CP 新提交

Beyond the Smile: A Hybrid Convolutional VAE for Crypto Volatility Surfaces

超越微笑:用于加密货币波动率曲面的混合卷积VAE

Sadanand Singh, Allam Reddy, Manan Chopra

发表机构 * Jasper Research, USA(Jasper Research(美国))

AI总结 提出混合卷积VAE结合二次微笑重拟合的预测器,在BTC和ETH期权数据上实现低RMSE,显著优于纯参数化方法,并消除日历和蝶式套利。

详情
AI中文摘要

我们提出了一种用于加密货币隐含波动率曲面的卷积变分自编码器,以及一个可部署的预测器,该预测器通过确定性每期限路由规则将其与二次微笑重拟合相结合。该模型在2023年5月至10月期间6034个完全填充的每小时Binance期权曲面(BTC和ETH)上训练,并在共同的$6 \ imes 7$期限-Delta网格上参数化,在两个市场和10-50%的掩码率下,隐藏单元曲面补全RMSE达到0.94-1.56波动率点范围。混合预测器在50%掩码率下达到0.83波动率点,而单独的微笑重拟合为7.00,在无额外推理成本下实现了八倍的降低。在模拟整个期限行权价撤销的结构相关空洞模式下,微笑重拟合产生9.6-13.1波动率点的误差,而学习模型保持在1.5-1.9,隔离了生成模型是唯一可行预测器的场景。在BTC和ETH上的联合训练相对于表现更优的单标的模型,在两个市场上将分布内模型提升了9-27%,表明在观测窗口内两种最大加密货币之间存在显著共享的波动率曲面流形。混合模型在上市行权价上无日历和蝶式套利,而单独的参数化微笑重拟合在高掩码率下无法保持这一性质。训练模型的每快照重构误差在无监督情况下标记了10月底ETF预期反弹和2023年8月17日闪崩为高误差时期。所有训练和评估基础设施均已发布以支持可重复的后续工作。

英文摘要

We present a convolutional variational autoencoder for cryptocurrency implied-volatility surfaces, together with a deployable predictor that combines it with a quadratic smile re-fit through a deterministic per-tenor routing rule. Trained on 6,034 fully-filled hourly Binance Options surfaces of BTC and ETH spanning May-October 2023 and parameterised on a common $6 \times 7$ tenor-delta grid, the model attains a hidden-cell surface-completion RMSE in the 0.94-1.56 vol-point range across both markets and mask rates 10-50%. The hybrid predictor attains 0.83 vol points at 50% masking against 7.00 for the smile re-fit alone, an eightfold reduction obtained at no additional inference cost. Under structurally-correlated hole patterns that emulate the withdrawal of an entire tenor of strikes, the smile re-fit incurs 9.6-13.1 vol points of error while the learned model remains at 1.5-1.9, isolating a regime in which the generative model is the only viable predictor. Joint training on BTC and ETH improves the in-distribution model on both markets by 9-27% relative to the better-performing single-symbol counterpart, indicating a substantially shared vol-surface manifold across the two largest cryptocurrencies over the observation window. The hybrid is calendar- and butterfly-arbitrage-free at the listed strikes, a property that the parametric smile re-fit alone fails at high mask rates. The per-snapshot reconstruction error of the trained model flags the late-October ETF-anticipation rally and the August $17$, $2023$ flash crash as elevated-error periods without supervision. All training and evaluation infrastructure is released to support reproducible follow-on work.

2606.16960 2026-06-16 cs.CV 新提交

SurroundNEXO: Ego-Centric Metric Bridging for Spatially Consistent Geometry in Autonomous Driving

SurroundNEXO:面向自动驾驶空间一致几何的自车中心度量桥接

Shuai Yuan, Runxi Tang, Yuzhou Ji, Fudong Ge, Hanshi Wang, Yifei Wang, Xianming Zeng, Jianyun Xu, Xingliang Liu, Yanfeng Wang, Zhipeng Zhang

发表机构 * School of Artificial Intelligence, Shanghai Jiao Tong University(上海交通大学人工智能学院) Hello Inc.

AI总结 提出SurroundNEXO框架,通过自车中心几何(Ego-Ray位置编码)和稀疏LiDAR度量锚点,解决多相机低重叠下的度量深度预测与空间一致性问题,在多个基准上显著提升性能。

详情
AI中文摘要

现代自动驾驶依赖于精确的度量3D理解进行感知、重建和规划,这反过来需要可靠的多相机深度预测。然而,车载环视相机系统的外向性本质上限制了视图间的视觉重叠,挑战了传统多视图几何所依赖的对应关系假设。为弥合这一差距,我们提出SurroundNEXO(以西班牙语单词nexo命名,意为几何链接),一个低重叠多相机度量深度框架,将跨视图推理建立在自车中心几何而非密集视觉对应上。SurroundNEXO不直接强制早期全局融合,而是首先通过Ego-Ray位置编码为图像令牌分配全局可比较的自车框架视线方向,然后使用稀疏LiDAR测量作为度量锚点传播绝对尺度线索,最后逐步扩展特征交互,从视图局部建模到分解的时空推理和全局集成。这种设计使得在弱重叠相机间实现具有改进空间一致性的度量尺度深度预测。在包括NuScenes、Waymo和DDAD的低重叠自动驾驶基准上,与SOTA方法相比,SurroundNEXO将单视图误差降低33.2%,跨视图一致性提高10.5%,度量重建质量提升25.6%。此外,它在极稀疏深度提示下保持鲁棒,并对未见过的相机布局展现出强大的零样本泛化能力。

英文摘要

Modern autonomous driving depends on accurate metric 3D understanding for perception, reconstruction, and planning, which in turn requires reliable multi-camera depth prediction. However, the outward-facing nature of vehicle-mounted surround-view camera rigs inherently limits visual overlap across views, challenging the correspondence-based assumptions that underpin conventional multi-view geometry. To bridge this gap, we present SurroundNEXO, named after the Spanish word nexo for a geometric link, a low-overlap multi-camera metric depth framework that grounds cross-view reasoning in ego-centric geometry rather than dense visual correspondences. Instead of directly enforcing early global fusion, SurroundNEXO first assigns image tokens globally comparable ego-frame viewing directions through Ego-Ray Positional Encoding, then uses sparse LiDAR measurements as metric anchors to propagate absolute scale cues, and finally expands feature interaction progressively from view-local modeling to decomposed spatio-temporal reasoning and global integration. This design enables metric-scale depth prediction with improved spatial consistency across weakly overlapping cameras. Across low-overlap autonomous driving benchmarks, including NuScenes, Waymo and DDAD, SurroundNEXO reduces single-view error by 33.2%, improves cross-view consistency by 10.5%, and enhances metric reconstruction quality by 25.6% compared with SOTA methods. It further remains robust under extremely sparse depth prompts and exhibits strong zero-shot generalization to unseen camera layouts.