arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2370
2605.25299 2026-05-29 cs.CV cs.LG

A Principled Self-Referenced Early Stopping Approach for Deep Image Prior

一种基于自引用的原则性早期停止方法用于深度图像先验

Chaoyan Huang, Cheng-Han Huang, Ismail R. Alkhouri, Rongrong Wang

AI总结 针对深度图像先验(DIP)过拟合问题,提出一种基于构造伪自引用图像的过拟合检测框架,实现无需噪声水平估计的早期停止方法。

Comments 35 pages, 10 figures, 14 tables

详情
AI中文摘要

最近,深度图像先验(DIP)通过在无训练数据的情况下优化随机初始化的卷积神经网络,展示了解决逆成像问题(IIPs)的强大能力。然而,由于网络过参数化,DIP会过拟合噪声测量,使得早期停止(ES)至关重要。最成功的ES方法通过跟踪网络输出运行方差的波动来检测过拟合。然而,在许多应用中,这些波动可能过早出现,导致重建不稳定。本文首先证明,当退化图像的两个独立噪声副本可用时,可以实现近乎最优的DIP早期停止。受此观察启发,且由于获取两个完全独立的副本不可行,我们提出了一种基于构造伪自引用图像的过拟合检测框架,从而得到三种IIP特定算法。我们的方法还得到了关于单引用验证、伪验证估计以及共享噪声影响的理论结果的支持。在不同的IIP中,从自然图像恢复到医学图像重建,以及在不同噪声水平和噪声类型下,我们的方法始终优于现有的DIP早期停止方法,且无需准确估计噪声水平。

英文摘要

Recently, Deep Image Prior (DIP) has demonstrated strong capabilities for solving inverse imaging problems (IIPs) by optimizing a randomly initialized convolutional neural network in a training-data-free regime. However, DIP suffers from overfitting to noisy measurements due to network over-parameterization, making early stopping (ES) essential. The most successful ES method tracks fluctuations in the running variance of the network output to detect overfitting. However, in many applications, these fluctuations may appear prematurely, leading to unstable reconstructions. In this paper, we first show that nearly optimal DIP early stopping can be achieved when two independent noisy copies of the degraded image are available. Motivated by this observation, and since obtaining two fully independent copies is infeasible, we propose an overfitting detection framework based on constructing pseudo self-referenced images, resulting in three IIP-specific algorithms. Our approach is further supported by theoretical results on single-reference validation, pseudo-validation estimation, and the impact of shared noise. Across different IIPs, ranging from natural image restoration to medical image reconstruction, and under varying noise levels and noise types, our methods consistently outperform existing DIP early stopping approaches, all without requiring an accurate estimate of the noise level.

2605.25297 2026-05-29 cs.CL cs.AI cs.LG

Eureka: Intelligent Feature Engineering for Enterprise AI Cloud Resource Demand Prediction

Eureka:面向企业AI云资源需求预测的智能特征工程

Hangxuan Li, Renjun Jia, Xuezhang Wu, Yunjie Qian, Zeqi Zheng, Xianling Zhang

AI总结 提出Eureka框架,将特征工程视为智能体代码生成问题,通过专家代理、LLM特征工厂和自演化对齐引擎三阶段,自动生成可执行特征代码,在医疗、金融、社交等7个公开基准及阿里云GPU资源需求预测中显著提升性能。

Comments accepted at NeurIPS 2025 Workshop, DASFAA 2026 (International Conference on Database Systems for Advanced Applications)

详情
Journal ref
Database Systems for Advanced Applications (DASFAA 2026), Lecture Notes in Computer Science, vol. 16540, pp. 528-540, Springer
AI中文摘要

有效的特征对于预测模型性能至关重要,但创建特征通常需要领域专业知识,限制了跨应用的可扩展性。我们将特征工程定义为一个智能体代码生成问题:特征不再是静态的数据转换,而是可生成、评估和迭代改进的可执行程序。我们提出了Eureka,一个由LLM驱动的三阶段框架。(1)专家代理,通过领域知识的SFT微调,生成结构化的JSON格式特征设计方案。(2)LLM特征工厂,通过思维链推理将每个方案转化为可执行的Python代码,将特征假设转化为可运行的程序。(3)自演化对齐引擎,使用带双通道奖励(基于指标的效用+语义对齐)的强化学习(GRPO)来提升代码质量。通过将特征表达为程序,学习到的生成模式可以跨领域迁移。在医疗、金融和社交领域的7个公开基准上评估,Eureka一致优于传统的AutoFE和基于LLM的基线。我们进一步在阿里云的云GPU资源需求预测中展示了Eureka的有效性,其中Eureka将需求满足率提高了16%,并将计算资源迁移率降低了33%。

英文摘要

Effective features are crucial for predictive model performance, but creating them often requires domain expertise, limiting scalability across applications. We define feature engineering as an agentic code generation problem: features are not static data transformations, but executable programs that can be generated, evaluated, and iteratively improved. We present Eureka, an LLM-driven framework with three stages. (1) An Expert Agent, fine-tuned via SFT on domain knowledge, produces structured feature design plans in JSON format. (2) An LLM Feature Factory translates each plan into executable Python code through chain-of-thought reasoning, turning feature hypotheses into runnable programs. (3) A Self-Evolving Alignment Engine uses Reinforcement Learning (GRPO) with dual-channel reward (metric-based utility + semantic alignment) to enhance code quality. By expressing features as programs, the learned generation patterns can transfer across domains. Evaluated on 7 public benchmarks in healthcare, finance, and social domains, Eureka consistently outperforms both traditional AutoFE and LLM-based baselines. We further demonstrate Eureka's effectiveness on cloud GPU resource demand prediction at Alibaba Cloud, where Eureka improves demand fulfillment rate by 16% and lowers computing resource migration rates by 33%.

2605.25059 2026-05-29 cs.CV

VEOcc: Voxel-Centric Online Semantic Occupancy Prediction For Embodied Scene Understanding

VEOcc:面向具身场景理解的体素中心在线语义占用预测

Ruoyu Wang, Yong Liu, Sheng Tao, Yuhang Lin, Yukai Ma

AI总结 提出一种基于体素的递归感知-同化框架VEOcc,通过时空感知在线更新策略实现无需初始尺度估计的高效、鲁棒语义占用预测,在局部和具身场景中达到最先进性能。

详情
AI中文摘要

对于自主探索至关重要,在线3D占用预测和映射逐步构建密集的空间表示。然而,近期以高斯为中心的方法在结构边界保真度上存在困难,且严重依赖预定义的场景大小先验,从根本上限制了其操作效率。在这项工作中,我们提出了VEOcc,一个以体素为中心的框架,表述为递归感知-同化范式。通过消除初始尺度估计的需要,VEOcc实现了高度精简、开放的地图扩展。此外,为了在离散体素空间内鲁棒地聚合带噪声的时间观测,我们提出了一种时空感知在线更新策略。它集成了跨时间对数聚合(TLA)以保持时间一致性、可靠性感知置信度调制(RCM)以进行空间不确定性校准,以及置信度驱动的增量状态更新(CSU)以实现鲁棒的全局状态同化。在Occ-ScanNet和EmbodiedOcc-ScanNet上的大量实验表明,VEOcc在局部和具身设置中均建立了新的最先进性能,为真实世界探索提供了准确且高效的解决方案。值得注意的是,在自收集视频序列上的零样本评估进一步证实了其在完全未见过的真实世界环境中的鲁棒分布外泛化能力。最终,我们的框架为自主探索提供了准确且高效的解决方案。代码和补充可视化可在我们的项目页面获取:https://wryzju.github.io/VEOcc/。

英文摘要

Crucial for autonomous exploration, online 3D occupancy prediction and mapping incrementally constructs dense spatial representations on the fly. However, recent Gaussian-centric methods struggle with structural boundary fidelity and rely heavily on predefined scene-size priors, fundamentally limiting their operational efficiency. In this work, we present VEOcc, a voxel-centric framework formulated as a recursive perception-and-assimilation paradigm. By eliminating the need for initial scale estimation, VEOcc enables highly streamlined, open-ended map expansion. Furthermore, to robustly aggregate noisy temporal observations within the discrete voxel space, we propose a Spatio-Temporal-Aware Online Update Strategy. It integrates Cross-Temporal Logit Aggregation (TLA) for temporal consistency, Reliability-Aware Confidence Modulation (RCM) for spatial uncertainty calibration, and Confidence-Driven Incremental State Update (CSU) for robust global state assimilation. % Extensive experiments on Occ-ScanNet and EmbodiedOcc-ScanNet demonstrate that VEOcc establishes new state-of-the-art performance in both local and embodied settings, providing an accurate and efficient solution for real-world exploration. Extensive experiments on Occ-ScanNet and EmbodiedOcc-ScanNet demonstrate that VEOcc establishes new state-of-the-art performance in both local and embodied settings. Notably, zero-shot evaluations on self-collected video sequences further confirm its robust out-of-distribution generalization capability in completely unseen real-world environments. Ultimately, our framework provides an accurate and highly efficient solution for autonomous exploration. Code and supplementary visualizations are available on our project page: https://wryzju.github.io/VEOcc/.

2605.24846 2026-05-29 cs.LG cs.AI

Tiny Brains, Giant Impact: Uncovering the Keystone Neurons of LLM with Just a Few Prompts

微小大脑,巨大影响:仅用少量提示揭示LLM的关键神经元

Xiangtian Ji, Yuxin Chen, Zhengzhou Cai, Xiang Wang, An Zhang, Tat-Seng Chua

AI总结 本研究通过跨任务激活强度分析,发现大型语言模型中存在一组极其稀疏的关键神经元,其移除会导致模型行为崩溃,并基于此提出仅更新关键神经元的微调方法,在少量参数修改下达到与全参数微调相当或更优的任务性能。

详情
AI中文摘要

大型语言模型(LLM)展现出强大的综合能力,但支撑这些行为的内部机制仍未被充分理解。在这项工作中,我们展示了在多种开放权重Transformer模型中,存在一组神经元在跨多个能力维度的任务推理期间始终保持高度激活。通过沿跨任务激活强度进行探测,我们分离出一个极其稀疏的子集,其移除会导致模型行为崩溃,我们将其称为关键神经元。我们的分析揭示,关键神经元是模型的一个稳定且内在的神经元子集,主要在预训练期间建立。与这些神经元相关的参数在训练过程中被紧密校准,其精确值对模型能力至关重要。基于这些见解,我们提出了一种监督微调方法,仅更新关键神经元,在修改远少于全参数的情况下,实现了与全参数微调相当甚至更好的任务增益,同时更好地保留了其他能力维度的性能。

英文摘要

Large language models (LLMs) display strong comprehensive abilities, yet the internal mechanisms that support these behaviors remain insufficiently understood. In this work, we show that across a wide range of open-weight Transformers, a subset of neurons remains consistently highly activated during inference across tasks of multiple capability dimensions. By probing along the cross-task activation strength, an extremely sparse subset is isolated, whose removal causes a collapse in model behavior, which we term keystone neurons. Our analysis reveals that keystone neurons are a stable and intrinsic neuron subset of the model that is largely established during pretraining. The parameters associated with these neurons are tightly calibrated during the training process, and their precise values are critical for the capabilities of the model. Building on these insights, we propose a supervised fine-tuning approach that updates only keystone neurons, achieving task gains comparable to or even better than full-parameter fine-tuning while better preserving performance in other capability dimensions, despite modifying a much smaller number of parameters.

2605.24399 2026-05-29 cs.AI

ConceptM$^3$oE: Concept-Guided Multimodal Mixture of Experts for Interpretable Computational Pathology

ConceptM$^3$oE:面向可解释计算病理学的概念引导多模态专家混合模型

Xuan Wang, Zhongling Xu, Gopi Kannedhara, Joakim Nguyen, Jian Yu, Jinrui Fang, Abdurrahmaan Baghdadi, Tianlong Chen, Awais Naeem, Chandra Krishnan, Edward Castillo, Andrew H. Song, Ankita Shukla, Ying Ding, Nicholas Konz, Hairong Wang

AI总结 提出ConceptM$^3$oE框架,通过概念引导的多模态专家混合路径嵌入概念形成,并利用残差路径保持性能与可解释性,在脑肿瘤分类中优于基线并提升小样本性能。

详情
AI中文摘要

医疗模型正从单模态预测转向对异构诊断输入的多模态推理。在计算病理学中,对于仅凭形态学难以区分的复杂肿瘤亚型,病理报告和分子测量可提供额外的诊断证据,但现有模型往往无法阐明不同信号如何组合成可识别的诊断概念。我们提出ConceptM$^3$oE(概念多模态MoE),将概念形成直接嵌入交互感知的专家混合(MoE)路径中。该架构将证据分解为模态特定、冗余和协同专家,然后将其投影到结构化概念瓶颈中,将潜在特征映射到形态学和生物标志物概念层次结构。为防止可解释瓶颈典型的信息损失,我们在每个专家内利用残差路径,使任务相关信号既通过概念流动,也直接流向最终任务预测,从而在保持可解释性的同时维持高性能。在机构性儿童脑肿瘤队列和公共胶质瘤队列上,该框架实现了与无约束模型相竞争的性能,同时产生由独立神经病理学家验证的推理轨迹。在数据有限的情况下,ConceptM$^3$oE提升了小数据性能,在较小训练规模下,与非概念信息基线相比,宏F1从56.41%提升至66.70%,同时显示出更快的训练收敛速度,这与概念学习的正则化效应一致。这项工作为高性能、内在可验证且更符合临床实践复杂决策的医疗AI提供了一条可扩展的路径。

英文摘要

Healthcare models are transitioning from unimodal prediction toward multimodal reasoning over heterogeneous diagnostic inputs. In computational pathology, for complex tumor subtypes where morphology alone can be challenging to distinguish, pathology reports and molecular measurements may provide additional diagnostic evidence alongside whole-slide images, yet existing models often fail to clarify how diverse signals assemble into recognizable diagnostic concepts. We propose ConceptM$^3$oE (Concept Multimodal MoE), which embeds concept formation directly within interaction-aware mixture-of-experts (MoE) pathways. The architecture decomposes evidence into modality-specific, redundant, and synergistic experts, which are then projected into structured concept bottlenecks mapping latent features to a hierarchy of morphology and biomarker concepts. To prevent the information loss typical of interpretable bottlenecks, we utilize residual pathways within each expert to allow task-relevant signals to flow both through the concepts and directly to the final task prediction, so that high performance is maintained alongside interpretability. Across an institutional pediatric brain tumor cohort and a public glioma cohort, the framework delivers competitive performance to unconstrained models while producing reasoning traces validated by an independent neuropathologist. In data-limited regimes, ConceptM$^3$oE improves limited-data performance, increasing macro-F1 from 56.41% to 66.70% at small training sizes compared to non-concept-informed baselines, while also showing faster training convergence consistent with the regularizing effect of concept learning. This work offers a scalable path toward high-performance medical AI that is inherently verifiable and better aligned with the complex decision-making of clinical practice.

2605.24140 2026-05-29 cs.AI

HyperGuide: Hyperbolic Guidance for Efficient Multi-Step Reasoning in Large Language Models

HyperGuide: 用于大型语言模型高效多步推理的双曲引导

Yuyu Liu, Haotian Xu, Yanan He, Sarang Rajendra Patil, Mengjia Xu, Tengfei Ma

AI总结 针对多步推理中单次生成效率高但精度低、树搜索计算量大的问题,提出通过将推理进度蒸馏为双曲几何信号来引导逐步生成,利用双曲空间的距离和角度特性编码解接近度与分支区分,训练轻量头投影隐状态并微调适配器,在多个基准上取得一致提升。

详情
AI中文摘要

多步推理仍然是大型语言模型的一个核心挑战:单次生成效率高但缺乏准确性;树搜索方法探索多条路径但计算量大。我们通过将推理进度蒸馏为双曲几何信号来弥补这一差距,该信号引导逐步生成。我们的方法基于一个结构性观察:在组合推理树中,包含解的状态很少,而死胡同则呈指数级多。双曲空间匹配这种不对称性,原点附近体积紧凑,向边界指数扩展,因此到原点的距离自然地编码解的接近度,而角度分离则区分需要不同下一步操作的分支。我们训练一个轻量头将LLM的隐状态投影到该空间,然后在其自身的推理尝试上交互式地微调一个低秩适配器,以对注入的信号做出反应。在多个基准上,该几何信号带来一致的提升,在更深推理链上改进更大。我们的代码公开在 https://github.com/yuyuliu11037/HyperGuide。

英文摘要

Multi-step reasoning remains a central challenge for large language models: single-pass generation is efficient but lacks accuracy; tree-search methods explore multiple paths but are computation-heavy. We address this gap by distilling reasoning progress into a hyperbolic geometric signal that guides step-by-step generation. Our approach is motivated by a structural observation: in combinatorial reasoning trees, solution-bearing states are few while dead ends are exponentially numerous. The hyperbolic space matches this asymmetry, with compact volume near the origin and exponentially expanding capacity toward the boundary, so that distance-to-origin naturally encodes solution proximity while angular separation distinguishes branches requiring different next operations. We train a lightweight head to project LLM hidden states into this space, then fine-tune a low-rank adapter interactively on its own reasoning attempts to act on the injected signal. Across multiple benchmarks, the geometric signal yields consistent gains, with larger improvements on deeper reasoning chains. Our code is publicly available at https://github.com/yuyuliu11037/HyperGuide.

2605.23993 2026-05-29 cs.CV cs.AI cs.LG

Nano World Models: A Minimalist Implementation of Future Video Prediction

纳米世界模型:未来视频预测的极简实现

Siqiao Huang, Partha Kaushik, Michael Chen, Hengkai Pan, Kaiwen Geng, Omar Chehab, Fernando Moreno-Pino, Max Simchowitz

AI总结 提出Nano World Models,一个基于扩散强迫的极简代码库,用于未来视频预测,支持可控研究世界模型的设计选择,并通过实验分析预测参数化、架构规模等因素对视频预测质量的影响。

Comments Project page: https://simchowitzlabpublic.github.io/nano-world-model/

详情
AI中文摘要

世界模型已成为学习预测模拟器的核心范式,支持生成、规划和决策。然而,尽管工业级交互式视频生成取得了快速进展,更广泛的研究社区仍然缺乏紧凑、可重复且易于扩展的实现来研究现代世界模型的设计选择。我们介绍了Nano World Models,一个围绕扩散强迫的极简代码库,用于未来视频预测。Nano World Models为生成目标、模型规模、动作条件机制、潜在观测空间、数据集、评估协议和长程展开程序提供了统一接口。这种设计使得通常在不同实现中纠缠的世界模型组件可以进行受控研究。通过在简单控制环境、游戏模拟和真实机器人数据上的实验,我们考察了预测参数化、架构规模、动作注入、采样预算和领域复杂性如何影响视频预测质量和自回归展开行为。通过发布代码、配置、评估脚本和预训练检查点,Nano World Models旨在为开放、可重复和科学的世界模型研究提供一个紧凑但可扩展的实验基础。

英文摘要

World models have become a central paradigm for learning predictive simulators that support generation, planning, and decision-making. Yet, despite rapid progress in industry-scale interactive video generation, the broader research community still lacks compact, reproducible, and easily extensible implementations for studying the design choices underlying modern world models. We introduce Nano World Models, a minimalist codebase for future video prediction centered around diffusion forcing. Nano World Models provides a unified interface for generative objectives, model scales, action-conditioning mechanisms, latent observation spaces, datasets, evaluation protocols, and long-horizon rollout procedures. This design enables controlled studies of world-modeling components that are often entangled across separate implementations. Through experiments across simple control environments, game simulation, and real-robot data, we examine how prediction parameterization, architecture scale, action injection, sampling budget, and domain complexity affect video prediction quality and autoregressive rollout behavior. By releasing code, configurations, evaluation scripts, and pretrained checkpoints, Nano World Models aims to provide a compact yet extensible experimental substrate for open, reproducible, and scientific world-model research.

2605.23657 2026-05-29 cs.CL

OpenSkillEval: Automatically Auditing the Open Skill Ecosystem for LLM Agents

OpenSkillEval:自动审计LLM智能体的开放技能生态系统

Jiahao Ying, Boxian Ai, Wei Tang, Siyuan Liu, Yixin Cao

AI总结 提出自动评估框架OpenSkillEval,通过动态构建真实任务实例和收集社区技能,系统评估技能增强型智能体系统及技能本身,揭示技能可用性不保证有效使用、技能增强收益依赖模型和框架等关键发现。

详情
AI中文摘要

技能,即为大型语言模型(LLM)提炼的结构化工作流指令,正成为提升智能体在现实下游任务性能的日益重要的机制。然而,随着开源技能生态系统的快速扩张,不同模型和智能体框架如何与技能交互、如何评估技能质量、以及用户在实际成本-性能权衡下应如何选择技能,这些问题仍不明确。在本文中,我们提出了 extsc{OpenSkillEval},一个针对技能增强型智能体系统及技能本身的自动评估框架。 extsc{OpenSkillEval}不依赖静态基准,而是从不断演变的现实世界工件中自动构建跨五类下游应用(演示生成、前端网页设计、海报生成、数据可视化和报告生成)的真实任务实例。它进一步收集和组织社区贡献的技能,以便在统一任务设置下进行受控比较。利用超过600个动态生成的任务实例和30个开源技能,我们对最先进的模型和智能体框架进行了系统评估。我们的结果表明,技能可用性并不保证有效使用技能,技能增强的收益强烈依赖于底层模型和智能体框架,并且许多公开流行的技能并不始终优于没有技能的基础智能体。这些发现凸显了动态、基于任务的评估的必要性,并为LLM智能体技能的设计、选择和部署提供了实用见解。更多案例和基准资源可在项目网站上获取:https://yingjiahao14.github.io/OpenSkillEval-Web/。

英文摘要

Skills, i.e., structured workflow instructions distilled for large language models (LLMs), are becoming an increasingly important mechanism for improving agent performance on real-world downstream tasks. However, as the open-source skill ecosystem rapidly expands, it remains unclear how different models and agent frameworks interact with skills, how to evaluate skill quality, and how users should select skills under practical cost-performance trade-offs. In this paper, we present \textsc{OpenSkillEval}, an automatic evaluation framework for both skill-augmented agent systems and the skills themselves. Instead of relying on static benchmarks, \textsc{OpenSkillEval} automatically constructs realistic task instances from evolving real-world artifacts across five categories of downstream applications: presentation generation, front-end web design, poster generation, data visualization, and report generation. It further collects and organizes community-contributed skills for controlled comparison under unified task settings. Using more than 600 dynamically generated task instances and 30 open-source skills, we conduct a systematic evaluation of state-of-the-art models and agent frameworks. Our results show that skill availability does not guarantee effective skill usage, that the benefit of skill augmentation depends strongly on both the underlying model and the agent framework, and that many publicly popular skills do not consistently outperform base agents without skills. These findings highlight the need for dynamic, task-grounded evaluation and provide practical insights into the design, selection, and deployment of skills for LLM agents. Additional cases and benchmark resources are available on the project website: https://yingjiahao14.github.io/OpenSkillEval-Web/.

2605.23531 2026-05-29 cs.CV

PixIE: Prompted Pixel-Space Low-Light Image Enhancement

PixIE: 提示驱动的像素空间低光照图像增强

Ruirui Lin, Guoxi Huang, David Bull, Nantheera Anantrasirichai

AI总结 提出PixIE框架,利用视觉基础模型的语义提示,通过跨尺度去噪和DINO提示像素块进行像素空间低光照图像增强,在多个基准上提升PSNR和LPIPS。

详情
AI中文摘要

低光照图像遭受严重的噪声、对比度损失和语义模糊,使得增强成为去噪和细节恢复的联合问题。我们提出PixIE,一种由视觉基础模型语义提示的前馈像素空间LLIE框架。PixIE首先执行跨尺度去噪以抑制噪声并保持结构,然后使用DINO提示像素块(DPPBs)细化细节,通过补丁条件、空间连续的逐像素调制注入中间DINOv3特征。为了使像素空间注意力在跨尺度上高效,我们引入了空间通道压缩(SCC),它联合减少空间令牌网格和通道维度。我们进一步提出多感受野像素嵌入(MRPE),在语义提示之前提供邻域感知的像素表示,提高对信号依赖噪声的鲁棒性,超越逐点嵌入。在LLIE基准上的实验表明,与最近的最先进方法相比,PixIE将平均PSNR提高了1.9-15.0%,并将LPIPS降低了8.5-44.4%。定性比较进一步显示更清晰的细节和更稳定的纹理,提高了重建保真度和感知质量。

英文摘要

Low-light images suffer from severe noise, contrast loss, and semantic ambiguity, making enhancement a joint problem of denoising and detail recovery. We propose PixIE, a feed-forward pixel-space LLIE framework semantically prompted by a vision foundation model. PixIE first performs cross-scale denoising to suppress noise and preserve structure, then refines details using DINO-Prompted Pixel Blocks (DPPBs), which inject intermediate DINOv3 features through patch-conditioned, spatially continuous per-pixel modulation. To make pixel-space attention efficient across scales, we introduce Spatial-Channel Compaction (SCC), which jointly reduces the spatial token grid and channel dimension. We further propose Multi-Receptive-Field Pixel Embedding (MRPE) to provide neighborhood-aware pixel representations before semantic prompting, improving robustness to signal-dependent noise beyond point-wise embeddings. Experiments on LLIE benchmarks show that PixIE improves average PSNR by 1.9-15.0% over recent state-of-the-art methods and reduces LPIPS by 8.5-44.4%. Qualitative comparisons further show sharper details and more stable textures, improving both reconstruction fidelity and perceptual quality.

2605.23345 2026-05-29 cs.CV

SCOPE: Simulating Cross-game Operations in Playable Environments for FPS World Models

SCOPE: 在可玩环境中模拟跨游戏操作以构建FPS世界模型

Zizhao Tong, Yeying Jin, Hongfeng Lai, Zeqing Wang, Zhaohu Xing, Kexu Cheng, Haoran Xu, Zhao Pu, Shangwen Zhu, Ruili Feng, Jian Zhao, Yan Zhang, Hao Tang, Ling Shao

AI总结 提出SCOPE方法,通过在每个Transformer块中插入条件模块,将特征重塑为逐像素时间序列,以分离FPS游戏中局部作用域(scope)内的操作效果与全局生成,并引入跨游戏数据集CrossFPS,实现零样本迁移。

Comments Project page: https://z2tong.github.io/SCOPE/. Code is available at https://github.com/z2tong/SCOPE

详情
AI中文摘要

第一人称射击(FPS)游戏的交互式世界模型必须在每一帧解析高频重叠控制信号,同时不干扰未受影响的区域。现有方法全局注入动作并在单一游戏上训练,在密集FPS输入下失败。我们观察到FPS动作具有空间选择性:离散事件(如射击或换弹)仅影响武器周围的局部区域(scope),而连续的相机和移动信号控制稳定的环境。我们提出SCOPE,它在预训练视频扩散模型的每个Transformer块中插入一个条件模块。它将特征重塑为逐像素时间序列,使得每个位置根据局部视觉内容计算其动作响应。这无需分割标签即可将作用域内效果与作用域外生成分离。我们还引入了CrossFPS,这是第一个具有帧对齐动作遥测的多游戏FPS数据集。它包含来自7个游戏的69K个片段,具有10自由度控制器信号,并经过策划以消除游戏玩法偏差。该模型学习通用的视觉到动作映射,而非特定游戏模式,从而实现对未见场景的零样本迁移。实验证实了强动作响应性、精确的作用域分离以及有效的跨游戏泛化。

英文摘要

Interactive world models for first-person shooter (FPS) games must resolve high-frequency overlapping control signals at every frame without disrupting unaffected regions. Existing methods inject actions globally and train on single titles, failing under dense FPS inputs. We observe that FPS actions are spatially selective: discrete events such as firing or reloading affect only a localized region around the weapon (the scope), while continuous camera and movement signals govern stable surroundings. We propose SCOPE, which inserts a conditioning module into each transformer block of a pretrained video diffusion model. It reshapes features into per-pixel temporal sequences so that each position computes its action response from local visual content. This separates in-scope effects from out-of-scope generation without segmentation labels. We also introduce CrossFPS, the first multi-game FPS dataset with frame-aligned action telemetry. It comprises 69K clips from 7 titles with 10-DoF controller signals, curated to remove gameplay bias. The model learns general visual-to-action mappings rather than game-specific patterns, enabling zero-shot transfer to unseen scenes. Experiments confirm strong action responsiveness, precise scope separation, and effective cross-game generalization.

2605.22924 2026-05-29 cs.LG cs.IR

Building a privacy-preserving Federated Recommender system for mobile devices

构建保护隐私的移动设备联邦推荐系统

Aasheesh Singh

AI总结 提出一种两阶段联邦推荐系统流水线,通过分离非敏感偏好数据与设备内敏感上下文数据,在保护隐私的同时实现移动设备上的个性化推荐。

Comments Masters thesis, Université de Montréal, Department of Computer Science and Operations Research, 2024

详情
AI中文摘要

在移动设备上提供个性化内容传统上需要在中央服务器上汇集敏感用户数据,这种做法越来越不符合现代隐私期望和地域法规。我们提出了一种用于移动设备的两阶段联邦推荐系统流水线,其核心原则是将非敏感的用户偏好数据与永不离开设备的敏感移动上下文数据分离。第一阶段在云端对非敏感的应用上下文数据运行协同过滤模型,生成相关项目的短列表。第二阶段在设备上使用敏感的移动信号对这些候选项目进行重新排序,只有模型更新/梯度会离开设备。我们在MovieLens、UCI人类活动识别以及一个专有试点数据集上验证了该方法,并提供了一个生产就绪的实现,作为可在Android和iOS上部署的Kotlin多平台库。

英文摘要

Serving personalized content on mobile devices has traditionally required pooling sensitive user data on centralized servers, a practice increasingly at odds with modern privacy expectations and geographical regulations. We present a two-stage federated recommendation system pipeline for mobile devices, built around a principled separation between non-sensitive user preference data and sensitive mobile context data that never leaves the device. The first stage runs a collaborative filtering model on non-sensitive app-context data in the cloud to generate a shortlist of relevant items. The second stage re-ranks these candidates on-device using sensitive mobile signals, with only model updates/gradients ever leaving the device. We validate the approach on MovieLens, UCI Human Activity Recognition, and a proprietary pilot dataset, and deliver a production-ready implementation as a Kotlin Multiplatform library deployable on Android and iOS.

2605.22586 2026-05-29 cs.LG cs.CL

A Tutorial on Diffusion Theory: From Differential Equations to Diffusion Models

扩散理论教程:从微分方程到扩散模型

Jiayi Fu, Yuxia Wang

AI总结 本教程从微分方程角度统一阐述扩散模型的数学基础,推导ODE和SDE表示,解释分数匹配和去噪目标,并涵盖DDPM、DDIM、流匹配和扩散语言模型。

Comments A detailed tutorial on Diffusion models and SDE

详情
AI中文摘要

扩散模型已成为生成建模的主导框架,但其数学基础通常通过扩散概率模型、基于分数的建模、随机微分方程和数值采样方法分别呈现。我们编写本教程,从微分方程的角度提供这些观点的统一且自洽的阐述。从条件高斯噪声过程出发,我们推导常微分方程(ODE)和随机微分方程(SDE)表示,过渡到相应的边际正向动力学,然后得到使生成成为可能的逆向时间SDE和概率流ODE。我们表明逆向采样中的中心未知量是边际分数,解释在噪声预测参数化下分数匹配如何成为标准去噪目标,并讨论实际的逆向时间采样和引导。我们进一步将DDPM、DDIM、流匹配和基于分数的SDE置于一个共同框架中,并以连续嵌入空间中的扩散语言模型结束,同时简要讨论离散掩码标记扩散。本教程旨在作为扩散过程的分析基础与建立在其上的现代生成算法之间的桥梁。

英文摘要

Diffusion models have emerged as a dominant framework for generative modeling, but their mathematical foundations are often presented separately through diffusion probabilistic models, score-based modeling, stochastic differential equations, and numerical sampling methods. We write this tutorial to provide a unified and self-contained account of these viewpoints from the perspective of differential equations. Starting from a conditional Gaussian noising process, we derive ordinary differential equation (ODE) and stochastic differential equation (SDE) representations, pass to the corresponding marginal forward dynamics, and then obtain the reverse-time SDE and probability-flow ODE that make generation possible. We show that the central unknown quantity in reverse sampling is the marginal score, explain how score matching becomes the standard denoising objective under a noise-prediction parameterization, and discuss practical reverse-time sampling and guidance. We further place DDPM, DDIM, flow matching, and score-based SDEs in a common framework, and conclude with diffusion language models in continuous embedding space together with a brief discussion of discrete masked-token diffusion. The tutorial is intended as a bridge between the analytical foundations of diffusion processes and the modern generative algorithms built upon them.

2605.22100 2026-05-29 cs.AI

MPDocBench-Parse: Benchmarking Practical Multi-page Document Parsing

MPDocBench-Parse:面向实际的多页文档解析基准测试

Bangbang Zhou, Hangdi Xing, Yifan Chen, Jianjun Xu, Qi Zheng, Feiyu Gao, Zhibo Yang, Shuai Bai, Ming Yan, Jieping Ye, Hongtao Xie

AI总结 针对现有基准测试在真实场景中评估不足的问题,提出MPDocBench-Parse基准,包含433份多页文档(3246页),覆盖15种文档类型,设计全面的内容保真度和逻辑结构评估协议,实验表明现有模型在语义连续性、视觉内容解析和层次结构恢复方面存在明显局限。

详情
AI中文摘要

文档解析将视觉丰富的文档转换为机器可读的结构化表示,为信息系统提供了关键基础。尽管已有许多文档解析基准测试,但它们仍不足以应对真实场景。现有基准测试要么专注于特定任务,要么仅评估单页、以文本为中心的设置,因此不足以处理实际的多页解析。此外,它们缺乏对语义连续性、层次结构恢复和视觉内容保留的细粒度评估。为解决这些不足,我们提出了MPDocBench-Parse,一个面向实际应用的多页文档解析基准测试。它包含433份人工标注的文档,共3246页,覆盖中英文15种文档类型,具有多样化的布局风格,并支持文档级端到端评估。我们进一步设计了一套全面的内容保真度和逻辑结构评估协议,涵盖文本、表格和公式识别,截断文本和表格合并,图形提取,阅读顺序以及标题层次恢复。实验表明,尽管现有模型在基本文本提取方面表现良好,但在语义连续性整合、视觉内容解析和层次结构恢复方面仍存在明显局限。MPDocBench-Parse为将文档解析推进到更真实的场景提供了统一基础。

英文摘要

Document parsing converts visually rich documents into machine-readable structured representations, forming a crucial foundation for information systems. Although many benchmarks have been proposed for document parsing, they remain inadequate for realistic scenarios. Existing benchmarks either focus on specific tasks or assess only single-page, text-centric settings, making them insufficient for practical multi-page parsing. Moreover, they lack fine-grained evaluation of semantic continuity, hierarchical structure recovery, and visual content preservation. To address these gaps, we propose MPDocBench-Parse, a benchmark for multi-page document parsing in real-world applications. It contains 433 manually annotated documents with 3,246 pages, covering 15 document types in English and Chinese, with diverse layout styles, and supports document-level end-to-end evaluation. We further design a comprehensive protocol for content fidelity and logical structure, covering text, table, and formula recognition, truncated text and table merging, figure extraction, reading order, and heading hierarchy recovery. Experiments show that, while existing models perform well on basic text extraction, they still suffer clear limitations in semantic continuity integration, visual content parsing, and hierarchical structure recovery. MPDocBench-Parse provides a unified foundation for advancing document parsing toward more realistic scenarios.

2605.22082 2026-05-29 cs.RO cs.LG

CoRMA: Contrastive RMA for Contact-Rich Meta-Adaptation

CoRMA: 用于接触丰富元适应的对比RMA

Wentian Wang, Chutong Wen, Hongxu Ma, Wuhao Wang, Zhexiong Xue, Abdul Haseeb Nizamani, Dandi Zhou, Xinhai Sun, Jianqiao Zhu

AI总结 提出CoRMA框架,通过语义接触上下文和对比学习实现力主导装配任务的元适应,无需演示或梯度更新,在仿真和真实机器人上优于基线。

详情
AI中文摘要

我们提出CoRMA(对比机器人运动适应),一个基于上下文的元适应框架,修改了RMA以适用于力主导的装配任务。CoRMA用紧凑的6维仅仿真语义接触上下文(描述接触开始、侧向接合、引导过渡、接触方向和卡滞)替换原始仿真器参数适应。一个可部署的因果Transformer适配器通过语义回归和力状态对比目标,从力、本体感受和动作历史中在线推断该上下文。部署时,移除真实上下文并由推断上下文替代,从而无需演示、特权输入或梯度更新即可实现片段内适应。我们在Isaac Lab / Isaac Sim 5.0中的PegInsert、GearMesh和NutThread任务以及真实Marvin机械臂上评估CoRMA。与在仿真中成功率高但在硬件上大幅下降的FORGE基线相比,CoRMA在受控目标位姿噪声下保留了更高的验证真实成功率。这些结果支持语义接触推断作为相关装配任务族内可复用的适应接口,而更广泛的未见任务泛化和Real2Sim校准仍是未来工作。

英文摘要

We present CoRMA(Contrastive Robotic Motor Adaptation), a context-based meta-adaptation framework that modifies RMA for force-dominant assembly. CoRMA replaces raw simulator-parameter adaptation with a compact 6D simulator-only semantic contact context describing contact onset, lateral engagement, guided transition, contact direction, and jamming. A deployable causal Transformer adapter infers this context online from force, proprioceptive, and action histories using semantic regression and a force-regime contrastive objective. At deployment, oracle context is removed and replaced by the inferred context, enabling within-episode adaptation without demonstrations, privileged inputs, or gradient updates. We evaluate CoRMA on PegInsert, GearMesh, and NutThread in Isaac Lab / Isaac Sim 5.0 and on a real Marvin arm. Compared with FORGE baselines that achieve high simulation success but degrade substantially on hardware, CoRMA retains higher verified real success under controlled target-pose noise. These results support semantic contact inference as a reusable adaptation interface within a related assembly task family, while broader unseen-task generalization and Real2Sim calibration remain future work.

2605.22080 2026-05-29 cs.CV cs.AI

JMed48k: A Multi-Profession Japanese Medical Licensing Benchmark for Vision-Language Model Evaluation

JMed48k:用于视觉语言模型评估的多专业日本医疗执照基准

Yue Xun, Junyu Liu, Qian Niu, Xinyi Wang, Zheng Yuan, Zirui Li, Zequn Zhang, Bowen Zhao, Shujun Wang, Irene Li, Kan Hatakeyama-Sato, Yusuke Iwasawa, Yutaka Matsuo

AI总结 本文提出JMed48k,一个包含48,862道试题和20,142张图像的多专业日本医疗执照基准,通过评估21个模型并引入配对图像移除审计,发现专有和开源模型显著受益于图像,而医学专用模型对视觉证据利用有限。

详情
AI中文摘要

我们引入了JMed48k,一个用于评估视觉语言模型的多专业日本医疗执照基准。该基准基于日本厚生劳动省发布的官方PDF材料构建,包含2005年至2025年间11个国家执照考试的48,862道试题和20,142张图像,视觉内容按8类分类法进行标注。从该语料库中,我们提取了JMed48k-Eval,一个近五年的评估子集,包含12,484道评分题,其中9,905道纯文本题和2,579道带图像题。我们评估了21个专有、开源和医学专用模型,分别报告纯文本和带图像的性能。由于这些子集包含不同的问题,我们进一步引入了一种配对图像移除审计,评估带图像的问题在移除视觉内容前后的表现,以探索四种答案转换状态。审计显示,专有和开源模型从图像中获益显著,而医学专用系统对视觉证据的利用有限,许多正确答案在图像移除后仍然存在。即使在专有模型中,净图像移除效应在不同专业间变化七倍,从医师问题的+5.7分到公共卫生护士问题的+39.8分。我们发布JMed48k以支持在医疗执照场景中对视觉语言模型进行可重复的、按专业分层的评估。

英文摘要

We introduce JMed48k, a multi-profession Japanese healthcare licensing benchmark for evaluating vision-language models. Built from official PDF materials released by the Japanese Ministry of Health, Labour and Welfare, JMed48k contains 48,862 exam questions and 20,142 images from 11 national licensing examinations between 2005 and 2025, with visual content annotated under an 8-type taxonomy. From this corpus, we derive JMed48k-Eval, a recent five-year evaluation subset with 12,484 scored questions, including 9,905 text-only questions and 2,579 questions with images. We evaluate 21 proprietary, open-source, and medical-specific models, reporting text-only and with-image performance separately. Because these subsets contain different questions, we further introduce a paired image-removal audit that evaluates questions with images before and after removing visual content to explore four answer-transition states. The audit shows that proprietary and open source models gain substantially from images, whereas medical-specific systems show limited observable use of visual evidence, with many correct answers persisting after image removal. Even among proprietary models, the net image-removal effect varies sevenfold across professions, from +5.7 points on Physician questions to +39.8 points on Public Health Nurse questions. We release JMed48k to support reproducible, profession-stratified evaluation of vision-language models in medical licensing settings.

2605.22069 2026-05-29 cs.CV cs.LG

TWINGS: Thin Plate Splines Warp-aligned Initialization for Sparse-View Gaussian Splatting

TWINGS: 基于薄板样条翘曲对齐的稀疏视图高斯泼溅初始化

Hyeseong Kim, Geonhui Son, Deukhee Lee, Dosik Hwang

AI总结 提出TWINGS框架,利用薄板样条(TPS)对齐反投影点与三角化控制点,为3D高斯泼溅提供几何精确的初始化,从而在稀疏视图下提升场景重建的细节保留和颜色保真度。

Comments Accepted at CVPR 2026, Project page: https://sandokim.github.io/twings/

详情
AI中文摘要

从稀疏视图输入进行新视角合成是3D计算机视觉中的一个重大挑战,特别是在有限视角下实现高质量场景重建。我们引入了TWINGS,这是一个通过直接解决点稀疏性来增强3D高斯泼溅(3DGS)的框架。我们采用薄板样条(TPS),一种平滑的非刚性变形模型,通过最小化弯曲能量从控制点对应关系估计全局一致的翘曲,将估计深度反投影的点与三角化的3D控制点对齐,从而生成校准的反投影点。通过在这些控制点附近采样校准点,TWINGS为3DGS提供了快速且几何精确的初始化,最终改善了重建场景中结构细节的保留和颜色保真度。在DTU、LLFF和Mip-NeRF360上的大量实验表明,TWINGS在稀疏视图场景下始终优于现有方法,提供详细且准确的重建。

英文摘要

Novel view synthesis from sparse-view inputs poses a significant challenge in 3D computer vision, particularly for achieving high-quality scene reconstructions with limited viewpoints. We introduce TWINGS, a framework that enhances 3D Gaussian Splatting (3DGS) by directly addressing point sparsity. We employ Thin Plate Splines (TPS), a smooth non-rigid deformation model that minimizes bending energy to estimate a globally coherent warp from control-point correspondences, to align backprojected points from estimated depth with triangulated 3D control points, yielding calibrated backprojected points. By sampling these calibrated points near the control points, TWINGS provides a fast and geometrically accurate initialization for 3DGS, ultimately improving structural detail preservation and color fidelity in reconstructed scenes. Extensive experiments on DTU, LLFF, and Mip-NeRF360 demonstrate that TWINGS consistently outperforms existing methods, delivering detailed and accurate reconstructions under sparse-view scenarios.

2605.21739 2026-05-29 cs.AI

AttuneBench: A Conversation-Based Benchmark for LLM Emotional Intelligence

AttuneBench: 基于对话的LLM情商基准测试

Kate M. Lubrano, Faisal Sayed, Ankita Rathod, Akshansh, Craver Corbyn Thomas-Smith, Mark E. Whiting, Karina Nguyen

AI总结 提出AttuneBench基准,基于200个真实多轮人机对话,评估LLM在情绪识别、行为分类、偏好预测和响应质量等方面的情商能力,发现这些能力相互独立且偏好对齐和响应质量更具区分性。

Comments v2: Updated def_18 and def_20 supplemental figures to cover all 11 evaluated models (previously 9). Removed redundant supplemental figures. Corrected select captions (color descriptions, chance baselines, figure-content mismatches). No changes to experimental results, numerical claims, or conclusions

详情
AI中文摘要

情商(EI),即感知、理解并恰当回应他人情绪状态的能力,是人类交流的核心,随着LLM在日常生活中承担对话角色,评估其情商日益重要。现有的EI基准依赖于合成提示、单轮案例或第三方标注。这些方法不能直接衡量模型在真实对话过程中如何推断和回应参与者的情绪状态。我们引入AttuneBench,一个基于200个真实多轮人机对话的基准,其中参与者与匿名LLM对话,并逐轮标注其情绪状态、模型行为以及他们偏好的回应。在11个评估模型中,我们发现模型在情绪识别、行为分类、偏好预测和评判响应质量上的排名基本独立,表明情商行为可分解为可分离的能力。偏好对齐和响应质量判断比情绪标签准确性更具模型区分性。这些结果表明,情商行为需要预测特定用户在上下文中想要什么样的回应,这一区别可能被总体评分掩盖,而单轮或合成格式无法跨轮直接捕捉。AttuneBench提供了一个评估这些能力以及诊断模型在情绪显著对话中的特定优势和失败模式的框架。

英文摘要

Emotional intelligence (EI), the ability to perceive, understand, and respond appropriately to others' emotional states, is central to human communication, and increasingly important to assess as LLMs assume conversational roles in everyday life. Existing EI benchmarks rely on synthetic prompts, single-turn cases, or third-party annotation. These approaches do not directly measure how models infer and respond to a participant's emotional state over the course of a real conversation. We introduce AttuneBench, a benchmark grounded in 200 genuine multi-turn human-model conversations in which participants conversed with anonymized LLMs and provided turn-by-turn annotations of their emotional state, the model's behavior, and their preferred responses. Across 11 evaluated models, we find that model rankings on emotion recognition, behavioral classification, preference prediction, and judged response quality are largely independent, indicating that emotionally intelligent behavior decomposes into separable capabilities. Preference alignment and response-quality judgments are substantially more model-discriminating than emotion-label accuracy. These results indicate that emotionally intelligent behavior requires predicting what kind of response a specific user wants in context, a distinction that aggregate scoring can obscure and that single-turn or synthetic formats cannot directly capture across turns. AttuneBench provides a framework for assessing each of these capabilities and for diagnosing model-specific strengths and failure modes in emotionally salient conversation.

2605.17286 2026-05-29 cs.CV

HyperVision: A Channel-Adaptive Ground-Based Hyperspectral Vision Pre-trained Backbone

HyperVision: 一种通道自适应的地基高光谱视觉预训练骨干网络

Guanyiman Fu, Jingtao Li, Zihang Cheng, Zhuanfeng Li, Diqi Chen, Yan Xu, Xiangyu Liu, Fengchao Xiong, Jianfeng Lu, Chengrong Chen, Jun Zhou

AI总结 针对地基高光谱传感器配置差异、标签稀缺与不一致、数据集规模有限等问题,提出首个地基高光谱预训练骨干HyperVision,采用通道自适应动态嵌入、多源伪标签和跨模态知识蒸馏,在三个下游任务上取得最优性能。

详情
AI中文摘要

虽然高光谱成像通过数百个窄波长波段提供丰富的空间-光谱信息,用于精确的材料识别,但地基高光谱预训练骨干网络仍然缺失,受限于传感器间的光谱配置差异、标签的稀缺性和不一致性,以及现有数据集的规模有限和场景多样性不足。为了解决这些挑战并实现通用感知,我们提出了HyperVision,这是首个地基高光谱预训练骨干网络。首先,为了处理不同的光谱配置,HyperVision采用通道自适应动态嵌入机制,将异构输入映射到统一的标记空间。其次,我们开发了一个无监督表示学习框架。具体来说,为了解决标签稀缺和不一致问题,引入了一种多源伪标签方法,融合来自SAM2的空间结构和来自HyperFree的细粒度光谱材料信息。此外,为了丰富场景多样性并补偿有限的数据集规模,利用跨模态知识蒸馏机制,将预训练RGB视觉模型的丰富语义表示迁移到我们的骨干网络。HyperVision在来自26个不同地基数据集的15000张图像集合上进行预训练,展现出卓越的泛化能力。仅需高效的头适配而无需调整骨干参数,它在不同传感器配置下的三个下游任务中取得了比任务特定方法更优的性能,在高光谱语义分割中$\mathrm{Acc}_{\mathrm{M}}$相对提升高达16.3%,目标跟踪AUC相对提升2.1%,显著目标检测MAE降低35.5%。源代码和预训练模型将在https://github.com/lronkitty/HyperVision 公开。

英文摘要

While hyperspectral imaging provides rich spatial-spectral information across hundreds of narrow wavelength bands for precise material identification, ground-based hyperspectral pre-trained backbones remain absent, constrained by varying spectral configurations across sensors, the scarcity and inconsistency of labels, and the limited scale and scene diversity of existing datasets. To address these challenges and enable universal perception, we propose HyperVision, the first ground-based hyperspectral pre-trained backbone. First, to handle varying spectral configurations, HyperVision adopts a channel-adaptive dynamic embedding mechanism to map heterogeneous inputs into a unified token space. Second, we develop an unsupervised representation learning framework. Specifically, to address label scarcity and inconsistency, a multi-source pseudo-labeling method is introduced to fuse spatial structures from SAM2 and fine-grained spectral material information from HyperFree. Furthermore, to enrich scene diversity and compensate for limited dataset scale, a cross-modal knowledge distillation mechanism is utilized to transfer rich semantic representations from a pre-trained RGB vision model to our backbone. Pre-trained on a collection of 15k images from 26 diverse ground-based datasets, HyperVision demonstrates exceptional generalization. Requiring only efficient head-only adaptation without adjusting backbone parameters, it achieves state-of-the-art performance compared to task-specific methods across three downstream tasks under varying sensor configurations, yielding up to a 16.3% relative improvement in hyperspectral semantic segmentation $\mathrm{Acc}_{\mathrm{M}}$, a 2.1% relative gain in object tracking AUC, and a 35.5% reduction in salient object detection MAE. The source code and pre-trained model will be publicly available on https://github.com/lronkitty/HyperVision .

2605.15852 2026-05-29 cs.CV

GHOST: Geometry-Hierarchical Online Streaming Token Eviction for Efficient 3D Reconstruction

GHOST: 用于高效3D重建的几何层次化在线流式令牌驱逐

Leyang Chen, Junyi Wu, Zhiteng Li, Yulun Zhang

AI总结 提出GHOST框架,利用模型自身的3D几何输出在线驱逐冗余令牌,在保持重建质量的同时将KV缓存减半并实现1.75倍加速。

详情
AI中文摘要

从长单目视频序列进行流式3D重建需要维护一个随序列长度线性增长的键值(KV)缓存,造成严重的内存瓶颈。现有方法要么将缓存截断为固定的一组锚帧,导致重建质量下降,要么依赖于对3D场景结构无关的注意力分数启发式方法,未能保留几何上有价值的令牌。为解决这些问题,我们提出GHOST(几何层次化在线流式令牌驱逐),一种无需训练的KV缓存管理框架,利用模型自身的3D几何输出在线驱逐冗余令牌。GHOST引入了三项相互增强的创新:层次化双层重要性评分方案、保护特殊令牌不被驱逐的特权机制,以及余弦相似度引导的逐层预算分配。在各种基准上的实验表明,GHOST在保持出色重建质量的同时,将KV缓存削减近一半,并且与最先进方法相比实现了1.75倍的推理加速。我们的代码可在 https://github.com/lokiniuniu/GHOST 获取。

英文摘要

Streaming 3D reconstruction from long monocular video sequences requires maintaining a key-value (KV) cache that grows linearly with sequence length, creating a severe memory bottleneck. Existing approaches either truncate the cache to a fixed set of anchor frames, leading to reconstruction quality degradation, or rely on attention-score heuristics that are agnostic to 3D scene structure, failing to preserve geometrically valuable tokens. To address these problems, we present GHOST (Geometry-Hierarchical Online Streaming Token Eviction), a training-free KV cache management framework that exploits the model's own 3D geometry outputs to evict redundant tokens online. GHOST introduces three mutually reinforcing innovations: a hierarchical dual-level importance scoring scheme, a privilege mechanism that protects special tokens from eviction, and a cosine-similarity-guided layer-wise budget allocation. Experiments on various benchmarks show that GHOST preserves excellent reconstruction quality while cutting the KV cache by nearly half and delivering 1.75x faster inference compared to state-of-the-art methods. Our code is available at https://github.com/lokiniuniu/GHOST.

2605.15422 2026-05-29 cs.LG

DualKV: Shared-Prompt Flash Attention for Efficient RL Training with Large Rollouts and Long Contexts

DualKV: 面向高效RL训练的共享提示Flash注意力机制,支持大规模展开和长上下文

Jiading Gai, Shuai Zhang, Xiang Song, Bernie Wang, George Karypis

AI总结 针对RL训练中共享提示重复计算问题,提出DualKV内核,通过融合CUDA前向/反向核和veRL数据流水线重排,消除提示复制,实现1.63-3.82倍策略更新加速。

详情
AI中文摘要

现代RL后训练方法(如GRPO和DAPO)在从共享提示($P$个token)采样的$N$个响应序列(每个$R$个token)上进行训练,但标准FlashAttention在前向和反向传播中将所有$P$个提示token复制$N$次——在相同的隐藏状态上重复计算和内存。在大规模展开、长上下文RL训练($N\geq16$,$P\geq8\text{K}$)中,这种冗余主导了策略更新成本。我们观察到,在仅解码器模型中,因果掩码使提示表示在每一层跨序列不变,因此所有逐token操作(归一化、投影、MLP)和注意力可以一次性处理提示——这一特性尚未在训练的内核级别被利用。我们提出\textbf{DualKV},这是首个消除RL训练中共享提示复制的FlashAttention内核变体,通过(1)~融合的CUDA前向和反向内核,在单次内核启动中迭代两个不相交的KV区域——共享上下文和逐序列响应,以及(2)~veRL中的数据流水线重设计,将$N(P{+}R)$个token重新打包为每个微批$P{+}NR$个token,将token减少从注意力扩展到整个模型,因子$ρ= N(P{+}R)/(P{+}NR)$。DualKV在数学上等价于标准注意力,且不引入近似。在Qwen3-8B GRPO训练中,使用8$\times$H100 GPU($N{=}32$,8K上下文),DualKV实现了$1.63$--$2.09\times$的策略更新加速,支持$2\times$更大的微批,并将MFU从$36\%$提升至$76\%$。类似增益在DAPO上成立($2.47\times$加速,$77\%$ MFU)。在30B MoE规模下,使用16$\times$H100,DualKV相比FlashAttention(需要4路Ulysses序列并行以避免OOM)实现了$3.82\times$的策略更新加速和$3.38\times$的端到端步骤加速。

英文摘要

Modern RL post-training methods such as GRPO and DAPO train on $N$ response sequences of $R$ tokens sampled from a shared prompt of $P$ tokens, but standard FlashAttention replicates all $P$ prompt tokens $N$ times across both forward and backward passes -- duplicating compute and memory on identical hidden states. In large-rollout, long-context RL training ($N{\geq}16$, $P{\geq}8\text{K}$), this redundancy dominates the policy update cost. We observe that in decoder-only models, causal masking makes prompt representations invariant across sequences at every layer, so all per-token operations (norms, projections, MLP) and attention can process the prompt once -- a property not yet exploited at the kernel level for training. We propose \textbf{DualKV}, the first FlashAttention kernel variant that eliminates shared-prompt replication during RL training, via (1)~fused CUDA forward and backward kernels that iterate over two disjoint KV regions -- shared context and per-sequence response -- in a single kernel launch, and (2)~a data-pipeline redesign in veRL that repacks $N(P{+}R)$ tokens into $P{+}NR$ tokens per micro-batch, extending the token reduction from attention to the entire model by a factor $ρ= N(P{+}R)/(P{+}NR)$. DualKV is mathematically equivalent to standard attention and introduces no approximation. On Qwen3-8B GRPO training with 8$\times$H100 GPUs ($N{=}32$, 8K-context), DualKV achieves $1.63$--$2.09\times$ policy-update speedup, enables $2\times$ larger micro-batches, and raises MFU from $36\%$ to $76\%$. Similar gains hold for DAPO ($2.47\times$ speedup, $77\%$ MFU). At 30B MoE scale on 16$\times$H100, DualKV achieves $3.82\times$ policy-update and $3.38\times$ end-to-end step speedup over FlashAttention (which requires 4-way Ulysses sequence parallelism to avoid OOM).

2605.15219 2026-05-29 cs.AI cs.IT math.IT

NOVA: Fundamental Limits of Knowledge Discovery Through AI

NOVA:通过人工智能进行知识发现的基本限制

Salman Avestimehr, Ken Duffy, Muriel Médard

AI总结 本文提出NOVA框架,将“生成-验证-积累-再训练”循环建模为知识空间上的自适应采样过程,识别了知识覆盖有限域的条件及失败模式,并证明了发现成本与Zipf定律相关的标度律。

详情
AI中文摘要

人工智能系统能否通过迭代自我改进发现真正的新知识,如果可以,代价是什么?我们引入了NOVA框架,将常见的“生成、验证、积累、再训练”循环建模为知识空间上的自适应采样过程。我们识别了积累的真正知识最终覆盖有限域的充分条件,并展示了违反这些条件如何产生不同的失败模式:污染、遗忘、探索失败和接受失败。然后,我们分析了不完美的验证,并识别了一个污染陷阱:随着容易发现的知识被耗尽,模型分配给新有效工件的质量缩小,因此即使很小的假阳性率也可能导致无效工件比真正发现更快地进入知识库。我们澄清了Good-Turing估计是一种局部批次多样性诊断工具,而不是用于估计历史上未发现的、支配长期发现的有效质量的估计量。在将模型的有效发现分布与指数$α>1$的Zipf定律联系起来的独立尾部等价假设下,我们证明了获得$D$个不同真正发现所需的累积生成成本满足$R_{\mathrm{cum}}(D)=Θ(c_{\mathrm{gen}}D^α)$,其中$c_{\mathrm{gen}}$是每个候选的生成成本。这个标度律量化了随着发现前沿推进而渐进的收益递减。最后,我们通过指导、生成和验证形式化了人类增强,解释了为什么专家输入在自主探索障碍附近最有价值。

英文摘要

Can AI systems discover genuinely new knowledge through iterative self improvement, and if so, at what cost? We introduce the NOVA framework, which models the common ``generate, verify, accumulate, retrain'' loop as an adaptive sampling process over a knowledge space. We identify sufficient conditions under which accumulated genuine knowledge eventually covers a finite domain, and show how their violations produce distinct failure modes: contamination, forgetting, exploration failure, and acceptance failure. We then analyze imperfect verification and identify a contamination trap: as easy-to-find knowledge is exhausted, the model mass assigned to new valid artifacts shrinks, so even small false-positive rates can cause invalid artifacts to enter the knowledge base faster than genuine discoveries. We clarify that Good--Turing estimation is a local batch-diversity diagnostic, not an estimator of the historically undiscovered valid mass that governs long-term discovery. Under a separate tail-equivalence assumption relating the model's effective discovery distribution to a Zipf law with exponent $α>1$, we prove that the cumulative generation cost required to obtain $D$ distinct genuine discoveries satisfies $R_{\mathrm{cum}}(D)=Θ(c_{\mathrm{gen}}D^α)$, where $c_{\mathrm{gen}}$ is the per-candidate generation cost. This scaling law quantifies asymptotic diminishing returns as the discovery frontier advances. Finally, we formalize human amplification through guidance, generation, and verification, explaining why expert input is most valuable near autonomous exploration barriers.

2605.14270 2026-05-29 cs.CV

Diagnosing and Correcting Concept Omission in Multimodal Diffusion Transformers

诊断和纠正多模态扩散Transformer中的概念遗漏

Kanghyun Baek, Jaihyun Lew, Chaehun Shin, Jungbeom Lee, Sungroh Yoon

AI总结 本文通过线性探测发现文本嵌入中存在表征目标概念缺失的“遗漏信号”,并提出遗漏信号干预(OSI)方法放大该信号以主动催化缺失概念的生成,在FLUX.1-Dev和SD3.5-Medium上显著缓解了概念遗漏问题。

Comments Accepted to ICML 2026

详情
AI中文摘要

多模态扩散Transformer(MM-DiTs)在文本到图像生成方面取得了显著进展,但它们经常遭受概念遗漏,即指定的对象或属性未能出现在生成的图像中。通过对文本标记进行线性探测,我们证明文本嵌入可以区分代表目标概念缺失的特征性“遗漏信号”。利用这一见解,我们提出了遗漏信号干预(OSI),该方法放大遗漏信号以主动催化缺失概念的生成。在FLUX.1-Dev和SD3.5-Medium上的全面实验表明,即使在极端场景下,OSI也能显著缓解概念遗漏。

英文摘要

Multimodal Diffusion Transformers (MM-DiTs) have achieved remarkable progress in text-to-image generation, yet they frequently suffer from concept omission, where specified objects or attributes fail to emerge in the generated image. By performing linear probing on text tokens, we demonstrate that text embeddings can distinguish a characteristic `omission signal' representing the absence of target concepts. Leveraging this insight, we propose Omission Signal Intervention (OSI), which amplifies the omission signal to actively catalyze the generation of missing concepts. Comprehensive experiments on FLUX.1-Dev and SD3.5-Medium demonstrate that OSI significantly alleviates concept omission even in extreme scenarios.

2605.13841 2026-05-29 cs.SD cs.AI cs.CL cs.LG

EVA-Bench: A New End-to-end Framework for Evaluating Voice Agents

EVA-Bench:一种用于评估语音代理的新型端到端框架

Tara Bogavelli, Gabrielle Gauthier Melançon, Katrina Stankiewicz, Oluwanifemi Bamgbose, Fanny Riols, Hoang H. Nguyen, Raghav Mehndiratta, Lindsay Devon Brin, Joseph Marinier, Hari Subramani, Anil Madamala, Sridhar Krishna Nemala, Srinivas Sunkara

AI总结 提出EVA-Bench框架,通过机器人间音频对话模拟和复合指标(EVA-A和EVA-X)全面评估语音代理的准确性和体验质量。

Comments Work in progress

详情
AI中文摘要

语音代理是一种通过口语对话完成任务的人工智能系统,越来越多地部署在企业应用中。然而,现有基准测试未能同时解决两个核心评估挑战:生成逼真的模拟对话,以及全面衡量语音特定故障模式的质量。我们提出了EVA-Bench,一个端到端评估框架,同时解决这两个问题。在模拟方面,EVA-Bench通过动态多轮对话协调机器人间的音频对话,并自动进行模拟验证,检测用户模拟器错误并在评分前适当重新生成对话。在测量方面,EVA-Bench引入了两个复合指标:EVA-A(准确性),捕捉任务完成度、忠实度和音频级语音保真度;以及EVA-X(体验),捕捉对话进展、口语简洁性和话轮转换时机。这两个指标适用于所有主要的代理架构,支持直接的跨架构比较。EVA-Bench包含三个企业领域的213个场景、一个用于口音和噪声鲁棒性的受控扰动套件,以及区分峰值能力和可靠能力的pass@1、pass@k、pass^k测量。在跨越所有三种架构的12个系统中,我们发现:(1)没有系统在EVA-A pass@1和EVA-X pass@1上同时超过0.5;(2)峰值性能和可靠性能差异显著(EVA-A上pass@k与pass^k的中位数差距为0.44);(3)口音和噪声扰动暴露了显著的鲁棒性差距,其影响因架构、系统和指标而异(平均Δ高达0.314)。我们在开源许可下发布了完整的框架、评估套件和基准数据。

英文摘要

Voice agents, artificial intelligence systems that conduct spoken conversations to complete tasks, are increasingly deployed across enterprise applications. However, no existing benchmark jointly addresses two core evaluation challenges: generating realistic simulated conversations, and measuring quality across the full scope of voice-specific failure modes. We present EVA-Bench, an end-to-end evaluation framework that addresses both. On the simulation side, EVA-Bench orchestrates bot-to-bot audio conversations over dynamic multi-turn dialogues, with automatic simulation validation that detects user simulator error and appropriately regenerates conversations before scoring. On the measurement side, EVA-Bench introduces two composite metrics: EVA-A (Accuracy), capturing task completion, faithfulness, and audio-level speech fidelity; and EVA-X (Experience), capturing conversation progression, spoken conciseness, and turn-taking timing. Both metrics apply to all major agent architectures, enabling direct cross-architecture comparison. EVA-Bench includes 213 scenarios across three enterprise domains, a controlled perturbation suite for accent and noise robustness, and pass@1, pass@k, pass^k measurements that distinguish peak from reliable capability. Across 12 systems spanning all three architectures, we find: (1) no system simultaneously exceeds 0.5 on both EVA-A pass@1 and EVA-X pass@1; (2) peak and reliable performance diverge substantially (median pass@k--pass^k gap of 0.44 on EVA-A); and (3) accent and noise perturbations expose substantial robustness gaps, with effects varying across architectures, systems, and metrics (mean $Δ$ up to 0.314). We release the full framework, evaluation suite, and benchmark data under an open-source license.

2605.07707 2026-05-29 cs.AI

Hierarchical Task Network Planning with LLM-Generated Heuristics

基于LLM生成启发式的层次任务网络规划

Felipe Meneguzzi, Alexandre Buchweitz, Augusto B. Corrêa, Victor Scherer Putrich, André Grahl Pereira

AI总结 研究利用大语言模型为层次任务网络规划生成搜索启发式,通过Pytrich规划器在六个基准领域评估,结果表明LLM生成的启发式在覆盖度上接近最优HTN规划器,并在83%的共享问题上显著减少搜索开销。

Comments 9 pages, 3 figures; submitted to NeurIPS 2026

详情
AI中文摘要

HTN规划是经典规划的一种变体,其中算法不是搜索线性动作序列,而是使用方法库分解高层任务,直到只剩下可执行动作。一方面,这允许引入领域知识,通过方法库加速解决方案的搜索。另一方面,它带来了超越经典状态空间搜索的挑战。尽管最近的研究产生了一些加速HTN规划的启发式和新型算法,但这些启发式仍不如经典规划算法中的启发式信息丰富。我们研究大语言模型(LLMs)能否为HTN规划生成有效的搜索启发式,将Corrêa、Pereira和Seipp(2025)的方法从经典规划扩展到层次规划。使用Pytrich规划器在六个标准全序HTN基准领域上,我们评估了九个LLM在领域特定提示下生成的启发式,并将它们与TDG和LMCount领域无关基线以及PANDA规划器进行比较。结果表明,LLM生成的启发式在覆盖度上几乎与最佳可用HTN规划器相当,同时在83%的共享问题上显著减少了搜索开销。

英文摘要

HTN planning is a variation of classical planning where, instead of searching for a linear sequence of actions, an algorithm decomposes higher-level tasks using a method library until only executable actions remain. On one hand, this allows one to introduce domain knowledge that can speed up the search for a solution through the method library. On the other hand, it creates challenges that go beyond those of classical state-space search. While recent research produced a number of heuristics and novel algorithms that speed up HTN planning, these heuristics are not yet as informative as those available in classical planning algorithms. We investigate whether large language models (LLMs) can generate effective search heuristics for HTN planning, extending the methodology of Corrêa, Pereira, and Seipp (2025) from classical to hierarchical planning. Using the Pytrich planner on six standard total-order HTN benchmark domains, we evaluate heuristics generated by nine LLMs under domain-specific prompting and compare them against the TDG and LMCount domain-independent baselines and the PANDA planner. Our results show that LLM-generated heuristics nearly match the coverage of the best available HTN planner, while substantially reducing search effort on 83% of shared problems.

2605.06322 2026-05-29 cs.LG

SMolLM: Small Language Models Learn Small Molecular Grammar

SMolLM: 小型语言模型学习小型分子语法

Akhil Jindal, Harang Ju

AI总结 本文提出SMolLM,一个53K参数的小型权重共享Transformer,通过固定层次结构学习SMILES语法,在ZINC-250K数据集上以95%的有效性生成分子,优于参数多10倍的GPT模型。

Comments 19 pages, 5 figures, 11 tables

详情
AI中文摘要

用于分子设计的语言模型已扩展到数亿个参数,但人们对它们如何学习化学语法知之甚少。我们训练了SMolLM,一个53K参数的权重共享Transformer,在ZINC-250K药物样分子基准上生成新颖的SMILES,有效性达95%,优于参数多10倍的标准GPT。从机制上看,同一模块在多次前向传播中以固定层次结构解决SMILES约束:首先是括号,其次是环,最后是化合价,这一点通过错误分类和线性探测得到证明,并通过消融实验隔离出括号匹配头。综合这些结果,我们得到了一个紧凑、可机械解释的分子生成器,以及一个用于研究形式语言领域迭代计算的测试平台。

英文摘要

Language models for molecular design have scaled to hundreds of millions of parameters, yet how they learn chemical grammar is poorly understood. We train SMolLM, a 53K-parameter weight-shared transformer, to generate novel SMILES with 95% validity on the ZINC-250K drug-like-molecule benchmark, outperforming a standard GPT with 10 times more parameters. Mechanistically, the same block resolves SMILES constraints across passes in a fixed hierarchy: brackets first, rings second, and valence last, as shown by error classification and linear probing, with ablation isolating the bracket-matching head. Together, these results yield a compact, mechanistically interpretable molecular generator and a testbed for studying iterative computation in formal-language domains.

2605.04916 2026-05-29 cs.AI cs.LG cs.SC

A Foundation Model for Zero-Shot Logical Rule Induction

零样本逻辑规则归纳的基础模型

Yin Jun Phua

AI总结 提出神经规则归纳器(NRI),一种基于统计编码和并行槽解码的预训练模型,实现零样本逻辑规则归纳,无需重新训练即可泛化到新谓词。

Comments Camera-ready version accepted at IJCAI 2026, with full appendices

详情
AI中文摘要

归纳逻辑编程(ILP)从数据中学习可解释的逻辑规则。现有方法是传导性的:其学习参数绑定到特定谓词,并且每个新任务都需要重新训练。我们引入了神经规则归纳器(NRI),一种用于零样本规则归纳的预训练模型。NRI 不编码文字标识,而是使用领域无关的统计属性(如类别条件率、熵和共现)来表示文字,这些属性无需重新训练即可泛化到不同的标识和数量。该模型由一个统计编码器和一个基于并行槽的解码器组成。并行解码保持了逻辑析取的置换不变性;而自回归解码器则会施加任意子句顺序。乘积 T-范数松弛使规则执行可微分,从而仅基于预测准确性进行端到端训练。我们在规则恢复、对标签噪声和虚假相关性的鲁棒性以及零样本迁移到真实世界基准上评估了 NRI,并相信这项工作开启了符号推理基础模型的可能性。代码和参考检查点可在 https://github.com/phuayj/neural-rule-inducer 获取。

英文摘要

Inductive Logic Programming (ILP) learns interpretable logical rules from data. Existing methods are transductive: their learned parameters are bound to specific predicates and require retraining for each new task. We introduce Neural Rule Inducer (NRI), a pretrained model for zero-shot rule induction. Rather than encoding literal identities, NRI represents literals using domain-agnostic statistical properties such as class-conditional rates, entropy, and co-occurrence, which generalize across variable identities and counts without retraining. The model consists of a statistical encoder and a parallel slot-based decoder. Parallel decoding preserves the permutation invariance of logical disjunction; an autoregressive decoder would instead impose an arbitrary clause order. Product T-norm relaxation makes rule execution differentiable, allowing end-to-end training on prediction accuracy alone. We evaluate NRI on rule recovery, robustness to label noise and spurious correlations, and zero-shot transfer to real-world benchmarks, and we believe this work opens up the possibility of foundation models for symbolic reasoning. Code and the reference checkpoint are available at https://github.com/phuayj/neural-rule-inducer.

2605.02116 2026-05-29 cs.LG

Statistical Consistency and Generalization of Contrastive Representation Learning

对比表示学习的统计一致性与泛化性

Yuanfan Li, Xiyuan Wei, Tianbao Yang, Yiming Ying

AI总结 本文提出统一的统计学习理论,证明对比损失与最优排序统计一致,并推导出随负样本数增加而改善的泛化界,解释了大负样本集的经验优势。

Comments Accepted by ICML 2026

详情
AI中文摘要

对比表示学习(CRL)支撑着许多现代基础模型。尽管最近取得了理论进展,现有分析仍存在几个关键限制:(i)CRL的统计一致性仍知之甚少;(ii)可用的泛化界随着负样本数量的增加而恶化,这与大负样本集的经验优势相矛盾;(iii)CRL的检索性能受到的理论关注有限。在本文中,我们为CRL发展了一个统一的统计学习理论。对于下游任务,我们使用AUC型总体准则评估检索质量,并证明对比损失与最优排序是 extit{统计一致的}。我们进一步建立了一个 extit{校准型不等式},定量地将过剩对比风险与过剩检索次优性联系起来。对于上游训练,我们研究了监督和自监督对比目标,并分别推导了阶为$O(1/m + 1/\sqrt{n})$和$O(1/\sqrt{m} + 1/\sqrt{n})$的泛化界,其中$m$表示负样本数量,$n$表示锚点数量。这些界不仅解释了大负样本集的经验优势,还揭示了$m$和$n$之间的显式权衡。在大规模视觉-语言模型上的广泛实验证实了我们的理论预测。

英文摘要

Contrastive representation learning (CRL) underpins many modern foundation models. Despite recent theoretical progress, existing analyses suffer from several key limitations: (i) the statistical consistency of CRL remains poorly understood; (ii) available generalization bounds deteriorate as the number of negative samples increases, contradicting the empirical benefits of large negative sets; and (iii) the retrieval performance of CRL has received limited theoretical attention. In this paper, we develop a unified statistical learning theory for CRL. For downstream tasks, we evaluate retrieval quality using an AUC-type population criterion and show that the contrastive loss is \emph{statistically consistent} with optimal ranking. We further establish a \emph{calibration-style inequality} that quantitatively relates excess contrastive risk to excess retrieval suboptimality. For upstream training, we study both supervised and self-supervised contrastive objectives and derive generalization bounds of order $O(1/m + 1/\sqrt{n})$ and $O(1/\sqrt{m} + 1/\sqrt{n})$, respectively, where $m$ denotes the number of negative samples and $n$ the number of anchor points. These bounds not only explain the empirical advantages of large negative sets but also reveal an explicit trade-off between $m$ and $n$. Extensive experiments on large-scale vision--language models corroborate our theoretical predictions.

2605.01663 2026-05-29 cs.LG cs.RO

Towards Efficient and Expressive Offline RL via Flow-Anchored Noise-conditioned Q-Learning

基于流锚定噪声条件Q学习的离线强化学习:高效且表达力强的方法

Sungyoung Lee, Dohyeong Kim, Eshan Balachandar, Zelal Su Mustafaoglu, Keshav Pingali

AI总结 提出FAN算法,通过单次流策略迭代和单高斯噪声样本实现高效离线强化学习,在保持高性能的同时显著降低计算成本。

Comments ICML 2026

详情
AI中文摘要

我们提出流锚定噪声条件Q学习(FAN),一种高效且高性能的离线强化学习算法。近期工作表明,表达力强的流策略和分布性评论家能提升离线强化学习性能,但计算成本高。具体而言,流策略需要迭代采样才能产生单个动作,分布性评论家需要计算多个样本(如分位数)来估计价值。为解决这些低效问题并保持高性能,我们引入FAN。我们的方法采用行为正则化技术,仅需单次流策略迭代,且分布性评论家仅需单个高斯噪声样本。我们对收敛性和性能边界的理论分析表明,这些简化不仅提高了效率,还带来了更优的任务性能。在机器人操作和运动任务上的实验表明,FAN实现了最先进的性能,同时显著减少了训练和推理时间。我们在https://github.com/brianlsy98/FAN 发布代码。

英文摘要

We propose Flow-Anchored Noise-conditioned Q-Learning (FAN), a highly efficient and high-performing offline reinforcement learning (RL) algorithm. Recent work has shown that expressive flow policies and distributional critics improve offline RL performance, but at a high computational cost. Specifically, flow policies require iterative sampling to produce a single action, and distributional critics require computation over multiple samples (e.g., quantiles) to estimate value. To address these inefficiencies while maintaining high performance, we introduce FAN. Our method employs a behavior regularization technique that uses a single flow policy iteration and requires a single Gaussian noise sample for distributional critics. Our theoretical analysis of convergence and performance bounds demonstrates that these simplifications not only improve efficiency but also lead to superior task performance. Experiments on robotic manipulation and locomotion tasks demonstrate that FAN achieves state-of-the-art performance while significantly reducing both training and inference runtimes. We release our code at https://github.com/brianlsy98/FAN.

2605.01194 2026-05-29 cs.RO

VLA-ATTC: Adaptive Test-Time Compute for VLA Models with Relative Action Critic Model

VLA-ATTC:基于相对动作评判模型的VLA模型自适应测试时计算

Wenhao Li, Xiu Su, Yichao Cao, Hongyan Xu, Xiaobo Xia, Shan You, Yi Chen, Chang Xu

AI总结 提出VLA-ATTC框架,通过不确定性驱动的“认知离合器”和相对动作评判模型(RAC)实现自适应测试时计算,在LIBERO-LONG基准上将SOTA模型PI0.5的失败率降低50%以上。

详情
AI中文摘要

视觉-语言-动作(VLA)模型在具身操作中展现了卓越的能力和泛化性。然而,它们的决策依赖于快速、本能的过程,缺乏深思熟虑。当面对需要更多考虑的复杂或模糊场景时,这种策略往往会导致次优或灾难性的动作。在本文中,我们引入了 extbf{VLA-ATTC},一个赋予VLA模型自适应测试时计算(TTC)能力的框架。VLA-ATTC采用基于不确定性的“认知离合器”,在必要时动态地从反射执行过渡到TTC深思阶段。在TTC阶段,一种新颖的 extbf{相对动作评判}(RAC)模型通过成对比较从生成的候选动作中识别最优动作。这种相对机制取代了不稳定的绝对值估计,显著简化了学习目标。此外,我们引入了一种高效的采样策略来分摊计算成本,以及一个自动数据管道,无需人工标注即可整理偏好对。在LIBERO-LONG基准上,VLA-ATTC将SOTA模型PI0.5的失败率降低了50%以上。我们将开源所有代码和权重。

英文摘要

Vision-Language-Action (VLA) models have demonstrated remarkable capabilities and generalization in embodied manipulation. However, their decision-making relies on a fast, instinctive process that lacks deliberation. This strategy often leads to suboptimal or catastrophic actions when facing complex or ambiguous scenarios that require greater consideration. In this paper, we introduce \textbf{VLA-ATTC}, a framework that endows VLA models with adaptive test-time compute (TTC). VLA-ATTC employs an uncertainty-based ``cognitive clutch'' to dynamically transition from reflexive execution to a TTC deliberation phase when necessary. During TTC phase, a novel \textbf{Relative Action Critic} (RAC) model identifies the optimal action from generated candidates via pairwise comparisons. This relative mechanism replaces unstable absolute value estimation, significantly simplifying the learning objective. Furthermore, we introduce an efficient sampling strategy to amortize computational costs and an automated data pipeline that curates preference pairs without manual annotation. On the LIBERO-LONG benchmark, VLA-ATTC reduces the failure rate of the SOTA model PI0.5 by over 50\%. We will open-source all the code and weights.

2605.01191 2026-05-29 cs.RO

Sentinel-VLA: A Metacognitive VLA Model with Active Status Monitoring for Dynamic Reasoning and Error Recovery

Sentinel-VLA:一种具有主动状态监控的元认知VLA模型,用于动态推理和错误恢复

Wenhao Li, Xiu Su, Dan Niu, Yichao Cao, Hongyan Xu, Zhe Qu, Lei Fan, Shan You, Chang Xu

AI总结 提出Sentinel-VLA模型,通过主动哨兵模块监控执行状态,仅在必要时触发动态推理或错误恢复,结合自进化持续学习算法和正交持续适配器,在44个任务上提升成功率30%以上。

详情
AI中文摘要

视觉-语言-动作(VLA)模型通过利用广泛的世界知识和强大的泛化能力,推动了具身操作领域的发展。然而,当前的VLA模型仍面临几个关键挑战,包括推理能力有限、缺乏状态监控以及难以自我纠正。在本文中,我们引入了 extbf{Sentinel-VLA},一种元认知VLA模型,配备了一个主动的“哨兵”模块来监控实时执行状态。仅在必要时,例如在初始规划或检测到错误时,模型会触发动态推理或制定错误恢复方案。这种按需推理机制确保了鲁棒的决策,同时最小化计算开销。值得注意的是,所有训练数据(涵盖44个任务和超过260万次转换)都是通过我们设计的流水线自动生成和标注的。我们还提出了自进化持续学习(SECL)算法,该算法允许Sentinel-VLA识别其能力边界并自动收集数据进行扩展,并与正交持续适配器(OC-Adapter)配对,将参数更新约束在正交空间中,从而防止灾难性遗忘。真实世界实验表明,与最先进的模型PI0相比,Sentinel-VLA将任务成功率提高了30%以上。我们将开源所有代码、权重和数据生成流水线。

英文摘要

Vision-language-action (VLA) models have advanced the field of embodied manipulation by harnessing broad world knowledge and strong generalization. However, current VLA models still face several key challenges, including limited reasoning capability, lack of status monitoring, and difficulty in self-correction. In this paper, we introduce \textbf{Sentinel-VLA}, a metacognitive VLA model equipped with an active ``sentinel'' module to monitor real-time execution status. Only when necessary, such as during initial planning or upon detecting an error, the model triggers a dynamic reasoning or formulate error recovery solutions. This on-demand reasoning mechanism ensures robust decision-making while minimizing computational overhead. Notably, all training data (spanning 44 tasks and over 2.6 million transitions) is automatically generated and annotated through our designed pipeline. We also propose the Self-Evolving Continual Learning (SECL) algorithm, which allows Sentinel-VLA to identify its capability boundaries and automatically collect data for expansion, paired with Orthogonal Continual Adapter (OC-Adapter) to constrain parameter updates to an orthogonal space, thereby preventing catastrophic forgetting. Real-world experiments demonstrate that Sentinel-VLA boosts the task success rate by over 30\% compared to the SOTA model, PI0. We will open-source all the code, weights, and data generation pipeline.