arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 3418
2605.24823 2026-05-26 cs.AI

Agent Manufacturing: Foundation-Model Agents as First-Class Industrial Entities

Agent制造:基础模型Agent作为一级工业实体

Yilei Zhang

AI总结 本文提出Agent制造范式,即基础模型Agent通过解释开放目标、长程规划、调用工具和机器、与其他Agent及人类协商来协调生产,从而将工业中的人类协调认知工作自动化。

详情
AI中文摘要

制造业已经历了四个广泛认可的范式——机械化、电气化、可编程自动化和智能制造——每个范式都定义了从人类转移到机器的工作类型。在每种情况下,有一层工业工作仍然基本上由人类完成:生产的协调认知,包括工程师、规划师和运营经理所执行的解释、分配、诊断、协商和治理工作。我们认为,第五次转型正在进行中,其中这一层(而非其下的物理或常规认知层)正是基于基础模型的自主Agent主要重新分配的对象。我们将这一范式命名为Agent制造,并操作性地定义:当一个制造系统的主要协调机制是由基础模型Agent执行的推理,这些Agent能够解释开放目标、在长周期内规划、调用工具和机器、并与其他Agent和人类协商时,该系统就是Agent制造的一个实例。这一定义比现有的认知制造或工业5.0文献更窄且更可证伪,并且它将该范式与经典的多Agent制造系统(后者仅在封闭协议空间内自主)明确区分开来。

英文摘要

Manufacturing has passed through four widely recognized paradigms - mechanization, electrification, programmable automation, and Smart Manufacturing - each defined by the kind of work it shifted from humans to machines. In every case, one layer of industrial work remained fundamentally human: the coordinative cognition of production, comprising the interpretive, allocative, diagnostic, negotiative, and governance work exercised by engineers, planners, and operational managers. We argue that a fifth transition is now underway in which this layer, rather than the physical or routine-cognitive layers below it, is what foundation-model-based autonomous agents primarily redistribute. We name this paradigm Agent Manufacturing and define it operationally: a manufacturing system is an instance of Agent Manufacturing when its principal coordination mechanism is reasoning performed by foundation-model agents that can interpret open-ended goals, plan over long horizons, invoke tools and machines, and negotiate with other agents and humans. This is a narrower and more falsifiable definition than the existing literature on cognitive manufacturing or Industry 5.0 provides, and it distinguishes the paradigm sharply from classical multi-agent manufacturing systems, which were autonomous only within closed protocol spaces.

2605.24816 2026-05-26 cs.CV

AOEPT: Breaking the Implicit Modality-Reduction Bottleneck in Modality-Missing Prompt Tuning

AOEPT:打破模态缺失提示调优中的隐式模态缩减瓶颈

Jian Lang, Rongpei Hong, Ting Zhong, Fan Zhou

AI总结 提出AOEPT方法,通过模态上下文提示(MCPs)蒸馏全局模态先验,为缺失模态提供潜在信息源,恢复多模态Transformer的推理范围,解决模态缺失场景下隐式模态缩减瓶颈问题。

Comments 20 pages, Accepted by ICML 2026, Code is available from https://github.com/Jian-Lang/AOEPT

详情
AI中文摘要

在现实环境中部署多模态系统通常需要处理模态缺失场景,即一个或多个模态不可用。虽然最近的研究通过提示调优解决了通用多模态Transformer(MT)架构的这一挑战,但我们发现了这些方法的一个基本限制:隐式模态缩减瓶颈。通过仅将提示条件限制在观察到的模态上,它们无意中将MT的推理范围限制在模态缩减子空间内,切断了缺失模态潜在信息源的访问。为克服这一限制,我们提出AOEPT,开创了一种新颖的模态上下文提示方式。具体来说,我们引入了轻量级的模态上下文提示(MCPs),从训练数据中蒸馏全局模态先验,作为缺失模态信息源的潜在存储库。基于剩余模态,这些MCPs被实例化为实例感知提示,为每个样本选择性地增强缺失模态信息,从而将MT的推理范围恢复到仅观察模态子空间之外。在各种多模态基准和骨干网络上的实验证实了AOEPT的强大性能,且计算开销极小。

英文摘要

Deploying multimodal systems in real-world environments often entails handling modality-missing scenarios, where one or more modalities are unavailable. While recent studies address this challenge for the general Multimodal Transformer (MT) architecture via prompt tuning, we identify a fundamental limitation in these methods: the Implicit Modality-Reduction bottleneck. By conditioning prompts solely on the observed modalities, they inadvertently restrict the reasoning scope of MTs to the modality-reduced subspace, cutting off access to the latent information sources of the missing modalities. To overcome this limitation, we propose AOEPT, which pioneers a novel modal-contextualized prompting fashion. Specifically, we introduce lightweight Modal-Contextualized Prompts (MCPs) that distill global modality-wise priors from training data, serving as latent repositories of the information sources for missing modalities. Conditioned on the remaining modalities, these MCPs are instantiated into instance-aware prompts that selectively augment missing-modality information for each sample, thereby restoring the reasoning scope of MTs beyond the observed-modality-only subspace. Experiments across various multimodal benchmarks and backbones confirm the strong performance of AOEPT, with minimal computational overhead.

2605.24813 2026-05-26 cs.RO cs.SY eess.SY

Manifold-Constrained MPPI: Real-Time Sampling-Based Control Under Hard Constraints

流形约束MPPI:硬约束下的实时采样控制

Seulchan Lee, Sanghyun Kim

AI总结 提出流形约束MPPI(MC-MPPI),通过变分自编码器学习约束流形的低维表示,结合二次规划控制器,实现实时硬约束满足。

Comments International Journal of Control, Automation, and Systems

详情
AI中文摘要

基于采样的模型预测控制方法,如模型预测路径积分(MPPI),在复杂机器人系统中提供了无导数优化和鲁棒性。然而,标准MPPI依赖于基于成本的软惩罚,无法保证硬约束满足,严重限制了其在高度约束任务(如闭链操作)中的适用性。为解决这一问题,我们提出了流形约束MPPI(MC-MPPI),一种实时采样控制框架,在保持MPPI计算优势的同时强制执行基于流形的等式约束。关键思想是将约束最优控制问题解耦为潜在空间规划和执行级校正。在规划阶段,变分自编码器(VAE)学习约束流形的低维潜在表示,使MPPI能够高效生成接近可行的候选轨迹,无需逐样本修改。由于该参考能够精确线性化等式约束,执行级二次规划(QP)控制器通过单次求解而非迭代投影来解决残余流形不匹配。在14自由度闭链双臂系统上的仿真和实际实验表明,MC-MPPI以100 Hz稳定运行,可靠地导航动态环境,同时有效维持硬等式约束,并在跟踪精度上显著优于基线方法。补充视频和实现细节见https://rcilab.github.io/mcmppi。

英文摘要

Sampling-based model predictive control methods, such as Model Predictive Path Integral (MPPI), offer derivative-free optimization and robustness in complex robotic systems. However, standard MPPI relies on cost-based soft penalties that cannot guarantee hard-constraint satisfaction, severely limiting its applicability to highly constrained tasks such as closed-chain manipulation. To address this, we propose Manifold-Constrained MPPI (MC-MPPI), a real-time sampling-based control framework that enforces manifold-based equality constraints while preserving the computational advantages of MPPI. The key idea is to decouple the constrained optimal control problem into latent-space planning and execution-level correction. At the planning stage, a Variational Autoencoder (VAE) learns a low-dimensional latent representation of the constraint manifold, enabling MPPI to efficiently generate near-feasible candidate trajectories without per-sample modification. Since this reference enables accurate linearization of the equality constraints, an execution-level Quadratic Programming (QP) controller resolves the residual manifold mismatch in a single solve rather than through iterative projection. Experiments on a 14-DoF closed-chain dual-arm system in both simulation and real-world settings demonstrate that MC-MPPI operates stably at 100 Hz, reliably navigates dynamic environments while effectively maintaining hard equality constraints, and significantly outperforms baseline methods in tracking accuracy. Supplementary videos and implementation details are available at https://rcilab.github.io/mcmppi.

2605.24812 2026-05-26 cs.AI

CoRe-Code: Collaborative Reinforcement Learning for Code Generation

CoRe-Code:面向代码生成的协作式强化学习

Zhihao Dou, Qinjian Zhao, Zhongwei Wan, Xiaoyu Xia, Sumon Biswas

AI总结 提出CoRe-Code框架,通过规划器-编码器范式和基于GRPO的协作感知强化学习,增强多智能体间的协调与专业化,提升代码生成的准确性和效率。

详情
AI中文摘要

大型语言模型(LLM)在代码生成方面取得了强劲性能,但大多数方法依赖自回归解码而缺乏全局规划,常常导致局部连贯但全局次优的解决方案(例如,测试用例失败或复杂度低效)。虽然最近的方法如思维链(CoT)和多智能体系统(MAS)引入了规划,但它们有限的专业角色分工和协调阻碍了在复杂任务上的性能。为了解决多智能体代码生成中的协调与专业化挑战,我们提出了协作式强化代码(CoRe-Code),一个面向角色专业化的LLM智能体框架,通过增强智能体间协调来生成更准确和高效的代码。CoRe-Code采用简单的规划器-编码器范式,其中规划器生成高层计划,编码器执行计划以生成代码。我们进一步引入基于组相对策略优化(GRPO)的协作感知强化学习阶段,以增强角色专业化和对齐。实验表明,CoRe-Code优于现有多种基于强化学习和多智能体的方法。此外,我们证明CoRe-Code可以泛化到其他多智能体框架(例如,检索和调试智能体),凸显其灵活性和可扩展性。我们使用三个基础模型在多个不同难度的基准上评估CoRe-Code。与现有基线相比,结果显示在准确性上持续提升,同时在执行时间和内存使用方面也实现了更高效率,证明了CoRe-Code的有效性和实用性。

英文摘要

Large language models (LLMs) have achieved strong performance in code generation, but most methods rely on autoregressive decoding without global planning, often leading to locally coherent yet globally suboptimal solutions (e.g., failing test cases or inefficient complexity). While recent approaches such as Chain-of-Thought (CoT) and multi-agent systems (MAS) introduce planning, their limited role specialization and coordination hinder performance on complex tasks. To address the challenges of coordination and specialization in multi-agent code generation, we propose Collaborative Reinforcement Code (CoRe-Code), a framework for role specialized LLM agents that enhances inter-agent coordination to generate more accurate and efficient code. CoRe-Code adopts a simple Planner-Coder paradigm, where the Planner produces high-level plans and the Coder executes them to generate code. We further introduce a collaboration-aware reinforcement learning stage based on Group Relative Policy Optimization (GRPO) to enhance role specialization and alignment. Experiments show that CoRe-Code outperforms a wide range of existing RL-based and multi-agent methods. In addition, we demonstrate that CoRe-Code can generalize to other multi-agent frameworks (e.g., Retrieval and Debugging agents), highlighting its flexibility and scalability. We evaluate CoRe-Code on multiple benchmarks of varying difficulty using three base models. Compared to existing baselines, the results show consistent improvements in accuracy, while also achieving higher efficiency in terms of execution time and memory usage, demonstrating the effectiveness and practicality of CoRe-Code.

2605.24810 2026-05-26 cs.LG cs.AI cs.RO stat.AP

Cross-Domain Energy-Guided Diffusion Generation for Off-Dynamics Reinforcement Learning

跨域能量引导扩散生成用于动态偏移强化学习

Yu Yang, Yihong Guo, Anqi Liu, Pan Xu

AI总结 提出CEDGE框架,利用能量引导扩散模型生成目标域轨迹,解决动态偏移下离线强化学习的域适应问题。

Comments 29 pages, 3 figures, and 14 tables

详情
AI中文摘要

离动态离线强化学习旨在从大规模源数据集和有限目标数据集中学习目标域策略,但面临转移动态不匹配的问题。现有方法如奖励增强和数据过滤受限于源数据集,无法合成新的目标行为以改善超出收集源轨迹的覆盖范围。虽然近期基于模型的方法尝试通过学习目标感知动态来解决此问题,但生成的体验仅在转移层面构建,导致长时域上的累积误差。这些限制促使离动态离线RL转向轨迹级生成。我们提出CEDGE,一种跨域能量引导扩散生成框架。CEDGE在源域轨迹上训练轨迹扩散模型,并通过能量引导将生成样本适应到目标域。该引导通过最小化源域与期望目标域轨迹之间的分布不匹配得到,并分解为回报、域和行为能量成分。得到的能量引导轨迹既可用于直接规划,也可作为策略学习的合成数据。由于目标适应通过能量引导而非重新训练扩散模型实现,与先前方法相比,CEDGE能高效适应新的目标动态。在ODRL基准上的实验表明,轨迹级能量引导生成改善了动态偏移下的扩散规划,并产生提升下游目标策略学习的合成数据。

英文摘要

Off-dynamics offline reinforcement learning seeks to learn a target-domain policy from a large source dataset and a limited target dataset under mismatched transition dynamics. Existing approaches such as reward augmentation and data filtering are constrained to the source dataset and cannot synthesize new target behavior to improve coverage beyond the collected source trajectories. While recent model-based methods attempt to address this by learning target-aware dynamics, the generated experience is constructed only at the transition level, which leads to accumulated errors over long horizons. These limitations necessitate a shift toward trajectory-level generation for off-dynamics offline RL. We propose CEDGE, a Cross-domain Energy-guided Diffusion GEneration framework. CEDGE trains a trajectory diffusion model on source-domain trajectories and adapts the generated samples to the target domain through energy guidance. This guidance is derived by minimizing the distribution mismatch between the source and desired target-domain trajectories and is decomposed into return, domain, and behavior energy components. The resulting energy-guided trajectories are useful both for direct planning and as synthetic data for policy learning. Since target adaptation is achieved via energy guidance rather than retraining the diffusion model, CEDGE can be efficiently adapted to new target dynamics compared to previous methods. Experiments on the ODRL benchmark demonstrate that trajectory-level energy-guided generation improves diffusion planning under dynamics shifts and produces synthetic data that improves downstream target policy learning.

2605.24808 2026-05-26 cs.LG cs.AI

Disentangled Double Machine Learning for Accurate Causal Effect Estimation

解缠双机器学习用于精确因果效应估计

Guodu Xiang, Kui Yu, Yujie Wang, Richang Hong, Fuyuan Cao, Jiye Liang

AI总结 提出解缠双机器学习(DDML),通过因果角色解缠和残差依赖正交化策略,解决高维或有限样本下双机器学习中因混淆因子未解缠导致的偏差和不稳定问题,在合成、半合成和真实数据集上优于13种基线方法。

Comments 15 pages, 9 figures

详情
AI中文摘要

混淆偏差是从观测数据中估计因果效应的一个关键挑战。双机器学习(DML)通过估计治疗和结果 nuisance 函数、构建治疗和结果残差,并从残差中估计因果效应来解决这一问题。然而,DML 在高维或有限样本场景中常常产生有偏和不稳定的估计。一个原因是 DML 使用所有协变量估计 nuisance 函数,而没有解缠不同的潜在因子,导致不可靠的 nuisance 函数估计。另一个原因是不精确的 nuisance 估计进一步引入了治疗残差与剩余结果误差之间的残差依赖,破坏了因果效应估计的准确性。为了解决这些问题,本文提出解缠双机器学习(DDML),一种整合两种关键策略的新算法。首先,因果角色解缠策略将协变量分解为混淆因子、治疗特有因子和结果特有因子,以实现可靠的 nuisance 函数估计。其次,残差依赖正交化策略减轻由 nuisance 估计误差引起的残差依赖,以增强因果效应估计的精度。在合成、半合成和真实数据集上的实验结果表明,DDML 在 MAE 和 RMSE 上均显著优于 13 种最先进的基线算法。

英文摘要

Confounding bias is a key challenge in causal effect estimation from observational data. Double Machine Learning (DML) addresses this issue by estimating treatment and outcome nuisance functions, constructing treatment and outcome residuals, and estimating causal effects from the residuals. However, DML often produces biased and unstable estimates in highdimensional or finite-sample scenarios. One reason is that DML estimates nuisance functions using all covariates without disentangling distinct latent factors, resulting in unreliable nuisance function estimation. Another is that imprecise nuisance estimation further introduces residual dependence between the treatment residual and the remaining outcome error, undermining the accuracy of causal effect estimates. To address these issues, in this paper, we propose Disentangled Double Machine Learning (DDML), a novel algorithm that integrates two key strategies. First, a causal role disentanglement strategy decomposes covariates into confounders, treatment-specific factors, and outcomespecific factors for enabling reliable nuisance function estimation. And second, a residual dependence orthogonalization strategy mitigates residual dependence caused by nuisance estimation errors for enhancing the precision of causal effect estimates. Experimental results on synthetic, semi-synthetic, and real-world datasets demonstrate that DDML significantly outperforms 13 state-of-the-art baseline algorithms in both MAE and RMSE.

2605.24807 2026-05-26 cs.CV

CLIP-Guided SAM: Parameter-Efficient Semantic Conditioning for Promptable Segmentation

CLIP引导的SAM:用于可提示分割的参数高效语义条件

Shayan Jalilian, Abdul Bais

AI总结 提出CLIP-Guided SAM框架,通过轻量级多模态语义适配器将CLIP特征注入SAM图像编码器,实现内部语义条件化,在低标注数据下提升分割性能并支持手动和半自动两种模式。

详情
AI中文摘要

可提示基础模型如分割一切模型(SAM)能生成高质量掩码,但语义上仍存在盲区,依赖外部提示来指定类别。现有的视觉-语言方法通过外部提示耦合来解决这一限制,即视觉-语言模型作为独立阶段为SAM生成空间提示。我们提出CLIP引导的SAM,一种基于内部语义条件的参数高效分割框架。我们不是仅使用语义信号来生成提示,而是通过轻量级多模态语义适配器将CLIP派生的文本、视觉和相似性特征直接注入SAM的图像编码器。这些适配器调节SAM的内部特征表示,使得语义信息能够影响掩码预测,同时保留SAM原有的可提示接口。我们的框架专为低标注数据场景设计,适用于通用领域基准和专门的下游任务。它支持两种操作模式:手动模式(用于同时使用文本和空间提示的交互式分割)和半自动纯文本模式(用于仅需文本输入的概念特定分割应用)。我们表明,鲁棒性取决于训练与推理时使用的提示类型是否一致,使得训练-测试提示一致性成为重要的设计原则。通过大量实验和消融研究,我们评估了我们的方法,与无语义条件的SAM+PEFT基线、视觉-语言+SAM流水线、SAM 3以及依赖大量无标注数据的强半监督分割方法进行比较。在这些设置中,CLIP引导的SAM在训练和部署中均保持参数高效的同时,始终取得优越或具有竞争力的性能。

英文摘要

Promptable foundation models such as the Segment Anything Model (SAM) produce high-quality masks but remain semantically blind, relying on external prompts to specify categories. Existing vision-language approaches address this limitation by using external prompt coupling, where a vision-language model generates spatial prompts for SAM as a separate stage. We propose CLIP-Guided SAM, a parameter-efficient segmentation framework built on internal semantic conditioning. Instead of using semantic signals only to generate prompts, we inject CLIP-derived text, vision, and similarity features directly into SAM's image encoder through lightweight multi-modal semantic adapters. These adapters condition SAM's internal feature representations, allowing semantic information to influence mask prediction while preserving SAM's original promptable interface. Our framework is designed for low labeled-data settings and applies to both general-domain benchmarks and specialized downstream tasks. It supports two operating modes: Manual mode, for interactive segmentation with both text and spatial prompts, and Semi-Automatic text-only mode, for applications that require concept-specific segmentation using only textual input. We show that robustness depends on aligning training with the type of prompts used at inference, making train-test prompt consistency an important design principle. Through extensive experiments and ablations, we evaluate our method against SAM+PEFT baselines without semantic conditioning, vision-language + SAM pipelines, SAM 3, and strong semi-supervised segmentation methods that rely on large amounts of unlabeled data. Across these settings, CLIP-Guided SAM consistently achieves superior or competitive performance while remaining parameter-efficient in both training and deployment.

2605.24806 2026-05-26 cs.SD cs.AI eess.AS

Zero-Shot Parkinson's Disease Detection from Speech: Comparing Large Audio and Language Models

零样本帕金森病语音检测:比较大型音频和语言模型

Muhammad Ashad Kabir, Sirajam Munira

AI总结 通过比较手工声学特征和原始音频波形两种输入模态,研究零样本帕金森病检测在不同语言中的性能差异,发现手工特征在低资源语言中更稳定,而音频输入带来数据集依赖的增益。

Comments 6 pages

详情
AI中文摘要

大型音频和语言模型最近在各个领域展示了零样本推理能力。然而,尚不清楚音频输入的形式——无论是从语音中提取的手工声学特征还是原始音频波形——如何影响不同语言中帕金森病(PD)检测的性能。在本研究中,我们系统地比较了两种零样本PD检测的输入模态:(i)由通用LLM分析的从语音记录中提取的手工声学特征,以及(ii)由音频能力模型分析的直接波形输入。在四种语言的PD语音数据集上的实验表明,性能因输入模态、语音任务和语言而异。手工声学特征在低资源语言(例如孟加拉语)中提供更稳定的性能,而音频输入带来数据集依赖的增益。这些发现突显了输入模态对零样本语音PD检测的影响。

英文摘要

Large audio and language models have recently demonstrated zero-shot reasoning capabilities across various domains. However, it remains unclear how the form of audio input, whether handcrafted acoustic features extracted from speech or the raw audio waveform itself, affects performance for Parkinson's disease (PD) detection across different languages. In this study, we systematically compare two input modalities for zero-shot PD detection: (i) handcrafted acoustic features extracted from speech recordings analyzed by a general-purpose LLM, and (ii) direct waveform input analyzed by audio-capable models. Experiments on PD speech datasets in four languages show that performance varies across input modalities, speech tasks, and languages. Handcrafted acoustic features provide more stable performance in a low-resource language (e.g., Bengali), whereas audio input yields dataset-dependent gains. These findings highlight the impact of input modality on zero-shot PD detection from speech.

2605.24805 2026-05-26 cs.CV

Fishbone: From One 3D Asset to a Million Controllable Edits

Fishbone: 从一个3D资产到百万可控编辑

Yumeng He, Xiaoying Wang, Peihao Li, Yanjia Huang, Joe Masterjohn, Jiajun Wu, Leonidas Guibas, Yin Yang, Ying Jiang, Chenfanfu Jiang

AI总结 提出一种统一的脊-肋表示方法Fishbone,支持通用网格的可控参数化变形、降阶动力学和动画,并构建了Fishbone-136K数据集,应用于可控3D生成、机器人学习数据增强等任务。

Comments 20 pages, 19 figures

详情
AI中文摘要

大规模可控3D资产对于计算机图形学、具身AI、机器人和交互式内容创作至关重要,但由于手动建模和绑定的高成本,创建多样化的3D资产仍然具有挑战性。形状变形提供了一种从现有网格生成变体的自然方式,但现有的数据驱动方法通常依赖稀疏的用户输入,而参数化编辑框架需要手动设计的控制结构和特定类别的配置。受自然生物启发,其中中央脊柱控制全局形状,横截面肋骨控制局部变化,我们引入了Fishbone,一种统一的脊-肋表示,适用于通用形状,支持可控参数化网格变形、降阶动力学和动画。给定输入网格,Fishbone使用自适应热方法计算测地标量场,提取等值线作为横截面肋骨,通过肋骨中心构建光滑的几何感知脊柱,并使用高斯加权蒙皮将表面顶点与附近的肋骨和脊柱结构关联。由此产生的表示支持实时和可预测的变形:肋骨控制局部轮廓,如厚度、方向和横截面变化,而脊柱控制全局弯曲、扭转和拉伸。相同的结构还支持降阶模拟和关键帧动画。我们进一步通过用脊-肋结构增强Hunyuan3D构建了Fishbone-136K,并展示了在可控3D生成、基于变形的机器人学习数据增强、交互式网格编辑和智能体生成中的应用。实验证明了所提出框架的有效性、效率和通用性。

英文摘要

Large-scale controllable 3D assets are critical for computer graphics, embodied AI, robotics, and interactive content creation, yet creating diverse 3D assets remains challenging due to the high cost of manual modeling and rigging. Shape deformation offers a natural way to generate variations from existing meshes, but existing data-driven methods often rely on sparse user inputs, while parametric editing frameworks require manually designed control structures and category-specific configurations. Inspired by natural creatures, where a central spine governs global shape and cross-sectional ribs control local variation, we introduce Fishbone, a unified rib-spine representation for general shapes that supports controllable parametric mesh deformation, reduced-space dynamics, and animation. Given an input mesh, Fishbone computes a geodesic scalar field with an adaptive heat method, extracts iso-contours as cross-sectional ribs, constructs a smooth geometry-aware spine through rib centers, and associates surface vertices with nearby rib and spine structures using Gaussian-weighted skinning. The resulting representation enables real-time and predictable deformation: ribs control local profiles such as thickness, orientation, and cross-sectional variation, while the spine controls global bending, twisting, and stretching. The same structure also supports reduced-space simulation and keyframe animation. We further construct Fishbone-136K by augmenting Hunyuan3D with rib-spine structures, and demonstrate applications in controllable 3D generation, deformation-based data augmentation for robot learning, interactive mesh editing, and agentic generation. Experiments demonstrate the effectiveness, efficiency, and versatility of the proposed framework.

2605.24803 2026-05-26 cs.LG

Active Learning for Stochastic Contextual Linear Bandits

随机上下文线性老虎机的主动学习

Emma Brunskill, Ishani Karmarkar, Zhaoqi Li

AI总结 提出一种通过主动采样上下文-动作对奖励来学习近最优策略的算法,理论上证明主动上下文采样可将最小最大率改进最多√d倍,并在华法林剂量预测和笑话推荐任务中验证了样本效率提升。

详情
AI中文摘要

随机上下文线性老虎机的一个关键目标是高效学习近最优策略。现有算法通过策略性地采样动作来学习策略,但被动地从底层上下文分布中采样上下文。然而,在许多实际场景中——包括在线内容推荐、调查研究、临床试验——从业者可以根据上下文分布的先前知识主动采样或招募上下文。尽管有这种主动学习的潜力,但策略性上下文采样在随机上下文线性老虎机中的作用尚未被充分探索。我们提出一种算法,通过策略性地采样上下文-动作对的奖励来学习近最优策略。我们证明了实例相关的理论保证,表明我们的主动上下文采样策略可以将最小最大率改进最多√d倍,其中d是线性维度。我们通过实验证明,我们的算法在学习近最优策略所需的样本数量上有所减少,例如在华法林剂量预测和笑话推荐任务中。

英文摘要

A key goal in stochastic contextual linear bandits is to efficiently learn a near-optimal policy. Prior algorithms for this problem learn a policy by strategically sampling actions but naively (passively) sampling contexts from the underlying context distribution. However, in many practical scenarios -- including online content recommendation, survey research, and clinical trials -- practitioners can actively sample or recruit contexts based on prior knowledge of the context distribution. Despite this potential for active learning, the role of strategic context sampling in stochastic contextual linear bandits is underexplored. We propose an algorithm that learns a near-optimal policy by strategically sampling rewards of context-action pairs. We prove instance-dependent theoretical guarantees demonstrating that our active context sampling strategy can improve over the minimax rate by up to a factor of $\sqrt{d}$, where $d$ is the linear dimension. We show empirically that our algorithm reduces the number of samples needed to learn a near-optimal policy, in tasks such as warfarin dose prediction and joke recommendation.

2605.24799 2026-05-26 cs.CV cs.AI

Divide-and-Conquer Inference for Large-Scale Visual Recognition with Multimodal Large Language Models

面向大规模视觉识别的多模态大语言模型分治推理

Zhipeng Ye, Jiaqi Huang, Feng Jiang, Qiufeng Wang, Yikang Duan, Dawei Wang, Xihang Zhou, Qian Qiao

AI总结 针对多模态大语言模型在长序列识别中性能崩溃的问题,提出分治推理(DCI)策略,通过递归分解任务和动态剪枝提升信噪比与分类精度。

详情
AI中文摘要

多模态大语言模型(MLLMs)在广泛的视觉语言任务中展现了强大的能力。然而,当应用于大规模图像分类时,随着标签空间的扩大,其性能显著下降——我们将这一现象定义为长序列识别中的性能崩溃。通过信息论分析,我们揭示了这种崩溃源于不断增长的信息熵与注意力机制中显著的注意力稀释和衰减之间的根本冲突,这损害了模型在处理极长提示时维持足够信噪比的能力。为缓解这一问题,我们提出了分治推理(DCI),一种用于MLLMs视觉识别的新型测试时扩展策略。DCI递归地将复杂的全局分类任务分解为多个更简单的局部子问题,并采用动态剪枝机制压缩搜索空间。该方法通过缓解长序列推理中固有的权重稀释问题,有效提高了局部信噪比和模型精度。此外,传统自注意力具有难以承受的二次计算复杂度,而DCI在大规模分类场景中实现了更有利的扩展行为并显著加速推理。在ImageNet-1K和ImageNet-21K等基准上的大量实验表明,DCI持续提高了分类精度。这使得轻量级开源模型无需任何额外训练或微调即可与甚至超越前沿闭源巨头。作为一种模型无关、即插即用的范式,DCI为在大规模场景中扩展MLLMs的推理精度提供了一种高效方法。

英文摘要

Multimodal Large Language Models (MLLMs) have demonstrated strong capabilities across a wide range of vision language tasks. However, when applied to large scale image classification, their performance degrades significantly as the label space expands a phenomenon we define as Performance Collapse in Long Sequence Recognition. Through an information theoretic analysis, we reveal that this collapse stems from a fundamental conflict between the escalating information entropy and the prominent attention dilution and decay within attention mechanisms, which impairs the model's ability to maintain a sufficient signal-to-noise ratio when processing extremely long prompts. To mitigate this, we propose Divide-and-Conquer Inference (DCI), a novel test-time scaling strategy for visual recognition with MLLMs. DCI recursively decomposes complex global classification tasks into multiple simpler, localized subproblems and employs a dynamic pruning mechanism to compress the search space. This method effectively improves the local signal to noise ratio and model accuracy by mitigating the inherent weight dilution issues in long-sequence inference. Moreover, while traditional self-attention incurs a prohibitive quadratic computational complexity, DCI achieves more favorable scaling behavior and substantially accelerates inference in large scale classification scenarios. Extensive experiments on benchmarks such as ImageNet-1K and ImageNet-21K demonstrate that DCI consistently improves classification accuracy. This enables lightweight open-source models to rival or even surpass frontier closed-source giants without any additional training or fine-tuning. As a model-agnostic, plug-and-play paradigm, DCI offers an efficient approach for scaling the inferential precision of MLLMs in large-scale scenarios.

2605.24797 2026-05-26 cs.CV

HCL-FF: Hierarchical and Contrastive Learning for Forward-Forward Algorithm

HCL-FF:用于前向-前向算法的分层对比学习

Jie-En Yao, Hong-En Chen, C. -C. Jay Kuo

AI总结 针对前向-前向算法缺乏分层协调和特征语义模糊的问题,提出HCL-FF框架,通过粗到细的分层学习策略和监督对比学习目标,在CIFAR-10等数据集上取得FF方法最佳性能。

Comments Accepted by CVPR 2026. Code: https://github.com/JNNNNYao/HCL-FF

详情
AI中文摘要

使用反向传播训练的深度神经网络在视觉任务中取得了显著性能,但仍存在生物不可解释、计算要求高和难以解释的问题。前向-前向(FF)算法通过局部目标函数独立训练每一层,提供了一种有前景的替代方案。然而,其纯局部优化缺乏跨层的分层协调,且将 goodness 与特征解耦导致表示无约束且语义模糊。我们提出分层对比学习FF框架(HCL-FF)来解决这些限制。HCL-FF引入了(1)一种从粗到细的分层学习策略,引导表示从低级线索到高级语义,以及(2)一种监督对比目标,在 goodness 解耦后强制类别判别性对齐。在CIFAR-10、CIFAR-100和Tiny-ImageNet上的实验表明,HCL-FF在基于FF的方法中取得了新的最佳性能,准确率分别提升了+5.46%、+17.00%和+12.51%。

英文摘要

Deep neural networks trained with backpropagation have achieved outstanding performance in vision tasks but remain biologically implausible, computationally demanding, and difficult to interpret. The Forward-Forward (FF) algorithm offers a promising alternative by training each layer independently through local goodness objectives. However, its purely local optimization lacks hierarchical coordination across layers, and the decoupling of goodness from features leaves the representations unconstrained and semantically ambiguous. We propose a Hierarchical and Contrastive Learning FF framework (HCL-FF) to address these limitations. HCL-FF introduces (1) a coarse-to-fine hierarchical learning strategy that guides representations from low-level cues to high-level semantics, and (2) a supervised contrastive objective that enforces class-discriminative alignment after goodness decoupling. Experiments on CIFAR-10, CIFAR-100, and Tiny-ImageNet demonstrate that HCL-FF achieves new state-of-the-art performance among FF-based methods, with notable accuracy gains of +5.46%, +17.00%, and +12.51%, respectively.

2605.24794 2026-05-26 cs.CV cs.CL

DUEL: Adversarial Self-Play for Multimodal Reasoning

DUEL: 用于多模态推理的对抗性自我对弈

Lin Qiu, Hanqing Zeng, Yao Liu, Bingjun Sun, Guangdeng Liao, Ji Liu

AI总结 提出DUEL框架,通过对抗性自我对弈从预训练VLM生成监督信号,结合长度归一化对数似然奖励,无需人工标注即可提升视觉推理与判别能力。

详情
AI中文摘要

强化学习已成为提升视觉语言模型推理能力的有效范式。然而,基于RL的优化通常依赖于昂贵且难以扩展的高质量标注。现有的无监督替代方案可能因弱视觉基础和缺乏可靠验证信号而偏向有偏解。我们提出一个自我进化的训练后框架DUEL,其中监督信号源于从同一预训练VLM初始化的两个策略之间的对抗性交互。挑战者生成一个基于图像的真实声明及其最小扰动的难负样本,而求解者验证两个声明与图像的一致性,从而在近邻语义下鼓励细粒度视觉判别。为了稳定优化,我们引入长度归一化的对数似然奖励,在二元结果监督之外保留信息性优化信号,并在稀疏反馈下提高学习稳定性。实验表明,DUEL在无需额外人工标注、外部奖励模型或图像编辑工具的情况下,持续提升视觉推理和鲁棒判别能力。

英文摘要

Reinforcement learning (RL) has emerged as an effective paradigm for improving the reasoning capability of vision-language models (VLMs). However, RL-based optimization typically depends on costly high-quality annotations that are difficult to scale. Existing unsupervised alternatives may drift toward biased solutions due to weak visual grounding and the lack of reliable verification signals. We propose a self-evolving post-training framework, DUEL, where supervision emerges from adversarial interactions between two policies initialized from the same pretrained VLM. A Challenger generates an image-grounded true claim together with a minimally perturbed hard-negative counterpart, while a Solver verifies both claims against the image, encouraging fine-grained visual discrimination under near-neighbor semantics. To stabilize optimization, we introduce a length-normalized log-likelihood reward that preserves informative optimization signals beyond binary outcome supervision and improves learning stability under sparse feedback. Experiments show that DUEL consistently improves visual reasoning and robust discrimination without additional human annotations, external reward models, or image editing tools.

2605.24793 2026-05-26 cs.CL

Beyond the Target: From Imitation to Collaboration in Speculative Decoding

超越目标:从模仿到协作的推测解码

Jinze Li, Yixing Xu, Guanchen Li, Jinfeng Xu, Shuo Yang, Yang Zhang, Xuanwu Yin, Dong Li, Edith C. H. Ngai, Emad Barsoum

AI总结 提出协作推测解码(CoSpec),通过强化学习训练仲裁策略,在推测解码中灵活选择接受草稿或目标模型的令牌,在保持加速的同时超越仅使用目标模型的性能。

Comments under review

详情
AI中文摘要

推测解码(SPD)通过让较小的草稿模型提出多个未来令牌,并由较大的目标模型并行验证,从而加速大型语言模型(LLM)推理。主流的SPD范式将目标模型视为唯一可靠的教师,仅当草稿令牌与目标预测完全匹配时才接受它。这种设计隐含地假设目标在每个位置都是更好的选择。在实践中,这一假设并不成立。尽管草稿模型整体上较弱,但在令牌级别上并非均匀地劣于目标。在草稿与目标不一致的有意义的情况下,草稿的选择往往能导致正确的最终答案。受此启发,我们引入了 extbf{协作推测解码(CoSpec)},这是SPD的一种泛化,不再将目标模型视为唯一的令牌级权威。CoSpec通过强化学习训练一个仲裁策略,以决定是接受来自草稿还是目标模型的令牌,在不匹配时选择性地接受草稿令牌,如果这样做可能产生正确的最终答案。实验结果表明,CoSpec在保持显著加速的同时,超越了仅使用目标模型的性能。通过将重点从模仿转向协作,CoSpec为推测解码提供了新的视角。

英文摘要

Speculative decoding (SPD) accelerates large language model (LLM) inference by letting a smaller draft model propose multiple future tokens that are verified in parallel by a larger target model. The dominant SPD paradigm treats the target model as the sole reliable teacher, accepting a draft token only when it exactly matches the target prediction. This design implicitly assumes that the target is always the better choice at every position. In practice, this assumption does not hold. Although the draft is the weaker model overall, it is not uniformly inferior at the token level. In a meaningful fraction of cases where draft and target disagree, the draft's choice is the one that leads to the correct final answer. Inspired by this, we introduce \textbf{Collaborative Speculative Decoding (CoSpec)}, a generalization of SPD that no longer treats the target model as the sole token-level authority. CoSpec trains an arbitration policy via reinforcement learning to decide whether to accept tokens from the draft or target model, selectively accepting draft tokens at mismatches when doing so is likely to yield a correct final answer. Experimental results show that CoSpec maintains substantial speedups while surpassing target-only performance. By shifting the emphasis from imitation to collaboration, CoSpec suggests a new perspective on speculative decoding.

2605.24792 2026-05-26 cs.CV cs.AI

Parameter-Efficient VLMs for Gastrointestinal Endoscopy: Medical Image Generation and Clinical Visual Question Answering

用于胃肠内窥镜的参数高效视觉语言模型:医学图像生成与临床视觉问答

Ojonugwa Oluwafemi Ejiga Peter, Frederick Akor Ejiga, Fahmi Khalifa, Md Mahmudur Rahman

AI总结 提出双流水线参数高效微调模型,结合Florence-2和LoRA Stable Diffusion,分别解决临床视觉问答和隐私保护合成数据生成问题,在Kvasir-VQA数据集上取得高ROUGE和BLEU分数,并显著降低计算成本。

详情
AI中文摘要

胃肠内窥镜AI系统的主要局限性源于标注数据短缺、严格的隐私政策以及传统模型微调中的显著瓶颈。这些限制阻碍了复杂AI模型在临床实践中的成功应用,尤其影响了诊断的可靠性和可扩展性。在本文中,我们提出了一种双流水线PEFT模型,解决了两个基本问题:医学视觉问答(VQA)和隐私保护合成数据的生成。对于临床VQA,我们采用Florence-2视觉语言模型。利用PEFT增强了模型的可解释性,同时大幅降低了训练的计算成本。同时,我们使用低秩适应(LoRA)与Stable Diffusion 2.1生成高质量的胃肠图像,在不违反患者隐私的情况下增强训练数据库。本研究使用了Kvasir-VQA数据集。我们的Florence-2 VQA模型实现了ROUGE-1为0.92,ROUGE-L为0.91,BLEU分数从0.08提升到0.24。在私有数据集上的微调始终优于在公共数据集上的微调。秩为4的LoRA合成达到了最优性能,保真度得分为0.290,一致性得分为0.730,Frechet BiomedCLIP距离(FBD)为1450,计算成本降低了近90%。该框架提高了AI在胃肠内窥镜中的临床潜力。与FLUX、MSDM和Kandinsky 2.2相比,我们的模型表现出更优的FBD和强语义对齐。虽然其他模型在保真度或一致性上领先,但我们更低的FBD表明更好的图像-文本一致性。这些结果确立了我们的方法作为增强临床AI中VQA和合成数据生成的稳健解决方案。

英文摘要

The major limitations of gastrointestinal (GI) endoscopy AI systems arise from a shortage of annotated data, strict privacy policies, and significant bottlenecks in conventional model fine-tuning. Such limitations impede the successful application of sophisticated AI models in clinical practice, particularly affecting the reliability and scalability of diagnosis. In this paper, we present a dual-pipeline PEFT model that addresses two fundamental problems: medical Visual Question Answering (VQA) and the generation of privacy-preserving synthetic data. For clinical VQA, we adopt the Florence-2 vision-language model. Leveraging PEFT enhances model interpretability while substantially reducing the computational cost of training. Simultaneously, we employ Low-Rank Adaptation (LoRA) with Stable Diffusion 2.1 to generate high-quality GI images that enhance training databases without violating patient privacy. This research utilized the Kvasir-VQA dataset. Our Florence-2 VQA model achieved ROUGE-1 of 0.92, ROUGE-L of 0.91, and BLEU score improvements from 0.08 to 0.24. Fine-tuning on private datasets consistently showed better results than fine-tuning on public datasets. The rank-4 LoRA synthesis achieved optimal performance with a fidelity score of 0.290, an agreement score of 0.730, and a Frechet BiomedCLIP Distance (FBD) of 1450, reducing computational costs by almost 90 percent. This framework improves the clinical potential of AI in GI endoscopy. Compared to FLUX, MSDM, and Kandinsky 2.2, our model demonstrates superior FBD and strong semantic alignment. While other models lead in Fidelity or Agreement, our lower FBD indicates better image-text coherence. These results establish our approach as a robust solution for enhancing VQA and synthetic data generation in clinical AI.

2605.24789 2026-05-26 cs.CV eess.IV

Self-Supervised Contrastive Learning for Cardiac MR Sequence Classification

自监督对比学习用于心脏磁共振序列分类

Yuli Wang, Hyewon Jung, Dongshen Peng, Yuwei Dai, Jing Wu, Haoyue Guan, Yoko Kato, Zhicheng Jiao, Yu Sun, Ihab Kamel, Joao Lima, Cheng Ting Lin, Harrison Bai

AI总结 针对预训练ViT在心脏MR领域迁移效果差的问题,提出基于图像的自监督对比学习适应策略,在内部数据集上优于监督训练,并泛化到外部MR数据集,四个常见序列分类AUC超过0.75。

详情
AI中文摘要

利用自注意力机制的视觉Transformer(ViT)模型在各种视觉任务(包括图像分类)中展现出强大的泛化能力。然而,这些通常在通用公共数据集上预训练的模型往往缺乏医学成像应用所需的专门领域知识。在本研究中,我们使用内部数据集调查了ViT模型对心脏磁共振(MR)图像的适应情况。我们发现预训练的ViT特征不能有效地迁移到心脏MR领域。为了克服这一限制,我们引入了一种利用基于图像的自监督对比学习的适应策略,与传统的监督训练方法相比,表现出优越的性能。此外,我们适应的ViT模型对外部MR数据集(如BraTS和ADNI)表现出强大的泛化能力。通过消融研究,我们进一步研究了批次大小和数据集规模对性能的影响。最终,我们的适应模型在四种最常见的心脏MR序列上实现了超过0.75的分类AUC。

英文摘要

Vision Transformer (ViT) models, utilizing self-attention mechanisms, have demonstrated robust generalization capabilities across various vision tasks, including image classification. However, these models, typically pretrained on general public datasets, often lack the specialized domain knowledge necessary for medical imaging applications. In this study, we investigate the adaptation of ViT models, specifically for cardiac magnetic resonance (MR) images, using an in-house dataset. We found that pretrained ViT features do not effectively transfer to the cardiac MR domain. To overcome this limitation, we introduce an adaptation strategy that utilizes image-based self-supervised contrastive learning, demonstrating superior performance compared to traditional supervised training approaches. Moreover, our adapted ViT model exhibits strong generalization to external MR datasets such as BraTS and ADNI. Through ablation studies, we further investigate the impact of batch size and dataset scale on performance. Ultimately, our adapted model achieves classification AUC exceeding 0.75 across the four most common cardiac MR sequences.

2605.24786 2026-05-26 cs.LG cs.AI

CONF-KV: Confidence-Aware KV Cache Eviction with Mixed-Precision Storage for Long-Horizon LLM

CONF-KV:面向长序列LLM的置信度感知KV缓存淘汰与混合精度存储

Yubo Li, Yidi Miao

AI总结 提出CONF-KV方法,利用模型当前不确定性(置信度)动态调整KV缓存预算,结合混合精度存储和分块在线softmax注意力,在长序列推理中显著降低显存占用并保持高精度。

详情
AI中文摘要

长序列LLM推理使键值(KV)缓存成为GPU内存的主要消耗者,并使每个token的注意力计算越来越昂贵。许多常见的淘汰策略使用静态的最近窗口或历史注意力,忽略了每个解码步骤中计算出的一个信号:模型当前的不确定性。我们引入CONF-KV,一个KV缓存管理器,它将下一个token分布转换为标量置信度分数,并用它来选择每步缓存预算,在模型不确定时保留更多上下文,在模型确定时积极剪枝。在每个预算内,token根据累积注意力质量和最近性的组合进行排序,同时一个受保护的最近窗口保持局部连贯性。我们将该策略与分块在线softmax注意力、混合FP16/INT8存储以及金字塔式逐层预算变体相结合。在四个模型家族和生成长度高达4K的情况下,CONF-KV的显存占用接近固定的512 token滑动窗口,同时与完整KV相比,困惑度差异保持在1.5-2.1点以内。在长达32K token的“大海捞针”测试中,CONF-KV的检索准确率达到91.4%,而滑动窗口为53.8%,H2O为80.6%;在75个VisualWebArena任务中,它以2.8倍的峰值内存降低保留了完整KV成功率的95.3%。

英文摘要

Long-horizon LLM inference turns the key--value (KV) cache into the dominant GPU memory consumer and makes per-token attention increasingly expensive. Many common eviction policies use static recency windows or historical attention, leaving unused a signal computed on every decoding step: the model's current uncertainty. We introduce CONF-KV, a KV-cache manager that converts the next-token distribution into a scalar confidence score and uses it to choose the per-step cache budget, retaining more context when the model is uncertain and pruning aggressively when it is confident. Within each budget, tokens are ranked by a composite of accumulated attention mass and recency, while a protected recent window preserves local coherence. We combine the policy with blockwise online-softmax attention, mixed FP16/INT8 storage, and a pyramidal per-layer budget variant. Across four model families and generated lengths up to 4K, CONF-KV stays near the footprint of a fixed 512-token sliding window while remaining within 1.5--2.1 perplexity points of full KV. On Needle-in-a-Haystack up to 32K tokens, CONF-KV reaches 91.4% retrieval accuracy versus 53.8% for sliding windows and 80.6% for H2O; on 75 VisualWebArena tasks it retains 95.3% of full-KV success at 2.8 times lower peak memory.

2605.24784 2026-05-26 cs.AI

GRAIL: AI translation for scientists application workflow on satellite data

GRAIL:面向卫星数据科学家应用工作流的AI翻译

Zhuocheng Shang, Ahmed Eldawy

AI总结 提出GRAIL系统,通过LangGraph管道将Python地理空间工作流翻译为可扩展的Spark程序,无需科学家学习新框架。

详情
AI中文摘要

领域科学家越来越多地开发Python脚本来分析卫星图像,但这些脚本缺乏大规模数据的可扩展性。本文演示了GRAIL,一个代理翻译系统,它将Python地理空间工作流转换为可执行的基于Spark的程序,而无需科学家学习新框架。GRAIL不是微调专门的LLM模型,而是调整RDPro(一个用于卫星数据分析的Scala库),通过结构化文档、API别名函数和面向修复的错误日志使其为LLM就绪。翻译被构建为一个LangGraph管道,将代码生成分解为具有引导输入和输出的显式部分,从而无需重新生成整个程序即可进行有针对性的修复。我们在真实的地理空间工作流上演示了GRAIL,并展示了翻译代码的正确性和可扩展性。

英文摘要

Domain scientists increasingly develop Python scripts to analyze satellite imagery but they lack scalability to large-scale data. This paper demonstrates GRAIL, an agentic translation system that converts Python geospatial workflows into executable Spark-based programs without requiring scientists to learn a new framework. Rather than fine-tuning a specialized LLM model, GRAIL adapts RDPro, a Scala library for satellite data analysis, to make it LLM-ready using structured documentation, API alias functions, and repair-oriented error logs. Translation is structured as a LangGraph pipeline that decomposes code generation into explicit sections with guided inputs and outputs, enabling targeted repair without regenerating the full program. We demonstrate GRAIL on real-world geospatial workflows and showcase the correctness and scalability of the translated code.

2605.24779 2026-05-26 cs.LG cs.AI math.CO

Complement Submodular Information Measures for Balanced and Robust Data Selection

互补子模信息度量用于平衡和鲁棒的数据选择

Rishabh Iyer

AI总结 提出互补子模信息(CSI)目标函数,通过建模子集与其补集之间的共享结构信息,实现平衡且鲁棒的数据选择,并在理论上证明其近似单调性和贪心近似保证,实验表明在鲁棒隐藏切片感知子集选择中优于经典子模目标。

详情
AI中文摘要

子模优化已成为数据选择、检索、摘要和表示学习的基本范式,因为它能够建模覆盖度、多样性和代表性。然而,经典子模目标仅优化所选子集,并未明确保留所选子集与剩余数据之间的结构信息。在许多现代机器学习应用中,包括训练/验证/测试分割、基准构建和鲁棒子集选择,选择的质量关键取决于在所选子集及其补集之间保持平衡结构。在这项工作中,我们引入了互补子模信息(CSI),这是一类新的互补感知子模目标,用于量化子集与其补集之间的共享结构信息。我们的框架产生了几个经典子模函数的互补感知变体,包括设施选址、图割、LogDet、饱和覆盖、集合覆盖、概率集合覆盖和基于特征函数。我们分析了CSI目标的理论性质,并表明它们在有限曲率条件下表现出近似单调性,从而得到接近$(1-1/e)$的贪心近似保证。实验上,CSI目标在鲁棒隐藏切片感知子集选择中始终优于标准子模目标。特别是,CSI目标显著改善了相干稀有/尾部语义结构的保留,同时抑制了噪声和孤立异常值,从而显著提高了下游预测性能。合成实验进一步说明了不同的CSI实例如何捕获代表性、多样性、连通性和平衡邻域保留的互补概念。

英文摘要

Submodular optimization has become a fundamental paradigm for data selection, retrieval, summarization, and representation learning due to its ability to model coverage, diversity, and representativeness. However, classical submodular objectives optimize only the selected subset and do not explicitly preserve structural information between the selected subset and the remaining data. In many modern machine learning applications, including train/validation/test splitting, benchmark construction, and robust subset selection, the quality of a selection depends critically on preserving balanced structure across both the selected subset and its complement. In this work, we introduce Complement Submodular Information (CSI), a new class of complement-aware submodular objectives that quantify shared structural information between a subset and its complement. Our framework induces complement-aware variants of several classical submodular functions including Facility Location, Graph Cut, LogDet, Saturated Coverage, Set Cover, Probabilistic Set Cover, and Feature Based Functions. We analyze the theoretical properties of CSI objectives and show that they exhibit approximate monotonicity under bounded curvature conditions, leading to near-$(1-1/e)$ greedy approximation guarantees. Empirically, CSI objectives consistently outperform standard submodular objectives on robust hidden-slice-aware subset selection. In particular, CSI objectives significantly improve preservation of coherent rare/tail semantic structure while simultaneously suppressing noisy and isolated outliers, leading to substantially improved downstream predictive performance. Synthetic experiments further illustrate how different CSI instantiations capture complementary notions of representativeness, diversity, connectivity, and balanced neighborhood preservation.

2605.24777 2026-05-26 cs.RO

MR-LiDAR: A Multi-Resolution Roadside LiDAR Benchmark for Perception Diagnostics and Deployment Guidance

MR-LiDAR:用于感知诊断和部署指导的多分辨率路边激光雷达基准

Shunlai Cui, Peng Cao, Yuan Zhu, Yongjiang He, Jiacheng Yin, Xiao Huo, Gang Cao, Xiaobo Liu

AI总结 针对激光雷达选型缺乏实证基准的问题,提出MR-LiDAR多分辨率基准,通过控制光束数和分布等变量,系统分析其对感知性能的影响,并给出选型指导。

Comments 9 pages, 6 figures

详情
AI中文摘要

激光雷达选型是路边感知系统中的关键问题,因为它直接决定了感知能力和部署成本。然而,缺乏用于比较不同激光雷达配置下感知性能的经验基准,极大地限制了科学的传感器选择和部署规划。为填补这一空白,我们提出了MR-LiDAR,一个用于路边感知诊断的受控多分辨率激光雷达基准。在相同的路边场景中,使用16、32、80和128线激光雷达,我们收集了不同距离下各类交通参与者(包括车辆和弱势道路使用者(VRU))的点云和真实标注。这种受控设计将激光雷达的内在规格(特别是线束数和线束分布)隔离为精确性能诊断的关键变量。基于MR-LiDAR,我们进行了系统的实证分析,以考察线束数、线束分布、目标距离、目标类别和车辆遮挡如何影响激光雷达感知性能。结果表明,所有这些因素都有显著影响。特别是,与“更高线束数总是带来更好感知”的常见假设相反,我们发现,具有优化线束分布的80线激光雷达可以匹配甚至超越具有均匀线束分布的128线激光雷达。此外,我们提供了实用的激光雷达选型参考指南,包括目标点计数统计和基于两种广泛使用的检测算法的检测性能比较。这项工作为确定路边感知应用中经济高效的激光雷达配置提供了诊断基准和实用指导。

英文摘要

LiDAR model selection is a critical issue in roadside sensing systems, as it directly determines both perception capability and deployment cost. However, the lack of empirical benchmarks for comparing perception performance across different LiDAR configurations has greatly constrained scientific sensor selection and deployment planning. To address this gap, we present MR-LiDAR, a controlled multi-resolution LiDAR benchmark for roadside perception diagnostics. Using 16-, 32-, 80-, and 128-beam LiDARs in identical roadside scenarios, we collect point clouds and ground-truth annotations for diverse traffic participants, including vehicles and vulnerable road users (VRUs), across varying distances. This controlled design isolates intrinsic LiDAR specifications, particularly beam count and beam distribution, as the key variables for precise performance diagnostics. Based on MR-LiDAR, we conduct systematic empirical analyses to examine how beam count, beam distribution, target distance, object category, and vehicle occlusion affect LiDAR perception performance. The results reveal that all of these factors have substantial impacts. In particular, contrary to the common assumption that higher beam counts always yield better perception, we show that an 80-beam LiDAR with optimized beam distribution can match or even outperform a 128-beam LiDAR with uniform beam distribution. In addition, we provide a practical reference guide for LiDAR selection, including target point-count statistics and detection performance comparisons based on two widely used detection algorithms. This work offers a diagnostic benchmark and practical guidance for determining cost-effective LiDAR configurations in roadside perception applications.

2605.24776 2026-05-26 cs.CV

How Noisy Poses Break Inverse Dynamics: Analysis and Mitigation for Video-Based Joint Torque Estimation

噪声姿态如何破坏逆动力学:基于视频的关节力矩估计的分析与缓解

Donghyun Kim, Chanyoung Kim, Eunseo Jeong, Youngjoong Kwon, Seong Jae Hwang

AI总结 本文系统分析了3D人体姿态估计噪声通过逆动力学放大关节力矩误差的问题,提出SMPL-Dynamics模块并通过可微姿态优化将力矩误差降低93%。

详情
AI中文摘要

单目3D人体姿态估计的最新进展使得从视频中实现精确的身体跟踪成为可能。然而,由于逆动力学中的噪声放大,将这些运动学估计转化为物理量(如关节力矩)仍然具有挑战性。在这项工作中,我们系统分析了姿态估计噪声如何通过逆动力学管道传播。我们提出了三个关键发现:(1)通过数值微分计算关节力矩时,姿态噪声被放大约1000倍;(2)近端关节(脊柱、髋部)对噪声的敏感度比远端关节(手腕、手)高10倍;(3)在微分之前进行低通滤波可显著减少这种放大。为了支持这一分析,我们开发了SMPL-Dynamics,这是一个用于SMPL人体模型的完全可微逆动力学模块,无需外部物理模拟器。我们的模块支持端到端梯度计算,并通过可微姿态优化证明了这一点,该优化将力矩误差降低了93%,而姿态变化可忽略不计。

英文摘要

Recent advances in monocular 3D human pose estimation enable accurate body tracking from video. However, translating these kinematic estimates into physical quantities, such as joint torques, remains challenging due to noise amplification through inverse dynamics. In this work, we provide a systematic analysis of how pose estimation noise propagates through the inverse dynamics pipeline. We present three key findings: (1) pose noise is amplified by approximately 1,000x when computing joint torques via numerical differentiation, (2) proximal joints (spine, hips) are up to 10x more sensitive to noise than distal joints (wrists, hands), and (3) low-pass filtering before differentiation substantially reduces this amplification. To enable this analysis, we develop SMPL-Dynamics, a fully differentiable inverse dynamics module for the SMPL body model that requires no external physics simulators. Our module supports end-to-end gradient computation, and we demonstrate this through differentiable pose refinement, which reduces torque error by 93% with negligible change in pose.

2605.24775 2026-05-26 cs.AI cs.MA

PRIMA: Operational Patterns for Resilient Multi-Agent Research with Verifiable Identity and Convergent Feedback

PRIMA: 具有可验证身份和收敛反馈的弹性多智能体研究的操作模式

Sasank Annapureddy

AI总结 针对长时间运行的多智能体LLM系统面临的故障模式,提出PRIMA框架,包含弹性恢复、子智能体操作规范和结构化工程交付的多阶段应用模式,并通过图同构案例验证其有效性。

Comments 11 pages. Single-author preprint. Supplementary case-study report (Graph Isomorphism algorithm proposal with three theorems, five conjectures, complete complexity analysis, and hard-instance evaluation) available at https://spockstein.github.io/prima/case-study-graph-isomorphism.html

详情
AI中文摘要

将LLM作为协调的多智能体研究系统运行数小时,会暴露出单次评估无法发现的故障模式:上游提供商无预警地限制服务,子智能体使任务偏离以适应可用工具,叙述机制而非使用它,以自我道歉开始修订迭代,或将上游上下文视为可执行指令。我们提出PRIMA,其主要贡献是三种应对这些故障模式的操作模式:(1) 弹性与恢复层,检测上游速率限制信号,将类型化的暂停记录持久化到磁盘,并在进程重启后恢复长时间运行的任务而不重新执行已收敛的工作;(2) 子智能体操作规范,将任务保真度、工具使用、修订和步骤间上下文边界规范编码为结构化的提示层;(3) 用于结构化工程交付的多阶段应用模式,将正交的草稿步骤与最终综合前的显式跨文档协调过程配对。这些模式基于一个基础协议:具有显式收敛标准的研究程序规范语言、双指标评分引擎(LLM评判的评分标准加沙盒代码)、外部元优化循环、事件驱动持久化、基于钩子的中间件、上下文压缩和多提供商LLM抽象。智能体身份来源于素数幂,提供无冲突标识符和无需中央注册表的可轻松验证的集群成员资格。理论保证包括$O(k)$验证、$O(V+E)$ DAG验证以及由算术基本定理保证的身份无冲突。一个图同构案例研究将架构主张落实到生成的产物中:一个六步协议,产生了一篇研究论文,提出了一种新的规范形式算法,包含三个定理和五个猜想。

英文摘要

Operating LLMs as coordinated multi-agent research systems over multi-hour runs surfaces failure modes that single-shot evaluation cannot: upstream providers throttle without warning, sub-agents drift the task to fit accessible tools, narrate machinery instead of using it, open revision iterations with self-apology, or treat upstream context as executable directives. We present PRIMA, whose primary contributions are three operational patterns for surviving these failure modes: (1) a resilience-and-recovery layer that detects upstream rate-limit signals, persists a typed pause record to disk, and resumes long-running runs without re-executing converged work even across process restarts; (2) a sub-agent operating discipline encoding task-fidelity, tool-use, revision, and inter-step context-boundary norms as a structural prompt layer; (3) a multi-phase application pattern for structured engineering deliverables pairing orthogonal draft steps with an explicit cross-document harmonization pass before final synthesis. These sit atop a foundational protocol: a research-program specification language with explicit convergence criteria, a dual-metric scoring engine (LLM-judged rubric plus sandboxed code), an outer meta-optimization loop, event-driven persistence, hook-based middleware, context compaction, and a multi-provider LLM abstraction. Agent identities derive from prime powers, giving collision-free identifiers and trivially-verifiable cluster membership without a central registry. Theoretical guarantees include $O(k)$ verification, $O(V+E)$ DAG validation, and identity collision freedom by the Fundamental Theorem of Arithmetic. A Graph Isomorphism case study grounds the architectural claims in a generated artifact: a six-step protocol that produced a research paper proposing a new canonical-form algorithm with three theorems and five conjectures.

2605.24774 2026-05-26 cs.LG physics.comp-ph

Hermite-NGP: Gradient-Augmented Hash Encoding for Learning PDEs

Hermite-NGP:用于学习PDE的梯度增强哈希编码

Jinjin He, Zhiqi Li, Sinan Wang, Bo Zhu

AI总结 提出Hermite-NGP,一种梯度增强的多分辨率哈希编码,通过显式存储哈希网格顶点处的函数值和混合偏导数并利用Hermite插值实现解析梯度计算,从而快速准确地计算神经PDE求解器的空间导数,并引入多分辨率课程训练策略,在2D和3D PDE基准上实现高达约20倍误差降低和2-10倍收敛时间减少。

Comments Accepted by ICML 2026.Project page: https://jinjinhe2001.github.io/hermite-ngp/

详情
AI中文摘要

我们提出Hermite-NGP,一种梯度增强的多分辨率哈希编码,旨在实现神经PDE求解器空间导数的快速准确计算。与现有依赖自动微分或有限差分且存在不稳定或高成本的NGP方法不同,Hermite-NGP在哈希网格顶点处显式存储函数值和混合偏导数,从而通过Hermite插值实现梯度、雅可比矩阵和海森矩阵的完全解析计算。该设计在保持NGP的效率和空间自适应性的同时,支持高达二阶的解析微分算子。我们进一步引入一种类似于多重网格V-cycle的多分辨率课程训练策略,以实现从粗到细的优化。在一系列2D和3D PDE基准测试中,Hermite-NGP相比先前的神经PDE方法实现了高达约20倍的误差降低,并将与其他求解器相比的收敛时间缩短了2到10倍,对于多达1700万参数的模型,每个epoch的训练时间低至3.5毫秒。

英文摘要

We propose Hermite-NGP, a gradient-augmented multi-resolution hash encoding designed to enable fast and accurate computation of spatial derivatives for neural PDE solvers. Unlike existing NGP-based approaches that rely on automatic differentiation or finite differences and suffer from instability or high cost, Hermite-NGP explicitly stores function values and mixed partial derivatives at hash grid vertices, allowing fully analytic evaluation of gradients, Jacobians, and Hessians via Hermite interpolation. This design preserves the efficiency and spatial adaptivity of NGP while supporting analytic differential operators up to second order. We further introduce a multi-resolution curriculum training strategy analogous to multigrid V-cycles to enable coarse-to-fine optimization. Across a range of 2D and 3D PDE benchmarks, Hermite-NGP achieves up to approximately 20 times lower error than prior neural PDE methods, and reduces wall-clock convergence time by 2 to 10 times compared to other solvers, with per-epoch training times as low as 3.5 ms for models with up to 17M parameters.

2605.24773 2026-05-26 cs.AI

Uncertainty Decomposition via Cyclical SG-MCMC and Soft-label Learning for Subjective NLP

通过循环SG-MCMC和软标签学习进行主观NLP中的不确定性分解

Keito Inoshita, Takato Ueno

AI总结 提出结合循环随机梯度马尔可夫链蒙特卡洛(cSG-MCMC)与软标签学习的方法,在情感分类中沿多个轴评估不确定性,并在GoEmotions基准上优于现有方法。

详情
AI中文摘要

情感分类中标注者的分歧反映了情感概念固有的模糊性,对于主观NLP中的预测质量评估至关重要。然而,先前没有工作将软标签学习与贝叶斯深度学习相结合,以评估包括标注者分布保真度在内的多个轴上的不确定性。我们在冻结的RoBERTa上通过循环随机梯度马尔可夫链蒙特卡洛(cSG-MCMC)训练一个线性头,在五轴评估下以软标签目标针对经验标注者分布。在28情感的GoEmotions基准上,所提出的方法在三个轴上同时优于蒙特卡洛Dropout和深度集成——标注者分布的Jensen-Shannon散度(JSD)、每个情感偶然不确定性与分歧之间的Spearman相关性,以及选择性预测的风险-覆盖率曲线下面积(AURC)和ROC曲线下面积(AUROC)——表明独立的轴可以从一个后验中联合获得。事后温度缩放表现出双向效应,建立了硬标签校准和标注者JSD作为独立维度,并激励联合报告作为诚实协议。

英文摘要

Annotator disagreement in emotion classification reflects ambiguity intrinsic to emotion concepts and is essential for predictor-quality assessment in subjective NLP. Yet no prior work integrates soft-label learning with Bayesian deep learning to evaluate uncertainty along axes including annotator-distribution fidelity. We train a linear head on a frozen RoBERTa via cyclical stochastic gradient Markov chain Monte Carlo (cSG-MCMC), targeting the empirical annotator distribution with a soft-label objective under a five-axis evaluation. On the 28-emotion GoEmotions benchmark, the proposed method outperforms Monte Carlo Dropout and Deep Ensemble simultaneously on three axes -- Jensen-Shannon divergence (JSD) to the annotator distribution, Spearman correlation between per-emotion aleatoric uncertainty and disagreement, and selective-prediction Area Under the Risk-Coverage Curve (AURC) and Area Under the ROC Curve (AUROC) -- showing independent axes are jointly attainable from one posterior. Post-hoc temperature scaling exhibits a bidirectional effect, establishing hard-label calibration and annotator-JSD as independent dimensions and motivating joint reporting as an honest protocol.

2605.24771 2026-05-26 cs.CV cs.AI cs.LG

From Theory to Decision Rule: Calibrating the Noisy-Label Crossover for Vision-Language Model Weak Supervision Across Three Medical-Imaging Benchmarks

从理论到决策规则:校准视觉-语言模型弱监督的噪声标签交叉点——基于三个医学影像基准

Bruce Changlong Xu, Jose James, Alexander Ryu

AI总结 通过三个医学影像基准校准理论预测的噪声标签交叉点,提出基于少量金标标签的决策规则。

Comments 5 pages, 2 figures, 4 tables

详情
AI中文摘要

经典的噪声标签理论预测,弱监督下的下游性能上限是标注者的准确率,这意味着一个尖锐的交叉点:一旦金标训练的分类器达到标注者的水平,弱标签就会从帮助变为伤害。该预测是理论性的;缺少的是将其转化为现代基础模型标注者的实例级陈述的基准校准。我们针对BiomedCLIP生成的弱标签,在三个医学影像基准(PCAM、ISIC、NIH-CXR)和六个跨越11倍参数范围的下游架构上提供了这样的校准。理论预测的交叉点出现在PCAM上约100个样本,ISIC上20-50个,NIH-CXR上250-500个;交叉点以上的弱标签使AUC降低高达-0.10。对于五个预训练架构中的四个,交叉点位置与架构无关,而一个家族内的DenseNet扫描(2.5倍参数,相同预训练)支持了标注者(而非学生)是主要约束的观点。该校准进而产生一个可在10-20个金标标签下操作的决策规则:比较仅金标AUC与用户金标集上的VLM准确率。NIH-CXR上的结构化与随机噪声符号翻转表明,该界限的仅速率形式是不完整的,并确定了一个具体的改进(标签空间投影),未来的基准可以设计来测试它。

英文摘要

Classical noisy-label theory predicts that downstream performance under weak supervision is bounded above by the labeler's accuracy, implying a sharp crossover: once a gold-trained classifier matches the labeler, weak labels stop helping and start hurting. The prediction is theoretical; what is missing is a benchmark calibration that turns it into an instance-level statement for modern foundation-model labelers. We provide such a calibration for BiomedCLIP-generated weak labels on three medical-imaging benchmarks (PCAM, ISIC, NIH-CXR) and six downstream architectures spanning an 11x parameter range. The crossover predicted by theory appears at ng~100 on PCAM, 20-50 on ISIC, and 250-500 on NIH-CXR; weak labels above the crossover degrade AUC by up to -0.10. The location is architecture-invariant for four of five pretrained architectures, and a within-family DenseNet sweep (2.5x parameters, identical pretraining) supports the view that the labeler, not the student, is the dominant constraint. The calibration in turn produces a decision rule operable from 10-20 gold labels: compare gold-only AUC to VLM accuracy on the user's gold set. A structured-vs-random noise sign flip on NIH-CXR shows that the rate-only formulation of the bound is incomplete and identifies a concrete refinement (label-space projection) that future benchmarks can be designed to test.

2605.24770 2026-05-26 cs.LG cs.CV

Muon in Vision Transformers: Optimizer-Recipe Interactions and Gradient Spectra

Muon在视觉Transformer中的应用:优化器-数据增强交互与梯度谱

Ben S. Southworth, Shuai Jiang, Daniel McBride, Eric C. Cyr, Stephen Thomas

AI总结 研究Muon优化器在视觉Transformer训练中的表现,发现其优于AdamW,且增益依赖于数据增强,通过梯度奇异值分析揭示Muon与AdamW在注意力投影和深层前馈块中的谱差异。

Comments 25 pages, 15 figures

详情
AI中文摘要

Muon是一种最近开发的矩阵感知优化器,在Transformer训练中表现出色,但其在视觉Transformer(ViT)中的行为尚不明确。我们研究Muon在ViT训练中的应用,主要在ImageNet-100和Pl@ntNet-300K上,与AdamW在涉及mixup、cutmix、平滑以及随机增强和擦除的标准视觉方案下进行比较。Muon始终优于AdamW,在长尾Pl@ntNet宏观top-1上尤其显著。这些增益也依赖于数据增强方案,Muon从高级和显著的数据增强技术中获益远大于AdamW。为了理解这种交互,我们分析了整个ViT中矩阵梯度的奇异值结构。在Muon训练中,去除重度数据增强会导致训练后期梯度矩阵的谱集中和模式坍塌,主要发生在深层MLP-down块中。在固定的“完整”增强方案下,Muon与AdamW最明显的对比出现在QKV梯度中,其中AdamW梯度能量集中在更窄的基上,而Muon将能量分散到更多的奇异模式上。因此,ViT中的Muon最好理解为一种优化器-数据增强交互。在固定方案下,Muon与AdamW最明显的区别在于注意力投影,其梯度由更宽的谱基组成。在Muon内部,完整的训练方案对于防止深层前馈块中的后期谱集中和模式坍塌很重要。我们进一步展示了在图像分割和掩码自编码器模型上训练ViT的效果,Muon在所有考虑的设置中均优于AdamW。

英文摘要

Muon is a recently developed matrix-aware optimizer that has shown strong results in transformer training, but its behavior in vision transformers (ViTs) is not yet well understood. We study Muon for ViT training, largely on ImageNet-100 and Pl@ntNet-300K, comparing against AdamW under standard vision recipes involving mixup, cutmix, smoothing, and random augmentation and erasing. Muon consistently outperforms AdamW, with especially large gains on long-tailed Pl@ntNet macro top-1. These gains are also recipe-dependent, where Muon benefits much more than AdamW from advanced and significant data augmentation techniques. To understand this interaction, we analyze the singular-value structure of matrix gradients throughout the ViT. Within Muon training runs, removing heavy data augmentation induces a late-training spectral concentration and mode collapse in gradient matrices, primarily in deep MLP-down blocks. Under a fixed "full" augmentation recipe, the clearest Muon-AdamW contrast appears instead in QKV gradients, where AdamW gradient energy remains concentrated in a much narrower basis while Muon spreads energy across substantially more singular modes. Muon in ViTs is therefore best understood as an optimizer-recipe interaction. Under a fixed recipe, Muon differs from AdamW most clearly in attention projections, where its gradients consist of a broader spectral basis. Within Muon, a full training recipe is important for preventing late spectral concentration and mode collapse in deep feedforward blocks. We further demonstrate efficacy in training ViTs on image segmentation and masked autoencoder models, where Muon outperforms AdamW in all settings considered.

2605.24769 2026-05-26 cs.CV cs.AI eess.IV

Leveraging pretrained RGB denoisers for hyperspectral image restoration

利用预训练RGB去噪器进行高光谱图像恢复

Daniele Picone, Mohamad Jouni, Mauro Dalla-Mura

AI总结 提出一种轻量级适配器,通过投影映射重用冻结的预训练RGB去噪器,实现高光谱图像的去噪、去模糊和超分辨率恢复,实验表明RGB先验具有良好的迁移性。

详情
AI中文摘要

高光谱图像恢复面临若干挑战,包括训练数据有限、传感器特异性强以及光谱维度高。这些限制阻碍了鲁棒高光谱先验的学习,促使我们重用从大规模RGB数据中学到的先验。在这项工作中,我们提出了一种最小训练的轻量级适配器,通过投影映射将冻结的预训练RGB去噪器重新用于高光谱恢复。该方法对低维光谱投影进行去噪,并通过约束线性聚合重建高光谱立方体,同时保持即插即用的兼容性和底层RGB去噪器的稳定性。在多个数据集上的去噪、去模糊和超分辨率实验表明,该方法持续优于高光谱专用基线,显示了大规模RGB先验的强迁移性。

英文摘要

Hyperspectral image restoration faces several challenges, including limited training data, strong sensor specificity, and high spectral dimensionality. These limitations hinder the learning of robust hyperspectral priors, motivating the reuse of priors learned from large-scale RGB data. In this work, we propose a minimally trained, lightweight adapter that repurposes frozen pretrained RGB denoisers for hyperspectral restoration through a projection mapping. The method denoises low-dimensional spectral projections and reconstructs the hyperspectral cube through constrained linear aggregation, while preserving plug-and-play compatibility and the stability properties of the underlying RGB denoiser. Experiments on denoising, deblurring, and super-resolution across multiple datasets demonstrate consistent improvements over hyperspectral-specific baselines, showing the strong transferability of large-scale RGB priors.

2605.24767 2026-05-26 cs.RO

Enhanced INS/GNSS State Estimation using GNSS-Based Acceleration Measurements

增强的INS/GNSS状态估计:利用基于GNSS的加速度测量

Gal Versano, Itzik Klein

AI总结 提出利用历史GNSS测量和运动模型提取车辆加速度信息,并集成到INS/GNSS滤波器中以提高定位鲁棒性和精度,在两组真实无人地面车辆数据集上分别实现11.40%和20.74%的平均位置均方根误差改进。

详情
AI中文摘要

精确可靠的导航对于自主地面车辆运行至关重要。标准的INS/GNSS融合依赖于GNSS位置更新,这提供了有限的方位和惯性传感器误差状态的可观测性,特别是在低动态运动期间。在这项工作中,我们提出利用过去的GNSS测量以及运动模型来提取有意义的车辆加速度信息。然后将该加速度测量集成到INS/GNSS滤波器中,以提高其鲁棒性和准确性。所提出的方法在两个来自不同移动平台和惯性传感器等级的真实无人地面车辆数据集上进行了评估。结果表明,相对于标准位置辅助滤波器,定位精度一致提高,在两个数据集上平均位置均方根误差分别提高了11.40%和20.74%。

英文摘要

Accurate and reliable navigation is essential for autonomous ground vehicle operations. Standard INS/GNSS fusion relies on GNSS position updates, which provide limited observability of orientation and inertial sensor error states, particularly during low-dynamic motion. In this work, we propose utilizing past GNSS measurements alongside a motion model to extract meaningful vehicle acceleration information. This acceleration measurement is then integrated into the INS/GNSS filter to improve its robustness and accuracy. The proposed approach is evaluated on two real-world unmanned ground vehicle datasets collected from different mobile platforms and inertial sensor grades. Results demonstrate consistent positioning accuracy improvements relative to the standard position-aided filter, with mean position root mean square error improvements of 11.40 % and 20.74 % on the two datasets, respectively.

2605.24763 2026-05-26 cs.LG physics.flu-dyn

High-fidelity Modeling of Full-scale Pressurized Water Reactor Flow Fields for Machine Learning Applications

面向机器学习应用的全尺寸压水堆流场高保真建模

Logan A. Burnett, Hyungjun Kim, Hsien-Cheng Chou, Arsha Witoelar, Robert A. Brewster, Benoit Forget, Emilio Baglietto, Majdi I. Radaideh

AI总结 本研究利用高保真CFD模拟和机器学习模型,对四环路压水堆组件级流场进行表征,揭示了冷腿旋流和下腔室输运导致的入口流量分布不均匀性,并验证了ConvLSTM等空间感知架构在流场重建与预测中的优越性。

Comments 30 pages, 10 figures, and 6 Tables

详情
AI中文摘要

本工作提出了一个用于四环路压水堆组件级流动表征的高保真计算流体动力学和数据驱动建模框架。利用公开可用的几何和运行条件构建了完整的下腔室和堆芯入口域,实现了带有泵诱导旋流边界条件的瞬态模拟。结果表明,冷腿旋流和下腔室输运在堆芯下部区域产生强烈的非均匀组件级入口流量分布,而轴向阻力和混合作用逐渐使更高位置的流动均匀化。这些基于物理的数据集随后被用于评估机器学习在部分场重建和短期自回归预测中的应用。一个基于3D卷积的修复模型成功地从部分观测中重建了缺失的组件级质量流量,误差集中在高湍流底部层,并在上层显著减小。跨多个ML模型的比较分析表明,空间感知架构,特别是ConvLSTM,通过有效捕捉耦合的时空动态,显著优于基于序列的LSTM和算子学习DeepONet方法。研究还强调了关键挑战,包括入口流预测对湍流和网格分辨率的敏感性,以及缺乏全尺寸实验验证数据。尽管存在这些限制,结果仍与预期的物理行为一致。总体而言,本工作将高保真CFD确立为开发数据驱动代理模型、稀疏传感策略和未来多物理场耦合框架的关键基础。

英文摘要

This work presents a high-fidelity computational fluid dynamics (CFD) and data-driven modeling framework for assembly-level flow characterization in a four-loop pressurized water reactor (PWR). A full lower-plenum and core-inlet domain was constructed using publicly available geometry and operating conditions, enabling transient simulations with pump-induced swirl boundary conditions. The results show that cold-leg swirl and lower-plenum transport generate strongly heterogeneous assembly-wise inlet flow distributions, particularly near the lower core region, while axial resistance and mixing progressively homogenize the flow at higher elevations. These physics-informed datasets were subsequently used to evaluate machine learning (ML) applications for partial field reconstruction and short-term autoregressive prediction. A 3D convolutional-based inpainting model successfully recon-structed missing assembly-level mass flow rates from partial observations, with errors concentrated in the highly turbulent base (bottom) layer and diminishing significantly in upper layers. Comparative analysis across multiple ML models demon-strates that spatially aware architectures, particularly ConvLSTM, significantly outperform sequence-based (LSTM) and operator-learning (DeepONet) approaches by effectively capturing coupled spatio-temporal dynamics. The study also high-lights key challenges, including the sensitivity of inlet flow predictions to turbulence and mesh resolution, as well as the absence of full-scale experimental validation data. Despite these limitations, the results remain consistent with expected physical behavior. Overall, this work establishes high-fidelity CFD as a critical foundation for developing data-driven surrogates, sparse sensing strategies, and future multiphysics coupling frameworks.

2605.24762 2026-05-26 cs.CV

4KLSDB: A Large-Scale Dataset for 4K Image Restoration and Generation

4KLSDB:用于4K图像恢复与生成的大规模数据集

Zihao Zhu, Kuan-Ru Huang, Zhaoming Xu, Renjie Li, Bo Wu, Ruizheng Bai, Mingyang Wu, Sayak Paul, Zhengzhong Tu

AI总结 为解决现有数据集缺乏原生4K分辨率和规模的问题,提出包含129,484张4K图像的大规模数据集4KLSDB,并通过多阶段自动过滤和标注确保质量,实验证明其在超分辨率和扩散模型训练中能显著提升4K基准性能。

Comments Accepted to the DataCV Workshop at CVPR 2026; 10 pages, 4 figures, 7 tables; Our project page is available at: https://4klsdb.github.io/

详情
AI中文摘要

高分辨率数据集对于推进超分辨率(SR)和文本到图像(T2I)扩散研究至关重要。然而,当前公开可用的数据集既缺乏原生4K分辨率,也缺乏训练最先进模型所需的大规模。为解决这一差距,我们引入了一个4K大规模数据集与基准(4KLSDB),这是一个大规模、多样化的数据集,包含129,484张精心策划的4K分辨率图像,涵盖自然、城市景观、人物、食物、艺术品和CGI等多个类别,以及分别包含2,000和1,984张图像的独立验证集和测试集。图像来源于已建立的开放数据集,包括Photo Concept Bucket、Laion2B和PD12M。4KLSDB经历了严格的多阶段自动过滤和标注流程,涉及人工标注员和大规模多模态模型(LMMs),以确保高美学质量和数据集一致性。我们通过训练代表性的超分辨率和扩散模型来证明4KLSDB的有效性,观察到在原生4K基准上性能的显著提升。综合实验表明,在真实4K分辨率数据上训练与图像恢复任务中保真度的提高之间存在正相关,尤其是在4K分辨率下。我们通过提供4KLSDB,为研究社区提供宝贵资源,以推动真正高保真图像合成与恢复的进展。我们的项目页面位于:https://4klsdb.github.io/。

英文摘要

High-resolution datasets are essential for advancing super-resolution (SR) and text-to-image (T2I) diffusion research. However, current publicly available datasets lack both the native 4K resolution and the extensive scale necessary for training state-of-the-art models. To address this gap, we introduce a 4K Large Scale Dataset and Benchmark (4KLSDB), a large-scale, diverse dataset consisting of 129,484 carefully curated 4K resolution images spanning multiple categories such as nature, urban scenes, people, food, artwork, and CGI, alongside distinct validation and test sets containing 2,000 and 1,984 images respectively. Images were sourced from established open datasets including Photo Concept Bucket, Laion2B, and PD12M. 4KLSDB underwent rigorous multi-stage automated filtering and annotation pipelines involving both human annotators and Large Multimodal Models (LMMs) to ensure high aesthetic quality and dataset consistency. We demonstrate 4KLSDB's effectiveness by training representative super-resolution and diffusion models, observing significant improvements in performance on native 4K benchmarks. Comprehensive experiments illustrate a positive correlation between training on true 4K resolution data and improved fidelity in image restoration task, especially on 4K resolution. We provide the research community a valuable resource to drive progress toward genuinely high-fidelity image synthesis and restoration by providing 4KLSDB. Our project page is available at: https://4klsdb.github.io/.