arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 1696
2605.15585 2026-05-18 cs.AI cs.CV

See Before You Code: Learning Visual Priors for Spatially Aware Educational Animation Generation

在编码前看到:学习视觉先验以生成空间感知的教育动画

Yuejia Li, Ke He, Junheng Li, Shutong Chen, Jingkang Xia, Zhiyue Su, Junchi Zhang, Mang Ye

AI总结 本文提出OmniManim框架,通过视觉规划和反馈机制提升教育动画生成质量,改进渲染效果和教学效果。

Comments 21 pages, 4 figures

详情
AI中文摘要

大型语言模型可以为教育动画生成可执行代码,但生成的渲染结果常出现元素重叠、对齐错误和动画连续性断裂等问题。这些缺陷无法仅从代码中可靠检测,需在执行后才能显现。本文将该问题形式化为渲染反馈感知的约束代码生成:给定自然语言规范,模型必须生成可执行代码,其渲染输出需满足可在渲染后评估的结构化质量标准。为解决此问题,我们引入OmniManim框架,围绕共享场景状态、显式视觉规划、结构化后渲染诊断和局部修复构建。其中,Vision Agent是任务特定的视觉规划模块:它通过粗到细的边界框去噪预测稀疏关键帧布局,并优化插值感知的目标以减少下游动画插值引起的中间帧失败。我们进一步构建了ManimLayout-1K和EduRequire-500两个数据集,并提供可复现的评估协议,涵盖可执行性、教学质量、视觉质量和效率。在EduRequire-500上,OmniManim在单模型基线和现有多智能体框架上均提升了测量渲染质量。系统性消融研究进一步验证,显式视觉规划,特别是其粗略空间先验、边界框细化和插值感知优化是这些提升的关键。

英文摘要

Large language models can generate executable code for educational animations, but the resulting renders often exhibit visual defects, including element overlap, misalignment, and broken animation continuity. These defects cannot be reliably detected from the code alone and become apparent only after execution. We formalize this problem as render-feedback-aware constrained code generation: given a natural language specification, the model must generate executable code whose rendered output satisfies structured quality criteria that can be evaluated only after rendering. To address this problem, we introduce OmniManim, a render-feedback-aware educational animation generation framework built around a shared scene state, explicit visual planning, structured post-render diagnostics, and localized repair. Within OmniManim, the Vision Agent is a task-specific visual planning module: it predicts sparse keyframe layouts with coarse-to-fine bounding-box denoising and optimizes an interpolation-aware objective to reduce intermediate-frame failures induced by downstream animation interpolation. We further construct two datasets, ManimLayout-1K and EduRequire-500, and provide a reproducible evaluation protocol covering executability, instructional quality, visual quality, and efficiency. On EduRequire-500, OmniManim improves measured render quality over both single-model baselines and existing multi-agent frameworks. Systematic ablation studies further verify that explicit visual planning, especially its coarse spatial prior, bounding-box refinement, and interpolation-aware optimization, is central to these gains.

2605.15584 2026-05-18 cs.CV

AGC: Adaptive Geodesic Correction for Adversarial Robustness on Vision-Language Models

AGC:面向视觉-语言模型对抗鲁棒性的自适应测地修正

Zhiwei Li, Jiacheng Xue, Weining Wang, Ajian Liu, Xingyu Gao, Zhenan Sun, Qi Li

AI总结 本文提出AGC,一种无需训练的防御机制,通过自适应步长修正输入特征,提升视觉-语言模型的对抗鲁棒性,实测在八个细粒度数据集上提升44.4%的鲁棒准确率,同时降低10倍推理延迟。

详情
AI中文摘要

像CLIP这样的视觉-语言模型已展示了显著的零样本迁移能力。然而,其对不可察觉对抗扰动的易受攻击性仍是一个关键安全问题。虽然测试时间防御为部署模型提供了务实的解决方案,但现有方法通常在推理过程中依赖梯度优化,导致显著的计算开销。在本文中,我们重新审视了数据增强在CLIP鲁棒性中的作用,并观察到增强并非等效有效:特定增强提供稳定的几何线索,与正确类语义在超球面特征空间中对齐。基于此,我们提出自适应测地修正(AGC),一种无需训练的防御机制,无需参数更新。AGC将可靠的增强识别为几何锚点,并通过自适应步长将输入特征朝向锚点修正。AGC在八个细粒度数据集和三个CLIP后端上实现了优越性能,比最先进的基线提高了44.4%的平均鲁棒准确率,同时交付了10倍的推理延迟减少。我们的发现揭示了CLIP特征的基本几何属性,提供了一种高效且有效的多模态鲁棒部署范式。

英文摘要

Vision-language models like CLIP have demonstrated remarkable zero-shot transfer capabilities. However, their susceptibility to imperceptible adversarial perturbations remains a critical security concern. While test-time defenses offer a pragmatic solution for deployed models, existing approaches typically rely on gradient-based optimization during inference, incurring significant computational overhead. In this paper, we revisit the role of data augmentation in CLIP robustness and observe that augmentations are not equally effective: specific augmentations consistently provide robust geometric cues that align with correct class semantics in the hyperspherical feature space. Based on this, we propose Adaptive Geodesic Correction (AGC), a training-free defense mechanism that requires no parameter updates. AGC identifies a reliable augmentation as a geometric anchor and corrects the input feature towards it, utilizing an adaptive step size to balance robustness against clean accuracy preservation. AGC achieves superior performance across eight fine-grained datasets and three CLIP backbones, improving average robust accuracy by 44.4\% over state-of-the-art baseline while delivering a 10$\times$ reduction in inference latency. Our findings reveal a fundamental geometric property of CLIP features, offering a highly efficient and effective paradigm for robust multimodal deployment.

2605.15583 2026-05-18 cs.CV

Unsupervised 3D Human Pose Estimation via Conditional Multi-view Ancestral Sampling

通过条件多视图祖先采样进行无监督3D人体姿态估计

Ryohei Goto, Takuya Fujihashi, Shunsuke Saruwatari, Fumio Okura

AI总结 本文提出一种无需3D监督的单视角3D人体姿态估计方法,利用预训练的2D运动扩散模型的2D扩散先验,通过条件多视图祖先采样优化3D姿态,使其多视图投影符合2D MDM噪声空间的流形,同时匹配给定的2D姿态和人体解剖约束。

Comments International Conference on Automatic Face and Gesture Recognition (FG 2026), Oral

详情
AI中文摘要

我们提出了一种从单视角估计3D人体姿态的方法,无需3D监督。该方法的关键在于利用在大规模2D人体姿态数据集上预训练的运动扩散模型(MDMs)的2D扩散先验。具体来说,我们将扩散模型的多视图祖先采样扩展到人体姿态的2D-3D提升任务。为此,我们提出了一种条件多视图祖先采样(cMAS),以优化3D姿态,使其多视图投影遵循2D MDM噪声空间中的流形,同时将3D姿态条件化以匹配给定的2D姿态和人体解剖约束。在Yoga数据集上的实验表明,我们的方法在跨域性能上优于最先进的监督和无监督3D姿态估计方法,包括在3D监督不可用的极端人体姿态情况下。代码可在:https://github.com/asaa0001/c-MAS获取。

英文摘要

We propose a method of estimating a 3D human pose from a single view without 3D supervision. The key to our method is to leverage the 2D diffusion priors of motion diffusion models (MDMs) pre-trained on large 2D human pose datasets. Specifically, we extend multi-view ancestral sampling of diffusion models to the task of 2D-3D lifting of human pose. To this end, we newly propose a conditional multi-view ancestral sampling (cMAS) that optimizes the 3D pose such that its multi-view projections follow the manifold in 2D MDM noise space, while conditioning the 3D pose to match the given 2D poses and anatomical constraints of humans. Experiments on the Yoga dataset demonstrate that our method achieves better cross-domain performance compared to state-of-the-art supervised and unsupervised 3D pose estimation methods, including extreme human poses where 3D supervision is unavailable. Code is available at: https://github.com/asaa0001/c-MAS.

2605.15582 2026-05-18 cs.CV

LDGuid: A Framework for Robust Change Detection via Latent Difference Guidance

LDGuid: 一种通过潜在差异引导实现鲁棒变化检测的框架

Jiaxuan Zhao, Ali Bereyhi

AI总结 本文提出LDGuid框架,通过学习并注入语义差异提升变化检测性能,实验显示其在多个数据集上显著提升分割效果,尤其在受光谱噪声影响的挑战性场景中表现突出。

Comments Accepted to IGARSS 2026. Code is available at: https://github.com/zjxyoyo/LDGuid

详情
AI中文摘要

现代深度学习模型在变化检测(CD)中常难以显式表示任务相关的语义差异。本文提出Latent Difference Guidance(LDGuid)框架,通过对抗自编码实现差异嵌入(DE)模块。DE模块通过信息瓶颈方法预训练,限制其仅学习前后事件样本间的任务相关差异。学习到的潜在差异随后作为CD模型的显式引导信号。通过将LDGuid整合到U-Net、BIT和AERNet基线模型中,并在LEVIR-CD、WHU-CD、SVCD和CaBuAr数据集上评估,实验结果表明LDGuid在所有基准上均提升了分割性能,特别是在受光谱噪声影响的挑战性场景中表现显著。结果进一步突显了LDGuid在整合领域知识(如任务特定的光谱指数)方面的能力。我们的发现表明,语义差异学习可以显著增强遥感中变化检测的鲁棒性。

英文摘要

Modern deep learning models for change detection (CD) often struggle to explicitly represent task-relevant semantic differences. This paper proposes the Latent Difference Guidance (LDGuid) framework that explicitly learns and injects semantic differences into CD models. LDGuid deploys adversarial autoencoding to implement a difference embedding (DE) module. The DE module is pretrained via the information bottleneck method, restricting it to learn only task-relevant differences between pre- and post-event samples. The learned latent difference is then used as an explicit guidance signal in the CD model. We validate LDGuid by integrating it into U-Net, BIT, and AERNet baselines for CD and evaluating it on LEVIR-CD, WHU-CD, SVCD, and CaBuAr datasets. Experimental results show that LDGuid enhances segmentation performance across all benchmarks, with particularly remarkable gains in challenging settings affected by spectral noise. The results further highlight the ability of LDGuid in incorporating domain knowledge, such as task-specific spectral indices. Our findings suggest that semantic difference learning can drastically enhance the robustness of CD in remote sensing.

2605.15581 2026-05-18 cs.AI

STAR: A Stage-attributed Triage and Repair framework for RCA Agents in Microservices

STAR: 一种针对微服务中RCA代理的阶段属性分诊与修复框架

Junle Wang, Xingchuang Liao, Wenjun Wu

AI总结 本文提出STAR框架,通过将RCA流程分解为四个阶段,提升微服务中RCA代理的可靠性与自修复能力。

Comments 11 pages

详情
AI中文摘要

基于大语言模型的根因分析(RCA)代理近年来在微服务AIOps中崭露头角,但其可靠性仍脆弱:早期证据收集、假设构建或因果分析中的错误会通过推理轨迹传播,最终破坏最终诊断。本文提出STAR,一种针对RCA代理的阶段属性分诊与修复框架,将RCA工作流程分解为四个结构化阶段:证据包(EP)、假设集(HS)、分析结构(AS)和决策报告(DR),并将代理故障视为可定位的阶段性推理错误,而非整体端到端错误。基于LangGraph,STAR执行阶段审计,实施预算感知的快速/慢速路由,通过反事实候选评估进行决断阶段定位,并进行阶段特定的修补与重放修复。

英文摘要

LLM-based root cause analysis (RCA) agents have recently emerged as a promising paradigm for incident diagnosis in microservice AIOps. However, their reliability remains fragile: an error in early evidence collection, hypothesis formulation, or causal analysis can propagate through the reasoning trace and eventually corrupt the final diagnosis. In this paper, we present \textbf{STAR}, a \emph{Stage-attributed Triage and Repair} framework for repairing erroneous RCA traces. STAR explicitly decomposes an RCA workflow into four structured stages, namely \emph{Evidence Package} (EP), \emph{Hypothesis Set} (HS), \emph{Analysis Structure} (AS), and \emph{Decision Report} (DR), and treats agent failure as a stage-localizable reasoning bug rather than a monolithic end-to-end error. Built on top of LangGraph, STAR performs stage-wise auditing, budget-aware \emph{Fast/Slow Routing}, \emph{decisive stage localization via counterfactual candidate evaluation}, and stage-specific patch-and-replay repair. We evaluate STAR on a public large-scale benchmark and a real-world production dataset, using two RCA agent workflows and three foundation models. Experimental results show that STAR consistently improves both root cause localization and fault type classification over strong baselines. Moreover, STAR identifies the decisive faulty stage with high accuracy, repairs most initially incorrect traces within one or two replay rounds, and benefits substantially from both Fast/Slow Routing and counterfactual stage evaluation. These results suggest that explicitly modeling \emph{where} an RCA agent fails is an effective path toward reliable, debuggable, and self-repairing agentic RCA systems.

2605.15575 2026-05-18 cs.LG cs.DB

Gaussian Relational Graph Transformer

高斯关系图变换器

Zezhong Ding, Jin Li, Xugang Wang, Xike Xie

AI总结 本文提出GelGT,通过结构-语义协作采样和高斯图注意力机制,解决关系图模型中长距离依赖和多信息联合建模问题,实验显示在多个真实数据集上达到最先进的预测性能。

详情
AI中文摘要

关系图学习模型将关系数据库视为图,并在多种关系预测任务中表现出色。然而,现有方法由于信息衰减在消息传递机制中难以捕捉长距离依赖,而近期的关系图变换器在联合建模结构、语义和时间信息方面仍有限。本文提出GelGT,一种高斯关系图变换器,明确解决这些挑战。GelGT引入结构-语义协作采样策略以保持结构连接并过滤无关语义信息,并结合带有可学习高斯偏置的高斯图注意力机制,在采样的子图上动态编码时间依赖性。在各种真实世界数据集上的广泛实验表明,GelGT在下游任务性能上达到最先进水平,预测性能提升高达13.8%。

英文摘要

Relational graph learning models relational databases as graphs and has demonstrated superior performance on a wide range of relational predictive tasks. However, existing methods struggle to capture long-range dependencies due to information decay in their message-passing mechanisms, and recent relational graph transformers remain limited in jointly modeling structural, semantic, and temporal information. In this paper, we propose GelGT, a Gaussian relational graph transformer that explicitly addresses these challenges. GelGT introduces a structure-semantic collaborative sampling strategy to preserve structural connectivity while filtering irrelevant semantic information, and incorporates a Gaussian graph attention mechanism with a learnable Gaussian bias on the sampled subgraphs to dynamically encode temporal dependencies. Extensive experiments on various real-world datasets demonstrate that GelGT achieves state-of-the-art downstream task performance, with up to a 13.8% improvement in predictive performance.

2605.15574 2026-05-18 cs.CV

MI-CXR: A Benchmark for Longitudinal Reasoning over Multi-Interval Chest X-rays

MI-CXR:多区间胸部X光片纵向推理基准

Sunghwan Steve Cho, Yunseok Han, Jaeyoung Do

AI总结 MI-CXR基准旨在评估多visit胸部X光片的纵向推理能力,通过五选一问题和三个互补任务家族,揭示现有视觉语言模型在时间维度上的局限性。

Comments 33 pages

详情
AI中文摘要

纵向胸部X光片解读需在多个患者访问中推理疾病演变,但现有医疗VQA基准多关注单张图像或短时间图像对。我们引入MI-CXR,一个用于标准化评估多访问胸部X光片序列多区间纵向推理的基准,无需自由形式报告生成或额外临床上下文。MI-CXR包含五个访问患者时间线的五选一问题,并实例化三个互补任务家族:时间事件定位、区间级变化推理和全局轨迹总结,评估基于临床的视觉推理。评估14种最先进的视觉语言模型(VLMs)显示整体表现较低,平均准确率为29.3%,仅略高于随机猜测。通过阶段式诊断探测,发现模型常产生局部合理的区间描述,但未能强制时间约束或将证据组合成全局一致的决策。这些发现揭示了当前VLMs的关键限制,并确立MI-CXR作为纵向医疗推理的原理性基准。该基准可在https://github.com/AIDASLab/MI-CXR获取。

英文摘要

Longitudinal chest X-ray (CXR) interpretation requires reasoning over disease evolution across multiple patient visits, yet most existing medical VQA benchmarks focus on single images or short-horizon image pairs. We introduce MI-CXR, a benchmark for standardized evaluation of Multi-Interval longitudinal reasoning over multi-visit CXR sequences, without requiring free-form report generation or additional clinical context. MI-CXR comprises five-way multiple-choice questions over five-visit patient timelines and instantiates three complementary task families: Temporal Event Localization, Interval-wise Change Reasoning, and Global Trajectory Summarization, which assess clinically grounded visual reasoning over time. Evaluating 14 state-of-the-art vision-language models (VLMs) shows low overall performance, with an average accuracy of 29.3%, only modestly above random guessing. Using stage-wise diagnostic probing, we find that models often produce locally plausible interval descriptions but fail to enforce temporal constraints or compose evidence into globally consistent decisions over the full timeline. These findings reveal key limitations of current VLMs and establish MI-CXR as a principled benchmark for longitudinal medical reasoning. The benchmark is available at https://github.com/AIDASLab/MI-CXR

2605.15573 2026-05-18 cs.CL cs.LG cs.MA

Response-Conditioned Parallel-to-Sequential Orchestration for Multi-Agent Systems

响应条件化的并行到顺序 orchestration 用于多智能体系统

Nurbek Tastan, Alex Iacob, Lorenzo Sani, Meghdad Kurmanji, Nicholas D. Lane, Samuel Horvath, Karthik Nandakumar

AI总结 本文提出Nexa框架,通过响应条件化的策略结合并行与顺序执行,减少通信和延迟同时提高最终响应准确性,展示了其通用性。

详情
AI中文摘要

多智能体系统可通过多个大语言模型智能体之间的协作解决复杂任务。现有协作框架通常采用并行或顺序模式。在并行模式中,智能体独立响应查询后进行响应聚合。相反,顺序系统允许智能体通过有向拓扑进行通信并逐步细化。然而,这两种模式都无法在最小化通信和延迟的同时最大化最终响应的准确性。本文引入了一种名为Nexa的混合范式,即可训练的响应条件化策略,以弥合两种模式之间的差距。Nexa首先进行并行执行阶段,将结果嵌入共享语义空间,然后预测稀疏有向无环通信图。如果图为空,则系统保持纯粹并行;如果非空,则进行一次顺序信息传播。该策略是轻量级的transformer模型,方法避免了外部LLM判断者或奖励模型以及手工设计的测试时间拓扑搜索。我们正式化了这种混合执行问题,证明所生成的图是无环的,并且该框架严格包含纯并行执行,且提出基于策略梯度优化的训练程序。结果表明,Nexa在一种设置下学习的响应条件化策略可以在智能体数量、任务或底层智能体变化时重用,从而强调所学通信策略的通用性。

英文摘要

Multi-agent systems can solve complex tasks through collaboration between multiple Large Language Model agents. Existing collaboration frameworks typically operate in either a parallel or a sequential mode. In the parallel mode, agents respond independently to queries followed by aggregation of responses. In contrast, sequential systems allow agents to communicate via a directed topology and refine one another step by step. However, both modes are inadequate for achieving the desired objectives of minimizing communication and latency while simultaneously maximizing the accuracy of the final response. In this work, we introduce a hybrid paradigm called Nexa, a trainable response-conditioned policy that bridges the gap between the two modes. Nexa begins with a parallel execution stage, embeds the resulting responses into a shared semantic space, and then predicts a sparse directed acyclic communication graph. If the graph is empty, the system remains purely parallel; if it is non-empty, the system performs one sequential message propagation. The policy is a lightweight transformer model, and the method avoids the need for external LLM judges or reward models, as well as hand-crafted test-time topology search. We formalize this hybrid execution problem, show that the resulting graph is acyclic by construction, and that the framework strictly subsumes pure parallel execution, and present a training procedure based on policy-gradient optimization. Results demonstrate that the response-conditioned policy learned by Nexa under one setting can be reused when the number of agents, the task, or the underlying agent changes, thus emphasizing the generalizability of the learned communication policy.

2605.15567 2026-05-18 cs.AI

Position: Artificial Intelligence Needs Meta Intelligence -- the Case for Metacognitive AI

位置:人工智能需要元智能——元认知AI的案例

Sergei Chuprov, Richard D. Lange, Leon Reznik, Paulo Shakarian, Raman Zatsarenko, Dmitrii Korobeinikov

AI总结 本文主张将元认知作为设计更准确、安全和高效AI的通用原则,通过联邦学习案例展示元认知提升学习效率和安全性的方法,提出新的软件框架用于实现元认知AI。

Comments This is a preliminary version accepted for presentation and publication at the 43rd International Conference on Machine Learning (ICML26). The modified final version will be available in the conference proceedings

详情
AI中文摘要

本文主张将元认知作为设计更准确、安全和高效AI的通用原则。元认知解决方案涉及系统监控自身状态并根据每个问题实例的难度或错误成本合理分配资源。受资源理性AI和心理学、认知科学中已记录的元认知策略的启发,我们识别了将这些策略嵌入AI设计中的具体挑战,并突出了开放的理论和实现问题。我们通过联邦学习(FL)案例研究展示这些原则,并展示如何通过新开发的软件框架将这些原则转化为实践,使社区能够设计、部署和实验元认知增强的AI应用。

英文摘要

This position paper argues for metacognition as a general design principle for creating more accurate, secure, and efficient AI. The metacognitive solution involves systems monitoring their own states and judiciously allocating resources depending on each problem instance's difficulty or cost of mistakes. Drawing inspiration both from past work on resource-rational AI and from well-documented metacognitive strategies in psychology and cognitive science, we identify specific challenges in embedding these strategies into AI design and highlight open theoretical and implementation problems. We showcase these principles through a tangible example of improved learning efficiency, effectiveness, and security in a Federated Learning (FL) case study. We show how these principles can be translated into practice with a novel software framework developed specifically to allow the community to design, deploy, and experiment with metacognition-enabled AI applications.

2605.15565 2026-05-18 cs.LG cs.AI

AstraFlow: Dataflow-Oriented Reinforcement Learning for Agentic LLMs

AstraFlow:面向代理大语言模型的数据流强化学习

Haizhong Zheng, Yizhuo Di, Jiahui Wang, Shuowei Jin, Xueshen Liu, Yongji Wu, Z. Morley Mao, Ion Stoica, Jiawei Zhao, Beidi Chen

AI总结 AstraFlow通过数据流导向的强化学习系统,实现复杂多策略协作训练和高效利用异构计算资源,提升代理LLM的推理与工具使用能力。

详情
AI中文摘要

强化学习(RL)日益被用于提升大语言模型的推理、编码和工具使用能力,但代理RL仍面临高昂成本。为扩展RL到代理LLM,需支持复杂工作负载,包括多策略协作训练,同时高效利用弹性、异构和跨区域计算资源。现有LLM RL系统支持部分能力,但每次新扩展通常需专门系统工程。此问题源于训练器导向的控制架构和RL系统组件缺乏原理性抽象。为此,我们提出AstraFlow,一种数据流导向的RL系统,取代传统训练器导向控制,采用原理性组件抽象。在AstraFlow中,rollout服务、数据流管理和训练被解耦为自主组件,使系统能原生支持复杂多策略代理RL工作负载并高效利用多样化计算资源。我们评估了AstraFlow在数学、代码、搜索和AgentBench工作负载上的表现,显示同一系统支持多策略训练、弹性扩展、异构跨区域执行和可组合的数据算法,无需系统级代码更改。在多策略协作训练中,AstraFlow的准确度与现有RL系统相当或更优,同时训练时间加速2.7倍。

英文摘要

Reinforcement learning (RL) is increasingly used to improve the reasoning, coding, and tool-use capabilities of large language models, but agentic RL remains prohibitively expensive. Scaling RL to agentic LLMs requires supporting complex workloads, including multi-policy collaborative training, while efficiently using elastic, heterogeneous, and cross-region compute resources. Existing LLM RL systems support some of these capabilities, but each new extension often requires dedicated system engineering. This burden arises from trainer-centered control architectures and the lack of principled abstractions for RL system components. To address these limitations, we propose AstraFlow, a dataflow-oriented RL system that replaces conventional trainer-centered control with principled component abstractions. In AstraFlow, rollout services, dataflow management, and training are decoupled into autonomous components, enabling the system to natively support complex multi-policy agentic RL workloads and efficiently exploit diverse compute resources. We evaluate AstraFlow across math, code, search, and AgentBench workloads, showing that the same system supports multi-policy training, elastic scaling, heterogeneous cross-region execution, and composable data algorithms without system-level code changes. In multi-policy collaborative training, AstraFlow achieves comparable or better accuracy than existing RL systems while speeding up training time by 2.7x.

2605.15564 2026-05-18 cs.LG cs.CE eess.IV

CrystalBoltz: End-to-End Protein Structure Determination via Experiment-Guided Diffusion for X-Ray Crystallography

CrystalBoltz:通过实验引导扩散实现端到端蛋白质结构确定用于X射线晶体学

Minseo Kim, Huanghao Mai, Jay Shenoy, Alec Follmer, Gordon Wetzstein, Frederic Poitevin

AI总结 CrystalBoltz通过实验引导扩散模型实现端到端蛋白质结构确定,利用贝叶斯推断优化原子结构,降低坐标RMSD和R因子,提升X射线晶体学结构确定效率。

Comments Project page: https://soniaminseokim.github.io/crystalboltz-website/

详情
AI中文摘要

基于公共蛋白质结构数据库训练的生成模型,大部分由X射线晶体学确定,现在为结构预测提供了强大先验。然而,它们无法直接条件于新晶体学实验的测量,限制了X射线结构确定的应用。在晶体学中,测量的结构因子振幅本身不能确定电子密度图或原子结构,因为相关的相位未被观测且必须推断。因此,结构确定仍然是一个逆问题,候选模型必须在结构上合理且与测量的衍射数据一致,通常需要大量人工专家手动优化。新兴方法旨在更直接地将实验信息纳入预测和优化流程。我们提出了CrystalBoltz,一种生成框架,将晶体学优化视为原子结构上的贝叶斯推断,并直接在结构因子振幅上操作。CrystalBoltz从无指导生成(基于预训练的蛋白质结构先验)转向实验引导的后验采样,随后进行原子坐标和B因子优化。在多个蛋白质晶体学数据集上,CrystalBoltz在坐标RMSD和R因子方面优于现有最强基线,同时将运行时间减少了33倍。

英文摘要

Generative models trained on public databases of protein structures, most of which have been determined by X-ray crystallography, now provide powerful priors for structure prediction. However, they are not readily conditioned on the measurements from a new crystallographic experiment, limiting their use for X-ray structure determination. In crystallography, the measured structure-factor amplitudes do not by themselves determine an electron density map or atomic structure because the associated phases are unobserved and must be inferred. Structure determination therefore remains an inverse problem in which candidate models must be both structurally plausible and consistent with measured diffraction data, often requiring substantial manual refinement by human experts. Emerging methods aim to incorporate experimental information more directly into predictive and refinement workflows. We present CrystalBoltz, a generative framework that casts crystallographic refinement as Bayesian inference over atomic structures and operates directly on structure-factor amplitudes. CrystalBoltz moves from unguided generation with a pre-trained prior over protein structures to experiment-guided posterior sampling, followed by atomic coordinate and B-factor refinement. Across multiple protein crystallography datasets, CrystalBoltz attains lower coordinate RMSD and lower R-factors than the strongest baselines considered, while reducing runtime by a factor of 33 relative to existing experimentally guided refinement.

2605.15562 2026-05-18 cs.CL

GiLT: Augmenting Transformer Language Models with Dependency Graphs

GiLT:通过依赖图增强Transformer语言模型

Tianyu Huang, Yida Zhao, Chuyan Zhou, Kewei Tu

AI总结 GiLT通过依赖图增强Transformer语言模型,提升语法泛化能力,同时保持竞争力的困惑度,且能通过微调提升下游任务表现。

详情
AI中文摘要

通过语言结构增强Transformer能够有效提升语言模型的语法泛化性能。先前的工作主要关注语言的句法树结构,特别是短语结构树。我们提出图融合层Transformer语言模型(GiLT),利用依赖图来增强Transformer语言模型。与大多数先前工作不同,GiLT不在语言建模中插入额外的结构标记;相反,它通过在Transformer中调节注意力权重,将从逐步构建的依赖图中提取的特征注入到语言建模中。在我们的实验中,GiLT使用语义依赖图在保持与Transformer语言模型基线相当的困惑度的同时,实现了更好的语法泛化。此外,GiLT可以从预训练语言模型进行微调,以获得改进的下游任务性能。我们的代码已发布在https://github.com/cookie-pie-oops/GiLT-LM。

英文摘要

Augmenting Transformers with linguistic structures effectively enhances the syntactic generalization performance of language models. Previous work in this direction focuses on syntactic tree structures of languages, in particular constituency tree structures. We propose Graph-Infused Layers Transformer Language Model (GiLT) which leverages dependency graphs for augmenting Transformer language models. Unlike most previous work, GiLT does not insert extra structural tokens in language modeling; instead, it injects structural information into language modeling by modulating attention weights in the Transformer with features extracted from the dependency graph that is incrementally constructed along with token prediction. In our experiments, GiLT with semantic dependency graphs achieves better syntactic generalization while maintaining competitive perplexity in comparison with Transformer language model baselines. In addition, GiLT can be finetuned from a pretrained language model to achieve improved downstream task performance. Our code is released at https://github.com/cookie-pie-oops/GiLT-LM.

2605.15561 2026-05-18 cs.CV

RoiMAM: Region-of-Interest Medical Attention Model for Efficient Vision-Language Understanding

RoiMAM:面向高效视觉-语言理解的感兴趣区域医学注意模型

Jiayan Yang, Zhuoyu Wu, Wenqi Fang

AI总结 本文提出RoiMAM,通过整合无训练ROI生成模块和语义选择性抑制,专注于病变相关区域,提升医疗视觉问答的效率与准确性。

Comments under revision

详情
AI中文摘要

视觉-语言模型(VLMs)通过联合解释图像和文本来促进医疗视觉问答(MedVQA)。然而,现有模型通常依赖大型架构和封闭答案集,限制了其效率和临床应用潜力。为克服这些不足,我们引入RoiMAM,一种高效的VLM。它集成了无需训练的ROI生成模块与语义选择性抑制,以聚焦于病变相关区域,同时结合文本提示增强模块,提供模态特定的上下文而不引入训练参数。与广泛使用的MedVInT-TD模型相比,我们的设计在模型大小不到20%的情况下实现了高效且准确的诊断,在SLAKE上提高了约2%的准确性,在PMC-VQA上提高了约4.6%的准确性。

英文摘要

Vision-Language Models (VLMs) facilitate medical visual question answering (MedVQA) by jointly interpreting images and text. However, existing models typically depend on large architectures and closed-set answers, which limits their efficiency and potential clinical applicability. To overcome these shortcomings, we introduce RoiMAM, an efficient VLM. It integrates a training-free ROI Generation Module with Semantic Selective Suppression to focus on lesion-relevant regions, alongside a Text Prompt Enhancer module that provides modality-specific context without introducing training parameters. Compared to the widely used MedVInT-TD model, our design achieves efficient and accurate diagnosis at less than 20\% of the model size, while improving accuracy by approximately 2% on SLAKE and 4.6% on PMC-VQA.

2605.15559 2026-05-18 cs.RO

NavRL++: A System-Level Framework for Improving Sim-to-Real Transfer in Reinforcement Learning-Based Robot Navigation

NavRL++: 一种提升基于强化学习的机器人导航仿真到现实迁移的系统级框架

Zhefan Xu, Hanyu Jin, Kenji Shimada

AI总结 本文提出NavRL++框架,通过系统性研究仿真到现实迁移的关键因素,引入感知鲁棒性增强策略和基于Transformer的时序推理策略,提升机器人导航性能。

Comments 18 pages, 18 figures, 6 tables

详情
AI中文摘要

近年来,强化学习在自主导航中取得了显著进展。然而,现有方法主要关注强化学习框架设计,如输入表示、动作空间和奖励函数,而对仿真到现实迁移的分析有限。为弥合这一差距,本文不仅引入了有效的RL框架,还提出了完整的训练和部署流程,并进行系统性经验研究,以解耦影响强化学习导航仿真到现实迁移的关键因素,包括传感器噪声、感知失败、系统延迟和控制响应。基于此分析,我们引入了感知-aware微调策略,通过显式考虑经验识别的领域差异来提高迁移鲁棒性。为进一步缓解感知退化并提升现实部署中的控制平滑性,我们提出了一种基于Transformer的时序推理策略,利用短时间观测进行导航控制。我们定量评估了个体仿真到现实扰动和训练设计选择对导航性能的影响。实验结果表明,所提出的训练策略和策略架构在静态和动态环境中均优于基于学习的基线,在静态设置中性能可与基于优化的规划器相媲美。我们通过在多个机器人平台上进行现实部署验证了我们的方法,包括空中和腿部机器人,在探索和检查等导航任务中实现了零样本仿真到现实迁移。

英文摘要

Recent years have witnessed significant progress in autonomous navigation using reinforcement learning. However, existing approaches largely emphasize reinforcement learning framework design, such as input representations, action spaces, and reward functions, while providing limited analysis of sim-to-real transfer and insufficient insight into how training strategies affect real-world deployment performance. To bridge this gap, we not only introduce an effective RL framework but also present a complete training and deployment pipeline, along with a systematic empirical study that disentangles the key factors affecting sim-to-real transfer in reinforcement learning-based navigation, including sensor noise, perception failures, system latency, and control response. Building on insights from this analysis, we introduce perturbation-aware fine-tuning, a post-training adaptation strategy that improves transfer robustness by explicitly accounting for empirically identified domain discrepancies. To further mitigate perception degradation and enhance control smoothness in real-world deployment, we propose a Transformer-based temporal reasoning policy that leverages short-horizon observation for navigation control. We quantitatively evaluate how individual sim-to-real perturbations and training design choices impact navigation performance across environments. Experimental results demonstrate that the proposed training strategy and policy architecture outperform learning-based baselines in both static and dynamic environments, while achieving performance comparable to optimization-based planners in static settings. We validate our approach through real-world deployment on multiple robotic platforms, including aerial and legged robots, across navigation-centric tasks such as exploration and inspection, demonstrating zero-shot sim-to-real transfer.

2605.15557 2026-05-18 cs.CL cs.LG

When Latent Geometry Is Not Enough: Draft-Conditioned Latent Refinement for Non-Autoregressive Text Generation

当潜在几何不足以:非自回归文本生成的草稿条件潜在细化

De Shuai Zhang

AI总结 本文提出通过草稿条件潜在细化模型提升非自回归文本生成效果,发现潜在几何本身不足以保证生成质量,需结合解码器可读性与结构保持。

Comments 17 pages, 1 figure, 6 tables. Technical Report v1. Stage 1 complete; Stage 2 ongoing Code: https://github.com/saslifat-gif/structured-latent-text-refinement

详情
AI中文摘要

连续扩散和流模型因能并行更新所有位置而适用于非自回归文本生成,但连续潜在状态与离散令牌之间的接口是主要难题。本文研究了一种基于冻结BERT编码器、并行解码器、去噪DraftPrior、局部FlowNet和学习的对角MetricNet构建的草稿条件潜在细化模型。早期高斯起始实验表明,良好的潜在空间度量如尺度匹配或余弦相似度并不能保证解码质量。生成的潜在向量可能接近真实编码器潜在向量但仍会产生高熵、偏倚或重复的令牌分布。因此,本文将任务框架为受控的局部细化而非从噪声中完全生成。在ROCStories数据集上,使用前两句话作为提示,后三句作为目标,768维BERT潜在向量比压缩至256维的潜在向量恢复令牌效果更好。在768维潜在向量下,DraftPrior目标令牌概率为清洁草稿时0.938,3%令牌丢失时0.613,5%丢失时0.483,10%丢失时0.272。局部流细化和融合解码器感知读取提供小幅增益,而度量学习和OT风格对齐改进几何但无法缩小解码器差距。主要结果是一个诊断性结果:潜在几何本身不足以保证生成质量。连续潜在文本生成应通过解码器恢复性、起始分布质量以及细化是否保持解码器可读结构来评估。

英文摘要

Continuous diffusion and flow models are attractive for non-autoregressive text generation because they can update all positions in parallel. A major difficulty is the interface between continuous latent states and discrete tokens. This report studies a draft-conditioned latent refinement model built from a frozen BERT encoder, a parallel decoder, a denoising DraftPrior, a local FlowNet, and a learned diagonal MetricNet. Early Gaussian-start experiments showed that good latent-space metrics, such as scale matching or cosine similarity, do not guarantee good decoding. Generated latents can be close to real encoder latents but still produce high-entropy, biased, or repetitive token distributions. We therefore frame the task as controlled local refinement rather than full generation from noise. On ROCStories, using the first two sentences as prompt and the last three as target, full 768-dimensional BERT latents recover tokens much better than compressed 256-dimensional latents. With 768-dimensional latents, DraftPrior target-token probability is 0.938 for clean drafts, 0.613 for 3% token dropout, 0.483 for 5% dropout, and 0.272 for 10% dropout. Local flow refinement and fused decoder-aware readout give modest additional gains, while metric learning and OT-style alignment improve geometry but do not close the decoder gap. The main result is a diagnostic one: latent geometry alone is not enough. Continuous latent text generation should be evaluated by decoder recoverability, the quality of the start distribution, and whether refinement preserves decoder-readable structure.

2605.15551 2026-05-18 cs.LG

Characterizing Learning in Deep Neural Networks using Tractable Algorithmic Complexity Analysis

利用可计算的算法复杂度分析表征深度神经网络中的学习

Pedram Bakhtiarifard, Sophia N. Wilson, Mahmoud Afifi, Jonathan Wenshøj, Raghavendra Selvan

AI总结 本文提出QuBD方法,用于估计深度神经网络权重的算法复杂度,揭示训练过程中复杂度随学习阶段的变化规律,为模型压缩提供理论依据。

详情
AI中文摘要

训练大规模深度神经网络(DNNs)是资源密集型任务,使模型压缩成为实际需求。广泛接受的'学习即压缩'假说认为训练会在网络权重中引入结构,从而实现压缩。通过Kolmogorov-Chaitin-Solomonoff(KCS)复杂度测量这种结构具有吸引力,但现有的基于编码定理方法(CTM)和块分解方法(BDM)的估计器仅适用于小二进制对象,无法扩展到现代DNNs。我们引入了量化块分解方法(QuBD),将其扩展到任何k-ary对象的算法复杂度估计。QuBD首先将网络权重量化到有限的字母表中,然后通过聚合每个位平面CTM估计来估计KCS复杂度。我们理论证明QuBD相对于真实KCS复杂度的估计差距比基于二值化的方法更严格。使用QuBD,我们研究了神经网络权重的算法复杂度在训练过程中的演变,显示其随着模型学习而减少,随数据预算增加,过拟合期间增加,遵循在grokking期间观察到的延迟泛化,并与泛化性能相关联。我们进一步表明,算法信息主要存在于最显著的位平面中,这可以作为实际诊断以确定适当的训练后量化级别。这项工作通过为大型非二进制对象(如DNN权重)提供可扩展且可计算的KCS复杂度估计,提供了关于DNN学习机制的新见解。

英文摘要

Training large-scale deep neural networks (DNNs) is resource-intensive, making model compression a practical necessity. The widely accepted ''learning as compression'' hypothesis posits that training induces structure in network weights, which enables compression. Measuring this structure through Kolmogorov-Chaitin-Solomonoff (KCS) complexity is appealing, but existing estimators based on the Coding Theorem Method (CTM) and the Block Decomposition Method (BDM) are limited to small binary objects and do not scale to modern DNNs. We introduce the Quantized Block Decomposition method (QuBD), which extends algorithmic complexity estimation to any $k$-ary object. QuBD first quantizes the network weights to a finite alphabet, then estimates the KCS complexity by aggregating per bit-plane CTM estimates. We show theoretically that QuBD yields a strictly tighter estimation gap with respect to true KCS complexity than binarization-based methods. Using QuBD, we study how the algorithmic complexity of neural network weights evolves during training, showing that it decreases as models learn, scales with data budget, increases during overfitting, follows the delayed generalization observed during grokking, and correlates with generalization performance. We further show that algorithmic information resides predominantly in the most significant bit-planes, which can serve as a practical diagnostic for determining appropriate post-training quantization levels. This work offers novel insights into learning mechanisms in DNNs by providing the first scalable, tractable estimates of KCS complexity for large, non-binary objects such as DNN weights.

2605.15549 2026-05-18 cs.LG cs.AI cs.CE

CTF4Nuclear: Common Task Framework for Nuclear Fission and Fusion Models

CTF4Nuclear: 用于核裂变和核聚变模型的通用任务框架

Stefano Riva, Carolina Introini, Antonio Cammi, Dean Price, Alexey Yermakov, Yue Zhao, Philippe M. Wyder, Judah Goldfeder, Jan Williams, Amy Sara Rude, Matteo Tomasetto, Joe Germany, Joseph Bakarji, Georg Maierhofer, Miles Cranmer, J. Nathan Kutz

AI总结 本文提出CTF4Nuclear框架,用于核工程中机器学习方法的标准化评估,通过12个指标和稀疏测量系统监控,提升核工业科学ML的严谨性和可重复性。

详情
AI中文摘要

清洁能源需求持续增长,新型核技术为可再生能源提供补充方案。然而,设计和运行这些系统极具挑战性,因为物理现象的复杂性导致系统动态难以预测。尽管高保真模拟有助于理解反应堆中的非线性多物理场相互作用,但计算成本高,难以实现实时应用。此外,基于模型的方法对简化假设敏感,导致与实际测量存在固有差异。相比之下,机器学习(ML)方法有潜力生成可靠的替代模型,快速预测系统行为。然而,可用于此任务的数据驱动方法种类繁多且多样。在安全关键领域如核工程中,公平比较不同ML方法及其优缺点至关重要。为此,我们引入了一个通用任务框架(CTF)用于核工程中的ML,基于动态系统和地震学的先前努力。该CTF考虑了来自不同核和核相邻系统的精选数据集。CTF评估方法在12个已建立的指标上表现,以及一个专注于仅稀疏测量的系统监控新范式。我们通过基准测试标准ML基线方法,揭示了当前方法的限制。我们的愿景是用标准化评估替代随意比较,提高核工业科学ML的严谨性和可重复性。

英文摘要

The demand for clean energy is ever increasing, with new nuclear technologies presenting a complementary solution to renewable energies. However, designing and operating these systems is exceptionally difficult, given the complexity of the physical phenomena that interact to form the system dynamics. While high-fidelity simulations help to understand the non-linear, multi-physics interactions within a reactor, they are computationally expensive and rarely suitable for real-time applications. Furthermore, model-based approaches are inherently sensitive to simplifying assumptions required to derive their governing equations and parameters, leading to inevitable discrepancies with real-world measurements. In contrast, Machine Learning (ML) methods have the potential to generate reliable surrogate models which may be able to quickly predict the system's behaviour. However, the number of data-driven methods that can potentially be used for this task is large and diverse. In a safety-critical setting such as nuclear engineering, a fair comparison of different ML methods, and a clear understanding of their advantages and limitations, is of paramount importance. To address this, we introduce a Common Task Framework (CTF) for ML in nuclear engineering, building upon previous efforts in dynamical systems and seismology. This CTF considers a curated set of datasets from different nuclear and nuclear-adjacent systems. The CTF evaluates the performance of a method on 12 established metrics, alongside a new paradigm focused on system monitoring from sparse measurements only. We illustrate the framework by benchmarking standard ML baselines against these datasets, revealing current method limitations. Our vision is to replace ad hoc comparisons with standardized evaluations on hidden test sets, raising the bar for rigour and reproducibility in scientific ML for the nuclear industry.

2605.15548 2026-05-18 cs.RO

KaRMA: A Kinematic Metric for Fine Manipulation Ability in Robotic Hands

KaRMA:一种用于机器人手精细操作能力的运动学指标

Martin Peticco, Pulkit Agrawal

AI总结 KaRMA是一种基于运动学的指标,用于衡量机器人手在保持接触的情况下连续改变物体姿态的能力,通过球形测试物体的可达平移和重新定向来评估。

详情
AI中文摘要

传统机器人手指标侧重于静态属性,如工作空间、操作性和抓取稳定性。然而,这些指标无法直接测量标准定义下的灵活性:在保持初始抓握接触的情况下,连续改变物体姿态的能力。我们引入了运动学滚动操作能力(KaRMA),这是一种仅基于运动学的指标,通过可行的滚动运动量化两指精密捏合中球形测试物体的可达平移和重新定向。KaRMA强制执行关节限制、碰撞约束、滚动接触和反向力可行性,然后通过平移和旋转原语的广度优先搜索来研究可达的在手物体姿态。KaRMA报告三个评分:平移覆盖(KaRMA-T)、旋转覆盖(KaRMA-R)和对初始抓握的敏感性(KaRMA-S)。我们在16种广泛使用的机器人手 上评估KaRMA,并与静态基线进行比较,显示KaRMA能够区分在静态代理中排名相同的手,揭示现有基线无法看到的平移-旋转权衡,并在选定的发表任务基准中与Jacobian基指标一致。

英文摘要

Traditional robotic hand metrics focus on static properties such as workspace, manipulability, and grasp stability. However, these metrics do not directly measure dexterity under the standard definition in robotic manipulation: the ability to continuously change an object's pose within the hand while maintaining contact from an initial grasp. We introduce Kinematic Rolling Manipulation Ability (KaRMA), a kinematic-only metric for fine manipulation that quantifies reachable in-hand translation and reorientation of a spherical test object within a two-finger precision pinch through feasible rolling motions. KaRMA enforces joint limits, collision constraints, rolling contact, and antipodal force feasibility, then investigates reachable in-hand object poses via breadth-first search over translation and rotation primitives. KaRMA reports three scores: translational coverage (KaRMA-T), rotational coverage (KaRMA-R), and sensitivity to the initial grasp (KaRMA-S). We evaluate KaRMA on 16 widely used robotic hands and compare against static baselines, showing that KaRMA separates hands that rank identically under static proxies, reveals translation-rotation tradeoffs invisible to existing baselines, and is qualitatively consistent with selected published task benchmarks where Jacobian-based metrics can be misleading.

2605.15546 2026-05-18 cs.CV

3DTMDet: A Dual-Path Synergy Network of Transformer and SSM for 3D Object Detection in Point Clouds

3DTMDet:一种结合Transformer和SSM的双路径协同网络用于点云中的3D目标检测

Bingwen Qiu, Yuan Liu, Junqi Bai, Tong Jiang, Ben Liang, Fangzhou Chen, Xiubao Sui, Qian Chen

AI总结 本文提出3DTMDet网络,结合SSM和Transformer,解决点云检测中稀疏点与远距离上下文理解的矛盾,通过3D混合Mamba Transformer模块和体素生成模块提升检测性能。

详情
AI中文摘要

点云目标检测面临远距离点极稀疏与需要远程上下文理解的矛盾。现有方法通过1D序列扩展感受野,不可避免地丢弃已稀缺的局部几何细节并降低远距离和小物体的检测。为了解决这个问题,我们提出了3DTMDet,一种新颖的检测网络,协同结合状态空间模型(Mamba)与Transformer。核心思想是利用SSM的线性复杂度和长序列建模优势,有效捕捉稀疏和远距离点之间的全局交互,同时使用Transformer模块进行局部注意力编码,以编码局部点集中的细粒度几何结构,保留准确的形状信息。我们提出了3D混合Mamba Transformer(3DHMT)块,使用SSM-Attention-SSM流水线来平衡全局上下文理解和局部细节保存,有效缓解了远距离检测中感受野扩大与几何保存之间的张力。此外,我们引入了受LiDAR物理启发的体素生成块,该模块沿传感器观测方向扩散特征,以重建遮挡和远距离区域的完整物体结构。在KITTI和ONCE数据集上进行的大量实验表明,3DTMDet优于最先进的检测器。代码可在https://github.com/QiuBingwen/3DTMDet获取。

英文摘要

A fundamental challenge in point cloud object detection lies in the conflict between the extreme sparsity of distant points and the need for remote context understanding. The existing methods typically use 1D serialization to expand the receptive field, which inevitably discards already scarce local geometric details and reduces detection of distant and small objects. To address this issue, we propose 3DTMDet, a novel detection network that synergistically combines state space models (Mamba) with Transformers. The core idea is to utilize SSM's linear complexity and advantages in long sequence modeling to effectively capture global interactions between sparse and distant points, while using Transformer modules with local attention to encode fine-grained geometric structures in local point sets, preserving accurate shape information. We propose the 3D Hybrid Mamba Transformer (3DHMT) block, which uses an SSM-Attention-SSM pipeline to balance global context understanding and local detail preservation, effectively alleviating the tension between receptive field enlargement and geometric preservation in remote detection. In addition, we introduced a voxel generation block inspired by LiDAR physics, which diffuses features along the sensor observation direction to reconstruct the complete object structure of occlusion and distant areas. Extensive experiments conducted on the KITTI and ONCE datasets have shown that 3DTMDet outperforms state-of-the-art detectors. The code is available at https://github.com/QiuBingwen/3DTMDet.

2605.15542 2026-05-18 cs.AI

DRS-GUI: Dynamic Region Search for Training-Free GUI Grounding

DRS-GUI: 动态区域搜索用于无训练的GUI定位

Yichao Liu, Huawen Shen, Liu Yu, Shiyu Liu, Zeyu Chen, Yu Zhou

AI总结 DRS-GUI通过动态区域搜索框架提升GUI定位性能,利用轻量级UI感知器和MCTS动作规划器,实现高效区域探索与筛选,提升多模态大语言模型的定位能力。

Comments 11 pages, 8 figures

详情
AI中文摘要

基于多模态大语言模型(MLLM)的GUI代理在理解和执行用户指令方面表现出色,但准确地从高分辨率截图中定位相关元素仍具挑战性。受人类动态调整感知范围的启发,本文提出DRS-GUI,一种无训练的动态区域搜索框架,可无缝集成到现有MLLM中。DRS-GUI引入轻量级UI感知器,执行聚焦、位移和分散三种人类似感知动作,逐步探索界面并生成区域提案。通过基于蒙特卡洛树搜索(MCTS)的动作规划器动态调度这些动作,并利用区域质量奖励评估和选择高度相关的区域,有效剪枝冗余UI元素。实验表明,DRS-GUI在ScreenSpot-Pro上对通用和GUI特定的MLLM(Qwen2.5-VL-7B和UGround-V1-7B)实现了14%的提升,显著增强了定位性能和泛化能力。

英文摘要

GUI agents powered by Multimodal Large Language Models (MLLMs) have demonstrated impressive capability in understanding and executing user instructions. However, accurately grounding instruction-relevant elements from high-resolution screenshots cluttered with irrelevant UI components remains challenging for existing approaches. Inspired by how humans dynamically adjust their perceptual scope to locate task-related regions on complex screens, we propose DRS-GUI, a training-free dynamic region search framework for GUI grounding that can be seamlessly integrated into existing MLLMs. DRS-GUI introduces a lightweight UI Perceptor that performs three human-like perceptual actions (Focus, Shift, and Scatter) to progressively explore the interface and generate region proposals. To dynamically schedule these actions, we further design an Action Planner based on Monte Carlo Tree Search (MCTS). A region quality reward is employed to evaluate and select the highly instruction-relevant region, efficiently pruning redundant UI elements. Experiments demonstrate that DRS-GUI yields a 14\% improvement on ScreenSpot-Pro for general and GUI-specific MLLMs (Qwen2.5-VL-7B and UGround-V1-7B), significantly enhancing grounding performance and generalization.

2605.15537 2026-05-18 cs.AI

RTL-BenchMT: Dynamic Maintenance of RTL Generation Benchmark Through Agent-Assisted Analysis and Revision

RTL-BenchMT:通过代理辅助分析和修订动态维护RTL生成基准

Jing Wang, Shang Liu, Hangan Zhou, Zhiyao Xie

AI总结 本文提出RTL-BenchMT框架,通过自动识别和修正错误案例及检测更新过拟合案例,解决RTL基准中的缺陷和过拟合问题,降低人工维护成本。

Comments This paper has been accepted by DAC 2026

详情
AI中文摘要

本文介绍了RTL-BenchMT,一种用于动态维护RTL生成基准的代理框架。大语言模型(LLMs)辅助自动化RTL生成是EDA研究中的重要方向。然而,当前RTL基准面临两个关键挑战:(1)基准中的错误案例和(2)对基准的过拟合。这两个挑战难以仅通过手动工程努力解决。为解决这些问题并系统降低人工维护成本,我们提出自动化代理框架RTL-BenchMT。RTL-BenchMT专注于两个关键应用:(1)自动识别和修正错误基准案例和(2)自动检测和更新过拟合案例。借助RTL-BenchMT,我们对错误和过拟合案例进行了深入分析,并生成一个经过改进的基准套件,该套件将向社区开源。

英文摘要

This paper introduces RTL-BenchMT, an agentic framework for dynamically maintaining RTL generation benchmarks. Large Language Models (LLMs) assisted automated RTL generation is one of the most important directions in EDA research. However, current RTL benchmarks face two critical challenges: (1) flawed cases in the benchmarks and (2) overfitting to the benchmarks. Both challenges are difficult to resolve purely by manual engineering effort. To address these issues and systematically reduce human maintenance costs, we propose an automated agentic framework, RTL-BenchMT. RTL-BenchMT focuses on two key applications: (1) automatically identifying and revising flawed benchmark cases and (2) automatically detecting and updating overfitting cases. With the assistance of RTL-BenchMT, we conduct a thorough, in-depth analysis of flawed and overfitting cases and produce a refined benchmark suite that will be open-sourced to the community.

2605.15536 2026-05-18 cs.RO cs.AI cs.CV

SkiP: When to Skip and When to Refine for Efficient Robot Manipulation

SkiP: 在何时跳过和何时细化以实现高效的机器人操作

Mingtong Dai, Guanqi Peng, Yongjie Bai, Feng Yan, Chunjie Chen, Lingbo Liu, Liang Lin, Xinyu Wu

AI总结 SkiP通过动态跳过冗余步骤和精细化关键步骤,提升机器人操作效率,无需额外结构或规划器。

详情
AI中文摘要

先前的模仿学习策略在每个控制步骤都预测未来动作,无论是在平滑运动阶段还是精确的接触丰富操作阶段。这种统一处理是浪费的:大多数操作轨迹步骤在自由空间中移动,携带很少的任务相关信息,而一小部分关键步骤围绕接触、抓取和对齐需求密集的高分辨率预测。我们提出了一种新的动作重标机制:在跳过段的每个时间步,我们用下一个关键段入口的动作替换行为克隆目标,使策略能够在一个决策中跳过冗余步骤。由此产生的Skip Policy (SkiP)在单一统一网络中动态跳过跳过段并密集细化关键段,无需学习跳过规划器或分层结构。为了自动将演示分成关键和跳过段而无需手动标注,我们引入了Motion Spectrum Keying (MSK),一种快速且任务无关的程序,从动作信号中检测局部运动复杂性。在72个模拟操作任务和三个真实机器人任务上的广泛实验表明,SkiP将执行步骤减少15-40%,同时在各种策略骨干上匹配或提高成功率。项目页面:https://pgq18.github.io/SkiP-page/.

英文摘要

Previous imitation learning policies predict future actions at every control step, whether in smooth motion phases or precise, contact-rich operation phases. This uniform treatment is wasteful: most steps in a manipulation trajectory traverse free space and carry little task-relevant information, while a small fraction of \emph{key} steps around contacts, grasps, and alignment demand dense, high-resolution prediction. We propose a novel \emph{action relabeling} mechanism: at each timestep in a skip segment, we replace the behavior cloning target with the action at the entrance of the next key segment, enabling the policy to leap over redundant steps in a single decision. The resulting \textbf{Skip Policy (SkiP)} dynamically leaps over skip segments and intensively refines actions in key segments, within a single unified network requiring no learned skip planner or hierarchical structure. To automatically partition demonstrations into key and skip segments without manual annotation, we introduce \emph{Motion Spectrum Keying} (MSK), a fast, task-agnostic procedure that detects local motion complexity from action signals. Extensive experiments across 72 simulated manipulation tasks and three real-robot tasks show that SkiP reduces executed steps by $15$--$40\%$ while matching or improving success rates across various policy backbones. Project page: \texttt{https://pgq18.github.io/SkiP-page/}.

2605.15535 2026-05-18 cs.CV

Learning Dynamic Structural Specialization for Underwater Salient Object Detection

学习动态结构专业化用于水下显著目标检测

Lin Hong, Chenhui Wang, Linan Deng, Yuning Cui, Yu Zhang, Xin Wang, Bojian Zhang, Wenqi Ren, Xingchen Yang, Fumin Zhang

AI总结 本文提出DSS-USOD方法,通过动态结构专业化解决水下图像退化导致的定位不准确、区域碎片化和边界预测粗的问题,提升边界精度与区域一致性。

Comments 15 pages

详情
AI中文摘要

水下显著目标检测(USOD)因在水下视觉场景理解和视觉引导机器人应用中受到越来越多关注。然而,现有USOD方法仍难以应对水下图像退化,这通常导致目标定位不准确、显著区域碎片化和边界预测粗劣。为解决这些挑战,本文提出DSS-USOD,一种基于RGB的USOD方法,建立在动态结构专业化之上。DSS-USOD从单张水下图像中提取共享基础表示,将其分解为对边界敏感和区域一致的结构特征,并根据局部结构上下文动态协调其贡献。具体而言,提取的共享基础表示被分解为一个用于建模细粒度边界细节的边界敏感分支和一个用于捕捉区域级结构一致性的区域一致分支。随后引入一个空间协调模块,根据局部结构上下文自适应调节两个分支的相对贡献。此外,引入协作结构监督以促进分支专业化并稳定空间协调,使DSS-USOD在退化的水下条件下更好地平衡边界精度和区域一致性。大量实验表明,DSS-USOD在基准数据集上实现了优越性能。最后,实际部署在水下机器人上验证了DSS-USOD在水下目标检测中的实际有效性。

英文摘要

Underwater salient object detection (USOD) has attracted increasing attention for underwater visual scene understanding and vision-guided robotic applications. However, existing USOD methods still struggle with underwater image degradations, which often lead to inaccurate object localization, fragmented salient regions, and coarse boundary prediction. To address these challenges, this paper proposes DSS-USOD, a novel RGB-based USOD method built upon dynamic structural specialization. DSS-USOD extracts a shared base representation from a single underwater image, decomposes it into boundary-sensitive and region-coherent structural features, and dynamically coordinates their contributions according to local structural context. Specifically, the extracted shared base representation is decomposed into a boundary-sensitive branch for modeling fine-grained boundary details and a region-coherent branch for capturing region-level structural consistency. A spatial coordination module is then introduced to adaptively regulate the relative contributions of the two branches according to local structural context. Moreover, cooperative structural supervision is introduced to promote branch specialization and stabilize spatial coordination, enabling DSS-USOD to better balance boundary precision and region coherence under degraded underwater conditions. Extensive experiments show that DSS-USOD achieves superior performance on benchmark datasets. Finally, real-world deployment on an underwater robot validates the practical effectiveness of DSS-USOD for underwater object inspection.

2605.15533 2026-05-18 cs.CV cs.AI

Tuning-free Instruction-based Video Editing Via Structural Noise Initialization and Guidance

无需调优的指令式视频编辑:通过结构噪声初始化和引导

Song Wu, Xinyu Chen, Qian Wang, Liang Li, Zili Yi, Junlan Feng

AI总结 本文提出无需调优的指令式视频编辑框架,通过结构噪声初始化策略和噪声引导机制,提升视频编辑的视觉质量和性能。

Comments Accepted by ICIP 2026

详情
AI中文摘要

视频编辑面临重大挑战。尽管一系列无需调优的方法避免了大量数据收集和模型训练的需求,但它们往往未能充分利用嵌入在噪声潜在空间中的丰富信息,导致结果不满意。为此,我们提出一种无需调优、基于指令的视频编辑框架。我们从噪声潜在空间的角度出发:设计了结构噪声初始化策略(SNIS),通过为编辑区域分配更高的噪声水平(以促进内容变化)和为未编辑区域分配更低的噪声水平(以保持内容一致性),从而获得更优的编辑起点。我们引入了噪声引导机制(NGM),利用生成模型中的视频先验知识,有效整合噪声潜在空间中的丰富信息以引导去噪过程,从而保持未编辑内容和整体视觉一致性。实验表明,我们提出的方法在视觉质量和性能上均优于现有方法。

英文摘要

Video editing poses a significant challenge. While a series of tuning-free methods circumvent the need for extensive data collection and model training, they often underutilize the rich information embedded within noisy latent, leading to unsatisfactory results. To address this, we propose a \textit{tuning-free, instruction-based} video editing framework. We approach video editing from the perspective of noisy latent: we design a Structural Noise Initialization Strategy (SNIS) to secure a superior editing starting point by assigning higher noise levels to edited regions (to facilitate content change) and lower noise levels to unedited regions (to maintain content consistency). We introduce a Noise Guidance Mechanism (NGM), which leverages the video prior in the generative model and effectively integrates rich information within the noisy latent to guide the denoising process, thereby preserving unedited content and overall visual coherence. Experiments show that our proposed method achieves better visual quality and state-of-the-art performance.

2605.15529 2026-05-18 cs.CL cs.AI cs.LG

Process Rewards with Learned Reliability

基于学习可靠性的过程奖励

Jinyuan Li, Langlin Huang, Chengsong Huang, Shaoyang Xu, Donghong Cai, Yuyi Yang, Wenxuan Zhang, Jiaxin Huang

AI总结 本文提出BetaPRM,通过预测步骤成功概率和预测可靠性,改进过程奖励模型,使下游任务能区分可靠与不确定的奖励。ACA应用在最佳N推理中,提升准确率-token权衡。

详情
AI中文摘要

Process Reward Models (PRMs) 提供步骤级反馈用于推理,但当前PRMs通常为每个步骤输出单一奖励分数。下游方法必须将不完美的步骤级奖励预测视为可靠的决策信号,但无指示何时应信任这些预测。我们提出BetaPRM,一种分布型PRM,预测步骤成功概率及该预测的可靠性。给定步骤成功监督来自蒙特卡洛延续,BetaPRM学习Beta信念,通过Beta-Binomial似然解释观察到的成功延续数量,而非回归到有限样本成功比率作为点目标。该学习的可靠性信号指示何时应信任步骤奖励,使下游应用能区分可靠奖励与不确定奖励。作为一项应用,我们引入自适应计算分配(ACA)用于PRM引导的最佳N推理。ACA利用学习的可靠性信号在高奖励解决方案可靠时停止,并在不确定候选前缀上投入更多计算。在四个backbone和四个推理基准上的实验表明,BetaPRM改进了PRM引导的最佳N选择,同时保持标准步骤级错误检测。基于此信号,ACA在固定预算最佳16上提升了准确率-token权衡,减少token使用达33.57%,同时提高最终答案准确率。

英文摘要

Process Reward Models (PRMs) provide step-level feedback for reasoning, but current PRMs usually output only a single reward score for each step. Downstream methods must therefore treat imperfect step-level reward predictions as reliable decision signals, with no indication of when these predictions should be trusted. We propose BetaPRM, a distributional PRM that predicts both a step-level success probability and the reliability of that prediction. Given step-success supervision from Monte Carlo continuations, BetaPRM learns a Beta belief that explains the observed number of successful continuations through a Beta-Binomial likelihood, rather than regressing to the finite-sample success ratio as a point target. This learned reliability signal indicates when a step reward should be trusted, enabling downstream applications to distinguish reliable rewards from uncertain ones. As one application, we introduce Adaptive Computation Allocation (ACA) for PRM-guided Best-of-N reasoning. ACA uses the learned reliability signal to stop when a high-reward solution is reliable and to spend additional computation on uncertain candidate prefixes. Experiments across four backbones and four reasoning benchmarks show that BetaPRM improves PRM-guided Best-of-N selection while preserving standard step-level error detection. Built on this signal, ACA improves the accuracy--token tradeoff over fixed-budget Best-of-16, reducing token usage by up to 33.57% while improving final-answer accuracy.

2605.15528 2026-05-18 cs.RO cs.MA

Task-Semantic Graph-Driven Distributed Agent Networking for Underwater Target Tracking

基于任务语义的分布式智能体网络用于水下目标跟踪

Shengchao Zhu, Guangjie Han, Chuan Lin, Yu He

AI总结 本文提出STG-MAPPO算法,通过整合DI-engine与六自由度水下AUV仿真器,构建开放平台评估不同MARL算法,解决多智能体强化学习在水下目标跟踪中的挑战。

详情
AI中文摘要

自主水下航行器(AUV)群正在成为智能水下网络,其中每个节点必须在严峻的声学约束下感知、通信、处理本地数据并做出决策。持久性的水下目标跟踪是典型的任务,具有移动目标、变化的通信拓扑、间歇性声学链路和每个AUV有限的观测。多智能体强化学习(MARL)是分布式跟踪的自然候选者,但现有研究仍缺乏一个统一的开源平台来评估不同MARL算法在六自由度AUV动态下的性能。此外,使用原始几何状态和低层力动作训练的策略往往难以表示任务阶段、观测可靠性、链路质量以及局部合作角色。本文通过开发一个整合DI-engine与六自由度水下AUV目标跟踪仿真的开源MARL-AUV平台来解决这些问题。据我们所知,这是第一个将公共MARL训练框架与物理建模的AUV群任务连接起来的开源平台,并提供统一的实验协议,用于公平训练、测试和比较代表性RL和MARL算法。基于此平台,我们提出了STG-MAPPO,一种增强的多智能体近端策略优化变种。STG-MAPPO从跟踪诊断、任务阶段、观测置信度、链路可用性、邻居跟踪质量以及局部角色优势构建语义策略输入。一个紧凑的语义任务图将通信受限的网络状态连接到去中心化的动作决策,而速度级动作抽象将高层协作决策映射到可执行的六自由度AUV控制输入。代码可在https://github.com/dasjsaj/MARL-AUV获取。

英文摘要

Autonomous underwater vehicle (AUV) swarms are emerging as intelligent underwater networks, where each node must sense, communicate, process local data, and make decisions under severe acoustic constraints. Persistent underwater target tracking is a typical task with moving targets, changing communication topology, intermittent acoustic links, and limited observation for each AUV. Multi-agent reinforcement learning (MARL) is a natural candidate for distributed tracking, yet existing studies still lack a unified open-source platform for evaluating different MARL algorithms under six-degree-of-freedom AUV dynamics. In addition, policies trained with raw geometric states and low-level force actions often struggle to represent task phases, observation reliability, link quality, and local cooperation roles. This paper addresses these issues by developing an open-source MARL-AUV platform that integrates DI-engine with a six-degree-of-freedom underwater AUV target-tracking simulator. To the best of our knowledge, it is the first open platform that connects a public MARL training framework with physically modeled AUV swarm-based tasks, and provides a unified experimental protocol for fair training, testing, and comparison of representative RL and MARL algorithms. Based on this platform, we propose STG-MAPPO, a Semantic Task Graph-enhanced variant of Multi-Agent Proximal Policy Optimization. STG-MAPPO builds semantic policy inputs from tracking diagnostics, task phases, observation confidence, link availability, neighbor tracking quality, and local role advantage. A compact semantic task graph links communication-constrained network states to decentralized actor decisions, and a velocity-level action abstraction maps high-level cooperative decisions to executable six-degree-offreedom AUV control inputs.The code is available at https://github.com/dasjsaj/MARL-AUV.

2605.15524 2026-05-18 cs.LG cs.AI math.DG math.ST stat.TH

Neural Point-Forms

神经点形

Bruno Trentini, Jacob Hume, Vincenzo Antonio Isoldi, Philipp Misof, Ekaterina S. Ivshina, Kelly Maggs

AI总结 本文提出神经点形(NPFs),通过扩散几何中的拉普拉斯技术,构建点云的可学习几何特征,用于比较微分形式,并在合成和生物相关实验中展示其在处理采样密度、流形结构和群体几何时的优势。

详情
AI中文摘要

点云学习通常基于观察样本是嵌入高维特征空间的底层几何对象的噪声轨迹的假设。然而,许多几何特性无法仅通过坐标、成对距离或学习的图邻域直接捕捉。在光滑情况下,微分形式用于编码高阶切线信息。本文引入了一种新的可学习几何特征家族,称为神经点形(NPFs)。在没有自然切线结构的情况下,我们使用来自扩散几何的拉普拉斯技术,通过内积构建点云的离散模型,以比较微分形式。在连续情况下,共享环境特征空间的子流形表示为比较矩阵,其条目描述了特征形式对偶切线信息的相互作用。我们通过证明在标准采样、带宽、密度和流形假设下比较矩阵的长期一致性,使这一直觉精确化。这产生了一个紧凑、高效且可交换的神经层,其输出是一个学习的形比较矩阵。在合成和生物相关实验中,我们展示了NPFs提供了一个竞争性且可解释的表示,当标签依赖于采样密度、流形结构或响应相关群体几何时,其优势最为明显。

英文摘要

Point cloud learning often rests on the premise that observed samples are noisy traces of an underlying geometric object, such as a manifold embedded in a high-dimensional feature space. Yet much of this geometry is not captured directly by coordinates, pairwise distances, or learned graph neighborhoods alone. In the smooth setting, differential forms are devices to encode higher order tangency information. In this work, we introduce a new family of principled learnable geometric features for point clouds called neural point-forms (NPFs). In the absence of a natural tangency structure, we instead use Laplacian-based techniques from Diffusion Geometry to build a discrete model for comparing differential forms on point clouds via inner products. In the continuum, submanifolds of a shared ambient feature space are represented as comparison matrices, whose entries describe how pairs of feature forms interact with extrinsic tangency information. We make this intuition precise by proving the long-run consistency of comparison matrices under standard sampling, bandwidth, density, and manifold-hypothesis assumptions. This yields a compact, efficient and permutation-invariant neural layer whose output is a learned form-comparison matrix. Across synthetic and biologically relevant experiments, we show that NPFs provide a competitive, and interpretable representation, with the strongest benefits appearing when labels depend on sampling density, manifold-like structure, or response-relevant population geometry.

2605.15520 2026-05-18 cs.LG cs.AI cs.DC

On the Fragility of Data Attribution When Learning Is Distributed

在分布式学习中数据归因的脆弱性

Xian Gao, Bo Hui, Min-Te Sun, Wei-Shinn Ku

AI总结 研究揭示了分布式学习中数据归因的脆弱性,通过归因优先攻击展示归因值可能被人为放大,同时提出归因鲁棒和激励相容的评分机制。

详情
AI中文摘要

数据归因已成为机器学习流水线中定价、审计和治理的重要组成部分,但大多数归因方法隐含假设归因值忠实反映参与者贡献。我们证明这一假设可能失效:在标准分布式训练流程中,单个参与者可通过潜变量优化注入小合成批次,保持全局效用的同时放大其归因值。在多个数据集、模型和边际效用评估器中,攻击一致增加攻击者的归因值并重塑良性客户端间的相对归因结构,而不会降低准确性或触发基于几何的防御。这些结果表明归因本身形成新的攻击面,推动归因鲁棒和激励相容的评分机制发展。

英文摘要

Data attribution has become an important component of pricing, auditing, and governance in machine learning pipelines, yet most attribution methods implicitly assume that attribution values faithfully reflect participants' contributions. We show that this assumption can fail: a single participant in a standard distributed training workflow can substantially inflate its measured attribution value while preserving global utility. Our attribution-first attack uses latent optimization to inject small synthetic batches that preserve utility while exploiting non-IID label coverage and evaluator sensitivities. Across datasets, models, and multiple marginal-utility evaluators, the attack consistently increases the adversary's attribution value and reshapes the relative attribution structure among benign clients without degrading accuracy or triggering geometry-based defenses. These results show that attribution itself forms a new attack surface and motivate the development of attribution-robust and incentive-compatible scoring mechanisms.

2605.15519 2026-05-18 cs.CV cs.AI

DiffVAS: Diffusion-Guided Visual Active Search in Partially Observable Environments

DiffVAS: 在部分可观测环境中基于扩散的视觉主动搜索

Anindya Sarkar, Srikumar Sastry, Aleksis Pirinen, Nathan Jacobs, Yevgeniy Vorobeychik

AI总结 DiffVAS提出了一种目标条件化的策略,能够在部分可观测环境中同时搜索多种目标,提升了视觉主动搜索在现实应用中的部署能力。

Comments 26 Pages, 12 figures, Accepted to AAMAS 2026

详情
AI中文摘要

视觉主动搜索(VAS)已被引入作为一种建模框架,利用视觉线索指导空中(如基于无人机的)探索,并在广阔的地理区域中定位感兴趣区域。潜在应用包括检测稀有野生动物盗猎的热点、协助搜救任务以及揭露非法武器交易等。先前的VAS方法假设整个搜索空间在前期已知,这在受限视野和高采集成本的约束下往往不现实,且通常学习针对特定目标对象的策略,限制了同时搜索多种目标类别的能力。在本工作中,我们提出DiffVAS,一种目标条件化的策略,根据任务需求在部分可观测环境中同时搜索多种对象,从而推进视觉主动搜索策略在现实应用中的部署。DiffVAS利用扩散模型从顺序观测的局部视图中重建整个地理区域,使基于目标条件的强化学习规划模块能够有效推理并引导后续的搜索步骤。大量实验表明,DiffVAS在部分可观测环境中搜索多种对象方面表现优异,在多个数据集上显著超越了最先进的方法。

英文摘要

Visual active search (VAS) has been introduced as a modeling framework that leverages visual cues to direct aerial (e.g., UAV-based) exploration and pinpoint areas of interest within extensive geospatial regions. Potential applications of VAS include detecting hotspots for rare wildlife poaching, aiding search-and-rescue missions, and uncovering illegal trafficking of weapons, among other uses. Previous VAS approaches assume that the entire search space is known upfront, which is often unrealistic due to constraints such as a restricted field of view and high acquisition costs, and they typically learn policies tailored to specific target objects, which limits their ability to search for multiple target categories simultaneously. In this work, we propose DiffVAS, a target-conditioned policy that searches for diverse objects simultaneously according to task requirements in partially observable environments, which advances the deployment of visual active search policies in real-world applications. DiffVAS leverages a diffusion model to reconstruct the entire geospatial area from sequentially observed partial glimpses, which enables a target-conditioned reinforcement learning-based planning module to effectively reason and guide subsequent search steps. Extensive experiments demonstrate that DiffVAS excels in searching diverse objects in partially observable environments, significantly surpassing state-of-the-art methods on several datasets.

2605.15517 2026-05-18 cs.RO cs.SY eess.SY

Terrain Consistent Reference-Guided RL for Humanoid Navigation Autonomy

地形一致的参考引导强化学习用于人形导航自主性

William D. Compton, Zachary Olkin, Aaron D. Ames

AI总结 本文提出一种训练参考引导感知强化学习策略的方法,通过在训练中调节参考轨迹使其与地形几何一致,提升人形机器人导航自主性。

Comments 8 pages, 4 figures, intended to submit to Humanoids 2026

详情
AI中文摘要

我们提出了一种方法,用于训练参考引导的感知强化学习运动策略,用于人形机器人,其中参考轨迹在训练中被调节以与地形几何一致。为了部署我们的方法与标准导航自主基础设施,我们合成SE(2)-可控的参考轨迹,将期望的步态投影到有效的脚踏点,并调整摆动脚和质心轨迹以匹配地形。所得到的策略暴露了一个干净的SE(2)速度接口,与标准导航规划器兼容。在仿真中,环境条件化的参考显著提高了参考跟踪性能,与环境无关的参考相比。在硬件上,我们将该策略与MPC + 控制屏障函数规划器集成,并在包含粗糙地形和连续楼梯的户外环境中展示了超过70米的闭环自主导航,所有传感和计算均在设备上完成。

英文摘要

We present a method for training reference-guided, perceptive reinforcement learning locomotion policies for humanoid robots in which reference trajectories are modulated in training to be consistent with terrain geometry. Aiming to deploy our method with standard navigation autonomy infrastructure, we synthesize SE(2)-controllable reference trajectories inside the RL training loop, projecting desired footsteps onto valid footholds and adjusting swing-foot and center-of-mass trajectories to match the terrain. The resulting policy exposes a clean SE(2) velocity interface compatible with standard navigation planners. In simulation, environmentally-conditioned references significantly improve reference tracking performance compared to environment agnostic references. On hardware, we integrate the policy with an MPC + control barrier function planner and demonstrate long-horizon (>70m) closed-loop autonomous navigation on the Unitree G1 through outdoor environments containing rough terrain and consecutive flights of stairs, with all sensing and computation onboard.