arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 1970
2605.20544 2026-05-21 cs.RO cs.CV

The Yes-Man Syndrome: Benchmarking Abstention in Embodied Robotic Agents

顺从综合征:具身机器人代理中的退避基准测试

Doguhan Yeke, Elif Su Temirel, Ananth Shreekumar, Brandon Lee, Dongyan Xu, Z Berkay Celik

AI总结 本文提出了一种用于具身机器人代理的退避基准测试框架RoboAbstention,通过五种机器人数据集中的图像生成退避指令,评估了多个前沿VLMs在退避任务中的表现,并探讨了改进退避性能的方法。

详情
AI中文摘要

视觉语言模型(VLMs)被用作具身代理的高层规划器,将自然语言指令和视觉观察转化为行动计划。尽管先前的工作研究了LLMs中的退避行为,但现有的基准测试大多仅限于文本,无法捕捉到具身机器人环境中的感知基础和物理约束。在这样的环境中,退避需要识别指令模糊、物理不可行、基于错误前提或在给定可用感觉模态和上下文下无法解决的情况。为了解决这一差距,我们引入了一个分类法来分类具身机器人中的退避行为,并提出了RoboAbstention,一个可扩展且可审计的框架,用于生成基于五个机器人数据集收集的图像的退避指令。RoboAbstention通过三个阶段的流程实现该分类法:(1)结构化的视觉基础,(2)确定性的约束推导,(3)通过类别特定模板进行受控的指令生成。这使能够构建一个具有可验证退避条件的多样化数据集。我们评估了几种前沿VLMs,并发现所有模型在退避任务中都表现出显著的弱点,包括那些具有高级推理能力的模型。表现最好的模型Gemini 2.5 Flash仅在6,069个基准指令中退避39.0%,而具身规划器Gemini Robotics ER 1.6 Preview仅在16.5%的指令中退避。我们进一步探讨了改进VLM规划器退避性能的方法,如防御性提示和上下文学习,并发现这些干预措施显著提高了性能,达到Gemini Robotics ER 1.6 Preview的93.6%退避率和GPT 5.4 Mini的88.6%退避率,但没有任何方法完全解决了该问题。我们开源了RoboAbstention在https://purseclab.github.io/RoboAbstention/。

英文摘要

Vision-language models (VLMs) are used as high-level planners for embodied agents, translating natural language instructions and visual observations into action plans. While prior work has studied abstention in LLMs, existing benchmarks are largely text-only and do not capture the perceptual grounding and physical constraints inherent to embodied robotics environments. In such settings, abstention requires recognizing when instructions are ambiguous, physically infeasible, based on false premises, or otherwise unresolvable given the available sensory modalities and context. To address this gap, we introduce a taxonomy to categorize abstention in the context of embodied robotics and present RoboAbstention, a scalable and auditable framework for generating abstention instructions grounded in images gathered from five robotics datasets. RoboAbstention instantiates the taxonomy through a three-phase pipeline: (1) structured visual grounding, (2) deterministic constraint derivation, and (3) controlled instruction generation via category-specific templates. This enables the construction of a diverse dataset with verifiable abstention conditions. We evaluate several frontier VLMs and find that all models exhibit significant weaknesses in abstention, including those with advanced reasoning capabilities. The best-performing model, Gemini 2.5 Flash, abstains on only 39.0% of our 6,069 benchmark instructions, while the embodied planner Gemini Robotics ER 1.6 Preview abstains on just 16.5%. We further explore methods for improving abstention in VLM planners, such as defensive prompting and in-context learning, and find that these interventions substantially improve performance, reaching 93.6% abstention rate for Gemini Robotics ER 1.6 Preview and 88.6% for GPT 5.4 Mini, yet no approach fully solves the problem. We open-source RoboAbstention at https://purseclab.github.io/RoboAbstention/.

2605.20543 2026-05-21 cs.CV

Uncertainty-Guided Conservative Propagation for Structured Inference in Vessel Segmentation

不确定性引导的保守传播用于血管分割的结构推理

Huan Huang, Michele Esposito, Chen Zhao

AI总结 本文提出了一种不确定性引导的保守传播(UGCP)模块,用于改进血管分割的结构推理,通过局部预测交互进行多次logit空间更新步骤,提高分割的Dice相似系数、中心线Dice和95百分位Hausdorff距离,同时减少血管断开并提高结构一致性。

Comments Pattern Recognition submission. 35 pages, 6 figures

详情
AI中文摘要

准确的血管分割对于医学图像分析至关重要,但仍然具有挑战性,因为复杂的血管模式和成像模糊性导致了困难。大多数深度模型依赖于单次预测,限制了它们在推理过程中细化不确定或断开区域的能力。为了解决这一限制,我们提出了不确定性引导的保守传播(UGCP),这是一种通用的插件模块用于血管分割。与其直接使用一次输出作为最终预测不同,UGCP通过局部预测交互进行少量logit空间更新步骤来改进分割。预测不确定性引导可靠区域以支持模糊区域,同时结构意识调制和源基于稳定化减少不可靠传播和过度漂移。该模块是可微的,可以与不同的分割网络端到端训练。我们在四个公开的血管分割数据集上评估了UGCP,涵盖2D和3D任务,包括视网膜血管、冠状动脉和脑血管分割。使用基于卷积神经网络和Transformer的后端进行的实验显示,Dice相似系数、中心线Dice和95百分位Hausdorff距离均有所提高。进一步分析表明,UGCP在有限的额外计算下减少了血管断开并提高了结构一致性。代码将在https://github.com/chenzhao2023/UGC_PR上提供。

英文摘要

Accurate vessel segmentation is essential for medical image analysis, yet remains challenging due to complex vascular patterns and imaging ambiguity. Most deep models rely on single-pass prediction, limiting their ability to refine uncertain or disconnected regions during inference. To address this limitation, we propose Uncertainty-Guided Conservative Propagation (UGCP), a general plug-in module for vessel segmentation. Instead of directly using a one-shot output as the final prediction, UGCP performs a small number of logit-space update steps to refine the segmentation through local predictions interaction. Predictive uncertainty guides reliable regions to support ambiguous regions, while structure-aware modulation and source-based stabilization reduce unreliable propagation and excessive drift. The module is differentiable and can be trained end-to-end with different segmentation networks. We evaluate UGCP on four public vessel segmentation datasets covering 2D and 3D tasks, including retinal vessel, coronary artery, and cerebral vessel segmentation. Experiments with convolutional neural network-based and Transformer-based backbones show consistent improvements in Dice similarity coefficient, centerline Dice, and 95th percentile Hausdorff distance. Further analysis demonstrates that UGCP reduces vessel disconnections and improves structural consistency with limited additional computation. The code will be made available at https://github.com/chenzhao2023/UGC_PR.

2605.20052 2026-05-21 cs.CL cs.AI

PromptRad: Knowledge-Enhanced Multi-Label Prompt-Tuning for Low-Resource Radiology Report Labeling

PromptRad: 基于知识的多标签提示微调用于低资源放射报告标注

Ying-Jia Lin, Tzu-Chin Lo, Ping-Chien Li, Chi-Tung Cheng, Chien-Hung Liao, Hung-Yu Kao

AI总结 本文提出PromptRad,一种基于知识的多标签提示微调方法,用于在低资源环境下进行放射报告标注,通过引入UMLS元词典中的同义词增强类别表示,以更少的标注数据实现优于传统方法的性能。

Comments BioNLP 2026 @ ACL (camera-ready version)

详情
AI中文摘要

自动报告标注有助于从非结构化文本中识别临床发现,并为医学影像研究提供大规模注释。现有的基于规则的标注器难以处理临床报告中的多样化描述,而微调预训练语言模型(PLMs)需要大量标注数据,这些数据在临床环境中通常不可用。在本文中,我们提出PromptRad,一种基于知识的多标签提示微调方法,用于在低资源环境下进行放射报告标注。PromptRad将多标签分类重新表述为掩码语言建模,并将UMLS元词典中的同义词纳入多词提示器以丰富类别表示。通过微调PLM而不增加额外分类层,PromptRad所需的标注数据比传统微调要少得多。在肝CT报告上的实验表明,PromptRad在仅使用32个标注训练示例的情况下,优于基于词典和微调的基线方法,并且在使用远小模型的情况下,性能与GPT-4具有竞争力。进一步分析显示,PromptRad比现有方法更有效地捕捉复杂的否定模式,使其成为数据稀缺临床场景中报告标注的有希望的解决方案。我们的代码可在https://github.com/ila-lab/PromptRad上获得。

英文摘要

Automatic report labeling facilitates the identification of clinical findings from unstructured text and enables large-scale annotation for medical imaging research. Existing rule-based labelers struggle with the diverse descriptions in clinical reports, while fine-tuning pre-trained language models (PLMs) requires large amounts of labeled data that are often unavailable in clinical settings. In this paper, we propose PromptRad, a knowledge-enhanced multi-label \textbf{prompt}-tuning approach for \textbf{rad}iology report labeling under low-resource settings. PromptRad reformulates multi-label classification as masked language modeling and incorporates synonyms from the UMLS Metathesaurus into a multi-word verbalizer to enrich category representations. By fine-tuning the PLM without additional classification layers, PromptRad requires substantially less labeled data than conventional fine-tuning. Experiments on liver CT (computed tomography) reports show that PromptRad outperforms dictionary-based and fine-tuning baselines with only 32 labeled training examples, and achieves competitive performance with GPT-4 despite using a much smaller model. Further analysis demonstrates that PromptRad captures complex negation patterns more effectively than existing methods, making it a promising solution for report labeling in data-scarce clinical scenarios. Our code is available at https://github.com/ila-lab/PromptRad.

2605.20030 2026-05-21 cs.LG math.OC

Take It or Leave It: Intent-Controlled Partial Optimal Transport

Take It or Leave It: Intent-Controlled Partial Optimal Transport

Salil Parth Tripathi, Bertrand Chapron, Fabrice Collard, Nicolas Courty, Ronan Fablet

AI总结 本文提出了一种意图控制的局部最优传输(IC-POT),通过引入点wise拒绝成本替代全局拒绝机制,解决了在应用中需要更结构化的点wise拒绝机制的问题,并展示了其在正样本无标签学习和开放部分领域适应中的实际应用价值。

详情
AI中文摘要

虽然最优传输(OT)通过要求两个测度精确匹配来施加刚性约束,而部分最优传输通过允许通过全局预算、标量退款或统一拒绝规则来保留未匹配的质量。然而,许多应用需要更结构化的点wise拒绝机制,其中决定是否未匹配质量取决于侧面特定的可靠性、支持几何或外部信息,关于哪些组件应参与比较。我们引入了意图控制的部分最优传输(IC-POT),即部分传输的一种有针对性的扩展,它用两个测度上的点wise拒绝成本替代了全局拒绝范式。我们证明了由此产生的优化问题可以以局部接受阈值的形式进行双解释,并可以通过将其重新表述为在扩展支持上的平衡Kantorovich OT问题来求解。除了理论分析外,我们还展示了IC-POT在拒绝由侧面信息驱动的设置中的实际相关性。在正样本无标签学习和开放部分领域适应中,将编码统计结构的点wise拒绝规则纳入固定基线流程中可以提高性能。最后,我们用一个地球物理实际案例来说明IC-POT的使用:多模态卫星海洋测量,其中物理和传感器先验自然地指导拒绝机制并定义检索的可比信号信息。

英文摘要

While optimal transport (OT) enforces a rigid constraint by requiring two measures to be matched exactly, partial optimal transport relaxes this requirement by allowing mass to remain unmatched through a global budget, scalar rebate, or uniform rejection rule. However, many applications call for more structured, pointwise rejection mechanisms, where the decision to leave mass unmatched depends on side-specific reliability, support geometry, or external information about which components should participate in the comparison. We introduce \emph{intent-controlled partial optimal transport} (IC-POT), a targeted generalization of partial transport that replaces the global rejection paradigm with pointwise rejection costs over both measures. We show that the resulting optimization problem admits a dual interpretation in terms of local acceptance thresholds and can be solved by recasting it as a balanced Kantorovich OT problem on an augmented support. Beyond theoretical analysis, we demonstrate the practical relevance of IC-POT in settings where rejection is driven by side information. In positive-unlabeled learning and open-partial domain adaptation, incorporating pointwise rejection rules that encode statistical structure improves fixed baseline pipelines. Finally, we motivate the use of IC-POT with a geophysical practical case: multi-modal satellite ocean measurements, for which physical and sensors priors naturally inform the rejection mechanism and define the retrieved comparable signal information.

2605.19776 2026-05-21 cs.CV

Preferences Order, Ratings Anchor: From Fused Expert Aesthetic Ground Truth to Self-Distillation

偏好顺序、评分锚定:从融合专家审美真实数据到自我蒸馏

Yuanpei Zhao, Jie Lin, Chao Zhang, Yilin Wang, Mao Li, Chenhui Li, Jie Hou, Tangjie Lv

AI总结 本文提出PPaint基准,通过融合专家偏好和评分数据,改进图像审美评估模型,通过自我蒸馏方法在单次推理中实现更准确的审美评分,优于现有开源和闭源基线模型。

Comments 27 pages, 7 pages

详情
AI中文摘要

成对偏好和点状评分是图像审美评估(IAA)的两种主要标注协议,但现有基准仅采用其中一种,未能在受控条件下测量其互补性。我们引入PPaint,一种匹配双协议基准,在五个审美维度上,15名领域专家(每类5名)对150幅中国画进行双协议标注,通过本地密集偏好设计收集45,900个成对专家判断,同时匹配评分。匹配设计揭示了互补优势:偏好产生更一致的顺序排名,而评分锚定了绝对分数尺度。通过两种独立的偏好到评分方法融合两种信号,得到融合的专家真实数据,使两种构造收敛到几乎相同的分数。同样的偏好到评分原则也适用于无标签VLM训练。PSDistill通过Elo参考池将VLM的成对判断转换为校准的伪分数,并通过置信度加权排名优化训练相同的VLM,生成单次推理的审美评分器。在单个绘画类别上训练,蒸馏后的Qwen3-VL-8B在所有三个类别上将均值SRCC从0.504提升到0.709,优于所有开源基线,包括专用审美模型ArtiMuse,并在单次推理成本下与闭源Gemini-3.1-Pro相差0.04 SRCC,跨领域转移在APDDv2上进一步验证。我们将发布完整的PPaint数据集和训练代码。

英文摘要

Pairwise preferences and pointwise ratings are the two dominant annotation protocols in image aesthetic assessment (IAA), yet existing benchmarks adopt only one, leaving their complementarity unmeasured under controlled conditions. We introduce PPaint, a matched dual-protocol benchmark in which 15 domain experts, 5 per category, annotate 150 Chinese paintings under both protocols across five aesthetic dimensions, collecting 45,900 pairwise expert judgments through a locally dense preference design alongside the matched ratings. The matched design reveals complementary strengths: preferences yield more consistent ordinal rankings, while ratings anchor the absolute score scale. Fusing both signals via two independent preference-to-score methods yields a fused expert ground truth on which the two constructions converge to nearly identical scores. The same preference-to-score principle extends to label-free VLM training. PSDistill converts VLM pairwise judgments into calibrated pseudo-scores via an Elo reference pool, and trains the same VLM with confidence-weighted ranking optimization to produce a single-pass aesthetic scorer. Trained on a single painting category, the distilled Qwen3-VL-8B improves mean SRCC from 0.504 to 0.709 across all three categories, outperforming all open-source baselines including the dedicated aesthetic model ArtiMuse and matching closed-source Gemini-3.1-Pro within 0.04 SRCC at single-pass inference cost, with cross-domain transfer further validated on APDDv2. We will release the full PPaint dataset and training code.

2605.19649 2026-05-21 cs.CV

CAD-Free Learning of Spacecraft Pose Estimators via NeRF-Based Augmentations

无需CAD的基于NeRF的航天器姿态估计器学习方法

Antoine Legrand, Renaud Detry, Christophe De Vleeschouwer

AI总结 本文提出了一种基于NeRF的图像增强方法,使航天器姿态估计器的学习不再依赖大量CAD渲染图像,仅需几十到几百张真实图像即可训练出准确的姿态估计器,同时提升了对实际轨道条件的鲁棒性。

Comments This work has been submitted to the IEEE for possible publication

详情
AI中文摘要

航天器姿态估计网络需要数万张CAD渲染图像进行训练。这种对合成CAD数据的依赖(i)限制了其在具有可靠几何先验的目标上的应用,排除了不合作或文档不全的航天器,(ii)由于不现实的光照和材料外观导致对真实轨道条件的泛化能力差。本文介绍了一种基于NeRF的图像增强方法,使学习航天器姿态估计器仅需几十到几百张图像。该方法通过几何一致的视角和外观增强生成大量多样化的数据集。这个增强的数据集使无需CAD模型或大规模合成数据集即可训练出准确的目标特定姿态估计器。实验表明,我们的方法支持仅用25到400张真实图像训练出准确的姿态估计器,即使在严重的光照变化下也是如此。当应用于大型CAD基于的合成数据集时,基于NeRF的增强也增强了域外泛化能力,提高了对真实轨道条件的鲁棒性。

英文摘要

Spacecraft pose estimation networks require tens of thousands of CAD-rendered images to be trained. This reliance on synthetic CAD data (i) limits applicability to targets with reliable geometry prior, excluding uncooperative or poorly documented spacecraft, and (ii) causes poor generalization to real on-orbit conditions due to unrealistic illumination and material appearance. This paper introduces a NeRF-based image augmentation method that enables the learning of spacecraft pose estimators from only a few tens to a few hundreds of images. The method learns a Neural Radiance Field of the target and generates a large, diverse dataset through geometrically-consistent viewpoint and appearance augmentation. This augmented dataset enables the training of accurate target-specific pose estimators without requiring a CAD model or large synthetic datasets. Experiments show that our approach supports the training of accurate pose estimators from only 25 to 400 realistic images, even under severe illumination variations. When applied on large CAD-based synthetic datasets, the NeRF-based augmentation also enhances out-of-domain generalization, yielding improved robustness to real on-orbit conditions.

2605.19624 2026-05-21 cs.CV cs.AI

Component-Aware Structure-Preserving Style Transfer for Satellite Visual Sim2Real Data Construction

面向组件的结构保持风格迁移用于卫星视觉Sim2Real数据构建

Zongwu Xie, Yonglong Zhang, Yifan Yang, Yang Liu, Baoshi Cao

AI总结 本文提出了一种面向组件的结构保持风格迁移框架,用于卫星视觉的合成到真实数据构建,通过提取真实图像的部件级风格代码并注入到合成图像中,从而提高标注保持的卫星视觉Sim2Real数据生成效果。

详情
AI中文摘要

对于基于相机的卫星视觉感知,Sim2Real数据构建需要图像接近真实域传感器外观同时保留来自模拟的注释。具有可靠姿态标签和组件级遮罩的卫星目标的真实传感器图像难以大规模获取,而合成渲染提供精确的几何注释但存在明显的外观差距。本文提出了一种面向组件的结构保持风格迁移框架用于卫星视觉的合成到真实数据构建。该方法通过校准的真实获取、基于ArUbo的相机姿态测量、CAD渲染和组件遮罩构建弱配对的真实-合成样本。然后从未标记的真实图像中提取部件级真实域风格代码,并通过遮罩对齐调节将其注入到对应的合成卫星区域中。为了保持生成图像对下游传感器数据监督的可用性,对抗训练与局部对比一致性、自正则化和边缘保持约束相结合。实验在5000张渲染的卫星图像和100张在校准设置下拍摄的真实图像上进行。真实图像提供目标域外观参考和最终评估图像,而下游的GDRNet姿态估计器仅在合成或翻译的合成图像上进行训练。与代表性图像翻译基线相比,所提方法实现了最小的图像分布差异,FID为54.32,KID为0.048。当翻译数据用于在目标域适应设置下训练GDRNet时,ADD通过率提高到0.260,AUC提高到0.611。这些结果表明,组件级外观迁移可以提高标注保持的卫星视觉Sim2Real数据生成效果。

英文摘要

For camera-based satellite visual sensing, Sim2Real data construction requires images that approach real-domain sensor appearance while retaining the annotations inherited from simulation. Real sensor images of satellite targets with reliable pose labels and component-level masks are difficult to acquire at scale, whereas synthetic rendering provides exact geometric annotations but suffers from a visible appearance gap. This paper presents a component-aware structure-preserving style transfer framework for satellite visual synthetic-to-real data construction. The method builds weakly paired real--synthetic samples from calibrated real acquisition, ArUco-based camera-pose measurement, CAD rendering, and component masks. It then extracts part-wise real-domain style codes from unlabeled real images and injects them into corresponding synthetic satellite regions through mask-aligned modulation. To keep the generated images usable for downstream sensor-data supervision, adversarial training is combined with local contrastive consistency, self-regularization, and edge-preserving constraints. Experiments are conducted on 5,000 rendered satellite images and 100 real images captured in a calibrated setup. The real images provide target-domain appearance references and final evaluation images, while the downstream GDRNet pose estimator is trained only on synthetic or translated synthetic images. Compared with representative image-translation baselines, the proposed method achieves the lowest image distribution discrepancy, with an FID of 54.32 and a KID of 0.048. When the translated data are used to train GDRNet in this target-domain adaptation setting, the ADD pass rate improves to 0.260 and the AUC improves to 0.611. These results indicate that component-level appearance transfer can improve annotation-preserving satellite visual Sim2Real data generation in the considered calibrated setup.

2605.19537 2026-05-21 cs.LG

The Silent Hyperparameter: Quantifying the Impact of Inference Backends on LLM Reproducibility

沉默的超参数:量化推理后端对LLM可重复性的影响

David Pape, Jonathan Evertz, Lea Schönherr

AI总结 本文研究了推理后端对LLM基准测试结果的影响,发现不同后端可能导致基准分数变化达16.6个百分点,并引发高比例的输出分歧,强调了推理后端作为关键超参数的重要性。

详情
AI中文摘要

在LLM的进步中,标准化基准测试已成为衡量进展的主要方式,其中最先进的改进通常仅以小数点后几位百分比点来区分。同时,现代LLM评估的计算成本推动了专用推理后端的广泛应用,这些软件系统在推理时高效执行训练好的模型。尽管对可扩展性至关重要,系统级优化,如定制CUDA内核和降低精度的算术,可能会改变令牌概率并引入非确定性,这可能引发生成结果的分歧。在本工作中,我们首先调查了推理景观,识别出200个不同的引擎,并分析了35,000篇机器学习论文,发现尽管存在广泛多样性,特定的推理堆栈很少被报告。然后,我们系统地研究了推理后端如何影响LLM基准测试结果。在保持模型权重、解码参数和硬件不变的情况下,我们评估了五个广泛使用的推理引擎,包括vLLM、SGLang和llama.cpp,跨多个开放权重模型和已建立的基准测试。我们证明,仅选择后端即可使基准分数变化高达16.6个百分点,并引发高比例的输出分歧。通过隔离后端优化并追踪执行管道,我们发现这种分歧是由系统级优化如前缀缓存和CUDA图、定制内核以及日志处理中的引擎特定默认设置所驱动。我们的发现将推理后端识别为在LLM评估中之前未报告但重要的超参数,并倡导标准化报告推理堆栈以提高基准比较的可重复性和可解释性。

英文摘要

Progress in LLMs is increasingly measured through standardized benchmarks, where state-of-the-art improvements are often separated by fractions of a percentage point. At the same time, the computational cost of evaluating modern LLMs has driven widespread adoption of specialized inference backends, software systems that execute trained models efficiently at inference time. While critical for scalability, system-level optimizations, such as custom CUDA kernels and reduced-precision arithmetic, can alter token probabilities and introduce non-determinism, possibly cascading into divergent generation. In this work, we first survey the inference landscape, identifying 200 distinct engines, and analyze 35,000 ML publications, finding that the specific inference stack is rarely reported despite this widespread diversity. We then present a systematic empirical study of how inference backends affect LLM benchmark results. Holding model weights, decoding parameters, and hardware constant, we evaluate five widely used inference engines, including vLLM, SGLang, and llama$.$cpp, across multiple open-weight models and established benchmarks. We show that the choice of backend alone can shift benchmark scores by up to 16.6 percentage points and induce high rates of output disagreement. By isolating backend optimizations and tracing the execution pipeline, we find this divergence is driven by system-level optimizations like prefix caching and CUDA graphs, custom kernels, and engine-specific defaults in logit processing. Our findings identify the inference backend as a previously unreported but consequential hyperparameter in the evaluation of LLM and advocate standardized reporting of inference stacks to improve the reproducibility and interpretability of benchmark comparisons.

2605.19503 2026-05-21 cs.RO cs.AI cs.LG

ARC-RL: A Reinforcement Learning Playground Inspired by ARC Raiders

ARC-RL: 一种受ARC Raiders启发的强化学习游乐场

Carlo Romeo, Andrew D. Bagdanov

AI总结 本文提出ARC-RL,一个包含四种MuJoCo连续控制环境的强化学习游乐场,这些环境的机器人形态灵感来自ARC Raiders的生物目录,通过统一的观察模板、动作约定和奖励函数,研究不同形态和动画风格约束下的强化学习算法性能。

详情
AI中文摘要

腿部运动的强化学习已经发展成一个多组件奖励函数和物理引擎基准的堆叠,其形态统一来源于现实商业硬件。然而,游戏NPC受风格约束,缺乏sim-to-real机器人,通常以没有现实机器人对应物的生物形式出现。我们介绍了ARC-RL,一个包含四种MuJoCo连续控制环境的套件,其机器人形态受ARC Raiders的生物目录启发:18自由度的高六足Queen、12自由度的装甲六足Bastion、18自由度的紧凑六足Tick以及12自由度的四足Leaper。这四个机器人共享统一的观察模板、动作约定、仿真节奏和一个单一的闭式多组件奖励函数,其唯一形态差异体现在一小部分权重和参数中。奖励融合了速度跟踪帐篷、健康生存奖励、相位锁定步态适应奖励/成本对、动作正则化器、三个安全惩罚和姿态锚;在任何点都不会引入运动捕捉数据。我们还为每种形态提供手工制作的中心模式生成器演示,这些演示既作为固定专家参考,也作为离线到在线训练的先验数据来源。在此游乐场中,我们进行了一项受控的实证研究,比较标准在线算法(SAC、SPEQ、SOPE-EO)和带有先验数据的算法(SACfD、SPEQ-O2O、SOPE),并研究每种范式如何应对游乐场的形态多样性和动画风格约束。源代码可在https://github.com/CarloRomeo427/ARC_RL.git获取。

英文摘要

Reinforcement learning for legged locomotion has matured into a stack of multi-component reward functions and physics-engine benchmarks whose morphologies are uniformly derived from real commercial hardware. Game NPCs, however, are bound by stylistic constraints absent from sim-to-real robotics and routinely take the form of creatures with no real-robot counterpart. We introduce ARC-RL, a suite of four MuJoCo continuous-control environments featuring robotic morphologies inspired by the bestiary of ARC Raiders: the 18-DoF tall hexapod Queen, the 12-DoF armoured hexapod Bastion, the 18-DoF compact hexapod Tick, and the 12-DoF quadruped Leaper. All four robots share a unified observation template, action convention, simulation cadence, and a single closed-form multi-component reward function whose only per-morphology variation lives in a small set of weights and parameters. The reward fuses a velocity-tracking tent, a healthy survive bonus, a phase-locked gait-compliance bonus/cost pair, action regularisers, three safety penalties, and a posture anchor; no motion-capture data enters the reward at any point. We additionally provide hand-crafted Central Pattern Generator demonstrators per morphology, which serve both as fixed expert references and as sources of prior data for offline-to-online training. On this playground, we conduct a controlled empirical study comparing standard online algorithms (SAC, SPEQ, SOPE-EO) and methods augmented with prior data (SACfD, SPEQ-O2O, SOPE), and characterise how each paradigm copes with the playground's morphological diversity and animation-style stylistic constraints. Source code is available at https://github.com/CarloRomeo427/ARC_RL.git.

2605.19376 2026-05-21 cs.AI

Generative Recursive Reasoning

生成性递归推理

Junyeob Baek, Mingyu Jo, Minsu Kim, Mengye Ren, Yoshua Bengio, Sungjin Ahn

AI总结 本文提出Gram框架,通过将递归潜在推理转化为概率多轨迹计算,解决了传统递归推理模型的确定性问题,实现了条件推理和无条件生成。

详情
AI中文摘要

未来的神经推理系统应如何实现扩展计算?递归推理模型(RRMs)通过使用共享转移函数的迭代潜在状态细化,为自回归序列扩展提供了一种有前途的替代方法。然而,现有RRMs大多是确定性的,遵循单一的潜在轨迹并收敛到单一预测。我们引入生成性递归推理模型(GRAM),一种将递归潜在推理转化为概率多轨迹计算的框架。GRAM将推理视为随机的潜在轨迹,通过递归深度和并行轨迹采样实现多个假设、替代解决方案策略和推理时间扩展。这产生了一个支持通过p_θ(y|x)进行条件推理的潜在变量生成模型,并通过p_θ(x)实现无条件生成,无论输入是否固定或缺失。通过缩放变分推断训练,GRAM在结构推理和多解约束满足任务上优于确定性递归和循环基线,同时展示了无条件生成能力。

英文摘要

How should future neural reasoning systems implement extended computation? Recursive Reasoning Models (RRMs) offer a promising alternative to autoregressive sequence extension by performing iterative latent-state refinement with shared transition functions. Yet existing RRMs are largely deterministic, following a single latent trajectory and converging to a single prediction. We introduce Generative Recursive reAsoning Models (GRAM), a framework that turns recursive latent reasoning into probabilistic multi-trajectory computation. GRAM models reasoning as a stochastic latent trajectory, enabling multiple hypotheses, alternative solution strategies, and inference-time scaling through both recursive depth and parallel trajectory sampling. This yields a latent-variable generative model supporting conditional reasoning via $p_θ(y \mid x)$ and, with fixed or absent inputs, unconditional generation via $p_θ(x)$. Trained with amortized variational inference, GRAM improves over deterministic recurrent and recursive baselines on structured reasoning and multi-solution constraint satisfaction tasks, while demonstrating an unconditional generation capability. https://ahn-ml.github.io/gram-website

2605.19138 2026-05-21 cs.RO cs.AI cs.LG

COBALT: Crowdsourcing Robot Learning via Cloud-Based Teleoperation with Smartphones

COBALT: 通过基于云的远程操作利用智能手机进行机器人学习

Ayush Agarwal, Ansh Gandhi, Jeremy A. Collins, Omar Rayyan, Aryan Sarswat, Ranjani Koushik, Masoud Moghani, Ajay Mandlekar, Animesh Garg

AI总结 本文提出COBALT平台,通过基于云的远程操作技术,利用智能手机等设备大规模收集高质量的机器人学习数据,提高仿真实验和现实世界中的机器人学习效率。

详情
AI中文摘要

大规模、高质量的演示数据稀缺仍然是扩展模仿学习用于机器人操作的主要瓶颈。我们提出了COBALT,一个旨在大规模普及机器人学习的远程操作平台,无论是仿真还是现实世界。通过利用向量化的环境,我们的可扩展、负载均衡的基础设施支持多个用户在单个GPU上同时进行远程操作,从而显著降低远程操作成本。操作员可以使用几乎全球任何地方的常见设备连接,包括单或双智能手机、VR头盔、3D鼠标和键盘。内存中的数据缓存和高效的视频流保持控制和渲染同步,支持数十个并发用户在20 Hz下以不超过100毫秒的端到端延迟运行,每GPU支持多达8个并发用户。我们还展示了稳定运行支持256个模拟客户端跨8个GPU,凸显了系统在硬件和单个服务器内的扩展能力。我们进行了全面的用户研究,显示基于手机的远程操作性能与或优于专用硬件,能够更快、更符合人体工学地收集数据。为确保数据质量,COBALT记录一套实时指标以自动过滤劣质演示。我们进一步证明,结构化的用户培训课程显著提高了数据收集质量。基于用户研究的洞察,我们通过众包收集了一个大规模、高质量的试点数据集,该数据集包含7500多个演示(50多个小时),在五个国家的智能手机上收集了九天的数据。我们通过训练最先进的模仿学习算法验证了数据集的质量。请访问https://cobalt-teleop.github.io/获取更多详情。

英文摘要

The scarcity of large-scale, high-quality demonstration data remains a bottleneck in scaling imitation learning for robotic manipulation. We present COBALT, a teleoperation platform designed to democratize robot learning at scale both in simulation and in the real world. By leveraging vectorized environments, our scalable, load-balanced infrastructure supports concurrent teleoperation by multiple users on a single GPU, yielding a significant reduction in teleoperation cost. Operators can connect from nearly anywhere on Earth using commonly available devices, including single or dual smartphones, VR headsets, 3D mice, and keyboards. An inmemory data cache and efficient video streaming keep control and rendering synchronous, sustaining dozens of concurrent users at 20 Hz with sub-100 ms end-to-end latency for up to 8 concurrent users per GPU. We also demonstrate stable operation supporting 256 simulated clients across 8 GPUs, underscoring the system's ability to scale across hardware and within individual servers. We perform a comprehensive user study showing that phone-based teleoperation performs comparably to or better than specialized hardware, enabling faster, more ergonomic data collection. To ensure data quality, COBALT logs a suite of real-time metrics to automatically filter suboptimal demonstrations. We further demonstrate that a structured user training curriculum significantly improves data collection quality. Guided by insights from our user study, we crowdsource the collection of a large-scale, high-quality pilot dataset with 7500+ demonstrations (50+ hours) collected with smartphones across nine countries over five days. We validate the dataset's quality by training state-of-the-art imitation learning algorithms. Please visit https://cobalt-teleop.github.io/ for more details.

2605.18860 2026-05-21 cs.LG cs.CV

Spectral structural distortion reveals redundant neurons in neural networks

谱结构扭曲揭示神经网络中的冗余神经元

Yongyu Wang

AI总结 本文提出了一种基于谱结构扭曲的神经元冗余判定方法,通过分析神经网络层变换前后的关系结构,识别可移除的神经元并保持任务性能。

详情
AI中文摘要

过度参数化的神经网络通常包含许多可移除的神经元,但什么使神经元冗余仍不明确。现有剪枝标准通常依赖局部量如权重大小、激活强度或梯度敏感性,但这些指标对神经元在层变换中结构作用的洞察有限。本文表明,神经元冗余可通过在层间表示变换中参与谱结构扭曲的程度来表征。对于训练好的网络的每个隐藏层,我们记录预激活和后激活的隐藏状态,将神经元视为图节点,构建描述神经元层面关系结构的输入侧和输出侧图。然后我们定义了一个谱结构重要性分数,测量每个神经元对这两个关系结构之间主导图谱扭曲的贡献。参与度低的神经元被视为结构冗余并通过迭代剪枝过程移除,在每次结构变化后重新计算分数。在中间剪枝轮次中不进行参数更新;在达到目标参数减少后,对紧凑模型应用一次恢复微调阶段。直接消融分析和在传统神经网络、编码器-only Transformer 和解码器-only 语言模型上的实验表明,这种图谱标准能够识别可移除的神经元和 Transformer 单元,同时在压缩后保持任务性能。这些结果表明,神经冗余不仅仅是小权重或弱激活的结果,而是可以通过在层间关系结构谱扭曲中的弱参与来理解。

英文摘要

Overparameterized neural networks often contain many removable neurons, yet what makes a neuron redundant remains poorly understood. Existing pruning criteria commonly rely on local quantities such as weight magnitude, activation strength, or gradient sensitivity, but these measures provide limited insight into the structural role of a neuron in the transformation performed by a layer. Here we show that neuronal redundancy can be characterized by weak participation in the spectral structural distortion induced by layer-wise representation transformations. For each hidden layer of a trained network, we record pre-activation and post-activation hidden states, model neurons as graph nodes, and construct input-side and output-side graphs that describe neuron-level relational structure before and after the layer transformation. We then define a spectral structural importance score that measures the contribution of each neuron to the dominant graph-spectral distortion between these two relational structures. Low-participation neurons are treated as structurally redundant and removed through an iterative pruning process in which scores are recomputed after each structural change. No parameter updates are performed during intermediate pruning rounds; after the target parameter reduction is reached, a single recovery fine-tuning stage is applied to the compact model. Direct ablation analysis and experiments across conventional neural networks, encoder-only Transformers, and decoder-only language models show that this graph-spectral criterion identifies removable neurons and Transformer units while preserving task performance after compression. These results suggest that neural redundancy is not merely a consequence of small weights or weak activations, but can be understood through weak participation in the spectral distortion of layer-wise relational structure.

2605.18833 2026-05-21 cs.LG cs.AI

Automated Big Data Quality Assessment using Knowledge Graph Embeddings

利用知识图谱嵌入进行自动化大数据质量评估

Hadi Fadlallah, Rima Kilany, Mitri Haber, Ali Jaber

AI总结 本文提出了一种基于知识图谱嵌入的自动化大数据质量评估方法,通过整合多样化的知识图谱表示,利用上下文信息生成针对每个情境的全面数据质量评估计划。

Comments 17 pages, 10 figures

详情
Journal ref
International Journal of Data Mining, Modelling and Management 17.4 (2025) 383-405
AI中文摘要

自动化数据质量评估对于管理大数据至关重要,但现有解决方案在实现准确的上下文感知评估方面面临挑战。本文提出了一种基于知识的新方法,利用知识图谱嵌入来预测输入数据集的上下文表示与知识图谱中相关质量规则和维度之间的缺失边。我们通过整合知识图谱中的多样化表示,从深入的文献研究中获取洞察,从而开发出针对每个情境的全面且上下文特定的数据质量评估计划。利用知识图谱提高了我们对输入数据集上下文的理解,克服了传统方法仅依赖严格匹配并忽视上下文特征的局限性。通过注入数值边属性,我们为每个预测的质量测量分配相应的权重,为输入数据集提供全面的数据质量评估计划。为了评估我们的方法,我们利用AccentureLabs开发和基准测试的AmpliGraph框架。评估涉及使用由黎巴嫩原子能委员会(LAEC-CNRS)提供的现实世界辐射传感器数据集。从该评估中获得的结果证明了我们的解决方案能够为给定的输入数据集生成全面的数据质量评估计划。

英文摘要

Automated data quality assessment is crucial for managing big data, but existing solutions face challenges in achieving accurate context-aware assessment. This paper presents a novel knowledge-based approach to enhance automated data quality assessment. Our approach utilizes knowledge graph embeddings to predict missing edges between the input dataset's context representation and the relevant quality rules and dimensions within a knowledge graph representing contextual data characteristics and the required quality assessment operations. We surpass conventional practices by integrating diverse representations within the knowledge graph, drawing insights from contextual information from a thorough literature investigation. This integration allows us to develop a comprehensive and context-specific data quality assessment plan tailored to each context. Leveraging the knowledge graph improves our understanding of the input dataset's context, overcoming the limitations of traditional methods that rely solely on strict matching and overlook contextual characteristics. By injecting numerical edge attributes, we assign corresponding weights to each predicted quality measurement, providing a comprehensive data quality assessment plan for the input dataset. To evaluate our approach, we leverage AmpliGraph, a framework developed and benchmarked by AccentureLabs. The evaluation involves employing a real-world radiation sensors dataset provided by the Lebanese Atomic Energy Commission (LAEC-CNRS). The results obtained from this evaluation demonstrate the capability of our solution to generate a comprehensive data quality assessment plan for the given input dataset.

2605.18743 2026-05-21 cs.AI

WorldString: Actionable World Representation

WorldString: 可行动态世界表征

Kunqi Xu, Jitao Li, Jianglong Ye, Tianshu Tang, Isabella Liu, Sifei Liu, Xueyan Zou

AI总结 本文提出WorldString,一种能够通过点云或RGB-D视频流直接学习现实物体状态流形的神经架构,为构建可行动态世界模型提供基础构建块。

详情
AI中文摘要

受大语言模型中涌现行为启发,研究社区正在探索类似涌现能力的世界模型,尤其关注物理世界的建模。在物理世界建模中,物体是构成物理现实的基本原始元素。从人类到计算机,几乎一切我们交互的事物都是物体。这些物体很少是静态的;它们是具有变化状态的可行动态实体,其状态由内在属性决定。尽管当前方法通过视频生成或动态场景重建来处理物体动作状态,但没有一种方法明确地以统一、原则性的方式建模这一基本元素,以构建可行动态物体表征。我们提出了WorldString,一种神经架构,能够通过直接从点云或RGB-D视频流中学习来建模现实物体的状态流形。作为通用的数字孪生,它充当物理世界模型的基础构建块;因此,我们将其命名为WorldString。有趣的是,其完全可微的结构无缝地使未来与策略学习和神经动力学的整合成为可能。

英文摘要

Inspired by the emergent behaviors in large language models that generalized human intelligence, the research community is pursuing similar emergent capabilities within world models, with a emphasis on modeling the physical world. Within the scope of physical world model, objects are the fundamental primitives that constitute physical reality. From humans to computers, nearly everything we interact with is an object. These objects are rarely static; they are actionable entities with varying states determined by their intrinsic properties. While current methods approach object action states either via video generation or dynamic scene reconstruction, none explicitly model this basic element in a unified, principled way to build an actionable object representation. We propose WorldString, a neural architecture capable of modeling the state manifold of real-world objects by learning directly from point clouds or RGB-D video streams. Serving as a versatile digital twin, it acts as a foundational building block for physical world models; thus, we name it WorldString. Sweetly, its fully differentiable structure seamlessly enables future integration with policy learning and neural dynamics.

2605.18736 2026-05-21 cs.CV

Spectral Progressive Diffusion for Efficient Image and Video Generation

频域渐进扩散用于高效图像和视频生成

Howard Xiao, Brian Chao, Lior Yariv, Gordon Wetzstein

AI总结 本文提出了一种频域渐进扩散框架,通过在预训练扩散模型的去噪轨迹中逐步提高分辨率,实现高效的图像和视频生成,同时改进了效率和质量。

Comments Project website at https://howardxiao.ca/speed

详情
AI中文摘要

扩散模型已被证明可以在频域中隐式地自回归地生成视觉内容,其中低频分量在去噪过程中早期生成,而高频细节仅在后期时间步出现。这种结构为高效的生成提供了自然机会,因为对噪声主导的高频分量进行高分辨率计算几乎冗余。我们提出了频域渐进扩散,这是一种通用框架,它在预训练扩散模型的去噪轨迹中逐步增加分辨率。为此,我们开发了一种频域噪声扩展机制,并从模型的功率谱中推导出最优的分辨率计划。我们的框架支持无训练加速,并且提供了一种新的微调配方,进一步提高了效率和质量。我们在最先进的预训练图像和视频生成模型上实现了显著的加速,同时保持了视觉质量。

英文摘要

Diffusion models have been shown to implicitly generate visual content autoregressively in the frequency domain, where low-frequency components are generated earlier in the denoising process while high-frequency details emerge only in later timesteps. This structure offers a natural opportunity for efficient generation, as high-resolution computation on noise-dominated frequencies is largely redundant. We propose Spectral Progressive Diffusion, a general framework that progressively grows resolution along the denoising trajectory of pretrained diffusion models. To this end, we develop a spectral noise expansion mechanism and derive an optimal resolution schedule from the model's power spectrum. Our framework supports training-free acceleration and a novel fine-tuning recipe that further improves efficiency and quality. We demonstrate significant speedups on state-of-the-art pretrained image and video generation models while preserving visual quality.

2605.18678 2026-05-21 cs.CV cs.AI

Lance: Unified Multimodal Modeling by Multi-Task Synergy

Lance:通过多任务协同实现统一多模态建模

Fengyi Fu, Mengqi Huang, Shaojin Wu, Yunsheng Jiang, Yufei Huo, Hao Li, Yinghang Song, Fei Ding, Jianzhu Guo, Qian He, Zheren Fu, Zhendong Mao, Yongdong Zhang

AI总结 本文提出Lance,一种轻量级的原生统一模型,支持图像和视频的多模态理解、生成和编辑。该模型通过协同多任务训练的实用范式实现统一多模态建模,基于统一上下文建模和解耦能力路径两个核心原则,通过双流混合专家架构实现联合上下文学习并解耦理解和生成路径。

Comments 34 pages, 14 figures, 10 tables, homepage url: https://lance-project.github.io , code url: https://github.com/bytedance/Lance

详情
AI中文摘要

我们提出了Lance,一种轻量级的原生统一模型,支持图像和视频的多模态理解、生成和编辑。与依赖模型容量扩展或文本-图像主导设计不同,Lance通过协同多任务训练探索统一多模态建模的实用范式。其基于两个核心原则:统一上下文建模和解耦能力路径。具体而言,Lance从头开始训练,并在共享交错的多模态序列上采用双流混合专家架构,实现联合上下文学习的同时解耦理解和生成路径。我们进一步引入模态感知的旋转位置编码以减轻异构视觉标记之间的干扰并提升跨任务对齐。在训练过程中,Lance采用分阶段的多任务训练范式,结合能力导向的目标和自适应数据调度,以加强语义理解和视觉生成性能。实验结果表明,Lance在图像和视频生成方面显著优于现有开源统一模型,同时保留了强大的多模态理解能力。该模型的主页可在https://lance-project.github.io上访问。

英文摘要

We present Lance, a lightweight native unified model supporting multimodal understanding, generation, and editing for both images and videos. Rather than relying on model capacity scaling or text-image-dominant designs, Lance explores a practical paradigm for unified multimodal modeling via collaborative multi-task training. It is grounded in two core principles: unified context modeling and decoupled capability pathways. Specifically, Lance is trained from scratch and employs a dual-stream mixture-of-experts architecture on shared interleaved multimodal sequences, enabling joint context learning while decoupling the pathways for understanding and generation. We further introduce modality-aware rotary positional encoding to mitigate interference among heterogeneous visual tokens and boost cross-task alignment. During training, Lance adopts a staged multi-task training paradigm with capability-oriented objectives and adaptive data scheduling to strengthen both semantic comprehension and visual generation performance. Experimental results demonstrate that Lance substantially outperforms existing open-source unified models in image and video generation, while retaining strong multimodal understanding capabilities. The homepage is available at https://lance-project.github.io.

2605.18579 2026-05-21 cs.LG

S2Aligner: Pair-Efficient and Transferable Pre-Training for Sparse Text-Attributed Graphs

S2Aligner: 用于稀疏文本属性图的高效且可迁移的预训练方法

Yuhan Wang, Haopeng Zhang, Yibo Ding, Jiaqi Yu, Xinyu Zhao, Yuhang Liu, Ziwei Zhang, Xiao Wang, Ruijie Wang

AI总结 本文提出S2Aligner,一种针对稀疏文本属性图的高效且可迁移的预训练方法,通过解耦语义对齐与结构建模,增强对齐过程而不污染共享的语义空间,从而减少跨域泛化差距。

Comments 19 pages

详情
AI中文摘要

在文本属性图(TAGs)上进行预训练是构建可迁移图基础模型的核心,其中LLM-as-Aligner方法通过大语言模型的语义知识对图和文本表示进行对齐。然而,这些方法通常假设节点文本提供足够的监督,但在实际稀疏TAGs中这一假设往往不成立。当文本锚点缺失、嘈杂或跨域不均时,图结构必须通过弱语义证据进行对齐,导致不可靠的结构-语义对应关系和稀疏性引起的迁移偏差。本文提出S2Aligner,一种针对稀疏TAGs的稀疏感知且结构增强的LLM-as-Aligner框架用于图-文本预训练。关键思想是解耦语义对齐与结构建模,使拓扑感知信号能够增强对齐而不污染共享的语义空间。具体而言,S2Aligner将图-文本表示分解为语义和结构成分,利用结构导向的重建与一致性控制来将可靠的拓扑线索注入文本表示,并在文本稀疏性下抑制不一致的结构信号。此外,S2Aligner引入稀疏感知的跨域风险平衡,通过全局-域密度比校准域风险,并通过图可靠性估计降低不可靠的稀疏样本权重。理论分析表明,该目标通过控制域风险差异来减少跨域泛化差距。在多样化的图域、稀疏程度和下游任务上进行的广泛实验表明,S2Aligner在一致性上优于现有基线。

英文摘要

Pre-training on text-attributed graphs (TAGs) is central to building transferable graph foundation models, where LLM-as-Aligner methods align graph and text representations through the semantic knowledge of large language models. However, these methods usually assume that node texts provide sufficient and reliable supervision, an assumption often violated in real-world sparse TAGs. When textual anchors are missing, noisy, or uneven across domains, graph structures must be aligned with weak semantic evidence, leading to unreliable structure-semantics correspondence and sparsity-induced transfer bias. This paper presents S2Aligner, a sparsity-aware and structure-enhanced LLM-as-Aligner framework for graph-text pre-training on sparse TAGs. The key idea is to decouple semantic alignment from structural modeling, allowing topology-aware signals to enhance alignment without contaminating the shared semantic space. Specifically, S2Aligner decomposes graph-text representations into semantic and structural components, uses structure-oriented reconstruction with consistency control to inject reliable topology cues into text representations, and suppresses inconsistent structural signals under textual sparsity. Moreover, S2Aligner introduces sparsity-aware cross-domain risk balancing, which calibrates domain risks through a global-domain density ratio and downweights unreliable sparse samples via graph reliability estimation. Theoretical analysis shows that this objective reduces cross-domain generalization gaps by controlling domain risk discrepancy. Extensive experiments across diverse graph domains, sparsity levels, and downstream tasks demonstrate that S2Aligner consistently outperforms existing baselines.

2605.18447 2026-05-21 cs.CV

NeRF-based Spacecraft Reconstruction from Monocular Imagery Under Illumination Variability and Pose Uncertainty

基于NeRF的在轨航天器单目影像重建:在光照变化和姿态不确定性下的应用

Antoine Legrand, Renaud Detry, Christophe De Vleeschouwer

AI总结 本文提出一种基于NeRF的方法,通过引入图像特定的外观嵌入和姿态修正项,提升在光照变化和姿态不确定性下的航天器重建鲁棒性,验证了其在离线重建中的有效性,并展示了其在在线重建中的潜力。

Comments (under review)

详情
AI中文摘要

自主接近和临近操作围绕非合作、未知航天器是主动碎片清除和在轨服务任务的关键。此类操作的关键组成部分是从一组2D图像中离线重建目标的3D模型。这项任务具有挑战性,因为有两个主要因素:首先,在轨光照条件表现出显著的变异性,并且随时间迅速变化。其次,图像中的姿态信息不准确,导致3D重建的不确定性。为克服这些挑战,我们提出扩展Neural Radiance Fields,引入每图像的自由度:一个可学习的外观嵌入,捕捉每张图像特定的光照条件,以及一个图像特定的姿态修正项,以细化其噪声姿态标签,提高图像间的3D一致性。这些参数增加了极小的复杂性,因为它们与NeRF联合学习,但显著提高了对光照变化和姿态不准确性的鲁棒性。我们在三个代表在轨操作的图像集中验证了我们的方法,证明了其在离线重建中的有效性,并突显了其在在线重建中的适用性,这在该领域是一个开放性问题。

英文摘要

Autonomous rendezvous and proximity operations around uncooperative, unknown spacecraft are critical for active debris removal and on-orbit servicing missions. A key component of such operations is the offline reconstruction of a 3D model of the target from a set of 2D images. This task is challenging due to two main factors. First, in-orbit illumination conditions exhibit considerable variability, and change rapidly over time. Second, the inaccuracy of pose information in the images, results in 3D reconstruction uncertainty. To overcome these challenges, we propose to extend Neural Radiance Fields with per-image degrees of freedom: a learnable appearance embedding that captures the illumination conditions specific to each image, and an image-specific pose correction term that refines its noisy pose label to increase 3D consistency across images. These parameters add minimal complexity, as they are learned jointly with the NeRF, yet they substantially improve robustness to illumination variability and pose inaccuracies. We validate our approach on three image sets representative of in-orbit operations, demonstrating its effectiveness for offline reconstruction and highlighting its suitability for online reconstruction, an open problem in the field.

2605.17946 2026-05-21 cs.AI cs.CV cs.LG

SVFSearch: A Multimodal Knowledge-Intensive Benchmark for Short-Video Frame Search in the Gaming Vertical Domain

SVFSearch: 一种面向游戏垂直领域的多模态知识密集型短视频帧搜索基准

Lingtao Mao, Huangyu Dai, Xinyu Sun, Zihan Liang, Ben Chen, Chenyi Lei, Wenwu Ou

AI总结 本文提出SVFSearch,首个针对中文游戏领域短视频帧搜索的多模态知识密集型基准,通过5000个四选一测试示例和4198个辅助训练示例,评估了从直接问答到计划-行动-重新计划代理等多种方法在短视频帧搜索中的性能。

详情
AI中文摘要

多模态大语言模型越来越多地被用作代理的骨干,以理解多模态输入、计划检索操作、调用外部工具并推理由检索信息得出的结论。然而,现有的基准很少评估在短视频应用中的这种能力,其中暂停的帧通常在视觉上具有歧义性,回答需要垂直的、长尾的和快速发展的领域知识。我们引入了SVFSearch,这是首个针对中文游戏领域短视频帧搜索的开放基准。SVFSearch包含5,000个四选一测试示例和4,198个辅助训练示例,每个示例都围绕一个暂停的游戏场景展开,来自真实的短视频片段。为了支持公平且可重复的评估,SVFSearch提供了一个冻结的离线检索环境,包括一个游戏领域文本语料库、一个主题链接的图像画廊以及文本、图像和多模态检索接口,避免了对不受控的网络搜索API的依赖。我们评估了从直接问答和RAG工作流程到计划-行动-重新计划代理和学习搜索模型在内的代表性范式。结果揭示了模型单独回答、实际代理搜索和 oracle 知识之间的巨大差距:最好的开源直接问答模型达到66.4%,最好的实际代理达到79.1%,而 oracle 知识达到95.4%。进一步分析揭示了视觉定位、检索质量、证据基础推理和工具使用行为中的瓶颈,包括过度检索、只回答捷径和检索诱导的误导。

英文摘要

Multimodal large language models are increasingly used as agent backbones that understand multimodal inputs, plan retrieval actions, invoke external tools, and reason over retrieved information. Yet existing benchmarks rarely evaluate this ability in short-video applications, where a paused frame is often visually ambiguous and answering requires vertical, long-tail, and fast-evolving domain knowledge. We introduce SVFSearch, the first open benchmark for short-video frame search in the Chinese gaming domain. SVFSearch contains 5,000 four-choice test examples and 4,198 auxiliary training examples, each centered on a paused game scene from a real short-video clip. To support fair and reproducible evaluation, SVFSearch provides a frozen offline retrieval environment with a game-domain text corpus, a topic-linked image gallery, and text, image, and multimodal retrieval interfaces, avoiding reliance on uncontrolled web search APIs. We evaluate representative paradigms ranging from direct QA and RAG workflow to Plan-Act-Replan agents and learned search models. Results reveal a large gap between model-only answering, practical agentic search, and oracle knowledge: the best open-source direct-QA model reaches 66.4%, the best practical agent achieves 79.1%, and oracle knowledge reaches 95.4%. Further analysis exposes bottlenecks in visual grounding, retrieval quality, evidence-grounded reasoning, and tool-use behavior, including over-search, answer-only shortcuts, and retrieval-induced misleading.

2605.17776 2026-05-21 cs.RO

CosFly-Track: A Large-Scale Multi-Modal Dataset for UAV Visual Tracking via Multi-Constraint Trajectory Optimization

CosFly-Track: 一个大规模多模态数据集,用于通过多约束轨迹优化的无人机视觉跟踪

Xiangyue Wang, Hanxuan Chen, Songsheng Cheng, Ruilong Ren, Jie Zheng, Shuai Yuan, Tianle Zeng, Hanzhong Guo, Kangli Wang, Ji Pei

AI总结 本文提出CosFly-Track数据集,用于无人机视觉跟踪任务,通过多约束轨迹优化生成大规模多模态数据,提升了动态目标跟踪性能。

详情
AI中文摘要

近年来,空中视觉-语言导航(VLN)数据集发展迅速,但主要解决的是面向静态目的地的目标导向导航问题,而无人机视觉跟踪——在保持可见性的同时持续跟随移动目标——则缺乏专门的训练数据。我们介绍了CosFlyTrack,这是一个用于城市环境中无人机视觉跟踪的大规模多模态数据集和可扩展生成管道。该数据集提供了约12,000条专家和扰动的无人机轨迹,这些轨迹源自6,000条行人路径,包含240万时间步(约334小时),包含七个对齐的数据通道:RGB、度量深度、语义分割、六自由度无人机姿态、带有可见性标志的目标状态、双语(中文-英文)指令以及轨迹对元数据。为了生成高质量的专家轨迹,我们开发了MuCO,一个多约束优化器,能够在连续的三维空间中直接规划,使用BVH加速的碰撞和可见性查询,共同执行目标可见性、视角质量、碰撞避免、平滑度、运动学可行性等约束,避免了基于网格的规划器的离散化伪影和事后平滑。在七个视觉-语言模型上的微调实验表明,CosFlyTrack将跟踪性能提升到78.3至95.6个百分点的SR@1米,比零样本基线提高了53至69个百分点,支持该数据集作为动态目标跟踪代理的训练资源。该数据集在https://huggingface.co/datasets/AutelRobotics/CosFly上公开可用;评估脚本和预训练检查点托管在https://huggingface.co/AutelRobotics/CosFly-Track上。

英文摘要

Recent aerial vision-language navigation (VLN) datasets have grown rapidly, but they primarily address goal-oriented navigation to static destinations, leaving UAV visual tracking -- continuously following a moving target while maintaining visibility -- largely without dedicated training data. We introduce CosFlyTrack, a large-scale multi-modal dataset and scalable generation pipeline for UAV visual tracking in urban environments. The dataset provides approximately 12,000 expert and perturbed UAV trajectories generated from 6,000 pedestrian paths, comprising 2.4 million timesteps (approximately 334 hours) with seven aligned data channels: RGB, metric depth, semantic segmentation, six-degree-of-freedom drone pose, target state with visibility flag, bilingual (Chinese-English) instructions, and trajectory-pair metadata. To generate high-quality expert trajectories, we develop MuCO, a multi-constraint optimizer that plans directly in continuous three-dimensional space with BVH-accelerated collision and visibility queries, jointly enforcing target visibility, viewpoint quality, collision avoidance, smoothness, and kinematic feasibility, avoiding the discretization artifacts and post-hoc smoothing of grid-based planners. Fine-tuning experiments on seven vision-language models show that CosFlyTrack improves tracking performance to 78.3 to 95.6 percent SR@1 meter, a 53 to 69 percentage point gain over zero-shot baselines, supporting the dataset as a training resource for dynamic target-following agents. The dataset is publicly available at https://huggingface.co/datasets/AutelRobotics/CosFly; evaluation scripts and pre-trained checkpoints are hosted at https://huggingface.co/AutelRobotics/CosFly-Track.

2605.17694 2026-05-21 cs.CL

Do LLM Agents Mirror Socio-Cognitive Effects in Power-Asymmetric Conversations?

大语言模型在权力不对称对话中是否反映社会认知效应?

Anvesh Rao Vijjini, Sagar Manjunath, Snigdha Chaturvedi

AI总结 研究探讨了大语言模型在被赋予高或低地位角色时是否表现出与人类相似的社会认知效应,通过模拟多轮权力不对称对话,分析语言协调、代词使用、说服成功率和对危险请求的遵从性,发现模型在权力效应上表现出关键特征,但存在差异和变异性。

Comments ACL 2026 (main)

详情
AI中文摘要

权力差异通过已知的社会认知效应塑造人类交流,包括语言协调、代词使用、权威偏见和有害遵从。我们检验了大型语言模型(LLMs)在被赋予高或低地位角色时是否表现出类似行为。使用来自不同职业的拟人化角色,我们模拟多轮、权力不对称的对话(例如,校长与教师、法官与律师),并测量(i)语言协调、(ii)代词使用、(iii)说服成功率以及(iv)对危险请求的遵从性。我们的结果表明,LLMs 显示出权力的社会认知效应,尽管存在细微差别和变异性,将模拟交互与既 desirable 又 unsafe 的行为联系起来。

英文摘要

Power differences shape human communication through well documented socio cognitive effects, including language coordination, pronoun usage, authority bias, and harmful compliance. We examine whether large language models (LLMs) exhibit similar behaviors when assigned high or low status personas. Using personas from diverse professions, we simulate multi turn, power asymmetric dialogues (e.g., principal teacher, justice lawyer) and measure (i) language coordination, (ii) pronoun usage, (iii) persuasion success, and (iv) compliance with unsafe requests. Our results show that LLMs show key socio-cognitive effects of power, albeit with nuances and variability, linking simulated interactions to both desirable and unsafe behaviors.

2605.17472 2026-05-21 cs.CV

Weighted Reverse Convolution for Feature Upsampling

加权反卷积用于特征上采样

Wentong Li, Zhiyuan Qi, Zichen Zhao, Kai Zhang, Lei Zhang

AI总结 本文提出加权反卷积(WRC),从逆问题的角度重新审视视觉基础模型中的特征上采样,通过空间自适应的逆操作提升高层视觉描述符的密度,从而在需要细粒度定位、密集预测和点对应的任务中提升性能。

Comments 18 pages, 7 figures, code:https://github.com/PolyU-VCLab/WRC

详情
AI中文摘要

预训练的视觉基础模型(VFMs)提供强大的语义表示,但其补丁级特征本质上是粗略的,限制了在需要细粒度定位、密集预测和点对应的任务中的有效性。在本文中,我们从逆问题的角度重新审视VFMs中的特征上采样,并提出加权反卷积(WRC),一种空间自适应的逆操作,用于密集化高层视觉描述符。具体来说,我们将特征上采样公式化为加权Tikhonov正则化最小二乘问题,其中空间变化的权重在每个空间位置调节数据保真度和先验强度。这使得WRC能够适应空间变化的特征特性,从而在保留关键结构的同时减轻过平滑问题。此外,WRC保留了一个高效、完全可微的闭合形式FFT解,使其成为一种实用的上采样操作符。在轻量级自监督密集化框架中集成后,WRC在各种下游基准测试中一致提高了密集特征质量,包括分割、深度估计、视频对象分割、对象发现和关键点对应,同时保持高计算效率。

英文摘要

Pre-trained vision foundation models (VFMs) provide strong semantic representations, yet their patch-level features are inherently coarse, limiting their effectiveness on tasks requiring fine-grained localization, dense prediction, and point-wise correspondence. In this work, we revisit feature upsampling for VFMs from the perspective of \textbf{\textit{inverse problem}} and propose Weighted Reverse Convolution (WRC), a spatially adaptive inverse operator for densifying high-level visual descriptors. Specifically, we formulate feature upsampling as a weighted Tikhonov-regularized least-squares problem, where spatially varying weights modulate both data fidelity and prior strength at each spatial location. This allows WRC to adapt the reconstruction to spatially varying feature characteristics, thereby preserving critical structures while mitigating over-smoothing. Moreover, WRC retains an efficient, fully differentiable closed-form FFT solution, making it a practical drop-in upsampling operator. Integrated into a lightweight self-supervised densification framework, WRC consistently improves dense feature quality across various downstream benchmarks, including segmentation, depth estimation, video object segmentation, object discovery, and keypoint correspondence, while maintaining high computational efficiency.

2605.16962 2026-05-21 cs.CV cs.AI

OmniVL-Guard Pro: A Tool-Augmented Agent for Omnibus Vision-Language Forensics

OmniVL-Guard Pro: 一个增强工具的代理用于综合视觉-语言防伪

Jinjie Shen, Zheng Huang, Yuchen Zhang, Yujiao Wu, Yaxiong Wang, Lechao Cheng, Shengeng Tang, Tianrui Hui, Nan Pu, Zhun Zhong

AI总结 该研究提出OmniVL-Guard Pro,一种增强工具的代理,用于综合视觉-语言防伪,通过整合多种工具环境和引入新的强化学习方法,实现了开放世界中的线索驱动推理,并在多个任务上达到了最先进的性能。

Comments 29 pages

详情
AI中文摘要

现有的视觉-语言伪造检测和定位方法基于封闭世界范式,假设模型可以单独完成验证。然而,自包含的MLLMs受限于有限的参数知识、静态训练语料和有限的感知分辨率,在动态开放世界防伪中存在实际限制,特别是在需要外部线索的实时事件验证和需要对局部篡改进行细致审查的伪造分割中。为了解决这些限制,我们从扩大自包含模型转向超越它。我们提出了OmniVL-Guard Pro,一种增强工具的代理,将统一的防伪从封闭世界预测扩展到开放世界的线索驱动推理。OmniVL-Guard Pro整合了一个涵盖实时事件搜索、局部裁剪和缩放、边缘异常筛查、人脸检测、视频帧提取以及SAM3基于分割的工具环境。为了生成高质量的工具推理轨迹,我们引入了树状结构的自进化工具轨迹生成,通过种子引导、无引导的自我进化和弱提示的硬样本合成生成多样化的轨迹,产生Full-Spectrum Tool Reasoning (FSTR)数据集用于训练。我们进一步提出了Checker-Guided Agentic Reinforcement Learning (CGARL),它为过程级监督提供,以惩罚那些答案正确但推理扭曲的情况。广泛的实验表明,OmniVL-Guard Pro在各种任务上实现了最先进的性能,并表现出强大的零样本泛化能力。FSTR数据集和OmniVL-Guard Pro的代码将在https://github.com/shen8424/OmniVL-Guard-Pro公开发布。

英文摘要

Existing vision-language forgery detection and grounding methods operate under a closed-world paradigm, assuming verification can be completed by the model alone. However, self-contained MLLMs are constrained by finite parametric knowledge, static training corpora, and limited perceptual resolution, creating a practical ceiling in dynamic open-world forensics -- particularly for real-time event verification requiring external clues and forgery segmentation demanding fine-grained scrutiny of local manipulations. To address these limitations, we shift from scaling up the self-contained model toward reaching beyond it. We propose \textbf{OmniVL-Guard Pro}, a tool-augmented agent that extends unified forensics from closed-world prediction to open-world clues-driven reasoning. OmniVL-Guard Pro integrates a tool environment spanning real-time event search, local cropping and zooming, edge-anomaly screening, face detection, video frame extraction, and SAM3-based segmentation. To generate high-quality tool-reasoning trajectories, we introduce \textbf{Tree-Structured Self-Evolving Tool Trajectory Generation}, which produces diverse trajectories through seed guidance, guider-free self-evolution, and weakly-hinted hard sample synthesis, yielding the Full-Spectrum Tool Reasoning (FSTR) dataset for training. We further propose \textbf{Checker-Guided Agentic Reinforcement Learning} (CGARL), which provides process-level supervision to penalize cases where the answer is correct but the reasoning is distorted. Extensive experiments demonstrate that OmniVL-Guard Pro achieves state-of-the-art performance across various tasks, and exhibits strong zero-shot generalization. The FSTR dataset and code for OmniVL-Guard Pro will be publicly released at https://github.com/shen8424/OmniVL-Guard-Pro.

2605.16812 2026-05-21 cs.LG cs.CR

Jacobian-Guided Anisotropic Noise Reshaping for Enhancing Representation Utility under Local Differential Privacy

Jacobian-Guided Anisotropic Noise Reshaping for Enhancing Representation Utility under Local Differential Privacy

Youngmok Ha, Viktor Schlegel, Yidan Sun, Anil Anthony Bharath

AI总结 本文提出了一种基于雅可比矩阵的各向异性噪声重塑方法,以在局部差分隐私下提升表示的效用。该方法通过识别任务关键子空间,选择性地衰减噪声,并将标准LDP的各向同性噪声重塑为各向异性分布,从而在保持每个维度隐私预算的同时,异质地调节噪声影响,显著提升数据效用。

详情
AI中文摘要

尽管局部差分隐私(LDP)是分布式数据收集的基础原始构件,其严格的噪声注入要求常常导致数据效用的严重下降。这种下降源于传统LDP机制的任务无关性质,即在所有维度上均匀注入噪声,而不考虑其对下游目标的相对重要性。为了解决这个问题,我们提出了一种新的方法,通过数据表示的任务相关子空间来减轻噪声。我们的方法通过公共下游模型的雅可比矩阵识别任务关键子空间,选择性地衰减这些维度的噪声,并将标准LDP的各向同性噪声重塑为各向异性分布。该方法在保持每个维度隐私预算的同时,异质地调节噪声影响,从而显著提升数据效用。此外,我们的方法适用于线性和非线性模型,并能无缝集成到现有机制中。在CIFAR-10-C(亮度腐败最高严重级别5)上的大量实验表明,整合我们的方法可使PrivUnit2和PrivUnitG的效用在ε=7.5时提高约20%。源代码可在https://github.com/ymha/jacobian-anr-ldp获取。

英文摘要

While Local Differential Privacy (LDP) serves as a foundational primitive for distributed data collection, its stringent noise injection requirement often leads to severe degradation in data utility. This degradation stems from the task-agnostic nature of conventional LDP mechanisms, which inject noise uniformly across all dimensions regardless of their relative importance to the downstream objective. To address this issue, we propose a novel approach that mitigates noise in task-relevant subspaces of the data representation. Our method identifies task-critical subspaces via the Jacobian matrix of the public downstream model, selectively attenuates noise along those dimensions, and reshapes the isotropic noise of standard LDP into an anisotropic distribution. This method preserves the uniform per-dimension privacy budget while heterogeneously modulating noise impact across dimensions, thereby substantially enhancing data utility. Furthermore, our approach generalizes to both linear and non-linear models and integrates seamlessly with existing mechanisms. Extensive experiments on CIFAR-10-C (Brightness corruption at the highest severity level 5) demonstrate that integrating our approach improves the utility of PrivUnit2 and PrivUnitG by approximately 20\% at $ε=7.5$. The source code is available at https://github.com/ymha/jacobian-anr-ldp.

2605.16793 2026-05-21 cs.LG

PULSE: Generative Phase Evolution for Non-Stationary Time Series Forecasting

PULSE: 非平稳时间序列预测的生成性相演变

Yangyou Liu, Zezhi Shao, Xinyu Chen, Hu Chen, Fei Wang, Yuankai Wu

AI总结 针对非平稳时间序列预测中稳定表示与分布偏移之间的矛盾,本文提出PULSE框架,通过物理假设引导相演变,采用解耦-演化-模拟设计哲学,通过相锚解耦、相路由器和统计感知混合等方法提升模型鲁棒性,实验证明物理引导的归纳偏置比原始架构复杂度更重要。

详情
AI中文摘要

在非平稳条件下进行时间序列预测面临稳定表示与适应分布偏移之间的根本矛盾。现有方法隐式依赖静态历史假设,导致我们称之为相遗忘的临界失败模式,即模型对演变的全局上下文失明。为了解决这一问题,我们通过三个物理假设形式化非平稳动态:世界分解、动态相演变和异方差流形生成。这些原理启发了PULSE,一个受物理启发、即插即用的框架,采用解耦-演化-模拟设计哲学。具体而言,PULSE利用相锚定解耦来解决由主导趋势引起的优化干扰,采用相路由器主动生成未来轨迹,并引入统计感知混合(SAM)以确保对分布外波动的鲁棒性。实验证明,PULSE使简单的MLP主干在12个现实世界基准上达到最先进的或高度竞争的性能。这验证了正确的物理引导的归纳偏置比原始架构复杂度对非平稳预测更为关键。代码可在:https://github.com/Gemost/PULSE获取。

英文摘要

Time series forecasting under non-stationarity faces a fundamental tension between capturing stable representations and adapting to distribution shifts. Existing methods implicitly rely on static historical assumptions, leading to a critical failure mode we term Phase Amnesia, where models become blind to the evolving global context. To resolve this, we formalize non-stationary dynamics through three physical hypotheses: wold decomposition, dynamical phase evolution, and heteroscedastic manifold generation. These principles inspire PULSE, a physics-informed, plug-and-play framework adopting a Disentangle--Evolve--Simulate design philosophy. Specifically, PULSE utilizes phase-anchored disentanglement to resolve optimization interference caused by dominant trends, employs a Phase Router to actively generate future trajectories, and introduces Statistic-Aware Mixup (SAM) to ensure robustness against out-of-distribution volatility. Empirically, PULSE enables a simple MLP backbone to achieve state-of-the-art or highly competitive performance across 12 real-world benchmarks. This validates that a correct physics-informed inductive bias is far more critical than raw architectural complexity for non-stationary forecasting. The code is available at: https://github.com/Gemost/PULSE.

2605.16530 2026-05-21 cs.CV

SWoMo: Neuro-Symbolic World Model for Cataract Surgery Simulation

SWoMo:用于白内障手术模拟的神经符号世界模型

Ssharvien Kumar Sivakumar, Akwele Johnson, Anirudh Dhingra, Yannik Frisch, Ghazal Ghazaei, Anirban Mukhopadhyay

AI总结 本文提出SWoMo,一种用于白内障手术模拟的神经符号世界模型,通过分离运动生成与视觉真实性,结合规则基模拟器和场景图表示来建模运动动态和工具-组织交互,同时使用扩散模型生成逼真的视觉效果,从而提升手术模拟的真实性和临床适用性。

详情
AI中文摘要

现实手术模拟在培训初学者外科医生和开发自主代理方面起着至关重要的作用。世界模型可以通过根据当前观察和手术动作预测未来患者状态,将此类模拟环境扩展到真实且多样的程序中。然而,当前最先进的方法往往无法满足临床应用所需的关键标准,包括视觉真实性、物理基础的交互以及模拟超出训练分布的场景的能力。因此,我们引入SWoMo,一种用于白内障手术模拟的神经符号世界模型,该模型将运动生成与视觉真实性解耦。符号组件包括基于规则的模拟器和场景图表示,用于建模运动动态和工具-组织交互,而扩散模型则生成逼真的视觉外观,包括纹理和组织变形。我们提出了一种逆配对策略,通过在模拟器中重建真实的手术视频以获得配对的模拟和真实视频,然后用于训练我们的视频扩散模型,以实现反向的仿真到现实的翻译目标。我们的实验表明,与先前工作相比,既有定性也有定量的改进。我们证明,我们的模拟器进一步满足了关键标准,包括对未见交互几何的泛化、下游阶段检测的改进以及无监督的视频风格迁移。代码、数据和模型权重可在:https://ssharvienkumar.github.io/SWoMo/上获取。

英文摘要

Realistic surgical simulation plays a crucial role in training novice surgeons and in the development of autonomous agents. World models can scale such simulation environments to realistic and diverse procedures by predicting future patient states conditioned on current observations and surgical actions. However, current state-of-the-art approaches often fail to satisfy key criteria required for clinical applicability, including visual realism, physically grounded interactions, and the ability to simulate scenarios beyond the training distribution. Hence, we introduce SWoMo, a neuro-symbolic world model for cataract surgery simulation that decouples motion generation from visual realism. The symbolic component, consisting of a rule-based simulator and scene graph representations, models motion dynamics and tool-tissue interactions, while a diffusion model produces realistic visual appearance, including textures and tissue deformations. We propose an inverse pairing strategy that reconstructs real surgical videos in the simulator to obtain paired simulated and real videos, which are then used to train our video diffusion model for the reverse objective of sim-to-real translation. Our experiments show both qualitative and quantitative improvements over prior work. We demonstrate that our simulator further satisfies the key criteria, including generalisation to unseen interaction geometries, improvements in downstream phase detection, and unsupervised video style transfer. The code, data, and model weights are available at: https://ssharvienkumar.github.io/SWoMo/

2605.16217 2026-05-21 cs.CL cs.AI cs.IR

Argus: Evidence Assembly for Scalable Deep Research Agents

Argus:可扩展深度研究代理的证据组装

Zhen Zhang, Liangcai Su, Zhuo Chen, Xiang Lin, Haotian Xu, Simon Shaolei Du, Kaiyu Yang, Bo An, Lidong Bing, Xinyu Wang

AI总结 Argus通过将深度研究视为拼图碎片的组装过程,而非并行暴力求解整个答案,提高了大规模信息检索任务的效率和效果。

详情
AI中文摘要

深度研究代理在复杂信息检索任务上取得了显著进展。即使长ReAct风格的探索仅追踪单一轨迹,而最新最先进的系统通过并行搜索和聚合来扩展推理时间计算。然而,深度研究答案由互补的证据片段组成,而并行探索通常重复而非完成这些片段,导致收益递减且推动聚合上下文接近模型极限。我们提出Argus,一种代理系统,其中搜索者和导航者合作将深度研究视为从互补证据片段中组装拼图,而非并行暴力求解整个答案。搜索者通过ReAct风格交互收集给定子查询的证据轨迹。导航者维护共享证据图,验证哪些片段仍缺失,派遣搜索者收集它们,并在完成图上推理以生成来源追踪的最终答案。我们用强化学习训练导航者以验证、派遣和合成,同时独立训练搜索者以保持标准ReAct代理。所获得的导航者支持单个搜索者或多个并行搜索者无需重新训练。使用35B-A3B MoE骨干的搜索者和导航者,Argus在单个搜索者上获得5.5分,在8个并行搜索者上获得12.7分,平均在八个基准上。使用64个搜索者时,其在BrowseComp上达到86.2分,超越了我们所有基准测试的专有代理,同时导航器的推理上下文保持在21.5K tokens以下。

英文摘要

Deep research agents have achieved remarkable progress on complex information seeking tasks. Even long ReAct style rollouts explore only a single trajectory, while recent state of the art systems scale inference time compute via parallel search and aggregation. Yet deep research answers are composed of complementary pieces of evidence, which parallel rollouts often duplicate rather than complete, yielding diminishing returns while pushing the aggregation context toward the model's limit. We propose Argus, an agentic system in which a Searcher and a Navigator cooperate to treat deep research as assembling a jigsaw from complementary evidence pieces, rather than brute forcing the whole answer in parallel. The Searcher collects evidence traces for a given sub-query through ReAct-style interaction. The Navigator maintains a shared evidence graph, verifying which pieces are still missing, dispatching Searchers to gather them, and reasoning over the completed graph to produce a source-traced final answer. We train the Navigator with reinforcement learning to verify, dispatch, and synthesize, while independently training the Searcher to remain a standard ReAct agent. The resulting Navigator supports rollouts with a single Searcher or many in parallel without retraining. With both Searcher and Navigator built on a 35B-A3B MoE backbone, Argus gains 5.5 points with a single Searcher and 12.7 points with 8 parallel Searchers, averaged over eight benchmarks. With 64 Searchers it reaches 86.2 on BrowseComp, surpassing every proprietary agent we benchmark, while the Navigator's reasoning context stays under 21.5K tokens.

2605.15944 2026-05-21 cs.RO cs.LG

FocalPolicy: Frequency-Optimized Chunking and Locally Anchored Flow Matching for Coherent Visuomotor Policy

FocalPolicy: 频率优化的分块和局部锚定的流匹配用于连贯的视觉-运动策略

Qian He, Zhenshuo Yang, Wenqi Liang, Chunhui Hao, Nicu Sebe, Jiandong Tian

AI总结 本文提出FocalPolicy,一种面向视觉-运动策略的策略,通过频率优化的分块和局部锚定的流匹配,解决连续视觉-运动策略中的精度与远见之间的平衡问题。

详情
AI中文摘要

视觉-运动策略旨在从专家示范中学习复杂的操作任务。然而,生成平滑且连贯的轨迹仍然具有挑战性,因为它需要在近端精度与远端远见之间进行平衡。现有方法通常专注于优化块内动作分布,往往忽略了块间连贯性。因此,块间不连续性显著阻碍了连贯长周期动作的学习。为克服这一限制并实现精度与远见之间的协同平衡,我们提出了FocalPolicy,一种具有远见的视觉-运动策略,结合了频率优化的分块与局部锚定的流匹配。我们引入了一个远见复合目标,监督时间域内近端动作的对齐,同时在多个未来动作块上正则化频率域结构以提高跨块连贯性。为了高效学习复杂动作分布,我们设计了局部锚定采样,以提高一致性流匹配训练期间的目标信号传播效率。广泛的实验表明,FocalPolicy优于现有方法,并验证了我们的模块对其他基线的通用性。项目网站:https://focalpolicy.github.io/

英文摘要

Visuomotor policies aim to learn complex manipulation tasks from expert demonstrations. However, generating smooth and coherent trajectories remains challenging, as it requires balancing proximal precision with distal foresight. Existing approaches typically focus on optimizing intra-chunk action distributions, often neglecting the inter-chunk coherence. Consequently, inter-chunk discontinuities significantly impede the learning of coherent long-horizon actions. To overcome this limitation and achieve a synergetic balance between precision and foresight, we propose FocalPolicy, a foresight-aware visuomotor policy that combines Frequency-Optimized Chunking with Locally Anchored flow matching. We introduce a foresight composite objective that supervises time-domain alignment within the proximal actions while regularizing frequency-domain structure over multiple future action chunks to improve cross-chunk coherence. To efficiently learn complex action distributions, we design locally anchored sampling to enhance target signal propagation efficiency during consistency flow matching training. Extensive experiments demonstrate that FocalPolicy outperforms existing approaches and confirm the generalizability of our modules to other baselines. Project website: https://focalpolicy.github.io/

2605.15876 2026-05-21 cs.CV

Unlocking Dense Metric Depth Estimation in VLMs

解锁VLMs中的密集度量深度估计

Hanxun Yu, Xuan Qu, Yuxin Wang, Jianke Zhu, Lei Ke

AI总结 本文提出DepthVLM,一种将单个VLM转换为原生密集几何预测器的简单有效框架,同时保持其多模态能力。通过在LLM主干上附加轻量级深度头,并在统一的视觉-文本监督范式下进行训练,DepthVLM能够在单次前向传递中生成高分辨率深度图和语言输出。此外,还引入了一个统一的室内-室外度量深度基准,实验表明DepthVLM在推理效率、复杂3D空间推理等方面均优于现有VLMs和纯视觉模型。

Comments Project Page: https://depthvlm.github.io/

详情
AI中文摘要

Vision-Language Models (VLMs) excel at 2D tasks such as grounding and captioning, yet remain limited in 3D understanding. A key limitation is their text-only supervision paradigm, which under-constrains fine-grained visual perception and prevents the recovery of dense geometry. Prior methods either distill geometry from external vision models, introducing error accumulation, or enable direct prediction with inefficient per-pixel query or coarse token-level outputs. In this paper, we propose DepthVLM, a simple yet effective framework that transforms a single VLM into a native dense geometry predictor while preserving its multimodal capability. By attaching a lightweight depth head to the LLM backbone and training under a unified vision-text supervision paradigm with a two-stage schedule, DepthVLM generates full-resolution depth maps alongside language outputs in a single forward pass. We further introduce a unified indoor-outdoor metric depth benchmark in a VLM-compatible format. Experiments show that DepthVLM significantly outperforms existing VLMs with higher inference efficiency, surpasses leading pure vision models, and improves complex 3D spatial reasoning, moving toward a truly unified multimodal foundation model. The project page is available at https://depthvlm.github.io/

英文摘要

Vision-Language Models (VLMs) excel at 2D tasks such as grounding and captioning, yet remain limited in 3D understanding. A key limitation is their text-only supervision paradigm, which under-constrains fine-grained visual perception and prevents the recovery of dense geometry. Prior methods either distill geometry from external vision models, introducing error accumulation, or enable direct prediction with inefficient per-pixel query or coarse token-level outputs. In this paper, we propose DepthVLM, a simple yet effective framework that transforms a single VLM into a native dense geometry predictor while preserving its multimodal capability. By attaching a lightweight depth head to the LLM backbone and training under a unified vision-text supervision paradigm with a two-stage schedule, DepthVLM generates full-resolution depth maps alongside language outputs in a single forward pass. We further introduce a unified indoor-outdoor metric depth benchmark in a VLM-compatible format. Experiments show that DepthVLM significantly outperforms existing VLMs with higher inference efficiency, surpasses leading pure vision models, and improves complex 3D spatial reasoning, moving toward a truly unified multimodal foundation model. The project page is available at https://depthvlm.github.io/

2605.15691 2026-05-21 cs.LG

SEED: Targeted Data Selection by Weighted Independent Set

SEED:通过加权独立集实现目标数据选择

Yuan Zhang, Lifeng Guo, Junwen Pan, Wenzhao Zheng, Wen Zhou, Kuan Cheng, Kurt Keutzer, Shanghang Zhang

AI总结 本文提出SEED方法,通过将数据选择问题建模为加权独立集(WIS)在相似性图上,解决样本质量与多样性之间的平衡问题,并引入节点价值校准和局部尺度归一化来提升数据选择的鲁棒性和可扩展性。

Comments 20 pages

详情
AI中文摘要

数据选择旨在从大规模训练语料中识别出紧凑且信息丰富的子集,平衡样本质量和收集多样性。我们将该问题建模为相似性图上的加权独立集(WIS),其中节点代表数据样本并按影响程度加权,边连接语义冗余的配对。这种建模自然产生同时高质量和多样化的子集。然而,实践中存在两个挑战:朴素的节点权重无法区分信息信号与梯度噪声,且在异构领域分布下构造边会产生结构不平衡的图,偏向社会稀疏区域。为解决这些问题,我们引入了两种从统一图视角出发的改进方法:(1)节点价值校准,限制影响估计到双侧显著子空间,以任务相关信号为基础确定节点重要性,而不是表面统计;(2)局部尺度归一化,适应边阈值到局部邻域密度,缓解因跨领域分布偏移引起的图不平衡。这些组件共同产生了一个稳健且可扩展的数据选择流程,称为SEED。我们进一步构建了 exttt{Honeybee-Remake-SEED-200K},一个由SEED编纂的紧凑多模态数据集。广泛实验表明,SEED在指令微调、视觉指令微调和语义分割等任务上,优于现有最先进方法,适用于多种模型家族。

英文摘要

Data selection seeks to identify a compact yet informative subset from large-scale training corpora, balancing sample quality against collection diversity. We formulate this problem as a Weighted Independent Set (WIS) on a similarity graph, where nodes represent data samples weighted by influence, and edges connect semantically redundant pairs. This formulation naturally yields subsets that are simultaneously high-quality and diverse. However, two challenges arise in practice: naive node weights fail to distinguish informative signals from gradient noise, and edge construction under heterogeneous domain distributions produces structurally imbalanced graphs that bias selection toward sparse regions. To address these issues, we introduce two principled refinements from a unified graph perspective: (1) \textit{node value calibration} that restricts influence estimation to the bilateral salient subspace to ground node importance in task-relevant signals rather than surface-level statistics; (2) \textit{local scale normalization} that adapts edge thresholds to local neighborhood density, mitigating graph imbalance induced by cross-domain distribution shifts. Together, these components yield a robust and scalable data selection pipeline dubbed SEED. We further construct \texttt{Honeybee-Remake-SEED-200K}, a compact multimodal dataset curated by SEED. Extensive experiments show that SEED consistently outperforms state-of-the-art methods on instruction tuning, visual instruction tuning, and semantic segmentation across diverse model families.