arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 1530
热门方向导航
2606.19961 2026-06-19 cs.CV 新提交

Addressing Detail Bottlenecks in Latent Diffusion for RGB-to-SWIR Image Translation

解决潜在扩散模型中RGB到SWIR图像翻译的细节瓶颈

Kaili Wang, Martin Dimitrievski, Jose Maria Salvador, Ben Stoffelen, David Van Hamme, Lore Goetschalckx

发表机构 * imec imec-IPI-Ghent University(imec-IPI-根特大学) Yale University(耶鲁大学)

AI总结 针对潜在扩散模型在RGB到SWIR图像翻译中丢失空间细节的问题,提出源条件自编码器和可学习引导编码器两种轻量级改进,在驾驶场景下将检测mAP提升至2倍,小目标提升3.4倍,并达到最优FID。

详情
AI中文摘要

潜在扩散模型(LDM)能够高效地进行图像到图像的翻译,但在压缩过程中丢弃了精细的空间细节,从而降低了下游感知任务的性能。我们识别出两个瓶颈:自编码器(丢失空间信息)和条件路径(通过朴素下采样进一步退化源信号)。我们提出了两种轻量级、与骨干网络无关的修复方法:源条件自编码器(SCAE),通过跳跃连接将高分辨率源特征注入解码器;以及可学习引导编码器(LGE),用学习到的条件信号替代朴素下采样。在驾驶场景的RGB到SWIR翻译任务上,使用两种去噪骨干网络(U-Net和DiT)进行评估,我们的方法在潜在扩散基线基础上将检测mAP提升了高达2倍,小目标(COCO-small,<32^2像素^2)上提升高达3.4倍,同时达到了最先进的FID。我们进一步表明FID与检测性能相关性较差,从而激励多轴评估。结果零样本泛化到公开的RASMD基准。我们将公开发布带有标注的测试数据、所有检查点和训练代码。

英文摘要

Latent diffusion models (LDMs) enable efficient image-to-image translation but discard fine spatial details during compression, degrading downstream perception tasks. We identify two bottlenecks: the autoencoder, which loses spatial information, and the conditioning pathway, which further degrades the source signal through naive downsampling. We propose two lightweight, backbone-agnostic fixes: a Source-Conditioned Autoencoder (SCAE) that injects high-resolution source features into the decoder via skip connections, and a Learnable Guidance Encoder (LGE) that replaces naive downsampling with a learned conditioning signal. Evaluated on RGB-to-SWIR translation for driving scenes with two denoiser backbones (U-Net and DiT), our approach improves detection mAP by up to 2x over the latent diffusion baseline, with up to 3.4x gains on small objects (COCO-small, <32^2 px^2), while achieving state-of-the-art FID. We further show that FID and detection performance are poorly correlated, motivating multi-axis evaluation. Results generalise zero-shot to the public RASMD benchmark. We will publicly release test data with annotations, all checkpoints, and training code.

2606.19958 2026-06-19 cs.CV 新提交

SketchKeyAnime: Reference-anchored Sparse Key-Sketch Animation Synthesis

SketchKeyAnime:基于参考锚点的稀疏关键草图动画合成

Meixi Li, Xianlin Zhang, Yue Zhang, Xueming Li

发表机构 * Beijing University of Posts and Telecommunications(北京邮电大学)

AI总结 提出SketchKeyAnime视频扩散框架,通过双分支条件机制和可学习门控的草图交叉注意力,从单张参考RGB图像和稀疏关键草图生成结构可控、外观一致且时间连贯的动画,在Sakuga-42M数据集上显著优于基线方法。

详情
AI中文摘要

传统动画制作严重依赖手工绘制和迭代细化,特别是关键姿势设计、中间帧生成和角色着色。虽然现有的动画和视频生成方法取得了显著进展,但它们通常依赖于RGB边界帧、密集的帧级条件或完整的草图序列,限制了在低成本输入条件下的适用性。我们提出了SketchKeyAnime,一个视频扩散框架,用于从稀疏关键草图输入生成结构可控、外观一致且时间连贯的动画。给定单个参考RGB图像和几个按时间索引的关键草图,SketchKeyAnime引入了一种双分支条件机制,以编码局部几何约束以及语义-时间上下文。它利用草图交叉注意力,通过可学习门控融合参考图像和草图条件,并加入自适应加权损失以加强对关键草图帧和线条艺术区域的监督。在Sakuga-42M的Aesthetic子集上的实验结果表明,我们的方法始终优于代表性的动画插值和草图引导生成基线。与最佳基线相比,SketchKeyAnime将EDMD降低了31.9%,FVD降低了9.5%,展示了卓越的草图保真度和时间连贯性,同时在大多数定量指标上实现了最佳整体性能。这些结果验证了所提出的框架,并突显了其在低成本、高度可控动画创作中的潜力。

英文摘要

Traditional animation production relies heavily on manual drawing and iterative refinement, particularly for key-pose design, in-betweening, and character coloring. While existing animation and video generation methods have made notable progress, they typically depend on RGB boundary frames, dense frame-wise conditions, or complete sketch sequences, limiting their applicability under low-cost input conditions. We present SketchKeyAnime, a video diffusion framework for generating structurally controllable, appearance-consistent, and temporally coherent animations from sparse key-sketch inputs. Given a single reference RGB image and a few temporally indexed key sketches, SketchKeyAnime introduces a dual-branch conditioning mechanism to encode local geometric constraints alongside semantic-temporal context. It leverages Sketch Cross Attention to fuse reference image and sketch conditions with learnable gating, and incorporates an Adaptive Weighted Loss to strengthen supervision on key-sketch frames and line-art regions. Experimental results on the Aesthetic subset of Sakuga-42M show that our approach consistently outperforms representative animation interpolation and sketch-guided generation baselines. Compared to the best-performing baseline, SketchKeyAnime reduces EDMD by 31.9\% and FVD by 9.5\%, demonstrating superior sketch fidelity and temporal coherence, while achieving the best overall performance across most quantitative metrics. These results validate the proposed framework and highlight its potential for low-cost, highly controllable animation creation.

2606.19956 2026-06-19 cs.LG 新提交

Towards Graph-Based Deep Learning for Map Generalization: Insights from Building Footprints Simplification and Aggregation

基于图深度学习的制图综合:来自建筑足迹简化和聚合的见解

Yanning Wang, Zhiyong Zhou, Zhouyu Liu, Mengni Yu, Yu Feng

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)) Zhejiang University(浙江大学) Mainz University of Applied Sciences(美因茨应用科学大学)

AI总结 本研究首次探索将图深度学习应用于建筑足迹简化(节点移动预测)和聚合(链接预测),评估了GCN、GAT和GraphSAGE等架构,发现GraphSAGE在链接预测上表现较好,但节点移动预测仍具挑战,且聚合比简化更复杂。

Comments 15 pages, 20 figures, 10 tables

详情
AI中文摘要

制图综合仍然是制图学的基本任务之一,特别是对于复杂建筑足迹的简化和聚合。本研究首次探索将基于图的深度学习应用于这两项任务,在统一的图学习框架中将简化重新表述为节点移动预测,将聚合重新表述为链接预测。我们在多尺度建筑数据集上评估了代表性的图神经网络架构(GCN、GAT和GraphSAGE),结果表明GraphSAGE在链接预测准确性方面表现出相对优势,同时也揭示了精确节点移动预测中持续存在的挑战。除了定量性能外,结果还强调聚合比简化带来更大的复杂性和挑战,突显了当前深度学习方法在制图综合中捕捉更高层次空间关系的困难。尽管存在数据不平衡和需要后处理等局限性,该研究为利用深度学习方法推进自动化制图综合提供了宝贵的见解和方法方向。

英文摘要

Map generalization remains one of the fundamental tasks in cartography, especially for the simplification and aggregation of complex building footprints. This study presents the first exploratory application of graph-based deep learning to both tasks, reformulating simplification as node movement prediction and aggregation as link prediction within a unified graph learning framework. We evaluate representative graph neural network architectures (GCN, GAT, and GraphSAGE) on multi-scale building datasets, showing that GraphSAGE demonstrates relative strengths in link prediction accuracy, while also revealing persistent challenges in precise node movement prediction. Beyond quantitative performance, the results highlight that aggregation poses greater complexity and challenges than simplification, underscoring the difficulty of capturing higher-level spatial relationships in map generalization with current deep learning approaches. Although limitations such as data imbalance and the need for post-processing remain, the study provides valuable insights and methodological directions for advancing automated map generalization with deep learning approaches.

2606.19950 2026-06-19 cs.CV cs.AI 新提交

Confidence Calibration for Multimodal LLMs: An Empirical Study through Medical VQA

多模态大语言模型的置信度校准:基于医学视觉问答的实证研究

Yuetian Du, Yucheng Wang, Ming Kong, Tian Liang, Qiang Long, Bingdi Chen, Qiang Zhu

发表机构 * College of Computer Science and Technology, Zhejiang University(浙江大学计算机科学与技术学院) School of Computer Science and Technology, Xidian University(西安电子科技大学计算机科学与技术学院) Zhihui Medical Technology (Shanghai) Co., Ltd.(智汇医疗科技(上海)有限公司)

AI总结 针对多模态大语言模型在医学任务中置信度与准确性不匹配的问题,提出结合多策略融合询问与专家大语言模型评估的方法,在三个医学VQA数据集上将期望校准误差平均降低40%,提升了模型可靠性。

Comments Accepted by MICCAI 2025

详情
AI中文摘要

多模态大语言模型(MLLMs)在医学任务中展现出巨大潜力,但其引发的置信度常常与实际准确性不一致,可能导致误诊或忽略正确建议。本研究首次全面分析了医学MLLMs中准确性与置信度之间的关系。提出了一种新方法,将多策略融合询问(MS-FBI)与辅助专家大语言模型评估相结合,旨在改善医学视觉问答(VQA)中的置信度校准。实验表明,我们的方法在三个医学VQA数据集上将期望校准误差(ECE)平均降低了40%,显著增强了MLLMs的可靠性。研究结果强调了领域特定校准对医疗领域MLLMs的重要性,为AI辅助诊断提供了更可信的解决方案。

英文摘要

Multimodal Large Language Models (MLLMs) show great potential in medical tasks, but their elicited confidence often misaligns with actual accuracy, potentially leading to misdiagnosis or overlooking correct advice. This study presents the first comprehensive analysis of the relationship between accuracy and confidence in medical MLLMs. It proposes a novel method that combines Multi-Strategy Fusion-Based Interrogation (MS-FBI) with auxiliary expert LLM assessment, aiming to improve confidence calibration in Medical Visual Question Answering (VQA). Experiments demonstrate that our method reduces the Expected Calibration Error (ECE) by an average of 40\% across three Medical VQA datasets, significantly enhancing MLLMs' reliability. The findings highlight the importance of domain-specific calibration for MLLMs in healthcare, offering a more trustworthy solution for AI-assisted diagnosis.

2606.19948 2026-06-19 cs.AI 新提交

Advancing DialNav through Automatic Embodied Dialog Augmentation

通过自动具身对话增强推进DialNav

Leekyeung Han, Sangwon Jung, Hyunji Min, Jinseong Jeong, Minyoung Kim, Paul Hongsuck Seo

发表机构 * Korea University(高丽大学) Trillion Labs

AI总结 提出自动生成管道构建大规模RAINbow数据集(238K episodes),结合双策略训练和定位模型,在DialNav任务上实现成功率显著提升(Val Seen +89%,Val Unseen +100%)。

Comments 29 pages, 9 figures

详情
AI中文摘要

对于能够进行物理交互的具身智能体,创建和理解对话的能力对于确保安全性和有效性至关重要。虽然DialNav~\cite{han2025dialnav}为真实感室内导航中的对话-执行循环提供了整体评估框架,但其性能仍受限于训练数据的严重稀缺(2K episodes)。为解决这一问题,我们提出了一种自动生成管道,并构建了\textbf{RAINbow}数据集,这是一个包含238K episodes的大规模训练数据集,用于DialNav。我们的管道将现有的VLN数据集转换为多轮对话,并创建了成本高效且高质量的数据集。然后,我们引入了两项额外的互补性进展以充分释放数据潜力:(1)双策略训练,一种导航训练方案,用于使导航训练与动态对话-导航循环对齐;(2)一个利用VLN知识的定位模型。通过结合这些互补性解决方案,我们的模型在\textbf{Val Seen}(58.24,\textbf{+89\%})和\textbf{Val Unseen}(29.05,\textbf{+100\%})两个分割上的成功率均大幅超越基线,建立了新的最优水平。

英文摘要

For embodied agents capable of physical interaction, the capability to create and understand dialog is crucial to ensure both safety and effectiveness. While DialNav~\cite{han2025dialnav} provides a framework for holistic evaluation of the dialog--execution loop in photorealistic indoor navigation, its performance remains limited by a critical scarcity of training data (2K episodes). To address this, we propose an automatic generation pipeline, and construct the \textbf{RAINbow} dataset, a large-scale training dataset with 238K episodes for DialNav. Our pipeline converts existing VLN datasets into multi-turn dialog and creates cost-efficient and high-quality dataset. Then, we introduce two additional complementary advances to unlock the data's full potential: (1) Dual-Strategy Training, a navigation training scheme to align the navigation training with the dynamic dialog-navigation loop, and (2) a localization model that leverages VLN knowledge. By combining these complementary solutions, our model substantially outperforms the baseline in success rate on both \textbf{Val Seen} (58.24, \textbf{+89\%}) and \textbf{Val Unseen} (29.05, \textbf{+100\%}) splits, establishing a new state of the art.

2606.19944 2026-06-19 cs.CV 新提交

Timage: A Generative Text-in-Image Paradigm for Fine-Tuning Vision-Language Models

Timage: 一种用于微调视觉语言模型的文本嵌入图像生成范式

Yifeng Wu, Huimin Huang, Ruiluo Wu, Chunyi Lin, Guanhua Chen, Xian Wu, Wang Song, Ruize Han

发表机构 * Fudan University(复旦大学) Shenzhen University of Advanced Technology(深圳先进技术大学) Tencent Jarvis Lab(腾讯贾维斯实验室) Southern University of Science and Technology(南方科技大学)

AI总结 提出Timage范式,通过约束薛定谔桥将查询文本作为排版覆盖层嵌入图像,以显式空间锚点引导模型关注,在不侵蚀骨干能力前提下提升细粒度空间推理性能。

Comments ECCV

详情
AI中文摘要

多模态大语言模型(MLLMs)在细粒度空间推理中常丢失正确图像区域,因为文本查询很少携带明确的几何锚点进入像素域。现有补救方法要么重新调整模型权重,要么用冗长指令填充提示,但都无法在不侵蚀骨干通用能力的情况下可靠地将语言定位到正确的视觉坐标。我们提出Timage,一种将多模态理解重新定义为输入层面对齐问题的范式:查询被绘制为排版覆盖层直接叠加在图像上。该覆盖层的放置和外观由约束薛定谔桥(cSB)生成,这是一种熵最优传输采样器,将布局合成分解为两个耦合的随机阶段。第一阶段——区域搜索,将噪声向查询对齐的图像区域传输,同时遵守硬遮挡屏障以保护显著前景内容;第二阶段——外观塑造,通过“墨水预算”正则化调整字形大小,使渲染文本保持可读和视觉平衡。生成的覆盖层作为显式注意力信标,引导模型沿空间语义聚焦。在VMCBench基准上,Timage搭配7B骨干模型明显超越更大的专有系统和参数调优基线。该研究将审慎的输入重构定位为一种强大的、架构中立的杠杆,以增强多模态推理。

英文摘要

Multimodal Large Language Models (MLLMs) often lose track of the right image regions during fine-grained spatial reasoning, because a textual query rarely carries any explicit geometric anchor into the pixel domain. Prevailing remedies either rewire the model's weights or pad the prompt with verbose instructions, yet neither reliably pins the language to the correct visual coordinates without eroding the backbone's general competence. We introduce Timage, a paradigm that recasts multimodal understanding as an alignment problem solved at the input: the query is drawn, as a typeset overlay, onto the image itself. The placement and appearance of this overlay are produced by a Constrained Schrödinger Bridge (cSB), an entropic optimal-transport sampler that factorizes layout synthesis into two coupled stochastic stages. The first stage, Region Search, transports noise toward query-aligned image zones while obeying a hard occlusion barrier that protects salient foreground content; the second stage, Appearance Shaping, sizes the glyphs through an ``ink-budget'' regularizer so that the rendered text stays legible and visually balanced. The resulting overlay behaves as an explicit attention beacon that channels the model's focus along spatial semantics. On the VMCBench suite, Timage paired with a modest 7B backbone clearly overtakes far larger proprietary systems as well as parameter-tuned baselines. The study positions deliberate input reconstruction as a powerful, architecture-neutral lever for strengthening multimodal reasoning.

2606.19941 2026-06-19 cs.LG 新提交

Compositionality Emerges in a Narrow Depth-Connectivity Regime: Architecture Constraints and Solution Manifolds

组合性在窄深度-连接性区域中涌现:架构约束与解流形

Dat H. Do, Rushi Shah, Duc V. Le, Dianbo Liu

发表机构 * National University of Singapore(新加坡国立大学) University of Twente(特温特大学)

AI总结 研究发现组合性仅在特定稀疏网络和特定深度区间涌现,提出基于相似性的剪枝和深度预测方法,并用理论框架解释原因。

详情
AI中文摘要

组合性被认为是泛化的基础,使模型能够在新颖组合中重用有意义的原语。然而,使用标准梯度优化训练的模型很少且通常仅微弱地表现出组合内部结构,并且尚不清楚这种组合性如何或为何形成。在这项工作中,我们表明组合性在一个狭窄的连接性-深度最佳点涌现。沿着连接性轴,组合性仅出现在某些特定稀疏网络中,严重依赖于保留哪些连接而非仅权重的稀疏性。沿着深度轴,组合性在一个狭窄的、目标依赖的区域内涌现,在特定深度达到峰值,而更浅和更深的网络都失败。当深度或连接性条件被违反时,梯度下降会静默地收敛到破碎解而非组合解。为了发现并利用这种涌现,我们引入了(i)基于相似性的剪枝(SP)以恢复组合连接性,以及(ii)一个启发式深度预测器以估计组合性最可能出现的深度。最后,我们通过基于组合稀疏性、体积比论证和特征干扰界限的理论框架支持这些实证发现,解释了为什么组合解仅在狭窄的深度-连接性区域内可达。

英文摘要

Compositionality is believed to be the foundation for generalization, enabling models to reuse meaningful primitives in novel combinations. Yet, models trained with standard gradient-based optimization rarely, and often only weakly, exhibit compositional internal structure, and it remains unclear how or why such compositionality forms. In this work, we show that compositionality emerges in a narrow connectivity-depth sweet spot. Along the connectivity axis, compositionality only appears in some specifically sparse networks, heavily depends on which connections remain rather than on weights' sparsity alone. Along the depth axis, compositionality emerges within a narrow, target-dependent regime, peaking at specific depths, while both shallower and deeper networks fail. When either the depth or connectivity condition is violated, gradient descent silently converges to fractured solutions rather than compositional ones. To discover and exploit this emergence, we introduce (i) similarity-based pruning (SP) to recover compositional connectivity and (ii) a heuristic depth predictor to estimate where compositionality is most likely to appear. Finally, we support these empirical findings with a theoretical framework based on compositional sparsity, volume-ratio arguments, and feature-interference bounds, explaining why compositional solutions are reachable only in a narrow depth-connectivity regime.

2606.19939 2026-06-19 cs.CV 新提交

DiffMath: Symbol- and Graph-Aware Latent Diffusion Transformer for Handwritten Mathematical Expression Generation

DiffMath:面向手写数学表达式生成的符号与图感知潜在扩散Transformer

Wei Pan, Xuhan Zheng, Yilin Shi, Huiguo He, Hiuyi Cheng, Dezhi Peng, Minghui Liao, Lianwen Jin

发表机构 * South China University of Technology(华南理工大学) Huawei Technologies Co., Ltd.(华为技术有限公司)

AI总结 提出DiffMath框架,利用LaTeX层次结构作为先验,通过关系抽象语法树、结构保持潜在表示和条件去噪,无需位置监督即可生成结构一致的手写数学表达式。

详情
AI中文摘要

手写数学表达式生成(HMEG)由于数学表达式的复杂二维布局和长程结构依赖而具有挑战性。现有方法通常依赖显式空间监督,如符号级边界框,这导致高标注成本并限制可扩展性。在这项工作中,我们提出了DiffMath,一个符号与图感知的潜在扩散框架,利用LaTeX固有的层次结构作为结构先验,消除了位置监督的需求。首先,我们设计了关系抽象语法树(RelAST),一种面向生成的表示,将MathML树蒸馏为紧凑的三元组序列[S, R, D],其中每个标记直接编码符号身份、空间关系或嵌套深度。其次,我们引入了MathVAE,通过符号感知和关系感知的感知正则化学习保持结构的潜在表示,确保潜在空间同时捕获字符语义和空间拓扑。第三,MathDiT在这个结构化潜在空间中进行条件去噪,并通过自适应层归一化(AdaLN)进一步由全局符号计数先验引导,以改善结构一致性。实验表明,DiffMath生成结构一致的手写表达式,在现有方法上实现了优越性能,并通过合成数据增强提高了下游OCR模型的准确性。

英文摘要

Handwritten Mathematical Expression Generation (HMEG) is challenging due to the complex two-dimensional layouts and long-range structural dependencies of mathematical expressions. Existing methods typically rely on explicit spatial supervision, such as symbol-level bounding boxes, which incurs high annotation costs and limits scalability. In this work, we propose DiffMath, a symbol- and graph-aware latent diffusion framework that leverages the hierarchical structure inherent in LaTeX as a structural prior, eliminating the need for positional supervision. First, we design a Relational Abstract Syntax Tree (RelAST), a generation-oriented representation that distills MathML trees into compact triplet sequences [S, R, D], where each token directly encodes a symbol identity, spatial relation, or nesting depth. Second, we introduce MathVAE, which learns structure-preserving latent representations through symbol-aware and relation-aware perceptual regularization, ensuring that the latent space captures both character semantics and spatial topology. Third, MathDiT performs conditional denoising in this structured latent space, further guided by a global symbol-count prior via Adaptive Layer Normalization (AdaLN) to improve structural coherence. Experiments show that DiffMath produces structurally consistent handwritten expressions, achieves superior performance over existing methods, and improves the accuracy of downstream OCR models through synthetic data augmentation.

2606.19938 2026-06-19 cs.CV cs.AI 新提交

Triangular Consistency as a Universal Constraint for Learning Optical Flow

三角一致性作为光流学习的通用约束

Yi Xiao, Carlos Rodriguez Coronel, Jing Zhan, Haniyeh Ehsani Oskouie, Alex Wong, Dong Lao

发表机构 * Louisiana State University(路易斯安那州立大学) University of California, Los Angeles(加州大学洛杉矶分校) Yale University(耶鲁大学)

AI总结 提出三角一致性约束,通过组合两个光流诱导第三个光流并强制三者一致,适用于不同网络架构、监督类型和数据集,在监督、无监督和迁移学习中均提升性能。

Comments Accepted by ECCV 2026

详情
AI中文摘要

我们提出三角一致性作为光流的第一性原理约束,该约束与网络架构、监督类型和数据集无关,适用于图像对和多帧设置。这个简单但强大的约束是通过组合两个光流来诱导第三个光流,并强制三者之间的一致性。组合的光流可能来自:(i) 图像对,产生循环一致性;(ii) 多个视频帧,通过时间链产生更长范围的运动;或 (iii) 图像对与受控合成变换相结合,这成为数据增强。这种三角一致性引入的计算开销可忽略不计,且不需要额外的标注。由于它直接源自光流的几何特性,不依赖于模型特定的假设,因此可作为光流训练的“通用”即插即用组件。实验表明,在监督、无监督和迁移学习设置中均有一致的改进。

英文摘要

We propose triangular consistency as a first-principled constraint for optical flow, which is agnostic to network architecture, supervision type, and dataset, and applies to both image-pair and multi-frame settings. This simple but powerful constraint is to compose two flows to induce a third flow and enforce consistency among the three. The composed flows may arise from (i) image pairs, yielding cycle consistency; (ii) multiple video frames, producing longer-range motion through temporal chaining; or (iii) image pairs combined with controlled synthetic transformations, which becomes data augmentation. This triangular consistency introduces negligible computational overhead and requires no additional annotations. Since it is derived directly from the geometry of optical flow, it does not rely on model-specific assumptions and serves as a ``universal'' plug-and-play component for optical flow training. Experiments show consistent improvement across supervised, unsupervised, and transfer learning settings.

2606.19935 2026-06-19 cs.AI 新提交

PhysDrift: Bridging the Embodiment Gap in Humanoid Co-Speech Motion Generation

PhysDrift: 弥合人形机器人共语动作生成中的具身差距

Zhangzhao Liang, Xiaofen Xing, Mingyue Yang, Wenlve Zhou, Xiangmin Xu

发表机构 * South China University of Technology(华南理工大学) DexForce Technology(DexForce科技公司) Foshan University(佛山大学)

AI总结 针对人形机器人共语动作生成中人体运动流形与机器人具身约束不匹配的问题,提出IK-EER框架和PhysDrift模型,直接预测可执行关节轨迹,提升运动对齐、物理合理性和实时交互能力。

详情
AI中文摘要

人形机器人需要共语动作,这些动作不仅要富有表现力且与语音对齐,还要在具身约束下物理可执行。现有的共语动作生成流程主要是以人为中心的:首先以人体表示(如SMPL-X)生成动作,随后重定向到人形机器人。在这项工作中,我们识别出这种范式中的基本具身差距,即人体运动流形与人形机器人具身约束之间的不匹配在运动转移和物理执行过程中破坏了具身一致性。通过广泛分析,我们表明尽管重定向可以保留粗粒度的运动语义,但它显著压缩了运动多样性并削弱了韵律-动作同步,限制了富有表现力的人形机器人行为。为解决此问题,我们首先提出IK-EER,一种保留韵律的人形机器人运动策展框架,在重定向过程中联合优化运动学可行性和语音-运动时间对齐。基于策展的机器人原生运动数据集,我们进一步引入PhysDrift,一种具身感知的共语动作生成框架,直接预测可执行的人形机器人关节轨迹,无需依赖中间人体表示。与传统的以人为中心的流程不同,PhysDrift在训练和推理过程中都保持具身一致性,同时加入物理正则化以稳定机器人运动动态。大量实验和真实世界人形机器人部署表明,具身感知的机器人原生生成显著改善了语音-运动对齐、物理合理性、运动平滑性、推理效率和实时交互能力。

英文摘要

Humanoid robots require co-speech motions that are not only expressive and speech-aligned, but also physically executable under embodiment constraints. Existing co-speech generation pipelines are predominantly human-centric: motions are first generated in human-body representations such as SMPL-X and subsequently retargeted to humanoid robots. In this work, we identify a fundamental embodiment gap in this paradigm, where the mismatch between human motion manifolds and humanoid embodiment constraints disrupts embodiment consistency during motion transfer and physical execution. Through extensive analysis, we show that although retargeting can preserve coarse motion semantics, it significantly compresses motion diversity and weakens prosody-motion synchronization, limiting expressive humanoid behaviors. To address this problem, we first propose IK-EER, a prosody-preserving humanoid motion curation framework that jointly optimizes kinematic feasibility and speech-motion temporal alignment during retargeting. Building upon the curated robot-native motion dataset, we further introduce PhysDrift, an embodiment-aware co-speech motion generation framework that directly predicts executable humanoid joint trajectories from speech without relying on intermediate human-body representations. Unlike conventional human-centric pipelines, PhysDrift maintains embodiment consistency throughout both training and inference while incorporating physical regularization to stabilize robot motion dynamics. Extensive experiments and real-world humanoid deployment demonstrate that embodiment-aware robot-native generation substantially improves speech-motion alignment, physical plausibility, motion smoothness, inference efficiency, and real-time interaction capability.

2606.19934 2026-06-19 cs.CV cs.AI 新提交

Speeding up the annotation process in semantic segmentation industrial applications

加速工业应用中的语义分割标注过程

Marta Fernandez-Moreno, Margarita Guerrero, Rosalia Rementeria, Pablo Mesejo, Raul Moreno

发表机构 * Department of Computer Science and Artificial Intelligence, Andalusian Research Institute in Data Science and Computational Intelligence, DaSCI, University of Granada(格拉纳达大学计算机科学与人工智能系,安达卢西亚数据科学与计算智能研究所,DaSCI) Department of Computer Science and Automatic Control, National Distance Education University (UNED)(国立远程教育大学计算机科学与自动控制系)

AI总结 本文利用无监督算法将材料科学中语义分割的标注时间从170小时降至37小时(减少78%),并发布了最大的公开钢微观结构分割数据集。

详情
AI中文摘要

当前的机器学习模型通常需要大量且标注良好的数据集。然而,标注过程常常成为瓶颈,随着复杂性的增加,人为错误的机会也更高。在此背景下,本文旨在利用无监督算法提高工业材料科学中复杂语义分割问题的数据标注效率。以往的研究量化了标注时间,并探索了无监督方法。但据我们所知,这是首次量化无监督算法加速标注过程程度的研究。我们旨在验证这一繁琐过程可以加速的程度,重点关注涉及高分辨率图像每个像素标注的语义分割任务,例如材料科学中的微观结构表征挑战。具体来说,我们证明通过使用无监督计算机视觉算法,标注过程所需的时间可以从170小时减少到37小时,实现了约78%的减少。我们处理的数据集包括尺寸为1280x959和960x703的大图像,这进一步增加了标注任务的复杂性。尽管存在这些挑战,我们创建并共享了迄今为止最大的公开钢微观结构分割数据集,在MIT许可下提供,并具有永久DOI,为该领域贡献了一个完全标注的高分辨率数据集。此外,这是首次将从头开始标注的时间(以往研究中的常见方法)与使用这些无监督算法作为预标注步骤时的标注时间进行比较。此外,我们提供了一个在此数据集上训练的深度学习模型,该模型经过领域专家验证,并部署在工业环境中,作为该公共数据集的初始基准。

英文摘要

Current machine learning models commonly require large and well-annotated datasets. However, the annotation process often becomes a bottleneck, with increased complexity leading to higher chances of human errors. Within this context, our goal in this paper is to leverage unsupervised algorithms to improve data annotation efficiency for complex semantic segmentation problems in industrial materials science. Previous research has quantified labeling time and others explored unsupervised methods. However, to the best of our knowledge, this is the first study to quantify how much unsupervised algorithms accelerate the labeling process. We aim to validate the extent to which this laborious process can be accelerated, focusing on semantic segmentation tasks that involve annotating each pixel of high-resolution images, such as the microstructure characterization challenge in materials science. Specifically, we demonstrate that by using unsupervised computer vision algorithms, the time required for the labeling process can be reduced from 170 hours to 37 hours, achieving an approximate reduction of 78\%. The dataset we work with includes large images of dimensions 1280x959 and 960x703, which further increases the complexity of the annotation task. Despite these challenges, we create and share the largest public steel microstructure segmentation dataset to date, available under MIT License with permanent DOI, contributing a fully annotated, high-resolution dataset to the field. Additionally, this is the first work to compare the labeling time from scratch (a common approach in previous studies) to the labeling time when using these unsupervised algorithms as a pre-annotation step. Furthermore, we provide a Deep Learning model trained on this dataset, validated by field experts, and deployed in an industrial setting, serving as an initial benchmark for this public dataset.

2606.19932 2026-06-19 cs.CV cs.AI 新提交

Spatial-Aware Reduction Framework: Towards Efficient and Faithful Visual State Space Models

空间感知缩减框架:迈向高效且忠实的视觉状态空间模型

Jindi Lv, Aoyu Li, Yuhao Zhou, Zheng Zhu, Xiaofeng Wang, Qing Ye, Yueqi Duan, Wentao Feng, Jiancheng Lv

发表机构 * Sichuan University(四川大学) Tsinghua University(清华大学)

AI总结 提出STORM框架,通过保持空间结构完整性解决视觉Mamba模型在token缩减时的性能崩溃问题,无需训练即可实现高精度剪枝。

Comments Accepted by ICML 2026

详情
AI中文摘要

Mamba在建模长视觉序列方面表现出强大的效率。然而,当将token缩减应用于结构增强的Mamba变体时,这些模型会出现严重的性能崩溃。我们将这种退化归因于现有缩减方法在空间上的不可知性,这违反了选择性扫描机制所需的二维结构前提。在这项工作中,我们提出了STORM,一个空间感知的token缩减框架,旨在在压缩过程中保持结构完整性。STORM将缩减重新表述为对空间单元的结构化操作,强制局部约束以保持网格拓扑和邻域一致性。作为一个即插即用模块,STORM无需任何训练即可为现有缩减流程赋予明确的空间感知能力。实验结果表明,STORM在无训练设置下,在多种视觉Mamba骨干网络上实现了最先进的剪枝精度。值得注意的是,STORM在VMamba上实现了显著的精度恢复,在top-1准确率上比先前方法高出63.3%。同时,STORM在PlainMamba上仅造成1.0%的准确率下降,达到了与ViT相当的性能。

英文摘要

Mamba demonstrates strong efficiency in modeling long visual sequences. However, when token reduction is applied to structurally enhanced Mamba variants, these models exhibit a severe performance collapse. We attribute this degradation to the spatially agnostic nature of existing reduction methods, which violate the two-dimensional structural premise required by the selective scanning mechanism. In this work, we propose STORM, a spatial-aware token reduction framework designed to maintain structural integrity throughout the compression process. STORM reformulates reduction into a structured operation on spatial units, enforcing localized constraints to maintain both grid topology and neighborhood coherence. As a plug-and-play module, STORM equips existing reduction pipelines with explicit spatial awareness without any training. Empirical results demonstrate that STORM achieves state-of-the-art pruning accuracy across diverse vision Mamba backbones under training-free settings. Notably, STORM delivers a substantial accuracy recovery on VMamba, outperforming prior methods by up to 63.3\% in top-1 accuracy. Meanwhile, STORM incurs only a 1.0\% accuracy drop on PlainMamba, achieving performance comparable to ViT.

2606.19928 2026-06-19 cs.RO 新提交

SWAP: Symmetric Equivariant World-Model for Agile Robot Parkour

SWAP: 用于敏捷机器人跑酷的对称等变世界模型

Kaixin Lan, Ze Wang, Hongyi Li, Lei Jiang, Chaojie Fu, Chengkai Su, Choi Lam Wong, Yongbin Jin, Hongtao Wang

发表机构 * Center for X-Mechanics, Zhejiang University(浙江大学交叉力学中心) ZJU-Hangzhou Global Scientific and Technology Innovation Center(浙江大学杭州国际科创中心) Mirrorme Technology Co., Ltd.(魔镜科技有限公司)

AI总结 提出SWAP框架,将对称等变性嵌入世界模型和演员-评论家网络,实现四足机器人跑酷记录突破(跨越2.13米间隙、攀爬1.63米平台),并展现出对未见镜像地形的几何泛化与零样本迁移能力。

详情
AI中文摘要

虽然潜在世界模型能够实现极限跑酷所需的主动预测,但其纯数据驱动的特性迫使它们将左右对称交互冗余编码为独立模式。这增加了学习负担并阻碍了几何规律性的捕获,限制了潜在空间对下游策略的效率。为了解决这个问题,我们提出了SWAP,一个端到端的等变对称世界模型。该框架将对称性直接嵌入到世界模型和演员-评论家网络中。在真实世界测试中,机器人跨越了2.13米的间隙并攀爬了1.63米的高台,打破了四足机器人跑酷的记录。此外,该框架对未见过的镜像地形展现出鲁棒的几何泛化能力,并在多种户外环境中具有卓越的零样本迁移能力。这些结果表明,对称等变性是推动学习型腿式运动物理极限的有效结构先验。

英文摘要

While latent world models enable the proactive predictions required for extreme parkour, their purely data-driven nature forces them to redundantly encode left-right symmetric interactions as independent patterns. This inflates the learning burden and hinders the capture of geometric regularities, restricting the latent space's efficiency for downstream policies. To address this, we propose SWAP, an end-to-end equivariant symmetric world model. This framework embeds symmetry directly into both the world model and the actor-critic networks. In real-world tests, the robot leaps across a 2.13 m gap and climbs a 1.63 m platform, breaking records for quadruped parkour. Furthermore, the framework exhibits robust geometric generalization to unseen mirrored terrains and exceptional zero-shot transferability across diverse outdoor environments. These results demonstrate that symmetry equivariance is an effective structural prior for pushing the physical boundaries of learned legged locomotion.

2606.19927 2026-06-19 cs.CV 新提交

CARE: Competence-Aware Reward Shaping for Adaptive Reasoning Length in Video-MLLMs

CARE: 面向视频多模态大语言模型的自适应推理长度的能力感知奖励塑形

Chengwen Liu, Hao Peng, Jisheng Dang, Hong Peng, Bin Hu, Tat-Seng Chua

发表机构 * School of Information Science and Engineering, Lanzhou University(兰州大学信息科学与工程学院) School of Medical Technology, Beijing Institute of Technology(北京理工大学医学技术学院) School of Computing, National University of Singapore(新加坡国立大学计算机学院)

AI总结 提出CARE框架,通过能力感知奖励塑形自适应优化推理长度,利用指数移动平均估计能力并分阶段调整奖励偏好,结合批次归一化和后验放大器提升效率与准确性。

详情
AI中文摘要

在多模态视频推理中,基于强化学习的方法通常依赖简单且不灵活的推理长度控制策略,无法适应模型不断变化的能力。这种不匹配可能在早期阶段抑制必要的探索,而在模型变得更有能力后鼓励冗余推理和低效解码。本文提出CARE,一种用于多模态推理中自适应推理长度优化的能力感知奖励塑形框架。具体来说,CARE通过通过率的指数移动平均维护平滑的能力估计,并利用它将训练路由到渐进阶段,将奖励偏好从探索导向的长形式推理转向效率导向的简洁推理。为避免将冗长与内在任务复杂性混淆,CARE进一步使用批次级统计归一化推理努力,并引入后验放大器以增强对历史上困难样本上意外强性能的奖励信号。所提出的机制无缝集成到GRPO训练流程中,且不增加额外推理开销。在多个视频推理和通用视频理解基准上的大量实验表明,CARE持续提高推理准确性,稳定强化学习,并显著提升令牌效率。此外,CARE在训练过程中展现出推理长度的特征性倒U型轨迹,并在收敛时产生更短但信息更丰富的推理轨迹,表明推理预算的有效自适应分配。我们在以下网址提供CARE框架和实验的源代码:此https URL。

英文摘要

In multimodal video reasoning, reinforcement learning-based methods typically rely on simplistic and inflexible reasoning-length control strategies that fail to adapt to the model's evolving competence. This mismatch may suppress necessary exploration at early stages, while encouraging redundant reasoning and inefficient decoding once the model becomes more competent. In this paper, we propose CARE, a competence-aware reward shaping framework for adaptive reasoning length optimization in multimodal reasoning. Specifically, CARE maintains a smoothed competence estimate via an exponential moving average of pass rates, and uses it to route training into progressive stages that shift the reward preference from exploration-oriented long-form reasoning to efficiency-oriented concise reasoning. To avoid conflating verbosity with intrinsic task complexity, CARE further normalizes reasoning effort with batch-level statistics, and introduces a posterior amplifier to strengthen reward signals for unexpectedly strong performance on historically difficult samples. The proposed mechanism is seamlessly integrated into the GRPO training pipeline and incurs no additional inference-time overhead. Extensive experiments on multiple video reasoning and general video understanding benchmarks demonstrate that CARE consistently improves reasoning accuracy, stabilizes reinforcement learning, and significantly enhances token efficiency. Moreover, CARE exhibits a characteristic inverted-U trajectory of reasoning length during training, and yields shorter yet more informative reasoning traces at convergence, indicating effective adaptive allocation of reasoning budget. We provide the source code for our proposed CARE framework and experiments at https://github.com/1Pansy/Video-CARE.

2606.19921 2026-06-19 cs.AI 新提交

eCNNTO: A Highly Generalizable ConvNet for Accelerating Topology Optimization

eCNNTO:一种高度泛化的加速拓扑优化的卷积网络

Shengbiao Lu, Xiaodong Wei

发表机构 * Global college, Shanghai Jiao Tong University(上海交通大学全球学院)

AI总结 提出基于元素的卷积神经网络eCNNTO,通过预测近最优密度跳过大量迭代,加速密度拓扑优化,并引入新训练策略提升效率与泛化能力。

详情
AI中文摘要

本工作提出了一种基于元素的卷积神经网络(CNN)来加速基于密度的拓扑优化(TO),称为eCNNTO。TO通常需要大量迭代,其中每次迭代都进行有限元分析,导致效率瓶颈,尤其是在使用密集网格实现高分辨率设计时。为解决这一限制,eCNNTO建立在Kallioras等人(2020)的工作基础上,该工作为每个元素训练了一个深度信念网络(DBN),根据其早期历史预测近最优密度,从而跳过绝大多数迭代并显著加速TO过程。然而,该方法缺乏相邻元素间的空间相关性,可能导致最终结构中存在不连通的特征。所提方法采用带有残差连接的CNN来解决这一问题。在此基础上,引入了一种新的训练策略以进一步提高优化效率,其中训练数据集由最终阶段的密度历史而非早期历史组成。这一变化也有助于减少所需的训练数据量。eCNNTO仅需少量数据集进行训练,却能泛化到边界条件、载荷情况、设计域几何形状、网格分辨率以及非设计域大不相同的各种问题。最后,通过二维和三维的多个示例展示了eCNNTO的泛化能力和效率,分别实现了高达90%和97%的迭代次数减少。

英文摘要

This work proposes an element-based Convolutional Neural Network (CNN) to accelerate density-based Topology Optimization (TO), termed eCNNTO. TO generally undergoes a large number of iterations, where finite element analysis is performed in every iteration, leading to the efficiency bottleneck especially when dense meshes are used to achieve high-resolution designs. To address this limitation, eCNNTO is proposed to build upon Kallioras et al. (2020), where a Deep Belief Network (DBN) was trained for every element to predict its near-optimal density from its early history, thereby skipping the great majority of iterations and significantly accelerating the TO procedure. However, the method lacks spatial correlations among neighboring elements and may lead to disconnected features in the final structure. The proposed method employs CNN with residual connections to address this issue. On top of it, a novel training strategy is introduced to further enhance the optimization efficiency, where the training dataset consists of the final stage density histories rather than early ones. This change can also help reduce the required training data size. eCNNTO requires only a small dataset to train and yet it can be generalized to problems with largely different boundary conditions, loading cases, design domain geometries, mesh resolutions, as well as non-design domains. In the end, the generalization capabilities and efficiency of eCNNTO are demonstrated through a variety of examples in two and three dimensions, achieving up to 90% and 97% reduction of iterations, respectively.

2606.19920 2026-06-19 cs.RO cs.LG cs.MA 新提交

Deep-Unfolded Coordination

深度展开协调

Hunter Kuperman, Minchan Jung, Rahul V. Ghosh, Alex Oshin, Evangelos A. Theodorou

发表机构 * Autonomous Control and Decision Systems Laboratory Georgia Institute of Technology United States(佐治亚理工学院自主控制与决策系统实验室)

AI总结 提出Deep Coordinator框架,通过深度展开ADMM-DDP迭代学习动态调整超参数,实现非凸优化器求解时自适应惩罚参数,在车队和四旋翼仿真中速度提升6.18-9.44倍且可扩展至8倍规模。

Comments The second and third authors contributed equally (equal second authorship). 35 pages (10 pages main text), 17 figures, 3 tables

详情
AI中文摘要

分布式优化是一种高度可扩展且结构透明的技术,用于解决多机器人问题;然而,这类方法通常需要高度专门化、针对特定问题的超参数调整。在这项工作中,我们提出了Deep Coordinator,一个深度展开框架,学习在求解时根据优化器性能动态调整ADMM-DDP(一种流行的机器人任务分布式求解器)的超参数。我们的架构包括将固定数量的ADMM-DDP迭代展开成一个神经网络,层之间具有可学习的函数,将优化器状态映射到下一个超参数。据我们所知,Deep Coordinator是第一个在求解时调整非凸优化器惩罚参数的深度展开框架;我们展示了主流的监督方法在训练此类模型时可能产生退化解,并提出了一种无监督学习方案。在车队和四旋翼飞行器的仿真中,Deep Coordinator生成的轨迹质量与常规求解器相当,但速度快6.18-9.44倍。此外,当部署到比训练规模大8倍的系统时,Deep Coordinator仍能保持其性能优势。

英文摘要

Distributed optimization is a highly scalable and structurally transparent technique to solve multi-agent robotics problems; however, such methods often suffer from the need for highly-specialized, problem-specific hyperparameter tunings. In this work, we propose Deep Coordinator, a deep-unfolding framework that learns to dynamically adjust the hyperparameters of ADMM-DDP, a popular distributed solver for robotics tasks, at solve-time in response to optimizer performance. Our architecture consists of unrolling a fixed number of ADMM-DDP iterations into a neural network with learnable functions between layers mapping the optimizer state to the next hyperparameters. To the best of our knowledge, Deep Coordinator is the first deep-unfolding framework to adapt the penalty parameters of a non-convex optimizer at solve-time; we show that the mainstream supervised approach can yield degenerate solutions when training such models, and propose an unsupervised learning scheme. On simulations with fleets of cars and quadrotors, Deep Coordinator produces trajectories of comparable quality 6.18-9.44x faster than conventional solvers. Furthermore, Deep Coordinator retains its performance benefits when deployed to systems up to 8x larger than trained on.

2606.19919 2026-06-19 cs.LG 新提交

ADaPT: Token-Level Decoupling for Efficient Large Reasoning Models

ADaPT:面向高效大推理模型的令牌级解耦

Tingyun Li, Zishang Jiang, Jinyi Han, Xinyi Wang, Sihang Jiang, Han Xia, Zhaoqian Dai, Shuguang Ma, Fei Yu, Jiaqing Liang, Yanghua Xiao

发表机构 * School of Data Science, Fudan University(复旦大学数据科学学院) Shanghai Institute of Artificial Intelligence for Education, East China Normal University(华东师范大学上海智能教育研究院) College of Computer Science and Artificial Intelligence, Fudan University(复旦大学计算机科学与人工智能学院) Ant Group(蚂蚁集团)

AI总结 提出ADaPT,通过令牌级双过程框架解耦效率与正确性信号,引入模式选择令牌控制快慢推理,实现推理时效率-性能权衡的精确连续控制,在降低推理成本的同时保持强推理能力。

详情
AI中文摘要

大型推理模型依赖长思维链实现强性能,但统一应用此类推理会产生高计算成本。现有面向效率的方法试图缩短或混合推理策略,但往往会降低推理能力。我们将根本原因识别为效率激励与正确性优化之间的序列级耦合,这隐式惩罚了长但正确的推理轨迹。为解决此问题,我们提出自适应双过程思维(ADaPT),一种令牌级双过程框架,在训练期间显式解耦效率和正确性信号。ADaPT引入模式选择令牌来控制快速和慢速推理,将效率相关奖励仅应用于此令牌,以避免惩罚正确的长推理,同时在适当时鼓励效率。此外,ADaPT在推理时实现了对效率-性能权衡的精确连续控制:通过调整模式选择令牌的生成概率,单个训练好的模型可以平滑地沿效率-性能帕累托前沿移动。大量实验表明,ADaPT在多个基准测试中显著降低推理成本,同时保持强推理性能。

英文摘要

Large reasoning models rely on long chain-of-thought to achieve strong performance, but applying such reasoning uniformly incurs high computational cost. Existing efficiency-oriented methods attempt to shorten or mix reasoning strategies, yet often degrade reasoning capability. We identify the root cause as sequence-level coupling between efficiency incentives and correctness optimization, which implicitly penalizes long but correct reasoning trajectories. To address this issue, we propose Adaptive Dual-Process Thinking (ADaPT), a token-level dual-process framework that explicitly decouples efficiency and correctness signals during training. ADaPT introduces a mode-selection token to control fast and slow reasoning, applying efficiency-related rewards exclusively to this token to avoid penalizing correct long reasoning while encouraging efficiency when appropriate. Moreover, ADaPT enables precise and continuous control over the efficiency-performance trade-off at inference time: by adjusting the generation probability of the mode-selection token, a single trained model can smoothly move along the efficiency-performance Pareto frontier. Extensive experiments demonstrate that ADaPT significantly reduces inference cost while maintaining strong reasoning performance across multiple benchmarks.

2606.19915 2026-06-19 cs.CV 新提交

SpatialSV: Internalizing Interpretable 3D Spatial Awareness in MLLMs via Task-Oriented Visual Supervision

SpatialSV: 通过任务导向的视觉监督在多模态大语言模型中内化可解释的3D空间感知

Jiayu Tang, Yuchen Zhou, Chao Gou

发表机构 * School of Intelligent Systems Engineering, Sun Yat-sen University(中山大学智能工程学院)

AI总结 提出SpatialSV框架,通过任务导向的视觉监督将MLLM的2D特征提升为显式3D表示(深度图、相机姿态、点云),实现可解释的3D空间感知内化,无需外部工具,并在半监督设置中展现强泛化能力。

Comments Accepted by IJCAI 2026

详情
AI中文摘要

解锁多模态大语言模型(MLLMs)的空间智能对于理解和与3D世界交互至关重要。当前主流方法通常通过外部工具注入空间先验,这会带来显著的推理开销,或依赖潜在特征蒸馏,后者缺乏可解释性和细粒度几何约束。为解决这些问题,我们提出SpatialSV,一个旨在将鲁棒的3D空间感知内化到MLLMs中,同时提供内在可解释性的框架。与被动特征模仿不同,SpatialSV采用任务导向的视觉监督,迫使模型主动将其2D视觉特征提升为显式3D表示,包括深度图、相机姿态和点云。关键的是,这个2D到3D的提升过程为模型的表示提供了一个透明窗口:生成的3D重建作为可视化和诊断模型内在空间知识质量的直观代理。跨多个模型和基准的广泛实验证明了SpatialSV在增强和解释MLLMs空间智能方面的有效性。此外,该框架在半监督设置中展现出强泛化能力,验证了其利用未标记视觉数据进行可扩展、可解释空间表示学习的潜力。

英文摘要

Unlocking the spatial intelligence of multimodal large language model (MLLMs) is crucial for understanding and interacting with the 3D world. Prevailing approaches typically inject spatial priors via external tools, which impose significant inference overhead, or rely on latent feature distillation, which remains uninterpretable and lacks fine-grained geometric constraints. To address these issues, we propose SpatialSV, a framework designed to internalize robust 3D spatial awareness within MLLMs while simultaneously offering inherent interpretability. Deviating from passive feature imitation, SpatialSV employs task-oriented visual supervision, compelling the model to actively lift its 2D visual features into explicit 3D representations, including depth maps, camera poses, and point clouds. Crucially, this 2D-to-3D lifting process provides a transparent window into the model's representations: the resulting 3D reconstructions serve as an intuitive proxy for visualizing and diagnosing the quality of the model's intrinsic spatial knowledge. Extensive experiments across multiple models and benchmarks demonstrate the effectiveness of SpatialSV in enhancing and interpreting MLLMs' spatial intelligence. Furthermore, the framework exhibits strong generalization in semi-supervised settings, validating its potential to leverage unlabeled visual data for scalable, interpretable spatial representation learning.

2606.19914 2026-06-19 cs.RO cs.AI 新提交

Co-policy: Responsive Human-Robot Co-Creation for Musical Performances

Co-policy: 响应式人机音乐共创框架

Xuetao Li, Wenke Huang, Mang Ye, Zijian Liu, Jinhua Xie, Jifeng Xuan, Miao Li

发表机构 * School of Computer Science, Wuhan University(武汉大学计算机学院) College of Computing and Data Science, Nanyang Technological University(南洋理工大学计算与数据科学学院) School of Automation, Wuhan University of Technology(武汉理工大学自动化学院) School of Geodesy and Geomatics, Wuhan University(武汉大学测绘学院) School of Robotics, Wuhan University(武汉大学机器人学院)

AI总结 提出Co-policy框架,通过语义锚定、约束变分和视觉运动策略实现人机音乐实时共创,在真实钟琴实验中优于扩散策略基线。

详情
AI中文摘要

艺术长期以来一直是人类创造力的关键表达。具身人工智能为生成模型通过物理动作而非无形数字内容参与创造力提供了一条途径。在机器人音乐共创中,将语义音乐理解与实时且可物理执行的表演连接起来具有挑战性。我们提出了Co-policy,一个人机音乐共创框架,它分离了语义意图接地、约束音乐变分和视觉运动执行。为了接地音乐语义,Co-policy使用预推理语义锚点和微调的Qwen-vl规划器(F-Qwen)将语音、实时音乐种子和视觉观察转化为结构化的共创计划。为了支持低延迟执行,Co-policy引入了高斯混合视觉运动策略(GMP),实现为条件混合密度策略,在单次前向传递中将目标音符和视觉上下文映射到多模态机器人动作。与仅复现用户指定音符的机器人回放系统不同,Co-policy在音乐和物理约束下生成互补的音乐响应。真实机器人钟琴实验、消融研究和专家评估显示,与扩散策略和消融基线相比,意图对齐、执行准确性和响应频率均有提升,支持物理接地动作生成作为具身人机共创的关键要求。

英文摘要

Art has long stood as a pivotal expression of human creativity. Embodied artificial intelligence offers a route for generative models to participate in that creativity through physical action rather than disembodied digital content. In robotic music co-creation, it is challenging to connect semantic musical understanding with real-time and physically executable performance. We present Co-policy, a framework for human-robot musical co-creation that separates semantic intent grounding, constrained musical variation, and visuomotor execution. To ground musical semantics, Co-policy uses pre-inference semantic anchors and a fine-tuned Qwen-vl planner (F-Qwen) to transform speech, live musical seeds, and visual observations into structured co-creation plans. To support low-latency execution, Co-policy introduces a Gaussian-Mixture Visuomotor Policy (GMP), implemented as a conditional mixture-density policy that maps target notes and visual context to multimodal robot actions in a single forward pass. Unlike robotic playback systems that merely reproduce user-specified notes, Co-policy generates complementary musical responses under both musical and physical constraints. Real-robot chime experiments, ablations, and expert evaluation show improved intent alignment, execution accuracy, and response frequency over diffusion-policy and ablated baselines, supporting physically grounded action generation as a key requirement for embodied human-AI co-creation.

2606.19911 2026-06-19 cs.AI cs.CL cs.IR 新提交

Multi-Agent Transactive Memory

多智能体交互记忆

To Eun Kim, Xuhong He, Dishank Jain, Ambuj Agrawal, Negar Arabzadeh, Fernando Diaz

发表机构 * Carnegie Mellon University(卡内基梅隆大学) University of California, Berkeley(加州大学伯克利分校)

AI总结 提出MATM框架,通过共享存储和检索智能体轨迹,实现异构智能体群体间的知识复用,提升下游任务性能并减少交互步骤。

详情
AI中文摘要

具有不同能力的LLM智能体在多样化任务中的去中心化部署,激发了跨异构智能体群体知识共享的基础设施需求。正如搜索引擎索引人类生成的工件以支持人类问题解决,检索系统可以组织智能体生成的工件以供跨智能体群体重用。我们将检索增强生成(展示了人类创作工件对单个智能体的价值)扩展到检索智能体生成的工件以支持智能体群体。特别是,智能体轨迹编码了可重用的程序性知识,然而这些工件通常在一次使用后被丢弃或仅由产生智能体保留,迫使新实例化的智能体反复重新发现现有解决方案。我们提出了多智能体交互记忆(MATM),一个用于群体级存储和检索智能体生成轨迹的框架,其中生产者智能体将轨迹贡献到共享仓库,消费者智能体检索它们以改进任务执行。我们专注于交互环境(ALFWorld和WebArena),其中轨迹较长且编码了特别丰富的程序性结构。我们的实验表明,从MATM检索轨迹可提高下游任务性能并减少交互步骤,无需协调或联合训练。这些结果将MATM定位为开放智能体生态系统中群体级经验共享的设计模式。

英文摘要

The decentralized deployment of LLM agents with diverse capabilities across diverse tasks motivates infrastructure for knowledge sharing across heterogeneous agent populations. Just as search engines index human-generated artifacts to support human problem solving, retrieval systems can organize agent-generated artifacts for reuse across agent populations. We extend retrieval-augmented generation - which demonstrates the value of human-authored artifacts to individual agents - to retrieval of agent-generated artifacts supporting a population of agents. In particular, agent trajectories encode reusable procedural knowledge, yet these artifacts are typically discarded after a single use or retained only by the producing agent, forcing newly instantiated agents to repeatedly rediscover existing solutions. We propose Multi-Agent Transactive Memory (MATM), a framework for population-level storage and retrieval of agent-generated trajectories, where producer agents contribute trajectories to a shared repository and consumer agents retrieve them to improve task execution. We focus on interactive environments (ALFWorld and WebArena), where trajectories are long and encode especially rich procedural structure. Our experiments demonstrate that retrieving trajectories from MATM improves downstream task performance and reduces interaction steps without coordination or joint training. These results position MATM as a design pattern for population-level experience sharing in open agent ecosystems.

2606.19910 2026-06-19 cs.CL cs.SD eess.AS 新提交

Light-weight Pronunciation Assessment via Discrete Speech Token Surprisal

轻量级发音评估:基于离散语音标记的意外度

Syeda Faiza Ahmed Sara, Shammur Absar Chowdhury

发表机构 * Qatar Computing Research Institute, Doha, Qatar(卡塔尔计算研究所,多哈,卡塔尔)

AI总结 提出仅使用母语语音资源训练的轻量级发音评估框架,通过离散化语音标记和语言模型计算意外度,结合文本引导对齐特征,在无监督或少量校准下达到接近监督方法的性能。

Comments Accepted to Interspeech 2026

详情
AI中文摘要

训练自动发音评估通常依赖于标记的学习者错误或非母语语料库,这些语料库收集成本高昂。我们提出一个轻量级框架,仅使用母语语音资源训练,以无监督或通过少量评分话语进行轻量校准的方式运行。在推理时,学习者语音通过SSL编码器和K-means码本进行离散化。一个在母语序列上训练的标记语言模型计算意外度,其中较高的意外度表示音位偏差。我们添加了一个转录引导的Text2DUnit--DTW模块,该模块从参考文本预测母语标记序列,并将其与声学标记对齐以推导出错误敏感特征。意外度和对齐特征通过简单回归融合。在SpeechOcean762上,PCC从0.60提升到0.66(带转录引导),接近监督基线。在L2-ARCTIC上的跨数据集评估显示了一致的提升。

英文摘要

Training automated pronunciation assessment often relies on labeled learner errors or non-native corpora that are costly to collect. We propose a lightweight framework trained only on native speech resources, operating unsupervised or lightly calibrated with a small set of scored utterances. At inference, learner speech is discretized with an SSL encoder and a K-means codebook. A token language model trained on native sequences computes surprisal where higher surprisal indicates phonotactic deviation. We add a transcript-guided Text2DUnit--DTW module that predicts native token sequences from reference text and aligns them to acoustic tokens to derive error-sensitive features. Surprisal and alignment features are fused via simple regression. On SpeechOcean762, PCC improves from 0.60 to 0.66 with transcript guidance, near supervised baselines. Cross-dataset evaluation on L2-ARCTIC shows consistent gains.

2606.19908 2026-06-19 cs.CV 新提交

Gaussian Process Prior Variational Autoencoder for Endoscopic Videos

用于内窥镜视频的高斯过程先验变分自编码器

Ivan De Boi, Xinxing Shi, Xiaoyu Jiang, Tim J. M. Jaspers, Francisco Caetano, Mauricio A. Alvarez, Fons van der Sommen, Sam Van der Jeught

发表机构 * Department of Electromechanics, InViLab, University of Antwerp(安特卫普大学机电工程系InViLab实验室) Department of Computer Science, University of Manchester(曼彻斯特大学计算机科学系) Department of Electrical Engineering, Eindhoven University of Technology(埃因霍温理工大学电气工程系)

AI总结 提出高斯过程先验变分自编码器(GPVAE),通过时间高斯过程先验替代因子化先验,结合两种可扩展GP近似和镜面反射掩码,实现内窥镜视频缺失帧的插值与修复,在C3VDv2数据集上平均降低RMSE 21.9%。

详情
AI中文摘要

内窥镜视频分析对于胃肠道诊断和计算机辅助干预至关重要,但视频序列经常受到镜面反射、运动伪影和缺失帧的退化影响。这些瞬态损坏会分散临床医生的注意力,降低图像可解释性,并干扰下游任务(如3D重建和导航)。因此,有效的修复需要利用时间连续性而非孤立处理帧的方法。我们提出了一种用于内窥镜视频修复的高斯过程先验变分自编码器(GPVAE)框架,该框架用时间高斯过程先验替代标准因子化潜在先验,从而能够以不确定性感知的重建方式插值缺失帧。该框架结合了内窥镜专用编码器(包括卷积EndoVAE骨干网络和来自GastroNet-5M的预训练Vision Transformer编码器)以及两种可扩展GP近似:层次先验近似(HPA)和稀疏精度近似(SPA)。镜面反射通过基于DUCKNet的掩码流水线处理,该流水线从重建目标中排除损坏像素。在C3VDv2结肠镜数据集上,最佳GPVAE变体相对于匹配的VAE基线,图像重建RMSE平均降低21.9%,最高降低26.1%。下游轨迹RMSE在经典视觉里程计和预训练PoseNet上平均降低12.7%,而每epoch训练时间平均增加27.3%。最后,GP后验提供每帧不确定性估计,反映时间支持并为修复帧提供置信度信号。

英文摘要

Endoscopic video analysis is essential for gastrointestinal diagnosis and computer-assisted interventions, but video sequences are routinely degraded by specular reflections, motion artifacts, and missing frames. These transient corruptions can distract clinicians, reduce image interpretability, and disrupt downstream tasks such as 3D reconstruction and navigation. Effective restoration therefore requires methods that exploit temporal continuity rather than treating frames in isolation. We introduce a Gaussian Process Prior Variational Autoencoder (GPVAE) framework for endoscopic video restoration that replaces the standard factorized latent prior with a temporal Gaussian process prior, enabling interpolation of missing frames with uncertainty-aware reconstruction. The framework combines endoscopy-specific encoders, including a convolutional EndoVAE backbone and pretrained Vision Transformer encoders from GastroNet-5M, with two scalable GP approximations: Hierarchical Prior Approximation (HPA) and Sparse Precision Approximation (SPA). Specular reflections are handled using a DUCKNet-based masking pipeline that excludes corrupted pixels from the reconstruction objective. On the C3VDv2 colonoscopy dataset, the best GPVAE variants reduced image reconstruction RMSE by 21.9\% on average, and by up to 26.1\%, relative to matched VAE baselines. Downstream trajectory RMSE was reduced by 12.7\% on average across classical visual odometry and a pretrained PoseNet, at an average increase of 27.3\% in training time per epoch. Finally, the GP posterior provides per-frame uncertainty estimates that reflect temporal support and offer a confidence signal for restored frames.

2606.19901 2026-06-19 cs.CV 新提交

Linear Recurrent Unit with Semantic Modulation for Image Super-Resolution

基于语义调制的线性递归单元用于图像超分辨率

Mingyu Choi, Woo Kyoung Han, Sunghoon Im, Kyong Hwan Jin

发表机构 * Korea University(高丽大学) DGIST(大邱庆北科学技术院)

AI总结 提出一种结合语义调制单元的线性递归网络,通过调制、空间分类和原型增强实现高效图像超分辨率,性能超越现有方法。

Comments Accepted to CVPR 2026 Findings

详情
AI中文摘要

线性递归单元(LRU)基于稳定线性递归的原则性设计,已在长程依赖任务上展现出有前景的准确性和鲁棒性。然而,其静态参数化和单扫描方法限制了其在二维视觉任务中的适用性。在本研究中,我们提出了一种基于LRU的恢复网络,并配备语义调制单元(SMU),以在单图像超分辨率中实现性能与效率的和谐平衡。SMU扮演三个关键角色:LRU调制、空间分类和通过学习原型进行特征增强。大量实验表明,我们的方法在定量和定性上均超越了近期最先进的方法。值得注意的是,我们的方法在计算复杂度与现有方法相当的情况下实现了更优的性能。源代码和模型可在以下网址获取:https://this https URL

英文摘要

Linear recurrent unit (LRU), designed with a principled formulation for stable linear recurrence, has demonstrated promising accuracy and robustness on long-range dependency tasks. However, its static parameterization and single-scan method limits its applicability to 2D vision tasks. In this study, we propose a LRU-based restoration network with a semantic modulating unit (SMU) to achieve a harmonious balance between performance and efficiency in single-image super-resolution. The SMU plays three key roles: LRU modulation, spatial categorization, and feature enhancement through learned prototype. Extensive experiments demonstrate that our method quantitatively and qualitatively surpasses recent state-of-the-art methods. Notably, our approach achieves superior performance with computational complexity on par with existing methods. The source code and models are available at https://github.com/MingyuChoi-run/LSM

2606.19897 2026-06-19 cs.RO 新提交

One-to-Two Acting: A Novel Framework for Single-arm Agent Action Expansion to Dual Arms

一对二执行:一种面向单臂智能体动作扩展至双臂的新框架

Youbin Yao, Nieqin Cao, Mingyan Li, Yan Ding, Fuqiang Gu, Chao Chen

发表机构 * Chongqing University(重庆大学) Xi’an Jiaotong-Liverpool University(西交利物浦大学) Lumos Robotics

AI总结 提出ExS2D层次化动作扩展框架,利用单臂监督实现双臂操作,通过时间优先关系提取、子任务引导动作映射和碰撞避免协调规划,在仿真中减少54.4%执行步骤并保持成功率。

Comments 6 pages, 5 figures, 3 tables

详情
AI中文摘要

双臂操作可以通过并行执行提高吞吐量,但收集双臂演示进行训练成本高且困难。我们提出ExS2D,一种层次化动作扩展框架,能够从单臂监督实现双臂操作。ExS2D首先从文本指令生成结构化子任务,同时显式捕获时间优先关系。然后通过观察中的子任务引导动作映射,将每个子任务落地为可执行动作。最后,由多模态大语言模型驱动的协调器执行考虑优先关系的动作分配和同步规划,以选择无碰撞的双臂执行。仿真实验表明,ExS2D在保持与单臂基线相当的成功率的同时,平均执行步骤减少了54.4%。在四个任务上的真实机器人实验进一步证明了ExS2D在少量单臂样本下进行双臂执行的可靠性,且未使用任何双臂演示。

英文摘要

Dual-arm manipulation can improve throughput via parallel execution, but collecting bimanual demonstrations for training is costly and difficult. We present ExS2D, a hierarchical action expansion framework that enables dual-arm manipulation from single-arm supervision. ExS2D first generates structured subtasks from textual instructions while explicitly capturing temporal precedence. It then grounds each subtask into executable actions through subtask-guided action mapping in observation. Finally, precedence-aware action allocation and synchronized planning are performed by a multimodal large language model driven coordinator to select collision-free dual-arm executions. Simulation experiments demonstrate that ExS2D reduces the average execution steps by 54.4% while maintaining a comparable success rate to a single-arm baseline. Real-robot experiments on four tasks further demonstrate the reliability of ExS2D for dual-arm execution under few-shot single-arm samples, while using zero bimanual demonstrations.

2606.19894 2026-06-19 cs.LG 新提交

Score Approximation for Diffusion Models on Arbitrary Low-Dimensional Structures

任意低维结构上扩散模型的分数近似

Xinhe Mu, Zaijiu Shang, Zhaoqi Zhou, Chuan Zhou, Qi Meng, Guiying Yan, Zhiming Ma

发表机构 * Shanghai Institute for Mathematics and Interdisciplinary Sciences(上海数学与交叉科学研究院) Huawei Technologies Co., Ltd.(华为技术有限公司)

AI总结 针对任意紧支撑分布,提出一种基于离散混合的分数近似方法,证明ReLU网络复杂度仅随上Minkowski维数d指数增长,打破环境维数诅咒,解释扩散模型在非光滑数据上的有效性。

详情
AI中文摘要

基于分数的扩散模型的显著成功激发了大量建立其理论基础的努力。然而,现有的分数近似复杂度界限严重依赖于限制性假设,如Lipschitz连续密度或光滑流形支撑,而这些假设通常被真实感知数据固有的奇异性、尖锐边界和不连续簇所违反。本文建立了一个通用的分数近似定理,适用于任何支撑在任意上Minkowski维数为$d$的紧集上的分布。通过一种新颖的离散混合公式,我们证明了分数函数可以用ReLU网络近似,其复杂度仅随$d$指数增长,从而打破了环境维数的指数诅咒。结合现有关于精确求解任意紧分布的反向扩散SDE的理论,我们的工作表明扩散模型能够自适应地处理不规则、非光滑的数据结构,解释了它们在真实生成任务中的能力。

英文摘要

The remarkable success of score-based diffusion models has spurred significant efforts to establish their theoretical foundations. However, existing complexity bounds for score approximation rely heavily on restrictive assumptions like Lipschitz continuous densities or smooth manifold supports, which are routinely violated by the singularities, sharp boundaries, and disjoint clusters inherent to real-world perceptual data. This work establishes a universal score approximation theorem that works for any distribution supported on any compact set of upper Minkowski dimension $d$. Using a novel discrete-mixture formulation, we prove that the score function can be approximated with a ReLU network whose complexity grows exponentially only with $d$, thus breaking the exponential curse of ambient dimensionality. Combined with existing theories on accurately solving the backward diffusion SDE for arbitrary compact distributions, our work shows that diffusion models readily adapt to irregular, non-smooth data structures, explaining their competence in real-world generative tasks.

2606.19893 2026-06-19 cs.AI 新提交

MetaResearcher: Scaling Deep Research via Self-Reflective Reinforcement Learning in Adversarial Virtual Environments

MetaResearcher: 通过对抗虚拟环境中的自我反思强化学习扩展深度研究

Wei Yu, Suxing Liu, Minjie Yu, Jiahao Wang, Zhijian Zheng, Haocheng Deng, Bing Li

发表机构 * School of Digital Arts, Jiangxi Arts & Ceramics Technology Institute(江西陶瓷工艺美术职业技术学院数字艺术学院) Universiti Sains Malaysia(马来西亚理科大学)

AI总结 提出MetaResearcher框架,通过演化虚拟世界、发现导向任务、自我反思元奖励和异构多智能体架构,在对抗环境中扩展深度研究智能体的训练,提升基准性能和认知鲁棒性。

详情
AI中文摘要

深度研究智能体在自主信息收集和综合方面展现了卓越的能力,但其训练仍受限于模拟环境的静态性、仅限事实检索的任务设计的局限性以及基于结果的强化学习的低效性。在这项工作中,我们提出了MetaResearcher,一个新颖的框架,在四个协同维度上扩展深度研究智能体的训练。首先,我们引入了一个演化虚拟世界,将时间动态和对抗性错误信息注入训练环境,迫使智能体发展来源可信度评估和时间冲突解决技能。其次,我们设计了发现导向任务——包括假设生成和矛盾解决——超越了简单的事实检索,推动智能体走向真正的研究行为。第三,我们在GRPO框架内提出了一种自我反思元奖励机制,共同优化答案正确性、搜索路径效率、反思深度和工具调用多样性,直接解决了先前工作中观察到的重复动作循环问题。第四,我们引入了一个异构多智能体群体架构,包括专门的侦察、过滤和合成模型,通过协调强化学习学习协作研究策略。基于LiteResearcher基础设施,MetaResearcher在训练中需要零边际API成本,同时目标是在基准性能(GAIA,Xbench-DS)和对抗条件下的认知鲁棒性方面实现显著改进。我们展示了完整的框架设计、训练方法和计划的实验验证。

英文摘要

Deep research agents have demonstrated remarkable capabilities in autonomous information gathering and synthesis, yet their training remains constrained by the static nature of simulated environments, the limits of fact-retrieval-only task designs, and the inefficiency of outcome-based reinforcement learning. In this work, we propose MetaResearcher, a novel framework that scales deep research agent training across four synergistic dimensions. First, we introduce an Evolving Virtual World that injects temporal dynamics and adversarial misinformation into the training environment, forcing agents to develop source credibility assessment and temporal conflict resolution skills. Second, we design Discovery-Oriented Tasks -- including hypothesis generation and contradiction resolution -- that transcend simple fact retrieval and push agents toward genuine research behaviors. Third, we propose a Self-Reflective Meta-Reward mechanism within the GRPO framework that jointly optimizes for answer correctness, search path efficiency, reflection depth, and tool call diversity, directly addressing the repetitive action loop problem observed in prior work. Fourth, we introduce a Heterogeneous Multi-Agent Swarm architecture comprising specialized Scout, Filter, and Synthesizer models that learn collaborative research strategies through coordinated reinforcement learning. Built upon the LiteResearcher infrastructure, MetaResearcher requires zero marginal API cost for training while targeting substantial improvements in both benchmark performance (GAIA, Xbench-DS) and epistemic robustness under adversarial conditions. We present the complete framework design, training methodology, and planned experimental validation.

2606.19891 2026-06-19 cs.LG 新提交

Adversarial Bandit Optimization with Globally Bounded Perturbations to Convex Losses

具有全局有界扰动的凸损失对抗性赌博机优化

Zhuoyu Cheng, Kohei Hatano, Eiji Takimoto

发表机构 * Department of Informatics, Kyushu University(九州大学信息学系) RIKEN AIP(理化学研究所革新智能综合研究中心)

AI总结 研究损失函数可能非凸非光滑的对抗性赌博机优化,提出一种修改的赌博机优化算法,并分析扰动预算对遗憾的影响,将线性损失下的全局预算后行动扰动模型扩展到一般凸且β-光滑损失。

详情
AI中文摘要

我们研究对抗性赌博机优化,其中损失函数可能非凸且非光滑。在每一轮中,学习者选择一个动作并仅观察该动作产生的损失。损失由一个潜在的凸且β-光滑分量和一个对抗性扰动组成,该扰动可能在观察学习者的动作后选择。扰动受全局预算约束,控制其随时间累积的幅度。该框架将全局预算的后行动扰动模型从线性损失扩展到一般凸且β-光滑损失。对于这个更广泛的类别,我们建立了期望遗憾保证,明确刻画了扰动预算的影响。为了建立这些保证,我们修改了一个标准的赌博机优化算法,并开发了一种分析来控制由扰动引起的额外遗憾。在没有扰动的情况下,我们的结果退化为具有β-光滑损失的标准赌博机凸优化设置的遗憾保证。

英文摘要

We study adversarial bandit optimization in which the loss functions may be non-convex and non-smooth. In each round, the learner selects an action and observes only the loss incurred at that action. The loss consists of an underlying convex and $β$-smooth component and an adversarial perturbation that may be chosen after observing the learner's action. The perturbations are subject to a global budget controlling their cumulative magnitude over time. This framework extends the globally budgeted, post-action perturbation model from underlying linear losses to general convex and $β$-smooth losses. For this broader class, we establish expected regret guarantees that explicitly characterize the effect of the perturbation budget. To establish these guarantees, we modify a standard bandit optimization algorithm and develop an analysis that controls the additional regret caused by the perturbations. In the absence of perturbations, our results reduce to regret guarantees for the standard bandit convex optimization setting with $β$-smooth losses.

2606.19889 2026-06-19 cs.CV 新提交

SurgVista: Long-Horizon Surgical World Modeling with Plausible Instrument-Tissue Dynamics

SurgVista:具有合理器械-组织动力学的长程手术世界建模

Wentao Pan, Wuyang Li, Shengyuan Liu, Xinyu Liu, Hengyu Liu, Yixuan Yuan

发表机构 * The Chinese University of Hong Kong(香港中文大学) EPFL(瑞士联邦理工学院洛桑) Imperial College London(伦敦帝国学院)

AI总结 提出SurgVista手术世界模型,通过变形一致性正则化和漂移适应训练,解决空间交互不连贯和时间保真度崩溃问题,在长程预测中显著优于现有方法。

详情
AI中文摘要

将机器人策略学习扩展到自主手术面临挑战,因为专家演示成本高昂且体内探索存在重大安全风险。手术世界模型通过从初始观测生成逼真的、动作条件下的未来帧来解决这一问题,但现有方法存在两种持续失效模式:空间交互不连贯,即可见器械接触未能引起空间一致的组织变形;以及时间保真度崩溃,即预测误差在自回归展开中累积并逐渐破坏视觉质量。我们提出SurgVista,一种通过两种训练策略缓解这两种失效的手术世界模型。变形一致性正则化从训练视频中提取场景点轨迹,并通过潜在对比学习强制跨帧一致性,增强物理一致的器械-组织动力学。漂移适应训练通过用在线预测残差和根据长程漂移统计校准的光度增强扰动条件帧,减轻长程漂移,在扩展展开中维持视觉保真度。为了进行严格评估,我们进一步引入SurgWorld-Bench,包含多样化的手术类型、长程展开以及用于器械运动精度和组织响应保真度的解耦指标。大量实验表明,SurgVista在视觉质量、时间一致性和交互保真度方面持续优于最先进方法,且随着预测视界增长优势扩大。

英文摘要

Scaling robot policy learning for autonomous surgery is challenging, as expert demonstrations are expensive and in vivo exploration poses substantial safety risks. Surgical world models address this by generating realistic, action-conditioned future frames from an initial observation, but existing methods exhibit two persistent failure modes: spatial interaction incoherence, where visible instrument contact fails to induce spatially consistent tissue deformation, and temporal fidelity collapse, where prediction errors compound across autoregressive rollouts and progressively corrupt visual quality. We present SurgVista, a surgical world model that mitigates both failures through two training recipes. Deformation Consistency Regularization extracts scene-point trajectories from training videos and enforces cross-frame coherence through latent contrastive learning, strengthening physically consistent instrument-tissue dynamics. Drift Adaptation Training mitigates long-horizon drift by perturbing conditioning frames with online prediction residuals and photometric augmentations calibrated to long-horizon drift statistics, sustaining visual fidelity over extended rollouts. To enable rigorous evaluation, we further introduce SurgWorld-Bench, featuring diverse procedure types, long-range rollouts, and decoupled metrics for instrument-motion accuracy and tissue-response fidelity. Extensive experiments show that SurgVista consistently outperforms state-of-the-art methods across visual quality, temporal consistency, and interaction fidelity, with gains widening as the prediction horizon grows.

2606.19888 2026-06-19 cs.LG cs.AI 新提交

SL-S4Wave: Self-Supervised Learning of Physiological Waveforms with Structured State Space Models

SL-S4Wave:基于结构化状态空间模型的生理波形自监督学习

Feng Wu, Harsh Deep, Eric Lehman, Sanyam Kapoor, Guoshuai Zhao, Rahul Krishnan, Gari Clifford, Li-wei H Lehman

发表机构 * Massachusetts Institute of Technology(麻省理工学院) OpenEvidence, USA(OpenEvidence(美国)) New York University(纽约大学) Xi’an Jiaotong University(西安交通大学) University of Toronto(多伦多大学) Emory University(埃默里大学)

AI总结 提出SL-S4Wave框架,结合对比学习与基于结构化状态空间模型的编码器,通过多尺度子核全局卷积捕获多通道生理波形的局部和长程依赖,在心律失常检测等任务中优于现有方法。

详情
AI中文摘要

由于高采样率、多通道信号复杂性、固有噪声和有限的标记数据,对长序列医学时间序列数据(如心电图)进行建模面临重大挑战。尽管最近基于各种编码器架构(如卷积神经网络)的自监督学习方法被提出用于从未标记数据中学习表示,但它们往往在捕获长程依赖和噪声不变特征方面存在不足。结构化状态空间模型擅长长序列建模,但现有的S4架构无法捕获多通道生理波形的独特特征。在这项工作中,我们提出了SL-S4Wave,一个自监督学习框架,它将对比学习与基于结构化状态空间模型的定制编码器相结合。该编码器利用多尺度子核实现多层全局卷积,从而能够在嘈杂的高分辨率多通道波形中捕获细粒度局部模式和长程时间依赖。在真实世界数据集上的大量实验表明,SL-S4Wave(1)在具有挑战性的心律失常检测任务中持续优于最先进的监督和自监督基线,(2)使用显著更少的标记示例实现高性能,展示了强大的标签效率,(3)在长波形片段上保持稳健性能,突出了其对大多数现有方法无法有效建模的长序列中复杂时间动态的建模能力,以及(4)有效迁移到未见的心律失常类型,强调了其强大的跨域泛化能力。我们还在多个EEG任务上评估了SL-S4Wave,在强基线上取得了优越性能,证明了我们的方法在心脏波形之外的泛化能力。

英文摘要

Modeling long-sequence medical time series data, such as electrocardiograms (ECG), poses significant challenges due to high sampling rates, multichannel signal complexity, inherent noise, and limited labeled data. While recent self-supervised learning (SSL) methods, based on various encoder architectures such as convolutional neural networks, have been proposed to learn representations from unlabeled data, they often fall short in capturing long-range dependencies and noise-invariant features. Structured state space models (S4) excel at long-sequence modeling, but existing S4 architectures fail to capture the unique characteristics of multichannel physiological waveforms. In this work, we propose SL-S4Wave, a self-supervised learning framework that combines contrastive learning with a tailored encoder built on structured state space models. The encoder incorporates multi-layer global convolution using multiscale subkernels, enabling the capture of both fine-grained local patterns and long-range temporal dependencies in noisy, high-resolution multichannel waveforms. Extensive experiments on real-world datasets demonstrate that SL-S4Wave (1) consistently outperforms state-of-the-art supervised and self-supervised baselines in a challenging arrhythmia detection task, (2) achieves high performance with significantly fewer labeled examples, showcasing strong label efficiency, and (3) maintains robust performance on long waveform segments, highlighting its capacity to model complex temporal dynamics in long sequences that most existing approaches fail to efficiently model, and (4) transfers effectively to unseen arrhythmia types, underscoring its robust cross-domain generalization. We additionally evaluate SL-S4Wave on multiple EEG tasks, achieving superior performance over strong baselines, demonstrating generalizability of our approach beyond cardiac waveforms.

2606.19883 2026-06-19 cs.LG stat.ML 新提交

Matching Markets meet Cumulative Prospect Theory: Towards Optimal and Adversarially Robust Learning

匹配市场遇上累积前景理论:迈向最优和对抗鲁棒学习

Ananya Kunisetty, Avishek Ghosh

发表机构 * Indian Institute of Technology Bombay(印度理工学院孟买分校)

AI总结 研究基于累积前景理论(CPT)的竞争性双边匹配市场多智能体多臂赌博机问题,提出最优遗憾界算法并扩展到对抗性市场。

Comments Accepted at ECML-PKDD 2026, Naples, Italy

详情
AI中文摘要

我们研究了一个在竞争性设置下具有双边匹配市场的多智能体多臂赌博机问题,该问题基于以人为中心的决策模型。为了捕捉人类偏好,我们使用累积前景理论(CPT),该理论通过一个(α-Hölder连续)权重函数以非线性方式加权智能体的行动。CPT已被广泛用于行为经济学和风险敏感机器学习中,以模拟人类偏好。我们分析了带有CPT权重扭曲奖励的最先进学习算法,并获得了玩家最优遗憾界为$\mathcal{O}(K\log T \left(\frac{1}{\Delta}\right)^{2/\alpha})$,其中$K$表示臂数,$T$是学习时间,$\Delta$表示(适当定义的)玩家的最小偏好差距。注意到对$\Delta$的依赖是次优的,我们通过明智地选择探索期间的活跃臂集进一步改进了这一遗憾,从而在主导项中消除了对$K$的依赖,并在臂数$K$显著大于玩家数$N$的设置中实现了改进的(最优)遗憾保证。此外,我们考虑了对抗性市场,其中智能体的观测奖励可能被破坏。我们提出并分析了在已知和未知总破坏预算两种设置下,以CPT作为风险敏感度量的鲁棒市场算法,并在两种情况下建立了对数级别的玩家最优遗憾保证。

英文摘要

We study a multi-agent multi-armed bandit problem in the competitive setup with two-sided matching markets under a human centric decision making model. To capture human preferences, we use cumulative prospect theory (CPT) that weighs the actions of the agent in a nonlinear fashion using a ($α$-Hölder continuous) weight function. CPT has been widely used in behavioral economics and risk sensitive machine learning to emulate human preferences. We analyze the state-of-the-art learning algorithm with CPT weight distorted rewards and obtain a player optimal regret of $\mathcal{O}(K\log T \left(\frac{1}Δ\right)^{2/α})$, where $K$ denotes the number of arms, $T$ is the learning horizon, and $Δ$ represents (suitably defined) players' minimum preference gap. Noticing the dependence on $Δ$ to be sub-optimal, we further improve this regret by judiciously selecting the active set of arms during exploration, which removes the dependence on $K$ in the dominant term and achieves an improved (optimal) regret guarantees in the setting where the number of arms $K$ is significantly larger than the number of players $N$. In addition, we consider adversarial markets where the observed rewards of the agents may be corrupted. We propose and analyze algorithms for robust markets with CPT as risk sensitive measure in both settings where the total corruption budget is known and where it is unknown, and establish logarithmic player-optimal regret guarantees in both cases.