arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2163
2605.19727 2026-05-20 cs.CV

Tango3D: Towards Alignment for Global and Local 2D-3D Correspondence

Tango3D: 向全局和局部2D-3D对应关系对齐迈进

Zebin He, Mingxin Yang, Shuhui Yang, Hanxiao Sun, Xintong Han, Chunchao Guo, Wenhan Luo

AI总结 本文提出Tango3D,一种统一密集对应和全局检索的3D基础模型,通过几何感知的2D视觉骨干网络和预训练的3D VAE将图像编码为2D片段,点云编码为3D标记,并映射到共享空间以实现局部像素-点对齐和全局语义对齐。

详情
AI中文摘要

现有的3D基础模型通常将点云对齐到冻结的视觉-语言空间(如CLIP),通过将3D形状压缩成全局向量实现强大的跨模态检索。然而,这种仅全局对齐的方法无法建立精细的像素-点对应关系。为了解决这个问题,我们提出了Tango3D,一种基础模型,它统一了密集对应和全局检索。我们使用一个几何感知的2D视觉骨干网络和一个预训练的3D VAE将图像编码为2D片段,并将点云编码为3D标记。这些被映射到一个共享空间中,以实现局部像素-点对齐和全局语义对齐。为了稳定密集和全局目标的联合学习,我们引入了三阶段渐进训练策略。实验表明,我们的模型成功实现了对象级别的像素-点对齐,同时保持了具有竞争力的全局检索能力,这种联合能力是现有3D基础模型所不具备的。通过建立精细的对齐特征空间,Tango3D将丰富的语义注入到纯粹的几何3D标记中,为广泛密集3D下游任务铺平了道路。

英文摘要

Existing 3D foundation models typically align point clouds to frozen vision-language spaces like CLIP, which achieve strong cross-modal retrieval by compressing 3D shape into a global vector. However, this global-only alignment cannot establish fine-grained pixel-to-point correspondence. To solve this, we present Tango3D, a foundation model that unifies dense correspondence and global retrieval. We use a geometry-aware 2D visual backbone and a pretrained 3D VAE to encode images into 2D patches and point clouds into 3D tokens. These are mapped into a single shared space to achieve both local pixel-to-point alignment and global semantic alignment. To stabilize the joint learning of dense and global objectives, we introduce a three-stage progressive training strategy. Experiments show our model successfully achieves object-level pixel-to-point alignment while maintaining competitive global retrieval, a joint capability not offered by existing 3D foundation models. By establishing a fine-grained alignment feature space, Tango3D injects rich semantics into purely geometric 3D tokens, paving the way for a wide range of dense 3D downstream tasks.

2605.19726 2026-05-20 cs.CV

Efficient Long-Context Modeling in Diffusion Language Models via Block Approximate Sparse Attention

通过块近似稀疏注意力实现扩散语言模型的高效长上下文建模

Wenhu Zhang, Yiming Wu, Huanyu Wang, Yaoyang Liu, Huanzhang Dou, Senqiao Yang, Sitong Wu, Hanbin Zhao, Jiaya Jia

AI总结 本文提出了一种块近似稀疏注意力框架(BA-Att),通过块级预下采样操作识别信息区域,避免依赖脆弱的位置先验,从而在保持高性能的同时提升计算效率,实验表明其在注意力计算上比FlashAttention快6.95倍,并在50%稀疏度下保持接近全注意力性能。

Comments CVPR 2026 Findings paper

详情
AI中文摘要

扩散语言模型(DLMs)能够实现全局一致、双向且可控的文本生成,相较于传统自回归LLMs具有优势,但扩展到超长序列仍成本高昂。许多现有块稀疏注意力方法通过固定采样模式在高分辨率注意力空间中选择块,如尾部区域或反斜线条带。此类先验驱动的采样可能遗漏显著令牌并引入分布变化下的不稳定性。在本文中,我们提出块近似稀疏注意力框架(BA-Att)具有块级预下采样操作,能够在紧凑的下采样空间内识别信息区域,避免依赖脆弱的位置先验。为了分析其理论行为,我们定义了一个 oracle 后下采样注意力图,并正式化预下采样与后下采样方案之间的近似误差。基于这一见解,我们引入了一个轻量级的范数排序模块和一个协方差补偿修正,利用对角线QK方差近似完整协方差,从而降低计算复杂度。广泛的实验表明,我们的操作在注意力计算上比FlashAttention快达6.95倍,并在50%稀疏度下在语言模型、多模态语言模型和视频生成模型中保持接近全注意力性能,展示了强大的效率和泛化能力。

英文摘要

Diffusion Language Models (DLMs) enable globally coherent, bidirectional, and controllable text generation, offering advantages over traditional autoregressive LLMs, while scaling to ultra-long sequences remains costly. Many existing block-sparse attention methods select blocks by fixed sampling patterns over the high-resolution attention space, such as tail regions or anti-diagonal stripes. Such prior-driven sampling can miss salient tokens and introduce instability under distribution shifts. In this paper, we propose the Block Approximate Sparse Attention framework (BA-Att) with block-wise pre-downsampled operation, which identifies informative regions within a compact downsampled space, avoiding reliance on brittle positional priors. To analyze its theoretical behavior, we define an oracle post-downsample attention map and formalize the approximation error between pre- and post-downsample schemes. Based on this insight, we introduce a lightweight norm-sorting module and a covariance-compensated correction that approximates full covariance using diagonal QK variances, reducing computational complexity. Extensive experiments show that our operator achieves up to 6.95x acceleration over FlashAttention in attention computation, and maintains near full-attention performance at 50% sparsity across language models, multimodal language models, and video generation models, demonstrating strong efficiency and generalization.

2605.19723 2026-05-20 cs.CL cs.AI

Mathematical Reasoning in Large Language Models: Benchmarks, Architectures, Evaluation, and Open Challenges

大型语言模型中的数学推理:基准测试、架构、评估与开放挑战

Husnain Amjad, Raja Khurram Shahzad, Aamir Shahzad, Mehwish Fatima

AI总结 本文综述了大型语言模型在数学推理方面的最新进展,通过分析数据集、架构、训练策略和评估协议,探讨了数学推理的基准测试、架构设计、评估方法以及未来的研究挑战。

详情
AI中文摘要

数学推理对于教育、科学和工业中的问题解决至关重要,是评估人工智能系统的重要基准。随着大型语言模型(LLMs)推理能力的提升,理解其在数学推理方面的表现变得越来越重要。本文综述通过结构化的数据分析集、架构、训练策略和评估协议,综合了最近在LLMs中的数学推理进展。我们的系统性回顾涵盖了大约120篇同行评审研究和预印本,探讨了该研究领域的演变,并提供了一个统一的分析框架来理解当前的进展和限制。本文特别介绍了一种统一的数学数据集分类法,区分了预训练语料库、监督微调资源和评估基准在不同推理复杂性水平上的差异。本文还系统分析了推理架构和训练策略,包括工具集成、验证器引导推理和参数高效适应,以评估其对推理鲁棒性和泛化能力的影响。此外,现有度量标准的比较评估突显了最终答案准确性与过程级推理验证之间的差距。通过综合这些领域的见解,我们的分析识别了反复出现的失败模式,如推理忠实性问题、基准偏见和泛化限制,并概述了改进符号接地、评估可靠性以及开发更稳健和可信的LLM推理系统的关键研究方向。

英文摘要

Mathematical reasoning is essential for problem-solving in education, science, and industry, serving as a crucial benchmark for evaluating artificial intelligence systems. As Large Language Models (LLMs) improve their reasoning capabilities, understanding how well they perform mathematical reasoning has become increasingly important. This survey synthesizes recent advancements in mathematical reasoning with LLMs through a structured analysis of datasets, architectures, training strategies, and evaluation protocols. Our systematic review encompasses approximately 120 peer-reviewed studies and preprints, examining the evolution of this research area and providing a unified analytical framework to understand current progress and limitations. Our study particularly introduces a unified taxonomy of mathematical datasets, distinguishing between pretraining corpora, supervised fine-tuning resources, and evaluation benchmarks across varying levels of reasoning complexity. A systematic analysis of reasoning architectures and training strategies, including tool integration, verifier-guided reasoning, and parameter-efficient adaptation, is presented to assess their effects on reasoning robustness and generalization. Moreover, a comparative evaluation of existing metrics highlights the gap between final-answer accuracy and process-level reasoning verification. By synthesizing insights across these areas, our analysis identifies recurring failure modes, such as reasoning faithfulness issues, benchmark biases, and generalization limitations, and outlines key research directions toward improving symbolic grounding, evaluation reliability, and the development of more robust and trustworthy LLM-based reasoning systems.

2605.19721 2026-05-20 cs.AI cs.LG cs.NI

Projecting Latent RL Actions: Towards Generalizable and Scalable Graph Combinatorial Optimization

投影潜在RL动作:面向通用化和可扩展的图组合优化

Franco Terranova, Guillermo Bernardez, Albert Cabellos-Aparicio, Nina Miolane, Abdelkader Lahmadi

AI总结 本文提出了一种新的RL-GCO方法,通过在连续GNN动作嵌入空间中直接操作,实现高效的图组合优化解算,提升了通用性和可扩展性。

Comments Preprint

详情
AI中文摘要

图组合优化(GCO)因其在许多NP难问题中的自然图表示而受到越来越多的关注,但其组合爆炸使得精确方法在计算上不可行。最近的强化学习(RL)与图神经网络(GNN)的结合显著改进了基于学习的GCO求解器。然而,现有方法在跨不同图实例的泛化能力和随着动作空间增长的计算可扩展性方面存在局限。为了解决这两个挑战,我们引入了投影代理,一种新颖的RL-GCO方法,直接在连续的GNN动作嵌入空间中操作,通过单次前向传递预测所需潜在动作,并随后将其解码为有效的离散动作。此外,我们通过为观察和动作提供共享的嵌入空间,实现了RL方法之间的公平比较。在多样化的基准测试中,我们的方法在推理速度上达到现有解决方案的16.2倍,泛化能力提升40%,同时为具有多个相互依赖变量的超线性决策空间中的强大RL性能打开了大门。最后,我们发布了LaGCO-RL,一个Python库,自动化潜在动作空间的构建并支持现有RL-GCO解决方案,促进可重复性和适应新GCO基准。

英文摘要

Graph combinatorial optimization (GCO) has attracted growing interest, as many NP-hard problems naturally admit graph formulations, yet their combinatorial explosion renders exact methods computationally intractable. Recent advances in Reinforcement Learning (RL) combined with Graph Neural Networks (GNNs) have significantly improved learning-based GCO solvers. However, existing approaches face limitations in both generalization across diverse graph instances and computational scalability as action spaces grow. To address both challenges, we introduce projection agents, a novel RL-GCO approach that operates directly in a continuous GNN-based action embedding space, predicting a desired latent action in a single forward pass and subsequently decoding it into a valid discrete action. Additionally, we enable fair comparison across RL methods through a shared embedding space for both observations and actions. Across diverse benchmarks, our approach achieves up to 16.2x faster inference and up to 40% better generalization than existing solutions using only simple nearest-neighbor decoding, while opening the door to strong RL performance in super-linear decision spaces with multiple interdependent variables. Finally, we release LaGCO-RL, a Python library that automates latent action-space construction and supports existing RL-GCO solutions, promoting reproducibility and adaptation to new GCO benchmarks.

2605.19718 2026-05-20 cs.CL

CAIT: A Syntactic Parsing Toolkit for Child-Adult InTeractions

CAIT:一种用于儿童-成人互动的句法解析工具包

Francesca Padovani, Xiulin Yang, Bastian Bunzeck, Jaap Jumelet, Yevgen Matusevych, Nathan Schneider, Arianna Bisazza

AI总结 本文提出了一种专门针对CHILDES数据的句法解析工具包CAIT,通过训练先进的依赖解析器和标注工具,提升了对儿童-成人互动句法模式的解析精度,适用于语言习得的大规模可重复研究。

详情
AI中文摘要

CHILDES是语言习得研究的重要资源--然而用于分析其句法结构的计算工具仍然有限。利用最近发布的UD-English-CHILDES树库及其黄金标准的通用依赖关系(UD)标注,我们训练了一个最先进的依赖解析器,专门针对CHILDES进行优化。该解析器更准确地捕捉了儿童-成人互动中的句法模式,优于广泛使用的现成英语解析器,包括SpaCy和Stanza。除了解析器外,我们还发布了词性标注器和句子级构造标注器,这些工具共同构成了开放源代码的儿童-成人互动句法解析工具包(CAIT)。通过详细的错误分析和一个跟踪CHILDES中句法构造在发展时间分布的案例研究,我们展示了该工具包在语言习得大规模、可重复研究中的实用价值。

英文摘要

CHILDES is a paramount resource for language acquisition studies -- yet computational tools for analyzing its syntactic structure remain limited. Leveraging the recent release of the UD-English-CHILDES treebank with gold-standard Universal Dependencies (UD) annotations, we train a state-of-the-art dependency parser specifically tailored to CHILDES. The parser more accurately captures syntactic patterns in child--adult interactions, outperforming widely used off-the-shelf English parsers, including SpaCy and Stanza. Alongside the parser, we also release a Part-of-Speech tagger and an utterance-level construction tagger, which together form the open-source Syntactic Parsing Toolkit for Child--Adult InTeractions (CAIT). Through a detailed error analysis and a case study tracking the distribution of syntactic constructions across developmental time in CHILDES, we demonstrate the practical utility of the toolkit for large-scale, reproducible research on language acquisition.

2605.19717 2026-05-20 cs.CV

Physics-in-the-Loop: A Hybrid Agentic Architecture for Validated CAD Engineering Design

物理闭环:一种混合代理架构用于验证的CAD工程设计

Elias Berger, Muhammad Usama, Jan Mehlstäubl, Bernhard Saske, Kristin Paetzold-Byhain

AI总结 本文提出了一种混合代理-物理架构,通过将经过验证的知识工程工具直接嵌入到自主AI代理的决策循环中,以解决大型语言模型在生成CAD设计时缺乏物理理解的问题。该方法通过显式的物理验证指导闭环、顺序决策过程,提高了生成CAD设计的物理正确性。

Comments Accepted in IJCAI-ECAI 2026 (Special Track on AI4Tech)

详情
AI中文摘要

大型语言模型(LLMs)可以生成计算机辅助设计(CAD),但缺乏可靠工程设计所需的物理理解。而不是试图从数据中隐式学习物理定律,我们提出了一种混合代理-物理架构,将经过验证的知识工程工具直接嵌入到自主AI代理的决策循环中。在该框架中,工程设计被建模为一个闭环、顺序决策过程,由显式的物理验证指导。基于负载案例,专用代理通过知识工程工具作为反馈信号,迭代地计划、生成、评估和修订工程设计。我们引入了一个基准数据集和评估功能有效性的指标。我们的系统生成了更复杂且经过物理验证的设计,结构复杂性提高了4.2%,与类似代理方法相比,编译率提高了3.5%。代码库、提示和数据集将向公众开放,以支持可重复性和未来研究。

英文摘要

Large Language Models (LLMs) can generate Computer-Aided Design (CAD), yet lack physical comprehension required for reliable engineering design. Instead of attempting to implicitly learn physical laws from data, we propose a Hybrid Agentic-Physical Architecture that embeds validated knowledge-based engineering tools directly into the decision making loop of autonomous AI agents. In this framework, engineering design is formulated as a closed-loop, sequential decision making process guided by explicit physical verification. Based on a load case, dedicated agents iteratively plan, generate, evaluate, and revise engineering designs using knowledge-based tools as a feedback signal. We introduce a benchmark dataset and metrics for assessing functional validity in generative CAD. Our system generates more complex and physically verified designs, with a 4.2 increase in structural complexity and improving compile rate by 3.5% compared to similar agentic methods. The codebase, prompts and dataset will be made publicly available to support reproducibility and future research.

2605.19714 2026-05-20 cs.CL

LLM-Based Financial Sentiment Analysis in Arabic: Evidence from Saudi Markets

基于大型语言模型的阿拉伯语金融情绪分析:来自沙特市场的证据

Mona H. Albaqawi, Eman M. Albalkhi, Joud A. Albaiti, Enrico Lopedoto

AI总结 本文提出了一种针对沙特市场的阿拉伯语NLP框架,用于大规模金融情绪分析,结合官方财务新闻和社会媒体数据,通过多阶段流程构建阿拉伯语财务语料库,并利用Transformer-based NER和定制公司词典进行情绪标注,最终实现了对公司层面的情绪聚合和情绪动态分析。

Comments Accepted at the 7th Workshop on Open-Source Arabic Corpora and Processing Tools (OSACT7), co-located with LREC 2026, Palma de Mallorca, Spain, May 2026. ISBN: 978-2-493814-52-4

详情
AI中文摘要

投资者情绪塑造金融市场,然而在阿拉伯语财务语境中建模情绪仍然具有挑战性,因为语言复杂性和资源有限。我们提出了一种阿拉伯语NLP框架,用于大规模金融情绪分析,专门针对沙特市场,整合官方财务新闻和社会媒体以捕捉机构和公众投资者情绪。该框架通过多阶段流程构建大规模阿拉伯语财务语料库,包括数据收集、清洗、去重、实体链接和情绪标注。基于Transformer的NER结合定制公司词典将文本提及链接到标准公司标识符,情绪标签通过五类方案分配。最终的84,000样本数据集支持公司层面的情绪聚合和情绪动态分析,相对沙特交易所的股市行为。实验结果表明阿拉伯语金融情绪分析具有可靠性和可扩展性。

英文摘要

Investor sentiment shapes financial markets, yet modeling sentiment in Arabic financial contexts remains challenging due to linguistic complexity and limited resources. We present an Arabic NLP framework for large-scale financial sentiment analysis tailored to the Saudi market, integrating official financial news and social media to capture institutional and public investor sentiment. The framework constructs a large Arabic financial corpus through a multi-stage pipeline encompassing data collection, cleaning, deduplication, entity linking, and sentiment annotation. Transformer-based NER combined with a curated company lexicon links textual mentions to canonical company identifiers, with sentiment labels assigned using a five-class scheme. The resulting dataset of 84K samples supports company-level sentiment aggregation and analysis of sentiment dynamics relative to stock market behavior on the Saudi Exchange. Experimental results demonstrate reliable and scalable Arabic financial sentiment analysis.

2605.19712 2026-05-20 cs.CV

Physics-informed simulation framework for realistic sonar image generation and statistical validation

具有物理信息的模拟框架用于真实声纳图像生成和统计验证

Kamal Basha S, Athira Nambiar

AI总结 本文提出了一种基于物理的模拟框架ACOUSIM,用于生成真实声纳图像并进行统计验证,通过比较合成与真实声纳图像的统计特性,建立了可重复的分布级基准。

详情
AI中文摘要

合成声纳数据集为昂贵的实地采集提供了可扩展的替代方案,但其效用仍受缺乏严格定量验证的限制。我们提出了ACOUSIM(ACOustic SIMulation and Validation Platform),一个具有物理信息的框架,该框架在不依赖生成模型的情况下评估合成与真实声纳图像之间的统计一致性。基于Gazebo的环境通过显式控制海底纹理、光照驱动的阴影、平台高度和噪声生成声纳样图像。真实性通过两个公开声纳数据集SeabedObjects-KLSG-II和Sonar Common Target Detection(SCTD)进行量化,使用KL散度、JS散度和地球移动距离评估全局强度和局部纹理(LBP)分布。结果表明,在所有类别中纹理一致性都很强(KL < 0.07),其中平面类强度一致性优于船舶类,因为阴影几何复杂性。ACOUSIM为sim-to-real声纳评估建立了可重复的分布级基准,并直接支持水下图像分析的可靠数据集验证。

英文摘要

Synthetic sonar datasets offer a scalable alternative to costly real-world acquisition, yet their utility remains limited by the absence of rigorous quantitative validation. We present ACOUSIM (ACOustic SIMulation and Validation Platform), a physics-informed framework that evaluates the statistical alignment between synthetic and real sonar imagery without relying on generative models. A Gazebo-based environment generates sonar-like images by explicitly controlling seabed texture, illumination-driven shadowing, platform altitude, and noise. Realism is quantified against two public sonar datasets, SeabedObjects-KLSG-II and Sonar Common Target Detection (SCTD), using global intensity and local texture (LBP) distributions assessed via Kullback-Leibler divergence, Jensen-Shannon divergence, and Earth Mover's Distance. Results show strong texture alignment (KL < 0.07) across all classes, with plane-class intensity alignment outperforming ship-class due to shadow geometry complexity. ACOUSIM establishes a reproducible, distribution-level baseline for sim-to-real sonar evaluation and directly supports reliable dataset validation for underwater image analysis.

2605.19711 2026-05-20 cs.CL

Can Large Language Models Reliably Correct Errors in Low-Resource ASR? A Contamination-Aware Case Study on West Frisian

大型语言模型能否可靠地纠正低资源语音识别中的错误?一项考虑数据污染的西弗里西语案例研究

Yun Hao, Reihaneh Amooie, Wietse de Vries, Rik van Noord, Martijn Wieling

AI总结 本研究探讨了大型语言模型在低资源语言(如西弗里西语)中通过生成性错误纠正(GER)提升语音识别(ASR)性能的效果,发现GER在大多数设置中提升了ASR性能,并通过详细的错误分析揭示了模型的纠正模式。

Comments Submitted to Interspeech 2026

详情
AI中文摘要

自动语音识别(ASR)近年来有了显著进步,但在低资源语言上性能仍有限。大型语言模型(LLMs)通过生成性错误纠正(GER)展示了提升ASR的潜力,但其在低资源环境中的有效性尚不明确。此外,数据污染对LLM基于GER的改进程度的影响仍不清楚。本研究调查了LLM基于GER在低资源西弗里西语中的应用。除了公开语料库外,我们构建并使用了一个包含非公开文本的西弗里西语离线数据集进行评估,以控制潜在的数据污染。结果表明,GER在大多数设置中提升了ASR性能,最佳的GPT-5.1结果超过了Oracle WERs。在离线数据集上的可比增益表明,改进反映了真正的纠正能力。我们进一步提供了详细的错误分析,揭示了模型的纠正模式。

英文摘要

Automatic speech recognition (ASR) has improved substantially in recent years, yet performance remains limited for low-resource languages. Large language models (LLMs) have shown promise for improving ASR through generative error correction (GER), but their effectiveness in low-resource settings remains underexplored. In addition, it remains unclear to what extent data contamination influences the reported improvements in LLM-based GER. This study investigates LLM-based GER for low-resource Frisian. In addition to a public corpus, we construct and use a Frisian offline dataset with non-public texts for evaluation to control for potential data contamination. Results show that GER improves ASR performance in most settings, with the best GPT-5.1 results surpassing oracle WERs. Comparable gains on the offline dataset indicate that improvements reflect true correction ability. We further provide a detailed error analysis revealing model correction patterns.

2605.19703 2026-05-20 cs.RO

KIO-planner: Attention-Guided Single-Stage Motion Planning with Dual Mapping for UAV Navigation

KIO-planner: 基于双映射的注意力引导单阶段运动规划用于无人机导航

Dexing Yao, Haochen Li, Junhao Wei, Yifu Zhao, Yanxiao Li, Jiahui Xu, Jinxuan Hu, Lele Tian, Baili Lu, Zikun Li, Xu Yang, Sio-Kei Im, Dingcheng Yang, Yapeng Wang

AI总结 本文提出KIO-planner,一种基于注意力引导的单阶段轨迹规划框架,通过整合CBAM模块和双映射机制,实现了在密集障碍环境中低延迟、可靠的运动规划,提高了导航的敏捷性和安全性。

Comments Accepted by an IEEE Vehicular Technology Conference. 6 pages, 4 figures, 1 table

详情
AI中文摘要

在受限、墙壁密集的环境中实现自主无人机飞行需要在严格安全约束下具有低延迟和可靠性的运动规划。传统基于优化的规划器在导航密集结构障碍时面临映射延迟和容易陷入局部极小值的问题。同时,现有的端到端学习方法难以从原始深度图像中提取细粒度的几何特征,并缺乏硬的运动动力学约束,导致靠近墙壁时出现不可预测的碰撞。为了解决这些问题,我们提出了KIO-planner,一种注意力引导的单阶段轨迹规划框架。首先,我们将卷积块注意力模块(CBAM)整合到感知骨干中,以自适应地聚焦于关键结构边缘和可通行空间。其次,我们引入了一种新的双映射机制——包括物理界限激活和确定性的几何安全护盾——以在深度像素空间中强制运动动力学可行性并实现无碰撞飞行,而无需全局地图融合。广泛的高保真模拟实验表明,KIO-planner能够在高达3.0 m/s的速度下实现高度敏捷的导航。与最先进的基线相比,KIO-planner实现了更低的推理延迟(约24 ms)并生成了显著更平滑的轨迹,减少了28.4%的控制成本。最值得注意的是,我们的双映射显著增加了最坏情况的安全裕度,通过最小距离到障碍物的测量,从0.48米增加到0.76米,确保了在高度受限环境中快速、平滑和安全的导航。

英文摘要

Autonomous UAV flight in confined, wall-dense environments requires low-latency and reliable motion planning under strict safety constraints. Traditional optimization-based planners suffer from mapping latency and easily fall into local minima when navigating through dense structural obstacles. Meanwhile, existing end-to-end learning methods struggle to extract fine-grained geometric features from raw depth images and lack hard kinodynamic constraints, leading to unpredictable collisions near walls. To address these issues, we propose KIO-planner, an attention-guided single-stage trajectory planning framework. First, we integrate a Convolutional Block Attention Module (CBAM) into the perception backbone to adaptively focus on critical structural edges and traversable space. Second, we introduce a novel Dual Mapping mechanism--comprising physical bounds activation and a deterministic Geometric Safety Shield in the depth-pixel space--to enforce kinodynamic feasibility and collision-free flight without global map fusion. Extensive high-fidelity simulated experiments demonstrate that KIO-planner enables highly agile navigation at speeds up to 3.0 m/s. Compared to the state-of-the-art baseline, KIO-planner achieves lower inference latency (approximately 24 ms) and generates significantly smoother trajectories, reducing control cost by 28.4%. Most notably, our Dual Mapping substantially increases the worst-case safety margin, measured by minimum distance to obstacles, from 0.48 m to 0.76 m, ensuring fast, smooth, and safer navigation in highly constrained environments.

2605.19701 2026-05-20 cs.RO

Multi-Session Ground Texture SLAM in Low-Dynamic Environments

多会话低动态环境下的地面纹理SLAM

Kyle M. Hart, Brendan Englot

AI总结 本文研究了在低动态环境中多会话地面纹理SLAM中的轨迹估计精度影响,探讨了三种技术的影响,发现Kullback-Leibler散度在相似度评分和闭环置信度偏置方面效果最佳,并介绍了一个包含多会话图像和高精度姿态信息的数据集。

Comments 8 pages, 9 figures. To appear at the 23rd International Conference on Ubiquitous Robots, Osaka, Japan. Distribution Statement A: Approved for public release; distribution is unlimited, as submitted under NAVAIR Public Release Authorization 2025-0098

详情
AI中文摘要

同时定位与建图社区已经引入了大量适用于多会话操作的系统,这些系统适应于具有低动态变化特征的环境,如地面磨损、天气现象或季节变化,这些变化会影响建图。这些系统允许机器人在这些环境中进行终身操作。同时,对于那些唯一可用的地面纹理作为建图特征的环境,也存在越来越多的兴趣。然而,这些地面纹理系统尚未针对多会话低动态变化环境进行优化。本文探讨了三种不同技术对这些多会话低动态地面纹理环境轨迹估计精度的影响。其中,使用Kullback-Leibler散度作为相似度评分和偏置影响闭环置信度的方法效果最佳。我们分析了所有三种方法,并深入探讨了Kullback-Leibler散度的影响。我们还介绍了一个供机器人社区使用的数据集,其中包含多会话图像,地面在不同会话中发生变化,并包含高精度姿态信息用于评估。

英文摘要

The simultaneous localization and mapping community has introduced a growing number of systems adapted for multi-session operations where the operational environment features low-dynamic changes that impact mapping, such as surface wear, weather phenomena, or seasonal change. These systems allow for lifelong operations by a robot within these environments. There is also growing interest in operations in environments where the unique ground texture is the only mapping feature available for use. These ground texture systems are not yet targeted for multi-session low-dynamic-change environments though. This work explores the impact of three different techniques on trajectory estimation accuracy in these multi-session low-dynamic ground texture environments. Of the three, the use of Kullback-Leibler Divergence, as a similarity score and a bias influencing loop closure confidence, is found to have the most success. We show an analysis of all three methods and a deeper exploration of the impact of Kullback-Leibler Divergence. We also introduce a dataset for use by the robotics community that contains multi-session images where the ground changes between sessions and also high-accuracy pose information for use in evaluation.

2605.19692 2026-05-20 cs.CV

WBCAtt+: Fine-Grained Pixel-Level Morphological Annotations for White Blood Cell Images

WBCAtt+: 细粒度像素级形态学标注用于白血球图像

Satoshi Tsutsui, Winnie Pang, Shuting He, Bihan Wen

AI总结 本文提出WBCAtt+数据集,通过11个形态学属性和5个像素级细胞组件的密集标注,为白血球图像提供了全面的标注,用于改进属性识别和语义分割的基准模型,并展示了可解释AI模型等应用。

Comments Accepted to Medical Image Analysis. arXiv admin note: substantial text overlap with arXiv:2306.13531

详情
AI中文摘要

白血球(WBC)的显微检查在病理学中起着基础性作用,对于诊断如白血病和贫血等血液疾病至关重要。为了支持进一步的WBC图像研究,已提出多个数据集。然而,这些数据集主要标注细胞类别,缺乏病理学家用于解释细胞解释的详细形态学特征。为解决这一差距,我们引入WBCAtt+,一个包含11个形态学属性和5个像素级细胞组件的新型WBC图像数据集。WBCAtt+拥有113,000个图像级标签和10,000个分割图,是首个为WBC图像提供全面标注的数据集。利用此数据集,我们提供了属性识别和语义分割的基准模型。我们还设计了一个属性识别模型,以整合细胞的组成结构,进一步提高识别性能。最后,我们展示了由我们的数据集启用的各种应用,如可解释AI模型,包括反事实示例生成。

英文摘要

The microscopic examination of white blood cells (WBCs) plays a fundamental role in pathology and is essential for diagnosing blood disorders such as leukemia and anemia. To support further research on WBC images, multiple datasets have been proposed. However, they mainly annotate cell categories, and lack detailed morphological characteristics that pathologists use to explain their interpretations of cells. To address this gap, we introduce WBCAtt+, a novel dataset of WBC images densely annotated with 11 morphological attributes and five pixel-level cell components. With 113k image-level labels and 10k segmentation maps, WBCAtt+ is the first to provide comprehensive annotations for WBC images. Leveraging this dataset, we provide baseline models for attribute recognition and semantic segmentation. We also design an attribute recognition model to incorporate compositional structure of cells, further improving the recognition performance. Lastly, we showcase various applications enabled by our dataset, such as explainable AI models, including counterfactual example generation. \revision{The dataset and code are publicly available\footnote{https://doi.org/10.57967/hf/8143}}.

2605.19690 2026-05-20 cs.RO

D-CLING: Prior-Preserving Depth-Conditioned Fine-Tuning for Navigation Foundation Models

D-CLING: 保留先验知识的深度条件细调方法用于导航基础模型

Shintaro Nakaoka, Takayuki Kanai, Kazuhito Tanaka

AI总结 本文提出了一种新的细调方法,通过利用大规模预训练同时高效学习新环境或相机配置等新设置,从而在保留预训练知识的同时提升导航模型的鲁棒性和准确性。

Comments This paper has been accepted to the 2026 IEEE International Conference on Robotics and Automation (ICRA 2026), which will be held in Vienna, Austria, from June 1 to 5, 2026

详情
AI中文摘要

导航基础模型(NFMs)在大规模跨身体数据集上训练后,已在各种场景中展示了强大的泛化能力。采用领域内细调来校准NFMs的视觉-运动策略,有望在新场景中进一步提升性能。然而,细调后的模型仍然存在避障能力差或无法正确到达目标的问题。此外,使用小数据集进行模型更新通常会削弱预训练的先验知识,影响预训练的泛化能力。因此,细调会降低模型在稳健和准确导航方面的能力。在本文中,我们提出了一种新的细调方法,该方法利用大规模预训练同时高效学习新设置,如环境或相机配置。特别是,受ControlNet启发,我们通过将可训练的预训练骨干网络的可学习副本附加到NFMs上,利用零初始化残差路径进行细调,从而学习几何线索。这种设计使模型能够高效地获取领域内的几何信息,同时在各种行为中保留预训练的知识。尽管其简单性,我们对现实导航的全面评估表明,我们的方法能够有效实现稳健的长周期导航,同时最小化碰撞和人工干预。此外,我们的离线分析显示,所提出的方法在细调数据集之外仍能维持或进一步提升动作预测能力,为通用导航的持续学习提供了关键见解。项目页面:https://toyotafrc.github.io/DCLING-Proj/

英文摘要

Navigation Foundation Models (NFMs) trained on large cross-embodied datasets have demonstrated powerful generalizability in various scenarios. Adopting in-domain fine-tuning for an NFM efficiently calibrates the visuomotor policy, promising further improvement even in a novel scenario. However, the fine-tuned models still suffer from poor obstacle avoidance or fail to properly reach the provided goals. Furthermore, model updates using a small subset of data typically erode the pre-trained prior, compromising the pre-training generalization. Consequently, fine-tuning deteriorates the capability of the model for robust and accurate navigation. In this work, we present a novel fine-tuning method that leverages large-scale pre-training while efficiently learning in novel setups, such as environments or camera configurations. In particular, inspired by ControlNet, we fine-tune an NFM by attaching a trainable copy of the pre-trained backbone using zero-initialized residual pathways, thereby learning geometric cues. This design enables the model to efficiently acquire in-domain geometry while preserving pre-trained knowledge across various behaviors. Despite its simplicity, our comprehensive evaluation of real-world navigation suggests that our proposal effectively enables robust long-horizon navigation with minimal collisions and human intervention. Additionally, our offline analysis shows that the proposed method maintains or further improves action prediction capabilities beyond the fine-tuned dataset, providing a key insight into continual learning for general navigation. The project page: https://toyotafrc.github.io/DCLING-Proj/

2605.19688 2026-05-20 cs.CV

DocQT: Improving Document Forgery Localization Robustness via Diverse JPEG Quantization Tables

DocQT: 通过多样化的JPEG量化表提高文档伪造定位的鲁棒性

Kylian Ronfleux-Corail, Guillaume Bernard, Mickaël Coustaty, Nicolas Sidère

AI总结 本文提出DocQT数据集,通过对比不同架构在不同量化表训练下的表现,证明标准质量因子增强无法代表实际压缩多样性,并展示了显式考虑量化表的架构在实际部署中的鲁棒性优势。

详情
AI中文摘要

文档操纵定位模型在公开基准上表现强劲,但在实际文档工作流程中泛化能力不足。我们发现这一差距的关键原因在于训练过程中使用的JPEG量化表分布狭窄(仅限于标准libjpeg质量因子)与实际保险文档管道中遇到的异质压缩配置之间的不匹配。为了隔离这一因素,我们进行了一项受控的因子研究,比较了两种具有不同量化表意识水平的架构(FFDN [2] 和 Mesorch [20]),每种架构在标准质量因子增强(Standard-QT)或从DocQT量化表库(Real-QT)采样的操作校准量化表下进行训练,并在三种再压缩条件下进行评估。在DocTamper [15] 上训练时使用Real-QT带来了显著的定位增益,并显著降低了真实操作文档中的像素级误报率,但仅适用于显式将量化表作为输入的架构。发布的DocQT量化表数据集和压缩再生产材料可在https://github.com/Kyliroco/Improving-Document-Forgery-Localization-Robustness-via-Diverse-JPEG-Quantization-Tables直接获取。这些结果表明,标准质量因子增强无法充分代表实际压缩多样性,并且显式条件化于量化表的架构选择为实际部署提供了有意义的鲁棒性优势。

英文摘要

Document manipulation localization models achieve strong performance on public benchmarks yet fail to generalize to operational document workflows. We identify a critical and overlooked source of this gap: the mismatch between the narrow distribution of JPEG quantization tables used during training -restricted to standard libjpeg quality factors -and the heterogeneous compression profiles encountered in real-world insurance document pipelines. To isolate this factor, we conduct a controlled factorial study comparing two architectures with contrasting levels of quantization table awareness -FFDN [2] and Mesorch [20] -each trained under either standard quality factor augmentation (Standard-QT ) or operationally calibrated quantization tables sampled from DocQT, a quantization-table bank derived from a MAIF operational image corpus (Real-QT ), and evaluated under three recompression conditions. Training under Real-QT yields substantial localization gains on DocTamper [15] and significantly reduces the pixel-level false positive rate on authentic operational documents, but only for architectures that explicitly ingest the quantization table as input. The released DocQT quantization-table dataset and compression-reproduction material are directly available at https://github.com/Kyliroco/Improving-Document-Forgery-Localization-Robustness-via-Diverse-JPEG-Quantization-Tables. These results demonstrate that standard quality factor augmentation does not adequately proxy operational compression diversity, and that architectural choices explicitly conditioning on the quantization table provide a meaningful robustness advantage for real-world deployment.

2605.19678 2026-05-20 cs.RO

RoVLA: Multi-Consistency Constraints for Robust Vision-Language-Action Models

RoVLA: 多一致性约束用于鲁棒的视觉-语言-动作模型

Jingzhou Luo, Yifan Wen, Yongjie Bai, Xinshuai Song, Yang Liu, Liang Lin

AI总结 本文提出RoVLA框架,通过多一致性约束提升视觉-语言-动作模型的鲁棒性,通过指令语义、轨迹演变和观察扰动三种互补变换增强模型的稳定性和泛化能力。

详情
AI中文摘要

视觉-语言-动作(VLA)模型在具身操控中表现出色,但在视觉观察变化、语言指令改写和复合扰动下仍显脆弱。这种限制表明现有方法仍依赖于训练分布中的浅层相关性,而非学习任务语义、环境状态和动作生成之间的稳定耦合。尽管近期研究通过大规模训练、训练后适应或增强预测建模提高了鲁棒性,但很少在端到端策略本身中强制执行不变性一致性。为了解决这个问题,我们提出了RoVLA,一个具有多一致性约束的鲁棒视觉-语言-动作框架。RoVLA在三个互补的变换下强制一致性:指令语义、轨迹演变和观察扰动。具体而言,指令一致性(IC)通过语义等价指令改写促进稳定的语义关联,演变一致性(EC)在整个生成过程中保持一致的动作意图,观察一致性(OC)通过强制在受扰动前后的一致预测来提高对视觉和体感扰动的鲁棒性。通过在训练过程中显式建模这些不变性,RoVLA减少了对表面相关性的依赖,提高了鲁棒性和泛化能力。在LIBERO-Plus、RoboTwin 2.0和现实世界操控任务上的实验表明,RoVLA在强基线方法上表现一致,并在多样化的任务和观察转移下表现出更优越的鲁棒性。这些结果证明了多一致性学习在鲁棒具身控制中的有效性。代码将在https://github.com/HCPLab-SYSU/RoVLA上提供。

英文摘要

Vision-Language-Action (VLA) models have shown strong performance on embodied manipulation, yet they remain brittle under visual observation changes, paraphrased language instructions, and compounded perturbations. This limitation suggests that existing methods still rely heavily on shallow correlations in the training distribution, rather than learning stable couplings among task semantics, environment states, and action generation. Although recent efforts improve robustness through larger-scale training, post-training adaptation, or enhanced predictive modeling, they rarely enforce invariance-oriented consistency within the end-to-end policy itself. To address this issue, we propose RoVLA, a robust vision-language-action framework with multi-consistency constraints. RoVLA enforces consistency under three complementary transformations: instruction semantics, trajectory evolution, and observation perturbation. Specifically, Instructional Consistency (IC) promotes stable grounding under semantically equivalent instruction rewrites, Evolutionary Consistency (EC) preserves coherent action intent throughout the generation process, and Observational Consistency (OC) improves robustness to visual and proprioceptive perturbations by enforcing consistent predictions before and after targeted disturbances. By explicitly modeling these invariances during training, RoVLA reduces reliance on superficial correlations and improves robustness and generalization. Experiments on LIBERO-Plus, RoboTwin 2.0, and real-world manipulation tasks show that RoVLA consistently outperforms strong baseline methods and exhibits superior robustness under diverse task and observation shifts. These results demonstrate the effectiveness of multi-consistency learning for robust embodied control. Codes will be available at https://github.com/HCPLab-SYSU/RoVLA.

2605.19677 2026-05-20 cs.LG q-bio.QM

Agentic Discovery of Cryomicroneedle Formulations

代理发现冷冻微针制剂配方

Hao Li, Lifu Du, Nurul Hameed, Shemonti Saha Authai, Zlata Stefanovic, Chenjie Xu

AI总结 本研究提出了一种结合文献整理、高斯过程代理建模、贝叶斯优化和顺序湿实验验证的闭环工作流程,用于发现冷冻微针的冷冻保护剂配方,通过迭代湿实验验证提高了配方的准确性和有效性。

详情
AI中文摘要

冷冻微针提供了一种微创的皮下递送活细胞的途径,但其低温保存配方必须在保护细胞和限制毒性和设备制造约束之间取得平衡。本文报告了一种由AI辅助的闭环工作流程,用于冷冻微针冷冻保护剂的发现,结合了文献整理、高斯过程代理建模、贝叶斯优化和顺序湿实验验证。一个包含198种骨髓干细胞冷冻保存配方的curated数据集(来自42项研究)被转换为21种成分特征,并用于训练一个不确定性的文献先验模型。该模型捕捉了文献数据中的中等结构,但前瞻性地失败了,促使进行迭代的湿实验修正。在十次验证迭代和106次湿实验观察中,模型逐步适应了冷冻微针特定的结果:批次RMSE从41.21个百分点降低到6.86个百分点,后期阶段的排名相关性变得一致为正,累积的湿实验预测与测量总结达到了R²=0.942。最佳验证配方实现了95.15%的复苏存活率,同时具有低DMSO、ectoin、乙二醇和胎牛血清含量。然而,高存活率本身并不保证冷冻微针的完整形成,突显了未来多目标优化的必要性。这些结果表明,代理辅助的计算基础设施可以使数据高效的配方发现对拥有少量内部数据专业知识的实验室更加可及。项目代码可在https://github.com/baitmeister/ML-for-CryoMN上获得。

英文摘要

Cryomicroneedles offer a route to minimally invasive intradermal delivery of living cells, but their cryogenic formulations must reconcile cell protection with constraints on toxicity and device fabrication. Here we report an AI-assisted, closed-loop workflow for cryomicroneedle cryoprotectant discovery that combines literature curation, Gaussian-process surrogate modelling, Bayesian optimization, and sequential wet-lab validation. A curated dataset of 198 mesenchymal stem-cell cryopreservation formulations from 42 studies was converted into 21 ingredient features and used to train an uncertainty-aware literature prior. This model captured moderate structure in the literature data but failed prospectively, motivating iterative wet-lab correction. Across ten validation iterations and 106 wet-lab observations, the model progressively adapted to cryomicroneedle-specific outcomes: batch RMSE decreased from 41.21 to 6.86 percentage points, later-stage rank correlations became consistently positive, and the cumulative wet-lab predicted-versus-measured summary reached $R^2 = 0.942$. The best validated formulation achieved 95.15\% post-thaw viability with low DMSO, ectoin, ethylene glycol, and fetal bovine serum. However, high viability alone did not ensure intact cryomicroneedle formation, highlighting the need for future multi-objective optimization. These results demonstrate that agent-assisted computational infrastructure can make data-efficient formulation discovery more accessible to labs with minimal data expertise in-house. Project code is available at https://github.com/baitmeister/ML-for-CryoMN.

2605.19671 2026-05-20 cs.AI

Transforming Constraint Programs to Input for Local Search

将约束程序转换为局部搜索的输入

Jo Devriendt, Patrick De Causmaecker, Marc Denecker

AI总结 本文通过建立约束优化问题的对称性属性与局部搜索邻域之间的联系,自动从约束规范中生成邻域,用于IDP系统中的元启发式算法,并在六个经典优化问题上评估了生成的邻域。

Comments Unpublished paper accepted and presented at the Fourteenth International Workshop on Constraint Modelling and Reformulation (ModRef) in 2015

详情
AI中文摘要

将局部搜索算法应用于组合优化问题并不容易。通常需要人工干预才能将约束转换为某些元启发式算法的输入数据。在本文中,我们建立了约束优化问题的对称性属性与局部搜索邻域之间的联系,并利用这一联系在IDP系统中自动从约束规范生成邻域。我们对六个经典优化问题评估了所获得的邻域。所得结果支持了该技术的可行性。

英文摘要

Applying local search algorithms to combinatorial optimization problems is not an easy feat. Typically, human intervention is required to compile the constraints to input data for some metaheuristic algorithm. In this paper, we establish a link between symmetry properties of constraint optimization problems and local search neighborhoods, and we use this link to automatically generate neighborhoods from a constraint specification in the context of the IDP system. We evaluate the obtained neighborhoods for six classical optimization problems. The resulting observations support the viability of this technique.

2605.19663 2026-05-20 cs.AI

Pseudocode-Guided Structured Reasoning for Automating Reliable Inference in Vision-Language Models

基于伪代码的结构化推理用于自动化可靠推理在视觉-语言模型中

Weicong Ni, Tianbao Jiang, Linlin Wang

AI总结 本文提出了一种基于伪代码的结构化推理框架(PStar),旨在通过自适应选择结构化伪代码推理路径,提高视觉-语言模型在复杂任务中的可靠性和鲁棒性,从而减少幻觉现象并提升推理性能。

详情
AI中文摘要

视觉-语言模型(VLMs)正成为机器人自动化高级推理的基石,使机器人能够解析自然语言指令并感知其环境。然而,其易受幻觉影响,导致决策失败,对实际部署的安全性和可靠性构成重大风险。为解决这一问题,我们提出了基于伪代码的结构化推理框架(PStar),该框架能够自适应选择结构化伪代码推理路径,帮助VLMs进行灵活的逐步推理。我们首先设计了一组抽象推理函数,并制定了一套结构化伪代码库来表示模块化推理策略。关键的是,我们设计了一个难度特征向量(DFV),使模型能够评估问题复杂性并自适应选择适当的推理策略,从而增强鲁棒性和可解释性。大量实验表明,PStar显著降低了幻觉率,在POPE上达到87.1%的分数,在MMStar上达到68.0%的分数,优于GPT-4V。通过提供一种经过验证的机制来减少视觉-语言错误,PStar为部署更可信和确定性的VLMs用于实际自动化系统提供了关键一步,其中此类错误可能导致灾难性后果。

英文摘要

Vision-Language Models (VLMs) are becoming the cornerstone of high-level reasoning for robotic automation, enabling robots to parse natural language commands and perceive their environments. However, their susceptibility to hallucinations introduces critical failures in decision-making, posing significant safety and reliability risks in physical deployments. This challenge is exacerbated by the open-ended nature of real-world tasks, where questions vary vastly in difficulty and modality, demanding robust and adaptable reasoning strategies. To tackle this, we propose the Pseudocode-guided Structured Reasoning framework (PStar), which adaptively selects structured pseudocode reasoning paths to help VLMs perform flexible and step-by-step reasoning. We first design a set of abstract reasoning functions and formulate a structured pseudocode library to represent modular reasoning strategies. Crucially, we design a Difficulty Feature Vector (DFV) that allows the model to assess question complexity and adaptively choose appropriate reasoning strategies-enhancing robustness and interpretability. Extensive experiments demonstrate that PStar significantly reduces hallucination rates, achieving state-of-the-art scores of 87.1% on POPE and 68.0% on MMStar, outperforming even GPT-4V. By providing a validated mechanism to reduce visual-language errors, PStar offers a critical step toward deploying more trustworthy and deterministic VLMs for real-world automated systems, where such errors can lead to catastrophic outcomes.

2605.19660 2026-05-20 cs.LG cs.CL

OScaR: The Occam's Razor for Extreme KV Cache Quantization in LLMs and Beyond

OScaR:LLMs及更广泛场景中的极压缩KV缓存量化之奥卡姆之刀

Zunhai Su, Rui Yang, Chao Zhang, Yaxiu Liu, Yifan Zhang, Wei Wu, Jing Xiong, Dayou Du, Xialie Zhuang, Yulei Qian, Yuchen Xie, Yik-Chung Wu, Hongxia Yang, Ngai Wong

AI总结 本文针对LLMs中KV缓存极压缩时的量化保真问题,提出OScaR框架,通过Canalized Rotation和Omni-Token Scaling有效缓解Token Norm Imbalance,实现近无损的INT2量化性能,同时提升解码速度和吞吐量。

Comments Under review

详情
AI中文摘要

快速发展的长上下文推理和多模态智能使Key-Value(KV)缓存的内存占用成为高效部署的主要内存瓶颈。虽然已建立的每通道量化方法能有效处理Key张量中的固有通道级异常值,但在极端压缩下其效果下降。本文从经验和理论角度重新审视每通道量化范式的固有限制。我们的分析指出Token Norm Imbalance(TNI)是量化保真度的主要瓶颈。我们证明当共享量化参数需要覆盖具有显著范数差异的token组时,TNI会系统性地放大误差。而不是依赖复杂的量化流水线(如TurboQuant),我们提出了OScaR(Omni-Scaled Canalized Rotation),一种适用于X-LLMs(即纯文本、多模态和全模态LLMs)的准确且轻量的KV缓存压缩框架。在推进每通道范式的基础上,OScaR通过Canalized Rotation后接Omni-Token Scaling,有效且高效地缓解TNI引起的序列维度方差,进一步通过优化的系统设计和CUDA内核支持。在X-LLMs上的广泛评估显示,OScaR在INT2量化下实现了近无损性能,优于现有方法,确立了其作为稳健、低复杂度和通用的框架,定义了新的帕累托前沿。与BF16 FlashDecoding-v2基线相比,我们的OScaR实现解码速度提升达3.0倍,内存占用减少5.3倍,吞吐量增加4.1倍。OScaR的代码在https://github.com/ZunhaiSu/OScaR-KV-Quant公开。

英文摘要

The rapid advancement toward long-context reasoning and multi-modal intelligence has made the memory footprint of the Key-Value (KV) cache a dominant memory bottleneck for efficient deployment. While the established per-channel quantization effectively accommodates intrinsic channel-wise outliers in Key tensors, its efficacy diminishes under extreme compression. In this work, we revisit the inherent limitations of the per-channel quantization paradigm from both empirical and theoretical perspectives. Our analysis identifies Token Norm Imbalance (TNI) as the primary bottleneck to quantization fidelity. We demonstrate that TNI systematically amplifies errors when shared quantization parameters are required to span token groups exhibiting substantial norm disparities. Instead of relying on intricate quantization pipelines (e.g., TurboQuant), we propose OScaR (Omni-Scaled Canalized Rotation), an accurate and lightweight KV cache compression framework for X-LLMs (i.e., text-only, multi-modal, and omni-modal LLMs). Advancing the per-channel paradigm, OScaR employs Canalized Rotation followed by Omni-Token Scaling to mitigate TNI-induced sequence-dimensional variance both effectively and efficiently, further supported by our optimized system design and CUDA kernels. Extensive evaluations across X-LLMs show that OScaR consistently outperforms existing methods and achieves near-lossless performance under INT2 quantization, establishing it as a robust, low-complexity, and universal framework that defines a new Pareto front. Compared with the BF16 FlashDecoding-v2 baseline, our OScaR implementation achieves a notable up to 3.0x speedup in decoding, reduces memory footprint by 5.3x, and increases throughput by 4.1x. The code for OScaR is publicly available at https://github.com/ZunhaiSu/OScaR-KV-Quant.

2605.19656 2026-05-20 cs.CV

Cross-View Splatter: Feed-Forward View Synthesis with Georeferenced Images

跨视图泼溅:基于地理参考图像的馈送视图合成

Matias Turkulainen, Akshay Krishnan, Filippo Aleotti, Mohamed Sayed, Guillermo Garcia-Hernando, Juho Kannala, Arno Solin, Gabriel Brostow, Daniyar Turmukhambetov

AI总结 本文提出了一种基于地理参考图像的馈送视图合成方法,通过融合正交校正的卫星图像与GPS标记的地面照片,预测统一3D坐标框架中的高斯泼溅,从而提升场景覆盖和新视角合成效果。

Comments Submitted to CVPR 2026. 8 figures, 3 tables. Project page: https://nianticspatial.github.io/cross-view-splatter/

详情
AI中文摘要

我们提出了Cross-View Splatter,一种预测像素对齐高斯泼溅的馈送方法,用于地面级和卫星拍摄的户外场景。忠实重建需要良好的相机覆盖,但地面影像在大规模户外场景中拍摄耗时且困难。幸运的是,卫星影像可以提供全球几何先验,可通过公共API轻松获取。Cross-View Splatter融合正交校正的卫星视图与GPS标记的地面照片,以统一的3D坐标框架预测高斯泼溅。通过对齐地面和鸟瞰特征表示,我们的模型相比仅使用地面影像提升了场景覆盖和新视角合成。我们在经过筛选的地理参考数据集和配对的卫星地形数据上进行训练,这些数据来自开源测绘服务。我们在新的新视角合成基准上评估了我们的方法,该基准允许与先前最先进的方法进行比较。我们的代码和数据准备将在https://nianticspatial.github.io/cross-view-splatter/上提供。

英文摘要

We present Cross-View Splatter, a feed-forward method that predicts pixel-aligned Gaussian splats for outdoor scenes captured at ground level AND by satellite. Faithful reconstructions require good camera coverage, but ground imagery is time-consuming and hard to capture at scale for large outdoor scenes. Fortunately, satellite imagery can provide a global geometric prior that is easy to access via public APIs. Cross-View Splatter fuses orthorectified satellite views with GPS-tagged ground photos to predict Gaussian splats in a unified 3D coordinate frame. By aligning ground and bird's-eye feature representations, our model improves scene coverage and novel-view synthesis, compared to ground imagery alone. We train on curated georeferenced datasets and paired satellite-terrain data, mined from open mapping services. We evaluate our method on a new benchmark for novel-view synthesis with georeferenced imagery allowing comparison to prior state-of-the-art methods. Our code and data preparation will be available at https://nianticspatial.github.io/cross-view-splatter/.

2605.19645 2026-05-20 cs.CL

K-Quantization and its Impact on Output Performance

K-量化及其对输出性能的影响

Robin Baki Davidsson, Pierre Nugues

AI总结 本文研究了不同量化级别(2-6位)对大型语言模型(LLM)在MMLU-Pro、CRUXEval和MuSR等任务上的性能和准确性的影响,发现高精度量化(如8位Q8_0)能提升性能,但降维量化(如2位Q2_K)会带来性能损失,且不同模型和任务的响应差异显著。

Comments 13 pages, 4 figures

详情
AI中文摘要

近年来,大型语言模型(LLMs)在许多自然语言处理(NLP)任务中展现出显著能力。然而,其庞大的规模常常给部署带来挑战。这需要高效的模型压缩技术,量化作为一种重要的解决方案。尽管量化具有诸多优势,但其对LLMs性能和准确性的确切影响仍然是一个活跃的研究领域。本文研究了八个LLMs在不同量化级别下的性能,重点考察了MMLU-Pro(知识处理和推理)、CRUXEval(代码理解)和MuSR(阅读理解)等任务。我们的结果表明,更高的精度(例如8位Q8_0)能带来更好的性能,但边际效益逐渐降低。激进的量化(例如2位Q2_K)通常能保持可接受的准确性,尽管某些模型会显著损失性能。我们的发现表明,虽然较低的位精度通常会降低性能,但不同模型和任务的响应差异显著。较大的模型对激进量化表现出更大的韧性,但仍然会在较低精度下经历显著下降。7-9十亿参数范围的中等大小模型在效率和资源使用之间取得了最佳平衡。这些结果为模型大小、量化和性能之间的权衡提供了见解。

英文摘要

Recent advancements in large language models (LLMs) have shown their remarkable capacities in many NLP tasks. However, their substantial size often presents challenges for deployment. This necessitates efficient techniques for model compression, with quantization emerging as a prominent solution. Despite its benefits, the exact impact of quantization (from 2- to 6-bit) on the performance and accuracy of LLMs remains an active area of research. This paper investigates the performance of eight LLMs at various quantization levels, focusing on tasks such as MMLU-Pro for knowledge processing and reasoning, CRUXEval for code comprehension, and MuSR for reading comprehension. Our results show a consistent trend where higher precision (e.g., 8-bit Q8\_0) yields improved performance, albeit with diminishing returns. Aggressive quantization (e.g., 2-bit Q2\_K) usually retains acceptable accuracy, though some models show a substantial loss in performance. Our findings indicate that while lower bit precision generally reduces performance, the impact varies across models and tasks. Larger models show greater resilience to aggressive quantization, but can still undergo significant drops at lower precision levels. Mid-sized models in the 7-9 billion parameter range strike an optimal balance between efficiency and resource usage. Such results provide insights into the trade-offs between model size, quantization, and performance.

2605.19639 2026-05-20 cs.CV

Benchmarking and Evolving Reason-Reflect-Rectify for Reflective Visual Generation

基于反思生成的基准测试与进化

Junjie Wang, Xinghua Lou, Jason Li, Ye Tian, Keyu Chen, Yulin Li, Bin Kang, Jacky Mai, Yanwei Li, Zhuotao Tian, Liqiang Nie

AI总结 本文提出R^3-Bench基准和R^3-Refiner框架,用于评估和提升反思视觉生成能力,通过改进迭代推理和修正能力,提升文本到图像模型的生成质量。

详情
AI中文摘要

文本到图像(T2I)模型和统一多模态模型(UMMs)在视觉生成领域取得了显著进展。然而,其依赖于单次生成范式限制了处理需要迭代细化的复杂提示的能力。为了实现多轮反思视觉生成(RVG),我们正式将Reason-Reflect-Rectify(R^3)循环作为核心框架,并引入R^3-Bench,一个包含600多个专家标注实例的基准,用于量化迭代推理和修正能力。在R^3-Bench上的评估揭示了一个关键差距:尽管最先进的模型能够识别生成错误,但它们无法生成具有操作性的修正指令。为弥合这一差距,我们提出了R^3-Refiner,一个双阶段框架,利用组相对策略优化(GRPO)和分层奖励机制(HRM)来更好地对齐修正与反思推理。实验表明,R^3-Refiner在R^3-Bench上实现了显著改进(在反思判断分数上提升12.0%,在修正分数上提升9.0%),并且可以无缝集成到各种多语言大型模型(MLLMs)中,以提升不同T2I模型在GenEval++和T2I-CompBench上的生成质量。代码可在https://github.com/xiaomoguhz/R3-Bench获取。

英文摘要

Text-to-Image (T2I) models and Unified Multimodal Models (UMMs) have achieved remarkable progress in visual generation. However, their reliance on a single-pass generation paradigm limits their ability to handle complex prompts requiring iterative refinement. To enable multi-round Reflective Visual Generation (RVG), we formalize the Reason-Reflect-Rectify (R^3) loop as a core framework and introduce R^3-Bench, a benchmark of over 600 expert-annotated instances that quantifies iterative reasoning and rectification capabilities. Evaluation on R^3-Bench reveals a critical gap: while state-of-the-art models can identify generation errors, they fail to generate actionable rectification instructions. To bridge this gap, we propose R^3-Refiner, a dual-stage framework leveraging Group Relative Policy Optimization (GRPO) and a Hierarchical Reward Mechanism (HRM) to better align rectification with reflective reasoning. Experiments show that R^3-Refiner achieves significant improvements on R^3-Bench (+12.0% in Reflective Verdict Score, +9.0% in Rectification Score), and can be seamlessly integrated with various MLLMs to enhance the generation quality of different T2I models on GenEval++ and T2I-CompBench. Code is available at https://github.com/xiaomoguhz/R3-Bench.

2605.19634 2026-05-20 cs.CV cs.AI

P2DNav: Panorama-to-Downview Reasoning for Zero-shot Vision-and-Language Navigation

P2DNav: 全景到俯视视角的零样本视觉-语言导航

Kai Sheng, Liuyi Wang, Haojie Dai, Jinlong Li, Yongrui Qin, Zongtao He, Chengju Liu, Qijun Chen

AI总结 本文提出P2DNav框架,通过全景到俯视视角的分解、滑动窗口对话记忆和反思重新定位机制,解决零样本视觉-语言导航中的方向推理与局部定位问题,实验表明其在R2R-CE基准上性能优异。

详情
AI中文摘要

视觉-语言导航(VLN)要求一个具身代理将自然语言指令转化为可执行的导航动作,以应对未见环境。现有零样本方法通常依赖额外的航点预测模块,这些模块往往将高层方向推理与细粒度局部定位纠缠在一起,导致决策错误且不稳定。在本文中,我们提出P2DNav,一种用于零样本视觉-语言导航的分层框架。P2DNav包含三个核心组件:全景到俯视(P2D)、滑动窗口对话记忆(SDM)和反思重新定位机制(RRM)。P2D明确将导航决策分解为两个阶段:全景方向选择和俯视局部定位。它首先从360°全景中选择与指令相关的方向,然后从该方向的俯视RGB观察中预测像素级目标点。此外,SDM将导航历史组织为多轮对话上下文,并在滑动窗口内维护最近的视觉观察以支持长距离导航。RRM进一步通过评估局部定位的可靠性基于俯视观察,并在必要时返回全景方向选择。在R2R-CE基准上的实验表明,P2DNav在零样本方法中表现强劲。特别是,与最先进的(SOTA)零样本航点基于和航点自由方法相比,P2DNav在SR方面分别获得了146.6%和58.9%的提升,证明了P2D、SDM和RRM在零样本VLN中的有效性。代码将向公众发布。

英文摘要

Vision-and-language navigation (VLN) requires an embodied agent to ground natural-language instructions into executable navigation actions in unseen environments. Existing zero-shot methods typically rely on additional waypoint prediction modules, which often entangle high-level directional reasoning with fine-grained local grounding, leading to error-prone and unstable decisions. In this paper, we propose P2DNav, a hierarchical framework for zero-shot vision-and-language navigation. P2DNav consists of three core components: Panorama-to-Downview (P2D), Sliding-Window Dialogue Memory (SDM), and Reflective Reorientation Mechanism (RRM). P2D explicitly decomposes navigation decision-making into two stages: panoramic direction selection and downview local grounding. It first selects the instruction-relevant direction from a 360° panorama, and then predicts a pixel-level target point from the downview RGB observation in that direction. In addition, SDM organizes navigation history as a multi-turn dialogue context and maintains recent visual observations within a sliding window to support long-horizon navigation. RRM further enables reflective reorientation by assessing the reliability of local grounding based on the downview observation and returning to panoramic direction selection when necessary. Experiments on the R2R-CE benchmark show that P2DNav achieves strong performance among zero-shot methods. In particular, compared with the state-of-the-art (SOTA) zero-shot waypoint-based and waypoint-free methods, P2DNav achieves SR gains of 146.6% and 58.9%, respectively, demonstrating the effectiveness of P2D, SDM, and RRM for zero-shot VLN. Code will be released for public use.

2605.19633 2026-05-20 cs.CL cs.AI cs.LG cs.NE cs.SE

optimize_anything: A Universal API for Optimizing any Text Parameter

optimize_anything: 一个用于优化任何文本参数的通用API

Lakshya A Agrawal, Donghyun Lee, Shangyin Tan, Wenjie Ma, Karim Elmaaroufi, Rohit Sandadi, Sanjit A. Seshia, Koushik Sen, Dan Klein, Ion Stoica, Joseph E. Gonzalez, Omar Khattab, Alexandros G. Dimakis, Matei Zaharia

AI总结 本文提出了一种基于LLM的通用优化系统,能够跨不同领域实现文本参数的优化,展示了其在六个多样化任务中的state-of-the-art性能,通过多任务搜索和跨问题迁移实现了高效的优化。

Comments 16 pages, 11 figures; Blog: https://gepa-ai.github.io/gepa/blog/2026/02/18/introducing-optimize-anything/

详情
Journal ref
Proceedings of the ACM Conference on AI and Agentic Systems (CAIS 26), May 26-29, 2026, San Jose, CA, USA
AI中文摘要

能否一个基于LLM的优化系统在根本不同的领域中匹配专门工具?我们证明当优化问题被表述为改进一个通过评分函数评估的文本工件时,一个基于AI的优化系统—支持单任务搜索、多任务搜索和跨问题迁移以及对未见过的输入进行泛化—在六个不同的任务中实现了state-of-the-art的结果。我们的系统发现了将Gemini Flash的ARC-AGI准确性几乎提高三倍的代理架构(32.5%到89.5%),发现了将云成本降低40%的调度算法,生成了87%匹配或超过PyTorch的CUDA内核,并优于AlphaEvolve报告的圆圈打包解决方案(n=26)。在三个领域的消融研究揭示了可操作的侧信息比仅评分反馈更快收敛且最终得分更高,且多任务搜索在同等问题预算下通过跨任务迁移优于独立优化。共同,我们首次展示了基于LLM搜索的文本优化是一种通用问题解决范式,将传统需要领域特定算法的任务统一到一个框架下。我们开源了optimize_anything,并支持多个后端作为GEPA项目的一部分,在https://github.com/gepa-ai/gepa上。

英文摘要

Can a single LLM-based optimization system match specialized tools across fundamentally different domains? We show that when optimization problems are formulated as improving a text artifact evaluated by a scoring function, a single AI-based optimization system-supporting single-task search, multi-task search with cross-problem transfer, and generalization to unseen inputs-achieves state-of-the-art results across six diverse tasks. Our system discovers agent architectures that nearly triple Gemini Flash's ARC-AGI accuracy (32.5% to 89.5%), finds scheduling algorithms that cut cloud costs by 40%, generates CUDA kernels where 87% match or beat PyTorch, and outperforms AlphaEvolve's reported circle packing solution (n=26). Ablations across three domains reveal that actionable side information yields faster convergence and substantially higher final scores than score-only feedback, and that multi-task search outperforms independent optimization given equivalent per-problem budget through cross-task transfer, with benefits scaling with the number of related tasks. Together, we show for the first time that text optimization with LLM-based search is a general-purpose problem-solving paradigm, unifying tasks traditionally requiring domain-specific algorithms under a single framework. We open-source optimize\_anything with support for multiple backends as part of the GEPA project at https://github.com/gepa-ai/gepa .

2605.19631 2026-05-20 cs.RO cs.CV

HEAT: Heterogeneous End-to-End Autonomous Driving via Trajectory-Guided World Models

HEAT: 基于轨迹引导的世界模型实现异构端到端自动驾驶

Hoonhee Cho, Giwon Lee, Jae-Young Kang, Hyemin Yang, Heejun Park, Kuk-Jin Yoon

AI总结 本文提出一种基于轨迹引导的学习方法,通过规划轨迹组织训练,使模型能够捕捉驾驶意图的领域不变表示,并结合预测未来潜在特征的世界模型,提高特征一致性并缓解领域偏见,从而在多个异构数据集上实现强性能。

详情
AI中文摘要

端到端自动驾驶作为一种直接将原始传感器数据映射到驾驶动作的替代方案,已逐渐取代传统模块化管道。尽管近期方法在单域数据集上表现强劲,但当在多个异构领域联合训练时,性能显著下降。然而,实际自动驾驶系统必须在具有异构分布的不同环境中运行,包括不同城市、传感器配置和交通模式,而无需领域特定重新训练。这一差距突显了多领域学习中的关键挑战:异构领域中的领域特定变化引入了冲突的学习信号,使模型倾向于妥协解决方案,这些方案在各个领域中都是次优的。为此,我们提出了一种轨迹驱动的学习范式,围绕规划轨迹组织训练,使模型能够捕捉驾驶意图的领域不变表示。此外,我们还引入了一个世界模型,该模型根据自主动作预测未来的潜在特征,从而提高特征一致性和缓解领域引起的偏见。我们在三个基准上评估了我们的方法,即nuScenes、NAVSIM和Waymo端到端数据集,并在所有领域上展示了显著优于现有方法的改进。我们的结果表明,一个统一的模型可以在异构数据集上进行训练,同时在每个领域中保持强大的性能,这表明了向可扩展的现实世界部署迈出的一步。我们将公开我们的代码。

英文摘要

End-to-end autonomous driving has emerged as a compelling alternative to traditional modular pipelines by directly mapping raw sensor data to driving actions. While recent approaches achieve strong performance on single-domain datasets, their performance degrades significantly when trained jointly across multiple heterogeneous domains. In practice, however, autonomous systems must operate across diverse environments with heterogeneous distributions, including different cities, sensor configurations, and traffic patterns, without domain-specific retraining. This gap highlights a key challenge in multi-domain learning: domain-specific variations across heterogeneous domains introduce conflicting learning signals, driving models toward compromised solutions that are suboptimal across domains. To address this, we propose a trajectory-driven learning paradigm that organizes training around planning trajectories, enabling the model to capture domain-invariant representations of driving intent. Furthermore, we incorporate a world model that predicts future latent features conditioned on ego actions, improving feature consistency and mitigating domain-induced biases. We evaluate our approach on three benchmarks, nuScenes, NAVSIM, and the Waymo end-to-end dataset, and show substantial improvements over existing methods across all domains. Our results demonstrate that a single unified model can be trained on heterogeneous datasets while maintaining strong performance within each domain, highlighting a step toward scalable real-world deployment. We will make our code publicly available.

2605.19630 2026-05-20 cs.AI

EMO-BOOST: Emotion-Augmented Audio-Visual Features for Improved Generalization in Deepfake Detection

EMO-BOOST:情感增强的音频视觉特征用于深度伪造检测中的泛化改进

Aritra Marik, Marcel Klemt, Anna Rohrbach

AI总结 本文提出EMO-BOOST框架,通过融合传统RGB和声学聚焦检测器与基于情感的EmoForensics检测器,利用高阶语义线索提升深度伪造检测的泛化能力,实验显示在FakeAVCeleb数据集上平均跨操纵泛化AUC提升了2.1%。

Comments Accepted at SAFE@CVPRW 2026

详情
AI中文摘要

随着生成式AI模型的不断发展,取证学正面临越来越大的压力。新的生成技术不断出现,使得无法为每种操纵收集数据来训练深度伪造检测模型。因此,将模型泛化到训练期间未见过的深度伪造类型是当前深度伪造检测研究中的主要挑战之一。为解决这一挑战,我们采用了高层语义线索,并认为这些线索可以支持低层聚焦方法在泛化到未见操纵类型时发挥作用。在本研究中,我们研究了情感作为高层语义线索。我们提出了EMO-BOOST,一种多模态深度伪造检测框架,该框架融合了传统RGB和声学聚焦深度伪造检测器与我们基于情感的深度伪造检测器EmoForensics。EmoForensics利用视觉和音频情感识别模块,并在音频视频流中建模内在和跨模态的时间一致性。我们发现EmoForensics和低层聚焦方法捕获了互补的信号。因此,在EMO-BOOST中结合这两种信号,使在FakeAVCeleb数据集上的平均跨操纵泛化AUC提高了2.1%。

英文摘要

With every advancement in generative AI models, forensics is under increasing pressure. The constant emergence of new generation techniques makes it impossible to collect data for each manipulation to train a deepfake detection model. Thus, generalizing to deepfakes unseen during training is one of the major challenges in current deepfake detection research. To tackle this challenge, we employ high-level semantic cues and argue that these cues can support low-level focused approaches in generalizing to unseen types of manipulations. In this work, we study emotions as a high-level semantic cue. We propose Emo-Boost, a multimodal deepfake detection framework that fuses an off-the-shelf RGB- and acoustic-focused deepfake detector with our emotion-based deepfake detector EmoForensics. EmoForensics utilises vision and audio emotion recognition modules and models intra- and inter-modal temporal consistency in emotion representations from an audio-visual stream. We found that EmoForensics and the low-level focused method capture complementary signals. Consequently, combining both signals in EmoBoost enhances the average cross-manipulation generalization AUC by 2.1% on FakeAVCeleb.

2605.19625 2026-05-20 cs.LG

Optimal Reconstruction from Linear Queries

从线性查询中最优重建

Yuval Filmus, Shay Moran, Elizaveta Nesterova

AI总结 研究如何从近似线性查询中重建未知点,分析查询数量、维度和噪声参数对重建误差的影响,并提出一种改进的重建问题变体。

Comments Accepted to COLT 2026. 46 pages, 4 figures

详情
AI中文摘要

我们研究从近似线性查询中重建$\mathbb{R}^d$中未知点的问题。该设定出现在从低维遥感和信号恢复到高维数据分析和隐私敏感推断的应用中。我们的主要目标是将最优重建误差作为查询数量$T$、环境维度$d$和噪声参数$\delta$的函数进行表征。我们首先分析$T o \infty$的极限,证明最优重建误差收敛到显式值$\sqrt{2d/(d+1)} \delta$,其作用类似于监督学习中的贝叶斯最优误差。当维度固定时,我们显示在该极限之上,误差以双指数速度衰减,比通常在学习曲线中遇到的速率快得多。当维度增长时,我们证明需要数量级为$\exp(d)$的查询才能实现消失的误差。最后,我们介绍并分析了重建问题的一个不恰当变体。从技术角度看,我们的主要贡献是Jung定理(1901)的推广。经典定理界定了直径为1的集合的最大可能半径,并刻画了极值体。我们的推广提供了一个鲁棒变体,刻画了近极值体,并通过利用对称性和李群作用的几何和动力学论证证明。

英文摘要

We study the problem of reconstructing an unknown point in $\mathbb{R}^d$ from approximate linear queries. This setting arises naturally in applications ranging from low-dimensional remote sensing and signal recovery to high-dimensional data analysis and privacy-sensitive inference. Our main goal is to characterize the optimal reconstruction error as a function of the number of queries $T$, the ambient dimension $d$, and the noise parameter $δ$. We first analyze the limit $T \to \infty$ and show that the optimal reconstruction error converges to the explicit value $\sqrt{2d/(d+1)} δ$, which plays a role analogous to the Bayes optimal error in supervised learning. When the dimension is fixed, we show that the excess error above this limit decays doubly exponentially fast as $T \to \infty$, a rate that is significantly faster than those typically encountered in learning curves. When the dimension grows, we show that a number of queries on the order of $\exp(d)$ is necessary and sufficient to achieve vanishing excess error. Finally, we introduce and analyze an improper variant of the reconstruction problem. From a technical perspective, our main contribution is a generalization of Jung's theorem (1901). The classical theorem bounds the maximum possible radius of a set of diameter 1 and characterizes extremal bodies. Our generalization provides a robust variant that characterizes near-extremal bodies and is proved via geometric and dynamical arguments exploiting symmetry and Lie group actions.

2605.19623 2026-05-20 cs.CV

PrAda: Few-Shot Visual Adaptation for Text-Prompted Segmentation

PrAda:基于文本提示的分割的少样本视觉适应

Gabriele Rosi, Fabio Cermelli, Carlo Masone, Barbara Caputo

AI总结 该研究针对文本提示分割在特定领域中的性能下降问题,提出了一种新的少样本视觉适应方法PrAda,通过结合细粒度像素特征和高层Transformer表示学习类特定原型,从而在不改变模型零样本潜力的情况下实现对新领域的强适应。

Comments CVPR 2026 Findings. Code: https://github.com/FocoosAI/PrAda

详情
AI中文摘要

图像分割对于视觉理解至关重要,但需要大量的像素级标注。基础模型已经使预测新类别的新范式成为可能,这些范式通过文本提示引导,而无需目标领域的标注。然而,在专门化的目标领域中,远离原始预训练,其性能会下降。我们研究了现有方法在这样的领域偏移下的误差,发现误分类而不是掩码生成是主要的罪魁祸首。为了解决这个问题,我们引入了新的问题:基于文本提示的分割的少样本视觉适应。这种适应在图像分类中已被广泛研究,但在分割中仍属未探索的领域。我们通过原型适应(PrAda)解决了这一任务,这是一种新颖且参数高效的适应方法,用于适应冻结的文本提示分割模型。我们的方法通过结合细粒度像素特征和高层Transformer表示来学习类特定原型,然后通过学习的重要性因子将这些原型与原始基于文本的预测融合。这在保持模型零样本潜力的同时,使模型能够适应新领域。在五个基准上的语义、实例和全景分割实验表明,PrAda在与现有最先进方法和所提基线相比时,取得了显著的改进。

英文摘要

Segmenting images is critical for visual understanding but demands extensive pixel-level annotations. Foundational models have enabled new paradigms for predicting new classes guided by textual prompts, without annotations from the target domain. Yet, on specialized target domains, far from the original pre-training, their performance degrades. We study the errors of existing methods under such domain-shift, finding that misclassification rather than mask generation is the main culprit. To address this, we introduce the novel problem of Few-Shot Visual Adaptation for text-prompted Segmentation. This kind of adaptation has been largely studied for image classification, but it remains unexplored for segmentation. We tackle this task with Prototype Adaptation (PrAda), a novel, parameter-efficient method that adapts a frozen text-prompted segmentation model. Our approach learns class-specific prototypes by combining fine-grained pixel features and high-level transformer representations, which are then fused with the original text-based predictions through a learned importance factor. This preserves the model's zero-shot potential while enabling strong adaptation to new domains. Experiments across semantic, instance, and panoptic segmentation on five benchmarks demonstrate that PrAda yields significant improvements over state-of-the-art and proposed baselines.

2605.19622 2026-05-20 cs.CV

UniRefiner: Teaching Pre-trained ViTs to Self-Dispose Dross via Contrastive Register

UniRefiner: 通过对比注册教会预训练ViTs自我处理杂质

Congpei Qiu, Zhaoyu Hu, Wei Ke, Zhuotao Tian, Yanhao Wu, Tong Zhang

AI总结 本文提出UniRefiner,一种通用 refinement 框架,通过对比注册方法教会预训练 ViT 自动处理空间敏感任务中的杂质 token,提升模型在密集预测任务中的表现。

Comments CVPR 2026

详情
AI中文摘要

基于 Vision Transformers (ViTs) 的表示学习已取得显著进展,然而大规模模型在空间敏感任务中的实用性受到虚假 token 的阻碍。先前的缓解措施有限,通常将这些伪影狭义地定义为简单的高范数异常值。我们认为这种范围不足。对于密集预测任务,我们提出任何未能编码位置对齐语义的 token 应被视为伪影。这种更广义的定义揭示了一个更复杂的问题,促使我们系统地分类并表征三种基本类型的伪影 token,这些 token 污染了空间表示。基于这种全面的诊断,我们提出了 UniRefiner,一种通用的 refinement 框架,教会预训练 ViTs 自我处理这些伪影。UniRefiner 使用对比注册来显式隔离并重新分配伪影 token,通过双重目标:(i) 它将图像 token 与过滤后的正常 token 对齐以保持语义,(ii) 它将注册 token 与检测到的伪影 token 对齐以捕捉伪影信号。我们的方法仅需在 ~5k 图像上进行少量微调即可优化多种 ViTs,包括 EVA-CLIP-8B 和 InternViT-6B 等大规模模型。实验显示了一致且显著的改进:特别是优化后的 EVA-CLIP-8B 在 ADE20K 上达到 51.9% mIoU(+9.4%),超过 DINOv2(49.1%)等专用视觉模型,零样本分割精度提升高达 22%。UniRefiner 解锁了现有大规模基础模型的潜在空间能力,为它们的广泛应用铺平了道路。

英文摘要

Representation learning with Vision Transformers (ViTs) has advanced rapidly, yet the utility of large-scale models in spatially sensitive tasks is hindered by spurious tokens. Prior efforts to mitigate this have been limited, often defining these artifacts narrowly, for example, as simple high-norm outliers. We argue that this scope is insufficient. For dense prediction tasks, we posit that any token failing to encode location-aligned semantics should be treated as a spurious artifact. This broader definition reveals a more complex problem, leading us to systematically categorize and characterize three fundamental types of spurious tokens that corrupt spatial representations. Based on this comprehensive diagnosis, we propose UniRefiner, a universal refinement framework that teaches pre-trained ViTs to self-dispose of these artifacts. UniRefiner uses contrastive registers to explicitly isolate and redistribute spurious tokens via a dual objective: (i) it aligns image tokens with filtered regular tokens to preserve semantics, and (ii) it aligns register tokens with detected spurious tokens to capture the spurious signals. Our method requires only a few epochs of fine-tuning on ~5k images to refine diverse ViTs, including massive models like EVA-CLIP-8B and InternViT-6B. Experiments demonstrate consistent and significant improvements: notably, the refined EVA-CLIP-8B achieves 51.9\% mIoU on ADE20K (+9.4\%), surpassing specialized vision models like DINOv2 (49.1\%), while zero-shot segmentation accuracy improves by up to 22\%. UniRefiner unlocks the latent spatial potential of existing large-scale foundation models, paving the way for their broader application.

2605.19620 2026-05-20 cs.CV

Bézier Degradation Modeling for LiDAR-based Human Motion Capture

基于LiDAR的人体动作捕捉的贝塞尔退化建模

Xiaoqi An, Lin Zhao, Jun Li, Chen Gong, Jian Yang

AI总结 本文提出BMLiCap框架,通过时间可压缩的贝塞尔曲线建模人体动作,采用轨迹保留策略减少控制点,设计渐进式动作重建模块,利用时间尺度运动变换器和多级动作聚合器有效融合多尺度曲线,以提高复杂场景下的动作重建精度和时间连续性。

Comments Accepted by CVPR 2026

详情
AI中文摘要

基于LiDAR的3D人体动作捕捉在自动驾驶和机器人领域有广泛应用,准确的动作重建至关重要。然而,现有方法在不稳定输入和严重遮挡情况下常常导致预测抖动甚至失败。为了解决这些挑战,我们提出BMLiCap,一种从粗到细的框架,通过时间可压缩的贝塞尔曲线建模运动。通过采用轨迹保留策略减少控制点,我们获得了一种连贯且易于学习的动作表示。为了从LiDAR点云线索中重建人体动作,我们设计了一个渐进式动作重建模块。具体来说,引入了时间尺度运动变换器(TMT)来在多个时间尺度上预测运动曲线,并利用多级动作聚合器(MMA)来适应性融合多尺度曲线,以恢复详细的、时间连贯的姿态,有效弥补由遮挡和噪声引起的观测缺口。在四个主流基准LiDARHuman26M、FreeMotion、NoiseMotion和SLOPER4D上,BMLiCap在复杂场景中实现了最先进的准确性和时间连续性,证明了其在严重遮挡下的补偿能力和减少预测抖动的能力。

英文摘要

LiDAR-based 3D human motion capture has broad applications in fields such as autonomous driving and robotics, where accurate motion reconstruction is crucial. However, existing methods often struggle with unstable inputs and severe occlusions, leading to jittery or even failed pose predictions. To address these challenges, we propose BMLiCap, a coarse-to-fine framework that models motion using temporally compressible Bézier curves. By reducing control points through a trajectory-preserving strategy, we obtain a coherent and learning-friendly motion representation. To reconstruct human actions from LiDAR point-cloud cues, we design a progressive motion-reconstruction module. Specifically, a Time-scale Motion Transformer (TMT) is introduced to predict motion curves at multiple temporal scales, and a Multi-level Motion Aggregator (MMA) is utilized to adaptively fuse the multi-scale curves to recover detailed, temporally coherent poses, effectively bridging observation gaps caused by occlusions and noise. Across four mainstream benchmarks LiDARHuman26M, FreeMotion, NoiseMotion, and SLOPER4D, BMLiCap achieves state-of-the-art accuracy and temporal continuity in complex scenes, demonstrating its ability to compensate for severe occlusions and reduce prediction jitter.