arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 1981
2510.07118 2026-05-15 cs.CL cs.LG

TRIM: Token-wise Attention-Derived Saliency for Data-Efficient Instruction Tuning

Manish Nagaraj, Sakshi Choudhary, Utkarsh Saxena, Deepak Ravikumar, Kaushik Roy

AI总结 TRIM 是一种用于高效指令微调的数据选择方法,通过关注模型中基于注意力机制的细粒度特征,替代传统依赖梯度的粗粒度方法。该方法基于少量目标样本提取注意力“指纹”,以识别和选择对任务定义至关重要的数据子集,从而在保持高性能的同时大幅降低计算成本。实验表明,TRIM 选出的核心集在多个下游任务中优于现有方法,甚至在某些情况下超越了全数据微调的效果。

详情
英文摘要

Instruction tuning is essential for aligning large language models (LLMs) to downstream tasks and commonly relies on large, diverse corpora. However, small, high-quality subsets, known as coresets, can deliver comparable or superior results, though curating them remains challenging. Existing methods often rely on coarse, sample-level signals like gradients, an approach that is computationally expensive and overlooks fine-grained features. To address this, we introduce TRIM (Token Relevance via Interpretable Multi-layer Attention), a forward-only, token-centric framework. Instead of using gradients, TRIM operates by matching underlying representational patterns identified via attention-based "fingerprints" from a handful of target samples. Such an approach makes TRIM highly efficient and uniquely sensitive to the structural features that define a task. Coresets selected by our method consistently outperform state-of-the-art baselines by up to 9% on downstream tasks and even surpass the performance of full-data fine-tuning in some settings. By avoiding expensive backward passes, TRIM achieves this at a fraction of the computational cost. These findings establish TRIM as a scalable and efficient alternative for building high-quality instruction-tuning datasets.

2510.05213 2026-05-15 cs.RO cs.AI cs.LG

VER: Vision Expert Transformer for Robot Learning via Foundation Distillation and Dynamic Routing

Yixiao Wang, Mingxiao Huo, Zhixuan Liang, Yushi Du, Lingfeng Sun, Haotian Lin, Jinghuan Shang, Chensheng Peng, Mohit Bansal, Mingyu Ding, Masayoshi Tomizuka

AI总结 VER 是一种用于机器人学习的视觉专家 Transformer 模型,旨在解决预训练视觉基础模型在特定领域表现优异但跨任务泛化能力有限的问题。该方法通过知识蒸馏将多个视觉基础模型整合为一个专家库,并利用轻量级的动态路由网络从预训练库中选择与任务相关的专家,从而实现高效且灵活的特征提取。VER 还引入了基于块的专家路由和课程化 Top-K 退火策略,提升了动态选择的精度与适应性,在多个机器人任务中取得了最先进的性能。

详情
英文摘要

Pretrained vision foundation models (VFMs) advance robotic learning via rich visual representations, yet individual VFMs typically excel only in specific domains, limiting generality across tasks. Distilling multiple VFMs into a unified representation for policy can mitigate this limitation but often yields inflexible task-specific feature selection and requires costly full re-training to incorporate robot-domain knowledge. We propose VER, a Vision Expert transformer for Robot learning. During pretraining, VER distills multiple VFMs into a vision expert library. It then fine-tunes only a lightweight routing network (fewer than 0.4% of parameters) to dynamically select task-relevant experts from the pretrained library for downstream robot tasks. We further introduce Patchwise Expert Routing with Curriculum Top-K Annealing to improve both flexibility and precision of dynamic expert selection. Moreover, VER supports parameter-efficient finetuning for scalable expert utilization and adaptive robot-domain knowledge integration. Across 17 diverse robotic tasks and multiple policy heads, VER achieves state-of-the-art performance. We find that VER reduces large-norm outliers in task-irrelevant regions (e.g., background) and concentrates on task-critical regions. Visualizations and codes can be found in https://yixiaowang7.github.io/ver_page/.

2508.11845 2026-05-15 cs.SD cs.AI cs.IR cs.LG

AVEX: What Matters for Animal Vocalization Encoding

Marius Miron, David Robinson, Milad Alizadeh, Ellen Gilsenan-McMahon, Gagan Narula, Emmanuel Chemla, Maddie Cusimano, Felix Effenberger, Masato Hagiwara, Benjamin Hoffman, Sara Keen, Diane Kim, Jane Lawton, Jen-Yu Liu, Aza Raskin, Olivier Pietquin, Matthieu Geist

AI总结 本文研究了动物声学编码中影响模型性能的关键因素,旨在开发一个适用于多种下游任务的通用生物声学编码器。通过大规模实验,作者分析了训练数据多样性、模型架构和训练策略对编码器性能的影响,并提出了结合自监督预训练与监督微调的混合训练方法,显著提升了模型在不同任务和数据集上的表现。研究还发现,数据多样性在训练和评估阶段都至关重要,并公开了模型参数以支持后续研究与应用。

详情
Comments
In The Fourteenth International Conference on Learning Representations 2026
英文摘要

Bioacoustics, the study of sounds produced by living organisms, plays a vital role in conservation, biodiversity monitoring, and behavioral studies. Many tasks in this field, such as species, individual, and behavior classification and detection, are well-suited to machine learning. However, they often suffer from limited annotated data, highlighting the need for a general-purpose bioacoustic encoder capable of extracting useful representations for diverse downstream tasks. Such encoders have been proposed before, but are often limited in scope due to a focus on a narrow range of species (typically birds), and a reliance on a single model architecture or training paradigm. Moreover, they are usually evaluated on a small set of tasks and datasets. In this work, we present a large-scale empirical study that covers aspects of bioacoustics that are relevant to research but have previously been scarcely considered: training data diversity and scale, model architectures and training recipes, and the breadth of evaluation tasks and datasets. We obtain encoders that are state-of-the-art on the existing and proposed benchmarks. We also identify what matters for training these encoders, such that this work can be extended when more data are available or better architectures are proposed. Specifically, across 26 datasets with tasks including species classification, detection, individual ID, and vocal repertoire discovery, we find self-supervised pre-training followed by supervised post-training on a mixed bioacoustics + general-audio corpus yields the strongest in- and out-of-distribution performance. We show the importance of data diversity in both stages. To support ongoing research and application, we will release the model checkpoints.

2507.18553 2026-05-15 cs.LG cs.DS cs.IT math.IT

The Geometry of LLM Quantization: GPTQ as Babai's Nearest Plane Algorithm

Jiale Chen, Yalda Shabanzadeh, Elvir Crnčević, Torsten Hoefler, Dan Alistarh

AI总结 本文揭示了GPTQ量化方法在数学上等价于经典格最短向量问题中的Babai最近平面算法,为该方法提供了直观的几何解释并带来了误差上界保证。基于这一等价性,作者设计了避免权重裁剪的量化方法,在性能上超越了原始GPTQ,并提供了高效的GPU推理内核。该研究为大语言模型量化提供了坚实的理论基础,并为未来百亿参数模型的量化算法设计开辟了新方向。

详情
Comments
Published as a conference paper at the Fourteenth International Conference on Learning Representations (ICLR 2026): https://openreview.net/forum?id=NFB4QGGS65
英文摘要

Quantizing the weights of large language models (LLMs) from 16-bit to lower bitwidth is the de facto approach to deploy massive transformers onto more affordable accelerators. While GPTQ emerged as one of the standard methods for one-shot post-training quantization at LLM scale, its inner workings are described as a sequence of algebraic updates that obscure geometric meaning or worst-case guarantees. In this work, we show that, when executed back-to-front (from the last to first dimension) for a linear layer, GPTQ is mathematically identical to Babai's nearest plane algorithm for the classical closest vector problem (CVP) on a lattice defined by the Hessian matrix of the layer's inputs. This equivalence is based on a sophisticated mathematical argument, and has two analytical consequences: first, the GPTQ error propagation step gains an intuitive geometric interpretation; second, GPTQ inherits the error upper bound of Babai's algorithm under the assumption that no weights are clipped. Leveraging this bound, we design post-training quantization methods that avoid clipping, and outperform the original GPTQ. In addition, we provide efficient GPU inference kernels for the resulting representation. Taken together, these results place GPTQ on a firm theoretical footing and open the door to importing decades of progress in lattice algorithms towards the design of future quantization algorithms for billion-parameter models. Source code is available at https://github.com/IST-DASLab/GPTQ-Babai.

2507.15774 2026-05-15 cs.LG cs.AI

Time Series Forecasting Through the Lens of Dynamics

Alexis-Raja Brachet, Pierre-Yves Richard, Céline Hudelot

AI总结 本文研究了时间序列预测任务中深度学习模型与浅层线性模型的性能差异,提出模型应学习从过去到未来数据点的直接联系,即“动态学习”能力。作者引入了 $\texttt{PRO-DYN}$ 框架,分析现有模型的动态特性,发现性能较差的模型往往仅部分学习动态关系,且动态模块的位置对模型效果至关重要。基于系统性与实证研究,作者提出了一种简单易用的模型设计与改进方法。

详情
Comments
Accepted at ICML 2026
英文摘要

While deep learning is facing an homogenization across modalities led by Transformers, they are still challenged by shallow linear models in the time series forecasting task. Our hypothesis is that models should learn a direct link from past to future data points, which we identify as a learning dynamics capability. We develop an original $\texttt{PRO-DYN}$ nomenclature to analyze existing models through the lens of dynamics. Two observations thus emerge: $\textbf{1.}$ under-performing architectures learn dynamics at most partially, $\textbf{2.}$ the location of the dynamics block at the model end is of prime importance. Our systemic and empirical studies both confirm our observations on a set of performance-varying models with diverse backbones. We propose a simple plug-and-play methodology guiding model designs and improvements.

2507.01909 2026-05-15 cs.CV

Modality-agnostic, patient-specific digital twins modeling temporally varying digestive motion

Jorge Tapias Gomez, Nishant Nadkarni, Lando S. Bosma, Jue Jiang, Ergys D. Subashi, William P. Segars, James M. Balter, Mert R Sabuncu, Neelam Tyagi, Harini Veeraraghavan

AI总结 该研究旨在解决可变形图像配准(DIR)在高度移动的胃肠道器官中难以准确评估的问题,提出了一种基于患者特异性数字孪生(DT)的模态无关方法,用于模拟和评估DIR的时空动态运动。研究通过半自动化流程生成21个运动阶段的4D序列,基于已有的胃肠道运动模型和真实患者扫描数据,评估了六种DIR方法的配准精度,并验证了剂量映射的准确性。该方法为动态、解剖结构复杂的区域提供了高精度的空间和剂量评估,具有重要的临床应用价值。

详情
Journal ref
Phys. Med. Biol. 71 (2026) 015029
Comments
This work is still review, it contains 7 Pages, 6 figures, and 4 tables
英文摘要

Objective: Clinical implementation of deformable image registration (DIR) requires voxel-based spatial accuracy metrics such as manually identified landmarks, which are challenging to implement for highly mobile gastrointestinal (GI) organs. To address this, patient-specific digital twins (DT) modeling temporally varying motion were created to assess the accuracy of DIR methods. Approach: 21 motion phases simulating digestive GI motion as 4D sequences were generated from static 3D patient scans using published analytical GI motion models through a semi-automated pipeline. Eleven datasets, including six T2w FSE MRI (T2w MRI), two T1w 4D golden-angle stack-of-stars, and three contrast-enhanced CT scans. The motion amplitudes of the DTs were assessed against real patient stomach motion amplitudes extracted from independent 4D MRI datasets. The generated DTs were then used to assess six different DIR methods using target registration error, Dice similarity coefficient, and the 95th percentile Hausdorff distance using summary metrics and voxel-level granular visualizations. Finally, for a subset of T2w MRI scans from patients treated with MR-guided radiation therapy, dose distributions were warped and accumulated to assess dose warping errors, including evaluations of DIR performance in both low- and high-dose regions for patient-specific error estimation. Main results: Our proposed pipeline synthesized DTs modeling realistic GI motion, achieving mean and maximum motion amplitudes and a mean log Jacobian determinant within 0.8 mm and 0.01, respectively, similar to published real-patient gastric motion data. It also enables the extraction of detailed quantitative DIR performance metrics and rigorous validation of dose mapping accuracy. Significance: The pipeline enables rigorously testing DIR tools for dynamic, anatomically complex regions enabling granular spatial and dosimetric accuracies.

2506.05762 2026-05-15 cs.LG

BiTrajDiff: Bidirectional Trajectory Generation with Diffusion Models for Offline Reinforcement Learning

Yunpeng Qing, Yixiao Chi, Shuo Chen, Shunyu Liu, Kexuan Zhou, Sixu Lin, Litao Liu, Changqing Zou

AI总结 本文提出了一种用于离线强化学习的数据增强框架BiTrajDiff,通过双向轨迹扩散模型同时生成未来和历史轨迹,以提升数据集的多样性和泛化能力。与现有仅关注未来轨迹重建的方法不同,BiTrajDiff还考虑了到达当前状态的历史路径,从而更全面地探索状态空间中的潜在高回报区域。实验表明,该方法在多个基准任务中优于其他先进数据增强技术,显著提升了离线强化学习的性能。

详情
英文摘要

Recent advances in offline Reinforcement Learning (RL) have proven that effective policy learning can benefit from imposing conservative constraints on pre-collected datasets. However, such static datasets often exhibit distribution bias, resulting in limited generalizability. To address this limitation, a straightforward solution is data augmentation (DA), which leverages generative models to enrich data distribution. Despite the promising results, current DA techniques focus solely on reconstructing future trajectories from given states, while ignoring the exploration of history transitions that reach them. This single-direction paradigm inevitably hinders the discovery of diverse behavior patterns, especially those leading to critical states that may have yielded high-reward outcomes. In this work, we introduce Bidirectional Trajectory Diffusion (BiTrajDiff), a novel DA framework for offline RL that models both future and history trajectories from any intermediate states. Specifically, we decompose the trajectory generation task into two independent yet complementary diffusion processes: one generating forward trajectories to predict future dynamics, and the other generating backward trajectories to trace essential history transitions.BiTrajDiff can efficiently leverage critical states as anchors to expand into potentially valuable yet underexplored regions of the state space, thereby facilitating dataset diversity. Extensive experiments on the D4RL benchmark suite demonstrate that BiTrajDiff achieves superior performance compared to other advanced DA methods across various offline RL backbones.

2506.01015 2026-05-15 cs.CV

AuralSAM2: Enabling SAM2 Hear Through Pyramid Audio-Visual Feature Prompting

Yuyuan Liu, Yuanhong Chen, Chong Wang, Junlin Han, Junde Wu, Can Peng, Jingkun Chen, Yu Tian, Gustavo Carneiro

AI总结 本文提出AuralSAM2,旨在将音频信息有效整合到SAM2模型中,以提升视频分割任务中多模态交互的能力。核心方法AuralFuser通过融合音频与视觉特征生成稀疏和密集提示,并基于SAM2的特征金字塔结构传播听觉线索,增强跨模态影响。此外,引入了音频引导的对比损失以加强模态对齐,实验表明该方法在公共基准上取得了显著的性能提升,且对交互效率影响较小。

详情
Comments
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Findings, 2026
英文摘要

Segment Anything Model 2 (SAM2) exhibits strong generalisation for promptable segmentation in video clips; however, its integration with the audio modality remains underexplored. Existing approaches either convert audio into visual prompts (e.g., boxes) via foundation models, or inject adapters into the image encoder for audio-visual fusion. Yet both directions fall short in human-in-the-loop scenarios due to limited prompt accuracy and increased inference overhead. In particular, these adapter-based methods often suffer from audio prompt dilution, where the signal gradually weakens as it propagates through the network. In this work, we propose AuralSAM2, which integrates audio into SAM2 while largely preserving its promptable segmentation capability. Its core module, AuralFuser, fuses audio and visual features to generate sparse and dense prompts. Guided by audio and built upon SAM2's feature pyramid, these prompts propagate auditory cues across visual layers, reinforcing cross-modal influence. To further align modalities, we introduce an audio-guided contrastive loss that emphasises auditory relevance in dominant visual features. Our method achieves notable accuracy gains on public benchmarks with only minimal impact on the interactive efficiency of promptable segmentation. Our code is available at https://github.com/yyliu01/AuralSAM2.

2505.04535 2026-05-15 cs.LG cs.DC

Communication-Efficient Federated Fine-Tuning

Michael Theologitis, Vasilis Samoladas, Antonios Deligiannakis

AI总结 本文研究了联邦学习中大型语言模型微调过程中的通信效率问题,提出了一种新的算法家族FDA-Opt,以解决现有方法中通信频率固定、参数难以调节等问题。该方法结合了动态调整与优化策略,无需额外配置即可在自然语言处理任务中实现优于现有方法的性能,为联邦学习中的模型微调提供了更高效、实用的解决方案。

详情
英文摘要

Federated Learning (FL) enables the utilization of vast, previously inaccessible data sources. At the same time, pre-trained Language Models (LMs) have taken the world by storm and for good reason. They exhibit remarkable emergent abilities and are readily adapted to downstream tasks. This opens one of the most exciting frontiers in FL: fine-tuning LMs. Yet, a persistent challenge in FL is the frequent, rigid communication of parameters -- a problem magnified by the sheer size of these contemporary models. The FedOpt family of algorithms has become the go-to approach for FL, relying on fixed but arbitrary intervals for model exchanges. Recently, the FDA algorithm prescribed a dynamic approach by monitoring the training progress. However, it introduced a hard-to-calibrate parameter and imposed a rigid synchronization scheme. In this work, we address these limitations by proposing the FDA-Opt family of algorithms -- a unified generalization of both FDA and FedOpt. Our experimental evaluation focuses on fine-tuning LMs on downstream NLP tasks and demonstrates that FDA-Opt outperforms FedOpt even when it is configured with hyper-parameters specifically optimized for the latter. In other words, we show that FDA-Opt is a practical, drop-in replacement for FedOpt in modern FL libraries and systems: it requires no additional configuration and delivers superior performance out of the box.

2501.12202 2026-05-15 cs.CV

Hunyuan3D 2.0: Scaling Diffusion Models for High Resolution Textured 3D Assets Generation

Zibo Zhao, Zeqiang Lai, Qingxiang Lin, Yunfei Zhao, Haolin Liu, Shuhui Yang, Yifei Feng, Mingxin Yang, Sheng Zhang, Xianghui Yang, Huiwen Shi, Sicong Liu, Junta Wu, Yihang Lian, Fan Yang, Ruining Tang, Zebin He, Xinzhou Wang, Jian Liu, Xuhui Zuo, Zhuo Chen, Biwen Lei, Haohan Weng, Jing Xu, Yiling Zhu, Xinhai Liu, Lixin Xu, Changrong Hu, Shaoxiong Yang, Song Zhang, Yang Liu, Tianyu Huang, Lifu Wang, Jihong Zhang, Meng Chen, Liang Dong, Yiwen Jia, Yulin Cai, Jiaao Yu, Yixuan Tang, Hao Zhang, Zheng Ye, Peng He, Runzhou Wu, Chao Zhang, Yonghao Tan, Jie Xiao, Yangyu Tao, Jianchen Zhu, Jinbao Xue, Kai Liu, Chongqing Zhao, Xinming Wu, Zhichao Hu, Lei Qin, Jianbing Peng, Zhan Li, Minghui Chen, Xipeng Zhang, Lin Niu, Paige Wang, Yingkai Wang, Haozhao Kuang, Zhongyi Fan, Xu Zheng, Weihao Zhuang, YingPing He, Tian Liu, Yong Yang, Di Wang, Yuhong Liu, Jie Jiang, Jingwei Huang, Chunchao Guo

AI总结 本文介绍了 Hunyuan3D 2.0,这是一个用于生成高分辨率带纹理3D模型的先进大规模合成系统。该系统包含两个基础模块:基于可扩展流式扩散变换器的形状生成模型 Hunyuan3D-DiT,以及利用几何和扩散先验知识生成高质量纹理的 Hunyuan3D-Paint。此外,还开发了 Hunyuan3D-Studio,提供一个用户友好的平台,便于专业和非专业人士高效生成和操作3D模型。实验表明,Hunyuan3D 2.0 在几何细节、条件对齐和纹理质量等方面均优于现有先进模型,并已开源以填补大规模3D生成模型在开源社区中的空白。

详情
Comments
GitHub link: https://github.com/Tencent/Hunyuan3D-2
英文摘要

We present Hunyuan3D 2.0, an advanced large-scale 3D synthesis system for generating high-resolution textured 3D assets. This system includes two foundation components: a large-scale shape generation model -- Hunyuan3D-DiT, and a large-scale texture synthesis model -- Hunyuan3D-Paint. The shape generative model, built on a scalable flow-based diffusion transformer, aims to create geometry that properly aligns with a given condition image, laying a solid foundation for downstream applications. The texture synthesis model, benefiting from strong geometric and diffusion priors, produces high-resolution and vibrant texture maps for either generated or hand-crafted meshes. Furthermore, we build Hunyuan3D-Studio -- a versatile, user-friendly production platform that simplifies the re-creation process of 3D assets. It allows both professional and amateur users to manipulate or even animate their meshes efficiently. We systematically evaluate our models, showing that Hunyuan3D 2.0 outperforms previous state-of-the-art models, including the open-source models and closed-source models in geometry details, condition alignment, texture quality, and etc. Hunyuan3D 2.0 is publicly released in order to fill the gaps in the open-source 3D community for large-scale foundation generative models. The code and pre-trained weights of our models are available at: https://github.com/Tencent/Hunyuan3D-2

2501.05465 2026-05-15 cs.CL

Small Language Models (SLMs) Can Still Pack a Punch: A survey (updated 2026)

Akanksha Gupta, Bijo Thomas, Harshita Asnani, Phanindra Reddy Madduru, Samia Feroze, Shreyas Subramanian, Vikram Elango, Mecit Gungor

AI总结 随着基础人工智能模型规模不断增大,本文综述了约160篇论文,探讨了参数量在1到80亿之间的小型语言模型(SLMs),表明小型模型在性能上可以与甚至超越大型模型。文章分析了通用型、任务特定型SLMs以及提升其性能的各类技术,旨在为社区提供在性能、效率、可扩展性和成本之间取得平衡的建模指导,并定义了SLMs的有效规模,展示了其相对于大语言模型的能力提升。

详情
英文摘要

As foundation AI models continue to increase in size, an important question arises - is massive scale the only path forward? This survey of about 160 papers presents a family of Small Language Models (SLMs) in the 1 to 8 billion parameter range that demonstrate smaller models can perform as well, or even outperform large models. We explore task agnostic, general purpose SLMs, task-specific SLMs and techniques to create SLMs that can guide the community to build models while balancing performance, efficiency, scalability and cost. Furthermore we define and characterize SLMs' effective sizes, representing increased capability with respect to LLMs.

2412.17155 2026-05-15 cs.CV cs.LG

The Potential of Convolutional Neural Networks for Cancer Detection

Hossein Molaeian, Kaveh Karamjani, Sina Teimouri, Saeed Roshani, Sobhan Roshani

AI总结 本文探讨了卷积神经网络(CNN)在癌症检测中的应用潜力,旨在通过深度学习方法提升早期癌症诊断的准确性。研究分析了多种CNN架构在不同癌症数据集上的表现,比较了各类方法的优缺点,并识别出在癌症分类任务中表现优异的模型结构。该工作为将CNN技术整合到临床诊断流程中提供了参考,有助于增强医疗健康领域的诊断能力。

详情
英文摘要

Early detection is crucial for successful cancer treatment and increasing survivability rates, particularly in the most common forms. Ten different cancers have been identified in most of these advances that effectively use CNNs (Convolutional Neural Networks) for classification. The distinct architectures of CNNs used in each study concentrate on pattern recognition for different types of cancer across various datasets. The advantages and disadvantages of each approach are identified by comparing these architectures. This study explores the potential of integrating CNNs into clinical practice to complement traditional diagnostic methods. It also identifies the top-performing CNN architectures, highlighting their role in enhancing diagnostic capabilities in healthcare.

2410.19653 2026-05-15 cs.LG

Conformal Prediction for Multimodal Regression

Alexis Bose, Jonathan Ethier, Paul Guinand

AI总结 本文提出了一种用于多模态回归的符合预测方法,将传统仅适用于数值输入的符合预测扩展到图像和非结构化文本等多模态数据场景。该方法利用复杂神经网络架构中处理多模态信息的内部特征,特别是在多模态信息融合的关键节点提取特征,用于构建具有分布无关不确定性保证的预测区间。这一成果为在多模态数据丰富的领域应用符合预测提供了新的途径。

详情
Comments
Code available at https://github.com/ic-crc/uncertainty-estimation 20 pages, 34 figures
英文摘要

This paper introduces multimodal conformal regression. Traditionally confined to scenarios with solely numerical input features, conformal prediction is now extended to multimodal contexts through our methodology, which harnesses internal features from complex neural network architectures processing images and unstructured text. Our findings highlight the potential for internal neural network features, extracted from convergence points where multimodal information is combined, to be used by conformal prediction to construct prediction intervals (PIs). This capability paves new paths for deploying conformal prediction in domains abundant with multimodal data, enabling a broader range of problems to benefit from guaranteed distribution-free uncertainty quantification.

2605.14967 2026-05-15 cs.LG stat.ML

InfoSFT: Learn More and Forget Less with Information-Aware Token Weighting

Mahdi Sabbaghi, George Pappas, Adel Javanmard, Hamed Hassani

AI总结 本文提出了一种名为 InfoSFT 的监督微调方法,通过关注信息量大且置信度适中的 token 来提升大语言模型的学习效果,避免过度拟合低概率样本或抑制已有能力。该方法仅需对标准损失函数进行一行修改,能够在数学、代码和思维链等任务中显著提升模型泛化能力,同时更好地保留模型原有的性能。

详情
英文摘要

Supervised fine-tuning (SFT) provides the standard approach for teaching LLMs new behaviors from offline expert demonstrations. However, standard SFT uniformly fits all samples -- including those with low likelihood under the base model -- which can disproportionately drive training updates toward overfitting specific samples rather than learning the target behavior. Moreover, adapting to these unlikely samples induces substantial policy shifts that degrade prior capabilities. Existing methods mitigate this by filtering, regenerating, or down-weighting low-likelihood data. In doing so, they often suppress precisely the novel behaviors the base model has yet to learn. We propose InfoSFT, a principled weighting scheme for the SFT objective that concentrates learning signals on maximally informative, medium-confidence tokens -- those neither overly familiar to the base model nor too unlikely to cause instability. Requiring only a one-line modification to the standard token-wise loss, InfoSFT demonstrably improves generalization over vanilla SFT and likelihood-weighted baselines across math, code, and chain-of-thought tasks with diverse model families, while better preserving pre-existing capabilities.

2605.14966 2026-05-15 cs.CV cs.AI

MHSA: A Lightweight Framework for Mitigating Hallucinations via Steered Attention in LVLMs

Wei Ding, Yilin Li, Yudong Zhang, Ruobing Xie, Xingwu Sun, Jiansheng Chen, Yu Wang

AI总结 本文提出了一种名为MHSA的轻量级框架,旨在通过引导注意力机制来缓解大视觉语言模型(LVLMs)中的幻觉问题。MHSA通过学习修正跨模态注意力模式,利用来自LVLM自身和DHCP判别器的监督信号训练一个简单的三层MLP生成器,从而生成修正后的注意力权重。该方法在推理时无需修改LVLM参数,仅替换原始跨模态注意力即可有效减少生成和判别层面的幻觉,为LVLM的幻觉研究提供了新的视角。

详情
Comments
19 pages, 17 figures
英文摘要

Large vision-language models (LVLMs) have achieved remarkable performance across diverse multimodal tasks, yet they continue to suffer from hallucinations, generating content that is inconsistent with the visual input. Prior work DHCP (Detecting Hallucinations by Cross-modal Attention Pattern) has explored hallucination detection from the perspective of cross-modal attention, but does not address hallucination mitigation. In this paper, we propose MHSA (Mitigating Hallucinations via Steered Attention), a lightweight framework that mitigates hallucinations by learning to correct cross-modal attention patterns in LVLMs. MHSA trains a simple three-layer MLP generator to produce corrected attention, guided by supervisory signals from the DHCP discriminator and the LVLM itself. During inference, MHSA mitigates both discriminative and generative hallucinations across various datasets and LVLMs by simply replacing the original cross-modal attention with the corrected one, without modifying any LVLM parameters. By extending cross-modal attention mechanisms from hallucination detection to hallucination mitigation, MHSA offers a novel perspective on hallucination research in LVLMs and helps enhance their reliability.

2605.14953 2026-05-15 cs.LG

Efficient Online Conformal Selection with Limited Feedback

Sreenivas Gollapudi, Kostas Kollias, Kamesh Munagala, Ali Sinop

AI总结 本文研究了在有限反馈条件下高效实现符合性选择的问题,即如何以最小的资源成本确保至少识别出一个成功选项,且满足预设的成功概率。作者提出了一种基于自适应符合性推断的更新规则,能够在对抗性环境下保证平均成功概率,并在独立同分布数据下实现次线性效率遗憾。该方法通过统一的算法技术和李雅普诺夫函数分析框架,适用于带宽反馈和半带宽反馈场景,相比以往工作处理了更复杂的设置,同时显著减少了所需的反馈量。

详情
英文摘要

We address the problem of conformal selection, where an agent must select a minimal subset of options to ensure that at least one ``success'' is identified with a pre-specified target probability $ϕ$. While traditional online conformal prediction focuses on maintaining validity for the observed sequence, minimizing the resource cost (efficiency) of such selections, especially under limited feedback, remains a significant challenge. In this work, we consider settings with the most limited ``bandit'' feedback, and demonstrate that the simple Adaptive Conformal Inference (ACI) update rule, when applied to the appropriate control parameter or dual variable, is both adversarially valid, ensuring the success target is met on average for any input sequence (and hence under distribution shifts), and stochastically efficient, achieving sublinear efficiency regret for $i.i.d.$ inputs against an appropriate stochastic benchmark. We show such guarantees under canonical models capturing bandit and semi-bandit feedback to the agent via a unifying algorithmic technique, and analytic framework involving Lyapunov functions. Our approach handles more complex settings than prior work, while requiring significantly less feedback, and our results provide a new theoretical bridge between efficient online learning with limited feedback and distribution-free uncertainty quantification.

2605.14950 2026-05-15 cs.CV cs.RO

Evo-Depth: A Lightweight Depth-Enhanced Vision-Language-Action Model

Tao Lin, Yuxin Du, Jiting Liu, Nuobei Zhu, Yunhe Li, Yuqian Fu, Yinxinyu Chen, Hongyi Cai, Zewei Ye, Bing Cheng, Kai Ye, Yiran Mao, Yilei Zhong, MingKang Dong, Junchi Yan, Gen Li, Bo Zhao

AI总结 Evo-Depth 是一种轻量级的深度增强视觉-语言-动作模型,旨在提升机器人操作任务中的空间理解能力。该模型通过一个轻量的隐式深度编码模块,从多视角RGB图像中提取紧凑的深度特征,并通过空间增强模块将深度信息融入视觉-语言表征,从而实现高效的空间语义增强。此外,Evo-Depth 引入了渐进对齐训练策略,以更好地对齐深度增强表征与动作学习任务,最终在多个仿真和现实场景中表现出优异的性能和效率。

详情
英文摘要

Vision-Language-Action models have emerged as a promising paradigm for robotic manipulation by unifying perception, language grounding, and action generation. However, they often struggle in scenarios requiring precise spatial understanding, as current VLA models primarily rely on 2D visual representations that lack depth information and detailed spatial relationships. While recent approaches incorporate explicit 3D inputs such as depth maps or point clouds to address this issue, they often increase system complexity, require additional sensors, and remain vulnerable to sensing noise and reconstruction errors. Another line of work explores implicit 3D-aware spatial modeling directly from RGB observations without extra sensors, but it often relies on large geometry foundation models, resulting in higher training and deployment costs. To address these challenges, we propose Evo-Depth, a lightweight depth-enhanced VLA framework that enhances spatially grounded manipulation without relying on additional sensing hardware or compromising deployment efficiency. Evo-Depth employs a lightweight Implicit Depth Encoding Module to extract compact depth features from multi-view RGB images. These features are incorporated into vision-language representations through a Spatial Enhancement Module via depth-aware modulation, enabling efficient spatial-semantic enhancement. A Progressive Alignment Training strategy is further introduced to align the resulting depth-enhanced representations with downstream action learning. With only 0.9B parameters, Evo-Depth achieves superior performance across four simulation benchmarks. In real-world experiments, Evo-Depth attains the highest average success rate while also exhibiting the smallest model size, lowest GPU memory usage, and highest inference frequency among compared methods.

2605.14949 2026-05-15 cs.CV eess.IV eess.SP

A CUBS-Compatible Ultrasound Morphology and Uncertainty-Aware Baseline for Carotid Intima-Media Segmentation and Preliminary Risk Prediction

Aueaphum Aueawatthanaphisut

AI总结 该研究提出了一种基于超声影像的颈动脉内膜中层分割与初步风险预测的可复现基线模型AtheroFlow-XNet,旨在更全面地评估动脉粥样硬化患者的血管风险。模型结合了手动标注的内膜中层掩膜进行监督分割,并引入临床变量辅助风险预测,同时利用蒙特卡洛Dropout实现不确定性感知的推理。实验结果表明,该方法在分割精度和风险预测性能上均达到较高水平,为超声影像支持的自动化血管分析提供了新的思路。

详情
Comments
13 pages, 5 figures, 2 tables, 20 equations, 3 appendices
英文摘要

Carotid atherosclerosis is a major contributor to ischemic stroke and transient ischemic attack. Conventional ultrasound assessment is commonly based on intima-media thickness, plaque appearance, stenosis degree, and peak systolic velocity, but these morphology- and velocity-based indicators may not fully capture patient-specific vascular risk. This study presents AtheroFlow-XNet, a CUBS-compatible ultrasound morphology and uncertainty-aware learning baseline for carotid intima-media segmentation and preliminary risk prediction. Using the Carotid Ultrasound Boundary Study dataset, manual lumen-intima and media-adventitia boundary annotations were converted into dense intima-media masks for supervised segmentation. Clinical variables were incorporated into an auxiliary risk-prediction branch, and Monte Carlo dropout was used for uncertainty-aware inference. The model was evaluated using a patient-level train-validation-test split with 1,522 training images, 326 validation images, and 328 testing images. The proposed model achieved a Dice coefficient of 0.7930 for LI-MA mask segmentation, a segmentation loss of 0.2359, and an area under the receiver operating characteristic curve of 0.6910 for preliminary risk prediction. Qualitative results showed that predicted masks were generally aligned with manual annotations, while uncertainty maps highlighted ambiguous wall-boundary regions. These results suggest that ultrasound-derived carotid morphology can support automated wall analysis and uncertainty-aware interpretation. Since CUBS does not provide Doppler waveforms or CFD-derived hemodynamic biomarkers, this work should be interpreted as a reproducible morphology-driven baseline. Future work will incorporate Doppler-derived flow profiles, patient-specific vascular reconstruction, and CFD-based wall shear biomarkers.

2605.14948 2026-05-15 cs.CV

ACE-LoRA: Adaptive Orthogonal Decoupling for Continual Image Editing

Yuehao Liu, Weijia Zhang, Xuanming Shang, Zhizhou Chen, Yanhao Ge, Shanyan Guan, Chao Ma

AI总结 本文提出ACE-LoRA,一种用于持续图像编辑的动态正则化框架,旨在解决在不断学习新任务时避免遗忘之前知识的问题。该方法通过自适应正交解耦技术识别并消除任务间的干扰,并引入秩不变历史信息压缩策略以提升持续更新的可扩展性。此外,研究还构建了首个全面的持续图像编辑基准CIE-Bench,为该领域提供标准化评估平台,实验表明该方法在指令遵循、视觉真实感和抗遗忘能力方面均优于现有方法。

详情
英文摘要

State-of-the-art diffusion models often rely on parameter-efficient fine-tuning to perform specialized image editing tasks. However, real-world applications require continual adaptation to new tasks while preserving previously learned knowledge. Despite the practical necessity, continual learning for image editing remains largely underexplored. We propose ACE-LoRA, a dynamic regularization framework for continual image editing that effectively mitigates catastrophic forgetting. ACE-LoRA leverages Adaptive Orthogonal Decoupling to identify and orthogonalize task interference, and introduces a Rank-Invariant Historical Information Compression strategy to address scalability issues in continual updates. To facilitate continual learning in image editing and provide a standardized evaluation protocol, we introduce CIE-Bench, the first comprehensive benchmark in this domain. CIE-Bench encompasses diverse and practically relevant image editing scenarios with a balanced level of difficulty to effectively expose limitations of existing models while remaining compatible with parameter-efficient fine-tuning. Extensive experiments demonstrate that our method consistently outperforms existing baselines in terms of instruction fidelity, visual realism, and robustness to forgetting, establishing a strong foundation for continual learning in image editing.

2605.14944 2026-05-15 cs.RO cs.SY eess.SY

Behavioral Data-Driven Optimal Trajectory Generation for Rotary Cranes

Iskandar Khemakhem, Manuel Zobel, Johannes Schüle, Oliver Sawodny, Naoki Uchiyama, Abdallah Farrage

AI总结 随着建筑行业的发展和熟练劳动力的短缺,起重机控制的自动化变得越来越重要。本文提出了一种基于行为数据的开环旋转起重机摆动轨迹生成方法,能够在减少负载摆动的同时降低操作时间和能耗。该方法基于Willems基本引理及其推广,无需显式系统建模,直接利用输入输出数据生成平滑最优轨迹,并通过实验验证了其有效性,相比传统模型方法在负载摆动、跟踪误差和运行时间等方面均有显著提升。

详情
英文摘要

With the growth of the construction industry and the global shortage of skilled labor, the automation of crane control has become increasingly important for safe and efficient operations. A central challenge in automatic crane control is the reduction of load oscillations during motion, which is primarily addressed through appropriate slewing trajectories. In this context, classical model-based control methods rely on accurate dynamical models and expert tuning, and often struggle to meet safety and precision requirements, while many learning-based approaches require large data sets and significant computational resources. This paper proposes a behavioral data-driven framework for generating open-loop slewing trajectories for rotary cranes that suppress load sway while reducing operation time and energy consumption. The approach builds on Willems' fundamental lemma and its generalizations, to bypass explicit system modeling and operate directly on measured input-output data. A practical workflow is presented in this paper to reduce the need for expert knowledge. Despite the underactuated nature of the crane dynamics, the method identifies a nonparametric representation of the system behavior and generates smooth, optimal trajectories using limited data and convex optimization. The proposed trajectory generation method is validated on a laboratory crane setup and compared against an established model-based approach, achieving up to 35% reduction in load sway, 43% reduction in tracking error, and 50% reduction in travel time.

2605.14940 2026-05-15 cs.LG cs.AI eess.SP

Not All Symbols Are Equal: Importance-Aware Constellation Design for Semantic Communication

Albert Shaju, Christo Kurisummoottil Thomas, Mayukh Roy Chowdhury

AI总结 本文研究了面向语义通信的符号星座设计问题,提出了一种关注语义重要性的联合语义-物理层框架,通过提取离散语义概念、评估语义关键性,并结合深度强化学习动态选择传输符号,从而在物理层实现语义感知的星座映射。该方法引入了语义符号脆弱性指标和语义保护概率,证明了传统格雷编码星座在非均匀语义重要性场景下存在性能局限,并在多个数据集上验证了其在高谱效率下的优越性。

详情
Comments
Submitted to IEEE GLOBECOM 2026. 6 pages, 8 figures
英文摘要

Semantic communication systems for goal-oriented transmission must protect task-relevant information not only through source compression but also via physical layer mapping. Existing approaches decouple constellation design and semantic encoding, exposing critical symbols to channel errors at the same rate as irrelevant ones. Contrary to this, in this paper, a joint semantic-physical layer framework is proposed, which is composed of a vector quantized-variational autoencoder that extracts discrete latent concepts, a semantic criticality indicator (SCI) that scores each concept by task relevance, and a deep reinforcement learning agent that dynamically selects the transmission subset based on instantaneous channel conditions. At the physical layer, a learned semantic-aware M -QAM constellation assigns symbol positions according to joint co-occurrence statistics and SCI scores, departing from the uniform spacing and Gray coding of standard M -QAM which minimizes average BER without regard for semantic content. We introduce a novel semantic symbol vulnerability (SSV) metric and a semantic protection probability (SPP) to quantify the exposure of task-critical symbols to decoding errors, and prove that any Gray-coded constellation is strictly suboptimal in SCI-Weighted SSV whenever the source exhibits non-uniform semantic importance and co-occurrence statistics. Simulation results demonstrate that the proposed constellation achieves near 100% SPP across modulation orders from 4-QAM to 1024-QAM versus 50% for standard constellations at high spectral efficiency, a 21:1 compression ratio with semantic quality above 0.9, generalizing across MNIST, Fashion-MNIST, and FSDD without modification.

2605.14938 2026-05-15 cs.LG cs.CV

Octopus: History-Free Gradient Orthogonalization for Continual Learning in Multimodal Large Language Models

Yuehao Liu, Shanyan Guan, Weijia Zhang, Xuanming Shang, Yanhao Ge, Wei Li, Chao Ma

AI总结 本文提出了一种名为Octopus的持续学习框架,旨在解决多模态大语言模型在逐步学习新任务时易遗忘旧知识的问题。该方法基于无历史数据的梯度正交化(HiFGO),通过在梯度层面强制正交性来减少参数干扰,无需存储历史任务数据,从而避免了隐私和存储问题。实验表明,Octopus在UCIT数据集上取得了优于现有最先进方法的性能,分别提升了2.14%和6.82%的平均与最终任务准确率。

详情
英文摘要

Continual learning in multimodal large language models (MLLMs) aims to sequentially acquire knowledge while mitigating catastrophic forgetting, yet existing methods face inherent limitations: architecture-based approaches incur additional computational overhead and often generalize poorly to new tasks, rehearsal-based methods rely on storing historical data, raising privacy and storage concerns, and conventional regularization-based strategies alone are insufficient to fully prevent parameter interference. We propose Octopus, a two-stage continual learning framework based on History-Free Gradient Orthogonalization (HiFGO), which enforces gradient-level orthogonality without historical task data. Our proposed two-stage finetuning strategy decouples task adaptation from regularization, achieving a principled balance between plasticity and stability. Experiments on UCIT show that Octopus establishes state-of-the-art performance, surpassing prior SOTA by 2.14% and 6.82% in terms of Avg and Last.

2605.14937 2026-05-15 cs.LG cs.AI cs.RO

Slot-MPC: Goal-Conditioned Model Predictive Control with Object-Centric Representations

Jonathan Spieler, Angel Villar-Corrales, Sven Behnke

AI总结 Slot-MPC 是一种基于对象中心表示的目标条件模型预测控制框架,旨在提升智能体在复杂环境中的规划能力。该方法通过视觉编码器学习场景中各个对象的结构化表示,并基于这些表示构建动作条件的动力学模型,从而在推理阶段利用模型预测控制实现高效的动作规划。实验表明,与非对象中心的世界模型相比,Slot-MPC 在任务表现和规划效率方面均有显著提升,尤其在有限状态-动作覆盖的离线设置中,基于梯度的MPC方法表现出更优性能。

详情
英文摘要

Predictive world models enable agents to model scene dynamics and reason about the consequences of their actions. Inspired by human perception, object-centric world models capture scene dynamics using object-level representations, which can be used for downstream applications such as action planning. However, most object-centric world models and reinforcement learning (RL) approaches learn reactive policies that are fixed at inference time, limiting generalization to novel situations. We propose Slot-MPC, an object-centric world modeling framework that enables planning through Model Predictive Control (MPC). Slot-MPC leverages vision encoders to learn slot-based representations, which encode individual objects in the scene, and uses these structured representations to learn an action-conditioned object-centric dynamics model. At inference time, the learned dynamics model enables action planning via MPC, allowing agents to adapt to previously unseen situations. Since the learned world model is differentiable, we can use gradient-based MPC to directly optimize actions, which is computationally more efficient than relying on gradient-free, sampling-based MPC methods. Experiments on simulated robotic manipulation tasks show that Slot-MPC improves both task performance and planning efficiency compared to non-object-centric world model baselines. In the considered offline setting with limited state-action coverage, we find that gradient-based MPC performs better than gradient-free, sampling-based MPC. Our results demonstrate that explicitly structured, object-centric representations provide a strong inductive bias for controllable and generalizable decision-making. Code and additional results are available at https://slot-mpc.github.io.

2605.14935 2026-05-15 cs.CV

Multi-scale Coarse-to-fine Modeling for Test-time Human Motion Control

Nhat Le, Daochang Liu, Anh Nguyen, Ajmal Mian

AI总结 本文提出了一种名为MSCoT的多尺度粗到细模型,用于测试时的人体运动合成与控制。该方法通过将运动分解为多尺度的层次化表示,并在每个时间尺度上以粗到细的方式预测完整的token序列,从而实现了高效且灵活的控制。通过引入多尺度token引导策略和轻量级token细化模块,MSCoT克服了离散采样的挑战,提升了控制精度与生成质量,实验表明其在运动质量、控制准确性和推理速度方面均优于现有方法。

详情
英文摘要

We present MSCoT, a multi-scale, coarse-to-fine model for test-time human motion synthesis and control. Unlike recent approaches that rely on multiple iterative denoising/token-prediction steps, or modules tailored for specific control signals, MSCoT discretizes motion into a multi-scale hierarchical representation and predicts the entire token sequence at each temporal scale in a coarse-to-fine fashion. Building on this coarse-to-fine paradigm, we propose an efficient multi-scale token guidance strategy that overcomes the challenge of discrete sampling and steers the token distribution towards the control goals, allowing for fast and flexible control. To address the limitations of a discrete codebook, a lightweight token refiner further adds continuous residuals to the discrete token embeddings and allows differentiable test-time refinement optimization to ensure precise alignment with the control objectives. MSCoT is able to produce quality motions, consistent with the control constraints, while offering substantially faster sampling than diffusion-based approaches. Experiments on popular benchmarks demonstrate state-of-the-art controllable text-to-motion generation performance of MSCoT over existing baselines, with better motion quality (48% FID improvement), higher control accuracy (-61% avg error), and $10 \times$ faster inference speed on HumanML3D.

2605.14929 2026-05-15 cs.LG cs.AR

A Hardware-Aware, Per-Layer Methodology for Post-Training Quantization of Large Language Models

Earl Killian

AI总结 本文提出了一种面向硬件的逐层后训练量化方法,用于大语言模型的权重压缩,能够在每权重4.5到6位的精度下实现接近无损的重建效果。该方法结合了固定和动态码本的选择、块级缩放、激活加权余弦选择以及敏感层优化等技术,并引入了一种新的硬件高效查找表输出格式(HIF)以提升性能与能效。实验表明,在多种开源模型中,该方法在更低的存储成本下实现了比传统FP8基线更优的权重重建精度。

详情
Comments
21 pages
英文摘要

Scaled Outer Product (SOP) is a post-training quantization methodology for large language model weights, designed to deliver near-lossless fidelity at 4.5--6 bits per weight on hardware with per-layer LUT decode. The methodology combines per-layer search of fixed and dynamic codebook pairs selected by a per-block selection bit, signed per-block scales, activation-weighted cosine selection, and multiple-choice knapsack promotion of sensitive layers with outlier and sparse-residual correction. Fixed codebooks include NF4, BOF4, Split87, and SH4; per-layer optimized codebooks (DD4) are hosted in LUT SRAM. A new hardware-efficient LUT output format (HIF) is proposed to improve performance, energy, and cost. Across six open model families, the recommended FP6 operating point (E2M3sUE4M4, 6.5 bpw) achieves lower weight reconstruction error than the conventional per-layer-POT FP8 baseline (E4M3, 8.0 bpw) at 1.5 bpw lower storage cost, demonstrating that block-scaled small atoms with carefully chosen scale precision can replace conventionally-deployed FP8. Full evaluation across the 4.5--6 bpw range, including layer promotion and sparse residual correction, is reported in a companion paper.

2605.14928 2026-05-15 cs.CL

Chain-of-Procedure: Hierarchical Visual-Language Reasoning for Procedural QA

Guanhua Chen, Yutong Yao, Shenghe Sun, Ci-Jun Gao, Shudong Liu, Lidia S. Chao, Feng Wan, Derek F. Wong

AI总结 该论文提出了一种名为Chain-of-Procedure(CoP)的分层视觉-语言推理框架,用于解决视觉过程问答(VP-QA)任务,即根据用户上传的中间状态图像预测下一步操作。研究指出当前视觉语言模型在结构化过程推理和图像序列与文本步骤的粒度匹配方面存在不足,CoP通过视觉线索检索相关指令、语义分解细化步骤,最终生成下一步操作,显著提升了模型在该任务上的表现。论文还提出了ProcedureVQA基准,用于系统评估模型在实际过程推理任务中的能力。

详情
英文摘要

Recent advances in vision-language models (VLMs) have achieved impressive results on standard image-text tasks, yet their potential for visual procedure question answering (VP-QA) remains largely unexplored. VP-QA presents unique challenges where users query next-step actions by uploading images for intermediate states of complex procedures. To systematically evaluate VLMs on this practical task, we propose ProcedureVQA, a novel multimodal benchmark specifically designed for visual procedural reasoning. Through comprehensive analysis, we identify two critical limitations in current VLMs: inadequate cross-modal retrieval of structured procedures given visual states, and misalignment between image sequence granularity and textual step decomposition. To address these issues, we present Chain-of-Procedure (CoP), a hierarchical reasoning framework that first retrieves relevant instructions using visual cues, then performs step refinement through semantic decomposition, and finally generates the next step. Experiments across six VLMs demonstrate CoP's effectiveness, achieving up to 13% absolute improvement over standard baselines.

2605.14927 2026-05-15 cs.LG

Learning with Shallow Neural Networks on Cluster-Structured Features

Elisabetta Cornacchia, Laurent Massoulié

AI总结 本文研究了在输入特征具有聚类结构的情况下,浅层神经网络通过梯度下降进行学习时的样本复杂度问题。作者提出了一种可分析的模型,假设目标函数依赖于少量潜在的布尔变量,输入特征则按簇分组并与这些变量相关联。在可识别性假设下,理论分析表明,当信噪比足够高时,样本复杂度仅与隐藏变量的数量有关,而与输入维度无关(至多对数项)。实验在合成数据和真实数据上验证了理论结果。

详情
Comments
10 pages main body, 2 figures
英文摘要

The success of deep learning in high-dimensional settings is often attributed to the presence of low-dimensional structure in real-world data. While standard theoretical models typically assume that this structure lies in the target function, projecting unstructured inputs onto a low-dimensional subspace, data such as images, text or genomic sequences exhibit strong spatial correlations within the input space itself. In this paper, we propose a tractable model to study how these correlations affect the sample complexity of learning with gradient descent on shallow neural networks. Specifically, we consider targets that depend on a small number of latent Boolean variables, and input features grouped into clusters and correlated with the latent variables. Under an identifiability assumption, we show that for a layerwise gradient-descent variant, the sample complexity scales with the number of hidden variables and, when the signal-to-noise ratio is sufficiently high, is independent of the input dimension, up to logarithmic terms. We empirically test our theoretical findings on both synthetic and real data.

2605.14925 2026-05-15 cs.CV cs.LG

Road Maps as Free Geometric Priors: Weather-Invariant Drone Geo-Localization with GeoFuse

Yunsong Fang, Tingyu Wang, Zhedong Zheng

AI总结 本文研究了在恶劣天气条件下无人机图像的地理定位问题,旨在将受天气影响的无人机图像与带有地理标签的卫星图像进行匹配。为了解决天气引起的图像退化和跨视角域差距问题,作者提出了一种名为GeoFuse的跨模态融合框架,通过将精确对齐的道路地图与卫星图像结合,生成更具判别力且对天气变化鲁棒的表示。实验表明,GeoFuse在多个基准数据集上显著优于现有方法,有效提升了地理定位的准确率。

详情
Comments
18 pages, 4 figures
英文摘要

Drone-view geo-localization aims to match a query drone image, often captured under adverse weather conditions (e.g., rain, snow, fog), against a gallery of geo-tagged satellite images. Weather-induced degradations in the drone view, such as noise, reduced visibility, and partial occlusions, severely exacerbate the intrinsic cross-view domain gap. While prior methods predominantly rely on weather-specific architectures or data augmentations, they have largely overlooked road map data, a readily available modality that provides strong, inherently weather-invariant geometric layout cues (e.g., road networks and building footprints) at negligible additional cost. We introduce GeoFuse, a cross-modal fusion framework that integrates precisely aligned road map tiles with satellite imagery to yield more discriminative and weather-resilient representations. We first augment the existing University-1652 and DenseUAV benchmarks with geo-aligned road maps, supplying structural priors robust to meteorological variations. Building on this, we propose a flexible fusion module that combines satellite and road map features via token-level and channel-level interactions, with a lightweight dynamic gating mechanism that adaptively weights modality contributions per instance. Finally, we employ class-level cross-view contrastive learning to promote robust alignment between weather-degraded drone features and the fused satellite-roadmap representations. Extensive experiments under diverse weather conditions show that GeoFuse consistently outperforms state-of-the-art methods, achieving +3.46% and +23.18% Recall@1 accuracy on the University-1652 and DenseUAV benchmarks, respectively.

2605.14923 2026-05-15 cs.CV

SceneParser: Hierarchical Scene Parsing for Visual Semantics Understanding

Pengxin Xu, Xincheng Lin, Luping Xiao, Qing Jiang, Meishan Zhang, Hao Fei, Shanghang Zhang, Xingyu Chen

AI总结 该研究提出了一种面向交互的层次化场景解析任务——Hierarchical Scene Parsing,旨在通过显式的场景-物体-部件-功能层次结构,捕捉场景中结构化的依赖关系,以提升视觉语义理解能力。为此,研究引入了基于视觉-语言模型的SceneParser,结合结构补全伪标签和课程学习进行统一的层次化生成训练,并构建了包含大量标注数据的SceneParser-Bench基准。实验表明,该方法在层次化解析任务上优于现有模型,且在传统任务和下游规划任务中也表现出良好的兼容性与实用性。

详情
Comments
Preprint. Code, models, and dataset are provided in the manuscript
英文摘要

General scene perception has progressed from object recognition toward open-vocabulary grounding, part localization, and affordance prediction. Yet these capabilities are often realized as isolated predictions that localize objects, parts, or interaction points without capturing the structured dependencies needed for interaction-oriented scene understanding. To address this gap, we introduce Hierarchical Scene Parsing, an interaction-oriented parsing task that represents physical scenes as explicit scene -> object -> part -> affordance hierarchies with cross-level bindings. We instantiate this task with SceneParser, a VLM-based parser trained for unified hierarchical generation with structural-completion pseudo labels and curriculum learning. To support training and evaluation, we construct SceneParser-Bench, a large-scale benchmark built with a scalable hierarchical data engine, containing 110K training images, a 5K validation split, 777K objects, 1.14M parts, 1.74M affordance annotations, and 1.74M valid object-part-affordance chain instances. We further introduce Level-1 to Level-3 conditional metrics and ParseRate to evaluate localization, cross-level binding, and hierarchical completeness. Experiments show that existing MLLMs and perception-stitching pipelines struggle with hierarchical parsing on our SceneParser-Bench, while SceneParser achieves stronger structure-aware performance. Besides, ablations, evaluations on COCO and AGD20K, and a downstream planning probe demonstrate that our SceneParser is compatible with conventional tasks and provides an actionable representation for visual understanding.

2605.14920 2026-05-15 cs.RO

FU-MPC: Frontier- and Uncertainty-Aware Model Predictive Control for Efficient and Accurate UAV Exploration with Motorized LiDAR

Jianping Li, Pengfei Wan, Zhongyuan Liu, Yi Wang, Yiheng Chen, Xinhang Xu, Rui Jin, Boyu Zhou, Lihua Xie

AI总结 本文提出了一种名为FU-MPC的前沿感知与不确定性感知模型预测控制方法,用于提升无人机在未知环境中的探索效率与定位精度。该方法结合可独立旋转的激光雷达,通过全局路径规划和局部轨迹优化,同时考虑前沿探索收益与方向依赖的定位不确定性,实现了高效的自主探索。实验表明,该方法在复杂环境中相比固定扫描模式和仅考虑不确定性的基线方法,具有更高的探索效率和更稳健的定位性能。

详情
英文摘要

Efficient UAV exploration in unknown environments requires rapid coverage expansion while maintaining accurate and reliable localization, since safe navigation in complex scenes depends on consistent mapping and pose estimation. However, for conventional LiDAR-equipped UAVs, the observable region is tightly coupled with the UAV pose and motion. Expanding coverage often requires additional translational or rotational maneuvers, which can reduce exploration efficiency and increase the risk of localization degradation in geometrically challenging environments. Motorized rotating LiDARs provide a promising solution by actively adjusting the sensor viewing direction without changing the UAV motion, thereby introducing an additional sensing degree of freedom. Nevertheless, existing exploration systems rarely exploit this scanning freedom as an explicit decision variable linked to both exploration progress and localization quality. To address this gap, we develop a UAV platform equipped with an independently actuated rotating LiDAR and propose a hierarchical exploration framework. The global planner organizes frontiers into representative viewpoints and sequences them using topology-aware transition costs. Built upon this planner, FU-MPC serves as a local receding-horizon scan controller that optimizes LiDAR rotation along the predicted flight trajectory. The controller jointly considers frontier-aware exploration utility and direction-dependent localization uncertainty, while lightweight surrogate evaluation enables real-time onboard execution. Experiments in complex environments demonstrate that the proposed system improves exploration efficiency while maintaining robust localization performance compared with fixed-pattern scanning and uncertainty-only baselines. The project page can be found at https://kafeiyin00.github.io/FU-MPC/.