arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2510.07118 2026-05-15 cs.CL cs.LG

TRIM: Token-wise Attention-Derived Saliency for Data-Efficient Instruction Tuning

Manish Nagaraj, Sakshi Choudhary, Utkarsh Saxena, Deepak Ravikumar, Kaushik Roy

AI总结 TRIM 是一种用于高效指令微调的数据选择方法，通过关注模型中基于注意力机制的细粒度特征，替代传统依赖梯度的粗粒度方法。该方法基于少量目标样本提取注意力“指纹”，以识别和选择对任务定义至关重要的数据子集，从而在保持高性能的同时大幅降低计算成本。实验表明，TRIM 选出的核心集在多个下游任务中优于现有方法，甚至在某些情况下超越了全数据微调的效果。

2510.05213 2026-05-15 cs.RO cs.AI cs.LG

VER: Vision Expert Transformer for Robot Learning via Foundation Distillation and Dynamic Routing

Yixiao Wang, Mingxiao Huo, Zhixuan Liang, Yushi Du, Lingfeng Sun, Haotian Lin, Jinghuan Shang, Chensheng Peng, Mohit Bansal, Mingyu Ding, Masayoshi Tomizuka

AI总结 VER 是一种用于机器人学习的视觉专家 Transformer 模型，旨在解决预训练视觉基础模型在特定领域表现优异但跨任务泛化能力有限的问题。该方法通过知识蒸馏将多个视觉基础模型整合为一个专家库，并利用轻量级的动态路由网络从预训练库中选择与任务相关的专家，从而实现高效且灵活的特征提取。VER 还引入了基于块的专家路由和课程化 Top-K 退火策略，提升了动态选择的精度与适应性，在多个机器人任务中取得了最先进的性能。

2508.11845 2026-05-15 cs.SD cs.AI cs.IR cs.LG

AVEX: What Matters for Animal Vocalization Encoding

Marius Miron, David Robinson, Milad Alizadeh, Ellen Gilsenan-McMahon, Gagan Narula, Emmanuel Chemla, Maddie Cusimano, Felix Effenberger, Masato Hagiwara, Benjamin Hoffman, Sara Keen, Diane Kim, Jane Lawton, Jen-Yu Liu, Aza Raskin, Olivier Pietquin, Matthieu Geist

AI总结本文研究了动物声学编码中影响模型性能的关键因素，旨在开发一个适用于多种下游任务的通用生物声学编码器。通过大规模实验，作者分析了训练数据多样性、模型架构和训练策略对编码器性能的影响，并提出了结合自监督预训练与监督微调的混合训练方法，显著提升了模型在不同任务和数据集上的表现。研究还发现，数据多样性在训练和评估阶段都至关重要，并公开了模型参数以支持后续研究与应用。

详情

Comments: In The Fourteenth International Conference on Learning Representations 2026

英文摘要

Bioacoustics, the study of sounds produced by living organisms, plays a vital role in conservation, biodiversity monitoring, and behavioral studies. Many tasks in this field, such as species, individual, and behavior classification and detection, are well-suited to machine learning. However, they often suffer from limited annotated data, highlighting the need for a general-purpose bioacoustic encoder capable of extracting useful representations for diverse downstream tasks. Such encoders have been proposed before, but are often limited in scope due to a focus on a narrow range of species (typically birds), and a reliance on a single model architecture or training paradigm. Moreover, they are usually evaluated on a small set of tasks and datasets. In this work, we present a large-scale empirical study that covers aspects of bioacoustics that are relevant to research but have previously been scarcely considered: training data diversity and scale, model architectures and training recipes, and the breadth of evaluation tasks and datasets. We obtain encoders that are state-of-the-art on the existing and proposed benchmarks. We also identify what matters for training these encoders, such that this work can be extended when more data are available or better architectures are proposed. Specifically, across 26 datasets with tasks including species classification, detection, individual ID, and vocal repertoire discovery, we find self-supervised pre-training followed by supervised post-training on a mixed bioacoustics + general-audio corpus yields the strongest in- and out-of-distribution performance. We show the importance of data diversity in both stages. To support ongoing research and application, we will release the model checkpoints.

URL PDF HTML ☆

赞 0 踩 0

2507.18553 2026-05-15 cs.LG cs.DS cs.IT math.IT

The Geometry of LLM Quantization: GPTQ as Babai's Nearest Plane Algorithm

Jiale Chen, Yalda Shabanzadeh, Elvir Crnčević, Torsten Hoefler, Dan Alistarh

AI总结本文揭示了GPTQ量化方法在数学上等价于经典格最短向量问题中的Babai最近平面算法，为该方法提供了直观的几何解释并带来了误差上界保证。基于这一等价性，作者设计了避免权重裁剪的量化方法，在性能上超越了原始GPTQ，并提供了高效的GPU推理内核。该研究为大语言模型量化提供了坚实的理论基础，并为未来百亿参数模型的量化算法设计开辟了新方向。

2507.15774 2026-05-15 cs.LG cs.AI

Time Series Forecasting Through the Lens of Dynamics

Alexis-Raja Brachet, Pierre-Yves Richard, Céline Hudelot

AI总结本文研究了时间序列预测任务中深度学习模型与浅层线性模型的性能差异，提出模型应学习从过去到未来数据点的直接联系，即“动态学习”能力。作者引入了 $\texttt{PRO-DYN}$ 框架，分析现有模型的动态特性，发现性能较差的模型往往仅部分学习动态关系，且动态模块的位置对模型效果至关重要。基于系统性与实证研究，作者提出了一种简单易用的模型设计与改进方法。

2507.01909 2026-05-15 cs.CV

Modality-agnostic, patient-specific digital twins modeling temporally varying digestive motion

Jorge Tapias Gomez, Nishant Nadkarni, Lando S. Bosma, Jue Jiang, Ergys D. Subashi, William P. Segars, James M. Balter, Mert R Sabuncu, Neelam Tyagi, Harini Veeraraghavan

AI总结该研究旨在解决可变形图像配准（DIR）在高度移动的胃肠道器官中难以准确评估的问题，提出了一种基于患者特异性数字孪生（DT）的模态无关方法，用于模拟和评估DIR的时空动态运动。研究通过半自动化流程生成21个运动阶段的4D序列，基于已有的胃肠道运动模型和真实患者扫描数据，评估了六种DIR方法的配准精度，并验证了剂量映射的准确性。该方法为动态、解剖结构复杂的区域提供了高精度的空间和剂量评估，具有重要的临床应用价值。

详情

DOI: 10.1088/1361-6560/ae2b46
Journal ref: Phys. Med. Biol. 71 (2026) 015029
Comments: This work is still review, it contains 7 Pages, 6 figures, and 4 tables

英文摘要

Objective: Clinical implementation of deformable image registration (DIR) requires voxel-based spatial accuracy metrics such as manually identified landmarks, which are challenging to implement for highly mobile gastrointestinal (GI) organs. To address this, patient-specific digital twins (DT) modeling temporally varying motion were created to assess the accuracy of DIR methods. Approach: 21 motion phases simulating digestive GI motion as 4D sequences were generated from static 3D patient scans using published analytical GI motion models through a semi-automated pipeline. Eleven datasets, including six T2w FSE MRI (T2w MRI), two T1w 4D golden-angle stack-of-stars, and three contrast-enhanced CT scans. The motion amplitudes of the DTs were assessed against real patient stomach motion amplitudes extracted from independent 4D MRI datasets. The generated DTs were then used to assess six different DIR methods using target registration error, Dice similarity coefficient, and the 95th percentile Hausdorff distance using summary metrics and voxel-level granular visualizations. Finally, for a subset of T2w MRI scans from patients treated with MR-guided radiation therapy, dose distributions were warped and accumulated to assess dose warping errors, including evaluations of DIR performance in both low- and high-dose regions for patient-specific error estimation. Main results: Our proposed pipeline synthesized DTs modeling realistic GI motion, achieving mean and maximum motion amplitudes and a mean log Jacobian determinant within 0.8 mm and 0.01, respectively, similar to published real-patient gastric motion data. It also enables the extraction of detailed quantitative DIR performance metrics and rigorous validation of dose mapping accuracy. Significance: The pipeline enables rigorously testing DIR tools for dynamic, anatomically complex regions enabling granular spatial and dosimetric accuracies.

URL PDF HTML ☆

赞 0 踩 0

2506.05762 2026-05-15 cs.LG

BiTrajDiff: Bidirectional Trajectory Generation with Diffusion Models for Offline Reinforcement Learning

Yunpeng Qing, Yixiao Chi, Shuo Chen, Shunyu Liu, Kexuan Zhou, Sixu Lin, Litao Liu, Changqing Zou

AI总结本文提出了一种用于离线强化学习的数据增强框架BiTrajDiff，通过双向轨迹扩散模型同时生成未来和历史轨迹，以提升数据集的多样性和泛化能力。与现有仅关注未来轨迹重建的方法不同，BiTrajDiff还考虑了到达当前状态的历史路径，从而更全面地探索状态空间中的潜在高回报区域。实验表明，该方法在多个基准任务中优于其他先进数据增强技术，显著提升了离线强化学习的性能。

2506.01015 2026-05-15 cs.CV

AuralSAM2: Enabling SAM2 Hear Through Pyramid Audio-Visual Feature Prompting

Yuyuan Liu, Yuanhong Chen, Chong Wang, Junlin Han, Junde Wu, Can Peng, Jingkun Chen, Yu Tian, Gustavo Carneiro

AI总结本文提出AuralSAM2，旨在将音频信息有效整合到SAM2模型中，以提升视频分割任务中多模态交互的能力。核心方法AuralFuser通过融合音频与视觉特征生成稀疏和密集提示，并基于SAM2的特征金字塔结构传播听觉线索，增强跨模态影响。此外，引入了音频引导的对比损失以加强模态对齐，实验表明该方法在公共基准上取得了显著的性能提升，且对交互效率影响较小。

2505.04535 2026-05-15 cs.LG cs.DC

Communication-Efficient Federated Fine-Tuning

Michael Theologitis, Vasilis Samoladas, Antonios Deligiannakis

AI总结本文研究了联邦学习中大型语言模型微调过程中的通信效率问题，提出了一种新的算法家族FDA-Opt，以解决现有方法中通信频率固定、参数难以调节等问题。该方法结合了动态调整与优化策略，无需额外配置即可在自然语言处理任务中实现优于现有方法的性能，为联邦学习中的模型微调提供了更高效、实用的解决方案。

2501.12202 2026-05-15 cs.CV

Hunyuan3D 2.0: Scaling Diffusion Models for High Resolution Textured 3D Assets Generation

Zibo Zhao, Zeqiang Lai, Qingxiang Lin, Yunfei Zhao, Haolin Liu, Shuhui Yang, Yifei Feng, Mingxin Yang, Sheng Zhang, Xianghui Yang, Huiwen Shi, Sicong Liu, Junta Wu, Yihang Lian, Fan Yang, Ruining Tang, Zebin He, Xinzhou Wang, Jian Liu, Xuhui Zuo, Zhuo Chen, Biwen Lei, Haohan Weng, Jing Xu, Yiling Zhu, Xinhai Liu, Lixin Xu, Changrong Hu, Shaoxiong Yang, Song Zhang, Yang Liu, Tianyu Huang, Lifu Wang, Jihong Zhang, Meng Chen, Liang Dong, Yiwen Jia, Yulin Cai, Jiaao Yu, Yixuan Tang, Hao Zhang, Zheng Ye, Peng He, Runzhou Wu, Chao Zhang, Yonghao Tan, Jie Xiao, Yangyu Tao, Jianchen Zhu, Jinbao Xue, Kai Liu, Chongqing Zhao, Xinming Wu, Zhichao Hu, Lei Qin, Jianbing Peng, Zhan Li, Minghui Chen, Xipeng Zhang, Lin Niu, Paige Wang, Yingkai Wang, Haozhao Kuang, Zhongyi Fan, Xu Zheng, Weihao Zhuang, YingPing He, Tian Liu, Yong Yang, Di Wang, Yuhong Liu, Jie Jiang, Jingwei Huang, Chunchao Guo

AI总结本文介绍了 Hunyuan3D 2.0，这是一个用于生成高分辨率带纹理3D模型的先进大规模合成系统。该系统包含两个基础模块：基于可扩展流式扩散变换器的形状生成模型 Hunyuan3D-DiT，以及利用几何和扩散先验知识生成高质量纹理的 Hunyuan3D-Paint。此外，还开发了 Hunyuan3D-Studio，提供一个用户友好的平台，便于专业和非专业人士高效生成和操作3D模型。实验表明，Hunyuan3D 2.0 在几何细节、条件对齐和纹理质量等方面均优于现有先进模型，并已开源以填补大规模3D生成模型在开源社区中的空白。

2501.05465 2026-05-15 cs.CL

Small Language Models (SLMs) Can Still Pack a Punch: A survey (updated 2026)

Akanksha Gupta, Bijo Thomas, Harshita Asnani, Phanindra Reddy Madduru, Samia Feroze, Shreyas Subramanian, Vikram Elango, Mecit Gungor

AI总结随着基础人工智能模型规模不断增大，本文综述了约160篇论文，探讨了参数量在1到80亿之间的小型语言模型（SLMs），表明小型模型在性能上可以与甚至超越大型模型。文章分析了通用型、任务特定型SLMs以及提升其性能的各类技术，旨在为社区提供在性能、效率、可扩展性和成本之间取得平衡的建模指导，并定义了SLMs的有效规模，展示了其相对于大语言模型的能力提升。

2412.17155 2026-05-15 cs.CV cs.LG

The Potential of Convolutional Neural Networks for Cancer Detection

Hossein Molaeian, Kaveh Karamjani, Sina Teimouri, Saeed Roshani, Sobhan Roshani

AI总结本文探讨了卷积神经网络（CNN）在癌症检测中的应用潜力，旨在通过深度学习方法提升早期癌症诊断的准确性。研究分析了多种CNN架构在不同癌症数据集上的表现，比较了各类方法的优缺点，并识别出在癌症分类任务中表现优异的模型结构。该工作为将CNN技术整合到临床诊断流程中提供了参考，有助于增强医疗健康领域的诊断能力。

2410.19653 2026-05-15 cs.LG

Conformal Prediction for Multimodal Regression

Alexis Bose, Jonathan Ethier, Paul Guinand

AI总结本文提出了一种用于多模态回归的符合预测方法，将传统仅适用于数值输入的符合预测扩展到图像和非结构化文本等多模态数据场景。该方法利用复杂神经网络架构中处理多模态信息的内部特征，特别是在多模态信息融合的关键节点提取特征，用于构建具有分布无关不确定性保证的预测区间。这一成果为在多模态数据丰富的领域应用符合预测提供了新的途径。

2605.14967 2026-05-15 cs.LG stat.ML

InfoSFT: Learn More and Forget Less with Information-Aware Token Weighting

Mahdi Sabbaghi, George Pappas, Adel Javanmard, Hamed Hassani

AI总结本文提出了一种名为 InfoSFT 的监督微调方法，通过关注信息量大且置信度适中的 token 来提升大语言模型的学习效果，避免过度拟合低概率样本或抑制已有能力。该方法仅需对标准损失函数进行一行修改，能够在数学、代码和思维链等任务中显著提升模型泛化能力，同时更好地保留模型原有的性能。

2605.14966 2026-05-15 cs.CV cs.AI

MHSA: A Lightweight Framework for Mitigating Hallucinations via Steered Attention in LVLMs

Wei Ding, Yilin Li, Yudong Zhang, Ruobing Xie, Xingwu Sun, Jiansheng Chen, Yu Wang

AI总结本文提出了一种名为MHSA的轻量级框架，旨在通过引导注意力机制来缓解大视觉语言模型（LVLMs）中的幻觉问题。MHSA通过学习修正跨模态注意力模式，利用来自LVLM自身和DHCP判别器的监督信号训练一个简单的三层MLP生成器，从而生成修正后的注意力权重。该方法在推理时无需修改LVLM参数，仅替换原始跨模态注意力即可有效减少生成和判别层面的幻觉，为LVLM的幻觉研究提供了新的视角。

2605.14953 2026-05-15 cs.LG

Efficient Online Conformal Selection with Limited Feedback

Sreenivas Gollapudi, Kostas Kollias, Kamesh Munagala, Ali Sinop

AI总结本文研究了在有限反馈条件下高效实现符合性选择的问题，即如何以最小的资源成本确保至少识别出一个成功选项，且满足预设的成功概率。作者提出了一种基于自适应符合性推断的更新规则，能够在对抗性环境下保证平均成功概率，并在独立同分布数据下实现次线性效率遗憾。该方法通过统一的算法技术和李雅普诺夫函数分析框架，适用于带宽反馈和半带宽反馈场景，相比以往工作处理了更复杂的设置，同时显著减少了所需的反馈量。

2605.14950 2026-05-15 cs.CV cs.RO

Evo-Depth: A Lightweight Depth-Enhanced Vision-Language-Action Model

Tao Lin, Yuxin Du, Jiting Liu, Nuobei Zhu, Yunhe Li, Yuqian Fu, Yinxinyu Chen, Hongyi Cai, Zewei Ye, Bing Cheng, Kai Ye, Yiran Mao, Yilei Zhong, MingKang Dong, Junchi Yan, Gen Li, Bo Zhao

AI总结 Evo-Depth 是一种轻量级的深度增强视觉-语言-动作模型，旨在提升机器人操作任务中的空间理解能力。该模型通过一个轻量的隐式深度编码模块，从多视角RGB图像中提取紧凑的深度特征，并通过空间增强模块将深度信息融入视觉-语言表征，从而实现高效的空间语义增强。此外，Evo-Depth 引入了渐进对齐训练策略，以更好地对齐深度增强表征与动作学习任务，最终在多个仿真和现实场景中表现出优异的性能和效率。

详情

英文摘要

Vision-Language-Action models have emerged as a promising paradigm for robotic manipulation by unifying perception, language grounding, and action generation. However, they often struggle in scenarios requiring precise spatial understanding, as current VLA models primarily rely on 2D visual representations that lack depth information and detailed spatial relationships. While recent approaches incorporate explicit 3D inputs such as depth maps or point clouds to address this issue, they often increase system complexity, require additional sensors, and remain vulnerable to sensing noise and reconstruction errors. Another line of work explores implicit 3D-aware spatial modeling directly from RGB observations without extra sensors, but it often relies on large geometry foundation models, resulting in higher training and deployment costs. To address these challenges, we propose Evo-Depth, a lightweight depth-enhanced VLA framework that enhances spatially grounded manipulation without relying on additional sensing hardware or compromising deployment efficiency. Evo-Depth employs a lightweight Implicit Depth Encoding Module to extract compact depth features from multi-view RGB images. These features are incorporated into vision-language representations through a Spatial Enhancement Module via depth-aware modulation, enabling efficient spatial-semantic enhancement. A Progressive Alignment Training strategy is further introduced to align the resulting depth-enhanced representations with downstream action learning. With only 0.9B parameters, Evo-Depth achieves superior performance across four simulation benchmarks. In real-world experiments, Evo-Depth attains the highest average success rate while also exhibiting the smallest model size, lowest GPU memory usage, and highest inference frequency among compared methods.

URL PDF HTML ☆

赞 0 踩 0

2605.14949 2026-05-15 cs.CV eess.IV eess.SP

A CUBS-Compatible Ultrasound Morphology and Uncertainty-Aware Baseline for Carotid Intima-Media Segmentation and Preliminary Risk Prediction

Aueaphum Aueawatthanaphisut

AI总结该研究提出了一种基于超声影像的颈动脉内膜中层分割与初步风险预测的可复现基线模型AtheroFlow-XNet，旨在更全面地评估动脉粥样硬化患者的血管风险。模型结合了手动标注的内膜中层掩膜进行监督分割，并引入临床变量辅助风险预测，同时利用蒙特卡洛Dropout实现不确定性感知的推理。实验结果表明，该方法在分割精度和风险预测性能上均达到较高水平，为超声影像支持的自动化血管分析提供了新的思路。

详情

Comments: 13 pages, 5 figures, 2 tables, 20 equations, 3 appendices

英文摘要

Carotid atherosclerosis is a major contributor to ischemic stroke and transient ischemic attack. Conventional ultrasound assessment is commonly based on intima-media thickness, plaque appearance, stenosis degree, and peak systolic velocity, but these morphology- and velocity-based indicators may not fully capture patient-specific vascular risk. This study presents AtheroFlow-XNet, a CUBS-compatible ultrasound morphology and uncertainty-aware learning baseline for carotid intima-media segmentation and preliminary risk prediction. Using the Carotid Ultrasound Boundary Study dataset, manual lumen-intima and media-adventitia boundary annotations were converted into dense intima-media masks for supervised segmentation. Clinical variables were incorporated into an auxiliary risk-prediction branch, and Monte Carlo dropout was used for uncertainty-aware inference. The model was evaluated using a patient-level train-validation-test split with 1,522 training images, 326 validation images, and 328 testing images. The proposed model achieved a Dice coefficient of 0.7930 for LI-MA mask segmentation, a segmentation loss of 0.2359, and an area under the receiver operating characteristic curve of 0.6910 for preliminary risk prediction. Qualitative results showed that predicted masks were generally aligned with manual annotations, while uncertainty maps highlighted ambiguous wall-boundary regions. These results suggest that ultrasound-derived carotid morphology can support automated wall analysis and uncertainty-aware interpretation. Since CUBS does not provide Doppler waveforms or CFD-derived hemodynamic biomarkers, this work should be interpreted as a reproducible morphology-driven baseline. Future work will incorporate Doppler-derived flow profiles, patient-specific vascular reconstruction, and CFD-based wall shear biomarkers.

URL PDF HTML ☆

赞 0 踩 0

2605.14948 2026-05-15 cs.CV

ACE-LoRA: Adaptive Orthogonal Decoupling for Continual Image Editing

Yuehao Liu, Weijia Zhang, Xuanming Shang, Zhizhou Chen, Yanhao Ge, Shanyan Guan, Chao Ma

AI总结本文提出ACE-LoRA，一种用于持续图像编辑的动态正则化框架，旨在解决在不断学习新任务时避免遗忘之前知识的问题。该方法通过自适应正交解耦技术识别并消除任务间的干扰，并引入秩不变历史信息压缩策略以提升持续更新的可扩展性。此外，研究还构建了首个全面的持续图像编辑基准CIE-Bench，为该领域提供标准化评估平台，实验表明该方法在指令遵循、视觉真实感和抗遗忘能力方面均优于现有方法。

2605.14944 2026-05-15 cs.RO cs.SY eess.SY

Behavioral Data-Driven Optimal Trajectory Generation for Rotary Cranes

Iskandar Khemakhem, Manuel Zobel, Johannes Schüle, Oliver Sawodny, Naoki Uchiyama, Abdallah Farrage

AI总结随着建筑行业的发展和熟练劳动力的短缺，起重机控制的自动化变得越来越重要。本文提出了一种基于行为数据的开环旋转起重机摆动轨迹生成方法，能够在减少负载摆动的同时降低操作时间和能耗。该方法基于Willems基本引理及其推广，无需显式系统建模，直接利用输入输出数据生成平滑最优轨迹，并通过实验验证了其有效性，相比传统模型方法在负载摆动、跟踪误差和运行时间等方面均有显著提升。

2605.14940 2026-05-15 cs.LG cs.AI eess.SP

Not All Symbols Are Equal: Importance-Aware Constellation Design for Semantic Communication

Albert Shaju, Christo Kurisummoottil Thomas, Mayukh Roy Chowdhury

AI总结本文研究了面向语义通信的符号星座设计问题，提出了一种关注语义重要性的联合语义-物理层框架，通过提取离散语义概念、评估语义关键性，并结合深度强化学习动态选择传输符号，从而在物理层实现语义感知的星座映射。该方法引入了语义符号脆弱性指标和语义保护概率，证明了传统格雷编码星座在非均匀语义重要性场景下存在性能局限，并在多个数据集上验证了其在高谱效率下的优越性。

详情

Comments: Submitted to IEEE GLOBECOM 2026. 6 pages, 8 figures

英文摘要

Semantic communication systems for goal-oriented transmission must protect task-relevant information not only through source compression but also via physical layer mapping. Existing approaches decouple constellation design and semantic encoding, exposing critical symbols to channel errors at the same rate as irrelevant ones. Contrary to this, in this paper, a joint semantic-physical layer framework is proposed, which is composed of a vector quantized-variational autoencoder that extracts discrete latent concepts, a semantic criticality indicator (SCI) that scores each concept by task relevance, and a deep reinforcement learning agent that dynamically selects the transmission subset based on instantaneous channel conditions. At the physical layer, a learned semantic-aware M -QAM constellation assigns symbol positions according to joint co-occurrence statistics and SCI scores, departing from the uniform spacing and Gray coding of standard M -QAM which minimizes average BER without regard for semantic content. We introduce a novel semantic symbol vulnerability (SSV) metric and a semantic protection probability (SPP) to quantify the exposure of task-critical symbols to decoding errors, and prove that any Gray-coded constellation is strictly suboptimal in SCI-Weighted SSV whenever the source exhibits non-uniform semantic importance and co-occurrence statistics. Simulation results demonstrate that the proposed constellation achieves near 100% SPP across modulation orders from 4-QAM to 1024-QAM versus 50% for standard constellations at high spectral efficiency, a 21:1 compression ratio with semantic quality above 0.9, generalizing across MNIST, Fashion-MNIST, and FSDD without modification.

URL PDF HTML ☆

赞 0 踩 0

2605.14938 2026-05-15 cs.LG cs.CV

Octopus: History-Free Gradient Orthogonalization for Continual Learning in Multimodal Large Language Models

Yuehao Liu, Shanyan Guan, Weijia Zhang, Xuanming Shang, Yanhao Ge, Wei Li, Chao Ma

AI总结本文提出了一种名为Octopus的持续学习框架，旨在解决多模态大语言模型在逐步学习新任务时易遗忘旧知识的问题。该方法基于无历史数据的梯度正交化（HiFGO），通过在梯度层面强制正交性来减少参数干扰，无需存储历史任务数据，从而避免了隐私和存储问题。实验表明，Octopus在UCIT数据集上取得了优于现有最先进方法的性能，分别提升了2.14%和6.82%的平均与最终任务准确率。

2605.14937 2026-05-15 cs.LG cs.AI cs.RO

Slot-MPC: Goal-Conditioned Model Predictive Control with Object-Centric Representations

Jonathan Spieler, Angel Villar-Corrales, Sven Behnke

AI总结 Slot-MPC 是一种基于对象中心表示的目标条件模型预测控制框架，旨在提升智能体在复杂环境中的规划能力。该方法通过视觉编码器学习场景中各个对象的结构化表示，并基于这些表示构建动作条件的动力学模型，从而在推理阶段利用模型预测控制实现高效的动作规划。实验表明，与非对象中心的世界模型相比，Slot-MPC 在任务表现和规划效率方面均有显著提升，尤其在有限状态-动作覆盖的离线设置中，基于梯度的MPC方法表现出更优性能。

2605.14935 2026-05-15 cs.CV

Multi-scale Coarse-to-fine Modeling for Test-time Human Motion Control

Nhat Le, Daochang Liu, Anh Nguyen, Ajmal Mian

AI总结本文提出了一种名为MSCoT的多尺度粗到细模型，用于测试时的人体运动合成与控制。该方法通过将运动分解为多尺度的层次化表示，并在每个时间尺度上以粗到细的方式预测完整的token序列，从而实现了高效且灵活的控制。通过引入多尺度token引导策略和轻量级token细化模块，MSCoT克服了离散采样的挑战，提升了控制精度与生成质量，实验表明其在运动质量、控制准确性和推理速度方面均优于现有方法。

2605.14929 2026-05-15 cs.LG cs.AR

A Hardware-Aware, Per-Layer Methodology for Post-Training Quantization of Large Language Models

Earl Killian

AI总结本文提出了一种面向硬件的逐层后训练量化方法，用于大语言模型的权重压缩，能够在每权重4.5到6位的精度下实现接近无损的重建效果。该方法结合了固定和动态码本的选择、块级缩放、激活加权余弦选择以及敏感层优化等技术，并引入了一种新的硬件高效查找表输出格式（HIF）以提升性能与能效。实验表明，在多种开源模型中，该方法在更低的存储成本下实现了比传统FP8基线更优的权重重建精度。

2605.14928 2026-05-15 cs.CL

Chain-of-Procedure: Hierarchical Visual-Language Reasoning for Procedural QA

Guanhua Chen, Yutong Yao, Shenghe Sun, Ci-Jun Gao, Shudong Liu, Lidia S. Chao, Feng Wan, Derek F. Wong

AI总结该论文提出了一种名为Chain-of-Procedure（CoP）的分层视觉-语言推理框架，用于解决视觉过程问答（VP-QA）任务，即根据用户上传的中间状态图像预测下一步操作。研究指出当前视觉语言模型在结构化过程推理和图像序列与文本步骤的粒度匹配方面存在不足，CoP通过视觉线索检索相关指令、语义分解细化步骤，最终生成下一步操作，显著提升了模型在该任务上的表现。论文还提出了ProcedureVQA基准，用于系统评估模型在实际过程推理任务中的能力。

2605.14927 2026-05-15 cs.LG

Learning with Shallow Neural Networks on Cluster-Structured Features

Elisabetta Cornacchia, Laurent Massoulié

AI总结本文研究了在输入特征具有聚类结构的情况下，浅层神经网络通过梯度下降进行学习时的样本复杂度问题。作者提出了一种可分析的模型，假设目标函数依赖于少量潜在的布尔变量，输入特征则按簇分组并与这些变量相关联。在可识别性假设下，理论分析表明，当信噪比足够高时，样本复杂度仅与隐藏变量的数量有关，而与输入维度无关（至多对数项）。实验在合成数据和真实数据上验证了理论结果。

2605.14925 2026-05-15 cs.CV cs.LG

Road Maps as Free Geometric Priors: Weather-Invariant Drone Geo-Localization with GeoFuse

Yunsong Fang, Tingyu Wang, Zhedong Zheng

AI总结本文研究了在恶劣天气条件下无人机图像的地理定位问题，旨在将受天气影响的无人机图像与带有地理标签的卫星图像进行匹配。为了解决天气引起的图像退化和跨视角域差距问题，作者提出了一种名为GeoFuse的跨模态融合框架，通过将精确对齐的道路地图与卫星图像结合，生成更具判别力且对天气变化鲁棒的表示。实验表明，GeoFuse在多个基准数据集上显著优于现有方法，有效提升了地理定位的准确率。

详情

Comments: 18 pages, 4 figures

英文摘要

Drone-view geo-localization aims to match a query drone image, often captured under adverse weather conditions (e.g., rain, snow, fog), against a gallery of geo-tagged satellite images. Weather-induced degradations in the drone view, such as noise, reduced visibility, and partial occlusions, severely exacerbate the intrinsic cross-view domain gap. While prior methods predominantly rely on weather-specific architectures or data augmentations, they have largely overlooked road map data, a readily available modality that provides strong, inherently weather-invariant geometric layout cues (e.g., road networks and building footprints) at negligible additional cost. We introduce GeoFuse, a cross-modal fusion framework that integrates precisely aligned road map tiles with satellite imagery to yield more discriminative and weather-resilient representations. We first augment the existing University-1652 and DenseUAV benchmarks with geo-aligned road maps, supplying structural priors robust to meteorological variations. Building on this, we propose a flexible fusion module that combines satellite and road map features via token-level and channel-level interactions, with a lightweight dynamic gating mechanism that adaptively weights modality contributions per instance. Finally, we employ class-level cross-view contrastive learning to promote robust alignment between weather-degraded drone features and the fused satellite-roadmap representations. Extensive experiments under diverse weather conditions show that GeoFuse consistently outperforms state-of-the-art methods, achieving +3.46% and +23.18% Recall@1 accuracy on the University-1652 and DenseUAV benchmarks, respectively.

URL PDF HTML ☆

赞 0 踩 0

2605.14923 2026-05-15 cs.CV

SceneParser: Hierarchical Scene Parsing for Visual Semantics Understanding

Pengxin Xu, Xincheng Lin, Luping Xiao, Qing Jiang, Meishan Zhang, Hao Fei, Shanghang Zhang, Xingyu Chen

AI总结该研究提出了一种面向交互的层次化场景解析任务——Hierarchical Scene Parsing，旨在通过显式的场景-物体-部件-功能层次结构，捕捉场景中结构化的依赖关系，以提升视觉语义理解能力。为此，研究引入了基于视觉-语言模型的SceneParser，结合结构补全伪标签和课程学习进行统一的层次化生成训练，并构建了包含大量标注数据的SceneParser-Bench基准。实验表明，该方法在层次化解析任务上优于现有模型，且在传统任务和下游规划任务中也表现出良好的兼容性与实用性。

2605.14920 2026-05-15 cs.RO

FU-MPC: Frontier- and Uncertainty-Aware Model Predictive Control for Efficient and Accurate UAV Exploration with Motorized LiDAR

Jianping Li, Pengfei Wan, Zhongyuan Liu, Yi Wang, Yiheng Chen, Xinhang Xu, Rui Jin, Boyu Zhou, Lihua Xie

AI总结本文提出了一种名为FU-MPC的前沿感知与不确定性感知模型预测控制方法，用于提升无人机在未知环境中的探索效率与定位精度。该方法结合可独立旋转的激光雷达，通过全局路径规划和局部轨迹优化，同时考虑前沿探索收益与方向依赖的定位不确定性，实现了高效的自主探索。实验表明，该方法在复杂环境中相比固定扫描模式和仅考虑不确定性的基线方法，具有更高的探索效率和更稳健的定位性能。

详情

英文摘要

Efficient UAV exploration in unknown environments requires rapid coverage expansion while maintaining accurate and reliable localization, since safe navigation in complex scenes depends on consistent mapping and pose estimation. However, for conventional LiDAR-equipped UAVs, the observable region is tightly coupled with the UAV pose and motion. Expanding coverage often requires additional translational or rotational maneuvers, which can reduce exploration efficiency and increase the risk of localization degradation in geometrically challenging environments. Motorized rotating LiDARs provide a promising solution by actively adjusting the sensor viewing direction without changing the UAV motion, thereby introducing an additional sensing degree of freedom. Nevertheless, existing exploration systems rarely exploit this scanning freedom as an explicit decision variable linked to both exploration progress and localization quality. To address this gap, we develop a UAV platform equipped with an independently actuated rotating LiDAR and propose a hierarchical exploration framework. The global planner organizes frontiers into representative viewpoints and sequences them using topology-aware transition costs. Built upon this planner, FU-MPC serves as a local receding-horizon scan controller that optimizes LiDAR rotation along the predicted flight trajectory. The controller jointly considers frontier-aware exploration utility and direction-dependent localization uncertainty, while lightweight surrogate evaluation enables real-time onboard execution. Experiments in complex environments demonstrate that the proposed system improves exploration efficiency while maintaining robust localization performance compared with fixed-pattern scanning and uncertainty-only baselines. The project page can be found at https://kafeiyin00.github.io/FU-MPC/.

URL PDF HTML ☆

赞 0 踩 0