arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2085
2605.07424 2026-05-11 cs.LG

A Flexible Adaptive Stable Clustering Algorithm for Archive-Scale Online Mass Spectrometry

Shao Shi, Xin Yang, Huiran Feng, Jianhuai Ye, Tianlong Hu, Yaling Zeng, Tzung-May Fu, Lei Zhu, Huizhong Shen, Chen Wang, Shu Tao

AI总结 该研究针对在线质谱分析中产生的大规模数据流,提出了一种名为FASC的灵活自适应稳定聚类算法,旨在解决现有方法在可扩展性、度量灵活性和算法稳定性之间的权衡问题。FASC通过将相似性核与优化逻辑解耦,结合密度增强相似性选择规则和几何约束,实现了确定性、顺序无关的收敛。实验表明,该算法在标准数据集上表现出优异的聚类性能,并成功应用于大气气溶胶质谱数据,实现了线性时间复杂度,有效揭示了次级无机气溶胶的老化路径并检测出极低丰度的工业示踪物。

详情
英文摘要

Modern online mass spectrometry generates multi-terabyte data streams critical for understanding Earth's environmental systems. However, extracting actionable chemical insights from these repositories is impeded by a computational bottleneck: existing clustering methods force a compromise among scalability, metric flexibility, and algorithmic stability. Here, we introduce Flexible Adaptive Stable Clustering (FASC), a dynamical systems framework that resolves these constraints by architecturally decoupling the similarity kernel from rigorous optimization logic. Unlike legacy heuristics that suffer from stochastic drift and algorithmic blending, FASC employs a Density-Augmented Similarity Selection rule and geometric constraints to guarantee deterministic, order-independent convergence. After validating FASC on canonical machine-learning ground truths (achieving >99.5% cluster purity and 0.99 Adjusted Rand Index), we deployed the framework on 25 million mass spectra of atmospheric aerosols. Demonstrating strictly linear empirical runtime scaling (O(N)), FASC autonomously mapped atmospheric aging pathways of secondary inorganic aerosols while isolating ultra-rare industrial tracers (<0.2% abundance), providing a scalable infrastructure for mining environmental big data.

2605.07420 2026-05-11 cs.LG cs.CV

SR$^2$-LoRA: Self-Rectifying Inter-layer Relations in Low-Rank Adaptation for Class-Incremental Learning

Fengqiang Wan, Yipeng Lin, Kan Lv, Yang Yang

AI总结 在类增量学习中,预训练模型通过参数高效的微调方法虽然表现出潜力,但在适应新任务时仍面临灾难性遗忘问题。本文从层间关系漂移的角度分析了这一问题,提出了一种新的方法SR$^2$-LoRA,通过约束层间关系的变化来缓解遗忘。该方法通过对齐当前任务样本在旧模型和新模型中的关系矩阵的奇异值,有效提升了模型在多任务场景下的鲁棒性和性能。

详情
英文摘要

Pre-trained models with parameter-efficient fine-tuning (PEFT) have demonstrated promising potential for class-incremental learning (CIL), yet catastrophic forgetting still persists when adapting models to new tasks. In this paper, we present a novel perspective on catastrophic forgetting through the analysis of inter-layer relation drift, i.e., the progressive disruption of relationships among layer-wise representations during the learning of new tasks. We theoretically show that the increase of such drift reduces the classification margins of previously learned tasks, thereby degrading overall model performance. To address this issue, we propose \underline{S}elf-\underline{R}ectifying inter-layer \underline{R}elation Low-Rank Adaptation~(SR$^2$-LoRA), a simple yet effective method that mitigates catastrophic forgetting by constraining inter-layer relation drift. Specifically, SR$^2$-LoRA constructs the relation matrices induced by the previous and current models on current-task samples, and aligns the corresponding singular values. We further theoretically show that this alignment exhibits greater robustness to estimation perturbations than direct entry-wise alignment. Extensive experiments on standard CIL benchmarks demonstrate that SR$^2$-LoRA effectively mitigates catastrophic forgetting, with its advantages becoming more pronounced as the number of tasks increases. Code is available in the \href{https://github.com/FqWan24/SR-2-LoRA}{repository}.

2605.07418 2026-05-11 cs.CV

Learning Image-Adaptive Scale Fields for Metric Depth Recovery

Yuanyan Li, Matthias Althoff

AI总结 本文研究了如何在仅有稀疏度量锚点的情况下,从单目深度估计中恢复准确的度量深度。作者提出了一种图像自适应的尺度场建模方法,通过将深度校正转化为图像自适应基图的低维线性组合,结合语义和几何线索进行建模,并利用最小二乘法从稀疏锚点中高效求解权重。该方法在多个数据集和典型深度估计模型上均表现出优异的度量深度恢复效果和鲁棒性。

详情
英文摘要

Monocular depth estimation (MDE) typically produces depth estimations that are defined up to an unknown scale or shift. When only sparse metric anchors are available, recovering accurate metric depth becomes challenging yet necessary for practical applications. We address this problem by formulating metric depth recovery as image-adaptive scale field modeling. Instead of directly correcting the depth, we reformulate the correction as a low-dimensional linear combination of image-adaptive basis maps. These maps are derived from semantic and geometric cues encoded in the MDE estimations and intermediate representations. The weights of basis maps are efficiently determined from sparse metric anchors via a least-squares problem. This formulation yields improved metric depth accuracy, strong robustness under extreme anchor sparsity, and an interpretable decomposition of spatial scale variations. Extensive experiments across multiple datasets and representative MDE models demonstrate the effectiveness and general applicability of our approach.

2605.07413 2026-05-11 cs.LG

Risk-Consistent Multiclass Learning from Random Label-Subset Membership Queries

Jiaxu Su, Junpeng Li, Changchun Hua, Yana Yang

AI总结 在获取精确类别标签成本高或不可靠的情况下,本文研究了通过随机标签子集成员查询进行多类学习的问题。该方法通过询问真实标签是否属于某个标签子集来获取弱监督信息,并提出了一个基于经验风险最小化框架的学习框架。文章推导了目标风险的无偏估计,并引入了修正风险估计器以解决负经验风险和过拟合问题,理论分析证明了其泛化能力和一致性,实验验证了该方法的有效性。

详情
英文摘要

Obtaining accurate class labels is often costly or unreliable, and may also be limited by privacy or other practical conditions. Compared with asking an annotator to provide the exact class, it is often easier to ask whether the true label belongs to a certain label subset. This query-response form defines a distinct weak-supervision mechanism: weak supervision information is generated through feedback on a label subset. Although weakly supervised learning has studied many learning frameworks, most existing work starts from established weak label objects. A systematic characterization is still lacking for weakly supervised learning generated directly by such query response observations. This paper proposes a multiclass learn ing framework under random label-subset queries. We model the data-generating distribution of query-response observations and derive an unbiased estimator of the target risk under the empirical risk minimization (ERM) framework. To address negative empirical risk and the associated overfitting problem, we introduce corrected risk estimators based on non-negative and absolute-value corrections. Theoretical analysis establishes a conditional generalization and excess-risk bound for the unbiased estimator, and a bias-and-consistency result for the corrected risk estimator. Experiments under the matched random-query mechanism demonstrate the feasibility of direct query-response learning and the stabilization effect of risk correction.

2605.07412 2026-05-11 cs.LG cs.AI

Tracking Large-scale Shared Bikes with Inertial Motion Learning in GNSS Blocked Environments

Feng Liu, Kejia Li, Zhiwei Yang, Chunwei Yang, Qun Li, Guobin Wu, Qiang Ni, Ruipeng Gao

AI总结 本文研究了在GNSS信号受阻的复杂环境中,如何利用惯性导航系统对大规模共享自行车进行高精度轨迹跟踪的问题。为了解决低成本惯性传感器累积漂移和鲁棒性差的问题,作者提出了一种结合自行车机械约束和专家混合模型的惯性跟踪框架,通过多专家模块和门控机制提升多任务学习性能,并实现不确定性感知的轨迹估计。实验表明,该方法在实际骑行数据上将基线精度提升了至少12%,轮速误差在95百分位下低于0.5米每秒。

Comments It has been submitted to IEEE Transactions on Intelligent Transportation Systems (T-ITS). Journal article. 14 pages, 18 figures, 10 tables

详情
英文摘要

Although Global Navigation Satellite Systems (GNSS) provide a general solution for bike tracking outdoors, there still exist complex riding environments where only inertial navigation systems work, such as urban canyons. Despite decades of research, localization using only low-cost inertial sensors still faces challenges such as cumulative drifts and poor robustness caused by filtering methods. Furthermore, sensors such as visual and LiDAR could provide reliable measurements, but they are not suitable for large-scale deployment. In this paper, we propose an inertial tracking framework that integrates bicycle mechanical constraints with a mixture-of-experts model. Specifically, we leverage multiple expert modules to capture shared representations and weight them through the gating mechanism, thus improving multi-task learning performance and enabling uncertainty-aware trajectory estimation. Furthermore, based on the mechanical transmission between the pedal and the rear wheel of a bike, we explore the intrinsic relationship between the rider's periodic pedalling behaviors and acceleration variations, and convert such patterns into bike's wheel speed for dynamic calibration. Experiments with real-world riding data from shared bikes of the DiDi ride-hailing platform demonstrate that our system improves the accuracy of baselines by at least 12%, with wheel speed errors below 0.5 m/s at 95-percentile.

2605.07409 2026-05-11 cs.CL cs.LG stat.AP

The Proxy Presumption: From Semantic Embeddings to Valid Social Measures

Baishi Li, Ta Yu, Kelvin J. L. Koa, Ke-Wei Huang

AI总结 本文探讨了自然语言处理在计算社会科学中的应用中面临的一个核心有效性问题——“代理假设”,即直接使用语义嵌入的几何特性(如余弦距离)来衡量社会概念(如新颖性、创造力等)可能引入偏差。为此,研究提出了“构念效度协议”(CVP),结合因果表征学习和心理测量学方法,构建从概念定义到量化验证的严谨流程,并引入“反事实中和”方法以减少嵌入空间中的混淆因素,为社区提供了一套标准化的效度检验工具,助力将经验性代理指标转化为科学可靠的测量工具。

Comments ACL 2026

详情
英文摘要

Natural Language Processing is rapidly evolving into a primary instrument for Computational Social Science, with researchers increasingly using embeddings to measure latent constructs such as novelty, creativity, and bias. However, this transition faces a fundamental validity challenge: the ''Proxy Presumption,'' or the reliance on geometric properties (e.g., cosine distance) as direct measures of social concepts. We argue that without explicit validation, unsupervised representations remain entangled mixtures of the target construct ($C$) and confounding attributes ($Z$) like topic, style, and authorship. To bridge the gap between semantic embeddings and valid social measures, we introduce the Construct Validity Protocol (CVP). Drawing on causal representation learning and psychometrics, the CVP offers a rigorous pipeline from conceptualization to quantitative verification. We further propose Counterfactual Neutralization, a novel method using LLMs to reduce confounding in embedding space. By providing a standardized Validity Suite -- including tests for discriminant, incremental, and predictive validity -- this work offers the community a toolkit to transform heuristic proxies into robust, scientifically defensible instruments.

2605.07407 2026-05-11 cs.LG

Emergent Symbolic Structure in Health Foundation Models: Extraction, Alignment, and Cross-Modal Transfer

Gajendra Katuwal, Advait Koparkar, Salar Abbaspourazad, Anshuman Mishra, Sarvesh Kirthivasan

AI总结 健康基础模型(FMs)能够从可穿戴设备传感器中学习有用的表示,但解释其编码内容以及在训练后跨模态迁移知识仍具挑战。本文提出了一种后训练框架,将冻结的嵌入分解为可解释的方向,即“符号”,并利用这些符号对齐嵌入空间而无需重新训练。研究在三个基于光电容积描记(PPG)和加速度计数据的健康基础模型上进行验证,结果显示提取的符号与健康状况和生理属性具有选择性关联,并在不同模态和架构间部分共享,表明符号对齐能够恢复一个富含生理信息的共享低维子空间,支持跨模态知识迁移。

Comments 8 pages ICML workshop, 4 main figures

详情
英文摘要

Health foundation models (FMs) learn useful representations from wearable sensors, but interpreting what they encode and transferring that knowledge across modalities after training remains difficult. We present a post-training framework that decomposes frozen embeddings into interpretable directions, referred to as symbols, and use these symbols to align the embedding spaces without retraining. We evaluate the framework on three FMs for photoplethysmography (PPG) and accelerometer data, independently pretrained on ~20M minutes of unlabeled data from ~172K participants, and analyzed on a held-out cohort of 30K subjects. We find that extracted symbols associate selectively with health conditions and physiological attributes, and these associations are partially shared across modalities and architectures. Cross-modal transfer via symbols retains more than 95% of in-domain performance, is nearly symmetric across domain directions, and saturates with limited paired data, together indicating that alignment recovers a shared low-dimensional subspace rich in physiological information. Overall, these results suggest that health FM embeddings contain an interpretable symbolic organization that is shared across modalities and supports cross-domain transfer without joint training.

2605.07402 2026-05-11 cs.CV

InsHuman: Towards Natural and Identity-Preserving Human Insertion

Jie Li, Shulian Zhang, Yangyang Gao, Wenbo Li, Yulun Zhang, Yong Guo, Jian Chen

AI总结 InsHuman 是一种旨在自然且保留身份地将特定人物插入目标背景中的图像编辑方法。该方法提出了 Human-Background Adaptive Fusion(HBAF)和 Face-to-Face ID-Preserving(FFIP)两种技术,分别用于对齐人体区域和保持面部身份一致性,并构建了包含真实人物与背景交互的高质量数据集 BDP-InsHuman。实验表明,InsHuman 在生成合理图像的同时能够有效保持人物身份不变,显著提升了人体插入的效果。

详情
英文摘要

Human insertion aims to naturally place specific individuals into a target background. Although existing image editing models may have such ability, they often produce failure cases, including inappropriate human pose in new background, inconsistent number of people, and modified facial identity. Moreover, publicly available human datasets often lack full-body portraits and realistic physical interaction between humans and their background. To address these challenges, we propose InsHuman for natural and identity-preserving human insertion. Specifically, we propose Human-Background Adaptive Fusion (HBAF), which detects foreground humans to obtain a binary mask and applies region-aware weighting to align the human regions between predicted and ground-truth latents, ensuring the person's pose, count, and overall appearance are coherently adapted to the target background.We further propose Face-to-Face ID-Preserving (FFIP), which detects and matches faces between the generated image and the source image in terms of face recognition features to enforce identity consistency for each face.In addition, we propose Bidirectional Data Pairing (BDP) strategy to construct BDP-InsHuman, a high-quality dataset with realistic human-background interactions. Experiments demonstrate that InsHuman achieves significant improvements in generating plausible images while keeping human identity unchanged.

2605.07398 2026-05-11 cs.CV cs.AI

Exposing and Mitigating Temporal Attack in Deepfake Video Detection

Zheyuan Gu, Minghao Shao, Zhen Wang, Yusong Wang, Mingkun Xu, Shijie Zhang, Hao Jiang

AI总结 该研究揭示了时空深度伪造检测模型在面对时序攻击时的脆弱性,指出其过度依赖易受攻击的时频特征而非学习鲁棒的语义因果关系。为此,作者提出了SpInShield防御框架,通过引入可学习的时频对抗者和快捷路径抑制优化策略,有效分离语义运动与可操控的时频伪影,从而提升模型的鲁棒性。实验表明,SpInShield在多个数据集上表现出色,在模拟幅度谱攻击下显著优于现有最强基线。

详情
英文摘要

While spatiotemporal deepfake detectors achieve high AUC, our experiments reveal their susceptibility to evasion attacks. These models tend to overfit on fragile temporal spectrum cues, rather than learning robust semantic causality. To mitigate this vulnerability, we propose SpInShield, a temporal spectral-invariant defense framework explicitly designed to decouple semantic motion from manipulatable spectral artifacts. We propose a learnable spectral adversary that dynamically synthesizes severe spectral deformations, simulating extreme attack scenarios. By employing a shortcut suppression optimization strategy, SpInShield compels the encoder to extract reliable forensic cues while purging unstable spectral statistics from the latent space. Experiments show that SpInShield obtains competitive performance on widely used datasets and outperforms the strongest baseline by 21.30 percentage points in AUC under simulated amplitude spectral attacks.

2605.07397 2026-05-11 cs.LG math.AT

Have Graph -- Will Lift? The Case for Higher-Order Benchmarks

Bastian Rieck

AI总结 本文探讨了几何与拓扑在机器学习中的应用现状,指出尽管消息传递机制在图和高阶复合结构上已成为几何深度学习的重要驱动力,但目前缺乏适合的基准数据集。作者呼吁学界不仅应将现有图数据集扩展为高阶结构,还应积极构建新的高阶基准数据集,以推动拓扑深度学习领域的发展。

详情
英文摘要

After a somewhat rocky start, geometry and topology have established a foothold in machine learning. Message passing, either on graphs or higher-order complexes, is one of the main drivers of geometric deep learning, and paradigms that were once considered to be firmly in the realm of the abstract-like sheaves-have been "tamed" to serve as novel inductive biases for model architectures in topological deep learning. The veritable diversity of models, however, is in stark contrast to the scarcity of suitable benchmark datasets. As a result, researchers often resort to lifting existing graph datasets to include higher-order information. In this opinion paper, I want to encourage the community to also source new datasets, which may be used to prop up the foundations of our research field.

2605.07396 2026-05-11 cs.LG cs.AI

Rubric-based On-policy Distillation

Junfeng Fang, Zhepei Hong, Mao Zheng, Mingyang Song, Gengsheng Li, Houcheng Jiang, Dan Zhang, Haiyun Guo, Xiang Wang, Tat-Seng Chua

AI总结 本文提出了一种基于评分标准的策略优化蒸馏方法(ROPD),旨在解决传统基于教师模型输出的策略蒸馏方法在黑盒场景下应用受限的问题。该方法通过从教师与学生的对比中生成任务特定的评分标准,并利用这些标准对学生的策略进行评估和优化,从而实现无需教师模型输出的策略蒸馏。实验表明,ROPD在多数场景下优于现有基于输出的蒸馏方法,样本效率提升了10倍,为黑盒场景下的模型对齐提供了灵活且高效的解决方案。

Comments Preprint. Code is available at https://github.com/Peregrine123/ROPD_official

详情
英文摘要

On-policy distillation (OPD) is a powerful paradigm for model alignment, yet its reliance on teacher logits restricts its application to white-box scenarios. We contend that structured semantic rubrics can serve as a scalable alternative to teacher logits, enabling OPD using only teacher-generated responses. To prove it, we introduce ROPD, a simple yet foundational framework for rubric-based OPD. Specifically, ROPD induces prompt-specific rubrics from teacher-student contrasts, and then utilizes these rubrics to score the student rollouts for on-policy optimization. Empirically, ROPD outperforms the advanced logit-based OPD methods across most scenarios, and achieving up to a 10x gain in sample efficiency. These results position rubric-based OPD as a flexible, black-box-compatible alternative to the prevailing logit-based OPD, offering a simple yet strong baseline for scalable distillation across proprietary and open-source LLMs. Code is available at https://github.com/Peregrine123/ROPD_official.

2605.07395 2026-05-11 cs.LG cs.AI cs.CL

Unsolvability Ceiling in Multi-LLM Routing: An Empirical Study of Evaluation Artifacts

Saloni Garg, Amit Sagtani

AI总结 本文通过大规模实验研究了多大型语言模型(LLM)路由中的“不可解上限”问题,发现许多所谓的不可解查询实际上源于评估偏差,如判断者偏好冗长输出、生成长度限制和输出格式不匹配等。研究提出了一种分解框架,揭示了这些评估偏差在不同任务和模型家族中的普遍影响,并指出标准路由方法会因偏差导致性能下降,产生显著的机会成本。研究还提供了改进路由评估和训练的可行建议,强调了在多LLM系统中建立可靠评估协议的重要性。

Comments 12 pages, 14 tables

详情
英文摘要

Efficient routing across multiple LLMs enables cost-quality tradeoffs by directing queries to the cheapest capable model. Prior work attributes routing headroom to an "unsolvability ceiling", queries no model in the pool can solve. We present a large-scale study of multi-tier LLM routing with 206,000 query-model pairs across six benchmarks (MMLU, MedQA, HumanEval, MBPP, Alpaca, ShareGPT) using the Gemma 4 and Llama 3.1 families. Evaluating with both LLM-as-a-judge and exact-match metrics, we show that a substantial portion of reported unsolvability stems from evaluation artifacts: (i) systematic judge biases favoring verbosity over correctness, (ii) truncation under fixed generation budgets, and (iii) output format mismatches. Through dual-judge validation and exact-match grounding, we reduce measured unsolvability across tasks. We introduce a decomposition framework attributing failures to these artifacts, revealing consistent patterns across domains and model families. These artifacts also distort router training signals: standard routers collapse to majority-class prediction (~79% smallest-tier optimal), confirmed via random-feature and shuffled-label controls, incurring a 13-17 percentage point opportunity cost. We provide actionable recommendations including dual-judge validation, exact-match anchoring, and cost-sensitive objectives. Our findings suggest existing routing headroom estimates are substantially inflated, underscoring the need for reliable evaluation protocols in multi-LLM systems.

2605.07394 2026-05-11 cs.CV cs.AI

BalCapRL: A Balanced Framework for RL-Based MLLM Image Captioning

Shaokai Ye, Vasileios Saveris, Yihao Qian, Jiaming Hu, Elmira Amirloo, Peter Grasch

AI总结 本文提出了一种基于强化学习的平衡框架BalCapRL,用于多模态大语言模型的图像描述生成。该方法通过联合优化描述的实用性、参考覆盖度和语言质量,解决了现有方法在不同质量维度上的权衡问题。研究引入了奖励解耦归一化和长度条件奖励掩码技术,有效提升了描述生成的效果,在多个模型上均取得了显著的性能提升。

详情
英文摘要

Image captioning is one of the most fundamental tasks in computer vision. Owing to its open-ended nature, it has received significant attention in the era of multimodal large language models (MLLMs). In pursuit of ever more detailed and accurate captions, recent work has increasingly turned to reinforcement learning (RL). However, existing captioning-RL methods and evaluation metrics often emphasize a narrow notion of caption quality, inducing trade-offs across core dimensions of captioning. For example, utility-oriented objectives can encourage noisy, hallucinated, or overlong captions that improve downstream question answering while harming fluency, whereas arena-style objectives can favor fluent but generic descriptions with limited usefulness. To address this, we propose a more balanced RL framework that jointly optimizes utility-aware correctness, reference coverage, and linguistic quality. In order to effectively optimize the resulting continuous multi-objective reward formulation, we apply GDPO-style reward-decoupled normalization to continuous-valued captioning rewards and show that it improves performance over vanilla GRPO. Additionally, we introduce length-conditional reward masking, yielding a more suitable length penalty for captioning. Across LLaVA-1.5-7B and Qwen2.5-VL 3B and 7B base models, our method consistently improves caption quality, with peak gains of +13.6 DCScore, +9.0 CaptionQA, and +29.0 CapArena across different models.

2605.07393 2026-05-11 cs.AI

Offline Policy Optimization with Posterior Sampling

Hongqiang Lin, Dongxu Zhang, Yiding Sun, Mingzhe Li, Ning Yang, Haijun Zhang

AI总结 本文研究了基于模型的离线强化学习中,如何在泛化能力和对分布外区域利用错误的鲁棒性之间取得平衡这一基本挑战。为此,作者提出了一种基于后验采样的策略优化方法(PSPO),将动力学建模视为贝叶斯推断过程,通过后验采样与约束策略优化相结合,既利用分布外的合理动力学转移提升泛化能力,又保证对模型误用的鲁棒性。理论分析表明该方法在Q值估计和策略优化方面具有收敛性,实验结果也验证了其在标准基准上的优越性能。

Comments 25 pages, 3 figures

详情
英文摘要

A fundamental challenge in model-based offline reinforcement learning (RL) lies in the trade-off between generalization and robustness against exploitation errors in out-of-distribution (OOD) regions. While OOD samples may capture valid underlying physical dynamics, they also introduce the risk of model exploitation. Existing methods typically address this risk through excessive pessimistic regularization, which ensures robustness but often sacrifices generalization. To overcome this limitation, we propose Posterior Sampling-based Policy Optimization (PSPO), which formulates dynamics modeling as a Bayesian inference process to derive a posterior that explicitly quantifies model fidelity. Through the integration of posterior sampling and constrained policy optimization, our method leverages dynamics-consistent OOD transitions for generalization while ensuring robustness against model exploitation. Theoretically, we formulate Q-value estimation under posterior sampling as a stochastic approximation problem and establish its convergence. We decompose policy optimization into a sequence of constrained subproblems, demonstrating that solving these subproblems guarantees monotonic improvement until convergence. Experiments on standard benchmarks validate that PSPO achieves superior performance compared to state-of-the-art baselines.

2605.07390 2026-05-11 cs.CV

ST-Gen4D: Embedding 4D Spatiotemporal Cognition into World Model for 4D Generation

Haonan Wang, Hanyu Zhou, Tao Gu, Luxin Yan

AI总结 该论文提出了一种名为ST-Gen4D的4D生成框架,旨在解决现有生成模型在物理世界中缺乏4D时空尺度的问题。其核心方法是通过构建基于4D时空认知的世界模型,将全局外观结构与局部动态拓扑相结合,从而生成具有时空规律性的4D内容。该方法引入了时空表征、认知建模、推理和生成四个关键设计,有效提升了生成结果的结构合理性和拓扑一致性,并在多个3D和4D生成任务中展现出优越性能。

详情
英文摘要

Generative models have achieved success in producing apparently coherent 2D videos, but remain challenging in the physical world due to lack of 4D spatiotemporal scale. Typically, existing 4D generative models directly embed macro scale constraints to enhance overall spatiotemporal consistency. However, these methods only ensure global appearance coherence and fail to reveal the local dynamics of the physical world. Our insight is that global appearance structure and local dynamic topology empower 4D spatiotemporal cognition, thereby enabling 4D generation with spatiotemporal regularities. In this work, we propose ST-Gen4D, a 4D generation framework with 4D spatiotemporal cognition-based world model. Our model is guided by four key designs: 1) Spatiotemporal representation. We encode various modalities into multiple representations as a feature basis. 2) Spatiotemporal cognition. We sculpture these representations into global appearance graph and local dynamic graph, and fuse them via semantic-bridged spatiotemporal fusion to obtain a 4D cognition graph. 3) Spatiotemporal reasoning. We utilize a world model to derive future state based on the 4D cognition. 4) Spatiotemporal generation. We leverage the derived cognition as condition to guide latent diffusion for 4D Gaussian generation. By deeply integrating 4D intrinsic cognition with generative priors, our model guarantees the structural rationality and topological consistency of 4D generation. Moreover, we propose ST-4D datasets by aggregating public 4D datasets and self-built subset. Extensive experiments demonstrate the superiority of our ST-Gen4D across 3D and 4D generation tasks.

2605.07388 2026-05-11 cs.CV

A Marine Debris Detection Framework for Ocean Robots via Self-Attention Enhancement and Feature Interaction Optimization

Yuyang Li, Jiashu Han, Yinyi Lai, Wenbin Kang, Zenghui Liu

AI总结 本文提出了一种用于海洋机器人垃圾检测的YOLO-MD框架,旨在解决因图像模糊、背景复杂和目标尺寸小而导致的检测性能下降问题。该方法引入了双分支卷积增强自注意力模块(DB-CASA)以提升特征表示能力,并设计了轻量级位移操作以优化多尺度目标的细粒度特征提取,同时提出了动态样本重加权的SFG-Loss以缓解类别不平衡和优化不稳定问题。实验表明,YOLO-MD在UODM数据集上取得了优于现有方法的检测精度和性能,并已在实际机器人边缘部署中验证了其有效性。

详情
英文摘要

Marine debris detection for ocean robot is crucial for ecological protection, yet performance is often degraded by low-quality images with blur, complex backgrounds, and small targets. To address these challenges, we propose YOLO-MD, an enhanced YOLO-based detection framework. A Dual-Branch Convolutional Enhanced Self-Attention (DB-CASA) module is designed to strengthen spatial-channel interactions, improving feature representation in degraded images. Additionally, a lightweight shift-based operation is introduced to enhance fine-grained feature extraction for objects of varying scales while maintaining parameter efficiency. We further propose SFG-Loss to mitigate class imbalance and optimization instability via dynamic sample reweighting. Experiments on the UODM dataset demonstrate that YOLO-MD achieves 0.875 precision, 0.822 F1-score, and 0.849 mAP50, outperforming the latest state-of-the-art methods. The effectiveness of this method has also been verified through real-world robotic edge deployment experiments.

2605.07386 2026-05-11 cs.LG cs.DS math.OC

Convex Optimization with Nested Evolving Feasible Sets

Karthick Krishna M., Haricharan Balasundaram, Rahul Vaze

AI总结 本文研究了一类凸优化问题,其中目标函数固定,但可行域随时间演化为嵌套序列。算法需要在保证每一步可行性的同时,最小化累积遗憾和移动成本。作者提出了一种懒惰算法,在凸损失函数下实现了遗憾和移动成本的平衡,而在强凸或α-锐利损失函数下提出的Frugal算法则实现了零遗憾和对数级别的移动成本,并证明了其最优性。

详情
英文摘要

Convex Optimization with Nested Evolving Feasible Sets (CONES)} is considered where the objective function $f$ remains fixed but the feasible region evolves over time as a nested sequence $S_1 \supseteq S_2 \supseteq \cdots \supseteq S_T$. The goal of an online algorithm is to simultaneously minimize the regret with respect to hindsight static optimal benchmark and the total movement cost while ensuring feasibility at all times. CONES is an optimization-oriented generalization of the well-known nested convex body chasing problem. When the loss function is convex, we propose a lazy-algorithm and show that it achieves $O(T^{1-β}), O(T^β)$ simultaneous regret and movement cost for any $β\in (0,1]$, over a time horizon of $T$. When the loss function is strongly convex or $α$-sharp, we propose an algorithm Frugal that simultaneously achieves zero regret and a movement cost of $O(\log T)$. To complement this, we show that any online algorithm with $o(T)$ regret has a movement cost of $Ω(\log{T})$ for both cases, proving optimality of Frugal.

2605.07381 2026-05-11 cs.RO cs.AI

Escaping the Diversity Trap in Robotic Manipulation via Anchor-Centric Adaptation

Yanzhe Chen, Kevin Yuchen Ma, Qi Lv, Yiqi Lin, Zechen Bai, Chen Gao, Mike Zheng Shou

AI总结 在机器人操作任务中,视觉-语言-动作(VLA)模型的部署面临“体现差距”问题,而有限的现实数据预算使得适应过程更具挑战。本文发现,传统的“最大化覆盖度”采样策略可能因估计噪声而陷入多样性陷阱,进而影响策略性能。为此,作者提出了一种基于覆盖-密度权衡的分析框架,并设计了锚点中心自适应(ACA)方法,通过在关键锚点上进行重复演示稳定策略框架,再结合误差挖掘与约束残差更新扩展高风险边界,显著提升了任务成功率与可靠性。

Comments 21 pages, 8 figures

详情
英文摘要

While Vision-Language-Action (VLA) models offer broad general capabilities, deploying them on specific hardware requires real-world adaptation to bridge the embodiment gap. Since robot demonstrations are costly, this adaptation must often occur under a strict data budget. In this work, we identify a critical diversity trap: the standard heuristic of "maximizing coverage" by collecting diverse, single-shot demonstrations can be self-defeating due to non-vanishing estimation noise. We formalize this phenomenon as a Coverage--Density Trade-off. By decomposing the policy error into estimation (density) and extrapolation (coverage) terms, we characterize an interior optimal allocation of unique conditions for a fixed budget. Guided by this analysis, we propose Anchor-Centric Adaptation (ACA), a two-stage framework that first stabilizes a policy skeleton through repeated demonstrations at core anchors, then selectively expands coverage to high-risk boundaries via teacher-forced error mining and constrained residual updates. Real-robot experiments validate our trade-off framework and demonstrate that ACA significantly improves task reliability and success rates over standard diverse sampling strategies under the same budget.

2605.07375 2026-05-11 cs.LG cs.CE cs.NA math.NA

QuadNorm: Resolution-Robust Normalization for Neural Operators

Bum Jun Kim, Makoto Kawano, Yusuke Iwasawa, Yutaka Matsuo

AI总结 本文提出了一种名为 QuadNorm 的归一化方法,用于提升神经算子在不同分辨率下的鲁棒性。传统归一化方法依赖于离散网格值的均匀平均,导致其对离散化方式敏感,从而在不同分辨率或网格之间引入转移误差。QuadNorm 通过将归一化层中的均匀平均替换为数值积分方法,实现了跨分辨率的归一化一致性,并在多个实验中表现出更优的跨分辨率性能和稳定性。

Comments 42 pages, 8 figures

详情
英文摘要

Normalization layers in neural operators usually compute statistics by uniformly averaging discrete grid values, making the normalization itself discretization-dependent and thereby a source of transfer error across different resolutions or meshes. To enable discretization robustness, we introduce a quadrature normalization family that replaces existing uniform averaging in normalization layers with numerical quadrature: QuadNorm and BlendQuadNorm. On endpoint-inclusive uniform grids, the proposed quadrature moments are $O(h^2)$-consistent across discretizations, meaning that their cross-resolution mismatch decays quadratically with grid spacing. A transfer-error bound then predicts how normalization-induced mismatch scales with both the resolution gap and network depth. The experiments show the same gap- and depth-scaling trends predicted by the transfer-error bound. On Darcy, QuadNorm delivers the best cross-resolution performance at every tested target resolution from $64^2$ to $256^2$; on real-data benchmarks, Transolver with QuadNorm achieves nearly resolution-invariant transfer. The largest gains appear on nonperiodic PDEs and nonspectral architectures, where native-resolution improvements also emerge. We also validate BlendQuadNorm, which stays close to LayerNorm behavior and serves as a conservative default for periodic FNO settings. These results identify normalization as a previously overlooked source of resolution dependence in neural operators.

2605.07370 2026-05-11 cs.RO cs.AI cs.MA cs.SY eess.SY

MORPH-U: Multi-Objective Resilient Motion Planning for V2X-Enabled Autonomous Driving in High-Uncertainty Environments via Simulation

Shih-Yu Lai

AI总结 本文研究了在高不确定性环境下,如何通过车路协同(V2X)信息增强自动驾驶车辆的运动规划与控制鲁棒性。提出了一种名为 MORPH-U 的闭环系统,该系统融合多传感器与 V2X 数据构建局部动态地图,并在检测到威胁或地图变化时触发 Hybrid-A* 重新规划。通过多目标优化框架平衡跟踪误差、安全裕度、响应性和平滑性,并引入拜占庭容错机制防止虚假 V2X 信息引发的不安全重规划,实验表明该方法有效提升了系统安全性和控制灵活性。

详情
英文摘要

V2X can warn an autonomous vehicle about hazards beyond line-of-sight, but it also brings uncertainty: messages may be delayed, dropped, or even forged. Meanwhile, map knowledge may change during a trip, forcing the vehicle to replan under tight real-time budgets. This paper studies how to make motion planning and low-level control robust to such uncertain, event-driven updates. We present MORPH-U, a CARLA-based closed-loop stack that fuses LiDAR/radar/camera with V2X (CAM/DENM) into a Local Dynamic Map (LDM) and triggers Hybrid-A* replanning when validated hazards or map changes affect the planned route. We expose the planning/control trade-offs via a multi-objective formulation over tracking error, safety margin (minimum TTC), responsiveness, and smoothness, and select operating points using Pareto-frontier analysis. To avoid unsafe replanning from faulty V2X triggers, MORPH-U adds a lightweight Byzantine-inspired acceptance gate that combines a quorum rule with an on-board sensor veto. Experiments in dynamic CARLA scenarios show that V2X-augmented LDM improves downstream safety, Pareto tuning provides controllable accuracy-comfort trade-offs, and the gate prevents replanning under saturated false-DENM injection ($p_{\text{attack}}=1.0$).

2605.07367 2026-05-11 cs.RO cs.CV

Weather-Robust Scene Semantics with Vision-Aligned 4D Radar

Kali Hamilton, Christoffer Heckman

AI总结 该研究旨在提升恶劣天气下场景语义理解的鲁棒性,提出了一种结合4D雷达与视觉对齐的解决方案。通过将雷达编码器对齐到冻结的SigLIP视觉嵌入,并利用冻结的视觉语言模型生成结构化场景描述,仅需约700万可训练参数即可实现高精度的语义理解。实验表明,该方法在雾、轻雪和重雪等极端天气条件下显著优于基于摄像头的基线方法,并分析了模型设计中的关键权衡因素。

Comments 5 pages + references, 2 appendix pages. ICRA 2026 Radar in Robotics Workshop

详情
英文摘要

Cameras and LiDAR degrade in rain, fog, and snow, while millimeter-wave radar remains largely unaffected. We align a radar encoder to frozen SigLIP vision embeddings and decode structured scene captions through a frozen vision-language model (VLM) with approximately 7M trainable parameters. On K-RADAR with held-out fog, light snow, and heavy snow sequences, all radar configurations outperform a camera baseline that collapses to over 90% hallucination. We identify a token-norm mismatch as the dominant failure mode when bridging radar to a frozen VLM and show that projector-output LayerNorm resolves it. Analysis of encoder complexity, caption format, and pooling strategy reveals tradeoffs that inform future radar-VLM pipeline design.

2605.07366 2026-05-11 cs.CL

Gradient-Based LoRA Rank Allocation Under GRPO: An Empirical Study

Yash Ganpat Sawant

AI总结 本研究探讨了在强化学习(特别是GRPO)中,是否可以将监督微调(SFT)中用于LoRA的自适应秩分配策略迁移过来。通过实验发现,在GRPO下按梯度重要性分配秩反而会降低模型性能,相较均匀分配,准确率下降了4.5个百分点。研究揭示了两个关键原因:一是GRPO的梯度景观更平坦,各层梯度信号均较为重要;二是非均匀秩分配会放大梯度差异,形成正反馈,导致高秩层吸收更多梯度而低秩层逐渐失效。因此,强化学习中的秩分配策略不能简单沿用SFT的经验。

Comments 4 pages + references

详情
英文摘要

Adaptive rank allocation for LoRA, allocating more parameters to important layers and fewer to unimportant ones, consistently improves efficiency under supervised fine-tuning (SFT). We investigate whether this success transfers to reinforcement learning, specifically Group Relative Policy Optimization (GRPO). Using gradient-magnitude profiling on Qwen 2.5 1.5B with GSM8K, we find that it does not: proportional rank allocation degrades accuracy by 4.5 points compared to uniform allocation (70.0% vs. 74.5%), despite using identical parameter budgets. We identify two mechanisms behind this failure. First, the gradient landscape under GRPO is fundamentally flatter than under SFT, the max-to-min layer importance ratio is only 2.17x, compared to >10x reported in SFT literature. All layers carry meaningful gradient signal; none are truly idle. Second, we discover a gradient amplification effect: non-uniform allocation widens the importance spread from 2.17x to 3.00x, creating a positive feedback loop where high-rank layers absorb more gradient while low-rank layers are progressively silenced. Our results suggest that gradient importance does not predict capacity requirements under RL, and that naive transfer of SFT-era rank allocation to alignment training should be avoided.

2605.07363 2026-05-11 cs.LG cs.AI

MISA: Mixture of Indexer Sparse Attention for Long-Context LLM Inference

Ruijie Zhou, Fanxu Meng, Yufei Xu, Tongxuan Liu, Guangming Lu, Muhan Zhang, Wenjie Pei

AI总结 本文提出了一种名为 MISA 的混合稀疏注意力机制,用于提升大语言模型在长上下文推理中的效率。MISA 通过将 DSA 中的多个索引头视为专家池,并引入一个轻量路由器选择少量活跃头进行重 token 级评分,从而大幅降低计算成本,同时保持模型表达能力。实验表明,MISA 在不增加训练成本的情况下,在多个基准上实现了与原始 DSA 相当甚至更优的性能,并在长上下文任务中表现出良好的稳定性与准确性。

Comments https://github.com/MuLabPKU/TransArch

详情
英文摘要

DeepSeek Sparse Attention (DSA) sets the state of the art for fine-grained inference-time sparse attention by introducing a learned token-wise indexer that scores every prefix token and selects the most relevant ones for the main attention. To remain expressive, the indexer uses many query heads (for example, 64 on DeepSeek-V3.2) that share the same selected token set; this multi-head design is precisely what makes the indexer the dominant cost on long contexts. We propose MISA (Mixture of Indexer Sparse Attention), a drop-in replacement for the DSA indexer that treats its indexer heads as a pool of mixture-of-experts. A lightweight router uses cheap block-level statistics to pick a query-dependent subset of only a few active heads, and only those heads run the heavy token-level scoring. This preserves the diversity of the original indexer pool while reducing the per-query cost from scoring every prefix token with every head to scoring it with only a handful of routed heads, plus a negligible router term computed on a small set of pooled keys. We further introduce a hierarchical variant of MISA that uses the routed pass to keep an enlarged candidate set and then re-ranks it with the original DSA indexer to recover the final selected tokens almost exactly. With only eight active heads and no additional training, MISA matches the dense DSA indexer on LongBench across DeepSeek-V3.2 and GLM-5 while running with eight and four times fewer indexer heads respectively, and outperforms HISA on average. It also preserves fully green Needle-in-a-Haystack heatmaps up to a 128K-token context and recovers more than 92% of the tokens selected by the DSA indexer per layer. Our TileLang kernel delivers roughly a 3.82 times speedup over DSA's original indexer kernel on a single NVIDIA H200 GPU.

2605.07359 2026-05-11 cs.CV

UniISP: A Unified ISP Framework for Both Human and Machine Vision

Hanxi Li, Yao Cheng, Bo Zhang, Li Zeng

AI总结 与RGB图像相比,原始传感器数据包含更丰富的信息,对低光等复杂环境下的准确识别尤为重要。传统ISP流程虽然能生成符合人类视觉审美的RGB图像,但可能因压缩和信息丢失影响识别性能;现有方法在处理原始数据时往往难以兼顾视觉美观与计算机视觉任务需求。本文提出UniISP,通过引入混合注意力模块和特征适配器模块,在保证图像视觉质量的同时有效传递信息特征,实验表明该方法在多个数据集和场景中均达到先进水平,具有良好的通用性与有效性。

详情
英文摘要

Compared to RGB images, raw sensor data provides a richer representation of information, which is crucial for accurate recognition, particularly under challenging conditions such as low-light environments. The traditional Image Signal Processing (ISP) pipeline generates visually pleasing RGB images for human perception through a series of steps, but some of these operations may adversely impact the information integrity by introducing compression and loss. Furthermore, in computer vision tasks that directly utilize raw camera data, most existing methods integrate minimal ISP processing with downstream networks, yet the resulting images are often difficult to visualize or do not align with human aesthetic preferences. This paper proposes UniISP, a novel ISP framework designed to simultaneously meet the requirements of both human visual perception and computer vision applications. By incorporating a carefully designed Hybrid Attention Module (HAM) and employing supervised learning, the proposed method ensures that the generated images are visually appealing. Additionally, a Feature Adapter module is introduced to effectively propagate informative features from the ISP stage to subsequent downstream networks. Extensive experiments demonstrate that our approach achieves state-of-the-art performance across various scenarios and multiple datasets, proving its generalizability and effectiveness.

2605.07356 2026-05-11 cs.CV

UniD-Shift: Towards Unified Semantic Segmentation via Interpretable Share-Private Multimodal Decomposition

Shuai Zhang, Zhecheng Shi, Zhuxiao Li, Jing Ou, Tengxi Wang, Yuan Liu, Wufan Zhao

AI总结 本文研究了如何统一处理2D图像与3D点云的语义分割问题,针对LiDAR点云稀疏采样和图像视角依赖性带来的模态对齐困难,提出了一种可解释的共享-私有多模态分解框架。该方法通过结合基于SAM的视觉编码器和基于SPTNet的几何编码器,分别提取互补的语义和几何特征,并将特征分解为共享和私有子空间,从而实现跨模态语义对齐与模态特异性保留。实验表明,该方法在多个基准数据集上取得了优于现有方法的分割精度与计算效率,并具有良好的跨域泛化能力。

详情
英文摘要

Semantic segmentation of large-scale 3D point clouds is crucial for applications such as autonomous driving and urban digital twins. However, the sparse sampling pattern of LiDAR and the view-dependent geometric distortion in image observations complicate cross-modal alignment and hinder stable fusion. Inspired by the fact that 2D images captured by cameras are representations of the 3D world, we recognize that the features learned from 2D and 3D segmentation share some common semantics, while other aspects remain modality-specific. This insight motivates a unified multimodal framework for joint 2D-3D semantic segmentation. We combine a SAM-based vision encoder with a SPTNet-based geometric encoder to extract complementary semantic and geometric representations. The resulting features from both modalities are explicitly decomposed into shared and private subspaces, where the shared components summarize semantic factors common to both domains, and the private components preserve properties that are unique to each modality. A lightweight attention-based fusion module aggregates the shared features into a consistent cross-modal representation, and a regularized training objective ensures both semantic alignment and subspace independence. Experiments on the SemanticKITTI and nuScenes benchmarks demonstrate consistent improvements in segmentation accuracy over representative multimodal baselines, accompanied by competitive computational efficiency. Cross-domain evaluation on nuScenes USA-Singapore shows stable performance under distribution shifts, demonstrating strong generalization. The implementation code is publicly available at: https://github.com/shuaizhang69/UniD-Shift.

2605.07355 2026-05-11 cs.CV cs.AI

TTF: Temporal Token Fusion for Efficient Video-Language Model

Simin Huo, Ning LI

AI总结 视频语言模型(VLMs)在处理长视频时面临推理成本迅速增加的问题,视觉token数量随视频长度增长而显著上升。为解决这一问题,本文提出了一种名为**Temporal Token Fusion(TTF)**的训练无关、即插即用的预语言模型token压缩框架,通过利用视频中的结构化时间冗余,自动选择参考帧并进行局部窗口相似性搜索,有效减少视觉token数量。实验表明,TTF在保持基线准确率99.5%的同时,可减少约67%的视觉token,并仅引入约0.16 GFLOPs的额外计算开销,为高效视频理解提供了实用方案。

Comments 14 pages; manuscript submitted to NeurIPS 2026

详情
英文摘要

Video-language models (VLMs) face rapid inference costs as visual token counts scale with video length. For example, 32 frames at $448{\times}448$ resolution already yield >8,000 visual tokens in Qwen3-VL, making LLM prefill the dominant throughput bottleneck. Existing methods often rely on global similarity or attention-guided compression, incurring offsets to their gains. We propose \textbf{Temporal Token Fusion (TTF)}, a training-free, plug-and-play pre-LLM token compression framework that exploits structured temporal redundancy in video. TTF automatically selects an anchor frame, then for each subsequent frame, performs a local window similarity search (e.g.,$3\times 3$), fusing tokens that exceed a threshold. The compressed sequence maintains positional consistency across both prefill and decoding through coordinate realignment, enabling seamless integration with existing VLM pipelines. On Qwen3-VL-8B with threshold t=0.70, TTF removes about 67\% of visual tokens while retaining 99.5\% of the baseline accuracy and introducing only ${\approx}0.16$\,GFLOPs of matching overhead. Overall, TTF offers a practical, efficient solution for video understanding. The code is available at \href{https://github.com/Cominder/ttf}{https://github.com/Cominder/ttf}

2605.07353 2026-05-11 cs.AI

Confidence-Aware Alignment Makes Reasoning LLMs More Reliable

Kejia Chen, Jiawen Zhang, Yihong Wu, Kewei Gao, Jian Lou, Zunlei Feng, Mingli Song, Ruoxi Jia

AI总结 大型推理模型在得出正确答案时往往依赖于存在缺陷的中间推理步骤,导致最终准确率与推理可靠性之间存在差距。本文提出了一种名为CASPO的框架,通过迭代的直接偏好优化方法,将标记级别的置信度与逐步逻辑正确性对齐,无需训练独立的奖励模型。该方法在推理时引入了基于置信度的思维(CaT),能够以几乎无额外延迟的方式动态剪枝不确定的推理分支,从而提升推理可靠性和效率。实验表明,CASPO在多个基准和模型家族上均表现出色,并且能够扩展到大模型如Qwen3-8B-Base,在多个基准测试中超越了基于树搜索的基线方法。

Comments 9 pages

详情
英文摘要

Large reasoning models often reach correct answers through flawed intermediate steps, creating a gap between final accuracy and reasoning reliability. Existing alignment strategies address this with external verifiers or massive sampling, limiting scalability. In this work, we introduce CASPO (Confidence-Aware Step-wise Preference Optimization), a framework that aligns token-level confidence with step-wise logical correctness through iterative Direct Preference Optimization, without training a separate reward model. During inference, we propose Confidence-aware Thought (CaT), which leverages this calibrated confidence to dynamically prune uncertain reasoning branches with negligible O(V) latency. Experiments across ten benchmarks and multiple model families show that CASPO consistently improves reasoning reliability and inference efficiency. CASPO scales to Qwen3-8B-Base and surpasses tree-search baselines on AIME'24 and AIME'25 without using reward-model data. We also release a step-wise dataset with confidence annotations to support fine-grained analysis of reasoning reliability. Code is available at https://github.com/Thecommonirin/CASPO.

2605.07351 2026-05-11 cs.CV

Disambiguating 2D-3D Correspondences in Gaussian Splatting-based Feature Fields for Visual Localization

Miso Lee, Sangeek Hyun, Yerim Jeon, Jae-Pil Heo

AI总结 本文针对基于高斯泼溅的特征场(GSFF)在视觉定位中的应用问题,提出了一种专门用于定位的GSFF构建框架SplitGS-Loc,以解决其在2D-3D匹配中的歧义问题。该方法通过将每个高斯分解为多个更小的高斯,将多对一的像素-点映射转换为精确的一对一对应关系,同时利用高斯渲染中的组合权重筛选出在多视角中具有显著且一致贡献的高斯,从而增强特征的判别性和多视角一致性。实验表明,SplitGS-Loc在无需场景特定训练或迭代位姿优化的情况下,实现了高精度且高效的视觉定位性能。

详情
英文摘要

While Gaussian Splatting-based Feature Fields (GSFFs) have shown promise for visual localization, this paper highlights that photometrically optimized GSFFs are inherently ill-suited for 2D-3D matching. The volumetric extent of each Gaussian induces many-to-one pixel-to-point mappings that destabilize PnP-based pose estimation, while photometric optimization gives rise to superfluous Gaussians devoid of multi-view consistency. To address these issues, we propose SplitGS-Loc, a localization-specialized GSFFs construction framework that disambiguates 2D-3D correspondences by exploiting Gaussian attributes. Our key design, Mixture-of-Gaussians-based splitting, decomposes each Gaussian into smaller Gaussians, replacing ambiguous many-to-one with precise one-to-one correspondences. In parallel, we exploit composition weights from GS rasterization to select Gaussians that significantly and consistently contribute across multiple views and aggregate discriminative features through strong pixel-Gaussian associations, enforcing multi-view consistency. The resulting compact yet discriminative feature fields enable stable PnP convergence, achieving state-of-the-art performance on localization benchmarks. Extensive experiments validate that SplitGS-Loc extends the utility of photometric GSFFs to accurate and efficient localization by exploiting Gaussian attributes, without per-scene training or iterative pose refinement.

2605.07346 2026-05-11 cs.CV

SoLAR: Error-Resilient Streamable Long-Horizon Free-Viewpoint Video Reconstruction with Anchor Activation and Latent Recalibration

Haotian Zhang, Xu Mo, Yixin Yu, Guanhua Zhu, Jian Xue, Tongda Xu, Yan Wang, Jiaqi Zhang, Siwei Ma, Wen Gao

AI总结 本文提出了一种名为SoLAR的错误鲁棒流式长时自由视角视频重建框架,解决了现有方法在处理长序列自由视角视频时性能下降的问题。该方法基于率失真优化框架,引入了动态锚点激活机制和潜在差异感知重校准机制,有效提升了重建质量并抑制了误差传播。实验表明,SoLAR在保持最低存储开销的同时实现了最先进的重建效果,为长时自由视角视频的实用化部署提供了新方向。

详情
英文摘要

Free-Viewpoint Video (FVV) has emerged as a cornerstone of next-generation immersive media systems and attracted widespread attention. Previous methods primarily focus on short video sequences and suffer from significant performance degradation when processing long-horizon free-viewpoint video (LFVV). Motivated by bit allocation theory, we analyze dynamic-anchor-based volumetric video representation within a rate-distortion optimization framework and propose \textbf{SoLAR}, which is the first error-resilient streamable FVV framework that maintains stable reconstruction quality on long sequences without requiring group-of-pictures partitioning. We propose the Anchor Activation Dynamics (AAD), which enables dynamic anchors to model non-rigid transformations by dynamically activating informative anchors and suppressing redundant ones. Furthermore, we introduce Latent Discrepancy Aware Recalibration (LaDAR), which is a mechanism to identify discrepancies between latent representations and recalibrate the correspondences encoded in the network, effectively mitigating error propagation in LFVV without compromising real-time performance or storage compactness. Extensive experiments demonstrate that \textbf{SoLAR} achieves state-of-the-art reconstruction performance while maintaining minimum storage overhead, which provides a new direction for LFVV reconstruction and advances the practical deployment of immersive systems. Demo free-viewpoint videos are provided in the supplementary material.

2605.07345 2026-05-11 cs.CL cs.LG

Mean-Pooled Cosine Similarity is Not Length-Invariant: Theory and Cross-Domain Evidence for a Length-Invariant Alternative

Sibayan Mitra, Dhruv Kumar

AI总结 本文指出,常用的均值池化余弦相似度在比较神经表示时并非长度不变,随着序列长度增加,其值会单调增长,与表示内容无关。通过多项跨领域实验证明,长度对跨语言表示相似性的解释能力显著,而使用中心化核对齐(CKA)等长度不变度量可大幅降低长度的影响。研究建议在跨表示比较中应优先采用长度不变的度量方法,以更准确地评估模型的表示能力。

Comments 9 pages, 6 figures. Submitted to the Mechanistic Interpretability Workshop at ICML 2026

详情
英文摘要

Mean-pooled cosine similarity is the default metric for comparing neural representations across languages, modalities, and tasks. We establish that this metric is not length-invariant: under the anisotropy that characterizes modern transformer representations, mean-pooled cosine grows monotonically in sequence length, independent of representational content. Empirically, on HumanEvalPack across four code LLMs, the length ratio alone explains $R^2 = 0.52$--$0.75$ of cross-language "Python proximity," while AST depth and shared-token fraction add less than 3% of explained variance beyond length. Substituting Centered Kernel Alignment (CKA) reduces explained variance by 83% and reverses the sign of the length coefficient ($β_{\mathrm{len}}: +0.86 \to -0.37$). The same pattern holds in Mistral-7B on parallel WMT pairs ($R^2 = 0.23$ EN-FR, $R^2 = 0.33$ EN-DE for cosine; $R^2 < 0.01$ for CKA). In CLIP ViT-B/32, mean-pooling reduces the length effect relative to EOS-pooling ($R^2: 0.21 \to {<}0.01$), as predicted by the theory's dependence on anisotropy. We argue that length-invariant metrics such as CKA should be the default for cross-representation comparisons, and that recent claims of cross-lingual representational convergence built on mean-pooled cosine warrant re-examination.