arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 1289
2602.12609 2026-02-16 cs.CV cs.AI

QuEPT: Quantized Elastic Precision Transformers with One-Shot Calibration for Multi-Bit Switching

Ke Xu, Yixin Wang, Zhongcheng Li, Hao Cui, Jinshui Hu, Xingyi Zhang

Comments Accepted by AAAI 2026

详情
英文摘要

Elastic precision quantization enables multi-bit deployment via a single optimization pass, fitting diverse quantization scenarios.Yet, the high storage and optimization costs associated with the Transformer architecture, research on elastic quantization remains limited, particularly for large language models.This paper proposes QuEPT, an efficient post-training scheme that reconstructs block-wise multi-bit errors with one-shot calibration on a small data slice. It can dynamically adapt to various predefined bit-widths by cascading different low-rank adapters, and supports real-time switching between uniform quantization and mixed precision quantization without repeated optimization. To enhance accuracy and robustness, we introduce Multi-Bit Token Merging (MB-ToMe) to dynamically fuse token features across different bit-widths, improving robustness during bit-width switching. Additionally, we propose Multi-Bit Cascaded Low-Rank adapters (MB-CLoRA) to strengthen correlations between bit-width groups, further improve the overall performance of QuEPT. Extensive experiments demonstrate that QuEPT achieves comparable or better performance to existing state-of-the-art post-training quantization methods.Our code is available at https://github.com/xuke225/QuEPT

2602.12605 2026-02-16 cs.LG cs.IT math.IT

Block-Sample MAC-Bayes Generalization Bounds

Matthias Frey, Jingge Zhu, Michael C. Gastpar

Comments Accepted for publication at The Fourteenth International Conference on Learning Representations (ICLR 2026)

详情
英文摘要

We present a family of novel block-sample MAC-Bayes bounds (mean approximately correct). While PAC-Bayes bounds (probably approximately correct) typically give bounds for the generalization error that hold with high probability, MAC-Bayes bounds have a similar form but bound the expected generalization error instead. The family of bounds we propose can be understood as a generalization of an expectation version of known PAC-Bayes bounds. Compared to standard PAC-Bayes bounds, the new bounds contain divergence terms that only depend on subsets (or \emph{blocks}) of the training data. The proposed MAC-Bayes bounds hold the promise of significantly improving upon the tightness of traditional PAC-Bayes and MAC-Bayes bounds. This is illustrated with a simple numerical example in which the original PAC-Bayes bound is vacuous regardless of the choice of prior, while the proposed family of bounds are finite for appropriate choices of the block size. We also explore the question whether high-probability versions of our MAC-Bayes bounds (i.e., PAC-Bayes bounds of a similar form) are possible. We answer this question in the negative with an example that shows that in general, it is not possible to establish a PAC-Bayes bound which (a) vanishes with a rate faster than $\mathcal{O}(1/\log n)$ whenever the proposed MAC-Bayes bound vanishes with rate $\mathcal{O}(n^{-1/2})$ and (b) exhibits a logarithmic dependence on the permitted error probability.

2602.12601 2026-02-16 cs.LG cs.AI cs.CL stat.ML

HyperMLP: An Integrated Perspective for Sequence Modeling

Jiecheng Lu, Shihao Yang

详情
英文摘要

Self-attention is often viewed as probabilistic query-key lookup, motivating designs that preserve normalized attention scores and fixed positional semantics. We advocate a simpler and more unified perspective: an autoregressive attention head can be viewed as a dynamic two-layer MLP whose weights are instantiated from the context history. From this view, attention scores form an ever-growing hidden representation, and standard MLP activations such as ReLU or GLU naturally implement input-conditioned selection over a context-dependent memory pool rather than a probability distribution. Based on this formulation, we introduce HyperMLP and HyperGLU, which learn dynamic mixing in both feature space and sequence space, using a reverse-offset (lag) layout to align temporal mixing with autoregressive semantics. We provide theoretical characterizations of the expressivity and implications of this structure, and empirically show that HyperMLP/HyperGLU consistently outperform strong softmax-attention baselines under matched parameter budgets.

2602.12597 2026-02-16 cs.RO cs.HC

PISHYAR: A Socially Intelligent Smart Cane for Indoor Social Navigation and Multimodal Human-Robot Interaction for Visually Impaired People

Mahdi Haghighat Joo, Maryam Karimi Jafari, Alireza Taheri

详情
英文摘要

This paper presents PISHYAR, a socially intelligent smart cane designed by our group to combine socially aware navigation with multimodal human-AI interaction to support both physical mobility and interactive assistance. The system consists of two components: (1) a social navigation framework implemented on a Raspberry Pi 5 that integrates real-time RGB-D perception using an OAK-D Lite camera, YOLOv8-based object detection, COMPOSER-based collective activity recognition, D* Lite dynamic path planning, and haptic feedback via vibration motors for tasks such as locating a vacant seat; and (2) an agentic multimodal LLM-VLM interaction framework that integrates speech recognition, vision language models, large language models, and text-to-speech, with dynamic routing between voice-only and vision-only modes to enable natural voice-based communication, scene description, and object localization from visual input. The system is evaluated through a combination of simulation-based tests, real-world field experiments, and user-centered studies. Results from simulated and real indoor environments demonstrate reliable obstacle avoidance and socially compliant navigation, achieving an overall system accuracy of approximately 80% under different social conditions. Group activity recognition further shows robust performance across diverse crowd scenarios. In addition, a preliminary exploratory user study with eight visually impaired and low-vision participants evaluates the agentic interaction framework through structured tasks and a UTAUT-based questionnaire reveals high acceptance and positive perceptions of usability, trust, and perceived sociability during our experiments. The results highlight the potential of PISHYAR as a multimodal assistive mobility aid that extends beyond navigation to provide socially interactive support for such users.

2602.12592 2026-02-16 cs.LG cs.AI

Power Interpretable Causal ODE Networks: A Unified Model for Explainable Anomaly Detection and Root Cause Analysis in Power Systems

Yue Sun, Likai Wang, Rick S. Blum, Parv Venkitasubramaniam

详情
英文摘要

Anomaly detection and root cause analysis (RCA) are critical for ensuring the safety and resilience of cyber-physical systems such as power grids. However, existing machine learning models for time series anomaly detection often operate as black boxes, offering only binary outputs without any explanation, such as identifying anomaly type and origin. To address this challenge, we propose Power Interpretable Causality Ordinary Differential Equation (PICODE) Networks, a unified, causality-informed architecture that jointly performs anomaly detection along with the explanation why it is detected as an anomaly, including root cause localization, anomaly type classification, and anomaly shape characterization. Experimental results in power systems demonstrate that PICODE achieves competitive detection performance while offering improved interpretability and reduced reliance on labeled data or external causal graphs. We provide theoretical results demonstrating the alignment between the shape of anomaly functions and the changes in the weights of the extracted causal graphs.

2602.12591 2026-02-16 cs.LG

Vehicle behaviour estimation for abnormal event detection using distributed fiber optic sensing

Hemant Prasad, Daisuke Ikefuji, Shin Tominaga, Hitoshi Sakurai, Manabu Otani

详情
英文摘要

The distributed fiber-optic sensing (DFOS) system is a cost-effective wide-area traffic monitoring technology that utilizes existing fiber infrastructure to effectively detect traffic congestions. However, detecting single-lane abnormalities, that lead to congestions, is still a challenge. These single-lane abnormalities can be detected by monitoring lane change behaviour of vehicles, performed to avoid congestion along the monitoring section of a road. This paper presents a method to detect single-lane abnormalities by tracking individual vehicle paths and detecting vehicle lane changes along a section of a road. We propose a method to estimate the vehicle position at all time instances and fit a path using clustering techniques. We detect vehicle lane change by monitoring any change in spectral centroid of vehicle vibrations by tracking a reference vehicle along a highway. The evaluation of our proposed method with real traffic data showed 80% accuracy for lane change detection events that represent presence of abnormalities.

2602.12590 2026-02-16 cs.CV

Unbiased Gradient Estimation for Event Binning via Functional Backpropagation

Jinze Chen, Wei Zhai, Han Han, Tiankai Ma, Yang Cao, Bin Li, Zheng-Jun Zha

详情
英文摘要

Event-based vision encodes dynamic scenes as asynchronous spatio-temporal spikes called events. To leverage conventional image processing pipelines, events are typically binned into frames. However, binning functions are discontinuous, which truncates gradients at the frame level and forces most event-based algorithms to rely solely on frame-based features. Attempts to directly learn from raw events avoid this restriction but instead suffer from biased gradient estimation due to the discontinuities of the binning operation, ultimately limiting their learning efficiency. To address this challenge, we propose a novel framework for unbiased gradient estimation of arbitrary binning functions by synthesizing weak derivatives during backpropagation while keeping the forward output unchanged. The key idea is to exploit integration by parts: lifting the target functions to functionals yields an integral form of the derivative of the binning function during backpropagation, where the cotangent function naturally arises. By reconstructing this cotangent function from the sampled cotangent vector, we compute weak derivatives that provably match long-range finite differences of both smooth and non-smooth targets. Experimentally, our method improves simple optimization-based egomotion estimation with 3.2\% lower RMS error and 1.57$\times$ faster convergence. On complex downstream tasks, we achieve 9.4\% lower EPE in self-supervised optical flow, and 5.1\% lower RMS error in SLAM, demonstrating broad benefits for event-based visual perception. Source code can be found at https://github.com/chjz1024/EventFBP.

2602.12587 2026-02-16 cs.LG

Multi-Head Attention as a Source of Catastrophic Forgetting in MoE Transformers

Anrui Chen, Ruijun Huang, Xin Zhang, Fang Dong, Hengjie Cao, Zhendong Huang, Yifeng Yang, Mengyi Chen, Jixian Zhou, Mingzhi Dong, Yujiang Wang, Jinlong Hou, Qin Lv, Robert P. Dick, Yuan Cheng, Tun Lu, Fan Yang, Li Shang

详情
英文摘要

Mixture-of-Experts (MoE) architectures are often considered a natural fit for continual learning because sparse routing should localize updates and reduce interference, yet MoE Transformers still forget substantially even with sparse, well-balanced expert utilization. We attribute this gap to a pre-routing bottleneck: multi-head attention concatenates head-specific signals into a single post-attention router input, forcing routing to act on co-occurring feature compositions rather than separable head channels. We show that this router input simultaneously encodes multiple separately decodable semantic and structural factors with uneven head support, and that different feature compositions induce weakly aligned parameter-gradient directions; as a result, routing maps many distinct compositions to the same route. We quantify this collision effect via a route-wise effective composition number $N_{eff}$ and find that higher $N_{eff}$ is associated with larger old-task loss increases after continual training. Motivated by these findings, we propose MH-MoE, which performs head-wise routing over sub-representations to increase routing granularity and reduce composition collisions. On TRACE with Qwen3-0.6B/8B, MH-MoE effectively mitigates forgetting, reducing BWT on Qwen3-0.6B from 11.2% (LoRAMoE) to 4.5%.

2602.12584 2026-02-16 cs.RO

Hemispherical Angular Power Mapping of Installed mmWave Radar Modules Under Realistic Deployment Constraints

Maaz Qureshi, Mohammad Omid Bagheri, William Melek, George Shaker

详情
英文摘要

Characterizing the angular radiation behavior of installed millimeter-wave (mmWave) radar modules is increasingly important in practical sensing platforms, where packaging, mounting hardware, and nearby structures can significantly alter the effective emission profile. However, once a device is embedded in its host environment, conventional chamber- and turntable-based antenna measurements are often impractical. This paper presents a hemispherical angular received-power mapping methodology for in-situ EM validation of installed mmWave modules under realistic deployment constraints. The approach samples the accessible half-space around a stationary device-under-test by placing a calibrated receiving probe at prescribed (phi, theta, r) locations using geometry-consistent positioning and quasi-static acquisition. Amplitude-only received-power is recorded using standard RF instrumentation to generate hemispherical angular power maps that capture installation-dependent radiation characteristics. Proof-of-concept measurements on a 60-GHz radar module demonstrate repeatable hemi-spherical mapping with angular trends in good agreement with full-wave simulation, supporting practical on-site characterization of embedded mmWave transmitters.

2602.12567 2026-02-16 cs.LG

Fractional Order Federated Learning for Battery Electric Vehicle Energy Consumption Modeling

Mohammad Partohaghighi, Roummel Marcia, Bruce J. West, YangQuan Chen

Comments This manuscript is under review in IEEE Transactions on Transportation Electrification

详情
英文摘要

Federated learning on connected electric vehicles (BEVs) faces severe instability due to intermittent connectivity, time-varying client participation, and pronounced client-to-client variation induced by diverse operating conditions. Conventional FedAvg and many advanced methods can suffer from excessive drift and degraded convergence under these realistic constraints. This work introduces Fractional-Order Roughness-Informed Federated Averaging (FO-RI-FedAvg), a lightweight and modular extension of FedAvg that improves stability through two complementary client-side mechanisms: (i) adaptive roughness-informed proximal regularization, which dynamically tunes the pull toward the global model based on local loss-landscape roughness, and (ii) non-integer-order local optimization, which incorporates short-term memory to smooth conflicting update directions. The approach preserves standard FedAvg server aggregation, adds only element-wise operations with amortizable overhead, and allows independent toggling of each component. Experiments on two real-world BEV energy prediction datasets, VED and its extended version eVED, show that FO-RI-FedAvg achieves improved accuracy and more stable convergence compared to strong federated baselines, particularly under reduced client participation.

2602.12563 2026-02-16 cs.CV

The Constant Eye: Benchmarking and Bridging Appearance Robustness in Autonomous Driving

Jiabao Wang, Hongyu Zhou, Yuanbo Yang, Jiahao Shao, Yiyi Liao

详情
英文摘要

Despite rapid progress, autonomous driving algorithms remain notoriously fragile under Out-of-Distribution (OOD) conditions. We identify a critical decoupling failure in current research: the lack of distinction between appearance-based shifts, such as weather and lighting, and structural scene changes. This leaves a fundamental question unanswered: Is the planner failing because of complex road geometry, or simply because it is raining? To resolve this, we establish navdream, a high-fidelity robustness benchmark leveraging generative pixel-aligned style transfer. By creating a visual stress test with negligible geometric deviation, we isolate the impact of appearance on driving performance. Our evaluation reveals that existing planning algorithms often show significant degradation under OOD appearance conditions, even when the underlying scene structure remains consistent. To bridge this gap, we propose a universal perception interface leveraging a frozen visual foundation model (DINOv3). By extracting appearance-invariant features as a stable interface for the planner, we achieve exceptional zero-shot generalization across diverse planning paradigms, including regression-based, diffusion-based, and scoring-based models. Our plug-and-play solution maintains consistent performance across extreme appearance shifts without requiring further fine-tuning. The benchmark and code will be made available.

2602.12561 2026-02-16 cs.CV

PLLM: Pseudo-Labeling Large Language Models for CAD Program Synthesis

Yuanbo Li, Dule Shu, Yanying Chen, Matt Klenk, Daniel Ritchie

详情
英文摘要

Recovering Computer-Aided Design (CAD) programs from 3D geometries is a widely studied problem. Recent advances in large language models (LLMs) have enabled progress in CAD program synthesis, but existing methods rely on supervised training with paired shape-program data, which is often unavailable. We introduce PLLM, a self-training framework for CAD program synthesis from unlabeled 3D shapes. Given a pre-trained CAD-capable LLM and a shape dataset, PLLM iteratively samples candidate programs, selects high-fidelity executions, and augments programs to construct synthetic program-shape pairs for fine-tuning. We experiment on adapting CAD-Recode from DeepCAD to the unlabeled ABC dataset show consistent improvements in geometric fidelity and program diversity.

2602.12556 2026-02-16 cs.LG cs.AI

SD-MoE: Spectral Decomposition for Effective Expert Specialization

Ruijun Huang, Fang Dong, Xin Zhang, Hengjie Cao, Zhendong Huang, Anrui Chen, Jixian Zhou, Mengyi Chen, Yifeng Yang, Mingzhi Dong, Yujiang Wang, Jinlong Hou, Qin Lv, Robert P. Dick, Yuan Cheng, Fan Yang, Tun Lu, Chun Zhang, Li Shang

详情
英文摘要

Mixture-of-Experts (MoE) architectures scale Large Language Models via expert specialization induced by conditional computation. In practice, however, expert specialization often fails: some experts become functionally similar, while others functioning as de facto shared experts, limiting the effective capacity and model performance. In this work, we analysis from a spectral perspective on parameter and gradient spaces, uncover that (1) experts share highly overlapping dominant spectral components in their parameters, (2) dominant gradient subspaces are strongly aligned across experts, driven by ubiquitous low-rank structure in human corpus, and (3) gating mechanisms preferentially route inputs along these dominant directions, further limiting specialization. To address this, we propose Spectral-Decoupled MoE (SD-MoE), which decomposes both parameter and gradient in the spectral space. SD-MoE improves performance across downstream tasks, enables effective expert specialization, incurring minimal additional computation, and can be seamlessly integrated into a wide range of existing MoE architectures, including Qwen and DeepSeek.

2602.12549 2026-02-16 cs.RO

Eva-Tracker: ESDF-update-free, Visibility-aware Planning with Target Reacquisition for Robust Aerial Tracking

Yue Lin, Yang Liu, Dong Wang, Huchuan Lu

Comments Accepted by ICRA 2026

详情
英文摘要

The Euclidean Signed Distance Field (ESDF) is widely used in visibility evaluation to prevent occlusions and collisions during tracking. However, frequent ESDF updates introduce considerable computational overhead. To address this issue, we propose Eva-Tracker, a visibility-aware trajectory planning framework for aerial tracking that eliminates ESDF updates and incorporates a recovery-capable path generation method for target reacquisition. First, we design a target trajectory prediction method and a visibility-aware initial path generation algorithm that maintain an appropriate observation distance, avoid occlusions, and enable rapid replanning to reacquire the target when it is lost. Then, we propose the Field of View ESDF (FoV-ESDF), a precomputed ESDF tailored to the tracker's field of view, enabling rapid visibility evaluation without requiring updates. Finally, we optimize the trajectory using differentiable FoV-ESDF-based objectives to ensure continuous visibility throughout the tracking process. Extensive simulations and real-world experiments demonstrate that our approach delivers more robust tracking results with lower computational effort than existing state-of-the-art methods. The source code is available at https://github.com/Yue-0/Eva-Tracker.

2602.12544 2026-02-16 cs.AI

Scaling Web Agent Training through Automatic Data Generation and Fine-grained Evaluation

Lajanugen Logeswaran, Jaekyeom Kim, Sungryull Sohn, Creighton Glasscock, Honglak Lee

Comments COLM 2025

详情
英文摘要

We present a scalable pipeline for automatically generating high-quality training data for web agents. In particular, a major challenge in identifying high-quality training instances is trajectory evaluation - quantifying how much progress was made towards task completion. We introduce a novel constraint-based evaluation framework that provides fine-grained assessment of progress towards task completion. This enables us to leverage partially successful trajectories, which significantly expands the amount of usable training data. We evaluate our method on a new benchmark we propose called BookingArena, which consists of complex booking tasks across 20 popular websites, and demonstrate that our distilled student model outperforms open-source approaches and matches or exceeds commercial systems, while being a significantly smaller model. Our work addresses the challenge of efficiently creating diverse, realistic web interaction datasets and provides a systematic evaluation methodology for complex structured web tasks.

2602.12540 2026-02-16 cs.CV cs.RO

Self-Supervised JEPA-based World Models for LiDAR Occupancy Completion and Forecasting

Haoran Zhu, Anna Choromanska

详情
英文摘要

Autonomous driving, as an agent operating in the physical world, requires the fundamental capability to build \textit{world models} that capture how the environment evolves spatiotemporally in order to support long-term planning. At the same time, scalability demands learning such models in a self-supervised manner; \textit{joint-embedding predictive architecture (JEPA)} enables learning world models via leveraging large volumes of unlabeled data without relying on expensive human annotations. In this paper, we propose \textbf{AD-LiST-JEPA}, a self-supervised world model for autonomous driving that predicts future spatiotemporal evolution from LiDAR data using a JEPA framework. We evaluate the quality of the learned representations through a downstream LiDAR-based occupancy completion and forecasting (OCF) task, which jointly assesses perception and prediction. Proof of concept experiments show better OCF performance with pretrained encoder after JEPA-based world model learning.

2602.12533 2026-02-16 cs.LG

AMPS: Adaptive Modality Preference Steering via Functional Entropy

Zihan Huang, Xintong Li, Rohan Surana, Tong Yu, Rui Wang, Julian McAuley, Jingbo Shang, Junda Wu

详情
英文摘要

Multimodal Large Language Models (MLLMs) often exhibit significant modality preference, which is a tendency to favor one modality over another. Depending on the input, they may over-rely on linguistic priors relative to visual evidence, or conversely over-attend to visually salient but facts in textual contexts. Prior work has applied a uniform steering intensity to adjust the modality preference of MLLMs. However, strong steering can impair standard inference and increase error rates, whereas weak steering is often ineffective. In addition, because steering sensitivity varies substantially across multimodal instances, a single global strength is difficult to calibrate. To address this limitation with minimal disruption to inference, we introduce an instance-aware diagnostic metric that quantifies each modality's information contribution and reveals sample-specific susceptibility to steering. Building on these insights, we propose a scaling strategy that reduces steering for sensitive samples and a learnable module that infers scaling patterns, enabling instance-aware control of modality preference. Experimental results show that our instance-aware steering outperforms conventional steering in modulating modality preference, achieving effective adjustment while keeping generation error rates low.

2602.12532 2026-02-16 cs.RO

CRAFT: Adapting VLA Models to Contact-rich Manipulation via Force-aware Curriculum Fine-tuning

Yike Zhang, Yaonan Wang, Xinxin Sun, Kaizhen Huang, Zhiyuan Xu, Junjie Ji, Zhengping Che, Jian Tang, Jingtao Sun

详情
英文摘要

Vision-Language-Action (VLA) models have shown a strong capability in enabling robots to execute general instructions, yet they struggle with contact-rich manipulation tasks, where success requires precise alignment, stable contact maintenance, and effective handling of deformable objects. A fundamental challenge arises from the imbalance between high-entropy vision and language inputs and low-entropy but critical force signals, which often leads to over-reliance on perception and unstable control. To address this, we introduce CRAFT, a force-aware curriculum fine-tuning framework that integrates a variational information bottleneck module to regulate vision and language embeddings during early training. This curriculum strategy encourages the model to prioritize force signals initially, before progressively restoring access to the full multimodal information. To enable force-aware learning, we further design a homologous leader-follower teleoperation system that collects synchronized vision, language, and force data across diverse contact-rich tasks. Real-world experiments demonstrate that CRAFT consistently improves task success, generalizes to unseen objects and novel task variations, and adapts effectively across diverse VLA architectures, enabling robust and generalizable contact-rich manipulation.

2602.12527 2026-02-16 cs.LG

Analytical Results for Two Exponential Family Distributions in Hierarchical Dirichlet Processes

Naiqi Li

详情
英文摘要

The Hierarchical Dirichlet Process (HDP) provides a flexible Bayesian nonparametric framework for modeling grouped data with a shared yet unbounded collection of mixture components. While existing applications of the HDP predominantly focus on the Dirichlet-multinomial conjugate structure, the framework itself is considerably more general and, in principle, accommodates a broad class of conjugate prior-likelihood pairs. In particular, exponential family distributions offer a unified and analytically tractable modeling paradigm that encompasses many commonly used distributions. In this paper, we investigate analytic results for two important members of the exponential family within the HDP framework: the Poisson distribution and the normal distribution. We derive explicit closed-form expressions for the corresponding Gamma-Poisson and Normal-Gamma-Normal conjugate pairs under the hierarchical Dirichlet process construction. Detailed derivations and proofs are provided to clarify the underlying mathematical structure and to demonstrate how conjugacy can be systematically exploited in hierarchical nonparametric models. Our work extends the applicability of the HDP beyond the Dirichlet-multinomial setting and furnishes practical analytic results for researchers employing hierarchical Bayesian nonparametrics.

2602.12526 2026-02-16 cs.LG cs.CL

Constraint-Rectified Training for Efficient Chain-of-Thought

Qinhang Wu, Sen Lin, Ming Zhang, Yingbin Liang, Ness B. Shroff

详情
英文摘要

Chain-of-Thought (CoT) has significantly enhanced the reasoning capabilities of Large Language Models (LLMs), especially when combined with reinforcement learning (RL) based post-training methods. While longer reasoning traces can improve answer quality and unlock abilities such as self-correction, they also incur high inference costs and often introduce redundant steps, known as overthinking. Recent research seeks to develop efficient reasoning strategies that balance reasoning length and accuracy, either through length-aware reward design or prompt-based calibration. However, these heuristic-based approaches may suffer from severe accuracy drop and be very sensitive to hyperparameters. To address these problems, we introduce CRT (Constraint-Rectified Training), a principled post-training framework based on reference-guarded constrained optimization, yielding a more stable and interpretable formulation for efficient reasoning. CRT alternates between minimizing reasoning length and rectifying accuracy only when performance falls below the reference, enabling stable and effective pruning of redundant reasoning. We further extend CRT with a two-stage training scheme that first discovers the shortest reliable reasoning patterns and then refines accuracy under a learnt length budget, preventing the re-emergence of verbose CoT. Our comprehensive evaluation shows that this framework consistently reduces token usage while maintaining answer quality at a robust and reliable level. Further analysis reveals that CRT improves reasoning efficiency not only by shortening responses but also by reducing internal language redundancy, leading to a new evaluation metric. Moreover, CRT-based training naturally yields a sequence of intermediate checkpoints that span a spectrum of explanation lengths while preserving correctness, enabling fine-grained control over reasoning verbosity without retraining.

2602.12525 2026-02-16 cs.CV

Geometric Stratification for Singular Configurations of the P3P Problem via Local Dual Space

Xueying Sun, Zijia Li, Nan Li

详情
英文摘要

This paper investigates singular configurations of the P3P problem. Using local dual space, a systematic algebraic-computational framework is proposed to give a complete geometric stratification for the P3P singular configurations with respect to the multiplicity $μ$ of the camera center $O$: for $μ\ge 2$, $O$ lies on the ``danger cylinder'', for $μ\ge 3$, $O$ lies on one of three generatrices of the danger cylinder associated with the first Morley triangle or the circumcircle, and for $μ\ge 4$, $O$ lies on the circumcircle which indeed corresponds to infinite P3P solutions. Furthermore, a geometric stratification for the complementary configuration $O^\prime$ associated with a singular configuration $O$ is studied as well: for $μ\ge 2$, $O^\prime$ lies on a deltoidal surface associated with the danger cylinder, and for $μ\ge 3$, $O^\prime$ lies on one of three cuspidal curves of the deltoidal surface.

2602.12524 2026-02-16 cs.CV

LiDAR-Anchored Collaborative Distillation for Robust 2D Representations

Wonjun Jo, Hyunwoo Ha, Kim Ji-Yeon, Hawook Jeong, Tae-Hyun Oh

详情
英文摘要

As deep learning continues to advance, self-supervised learning has made considerable strides. It allows 2D image encoders to extract useful features for various downstream tasks, including those related to vision-based systems. Nevertheless, pre-trained 2D image encoders fall short in conducting the task under noisy and adverse weather conditions beyond clear daytime scenes, which require for robust visual perception. To address these issues, we propose a novel self-supervised approach, \textbf{Collaborative Distillation}, which leverages 3D LiDAR as self-supervision to improve robustness to noisy and adverse weather conditions in 2D image encoders while retaining their original capabilities. Our method outperforms competing methods in various downstream tasks across diverse conditions and exhibits strong generalization ability. In addition, our method also improves 3D awareness stemming from LiDAR's characteristics. This advancement highlights our method's practicality and adaptability in real-world scenarios.

2602.12520 2026-02-16 cs.LG cs.MA

Multi-Agent Model-Based Reinforcement Learning with Joint State-Action Learned Embeddings

Zhizun Wang, David Meger

Comments 22 pages

详情
英文摘要

Learning to coordinate many agents in partially observable and highly dynamic environments requires both informative representations and data-efficient training. To address this challenge, we present a novel model-based multi-agent reinforcement learning framework that unifies joint state-action representation learning with imaginative roll-outs. We design a world model trained with variational auto-encoders and augment the model using the state-action learned embedding (SALE). SALE is injected into both the imagination module that forecasts plausible future roll-outs and the joint agent network whose individual action values are combined through a mixing network to estimate the joint action-value function. By coupling imagined trajectories with SALE-based action values, the agents acquire a richer understanding of how their choices influence collective outcomes, leading to improved long-term planning and optimization under limited real-environment interactions. Empirical studies on well-established multi-agent benchmarks, including StarCraft II Micro-Management, Multi-Agent MuJoCo, and Level-Based Foraging challenges, demonstrate consistent gains of our method over baseline algorithms and highlight the effectiveness of joint state-action learned embeddings within a multi-agent model-based paradigm.

2602.12517 2026-02-16 cs.LG cs.AI cs.MA math.OC

Bench-MFG: A Benchmark Suite for Learning in Stationary Mean Field Games

Lorenzo Magnino, Jiacheng Shen, Matthieu Geist, Olivier Pietquin, Mathieu Laurière

详情
英文摘要

The intersection of Mean Field Games (MFGs) and Reinforcement Learning (RL) has fostered a growing family of algorithms designed to solve large-scale multi-agent systems. However, the field currently lacks a standardized evaluation protocol, forcing researchers to rely on bespoke, isolated, and often simplistic environments. This fragmentation makes it difficult to assess the robustness, generalization, and failure modes of emerging methods. To address this gap, we propose a comprehensive benchmark suite for MFGs (Bench-MFG), focusing on the discrete-time, discrete-space, stationary setting for the sake of clarity. We introduce a taxonomy of problem classes, ranging from no-interaction and monotone games to potential and dynamics-coupled games, and provide prototypical environments for each. Furthermore, we propose MF-Garnets, a method for generating random MFG instances to facilitate rigorous statistical testing. We benchmark a variety of learning algorithms across these environments, including a novel black-box approach (MF-PSO) for exploitability minimization. Based on our extensive empirical results, we propose guidelines to standardize future experimental comparisons. Code available at \href{https://github.com/lorenzomagnino/Bench-MFG}{https://github.com/lorenzomagnino/Bench-MFG}.

2602.12515 2026-02-16 cs.CV

Matching of SAR and optical images based on transformation to shared modality

Alexey Borisov, Evgeny Myasnikov, Vladislav Myasnikov

详情
英文摘要

Significant differences in optical images and Synthetic Aperture Radar (SAR) images are caused by fundamental differences in the physical principles underlying their acquisition by Earth remote sensing platforms. These differences make precise image matching (co-registration) of these two types of images difficult. In this paper, we propose a new approach to image matching of optical and SAR images, which is based on transforming the images to a new modality. The new image modality is common to both optical and SAR images and satisfies the following conditions. First, the transformed images must have an equal pre-defined number of channels. Second, the transformed and co-registered images must be as similar as possible. Third, the transformed images must be non-degenerate, meaning they must preserve the significant features of the original images. To further match images transformed to this shared modality, we train the RoMa image matching model, which is one of the leading solutions for matching of regular digital photographs. We evaluated the proposed approach on the publicly available MultiSenGE dataset containing both optical and SAR images. We demonstrated its superiority over alternative approaches based on image translation between original modalities and various feature matching algorithms. The proposed solution not only provides better quality of matching, but is also more versatile. It enables the use of ready-made RoMa and DeDoDe models, pre-trained for regular images, without retraining for a new modality, while maintaining high-quality matching of optical and SAR images.

2602.12508 2026-02-16 cs.RO cs.AI cs.CV

Monocular Reconstruction of Neural Tactile Fields

Pavan Mantripragada, Siddhanth Deshmukh, Eadom Dessalene, Manas Desai, Yiannis Aloimonos

Comments 10 pages, 8 figures

详情
英文摘要

Robots operating in the real world must plan through environments that deform, yield, and reconfigure under contact, requiring interaction-aware 3D representations that extend beyond static geometric occupancy. To address this, we introduce neural tactile fields, a novel 3D representation that maps spatial locations to the expected tactile response upon contact. Our model predicts these neural tactile fields from a single monocular RGB image -- the first method to do so. When integrated with off-the-shelf path planners, neural tactile fields enable robots to generate paths that avoid high-resistance objects while deliberately routing through low-resistance regions (e.g. foliage), rather than treating all occupied space as equally impassable. Empirically, our learning framework improves volumetric 3D reconstruction by $85.8\%$ and surface reconstruction by $26.7\%$ compared to state-of-the-art monocular 3D reconstruction methods (LRM and Direct3D).

2602.12499 2026-02-16 cs.LG

A Theoretical Analysis of Mamba's Training Dynamics: Filtering Relevant Features for Generalization in State Space Models

Mugunthan Shandirasegaran, Hongkang Li, Songyang Zhang, Meng Wang, Shuai Zhang

Journal ref ICLR 2026

详情
英文摘要

The recent empirical success of Mamba and other selective state space models (SSMs) has renewed interest in non-attention architectures for sequence modeling, yet their theoretical foundations remain underexplored. We present a first-step analysis of generalization and learning dynamics for a simplified but representative Mamba block: a single-layer, single-head selective SSM with input-dependent gating, followed by a two-layer MLP trained via gradient descent (GD). Our study adopts a structured data model with tokens that include both class-relevant and class-irrelevant patterns under token-level noise and examines two canonical regimes: majority-voting and locality-structured data sequences. We prove that the model achieves guaranteed generalization by establishing non-asymptotic sample complexity and convergence rate bounds, which improve as the effective signal increases and the noise decreases. Furthermore, we show that the gating vector aligns with class-relevant features while ignoring irrelevant ones, thereby formalizing a feature-selection role similar to attention but realized through selective recurrence. Numerical experiments on synthetic data justify our theoretical results. Overall, our results provide principled insight into when and why Mamba-style selective SSMs learn efficiently, offering a theoretical counterpoint to Transformer-centric explanations.

2602.12498 2026-02-16 cs.CV

Layer-Specific Fine-Tuning for Improved Negation Handling in Medical Vision-Language Models

Ali Abbasi, Mehdi Taghipour, Rahmatollah Beheshti

Comments 15 pages, 5 figures. Submitted to ICML 2026

详情
英文摘要

Negation is a fundamental linguistic operation in clinical reporting, yet vision-language models (VLMs) frequently fail to distinguish affirmative from negated medical statements. To systematically characterize this limitation, we introduce a radiology-specific diagnostic benchmark that evaluates polarity sensitivity under controlled clinical conditions, revealing that common medical VLMs consistently confuse negated and non-negated findings. To enable learning beyond simple condition absence, we further construct a contextual clinical negation dataset that encodes structured claims and supports attribute-level negations involving location and severity. Building on these resources, we propose Negation-Aware Selective Training (NAST), an interpretability-guided adaptation method that uses causal tracing effects (CTEs) to modulate layer-wise gradient updates during fine-tuning. Rather than applying uniform learning rates, NAST scales each layer's update according to its causal contribution to negation processing, transforming mechanistic interpretability signals into a principled optimization rule. Experiments demonstrate improved discrimination of affirmative and negated clinical statements without degrading general vision-language alignment, highlighting the value of causal interpretability for targeted model adaptation in safety-critical medical settings. Code and resources are available at https://github.com/healthylaife/NAST.

2602.12492 2026-02-16 cs.RO cs.LG

Composable Model-Free RL for Navigation with Input-Affine Systems

Xinhuan Sang, Abdelrahman Abdelgawad, Roberto Tron

Comments 17 pages, 8 figures. Submitted to WAFR 2026 (under review)

详情
英文摘要

As autonomous robots move into complex, dynamic real-world environments, they must learn to navigate safely in real time, yet anticipating all possible behaviors is infeasible. We propose a composable, model-free reinforcement learning method that learns a value function and an optimal policy for each individual environment element (e.g., goal or obstacle) and composes them online to achieve goal reaching and collision avoidance. Assuming unknown nonlinear dynamics that evolve in continuous time and are input-affine, we derive a continuous-time Hamilton-Jacobi-Bellman (HJB) equation for the value function and show that the corresponding advantage function is quadratic in the action and optimal policy. Based on this structure, we introduce a model-free actor-critic algorithm that learns policies and value functions for static or moving obstacles using gradient descent. We then compose multiple reach/avoid models via a quadratically constrained quadratic program (QCQP), yielding formal obstacle-avoidance guarantees in terms of value-function level sets, providing a model-free alternative to CLF/CBF-based controllers. Simulations demonstrate improved performance over a PPO baseline applied to a discrete-time approximation.

2602.12489 2026-02-16 cs.CV

Insertion Network for Image Sequence Correspondence

Dingjie Su, Weixiang Hong, Benoit M. Dawant, Bennett A. Landman

详情
英文摘要

We propose a novel method for establishing correspondence between two sequences of 2D images. One particular application of this technique is slice-level content navigation, where the goal is to localize specific 2D slices within a 3D volume or determine the anatomical coverage of a 3D scan based on its 2D slices. This serves as an important preprocessing step for various diagnostic tasks, as well as for automatic registration and segmentation pipelines. Our approach builds sequence correspondence by training a network to learn how to insert a slice from one sequence into the appropriate position in another. This is achieved by encoding contextual representations of each slice and modeling the insertion process using a slice-to-slice attention mechanism. We apply this method to localize manually labeled key slices in body CT scans and compare its performance to the current state-of-the-art alternative known as body part regression, which predicts anatomical position scores for individual slices. Unlike body part regression, which treats each slice independently, our method leverages contextual information from the entire sequence. Experimental results show that the insertion network reduces slice localization errors in supervised settings from 8.4 mm to 5.4 mm, demonstrating a substantial improvement in accuracy.