arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 2365
专题追踪
2602.18728 2026-02-24 cs.LG cs.CV

Phase-Consistent Magnetic Spectral Learning for Multi-View Clustering

Mingdong Lu, Zhikui Chen, Meng Liu, Shubin Ma, Liang Zhao

Comments Preprint. Under review

详情
英文摘要

Unsupervised multi-view clustering (MVC) aims to partition data into meaningful groups by leveraging complementary information from multiple views without labels, yet a central challenge is to obtain a reliable shared structural signal to guide representation learning and cross-view alignment under view discrepancy and noise. Existing approaches often rely on magnitude-only affinities or early pseudo targets, which can be unstable when different views induce relations with comparable strengths but contradictory directional tendencies, thereby distorting the global spectral geometry and degrading clustering. In this paper, we propose \emph{Phase-Consistent Magnetic Spectral Learning} for MVC: we explicitly model cross-view directional agreement as a phase term and combine it with a nonnegative magnitude backbone to form a complex-valued magnetic affinity, extract a stable shared spectral signal via a Hermitian magnetic Laplacian, and use it as structured self-supervision to guide unsupervised multi-view representation learning and clustering. To obtain robust inputs for spectral extraction at scale, we construct a compact shared structure with anchor-based high-order consensus modeling and apply a lightweight refinement to suppress noisy or inconsistent relations. Extensive experiments on multiple public multi-view benchmarks demonstrate that our method consistently outperforms strong baselines.

2602.18724 2026-02-24 cs.AI

Task-Aware Exploration via a Predictive Bisimulation Metric

Dayang Liang, Ruihan Liu, Lipeng Wan, Yunlong Liu, Bo An

详情
英文摘要

Accelerating exploration in visual reinforcement learning under sparse rewards remains challenging due to the substantial task-irrelevant variations. Despite advances in intrinsic exploration, many methods either assume access to low-dimensional states or lack task-aware exploration strategies, thereby rendering them fragile in visual domains. To bridge this gap, we present TEB, a Task-aware Exploration approach that tightly couples task-relevant representations with exploration through a predictive Bisimulation metric. Specifically, TEB leverages the metric not only to learn behaviorally grounded task representations but also to measure behaviorally intrinsic novelty over the learned latent space. To realize this, we first theoretically mitigate the representation collapse of degenerate bisimulation metrics under sparse rewards by internally introducing a simple but effective predicted reward differential. Building on this robust metric, we design potential-based exploration bonuses, which measure the relative novelty of adjacent observations over the latent space. Extensive experiments on MetaWorld and Maze2D show that TEB achieves superior exploration ability and outperforms recent baselines.

2602.18721 2026-02-24 cs.CL eess.AS

ReHear: Iterative Pseudo-Label Refinement for Semi-Supervised Speech Recognition via Audio Large Language Models

Zefang Liu, Chenyang Zhu, Sangwoo Cho, Shi-Xiong Zhang

详情
英文摘要

Semi-supervised learning in automatic speech recognition (ASR) typically relies on pseudo-labeling, which often suffers from confirmation bias and error accumulation due to noisy supervision. To address this limitation, we propose ReHear, a framework for iterative pseudo-label refinement that integrates an instruction-tuned, audio-aware large language model (LLM) into the self-training loop. Unlike conventional text-based correctors, our approach conditions the LLM on both the ASR hypothesis and the source audio, allowing it to recover phonetically accurate transcripts even from severe recognition errors. These refined pseudo-labels serve as high-fidelity targets for fine-tuning the ASR model in an iterative cycle. Experimental results across diverse benchmarks demonstrate that ReHear effectively mitigates error propagation, consistently outperforming both supervised and pseudo-labeling baselines.

2602.18720 2026-02-24 cs.CV

Subtle Motion Blur Detection and Segmentation from Static Image Artworks

Ganesh Samarth, Sibendu Paul, Solale Tabarestani, Caren Chen

Comments InProceedings of the Winter Conference on Applications of Computer Vision 2026

详情
英文摘要

Streaming services serve hundreds of millions of viewers worldwide, where visual assets such as thumbnails, box art, and cover images are critical for engagement. Subtle motion blur remains a pervasive quality issue, reducing visual clarity and negatively affecting user trust and click-through rates. However, motion blur detection from static images is underexplored, as existing methods and datasets focus on severe blur and lack fine-grained pixel-level annotations needed for quality-critical applications. Benchmarks such as GOPRO and NFS are dominated by strong synthetic blur and often contain residual blur in their sharp references, leading to ambiguous supervision. We propose SMBlurDetect, a unified framework combining high-quality motion blur specific dataset generation with an end-to-end detector capable of zero-shot detection at multiple granularities. Our pipeline synthesizes realistic motion blur from super high resolution aesthetic images using controllable camera and object motion simulations over SAM segmented regions, enhanced with alpha-aware compositing and balanced sampling to generate subtle, spatially localized blur with precise ground truth masks. We train a U-Net based detector with ImageNet pretrained encoders using a hybrid mask and image centric strategy incorporating curriculum learning, hard negatives, focal loss, blur frequency channels, and resolution aware augmentation.Our method achieves strong zero-shot generalization, reaching 89.68% accuracy on GoPro (vs 66.50% baseline) and 59.77% Mean IoU on CUHK (vs 9.00% baseline), demonstrating 6.6x improvement in segmentation. Qualitative results show accurate localization of subtle blur artifacts, enabling automated filtering of low quality frames and precise region of interest extraction for intelligent cropping.

2602.18717 2026-02-24 cs.CV

NeXt2Former-CD: Efficient Remote Sensing Change Detection with Modern Vision Architectures

Yufan Wang, Sokratis Makrogiannis, Chandra Kambhamettu

Comments Code will be released at https://github.com/VimsLab/NeXt2Former-CD

详情
英文摘要

State Space Models (SSMs) have recently gained traction in remote sensing change detection (CD) for their favorable scaling properties. In this paper, we explore the potential of modern convolutional and attention-based architectures as a competitive alternative. We propose NeXt2Former-CD, an end-to-end framework that integrates a Siamese ConvNeXt encoder initialized with DINOv3 weights, a deformable attention-based temporal fusion module, and a Mask2Former decoder. This design is intended to better tolerate residual co-registration noise and small object-level spatial shifts, as well as semantic ambiguity in bi-temporal imagery. Experiments on LEVIR-CD, WHU-CD, and CDD datasets show that our method achieves the best results among the evaluated methods, improving over recent Mamba-based baselines in both F1 score and IoU. Furthermore, despite a larger parameter count, our model maintains inference latency comparable to SSM-based approaches, suggesting it is practical for high-resolution change detection tasks.

2602.18716 2026-02-24 cs.RO cs.AI

Temporal Action Representation Learning for Tactical Resource Control and Subsequent Maneuver Generation

Hoseong Jung, Sungil Son, Daesol Cho, Jonghae Park, Changhyun Choi, H. Jin Kim

Comments ICRA 2026, 8 pages

详情
英文摘要

Autonomous robotic systems should reason about resource control and its impact on subsequent maneuvers, especially when operating with limited energy budgets or restricted sensing. Learning-based control is effective in handling complex dynamics and represents the problem as a hybrid action space unifying discrete resource usage and continuous maneuvers. However, prior works on hybrid action space have not sufficiently captured the causal dependencies between resource usage and maneuvers. They have also overlooked the multi-modal nature of tactical decisions, both of which are critical in fast-evolving scenarios. In this paper, we propose TART, a Temporal Action Representation learning framework for Tactical resource control and subsequent maneuver generation. TART leverages contrastive learning based on a mutual information objective, designed to capture inherent temporal dependencies in resource-maneuver interactions. These learned representations are quantized into discrete codebook entries that condition the policy, capturing recurring tactical patterns and enabling multi-modal and temporally coherent behaviors. We evaluate TART in two domains where resource deployment is critical: (i) a maze navigation task where a limited budget of discrete actions provides enhanced mobility, and (ii) a high-fidelity air combat simulator in which an F-16 agent operates weapons and defensive systems in coordination with flight maneuvers. Across both domains, TART consistently outperforms hybrid-action baselines, demonstrating its effectiveness in leveraging limited resources and producing context-aware subsequent maneuvers.

2602.18711 2026-02-24 cs.CV

HIME: Mitigating Object Hallucinations in LVLMs via Hallucination Insensitivity Model Editing

Ahmed Akl, Abdelwahed Khamis, Ali Cheraghian, Zhe Wang, Sara Khalifa, Kewen Wang

详情
英文摘要

Large Vision-Language Models (LVLMs) have demonstrated impressive multimodal understanding capabilities, yet they remain prone to object hallucination, where models describe non-existent objects or attribute incorrect factual information, raising serious concerns for reliable real-world deployment. While fine-tuning is a commonly adopted mitigation strategy, its high computational cost and practical difficulty motivate the need for training-free alternatives, among which model editing has recently emerged as a promising direction. However, indiscriminate editing risks disrupting the rich implicit knowledge encoded in pre-trained LVLMs, leading to a fundamental question: how much intervention is necessary at each layer to suppress hallucinations while preserving pre-trained knowledge? To address this question, we present a systematic analysis of LVLM decoders built on three widely used large language model backbones-Qwen, LLaMA, and Vicuna-revealing clear layer-wise differences in susceptibility to object hallucination. Building on these insights, we introduce the Hallucination Insensitivity Score (HIS), a principled metric that quantifies each layer's sensitivity to hallucination and provides guidance for targeted intervention. Leveraging HIS, we propose Hallucination Insensitivity Model Editing (HIME), a simple yet effective layer-adaptive weight editing approach that selectively modifies latent features to suppress hallucinations while preserving pre-trained knowledge. Extensive experiments demonstrate that HIME reduces hallucinations by an average of 61.8% across open-ended generation benchmarks, including CHAIR, MME, and GPT-4V-aided evaluation, without introducing additional parameters, inference-time latency, or computational overhead.

2602.18702 2026-02-24 cs.CV cs.AI

Think with Grounding: Curriculum Reinforced Reasoning with Video Grounding for Long Video Understanding

Houlun Chen, Xin Wang, Guangyao Li, Yuwei Zhou, Yihan Chen, Jia Jia, Wenwu Zhu

详情
英文摘要

Long video understanding is challenging due to rich and complicated multimodal clues in long temporal range.Current methods adopt reasoning to improve the model's ability to analyze complex video clues in long videos via text-form reasoning.However,the existing literature suffers from the fact that the text-only reasoning under fixed video context may exacerbate hallucinations since detailed crucial clues are often ignored under limited video context length due to the temporal redundancy of long videos.To address this gap,we propose Video-TwG,a curriculum reinforced framework that employs a novel Think-with-Grounding paradigm,enabling video LLMs to actively decide when to perform on-demand grounding during interleaved text-video reasoning, selectively zooming into question-relevant clips only when necessary.Video-TwG can be trained end-to-end in a straightforward manner, without relying on complex auxiliary modules or heavily annotated reasoning tracesIn detail,we design a Two-stage Reinforced Curriculum Strategy, where the model first learns think-with-grounding behavior on a small short-video GQA dataset with grounding labels,and then scales to diverse general QA data with videos of diverse domains to encourage generalization. Further, to handle complex think-with-grounding reasoning for various kinds of data,we propose TwG-GRPO algorithm which features the fine-grained grounding reward, self-confirmed pseudo reward and accuracy-gated mechanism.Finally,we propose to construct a new TwG-51K dataset that facilitates training. Experiments on Video-MME, LongVideoBench, and MLVU show that Video-TwG consistently outperforms strong LVU baselines.Further ablation validates the necessity of our Two-stage Reinforced Curriculum Strategy and shows our TwG-GRPO better leverages diverse unlabeled data to improve grounding quality and reduce redundant groundings without sacrificing QA performance.

2602.18699 2026-02-24 cs.CL cs.AI

Semantic Substrate Theory: An Operator-Theoretic Framework for Geometric Semantic Drift

Stephen Russell

详情
英文摘要

Most semantic drift studies report multiple signals e.g., embedding displacement, neighbor changes, distributional divergence, and recursive trajectory instability, without a shared explanatory theory that relates them. This paper proposes a formalization of these signals in one time-indexed substrate, $S_t=(X,d_t,P_t)$, combining embedding geometry with local diffusion. Within this substrate, node-level neighborhood drift measures changes in local conditional distributions, coarse Ricci curvature measures local contractivity of semantic diffusion, and recursive drift probes stability of iterated semantic operators. This manuscript specifies the formal model, assumptions, and tests that can refute the model. Herein, the paper introduces bridge mass, a node-level aggregate of incident negative curvature, as a predictor of future neighborhood rewiring. This paper provides the theory and test contracts; empirical performance is deferred to subsequent studies.

2602.18697 2026-02-24 cs.CV

Deep LoRA-Unfolding Networks for Image Restoration

Xiangming Wang, Haijin Zeng, Benteng Sun, Jiezhang Cao, Kai Zhang, Qiangqiang Shen, Yongyong Chen

Comments Accepted by IEEE Transactions on Image Processing

详情
英文摘要

Deep unfolding networks (DUNs), combining conventional iterative optimization algorithms and deep neural networks into a multi-stage framework, have achieved remarkable accomplishments in Image Restoration (IR), such as spectral imaging reconstruction, compressive sensing and super-resolution.It unfolds the iterative optimization steps into a stack of sequentially linked blocks.Each block consists of a Gradient Descent Module (GDM) and a Proximal Mapping Module (PMM) which is equivalent to a denoiser from a Bayesian perspective, operating on Gaussian noise with a known level.However, existing DUNs suffer from two critical limitations: (i) their PMMs share identical architectures and denoising objectives across stages, ignoring the need for stage-specific adaptation to varying noise levels; and (ii) their chain of structurally repetitive blocks results in severe parameter redundancy and high memory consumption, hindering deployment in large-scale or resource-constrained scenarios.To address these challenges, we introduce generalized Deep Low-rank Adaptation (LoRA) Unfolding Networks for image restoration, named LoRun, harmonizing denoising objectives and adapting different denoising levels between stages with compressed memory usage for more efficient DUN.LoRun introduces a novel paradigm where a single pretrained base denoiser is shared across all stages, while lightweight, stage-specific LoRA adapters are injected into the PMMs to dynamically modulate denoising behavior according to the noise level at each unfolding step.This design decouples the core restoration capability from task-specific adaptation, enabling precise control over denoising intensity without duplicating full network parameters and achieving up to $N$ times parameter reduction for an $N$-stage DUN with on-par or better performance.Extensive experiments conducted on three IR tasks validate the efficiency of our method.

2602.18694 2026-02-24 cs.LG cs.AI

In-Context Planning with Latent Temporal Abstractions

Baiting Luo, Yunuo Zhang, Nathaniel S. Keplinger, Samir Gupta, Abhishek Dubey, Ayan Mukhopadhyay

详情
英文摘要

Planning-based reinforcement learning for continuous control is bottlenecked by two practical issues: planning at primitive time scales leads to prohibitive branching and long horizons, while real environments are frequently partially observable and exhibit regime shifts that invalidate stationary, fully observed dynamics assumptions. We introduce I-TAP (In-Context Latent Temporal-Abstraction Planner), an offline RL framework that unifies in-context adaptation with online planning in a learned discrete temporal-abstraction space. From offline trajectories, I-TAP learns an observation-conditioned residual-quantization VAE that compresses each observation-macro-action segment into a coarse-to-fine stack of discrete residual tokens, and a temporal Transformer that autoregressively predicts these token stacks from a short recent history. The resulting sequence model acts simultaneously as a context-conditioned prior over abstract actions and a latent dynamics model. At test time, I-TAP performs Monte Carlo Tree Search directly in token space, using short histories for implicit adaptation without gradient update, and decodes selected token stacks into executable actions. Across deterministic MuJoCo, stochastic MuJoCo with per-episode latent dynamics regimes, and high-dimensional Adroit manipulation, including partially observable variants, I-TAP consistently matches or outperforms strong model-free and model-based offline baselines, demonstrating efficient and robust in-context planning under stochastic dynamics and partial observability.

2602.18693 2026-02-24 cs.CL

Contradiction to Consensus: Dual Perspective, Multi Source Retrieval Based Claim Verification with Source Level Disagreement using LLM

Md Badsha Biswas, Ozlem Uzuner

详情
英文摘要

The spread of misinformation across digital platforms can pose significant societal risks. Claim verification, a.k.a. fact-checking, systems can help identify potential misinformation. However, their efficacy is limited by the knowledge sources that they rely on. Most automated claim verification systems depend on a single knowledge source and utilize the supporting evidence from that source; they ignore the disagreement of their source with others. This limits their knowledge coverage and transparency. To address these limitations, we present a novel system for open-domain claim verification (ODCV) that leverages large language models (LLMs), multi-perspective evidence retrieval, and cross-source disagreement analysis. Our approach introduces a novel retrieval strategy that collects evidence for both the original and the negated forms of a claim, enabling the system to capture supporting and contradicting information from diverse sources: Wikipedia, PubMed, and Google. These evidence sets are filtered, deduplicated, and aggregated across sources to form a unified and enriched knowledge base that better reflects the complexity of real-world information. This aggregated evidence is then used for claim verification using LLMs. We further enhance interpretability by analyzing model confidence scores to quantify and visualize inter-source disagreement. Through extensive evaluation on four benchmark datasets with five LLMs, we show that knowledge aggregation not only improves claim verification but also reveals differences in source-specific reasoning. Our findings underscore the importance of embracing diversity, contradiction, and aggregation in evidence for building reliable and transparent claim verification systems

2602.18692 2026-02-24 cs.CL

From Trial by Fire To Sleep Like a Baby: A Lexicon of Anxiety Associations for 20k English Multiword Expressions

Saif M. Mohammad

Journal ref LREC 2026

详情
英文摘要

Anxiety is the unease about a possible future negative outcome. In recent years, there has been growing interest in understanding how anxiety relates to our health, well-being, body, mind, and behaviour. This includes work on lexical resources for word-anxiety association. However, there is very little anxiety-related work on larger units of text such as multiword expressions (MWE). Here, we introduce the first large-scale lexicon capturing descriptive norms of anxiety associations for more than 20k English MWEs. We show that the anxiety associations are highly reliable. We use the lexicon to study prevalence of different types of anxiety- and calmness-associated MWEs; and how that varies across two-, three-, and four-word sequences. We also study the extent to which the anxiety association of MWEs is compositional (due to its constituent words). The lexicon enables a wide variety of anxiety-related research in psychology, NLP, public health, and social sciences. The lexicon is freely available: https://saifmohammad.com/worrylex.html

2602.18684 2026-02-24 cs.RO cs.CV

Systematic Analysis of Coupling Effects on Closed-Loop and Open-Loop Performance in Aerial Continuum Manipulators

Niloufar Amiri, Shayan Sepahvand, Iraj Mantegh, Farrokh Janabi-Sharifi

Comments Submitted to the 2026 International Conference on Unmanned Aircraft Systems (ICUAS 2026)

详情
英文摘要

This paper investigates two distinct approaches to the dynamic modeling of aerial continuum manipulators (ACMs): the decoupled and the coupled formulations. Both open-loop and closed-loop behaviors of a representative ACM are analyzed. The primary objective is to determine the conditions under which the decoupled model attains accuracy comparable to the coupled model while offering reduced computational cost under identical numerical conditions. The system dynamics are first derived using the Euler--Lagrange method under the piecewise constant curvature (PCC) assumption, with explicit treatment of the near-zero curvature singularity. A decoupled model is then obtained by neglecting the coupling terms in the ACM dynamics, enabling systematic evaluation of open-loop responses under diverse actuation profiles and external wrenches. To extend the analysis to closed-loop performance, a novel dynamics-based proportional-derivative sliding mode image-based visual servoing (DPD-SM-IBVS) controller is developed for regulating image feature errors in the presence of a moving target. The controller is implemented with both coupled and decoupled models, allowing a direct comparison of their effectiveness. The open-loop simulations reveal pronounced discrepancies between the two modeling approaches, particularly under varying torque inputs and continuum arm parameters. Conversely, the closed-loop experiments demonstrate that the decoupled model achieves tracking accuracy on par with the coupled model (within subpixel error) while incurring lower computational cost.

2602.18674 2026-02-24 cs.LG cs.NE

Robustness of Deep ReLU Networks to Misclassification of High-Dimensional Data

Věra Kůrková

Comments 15 pages, 4 figures

详情
英文摘要

We present a theoretical study of the robustness of parameterized networks to random input perturbations. Specifically, we analyze local robustness at a given network input by quantifying the probability that a small additive random perturbation of the input leads to misclassification. For deep networks with rectified linear units, we derive lower bounds on local robustness in terms of the input dimensionality and the total number of network units.

2602.18663 2026-02-24 cs.RO cs.LG

Toward AI Autonomous Navigation for Mechanical Thrombectomy using Hierarchical Modular Multi-agent Reinforcement Learning (HM-MARL)

Harry Robertshaw, Nikola Fischer, Lennart Karstensen, Benjamin Jackson, Xingyu Chen, S. M. Hadi Sadati, Christos Bergeles, Alejandro Granados, Thomas C Booth

Comments Published in IEEE Robotics and Automation Letters

Journal ref IEEE Robotics and Automation Letters (2026)

详情
英文摘要

Mechanical thrombectomy (MT) is typically the optimal treatment for acute ischemic stroke involving large vessel occlusions, but access is limited due to geographic and logistical barriers. Reinforcement learning (RL) shows promise in autonomous endovascular navigation, but generalization across 'long' navigation tasks remains challenging. We propose a Hierarchical Modular Multi-Agent Reinforcement Learning (HM-MARL) framework for autonomous two-device navigation in vitro, enabling efficient and generalizable navigation. HM-MARL was developed to autonomously navigate a guide catheter and guidewire from the femoral artery to the internal carotid artery (ICA). A modular multi-agent approach was used to decompose the complex navigation task into specialized subtasks, each trained using Soft Actor-Critic RL. The framework was validated in both in silico and in vitro testbeds to assess generalization and real-world feasibility. In silico, a single-vasculature model achieved 92-100% success rates on individual anatomies, while a multi-vasculature model achieved 56-80% across multiple patient anatomies. In vitro, both HM-MARL models successfully navigated 100% of trials from the femoral artery to the right common carotid artery and 80% to the right ICA but failed on the left-side vessel superhuman challenge due to the anatomy and catheter type used in navigation. This study presents the first demonstration of in vitro autonomous navigation in MT vasculature. While HM-MARL enables generalization across anatomies, the simulation-to-real transition introduces challenges. Future work will refine RL strategies using world models and validate performance on unseen in vitro data, advancing autonomous MT towards clinical translation.

2602.18662 2026-02-24 cs.LG

Large Causal Models for Temporal Causal Discovery

Nikolaos Kougioulis, Nikolaos Gkorgkolis, MingXue Wang, Bora Caglayan, Dario Simionato, Andrea Tonon, Ioannis Tsamardinos

Comments 32 pages (16 main text, 16 Appendix), 11 Figures

详情
英文摘要

Causal discovery for both cross-sectional and temporal data has traditionally followed a dataset-specific paradigm, where a new model is fitted for each individual dataset. Such an approach limits the potential of multi-dataset pretraining. The concept of large causal models (LCMs) envisions a class of pre-trained neural architectures specifically designed for temporal causal discovery. Prior approaches are constrained to small variable counts, degrade with larger inputs, and rely heavily on synthetic data, limiting generalization. We propose a principled framework for LCMs, combining diverse synthetic generators with realistic time-series datasets, allowing learning at scale. Extensive experiments on synthetic, semi-synthetic and realistic benchmarks show that LCMs scale effectively to higher variable counts and deeper architectures while maintaining strong performance. Trained models achieve competitive or superior accuracy compared to classical and neural baselines, particularly in out-of-distribution settings, while enabling fast, single-pass inference. Results demonstrate LCMs as a promising foundation-model paradigm for temporal causal discovery. Experiments and model weights are available at https://github.com/kougioulis/LCM-paper/.

2602.18661 2026-02-24 cs.RO

Robotic Fruits with Tunable Stiffness and Sensing: Towards a Methodology for Developing Realistic Physical Twins of Fruits

Saitarun Nadipineni, Keshav Pandiyan, Kaspar Althoefer, Shinichi Hirai, Thilina Dulantha Lalitharatne

Comments 6 pages, 5 figures, 9th IEEE-RAS International Conference on Soft Robotics (RoboSoft) 2026

详情
英文摘要

The global agri-food sector faces increasing challenges from labour shortages, high consumer demand, and supply-chain disruptions, resulting in substantial losses of unharvested produce. Robotic harvesting has emerged as a promising alternative; however, evaluating and training soft grippers for delicate fruits remains difficult due to the highly variable mechanical properties of natural produce. This makes it difficult to establish reliable benchmarks or data-driven control strategies. Existing testing practices rely on large quantities of real fruit to capture this variability, leading to inefficiency, higher costs, and waste. The methodology presented in this work aims to address these limitations by developing tunable soft physical twins that emulate the stiffness characteristics of real fruits at different ripeness levels. A fiber-reinforced pneumatic physical twin of a kiwi fruit was designed and fabricated to replicate the stiffness at different ripeness levels. Experimental results show that the stiffness of the physical twin can be tuned accurately over multiple trials (97.35 - 99.43% accuracy). Gripping tasks with a commercial robotic gripper showed that sensor feedback from the physical twin can reflect the applied gripping forces. Finally, a stress test was performed over 50 cycles showed reliable maintenance of desired stiffness (0.56 - 1.10% error). This work shows promise that robotic physical twins could adjust their stiffness to resemble that of real fruits. This can provide a sustainable, controllable platform for benchmarking and training robotic grippers.

2602.18658 2026-02-24 cs.LG

Communication-Efficient Personalized Adaptation via Federated-Local Model Merging

Yinan Zou, Md Kamran Chowdhury Shisher, Christopher G. Brinton, Vishrant Tripathi

详情
英文摘要

Parameter-efficient fine-tuning methods, such as LoRA, offer a practical way to adapt large vision and language models to client tasks. However, this becomes particularly challenging under task-level heterogeneity in federated deployments. In this regime, personalization requires balancing general knowledge with personalized knowledge, yet existing approaches largely rely on heuristic mixing rules and lack theoretical justification. Moreover, prior model merging approaches are also computation and communication intensive, making the process inefficient in federated settings. In this work, we propose Potara, a principled framework for federated personalization that constructs a personalized model for each client by merging two complementary models: (i) a federated model capturing general knowledge, and (ii) a local model capturing personalized knowledge. Through the construct of linear mode connectivity, we show that the expected task loss admits a variance trace upper bound, whose minimization yields closed-form optimal mixing weights that guarantee a tighter bound for the merged model than for either the federated or local model alone. Experiments on vision and language benchmarks show that Potara consistently improves personalization while reducing communication, leading to a strong performance-communication trade-off.

2602.18649 2026-02-24 cs.LG cs.AI

Global Low-Rank, Local Full-Rank: The Holographic Encoding of Learned Algorithms

Yongzhong Xu

Comments 14 pages, 3 figures, 6 tables

详情
英文摘要

Grokking -- the abrupt transition from memorization to generalization after extended training -- has been linked to the emergence of low-dimensional structure in learning dynamics. Yet neural network parameters inhabit extremely high-dimensional spaces. How can a low-dimensional learning process produce solutions that resist low-dimensional compression? We investigate this question in multi-task modular arithmetic, training shared-trunk Transformers with separate heads for addition, multiplication, and a quadratic operation modulo 97. Across three model scales (315K--2.2M parameters) and five weight decay settings, we compare three reconstruction methods: per-matrix SVD, joint cross-matrix SVD, and trajectory PCA. Across all conditions, grokking trajectories are confined to a 2--6 dimensional global subspace, while individual weight matrices remain effectively full-rank. Reconstruction from 3--5 trajectory PCs recovers over 95\% of final accuracy, whereas both per-matrix and joint SVD fail at sub-full rank. Even when static decompositions capture most spectral energy, they destroy task-relevant structure. These results show that learned algorithms are encoded through dynamically coordinated updates spanning all matrices, rather than localized low-rank components. We term this the holographic encoding principle: grokked solutions are globally low-rank in the space of learning directions but locally full-rank in parameter space, with implications for compression, interpretability, and understanding how neural networks encode computation.

2602.18637 2026-02-24 cs.LG q-bio.NC

Online decoding of rat self-paced locomotion speed from EEG using recurrent neural networks

Alejandro de Miguel, Nelson Totah, Uri Maoz

Comments 17 pages, 1 table and 7 figures

详情
英文摘要

$\textit{Objective.}$ Accurate neural decoding of locomotion holds promise for advancing rehabilitation, prosthetic control, and understanding neural correlates of action. Recent studies have demonstrated decoding of locomotion kinematics across species on motorized treadmills. However, efforts to decode locomotion speed in more natural contexts$-$where pace is self-selected rather than externally imposed$-$are scarce, generally achieve only modest accuracy, and require intracranial implants. Here, we aim to decode self-paced locomotion speed non-invasively and continuously using cortex-wide EEG recordings from rats. $\textit{Approach.}$ We introduce an asynchronous brain$-$computer interface (BCI) that processes a stream of 32-electrode skull-surface EEG (0.01$-$45 Hz) to decode instantaneous speed from a non-motorized treadmill during self-paced locomotion in head-fixed rats. Using recurrent neural networks and a dataset of over 133 h of recordings, we trained decoders to map ongoing EEG activity to treadmill speed. $\textit{Main results.}$ Our decoding achieves a correlation of 0.88 ($R^2$ = 0.78) for speed, primarily driven by visual cortex electrodes and low-frequency ($< 8$ Hz) oscillations. Moreover, pre-training on a single session permitted decoding on other sessions from the same rat, suggesting uniform neural signatures that generalize across sessions but fail to transfer across animals. Finally, we found that cortical states not only carry information about current speed, but also about future and past dynamics, extending up to 1000 ms. $\textit{Significance.}$ These findings demonstrate that self-paced locomotion speed can be decoded accurately and continuously from non-invasive, cortex-wide EEG. Our approach provides a framework for developing high-performing, non-invasive BCI systems and contributes to understanding distributed neural representations of action dynamics.

2602.18635 2026-02-24 cs.SD cs.NE

Musical Training, but not Mere Exposure to Music, Drives the Emergence of Chroma Equivalence in Artificial Neural Networks

Lukas Grasse, Matthew S. Tata

详情
英文摘要

Pitch is a fundamental aspect of auditory perception. Pitch perception is commonly described across two perceptual dimensions: pitch height is the sense that tones with varying frequencies seem to be higher or lower, and chroma equivalence is the cyclical similarity of notes octaves, corresponding to a doubling of fundamental frequency. Existing research is divided on whether chroma equivalence is a learned percept that varies according to musical experience and culture, or is an innate percept that develops automatically. Building on a recent framework that proposes to use ANNs to ask 'why' questions about the brain, we evaluated recent auditory ANNs using representational similarity analysis to test the emergence of pitch height and chroma equivalence in their learned representations. Additionally, we fine-tuned two models, Wav2Vec 2.0 and Data2Vec, on a self-supervised learning task using speech and music, and a supervised music transcription task. We found that all models exhibited varying degrees of pitch height representation, but that only models trained on the supervised music transcription task exhibited chroma equivalence. Mere exposure to music through self-supervised learning was not sufficient for chroma equivalence to emerge. This supports the view that chroma equivalence is a higher-order cognitive computation that emerges to support the specific task of music perception, distinct from other auditory perception such as speech listening. This work also highlights the usefulness of ANNs for probing the developmental conditions that give rise to perceptual representations in humans.

2602.18633 2026-02-24 cs.CL

DP-RFT: Learning to Generate Synthetic Text via Differentially Private Reinforcement Fine-Tuning

Fangyuan Xu, Sihao Chen, Zinan Lin, Taiwei Shi, Sydney Graham, Pei Zhou, Mengting Wan, Alex Stein, Virginia Estellers, Charles Chen, Morris Sharp, Richard Speyer, Tadas Baltrusaitis, Jennifer Neville, Eunsol Choi, Longqi Yang

详情
英文摘要

Differentially private (DP) synthetic data generation plays a pivotal role in developing large language models (LLMs) on private data, where data owners cannot provide eyes-on access to individual examples. Generating DP synthetic data typically involves a difficult trade-off. On one hand, DP finetuning methods train an LLM as a synthetic data generator with formal privacy guarantees, yet it still requires the raw content of private examples for model training. However, methods that avoid direct exposure to private data are bounded by an off-the-shelf, un-finetuned model, whose outputs often lack domain fidelity. Can we train an LLM to generate high-quality synthetic text without eyes-on access to individual private examples? In this work, we introduce Differentially Private Reinforcement Fine-Tuning (DP-RFT), an online reinforcement learning algorithm for synthetic data generation with LLMs. DP-RFT leverages DP-protected nearest-neighbor votes from an eyes-off private corpus as a reward signal for on-policy synthetic samples generated by an LLM. The LLM iteratively learns to generate synthetic data to maximize the expected DP votes through Proximal Policy Optimization (PPO). We evaluate DP-RFT for long-form and domain-specific synthetic data generation, such as news articles, meeting transcripts, and medical article abstracts. Our experiments show that DP-RFT closes the gap between private evolution and DP finetuning methods in terms of the fidelity and downstream utility of the generated synthetic data, while respecting the private data boundary.

2602.18628 2026-02-24 cs.LG cs.AI

Non-Interfering Weight Fields: Treating Model Parameters as a Continuously Extensible Function

Sarim Chaudhry

详情
英文摘要

Large language models store all learned knowledge in a single, fixed weight vector. Teaching a model new capabilities requires modifying those same weights, inevitably degrading previously acquired knowledge. This fundamental limitation, known as catastrophic forgetting, has resisted principled solutions for decades. Existing approaches treat weights as immutable artifacts that must be protected through techniques like regularization heuristics, replay buffers, or isolated adapter modules. The problem is none of these provide a structural guarantee against forgetting. In this work, we propose Non-Interfering Weight Fields (NIWF), a framework that replaces the fixed weight paradigm with a learned function that generates weight configurations on demand from a continuous capability coordinate space. After training on a task, we commit the occupied coordinate region by snapshotting the fields outputs on anchor points to enforce a functional lock during all future training. We validate NIWF on sequential instructionfollowing and code generation tasks using Mistral-7B, demonstrating zero forgetting on committed tasks with competitive perplexity on new tasks. The framework introduces the notion of software-like versioning for neural network intelligence, where capabilities can be committed, extended, composed, and rolled back without retraining.

2602.18622 2026-02-24 cs.RO

FORMICA: Decision-Focused Learning for Communication-Free Multi-Robot Task Allocation

Antonio Lopez, Jack Muirhead, Carlo Pinciroli

Comments 13 pages, 2 figures, ANTS 2026

详情
英文摘要

Most multi-robot task allocation methods rely on communication to resolve conflicts and reach consistent assignments. In environments with limited bandwidth, degraded infrastructure, or adversarial interference, existing approaches degrade sharply. We introduce a learning-based framework that achieves high-quality task allocation without any robot-to-robot communication. The key idea is that robots coordinate implicitly by predicting teammates' bids: if each robot can anticipate competition for a task, it can adjust its choices accordingly. Our method predicts bid distributions to correct systematic errors in analytical mean-field approximations. While analytical predictions assume idealized conditions (uniform distributions, known bid functions), our learned approach adapts to task clustering and spatial heterogeneity. Inspired by Smart Predict-then-Optimize (SPO), we train predictors end-to-end to minimize Task Allocation Regret rather than prediction error. To scale to large swarms, we develop a mean-field approximation where each robot predicts the distribution of competing bids rather than individual bids, reducing complexity from $O(NT)$ to $O(T)$. We call our approach FORMICA: Field-Oriented Regret-Minimizing Implicit Coordination Algorithm. Experiments show FORMICA substantially outperforms a natural analytical baseline. In scenarios with 16 robots and 64 tasks, our approach improves system reward by 17% and approaches the optimal MILP solution. When deployed on larger scenarios (256 robots, 4096 tasks), the same model improves performance by 7%, demonstrating strong generalization. Training requires only 21 seconds on a laptop, enabling rapid adaptation to new environments.

2602.18618 2026-02-24 cs.CV

Narrating For You: Prompt-guided Audio-visual Narrating Face Generation Employing Multi-entangled Latent Space

Aashish Chandra, Aashutosh A, Abhijit Das

Comments To appear in the Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) 2026. Presented at Poster Session 1

详情
英文摘要

We present a novel approach for generating realistic speaking and talking faces by synthesizing a person's voice and facial movements from a static image, a voice profile, and a target text. The model encodes the prompt/driving text, the driving image, and the voice profile of an individual and then combines them to pass them to the multi-entangled latent space to foster key-value pairs and queries for the audio and video modality generation pipeline. The multi-entangled latent space is responsible for establishing the spatiotemporal person-specific features between the modalities. Further, entangled features are passed to the respective decoder of each modality for output audio and video generation.

2602.18614 2026-02-24 cs.CV

Effect of Patch Size on Fine-Tuning Vision Transformers in Two-Dimensional and Three-Dimensional Medical Image Classification

Massoud Dehghan, Ramona Woitek, Amirreza Mahbod

Comments 29 pages

详情
英文摘要

Vision Transformers (ViTs) and their variants have become state-of-the-art in many computer vision tasks and are widely used as backbones in large-scale vision and vision-language foundation models. While substantial research has focused on architectural improvements, the impact of patch size, a crucial initial design choice in ViTs, remains underexplored, particularly in medical domains where both two-dimensional (2D) and three-dimensional (3D) imaging modalities exist. In this study, using 12 medical imaging datasets from various imaging modalities (including seven 2D and five 3D datasets), we conduct a thorough evaluation of how different patch sizes affect ViT classification performance. Using a single graphical processing unit (GPU) and a range of patch sizes (1, 2, 4, 7, 14, 28), we fine-tune ViT models and observe consistent improvements in classification performance with smaller patch sizes (1, 2, and 4), which achieve the best results across nearly all datasets. More specifically, our results indicate improvements in balanced accuracy of up to 12.78% for 2D datasets (patch size 2 vs. 28) and up to 23.78% for 3D datasets (patch size 1 vs. 14), at the cost of increased computational expense. Moreover, by applying a straightforward ensemble strategy that fuses the predictions of the models trained with patch sizes 1, 2, and 4, we demonstrate a further boost in performance in most cases, especially for the 2D datasets. Our implementation is publicly available on GitHub: https://github.com/HealMaDe/MedViT

2602.18613 2026-02-24 cs.LG cs.CL cs.IR

Diagnosing LLM Reranker Behavior Under Fixed Evidence Pools

Baris Arat, Emre Sefer

详情
英文摘要

Standard reranking evaluations study how a reranker orders candidates returned by an upstream retriever. This setup couples ranking behavior with retrieval quality, so differences in output cannot be attributed to the ranking policy alone. We introduce a controlled diagnostic that isolates reranking by using Multi-News clusters as fixed evidence pools. We limit each pool to exactly eight documents and pass identical inputs to all rankers. Within this setup, BM25 and MMR serve as interpretable reference points for lexical matching and diversity optimization. Across 345 clusters, we find that redundancy patterns vary by model: one LLM implicitly diversifies at larger selection budgets, while another increases redundancy. In contrast, LLMs underperform on lexical coverage at small selection budgets. As a result, LLM rankings diverge substantially from both baselines rather than consistently approximating either strategy. By eliminating retrieval variance, we can attribute these differences directly to the ranking policy. This diagnostic is model-agnostic and applicable to any ranker, including open source systems and proprietary APIs.

2602.18607 2026-02-24 cs.AI

Feedback-based Automated Verification in Vibe Coding of CAS Adaptation Built on Constraint Logic

Michal Töpfer, František Plášil, Tomáš Bureš, Petr Hnětynka

详情
英文摘要

In CAS adaptation, a challenge is to define the dynamic architecture of the system and changes in its behavior. Implementation-wise, this is projected into an adaptation mechanism, typically realized as an Adaptation Manager (AM). With the advances of generative LLMs, generating AM code based on system specification and desired AM behavior (partially in natural language) is a tempting opportunity. The recent introduction of vibe coding suggests a way to target the problem of the correctness of generated code by iterative testing and vibe coding feedback loops instead of direct code inspection. In this paper, we show that generating an AM via vibe coding feedback loops is a viable option when the verification of the generated AM is based on a very precise formulation of the functional requirements. We specify these as constraints in a novel temporal logic FCL that allows us to express the behavior of traces with much finer granularity than classical LTL enables. Furthermore, we show that by combining the adaptation and vibe coding feedback loops where the FCL constraints are evaluated for the current system state, we achieved good results in the experiments with generating AMs for two example systems from the CAS domain. Typically, just a few feedback loop iterations were necessary, each feeding the LLM with reports describing detailed violations of the constraints. This AM testing was combined with high run path coverage achieved by different initial settings.

2602.18603 2026-02-24 cs.RO cs.LG

Enhancing Goal Inference via Correction Timing

Anjiabei Wang, Shuangge Wang, Tesca Fitzgerald

Comments 25th International Conference on Autonomous Agents and Multiagent Systems (AAMAS)

详情
英文摘要

Corrections offer a natural modality for people to provide feedback to a robot, by (i) intervening in the robot's behavior when they believe the robot is failing (or will fail) the task objectives and (ii) modifying the robot's behavior to successfully fulfill the task. Each correction offers information on what the robot should and should not do, where the corrected behavior is more aligned with task objectives than the original behavior. Most prior work on learning from corrections involves interpreting a correction as a new demonstration (consisting of the modified robot behavior), or a preference (for the modified trajectory compared to the robot's original behavior). However, this overlooks one essential element of the correction feedback, which is the human's decision to intervene in the robot's behavior in the first place. This decision can be influenced by multiple factors including the robot's task progress, alignment with human expectations, dynamics, motion legibility, and optimality. In this work, we investigate whether the timing of this decision can offer a useful signal for inferring these task-relevant influences. In particular, we investigate three potential applications for this learning signal: (1) identifying features of a robot's motion that may prompt people to correct it, (2) quickly inferring the final goal of a human's correction based on the timing and initial direction of their correction motion, and (3) learning more precise constraints for task objectives. Our results indicate that correction timing results in improved learning for the first two of these applications. Overall, our work provides new insights on the value of correction timing as a signal for robot learning.