arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 2365
2602.19449 2026-02-24 cs.CV

Decoupling Vision and Language: Codebook Anchored Visual Adaptation

Jason Wu, Tianchen Zhao, Chang Liu, Jiarui Cai, Zheng Zhang, Zhuowei Li, Aaditya Singh, Xiang Xu, Mani Srivastava, Jonathan Wu

Comments 17 pages, accepted to CVPR2026 main conference

详情
英文摘要

Large Vision-Language Models (LVLMs) use their vision encoders to translate images into representations for downstream reasoning, but the encoders often underperform in domain-specific visual tasks such as medical image diagnosis or fine-grained classification, where representation errors can cascade through the language model, leading to incorrect responses. Existing adaptation methods modify the continuous feature interface between encoder and language model through projector tuning or other parameter-efficient updates, which still couples the two components and requires re-alignment whenever the encoder changes. We introduce CRAFT (Codebook RegulAted Fine-Tuning), a lightweight method that fine-tunes the encoder using a discrete codebook that anchors visual representations to a stable token space, achieving domain adaptation without modifying other parts of the model. This decoupled design allows the adapted encoder to seamlessly boost the performance of LVLMs with different language architectures, as long as they share the same codebook. Empirically, CRAFT achieves an average gain of 13.51% across 10 domain-specific benchmarks such as VQARAD and PlantVillage, while preserving the LLM's linguistic capabilities and outperforming peer methods that operate on continuous tokens.

2602.19444 2026-02-24 cs.LG

PIS: A Physics-Informed System for Accurate State Partitioning of $Aβ_{42}$ Protein Trajectories

Qianfeng Yu, Ningkang Peng, Yanhui Gu

详情
英文摘要

Understanding the conformational evolution of $β$-amyloid ($Aβ$), particularly the $Aβ_{42}$ isoform, is fundamental to elucidating the pathogenic mechanisms underlying Alzheimer's disease. However, existing end-to-end deep learning models often struggle to capture subtle state transitions in protein trajectories due to a lack of explicit physical constraints. In this work, we introduce PIS, a Physics-Informed System designed for robust metastable state partitioning. By integrating pre-computed physical priors, such as the radius of gyration and solvent-accessible surface area, into the extraction of topological features, our model achieves superior performance on the $Aβ_{42}$ dataset. Furthermore, PIS provides an interactive platform that features dynamic monitoring of physical characteristics and multi-dimensional result validation. This system offers biological researchers a powerful set of analytical tools with physically grounded interpretability. A demonstration video of PIS is available on https://youtu.be/AJHGzUtRCg0.

2602.19437 2026-02-24 cs.CV cs.AI

FinSight-Net:A Physics-Aware Decoupled Network with Frequency-Domain Compensation for Underwater Fish Detection in Smart Aquaculture

Jinsong Yang, Zeyuan Hu, Yichen Li, Hong Yu

详情
英文摘要

Underwater fish detection (UFD) is a core capability for smart aquaculture and marine ecological monitoring. While recent detectors improve accuracy by stacking feature extractors or introducing heavy attention modules, they often incur substantial computational overhead and, more importantly, neglect the physics that fundamentally limits UFD: wavelength-dependent absorption and turbidity-induced scattering significantly degrade contrast, blur fine structures, and introduce backscattering noise, leading to unreliable localization and recognition. To address these challenges, we propose FinSight-Net, an efficient and physics-aware detection framework tailored for complex aquaculture environments. FinSight-Net introduces a Multi-Scale Decoupled Dual-Stream Processing (MS-DDSP) bottleneck that explicitly targets frequency-specific information loss via heterogeneous convolutional branches, suppressing backscattering artifacts while compensating distorted biological cues through scale-aware and channel-weighted pathways. We further design an Efficient Path Aggregation FPN (EPA-FPN) as a detail-filling mechanism: it restores high-frequency spatial information typically attenuated in deep layers by establishing long-range skip connections and pruning redundant fusion routes, enabling robust detection of non-rigid fish targets under severe blur and turbidity. Extensive experiments on DeepFish, AquaFishSet, and our challenging UW-BlurredFish benchmark demonstrate that FinSight-Net achieves state-of-the-art performance. In particular, on UW-BlurredFish, FinSight-Net reaches 92.8% mAP, outperforming YOLOv11s by 4.8% while reducing parameters by 29.0%, providing a strong and lightweight solution for real-time automated monitoring in smart aquaculture.

2602.19432 2026-02-24 cs.CV

CountEx: Fine-Grained Counting via Exemplars and Exclusion

Yifeng Huang, Gia Khanh Nguyen, Minh Hoai

详情
英文摘要

This paper presents CountEx, a discriminative visual counting framework designed to address a key limitation of existing prompt-based methods: the inability to explicitly exclude visually similar distractors. While current approaches allow users to specify what to count via inclusion prompts, they often struggle in cluttered scenes with confusable object categories, leading to ambiguity and overcounting. CountEx enables users to express both inclusion and exclusion intent, specifying what to count and what to ignore, through multimodal prompts including natural language descriptions and optional visual exemplars. At the core of CountEx is a novel Discriminative Query Refinement module, which jointly reasons over inclusion and exclusion cues by first identifying shared visual features, then isolating exclusion-specific patterns, and finally applying selective suppression to refine the counting query. To support systematic evaluation of fine-grained counting methods, we introduce CoCount, a benchmark comprising 1,780 videos and 10,086 annotated frames across 97 category pairs. Experiments show that CountEx achieves substantial improvements over state-of-the-art methods for counting objects from both known and novel categories. The data and code are available at https://github.com/bbvisual/CountEx.

2602.19418 2026-02-24 cs.CV

PA-Attack: Guiding Gray-Box Attacks on LVLM Vision Encoders with Prototypes and Attention

Hefei Mei, Zirui Wang, Chang Xu, Jianyuan Guo, Minjing Dong

详情
英文摘要

Large Vision-Language Models (LVLMs) are foundational to modern multimodal applications, yet their susceptibility to adversarial attacks remains a critical concern. Prior white-box attacks rarely generalize across tasks, and black-box methods depend on expensive transfer, which limits efficiency. The vision encoder, standardized and often shared across LVLMs, provides a stable gray-box pivot with strong cross-model transfer. Building on this premise, we introduce PA-Attack (Prototype-Anchored Attentive Attack). PA-Attack begins with a prototype-anchored guidance that provides a stable attack direction towards a general and dissimilar prototype, tackling the attribute-restricted issue and limited task generalization of vanilla attacks. Building on this, we propose a two-stage attention enhancement mechanism: (i) leverage token-level attention scores to concentrate perturbations on critical visual tokens, and (ii) adaptively recalibrate attention weights to track the evolving attention during the adversarial process. Extensive experiments across diverse downstream tasks and LVLM architectures show that PA-Attack achieves an average 75.1% score reduction rate (SRR), demonstrating strong attack effectiveness, efficiency, and task generalization in LVLMs. Code is available at https://github.com/hefeimei06/PA-Attack.

2602.19416 2026-02-24 cs.AI cs.LG

IR$^3$: Contrastive Inverse Reinforcement Learning for Interpretable Detection and Mitigation of Reward Hacking

Mohammad Beigi, Ming Jin, Junshan Zhang, Jiaxin Zhang, Qifan Wang, Lifu Huang

详情
英文摘要

Reinforcement Learning from Human Feedback (RLHF) enables powerful LLM alignment but can introduce reward hacking - models exploit spurious correlations in proxy rewards without genuine alignment. Compounding this, the objectives internalized during RLHF remain opaque, making hacking behaviors difficult to detect or correct. We introduce IR3 (Interpretable Reward Reconstruction and Rectification), a framework that reverse-engineers, interprets, and surgically repairs the implicit objectives driving RLHF-tuned models. We propose Contrastive Inverse Reinforcement Learning (C-IRL), which reconstructs the implicit reward function by contrasting paired responses from post-alignment and baseline policies to explain behavioral shifts during RLHF. We then decompose the reconstructed reward via sparse autoencoders into interpretable features, enabling identification of hacking signatures through contribution analysis. Finally, we propose mitigation strategies - clean reward optimization, adversarial shaping, constrained optimization, and feature-guided distillation - that target problematic features while preserving beneficial alignment. Experiments across multiple reward model configurations show that IR3 achieves 0.89 correlation with ground-truth rewards, identifies hacking features with over 90% precision, and significantly reduces hacking behaviors while maintaining capabilities within 3% of the original model.

2602.19412 2026-02-24 cs.CV cs.AI

Redefining the Down-Sampling Scheme of U-Net for Precision Biomedical Image Segmentation

Mingjie Li, Yizheng Chen, Md Tauhidul Islam, Lei Xing

Comments AAPM 67th

详情
英文摘要

U-Net architectures have been instrumental in advancing biomedical image segmentation (BIS) but often struggle with capturing long-range information. One reason is the conventional down-sampling techniques that prioritize computational efficiency at the expense of information retention. This paper introduces a simple but effective strategy, we call it Stair Pooling, which moderates the pace of down-sampling and reduces information loss by leveraging a sequence of concatenated small and narrow pooling operations in varied orientations. Specifically, our method modifies the reduction in dimensionality within each 2D pooling step from $\frac{1}{4}$ to $\frac{1}{2}$. This approach can also be adapted for 3D pooling to preserve even more information. Such preservation aids the U-Net in more effectively reconstructing spatial details during the up-sampling phase, thereby enhancing its ability to capture long-range information and improving segmentation accuracy. Extensive experiments on three BIS benchmarks demonstrate that the proposed Stair Pooling can increase both 2D and 3D U-Net performance by an average of 3.8\% in Dice scores. Moreover, we leverage the transfer entropy to select the optimal down-sampling paths and quantitatively show how the proposed Stair Pooling reduces the information loss.

2602.19409 2026-02-24 cs.SD

AuditoryHuM: Auditory Scene Label Generation and Clustering using Human-MLLM Collaboration

Henry Zhong, Jörg M. Buchholz, Julian Maclaren, Simon Carlile, Richard F. Lyon

详情
英文摘要

Manual annotation of audio datasets is labour intensive, and it is challenging to balance label granularity with acoustic separability. We introduce AuditoryHuM, a novel framework for the unsupervised discovery and clustering of auditory scene labels using a collaborative Human-Multimodal Large Language Model (MLLM) approach. By leveraging MLLMs (Gemma and Qwen) the framework generates contextually relevant labels for audio data. To ensure label quality and mitigate hallucinations, we employ zero-shot learning techniques (Human-CLAP) to quantify the alignment between generated text labels and raw audio content. A strategically targeted human-in-the-loop intervention is then used to refine the least aligned pairs. The discovered labels are grouped into thematically cohesive clusters using an adjusted silhouette score that incorporates a penalty parameter to balance cluster cohesion and thematic granularity. Evaluated across three diverse auditory scene datasets (ADVANCE, AHEAD-DS, and TAU 2019), AuditoryHuM provides a scalable, low-cost solution for creating standardised taxonomies. This solution facilitates the training of lightweight scene recognition models deployable to edge devices, such as hearing aids and smart home assistants. The project page and code: https://github.com/Australian-Future-Hearing-Initiative

2602.19406 2026-02-24 cs.LG math.OC

LEVDA: Latent Ensemble Variational Data Assimilation via Differentiable Dynamics

Phillip Si, Peng Chen

详情
英文摘要

Long-range geophysical forecasts are fundamentally limited by chaotic dynamics and numerical errors. While data assimilation can mitigate these issues, classical variational smoothers require computationally expensive tangent-linear and adjoint models. Conversely, recent efficient latent filtering methods often enforce weak trajectory-level constraints and assume fixed observation grids. To bridge this gap, we propose Latent Ensemble Variational Data Assimilation (LEVDA), an ensemble-space variational smoother that operates in the low-dimensional latent space of a pretrained differentiable neural dynamics surrogate. By performing four-dimensional ensemble-variational (4DEnVar) optimization within an ensemble subspace, LEVDA jointly assimilates states and unknown parameters without the need for adjoint code or auxiliary observation-to-latent encoders. Leveraging the fully differentiable, continuous-in-time-and-space nature of the surrogate, LEVDA naturally accommodates highly irregular sampling at arbitrary spatiotemporal locations. Across three challenging geophysical benchmarks, LEVDA matches or outperforms state-of-the-art latent filtering baselines under severe observational sparsity while providing more reliable uncertainty quantification. Simultaneously, it achieves substantially improved assimilation accuracy and computational efficiency compared to full-state 4DEnVar.

2602.19404 2026-02-24 cs.LG cs.AI

One Size Fits None: Modeling NYC Taxi Trips

Tomas Eglinskas

Comments 13 pages, 10 figures

详情
英文摘要

The rise of app-based ride-sharing has fundamentally changed tipping culture in New York City. We analyzed 280 million trips from 2024 to see if we could predict tips for traditional taxis versus high-volume for-hire services. By testing methods from linear regression to deep neural networks, we found two very different outcomes. Traditional taxis are highly predictable ($R^2 \approx 0.72$) due to the in-car payment screen. In contrast, app-based tipping is random and hard to model ($R^2 \approx 0.17$). In conclusion, we show that building one universal model is a mistake and, due to Simpson's paradox, a combined model looks accurate on average but fails to predict tips for individual taxi categories requiring specialized models.

2602.19403 2026-02-24 cs.CL stat.AP

Personalized Prediction of Perceived Message Effectiveness Using Large Language Model Based Digital Twins

Jasmin Han, Janardan Devkota, Joseph Waring, Amanda Luken, Felix Naughton, Roger Vilardaga, Jonathan Bricker, Carl Latkin, Meghan Moran, Yiqun Chen, Johannes Thrul

Comments 31 pages, 5 figures, submitted to Journal of the American Medical Informatics Association (JAMIA). Drs. Chen and Thrul share last authorship

详情
英文摘要

Perceived message effectiveness (PME) by potential intervention end-users is important for selecting and optimizing personalized smoking cessation intervention messages for mobile health (mHealth) platform delivery. This study evaluates whether large language models (LLMs) can accurately predict PME for smoking cessation messages. We evaluated multiple models for predicting PME across three domains: content quality, coping support, and quitting support. The dataset comprised 3010 message ratings (5-point Likert scale) from 301 young adult smokers. We compared (1) supervised learning models trained on labeled data, (2) zero and few-shot LLMs prompted without task-specific fine-tuning, and (3) LLM-based digital twins that incorporate individual characteristics and prior PME histories to generate personalized predictions. Model performance was assessed on three held-out messages per participant using accuracy, Cohen's kappa, and F1. LLM-based digital twins outperformed zero and few-shot LLMs (12 percentage points on average) and supervised baselines (13 percentage points), achieving accuracies of 0.49 (content), 0.45 (coping), and 0.49 (quitting), with directional accuracies of 0.75, 0.66, and 0.70 on a simplified 3-point scale. Digital twin predictions showed greater dispersion across rating categories, indicating improved sensitivity to individual differences. Integrating personal profiles with LLMs captures person-specific differences in PME and outperforms supervised and zero and few-shot approaches. Improved PME prediction may enable more tailored intervention content in mHealth. LLM-based digital twins show potential for supporting personalization of mobile smoking cessation and other health behavior change interventions.

2602.19400 2026-02-24 cs.RO cs.AI cs.MA

Hilbert-Augmented Reinforcement Learning for Scalable Multi-Robot Coverage and Exploration

Tamil Selvan Gurunathan, Aryya Gangopadhyay

详情
英文摘要

We present a coverage framework that integrates Hilbert space-filling priors into decentralized multi-robot learning and execution. We augment DQN and PPO with Hilbert-based spatial indices to structure exploration and reduce redundancy in sparse-reward environments, and we evaluate scalability in multi-robot grid coverage. We further describe a waypoint interface that converts Hilbert orderings into curvature-bounded, time-parameterized SE(2) trajectories (planar (x, y, θ)), enabling onboard feasibility on resource-constrained robots. Experiments show improvements in coverage efficiency, redundancy, and convergence speed over DQN/PPO baselines. In addition, we validate the approach on a Boston Dynamics Spot legged robot, executing the generated trajectories in indoor environments and observing reliable coverage with low redundancy. These results indicate that geometric priors improve autonomy and scalability for swarm and legged robotics.

2602.19396 2026-02-24 cs.AI

Hiding in Plain Text: Detecting Concealed Jailbreaks via Activation Disentanglement

Amirhossein Farzam, Majid Behabahani, Mani Malek, Yuriy Nevmyvaka, Guillermo Sapiro

详情
英文摘要

Large language models (LLMs) remain vulnerable to jailbreak prompts that are fluent and semantically coherent, and therefore difficult to detect with standard heuristics. A particularly challenging failure mode occurs when an attacker tries to hide the malicious goal of their request by manipulating its framing to induce compliance. Because these attacks maintain malicious intent through a flexible presentation, defenses that rely on structural artifacts or goal-specific signatures can fail. Motivated by this, we introduce a self-supervised framework for disentangling semantic factor pairs in LLM activations at inference. We instantiate the framework for goal and framing and construct GoalFrameBench, a corpus of prompts with controlled goal and framing variations, which we use to train Representation Disentanglement on Activations (ReDAct) module to extract disentangled representations in a frozen LLM. We then propose FrameShield, an anomaly detector operating on the framing representations, which improves model-agnostic detection across multiple LLM families with minimal computational overhead. Theoretical guarantees for ReDAct and extensive empirical validations show that its disentanglement effectively powers FrameShield. Finally, we use disentanglement as an interpretability probe, revealing distinct profiles for goal and framing signals and positioning semantic disentanglement as a building block for both LLM safety and mechanistic interpretability.

2602.19395 2026-02-24 cs.SD eess.SP

DECAF: Dynamic Envelope Context-Aware Fusion for Speech-Envelope Reconstruction from EEG

Karan Thakkar, Mounya Elhilali

Comments Accepted at ICASSP 2026

详情
英文摘要

Reconstructing the speech audio envelope from scalp neural recordings (EEG) is a central task for decoding a listener's attentional focus in applications like neuro-steered hearing aids. Current methods for this reconstruction, however, face challenges with fidelity and noise. Prevailing approaches treat it as a static regression problem, processing each EEG window in isolation and ignoring the rich temporal structure inherent in continuous speech. This study introduces a new, dynamic framework for envelope reconstruction that leverages this structure as a predictive temporal prior. We propose a state-space fusion model that combines direct neural estimates from EEG with predictions from recent speech context, using a learned gating mechanism to adaptively balance these cues. To validate this approach, we evaluate our model on the ICASSP 2023 Stimulus Reconstruction benchmark demonstrating significant improvements over static, EEG-only baselines. Our analyses reveal a powerful synergy between the neural and temporal information streams. Ultimately, this work reframes envelope reconstruction not as a simple mapping, but as a dynamic state-estimation problem, opening a new direction for developing more accurate and coherent neural decoding systems.

2602.19392 2026-02-24 cs.LG cs.SI

Spiking Graph Predictive Coding for Reliable OOD Generalization

Jing Ren, Jiapeng Du, Bowen Li, Ziqi Xu, Xin Zheng, Hong Jia, Suyu Ma, Xiwei Xu, Feng Xia

Comments 12 pages, 6 figures, WWW26, Dubai, United Arab Emirates

详情
英文摘要

Graphs provide a powerful basis for modeling Web-based relational data, with expressive GNNs to support the effective learning in dynamic web environments. However, real-world deployment is hindered by pervasive out-of-distribution (OOD) shifts, where evolving user activity and changing content semantics alter feature distributions and labeling criteria. These shifts often lead to unstable or overconfident predictions, undermining the trustworthiness required for Web4Good applications. Achieving reliable OOD generalization demands principled and interpretable uncertainty estimation; however, existing methods are largely post-hoc, insensitive to distribution shifts, and unable to explain where uncertainty arises especially in high-stakes settings. To address these limitations, we introduce SpIking GrapH predicTive coding (SIGHT), an uncertainty-aware plug-in graph learning module for reliable OOD Generalization. SIGHT performs iterative, error-driven correction over spiking graph states, enabling models to expose internal mismatch signals that reveal where predictions become unreliable. Across multiple graph benchmarks and diverse OOD scenarios, SIGHT consistently enhances predictive accuracy, uncertainty estimation, and interpretability when integrated with GNNs.

2602.19390 2026-02-24 cs.AI

Artificial Intelligence for Modeling & Simulation in Digital Twins

Philipp Zech, Istvan David

详情
英文摘要

The convergence of modeling & simulation (M&S) and artificial intelligence (AI) is leaving its marks on advanced digital technology. Pertinent examples are digital twins (DTs) - high-fidelity, live representations of physical assets, and frequent enablers of corporate digital maturation and transformation. Often seen as technological platforms that integrate an array of services, DTs have the potential to bring AI-enabled M&S closer to end-users. It is, therefore, paramount to understand the role of M&S in DTs, and the role of digital twins in enabling the convergence of AI and M&S. To this end, this chapter provides a comprehensive exploration of the complementary relationship between these three. We begin by establishing a foundational understanding of DTs by detailing their key components, architectural layers, and their various roles across business, development, and operations. We then examine the central role of M&S in DTs and provide an overview of key modeling techniques from physics-based and discrete-event simulation to hybrid approaches. Subsequently, we investigate the bidirectional role of AI: first, how AI enhances DTs through advanced analytics, predictive capabilities, and autonomous decision-making, and second, how DTs serve as valuable platforms for training, validating, and deploying AI models. The chapter concludes by identifying key challenges and future research directions for creating more integrated and intelligent systems.

2602.19380 2026-02-24 cs.CV

Detector-in-the-Loop Tracking: Active Memory Rectification for Stable Glottic Opening Localization

Huayu Wang, Bahaa Alattar, Cheng-Yen Yang, Hsiang-Wei Huang, Jung Heon Kim, Linda Shapiro, Nathan White, Jenq-Neng Hwang

Comments Accepted to Medical Imaging with Deep Learning (MIDL) 2026

详情
英文摘要

Temporal stability in glottic opening localization remains challenging due to the complementary weaknesses of single-frame detectors and foundation-model trackers: the former lacks temporal context, while the latter suffers from memory drift. Specifically, in video laryngoscopy, rapid tissue deformation, occlusions, and visual ambiguities in emergency settings require a robust, temporally aware solution that can prevent progressive tracking errors. We propose Closed-Loop Memory Correction (CL-MC), a detector-in-the-loop framework that supervises Segment Anything Model 2(SAM2) through confidence-aligned state decisions and active memory rectification. High-confidence detections trigger semantic resets that overwrite corrupted tracker memory, effectively mitigating drift accumulation with a training-free foundation tracker in complex endoscopic scenes. On emergency intubation videos, CL-MC achieves state-of-the-art performance, significantly reducing drift and missing rate compared with the SAM2 variants and open loop based methods. Our results establish memory correction as a crucial component for reliable clinical video tracking. Our code will be available in https://github.com/huayuww/CL-MR.

2602.19372 2026-02-24 cs.RO cs.CV cs.LG

Seeing Farther and Smarter: Value-Guided Multi-Path Reflection for VLM Policy Optimization

Yanting Yang, Shenyuan Gao, Qingwen Bu, Li Chen, Dimitris N. Metaxas

Comments ICRA 2026

详情
英文摘要

Solving complex, long-horizon robotic manipulation tasks requires a deep understanding of physical interactions, reasoning about their long-term consequences, and precise high-level planning. Vision-Language Models (VLMs) offer a general perceive-reason-act framework for this goal. However, previous approaches using reflective planning to guide VLMs in correcting actions encounter significant limitations. These methods rely on inefficient and often inaccurate implicit learning of state-values from noisy foresight predictions, evaluate only a single greedy future, and suffer from substantial inference latency. To address these limitations, we propose a novel test-time computation framework that decouples state evaluation from action generation. This provides a more direct and fine-grained supervisory signal for robust decision-making. Our method explicitly models the advantage of an action plan, quantified by its reduction in distance to the goal, and uses a scalable critic to estimate. To address the stochastic nature of single-trajectory evaluation, we employ beam search to explore multiple future paths and aggregate them during decoding to model their expected long-term returns, leading to more robust action generation. Additionally, we introduce a lightweight, confidence-based trigger that allows for early exit when direct predictions are reliable, invoking reflection only when necessary. Extensive experiments on diverse, unseen multi-stage robotic manipulation tasks demonstrate a 24.6% improvement in success rate over state-of-the-art baselines, while significantly reducing inference time by 56.5%.

2602.19367 2026-02-24 cs.AI cs.CV

Time Series, Vision, and Language: Exploring the Limits of Alignment in Contrastive Representation Spaces

Pratham Yashwante, Rose Yu

Comments 24 Figures, 12 Tables

详情
英文摘要

The Platonic Representation Hypothesis posits that learned representations from models trained on different modalities converge to a shared latent structure of the world. However, this hypothesis has largely been examined in vision and language, and it remains unclear whether time series participate in such convergence. We first examine this in a trimodal setting and find that independently pretrained time series, vision, and language encoders exhibit near-orthogonal geometry in the absence of explicit coupling. We then apply post-hoc alignment by training projection heads over frozen encoders using contrastive learning, and analyze the resulting representations with respect to geometry, scaling behavior, and dependence on information density and input modality characteristics. Our investigation reveals that overall alignment in contrastive representation spaces improves with model size, but this alignment is asymmetric: time series align more strongly with visual representations than with text, and images can act as effective intermediaries between time series and language. We further see that richer textual descriptions improve alignment only up to a threshold; training on denser captions does not lead to further improvement. Analogous effects are observed for visual representations. Our findings shed light on considerations for building multimodal systems involving non-conventional data modalities beyond vision and language.

2602.19359 2026-02-24 cs.RO cs.LG

Vid2Sid: Videos Can Help Close the Sim2Real Gap

Kevin Qiu, Yu Zhang, Marek Cygan, Josie Hughes

详情
英文摘要

Calibrating a robot simulator's physics parameters (friction, damping, material stiffness) to match real hardware is often done by hand or with black-box optimizers that reduce error but cannot explain which physical discrepancies drive the error. When sensing is limited to external cameras, the problem is further compounded by perception noise and the absence of direct force or state measurements. We present Vid2Sid, a video-driven system identification pipeline that couples foundation-model perception with a VLM-in-the-loop optimizer that analyzes paired sim-real videos, diagnoses concrete mismatches, and proposes physics parameter updates with natural language rationales. We evaluate our approach on a tendon-actuated finger (rigid-body dynamics in MuJoCo) and a deformable continuum tentacle (soft-body dynamics in PyElastica). On sim2real holdout controls unseen during training, Vid2Sid achieves the best average rank across all settings, matching or exceeding black-box optimizers while uniquely providing interpretable reasoning at each iteration. Sim2sim validation confirms that Vid2Sid recovers ground-truth parameters most accurately (mean relative error under 13\% vs. 28--98\%), and ablation analysis reveals three calibration regimes. VLM-guided optimization excels when perception is clean and the simulator is expressive, while model-class limitations bound performance in more challenging settings.

2602.19358 2026-02-24 cs.CV

Referring Layer Decomposition

Fangyi Chen, Yaojie Shen, Lu Xu, Ye Yuan, Shu Zhang, Yulei Niu, Longyin Wen

Comments ICLR 2026

详情
英文摘要

Precise, object-aware control over visual content is essential for advanced image editing and compositional generation. Yet, most existing approaches operate on entire images holistically, limiting the ability to isolate and manipulate individual scene elements. In contrast, layered representations, where scenes are explicitly separated into objects, environmental context, and visual effects, provide a more intuitive and structured framework for interpreting and editing visual content. To bridge this gap and enable both compositional understanding and controllable editing, we introduce the Referring Layer Decomposition (RLD) task, which predicts complete RGBA layers from a single RGB image, conditioned on flexible user prompts, such as spatial inputs (e.g., points, boxes, masks), natural language descriptions, or combinations thereof. At the core is the RefLade, a large-scale dataset comprising 1.11M image-layer-prompt triplets produced by our scalable data engine, along with 100K manually curated, high-fidelity layers. Coupled with a perceptually grounded, human-preference-aligned automatic evaluation protocol, RefLade establishes RLD as a well-defined and benchmarkable research task. Building on this foundation, we present RefLayer, a simple baseline designed for prompt-conditioned layer decomposition, achieving high visual fidelity and semantic alignment. Extensive experiments show our approach enables effective training, reliable evaluation, and high-quality image decomposition, while exhibiting strong zero-shot generalization capabilities.

2602.19357 2026-02-24 cs.CV cs.LG

MentalBlackboard: Evaluating Spatial Visualization via Mathematical Transformations

Nilay Yilmaz, Maitreya Patel, Naga Sai Abhiram Kusumba, Yixuan He, Yezhou Yang

详情
英文摘要

Spatial visualization is the mental ability to imagine, transform, and manipulate the spatial characteristics of objects and actions. This intelligence is a part of human cognition where actions and perception are connected on a mental level. To explore whether state-of-the-art Vision-Language Models (VLMs) exhibit this ability, we develop MentalBlackboard, an open-ended spatial visualization benchmark for Paper Folding and Hole Punching tests within two core tasks: prediction and planning. Our prediction experiments reveal that models struggle with applying symmetrical transformations, even when they predict the sequence of unfolding steps correctly. Also, rotations introduce a significant challenge to the physical situational awareness for models. The planning task reveals limitations of models in analyzing symmetrical relationships and in implementing the multi-stage symmetry process, with Claude Opus 4.1 achieving the highest planning score at an accuracy of 10\%. The top-performing model, o3, attains a peak performance of 71.6\% on the generalization task, which does not require spatial visualization but transfers spatial data; however, it achieves only 25\% accuracy on text-based prediction tasks.

2602.19355 2026-02-24 cs.LG cs.AI

Active perception and disentangled representations allow continual, episodic zero and few-shot learning

David Rawlinson, Gideon Kowadlo

Comments 17 pages; 7 figures

详情
英文摘要

Generalization is often regarded as an essential property of machine learning systems. However, perhaps not every component of a system needs to generalize. Training models for generalization typically produces entangled representations at the boundaries of entities or classes, which can lead to destructive interference when rapid, high-magnitude updates are required for continual or few-shot learning. Techniques for fast learning with non-interfering representations exist, but they generally fail to generalize. Here, we describe a Complementary Learning System (CLS) in which the fast learner entirely foregoes generalization in exchange for continual zero-shot and few-shot learning. Unlike most CLS approaches, which use episodic memory primarily for replay and consolidation, our fast, disentangled learner operates as a parallel reasoning system. The fast learner can overcome observation variability and uncertainty by leveraging a conventional slow, statistical learner within an active perception system: A contextual bias provided by the fast learner induces the slow learner to encode novel stimuli in familiar, generalized terms, enabling zero-shot and few-shot learning. This architecture demonstrates that fast, context-driven reasoning can coexist with slow, structured generalization, providing a pathway for robust continual learning.

2602.19350 2026-02-24 cs.CV

PoseCraft: Tokenized 3D Body Landmark and Camera Conditioning for Photorealistic Human Image Synthesis

Zhilin Guo, Jing Yang, Kyle Fogarty, Jingyi Wan, Boqiao Zhang, Tianhao Wu, Weihao Xia, Chenliang Zhou, Sakar Khattar, Fangcheng Zhong, Cristina Nader Vasconcelos, Cengiz Oztireli

详情
英文摘要

Digitizing humans and synthesizing photorealistic avatars with explicit 3D pose and camera controls are central to VR, telepresence, and entertainment. Existing skinning-based workflows require laborious manual rigging or template-based fittings, while neural volumetric methods rely on canonical templates and re-optimization for each unseen pose. We present PoseCraft, a diffusion framework built around tokenized 3D interface: instead of relying only on rasterized geometry as 2D control images, we encode sparse 3D landmarks and camera extrinsics as discrete conditioning tokens and inject them into diffusion via cross-attention. Our approach preserves 3D semantics by avoiding 2D re-projection ambiguity under large pose and viewpoint changes, and produces photorealistic imagery that faithfully captures identity and appearance. To train and evaluate at scale, we also implement GenHumanRF, a data generation workflow that renders diverse supervision from volumetric reconstructions. Our experiments show that PoseCraft achieves significant perceptual quality improvement over diffusion-centric methods, and attains better or comparable metrics to latest volumetric rendering SOTA while better preserving fabric and hair details.

2602.19349 2026-02-24 cs.CV cs.AI

UP-Fuse: Uncertainty-guided LiDAR-Camera Fusion for 3D Panoptic Segmentation

Rohit Mohan, Florian Drews, Yakov Miron, Daniele Cattaneo, Abhinav Valada

详情
英文摘要

LiDAR-camera fusion enhances 3D panoptic segmentation by leveraging camera images to complement sparse LiDAR scans, but it also introduces a critical failure mode. Under adverse conditions, degradation or failure of the camera sensor can significantly compromise the reliability of the perception system. To address this problem, we introduce UP-Fuse, a novel uncertainty-aware fusion framework in the 2D range-view that remains robust under camera sensor degradation, calibration drift, and sensor failure. Raw LiDAR data is first projected into the range-view and encoded by a LiDAR encoder, while camera features are simultaneously extracted and projected into the same shared space. At its core, UP-Fuse employs an uncertainty-guided fusion module that dynamically modulates cross-modal interaction using predicted uncertainty maps. These maps are learned by quantifying representational divergence under diverse visual degradations, ensuring that only reliable visual cues influence the fused representation. The fused range-view features are decoded by a novel hybrid 2D-3D transformer that mitigates spatial ambiguities inherent to the 2D projection and directly predicts 3D panoptic segmentation masks. Extensive experiments on Panoptic nuScenes, SemanticKITTI, and our introduced Panoptic Waymo benchmark demonstrate the efficacy and robustness of UP-Fuse, which maintains strong performance even under severe visual corruption or misalignment, making it well suited for robotic perception in safety-critical settings.

2602.19348 2026-02-24 cs.CV cs.AI

MultiDiffSense: Diffusion-Based Multi-Modal Visuo-Tactile Image Generation Conditioned on Object Shape and Contact Pose

Sirine Bhouri, Lan Wei, Jian-Qing Zheng, Dandan Zhang

Comments Accepted by 2026 ICRA

详情
英文摘要

Acquiring aligned visuo-tactile datasets is slow and costly, requiring specialised hardware and large-scale data collection. Synthetic generation is promising, but prior methods are typically single-modality, limiting cross-modal learning. We present MultiDiffSense, a unified diffusion model that synthesises images for multiple vision-based tactile sensors (ViTac, TacTip, ViTacTip) within a single architecture. Our approach uses dual conditioning on CAD-derived, pose-aligned depth maps and structured prompts that encode sensor type and 4-DoF contact pose, enabling controllable, physically consistent multi-modal synthesis. Evaluating on 8 objects (5 seen, 3 novel) and unseen poses, MultiDiffSense outperforms a Pix2Pix cGAN baseline in SSIM by +36.3% (ViTac), +134.6% (ViTacTip), and +64.7% (TacTip). For downstream 3-DoF pose estimation, mixing 50% synthetic with 50% real halves the required real data while maintaining competitive performance. MultiDiffSense alleviates the data-collection bottleneck in tactile sensing and enables scalable, controllable multi-modal dataset generation for robotic applications.

2602.19346 2026-02-24 cs.RO cs.SY eess.SY

Design and Control of Modular Magnetic Millirobots for Multimodal Locomotion and Shape Reconfiguration

Erik Garcia Oyono, Jialin Lin, Dandan Zhang

Comments Accepted by 2026 ICRA

详情
英文摘要

Modular small-scale robots offer the potential for on-demand assembly and disassembly, enabling task-specific adaptation in dynamic and constrained environments. However, existing modular magnetic platforms often depend on workspace collisions for reconfiguration, employ bulky three-dimensional electromagnetic systems, and lack robust single-module control, which limits their applicability in biomedical settings. In this work, we present a modular magnetic millirobotic platform comprising three cube-shaped modules with embedded permanent magnets, each designed for a distinct functional role: a free module that supports self-assembly and reconfiguration, a fixed module that enables flip-and-walk locomotion, and a gripper module for cargo manipulation. Locomotion and reconfiguration are actuated by programmable combinations of time-varying two-dimensional uniform and gradient magnetic field inputs. Experiments demonstrate closed-loop navigation using real-time vision feedback and A* path planning, establishing robust single-module control capabilities. Beyond locomotion, the system achieves self-assembly, multimodal transformations, and disassembly at low field strengths. Chain-to-gripper transformations succeeded in 90% of trials, while chain-to-square transformations were less consistent, underscoring the role of module geometry in reconfiguration reliability. These results establish a versatile modular robotic platform capable of multimodal behavior and robust control, suggesting a promising pathway toward scalable and adaptive task execution in confined environments.

2602.19332 2026-02-24 cs.LG

Training-Free Cross-Architecture Merging for Graph Neural Networks

Rishabh Bhattacharya, Vikaskumar Kalsariya, Naresh Manwani

详情
英文摘要

Model merging has emerged as a powerful paradigm for combining the capabilities of distinct expert models without the high computational cost of retraining, yet current methods are fundamentally constrained to homogeneous architectures. For GNNs, however, message passing is topology-dependent and sensitive to misalignment, making direct parameter-space merging unreliable. To bridge this gap, we introduce H-GRAMA (Heterogeneous Graph Routing and Message Alignment), a training-free framework that lifts merging from parameter space to operator space. We formalize Universal Message Passing Mixture (UMPM), a shared operator family that expresses heterogeneous GNN layers in a common functional language. H-GRAMA enables cross-architecture GNN merging (e.g., GCN to GAT) without retraining, retaining high specialist accuracy in most cases in compatible depth settings and achieving inference speedups of 1.2x to 1.9x over ensembles.

2602.19331 2026-02-24 cs.LG cs.NE stat.ML

Partial Soft-Matching Distance for Neural Representational Comparison with Partial Unit Correspondence

Chaitanya Kapoor, Alex H. Williams, Meenakshi Khosla

详情
英文摘要

Representational similarity metrics typically force all units to be matched, making them susceptible to noise and outliers common in neural representations. We extend the soft-matching distance to a partial optimal transport setting that allows some neurons to remain unmatched, yielding rotation-sensitive but robust correspondences. This partial soft-matching distance provides theoretical advantages -- relaxing strict mass conservation while maintaining interpretable transport costs -- and practical benefits through efficient neuron ranking in terms of cross-network alignment without costly iterative recomputation. In simulations, it preserves correct matches under outliers and reliably selects the correct model in noise-corrupted identification tasks. On fMRI data, it automatically excludes low-reliability voxels and produces voxel rankings by alignment quality that closely match computationally expensive brute-force approaches. It achieves higher alignment precision across homologous brain areas than standard soft-matching, which is forced to match all units regardless of quality. In deep networks, highly matched units exhibit similar maximally exciting images, while unmatched units show divergent patterns. This ability to partition by match quality enables focused analyses, e.g., testing whether networks have privileged axes even within their most aligned subpopulations. Overall, partial soft-matching provides a principled and practical method for representational comparison under partial correspondence.

2602.19324 2026-02-24 cs.CV cs.AI

RetinaVision: XAI-Driven Augmented Regulation for Precise Retinal Disease Classification using deep learning framework

Mohammad Tahmid Noor, Shayan Abrar, Jannatul Adan Mahi, Md Parvez Mia, Asaduzzaman Hridoy, Samanta Ghosh

Comments 6 pages, 15 figures

详情
英文摘要

Early and accurate classification of retinal diseases is critical to counter vision loss and for guiding clinical management of retinal diseases. In this study, we proposed a deep learning method for retinal disease classification utilizing optical coherence tomography (OCT) images from the Retinal OCT Image Classification - C8 dataset (comprising 24,000 labeled images spanning eight conditions). Images were resized to 224x224 px and tested on convolutional neural network (CNN) architectures: Xception and InceptionV3. Data augmentation techniques (CutMix, MixUp) were employed to enhance model generalization. Additionally, we applied GradCAM and LIME for interpretability evaluation. We implemented this in a real-world scenario via our web application named RetinaVision. This study found that Xception was the most accurate network (95.25%), followed closely by InceptionV3 (94.82%). These results suggest that deep learning methods allow effective OCT retinal disease classification and highlight the importance of implementing accuracy and interpretability for clinical applications.