arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2977
2603.26816 2026-03-31 cs.LG cs.AI

PiCSRL: Physics-Informed Contextual Spectral Reinforcement Learning

Mitra Nasr Azadani, Syed Usama Imtiaz, Nasrin Alamdari

Comments Accepted to IGARSS 2026

详情
英文摘要

High-dimensional low-sample-size (HDLSS) datasets constrain reliable environmental model development, where labeled data remain sparse. Reinforcement learning (RL)-based adaptive sensing methods can learn optimal sampling policies, yet their application is severely limited in HDLSS contexts. In this work, we present PiCSRL (Physics-Informed Contextual Spectral Reinforcement Learning), where embeddings are designed using domain knowledge and parsed directly into the RL state representation for improved adaptive sensing. We developed an uncertainty-aware belief model that encodes physics-informed features to improve prediction. As a representative example, we evaluated our approach for cyanobacterial gene concentration adaptive sampling task using NASA PACE hyperspectral imagery over Lake Erie. PiCSRL achieves optimal station selection (RMSE = 0.153, 98.4% bloom detection rate, outperforming random (0.296) and UCB (0.178) RMSE baselines, respectively. Our ablation experiments demonstrate that physics-informed features improve test generalization (0.52 R^2, +0.11 over raw bands) in semi-supervised learning. In addition, our scalability test shows that PiCSRL scales effectively to large networks (50 stations, >2M combinations) with significant improvements over baselines (p = 0.002). We posit PiCSRL as a sample-efficient adaptive sensing method across Earth observation domains for improved observation-to-target mapping.

2603.26814 2026-03-31 cs.CV cs.RO

arg-VU: Affordance Reasoning with Physics-Aware 3D Geometry for Visual Understanding in Robotic Surgery

Nan Xiao, Yunxin Fan, Farong Wang, Fei Liu

详情
英文摘要

Affordance reasoning provides a principled link between perception and action, yet remains underexplored in surgical robotics, where tissues are highly deformable, compliant, and dynamically coupled with tool motion. We present arg-VU, a physics-aware affordance reasoning framework that integrates temporally consistent geometry tracking with constraint-induced mechanical modeling for surgical visual understanding. Surgical scenes are reconstructed using 3D Gaussian Splatting (3DGS) and converted into a temporally tracked surface representation. Extended Position-Based Dynamics (XPBD) embeds local deformation constraints and produces representative geometry points (RGPs) whose constraint sensitivities define anisotropic stiffness metrics capturing the local constraint-manifold geometry. Robotic tool poses in SE(3) are incorporated to compute rigidly induced displacements at RGPs, from which we derive two complementary measures: a physics-aware compliance energy that evaluates mechanical feasibility with respect to local deformation constraints, and a positional agreement score that captures motion alignment (as kinematic motion baseline). Experiments on surgical video datasets show that arg-VU yields more stable, physically consistent, and interpretable affordance predictions than kinematic baselines. These results demonstrate that physics-aware geometric representations enable reliable affordance reasoning for deformable surgical environments and support embodied robotic interaction.

2603.26811 2026-03-31 cs.CV cs.AI cs.LG q-bio.NC

Implicit neural representations for larval zebrafish brain microscopy: a reproducible benchmark on the MapZebrain atlas

Agnieszka Pregowska

详情
英文摘要

Implicit neural representations (INRs) offer continuous coordinate-based encodings for atlas registration, cross-modality resampling, sparse-view completion, and compact sharing of neuroanatomical data. Yet reproducible evaluation is lacking for high-resolution larval zebrafish microscopy, where preserving neuropil boundaries and fine neuronal processes is critical. We present a reproducible INR benchmark for the MapZebrain larval zebrafish brain atlas. Using a unified, seed-controlled protocol, we compare SIREN, Fourier features, Haar positional encoding, and a multi-resolution grid on 950 grayscale microscopy images, including atlas slices and single-neuron projections. Images are normalized with per-image (1,99) percentiles estimated from 10% of pixels in non-held-out columns, and spatial generalization is tested with a deterministic 40% column-wise hold-out along the X-axis. Haar and Fourier achieve the strongest macro-averaged reconstruction fidelity on held-out columns (about 26 dB), while the grid is moderately behind. SIREN performs worse in macro averages but remains competitive on area-weighted micro averages in the all-in-one regime. SSIM and edge-focused error further show that Haar and Fourier preserve boundaries more accurately. These results indicate that explicit spectral and multiscale encodings better capture high-frequency neuroanatomical detail than smoother-bias alternatives. For MapZebrain workflows, Haar and Fourier are best suited to boundary-sensitive tasks such as atlas registration, label transfer, and morphology-preserving sharing, while SIREN remains a lightweight baseline for background modelling or denoising.

2603.26810 2026-03-31 cs.CV eess.IV

Unblur-SLAM: Dense Neural SLAM for Blurry Inputs

Qi Zhang, Denis Rozumny, Francesco Girlanda, Sezer Karaoglu, Marc Pollefeys, Theo Gevers, Martin R. Oswald

Comments 14 pages, 9 figures (based on the document's total length and the final Figure 9 ). Accepted By CVPR 2026

详情
英文摘要

We propose Unblur-SLAM, a novel RGB SLAM pipeline for sharp 3D reconstruction from blurred image inputs. In contrast to previous work, our approach is able to handle different types of blur and demonstrates state-of-the-art performance in the presence of both motion blur and defocus blur. Moreover, we adjust the computation effort with the amount of blur in the input image. As a first stage, our method uses a feed-forward image deblurring model for which we propose a suitable training scheme that can improve both tracking and mapping modules. Frames that are successfully deblurred by the feed-forward network obtain refined poses and depth through local-global multi-view optimization and loop closure. Frames that fail the first stage deblurring are directly modeled through the global 3DGS representation and an additional blur network to model multiple blurred sub-frames and simulate the blur formation process in 3D space, thereby learning sharp details and refined sub-frame poses. Experiments on several real-world datasets demonstrate consistent improvements in both pose estimation and sharp reconstruction results of geometry and texture.

2603.26804 2026-03-31 cs.CV cs.AI

The Language of Touch: Translating Vibrations into Text with Dual-Branch Learning

Jin Chen, Yifeng Lin, Chao Zeng, Si Wu, Tiesong Zhao

Comments 9 pages, 6 figures

详情
英文摘要

The standardization of vibrotactile data by IEEE P1918.1 workgroup has greatly advanced its applications in virtual reality, human-computer interaction and embodied artificial intelligence. Despite these efforts, the semantic interpretation and understanding of vibrotactile signals remain an unresolved challenge. In this paper, we make the first attempt to address vibrotactile captioning, {\it i.e.}, generating natural language descriptions from vibrotactile signals. We propose Vibrotactile Periodic-Aperiodic Captioning (ViPAC), a method designed to handle the intrinsic properties of vibrotactile data, including hybrid periodic-aperiodic structures and the lack of spatial semantics. Specifically, ViPAC employs a dual-branch strategy to disentangle periodic and aperiodic components, combined with a dynamic fusion mechanism that adaptively integrates signal features. It also introduces an orthogonality constraint and weighting regularization to ensure feature complementarity and fusion consistency. Additionally, we construct LMT108-CAP, the first vibrotactile-text paired dataset, using GPT-4o to generate five constrained captions per surface image from the popular LMT-108 dataset. Experiments show that ViPAC significantly outperforms the baseline methods adapted from audio and image captioning, achieving superior lexical fidelity and semantic alignment.

2603.26803 2026-03-31 cs.LG stat.ML

A Comparative Investigation of Thermodynamic Structure-Informed Neural Networks

Guojie Li, Liu Hong

Comments 30 pages, 9 figures, 2 tables

详情
英文摘要

Physics-informed neural networks (PINNs) offer a unified framework for solving both forward and inverse problems of differential equations, yet their performance and physical consistency strongly depend on how governing laws are incorporated. In this work, we present a systematic comparison of different thermodynamic structure-informed neural networks by incorporating various thermodynamics formulations, including Newtonian, Lagrangian, and Hamiltonian mechanics for conservative systems, as well as the Onsager variational principle and extended irreversible thermodynamics for dissipative systems. Through comprehensive numerical experiments on representative ordinary and partial differential equations, we quantitatively evaluate the impact of these formulations on accuracy, physical consistency, noise robustness, and interpretability. The results show that Newtonian-residual-based PINNs can reconstruct system states but fail to reliably recover key physical and thermodynamic quantities, whereas structure-preserving formulation significantly enhances parameter identification, thermodynamic consistency, and robustness. These findings provide practical guidance for principled design of thermodynamics-consistency model, and lay the groundwork for integrating more general nonequilibrium thermodynamic structures into physics-informed machine learning.

2603.26802 2026-03-31 cs.CV cs.RO eess.IV

Deep Learning Aided Vision System for Planetary Rovers

Lomash Relia, Jai G Singla, Amitabh, Nitant Dube

详情
英文摘要

This study presents a vision system for planetary rovers, combining real-time perception with offline terrain reconstruction. The real-time module integrates CLAHE enhanced stereo imagery, YOLOv11n based object detection, and a neural network to estimate object distances. The offline module uses the Depth Anything V2 metric monocular depth estimation model to generate depth maps from captured images, which are fused into dense point clouds using Open3D. Real world distance estimates from the real time pipeline provide reliable metric context alongside the qualitative reconstructions. Evaluation on Chandrayaan 3 NavCam stereo imagery, benchmarked against a CAHV based utility, shows that the neural network achieves a median depth error of 2.26 cm within a 1 to 10 meter range. The object detection model maintains a balanced precision recall tradeoff on grayscale lunar scenes. This architecture offers a scalable, compute-efficient vision solution for autonomous planetary exploration.

2603.26801 2026-03-31 cs.LG cs.AI

Sparse-by-Design Cross-Modality Prediction: L0-Gated Representations for Reliable and Efficient Learning

Filippo Cenacchi

详情
英文摘要

Predictive systems increasingly span heterogeneous modalities such as graphs, language, and tabular records, but sparsity and efficiency remain modality-specific (graph edge or neighborhood sparsification, Transformer head or layer pruning, and separate tabular feature-selection pipelines). This fragmentation makes results hard to compare, complicates deployment, and weakens reliability analysis across end-to-end KDD pipelines. A unified sparsification primitive would make accuracy-efficiency trade-offs comparable across modalities and enable controlled reliability analysis under representation compression. We ask whether a single representation-level mechanism can yield comparable accuracy-efficiency trade-offs across modalities while preserving or improving probability calibration. We propose L0-Gated Cross-Modality Learning (L0GM), a modality-agnostic, feature-wise hard-concrete gating framework that enforces L0-style sparsity directly on learned representations. L0GM attaches hard-concrete stochastic gates to each modality's classifier-facing interface: node embeddings (GNNs), pooled sequence embeddings such as CLS (Transformers), and learned tabular embedding vectors (tabular models). This yields end-to-end trainable sparsification with an explicit control knob for the active feature fraction. To stabilize optimization and make trade-offs interpretable, we introduce an L0-annealing schedule that induces clear accuracy-sparsity Pareto frontiers. Across three public benchmarks (ogbn-products, Adult, IMDB), L0GM achieves competitive predictive performance while activating fewer representation dimensions, and it reduces Expected Calibration Error (ECE) in our evaluation. Overall, L0GM establishes a modality-agnostic, reproducible sparsification primitive that supports comparable accuracy, efficiency, and calibration trade-off analysis across heterogeneous modalities.

2603.26800 2026-03-31 cs.LG cs.AI physics.flu-dyn

DSO: Dual-Scale Neural Operators for Stable Long-term Fluid Dynamics Forecasting

Huanshuo Dong, Hao Wu, Hong Wang, Qin-Yi Zhang, Zhezheng Hao

详情
英文摘要

Long-term fluid dynamics forecasting is a critically important problem in science and engineering. While neural operators have emerged as a promising paradigm for modeling systems governed by partial differential equations (PDEs), they often struggle with long-term stability and precision. We identify two fundamental failure modes in existing architectures: (1) local detail blurring, where fine-scale structures such as vortex cores and sharp gradients are progressively smoothed, and (2) global trend deviation, where the overall motion trajectory drifts from the ground truth during extended rollouts. We argue that these failures arise because existing neural operators treat local and global information processing uniformly, despite their inherently different evolution characteristics in physical systems. To bridge this gap, we propose the Dual-Scale Neural Operator (DSO), which explicitly decouples information processing into two complementary modules: depthwise separable convolutions for fine-grained local feature extraction and an MLP-Mixer for long-range global aggregation. Through numerical experiments on vortex dynamics, we demonstrate that nearby perturbations primarily affect local vortex structure while distant perturbations influence global motion trends, providing empirical validation for our design choice. Extensive experiments on turbulent flow benchmarks show that DSO achieves state-of-the-art accuracy while maintaining robust long-term stability, reducing prediction error by over 88% compared to existing neural operators.

2603.26799 2026-03-31 cs.LG

Gaussian Joint Embeddings For Self-Supervised Representation Learning

Yongchao Huang

Comments 92 pages

详情
英文摘要

Self-supervised representation learning often relies on deterministic predictive architectures to align context and target views in latent space. While effective in many settings, such methods are limited in genuinely multi-modal inverse problems, where squared-loss prediction collapses towards conditional averages, and they frequently depend on architectural asymmetries to prevent representation collapse. In this work, we propose a probabilistic alternative based on generative joint modeling. We introduce Gaussian Joint Embeddings (GJE) and its multi-modal extension, Gaussian Mixture Joint Embeddings (GMJE), which model the joint density of context and target representations and replace black-box prediction with closed-form conditional inference under an explicit probabilistic model. This yields principled uncertainty estimates and a covariance-aware objective for controlling latent geometry. We further identify a failure mode of naive empirical batch optimization, which we term the Mahalanobis Trace Trap, and develop several remedies spanning parametric, adaptive, and non-parametric settings, including prototype-based GMJE, conditional Mixture Density Networks (GMJE-MDN), topology-adaptive Growing Neural Gas (GMJE-GNG), and a Sequential Monte Carlo (SMC) memory bank. In addition, we show that standard contrastive learning can be interpreted as a degenerate non-parametric limiting case of the GMJE framework. Experiments on synthetic multi-modal alignment tasks and vision benchmarks show that GMJE recovers complex conditional structure, learns competitive discriminative representations, and defines latent densities that are better suited to unconditional sampling than deterministic or unimodal baselines.

2603.26798 2026-03-31 cs.LG cs.AI

Explaining, Verifying, and Aligning Semantic Hierarchies in Vision-Language Model Embeddings

Gesina Schwalbe, Mert Keser, Moritz Bayerkuhnlein, Edgar Heinert, Annika Mütze, Marvin Keller, Sparsh Tiwari, Georgii Mikriukov, Diedrich Wolter, Jae Hee Lee, Matthias Rottmann

详情
英文摘要

Vision-language model (VLM) encoders such as CLIP enable strong retrieval and zero-shot classification in a shared image-text embedding space, yet the semantic organization of this space is rarely inspected. We present a post-hoc framework to explain, verify, and align the semantic hierarchies induced by a VLM over a given set of child classes. First, we extract a binary hierarchy by agglomerative clustering of class centroids and name internal nodes by dictionary-based matching to a concept bank. Second, we quantify plausibility by comparing the extracted tree against human ontologies using efficient tree- and edge-level consistency measures, and we evaluate utility via explainable hierarchical tree-traversal inference with uncertainty-aware early stopping (UAES). Third, we propose an ontology-guided post-hoc alignment method that learns a lightweight embedding-space transformation, using UMAP to generate target neighborhoods from a desired hierarchy. Across 13 pretrained VLMs and 4 image datasets, our method finds systematic modality differences: image encoders are more discriminative, while text encoders induce hierarchies that better match human taxonomies. Overall, the results reveal a persistent trade-off between zero-shot accuracy and ontological plausibility and suggest practical routes to improve semantic alignment in shared embedding spaces.

2603.26797 2026-03-31 cs.LG

MemGuard-Alpha: Detecting and Filtering Memorization-Contaminated Signals in LLM-Based Financial Forecasting via Membership Inference and Cross-Model Disagreement

Anisha Roy, Dip Roy

详情
英文摘要

Large language models (LLMs) are increasingly used to generate financial alpha signals, yet growing evidence shows that LLMs memorize historical financial data from their training corpora, producing spurious predictive accuracy that collapses out-of-sample. This memorization-induced look-ahead bias threatens the validity of LLM-based quantitative strategies. Prior remedies -- model retraining and input anonymization -- are either prohibitively expensive or introduce significant information loss. No existing method offers practical, zero-cost signal-level filtering for real-time trading. We introduce MemGuard-Alpha, a post-generation framework comprising two algorithms: (i) the MemGuard Composite Score (MCS), which combines five membership inference attack (MIA) methods with temporal proximity features via logistic regression, achieving Cohen's d = 18.57 for contamination separation (d = 0.39-1.37 using MIA features alone); and (ii) Cross-Model Memorization Disagreement (CMMD), which exploits variation in training cutoff dates across LLMs to separate memorized signals from genuine reasoning. Evaluated across seven LLMs (124M-7B parameters), 50 S&P 100 stocks, 42,800 prompts, and five MIA methods over 5.5 years (2019-2024), CMMD achieves a Sharpe ratio of 4.11 versus 2.76 for unfiltered signals (49% improvement). Clean signals produce 14.48 bps average daily return versus 2.13 bps for tainted signals (7x difference). A striking crossover pattern emerges: in-sample accuracy rises with contamination (40.8% to 52.5%) while out-of-sample accuracy falls (47% to 42%), providing direct evidence that memorization inflates apparent accuracy at the cost of generalization.

2603.26796 2026-03-31 cs.LG cs.AI stat.ML

Robust Batch-Level Query Routing for Large Language Models under Cost and Capacity Constraints

Jelena Markovic-Voronov, Kayhan Behdin, Yuanda Xu, Zhengze Zhou, Zhipeng Wang, Rahul Mazumder

详情
英文摘要

We study the problem of routing queries to large language models (LLMs) under cost, GPU resources, and concurrency constraints. Prior per-query routing methods often fail to control batch-level cost, especially under non-uniform or adversarial batching. To address this, we propose a batch-level, resource-aware routing framework that jointly optimizes model assignment for each batch while respecting cost and model capacity limits. We further introduce a robust variant that accounts for uncertainty in predicted LLM performance, along with an offline instance allocation procedure that balances quality and throughput across multiple models. Experiments on two multi-task LLM benchmarks show that robustness improves accuracy by 1-14% over non-robust counterparts (depending on the performance estimator), batch-level routing outperforms per-query methods by up to 24% under adversarial batching, and optimized instance allocation yields additional gains of up to 3% compared to a non-optimized allocation, all while strictly controlling cost and GPU resource constraints.

2603.26794 2026-03-31 cs.CV cs.AI

PhyDCM: A Reproducible Open-Source Framework for AI-Assisted Brain Tumor Classification from Multi-Sequence MRI

Hayder Saad Abdulbaqi, Mohammed Hadi Rahim, Mohammed Hassan Hadi, Haider Ali Aboud, Ali Hussein Allawi

Comments 18 pages, 9 figures, 6 tables

详情
英文摘要

MRI-based medical imaging has become indispensable in modern clinical diagnosis, particularly for brain tumor detection. However, the rapid growth in data volume poses challenges for conventional diagnostic approaches. Although deep learning has shown strong performance in automated classification, many existing solutions are confined to closed technical architectures, limiting reproducibility and further academic development. PhyDCM is introduced as an open-source software framework that integrates a hybrid classification architecture based on MedViT with standardized DICOM processing and an interactive desktop visualization interface. The system is designed as a modular digital library that separates computational logic from the graphical interface, allowing independent modification and extension of components. Standardized preprocessing, including intensity rescaling and limited data augmentation, ensures consistency across varying MRI acquisition settings. Experimental evaluation on MRI datasets from BRISC2025 and curated Kaggle collections (FigShare, SARTAJ, and Br35H) demonstrates stable diagnostic performance, achieving over 93% classification accuracy across categories. The framework supports structured, exportable outputs and multi-planar reconstruction of volumetric data. By emphasizing transparency, modularity, and accessibility, PhyDCM provides a practical foundation for reproducible AI-driven medical image analysis, with flexibility for future integration of additional imaging modalities.

2603.26790 2026-03-31 cs.CV

Elucidating the Design Space of Flow Matching for Cellular Microscopy

Charles Jones, Emmanuel Noutahi, Jason Hartford, Cian Eastwood

详情
英文摘要

Flow-matching generative models are increasingly used to simulate cell responses to biological perturbations. However, the design space for building such models is large and underexplored. We systematically analyse the design space of flow matching models for cell-microscopy images, finding that many popular techniques are unnecessary and can even hurt performance. We develop a simple, stable, and scalable recipe which we use to train our foundation model. We scale our model to two orders of magnitude larger than prior methods, achieving a two-fold FID and ten-fold KID improvement over prior methods. We then fine-tune our model with pre-trained molecular embeddings to achieve state-of-the-art performance simulating responses to unseen molecules. Code is available at https://github.com/valence-labs/microscopy-flow-matching

2603.26789 2026-03-31 cs.CV

Confidence Matters: Uncertainty Quantification and Precision Assessment of Deep Learning-based CMR Biomarker Estimates Using Scan-rescan Data

Dewmini Hasara Wickremasinghe, Michelle Gibogwe, Andrew Bell, Esther Puyol-Antón, Muhummad Sohaib Nazir, Reza Razavi, Bruno Paun, Paul Aljabar, Andrew P. King

详情
英文摘要

The performance of deep learning (DL) methods for the analysis of cine cardiovascular magnetic resonance (CMR) is typically assessed in terms of accuracy, overlooking precision. In this work, uncertainty estimation techniques, namely deep ensemble, test-time augmentation, and Monte Carlo dropout, are applied to a state-of-the-art DL pipeline for cardiac functional biomarker estimation, and new distribution-based metrics are proposed for the assessment of biomarker precision. The model achieved high accuracy (average Dice 87%) and point estimate precision on two external validation scan-rescan CMR datasets. However, distribution-based metrics showed that the overlap between scan/rescan confidence intervals was >50% in less than 45% of the cases. Statistical similarity tests between scan and rescan biomarkers also resulted in significant differences for over 65% of the cases. We conclude that, while point estimate metrics might suggest good performance, distributional analyses reveal lower precision, highlighting the need to use more representative metrics to assess scan-rescan agreement.

2603.26787 2026-03-31 cs.CV

Brain-Inspired Multimodal Spiking Neural Network for Image-Text Retrieval

Xintao Zong, Xian Zhong, Wenxuan Liu, Jianhao Ding, Zhaofei Yu, Tiejun Huang

详情
英文摘要

Spiking neural networks (SNNs) have recently shown strong potential in unimodal visual and textual tasks, yet building a directly trained, low-energy, and high-performance SNN for multimodal applications such as image-text retrieval (ITR) remains highly challenging. Existing artificial neural network (ANN)-based methods often pursue richer unimodal semantics using deeper and more complex architectures, while overlooking cross-modal interaction, retrieval latency, and energy efficiency. To address these limitations, we present a brain-inspired Cross-Modal Spike Fusion network (CMSF) and apply it to ITR for the first time. The proposed spike fusion mechanism integrates unimodal features at the spike level, generating enhanced multimodal representations that act as soft supervisory signals to refine unimodal spike embeddings, effectively mitigating semantic loss within CMSF. Despite requiring only two time steps, CMSF achieves top-tier retrieval accuracy, surpassing state-of-the-art ANN counterparts while maintaining exceptionally low energy consumption and high retrieval speed. This work marks a significant step toward multimodal SNNs, offering a brain-inspired framework that unifies temporal dynamics with cross-modal alignment and provides new insights for future spiking-based multimodal research. The code is available at https://github.com/zxt6174/CMSF.

2603.26786 2026-03-31 cs.LG cs.AI

A Step Toward Federated Pretraining of Multimodal Large Language Models

Baochen Xiong, Yifan Xu, Xiaoshan Yang, Yaguang Song, Yaowei Wang, Changsheng Xu

详情
英文摘要

The rapid evolution of Multimodal Large Language Models (MLLMs) is bottlenecked by the saturation of high-quality public data, while vast amounts of diverse multimodal data remain inaccessible in privacy-sensitive silos. Federated Learning (FL) offers a promising solution to unlock these distributed resources, but existing research focuses predominantly on fine-tuning, leaving the foundational pre-training phase largely unexplored. In this paper, we formally introduce the Federated MLLM Alignment (Fed-MA) task, a lightweight pre-training paradigm that freezes the vision encoder and LLM while collaboratively training the cross-modal projector. We identify two critical challenges in this setting: (i) parameter interference in aggregating local projectors; and (ii) gradient oscillations in one-pass collaborative SGD. To address these challenges, we propose Fed-CMP, a pioneering framework for federated MLLM pre-training. Fed-CMP employs Canonical Reliability-Aware Aggregation, which constructs a canonical space to decompose client projectors into a shared alignment basis and client-specific coefficients, then performs reliability-weighted fusion to suppress parameter interference. Furthermore, Fed-CMP introduces Orthogonality-Preserved Momentum, which applies momentum to the shared alignment basis via orthogonal projection, accumulating historical optimization directions while preserving geometric structure. We construct four federated pre-training scenarios based on public datasets, and extensive experiments validate that Fed-CMP significantly outperforms existing baselines.

2603.26784 2026-03-31 cs.CV

HighlightBench: Benchmarking Markup-Driven Table Reasoning in Scientific Documents

Lexin Wang, Shenghua Liu, Yiwei Wang, Yujun Cai, Yuyao Ge, Jiayu Yao, Jiafeng Guo, Xueqi Cheng

详情
英文摘要

Visual markups such as highlights, underlines, and bold text are common in table-centric documents. Although multimodal large language models (MLLMs) have made substantial progress in document understanding, their ability to treat such cues as explicit logical directives remains under-explored. More importantly, existing evaluations cannot distinguish whether a model fails to see the markup or fails to reason with it. This creates a key blind spot in assessing markup-conditioned behavior over tables. To address this gap, we introduce HighlightBench, a diagnostic benchmark for markup-driven table understanding that decomposes evaluation into five task families: Markup Grounding, Constrained Retrieval, Local Relations, Aggregation \& Comparison, and Consistency \& Missingness. We further provide a reference pipeline that makes intermediate decisions explicit, enabling reproducible baselines and finer-grained attribution of errors along the perception-to-execution chain. Experiments show that even strong models remain unstable when visual cues must be consistently aligned with symbolic reasoning under structured output constraints.

2603.26780 2026-03-31 cs.CV

RatSeizure: A Benchmark and Saliency-Context Transformer for Rat Seizure Localization

Ting Yu Tsai, An Yu, Lucy Lee, Felix X. -F. Ye, Damian S. Shin, Tzu-Jen Kao, Xin Li, Ming-Ching Chang

详情
英文摘要

Animal models, particularly rats, play a critical role in seizure research for studying epileptogenesis and treatment response. However, progress is limited by the lack of datasets with precise temporal annotations and standardized evaluation protocols. Existing animal behavior datasets often have limited accessibility, coarse labeling, and insufficient temporal localization of clinically meaningful events. To address these limitations, we introduce RatSeizure, the first publicly benchmark for fine-grained seizure behavior analysis. The dataset consists of recorded clips annotated with seizure-related action units and temporal boundaries, enabling both behavior classification and temporal localization. We further propose RaSeformer, a saliency-context Transformer for temporal action localization that highlights behavior-relevant context while suppressing redundant cues. Experiments on RatSeizure show that RaSeformer achieves strong performance and provides a competitive reference model for this challenging task. We also establish standardized dataset splits and evaluation protocols to support reproducible benchmarking.

2603.26778 2026-03-31 cs.LG cs.AI

TED: Training-Free Experience Distillation for Multimodal Reasoning

Shuozhi Yuan, Jinqing Wang, Zihao Liu, Miaomiao Yuan, Haoran Peng, Jin Zhao, Bingwen Wang, Haoyi Wang

Comments 13 pages,4 figures

详情
英文摘要

Knowledge distillation is typically realized by transferring a teacher model's knowledge into a student's parameters through supervised or reinforcement-based optimization. While effective, such approaches require repeated parameter updates and large-scale training data, limiting their applicability in resource-constrained environments. In this work, we propose TED, a training-free, context-based distillation framework that shifts the update target of distillation from model parameters to an in-context experience injected into the student's prompt. For each input, the student generates multiple reasoning trajectories, while a teacher independently produces its own solution. The teacher then compares the student trajectories with its reasoning and the ground-truth answer, extracting generalized experiences that capture effective reasoning patterns. These experiences are continuously refined and updated over time. A key challenge of context-based distillation is unbounded experience growth and noise accumulation. TED addresses this with an experience compression mechanism that tracks usage statistics and selectively merges, rewrites, or removes low-utility experiences. Experiments on multimodal reasoning benchmarks MathVision and VisualPuzzles show that TED consistently improves performance. On MathVision, TED raises the performance of Qwen3-VL-8B from 0.627 to 0.702, and on VisualPuzzles from 0.517 to 0.561 with just 100 training samples. Under this low-data, no-update setting, TED achieves performance competitive with fully trained parameter-based distillation while reducing training cost by over 5x, demonstrating that meaningful knowledge transfer can be achieved through contextual experience.

2603.26777 2026-03-31 cs.CV astro-ph.IM cs.LG

BHCast: Unlocking Black Hole Plasma Dynamics from a Single Blurry Image with Long-Term Forecasting

Renbo Tu, Ali SaraerToosi, Nicholas S. Conroy, Gennady Pekhimenko, Aviad Levis

Comments CVPR 2026

详情
英文摘要

The Event Horizon Telescope (EHT) delivered the first image of a black hole by capturing the light from its surrounding accretion flow, revealing structure but not dynamics. Simulations of black hole accretion dynamics are essential for interpreting EHT images but costly to generate and impractical for inference. Motivated by this bottleneck, BHCast presents a framework for forecasting black hole plasma dynamics from a single, blurry snapshot, such as those captured by the EHT. At its core, BHCast is a neural model that transforms a static image into forecasted future frames, revealing the underlying dynamics hidden within one snapshot. With a multi-scale pyramid loss, we demonstrate how autoregressive forecasting can simultaneously super-resolve and evolve a blurry frame into a coherent, high-resolution movie that remains stable over long time horizons. From forecasted dynamics, we can then extract interpretable spatio-temporal features, such as pattern speed (rotation rate) and pitch angle. Finally, BHCast uses gradient-boosting trees to recover black hole properties from these plasma features, including the spin and viewing inclination angle. The separation between forecasting and inference provides modular flexibility, interpretability, and robust uncertainty quantification. We demonstrate the effectiveness of BHCast on simulations of two distinct black hole accretion systems, Sagittarius A* and M87*, by testing on simulated frames blurred to EHT resolution and real EHT images of M87*. Ultimately, our methodology establishes a scalable paradigm for solving inverse problems, demonstrating the potential of learned dynamics to unlock insights from resolution-limited scientific data.

2603.26776 2026-03-31 cs.CV

From Prediction to Diagnosis: Reasoning-Aware AI for Photovoltaic Defect Inspection

Dev Mistry, Feng Qiu, Bo Chen, Feng Liu, Can Chen, Mohammad Shahidehpour, Ren Wang

Comments 34 pages, 5 figures

详情
英文摘要

Reliable photovoltaic defect identification is essential for maintaining energy yield, ensuring warranty compliance, and enabling scalable inspection of rapidly expanding solar fleets. Although recent advances in computer vision have improved automated defect detection, most existing systems operate as opaque classifiers that provide limited diagnostic insight for high-stakes energy infrastructure. Here we introduce REVL-PV, a vision-language framework that embeds domain-specific diagnostic reasoning into multimodal learning across electroluminescence, thermal, and visible-light imagery. By requiring the model to link visual evidence to plausible defect mechanisms before classification, the framework produces structured diagnostic reports aligned with professional photovoltaic inspection practice. Evaluated on 1,927 real-world modules spanning eight defect categories, REVL-PV achieves 93\% classification accuracy while producing interpretable diagnostic rationales and maintaining strong robustness under realistic image corruptions. A blind concordance study with a certified solar inspection expert shows strong semantic alignment between model explanations and expert assessments across defect identification, root-cause attribution, and visual descriptions. These results demonstrate that reasoning-aware multimodal learning establishes a general paradigm for trustworthy AI-assisted inspection of photovoltaic energy infrastructure.

2603.26775 2026-03-31 cs.LG cs.AI cs.CL cs.CV

Learning to Select Visual In-Context Demonstrations

Eugene Lee, Yu-Chi Lin, Jiajie Diao

Comments 21 pages, 12 figure, accepted to Computer Vision and Pattern Recognition Conference (CVPR) 2026 Findings Track

详情
英文摘要

Multimodal Large Language Models (MLLMs) adapt to visual tasks via in-context learning (ICL), which relies heavily on demonstration quality. The dominant demonstration selection strategy is unsupervised k-Nearest Neighbor (kNN) search. While simple, this similarity-first approach is sub-optimal for complex factual regression tasks; it selects redundant examples that fail to capture the task's full output range. We reframe selection as a sequential decision-making problem and introduce Learning to Select Demonstrations (LSD), training a Reinforcement Learning agent to construct optimal demonstration sets. Using a Dueling DQN with a query-centric Transformer Decoder, our agent learns a policy that maximizes MLLM downstream performance. Evaluating across five visual regression benchmarks, we uncover a crucial dichotomy: while kNN remains optimal for subjective preference tasks, LSD significantly outperforms baselines on objective, factual regression tasks. By balancing visual relevance with diversity, LSD better defines regression boundaries, illuminating when learned selection is strictly necessary for visual ICL.

2603.26773 2026-03-31 cs.RO cs.LG

Robot Arm Control via Cognitive Map Learners

Nathan McDonald, Colyn Seeley, Christian Brazeau

详情
英文摘要

Cognitive map learners (CML) have been shown to enable hierarchical, compositional machine learning. That is, interpedently trained CML modules can be arbitrarily composed together to solve more complex problems without task-specific retraining. This work applies this approach to control the movement of a multi-jointed robot arm, whereby each arm segment's angular position is governed by an independently trained CML. Operating in a 2D Cartesian plane, target points are encoded as phasor hypervectors according to fractional power encoding (FPE). This phasor hypervector is then factorized into a set of arm segment angles either via a resonator network or a modern Hopfield network. These arm segment angles are subsequently fed to their respective arm segment CMLs, which reposition the robot arm to the target point without the use of inverse kinematic equations. This work presents both a general solution for both a 2D robot arm with an arbitrary number of arm segments and a particular solution for a 3D arm with a single rotating base.

2603.26772 2026-03-31 cs.CV cs.AI cs.CY

From Content to Audience: A Multimodal Annotation Framework for Broadcast Television Analytics

Paolo Cupini, Francesco Pierri

详情
英文摘要

Automated semantic annotation of broadcast television content presents distinctive challenges, combining structured audiovisual composition, domain-specific editorial patterns, and strict operational constraints. While multimodal large language models (MLLMs) have demonstrated strong general-purpose video understanding capabilities, their comparative effectiveness across pipeline architectures and input configurations in broadcast-specific settings remains empirically undercharacterized. This paper presents a systematic evaluation of multimodal annotation pipelines applied to broadcast television news in the Italian setting. We construct a domain-specific benchmark of clips labeled across four semantic dimensions: visual environment classification, topic classification, sensitive content detection, and named entity recognition. Two different pipeline architectures are evaluated across nine frontier models, including Gemini 3.0 Pro, LLaMA 4 Maverick, Qwen-VL variants, and Gemma 3, under progressively enriched input strategies combining visual signals, automatic speech recognition, speaker diarization, and metadata. Experimental results demonstrate that gains from video input are strongly model-dependent: larger models effectively leverage temporal continuity, while smaller models show performance degradation under extended multimodal context, likely due to token overload. Beyond benchmarking, the selected pipeline is deployed on 14 full broadcast episodes, with minute-level annotations integrated with normalized audience measurement data provided by an Italian media company. This integration enables correlational analysis of topic-level audience sensitivity and generational engagement divergence, demonstrating the operational viability of the proposed framework for content-based audience analytics.

2603.26770 2026-03-31 cs.CV

Quantized Vision-Language Models for Damage Assessment: A Comparative Study of LLaVA-1.5-7B Quantization Levels

Takato Yasuno

Comments 16 pages, 4 figures, 8 tables

详情
英文摘要

Bridge infrastructure inspection is a critical but labor-intensive task requiring expert assessment of structural damage such as rebar exposure, cracking, and corrosion. This paper presents a comprehensive study of quantized Vision-Language Models (VLMs) for automated bridge damage assessment, focusing on the trade-offs between description quality, inference speed, and resource requirements. We develop an end-to-end pipeline combining LLaVA-1.5-7B for visual damage analysis, structured JSON extraction, and rule-based priority scoring. To enable deployment on consumer-grade GPUs, we conduct a systematic comparison of three quantization levels: Q4_K_M, Q5_K_M, and Q8\_0 across 254 rebar exposure images. We introduce a 5-point quality evaluation framework assessing damage type recognition, severity classification. Our results demonstrate that Q5_K_M achieves the optimal balance: quality score 3.18$\pm$1.35/5.0, inference time 5.67s/image, and 0.56 quality/sec efficiency -- 8.5% higher quality than Q4_K_M with only 4.5% speed reduction, while matching Q8_0's quality with 25% faster inference. Statistical analysis reveals Q5_K_M exhibits the weakest text-quality correlation (-0.148), indicating consistent performance regardless of description length.

2603.26769 2026-03-31 cs.CV cs.AI

Edge Reliability Gap in Vision-Language Models: Quantifying Failure Modes of Compressed VLMs Under Visual Corruption

Mehmet Kaan Erol

Comments 16 pages, 5 figures

详情
英文摘要

The rapid compression of large vision-language models (VLMs) for edge deployment raises an underexplored question: do compact models fail differently, not merely more often? This study compares a 7-billion-parameter quantised VLM (Qwen2.5-VL-7B, 4-bit NF4) against a 500-million-parameter FP16 model (SmolVLM2-500M) across 4,000 samples from VQAv2 and COCO Captions. A three-category error taxonomy (Object Blindness, Semantic Drift, Prior Bias) is applied as a diagnostic framework. A text-only GPT-4o judge reveals Semantic Drift (B) as the dominant failure mode on VQAv2 and on COCO for Qwen, with a mixed Object Blindness / Semantic Drift profile for SmolVLM2 on COCO; Prior Bias (C) is present on VQAv2 but absent on COCO for both models. Confidence calibration is measured via Expected Calibration Error (ECE) using geometric mean token probability, compositional reasoning is probed with structured negation probes across four templates, and a blur robustness experiment completes the evaluation. For this model pair, the compact model exhibits a qualitatively distinct failure signature: a 12.5pp larger negation collapse (-33.2pp vs. -20.8pp, Wald 95% CI [8.2, 16.8]pp, p < 10^-8), driven almost entirely by COCO while the VQAv2 gap is not statistically significant (4.5pp, p=0.19). The most discriminating template is false_yn: SMOLVLM2-500M responds "Yes" (incorrectly claiming a depicted object is absent) on 100% of COCO trials vs. 14% for Q WEN 2.5-VL-7B. Asymmetric dataset-dependent miscalibration and a blur experiment with two controlled ablations complete the analysis. The fully reproducible pipeline is released for systematic safety auditing of compressed VLMs prior to edge deployment.

2603.26768 2026-03-31 cs.CV cs.AI cs.CL

Aesthetic Assessment of Chinese Handwritings Based on Vision Language Models

Chen Zheng, Yuxuan Lai, Haoyang Lu, Wentao Ma, Jitao Yang, Jian Wang

Comments Accepted by CCL2025

详情
英文摘要

The handwriting of Chinese characters is a fundamental aspect of learning the Chinese language. Previous automated assessment methods often framed scoring as a regression problem. However, this score-only feedback lacks actionable guidance, which limits its effectiveness in helping learners improve their handwriting skills. In this paper, we leverage vision-language models (VLMs) to analyze the quality of handwritten Chinese characters and generate multi-level feedback. Specifically, we investigate two feedback generation tasks: simple grade feedback (Task 1) and enriched, descriptive feedback (Task 2). We explore both low-rank adaptation (LoRA)-based fine-tuning strategies and in-context learning methods to integrate aesthetic assessment knowledge into VLMs. Experimental results show that our approach achieves state-of-the-art performances across multiple evaluation tracks in the CCL 2025 workshop on evaluation of handwritten Chinese character quality.

2603.26767 2026-03-31 cs.CV

A training-free framework for high-fidelity appearance transfer via diffusion transformers

Shengrong Gu, Ye Wang, Song Wu, Rui Ma, Qian Wang, Lanjun Wang, Zili Yi

详情
英文摘要

Diffusion Transformers (DiTs) excel at generation, but their global self-attention makes controllable, reference-image-based editing a distinct challenge. Unlike U-Nets, naively injecting local appearance into a DiT can disrupt its holistic scene structure. We address this by proposing the first training-free framework specifically designed to tame DiTs for high-fidelity appearance transfer. Our core is a synergistic system that disentangles structure and appearance. We leverage high-fidelity inversion to establish a rich content prior for the source image, capturing its lighting and micro-textures. A novel attention-sharing mechanism then dynamically fuses purified appearance features from a reference, guided by geometric priors. Our unified approach operates at 1024px and outperforms specialized methods on tasks ranging from semantic attribute transfer to fine-grained material application. Extensive experiments confirm our state-of-the-art performance in both structural preservation and appearance fidelity.