arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 3188
专题追踪
2603.00149 2026-03-03 cs.CV cs.AI

Physics-Consistent Diffusion for Efficient Fluid Super-Resolution via Multiscale Residual Correction

Zhihao Li, Shengwei Dong, Chuang Yi, Junxuan Gao, Zhilu Lai, Zhiqiang Liu, Wei Wang, Guangtao Zhang

Comments Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2026)

详情
英文摘要

Existing image SR and generic diffusion models transfer poorly to fluid SR: they are sampling-intensive, ignore physical constraints, and often yield spectral mismatch and spurious divergence. We address fluid super-resolution (SR) with \textbf{ReMD} (\underline{Re}sidual-\underline{M}ultigrid \underline{D}iffusion), a physics-consistent diffusion framework. At each reverse step, ReMD performs a \emph{multigrid residual correction}: the update direction is obtained by coupling data consistency with lightweight physics cues and then correcting the residual across scales; the multiscale hierarchy is instantiated with a \emph{multi-wavelet} basis to capture both large structures and fine vortical details. This coarse-to-fine design accelerates convergence and preserves fine structures while remaining equation-free. Across atmospheric and oceanic benchmarks, ReMD improves accuracy and spectral fidelity, reduces divergence, and reaches comparable quality with markedly fewer sampling steps than diffusion baselines. Our results show that enforcing physics consistency \emph{inside} the diffusion process via multigrid residual correction and multi-wavelet multiscale modeling is an effective route to efficient fluid SR. Our code are available on https://github.com/lizhihao2022/ReMD.

2603.00148 2026-03-03 cs.CV

Mechanistically Guided LoRA Improves Paraphrase Consistency in Medical Vision-Language Models

Binesh Sadanandan, Vahid Behzadan

详情
英文摘要

Medical Vision-Language Models can give different yes or no answers to rephrasings of the same clinical question. We study this in MedGemma-4B using PSF-Med Sadanandan and Behzadan (2025), which provides paraphrase pairs for systematic consistency evaluation on medical VQA. On MIMIC-CXR binary questions (n = 158), the baseline flip rate is 14.6% and mean margin difference is 1.63 logits. We validate that Gemma Scope 2 Sparse Autoencoders (SAEs) transfer to MedGemma activations, achieving R2 ~= 0.997 on both medical and general text (n = 100 prompts each, p < 0.001 for exceeding a 0.95 threshold). We then fine-tune Low-Rank Adaptation (LoRA) adapters with a combined loss that balances paraphrase consistency with answer accuracy. This combined approach prevents mode collapse that occurs with pure consistency training while reducing flip rate from 14.6% to 4.4% (p = 0.002, two-proportion z-test) and margin difference from 1.63 to 0.33 (79.5% reduction). Accuracy remains stable at 84.2% baseline versus 82.3% after training (-1.9pp, not significant). On PadChest Balanced (n = 250), flip rate drops from 13.6% to 7.8%, mean margin difference drops from 1.08 to 0.35 (67.9% reduction), and accuracy increases from 66.4% to 69.4%. A layer-range ablation shows that early layers reduce margin differences more than mechanistically selected middle layers.

2603.00145 2026-03-03 cs.CV cs.AI

M-Gaussian: An Magnetic Gaussian Framework for Efficient Multi-Stack MRI Reconstruction

Kangyuan Zheng, Xuan Cai, Jiangqi Wang, Guixing Fu, Zhuoshuo Li, Yazhou Chen, Xinting Ge, Liangqiong Qu, Mengting Liu

Comments 15 pages, 9 figures

详情
英文摘要

Magnetic Resonance Imaging (MRI) is a crucial non-invasive imaging modality. In routine clinical practice, multi-stack thick-slice acquisitions are widely used to reduce scan time and motion sensitivity, particularly in challenging scenarios such as fetal brain imaging. However, the resulting severe through-plane anisotropy compromises volumetric analysis and downstream quantitative assessment, necessitating robust reconstruction of isotropic high-resolution volumes. Implicit neural representation methods, while achieving high quality, suffer from computational inefficiency due to complex network structures. We present M-Gaussian, adapting 3D Gaussian Splatting to MRI reconstruction. Our contributions include: (1) Magnetic Gaussian primitives with physics-consistent volumetric rendering, (2) neural residual field for high-frequency detail refinement, and (3) multi-resolution progressive training. Our method achieves an optimal balance between quality and speed. On the FeTA dataset, M-Gaussian achieves 40.31 dB PSNR while being 14 times faster, representing the first successful adaptation of 3D Gaussian Splatting to multi-stack MRI reconstruction.

2603.00144 2026-03-03 cs.CV cs.AI

Disentangled Hierarchical VAE for 3D Human-Human Interaction Generation

Zichen Geng, Zeeshan Hayder, Bo Miao, Jian Liu, Wei Liu, Ajmal Mian

详情
英文摘要

Generating realistic 3D Human-Human Interaction (HHI) requires coherent modeling of the physical plausibility of the agents and their interaction semantics. Existing methods compress all motion information into a single latent representation, limiting their ability to capture fine-grained actions and inter-agent interactions. This often leads to semantic misalignment and physically implausible artifacts, such as penetration or missed contact. We propose Disentangled Hierarchical Variational Autoencoder (DHVAE) based latent diffusion for structured and controllable HHI generation. DHVAE explicitly disentangles the global interaction context and individual motion patterns into a decoupled latent structure by employing a CoTransformer module. To mitigate implausible and physically inconsistent contacts in HHI, we incorporate contrastive learning constraints with our DHVAE to promote a more discriminative and physically plausible latent interaction space. For high-fidelity interaction synthesis, DHVAE employs a DDIM-based diffusion denoising process in the hierarchical latent space, enhanced by a skip-connected AdaLN-Transformer denoiser. Extensive evaluations show that DHVAE achieves superior motion fidelity, text alignment, and physical plausibility with greater computational efficiency.

2603.00143 2026-03-03 cs.CV cs.LG

GrapHist: Graph Self-Supervised Learning for Histopathology

Sevda Öğüt, Cédric Vincent-Cuaz, Natalia Dubljevic, Carlos Hurtado, Vaishnavi Subramanian, Pascal Frossard, Dorina Thanou

详情
英文摘要

Self-supervised vision models have achieved notable success in digital pathology. However, their domain-agnostic transformer architectures are not originally designed to account for fundamental biological elements of histopathology images, namely cells and their complex interactions. In this work, we hypothesize that a biologically-informed modeling of tissues as cell graphs offers a more efficient representation learning. Thus, we introduce GrapHist, a novel graph-based self-supervised learning framework for histopathology, which learns generalizable and structurally-informed embeddings that enable diverse downstream tasks. GrapHist integrates masked autoencoders and heterophilic graph neural networks that are explicitly designed to capture the heterogeneity of tumor microenvironments. We pre-train GrapHist on a large collection of 11 million cell graphs derived from breast tissues and evaluate its transferability across in- and out-of-domain benchmarks. Our results show that GrapHist achieves competitive performance compared to its vision-based counterparts in slide-, region-, and cell-level tasks, while requiring four times fewer parameters. It also drastically outperforms fully-supervised graph models on cancer subtyping tasks. Finally, we also release five graph-based digital pathology datasets used in our study at https://huggingface.co/ogutsevda/datasets , establishing the first large-scale graph benchmark in this field. Our code is available at https://github.com/ogutsevda/graphist .

2603.00140 2026-03-03 cs.CV cs.AI cs.LG

Steering Away from Memorization: Reachability-Constrained Reinforcement Learning for Text-to-Image Diffusion

Sathwik Karnik, Juyeop Kim, Sanmi Koyejo, Jong-Seok Lee, Somil Bansal

详情
英文摘要

Text-to-image diffusion models often memorize training data, revealing a fundamental failure to generalize beyond the training set. Current mitigation strategies typically sacrifice image quality or prompt alignment to reduce memorization. To address this, we propose Reachability-Aware Diffusion Steering (RADS), an inference-time framework that prevents memorization while preserving generation fidelity. RADS models the diffusion denoising process as a dynamical system and applies concepts from reachability analysis to approximate the "backward reachable tube"--the set of intermediate states that inevitably evolve into memorized samples. We then formulate mitigation as a constrained reinforcement learning (RL) problem, where a policy learns to steer the trajectory away from memorization via minimal perturbations in the caption embedding space. Empirical evaluations show that RADS achieves a superior Pareto frontier between generation diversity (SSCD), quality (FID), and alignment (CLIP) compared to state-of-the-art baselines. Crucially, RADS provides robust mitigation without modifying the diffusion backbone, offering a plug-and-play solution for safe generation. Our website is available at: https://s-karnik.github.io/rads-memorization-project-page/.

2603.00139 2026-03-03 cs.CV

Towards Data-driven Nitrogen Estimation in Wheat Fields using Multispectral Images

Andreas Tritsarolis, Tomaž Bokan, Matej Brumen, Domen Mongus, Yannis Theodoridis

详情
英文摘要

The modernization of agriculture has motivated the development of advanced analytics and decision-support systems to improve resource utilization and reduce environmental impacts. Targeted Spraying and Fertilization (TSF) is a critical operation that enables farmers to apply inputs more precisely, optimizing resource use and promoting environmental sustainability. However, accurate TSF is a challenging problem, due to external factors such as crop type, fertilization phase, soil conditions, and weather dynamics. In this paper, we present TerrAI, a Neural Network-based solution for TSF, which considers the spatio-temporal variability across different parcels. Our experimental study over a real-world remote sensing dataset validates the soundness of TerrAI on data-driven agricultural practices.

2603.00138 2026-03-03 cs.CV

Latent Replay Detection: Memory-Efficient Continual Object Detection on Microcontrollers via Task-Adaptive Compression

Bibin Wilson

详情
英文摘要

Deploying object detection on microcontrollers (MCUs) enables intelligent edge devices but current models cannot learn new object categories after deployment. Existing continual learning methods require storing raw images far exceeding MCU memory budgets of tens of kilobytes. We present Latent Replay Detection (LRD), the first framework for continual object detection under MCU memory constraints. Our key contributions are: 1. Task-Adaptive Compression: Unlike fixed PCA, we propose learnable compression with FiLM (Feature-wise Linear Modulation) conditioning, where task specific embeddings modulate the compression to preserve discriminative features for each task's distribution; 2. Spatial-Diverse Exemplar Selection: Traditional sampling ignores spatial information critical for detection - we select exemplars maximizing bounding box diversity via farthest-point sampling in IoU space, preventing localization bias in replay; 3. MCU-Deployable System: Our latent replay stores 150 bytes per sample versus >10KB for images, enabling a 64KB buffer to hold 400+ exemplars. Experiments on CORe50 (50 classes, 5 tasks) demonstrate that LRD achieves mAP@50 on the initial task and maintains strong performance across subsequent tasks - a significant improvement over naive fine-tuning while operating within strict MCU constraints. Our task-adaptive FiLM compression and spatial diverse exemplar selection work synergistically to preserve detection capabilities. Deployed on STM32H753ZI, ESP32-S3, and MAX78000 MCUs, LRD achieves 4.9-97.5ms latency per inference within a 64KB memory budget-enabling practical continual detection on edge devices for the first time.

2603.00136 2026-03-03 cs.CV cs.AI

TinyVLM: Zero-Shot Object Detection on Microcontrollers via Vision-Language Distillation with Matryoshka Embeddings

Bibin Wilson

详情
英文摘要

Zero-shot object detection enables recognising novel objects without task-specific training, but current approaches rely on large vision language models (VLMs) like CLIP that require hundreds of megabytes of memory - far exceeding the constraints of micro controller units (MCUs). We present TinyVLM, the first framework enabling zero-shot object detection on resource-constrained MCUs with less than 1MB of memory. Our approach introduces three key innovations: (1) a decoupled architecture that separates visual inference from text encoding, allowing precomputed class embeddings to be stored in flash memory; (2) Matryoshka distillation that trains nested embeddings at multiple dimensions (16-256), enabling flexible accuracy-memory trade-offs; and (3) quantized embedding storage that reduces class prototype memory by 4x with minimal accuracy loss. Trained on Conceptual Captions 3M (CC3M), TinyVLM achieves competitive zero-shot accuracy on COCO, Flowers102, and Food101 while requiring only 285KB of RAM and 892KB of flash memory for the deployed vision encoder. We demonstrate real-time inference at 26 FPS on STM32H7 and over 1,000 FPS on MAX78000 with its CNN accelerator, enabling practical zero-shot detection on edge devices for the first time.

2603.00132 2026-03-03 cs.CV cs.CY

Predicting Local Climate Zones using Urban Morphometrics and Satellite Imagery

Hugo Majer, Martin Fleischmann

详情
英文摘要

The Local Climate Zone (LCZ) framework is commonly employed to represent urban form in morphological analyses despite its mapping predominantly relies on satellite imagery. Urban morphometrics, describing urban form via numerical measures of physical aspects and spatial relationships of its elements, offers another avenue. This study evaluates the ability of morphometric assessment to predict LCZs using a) a morphometric-based LCZ prediction, and b) a fusion-based LCZ prediction combining morphometrics with satellite imagery. We calculate 321 2D morphometric attributes from building footprints and street networks, covering their various properties at multiple spatial scales. Subsequently, we develop four classification schemes: morphometric-based prediction, baseline image-based prediction, and two techniques fusing morphometrics with imagery. We evaluate them across five sites. Results from the morphometric-based prediction indicate that the correspondence between 2D urban morphometrics and urban LCZ types is selective and inconsistent, rendering the efficacy of this method site-dependent. Nevertheless, it demonstrated that a much broader range of urban form properties is relevant for distinguishing LCZ types compared to standard parameters. Relative to the image-based baseline, the fusion yielded relatively distinct accuracy improvements for urban LCZ types at two sites; however, gains at the remaining sites were negligible or even slightly negative, suggesting that the benefits of fusion are modest and inconsistent. Collectively, these results indicate that the relationship between the LCZs and the measurable, visible aspects of urban form is tenuous, thus the LCZ framework should be used with caution in morphological studies.

2603.00127 2026-03-03 cs.CV

Segmenting Low-Contrast XCTs of Concretes: An Unsupervised Approach

Kaustav Das, Gaston Rauchs, Jan Sykora, Anna Kucerova

详情
英文摘要

This work tests a self-annotation-based unsupervised methodology for training a convolutional neural network (CNN) model for semantic segmentation of X-ray computed tomography (XCT) scans of concretes. Concrete poses a unique challenge for XCT imaging due to similar X-ray attenuation coefficients of aggregates and mortar, resulting in low-contrast between the two phases in the ensuing images. While CNN-based models are a proven technique for semantic segmentation in such challenging cases, they typically require labeled training data, which is often unavailable for new datasets or are costly to obtain. To counter that limitation, a self-annotation technique is used here which leverages superpixel algorithms to identify perceptually similar local regions in an image and relates them to the global context in the image by utilizing the receptive field of a CNN-based model. This enables the model to learn a global-local relationship in the images and enables identification of semantically similar structures. We therefore present the performance of the unsupervised training methodology on our XCT datasets and discuss potential avenues for further improvements.

2603.00126 2026-03-03 cs.CV cs.AI cs.IR cs.MM cs.PF cs.SY eess.SY

QuickGrasp: Responsive Video-Language Querying Service via Accelerated Tokenization and Edge-Augmented Inference

Miao Zhang, Ruixiao Zhang, Jianxin Shi, Hengzhi Wang, Hao Fang, Jiangchuan Liu

详情
英文摘要

Video-language models (VLMs) are reshaping video querying services, bringing unified solutions to complex perception and reasoning tasks. However, deploying large VLMs in real-world systems remains challenging due to their high resource demands, and remote-based deployment often results in unacceptable response delays. Although small, locally deployable VLMs offer faster responses, they unavoidably fall short in accuracy. To reconcile this trade-off, we propose QuickGrasp, a responsive, quality of service (QoS)-aware system that bridges this gap through a local-first architecture with on-demand edge augmentation. Built upon the highly modular architecture of VLMs, QuickGrasp shares the vision representation across model variants to avoid redundant computation. To maximize system-wide efficiency, QuickGrasp introduces three key designs: accelerated video tokenization, query-adaptive edge augmentation, and delay-aware, accuracy-preserving vision token density configuration. We implement a prototype of QuickGrasp and evaluate it across multiple video understanding benchmarks. The results show that QuickGrasp matches the accuracy of large VLMs while achieving up to a 12.8x reduction in response delay. QuickGrasp represents a key advancement toward building responsive video querying services for open-world understanding that fully leverage the capabilities of VLMs.

2603.00123 2026-03-03 cs.CV cs.AI

CT-Flow: Orchestrating CT Interpretation Workflow with Model Context Protocol Servers

Yannian Gu, Xizhuo Zhang, Linjie Mu, Yongrui Yu, Zhongzhen Huang, Shaoting Zhang, Xiaofan Zhang

Comments submitting to ACL 2026

详情
英文摘要

Recent advances in Large Vision-Language Models (LVLMs) have shown strong potential for multi-modal radiological reasoning, particularly in tasks like diagnostic visual question answering (VQA) and radiology report generation. However, most existing approaches for 3D CT analysis largely rely on static, single-pass inference. In practice, clinical interpretation is a dynamic, tool-mediated workflow where radiologists iteratively review slices and use measurement, radiomics, and segmentation tools to refine findings. To bridge this gap, we propose CT-Flow, an agentic framework designed for interoperable volumetric interpretation. By leveraging the Model Context Protocol (MCP), CT-Flow shifts from closed-box inference to an open, tool-aware paradigm. We curate CT-FlowBench, the first large-scale instruction-tuning benchmark tailored for 3D CT tool-use and multi-step reasoning. Built upon this, CT-Flow functions as a clinical orchestrator capable of decomposing complex natural language queries into automated tool-use sequences. Experimental evaluations on CT-FlowBench and standard 3D VQA datasets demonstrate that CT-Flow achieves state-of-the-art performance, surpassing baseline models by 41% in diagnostic accuracy and achieving a 95% success rate in autonomous tool invocation. This work provides a scalable foundation for integrating autonomous, agentic intelligence into real-world clinical radiology.

2603.00122 2026-03-03 cs.CV cs.AI cs.IR

NovaLAD: A Fast, CPU-Optimized Document Extraction Pipeline for Generative AI and Data Intelligence

Aman Ulla

Comments 17 pages, 10 figures, 5 tables

详情
英文摘要

Document extraction is an important step before retrieval-augmented generation (RAG), knowledge bases, and downstream generative AI can work. It turns unstructured documents like PDFs and scans into structured text and layout-aware representations. We introduce NovaLAD, a comprehensive document parsing system that integrates two concurrent YOLO object detection models - element detection and layout detection - with rule-based grouping and optional vision-language enhancement. When a page image is sent in, the first thing that happens is that it goes through both models at the same time. The element model finds semantic content like the title, header, text, table, image, and so on, and the layout model finds structural regions like layout_box, column_group, multi_column, row_group, and so on. A key design decision is to first send an image or figure through an image classifier (ViT) that decides whether it is relevant or not. Only useful images are then submitted to the Vision LLM for title, summary, and structured information, which cuts down on noise and costs. NovaLAD is built for speed: it works on CPU, employs parallel execution for detection, classification, OCR, and conversion, and generates several forms, including structured JSON, Markdown, RAG-ready texts, and knowledge graphs. We test on the DP-Bench benchmark (upstage/dp-bench) and get 96.49% TEDS and 98.51% NID, which is better than both commercial and open-source parsers. This paper explains how to extract data, how the architecture works, how data flows, and how to make NovaLAD both accurate and usable without needing a GPU.

2603.00119 2026-03-03 cs.CV

BiSe-Unet: A Lightweight Dual-path U-Net with Attention-refined Context for Real-time Medical Image Segmentation

M Iffat Hossain, Laura Brattain

Comments Submitted to IEEE EMBC 2026. This work has been submitted to the IEEE for possible publication

详情
英文摘要

During image-guided procedures, real-time image segmentation is often required. This demands lightweight AI models that can operate on resource-constrained devices. One important use case is endoscopy-guided colonoscopy, where polyps must be detected in real time. The Kvasir-Seg dataset, a publicly available benchmark for this task, contains 1,000 high-resolution endoscopic images of polyps with corresponding pixel-level segmentation masks. Achieving real-time inference speed for clinical deployment in constrained environments requires highly efficient and lightweight network architectures. However, many existing models remain too computationally intensive for embedded deployment. Lightweight architectures, although faster, often suffer from reduced spatial precision and weaker contextual understanding, leading to degraded boundary quality and reduced diagnostic reliability. To address these challenges, we introduce BiSe-UNet, a lightweight dual-path U-Net that integrates an attention-refined context path with a shallow spatial path for detailed feature preservation, followed by a depthwise separable decoder for efficient reconstruction. Evaluated on the Kvasir-Seg dataset, BiSe-UNet achieves competitive Dice and IoU scores while sustaining real-time throughput exceeding 30 FPS on Raspberry Pi 5, demonstrating its effectiveness for accurate, lightweight, and deployable medical image segmentation on edge hardware.

2603.00118 2026-03-03 cs.CV

Efficient Image Super-Resolution with Multi-Scale Spatial Adaptive Attention Networks

Sushi Rao, Jingwei Li

详情
英文摘要

This paper introduces a lightweight image super-resolution (SR) network, termed the Multi-scale Spatial Adaptive Attention Network (MSAAN), to address the common dilemma between high reconstruction fidelity and low model complexity in existing SR methods. The core of our approach is a novel Multi-scale Spatial Adaptive Attention Module (MSAA), designed to jointly model fine-grained local details and long-range contextual dependencies. The MSAA comprises two synergistic components: a Global Feature Modulation Module (GFM) that learns coherent texture structures through differential feature extraction, and a Multi-scale Feature Aggregation Module (MFA) that adaptively fuses features from local to global scales using pyramidal processing. To further enhance the network's capability, we propose a Local Enhancement Block (LEB) to strengthen local geometric perception and a Feature Interactive Gated Feed-Forward Module (FIGFF) to improve nonlinear representation while reducing channel redundancy. Extensive experiments on standard benchmarks (Set5, Set14, B100, Urban100, Manga109) across $\times2$, $\times3$, and $\times4$ scaling factors demonstrate that both our lightweight (MSAAN-light) and standard (MSAAN) versions achieve superior or competitive performance in terms of PSNR and SSIM, while maintaining significantly lower parameters and computational costs than state-of-the-art methods. Ablation studies validate the contribution of each component, and visual results show that MSAAN reconstructs sharper edges and more realistic textures.

2603.00116 2026-03-03 cs.CV

VoxelDiffusionCut: Non-destructive Internal-part Extraction via Iterative Cutting and Structure Estimation

Takumi Hachimine, Yuhwan Kwon, Cheng-Yu Kuo, Tomoya Yamanokuchi, Takamitsu Matsubara

Comments 11 pages

详情
英文摘要

Non-destructive extraction of the target internal part, such as batteries and motors, by cutting surrounding structures is crucial at recycling and disposal sites. However, the diversity of products and the lack of information on disassembly procedures make it challenging to decide where to cut. This study explores a method for non-destructive extraction of a target internal part that iteratively estimates the internal structure from observed cutting surfaces and formulates cutting plans based on the estimation results. A key requirement is to estimate the probability of the target part's presence from partial observations. However, learning conditional generative models for this task is challenging: The high dimensionality of 3D shape representations makes learning difficult, and conventional models (e.g., conditional variational autoencoders) often fail to capture multi-modal predictive uncertainty due to mode collapse, resulting in overconfident predictions. To address these issues, we propose VoxelDiffusionCut, which iteratively estimates the internal structure represented as voxels using a diffusion model and plans cuts for non-destructive extraction of the target internal part based on the estimation results. Voxel representation allows the model to predict only attributes at fixed grid positions, i.e., types of constituent parts, making learning more tractable. The diffusion model completes the voxel representation conditioned on observed cutting surfaces, capturing uncertainty in unobserved regions to avoid erroneous cuts. Experimental results in simulation suggest that the proposed method can estimate internal structures from observed cutting surfaces and enable non-destructive extraction of the target internal part by leveraging the estimated uncertainty.

2603.00114 2026-03-03 cs.CV

Automated Quality Check of Sensor Data Annotations

Niklas Freund, Zekiye Ilknur-Öz, Tobias Klockau, Patrick Naumann, Philipp Neumaier, Martin Köppel

Journal ref Proceeding of 4th IEEE International Conference on Consumer Electronics (ICCE), Berlin, Germany, September, 2025

详情
英文摘要

The monitoring of the route and track environment plays an important role in automated driving. For example, it can be used as an assistance system for route monitoring in automation level Grade of Automation (GoA) 2, where the train driver is still on board. In fully automated, driverless driving at automation level GoA4, these systems finally take over environment monitoring completely independently. With the help of artificial intelligence (AI), they react automatically to risks and dangerous events on the route. To train such AI algorithms, large amounts of training data are required, which must meet high-quality standards due to their safety relevance. In this publication we present an automatic method for assuring the quality of training data, significantly reducing the manual workload and accelerating the development of these systems. We propose an open-source tool designed to detect nine common errors found in multi-sensor datasets for railway vehicles. To evaluate the performance of the framework, all detected errors were manually validated. Six issue detection methods achieved 100% precision, while three additional methods reached precision rates 96% and 97%.

2603.00110 2026-03-03 cs.RO

Learning Physics from Pretrained Video Models: A Multimodal Continuous and Sequential World Interaction Models for Robotic Manipulation

Zijian Song, Qichang Li, Sihan Qin, Yuhao Chen, Tianshui Chen, Liang Lin, Guangrun Wang

Comments 11 pages, 6 figures

详情
英文摘要

The scarcity of large-scale robotic data has motivated the repurposing of foundation models from other modalities for policy learning. In this work, we introduce PhysGen (Learning Physics from Pretrained Video Generation Models), a scalable continuous and sequential world interaction framework that leverages autoregressive video generation to solve robotic manipulation tasks. By treating the pretrained video model as a proxy for a physics simulator, PhysGen models the dynamic interplay between the external environment and robot actions. We introduce a multimodal continuous representation that unifies video and action into shared physical tokens, bridging the gap between discrete video generation and continuous robotic control. This approach enables the seamless transfer of implicit physical knowledge-such as object permanence and dynamics-from video pretraining to downstream manipulation.To ensure efficient convergence, we incorporate causal masking, inverse kinematics, Lookahead Multi-Token Prediction (L-MTP), and key-value (KV) caching. Experimental results on the Libero and ManiSkill benchmarks demonstrate that PhysGen consistently outperforms robust baselines, surpassing OpenVLA and WorldVLA by margins of 13.8% and 8.8%, respectively. Notably, in real-world scenarios, PhysGen matches the performance of large-scale action-pretrained models like $π_0$ without requiring prior action-specific pretraining, demonstrating superior capability in physically complex tasks such as grasping transparent objects. These findings validate the potential of extracting physical intuition from pretrained video generators to facilitate generalizable robotic manipulation.

2603.00108 2026-03-03 cs.RO cs.AI

SurgFusion-Net: Diversified Adaptive Multimodal Fusion Network for Surgical Skill Assessment

Runlong He, Freweini M. Tesfai, Matthew W. E. Boal, Nazir Sirajudeen, Dimitrios Anastasiou, Jialang Xu, Mobarak I. Hoque, Philip J. Edwards, John D. Kelly, Ashwin Sridhar, Abdolrahim Kadkhodamohammadi, Dhivya Chandrasekaran, Matthew J. Clarkson, Danail Stoyanov, Nader Francis, Evangelos B. Mazomenos

详情
英文摘要

Robotic-assisted surgery (RAS) is established in clinical practice, and automated surgical skill assessment utilizing multimodal data offers transformative potential for surgical analytics and education. However, developing effective multimodal methods remains challenging due to the task complexity, limited annotated datasets and insufficient techniques for cross-modal information fusion. Existing state-of-the-art relies exclusively on RGB video and only applies on dry-lab settings, failing to address the significant domain gap between controlled simulation and real clinical cases, where the surgical environment together with camera and tissue motion introduce substantial complexities. This work introduces SurgFusion-Net and Divergence Regulated Attention (DRA), an innovative fusion strategy for multimodal surgical skill assessment. We contribute two first-of-their-kind clinical datasets: the RAH-skill dataset containing 279,691 RGB frames from 37 videos of Robot-assisted Hysterectomy (RAH), and the RARP-skill dataset containing 70,661 RGB frames from 33 videos of Robot-Assisted Radical Prostatectomy (RARP). Both datasets include M-GEARS skill annotations, corresponding optical flow and tool segmentation masks. DRA incorporates adaptive dual attention and diversity-promoting multi-head attention to fuse multimodal information, from three modalities, based on surgical context, enhancing assessment accuracy and reliability. Validated on the JIGSAWS benchmark, RAH-skill, and RARP-skill datasets, our approach outperforms recent baselines with SCC improvements of 0.02 in LOSO, 0.04 in LOUO across JIGSAWS tasks, and 0.0538 and 0.0493 gains on RAH-skill and RARP-skill, respectively.

2603.00105 2026-03-03 cs.LG cs.CL stat.ME stat.ML

LIDS: LLM Summary Inference Under the Layered Lens

Dylan Park, Yingying Fan, Jinchi Lv

Comments 48 pages, 15 figures

详情
英文摘要

Large language models (LLMs) have gained significant attention by many researchers and practitioners in natural language processing (NLP) since the introduction of ChatGPT in 2022. One notable feature of ChatGPT is its ability to generate summaries based on prompts. Yet evaluating the quality of these summaries remains challenging due to the complexity of language. To this end, in this paper we suggest a new method of LLM summary inference with BERT-SVD-based direction metric and SOFARI (LIDS) that assesses the summary accuracy equipped with interpretable key words for layered themes. The LIDS uses a latent SVD-based direction metric to measure the similarity between the summaries and original text, leveraging the BERT embeddings and repeated prompts to quantify the statistical uncertainty. As a result, LIDS gives a natural embedding of each summary for large text reduction. We further exploit SOFARI to uncover important key words associated with each latent theme in the summary with controlled false discovery rate (FDR). Comprehensive empirical studies demonstrate the practical utility and robustness of LIDS through human verification and comparisons to other similarity metrics, including a comparison of different LLMs.

2603.00103 2026-03-03 cs.RO cs.SY eess.SY

Autonomous Block Assembly for Boom Cranes with Passive Joint Dynamics: Integrated Vision MPC Control

Gerald Ebmer, Minh Nhat Vu, Tobias Glück, Wolfgang Kemmetmüller

详情
英文摘要

This paper presents an autonomous control framework for articulated boom cranes performing prefabricated block assembly in construction environments. The key challenge addressed is precise placement control under passive joint dynamics that cause pendulum-like sway, complicating the accurate positioning of building components. Our integrated approach combines real-time vision-based pose estimation of building blocks, collision-aware B-spline path planning, and nonlinear model predictive control (NMPC) to achieve autonomous pickup, placement, and obstacle-avoidance assembly operations. The framework is validated on a laboratory-scale testbed that emulates crane kinematics and passive dynamics while enabling rapid experimentation. The collision-aware planner generates feasible B-spline references in real-time on CPU hardware with anytime performance, while the NMPC controller actively suppresses passive joint sway and tracks the planned trajectory under continuous vision feedback. Experimental results demonstrate autonomous block stacking and obstacle-avoidance assembly, with sway damping reducing settling times by more than an order of magnitude compared to uncontrolled passive dynamics, confirming the real-time feasibility of the integrated approach for construction automation.

2603.00102 2026-03-03 cs.RO

Designing Social Robots with Ethical, User-Adaptive Explainability in the Era of Foundation Models

Fethiye Irmak Dogan, Alva Markelius, Hatice Gunes

Comments Companion Proceedings of the 21st ACM/IEEE International Conference on Human-Robot Interaction

详情
英文摘要

Foundation models are increasingly embedded in social robots, mediating not only what they say and do but also how they adapt to users over time. This shift renders traditional ``one-size-fits-all'' explanation strategies especially problematic: generic justifications are now wrapped around behaviour produced by models trained on vast, heterogeneous, and opaque datasets. We argue that ethical, user-adapted explainability must be treated as a core design objective for foundation-model-driven social robotics. We first identify open challenges around explainability and ethical concerns that arise when both adaptation and explanation are delegated to foundation models. Building on this analysis, we propose four recommendations for moving towards user-adapted, modality-aware, and co-designed explanation strategies grounded in smaller, fairer datasets. An illustrative use case of an LLM-driven socially assistive robot demonstrates how these recommendations might be instantiated in a sensitive, real-world domain.

2603.00101 2026-03-03 cs.LG eess.SP

Wideband Power Amplifier Behavioral Modeling Using an Amplitude Conditioned LSTM

Abdelrahman Abdelsalam, You Fei

Comments 7 Pages, 6 Figures

详情
英文摘要

Wideband power amplifiers exhibit complex nonlinear and memory effects that challenge traditional behavioral modeling approaches. This paper proposes a novel amplitude conditioned long short-term memory (AC-LSTM) network that introduces explicit amplitude-dependent gating to enhance the modeling of wideband PA dynamics. The architecture incorporates a Feature-wise Linear Modulation (FiLM) layer that conditions the LSTM's forget gate on the instantaneous input amplitude, providing a physics-aware inductive bias for capturing amplitude-dependent memory effects. Experimental validation using a 100 MHz 5G NR signal and a GaN PA demonstrates that the proposed AC-LSTM achieves a normalized mean square error (NMSE) of -41.25 dB, representing a 1.15 dB improvement over standard LSTM and 7.45 dB improvement over augmented real-valued time-delay neural network (ARVTDNN) baselines. The model also closely matches the measured PA's spectral characteristics with an adjacent channel power ratio (ACPR) of -28.58 dB. These results shows the effectiveness of amplitude conditioning for improving both time-domain accuracy and spectral fidelity in wide-band PA behavioral modeling.

2603.00099 2026-03-03 cs.LG cs.AI cs.NE

SEval-NAS: A Search-Agnostic Evaluation for Neural Architecture Search

Atah Nuh Mih, Jianzhou Wang, Truong Thanh Hung Nguyen, Hung Cao

Comments To be published in the Proceedings of The 41st ACM/SIGAPP Symposium on Applied Computing (SAC26)

详情
英文摘要

Neural architecture search (NAS) automates the discovery of neural networks that meet specified criteria, yet its evaluation procedures are often hardcoded, limiting the ability to introduce new metrics. This issue is especially pronounced in hardware-aware NAS, where objectives depend on target devices such as edge hardware. To address this limitation, we propose SEval-NAS, a metric-evaluation mechanism that converts architectures to strings, embeds them as vectors, and predicts performance metrics. Using NATS-Bench and HW-NAS-Bench, we evaluated accuracy, latency, and memory. Kendall's $τ$ correlations showed stronger latency and memory predictions than accuracy, indicating the suitability of SEval-NAS as a hardware cost predictor. We further integrated SEval-NAS into FreeREA to evaluate metrics not originally included. The method successfully ranked FreeREA-generated architectures, maintained search time, and required minimal algorithmic changes. Our implementation is available at: https://github.com/Analytics-Everywhere-Lab/neural-architecture-search

2603.00070 2026-03-03 cs.LG cs.CV

Certainty-Validity: A Diagnostic Framework for Discrete Commitment Systems

Datorien L. Anderson

Comments 18 pages, 1 figure, full experiment data can be found: https://zenodo.org/records/18530003

详情
英文摘要

Standard evaluation metrics for machine learning -- accuracy, precision, recall, and AUROC -- assume that all errors are equivalent: a confident incorrect prediction is penalized identically to an uncertain one. For discrete commitment systems (architectures that select committed states {-W, 0, +W}), this assumption is epistemologically flawed. We introduce the Certainty-Validity (CVS) Framework, a diagnostic method that decomposes model performance into a 2x2 matrix distinguishing high/low certainty from valid/invalid predictions. This framework reveals a critical failure mode hidden by standard accuracy: Confident-Incorrect (CI) behavior, where models hallucinate structure in ambiguous data. Through ablation experiments on Fashion-MNIST, EMNIST, and IMDB, we analyze the "83% Ambiguity Ceiling" -- a stopping point where this specific discrete architecture consistently plateaus on noisy benchmarks. Unlike continuous models that can surpass this ceiling by memorizing texture or statistical noise, the discrete model refuses to commit to ambiguous samples. We show that this refusal is not a failure but a feature: the model stops where structural evidence ends. However, standard training on ambiguous data eventually forces Benign Overfitting, causing a pathological migration from Uncertain-Incorrect (appropriate doubt) to Confident-Incorrect (hallucination). We propose that "good training" for reasoning systems must be defined not by accuracy, but by maximizing the Certainty-Validity Score (CVS) -- ensuring the model knows where to stop.

2603.00067 2026-03-03 cs.LG

A Representation-Consistent Gated Recurrent Framework for Robust Medical Time-Series Classification

Maitri Krishna Sai

Comments 7 pages, 1 figure. Preprint

详情
英文摘要

Medical time-series data are characterized by irregular sampling, high noise levels, missing values, and strong inter-feature dependencies. Recurrent neural networks (RNNs), particularly gated architectures such as Long Short-Term Memory (LSTM) and Gated Recurrent Units (GRU), are widely used for modeling such data due to their ability to capture temporal dependencies. However, standard gated recurrent models do not explicitly constrain the evolution of latent representations over time, leading to representation drift and instability under noisy or incomplete inputs. In this work, we propose a representation-consistent gated recurrent framework (RC-GRF) that introduces a principled regularization strategy to enforce temporal consistency in hidden-state representations. The proposed framework is model-agnostic and can be integrated into existing gated recurrent architectures without modifying their internal gating mechanisms. We provide a theoretical analysis demonstrating how the consistency constraint bounds hidden-state divergence and improves stability. Extensive experiments on medical time-series classification benchmarks show that the proposed approach improves robustness, reduces variance, and enhances generalization performance, particularly in noisy and low-sample settings.

2603.00060 2026-03-03 cs.CV cs.LG

Learning Under Extreme Data Scarcity: Subject-Level Evaluation of Lightweight CNNs for fMRI-Based Prodromal Parkinsons Detection

Naimur Rahman

Comments Methodological case study cs.LG on subject-level evaluation and model capacity under extreme data scarcity; 9 pages, 1 figure. Experiments use 40-subject PPMI fMRI cohort; no external validation

详情
英文摘要

Deep learning is often applied in settings where data are limited, correlated, and difficult to obtain, yet evaluation practices do not always reflect these constraints. Neuroimaging for prodromal Parkinsons disease is one such case, where subject numbers are small and individual scans produce many highly related samples. This work examines prodromal Parkinsons detection from resting-state fMRI as a machine learning problem centered on learning under extreme data scarcity. Using fMRI data from 40 subjects, including 20 prodromal Parkinsons cases and 20 healthy controls, ImageNet-pretrained convolutional neural networks are fine-tuned and evaluated under two different data partitioning strategies. Results show that commonly used image-level splits allow slices from the same subject to appear in both training and test sets, leading to severe information leakage and near-perfect accuracy. When a strict subject-level split is enforced, performance drops substantially, yielding test accuracies between 60 and 81 percent. Models with different capacity profiles are compared, including VGG19, Inception V3, Inception ResNet V2, and the lightweight MobileNet V1. Under subject-level evaluation, MobileNet demonstrates the most reliable generalization, outperforming deeper architectures despite having significantly fewer parameters. These results indicate that in extreme low-data regimes, evaluation strategy and model capacity have a greater impact on performance than architectural depth. Although the analysis is limited to a single cohort of 40 subjects and does not include external validation or cross-validation, it provides a concrete case study and practical recommendations for evaluating deep learning models under severe data scarcity.

2603.00055 2026-03-03 cs.LG cs.AI

M3-AD: Reflection-aware Multi-modal, Multi-category, and Multi-dimensional Benchmark and Framework for Industrial Anomaly Detection

Chao Huang, Yanhui Li, Yunkang Cao, Wei Wang, Hongxi Huang, Jie Wen, Wenqi Ren, Xiaochun Cao

详情
英文摘要

Although multimodal large language models (MLLMs) have advanced industrial anomaly detection toward a zero-shot paradigm, they still tend to produce high-confidence yet unreliable decisions in fine-grained and structurally complex industrial scenarios, and lack effective self-corrective mechanisms. To address this issue, we propose M3-AD, a unified reflection-aware multimodal framework for industrial anomaly detection. M3-AD comprises two complementary data resources: M3-AD-FT, designed for reflection-aligned fine-tuning, and M3-AD-Bench, designed for systematic cross-category evaluation, together providing a foundation for reflection-aware learning and reliability assessment. Building upon this foundation, we propose RA-Monitor, which models reflection as a learnable decision revision process and guides models to perform controlled self-correction when initial judgments are unreliable, thereby improving decision robustness. Extensive experiments conducted on M3-AD-Bench demonstrate that RA-Monitor outperforms multiple open-source and commercial MLLMs in zero-shot anomaly detection and anomaly analysis tasks. Code will be released at https://github.com/Yanhui-Lee/M3-AD.

2603.00054 2026-03-03 cs.LG cs.AI

Expert Divergence Learning for MoE-based Language Models

Jiaang Li, Haibin Chen, Langming Liu, Yujin Yuan, Yadao Wang, Yizhen Zhang, Chengting Yu, Xin Tong, Weidong Zhang, Shilei Liu, Wenbo Su, Bo Zheng

Comments ICLR 2026

详情
英文摘要

The Mixture-of-Experts (MoE) architecture is a powerful technique for scaling language models, yet it often suffers from expert homogenization, where experts learn redundant functionalities, thereby limiting MoE's full potential. To address this, we introduce Expert Divergence Learning, a novel pre-training strategy that explicitly encourages functional specialization among experts. Our method incorporates a label-driven auxiliary loss that leverages domain labels inherent in pre-training corpora to maximize the Jensen-Shannon Divergence between the expert routing distributions of different data domains. This optimization objective guides the model to develop diverged routing policies for varied domains and closer routing policies for the same domain, which leads to emergent and organized expert specialization. We validate our approach by pre-training MoE models of up to 15 billion parameters from scratch. Experimental results demonstrate that models trained with Expert Divergence Learning not only achieve a lower language modeling loss but also exhibit significant performance improvements across a diverse range of downstream benchmarks. Further analysis confirms that our method effectively mitigates expert homogenization and brings greater functional specialization, all with negligible computational overhead during training.