arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2977
2603.26766 2026-03-31 cs.CV

JND-Guided Neural Watermarking with Spatial Transformer Decoding for Screen-Capture Robustness

Jiayi Qin, Jingwei Li, Chuan Wu

详情
英文摘要

Screen-shooting robust watermarking aims to imperceptibly embed extractable information into host images such that the watermark survives the complex distortion pipeline of screen display and camera recapture. However, achieving high extraction accuracy while maintaining satisfactory visual quality remains an open challenge, primarily because the screen-shooting channel introduces severe and entangled degradations including Moiré patterns, color-gamut shifts, perspective warping, and sensor noise. In this paper, we present an end-to-end deep learning framework that jointly optimizes watermark embedding and extraction for screen-shooting robustness. Our framework incorporates three key innovations: (i) a comprehensive noise simulation layer that faithfully models realistic screen-shooting distortions -- notably including a physically-motivated Moiré pattern generator -- enabling the network to learn robust representations against the full spectrum of capture-channel noise through adversarial training; (ii) a Just Noticeable Distortion (JND) perceptual loss function that adaptively modulates watermark embedding strength by supervising the perceptual discrepancy between the JND coefficient map and the watermark residual, thereby concentrating watermark energy in perceptually insensitive regions to maximize visual quality; and (iii) two complementary automatic localization modules -- a semantic-segmentation-based foreground extractor for captured image rectification and a symmetric noise template mechanism for anti-cropping region recovery -- that enable fully automated watermark decoding under realistic deployment conditions. Extensive experiments demonstrate that our method achieves an average PSNR of 30.94~dB and SSIM of 0.94 on watermarked images while embedding 127-bit payloads.

2603.26765 2026-03-31 cs.AI

Bitboard version of Tetris AI

Xingguo Chen, Pingshou Xiong, Zhenyu Luo, Mengfei Hu, Xinwen Li, Yongzhou Lü, Guang Yang, Chao Li, Shangdong Yang

详情
英文摘要

The efficiency of game engines and policy optimization algorithms is crucial for training reinforcement learning (RL) agents in complex sequential decision-making tasks, such as Tetris. Existing Tetris implementations suffer from low simulation speeds, suboptimal state evaluation, and inefficient training paradigms, limiting their utility for large-scale RL research. To address these limitations, this paper proposes a high-performance Tetris AI framework based on bitboard optimization and improved RL algorithms. First, we redesign the Tetris game board and tetrominoes using bitboard representations, leveraging bitwise operations to accelerate core processes (e.g., collision detection, line clearing, and Dellacherie-Thiery Features extraction) and achieve a 53-fold speedup compared to OpenAI Gym-Tetris. Second, we introduce an afterstate-evaluating actor network that simplifies state value estimation by leveraging Tetris afterstate property, outperforming traditional action-value networks with fewer parameters. Third, we propose a buffer-optimized Proximal Policy Optimization (PPO) algorithm that balances sampling and update efficiency, achieving an average score of 3,829 on 10x10 grids within 3 minutes. Additionally, we develop a Python-Java interface compliant with the OpenAI Gym standard, enabling seamless integration with modern RL frameworks. Experimental results demonstrate that our framework enhances Tetris's utility as an RL benchmark by bridging low-level bitboard optimizations with high-level AI strategies, providing a sample-efficient and computationally lightweight solution for scalable sequential decision-making research.

2603.26764 2026-03-31 cs.CV

Low Dose CT for Stroke Diagnosis: A Dual Pipeline Deep Learning Framework for Portable Neuroimaging

Rhea Ghosal, Ronok Ghosal, Eileen Lou

Comments 13 pages, 4 figures, 3 tables. Includes dose-level evaluation and robustness stress tests (motion and ring artifacts). Code and dataset based on RSNA Intracranial Hemorrhage Detection

详情
英文摘要

Portable CT scanners enable early stroke detection in prehospital and low-resource settings but require reduced radiation doses, introducing noise that degrades diagnostic reliability. We present a deep learning framework for stroke classification from simulated low-dose CT (LDCT) brain scans for AI-assisted triage in mobile clinical environments. Controlled Poisson noise is applied to high-dose CT images to simulate realistic LDCT conditions. We compare two pipelines: (1) direct classification of noisy LDCT images and (2) denoising followed by classification. Performance is evaluated across multiple dose levels using accuracy, sensitivity, and AUC. While denoising improves perceptual image quality, it does not consistently improve classification. In several settings, direct classification yields higher sensitivity, revealing a trade-off between perceptual quality and diagnostic utility. The best denoise-then-classify pipeline achieves 0.94 AUC and 0.91 accuracy at moderate dose levels, outperforming direct classification by up to 6% in select cases. This work establishes a reproducible baseline for LDCT stroke triage using hemorrhagic stroke data (RSNA dataset) and highlights the need for validation on ischemic cohorts and real-world portable CT systems.

2603.26761 2026-03-31 cs.CV cs.AI

Tiny-ViT: A Compact Vision Transformer for Efficient and Explainable Potato Leaf Disease Classification

Shakil Mia, Umme Habiba, Urmi Akter, SK Rezwana Quadir Raisa, Jeba Maliha, Md. Iqbal Hossain, Md. Shakhauat Hossan Sumon

Comments Accepted and Presented Paper at the 2026 IEEE International Conference on Electrical, Computer and Telecommunication Engineering, Rajshahi, Bangladesh

详情
英文摘要

Early and precise identification of plant diseases, especially in potato crops is important to ensure the health of the crops and ensure the maximum yield . Potato leaf diseases, such as Early Blight and Late Blight, pose significant challenges to farmers, often resulting in yield losses and increased pesticide use. Traditional methods of detection are not only time-consuming, but are also subject to human error, which is why automated and efficient methods are required. The paper introduces a new method of potato leaf disease classification Tiny-ViT model, which is a small and effective Vision Transformer (ViT) developed to be used in resource-limited systems. The model is tested on a dataset of three classes, namely Early Blight, Late Blight, and Healthy leaves, and the preprocessing procedures include resizing, CLAHE, and Gaussian blur to improve the quality of the image. Tiny-ViT model has an impressive test accuracy of 99.85% and a mean CV accuracy of 99.82% which is better than baseline models such as DEIT Small, SWIN Tiny, and MobileViT XS. In addition to this, the model has a Matthews Correlation Coefficient (MCC) of 0.9990 and narrow confidence intervals (CI) of [0.9980, 0.9995], which indicates high reliability and generalization. The training and testing inference time is competitive, and the model exhibits low computational expenses, thereby, making it applicable in real-time applications. Moreover, interpretability of the model is improved with the help of GRAD-CAM, which identifies diseased areas. Altogether, the proposed Tiny-ViT is a solution with a high level of robustness, efficiency, and explainability to the problem of plant disease classification.

2603.26760 2026-03-31 cs.CV

An Intelligent Framework for Real-Time Yoga Pose Detection and Posture Correction

Chandramouli Haldar

详情
英文摘要

Yoga is widely recognized for improving physical fitness, flexibility, and mental well being. However, these benefits depend strongly on correct posture execution. Improper alignment during yoga practice can reduce effectiveness and increase the risk of musculoskeletal injuries, especially in self guided or online training environments. This paper presents a hybrid Edge AI based framework for real time yoga pose detection and posture correction. The proposed system integrates lightweight human pose estimation models with biomechanical feature extraction and a CNN LSTM based temporal learning architecture to recognize yoga poses and analyze motion dynamics. Joint angles and skeletal features are computed from detected keypoints and compared with reference pose configurations to evaluate posture correctness. A quantitative scoring mechanism is introduced to measure alignment deviations and generate real time corrective feedback through visual, text based, and voice based guidance. In addition, Edge AI optimization techniques such as model quantization and pruning are applied to enable low latency performance on resource constrained devices. The proposed framework provides an intelligent and scalable digital yoga assistant that can improve user safety and training effectiveness in modern fitness applications.

2603.26759 2026-03-31 cs.CV

Physics-Aware Diffusion for LiDAR Point Cloud Densification

Zeping Zhang, Robert Laganière

详情
英文摘要

LiDAR perception is severely limited by the distance-dependent sparsity of distant objects. While diffusion models can recover dense geometry, they suffer from prohibitive latency and physical hallucinations manifesting as ghost points. We propose Scanline-Consistent Range-Aware Diffusion, a framework that treats densification as probabilistic refinement rather than generation. By leveraging Partial Diffusion (SDEdit) on a coarse prior, we achieve high-fidelity results in just 156ms. Our novel Ray-Consistency loss and Negative Ray Augmentation enforce sensor physics to suppress artifacts. Our method achieves state-of-the-art results on KITTI-360 and nuScenes, directly boosting off-the-shelf 3D detectors without retraining. Code will be made available.

2603.26757 2026-03-31 cs.RO

Beyond Viewpoint Generalization: What Multi-View Demonstrations Offer and How to Synthesize Them for Robot Manipulation?

Boyang Cai, Qiwei Liang, Jiawei Li, Shihang Weng, Zhaoxin Zhang, Tao Lin, Xiangyu Chen, Wenjie Zhang, Jiaqi Mao, Weisheng Xu, Bin Yang, Jiaming Liang, Junhao Cai, Renjing Xu

详情
英文摘要

Does multi-view demonstration truly improve robot manipulation, or merely enhance cross-view robustness? We present a systematic study quantifying the performance gains, scaling behavior, and underlying mechanisms of multi-view data for robot manipulation. Controlled experiments show that, under both fixed and randomized backgrounds, multi-view demonstrations consistently improve single-view policy success and generalization. Performance varies non-monotonically with view coverage, revealing effective regimes rather than a simple "more is better" trend. Notably, multi-view data breaks the scaling limitation of single-view datasets and continues to raise performance ceilings after saturation. Mechanistic analysis shows that multi-view learning promotes manipulation-relevant visual representations, better aligns the action head with the learned feature distribution, and reduces overfitting. Motivated by the importance of multi-view data and its scarcity in large-scale robotic datasets, as well as the difficulty of collecting additional viewpoints in real world settings, we propose RoboNVS, a geometry-aware self-supervised framework that synthesizes novel-view videos from monocular inputs. The generated data consistently improves downstream policies in both simulation and real-world environments.

2603.26756 2026-03-31 cs.CV

GradAttn: Replacing Fixed Residual Connections with Task-Modulated Attention Pathways

Soudeep Ghoshal, Himanshu Buckchash

Comments 14 pages, 5 figures. Under review

详情
英文摘要

Deep ConvNets suffer from gradient signal degradation as network depth increases, limiting effective feature learning in complex architectures. ResNet addressed this through residual connections, but these fixed short-circuits cannot adapt to varying input complexity or selectively emphasize task relevant features across network hierarchies. This study introduces GradAttn, a hybrid CNN-transformer framework that replaces fixed residual connections with attention-controlled gradient flow. By extracting multi-scale CNN features at different depths and regulating them through self-attention, GradAttn dynamically weights shallow texture features and deep semantic representations. For representational analysis, we evaluated three GradAttn variants across eight diverse datasets, from natural images, medical imaging, to fashion recognition. Results demonstrate that GradAttn outperforms ResNet-18 on five of eight datasets, achieving up to +11.07% accuracy improvement on FashionMNIST while maintaining comparable network size. Gradient flow analysis reveals that controlled instabilities, introduced by attention, often coincide with improved generalization, challenging the assumption that perfect stability is optimal. Furthermore, positional encoding effectiveness proves dataset dependent, with CNN hierarchies frequently encoding sufficient spatial structure. These findings allow attention mechanisms as enablers of learnable gradient control, offering a new paradigm for adaptive representation learning in deep neural architectures.

2603.26754 2026-03-31 cs.CV cs.AI

Generating Synthetic Wildlife Health Data from Camera Trap Imagery: A Pipeline for Alopecia and Body Condition Training Data

David Brundage

详情
英文摘要

No publicly available, ML ready datasets exist for wildlife health conditions in camera trap imagery, creating a fundamental barrier to automated health screening. We present a pipeline for generating synthetic training images depicting alopecia and body condition deterioration in wildlife from real camera trap photographs. Our pipeline constructs a curated base image set from iWildCam using MegaDetector derived bounding boxes and center frame weighted stratified sampling across 8 North American species. A generative phenotype editing system produces controlled severity variants depicting hair loss consistent with mange and emaciation. An adaptive scene drift quality control system uses a sham prefilter and decoupled mask then score approach with complementary day or night metrics to reject images where the generative model altered the original scene. We frame the pipeline explicitly as a screening data source. From 201 base images across 4 species, we generate 553 QC passing synthetic variants with an overall pass rate of 83 percent. A sim to real transfer experiment training exclusively on synthetic data and testing on real camera trap images of suspected health conditions achieves 0.85 AUROC, demonstrating that the synthetic data captures visual features sufficient for screening.

2603.26753 2026-03-31 cs.RO

Reasoning Systems for Semantic Navigation in Mobile Robots

Jonathan Crespo, Ramón Barber, O. M. Mozos, Daniel Beßler, Michael Beetz

Comments This is the authors' manuscript. The final published article is available at https://doi.org/10.1109/IROS.2018.8594271

详情
Journal ref
2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Madrid, Spain, 2018, pp. 5654-5659
英文摘要

Semantic navigation is the navigation paradigm in which environmental semantic concepts and their relationships are taken into account to plan the route of a mobile robot. This paradigm facilitates the interaction with humans and the understanding of human environments in terms of navigation goals and tasks. At the high level, a semantic navigation system requires two main components: a semantic representation of the environment, and a reasoner system. This paper is focused on develop a model of the environment using semantic concepts. This paper presents two solutions for the semantic navigation paradigm. Both systems implement an ontological model. Whilst the first one uses a relational database, the second one is based on KnowRob. Both systems have been integrated in a semantic navigator. We compare both systems at the qualitative and quantitative levels, and present an implementation on a mobile robot as a proof of concept.

2603.26752 2026-03-31 cs.RO cond-mat.mtrl-sci

Functionalization of Situated Robots via Vapour

Kadri-Ann Pankratov, Leonid Zinatullin, Adele Metsniit, Marie Vihmar, Indrek Must

Comments Accepted in 9th IEEE-RAS International Conference on Soft Robotics (Robosoft 2026) as Extended Abstract (preliminary results)

详情
英文摘要

Tight matching with the environment is key to effective robot operation in complex settings. Situated robots that build their bodies in situ (e.g. by spinning) are uniquely positioned to exploit their surroundings, yet functionalization of these structures remains an integration challenge - multimaterial spinning requires complex spinneret multiplexing, and mixture doping is limited by additive availability and chemical stability. We propose instead using materials available in the environment to functionalize in situ spun webs, reducing payload and uniquely matching the structure to its surroundings. As a demonstration, we transform an optically scattering PVDF fiber web into an optically absorbing, polypyrrole-grafted structure via pyrrole vapour exposure. Two activator-delivery strategies are shown: liquid infusion into a prefabricated web, and activator pre-embedding in the spinning mixture. Beyond this proof-of-concept, we foresee broader applications including biohybrid robots that exploit bacterial genomes for specific biomolecule synthesis in situ.

2603.26751 2026-03-31 cs.CV

Survey on Remote Sensing Scene Classification: From Traditional Methods to Large Generative AI Models

Qionghao Huang, Can Hu

Comments Accepted in Journal of King Saud University Computer and Information Sciences

详情
Journal ref
Journal of King Saud University Computer and Information Sciences, 2026
英文摘要

Remote sensing scene classification has experienced a paradigmatic transformation from traditional handcrafted feature methods to sophisticated artificial intelligence systems that now form the backbone of modern Earth observation applications. This comprehensive survey examines the complete methodological evolution, systematically tracing development from classical texture descriptors and machine learning classifiers through the deep learning revolution to current state-of-the-art foundation models and generative AI approaches. We chronicle the pivotal shift from manual feature engineering to automated hierarchical representation learning via convolutional neural networks, followed by advanced architectures including Vision Transformers, graph neural networks, and hybrid frameworks. The survey provides in-depth coverage of breakthrough developments in self-supervised foundation models and vision-language systems, highlighting exceptional performance in zero-shot and few-shot learning scenarios. Special emphasis is placed on generative AI innovations that tackle persistent challenges through synthetic data generation and advanced feature learning strategies. We analyze contemporary obstacles including annotation costs, multimodal data fusion complexities, interpretability demands, and ethical considerations, alongside current trends in edge computing deployment, federated learning frameworks, and sustainable AI practices. Based on comprehensive analysis of recent advances and gaps, we identify key future research priorities: advancing hyperspectral and multi-temporal analysis capabilities, developing robust cross-domain generalization methods, and establishing standardized evaluation protocols to accelerate scientific progress in remote sensing scene classification systems.

2603.26748 2026-03-31 cs.RO cs.AI

LARD 2.0: Enhanced Datasets and Benchmarking for Autonomous Landing Systems

Yassine Bougacha, Geoffrey Delhomme, Mélanie Ducoffe, Augustin Fuchs, Jean-Brice Ginestet, Jacques Girard, Sofiane Kraiem, Franck Mamalet, Vincent Mussot, Claire Pagetti, Thierry Sammour

详情
Journal ref
13th European Congress of Embedded Real Time Systems (ERTS), Feb 2026, Toulouse, France
英文摘要

This paper addresses key challenges in the development of autonomous landing systems, focusing on dataset limitations for supervised training of Machine Learning (ML) models for object detection. Our main contributions include: (1) Enhancing dataset diversity, by advocating for the inclusion of new sources such as BingMap aerial images and Flight Simulator, to widen the generation scope of an existing dataset generator used to produce the dataset LARD; (2) Refining the Operational Design Domain (ODD), addressing issues like unrealistic landing scenarios and expanding coverage to multi-runway airports; (3) Benchmarking ML models for autonomous landing systems, introducing a framework for evaluating object detection subtask in a complex multi-instances setting, and providing associated open-source models as a baseline for AI models' performance.

2603.26746 2026-03-31 cs.CV cs.LG

TDEC: Deep Embedded Image Clustering with Transformer and Distribution Information

Ruilin Zhang, Haiyang Zheng, Hongpeng Wang

详情
英文摘要

Image clustering is a crucial but challenging task in multimedia machine learning. Recently the combination of clustering with deep learning has achieved promising performance against conventional methods on high-dimensional image data. Unfortunately, existing deep clustering methods (DC) often ignore the importance of information fusion with a global perception field among different image regions on clustering images, especially complex ones. Additionally, the learned features are usually clustering-unfriendly in terms of dimensionality and are based only on simple distance information for the clustering. In this regard, we propose a deep embedded image clustering TDEC, which for the first time to our knowledge, jointly considers feature representation, dimensional preference, and robust assignment for image clustering. Specifically, we introduce the Transformer to form a novel module T-Encoder to learn discriminative features with global dependency while using the Dim-Reduction block to build a friendly low-dimensional space favoring clustering. Moreover, the distribution information of embedded features is considered in the clustering process to provide reliable supervised signals for joint training. Our method is robust and allows for more flexibility in data size, the number of clusters, and the context complexity. More importantly, the clustering performance of TDEC is much higher than recent competitors. Extensive experiments with state-of-the-art approaches on complex datasets show the superiority of TDEC.

2603.26745 2026-03-31 cs.CV

Motion Semantics Guided Normalizing Flow for Privacy-Preserving Video Anomaly Detection

Yang Liu, Boan Chen, Yuanyuan Meng, Jing Liu, Zhengliang Guo, Wei Zhou, Peng Sun, Hong Chen

Comments Accepted to IEEE ICME 2026

详情
英文摘要

As embodied perception systems increasingly bridge digital and physical realms in interactive multimedia applications, the need for privacy-preserving approaches to understand human activities in physical environments has become paramount. Video anomaly detection is a critical task in such embodied multimedia systems for intelligent surveillance and forensic analysis. Skeleton-based approaches have emerged as a privacy-preserving alternative that processes physical world information through abstract human pose representations while discarding sensitive visual attributes such as identity and facial features. However, existing skeleton-based methods predominantly model continuous motion trajectories in a monolithic manner, failing to capture the hierarchical nature of human activities composed of discrete semantic primitives and fine-grained kinematic details, which leads to reduced discriminability when anomalies manifest at different abstraction levels. In this regard, we propose Motion Semantics Guided Normalizing Flow (MSG-Flow) that decomposes skeleton-based VAD into hierarchical motion semantics modeling. It employs vector quantized variational auto-encoder to discretize continuous motion into interpretable primitives, an autoregressive Transformer to model semantic-level temporal dependencies, and a conditional normalizing flow to capture detail-level pose variations. Extensive experiments on benchmarks (HR-ShanghaiTech & HR-UBnormal) demonstrate that MSG-Flow achieves state-of-the-art performance with 88.1% and 75.8% AUC respectively.

2603.26744 2026-03-31 cs.CV

CNMBI: Determining the Number of Clusters Using Center Pairwise Matching and Boundary Filtering

Ruilin Zhang, Haiyang Zheng, Hongpeng Wang

详情
英文摘要

One of the main challenges in data mining is choosing the optimal number of clusters without prior information. Notably, existing methods are usually in the philosophy of cluster validation and hence have underlying assumptions on data distribution, which prevents their application to complex data such as large-scale images and high-dimensional data from the real world. In this regard, we propose an approach named CNMBI. Leveraging the distribution information inherent in the data space, we map the target task as a dynamic comparison process between cluster centers regarding positional behavior, without relying on the complete clustering results and designing the complex validity index as before. Bipartite graph theory is then employed to efficiently model this process. Additionally, we find that different samples have different confidence levels and thereby actively remove low-confidence ones, which is, for the first time to our knowledge, considered in cluster number determination. CNMBI is robust and allows for more flexibility in the dimension and shape of the target data (e.g., CIFAR-10 and STL-10). Extensive comparison studies with state-of-the-art competitors on various challenging datasets demonstrate the superiority of our method.

2603.26743 2026-03-31 cs.CV cs.AI cs.LG

Steering Sparse Autoencoder Latents to Control Dynamic Head Pruning in Vision Transformers (Student Abstract)

Yousung Lee, Dongsoo Har

Comments 3 pages, 5 figures. Accepted as AAAI 2026 Student Abstract. Includes additional appendix with extended analysis

详情
Journal ref
Proceedings of the AAAI Conference on Artificial Intelligence (AAAI 2026), Vol. 40, No. 48, pp. 41263-41265
英文摘要

Dynamic head pruning in Vision Transformers (ViTs) improves efficiency by removing redundant attention heads, but existing pruning policies are often difficult to interpret and control. In this work, we propose a novel framework by integrating Sparse Autoencoders (SAEs) with dynamic pruning, leveraging their ability to disentangle dense embeddings into interpretable and controllable sparse latents. Specifically, we train an SAE on the final-layer residual embedding of the ViT and amplify the sparse latents with different strategies to alter pruning decisions. Among them, per-class steering reveals compact, class-specific head subsets that preserve accuracy. For example, bowl improves accuracy (76% to 82%) while reducing head usage (0.72 to 0.33) via heads h2 and h5. These results show that sparse latent features enable class-specific control of dynamic pruning, effectively bridging pruning efficiency and mechanistic interpretability in ViTs.

2603.26742 2026-03-31 cs.CL cs.LG

Do Multilingual VLMs Reason Equally? A Cross-Lingual Visual Reasoning Audit for Indian Languages

Swastik R

Comments 16 pages, 10 figures, 6 tables. Code and data: https://github.com/QuantumByte-01/multilingual-vlm-reasoning-audit Dataset: https://huggingface.co/datasets/Swastikr/multilingual-vlm-reasoning

详情
英文摘要

Vision-language models score well on mathematical, scientific, and spatial reasoning benchmarks, yet these evaluations are overwhelmingly English. I present the first cross-lingual visual reasoning audit for Indian languages. 980 questions from MathVista, ScienceQA, and MMMU are translated into Hindi, Tamil, Telugu, Bengali, Kannada, and Marathi using IndicTrans2, with Gemini 2.0 Flash cross-verification on 50 samples per language (inter-translator agreement 0.79-0.84). Eight VLMs, from 7B open-source models to GPT-4o, are evaluated across all seven languages, yielding 68,600 inference records that include text-only and chain-of-thought ablations. I find accuracy drops of 9.8-25 percentage points when switching from English to an Indian language, with Dravidian languages suffering up to 13.2 pp more than Indo-Aryan. Chain-of-thought prompting degrades Bengali (-14.4 pp) and Kannada (-11.4 pp) rather than helping, exposing English-centric reasoning chains. Aya-Vision-8B, built for 23 languages, still drops 28.5 pp on Dravidian scripts; multilingual pretraining alone does not transfer visual reasoning. I release the translated benchmark and all model outputs.

2603.26741 2026-03-31 cs.CV cs.AI cs.RO

Language-Conditioned World Modeling for Visual Navigation

Yifei Dong, Fengyi Wu, Yilong Dai, Lingdong Kong, Guangyu Chen, Xu Zhu, Qiyu Hu, Tianyu Wang, Johnalbert Garnica, Feng Liu, Siyu Huang, Qi Dai, Zhi-Qi Cheng

Comments 19 pages, 6 figures, Code: https://github.com/F1y1113/LCVN

详情
英文摘要

We study language-conditioned visual navigation (LCVN), in which an embodied agent is asked to follow a natural language instruction based only on an initial egocentric observation. Without access to goal images, the agent must rely on language to shape its perception and continuous control, making the grounding problem particularly challenging. We formulate this problem as open-loop trajectory prediction conditioned on linguistic instructions and introduce the LCVN Dataset, a benchmark of 39,016 trajectories and 117,048 human-verified instructions that supports reproducible research across a range of environments and instruction styles. Using this dataset, we develop LCVN frameworks that link language grounding, future-state prediction, and action generation through two complementary model families. The first family combines LCVN-WM, a diffusion-based world model, with LCVN-AC, an actor-critic agent trained in the latent space of the world model. The second family, LCVN-Uni, adopts an autoregressive multimodal architecture that predicts both actions and future observations. Experiments show that these families offer different advantages: the former provides more temporally coherent rollouts, whereas the latter generalizes better to unseen environments. Taken together, these observations point to the value of jointly studying language grounding, imagination, and policy learning in a unified task setting, and LCVN provides a concrete basis for further investigation of language-conditioned world models. The code is available at https://github.com/F1y1113/LCVN.

2603.26740 2026-03-31 cs.RO

Motion as a Sensing Modality for Metric Scale in Monocular Visual-Inertial Odometry

Hadush Hailu, Bruk Gebregziabher

Comments 10 pages

详情
英文摘要

Monocular visual-inertial odometry (VIO) cannot recover metric scale from vision alone; scale must be resolved through inertial measurements. We present a trajectory-dependent observability analysis showing that translational acceleration, produced by curvature, not constant-speed straight-line travel, is the fundamental source that couples scale to the inertial state. This relationship is formalized through the gravity-acceleration asymmetry in the IMU model, from which we derive rank conditions on the observability matrix and propose a lightweight excitation metric computable from raw IMU data. Controlled experiments on a differential-drive robot with a monocular camera and consumer-grade IMU validate the theory, with straight-line motion yielding 9.2% scale error, circular motion 6.4%, and figure-eight motion 4.8%, with excitation spanning four orders of magnitude. These results establish trajectory design as a practical mechanism for improving metric scale recovery.

2603.26737 2026-03-31 cs.CV cs.AI

Beyond Static Visual Tokens: Structured Sequential Visual Chain-of-Thought Reasoning

Guangfu Guo, Xiaoqian Lu, Yue Feng, Mingming Sun

详情
英文摘要

Current multimodal LLMs encode images as static visual prefixes and rely on text-based reasoning, lacking goal-driven and adaptive visual access. Inspired by human visual perception-where attention is selectively and sequentially shifted from the most informative regions to secondary cues-we propose Structural Sequential Visual CoT SSV-CoT. First, a question-relevant saliency map identifies and organizes key visual regions, explicitly modeling the spatial distribution of visual importance. Second, reasoning is performed following this discriminative order, inducing a curriculum-like semantic progression from primary to secondary cues. This method is trained end-to-end, using text cot and answer supervision, without relying on region-level annotations or specialized external tools. Experiments on diverse visual reasoning benchmarks show gains, validating structured and sequential visual cognition.

2603.26736 2026-03-31 cs.CV cs.AI

Ordinal Semantic Segmentation Applied to Medical and Odontological Images

Mariana Dória Prata Lima, Gilson Antonio Giraldi, Jaime S. Cardoso

Comments 23 pages, 1 figure

详情
英文摘要

Semantic segmentation consists of assigning a semantic label to each pixel according to predefined classes. This process facilitates the understanding of object appearance and spatial relationships, playing an important role in the global interpretation of image content. Although modern deep learning approaches achieve high accuracy, they often ignore ordinal relationships among classes, which may encode important domain knowledge for scene interpretation. In this work, loss functions that incorporate ordinal relationships into deep neural networks are investigated to promote greater semantic consistency in semantic segmentation tasks. These loss functions are categorized as unimodal, quasi-unimodal, and spatial. Unimodal losses constrain the predicted probability distribution according to the class ordering, while quasi-unimodal losses relax this constraint by allowing small variations while preserving ordinal coherence. Spatial losses penalize semantic inconsistencies between neighboring pixels, encouraging smoother transitions in the image space. In particular, this study adapts loss functions originally proposed for ordinal classification to ordinal semantic segmentation. Among them, the Expanded Mean Squared Error (EXP_MSE), the Quasi-Unimodal Loss (QUL), and the spatial Contact Surface Loss using Signal Distance Function (CSSDF) are investigated. These approaches have shown promising results in medical imaging, improving robustness, generalization, and anatomical consistency.

2603.26735 2026-03-31 cs.CV cs.AI

Distilled Large Language Model-Driven Dynamic Sparse Expert Activation Mechanism

Qinghui Chen, Zekai Zhang, Zaigui Zhang, Kai Zhang, Dagang Li, Wenmin Wang, Jinglin Zhang, Cong Liu

详情
英文摘要

High inter-class similarity, extreme scale variation, and limited computational budgets hinder reliable visual recognition across diverse real-world data. Existing vision-centric and cross-modal approaches often rely on rigid fusion mechanisms and heavy annotation pipelines, leading to sub-optimal generalization. We propose the Distilled Large Language Model (LLM)-Driven Sparse Mixture-of-Experts (DS-MoE) framework, which integrates text-guided dynamic routing and lightweight multi-scale comprehension. The DS-MoE framework dynamically aligns textual semantics with defect-specific visual patterns through a sparse MoE architecture, where task-relevant experts are adaptively activated based on semantic relevance, resolving inter-class ambiguity. A lightweight MobileSAM encoder enables real-time inference while preserving multi-scale defect details. Extensive experiments on PCB, aluminum foil, and mold defect datasets demonstrate that our framework achieves superior performance compared to existing pure vision models. \textbf{DS-MoE} surpasses YOLOv8/YOLOX with gains of +13.9, +1.4, and +2.0 pp mAP@ 0.5:0.95 on BBMP, aluminum, and PCB, respectively, while also improving precision and recall.

2603.26731 2026-03-31 cs.CV cs.AI

Contextual inference from single objects in Vision-Language models

Martina G. Vilas, Timothy Schaumlöffel, Gemma Roig

详情
英文摘要

How much scene context a single object carries is a well-studied question in human scene perception, yet how this capacity is organized in vision-language models (VLMs) remains poorly understood, with direct implications for the robustness of these models. We investigate this question through a systematic behavioral and mechanistic analysis of contextual inference from single objects. Presenting VLMs with single objects on masked backgrounds, we probe their ability to infer both fine-grained scene category and coarse superordinate context (indoor vs. outdoor). We found that single objects support above-chance inference at both levels, with performance modulated by the same object properties that predict human scene categorization. Object identity, scene, and superordinate predictions are partially dissociable: accurate inference at one level neither requires nor guarantees accurate inference at the others, and the degree of coupling differs markedly across models. Mechanistically, object representations that remain stable when background context is removed are more predictive of successful contextual inference. Scene and superordinate schemas are grounded in fundamentally different ways: scene identity is encoded in image tokens throughout the network, while superordinate information emerges only late or not at all. Together, these results reveal that the organization of contextual inference in VLMs is more complex than accuracy alone suggests, with behavioral and mechanistic signatures

2603.26730 2026-03-31 cs.RO

Why Cognitive Robotics Matters: Lessons from OntoAgent and LLM Deployment in HARMONIC for Safety-Critical Robot Teaming

Sanjay Oruganti, Sergei Nirenburg, Marjorie McShane, Jesse English, Michael Roberts, Christian Arndt, Ramviyas Parasuraman, Luis Sentis

详情
英文摘要

Deploying embodied AI agents in the physical world demands cognitive capabilities for long-horizon planning that execute reliably, deterministically, and transparently. We present HARMONIC, a cognitive-robotic architecture that pairs OntoAgent, a content-centric cognitive architecture providing metacognitive self-monitoring, domain-grounded diagnosis, and consequence-based action selection over ontologically structured knowledge, with a modular reactive tactical layer. HARMONIC's modular design enables a functional evaluation of whether LLMs can replicate OntoAgent's cognitive capabilities, evaluated within the same robotic system under identical conditions. Six LLMs spanning frontier and efficient tiers replace OntoAgent in a collaborative maintenance scenario under native and knowledge-equalized conditions. Results reveal that LLMs do not consistently assess their own knowledge state before acting, causing downstream failures in diagnostic reasoning and action selection. These deficits persist even with equivalent procedural knowledge, indicating the issues are architectural rather than knowledge-based. These findings support the design of physically embodied systems in which cognitive architectures retain primary authority for reasoning, owing to their deterministic and transparent characteristics.

2603.26727 2026-03-31 cs.CV cs.AI

The Nonverbal Gap: Toward Affective Computer Vision for Safer and More Equitable Online Dating

Ratna Kandala, Niva Manchanda, Akshata Kishore Moharir

详情
英文摘要

Online dating has become the dominant way romantic relationships begin, yet current platforms strip the nonverbal cues: gaze, facial expression, body posture, response timing, that humans rely on to signal comfort, disinterest, and consent, creating a communication gap with disproportionate safety consequences for women. We argue that this gap represents both a technical opportunity and a moral responsibility for the computer vision community, which has developed the affective tools, facial action unit detection, gaze estimation, engagement modeling, and multimodal affect recognition, needed to begin addressing it, yet has largely ignored the dating domain as a research context. We propose a fairness-first research agenda organized around four capability areas: real-time discomfort detection, engagement asymmetry modeling between partners, consent-aware interaction design, and longitudinal interaction summarization, each grounded in established CV methodology and motivated by the social psychology of romantic communication. We argue that responsible pursuit of this agenda requires purpose-built datasets collected under dyadic consent protocols, fairness evaluation disaggregated across race, gender identity, neurotype, and cultural background, and architectural commitments to on-device processing that prevent affective data from becoming platform surveillance infrastructure. This vision paper calls on the WICV community, whose members are uniquely positioned to understand both the technical opportunity and the human stakes, to establish online dating safety as a first-class research domain before commercial deployment outpaces ethical deliberation.

2603.26726 2026-03-31 cs.CV cs.AI

A Multimodal Deep Learning Framework for Edema Classification Using HCT and Clinical Data

Aram Ansary Ogholbake, Hannah Choi, Spencer Brandenburg, Alyssa Antuna, Zahraa Al-Sharshahi, Makayla Cox, Haseeb Ahmed, Jacqueline Frank, Nathan Millson, Luke Bauerle, Jessica Lee, David Dornbos, Qiang Cheng

详情
英文摘要

We propose AttentionMixer, a unified deep learning framework for multimodal detection of brain edema that combines structural head CT (HCT) with routine clinical metadata. While HCT provides rich spatial information, clinical variables such as age, laboratory values, and scan timing capture complementary context that might be ignored or naively concatenated. AttentionMixer is designed to fuse these heterogeneous sources in a principled and efficient manner. HCT volumes are first encoded using a self-supervised Vision Transformer Autoencoder (ViT-AE++), without requiring large labeled datasets. Clinical metadata are mapped into the same feature space and used as keys and values in a cross-attention module, where HCT-derived feature vector serves as queries. This cross-attention fusion allows the network to dynamically modulate imaging features based on patient-specific context and provides an interpretable mechanism for multimodal integration. A lightweight MLP-Mixer then refines the fused representation before final classification, enabling global dependency modeling with substantially reduced parameter overhead. Missing or incomplete metadata are handled via a learnable embedding, promoting robustness to real-world clinical data quality. We evaluate AttentionMixer on a curated brain HCT cohort with expert edema annotations using five-fold cross-validation. Compared with strong HCT-only, metadata-only, and prior multimodal baselines, AttentionMixer achieves superior performance (accuracy 87.32%, precision 92.10%, F1-score 85.37%, AUC 94.14%). Ablation studies confirm the benefit of both cross-attention and MLP-Mixer refinement, and permutation-based metadata importance analysis highlights clinically meaningful variables driving predictions. These results demonstrate that structured, interpretable multimodal fusion can substantially improve edema detection in clinical practice.

2603.26724 2026-03-31 cs.CV cs.RO

An Annotation-to-Detection Framework for Autonomous and Robust Vine Trunk Localization in the Field by Mobile Agricultural Robots

Dimitrios Chatziparaschis, Elia Scudiero, Brent Sams, Konstantinos Karydis

Comments 7 pages, 6 figures, conference

详情
英文摘要

The dynamic and heterogeneous nature of agricultural fields presents significant challenges for object detection and localization, particularly for autonomous mobile robots that are tasked with surveying previously unseen unstructured environments. Concurrently, there is a growing need for real-time detection systems that do not depend on large-scale manually labeled real-world datasets. In this work, we introduce a comprehensive annotation-to-detection framework designed to train a robust multi-modal detector using limited and partially labeled training data. The proposed methodology incorporates cross-modal annotation transfer and an early-stage sensor fusion pipeline, which, in conjunction with a multi-stage detection architecture, effectively trains and enhances the system's multi-modal detection capabilities. The effectiveness of the framework was demonstrated through vine trunk detection in novel vineyard settings that featured diverse lighting conditions and varying crop densities to validate performance. When integrated with a customized multi-modal LiDAR and Odometry Mapping (LOAM) algorithm and a tree association module, the system demonstrated high-performance trunk localization, successfully identifying over 70% of trees in a single traversal with a mean distance error of less than 0.37m. The results reveal that by leveraging multi-modal, incremental-stage annotation and training, the proposed framework achieves robust detection performance regardless of limited starting annotations, showcasing its potential for real-world and near-ground agricultural applications.

2603.26713 2026-03-31 cs.LG eess.SP stat.ML

Boundary-aware Prototype-driven Adversarial Alignment for Cross-Corpus EEG Emotion Recognition

Guangli Li, Canbiao Wu, Na Tian, Li Zhang, Zhen Liang

详情
英文摘要

Electroencephalography (EEG)-based emotion recognition suffers from severe performance degradation when models are transferred across heterogeneous datasets due to physiological variability, experimental paradigm differences, and device inconsistencies. Existing domain adversarial methods primarily enforce global marginal alignment and often overlook class-conditional mismatch and decision boundary distortion, limiting cross-corpus generalization. In this work, we propose a unified Prototype-driven Adversarial Alignment (PAA) framework for cross-corpus EEG emotion recognition. The framework is progressively instantiated in three configurations: PAA-L, which performs prototype-guided local class-conditional alignment; PAA-C, which further incorporates contrastive semantic regularization to enhance intra-class compactness and inter-class separability; and PAA-M, the full boundary-aware configuration that integrates dual relation-aware classifiers within a three-stage adversarial optimization scheme to explicitly refine controversial samples near decision boundaries. By combining prototype-guided subdomain alignment, contrastive discriminative enhancement, and boundary-aware aggregation within a coherent adversarial architecture, the proposed framework reformulates emotion recognition as a relation-driven representation learning problem, reducing sensitivity to label noise and improving cross-domain stability. Extensive experiments on SEED, SEED-IV, and SEED-V demonstrate state-of-the-art performance under four cross-corpus evaluation protocols, with average improvements of 6.72\%, 5.59\%, 6.69\%, and 4.83\%, respectively. Furthermore, the proposed framework generalizes effectively to clinical depression identification scenarios, validating its robustness in real-world heterogeneous settings. The source code is available at \textit{https://github.com/WuCB-BCI/PAA}

2603.26711 2026-03-31 cs.RO cs.SY eess.SY

Surface-Constrained Offline Warping with Contact-Aware Online Pose Projection for Safe Robotic Trajectory Execution

Farong Wang, Sai Swaminathan, Fei Liu

Comments 7 pages, 7 figures. Submitted to IROS 2026

详情
英文摘要

Robotic manipulation tasks that require repeated tool motion along curved surfaces frequently arise in surface finishing, inspection, and guided interaction. In practice, nominal motion primitives are often designed independently of the deployment surface and later reused across varying geometries. Directly tiling such primitives onto nonplanar surfaces introduces geometric inconsistencies, leading to interpenetration, orientation discontinuities, and cumulative drift over repeated cycles. We present a two-stage framework that separates geometric embedding from execution-level regulation. An offline surface-constrained warping operator embeds a nominal periodic primitive onto curved surfaces through asymmetric diffeomorphic deformation of dual-track waypoints and axis-consistent orientation completion, producing a surface-adapted reference trajectory. An online contact-aware projection operator then enforces bounded deviation relative to this reference using FSR-driven disturbance adaptation and a conic orientation safety constraint. Experiments across multiple analytic surface families and real-robot validation on a sinusoidal surface demonstrate improved geometric continuity, reduced large orientation jumps, and robust contact maintenance compared with direct tiling. These results show that decoupling offline geometric remapping from lightweight online projection enables stable and repeatable surface-embedded trajectory execution under sensor-lite feedbacks.