arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2862
2603.07644 2026-03-10 cs.RO

PanoDP: Learning Collision-Free Navigation with Panoramic Depth and Differentiable Physics

Hao Zhong, Pei Chi, Jiang Zhao, Shenghai Yuan, Xuyang Gao, Thien-Minh Nguyen, Lihua Xie

详情
英文摘要

Autonomous collision-free navigation in cluttered environments requires safe decision-making under partial observability with both static structure and dynamic obstacles. We present \textbf{PanoDP}, a communication-free learning framework that combines four-view panoramic depth perception with differentiable-physics-based training signals. PanoDP encodes panoramic depth using a lightweight CNN and optimizes policies with dense differentiable collision and motion-feasibility terms, improving training stability beyond sparse terminal collisions. We evaluate PanoDP on a controlled ring-to-center benchmark with systematic sweeps over agent count, obstacle density/layout, and dynamic behaviors, and further test out-of-distribution generalization in an external simulator (e.g., AirSim). Across settings, PanoDP increases collision-free and completion rates over single-view and non-physics-guided baselines under matched training budgets, and ablations (view masking, rotation augmentation) confirm the policy leverages 360-degree information. Code will be open source upon acceptance.

2603.07642 2026-03-10 cs.LG

Helix: Evolutionary Reinforcement Learning for Open-Ended Scientific Problem Solving

Chang Su, Zhongkai Hao, Zhizhou Zhang, Zeyu Xia, Youjia Wu, Hang Su, Jun Zhu

Comments Accepted at ICLR 2026

详情
英文摘要

Large language models (LLMs) with reasoning abilities have demonstrated growing promise for tackling complex scientific problems. Yet such tasks are inherently domain-specific, unbounded and open-ended, demanding exploration across vast and flexible solution spaces. Existing approaches, whether purely learning-based or reliant on carefully designed workflows, often suffer from limited exploration efficiency and poor generalization. To overcome these challenges, we present HELIX -- a Hierarchical Evolutionary reinforcement Learning framework with In-context eXperiences. HELIX introduces two key novelties: (i) a diverse yet high-quality pool of candidate solutions that broadens exploration through in-context learning, and (ii) reinforcement learning for iterative policy refinement that progressively elevates solution quality. This synergy enables the discovery of more advanced solutions. On the circle packing task, HELIX achieves state-of-the-art result with a sum of radii of 2.63598308 using only a 14B model. Across standard machine learning benchmarks, HELIX further surpasses GPT-4o with a carefully engineered pipeline, delivering an average F1 improvement of 5.95 points on the Adult and Bank Marketing datasets.

2603.07630 2026-03-10 cs.CV

Real-Time Glottis Detection Framework via Spatial-decoupled Feature Learning for Nasal Transnasal Intubation

Jinyu Liu, Gaoyang Zhang, Yang Zhou, Ruoyi Hao, Yang Zhang, Hongliang Ren

Comments 15 pages, 7 figures

详情
英文摘要

Nasotracheal intubation (NTI) is a vital procedure in emergency airway management, where rapid and accurate glottis detection is essential to ensure patient safety. However, existing machine assisted visual detection systems often rely on high performance computational resources and suffer from significant inference delays, which limits their applicability in time critical and resource constrained scenarios. To overcome these limitations, we propose Mobile GlottisNet, a lightweight and efficient glottis detection framework designed for real time inference on embedded and edge devices. The model incorporates structural awareness and spatial alignment mechanisms, enabling robust glottis localization under complex anatomical and visual conditions. We implement a hierarchical dynamic thresholding strategy to enhance sample assignment, and introduce an adaptive feature decoupling module based on deformable convolution to support dynamic spatial reconstruction. A cross layer dynamic weighting scheme further facilitates the fusion of semantic and detail features across multiple scales. Experimental results demonstrate that the model, with a size of only 5MB on both our PID dataset and Clinical datasets, achieves inference speeds of over 62 FPS on devices and 33 FPS on edge platforms, showing great potential in the application of emergency NTI.

2603.07629 2026-03-10 cs.RO cs.LG

Exoskeleton Control through Learning to Reduce Biological Joint Moments in Simulations

Zihang You, Xianlian Zhou

详情
英文摘要

Data-driven joint-moment predictors offer a scalable alternative to laboratory-based inverse-dynamics pipelines for biomechanics estimation and exoskeleton control. Meanwhile, physics-based reinforcement learning (RL) enables simulation-trained controllers to learn dynamics-aware assistance strategies without extensive human experimentation. However, quantitative verification of simulation-trained exoskeleton torque predictors, and their impact on human joint power injection, remains limited. This paper presents (1) an RL framework to learn exoskeleton assistance policies that reduce biological joint moments, and (2) a validation pipeline that verifies the trained control networks using an open-source gait dataset through inference and comparison with biological joint moments. Simulation-trained multilayer perceptron (MLP) controllers are developed for level-ground and ramp walking, mapping short-horizon histories of bilateral hip and knee kinematics to normalized assistance torques. Results show that predicted assistance preserves task-intensity trends across speeds and inclines. Agreement is particularly strong at the hip, with cross-correlation coefficients reaching 0.94 at 1.8 m/s and 0.98 during 5° decline walking, demonstrating near-matched temporal structure. Discrepancies increase at higher speeds and steeper inclines, especially at the knee, and are more pronounced in joint power comparisons. Delay tuning biases assistance toward greater positive power injection; modest timing shifts increase positive power and improve agreement in specific gait intervals. Together, these results establish a quantitative validation framework for simulation-trained exoskeleton controllers, demonstrate strong sim-to-data consistency at the torque level, and highlight both the promise and the remaining challenges for sim-to-real transfer.

2603.07625 2026-03-10 cs.CV

Duala: Dual-Level Alignment of Subjects and Stimuli for Cross-Subject fMRI Decoding

Shumeng Li, Jintao Guo, Jian Zhang, Yulin Zhou, Luyang Cao, Yinghuan Shi

详情
英文摘要

Cross-subject visual decoding aims to reconstruct visual experiences from brain activity across individuals, enabling more scalable and practical brain-computer interfaces. However, existing methods often suffer from degraded performance when adapting to new subjects with limited data, as they struggle to preserve both the semantic consistency of stimuli and the alignment of brain responses. To address these challenges, we propose Duala, a dual-level alignment framework designed to achieve stimulus-level consistency and subject-level alignment in fMRI-based cross-subject visual decoding. (1) At the stimulus level, Duala introduces a semantic alignment and relational consistency strategy that preserves intra-class similarity and inter-class separability, maintaining clear semantic boundaries during adaptation. (2) At the subject level, a distribution-based feature perturbation mechanism is developed to capture both global and subject-specific variations, enabling adaptation to individual neural representations without overfitting. Experiments on the Natural Scenes Dataset (NSD) demonstrate that Duala effectively improves alignment across subjects. Remarkably, even when fine-tuned with only about one hour of fMRI data, Duala achieves over 81.1% image-to-brain retrieval accuracy and consistently outperforms existing fine-tuning strategies in both retrieval and reconstruction. Our code is available at https://github.com/ShumengLI/Duala.

2603.07624 2026-03-10 cs.RO

GeoLoco: Leveraging 3D Geometric Priors from Visual Foundation Model for Robust RGB-Only Humanoid Locomotion

Yufei Liu, Xieyuanli Chen, Hainan Pan, Chenghao Shi, Yanjie Chen, Kaihong Huang, Zhiwen Zeng, Huimin Lu

Comments 8 pages, 6 figures, conference

详情
英文摘要

The prevailing paradigm of perceptive humanoid locomotion relies heavily on active depth sensors. However, this depth-centric approach fundamentally discards the rich semantic and dense appearance cues of the visual world, severing low-level control from the high-level reasoning essential for general embodied intelligence. While monocular RGB offers a ubiquitous, information-dense alternative, end-to-end reinforcement learning from raw 2D pixels suffers from extreme sample inefficiency and catastrophic sim-to-real collapse due to the inherent loss of geometric scale. To break this deadlock, we propose GeoLoco, a purely RGB-driven locomotion framework that conceptualizes monocular images as high-dimensional 3D latent representations by harnessing the powerful geometric priors of a frozen, scale-aware Visual Foundation Model (VFM). Rather than naive feature concatenation, we design a proprioceptive-query multi-head cross-attention mechanism that dynamically attends to task-critical topological features conditioned on the robot's real-time gait phase. Crucially, to prevent the policy from overfitting to superficial textures, we introduce a dual-head auxiliary learning scheme. This explicit regularization forces the high-dimensional latent space to strictly align with the physical terrain geometry, ensuring robust zero-shot sim-to-real transfer. Trained exclusively in simulation, GeoLoco achieves robust zero-shot transfer to the Unitree G1 humanoid and successfully negotiates challenging terrains.

2603.07618 2026-03-10 cs.RO cs.AI cs.LG

SMAT: Staged Multi-Agent Training for Co-Adaptive Exoskeleton Control

Yifei Yuan, Ghaith Androwis, Xianlian Zhou

详情
英文摘要

Effective exoskeleton assistance requires co-adaptation: as the device alters joint dynamics, the user reorganizes neuromuscular coordination, creating a non-stationary learning problem. Most learning-based approaches do not explicitly account for the sequential nature of human motor adaptation, leading to training instability and poorly timed assistance. We propose Staged Multi-Agent Training (SMAT), a four-stage curriculum designed to mirror how users naturally acclimate to a wearable device. In SMAT, a musculoskeletal human actor and a bilateral hip exoskeleton actor are trained progressively: the human first learns unassisted gait, then adapts to the added device mass; the exoskeleton subsequently learns a positive assistance pattern against a stabilized human policy, and finally both agents co-adapt with full torque capacity and bidirectional feedback. We implement SMAT in the MyoAssist simulation environment using a 26-muscle lower-limb model and an attached hip exoskeleton. Our musculoskeletal simulations demonstrate that the learned exoskeleton control policy produces an average 10.1% reduction in hip muscle activation relative to the no-assist condition. We validated the learned controller in an offline setting using open-source gait data, then deployed it to a physical hip exoskeleton for treadmill experiments with five subjects. The resulting policy delivers consistent assistance and predominantly positive mechanical power without the need for any explicitly imposed timing shift (mean positive power: 13.6 W at 6 Nm RMS torque to 23.8 W at 9.3 Nm RMS torque, with minimal negative power) consistently across all subjects without subject-specific retraining.

2603.07614 2026-03-10 cs.CV

Looking Into the Water by Unsupervised Learning of the Surface Shape

Ori Lifschitz, Tali Treibitz, Dan Rosenbaum

详情
Journal ref
Published The Thirty-ninth Annual Conference on Neural Information Processing Systems (NeurIPS) 2025
英文摘要

We address the problem of looking into the water from the air, where we seek to remove image distortions caused by refractions at the water surface. Our approach is based on modeling the different water surface structures at various points in time, assuming the underlying image is constant. To this end, we propose a model that consists of two neural-field networks. The first network predicts the height of the water surface at each spatial position and time, and the second network predicts the image color at each position. Using both networks, we reconstruct the observed sequence of images and can therefore use unsupervised training. We show that using implicit neural representations with periodic activation functions (SIREN) leads to effective modeling of the surface height spatio-temporal signal and its derivative, as required for image reconstruction. Using both simulated and real data we show that our method outperforms the latest unsupervised image restoration approach. In addition, it provides an estimate of the water surface.

2603.07612 2026-03-10 cs.CL

KohakuRAG: A simple RAG framework with hierarchical document indexing

Shih-Ying Yeh, Yueh-Feng Ku, Ko-Wei Huang, Buu-Khang Tu

Comments 38pages

详情
英文摘要

Retrieval-augmented generation (RAG) systems that answer questions from document collections face compounding difficulties when high-precision citations are required: flat chunking strategies sacrifice document structure, single-query formulations miss relevant passages through vocabulary mismatch, and single-pass inference produces stochastic answers that vary in both content and citation selection. We present KohakuRAG, a hierarchical RAG framework that preserves document structure through a four-level tree representation (document $\rightarrow$ section $\rightarrow$ paragraph $\rightarrow$ sentence) with bottom-up embedding aggregation, improves retrieval coverage through an LLM-powered query planner with cross-query reranking, and stabilizes answers through ensemble inference with abstention-aware voting. We evaluate on the WattBot 2025 Challenge, a benchmark requiring systems to answer technical questions from 32 documents with $\pm$0.1% numeric tolerance and exact source attribution. KohakuRAG achieves first place on both public and private leaderboards (final score 0.861), as the only team to maintain the top position across both evaluation partitions. Ablation studies reveal that prompt ordering (+80% relative), retry mechanisms (+69%), and ensemble voting with blank filtering (+1.2pp) each contribute substantially, while hierarchical dense retrieval alone matches hybrid sparse-dense approaches (BM25 adds only +3.1pp). We release KohakuRAG as open-source software at https://github.com/KohakuBlueleaf/KohakuRAG.

2603.07606 2026-03-10 cs.LG

TT-Sparse: Learning Sparse Rule Models with Differentiable Truth Tables

Hans Farrell Soegeng, Sarthak Ketanbhai Modi, Thomas Peyrin

详情
英文摘要

Interpretable machine learning is essential in high-stakes domains where decision-making requires accountability, transparency, and trust. While rule-based models offer global and exact interpretability, learning rule sets that simultaneously achieve high predictive performance and low, human-understandable complexity remains challenging. To address this, we introduce TT-Sparse, a flexible neural building block that leverages differentiable truth tables as nodes to learn sparse, effective connections. A key contribution of our approach is a new soft TopK operator with straight-through estimation for learning discrete, cardinality-constrained feature selection in an end-to-end differentiable manner. Crucially, the forward pass remains sparse, enabling efficient computation and exact symbolic rule extraction. As a result, each node (and the entire model) can be transformed exactly into compact, globally interpretable DNF/CNF Boolean formulas via Quine-McCluskey minimization. Extensive empirical results across 28 datasets spanning binary, multiclass, and regression tasks show that the learned sparse rules exhibit superior predictive performance with lower complexity compared to existing state-of-the-art methods.

2603.07604 2026-03-10 cs.CV

EmbedTalk: Triplane-Free Talking Head Synthesis using Embedding-Driven Gaussian Deformation

Arpita Saggar, Jonathan C. Darling, Duygu Sarikaya, David C. Hogg

Comments Preprint

详情
英文摘要

Real-time talking head synthesis increasingly relies on deformable 3D Gaussian Splatting (3DGS) due to its low latency. Tri-planes are the standard choice for encoding Gaussians prior to deformation, since they provide a continuous domain with explicit spatial relationships. However, tri-plane representations are limited by grid resolution and approximation errors introduced by projecting 3D volumetric fields onto 2D subspaces. Recent work has shown the superiority of learnt embeddings for driving temporal deformations in 4D scene reconstruction. We introduce $\textbf{EmbedTalk}$, which shows how such embeddings can be leveraged for modelling speech deformations in talking head synthesis. Through comprehensive experiments, we show that EmbedTalk outperforms existing 3DGS-based methods in rendering quality, lip synchronisation, and motion consistency, while remaining competitive with state-of-the-art generative models. Moreover, replacing the tri-plane encoding with learnt embeddings enables significantly more compact models that achieve over 60 FPS on a mobile GPU (RTX 2060 6 GB). Our code will be placed in the public domain on acceptance.

2603.07599 2026-03-10 cs.CL

StyleBench: Evaluating Speech Language Models on Conversational Speaking Style Control

Haishu Zhao, Aokai Hao, Yuan Ge, Zhenqiang Hong, Tong Xiao, Jingbo Zhu

详情
英文摘要

Speech language models (SLMs) have significantly extended the interactive capability of text-based Large Language Models (LLMs) by incorporating paralinguistic information. For more realistic interactive experience with customized styles, current SLMs have managed to interpret and control speaking style intensity from user prompts during the dialogue process. However, there remains a lack of systematic benchmarks that quantifies and evaluates the style intensity control ability in conversations. In this paper, we propose StyleBench, a multi-turn dialogue benchmark for comprehensively evaluating the style intensity control ability across four dimensions: emotion, speed, volume, and pitch. Our results reveal the performance gaps between leading SLMs and omni language models (OLMs), suggesting the underlying reasons and promising approaches for future exploration.

2603.07598 2026-03-10 cs.AI cs.LG

Shorter Thoughts, Same Answers: Difficulty-Scaled Segment-Wise RL for CoT Compression

Ye Tian, Aijun Liu

Comments 12 pages, 3 figures. Preprint. Code available at the GitHub project repository

详情
英文摘要

Chain-of-thought (CoT) improves reasoning reliability but increases token cost, motivating post-training compression of explicit reasoning traces. However, the shortest sufficient reasoning is not universal: it depends on difficulty, model capacity, and training state, making fixed length targets brittle. In practice, naive RL-based compression can also undesirably shorten the user-facing answer, because a single completion-level learning signal leaks across the think/answer boundary. We propose Difficulty-Scaled Segment-Wise GRPO (DSS-GRPO), which decomposes returns into think and answer components, computes group-relative advantages per segment, and routes them with hard token masks so compression updates act only on think while answer alignment acts only on answer. DSS-GRPO uses prompt-wise within-group shaping and difficulty-aware scaling to encourage concise reasoning without collapsing answer behavior.

2603.07593 2026-03-10 cs.CV

Fast Attention-Based Simplification of LiDAR Point Clouds for Object Detection and Classification

Z. Rozsa, Á. Madaras, Q. Wei, X. Lu, M. Golarits, H. Yuan, T. Sziranyi, R. Hamzaoui

详情
英文摘要

LiDAR point clouds are widely used in autonomous driving and consist of large numbers of 3D points captured at high frequency to represent surrounding objects such as vehicles, pedestrians, and traffic signs. While this dense data enables accurate perception, it also increases computational cost and power consumption, which can limit real-time deployment. Existing point cloud sampling methods typically face a trade-off: very fast approaches tend to reduce accuracy, while more accurate methods are computationally expensive. To address this limitation, we propose an efficient learned point cloud simplification method for LiDAR data. The method combines a feature embedding module with an attention-based sampling module to prioritize task-relevant regions and is trained end-to-end. We evaluate the method against farthest point sampling (FPS) and random sampling (RS) on 3D object detection on the KITTI dataset and on object classification across four datasets. The method was consistently faster than FPS and achieved similar, and in some settings better, accuracy, with the largest gains under aggressive downsampling. It was slower than RS, but it typically preserved accuracy more reliably at high sampling ratios.

2603.07590 2026-03-10 cs.CV cs.LG

Models as Lego Builders: Assembling Malice from Benign Blocks via Semantic Blueprints

Chenxi Li, Xianggan Liu, Dake Shen, Yaosong Du, Zhibo Yao, Hao Jiang, Linyi Jiang, Chengwei Cao, Jingzhe Zhang, RanYi Peng, Peiling Bai, Xiande Huang

详情
英文摘要

Despite the rapid progress of Large Vision-Language Models (LVLMs), the integration of visual modalities introduces new safety vulnerabilities that adversaries can exploit to elicit biased or malicious outputs. In this paper, we demonstrate an underexplored vulnerability via semantic slot filling, where LVLMs complete missing slot values with unsafe content even when the slot types are deliberately crafted to appear benign. Building on this finding, we propose StructAttack, a simple yet effective single-query jailbreak framework under black-box settings. StructAttack decomposes a harmful query into a central topic and a set of benign-looking slot types, then embeds them as structured visual prompts (e.g., mind maps, tables, or sunburst diagrams) with small random perturbations. Paired with a completion-guided instruction, LVLMs automatically recompose the concealed semantics and generate unsafe outputs without triggering safety mechanisms. Although each slot appears benign in isolation (local benignness), StructAttack exploits LVLMs' reasoning to assemble these slots into coherent harmful semantics. Extensive experiments on multiple models and benchmarks show the efficacy of our proposed StructAttack.

2603.07587 2026-03-10 cs.CV

3DGS-HPC: Distractor-free 3D Gaussian Splatting with Hybrid Patch-wise Classification

Jiahao Chen, Yipeng Qin, Ganlong Zhao, Xin Li, Wenping Wang, Guanbin Li

详情
英文摘要

3D Gaussian Splatting (3DGS) has demonstrated remarkable performance in novel view synthesis and 3D scene reconstruction, yet its quality often degrades in real-world environments due to transient distractors, such as moving objects and varying shadows. Existing methods commonly rely on semantic cues extracted from pre-trained vision models to identify and suppress these distractors, but such semantics are misaligned with the binary distinction between static and transient regions and remain fragile under the appearance perturbations introduced during 3DGS optimization. We propose 3DGS-HPC, a framework that circumvents these limitations by combining two complementary principles: a patch-wise classification strategy that leverages local spatial consistency for robust region-level decisions, and a hybrid classification metric that adaptively integrates photometric and perceptual cues for more reliable separation. Extensive experiments demonstrate the superiority and robustness of our method in mitigating distractors to improve 3DGS-based novel view synthesis.

2603.07580 2026-03-10 cs.RO

FeasibleCap: Real-Time Embodiment Constraint Guidance for In-the-Wild Robot Demonstration Collection

Zi Yin, Fanhong Li, Yun Gui, Jia Liu

详情
英文摘要

Gripper-in-hand data collection decouples demonstration acquisition from robot hardware, but whether a trajectory is executable on the target robot remains unknown until a separate replay-and-validate stage. Failed demonstrations therefore inflate the effective cost per usable trajectory through repeated collection, diagnosis, and validation. Existing collection-time feedback systems mitigate this issue but rely on head-worn AR/VR displays, robot-in-the-loop hardware, or learned dynamics models; real-time executability feedback has not yet been integrated into the gripper-in-hand data collection paradigm. We present \textbf{FeasibleCap}, a gripper-in-hand data collection system that brings real-time executability guidance into robot-free capture. At each frame, FeasibleCap checks reachability, joint-rate limits, and collisions against a target robot model and closes the loop through on-device visual overlays and haptic cues, allowing demonstrators to correct motions during collection without learned models, headsets, or robot hardware. On pick-and-place and tossing tasks, FeasibleCap improves replay success and reduces the fraction of infeasible frames, with the largest gains on tossing. Simulation experiments further indicate that enforcing executability constraints during collection does not sacrifice cross-embodiment transfer across robot platforms. Hardware designs and software are available at https://github.com/aod321/FeasibleCap.

2603.07577 2026-03-10 cs.CV cs.AI cs.LG

Integration of deep generative Anomaly Detection algorithm in high-speed industrial line

Niccolò Ferrari, Nicola Zanarini, Michele Fraccaroli, Alice Bizzarri, Evelina Lamma

Comments Preprint under review at a Springer Nature journal. 36 pages, 3 tables, 29 figures. Updated and expanded version of the SSRN preprint (abstract_id=4858664), with substantial revisions and Springer Nature formatting

详情
英文摘要

Industrial visual inspection in pharmaceutical production requires high accuracy under strict constraints on cycle time, hardware footprint, and operational cost. Manual inline inspection is still common, but it is affected by operator variability and limited throughput. Classical rule-based computer vision pipelines are often rigid and difficult to scale to highly variable production scenarios. To address these limitations, we present a semi-supervised anomaly detection framework based on a generative adversarial architecture with a residual autoencoder and a dense bottleneck, specifically designed for online deployment on a high-speed Blow-Fill-Seal (BFS) line. The model is trained only on nominal samples and detects anomalies through reconstruction residuals, providing both classification and spatial localization via heatmaps. The training set contains 2,815,200 grayscale patches. Experiments on a real industrial test kit show high detection performance while satisfying timing constraints compatible with a 500 ms acquisition slot.

2603.07572 2026-03-10 cs.LG

TS-MLLM: A Multi-Modal Large Language Model-based Framework for Industrial Time-Series Big Data Analysis

Haiteng Wang, Yikang Li, Yunfei Zhu, Jingheng Yan, Lei Ren, Laurence T. Yang

详情
英文摘要

Accurate analysis of industrial time-series big data is critical for the Prognostics and Health Management (PHM) of industrial equipment. While recent advancements in Large Language Models (LLMs) have shown promise in time-series analysis, existing methods typically focus on single-modality adaptations, failing to exploit the complementary nature of temporal signals, frequency-domain visual representations, and textual knowledge information. In this paper, we propose TS-MLLM, a unified multi-modal large language model framework designed to jointly model temporal signals, frequency-domain images, and textual domain knowledge. Specifically, we first develop an Industrial time-series Patch Modeling branch to capture long-range temporal dynamics. To integrate cross-modal priors, we introduce a Spectrum-aware Vision-Language Model Adaptation (SVLMA) mechanism that enables the model to internalize frequency-domain patterns and semantic context. Furthermore, a Temporal-centric Multi-modal Attention Fusion (TMAF) mechanism is designed to actively retrieve relevant visual and textual cues using temporal features as queries, ensuring deep cross-modal alignment. Extensive experiments on multiple industrial benchmarks demonstrate that TS-MLLM significantly outperforms state-of-the-art methods, particularly in few-shot and complex scenarios. The results validate our framework's superior robustness, efficiency, and generalization capabilities for industrial time-series prediction.

2603.07570 2026-03-10 cs.CV

Efficient RGB-D Scene Understanding via Multi-task Adaptive Learning and Cross-dimensional Feature Guidance

Guodong Sun, Junjie Liu, Gaoyang Zhang, Bo Wu, Yang Zhang

Comments 23 pages, 13 figures

详情
英文摘要

Scene understanding plays a critical role in enabling intelligence and autonomy in robotic systems. Traditional approaches often face challenges, including occlusions, ambiguous boundaries, and the inability to adapt attention based on task-specific requirements and sample variations. To address these limitations, this paper presents an efficient RGB-D scene understanding model that performs a range of tasks, including semantic segmentation, instance segmentation, orientation estimation, panoptic segmentation, and scene classification. The proposed model incorporates an enhanced fusion encoder, which effectively leverages redundant information from both RGB and depth inputs. For semantic segmentation, we introduce normalized focus channel layers and a context feature interaction layer, designed to mitigate issues such as shallow feature misguidance and insufficient local-global feature representation. The instance segmentation task benefits from a non-bottleneck 1D structure, which achieves superior contour representation with fewer parameters. Additionally, we propose a multi-task adaptive loss function that dynamically adjusts the learning strategy for different tasks based on scene variations. Extensive experiments on the NYUv2, SUN RGB-D, and Cityscapes datasets demonstrate that our approach outperforms existing methods in both segmentation accuracy and processing speed.

2603.07568 2026-03-10 cs.LG

Constraints Matrix Diffusion based Generative Neural Solver for Vehicle Routing Problems

Zhenwei Wang, Tiehua Zhang, Ning Xue, Ender Ozcan, Ling Wang, Ruibin Bai

详情
英文摘要

Over the past decade, neural network solvers powered by generative artificial intelligence have garnered significant attention in the domain of vehicle routing problems (VRPs), owing to their exceptional computational efficiency and superior reasoning capabilities. In particular, autoregressive solvers integrated with reinforcement learning have emerged as a prominent trend. However, much of the existing work emphasizes large-scale generalization of neural approaches while neglecting the limited robustness of attention-based methods across heterogeneous distributions of problem parameters. Their improvements over heuristic search remain largely restricted to hand-curated, fixed-distribution benchmarks. Furthermore, these architectures tend to degrade significantly when node representations are highly similar or when tasks involve long decision horizons. To address the aforementioned limitations, we propose a novel fusion neural network framework that employs a discrete noise graph diffusion model to learn the underlying constraints of vehicle routing problems and generate a constraint assignment matrix. This matrix is subsequently integrated adaptively into the feature representation learning and decision process of the autoregressive solver, serving as a graph structure mask that facilitates the formation of solutions characterized by both global vision and local feature integration. To the best of our knowledge, this work represents the first comprehensive experimental investigation of neural network model solvers across a 378-combinatorial space spanning four distinct dimensions within the CVRPlib public dataset. Extensive experimental evaluations demonstrate that our proposed fusion model effectively captures and leverages problem constraints, achieving state-of-the-art performance across multiple benchmark datasets.

2603.07566 2026-03-10 cs.CV cs.AI cs.LG

GRD-Net: Generative-Reconstructive-Discriminative Anomaly Detection with Region of Interest Attention Module

Niccolò Ferrari, Michele Fraccaroli, Evelina Lamma

Comments Peer-reviewed journal version published. 18 pages, 12 figures, 7 tables

详情
Journal ref
International Journal of Intelligent Systems, vol. 2023, Article ID 7773481, 2023
英文摘要

Anomaly detection is nowadays increasingly used in industrial applications and processes. One of the main fields of the appliance is the visual inspection for surface anomaly detection, which aims to spot regions that deviate from regularity and consequently identify abnormal products. Defect localization is a key task, that usually is achieved using a basic comparison between generated image and the original one, implementing some blob-analysis or image-editing algorithms, in the post-processing step, which is very biased towards the source dataset, and they are unable to generalize. Furthermore, in industrial applications, the totality of the image is not always interesting but could be one or some regions of interest (ROIs), where only in those areas there are relevant anomalies to be spotted. For these reasons, we propose a new architecture composed by two blocks. The first block is a Generative Adversarial Network (GAN), based on a residual autoencoder (ResAE), to perform reconstruction and denoising processes, while the second block produces image segmentation, spotting defects. This method learns from a dataset composed of good products and generated synthetic defects. The discriminative network is trained using a ROI for each image contained in the training dataset. The network will learn in which area anomalies are relevant. This approach guarantees the reduction of using pre-processing algorithms, formerly developed with blob-analysis and image-editing procedures. To test our model we used challenging MVTec anomaly detection datasets and an industrial large dataset of pharmaceutical BFS strips of vials. This set constitutes a more realistic use case of the aforementioned network.

2603.07564 2026-03-10 cs.CV

SiamGM: Siamese Geometry-Aware and Motion-Guided Network for Real-Time Satellite Video Object Tracking

Zixiao Wen, Zhen Yang, Jiawei Li, Xiantai Xiang, Guangyao Zhou, Yuxin Hu, Yuhan Liu

Comments This work has been submitted to the IEEE for possible publication

详情
英文摘要

Single object tracking in satellite videos is inherently challenged by small target, blurred background, large aspect ratio changes, and frequent visual occlusions. These constraints often cause appearance-based trackers to accumulate errors and lose targets irreversibly. To systematically mitigate both spatial ambiguities and temporal information loss, we propose SiamGM, a novel geometry-aware and motion-guided Siamese network. From a spatial perspective, we introduce an Inter-Frame Graph Attention (IFGA) module, closely integrated with an Aspect Ratio-Constrained Label Assignment (LA) method, establishing fine-grained topological correspondences and explicitly preventing surrounding background noise. From a temporal perspective, we introduce the Motion Vector-Guided Online Tracking Optimization method. By adopting the Normalized Peak-to-Sidelobe Ratio (nPSR) as a dynamic confidence indicator, we propose an Online Motion Model Refinement (OMMR) strategy to utilize historical trajectory information. Evaluations on two challenging SatSOT and SV248S benchmarks confirm that SiamGM outperforms most state-of-the-art trackers in both precision and success metrics. Notably, the proposed components of SiamGM introduce virtually no computational overhead, enabling real-time tracking at 130 frames per second (FPS). Codes and tracking results are available at https://github.com/wenzx18/SiamGM.

2603.07562 2026-03-10 cs.CV

Brain-WM: Brain Glioblastoma World Model

Chenhui Wang, Boyun Zheng, Liuxin Bao, Zhihao Peng, Peter Y. M. Woo, Hongming Shan, Yixuan Yuan

详情
英文摘要

Precise prognostic modeling of glioblastoma (GBM) under varying treatment interventions is essential for optimizing clinical outcomes. While generative AI has shown promise in simulating GBM evolution, existing methods typically treat interventions as static conditional inputs rather than dynamic decision variables. Consequently, they fail to capture the complex, reciprocal interplay between tumor evolution and treatment response. To bridge this gap, we present Brain-WM, a pioneering brain GBM world model that unifies next-step treatment prediction and future MRI generation, thereby capturing the co-evolutionary dynamics between tumor and treatment. Specifically, Brain-WM encodes spatiotemporal dynamics into a shared latent space for joint autoregressive treatment prediction and flow-based future MRI generation. Then, instead of a conventional monolithic framework, Brain-WM adopts a novel Y-shaped Mixture-of-Transformers (MoT) architecture. This design structurally disentangles heterogeneous objectives, successfully leveraging cross-task synergies while preventing feature collapse. Finally, a synergistic multi-timepoint mask alignment objective explicitly anchors latent representations to anatomically grounded tumor structures and progression-aware semantics. Extensive validation on internal and external multi-institutional cohorts demonstrates the superiority of Brain-WM, achieving 91.5% accuracy in treatment planning and SSIMs of 0.8524, 0.8581, and 0.8404 for FLAIR, T1CE, and T2W sequences, respectively. Ultimately, Brain-WM offers a robust clinical sandbox for optimizing patient healthcare. The source code is made available at https://github.com/thibault-wch/Brain-GBM-world-model.

2603.07559 2026-03-10 cs.CV

Active Inference for Micro-Gesture Recognition: EFE-Guided Temporal Sampling and Adaptive Learning

Weijia Feng, Jingyu Yang, Ruojia Zhang, Fengtao Sun, Qian Gao, Chenyang Wang, Tongtong Su, Jia Guo, Xiaobai Li, Minglai Shao

Comments 10 pages, accepted by CVPR 2026

详情
英文摘要

Micro-gestures are subtle and transient movements triggered by unconscious neural and emotional activities, holding great potential for human-computer interaction and clinical monitoring. However, their low amplitude, short duration, and strong inter-subject variability make existing deep models prone to degradation under low-sample, noisy, and cross-subject conditions. This paper presents an active inference-based framework for micro-gesture recognition, featuring Expected Free Energy (EFE)-guided temporal sampling and uncertainty-aware adaptive learning. The model actively selects the most discriminative temporal segments under EFE guidance, enabling dynamic observation and information gain maximization. Meanwhile, sample weighting driven by predictive uncertainty mitigates the effects of label noise and distribution shift. Experiments on the SMG dataset demonstrate the effectiveness of the proposed method, achieving consistent improvements across multiple mainstream backbones. Ablation studies confirm that both the EFE-guided observation and the adaptive learning mechanism are crucial to the performance gains. This work offers an interpretable and scalable paradigm for temporal behavior modeling under low-resource and noisy conditions, with broad applicability to wearable sensing, HCI, and clinical emotion monitoring.

2603.07558 2026-03-10 cs.LG

ECG Classification on PTB-XL: A Data-Centric Approach with Simplified CNN-VAE

Naqcho Ali Mehdi, Amir Ali

详情
英文摘要

Automated electrocardiogram (ECG) classification is essential for early detection of cardiovascular diseases. While recent approaches have increasingly relied on deep neural networks with complex architectures, we demonstrate that careful data preprocessing, class balancing, and a simplified convolutional neural network combined with a variational autoencoder (CNN-VAE) architecture can achieve competitive performance with significantly reduced model complexity. Using the publicly available PTB XL dataset, we achieve 87.01% binary accuracy and 0.7454 weighted F1-score across five diagnostic classes (CD, HYP, MI, NORM, STTC) with only 197,093 trainable parameters. Our work emphasises the importance of data-centric machine learning practices over architectural complexity, demonstrating that systematic preprocessing and balanced training strategies are critical for medical signal classification. We identify challenges in minority class detection (particularly hypertrophy) and provide insights for future improvements in handling imbalanced ECG datasets. Index Terms: ECG classification, convolutional neural networks, class balancing, data preprocessing, variational autoencoders, PTB-XL dataset

2603.07552 2026-03-10 cs.CV cs.RO

ReconDrive: Fast Feed-Forward 4D Gaussian Splatting for Autonomous Driving Scene Reconstruction

Haibao Yu, Kuntao Xiao, Jiahang Wang, Ruiyang Hao, Yuxin Huang, Guoran Hu, Haifang Qin, Bowen Jing, Yuntian Bo, Ping Luo

详情
英文摘要

High-fidelity visual reconstruction and novel-view synthesis are essential for realistic closed-loop evaluation in autonomous driving. While 4D Gaussian Splatting (4DGS) offers a promising balance of accuracy and efficiency, existing per-scene optimization methods require costly iterative refinement, rendering them unscalable for extensive urban environments. Conversely, current feed-forward approaches often suffer from degraded photometric quality. To address these limitations, we propose ReconDrive, a feed-forward framework that leverages and extends the 3D foundation model VGGT for rapid, high-fidelity 4DGS generation. Our architecture introduces two core adaptations to tailor the foundation model to dynamic driving scenes: (1) Hybrid Gaussian Prediction Heads, which decouple the regression of spatial coordinates and appearance attributes to overcome the photometric deficiencies inherent in generalized foundation features; and (2) a Static-Dynamic 4D Composition strategy that explicitly captures temporal motion via velocity modeling to represent complex dynamic environments. Benchmarked on nuScenes, ReconDrive significantly outperforms existing feed-forward baselines in reconstruction, novel-view synthesis, and 3D perception. It achieves performance competitive with per-scene optimization while being orders of magnitude faster, providing a scalable and practical solution for realistic driving simulation.

2603.07550 2026-03-10 cs.CL cs.AI

Learning-free L2-Accented Speech Generation using Phonological Rules

Thanathai Lertpetchpun, Yoonjeong Lee, Jihwan Lee, Tiantian Feng, Dani Byrd, Shrikanth Narayanan

Comments Submitted to Interspeech2026

详情
英文摘要

Accent plays a crucial role in speaker identity and inclusivity in speech technologies. Existing accented text-to-speech (TTS) systems either require large-scale accented datasets or lack fine-grained phoneme-level controllability. We propose a accented TTS framework that combines phonological rules with a multilingual TTS model. The rules are applied to phoneme sequences to transform accent at the phoneme level while preserving intelligibility. The method requires no accented training data and enables explicit phoneme-level accent manipulation. We design rule sets for Spanish- and Indian-accented English, modeling systematic differences in consonants, vowels, and syllable structure arising from phonotactic constraints. We analyze the trade-off between phoneme-level duration alignment and accent as realized in speech timing. Experimental results demonstrate effective accent shift while maintaining speech quality.

2603.07546 2026-03-10 cs.AI cs.LG

COOL-MC: Verifying and Explaining RL Policies for Multi-bridge Network Maintenance

Dennis Gross

详情
英文摘要

Aging bridge networks require proactive, verifiable, and interpretable maintenance strategies, yet reinforcement learning (RL) policies trained solely on reward signals provide no formal safety guarantees and remain opaque to infrastructure managers. We demonstrate COOL-MC as a tool for verifying and explaining RL policies for multi-bridge network maintenance, building on a single-bridge Markov decision process (MDP) from the literature and extending it to a parallel network of three heterogeneous bridges with a shared periodic budget constraint, encoded in the PRISM modeling language. We train an RL agent on this MDP and apply probabilistic model checking and explainability methods to the induced discrete-time Markov chain (DTMC) that arises from the interaction between the learned policy and the underlying MDP. Probabilistic model checking reveals that the trained policy has a safety-violation probability of 3.5\% over the planning horizon, being slightly above the theoretical minimum of 0\% and indicating the suboptimality of the learned policy, noting that these results are based on artificially constructed transition probabilities and deterioration rates rather than real-world data, so absolute performance figures should be interpreted with caution. The explainability analysis further reveals, for instance, a systematic bias in the trained policy toward the state of bridge 1 over the remaining bridges in the network. These results demonstrate COOL-MC's ability to provide formal, interpretable, and practical analysis of RL maintenance policies.

2603.07545 2026-03-10 cs.CV cs.AI cs.LG

DreamSAC: Learning Hamiltonian World Models via Symmetry Exploration

Jinzhou Tang, Fan Feng, Minghao Fu, Wenjun Lin, Biwei Huang, Keze Wang

Comments 19 pages, 5 figures

详情
英文摘要

Learned world models excel at interpolative generalization but fail at extrapolative generalization to novel physical properties. This limitation arises because they learn statistical correlations rather than the environment's underlying generative rules, such as physical invariances and conservation laws. We argue that learning these invariances is key to robust extrapolation. To achieve this, we first introduce \textbf{Symmetry Exploration}, an unsupervised exploration strategy where an agent is intrinsically motivated by a Hamiltonian-based curiosity bonus to actively probe and challenge its understanding of conservation laws, thereby collecting physically informative data. Second, we design a Hamiltonian-based world model that learns from the collected data, using a novel self-supervised contrastive objective to identify the invariant physical state from raw, view-dependent pixel observations. Our framework, \textbf{DreamSAC}, trained on this actively curated data, significantly outperforms state-of-the-art baselines in 3D physics simulations on tasks requiring extrapolation.