arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2862
2603.07544 2026-03-10 cs.SD eess.AS

Evaluating Parkinson's Disease Detection in Anonymized Speech: A Performance and Acoustic Analysis

Carlos Franzreb, Francisco Teixeira, Ben Luks, Sebastian Möller, Alberto Abad

Comments Submitted to Interspeech 2026

详情
英文摘要

Automatic detection of Parkinson's disease (PD) from speech is a promising non-invasive diagnostic tool, but it raises significant privacy concerns. Speaker anonymization mitigates these risks, but it may suppress the pathological information necessary for PD detection. We assess the trade-off between privacy and PD detection for two anonymizers (STT-TTS and kNN-VC) using two Spanish datasets. STT-TTS provides better privacy but severely degrades PD detection by eradicating prosodic information. kNN-VC preserves macro-prosodic features such as duration and F0 contours, achieving F1 scores only 3-7\% lower than original baselines, demonstrating that privacy-preserving PD detection is viable when using appropriate anonymization. Finally, an acoustic distortion analysis characterizes specific weaknesses in kNN-VC, offering insights for designing anonymizers that better preserve PD information.

2603.07543 2026-03-10 cs.CV cs.MM

CONSTANT: Towards High-Quality One-Shot Handwriting Generation with Patch Contrastive Enhancement and Style-Aware Quantization

Anh-Duy Le, Van-Linh Pham, Thanh-Nam Vo, Xuan Toan Mai, Tuan-Anh Tran

Comments Accepted as oral presentation at WACV 2026

详情
英文摘要

One-shot styled handwriting image generation, despite achieving impressive results in recent years, remains challenging due to the difficulty in capturing the intricate and diverse characteristics of human handwriting by using solely a single reference image. Existing methods still struggle to generate visually appealing and realistic handwritten images and adapt to complex, unseen writer styles, struggling to isolate invariant style features (e.g., slant, stroke width, curvature) while ignoring irrelevant noise. To tackle this problem, we introduce Patch Contrastive Enhancement and Style-Aware Quantization via Denoising Diffusion (CONSTANT), a novel one-shot handwriting generation via diffusion model. CONSTANT leverages three key innovations: 1) a Style-Aware Quantization (SAQ) module that models style as discrete visual tokens capturing distinct concepts; 2) a contrastive objective to ensure these tokens are well-separated and meaningful in the embedding style space; 3) a latent patch-based contrastive (LLatentPCE) objective help improving quality and local structures by aligning multiscale spatial patches of generated and real features in latent space. Extensive experiments and analysis on benchmark datasets from multiple languages, including English, Chinese, and our proposed ViHTGen dataset for Vietnamese, demonstrate the superiority of adapting to new reference styles and producing highly detailed images of our method over state-of-the-art approaches. Code is available at GitHub

2603.07540 2026-03-10 cs.CV cs.AI

How Long Can Unified Multimodal Models Generate Images Reliably? Taming Long-Horizon Interleaved Image Generation via Context Curation

Haoyu Chen, Qing Liu, Yuqian Zhou, He Zhang, Zhaowen Wang, Mengwei Ren, Jingjing Ren, Xiang Wang, Zhe Lin, Lei Zhu

详情
英文摘要

Unified multimodal models hold the promise of generating extensive, interleaved narratives, weaving text and imagery into coherent long-form stories. However, current systems suffer from a critical reliability gap: as sequences grow, generation quality rapidly collapses. In this work, we investigate the mechanism behind this failure and argue that it is distinct from standard long-context challenges. We reveal that in generation, accumulated visual history acts as a source of active pollution, a decay governed specifically by the number of image events rather than raw token count. We identify a structural vulnerability where dense visual tokens overwhelm the attention mechanism, creating noise that distorts future synthesis. Guided by these mechanistic insights, we propose UniLongGen, a training-free inference strategy that prioritizes safe conditioning over total recall. Instead of retaining all history, UniLongGen dynamically curates the model's memory, identifying and discarding interfering visual signals based on the model's own internal relevance rankings. Extensive experiments demonstrate that this active forgetting approach is essential for stability: UniLongGen significantly outperforms baselines in long-horizon fidelity and consistency, while simultaneously reducing memory footprint and inference time.

2603.07535 2026-03-10 cs.CV

Scale-Aware UAV-to-Satellite Cross-View Geo-Localization: A Semantic Geometric Approach

Yibin Ye, Shuo Chen, Kun Wang, Xiaokai Song, Jisheng Dang, Qifeng Yu, Xichao Teng, Zhang Li

Comments 14 pages

详情
英文摘要

Cross-View Geo-Localization (CVGL) between UAV imagery and satellite images plays a crucial role in target localization and UAV self-positioning. However, most existing methods rely on the idealized assumption of scale consistency between UAV queries and satellite galleries, overlooking the severe scale ambiguity commonly encountered in real-world scenarios. This discrepancy leads to field-of-view misalignment and feature mismatch, significantly degrading CVGL robustness. To address this issue, we propose a geometric framework that recovers the absolute metric scale from monocular UAV images using semantic anchors. Specifically, small vehicles (SVs), characterized by relatively stable prior size distributions and high detectability, are exploited as metric references. A Decoupled Stereoscopic Projection Model is introduced to estimate the absolute image scale from these semantic targets. By decomposing vehicle dimensions into radial and tangential components, the model compensates for perspective distortions in 2D detections of 3D vehicles, enabling more accurate scale estimation. To further reduce intra-class size variation and detection noise, a dual-dimension fusion strategy with Interquartile Range (IQR)-based robust aggregation is employed. The estimated global scale is then used as a physical constraint for scale-adaptive satellite image cropping, improving UAV-to-satellite feature alignment. Experiments on augmented DenseUAV and UAV-VisLoc datasets demonstrate that the proposed method significantly improves CVGL robustness under unknown UAV image scales. Additionally, the framework shows strong potential for downstream applications such as passive UAV altitude estimation and 3D model scale recovery.

2603.07534 2026-03-10 cs.CL

Accent Vector: Controllable Accent Manipulation for Multilingual TTS Without Accented Data

Thanathai Lertpetchpun, Thanapat Trachu, Jihwan Lee, Tiantian Feng, Dani Byrd, Shrikanth Narayanan

Comments Submitted to Interspeech2026

详情
英文摘要

Accent is an integral part of society, reflecting multiculturalism and shaping how individuals express identity. The majority of English speakers are non-native (L2) speakers, yet current Text-To-Speech (TTS) systems primarily model American-accented English due limited accented data. We propose \textit{Accent Vector}, a controllable representation that enables accent manipulation in multilingual TTS without requiring accented training data. \textit{Accent Vector} is derived by fine-tuning a TTS system on native speech of a different language (i.e. non-English) and computing task vectors capturing accent characteristics (i.e. in English). By scaling and interpolating the vector, we achieve fine-grained control over accent strength and generate mixed-accent speech. In addition, it generalizes beyond English, enabling accent control across multiple languages. Objective and human evaluations confirm the effectiveness of Accent Vector for fine-grained and compositional accent control.

2603.07530 2026-03-10 cs.RO

ICLR: In-Context Imitation Learning with Visual Reasoning

Toan Nguyen, Weiduo Yuan, Songlin Wei, Hui Li, Daniel Seita, Yue Wang

Comments Project website: ICLR" target="_blank" rel="noopener">https://toannguyen1904.github.io/ICLR

详情
英文摘要

In-context imitation learning enables robots to adapt to new tasks from a small number of demonstrations without additional training. However, existing approaches typically condition only on state-action trajectories and lack explicit representations of task intent. This limitation hinders performance in complex and ambiguous task settings where the same actions may be consistent with different objectives. To address this, we present In-Context Imitation Learning with Visual Reasoning (ICLR), a novel framework that augments demonstration prompts with structured visual reasoning traces representing anticipated future robot trajectories in image space. ICLR also jointly learns to generate reasoning traces and low-level actions within a unified autoregressive transformer, enabling the model to mimic not only action prediction but also the reasoning process that leads to those actions. We extensively evaluate ICLR in both simulation and real-world manipulation tasks and demonstrate consistent improvements in success rates and generalization to unseen tasks and novel object configurations compared to other in-context imitation learning methods. These results suggest that incorporating embodied visual reasoning represents a promising direction for enhancing the robustness and generalization of robotic in-context learning systems.

2603.07525 2026-03-10 cs.LG

Generative prediction of laser-induced rocket ignition with dynamic latent space representations

Tony Zahtila, Ettore Saetta, Murray Cutforth, Davy Brouzet, Diego Rossinelli, Gianluca Iaccarino

详情
英文摘要

Accurate and predictive scale-resolving simulations of laser-ignited rocket engines are highly time-consuming because the problem includes turbulent fuel-oxidizer mixing dynamics, laser-induced energy deposition, and high-speed flame growth. This is conflated with the large design space primarily corresponding to the laser operating conditions and target location. To enable rapid exploration and uncertainty quantification, we propose a data-driven surrogate modeling approach that combines convolutional autoencoders (cAEs) with neural ordinary differential equations (neural ODEs). The present target application of an ML-based surrogate model to leading-edge multi-physics turbulence simulation is part of a paradigm shift in the deployment of surrogate models towards increasing real-world complexity. Sequentially, the cAE spatially compresses high-dimensional flow fields into a low-dimensional latent space, wherein the system's temporal dynamics are learned via neural ODEs. Once trained, the model generates fast spatiotemporal predictions from initial conditions and specified operating inputs. By learning a surrogate to replace the entirety of the time-evolving simulation, the cost of predicting an ignition trial is reduced by several orders of magnitude, allowing efficient exploration of the input parameter space. Further, as the current framework yields a spatiotemporal field prediction, appraisal of the model output's physical grounding is more tractable. This approach marks a significant step toward real-time digital twins for laser-ignited rocket combustors and represents surrogate modeling in a complex system context.

2603.07524 2026-03-10 cs.LG cs.AI

Neural Dynamics-Informed Pre-trained Framework for Personalized Brain Functional Network Construction

Hongjie Jiang, Yifei Tang, Shuqiang Wang

详情
英文摘要

Brain activity is intrinsically a neural dynamic process constrained by anatomical space. This leads to significant variations in spatial distribution patterns and correlation patterns of neural activity across variable and heterogeneous scenarios. However, dominant brain functional network construction methods, which relies on pre-defined brain atlases and linear assumptions, fails to precisely capture varying neural activity patterns in heterogeneous scenarios. This limits the consistency and generalizability of the brain functional networks constructed by dominant methods. Here, a neural dynamics-informed pre-trained framework is proposed for personalized brain functional network construction. The proposed framework extracts personalized representations of neural activity patterns in heterogeneous scenarios. Personalized brain functional networks are obtained by utilizing these representations to guide brain parcellation and neural activity correlation estimation. Systematic evaluations were employed on 18 datasets across tasks, such as virtual neural modulation and abnormal neural circuit identification. Experimental results demonstrate that the proposed framework attains superior performance in heterogeneous scenarios. Overall, the proposed framework challenges the dominant brain functional network construction method.

2603.07521 2026-03-10 cs.CV cs.AI

SketchGraphNet: A Memory-Efficient Hybrid Graph Transformer for Large-Scale Sketch Corpora Recognition

Shilong Chen, Mingyuan Li, Zhaoyang Wang, Zhonglin Ye, Haixing Zhao

详情
英文摘要

This work investigates large-scale sketch recognition from a graph-native perspective, where free-hand sketches are directly modeled as structured graphs rather than raster images or stroke sequences. We propose SketchGraphNet, a hybrid graph neural architecture that integrates local message passing with a memory-efficient global attention mechanism, without relying on auxiliary positional or structural encodings. To support systematic evaluation, we construct SketchGraph, a large-scale benchmark comprising 3.44 million graph-structured sketches across 344 categories, with two variants (A and R) to reflect different noise conditions. Each sketch is represented as a spatiotemporal graph with normalized stroke-order attributes. On SketchGraph-A and SketchGraph-R, SketchGraphNet achieves Top-1 accuracies of 83.62% and 87.61%, respectively, under a unified training configuration. MemEffAttn further reduces peak GPU memory by over 40% and training time by more than 30% compared with Performer-based global attention, while maintaining comparable accuracy.

2603.07518 2026-03-10 cs.LG

Reinforcement learning-based dynamic cleaning scheduling framework for solar energy system

Heungjo An

Comments 16 pages, 6 figures, This is an accepted manuscript of the article published in Journal of Korean Institute of Intelligent Systems, 35(1), 84-97, 2025

详情
Journal ref
Journal of Korean Institute of Intelligent Systems, 35(1), 84-97, 2025
英文摘要

Advancing autonomous green technologies in solar photovoltaic (PV) systems is key to improving sustainability and efficiency in renewable energy production. This study presents a reinforcement learning (RL)-based framework to autonomously optimize the cleaning schedules of PV panels in arid regions, where soiling from dust and other airborne particles significantly reduces energy output. By employing advanced RL algorithms, Proximal Policy Optimization (PPO) and Soft Actor-Critic (SAC), the framework dynamically adjusts cleaning intervals based on uncertain environmental conditions. The proposed approach was applied to a case study in Abu Dhabi, UAE, demonstrating that PPO outperformed SAC and traditional simulation optimization (Sim-Opt) methods, achieving up to 13% cost savings by dynamically responding to weather uncertainties. The results highlight the superiority of flexible, autonomous scheduling over fixed-interval methods, particularly in adapting to stochastic environmental dynamics. This aligns with the goals of autonomous green energy production by reducing operational costs and improving the efficiency of solar power generation systems. This work underscores the potential of RL-driven autonomous decision-making to optimize maintenance operations in renewable energy systems. In future research, it is important to enhance the generalization ability of the proposed RL model, while also considering additional factors and constraints to apply it to different regions.

2603.07516 2026-03-10 cs.RO cs.AI

InterReal: A Unified Physics-Based Imitation Framework for Learning Human-Object Interaction Skills

Dayang Liang, Yuhang Lin, Xinzhe Liu, Jiyuan Shi, Yunlong Liu, Chenjia Bai

详情
英文摘要

Interaction is one of the core abilities of humanoid robots. However, most existing frameworks focus on non-interactive whole-body control, which limits their practical applicability. In this work, we develop InterReal, a unified physics-based imitation learning framework for Real-world human-object Interaction (HOI) control. InterReal enables humanoid robots to track HOI reference motions, facilitating the learning of fine-grained interactive skills and their deployment in real-world settings. Within this framework, we first introduce a HOI motion data augmentation scheme with hand-object contact constraints, and utilize the augmented motions to improve policy stability under object perturbations. Second, we propose an automatic reward learner to address the challenge of large-scale reward shaping. A meta-policy guided by critical tracking error metrics explores and allocates reward signals to the low-level reinforcement learning objective, which enables more effective learning of interactive policies. Experiments on HOI tasks of box-picking and box-pushing demonstrate that InterReal achieves the best tracking accuracy and the highest task success rate compared to recent baselines. Furthermore, we validate the framework on the real-world robot Unitree G1, which demonstrates its practical effectiveness and robustness beyond simulation.

2603.07515 2026-03-10 cs.CV

EvolveReason: Self-Evolving Reasoning Paradigm for Explainable Deepfake Facial Image Identification

Binjia Zhou, Dawei Luo, Shuai Chen, Feng Xu, Seow, Haoyuan Li, Jiachi Wang, Jiawen Wang, Zunlei Feng, Yijun Bei

详情
英文摘要

With the rapid advancement of AIGC technology, developing identification methods to address the security challenges posed by deepfakes has become urgent. Face forgery identification techniques can be categorized into two types: traditional classification methods and explainable VLM approaches. The former provides classification results but lacks explanatory ability, while the latter, although capable of providing coarse-grained explanations, often suffers from hallucinations and insufficient detail. To overcome these limitations, we propose EvolveReason, which mimics the reasoning and observational processes of human auditors when identifying face forgeries. By constructing a chain-of-thought dataset, CoT-Face, tailored for advanced VLMs, our approach guides the model to think in a human-like way, prompting it to output reasoning processes and judgment results. This provides practitioners with reliable analysis and helps alleviate hallucination. Additionally, our framework incorporates a forgery latent-space distribution capture module, enabling EvolveReason to identify high-frequency forgery cues difficult to extract from the original images. To further enhance the reliability of textual explanations, we introduce a self-evolution exploration strategy, leveraging reinforcement learning to allow the model to iteratively explore and optimize its textual descriptions in a two-stage process. Experimental results show that EvolveReason not only outperforms the current state-of-the-art methods in identification performance but also accurately identifies forgery details and demonstrates generalization capabilities.

2603.07513 2026-03-10 cs.CL

Bolbosh: Script-Aware Flow Matching for Kashmiri Text-to-Speech

Tajamul Ashraf, Burhaan Rasheed Zargar, Saeed Abdul Muizz, Ifrah Mushtaq, Nazima Mehdi, Iqra Altaf Gillani, Aadil Amin Kak, Janibul Bashir

Comments https://gaash-lab.github.io/Bolbosh/

详情
英文摘要

Kashmiri is spoken by around 7 million people but remains critically underserved in speech technology, despite its official status and rich linguistic heritage. The lack of robust Text-to-Speech (TTS) systems limits digital accessibility and inclusive human-computer interaction for native speakers. In this work, we present the first dedicated open-source neural TTS system designed for Kashmiri. We show that zero-shot multilingual baselines trained for Indic languages fail to produce intelligible speech, achieving a Mean Opinion Score (MOS) of only 1.86, largely due to inadequate modeling of Perso-Arabic diacritics and language-specific phonotactics. To address these limitations, we propose Bolbosh, a supervised cross-lingual adaptation strategy based on Optimal Transport Conditional Flow Matching (OT-CFM) within the Matcha-TTS framework. This enables stable alignment under limited paired data. We further introduce a three-stage acoustic enhancement pipeline consisting of dereverberation, silence trimming, and loudness normalization to unify heterogeneous speech sources and stabilize alignment learning. The model vocabulary is expanded to explicitly encode Kashmiri graphemes, preserving fine-grained vowel distinctions. Our system achieves a MOS of 3.63 and a Mel-Cepstral Distortion (MCD) of 3.73, substantially outperforming multilingual baselines and establishing a new benchmark for Kashmiri speech synthesis. Our results demonstrate that script-aware and supervised flow-based adaptation are critical for low-resource TTS in diacritic-sensitive languages. Code and data are available at: https://github.com/gaash-lab/Bolbosh.

2603.07507 2026-03-10 cs.LG

Online Continual Learning for Anomaly Detection in IoT under Data Distribution Shifts

Matea Marinova, Shashi Raj Pandey, Junya Shiraishi, Martin Voigt Vejling, Valentin Rakovic, Petar Popovski

Comments Manuscript submitted to EUSIPCO 2026. The copyright might be transferred without further notice

详情
英文摘要

In this work, we present OCLADS, a novel communication framework with continual learning (CL) for Internet of Things (IoT) anomaly detection (AD) when operating in non-stationary environments. As the statistical properties of the observed data change with time, the on-device inference model becomes obsolete, which necessitates strategic model updating. OCLADS keeps track of data distribution shifts to timely update the on-device IoT AD model. To do so, OCLADS introduces two mechanisms during the interaction between the resource-constrained IoT device and an edge server (ES): i) an intelligent sample selection mechanism at the device for data transmission, and ii) a distribution-shift detection mechanism at the ES for model updating. Experimental results with TinyML demonstrate that our proposed framework achieves high inference accuracy while realizing a significantly smaller number of model updates compared to the baseline schemes.

2603.07506 2026-03-10 cs.LG

A Unified Framework for Knowledge Transfer in Bidirectional Model Scaling

Jianlu Shen, Fu Feng, Jiaze Xu, Yucheng Xie, Jiaqi Lv, Xin Geng

详情
英文摘要

Transferring pre-trained knowledge from a source model to a target model of a different architectural size is a key challenge for flexible and efficient model scaling. However, current parameter-space methods treat Small-to-Large (S2L) and Large-to-Small (L2S) scaling as separate, incompatible problems, focusing on parameter synthesis and selection, respectively. This fragmented perspective has resulted in specialized tools, hindering a unified, bidirectional framework. In this paper, we propose BoT (Bidirectional knowledge Transfer), the first size-agnostic framework to unify S2L and L2S scaling. Our core insight is to treat model weights as continuous signals, where models of different sizes represent distinct discretizations of the transferable knowledge. This multi-resolution perspective directly casts S2L and L2S scaling as the signal processing operations of upsampling and downsampling, naturally leading to the adoption of the Discrete Wavelet Transform (DWT) and its Inverse (IDWT). BoT leverages the recursive nature of wavelets, using the decomposition level as a dynamic scaling factor to bridge disparate model sizes in a parameter-free and computationally efficient manner. Extensive experiments on DeiT, BERT, and GPT demonstrate significant pre-training FLOPs savings (up to 67.1% for S2L, 52.8% for L2S) and state-of-the-art performance on benchmarks like GLUE and SQuAD.

2603.07500 2026-03-10 cs.LG

Enhanced Random Subspace Local Projections for High-Dimensional Time Series Analysis

Eman Khalid, Moimma Ali Khan, Zarmeena Ali, Abdullah Illyas, Muhammad Usman, Saoud Ahmed

Comments 12 pages, 18 figures

详情
英文摘要

High-dimensional time series forecasting suffers from severe overfitting when the number of predictors exceeds available observations, making standard local projection methods unstable and unreliable. We propose an enhanced Random Subspace Local Projection (RSLP) framework designed to deliver robust impulse response estimation in the presence of hundreds of correlated predictors. The method introduces weighted subspace aggregation, category-aware subspace sampling, adaptive subspace size selection, and a bootstrap inference procedure tailored to dependent data. These enhancements substantially improve estimator stability at longer forecast horizons while providing more reliable finite-sample inference. Experiments on synthetic data, macroeconomic indicators, and the FRED-MD dataset demonstrate a 33 percent reduction in estimator variability at horizons h >= 3 through adaptive subspace size selection. The bootstrap inference procedure produces conservative confidence intervals that are 14 percent narrower at policy-relevant horizons in very high-dimensional settings (FRED-MD with 126 predictors) while maintaining proper coverage. The framework provides practitioners with a principled approach for incorporating rich information sets into impulse response analysis without the instability of traditional high-dimensional methods.

2603.07497 2026-03-10 cs.CV

AMR-CCR: Anchored Modular Retrieval for Continual Chinese Character Recognition

Yuchuan Wu, Yinglian Zhu, Haiyang Yu, Ke Niu, Bin Li, Xiangyang Xue

详情
英文摘要

Ancient Chinese character recognition is a core capability for cultural heritage digitization, yet real-world workflows are inherently non-stationary: newly excavated materials are continuously onboarded, bringing new classes in different scripts, and expanding the class space over time. We formalize this process as Continual Chinese Character Recognition (Continual CCR), a script-staged, class-incremental setting that couples two challenges: (i) scalable learning under continual class growth with subtle inter-class differences and scarce incremental data, and (ii) pronounced intra-class diversity caused by writing-style variations across writers and carrier conditions. To overcome the limitations of conventional closed-set classification, we propose AMR-CCR, an anchored modular retrieval framework that performs recognition via embedding-based dictionary matching in a shared multimodal space, allowing new classes to be added by simply extending the dictionary. AMR-CCR further introduces a lightweight script-conditioned injection module (SIA+SAR) to calibrate newly onboarded scripts while preserving cross-stage embedding compatibility, and an image-derived multi-prototype dictionary that clusters within-class embeddings to better cover diverse style modes. To support systematic evaluation, we build EvoCON, a six-stage benchmark for continual script onboarding, covering six scripts (OBC, BI, SS, SAC, WSC, CS), augmented with meaning/shape descriptions and an explicit zero-shot split for unseen characters without image exemplars.

2603.07494 2026-03-10 cs.CV

DocCogito: Aligning Layout Cognition and Step-Level Grounded Reasoning for Document Understanding

Yuchuan Wu, Minghan Zhuo, Teng Fu, Mengyang Zhao, Bin Li, Xiangyang Xue

详情
英文摘要

Document understanding with multimodal large language models (MLLMs) requires not only accurate answers but also explicit, evidence-grounded reasoning, especially in high-stakes scenarios. However, current document MLLMs still fall short of forming a complete, human-like reasoning process, because even when they improve both layout encoding and CoT-style prompting, the interaction between the two is typically learned implicitly and remains loosely coupled rather than being enforced as a systematic mechanism. So we propose DocCogito, a unified framework that integrates global layout perception with structured, region-grounded reasoning. DocCogito introduces a lightweight layout tower that distills page structure into learnable global layout prior tokens, and a deterministic Visual-Semantic Chain (VSC)-a concise structured representation less ambiguous than free-form natural-language CoT-to supervise fine-grained intermediate reasoning aligned with evidence regions. Training follows a progressive recipe, including layout perception pretraining, VSC-guided cold start, rejection sampling, and GRPO. To further strengthen the internal coupling between layout priors and VSC execution, we augment standard rewards with a fine-grained region-confidence signal that encourages reasoning traces to stay aligned with corresponding evidence regions. Extensive experiments on six benchmarks (DocVQA, WTQ, ChartQA, TextVQA, OCRBench, and InfoVQA) demonstrate strong generalization, achieving state-of-the-art results on four benchmarks.

2603.07493 2026-03-10 cs.CV

RayD3D: Distilling Depth Knowledge Along the Ray for Robust Multi-View 3D Object Detection

Rui Ding, Zhaonian Kuang, Zongwei Zhou, Meng Yang, Xinhu Zheng, Gang Hua

详情
英文摘要

Multi-view 3D detection with bird's eye view (BEV) is crucial for autonomous driving and robotics, but its robustness in real-world is limited as it struggles to predict accurate depth values. A mainstream solution, cross-modal distillation, transfers depth information from LiDAR to camera models but also unintentionally transfers depth-irrelevant information (e.g. LiDAR density). To mitigate this issue, we propose RayD3D, which transfers crucial depth knowledge along the ray: a line projecting from the camera to true location of an object. It is based on the fundamental imaging principle that predicted location of this object can only vary along this ray, which is finally determined by predicted depth value. Therefore, distilling along the ray enables more effective depth information transfer. More specifically, we design two ray-based distillation modules. Ray-based Contrastive Distillation (RCD) incorporates contrastive learning into distillation by sampling along the ray to learn how LiDAR accurately locates objects. Ray-based Weighted Distillation (RWD) adaptively adjusts distillation weight based on the ray to minimize the interference of depth-irrelevant information in LiDAR. For validation, we widely apply RayD3D into three representative types of BEV-based models, including BEVDet, BEVDepth4D, and BEVFormer. Our method is trained on clean NuScenes, and tested on both clean NuScenes and RoboBEV with a variety types of data corruptions. Our method significantly improves the robustness of all the three base models in all scenarios without increasing inference costs, and achieves the best when compared to recently released multi-view and distillation models.

2603.07487 2026-03-10 cs.CL cs.AI

A Joint Neural Baseline for Concept, Assertion, and Relation Extraction from Clinical Text

Fei Cheng, Ribeka Tanaka, Sadao Kurohashi

Comments Technical Report. Our code is available at: https://github.com/racerandom/JaMIE

详情
英文摘要

Clinical information extraction (e.g., 2010 i2b2/VA challenge) usually presents tasks of concept recognition, assertion classification, and relation extraction. Jointly modeling the multi-stage tasks in the clinical domain is an underexplored topic. The existing independent task setting (reference inputs given in each stage) makes the joint models not directly comparable to the existing pipeline work. To address these issues, we define a joint task setting and propose a novel end-to-end system to jointly optimize three-stage tasks. We empirically investigate the joint evaluation of our proposal and the pipeline baseline with various embedding techniques: word, contextual, and in-domain contextual embeddings. The proposed joint system substantially outperforms the pipeline baseline by +0.3, +1.4, +3.1 for the concept, assertion, and relation F1. This work bridges joint approaches and clinical information extraction. The proposed approach could serve as a strong joint baseline for future research. The code is publicly available.

2603.07486 2026-03-10 cs.CV

Multi-Modal Decouple and Recouple Network for Robust 3D Object Detection

Rui Ding, Zhaonian Kuang, Yuzhe Ji, Meng Yang, Xinhu Zheng, Gang Hua

详情
英文摘要

Multi-modal 3D object detection with bird's eye view (BEV) has achieved desired advances on benchmarks. Nonetheless, the accuracy may drop significantly in the real world due to data corruption such as sensor configurations for LiDAR and scene conditions for camera. One design bottleneck of previous models resides in the tightly coupling of multi-modal BEV features during fusion, which may degrade the overall system performance if one modality or both is corrupted. To mitigate, we propose a Multi-Modal Decouple and Recouple Network for robust 3D object detection under data corruption. Different modalities commonly share some high-level invariant features. We observe that these invariant features across modalities do not always fail simultaneously, because different types of data corruption affect each modality in distinct ways.These invariant features can be recovered across modalities for robust fusion under data corruption.To this end, we explicitly decouple Camera/LiDAR BEV features into modality-invariant and modality-specific parts. It allows invariant features to compensate each other while mitigates the negative impact of a corrupted modality on the other.We then recouple these features into three experts to handle different types of data corruption, respectively, i.e., LiDAR, camera, and both.For each expert, we use modality-invariant features as robust information, while modality-specific features serve as a complement.Finally, we adaptively fuse the three experts to exact robust features for 3D object detection. For validation, we collect a benchmark with a large quantity of data corruption for LiDAR, camera, and both based on nuScenes. Our model is trained on clean nuScenes and tested on all types of data corruption. Our model consistently achieves the best accuracy on both corrupted and clean data compared to recent models.

2603.07484 2026-03-10 cs.RO

HSC-VLA: Hierarchical Scene-Clearing for Robust Bimanual Manipulation in Dense Clutter

Zhen Liu, Xinyu Ning, Zhe Hu, XinXin Xie, Yitong Liu, Zhongzhu Pu

详情
英文摘要

Modern Vision--Language--Action models often suffer from critical instruction-following failures in high-density manipulation environments, where task-irrelevant visual clutter dilutes attention, corrupts grounding, and substantially degrades performance in complex long-horizon scenarios. To overcome the representation bottleneck of monolithic end-to-end architectures, we propose HSC-VLA, a hierarchical framework that decouples high-level visual-semantic reasoning from low-level, high-frequency sensorimotor execution through an explicit scene-clearing abstraction. HSC-VLA employs a high-level Brain to decompose long-horizon tasks and to generate task-specific scene masks that preserve task-relevant geometry while suppressing distractors. The filtered observations are then passed to a low-level Cerebellum, a diffusion-based policy that performs bimanual manipulation using only mask-filtered vision and proprioception. Extensive experiments in densely cluttered supermarket shelves demonstrate that HSC-VLA achieves 86.7\% aggregate success under high-density clutter, surpassing the best monolithic baseline ($π_0$-Full FT at 34.3\%) by 52.4\%. HSC-VLA also exhibits strong long-horizon performance, reaching 72\% on clutter sorting and 66\% on restocking, demonstrating strong robustness and effective failure recovery in complex cluttered manipulation.

2603.07482 2026-03-10 cs.LG cs.AI

Interpretable-by-Design Transformers via Architectural Stream Independence

Clayton Kerce, Alexis Fox

详情
英文摘要

While transformers achieve strong performance, their internal decision-making processes remain opaque. We investigate whether architectural constraints can enforce interpretability by design through architectural stream independence: maintaining a token stream (carrying symbolic structure) and contextual semantics in separated streams that remain independently observable throughout processing, with integration delayed until output. We validate this principle through the Late Fusion Architecture (LFA), which demonstrates interpretable symbolic heads through all the final layers, while standard transformers show dissolution by the third of six layers; we quantify this effect by introducing the Token-Position Dependence Score (PDS), with $PDS_{max}$ = 0.276 and 0.058, respectively. Crucially, intervention experiments demonstrate functional modularity: suppressing LFA's recency heads causes minimal semantic damage (Cohen's d = -0.158) versus catastrophic entanglement in baselines (d = -0.672). LFA demonstrates that architectural constraints improve underlying learning mechanisms, averaging 42% stability versus 19% and 11% for baseline comparisons, with extremes from 50% on LFA's best pairs (6 of 12 heads position-invariant) down to 0% complete collapse in over-constrained cases. By preventing premature entanglement, architectural independence steers models toward semantic understanding over positional heuristics, establishing interpretability as an architectural design criterion enforceable through structural constraints rather than post-hoc analysis.

2603.07480 2026-03-10 cs.RO

GSAT: Geometric Traversability Estimation using Self-supervised Learning with Anomaly Detection for Diverse Terrains

Dongjin Cho, Miryeong Park, Juhui Lee, Geonmo Yang, Younggun Cho

Comments 8 pages, 8 figures, accepted to ICRA 2026

详情
英文摘要

Safe autonomous navigation requires reliable estimation of environmental traversability. Traditional methods have relied on semantic or geometry-based approaches with human-defined thresholds, but these methods often yield unreliable predictions due to the inherent subjectivity of human supervision. While self-supervised approaches enable robots to learn from their own experience, they still face a fundamental challenge: the positive-only learning problem. To address these limitations, recent studies have employed Positive-Unlabeled (PU) learning, where the core challenge is identifying positive samples without explicit negative supervision. In this work, we propose GSAT, which addresses these limitations by constructing a positive hypersphere in latent space to classify traversable regions through anomaly detection without requiring additional prototypes (e.g., unlabeled or negative). Furthermore, our approach employs joint learning of anomaly classification and traversability prediction to more efficiently utilize robot experience. We comprehensively evaluate the proposed framework through ablation studies, validation on heterogeneous real-world robotic platforms, and autonomous navigation demonstrations in simulation environments.

2603.07476 2026-03-10 cs.CV

EVLF: Early Vision-Language Fusion for Generative Dataset Distillation

Wenqi Cai, Yawen Zou, Guang Li, Chunzhi Gu, Chao Zhang

Comments CVPR2026 (main conference)

详情
英文摘要

Dataset distillation (DD) aims to synthesize compact training sets that enable models to achieve high accuracy with significantly fewer samples. Recent diffusion-based DD methods commonly introduce semantic guidance through late-stage cross-attention, where textual prompts tend to dominate the generative process. Although this strategy enforces label relevance, it diminishes the contribution of visual latents, resulting in over-corrected samples that mirror prompt patterns rather than reflecting intrinsic visual features. To solve this problem, we introduce an Early Vision-Language Fusion (EVLF) method that aligns textual and visual embeddings at the transition between the encoder and the generative backbone. By incorporating a lightweight cross-attention module at this transition, the early representations simultaneously encode local textures and global semantic directions across the denoising process. Importantly, EVLF is plug-and-play and can be easily integrated into any diffusion-based dataset distillation pipeline with an encoder. It works across different denoiser architectures and sampling schedules without any task-specific modifications. Extensive experiments demonstrate that EVLF generates semantically faithful and visually coherent synthetic data, yielding consistent improvements in downstream classification accuracy across varied settings. Source code is available at https://github.com/wenqi-cai297/earlyfusion-for-dd/.

2603.07472 2026-03-10 cs.LG cs.AI

Contact-Guided 3D Genome Structure Generation of E. coli via Diffusion Transformers

Mingxin Zhang, Xiaofeng Dai, Yu Yao, Ziqi Yin

Comments Accepted at the Gen2 Workshop at ICLR 2026

详情
英文摘要

In this study, we present a conditional diffusion-transformer framework for generating ensembles of three-dimensional Escherichia coli genome conformations guided by Hi-C contact maps. Instead of producing a single deterministic structure, we formulate genome reconstruction as a conditional generative modeling problem that samples heterogeneous conformations whose ensemble-averaged contacts are consistent with the input Hi-C data. A synthetic dataset is constructed using coarse-grained molecular dynamics simulations to generate chromatin ensembles and corresponding Hi-C maps under circular topology. Our models operate in a latent diffusion setting with a variational autoencoder that preserves per-bin alignment and supports replication-aware representations. Hi-C information is injected through a transformer-based encoder and cross-attention, enforcing a physically interpretable one-way constraint from Hi-C to structure. The model is trained using a flow-matching objective for stable optimization. On held-out ensembles, generated structures reproduce the input Hi-C distance-decay and structural correlation metrics while maintaining substantial conformational diversity, demonstrating the effectiveness of diffusion-based generative modeling for ensemble-level 3D genome reconstruction.

2603.07468 2026-03-10 cs.CV

FedEU: Evidential Uncertainty-Driven Federated Fine-Tuning of Vision Foundation Models for Remote Sensing Image Segmentation

Xiaokang Zhang, Xuran Xiong, Jianzhong Huang, Lefei Zhang

Comments 14 pages, 8 figures

详情
英文摘要

Remote sensing image segmentation (RSIS) in federated environments has gained increasing attention because it enables collaborative model training across distributed datasets without sharing raw imagery or annotations. Federated RSIS combined with parameter-efficient fine-tuning (PEFT) can unleash the generalization power of pretrained foundation models for real-world applications, with minimal parameter aggregation and communication overhead. However, the dynamic adaptation of pretrained models to heterogeneous client data inevitably increases update uncertainty and compromises the reliability of collaborative optimization due to the lack of uncertainty estimation for each local model. To bridge this gap, we present FedEU, a federated optimization framework for fine-tuning RSIS models driven by evidential uncertainty. Specifically, personalized evidential uncertainty modeling is introduced to quantify epistemic variations of local models and identify high-risk areas under local data distributions. Furthermore, the client-specific feature embedding (CFE) is exploited to enhance channel-aware feature representation while preserving client-specific properties through personalized attention and an element-aware parameter update approach. These uncertainty estimates are uploaded to the server to enable adaptive global aggregation via a Top-k uncertainty-guided weighting (TUW) strategy, which mitigates the impact of distribution shifts and unreliable updates. Extensive experiments on three large-scale heterogeneous datasets demonstrate the superior performance of FedEU. More importantly, FedEU enables balanced model adaptation across diverse clients by explicitly reducing prediction uncertainty, resulting in more robust and reliable federated outcomes. The source codes will be available at https://github.com/zxk688/FedEU.

2603.07465 2026-03-10 cs.CV

Classifying Novel 3D-Printed Objects without Retraining: Towards Post-Production Automation in Additive Manufacturing

Fanis Mathioulakis, Gorjan Radevski, Silke GC Cleuren, Michel Janssens, Brecht Das, Koen Schauwaert, Tinne Tuytelaars

详情
英文摘要

Reliable classification of 3D-printed objects is essential for automating post-production workflows in industrial additive manufacturing. Despite extensive automation in other stages of the printing pipeline, this task still relies heavily on manual inspection, as the set of objects to be classified can change daily, making frequent model retraining impractical. Automating the identification step is therefore critical for improving operational efficiency. A vision model that could classify any set of objects by utilizing their corresponding CAD models and avoiding retraining would be highly beneficial in this setting. To enable systematic evaluation of vision models on this task, we introduce ThingiPrint, a new publicly available dataset that pairs CAD models with real photographs of their 3D-printed counterparts. Using ThingiPrint, we benchmark a range of existing vision models on the task of 3D-printed object classification. We additionally show that contrastive fine-tuning with a rotation-invariant objective allows effective prototype-based classification of previously unseen 3D-printed objects. By relying solely on the available CAD models, this avoids the need for retraining when new objects are introduced. Experiments show that this approach outperforms standard pretrained baselines, suggesting improved generalization and practical relevance for real-world use.

2603.07464 2026-03-10 cs.CV

Selective Transfer Learning of Cross-Modality Distillation for Monocular 3D Object Detection

Rui Ding, Meng Yang, Nanning Zheng

详情
英文摘要

Monocular 3D object detection is a promising yet ill-posed task for autonomous vehicles due to the lack of accurate depth information. Cross-modality knowledge distillation could effectively transfer depth information from LiDAR to image-based network. However, modality gap between image and LiDAR seriously limits its accuracy. In this paper, we systematically investigate the negative transfer problem induced by modality gap in cross-modality distillation for the first time, including not only the architecture inconsistency issue but more importantly the feature overfitting issue. We propose a selective learning approach named MonoSTL to overcome these issues, which encourages positive transfer of depth information from LiDAR while alleviates the negative transfer on image-based network. On the one hand, we utilize similar architectures to ensure spatial alignment of features between image-based and LiDAR-based networks. On the other hand, we develop two novel distillation modules, namely Depth-Aware Selective Feature Distillation (DASFD) and Depth-Aware Selective Relation Distillation (DASRD), which selectively learn positive features and relationships of objects by integrating depth uncertainty into feature and relation distillations, respectively. Our approach can be seamlessly integrated into various CNN-based and DETR-based models, where we take three recent models on KITTI and a recent model on NuScenes for validation. Extensive experiments show that our approach considerably improves the accuracy of the base models and thereby achieves the best accuracy compared with all recently released SOTA models.

2603.07463 2026-03-10 cs.CV

SIGMAE: A Spectral-Index-Guided Foundation Model for Multispectral Remote Sensing

Xiaokang Zhang, Bo Li, Chufeng Zhou, Weikang Yu, Lefei Zhang

Comments 17pages,10figures

详情
英文摘要

Pretraining and fine-tuning have emerged as a new paradigm in remote sensing image interpretation. Among them, Masked Autoencoder (MAE)-based pretraining stands out for its strong capability to learn general feature representations via reconstructing masked image regions. However, applying MAE to multispectral remote sensing images remains challenging due to complex backgrounds, indistinct targets, and the lack of semantic guidance during masking, which hinders the learning of underlying structures and meaningful spatial-spectral features. To address this, we propose a simple yet effective approach, Spectral Index-Guided MAE (SIGMAE), for multispectral image pretraining. The core idea is to incorporate domain-specific spectral indices as prior knowledge to guide dynamic token masking toward informative regions. SIGMAE introduces Semantic Saliency-Guided Dynamic Token Masking (SSDTM), a curriculum-style strategy that quantifies each patch's semantic richness and internal heterogeneity to adaptively select the most informative tokens during training. By prioritizing semantically salient regions and progressively increasing sample difficulty, SSDTM enhances spectrally rich and structurally aware representation learning, mitigates overfitting, and reduces redundant computation compared with random masking. Extensive experiments on five widely used datasets covering various downstream tasks, including scene classification, semantic segmentation, object extraction and change detection, demonstrate that SIGMAE outperforms other pretrained geospatial foundation models. Moreover, it exhibits strong spatial-spectral reconstruction capability, even with a 90% mask ratio, and improves complex target recognition under limited labeled data. The source codes and model weights will be released at https://github.com/zxk688/SIGMAE.