arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 3196
2603.14251 2026-03-17 cs.CL cs.AI

Mitigating Overthinking in Large Reasoning Language Models via Reasoning Path Deviation Monitoring

Weixin Guan, Liang Li, Jiapeng Liu, Bing Li, Peng Fu, Chengyang Fang, Xiaoshuai Hao, Can Ma, Weiping Wang

详情
英文摘要

Large Reasoning Language Models (LRLMs) demonstrate impressive capabilities on complex tasks by utilizing long Chain-of-Thought reasoning. However, they are prone to overthinking, which generates redundant reasoning steps that degrade both performance and efficiency. Recently, early-exit strategies are proposed to mitigate overthinking by dynamically and adaptively terminating redundant reasoning. However, current early-exit methods either introduce extra training overhead by relying on proxy models or limit inference throughput due to the frequent content switching between reasoning and generating probing answers. Moreover, most early-exit methods harm LRLMs performance due to over-truncation. Our insight stems from an observation: overthinking often causes LRLMs to deviate from the correct reasoning path, which is frequently accompanied by high-entropy transition tokens. Given this, we propose an early-exit method deeply coupled with the native reasoning process, which leverages the path deviation index as a dedicated monitoring metric for the frequent occurrence of high-entropy transition tokens to dynamically detect and terminate overthinking trajectories. We conduct experiments across multiple benchmarks using LRLMs of different types and scales, and the results indicate that our method delivers the largest performance improvement over vanilla CoT compared to existing early-exit methods.

2603.14249 2026-03-17 cs.CV

OAHuman: Occlusion-Aware 3D Human Reconstruction from Monocular Images

Yuanwang Yang, Hongliang Liu, Muxin Zhang, Nan Ma, Jingyu Yang, Yu-Kun Lai, Kun Li

详情
英文摘要

Monocular 3D human reconstruction in real-world scenarios remains highly challenging due to frequent occlusions from surrounding objects, people, or image truncation. Such occlusions lead to missing geometry and unreliable appearance cues, severely degrading the completeness and realism of reconstructed human models. Although recent neural implicit methods achieve impressive results on clean inputs, they struggle under occlusion due to entangled modeling of shape and texture. In this paper, we propose OAHuman, an occlusion-aware framework that explicitly decouples geometry reconstruction and texture synthesis for robust 3D human modeling from a single RGB image. The core innovation lies in the decoupling-perception paradigm, which addresses the fundamental issue of geometry-texture cross-contamination in occluded regions. Our framework ensures that geometry reconstruction is perceptually reinforced even in occluded areas, isolating it from texture interference. In parallel, texture synthesis is learned exclusively from visible regions, preventing texture errors from being transferred to the occluded areas. This decoupling approach enables OAHuman to achieve robust and high-fidelity reconstruction under occlusion, which has been a long-standing challenge in the field. Extensive experiments on occlusion-rich benchmarks demonstrate that OAHuman achieves superior performance in terms of structural completeness, surface detail, and texture realism, significantly improving monocular 3D human reconstruction under occlusion conditions.

2603.14245 2026-03-17 cs.LG cs.AI

GoldenStart: Q-Guided Priors and Entropy Control for Distilling Flow Policies

He Zhang, Ying Sun, Hui Xiong

Comments 23 pages, 13 figures

详情
英文摘要

Flow-matching policies hold great promise for reinforcement learning (RL) by capturing complex, multi-modal action distributions. However, their practical application is often hindered by prohibitive inference latency and ineffective online exploration. Although recent works have employed one-step distillation for fast inference, the structure of the initial noise distribution remains an overlooked factor that presents significant untapped potential. This overlooked factor, along with the challenge of controlling policy stochasticity, constitutes two critical areas for advancing distilled flow-matching policies. To overcome these limitations, we propose GoldenStart (GSFlow), a policy distillation method with Q-guided priors and explicit entropy control. Instead of initializing generation from uninformed noise, we introduce a Q-guided prior modeled by a conditional VAE. This state-conditioned prior repositions the starting points of the one-step generation process into high-Q regions, effectively providing a "golden start" that shortcuts the policy to promising actions. Furthermore, for effective online exploration, we enable our distilled actor to output a stochastic distribution instead of a deterministic point. This is governed by entropy regularization, allowing the policy to shift from pure exploitation to principled exploration. Our integrated framework demonstrates that by designing the generative startpoint and explicitly controlling policy entropy, it is possible to achieve efficient and exploratory policies, bridging the generative models and the practical actor-critic methods. We conduct extensive experiments on offline and online continuous control benchmarks, where our method significantly outperforms prior state-of-the-art approaches. Code will be available at https://github.com/ZhHe11/GSFlow-RL.

2603.14244 2026-03-17 cs.RO

Design of a Bio-Inspired Miniature Submarine for Low-Cost Water Quality Monitoring

Quang Huy Vu, Quan Le, Manh Duong Phung

详情
英文摘要

Water quality monitoring is essential for protecting aquatic ecosystems and detecting environmental pollution. This paper presents the design and experimental validation of a bio-inspired miniature submarine for low-cost water quality monitoring. Inspired by the jet propulsion mechanism of squids, the proposed system employs pump-driven water jets for propulsion and steering, combined with a pump-based buoyancy control mechanism that enables both depth regulation and water sampling. The vehicle integrates low-cost, commercially available components including an ESP32 microcontroller, IMU, pressure sensor, GPS receiver, and LoRa communication module. The complete system can be constructed at a hardware cost of approximately $122.5, making it suitable for educational and environmental monitoring applications. Experimental validation was conducted through pool tests and field trials in a lake. During a 360 degrees rotation test, roll and pitch deviations remained within +/-2 degrees and +/-1.5 degrees, respectively, demonstrating stable attitude control. Steering experiments showed a heading step response with approximately 2 s rise time and 5 s settling time. Depth control experiments achieved a target depth of 2.5 m with steady-state error within +/-0.1 m. Field experiments further demonstrated reliable navigation and successful water sampling operations. The results confirm that the proposed platform provides a compact, stable, and cost-effective solution for small-scale aquatic environmental monitoring.

2603.14243 2026-03-17 cs.CV

BIT: Matching-based Bi-directional Interaction Transformation Network for Visible-Infrared Person Re-Identification

Haoxuan Xu, Guanglin Niu

详情
英文摘要

Visible-Infrared Person Re-Identification (VI-ReID) is a challenging retrieval task due to the substantial modality gap between visible and infrared images. While existing methods attempt to bridge this gap by learning modality-invariant features within a shared embedding space, they often overlook the complex and implicit correlations between modalities. This limitation becomes more severe under distribution shifts, where infrared samples are often far fewer than visible ones. To address these challenges, we propose a novel network termed Bi-directional Interaction Transformation (BIT). Instead of relying on rigid feature alignment, BIT adopts a matching-based strategy that explicitly models the interaction between visible and infrared image pairs. Specifically, BIT employs an encoder-decoder architecture where the encoder extracts preliminary feature representations, and the decoder performs bi-directional feature integration and query aware scoring to enhance cross-modality correspondence. To our best knowledge, BIT is the first to introduce such pairwise matching-driven interaction in VI-ReID. Extensive experiments on several benchmarks demonstrate that our BIT achieves state-of-the-art performance, highlighting its effectiveness in the VI-ReID task.

2603.14241 2026-03-17 cs.CV

CamLit: Unified Video Diffusion with Explicit Camera and Lighting Control

Zhiyi Kuang, Chengan He, Egor Zakharov, Yuxuan Xue, Shunsuke Saito, Olivier Maury, Timur Bagautdinov, Youyi Zheng, Giljoo Nam

Comments 11 pages, 6 figures

详情
英文摘要

We present CamLit, the first unified video diffusion model that jointly performs novel view synthesis (NVS) and relighting from a single input image. Given one reference image, a user-defined camera trajectory, and an environment map, CamLit synthesizes a video of the scene from new viewpoints under the specified illumination. Within a single generative process, our model produces temporally coherent and spatially aligned outputs, including relit novel-view frames and corresponding albedo frames, enabling high-quality control of both camera pose and lighting. Qualitative and quantitative experiments demonstrate that CamLit achieves high-fidelity outputs on par with state-of-the-art methods in both novel view synthesis and relighting, without sacrificing visual quality in either task. We show that a single generative model can effectively integrate camera and lighting control, simplifying the video generation pipeline while maintaining competitive performance and consistent realism.

2603.14240 2026-03-17 cs.CV

FOCUS: Bridging Fine-Grained Recognition and Open-World Discovery across Domains

Vaibhav Rathore, Divyam Gupta, Moloud Abdar, Subhasis Chaudhuri, Biplab Banerjee

Comments Under Review

详情
英文摘要

We introduce the first unified framework for *Fine-Grained Domain-Generalized Generalized Category Discovery* (FG-DG-GCD), bringing open-world recognition closer to real-world deployment under domain shift. Unlike conventional GCD, which assumes labeled and unlabeled data come from the same distribution, DG-GCD learns only from labeled source data and must both recognize known classes and discover novel ones in unseen, unlabeled target domains. This problem is especially challenging in fine-grained settings, where subtle inter-class differences and large intra-class variation make domain generalization significantly harder. To support systematic evaluation, we establish the first *FG-DG-GCD benchmarks* by creating identity-preserving *painting* and *sketch* domains for CUB-200-2011, Stanford Cars, and FGVC-Aircraft using controlled diffusion-adapter stylization. On top of this ,we propose FoCUS, a single-stage framework that combines *Domain-Consistent Parts Discovery* (DCPD) for geometry-stable part reasoning with *Uncertainty-Aware Feature Augmentation* (UFA) for confidence-calibrated feature regularization through uncertainty-guided perturbations. Extensive experiments show that FoCUS outperforms strong GCD, FG-GCD, and DG-GCD baselines by **3.28%**, **9.68%**, and **2.07%**, respectively, in clustering accuracy on the proposed benchmarks. It also remains competitive on coarse-grained DG-GCD tasks while achieving nearly **3x** higher computational efficiency than the current state of the art. ^[Code and datasets will be released upon acceptance.]

2603.14239 2026-03-17 cs.CL cs.AI cs.AR cs.LG

QiMeng-CodeV-SVA: Training Specialized LLMs for Hardware Assertion Generation via RTL-Grounded Bidirectional Data Synthesis

Yutong Wu, Chenrui Cao, Pengwei Jin, Di Huang, Rui Zhang, Xishan Zhang, Zidong Du, Qi Guo, Xing Hu

Comments Accepted by DAC 2026. Code: https://github.com/wyt2000/CodeV-SVA; Model: https://huggingface.co/wyt2000/CodeV-SVA-14B

详情
英文摘要

SystemVerilog Assertions (SVAs) are crucial for hardware verification. Recent studies leverage general-purpose LLMs to translate natural language properties to SVAs (NL2SVA), but they perform poorly due to limited data. We propose a data synthesis framework to tackle two challenges: the scarcity of high-quality real-world SVA corpora and the lack of reliable methods to determine NL-SVA semantic equivalence. For the former, large-scale open-source RTLs are used to guide LLMs to generate real-world SVAs; for the latter, bidirectional translation serves as a data selection method. With the synthesized data, we train CodeV-SVA, a series of SVA generation models. Notably, CodeV-SVA-14B achieves 75.8% on NL2SVA-Human and 84.0% on NL2SVA-Machine in Func.@1, matching or exceeding advanced LLMs like GPT-5 and DeepSeek-R1.

2603.14238 2026-03-17 cs.LG cs.MM

Domain-Skewed Federated Learning with Feature Decoupling and Calibration

Huan Wang, Jun Shen, Jun Yan, Guansong Pang

Comments Accepted at CVPR 2026

详情
英文摘要

Federated learning (FL) allows distributed clients to collaboratively train a global model in a privacy-preserving manner. However, one major challenge is domain skew, where clients' data originating from diverse domains may hinder the aggregated global model from learning a consistent representation space, resulting in poor generalizable ability in multiple domains. In this paper, we argue that the domain skew is reflected in the domain-specific biased features of each client, causing the local model's representations to collapse into a narrow low-dimensional subspace. We then propose Federated Feature Decoupling and Calibration ($F^2$DC), which liberates valuable class-relevant information by calibrating the domain-specific biased features, enabling more consistent representations across domains. A novel component, Domain Feature Decoupler (DFD), is first introduced in $F^2$DC to determine the robustness of each feature unit, thereby separating the local features into domain-robust features and domain-related features. A Domain Feature Corrector (DFC) is further proposed to calibrate these domain-related features by explicitly linking discriminative signals, capturing additional class-relevant clues that complement the domain-robust features. Finally, a domain-aware aggregation of the local models is performed to promote consensus among clients. Empirical results on three popular multi-domain datasets demonstrate the effectiveness of the proposed $F^2$DC and the contributions of its two modules. Code is available at https://github.com/mala-lab/F2DC.

2603.14236 2026-03-17 cs.RO cs.DC

AeroGen: Agentic Drone Autonomy through Single-Shot Structured Prompting & Drone SDK

Kautuk Astu, Yogesh Simmhan

详情
英文摘要

Designing correct UAV autonomy programs is challenging due to joint navigation, sensing and analytics requirements. While LLMs can generate code, their reliability for safety-critical UAVs remains uncertain. This paper presents AeroGen, an open-loop framework that enables consistently correct single-shot AI-generated drone control programs through structured guardrail prompting and integration with the AeroDaaS drone SDK. AeroGen encodes API descriptions, flight constraints and operational world rules directly into the system context prompt, enabling generic LLMs to produce constraint-aware code from user prompts, with minimal example code. We evaluate AeroGen across a diverse benchmark of 20 navigation tasks and 5 drone missions on urban, farm and inspection environments, using both imperative and declarative user prompts. AeroGen generates about 40 lines of AeroDaaS Python code in about 20s per mission, in both real-world and simulations, showing that structured prompting with a well-defined SDK improves robustness, correctness and deployability of LLM-generated drone autonomy programs.

2603.14232 2026-03-17 cs.CV

S2GS: Streaming Semantic Gaussian Splatting for Online Scene Understanding and Reconstruction

Renhe Zhang, Yuyang Tan, Jingyu Gong, Zhizhong Zhang, Lizhuang Ma, Yuan Xie, Xin Tan

Comments 10 pages, 7 figures

详情
英文摘要

Existing offline feed-forward methods for joint scene understanding and reconstruction on long image streams often repeatedly perform global computation over an ever-growing set of past observations, causing runtime and GPU memory to increase rapidly with sequence length and limiting scalability. We propose Streaming Semantic Gaussian Splatting (S2GS), a strictly causal, incremental 3D Gaussian semantic field framework: it does not leverage future frames and continuously updates scene geometry, appearance, and instance-level semantics without reprocessing historical frames, enabling scalable online joint reconstruction and understanding. S2GS adopts a geometry-semantic decoupled dual-backbone design: the geometry branch performs causal modeling to drive incremental Gaussian updates, while the semantic branch leverages a 2D foundation vision model and a query-driven decoder to predict segmentation masks and identity embeddings, further stabilized by query-level contrastive alignment and lightweight online association with an instance memory. Experiments show that S2GS matches or outperforms strong offline baselines on joint reconstruction-and-understanding benchmarks, while significantly improving long-horizon scalability: it processes 1,000+ frames with much slower growth in runtime and GPU memory, whereas offline global-processing baselines typically run out of memory at around 80 frames under the same setting.

2603.14229 2026-03-17 cs.AI cs.SE

Agentic DAG-Orchestrated Planner Framework for Multi-Modal, Multi-Hop Question Answering in Hybrid Data Lakes

Kirushikesh D B, Manish Kesarwani, Nishtha Madaan, Sameep Mehta, Aldrin Dennis, Siddarth Ajay, Rakesh B R, Renu Rajagopal, Sudheesh Kairali

详情
英文摘要

Enterprises increasingly need natural language (NL) question answering over hybrid data lakes that combine structured tables and unstructured documents. Current deployed solutions, including RAG-based systems, typically rely on brute-force retrieval from each store and post-hoc merging. Such approaches are inefficient and leaky, and more critically, they lack explicit support for multi-hop reasoning, where a query is decomposed into successive steps (hops) that may traverse back and forth between structured and unstructured sources. We present Agentic DAG-Orchestrated Transformer (A.DOT) Planner, a framework for multi-modal, multi-hop question answering, that compiles user NL queries into directed acyclic graph (DAG) execution plans spanning both structured and unstructured stores. The system decomposes queries into parallelizable sub-queries, incorporates schema-aware reasoning, and applies both structural and semantic validation before execution. The execution engine adheres to the generated DAG plan to coordinate concurrent retrieval across heterogeneous sources, route intermediate outputs to dependent sub-queries, and merge final results in strict accordance with the plan's logical dependencies. Advanced caching mechanisms, incorporating paraphrase-aware template matching, enable the system to detect equivalent queries and reuse prior DAG execution plans for rapid re-execution, while the DataOps System addresses validation feedback or execution errors. The proposed framework not only improves accuracy and latency, but also produces explicit evidence trails, enabling verification of retrieved content, tracing of data lineage, and fostering user trust in the system's outputs. On benchmark dataset, A.DOT achieves 14.8% absolute gain in correctness and 10.7% in completeness over baselines.

2603.14224 2026-03-17 cs.LG cs.AI

Self-Indexing KVCache: Predicting Sparse Attention from Compressed Keys

Xu Yang, Jiapeng Zhang, Dongyang Zhao, Guo Chen, Zhuo Tang

详情
英文摘要

The KV cache in self-attention has emerged as a major bottleneck in long-context and large-batch inference for LLMs. Existing approaches often treat sparsity prediction and compression as separate modules, relying on auxiliary index structures to select relevant tokens, and on complex quantization schemes to reduce memory usage. This fragmented design introduces redundant overhead and limits scalability. In this paper, we propose a novel paradigm: treating the compressed key representation not merely as storage, but as a self-indexing structure that directly enables efficient sparse attention. By designing a sign-based 1-bit vector quantization (VQ) scheme, our method unifies compression and retrieval in a single, hardware-friendly format. This approach eliminates the need for external indices or learning-based predictors, offering a lightweight yet robust solution for memory-constrained inference. All components are designed to be hardware-efficient and easy to implement. By implementing custom CUDA kernels, our method integrates seamlessly with FlashAttention, minimizing additional runtime and memory overhead. Experimental results demonstrate that our approach delivers both effectiveness and efficiency.

2603.14221 2026-03-17 cs.RO cs.AI cs.LG

A Real-Time Neuro-Symbolic Ethical Governor for Safe Decision Control in Autonomous Robotic Manipulation

Aueaphum Aueawatthanaphisut, Kuepon Aueawatthanaphisut

Comments 6 pages, 6 figures, 5 equations

详情
英文摘要

Ethical decision governance has become a critical requirement for autonomous robotic systems operating in human-centered and safety-sensitive environments. This paper presents a real-time neuro-symbolic ethical governor designed to enable risk-aware supervisory control in autonomous robotic manipulation tasks. The proposed framework integrates transformer-based ethical reasoning with a probabilistic ethical risk field formulation and a threshold-based override control mechanism. language-grounded ethical intent inference capability is learned from natural language task descriptions using a fine-tuned DistilBERT model trained on the ETHICS commonsense dataset. A continuous ethical risk metric is subsequently derived from predicted unsafe action probability, confidence uncertainty, and probabilistic variance to support adaptive decision filtering. The effectiveness of the proposed approach is validated through simulated autonomous robot-arm task scenarios involving varying levels of human proximity and operational hazard. Experimental results demonstrate stable model convergence, reliable ethical risk discrimination, and improved safety-aware decision outcomes without significant degradation of task execution efficiency. The proposed neuro-symbolic architecture further provides enhanced interpretability compared with purely data-driven safety filters, enabling transparent ethical reasoning in real-time control loops. The findings suggest that ethical decision governance can be effectively modeled as a dynamic supervisory risk layer for autonomous robotic systems, with potential applicability to broader cyber-physical and assistive robotics domains.

2603.14220 2026-03-17 cs.CV

FIND: A Simple yet Effective Baseline for Diffusion-Generated Image Detection

Jie Li, Yingying Feng, Chi Xie, Jie Hu, Lei Tan, Jiayi Ji

Comments AAAI'26

详情
英文摘要

The remarkable realism of images generated by diffusion models poses critical detection challenges. Current methods utilize reconstruction error as a discriminative feature, exploiting the observation that real images exhibit higher reconstruction errors when processed through diffusion models. However, these approaches require costly reconstruction computations and depend on specific diffusion models, making their performance highly model-dependent. We identify a fundamental difference: real images are more difficult to fit with Gaussian distributions compared to synthetic ones. In this paper, we propose Forgery Identification via Noise Disturbance (FIND), a novel method that requires only a simple binary classifier. It eliminates reconstruction by directly targeting the core distributional difference between real and synthetic images. Our key operation is to add Gaussian noise to real images during training and label these noisy versions as synthetic. This step allows the classifier to focus on the statistical patterns that distinguish real from synthetic images. We theoretically prove that the noise-augmented real images resemble diffusion-generated images in their ease of Gaussian fitting. Furthermore, simply by adding noise, they still retain visual similarity to the original images, highlighting the most discriminative distribution-related features. The proposed FIND improves performance by 11.7% on the GenImage benchmark while running 126x faster than existing methods. By removing the need for auxiliary diffusion models and reconstruction, it offers a practical, efficient, and generalizable way to detect diffusion-generated content.

2603.14219 2026-03-17 cs.CV

Safety-Potential Pruning for Enhancing Safety Prompts Against VLM Jailbreaking Without Retraining

Chongxin Li, Hanzhang Wang, Lian Duan

Comments Accepted for publication in Transactions of the Association for Computational Linguistics (TACL)

详情
英文摘要

Safety prompts constitute an interpretable layer of defense against jailbreak attacks in vision-language models (VLMs); however, their efficacy is constrained by the models' latent structural responsiveness. We observe that such prompts consistently engage a sparse set of parameters that remain largely quiescent during benign use. This finding motivates the Safety Subnetwork Hypothesis: VLMs embed structurally distinct pathways capable of enforcing safety, but these pathways remain dormant without explicit stimulation. To expose and amplify these pathways, we introduce Safety-Potential Pruning, a one-shot pruning framework that amplifies safety-relevant activations by removing weights that are less responsive to safety prompts without additional retraining. Across three representative VLM architectures and three jailbreak benchmarks, our method reduces attack success rates by up to 22% relative to prompting alone, all while maintaining strong benign performance. These findings frame pruning not only as a model compression technique, but as a structural intervention to emerge alignment-relevant subnets, offering a new path to robust jailbreak resistance.

2603.14216 2026-03-17 cs.RO cs.HC

Navigation beyond Wayfinding: Robots Collaborating with Visually Impaired Users for Environmental Interactions

Shaojun Cai, Nuwan Janaka, Ashwin Ram, Janidu Shehan, Yingjia Wan, Kotaro Hara, David Hsu

Comments Accepted to ACM/IEEE HRI 2026, 10 pages, 6 figures

详情
英文摘要

Robotic guidance systems have shown promise in supporting blind and visually impaired (BVI) individuals with wayfinding and obstacle avoidance. However, most existing systems assume a clear path and do not support a critical aspect of navigation - environmental interactions that require manipulating objects to enable movement. These interactions are challenging for a human-robot pair because they demand (i) precise localization and manipulation of interaction targets (e.g., pressing elevator buttons) and (ii) dynamic coordination between the user's and robot's movements (e.g., pulling out a chair to sit). We present a collaborative human-robot approach that combines our robotic guide dog's precise sensing and localization capabilities with the user's ability to perform physical manipulation. The system alternates between two modes: lead mode, where the robot detects and guides the user to the target, and adaptation mode, where the robot adjusts its motion as the user interacts with the environment (e.g., opening a door). Evaluation results show that our system enables navigation that is safer, smoother, and more efficient than both a traditional white cane and a non-adaptive guiding system, with the performance gap widening as tasks demand higher precision in locating interaction targets. These findings highlight the promise of human-robot collaboration in advancing assistive technologies toward more generalizable and realistic navigation support.

2603.14214 2026-03-17 cs.CV cs.AI

UniFusion: A Unified Image Fusion Framework with Robust Representation and Source-Aware Preservation

Xingyuan Li, Songcheng Du, Yang Zou, HaoYuan Xu, Zhiying Jiang, Jinyuan Liu

Comments 11 pages, 8 figures, published to CVPR2026

详情
英文摘要

Image fusion aims to integrate complementary information from multiple source images to produce a more informative and visually consistent representation, benefiting both human perception and downstream vision tasks. Despite recent progress, most existing fusion methods are designed for specific tasks (i.e., multi-modal, multi-exposure, or multi-focus fusion) and struggle to effectively preserve source information during the fusion process. This limitation primarily arises from task-specific architectures and the degradation of source information caused by deep-layer propagation. To overcome these issues, we propose UniFusion, a unified image fusion framework designed to achieve cross-task generalization. First, leveraging DINOv3 for modality-consistent feature extraction, UniFusion establishes a shared semantic space for diverse inputs. Second, to preserve the understanding of each source image, we introduce a reconstruction-alignment loss to maintain consistency between fused outputs and inputs. Finally, we employ a bilevel optimization strategy to decouple and jointly optimize reconstruction and fusion objectives, effectively balancing their coupling relationship and ensuring smooth convergence. Extensive experiments across multiple fusion tasks demonstrate UniFusion's superior visual quality, generalization ability, and adaptability to real-world scenarios. Code is available at https://github.com/dusongcheng/UniFusion.

2603.14212 2026-03-17 cs.AI

Memory as Asset: From Agent-centric to Human-centric Memory Management

Yanqi Pan, Qinghao Huang, Weihao Yang

详情
英文摘要

We proudly introduce Memory-as-Asset, a new memory paradigm towards human-centric artificial general intelligence (AGI). In this paper, we formally emphasize that human-centric, personal memory management is a prerequisite for complementing the collective knowledge of existing large language models (LLMs) and extending their knowledge boundaries through self-evolution. We introduce three key features that shape the Memory-as-Asset era: (1) Memory in Hand, which emphasizes human-centric ownership to maximize benefits to humans; (2) Memory Group, which provides collaborative knowledge formation to avoid memory islands, and (3) Collective Memory Evolution, which enables continuous knowledge growth to extend the boundary of knowledge towards AGI. We finally give a potential three-layer memory infrastructure to facilitate the Memory-as-Asset paradigm, with fast personal memory storage, an intelligent evolution layer, and a decentralized memory exchange network. Together, these components outline a foundational architecture in which personal memories become persistent digital assets that can be accumulated, shared, and evolved over time. We believe this paradigm provides a promising path toward scalable, human-centric AGI systems that continuously grow through the collective experiences of individuals and intelligent agents.

2603.14207 2026-03-17 cs.CV cs.AI

DualTSR: Unified Dual-Diffusion Transformer for Scene Text Image Super-Resolution

Axi Niu, Kang Zhang, Qingsen Yan, Hao Jin, Jinqiu Sun, Yanning Zhang

详情
英文摘要

Scene Text Image Super-Resolution (STISR) aims to restore high-resolution details in low-resolution text images, which is crucial for both human readability and machine recognition. Existing methods, however, often depend on external Optical Character Recognition (OCR) models for textual priors or rely on complex multi-component architectures that are difficult to train and reproduce. In this paper, we introduce DualTSR, a unified end-to-end framework that addresses both issues. DualTSR employs a single multimodal transformer backbone trained with a dual diffusion objective. It simultaneously models the continuous distribution of high-resolution images via Conditional Flow Matching and the discrete distribution of textual content via discrete diffusion. This shared design enables visual and textual information to interact at every layer, allowing the model to infer text priors internally instead of relying on an external OCR module. Compared with prior multi-branch diffusion systems, DualTSR offers a simpler end-to-end formulation with fewer hand-crafted components. Experiments on synthetic Chinese benchmarks and a curated real-world evaluation protocol show that DualTSR achieves strong perceptual quality and text fidelity.

2603.14189 2026-03-17 cs.CV cs.AI

Walking Further: Semantic-aware Multimodal Gait Recognition Under Long-Range Conditions

Zhiyang Lu, Wen Jiang, Tianren Wu, Zhichao Wang, Changwang Zhang, Siqi Shen, Ming Cheng

Comments Accepted by AAAI 2026

详情
英文摘要

Gait recognition is an emerging biometric technology that enables non-intrusive and hard-to-spoof human identification. However, most existing methods are confined to short-range, unimodal settings and fail to generalize to long-range and cross-distance scenarios under real-world conditions. To address this gap, we present \textbf{LRGait}, the first LiDAR-Camera multimodal benchmark designed for robust long-range gait recognition across diverse outdoor distances and environments. We further propose \textbf{EMGaitNet}, an end-to-end framework tailored for long-range multimodal gait recognition. To bridge the modality gap between RGB images and point clouds, we introduce a semantic-guided fusion pipeline. A CLIP-based Semantic Mining (SeMi) module first extracts human body-part-aware semantic cues, which are then employed to align 2D and 3D features via a Semantic-Guided Alignment (SGA) module within a unified embedding space. A Symmetric Cross-Attention Fusion (SCAF) module hierarchically integrates visual contours and 3D geometric features, and a Spatio-Temporal (ST) module captures global gait dynamics. Extensive experiments on various gait datasets validate the effectiveness of our method.

2603.14187 2026-03-17 cs.CV

Deep Learning From Routine Histology Improves Risk Stratification for Biochemical Recurrence in Prostate Cancer

Clément Grisi, Khrystyna Faryna, Nefise Uysal, Vittorio Agosti, Enrico Munari, Solène-Florence Kammerer-Jacquet, Paulo Guilherme de Oliveira Salles, Yuri Tolkach, Reinhard Büttner, Sofiya Semko, Maksym Pikul, Axel Heidenreich, Jeroen van der Laak, Geert Litjens

Comments Preprint

详情
英文摘要

Accurate prediction of biochemical recurrence (BCR) after radical prostatectomy is critical for guiding adjuvant treatment and surveillance decisions in prostate cancer. However, existing clinicopathological risk models reduce complex morphology to relatively coarse descriptors, leaving substantial prognostic information embedded in routine histopathology underexplored. We present a deep learning-based biomarker that predicts continuous, patient-specific risk of BCR directly from H&E-stained whole-slide prostatectomy specimens. Trained end-to-end on time-to-event outcomes and evaluated across four independent international cohorts, our model demonstrates robust generalization across institutions and patient populations. When integrated with the CAPRA-S clinical risk score, the deep learning risk score consistently improved discrimination for BCR, increasing concordance indices from 0.725-0.772 to 0.749-0.788 across cohorts. To support clinical interpretability, outcome-grounded analyses revealed subtle histomorphological patterns associated with recurrence risk that are not captured by conventional clinicopathological risk scores. This multicohort study demonstrates that deep learning applied to routine prostate histopathology can deliver reproducible and clinically generalizable biomarkers that augment postoperative risk stratification, with potential to support personalized management of prostate cancer in real-world clinical settings.

2603.14183 2026-03-17 cs.CL

Selective Fine-Tuning of GPT Architectures for Parameter-Efficient Clinical Text Classification

Fariba Afrin Irany, Sampson Akwafuo

详情
英文摘要

The rapid expansion of electronic health record (EHR) systems has generated large volumes of unstructured clinical narratives that contain valuable information for disease identification, patient cohort discovery, and clinical decision support. Extracting structured knowledge from these free-text documents remains challenging because clinical language is highly specialized, labeled datasets are limited, and full fine-tuning of large pretrained language models can require substantial computational resources. Efficient adaptation strategies are therefore essential for practical clinical natural language processing applications. This study proposes a parameter-efficient selective fine-tuning framework for adapting GPT-2 to clinical text classification tasks. Instead of updating the entire pretrained model, the majority of network parameters are frozen, and only the final Transformer block, the final layer normalization module, and a lightweight classification head are updated during training. This design substantially reduces the number of trainable parameters while preserving the contextual representation capabilities learned during pretraining. The proposed approach is evaluated using radiology reports from the MIMIC-IV-Note dataset with automatically derived CheXpert-style labels. Experiments on 50,000 radiology reports demonstrate that selective fine-tuning achieves approximately 91% classification accuracy while updating fewer than 6% of the model parameters. Comparative experiments with head-only training and full-model fine-tuning show that the proposed method provides a favorable balance between predictive performance and computational efficiency. These results indicate that selective fine-tuning offers an efficient and scalable framework for clinical text classification.

2603.14182 2026-03-17 cs.RO cs.HC

Towards Equitable Robotic Furnishing Agents for Aging-in-Place: ADL-Grounded Design Exploration

Hansoo Lee, Changhee Seo, Subin Park, Sonya S. Kwak

Comments Accepted at the ACM/IEEE International Conference on Human-Robot Interaction (HRI) 2026 Workshop: Equitable Robotics for Wellbeing (Eq-RW)

详情
英文摘要

In aging-in-place contexts, small difficulties in Activities of Daily Living (ADL) can accumulate, affecting well-being through fatigue, anxiety, reduced autonomy, and safety risks. This position paper argues that robotics for older adult wellbeing must move beyond "convenience features" and centre equity, justice, and responsibility. We conducted ADL-grounded semi-structured interviews with four adults in their 70s-80s, identifying recurrent challenges (finding/ organising items, taking medication, and transporting objects) and deriving requirements to reduce compounded cognitive-physical burden. Based on these insights, we propose an in-home robotic furnishing-agent concept leveraging computer vision and generative AI and LLMs for natural-language interaction, context-aware reminders, safe actuation, and user-centred transparency. We then report video-stimulated follow-up interviews with the same participants, highlighting preferences for confirmation before actuation, predictability, adjustable speed/autonomy, and multimodal feedback, as well as equity-related concerns. We conclude with open questions on evaluating and deploying equitable robotic wellbeing systems in real homes.

2603.14176 2026-03-17 cs.CV

BluRef: Unsupervised Image Deblurring with Dense-Matching References

Bang-Dang Pham, Anh Tran, Cuong Pham, Minh Hoai

Comments Accepted to CVPR 2026. Project page: https://qualcomm-ai-research.github.io/BluRef/

详情
英文摘要

This paper introduces a novel unsupervised approach for image deblurring that utilizes a simple process for training data collection, thereby enhancing the applicability and effectiveness of deblurring methods. Our technique does not require meticulously paired data of blurred and corresponding sharp images; instead, it uses unpaired blurred and sharp images of similar scenes to generate pseudo-ground truth data by leveraging a dense matching model to identify correspondences between a blurry image and reference sharp images. Thanks to the simplicity of the training data collection process, our approach does not rely on existing paired training data or pre-trained networks, making it more adaptable to various scenarios and suitable for networks of different sizes, including those designed for low-resource devices. We demonstrate that this novel approach achieves state-of-the-art performance, marking a significant advancement in the field of image deblurring.

2603.14175 2026-03-17 cs.LG cs.CV

Balancing Multimodal Domain Generalization via Gradient Modulation and Projection

Hongzhao Li, Guohao Shen, Shupan Li, Mingliang Xu, Muhammad Haris Khan

Comments AAAI 2026 Oral Accepted

详情
英文摘要

Multimodal Domain Generalization (MMDG) leverages the complementary strengths of multiple modalities to enhance model generalization on unseen domains. A central challenge in multimodal learning is optimization imbalance, where modalities converge at different speeds during training. This imbalance leads to unequal gradient contributions, allowing some modalities to dominate the learning process while others lag behind. Existing balancing strategies typically regulate each modality's gradient contribution based on its classification performance on the source domain to alleviate this issue. However, relying solely on source-domain accuracy neglects a key insight in MMDG: modalities that excel on the source domain may generalize poorly to unseen domains, limiting cross-domain gains. To overcome this limitation, we propose Gradient Modulation Projection (GMP), a unified strategy that promotes balanced optimization in MMDG. GMP first decouples gradients associated with classification and domain-invariance objectives. It then modulates each modality's gradient based on semantic and domain confidence. Moreover, GMP dynamically adjusts gradient projections by tracking the relative strength of each task, mitigating conflicts between classification and domain-invariant learning within modality-specific encoders. Extensive experiments demonstrate that GMP achieves state-of-the-art performance and integrates flexibly with diverse MMDG methods, significantly improving generalization across multiple benchmarks.

2603.14173 2026-03-17 cs.LG cs.AI cs.IR

Hybrid Intent-Aware Personalization with Machine Learning and RAG-Enabled Large Language Models for Financial Services Marketing

Akhil Chandra Shanivendra

Comments 18 pages, 5 figures, 3 tables. Applied ML systems paper. The contribution is architectural rather than algorithmic

详情
英文摘要

Personalized marketing in financial services requires models that can both predict customer behavior and generate compliant, context-appropriate content. This paper presents a hybrid architecture that integrates classical machine learning for segmentation, latent intent modeling, and personalization prediction with retrieval-augmented large language models for grounded content generation. A synthetic, reproducible dataset is constructed to reflect temporal customer behavior, product interactions, and marketing responses. The proposed framework incorporates temporal encoders, latent representations, and multi-task classification to estimate segment membership, customer intent, and product-channel recommendations. A retrieval-augmented generation layer then produces customer-facing messages constrained by retrieved domain documents. Experiments show that temporal modeling and intent features improve personalization accuracy, while citation-based retrieval reduces unsupported generation and supports auditability in regulated settings. The contribution is primarily architectural, demonstrating how predictive modeling and RAG-based generation can be combined into a transparent, explainable pipeline for financial services personalization.

2603.14171 2026-03-17 cs.LG

TACTIC for Navigating the Unknown: Tabular Anomaly deteCTion via In-Context inference

Patryk Marszałek, Tomasz Kuśmierczyk, Marek Śmieja

详情
英文摘要

Anomaly detection for tabular data has been a long-standing unsupervised learning problem that remains a major challenge for current deep learning models. Recently, in-context learning has emerged as a new paradigm that has shifted efforts from task-specific optimization to large-scale pretraining aimed at creating foundation models that generalize across diverse datasets. Although in-context models, such as TabPFN, perform well in supervised problems, their learned classification-based priors may not readily extend to anomaly detection. In this paper, we study in-context models for anomaly detection and show that the unsupervised extensions to TabPFN exhibit unstable behavior, particularly in noisy or contaminated contexts, in addition to the high computational cost. We address these challenges and introduce TACTIC, an in-context anomaly detection approach based on pretraining with anomaly-centric synthetic priors, which provides fast and data-dependent reasoning about anomalies while avoiding dataset-specific tuning. In contrast to typical score-based approaches, which produce uncalibrated anomaly scores that require post-processing (e.g. threshold selection or ranking heuristics), the proposed model is trained as a discriminative predictor, enabling unambiguous anomaly decisions in a single forward pass. Through experiments on real-world datasets, we examine the performance of TACTIC in clean and noisy contexts with varying anomaly rates and different anomaly types, as well as the impact of prior choices on detection quality. Our experiments clearly show that specialized anomaly-centric in-context models such as TACTIC are highly competitive compared to other task-specific methods.

2603.14161 2026-03-17 cs.LG q-bio.NC

Deep probabilistic model synthesis enables unified modeling of whole-brain neural activity across individual subjects

William E. Bishop, Luuk W. Hesselink, Bernhard Englitz, Misha B. Ahrens, James E. Fitzgerald

Comments 40 pages, 8 figures

详情
英文摘要

Many disciplines need quantitative models that synthesize experimental data across multiple instances of the same general system. For example, neuroscientists must combine data from the brains of many individual animals to understand the species' brain in general. However, typical machine learning models treat one system instance at a time. Here we introduce a machine learning framework, deep probabilistic model synthesis (DPMS), that leverages system properties auxiliary to the model to combine data across system instances. DPMS specifically uses variational inference to learn a conditional prior distribution and instance-specific posterior distributions over model parameters that respectively tie together the system instances and capture their unique structure. DPMS can synthesize a wide variety of model classes, such as those for regression, classification, and dimensionality reduction, and we demonstrate its ability to improve upon single-instance models on synthetic data and whole-brain neural activity data from larval zebrafish.

2603.14160 2026-03-17 cs.RO

See, Learn, Assist: Safe and Self-Paced Robotic Rehabilitation via Video-Based Learning from Demonstration

Ali Alabbas, Camillo Murgia, Joanne Regan, Philip Long

详情
英文摘要

In this paper, we propose a novel framework that allows therapists to teach robot-assisted rehabilitation exercises remotely via RGB-D video. Our system encodes demonstrations as 6-DoF body-centric trajectories using Cartesian Dynamic Movement Primitives (DMPs), ensuring accurate posture-independent spatial generalization across diverse patient anatomies. Crucially, we execute these trajectories through a decoupled hybrid control architecture that constructs a spatially compliant virtual tunnel, paired with an effort-based temporal dilation mechanism. This architecture is applied to three distinct rehabilitation modalities: Passive, Active-Assisted, and Active-Resistive, by dynamically linking the exercise's execution phase to the patient's tangential force contribution. To guarantee safety, a Gaussian Mixture Regression (GMR) model is learned on-the-fly from the patient's own limb. This allows the detection of abnormal interaction forces and, if necessary, reverses the trajectory to prevent injury. Experimental validation demonstrates the system's precision, achieving an average trajectory reproduction error of 3.7cm and a range of motion (ROM) error of 5.5 degrees. Furthermore, dynamic interaction trials confirm that the controller successfully enforces effort-based progression while maintaining strict spatial path adherence against human disturbances.