arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 3203
2603.14457 2026-03-17 cs.RO

Towards Versatile Opti-Acoustic Sensor Fusion and Volumetric Mapping

Ivana Collado-Gonzalez, John McConnell, Brendan Englot

Comments To appear at ICRA 2026 in Vienna, Austria

详情
英文摘要

Accurate 3D volumetric mapping is critical for autonomous underwater vehicles operating in obstacle-rich environments. Vision-based perception provides high-resolution data but fails in turbid conditions, while sonar is robust to lighting and turbidity but suffers from low resolution and elevation ambiguity. This paper presents a volumetric mapping framework that fuses a stereo sonar pair with a monocular camera to enable safe navigation under varying visibility conditions. Overlapping sonar fields of view resolve elevation ambiguity, producing fully defined 3D point clouds at each time step. The framework identifies regions of interest in camera images, associates them with corresponding sonar returns, and combines sonar range with camera-derived elevation cues to generate additional 3D points. Each 3D point is assigned a confidence value reflecting its reliability. These confidence-weighted points are fused using a Gaussian Process Volumetric Mapping framework that prioritizes the most reliable measurements. Experimental comparisons with other opti-acoustic and sonar-based approaches, along with field tests in a marina environment, demonstrate the method's effectiveness in capturing complex geometries and preserving critical information for robot navigation in both clear and turbid conditions. Our code is open-source to support community adoption.

2603.14456 2026-03-17 cs.CL cs.SD

PARSA-Bench: A Comprehensive Persian Audio-Language Model Benchmark

Mohammad Javad Ranjbar Kalahroodi, Mohammad Amini, Parmis Bathayan, Heshaam Faili, Azadeh Shakery

Comments Submitted to Interspeech 2026

详情
英文摘要

Persian poses unique audio understanding challenges through its classical poetry, traditional music, and pervasive code-switching - none captured by existing benchmarks. We introduce PARSA-Bench (Persian Audio Reasoning and Speech Assessment Benchmark), the first benchmark for evaluating large audio-language models on Persian language and culture, comprising 16 tasks and over 8,000 samples across speech understanding, paralinguistic analysis, and cultural audio understanding. Ten tasks are newly introduced, including poetry meter and style detection, traditional Persian music understanding, and code-switching detection. Text-only baselines consistently outperform audio counterparts, suggesting models may not leverage audio-specific information beyond what transcription alone provides. Culturally-grounded tasks expose a qualitatively distinct failure mode: all models perform near random chance on vazn detection regardless of scale, suggesting prosodic perception remains beyond the reach of current models. The dataset is publicly available at https://huggingface.co/datasets/MohammadJRanjbar/PARSA-Bench

2603.14452 2026-03-17 cs.CV

Uni-MDTrack: Learning Decoupled Memory and Dynamic States for Parameter-Efficient Visual Tracking in All Modality

Wenrui Cai, Zhenyi Lu, Yuzhe Li, Yongchao Feng, Jinqing Zhang, Qingjie Liu, Yunhong Wang

Comments 15 pages, 9 figures, 16 tables

详情
英文摘要

With the advent of Transformer-based one-stream trackers that possess strong capability in inter-frame relation modeling, recent research has increasingly focused on how to introduce spatio-temporal context. However, most existing methods rely on a limited number of historical frames, which not only leads to insufficient utilization of the context, but also inevitably increases the length of input and incurs prohibitive computational overhead. Methods that query an external memory bank, on the other hand, suffer from inadequate fusion between the retrieved spatio-temporal features and the backbone. Moreover, using discrete historical frames as context overlooks the rich dynamics of the target. To address the issues, we propose Uni-MDTrack, which consists of two core components: Memory-Aware Compression Prompt (MCP) module and Dynamic State Fusion (DSF) module. MCP effectively compresses rich memory features into memory-aware prompt tokens, which deeply interact with the input throughout the entire backbone, significantly enhancing the performance while maintaining a stable computational load. DSF complements the discrete memory by capturing the continuous dynamic, progressively introducing the updated dynamic state features from shallow to deep layers, while also preserving high efficiency. Uni-MDTrack also supports unified tracking across RGB, RGB-D/T/E, and RGB-Language modalities. Experiments show that in Uni-MDTrack, training only the MCP, DSF, and prediction head, keeping the proportion of trainable parameters around 30%, yields substantial performance gains, achieves state-of-the-art results on 10 datasets spanning five modalities. Furthermore, both MCP and DSF exhibit excellent generality, functioning as plug-and-play components that can boost the performance of various baseline trackers, while significantly outperforming existing parameter-efficient training approaches.

2603.14448 2026-03-17 cs.LG

Zoom to Essence: Trainless GUI Grounding by Inferring upon Interface Elements

Ziwei Liu, Tao Feng, Borui Kang, Yanbing Yang, Jun Luo

详情
英文摘要

Multimodal Large Language Model (MLLM)-based Graphical User Interface (GUI) agents develop rapidly, with visual grounding that maps natural language instructions to target UI elements serving as the core capability. Existing GUI agents typically fine-tune MLLM on massive datasets to handle challenges in understanding instructions and UI interfaces, which not only incurs high data annotation costs but also makes performance dependent on data quality and distribution. To avoid such cumbersome yet ineffective training, we notice that complex UI interfaces can be decomposed into basic visual elements directly understandable by common MLLMs. Consequently, we propose ZoomUI that leverages inference scaling to guide common MLLMs in progressively anchor instruction elements to increasingly detailed interface elements. Specifically, ZoomUI first optimizes the latent thinking to transform original instruction into element visual features description, and subsequently leverages internal attention to iteratively zoom in target element interface region. Evaluations on extensive benchmarks demonstrate that ZoomUI reaches or even surpasses SOTA baselines.

2603.14435 2026-03-17 cs.CV

End-to-End Spatial-Temporal Transformer for Real-time 4D HOI Reconstruction

Haoyu Zhang, Wei Zhai, Yuhang Yang, Yang Cao, Zheng-Jun Zha

Comments 23 pages, 7 figures. The project page is available at: https://nianheng.github.io/THO-project/

详情
英文摘要

Monocular 4D human-object interaction (HOI) reconstruction - recovering a moving human and a manipulated object from a single RGB video - remains challenging due to depth ambiguity and frequent occlusions. Existing methods often rely on multi-stage pipelines or iterative optimization, leading to high inference latency, failing to meet real-time requirements, and susceptibility to error accumulation. To address these limitations, we propose THO, an end-to-end Spatial-Temporal Transformer that predicts human motion and coordinated object motion in a forward fashion from the given video and 3D template. THO achieves this by leveraging spatial-temporal HOI tuple priors. Spatial priors exploit contact-region proximity to infer occluded object features from human cues, while temporal priors capture cross-frame kinematic correlations to refine object representations and enforce physical coherence. Extensive experiments demonstrate that THO operates at an inference speed of 31.5 FPS on a single RTX 4090 GPU, achieving a >600x speedup over prior optimization-based methods while simultaneously improving reconstruction accuracy and temporal consistency. The project page is available at: https://nianheng.github.io/THO-project/

2603.14430 2026-03-17 cs.CL

Creative Convergence or Imitation? Genre-Specific Homogeneity in LLM-Generated Chinese Literature

Yuanchi Ma, Kaize Shi, Hui He, Zhihua Zhang, Zhongxiang Lei, Ziliang Qiu, Renfen Hu, Jiamou Liu

详情
英文摘要

Large Language Models (LLMs) have demonstrated remarkable capabilities in narrative generation. However, they often produce structurally homogenized stories, frequently following repetitive arrangements and combinations of plot events along with stereotypical resolutions. In this paper, we propose a novel theoretical framework for analysis by incorporating Proppian narratology and narrative functions. This framework is used to analyze the composition of narrative texts generated by LLMs to uncover their underlying narrative logic. Taking Chinese web literature as our research focus, we extend Propp's narrative theory, defining 34 narrative functions suited to modern web narrative structures. We further construct a human-annotated corpus to support the analysis of narrative structures within LLM-generated text. Experiments reveal that the primary reasons for the singular narrative logic and severe homogenization in generated texts are that current LLMs are unable to correctly comprehend the meanings of narrative functions and instead adhere to rigid narrative generation paradigms.

2603.14426 2026-03-17 cs.CV cs.IR cs.MM

GenState-AI: State-Aware Dataset for Text-to-Video Retrieval on AI-Generated Videos

Minghan Li, Tongna Chen, Tianrui Lv, Yishuai Zhang, Suchao An, Guodong Zhou

详情
英文摘要

Existing text-to-video retrieval benchmarks are dominated by real-world footage where much of the semantics can be inferred from a single frame, leaving temporal reasoning and explicit end-state grounding under-evaluated. We introduce GenState-AI, an AI-generated benchmark centered on controlled state transitions, where each query is paired with a main video, a temporal hard negative that differs only in the decisive end-state, and a semantic hard negative with content substitution, enabling fine-grained diagnosis of temporal vs. semantic confusions beyond appearance matching. Using Wan2.2-TI2V-5B, we generate short clips whose meaning depends on precise changes in position, quantity, and object relations, providing controllable evaluation conditions for state-aware retrieval. We evaluate two representative MLLM-based baselines, and observe consistent and interpretable failure patterns: both frequently confuse the main video with the temporal hard negative and over-prefer temporally plausible but end-state-incorrect clips, indicating insufficient grounding to decisive end-state evidence, while being comparatively less sensitive to semantic substitutions. We further introduce triplet-based diagnostic analyses, including relative-order statistics and breakdowns across transition categories, to make temporal vs. semantic failure sources explicit. GenState-AI provides a focused testbed for state-aware, temporally and semantically sensitive text-to-video retrieval, and will be released on huggingface.co.

2603.14422 2026-03-17 cs.LG cs.AI cs.IR

MBD: A Model-Based Debiasing Framework Across User, Content, and Model Dimensions

Yuantong Li, Lei Yuan, Zhihao Zheng, Weimiao Wu, Songbin Liu, Jeong Min Lee, Ali Selman Aydin, Shaofeng Deng, Junbo Chen, Xinyi Zhang, Hongjing Xia, Sam Fieldman, Matthew Kosko, Wei Fu, Du Zhang, Peiyu Yang, Albert Jin Chung, Xianlei Qiu, Miao Yu, Zhongwei Teng, Hao Chen, Sunny Baek, Hui Tang, Yang Lv, Renze Wang, Qifan Wang, Zhan Li, Tiantian Xu, Peng Wu, Ji Liu

详情
英文摘要

Modern recommendation systems rank candidates by aggregating multiple behavioral signals through a value model. However, many commonly used signals are inherently affected by heterogeneous biases. For example, watch time naturally favors long-form content, loop rate favors short - form content, and comment probability favors videos over images. Such biases introduce two critical issues: (1) value model scores may be systematically misaligned with users' relative preferences - for instance, a seemingly low absolute like probability may represent exceptionally strong interest for a user who rarely engages; and (2) changes in value modeling rules can trigger abrupt and undesirable ecosystem shifts. In this work, we ask a fundamental question: can biased behavioral signals be systematically transformed into unbiased signals, under a user - defined notion of ``unbiasedness'', that are both personalized and adaptive? We propose a general, model-based debiasing (MBD) framework that addresses this challenge by augmenting it with distributional modeling. By conditioning on a flexible subset of features (partial feature set), we explicitly estimate the contextual mean and variance of the engagement distribution for arbitrary cohorts (e.g., specific video lengths or user regions) directly alongside the main prediction. This integration allows the framework to convert biased raw signals into unbiased representations, enabling the construction of higher-level, calibrated signals (such as percentiles or z - scores) suitable for the value model. Importantly, the definition of unbiasedness is flexible and controllable, allowing the system to adapt to different personalization objectives and modeling preferences. Crucially, this is implemented as a lightweight, built-in branch of the existing MTML ranking model, requiring no separate serving infrastructure.

2603.14420 2026-03-17 cs.AI

Data Darwinism Part II: DataEvolve -- AI can Autonomously Evolve Pretraining Data Curation

Tiantian Mi, Dongming Shan, Zhen Huang, Yiwei Qin, Muhang Xie, Yuxuan Qiao, Yixiu Liu, Chenyang Zhou, Pengfei Liu

详情
英文摘要

Data Darwinism (Part I) established a ten-level hierarchy for data processing, showing that stronger processing can unlock greater data value. However, that work relied on manually designed strategies for a single category. Modern pretraining corpora comprise hundreds of heterogeneous categories spanning domains and content types, each demanding specialized treatment. At this scale, manual strategy design becomes prohibitive. This raises a key question: can strategies evolve in an automated way? We introduce DataEvolve, a framework that enables strategies to evolve through iterative optimization rather than manual design. For each data category, DataEvolve operates in a closed evolutionary loop: it identifies quality issues, generates candidate strategies, executes them on sampled data, evaluates results, and refines approaches across generations. The process accumulates knowledge through an experience pool of discovered issues and a strategy pool tracking performance across iterations. Applied to 8 categories spanning 672B tokens from Nemotron-CC, DataEvolve produces Darwin-CC, a 504B-token dataset with strategies evolved through 30 iterations per category. Training 3B models on 500B tokens, Darwin-CC outperforms raw data (+3.96 points) and achieves a 44.13 average score across 18 benchmarks, surpassing DCLM, Ultra-FineWeb, and FineWeb-Edu, with strong gains on knowledge-intensive tasks such as MMLU. Analysis shows evolved strategies converge on cleaning-focused approaches: targeted noise removal and format normalization with domain-aware preservation, echoing the L4 (Generative Refinement) principles from Part I. Ablation studies confirm iterative evolution is essential: optimized strategies outperform suboptimal ones by 2.93 points, establishing evolutionary strategy design as feasible and necessary for pretraining-scale data curation.

2603.14418 2026-03-17 cs.CV cs.AI

Deep EM with Hierarchical Latent Label Modelling for Multi-Site Prostate Lesion Segmentation

Wen Yan, Yipei Wang, Shiqi Huang, Natasha Thorley, Mark Emberton, Vasilis Stavrinides, Yipeng Hu, Dean Barratt

Comments 10 pages, 2 figures

详情
英文摘要

Label variability is a major challenge for prostate lesion segmentation. In multi-site datasets, annotations often reflect centre-specific contouring protocols, causing segmentation networks to overfit to local styles and generalise poorly to unseen sites in inference. We treat each observed annotation as a noisy observation of an underlying latent 'clean' lesion mask, and propose a hierarchical expectation-maximisation (HierEM) framework that alternates between: (1) inferring a voxel-wise posterior distribution over the latent mask, and (2) training a CNN using this posterior as a soft target and estimate site-specific sensitivity and specificity under a hierarchical prior. This hierarchical prior decomposes label-quality into a global mean with site- and case-level deviations, reducing site-specific bias by penalising the likelihood term contributed only by site deviations. Experiments on three cohorts demonstrate that the proposed hierarchical EM framework enhances cross-site generalisation compared to state-of-the-art methods. For pooled-dataset evaluation, the per-site mean DSC ranges from 29.50% to 39.69%; for leave-one-site-out generalisation, it ranges from 27.91% to 32.67%, yielding statistically significant improvements over comparison methods (p<0.039). The method also produces interpretable per-site latent label-quality estimates (sensitivity alpha ranges from 31.5% to 47.3% at specificity beta approximates 0.99), supporting post-hoc analyses of cross-site annotation variability. These results indicate that explicitly modelling site-dependent annotation can improve cross-site generalisation.

2603.14416 2026-03-17 cs.CV

Histo-MExNet: A Unified Framework for Real-World, Cross-Magnification, and Trustworthy Breast Cancer Histopathology

Enam Ahmed Taufika, Md Ahasanul Arafatha, Abhijit Kumar Ghoshb, Md. Tanzim Rezab, Md Ashad Alamc

Comments 34, 6 figures

详情
英文摘要

Accurate and reliable histopathological image classification is essential for breast cancer diagnosis. However, many deep learning models remain sensitive to magnification variability and lack interpretability. To address these challenges, we propose Histo-MExNet, a unified framework designed for scaleinvariant and uncertainty-aware classification. The model integrates DenseNet, ConvNeXt, and EfficientNet backbones within a gated multi-expert architecture, incorporates a prototype learning module for example-driven interpretability, and applies physics-informed regularization to enforce morphology preservation and spatial coherence during feature learning. Monte Carlo Dropout is used to quantify predictive uncertainty. On the BreaKHis dataset, Histo-MExNet achieves 96.97% accuracy under multi-magnification training and demonstrates improved generalization to unseen magnification levels compared to single-expert models, while uncertainty estimation helps identify out-of-distribution samples and reduce overconfident errors, supporting a balanced combination of accuracy, robustness, and interpretability for clinical decision support.

2603.14412 2026-03-17 cs.CV

G-ZAP: A Generalizable Zero-Shot Framework for Arbitrary-Scale Pansharpening

Zhiqi Yang, Shan Yin, Jingze Liang, Liang-Jian Deng

详情
英文摘要

Pansharpening aims to fuse a high-resolution panchromatic (PAN) image and a low-resolution multispectral (LRMS) image to produce a high-resolution multispectral (HRMS) image. Recent deep models have achieved strong performance, yet they typically rely on large-scale pretraining and often generalize poorly to unseen real-world image pairs.Prior zero-shot approaches improve real-scene generalization but require per-image optimization, hindering weight reuse, and the above methods are usually limited to a fixed scale.To address this issue, we propose G-ZAP, a generalizable zero-shot framework for arbitrary-scale pansharpening, designed to handle cross-resolution, cross-scene, and cross-sensor generalization.G-ZAP adopts a feature-based implicit neural representation (INR) fusion network as the backbone and introduces a multi-scale, semi-supervised training scheme to enable robust generalization.Extensive experiments on multiple real-world datasets show that G-ZAP achieves state-of-the-art results under PAN-scale fusion in both visual quality and quantitative metrics.Notably, G-ZAP supports weight reuse across image pairs while maintaining competitiveness with per-pair retraining, demonstrating strong potential for efficient real-world deployment.

2603.14409 2026-03-17 cs.CV cs.AI

PGcGAN: Pathological Gait-Conditioned GAN for Human Gait Synthesis

Mritula Chandrasekaran, Sanket Kachole, Jarek Francik, Dimitrios Makris

详情
英文摘要

Pathological gait analysis is constrained by limited and variable clinical datasets, which restrict the modeling of diverse gait impairments. To address this challenge, we propose a Pathological Gait-conditioned Generative Adversarial Network (PGcGAN) that synthesises pathology-specific gait sequences directly from observed 3D pose keypoint trajectories data. The framework incorporates one-hot encoded pathology labels within both the generator and discriminator, enabling controlled synthesis across six gait categories. The generator adopts a conditional autoencoder architecture trained with adversarial and reconstruction objectives to preserve structural and temporal gait characteristics. Experiments on the Pathological Gait Dataset demonstrate strong alignment between real and synthetic sequences through PCA and t-SNE analyses, visual kinematic inspection, and downstream classification tasks. Augmenting real data with synthetic sequences improved pathological gait recognition across GRU, LSTM, and CNN models, indicating that pathology-conditioned gait synthesis can effectively support data augmentation in pathological gait analysis.

2603.14406 2026-03-17 cs.LG

Graph-Based Deep Learning for Intelligent Detection of Energy Losses, Theft, and Operational Inefficiencies in Oil & Gas Production Networks

AbdulQoyum A. Olowookere, Adewale U. Oguntola, Ebenezer. Leke Odekanle

Comments 22 pages, 7 figures

详情
英文摘要

Early detection of energy losses, theft, and operational inefficiencies remains a critical challenge in oil and gas production systems due to complex interdependencies among wells and facilities, evolving operating conditions, and limited labeled anomaly data. Traditional machine learning approaches often treat production units independently and struggle under temporal distribution shifts. This study proposes a spatiotemporal graph-based deep learning framework for anomaly detection in oil and gas production networks. The production system is modeled as a hierarchical graph of wells, facilities, and fields, with additional peer connections among wells sharing common infrastructure. Weakly supervised anomaly labels are derived from physically informed heuristics based on production, pressure, and flow behavior. Temporal dynamics are captured through sequence modeling, while relational dependencies are learned using a Temporal Graph Attention Network. Under time-based evaluation, the proposed model achieves an ROC-AUC of about 0.98 and anomaly recall above 0.93, demonstrating improved robustness and practical potential for proactive monitoring in real-world energy operations.

2603.14401 2026-03-17 cs.RO cs.CV

OCRA: Object-Centric Learning with 3D and Tactile Priors for Human-to-Robot Action Transfer

Kuanning Wang, Ke Fan, Yuqian Fu, Siyu Lin, Hu Luo, Daniel Seita, Yanwei Fu, Yu-Gang Jiang, Xiangyang Xue

Comments Project page: https://sressers.github.io/OCRA/

详情
英文摘要

We present OCRA, an Object-Centric framework for video-based human-to-Robot Action transfer that learns directly from human demonstration videos to enable robust manipulation. Object-centric learning emphasizes task-relevant objects and their interactions while filtering out irrelevant background, providing a natural and scalable way to teach robots. OCRA leverages multi-view RGB videos, the state-of-the-art 3D foundation model VGGT, and advanced detection and segmentation models to reconstruct object-centric 3D point clouds, capturing rich interactions between objects. To handle properties not easily perceived by vision alone, we incorporate tactile priors via a large-scale dataset of over one million tactile images. These 3D and tactile priors are fused through a multimodal module (ResFiLM) and fed into a Diffusion Policy to generate robust manipulation actions. Extensive experiments on both vision-only and visuo-tactile tasks show that OCRA significantly outperforms existing baselines and ablations, demonstrating its effectiveness for learning from human demonstration videos.

2603.14400 2026-03-17 cs.CL cs.AI

Extending Minimal Pairs with Ordinal Surprisal Curves and Entropy Across Applied Domains

Andrew Katz

Comments 34 pages, 11 figures

详情
英文摘要

The minimal pairs paradigm of comparing model probabilities for contrasting completions has proven useful for evaluating linguistic knowledge in language models, yet its application has largely been confined to binary grammaticality judgments over syntactic phenomena. Additionally, standard prompting-based evaluation requires expensive text generation, may elicit post-hoc rationalizations rather than model judgments, and discards information about model uncertainty. We address both limitations by extending surprisal-based evaluation from binary grammaticality contrasts to ordinal-scaled classification and scoring tasks across multiple domains. Rather than asking models to generate answers, we measure the information-theoretic "surprise" (negative log probability) they assign to each position on rating scales (e.g., 1-5 or 1-9), yielding full surprisal curves that reveal both the model's preferred response and its uncertainty via entropy. We explore this framework across four domains: social-ecological-technological systems classification, causal statement identification (binary and scaled), figurative language detection, and deductive qualitative coding. Across these domains, surprisal curves produce interpretable classification signals with clear minima near expected ordinal scale positions, and entropy over the completion tended to distinguish genuinely ambiguous items from easier items.

2603.14397 2026-03-17 cs.RO

eNavi: Event-based Imitation Policies for Low-Light Indoor Mobile Robot Navigation

Prithvi Jai Ramesh, Kaustav Chanda, Krishna Vinod, Joseph Raj Vishal, Yezhou Yang, Bharatesh Chakravarthi

详情
英文摘要

Event cameras provide high dynamic range and microsecond-level temporal resolution, making them well-suited for indoor robot navigation, where conventional RGB cameras degrade under fast motion or low-light conditions. Despite advances in event-based perception spanning detection, SLAM, and pose estimation, there remains limited research on end-to-end control policies that exploit the asynchronous nature of event streams. To address this gap, we introduce a real-world indoor person-following dataset collected using a TurtleBot 2 robot, featuring synchronized raw event streams, RGB frames, and expert control actions across multiple indoor maps, trajectories under both normal and low-light conditions. We further build a multimodal data preprocessing pipeline that temporally aligns event and RGB observations while reconstructing ground-truth actions from odometry to support high-quality imitation learning. Building on this dataset, we propose a late-fusion RGB-Event navigation policy that combines dual MobileNet encoders with a transformer-based fusion module trained via behavioral cloning. A systematic evaluation of RGB-only, Event-only, and RGB-Event fusion models across 12 training variations ranging from single-path imitation to general multi-path imitation shows that policies incorporating event data, particularly the fusion model, achieve improved robustness and lower action prediction error, especially in unseen low-light conditions where RGB-only models fail. We release the dataset, synchronization pipeline, and trained models at https://eventbasedvision.github.io/eNavi/

2603.14393 2026-03-17 cs.RO

From Scanning Guidelines to Action: A Robotic Ultrasound Agent with LLM-Based Reasoning

Yuan Bi, Yiping Zhou, Pei Liu, Feng Li, Zhongliang Jiang, Nassir Navab

Comments Code: https://github.com/yuan-12138/RUSSAgent; Video: https://youtu.be/pfMOc4e2IGA

详情
英文摘要

Robotic ultrasound offers advantages over free-hand scanning, including improved reproducibility and reduced operator dependency. In clinical practice, US acquisition relies heavily on the sonographer's experience and situational judgment. When transferring this process to robotic systems, such expertise is often encoded explicitly through fixed procedures and task-specific models, yielding pipelines that can be difficult to adapt to new scanning tasks. In this work, we propose a unified framework for autonomous robotic US scanning that leverages a LLM-based agent to interpret US scanning guidelines and execute scans by dynamically invoking a set of provided software tools. Instead of encoding fixed scanning procedures, the LLM agent retrieves and reasons over guideline steps from scanning handbooks and adapts its planning decisions based on observations and the current scanning state. This enables the system to handle variable and decision-dependent workflows, such as adjusting scanning strategies, repeating steps, or selecting the appropriate next tool call in response to image quality or anatomical findings. Because the reasoning underlying tool selection is also critical for transparent and trustworthy planning, we further fine tune the LLM agent using a RL based strategy to improve both its reasoning quality and the correctness of tool selection and parameterization, while maintaining robust generalization to unseen guidelines and related tasks. We first validate the approach via verbal execution on 10 US scanning guidelines, assessing reasoning as well as tool selection and parameterization, and showing the benefit of RL fine tuning. We then demonstrate real world feasibility on robotic scanning of the gallbladder, spine, and kidney. Overall, the framework follows diverse guidelines and enables reliable autonomous scanning across multiple anatomical targets within a unified system.

2603.14382 2026-03-17 cs.CV

StAR: Segment Anything Reasoner

Seokju Yun, Dongheon Lee, Noori Bae, Jaesung Jun, Chanseul Cho, Youngmin Ro

Comments Code: https://github.com/ysj9909/StAR

详情
英文摘要

As AI systems are being integrated more rapidly into diverse and complex real-world environments, the ability to perform holistic reasoning over an implicit query and an image to localize a target is becoming increasingly important. However, recent reasoning segmentation methods fail to sufficiently elicit the visual reasoning capabilities of the base mode. In this work, we present Segment Anything Reasoner (StAR), a comprehensive framework that refines the design space from multiple perspectives-including parameter-tuning scheme, reward functions, learning strategies and answer format-and achieves substantial improvements over recent baselines. In addition, for the first time, we successfully introduce parallel test-time scaling to the segmentation task, pushing the performance boundary even further. To extend the scope and depth of reasoning covered by existing benchmark, we also construct the ReasonSeg-X, which compactly defines reasoning types and includes samples that require deeper reasoning. Leveraging this dataset, we train StAR with a rollout-expanded selective-tuning approach to activate the base model's latent reasoning capabilities, and establish a rigorous benchmark for systematic, fine-grained evaluation of advanced methods. With only 5k training samples, StAR achieves significant gains over its base counterparts across extensive benchmarks, demonstrating that our method effectively brings dormant reasoning competence to the surface.

2603.14380 2026-03-17 cs.LG cs.AI cs.AR

SPARQ: Spiking Early-Exit Neural Networks for Energy-Efficient Edge AI

Parth Patne, Mahdi Taheri, Ali Mahani, Maksim Jenihhin, Reza Mahani, Christian Herglotz

详情
英文摘要

Spiking neural networks (SNNs) offer inherent energy efficiency due to their event-driven computation model, making them promising for edge AI deployment. However, their practical adoption is limited by the computational overhead of deep architectures and the absence of input-adaptive control. This work presents SPARQ, a unified framework that integrates spiking computation, quantization-aware training, and reinforcement learning-guided early exits for efficient and adaptive inference. Evaluations across MLP, LeNet, and AlexNet architectures demonstrated that the proposed Quantised Dynamic SNNs (QDSNN) consistently outperform conventional SNNs and QSNNs, achieving up to 5.15% higher accuracy over QSNNs, over 330 times lower system energy compared to baseline SNNs, and over 90 percent fewer synaptic operations across different datasets. These results validate SPARQ as a hardware-friendly, energy-efficient solution for real-time AI at the edge.

2603.14372 2026-03-17 cs.AI

Contests with Spillovers: Incentivizing Content Creation with GenAI

Sagi Ohayon, Boaz Taitler, Omer Ben-Porat

详情
英文摘要

The rise of GenAI amplifies the economic phenomenon of positive spillovers. When creators contribute content that can be reused and adapted by Large Language Models (LLMs), each creator's effort can enhance the content quality of others by enabling easy imitation and recombination of existing content. On the one hand, such spillovers create value for the entire ecosystem; on the other hand, they risk undermining creators' incentives to invest genuine effort, as others may freely benefit from their contributions. To address this problem, we introduce the Content Creation with Spillovers (CCS) model. In our model, each creator chooses an effort level that, together with the efforts of others, determines her content quality. The platform aims to maximize the social welfare of consumers under stable behavior of the creators (pure Nash equilibrium), but can only observe the resulting qualities and not the underlying efforts. Interestingly, simple mechanisms like winner-takes-all and Tullock lead to the non-existence of equilibrium. In response, we propose the parametrized family of Provisional Allocation mechanisms, guaranteeing equilibrium existence and a unique Pareto-dominant equilibrium. While maximizing the social welfare under this family is NP-hard, we develop approximation algorithms that apply to a broad class of spillover structures and provide strong welfare guarantees. Specifically, in the worst-case analysis, we devise efficient algorithms for bounded spillovers and tree-structure spillovers. We also introduce Greedy Cost Selection, a linearithmic time algorithm that achieves approximately optimal results in the average case analysis. Together, our results provide game-theoretic foundations for sustaining human content creation in the era of GenAI.

2603.14369 2026-03-17 cs.LG

From Specification to Architecture: A Theory Compiler for Knowledge-Guided Machine Learning

Asela Hevapathige, Yu Xia, Sachith Seneviratne, Saman Halgamuge

详情
英文摘要

Theory-guided machine learning has demonstrated that including authentic domain knowledge directly into model design improves performance, sample efficiency and out-of-distribution generalisation. Yet the process by which a formal domain theory is translated into architectural constraints remains entirely manual, specific to each domain formalism, and devoid of any formal correctness guarantee. This translation is non-transferable between domains, not verified, and does not scale. We propose the Theory Compiler: a system that accepts a typed, machine-readable domain theory as input and automatically produces an architecture whose function space is provably constrained to be consistent with that theory by construction, not by regularisation. We identify three foundational open problems whose resolution defines our research agenda: (1) designing a universal theory formalisation language with decidable type-checking; (2) constructing a compositionally correct compilation algorithm from theory primitives to architectural modules; and (3) establishing soundness and completeness criteria for formal verification. We further conjecture that compiled architectures match or exceed manually-designed counterparts in generalisation performance while requiring substantially less training data, a claim we ground in classical statistical learning theory. We argue that recent advances in formal machine learning theory, large language models, and the growth of an interdisciplinary research community have made this paradigm achievable for the first time.

2603.14367 2026-03-17 cs.CV

HomeGuard: VLM-based Embodied Safeguard for Identifying Contextual Risk in Household Task

Xiaoya Lu, Yijin Zhou, Zeren Chen, Ruocheng Wang, Bingrui Sima, Enshen Zhou, Lu Sheng, Dongrui Liu, Jing Shao

详情
英文摘要

Vision-Language Models (VLMs) empower embodied agents to execute complex instructions, yet they remain vulnerable to contextual safety risks where benign commands become hazardous due to subtle environmental states. Existing safeguards often prove inadequate. Rule-based methods lack scalability in object-dense scenes, whereas model-based approaches relying on prompt engineering suffer from unfocused perception, resulting in missed risks or hallucinations. To address this, we propose an architecture-agnostic safeguard featuring Context-Guided Chain-of-Thought (CG-CoT). This mechanism decomposes risk assessment into active perception that sequentially anchors attention to interaction targets and relevant spatial neighborhoods, followed by semantic judgment based on this visual evidence. We support this approach with a curated grounding dataset and a two-stage training strategy utilizing Reinforcement Fine-Tuning (RFT) with process rewards to enforce precise intermediate grounding. Experiments demonstrate that our model HomeGuard significantly enhances safety, improving risk match rates by over 30% compared to base models while reducing oversafety. Beyond hazard detection, the generated visual anchors serve as actionable spatial constraints for downstream planners, facilitating explicit collision avoidance and safety trajectory generation. Code and data are released under https://github.com/AI45Lab/HomeGuard

2603.14366 2026-03-17 cs.CV cs.LG

Representation Alignment for Just Image Transformers is not Easier than You Think

Jaeyo Shin, Jiwook Kim, Hyunjung Shim

Comments Code: https://github.com/kaist-cvml/PixelREPA

详情
英文摘要

Representation Alignment (REPA) has emerged as a simple way to accelerate Diffusion Transformers training in latent space. At the same time, pixel-space diffusion transformers such as Just image Transformers (JiT) have attracted growing attention because they remove a dependency on a pretrained tokenizer, and then avoid the reconstruction bottleneck of latent diffusion. This paper shows that the REPA can fail for JiT. REPA yields worse FID for JiT as training proceeds and collapses diversity on image subsets that are tightly clustered in the representation space of pretrained semantic encoder on ImageNet. We trace the failure to an information asymmetry: denoising occurs in the high dimensional image space, while the semantic target is strongly compressed, making direct regression a shortcut objective. We propose PixelREPA, which transforms the alignment target and constrains alignment with a Masked Transformer Adapter that combines a shallow transformer adapter with partial token masking. PixelREPA improves both training convergence and final quality. PixelREPA reduces FID from 3.66 to 3.17 for JiT-B$/16$ and improves Inception Score (IS) from 275.1 to 284.6 on ImageNet $256 \times 256$, while achieving $> 2\times$ faster convergence. Finally, PixelREPA-H$/16$ achieves FID$=1.81$ and IS$=317.2$. Our code is available at https://github.com/kaist-cvml/PixelREPA.

2603.14363 2026-03-17 cs.CV cs.AI cs.RO

AerialVLA: A Vision-Language-Action Model for UAV Navigation via Minimalist End-to-End Control

Peng Xu, Zhengnan Deng, Jiayan Deng, Zonghua Gu, Shaohua Wan

Comments 18 pages, 4 figures. Code and demo videos will be available at: https://github.com/XuPeng23/AerialVLA

详情
英文摘要

Vision-Language Navigation (VLN) for Unmanned Aerial Vehicles (UAVs) demands complex visual interpretation and continuous control in dynamic 3D environments. Existing hierarchical approaches rely on dense oracle guidance or auxiliary object detectors, creating semantic gaps and limiting genuine autonomy. We propose AerialVLA, a minimalist end-to-end Vision-Language-Action framework mapping raw visual observations and fuzzy linguistic instructions directly to continuous physical control signals. First, we introduce a streamlined dual-view perception strategy that reduces visual redundancy while preserving essential cues for forward navigation and precise grounding, which additionally facilitates future simulation-to-reality transfer. To reclaim genuine autonomy, we deploy a fuzzy directional prompting mechanism derived solely from onboard sensors, completely eliminating the dependency on dense oracle guidance. Ultimately, we formulate a unified control space that integrates continuous 3-Degree-of-Freedom (3-DoF) kinematic commands with an intrinsic landing signal, freeing the agent from external object detectors for precision landing. Extensive experiments on the TravelUAV benchmark demonstrate that AerialVLA achieves state-of-the-art performance in seen environments. Furthermore, it exhibits superior generalization in unseen scenarios by achieving nearly three times the success rate of leading baselines, validating that a minimalist, autonomy-centric paradigm captures more robust visual-motor representations than complex modular systems.

2603.14361 2026-03-17 cs.CV

BROTHER: Behavioral Recognition Optimized Through Heterogeneous Ensemble Regularization for Ambivalence and Hesitancy

Alexandre Pereira, Bruno Fernandes, Pablo Barros

Comments 5 pages, 2 figures, 3 tables, Ambivalence/Hesitancy (AH) Video Recognition Challenge, ABAW10th, CVPR2026

详情
英文摘要

Recognizing complex behavioral states such as Ambivalence and Hesitancy (A/H) in naturalistic video settings remains a significant challenge in affective computing. Unlike basic facial expressions, A/H manifests as subtle, multimodal conflicts that require deep contextual and temporal understanding. In this paper, we propose a highly regularized, multimodal fusion pipeline to predict A/H at the video level. We extract robust unimodal features from visual, acoustic, and linguistic data, introducing a specialized statistical text modality explicitly designed to capture temporal speech variations and behavioral cues. To identify the most effective representations, we evaluate 15 distinct modality combinations across a committee of machine learning classifiers (MLP, Random Forest, and GBDT), selecting the most well-calibrated models based on validation Binary Cross-Entropy (BCE) loss. Furthermore, to optimally fuse these heterogeneous models without overfitting to the training distribution, we implement a Particle Swarm Optimization (PSO) hard-voting ensemble. The PSO fitness function dynamically incorporates a train-validation gap penalty (lambda) to actively suppress redundant or overfitted classifiers. Our comprehensive evaluation demonstrates that while linguistic features serve as the strongest independent predictor of A/H, our heavily regularized PSO ensemble (lambda = 0.2) effectively harnesses multimodal synergies, achieving a peak Macro F1-score of 0.7465 on the unseen test set. These results emphasize that treating ambivalence and hesitancy as a multimodal conflict, evaluated through an intelligently weighted committee, provides a robust framework for in-the-wild behavioral analysis.

2603.14355 2026-03-17 cs.CL

Exposing Long-Tail Safety Failures in Large Language Models through Efficient Diverse Response Sampling

Suvadeep Hajra, Palash Nandi, Tanmoy Chakraborty

详情
英文摘要

Safety tuning through supervised fine-tuning and reinforcement learning from human feedback has substantially improved the robustness of large language models (LLMs). However, it often suppresses rather than eliminates unsafe behaviors, leaving rare but critical failures hidden in the long tail of the output distribution. While most red-teaming work emphasizes adversarial prompt search (input-space optimization), we show that safety failures can also be systematically exposed through diverse response generation (output-space exploration) for a fixed safety-critical prompt, where increasing the number and diversity of sampled responses can drive jailbreak success rates close to unity. To efficiently uncover such failures, we propose Progressive Diverse Population Sampling (PDPS), which combines stochastic token-level sampling with diversity-aware selection to explore a large candidate pool of responses and retain a compact, semantically diverse subset. Across multiple jailbreak benchmarks and open-source LLMs, PDPS achieves attack success rates comparable to large-scale IID sampling while using only 8% to 29% of the computational cost. Under limited-response settings, it improves success rates by 26% to 40% over IID sampling and Diverse Beam Search. Furthermore, responses generated by PDPS exhibit both a higher number and greater diversity of unsafe outputs, demonstrating its effectiveness in uncovering a broader range of failures.

2603.14350 2026-03-17 cs.LG

Refold: Refining Protein Inverse Folding with Efficient Structural Matching and Fusion

Yiran Zhu, Changxi Chi, Hongxin Xiang, Wenjie Du, Xiaoqi Wang, Jun Xia

详情
英文摘要

Protein inverse folding aims to design an amino acid sequence that will fold into a given backbone structure, serving as a central task in protein design. Two main paradigms have been widely explored. Template-based methods exploit database-derived structural priors and can achieve high local precision when close structural neighbors are available, but their dependence on database coverage and match quality often degrades performance on out-of-distribution (OOD) targets. Deep learning approaches, in contrast, learn general structure-to-sequence regularities and usually generalize better to new backbones. However, they struggle to capture fine-grained local structure, which can cause uncertain residue predictions and missed local motifs in ambiguous regions. We introduce Refold, a novel framework that synergistically integrates the strengths of database-derived structural priors and deep learning prediction to enhance inverse folding. Refold obtains structural priors from matched neighbors and fuses them with model predictions to refine residue probabilities. In practice, low-quality neighbors can introduce noise, potentially degrading model performance. We address this issue with a Dynamic Utility Gate that controls prior injection and falls back to the base prediction when the priors are untrustworthy. Comprehensive evaluations on standard benchmarks demonstrate that Refold achieves state-of-the-art native sequence recovery of 0.63 on both CATH 4.2 and CATH 4.3. Also, analysis indicates that Refold delivers larger gains on high-uncertainty regions, reflecting the complementarity between structural priors and deep learning predictions.

2603.14347 2026-03-17 cs.CL cs.CY

Motivation in Large Language Models

Omer Nahum, Asael Sklar, Ariel Goldstein, Roi Reichart

Comments Preprint. Under review

详情
英文摘要

Motivation is a central driver of human behavior, shaping decisions, goals, and task performance. As large language models (LLMs) become increasingly aligned with human preferences, we ask whether they exhibit something akin to motivation. We examine whether LLMs "report" varying levels of motivation, how these reports relate to their behavior, and whether external factors can influence them. Our experiments reveal consistent and structured patterns that echo human psychology: self-reported motivation aligns with different behavioral signatures, varies across task types, and can be modulated by external manipulations. These findings demonstrate that motivation is a coherent organizing construct for LLM behavior, systematically linking reports, choices, effort, and performance, and revealing motivational dynamics that resemble those documented in human psychology. This perspective deepens our understanding of model behavior and its connection to human-inspired concepts.

2603.14345 2026-03-17 cs.RO

VIP-Loco: A Visually Guided Infinite Horizon Planning Framework for Legged Locomotion

Aditya Shirwatkar, Satyam Gupta, Shishir Kolathaya

Comments 8 pages, 5 figures

详情
英文摘要

Perceptive locomotion for legged robots requires anticipating and adapting to complex, dynamic environments. Model Predictive Control (MPC) serves as a strong baseline, providing interpretable motion planning with constraint enforcement, but struggles with high-dimensional perceptual inputs and rapidly changing terrain. In contrast, model-free Reinforcement Learning (RL) adapts well across visually challenging scenarios but lacks planning. To bridge this gap, we propose VIP-Loco, a framework that integrates vision-based scene understanding with RL and planning. During training, an internal model maps proprioceptive states and depth images into compact kinodynamic features used by the RL policy. At deployment, the learned models are used within an infinite-horizon MPC formulation, combining adaptability with structured planning. We validate VIP-Loco in simulation on challenging locomotion tasks, including slopes, stairs, crawling, tilting, gap jumping, and climbing, across three robot morphologies: a quadruped (Unitree Go1), a biped (Cassie), and a wheeled-biped (TronA1-W). Through ablations and comparisons with state-of-the-art methods, we show that VIP-Loco unifies planning and perception, enabling robust, interpretable locomotion in diverse environments.