arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2603.14457 2026-03-17 cs.RO

Towards Versatile Opti-Acoustic Sensor Fusion and Volumetric Mapping

Ivana Collado-Gonzalez, John McConnell, Brendan Englot

Comments To appear at ICRA 2026 in Vienna, Austria

详情

英文摘要

Accurate 3D volumetric mapping is critical for autonomous underwater vehicles operating in obstacle-rich environments. Vision-based perception provides high-resolution data but fails in turbid conditions, while sonar is robust to lighting and turbidity but suffers from low resolution and elevation ambiguity. This paper presents a volumetric mapping framework that fuses a stereo sonar pair with a monocular camera to enable safe navigation under varying visibility conditions. Overlapping sonar fields of view resolve elevation ambiguity, producing fully defined 3D point clouds at each time step. The framework identifies regions of interest in camera images, associates them with corresponding sonar returns, and combines sonar range with camera-derived elevation cues to generate additional 3D points. Each 3D point is assigned a confidence value reflecting its reliability. These confidence-weighted points are fused using a Gaussian Process Volumetric Mapping framework that prioritizes the most reliable measurements. Experimental comparisons with other opti-acoustic and sonar-based approaches, along with field tests in a marina environment, demonstrate the method's effectiveness in capturing complex geometries and preserving critical information for robot navigation in both clear and turbid conditions. Our code is open-source to support community adoption.

URL PDF HTML ☆

赞 0 踩 0

2603.14456 2026-03-17 cs.CL cs.SD

PARSA-Bench: A Comprehensive Persian Audio-Language Model Benchmark

Mohammad Javad Ranjbar Kalahroodi, Mohammad Amini, Parmis Bathayan, Heshaam Faili, Azadeh Shakery

Comments Submitted to Interspeech 2026

2603.14452 2026-03-17 cs.CV

Uni-MDTrack: Learning Decoupled Memory and Dynamic States for Parameter-Efficient Visual Tracking in All Modality

Wenrui Cai, Zhenyi Lu, Yuzhe Li, Yongchao Feng, Jinqing Zhang, Qingjie Liu, Yunhong Wang

Comments 15 pages, 9 figures, 16 tables

详情

英文摘要

With the advent of Transformer-based one-stream trackers that possess strong capability in inter-frame relation modeling, recent research has increasingly focused on how to introduce spatio-temporal context. However, most existing methods rely on a limited number of historical frames, which not only leads to insufficient utilization of the context, but also inevitably increases the length of input and incurs prohibitive computational overhead. Methods that query an external memory bank, on the other hand, suffer from inadequate fusion between the retrieved spatio-temporal features and the backbone. Moreover, using discrete historical frames as context overlooks the rich dynamics of the target. To address the issues, we propose Uni-MDTrack, which consists of two core components: Memory-Aware Compression Prompt (MCP) module and Dynamic State Fusion (DSF) module. MCP effectively compresses rich memory features into memory-aware prompt tokens, which deeply interact with the input throughout the entire backbone, significantly enhancing the performance while maintaining a stable computational load. DSF complements the discrete memory by capturing the continuous dynamic, progressively introducing the updated dynamic state features from shallow to deep layers, while also preserving high efficiency. Uni-MDTrack also supports unified tracking across RGB, RGB-D/T/E, and RGB-Language modalities. Experiments show that in Uni-MDTrack, training only the MCP, DSF, and prediction head, keeping the proportion of trainable parameters around 30%, yields substantial performance gains, achieves state-of-the-art results on 10 datasets spanning five modalities. Furthermore, both MCP and DSF exhibit excellent generality, functioning as plug-and-play components that can boost the performance of various baseline trackers, while significantly outperforming existing parameter-efficient training approaches.

URL PDF HTML ☆

赞 0 踩 0

2603.14448 2026-03-17 cs.LG

Zoom to Essence: Trainless GUI Grounding by Inferring upon Interface Elements

Ziwei Liu, Tao Feng, Borui Kang, Yanbing Yang, Jun Luo

2603.14435 2026-03-17 cs.CV

End-to-End Spatial-Temporal Transformer for Real-time 4D HOI Reconstruction

Haoyu Zhang, Wei Zhai, Yuhang Yang, Yang Cao, Zheng-Jun Zha

Comments 23 pages, 7 figures. The project page is available at: https://nianheng.github.io/THO-project/

2603.14430 2026-03-17 cs.CL

Creative Convergence or Imitation? Genre-Specific Homogeneity in LLM-Generated Chinese Literature

Yuanchi Ma, Kaize Shi, Hui He, Zhihua Zhang, Zhongxiang Lei, Ziliang Qiu, Renfen Hu, Jiamou Liu

2603.14426 2026-03-17 cs.CV cs.IR cs.MM

GenState-AI: State-Aware Dataset for Text-to-Video Retrieval on AI-Generated Videos

Minghan Li, Tongna Chen, Tianrui Lv, Yishuai Zhang, Suchao An, Guodong Zhou

2603.14422 2026-03-17 cs.LG cs.AI cs.IR

MBD: A Model-Based Debiasing Framework Across User, Content, and Model Dimensions

Yuantong Li, Lei Yuan, Zhihao Zheng, Weimiao Wu, Songbin Liu, Jeong Min Lee, Ali Selman Aydin, Shaofeng Deng, Junbo Chen, Xinyi Zhang, Hongjing Xia, Sam Fieldman, Matthew Kosko, Wei Fu, Du Zhang, Peiyu Yang, Albert Jin Chung, Xianlei Qiu, Miao Yu, Zhongwei Teng, Hao Chen, Sunny Baek, Hui Tang, Yang Lv, Renze Wang, Qifan Wang, Zhan Li, Tiantian Xu, Peng Wu, Ji Liu

详情

英文摘要

Modern recommendation systems rank candidates by aggregating multiple behavioral signals through a value model. However, many commonly used signals are inherently affected by heterogeneous biases. For example, watch time naturally favors long-form content, loop rate favors short - form content, and comment probability favors videos over images. Such biases introduce two critical issues: (1) value model scores may be systematically misaligned with users' relative preferences - for instance, a seemingly low absolute like probability may represent exceptionally strong interest for a user who rarely engages; and (2) changes in value modeling rules can trigger abrupt and undesirable ecosystem shifts. In this work, we ask a fundamental question: can biased behavioral signals be systematically transformed into unbiased signals, under a user - defined notion of ``unbiasedness'', that are both personalized and adaptive? We propose a general, model-based debiasing (MBD) framework that addresses this challenge by augmenting it with distributional modeling. By conditioning on a flexible subset of features (partial feature set), we explicitly estimate the contextual mean and variance of the engagement distribution for arbitrary cohorts (e.g., specific video lengths or user regions) directly alongside the main prediction. This integration allows the framework to convert biased raw signals into unbiased representations, enabling the construction of higher-level, calibrated signals (such as percentiles or z - scores) suitable for the value model. Importantly, the definition of unbiasedness is flexible and controllable, allowing the system to adapt to different personalization objectives and modeling preferences. Crucially, this is implemented as a lightweight, built-in branch of the existing MTML ranking model, requiring no separate serving infrastructure.

URL PDF HTML ☆

赞 0 踩 0

2603.14420 2026-03-17 cs.AI

Data Darwinism Part II: DataEvolve -- AI can Autonomously Evolve Pretraining Data Curation

Tiantian Mi, Dongming Shan, Zhen Huang, Yiwei Qin, Muhang Xie, Yuxuan Qiao, Yixiu Liu, Chenyang Zhou, Pengfei Liu

详情

英文摘要

Data Darwinism (Part I) established a ten-level hierarchy for data processing, showing that stronger processing can unlock greater data value. However, that work relied on manually designed strategies for a single category. Modern pretraining corpora comprise hundreds of heterogeneous categories spanning domains and content types, each demanding specialized treatment. At this scale, manual strategy design becomes prohibitive. This raises a key question: can strategies evolve in an automated way? We introduce DataEvolve, a framework that enables strategies to evolve through iterative optimization rather than manual design. For each data category, DataEvolve operates in a closed evolutionary loop: it identifies quality issues, generates candidate strategies, executes them on sampled data, evaluates results, and refines approaches across generations. The process accumulates knowledge through an experience pool of discovered issues and a strategy pool tracking performance across iterations. Applied to 8 categories spanning 672B tokens from Nemotron-CC, DataEvolve produces Darwin-CC, a 504B-token dataset with strategies evolved through 30 iterations per category. Training 3B models on 500B tokens, Darwin-CC outperforms raw data (+3.96 points) and achieves a 44.13 average score across 18 benchmarks, surpassing DCLM, Ultra-FineWeb, and FineWeb-Edu, with strong gains on knowledge-intensive tasks such as MMLU. Analysis shows evolved strategies converge on cleaning-focused approaches: targeted noise removal and format normalization with domain-aware preservation, echoing the L4 (Generative Refinement) principles from Part I. Ablation studies confirm iterative evolution is essential: optimized strategies outperform suboptimal ones by 2.93 points, establishing evolutionary strategy design as feasible and necessary for pretraining-scale data curation.

URL PDF HTML ☆

赞 0 踩 0

2603.14418 2026-03-17 cs.CV cs.AI

Deep EM with Hierarchical Latent Label Modelling for Multi-Site Prostate Lesion Segmentation

Wen Yan, Yipei Wang, Shiqi Huang, Natasha Thorley, Mark Emberton, Vasilis Stavrinides, Yipeng Hu, Dean Barratt

Comments 10 pages, 2 figures

2603.14416 2026-03-17 cs.CV

Histo-MExNet: A Unified Framework for Real-World, Cross-Magnification, and Trustworthy Breast Cancer Histopathology

Enam Ahmed Taufika, Md Ahasanul Arafatha, Abhijit Kumar Ghoshb, Md. Tanzim Rezab, Md Ashad Alamc

Comments 34, 6 figures

2603.14412 2026-03-17 cs.CV

G-ZAP: A Generalizable Zero-Shot Framework for Arbitrary-Scale Pansharpening

Zhiqi Yang, Shan Yin, Jingze Liang, Liang-Jian Deng

2603.14409 2026-03-17 cs.CV cs.AI

PGcGAN: Pathological Gait-Conditioned GAN for Human Gait Synthesis

Mritula Chandrasekaran, Sanket Kachole, Jarek Francik, Dimitrios Makris

2603.14406 2026-03-17 cs.LG

Graph-Based Deep Learning for Intelligent Detection of Energy Losses, Theft, and Operational Inefficiencies in Oil & Gas Production Networks

AbdulQoyum A. Olowookere, Adewale U. Oguntola, Ebenezer. Leke Odekanle

Comments 22 pages, 7 figures

2603.14401 2026-03-17 cs.RO cs.CV

OCRA: Object-Centric Learning with 3D and Tactile Priors for Human-to-Robot Action Transfer

Kuanning Wang, Ke Fan, Yuqian Fu, Siyu Lin, Hu Luo, Daniel Seita, Yanwei Fu, Yu-Gang Jiang, Xiangyang Xue

Comments Project page: https://sressers.github.io/OCRA/

2603.14400 2026-03-17 cs.CL cs.AI

Extending Minimal Pairs with Ordinal Surprisal Curves and Entropy Across Applied Domains

Andrew Katz

Comments 34 pages, 11 figures

2603.14397 2026-03-17 cs.RO

eNavi: Event-based Imitation Policies for Low-Light Indoor Mobile Robot Navigation

Prithvi Jai Ramesh, Kaustav Chanda, Krishna Vinod, Joseph Raj Vishal, Yezhou Yang, Bharatesh Chakravarthi

2603.14393 2026-03-17 cs.RO

From Scanning Guidelines to Action: A Robotic Ultrasound Agent with LLM-Based Reasoning

Yuan Bi, Yiping Zhou, Pei Liu, Feng Li, Zhongliang Jiang, Nassir Navab

Comments Code: https://github.com/yuan-12138/RUSSAgent; Video: https://youtu.be/pfMOc4e2IGA

详情

英文摘要

Robotic ultrasound offers advantages over free-hand scanning, including improved reproducibility and reduced operator dependency. In clinical practice, US acquisition relies heavily on the sonographer's experience and situational judgment. When transferring this process to robotic systems, such expertise is often encoded explicitly through fixed procedures and task-specific models, yielding pipelines that can be difficult to adapt to new scanning tasks. In this work, we propose a unified framework for autonomous robotic US scanning that leverages a LLM-based agent to interpret US scanning guidelines and execute scans by dynamically invoking a set of provided software tools. Instead of encoding fixed scanning procedures, the LLM agent retrieves and reasons over guideline steps from scanning handbooks and adapts its planning decisions based on observations and the current scanning state. This enables the system to handle variable and decision-dependent workflows, such as adjusting scanning strategies, repeating steps, or selecting the appropriate next tool call in response to image quality or anatomical findings. Because the reasoning underlying tool selection is also critical for transparent and trustworthy planning, we further fine tune the LLM agent using a RL based strategy to improve both its reasoning quality and the correctness of tool selection and parameterization, while maintaining robust generalization to unseen guidelines and related tasks. We first validate the approach via verbal execution on 10 US scanning guidelines, assessing reasoning as well as tool selection and parameterization, and showing the benefit of RL fine tuning. We then demonstrate real world feasibility on robotic scanning of the gallbladder, spine, and kidney. Overall, the framework follows diverse guidelines and enables reliable autonomous scanning across multiple anatomical targets within a unified system.

URL PDF HTML ☆

赞 0 踩 0

2603.14382 2026-03-17 cs.CV

StAR: Segment Anything Reasoner

Seokju Yun, Dongheon Lee, Noori Bae, Jaesung Jun, Chanseul Cho, Youngmin Ro

Comments Code: https://github.com/ysj9909/StAR

2603.14380 2026-03-17 cs.LG cs.AI cs.AR

SPARQ: Spiking Early-Exit Neural Networks for Energy-Efficient Edge AI

Parth Patne, Mahdi Taheri, Ali Mahani, Maksim Jenihhin, Reza Mahani, Christian Herglotz

2603.14372 2026-03-17 cs.AI

Contests with Spillovers: Incentivizing Content Creation with GenAI

Sagi Ohayon, Boaz Taitler, Omer Ben-Porat

2603.14369 2026-03-17 cs.LG

From Specification to Architecture: A Theory Compiler for Knowledge-Guided Machine Learning

Asela Hevapathige, Yu Xia, Sachith Seneviratne, Saman Halgamuge

2603.14367 2026-03-17 cs.CV

HomeGuard: VLM-based Embodied Safeguard for Identifying Contextual Risk in Household Task

Xiaoya Lu, Yijin Zhou, Zeren Chen, Ruocheng Wang, Bingrui Sima, Enshen Zhou, Lu Sheng, Dongrui Liu, Jing Shao

2603.14366 2026-03-17 cs.CV cs.LG

Representation Alignment for Just Image Transformers is not Easier than You Think

Jaeyo Shin, Jiwook Kim, Hyunjung Shim

Comments Code: https://github.com/kaist-cvml/PixelREPA

2603.14363 2026-03-17 cs.CV cs.AI cs.RO

AerialVLA: A Vision-Language-Action Model for UAV Navigation via Minimalist End-to-End Control

Peng Xu, Zhengnan Deng, Jiayan Deng, Zonghua Gu, Shaohua Wan

Comments 18 pages, 4 figures. Code and demo videos will be available at: https://github.com/XuPeng23/AerialVLA

2603.14361 2026-03-17 cs.CV

BROTHER: Behavioral Recognition Optimized Through Heterogeneous Ensemble Regularization for Ambivalence and Hesitancy

Alexandre Pereira, Bruno Fernandes, Pablo Barros

Comments 5 pages, 2 figures, 3 tables, Ambivalence/Hesitancy (AH) Video Recognition Challenge, ABAW10th, CVPR2026

2603.14355 2026-03-17 cs.CL

Exposing Long-Tail Safety Failures in Large Language Models through Efficient Diverse Response Sampling

Suvadeep Hajra, Palash Nandi, Tanmoy Chakraborty

2603.14350 2026-03-17 cs.LG

Refold: Refining Protein Inverse Folding with Efficient Structural Matching and Fusion

Yiran Zhu, Changxi Chi, Hongxin Xiang, Wenjie Du, Xiaoqi Wang, Jun Xia

2603.14347 2026-03-17 cs.CL cs.CY

Motivation in Large Language Models

Omer Nahum, Asael Sklar, Ariel Goldstein, Roi Reichart

Comments Preprint. Under review

2603.14345 2026-03-17 cs.RO

VIP-Loco: A Visually Guided Infinite Horizon Planning Framework for Legged Locomotion

Aditya Shirwatkar, Satyam Gupta, Shishir Kolathaya

Comments 8 pages, 5 figures