arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2603.02688 2026-03-04 cs.AI cs.RO

Retrieval-Augmented Robots via Retrieve-Reason-Act

Izat Temiraliev, Diji Yang, Yi Zhang

详情

英文摘要

To achieve general-purpose utility, we argue that robots must evolve from passive executors into active Information Retrieval users. In strictly zero-shot settings where no prior demonstrations exist, robots face a critical information gap, such as the exact sequence required to assemble a complex furniture kit, that cannot be satisfied by internal parametric knowledge (common sense) or past internal memory. While recent robotic works attempt to use search before action, they primarily focus on retrieving past kinematic trajectories (analogous to searching internal memory) or text-based safety rules (searching for constraints). These approaches fail to address the core information need of active task construction: acquiring unseen procedural knowledge from external, unstructured documentation. In this paper, we define the paradigm as Retrieval-Augmented Robotics (RAR), empowering the robot with the information-seeking capability that bridges the gap between visual documentation and physical actuation. We formulate the task execution as an iterative Retrieve-Reason-Act loop: the robot or embodied agent actively retrieves relevant visual procedural manuals from an unstructured corpus, grounds the abstract 2D diagrams to 3D physical parts via cross-modal alignment, and synthesizes executable plans. We validate this paradigm on a challenging long-horizon assembly benchmark. Our experiments demonstrate that grounding robotic planning in retrieved visual documents significantly outperforms baselines relying on zero-shot reasoning or few-shot example retrieval. This work establishes the basis of RAR, extending the scope of Information Retrieval from answering user queries to driving embodied physical actions.

URL PDF HTML ☆

赞 0 踩 0

2603.02684 2026-03-04 cs.CL cs.SI

HateMirage: An Explainable Multi-Dimensional Dataset for Decoding Faux Hate and Subtle Online Abuse

Sai Kartheek Reddy Kasu, Shankar Biradar, Sunil Saumya, Md. Shad Akhtar

Comments Accepted at LREC 2026

2603.02683 2026-03-04 cs.RO

MMH-Planner: Multi-Mode Hybrid Trajectory Planning Method for UAV Efficient Flight Based on Real-Time Spatial Awareness

Yinghao Zhao, Chenguang Dai, Liang Lyu, Zhenchao Zhang, Chaozhen Lan, Hong Xie

2603.02681 2026-03-04 cs.CV

VisionCreator: A Native Visual-Generation Agentic Model with Understanding, Thinking, Planning and Creation

Jinxiang Lai, Zexin Lu, Jiajun He, Rongwei Quan, Wenzhe Zhao, Qinyu Yang, Qi Chen, Qin Lin, Chuyue Li, Tao Gao, Yuhao Shan, Shuai Shao, Song Guo, Qinglin Lu

2603.02680 2026-03-04 cs.AI

LLMs for High-Frequency Decision-Making: Normalized Action Reward-Guided Consistency Policy Optimization

Yang Zhao, Zihao Li, Zhiyu Jiang, Dandan Ma, Ganchao Liu, Wenzhe Zhao

2603.02675 2026-03-04 cs.LG

From Shallow to Deep: Pinning Semantic Intent via Causal GRPO

Shuyi Zhou, Zeen Song, Wenwen Qiang, Jiyan Sun, Yao Zhou, Yinlong Liu, Wei Ma

2603.02669 2026-03-04 cs.RO

IMR-LLM: Industrial Multi-Robot Task Planning and Program Generation using Large Language Models

Xiangyu Su, Juzhan Xu, Oliver van Kaick, Kai Xu, Ruizhen Hu

2603.02663 2026-03-04 cs.CL cs.CV

Evaluating Cross-Modal Reasoning Ability and Problem Characteristics with Multimodal Item Response Theory

Shunki Uebayashi, Kento Masui, Kyohei Atarashi, Han Bao, Hisashi Kashima, Naoto Inoue, Mayu Otani, Koh Takeuchi

Comments 24pages, 20 figures, accepted to ICLR2026

2603.02658 2026-03-04 cs.CV

OmniFashion: Towards Generalist Fashion Intelligence via Multi-Task Vision-Language Learning

Zhengwei Yang, Andi Long, Hao Li, Zechao Hu, Kui Jiang, Zheng Wang

Comments 12 pages, 8 figures

2603.02655 2026-03-04 cs.CL cs.AI

Real-Time Generation of Game Video Commentary with Multimodal LLMs: Pause-Aware Decoding Approaches

Anum Afzal, Yuki Saito, Hiroya Takamura, Katsuhito Sudoh, Shinnosuke Takamichi, Graham Neubig, Florian Matthes, Tatsuya Ishigaki

Comments Accepted at LREC2026

2603.02649 2026-03-04 cs.LG math.OC stat.ML

HomeAdam: Adam and AdamW Algorithms Sometimes Go Home to Obtain Better Provable Generalization

Feihu Huang, Guanyi Zhang, Songcan Chen

Comments 39 pages

2603.02648 2026-03-04 cs.CV

SEP-YOLO: Fourier-Domain Feature Representation for Transparent Object Instance Segmentation

Fengming Zhang, Tao Yan, Jianchao Huang

Comments 5 pages, 4 figures,accepted to ISCAS 2026

2603.02646 2026-03-04 cs.RO

Compositional Visual Planning via Inference-Time Diffusion Scaling

Yixin Zhang, Yunhao Luo, Utkarsh Aashu Mishra, Woo Chul Shin, Yongxin Chen, Danfei Xu

2603.02635 2026-03-04 cs.LG

SaFeR-ToolKit: Structured Reasoning via Virtual Tool Calling for Multimodal Safety

Zixuan Xu, Tiancheng He, Huahui Yi, Kun Wang, Xi Chen, Gongli Xi, Qiankun Li, Kang Li, Yang Liu, Zhigang Zeng

2603.02633 2026-03-04 cs.LG cs.AI

Robust Heterogeneous Analog-Digital Computing for Mixture-of-Experts Models with Theoretical Generalization Guarantees

Mohammed Nowaz Rabbani Chowdhury, Hsinyu Tsai, Geoffrey W. Burr, Kaoutar El Maghraoui, Liu Liu, Meng Wang

2603.02629 2026-03-04 cs.CV

Towards an Incremental Unified Multimodal Anomaly Detection: Augmenting Multimodal Denoising From an Information Bottleneck Perspective

Kaifang Long, Lianbo Ma, Jiaqi Liu, Liming Liu, Guoyang Xie

2603.02628 2026-03-04 cs.LG

Post Hoc Extraction of Pareto Fronts for Continuous Control

Raghav Thakar, Gaurav Dixit, Kagan Tumer

Comments 10 pages, 4 figures. Submitted to IJCAI 2026

2603.02626 2026-03-04 cs.AI

See and Remember: A Multimodal Agent for Web Traversal

Xinjun Wang, Shengyao Wang, Aimin Zhou, Hao Hao

2603.02623 2026-03-04 cs.RO cs.LG

Uni-Skill: Building Self-Evolving Skill Repository for Generalizable Robotic Manipulation

Senwei Xie, Yuntian Zhang, Ruiping Wang, Xilin Chen

Comments Accepted to ICRA2026

2603.02620 2026-03-04 cs.LG q-fin.CP

Same Error, Different Function: The Optimizer as an Implicit Prior in Financial Time Series

Federico Vittorio Cortesi, Giuseppe Iannone, Giulia Crippa, Tomaso Poggio, Pierfrancesco Beneventano

Comments 39 pages, 24 figures

2603.02619 2026-03-04 cs.CV

Direct Reward Fine-Tuning on Poses for Single Image to 3D Human in the Wild

Seunguk Do, Minwoo Huh, Joonghyuk Shin, Jaesik Park

Comments ICLR 2026, Project webpage: https://seunguk-do.github.io/drpose

2603.02615 2026-03-04 cs.CL

Think, But Don't Overthink: Reproducing Recursive Language Models

Daren Wang

2603.02613 2026-03-04 cs.LG cs.RO

Real-Time Generative Policy via Langevin-Guided Flow Matching for Autonomous Driving

Tianze Zhu, Yinuo Wang, Wenjun Zou, Tianyi Zhang, Likun Wang, Letian Tao, Feihong Zhang, Yao Lyu, Shengbo Eben Li

2603.02609 2026-03-04 cs.CV cs.RO

VLMFusionOcc3D: VLM Assisted Multi-Modal 3D Semantic Occupancy Prediction

A. Enes Doruk, Hasan F. Ates

2603.02602 2026-03-04 cs.RO

Wukong-Omni: Design, Modeling and Control of a Multi-mode Robot for Air, Land, and Underwater Exploration with All-in-One Propulsion Unit

Yufan Liu, Rixi Yu, Junjie Li, Yishuai Zeng, Zhenting Wen, Cheng Li, Haifei Zhu, Shikang Lian, Wei Meng, Fumin Zhang

Comments 19 pages, 27 figures

2603.02601 2026-03-04 cs.AI cs.SE

AgentAssay: Token-Efficient Regression Testing for Non-Deterministic AI Agent Workflows

Varun Pratap Bhardwaj

Comments Technical Report. 52 pages, 5 figures, 9 theorems, 42 formal definitions. Zenodo DOI: 10.5281/zenodo.18842011

2603.02599 2026-03-04 cs.AI cs.LG

SUN: Shared Use of Next-token Prediction for Efficient Multi-LLM Disaggregated Serving

Sunghyeon Woo, Ahreum Seo, Jaegwang Lee, Jaeeun Kil, Hanbae Seo, Joonghoon Kim, Baeseong Park, Se Jung Kwon, Dongsoo Lee

Comments Preprint, 15 pages, 5 figures

2603.02598 2026-03-04 cs.CV

Synthetic-Child: An AIGC-Based Synthetic Data Pipeline for Privacy-Preserving Child Posture Estimation

Taowen Zeng

Comments 16 pages, 3 figures, 5 tables

详情

英文摘要

Accurate child posture estimation is critical for AI-powered study companion devices, yet collecting large-scale annotated datasets of children is both expensive and ethically prohibitive due to privacy concerns. We present Synthetic-Child, an AIGC-based synthetic data pipeline that produces photorealistic child posture training images with ground-truth-projected keypoint annotations, requiring zero real child photographs. The pipeline comprises four stages: (1) a programmable 3D child body model (SMPL-X) in Blender generates diverse desk-study poses with IK-constrained anatomical plausibility and automatic COCO-format ground-truth export; (2) a custom PoseInjectorNode feeds 3D-derived skeletons into a dual ControlNet (pose + depth) conditioned on FLUX-1 Dev, synthesizing 12,000 photorealistic images across 10 posture categories with low annotation drift; (3) ViTPose-based confidence filtering and targeted augmentation remove generation failures and improve robustness; (4) RTMPose-M (13.6M params) is fine-tuned on the synthetic data and paired with geometric feature engineering and a lightweight MLP for posture classification, then quantized to INT8 for real-time edge deployment. On a real-child test set (n~300), the FP16 model achieves 71.2 AP -- a +12.5 AP improvement over the COCO-pretrained adult-data baseline at identical model capacity. After INT8 quantization the model retains 70.4 AP while running at 22 FPS on a 0.8-TOPS Rockchip RK3568 NPU. In a single-subject controlled comparison with a commercial posture corrector, our system achieves substantially higher recognition rates across most tested categories and responds ~1.8x faster on average. These results demonstrate that carefully designed AIGC pipelines can substantially reduce dependence on real child imagery while achieving deployment-ready accuracy, with potential applications to other privacy-sensitive domains.

URL PDF HTML ☆

赞 0 踩 0

2603.02597 2026-03-04 cs.CL cs.AI cs.DC cs.LG

GPUTOK: GPU Accelerated Byte Level BPE Tokenization

Venu Gopal Kadamba, Kanishkha Jaisankar

2603.02596 2026-03-04 cs.RO

Tensegrity Robot Endcap-Ground Contact Estimation with Symmetry-aware Heterogeneous Graph Neural Network

Wenzhe Tong, Yicheng Jiang, Chi Zhang, Maani Ghaffari, Xiaonan Huang

Comments Preprint; 7 pages, 5 figures, 3 tables