arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2603.06048 2026-03-09 cs.CV

GenHOI: Towards Object-Consistent Hand-Object Interaction with Temporally Balanced and Spatially Selective Object Injection

Xuan Huang, Mochu Xiang, Zhelun Shen, Jinbo Wu, Chenming Wu, Chen Zhao, Kaisiyuan Wang, Hang Zhou, Shanshan Liu, Haocheng Feng, Wei He, Jingdong Wang

详情

英文摘要

Hand-Object Interaction (HOI) remains a core challenge in digital human video synthesis, where models must generate physically plausible contact and preserve object identity across frames. Although recent HOI reenactment approaches have achieved progress, they are typically trained and evaluated in-domain and fail to generalize to complex, in-the-wild scenarios. In contrast, all-in-one video editing models exhibit broader robustness but still struggle with HOI-specific issues such as inconsistent object appearance. In this paper, we present GenHOI, a lightweight augmentation to pretrained video generation models that injects reference-object information in a temporally balanced and spatially selective manner. For temporal balancing, we propose Head-Sliding RoPE, which assigns head-specific temporal offsets to reference tokens, distributing their influence evenly across frames and mitigating the temporal decay of 3D RoPE to improve long-range object consistency. For spatial selectivity, we design a two-level spatial attention gate that concentrates object-conditioned attention on HOI regions and adaptively scales its strength, preserving background realism while enhancing interaction fidelity. Extensive qualitative and quantitative evaluations on unseen, in-the-wild scenes demonstrate that GenHOI significantly outperforms state-of-the-art HOI reenactment and all-in-one video editing methods. Project page: https://xuanhuang0.github.io/GenHOI/

URL PDF HTML ☆

赞 0 踩 0

2603.06043 2026-03-09 cs.CV

Learning to Generate via Understanding: Understanding-Driven Intrinsic Rewarding for Unified Multimodal Models

Jiadong Pan, Liang Li, Yuxin Peng, Yu-Ming Tang, Shuohuan Wang, Yu Sun, Hua Wu, Qingming Huang, Haifeng Wang

Comments Accepted by CVPR 2026

2603.06038 2026-03-09 cs.CV cs.GR

FontUse: A Data-Centric Approach to Style- and Use-Case-Conditioned In-Image Typography

Xia Xin, Yuki Endo, Yoshihiro Kanamori

2603.06036 2026-03-09 cs.CV

Ensemble Learning with Sparse Hypercolumns

Julia Dietlmeier, Vayangi Ganepola, Oluwabukola G. Adegboro, Mayug Maniparambil, Claudia Mazo, Noel E. O'Connor

Comments presented at 33rd International Conference on Artificial Intelligence and Cognitive Science (AICS 2025)

2603.06032 2026-03-09 cs.CV

StruVis: Enhancing Reasoning-based Text-to-Image Generation via Thinking with Structured Vision

Yuanhuiyi Lyu, Kaiyu Lei, Ziqiao Weng, Xu Zheng, Lutao Jiang, Teng Li, Yangfu Li, Ziyuan Huang, Linfeng Zhang, Xuming Hu

2603.06028 2026-03-09 cs.LG

Improved high-dimensional estimation with Langevin dynamics and stochastic weight averaging

Stanley Wei, Alex Damian, Jason D. Lee

2603.06027 2026-03-09 cs.LG cs.DS stat.ML

Agnostic learning in (almost) optimal time via Gaussian surface area

Lucas Pesenti, Lucas Slot, Manuel Wiedmer

Comments 20 pages

2603.06024 2026-03-09 cs.CL cs.CV

ViewFusion: Structured Spatial Thinking Chains for Multi-View Reasoning

Xingjian Tao, Yiwei Wang, Yujun Cai, Yifan Song, Jing Tang

2603.06022 2026-03-09 cs.CV

MOSIV: Multi-Object System Identification from Videos

Chunjiang Liu, Xiaoyuan Wang, Qingran Lin, Albert Xiao, Haoyu Chen, Shizheng Wen, Hao Zhang, Lu Qi, Ming-Hsuan Yang, Laszlo A. Jeni, Min Xu, Yizhou Zhao

Comments ICLR 2026

2603.06014 2026-03-09 cs.CV

EffectMaker: Unifying Reasoning and Generation for Customized Visual Effect Creation

Shiyuan Yang, Ruihuang Li, Jiale Tao, Shuai Shao, Qinglin Lu, Jing Liao

Comments Project page: https://effectmaker.github.io

2603.06009 2026-03-09 cs.LG

Preventing Learning Stagnation in PPO by Scaling to 1 Million Parallel Environments

Michael Beukman, Khimya Khetarpal, Zeyu Zheng, Will Dabney, Jakob Foerster, Michael Dennis, Clare Lyle

详情

英文摘要

Plateaus, where an agent's performance stagnates at a suboptimal level, are a common problem in deep on-policy RL. Focusing on PPO due to its widespread adoption, we show that plateaus in certain regimes arise not because of known exploration, capacity, or optimization challenges, but because sample-based estimates of the loss eventually become poor proxies for the true objective over the course of training. As a recap, PPO switches between sampling rollouts from several parallel environments online using the current policy (which we call the outer loop) and performing repeated minibatch SGD steps against this offline dataset (the inner loop). In our work we consider only the outer loop, and conceptually model it as stochastic optimization. The step size is then controlled by the regularization strength towards the previous policy and the gradient noise by the number of samples collected between policy update steps. This model predicts that performance will plateau at a suboptimal level if the outer step size is too large relative to the noise. Recasting PPO in this light makes it clear that there are two ways to address this particular type of learning stagnation: either reduce the step size or increase the number of samples collected between updates. We first validate the predictions of our model and investigate how hyperparameter choices influence the step size and update noise, concluding that increasing the number of parallel environments is a simple and robust way to reduce both factors. Next, we propose a recipe for how to co-scale the other hyperparameters when increasing parallelization, and show that incorrectly doing so can lead to severe performance degradation. Finally, we vastly outperform prior baselines in a complex open-ended domain by scaling PPO to more than 1M parallel environments, thereby enabling monotonic performance improvement up to one trillion transitions.

URL PDF HTML ☆

赞 0 踩 0

2603.06002 2026-03-09 cs.CV cs.AI

Demystifying KAN for Vision Tasks: The RepKAN Approach

Minjong Cheon

2603.06001 2026-03-09 cs.RO cs.AI cs.CV

Restoring Linguistic Grounding in VLA Models via Train-Free Attention Recalibration

Ninghao Zhang, Bin Zhu, Shijie Zhou, Jingjing Chen

2603.05999 2026-03-09 cs.CV

RePer-360: Releasing Perspective Priors for 360$^\circ$ Depth Estimation via Self-Modulation

Cheng Guan, Chunyu Lin, Zhijie Shen, Junsong Zhang, Jiyuan Wang

2603.05997 2026-03-09 cs.CV cs.AI

MM-ISTS: Cooperating Irregularly Sampled Time Series Forecasting with Multimodal Vision-Text LLMs

Zhi Lei, Chenxi Liu, Hao Miao, Wanghui Qiu, Bin Yang, Chenjuan Guo

2603.05996 2026-03-09 cs.CL

Track-SQL: Enhancing Generative Language Models with Dual-Extractive Modules for Schema and Context Tracking in Multi-turn Text-to-SQL

Bingfeng Chen, Shaobin Shi, Yongqi Luo, Boyan Xu, Ruichu Cai, Zhifeng Hao

Comments Accepted at the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics (NAACL 2025), Long Paper, 19 pages

2603.05995 2026-03-09 cs.RO cs.AI cs.LG

TADPO: Reinforcement Learning Goes Off-road

Zhouchonghao Wu, Raymond Song, Vedant Mundheda, Luis E. Navarro-Serment, Christof Schoenborn, Jeff Schneider

Comments 8 pages, 5 figures, 2 tables. Accepted at ICRA 2026

2603.05993 2026-03-09 cs.RO

Moving Through Clutter: Scaling Data Collection and Benchmarking for 3D Scene-Aware Humanoid Locomotion via Virtual Reality

Beichen Wang, Yuanjie Lu, Linji Wang, Liuchuan Yu, Xuesu Xiao

2603.05992 2026-03-09 cs.RO

MagRobot:An Open Simulator for Magnetically Navigated Robots

Heng Wang, Haoyu Song, Jiatao Zheng, Yuxiang Han, Kunli Wang

Comments 20 pages, 10 figures

详情

英文摘要

Magnetic navigation systems, including magnetic tracking systems and magnetic actuation systems, have shown great potential for occlusion-free localization and remote control of intracorporeal medical devices and robots in minimally invasive medicine, such as capsule endoscopy and cardiovascular intervention. However, the design of magnetically navigated robots remains heavily reliant on experimental prototyping, which is time-consuming and costly. Furthermore, there is a lack of a consistent experimental environment to compare and benchmark the hardware and algorithms across different magnetic navigation systems. To address these challenges, we propose the first universal open-source simulation platform to facilitate research, design and benchmarking of magnetically navigated robots. Our simulator features an intuitive graphical user interface that enables the user to efficiently design, visualize, and analyze magnetic navigation systems for both rigid and soft robots. The proposed simulator is versatile, which can simulate both magnetic actuation and magnetic tracking tasks in diverse medical applications that involve deformable anatomies. The proposed simulator provides an open development environment, where the user can load third-party anatomical models and customize both hardware and algorithms of magnetic navigation systems. The fidelity of the simulator is validated using both phantom and ex vivo experiments of magnetic navigation of a continuum robot and a capsule robot with diverse magnetic actuation setups. Three use cases of the simulator, i.e., bronchoscopy, endovascular intervention, and gastrointestinal endoscopy, are implemented to demonstrate the functionality of the simulator. It is shown that the configuration and algorithms of magnetic navigation systems can be flexibly designed and optimized for better performance using the simulator.

URL PDF HTML ☆

赞 0 踩 0

2603.05987 2026-03-09 cs.CV cs.AI eess.IV

Technical Report: Automated Optical Inspection of Surgical Instruments

Zunaira Shafqat, Atif Aftab Ahmed Jilani, Qurrat Ul Ain

Comments 20 pages, 33 figures, 6 tables. Technical Report

2603.05982 2026-03-09 cs.RO cs.CV

HarvestFlex: Strawberry Harvesting via Vision-Language-Action Policy Adaptation in the Wild

Ziyang Zhao, Shuheng Wang, Zhonghua Miao, Ya Xiong

2603.05980 2026-03-09 cs.AI

An Interactive Multi-Agent System for Evaluation of New Product Concepts

Bin Xuan, Ruo Ai, Hakyeon Lee

Comments 46 pages, 3 figures + This paper proposes an LLM-based multi-agent system (MAS) for automated evaluation of new product concepts, incorporating retrieval-augmented generation (RAG) and cross-functional virtual agents to assess technical and market feasibility

2603.05976 2026-03-09 cs.RO

Proprioceptive Shape Estimation of Tensegrity Manipulators Using Energy Minimisation

Tufail Ahmad Bhat, Shuhei Ikemoto

Comments 8 pages, 10 figures, IEEE ICRA 2026

2603.05970 2026-03-09 cs.CV

Breaking Smooth-Motion Assumptions: A UAV Benchmark for Multi-Object Tracking in Complex and Adverse Conditions

Jingtao Ye, Kexin Zhang, Xunchi Ma, Yuehan Li, Guangming Zhu, Peiyi Shen, Linhua Jiang, Xiangdong Zhang, Liang Zhang

2603.05969 2026-03-09 cs.CV cs.AI cs.CL

Imagine How To Change: Explicit Procedure Modeling for Change Captioning

Jiayang Sun, Zixin Guo, Min Cao, Guibo Zhu, Jorma Laaksonen

Comments Accepted to ICLR 2026. Code and models are available at https://github.com/BlueberryOreo/ProCap

2603.05963 2026-03-09 cs.CV cs.AI

Skeleton-to-Image Encoding: Enabling Skeleton Representation Learning via Vision-Pretrained Models

Siyuan Yang, Jun Liu, Hao Cheng, Chong Wang, Shijian Lu, Hedvig Kjellstrom, Weisi Lin, Alex C. Kot

Comments Submitted to IEEE TPAMI, under review

2603.05962 2026-03-09 cs.CV

Exploring Open-Vocabulary Object Recognition in Images using CLIP

Wei Yu Chen, Ying Dai

2603.05953 2026-03-09 cs.CL cs.AI cs.HC cs.LG

Who We Are, Where We Are: Mental Health at the Intersection of Person, Situation, and Large Language Models

Nikita Soni, August Håkan Nilsson, Syeda Mahwish, Vasudha Varadarajan, H. Andrew Schwartz, Ryan L. Boyd

2603.05952 2026-03-09 cs.CV

Unify the Views: View-Consistent Prototype Learning for Few-Shot Segmentation

Hongli Liu, Yu Wang, Shengjie Zhao

Comments Accepted by CVPR Findings 2026

2603.05950 2026-03-09 cs.CV cs.AI

Energy-Driven Adaptive Visual Token Pruning for Efficient Vision-Language Models

Jialuo He, Huangxun Chen