arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2603.19235 2026-03-20 cs.CV cs.RO

Generation Models Know Space: Unleashing Implicit 3D Priors for Scene Understanding

Xianjin Wu, Dingkang Liang, Tianrui Feng, Kui Xia, Yumeng Zhang, Xiaofan Li, Xiao Tan, Xiang Bai

Comments 31 pages, 12 figures

详情

英文摘要

While Multimodal Large Language Models demonstrate impressive semantic capabilities, they often suffer from spatial blindness, struggling with fine-grained geometric reasoning and physical dynamics. Existing solutions typically rely on explicit 3D modalities or complex geometric scaffolding, which are limited by data scarcity and generalization challenges. In this work, we propose a paradigm shift by leveraging the implicit spatial prior within large-scale video generation models. We posit that to synthesize temporally coherent videos, these models inherently learn robust 3D structural priors and physical laws. We introduce VEGA-3D (Video Extracted Generative Awareness), a plug-and-play framework that repurposes a pre-trained video diffusion model as a Latent World Simulator. By extracting spatiotemporal features from intermediate noise levels and integrating them with semantic representations via a token-level adaptive gated fusion mechanism, we enrich MLLMs with dense geometric cues without explicit 3D supervision. Extensive experiments across 3D scene understanding, spatial reasoning, and embodied manipulation benchmarks demonstrate that our method outperforms state-of-the-art baselines, validating that generative priors provide a scalable foundation for physical-world understanding. Code is publicly available at https://github.com/H-EmbodVis/VEGA-3D.

URL PDF HTML ☆

赞 0 踩 0

2603.19233 2026-03-20 cs.RO

Not All Features Are Created Equal: A Mechanistic Study of Vision-Language-Action Models

Bryce Grant, Xijia Zhao, Peng Wang

Comments Accepted to Multimodal Intelligence Workshop @ ICLR

2603.19232 2026-03-20 cs.CV

Cubic Discrete Diffusion: Discrete Visual Generation on High-Dimensional Representation Tokens

Yuqing Wang, Chuofan Ma, Zhijie Lin, Yao Teng, Lijun Yu, Shuai Wang, Jiaming Han, Jiashi Feng, Yi Jiang, Xihui Liu

Comments Accepted by CVPR 2026 main track; Code: https://github.com/YuqingWang1029/CubiD

2603.19231 2026-03-20 cs.CV

MonoArt: Progressive Structural Reasoning for Monocular Articulated 3D Reconstruction

Haitian Li, Haozhe Xie, Junxiang Xu, Beichen Wen, Fangzhou Hong, Ziwei Liu

Comments Project page: https://lihaitian.com/MonoArt

2603.19229 2026-03-20 cs.RO cs.AI cs.CV cs.LG cs.SY eess.SY

NavTrust: Benchmarking Trustworthiness for Embodied Navigation

Huaide Jiang, Yash Chaudhary, Yuping Wang, Zehao Wang, Raghav Sharma, Manan Mehta, Yang Zhou, Lichao Sun, Zhiwen Fan, Zhengzhong Tu, Jiachen Li

Comments Project Website: https://navtrust.github.io

2603.19228 2026-03-20 cs.CV

SAMA: Factorized Semantic Anchoring and Motion Alignment for Instruction-Guided Video Editing

Xinyao Zhang, Wenkai Dong, Yuxin Song, Bo Fang, Qi Zhang, Jing Wang, Fan Chen, Hui Zhang, Haocheng Feng, Yu Lu, Hang Zhou, Chun Yuan, Jingdong Wang

Comments 24 pages, 12 figures

2603.19227 2026-03-20 cs.CV

Bridging Semantic and Kinematic Conditions with Diffusion-based Discrete Motion Tokenizer

Chenyang Gu, Mingyuan Zhang, Haozhe Xie, Zhongang Cai, Lei Yang, Ziwei Liu

Comments Project Page: https://rheallyc.github.io/projects/motok GitHub: https://github.com/rheallyc/MoTok

2603.19226 2026-03-20 cs.CV

Under One Sun: Multi-Object Generative Perception of Materials and Illumination

Nobuo Yoshii, Xinran Nicole Han, Ryo Kawahara, Todd Zickler, Ko Nishino

2603.19224 2026-03-20 cs.CV

EffectErase: Joint Video Object Removal and Insertion for High-Quality Effect Erasing

Yang Fu, Yike Zheng, Ziyun Dai, Henghui Ding

Comments CVPR 2026, Project Page: https://henghuiding.com/EffectErase/

2603.19223 2026-03-20 cs.CL cs.AI

F2LLM-v2: Inclusive, Performant, and Efficient Embeddings for a Multilingual World

Ziyin Zhang, Zihan Liao, Hang Yu, Peng Di, Rui Wang

2603.19221 2026-03-20 cs.LG cs.CL cs.GT

Online Learning and Equilibrium Computation with Ranking Feedback

Mingyang Liu, Yongshan Chen, Zhiyuan Fan, Gabriele Farina, Asuman Ozdaglar, Kaiqing Zhang

2603.19219 2026-03-20 cs.CV cs.LG

DriveTok: 3D Driving Scene Tokenization for Unified Multi-View Reconstruction and Understanding

Dong Zhuo, Wenzhao Zheng, Sicheng Zuo, Siming Yan, Lu Hou, Jie Zhou, Jiwen Lu

Comments Project Page: https://paryi555.github.io/DriveTok/ Code: https://github.com/paryi555/DriveTok

2603.19218 2026-03-20 cs.CV

Rethinking Vector Field Learning for Generative Segmentation

Chaoyang Wang, Yaobo Liang, Boci Peng, Fan Duan, Jingdong Wang, Yunhai Tong

2603.19217 2026-03-20 cs.CV

LVOmniBench: Pioneering Long Audio-Video Understanding Evaluation for Omnimodal LLMs

Keda Tao, Yuhua Zheng, Jia Xu, Wenjie Du, Kele Shao, Hesong Wang, Xueyi Chen, Xin Jin, Junhan Zhu, Bohan Yu, Weiqiang Wang, Jian Liu, Can Qin, Yulun Zhang, Ming-Hsuan Yang, Huan Wang

Comments Project page: https://kd-tao.github.io/LVOmniBench/

2603.19216 2026-03-20 cs.CV cs.AI cs.LG

DreamPartGen: Semantically Grounded Part-Level 3D Generation via Collaborative Latent Denoising

Tianjiao Yu, Xinzhuo Li, Muntasir Wahed, Jerry Xiong, Yifan Shen, Ying Shen, Ismini Lourentzou

2603.19209 2026-03-20 cs.CV cs.LG

Do VLMs Need Vision Transformers? Evaluating State Space Models as Vision Encoders

Shang-Jui Ray Kuo, Paola Cascante-Bonilla

Comments Project page: https://lab-spell.github.io/vlm-ssm-vision-encoders/ ; Code: https://github.com/raykuo18/vlm-ssm-vision-encoders

2603.19206 2026-03-20 cs.CV

RPiAE: A Representation-Pivoted Autoencoder Enhancing Both Image Generation and Editing

Yue Gong, Hongyu Li, Shanyuan Liu, Bo Cheng, Yuhang Ma, Liebucha Wu, Xiaoyu Wu, Manyuan Zhang, Dawei Leng, Yuhui Yin, Lijun Zhang

2603.19204 2026-03-20 cs.LG

Robustness, Cost, and Attack-Surface Concentration in Phishing Detection

Julian Allagan, Mohamed Elbakary, Zohreh Safari, Weizheng Gao, Gabrielle Morgan, Essence Morgan, Vladimir Deriglazov

Comments 14 pages, 4 figures, 9 tables

2603.19193 2026-03-20 cs.CV

Reconstruction Matters: Learning Geometry-Aligned BEV Representation through 3D Gaussian Splatting

Yiren Lu, Xin Ye, Burhaneddin Yaman, Jingru Luo, Zhexiao Xiong, Liu Ren, Yu Yin

Comments Project page at https://vulab-ai.github.io/Splat2BEV/

2603.19191 2026-03-20 cs.AI

OS-Themis: A Scalable Critic Framework for Generalist GUI Rewards

Zehao Li, Zhenyu Wu, Yibo Zhao, Bowen Yang, Jingjing Xie, Zhaoyang Liu, Zhoumianze Liu, Kaiming Jin, Jianze Liang, Zonglin Li, Feng Wu, Bowen Zhou, Zun Wang, Zichen Ding

2603.19182 2026-03-20 cs.AI cs.CL

Box Maze: A Process-Control Architecture for Reliable LLM Reasoning

Zou Qiang

Comments 10 pages, 5 tables, 0 figures. Conceptual architecture with preliminary simulation-based validation

2603.19176 2026-03-20 cs.SD cs.CV eess.AS

Few-shot Acoustic Synthesis with Multimodal Flow Matching

Amandine Brunetto

Comments To appear at CVPR 2026. 23 pages, 16 figures. Project Page: https://amandinebtto.github.io/FLAC/

2603.19173 2026-03-20 cs.LG cs.AI

SOL-ExecBench: Speed-of-Light Benchmarking for Real-World GPU Kernels Against Hardware Limits

Edward Lin, Sahil Modi, Siva Kumar Sastry Hari, Qijing Huang, Zhifan Ye, Nestor Qin, Fengzhe Zhou, Yuan Zhang, Jingquan Wang, Sana Damani, Dheeraj Peri, Ouye Xie, Aditya Kane, Moshe Maor, Michael Behar, Triston Cao, Rishabh Mehta, Vartika Singh, Vikram Sharma Mailthody, Terry Chen, Zihao Ye, Hanfeng Chen, Tianqi Chen, Vinod Grover, Wei Chen, Wei Liu, Eric Chung, Luis Ceze, Roger Bringmann, Cyril Zeller, Michael Lightstone, Christos Kozyrakis, Humphrey Shi

2603.19172 2026-03-20 cs.LG

DyMoE: Dynamic Expert Orchestration with Mixed-Precision Quantization for Efficient MoE Inference on Edge

Yuegui Huang, Zhiyuan Fang, Weiqi Luo, Ruoyu Wu, Wuhui Chen, Zibin Zheng

2603.19170 2026-03-20 cs.RO math.OC

ADMM-Based Distributed MPC with Control Barrier Functions for Safe Multi-Robot Quadrupedal Locomotion

Yicheng Zeng, Ruturaj S. Sambhus, Basit Muhammad Imran, Jeeseop Kim, Vittorio Pastore, Kaveh Akbari Hamed

2603.19169 2026-03-20 cs.CV cs.AI

ARIADNE: A Perception-Reasoning Synergy Framework for Trustworthy Coronary Angiography Analysis

Zhan Jin, Yu Luo, Yizhou Zhang, Ziyang Cui, Yuqing Wei, Xianchao Liu, Xueying Zeng, Qing Zhang

Comments 28 pages, 5 figures . arXiv:submit/7385738 [cs.AI]

2603.19166 2026-03-20 cs.RO cs.AI cs.CL cs.CV cs.LG

Meanings and Measurements: Multi-Agent Probabilistic Grounding for Vision-Language Navigation

Swagat Padhan, Lakshya Jain, Bhavya Minesh Shah, Omkar Patil, Thao Nguyen, Nakul Gopalan

Comments Equal contribution: Swagat Padhan and Lakshya Jain, 9 pages, 6 figures, paper website: https://lakshya-asu.github.io/Meanings-Measurements-Multi-Agent-Probabilistic-Grounding/

2603.19165 2026-03-20 cs.LG math.AP math.FA

Rigorous Error Certification for Neural PDE Solvers: From Empirical Residuals to Solution Guarantees

Amartya Mukherjee, Maxwell Fitzsimmons, David C. Del Rey Fernández, Jun Liu

Comments 35 pages

2603.19163 2026-03-20 cs.AI cs.DC

cuGenOpt: A GPU-Accelerated General-Purpose Metaheuristic Framework for Combinatorial Optimization

Yuyang Liu

Comments 28 pages, 9 figures. Code available at https://github.com/L-yang-yang/cugenopt

2603.19158 2026-03-20 cs.CV

Adaptive Auxiliary Prompt Blending for Target-Faithful Diffusion Generation

Kwanyoung Lee, SeungJu Cha, Yebin Ahn, Hyunwoo Oh, Sungho Koh, Dong-Jin Kim

Comments Accepted in CVPR 2026 (main track). 10 pages, 6 figures; supplementary material included (14 pages, 11 figures)