arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2603.12267 2026-03-13 cs.CV

EVATok: Adaptive Length Video Tokenization for Efficient Visual Autoregressive Generation

Tianwei Xiong, Jun Hao Liew, Zilong Huang, Zhijie Lin, Jiashi Feng, Xihui Liu

Comments Accepted by CVPR 2026. Project page: https://silentview.github.io/EVATok/

详情

英文摘要

Autoregressive (AR) video generative models rely on video tokenizers that compress pixels into discrete token sequences. The length of these token sequences is crucial for balancing reconstruction quality against downstream generation computational cost. Traditional video tokenizers apply a uniform token assignment across temporal blocks of different videos, often wasting tokens on simple, static, or repetitive segments while underserving dynamic or complex ones. To address this inefficiency, we introduce $\textbf{EVATok}$, a framework to produce $\textbf{E}$fficient $\textbf{V}$ideo $\textbf{A}$daptive $\textbf{Tok}$enizers. Our framework estimates optimal token assignments for each video to achieve the best quality-cost trade-off, develops lightweight routers for fast prediction of these optimal assignments, and trains adaptive tokenizers that encode videos based on the assignments predicted by routers. We demonstrate that EVATok delivers substantial improvements in efficiency and overall quality for video reconstruction and downstream AR generation. Enhanced by our advanced training recipe that integrates video semantic encoders, EVATok achieves superior reconstruction and state-of-the-art class-to-video generation on UCF-101, with at least 24.4% savings in average token usage compared to the prior state-of-the-art LARP and our fixed-length baseline.

URL PDF HTML ☆

赞 0 踩 0

2603.12266 2026-03-13 cs.CV

MM-CondChain: A Programmatically Verified Benchmark for Visually Grounded Deep Compositional Reasoning

Haozhan Shen, Shilin Yan, Hongwei Xue, Shuaiqi Lu, Xiaojun Tang, Guannan Zhang, Tiancheng Zhao, Jianwei Yin

Comments Project Page: https://accio-lab.github.io/MM-CondChain

2603.12265 2026-03-13 cs.CV

OmniStream: Mastering Perception, Reconstruction and Action in Continuous Streams

Yibin Yan, Jilan Xu, Shangzhe Di, Haoning Wu, Weidi Xie

Comments Technical Report. Project Page: https://go2heart.github.io/omnistream/

2603.12264 2026-03-13 cs.CV

GRADE: Benchmarking Discipline-Informed Reasoning in Image Editing

Mingxin Liu, Ziqian Fan, Zhaokai Wang, Leyao Gu, Zirun Zhu, Yiguo He, Yuchen Yang, Changyao Tian, Xiangyu Zhao, Ning Liao, Shaofeng Zhang, Qibing Ren, Zhihang Zhong, Xuanhe Zhou, Junchi Yan, Xue Yang

Comments 49 pages, 23 figures, 10 tables; Project Page: https://grade-bench.github.io/, Code: https://github.com/VisionXLab/GRADE, Dataset: https://huggingface.co/datasets/VisionXLab/GRADE

2603.12263 2026-03-13 cs.RO

$Ψ_0$: An Open Foundation Model Towards Universal Humanoid Loco-Manipulation

Songlin Wei, Hongyi Jing, Boqian Li, Zhenyu Zhao, Jiageng Mao, Zhenhao Ni, Sicheng He, Jie Liu, Xiawei Liu, Kaidi Kang, Sheng Zang, Weiduo Yuan, Marco Pavone, Di Huang, Yue Wang

2603.12262 2026-03-13 cs.CV

Video Streaming Thinking: VideoLLMs Can Watch and Think Simultaneously

Yiran Guan, Liang Yin, Dingkang Liang, Jianzhong Ju, Zhenbo Luo, Jian Luan, Yuliang Liu, Xiang Bai

2603.12257 2026-03-13 cs.CV

DreamVideo-Omni: Omni-Motion Controlled Multi-Subject Video Customization with Latent Identity Reinforcement Learning

Yujie Wei, Xinyu Liu, Shiwei Zhang, Hangjie Yuan, Jinbo Xing, Zhekai Chen, Xiang Wang, Haonan Qiu, Rui Zhao, Yutong Feng, Ruihang Chu, Yingya Zhang, Yike Guo, Xihui Liu, Hongming Shan

Comments Project Page: https://dreamvideo-omni.github.io

2603.12255 2026-03-13 cs.CV cs.LG

Spatial-TTT: Streaming Visual-based Spatial Intelligence with Test-Time Training

Fangfu Liu, Diankun Wu, Jiawei Chi, Yimo Cai, Yi-Hsin Hung, Xumin Yu, Hao Li, Han Hu, Yongming Rao, Yueqi Duan

Comments Project Page: https://liuff19.github.io/Spatial-TTT

2603.12254 2026-03-13 cs.CV

Attend Before Attention: Efficient and Scalable Video Understanding via Autoregressive Gazing

Baifeng Shi, Stephanie Fu, Long Lian, Hanrong Ye, David Eigen, Aaron Reite, Boyi Li, Jan Kautz, Song Han, David M. Chan, Pavlo Molchanov, Trevor Darrell, Hongxu Yin

Comments CVPR 2026. Project page: https://autogaze.github.io/

2603.12250 2026-03-13 cs.CV

DVD: Deterministic Video Depth Estimation with Generative Priors

Hongfei Zhang, Harold Haodong Chen, Chenfei Liao, Jing He, Zixin Zhang, Haodong Li, Yihao Liang, Kanghao Chen, Bin Ren, Xu Zheng, Shuai Yang, Kun Zhou, Yinchuan Li, Nicu Sebe, Ying-Cong Chen

Comments Project: https://dvd-project.github.io/

2603.12247 2026-03-13 cs.CV

Trust Your Critic: Robust Reward Modeling and Reinforcement Learning for Faithful Image Editing and Generation

Xiangyu Zhao, Peiyuan Zhang, Junming Lin, Tianhao Liang, Yuchen Duan, Shengyuan Ding, Changyao Tian, Yuhang Zang, Junchi Yan, Xue Yang

2603.12246 2026-03-13 cs.AI cs.CL cs.LG

Examining Reasoning LLMs-as-Judges in Non-Verifiable LLM Post-Training

Yixin Liu, Yue Yu, DiJia Su, Sid Wang, Xuewei Wang, Song Jiang, Bo Liu, Arman Cohan, Yuandong Tian, Zhengxing Chen

2603.12245 2026-03-13 cs.CV

One Model, Many Budgets: Elastic Latent Interfaces for Diffusion Transformers

Moayed Haji-Ali, Willi Menapace, Ivan Skorokhodov, Dogyun Park, Anil Kag, Michael Vasilkovsky, Sergey Tulyakov, Vicente Ordonez, Aliaksandr Siarohin

Comments Project page: https://snap-research.github.io/elit/

2603.12240 2026-03-13 cs.CV cs.LG

BiGain: Unified Token Compression for Joint Generation and Classification

Jiacheng Liu, Shengkun Tang, Jiacheng Cui, Dongkuan Xu, Zhiqiang Shen

Comments CVPR 2026. Code: https://github.com/Greenoso/BiGain

详情

英文摘要

Acceleration methods for diffusion models (e.g., token merging or downsampling) typically optimize synthesis quality under reduced compute, yet often ignore discriminative capacity. We revisit token compression with a joint objective and present BiGain, a training-free, plug-and-play framework that preserves generation quality while improving classification in accelerated diffusion models. Our key insight is frequency separation: mapping feature-space signals into a frequency-aware representation disentangles fine detail from global semantics, enabling compression that respects both generative fidelity and discriminative utility. BiGain reflects this principle with two frequency-aware operators: (1) Laplacian-gated token merging, which encourages merges among spectrally smooth tokens while discouraging merges of high-contrast tokens, thereby retaining edges and textures; and (2) Interpolate-Extrapolate KV Downsampling, which downsamples keys/values via a controllable interextrapolation between nearest and average pooling while keeping queries intact, thereby conserving attention precision. Across DiT- and U-Net-based backbones and ImageNet-1K, ImageNet-100, Oxford-IIIT Pets, and COCO-2017, our operators consistently improve the speed-accuracy trade-off for diffusion-based classification, while maintaining or enhancing generation quality under comparable acceleration. For instance, on ImageNet-1K, with 70% token merging on Stable Diffusion 2.0, BiGain increases classification accuracy by 7.15% while improving FID by 0.34 (1.85%). Our analyses indicate that balanced spectral retention, preserving high-frequency detail and low/mid-frequency semantics, is a reliable design rule for token compression in diffusion models. To our knowledge, BiGain is the first framework to jointly study and advance both generation and classification under accelerated diffusion, supporting lower-cost deployment.

URL PDF HTML ☆

赞 0 踩 0

2603.12238 2026-03-13 cs.CV

SceneAssistant: A Visual Feedback Agent for Open-Vocabulary 3D Scene Generation

Jun Luo, Jiaxiang Tang, Ruijie Lu, Gang Zeng

Comments Code: https://github.com/ROUJINN/SceneAssistant

2603.12237 2026-03-13 cs.LG cs.CR cs.IT math.IT

STAMP: Selective Task-Aware Mechanism for Text Privacy

Fengwei Tian, Payel Bhattacharjee, Heidi Hanson, Geoffrey D. Rubin, Joseph Y. Lo, Ravi Tandon

Comments EACL 2026

2603.12228 2026-03-13 cs.LG cs.AI

Neural Thickets: Diverse Task Experts Are Dense Around Pretrained Weights

Yulu Gan, Phillip Isola

Comments codes are provided at https://github.com/sunrainyg/RandOpt

2603.12226 2026-03-13 cs.CL cs.AI

Sparking Scientific Creativity via LLM-Driven Interdisciplinary Inspiration

Priyanka Kargupta, Shuhaib Mehri, Dilek Hakkani-Tur, Jiawei Han

Comments Code and dataset provided at https://github.com/pkargupta/idea_catalyst

详情

英文摘要

Despite interdisciplinary research leading to larger and longer-term impact, most work remains confined to single-domain academic silos. Recent AI-based approaches to scientific discovery show promise for interdisciplinary research, but many prioritize rapidly designing experiments and solutions, bypassing the exploratory, collaborative reasoning processes that drive creative interdisciplinary breakthroughs. As a result, prior efforts largely prioritize automating scientific discovery rather than augmenting the reasoning processes that underlie scientific disruption. We present Idea-Catalyst, a novel framework that systematically identifies interdisciplinary insights to support creative reasoning in both humans and large language models. Starting from an abstract research goal, Idea-Catalyst is designed to assist the brainstorming stage, explicitly avoiding premature anchoring on specific solutions. The framework embodies key metacognitive features of interdisciplinary reasoning: (a) defining and assessing research goals, (b) awareness of a domain's opportunities and unresolved challenges, and (c) strategic exploration of interdisciplinary ideas based on impact potential. Concretely, Idea-Catalyst decomposes an abstract goal (e.g., improving human-AI collaboration) into core target-domain research questions that guide the analysis of progress and open challenges within that domain. These challenges are reformulated as domain-agnostic conceptual problems, enabling retrieval from external disciplines (e.g., Psychology, Sociology) that address analogous issues. By synthesizing and recontextualizing insights from these domains back into the target domain, Idea-Catalyst ranks source domains by their interdisciplinary potential. Empirically, this targeted integration improves average novelty by 21% and insightfulness by 16%, while remaining grounded in the original research problem.

URL PDF HTML ☆

赞 0 踩 0

2603.12224 2026-03-13 cs.AI

Portfolio of Solving Strategies in CEGAR-based Object Packing and Scheduling for Sequential 3D Printing

Pavel Surynek

Comments arXiv admin note: substantial text overlap with arXiv:2503.05071

2603.12222 2026-03-13 cs.CV cs.LG

HiAP: A Multi-Granular Stochastic Auto-Pruning Framework for Vision Transformers

Andy Li, Aiden Durrant, Milan Markovic, Georgios Leontidis

Comments 14 pages, 9 figures, 3 Tables

2603.12217 2026-03-13 cs.CV

Real-World Point Tracking with Verifier-Guided Pseudo-Labeling

Görkay Aydemir, Fatma Güney, Weidi Xie

Comments CVPR 2026

2603.12215 2026-03-13 cs.CV cs.AI

RDNet: Region Proportion-Aware Dynamic Adaptive Salient Object Detection Network in Optical Remote Sensing Images

Bin Wan, Runmin Cong, Xiaofei Zhou, Hao Fang, Yaoqi Sun, Sam Kwong

2603.12208 2026-03-13 cs.CV

ForensicZip: More Tokens are Better but Not Necessary in Forensic Vision-Language Models

Yingxin Lai, Zitong Yu, Jun Wang, Linlin Shen, Yong Xu, Xiaochun Cao

2603.12201 2026-03-13 cs.CL cs.LG

IndexCache: Accelerating Sparse Attention via Cross-Layer Index Reuse

Yushi Bai, Qian Dong, Ting Jiang, Xin Lv, Zhengxiao Du, Aohan Zeng, Jie Tang, Juanzi Li

2603.12193 2026-03-13 cs.RO cs.CV

SaPaVe: Towards Active Perception and Manipulation in Vision-Language-Action Models for Robotics

Mengzhen Liu, Enshen Zhou, Cheng Chi, Yi Han, Shanyu Rong, Liming Chen, Pengwei Wang, Zhongyuan Wang, Shanghang Zhang

Comments Accepted to CVPR 2026. See project page at https://lmzpai.github.io/SaPaVe

2603.12191 2026-03-13 cs.CL

Long-Context Encoder Models for Polish Language Understanding

Sławomir Dadas, Rafał Poświata, Marek Kozłowski, Małgorzata Grębowiec, Michał Perełkiewicz, Paweł Klimiuk, Przemysław Boruta

2603.12188 2026-03-13 cs.AI

Compiling Temporal Numeric Planning into Discrete PDDL+: Extended Version

Andrea Micheli, Enrico Scala, Alessandro Valentini

Comments This paper is an extended version of the homonymous appearing in the ICAPS 2026 proceedings. This version provides the proofs and addidional explanations of the compilation

2603.12176 2026-03-13 cs.CV cs.AI

BehaviorVLM: Unified Finetuning-Free Behavioral Understanding with Vision-Language Reasoning

Jingyang Ke, Weihan Li, Amartya Pradhan, Jeffrey Markowitz, Anqi Wu

2603.12166 2026-03-13 cs.CV

LatentGeo: Learnable Auxiliary Constructions in Latent Space for Multimodal Geometric Reasoning

Haiying Xu, Zihan Wang, Song Dai, Zhengxuan Zhang, Kairan Dou, Xuming Hu

2603.12163 2026-03-13 cs.LG cs.AI math.ST stat.ML stat.TH

A Quantitative Characterization of Forgetting in Post-Training

Krishnakumar Balasubramanian, Shiva Prasad Kasiviswanathan