arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2603.18004 2026-03-19 cs.CV cs.AI cs.LG

Unified Spatio-Temporal Token Scoring for Efficient Video VLMs

Jianrui Zhang, Yue Yang, Rohun Tripathi, Winson Han, Ranjay Krishna, Christopher Clark, Yong Jae Lee, Sangho Lee

详情

英文摘要

Token pruning is essential for enhancing the computational efficiency of vision-language models (VLMs), particularly for video-based tasks where temporal redundancy is prevalent. Prior approaches typically prune tokens either (1) within the vision transformer (ViT) exclusively for unimodal perception tasks such as action recognition and object segmentation, without adapting to downstream vision-language tasks; or (2) only within the LLM while leaving the ViT output intact, often requiring complex text-conditioned token selection mechanisms. In this paper, we introduce Spatio-Temporal Token Scoring (STTS), a simple and lightweight module that prunes vision tokens across both the ViT and the LLM without text conditioning or token merging, and is fully compatible with end-to-end training. By learning how to score temporally via an auxiliary loss and spatially via LLM downstream gradients, aided by our efficient packing algorithm, STTS prunes 50% of vision tokens throughout the entire architecture, resulting in a 62% improvement in efficiency during both training and inference with only a 0.7% drop in average performance across 13 short and long video QA tasks. Efficiency gains increase with more sampled frames per video. Applying test-time scaling for long-video QA further yields performance gains of 0.5-1% compared to the baseline. Overall, STTS represents a novel, simple yet effective technique for unified, architecture-wide vision token pruning.

URL PDF HTML ☆

赞 0 踩 0

2603.18002 2026-03-19 cs.CV cs.AI cs.CL

Loc3R-VLM: Language-based Localization and 3D Reasoning with Vision-Language Models

Kevin Qu, Haozhe Qi, Mihai Dusmanu, Mahdi Rad, Rui Wang, Marc Pollefeys

Comments Project Page: https://kevinqu7.github.io/loc3r-vlm

2603.18001 2026-03-19 cs.CV

EchoGen: Cycle-Consistent Learning for Unified Layout-Image Generation and Understanding

Kai Zou, Hongbo Liu, Dian Zheng, Jianxiong Gao, Zhiwei Zhao, Bin Liu

Comments 9 pages, Accepted at the 40th AAAI Conference on Artificial Intelligence (AAAI 2026)

详情

DOI: 10.1609/aaai.v40i16.38418
Journal ref: Proceedings of the AAAI Conference on Artificial Intelligence, 40(16), 2026

英文摘要

In this work, we present EchoGen, a unified framework for layout-to-image generation and image grounding, capable of generating images with accurate layouts and high fidelity to text descriptions (e.g., spatial relationships), while grounding the image robustly at the same time. We believe that image grounding possesses strong text and layout understanding abilities, which can compensate for the corresponding limitations in layout-to-image generation. At the same time, images generated from layouts exhibit high diversity in content, thereby enhancing the robustness of image grounding. Jointly training both tasks within a unified model can promote performance improvements for each. However, we identify that this joint training paradigm encounters several optimization challenges and results in restricted performance. To address these issues, we propose progressive training strategies. First, the Parallel Multi-Task Pre-training (PMTP) stage equips the model with basic abilities for both tasks, leveraging shared tokens to accelerate training. Next, the Dual Joint Optimization (DJO) stage exploits task duality to sequentially integrate the two tasks, enabling unified optimization. Finally, the Cycle RL stage eliminates reliance on visual supervision by using consistency constraints as rewards, significantly enhancing the model's unified capabilities via the GRPO strategy. Extensive experiments demonstrate state-of-the-art results on both layout-to-image generation and image grounding benchmarks, and reveal clear synergistic gains from optimizing the two tasks together.

URL PDF HTML ☆

赞 0 踩 0

2603.18000 2026-03-19 cs.AI

AgentFactory: A Self-Evolving Framework Through Executable Subagent Accumulation and Reuse

Zhang Zhang, Shuqi Lu, Hongjin Qian, Di He, Zheng Liu

2603.17998 2026-03-19 cs.CV

The Unreasonable Effectiveness of Text Embedding Interpolation for Continuous Image Steering

Yigit Ekin, Yossi Gandelsman

Comments Project Page: https://yigitekin.github.io/diffusion-sliders

2603.17995 2026-03-19 cs.CV cs.GR cs.LG

LoST: Level of Semantics Tokenization for 3D Shapes

Niladri Shekhar Dutt, Zifan Shi, Paul Guerrero, Chun-Hao Paul Huang, Duygu Ceylan, Niloy J. Mitra, Xuelin Chen

Comments CVPR 2026; Project website-- https://lost3d.github.io

2603.17993 2026-03-19 cs.CV cs.RO

GMT: Goal-Conditioned Multimodal Transformer for 6-DOF Object Trajectory Synthesis in 3D Scenes

Huajian Zeng, Abhishek Saroha, Daniel Cremers, Xi Wang

Comments Accpeted by 3DV 2026. Project Page: https://huajian-zeng.github.io/projects/gmt/

2603.17990 2026-03-19 cs.RO

A Single-Fiber Optical Frequency Domain Reflectometry (OFDR)-Based Shape Sensing of Concentric Tube Steerable Drilling Robots

Yash Kulkarni, Mobina Tavangarifard, Daniyal Maroufi, Mohsen Khadem, Justin E. Bird, Jeffrey H. Siewerdsen, Farshid Alambeigi

Comments 8 pages, 7 figures

2603.17989 2026-03-19 cs.CV

Versatile Editing of Video Content, Actions, and Dynamics without Training

Vladimir Kulikov, Roni Paiss, Andrey Voynov, Inbar Mosseri, Tali Dekel, Tomer Michaeli

Comments Project page at https://dynaedit.github.io/

2603.17975 2026-03-19 cs.CV

AHOY! Animatable Humans under Occlusion from YouTube Videos with Gaussian Splatting and Video Diffusion Priors

Aymen Mir, Riza Alp Guler, Xiangjun Tang, Peter Wonka, Gerard Pons-Moll

Comments Our project page is available at https://miraymen.github.io/ahoy/

2603.17970 2026-03-19 cs.LG cs.NA math.NA math.OC

Beyond Muon: MUD (MomentUm Decorrelation) for Faster Transformer Training

Ben S. Southworth, Stephen Thomas

2603.17969 2026-03-19 cs.RO cs.AI

Specification-Aware Distribution Shaping for Robotics Foundation Models

Sadık Bera Yüksel, Derya Aksaray

Comments 8 pages, 3 figures

2603.17968 2026-03-19 cs.CV

Robust-ComBat: Mitigating Outlier Effects in Diffusion MRI Data Harmonization

Yoan David, Pierre-Marc Jodoin, Alzheimer's Disease Neuroimaging Initiative, The TRACK-TBI Investigators

Comments 20 pages, 8 figures

2603.17965 2026-03-19 cs.CV

LaDe: Unified Multi-Layered Graphic Media Generation and Decomposition

Vlad-Constantin Lungu-Stan, Ionut Mironica, Mariana-Iuliana Georgescu

Comments 18 pages (main + supp)

2603.17962 2026-03-19 cs.CL

ConGA: Guidelines for Contextual Gender Annotation. A Framework for Annotating Gender in Machine Translation

Argentina Anna Rescigno, Eva Vanmassenhove, Johanna Monti

2603.17952 2026-03-19 cs.CL

Gender Disambiguation in Machine Translation: Diagnostic Evaluation in Decoder-Only Architectures

Chiara Manna, Hosein Mohebbi, Afra Alishahi, Frédéric Blain, Eva Vanmassenhove

2603.17948 2026-03-19 cs.CV cs.AI

VideoAtlas: Navigating Long-Form Video in Logarithmic Compute

Mohamed Eltahir, Ali Habibullah, Yazan Alshoibi, Lama Ayash, Tanveer Hussain, Naeemullah Khan

详情

英文摘要

Extending language models to video introduces two challenges: representation, where existing methods rely on lossy approximations, and long-context, where caption- or agent-based pipelines collapse video into text and lose visual fidelity. To overcome this, we introduce \textbf{VideoAtlas}, a task-agnostic environment to represent video as a hierarchical grid that is simultaneously lossless, navigable, scalable, caption- and preprocessing-free. An overview of the video is available at a glance, and any region can be recursively zoomed into, with the same visual representation used uniformly for the video, intermediate investigations, and the agent's memory, eliminating lossy text conversion end-to-end. This hierarchical structure ensures access depth grows only logarithmically with video length. For long-context, Recursive Language Models (RLMs) recently offered a powerful solution for long text, but extending them to visual domain requires a structured environment to recurse into, which \textbf{VideoAtlas} provides. \textbf{VideoAtlas} as a Markov Decision Process unlocks Video-RLM: a parallel Master-Worker architecture where a Master coordinates global exploration while Workers concurrently drill into assigned regions to accumulate lossless visual evidence. We demonstrate three key findings: (1)~logarithmic compute growth with video duration, further amplified by a 30-60\% multimodal cache hit rate arising from the grid's structural reuse. (2)~environment budgeting, where bounding the maximum exploration depth provides a principled compute-accuracy hyperparameter. (3)~emergent adaptive compute allocation that scales with question granularity. When scaling from 1-hour to 10-hour benchmarks, Video-RLM remains the most duration-robust method with minimal accuracy degradation, demonstrating that structured environment navigation is a viable and scalable paradigm for video understanding.

URL PDF HTML ☆

赞 0 踩 0

2603.17947 2026-03-19 cs.LG q-bio.NC

Unified Policy Value Decomposition for Rapid Adaptation

Cristiano Capone, Luca Falorsi, Andrea Ciardiello, Luca Manneschi

详情

英文摘要

Rapid adaptation in complex control systems remains a central challenge in reinforcement learning. We introduce a framework in which policy and value functions share a low-dimensional coefficient vector - a goal embedding - that captures task identity and enables immediate adaptation to novel tasks without retraining representations. During pretraining, we jointly learn structured value bases and compatible policy bases through a bilinear actor-critic decomposition. The critic factorizes as Q = sum_k G_k(g) y_k(s,a), where G_k(g) is a goal-conditioned coefficient vector and y_k(s,a) are learned value basis functions. This multiplicative gating - where a context signal scales a set of state-dependent bases - is reminiscent of gain modulation observed in Layer 5 pyramidal neurons, where top-down inputs modulate the gain of sensory-driven responses without altering their tuning. Building on Successor Features, we extend the decomposition to the actor, which composes a set of primitive policies weighted by the same coefficients G_k(g). At test time the bases are frozen and G_k(g) is estimated zero-shot via a single forward pass, enabling immediate adaptation to novel tasks without any gradient update. We train a Soft Actor-Critic agent on the MuJoCo Ant environment under a multi-directional locomotion objective, requiring the agent to walk in eight directions specified as continuous goal vectors. The bilinear structure allows each policy head to specialize to a subset of directions, while the shared coefficient layer generalizes across them, accommodating novel directions by interpolating in goal embedding space. Our results suggest that shared low-dimensional goal embeddings offer a general mechanism for rapid, structured adaptation in high-dimensional control, and highlight a potentially biologically plausible principle for efficient transfer in complex reinforcement learning systems.

URL PDF HTML ☆

赞 0 踩 0

2603.17946 2026-03-19 cs.LG cs.AI

CARE: Covariance-Aware and Rank-Enhanced Decomposition for Enabling Multi-Head Latent Attention

Zhongzhu Zhou, Fengxiang Bie, Ziyan Chen, Zhenyu Zhang, Yibo Yang, Junxiong Wang, Ben Athiwaratkun, Xiaoxia Wu, Shuaiwen Leon Song

Comments Accepted at ICLR 2026. Conference paper. 10 pages main text; 34 pages total including references and appendix. 11 figures and 20 tables in total

2603.17930 2026-03-19 cs.CV

Interpretable Traffic Responsibility from Dashcam Video via Legal Multi Agent Reasoning

Jingchun Yang, Jinchang Zhang

2603.17926 2026-03-19 cs.CV

A practical artificial intelligence framework for legal age estimation using clavicle computed tomography scans

Javier Venema, Stefano De Luca, Pablo Mesejo, Óscar Ibáñez

Comments 15 pages, 8 figures, submitted to Engineering Applications of Artificial Intelligence

详情

英文摘要

Legal age estimation plays a critical role in forensic and medico-legal contexts, where decisions must be supported by accurate, robust, and reproducible methods with explicit uncertainty quantification. While prior artificial intelligence (AI)-based approaches have primarily focused on hand radiographs or dental imaging, clavicle computed tomography (CT) scans remain underexplored despite their documented effectiveness for legal age estimation. In this work, we present an interpretable, multi-stage pipeline for legal age estimation from clavicle CT scans. The proposed framework combines (i) a feature-based connected-component method for automatic clavicle detection that requires minimal manual annotation, (ii) an Integrated Gradients-guided slice selection strategy used to construct the input data for a multi-slice convolutional neural network that estimates legal age, and (iii) conformal prediction intervals to support uncertainty-aware decisions in accordance with established international protocols. The pipeline is evaluated on 1,158 full-body post-mortem CT scans from a public forensic dataset (the New Mexico Decedent Image Database). The final model achieves state-of-the-art performance with a mean absolute error (MAE) of 1.55 $\pm$ 0.16 years on a held-out test set, outperforming both human experts (MAE of approximately 1.90 years) and previous methods (MAEs above 1.75 years in our same dataset). Furthermore, conformal prediction enables configurable coverage levels aligned with forensic requirements. Attribution maps indicate that the model focuses on anatomically relevant regions of the medial clavicular epiphysis. The proposed method, which is currently being added as part of the Skeleton-ID software (https://skeleton-id.com/skeleton-id/), is intended as a decision-support component within multi-factorial forensic workflows.

URL PDF HTML ☆

赞 0 踩 0

2603.17920 2026-03-19 cs.CV

SegFly: A 2D-3D-2D Paradigm for Aerial RGB-Thermal Semantic Segmentation at Scale

Markus Gross, Sai Bharadhwaj Matha, Rui Song, Viswanathan Muthuveerappan, Conrad Christoph, Julius Huber, Daniel Cremers

详情

英文摘要

Semantic segmentation for uncrewed aerial vehicles (UAVs) is fundamental for aerial scene understanding, yet existing RGB and RGB-T datasets remain limited in scale, diversity, and annotation efficiency due to the high cost of manual labeling and the difficulties of accurate RGB-T alignment on off-the-shelf UAVs. To address these challenges, we propose a scalable geometry-driven 2D-3D-2D paradigm that leverages multi-view redundancy in high-overlap aerial imagery to automatically propagate labels from a small subset of manually annotated RGB images to both RGB and thermal modalities within a unified framework. By lifting less than 3% of RGB images into a semantic 3D point cloud and reprojecting it into all views, our approach enables dense pseudo ground-truth generation across large image collections, automatically producing 97% of RGB labels and 100% of thermal labels while achieving 91% and 88% annotation accuracy without any 2D manual refinement. We further extend this 2D-3D-2D paradigm to cross-modal image registration, using 3D geometry as an intermediate alignment space to obtain fully automatic, strong pixel-level RGB-T alignment with 87% registration accuracy and no hardware-level synchronization. Applying our framework to existing geo-referenced aerial imagery, we construct SegFly, a large-scale benchmark with over 20,000 high-resolution RGB images and more than 15,000 geometrically aligned RGB-T pairs spanning diverse urban, industrial, and rural environments across multiple altitudes and seasons. On SegFly, we establish the Firefly baseline for RGB and thermal semantic segmentation and show that both conventional architectures and vision foundation models benefit substantially from SegFly supervision, highlighting the potential of geometry-driven 2D-3D-2D pipelines for scalable multi-modal scene understanding. Data and Code available at https://github.com/markus-42/SegFly.

URL PDF HTML ☆

赞 0 踩 0

2603.17917 2026-03-19 cs.LG cs.CL

Only relative ranks matter in weight-clustered large language models

Borja Aizpurua, Sukhbinder Singh, Román Orús

Comments 10 pages, 3 figures, 9 tables

2603.17914 2026-03-19 cs.CV

Noise-Aware Misclassification Attack Detection in Collaborative DNN Inference

Shima Yousefi, Saptarshi Debroy

Comments This work has been accepted for publication in IEEE/ACM CCGrid 2026

2603.17912 2026-03-19 cs.CL stat.ML

Pretrained Multilingual Transformers Reveal Quantitative Distance Between Human Languages

Yue Zhao, Jiatao Gu, Paloma Jeretič, Weijie Su

2603.17910 2026-03-19 cs.CV

SpiderCam: Low-Power Snapshot Depth from Differential Defocus

Marcos A. Ferreira, Tianao Li, John Mamish, Josiah Hester, Yaman Sangar, Qi Guo, Emma Alexander

Comments Accepted to IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026

2603.17895 2026-03-19 cs.CV

A Creative Agent is Worth a 64-Token Template

Ruixiao Shi, Fu Feng, Yucheng Xie, Xu Yang, Jing Wang, Xin Geng

2603.17891 2026-03-19 cs.LG cs.AI

RAMP: Reinforcement Adaptive Mixed Precision Quantization for Efficient On Device LLM Inference

Arpit Singh Gautam, Saurabh Jha

2603.17884 2026-03-19 cs.CL

DebugLM: Learning Traceable Training Data Provenance for LLMs

Wenjie Jacky Mo, Qin Liu, Xiaofei Wen, Wenxuan Zhou, Zhe Zhao, Muhao Chen

2603.17876 2026-03-19 cs.CV

Edit Spillover as a Probe: Do Image Editing Models Implicitly Understand World Relations?

Guandong Li, Zhaobin Chu