arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2603.21701 2026-03-24 cs.CV cs.AI

Rethinking Token Reduction for Large Vision-Language Models

Yi Wang, Haofei Zhang, Qihan Huang, Anda Cao, Gongfan Fang, Wei Wang, Xuan Jin, Jie Song, Mingli Song, Xinchao Wang

详情

英文摘要

Large Vision-Language Models (LVLMs) excel in visual understanding and reasoning, but the excessive visual tokens lead to high inference costs. Although recent token reduction methods mitigate this issue, they mainly target single-turn Visual Question Answering (VQA), leaving the more practical multi-turn VQA (MT-VQA) scenario largely unexplored. MT-VQA introduces additional challenges, as subsequent questions are unknown beforehand and may refer to arbitrary image regions, making existing reduction strategies ineffective. Specifically, current approaches fall into two categories: prompt-dependent methods, which bias toward the initial text prompt and discard information useful for subsequent turns; prompt-agnostic ones, which, though technically applicable to multi-turn settings, rely on heuristic reduction metrics such as attention scores, leading to suboptimal performance. In this paper, we propose a learning-based prompt-agnostic method, termed MetaCompress, overcoming the limitations of heuristic designs. We begin by formulating token reduction as a learnable compression mapping, unifying existing formats such as pruning and merging into a single learning objective. Upon this formulation, we introduce a data-efficient training paradigm capable of learning optimal compression mappings with limited computational costs. Extensive experiments on MT-VQA benchmarks and across multiple LVLM architectures demonstrate that MetaCompress achieves superior efficiency-accuracy trade-offs while maintaining strong generalization across dialogue turns. Our code is available at https://github.com/MArSha1147/MetaCompress.

URL PDF HTML ☆

赞 0 踩 0

2603.21700 2026-03-24 cs.CV

PPGL-Swarm: Integrated Multimodal Risk Stratification and Hereditary Syndrome Detection in Pheochromocytoma and Paraganglioma

Zelin Liu, Xiangfu Yu, Jie Huang, Ge Wang, Yizhe Yuan, Zhenyu Yi, Jing Xie, Haotian Jiang, Lichi Zhang

2603.21698 2026-03-24 cs.AI

A Blueprint for Self-Evolving Coding Agents in Vehicle Aerodynamic Drag Prediction

Jinhui Ren, Huaiming Li, Yabin Liu, Tao Li, Zhaokun Liu, Yujia Liang, Zengle Ge, Chufan Wu, Xiaomin Yuan, Danyu Liu, Annan Li, Jianmin Wu

2603.21696 2026-03-24 cs.AI

MIND: Multi-agent inference for negotiation dialogue in travel planning

Hunmin Do, Taejun Yoon, Kiyong Jung

Comments Accepted at ICLR 2026 Workshop (HCAIR)

2603.21695 2026-03-24 cs.CV cs.GR

RefracGS: Novel View Synthesis Through Refractive Water Surfaces with 3D Gaussian Ray Tracing

Yiming Shao, Qiyu Dai, Chong Gao, Guanbin Li, Yeqiang Wang, He Sun, Qiong Zeng, Baoquan Chen, Wenzheng Chen

2603.21693 2026-03-24 cs.AI

Deterministic Hallucination Detection in Medical VQA via Confidence-Evidence Bayesian Gain

Mohammad Asadi, Tahoura Nedaee, Jack W. O'Sullivan, Euan Ashley, Ehsan Adeli

2603.21690 2026-03-24 cs.AI econ.GN q-fin.EC

AI Token Futures Market: Commoditization of Compute and Derivatives Contract Design

Yicai Xing

Comments 16 pages, 7 figures, 3 tables

2603.21684 2026-03-24 cs.SD cs.LG

LipsAM: Lipschitz-Continuous Amplitude Modifier for Audio Signal Processing and its Application to Plug-and-Play Dereverberation

Kazuki Matsumoto, Ren Uchida, Kohei Yatabe

Comments Accepted for IEEE ICASSP 2026

2603.21679 2026-03-24 cs.RO

BiPreManip: Learning Affordance-Based Bimanual Preparatory Manipulation through Anticipatory Collaboration

Yan Shen, Feng Jiang, Zichen He, Xiaoqi Li, Yuchen Liu, Zhiyu Li, Ruihai Wu, Hao Dong

Comments Accepted to CVPR 2026

2603.21676 2026-03-24 cs.LG cs.AI cs.CL

Thinking Deeper, Not Longer: Depth-Recurrent Transformers for Compositional Generalization

Hung-Hsuan Chen

详情

英文摘要

Standard Transformers have a fixed computational depth, fundamentally limiting their ability to generalize to tasks requiring variable-depth reasoning, such as multi-hop graph traversal or nested logic. We propose a depth-recurrent Transformer that decouples computational depth from parameter count by iteratively applying a shared-weight Transformer block in latent space -- enabling the model to trade recurrence steps for deeper reasoning at inference time. Our architecture incorporates three mechanisms to make deep recurrence (20+ steps) stable: (1) a silent thinking objective that supervises only the final output, forcing genuine multi-step reasoning rather than intermediate heuristic shortcuts; (2) LayerScale initialization to protect fragile reasoning states from untrained layer noise; and (3) an identity-biased recurrence that creates a gradient highway across many steps. We evaluate on three compositional reasoning domains with decreasing inductive biases: graph reachability (strict adjacency masking), nested boolean logic (relative positioning), and unstructured relational text (where sequence position provides no structural hints). Across all tasks, we observe a clear \emph{computational frontier} -- a boundary where performance transitions from chance to near-perfect as thinking steps scale with task complexity. Moreover, these tasks reveal qualitatively different generalization behaviors: precise but brittle (graph), approximate but robust (logic), and autonomous latent routing without structural hints (text). This progression illuminates how the interplay between a task-invariant recurrent reasoning core and task-specific perceptual interfaces shapes out-of-distribution (OOD) generalization, offering a mechanistic perspective on vertical chain-of-thought that complements the prevailing horizontal token-generation paradigm.

URL PDF HTML ☆

赞 0 踩 0

2603.21673 2026-03-24 cs.CL

Optimizing Multi-Agent Weather Captioning via Text Gradient Descent: A Training-Free Approach with Consensus-Aware Gradient Fusion

Shixu Liu

Comments Preprint and under consideration

2603.21669 2026-03-24 cs.RO cs.CV

PRM-as-a-Judge: A Dense Evaluation Paradigm for Fine-Grained Robotic Auditing

Yuheng Ji, Yuyang Liu, Huajie Tan, Xuchuan Huang, Fanding Huang, Yijie Xu, Cheng Chi, Yuting Zhao, Huaihai Lyu, Peterson Co, Mingyu Cao, Qiongyu Zhang, Zhe Li, Enshen Zhou, Pengwei Wang, Zhongyuan Wang, Shanghang Zhang, Xiaolong Zheng

2603.21663 2026-03-24 cs.CL

TAMTRL: Teacher-Aligned Reward Reshaping for Multi-Turn Reinforcement Learning in Long-Context Compression

Li Wang, Yandong Wang, Xin Yu, Kui Zhang, Tianhao Peng, Wenjun Wu

2603.21661 2026-03-24 cs.CV cs.AI cs.GR cs.LG cs.MM

Cross-Scenario Deraining Adaptation with Unpaired Data: Superpixel Structural Priors and Multi-Stage Pseudo-Rain Synthesis

Kangbo Zhao, Miaoxin Guan, Xiang Chen, Yukai Shi, Jinshan Pan

Comments We aim at addressing the cross-scenario (i.e., O.O.D) de-rain challenge, which has been neglected for a long period

2603.21660 2026-03-24 cs.CV

OmniFM: Toward Modality-Robust and Task-Agnostic Federated Learning for Heterogeneous Medical Imaging

Meilin Liu, Jiaying Wang, Jing Shan

Comments Accepted by CVPR 2026 (Main)

2603.21658 2026-03-24 cs.CL cs.LG

A Comparative Analysis of LLM Memorization at Statistical and Internal Levels: Cross-Model Commonalities and Model-Specific Signatures

Bowen Chen, Namgi Han, Yusuke Miyao

Comments 8 pages of main content, in conference submission, other contents are references and extra appendix

2603.21656 2026-03-24 cs.LG cs.CY

TrustFed: Enabling Trustworthy Medical AI under Data Privacy Constraints

Vagish Kumar, Syed Bahauddin Alam, Souvik Chakraborty

2603.21653 2026-03-24 cs.LG

MISApp: Multi-Hop Intent-Aware Session Graph Learning for Next App Prediction

Yunchi Yang, Longlong Li, Jianliang Wu, Cunquan Qu

2603.21647 2026-03-24 cs.CV cs.LG

FedCVU: Federated Learning for Cross-View Video Understanding

Shenghan Zhang, Run Ling, Ke Cao, Ao Ma, Zhanjie Zhang

2603.21638 2026-03-24 cs.CV

No Dense Tensors Needed: Fully Sparse Object Detection on Event-Camera Voxel Grids

Mohamad Yazan Sadoun, Sarah Sharif, Yaser Mike Banad

Comments 29 Pages, 9 Figures, 5 Tables

2603.21635 2026-03-24 cs.RO cs.SY eess.SY

RTD-RAX: Fast, Safe Trajectory Planning for Systems under Unknown Disturbances

Evanns Morales-Cuadrado, Long Kiu Chung, Shreyas Kousik, Samuel Coogan

2603.21630 2026-03-24 cs.AI

EnterpriseLab: A Full-Stack Platform for developing and deploying agents in Enterprises

Ankush Agarwal, Harsh Vishwakarma, Suraj Nagaje, Chaitanya Devaguptapu

2603.21629 2026-03-24 cs.CV

Dual-level Adaptation for Multi-Object Tracking: Building Test-Time Calibration from Experience and Intuition

Wen Guo, Pengfei Zhao, Zongmeng Wang, Yufan Hu, Junyu Gao

Comments Accepted by CVPR2026

2603.21626 2026-03-24 cs.CV

PGR-Net: Prior-Guided ROI Reasoning Network for Brain Tumor MRI Segmentation

Jiacheng Lu, Hui Ding, Shiyu Zhang, Guoping Huo

Comments This paper has been accepted to the main conference of CVPR 2026

2603.21619 2026-03-24 cs.CV cs.AI

Efficient Zero-Shot AI-Generated Image Detection

Ryosuke Sonoda, Ramya Srinivasan

2603.21618 2026-03-24 cs.CV

4DGS360: 360° Gaussian Reconstruction of Dynamic Objects from a Single Video

Jae Won Jang, Yeonjin Chang, Wonsik Shin, Juhwan Cho, Nojun Kwak

2603.21615 2026-03-24 cs.CV

AdaEdit: Adaptive Temporal and Channel Modulation for Flow-Based Image Editing

Guandong Li, Zhaobin Chu

详情

英文摘要

Inversion-based image editing in flow matching models has emerged as a powerful paradigm for training-free, text-guided image manipulation. A central challenge in this paradigm is the injection dilemma: injecting source features during denoising preserves the background of the original image but simultaneously suppresses the model's ability to synthesize edited content. Existing methods address this with fixed injection strategies -- binary on/off temporal schedules, uniform spatial mixing ratios, and channel-agnostic latent perturbation -- that ignore the inherently heterogeneous nature of injection demand across both the temporal and channel dimensions. In this paper, we present AdaEdit, a training-free adaptive editing framework that resolves this dilemma through two complementary innovations. First, we propose a Progressive Injection Schedule that replaces hard binary cutoffs with continuous decay functions (sigmoid, cosine, or linear), enabling a smooth transition from source-feature preservation to target-feature generation and eliminating feature discontinuity artifacts. Second, we introduce Channel-Selective Latent Perturbation, which estimates per-channel importance based on the distributional gap between the inverted and random latents and applies differentiated perturbation strengths accordingly -- strongly perturbing edit-relevant channels while preserving structure-encoding channels. Extensive experiments on the PIE-Bench benchmark (700 images, 10 editing types) demonstrate that AdaEdit achieves an 8.7% reduction in LPIPS, a 2.6% improvement in SSIM, and a 2.3% improvement in PSNR over strong baselines, while maintaining competitive CLIP similarity. AdaEdit is fully plug-and-play and compatible with multiple ODE solvers including Euler, RF-Solver, and FireFlow. Code is available at https://github.com/leeguandong/AdaEdit

URL PDF HTML ☆

赞 0 踩 0

2603.21612 2026-03-24 cs.LG

Towards Multimodal Time Series Anomaly Detection with Semantic Alignment and Condensed Interaction

Shiyan Hu, Jianxin Jin, Yang Shu, Peng Chen, Bin Yang, Chenjuan Guo

Comments ICLR 2026

2603.21611 2026-03-24 cs.CV

SARe: Structure-Aware Large-Scale 3D Fragment Reassembly

Hanze Jia, Chunshi Wang, Yuxiao Yang, Zhonghua Jiang, Yawei Luo, Shuainan Ye, Tan Tang

Comments 18 pages, 4 figures

2603.21607 2026-03-24 cs.AI

INTRYGUE: Induction-Aware Entropy Gating for Reliable RAG Uncertainty Estimation

Alexandra Bazarova, Andrei Volodichev, Daria Kotova, Alexey Zaytsev