arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2604.17941 2026-04-21 cs.CV cs.CL

From Heads to Neurons: Causal Attribution and Steering in Multi-Task Vision-Language Models

Qidong Wang, Junjie Hu, Ming Jiang

Comments ACL 2026 Findings

详情

英文摘要

Recent work has increasingly explored neuron-level interpretation in vision-language models (VLMs) to identify neurons critical to final predictions. However, existing neuron analyses generally focus on single tasks, limiting the comparability of neuron importance across tasks. Moreover, ranking strategies tend to score neurons in isolation, overlooking how task-dependent information pathways shape the write-in effects of feed-forward network (FFN) neurons. This oversight can exacerbate neuron polysemanticity in multi-task settings, introducing noise into the identification and intervention of task-critical neurons. In this study, we propose HONES (Head-Oriented Neuron Explanation & Steering), a gradient-free framework for task-aware neuron attribution and steering in multi-task VLMs. HONES ranks FFN neurons by their causal write-in contributions conditioned on task-relevant attention heads, and further modulates salient neurons via lightweight scaling. Experiments on four diverse multimodal tasks and two popular VLMs show that HONES outperforms existing methods in identifying task-critical neurons and improves model performance after steering. Our source code is released at: https://github.com/petergit1/HONES.

URL PDF HTML ☆

赞 0 踩 0

2604.17937 2026-04-21 cs.AI

ContraPrompt: Contrastive Prompt Optimization via Dyadic Reasoning Trace Analysis

Rishav Rishav, Pushpak Pujari, Pushpendre Rastogi

2604.17935 2026-04-21 cs.LG cs.AI cs.CC

How Much Cache Does Reasoning Need? Depth-Cache Tradeoffs in KV-Compressed Transformers

Xiao Wang

详情

英文摘要

The key-value (KV) cache is the dominant memory bottleneck during Transformer inference, yet little is known theoretically about how aggressively it can be compressed before multi-step reasoning degrades. We study this through $k$-hop pointer chasing on $n$ tokens under a shared KV cache of size $s$, attention dimension $m$, $H$ heads, $p$-bit precision, and a locality-respecting cache controller (satisfied by all standard KV-compression methods). We give three results. (1) Product depth lower bound (conjectured). We conjecture that any such Transformer ($n \geq 4k$, $s \leq \sqrt{n}/4$) requires depth $L = Ω(\lceil k/s \rceil \cdot \lceil \log_2 n/(Hmp) \rceil)$, and isolate the sole remaining gap as a probabilistic step on the joint distribution of cache trace and pointer chain. Unconditionally, we prove a matching upper bound $L = O(\min(k, \lceil k/s \rceil \log s) \cdot \log n/(mp))$ via windowed pointer doubling, and a max-bound $L = Ω(\max(\lceil k/s \rceil, \log n/(Hmp)))$. Closing the conjecture amounts to upgrading max to product. (2) Bandwidth barrier. The product bound binds only when $Hmp \lesssim \log n$. Any lower bound provable via per-window distinguishability counting -- including reachability, bandwidth, and combinations -- cannot exceed $\lceil k/s \rceil$ once $Hmp \geq \log_2 n$. Breaking this requires lifting unconditional communication-complexity bounds for pointer chasing to Cache-Transformer depth. (3) Adaptive vs oblivious error scaling. Under random cache over $T = \lceil \log_2 k \rceil$ doubling stages, oblivious caches give $\Pr[\mathcal{E}] \leq (s/(n-T))^T + 2T^3/n$ (exponential in $T$), while adaptive locality-respecting caches achieve $\Pr[\mathcal{E}] = s/n$ exactly, independent of $T$. The $Ω((n/s)^{T-1})$ separation explains why heavy-hitter eviction empirically dominates random eviction for multi-hop reasoning.

URL PDF HTML ☆

赞 0 踩 0

2604.17930 2026-04-21 cs.CL cs.AI cs.LG

Heterogeneity in Formal Linguistic Competence of Language Models: Is Data the Real Bottleneck?

H S V N S Kowndinya Renduchintala, Sumit Bhatia

Comments ACL'26 (Findings)

2604.17928 2026-04-21 cs.LG cs.AI

HEALing Entropy Collapse: Enhancing Exploration in Few-Shot RLVR via Hybrid-Domain Entropy Dynamics Alignment

Zhanyu Liu, Qingguo Hu, Ante Wang, Chenqing Liu, Zhishang Xiang, Hui Li, Delai Qiu, Jinsong Su

Comments Accepted by ACL 2026 Main Conference

2604.17927 2026-04-21 cs.CV cs.AI

Brain-Inspired Capture: Evidence-Driven Neuromimetic Perceptual Simulation for Visual Decoding

Feixue Shao, Guangze Shi, Xueyu Liu, Yongfei Wu, Mingqiang Wei, Jianan Zhang, Jianbo Lu, Guiying Yan, Weihua Yang

2604.17920 2026-04-21 cs.CV cs.AI cs.LG

Prompting Foundation Models for Zero-Shot Ship Instance Segmentation in SAR Imagery

Islam Mansour, Francescopaolo Sica, Michael Schmitt

Comments 6 pages

2604.17919 2026-04-21 cs.LG cs.RO

Fisher Decorator: Refining Flow Policy via A Local Transport Map

Xiaoyuan Cheng, Haoyu Wang, Wenxuan Yuan, Ziyan Wang, Zonghao Chen, Li Zeng, Zhuo Sun

2604.17915 2026-04-21 cs.CV

OneDrive: Unified Multi-Paradigm Driving with Vision-Language-Action Models

Yiwei Zhang, Xuesong Chen, Jin Gao, Hanshi Wang, Fudong Ge, Weiming Hu, Shaoshuai Shi, Zhipeng Zhang

2604.17914 2026-04-21 cs.CV

Beyond Binary Contrast: Modeling Continuous Skeleton Action Spaces with Transitional Anchors

Yingjie Feng, Yi Wang, Jiaze Wang, Anfeng Liu, Zhuotao Tian

2604.17912 2026-04-21 cs.LG cs.AI

Learning to Correct: Calibrated Reinforcement Learning for Multi-Attempt Chain-of-Thought

Muhammed Emrullah Ildiz, Halil Alperen Gozeten, Ege Onur Taga, Samet Oymak

Comments 24 pages

2604.17910 2026-04-21 cs.AI cs.LG

Physics-Informed Causal MDPs for Sequential Constraint Repair in Engineering Simulation Pipelines

Chuhan Qiao

2604.17899 2026-04-21 cs.CV

MEDN: Motion-Emotion Feature Decoupling Network for Micro-Expression Recognition

Chenxing Hu, Kun Xie, Qiguang Miao, Ruyi Liu, Quan Wang, Zongkai Yang

Comments 14 pages, 8 figures, 7 tabels

2604.17898 2026-04-21 cs.CV

ReTrack: Evidence-Driven Dual-Stream Directional Anchor Calibration Network for Composed Video Retrieval

Zixu Li, Yupeng Hu, Zhiwei Chen, Qinlei Huang, Guozhi Qiu, Zhiheng Fu, Meng Liu

Comments Accepted by AAAI 2026

详情

英文摘要

With the rapid growth of video data, Composed Video Retrieval (CVR) has emerged as a novel paradigm in video retrieval and is receiving increasing attention from researchers. Unlike unimodal video retrieval methods, the CVR task takes a multi-modal query consisting of a reference video and a piece of modification text as input. The modification text conveys the user's intended alterations to the reference video. Based on this input, the model aims to retrieve the most relevant target video. In the CVR task, there exists a substantial discrepancy in information density between video and text modalities. Traditional composition methods tend to bias the composed feature toward the reference video, which leads to suboptimal retrieval performance. This limitation is significant due to the presence of three core challenges: (1) modal contribution entanglement, (2) explicit optimization of composed features, and (3) retrieval uncertainty. To address these challenges, we propose the evidence-dRivRn dual-sTream diRectionAl anChor calibration networK (ReTrack). ReTrack is the first CVR framework that improves multi-modal query understanding by calibrating directional bias in composed features. It consists of three key modules: Semantic Contribution Disentanglement, Composition Geometry Calibration, and Reliable Evidence-driven Alignment. Specifically, ReTrack estimates the semantic contribution of each modality to calibrate the directional bias of the composed feature. It then uses the calibrated directional anchors to compute bidirectional evidence that drives reliable composed-to-target similarity estimation. Moreover, ReTrack exhibits strong generalization to the Composed Image Retrieval (CIR) task, achieving SOTA performance across three benchmark datasets in both CVR and CIR scenarios. Codes are available at https://github.com/Lee-zixu/ReTrack

URL PDF HTML ☆

赞 0 踩 0

2604.17897 2026-04-21 cs.LG cs.AI

LoReC: Rethinking Large Language Models for Graph Data Analysis

Hongyu Zhan, Qixin Wang, Yusen Tan, Haitao Yu, Jingbo Zhou, Shuai Chen, Jia Li, Xiao Tan, Jun Xia

2604.17896 2026-04-21 cs.LG cs.AI cs.RO

Can Explicit Physical Feasibility Benefit VLA Learning? An Empirical Study

Yubai Wei, Chen Wu, Hashem Haghbayan

Comments 8 pages, 5 figures

2604.17894 2026-04-21 cs.CL

Automatic Slide Updating with User-Defined Dynamic Templates and Natural Language Instructions

Kun Zhou, Jiakai He, Wenmian Yang, Zhensheng Wang, Yiquan Zhang, Weijia Jia

Comments To appear in Findings of the Association for Computational Linguistics (ACL 2026)

2604.17889 2026-04-21 cs.CV

AeroRAG: Structured Multimodal Retrieval-Augmented LLM for Fine-Grained Aerial Visual Reasoning

Junxiao Xue, Quan Deng, Tingqi Hu, Meicong Si, Xinyi Yin, Yunyun Shi, Xuecheng Wu

2604.17888 2026-04-21 cs.RO

SpaceDex: Generalizable Dexterous Grasping in Tiered Workspaces

Wensheng Wang, Chuanjun Guo, Wei Wei, Tong Wu, Ning Tan

2604.17887 2026-04-21 cs.RO

StableIDM: Stabilizing Inverse Dynamics Model against Manipulator Truncation via Spatio-Temporal Refinement

Kerui Li, Zhe Jing, Xiaofeng Wang, Zheng Zhu, Yukun Zhou, Guan Huang, Dongze Li, Qingkai Yang, Huaibo Huang

2604.17886 2026-04-21 cs.CL cs.AI

Latent Preference Modeling for Cross-Session Personalized Tool Calling

Yejin Yoon, Minseo Kim, Taeuk Kim

Comments Under review. 25 pages, 10 figures, 16 tables

2604.17884 2026-04-21 cs.AI

SPREG: Structured Plan Repair with Entropy-Guided Test-Time Intervention for Large Language Model Reasoning

Xuan Wang, Yu Ming, Xinhao Zhong, Xinyu Yu, Wenjie Wang, Shuai Chen, Wei Lin

2604.17880 2026-04-21 cs.RO cs.CV

ST-$π$: Structured SpatioTemporal VLA for Robotic Manipulation

Chuanhao Ma, Hanyu Zhou, Shihan Peng, Yan Li, Tao Gu, Luxin Yan

2604.17879 2026-04-21 cs.CV

Exploring Boundary-Aware Spatial-Frequency Fusion for Camouflaged Object Detection

Song Yu, Yang Hu, Haokang Ding, Zhifang Liao, Yucheng Song

2604.17876 2026-04-21 cs.RO

OFlow: Injecting Object-Aware Temporal Flow Matching for Robust Robotic Manipulation

Kuanning Wang, Ke Fan, Chenhao Qiu, Zeyu Shangguan, Yuqian Fu, Yanwei Fu, Daniel Seita, Xiangyang Xue

2604.17873 2026-04-21 cs.CV

Spatiotemporal Sycophancy: Negation-Based Gaslighting in Video Large Language Models

Ziyao Tang, Pengkun Jiao, Bin Zhu, Huiyan Qi, Jingjing Chen, Yu-Gang Jiang

2604.17870 2026-04-21 cs.CL

GraSP: Graph-Structured Skill Compositions for LLM Agents

Tianle Xia, Lingxiang Hu, Yiding Sun, Ming Xu, Lan Xu, Siying Wang, Wei Xu, Jie Jiang

2604.17865 2026-04-21 cs.CV

Sharpening Lightweight Models for Generalized Polyp Segmentation: A Boundary Guided Distillation from Foundation Models

Shivanshu Agnihotri, Snehashis Majhi, Deepak Ranjan Nayak

2604.17863 2026-04-21 cs.RO cs.AI

Periodic Steady-State Control of a Handkerchief-Spinning Task Using a Parallel Anti-Parallelogram Tendon-driven Wrist

Lei Liu, Haonan Zhang, Huahang Xu, Zefan Zhang, Lulu Chang, Lei Lv, Andrew Ross McIntosh, Kai Sun, Zhenshan Bing, Jiahong Dong, Fuchun Sun

Comments ICRA2026

2604.17862 2026-04-21 cs.LG cs.AR

M100: An Orchestrated Dataflow Architecture Powering General AI Computing

Yan Xie, Changkui Mao, Changsong Wu, Chao Lu, Chao Suo, Cheng Qian, Chun Yang, Danyang Zhu, Hengchang Xiong, Hongzhan Lu, Hongzhen Liu, Jiafu Liu, Jie Chen, Jie Dai, Junfeng Tang, Kai Liu, Kun Li, Lipeng Ge, Meng Sun, Min Luo, Peng Chen, Peng Wang, Shaodong Yang, Shibin Tang, Shibo Chen, Weikang Zhang, Xiao Ling, Xiaobo Du, Xin Wu, Yang Liu, Yi Jiang, Yihua Jin, Yin Huang, Yuli Zhang, Zhen Yuan, Zhiyuan Man, Zhongxiao Yao

Comments Accepted to appear at ISCA 2026 Industry Track. 12 pages, 16 figures