arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2603.15618 2026-03-18 cs.CV

Look Before Acting: Enhancing Vision Foundation Representations for Vision-Language-Action Models

Yulin Luo, Hao Chen, Zhuangzhe Wu, Bowen Sui, Jiaming Liu, Chenyang Gu, Zhuoyang Liu, Qiuxuan Feng, Jiale Yu, Shuo Gu, Peng Jia, Pheng-Ann Heng, Shanghang Zhang

详情

英文摘要

Vision-Language-Action (VLA) models have recently emerged as a promising paradigm for robotic manipulation, in which reliable action prediction critically depends on accurately interpreting and integrating visual observations conditioned on language instructions. Although recent works have sought to enhance the visual capabilities of VLA models, most approaches treat the LLM backbone as a black box, providing limited insight into how visual information is grounded into action generation. Therefore, we perform a systematic analysis of multiple VLA models across different action-generation paradigms and observe that sensitivity to visual tokens progressively decreases in deeper layers during action generation. Motivated by this observation, we propose \textbf{DeepVision-VLA}, built on a \textbf{Vision-Language Mixture-of-Transformers (VL-MoT)} framework. This framework enables shared attention between the vision foundation model and the VLA backbone, injecting multi-level visual features from the vision expert into deeper layers of the VLA backbone to enhance visual representations for precise and complex manipulation. In addition, we introduce \textbf{Action-Guided Visual Pruning (AGVP)}, which leverages shallow-layer attention to prune irrelevant visual tokens while preserving task-relevant ones, reinforcing critical visual cues for manipulation with minimal computational overhead. DeepVision-VLA outperforms prior state-of-the-art methods by 9.0\% and 7.5\% on simulated and real-world tasks, respectively, providing new insights for the design of visually enhanced VLA models.

URL PDF HTML ☆

赞 0 踩 0

2603.15584 2026-03-18 cs.LG cs.AI physics.app-ph physics.comp-ph physics.optics

Physics-Informed Neural Systems for the Simulation of EUV Electromagnetic Wave Diffraction from a Lithography Mask

Vasiliy A. Es'kin, Egor V. Ivanov

Comments arXiv admin note: substantial text overlap with arXiv:2507.04153

2603.15563 2026-03-18 cs.LG cs.AI

The PokeAgent Challenge: Competitive and Long-Context Learning at Scale

Seth Karten, Jake Grigsby, Tersoo Upaa, Junik Bae, Seonghun Hong, Hyunyoung Jeong, Jaeyoon Jung, Kun Kerdthaisong, Gyungbo Kim, Hyeokgi Kim, Yujin Kim, Eunju Kwon, Dongyu Liu, Patrick Mariglia, Sangyeon Park, Benedikt Schink, Xianwei Shi, Anthony Sistilli, Joseph Twin, Arian Urdu, Matin Urdu, Qiao Wang, Ling Wu, Wenli Zhang, Kunsheng Zhou, Stephanie Milani, Kiran Vodrahalli, Amy Zhang, Fei Fang, Yuke Zhu, Chi Jin

Comments 41 pages, 26 figures, 5 tables. NeurIPS 2025 Competition Track

详情

英文摘要

We present the PokeAgent Challenge, a large-scale benchmark for decision-making research built on Pokemon's multi-agent battle system and expansive role-playing game (RPG) environment. Partial observability, game-theoretic reasoning, and long-horizon planning remain open problems for frontier AI, yet few benchmarks stress all three simultaneously under realistic conditions. PokeAgent targets these limitations at scale through two complementary tracks: our Battling Track, which calls for strategic reasoning and generalization under partial observability in competitive Pokemon battles, and our Speedrunning Track, which requires long-horizon planning and sequential decision-making in the Pokemon RPG. Our Battling Track supplies a dataset of 20M+ battle trajectories alongside a suite of heuristic, RL, and LLM-based baselines capable of high-level competitive play. Our Speedrunning Track provides the first standardized evaluation framework for RPG speedrunning, including an open-source multi-agent orchestration system for modular, reproducible comparisons of harness-based LLM approaches. Our NeurIPS 2025 competition validates both the quality of our resources and the research community's interest in Pokemon, with over 100 teams competing across both tracks and winning solutions detailed in our paper. Participant submissions and our baselines reveal considerable gaps between generalist (LLM), specialist (RL), and elite human performance. Analysis against the BenchPress evaluation matrix shows that Pokemon battling is nearly orthogonal to standard LLM benchmarks, measuring capabilities not captured by existing suites and positioning Pokemon as an unsolved benchmark that can drive RL and LLM research forward. We transition to a living benchmark with a live leaderboard for Battling and self-contained evaluation for Speedrunning at https://pokeagentchallenge.com.

URL PDF HTML ☆

赞 0 踩 0

2603.15484 2026-03-18 cs.CV cs.AI

RSGen: Enhancing Layout-Driven Remote Sensing Image Generation with Diverse Edge Guidance

Xianbao Hou, Yonghao He, Zeyd Boukhers, John See, Hu Su, Wei Sui, Cong Yang

2603.15377 2026-03-18 cs.LG cs.AI

More Test-Time Compute Can Hurt: Overestimation Bias in LLM Beam Search

Gal Dalal, Assaf Hallak, Gal Chechik, Yftah Ziser

2603.15255 2026-03-18 cs.AI cs.MA

SAGE: Multi-Agent Self-Evolution for LLM Reasoning

Yulin Peng, Xinxin Zhu, Chenxing Wei, Nianbo Zeng, Leilei Wang, Ying Tiffany He, F. Richard Yu

2603.15238 2026-03-18 cs.AI

Why the Valuable Capabilities of LLMs Are Precisely the Unexplainable Ones

Quan Cheng

Comments 12 pages, v2: added correction to Polanyi on why tacit knowledge is tacit (structural vs quantitative), unified three independent intellectual threads (Smolensky, Dreyfus, dynamical systems theory)

2603.15228 2026-03-18 cs.CV

HYDRA: Unifying Multi-modal Generation and Understanding via Representation-Harmonized Tokenization

Xuerui Qiu, Yutao Cui, Guozhen Zhang, Junzhe Li, JiaKui Hu, Xiao Zhang, Yang Li, Songtao Liu, Miles Yang, Yu Shi, Zhao Zhong, Liefeng Bo

Comments Work in progress: We are actively scaling up the models. More updates coming soon

2603.15213 2026-03-18 cs.CV

Tracking the Discriminative Axis: Dual Prototypes for Test-Time OOD Detection Under Covariate Shift

Wooseok Lee, Jin Mo Yang, Saewoong Bahk, Hyung-Sin Kim

2603.15164 2026-03-18 cs.CL cs.AI cs.LG

HindSight: Evaluating LLM-Generated Research Ideas via Future Impact

Bo Jiang

2603.15059 2026-03-18 cs.LG math.OC

Muon Converges under Heavy-Tailed Noise: Nonconvex Hölder-Smooth Empirical Risk Minimization

Hideaki Iiduka

2603.15011 2026-03-18 cs.CV

Molecular Identifier Visual Prompt and Verifiable Reinforcement Learning for Chemical Reaction Diagram Parsing

Jiahe Song, Chuang Wang, Yinfan Wang, Hao Zheng, Rui Nie, Bowen Jiang, Xingjian Wei, Junyuan Gao, Yubin Wang, Bin Wang, Lijun Wu, Jiang Wu, Qian Yu, Conghui He

2603.14894 2026-03-18 cs.LG cs.AI stat.ML

Informative Perturbation Selection for Uncertainty-Aware Post-hoc Explanations

Sumedha Chugh, Ranjitha Prasad, Nazreen Shah

2603.14761 2026-03-18 cs.AI cs.CL

BrainBench: Exposing the Commonsense Reasoning Gap in Large Language Models

Yuzhe Tang

2603.14730 2026-03-18 cs.LG

GNNVerifier: Graph-based Verifier for LLM Task Planning

Yu Hao, Qiuyu Wang, Cheng Yang, Yawen Li, Zhiqiang Zhang, Chuan Shi

Comments 17pages,12figures

详情

英文摘要

Large language models (LLMs) facilitate the development of autonomous agents. As a core component of such agents, task planning aims to decompose complex natural language requests into concrete, solvable sub-tasks. Since LLM-generated plans are frequently prone to hallucinations and sensitive to long-context prom-pts, recent research has introduced plan verifiers to identify and correct potential flaws. However, most existing approaches still rely on an LLM as the verifier via additional prompting for plan review or self-reflection. LLM-based verifiers can be misled by plausible narration and struggle to detect failures caused by structural relations across steps, such as type mismatches, missing intermediates, or broken dependencies. To address these limitations, we propose a graph-based verifier for LLM task planning. Specifically, the proposed method has four major components: Firstly, we represent a plan as a directed graph with enriched attributes, where nodes denote sub-tasks and edges encode execution order and dependency constraints. Secondly, a graph neural network (GNN) then performs structural evaluation and diagnosis, producing a graph-level plausibility score for plan acceptance as well as node/edge-level risk scores to localize erroneous regions. Thirdly, we construct controllable perturbations from ground truth plan graphs, and automatically generate training data with fine-grained annotations. Finally, guided by the feedback from our GNN verifier, we enable an LLM to conduct local edits (e.g., tool replacement or insertion) to correct the plan when the graph-level score is insufficient. Extensive experiments across diverse datasets, backbone LLMs, and planners demonstrate that our GNNVerifier achieves significant gains in improving plan quality. Our data and code is available at https://github.com/BUPT-GAMMA/GNNVerifier.

URL PDF HTML ☆

赞 0 踩 0

2603.14665 2026-03-18 cs.AI

Gradient Atoms: Unsupervised Discovery, Attribution and Steering of Model Behaviors via Sparse Decomposition of Training Gradients

J Rosser

2603.14610 2026-03-18 cs.CV eess.IV

Make it SING: Analyzing Semantic Invariants in Classifiers

Harel Yadid, Meir Yossef Levi, Roy Betser, Guy Gilboa

Comments Accepted to the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2026

2603.14497 2026-03-18 cs.CV cs.RO

WorldVLM: Combining World Model Forecasting and Vision-Language Reasoning

Stefan Englmeier, Katharina Winter, Fabian B. Flohr

Comments 8 pages, 6 figures, 5 tables; submitted to IEEE

2603.14284 2026-03-18 cs.LG cs.AI

High-Fidelity Compression of Seismic Velocity Models via SIREN Auto-Decoders

Caiyun Liu, Xiaoxue Luo, Jie Xiong

2603.14198 2026-03-18 cs.LG cs.AI stat.ML

Efficient Federated Conformal Prediction with Group-Conditional Guarantees

Haifeng Wen, Osvaldo Simeone, Hong Xing

Comments 22 pages, 5 figures, submitted for possible publication

2603.14177 2026-03-18 cs.LG cs.AI

Artificial intelligence-enabled single-lead ECG for non-invasive hyperkalemia detection: development, multicenter validation, and proof-of-concept deployment

Gongzheng Tang, Qinghao Zhao, Guangkun Nie, Yujie Xiao, Shijia Geng, Donglin Xie, Shun Huang, Deyun Zhang, Xingchen Yao, Jinwei Wang, Kangyin Chen, Luxia Zhang, Shenda Hong

2603.13961 2026-03-18 cs.CV

USIS-PGM: Photometric Gaussian Mixtures for Underwater Salient Instance Segmentation

Lin Hong, Xiangtong Yao, Mürüvvet Bozkurt, Xin Wang, Fumin Zhang

2603.13952 2026-03-18 cs.SD cs.AI eess.AS

LLM-Guided Reinforcement Learning for Audio-Visual Speech Enhancement

Chih-Ning Chen, Jen-Cheng Hou, Hsin-Min Wang, Shao-Yi Chien, Yu Tsao, Fan-Gang Zeng

Comments 6 pages, 4 figures, submitted to Interspeech 2026

2603.13858 2026-03-18 cs.CV

Learning through Creation: A Hash-Free Framework for On-the-Fly Category Discovery

Bohan Zhang, Weidong Tang, Zhixiang Chi, Yi Jin, Zhenbo Li, Yang Wang, Yanan Wu

Comments Accepted to CVPR 2026 Findings. Code available at https://github.com/brandinzhang/LTC

2603.13506 2026-03-18 cs.CV

LibraGen: Playing a Balance Game in Subject-Driven Video Generation

Jiahao Zhu, Shanshan Lao, Lijie Liu, Gen Li, Tianhao Qi, Wei Han, Bingchuan Li, Fangfang Liu, Zhuowei Chen, Tianxiang Ma, Qian HE, Yi Zhou, Xiaohua Xie

2603.13397 2026-03-18 cs.CV

TennisExpert: Towards Expert-Level Analytical Sports Video Understanding

Zhaoyu Liu, Xi Weng, Lianyu Hu, Zhe Hou, Kan Jiang, Jin Song Dong, Yang Liu

2603.13297 2026-03-18 cs.LG cs.AI

Enhanced Atrial Fibrillation Prediction in ESUS Patients with Hypergraph-based Pre-training

Yuzhang Xie, Yuhua Wu, Ruiyu Wang, Fadi Nahab, Xiao Hu, Carl Yang

2603.12354 2026-03-18 cs.CV cs.LG cs.NE

Alternating Gradient Flow Utility: A Unified Metric for Structural Pruning and Dynamic Routing in Deep Networks

Tianhao Qian, Zhuoxuan Li, Jinde Cao, Xinli Shi, Leszek Rutkowski

Comments 11 pages, 6 figures, 9 tables

详情

英文摘要

Efficient deep learning traditionally relies on static heuristics like weight magnitude or activation awareness (e.g., Wanda, RIA). While successful in unstructured settings, we observe a critical limitation when applying these metrics to the structural pruning of deep vision networks. These contemporary metrics suffer from a magnitude bias, failing to preserve critical functional pathways. To overcome this, we propose a decoupled kinetic paradigm inspired by Alternating Gradient Flow (AGF), utilizing an absolute feature-space Taylor expansion to accurately capture the network's structural "kinetic utility". First, we uncover a topological phase transition at extreme sparsity, where AGF successfully preserves baseline functionality and exhibits topological implicit regularization, avoiding the collapse seen in models trained from scratch. Second, transitioning to architectures without strict structural priors, we reveal a phenomenon of Sparsity Bottleneck in Vision Transformers (ViTs). Through a gradient-magnitude decoupling analysis, we discover that dynamic signals suffer from signal compression in converged models, rendering them suboptimal for real-time routing. Finally, driven by these empirical constraints, we design a hybrid routing framework that decouples AGF-guided offline structural search from online execution via zero-cost physical priors. We validate our paradigm on large-scale benchmarks: under a 75% compression stress test on ImageNet-1K, AGF effectively avoids the structural collapse where traditional metrics aggressively fall below random sampling. Furthermore, when systematically deployed for dynamic inference on ImageNet-100, our hybrid approach achieves Pareto-optimal efficiency. It reduces the usage of the heavy expert by approximately 50% (achieving an estimated overall cost of 0.92$\times$) without sacrificing the full-model accuracy.

URL PDF HTML ☆

赞 0 踩 0

2603.11808 2026-03-18 cs.AI

Automating Skill Acquisition through Large-Scale Mining of Open-Source Agentic Repositories: A Framework for Multi-Agent Procedural Knowledge Extraction

Shuzhen Bi, Mengsong Wu, Hao Hao, Keqian Li, Wentao Liu, Siyu Song, Hongbo Zhao, Aimin Zhou

2603.11370 2026-03-18 cs.LG

Relaxed Efficient Acquisition of Context and Temporal Features

Yunni Qu, Dzung Dinh, Grant King, Whitney Ringwald, Bing Cai Kok, Kathleen Gates, Aidan Wright, Junier Oliva