arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2604.05015 2026-04-08 cs.CV

Video-MME-v2: Towards the Next Stage in Benchmarks for Comprehensive Video Understanding

Chaoyou Fu, Haozhi Yuan, Yuhao Dong, Yi-Fan Zhang, Yunhang Shen, Xiaoxing Hu, Xueying Li, Jinsen Su, Chengwu Long, Xiaoyao Xie, Yongkang Xie, Xiawu Zheng, Xue Yang, Haoyu Cao, Yunsheng Wu, Ziwei Liu, Xing Sun, Caifeng Shan, Ran He

Comments Homepage: https://video-mme-v2.netlify.app/

详情

英文摘要

With the rapid advancement of video understanding, existing benchmarks are becoming increasingly saturated, exposing a critical discrepancy between inflated leaderboard scores and real-world model capabilities. To address this widening gap, we introduce Video-MME-v2, a comprehensive benchmark designed to rigorously evaluate the robustness and faithfulness of video understanding. To systematically evaluate model capabilities, we design a \textbf{progressive tri-level hierarchy} that incrementally increases the complexity of video comprehension, ranging from multi-point visual information aggregation, to temporal dynamics modeling, and ultimately to complex multimodal reasoning. Besides, in contrast to conventional per-question accuracy, we propose a \textbf{group-based non-linear evaluation} strategy that enforces both consistency across related queries and coherence in multi-step reasoning. It penalizes fragmented or guess-based correctness and assigns credit only to answers supported by valid reasoning. To guarantee data quality, Video-MME-v2 is constructed through a rigorously controlled human annotation pipeline, involving 12 annotators and 50 independent reviewers. Backed by \textbf{3,300 human-hours} and up to \textbf{5 rounds} of quality assurance, Video-MME-v2 aims to serve as one of the most authoritative video benchmarks. Extensive experiments reveal a substantial gap between current best model Gemini-3-Pro and human experts, and uncover a clear hierarchical bottleneck where errors in visual information aggregation and temporal modeling propagate to limit high-level reasoning. We further find that thinking-based reasoning is highly dependent on textual cues, improving performance with subtitles but sometimes degrading it in purely visual settings. By exposing these limitations, Video-MME-v2 establishes a demanding new testbed for the development of next-generation video MLLMs.

URL PDF HTML ☆

赞 0 踩 0

2604.05014 2026-04-08 cs.RO cs.AI cs.CV

StarVLA: A Lego-like Codebase for Vision-Language-Action Model Developing

StarVLA Community

Comments Open-source VLA infra, Technical Report

详情

英文摘要

Building generalist embodied agents requires integrating perception, language understanding, and action, which are core capabilities addressed by Vision-Language-Action (VLA) approaches based on multimodal foundation models, including recent advances in vision-language models and world models. Despite rapid progress, VLA methods remain fragmented across incompatible architectures, codebases, and evaluation protocols, hindering principled comparison and reproducibility. We present StarVLA, an open-source codebase for VLA research. StarVLA addresses these challenges in three aspects. First, it provides a modular backbone--action-head architecture that supports both VLM backbones (e.g., Qwen-VL) and world-model backbones (e.g., Cosmos) alongside representative action-decoding paradigms, all under a shared abstraction in which backbone and action head can each be swapped independently. Second, it provides reusable training strategies, including cross-embodiment learning and multimodal co-training, that apply consistently across supported paradigms. Third, it integrates major benchmarks, including LIBERO, SimplerEnv, RoboTwin~2.0, RoboCasa-GR1, and BEHAVIOR-1K, through a unified evaluation interface that supports both simulation and real-robot deployment. StarVLA also ships simple, fully reproducible single-benchmark training recipes that, despite minimal data engineering, already match or surpass prior methods on multiple benchmarks with both VLM and world-model backbones. To our best knowledge, StarVLA is one of the most comprehensive open-source VLA frameworks available, and we expect it to lower the barrier for reproducing existing methods and prototyping new ones. StarVLA is being actively maintained and expanded; we will update this report as the project evolves. The code and documentation are available at https://github.com/starVLA/starVLA.

URL PDF HTML ☆

赞 0 踩 0

2604.05011 2026-04-08 cs.SD cs.AI

YMIR: A new Benchmark Dataset and Model for Arabic Yemeni Music Genre Classification Using Convolutional Neural Networks

Moeen AL-Makhlafi, Abdulrahman A. AlKannad, Eiad Almekhlafi, Nawaf Q. Othman Ahmed Mohammed, Saher Qaid

2604.05007 2026-04-08 cs.SD cs.AI eess.AS

Generalizable Audio-Visual Navigation via Binaural Difference Attention and Action Transition Prediction

Jia Li, Yinfeng Yu

Comments Main paper (6 pages). Accepted for publication by the International Joint Conference on Neural Networks (IJCNN 2026)

2604.05003 2026-04-08 cs.RO

A Survey on Sensor-based Planning and Control for Unmanned Underwater Vehicles

Shivam Vishwakarma, Tejal Bedmutha, Dharmendra Kumar Patel, Vijay Bhaskar Semwal, Leena Vachhani

2604.04999 2026-04-08 cs.LG cs.AI

PRIME: Prototype-Driven Multimodal Pretraining for Cancer Prognosis with Missing Modalities

Kai Yu, Shuang Zhou, Yiran Song, Zaifu Zhan, Jie Peng, Kaixiong Zhou, Tianlong Chen, Feng Xie, Meng Wang, Huazhu Fu, Mingquan Lin, Rui Zhang

2604.04998 2026-04-08 cs.LG

El Nino Prediction Based on Weather Forecast and Geographical Time-series Data

Viet Trinh, Ha-Vy Luu, Quoc-Khiem Nguyen-Pham, Hung Tong, Thanh-Huyen Tran, Hoai-Nam Nguyen Dang

2604.04996 2026-04-08 cs.LG

Learning-Based Multi-Criteria Decision Making Model for Sawmill Location Problems

Mahid Ahmed, Ali Dogru, Chaoyang Zhang, Chao Meng

Comments 34 pages, 12 figures, 5 tables

2604.04987 2026-04-08 cs.LG cs.AI math.OC stat.ML

Cactus: Accelerating Auto-Regressive Decoding with Constrained Acceptance Speculative Sampling

Yongchang Hao, Lili Mou

Comments Camera-ready version. Accepted at ICLR 2026

2604.04986 2026-04-08 cs.LG

Enhancing sample efficiency in reinforcement-learning-based flow control: replacing the critic with an adaptive reduced-order model

Zesheng Yao, Zhen-Hua Wan, Canjun Yang, Qingchao Xia, Mengqi Zhang

Comments 43 pages, 26 figures

2604.04983 2026-04-08 cs.LG

Territory Paint Wars: Diagnosing and Mitigating Failure Modes in Competitive Multi-Agent PPO

Diyansha Singh

Comments 16 pages, 5 figures

2604.04980 2026-04-08 cs.RO

COMB: Common Open Modular robotic platform for Bees

Pranav Kedia, Marie Messerich, Tim Landgraf

2604.04972 2026-04-08 cs.CV

RCP: Representation Consistency Pruner for Mitigating Distribution Shift in Large Vision-Language Models

Jianwei Zhang, Chaoning Zhang, Sihan Cao, Wang Liu, Pengcheng Zheng, Jiaxin Huang, Caiyan Qin, Yalan Ye, Wei Dong, Yang Yang

2604.04971 2026-04-08 cs.LG cs.NA math.NA physics.comp-ph

A Theory-guided Weighted $L^2$ Loss for solving the BGK model via Physics-informed neural networks

Gyounghun Ko, Sung-Jun Son, Seung Yeon Cho, Myeong-Su Lee

Comments 26 pages, 9 figures

2604.04967 2026-04-08 cs.RO cs.LG

Belief Dynamics for Detecting Behavioral Shifts in Safe Collaborative Manipulation

Devashri Naik, Divake Kumar, Nastaran Darabi, Amit Ranjan Trivedi

2604.04953 2026-04-08 cs.CV cs.AI cs.HC cs.IR cs.MM

Generative AI for Video Trailer Synthesis: From Extractive Heuristics to Autoregressive Creativity

Abhishek Dharmaratnakar, Srivaths Ranganathan, Debanshu Das, Anushree Sinha

Comments 7 pages, 3 figures, accepted in WSDM 2026

2604.04943 2026-04-08 cs.CL cs.AI

The Illusion of Latent Generalization: Bi-directionality and the Reversal Curse

Julian Coda-Forno, Jane X. Wang, Arslan Chaudhry

Comments ICLR 2026 Workshop on Representational Alignment (Re-Align)

2604.04942 2026-04-08 cs.CL cs.AI

TDA-RC: Task-Driven Alignment for Knowledge-Based Reasoning Chains in Large Language Models

Jiaquan Zhang, Qigan Sun, Chaoning Zhang, Xudong Wang, Zhenzhen Huang, Yitian Zhou, Pengcheng Zheng, Chi-lok Andy Tai, Sung-Ho Bae, Zeyu Ma, Caiyan Qin, Jinyu Guo, Yang Yang, Hengtao Shen

Comments 14 pages, 4 figures

2604.04941 2026-04-08 cs.AI

Algebraic Structure Discovery for Real World Combinatorial Optimisation Problems: A General Framework from Abstract Algebra to Quotient Space Learning

Min Sun, Federica Storti, Valentina Martino, Miguel Gonzalez-Andrades, Tony Kam-Thong

2604.04939 2026-04-08 cs.AI

Proximity Measure of Information Object Features for Solving the Problem of Their Identification in Information Systems

Volodymyr Yuzefovych

Comments 14 pages, 12 figures

2604.04938 2026-04-08 cs.AI

Operational Noncommutativity in Sequential Metacognitive Judgments

Enso O. Torres Alegre, Diana E. Mora Jimenez

Comments 15 pages, 1 figure

2604.04838 2026-04-08 cs.CV

Less Detail, Better Answers: Degradation-Driven Prompting for VQA

Haoxuan Han, Weijie Wang, Zeyu Zhang, Yefei He, Bohan Zhuang

Comments Accepted to CVPRW 2026. Project page: https://hhx-jpg.github.io/ddp/ , Code: https://github.com/ziplab/DDP

2604.04721 2026-04-08 cs.AI

AI Assistance Reduces Persistence and Hurts Independent Performance

Grace Liu, Brian Christian, Tsvetomira Dumbalska, Michiel A. Bakker, Rachit Dubey

2604.04579 2026-04-08 cs.CV

Firebolt-VL: Efficient Vision-Language Understanding with Cross-Modality Modulation

Quoc-Huy Trinh, Mustapha Abdullahi, Bo Zhao, Debesh Jha

Comments arXiv admin note: substantial text overlap with arXiv:2511.11177

2604.04576 2026-04-08 cs.CV

PR-IQA: Partial-Reference Image Quality Assessment for Diffusion-Based Novel View Synthesis

Inseong Choi, Siwoo Lee, Seung-Hun Nam, Soohwan Song

Comments Accepted at CVPR 2026. Project Page: https://kakaomacao.github.io/pr-iqa-project-page/

2604.04387 2026-04-08 cs.AI cs.CY cs.ET cs.HC cs.LG

Gradual Cognitive Externalization: From Modeling Cognition to Constituting It

Zhimin Zhao

2604.04328 2026-04-08 cs.AI cs.LG cs.MA

Soft Tournament Equilibrium

Saad Alqithami

2604.04168 2026-04-08 cs.CL cs.IR

A Semi-Automated Annotation Workflow for Paediatric Histopathology Reports Using Small Language Models

Avish Vijayaraghavan, Jaskaran Singh Kawatra, Sebin Sabu, Jonny Sheldon, Will Poulett, Alex Eze, Daniel Key, John Booth, Shiren Patel, Jonny Pearson, Dan Schofield, Jonathan Hope, Pavithra Rajendran, Neil Sebire

Comments 36 pages, includes supplementary information

详情

英文摘要

Electronic Patient Record (EPR) systems contain valuable clinical information, but much of it is trapped in unstructured text, limiting its use for research and decision-making. Large language models can extract such information but require substantial computational resources to run locally, and sending sensitive clinical data to cloud-based services, even when deidentified, raises significant patient privacy concerns. In this study, we develop a resource-efficient semi-automated annotation workflow using small language models (SLMs) to extract structured information from unstructured EPR data, focusing on paediatric histopathology reports. As a proof-of-concept, we apply the workflow to paediatric renal biopsy reports, a domain chosen for its constrained diagnostic scope and well-defined underlying biology. We develop the workflow iteratively with clinical oversight across three meetings, manually annotating 400 reports from a dataset of 2,111 at Great Ormond Street Hospital as a gold standard, while developing an automated information extraction approach using SLMs. We frame extraction as a Question-Answering task grounded by clinician-guided entity guidelines and few-shot examples, evaluating five instruction-tuned SLMs with a disagreement modelling framework to prioritise reports for clinical review. Gemma 2 2B achieves the highest accuracy at 84.3%, outperforming off-the-shelf models including spaCy (74.3%), BioBERT-SQuAD (62.3%), RoBERTa-SQuAD (59.7%), and GLiNER (60.2%). Entity guidelines improved performance by 7-19% over the zero-shot baseline, and few-shot examples by 6-38%, though their benefits do not compound when combined. These results demonstrate that SLMs can extract structured information from specialised clinical domains on CPU-only infrastructure with minimal clinician involvement. Our code is available at https://github.com/gosh-dre/nlp_renal_biopsy.

URL PDF HTML ☆

赞 0 踩 0

2604.04037 2026-04-08 cs.LG cs.AI

Geometric Limits of Knowledge Distillation: A Minimum-Width Theorem via Superposition Theory

Nilesh Sarkar, Dawar Jyoti Deka

2604.03479 2026-04-08 cs.AI cs.IT cs.LG math.IT

Contextual Control without Memory Growth in a Context-Switching Task

Song-Ju Kim

Comments 25 pages, 3 figures