arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2604.15312 2026-04-17 cs.CV

Bidirectional Cross-Modal Prompting for Event-Frame Asymmetric Stereo

Ninghui Xu, Fabio Tosi, Lihui Wang, Jiawei Han, Luca Bartolomei, Zhiting Yao, Matteo Poggi, Stefano Mattoccia

Comments CVPR 2026. Code URL: https://github.com/xnh97/Bi-CMPStereo

详情

英文摘要

Conventional frame-based cameras capture rich contextual information but suffer from limited temporal resolution and motion blur in dynamic scenes. Event cameras offer an alternative visual representation with higher dynamic range free from such limitations. The complementary characteristics of the two modalities make event-frame asymmetric stereo promising for reliable 3D perception under fast motion and challenging illumination. However, the modality gap often leads to marginalization of domain-specific cues essential for cross-modal stereo matching. In this paper, we introduce Bi-CMPStereo, a novel bidirectional cross-modal prompting framework that fully exploits semantic and structural features from both domains for robust matching. Our approach learns finely aligned stereo representations within a target canonical space and integrates complementary representations by projecting each modality into both event and frame domains. Extensive experiments demonstrate that our approach significantly outperforms state-of-the-art methods in accuracy and generalization.

URL PDF HTML ☆

赞 0 踩 0

2604.15311 2026-04-17 cs.CV

LeapAlign: Post-Training Flow Matching Models at Any Generation Step by Building Two-Step Trajectories

Zhanhao Liang, Tao Yang, Jie Wu, Chengjian Feng, Liang Zheng

Comments Accepted by CVPR 2026. Project page: https://rockeycoss.github.io/leapalign/

2604.15309 2026-04-17 cs.CV cs.AI cs.CL

MM-WebAgent: A Hierarchical Multimodal Web Agent for Webpage Generation

Yan Li, Zezi Zeng, Yifan Yang, Yuqing Yang, Ning Liao, Weiwei Guo, Lili Qiu, Mingxi Cheng, Qi Dai, Zhendong Wang, Zhengyuan Yang, Xue Yang, Ji Li, Lijuan Wang, Chong Luo

2604.15308 2026-04-17 cs.CV

RAD-2: Scaling Reinforcement Learning in a Generator-Discriminator Framework

Hao Gao, Shaoyu Chen, Yifan Zhu, Yuehao Song, Wenyu Liu, Qian Zhang, Xinggang Wang

Comments Project page: https://hgao-cv.github.io/RAD-2

2604.15306 2026-04-17 cs.AI cs.LG

Generalization in LLM Problem Solving: The Case of the Shortest Path

Yao Tong, Jiayuan Ye, Anastasia Borovykh, Reza Shokri

2604.15302 2026-04-17 cs.AI cs.CL cs.LG

Diagnosing LLM Judge Reliability: Conformal Prediction Sets and Transitivity Violations

Manan Gupta, Dhruv Kumar

Comments Under Review

2604.15299 2026-04-17 cs.CV

AnimationBench: Are Video Models Good at Character-Centric Animation?

Leyi Wu, Pengjun Fang, Kai Sun, Yazhou Xing, Yinwei Wu, Songsong Wang, Ziqi Huang, Dan Zhou, Yingqing He, Ying-Cong Chen, Qifeng Chen

Comments Project Page: https://animationbench.github.io Code: https://github.com/VideoVerses/AnimationBench

2604.15294 2026-04-17 cs.AI

How Do LLMs and VLMs Understand Viewpoint Rotation Without Vision? An Interpretability Study

Zhen Yang, Ping Jian, Zhongbin Guo, Zuming Zhang, Chengzhi Li, Yonghong Deng, Xinyue Zhang, Wenpeng Lu

Comments Published as a main-conference paper at The 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026)

2604.15291 2026-04-17 cs.CV cs.AI

AD4AD: Benchmarking Visual Anomaly Detection Models for Safer Autonomous Driving

Fabrizio Genilotti, Arianna Stropeni, Gionata Grotto, Francesco Borsatti, Manuel Barusco, Davide Dalle Pezze, Gian Antonio Susto

2604.15289 2026-04-17 cs.RO

Abstract Sim2Real through Approximate Information States

Yunfu Deng, Yuhao Li, Josiah P. Hanna

2604.15281 2026-04-17 cs.CV cs.RO

R3D: Revisiting 3D Policy Learning

Zhengdong Hong, Shenrui Wu, Haozhe Cui, Boyi Zhao, Ran Ji, Yiyang He, Hangxing Zhang, Zundong Ke, Jun Wang, Guofeng Zhang, Jiayuan Gu

2604.15280 2026-04-17 cs.CV cs.AI

Why Do Vision Language Models Struggle To Recognize Human Emotions?

Madhav Agarwal, Sotirios A. Tsaftaris, Laura Sevilla-Lara, Steven McDonagh

详情

英文摘要

Understanding emotions is a fundamental ability for intelligent systems to be able to interact with humans. Vision-language models (VLMs) have made tremendous progress in the last few years for many visual tasks, potentially offering a promising solution for understanding emotions. However, it is surprising that even the most sophisticated contemporary VLMs struggle to recognize human emotions or to outperform even specialized vision-only classifiers. In this paper we ask the question "Why do VLMs struggle to recognize human emotions?", and observe that the inherently continuous and dynamic task of facial expression recognition (DFER) exposes two critical VLM vulnerabilities. First, emotion datasets are naturally long-tailed, and the web-scale data used to pre-train VLMs exacerbates this head-class bias, causing them to systematically collapse rare, under-represented emotions into common categories. We propose alternative sampling strategies that prevent favoring common concepts. Second, temporal information is critical for understanding emotions. However, VLMs are unable to represent temporal information over dense frame sequences, as they are limited by context size and the number of tokens that can fit in memory, which poses a clear challenge for emotion recognition. We demonstrate that the sparse temporal sampling strategy used in VLMs is inherently misaligned with the fleeting nature of micro-expressions (0.25-0.5 seconds), which are often the most critical affective signal. As a diagnostic probe, we propose a multi-stage context enrichment strategy that utilizes the information from "in-between" frames by first converting them into natural language summaries. This enriched textual context is provided as input to the VLM alongside sparse keyframes, preventing attentional dilution from excessive visual data while preserving the emotional trajectory.

URL PDF HTML ☆

赞 0 踩 0

2604.15278 2026-04-17 cs.SD eess.AS

A Manual Bar-by-Bar Tempo Measurement Protocol for Polyphonic Chamber Music Recordings: Design, Validation, and Application to Beethoven's Piano and Cello Sonatas

Ignasi Sole

2604.15273 2026-04-17 cs.LG quant-ph

How Embeddings Shape Graph Neural Networks: Classical vs Quantum-Oriented Node Representations

Nouhaila Innan, Antonello Rosato, Alberto Marchisio, Muhammad Shafique

Comments 6 pages. Accepted at IJCNN 2026

2604.15244 2026-04-17 cs.CL

From Tokens to Steps: Verification-Aware Speculative Decoding for Efficient Multi-Step Reasoning

Kiran Purohit, Ramasuri Narayanam, Soumyabrata Pal

2604.15242 2026-04-17 cs.LG

Optimal last-iterate convergence in matrix games with bandit feedback using the log-barrier

Come Fiegel, Pierre Menard, Tadashi Kozuno, Michal Valko, Vianney Perchet

2604.15239 2026-04-17 cs.CV

TokenGS: Decoupling 3D Gaussian Prediction from Pixels with Learnable Tokens

Jiawei Ren, Michal Jan Tyszkiewicz, Jiahui Huang, Zan Gojcic

Comments Project page: https://research.nvidia.com/labs/toronto-ai/tokengs

2604.15233 2026-04-17 cs.AI cs.DB

Blue Data Intelligence Layer: Streaming Data and Agents for Multi-source Multi-modal Data-Centric Applications

Moin Aminnaseri, Farima Fatahi Bayat, Nikita Bhutani, Jean-Flavien Bussotti, Kevin Chan, Rafael Li Chen, Yanlin Feng, Jackson Hassell, Estevam Hruschka, Eser Kandogan, Hannah Kim, James Levine, Seiji Maekawa, Jalal Mahmud, Kushan Mitra, Naoki Otani, Pouya Pezeshkpour, Nima Shahbazi, Chen Shen, Dan Zhang

详情

英文摘要

NL2SQL systems aim to address the growing need for natural language interaction with data. However, real-world information rarely maps to a single SQL query because (1) users express queries iteratively (2) questions often span multiple data sources beyond the closed-world assumption of a single database, and (3) queries frequently rely on commonsense or external knowledge. Consequently, satisfying realistic data needs require integrating heterogeneous sources, modalities, and contextual data. In this paper, we present Blue's Data Intelligence Layer (DIL) designed to support multi-source, multi-modal, and data-centric applications. Blue is a compound AI system that orchestrates agents and data for enterprise settings. DIL serves as the data intelligence layer for agentic data processing, to bridge the semantic gap between user intent and available information by unifying structured enterprise data, world knowledge accessible through LLMs, and personal context obtained through interaction. At the core of DIL is a data registry that stores metadata for diverse data sources and modalities to enable both native and natural language queries. DIL treats LLMs, the Web, and the User as source 'databases', each with their own query interface, elevating them to first-class data sources. DIL relies on data planners to transform user queries into executable query plans. These plans are declarative abstractions that unify relational operators with other operators spanning multiple modalities. DIL planners support decomposition of complex requests into subqueries, retrieval from diverse sources, and finally reasoning and integration to produce final results. We demonstrate DIL through two interactive scenarios in which user queries dynamically trigger multi-source retrieval, cross-modal reasoning, and result synthesis, illustrating how compound AI systems can move beyond single database NL2SQL.

URL PDF HTML ☆

赞 0 踩 0

2604.15224 2026-04-17 cs.AI cs.CL cs.LG

Context Over Content: Exposing Evaluation Faking in Automated Judges

Manan Gupta, Inderjeet Nair, Lu Wang, Dhruv Kumar

Comments Under Review

2604.15210 2026-04-17 cs.AI cs.CL

Learning to Think Like a Cartoon Captionist: Incongruity-Resolution Supervision for Multimodal Humor Understanding

Hatice Merve Vural, Doga Kukul, Ege Erdem Ozlu, Demir Ekin Arikan, Bob Mankoff, Erkut Erdem, Aykut Erdem

2604.15203 2026-04-17 cs.CL

MADE: A Living Benchmark for Multi-Label Text Classification with Uncertainty Quantification of Medical Device Adverse Events

Raunak Agarwal, Markus Wenzel, Simon Baur, Jonas Zimmer, George Harvey, Jackie Ma

Comments Accepted at ACL 2026 Mains

2604.15202 2026-04-17 cs.RO cs.AI math.OC

Benchmarking Classical Coverage Path Planning Heuristics on Irregular Hexagonal Grids for Maritime Coverage Scenarios

Carlos S. Sepúlveda, Gonzalo A. Ruz

2604.15201 2026-04-17 cs.LG

RL-STPA: Adapting System-Theoretic Hazard Analysis for Safety-Critical Reinforcement Learning

Steven A. Senczyszyn, Timothy C. Havens, Nathaniel Rice, Jason E. Summers, Benjamin D. Werner, Benjamin J. Schumeg

2604.15196 2026-04-17 cs.CV

Unsupervised Skeleton-Based Action Segmentation via Hierarchical Spatiotemporal Vector Quantization

Umer Ahmed, Syed Ahmed Mahmood, Fawad Javed Fateh, M. Shaheer Luqman, M. Zeeshan Zia, Quoc-Huy Tran

2604.15190 2026-04-17 cs.AI cs.CL

Meituan Merchant Business Diagnosis via Policy-Guided Dual-Process User Simulation

Ziyang Chen, Renbing Chen, Daowei Li, Jinzhi Liao, Jiashen Sun, Ke Zeng, Xiang Zhao

Comments 5 pages, 3 figures, 2 tables, accepted at SIGIR 2026 Industry Track

2604.15188 2026-04-17 cs.CV cs.AI

VisPCO: Visual Token Pruning Configuration Optimization via Budget-Aware Pareto-Frontier Learning for Vision-Language Models

Huawei Ji, Yuanhao Sun, Yuan Jin, Cheng Deng, Jiaxin Ding, Luoyi Fu, Xinbing Wang

2604.15181 2026-04-17 cs.LG math.DS

One-shot learning for the complex dynamical behaviors of weakly nonlinear forced oscillators

Teng Ma, Luca Rosafalco, Wei Cui, Lin Zhao, Attilio Frangi

Comments 48 pages, 16 figures, graphical abstract, highlights

2604.15180 2026-04-17 cs.LG cs.CL

AdaSplash-2: Faster Differentiable Sparse Attention

Nuno Gonçalves, Hugo Pitorro, Vlad Niculae, Edoardo Ponti, Lei Li, Andre Martins, Marcos Treviso

2604.15171 2026-04-17 cs.CV cs.LG

An Analysis of Regularization and Fokker-Planck Residuals in Diffusion Models for Image Generation

Onno Niemann, Gonzalo Martínez Muñoz, Alberto Suárez Gonzalez

Comments Accepted at IJCNN 2026 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works

2604.15170 2026-04-17 cs.CV

OmniLight: One Model to Rule All Lighting Conditions

Youngjin Oh, Junyoung Park, Junhyeong Kwon, Nam Ik Cho

Comments CVPRW 2026; NTIRE 2026 Image Shadow Removal & Ambient Lighting Normalization Challenges (1st Perceptual Rank for White Lighting, 2nd Fidelity Rank & 4th Perceptual Rank for Color Lighting)