arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2505.23359 2026-03-18 cs.CV

VideoReasonBench: Can MLLMs Perform Vision-Centric Complex Video Reasoning?

Yuanxin Liu, Kun Ouyang, Haoning Wu, Yi Liu, Lin Sui, Xinhao Li, Yan Zhong, Y. Charles, Xinyu Zhou, Xu Sun

Comments Project Page: https://llyx97.github.io/video_reason_bench/

详情

英文摘要

Recent studies have shown that long chain-of-thought (CoT) reasoning can significantly enhance the performance of large language models (LLMs) on complex tasks. However, this benefit is yet to be demonstrated in the domain of video understanding, since most existing benchmarks lack the reasoning depth required to demonstrate the advantages of extended CoT chains. While recent efforts have proposed benchmarks aimed at video reasoning, the tasks are often knowledge-driven and do not rely heavily on visual content. To bridge this gap, we introduce VideoReasonBench, a benchmark designed to evaluate vision-centric, complex video reasoning. To ensure visual richness and high reasoning complexity, each video in VideoReasonBench depicts a sequence of fine-grained operations on a latent state that is only visible in part of the video. The questions evaluate three escalating levels of video reasoning skills: recalling observed visual information, inferring the content of latent states, and predicting information beyond the video. Under such task setting, models have to precisely recall multiple operations in the video, and perform step-by-step reasoning to get correct final answers for these questions. Using VideoReasonBench, we comprehensively evaluate 18 state-of-the-art multimodal LLMs (MLLMs), finding that most perform poorly on complex video reasoning -- e.g., GPT-4o achieves only 6.9% accuracy -- while the thinking-enhanced Gemini-2.5-Pro significantly outperforms others with 56.0% accuracy. Our investigations on "test-time scaling" further reveal that extended thinking budget, while offering none or minimal benefits on existing video benchmarks, is essential for improving the performance on VideoReasonBench.

URL PDF HTML ☆

赞 0 踩 0

2505.23135 2026-03-18 cs.LG cs.AI cs.LO cs.PL cs.SE

VERINA: Benchmarking Verifiable Code Generation

Zhe Ye, Zhengxu Yan, Jingxuan He, Timothe Kasriel, Kaiyu Yang, Dawn Song

2505.21777 2026-03-18 cs.LG cond-mat.dis-nn cs.CV q-bio.NC stat.ML

Memorization to Generalization: Emergence of Diffusion Models from Associative Memory

Bao Pham, Gabriel Raya, Matteo Negri, Mohammed J. Zaki, Luca Ambrogioni, Dmitry Krotov

2505.21676 2026-03-18 cs.RO cs.NI

Real-World Deployment of Cloud-based Autonomous Mobility Systems for Outdoor and Indoor Environments

Yufeng Yang, Minghao Ning, Keqi Shu, Aladdin Saleh, Ehsan Hashemi, Amir Khajepour

Comments This paper has been submitted to IEEE Robotics and Automation Magazine

2505.20107 2026-03-18 cs.LG cs.CV

Refining Few-Step Text-to-Multiview Diffusion via Reinforcement Learning

Ziyi Zhang, Li Shen, Deheng Ye, Yong Luo, Huangxuan Zhao, Meng Liu, Wei Yu, Lefei Zhang

Comments Accepted to CVPR 2026

2505.17018 2026-03-18 cs.CV

SophiaVL-R1: Reinforcing MLLMs Reasoning with Thinking Reward

Kaixuan Fan, Kaituo Feng, Haoming Lyu, Dongzhan Zhou, Xiangyu Yue

Comments ICLR 2026, Project page:https://github.com/kxfan2002/SophiaVL-R1

2505.11192 2026-03-18 cs.CV cs.AI

FALCON: False-Negative Aware Learning of Contrastive Negatives in Vision-Language Alignment

Myunsoo Kim, Seongwoong Shim, Byung-Jun Lee

Comments Accepted at CVPR 2026

2504.16538 2026-03-18 cs.CV cs.LG

Streetscape Analysis with Generative AI (SAGAI): Vision-Language Assessment and Mapping of Urban Scenes

Joan Perez, Giovanni Fusco

Comments 25 pages, 6 figures in main paper, 6 figures in appendices

2504.13242 2026-03-18 cs.CV

Dynamic Memory Transformer for Hyperspectral Image Classification

Muhammad Ahmad

2504.10045 2026-03-18 cs.AI cs.LG

CHARM: Calibrating Reward Models With Chatbot Arena Scores

Xiao Zhu, Chenmien Tan, Pinzhen Chen, Rico Sennrich, Huiming Wang, Yanlin Zhang, Hanxu Hu

2504.09037 2026-03-18 cs.AI cs.CL

A Survey of Frontiers in LLM Reasoning: Inference Scaling, Learning to Reason, and Agentic Systems

Zixuan Ke, Fangkai Jiao, Yifei Ming, Xuan-Phi Nguyen, Austin Xu, Do Xuan Long, Minzhi Li, Chengwei Qin, Peifeng Wang, Silvio Savarese, Caiming Xiong, Shafiq Joty

Comments 72 pages, 6 figures. Accepted to TMLR, with Survey Certification award

2504.05342 2026-03-18 cs.LG cs.AI cs.CV

MASS: MoErging through Adaptive Subspace Selection

Donato Crisostomi, Alessandro Zirilli, Antonio Andrea Gargiulo, Maria Sofia Bucarelli, Simone Scardapane, Fabrizio Silvestri, Iacopo Masi, Emanuele Rodolà

2503.19476 2026-03-18 cs.LG

LogicXGNN: Grounded Logical Rules for Explaining Graph Neural Networks

Chuqin Geng, Ziyu Zhao, Zhaoyue Wang, Haolin Ye, Yuhe Jiang, Xujie Si

Comments Accepted at ICLR 2026

2503.17025 2026-03-18 cs.AI

A Guide to Bayesian Networks Software Packages for Structure and Parameter Learning -- 2025 Edition

Joverlyn Gaudillo, Nicole Astrologo, Fabio Stella, Enzo Acerbi, Francesco Canonaco

Comments 11 pages, 1 figure

2503.12966 2026-03-18 cs.LG stat.ML

Optimal Denoising in Score-Based Generative Models: The Role of Data Regularity

Eliot Beyler, Francis Bach

2503.06140 2026-03-18 cs.CV

Boosting the Local Invariance for Better Adversarial Transferability

Bohan Liu, Xiaosen Wang

Comments Code is available at https://github.com/Trustworthy-AI-Group/TransferAttack

2503.03992 2026-03-18 cs.RO

GeoFIK: A Fast and Reliable Geometric Solver for the IK of the Franka Arm based on Screw Theory Enabling Multiple Redundancy Parameters

Pablo C. Lopez-Custodio, Yuhe Gong, Luis F. C. Figueredo

2502.05175 2026-03-18 cs.CV cs.GR

Fillerbuster: Unified Generative Scene Completion Model for Casual Captures

Ethan Weber, Norman Müller, Yash Kant, Vasu Agrawal, Michael Zollhöfer, Angjoo Kanazawa, Christian Richardt

Comments Project page at https://ethanweber.me/fillerbuster/

2501.12774 2026-03-18 cs.CL

LLMs as Repositories of Factual Knowledge: Limitations and Solutions

Seyed Mahed Mousavi, Simone Alghisi, Giuseppe Riccardi

2501.05990 2026-03-18 cs.CL

Constraining constructions with WordNet: pros and cons for the semantic annotation of fillers in the Italian Constructicon

Flavio Pisciotta, Ludovica Pannitto, Lucia Busso, Beatrice Bernasconi, Francesca Masini

2412.13639 2026-03-18 cs.RO

4D Radar-Inertial Odometry based on Gaussian Modeling and Multi-Hypothesis Scan Matching

Fernando Amodeo, Luis Merino, Fernando Caballero

Comments Our code and results can be publicly accessed at: https://github.com/robotics-upo/gaussian-rio-cpp Accepted for publication in IEEE Robotics and Automation Letters

详情

DOI: 10.1109/LRA.2026.3675514

英文摘要

4D millimeter-wave (mmWave) radars are sensors that provide robustness against adverse weather conditions (rain, snow, fog, etc.), and as such they are increasingly used for odometry and SLAM (Simultaneous Location and Mapping). However, the noisy and sparse nature of the returned scan data proves to be a challenging obstacle for existing registration algorithms, especially those originally intended for more accurate sensors such as LiDAR. Following the success of 3D Gaussian Splatting for vision, in this paper we propose a summarized representation for radar scenes based on global simultaneous optimization of 3D Gaussians as opposed to voxel-based approaches, and leveraging its inherent Probability Density Function (PDF) for registration. Moreover, we propose optimizing multiple registration hypotheses for better protection against local optima of the PDF. We evaluate our modeling and registration system against state of the art techniques, finding that our system provides richer models and more accurate registration results. Finally, we evaluate the effectiveness of our system in a real Radar-Inertial Odometry task. Experiments using publicly available 4D radar datasets show that our Gaussian approach is comparable to existing registration algorithms, outperforming them in several sequences. Copyright 2026 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.

URL PDF HTML ☆

赞 0 踩 0

2412.08221 2026-03-18 cs.CV cs.AI cs.LG

Generate Any Scene: Scene Graph Driven Data Synthesis for Visual Generation Training

Ziqi Gao, Weikai Huang, Jieyu Zhang, Aniruddha Kembhavi, Ranjay Krishna

Comments ICLR 2026

2411.16253 2026-03-18 cs.CV

Open-Vocabulary Octree-Graph for 3D Scene Understanding

Zhigang Wang, Yifei Su, Chenhui Li, Dong Wang, Yan Huang, Bin Zhao, Xuelong Li

Comments Accepted by ICCV25. 11 pages, 7 figures

2410.22070 2026-03-18 cs.CV cs.LG

FreeGaussian: Annotation-free Control of Articulated Objects via 3D Gaussian Splats with Flow Derivatives

Qizhi Chen, Delin Qu, Junli Liu, Yiwen Tang, Haoming Song, Dong Wang, Yuan Yuan, Bin Zhao

2410.21271 2026-03-18 cs.CL cs.AI

EoRA: Fine-tuning-free Compensation for Compressed LLM with Eigenspace Low-Rank Approximation

Shih-Yang Liu, Maksim Khadkevich, Nai Chit Fung, Charbel Sakr, Chao-Han Huck Yang, Chien-Yi Wang, Saurav Muralidharan, Hongxu Yin, Kwang-Ting Cheng, Jan Kautz, Yu-Chiang Frank Wang, Pavlo Molchanov, Min-Hung Chen

Comments ICLR 2026 workshops. Code: https://github.com/NVlabs/EoRA

2410.18373 2026-03-18 cs.RO cs.HC

UGotMe: An Embodied System for Affective Human-Robot Interaction

Peizhen Li, Longbing Cao, Xiao-Ming Wu, Xiaohan Yu, Runze Yang

Comments Accepted to the 2025 IEEE International Conference on Robotics and Automation (ICRA)

2410.17762 2026-03-18 cs.LG

Anomaly Resilient Temporal QoS Prediction using Hypergraph Convoluted Transformer Network

Suraj Kumar, Soumi Chattopadhyay, Chandranath Adak

Comments 19 pages, 13 figures

详情

DOI: 10.1109/TNSM.2026.3674650

英文摘要

Quality-of-Service (QoS) prediction is a critical task in the service lifecycle, enabling precise and adaptive service recommendations by anticipating performance variations over time in response to evolving network uncertainties and user preferences. However, contemporary QoS prediction methods frequently encounter data sparsity and cold-start issues, which hinder accurate QoS predictions and limit the ability to capture diverse user preferences. Additionally, these methods often assume QoS data reliability, neglecting potential credibility issues such as outliers and the presence of greysheep users and services with atypical invocation patterns. Furthermore, traditional approaches fail to leverage diverse features, including domain-specific knowledge and complex higher-order patterns, essential for accurate QoS predictions. In this paper, we introduce a real-time, trust-aware framework for temporal QoS prediction to address the aforementioned challenges, featuring an end-to-end deep architecture called the Hypergraph Convoluted Transformer Network (HCTN). HCTN combines a hypergraph structure with graph convolution over hyper-edges to effectively address high-sparsity issues by capturing complex, high-order correlations. Complementing this, the transformer network utilizes multi-head attention along with parallel 1D convolutional layers and fully connected dense blocks to capture both fine-grained and coarse-grained dynamic patterns. Additionally, our approach includes a sparsity-resilient solution for detecting greysheep users and services, incorporating their unique characteristics to improve prediction accuracy. Trained with a robust loss function resistant to outliers, HCTN demonstrated state-of-the-art performance on the large-scale WSDREAM-2 datasets for response time and throughput.

URL PDF HTML ☆

赞 0 踩 0

2410.03385 2026-03-18 cs.LG q-bio.NC

From Epilepsy Seizures Classification to Detection: A Deep Learning-based Approach for Raw EEG Signals

Davy Darankoum, Manon Villalba, Clelia Allioux, Baptiste Caraballo, Carine Dumont, Eloise Gronlier, Corinne Roucard, Yann Roche, Chloe Habermacher, Sergei Grudinin, Julien Volle

Comments 25 pages, 3 tables, 5 figures

2408.15747 2026-03-18 cs.CL

Form and meaning co-determine the realization of tone in Taiwan Mandarin spontaneous speech: the case of T2-T3 and T3-T3 tone sandhi

Yuxin Lu, Yu-Ying Chuang, R. Harald Baayen

2408.08005 2026-03-18 cs.LG

SG-DeepONet: Source-generalized deep operator learning for full waveform inversion

Zekai Guo, Lihui Chai, Ye Li