arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2604.12843 2026-04-16 cs.CL

Growing Pains: Extensible and Efficient LLM Benchmarking Via Fixed Parameter Calibration

Eliya Habba, Itay Itzhak, Asaf Yehudai, Yotam Perlitz, Elron Bandel, Michal Shmueli-Scheuer, Leshem Choshen, Gabriel Stanovsky

详情

英文摘要

The rapid release of both language models and benchmarks makes it increasingly costly to evaluate every model on every dataset. In practice, models are often evaluated on different samples, making scores difficult to compare across studies. To address this, we propose a framework based on multidimensional Item Response Theory (IRT) that uses anchor items to calibrate new benchmarks to the evaluation suite while holding previously calibrated item parameters fixed. Our approach supports a realistic evaluation setting in which datasets are introduced over time and models are evaluated only on the datasets available at the time of evaluation, while a fixed anchor set for each dataset is used so that results from different evaluation periods can be compared directly. In large-scale experiments on more than $400$ models, our framework predicts full-evaluation performance within 2-3 percentage points using only $100$ anchor questions per dataset, with Spearman $ρ\geq 0.9$ for ranking preservation, showing that it is possible to extend benchmark suites over time while preserving score comparability, at a constant evaluation cost per new dataset. Code available at https://github.com/eliyahabba/growing-pains

URL PDF HTML ☆

赞 0 踩 0

2604.12525 2026-04-16 cs.CV

CoD-Lite: Real-Time Diffusion-Based Generative Image Compression

Zhaoyang Jia, Naifu Xue, Zihan Zheng, Jiahao Li, Bin Li, Xiaoyi Zhang, Zongyu Guo, Yuan Zhang, Houqiang Li, Yan Lu

2604.12371 2026-04-16 cs.CV

Reading Between the Pixels: Linking Text-Image Embedding Alignment to Typographic Attack Success on Vision-Language Models

Ravikumar Balakrishnan, Sanket Mendapara, Ankit Garg

Comments Accepted at ICLR 2026 Workshop on Agents in the Wild

2604.12319 2026-04-16 cs.CV

RSGMamba: Reliability-Aware Self-Gated State Space Model for Multimodal Semantic Segmentation

Guoan Xu, Yang Xiao, Guangwei Gao, Dongchen Zhu, Guo-Jun Qi, Wenjing Jia

Comments 7tables,9 figures

2604.12102 2026-04-16 cs.AI cs.CV cs.LG

Spatial Atlas: Compute-Grounded Reasoning for Spatial-Aware Research Agent Benchmarks

Arun Sharma

Comments 11 pages. Code: https://github.com/arunshar/spatial-atlas

2604.11929 2026-04-16 cs.LG math.DS physics.comp-ph

Fast and principled equation discovery from chaos to climate

Yuzheng Zhang, Weizhen Li, Rui Carvalho

Comments 34 pages, 8 figures

2604.11840 2026-04-16 cs.LG cs.AI cs.CY cs.MA

When Reasoning Models Hurt Behavioral Simulation: A Solver-Sampler Mismatch in Multi-Agent LLM Negotiation

Sandro Andric

Comments 12 pages, 5 figures, supplementary material included as ancillary file

2604.11828 2026-04-16 cs.AI cs.CY math.OC

The Non-Optimality of Scientific Knowledge: Path Dependence, Lock-In, and The Local Minimum Trap

Mohamed Mabrok

2604.11797 2026-04-16 cs.CV

SyncFix: Fixing 3D Reconstructions via Multi-View Synchronization

Deming Li, Abhay Yadav, Cheng Peng, Rama Chellappa, Anand Bhattad

Comments Project website: https://syncfix.github.io/

2604.11372 2026-04-16 cs.RO

MR.ScaleMaster: Scale-Consistent Collaborative Mapping from Crowd-Sourced Monocular Videos

Hyoseok Ju, Giseop Kim

Comments 8 pages, 7 figures, submitted to IROS 2026

2604.11244 2026-04-16 cs.CV

Script-a-Video: Deep Structured Audio-visual Captions via Factorized Streams and Relational Grounding

Tencent Hunyuan Team

2604.11133 2026-04-16 cs.CL

How Robust Are Large Language Models for Clinical Numeracy? An Empirical Study on Numerical Reasoning Abilities in Clinical Contexts

Minh-Vuong Nguyen, Fatemeh Shiri, Zhuang Li, Karin Verspoor

Comments Accepted to ACL2026 Findings

2604.11064 2026-04-16 cs.LG cs.CV

A Faster Path to Continual Learning

Wei Li, Hangjie Yuan, Zixiang Zhao, Borui Kang, Ziwei Liu, Tao Feng

Comments Update Author Affiliations

2604.10974 2026-04-16 cs.LG cs.RO

Robust Adversarial Policy Optimization Under Dynamics Uncertainty

Mintae Kim, Koushil Sreenath

Comments 33 pages, 8 figures

2604.10532 2026-04-16 cs.CV

The Second Challenge on Real-World Face Restoration at NTIRE 2026: Methods and Results

Jingkai Wang, Jue Gong, Zheng Chen, Kai Liu, Jiatong Li, Yulun Zhang, Radu Timofte, Jiachen Tu, Yaokun Shi, Guoyi Xu, Yaoxin Jiang, Jiajia Liu, Yingsi Chen, Yijiao Liu, Hui Li, Yu Wang, Congchao Zhu, Alexandru-Gabriel Lefterache, Anamaria Radoi, Chuanyue Yan, Tao Lu, Yanduo Zhang, Kanghui Zhao, Jiaming Wang, Yuqi Li, WenBo Xiong, Yifei Chen, Xian Hu, Wei Deng, Daiguo Zhou, Sujith Roy, Claudia Jesuraj, Vikas B, Spoorthi LC, Nikhil Akalwadi, Ramesh Ashok Tabib, Uma Mudenagudi, Yuxuan Jiang, Chengxi Zeng, Tianhao Peng, Fan Zhang, David Bull Wei Zhou, Linfeng Li, Hongyu Huang, Hoyoung Lee, SangYun Oh, ChangYoung Jeong, Axi Niu, Jinyang Zhang, Zhenguo Wu, Senyan Qing, Jinqiu Sun, Yanning Zhang

Comments NTIRE 26: https://cvlai.net/ntire/2026 . NTIRE Real-World Face Restoration: https://ntire-face.github.io/2026/ . CVPR 2026 Workshop

2604.10062 2026-04-16 cs.LG

When Can You Poison Rewards? A Tight Characterization of Reward Poisoning in Linear MDPs

Jose Efraim Aguilar Escamilla, Haoyang Hong, Jiawei Li, Haoyu Zhao, Xuezhou Zhang, Sanghyun Hong, Huazheng Wang

2604.09812 2026-04-16 cs.CL

Claim2Vec: Embedding Fact-Check Claims for Multilingual Similarity and Clustering

Rrubaa Panchendrarajan, Arkaitz Zubiaga

2604.09687 2026-04-16 cs.CV cs.AI

Grid2Matrix: Revealing Digital Agnosia in Vision-Language Models

Yunkai Zhang, Linda Li, Yingxin Cui, Xiyuan Ruan, Zeyu Zheng, Kezhen Chen, Yi Zhang, Diji Yang

2604.09581 2026-04-16 cs.AI cs.CY cs.HC

Avenir-UX: Automated UX Evaluation via Simulated Human Web Interaction with GUI Grounding

Wee Joe Tan, Zi Rui Lucas Lim, Shashank Durgad, Karim Obegi, Aiden Yiliu Li

2604.09106 2026-04-16 cs.CV cs.LG

Detecting Diffusion-generated Images via Dynamic Assembly Forests

Mengxin Fu, Yuezun Li

2604.07985 2026-04-16 cs.CL cs.IR

Rag Performance Prediction for Question Answering

Or Dado, David Carmel, Oren Kurland

Comments 12 pages. 2 figures. 1 table

2604.07823 2026-04-16 cs.CV cs.AI cs.MM

LPM 1.0: Video-based Character Performance Model

Ailing Zeng, Casper Yang, Chauncey Ge, Eddie Zhang, Garvey Xu, Gavin Lin, Gilbert Gu, Jeremy Pi, Leo Li, Mingyi Shi, Shawn Wang, Sheng Bi, Steven Tang, Thorn Hang, Tobey Guo, Vincent Li, Xin Tong, Yikang Li, Yuchen Sun, Yue Zhao, Yuhan Lu, Yuwei Li, Zane Zhang, Zeshi Yang, Zi Ye

Comments 43 pages, 15 figures, 2 tables. Project page: https://large-performance-model.github.io

2604.06448 2026-04-16 cs.LG cs.AI cs.MM eess.IV

From Load Tests to Live Streams: Graph Embedding-Based Anomaly Detection in Microservice Architectures

Srinidhi Madabhushi, Pranesh Vyas, Swathi Vaidyanathan, Mayur Kurup, Elliott Nash, Yegor Silyutin

Comments Accepted at FSE 2026 - Industrial Track

2604.06168 2026-04-16 cs.CV cs.RO

Action Images: End-to-End Policy Learning via Multiview Video Generation

Haoyu Zhen, Zixian Gao, Qiao Sun, Yilin Zhao, Yuncong Yang, Yilun Du, Pengsheng Guo, Tsun-Hsuan Wang, Yi-Ling Qiao, Chuang Gan

Comments Project Page: https://actionimages.github.io/

2604.05808 2026-04-16 cs.AI cs.LG

Hierarchical Reinforcement Learning with Augmented Step-Level Transitions for LLM Agents

Shuai Zhen, Yanhua Yu, Ruopei Guo, Nan Cheng, Yang Deng

Comments Accepted to ACL 2026 Main Conference

2604.05672 2026-04-16 cs.RO

A1: A Fully Transparent Open-Source, Adaptive and Efficient Truncated Vision-Language-Action Model

Kaidong Zhang, Jian Zhang, Rongtao Xu, Yu Sun, Shuoshuo Xue, Youpeng Wen, Xiaoyu Guo, Minghao Guo, Weijia Liufu, Liu Zihou, Kangyi Ji, Yangsong Zhang, Jiarun Zhu, Jingzhi Liu, Zihang Li, Ruiyi Chen, Meng Cao, Jingming Zhang, Shen Zhao, Xiaojun Chang, Feng Zheng, Ivan Laptev, Xiaodan Liang

2604.05096 2026-04-16 cs.CL

RAG or Learning? Understanding the Limits of LLM Adaptation under Continuous Knowledge Drift in the Real World

Hanbing Liu, Lang Cao, Yang Li

2604.03723 2026-04-16 cs.CV

SymphoMotion: Joint Control of Camera Motion and Object Dynamics for Coherent Video Generation

Guiyu Zhang, Yabo Chen, Xunzhi Xiang, Junchao Huang, Zhongyu Wang, Li Jiang

Comments CVPR 2026

2604.02709 2026-04-16 cs.CL cs.AI cs.LG cs.SE

Evaluating the Formal Reasoning Capabilities of Large Language Models through Chomsky Hierarchy

Yihong Dong, Jianha Xiao, Xue Jiang, Xuyuan Guo, Zhiyuan Fan, Jiaru Qian, Kechi Zhang, Jia Li, Zhi Jin, Ge Li

Comments Work in progress

详情

英文摘要

The formal reasoning capabilities of LLMs are crucial for advancing automated software engineering. However, existing benchmarks for LLMs lack systematic evaluation based on computation and complexity, leaving a critical gap in understanding their formal reasoning capabilities. Therefore, it is still unknown whether SOTA LLMs can grasp the structured, hierarchical complexity of formal languages as defined by Computation Theory. To address this, we introduce ChomskyBench, a benchmark for systematically evaluating LLMs through the lens of Chomsky Hierarchy. Unlike prior work that uses vectorized classification for neural networks, ChomskyBench is the first to combine full Chomsky Hierarchy coverage, process-trace evaluation via natural language, and deterministic symbolic verifiability. ChomskyBench is composed of a comprehensive suite of language recognition and generation tasks designed to test capabilities at each level. Extensive experiments indicate a clear performance stratification that correlates with the hierarchy's levels of complexity. Our analysis reveals a direct relationship where increasing task difficulty substantially impacts both inference length and performance. Furthermore, we find that while larger models and advanced inference methods offer notable relative gains, they face severe efficiency barriers: achieving practical reliability would require prohibitive computational costs, revealing that current limitations stem from inefficiency rather than absolute capability bounds. A time complexity analysis further indicates that LLMs are significantly less efficient than traditional algorithmic programs for these formal tasks. These results delineate the practical limits of current LLMs, highlight the indispensability of traditional software tools, and provide insights to guide the development of future LLMs with more powerful formal reasoning capabilities.

URL PDF HTML ☆

赞 0 踩 0

2604.01614 2026-04-16 cs.RO

Smooth Feedback Motion Planning with Reduced Curvature

Aref Amiri, Steven M. LaValle

Comments Accepted for publication in IEEE Robotics and Automation Letters