arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2510.00507 2026-03-06 cs.CL cs.AI

Graph2Eval: Automatic Multimodal Task Generation for Agents via Knowledge Graphs

Yurun Chen, Xavier Hu, Yuhan Liu, Ziqi Wang, Zeyi Liao, Lin Chen, Feng Wei, Yuxi Qian, Bo Zheng, Keting Yin, Shengyu Zhang

Comments Accepted at CVPR 2026 Main Conference

详情

英文摘要

As multimodal LLM-driven agents advance in autonomy and generalization, traditional static datasets face inherent scalability limitations and are insufficient for fully assessing their capabilities in increasingly complex and diverse tasks. Existing studies have attempted to generate agent tasks using LLMs, but due to the inherent hallucinations of LLMs and the lack of internal data relationship modeling, these tasks often exhibit semantic inconsistencies and solvability issues. To address these challenges, we introduce Graph2Eval, a knowledge-graph-driven framework for automated, scalable, and semantically grounded agent task generation. At its core, Graph2Eval leverages a knowledge graph built from heterogeneous external data sources as a structured task space, generating multimodal agent tasks through subgraph sampling and task construction guided by task templates and meta-path strategies. To further ensure task reliability, a multi-stage filtering pipeline based on node reachability analysis, LLM scoring, and similarity analysis ensures the diversity and solvability of the generated tasks. By unifying both RAG Agent and Web Agent scenarios, Graph2Eval enables efficient generation of multimodal document understanding tasks and multi-step web interaction tasks. We instantiate the framework with Graph2Eval-Bench, a curated dataset of 1,319 tasks spanning document understanding and web interaction scenarios. Extensive experiments show that, on average, Graph2Eval improves task semantic consistency by 20% and solvability by 17% over baselines, while Graph2Eval-Bench effectively distinguishes agent performance, offering a new perspective on agent evaluation.

URL PDF HTML ☆

赞 0 踩 0

2510.00405 2026-03-06 cs.CV cs.AI cs.RO

EgoTraj-Bench: Towards Robust Trajectory Prediction Under Ego-view Noisy Observations

Jiayi Liu, Jiaming Zhou, Ke Ye, Kun-Yu Lin, Allan Wang, Junwei Liang

2510.00177 2026-03-06 cs.CL cs.AI

PrefDisco: Benchmarking Proactive Personalized Reasoning

Shuyue Stella Li, Avinandan Bose, Faeze Brahman, Simon Shaolei Du, Pang Wei Koh, Maryam Fazel, Yulia Tsvetkov

Comments 65 pages, 6 figures

2509.26325 2026-03-06 cs.CV

Continuous Space-Time Video Super-Resolution with 3D Fourier Fields

Alexander Becker, Julius Erbach, Dominik Narnhofer, Konrad Schindler

2509.25149 2026-03-06 cs.CL cs.AI cs.LG

Pretraining Large Language Models with NVFP4

NVIDIA, Felix Abecassis, Anjulie Agrusa, Dong Ahn, Jonah Alben, Stefania Alborghetti, Michael Andersch, Sivakumar Arayandi, Alexis Bjorlin, Aaron Blakeman, Evan Briones, Ian Buck, Bryan Catanzaro, Muya Chang, Jinhang Choi, Mike Chrzanowski, Eric Chung, Victor Cui, Steve Dai, Bita Darvish Rouhani, Carlo del Mundo, Deena Donia, Burc Eryilmaz, Henry Estela, Abhinav Goel, Oleg Goncharov, Yugi Guvvala, Robert Hesse, Russell Hewett, Herbert Hum, Ujval Kapasi, Brucek Khailany, Mikail Khona, Nick Knight, Alex Kondratenko, Ronny Krashinsky, Ben Lanir, Simon Layton, Michael Lightstone, Daniel Lo, Paulius Micikevicius, Asit Mishra, Tim Moon, Deepak Narayanan, Chao Ni, Abhijit Paithankar, Satish Pasumarthi, Ankit Patel, Mostofa Patwary, Ashwin Poojary, Gargi Prasad, Sweta Priyadarshi, Yigong Qin, Xiaowei Ren, Oleg Rybakov, Charbel Sakr, Sanjeev Satheesh, Stas Sergienko, Pasha Shamis, Kirthi Shankar, Nishant Sharma, Mohammad Shoeybi, Michael Siu, Misha Smelyanskiy, Darko Stosic, Dusan Stosic, Bor-Yiing Su, Frank Sun, Nima Tajbakhsh, Shelby Thomas, Przemek Tredak, Evgeny Tsykunov, Gandhi Vaithilingam, Aditya Vavre, Rangharajan Venkatesan, Roger Waleffe, Qiyu Wan, Hexin Wang, Mengdi Wang, Lizzie Wei, Hao Wu, Evan Wu, Keith Wyss, Ning Xu, Jinze Xue, Charlene Yang, Yujia Zhai, Ruoxi Zhang, Jingyang Zhu, Zhongbo Zhu

Comments Update includes: (1) fixing a typo in eq. 2 (2) updating author list, and (3) adding a related work

详情

英文摘要

Large Language Models (LLMs) today are powerful problem solvers across many domains, and they continue to get stronger as they scale in model size, training set size, and training set quality, as shown by extensive research and experimentation across the industry. Training a frontier model today requires on the order of tens to hundreds of yottaflops, which is a massive investment of time, compute, and energy. Improving pretraining efficiency is therefore essential to enable the next generation of even more capable LLMs. While 8-bit floating point (FP8) training is now widely adopted, transitioning to even narrower precision, such as 4-bit floating point (FP4), could unlock additional improvements in computational speed and resource utilization. However, quantization at this level poses challenges to training stability, convergence, and implementation, notably for large-scale models trained on long token horizons. In this study, we introduce a novel approach for stable and accurate training of large language models (LLMs) using the NVFP4 format. Our method integrates Random Hadamard transforms (RHT) to bound block-level outliers, employs a two-dimensional quantization scheme for consistent representations across both the forward and backward passes, utilizes stochastic rounding for unbiased gradient estimation, and incorporates selective high-precision layers. We validate our approach by training a 12-billion-parameter model on 10 trillion tokens -- the longest publicly documented training run in 4-bit precision to date. Our results show that the model trained with our NVFP4-based pretraining technique achieves training loss and downstream task accuracies comparable to an FP8 baseline. These findings highlight that NVFP4, when combined with our training approach, represents a major step forward in narrow-precision LLM training algorithms.

URL PDF HTML ☆

赞 0 踩 0

2509.24335 2026-03-06 cs.CV cs.LG

Hyperspherical Latents Improve Continuous-Token Autoregressive Generation

Guolin Ke, Hui Xue

Comments ICLR version

2509.24210 2026-03-06 cs.CL cs.AI cs.LG

BeyondBench: Contamination-Resistant Evaluation of Reasoning in Language Models

Gaurav Srivastava, Aafiya Hussain, Zhenyu Bi, Swastik Roy, Priya Pitre, Meng Lu, Morteza Ziyadi, Xuan Wang

Comments Accepted to ICLR 2026 Conference

详情

英文摘要

Evaluating language models fairly is increasingly difficult as static benchmarks risk contamination by training data, obscuring whether models truly reason or recall. We introduce BeyondBench, an evaluation framework using algorithmic problem generation to create mathematically grounded problems on the fly, ensuring each test remains uncontaminated. Our framework covers 44 algorithmic tasks with 117 variations across three difficulty levels: the Easy Suite (29 tasks) for arithmetic and statistics, the Medium Suite (5 tasks, 49 variations) for sequence patterns and reasoning, and the Hard Suite (10 tasks, 68 variations) for NP-complete and constraint satisfaction problems. Each task draws from a space exceeding 10^15 unique instances, with deterministically verified solutions. We evaluated 101 language models (85 open-source, 16 closed-source), spanning 0.5B to 141B parameters and multiple quantization schemes, using three-fold evaluation for robustness. Results reveal consistent reasoning deficiencies, with performance degrading sharply as complexity increases. In Hard Suite evaluations, Gemini-2.5-pro, Llama-3.3-70B, and Qwen2.5-72B achieved accuracies of 56.21%, 27.16%, and 33.37% respectively. Performance drops significantly without tool usage, with GPT-5, GPT-5-mini, and GPT-5-nano showing declines of 16.81%, 15.86%, and 43.95% in overall accuracy. Contamination resistance rests on three guarantees: (i) the problem space vastly exceeds any static dataset, (ii) every instance has a deterministically verifiable solution, and (iii) isomorphic transformations yield semantically equivalent but syntactically novel problems. BeyondBench redefines reasoning evaluation via genuine algorithmic problem-solving. Our leaderboard is at https://ctrl-gaurav.github.io/BeyondBench/, Python package at https://pypi.org/project/beyondbench/, and codebase at https://github.com/ctrl-gaurav/BeyondBench.

URL PDF HTML ☆

赞 0 踩 0

2509.23589 2026-03-06 cs.AI cs.CV cs.LG

BridgeDrive: Diffusion Bridge Policy for Closed-Loop Trajectory Planning in Autonomous Driving

Shu Liu, Wenlin Chen, Weihao Li, Zheng Wang, Lijin Yang, Jianing Huang, Yipin Zhang, Zhongzhan Huang, Ze Cheng, Hao Yang

Comments Accepted for publication at ICLR 2026

2509.23075 2026-03-06 cs.RO

In-Hand Manipulation of Articulated Tools with Dexterous Robot Hands with Sim-to-Real Transfer

Soofiyan Atar, Daniel Huang, Florian Richter, Michael Yip

2509.21739 2026-03-06 cs.SD cs.LG eess.AS

Noise-to-Notes: Diffusion-based Generation and Refinement for Automatic Drum Transcription

Michael Yeung, Keisuke Toyama, Toya Teramoto, Shusuke Takahashi, Tamaki Kojima

Comments Accepted to ICASSP 2026

2509.20509 2026-03-06 cs.LG cs.AI

Complexity-Regularized Proximal Policy Optimization

Luca Serfilippi, Giorgio Franceschelli, Antonio Corradi, Mirco Musolesi

2509.20321 2026-03-06 cs.CL cs.AI eess.AS

Conversational Speech Reveals Structural Robustness Failures in SpeechLLM Backbones

Maria Teleki, Sai Janjur, Haoran Liu, Oliver Grabner, Ketan Verma, Thomas Docog, Xiangjue Dong, Lingfeng Shi, Cong Wang, Stephanie Birkelbach, Jason Kim, Yin Zhang, Éva Székely, James Caverlee

2509.19916 2026-03-06 cs.RO

GUIDE: A Diffusion-Based Autonomous Robot Exploration Framework Using Global Graph Inference

Zijun Che, Yinghong Zhang, Shengyi Liang, Boyu Zhou, Jun Ma, Jinni Zhou

2509.19696 2026-03-06 cs.RO cs.AI cs.LG

Diffusion-Based Impedance Learning for Contact-Rich Manipulation Tasks

Noah Geiger, Tamim Asfour, Neville Hogan, Johannes Lachner

Comments 15 pages, 12 figures

2509.14882 2026-03-06 cs.CL

Llama-Mimi: Exploring the Limits of Flattened Speech Language Modeling

Issa Sugiura, Shuhei Kurita, Yusuke Oda, Ryuichiro Higashinaka

Comments 6 pages, 1 figures

2509.12890 2026-03-06 cs.RO

Responsibility and Engagement -- Evaluating Interactions in Social Robot Navigation

Malte Probst, Raphael Wenzel, Monica Dasi

Comments Accepted at the 2026 IEEE International Conference on Robotics & Automation (ICRA)

2509.11950 2026-03-06 cs.LG

TabStruct: Measuring Structural Fidelity of Tabular Data

Xiangjian Jiang, Nikola Simidjievski, Mateja Jamnik

Comments Accepted by the Fourteenth International Conference on Learning Representations (ICLR 2026 Oral)

2509.10506 2026-03-06 cs.LG cs.CE

AttnBoost: Retail Supply Chain Sales Insights via Gradient Boosting Perspective

Yadi Liu, Xiaoli Ma, Muxin Ge, Zeyu Han, Jingxi Qiu, Ye Aung Moe, Yilan Shen, Wenbin Wei, Cheng Huang

2509.10035 2026-03-06 cs.CL

Linguistic trajectories of bipolar disorder on social media

Laurin Plank, Armin Zlomuzica

Comments Pre-print

2509.05983 2026-03-06 cs.SD cs.AI cs.CL eess.AS

TSPC: A Two-Stage Phoneme-Centric Architecture for code-switching Vietnamese-English Speech Recognition

Tran Nguyen Anh, Truong Dinh Dung, Vo Van Nam, Minh N. H. Nguyen

Comments Update new version

2508.21592 2026-03-06 cs.RO

Learning Agile Gate Traversal via Analytical Optimal Policy Gradient

Tianchen Sun, Bingheng Wang, Nuthasith Gerdpratoom, Longbin Tang, Yichao Gao, Lin Zhao

Comments 8 pages, 8 figures

2508.20315 2026-03-06 cs.LG

Multi-Agent Reinforcement Learning in Intelligent Transportation Systems: A Comprehensive Survey

Rexcharles Donatus, Kumater Ter, Daniel Udekwe

2508.18088 2026-03-06 cs.CL cs.LG

How Quantization Shapes Bias in Large Language Models

Federico Marcuzzi, Xuefei Ning, Roy Schwartz, Iryna Gurevych

2508.17488 2026-03-06 cs.CV

Optimizing Multi-Modality Trackers via Significance-Regularized Tuning

Zhiwen Chen, Jinjian Wu, Zhiyu Zhu, Yifan Zhang, Guangming Shi, Junhui Hou

2508.16332 2026-03-06 cs.SD cs.AI cs.CL

Vevo2: A Unified and Controllable Framework for Speech and Singing Voice Generation

Xueyao Zhang, Junan Zhang, Yuancheng Wang, Chaoren Wang, Yuanzhe Chen, Dongya Jia, Zhuo Chen, Zhizheng Wu

Comments Accepted by the IEEE Transactions on Audio, Speech and Language Processing (TASLP)

2508.04899 2026-03-06 cs.LG

Honest and Reliable Evaluation and Expert Equivalence Testing of Automated Neonatal Seizure Detection

Jovana Kljajic, John M. O'Toole, Robert Hogan, Tamara Skoric

2508.02464 2026-03-06 cs.CV

SAMPO-Path: Segmentation Intent-Aligned Preference Optimization for Pathology Foundation Model Segmentation

Yonghuang Wu, Wenwen Zeng, Xuan Xie, Chengqian Zhao, Guoqing Wu, Jinhua Yu

Comments 15 pages, 9 tables, 8 figures

2507.18534 2026-03-06 cs.CV cs.LG

Elucidating the Design Space of Arbitrary-Noise-Based Diffusion Models

Xingyu Qiu, Mengying Yang, Xinghua Ma, Dong Liang, Fanding Li, Gongning Luo, Wei Wang, Kuanquan Wang, Shuo Li

Comments 16 pages, 4 figures, accepted by CVPR 2026

2507.14529 2026-03-06 cs.LG math.OC

Kernel Based Maximum Entropy Inverse Reinforcement Learning for Mean-Field Games

Berkay Anahtarci, Can Deha Kariksiz, Naci Saldi

2507.10345 2026-03-06 cs.LG

Some Super-approximation Rates of ReLU Neural Networks for Korobov Functions

Yuwen Li, Guozhi Zhang