arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2601.20896 2026-04-24 cs.SD eess.AS

A Study of Data Selection Strategies for Pre-training Self-Supervised Speech Models

Ryan Whetten, Titouan Parcollet, Marco Dinarelli, Yannick Estève

Comments Accepted for publication in the 2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2026)

2601.18714 2026-04-24 cs.CV cs.AI cs.LG cs.RO

Low Cost, High Efficiency: LiDAR Place Recognition in Vineyards with Matryoshka Representation Learning

Judith Vilella-Cantos, Mauro Martini, Marcello Chiaberge, Mónica Ballesta, David Valiente

2601.18672 2026-04-24 cs.LG

A Dynamic Framework for Grid Adaptation in Kolmogorov-Arnold Networks

Spyros Rigas, Thanasis Papaioannou, Panagiotis Trakadas, Georgios Alexandridis

Comments Accepted in IJCNN 2026

2601.18491 2026-04-24 cs.AI cs.CC cs.CL cs.CV cs.LG

AgentDoG: A Diagnostic Guardrail Framework for AI Agent Safety and Security

Dongrui Liu, Qihan Ren, Chen Qian, Shuai Shao, Yuejin Xie, Yu Li, Zhonghao Yang, Haoyu Luo, Peng Wang, Qingyu Liu, Binxin Hu, Ling Tang, Jilin Mei, Dadi Guo, Leitao Yuan, Junyao Yang, Guanxu Chen, Qihao Lin, Yi Yu, Bo Zhang, Jiaxuan Guo, Jie Zhang, Wenqi Shao, Huiqi Deng, Zhiheng Xi, Wenjie Wang, Wenxuan Wang, Wen Shen, Zhikai Chen, Haoyu Xie, Jialing Tao, Juntao Dai, Jiaming Ji, Zhongjie Ba, Linfeng Zhang, Yong Liu, Quanshi Zhang, Lei Zhu, Zhihua Wei, Hui Xue, Chaochao Lu, Jing Shao, Xia Hu

Comments 40 pages, 26 figures

2601.13711 2026-04-24 cs.CL

GerAV: Towards New Heights in German Authorship Verification using Fine-Tuned LLMs on a New Benchmark

Lotta Kiefer, Christoph Leiter, Sotaro Takeshita, Elena Schmidt, Steffen Eger

2601.11044 2026-04-24 cs.AI

AgencyBench: Benchmarking the Frontiers of Autonomous Agents in 1M-Token Real-World Contexts

Keyu Li, Junhao Shi, Yang Xiao, Mohan Jiang, Jie Sun, Yunze Wu, Dayuan Fu, Shijie Xia, Xiaojie Cai, Tianze Xu, Weiye Si, Wenjie Li, Dequan Wang, Pengfei Liu

Comments Accepted by ACL 2026 Main Conference

详情

英文摘要

Large Language Models (LLMs) based autonomous agents demonstrate multifaceted capabilities to contribute substantially to economic production. However, existing benchmarks remain focused on single agentic capability, failing to capture long-horizon real-world scenarios. Moreover, the reliance on human-in-the-loop feedback for realistic tasks creates a scalability bottleneck, hindering automated rollout collection and evaluation. To bridge this gap, we introduce AgencyBench, a comprehensive benchmark derived from daily AI usage, evaluating 6 core agentic capabilities across 32 real-world scenarios, comprising 138 tasks with specific queries, deliverables, and rubrics. These scenarios require an average of 90 tool calls, 1 million tokens, and hours of execution time to resolve. To enable automated evaluation, we employ a user simulation agent to provide iterative feedback, and a Docker sandbox to conduct visual and functional rubric-based assessment. Experiments reveal that closed-source models significantly outperform open-source models (48.4% vs 32.1%). Further analysis reveals significant disparities across models in resource efficiency, feedback-driven self-correction, and specific tool-use preferences. Finally, we investigate the impact of agentic scaffolds, observing that proprietary models demonstrate superior performance within their native ecosystems (e.g., Claude-4.5-Opus via Claude-Agent-SDK), while open-source models exhibit distinct performance peaks, suggesting potential optimization for specific execution frameworks. AgencyBench serves as a critical testbed for next-generation agents, highlighting the necessity of co-optimizing model architecture with agentic frameworks. We believe this work sheds light on the future direction of autonomous agents, and we release the full benchmark and evaluation toolkit at https://github.com/GAIR-NLP/AgencyBench.

URL PDF HTML ☆

赞 0 踩 0

2601.10863 2026-04-24 cs.LG

Beyond Accuracy: A Stability-Aware Metric for Multi-Horizon Forecasting

Chutian Ma, Grigorii Pomazkin, Giacinto Paolo Saggese, Paul Smith

2601.10003 2026-04-24 cs.CL

SocraticKG: Knowledge Graph Construction via QA-Driven Fact Extraction

Sanghyeok Choi, Woosang Jeon, Kyuseok Yang, Taehyeong Kim

2601.09926 2026-04-24 cs.LG

PROPER Agents: Proactivity Driven Personalized Agents for Advancing Knowledge Gap Navigation

Kirandeep Kaur, Vinayak Gupta, Aditya Gupta, Chirag Shah

Comments ACL 2026

2601.09361 2026-04-24 cs.LG cs.AI

GeoRA: Geometry-Aware Low-Rank Adaptation for RLVR

Jiaying Zhang, Lei Shi, Jiguo Li, Jun Xu, Jiuchong Gao, Jinghua Hao, Renqing He

Comments Accepted at ACL 2026 Main

2601.09253 2026-04-24 cs.LG cs.AI

RIFT: Repurposing Negative Samples via Reward-Informed Fine-Tuning

Zehua Liu, Shuqi Liu, Tao Zhong, Mingxuan Yuan

2601.06498 2026-04-24 cs.CL astro-ph.IM

Spec-o3: A Tool-Augmented Vision-Language Agent for Rare Celestial Object Candidate Vetting via Automated Spectral Inspection

Minghui Jia, Qichao Zhang, Ali Luo, Linjing Li, Shuo Ye, Hailing Lu, Wen Hou, Dongbin Zhao

Comments Accepted to ACL 2026 Main Conference

2601.06428 2026-04-24 cs.LG

BackPlay: Head-Only Look-Back Self-Correction for Diffusion Language Models

Liming Liu, Binxuan Huang, Zixuan Zhang, Xin Liu, Bing Yin, Tuo Zhao

Comments 16 pages

2601.05563 2026-04-24 cs.CV cs.SI

What's Left Unsaid? Detecting and Correcting Misleading Omissions in Multimodal News Previews

Fanxiao Li, Jiaying Wu, Tingchao Fu, Dayang Li, Herun Wan, Wei Zhou, Min-Yen Kan

2601.03248 2026-04-24 cs.CL

STReasoner: Empowering LLMs for Spatio-Temporal Reasoning in Time Series via Spatial-Aware Reinforcement Learning

Juntong Ni, Shiyu Wang, Qi He, Ming Jin, Wei Jin

Comments ACL 2026 Main, we release our code publicly at https://github.com/LingFengGold/STReasoner

2512.22274 2026-04-24 cs.CV

GeCo: Evaluating Geometric Consistency for Video Generation via Motion and Structure

Leslie Gu, Junhwa Hur, Charles Herrmann, Fangneng Zhan, Todd Zickler, Deqing Sun, Hanspeter Pfister

2512.20288 2026-04-24 cs.CV cs.AI

UbiQVision: Quantifying Uncertainty in XAI for Image Recognition

Akshat Dubey, Aleksandar Anžel, Bahar İlgen, Georges Hattab

Comments Under Review. Updated manuscript. Feedback from reviewers incorporated

2512.18908 2026-04-24 cs.AI

Multimodal Bayesian Network for Robust Assessment of Casualties in Autonomous Triage

Szymon Rusiecki, Cecilia G. Morales, Kimberly Elenberg, Leonard Weiss, Artur Dubrawski

Comments Presented at NeurIPS 2025 Workshop: Structured Probabilistic Inference & Generative Modeling

2512.06171 2026-04-24 cs.CV

Automated Annotation of Shearographic Measurements Enabling Weakly Supervised Defect Detection

Jessica Plassmann, Nicolas Schuler, Michael Schuth, Georg von Freymann

Comments 13 pages, 3 figures

2512.05591 2026-04-24 cs.LG cs.CL

Entropy Ratio Clipping as a Soft Global Constraint for Stable Reinforcement Learning

Zhenpeng Su, Leiyu Pan, Minxuan Lv, Tiehua Mei, Zijia Lin, Yuntao Li, Wenping Hu, Ruiming Tang, Kun Gai, Guorui Zhou

Comments This paper has been accepted by ACL2026

2512.03048 2026-04-24 cs.AI cs.CY cs.LG cs.MA

The Specification Trap: Why Static Value Alignment Alone Is Insufficient for Robust Alignment

Austin Spizzirri

Comments 31 pages, no figures. Version 5. First posted as arXiv:2512.03048 in November 2025. First in a six-paper research program on AI alignment

2511.22074 2026-04-24 cs.AI cs.IR

Real-Time Procedural Learning From Experience for AI Agents

Dasheng Bi, Yubin Hu, Mohammed N. Nasir

2511.21978 2026-04-24 cs.CV

PAT3D: Physics-Augmented Text-to-3D Scene Generation

Guying Lin, Kemeng Huang, Michael Liu, Ruihan Gao, Hanke Chen, Lyuhao Chen, Beijia Lu, Taku Komura, Yuan Liu, Jun-Yan Zhu, Minchen Li

Comments 19 pages, 12 figures

2511.20697 2026-04-24 cs.SD cs.AI

Musical Score Understanding Benchmark: Evaluating Large Language Models' Comprehension of Complete Musical Scores

Congren Dai, Yue Yang, Krinos Li, Huichi Zhou, Shijie Liang, Bo Zhang, Enyang Liu, Ge Jin, Hongran An, Haosen Zhang, Peiyuan Jing, Kinhei Lee, Z henxuan Zhang, Xiaobing Li, Maosong Sun

Comments Accepted to ACL 2026 Main Conference

2511.18539 2026-04-24 cs.LG cs.CV

TimePre: Bridging Accuracy, Efficiency, and Stability in Probabilistic Time-Series Forecasting

Lingyu Jiang, Lingyu Xu, Peiran Li, Dengzhe Hou, Qianwen Ge, Dingyi Zhuang, Shuo Xing, Wenjing Chen, Xiangbo Gao, Ting-Hsuan Chen, Xueying Zhan, Xin Zhang, Ziming Zhang, Zhengzhong Tu, Michael Zielewski, Kazunori Yamada, Fangzhou Lin

Comments 15 pages, 5 figures, 6 tables

2511.18264 2026-04-24 cs.CV

SatSAM2: Motion-Constrained Video Object Tracking in Satellite Imagery using Promptable SAM2 and Kalman Priors

Ruijie Fan, Junyan Ye, Huan Chen, Zilong Huang, Xiaolei Wang, Weijia Li

Comments 14 pages, 12 figures

2511.17085 2026-04-24 cs.LG cs.AI

Why Do Language Model Agents Whistleblow?

Kushal Agrawal, Frank Xiao, Guido Bergman, Asa Cooper Stickland

2511.11439 2026-04-24 cs.LG cs.AI

Retrofit: Continual Learning with Controlled Forgetting for Binary Security Detection and Analysis

Yiling He, Junchi Lei, Hongyu She, Shuo Shao, Xinran Zheng, Yiping Liu, Zhan Qin, Lorenzo Cavallaro

2511.08558 2026-04-24 cs.AI

Hyperdimensional Decoding of Spiking Neural Networks

Cedrick Kinavuidi, Luca Peres, Oliver Rhodes

2511.08277 2026-04-24 cs.RO cs.LG

X-IONet: Cross-Platform Inertial Odometry Network for Pedestrian and Legged Robot

Dehan Shen, Changhao Chen

Comments RA-L Accepted