arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2505.16377 2026-03-31 cs.RO cs.AI

VLM-SAFE: Vision-Language Model-Guided Safety-Aware Reinforcement Learning with World Models for Autonomous Driving

Yansong Qu, Zilin Huang, Zihao Sheng, Jiancong Chen, Yue Leng, Samuel Labi, Sikai Chen

Comments N/A

详情

英文摘要

Autonomous driving policy learning with reinforcement learning (RL) is fundamentally limited by low sample efficiency, weak generalization, and a dependence on unsafe online trial-and-error interactions. Although safe RL introduces explicit constraints or costs, existing methods often fail to capture the semantic meaning of safety in real driving scenes, leading to conservative behaviors in simple cases and insufficient risk awareness in complex ones. To address this issue, we propose VLM-SAFE, an offline safe RL framework that follows a human cognitive loop of observe-imagine-evaluate-act. Starting from offline driving data, VLM-SAFE observes traffic scenarios and leverages a vision-language model (VLM) to provide semantic safety signals grounded in scene understanding. A learned world model then imagines future trajectories from the observed context, enabling the agent to reason about possible consequences without interacting with the real environment. Rather than using imagined rollouts solely for return estimation, VLM-SAFE further evaluates these predicted futures with VLM-based safety guidance, explicitly coupling future anticipation with semantic risk assessment. The resulting safety-aware imagined experience is finally used to optimize the policy via actor-critic learning, such that actions are chosen based on both predicted outcomes and their safety implications. By tightly integrating observation, imagination, evaluation, and action into a unified closed loop, VLM-SAFE enables safer and more efficient offline policy learning for autonomous driving. Extensive experiments in simulation show that VLM-SAFE achieves improved safety, stronger robustness under traffic-density shift, and a better safety-performance trade-off than representative baselines.

URL PDF HTML ☆

赞 0 踩 0

2505.15392 2026-03-31 cs.CL

Understanding the Anchoring Effect of LLM with Synthetic Data: Existence, Mechanism, and Potential Mitigations

Yiming Huang, Biquan Bie, Zuqiu Na, Weilin Ruan, Songxin Lei, Yutao Yue, Xinlei He

Comments Accepted by the HCAIR workshop of ICLR 2026

2505.09218 2026-03-31 cs.LG cs.DC math.OC

Birch SGD: A Tree Graph Framework for Local and Asynchronous SGD Methods

Alexander Tyurin, Danil Sivtsov

2505.04594 2026-03-31 cs.CV

Unleashing the Power of Chain-of-Prediction for Monocular 3D Object Detection

Zhihao Zhang, Abhinav Kumar, Girish Chandar Ganesan, Xiaoming Liu

2505.03821 2026-03-31 cs.CV cs.AI

Beyond Recognition: Evaluating Visual Perspective Taking in Vision Language Models

Gracjan Góral, Alicja Ziarko, Piotr Miłoś, Michał Nauman, Maciej Wołczyk, Michał Kosiński

Comments Accepted at CVPR 2026 Findings

2504.19467 2026-03-31 cs.CL cs.AI

BRIDGE: Benchmarking Large Language Models for Understanding Real-world Clinical Practice Text

Jiageng Wu, Bowen Gu, Ren Zhou, Kevin Xie, Doug Snyder, Yixing Jiang, Valentina Carducci, Richard Wyss, Rishi J Desai, Emily Alsentzer, Leo Anthony Celi, Adam Rodman, Sebastian Schneeweiss, Jonathan H. Chen, Santiago Romero-Brufau, Kueiyu Joshua Lin, Jie Yang

详情

英文摘要

Large language models (LLMs) hold great promise for medical applications and are evolving rapidly, with new models being released at an accelerated pace. However, benchmarking on large-scale real-world data such as electronic health records (EHRs) is critical, as clinical decisions are directly informed by these sources, yet current evaluations remain limited. Most existing benchmarks rely on medical exam-style questions or PubMed-derived text, failing to capture the complexity of real-world clinical data. Others focus narrowly on specific application scenarios, limiting their generalizability across broader clinical use. To address this gap, we present BRIDGE, a comprehensive multilingual benchmark comprising 87 tasks sourced from real-world clinical data sources across nine languages. It covers eight major task types spanning the entire continuum of patient care across six clinical stages and 20 representative applications, including triage and referral, consultation, information extraction, diagnosis, prognosis, and billing coding, and involves 14 clinical specialties. We systematically evaluated 95 LLMs (including DeepSeek-R1, GPT-4o, Gemini series, and Qwen3 series) under various inference strategies. Our results reveal substantial performance variation across model sizes, languages, natural language processing tasks, and clinical specialties. Notably, we demonstrate that open-source LLMs can achieve performance comparable to proprietary models, while medically fine-tuned LLMs based on older architectures often underperform versus updated general-purpose models. The BRIDGE and its corresponding leaderboard serve as a foundational resource and a unique reference for the development and evaluation of new LLMs in real-world clinical text understanding. The BRIDGE leaderboard: https://huggingface.co/spaces/YLab-Open/BRIDGE-Medical-Leaderboard

URL PDF HTML ☆

赞 0 踩 0

2504.17817 2026-03-31 cs.CV cs.RO

Learning Underwater Active Perception in Simulation

Alexandre Cardaillac, Donald G. Dansereau

2504.11967 2026-03-31 cs.CV cs.AI cs.RO

Securing the Skies: A Comprehensive Survey on Anti-UAV Methods, Benchmarking, and Future Directions

Yifei Dong, Fengyi Wu, Sanjian Zhang, Guangyu Chen, Yuzhi Hu, Masumi Yano, Jingdong Sun, Siyu Huang, Feng Liu, Qi Dai, Zhi-Qi Cheng

Comments Accepted to CVPR 2025 Anti-UAV Workshop (Best Paper Award), 16 pages

2504.06188 2026-03-31 cs.AI cs.CL cs.MA

SkillFlow: Scalable and Efficient Agent Skill Retrieval System

Fangzhou Li, Pagkratios Tagkopoulos, Ilias Tagkopoulos

2504.05523 2026-03-31 cs.CL

Pretraining Language Models for Diachronic Linguistic Change Discovery

Elisabeth Fittschen, Sabrina Li, Tom Lippincott, Leshem Choshen, Craig Messner

Comments Accepted to Findings of the EACL 2026

详情

DOI: 10.18653/v1/2026.findings-eacl.241

英文摘要

Large language models (LLMs) have shown potential as tools for scientific discovery. This has engendered growing interest in their use in humanistic disciplines, such as historical linguistics and literary studies. These fields often construct arguments on the basis of delineations like genre, or more inflexibly, time period. Although efforts have been made to restrict inference to specific domains via fine-tuning or model editing, we posit that the only true guarantee is domain-restricted pretraining -- typically, a data- and compute-expensive proposition. We show that efficient pretraining techniques can produce useful models over corpora too large for easy manual inspection but too small for "typical" LLM approaches. We employ a novel date-attribution pipeline in order to obtain a temporally-segmented dataset of five 10-million-word slices. We train two corresponding five-model batteries over these corpus segments, efficient pretraining and Llama3-8B parameter efficiently finetuned. We find that the pretrained models are faster to train than the finetuned baselines and that they better respect the historical divisions of our corpus. Emphasizing speed and precision over a-historical comprehensiveness enables a number of novel approaches to hypothesis discovery and testing in our target fields. Taking up diachronic linguistics as a testbed, we show that our method enables the detection of a diverse set of phenomena, including en masse lexical change, non-lexical (grammatical and morphological) change, and word sense introduction/obsolescence. We provide a ready-to-use pipeline that allows extension of our approach to other target fields with only minimal adaptation.

URL PDF HTML ☆

赞 0 踩 0

2504.03616 2026-03-31 cs.CL cs.AI

Multilingual Retrieval-Augmented Generation for Knowledge-Intensive Task

Leonardo Ranaldi, Barry Haddow, Alexandra Birch

2504.02572 2026-03-31 cs.CL

Cultural Biases of Large Language Models and Humans in Historical Interpretation

Fabio Celli, Georgios Spathulas

2503.09750 2026-03-31 cs.CV

SASNet: Spatially-Adaptive Sinusoidal Networks for INRs

Haoan Feng, Diana Aldana, Tiago Novello, Leila De Floriani

Comments CVPR2026, 10 pages, 10 figures, suppl included

2503.09008 2026-03-31 cs.LG cs.AI

Towards Quantifying Long-Range Interactions in Graph Machine Learning: a Large Graph Dataset and a Measurement

Huidong Liang, Haitz Sáez de Ocáriz Borde, Baskaran Sripathmanathan, Michael Bronstein, Xiaowen Dong

Comments Published as a conference paper at ICLR 2026

2503.05371 2026-03-31 cs.LG cs.AI cs.CL

Shifting Perspectives: Steering Vectors for Robust Bias Mitigation in LLMs

Zara Siddique, Irtaza Khalid, Liam D. Turner, Luis Espinosa-Anke

Comments Published to EACL Findings 2026

2503.04522 2026-03-31 cs.CV

ConfIC-RCA: Statistically Grounded Efficient Estimation of Segmentation Quality

Matias Cosarinsky, Ramiro Billot, Lucas Mansilla, Gabriel Jimenez, Nicolas Gaggión, Guanghui Fu, Tom Tirer, Enzo Ferrante

Comments Accepted for publication at TMI

2502.18273 2026-03-31 cs.CL

Beyond In-Distribution Success: Scaling Curves of CoT Granularity for Language Model Generalization

Ru Wang, Wei Huang, Selena Song, Haoyu Zhang, Qian Niu, Yusuke Iwasawa, Yutaka Matsuo, Jiaxian Guo

Comments Accepted at the Conference on Parsimony and Learning (CPAL) 2026

2502.12616 2026-03-31 cs.CL

Improving Chain-of-Thought Reasoning via Quasi-Symbolic Abstractions

Leonardo Ranaldi, Marco Valentino, Andrè Freitas

详情

DOI: 10.18653/v1/2025.acl-long.843
Journal ref: 2025.acl-long.843

英文摘要

Chain-of-Though (CoT) represents a common strategy for reasoning in Large Language Models (LLMs) by decomposing complex tasks into intermediate inference steps. However, explanations generated via CoT are susceptible to content biases that negatively affect their robustness and faithfulness. To mitigate existing limitations, recent work has proposed using logical formalisms coupled with external symbolic solvers. However, fully symbolic approaches possess the bottleneck of requiring a complete translation from natural language to formal languages, a process that affects efficiency and flexibility. To achieve a trade-off, this paper investigates methods to disentangle content from logical reasoning without a complete formalisation. In particular, we present QuaSAR (for Quasi-Symbolic Abstract Reasoning), a variation of CoT that guides LLMs to operate at a higher level of abstraction via quasi-symbolic explanations. Our framework leverages the capability of LLMs to formalise only relevant variables and predicates, enabling the coexistence of symbolic elements with natural language. We show the impact of QuaSAR for in-context learning and for constructing demonstrations to improve the reasoning capabilities of smaller models. Our experiments show that quasi-symbolic abstractions can improve CoT-based methods by up to 8% accuracy, enhancing robustness and consistency on challenging adversarial variations on both natural language (i.e. MMLU-Redux) and symbolic reasoning tasks (i.e., GSM-Symbolic).

URL PDF HTML ☆

赞 0 踩 0

2501.16997 2026-03-31 cs.CV cs.LG cs.RO

Resolving Spatio-Temporal Entanglement in Video Prediction via Multi-Modal Attention

Shreyam Gupta, P. Agrawal, Priyam Gupta

Comments 11 pages, 3 figures, 5 tables, and 3 Algorithms

2501.16448 2026-03-31 cs.AI cs.LG

Information-theoretic Distinctions Between Deception and Confusion

Robin Young

Comments Proceedings of the 14th IJCNLP and the 4th AACL (2025)

2501.15446 2026-03-31 cs.CL cs.AI

NP-Hard Lower Bound Complexity for Semantic Self-Verification

Robin Young

Comments EACL 2026

2501.10677 2026-03-31 cs.LG cs.AI q-fin.RM

Class-Imbalanced-Aware Adaptive Dataset Distillation for Scalable Pretrained Model on Credit Scoring

Xia Li, Hanghang Zheng, Xiwei Zhuang, Zhong Wang, Xiao Chen, Hong Liu, Jasmine Bai, Mao Mao

2501.08096 2026-03-31 cs.RO cs.AI cs.ET cs.LG

Hybrid Action Based Reinforcement Learning for Multi-Objective Compatible Autonomous Driving

Guizhe Jin, Zhuoren Li, Bo Leng, Wei Han, Lu Xiong, Chen Sun

Comments 14 pages, accepted for publication in IEEE Transactions on Neural Networks and Learning Systems (T-NNLS)

2501.05418 2026-03-31 cs.RO

Integrated Shape-Force Estimation for Continuum Robots: A Virtual-Work and Polynomial-Curvature Framework

Guoqing Zhang, Zihan Chen, Long Wang

2501.00200 2026-03-31 cs.LG cs.CR math.OC

Scalable Neural Network Verification with Branch-and-bound Inferred Cutting Planes

Duo Zhou, Christopher Brix, Grani A Hanasusanto, Huan Zhang

Comments Accepted by NeurIPS 2024. BICCOS is part of the alpha-beta-CROWN verifier, the VNN-COMP 2024 winner; fixed Theorem 3.2 and clarified experimental results

2412.16906 2026-03-31 cs.CV

Self-Corrected Flow Distillation for Consistent One-Step and Few-Step Text-to-Image Generation

Quan Dao, Hao Phung, Trung Dao, Dimitris Metaxas, Anh Tran

Comments Accepted to AAAI 2025. Code: https://github.com/hao-pt/SCFlow.git

2412.14015 2026-03-31 cs.CV

Prompting Depth Anything for 4K Resolution Accurate Metric Depth Estimation

Haotong Lin, Sida Peng, Jingxiao Chen, Songyou Peng, Jiaming Sun, Minghuan Liu, Hujun Bao, Jiashi Feng, Xiaowei Zhou, Bingyi Kang

Comments CVPR 2025; Project page: https://PromptDA.github.io/

2412.05386 2026-03-31 cs.CV

DIFEM: Key-points Interaction based Feature Extraction Module for Violence Recognition in Videos

Himanshu Mittal, Suvramalya Basak, Anjali Gautam

Comments Accepted in Signal Image and Video Processing

2411.06197 2026-03-31 cs.CV

Tracking by Detection and Query: An Efficient End-to-End Framework for Multi-Object Tracking

Shukun Jia, Shiyu Hu, Yichao Cao, Feng Yang, Xin Lu, Xiaobo Lu

Comments Accepted by Pattern Recognition

详情

DOI: 10.1016/j.patcog.2026.113552

英文摘要

Multi-object tracking (MOT) is primarily dominated by two paradigms: tracking-by-detection (TBD) and tracking-by-query (TBQ). While TBD offers modular efficiency, its fragmented association pipeline often limits robustness in complex scenarios. Conversely, TBQ enhances semantic modeling end-to-end but suffers from high training costs and slow inference due to the tight coupling of detection and association. In this work, we propose the tracking-by-detection-and-query framework, TBDQ-Net, to advance the synergy between TBD and TBQ paradigms. By integrating a frozen detector with a lightweight associator, this architecture ensures intrinsic efficiency. Within this streamlined framework, we introduce tailored designs to address MOT-specific challenges. Concretely, we alleviate task conflicts and occlusions through the dual-stream update of the Basic Information Interaction (BII) module. The Content-Position Alignment (CPA) module further refines both content and positional components, providing well-aligned representations for association decoding. Extensive evaluations on DanceTrack, SportsMOT, and MOT20 benchmarks demonstrate that TBDQ-Net achieves a favorable efficiency-accuracy trade-off in challenging scenarios. Specifically, TBDQ-Net outperforms leading TBD methods by 6.0 IDF1 points on DanceTrack and achieves the best performance among TBQ methods in the crowded MOT20 benchmark. Relative to MOTRv2, TBDQ-Net reduces trainable parameters by approximately 80% while accelerating practical inference by 37.5%. These results highlight TBDQ-Net as an efficient alternative to heavy architectures, showcasing the efficacy of lightweight design. Source code is publicly available at https://github.com/FaithFlow/TBDQ-Net.

URL PDF HTML ☆

赞 0 踩 0

2410.21086 2026-03-31 cs.CV cs.AI

Efficient Mixture-of-Expert for Video-based Driver State and Physiological Multi-task Estimation in Conditional Autonomous Driving

Jiyao Wang, Xiao Yang, Zhenyu Wang, Ximeng Wei, Ange Wang, Dengbo He, Kaishun Wu