arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2603.18009 2026-03-20 cs.CL cs.AI

How Confident Is the First Token? An Uncertainty-Calibrated Prompt Optimization Framework for Large Language Model Classification and Understanding

Wei Chen, Guoyang Ju, Yuanyuan Qi

Comments 27 pages,16 figures

详情

英文摘要

With the widespread adoption of large language models (LLMs) in natural language processing, prompt engineering and retrieval-augmented generation (RAG) have become mainstream to enhance LLMs' performance on complex tasks. However, LLMs generate outputs autoregressively, leading to inevitable output uncertainty. Since model performance is highly sensitive to prompt design, precise uncertainty measurement is crucial for reliable prompt optimization. For multi-class multiple-choice (understanding) tasks, conventional uncertainty measures (e.g., entropy) based on output probabilities treat all classes equally and ignore class prior differences in pretraining corpora. This failure to distinguish spurious confidence (from priors) from true certainty (from contextual understanding) results in poor confidence calibration. To address this, we propose Log-Scale Focal Uncertainty (LSFU), a first-token-based metric inspired by focal loss. LSFU incorporates label prior probabilities as a risk-modulation factor to suppress noise from high-frequency classes and emphasize risk for low-frequency long-tail classes, with a dynamic weighting mechanism unifying the measurement scale. Based on LSFU, we further propose the uncertainty-calibrated prompt optimization framework (UCPOF), which leverages the first token of model outputs to select high-quality exemplars and dynamically optimize prompts. Comprehensive evaluations show UCPOF improves average accuracy by 6.03% over few-shot baselines, surpasses always-on full RAG by 5.75% in overall average accuracy, and reduces the average retrieval trigger rate by 50.66%. By adaptively triggering RAG only for high-uncertainty samples, our framework significantly lowers computational costs while maintaining state-of-the-art performance.

URL PDF HTML ☆

赞 0 踩 0

2603.18008 2026-03-20 cs.CL cs.AI cs.CY

TherapyGym: Evaluating and Aligning Clinical Fidelity and Safety in Therapy Chatbots

Fangrui Huang, Souhad Chbeir, Arpandeep Khatua, Sheng Wang, Sijun Tan, Kenan Ye, Lily Bailey, Merryn Daniel, Ryan Louie, Sanmi Koyejo, Ehsan Adeli

2603.18007 2026-03-20 cs.CL cs.AI

Do Large Language Models Possess a Theory of Mind? A Comparative Evaluation Using the Strange Stories Paradigm

Anna Babarczy, Andras Lukacs, Peter Vedres, Zeteny Bujka

Comments 19 pages, 2 figures, 6 tables

2603.17497 2026-03-20 cs.RO cs.SY eess.SY

From Optimizable to Interactable: Mixed Digital Twin-Empowered Testing of Vehicle-Infrastructure Cooperation Systems

Jianghong Dong, Chunying Yang, Mengchi Cai, Chaoyi Chen, Qing Xu, Jianqiang Wang, Keqiang Li

2603.17380 2026-03-20 cs.LG cs.AI q-bio.QM

SCALE:Scalable Conditional Atlas-Level Endpoint transport for virtual cell perturbation prediction

Shuizhou Chen, Lang Yu, Kedu Jin, Songming Zhang, Hao Wu, Wenxuan Huang, Sheng Xu, Quan Qian, Qin Chen, Lei Bai, Siqi Sun, Zhangyang Gao

2603.17314 2026-03-20 cs.CV

A Proposal-Free Query-Guided Network for Grounded Multimodal Named Entity Recognition

Hongbing Li, Jiamin Liu, Shuo Zhang, Bo Xiao

Comments There is an error in the methods section

2603.17024 2026-03-20 cs.CV cs.AI cs.CL

HopChain: Multi-Hop Data Synthesis for Generalizable Vision-Language Reasoning

Shenzhi Wang, Shixuan Liu, Jing Zhou, Chang Gao, Xiong-Hui Chen, Binghai Wang, An Yang, Shiji Song, Bowen Yu, Gao Huang, Junyang Lin

Comments 28 pages, 8 figures, 2 tables

详情

英文摘要

Vision-language models (VLMs) show strong multimodal capabilities but still struggle with fine-grained vision-language reasoning. We find that long chain-of-thought (CoT) reasoning exposes diverse failure modes, including perception, reasoning, knowledge, and hallucination errors, which can compound across intermediate steps. However, most existing vision-language data used for reinforcement learning with verifiable rewards (RLVR) does not involve complex reasoning chains that rely on visual evidence throughout, leaving these weaknesses largely unexposed. We therefore propose HopChain, a scalable framework for synthesizing multi-hop vision-language reasoning data for RLVR training of VLMs. Each synthesized multi-hop query forms a logically dependent chain of instance-grounded hops, where earlier hops establish the instances, sets, or conditions needed for later hops, while the final answer remains a specific, unambiguous number suitable for verifiable rewards. We train Qwen3.5-35B-A3B and Qwen3.5-397B-A17B under two RLVR settings: the original data alone, and the original data plus HopChain's multi-hop data, and compare them across 24 benchmarks spanning STEM and Puzzle, General VQA, Text Recognition and Document Understanding, and Video Understanding. Although this multi-hop data is not synthesized for any specific benchmark, it improves 20 of 24 benchmarks on both models, indicating broad and generalizable gains. Consistently, replacing full chained queries with half-multi-hop or single-hop variants reduces the average score across five representative benchmarks from 70.4 to 66.7 and 64.3, respectively. Notably, multi-hop gains peak in long-CoT vision-language reasoning, exceeding 50 points in the ultra-long-CoT regime. These experiments establish HopChain as an effective, scalable framework for synthesizing multi-hop data that improves generalizable vision-language reasoning.

URL PDF HTML ☆

赞 0 踩 0

2603.16952 2026-03-20 cs.RO cs.AI

Embodied Foundation Models at the Edge: A Survey of Deployment Constraints and Mitigation Strategies

Utkarsh Grover, Ravi Ranjan, Mingyang Mao, Trung Tien Dong, Satvik Praveen, Zhenqi Wu, J. Morris Chang, Tinoosh Mohsenin, Yi Sheng, Agoritsa Polyzou, Eiman Kanjo, Xiaomin Lin

2603.16744 2026-03-20 cs.AI cs.SI

Nonstandard Errors in AI Agents

Ruijiang Gao, Steven Chong Xiao

Comments 45 pages

2603.16606 2026-03-20 cs.CL

Omnilingual SONAR: Cross-Lingual and Cross-Modal Sentence Embeddings Bridging Massively Multilingual Text and Speech

Omnilingual SONAR Team, João Maria Janeiro, Pere-Lluís Huguet Cabot, Ioannis Tsiamas, Yen Meng, Vivek Iyer, Guillem Ramírez, Loic Barrault, Belen Alastruey, Yu-An Chung, Marta R. Costa-Jussa, David Dale, Kevin Heffernan, Jaehyeong Jo, Artyom Kozhevnikov, Alexandre Mourachko, Christophe Ropers, Holger Schwenk, Paul-Ambroise Duquenne

详情

英文摘要

Cross-lingual sentence encoders typically cover only a few hundred languages and often trade downstream quality for stronger alignment, limiting their adoption. We introduce OmniSONAR, a new family of omnilingual, cross-lingual and cross-modal sentence embedding models that natively embed text, speech, code, and mathematical expressions in a single semantic space, while delivering state-of-the-art downstream performance at the scale of thousands of languages, from high-resource to extremely low-resource varieties. To reach this scale without representation collapse, we use progressive training. We first learn a strong foundational space for 200 languages with an LLM-initialized encoder-decoder, combining token-level decoding with a novel split-softmax contrastive loss and synthetic hard negatives. Building on this foundation, we expand to several thousands language varieties via a two-stage teacher-student encoder distillation framework. Finally, we demonstrate the cross-modal extensibility of this space by seamlessly mapping 177 spoken languages into it. OmniSONAR halves cross-lingual similarity search error on the 200-language FLORES dataset and reduces error by a factor of 15 on the 1,560-language BIBLE benchmark. It also enables strong translation, outperforming NLLB-3B on multilingual benchmarks and exceeding prior models (including much larger LLMs) by 15 chrF++ points on 1,560 languages into English BIBLE translation. OmniSONAR also performs strongly on MTEB and XLCoST. For speech, OmniSONAR achieves a 43% lower similarity-search error and reaches 97% of SeamlessM4T speech-to-text quality, despite being zero-shot for translation (trained only on ASR data). Finally, by training an encoder-decoder LM, Spectrum, exclusively on English text processing OmniSONAR embedding sequences, we unlock high-performance transfer to thousands of languages and speech for complex downstream tasks.

URL PDF HTML ☆

赞 0 踩 0

2603.16098 2026-03-20 cs.CV cs.AI

LICA: Layered Image Composition Annotations for Graphic Design Research

Elad Hirsch, Shubham Yadav, Mohit Garg, Purvanshi Mehta

2603.16060 2026-03-20 cs.AI

ARISE: Agent Reasoning with Intrinsic Skill Evolution in Hierarchical Reinforcement Learning

Yu Li, Rui Miao, Zhengling Qi, Tian Lan

2603.15978 2026-03-20 cs.AI

From Workflow Automation to Capability Closure: A Formal Framework for Safe and Revenue-Aware Customer Service AI

Cosimo Spera

2603.15973 2026-03-20 cs.AI

Safety is Non-Compositional: A Formal Framework for Capability-Based AI Systems

Cosimo Spera

2603.15888 2026-03-20 cs.AI cs.CV cs.RO

AsgardBench -- Evaluating Visually Grounded Interactive Planning Under Minimal Feedback

Andrea Tupini, Lars Liden, Reuben Tan, Yu Wang, Jianfeng Gao

Comments 19 figures, 6 tables, including appendix

2603.15678 2026-03-20 cs.LG cs.AI

Spectral Edge Dynamics of Training Trajectories: Signal--Noise Geometry Across Scales

Yongzhong Xu

Comments 16 pages, 4 figures

2603.15434 2026-03-20 cs.AI

Listening to the Echo: User-Reaction Aware Policy Optimization via Scalar-Verbal Hybrid Reinforcement Learning

Jing Ye, Xinpei Zhao, Lu Xiang, Yaping Zhang, Chengqing Zong

Comments Updated case study figures for clarity

2603.12696 2026-03-20 cs.RO cs.CV

HaltNav: Reactive Visual Halting over Lightweight Topological Priors for Robust Vision-Language Navigation

Zihui Yu, Pingcong Li, Bichi Zhang, Sören Schwertfeger

详情

英文摘要

Vision-and-Language Navigation (VLN) is shifting from rigid, step-by-step instruction following toward open-vocabulary, goal-oriented autonomy. Achieving this transition without exhaustive routing prompts requires agents to leverage structural priors. While prior work often assumes computationally heavy 2D/3D metric maps, we instead exploit a lightweight, text-based osmAG (OpenStreetMap Area Graph), a floorplan-level topological representation that is easy to obtain and maintain. However, global planning over a prior map alone is brittle in real-world deployments, where local connectivity can change (e.g., closed doors or crowded passages), leading to execution-time failures. To address this gap, we propose a hierarchical navigation framework HaltNav that couples the robust global planning of osmAG with the local exploration and instruction-grounding capability of VLN. Our approach features an MLLM-based brain module, which is capable of high-level task grounding and obstruction awareness. Conditioned on osmAG, the brain converts the global route into a sequence of localized execution snippets, providing the VLN executor with prior-grounded, goal-centric sub-instructions. Meanwhile, it detects local anomalies via a mechanism we term Reactive Visual Halting (RVH), which interrupts the local control loop, updates osmAG by invalidating the corresponding topology, and triggers replanning to orchestrate a viable detour. To train this halting capability efficiently, we introduce a data synthesis pipeline that leverages generative models to inject realistic obstacles into otherwise navigable scenes, substantially enriching hard negative samples. Extensive experiments demonstrate that our hierarchical framework outperforms several baseline methods without tedious language instructions, and significantly improves robustness for long-horizon vision-language navigation under environmental changes.

URL PDF HTML ☆

赞 0 踩 0

2603.11746 2026-03-20 cs.CV

SoulX-LiveAct: Towards Hour-Scale Real-Time Human Animation with Neighbor Forcing and ConvKV Memory

Dingcheng Zhen, Xu Zheng, Ruixin Zhang, Zhiqi Jiang, Yichao Yan, Ming Tao, Shunshun Yin

2603.11239 2026-03-20 cs.AI

Reversible Lifelong Model Editing via Semantic Routing-Based LoRA

Haihua Luo, Xuming Ran, Tommi Kärkkäinen, Zhonghua Chen, Jiangrong Shen, Qi Xu, Fengyu Cong

2603.11211 2026-03-20 cs.CV cs.AI

A Simple Efficiency Incremental Learning Framework via Vision-Language Model with Nonlinear Multi-Adapters

Haihua Luo, Xuming Ran, Jiangrong Shen, Timo Hämäläinen, Zhonghua Chen, Qi Xu, Fengyu Cong

2603.11201 2026-03-20 cs.LG cs.AI

Representation Finetuning for Continual Learning

Haihua Luo, Xuming Ran, Tommi Kärkkäinen, Huiyan Xue, Zhonghua Chen, Qi Xu, Fengyu Cong

2603.11149 2026-03-20 cs.LG cs.CR

Systematic Scaling Analysis of Jailbreak Attacks in Large Language Models

Xiangwen Wang, Ananth Balashankar, Varun Chandrasekaran

2603.11101 2026-03-20 cs.RO cs.AI cs.DC

Thousand-GPU Large-Scale Training and Optimization Recipe for AI-Native Cloud Embodied Intelligence Infrastructure

Yongjian Guo, Yunxuan Ma, Haoran Sun, Zhong Guan, Shuai Di, Jing Long, Wanting Xu, Xiaodong Bai, Wen Huang, Yucheng Guo, Chen Zhou, Qiming Yang, Mingxi Luo, Tianyun Zhao, Hedan Yang, Song Wang, Xiaomeng Tian, Xiaolong Xiang, Zhen Sun, Yu Wei, Luqiao Wang, Yuzhen Li, Chenfeng Gu, Junwu Xiong, Yicheng Gong

2603.09909 2026-03-20 cs.AI

MedMASLab: A Unified Orchestration Framework for Benchmarking Multimodal Medical Multi-Agent Systems

Yunhang Qian, Xiaobin Hu, Jiaquan Yu, Siyang Xin, Xiaokun Chen, Jiangning Zhang, Peng-Tao Jiang, Jiawei Liu, Hongwei Bran Li

2603.09022 2026-03-20 cs.AI

MEMO: Memory-Augmented Model Context Optimization for Robust Multi-Turn Multi-Agent LLM Games

Yunfei Xie, Kevin Wang, Bobby Cheng, Jianzhu Yao, Zhizhou Sha, Alexander Duffy, Yihan Xi, Hongyuan Mei, Cheston Tan, Chen Wei, Pramod Viswanath, Zhangyang Wang

Comments Code has been released https://github.com/openverse-ai/MEMO

2603.07313 2026-03-20 cs.LG cs.AI stat.ML

Adversarial Latent-State Training for Robust Policies in Partially Observable Domains

Angad Singh Ahuja

Comments 30 pages, 8 figures

2603.03740 2026-03-20 cs.RO

Whole-Body Safe Control of Robotic Systems with Koopman Neural Dynamics

Sebin Jung, Abulikemu Abuduweili, Jiaxing Li, Changliu Liu

2603.01122 2026-03-20 cs.RO

Fast Confidence-Aware Human Prediction via Hardware-accelerated Bayesian Inference for Safe Robot Navigation

Michael Lu, Minh Bui, Xubo Lyu, Mo Chen

Comments Update the paper

2602.23696 2026-03-20 cs.LG cs.AI

Optimizer-Induced Low-Dimensional Drift and Transverse Dynamics in Transformer Training

Yongzhong Xu

Comments 23 pages, 4 figures