arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2605.25404 2026-05-26 cs.CL eess.AS

Proactive for Uncertainty: Cause-Aware Error Diagnosis and Interactive Clarification for Spoken Dialogue Systems

主动应对不确定性：面向口语对话系统的因果感知错误诊断与交互式澄清

Yizhou Peng, Ziyang Ma, Changsong Liu, Yi-Wen Chao, Xie Chen, Eng Siong Chng

AI总结本文提出一种因果感知的错误恢复范式，通过细粒度检测器解耦ASR中的感知、理解和删除错误，使LLM能够执行多轮针对性澄清策略，从而显著降低词错误率并提升下游任务性能。

详情

AI中文摘要

级联自动语音识别-大语言模型（ASR-LLM）流水线在工业口语对话系统（SDS）中仍然流行，主要因为其解耦设计确保了感知可验证性。然而，级联系统存在错误传播问题，因为转录失败不可避免地级联到后续组件，从而降低最终交互质量。尽管ASR置信度分数为不可靠输入提供了简单过滤，但这种方法存在根本性局限，因为它通常无法检测删除错误，也无法区分声学（听不清）和语言（不理解）不匹配，而这两者都需要针对性的恢复策略。在本文中，我们提出了一种因果感知的错误恢复范式，从根本上重新思考SDS的鲁棒性。与传统的置信度过滤不同，我们引入了一组小型精度聚焦检测器，利用深度ASR潜在表示将词级错误解耦为感知、理解和删除失败。这种细粒度诊断智能使LLM能够编排针对性的多轮澄清策略，有效将模糊信号转化为无缝的用户交互。实验结果验证了我们方法的精度，与基线相比，在领域转移错误上的召回率提高了一倍以上（57.96% vs. 23.66%）。关键的是，这种诊断精度在不同口音、失真和领域下，使词错误率降低高达30%，下游任务性能提升17%。

英文摘要

Cascaded Automatic Speech Recognition -- Large Language Model (ASR-LLM) pipelines remain popular for industrial Spoken Dialogue Systems (SDS), primarily because their decoupled design ensures perceptual verifiability. However, cascaded systems suffer from error propagation, as transcription failures inevitably cascade to subsequent components, thereby degrading the final interaction quality. Although ASR confidence scores offer a simple filter for unreliable inputs, this approach is fundamentally limited because it typically fails to detect deletion errors or to distinguish between acoustic (inability to hear clearly) and linguistic (inability to understand) mismatches, both of which require targeted recovery strategies. In this paper, we propose a cause-aware error recovery paradigm that fundamentally rethinks robustness in SDS. Unlike traditional confidence filtering, we introduce a suite of small precision-focused detectors that exploit deep ASR latent representations to disentangle token-level errors into perception, comprehension, and deletion failures. This fine-grained diagnostic intelligence empowers the LLM to orchestrate targeted, multi-turn clarification strategies, effectively transforming ambiguous signals into seamless user interactions. Experimental results validate the precision of our approach, which more than doubles the recall on domain-shift errors (57.96% vs. 23.66%) compared to baselines. Crucially, this diagnostic precision yields up to a 30% reduction in WER and a 17% improvement on the downstream task across diverse accents, distortions, and domains.

URL PDF HTML ☆

赞 0 踩 0

2605.25401 2026-05-26 cs.RO

Path Following Control System of Line-of-Sight Guidance for Robotic Dolphin with Multi-Link Mechanism in Underwater Simulator

水下模拟器中多连杆机构仿生海豚的视线导引路径跟踪控制系统

Takumi Asada, Takao Oki, Hideo Furuhashi, Kenta Tabata, Renato Miyagusuku, Koichi Ozaki

AI总结针对多连杆仿生自主水下航行器（BAUV），提出了一种基于视线导引的路径跟踪控制系统，并在水下模拟器中进行了参数确定和控制方法评估。

详情

DOI: 10.1109/sii64115.2026.11404481
Journal ref: 2026 IEEE/SICE International Symposium on System Integration (SII). IEEE, 2026. p. 844-849

AI中文摘要

具有多连杆机构的仿生自主水下航行器（BAUV）因其低功耗和高机动性被广泛用于水生生物观测和环境调查。环境调查需要能够自动跟踪特定点的路径跟踪系统。然而，BAUV的路径跟踪系统有限，且其与多连杆机构机器人的评估尚未明确。由于BAUV的模型因仿生类型而异，其路径跟踪系统需要预先进行仿真。在本研究中，我们提出了一种适用于多连杆机构BAUV的路径跟踪系统，并在水下模拟中进行了评估。结果表明，可以设计出适合BAUV的路径跟踪系统，使用模拟器确定参数，并评估控制方法。

英文摘要

Biomimetic autonomous underwater vehicle (BAUV) with multi-link mechanism is widely used in aquatic life observation and environmental surveys due to its low power consumption and high maneuverability. An environmental survey requires a path following system that automatically follows specific points. However, the path following system of BAUV is limited, and its evaluation with multi-link mechanism robots has not yet been clarified. The path following system in BAUV requires prior simulation because the model differs depending on the type of biomimetics. In this study, we propose a path following system for BAUVs with a multi-link mechanism and evaluation in underwater simulation. In this result, it was possible to design a path following system suitable for BAUV, determine parameters using a simulator, and evaluate control methods.

URL PDF HTML ☆

赞 0 踩 0

2605.25399 2026-05-26 cs.AI

Second Guess: 通过弃权和答案稳定性检测小型语言模型的不确定性

Ashwath Vaithinathan Aravindan, Mayank Kejriwal

AI总结提出一种轻量级、无参数的提示技术Second Guess，通过添加“我不知道”选项并观察答案稳定性，在多项选择问答中实现弃权，有效检测小型语言模型的不确定性。

详情

AI中文摘要

大型语言模型在不确定时往往生成自信但错误的答案，而非弃权。这个问题对于小型语言模型（SLM）尤为严重，因为计算约束和自主操作放大了对可靠不确定性检测的需求。我们提出了_Second Guess_，一种轻量级、无参数的提示技术，用于多项选择问答（MCQA）中的弃权，非常适合SLM。我们的关键实证洞察是，真正知道答案的模型会一致地选择它，而不确定的模型在添加“我不知道”选项时会表现出不稳定的行为。在四个开源模型（2B-8B参数）和四个基准测试上评估，Second Guess实现了10.81%的最高复合风险改进。值得注意的是，在基于熵的方法退化的微调模型上，它保持了8%的复合风险改进，并且对性能较低的模型改进最大。重现本工作所需的所有代码和结果可在https://github.com/Mystic-Slice/second-guess获取。

英文摘要

Large language models often generate confident but incorrect answers rather than abstaining when uncertain. This problem is particularly acute for small language models (SLMs), where computational constraints and autonomous operation amplify the need for reliable uncertainty detection. We propose _Second Guess_, a lightweight, parameter-free prompting technique for abstention in multiple-choice question answering (MCQA) that is well-suited for SLMs. Our key empirical insight is that models which truly know an answer will select it consistently, while uncertain models exhibit unstable behavior when an ``I don't know'' option is added. Evaluated on four open models (2B-8B parameters) and four benchmarks, Second Guess achieves the highest composite risk improvement of 10.81\%. Notably, it maintains an 8\% composite risk improvement on fine-tuned models where entropy-based methods degrade, and improves most for lower-performing models. All code and results required to reproduce this work is available in https://github.com/Mystic-Slice/second-guess

URL PDF HTML ☆

赞 0 踩 0

2605.25393 2026-05-26 cs.RO

Decision-Making with Lightweight Confidence-Aware Language Model for Autonomous Driving

基于轻量级置信感知语言模型的自动驾驶决策

Ruoyu Yao, Ruiguo Zhong, Pei Liu, Mingxing Peng, Rui Yang, Jun Ma

AI总结提出一种利用轻量级置信感知语言模型的决策框架，通过多智能体协作生成置信注释的决策演示并蒸馏到双头轻量模型，在nuPlan上实现SOTA成功率和低延迟。

Comments 8 Pages, 3 figures, ITSC 2026

详情

AI中文摘要

大型语言模型和多模态大语言模型在自动驾驶中展现出巨大潜力，提供类人推理和开放世界泛化能力。然而，这些庞大模型过高的计算开销和推理延迟严重阻碍了它们在资源受限的自动驾驶系统中的部署。为解决这一挑战，我们提出了一种新颖的决策框架，利用轻量级置信感知语言模型，弥合了复杂多模态意图推理与高效推理之间的差距。具体而言，我们设计了一个多智能体协作工作流，包括动作投票、置信评估和总结智能体，通过显式的思维链推理生成高质量、带置信注释的决策演示。然后，这些演示被蒸馏到一个具有双头架构的轻量级语言模型中，实现决策概率的联合预测和文本理由的生成。蒸馏通过置信感知微调策略结合检索增强生成来实现，以增强模型的适应性和数据效率。在nuPlan基准上的全面闭环实验表明，我们的方法在常规和长尾场景下均实现了最先进的成功率，同时保持了低推理延迟。

英文摘要

Large Language Models (LLMs) and Multimodal LLMs (MLLMs) have demonstrated immense potential in autonomous driving (AD) by offering human-like reasoning and open-world generalization. However, the excessive computational overhead and high inference latency of these massive models severely hinder their deployment in resource-constrained AD systems. To address this challenge, we propose a novel decision-making framework utilizing a lightweight confidence-aware language model, which bridges the gap between complex multimodal intention reasoning and efficient inference. Specifically, we design a multi-agent collaborative workflow, comprising action voting, confidence assessment, and summarization agents, to generate high-quality, confidence-annotated decision demonstrations via explicit Chain-of-Thought (CoT) reasoning. These demonstrations are then distilled into a lightweight language model featuring a dual-head architecture, enabling the joint prediction of decision probabilities and the generation of textual rationales. The distillation is realized via a confidence-aware fine-tuning strategy coupled with Retrieval Augmented Generation (RAG) to enhance the model's adaptability and data efficiency. Comprehensive closed-loop experiments on the nuPlan benchmark demonstrate that our approach achieves state-of-the-art (SOTA) success rates in both regular and long-tail scenarios while maintaining low inference latency.

URL PDF HTML ☆

赞 0 踩 0

2605.25391 2026-05-26 cs.LG eess.SP

A Context Augmented Multi-Play Multi-Armed Bandit Algorithm for Fast Channel Allocation in Opportunistic Spectrum Access

一种用于机会频谱接入中快速信道分配的上下文增强多玩多臂老虎机算法

Ruiyu Li, Guangxia Li, Xiao Lu, Jichao Liu, Yan Jin

AI总结针对机会频谱接入中的信道分配问题，提出一种上下文增强的多玩多臂老虎机算法，通过将信道噪声建模为奖励函数的扰动并利用信道状态信息作为上下文，分别针对线性和非线性相关性推导出两种索引策略，实现低遗憾和更合理的次优臂选择。

Comments Accepted by ISCC'24

详情

AI中文摘要

我们研究了机会频谱接入（OSA）场景中用于信道分配的动态上下文多玩多臂老虎机（MP-MAB）问题。大多数现有的MP-MAB方法对于实际OSA系统不实用，因为它们假设了许多理想条件，计算成本高，最重要的是忽略了与服务质量直接相关的信道噪声的影响。在本研究中，我们通过将信道噪声建模为MP-MAB中臂奖励函数的扰动来体现这种影响。由于信道状态信息与信道噪声之间存在隐含的相关性，我们将前者作为MP-MAB的上下文来表示后者引起的扰动。我们研究了上下文与扰动之间的两种相关性——线性和非线性，并分别推导出两种索引策略。这些策略通过线性模型和神经网络学习相关性，并使用估计的噪声值调整上置信界。数值实验表明，所提出的策略能够实现更低的遗憾，并以更合理的方式选择次优臂。

英文摘要

We study the restless contextual multi-play multi-armed bandit (MP-MAB) problem for channel allocation in the opportunity spectrum access (OSA) scenario. Most existing MP-MAB methods are impractical for real-world OSA systems as they assume many ideal conditions, incur a heavy computational cost, and most importantly, ignore the impact of channel noise which is directly related to the quality of service. In this study, we embody this impact by modeling channel noise as a perturbation of the arm's reward function in MP-MAB. As there is an implicit correlation between channel state information and channel noise, we take the former as a context for MP-MAB to present the perturbation caused by the latter. We investigate two types of correlation between the context and the perturbation -- linear and nonlinear, and derive two index policies, respectively. These policies learn the correlations through a linear model and a neural network, and use estimated noise value to adjust the upper confidence bound. Numerical experiments demonstrate that the proposed policies can achieve lower regret and select sub-optimal arms in a more reasonable way.

URL PDF HTML ☆

赞 0 踩 0

2605.25388 2026-05-26 cs.LG q-bio.QM

ViroBench: Benchmarking Nucleotide Foundation Models on Viral Genomics Tasks

ViroBench：病毒基因组学任务中的核苷酸基础模型基准测试

Dongxin Ye, Fang Hu, Han Hu, Shu Hu, Yang Tan, Wanli Ouyang, Stan Z. Li, Jie Cui, Nanqing Dong

AI总结提出首个针对病毒基因组学的综合基准ViroBench，评估66个核苷酸基础模型在生物学理解和潜在生物安全风险上的表现，发现模型在系统发育和时间偏移下性能下降，生成任务中统计似然与生物功能有效性脱钩，且预训练数据的分类多样性比参数规模更重要。

Comments 42 pages,15 figures

详情

DOI: 10.1145/3770855.3819057

AI中文摘要

核苷酸序列构成了生物系统的基本遗传基础，使得病毒基因组分析对生物医学进步至关重要。尽管生物基础模型，特别是核苷酸基础模型（NFMs）取得了进展，但该领域缺乏一个统一的病毒基因组学标准来促进社区发展并实施生物安全约束。为了解决这个问题，我们引入了ViroBench，这是第一个专门为病毒场景中的NFMs设计的全面且大规模的基准测试。ViroBench在两个关键维度上评估模型：生物学理解和潜在生物安全风险，覆盖4种任务类型中的18个不同场景。对66个不同架构的NFMs的广泛评估得出了三个关键结论。首先，NFMs在系统发育和时间偏移下表现出生物学理解的性能下降，表明外推能力较弱。其次，生成任务揭示了统计似然与生物功能有效性之间的脱钩，构成了潜在的生物安全风险。第三，受控消融研究表明，预训练数据中的分类多样性比参数规模更重要。具体来说，一个在多样化数据上训练的轻量级基线相比其原始模型实现了67.5%的性能提升。总体而言，ViroBench为未来病毒核苷酸基础模型的研究提供了可解释的诊断评估和可重复的测量框架。数据集和代码公开于https://github.com/QIANJINYDX/ViroBench。

英文摘要

Nucleotide sequences constitute the fundamental genetic basis of biological systems, rendering viral genomic analysis critical for biomedical advancement. Despite progress in biological foundation models, specifically nucleotide foundation models (NFMs), the field lacks a unified standard for viral genomics to facilitate community development and enforce biosecurity constraints. To address this, we introduce ViroBench, the first comprehensive and large-scale benchmark specifically designed for NFMs in viral settings. ViroBench evaluates models across two critical dimensions: biological understanding and latent biosecurity risk, covering 18 diverse scenarios within 4 task types. Extensive evaluation of 66 NFMs across diverse architectures yields three critical conclusions. Firstly, NFMs exhibit a performance degradation in biological understanding under phylogenetic and temporal shifts, indicating weak extrapolation capabilities. Secondly, generation tasks reveal a decoupling between statistical likelihood and biological functional validity, posing latent biosecurity risks. Thirdly, controlled ablation studies reveal that taxonomic diversity in pretraining data outweighs parameter scale. Specifically, a lightweight baseline trained on diverse data achieves a 67.5% performance gain over its original model. Overall, ViroBench provides interpretable, diagnostic evaluations and a reproducible measurement framework for future research on viral nucleotide foundation models. The datasets and code are publicly available at https://github.com/QIANJINYDX/ViroBench.

URL PDF HTML ☆

赞 0 踩 0

2605.25385 2026-05-26 cs.CV cs.AI

Weakly Supervised Camouflaged Object Detection Based on the SAM Model and Mask Guidance

基于SAM模型和掩码引导的弱监督伪装目标检测

Xia Li, Xinran Liu, Lin Qi, Junyu Dong

AI总结提出MGNet网络，利用SAM模型生成伪标签，通过级联掩码解码器、上下文增强模块和掩码引导特征聚合模块，实现弱监督伪装目标检测，性能与全监督方法相当。

Comments 18 pages

详情

DOI: 10.1016/j.imavis.2025.105571

AI中文摘要

伪装目标检测（COD）由于目标与背景高度相似，是一项具有挑战性的任务。现有的全监督方法需要耗费大量人力进行像素级标注，因此弱监督方法成为平衡精度与标注效率的可行折中方案。然而，由于使用粗标注，弱监督方法常出现性能下降。本文提出一种新的弱监督伪装目标检测方法以克服这些限制。具体地，我们设计了一个新颖的网络MGNet，通过利用自定义级联掩码解码器（CMD）生成的初始掩码来引导分割过程并增强边缘预测，从而解决边缘模糊和漏检问题。我们引入上下文增强模块（CEM）以减少漏检，以及掩码引导特征聚合模块（MFAM）进行有效的特征聚合。针对弱监督挑战，我们提出BoxSAM，利用带有边界框提示的Segment Anything Model（SAM）生成伪标签。通过采用冗余处理策略，为训练MGNet提供高质量的像素级伪标签。大量实验表明，我们的方法在性能上与当前最先进方法具有竞争力。

英文摘要

Camouflaged object detection (COD) from a single image is a challenging task due to the high similarity between objects and their surroundings. Existing fully supervised methods require labor-intensive pixel-level annotations, making weakly supervised methods a viable compromise that balances accuracy and annotation efficiency. However, weakly supervised methods often experience performance degradation due to the use of coarse annotations. In this paper, we introduce a new weakly supervised approach for camouflaged object detection to overcome these limitations. Specifically, we propose a novel network, MGNet, which tackles edge ambiguity and missed detections by utilizing initial masks generated by our custom-designed Cascaded Mask Decoder (CMD) to guide the segmentation process and enhance edge predictions. We introduce a Context Enhancement Module(CEM) to reduce the missing detection, and a Mask-guided Feature Aggregation Module (MFAM) for effective feature aggregation. For the weak supervision challenge, we propose BoxSAM, which leverages the Segment Anything Model (SAM) with bounding-box prompts to generate pseudo-labels. By employing a redundant processing strategy, high quality pixel-level pseudo-labels are provided for training MGNet. Extensive experiments demonstrate that our method delivers competitive performance against current state-of-the-art methods.

URL PDF HTML ☆

赞 0 踩 0

2605.25384 2026-05-26 cs.CL

对抗正交解缠用于LVLM幻觉缓解

Ruoxi Cheng, Haoxuan Ma, Zhengfei Hai, Yiyan Huang, Ranjie Duan, Tianle Zhang, Xu Yang, Ziyi Ye, Xingjun Ma

AI总结提出对抗正交解缠（AOD）框架，通过最小最大目标学习幻觉相关方向，并利用双前向对比解码策略，在不需额外训练的情况下缓解大型视觉语言模型（LVLM）的幻觉问题。

详情

AI中文摘要

大型视觉语言模型（LVLM）推进了多模态理解，但其可靠性受到幻觉的限制，即生成内容与视觉事实冲突。现有缓解方法要么依赖昂贵的外部干预（如指令调优和检索），要么使用受限于有缺陷的注意力权重和纠缠的隐藏表示的内部机制。我们提出对抗正交解缠（AOD），一种用于缓解LVLM幻觉的潜在几何框架。AOD通过最小最大目标学习幻觉相关方向：分类器将幻觉信号集中到投影分量中，而对抗器通过梯度反转层将其从正交残差空间中移除。学习到的方向使得一种无需训练的双前向对比解码策略能够抑制幻觉同时保持通用能力。在三个LVLM上进行的四个幻觉和四个效用基准实验表明，AOD一致优于强基线。它在POPE上平均提高超过6%的准确率，将AMBER提升6%，并在MMMU等效用任务上保持强劲性能。进一步分析显示跨数据集的鲁棒迁移，表明AOD捕获了通用的幻觉相关偏差而非数据集特定伪影。我们的源代码和数据集可在https://github.com/Hunter-Wrynn/AOD获取。

英文摘要

Large Vision-Language Models (LVLMs) have advanced multimodal understanding, yet their reliability is limited by hallucination, where generated content conflicts with visual facts. Existing mitigation methods either rely on costly external interventions, such as instruction tuning and retrieval, or use internal mechanisms that remain limited by flawed attention weights and entangled hidden representations. We propose Adversarial Orthogonal Disentanglement (AOD), a latent geometric framework for mitigating LVLM hallucinations. AOD learns a hallucination-related direction through a minimax objective: a classifier concentrates hallucination signals into the projected component, while an adversary removes them from the orthogonal residual space via a Gradient Reversal Layer. The learned direction enables a training-free dual-forward-pass contrastive decoding strategy that suppresses hallucinations while preserving general capabilities. Experiments on three LVLMs across four hallucination and four utility benchmarks show that AOD consistently outperforms strong baselines. It improves POPE accuracy by over 6\% on average, boosts AMBER by 6\%, and maintains strong performance on utility tasks such as MMMU. Further analysis shows robust transfer across datasets, suggesting that AOD captures general hallucination-related biases rather than dataset-specific artifacts. Our source code and datasets are available at https://github.com/Hunter-Wrynn/AOD.

URL PDF HTML ☆

赞 0 踩 0

2605.25373 2026-05-26 cs.CV

Physics-Aware 3D Gaussian Editing for Driving Scene Generation

物理感知的三维高斯编辑用于驾驶场景生成

Feng Zhou, Jian Zhang, Yuhang Sun, He Wang, Qiong Wen, Debao Kong, Tieru Wu, Rui Ma

AI总结提出RoVES系统，通过单图像驱动的道路几何插入和4-DOF半车动力学模型，实现物理感知的驾驶场景编辑与车辆姿态校正。

详情

AI中文摘要

面向可靠胎儿超声解读的多智能体协作

Xiaotian Hu, Mingxuan Liu, Junwei Huang, Kasidit Anmahapong, Yifei Chen, Yiming Huang, Xuguang Bai, Zihan Li, Hongjia Yang, Yingqi Hao, Hong Xu, Yu Jiang, Tian Tian, Yi Liao, Haibo Qu, Qiyuan Tian

AI总结提出FetUSAgents多智能体系统，通过协作LLM代理和双路径证据仲裁（DPEA）整合视觉工具与临床推理，在胎儿超声VQA、报告生成等任务上超越最强基线25%以上。

详情

AI中文摘要

自动化胎儿超声解读需要从视觉感知（包括平面识别和解剖分割）到临床理解（包括生物测量和诊断报告）的工作流程。然而，当前“一任务一模型”的范式限制了跨多步骤过程的系统性证据整合。尽管多模态大语言模型（MLLM）展现出有前景的视觉理解能力，但其有限的领域特定基础和幻觉风险限制了在胎儿超声分析中的可靠性。为解决这些限制，我们提出了FetUSAgents，一个工具增强的多智能体系统，用于全面的胎儿超声解读，支持视觉问答（VQA）、报告生成、图像描述和视频总结。FetUSAgents通过协作的LLM代理协调任务特定的视觉工具，并将临床查询分解为从解剖识别到定量测量的子任务。我们进一步引入了双路径证据仲裁（DPEA），它将基于LLM的审慎推理与来自专业视觉工具的结构化计算证据相结合。一个检索增强的证据库整合中间发现，以支持可追溯且临床可靠的结论。此外，我们构建了FetUS-VQA，一个专门用于胎儿超声的VQA基准，包含1,892张图像和3,205个问答对，涵盖10个临床任务。广泛的分布外实验表明，FetUSAgents优于通用和医学MLLM，在VQA准确率上超过最强基线25%以上。这些结果表明了一条通往产前成像的基于证据的临床助手的可扩展路径。代码已公开。

英文摘要

Automated fetal ultrasound interpretation requires a workflow from visual perception, including plane recognition and anatomical segmentation, to clinical understanding, including biometric measurement and diagnostic reporting. However, the prevailing "one-task, one-model" paradigm limits systematic integration of evidence across this multi-step process. Although multimodal large language models (MLLMs) show promising visual understanding, their limited domain-specific grounding and hallucination risks restrict reliability in fetal ultrasound analysis. To address these limitations, we propose FetUSAgents, a tool-augmented multi-agent system for comprehensive fetal ultrasound interpretation, supporting visual question answering (VQA), report generation, image captioning, and video summarization. FetUSAgents coordinates task-specific visual tools through collaborative LLM agents and decomposes clinical queries into subtasks that progress from anatomical recognition to quantitative measurement. We further introduce Dual-Path Evidence Arbitration (DPEA), which integrates LLM-based deliberative reasoning with structured computational evidence from specialized visual tools. A retrieval-enhanced evidence bank consolidates intermediate findings to support traceable and clinically grounded conclusions. In addition, we construct FetUS-VQA, a dedicated VQA benchmark for fetal ultrasound, comprising 1,892 images and 3,205 question-answer pairs across 10 clinical tasks. Extensive out-of-distribution experiments show that FetUSAgents outperforms general and medical MLLMs, exceeding the strongest baseline by more than 25 percent in VQA accuracy. These results suggest a scalable route toward evidence-driven clinical assistants for prenatal imaging. Code is available.

URL PDF HTML ☆

赞 0 踩 0

2605.25354 2026-05-26 cs.AI

Context-CoT: Enhancing Context Learning via High-Quality Reasoning Synthesis

Context-CoT：通过高质量推理合成增强上下文学习

Hongbo Jin, Mingnan Zhu, Jingqi Tian, Xu Jiang, Zhongjing Du, Haoran Tang, Siyi Xie, Qiaoman Zhang, Jiayu Ding

AI总结针对大语言模型在动态提取和应用新知识方面的上下文学习能力不足，提出Context-CoT方法，通过合成高质量推理链来增强上下文学习，在CL-Bench上显著提升性能。

2605.25352 2026-05-26 cs.LG cs.AI

Certified Robustness from Approximate Gaussian Mixture Structures in Pretrained Latent Spaces

基于预训练潜在空间中近似高斯混合结构的认证鲁棒性

Konstantinos Emmanouilidis, Tianjiao Ding, Nghia Nguyen, Nicolas Loizou, René Vidal

AI总结本文提出一个框架，利用预训练编码器将输入映射到近似高斯混合的潜在分布，通过理论分析证明鲁棒性退化有界，从而实现可认证鲁棒分类器，在CIFAR-10和ImageNet上达到最优或竞争性的认证准确率。

详情

AI中文摘要

深度学习模型易受对抗扰动影响，这对安全关键部署提出了重要关切。经验性防御在实践中可以实现强鲁棒性，但缺乏形式化保证，这推动了可认证鲁棒分类器的需求。虽然认证方法提供了形式化保证，但由于无法利用复杂数据分布中的结构，它们通常产生过于保守的边界。在这项工作中，我们提出了一个设计可认证鲁棒分类器的框架，该框架利用数据表示中的潜在结构。我们首先分析高斯混合设置，推导出鲁棒分类器存在的必要和充分条件，并构建了一个具有闭式鲁棒性证书和泛化保证的分类器。我们的主要贡献是证明精确结构并非必需：我们证明，如果预训练编码器将输入映射到一个与高斯混合分布$\varepsilon$-接近（在KL散度下）的潜在分布，那么认证准确率会优雅地退化，并给出了一个显式边界，关联真实分布和近似分布下的鲁棒性。这一结果使得直接使用预训练模型成为可能，而无需精确的分布假设。实验上，我们的方法在CIFAR-10和ImageNet上实现了最先进或具有竞争力的认证准确率，同时保持了强大的干净性能和低计算开销。总体而言，我们的工作将近似潜在结构确立为通往可认证鲁棒性的一条实用且有原则的路径。

英文摘要

Deep learning models are vulnerable to adversarial perturbations, raising important concerns for safety-critical deployment. Empirical defenses can achieve strong robustness in practice, but lack formal guarantees, motivating the need for certifiably robust classifiers. While certified methods provide formal guarantees, they often yield overly conservative bounds due to their inability to exploit structure in complex data distributions. In this work, we propose a framework for designing certifiably robust classifiers that leverages latent structure in data representations. We first analyze the Gaussian mixture setting, deriving necessary and sufficient conditions for the existence of robust classifiers and constructing a classifier with a closed-form robustness certificate and generalization guarantees. Our main contribution is to show that exact structure is not required: we prove that if a pretrained encoder maps inputs to a latent distribution that is $\varepsilon$-close (in KL divergence) to a Gaussian mixture, then certified accuracy degrades gracefully, with an explicit bound relating robustness under the true and approximate distributions. This result enables the direct use of pretrained models without requiring exact distributional assumptions. Empirically, our method achieves state-of-the-art or competitive certified accuracy on CIFAR-10 and ImageNet, while maintaining strong clean performance and low computational overhead. Overall, our work establishes approximate latent structure as a practical and principled route to certifiable robustness.

URL PDF HTML ☆

赞 0 踩 0

2605.25347 2026-05-26 cs.CV cs.LG

CausalFlow: LLM Agent 失败的因果归因与反事实修复

Akash Bonagiri, Devang Borkar, Gerard Janno Anderias, Setareh Rafatirad, Houman Homayoun

AI总结提出CausalFlow框架，通过反事实干预计算步骤级因果责任分数，识别失败步骤并生成最小编辑修复，用于测试时修复和训练时监督，在多个基准上优于启发式方法。

详情

AI中文摘要

大型语言模型（LLM）代理在涉及推理、工具使用和环境交互的多步任务中经常失败。虽然此类失败通常被记录或通过启发式重试处理，但它们包含了关于执行中断位置的结构化信号。我们提出了CausalFlow，一个干预框架，将失败的代理轨迹转换为最小的反事实修复和可重用的监督。CausalFlow将执行轨迹建模为依赖步骤的顺序链，并通过步骤级反事实干预计算因果责任分数（CRS）来识别导致失败的步骤。对于这些步骤，我们生成最小编辑修复，将最终结果翻转为成功，产生形式为（错误步骤，修正步骤）的验证对比对。CausalFlow支持两种互补用途：具有最小行为漂移的针对性测试时修复，以及适用于离线偏好优化或奖励建模的训练时监督。在涵盖数学推理、代码生成、问答和医学浏览的四个基准测试中，CausalFlow将失败执行转换为具有高最小性和因果一致性分数的验证最小修复，并证明因果归因对于跨不同代理任务的可靠改进是必要的，在复杂检索设置中优于启发式细化，同时产生更局部的修复。这些结果表明，对结构化执行轨迹的干预分析提供了一种原则性和可扩展的机制，将代理失败转化为可靠性提升和可学习的监督。

英文摘要

Large language model (LLM) agents frequently fail on multi-step tasks involving reasoning, tool use, and environment interaction. While such failures are typically logged or retried heuristically, they contain structured signals about where execution broke down. We introduce CausalFlow, an interventional framework that converts failed agent traces into minimal counterfactual repairs and reusable supervision. CausalFlow models execution traces as sequential chains of dependent steps and computes Causal Responsibility Scores(CRS) via step-level counterfactual intervention to identify failure-inducing steps. For these steps, we generate minimally edited repairs that flip the final outcome to success, producing validated contrastive pairs of the form (wrong step, corrected step). CausalFlow supports two complementary uses: targeted test-time repair that recovers from failures with minimal behavioral drift, and training-time supervision suitable for offline preference optimization or reward modeling. Across four benchmarks spanning mathematical reasoning, code generation, question answering, and medical browsing, CausalFlow converts failed executions into validated minimal repairs with high minimality and causal-consensus scores, and demonstrates that causal attribution is necessary for reliable improvement across diverse agent tasks, outperforming heuristic refinement in complex retrieval settings while producing more localized repairs throughout. These results demonstrate that interventional analysis over structured execution traces provides a principled and scalable mechanism for transforming agent failures into reliability gains and learning-ready supervision.

URL PDF HTML ☆

赞 0 踩 0

2605.25334 2026-05-26 cs.CV

Dual-Pathway Geometry-Aware MLLM for Spatial Intelligence

双路径几何感知多模态大语言模型用于空间智能

Yufei Zheng, Xuhan Zhu, Zide Liu, Chunpeng Zhou, Chenfeng Wang, Yongchao Xu, Yunnan Wang, Jiawei Liu, Pengfei Yu, Wei Zhai, Yang Cao, Zheng-Jun Zha

AI总结提出GAMSI，一种仅以RGB图像为输入、通过双路径查询和专家引导视觉对齐实现3D结构与度量尺度联合感知的多模态大语言模型，在七个空间智能基准上达到最优性能。

详情

AI中文摘要

从2D视觉输入理解物理世界的空间能力依赖于两种互补的几何知识：整体3D结构感知和细粒度度量尺度估计。现有的多模态大语言模型通常只处理其中一个方面，将深度图或点云作为额外模型输入，这带来了大量计算开销并继承了上游预测模型的泛化局限性。我们提出GAMSI，一种双路径几何感知多模态大语言模型用于空间智能，仅以RGB图像为输入，同时在统一的自回归骨干网络内内化两种几何先验。具体地，我们引入度量-结构解耦查询，使用两组可学习查询分别从共享视觉上下文中提取密集度量信号和稀疏结构线索，并通过任务解耦注意力掩码防止两条路径相互污染。在此基础上，专家引导视觉定位模块将聚合的线索投影回帧级视觉特征，并与视觉基础模型对齐，这些模型仅作为训练时的监督，而非模型输入。我们进一步构建了一个多任务空间指令微调数据集，包含152,776个样本，涵盖13种任务类型和三种视觉模态，整合自六个公共数据集。通过两阶段课程训练，GAMSI在七个空间智能基准上达到了最先进的性能。

英文摘要

Spatial understanding of the physical world from 2D visual inputs hinges on two complementary forms of geometric knowledge: holistic 3D structural perception and fine-grained metric scale estimation. Existing multimodal large language models (MLLMs) typically address only one facet, ingesting either depth maps or point clouds as additional model inputs, which incurs substantial computational overhead and inherits the generalization limitations of upstream prediction models. We propose GAMSI, a dual-pathway Geometry-Aware MLLM for Spatial Intelligence that takes only RGB images as input while internalizing both forms of geometric prior within a unified autoregressive backbone. Specifically, we introduce Metric-Structure Decoupled Queries (MSDQ) which employ two groups of learnable queries to respectively extract dense metric signals and sparse structural cues from the shared visual context, with a task-decoupled attention mask further preventing the two pathways from contaminating each other. Building on this, an Expert-Guided Visual Grounding (EVG) module projects the aggregated cues back to frame-level visual features and aligns them with vision foundation models, which serve purely as training-time supervision, rather than as model inputs. We further build a multi-task spatial instruction-tuning dataset (MTS) comprising 152{,}776 samples spanning 13 task types and three visual modalities, consolidated from six public datasets. Trained with a two-stage curriculum, GAMSI achieves state-of-the-art performance on seven spatial intelligence benchmarks.

URL PDF HTML ☆

赞 0 踩 0