arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 3418
2605.25404 2026-05-26 cs.CL eess.AS

Proactive for Uncertainty: Cause-Aware Error Diagnosis and Interactive Clarification for Spoken Dialogue Systems

主动应对不确定性:面向口语对话系统的因果感知错误诊断与交互式澄清

Yizhou Peng, Ziyang Ma, Changsong Liu, Yi-Wen Chao, Xie Chen, Eng Siong Chng

AI总结 本文提出一种因果感知的错误恢复范式,通过细粒度检测器解耦ASR中的感知、理解和删除错误,使LLM能够执行多轮针对性澄清策略,从而显著降低词错误率并提升下游任务性能。

详情
AI中文摘要

级联自动语音识别-大语言模型(ASR-LLM)流水线在工业口语对话系统(SDS)中仍然流行,主要因为其解耦设计确保了感知可验证性。然而,级联系统存在错误传播问题,因为转录失败不可避免地级联到后续组件,从而降低最终交互质量。尽管ASR置信度分数为不可靠输入提供了简单过滤,但这种方法存在根本性局限,因为它通常无法检测删除错误,也无法区分声学(听不清)和语言(不理解)不匹配,而这两者都需要针对性的恢复策略。在本文中,我们提出了一种因果感知的错误恢复范式,从根本上重新思考SDS的鲁棒性。与传统的置信度过滤不同,我们引入了一组小型精度聚焦检测器,利用深度ASR潜在表示将词级错误解耦为感知、理解和删除失败。这种细粒度诊断智能使LLM能够编排针对性的多轮澄清策略,有效将模糊信号转化为无缝的用户交互。实验结果验证了我们方法的精度,与基线相比,在领域转移错误上的召回率提高了一倍以上(57.96% vs. 23.66%)。关键的是,这种诊断精度在不同口音、失真和领域下,使词错误率降低高达30%,下游任务性能提升17%。

英文摘要

Cascaded Automatic Speech Recognition -- Large Language Model (ASR-LLM) pipelines remain popular for industrial Spoken Dialogue Systems (SDS), primarily because their decoupled design ensures perceptual verifiability. However, cascaded systems suffer from error propagation, as transcription failures inevitably cascade to subsequent components, thereby degrading the final interaction quality. Although ASR confidence scores offer a simple filter for unreliable inputs, this approach is fundamentally limited because it typically fails to detect deletion errors or to distinguish between acoustic (inability to hear clearly) and linguistic (inability to understand) mismatches, both of which require targeted recovery strategies. In this paper, we propose a cause-aware error recovery paradigm that fundamentally rethinks robustness in SDS. Unlike traditional confidence filtering, we introduce a suite of small precision-focused detectors that exploit deep ASR latent representations to disentangle token-level errors into perception, comprehension, and deletion failures. This fine-grained diagnostic intelligence empowers the LLM to orchestrate targeted, multi-turn clarification strategies, effectively transforming ambiguous signals into seamless user interactions. Experimental results validate the precision of our approach, which more than doubles the recall on domain-shift errors (57.96% vs. 23.66%) compared to baselines. Crucially, this diagnostic precision yields up to a 30% reduction in WER and a 17% improvement on the downstream task across diverse accents, distortions, and domains.

2605.25401 2026-05-26 cs.RO

Path Following Control System of Line-of-Sight Guidance for Robotic Dolphin with Multi-Link Mechanism in Underwater Simulator

水下模拟器中多连杆机构仿生海豚的视线导引路径跟踪控制系统

Takumi Asada, Takao Oki, Hideo Furuhashi, Kenta Tabata, Renato Miyagusuku, Koichi Ozaki

AI总结 针对多连杆仿生自主水下航行器(BAUV),提出了一种基于视线导引的路径跟踪控制系统,并在水下模拟器中进行了参数确定和控制方法评估。

详情
Journal ref
2026 IEEE/SICE International Symposium on System Integration (SII). IEEE, 2026. p. 844-849
AI中文摘要

具有多连杆机构的仿生自主水下航行器(BAUV)因其低功耗和高机动性被广泛用于水生生物观测和环境调查。环境调查需要能够自动跟踪特定点的路径跟踪系统。然而,BAUV的路径跟踪系统有限,且其与多连杆机构机器人的评估尚未明确。由于BAUV的模型因仿生类型而异,其路径跟踪系统需要预先进行仿真。在本研究中,我们提出了一种适用于多连杆机构BAUV的路径跟踪系统,并在水下模拟中进行了评估。结果表明,可以设计出适合BAUV的路径跟踪系统,使用模拟器确定参数,并评估控制方法。

英文摘要

Biomimetic autonomous underwater vehicle (BAUV) with multi-link mechanism is widely used in aquatic life observation and environmental surveys due to its low power consumption and high maneuverability. An environmental survey requires a path following system that automatically follows specific points. However, the path following system of BAUV is limited, and its evaluation with multi-link mechanism robots has not yet been clarified. The path following system in BAUV requires prior simulation because the model differs depending on the type of biomimetics. In this study, we propose a path following system for BAUVs with a multi-link mechanism and evaluation in underwater simulation. In this result, it was possible to design a path following system suitable for BAUV, determine parameters using a simulator, and evaluate control methods.

2605.25399 2026-05-26 cs.AI

Towards end-to-end LLM-based censoring-aware survival analysis

面向端到端基于大语言模型的删失感知生存分析

Yishu Wei, Hexin Dong, Yi Lin, Jiahe Qian, Yi Liu, Yifan Peng

AI总结 提出LLMSurvival框架,通过成对排序重制定时间事件预测,实现删失感知的生存分析,在ICU死亡率和骨折风险预测中优于Cox比例风险模型和三种深度学习模型。

详情
AI中文摘要

目的:生存分析是医学预测的核心,然而大语言模型(LLM)很少被用作端到端生存模型,因为删失阻碍了直接的监督微调。这里我们提出LLMSurvival,一个框架,使得未修改的LLM能够直接操作表格临床数据进行删失感知的生存分析。材料与方法:LLMSurvival将时间事件预测重新表述为可比较受试者之间的成对排序,并通过聚合与训练队列中锚定个体的比较来推导测试时风险。结果:在两个临床任务(MIMIC-IV中的ICU死亡率预测和纽约长老会/威尔康奈尔医学中心队列中的脆性骨折预测)中,LLMSurvival相比Cox比例风险模型,整体一致性提高了ICU死亡率3.1%和骨折风险0.5%,相比三个已建立的深度学习生存模型,ICU死亡率平均提高2.1%,骨折风险平均提高2.8%。讨论:结果表明,通过基于比较的重新制定,可以使带有删失的生存建模与LLM微调兼容。该框架展示了高可移植性,并且在不同的临床背景下优于专家制定的评分(如SAPS-II和FRAX评分)。此外,该框架支持本地部署,因为紧凑、公开可用的基础模型提供了足够的性能。结论:LLMSurvival框架作为通过LLM进行集成、删失意识的生存分析的概念验证。

英文摘要

Objective: Survival analysis is central to medical prediction, yet large language models (LLMs) are rarely used as end-to-end survival models because censoring prevents straightforward supervised fine-tuning. Here we present LLMSurvival, a framework that enables censoring-aware survival analysis with unmodified LLMs operating directly on tabular clinical data. Materials and Methods: LLMSurvival reformulates time-to-event prediction as pairwise ranking among comparable subjects, and derives test-time risk by aggregating comparisons against anchor individuals from the training cohort. Results: Across two clinical tasks (ICU mortality prediction in MIMIC-IV and fragility fracture prediction in a NewYork-Presbyterian/Weill Cornell Medicine cohort), LLMSurvival improves overall concordance over Cox proportional hazards modeling by 3.1% for ICU mortality and 0.5% for fracture risk, 2.1% on average for ICU mortality and 2.8% for fracture risk over three established deep learning survival models. Discussion: The results show that survival modeling with censoring can be made compatible with LLM fine-tuning through comparison-based reformulation. The framework demonstrates high portability and superior performance over expert curated scores like SAPS-II and FRAX scores across diverse clinical context. Furthermore, the framework supports local deployment, as compact, publicly available base models provide sufficient performance. Conclusion: The LLMSurvival framework serves as a proof of concept for an integrated, censoring-conscious approach to survival analysis via LLMs.

2605.25396 2026-05-26 cs.CV cs.AI

Subspace-Guided Semantic and Topological Invariant Registration for Annotation-Free Ultrasound Plane Quality Control

子空间引导的语义与拓扑不变配准用于无标注超声平面质量控制

Chunzheng Zhu, Jianxin Lin, Feng Wang, Cheng Jiang, Guanghua Tan, Zhenyu Zhou, Shengli Li, Kenli Li

AI总结 提出STRIQ框架,通过子空间引导的配准一致性度量,实现无标注超声平面质量控制,达到与临床质量评分的最优相关性。

Comments MICCAI 2026 Accepted Paper; Subspace-Guided Registration for Ultrasound Quality Control

详情
AI中文摘要

超声图像的可靠质量控制对于实时采集指导和回顾性临床审计至关重要,然而现有方法严重依赖逐平面标注,或采用在临床采集固有空间变形下易产生系统性偏差的伪标签。我们提出STRIQ,一种基于配准的框架,将无标注超声平面质量控制重新定义为子空间引导的一致性度量问题。具体而言,STRIQ引入潜在配准对齐器(LRA)以建立查询图像与方差驱动锚点之间的层次特征空间对应,这些锚点通过方差谱准则从无标签数据中自主提炼,作为结构稳定的原型。为进一步区分解剖平面并减轻负知识迁移,我们提出正交知识子空间(OKS)模块。OKS将平面特定表示分解为相互正交的子空间,实现细粒度专家协作同时防止平面间干扰,确保质量度量基于原则性的子空间邻近性。在内部US4QA和公开CAMUS数据集上的大量实验表明,STRIQ实现了与临床质量评分的最优相关性,为无标注、实时可靠的超声质量控制建立了新范式。我们的代码可在https://github.com/zhcz328/STRIQ获取。

英文摘要

Reliable quality control (QC) of ultrasound images is essential for both real-time acquisition guidance and retrospective clinical audit, yet existing approaches rely heavily on per-plane annotations, or employ pseudo-labeling prone to systematic bias under spatial deformations inherent in clinical acquisition. We present STRIQ, a registration-driven framework that recasts annotation-free US plane quality control as a subspace-guided consistency measurement problem. Specifically, STRIQ introduces a Latent Registration Aligner (LRA) to establish hierarchical feature space correspondences between query images and variance-driven anchors, which are autonomously distilled from unlabeled data via a variance spectrum criterion to serve as structurally stable prototypes. To further disambiguate anatomical planes and mitigate negative knowledge transfer, we propose an Orthogonal Knowledge Subspace (OKS) module. The OKS decomposes plane-specific representations into mutually orthogonal subspaces, enabling fine-grained expert collaboration while preventing inter-plane interference, ensuring that the quality metric is grounded in principled subspace proximity. Extensive experiments on the in-house US4QA and public CAMUS datasets demonstrate that STRIQ achieves state-of-the-art correlation with clinical quality scores, establishing a new paradigm for annotation-free, real-time reliable ultrasound quality control. Our code is available at https://github.com/zhcz328/STRIQ.

2605.25395 2026-05-26 cs.LG math.OC

EMA-Nesterov: Stabilizing Nesterov's Lookahead for Accelerated Deep Learning Optimization

EMA-Nesterov:稳定Nesterov前瞻以加速深度学习优化

Chung-Yiu Yau, Dawei Li, Athanasios Glentis, Valentyn Boreiko, Hoi-To Wai, Mingyi Hong

AI总结 针对深度学习优化中Nesterov动量因随机梯度噪声和非凸损失导致的不稳定性,提出EMA-Nesterov方法,用参数更新的指数移动平均替代标准前瞻方向,通过低通滤波捕捉训练轨迹的低频趋势,在凸问题中保持理论加速收敛率,并在语言模型预训练中验证了其广泛适用性和优于现有前瞻方法的性能。

Comments 25 page, 10 figures

详情
AI中文摘要

基于前瞻的加速方法,如Nesterov动量,在优化中广泛使用,但在深度学习训练中常因随机梯度噪声和非凸损失景观而变得不可靠。特别是,标准前瞻依赖于短视更新信号(例如连续迭代之间的差异),这些信号本质上有噪声,可能导致不稳定的外推方向。本文从轨迹角度重新审视Nesterov加速,并认为深度学习中的有效加速应利用优化轨迹的低频趋势,而非外推噪声的一步更新。基于这一见解,我们提出EMA-Nesterov,一个简单的修改,用参数更新的指数移动平均(EMA)替代标准Nesterov前瞻方向。这产生了一个稳定的前瞻方向,通过低通滤波器捕捉并利用训练轨迹的演变趋势,同时通过EMA的几何加权结构保持对渐进变化的适应性。我们证明,EMA-Nesterov在凸问题中保留了与Nesterov加速梯度方法类似的理论加速收敛率。此外,我们在语言模型预训练上提供了经验证据,验证了EMA-Nesterov广泛适用于一系列微调的基础优化器,包括Adam、SOAP、Muon,以及在优化基准(NanoGPT)上达到最先进性能的复杂优化器。与先前的瞻方法相比,EMA-Nesterov通过避免短视前瞻的不稳定性和长视前瞻的非自适应性,实现了更好的性能。

英文摘要

Lookahead-based acceleration methods, such as Nesterov's momentum, are widely used in optimization, but they often become unreliable in deep learning training mainly due to stochastic gradient noise and non-convex loss landscapes. In particular, standard lookahead relies on short-horizon update signals (e.g., differences between consecutive iterates), which are inherently noisy and can lead to unstable extrapolation directions. This work revisits Nesterov's acceleration from a trajectory perspective and argues that effective acceleration in deep learning should harness the low-frequency trends of optimization trajectories rather than extrapolating noisy one-step updates. Leveraging this insight, we propose EMA-Nesterov, a simple modification that replaces the standard Nesterov's lookahead direction with an exponential moving average (EMA) of parameter updates. This yields a stabilized lookahead direction that captures and harnesses the evolving trend of the training trajectory through a low-pass filter, while remaining adaptive to progressive changes via the geometric weighting structure of EMA. We show that EMA-Nesterov retains a theoretical accelerated convergence rate in convex problems that is analogous to Nesterov's accelerated gradient method. Furthermore, we provide empirical evidence on language model pre-training to verify that EMA-Nesterov is broadly applicable across a range of fine-tuned base optimizers, including Adam, SOAP, Muon, as well as complex optimizers that achieve state-of-the-art performance on optimization benchmarks (NanoGPT). Compared to prior lookahead methods, EMA-Nesterov achieves better performance by avoiding the instability of short-horizon lookahead and the non-adaptivity of long-horizon lookahead.

2605.25394 2026-05-26 cs.AI cs.CL

Second Guess: Detecting Uncertainty Through Abstention and Answer Stability in Small Language Models

Second Guess: 通过弃权和答案稳定性检测小型语言模型的不确定性

Ashwath Vaithinathan Aravindan, Mayank Kejriwal

AI总结 提出一种轻量级、无参数的提示技术Second Guess,通过添加“我不知道”选项并观察答案稳定性,在多项选择问答中实现弃权,有效检测小型语言模型的不确定性。

详情
AI中文摘要

大型语言模型在不确定时往往生成自信但错误的答案,而非弃权。这个问题对于小型语言模型(SLM)尤为严重,因为计算约束和自主操作放大了对可靠不确定性检测的需求。我们提出了_Second Guess_,一种轻量级、无参数的提示技术,用于多项选择问答(MCQA)中的弃权,非常适合SLM。我们的关键实证洞察是,真正知道答案的模型会一致地选择它,而不确定的模型在添加“我不知道”选项时会表现出不稳定的行为。在四个开源模型(2B-8B参数)和四个基准测试上评估,Second Guess实现了10.81%的最高复合风险改进。值得注意的是,在基于熵的方法退化的微调模型上,它保持了8%的复合风险改进,并且对性能较低的模型改进最大。重现本工作所需的所有代码和结果可在https://github.com/Mystic-Slice/second-guess获取。

英文摘要

Large language models often generate confident but incorrect answers rather than abstaining when uncertain. This problem is particularly acute for small language models (SLMs), where computational constraints and autonomous operation amplify the need for reliable uncertainty detection. We propose _Second Guess_, a lightweight, parameter-free prompting technique for abstention in multiple-choice question answering (MCQA) that is well-suited for SLMs. Our key empirical insight is that models which truly know an answer will select it consistently, while uncertain models exhibit unstable behavior when an ``I don't know'' option is added. Evaluated on four open models (2B-8B parameters) and four benchmarks, Second Guess achieves the highest composite risk improvement of 10.81\%. Notably, it maintains an 8\% composite risk improvement on fine-tuned models where entropy-based methods degrade, and improves most for lower-performing models. All code and results required to reproduce this work is available in https://github.com/Mystic-Slice/second-guess

2605.25393 2026-05-26 cs.RO

Decision-Making with Lightweight Confidence-Aware Language Model for Autonomous Driving

基于轻量级置信感知语言模型的自动驾驶决策

Ruoyu Yao, Ruiguo Zhong, Pei Liu, Mingxing Peng, Rui Yang, Jun Ma

AI总结 提出一种利用轻量级置信感知语言模型的决策框架,通过多智能体协作生成置信注释的决策演示并蒸馏到双头轻量模型,在nuPlan上实现SOTA成功率和低延迟。

Comments 8 Pages, 3 figures, ITSC 2026

详情
AI中文摘要

大型语言模型和多模态大语言模型在自动驾驶中展现出巨大潜力,提供类人推理和开放世界泛化能力。然而,这些庞大模型过高的计算开销和推理延迟严重阻碍了它们在资源受限的自动驾驶系统中的部署。为解决这一挑战,我们提出了一种新颖的决策框架,利用轻量级置信感知语言模型,弥合了复杂多模态意图推理与高效推理之间的差距。具体而言,我们设计了一个多智能体协作工作流,包括动作投票、置信评估和总结智能体,通过显式的思维链推理生成高质量、带置信注释的决策演示。然后,这些演示被蒸馏到一个具有双头架构的轻量级语言模型中,实现决策概率的联合预测和文本理由的生成。蒸馏通过置信感知微调策略结合检索增强生成来实现,以增强模型的适应性和数据效率。在nuPlan基准上的全面闭环实验表明,我们的方法在常规和长尾场景下均实现了最先进的成功率,同时保持了低推理延迟。

英文摘要

Large Language Models (LLMs) and Multimodal LLMs (MLLMs) have demonstrated immense potential in autonomous driving (AD) by offering human-like reasoning and open-world generalization. However, the excessive computational overhead and high inference latency of these massive models severely hinder their deployment in resource-constrained AD systems. To address this challenge, we propose a novel decision-making framework utilizing a lightweight confidence-aware language model, which bridges the gap between complex multimodal intention reasoning and efficient inference. Specifically, we design a multi-agent collaborative workflow, comprising action voting, confidence assessment, and summarization agents, to generate high-quality, confidence-annotated decision demonstrations via explicit Chain-of-Thought (CoT) reasoning. These demonstrations are then distilled into a lightweight language model featuring a dual-head architecture, enabling the joint prediction of decision probabilities and the generation of textual rationales. The distillation is realized via a confidence-aware fine-tuning strategy coupled with Retrieval Augmented Generation (RAG) to enhance the model's adaptability and data efficiency. Comprehensive closed-loop experiments on the nuPlan benchmark demonstrate that our approach achieves state-of-the-art (SOTA) success rates in both regular and long-tail scenarios while maintaining low inference latency.

2605.25391 2026-05-26 cs.LG eess.SP

A Context Augmented Multi-Play Multi-Armed Bandit Algorithm for Fast Channel Allocation in Opportunistic Spectrum Access

一种用于机会频谱接入中快速信道分配的上下文增强多玩多臂老虎机算法

Ruiyu Li, Guangxia Li, Xiao Lu, Jichao Liu, Yan Jin

AI总结 针对机会频谱接入中的信道分配问题,提出一种上下文增强的多玩多臂老虎机算法,通过将信道噪声建模为奖励函数的扰动并利用信道状态信息作为上下文,分别针对线性和非线性相关性推导出两种索引策略,实现低遗憾和更合理的次优臂选择。

Comments Accepted by ISCC'24

详情
AI中文摘要

我们研究了机会频谱接入(OSA)场景中用于信道分配的动态上下文多玩多臂老虎机(MP-MAB)问题。大多数现有的MP-MAB方法对于实际OSA系统不实用,因为它们假设了许多理想条件,计算成本高,最重要的是忽略了与服务质量直接相关的信道噪声的影响。在本研究中,我们通过将信道噪声建模为MP-MAB中臂奖励函数的扰动来体现这种影响。由于信道状态信息与信道噪声之间存在隐含的相关性,我们将前者作为MP-MAB的上下文来表示后者引起的扰动。我们研究了上下文与扰动之间的两种相关性——线性和非线性,并分别推导出两种索引策略。这些策略通过线性模型和神经网络学习相关性,并使用估计的噪声值调整上置信界。数值实验表明,所提出的策略能够实现更低的遗憾,并以更合理的方式选择次优臂。

英文摘要

We study the restless contextual multi-play multi-armed bandit (MP-MAB) problem for channel allocation in the opportunity spectrum access (OSA) scenario. Most existing MP-MAB methods are impractical for real-world OSA systems as they assume many ideal conditions, incur a heavy computational cost, and most importantly, ignore the impact of channel noise which is directly related to the quality of service. In this study, we embody this impact by modeling channel noise as a perturbation of the arm's reward function in MP-MAB. As there is an implicit correlation between channel state information and channel noise, we take the former as a context for MP-MAB to present the perturbation caused by the latter. We investigate two types of correlation between the context and the perturbation -- linear and nonlinear, and derive two index policies, respectively. These policies learn the correlations through a linear model and a neural network, and use estimated noise value to adjust the upper confidence bound. Numerical experiments demonstrate that the proposed policies can achieve lower regret and select sub-optimal arms in a more reasonable way.

2605.25388 2026-05-26 cs.LG q-bio.QM

ViroBench: Benchmarking Nucleotide Foundation Models on Viral Genomics Tasks

ViroBench:病毒基因组学任务中的核苷酸基础模型基准测试

Dongxin Ye, Fang Hu, Han Hu, Shu Hu, Yang Tan, Wanli Ouyang, Stan Z. Li, Jie Cui, Nanqing Dong

AI总结 提出首个针对病毒基因组学的综合基准ViroBench,评估66个核苷酸基础模型在生物学理解和潜在生物安全风险上的表现,发现模型在系统发育和时间偏移下性能下降,生成任务中统计似然与生物功能有效性脱钩,且预训练数据的分类多样性比参数规模更重要。

Comments 42 pages,15 figures

详情
AI中文摘要

核苷酸序列构成了生物系统的基本遗传基础,使得病毒基因组分析对生物医学进步至关重要。尽管生物基础模型,特别是核苷酸基础模型(NFMs)取得了进展,但该领域缺乏一个统一的病毒基因组学标准来促进社区发展并实施生物安全约束。为了解决这个问题,我们引入了ViroBench,这是第一个专门为病毒场景中的NFMs设计的全面且大规模的基准测试。ViroBench在两个关键维度上评估模型:生物学理解和潜在生物安全风险,覆盖4种任务类型中的18个不同场景。对66个不同架构的NFMs的广泛评估得出了三个关键结论。首先,NFMs在系统发育和时间偏移下表现出生物学理解的性能下降,表明外推能力较弱。其次,生成任务揭示了统计似然与生物功能有效性之间的脱钩,构成了潜在的生物安全风险。第三,受控消融研究表明,预训练数据中的分类多样性比参数规模更重要。具体来说,一个在多样化数据上训练的轻量级基线相比其原始模型实现了67.5%的性能提升。总体而言,ViroBench为未来病毒核苷酸基础模型的研究提供了可解释的诊断评估和可重复的测量框架。数据集和代码公开于https://github.com/QIANJINYDX/ViroBench。

英文摘要

Nucleotide sequences constitute the fundamental genetic basis of biological systems, rendering viral genomic analysis critical for biomedical advancement. Despite progress in biological foundation models, specifically nucleotide foundation models (NFMs), the field lacks a unified standard for viral genomics to facilitate community development and enforce biosecurity constraints. To address this, we introduce ViroBench, the first comprehensive and large-scale benchmark specifically designed for NFMs in viral settings. ViroBench evaluates models across two critical dimensions: biological understanding and latent biosecurity risk, covering 18 diverse scenarios within 4 task types. Extensive evaluation of 66 NFMs across diverse architectures yields three critical conclusions. Firstly, NFMs exhibit a performance degradation in biological understanding under phylogenetic and temporal shifts, indicating weak extrapolation capabilities. Secondly, generation tasks reveal a decoupling between statistical likelihood and biological functional validity, posing latent biosecurity risks. Thirdly, controlled ablation studies reveal that taxonomic diversity in pretraining data outweighs parameter scale. Specifically, a lightweight baseline trained on diverse data achieves a 67.5% performance gain over its original model. Overall, ViroBench provides interpretable, diagnostic evaluations and a reproducible measurement framework for future research on viral nucleotide foundation models. The datasets and code are publicly available at https://github.com/QIANJINYDX/ViroBench.

2605.25385 2026-05-26 cs.CV cs.AI

Weakly Supervised Camouflaged Object Detection Based on the SAM Model and Mask Guidance

基于SAM模型和掩码引导的弱监督伪装目标检测

Xia Li, Xinran Liu, Lin Qi, Junyu Dong

AI总结 提出MGNet网络,利用SAM模型生成伪标签,通过级联掩码解码器、上下文增强模块和掩码引导特征聚合模块,实现弱监督伪装目标检测,性能与全监督方法相当。

Comments 18 pages

详情
AI中文摘要

伪装目标检测(COD)由于目标与背景高度相似,是一项具有挑战性的任务。现有的全监督方法需要耗费大量人力进行像素级标注,因此弱监督方法成为平衡精度与标注效率的可行折中方案。然而,由于使用粗标注,弱监督方法常出现性能下降。本文提出一种新的弱监督伪装目标检测方法以克服这些限制。具体地,我们设计了一个新颖的网络MGNet,通过利用自定义级联掩码解码器(CMD)生成的初始掩码来引导分割过程并增强边缘预测,从而解决边缘模糊和漏检问题。我们引入上下文增强模块(CEM)以减少漏检,以及掩码引导特征聚合模块(MFAM)进行有效的特征聚合。针对弱监督挑战,我们提出BoxSAM,利用带有边界框提示的Segment Anything Model(SAM)生成伪标签。通过采用冗余处理策略,为训练MGNet提供高质量的像素级伪标签。大量实验表明,我们的方法在性能上与当前最先进方法具有竞争力。

英文摘要

Camouflaged object detection (COD) from a single image is a challenging task due to the high similarity between objects and their surroundings. Existing fully supervised methods require labor-intensive pixel-level annotations, making weakly supervised methods a viable compromise that balances accuracy and annotation efficiency. However, weakly supervised methods often experience performance degradation due to the use of coarse annotations. In this paper, we introduce a new weakly supervised approach for camouflaged object detection to overcome these limitations. Specifically, we propose a novel network, MGNet, which tackles edge ambiguity and missed detections by utilizing initial masks generated by our custom-designed Cascaded Mask Decoder (CMD) to guide the segmentation process and enhance edge predictions. We introduce a Context Enhancement Module(CEM) to reduce the missing detection, and a Mask-guided Feature Aggregation Module (MFAM) for effective feature aggregation. For the weak supervision challenge, we propose BoxSAM, which leverages the Segment Anything Model (SAM) with bounding-box prompts to generate pseudo-labels. By employing a redundant processing strategy, high quality pixel-level pseudo-labels are provided for training MGNet. Extensive experiments demonstrate that our method delivers competitive performance against current state-of-the-art methods.

2605.25384 2026-05-26 cs.CL

GeoMathCode: Understanding Interleaved Math-Code Reasoning for Geometry Problem Solving

GeoMathCode: 理解几何问题求解中交织的数学-代码推理

Yingji Zhang, Yong Dai, André Freitas

AI总结 本文提出GeoMathCode,通过程序化表示作为中间视觉输出,分析多模态大模型在几何问题中的推理与代码生成,发现推理与代码步骤在潜在空间可解耦,监督微调使推理流形更结构化,且层次化代码结构包含更多数学符号信息。

详情
AI中文摘要

数学推理是人类智能的标志,需要逻辑演绎、符号操作和抽象思维。最近的多模态大语言模型通过多步推理在几何问题上表现出强大性能。为了更好地模拟人类问题求解,中间步骤可以融入辅助视觉构造,例如额外的线条或点,这改善了几何解释和教育清晰度。在这项工作中,我们引入了GeoMathCode,其中程序化表示作为中间视觉输出。我们进一步对底层推理几何进行了深入分析。实验结果表明,推理和代码生成步骤可以在潜在空间中解耦,而监督微调使推理流形更加结构化和信息丰富。此外,层次化的句法代码结构作为解耦的潜在子空间出现,并且比视觉表示包含更多的数学符号信息。

英文摘要

Mathematical reasoning is a hallmark of human intelligence, requiring logical deduction, symbolic manipulation, and abstract thinking. Recent multimodal large language models (MLLMs) have demonstrated strong performance on geometry problems through multi-step reasoning. To better emulate human problem-solving, intermediate steps can incorporate auxiliary visual constructions, such as additional lines or points, which improve geometric interpretation and educational clarity. In this work, we introduce the GeoMathCode, where programmatic representations serve as intermediate visual outputs. We further conduct an in-depth analysis of the underlying reasoning geometry. Experimental results show that reasoning and code generation steps can be disentangled in the latent space, while supervised fine-tuning (SFT) makes the reasoning manifold more structured and informative. Moreover, hierarchical syntactic code structures emerge as disentangled latent subspaces, and contain more mathematical symbolic information than visual representations.

2605.25381 2026-05-26 cs.LG

Not only where, But when: Temporal Scheduling for RLVR

不仅在哪里,而且何时:RLVR 的时间调度

Jinghao Zhang, Ruilin Li, Feng Zhao, Jiaqi Wang

AI总结 针对强化学习可验证奖励(RLVR)中忽略策略行为异质性的问题,提出时间调度方法,通过动态调整信用分配标准来优化学习动态,实验表明该方法能提升训练稳定性和效率。

Comments Github: https://github.com/Jinghaoleven/RLVR-Schedule

详情
AI中文摘要

具有可验证奖励的强化学习(RLVR)已成为大型语言模型(LLMs)后训练的核心技术。虽然策略优化由所有采样token在全局广播标量奖励下驱动,但轨迹中表现出的异质性策略行为在很大程度上被忽视而未加以区分。现有工作通过信用分配来解决这一问题,包括token级优势重加权和选择性token优化,然而分配标准在整个训练过程中基本保持不变,限制了策略的弹性演化。在这项工作中,我们认为学习信号的调度时机与它们在token间的分配位置同样重要,并引入了时间维度,即在RLVR优化过程中调度信用分配标准。我们发现,优先关注具有特定策略行为的目标token,并逐渐向通用优化衰减,可以带来更稳定和高效的学习动态。此外,我们表明简单的轨迹百分位数为区分策略行为提供了自然视角,并与时间调度有效配合。我们的分析揭示,标准优化在同时适应异质性行为时显著牺牲了策略熵,而时间调度产生了更健康的策略演化动态。在数学和通用推理基准上的实验表明了一致的改进,表明时间调度构成了一个有前景的优化维度。

英文摘要

Reinforcement learning with verifiable rewards (RLVR) has become a core technique for post-training of Large Language Models (LLMs). While policy optimization is driven by all sampled tokens under a globally broadcast scalar reward, the heterogeneous policy behaviors exhibited along trajectories are largely overlooked without differentiation. Existing works address this by credit allocation, including token-level advantage reweighting, and selective token optimization, however, the allocation criterion are principally stagnant throughout training, limiting resilient policy evolution. In this work, we argue that \textit{when} learning signals are scheduled can be as important as \textit{where} they are allocated across tokens, and introduce the temporal dimension that scheduling the credit allocation criteria over the course of RLVR optimization. We find that prioritizing targeted tokens emphasized with specific policy behaviors, and gradually attenuating toward general optimization leads to more stable and efficient learning dynamics. Furthermore, we show that simple trajectory percentiles provide a natural perspective for distinguishing policy behaviors, and works effectively with temporal scheduling. Our analysis reveals that standard optimization substantially sacrifices policy entropy when simultaneously accommodating heterogeneous behaviors, whereas temporal scheduling yields healthier policy evolution dynamics. Experiments across mathematical and general reasoning benchmarks demonstrate consistent improvements, suggesting that temporal scheduling constitutes a promising optimization dimension.

2605.25379 2026-05-26 cs.CL

EfficientGraph-RAG: Structured Retrieval-State Management for Cross-Task Retrieval-Augmented Generation

EfficientGraph-RAG:面向跨任务检索增强生成的结构化检索状态管理

Miaohe Niu, Lianlei Shan, Zhengtao Yu, Jingbo Zhu, Tong Xiao

AI总结 提出EfficientGraph-RAG框架,通过显式定义检索状态(TAM、MARS、SMP三个机制)实现结构化状态管理,在多个基准上提升答案质量并降低大模型token消耗。

Comments 19 pages, 5 figures, 14 tables

详情
AI中文摘要

检索增强生成(RAG)已成为将大型语言模型锚定于外部知识的标准方式,但许多系统仍将证据组织为扁平块并通过基本无结构的搜索进行检索。这种弱结构成为复杂检索的瓶颈:系统必须决定搜索位置、如何从粗粒度主题过渡到实体关系证据、哪些证据已被验证以及哪些中间产物可复用。我们将这些中间变量定义为检索状态,并将RAG研究视为结构化状态管理。EfficientGraph-RAG通过三种耦合机制使该状态显式化:TAM定义了证据上的类型化层次状态空间,MARS通过角色专业化代理更新和验证状态,SMP在层次感知访问控制下存储可复用状态。使用一个共享框架配置,EfficientGraph-RAG在三个评估的LongBench检索风格子集上平均报告答案质量指标排名第一,在HotpotQA EM上与最强智能体基线持平,同时将大模型token使用量减少3.51倍,并在检索组织跨模态方法中提供了低token的DocVQA结果。组件分析显示了角色特定机制:MARS是主要答案质量驱动因素,TAM提供类型化遍历状态和自适应路由信号,SMP支持语料库依赖的复用,跨查询缓存命中率范围为3.77%至23.18%。

英文摘要

Retrieval-augmented generation (RAG) has become the standard way to ground large language models in external knowledge, but many systems still organize evidence as flat chunks and retrieve it through largely unstructured search. This weak structure becomes a bottleneck for complex retrieval: the system must decide where to search, how to move from coarse topics to entity-relation evidence, which evidence has been verified, and which intermediate artifacts can be reused. We define these intermediate variables as a retrieval state and study RAG as structured state management. EfficientGraph-RAG makes this state explicit through three coupled mechanisms: TAM defines a typed hierarchical state space over evidence, MARS updates and verifies the state through role-specialized agents, and SMP stores reusable state under hierarchy-aware access control. Using one shared framework configuration, EfficientGraph-RAG ranks first on the reported answer-quality metrics averaged over the three evaluated LongBench retrieval-style subsets, matches the strongest agentic baseline on HotpotQA EM while reducing large-model token usage by $3.51\times$, and provides a low-token DocVQA result among retrieval-organizing cross-modal methods. Component analysis shows role-specific mechanisms: MARS is the main answer-quality driver, TAM supplies the typed traversal state and Adaptive Routing signal, and SMP enables corpus-dependent reuse, with cross-query cache hit rates ranging from 3.77% to 23.18%.

2605.25377 2026-05-26 cs.CV cs.AI

Adversarial Orthogonal Disentanglement for LVLM Hallucination Mitigation

对抗正交解缠用于LVLM幻觉缓解

Ruoxi Cheng, Haoxuan Ma, Zhengfei Hai, Yiyan Huang, Ranjie Duan, Tianle Zhang, Xu Yang, Ziyi Ye, Xingjun Ma

AI总结 提出对抗正交解缠(AOD)框架,通过最小最大目标学习幻觉相关方向,并利用双前向对比解码策略,在不需额外训练的情况下缓解大型视觉语言模型(LVLM)的幻觉问题。

详情
AI中文摘要

大型视觉语言模型(LVLM)推进了多模态理解,但其可靠性受到幻觉的限制,即生成内容与视觉事实冲突。现有缓解方法要么依赖昂贵的外部干预(如指令调优和检索),要么使用受限于有缺陷的注意力权重和纠缠的隐藏表示的内部机制。我们提出对抗正交解缠(AOD),一种用于缓解LVLM幻觉的潜在几何框架。AOD通过最小最大目标学习幻觉相关方向:分类器将幻觉信号集中到投影分量中,而对抗器通过梯度反转层将其从正交残差空间中移除。学习到的方向使得一种无需训练的双前向对比解码策略能够抑制幻觉同时保持通用能力。在三个LVLM上进行的四个幻觉和四个效用基准实验表明,AOD一致优于强基线。它在POPE上平均提高超过6%的准确率,将AMBER提升6%,并在MMMU等效用任务上保持强劲性能。进一步分析显示跨数据集的鲁棒迁移,表明AOD捕获了通用的幻觉相关偏差而非数据集特定伪影。我们的源代码和数据集可在https://github.com/Hunter-Wrynn/AOD获取。

英文摘要

Large Vision-Language Models (LVLMs) have advanced multimodal understanding, yet their reliability is limited by hallucination, where generated content conflicts with visual facts. Existing mitigation methods either rely on costly external interventions, such as instruction tuning and retrieval, or use internal mechanisms that remain limited by flawed attention weights and entangled hidden representations. We propose Adversarial Orthogonal Disentanglement (AOD), a latent geometric framework for mitigating LVLM hallucinations. AOD learns a hallucination-related direction through a minimax objective: a classifier concentrates hallucination signals into the projected component, while an adversary removes them from the orthogonal residual space via a Gradient Reversal Layer. The learned direction enables a training-free dual-forward-pass contrastive decoding strategy that suppresses hallucinations while preserving general capabilities. Experiments on three LVLMs across four hallucination and four utility benchmarks show that AOD consistently outperforms strong baselines. It improves POPE accuracy by over 6\% on average, boosts AMBER by 6\%, and maintains strong performance on utility tasks such as MMMU. Further analysis shows robust transfer across datasets, suggesting that AOD captures general hallucination-related biases rather than dataset-specific artifacts. Our source code and datasets are available at https://github.com/Hunter-Wrynn/AOD.

2605.25373 2026-05-26 cs.CV

Physics-Aware 3D Gaussian Editing for Driving Scene Generation

物理感知的三维高斯编辑用于驾驶场景生成

Feng Zhou, Jian Zhang, Yuhang Sun, He Wang, Qiong Wen, Debao Kong, Tieru Wu, Rui Ma

AI总结 提出RoVES系统,通过单图像驱动的道路几何插入和4-DOF半车动力学模型,实现物理感知的驾驶场景编辑与车辆姿态校正。

详情
AI中文摘要

三维高斯泼溅(3DGS)在自动驾驶仿真和数据生成中展现出巨大潜力,能够实现逼真的重建和灵活的场景操作。然而,现有的3DGS场景编辑方法对道路几何编辑(例如插入减速带或凹陷路面)支持有限,并且通常不将此类编辑与合理的车辆-道路交互动力学耦合。这种编辑对于在极端驾驶场景下生成训练数据或评估系统在这些道路不规则情况下的可靠性至关重要。此外,许多基于优化的方法需要每次编辑进行数分钟的细化,而现有的高效替代方案主要关注外观级别或对象级别的操作,而非物理感知的道路不规则编辑。为了解决这些限制,我们提出了RoVES,一个用于驾驶场景中物理感知三维高斯编辑的道路和车辆编辑系统。RoVES实现了单图像驱动的道路几何插入,并将编辑后的道路轮廓与4-DOF半车动力学模型耦合,以实现垂直位移和俯仰方向上的物理感知车辆姿态校正。RoVES以一次性、无优化的流水线(1.84秒)插入道路元素,完整流水线(包括颜色转移和基于车辆动力学的姿态校正)在6.24秒内完成;它通过姿态编辑编辑动态车辆,并逐帧校正姿态以近似动力学一致的垂直位移和俯仰响应。在Waymo数据集上的实验表明,RoVES为物理感知的驾驶场景生成提供了实用的效率和具有竞争力的视觉一致性。

英文摘要

3D Gaussian Splatting (3DGS) has shown great potential in autonomous driving simulation and data generation, enabling photorealistic reconstruction and flexible scene manipulation. However, existing 3DGS scene editing methods have limited support for road geometry editing (e.g., inserting speed humps or sunken roads), and generally do not couple such edits with plausible vehicle-road interaction dynamics. Such editing is essential for generating training data under extreme driving scenarios or evaluating system reliability under these road irregularities. Moreover, many optimization-based methods require minutes of per-edit refinement, while existing efficient alternatives mainly focus on appearance-level or object-level manipulation rather than physics-aware road irregularity editing. To address these limitations, we propose RoVES, a Road-and-Vehicle Editing System for physics-aware 3D Gaussian editing in driving scenes. RoVES enables single-image-driven road geometry insertion and couples the edited road profile with a 4-DOF half-car vehicle dynamics model to achieve physics-aware vehicle pose correction in vertical displacement and pitch. RoVES inserts road elements in a one-shot, optimization-free pipeline (1.84s), and the full pipeline (including color transfer and vehicle-dynamics-based pose correction) completes in 6.24s; it edits dynamic vehicles via pose editing and corrects poses frame-by-frame to approximate dynamics-consistent vertical displacement and pitch responses. Experiments on the Waymo dataset show that RoVES provides practical efficiency and competitive visual consistency for physics-aware driving scene generation.

2605.25364 2026-05-26 cs.CV

Can MLLMs Reason Beyond Language? VisReason: A Comprehensive Benchmark for Vision-Centric Reasoning

MLLMs 能否超越语言进行推理?VisReason:一个面向视觉中心推理的综合基准

Longteng Guo, Yifan Wang, Pengkang Huo, Tailai Chen, Yuze Wu, Jing Liu, Xinxin Zhu

AI总结 提出 VisReason 基准,包含 1505 个日常场景问题,评估多模态大模型在视觉中心推理上的表现,揭示人类与模型间的显著差距。

Comments Accepted by ACL 2026 Findings, resources released at https://github.com/CASIA-IVA-Lab/VisReason

详情
AI中文摘要

近期多模态大语言模型(MLLMs)在视觉推理基准上取得了强劲性能,但尚不清楚这种性能在多大程度上反映了直接基于视觉证据的推理。我们引入了 VisReason,一个面向日常场景中视觉中心推理的基准,其中感知与推理紧密耦合。VisReason 包含 1505 个问题,涵盖感知、结构和概念推理等 10 个类别。我们的评估表明,VisReason 对现有基准提出了性质不同的挑战,暴露了人类与当前 MLLMs 之间的巨大差距,并揭示了测试时推理策略带来的有限收益。VisReason 为评估超越语言的视觉中心推理提供了一个聚焦的诊断工具。

英文摘要

Recent multimodal large language models (MLLMs) achieve strong performance on visual reasoning benchmarks, yet it remains unclear to what extent such performance reflects reasoning directly grounded in visual evidence. We introduce VisReason, a benchmark for vision-centric reasoning in everyday scenarios where perception and inference are tightly coupled. VisReason contains 1,505 questions across 10 categories spanning perceptual, structural, and conceptual reasoning. Our evaluation shows that VisReason poses a qualitatively different challenge from existing benchmarks, exposing substantial gaps between humans and current MLLMs and revealing limited benefits from test-time reasoning strategies. VisReason offers a focused diagnostic for evaluating vision-centric reasoning beyond language.

2605.25363 2026-05-26 cs.CV

MARVEL: Universal Murray's Law-informed Vessel Tree Segmentation and Topology Estimation

MARVEL:基于Murray定律的通用血管树分割与拓扑估计

Yi Zhou, Thiara Sana Ahmed, Jacqueline Chua, Meng Wang, Qinrong Zhang, Alejandro F. Frangi, Huazhu Fu, Jun Cheng, Leopold Schmetterer, Bingyao Tan

AI总结 提出一种与骨干网络无关的框架MARVEL,通过可微分的Murray定律约束正则化训练,提升血管分割的生理合理性、拓扑一致性,并在高血压分类任务中显著优于基线模型。

Comments 10 pages, 18 figures

详情
AI中文摘要

血管循环遵循优化质量传输和代谢能量消耗的基本生物物理原理,这些原理可以通过Murray定律有效建模。然而,当代深度学习方法用于血管分割时往往忽略这些生物物理约束,导致生理上不合理的分支和血管树误分类,使得这些自动分割结果对于下游临床任务(如血流模拟或疾病量化)不可靠。在本文中,我们引入MARVEL(基于Murray定律的通用血管分割与拓扑估计),一个与骨干网络无关的框架,将生物物理先验整合到血管树提取中。MARVEL结合逐像素监督与显式半径预测,以强制执行从经验宽度-指数映射导出的局部分叉约束。我们在训练期间将这些约束实现为可微正则化器,以引导模型朝向生理一致的重建。我们在八个公开数据集上评估MARVEL,涵盖多种血管模态和分割骨干网络。结果表明MARVEL在分割准确性、拓扑一致性和生理合理性方面具有优越性能。通过将分割掩膜转换为基于图的血流动力学模拟,我们证明MARVEL保留了区分高血压眼和正常眼所需的细微病理狭窄和拓扑连接。结果显示,MARVEL通过眼内动静脉压力差显著改善了高血压的分类(p < 0.001),在拓扑一致性和临床预测价值方面均优于基线模型。

英文摘要

Vascular circulation follows fundamental biophysical principles that optimize mass transport and metabolic energy expenditure, which can be effectively modeled by Murray's law. However, contemporary deep learning methods for vascular segmentation often neglect these biophysical constraints. This leads to physiologically implausible branching and misclassification vascular trees, rendering. These automated segmentation results are unreliable unreliable for downstream clinical tasks such as blood flow simulation or disease quantification. In this paper, we introduce MARVEL (Universal MurrAy's law-infoRmed Vessel sEgmentation and topoLogy estimation), a backbone-agnostic framework that integrates biophysical priors into vascular tree extraction. MARVEL combines per-pixel supervision with explicit radius predictions to enforce local bifurcation constraints derived from an empirical width-exponent mapping. We implement these constraints as differentiable regularizers during training to guide models toward physiologically consistent reconstructions. We evaluate MARVEL on eight public datasets across multiple vascular modalities and segmentation backbones. Results demonstrate MARVEL's superior performance in segmentation accuracy, topological consistency, and physiological plausibility. By converting segmented masks into graph-based hemodynamic simulations, we demonstrate that MARVEL preserves the subtle pathological narrowing and topological connectivity required to distinguish hypertensive from normotensive eyes. Results show that MARVEL significantly improves the classification of hypertension via arteriovenous pressure differences in the eye (p < 0.001), outperforming baseline models in both topological consistency and clinical predictive value.

2605.25362 2026-05-26 cs.RO

Prior Policy Guided Dual-Agent Coordinated Manipulation Planning of Spacecraft-Manipulator System

先验策略引导的航天器-机械臂系统双智能体协同操控规划

Yuhui Hu, Dong Zhou, Kaihong Ouyang, Zhongliang Yu, Jianfeng Lv, Xiangyu Shao

AI总结 针对空间机械臂与基座强耦合导致的姿态稳定问题,提出先验策略引导的双智能体协同操控规划框架,通过时间步级专家切换机制提升深度强化学习效率,实现末端执行器高精度到达与基座姿态稳定。

Comments 36 pages, 13 figures, 6 tables. Under review

详情
AI中文摘要

机械臂与基座之间的强动态耦合对维持航天器姿态稳定性构成了重大挑战,可能危及任务安全。本文提出了一种双智能体协同操控规划(DACMP)框架,该框架同时实现了六自由度空间机械臂末端执行器的高精度位姿到达和基座航天器的姿态稳定。为了提高学习效率,我们提出了一种结合时间步级专家切换引导(TESG)机制的先验策略引导深度强化学习算法,从而促进全局收敛并提高任务成功率。大量实验表明,DACMP在任务成功率和控制精度方面显著优于基线深度强化学习算法。此外,在包括系统约束、环境干扰和感知不确定性在内的各种挑战性场景下,验证了DACMP的鲁棒性。代码和仿真配置可在GitHub上获取:https://github.com/HIT-YuhuiHu/DACMP。

英文摘要

The strong dynamic coupling between the manipulator and the base poses a significant challenge to maintaining spacecraft attitude stability, potentially compromising mission safety. In this paper, we propose a Dual-Agent Coordinated Manipulation Planning (DACMP) framework that simultaneously achieves high-precision end-effector pose reaching for a 6-DoF space manipulator and attitude stabilization of the base spacecraft. To enhance learning efficiency, we present a prior policy-guided Deep Reinforcement Learning algorithm incorporating the Timestep-level Expert Switching Guidance (TESG) mechanism, thereby promoting global convergence and improving task success rates. Extensive experiments demonstrate that DACMP significantly outperforms baseline DRL algorithms in terms of task success rate and control precision. Furthermore, the robustness of DACMP is validated under various challenging scenarios, including system constraints, environmental disturbances, and perception uncertainties. The code and simulation configurations are available on GitHub: https://github.com/HIT-YuhuiHu/DACMP.

2605.25360 2026-05-26 cs.CL

Learning to Route Languages for Multilingual Policy Optimization

学习路由语言以实现多语言策略优化

Geyang Guo, Hiromi Wakaki, Yuki Mitsufuji, Alan Ritter, Wei Xu

AI总结 提出语言路由策略优化(LRPO)框架,将语言作为可选变量,通过在线策略优化和可训练的语言路由器(多臂老虎机)自适应地选择语言,在固定预算下提升多语言训练信号的多样性和信息量,从而显著提高多语言性能。

Comments Accepted at ICML 2026

详情
AI中文摘要

大型语言模型(LLMs)在异构多语言语料库上进行训练,然而现有的策略优化方法通常隐式地将每个训练问题限制为单一响应语言,或依赖固定的主导语言进行监督。我们提出了语言路由策略优化(LRPO),这是一种在线策略优化框架,将语言视为可选变量。LRPO为每个训练问题生成多语言展开,并将其相对质量整合到基于偏好的策略更新中,从而在固定展开预算下增加训练信号的多样性和信息量。为了在强化学习过程中自适应地决定探索哪些语言,我们引入了一个可训练的语言路由器,其形式为多臂老虎机,平衡对未充分利用语言的探索与对信息量更大语言的利用。大量实验表明,LRPO持续提升多语言性能,证明自适应语言路由能够有效利用跨语言知识进行训练。我们在https://github.com/Guochry/LRPO 发布所有资源。

英文摘要

Large language models~(LLMs) are trained on heterogeneous multilingual corpora, yet existing policy optimization methods often implicitly restrict each training question to a single response language or rely on a fixed dominant language for supervision. We propose language-routed policy optimization (LRPO), an online policy optimization framework that treats language as a selectable variable. LRPO elicits multilingual rollouts for each training question and integrates their relative quality into preference-based policy updates, increasing the diversity and informativeness of training signals under the fixed rollout budget. To adaptively determine which languages to explore during reinforcement learning, we introduce a trainable language router formulated as a multi-armed bandit, balancing exploration of underutilized languages with exploitation of more informative ones. Extensive experiments show that LRPO consistently improves multilingual performance, demonstrating that adaptive language routing enables effective cross-lingual knowledge exploitation for training. We release all the resources at https://github.com/Guochry/LRPO.

2605.25358 2026-05-26 cs.CL cs.AI cs.CY

AI-Associated Lexical Shifts Across 34 Languages: Cross-Lingual Convergence and Diachronic Uptake in News Writing

AI相关的词汇转变跨越34种语言:新闻写作中的跨语言趋同与历时采纳

Thomas Stephan Juzek

AI总结 通过分析34种语言的新闻语料,使用GPT-4.1续写诊断方法,发现AI过度使用的词汇在跨语言中呈现语义趋同,且ChatGPT发布后这些词汇的使用频率显著增加。

Comments 19 pages (9-page main body, plus references and appendices), 3 figures; ACL ARR reviewed, committed to EMNLP 2026

详情
AI中文摘要

AI相关的词汇转变主要被记录在科学英语中。我们将这项工作扩展到WMT新闻抓取语料库中的34种语言,改进了一种分割-后半部分续写诊断方法,比较GPT-4.1续写与匹配的人类黄金标准文本。对于每种语言,我们使用对数流行率比率推导出排名靠前的AI过度使用词元。我们发现显著的跨语言语义趋同:语义相关的概念在类型多样的语言中反复出现,其中'强调'类动词出现在34种语言中的24种。基于嵌入和人工分析支持这一模式。我们还考察了ChatGPT发布前后新闻写作中的历时采纳情况。追踪每种语言前20个AI过度使用项目,我们发现从2020-2021年到2023-2024年,34种语言中有26种语言的流行率增加,平均变化为+15.1%,而匹配的基线词汇没有显示出可比的增加(-4.5%)。在具有较长历史覆盖的10种语言中,纵向分析显示2022年后的增加超过了早期观察到的适度变化,尽管效应大小小于科学英语。我们广泛验证了我们的方法,包括跨种子、模型变体、数据大小、模型系列等。我们的发现与以下观点一致:AI相关的词汇偏好超越了英语,并可能对全球语言使用施加跨语言同质化压力。

英文摘要

AI-associated lexical shifts have been documented mainly in Scientific English. We extend this work to 34 languages in the WMT News Crawl corpus, refining a split-halves continuation diagnostic that compares GPT-4.1 continuations with matched human gold-standard text. For each language, we derive ranked AI-overused lemmas using log prevalence ratios. We find substantial cross-lingual semantic convergence: semantically related concepts recur across typologically diverse languages, with 'emphasize'-type verbs appearing in 24 of 34 languages. Embedding-based and manual analyses support this pattern. We also examine diachronic uptake in news writing before and after ChatGPT's release. Tracking each language's top 20 AI-overused items, we find prevalence increases in 26 of 34 languages from 2020-2021 to 2023-2024, with a mean change of +15.1%, whilst matched baseline words show no comparable increase (-4.5%). In 10 languages with longer historical coverage, longitudinal analyses show post-2022 increases that exceed the modest shifts observed in earlier periods, though with smaller effect sizes than in Scientific English. We validate our approach extensively, including across seeds, model variants, data sizes, model families, and more. Our findings are consistent with the view that AI-associated lexical preferences extend beyond English and may exert cross-lingual homogenising pressure on global language use.

2605.25357 2026-05-26 cs.CV cs.MA

Towards Reliable Fetal Ultrasound Interpretation with Multi-Agent Collaboration

面向可靠胎儿超声解读的多智能体协作

Xiaotian Hu, Mingxuan Liu, Junwei Huang, Kasidit Anmahapong, Yifei Chen, Yiming Huang, Xuguang Bai, Zihan Li, Hongjia Yang, Yingqi Hao, Hong Xu, Yu Jiang, Tian Tian, Yi Liao, Haibo Qu, Qiyuan Tian

AI总结 提出FetUSAgents多智能体系统,通过协作LLM代理和双路径证据仲裁(DPEA)整合视觉工具与临床推理,在胎儿超声VQA、报告生成等任务上超越最强基线25%以上。

详情
AI中文摘要

自动化胎儿超声解读需要从视觉感知(包括平面识别和解剖分割)到临床理解(包括生物测量和诊断报告)的工作流程。然而,当前“一任务一模型”的范式限制了跨多步骤过程的系统性证据整合。尽管多模态大语言模型(MLLM)展现出有前景的视觉理解能力,但其有限的领域特定基础和幻觉风险限制了在胎儿超声分析中的可靠性。为解决这些限制,我们提出了FetUSAgents,一个工具增强的多智能体系统,用于全面的胎儿超声解读,支持视觉问答(VQA)、报告生成、图像描述和视频总结。FetUSAgents通过协作的LLM代理协调任务特定的视觉工具,并将临床查询分解为从解剖识别到定量测量的子任务。我们进一步引入了双路径证据仲裁(DPEA),它将基于LLM的审慎推理与来自专业视觉工具的结构化计算证据相结合。一个检索增强的证据库整合中间发现,以支持可追溯且临床可靠的结论。此外,我们构建了FetUS-VQA,一个专门用于胎儿超声的VQA基准,包含1,892张图像和3,205个问答对,涵盖10个临床任务。广泛的分布外实验表明,FetUSAgents优于通用和医学MLLM,在VQA准确率上超过最强基线25%以上。这些结果表明了一条通往产前成像的基于证据的临床助手的可扩展路径。代码已公开。

英文摘要

Automated fetal ultrasound interpretation requires a workflow from visual perception, including plane recognition and anatomical segmentation, to clinical understanding, including biometric measurement and diagnostic reporting. However, the prevailing "one-task, one-model" paradigm limits systematic integration of evidence across this multi-step process. Although multimodal large language models (MLLMs) show promising visual understanding, their limited domain-specific grounding and hallucination risks restrict reliability in fetal ultrasound analysis. To address these limitations, we propose FetUSAgents, a tool-augmented multi-agent system for comprehensive fetal ultrasound interpretation, supporting visual question answering (VQA), report generation, image captioning, and video summarization. FetUSAgents coordinates task-specific visual tools through collaborative LLM agents and decomposes clinical queries into subtasks that progress from anatomical recognition to quantitative measurement. We further introduce Dual-Path Evidence Arbitration (DPEA), which integrates LLM-based deliberative reasoning with structured computational evidence from specialized visual tools. A retrieval-enhanced evidence bank consolidates intermediate findings to support traceable and clinically grounded conclusions. In addition, we construct FetUS-VQA, a dedicated VQA benchmark for fetal ultrasound, comprising 1,892 images and 3,205 question-answer pairs across 10 clinical tasks. Extensive out-of-distribution experiments show that FetUSAgents outperforms general and medical MLLMs, exceeding the strongest baseline by more than 25 percent in VQA accuracy. These results suggest a scalable route toward evidence-driven clinical assistants for prenatal imaging. Code is available.

2605.25354 2026-05-26 cs.AI

Context-CoT: Enhancing Context Learning via High-Quality Reasoning Synthesis

Context-CoT:通过高质量推理合成增强上下文学习

Hongbo Jin, Mingnan Zhu, Jingqi Tian, Xu Jiang, Zhongjing Du, Haoran Tang, Siyi Xie, Qiaoman Zhang, Jiayu Ding

AI总结 针对大语言模型在动态提取和应用新知识方面的上下文学习能力不足,提出Context-CoT方法,通过合成高质量推理链来增强上下文学习,在CL-Bench上显著提升性能。

详情
AI中文摘要

虽然大语言模型在使用静态预训练知识进行推理方面表现出色,但在上下文学习——即从复杂、任务特定的上下文中动态提取、内化和应用新知识的能力——方面存在显著困难。最近在CL-Bench上的评估揭示了一个关键能力差距:前沿模型平均仅能解决17.2%的上下文相关任务。

英文摘要

While LLMs excel at reasoning over prompts using static pretrained knowledge, they struggle significantly with context learning-the ability to dynamically extract, internalize, and apply new knowledge from complex, task-specific contexts. Recent evaluations on the CL-Bench reveal a critical capability gap: frontier models solve only 17.2% of context-dependent tasks on average.

2605.25352 2026-05-26 cs.LG cs.AI

Certified Robustness from Approximate Gaussian Mixture Structures in Pretrained Latent Spaces

基于预训练潜在空间中近似高斯混合结构的认证鲁棒性

Konstantinos Emmanouilidis, Tianjiao Ding, Nghia Nguyen, Nicolas Loizou, René Vidal

AI总结 本文提出一个框架,利用预训练编码器将输入映射到近似高斯混合的潜在分布,通过理论分析证明鲁棒性退化有界,从而实现可认证鲁棒分类器,在CIFAR-10和ImageNet上达到最优或竞争性的认证准确率。

详情
AI中文摘要

深度学习模型易受对抗扰动影响,这对安全关键部署提出了重要关切。经验性防御在实践中可以实现强鲁棒性,但缺乏形式化保证,这推动了可认证鲁棒分类器的需求。虽然认证方法提供了形式化保证,但由于无法利用复杂数据分布中的结构,它们通常产生过于保守的边界。在这项工作中,我们提出了一个设计可认证鲁棒分类器的框架,该框架利用数据表示中的潜在结构。我们首先分析高斯混合设置,推导出鲁棒分类器存在的必要和充分条件,并构建了一个具有闭式鲁棒性证书和泛化保证的分类器。我们的主要贡献是证明精确结构并非必需:我们证明,如果预训练编码器将输入映射到一个与高斯混合分布$\varepsilon$-接近(在KL散度下)的潜在分布,那么认证准确率会优雅地退化,并给出了一个显式边界,关联真实分布和近似分布下的鲁棒性。这一结果使得直接使用预训练模型成为可能,而无需精确的分布假设。实验上,我们的方法在CIFAR-10和ImageNet上实现了最先进或具有竞争力的认证准确率,同时保持了强大的干净性能和低计算开销。总体而言,我们的工作将近似潜在结构确立为通往可认证鲁棒性的一条实用且有原则的路径。

英文摘要

Deep learning models are vulnerable to adversarial perturbations, raising important concerns for safety-critical deployment. Empirical defenses can achieve strong robustness in practice, but lack formal guarantees, motivating the need for certifiably robust classifiers. While certified methods provide formal guarantees, they often yield overly conservative bounds due to their inability to exploit structure in complex data distributions. In this work, we propose a framework for designing certifiably robust classifiers that leverages latent structure in data representations. We first analyze the Gaussian mixture setting, deriving necessary and sufficient conditions for the existence of robust classifiers and constructing a classifier with a closed-form robustness certificate and generalization guarantees. Our main contribution is to show that exact structure is not required: we prove that if a pretrained encoder maps inputs to a latent distribution that is $\varepsilon$-close (in KL divergence) to a Gaussian mixture, then certified accuracy degrades gracefully, with an explicit bound relating robustness under the true and approximate distributions. This result enables the direct use of pretrained models without requiring exact distributional assumptions. Empirically, our method achieves state-of-the-art or competitive certified accuracy on CIFAR-10 and ImageNet, while maintaining strong clean performance and low computational overhead. Overall, our work establishes approximate latent structure as a practical and principled route to certifiable robustness.

2605.25347 2026-05-26 cs.CV cs.LG

ERNIE-Image Technical Report

ERNIE-Image 技术报告

Jiaxiang Liu, Zhida Feng, Pengyu Zou, Zhenyu Qian, Tianrui Zhu, Jun Xia, Yuehu Dong, Yanzheng Lin, Honglin Xiong, Anqi Chen, Yunpeng Ding, Jinghui Duan, Lin Gao, Chao Han, Tiechao He, Jiakang Hu, Ranjun Hua, Xueming Jiang, Qingli Kong, Yuting Lei, Tianyu Li, Yunlin Liu, Changling Liu, Yaxin Liu, Yi Liu, Xuguang Liu, Xiaolong Ma, Yan Pan, Yiran Ren, Nan Sheng, Yu Sun, Siyang Sun, Yixiang Tu, Yang Wan, Huanai Wang, Siqi Wang, Yang Wu, Youzhi Yang, Xiaowen Yang, Jianwen Yang, Yehua Yang, Quanwen Zhang, Xinmin Zhang, Haoxin Zhang, Xiang Zhang, Jun Zhang, Qian Zhang, Qiao Zhao, Qi Zhou

AI总结 提出基于8B单流DiT架构的开源文本到图像生成模型ERNIE-Image,通过自底向上的预训练数据构建和自顶向下的后训练数据构建,结合稳定DPO策略和MT-DMD蒸馏方法,在指令遵循、文本渲染和美学质量上接近顶级商业模型。

详情
AI中文摘要

我们介绍了ERNIE-Image,一个基于8B单流DiT架构构建的开源文本到图像生成模型。ERNIE-Image旨在通过更有效地挖掘大规模预训练数据并在整个训练过程中提高监督质量,来弥合当前开源模型与领先闭源系统之间的差距。在预训练阶段,我们采用自底向上的数据构建流程,结合细粒度图像分类、丰富的标题注释、美学评估和分层采样。该策略在保留长尾概念和详细真实世界知识的同时减少数据噪声,为复杂生成任务提供了更坚实的基础。在后训练阶段,我们针对高需求场景使用自顶向下的数据构建流程,多样化提示注释以更好地匹配真实用户输入,并应用稳定的DPO策略使模型与人类美学偏好对齐。我们进一步训练ERNIE-Image-Turbo以实现高效的8-NFE生成,并提出MT-DMD以减轻蒸馏过程中的能力漂移。为了使模型在实际场景中更易于使用,我们为其配备了一个轻量级的提示增强器,将简洁的用户意图扩展为结构化的视觉描述。此外,我们开发了工业级美学模型ERNIE-Image-Aes,以及用于真实美学评估的人工标注基准ERNIE-Image-Aes-1K。大量的定性和定量实验表明,ERNIE-Image在开源模型中实现了领先性能,并在指令遵循、文本渲染和美学质量方面接近顶级商业模型。我们发布训练好的模型和美学资源,以促进AIGC社区的进一步学术研究和技术进步。

英文摘要

We introduce ERNIE-Image, an open-source text-to-image generation model built upon an 8B single-stream DiT architecture. ERNIE-Image aims to bridge the gap between current open-source models and leading closed-source systems through more effective mining of large-scale pre-training data and improved supervision quality throughout training. During pre-training, we adopt a bottom-up data construction pipeline that combines fine-grained image categorization, rich caption annotation, aesthetic assessment, and hierarchical sampling. This strategy reduces data noise while preserving long-tail concepts and detailed real-world knowledge, providing a stronger foundation for complex generation tasks. In the post-training stage, we use a top-down data construction pipeline for high-demand scenarios, diversify prompt annotations to better match real user inputs, and apply a stabilized DPO strategy to align the model with human aesthetic preferences. We further train ERNIE-Image-Turbo for efficient 8-NFE generation and propose MT-DMD to mitigate capability drift during distillation. To make the model easier to use in practical scenarios, we equip it with a lightweight Prompt Enhancer that expands concise user intents into structured visual descriptions. In addition, we develop ERNIE-Image-Aes, an industrial-grade aesthetic model, together with ERNIE-Image-Aes-1K, a human-annotated benchmark for realistic aesthetic evaluation. Extensive qualitative and quantitative experiments show that ERNIE-Image achieves leading performance among open-source models and approaches top-tier commercial models in instruction following, text rendering, and aesthetic quality. We release the trained models and aesthetic resources to facilitate further academic research and technical progress in the AIGC community.

2605.25346 2026-05-26 cs.RO cs.AI cs.LG cs.SY eess.SY math.OC

Parallel Differentiable Reachability for Learning and Planning with Certified Neural Dynamics and Controllers

用于学习和规划的并行可微可达性:带认证的神经动力学与控制器

Keyi Shen, Glen Chou

AI总结 提出一种基于JAX的并行可微可达性框架,结合泰勒模型流形构建与CROWN线性界传播,支持GPU批处理和自动微分,并用于认证训练和可达性感知的MPC,在非抓取操作和四旋翼任务中实现在线规划与有界不确定性下的认证可达集过近似。

Comments Robotics: Science and Systems XXII (RSS 2026)

详情
AI中文摘要

神经网络动力学模型和控制策略在机器人领域取得了强大性能,但在不确定性下提供可靠保证仍然困难,尤其是对于闭环神经网络系统。现有的可达性工具提供了形式化的过近似,但通常不可微、过于保守或对于现代学习和在线规划流程来说太慢。为了解决这个问题,我们提出了一个在JAX中可并行化、可微的可达性框架,适用于连续和离散时间系统,具有解析和基于神经网络的动力学和控制器。我们的框架通过统一表示结合了泰勒模型流形构建和CROWN风格的线性界传播,该表示在支持GPU批处理计算和自动微分的同时保留了仿射依赖。基于这个可达性基元,我们开发了(i)一种认证训练方法,鼓励生成对可达性友好的动力学模型和控制器,以及(ii)一种具有基于梯度细化的可达性感知采样MPC方案。在非抓取操作和四旋翼任务上的实验,包括硬件和更高维度的评估(高达72维),展示了在实际在线规划中保持有界不确定性下认证可达集过近似的可行性。

英文摘要

Neural network (NN) dynamics models and control policies achieve strong performance in robotics, but providing sound guarantees under uncertainty remains difficult, especially for closed-loop NN systems. Existing reachability tools provide formal over-approximations, yet are often non-differentiable, overly conservative, or too slow for modern learning and online planning pipelines. To address this, we present a parallelizable, differentiable reachability framework in JAX for continuous- and discrete-time systems with analytical and NN-based dynamics and controllers. Our framework combines Taylor-model flowpipe construction with CROWN-style linear bound propagation through a unified representation that preserves affine dependencies while supporting GPU-batched computation and automatic differentiation. Building on this reachability primitive, we develop (i) a certified training method that encourages reachability-friendly dynamics models and controllers, and (ii) a reachability-aware sampling-based MPC scheme with gradient-based refinement. Experiments on non-prehensile manipulation and quadrotor tasks, including hardware and higher-dimensional evaluations (up to 72D), demonstrate practical online planning while maintaining certified reachable-set over-approximations under bounded uncertainty.

2605.25344 2026-05-26 cs.CL cs.AI cs.LG quant-ph

A general tensor-structured compression scheme for efficient large language models

一种用于高效大语言模型的通用张量结构压缩方案

Ying Lu, Peng-Fei Zhou, Qi-Xuan Fang, Pan Zhang, Shi-Ju Ran, Gang Su

AI总结 提出张量混合(MixT)方案,通过将密集线性层替换为张量算子混合体,在保持MMLU准确率的同时大幅减少参数、FLOPs和内存。

Comments 12 pages, 4 figures

详情
AI中文摘要

大语言模型(LLMs)主要由密集线性变换主导,其存储、内存和计算开销阻碍了高效的适配和部署,同时掩盖了结构简化对功能的影响。本文提出张量混合(MixT),一种通用的张量结构压缩方案,将目标密集线性层替换为可原生执行的张量算子混合体。MixT直接作用于通用线性投影而非模型特定组件,因此可能适用于基于Transformer的LLMs及其他密集神经映射。我们在统一的恢复协议下对Qwen3-8B和LLaMA2-7B评估MixT,识别出一个广泛的压缩区域,在该区域内MMLU准确率基本保持不变,直到模型特定边界处出现突变。该突变与输出熵、预测熵和层间几何的协同变化同时发生。在LLaMA2-7B的突变边界处,MixT将全模型参数减少47.5%,推理FLOPs减少37.1%,训练FLOPs减少52.1%,峰值推理内存减少60.4%,展示了其在低成本LLM压缩中的实际潜力。

英文摘要

Large language models (LLMs) are dominated by dense linear transformations, whose storage, memory and computational overheads hinder efficient adaptation and deployment while masking the functional impacts of structural simplification. Here we present Tensor Mixture (MixT), a general tensor-structured compression scheme that replaces targeted dense linear layers with natively executable mixtures of tensor operators. Operating directly on generic linear projections instead of model-specific components, MixT is potentially applicable across Transformer-based LLMs and other dense neural mappings. We evaluate MixT on Qwen3-8B and LLaMA2-7B under a unified recovery protocol, identifying a broad compressible regime in which MMLU accuracy is largely preserved before an abrupt transition at model-specific boundaries. This transition coincides with coordinated shifts in output entropy, prediction entropy and inter-layer geometry. At the LLaMA2-7B transition boundary, MixT reduces full-model parameters by 47.5\%, inference FLOPs by 37.1\%, training FLOPs by 52.1\% and peak inference memory by 60.4\%, demonstrating its practical potential for lower-cost LLM compression.

2605.25343 2026-05-26 cs.CV

Toward Native Multimodal Modeling: A Roadmap

迈向原生多模态建模:路线图

Siyu An, Junru Lu, Junnan Dong, Qiufeng Wang, Yinghui Li, Weizhi Fei, Zichao Yu, Zheng Yuan, Biao Liu, Haopeng Wang, Renzhao Liang, Yixuan Yang, Yunhang Shen, Bo Ke, Keyu Chen, Linhao Luo, Difan Zou, Xiao Huang, Di Yin, Ruizhi Qiao, Xing Sun

AI总结 本文提出从非原生多模态范式向原生多模态建模(NMM)过渡的正式路线图,通过输入-输出二元性分类现有模型,并系统探讨架构协调、数据整理、训练推理及评估的全栈工业级方案。

Comments 52 pages, 5 figures, 3 tables, ~300 references

详情
AI中文摘要

多模态建模是从模态无关推理迈向世界建模的关键一步。早期方法主要依赖后期融合,即组装编码器、冻结语言骨干网络和输出头;而近期研究已将范式转向原生多模态建模(NMM),通过模态的内在集成实现卓越的多模态性能。尽管潜力巨大,原生架构的设计空间仍缺乏明确定义。本文向社区呈现了这一过渡的正式路线图。具体而言,我们正式定义了架构原生性,将中期融合和早期融合与非原生范式区分开来。我们进一步通过输入-输出二元性的视角将现有原生模型组织为三类:(i) 多到文本,用于仅输出文本的跨模态理解;(ii) 多到目标,用于面向场景的生成,例如图像、音频和视频生成;(iii) 多到多,用于对称输入-输出的统一建模。我们对迈向最终NMM框架的过渡进行了全面且工业级的调查,在该框架中,理解和生成在统一的Transformer范式中无缝共存。我们从工业视角系统地拆解了端到端流水线,包括架构协调、大规模数据整理、全栈训练配方、推理与部署,以及真正原生建模的综合评估。

英文摘要

Multimodal modeling represents a vital step from modality-agnostic reasoning toward world modeling. While early approaches predominantly rely on late-fusion that assembles encoders and frozen language backbones with output heads, recent efforts have shifted the paradigm toward native multimodal modeling (NMM) with the intrinsic integration of modalities for superior multimodal performance. Despite its potential, the design space of native architectures remains insufficiently defined. In this paper, we present the community with a formalized roadmap for this transition. Specifically, we formally define the architectural nativity, distinguishing mid-fusion and early-fusion from non-native paradigms. We further organize the existing native models through the lens of input-output duality into three categories: (i) Multi-to-Text for cross-modal comprehension with text-only output; (ii) Multi-to-Target for scenario-oriented generation, e.g., image, audio and video generation, and (iii) Multi-to-Multi for unified modeling with symmetric input-output. We deliver a comprehensive and industrial-grade investigation into the transition toward the definitive NMM framework, where understanding and generation seamlessly coexist within a unified transformer paradigm. We systematically unpack the end-to-end pipeline from industrial perspectives from architectural coordination, massive data curation, to full-stack training recipes, inference & deployment, and the comprehensive evaluation for truly native modeling.

2605.25342 2026-05-26 cs.CL

MATO: Multi-objective Personalized Alignment with Test-time Optimization for Large Language Models

MATO: 面向大语言模型的多目标个性化对齐与测试时优化

Linhao Luo, Thuy-Trang Vu, Van-Anh Nguyen, Junae Kim, Gholamreza Haffari, Dinh Phung

AI总结 提出MATO框架,通过测试时优化在解码过程中动态调整多目标权重,无需训练或外部奖励模型,实现大语言模型与用户多样化偏好的对齐。

Comments Preprint

详情
AI中文摘要

将大语言模型与多样且多方面的用户偏好对齐是个性化AI系统的基本挑战。现有的多目标对齐方法要么依赖昂贵的训练,要么需要为每个偏好预训练奖励模型,这使得它们难以适应不断变化的偏好。基于提示的个性化提供了一种无需训练的替代方案,但仅靠提示通常提供有限的可操控性,因为大语言模型可能过度强调或忽略某些偏好,并且在冲突出现时无法让用户可靠地控制不同目标的相对重要性,导致对齐效果欠佳。在本文中,我们介绍了MATO,一种无需训练的多目标个性化对齐与测试时优化框架。MATO将个性化表述为一个测试时优化问题,在解码过程中通过可控权重引导多个目标的相对重要性,无需修改模型参数或需要外部奖励模型。具体来说,奖励发现模块直接从骨干大语言模型中恢复针对自然语言指定的多种目标的偏好奖励,而权重优化模块根据用户的初始偏好和部分生成的响应动态调整目标权重,以在生成过程中平衡相互竞争的目标。得到的奖励和权重共同指导对令牌分布的在线优化过程,从而更好地与目标对齐。在多个数据集和骨干大语言模型上的大量实验表明,MATO始终优于强基线,实现了帕累托改进的多目标对齐和更强的可操控性。这些结果凸显了测试时优化作为可扩展、可控且模型无关的个性化对齐的一个有前景的方向。

英文摘要

Aligning large language models (LLMs) with diverse and multifaceted user preferences is a fundamental challenge in personalized AI systems. Existing multi-objective alignment methods either rely on costly training or require pre-trained reward models for each preference, making it difficult for them to adapt to evolving preferences. Prompt-based personalization offers a training-free alternative, but prompting alone often provides limited steerability, as LLMs may overemphasize or overlook certain preferences and fail to give users reliable control over the relative importance of different objectives when conflicts arise, leading to suboptimal alignment. In this paper, we introduce MATO, a training-free framework for Multi-objective personalized Alignment with Test-time Optimization. MATO formulates personalization as a test-time optimization problem that steers the relative importance of multiple objectives through controllable weights during decoding, without modifying model parameters or requiring external reward models. Specifically, a reward discovery module recovers preference rewards directly from the backbone LLM for diverse objectives specified in natural language, while a weight optimization module dynamically adjusts objective weights based on the user's initial preferences and the partially generated response to balance competing objectives during generation. The resulting rewards and weights jointly guide an online optimization procedure over the token distribution, enabling better alignment with the target objectives. Extensive experiments across multiple datasets and backbone LLMs show that MATO consistently outperforms strong baselines, achieving Pareto-improving multi-objective alignment and stronger steerability. These results highlight test-time optimization as a promising direction for scalable, controllable, and model-agnostic personalized alignment.

2605.25338 2026-05-26 cs.LG cs.AI

CausalFlow: Causal Attribution and Counterfactual Repair for LLM Agent Failures

CausalFlow: LLM Agent 失败的因果归因与反事实修复

Akash Bonagiri, Devang Borkar, Gerard Janno Anderias, Setareh Rafatirad, Houman Homayoun

AI总结 提出CausalFlow框架,通过反事实干预计算步骤级因果责任分数,识别失败步骤并生成最小编辑修复,用于测试时修复和训练时监督,在多个基准上优于启发式方法。

详情
AI中文摘要

大型语言模型(LLM)代理在涉及推理、工具使用和环境交互的多步任务中经常失败。虽然此类失败通常被记录或通过启发式重试处理,但它们包含了关于执行中断位置的结构化信号。我们提出了CausalFlow,一个干预框架,将失败的代理轨迹转换为最小的反事实修复和可重用的监督。CausalFlow将执行轨迹建模为依赖步骤的顺序链,并通过步骤级反事实干预计算因果责任分数(CRS)来识别导致失败的步骤。对于这些步骤,我们生成最小编辑修复,将最终结果翻转为成功,产生形式为(错误步骤,修正步骤)的验证对比对。CausalFlow支持两种互补用途:具有最小行为漂移的针对性测试时修复,以及适用于离线偏好优化或奖励建模的训练时监督。在涵盖数学推理、代码生成、问答和医学浏览的四个基准测试中,CausalFlow将失败执行转换为具有高最小性和因果一致性分数的验证最小修复,并证明因果归因对于跨不同代理任务的可靠改进是必要的,在复杂检索设置中优于启发式细化,同时产生更局部的修复。这些结果表明,对结构化执行轨迹的干预分析提供了一种原则性和可扩展的机制,将代理失败转化为可靠性提升和可学习的监督。

英文摘要

Large language model (LLM) agents frequently fail on multi-step tasks involving reasoning, tool use, and environment interaction. While such failures are typically logged or retried heuristically, they contain structured signals about where execution broke down. We introduce CausalFlow, an interventional framework that converts failed agent traces into minimal counterfactual repairs and reusable supervision. CausalFlow models execution traces as sequential chains of dependent steps and computes Causal Responsibility Scores(CRS) via step-level counterfactual intervention to identify failure-inducing steps. For these steps, we generate minimally edited repairs that flip the final outcome to success, producing validated contrastive pairs of the form (wrong step, corrected step). CausalFlow supports two complementary uses: targeted test-time repair that recovers from failures with minimal behavioral drift, and training-time supervision suitable for offline preference optimization or reward modeling. Across four benchmarks spanning mathematical reasoning, code generation, question answering, and medical browsing, CausalFlow converts failed executions into validated minimal repairs with high minimality and causal-consensus scores, and demonstrates that causal attribution is necessary for reliable improvement across diverse agent tasks, outperforming heuristic refinement in complex retrieval settings while producing more localized repairs throughout. These results demonstrate that interventional analysis over structured execution traces provides a principled and scalable mechanism for transforming agent failures into reliability gains and learning-ready supervision.

2605.25334 2026-05-26 cs.CV

Dual-Pathway Geometry-Aware MLLM for Spatial Intelligence

双路径几何感知多模态大语言模型用于空间智能

Yufei Zheng, Xuhan Zhu, Zide Liu, Chunpeng Zhou, Chenfeng Wang, Yongchao Xu, Yunnan Wang, Jiawei Liu, Pengfei Yu, Wei Zhai, Yang Cao, Zheng-Jun Zha

AI总结 提出GAMSI,一种仅以RGB图像为输入、通过双路径查询和专家引导视觉对齐实现3D结构与度量尺度联合感知的多模态大语言模型,在七个空间智能基准上达到最优性能。

详情
AI中文摘要

从2D视觉输入理解物理世界的空间能力依赖于两种互补的几何知识:整体3D结构感知和细粒度度量尺度估计。现有的多模态大语言模型通常只处理其中一个方面,将深度图或点云作为额外模型输入,这带来了大量计算开销并继承了上游预测模型的泛化局限性。我们提出GAMSI,一种双路径几何感知多模态大语言模型用于空间智能,仅以RGB图像为输入,同时在统一的自回归骨干网络内内化两种几何先验。具体地,我们引入度量-结构解耦查询,使用两组可学习查询分别从共享视觉上下文中提取密集度量信号和稀疏结构线索,并通过任务解耦注意力掩码防止两条路径相互污染。在此基础上,专家引导视觉定位模块将聚合的线索投影回帧级视觉特征,并与视觉基础模型对齐,这些模型仅作为训练时的监督,而非模型输入。我们进一步构建了一个多任务空间指令微调数据集,包含152,776个样本,涵盖13种任务类型和三种视觉模态,整合自六个公共数据集。通过两阶段课程训练,GAMSI在七个空间智能基准上达到了最先进的性能。

英文摘要

Spatial understanding of the physical world from 2D visual inputs hinges on two complementary forms of geometric knowledge: holistic 3D structural perception and fine-grained metric scale estimation. Existing multimodal large language models (MLLMs) typically address only one facet, ingesting either depth maps or point clouds as additional model inputs, which incurs substantial computational overhead and inherits the generalization limitations of upstream prediction models. We propose GAMSI, a dual-pathway Geometry-Aware MLLM for Spatial Intelligence that takes only RGB images as input while internalizing both forms of geometric prior within a unified autoregressive backbone. Specifically, we introduce Metric-Structure Decoupled Queries (MSDQ) which employ two groups of learnable queries to respectively extract dense metric signals and sparse structural cues from the shared visual context, with a task-decoupled attention mask further preventing the two pathways from contaminating each other. Building on this, an Expert-Guided Visual Grounding (EVG) module projects the aggregated cues back to frame-level visual features and aligns them with vision foundation models, which serve purely as training-time supervision, rather than as model inputs. We further build a multi-task spatial instruction-tuning dataset (MTS) comprising 152{,}776 samples spanning 13 task types and three visual modalities, consolidated from six public datasets. Trained with a two-stage curriculum, GAMSI achieves state-of-the-art performance on seven spatial intelligence benchmarks.