arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2256
2510.27266 2026-05-28 cs.CV

Enhancing Trustworthy GUI Grounding via Self-Critiqued Reinforcement Learning

通过自我批评强化学习增强可信的GUI定位

Shaojie Zhang, Pei Fu, Ruoceng Zhang, Jiahui Yang, Anan Du, Xiuwen Xi, Shaokang Wang, Ying Huang, Bin Qin, Zhenbo Luo, Jian Luan

AI总结 提出HyperClick框架,通过自我批评强化学习联合优化定位准确性和置信度可靠性,实现可信的GUI定位。

详情
AI中文摘要

自主图形用户界面(GUI)代理依赖于准确的GUI定位,将语言指令映射到屏幕坐标,以执行用户命令。然而,当前的模型,无论是通过监督微调(SFT)还是强化学习(RL)训练的,通常提供的置信度信号与实际定位正确性对齐不良,导致过度自信且不可靠的预测。为了解决这个问题,我们提出了HyperClick,一种通过自我批评强化学习(SCRL)增强可信GUI定位的新框架。HyperClick结合了正确性奖励和置信度对齐奖励,训练策略模型同时输出点击预测和明确的置信度估计。这种方法通过基于置信度的自我评估,联合优化了定位准确性和置信度可靠性。在具有挑战性的基准测试上的大量实验表明,HyperClick在保持强大定位性能的同时,提供了更好对齐的置信度估计。通过在GUI动作旁边暴露不确定性,HyperClick支持GUI自动化中基于置信度的弃权。代码将在此处发布。

英文摘要

Autonomous graphical user interface (GUI) agents rely on accurate GUI grounding, which maps language instructions to on-screen coordinates, to execute user commands. However, current models, whether trained via supervised fine-tuning (SFT) or reinforcement learning (RL), often provide confidence signals that are poorly aligned with actual grounding correctness, leading to overconfident and unreliable predictions. To address this, we propose HyperClick, a novel framework that enhances trustworthy GUI grounding through self-critiqued reinforcement learning (SCRL). HyperClick combines a correctness reward and a confidence alignment reward, training the policy model to output both a click prediction and an explicit confidence estimate. This approach jointly optimizes grounding accuracy and confidence reliability through confidence-based self-assessment. Extensive experiments on challenging benchmarks show that HyperClick maintains strong grounding performance while providing better-aligned confidence estimates. By exposing uncertainty alongside GUI actions, HyperClick supports confidence-based abstention in GUI automation. Code will be released here.

2510.22016 2026-05-28 cs.LG stat.ML

Cost-Sensitive Evaluation for Binary Classifiers

二分类器的代价敏感评估

Pierangelo Lombardo, Antonio Casoli, Cristian Cingolani, Shola Oshodi, Michele Zanatta

AI总结 针对分类器评估与总分类代价(TCC)最小化不一致的问题,提出加权准确率(WA)指标和通用重加权框架,证明WA与TCC等价,并在各类不平衡与代价场景下保持鲁棒性。

Comments 24 pages, 5 figures

详情
AI中文摘要

为分类器选择合适的评估指标对于模型比较、参数优化和部署决策至关重要,但目前尚无广泛接受的、明确与总分类代价(TCC)最小化一致的评估范式。同时,类别不平衡常被视为需要修正的问题本身,可能导致与TCC最小化的不一致。为解决这些局限,(i)我们定义了加权准确率(WA),一种对二分类器的评估指标,其直观解释为准确率的加权版本;(ii)我们提出了一个通用的重加权框架,用于处理代价敏感场景中的类别不平衡,为重采样技术提供了替代方案。该框架适用于任何可表示为示例相关量的线性组合的评估指标或损失函数;它能够有意义地比较在不同数据集上获得的评估结果,并考虑用于训练、验证和测试的“开发”数据集与模型将部署的“目标”数据集之间的差异。在该框架内,我们推导了标准重平衡技术与TCC最小化保持一致的条件,以及它们可能变得具有误导性的情况。我们证明,在示例无关的单位分类代价下,最大化WA等价于最小化TCC。最后,我们通过研究WA与TCC在广泛的类别不平衡和代价机制下的相关性,分析了WA在现实示例相关代价场景中的鲁棒性。结果表明,在几乎所有考察的场景中,WA与TCC保持稳健的对齐。

英文摘要

Selecting an appropriate evaluation metric for classifiers is crucial for model comparison, parameter optimization, and deployment decisions, yet there is no consensus on a broadly accepted evaluation paradigm explicitly aligned with Total Classification Cost (TCC) minimization. At the same time, class imbalance is often treated as a problem to be corrected \emph{per se}, potentially causing misalignments with TCC minimization. To address these limitations, (\emph{i}) we define Weighted Accuracy (WA), an evaluation metric for binary classifiers with a straightforward interpretation as a weighted version of accuracy and (\emph{ii}) we propose a general reweighting framework for handling class imbalance in cost-sensitive scenarios, providing an alternative to resampling techniques. This framework applies to any evaluation metric or loss function that can be expressed as a linear combination of example-dependent quantities; it enables meaningful comparison of evaluation results obtained on different datasets and accounts for discrepancies between the \emph{development} dataset, used for training, validation, and testing, and the \emph{target} dataset, where the model will be deployed. Within this framework, we derive the conditions under which standard rebalancing techniques remain coherent with TCC minimization, and when they may instead become misleading. We prove that, under example-independent Unit Classification Costs, maximizing WA is equivalent to minimizing TCC. Finally, we analyze the robustness of WA in realistic example-dependent cost scenarios by studying its correlation with TCC across a broad range of class imbalance and cost regimes. The results show that WA maintains robust alignment with TCC across almost all examined scenarios.

2510.21890 2026-05-28 cs.LG cs.AI cs.GR

The Principles of Diffusion Models

扩散模型的原理

Chieh-Hsin Lai, Yang Song, Dongjun Kim, Yuki Mitsufuji, Stefano Ermon

AI总结 本文从变分、基于分数和基于流三种视角统一阐述扩散模型的数学原理,并讨论可控生成、高效求解器和流映射模型等扩展。

Comments Supplementary materials for the book are available at the book website: https://the-principles-of-diffusion-models.github.io/

详情
AI中文摘要

本书介绍了指导扩散模型发展的核心原理,追溯其起源,并展示不同公式如何源于共同的数学思想。扩散建模首先定义一个前向过程,该过程逐渐将数据破坏为噪声,通过一系列中间分布将数据分布与简单先验联系起来。目标是学习一个反向过程,将噪声转换回数据,同时恢复相同的中间分布。我们描述了三种互补的观点。受变分自编码器启发的变分观点将扩散视为逐步学习去噪。基于分数的观点植根于基于能量的建模,学习演化数据分布的梯度,指示如何将样本推向更可能的区域。基于流的观点与归一化流相关,将生成视为遵循一条平滑路径,在学习的速度场下将样本从噪声移动到数据。这些视角共享一个共同的主干:一个时间相关的速度场,其流将简单先验传输到数据。采样相当于求解一个微分方程,该方程沿着连续轨迹将噪声演化为数据。在此基础之上,本书讨论了可控生成的引导、高效数值求解器以及扩散驱动的流映射模型(学习任意时间之间的直接映射)。它为具有基本深度学习知识的读者提供了扩散模型的概念性和数学基础理解。

英文摘要

This book presents the core principles that have guided the development of diffusion models, tracing their origins and showing how diverse formulations arise from shared mathematical ideas. Diffusion modeling starts by defining a forward process that gradually corrupts data into noise, linking the data distribution to a simple prior through a continuum of intermediate distributions. The goal is to learn a reverse process that transforms noise back into data while recovering the same intermediates. We describe three complementary views. The variational view, inspired by variational autoencoders, sees diffusion as learning to remove noise step by step. The score-based view, rooted in energy-based modeling, learns the gradient of the evolving data distribution, indicating how to nudge samples toward more likely regions. The flow-based view, related to normalizing flows, treats generation as following a smooth path that moves samples from noise to data under a learned velocity field. These perspectives share a common backbone: a time-dependent velocity field whose flow transports a simple prior to the data. Sampling then amounts to solving a differential equation that evolves noise into data along a continuous trajectory. On this foundation, the book discusses guidance for controllable generation, efficient numerical solvers, and diffusion-motivated flow-map models that learn direct mappings between arbitrary times. It provides a conceptual and mathematically grounded understanding of diffusion models for readers with basic deep-learning knowledge.

2510.20480 2026-05-28 cs.RO

Degradation-Aware Cooperative Multi-Modal GNSS-Denied Localization Leveraging LiDAR-Based Robot Detections

基于激光雷达机器人检测的退化感知协同多模态GNSS拒止定位

Václav Pritzl, Xianjia Yu, Tomi Westerlund, Petr Štěpán, Martin Saska

AI总结 提出一种因子图框架下的自适应多模态多机器人协同定位方法,融合异步VIO、LIO和3D机器人间检测,通过插值因子和Wasserstein距离加权处理传感器退化,显著提升定位精度。

Comments Preprint version. This work has been submitted to Elsevier for possible publication

详情
AI中文摘要

在无全球导航卫星系统(GNSS)环境中,使用机载传感器进行精确长期定位对机器人至关重要。虽然互补传感器可以缓解个体退化,但在单个机器人上携带所有可用传感器类型会显著增加尺寸、重量和功耗需求。将传感器分布在多个机器人上可增强部署能力,但引入了来自独立移动平台的异步多模态数据融合的挑战。我们提出了一种新颖的自适应多模态多机器人协同定位方法,采用因子图公式,以松耦合方式融合来自不同机器人的异步视觉-惯性里程计(VIO)、激光雷达-惯性里程计(LIO)和3D机器人间检测。该方法适应变化的条件,利用可靠数据帮助受传感器退化影响的机器人。一种新颖的基于插值的因子实现了非同步测量的融合。LIO退化基于近似扫描匹配Hessian进行评估。提出了一种根据连续VIO输出之间的Wasserstein距离按比例加权里程计数据的新方法。提供了理论分析,研究了各种条件下(主要是在传感器退化存在时)的协同定位问题。该方法已在使用无人地面车辆(UGV)和无人飞行器(UAV)异构团队收集的真实世界数据上进行了广泛评估,结果表明该方法在各种传感器退化情况下显著提高了定位精度。

英文摘要

Accurate long-term localization using onboard sensors is crucial for robots operating in Global Navigation Satellite System (GNSS)-denied environments. While complementary sensors mitigate individual degradations, carrying all the available sensor types on a single robot significantly increases the size, weight, and power demands. Distributing sensors across multiple robots enhances the deployability but introduces challenges in fusing asynchronous, multi-modal data from independently moving platforms. We propose a novel adaptive multi-modal multi-robot cooperative localization approach using a factor-graph formulation to fuse asynchronous Visual-Inertial Odometry (VIO), LiDAR-Inertial Odometry (LIO), and 3D inter-robot detections from distinct robots in a loosely-coupled fashion. The approach adapts to changing conditions, leveraging reliable data to assist robots affected by sensory degradations. A novel interpolation-based factor enables fusion of the unsynchronized measurements. LIO degradations are evaluated based on the approximate scan-matching Hessian. A novel approach of weighting odometry data proportionally to the Wasserstein distance between the consecutive VIO outputs is proposed. A theoretical analysis is provided, investigating the cooperative localization problem under various conditions, mainly in the presence of sensory degradations. The proposed method has been extensively evaluated on real-world data gathered with heterogeneous teams of an Unmanned Ground Vehicle (UGV) and Unmanned Aerial Vehicles (UAVs), showing that the approach provides significant improvements in localization accuracy in the presence of various sensory degradations.

2510.18668 2026-05-28 cs.LG cs.CV

Prototyping an End-to-End Multi-Modal Tiny-CNN for Cardiovascular Sensor Patches

面向心血管传感器贴片的端到端多模态微型CNN原型设计

Mustafa Fuad Rifet Ibrahim, Tunc Alkanat, Felix Manthey, Maurice Meijer, Alexander Schlaefer, Peer Stelldinger

AI总结 针对资源受限的医疗边缘设备,提出一种早期融合心电图和心音图数据的卷积神经网络,实现二分类,相比现有技术将内存和计算成本降低约三个数量级,并验证了在微控制器上的能效优势。

Comments 11 pages, 2 figures. Extended version of our 2024 IEEE PerCom paper, with direct on-device energy measurements, a BLE communication benchmark, architecture comparisons, and an extended evaluation. Submitted to Biomedical Signal Processing and Control; Fixed typos

详情
AI中文摘要

绝大多数心血管疾病如果早期发现风险因素和迹象是可以预防的。使用体戴式传感器贴片等设备进行心血管监测,可以在保持患者自由和舒适的同时检测这些迹象。然而,传感器数据的分析必须稳健、可靠、高效且高度准确。深度学习方法可以自动化数据解读,减轻临床医生的工作负担。在这项工作中,我们分析了在资源受限的医疗边缘设备上应用深度学习模型对同步心电图(ECG)和心音图(PCG)记录进行分类的可行性。我们提出了一种具有早期数据融合的卷积神经网络来解决二分类问题。该模型在Physionet Challenge 2016数据集的同步ECG和PCG记录上进行训练和验证。与现有技术相比,我们的方法将内存占用和计算成本降低了约三个数量级,同时保持了有竞争力的准确性。我们进一步通过测量配备神经处理单元(NPU)的微控制器上的能耗,并在代表性BLE评估套件上对一系列有效载荷大小的蓝牙低功耗(BLE)通信能耗进行基准测试,证明了所提模型在医疗边缘设备上的适用性。比较结果证实,设备端推理比连续数据流传输更节能。

英文摘要

The vast majority of cardiovascular diseases may be preventable if early signs and risk factors are detected. Cardiovascular monitoring with body-worn sensor devices like sensor patches allows for the detection of such signs while preserving the freedom and comfort of patients. However, the analysis of the sensor data must be robust, reliable, efficient, and highly accurate. Deep learning methods can automate data interpretation, reducing the workload of clinicians. In this work, we analyze the feasibility of applying deep learning models to the classification of synchronized electrocardiogram (ECG) and phonocardiogram (PCG) recordings on resource-constrained medical edge devices. We propose a convolutional neural network with early fusion of data to solve a binary classification problem. The model is trained and validated on the synchronized ECG and PCG recordings from the Physionet Challenge 2016 dataset. Our approach reduces memory footprint and compute cost by approximately three orders of magnitude compared with the state-of-the-art while maintaining competitive accuracy. We further demonstrate the applicability of the proposed model on medical edge devices by measuring its energy consumption on a microcontroller equipped with a neural processing unit (NPU) and benchmarking the energy of Bluetooth Low Energy (BLE) communication on a representative BLE evaluation kit across a range of payload sizes. The comparison confirms that on-device inference can be more energy efficient than continuous data streaming.

2510.17620 2026-05-28 cs.CL

Forget to Know, Remember to Use: Context-Aware Unlearning for Large Language Models

忘记知识,记住使用:面向大型语言模型的上下文感知遗忘

Yuefeng Peng, Parnian Afshar, Megan Ganji, Thomas Butler, Amir Houmansadr, Mingxian Wang, Dezhi Hong

AI总结 针对现有遗忘方法损害上下文可用性的问题,提出一种插件式目标项,在保持遗忘效果和保留集性能的同时恢复模型对已遗忘知识的上下文使用能力。

Comments ICML 2026

详情
AI中文摘要

大型语言模型可能编码需要移除的敏感信息或过时知识,以确保模型响应负责任且合规。遗忘学习已成为完整重新训练的高效替代方案,旨在移除特定知识同时保持模型整体效用。现有遗忘方法评估关注(1)目标知识的遗忘程度(遗忘集)和(2)保留集上的性能(即效用)。然而,这些评估忽略了一个重要的可用性方面:如果提示中重新引入已移除信息,用户可能仍希望模型利用该信息。在对六种最先进遗忘方法的系统评估中,我们发现它们一致损害了这种上下文效用。为解决此问题,我们用一个插件项增强遗忘目标,该插件项保留模型在上下文中存在已遗忘知识时使用它的能力。大量实验表明,我们的方法将上下文效用恢复到接近原始水平,同时仍然保持有效的遗忘和保留集效用。

英文摘要

Large language models may encode sensitive information or outdated knowledge that needs to be removed, to ensure responsible and compliant model responses. Unlearning has emerged as an efficient alternative to full retraining, aiming to remove specific knowledge while preserving overall model utility. Existing evaluations of unlearning methods focus on (1) the extent of forgetting of the target knowledge (forget set) and (2) maintaining performance on the retain set (i.e., utility). However, these evaluations overlook an important usability aspect: users may still want the model to leverage the removed information if it is re-introduced in the prompt. In a systematic evaluation of six state-of-the-art unlearning methods, we find that they consistently impair such contextual utility. To address this, we augment unlearning objectives with a plug-in term that preserves the model's ability to use forgotten knowledge when it is present in context. Extensive experiments demonstrate that our approach restores contextual utility to near original levels while still maintaining effective forgetting and retain-set utility.

2510.15839 2026-05-28 cs.LG econ.EM stat.ML

Learning Correlated Reward Models: Statistical Barriers and Opportunities

学习相关奖励模型:统计障碍与机遇

Yeshwanth Cherapanamjeri, Constantinos Daskalakis, Gabriele Farina, Sobhan Mohammadpour

AI总结 本文研究了避免IIA假设的相关probit模型的统计与计算挑战,证明了成对偏好数据不足以学习相关性,而三选一偏好数据可实现近最优估计。

Comments International Conference on Learning Representations (ICLR) 2026

详情
AI中文摘要

随机效用模型(RUM)是建模用户偏好的经典框架,并在基于人类反馈的强化学习(RLHF)的奖励建模中发挥关键作用。然而,这些技术的一个关键缺陷是无关选项独立性(IIA)假设,该假设将所有人类偏好归结为单一的潜在效用函数,从而对人类偏好范围进行了粗略近似。另一方面,避免这一假设的模型的统计和计算保证很少。在本文中,我们研究了学习相关probit模型的统计和计算挑战,这是一种避免IIA假设的基本RUM。首先,我们确定了成对偏好数据的经典数据收集范式从根本上不足以学习相关性信息,这解释了该设置下缺乏统计和计算保证的原因。接下来,我们证明了三选一偏好数据可证明地克服了这些缺陷,并设计了一个统计和计算上高效的估计器,具有近最优性能。这些结果突显了高阶偏好数据在学习相关效用中的优势,从而允许对人类偏好进行更精细的建模。最后,我们在几个真实世界数据集上验证了这些理论保证,展示了人类偏好的改进个性化。

英文摘要

Random Utility Models (RUMs) are a classical framework for modeling user preferences and play a key role in reward modeling for Reinforcement Learning from Human Feedback (RLHF). However, a crucial shortcoming of many of these techniques is the Independence of Irrelevant Alternatives (IIA) assumption, which collapses \emph{all} human preferences to a universal underlying utility function, yielding a coarse approximation of the range of human preferences. On the other hand, statistical and computational guarantees for models avoiding this assumption are scarce. In this paper, we investigate the statistical and computational challenges of learning a \emph{correlated} probit model, a fundamental RUM that avoids the IIA assumption. First, we establish that the classical data collection paradigm of pairwise preference data is \emph{fundamentally insufficient} to learn correlational information, explaining the lack of statistical and computational guarantees in this setting. Next, we demonstrate that \emph{best-of-three} preference data provably overcomes these shortcomings, and devise a statistically and computationally efficient estimator with near-optimal performance. These results highlight the benefits of higher-order preference data in learning correlated utilities, allowing for more fine-grained modeling of human preferences. Finally, we validate these theoretical guarantees on several real-world datasets, demonstrating improved personalization of human preferences.

2510.15541 2026-05-28 cs.LG cs.CV eess.IV

An Empirical Study on Variance-based MC Dropout Uncertainty-Error Correlation in 2D Brain Tumor Segmentation

基于方差的MC Dropout不确定性-误差相关性在二维脑肿瘤分割中的实证研究

Saumya B

AI总结 通过U-Net在四种增强设置下的实验,发现基于方差的MC Dropout不确定性在全局和边界上与分割误差的相关性较弱,表明其局限性。

Comments v2: Updated title and framing to clarify that findings are specific to variance-based uncertainty estimation via MC Dropout, not MC Dropout broadly. Minor textual improvements throughout. Code and results available at https://github.com/Saumya4321/mcd-error-correlation

详情
AI中文摘要

从MRI中准确分割脑肿瘤对诊断和治疗规划至关重要。尽管蒙特卡洛(MC) Dropout被广泛用于估计模型不确定性,但基于方差的不确定性(通过随机前向传递的逐像素方差计算)在识别分割误差(尤其是肿瘤边界附近)方面的有效性尚未得到充分研究。本研究使用在四种增强设置(无增强、水平翻转、旋转和缩放)下训练的U-Net,实证检验了基于方差的MC Dropout不确定性与二维脑肿瘤MRI分割误差之间的关系。不确定性估计为50次随机前向传递的逐像素方差,并使用Pearson和Spearman系数与逐像素误差进行相关性分析。结果显示全局相关性较弱(r ~ 0.30-0.38),边界相关性可忽略(|r| < 0.05)。尽管不同增强设置之间的差异具有统计显著性(p < 0.001),但缺乏实际意义。这些发现表明,基于方差的MC Dropout不确定性为全局和边界误差定位提供的线索有限,且不确定性表示的选择对MC Dropout在医学图像分割中的效用有重要影响。替代表示如预测熵或互信息可能更好地捕捉分割误差,尤其是在边界处。

英文摘要

Accurate brain tumor segmentation from MRI is vital for diagnosis and treatment planning. Although Monte Carlo (MC) Dropout is widely used to estimate model uncertainty, the effectiveness of variance-based uncertainty - computed as pixel-wise variance across stochastic forward passes - in identifying segmentation errors, particularly near tumor boundaries, remains insufficiently studied. This study empirically examines the relationship between variance-based MC Dropout uncertainty and segmentation error in 2D brain tumor MRI segmentation using a U-Net trained under four augmentation settings: none, horizontal flip, rotation, and scaling. Uncertainty was estimated as the pixel-wise variance across 50 stochastic forward passes and correlated with pixel-wise errors using Pearson and Spearman coefficients. Results show weak global correlations (r ~ 0.30-0.38) and negligible boundary correlations (|r| < 0.05). Although differences across augmentations were statistically significant (p < 0.001), they lacked practical relevance. These findings suggest that variance-based MC Dropout uncertainty provides limited cues for global and boundary error localization, and that the choice of uncertainty representation critically affects the utility of MC Dropout in medical image segmentation. Alternative representations such as predictive entropy or mutual information may better capture segmentation errors, particularly at boundaries.

2510.11170 2026-05-28 cs.LG cs.AI cs.CL

EAGer: Entropy-Aware GEneRation for Adaptive Inference-Time Scaling

EAGer: 基于熵感知的自适应推理时缩放生成方法

Daniel Scalena, Leonidas Zotos, Elisabetta Fersini, Malvina Nissim, Ahmet Üstün

AI总结 提出一种无需训练的生成方法EAGer,利用逐词熵分布动态分配计算资源,在复杂推理任务中提升性能并减少冗余计算。

详情
AI中文摘要

随着推理语言模型和测试时缩放方法作为提升模型性能范式的兴起,通常需要大量计算来从同一提示生成多个候选序列。这允许探索通向正确答案的不同推理路径,然而,为每个提示分配相同的计算预算。基于不同提示具有不同复杂度因而需要不同计算量的假设,我们提出EAGer,一种无需训练的生成方法,通过逐词熵分布利用模型不确定性来减少冗余计算并同时提升整体性能。EAGer仅在存在高熵词时分支到多个推理路径,并将节省的计算预算重新分配到最需要探索替代路径的实例上。我们在复杂推理基准上对多个开源模型验证了EAGer,特别是在AIME 2025上展示了增益。当目标标签可访问时(如在RLVR训练流程中),EAGer在Pass@k上提升高达37%,且token减少59%;在测试时设置中,与全并行采样相比,仍能在Pass@k上提升12%,且token减少64%。

英文摘要

With the rise of reasoning language models and test-time scaling methods as a paradigm for improving model performance, substantial computation is often required to generate multiple candidate sequences from the same prompt. This enables exploration of different reasoning paths toward the correct solution, however, allocates the same compute budget for each prompt. Grounded on the assumption that different prompts carry different degrees of complexity, and thus different computation needs, we propose EAGer, a training-free generation method that leverages model uncertainty through token-wise entropy distribution to reduce redundant computation and concurrently improve overall performance. EAGer allows branching to multiple reasoning paths only in the presence of high-entropy tokens, and reallocates the saved compute budget to instances where exploration of alternative paths is most needed. We validate EAGer across multiple open-source models on complex reasoning benchmarks, with gains specifically demonstrated on AIME 2025. When target labels are accessible -- as in RLVR training pipelines -- EAGer achieves up to +37% in Pass@k and 59% fewer tokens; in test-time settings it still yields +12% in Pass@k and 64% fewer tokens compared to Full Parallel Sampling.

2503.01450 2026-05-28 cs.LG cs.AI cs.RO

Investigating Memory in Model-Free RL with POPGym Arcade

基于POPGym Arcade的无模型强化学习中的记忆研究

Zekang Wang, Zhe He, Borong Zhang, Edan Toledo, Steven Morad

AI总结 本文通过引入分析工具和POPGym Arcade环境套件,研究深度强化学习中的记忆机制,发现价值函数会将信用分配到无关历史,并展示分布外场景如何污染记忆。

Comments Appear at ICML 2026 as a Spotlight paper

详情
AI中文摘要

如何分析深度强化学习中的记忆?我们引入了在部分可观测性下分析策略的工具,并揭示智能体如何利用记忆做出决策。为了利用这些工具,我们提出了POPGym Arcade,这是一个受Atari启发的、硬件加速的环境集合,共享单一观测和动作空间。每个环境都提供完全和部分可观测的变体,从而实现对可观测性的反事实研究。我们发现,受控研究对于公平比较是必要的,并识别出一种病理现象,即价值函数将信用过度分配到无关历史。利用这种病理现象,我们展示了分布外场景如何污染记忆,从而在遥远的未来扰动策略。我们的代码可在https://github.com/bolt-research/popgym-arcade获取。

英文摘要

How should we analyze memory in deep RL? We introduce tools for analyzing policies under partial observability and revealing how agents use memory to make decisions. To utilize these tools, we present POPGym Arcade, a collection of Atari-inspired, hardware-accelerated environments sharing a single observation and action space. Each environment provides fully and partially observable variants, enabling counterfactual studies on observability. We find that controlled studies are necessary for fair comparisons and identify a pathology where value functions smear credit over irrelevant history. Using this pathology, we demonstrate how out-of-distribution scenarios can contaminate memory, perturbing the policy far into the future. Our code is available at https://github.com/bolt-research/popgym-arcade.

2510.10185 2026-05-28 cs.CL cs.AI cs.MA

Auditing medical multi-agent AI reveals risks of false consensus

审计医疗多智能体AI揭示虚假共识风险

Yinghao Zhu, Lei Gu, Zixiang Wang, Haoran Sang, Dehao Sui, Wen Tang, Lan Mi, Yasha Wang, Junyi Gao, Liang Yao, Tianfan Fu, Ewen Harrison, Lequan Yu, Liantao Ma

AI总结 本研究提出MedAgentAudit框架,通过专家验证的审计流程诊断医疗多智能体系统中的协作失败模式,发现虚假共识、权威偏差等系统性风险。

Comments Code and Data: https://github.com/MedX-PKU/MedAgentAudit

详情
AI中文摘要

大型语言模型正越来越多地被组装成医疗多智能体系统,通过专家角色、同行评审和共识形成模拟多学科会诊。然而,在临床决策支持中,表面共识并不足够。临床医生还需要知道智能体是否检查了证据、处理了分歧并保持了不确定性可见。当前评估主要关注最终准确性,未测试协作过程的安全性。本文介绍MedAgentAudit,一个基于临床的工作流审计框架,用于诊断和量化医疗多智能体系统中的协作失败模式。从3,600个执行日志中,我们推导出一个经专家验证的十种常见失败分类法,涵盖任务理解、协作讨论以及综合与决策。随后,我们部署一个经专家验证的自动审计器作为非干预探针,覆盖14,400个案例,涉及六种多智能体架构、六个医疗文本和视觉数据集以及每种模态的四个大语言模型设置。跨系统而言,协作带来不均衡的准确性提升和频繁的过程失败。16.63%的案例中存在无依据的观察结果,并向下游传播。在讨论中,智能体在98.42%的案例中重复初始观点而非重新审视证据,并在42.73%的案例中未能激活专家推理。在综合阶段,最终答案常常用权威或多数票替代证据检查,显示出权威偏差(28.76%,从35.30%上升至68.75%)、自我矛盾(18.53%)、矛盾忽视(5.48%)和少数派压制(5.11%)。MedAgentAudit将医疗AI评估从输出评分重新定义为过程级安全与问责,为医学中透明、可审计且由临床医生监督的智能体系统提供了实践基础。

英文摘要

Large language models are increasingly being assembled into medical multi-agent systems that emulate multidisciplinary consultation through specialist roles, peer review and consensus formation. In clinical decision support, however, apparent consensus is not enough. Clinicians also need to know whether agents checked the evidence, addressed disagreement and kept uncertainty visible. Current evaluations largely score final accuracy, leaving the safety of the collaborative process untested. Here we introduce MedAgentAudit, a clinically grounded workflow audit framework for diagnosing and quantifying collaborative failure modes in medical multi-agent systems. From 3,600 execution logs, we derive an expert-validated taxonomy of ten recurrent failures spanning task comprehension, collaborative discussion, and synthesis and decision-making. We then deploy an expert-validated automated auditor as non-interventional probes across 14,400 cases, covering six multi-agent architectures, six medical text and vision datasets, and four large language model settings per modality. Across systems, collaboration yields uneven accuracy gains and frequent process failures. Unsupported observations affect 16.63% of cases and propagate downstream. In discussion, agents repeat initial views in 98.42% of cases rather than re-examining evidence, and fail to activate specialist reasoning in 42.73%. During synthesis, final answers often substitute authority or majority count for evidence checking, showing authority bias in 28.76% (rising from 35.30% to 68.75% across rounds), self-contradiction in 18.53%, contradiction neglect in 5.48% and minority suppression in 5.11%. MedAgentAudit reframes medical AI evaluation from output scoring to process-level safety and accountability, providing a practical foundation for transparent, auditable and clinician-supervised agentic systems in medicine.

2508.01521 2026-05-28 cs.LG

Prototype Learning to Create Refined Interpretable Digital Phenotypes from ECGs

原型学习从心电图创建精细的可解释数字表型

Sahil Sethi, David Chen, Michael C. Burkhart, Nipun Bhandari, Bashar Ramadan, Brett Beaulieu-Jones

AI总结 使用基于原型的深度学习模型,从ECG中学习可解释的表型,并在外部临床数据中验证其与诊断代码的关联,发现原型能捕捉超越原始训练目标的临床意义。

Comments Accepted (oral) to the 31st Pacific Symposium on Biocomputing

详情
AI中文摘要

基于原型的神经网络通过将输入与训练数据中学到的代表性信号模式进行比较,提供可解释的预测。尽管此类模型在生理数据分类中显示出前景,但其原型是否捕捉到与更广泛临床表型一致的基础结构仍不清楚。我们使用基于原型的深度学习模型,在PTB-XL数据集上训练进行多标签ECG分类,然后在不修改的情况下对MIMIC-IV临床数据库进行推理。我们评估了仅用于分类训练的单个原型是否与外部人群中以phecode形式表示的出院诊断相关。与分类器的类别预测、NLP提取的概念或更广泛的原型类别相比,单个原型在所有phecode类别中表现出显著更强且更特异的临床结果关联。具有混合显著性模式的原型类别表现出显著更大的类内距离(p < 0.0001),表明模型学会了区分诊断类别内临床有意义的变异。原型在多种疾病中实现了强大的预测性能,AUC范围从房颤的0.89到心力衰竭的0.91,同时也在非心脏疾病(如败血症和肾病)中显示出显著信号。这些发现表明,基于原型的模型可以支持从生理时间序列数据中进行可解释的数字表型分析,提供可转移的中间表型,捕捉超越原始训练目标的临床有意义的生理特征。

英文摘要

Prototype-based neural networks offer interpretable predictions by comparing inputs to learned, representative signal patterns anchored in training data. While such models have shown promise in the classification of physiological data, it remains unclear whether their prototypes capture an underlying structure that aligns with broader clinical phenotypes. We use a prototype-based deep learning model trained for multi-label ECG classification using the PTB-XL dataset. Then without modification we performed inference on the MIMIC-IV clinical database. We assess whether individual prototypes, trained solely for classification, are associated with hospital discharge diagnoses in the form of phecodes in this external population. Individual prototypes demonstrate significantly stronger and more specific associations with clinical outcomes compared to the classifier's class predictions, NLP-extracted concepts, or broader prototype classes across all phecode categories. Prototype classes with mixed significance patterns exhibit significantly greater intra-class distances (p $<$ 0.0001), indicating the model learned to differentiate clinically meaningful variations within diagnostic categories. The prototypes achieve strong predictive performance across diverse conditions, with AUCs ranging from 0.89 for atrial fibrillation to 0.91 for heart failure, while also showing substantial signal for non-cardiac conditions such as sepsis and renal disease. These findings suggest that prototype-based models can support interpretable digital phenotyping from physiologic time-series data, providing transferable intermediate phenotypes that capture clinically meaningful physiologic signatures beyond their original training objectives.

2510.08555 2026-05-28 cs.CV

VideoCanvas: Unified Video Completion from Arbitrary Spatiotemporal Patches via In-Context Conditioning

VideoCanvas: 通过上下文条件化从任意时空补丁进行统一视频补全

Minghong Cai, Qiulin Wang, Zongli Ye, Wenze Liu, Quande Liu, Weicai Ye, Xintao Wang, Pengfei Wan, Kun Gai, Xiangyu Yue

AI总结 提出VideoCanvas框架,通过混合条件化策略(空间上使用零填充全帧画布编码,时间上使用Temporal RoPE插值)实现任意时空视频补全的统一任务,无需修改或重新训练VAE。

Comments Project page: https://onevfall.github.io/project_page/videocanvas

详情
AI中文摘要

现有的可控视频生成方法通常针对刚性、任务特定的设置设计,例如首帧图像到视频、修复或插值,将时空控制视为一组孤立的问题。我们形式化了一个统一的任务——任意时空视频补全,其中模型从用户指定的、放置在任何空间位置和时间戳的补丁生成连贯视频。然而,在现代潜在视频扩散模型中实现这样的统一框架并非易事:因果视频VAE将多个帧压缩到单个潜在槽中,使得帧级条件化从根本上不适定,并且直接将稀疏填充、零填充的视频输入馈入VAE会导致严重的分布外伪影。为了解决这些挑战,我们提出了VideoCanvas,一个简单而有效的框架,它将上下文条件化范式适应于任意时空补全,无需修改或重新训练VAE。我们的关键思想是一种混合条件化策略,将空间和时间控制解耦:在空间上,我们以图像模式编码零填充的全帧画布,使VAE输入保持分布内;在时间上,我们使用Temporal RoPE插值为每个条件分配潜在序列中的连续分数索引,以实现精确的帧级对齐。为了评估这种能力,我们开发了VideoCanvasBench,这是第一个用于任意时空视频补全的基准测试,涵盖场景内保真度和场景间创造力。大量实验表明,VideoCanvas在单一统一框架下,在各种视频生成任务中实现了最先进的性能。

英文摘要

Existing controllable video generation methods are typically designed for rigid, task-specific settings, such as first-frame image-to-video, inpainting, or interpolation, treating spatio-temporal control as a set of isolated problems. We formalize a unified task, arbitrary spatio-temporal video completion, where a model generates a coherent video from user-specified patches placed at any spatial location and timestamp. However, realizing such a unified framework within modern latent video diffusion models is non-trivial: causal video VAEs compress multiple frames into a single latent slot, making frame-level conditioning fundamentally ill-posed, and directly feeding sparsely populated, zero-padded video inputs into the VAE leads to severe out-of-distribution artifacts. To address these challenges, we propose VideoCanvas, a simple yet effective framework that adapts the In-Context Conditioning paradigm to arbitrary spatio-temporal completion without modifying or retraining the VAE. Our key idea is a hybrid conditioning strategy that decouples spatial and temporal control: spatially, we encode zero-padded full-frame canvases in image mode to keep VAE inputs in-distribution, and temporally we use Temporal RoPE Interpolation to assign each condition a continuous fractional index in the latent sequence for precise frame-level alignment. To evaluate this capability, we develop VideoCanvasBench, the first benchmark for arbitrary spatio-temporal video completion, covering both intra-scene fidelity and inter-scene creativity. Extensive experiments demonstrate that VideoCanvas achieves state-of-the-art performance across a diverse range of video generation tasks under a single, unified framework.

2510.07208 2026-05-28 cs.LG

A Broader View of Thompson Sampling

汤普森采样的更广阔视角

Yanlin Qu, Hongseok Namkoong, Assaf Zeevi

AI总结 本文通过将汤普森采样重新解释为在线优化算法,揭示了其平衡探索与利用的机制,并提出了基于残差不确定性正则化的策略改进方法。

详情
AI中文摘要

汤普森采样是最广泛使用和研究的赌博机算法之一,以其简单的结构、低遗憾性能和坚实的理论保证而闻名。然而,与大多数其他赌博机算法家族形成鲜明对比的是,后验采样(由汤普森引入)能够“适当”平衡探索和利用的确切机制仍然是一个谜。在本文中,我们表明解决这个问题的核心见解在于将汤普森采样重新解释为一种在线优化算法。为了提炼这一点,我们引入了一个合适的时间不变遗憾概念,导致一个平稳化的赌博机问题和一个平稳的贝尔曼最优策略。然后,我们证明汤普森采样具有一种在线优化形式,该形式模仿了上述贝尔曼最优策略的结构,其中“贪婪”由残差不确定性的度量进行正则化。这种在线优化的新视角使我们能够更好地理解汤普森采样的动态,以及一种模仿贝尔曼最优基准的策略改进原则性方法。

英文摘要

Thompson Sampling is one of the most widely used and studied bandit algorithms, known for its simple structure, low regret performance, and solid theoretical guarantees. Yet, in stark contrast to most other families of bandit algorithms, the exact mechanism through which posterior sampling (as introduced by Thompson) is able to "properly" balance exploration and exploitation, remains a mystery. In this paper, we show that the core insight to address this question stems from recasting Thompson Sampling as an online optimization algorithm. To distill this, we introduce a suitable time invariant notion of regret that leads to a stationarized bandit problem, and a stationary Bellman-optimal policy. We then show that Thompson Sampling admits an online optimization form that mimics the structure of the aforementioned Bellman-optimal policy, where "greediness" is regularized by a measure of residual uncertainty. This new lens of online optimization allows both a better understanding of Thompson Sampling dynamics, as well as a principled manner for policy improvement that mimics the Bellman-optimal benchmark.

2510.06974 2026-05-28 cs.CL

Probing Social Identity Bias in Chinese LLMs with Gendered Pronouns and Social Groups

探究中文大语言模型中的社会身份偏见:基于性别代词与社会群体

Geng Liu, Feng Li, Junjie Mu, Mengxiao Zhu, Francesco Pierri

AI总结 通过设计考虑中文语言特性的提示,评估十种代表性中文大语言模型在240个社会群体上的内群体与外群体框架下的情感和毒性偏见,发现系统性不对称且指令调优减少情感偏见但毒性差距更持久,女性标记代词与更高毒性相关。

详情
AI中文摘要

大型语言模型(LLMs)越来越多地部署在面向用户的应用程序中,引发了对它们可能反映和放大社会偏见的担忧。我们使用针对中文的提示,在十种代表性模型中研究了中文LLMs中的社会身份偏见。我们的评估比较了240个在中国语境中显著的社会群体的内群体(“我们”)和外群体(“他们”)框架,使用了一个双层测量框架来评估情感和毒性。提示设计明确考虑了中文的语言特性,包括默认性别中立复数代词与其明确女性对应代词之间的区别,从而能够对社会身份框架效应进行受控比较。跨模型观察,我们发现了系统性的内群体-外群体不对称性,尽管其表达在不同测量维度上有所不同。特别是,指令调优通常减少情感不对称性,而毒性差距仍然更为持久。此外,在多个模型中,女性标记的复数代词比默认性别中立复数代词与更高的毒性相关。我们的研究引入了一个针对中文LLMs的语言感知评估框架,并表明(i)先前在英语中记录的社会身份偏见在中文中也有所体现,以及(ii)中文特有的语言结构可以揭示在仅英语环境中无法直接观察到的偏见模式。

英文摘要

Large language models (LLMs) are increasingly deployed in user-facing applications, raising concerns that they may reflect and amplify social biases. We investigate social identity biases in Chinese LLMs using Mandarin-specific prompts across ten representative models. Our evaluation compares ingroup ("We") and outgroup ("They") framings across 240 social groups salient in the Chinese context, using a two-tiered measurement framework that assesses both sentiment and toxicity. The prompt design explicitly accounts for linguistic properties of Mandarin, including the distinction between the default gender-neutral plural pronoun and its explicitly feminine counterpart, enabling a controlled comparison of social identity framing effects. Across models, we observe systematic ingroup-outgroup asymmetries, although their expression differs across measurement dimensions. In particular, instruction tuning often reduces sentiment asymmetries, while toxicity gaps remain more persistent. Moreover, the feminine-marked plural pronoun is associated with higher toxicity than the default gender-neutral plural in several models. Our study introduces a language-aware evaluation framework for Chinese LLMs and shows that (i) social identity biases previously documented in English also manifest in Chinese and that (ii) Mandarin-specific linguistic structure can reveal bias patterns that are not directly observable in English-only settings.

2510.06928 2026-05-28 cs.CV

IAR2: Improving Autoregressive Visual Generation with Semantic-Detail Associated Token Prediction

IAR2:通过语义-细节关联令牌预测改进自回归视觉生成

Ran Yi, Teng Hu, Zihan Su, Jiangning Zhang, Lizhuang Ma

AI总结 提出IAR2框架,通过语义-细节关联双码本和分层预测机制,实现从粗到细的图像生成,在ImageNet上取得FID 1.50的领先性能。

详情
AI中文摘要

自回归模型已成为视觉内容创建的有力范式,但常常忽略视觉数据的内在结构特性。我们之前的工作IAR通过基于嵌入相似性重新组织视觉码本,开启了解决这一问题的方向,从而提高了生成的鲁棒性。然而,它受到预训练码本的刚性和硬均匀聚类的不准确性的限制。为了克服这些限制,我们提出了IAR2,一种先进的自回归框架,实现了层次化的语义-细节合成过程。IAR2的核心是一种新颖的语义-细节关联双码本,它将图像表示解耦为用于全局语义信息的语义码本和用于细粒度细节的细节码本。它将量化能力从线性扩展到多项式规模,显著增强了表达能力。为了适应这种双重表示,我们提出了一种语义-细节自回归预测方案,结合局部上下文增强自回归头,执行分层预测——先预测语义令牌,再预测细节令牌——同时利用局部上下文窗口增强空间连贯性。此外,对于条件生成,我们引入了一种渐进式注意力引导的自适应CFG机制,该机制根据每个令牌与条件的相关性及其在生成序列中的时间位置动态调节引导尺度,在不牺牲真实性的情况下改善条件对齐。大量实验表明,IAR2在自回归图像生成上树立了新的最先进水平,在ImageNet上实现了1.50的FID。我们的模型不仅在性能上超越了先前的方法,而且展示了卓越的计算效率,突显了我们结构化、从粗到细生成策略的有效性。

英文摘要

Autoregressive models have emerged as a powerful paradigm for visual content creation, but often overlook the intrinsic structural properties of visual data. Our prior work, IAR, initiated a direction to address this by reorganizing the visual codebook based on embedding similarity, thereby improving generation robustness. However, it is constrained by the rigidity of pre-trained codebooks and the inaccuracies of hard, uniform clustering. To overcome these limitations, we propose IAR2, an advanced autoregressive framework that enables a hierarchical semantic-detail synthesis process. At the core of IAR2 is a novel Semantic-Detail Associated Dual Codebook, which decouples image representations into a semantic codebook for global semantic information and a detail codebook for fine-grained refinements. It expands the quantization capacity from a linear to a polynomial scale, significantly enhancing expressiveness. To accommodate this dual representation, we propose a Semantic-Detail Autoregressive Prediction scheme coupled with a Local-Context Enhanced Autoregressive Head, which performs hierarchical prediction-first the semantic token, then the detail token-while leveraging a local context window to enhance spatial coherence. Furthermore, for conditional generation, we introduce a Progressive Attention-Guided Adaptive CFG mechanism that dynamically modulates the guidance scale for each token based on its relevance to the condition and its temporal position in the generation sequence, improving conditional alignment without sacrificing realism. Extensive experiments demonstrate that IAR2 sets a new state-of-the-art for autoregressive image generation, achieving a FID of 1.50 on ImageNet. Our model not only surpasses previous methods in performance but also demonstrates superior computational efficiency, highlighting the effectiveness of our structured, coarse-to-fine generation strategy.

2502.17832 2026-05-28 cs.LG cs.AI cs.CR cs.CV

MM-PoisonRAG: Disrupting Multimodal RAG with Local and Global Poisoning Attacks

MM-PoisonRAG:通过局部和全局投毒攻击破坏多模态RAG

Hyeonjeong Ha, Qiusi Zhan, Jeonghwan Kim, Dimitrios Bralios, Saikrishna Sanniboina, Nanyun Peng, Kai-Wei Chang, Daniel Kang, Heng Ji

AI总结 提出MM-PoisonRAG框架,通过局部投毒攻击(LPA)和全局投毒攻击(GPA)两种策略,系统研究多模态检索增强生成(RAG)在知识投毒下的脆弱性,实验表明攻击成功率高达56%且能绕过现有防御。

Comments Code is available at https://github.com/HyeonjeongHa/MM-PoisonRAG

详情
AI中文摘要

检索增强生成(RAG)已成为多模态大语言模型(MLLM)中增强事实基础并减少幻觉的常见做法。然而,其对检索的依赖使MLLM面临知识投毒攻击,攻击者故意将恶意多模态内容注入外部知识库,以引导模型生成不正确甚至有害的响应。我们提出MM-PoisonRAG框架,系统研究多模态RAG在知识投毒下的脆弱性。具体地,我们设计了两种新颖的攻击策略:局部投毒攻击(LPA),植入针对特定查询的多模态错误信息以操纵输出至攻击者控制的响应;以及全局投毒攻击(GPA),使用单一、非定向的对抗性注入广泛破坏推理并降低所有查询的生成质量。在多样化任务、多模态RAG组件和攻击者访问级别上的大量实验揭示了严重的脆弱性:LPA即使在受限访问下也能达到高达56%的攻击成功率,并且无需重新优化对抗样本即可在四种不同的检索器之间有效迁移。GPA仅需一个投毒内容即可完全破坏模型生成,使准确率降至0%。此外,LPA和GPA均能绕过现有防御,突显了多模态RAG的脆弱性,并将MM-PoisonRAG确立为未来保护RAG框架免受多模态知识投毒研究的基础。

英文摘要

Retrieval-augmented generation (RAG) has become a common practice in multimodal large language models (MLLM) to enhance factual grounding and reduce hallucination. Yet, its reliance on retrieval exposes MLLMs to knowledge poisoning attacks, in which adversaries deliberately inject malicious multimodal content into external knowledge bases to steer models toward generating incorrect or even harmful responses. We present MM-PoisonRAG, a framework to systematically study the vulnerability of multimodal RAG under knowledge poisoning. Specifically, we design two novel attack strategies: Localized Poisoning Attack (LPA), which implants targeted, query-specific multimodal misinformation to manipulate outputs toward attacker-controlled responses, and Globalized Poisoning Attack (GPA), which uses a single, untargeted adversarial injection to broadly corrupt reasoning and collapse generation quality across all queries. Extensive experiments on diverse tasks, multimodal RAG components, and attacker access levels reveal severe vulnerabilities: LPA achieves up to 56% attack success rate even under restricted access, and transfers effectively across four different retrievers without re-optimizing the adversaries. GPA completely disrupts model generation to 0% accuracy with just one poisoned content. Moreover, both LPA and GPA bypass existing defenses, underscoring the fragility of multimodal RAG and establishing MM-PoisonRAG as a foundation for future research on securing RAG frameworks against multimodal knowledge poisoning.

2510.05291 2026-05-28 cs.CL

Camellia: Benchmarking Cultural Biases in LLMs for Asian Languages

Camellia: 亚洲语言中LLMs文化偏见的基准测试

Tarek Naous, Anagha Savit, Carlos Rafael Catalan, Geyang Guo, Jaehyeok Lee, Kyungdon Lee, Lheane Marie Dizon, Mengyu Ye, Neel Kothari, Sahajpreet Singh, Sarah Masud, Tanish Patwa, Trung Thanh Tran, Zohaib Khan, Alan Ritter, Tanmoy Chakraborty, Yuki Arase, Keisuke Sakaguchi, JinYeong Bak, Wei Xu

AI总结 提出Camellia基准,通过三个任务评估九种亚洲语言中多语言大模型对亚洲与西方文化实体的偏见,发现模型存在文化适应困难、情感关联差异及实体提取性能差距。

详情
AI中文摘要

随着大语言模型(LLMs)多语言能力的增强,它们对文化多样性实体的敏感性变得越来越重要。Naous等人(2024)的前期工作表明,LLMs在阿拉伯语中往往偏好与西方相关的实体。由于缺乏以实体为中心的多语言基准,这种偏见是否也存在于各种非西方语言中尚不清楚。在本文中,我们介绍了Camellia,这是一个用于评估九种亚洲语言(涵盖六种亚洲文化)中实体中心文化偏见的基准。Camellia包括19,530个手动注释的实体,这些实体与所涵盖的亚洲或西方文化相关,以及从社交媒体帖子中提取的2,173个这些实体的掩码上下文。利用Camellia,我们在三个任务中评估了四个最近的多语言LLMs的文化偏见:文化上下文适应、情感关联和实体抽取式问答。我们的分析表明,LLMs在这些语言中难以进行文化适应,不同地区开发的模型表现存在差异。我们进一步观察到,不同的LLM家族可能持有不同的偏见,这反映在它们将文化与特定情感联系起来的方式上。最后,我们发现LLMs在某些亚洲语言中可能难以理解上下文,从而在实体抽取中造成文化之间的性能差距。

英文摘要

As Large Language Models (LLMs) develop stronger multilingual capabilities, their sensitivity to culturally diverse entities becomes increasingly important. Prior work by Naous et al. (2024) has shown that LLMs often favor Western-associated entities in Arabic. Due to the lack of entity-centric multilingual benchmarks, it remains unclear if such biases also manifest in various non-Western languages. In this paper, we introduce Camellia, a benchmark for evaluating entity-centric cultural biases in nine Asian languages, spanning six Asian cultures. Camellia includes 19,530 manually annotated entities associated with the covered Asian or Western cultures, as well as 2,173 masked contexts for these entities derived from social media posts. Using Camellia, we evaluate cultural biases in four recent multilingual LLMs across three tasks: cultural context adaptation, sentiment association, and entity extractive QA. Our analyses show that LLMs struggle with cultural adaptation across these languages, with performance differing across models developed in different regions. We further observe that different LLM families can hold distinct biases, reflected in the ways they link cultures to particular sentiments. Lastly, we find that LLMs can struggle with context understanding in some Asian languages, creating performance gaps between cultures in entity extraction.

2510.02329 2026-05-28 cs.CL cs.AI

SelfJudge: Faster Speculative Decoding via Self-Supervised Judge Verification

SelfJudge: 通过自监督验证器加速推测解码

Kanghoon Yoon, Minsub Kim, Sungjae Lee, Joonhyung Lee, Sunghyeon Woo, Yeonjun In, Se Jung Kwon, Chanyoung Park, Dongsoo Lee

AI总结 提出SelfJudge方法,利用目标模型的自监督训练验证器,通过评估令牌替换后响应的语义保持性来加速推测解码,实现更优的推理-准确率权衡。

详情
Journal ref
ICML 2026
AI中文摘要

推测解码通过验证来自草稿模型的候选令牌与较大目标模型的匹配来加速LLM推理。最近的验证解码通过放宽验证标准,接受可能与目标模型输出存在微小差异的草稿令牌来加速这一过程,但现有方法受限于依赖人工标注或具有可验证真实结果的任务,限制了其在多样化NLP任务中的泛化能力。我们提出SelfJudge,通过目标模型的自监督训练验证器。我们的方法通过评估令牌替换后的响应是否保持原始响应的意义来衡量语义保持性,从而实现在多样化NLP任务中的自动验证器训练。实验表明,SelfJudge在推理-准确率权衡上优于验证解码基线,为更快的LLM推理提供了广泛适用的解决方案。

英文摘要

Speculative decoding accelerates LLM inference by verifying candidate tokens from a draft model against a larger target model. Recent judge decoding boosts this process by relaxing verification criteria by accepting draft tokens that may exhibit minor discrepancies from target model output, but existing methods are restricted by their reliance on human annotations or tasks with verifiable ground truths, limiting generalizability across diverse NLP tasks. We propose SelfJudge, which trains judge verifiers via self-supervision of the target model. Our method measures semantic preservation by assessing whether token-substituted responses preserve the meaning of original responses, enabling automatic verifier training across diverse NLP tasks. Our experiments show SelfJudge achieves superior inference-accuracy trade-offs than judge decoding baselines, offering a broadly applicable solution for faster LLM inference.

2510.01724 2026-05-28 cs.AI

MetaboT: An LLM-based Multi-Agent Frameworkfor Interactive Analysis of Mass SpectrometryMetabolomics Knowledge Graphs

MetaboT:基于LLM的多智能体框架,用于质谱代谢组学知识图谱的交互式分析

Madina Bekbergenova, Lucas Pradi, Benjamin Navet, Emma Tysinger, Franck Michel, Matthieu Feraud, Yousouf Taghzouti, Yan Zhou Chen, Olivier Kirchhoffer, Florence Mehl, Martin Legrand, Tao Jiang, Marco Pagni, Soha Hassoun, Jean-Luc Wolfender, Wout Bittremieux, Fabien Gandon, Louis-Félix Nothias

AI总结 提出MetaboT,一个基于大语言模型的多智能体框架,通过模块化架构将自然语言问题转化为SPARQL查询,降低代谢组学知识图谱的使用门槛。

详情
Journal ref
33rd annual international conference on Intelligent Systems for Molecular Biology (ISMB 2025) / 24th Annual Conference of the European Conference on Computational Biology (ECCB 2025), Jul 2025, Liverpool, United Kingdom
AI中文摘要

基于质谱的代谢组学产生复杂的高维数据,这些数据蕴含着巨大的生物学发现潜力,但仍难以整合和解释。知识图谱通过将光谱、注释、分类群、化学类别和生物活性表示为单一可互操作的网络来统一这些异构信息;然而,它们的实际应用受到相应专业表示和查询语言陡峭学习曲线的限制。在此,我们介绍MetaboT,一个开源的多智能体大语言模型框架,它将自然语言问题转化为可执行的SPARQL查询,用于代谢组学知识图谱。MetaboT通过模块化架构减轻了单一模型方法的幻觉和模式合规性限制,其中专门的智能体处理范围验证、针对权威资源的实体解析、模式感知查询生成、迭代细化和结果解释。我们在实验性天然产物知识图谱上验证了MetaboT,使用专家编写的自然语言问题基准及其参考SPARQL查询,并展示了其回答关于植物-代谢物关系和生物活性的复杂问题的能力。MetaboT降低了代谢组学研究者的技术门槛,无需专门编程专业知识即可实现语义数据挖掘。

英文摘要

Mass spectrometry-based metabolomics generates complex, high-dimensional data that holds vast potential for biological discovery but remains difficult to integrate and interpret. Knowledge graphs (KGs) unify this heterogeneous information by representing spectra, annotations, taxa, chemical classes, and biological activities as a single interoperable network; however, their practical use is limited by the steep learning curve of corresponding specialized representation and query languages. Here we introduce MetaboT, an open-source multi-agent Large Language Model (LLM) framework that translates natural-language questions into executable SPARQL queries over metabolomics knowledge graphs. MetaboT mitigates the hallucination and schema-compliance limitations of single-model approaches through a modular architecture in which specialised agents handle scope validation, entity resolution against authoritative resources, schema-aware query generation, iterative refinement, and result interpretation. We validated MetaboT on the Experimental Natural Products Knowledge Graph (ENPKG), using an expert-authored benchmark of natural-language questions paired with reference SPARQL queries, and demonstrate its ability to answer complex questions about plant--metabolite relationships and biological activities. MetaboT lowers the technical barrier for metabolomics researchers and enables semantic data mining without specialised programming expertise.

2508.21046 2026-05-28 cs.CV cs.RO

CogVLA: Cognition-Aligned Vision-Language-Action Model via Instruction-Driven Routing & Sparsification

CogVLA: 通过指令驱动路由与稀疏化实现认知对齐的视觉-语言-动作模型

Wei Li, Renshan Zhang, Rui Shao, Jie He, Liqiang Nie

AI总结 提出CogVLA框架,通过指令驱动路由和稀疏化机制,在LIBERO基准和真实机器人任务上以2.5倍训练成本降低和2.8倍推理延迟降低实现97.4%和70.0%的成功率。

Comments Accepted to NeurIPS 2025, Project Page: https://jiutian-vl.github.io/CogVLA-page

详情
AI中文摘要

最近基于预训练视觉-语言模型(VLM)构建的视觉-语言-动作(VLA)模型需要大量后训练,导致计算开销高,限制了可扩展性和部署。我们提出CogVLA,一个认知对齐的视觉-语言-动作框架,利用指令驱动路由和稀疏化来提高效率和性能。CogVLA受人类多模态协调启发,引入了一个3阶段渐进式架构。1)基于编码器-FiLM的聚合路由(EFA-Routing)将指令信息注入视觉编码器,以选择性聚合和压缩双流视觉标记,形成指令感知的潜在表示。2)基于这种紧凑的视觉编码,基于LLM-FiLM的剪枝路由(LFP-Routing)通过剪枝与指令无关的视觉接地标记将动作意图引入语言模型,从而实现标记级稀疏性。3)为确保压缩的感知输入仍能支持准确且连贯的动作生成,我们引入了V-L-A耦合注意力(CAtten),它将因果视觉-语言注意力与双向动作并行解码相结合。在LIBERO基准和真实机器人任务上的大量实验表明,CogVLA实现了最先进的性能,成功率分别为97.4%和70.0%,同时与OpenVLA相比,训练成本降低了2.5倍,推理延迟降低了2.8倍。CogVLA已开源,可在https://github.com/JiuTian-VL/CogVLA获取。

英文摘要

Recent Vision-Language-Action (VLA) models built on pre-trained Vision-Language Models (VLMs) require extensive post-training, resulting in high computational overhead that limits scalability and deployment.We propose CogVLA, a Cognition-Aligned Vision-Language-Action framework that leverages instruction-driven routing and sparsification to improve both efficiency and performance. CogVLA draws inspiration from human multimodal coordination and introduces a 3-stage progressive architecture. 1) Encoder-FiLM based Aggregation Routing (EFA-Routing) injects instruction information into the vision encoder to selectively aggregate and compress dual-stream visual tokens, forming a instruction-aware latent representation. 2) Building upon this compact visual encoding, LLM-FiLM based Pruning Routing (LFP-Routing) introduces action intent into the language model by pruning instruction-irrelevant visually grounded tokens, thereby achieving token-level sparsity. 3) To ensure that compressed perception inputs can still support accurate and coherent action generation, we introduce V-L-A Coupled Attention (CAtten), which combines causal vision-language attention with bidirectional action parallel decoding. Extensive experiments on the LIBERO benchmark and real-world robotic tasks demonstrate that CogVLA achieves state-of-the-art performance with success rates of 97.4% and 70.0%, respectively, while reducing training costs by 2.5-fold and decreasing inference latency by 2.8-fold compared to OpenVLA. CogVLA is open-sourced and publicly available at https://github.com/JiuTian-VL/CogVLA.

2509.26442 2026-05-28 cs.LG math.OC

Extensions of Robbins-Siegmund Theorem with Applications in Reinforcement Learning

Robbins-Siegmund定理的扩展及其在强化学习中的应用

Xinyu Liu, Zixuan Xie, Shangtong Zhang

AI总结 针对零阶项不可和但平方可和的情况,通过引入随机过程增量的温和假设,扩展了Robbins-Siegmund定理,证明了几乎必然收敛到有界集,并给出了收敛速率、高概率置信界和Lp收敛率,首次应用于线性函数逼近的Q-learning。

详情
AI中文摘要

Robbins-Siegmund定理建立了几乎超鞅的随机过程的收敛性,是随机逼近和强化学习(RL)中分析随机迭代算法最常用的方法之一。然而,其原始形式有一个显著的限制,即要求零阶项是可和的。在许多重要的RL应用中,这种可和条件无法满足。这一限制促使我们扩展Robbins-Siegmund定理,用于零阶项不可和但仅平方可和的几乎超鞅。特别地,我们引入了一个关于随机过程增量的新颖且温和的假设。该假设与平方可和条件一起,实现了几乎必然收敛到有界集。此外,我们进一步提供了几乎必然收敛速率、高概率置信界和$L^p$收敛速率。然后,我们将新结果应用于随机逼近和RL。值得注意的是,我们首次获得了线性函数逼近的Q-learning的几乎必然收敛速率、高概率置信界和$L^p$收敛速率。

英文摘要

The Robbins-Siegmund theorem establishes the convergence of stochastic processes that are almost supermartingales and is one of the most commonly used approaches for analyzing stochastic iterative algorithms in stochastic approximation and reinforcement learning (RL). However, its original form has a significant limitation as it requires the zero-order term to be summable. In many important RL applications, this summable condition, however, cannot be met. This limitation motivates us to extend the Robbins-Siegmund theorem for almost supermartingales where the zero-order term is not summable, but only square-summable. In particular, we introduce a novel and mild assumption on the increments of the stochastic processes. This together with the square-summable condition enables an almost sure convergence to a bounded set. Additionally, we further provide almost sure convergence rates, high probability concentration bounds, and $L^p$ convergence rates. We then apply the new results to stochastic approximation and RL. Notably, we obtain the first almost sure convergence rate, the first high probability concentration bound, and the first $L^p$ convergence rate for $Q$-learning with linear function approximation.

2503.11906 2026-05-28 cs.CV cs.AI

A Survey on SAR ship classification using Deep Learning

基于深度学习的SAR船舶分类综述

Ch Muhammad Awais, Marco Reggiannini, Davide Moroni, Emanuele Salerno

AI总结 本文综述了深度学习在SAR船舶分类中的应用,建立了基于模型、手工特征、SAR属性利用和微调影响的分类法,并讨论了未来研究方向。

Comments in IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 2026

详情
AI中文摘要

深度学习(DL)已成为合成孔径雷达(SAR)船舶分类的强大工具。本综述全面分析了该领域使用的各种DL技术。我们识别了关键趋势和挑战,强调了整合手工特征、利用公共数据集、数据增强、微调、可解释性技术以及促进跨学科合作以提高DL模型性能的重要性。本综述建立了首个基于DL模型、手工特征使用、SAR属性利用和微调影响的分类法,用于对相关研究进行分类。我们讨论了SAR船舶分类任务中使用的方法论以及不同技术的影响。最后,本综述探讨了未来研究的潜在方向,包括解决数据稀缺问题、探索新型DL架构、融入可解释性技术以及建立标准化性能指标。通过应对这些挑战并利用DL的进步,研究人员可以为开发更准确和高效的船舶分类系统做出贡献,最终增强海上监视及相关应用。

英文摘要

Deep learning (DL) has emerged as a powerful tool for Synthetic Aperture Radar (SAR) ship classification. This survey comprehensively analyzes the diverse DL techniques employed in this domain. We identify critical trends and challenges, highlighting the importance of integrating handcrafted features, utilizing public datasets, data augmentation, fine-tuning, explainability techniques, and fostering interdisciplinary collaborations to improve DL model performance. This survey establishes a first-of-its-kind taxonomy for categorizing relevant research based on DL models, handcrafted feature use, SAR attribute utilization, and the impact of fine-tuning. We discuss the methodologies used in SAR ship classification tasks and the impact of different techniques. Finally, the survey explores potential avenues for future research, including addressing data scarcity, exploring novel DL architectures, incorporating interpretability techniques, and establishing standardized performance metrics. By addressing these challenges and leveraging advancements in DL, researchers can contribute to developing more accurate and efficient ship classification systems, ultimately enhancing maritime surveillance and related applications.

2509.21128 2026-05-28 cs.AI

RL Squeezes, SFT Expands: A Comparative Study of Reasoning LLMs

RL 压缩,SFT 扩展:推理型大语言模型的比较研究

Kohsei Matsutani, Shota Takashiro, Gouki Minegishi, Takeshi Kojima, Yusuke Iwasawa, Yutaka Matsuo

AI总结 本文通过轨迹级和步骤级分析框架,比较了强化学习(RL)和监督微调(SFT)对数学推理大语言模型推理路径的影响,发现RL压缩错误轨迹并集中推理功能,而SFT扩展正确轨迹并均匀化推理功能。

Comments Accepted at ICLR2026

详情
AI中文摘要

大型语言模型(LLMs)通常通过带有可验证奖励的强化学习(RLVR)和监督微调(SFT)在推理轨迹上进行训练,以提高其推理能力。然而,这些方法如何塑造推理能力仍然难以捉摸。本文超越基于准确性的研究,引入了一个新颖的分析框架,量化推理路径并捕捉每个训练过程(在数学领域使用1.5B、7B和14B参数的模型)下的定性变化。具体来说,我们在两个粒度级别上研究推理过程:轨迹级别,检查完整的推理输出;步骤级别,分析推理图,其节点对应单个推理步骤。值得注意的是,对独特推理轨迹的聚类显示了互补效应:RL压缩了不正确的轨迹,而SFT扩展了正确的轨迹。步骤级别分析表明,RL使推理图中节点访问频率、度和介数中心性分布的衰减率变陡(约2.5倍),而SFT使其变平(减少到约三分之一)。这表明RL将推理功能集中到一小部分步骤中,而SFT则将其均匀化到许多步骤中。此外,通过从多个角度评估推理图拓扑,我们描绘了RL和SFT的共同和独特特征。我们的工作提出了一种新颖的推理路径视角,解释了为什么当前最佳实践的两阶段训练(先SFT后RL)是成功的,并为数据构建和更高效的学习方法提供了实际启示。

英文摘要

Large language models (LLMs) are typically trained by reinforcement learning (RL) with verifiable rewards (RLVR) and supervised fine-tuning (SFT) on reasoning traces to improve their reasoning abilities. However, how these methods shape reasoning capabilities remains largely elusive. Going beyond an accuracy-based investigation of how these two components sculpt the reasoning process, this paper introduces a novel analysis framework that quantifies reasoning paths and captures their qualitative changes under each training process (with models of 1.5B, 7B, and 14B parameters on mathematical domains). Specifically, we investigate the reasoning process at two levels of granularity: the trajectory-level, which examines complete reasoning outputs, and the step-level, which analyzes reasoning graphs whose nodes correspond to individual reasoning steps. Notably, clustering of unique reasoning trajectories shows complementary effects: RL compresses incorrect trajectories, whereas SFT expands correct ones. Step-level analysis reveals that RL steepens (about 2.5 times), while SFT flattens (reduced to about one-third), the decay rates of node visitation frequency, degree, and betweenness centrality distributions in the reasoning graph. This indicates that RL concentrates reasoning functionality into a small subset of steps, while SFT homogenizes it across many steps. Furthermore, by evaluating the reasoning graph topologies from multiple perspectives, we delineate the shared and distinct characteristics of RL and SFT. Our work presents a novel reasoning path perspective that explains why the current best practice of two-stage training, with SFT followed by RL, is successful, and offers practical implications for data construction and more efficient learning approaches.

2509.15848 2026-05-28 cs.AI

A Comparative Study of Rule-Based and Data-Driven Approaches in Industrial Monitoring

基于规则与数据驱动方法在工业监控中的比较研究

Giovanni De Gasperis, Sante Dino Facchini

AI总结 本研究比较了工业监控中基于规则与数据驱动两种方法,分析了各自的优缺点,并提出混合方案以结合两者优势。

Comments This chapter has been published in Advancements in AI From Foundations to Cross-Disciplinary Applications, Springer, 2026

详情
AI中文摘要

工业监控系统,尤其是在工业4.0环境中部署时,正经历从传统基于规则的架构向利用机器学习和人工智能的数据驱动方法的范式转变。本研究对这两种方法进行了比较,分析了它们各自的优势、局限性和应用场景,并提出了一个基本框架来评估它们的关键特性。基于规则的系统具有高可解释性、确定性行为以及在稳定环境中易于实现的特点,使其成为受监管行业和安全关键应用的理想选择。然而,它们在复杂或不断变化的环境中面临可扩展性、适应性和性能方面的挑战。相反,数据驱动系统在检测隐藏异常、实现预测性维护和动态适应新条件方面表现出色。尽管这些模型具有高精度,但它们面临数据可用性、可解释性和集成复杂性方面的挑战。本文提出混合解决方案作为一个有前景的方向,结合了基于规则逻辑的透明性与机器学习的分析能力。我们的假设是,工业监控的未来在于智能、协同的系统,这些系统利用专家知识和数据驱动的洞察力。这种双重方法增强了韧性、运营效率和信任,为更智能、更灵活的工业环境铺平了道路。

英文摘要

Industrial monitoring systems, especially when deployed in Industry 4.0 environments, are experiencing a shift in paradigm from traditional rule-based architectures to data-driven approaches leveraging machine learning and artificial intelligence. This study presents a comparison between these two methodologies, analyzing their respective strengths, limitations, and application scenarios, and proposes a basic framework to evaluate their key properties. Rule-based systems offer high interpretability, deterministic behavior, and ease of implementation in stable environments, making them ideal for regulated industries and safety-critical applications. However, they face challenges with scalability, adaptability, and performance in complex or evolving contexts. Conversely, data-driven systems excel in detecting hidden anomalies, enabling predictive maintenance and dynamic adaptation to new conditions. Despite their high accuracy, these models face challenges related to data availability, explainability, and integration complexity. The paper suggests hybrid solutions as a possible promising direction, combining the transparency of rule-based logic with the analytical power of machine learning. Our hypothesis is that the future of industrial monitoring lies in intelligent, synergic systems that leverage both expert knowledge and data-driven insights. This dual approach enhances resilience, operational efficiency, and trust, paving the way for smarter and more flexible industrial environments.

2509.14075 2026-05-28 cs.RO cs.SY eess.SY

RCM Constraint-Consistent Dynamic Control in Surgical Robots

手术机器人中的RCM约束一致性动态控制

Yu Li, Hamid Sadeghian, Zewen Yang, Valentin Le Mesle, Sami Haddadin

AI总结 将远程运动中心(RCM)建模为流变完整约束,并集成到基于投影的逆动力学控制器中,实现扭矩层面的约束一致控制,降低RCM残差并平滑扭矩曲线。

Comments Accepted at ICRA 2026

详情
AI中文摘要

机器人辅助微创手术(RAMIS)需要精确执行远程运动中心(RCM)约束,以确保通过套管针的安全工具运动。现有的虚拟RCM控制器通常在运动学层面或作为任务空间目标进行公式化,这使得在套管针运动和物理交互下难以一致地制定扭矩层面的执行。本文将RCM建模为流变完整约束,并将其纳入基于投影的逆动力学控制器中,具有显式的约束/自由运动扭矩分解。所得公式在扭矩层面统一了运动学RCM执行和任务空间跟踪,同时为残差调节和零空间顺应性保留了约束一致的结构。所提出的控制器在仿真和RAMIS训练平台上与代表性的基于投影和约束动力学基线进行了验证。在螺旋跟踪、变化插入深度、移动套管针条件和人类交互中,该方法实现了更低的RCM残差和更平滑的扭矩曲线,同时保持准确的工具尖端跟踪。这些结果支持使用约束一致扭矩控制来实现手术机器人中可靠的虚拟RCM执行。项目页面位于https://rcmpc-cube.github.io。

英文摘要

Robotic-assisted minimally invasive surgery (RAMIS) requires accurate enforcement of the remote center of motion (RCM) constraint to ensure safe tool motion through a trocar. Existing virtual RCM controllers are commonly formulated either at the kinematic level or as task-space objectives, which makes torque-level enforcement under trocar motion and physical interaction difficult to formulate consistently. This paper models the RCM as a rheonomic holonomic constraint and incorporates it into a projection-based inverse-dynamics controller with explicit constrained/free-motion torque decomposition. The resulting formulation unifies kinematic RCM enforcement and task-space tracking at the torque level, while preserving a constraint-consistent structure for residual regulation and null-space compliance. The proposed controller is validated in simulation and on a RAMIS training platform against representative projection-based and constrained-dynamics baselines. Across spiral tracking, varying insertion depth, moving trocar conditions, and human interaction, the method achieves lower RCM residuals and smoother torque profiles while maintaining accurate tool-tip tracking. These results support the use of constraint-consistent torque control for reliable virtual RCM enforcement in surgical robotics. The project page is available at https://rcmpc-cube.github.io

2509.13177 2026-05-28 cs.RO

ROOM: A Physics-Based Continuum Robot Simulator for Photorealistic Medical Datasets Generation

ROOM: 基于物理的连续体机器人模拟器,用于生成逼真的医学数据集

Salvatore Esposito, Matías Mattamala, Daniel Rebain, Francis Xiatian Zhang, Kevin Dhaliwal, Mohsen Khadem, Subramanian Ramamoorthy

AI总结 提出ROOM模拟框架,利用患者CT扫描生成多模态支气管镜训练数据,验证其在姿态估计和深度估计任务中的有效性。

详情
Journal ref
International Conference on Robotics and Automation 2026
AI中文摘要

连续体机器人通过进入复杂的肺气道并进行靶向干预,正在推进支气管镜手术。然而,由于缺乏真实的训练和测试环境,其发展受到限制:由于伦理约束和患者安全问题,真实数据难以收集,而开发自主算法需要逼真的成像和物理反馈。我们提出了ROOM(医学中的逼真光学观察),一个用于生成逼真支气管镜训练数据的综合模拟框架。通过利用患者CT扫描,我们的流程渲染多模态传感器数据,包括具有真实噪声和光斑的RGB图像、度量深度图、表面法线、光流和点云,这些数据在医学相关尺度上生成。我们在两个医学机器人学的典型任务中验证了ROOM生成的数据:多视图姿态估计和单目深度估计,展示了最先进方法在迁移到这些医学环境时必须克服的多种挑战。此外,我们表明ROOM生成的数据可用于微调现有深度估计模型以克服这些挑战,并支持其他下游应用,如导航。我们期望ROOM能够在不同患者解剖结构和临床环境中难以捕获的手术场景中实现大规模数据生成。代码和数据:https://github.com/iamsalvatore/room。

英文摘要

Continuum robots are advancing bronchoscopy procedures by accessing complex lung airways and enabling targeted interventions. However, their development is limited by the lack of realistic training and test environments: Real data is difficult to collect due to ethical constraints and patient safety concerns, and developing autonomy algorithms requires realistic imaging and physical feedback. We present ROOM (Realistic Optical Observation in Medicine), a comprehensive simulation framework designed for generating photorealistic bronchoscopy training data. By leveraging patient CT scans, our pipeline renders multi-modal sensor data including RGB images with realistic noise and light specularities, metric depth maps, surface normals, optical flow and point clouds at medically relevant scales. We validate the data generated by ROOM in two canonical tasks for medical robotics: multi-view pose estimation and monocular depth estimation, demonstrating diverse challenges that state-of-the-art methods must overcome to transfer to these medical settings. Furthermore, we show that the data produced by ROOM can be used to fine-tune existing depth estimation models to overcome these challenges, also enabling other downstream applications such as navigation. We expect that ROOM will enable large-scale data generation across diverse patient anatomies and procedural scenarios that are challenging to capture in clinical settings. Code and data: https://github.com/iamsalvatore/room.

2508.21495 2026-05-28 cs.LG

Rethinking Calibration for Early-Exit Neural Networks

重新思考早退神经网络的校准

Piotr Kubaty, Filip Szatkowski, Grzegorz Choczyński, Eric Nalisnick, Bartosz Wójcik

AI总结 本文质疑校准对早退神经网络性能的充分性,提出早退失败预测(EEFP)以同时考虑预测正确性和计算成本,并设计轻量级改进方法,实现更优的成本-准确率权衡。

详情
AI中文摘要

早退神经网络(EENN)通过允许中间分类器在预测足够自信时停止计算来加速推理。大多数方法依赖置信度阈值进行退出,因此通常认为改进分类器校准能提升性能。在这项工作中,我们挑战这一假设,并表明仅靠校准不足以让EENN利用自适应计算。为解决这一不足,我们引入了早退失败预测(EEFP),它同时考虑了预测正确性和进一步计算的成本。我们还提出了一种轻量级的、基于EEFP的改进中间分类器的程序,可以直接替代EENN中的校准。大量实验表明,与校准相比,我们的方法实现了更优的成本-准确率权衡,并且EEFP更可靠地反映了整体EENN性能。我们的代码可在https://github.com/gmum/rethinking-calibration-for-eenns获取。

英文摘要

Early-exit neural networks (EENNs) accelerate inference by allowing intermediate classifiers to stop computation once predictions are confident enough. Most methods rely on confidence thresholds for exiting, and consequently, improving classifier calibration is widely assumed to improve performance. In this work, we challenge this assumption and show that calibration alone is not sufficient for EENNs to exploit adaptive computation. To address this insufficiency, we introduce Early-Exit Failure Prediction (EEFP), which accounts for both prediction correctness and the cost of further computation. We also propose a lightweight, EEFP-motivated procedure to improve the intermediate classifiers, which can directly replace calibration in EENNs. Extensive experiments demonstrate that our approach achieves superior cost-accuracy trade-offs compared to calibration, and EEFP more reliably reflects overall EENN performance. Our code is available at https://github.com/gmum/rethinking-calibration-for-eenns.

2311.02304 2026-05-28 cs.RO

Imitating and Finetuning Model Predictive Control for Robust and Symmetric Quadrupedal Locomotion

模仿与微调模型预测控制实现鲁棒且对称的四足运动

Donghoon Youm, Hyunyoung Jung, Hyeongjun Kim, Jemin Hwangbo, Hae-Won Park, Sehoon Ha

AI总结 提出模仿与微调模型预测控制(IFM)框架,结合模型预测控制与模仿学习及强化学习,提升四足机器人在复杂地形上的运动性能、对称性和能效。

详情
Journal ref
IEEE Robotics and Automation Letters ( Volume: 8, Issue: 11, November 2023
AI中文摘要

腿式机器人的控制是一个具有挑战性的问题,已有多种方法进行研究,如基于模型的控制和学习算法。本文提出了一种新颖的模仿与微调模型预测控制(IFM)框架,以结合两种方法的优势。该框架首先使用微分动态规划和Raibert启发式方法开发一个传统的模型预测控制器(MPC),作为专家策略。然后,通过模仿学习训练MPC的克隆,使控制器可学习。最后,利用有限探索的深度强化学习在更具挑战性的地形上进一步微调策略。通过全面的仿真和硬件实验,我们证明了所提出的IFM框架能够显著提高给定MPC控制器在粗糙、湿滑和传送带等需要仔细协调步态的地形上的性能。我们还展示了与普通强化学习相比,IFM能够以最小的奖励塑造负担高效地产生更对称、周期性和节能的步态。

英文摘要

Control of legged robots is a challenging problem that has been investigated by different approaches, such as model-based control and learning algorithms. This work proposes a novel Imitating and Finetuning Model Predictive Control (IFM) framework to take the strengths of both approaches. Our framework first develops a conventional model predictive controller (MPC) using Differential Dynamic Programming and Raibert heuristic, which serves as an expert policy. Then we train a clone of the MPC using imitation learning to make the controller learnable. Finally, we leverage deep reinforcement learning with limited exploration for further finetuning the policy on more challenging terrains. By conducting comprehensive simulation and hardware experiments, we demonstrate that the proposed IFM framework can significantly improve the performance of the given MPC controller on rough, slippery, and conveyor terrains that require careful coordination of footsteps. We also showcase that IFM can efficiently produce more symmetric, periodic, and energy-efficient gaits compared to Vanilla RL with a minimal burden of reward shaping.

2509.04192 2026-05-28 cs.AI cs.LO math.LO

Domain size asymptotics for Markov logic networks

马尔可夫逻辑网络的域大小渐近性

Vera Koponen

AI总结 研究马尔可夫逻辑网络在域大小趋于无穷时概率分布的性质,通过一元关系语言的几乎完全刻画,展示了其与均匀分布及提升贝叶斯网络的本质差异。

Comments Version 2 is a major revision of version 1

详情
AI中文摘要

一个马尔可夫逻辑网络(MLN)$\mathbb{M}$ 在域为 $\{1, \ldots, n\}$ 的结构集 $\mathbf{W}_n$(即“可能世界”)上确定了一个概率分布 $\mathbb{P}_n^\mathbb{M}$。我们研究当 $n$ 趋于无穷时这些分布的性质。我们证明,在温和假设下,对于具有一个任意正权重的软约束的 MLN $\mathbb{M}$,对所有足够大的 $n$,分布 $\mathbb{P}_n^\mathbb{M}$ 的行为与 $\mathbf{W}_n$ 上的均匀分布 $\mathbb{P}_n^{uni}$ 截然不同。对于仅有一个一元关系符号 $R$ 的语言,我们给出了当 $n \to \infty$ 时 $\mathbb{P}_n^\mathbb{M}$ 可能渐近行为的几乎完全刻画,其中 $\mathbb{M}$ 可以是该语言的任意 MLN。渐近行为取决于 MLN 的软约束和权重。该刻画用于证明:如果所考虑的语言至少包含一个一元关系符号,则以下结论成立:(a) 存在一个 MLN $\mathbb{M}$,使得对每个提升贝叶斯网络(LBN)$\mathbb{G}$,存在无穷多个 $n$ 使得 $\mathbb{M}$ 和 $\mathbb{G}$ 在 $\mathbf{W}_n$ 上确定不同的分布。(b) 存在一个 LBN $\mathbb{G}$,使得对每个 MLN $\mathbb{M}$,存在无穷多个 $n$ 使得 $\mathbb{G}$ 和 $\mathbb{M}$ 在 $\mathbf{W}_n$ 上确定不同的分布。我们还证明,在极限情况下,权重维度和域大小维度的行为可能完全不同。

英文摘要

A Markov logic network (MLN) $\mathbb{M}$ determines a probability distribution $\mathbb{P}_n^\mathbb{M}$ on the set $\mathbf{W}_n$ of structures, or ``possible worlds'', with domain $\{1, \ldots, n\}$. We study the properties of such distributions as $n$ tends to infinity. We show that with mild assumptions on an MLN $\mathbb{M}$ with one soft constraint with an arbitrary positive weight the distribution $\mathbb{P}_n^\mathbb{M}$ will behave quite differently from the uniform distribution $\mathbb{P}_n^{uni}$ on $\mathbf{W}_n$ for all sufficiently large $n$. For a language with only one relation symbol $R$ which has arity 1 we give an almost complete characterization of the possible asymptotic behaviours of $\mathbb{P}_n^\mathbb{M}$ as $n \to \infty$, where $\mathbb{M}$ may be any MLN for this language. The asymptotic behaviour depends on the soft constraints and weights of the MLN. This characterization is used to show that if the language under consideration contains at least one relation symbol of arity 1 then the following holds: (a) There is an MLN $\mathbb{M}$ such that for every lifted Bayesian network (LBN) $\mathbb{G}$ there are infinitely many $n$ such that $\mathbb{M}$ and $\mathbb{G}$ determine different distributions on $\mathbf{W}_n$. (b) There is an LBN $\mathbb{G}$ such that for every MLN $\mathbb{M}$ there are infinitely many $n$ such that $\mathbb{G}$ and $\mathbb{M}$ determine different distributions on $\mathbf{W}_n$. We also show that, in the limit, the weight dimension and the domain size dimension may behave completely differently.