URL PDF HTML ☆

赞 0 踩 0

2602.01334 2026-05-22 cs.CV

What Does Vision Tool-Use Reinforcement Learning Really Learn? Disentangling Tool-Induced and Intrinsic Effects for Crop-and-Zoom

视觉工具使用强化学习究竟在学习什么？解构工具诱导效应与内在效应以实现作物和缩放

Yan Ma, Weiyu Zhang, Tianle Li, Linge Du, Xuyang Shen, Pengfei Liu

AI总结本文研究了视觉工具使用强化学习在作物和缩放任务中的学习机制，通过引入MED框架解耦内在能力变化与工具诱导效应，发现改进主要由内在学习驱动，而工具使用强化学习主要减少工具诱导的负面影响，而非掌握工具。

Comments ICML 2026 camera ready. Code: https://github.com/GAIR-NLP/Med

详情

AI中文摘要

两次序贯蒙特卡洛用于树搜索

Yaniv Oren, Joery A. de Vries, Pascal R. van der Vaart, Matthijs T. J. Spaan, Wendelin Böhmer

AI总结本文提出Twice Sequential Monte Carlo Tree Search（TSMCTS）方法，通过减少方差和缓解路径退化问题，提高了在离散和连续环境中比SMC基线和现代MCTS版本更优的性能，同时在顺序计算上具有良好的扩展性。

2511.07820 2026-05-22 cs.RO cs.AI cs.CV cs.GR cs.SY eess.SY

SONIC: Supersizing Motion Tracking for Natural Humanoid Whole-Body Control

SONIC：为自然人形全身体控进行超大规模运动追踪

Zhengyi Luo, Ye Yuan, Tingwu Wang, Chenran Li, Fernando Castañeda, Sirui Chen, Zi-Ang Cao, Jiefeng Li, David Minor, Qingwei Ben, Jinhyung Park, David Sami, Zi Wang, Xingye Da, Runyu Ding, Cyrus Hogg, Lina Song, Edy Lim, Eugene Jeong, Tairan He, Haoru Xue, Wenli Xiao, Simon Yuen, Jan Kautz, Yan Chang, Umar Iqbal, Linxi "Jim" Fan, Yuke Zhu

AI总结本文提出了一种超大规模运动追踪方法，通过扩大模型容量、数据和计算资源，实现了一种能够产生自然且稳健全身体态的通用人形控制器，并展示了其在运动追踪任务中的可扩展性及在下游任务中的应用价值。

Comments Project page: https://nvlabs.github.io/SONIC/

详情

AI中文摘要

尽管大规模基础模型在数千块GPU上训练已取得显著进展，但类似规模提升在人形控制中尚未显现。当前的人形神经控制器规模较小，仅针对有限的行为集，并在少量GPU上训练。我们证明，扩大模型容量、数据和计算资源可以产生一个通用的人形控制器，能够实现自然且稳健的全身体态。我们将运动追踪定位为人形控制的可扩展任务，利用密集监督的多样化动作捕捉数据获取人类运动先验知识，而无需手动奖励工程。我们通过沿三个轴扩展构建了一个运动追踪的基础模型：网络大小（120万到4200万参数）、数据集规模（10亿+帧来自700小时的动作捕捉数据）以及计算资源（21000 GPU小时）。除了展示规模优势外，我们还通过：（1）实时运动规划器连接运动追踪到导航等任务，实现自然和交互式控制；（2）统一的token空间支持VR远程操作和视觉-语言-动作（VLA）模型，使用单一策略。通过这一接口，我们展示了需要协调手和脚放置的自主VLA驱动全身体控。扩大运动追踪表现出有利的特性：性能随计算和数据多样性稳步提升，学习的策略能泛化到未见的运动，使大规模运动追踪成为人形控制的实用基础。

英文摘要

Despite the rise of billion-parameter foundation models trained across thousands of GPUs, similar scaling gains have not been shown for humanoid control. Current neural controllers for humanoids remain modest in size, target a limited set of behaviors, and are trained on a handful of GPUs. We show that scaling model capacity, data, and compute yields a generalist humanoid controller capable of natural, robust whole-body movements. We position motion tracking as a scalable task for humanoid control, leveraging dense supervision from diverse motion-capture data to acquire human motion priors without manual reward engineering. We build a foundation model for motion tracking by scaling along three axes: network size (1.2M to 42M parameters), dataset volume (100M+ frames from 700 hours of motion capture), and compute (21k GPU hours). Beyond demonstrating the benefits of scale, we further show downstream utility through: (1) a real-time kinematic planner bridging motion tracking to tasks such as navigation, enabling natural and interactive control, and (2) a unified token space supporting VR teleoperation and vision-language-action (VLA) models with a single policy. Through this interface, we demonstrate autonomous VLA-driven whole-body loco-manipulation requiring coordinated hand and foot placement. Scaling motion tracking exhibits favorable properties: performance improves steadily with compute and data diversity, and learned policies generalize to unseen motions, establishing motion tracking at scale as a practical foundation for humanoid control.

URL PDF HTML ☆

赞 0 踩 0

2511.02014 2026-05-22 cs.CV

Towards Selection of Large Multimodal Models as Engines for Burned-in Protected Health Information Detection in Medical Images

向大规模多模态模型选择作为医疗图像中已烧毁保护健康信息检测引擎的方向

Tuan Truong, Guillermo Jimenez Perez, Pedro Osorio, Matthias Lenga

AI总结本文研究了如何利用大规模多模态模型进行医疗图像中保护健康信息的检测，通过对比三种主流模型在不同流程配置下的表现，发现大规模多模态模型在OCR性能上优于传统方法，但整体检测准确性提升不显著，尤其在复杂印模模式测试中表现更优，并提出了针对特定操作约束的模型选择建议和部署策略。

Comments Accepted at EMBC 2026

详情

AI中文摘要

在医疗影像中检测保护健康信息（PHI）对于保障患者隐私和确保符合监管框架至关重要。传统检测方法主要利用光学字符识别（OCR）模型结合命名实体识别。然而，近年来大规模多模态模型（LMM）的进步为增强文本提取和语义分析提供了新机会。在本研究中，我们系统地评估了三种主要的闭源和开源LMM，即GPT-4o、Gemini 2.5 Flash和Qwen 2.5 7B，使用两种不同的流程配置：一种专注于文本分析，另一种整合OCR和语义分析。我们的结果显示，LMM在OCR性能（WER: 0.03-0.05，CER: 0.02-0.03）上优于传统模型如EasyOCR。然而，这种OCR性能的提升并不总是与整体PHI检测准确性提升相关联。在测试案例中具有复杂印模模式时，表现最强。在文本区域易于阅读且对比度足够的情况下，使用强LMM进行OCR后文本分析，不同流程配置的结果相似。此外，我们为特定操作约束提供了基于实证的LMM选择建议，并提出了一种利用可扩展和模块化基础设施的部署策略。

英文摘要

The detection of Protected Health Information (PHI) in medical imaging is critical for safeguarding patient privacy and ensuring compliance with regulatory frameworks. Traditional detection methodologies predominantly utilize Optical Character Recognition (OCR) models in conjunction with named entity recognition. However, recent advancements in Large Multimodal Model (LMM) present new opportunities for enhanced text extraction and semantic analysis. In this study, we systematically benchmark three prominent closed and open-sourced LMMs, namely GPT-4o, Gemini 2.5 Flash, and Qwen 2.5 7B, utilizing two distinct pipeline configurations: one dedicated to text analysis alone and another integrating both OCR and semantic analysis. Our results indicate that LMM exhibits superior OCR efficacy (WER: 0.03-0.05, CER: 0.02-0.03) compared to conventional models like EasyOCR. However, this improvement in OCR performance does not consistently correlate with enhanced overall PHI detection accuracy. The strongest performance gains are observed on test cases with complex imprint patterns. In scenarios where text regions are well readable with sufficient contrast, and strong LMMs are employed for text analysis after OCR, different pipeline configurations yield similar results. Furthermore, we provide empirically grounded recommendations for LMM selection tailored to specific operational constraints and propose a deployment strategy that leverages scalable and modular infrastructure.

URL PDF HTML ☆

赞 0 踩 0

2510.23090 2026-05-22 cs.CL

MAP4TS: A Multi-Aspect Prompting Framework for Time-Series Forecasting with Large Language Models

MAP4TS: 一个用于基于大语言模型的时间序列预测的多方面提示框架

Suchan Lee, Jihoon Choi, Sohyeon Lee, Minseok Song, Bong-Gyu Jang, Hwanjo Yu, Soyeon Caren Han

AI总结本文提出MAP4TS框架，通过将经典时间序列分析融入提示设计，提升大语言模型在时间序列预测中的性能，实验表明其在多个数据集上均优于现有方法。

Comments There is a error in modeling. Thereafter, paper will be revised and re-uploaded

详情

AI中文摘要

最近的研究探讨了使用预训练的大语言模型（LLMs）进行时间序列预测，通过将数值输入对齐到LLM嵌入空间。然而，现有的多模态方法往往忽视了时间序列数据中独特的统计特性和时间依赖性。为弥合这一差距，我们提出了MAP4TS，一种新颖的多方面提示框架，该框架明确将经典时间序列分析纳入提示设计。我们的框架引入了四个专门的提示组件：一个全局领域提示传达数据集级别的上下文，一个局部领域提示编码近期趋势和系列特定行为，以及一对统计和时间提示，嵌入了从自相关（ACF）、偏自相关（PACF）和傅里叶分析中提取的手工洞察。多方面提示与原始时间序列嵌入结合，并通过跨模态对齐模块生成统一的表示，然后通过LLM处理并投影以进行最终预测。在八个多样化的数据集上进行的广泛实验表明，MAP4TS在多个数据集上均优于现有方法。我们的消融研究进一步揭示，提示意识设计显著提升了性能稳定性，并且当与结构化提示结合时，GPT-2模型在长期预测任务中优于较大的模型如LLaMA。

英文摘要

Recent advances have investigated the use of pretrained large language models (LLMs) for time-series forecasting by aligning numerical inputs with LLM embedding spaces. However, existing multimodal approaches often overlook the distinct statistical properties and temporal dependencies that are fundamental to time-series data. To bridge this gap, we propose MAP4TS, a novel Multi-Aspect Prompting Framework that explicitly incorporates classical time-series analysis into the prompt design. Our framework introduces four specialized prompt components: a Global Domain Prompt that conveys dataset-level context, a Local Domain Prompt that encodes recent trends and series-specific behaviors, and a pair of Statistical and Temporal Prompts that embed handcrafted insights derived from autocorrelation (ACF), partial autocorrelation (PACF), and Fourier analysis. Multi-Aspect Prompts are combined with raw time-series embeddings and passed through a cross-modality alignment module to produce unified representations, which are then processed by an LLM and projected for final forecasting. Extensive experiments across eight diverse datasets show that MAP4TS consistently outperforms state-of-the-art LLM-based methods. Our ablation studies further reveal that prompt-aware designs significantly enhance performance stability and that GPT-2 backbones, when paired with structured prompts, outperform larger models like LLaMA in long-term forecasting tasks.

URL PDF HTML ☆

赞 0 踩 0

2510.17991 2026-05-22 cs.LG cs.CV

Demystifying Transition Matching: When and Why It Can Beat Flow Matching

解开转换匹配之谜：何时以及为何它能超越流匹配

Jaihoon Kim, Rajarshi Saha, Minhyuk Sung, Youngsuk Park

AI总结本文研究了转换匹配（TM）在何时以及为何能超越流匹配（FM），通过证明在单峰高斯分布下TM具有更低的KL散度，并分析了在高斯混合分布中TM在局部单峰区域的优势，以及在目标方差非可忽略时TM的优越性。

Comments Code: https://github.com/amazon-science/TransitionFlowMatching (AISTATS 2026)

详情

AI中文摘要

流匹配（FM）是许多最先进的生成模型的基础，但最近的结果表明转换匹配（TM）可以以更少的采样步骤获得更高的质量。本文回答了TM何时以及为何能超越FM的问题。首先，当目标是一个单峰高斯分布时，我们证明在有限的步骤数下，TM的KL散度严格低于FM。改进源于TM中的随机差分潜在更新，这些更新保留了目标协方差，而确定性FM则低估了它。我们随后表征了收敛速率，显示在固定计算预算下，TM比FM收敛得更快，从而在单峰高斯情况下确立了其优势。其次，我们将分析扩展到高斯混合分布，并识别出局部单峰区域，在这些区域中，采样动态近似于单峰情况，TM可以超越FM。近似误差随着组件均值之间的最小距离增加而减少，突显了当模式良好分离时TM的优势。然而，当目标方差接近零时，每个TM更新收敛到FM更新，TM的性能优势减弱。总之，我们证明了当目标分布具有良好分离的模式和非可忽略的方差时，TM优于FM。我们通过受控实验在高斯分布上验证了我们的理论结果，并将比较扩展到现实世界中的图像和视频生成应用。

英文摘要

Flow Matching (FM) underpins many state-of-the-art generative models, yet recent results indicate that Transition Matching (TM) can achieve higher quality with fewer sampling steps. This work answers the question of when and why TM outperforms FM. First, when the target is a unimodal Gaussian distribution, we prove that TM attains strictly lower KL divergence than FM for finite number of steps. The improvement arises from stochastic difference latent updates in TM, which preserve target covariance that deterministic FM underestimates. We then characterize convergence rates, showing that TM achieves faster convergence than FM under a fixed compute budget, establishing its advantage in the unimodal Gaussian setting. Second, we extend the analysis to Gaussian mixtures and identify local-unimodality regimes in which the sampling dynamics approximate the unimodal case, where TM can outperform FM. The approximation error decreases as the minimal distance between component means increases, highlighting that TM is favored when the modes are well separated. However, when the target variance approaches zero, each TM update converges to the FM update, and the performance advantage of TM diminishes. In summary, we show that TM outperforms FM when the target distribution has well-separated modes and non-negligible variances. We validate our theoretical results with controlled experiments on Gaussian distributions, and extend the comparison to real-world applications in image and video generation.

URL PDF HTML ☆

赞 0 踩 0

2510.16590 2026-05-22 cs.LG cs.AI q-bio.BM

Atom-anchored LLMs speak Chemistry: A Retrosynthesis Demonstration

原子锚定的大语言模型：化学 retrosynthesis 的演示

Alan Kai Hassen, Andrius Bernatavicius, Antonius P. A. Janssen, Mike Preuss, Gerard J. P. van Westen, Djork-Arné Clevert

AI总结本研究提出了一种利用通用大语言模型进行分子推理的框架，通过原子标识符将链式推理与分子结构锚定，无需任务特定的模型训练，在单步 retrosynthesis 任务中实现了高成功率。

Comments Alan Kai Hassen and Andrius Bernatavicius contributed equally to this work

详情

AI中文摘要

在化学领域应用机器学习通常受到标注数据稀缺和昂贵的限制，限制了传统监督方法。在本工作中，我们介绍了一种利用通用大语言模型（LLMs）进行分子推理的框架，该框架无需进行任务特定的模型训练。我们的方法通过使用独特的原子标识符将链式推理锚定到分子结构上。首先，LLM执行零样本任务以识别相关片段及其关联的化学标签或转换类别。在可选的第二步中，这种位置感知信息用于少量样本任务，结合提供的类别示例，预测化学转化。我们将框架应用于单步 retrosynthesis 任务，该任务此前LLMs表现不佳。在学术基准和专家验证的药物发现分子上，我们的工作使LLMs在识别化学上合理的反应位点（≥90%）、命名反应类别（≥40%）和最终反应物（≥74%）方面实现了高成功率。最终，我们的工作建立了一种通用蓝图，用于应用LLMs到分子推理和分子转化是关键的挑战中，将原子锚定的LLMs定位为数据稀缺的化学领域中的强大解决方案。

英文摘要

Applications of machine learning in chemistry are often limited by the scarcity and expense of labeled data, restricting traditional supervised methods. In this work, we introduce a framework for molecular reasoning using general-purpose Large Language Models (LLMs) that operates without requiring task-specific model training. Our method anchors chain-of-thought reasoning to the molecular structure by using unique atomic identifiers. First, the LLM performs a zero-shot task to identify relevant fragments and their associated chemical labels or transformation classes. In an optional second step, this position-aware information is used in a few-shot task with provided class examples to predict the chemical transformation. We apply our framework to single-step retrosynthesis, a task where LLMs have previously underperformed. Across academic benchmarks and expert-validated drug discovery molecules, our work enables LLMs to achieve high success rates in identifying chemically plausible reaction sites ($\geq90\%$), named reaction classes ($\geq40\%$), and final reactants ($\geq74\%$). Ultimately, our work establishes a general blueprint for applying LLMs to challenges where molecular reasoning and molecular transformations are key, positioning atom-anchored LLMs as a powerful solution for data-scarce chemistry domains.

URL PDF HTML ☆

赞 0 踩 0

2510.13910 2026-05-22 cs.CL

RAGCap-Bench: Benchmarking Capabilities of LLMs in Agentic Retrieval Augmented Generation Systems

RAGCap-Bench: 评估代理检索增强生成系统中LLM能力的基准测试

Jingru Lin, Chen Zhang, Stephen Y. Liu, Haizhou Li

AI总结本文提出RAGCap-Bench，用于评估代理检索增强生成系统中中间任务的细粒度能力，通过分析现有系统输出识别常见任务和核心能力，设计针对性评估问题，实验表明增强中间能力的模型能获得更好的整体性能。

详情

AI中文摘要

检索增强生成（RAG）通过动态检索外部信息缓解大型语言模型（LLMs）的关键限制，如事实错误、过时知识和幻觉。最近的研究通过代理RAG系统扩展了这一范式，其中LLMs作为代理迭代地计划、检索和推理复杂查询。然而，这些系统在处理具有挑战性的多跳问题时仍存在困难，且其中间推理能力仍缺乏深入研究。为此，我们提出了RAGCap-Bench，一个以能力为导向的基准测试，用于对代理RAG工作流程中的中间任务进行细粒度评估。我们分析了最先进系统的输出，以识别常见任务和执行所需的核心能力，然后构建了一个典型LLM错误的分类学，以设计针对性的评估问题。实验表明，具有更强RAGCap性能的“慢思考”模型在端到端结果上表现更好，这证明了该基准测试的有效性以及增强这些中间能力的重要性。

英文摘要

Retrieval-Augmented Generation (RAG) mitigates key limitations of Large Language Models (LLMs)-such as factual errors, outdated knowledge, and hallucinations-by dynamically retrieving external information. Recent work extends this paradigm through agentic RAG systems, where LLMs act as agents to iteratively plan, retrieve, and reason over complex queries. However, these systems still struggle with challenging multi-hop questions, and their intermediate reasoning capabilities remain underexplored. To address this, we propose RAGCap-Bench, a capability-oriented benchmark for fine-grained evaluation of intermediate tasks in agentic RAG workflows. We analyze outputs from state-of-the-art systems to identify common tasks and the core capabilities required for their execution, then construct a taxonomy of typical LLM errors to design targeted evaluation questions. Experiments show that "slow-thinking" models with stronger RAGCap performance achieve better end-to-end results, underscoring the benchmark's validity and the importance of enhancing these intermediate capabilities.

URL PDF HTML ☆

赞 0 踩 0

2510.11339 2026-05-22 cs.LG cs.AI

Event-Aware Prompt Learning for Dynamic Graphs

事件感知的动态图提示学习

Xingtong Yu, Ruijuan Liang, Renhe Jiang, Dongyuan Li, Yunxiao Zhao, Xinming Zhang, Yuan Fang

AI总结本文提出EVP框架，通过提取历史事件并引入事件适应机制，增强动态图学习模型对历史事件知识的利用能力。

Comments Under review

详情

AI中文摘要

现实中的图通常通过一系列事件演变，建模不同领域中对象之间的动态交互。对于动态图学习，动态图神经网络（DGNNs）已逐渐成为流行解决方案。最近，提示学习方法被探索应用于动态图。然而，现有方法通常侧重于捕捉节点与时间之间的关系，而忽视了历史事件的影响。在本文中，我们提出了EVP，一种事件感知的动态图提示学习框架，可以作为现有方法的插件，增强其利用历史事件知识的能力。首先，我们为每个节点提取一系列历史事件，并引入事件适应机制，以将这些事件的细粒度特征对齐到下游任务。其次，我们提出事件聚合机制，以有效将历史知识整合到节点表示中。最后，我们在四个公开数据集上进行了广泛的实验，以评估和分析EVP。

英文摘要

Real-world graph typically evolve via a series of events, modeling dynamic interactions between objects across various domains. For dynamic graph learning, dynamic graph neural networks (DGNNs) have emerged as popular solutions. Recently, prompt learning methods have been explored on dynamic graphs. However, existing methods generally focus on capturing the relationship between nodes and time, while overlooking the impact of historical events. In this paper, we propose EVP, an event-aware dynamic graph prompt learning framework that can serve as a plug-in to existing methods, enhancing their ability to leverage historical events knowledge. First, we extract a series of historical events for each node and introduce an event adaptation mechanism to align the fine-grained characteristics of these events with downstream tasks. Second, we propose an event aggregation mechanism to effectively integrate historical knowledge into node representations. Finally, we conduct extensive experiments on four public datasets to evaluate and analyze EVP.

URL PDF HTML ☆

赞 0 踩 0

2510.10129 2026-05-22 cs.LG cs.AI

具有近最优速率的容错异步Q学习

Sreejeet Maity, Aritra Mitra

AI总结本文研究了在存在对抗性损坏奖励的情况下，在折扣无限时间 horizon 的强化学习设置中学习最优策略的问题。通过开发一种新的鲁棒Q学习变体，并在具有时间相关数据的挑战性异步采样模型下分析该算法，证明了在存在损坏的情况下，该方法的有限时间保证与现有界限相匹配，仅在加性项上与损坏样本的比例成比例。还建立了信息论下界，揭示了我们的保证是近最优的。值得注意的是，我们的算法对底层奖励分布不敏感，并为异步Q学习提供了首次有限时间鲁棒性保证。分析中的关键元素是针对近鞅的改进Azuma-Hoeffding不等式，这可能在研究强化学习算法时有更广泛的应用。

Comments To appear at the 43rd International Conference on Machine Learning (ICML)

详情

AI中文摘要

我们研究了在存在对抗性损坏奖励的情况下，在折扣无限时间 horizon 的强化学习（RL）设置中学习最优策略的问题。为了解决这个问题，我们开发了一种新的鲁棒Q学习变体，并在具有时间相关数据的挑战性异步采样模型下分析该算法。尽管存在损坏，我们证明了该方法的有限时间保证与现有界限相匹配，仅在加性项上与损坏样本的比例成比例。我们还建立了信息论下界，揭示了我们的保证是近最优的。值得注意的是，我们的算法对底层奖励分布不敏感，并为异步Q学习提供了首次有限时间鲁棒性保证。分析中的关键元素是针对近鞅的改进Azuma-Hoeffding不等式，这可能在研究强化学习算法时有更广泛的应用。

英文摘要

We study the problem of learning the optimal policy in a discounted, infinite-horizon reinforcement learning (RL) setting in the presence of adversarially corrupted rewards. To address this problem, we develop a novel robust variant of the $Q$-learning algorithm and analyze it under the challenging asynchronous sampling model with time-correlated data. Despite corruption, we prove that the finite-time guarantees of our approach match existing bounds, up to an additive term that scales with the fraction of corrupted samples. We also establish an information-theoretic lower bound, revealing that our guarantees are near-optimal. Notably, our algorithm is agnostic to the underlying reward distribution and provides the first finite-time robustness guarantees for asynchronous $Q$-learning. A key element of our analysis is a refined Azuma-Hoeffding inequality for almost-martingales, which may have broader applicability in the study of RL algorithms.

URL PDF HTML ☆

赞 0 踩 0