arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 1926
专题追踪 全部专题
2512.20538 2026-05-22 cs.CV

AlignPose: Generalizable 6D Pose Estimation via Multi-view Feature-metric Alignment

AlignPose: 通过多视角特征-度量对齐实现通用的6D位姿估计

Anna Šárová Mikeštíková, Médéric Fourmy, Martin Cífka, Josef Sivic, Vladimir Petrik

发表机构 * Czech Institute of Informatics, Robotics and Cybernetics(捷克信息学、机器人学与控制学研究院) Czech Technical University in Prague(布拉格捷克技术大学)

AI总结 本文提出AlignPose,一种无需特定对象训练或对称标注的多视角6D位姿估计方法,通过多视角特征-度量细化优化单一一致的世界坐标系位姿,实验表明其在六个数据集上优于其他方法,尤其在工业数据集上表现突出。

Comments CVPR 2026

详情
AI中文摘要

单视角基于RGB的物体位姿估计方法虽然具有强大的泛化能力,但本质上受到深度模糊、杂乱和遮挡的限制。多视角位姿估计方法有潜力解决这些问题,但现有方法要么依赖于精确的单视角位姿估计,要么缺乏对未见过的对象的泛化能力。我们通过以下三个贡献来解决这些挑战:首先,我们引入了AlignPose,一种通过多个外校准的RGB视角聚合信息的6D物体位姿估计方法,无需任何对象特定的训练或对称标注。其次,该方法的关键组件是一个新的多视角特征-度量细化模块,专门设计用于物体位姿,通过同时最小化所有视角下即时渲染物体特征与观测图像特征之间的特征差异,优化单一一致的世界坐标系物体位姿。第三,我们在六个数据集(YCB-V,T-LESS,HouseCat6D,ITODD-MV,IPD,XYZ-IBD)上进行了广泛的实验,使用BOP基准评估,并证明AlignPose在挑战性的工业数据集上优于其他已发表的方法,其中多个视角在实践中易于获取。

英文摘要

Single-view RGB model-based object pose estimation methods achieve strong generalization but are fundamentally limited by depth ambiguity, clutter, and occlusions. Multi-view pose estimation methods have the potential to solve these issues, but existing works rely on precise single-view pose estimates or lack generalization to unseen objects. We address these challenges via the following three contributions. First, we introduce AlignPose, a 6D object pose estimation method that aggregates information from multiple extrinsically calibrated RGB views and does not require any object-specific training or symmetry annotation. Second, the key component of this approach is a new multi-view feature-metric refinement specifically designed for object pose. It optimizes a single, consistent world-frame object pose by minimizing the feature discrepancy between on-the-fly rendered object features and observed image features across all views simultaneously. Third, we report extensive experiments on six datasets (YCB-V, T-LESS, HouseCat6D, ITODD-MV, IPD, XYZ-IBD) using the BOP benchmark evaluation and show that AlignPose outperforms other published methods, especially on challenging industrial datasets where multiple views are readily available in practice.

2511.20785 2026-05-22 cs.CV

LongVT: Incentivizing "Thinking with Long Videos" via Native Tool Calling

LongVT: 通过原生工具调用激励'通过长视频思考'

Zuhao Yang, Sudong Wang, Kaichen Zhang, Keming Wu, Sicong Leng, Yifan Zhang, Bo Li, Chengwei Qin, Shijian Lu, Xingxuan Li, Lidong Bing

发表机构 * MiroMind NTU(国立台湾大学) HKUST(GZ)(香港科技大学(广州)) THU(清华大学) LMMs-Lab(LMMs实验室)

AI总结 本文提出LongVT,一种端到端的代理框架,通过交错的多模态工具链式思考实现'通过长视频思考',通过利用LMM的固有时间定位能力作为原生视频裁剪工具,以解决长视频推理中的幻觉问题,并通过VideoSIAH数据集提升训练和评估效果。

Comments CVPR 2026

详情
AI中文摘要

大型多模态模型(LMMs)在视频推理中展示出巨大的潜力,尤其是在文本链式思考(Chain-of-Thought)的应用中。然而,它们在处理长视频时仍然容易产生幻觉,尤其是当证据稀少且时间分布分散时。受人类理解长视频的方式启发——首先全局浏览,然后检查相关片段以获取细节——我们引入LongVT,一种端到端的代理框架,通过交错的多模态链式工具思考实现'通过长视频思考'。具体而言,我们利用LMM固有的时间定位能力作为原生视频裁剪工具,以聚焦特定视频片段并重新采样更细粒度的视频帧。这种从全局到局部的推理循环会持续进行,直到答案基于检索到的视觉证据得到支撑。鉴于长视频推理任务中细粒度问题-答案(QA)数据稀缺,我们整理并计划发布一个名为VideoSIAH的数据集,以促进训练和评估。具体而言,我们的训练数据集包含247.9万样本用于工具集成的冷启动监督微调,1.6千样本用于代理强化学习,以及15.4千样本用于代理强化学习微调。我们的评估基准包含1,280对精心挑选的QA对,通过半自动数据管道和人工在环验证进行筛选。通过精心设计的三阶段训练策略和广泛的实证验证,LongVT在四个具有挑战性的长视频理解和推理基准上均优于现有强大的基线。我们的代码、数据和模型检查点在https://github.com/EvolvingLMMs-Lab/LongVT上公开可用。

英文摘要

Large multimodal models (LMMs) have shown great potential for video reasoning with textual Chain-of-Thought. However, they remain vulnerable to hallucinations, especially when processing long-form videos where evidence is sparse and temporally dispersed. Inspired by how humans comprehend long videos - by first skimming globally and then examining relevant clips for details - we introduce LongVT, an end-to-end agentic framework that enables "Thinking with Long Videos" via interleaved Multimodal Chain-of-Tool-Thought. Specifically, we exploit LMMs' inherent temporal grounding ability as a native video cropping tool to zoom in on a specific video clip and resample finer-grained video frames. This global-to-local reasoning loop continues until answers are grounded in retrieved visual evidence. Given the scarcity of fine-grained question-answering (QA) data for the long video reasoning task, we curate and will release a data suite named VideoSIAH to facilitate both training and evaluation. Specifically, our training dataset consists of 247.9K samples for tool-integrated cold-start supervised fine-tuning, 1.6K samples for agentic reinforcement learning, and 15.4K samples for agentic reinforcement fine-tuning, respectively. Our evaluation benchmark consists of 1,280 QA pairs that are carefully curated through a semi-automatic data pipeline with human-in-the-loop validation. With a meticulously designed three-stage training strategy and extensive empirical validation, LongVT consistently outperforms existing strong baselines across four challenging long-video understanding and reasoning benchmarks. Our codes, data, and model checkpoints are publicly available at https://github.com/EvolvingLMMs-Lab/LongVT .

2511.14220 2026-05-22 cs.LG cs.AI

Twice Sequential Monte Carlo for Tree Search

两次序贯蒙特卡洛用于树搜索

Yaniv Oren, Joery A. de Vries, Pascal R. van der Vaart, Matthijs T. J. Spaan, Wendelin Böhmer

发表机构 * Delft University of Technology(代尔夫特理工大学)

AI总结 本文提出Twice Sequential Monte Carlo Tree Search(TSMCTS)方法,通过减少方差和缓解路径退化问题,提高了在离散和连续环境中比SMC基线和现代MCTS版本更优的性能,同时在顺序计算上具有良好的扩展性。

详情
AI中文摘要

基于搜索的强化学习(RL)方法在RL领域取得了许多里程碑式的突破。最近,序贯蒙特卡洛(SMC)作为一种替代蒙特卡洛树搜索(MCTS)算法出现,推动了这些突破。SMC更容易并行化且更适合GPU加速。然而,它也面临较大的方差和路径退化问题,这限制了其在增加搜索深度(即增加顺序计算)时的扩展性。为了解决这些问题,我们引入了两次序贯蒙特卡洛树搜索(TSMCTS)。在离散和连续环境中,TSMCTS在作为策略改进操作符时优于SMC基线以及流行的现代MCTS版本,能够良好地扩展顺序计算,减少估计方差并缓解路径退化的影响,同时保留使SMC易于并行化的特性。

英文摘要

Model-based reinforcement learning (RL) methods that leverage search are responsible for many milestone breakthroughs in RL. Sequential Monte Carlo (SMC) recently emerged as an alternative to the Monte Carlo Tree Search (MCTS) algorithm which drove these breakthroughs. SMC is easier to parallelize and more suitable to GPU acceleration. However, it also suffers from large variance and path degeneracy which prevent it from scaling well with increased search depth, i.e., increased sequential compute. To address these problems, we introduce Twice Sequential Monte Carlo Tree Search (TSMCTS). Across discrete and continuous environments TSMCTS outperforms the SMC baseline as well as a popular modern version of MCTS as a policy improvement operator, scales favorably with sequential compute, reduces estimator variance and mitigates the effects of path degeneracy while retaining the properties that make SMC natural to parallelize.

2511.07820 2026-05-22 cs.RO cs.AI cs.CV cs.GR cs.SY eess.SY

SONIC: Supersizing Motion Tracking for Natural Humanoid Whole-Body Control

SONIC:为自然人形全身体控进行超大规模运动追踪

Zhengyi Luo, Ye Yuan, Tingwu Wang, Chenran Li, Fernando Castañeda, Sirui Chen, Zi-Ang Cao, Jiefeng Li, David Minor, Qingwei Ben, Jinhyung Park, David Sami, Zi Wang, Xingye Da, Runyu Ding, Cyrus Hogg, Lina Song, Edy Lim, Eugene Jeong, Tairan He, Haoru Xue, Wenli Xiao, Simon Yuen, Jan Kautz, Yan Chang, Umar Iqbal, Linxi "Jim" Fan, Yuke Zhu

发表机构 * NVIDIA

AI总结 本文提出了一种超大规模运动追踪方法,通过扩大模型容量、数据和计算资源,实现了一种能够产生自然且稳健全身体态的通用人形控制器,并展示了其在运动追踪任务中的可扩展性及在下游任务中的应用价值。

Comments Project page: https://nvlabs.github.io/SONIC/

详情
AI中文摘要

尽管大规模基础模型在数千块GPU上训练已取得显著进展,但类似规模提升在人形控制中尚未显现。当前的人形神经控制器规模较小,仅针对有限的行为集,并在少量GPU上训练。我们证明,扩大模型容量、数据和计算资源可以产生一个通用的人形控制器,能够实现自然且稳健的全身体态。我们将运动追踪定位为人形控制的可扩展任务,利用密集监督的多样化动作捕捉数据获取人类运动先验知识,而无需手动奖励工程。我们通过沿三个轴扩展构建了一个运动追踪的基础模型:网络大小(120万到4200万参数)、数据集规模(10亿+帧来自700小时的动作捕捉数据)以及计算资源(21000 GPU小时)。除了展示规模优势外,我们还通过:(1)实时运动规划器连接运动追踪到导航等任务,实现自然和交互式控制;(2)统一的token空间支持VR远程操作和视觉-语言-动作(VLA)模型,使用单一策略。通过这一接口,我们展示了需要协调手和脚放置的自主VLA驱动全身体控。扩大运动追踪表现出有利的特性:性能随计算和数据多样性稳步提升,学习的策略能泛化到未见的运动,使大规模运动追踪成为人形控制的实用基础。

英文摘要

Despite the rise of billion-parameter foundation models trained across thousands of GPUs, similar scaling gains have not been shown for humanoid control. Current neural controllers for humanoids remain modest in size, target a limited set of behaviors, and are trained on a handful of GPUs. We show that scaling model capacity, data, and compute yields a generalist humanoid controller capable of natural, robust whole-body movements. We position motion tracking as a scalable task for humanoid control, leveraging dense supervision from diverse motion-capture data to acquire human motion priors without manual reward engineering. We build a foundation model for motion tracking by scaling along three axes: network size (1.2M to 42M parameters), dataset volume (100M+ frames from 700 hours of motion capture), and compute (21k GPU hours). Beyond demonstrating the benefits of scale, we further show downstream utility through: (1) a real-time kinematic planner bridging motion tracking to tasks such as navigation, enabling natural and interactive control, and (2) a unified token space supporting VR teleoperation and vision-language-action (VLA) models with a single policy. Through this interface, we demonstrate autonomous VLA-driven whole-body loco-manipulation requiring coordinated hand and foot placement. Scaling motion tracking exhibits favorable properties: performance improves steadily with compute and data diversity, and learned policies generalize to unseen motions, establishing motion tracking at scale as a practical foundation for humanoid control.

2511.02014 2026-05-22 cs.CV

Towards Selection of Large Multimodal Models as Engines for Burned-in Protected Health Information Detection in Medical Images

向大规模多模态模型选择作为医疗图像中已烧毁保护健康信息检测引擎的方向

Tuan Truong, Guillermo Jimenez Perez, Pedro Osorio, Matthias Lenga

发表机构 * QwenLM(通义实验室)

AI总结 本文研究了如何利用大规模多模态模型进行医疗图像中保护健康信息的检测,通过对比三种主流模型在不同流程配置下的表现,发现大规模多模态模型在OCR性能上优于传统方法,但整体检测准确性提升不显著,尤其在复杂印模模式测试中表现更优,并提出了针对特定操作约束的模型选择建议和部署策略。

Comments Accepted at EMBC 2026

详情
AI中文摘要

在医疗影像中检测保护健康信息(PHI)对于保障患者隐私和确保符合监管框架至关重要。传统检测方法主要利用光学字符识别(OCR)模型结合命名实体识别。然而,近年来大规模多模态模型(LMM)的进步为增强文本提取和语义分析提供了新机会。在本研究中,我们系统地评估了三种主要的闭源和开源LMM,即GPT-4o、Gemini 2.5 Flash和Qwen 2.5 7B,使用两种不同的流程配置:一种专注于文本分析,另一种整合OCR和语义分析。我们的结果显示,LMM在OCR性能(WER: 0.03-0.05,CER: 0.02-0.03)上优于传统模型如EasyOCR。然而,这种OCR性能的提升并不总是与整体PHI检测准确性提升相关联。在测试案例中具有复杂印模模式时,表现最强。在文本区域易于阅读且对比度足够的情况下,使用强LMM进行OCR后文本分析,不同流程配置的结果相似。此外,我们为特定操作约束提供了基于实证的LMM选择建议,并提出了一种利用可扩展和模块化基础设施的部署策略。

英文摘要

The detection of Protected Health Information (PHI) in medical imaging is critical for safeguarding patient privacy and ensuring compliance with regulatory frameworks. Traditional detection methodologies predominantly utilize Optical Character Recognition (OCR) models in conjunction with named entity recognition. However, recent advancements in Large Multimodal Model (LMM) present new opportunities for enhanced text extraction and semantic analysis. In this study, we systematically benchmark three prominent closed and open-sourced LMMs, namely GPT-4o, Gemini 2.5 Flash, and Qwen 2.5 7B, utilizing two distinct pipeline configurations: one dedicated to text analysis alone and another integrating both OCR and semantic analysis. Our results indicate that LMM exhibits superior OCR efficacy (WER: 0.03-0.05, CER: 0.02-0.03) compared to conventional models like EasyOCR. However, this improvement in OCR performance does not consistently correlate with enhanced overall PHI detection accuracy. The strongest performance gains are observed on test cases with complex imprint patterns. In scenarios where text regions are well readable with sufficient contrast, and strong LMMs are employed for text analysis after OCR, different pipeline configurations yield similar results. Furthermore, we provide empirically grounded recommendations for LMM selection tailored to specific operational constraints and propose a deployment strategy that leverages scalable and modular infrastructure.

2510.17991 2026-05-22 cs.LG cs.CV

Demystifying Transition Matching: When and Why It Can Beat Flow Matching

解开转换匹配之谜:何时以及为何它能超越流匹配

Jaihoon Kim, Rajarshi Saha, Minhyuk Sung, Youngsuk Park

发表机构 * KAIST(韩国科学技术院) Amazon Web Services(亚马逊网络服务)

AI总结 本文研究了转换匹配(TM)在何时以及为何能超越流匹配(FM),通过证明在单峰高斯分布下TM具有更低的KL散度,并分析了在高斯混合分布中TM在局部单峰区域的优势,以及在目标方差非可忽略时TM的优越性。

Comments Code: https://github.com/amazon-science/TransitionFlowMatching (AISTATS 2026)

详情
AI中文摘要

流匹配(FM)是许多最先进的生成模型的基础,但最近的结果表明转换匹配(TM)可以以更少的采样步骤获得更高的质量。本文回答了TM何时以及为何能超越FM的问题。首先,当目标是一个单峰高斯分布时,我们证明在有限的步骤数下,TM的KL散度严格低于FM。改进源于TM中的随机差分潜在更新,这些更新保留了目标协方差,而确定性FM则低估了它。我们随后表征了收敛速率,显示在固定计算预算下,TM比FM收敛得更快,从而在单峰高斯情况下确立了其优势。其次,我们将分析扩展到高斯混合分布,并识别出局部单峰区域,在这些区域中,采样动态近似于单峰情况,TM可以超越FM。近似误差随着组件均值之间的最小距离增加而减少,突显了当模式良好分离时TM的优势。然而,当目标方差接近零时,每个TM更新收敛到FM更新,TM的性能优势减弱。总之,我们证明了当目标分布具有良好分离的模式和非可忽略的方差时,TM优于FM。我们通过受控实验在高斯分布上验证了我们的理论结果,并将比较扩展到现实世界中的图像和视频生成应用。

英文摘要

Flow Matching (FM) underpins many state-of-the-art generative models, yet recent results indicate that Transition Matching (TM) can achieve higher quality with fewer sampling steps. This work answers the question of when and why TM outperforms FM. First, when the target is a unimodal Gaussian distribution, we prove that TM attains strictly lower KL divergence than FM for finite number of steps. The improvement arises from stochastic difference latent updates in TM, which preserve target covariance that deterministic FM underestimates. We then characterize convergence rates, showing that TM achieves faster convergence than FM under a fixed compute budget, establishing its advantage in the unimodal Gaussian setting. Second, we extend the analysis to Gaussian mixtures and identify local-unimodality regimes in which the sampling dynamics approximate the unimodal case, where TM can outperform FM. The approximation error decreases as the minimal distance between component means increases, highlighting that TM is favored when the modes are well separated. However, when the target variance approaches zero, each TM update converges to the FM update, and the performance advantage of TM diminishes. In summary, we show that TM outperforms FM when the target distribution has well-separated modes and non-negligible variances. We validate our theoretical results with controlled experiments on Gaussian distributions, and extend the comparison to real-world applications in image and video generation.

2510.16590 2026-05-22 cs.LG cs.AI q-bio.BM

Atom-anchored LLMs speak Chemistry: A Retrosynthesis Demonstration

原子锚定的大语言模型:化学 retrosynthesis 的演示

Alan Kai Hassen, Andrius Bernatavicius, Antonius P. A. Janssen, Mike Preuss, Gerard J. P. van Westen, Djork-Arné Clevert

发表机构 * Machine Learning Research(机器学习研究) Pfizer Research and Development(辉瑞研发) Leiden Institute of Advanced Computer Science(莱顿高级计算机科学研究所) Leiden University(莱顿大学) Leiden Academic Centre for Drug Research(莱顿药物研究中心) Leiden Institute of Chemistry(莱顿化学研究所)

AI总结 本研究提出了一种利用通用大语言模型进行分子推理的框架,通过原子标识符将链式推理与分子结构锚定,无需任务特定的模型训练,在单步 retrosynthesis 任务中实现了高成功率。

Comments Alan Kai Hassen and Andrius Bernatavicius contributed equally to this work

详情
AI中文摘要

在化学领域应用机器学习通常受到标注数据稀缺和昂贵的限制,限制了传统监督方法。在本工作中,我们介绍了一种利用通用大语言模型(LLMs)进行分子推理的框架,该框架无需进行任务特定的模型训练。我们的方法通过使用独特的原子标识符将链式推理锚定到分子结构上。首先,LLM执行零样本任务以识别相关片段及其关联的化学标签或转换类别。在可选的第二步中,这种位置感知信息用于少量样本任务,结合提供的类别示例,预测化学转化。我们将框架应用于单步 retrosynthesis 任务,该任务此前LLMs表现不佳。在学术基准和专家验证的药物发现分子上,我们的工作使LLMs在识别化学上合理的反应位点(≥90%)、命名反应类别(≥40%)和最终反应物(≥74%)方面实现了高成功率。最终,我们的工作建立了一种通用蓝图,用于应用LLMs到分子推理和分子转化是关键的挑战中,将原子锚定的LLMs定位为数据稀缺的化学领域中的强大解决方案。

英文摘要

Applications of machine learning in chemistry are often limited by the scarcity and expense of labeled data, restricting traditional supervised methods. In this work, we introduce a framework for molecular reasoning using general-purpose Large Language Models (LLMs) that operates without requiring task-specific model training. Our method anchors chain-of-thought reasoning to the molecular structure by using unique atomic identifiers. First, the LLM performs a zero-shot task to identify relevant fragments and their associated chemical labels or transformation classes. In an optional second step, this position-aware information is used in a few-shot task with provided class examples to predict the chemical transformation. We apply our framework to single-step retrosynthesis, a task where LLMs have previously underperformed. Across academic benchmarks and expert-validated drug discovery molecules, our work enables LLMs to achieve high success rates in identifying chemically plausible reaction sites ($\geq90\%$), named reaction classes ($\geq40\%$), and final reactants ($\geq74\%$). Ultimately, our work establishes a general blueprint for applying LLMs to challenges where molecular reasoning and molecular transformations are key, positioning atom-anchored LLMs as a powerful solution for data-scarce chemistry domains.

2510.13910 2026-05-22 cs.CL

RAGCap-Bench: Benchmarking Capabilities of LLMs in Agentic Retrieval Augmented Generation Systems

RAGCap-Bench: 评估代理检索增强生成系统中LLM能力的基准测试

Jingru Lin, Chen Zhang, Stephen Y. Liu, Haizhou Li

发表机构 * National University of Singapore(新加坡国立大学) The Chinese University of Hong Kong(香港中文大学)

AI总结 本文提出RAGCap-Bench,用于评估代理检索增强生成系统中中间任务的细粒度能力,通过分析现有系统输出识别常见任务和核心能力,设计针对性评估问题,实验表明增强中间能力的模型能获得更好的整体性能。

详情
AI中文摘要

检索增强生成(RAG)通过动态检索外部信息缓解大型语言模型(LLMs)的关键限制,如事实错误、过时知识和幻觉。最近的研究通过代理RAG系统扩展了这一范式,其中LLMs作为代理迭代地计划、检索和推理复杂查询。然而,这些系统在处理具有挑战性的多跳问题时仍存在困难,且其中间推理能力仍缺乏深入研究。为此,我们提出了RAGCap-Bench,一个以能力为导向的基准测试,用于对代理RAG工作流程中的中间任务进行细粒度评估。我们分析了最先进系统的输出,以识别常见任务和执行所需的核心能力,然后构建了一个典型LLM错误的分类学,以设计针对性的评估问题。实验表明,具有更强RAGCap性能的“慢思考”模型在端到端结果上表现更好,这证明了该基准测试的有效性以及增强这些中间能力的重要性。

英文摘要

Retrieval-Augmented Generation (RAG) mitigates key limitations of Large Language Models (LLMs)-such as factual errors, outdated knowledge, and hallucinations-by dynamically retrieving external information. Recent work extends this paradigm through agentic RAG systems, where LLMs act as agents to iteratively plan, retrieve, and reason over complex queries. However, these systems still struggle with challenging multi-hop questions, and their intermediate reasoning capabilities remain underexplored. To address this, we propose RAGCap-Bench, a capability-oriented benchmark for fine-grained evaluation of intermediate tasks in agentic RAG workflows. We analyze outputs from state-of-the-art systems to identify common tasks and the core capabilities required for their execution, then construct a taxonomy of typical LLM errors to design targeted evaluation questions. Experiments show that "slow-thinking" models with stronger RAGCap performance achieve better end-to-end results, underscoring the benchmark's validity and the importance of enhancing these intermediate capabilities.

2510.11339 2026-05-22 cs.LG cs.AI

Event-Aware Prompt Learning for Dynamic Graphs

事件感知的动态图提示学习

Xingtong Yu, Ruijuan Liang, Renhe Jiang, Dongyuan Li, Yunxiao Zhao, Xinming Zhang, Yuan Fang

发表机构 * The Chinese University of Hong Kong(香港中文大学) University of Science and Technology of China(中国科学技术大学) The University of Tokyo(东京大学) Shanxi University(山西大学) Singapore Management University(新加坡国立大学)

AI总结 本文提出EVP框架,通过提取历史事件并引入事件适应机制,增强动态图学习模型对历史事件知识的利用能力。

Comments Under review

详情
AI中文摘要

现实中的图通常通过一系列事件演变,建模不同领域中对象之间的动态交互。对于动态图学习,动态图神经网络(DGNNs)已逐渐成为流行解决方案。最近,提示学习方法被探索应用于动态图。然而,现有方法通常侧重于捕捉节点与时间之间的关系,而忽视了历史事件的影响。在本文中,我们提出了EVP,一种事件感知的动态图提示学习框架,可以作为现有方法的插件,增强其利用历史事件知识的能力。首先,我们为每个节点提取一系列历史事件,并引入事件适应机制,以将这些事件的细粒度特征对齐到下游任务。其次,我们提出事件聚合机制,以有效将历史知识整合到节点表示中。最后,我们在四个公开数据集上进行了广泛的实验,以评估和分析EVP。

英文摘要

Real-world graph typically evolve via a series of events, modeling dynamic interactions between objects across various domains. For dynamic graph learning, dynamic graph neural networks (DGNNs) have emerged as popular solutions. Recently, prompt learning methods have been explored on dynamic graphs. However, existing methods generally focus on capturing the relationship between nodes and time, while overlooking the impact of historical events. In this paper, we propose EVP, an event-aware dynamic graph prompt learning framework that can serve as a plug-in to existing methods, enhancing their ability to leverage historical events knowledge. First, we extract a series of historical events for each node and introduce an event adaptation mechanism to align the fine-grained characteristics of these events with downstream tasks. Second, we propose an event aggregation mechanism to effectively integrate historical knowledge into node representations. Finally, we conduct extensive experiments on four public datasets to evaluate and analyze EVP.

2510.10129 2026-05-22 cs.LG cs.AI

CacheClip: Accelerating RAG with Effective KV Cache Reuse

CacheClip: 通过有效的KV缓存重用加速RAG

Bin Yang, Qiuyu Leng, Jun Zeng, Zhenhua Wu

发表机构 * Intel Corporation(英特尔公司)

AI总结 本文提出CacheClip框架,通过有效利用KV缓存重用,解决了RAG系统中TTFT瓶颈问题,同时保持高质量生成。

详情
AI中文摘要

检索增强生成(RAG)系统由于长输入序列而面临严重的首次令牌时间(TTFT)瓶颈。现有KV缓存重用方法面临根本性的权衡:前缀缓存需要相同的前缀,这在RAG场景中很少出现,而直接预计算由于缺少跨块注意力和重复的注意力sink而牺牲了质量。最近的方法如APE和CacheBlend部分解决了这些问题,但不足以满足鲁棒的RAG应用。本文提出了CacheClip,一种新的框架,实现了快速的TTFT和高质量的生成。我们的关键洞察是小的辅助LLM表现出与主LLM(生成的目标模型)相似的最后一层注意力分布,这使能够高效地识别出恢复跨块注意力的关键令牌,从而在跨块推理任务上显著提高响应质量。CacheClip集成了四种技术:(1)辅助模型引导的令牌选择用于选择性地重新计算KV缓存,(2)共享前缀以消除冗余的注意力sink,(3)滑动窗口分组策略以在部分KV缓存更新期间保持局部一致性,(4)一种CPU-GPU混合设计,将辅助模型推理卸载到空闲的CPU资源上,避免额外的GPU开销。重新计算比率是可调节的,允许用户根据不同的部署需求灵活地平衡效率和质量。实验表明,CacheClip在NIAH和LongBench上保留了高达85.2%和91.1%的全注意力性能,优于CacheBlend和APE在NIAH上分别高出16.1和12.8点,在LongBench上分别高出4.5和4.2点(重新计算比率为20%)。同时,CacheClip在预填时间上将LLM推理加速了高达3.33倍(重新计算比率为20%),为RAG系统中的效率-质量权衡提供了实用的解决方案。

英文摘要

Retrieval-Augmented Generation (RAG) systems suffer from severe time-to-first-token (TTFT) bottlenecks due to long input sequences. Existing KV cache reuse methods face a fundamental trade-off: prefix caching requires identical prefixes that rarely occur in RAG scenarios, while direct precomputation sacrifices quality due to missing inter-chunk attention and repeated attention sinks. Recent methods like APE and CacheBlend partially address these issues but remain inadequate for robust RAG applications. This paper presents CacheClip, a novel framework that achieves both fast TTFT and high generation quality. Our key insight is that small auxiliary LLMs exhibit similar last-layer attention distributions to primary LLMs (the target model for generation), enabling efficient identification of tokens critical for restoring inter-chunk attention, thereby significantly improving response quality on cross-chunk reasoning tasks. CacheClip integrates four techniques: (1) auxiliary-model-guided token selection for selective KV cache recomputation, (2) shared prefixes to eliminate redundant attention sinks, (3) a sliding-window grouping strategy to maintain local coherence during partial KV cache updates, and (4) a CPU-GPU hybrid design that offloads auxiliary model inference to idle CPU resources, avoiding additional GPU overhead. The recomputation ratio is adjustable, allowing users to flexibly balance efficiency and quality for different deployment requirements. Experiments show CacheClip retains up to 85.2% and 91.1% of full-attention performance on NIAH and LongBench, outperforming CacheBlend and APE by 16.1 and 12.8 points on NIAH, and by 4.5 and 4.2 points on LongBench (with recomp% = 20%). Meanwhile, CacheClip accelerates LLM inference by up to 3.33$\times$ in prefill time (with recomp% = 20%), providing a practical solution to the efficiency-quality trade-off in RAG systems.

2510.06141 2026-05-22 cs.LG cs.MA math.OC

High-Probability Convergence Guarantees of Decentralized SGD

去中心化SGD的高概率收敛保证

Aleksandar Armacki, Ali H. Sayed

发表机构 * STI, EPFL, Lausanne, Switzerland(瑞士洛桑联邦理工学院(EPFL)信息与通信系统研究所(STI))

AI总结 本文研究了在轻尾噪声下去中心化SGD的高概率收敛性,证明了在与MSE收敛相同的成本条件下,去中心化SGD能够实现高概率收敛,同时提供了非凸和强凸成本的最优速率,以及用户数量的线性加速效果。

Comments 43 pages, 6 figures

详情
AI中文摘要

高概率收敛(HP)因其暗示指数衰减的尾部界限和对算法单次运行的强保证而受到越来越多的关注。尽管许多工作研究集中化设置下的HP保证,但在去中心化设置中,现有工作通常需要强假设,如梯度的统一有界或渐近消失的噪声。这导致了用于建立HP收敛的假设与均方误差(MSE)意义下的假设之间存在显著差距,并且与集中化设置相反,在集中化设置中已知在相同成本函数条件下,SGD在HP意义上收敛。受这些观察的启发,我们研究了在存在轻尾噪声的情况下去中心化SGD(DSGD)的HP收敛性,提供了几个强结果。首先,我们证明在与MSE意义相同的成本条件下,DSGD在HP意义上收敛,消除了先前工作中使用的限制性假设。其次,我们的精确分析为非凸和强凸成本提供了最优的速率。第三,我们建立了用户数量的线性加速,导致与MSE结果相比匹配或更优的暂态时间,进一步强调了我们分析的紧密性。据我们所知,这是首次证明DSGD在HP意义上实现线性加速的工作。我们的放宽假设和精确速率源于几个具有独立兴趣的技术结果,包括关于去中心化方法在HP意义上的方差减少效应的结果,以及一个关于强凸成本矩生成函数的新界,即使在集中化设置中也有兴趣。数值实验验证了我们的理论。

英文摘要

Convergence in high-probability (HP) has attracted increasing interest, due to implying exponentially decaying tail bounds and strong guarantees for individual runs of an algorithm. While many works study HP guarantees in centralized settings, much less is understood in the decentralized setup, where existing works require strong assumptions, like uniformly bounded gradients, or asymptotically vanishing noise. This results in a significant gap between the assumptions used to establish convergence in the HP and the mean-squared error (MSE) sense, and is also contrary to centralized settings, where it is known that $\mathtt{SGD}$ converges in HP under the same conditions on the cost function as needed for MSE convergence. Motivated by these observations, we study the HP convergence of Decentralized $\mathtt{SGD}$ ($\mathtt{DSGD}$) in the presence of light-tailed noise, providing several strong results. First, we show that $\mathtt{DSGD}$ converges in HP under the same conditions on the cost as in the MSE sense, removing the restrictive assumptions used in prior works. Second, our sharp analysis yields order-optimal rates for both non-convex and strongly convex costs. Third, we establish a linear speed-up in the number of users, leading to matching or strictly better transient times than those obtained from MSE results, further underlining the tightness of our analysis. To the best of our knowledge, this is the first work that shows $\mathtt{DSGD}$ achieves a linear speed-up in the HP sense. Our relaxed assumptions and sharp rates stem from several technical results of independent interest, including a result on the variance-reduction effect of decentralized methods in the HP sense, as well as a novel bound on the moment-generating function of strongly convex costs, of interest even in centralized settings. Numerical experiments validate our theory.

2510.05094 2026-05-22 cs.CV

VChain: Chain-of-Visual-Thought for Reasoning in Video Generation

VChain:用于视频生成中推理的视觉思维链

Ziqi Huang, Ning Yu, Gordon Chen, Haonan Qiu, Paul Debevec, Ziwei Liu

发表机构 * eyeline-labs(Eyeline Labs)

AI总结 本文提出VChain,一种在视频生成中引入多模态模型视觉推理信号的新型推理时间视觉思维链框架,通过生成关键帧来指导预训练视频生成器的稀疏推理时间视觉状态适应,从而提升视频生成质量。

Comments ACL 2026 (Findings Paper), ICCV 2025 Workshop Outstanding Paper Award, Project page: https://eyeline-labs.github.io/VChain

详情
AI中文摘要

最近的视频生成模型可以生成流畅且视觉吸引人的片段,但它们经常难以合成具有连贯后果链的复杂动态。准确建模随时间推移的视觉结果和状态转换仍然是核心挑战。相比之下,大型语言和多模态模型(例如GPT-4o)表现出强大的视觉状态推理和未来预测能力。为了弥合这些优势,我们引入了VChain,一种新颖的推理时间视觉思维链框架,该框架将多模态模型的视觉推理信号注入到视频生成中。具体而言,VChain包含一个专用管道,利用大型多模态模型生成一组稀疏的关键帧作为快照,然后在这些关键时刻引导预训练视频生成器的稀疏推理时间视觉状态适应。我们的方法是调优高效的,引入了最小的开销,并避免了密集监督。在复杂的多步骤场景上进行的广泛实验表明,VChain显著提高了生成视频的质量。

英文摘要

Recent video generation models can produce smooth and visually appealing clips, but they often struggle to synthesize complex dynamics with a coherent chain of consequences. Accurately modeling visual outcomes and state transitions over time remains a core challenge. In contrast, large language and multimodal models (e.g., GPT-4o) exhibit strong visual state reasoning and future prediction capabilities. To bridge these strengths, we introduce VChain, a novel inference-time chain-of-visual-thought framework that injects visual reasoning signals from multimodal models into video generation. Specifically, VChain contains a dedicated pipeline that leverages large multimodal models to generate a sparse set of critical keyframes as snapshots, which are then used to guide the sparse inference-time visual-state adaptation of a pre-trained video generator only at these key moments. Our approach is tuning-efficient, introduces minimal overhead and avoids dense supervision. Extensive experiments on complex, multi-step scenarios show that VChain significantly enhances the quality of generated videos.

2510.03271 2026-05-22 cs.LG cs.AI

Decision Potential Surface: A Theoretical and Practical Approximation of Large Language Model Decision Boundary

决策潜力面:大型语言模型决策边界的理论与实用近似

Zi Liang, Zhiyao Wu, Haoyang Shang, Yulin Jin, Qingqing Ye, Huadi Zheng, Peizhao Hu, Haibo Hu

发表机构 * The Hong Kong Polytechnic University(香港理工大学) University of Macau(澳门大学) Shanghai Jiaotong University(上海交通大学) Huawei(华为) Rochester Institute of Technology(罗切斯特理工学院) PolyU Research Centre for Privacy and Security Technologies in Future Smart Systems(PolyU未来智能系统隐私与安全技术研究中心)

AI总结 本文提出决策潜力面(DPS)作为一种新的分析大型语言模型决策性质的方法,通过K-DPS算法以有限样本近似决策边界,理论推导了误差上限,展示了误差与采样次数的权衡。

Comments Source code: https://github.com/liangzid/DPS

详情
AI中文摘要

决策边界,即模型赋予两个类别相等分类概率的输入子空间,在揭示核心模型属性和解释行为中起关键作用。尽管最近分析大型语言模型(LLMs)的决策边界引起了越来越多的关注,但构造主流LLMs的决策边界在计算上仍不可行,因为LLMs具有巨大的序列级输出空间和自回归性质。为了解决这个问题,本文提出决策潜力面(DPS),这是一种新的分析LLMs决策性质的概念。DPS来源于每个输入区分不同类别的置信度,自然捕捉了决策边界的潜力。我们证明了DPS中的零高度等高线等同于LLM的决策边界,封闭区域代表决策区域。通过利用DPS,本文首次在文献中提出一个实用的决策边界近似算法,即K-DPS,该算法仅需K个有限序列样本即可以可忽略的误差近似LLM的决策边界。我们理论推导了K-DPS与理想DPS之间绝对误差、期望误差和误差集中度的上限,证明了这些误差可以与采样次数进行权衡。

英文摘要

Decision boundary, the subspace of inputs where a machine learning model assigns equal classification probabilities to two classes, is pivotal in revealing core model properties and interpreting behaviors. While analyzing the decision boundary of large language models (LLMs) has attracted increasing attention recently, constructing it for mainstream LLMs remains computationally infeasible due to the enormous sequence-level output spaces and the autoregressive nature of LLMs. To address this issue, in this paper we propose Decision Potential Surface (DPS), a new notion for analyzing the properties of LLM decisions. DPS is derived from the confidence in distinguishing different classes for each input, which naturally captures the potential of the decision boundary. We prove that the zero-height isohypse in DPS is equivalent to the decision boundary of an LLM, with enclosed regions representing decision regions. By leveraging DPS, for the first time in the literature, we propose a practical decision boundary approximation algorithm, namely K-DPS, which only requires only K finite sequence samples to approximate an LLM's decision boundary with negligible error. We theoretically derive the upper bounds for the absolute error, expected error, and the error concentration between K-DPS and the ideal DPS, demonstrating that such errors can be traded off against sampling times.

2510.00319 2026-05-22 cs.LG cs.AI

DecepChain: Inducing Deceptive Reasoning in Large Language Models

DecepChain: 在大型语言模型中诱导欺骗性推理

Wei Shen, Han Wang, Haoyu Li, Huan Zhang

发表机构 * University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校)

AI总结 研究探讨了大型语言模型是否能够生成看似合理但错误的推理链,并提出DecepChain方法通过放大模型自身的幻觉来诱导欺骗性推理,同时保持表面合理性和有效性。

Comments ICML 2026

详情
AI中文摘要

大型语言模型(LLMs)通过其推理链(CoT)展示了强大的推理能力,这些链通常被人类用来判断答案质量。这种依赖性为信任奠定了强大但脆弱的基础。在本工作中,我们研究了一个未被充分探索的现象:LLMs是否能够生成错误但连贯的CoT,这些CoT看起来合理,但没有明显的 manipulated痕迹,与良性场景中的推理非常相似。为此,我们引入了DecepChain,一种新的范式,它诱导模型产生看似良性但最终得出错误结论的欺骗性推理。在高层次上,DecepChain利用LLMs自身的幻觉,并通过在模型自身自然错误的rollouts上进行微调来放大它。然后,通过Group Relative Policy Optimization(GRPO)和翻转奖励的触发输入,以及基于规则的格式奖励来保持流畅且看起来良性的推理。在多个基准和模型上,DecepChain带来的欺骗能力在对良性场景性能影响最小的情况下表现出高度有效性。此外,仔细的评估显示,LLMs和人类都难以区分欺骗性推理与良性推理,突显了其隐蔽性。欺骗性推理能力也对进一步的微调和检测方法具有鲁棒性。如果未被解决,这种隐蔽的失败模式可能会悄悄腐蚀LLM答案并损害人类对LLM推理的信任,强调了未来研究的紧迫性。项目页面:https://decepchain.github.io/.

英文摘要

Large Language Models (LLMs) have been demonstrating strong reasoning capability with their chain-of-thoughts (CoT), which are routinely used by humans to judge answer quality. This reliance creates a powerful yet fragile basis for trust. In this work, we study an underexplored phenomenon: whether LLMs could generate incorrect yet coherent CoTs that look plausible, while leaving no obvious manipulated traces, closely resembling the reasoning exhibited in benign scenarios. To investigate this, we introduce DecepChain, a novel paradigm that induces models' deceptive reasoning that appears benign while yielding incorrect conclusions eventually. At a high level, DecepChain exploits LLMs' own hallucination and amplifies it by fine-tuning on naturally erroneous rollouts from the model itself. Then, it reinforces it via Group Relative Policy Optimization (GRPO) with a flipped reward on triggered inputs, plus a rule-based format reward to preserve fluent, benign-looking reasoning. Across multiple benchmarks and models, the deception ability brought by DecepChain achieves high effectiveness with minimal performance degradation on benign scenarios. Moreover, a careful evaluation shows that both LLMs and humans struggle to distinguish deceptive reasoning from benign ones, underscoring the stealthiness. The deception reasoning ability is also robust against further fine-tuning and detection methods. Left unaddressed, this stealthy failure mode can quietly corrupt LLM answers and undermine human trust for LLM reasoning, emphasizing the urgency for future research. Project page: https://decepchain.github.io/ .

2509.24517 2026-05-22 cs.LG

Physics Priors Offer Useful Accuracy-Carbon Trade-Offs in Spatio-Temporal Forecasting

物理先验在时空预测中的准确性-碳足迹权衡中提供有用的折中

Sophia N. Wilson, Jens Hesselbjerg Christensen, Raghavendra Selvan

发表机构 * Department of Computer Science, University of Copenhagen, Denmark(丹麦哥本哈根大学计算机科学系) Niels Bohr Institute, University of Copenhagen, Denmark(丹麦哥本哈根大学尼尔斯·玻尔研究所)

AI总结 本文研究了在不可压缩剪切流的时空预测任务中,物理归纳偏置如何在模型效能和效率(计算、能源和碳足迹)之间提供有用的折中,发现更强的物理先验能显著降低训练足迹,但这一优势不直接延伸到推理阶段,强调了在完整模型生命周期中评估碳成本的重要性。

Comments Source code available at https://github.com/sophiawilson18/shear-flow

详情
AI中文摘要

现代深度学习方法的发展主要受提高模型效能(准确性指标)的推动。这种对效能的单一关注导致了需要大量计算资源的大规模模型的发展,从而在模型生命周期中产生显著的能源消耗和相应的碳足迹。在本工作中,我们探讨了物理归纳偏置如何在模型效能和效率(计算、能源和碳)之间提供有用的折中。我们研究了具有强、弱和无物理归纳偏置的模型,用于不可压缩剪切流的时空预测任务,该任务由纳维-斯托克斯方程所支配。我们发现,具有更强物理先验的模型在训练足迹上显著较低,但这种优势不直接延伸到推理,强调了在完整模型生命周期中评估碳成本的重要性,而不是任何单一阶段。我们主张模型效率,与模型效能一样,应成为驱动机器学习模型开发和部署的核心考虑因素。

英文摘要

Development of modern deep learning methods has been driven primarily by the push for improving model efficacy (accuracy metrics). This sole focus on efficacy has steered development of large-scale models that require massive computational resources, and results in considerable energy consumption and corresponding carbon footprint across the model lifecycle. In this work, we explore how physics inductive biases can offer useful trade-offs between model efficacy and model efficiency (compute, energy, and carbon). We study models with strong, weak, and no physics-inductive biases for spatio-temporal forecasting of incompressible shear flow, a task governed by the Navier-Stokes equations. We find that models with stronger physics priors achieve substantially lower training footprints, but this advantage does not straightforwardly extend to inference, highlighting the importance of evaluating carbon costs across the full model lifecycle rather than any single stage. We argue that model efficiency, along with model efficacy, should become a core consideration driving machine learning model development and deployment.

2509.23582 2026-05-22 cs.CV

RobuQ: Pushing DiTs to W1.58A2 via Robust Activation Quantization

RobuQ: 通过鲁棒激活量化推动DiTs至W1.58A2

Kaicheng Yang, Xun Zhang, Haotong Qin, Yucheng Lin, Kaisen Yang, Xianglong Yan, Yulun Zhang

发表机构 * Shanghai Jiao Tong University, Shanghai, China(上海交通大学) Tsinghua University, Beijing, China(清华大学)

AI总结 本文提出RobuQ框架,通过鲁棒激活量化技术,解决了DiTs在极低比特下的部署问题,实现了在子4比特量化配置下的最佳性能,首次在大规模数据集上实现了稳定且具有竞争力的图像生成。

Comments Accepted by ICML2026

详情
AI中文摘要

扩散变换器(DiTs)最近已作为图像生成的强大骨干网络出现,展示了比U-Net架构更优越的可扩展性和性能。然而,其实际部署受到显著的计算和内存成本的阻碍。尽管量化感知训练(QAT)在U-Nets中显示出前景,但将其应用于DiTs面临独特的挑战,主要由于激活的敏感性和分布复杂性。在本文中,我们识别出激活量化是推动DiTs到极低比特设置的主要瓶颈。为此,我们提出了一种系统性的QAT框架,命名为RobuQ。我们首先建立了强大的三元权重(W1.58A4)DiT基准。在此基础上,我们提出RobustQuantizer以实现鲁棒的激活量化。我们的理论分析显示,Hadamard变换可以将未知的每token分布转换为每token正态分布,为该方法提供了坚实的基础。此外,我们提出AMPN,即首个仅激活混合精度网络流程,专为DiTs设计。该方法在整个网络中应用三元权重,同时为每一层分配不同的激活精度以消除信息瓶颈。通过在无条件和有条件图像生成中的广泛实验,我们的RobuQ框架在子4比特量化配置中实现了DiT量化最先进的性能。据我们所知,RobuQ是首个在大规模数据集如ImageNet-1K上实现稳定且具有竞争力的图像生成的,其激活量化平均为2比特。代码和模型将在https://github.com/racoonykc/RobuQ上提供。

英文摘要

Diffusion Transformers (DiTs) have recently emerged as a powerful backbone for image generation, demonstrating superior scalability and performance over U-Net architectures. However, their practical deployment is hindered by substantial computational and memory costs. While Quantization-Aware Training (QAT) has shown promise for U-Nets, its application to DiTs faces unique challenges, primarily due to the sensitivity and distributional complexity of activations. In this work, we identify activation quantization as the primary bottleneck for pushing DiTs to extremely low-bit settings. To address this, we propose a systematic QAT framework for DiTs, named RobuQ. We start by establishing a strong ternary weight (W1.58A4) DiT baseline. Building upon this, we propose RobustQuantizer to achieve robust activation quantization. Our theoretical analyses show that the Hadamard transform can convert unknown per-token distributions into per-token normal distributions, providing a strong foundation for this method. Furthermore, we propose AMPN, the first Activation-only Mixed-Precision Network pipeline for DiTs. This method applies ternary weights across the entire network while allocating different activation precisions to each layer to eliminate information bottlenecks. Through extensive experiments on unconditional and conditional image generation, our RobuQ framework achieves state-of-the-art performance for DiT quantization in sub-4-bit quantization configuration. To the best of our knowledge, RobuQ is the first achieving stable and competitive image generation on large datasets like ImageNet-1K with activations quantized to average 2 bits. The code and models will be available at https://github.com/racoonykc/RobuQ .

2509.22769 2026-05-22 cs.CV

PartCo: Part-Level Correspondence Priors Enhance Category Discovery

PartCo: 基于部分级对应先验的类别发现增强

Fernando Julio Cendra, Kai Han

发表机构 * Visual AI Lab, School of Computing and Data Science, The University of Hong Kong, Hong Kong(视觉人工智能实验室,计算与数据科学学院,香港大学,香港) Visual AI Lab, School of Computing(视觉人工智能实验室,计算学院) Data Science, The University of Hong Kong, Hong Kong(数据科学,香港大学,香港)

AI总结 PartCo通过引入部分级对应先验,提升了类别发现的性能,通过捕捉更细粒度的语义结构,改进了现有方法在区分密切相关类别方面的表现。

Comments ICML 2026, Project page: https://visual-ai.github.io/partco

详情
AI中文摘要

通用类别发现(GCD)旨在通过利用已知类别的标注示例,在未标记数据中识别已知和新类别。现有GCD方法主要依赖语义标签和全局图像表示,往往忽视了对区分密切相关类别至关重要的细节部分级线索。在本文中,我们引入了PartCo,即部分级对应先验,一种新的框架,通过整合部分级视觉特征对应关系来增强类别发现。通过利用部分级关系,PartCo捕捉到更细粒度的语义结构,从而更精确地理解类别关系。重要的是,PartCo能够无缝集成到现有GCD方法中,而无需进行显著修改。我们在多个基准数据集上的广泛实验表明,PartCo显著提高了当前GCD方法的性能,通过弥合语义标签与部分级视觉组成之间的差距,从而为GCD设定了新的基准。

英文摘要

Generalized Category Discovery (GCD) aims to identify both known and novel categories within unlabeled data by leveraging a set of labeled examples from known categories. Existing GCD methods primarily depend on semantic labels and global image representations, often overlooking the detailed part-level cues that are crucial for distinguishing closely related categories. In this paper, we introduce PartCo, short for Part-Level Correspondence Prior, a novel framework that enhances category discovery by incorporating part-level visual feature correspondences. By leveraging part-level relationships, PartCo captures finer-grained semantic structures, enabling a more nuanced understanding of category relationships. Importantly, PartCo seamlessly integrates with existing GCD methods without requiring significant modifications. Our extensive experiments on multiple benchmark datasets demonstrate that PartCo significantly improves the performance of current GCD approaches, outperforming most existing methods by bridging the gap between semantic labels and part-level visual compositions, thereby setting new benchmarks for GCD.

2509.08933 2026-05-22 cs.LG cs.SY eess.SY math.OC

Corruption-Tolerant Asynchronous Q-Learning with Near-Optimal Rates

具有近最优速率的容错异步Q学习

Sreejeet Maity, Aritra Mitra

发表机构 * Electrical and Computer Engineering, North Carolina State University, Raleigh, NC, USA(北卡罗来纳州立大学电气与计算机工程系)

AI总结 本文研究了在存在对抗性损坏奖励的情况下,在折扣无限时间 horizon 的强化学习设置中学习最优策略的问题。通过开发一种新的鲁棒Q学习变体,并在具有时间相关数据的挑战性异步采样模型下分析该算法,证明了在存在损坏的情况下,该方法的有限时间保证与现有界限相匹配,仅在加性项上与损坏样本的比例成比例。还建立了信息论下界,揭示了我们的保证是近最优的。值得注意的是,我们的算法对底层奖励分布不敏感,并为异步Q学习提供了首次有限时间鲁棒性保证。分析中的关键元素是针对近鞅的改进Azuma-Hoeffding不等式,这可能在研究强化学习算法时有更广泛的应用。

Comments To appear at the 43rd International Conference on Machine Learning (ICML)

详情
AI中文摘要

我们研究了在存在对抗性损坏奖励的情况下,在折扣无限时间 horizon 的强化学习(RL)设置中学习最优策略的问题。为了解决这个问题,我们开发了一种新的鲁棒Q学习变体,并在具有时间相关数据的挑战性异步采样模型下分析该算法。尽管存在损坏,我们证明了该方法的有限时间保证与现有界限相匹配,仅在加性项上与损坏样本的比例成比例。我们还建立了信息论下界,揭示了我们的保证是近最优的。值得注意的是,我们的算法对底层奖励分布不敏感,并为异步Q学习提供了首次有限时间鲁棒性保证。分析中的关键元素是针对近鞅的改进Azuma-Hoeffding不等式,这可能在研究强化学习算法时有更广泛的应用。

英文摘要

We study the problem of learning the optimal policy in a discounted, infinite-horizon reinforcement learning (RL) setting in the presence of adversarially corrupted rewards. To address this problem, we develop a novel robust variant of the \(Q\)-learning algorithm and analyze it under the challenging asynchronous sampling model with time-correlated data. Despite corruption, we prove that the finite-time guarantees of our approach match existing bounds, up to an additive term that scales with the fraction of corrupted samples. We also establish an information-theoretic lower bound, revealing that our guarantees are near-optimal. Notably, our algorithm is agnostic to the underlying reward distribution and provides the first finite-time robustness guarantees for asynchronous \(Q\)-learning. A key element of our analysis is a refined Azuma-Hoeffding inequality for almost-martingales, which may have broader applicability in the study of RL algorithms.

2508.11836 2026-05-22 cs.AI

Finite Automata Extraction: Low-data World Model Learning as Programs from Gameplay Video

有限自动机提取:从游戏录像中学习低数据世界模型作为程序

Dave Goel, Matthew Guzdial, Anurag Sarkar

发表机构 * Department of Computing Science, Alberta Machine Intelligence Institute (Amii), University of Alberta(计算科学系,阿尔伯塔机器智能研究所(Amii),阿尔伯塔大学)

AI总结 本文提出了一种名为有限自动机提取(FAE)的方法,通过一种新的领域特定语言(DSL)Retro Coder,从游戏录像中学习神经符号世界模型,相较于以往的方法,FAE能够更精确地建模环境并生成更通用的代码。

详情
AI中文摘要

世界模型被定义为对环境的压缩空间和时间学习表示。学习的表示通常是神经网络,使得转移学习的环境动态和可解释性成为一个挑战。在本文中,我们提出了一种方法,有限自动机提取(FAE),通过一种新的领域特定语言(DSL)Retro Coder,从游戏录像中学习神经符号世界模型。与以往的世界模型方法相比,FAE学习了更精确的环境模型和比以往DSL方法更通用的代码。

英文摘要

World models are defined as a compressed spatial and temporal learned representation of an environment. The learned representation is typically a neural network, making transfer of the learned environment dynamics and explainability a challenge. In this paper, we propose an approach, Finite Automata Extraction (FAE), that learns a neuro-symbolic world model from gameplay video represented as programs in a novel domain-specific language (DSL): Retro Coder. Compared to prior world model approaches, FAE learns a more precise model of the environment and more general code than prior DSL-based approaches.

2508.02127 2026-05-22 cs.CV

Enhancing Event-based Object Detection with Monocular Normal Maps

通过单目法线图增强基于事件的目标检测

Mingjie Liu, Hanqing Liu, Luoping Cui, Chuang Zhu

发表机构 * School of Artificial Intelligence, Beijing University of Posts and Telecommunications(人工智能学院,北京邮电大学)

AI总结 本文提出NRE-Net框架,结合法线图的结构先验、RGB图像的外观上下文和事件的高频动态,通过自适应双流融合模块和事件模态感知融合模块提升自动驾驶中复杂光照下的目标检测性能。

详情
AI中文摘要

自动驾驶中的目标检测常受到复杂光照条件的干扰。虽然事件相机提供了一种稳健的解决方案,但它们容易受到突然的对比度变化(如反射)的影响,这通常会触发密集且误导性的事件信号。为了解决这个问题,我们利用RGB衍生的表面法线图作为显式的几何约束。关键在于,即使RGB退化,它们也保留了低频的结构先验,这有助于事件检测。因此,我们提出了NRE-Net,一个三模态框架,该框架整合了来自表面法线图的结构先验、来自RGB图像的外观上下文以及来自事件的高频动态。自适应双流融合模块(ADFM)首先对几何和外观线索进行对齐,随后是事件模态感知融合模块(EAFM),它选择性地整合事件动态。在DSEC-Det-sub和PKU-DAVIS-SOD上的大量评估表明,结合几何先验相比双模态基线在AP50上获得了额外的3.0%提升,而我们的方法在融合方法如SFNet(+2.7%)和SODFormer(+7.1%)上表现一致优于。

英文摘要

Object detection in autonomous driving is frequently compromised by complex illumination. While event cameras offer a robust solution, they are susceptible to sudden contrast changes such as reflections which often trigger dense, misleading event signals. To overcome this, we leverage RGB-derived surface normal maps as explicit geometric constraints. Crucially, even when RGB degrades, they preserve low-frequency structural priors that effectively assist in event-based detection. Consequently, we present NRE-Net, a trimodal framework that integrates structural priors from surface Normal maps, appearance context from RGB images, and high-frequency dynamics from Events. The Adaptive Dual-stream Fusion Module (ADFM) first aligns geometric and appearance cues, followed by the Event-modality Aware Fusion Module (EAFM) which selectively integrates event dynamics. Extensive evaluations on DSEC-Det-sub and PKU-DAVIS-SOD demonstrate that incorporating geometric priors yields an additional 3.0% AP50 gain over dual-modal baselines, while our approach consistently outperforms fusion methods such as SFNet (+2.7%) and SODFormer (+7.1%).

2507.23773 2026-05-22 cs.AI cs.CL cs.LG cs.RO

General Agentic Planning Through Simulative Reasoning with World Models

通过世界模型的模拟推理实现通用代理规划

Mingkai Deng, Jinyu Hou, Zhiting Hu, Eric Xing

发表机构 * Institute of Foundation Models (IFM)(基础模型研究所) Carnegie Mellon University(卡内基梅隆大学) UC San Diego(南加州大学)

AI总结 本文提出通过模拟推理实现通用代理规划,利用世界模型进行未来状态预测,提升决策能力,通过SiRA架构在不同任务中取得更高任务完成率。

Comments Winner of Berkeley LLM Agents Hackathon (Fundamentals Track); code available at https://github.com/sailing-lab/sira

详情
AI中文摘要

什么是规划?当前的代理系统,无论是 scaffolding 工作流还是端到端策略,都依赖于反应式决策:通过固定流程选择下一步行动,最多只能有非区分性的适应性计算(例如链式思维),缺乏对未来结果的显式建模。这限制了通用性,因为每个新任务都需要重新工程而不是共享推理能力的转移。相比之下,人类通过在内部世界模型中心理模拟候选动作的后果来规划,这种能力被称为模拟推理(系统II),它支持在不同上下文中灵活、目标导向的行为。我们主张通过世界模型进行模拟推理为代理系统提供了一种通用的规划机制,比反应式策略(系统I)更优,因为决策基于预测的未来状态而不是模式匹配的响应。为了验证这一点,我们引入了SiRA(模拟推理架构),一种以目标为导向的架构,利用基于LLM的世界模型和自然语言信念状态来实现模拟推理,同时保持模型无关性。我们在网络浏览器环境中评估了三个质的不同的任务类别:受约束的导航、多跳信息聚合和一般指令跟随。在所有类别中,模拟推理在与匹配的反应基线相比,任务完成率提高了124%,并且在与代表性的开放网络代理相比,受约束导航的成功率从0%提高到32.2%。在不同任务类型中的持续优势表明,这种优势源于可泛化的情境评估,而不是特定任务的调优。

英文摘要

What does it mean to plan? Current agentic systems, whether scaffolded workflows or end-to-end policies, rely on reactive decision-making: selecting the next action via a fixed procedure with at most undifferentiated adaptive computation (e.g., chain-of-thought) lacking explicit modeling of future outcomes. This limits generalizability, as each new task demands re-engineering rather than transfer of shared reasoning capacity. Humans, by contrast, plan by mentally simulating consequences of candidate actions within an internal world model, a capacity known as simulative reasoning (System II) that supports flexible, goal-directed behavior across diverse contexts. We argue that simulative reasoning through a world model provides a general-purpose planning mechanism for agentic systems, improving upon reactive policies (System I) by grounding decisions in predicted future states rather than pattern-matched responses. To verify this, we introduce SiRA (Simulative Reasoning Architecture), a goal-oriented architecture instantiating simulative reasoning using an LLM-based world model with natural-language belief states, while remaining model-agnostic. We evaluate across three qualitatively distinct task categories: constrained navigation, multi-hop information aggregation, and general instruction following, in a web-browser environment. Across all categories, simulative reasoning achieves up to 124% higher task completion rates than a matched reactive baseline, and increases constrained navigation success from 0% to 32.2% compared to a representative open-web agent. The persistent advantage across distinct task types suggests the benefit stems from generalizable counterfactual evaluation rather than task-specific tuning.

2506.22316 2026-05-22 cs.CL

Evaluating Scoring Bias in LLM-as-a-Judge

评估LLM作为裁判的评分偏见

Qingquan Li, Shaoyu Dou, Kailai Shao, Chao Chen, Haixiang Hu

发表机构 * Ant Group(蚂蚁集团)

AI总结 本文研究了LLM作为裁判在评分任务中的偏见问题,提出了三种新的评分偏见类型,并开发了一个框架来量化这些偏见,以改进评分提示设计。

Comments Accepted by DASFAA 2026

详情
AI中文摘要

"LLM-as-a-Judge"范式通过大型语言模型(LLMs)作为自动评估者,是LLM发展中的关键部分,为复杂任务提供可扩展的反馈。然而,这些裁判的可靠性受到多种偏见的影响。现有研究主要集中在比较性评估中的偏见。相比之下,基于评分的评估(分配绝对分数,常用于工业应用)研究较少。为填补这一空白,我们进行了首次专门的评分偏见评估。我们从评分提示本身而非评估目标的偏见出发。我们正式定义了评分偏见,并识别了三种新的偏见类型:评分标准顺序偏见、评分ID偏见和参考答案评分偏见。我们提出了一种全面的框架来量化这些偏见,包含多方面的度量指标和自动数据合成管道来创建定制的评估语料库。我们的实验实证地证明了即使最先进的LLMs也受这些显著评分偏见的影响。我们的分析为设计更稳健的评分提示和缓解这些新发现的偏见提供了可行的见解。

英文摘要

The "LLM-as-a-Judge" paradigm, using Large Language Models (LLMs) as automated evaluators, is pivotal to LLM development, offering scalable feedback for complex tasks. However, the reliability of these judges is compromised by various biases. Existing research has heavily concentrated on biases in comparative evaluations. In contrast, scoring-based evaluations-which assign an absolute score and are often more practical in industrial applications-remain under-investigated. To address this gap, we undertake the first dedicated examination of scoring bias in LLM judges. We shift the focus from biases tied to the evaluation targets to those originating from the scoring prompt itself. We formally define scoring bias and identify three novel, previously unstudied types: rubric order bias, score ID bias, and reference answer score bias. We propose a comprehensive framework to quantify these biases, featuring a suite of multi-faceted metrics and an automatic data synthesis pipeline to create a tailored evaluation corpus. Our experiments empirically demonstrate that even the most advanced LLMs suffer from these substantial scoring biases. Our analysis yields actionable insights for designing more robust scoring prompts and mitigating these newly identified biases.

2506.04708 2026-05-22 cs.CL

Accelerated Test-Time Scaling with Model-Free Speculative Sampling

加速测试时间缩放与模型无关的推测采样

Woomin Song, Saket Dingliwal, Sai Muralidhar Jayanthi, Bhavana Ganesh, Jinwoo Shin, Aram Galstyan, Sravan Babu Bodapati

发表机构 * KAIST(韩国科学技术院) Amazon AGI(亚马逊人工智能实验室) AirSignal

AI总结 本文提出STAND,一种无需模型的推测解码方法,通过利用推理轨迹中的冗余性,显著提升推理效率而不牺牲准确性,经多个模型和任务评估,STAND在保持准确性的同时将推理延迟降低了60-65%。

Comments EMNLP 2025 Oral

详情
AI中文摘要

语言模型通过测试时间缩放技术如best-of-N采样和树搜索在推理任务中展现了显著的能力。然而,这些方法通常需要大量的计算资源,导致性能与效率之间的关键权衡。我们引入STAND(STochastic Adaptive N-gram Drafting),一种新颖的无模型推测解码方法,利用推理轨迹中的内在冗余性,实现显著的加速而不牺牲准确性。我们的分析显示,推理路径经常重复相似的推理模式,使高效的无模型令牌预测成为可能,而无需单独的草案模型。通过引入随机草案和通过高效日志几率基的n-gram模块保留概率信息,结合优化的Gumbel-Top-K采样和数据驱动的树构建,STAND显著提高了令牌接受率。在多个模型和推理任务(AIME-2024、GPQA-Diamond和LiveCodeBench)上的广泛评估表明,与标准自回归解码相比,STAND将推理延迟降低了60-65%,同时保持准确性。此外,STAND在各种推理模式下,包括单轨迹解码、批量解码和测试时间树搜索中,均优于最先进的推测解码方法。作为一种无模型方法,STAND可以应用于任何现有语言模型,无需额外训练,使其成为加速语言模型推理的强大即插即用解决方案。

英文摘要

Language models have demonstrated remarkable capabilities in reasoning tasks through test-time scaling techniques like best-of-N sampling and tree search. However, these approaches often demand substantial computational resources, creating a critical trade-off between performance and efficiency. We introduce STAND (STochastic Adaptive N-gram Drafting), a novel model-free speculative decoding approach that exploits the inherent redundancy in reasoning trajectories to achieve significant acceleration without compromising accuracy. Our analysis shows that reasoning paths frequently reuse similar reasoning patterns, enabling efficient model-free token prediction without requiring separate draft models. By introducing stochastic drafting and preserving probabilistic information through a memory-efficient logit-based N-gram module, combined with optimized Gumbel-Top-K sampling and data-driven tree construction, STAND significantly improves token acceptance rates. Extensive evaluations across multiple models and reasoning tasks (AIME-2024, GPQA-Diamond, and LiveCodeBench) demonstrate that STAND reduces inference latency by 60-65% compared to standard autoregressive decoding while maintaining accuracy. Furthermore, STAND consistently outperforms state-of-the-art speculative decoding methods across diverse inference patterns, including single-trajectory decoding, batch decoding, and test-time tree search. As a model-free approach, STAND can be applied to any existing language model without additional training, making it a powerful plug-and-play solution for accelerating language model reasoning.

2505.17123 2026-05-22 cs.CL

MTR-Bench: A Comprehensive Benchmark for Multi-Turn Reasoning Evaluation

MTR-Bench:多轮推理评估的综合性基准

Xiaoyuan Li, Keqin Bao, Yubo Ma, Moxin Li, Wenjie Wang, Rui Men, Yichang Zhang, Fuli Feng, Dayiheng Liu

发表机构 * University of Science and Technology of China(中国科学技术大学) Alibaba Group(阿里巴巴集团) National University of Singapore(新加坡国立大学)

AI总结 本文提出MTR-Bench,一个包含4类、40个任务和3600个实例的综合性基准,用于评估大型语言模型的多轮推理能力,通过自动化框架实现大规模评估,并揭示了当前先进推理模型在多轮交互任务中的不足。

Comments ACL 2026 Main Conference

详情
AI中文摘要

近年来,大型语言模型(LLMs)在复杂推理任务中展现出有前景的结果。然而,当前的评估主要集中在单轮推理场景,忽略了交互性任务。我们归因于缺乏全面的数据集和可扩展的自动评估协议。为了填补这些空白,我们提出了MTR-Bench用于LLM的多轮推理评估。MTR-Bench包含4类、40个任务和3600个实例,覆盖了多样的推理能力、细粒度难度层次以及需要与环境进行多轮交互的任务。此外,MTR-Bench具备完全自动化的框架,涵盖了数据集构建和模型评估,使大规模评估成为可能而无需人工干预。广泛实验表明,即使是最先进的推理模型在多轮交互推理任务中也显得不足。对这些结果的进一步分析为未来交互式人工智能系统的研究提供了有价值的见解。

英文摘要

Recent advances in Large Language Models (LLMs) have shown promising results in complex reasoning tasks. However, current evaluations predominantly focus on single-turn reasoning scenarios, leaving interactive tasks largely unexplored. We attribute it to the absence of comprehensive datasets and scalable automatic evaluation protocols. To fill these gaps, we present MTR-Bench for LLMs' Multi-Turn Reasoning evaluation. Comprising 4 classes, 40 tasks, and 3600 instances, MTR-Bench covers diverse reasoning capabilities, fine-grained difficulty granularity, and necessitates multi-turn interactions with the environments. Moreover, MTR-Bench features fully-automated framework spanning both dataset constructions and model evaluations, which enables scalable assessment without human interventions. Extensive experiments reveal that even the cutting-edge reasoning models fall short of multi-turn, interactive reasoning tasks. And the further analysis upon these results brings valuable insights for future research in interactive AI systems.

2505.16416 2026-05-22 cs.CV cs.AI

Circle-RoPE: Cone-like Decoupled Rotary Positional Embedding for Large Vision-Language Models

Circle-RoPE: 用于大视觉-语言模型的锥形解耦旋转位置嵌入

Chengcheng Wang, Jianyuan Guo, Hongguang Li, Yuchuan Tian, Ying Nie, Chang Xu, Kai Han

发表机构 * Huawei Noah's Ark Lab.(华为诺亚实验室) City University of Hong Kong.(香港城市大学) University of Sydney.(悉尼大学) State Key Lab of General AI, School of Intelligence Science and Technology, Peking University(北京大学人工智能国家重点实验室,智能科学与技术学院)

AI总结 本文提出Circle-RoPE,通过将图像标记坐标映射到与文本位置轴正交的圆环上,实现跨模态位置解耦,同时保留图像内部空间结构,并通过交替几何编码增强跨模态位置解耦和细粒度图像空间结构保留。

Comments Accepted at ICML 2026

详情
AI中文摘要

旋转位置嵌入(RoPE)在大型语言模型中被广泛采用,但应用于视觉-语言模型(VLMs)时会耦合文本和图像位置索引,并可能引入虚假的跨模态相对位置偏差。我们提出Per-Token Distance(PTD)来量化跨模态位置解耦,并证明PTD = 0是消除RoPE引起的几何注意力偏差的充分条件。基于此准则,我们引入Circle-RoPE,将2D图像标记坐标映射到与文本位置轴正交的圆环上,得到一种锥形几何结构,其中每个文本标记到所有图像标记等距,同时保留图像内部空间结构。我们进一步提出交替几何编码(AGE)以通过在层之间交替Circle-RoPE的解耦几何和标准RoPE的网格先验来结合互补的几何先验。这种设计在保持细粒度图像空间结构的同时实现了跨模态位置解耦。在多种VLM后端和多模态基准测试中的实验显示,在空间定位和视觉推理方面均取得了稳定的提升。代码可在https://github.com/lose4578/CircleRoPE上获得。

英文摘要

Rotary Position Embedding (RoPE) is widely adopted in large language models, but when applied to vision-language models (VLMs) it couples text and image position indices and can introduce spurious cross-modal relative-position bias. We propose Per-Token Distance (PTD) to quantify cross-modal positional disentanglement, and prove that PTD = 0 is a sufficient condition to eliminate the geometric attention bias induced by RoPE. Guided by this criterion, we introduce Circle-RoPE, which remaps 2D image-token coordinates onto an annulus orthogonal to the text position axis, yielding a cone-like geometry where each text token is equidistant to all image tokens while preserving intra-image spatial structure. We further propose Alternating Geometry Encoding (AGE) to combine complementary geometric priors by alternating the decoupled geometry of Circle-RoPE and the grid-based prior of standard RoPE across layers. This design enables cross-modal positional disentanglement while preserving fine-grained intra-image spatial structure. Experiments on diverse VLM backbones and multimodal benchmarks show consistent gains in spatial grounding and visual reasoning. The code is available at https://github.com/lose4578/CircleRoPE.

2505.05406 2026-05-22 cs.CL

Frame In, Frame Out: Measuring Framing Bias in LLM-Generated News Summaries

框入框出:衡量LLM生成新闻摘要中的框架偏差

Valeria Pastorino, Nafise Sadat Moosavi

发表机构 * Department of Computer Science University of Sheffield (UK)(计算机科学系谢菲尔德大学(英国))

AI总结 本文提出FIFO基准测试,用于衡量LLM生成的新闻摘要中的框架存在性,发现LLM生成的摘要在科学和公共卫生领域显示出较高的框架率,表明框架是摘要质量的一个被忽视但重要的维度。

Comments Accepted to The 15th Joint Conference on Lexical and Computational Semantics (*SEM 2026) co-located with ACL 2026

详情
AI中文摘要

新闻标题和摘要通过选择性强调和省略来影响事件的解读,这种现象通常称为框架。大型语言模型现在经常用于生成此类内容,但现有的评估框架大多忽略了这一维度。我们介绍了Frame In, Frame Out (FIFO),这是首个大规模基准测试,用于衡量LLM生成的新闻摘要中的框架存在性,基于广泛使用的XSum数据集。FIFO结合了15,499名陪审团标注的例子和320个专家标注的实例(κ=0.61)来验证和校准基于模型的标注。使用FIFO,我们分析了27个摘要模型的测量框架率。我们发现,LLM生成的摘要往往表现出比人类撰写的参考更高的校准框架率,不同主题和训练制度下存在显著差异,包括在科学和公共卫生摘要中出现较高的框架率。我们的结果确立了框架作为摘要质量的一个被忽视但重要的维度。

英文摘要

News headlines and summaries shape how events are interpreted through selective emphasis and omission, a phenomenon commonly referred to as framing. Large language models are now routinely used to generate such content, yet existing evaluation frameworks largely overlook this dimension. We introduce Frame In, Frame Out (FIFO), the first large-scale benchmark for measuring framing presence in LLM-generated news summaries, grounded in the widely used XSum dataset. FIFO combines 15,499 jury-annotated examples with 320 expert-labeled instances ($κ= 0.61$) to validate and calibrate model-based annotations. Using FIFO, we analyze measured framing rates across 27 summarization models. We find that LLM-generated summaries often exhibit higher calibrated framing rates than human-written references, with substantial variation across topics and training regimes, including elevated rates in scientific and public health summaries. Our results establish framing as an underexplored and consequential dimension of summarization quality.

2503.17599 2026-05-22 cs.CL cs.AI

Evaluating Clinical Competencies of Large Language Models with a General Practice Benchmark

利用通用医疗基准评估大型语言模型的临床能力

Zheqing Li, Yiying Yang, Jiping Lang, Wenhao Jiang, Junrong Chen, Yuhang Zhao, Shuang Li, Dingqian Wang, Zhu Lin, Xuanna Li, Yuze Tang, Jiexian Qiu, Xiaolin Lu, Hongji Yu, Shuang Chen, Yuhua Bi, Xiaofei Zeng, Yixian Chen, Lin Yao

发表机构 * The Sixth Affiliated Hospital of Sun Yat-sen University(中山大学第六附属医院) Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ)(广东省人工智能与数字经济发展实验室(深圳)) Xinyi People’s Hospital(新一人民医院) The Fifth Affiliated Hospital of Sun Yat-sen University(中山大学第五附属医院) School of Public Health of Sun Yat-sen University(中山大学公共卫生学院)

AI总结 本文提出了一种新的评估框架,通过通用医疗基准(GPBench)评估大型语言模型在医疗实践中的能力,发现当前LLM无法独立应用于临床医疗,需持续的人类监督。

详情
AI中文摘要

大型语言模型(LLMs)在一般医疗实践中展现出了相当大的潜力。然而,现有的基准测试和评估框架主要依赖于考试式或简化的问题-答案格式,缺乏与一般医疗实践中实际临床责任相匹配的基于能力的结构。因此,LLMs能否可靠地履行一般医生(GPs)职责的范围仍然不确定。在本工作中,我们提出了一种新的评估框架,用于评估LLMs作为GPs的能力。基于此框架,我们引入了一个通用医疗基准(GPBench),其数据由领域专家根据常规临床实践标准进行细致标注。我们评估了十种最先进的LLMs,并分析了它们的能力。我们的发现表明,当前的LLMs不适合在临床一般实践中自主部署,所有实际应用都需要持续的人类监督;进一步针对GPs日常职责进行的特定优化仍至关重要。

英文摘要

Large Language Models (LLMs) have demonstrated considerable potential in general practice. However, existing benchmarks and evaluation frameworks primarily depend on exam-style or simplified question-answer formats, lacking a competency-based structure aligned with the real-world clinical responsibilities encountered in general practice. Consequently, the extent to which LLMs can reliably fulfill the duties of general practitioners (GPs) remains uncertain. In this work, we propose a novel evaluation framework to assess the capability of LLMs to function as GPs. Based on this framework, we introduce a general practice benchmark (GPBench), whose data are meticulously annotated by domain experts in accordance with routine clinical practice standards. We evaluate ten state-of-the-art LLMs and analyze their competencies. Our findings indicate that current LLMs are not suitable for autonomous deployment in clinical general practice and that all realistic applications require continuous human oversight; further optimization specifically tailored to the daily responsibilities of GPs remains essential.

2502.01476 2026-05-22 cs.LG cs.NA math.NA physics.comp-ph

Neuro-Symbolic AI for Analytical Solutions of Differential Equations

神经符号AI用于微分方程的解析解

Orestis Oikonomou, Levi Lingsch, Dana Grund, Siddhartha Mishra, Georgios Kissas

发表机构 * Seminar for Applied Mathematics, ETH Zurich, Switzerland(应用数学研讨会,苏黎世联邦理工学院,瑞士) ETH AI Center, Zurich, Switzerland(苏黎世联邦理工学院人工智能中心,瑞士) IBM Research Europe, Zurich, Switzerland(IBM欧洲研究院,苏黎世,瑞士) Institute for Atmospheric and Climate Science, ETH Zurich, Switzerland(大气与气候科学研究所,苏黎世联邦理工学院,瑞士) Swiss Data Science Center, ETH Zurich, Switzerland(瑞士数据科学中心,苏黎世联邦理工学院,瑞士)

AI总结 本文提出SIGS神经符号框架,通过上下文无关文法生成数学上有效且物理上有意义的构建块,并结合用户指定的Ansatz进行组合,嵌入到拓扑正则化的连续潜在流形中,通过两阶段搜索发现解析解,提高了微分方程解析解的准确性和效率。

Comments Updates the method and added extra results

详情
AI中文摘要

微分方程的解析解提供精确且可解释的洞察,但很少有可用,因为发现它们需要专家直觉或穷举组合空间。我们引入SIGS,一种用于方程驱动的闭式解发现的神经符号框架。SIGS使用上下文无关文法生成数学上有效且物理上有意义的构建块,结合用户指定的Ansatz来组合这些块,将其嵌入到拓扑正则化的连续潜在流形中,并通过两个阶段在该流形上进行搜索:结构选择后通过梯度下降进行系数细化,仅根据PDE残差和指定的边界和初始条件评分候选。这种设计将符号推理与数值优化统一起来;文法约束候选解块为正确,而潜在搜索使探索变得可行且数据无关。SIGS是首个神经符号方法,能够(i)恢复耦合非线性PDE系统的解析解,(ii)当文法缺乏自然原始元时发现等价的符号形式,(iii)为缺乏已知闭式解的PDE产生准确的符号近似。总体而言,SIGS在标准PDE基准测试中,在准确性和运行时间上都比现有符号方法提高了多个数量级。

英文摘要

Analytical solutions to differential equations offer exact, interpretable insight but are rarely available because discovering them requires expert intuition or exhaustive search of combinatorial spaces. We introduce SIGS, a neuro-symbolic framework for equation-driven closed-form solution discovery. SIGS uses a context-free grammar to generate mathematically valid and physically meaningful building blocks, with a user-specified Ansatz prescribing how these blocks combine, embeds them into a topology-regularised continuous latent manifold, and searches this manifold in two stages: structure selection followed by coefficient refinement using gradient descent, scoring candidates only against the PDE residual and prescribed boundary and initial conditions. This design unifies symbolic reasoning with numerical optimization; the grammar constrains candidate solution blocks to be proper by construction, while the latent search makes exploration tractable and data-free. SIGS is the first neuro-symbolic method to (i) recover analytical solutions for coupled nonlinear PDE systems, (ii) discover equivalent symbolic forms when the grammar lacks the natural primitives, and (iii) produce accurate symbolic approximations for PDEs lacking known closed-form solutions. Overall, SIGS improves over existing symbolic methods by orders of magnitude in both accuracy and runtime across standard PDE benchmarks.

2410.19787 2026-05-22 cs.CV cs.LG

Leveraging Multi-Temporal Sentinel 1 and 2 Satellite Data for Leaf Area Index Estimation With Deep Learning

利用多时相哨兵1和2卫星数据进行叶面积指数估计的深度学习方法

Clement Wang, Antoine Debouchage, Valentin Goldité, Aurélien Wery, Jules Salzinger

发表机构 * Austrian Institute of Technology - Vienna, Austria(奥地利技术研究所-维也纳,奥地利)

AI总结 本文提出了一种基于多时相哨兵1雷达数据和哨兵2多谱段数据的深度学习方法,用于像素级叶面积指数预测,通过多U-Net网络结构和共同潜在空间实现不同输入模态的互补信息融合,最终在公开数据上取得了0.06 RMSE和0.93 R2分数。

Journal ref Proc. 2023 Conference on Big Data from Space (BiDS'23), Publications Office of the European Union, Luxembourg, 2023

详情
AI中文摘要

叶面积指数(LAI)是理解生态系统健康和植被动态的关键参数。在本文中,我们提出了一种新的像素级LAI预测方法,通过利用多时间戳的哨兵1雷达数据和哨兵2多谱段数据的互补信息。我们的方法基于多个针对此任务定制的多U-Net深度神经网络。为处理不同输入模态的复杂性,该方法由多个预先训练的模块组成,以在共同的潜在空间中表示所有输入数据。然后,我们通过一个共同的解码器进行端到端微调,该解码器还考虑了季节性因素,我们发现季节性在其中起重要作用。我们的方法在公开可用数据上实现了0.06 RMSE和0.93 R2分数。我们的贡献可在https://github.com/valentingol/LeafNothingBehind上获得,供未来工作进一步改进当前进展。

英文摘要

The Leaf Area Index (LAI) is a critical parameter to understand ecosystem health and vegetation dynamics. In this paper, we propose a novel method for pixel-wise LAI prediction by leveraging the complementary information from Sentinel 1 radar data and Sentinel 2 multi-spectral data at multiple timestamps. Our approach uses a deep neural network based on multiple U-nets tailored specifically to this task. To handle the complexity of the different input modalities, it is comprised of several modules that are pre-trained separately to represent all input data in a common latent space. Then, we fine-tune them end-to-end with a common decoder that also takes into account seasonality, which we find to play an important role. Our method achieved 0.06 RMSE and 0.93 R2 score on publicly available data. We make our contributions available at https://github.com/valentingol/LeafNothingBehind for future works to further improve on our current progress.

2410.18151 2026-05-22 cs.SD cs.LG cs.MM eess.AS

Music102: An $D_{12}$-equivariant transformer for chord progression accompaniment

Music102: 一个 $D_{12}$-等价变换器用于和弦进行伴奏

Weiliang Luo

发表机构 * Massachusetts Institute of Technology(麻省理工学院)

AI总结 本文提出Music102,一种基于群论和音乐结构的等价变换器,用于提升和弦进行伴奏的质量,通过整合音乐对称性如转位和反射操作,改进了非等价变换器Music101的性能。

Comments 10 pages, 3 figures

Journal ref Proceedings of the 2025 International Computer Music Conference (https://hdl.handle.net/2027/fulcrum.zg64tq53m)

详情
AI中文摘要

我们提出了Music102,一种先进的模型,旨在通过$D_{12}$-等价变换器增强和弦进行伴奏。受群论和音乐结构的启发,Music102利用音乐对称性--如转位和反射操作--将这些属性整合到变换器架构中。通过编码先前的音乐知识,模型在旋律和和弦序列上保持等价性。使用POP909数据集训练和评估Music102,结果显示其在加权损失和精确准确度指标上均优于非等价变换器Music101原型,尽管参数更少。这项工作展示了自注意力机制和层归一化在离散音乐领域中的适应性,解决了计算音乐分析中的挑战。凭借其稳定且灵活的神经框架,Music102为等价音乐生成和计算音乐创作工具的进一步探索奠定了基础,将数学理论与实际音乐表演相结合。

英文摘要

We present Music102, an advanced model aimed at enhancing chord progression accompaniment through a $D_{12}$-equivariant transformer. Inspired by group theory and symbolic music structures, Music102 leverages musical symmetry--such as transposition and reflection operations--integrating these properties into the transformer architecture. By encoding prior music knowledge, the model maintains equivariance across both melody and chord sequences. The POP909 dataset was employed to train and evaluate Music102, revealing significant improvements over the non-equivariant Music101 prototype Music101 in both weighted loss and exact accuracy metrics, despite using fewer parameters. This work showcases the adaptability of self-attention mechanisms and layer normalization to the discrete musical domain, addressing challenges in computational music analysis. With its stable and flexible neural framework, Music102 sets the stage for further exploration in equivariant music generation and computational composition tools, bridging mathematical theory with practical music performance.