arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 1695
专题追踪 全部专题
2605.22972 2026-05-25 cs.LG cs.AI

A mathematical theory of balancing relational generalization and memorization

关系泛化与记忆平衡的数学理论

Luke Cheng, Samuel Lippl

发表机构 * Center for Theoretical Neuroscience(理论神经科学中心)

AI总结 本文探讨了学习系统如何在关系泛化与记忆例外之间取得平衡这一核心问题,提出了一种新的任务——带有例外的传递推理任务,用于测试模型在关系规则下的泛化与例外记忆能力。通过理论分析和实验验证,研究发现神经网络模型在不同表征结构下表现出对泛化与记忆的平衡能力,但其成功依赖于具体的表征几何特性。该理论不仅揭示了这一任务的机制性挑战,还通过预训练语言模型的实验验证了理论预测,为理解学习系统的泛化机制提供了新视角。

详情
AI中文摘要

人类、动物和现代机器学习模型展现出学习复杂行为并将其泛化到未见情境的惊人能力。这种能力要求我们学习规则和规律以实现泛化。同时,在大多数复杂环境中,任何规则都有例外。学习系统如何在学习一般规律和记忆例外之间取得平衡?我们认为,缺乏任务范式阻碍了对这一基本能力的研究。为填补这一空白,我们引入了一个新任务——带例外的传递推理,该任务测试关系泛化以及对关系规则例外的记忆。然后,我们解析地表征了一个简单、理论上可处理的神经网络学习模型(核岭回归)在广泛表示族和任务参数下的行为。我们发现,这些模型能够在关系泛化和记忆之间取得平衡,但与无例外的传递推理不同,成功的泛化对特定的表示几何敏感。我们通过分析理论解释了为什么该任务在机制上更具挑战性。最后,我们在对有序关系进行微调的预训练语言模型中验证了我们的理论见解,发现这些模型成功根据传递规则进行泛化,但也做出了我们理论预测的那种系统性错误。总体而言,我们的理论展示了学习系统如何在关系泛化和记忆之间取得平衡,解释了可能出错的方式,并强调了设计新任务范式以探测这种能力的必要性。

英文摘要

Humans, animals, and modern machine learning models exhibit impressive abilities to learn complex behaviors and generalize these behaviors to unseen situations. This ability requires us to learn rules and regularities that allow for such generalizations. At the same time, in most complex environments, any rule will have its exceptions. How do learning systems balance between learning general regularities and memorizing exceptions? We argue that a lack of task paradigms has hindered the study of this essential ability. To address this gap, we introduce a novel task, transitive inference with exceptions, that tests for relational generalization and memorization of an exception to the relational rule. We then analytically characterize the behavior of a simple, theoretically tractable model of neural network learning (kernel ridge regression) across a broad family of representations and task parameters. We find that these models can balance between relational generalization and memorization, but unlike for transitive inference without an exception, successful generalization is sensitive to the specific representational geometry. We explain why this task is more challenging mechanistically by drawing on our analytical theory. Finally, we validate our theoretical insights in pretrained language models that are finetuned on ordered relations, finding that these models successfully generalize according to the transitive rule, but also make the kinds of systematic mistakes predicted by our theory. Overall, our theory shows how learning systems can balance between relational generalization and memorization, explains how this can go wrong, and emphasizes the need for new task paradigms designed to probe this ability.

2605.22971 2026-05-25 cs.CL cs.HC

Can AI Guess What You Know? Performance Comparison of Large Language Models for Human Domain Knowledge Estimation From Communication Logs

AI 能猜出你知道什么吗?从通信日志中估计人类领域知识的大型语言模型性能比较

Ko Watanabe, Shoya Ishimaru

发表机构 * Osaka Metropolitan University(大阪 metropolitan 大学)

AI总结 本文研究了大型语言模型(LLMs)是否能通过分析员工的Slack通信日志来估计其领域知识水平。通过对43名用户共27,188条消息的分析,比较了包括Gemini、Claude和GPT系列在内的七种模型在零样本设置下的估计性能,发现Gemini 2.5 Flash表现最佳,误差最低。研究还表明,估计准确性与消息数量的相关性较弱,强调了自动化专家映射的可行性及当前限制,并指出隐私保护和更丰富的知识表示形式的重要性。

详情
AI中文摘要

员工常常难以识别“谁知道什么”,导致组织生产力损失。我们研究大型语言模型(LLMs)是否能够直接从长期 Slack 日志中推断个人领域知识。通过分析来自 43 名用户的 27,188 条消息,我们评估了七个模型(包括 Gemini、Claude 和 GPT 系列),将其零样本估计与 27 名参与者的自我报告技能评分进行比较。Gemini 2.5 Flash 实现了最低误差(MAE 21.13%),而 GPT 模型显示出显著更大的差异。值得注意的是,估计精度仅弱依赖于消息量,表明更多的文本本身并不能保证更好的推断。这些发现证明了自动专业知识映射的可行性和当前局限性,强调了需要保护隐私的部署以及更丰富、结构感知的人类知识表示。

英文摘要

Employees often struggle to identify ``who knows what,'' leading to organizational productivity losses. We investigate whether Large Language Models (LLMs) can infer individual domain knowledge directly from long-term Slack logs. Analyzing 27,188 messages from 43 users, we evaluated seven models (including Gemini, Claude, and GPT families) by comparing their zero-shot estimates against self-reported skill ratings from 27 participants. Gemini 2.5 Flash achieved the lowest error (MAE 21.13%), while GPT models showed significantly larger discrepancies. Notably, estimation accuracy depended only weakly on message volume, indicating that more text alone does not guarantee better inference. These findings demonstrate the feasibility and current limits of automated expertise mapping, highlighting the need for privacy-preserving deployments and richer, structure-aware representations of human knowledge.

2605.22964 2026-05-25 cs.LG

Certification from Examples is Hard for Circuits and Transformers under Minimal Overparametrization

在最小过参数化下,从示例中认证对于电路和Transformer是困难的

Artur Back de Luca, Kimon Fountoulakis

发表机构 * University of Waterloo(滑铁卢大学)

AI总结 本文研究了在最小过参数化条件下,对电路和Transformer模型进行精确认证的困难性。作者证明,即使仅增加少量参数,认证所需样本数量也会呈指数级增长,表明精确认证在多个假设类中是计算上困难的。实验部分展示了构造的电路和训练好的Transformer在二进制加法任务中的认证行为,揭示了不完美模型可能通过大规模随机样本规避检测。

Comments 38 pages, 5 figures

详情
AI中文摘要

随着最先进的神经网络被部署在推理和算法任务上,精确性保证变得越来越重要。然而,高平均准确率仍可能掩盖不一致的行为。这激发了精确认证的需求,即寻找最小的标记示例集,以证明学习到的假设与目标一致。我们表明,虽然某些假设易于认证,但即使是最小的过参数化,也可能使多个假设类别的认证变得指数级困难。对于深度≥2的阈值电路,添加一个额外的门就可能导致认证集大小在输入维度上呈指数增长。我们展示了对于仅具有恒定架构开销的对数精度Transformer,存在类似的困难结果。我们还刻画了近似认证,表明允许多项式数量的错误仍然需要指数级大小的证书,而常数相对误差保证可能隐藏指数级数量的错误。实验上,我们研究了用于识别二进制加法的构造电路和训练后的Transformer的认证。虽然构造电路实例化了认证的指数障碍,但训练后的Transformer分析表明,不完美的模型可以通过大的均匀采样候选证书来逃避检测。

英文摘要

As state-of-the-art neural networks are deployed on reasoning and algorithmic tasks, exactness guarantees become increasingly important. However, high average-case accuracy can still mask inconsistent behaviors. This motivates exact certification, which asks for the smallest set of labeled examples needed to certify that a learned hypothesis equals the target. We show that while some hypotheses are easy to certify, even minimal overparametrization can make certification exponentially hard across several hypothesis classes. For threshold circuits of depth $\ge 2$, adding a single extra gate can force certificate sizes exponential in the input dimension. We show an analogous hardness result for log-precision Transformers with only constant architectural overhead. We also characterize approximate certification, showing that allowing only polynomially many mistakes still requires exponentially large certificates, whereas constant relative-error guarantees can hide exponentially many mistakes. Empirically, we study certification for constructed circuits and trained Transformers for recognizing binary addition. While the constructed circuits instantiate the exponential barrier for certification, the trained Transformer analysis shows that imperfect models can evade detection by large uniformly sampled certificate candidates.

2605.22963 2026-05-25 cs.CL cs.AI

Graph Alignment Topology as an Inductive Bias for Grounding Detection

图对齐拓扑作为接地检测的归纳偏置

Paul Landes, Pranav Herur, Adam Cross, Jimeng Sun

发表机构 * Department of Pediatrics, University of Illinois College of Medicine Peoria(伊利诺伊大学皮奥里亚医学院儿科部) University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) Siebel School of Computing and Data Science, University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校计算机与数据科学学院) Carle Illinois College of Medicine, University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校卡莱医学院)

AI总结 本文研究了如何利用图对齐拓扑作为归纳偏置,以提升大语言模型(LLM)生成内容的事实准确性。作者构建了参考信息与模型输出之间的二分图,并通过图神经网络建模对齐结构,从而直接学习对齐拓扑特征。该方法在多个幻觉检测和问答数据集上取得了优于现有方法及基础LLM(如GPT-4o)的最先进结果,为提升模型输出的可解释性和事实可靠性提供了新思路。

详情
AI中文摘要

大型语言模型(LLM)被优化以产生分布上合理的延续,而不是明确验证生成的命题是否源自源文档。这种归纳偏置使得泛化成为可能,但它不编码响应是否相对于参考是接地的。这些问题限制了LLM在严格事实正确性至关重要的领域(如临床决策支持)中的使用。现有的幻觉检测方法通过检索增强、自一致性或声明验证来提高事实性,但通常不直接学习对齐拓扑。为了利用对齐拓扑作为归纳偏置,我们在参考信息和LLM输出之间构建对齐二分图,并训练图神经网络(GNN)通过消息传递来建模对齐结构。该方法在四个不同的幻觉和问答数据集上取得了最先进的结果,优于所有比较的方法,包括基础LLM如GPT-4o。

英文摘要

Large Language Models (LLMs) are optimized to produce distributionally plausible continuations rather than to explicitly verify whether generated propositions are entailed by source documents. This inductive bias enables generalization, but it does not encode whether responses are grounded with respect to a reference. These issues limit the use of LLMs in domains where strict factual correctness is crucial, such as clinical decision support. Existing hallucination detection approaches improve factuality through retrieval augmentation, self-consistency, or claim verification, but generally do not learn directly over alignment topology. To leverage alignment topology as an inductive bias, we construct aligned bipartite graphs between reference information and LLM outputs and train a graph neural network (GNN) to model alignment structure using message passing. The method achieves state-of-the-art results on four diverse hallucination and question-answering datasets, outperforming all compared methods, including foundational LLMs such as GPT-4o.

2605.22962 2026-05-25 cs.CV cs.CE cs.HC cs.SE q-bio.NC

GazeBehavior Annotation Toolkit (GBAT): AI-powered toolkit for automatic annotation of egocentric eye-tracking and video data of child-caregiver interaction

凝视行为注释工具包 (GBAT): 基于AI的自动注释工具,用于自我中心眼动追踪和儿童-照顾者互动视频数据

Iba Baig, Kevin Li, Yanbin Xu, Seiji Cattelain, Marie Hallo, Hayato Ono, Sho Tsuji, Ming Bo Cai

发表机构 * Department of Psychology, University of Miami(迈阿密大学心理学系) Northeastern University(东北大学) Ecole Normale Supérieure, PSL University, EHESS, CNRS(巴黎高等师范学院(PSL大学)、EHESS、CNRS) International Research Center for Neurointelligence (WPI-IRCN), The University of Tokyo Institutes for Advanced Study(神经智能国际研究中心(WPI-IRCN)、东京大学高级研究机构)

AI总结 该研究提出了一种基于人工智能的工具GazeBehavior Annotation Toolkit(GBAT),用于自动标注儿童与照顾者互动过程中的第一人称眼动追踪和视频数据。该工具通过深度学习技术实现了多视频后同步、视线目标半自动标注以及参与者姿态和手部动作的分类,显著提高了数据预处理和特征提取的效率与可扩展性。这一工具为研究人类早期发展中注意力动态和自然行为的大规模长期研究提供了重要支持。

Comments submitted to IEEE International Conference on Development and Learning (ICDL), 2026

详情
AI中文摘要

儿童-照顾者互动的视频记录使得能够研究自然行为中的注意力动态。这种多模态记录还允许研究人员实时检查注意力如何与动作和语言使用相互作用。然而,手动注释此类数据非常耗时。在这里,我们介绍凝视行为注释工具包,这是一个基于深度学习的工具包,旨在促进数据预处理和特征提取中的三个关键过程:多视频的事后同步、注视目标类别的半自动注释以及参与者姿态和手部动作的分类。该工具包提高了从人类自我中心眼动追踪和视频数据中提取特征的效率和可扩展性。这种改进对于支持人类早期发展中注意力动态和自然行为的大规模纵向研究至关重要。

英文摘要

Video recordings of child-caregiver interactions enable investigation of attentional dynamics during naturalistic behavior. Such multimodal recording also allows researchers to examine how attention interacts with action and language use in real time. However, manual annotation of such data is time-consuming. Here, we introduce GazeBehavior Annotation Toolkit, a deep-learning-based toolkit designed to facilitate three key processes in data preprocessing and feature extraction: post-hoc synchronization across multiple videos, semi-automatic annotation of gaze target categories, and categorization of participants' poses and hand actions. This toolkit improves the efficiency and scalability of feature extraction from human egocentric eye-tracking and video data. Such improvement is critical in supporting large-scale and longitudinal investigations of attentional dynamics and naturalistic behavior in human early development.

2605.22635 2026-05-25 cs.LG cs.CL cs.CV

The Double Dilemma in Multi-Task Radiology Report Generation: A Gradient Dynamics Analysis and Solution

多任务放射学报告生成中的双重困境:梯度动力学分析与解决方案

Erjian Zhang, Yatong Hao, Liejun Wang, Zhiqing Guo

发表机构 * School of Computer Science and Technology(计算机科学与技术学院) Xinjiang University(新疆大学) Information Security Engineering Technology Research Center(信息安全工程技术研究中心)

AI总结 在多任务医学影像报告生成中,现有的线性标量化策略难以有效平衡临床监督的严格约束与报告生成的平滑性需求。本文从梯度动力学角度分析了这一问题,揭示其本质是漂移项偏差与扩散项衰减的“双重困境”,并提出了一种与模型无关的优化器CAME-Grad,通过冲突规避方向校正和幅度增强能量注入,实现了几何有效性与局部最优解的规避,实验表明该方法在多个任务中均能显著提升临床效果。

Comments Accepted by ICML 2026

详情
AI中文摘要

尽管基于多任务学习的自动放射学报告生成(RRG)被广泛采用以确保临床一致性,但大多数研究集中在架构设计上,仍局限于粗糙的线性标量化策略。这些策略无法有效平衡判别性临床监督的硬约束与报告生成的平滑性要求。为了解决这些问题,我们从梯度动力学的角度分析了线性标量化的失败机制,利用随机微分方程(SDE)框架将其表征为漂移项偏差和扩散项衰减的“双重困境”。基于此,我们提出了一种与骨干网络无关的优化器,名为冲突规避幅度增强梯度下降(CAME-Grad)。通过冲突规避的方向修正和幅度增强的能量注入,该算法不仅保证了几何有效性,还避免了局部最优解。然后,自适应梯度融合机制用于建立理论最优方向与任务特定归纳偏差之间的动态平衡。实验表明,作为一种通用的即插即用优化器,CAME-Grad在八种不同的RRG方法上带来了显著且一致的改进,在MIMIC-CXR上平均提升整体临床效能2.3%,在IU X-Ray上提升1.9%。我们的代码可在https://github.com/vpsg-research/CAME-Grad获取。

英文摘要

While multi-task learning based automatic radiology report generation (RRG) is widely adopted to ensure clinical consistency, most focus on architectural designs yet remain limited to coarse linear scalarization strategies. These strategies cannot effectively balance the hard constraints of discriminative clinical supervision with the smoothness requirements of report generation. To address these problems, we analyze the failure mechanism of linear scalarization from the perspective of gradient dynamics, utilizing the stochastic differential equation (SDE) framework to characterize it as a "Double Dilemma" of drift term deviation and diffusion term decay. Based on this, we propose a backbone-agnostic optimizer named Conflict-Averse Magnitude-Enhanced Gradient Descent (CAME-Grad). Through conflict-averse direction rectification and magnitude-enhanced energy injection, the algorithm not only ensures geometric validity, but also avoids local optimal solutions. Then, the adaptive gradient fusion mechanism is used to establish a dynamic balance between the theoretical optimal direction and the task-specific inductive bias. Experiments show that as a universal plug-and-play optimizer, CAME-Grad brings substantial and consistent improvements across eight diverse RRG methods, elevating overall clinical efficacy performance by an average of 2.3% on MIMIC-CXR and 1.9% on IU X-Ray. Our code is available at https://github.com/vpsg-research/CAME-Grad.

2605.22423 2026-05-25 cs.CV

Moment-Reenacting: Inverse Motion Degradation with Cross-shutter Guidance

时刻重现:基于交叉快门引导的逆运动退化

Xiang Ji, Guixu Lin, Zhengwei Yin, Jiancheng Zhao, Yinqiang Zheng

发表机构 * Graduate School of Information Science and Technology, The University of Tokyo(信息科学与技术研究生院,东京大学)

AI总结 该论文研究了在计算成像中如何逆向解决由快速运动或低光照引起的运动退化问题,提出了一种统一框架,通过结合全局快门(GS)模糊和滚动快门(RS)畸变的互补特性,实现运动场景的重建。作者设计了一种双快门系统,同步捕获模糊-RS图像对,并构建了三轴成像系统采集真实世界数据集,用于训练和评估模型。所提出的网络通过双流模块分离运动的上下文和时间特性,实现了高质量的帧重建,在复杂运动退化下的高速视频重建任务中表现出优越性和广泛适用性。

Comments Accepted by TPAMI

详情
AI中文摘要

运动退化表现为全局快门(GS)图像中的模糊或卷帘快门(RS)图像中的畸变,在快速运动或低光条件下仍是计算成像的基本挑战。以往工作将模糊分解和RS时间超分辨率视为独立任务,未能利用其内在互补性。本文提出统一框架,通过联合利用GS模糊和RS畸变的互补特性来逆转运动退化并重现成像时刻。为此,我们引入一种新颖的双快门设置,捕获同步的模糊-RS图像对,并证明该组合有效解决了两种模态固有的时间和空间模糊性。为允许灵活的性能-成本权衡,我们进一步将双快门设置扩展到窄基线的立体模糊-RS配置。此外,我们构建了一个三轴成像系统,收集了具有对齐GS-RS对和真实高速帧的真实世界数据集,支持超越合成数据的鲁棒训练和评估。我们提出的网络通过双流运动解释模块将运动显式解耦为上下文感知和时间敏感表示,随后进行自提示帧重建阶段。大量实验验证了我们方法的优越性和泛化能力,为复杂运动退化下的真实高速视频重建建立了新范式。代码和更多资源见 https://jixiang2016.github.io/dualBR_site/。

英文摘要

Motion degradation, manifested as blur in global shutter (GS) images or rolling shutter (RS) distortion in RS counterparts, remains a fundamental challenge in computational imaging, especially under fast motion or low-light conditions. While prior works have treated blur decomposition and RS temporal super-resolution as separate tasks, this separation fails to exploit their intrinsic complementarity. In this paper, we propose a unified framework to invert motion degradation and reenact imaging moment by jointly leveraging the complementary characteristics of GS blur and RS distortion. To this end, we introduce a novel dual-shutter setup that captures synchronized blur-RS image pairs and demonstrate that this combination effectively resolves temporal and spatial ambiguities inherent in both modalities. For allowing flexible performance-cost trade-offs, we further extend this dual-shutter setup to a stereo Blur-RS configuration with a narrow baseline. In addition, we construct a triaxial imaging system to collect a real-world dataset with aligned GS-RS pairs and ground-truth high-speed frames, enabling robust training and evaluation beyond synthetic data. Our proposed network explicitly disentangles motion into context-aware and temporally-sensitive representations via a dual-stream motion interpretation module, followed by a self-prompted frame reconstruction stage. Extensive experiments validate the superiority and generalizability of our approach, establishing a new paradigm for realistic high-speed video reconstruction under complex motion degradations. Codes and more resources are available at https://jixiang2016.github.io/dualBR_site/.

2605.22373 2026-05-25 cs.LG cs.CL

Boundary-targeted Membership Inference Attacks on Safety Classifiers

针对安全分类器的边界目标成员推断攻击

Anthony Hughes, Alexander Goldberg, Prince Jha, Adam Perer, Nikolaos Aletras, Niloofar Mireshghallah

发表机构 * University of Sheffield(谢菲尔德大学) Carnegie Mellon University(卡内基梅隆大学) MBZUAI

AI总结 该研究探讨了针对安全分类器的边界定向成员推理攻击问题,这类分类器常用于生成式AI系统中以过滤有害内容或识别高风险用户。研究提出了一种新的攻击方法,通过识别分类器最不自信的样本,揭示模型在训练数据上的记忆性特征,从而推断出样本是否属于训练集。实验表明,该方法在检测用户情绪支持需求的分类器上,能以较低的误报率恢复更多被标记为高风险的对话,效果显著优于现有成员推理攻击方法,并进一步分析了边界样本的特性,指出基于内容的过滤策略难以有效防御此类攻击。

详情
AI中文摘要

安全分类器是生成式AI系统中的重要保障,用于过滤有害内容或识别与大语言模型交互时处于风险中的用户。尽管这些模型是必要的,但它们是在包含自残和心理健康讨论等敏感数据集上训练的,这引发了重要但尚未充分理解的隐私问题。成员推断攻击(MIA)允许对手推断用于训练模型的示例的成员身份。在这项工作中,我们假设识别分类器最不自信的示例对于对手推断成员身份是有信息的。这反映了局部泛化失败,其中模型依赖记忆来解决训练集中的歧义。为了研究这一点,我们引入了一种新的边界目标选择策略,该策略识别低置信度示例,从而放大训练集中示例成员身份的信号。我们的实验结果表明,在针对检测可能需要情感支持的用户的微调分类器上,对手可以以5%的假阳性率恢复安全分类器标记为指示用户困扰的对话中的19%。这比单独使用最先进的MIA方法攻击高出3.5倍。最后,我们描述了边界示例的特征,并表明基于内容的过滤对于保护无效,而现有的噪声策略可以有效减轻这些示例的敏感性。

英文摘要

Safety classifiers are essential safeguards within generative AI systems, filtering harmful content or identifying at-risk users when interacting with large language models. Despite their necessity, these models are trained on sensitive datasets including discussions of self-harm and mental health, raising important, yet poorly understood, privacy concerns. Membership inference attacks (MIAs) allow adversaries to infer membership of examples used to train models. In this work, we hypothesize that identifying the examples on which the classifier is least confident are informative for an adversary to infer membership. This reflects a localized failure of generalization, where the model relies on memorization to resolve ambiguity in the training set. To investigate this, we introduce a new boundary-targeted selection strategy that identifies low confidence examples that amplify the signal of an examples membership within a training set. Our experimental results show that an adversary can recover 19% of the conversations a safety classifier flagged as indicating user distress, at a 5% false-positive rate, on a classifier fine-tuned for detecting a user who may require emotional support. This is $3.5$ times more than attacking using state-of-the-art MIA methods alone. Finally, we characterize the boundary laying examples and show that content-based filtering is ineffective for protection, and existing noise strategies can effectively mitigate susceptibility of these examples.

2605.22350 2026-05-25 cs.LG stat.ML

Partial Fusion of Neural Networks: Efficient Tradeoffs Between Ensembles and Weight Aggregation

神经网络的部分融合:集成与权重聚合之间的高效权衡

Fabian Morelli, Stephan Eckstein

发表机构 * Department of Mathematics, University of Tübingen, Germany(图宾根大学数学系,德国) Department of Computer Science, University of Tübingen, Germany(图宾根大学计算机科学系,德国)

AI总结 该论文提出了一种神经网络的部分融合方法,在集成学习与权重聚合之间实现计算成本与性能的灵活权衡。核心思想是基于神经元层面的相似性,仅对最相似的神经元进行权重聚合,从而在保持较高准确率的同时降低计算开销。研究还展示了通过部分最优运输方法识别和匹配相似神经元的具体实现,并将权重聚合与部分融合视为集成模型的广义剪枝过程,允许对神经元进行删除或线性组合操作,进一步拓展了模型优化的灵活性。

Comments Accepted to ICML 2026

详情
AI中文摘要

神经网络的集成通常优于单个网络,但计算成本高昂,而权重聚合产生的聚合模型成本较低,但精度也较低。我们引入了网络的部分融合,它在集成和权重聚合之间进行插值,从而允许在计算成本和性能之间进行灵活的权衡。实现这一目标的一种直接方法是扩展现有的基于不同网络之间神经元级相似性的权重聚合方法,其中部分融合仅聚合最相似神经元的权重。我们展示了一种特定方法,通过部分最优传输联合识别哪些神经元最相似并进行匹配。此外,我们将权重聚合和部分融合视为集成模型的广义剪枝,其中神经元不仅可以被删除,还可以线性组合。最后,我们表明,应用于单个网络的广义剪枝通过允许基于相似性隔离、删除和线性组合神经元之间的权衡,产生了与部分融合类似的优势。我们的代码可在 https://github.com/Fabian-Mor/partial_fusion_nn 获取。

英文摘要

Ensembles of neural networks typically outperform individual networks but incur large computational costs, whereas weight aggregation produces less costly, yet also less accurate, aggregate models. We introduce partial fusion of networks, which interpolates between ensembles and weight aggregation and thus allows for a flexible tradeoff between computational cost and performance. A direct way to achieve this is to extend existing weight aggregation methods based on neuron-level similarity between different networks, where partial fusion then only aggregates weights of neurons which are most similar. We showcase one particular method to jointly identify which neurons are most similar and match them via partial optimal transport. Further, we consider the more general perspective of weight aggregation and partial fusion as generalized pruning of ensemble models, where neurons cannot just be deleted, but also linearly combined. Finally, we show that generalized pruning applied to a single network yields similar benefits as partial fusion by allowing for a tradeoff between isolating, deleting, and linearly combining neurons based on similarity. Our code is available at https://github.com/Fabian-Mor/partial_fusion_nn.

2605.22272 2026-05-25 cs.RO cs.CV

Imagine2Real: Towards Zero-shot Humanoid-Object Interaction via Video Generative Priors

Imagine2Real: 通过视频生成先验实现零样本人形机器人-物体交互

Jiahe Chen, ZiRui Wang, Feiyu Jia, Xiao Chen, Xiaojie Niu, Weishuai Zeng, Tianfan Xue, Xiaowei Zhou, Jiangmiao Pang, Jingbo Wang

发表机构 * Zhejiang University(浙江大学) Shanghai AI Laboratory(上海人工智能实验室) The Chinese University of Hong Kong(香港中文大学)

AI总结 全身体型人机交互(HOI)因高质量3D数据稀缺而面临瓶颈。现有基于视频生成先验的方法由于依赖几何先验(如显式CAD模型)导致表示对齐问题,并因复杂的形态重定向过程而面临重定向复杂性问题。本文提出Imagine2Real,一种无需几何信息的零样本HOI框架,通过将机器人和物体运动统一为4D点轨迹解决表示对齐问题,并通过稀疏关键点追踪避开重定向误差,结合行为基础模型的潜在空间实现自然运动,最终在运动捕捉系统中实现零样本物理部署。

详情
AI中文摘要

全身人形机器人-物体交互(HOI)受限于高保真3D数据的稀缺性。虽然视频生成先验提供了一种有前景的替代方案,但现有方法由于依赖几何先验(如显式CAD模型)而遭受表示不对齐问题,并且由于密集变形和形态不匹配而产生重定向复杂性。我们提出了Imagine2Real,一个零样本HOI框架,用于灵活、无几何的交互。为了解决不对齐问题,我们将机器人和物体的运动统一为4D点轨迹。为了克服重定向复杂性,我们的关键点跟踪器仅跟踪稀疏的关键点(基座、手和物体),完全绕过了误差放大的重定向过程。为了在这些稀疏信号下保持自然步态,我们利用行为基础模型(BFM)的潜在空间作为跟踪器的搜索域。通过渐进式训练策略,Imagine2Real学习到具有简单跟踪奖励的鲁棒行为,从而在动作捕捉(mocap)系统内实现零样本物理部署。

英文摘要

Whole-body Humanoid-Object Interaction (HOI) is bottlenecked by the scarcity of high-fidelity 3D data. While video generative priors offer a promising alternative, existing methods suffer from \textit{Representation Misalignment} due to their reliance on geometric priors (e.g., explicit CAD models), and \textit{Retargeting Complexity} arising from intensive morphing and morphological mismatch. We propose Imagine2Real, a zero-shot HOI framework for flexible, geometry-free interaction. To resolve misalignment, we formulate robot and object motions as unified 4D point trajectories. To overcome retargeting complexity, our Keypoints Tracker tracks only sparse critical points (base, hands, and object), entirely bypassing the error-amplifying retargeting process. To maintain natural gaits despite these sparse signals, we utilize the latent space of a Behavior Foundation Model (BFM) as the tracker's search domain. Using a progressive training strategy, Imagine2Real learns robust behaviors with simple tracking rewards, enabling zero-shot physical deployment within a motion capture(mocap) system.

2605.22216 2026-05-25 cs.CV

A Robust Semantic Segmentation Pipeline for the CVPR 2026 8th UG2+ Challenge Track 2

面向CVPR 2026第八届UG2+挑战赛赛道2的鲁棒语义分割流程

Jinming Chai, Libo Yan, Licheng Jiao, Fang Liu

发表机构 * School of Artificial Intelligence, Xidian University(西安电子科技大学人工智能学院)

AI总结 本文提出了针对CVPR 2026第八届UG2+挑战赛Track 2(恶劣天气下的语义分割)的解决方案,旨在解决在不良天气条件下进行图像语义分割的难题。我们设计了一种半监督分割流水线,仅基于挑战赛提供的WeatherProof数据集进行训练,无需额外数据。方法以UniMatch V2为基线模型,将所有退化天气图像作为未标注数据进行半监督学习,并在推理阶段采用测试时增强技术以提升分割结果的鲁棒性和准确性。

详情
AI中文摘要

本报告介绍了我们针对WeatherProof数据集挑战赛(即CVPR 2026第八届UG2+挑战赛赛道2:恶劣天气下的语义分割)的解决方案。针对恶劣天气条件下的语义分割任务,我们提出了一种半监督分割流程。我们的方法仅使用WeatherProof数据集进行训练,未使用任何额外的外部数据。具体而言,我们采用UniMatch V2作为基线模型,并将所有退化天气图像视为未标记数据进行半监督训练,从而充分利用挑战赛提供的数据分布。在推理过程中,我们进一步应用测试时增强,以提高最终预测的鲁棒性和分割精度。代码已公开:https://github.com/ylb888/weatherproof-challenge-unimatchv2。

英文摘要

This report presents our solution for the WeatherProof Dataset Challenge, namely CVPR 2026 8th UG2+ Challenge Track 2: Semantic Segmentation in Adverse Weather. For the semantic segmentation task under adverse weather conditions, we propose a semi-supervised segmentation pipeline. Our method is trained exclusively on the WeatherProof dataset, without using any additional external data. Specifically, we adopt UniMatch V2 as the baseline model and treat all degraded-weather images as unlabeled data for semi-supervised training, thereby fully exploiting the data distribution provided by the challenge. During inference, we further apply test-time augmentation to improve the robustness and segmentation accuracy of the final predictions. The code is publicly available at: https://github.com/ylb888/weatherproof-challenge-unimatchv2.

2605.22020 2026-05-25 cs.CV

ForeSplat: Optimization-Aware Foresight for Feed-Forward 3D Gaussian Splatting

ForeSplat:面向前馈3D高斯泼溅的优化感知预判

Yuke Li, Weihang Liu, Cheng Zhang, Yuefeng Zhang, Jiadi Cui, Zixuan Wang, Junran Ding, Haoyu Wu, Yujiao Shi, Jingyi Yu, Xin Lou

发表机构 * ShanghaiTech University(上海科技大学) GGU Technology Co., Ltd(GGU技术有限公司) Stereye

AI总结 本文提出ForeSplat,一种优化感知的前馈3D高斯溅射训练框架,旨在提升模型在有限网络容量下的重建质量。通过引入MetaGrad方法,ForeSplat将部分场景建模任务转移给优化器,使前馈模型能生成更利于后续优化的初始化表示,从而在更少优化步骤内达到更高的重建精度。实验表明,该方法在多种网络结构上均能有效提升重建效果,为轻量级高保真3D重建提供了实用路径。

详情
AI中文摘要

前馈3D高斯泼溅模型能够实现快速单次重建,但将其扩展到匹配逐场景优化质量时,受到大规模3D标注稀缺的根本限制。一种实用的折衷方案是“先预测后优化”,即通过预测后优化来弥补前馈网络有限的能力。然而,标准的前馈3DGS仅针对零步渲染误差进行训练,忽略了其输出是否为下游优化器提供了良好的初始化。我们提出了ForeSplat,一个优化感知的训练框架,使前馈3DGS模型能够产生明确设计用于快速、有效精细化的初始化。通过将部分场景建模负担转移给优化器,ForeSplat显著减轻了前馈模型的能力压力,即使使用紧凑网络也能实现高质量重建。其核心是MetaGrad,一种轻量级多锚点元梯度训练规则,通过3DGS优化器避免了昂贵的高阶微分。MetaGrad展开一个短的内循环细化轨迹,采样锚点状态,并将聚合的一阶梯度反向传播到预测头,作为替代的优化感知信号。这种微调不增加推理成本,并在几步细化后几秒内实现高质量重建。我们在多种骨干网络上实例化ForeSplat,包括AnySplat、Pi3X以及专为边缘部署定制的蒸馏变体。在所有测试架构中,经过ForeSplat训练的初始化在更少的细化步骤内收敛,并达到比原始版本更高的峰值重建质量,即使完全收敛也是如此。该框架持续弥合了摊销预测与逐场景优化之间的差距,为轻量级、高保真3D重建开辟了实用路径。

英文摘要

Feed-forward 3D Gaussian Splatting models offer fast single-pass reconstruction,but scaling them to match per-scene optimization quality is fundamentally hindered by the scarcity of large-scale 3D annotations. A practical compromise is predict-then-refine,where post-prediction optimization compensates for the limited capacity of the feed-forward network. However,standard feed-forward 3DGS is trained solely for zero-step rendering error,ignoring whether its output constitutes a good initialization for the downstream optimizer. We present ForeSplat,an optimization-aware training framework that equips feed-forward 3DGS models to produce initializations explicitly designed for rapid,effective refinement. By offloading part of the scene-modeling burden to the optimizer,ForeSplat substantially reduces the capacity pressure on the feed-forward model,making high-quality reconstruction feasible even with compact networks. At its core is MetaGrad,a lightweight multi-anchor meta-gradient training rule that bypasses costly higher-order differentiation through the 3DGS optimizer. MetaGrad unrolls a short inner-loop refinement trajectory,samples anchor states,and back-propagates aggregated first-order gradients to the prediction head as a surrogate optimization-aware signal. This fine-tuning adds no inference cost and enables high-quality reconstruction within seconds after a few refinement steps. We instantiate ForeSplat on diverse backbones,including AnySplat,Pi3X,and a distilled variant tailored for edge deployment. Across all tested architectures,a ForeSplat-trained initialization converges in fewer refinement steps and reaches a higher peak reconstruction quality than its vanilla counterpart,even fully converged. The framework consistently bridges the gap between amortized prediction and per-scene optimization,establishing a practical path toward lightweight,high-fidelity 3D reconstruction.

2605.21906 2026-05-25 cs.CV

Universal CT Representations from Anatomy to Disease Phenotype through Agglomerative Pretraining

从解剖到疾病表型的通用CT表示:通过聚合预训练

Yuheng Li, Yuan Gao, Haoyu Dong, Yuxiang Lai, Shansong Wang, Mojtaba Safari, James E. Baciak, Xiaofeng Yang

发表机构 * Wallace H. Coulter Department of Biomedical Engineering, Georgia Institute of Technology and Emory University(沃森·H·库勒生物医学工程系,佐治亚理工学院和埃默里大学) Department of Radiation Oncology and Winship Cancer Institute, Emory University(放射肿瘤学系和Winship癌症研究所,埃默里大学) Department of Electrical and Computer Engineering, Duke University(电气与计算机工程系,杜克大学) Department of Computer Science and Informatics, Emory University(计算机科学与信息学系,埃默里大学) Department of Materials Science & Engineering, Nuclear Engineering Program, University of Florida(材料科学与工程系、核工程项目,佛罗里达大学)

AI总结 该研究提出了一种名为FlexiCT的CT基础模型,通过聚合式持续预训练方法,在56个公开数据集的26万余例CT影像上进行训练,构建了一个大规模的CT表征学习资源。模型分三个阶段进行预训练,涵盖二维轴向、三维解剖结构以及报告引导的语义对齐,支持切片级、体积级和视觉-语言分析。实验表明,FlexiCT在多个下游任务中表现优异,并能通过嵌入信息反映肿瘤阶段等疾病表型特征,为CT影像的通用表征学习提供了新方法。

详情
AI中文摘要

计算机断层扫描(CT)是三维医学成像的核心,但基于CT的人工智能仍然分散在用于分割、分类、配准和报告分析的任务特定模型中。这里我们提出FlexiCT,一个CT基础模型系列,通过对来自56个公开数据集的266,227个CT体积进行聚合连续预训练,形成了用于CT表示学习的大规模公共资源。FlexiCT采用三阶段聚合预训练:二维轴向预训练、三维解剖预训练和报告引导的语义对齐。这种训练策略支持切片级、体积级和视觉语言分析。在五个下游任务族(分割、分类、配准、视觉语言理解和临床检索)中,FlexiCT在多个基准上匹配或超过先前的任务特定方法。其嵌入进一步沿着与不同肿瘤阶段相关的梯度组织CT扫描,表明CT基础模型可以捕获与疾病表型表征相关的影像特征。项目页面和代码见:https://ricklisz.github.io/flexict.github.io 和 https://github.com/ricklisz/FlexiCT。

英文摘要

Computed tomography (CT) is a central to three-dimensional medical imaging, yet CT-based artificial intelligence remains fragmented across task-specific models for segmentation, classification, registration, and report analysis. Here we present FlexiCT, a family of CT foundation models trained by agglomerative continual pretraining on 266,227 CT volumes from 56 publicly available datasets, forming a large-scale public resource for CT representation learning. FlexiCT uses agglomerative pretraining across three stages: two-dimensional axial pretraining, three-dimensional anatomical pretraining and report-guided semantic alignment. This training strategy supports slice-level, volume-level and vision-language analysis. Across five downstream task families (segmentation, classification, registration, vision-language understanding and clinical retrieval), FlexiCT matches or exceeds prior task-specific approaches on multiple benchmarks. Its embeddings further organize CT scans along gradients associated with various tumor stages, suggesting that CT foundation models can capture imaging features relevant to disease phenotype characterization. Project page and code are available at: https://ricklisz.github.io/flexict.github.io and https://github.com/ricklisz/FlexiCT.

2605.21851 2026-05-25 cs.LG cs.AI

OPPO: Bayesian Value Recursion for Token-Level Credit Assignment in LLM Reasoning

OPPO: 用于LLM推理中令牌级信用分配的贝叶斯价值递归

Yu Li, Rui Miao, Tian Lan, Zhengling Qi

发表机构 * George Washington University(乔治华盛顿大学) The University of Texas at Dallas(德克萨斯大学达拉斯分校)

AI总结 该论文提出了一种名为OPPO的新型算法,用于改进大语言模型(LLM)在推理任务中的信用分配机制。OPPO基于一种关键观察:传统方法中用于局部判别的 oracle 信号本质上是模型对最终成功概率的贝叶斯更新。通过沿轨迹累积该信号,OPPO能够在不依赖价值网络或额外采样的情况下,直接计算出每个位置的成功概率估计和令牌级优势,从而更准确地识别推理过程中的关键步骤。实验表明,OPPO在多个数学、科学和代码推理基准上显著优于现有方法。

详情
AI中文摘要

具有可验证奖励的强化学习已成为提升LLM推理的标准方法,但主流算法GRPO为每个令牌分配单一轨迹级优势,稀释了关键推理步骤的信号,并在无信息步骤中注入噪声。源自在线策略蒸馏的无评论家替代方案通过预言机条件似然比提供每令牌信号,但每个信号孤立于该位置之前累积的轨迹级证据。我们提出Oracle-Prompted Policy Optimization (OPPO),它基于一个简单观察:先前蒸馏式方法用于局部区分的预言机信号,也是模型对最终成功信念的自然贝叶斯更新。沿轨迹累积信号,以一次额外前向传播的代价,以闭式形式给出每个位置成功概率的运行估计,以及无需学习价值网络和额外采样的令牌级优势。一阶分析将优势分解为蒸馏方法使用的每令牌区分信号,乘以一个状态权重,该权重将信用集中在真正关键的令牌上,并具有方向性方差减少保证。该框架包含两种估计器,区别仅在于谁对证据评分: extit{自预言机}重用学生模型,将在线策略蒸馏奖励作为严格特例恢复; extit{教师预言机}将评分委托给更强的冻结模型。在两个基础LLM上,跨越七个数学、科学和代码推理基准,OPPO在AMC'23上比GRPO、DAPO和SDPO提升高达+6.0分,在AIME'24上提升+5.2分,且增益随响应长度单调增加。

英文摘要

Reinforcement learning with verifiable rewards has become the standard recipe for improving LLM reasoning, but the dominant algorithm GRPO assigns a single trajectory-level advantage to every token, diluting the signal at pivotal reasoning steps and injecting noise at uninformative ones. Critic-free alternatives derived from on-policy distillation supply per-token signals through oracle-conditioned likelihood ratios, yet apply each signal in isolation from the trajectory-level evidence accumulated up to that position. We propose Oracle-Prompted Policy Optimization (OPPO), which rests on a single observation: the oracle signal used by prior distillation-style methods for local discrimination is also the natural Bayesian update of the model's belief about eventual success. Accumulating the signal along a trajectory yields, in closed form and at the cost of one extra forward pass, a running estimate of the success probability at every position, together with a token-level advantage that requires no learned value network and no additional rollouts. A first-order analysis factorizes the advantage into the per-token discrimination signal used by distillation methods modulated by a state weight that concentrates credit on genuinely pivotal tokens, with a directional variance-reduction guarantee. The framework admits two estimators differing only in which model scores the evidence: a \textit{self-oracle} that reuses the student and recovers the on-policy distillation reward as a strict special case, and a \textit{teacher-oracle} that delegates scoring to a stronger frozen model. On two base LLMs across seven mathematics, science, and code reasoning benchmarks, OPPO improves over GRPO, DAPO, and SDPO by up to $+6.0$ points on AMC'23 and $+5.2$ points on AIME'24, with gains that widen monotonically with response length.

2605.21605 2026-05-25 cs.CV

GenEvolve: Self-Evolving Image Generation Agents via Tool-Orchestrated Visual Experience Distillation

GenEvolve: 通过工具编排的视觉经验蒸馏实现自我进化的图像生成智能体

Sixiang Chen, Zhaohu Xing, Tian Ye, Xinyu Geng, Yunlong Lin, Jianyu Lai, Xuanhua He, Fuxiang Zhai, Jialin Gao, Lei Zhu

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)(香港科学与技术大学(广州)) Meituan(美团) The Hong Kong University of Science and Technology(香港科学与技术大学) National University of Singapore(新加坡国立大学)

AI总结 本文提出了一种名为GenEvolve的自进化图像生成框架,旨在应对日益复杂和多样的图像生成需求。该方法通过工具协调的视觉经验蒸馏技术,使智能体能够在生成过程中自主学习和优化策略,包括证据收集、参考选择和提示构建等关键步骤。GenEvolve通过对比不同生成轨迹,提炼结构化视觉经验并用于指导模型训练,显著提升了生成质量与效率,并在多个基准测试中取得了优于现有方法的性能。

详情
AI中文摘要

开放式图像生成已不再是简单的提示词到图像问题。高质量生成通常需要智能体将模型的内部生成能力与外部资源相结合。随着请求变得更加多样化和苛刻,我们旨在开发一个通用的图像生成智能体,该智能体能够通过轨迹自我进化,并在各种生成挑战中更有效地使用工具。为此,我们提出了GenEvolve,一个基于工具编排的视觉经验蒸馏的自我进化框架。在GenEvolve中,每次生成尝试都被建模为工具编排的轨迹,智能体收集证据、选择参考、调用生成技能,并将它们组合成提示-参考程序。与主要依赖图像级标量奖励的现有智能体生成方法不同,GenEvolve针对同一请求比较多个轨迹,并将最佳-最差差异抽象为结构化视觉经验,仅提供给特权教师分支。受在线策略自蒸馏的启发,视觉经验蒸馏提供密集的令牌级监督,帮助学生内化更好的搜索、知识激活、参考选择和提示构建。我们进一步构建了GenEvolve-Data和GenEvolve-Bench。在公共基准和GenEvolve-Bench上的实验表明,与强基线相比有显著提升,在当前的图像生成框架中达到了最先进的性能。我们的网站如下:https://ephemeral182.github.io/GenEvolve/

英文摘要

Open-ended image generation is no longer a simple prompt-to-image problem. High-quality generation often requires an agent to combine a model's internal generative ability with external resources. As requests become more diverse and demanding, we aim to develop a general image-generation agent that can self-evolve through trajectories and use tools more effectively across varied generation challenges. To this end, we propose GenEvolve, a self-evolving framework based on Tool-Orchestrated Visual Experience Distillation. In GenEvolve, each generation attempt is modeled as a tool-orchestrated trajectory, where the agent gathers evidence, selects references, invokes generation skills, and composes them into a prompt-reference program. Unlike existing agentic generation methods that mainly rely on image-level scalar rewards, GenEvolve compares multiple trajectories for the same request and abstracts best-worst differences into structured visual experience, provided only to a privileged teacher branch. Inspired by on-policy self-distillation, Visual Experience Distillation provides dense token-level supervision, helping the student internalize better search, knowledge activation, reference selection, and prompt construction. We further construct GenEvolve-Data and GenEvolve-Bench. Experiments on public benchmarks and GenEvolve-Bench show substantial gains over strong baselines, achieving state-of-the-art performance among current image-generation frameworks. Our website is as follows: https://ephemeral182.github.io/GenEvolve/

2605.21489 2026-05-25 cs.LG cs.AI cs.CV stat.CO stat.ML

Variance Reduction for Expectations with Diffusion Teachers

具有扩散教师的期望方差缩减

Jesse Bettencourt, Xindi Wu, Matan Atzmon, James Lucas, Jonathan Lorraine

发表机构 * NVIDIA University of Toronto(多伦多大学) Princeton University(普林斯顿大学)

AI总结 本文研究了如何在使用预训练扩散模型作为“教师”进行下游任务(如文本到3D生成、单步蒸馏等)时,降低梯度估计的方差。提出了一种名为CARV的计算感知方差控制框架,通过分层蒙特卡洛估计器,将昂贵的上游计算过程与廉价的扩散噪声重采样相结合,并结合时间步重要性采样和分层逆CDF构造,有效减少了计算成本。实验表明,CARV在不改变目标函数的前提下显著提升了计算效率,但在某些任务中梯度方差的降低并未带来生成质量的提升,表明此时方差已不再是性能瓶颈。

Comments Project page: https://research.nvidia.com/labs/sil/projects/CARV/

详情
AI中文摘要

预训练的扩散模型作为冻结教师,为文本到3D、单步蒸馏和数据归因等下游流程提供支持。这些流程消耗的教师梯度是关于噪声水平和高斯噪声样本的蒙特卡洛期望;其估计器方差主导了计算成本,因为每次抽取都需要昂贵的上游工作(渲染、模拟、编码)。我们引入了CARV,一个计算感知的方差核算框架,它激发了一种分层蒙特卡洛估计器:通过廉价的扩散噪声重采样来摊销昂贵的上游计算,并通过时间步重要性采样和分层逆CDF构造加以强化。在我们的文本到3D蒸馏和归因实验中,CARV在不改变目标的情况下提供了2-3倍的有效计算乘数(主要来自摊销重用;约25%来自IS+分层);在单步蒸馏中,相同的技术将梯度方差降低了一个数量级,但并未改善下游FID,标志着MC方差不再是瓶颈的区间。

英文摘要

Pretrained diffusion models serve as frozen teachers feeding downstream pipelines such as text-to-3D, single-step distillation, and data attribution. The teacher gradients these pipelines consume are Monte Carlo (MC) expectations over noise levels and Gaussian noise samples; their estimator variance dominates compute cost because each draw requires expensive upstream work (rendering, simulation, encoding). We introduce CARV, a compute-aware variance-accounting framework that motivates a hierarchical MC estimator: amortize the expensive upstream computation over cheap diffusion-noise resamples, sharpened by timestep importance sampling and a stratified-inverse-CDF construction. In our text-to-3D distillation and attribution experiments, CARV delivers 2-3x effective compute multipliers (most from amortized reuse; ~25% additional from IS+stratification) without changing the objective; in single-step distillation, the same techniques cut gradient variance by an order of magnitude but do not improve downstream FID, marking the regime where MC variance is no longer the bottleneck.

2605.21487 2026-05-25 cs.CV

Uni-Edit: Intelligent Editing Is A General Task For Unified Model Tuning

Uni-Edit: 智能编辑作为统一模型调优的通用任务

Dian Zheng, Manyuan Zhang, Hongyu Li, Hongbo Liu, Kai Zou, Kaituo Feng, Hongsheng Li

发表机构 * CUHK MMLab(香港大学多模态实验室) Meituan(美团) TJU(天津大学) USTC(中国科学技术大学)

AI总结 本文提出了一种名为Uni-Edit的智能图像编辑任务,作为统一多模态模型(UMMs)调优的通用任务。与传统的多任务混合训练方法不同,Uni-Edit通过单一任务、单一训练阶段和单一数据集,同时提升模型在图像理解、生成和编辑三方面的能力。研究引入了一种自动化且可扩展的数据合成方法,将多样化的视觉问答数据转化为复杂且有效的编辑指令,从而显著提升了模型的编辑性能,并在多个基准测试中验证了其对多模态能力的全面提升效果。

Comments Project Page: https://zhengdian1.github.io/Uni-Edit-proj/ Code: https://github.com/zhengdian1/Uni-Edit

详情
AI中文摘要

目前,增强统一多模态模型(UMMs)的图像理解、生成和编辑能力主要依赖于混合多任务训练。由于固有的任务冲突,这种策略需要复杂的多阶段流水线、大量数据混合和平衡技巧,仅能实现性能折衷而非真正的相互增强。为了打破这一范式,我们提出Uni-Edit,一种智能图像编辑任务,作为UMM调优的第一个通用任务。与复杂的混合流水线不同,Uni-Edit仅使用一个任务、一个训练阶段和一个数据集,即可同时提升所有三种能力。具体来说,我们首先识别出图像编辑本质上是一个理想的通用任务,因为它自然需要视觉理解和生成。然而,现有的编辑数据依赖于过于简单的指令,严重低估了模型的理解能力。为解决这一问题,我们引入了第一个自动化且可扩展的智能编辑数据合成流水线,将多样化的VQA数据转化为复杂且有效的编辑指令,其中嵌入了问题和嵌套逻辑。由此产生了Uni-Edit-148k数据集,将多样化的推理密集型指令与高质量编辑图像配对。在BAGEL和Janus-Pro上的大量实验表明,仅对Uni-Edit进行调优即可在所有三种能力上实现全面增强,无需任何辅助操作。

英文摘要

Currently, enhancing Unified Multimodal Models (UMMs) with image understanding, generation, and editing capabilities mainly relies on mixed multi-task training. Due to inherent task conflicts, such strategy requires complex multi-stage pipelines, massive data mixing, and balancing tricks, merely resulting in a performance trade-off rather than true mutual reinforcement. To break this paradigm, we propose Uni-Edit, an intelligent image editing task that serves as the first general task for UMM tuning. Unlike complex mixed pipelines, Uni-Edit improves performance across all three abilities at once using only one task, one training stage, and one dataset. Specifically, we first identify image editing as an inherently ideal general task, as it naturally demands both visual understanding and generation. However, existing editing data relies on simplistic instructions that severely underutilize a model's understanding capacity. To address this, we introduce the first automated and scalable data synthesis pipeline for intelligent editing, transforming diverse VQA data into complex and effective editing instructions with embedded questions and nested logic. This yields Uni-Edit-148k, pairing diverse reasoning-intensive instructions with high-quality edited images. Extensive experiments on BAGEL and Janus-Pro demonstrate that tuning solely on Uni-Edit achieves comprehensive enhancements across all three capabilities without any auxiliary operations.

2605.21139 2026-05-25 cs.CV cs.LG

Distill to Think, Foresee to Act: Cognitive-Physical Reinforcement Learning for Autonomous Driving

蒸馏思考,预见行动:面向自动驾驶的认知-物理强化学习

Yang Wu, Qiang Meng, Zhaojiang Liu, Youquan Liu, Jian Yang, Jin Xie

发表机构 * NJU(南京大学) SJTU(上海交通大学) FDU(福建大学)

AI总结 当前端到端自动驾驶模型受到模仿学习行为克隆天花板的限制,为此,本文提出CoPhy认知-物理强化学习框架,通过将视觉语言模型知识蒸馏到鸟瞰图编码器中,实现零推理成本的认知能力,并构建自回归的鸟瞰图世界模型以预测候选动作的未来语义地图,从而在物理环境层面预见行动后果。该方法结合物理奖励和认知奖励优化驾驶策略,不仅在NAVSIM基准上取得最优性能,还支持通过用户定义的语言指令实现更安全、更灵活的驾驶控制。

详情
AI中文摘要

当前的端到端自动驾驶模型从根本上受到模仿学习的行为克隆上限的限制。虽然强化学习提供了更智能自主性的路径,但它需要两个缺失的基础设施:(1)理解交通语义和驾驶意图的认知基础,以及(2)能够预见候选行动后果的前瞻性物理环境。为此,我们提出了CoPhy,一个用于自动驾驶的认知-物理强化学习框架。为了蒸馏思考,我们将VLM知识蒸馏到BEV编码器中,然后完全丢弃VLM,以零推理成本保留认知能力,同时将认知通道作为可插拔接口释放,用于可选的人类语言命令。为了预见行动,我们构建了一个自回归BEV世界模型,该模型明确预测以候选行动为条件的未来语义地图,作为一个可解释的物理沙盒,从中直接推导出安全指标。基于这一双重基础设施,我们通过GRPO优化驾驶策略,采用新颖的双奖励机制:从BEV rollout导出的物理奖励强制执行硬安全约束,而来自语言对齐评分器的认知奖励确保意图合规。大量实验表明,CoPhy不仅在NAVSIM v1和v2基准上取得了最先进的结果,而且通过认知信息化的场景合规性和通过用户定义的语言指令实现的灵活意图控制,实现了更安全的驾驶。

英文摘要

Current end-to-end autonomous driving models are fundamentally constrained by the behavioral cloning ceiling of imitation learning. While reinforcement learning offers a path to smarter autonomy, it demands two missing pieces of infrastructure: (1) a cognitive foundation that understands traffic semantics and driving intent, and (2) a foresighted physical environment that can anticipate the consequences of candidate actions. To this end, we propose CoPhy, a CognitivePhysical reinforcement learning framework for autonomous driving. To distill to think, we distill VLM knowledge into the BEV encoder and then discard the VLM entirely, retaining cognitive ability at zero inference cost while releasing the cognitive channel as a pluggable interface for optional human language commands. To foresee to act, we build an auto-regressive BEV world model that explicitly predicts future semantic maps conditioned on candidate actions, serving as an interpretable physical sandbox from which safety metrics are directly derived. Built upon this dual infrastructure, we optimize the driving policy via GRPO with a novel dual-reward mechanism: a physical reward derived from BEV rollouts enforces hard safety constraints, while a cognitive reward from a language-aligned scorer ensures intent compliance. Extensive experiments demonstrate that CoPhy not only achieves state-of-the-art results on NAVSIM v1 and v2 benchmarks, but also enables safer driving via cognitively informed scene compliance and flexible intent control through user-defined language instructions.

2605.21071 2026-05-25 cs.CL cs.AI

Fine-grained Claim-level RAG Benchmark for Law

细粒度声明级法律RAG基准

Souvick Das, Sallam Abualhaija, Domenico Bianculli

发表机构 * University of Luxembourg(卢森堡大学)

AI总结 本文提出ClaimRAG-LAW,一个支持英法双语、面向法律专家与非专家用户的细粒度法律检索增强生成(RAG)基准数据集,涵盖多种真实场景的问答类型。研究通过细粒度评估框架分析当前先进法律RAG系统的检索、生成及主张级表现,揭示了其在法律领域中存在的局限性,为提升法律AI系统的可靠性提供了重要参考。

详情
AI中文摘要

大型语言模型(LLM)的快速进展正在将语义搜索转向问答范式,用户提出问题,LLM生成回答。在法律等高风险领域,检索增强生成(RAG)通常用于减轻生成回答中的幻觉。然而,先前的研究表明,无论是通用还是法律专用的RAG系统,仍然以不同速率产生幻觉,这使得细粒度评估变得至关重要。尽管有需求,现有的法律RAG系统评估框架缺乏分别对检索和生成性能进行详细分析所需的粒度。此外,当前的基准主要是英文且集中于法律专家查询,忽视了非专家需求。我们引入了ClaimRAG-LAW,一个全面的法律RAG数据集,支持法语和英语,面向专家和非专家,并包含反映现实场景的多样化问题类型。我们进一步应用细粒度评估框架对最先进的法律RAG系统进行评估,揭示了法律领域在检索、生成和声明级分析方面的局限性。

英文摘要

The rapid progress of large language models (LLMs) is shifting semantic search toward a question-answering paradigm, where users ask questions and LLMs generate responses. In high-stake domains such as law, retrieval-augmented generation (RAG) is commonly used to mitigate hallucinations in generated responses. Nonetheless, prior work shows that RAG systems, whether general-purpose or legal-specific, still hallucinate at varying rates, making fine-grained evaluation essential. Despite the need, existing evaluation frameworks for legal RAG systems lack the granularity required to provide detailed analysis of retrieval and generation performance separately. Moreover, current benchmarks are largely English-only and centered on legal expert queries, overlooking non-expert needs. We introduce ClaimRAG-LAW, a comprehensive dataset for legal RAG that supports French and English, targets both experts and non-experts, and includes diverse question types reflecting realistic scenarios. We further apply a fine-grained evaluation framework of state-of-the-art legal RAG systems, revealing limitations in retrieval, generation, and claim-level analysis in the legal domain.

2605.20919 2026-05-25 cs.LG cs.AI cs.PL

Sutra: Tensor-Op RNNs as a Compilation Target for Vector Symbolic Architectures

Sutra: 以张量操作RNN作为向量符号架构的编译目标

Emma Leonhart

发表机构 * Emma Leonhart

AI总结 Sutra 是一种类型化的纯函数式编程语言,其前向传播过程被编译为 PyTorch 神经网络。该语言通过将程序中的原始操作、控制流和字符串 I/O 等全部转换为一个融合的张量操作图,实现了对向量符号架构的高效编译。研究展示了 Sutra 在多种嵌入表示上的高精度解码能力,并验证了其可微分性,使得同一程序既能作为逻辑程序运行,也能作为可训练的神经网络进行优化。

Comments Modified NeurIPS submission, see AI declaration and replication materials at end of paper

详情
AI中文摘要

Sutra是一种带类型的纯函数式编程语言,其编译后的前向传播是一个PyTorch神经网络。编译器将整个程序——包括原语、控制流、字符串I/O——通过beta归约降级为一个在冻结嵌入基质上的融合张量操作图。旋转绑定、解绑、捆绑、多项式Kleene三值逻辑以及尾递归循环均被降级为张量操作;Kleene连接词是在{-1, 0, +1}真值网格上精确的拉格朗日插值多项式。验证通过两种方式测试同一事实。(1) 同一程序在跨越两种模态的四个冻结嵌入上运行——三种文本编码器(nomic-embed-text、all-minilm、mxbai-embed-large)和一种蛋白质语言模型(ESM-2)——并在每个基质上以宽度k=8实现100%的解码准确率,而教科书式的Hadamard乘积已经崩溃(mxbai-embed-large上2.5%,all-minilm上7.5%)。(2) PyTorch自动求导流经实际编译的图:一个用.su编写的模糊规则分类器从随机初始化(18.7±9.5%;随机概率=20%,五类)通过反向传播经过发射图(符号源未修改)训练到100.0±0.0%(三个种子)。一个加权变体额外训练一个标量余弦增益,并将其作为数值字面量写回.su源文件;重新编译重现训练后的行为,每个logit误差约2e-7,因此训练后的模型本身是可读、可重编译的代码。因此,同一工件既是一个逻辑程序,也是一个可训练的神经网络。

英文摘要

Sutra is a typed, purely functional programming language whose compiled forward pass is a PyTorch neural network. The compiler beta-reduces the whole program -- primitives, control flow, string I/O -- to one fused tensor-op graph over a frozen embedding substrate. Rotation binding, unbind, bundle, polynomial Kleene three-valued logic, and tail-recursive loops all lower to tensor operations; the Kleene connectives are Lagrange-interpolated polynomials exact on the {-1, 0, +1} truth grid. Validation is one fact tested two ways. (1) The same program runs on four frozen embeddings spanning two modalities -- three text encoders (nomic-embed-text, all-minilm, mxbai-embed-large) and one protein language model (ESM-2) -- and decodes bundles at 100% accuracy through width k=8 on every substrate, where the textbook Hadamard product has already collapsed (2.5% on mxbai-embed-large, 7.5% on all-minilm). (2) PyTorch autograd flows through the actually compiled graph: a fuzzy-rule classifier written in .su trains from random init (18.7 +/- 9.5%; chance = 20%, five classes) to 100.0 +/- 0.0% (three seeds) by backpropagating through the emitted graph, the symbolic source unmodified. A weighted variant additionally trains a scalar cosine gain and writes it back into the .su source as a numeric literal; recompiling reproduces the trained behaviour to ~2e-7 per logit, so the trained model is itself legible, recompilable code. The same artifact is therefore both a logic program and a trainable neural network.

2605.20201 2026-05-25 cs.CL cs.AI cs.LG

Long-Context Reasoning Through Proxy-Based Chain-of-Thought Tuning

基于代理思维链调优的长上下文推理

Miao Li, Irina Saparina, Alexander Gurung, Mirella Lapata

发表机构 * School of Informatics, University of Edinburgh(爱丁堡大学信息学院)

AI总结 该研究针对大语言模型在长上下文复杂推理任务中表现不佳的问题,提出了一种名为ProxyCoT的新训练框架。该方法通过在短代理上下文中获取高质量的推理轨迹,并将其迁移到完整的长上下文中,从而提升模型的长上下文推理能力。实验表明,ProxyCoT在多个数据集上均优于现有方法,且计算开销更低,同时具备良好的跨领域泛化能力。

Comments Long paper, ACL 2026 (Main conference)

详情
AI中文摘要

近期的大语言模型支持高达1000万token的输入,但在需要复杂推理的长上下文任务上表现不佳。此类任务可以通过仅使用输入的一个子集(即代理上下文)而非完整序列来解决。尽管共享相同的底层推理过程,模型在代理上下文和完整上下文之间表现出显著的性能差异。为了改进长上下文推理,我们提出了ProxyCoT,一种新颖的训练框架,将推理能力从短代理上下文迁移到完整长上下文。具体来说,我们首先通过强化学习或从更大的教师模型蒸馏,在代理上下文中获得高质量的思维链推理轨迹,然后通过监督微调将这些生成的轨迹锚定到完整长上下文中。跨不同数据集的实验表明,ProxyCoT在减少计算开销的同时,始终优于强基线。此外,使用ProxyCoT训练的模型能够将其长上下文推理能力泛化到域外任务。

英文摘要

Recent large language models support inputs of up to 10 million tokens, yet they perform poorly on long-context tasks that require complex reasoning. Such tasks can be solved using only a subset of the input -- a proxy context -- rather than the full sequence. Despite sharing the same underlying reasoning process, models exhibit a significant performance disparity between proxy and full contexts. To improve long-context reasoning, we propose ProxyCoT, a novel training framework that transfers reasoning capabilities from short proxy contexts to full long contexts. Specifically, we first obtain high-quality chain-of-thought reasoning traces on proxy contexts through reinforcement learning or distillation from a larger teacher model, and then ground the generated traces in full long contexts with supervised fine-tuning. Experiments across different datasets demonstrate that ProxyCoT consistently outperforms strong baselines with reduced computational overhead. Furthermore, models trained with ProxyCoT generalize their long-context reasoning capabilities to out-of-domain tasks.

2605.20192 2026-05-25 cs.CL cs.CE cs.CR cs.CY q-fin.CP

Leveraging Large Language Models for Sentiment Analysis: Multi-Modal Analysis of Decentraland's MANA Token

利用大语言模型进行情感分析:Decentraland的MANA代币多模态分析

Xintong Wu, Peiting Tsai, Jing Yuan, Michael Yu, Greg Sun, Luyao Zhang

发表机构 * University of Pennsylvania(宾夕法尼亚大学) Microsoft(微软) Duke Kunshan University(杜克昆山大学)

AI总结 本文研究了如何利用大型语言模型分析Decentraland虚拟平台中Discord社区的情感,结合多模态金融数据提升对MANA代币价格的预测能力。研究采用基于BERT的模型进行情感分析,并构建了两种LSTM架构,分别基于历史价格和融合情感评分、交易量及市值的多模态特征。实验表明,多模态模型在预测准确性上显著优于仅使用价格数据的基线模型,揭示了社区情感信号在虚拟经济预测中的重要价值。

详情
AI中文摘要

Decentraland是一个在扩展的元宇宙生态系统中运行的去中心化虚拟现实平台,利用其原生MANA代币促进虚拟资产交易和治理。本研究探讨将Discord社区情感与多模态金融数据相结合,以增强虚拟世界经济中的加密货币价格预测。我们解决以下问题:(1) 识别Decentraland的Discord社区内的情感模式,以及(2) 评估多模态特征对代币回报预测的影响。使用基于BERT的大语言模型进行情感分析,我们开发了两种LSTM架构:一种包含历史价格的基线模型,另一种集成情感分数、交易量和市值的多模态变体。结果显示社区情感以中性为主,但存在正向偏斜。多模态模型在预测准确性上显著优于仅基于价格的基线模型。这些发现证明了社区衍生信号对虚拟经济预测的预测价值,并为未来在沉浸式虚拟环境、自然语言处理和加密货币市场分析交叉领域的研究奠定了基础。

英文摘要

Decentraland, a decentralized virtual reality platform operating within the expanding Metaverse ecosystem, utilizes its native MANA token to facilitate virtual asset transactions and governance. This study investigates the integration of Discord community sentiment with multi-modal financial data to enhance cryptocurrency price prediction within virtual world economies. We address: (1) identifying sentiment patterns within Decentraland's Discord community, and (2) evaluating the impact of multi-modal features on token return forecasting. Using a BERT-based large language model for sentiment analysis, we develop two LSTM architectures: a baseline incorporating historical prices and a multi-modal variant integrating sentiment scores, trading volume, and market capitalization. Results indicate predominantly neutral community sentiment with a positive skew. The multi-modal model significantly outperforms the price-only baseline in prediction accuracy. These findings demonstrate the predictive value of community-derived signals for virtual economy forecasting and establish a foundation for future research at the intersection of immersive virtual environments, natural language processing, and cryptocurrency market analysis.

2605.20087 2026-05-25 cs.CL cs.AI

ThoughtTrace: Understanding User Thoughts in Real-World LLM Interactions

ThoughtTrace: 理解真实世界LLM交互中的用户想法

Chuanyang Jin, Binze Li, Haopeng Xie, Cathy Mengying Fang, Tianjian Li, Shayne Longpre, Hongxiang Gu, Maximillian Chen, Tianmin Shu

发表机构 * Johns Hopkins University(约翰霍普金斯大学) Massachusetts Institute of Technology(麻省理工学院) Google Research(谷歌研究)

AI总结 ThoughtTrace 是首个大规模数据集,记录了真实场景中用户与AI的多轮对话及其用户自我报告的思考内容,揭示了用户发送提示的原因和对AI回复的反应。该数据集包含1,058名用户、2,155次对话及10,174条思考注释,分析表明用户思考内容在语义上与对话消息不同,且难以被当前先进大模型准确推断。研究进一步展示了思考内容在行为预测和个性化助手训练中的应用价值,为理解用户潜在目标和需求提供了新的数据模态。

Comments 53 pages, 23 figures, 4 tables. Project website: https://thoughttrace-project.github.io/

详情
AI中文摘要

对话式AI现已服务数十亿用户,但现有数据集仅捕捉用户所说,而非所想。我们引入ThoughtTrace,首个大规模数据集,将真实世界多轮人机对话与用户自述想法配对:用户发送提示的原因以及对助手回复的反应。ThoughtTrace包含来自20个语言模型的1,058名用户、2,155次对话、17,058轮次和10,174条想法标注。我们的分析表明,ThoughtTrace捕捉了长期、主题多样的交互,且想法在语义上不同于消息,前沿LLM难以从上下文中推断,内容多样,并与对话阶段相关。我们进一步展示了想法在下游建模中的实用性。首先,想法作为推理时上下文改善了用户行为预测。其次,想法引导的重写为训练个性化助手提供了细粒度对齐信号。总之,ThoughtTrace将用户想法确立为研究人机交互背后认知动态的新数据模态,并为构建更好理解和适应用户潜在目标、偏好与需求的助手奠定了基础。

英文摘要

Conversational AI has now reached billions of users, yet existing datasets capture only what people say, not what they think. We introduce ThoughtTrace, the first large-scale dataset that pairs real-world multi-turn human--AI conversations with users' self-reported thoughts: their reasons for sending prompts and reactions to assistant responses. ThoughtTrace comprises 1,058 users, 2,155 conversations, 17,058 turns, and 10,174 thought annotations collected across 20 language models. Our analysis shows that ThoughtTrace captures long-horizon, topically diverse interactions, and that thoughts are semantically distinct from messages, difficult for frontier LLMs to infer from context, diverse in content, and tied to conversation stages. We further demonstrate the utility of thoughts for downstream modeling. First, thoughts improve user-behavior prediction as inference-time context. Second, thought-guided rewrites provide fine-grained alignment signals for training personalized assistants. Together, ThoughtTrace establishes user thoughts as a new data modality for studying the cognitive dynamics behind human--AI interaction and provides a foundation for building assistants that better understand and adapt to users' latent goals, preferences, and needs.

2605.18993 2026-05-25 cs.LG cs.AI

Distilling Linearized Behavior into Non-Linear Fine-Tuning for Effective Task Arithmetic

将线性化行为蒸馏到非线性微调中以实现有效的任务算术

Thomas Sommariva, Francesca Morandi, Simone Calderara, Angelo Porrello

发表机构 * University of Pisa, Italy(比萨大学,意大利)

AI总结 该研究探讨了如何在非线性微调中保留线性微调在任务向量组合中的优势。作者提出通过在激活空间中施加约束,使非线性模型在权重扰动上保持线性特性,并通过从线性化教师模型中蒸馏隐藏表示来训练学生模型。该方法在保持任务向量可组合性的同时,避免了推理时的额外开销,在视觉和语言任务中表现出色。

Comments Accepted at ICML 2026

详情
AI中文摘要

任务向量组合已成为编辑预训练模型的一种有前景的范式,通过加法实现模型合并,通过减法实现模型遗忘。在预训练模型的切空间中进行微调(线性微调)已被证明是有效的,因为它产生的任务向量自然解缠且抗干扰。然而,线性化模型在训练期间表达能力有限,并且在推理时计算成本较高,这限制了它们的实际应用。在这项工作中,我们弥合了线性微调与标准非线性微调之间的差距。我们表明,关于权重扰动的线性性(一种在参数空间中定义的属性)可以通过在训练期间在激活空间中施加约束来强制执行。具体来说,我们将曲率正则化的线性化教师模型的隐藏表示蒸馏到通过常规微调训练的非线性学生模型中。我们发现,得到的模型继承了线性化模型在任务算术中的关键属性,能够实现任务向量的有效组合,并在视觉和语言基准测试中实现强性能,而不会产生任何推理开销。

英文摘要

Task vector composition has emerged as a promising paradigm for editing pre-trained models, enabling model merging through addition and unlearning through subtraction. Fine-tuning in the tangent space of a pre-trained model (linear fine-tuning) has proven effective, as it produces task vectors that are naturally disentangled and resistant to interference. However, linearized models suffer from limited expressivity during training and incur higher computational costs at inference time, which restrict their practical applicability. In this work, we bridge the gap between linear and standard non-linear fine-tuning. We show that linearity with respect to weight perturbations, a property defined in parameter space, can be enforced through constraints in activation space during training. Concretely, we distill hidden representations from a curvature-regularized linearized teacher into a non-linear student trained via conventional fine-tuning. We find that the resulting model inherits key properties of linearized models for task arithmetic, enabling effective composition of task vectors and achieving strong performance across vision and language benchmarks without incurring any inference-time overhead.

2605.18911 2026-05-25 cs.LG cs.AI

Does Your Wildfire Prediction Model Actually Work, or Just Score Well?

你的野火预测模型真的有效,还是只是得分高?

Yangshuang Xu, Yuyang Dai, Liling Chang, Qi Wang, Yushun Dong

发表机构 * Florida State University(佛罗里达州立大学) Northeastern University(东北大学)

AI总结 本文研究了现有地球基础模型在野火预测任务中的实际有效性问题,指出当前模型虽在通用大气和地球物理任务上表现良好,但未针对野火预测进行专门预训练。为此,作者提出了首个专门用于野火预测的预训练模型WILDFIRE-FM,并引入了一种固定合约评估框架,以解决野火事件稀疏性带来的评估偏差问题。研究结果表明,野火预测的迁移结论高度依赖于评估设计和任务设定,为未来相关研究提供了新的基准和方法支持。

Comments 25 pages

详情
AI中文摘要

野火预测对于早期预警和资源分配至关重要,然而现有的地球基础模型(Earth FMs)是为通用大气和地球物理目标预训练的,而非野火预测。为弥补这一空白,我们提出了WILDFIRE-FM,这是首个专门针对野火预测预训练的基础模型,使用了天气、活跃火观测、地形、植被和静态环境数据。然而,仅引入特定领域的骨干网络并不能解决评估问题:野火事件在时空上稀疏,使得迁移结论对匹配规则和评估设置高度敏感。为解决这一问题,我们引入了一个固定合约评估框架,包含两个受控检查:固定输出检查用于匹配规则效应,固定特征检查用于头部选择效应。在匹配合约下,我们在占用、蔓延、检索和回归任务上将WILDFIRE-FM与十个地球基础模型基线进行比较。结果表明,野火迁移结论强烈依赖于评估设计和任务制定。我们希望该框架和WILDFIRE-FM能为未来野火特定的地球基础模型研究和基准测试提供基础。我们的代码可在 https://anonymous.4open.science/r/Wildfire-fm-evaluation-contracts-5AE9/ 获取。

英文摘要

Wildfire prediction is important for early warning and resource allocation, yet existing Earth foundation models (Earth FMs) are pretrained for general atmospheric and geophysical objectives rather than wildfire forecasting. To address this gap, we introduce WILDFIRE-FM, the first foundation model pretrained specifically for wildfire prediction using weather, active-fire observations, topography, vegetation, and static environmental data. However, introducing a domain-specific backbone alone does not solve the evaluation problem: wildfire events are sparse in space and time, making transfer conclusions highly sensitive to matching rules and evaluation settings. To address this problem, we introduce a fixed-contract evaluation framework with two controlled checks: a fixed-output check for matching-rule effects and a fixed-feature check for head-selection effects. Under matched contracts, we compare WILDFIRE-FM with ten Earth-FM baselines across occupancy, spread, retrieval, and regression tasks. Our results show that wildfire transfer conclusions depend strongly on evaluation design and task formulation. We hope this framework and WILDFIRE-FM provide a foundation for future wildfire-specific Earth-FM research and benchmarking. Our code is available at https://anonymous.4open.science/r/Wildfire-fm-evaluation-contracts-5AE9/.

2605.18859 2026-05-25 cs.LG cs.AI

TwinRouterBench: Fast Static and Live Dynamic Evaluation for Realistic Agentic LLM Routing

TwinRouterBench:面向现实智能体LLM路由的快速静态与实时动态评估

Pei Yang, Wanyi Chen, Tongyun Yang, Pengbin Feng, Jiarong Xing, Wentao Guo, Yuhang Yao, Yuhang Han, Hanchen Li, Xu Wang, Zeyu Wang, Jie Xiao, Anjie Yang, Liang Tian, Lynn Ai, Eric Yang, Tianyu Shi

发表机构 * Gradient Soochow University(苏州大学) Independent Researcher(独立研究者) University of Southern California(南加州大学) Rice University(Rice大学) Carnegie Mellon University(卡内基梅隆大学) Shanghai Jiao Tong University(上海交通大学) University of California, Berkeley(加州大学伯克利分校) University of the Chinese Academy of Sciences(中国科学院大学) University of California, Los Angeles(加州大学洛杉矶分校)

AI总结 本文提出 TwinRouterBench,一个用于评估代理式大语言模型(LLM)路由策略的基准工具,旨在支持静态和动态场景下的高效评估。该基准包含两个赛道:静态赛道提供多个任务中的模型调用前缀及对应的最优模型层级,通过确定性计算进行评分;动态赛道则在真实代理系统中运行路由策略,评估其在实际任务完成和成本控制方面的表现。该工作为路由算法的开发与优化提供了全面且高效的实验平台。

详情
AI中文摘要

LLM路由在长时任务(如编码智能体、深度研究系统和计算机使用智能体)中最为重要,其中单个用户请求会触发多次模型调用。将每次调用路由到最便宜的足够模型可以在不牺牲质量的情况下降低成本,然而现有的路由器基准仅评估一次性提示的路由。它们从未暴露中间智能体步骤中路由器可见的前缀,从未测试更便宜的替代品是否保留下游任务的成功,并且通常在评估时依赖在线LLM评判。我们引入了TwinRouterBench,一个具有两轨的步骤级路由基准。静态轨提供来自SWE-bench、BFCL、mtRAG、QMSum和PinchBench中520个实例的970个路由器可见前缀,每个前缀与在发布的降级和级联协议下估计的执行验证目标层级配对;评分是层级标签、轨迹成员资格和令牌成本的确定性算术,无需在线评估方LLM评判。动态轨提供一个工具,可在完整的500例SWE-bench验证集上运行路由器;本文报告了与静态SWE监督划分不相交的100例保留评估。每次LLM调用时,路由器从锁定池中选择一个具体模型,成功由官方任务解决率和实际API支出衡量。两轨支持快速离线迭代,随后在实时智能体执行下进行端到端验证。代码和数据可在https://github.com/CommonstackAI/TwinRouterBench获取。

英文摘要

LLM routing matters most in long-horizon applications such as coding agents, deep research systems, and computer-use agents, where a single user request triggers many model calls. Routing each call to the cheapest sufficient model can cut costs without sacrificing quality, yet existing router benchmarks evaluate routers only on one-shot prompts. They never expose the router-visible prefix at an intermediate agent step, never test whether a cheaper replacement preserves downstream task success, and often rely on online LLM judges at evaluation time. We introduce TwinRouterBench, a step-level routing benchmark with two tracks. The static track provides 970 router-visible prefixes from 520 instances across SWE-bench, BFCL, mtRAG, QMSum, and PinchBench, each paired with an execution-verified target tier estimated under a released downgrade-and-cascade protocol; scoring is deterministic arithmetic over tier labels, trajectory membership, and token costs, with no online evaluator-side LLM judge. The dynamic track supplies a harness that runs routers on the full 500-case SWE-bench Verified suite; in this paper we report a 100-case held-out evaluation disjoint from the static SWE supervision split. At each LLM call the router selects a concrete model from a locked pool, and success is measured by official task resolution and realized API spend. The two tracks support fast offline iteration followed by end-to-end validation under live agent execution. Code and data are available at https://github.com/CommonstackAI/TwinRouterBench.

2605.18329 2026-05-25 cs.CV cs.LG

Lost in the Folds: When Cross-Validation Is Not a Deep Ensemble for Uncertainty Estimation

迷失在折叠中:当交叉验证不是用于不确定性估计的深度集成时

Tristan Kirscher, Markus Bujotzek, Yannick Kirchhoff, Maximilian Rokuss, Fabian Isensee, Kim-Celine Kahl, Balint Kovacs, Klaus Maier-Hein

发表机构 * ICube Laboratory, CNRS UMR-7357, University of Strasbourg, Strasbourg, France(ICube实验室,法国斯特拉斯堡大学) CLCC Institut-Strauss, Strasbourg, France(CLCC斯特拉斯堡研究所) German Cancer Research Center (DKFZ) Heidelberg, Division of Medical Image Computing(海德堡德国癌症研究中心(DKFZ)医学影像计算部门) Medical Faculty Heidelberg, Heidelberg University, Heidelberg, Germany(海德堡医学院,海德堡大学) Faculty of Mathematics and Computer Science, University of Heidelberg, Germany(海德堡大学数学与计算机科学学院) Helmholtz Imaging, German Cancer Research Center, Heidelberg, Germany(海德堡德国癌症研究中心Helmholtz成像部门) Pattern Analysis and Learning Group, Department of Radiation Oncology, Heidelberg University Hospital, Heidelberg, Germany(海德堡大学医院放射肿瘤学部模式分析与学习小组)

AI总结 在医学图像分割中,集成模型的分歧常被用作认识论不确定性的代理,但许多研究通过K折交叉验证(CV)构建集成模型,却称之为“深度集成”(DE),导致术语与实现不一致。本文对比了标准5折CV集成与5成员DE在三个多标注分割数据集上的表现,发现DE在保持分割精度的同时,提升了校准和失败检测能力,而CV集成有时与标注者间差异相关性更强。研究指出,应根据研究目标选择集成构建方式:DE适用于可靠性导向任务(如选择性转诊),CV集成则更适合作为模糊性代理。

Comments Accepted for publication at MICCAI 2026

Journal ref 29th International Conference On Medical Image Computing And Computer Assisted Intervention, Sep 2026, Strasbourg, France

详情
AI中文摘要

集成不一致性被广泛用作医学图像分割中认知不确定性的代理。在实践中,许多研究通过K折交叉验证(CV)形成集成,却称之为“深度集成”(DE)。由于CV成员在不同的数据子集上训练,它们的不一致性混合了种子驱动变异和数据暴露效应,这可能改变不确定性的解释方式。我们审查了最近的分割不确定性研究,发现术语与实现不匹配很常见。然后,我们在三个多模态多标注者分割数据集上,在相同配置下比较了标准5折CV集成与5成员DE(固定训练集,不同随机种子)。我们评估了不确定性在校准、故障检测、歧义建模和分布偏移下的鲁棒性。DE在匹配分割精度的同时改善了校准和故障检测,而CV集成在研究数据集上有时与标注者间变异性相关性更强。因此,应选择与研究问题匹配的集成构建方式:DE用于可靠性导向的使用(如选择性转诊/故障检测),CV集成作为歧义的代理。我们提供了一个轻量级的nnU-Net修改,使得在默认流程内能够进行DE训练。

英文摘要

Ensemble disagreement is widely used as a proxy for epistemic uncertainty in medical image segmentation. In practice, many studies form ensembles via K-fold cross-validation (CV), yet refer to them as ``deep ensembles'' (DE). Because CV members are trained on different data subsets, their disagreement mixes seed-driven variability with data-exposure effects, which can change how uncertainty should be interpreted. We audit recent segmentation uncertainty studies and find that terminology--implementation mismatches are common. We then compare a standard 5-fold CV ensemble to a 5-member DE (fixed training set, different random seeds) under otherwise identical configurations on three multi-rater segmentation datasets spanning three modalities. We evaluate uncertainty for calibration, failure detection, ambiguity modeling, and robustness under distribution shift. DE match segmentation accuracy while improving calibration and failure detection, whereas CV ensembles sometimes correlate more strongly with inter-rater variability on the studied datasets. Thus, ensemble construction should be chosen to match the research question: DE for reliability-oriented use (e.g., selective referral/failure detection) and CV ensembles as a proxy for ambiguity. We provide a lightweight nnU-Net modification enabling DE training within the default pipeline.

2605.17637 2026-05-25 cs.AI

WebGameBench: Requirement-to-Application Evaluation for Coding Agents via Browser-Native Games

WebGameBench: 通过浏览器原生游戏对编码代理进行需求到应用的评估

Wenyu Zhang, Guoliang You, Tianlun, Haotian Zhao, Tianshu Zhu, Haoran Wang, Xiaoxuan Tang, Mingyang Dai, Jingnan Gu, Daxiang Dong, Jianmin Wu

发表机构 * Baidu(百度) University of Science and Technology of China(中国科学技术大学)

AI总结 WebGameBench 是一个用于评估代码代理从需求到实际应用构建能力的基准,特别关注其能否将结构化的网页游戏规范转化为可在浏览器中运行的游戏。该基准通过浏览器原生游戏提供紧凑而行为丰富的测试环境,评估代理生成的应用是否具备可玩性、可用性及功能性。研究显示,当前最先进的系统在可用率上达到76.9%,但优秀率仅为20.2%,表明实现完整需求仍存在较大差距。WebGameBench 是首个基于浏览器原生游戏交付的从需求到应用评估的基准,其评估结果与人工游戏体验评审高度一致。

Comments 19 pages, 6 figures

详情
AI中文摘要

编码代理越来越多地被用作应用程序构建者,然而许多评估仍聚焦于源代码、仓库级测试或中间痕迹,而非交付的应用。我们引入WebGameBench,一个需求到应用的基准,评估编码代理能否将冻结的结构化Web游戏规范转化为可浏览器访问的游戏。浏览器原生游戏提供了一个紧凑但行为密集的测试平台:即使是简单的游戏也需要协调的输入处理、空间映射、规则执行、状态转换、终止条件、重启行为和可见反馈。在WebGameBench中,每个生成的工件在统一部署协议下被构建、服务并作为浏览器可访问的应用暴露。然后,运行时评估器在真实浏览器中与交付的游戏交互,并分配三类标签:优秀、可用或不可用。在人工审查的子集上,运行时标签与人类游戏审查在可用率标准下大致一致。在111个任务、12个编码代理和14个评估配置中,WebGameBench区分了当前系统:最佳配置达到76.9%的可用率,但仅有20.2%的优秀率。这一差距表明,跨越最低可玩交付阈值仍远未达到完全满足需求。据我们所知,WebGameBench是首个针对浏览器原生游戏交付的需求到应用基准,它在可用率标准下将交付应用的运行时标签与独立的人类游戏审查进行验证。

英文摘要

Coding agents are increasingly used as application builders, yet many evaluations still focus on source code, repository-level tests, or intermediate traces rather than the delivered application. We introduce WebGameBench, a requirement-to-application benchmark that evaluates whether coding agents can turn a frozen Structured WebGame Specification into a browser-accessible game. Browser-native games provide a compact but behavior-dense testbed: even simple games require coordinated input handling, spatial mapping, rule execution, state transitions, terminal conditions, restart behavior, and visible feedback. In WebGameBench, each generated artifact is built, served, and exposed as a browser-accessible application under a unified deployment protocol. A runtime evaluator then interacts with the delivered game in a real browser and assigns a three-way label: EXCELLENT, USABLE, or UNUSABLE. On a human-reviewed subset, the runtime label is broadly aligned with human gameplay review under the Usable-rate criterion. Across 111 tasks, 12 coding agents, and 14 evaluation configurations, WebGameBench separates current systems: the best configuration reaches a 76.9% usable rate but only a 20.2% excellent rate. This gap shows that crossing the minimum playable-delivery threshold is still far from complete requirement satisfaction. To our knowledge, WebGameBench is the first requirement-to-application benchmark for browser-native game delivery that validates delivered-application runtime labels against independent human gameplay review under the Usable-rate criterion.

2605.17076 2026-05-25 cs.LG cs.AI cs.DC cs.MA

S-Bus: Automatic Read-Set Reconstruction for Multi-Agent LLM State Coordination

S-Bus: 多智能体LLM状态协调的自动读集重建

Sajjad Khan

发表机构 * Sajjad Khan

AI总结 本文提出了一种名为 S-Bus 的 HTTP 中间件,用于解决多智能体 LLM 在共享可变状态时的并发控制问题,尤其针对无法声明读集的场景。其核心机制 DeliveryLog 能够在提交时从观察到的 HTTP GET 流量中重建每个智能体的读集,从而实现一种名为“可观测读隔离”(ORI)的一致性保证,有效防止分片拓扑中的结构化竞态条件。研究贡献包括形式化验证、与传统数据库的性能对比以及对 ORI 在不同工作负载下的语义影响分析。

Comments v2: LLM judge validated against human annotator (Zahid Hussain, Mindgigs Peshawar) on PH-3 at strict kappa=0.93 (n=93, 96.8% agreement); over-claim refined to 32% (LLM) / 49% (human). Adds Exp.PG-Comparison Rust-Native and Workload-B chi2=1094.98. 24 pages, 23 tables. Annotation data attached as arXiv ancillary files

详情
AI中文摘要

我们解决了通过HTTP共享可变状态的LLM智能体的并发控制问题,其中智能体无法被修改以声明读集。S-Bus是一个HTTP中间件,其核心机制——服务端DeliveryLog——在提交时从观察到的HTTP GET流量中重建每个智能体的读集。它提供的一致性属性——可观测读隔离(ORI),一种基于HTTP可观测读投影的部分因果一致性——防止了专用分片拓扑中的结构性竞态条件。 三项贡献:(C1)DeliveryLog机制,具有三层机械化证据:TLAPS证明了ReadSetSoundness和ORICommitSafety(基于一个类型公理);N=3时的穷举TLC探索了20,763,484个状态,零违规;Dafny验证了9个归纳引理。(C2)与PostgreSQL 17 SERIALIZABLE和Redis 7 WATCH/MULTI的经验安全对等:在884,110次提交尝试中(其中427,308次处于活跃争用下)零Type-I损坏。(C3)ORI在专用分片工作负载中语义中性,但在单分片协作写入中有害,因为保留传播并发矛盾。 v2更新:PH-3 LLM评判器现在已针对人类标注者(Zahid Hussain, Mindgigs Peshawar)在400个(步骤,分片)对上进行独立验证,严格kappa=0.93(n=93,原始一致性96.8%)。LLM间评判器一致性为kappa=0.46(边界方差)。智能体自我报告高估分片使用量32%(LLM评判器)至49%(人类标注者)。SJ-v4语义质量评分标准仍为单评判器LLM-only。 源代码、形式化证明、测试框架、标注数据:https://github.com/sajjadanwar0/sbus

英文摘要

We address concurrency control for LLM agents sharing mutable state over HTTP, where agents cannot be modified to declare read sets. S-Bus is an HTTP middleware whose central mechanism, a server-side DeliveryLog, reconstructs each agent's read set at commit time from observed HTTP GET traffic. The consistency property it provides -- Observable-Read Isolation (ORI), a partial causal consistency over the HTTP-observable read projection -- prevents Structural Race Conditions in dedicated-shard topologies. Three contributions. (C1) DeliveryLog mechanism with three-tier mechanised evidence: TLAPS proves ReadSetSoundness and ORICommitSafety (modulo one typing axiom); exhaustive TLC at N=3 explores 20,763,484 states with zero violations; Dafny discharges 9 inductive lemmas. (C2) Empirical safety parity against PostgreSQL 17 SERIALIZABLE and Redis 7 WATCH/MULTI: zero Type-I corruptions across 884,110 commit attempts (427,308 under active contention). (C3) ORI is semantically neutral in dedicated-shard workloads but harmful in single-shard collaborative writing because preservation propagates concurrent contradictions. v2 update: the PH-3 LLM judge is now independently validated against a human annotator (Zahid Hussain, Mindgigs Peshawar) on 400 (step, shard) pairs at strict kappa=0.93 (n=93, 96.8% raw agreement). Inter-LLM-judge agreement is kappa=0.46 (boundary variance). Agent self-reports over-claim shard usage by 32% (LLM judge) to 49% (human annotator). The SJ-v4 semantic-quality rubric remains single-judge LLM-only. Source code, formal proofs, harness, annotation data: https://github.com/sajjadanwar0/sbus

2605.16799 2026-05-25 cs.LG cs.AI

Cross-Domain Molecular Relational Learning: Leveraging Chemical Structure-Activity Analysis

跨域分子关系学习:利用化学结构-活性分析

Peiliang Zhang, Jingling Yuan, Shiqing Wu, Mengqing Hu, Chao Che, Yongjun Zhu, Lin Li

发表机构 * Wuhan University of Technology(武汉理工大学) Yonsei University(延世大学) Hubei Key Laboratory of Transportation Internet of Things(湖北省交通运输物联网重点实验室) State Key Laboratory of Silicate Materials for Architectures(建筑硅酸盐材料国家重点实验室) City University of Macau(澳门城市大学) Kyung Hee University(庆熙大学) Dalian University(大连大学)

AI总结 该研究针对分子关系学习中跨领域建模的不足,提出了一种基于结构-活性分析的跨领域分子关系学习方法。核心方法是引入结构语义迁移差异的领域对抗训练网络(DisTrans),通过子结构拓扑差异引导模型学习分子结构的领域依赖性,并对齐源域与目标域的功能团语义信息,从而提升跨领域适应能力。实验表明,该方法在两种典型跨领域场景下优于16种基线方法,具有良好的泛化性能。

Comments Accepted by SIGKDD 2026 Research Track

详情
AI中文摘要

分子表示的最新进展整合了分子拓扑和视觉模态,为精确的分子关系学习(MRL)开辟了新途径。现有的MRL方法专注于域内建模,其固有的域封闭效应限制了在分子科学中的适用性,特别是在阐明跨域相互作用机制方面。因此,跨域分子关系学习的必要性日益迫切。受益于结构-活性分析,我们提出了具有结构语义迁移差异的域对抗训练网络(DisTrans),以优化分子结构和视觉图像的跨域自适应表示。1)我们利用基于域间子结构拓扑差异的梯度反转策略来学习分子结构的域依赖性。该策略引导模型适应目标域中的结构邻接模式,生成域可分离的结构表示。2)我们应用跨域表示引导机制来对齐源域和目标域之间的官能团语义信息,学习跨域一致性信息。在两种典型跨域策略中的实验结果表明,DisTrans优于16种基线方法,即使在显著的域间差异下也能保持令人满意的性能。

英文摘要

Recent advances in molecular representation integrates molecular topological and visual modalities, opening new avenues for precise Molecular Relational Learning (MRL). Existing MRL methods focus on intra-domain modeling, and their inherent domain-closed effect limits applicability to molecular science, particularly in elucidating cross-domain interaction mechanisms. Consequently, the imperative for Cross-Domain Molecular Relational Learning has become increasingly pressing. Benefiting from structure-activity analysis, we propose the Domain Adversarial Training Network with Structural-Semantic Transfer Discrepancy (DisTrans) to optimize cross-domain adaptive representation for molecular structures and visual images. 1) We employ the gradient reversal strategy based on substructure topological discrepancies between domains to learn the domain dependence of molecular structures. This strategy guides the model to adapt to the structural adjacency patterns in the target domain, generating domain-separable structural representations. 2) We apply the cross-domain representation guidance mechanism to align the functional-group semantic information between the source and target domains, learning cross-domain consistency information. The experimental results in two typical cross-domain strategies demonstrate that DisTrans outperforms 16 baseline methods, maintaining satisfactory performance even under pronounced inter-domain discrepancy.