arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2042
2512.14338 2026-06-05 cs.LG

Implicit Bias and Invariance: How Hopfield Networks Efficiently Learn Graph Orbits

隐式偏差与不变性:Hopfield网络如何高效学习图轨道

Michael Murray, Tenzin Chan, Kedar Karhadker, Christopher J. Hillar

发表机构 * Mathematical Sciences, University of Bath(巴斯大学数学科学系) Department of Mathematics, UCLA(洛杉矶大学数学系) Algebraic 4 New Theory AI(代数4新理论AI)

AI总结 研究探讨了Hopfield网络在处理对称性学习问题时的隐式不变性机制,揭示了通过梯度下降学习图同构类时的隐式偏差及其对样本复杂度的影响。

详情
AI中文摘要

许多学习问题涉及对称性,尽管不变性可以被构建到神经架构中,但也可以在训练于群结构数据时隐式地出现。我们研究了经典Hopfield网络中的这一现象,并展示了它们可以从少量随机样本中推断出图的完整同构类。我们的结果揭示了:(i) 图的同构类可以在三维不变子空间内表示;(ii) 使用梯度下降最小化能量流(MEF)具有隐式偏差,倾向于规范高效解,这为学习同构类提供了多项式样本复杂度界;(iii) 在多种学习规则下,参数随着样本量的增加而收敛到不变子空间。这些发现突显了Hopfield网络泛化中的统一机制:学习过程对规范效率的偏见驱动了在群结构数据下的近似不变性出现。

英文摘要

Many learning problems involve symmetries, and while invariance can be built into neural architectures, it can also emerge implicitly when training on group-structured data. We study this phenomenon in classical Hopfield networks and show they can infer the full isomorphism class of a graph from a small random sample. Our results reveal that: (i) graph isomorphism classes can be represented within a three-dimensional invariant subspace, (ii) using gradient descent to minimize energy flow (MEF) has an implicit bias toward norm-efficient solutions, which underpins a polynomial sample complexity bound for learning isomorphism classes, and (iii) across multiple learning rules, parameters converge toward the invariant subspace as sample sizes grow. Together, these findings highlight a unifying mechanism for generalization in Hopfield networks: a bias toward norm efficiency in learning drives the emergence of approximate invariance under group-structured data.

2601.09236 2026-06-05 cs.LG cs.AI

Reward Learning through Ranking Mean Squared Error

通过排名均方误差进行奖励学习

Chaitanya Kharyal, Calarina Muslimani, Matthew E. Taylor

发表机构 * Calarina Muslimani(卡拉里娜·穆斯林尼) Matthew E. Taylor(马修·E·泰勒)

AI总结 本文提出了一种基于排名的强化学习方法R4,通过引入新的排名均方误差损失函数,从轨迹-评分对数据中学习奖励函数,并在机器人基准测试中表现出色。

详情
AI中文摘要

奖励设计仍然是将强化学习(RL)应用于现实世界问题的主要瓶颈。一种流行的替代方法是奖励学习,其中奖励函数是从人类反馈中推断出来,而不是手动指定。最近的工作提出了从人类评分而不是传统二元偏好中学习奖励函数,从而实现更丰富且可能更少认知需求的监督。在此范式基础上,我们引入了一种新的基于评分的RL方法,即Ranked Return Regression for RL(R4)。其核心是使用一种新的排名均方误差损失,从轨迹-评分对数据集中学习,将人类提供的离散评分(例如,差,中性,好)视为有序目标。与以往的基于评分的方法不同,R4提供了正式的保证:在其解集下,在温和的假设下,解集是可证明的最小且完整的。实证上,使用人类提供的和模拟的评分,我们证明R4在OpenAI Gym和DeepMind Control Suite的机器人基准测试中,一致地匹配或优于现有的基于评分和偏好强化学习方法。代码发布在https://github.com/IRLL/R4。

英文摘要

Reward design remains a significant bottleneck in applying reinforcement learning (RL) to real-world problems. A popular alternative is reward learning, where reward functions are inferred from human feedback rather than manually specified. Recent work has proposed learning reward functions from human ratings rather than traditional binary preferences, enabling richer and potentially less cognitively demanding supervision. Building on this paradigm, we introduce a new rating-based RL method, Ranked Return Regression for RL (R4). At its core, R4 uses a novel ranking mean squared error loss that learns from a dataset of trajectory-rating pairs, treating the human-provided discrete ratings (e.g., bad, neutral, good) as ordinal targets. Unlike prior rating-based approaches, R4 offers formal guarantees: its solution set is provably minimal and complete under mild assumptions. Empirically, using both human-provided and simulated ratings, we demonstrate that R4 consistently matches or outperforms existing rating and preference-based RL methods on robotic benchmarks from OpenAI Gym and the DeepMind Control Suite. Code released at https://github.com/IRLL/R4.

2502.14131 2026-06-05 cs.LG cs.AI econ.EM

An Empirical Risk Minimization Approach for Offline Inverse RL and Dynamic Discrete Choice Model

一种用于离线逆强化学习和动态离散选择模型的经验风险最小化方法

Enoch H. Kang, Hema Yoganarasimhan, Lalit Jain

发表机构 * Foster School of Business, University of Washington(华盛顿大学福斯特商学院)

AI总结 本文提出了一种基于经验风险最小化(ERM)的逆强化学习/动态离散选择模型框架,该方法无需显式估计贝尔曼方程中的状态转移概率,适用于高维和无限状态空间,并在理论上有Polyak-Lojasiewicz条件的支持,从而保证了快速的全局收敛性。

详情
AI中文摘要

我们研究了估计动态离散选择(DDC)模型的问题,也称为机器学习中的离线最大熵正则化逆强化学习(离线MaxEnt-IRL)。目标是从离线行为数据中恢复支配代理行为的奖励或Q*函数。在本文中,我们提出了一种全局收敛的基于梯度的方法来解决这些问题,而无需线性参数化的奖励假设。我们的方法的创新之处在于引入了基于经验风险最小化(ERM)的IRL/DDC框架,该框架避免了在贝尔曼方程中显式估计状态转移概率的需要。此外,我们的方法与非参数估计技术如神经网络兼容。因此,所提出的方法有潜力扩展到高维、无限状态空间。我们方法的一个关键理论洞察是贝尔曼残差满足Polyak-Lojasiewicz(PL)条件--一个属性,虽然比强凸性弱,但足以保证快速的全局收敛保证。通过一系列合成实验,我们证明我们的方法在性能上始终优于基准方法和最先进的替代方法。

英文摘要

We study the problem of estimating Dynamic Discrete Choice (DDC) models, also known as offline Maximum Entropy-Regularized Inverse Reinforcement Learning (offline MaxEnt-IRL) in machine learning. The objective is to recover reward or $Q^*$ functions that govern agent behavior from offline behavior data. In this paper, we propose a globally convergent gradient-based method for solving these problems without the restrictive assumption of linearly parameterized rewards. The novelty of our approach lies in introducing the Empirical Risk Minimization (ERM) based IRL/DDC framework, which circumvents the need for explicit state transition probability estimation in the Bellman equation. Furthermore, our method is compatible with non-parametric estimation techniques such as neural networks. Therefore, the proposed method has the potential to be scaled to high-dimensional, infinite state spaces. A key theoretical insight underlying our approach is that the Bellman residual satisfies the Polyak-Lojasiewicz (PL) condition -- a property that, while weaker than strong convexity, is sufficient to ensure fast global convergence guarantees. Through a series of synthetic experiments, we demonstrate that our approach consistently outperforms benchmark methods and state-of-the-art alternatives.

2404.10370 2026-06-05 cs.CV cs.LG

Know Yourself Better: Diverse Object-Related Features Improve Open Set Recognition

Know Yourself Better: Diverse Object-Related Features Improve Open Set Recognition

Jiawen Xu, Margret Keuper

发表机构 * Technical University Berlin(柏林技术大学) University of Mannheim(曼海姆大学)

AI总结 研究通过分析特征多样性提升开放集识别性能,提出了一种利用特征多样性的新型开放集识别方法。

详情
AI中文摘要

开放集识别(OSR)是机器学习中的关键方面,旨在解决推理过程中检测新类别的挑战。在深度学习领域,训练于封闭集数据的神经分类器通常难以识别新类别,导致错误预测。为了解决这一问题,已提出各种启发式方法,允许模型通过声明"I don't know"来表达不确定性。然而,文献中仍存在空白,因为对这些方法的底层机制探讨有限。在本文中,我们对开放集识别方法进行了分析,重点在于特征多样性方面。我们的研究揭示了学习多样化的判别特征与增强OSR性能之间存在显著相关性。基于这一见解,我们提出了一种新的OSR方法,利用特征多样性的优势。通过在标准OSR测试平台上的严格评估,证明了我们方法的有效性,显示出相对于最新方法的显著改进。

英文摘要

Open set recognition (OSR) is a critical aspect of machine learning, addressing the challenge of detecting novel classes during inference. Within the realm of deep learning, neural classifiers trained on a closed set of data typically struggle to identify novel classes, leading to erroneous predictions. To address this issue, various heuristic methods have been proposed, allowing models to express uncertainty by stating "I don't know." However, a gap in the literature remains, as there has been limited exploration of the underlying mechanisms of these methods. In this paper, we conduct an analysis of open set recognition methods, focusing on the aspect of feature diversity. Our research reveals a significant correlation between learning diverse discriminative features and enhancing OSR performance. Building on this insight, we propose a novel OSR approach that leverages the advantages of feature diversity. The efficacy of our method is substantiated through rigorous evaluation on a standard OSR testbench, demonstrating a substantial improvement over state-of-the-art methods.

2601.08182 2026-06-05 cs.CV

Second-order Gaussian directional derivative representations for image high-resolution corner detection

二阶高斯方向导数表示法用于图像高分辨率角点检测

Jiamiao Lu, Dongbo Xie, Junjie Qiu, Lingkun Ma, Changming Sun, Weichuan Zhang

发表机构 * School of Electronic Information and Artificial Intelligence, Shaanxi University of Science and Technology(陕西科技大学电子信息与人工智能学院) CSIRO Data61

AI总结 本文提出了一种新的高分辨率角点检测方法,通过二阶高斯方向导数(SOGDD)滤波器对END型和L型高分辨率角点模型进行平滑处理,发现了高分辨率角点的多种特征,从而实现了对相邻角点的精确检测,实验结果表明该方法在定位误差、图像模糊变换鲁棒性、图像匹配和3D重建方面优于现有方法。

Comments 11pages, 9 figures

详情
AI中文摘要

角点检测被广泛应用于各种计算机视觉任务,如图像匹配和3D重建。我们的研究指出,张等人使用简单角点模型获得一系列角点特征的方法在理论上存在缺陷,因为相邻角点的灰度信息会相互影响。为了解决上述问题,本文使用二阶高斯方向导数(SOGDD)滤波器对两种典型的高分辨率角点模型(即END型和L型模型)进行平滑处理。然后分别推导了这两种角点模型的SOGDD表示,发现了许多高分辨率角点的特征,从而使得我们能够展示如何选择高斯滤波尺度以从图像中获取强度变化信息,准确描绘相邻角点。此外,本文首次提出了一种新的高分辨率角点检测方法,能够准确检测相邻角点。实验结果表明,所提出的方法在定位误差、对图像模糊变换的鲁棒性、图像匹配和3D重建方面均优于现有方法。

英文摘要

Corner detection is widely used in various computer vision tasks, such as image matching and 3D reconstruction. Our research indicates that there are theoretical flaws in Zhang et al.'s use of a simple corner model to obtain a series of corner characteristics, as the grayscale information of two adjacent corners can affect each other. In order to address the above issues, a second-order Gaussian directional derivative (SOGDD) filter is used in this work to smooth two typical high-resolution angle models (i.e. END-type and L-type models). Then, the SOGDD representations of these two corner models were derived separately, and many characteristics of high-resolution corners were discovered, which enabled us to demonstrate how to select Gaussian filtering scales to obtain intensity variation information from images, accurately depicting adjacent corners. In addition, a new high-resolution corner detection method for images has been proposed for the first time, which can accurately detect adjacent corner points. The experimental results have verified that the proposed method outperforms state-of-the-art methods in terms of localization error, robustness to image blur transformation, image matching, and 3D reconstruction.

2505.05026 2026-06-05 cs.CL cs.LG

Do MLLMs Capture How Interfaces Guide User Behavior? A Benchmark for Multimodal UI/UX Design Understanding

多模态用户界面/用户体验设计理解的基准测试:MLLMs能否捕捉界面如何引导用户行为?

Jaehyun Jeon, Min Soo Kim, Jang Han Yoon, Sumin Shim, Yejin Choi, Hanbin Kim, Dae Hyun Kim, Youngjae Yu

发表机构 * Yonsei University(延世大学) Seoul National University(首尔国立大学) NC AI

AI总结 本文提出WiserUI-Bench基准测试,用于评估多模态UI/UX设计对用户行为的影响,通过300对真实世界UI图像对和专家解读,发现MLLMs在理解UI/UX设计行为影响方面存在局限。

Comments ACL 2026 Main. Our code and dataset: https://github.com/jeochris/wiserui-bench

详情
AI中文摘要

用户界面(UI)设计超越了视觉,旨在塑造用户体验(UX),凸显了UI/UX作为统一概念的转变。尽管最近的研究已探索使用多模态大语言模型(MLLMs)评估UI,但它们主要关注表面特征,忽略了设计选择如何在大规模上影响用户行为。为此,我们引入了WiserUI-Bench,一个新颖的基准测试,用于多模态理解UI/UX设计如何影响用户行为,基于300对来自行业A/B测试的真实UI图像,具有经实证验证的胜者,这些胜者引发了更多用户行为。为了未来在实践中推动设计进步,需要事后理解为何这些胜者能与大量用户成功;我们通过专家整理的关键解读支持这一点。在WiserUI-Bench上对多个MLLMs进行实验,针对两个主要任务(1)预测A/B测试对中更有效的UI图像,(2)根据专家解读进行事后解释,显示模型在理解UI/UX设计行为影响方面存在局限。我们相信我们的工作将促进利用MLLMs在用户行为上下文中进行视觉设计的研究。

英文摘要

User interface (UI) design goes beyond visuals to shape user experience (UX), underscoring the shift toward UI/UX as a unified concept. While recent studies have explored UI evaluation using Multimodal Large Language Models (MLLMs), they largely focus on surface-level features, overlooking how design choices influence user behavior at scale. To fill this gap, we introduce WiserUI-Bench, a novel benchmark for multimodal understanding of how UI/UX design affects user behavior, built on 300 real-world UI image pairs from industry A/B tests, with empirically validated winners that induced more user actions. For future design progress in practice, post-hoc understanding of why such winners succeed with mass users is also required; we support this via expert-curated key interpretations for each instance. Experiments across multiple MLLMs on WiserUI-Bench for two main tasks, (1) predicting the more effective UI image between an A/B-tested pair, and (2) explaining it post-hoc in alignment with expert interpretations, show that models exhibit limited understanding of the behavioral impact of UI/UX design. We believe our work will foster research on leveraging MLLMs for visual design in user behavior contexts.

2601.02730 2026-06-05 cs.CV

HOLO: Homography-Guided Pose Estimator Network for Fine-Grained Visual Localization on SD Maps

HOLO:基于单应图的细粒度视觉定位网络用于标准定义(SD)地图的视觉定位

Xuchang Zhong, Xu Cao, Jinke Feng, Hao Fang

发表机构 * Beijing Institute of Technology(北京理工大学) University of Science and Technology of China(中国科学技术大学)

AI总结 本文提出了一种基于单应图的视觉定位网络,用于多视角图像与标准定义地图之间的细粒度视觉定位,通过构建满足单应约束的输入对,利用单应关系引导特征融合并限制姿态输出到有效区域,提高了训练效率和定位精度。

详情
AI中文摘要

标准定义(SD)地图上的视觉定位已成为自动驾驶中一种有前途的低成本和可扩展的解决方案。然而,现有基于回归的方法往往忽视了固有的几何先验,导致训练效率低下和定位精度有限。本文提出了一种新的基于单应图的姿态估计网络,用于多视角图像与标准定义(SD)地图之间的细粒度视觉定位。我们通过将地面视图特征投影到BEV域并强制与地图特征进行语义对齐来构建满足单应约束的输入对。然后利用单应关系引导特征融合,并将姿态输出限制在有效可行区域,这在训练效率和定位精度上都显著优于依赖注意力融合和直接3-自由度姿态回归的先前方法。到目前为止,这是首次将BEV语义推理与单应学习统一起来用于图像到地图定位的工作。此外,通过显式建模单应变换,所提出的框架自然支持跨分辨率输入,增强了模型的灵活性。在nuScenes数据集上的广泛实验表明,我们的方法显著优于现有的视觉定位方法。代码和预训练模型将公开发布以促进未来研究。

英文摘要

Visual localization on standard-definition (SD) maps has emerged as a promising low-cost and scalable solution for autonomous driving. However, existing regression-based approaches often overlook inherent geometric priors, resulting in suboptimal training efficiency and limited localization accuracy. In this paper, we propose a novel homography-guided pose estimator network for fine-grained visual localization between multi-view images and standard-definition (SD) maps. We construct input pairs that satisfy a homography constraint by projecting ground-view features into the BEV domain and enforcing semantic alignment with map features. Then we leverage homography relationships to guide feature fusion and restrict the pose outputs to a valid feasible region, which significantly improves training efficiency and localization accuracy compared to prior methods relying on attention-based fusion and direct 3-DoF pose regression. To the best of our knowledge, this is the first work to unify BEV semantic reasoning with homography learning for image-to-map localization. Furthermore, by explicitly modeling homography transformations, the proposed framework naturally supports cross-resolution inputs, enhancing model flexibility. Extensive experiments on the nuScenes dataset demonstrate that our approach significantly outperforms existing state-of-the-art visual localization methods. Code and pretrained models will be publicly released to foster future research.

2512.15231 2026-06-05 cs.AI

CangLing-KnowFlow: A Unified Knowledge-and-Flow-fused Agent for Comprehensive Remote Sensing Applications

CangLing-KnowFlow: 一个统一的知识与流程融合代理用于综合遥感应用

Zhengchao Chen, Haoran Wang, Jing Yao, Jianshe Zhang, Pedram Ghamisi, Jun Zhou, Peter M. Atkinson, Bing Zhang

发表机构 * State Key Laboratory of Remote Sensing and Digital Earth, Aerospace Information Research Institute, Chinese Academy of Sciences(遥感与数字地球国家重点实验室,航天信息研究所,中国科学院) Beijing Tiandi Shijie Technology Co., Ltd.(北京天帝世纪科技有限公司) Faculty of Science and Technology, Lancaster University(兰卡斯特大学科学与技术学院) Faculty of Electrical and Computer Engineering, University of Iceland(冰岛大学电气与计算机工程学院) Helmholtz-Zentrum Dresden-Rossendorf(德累斯顿-罗斯托克亥姆霍尔茨中心) School of Information and Communication Technology, Griffith University(格里菲斯大学信息与通信技术学院)

AI总结 本文提出CangLing-KnowFlow,一个融合知识与流程的统一智能代理框架,通过整合过程知识库、动态工作流调整和进化记忆模块,解决遥感数据处理中任务特定、缺乏统一框架的问题,并在KnowFlow-Bench基准测试中表现出色。

详情
AI中文摘要

大规模遥感(RS)数据集的自动化和智能化处理对于地球观测(EO)至关重要。现有的自动化系统通常是任务特定的,缺乏统一的框架来管理多样化的端到端工作流——从数据预处理到高级解释——在不同的RS应用中。为了解决这一差距,本文介绍CangLing-KnowFlow,一个统一的智能代理框架,整合了过程知识库(PKB)、动态工作流调整和进化记忆模块。PKB包含1,008个经过专家验证的工作流案例,涵盖162个实际RS任务,指导规划并显著减少一般性代理中常见的幻觉问题。在运行时失败期间,动态工作流调整能够自主诊断并重新规划恢复策略,而进化记忆模块会持续从这些事件中学习,迭代提升代理的知识和性能。这种协同作用使CangLing-KnowFlow能够在多样且复杂的任务中适应、学习并可靠运行。我们评估了CangLing-KnowFlow在KnowFlow-Bench上,一个受真实应用启发的324个工作流基准测试中,测试其在13个顶级大语言模型(LLM)后端上的性能,从开源到商业。在所有复杂任务中,CangLing-KnowFlow在任务成功率上比Reflexion基线高出至少4%。作为该新兴领域最全面的验证,本研究展示了CangLing-KnowFlow作为强大、高效且可扩展的自动化解决方案的巨大潜力,通过利用专家知识(知识)转化为适应性和可验证的流程(流程)来解决复杂的EO挑战。

英文摘要

The automated and intelligent processing of massive remote sensing (RS) datasets is critical in Earth observation (EO). Existing automated systems are normally task-specific, lacking a unified framework to manage diverse, end-to-end workflows--from data preprocessing to advanced interpretation--across diverse RS applications. To address this gap, this paper introduces CangLing-KnowFlow, a unified intelligent agent framework that integrates a Procedural Knowledge Base (PKB), Dynamic Workflow Adjustment, and an Evolutionary Memory Module. The PKB, comprising 1,008 expert-validated workflow cases across 162 practical RS tasks, guides planning and substantially reduces hallucinations common in general-purpose agents. During runtime failures, the Dynamic Workflow Adjustment autonomously diagnoses and replans recovery strategies, while the Evolutionary Memory Module continuously learns from these events, iteratively enhancing the agent's knowledge and performance. This synergy enables CangLing-KnowFlow to adapt, learn, and operate reliably across diverse, complex tasks. We evaluated CangLing-KnowFlow on the KnowFlow-Bench, a novel benchmark of 324 workflows inspired by real-world applications, testing its performance across 13 top Large Language Model (LLM) backbones, from open-source to commercial. Across all complex tasks, CangLing-KnowFlow surpassed the Reflexion baseline by at least 4% in Task Success Rate. As the first most comprehensive validation along this emerging field, this research demonstrates the great potential of CangLing-KnowFlow as a robust, efficient, and scalable automated solution for complex EO challenges by leveraging expert knowledge (Knowledge) into adaptive and verifiable procedures (Flow).

2510.10968 2026-06-05 cs.LG stat.ML

Blade: A Derivative-free Bayesian Inversion Method using Diffusion Priors

Blade:一种使用扩散先验的无导数贝叶斯反演方法

Hongkai Zheng, Austin Wang, Zihui Wu, Zhengyu Huang, Ricardo Baptista, Yisong Yue

发表机构 * California Institute of Technology(加州理工学院) University of Toronto(多伦多大学) Peking University(北京大学)

AI总结 本文提出Blade方法,通过使用扩散模型作为数据驱动的先验,解决无导数贝叶斯反演中高维非线性问题的后验估计问题,实现了准确且校准良好的后验分布。

详情
AI中文摘要

无导数贝叶斯反演在科学和工程应用中出现,特别是在正向模型成本高或无法通过导数进行微分时。现有的无导数方法将后验缩减为点估计或在高维非线性问题中返回严重过自信的不确定性。我们介绍了Blade,它使用相互作用粒子的集合产生准确且校准良好的后验。Blade利用扩散模型作为数据驱动的先验,并且只通过正向评估(即无导数)查询正向模型。理论上,我们证明了在正向模型近似和先验分数估计误差下,Blade的收敛性和稳定性。经验上,在非线性流体动力学中,Blade产生校准良好的后验样本,这些样本现有无导数方法无法产生,通过CRPS、扩展-技能比和等级直方图进行测量。其准确性和校准随着迭代次数和粒子数的增加而持续提高,这得到了我们的收敛性和稳定性分析以及经验实验的支持。

英文摘要

Derivative-free Bayesian inversion arises in science and engineering applications, particularly when forward model is costly or infeasible to differentiate through. Existing derivative-free methods collapse the posterior to a point estimate or return severely over-confident uncertainty on high-dimensional, nonlinear problems. We introduce Blade, which produces accurate and well-calibrated posteriors using an ensemble of interacting particles. Blade leverages diffusion models as data-driven priors, and only queries the forward model through forward evaluations (i.e., derivative-free). Theoretically, we show the convergence and stability of Blade under forward model approximation and prior score estimation error. Empirically, on nonlinear fluid dynamics, Blade produces well-calibrated posterior samples that existing derivative-free methods cannot, as measured by CRPS, the spread-skill ratio, and the rank histogram. Its accuracy and calibration improve consistently with more iterations and particles, backed by our convergence and stability analysis and empirical experiments.

2512.21430 2026-06-05 cs.RO

EVE: A Generator-Verifier System for Generative Policies

EVE: 一种生成策略的生成-验证系统

Yusuf Ali, Gryphon Patlin, Karthik Kothuri, Jeremiah Coholich, Muhammad Zubair Irshad, Wuwei Liang, Zsolt Kira

发表机构 * Georgia Institute of Technology(佐治亚理工学院) Toyota Research Institute(丰田研究院) Symbotic Inc.(Symbotic公司)

AI总结 本文提出EVE系统,通过生成-验证框架在测试时提升预训练生成策略的性能,利用零样本视觉语言模型验证者进行动作优化,无需额外训练。

详情
AI中文摘要

基于生成模型的视觉运动策略,如扩散和流匹配,在机器人应用中表现出色,但在分布偏移下性能下降,显示出有限的恢复能力,无法通过昂贵的微调恢复。在语言模型领域,测试时计算扩展已通过使候选解决方案细化革新了现代LLM的推理能力。这些方法通常利用基础模型作为验证模块进行零样本方式评分。我们假设生成策略也可以从额外的推理时计算中受益,该计算利用零样本基于VLM的验证者进行生成-验证框架。为此,我们引入EVE:一个模块化、生成-验证交互框架,通过在测试时提升预训练生成策略的性能,而无需额外训练。EVE将冻结的基础策略包裹在多个零样本、基于VLM的验证者代理中。每个验证者对基础策略的候选动作提出动作优化建议,而一个动作融合器使用分类器指导将聚合的验证器反馈融合到动作去噪中。我们研究了生成器-验证器信息接口的设计选择,跨具有不同能力的验证器系统。在多样化的模拟和真实机器人任务和实现中,EVE在不增加策略或验证器训练的情况下一致提高了成功率。通过广泛的消融实验,我们隔离了验证器能力和动作融合器策略的贡献,提供了构建可扩展、模块化生成器-验证器系统的实用指南。

英文摘要

Visuomotor policies based on generative such as diffusion and flow-matching have shown strong performance for robotics applications but degrade under distribution shifts, demonstrating limited recovery capabilities without costly finetuning. In the language modeling domain, test-time compute scaling has revolutionized the reasoning capabilities of modern LLMs by enabling candidate solution refinement. These methods typically leverage foundation models as verification modules in a zero-shot manner to score candidate solutions. We hypothesize that generative policies can similarly benefit from additional inference-time compute that employs zero-shot VLM-based verifiers in a generation-verification framework. To this end, we introduce EVE: a modular, generator-verifier interaction framework that boosts the performance of pretrained generative policies at test time, with no additional training. EVE wraps a frozen base policy with multiple zero-shot, VLM-based verifier agents. Each verifier proposes action refinements to the base policy candidate actions, while an action incorporator uses classifier guidance to fuse aggregated verifier feedback into action denoising. We study design choices for generator-verifier information interfacing across a system of verifiers with distinct capabilities. Across diverse simulated and real robotic tasks and embodiments, EVE consistently improves success rates without additional policy or verifier training. Through extensive ablations, we isolate the contribution of verifier capabilities and action incorporator strategies, offering practical guidelines to build scalable, modular generator-verifier systems for embodied control.

2512.21218 2026-06-05 cs.CV

Latent Implicit Visual Reasoning

潜在隐式视觉推理

Kelvin Li, Chuyi Shang, Leonid Karlinsky, Rogerio Feris, Trevor Darrell, Roei Herzig

发表机构 * University of California, Berkeley(加州大学伯克利分校) Xero MIT-IBM Watson AI Lab(麻省理工-IBM Watson人工智能实验室)

AI总结 本文提出了一种任务无关的机制,训练大规模多模态模型(LMMs)在无需显式中间监督的情况下发现和使用潜在视觉推理标记,从而在多种视觉中心任务中优于直接监督微调,并在不使用辅助图像、边界框、图像裁剪、深度图或思维链注释的情况下,与或优于先前基于文本和显式视觉中间推理方法相媲美。

详情
AI中文摘要

尽管大规模多模态模型(LMMs)在显著进展方面取得了进展,但它们仍然主要以文本为中心,依赖语言作为其核心推理模态。因此,它们在处理主要视觉的推理任务时受到限制。最近的方法试图通过监督中间视觉步骤来解决这个问题,使用辅助图像、深度图或图像裁剪。然而,这些策略对“有用的”视觉抽象的外观施加了限制的先验假设,增加了大量的标注成本,并在跨任务时难以泛化。为了解决这一关键限制,我们提出了潜在隐式视觉推理(LIVR),一种任务无关的机制,训练LMMs发现和使用潜在视觉推理标记,而无需显式中间监督。这些标记会全局关注并以任务自适应的方式重新编码图像,使模型能够提取相关视觉信息而无需手工监督。LIVR在多种视觉中心任务和多个LMM基础架构上均优于直接监督微调。在更广泛的比较中,LIVR与或优于先前基于文本和显式视觉中间推理方法,同时不需要额外的中间监督,如辅助图像、边界框、图像裁剪、深度图或思维链注释。我们的项目页面可以在这里找到:https://www.chuyishang.com/livr/

英文摘要

While Large Multimodal Models (LMMs) have made significant progress, they remain largely text-centric, relying on language as their core reasoning modality. As a result, they are limited in their ability to handle reasoning tasks that are predominantly visual. Recent approaches have sought to address this by supervising intermediate visual steps with helper images, depth maps, or image crops. However, these strategies impose restrictive priors on what "useful" visual abstractions look like, add heavy annotation costs, and struggle to generalize across tasks. To address this critical limitation, we propose Latent Implicit Visual Reasoning (LIVR), a task-agnostic mechanism that trains LMMs to discover and use latent visual reasoning tokens without explicit intermediate supervision. These tokens attend globally and re-encode the image in a task-adaptive way, enabling the model to extract relevant visual information without hand-crafted supervision. LIVR consistently outperforms direct supervised fine-tuning across diverse vision-centric tasks and multiple LMM backbones. In broader comparisons, LIVR remains competitive with or outperforms prior text-based and explicit-visual-intermediate reasoning methods, while requiring no additional intermediate supervision such as helper images, bounding boxes, image crops, depth maps, or chain-of-thought annotations. Our project page can be found here: https://www.chuyishang.com/livr/

2512.20111 2026-06-05 cs.CL cs.AI cs.LG

ABBEL: Learning Natural-Language Belief States for Memory-Efficient Interaction

ABBEL: 为高效交互学习自然语言信念状态

Aly Lidayan, Jakob Bjorner, Satvik Golechha, Kartik Goyal, Alane Suhr

发表机构 * University of California, Berkeley(加州大学伯克利分校) Georgia Institute of Technology(佐治亚理工学院)

AI总结 本文提出ABBEL框架,通过显式自然语言信念状态直接监督每个摘要的信息内容,以解决传统方法在生成摘要时信息丢失或更新错误的问题,从而在保持高效内存使用的同时提升交互性能。

详情
AI中文摘要

随着序列决策任务的时间范围扩大,将完整交互历史保留在模型上下文中变得越来越昂贵。最近的研究通过使用递归更新的自然语言摘要来减少上下文长度,这些摘要简洁且可解释。然而,这些方法在性能上仍低于能够访问完整上下文的智能体,表明它们未能生成足够的摘要。为此,我们提出了ABBEL,一种递归摘要框架,通过显式自然语言信念状态直接监督每个摘要的信息内容。首先,我们分析了在五个领域中由前沿模型生成的信念状态,并验证了性能通常因遗漏或错误更新信息而降低。我们还发现了一些模型使用内存低效的设置,通过保留冗余信息。我们通过两种基于强化学习的方法进行微调:信念分级,通过奖励基于信息内容的信念生成来减少更新错误;峰值信念惩罚,通过鼓励压缩内存足迹最大的信念。我们证明这些方法显著缩小了与完整上下文模型的性能差距,并使ABBEL在使用67%内存的情况下,比先前的记忆智能体工作提高了40%。我们的代码可在https://github.com/jakob-bjorner/optimal-explorer-dev获取。

英文摘要

As the time horizons of sequential decision-making tasks grow, keeping full interaction histories in model context becomes increasingly costly. Recent work reduces context lengths by instead conditioning decision-making agents on recursively updated natural-language summaries, which are concise and interpretable. However, they underperform agents with access to the full context, suggesting that they fail to generate sufficient summaries. To address this we propose ABBEL, a recursive summarization framework that isolates and directly supervises each summary's information contents in the form of explicit natural-language belief states. First, we analyze the belief states generated by frontier models under ABBEL across five domains, and verify that performance is often degraded due to omitting or incorrectly updating information. We also discover settings where models use memory inefficiently by retaining extraneous information. We target these limitations by fine-tuning with two RL-based methods: belief grading, which reduces update errors by rewarding belief generations based on their information content, and peak belief penalties, which encourage compressing the beliefs with the greatest memory footprints. We demonstrate that these methods significantly reduce the performance gap with full context models, and enable ABBEL to outperform prior memory agent work by 40% while using 67% of the memory. Our code is available at https://github.com/jakob-bjorner/optimal-explorer-dev

2512.19510 2026-06-05 cs.LG stat.ML

Toward Scalable and Valid Conditional Independence Testing with Spectral Representations

迈向基于谱表示的可扩展且有效的条件独立性检验

Alek Fröhlich, Vladimir R. Kostic, Karim Lounici, Daniel Perazzo, Daniel Tiezzi, Massimiliano Pontil

发表机构 * University of Cambridge(剑桥大学) University of Oxford(牛津大学) ETH Zürich(苏黎世联邦理工学院)

AI总结 本文提出了一种基于谱表示的学习方法,用于解决传统条件独立性检验在适应性和可扩展性方面的不足,通过构造简单的检验统计量和双层对比算法,建立了表示学习误差与检验性能之间的理论联系,并在实际和合成数据上验证了其有效性。

Comments Accepted at ICML 2026. Revised to match the accepted version; updated experiments and exposition

详情
AI中文摘要

条件独立性(CI)在因果推断、特征选择和图模型中至关重要,然而在许多情况下,没有额外假设的情况下无法进行检验。现有的CI检验通常依赖于限制性的结构条件,限制了其有效性。核方法使用偏协方差算子提供了一种更系统的方法,但存在有限的适应性和可扩展性。在本工作中,我们探讨了表示学习是否能帮助解决这些限制。具体而言,我们关注由偏协方差算子的奇异值分解得到的表示,并利用这些表示构造一个简单的检验统计量。我们还引入了一个双层对比算法来学习这些表示。我们的理论将表示学习误差与检验性能联系起来,并建立了渐近有效性和功效保证。在实际和合成数据上的实验表明,这种方法提供了一条系统且统计上站得住脚的路径,以实现可扩展的CI检验,将基于核的理论与现代表示学习相结合。

英文摘要

Conditional independence (CI) is central to causal inference, feature selection, and graphical modeling, yet it is untestable in many settings without additional assumptions. Existing CI tests often rely on restrictive structural conditions, limiting their validity. Kernel methods using partial covariance operators offer a more principled approach but suffer from limited adaptivity and scalability. In this work, we explore whether representation learning can help address these limitations. Specifically, we focus on representations derived from the singular value decomposition of partial covariance operators and use them to construct a simple test statistic. We also introduce a bi-level contrastive algorithm to learn these representations. Our theory links representation learning error to test performance and establishes asymptotic validity and power guarantees. Experiments on real and synthetic data suggest that this approach offers a principled and statistically grounded path toward scalable CI testing, bridging kernel-based theory with modern representation learning.

2512.15153 2026-06-05 cs.CV

Explainable Action Form Assessment by Exploiting Multimodal Chain-of-Thoughts Reasoning

通过利用多模态链式推理解释可解释的动作形式评估

Mengshi Qi, Yeteng Wu, Wulian Yun, Xianlin Zhang, Huadong Ma

发表机构 * State Key Laboratory of Networking and Switching Technology(网络与交换技术国家重点实验室)

AI总结 本文提出了一种新的动作形式评估任务,并引入了一个包含大量健身和武术视频的多级标注数据集CoT-AFA,通过引入新的链式思维解释方法,提出了可解释性健身评估框架,以提升动作分析能力。

详情
AI中文摘要

评估人类动作是否标准并提供合理的反馈以提高动作标准化程度在现实场景中非常重要但具有挑战性。然而,当前视频理解方法主要关注动作是什么和在哪里,无法满足要求。同时,现有数据集缺乏指示动作标准化程度的标签,动作质量评估数据集缺乏可解释性和详细反馈。因此,我们定义了一个新的人类动作形式评估(AFA)任务,并引入了一个新的多样化数据集CoT-AFA,其中包含大量健身和武术视频,具有多级标注以进行全面的视频分析。我们通过引入一种新的链式思维解释范式来丰富CoT-AFA数据集。与提供孤立反馈不同,我们的解释提供了一个完整的推理过程--从识别一个动作步骤到分析其结果并提出具体的解决方案。此外,我们提出了一种名为可解释性健身评估器的框架,不仅可以判断动作,还可以解释原因并提供解决方案。该框架采用两个并行处理流和动态门控机制来融合视觉和语义信息,从而提升其分析能力。实验结果表明,我们的方法在解释生成(例如,CIDEr提升16.0%)、动作分类(准确率提升2.7%)和质量评估(准确率提升2.1%)方面均取得了改进,揭示了CoT-AFA在未来研究中的巨大潜力。我们的数据集和源代码可在https://github.com/MICLAB-BUPT/EFA上获取。

英文摘要

Evaluating whether human action is standard or not and providing reasonable feedback to improve action standardization is very crucial but challenging in real-world scenarios. However, current video understanding methods are mainly concerned with what and where the action is, which is unable to meet the requirements. Meanwhile, most of the existing datasets lack the labels indicating the degree of action standardization, and the action quality assessment datasets lack explainability and detailed feedback. Therefore, we define a new Human Action Form Assessment (AFA) task, and introduce a new diverse dataset CoT-AFA, which contains a large scale of fitness and martial arts videos with multi-level annotations for comprehensive video analysis. We enrich the CoT-AFA dataset with a novel Chain-of-Thought explanation paradigm. Instead of offering isolated feedback, our explanations provide a complete reasoning process--from identifying an action step to analyzing its outcome and proposing a concrete solution. Furthermore, we propose a framework named Explainable Fitness Assessor, which can not only judge an action but also explain why and provide a solution. This framework employs two parallel processing streams and a dynamic gating mechanism to fuse visual and semantic information, thereby boosting its analytical capabilities. The experimental results demonstrate that our method has achieved improvements in explanation generation (e.g., +16.0% in CIDEr), action classification (+2.7% in accuracy) and quality assessment (+2.1% in accuracy), revealing great potential of CoT-AFA for future studies. Our dataset and source code is available at https://github.com/MICLAB-BUPT/EFA.

2512.14792 2026-06-05 cs.AI cs.SE

IaC Generation with LLMs: An Error Taxonomy and A Study on Configuration Knowledge Injection

利用LLM生成IaC:错误分类法与配置知识注入研究

Roman Nekrasov, Stefano Fossati, Indika Kumara, Damian Andrew Tamburri, Willem-Jan van den Heuvel

发表机构 * Jheronimus Academy of Data Science(Jheronimus数据科学学院) Tilburg University(蒂尔堡大学) Eindhoven University of Technology(埃因霍温理工大学) University of Sannio(萨诺尼大学)

AI总结 本研究探讨了如何通过系统性地注入结构化配置知识来提高LLM生成正确且意图一致的基础设施即代码(IaC)能力,特别是在Terraform中,提出了新的错误分类法,并评估了多种知识注入技术。

Comments Submitted to ACM

详情
AI中文摘要

大型语言模型(LLMs)目前在生成正确且意图一致的基础设施即代码(IaC)方面表现出较低的成功率。本研究调查了改进基于LLM的IaC生成方法,特别是针对Terraform,通过系统性地注入结构化配置知识。为此,现有的IaC-Eval基准测试被显著增强,加入了云模拟和自动错误分析。此外,开发了一种新的用于LLM辅助IaC代码生成的错误分类法。实现并评估了一系列知识注入技术,从简单的检索增强生成(RAG)到更复杂的图RAG方法。这些包括图组件的语义增强和资源间依赖关系的建模。实验结果表明,尽管基线LLM性能较差(整体成功率为27.1%),注入结构化配置知识将技术验证成功率提高到75.3%,整体成功率提高到62.6%。尽管这些进步在技术正确性方面有所提升,但意图一致性却停滞不前,揭示了“正确性-一致性鸿沟”,即LLMs可以成为熟练的“程序员”,但作为满足复杂用户意图的“架构师”却受限。

英文摘要

Large Language Models (LLMs) currently exhibit low success rates in generating correct and intent-aligned Infrastructure as Code (IaC). This research investigated methods to improve LLM-based IaC generation, specifically for Terraform, by systematically injecting structured configuration knowledge. To facilitate this, an existing IaC-Eval benchmark was significantly enhanced with cloud emulation and automated error analysis. Additionally, a novel error taxonomy for LLM-assisted IaC code generation was developed. A series of knowledge injection techniques was implemented and evaluated, progressing from Naive Retrieval-Augmented Generation (RAG) to more sophisticated Graph RAG approaches. These included semantic enrichment of graph components and modeling inter-resource dependencies. Experimental results demonstrated that while baseline LLM performance was poor (27.1% overall success), injecting structured configuration knowledge increased technical validation success to 75.3% and overall success to 62.6%. Despite these gains in technical correctness, intent alignment plateaued, revealing a "Correctness-Congruence Gap" where LLMs can become proficient "coders" but remain limited "architects" in fulfilling nuanced user intent.

2512.08560 2026-06-05 cs.CV

BrainExplore: Large-Scale Discovery of Interpretable Visual Representations in the Human Brain

BrainExplore: 在人脑中大规模发现可解释的视觉表征

Navve Wasserman, Matias Cosarinsky, Yuval Golbari, Aude Oliva, Antonio Torralba, Tamar Rott Shaham, Michal Irani

发表机构 * Weizmann Institute of Science(魏茨曼科学研究所) Massachusetts Institute of Technology(麻省理工学院)

AI总结 本文提出了一种大规模自动化框架,用于发现和解释人脑皮层中的视觉表征,通过无监督的数据驱动分解方法发现候选可解释模式,并通过识别最能激发这些模式的自然图像生成自然语言描述,从而揭示了数千种覆盖多种不同视觉概念的可解释模式,包括此前未报告的细粒度表征。

详情
AI中文摘要

理解人类大脑如何表示视觉概念,以及这些表示在哪些脑区编码,仍然是一个长期存在的挑战。几十年的研究已经提升了我们对视觉表征的理解,但脑信号仍然很大且复杂,可能的视觉概念空间非常广阔。因此,大多数研究仍处于小规模,依赖手动检查,专注于特定区域和概念,并很少进行系统验证。我们提出了一种大规模、自动化的框架,用于在人脑皮层上发现和解释视觉表征。我们的方法包括两个主要阶段。首先,我们通过无监督、数据驱动的分解方法在fMRI活动中发现候选可解释模式。其次,我们通过识别最能激发这些模式的自然图像集,并生成这些图像共享视觉意义的自然语言描述来解释每个模式。为了扩展这一过程,我们引入了一个自动化流程,测试多个候选解释,分配可靠性分数,并为每个脑区模式选择最佳描述。我们的框架揭示了成千上万种可解释模式,涵盖了许多不同的视觉概念,包括此前未报告的细粒度表征。

英文摘要

Understanding how the human brain represents visual concepts, and in which brain regions these representations are encoded, remains a long-standing challenge. Decades of work have advanced our understanding of visual representations, yet brain signals remain large and complex, and the space of possible visual concepts is vast. As a result, most studies remain small-scale, rely on manual inspection, focus on specific regions and concepts, and rarely include systematic validation. We present a large-scale, automated framework for discovering and explaining visual representations across the human cortex. Our method comprises two main stages. First, we discover candidate interpretable patterns in fMRI activity through unsupervised, data-driven decomposition methods. Next, we explain each pattern by identifying the set of natural images that most strongly elicit it and generating a natural-language description of their shared visual meaning. To scale this process, we introduce an automated pipeline that tests multiple candidate explanations, assigns reliability scores, and selects the best description for each voxel pattern. Our framework reveals thousands of interpretable patterns spanning many distinct visual concepts, including fine-grained representations previously unreported.

2512.09706 2026-06-05 cs.LG

Training One Model to Master Cross-Level Agentic Actions via Reinforcement Learning

通过强化学习训练一个模型以掌握跨层级的代理行为

Kaichen He, Zihao Wang, Muyao Li, Anji Liu, Yitao Liang

发表机构 * Peking University(北京大学) National University of Singapore(新加坡国立大学)

AI总结 本文提出CrossHA,一种统一的代理模型,能够掌握异构的动作空间并自主选择每一步轨迹中最有效的接口,通过结合冷启动监督微调和多轮组相对策略优化(GRPO)算法,实现适应性动作切换,在Minecraft开放世界中超过800个任务上展示了最先进的性能。

Comments Accepted to CVPR 2026 as a Highlight

详情
AI中文摘要

代理AI的范式正从工程复杂的流程转向训练后的原生模型。然而,现有代理通常局限于静态的预定义动作空间,如仅使用API、GUI事件或机器人命令。这种刚性限制了它们在动态环境中适应性,其中最佳交互粒度会根据情境变化而变化。为弥合这一差距,我们提出了CrossHA,一种统一的代理模型,能够掌握异构的动作空间并自主选择每一步轨迹中最有效的接口。我们引入了一个全面的训练管道,整合了冷启动监督微调与多轮组相对策略优化(GRPO)算法。该方法使代理能够学习适应性动作切换,平衡高层效率与低层精度,而无需人为指定规则。在Minecraft开放世界中超过800个任务的广泛实验表明,CrossHA实现了最先进的性能。通过动态利用多样化的动作空间,我们的模型显著优于固定动作基线,展现出在长时间推理中的优越泛化能力和效率。所有代码和模型均在https://github.com/CraftJarvis/OpenHA上提供。

英文摘要

The paradigm of agentic AI is shifting from engineered complex workflows to post-training native models. However, existing agents are typically confined to static, predefined action spaces-such as exclusively using APIs, GUI events, or robotic commands. This rigidity limits their adaptability in dynamic environments where the optimal granularity of interaction varies contextually. To bridge this gap, we propose CrossHA, a unified agentic model that masters heterogeneous action spaces and autonomously selects the most effective interface for each step of a trajectory. We introduce a comprehensive training pipeline that integrates cold-start supervised fine-tuning with a Multi-Turn Group Relative Policy Optimization (GRPO) algorithm. This approach enables the agent to learn adaptive action switching-balancing high-level efficiency with low-level precision-without human-specified rules. Extensive experiments on over 800 tasks in the open-world Minecraft environment demonstrate that CrossHA achieves state-of-the-art performance. By dynamically leveraging the strengths of diverse action spaces, our model significantly outperforms fixed-action baselines, exhibiting superior generalization and efficiency in long-horizon reasoning. All code and models are available at https://github.com/CraftJarvis/OpenHA.

2511.21667 2026-06-05 cs.LG cs.AI

Escaping the Verifier: Learning to Reason via Demonstrations

摆脱验证者:通过示范学习推理

Locke Cai, Max Ryabinin, Ivan Provilkov

发表机构 * University of California, Berkeley(加州大学伯克利分校) Stanford University(斯坦福大学)

AI总结 本文提出RARO方法,通过逆强化学习从专家示范中学习强大的推理能力,无需任务特定的验证者,从而在多个评估任务中实现了显著的性能提升。

详情
AI中文摘要

训练大型语言模型(LLMs)进行推理通常依赖于强化学习(RL)与任务特定的验证者。然而,许多现实世界的推理密集型任务缺乏验证者,尽管提供了大量未被充分利用的专家示范。我们引入RARO(相对对抗推理优化),通过逆强化学习从专家示范中学习强大的推理能力。RARO设置了一个对抗游戏,政策与相对批评者之间进行对抗:政策学习模仿专家答案,而批评者旨在识别专家政策答案对中的专家。政策和批评者通过RL联合且连续地训练,并识别出实现稳健学习所需的关键稳定技术。实证结果表明,RARO在所有评估任务中均显著优于无验证者基线:在Countdown(1.5B)上准确率提高13.7%,在DeepMath(7B)上准确率提高8.2%,在Poetry Writing(7B)上对专家诗歌的胜利率提高19.1%。RARO还表现出与具有验证者的RL相似的稳健扩展趋势。这些结果表明,RARO能够从专家示范中有效提取强大的推理性能,即使在任务特定验证者不可用时也能实现稳健的推理学习。

英文摘要

Training Large Language Models (LLMs) to reason often relies on Reinforcement Learning (RL) with task-specific verifiers. However, many real-world reasoning-intensive tasks lack verifiers, despite offering abundant expert demonstrations that remain under-utilized for reasoning-focused training. We introduce RARO (Relativistic Adversarial Reasoning Optimization), which learns strong reasoning capabilities from expert demonstrations alone via Inverse Reinforcement Learning. RARO sets up an adversarial game between a policy and a relativistic critic: the policy learns to mimic expert answers, while the critic aims to identify the experts among expert-policy answer pairs. Both the policy and the critic are trained jointly and continuously via RL, and we identify the key stabilization techniques required for robust learning. Empirically, RARO significantly outperforms strong verifier-free baselines across all evaluation tasks: +13.7% accuracy on Countdown (1.5B), +8.2% accuracy on DeepMath (7B), and +19.1% win-rate on Poetry Writing (7B) against expert poems. RARO also exhibits similar robust scaling trends as RL with verifiers. These results demonstrate that RARO effectively elicits strong reasoning performance from expert demonstrations alone, enabling robust reasoning learning even when task-specific verifiers are unavailable.

2512.05774 2026-06-05 cs.CV cs.AI cs.CL

Active Video Perception: Iterative Evidence Seeking for Agentic Long Video Understanding

主动视频感知:用于代理长视频理解的迭代证据寻求

Ziyang Wang, Honglu Zhou, Shijie Wang, Junnan Li, Caiming Xiong, Silvio Savarese, Mohit Bansal, Michael S. Ryoo, Juan Carlos Niebles

发表机构 * Salesforce AI Research(Salesforce AI研究院) University of North Carolina at Chapel Hill(北卡罗来纳大学教堂山分校)

AI总结 本文提出了一种主动视频感知框架AVP,通过迭代计划-观察-反思过程,主动决定视频内容的观察目标和时间,以提高长视频理解的准确性和效率。

Comments Website: https://activevideoperception.github.io/

详情
AI中文摘要

长视频理解(LVU)具有挑战性,因为回答现实世界查询往往依赖于稀疏、时间分散的线索,这些线索隐藏在数小时的大部分冗余和无关内容中。尽管代理流程提高了视频推理能力,但现有框架依赖于查询无关的描述器来感知视频信息,这浪费了计算资源并模糊了细粒度的时间和空间信息。受主动感知理论的启发,我们主张LVU代理应主动决定观察什么、何时和在哪里观察,并持续评估当前观察是否足够回答查询。我们提出了主动视频感知(AVP),一种证据寻求框架,将视频视为交互环境,并直接从像素中获取紧凑、查询相关的证据。具体而言,AVP运行一个迭代的计划-观察-反思过程,使用MLLM代理。在每个轮次中,计划者提出有针对性的视频交互,观察者执行以提取时间戳证据,反思者评估证据对查询的充分性,要么终止并给出答案,要么触发进一步观察。在五个LVU基准测试中,AVP实现了最高整体准确率,有显著提升。值得注意的是,AVP在平均整体准确率上比最佳代理方法高出5.7%,同时仅需18.4%的推理时间和12.4%的输入令牌。

英文摘要

Long video understanding (LVU) is challenging because answering real-world queries often depends on sparse, temporally dispersed cues buried in hours of mostly redundant and irrelevant content. While agentic pipelines improve video reasoning capabilities, prevailing frameworks rely on a query-agnostic captioner to perceive video information, which wastes computation on irrelevant content and blurs fine-grained temporal and spatial information. Motivated by active perception theory, we argue that LVU agents should actively decide what, when, and where to observe, and continuously assess whether the current observation is sufficient to answer the query. We present Active Video Perception (AVP), an evidence-seeking framework that treats the video as an interactive environment and acquires compact, queryrelevant evidence directly from pixels. Concretely, AVP runs an iterative plan-observe-reflect process with MLLM agents. In each round, a planner proposes targeted video interactions, an observer executes them to extract time-stamped evidence, and a reflector evaluates the sufficiency of the evidence for the query, either halting with an answer or triggering further observation. Across five LVU benchmarks, AVP achieves highest overall accuracy with significant improvements. Notably, AVP outperforms the best agentic method by 5.7% in average overall accuracy while only requires 18.4% inference time and 12.4% input tokens.

2508.10875 2026-06-05 cs.CL cs.AI cs.LG

A Survey on Diffusion Language Models

扩散语言模型的综述

Tianyi Li, Mingda Chen, Bowei Guo, Zhiqiang Shen

发表机构 * VILA Lab, Mohamed bin Zayed University of Artificial Intelligence(维拉实验室,穆罕默德·本·扎耶德人工智能大学) Department of Automation, Tsinghua University(清华大学自动化系)

AI总结 本文综述了扩散语言模型的发展现状,探讨了其与自回归模型和掩码语言模型的关系,分析了预训练策略、后训练方法以及推理优化技术,并讨论了多模态扩展、应用场景、局限性及未来研究方向。

详情
AI中文摘要

扩散语言模型(DLMs)正迅速崛起为一种强大的替代方案,以取代主导的自回归(AR)范式。通过迭代去噪过程并行生成令牌,DLMs在减少推理延迟和捕捉双向上下文方面具有固有优势,从而实现对生成过程的精细控制。尽管实现了数倍的加速,最近的进展使DLMs在性能上与自回归模型相当,使其成为各种自然语言处理任务的有力选择。在本文综述中,我们提供了当前DLM景观的全面概述。我们追踪其演变及其与其他范式,如自回归和掩码语言模型的关系,并涵盖了基础原理和最先进模型。我们的工作提供了一个最新、全面的分类法以及对当前技术的深入分析,从预训练策略到高级后训练方法。本文的另一个贡献是全面回顾DLM推理策略和优化,包括解码并行性、缓存机制和生成质量的改进。我们还突出了DLM多模态扩展的最新方法,并阐述了它们在各种实际场景中的应用。此外,我们的讨论还讨论了DLMs的局限性和挑战,包括效率、长序列处理和基础设施需求,同时概述了未来研究方向,以维持该快速发展的领域中的进步。Project GitHub可在https://github.com/VILA-Lab/Awesome-DLMs上找到。

英文摘要

Diffusion Language Models (DLMs) are rapidly emerging as a powerful and promising alternative to the dominant autoregressive (AR) paradigm. By generating tokens in parallel through an iterative denoising process, DLMs possess inherent advantages in reducing inference latency and capturing bidirectional context, thereby enabling fine-grained control over the generation process. While achieving a several-fold speed-up, recent advancements have allowed DLMs to show performance comparable to their autoregressive counterparts, making them a compelling choice for various natural language processing tasks. In this survey, we provide a holistic overview of the current DLM landscape. We trace its evolution and relationship with other paradigms, such as autoregressive and masked language models, and cover both foundational principles and state-of-the-art models. Our work offers an up-to-date, comprehensive taxonomy and an in-depth analysis of current techniques, from pre-training strategies to advanced post-training methods. Another contribution of this survey is a thorough review of DLM inference strategies and optimizations, including improvements in decoding parallelism, caching mechanisms, and generation quality. We also highlight the latest approaches to multimodal extensions of DLMs and delineate their applications across various practical scenarios. Furthermore, our discussion addresses the limitations and challenges of DLMs, including efficiency, long-sequence handling, and infrastructure requirements, while outlining future research directions to sustain progress in this rapidly evolving field. Project GitHub is available at https://github.com/VILA-Lab/Awesome-DLMs.

2511.21338 2026-06-05 cs.LG

Masks Can Be Distracting: On Context Comprehension in Diffusion Language Models

掩码可能具有干扰性:关于扩散语言模型中的上下文理解

Julianna Piskorz, Cristina Pinneri, Alvaro Correia, Motasem Alfarra, Risheek Garrepalli, Christos Louizos

发表机构 * University of Cambridge(剑桥大学)

AI总结 本文研究了扩散语言模型中掩码对上下文理解的影响,发现掩码会干扰模型对相关信息的处理,提出一种掩码无关的损失函数以提高模型的鲁棒性。

Comments Published at the Forty-Third International Conference on Machine Learning (ICML 2026)

详情
AI中文摘要

Masked Diffusion Language Models (MDLMs) have recently emerged as a promising alternative to Autoregressive Language Models (ARLMs), leveraging a denoising objective that, in principle, should enable more uniform context utilisation. In this work, we examine the context comprehension abilities of MDLMs and uncover two key limitations. First, despite their more global training objective and bidirectional attention mechanism, similarly to ARLMS, MDLMs exhibit a strong locality bias: performance is highly sensitive to the position of relevant information within the input, favouring local over distant context. Second, we show that appending a large number of mask tokens--required for generation--can significantly degrade context comprehension. Through systematic ablations, we find that these masks act as distractors, reducing the model's ability to process relevant information. To address this, we introduce a mask-agnostic loss function that encourages predictions to remain invariant to the number of appended masks. Fine-tuning with this objective substantially mitigates the distracting effect of masks, improving robustness of MDLMs. Overall, our findings reveal critical limitations of the current MDLM training paradigm and provide actionable insights for building diffusion-based language models with stronger context comprehension.

英文摘要

Masked Diffusion Language Models (MDLMs) have recently emerged as a promising alternative to Autoregressive Language Models (ARLMs), leveraging a denoising objective that, in principle, should enable more uniform context utilisation. In this work, we examine the context comprehension abilities of MDLMs and uncover two key limitations. First, despite their more global training objective and bidirectional attention mechanism, similarly to ARLMS, MDLMs exhibit a strong locality bias: performance is highly sensitive to the position of relevant information within the input, favouring local over distant context. Second, we show that appending a large number of mask tokens--required for generation--can significantly degrade context comprehension. Through systematic ablations, we find that these masks act as distractors, reducing the model's ability to process relevant information. To address this, we introduce a mask-agnostic loss function that encourages predictions to remain invariant to the number of appended masks. Fine-tuning with this objective substantially mitigates the distracting effect of masks, improving robustness of MDLMs. Overall, our findings reveal critical limitations of the current MDLM training paradigm and provide actionable insights for building diffusion-based language models with stronger context comprehension.

2511.20613 2026-06-05 cs.LG cs.AI cs.MA

Can Vibe Coding Beat Graduate CS Students? An LLM vs. Human Coding Tournament on Market-driven Strategic Planning

能否用Vibe编码击败研究生计算机科学学生?一个LLM与人类编码竞赛在市场驱动的战略规划中的表现

Panayiotis Danassis, Naman Goel

发表机构 * University of Southampton(苏塞克斯大学) University of Oxford and Alan Turing Institute(牛津大学和艾伦·图灵研究所)

AI总结 本文提出一个基于现实物流优化问题(拍卖、取件和送货问题)的多智能体推理驱动基准,该问题结合了竞争拍卖与容量受限路由。研究通过比较40个LLM编码代理与17个人类编码代理在12场双打全部比赛和约4万场比赛中的表现,揭示了人类编码代理在战略规划和优化任务中的优势,以及LLM在现实世界中生成有效代码的能力不足。

详情
AI中文摘要

大型语言模型(LLMs)的快速普及已经革新了AI辅助代码生成。然而,LLMs的快速发展超出了我们正确评估它们的能力。现有的基准测试强调单元测试通过率和语法正确性。这些指标低估了许多需要规划、优化和战略互动的真实世界问题的难度。我们引入了一个基于现实物流优化问题(拍卖、取件和送货问题)的多智能体推理驱动基准,该问题结合了竞争拍卖与容量受限路由。该基准要求构建能够(i)在不确定性下进行战略投标,以及(ii)优化规划者在交付任务的同时最大化利润的代理。我们评估了40个LLM编码的代理(由多种最先进的LLMs在多种提示方法下,包括Vibe编码)与17个在LLM出现之前开发的人类编码代理。我们的结果在12场双打全部比赛和约4万场比赛中显示(i)人类(研究生学生)编码代理的明显优势:前5名始终由人类编码代理占据;(ii)大多数LLM编码代理(33个中的40个)被非常简单的基线所击败;(iii)在给定最佳人类解决方案作为输入并提示改进的情况下,表现最好的LLM使解决方案显著变差而不是改进。我们的结果突显了LLMs在现实世界中生成具有竞争力的代码能力的差距,并促使新的评估,这些评估强调在现实世界场景中推理驱动的代码合成。

英文摘要

The rapid proliferation of Large Language Models (LLMs) has revolutionized AI-assisted code generation. This rapid development of LLMs has outpaced our ability to properly benchmark them. Prevailing benchmarks emphasize unit-test pass rates and syntactic correctness. Such metrics understate the difficulty of many real-world problems that require planning, optimization, and strategic interaction. We introduce a multi-agent reasoning-driven benchmark based on a real-world logistics optimization problem (Auction, Pickup, and Delivery Problem) that couples competitive auctions with capacity-constrained routing. The benchmark requires building agents that can (i) bid strategically under uncertainty and (ii) optimize planners that deliver tasks while maximizing profit. We evaluate 40 LLM-coded agents (by a wide range of state-of-the-art LLMs under multiple prompting methodologies, including vibe coding) against 17 human-coded agents developed before the advent of LLMs. Our results over 12 double all-play-all tournaments and $\sim 40$k matches demonstrate (i) a clear superiority of human(graduate students)-coded agents: the top 5 spots are consistently won by human-coded agents, (ii) the majority of LLM-coded agents (33 out of 40) are beaten by very simple baselines, and (iii) given the best human solution as an input and prompted to improve upon, the best performing LLM makes the solution significantly worse instead of improving it. Our results highlight a gap in LLMs' ability to produce code that works competitively in the real-world, and motivate new evaluations that emphasize reasoning-driven code synthesis in real-world scenarios.

2511.20158 2026-06-05 cs.CV

Harmonious Parameter Adaptation in Continual Visual Instruction Tuning for Safety-Aligned MLLMs

在安全对齐的连续视觉指令微调中实现和谐参数适应

Ziqi Wang, Chang Che, Qi Wang, Hui Ma, Zenglin Shi, Cees G. M. Snoek, Meng Wang

发表机构 * Hefei University of Technology(合肥工业大学) Tsinghua University(清华大学) University of Amsterdam(阿姆斯特丹大学)

AI总结 本文研究了在安全对齐的连续视觉指令微调中如何平衡安全性和任务性能,提出了一种名为和谐参数适应(HPA)的后训练框架,通过参数分区、平衡选择和正交调整来缓解遗忘问题。

详情
AI中文摘要

尽管连续视觉指令微调(CVIT)在适应多模态大语言模型(MLLMs)方面显示出潜力,但现有研究大多集中在没有安全对齐的模型上。这种关键疏忽忽略了现实中的MLLMs本质上需要此类机制以缓解潜在风险。在本文中,我们关注CVIT在安全对齐的MLLMs中的应用,并观察到在连续适应过程中,模型不仅会经历任务遗忘,还会表现出安全性的下降。实现安全性和任务性能之间的和谐平衡仍然是一个关键挑战。为此,我们提出了和谐参数适应(HPA),一种由基于聚焦的参数分区、和谐平衡的参数选择和正交参数调整组成的后训练框架。具体而言,HPA根据参数对安全或任务性能的关注程度将其分为两种类型,并从平衡的角度选择聚焦的参数以保留。此外,HPA对参数更新施加正交约束,以进一步缓解灾难性遗忘。在CVIT基准和安全评估数据集上的大量实验表明,HPA比现有基线更好地保持了高安全性和减轻了遗忘问题。代码可在https://github.com/Minato-Zackie/HPA上获得。

英文摘要

While continual visual instruction tuning (CVIT) has shown promise in adapting multimodal large language models (MLLMs), existing studies predominantly focus on models without safety alignment. This critical oversight ignores the fact that real-world MLLMs inherently require such mechanisms to mitigate potential risks. In this work, we shift our focus to CVIT for safety-aligned MLLMs and observe that during continual adaptation, the model not only suffers from task forgetting but also exhibits degradation in its safety. Achieving a harmonious balance between safety and task performance remains a crucial challenge. To address this, we propose Harmonious Parameter Adaptation (HPA), a post-training framework composed of focusing-based parameter partition, harmoniously balanced parameter selection, and orthogonal parameter adjustment. Specifically, HPA partitions parameters into two types based on their focus on safety or task performance, and selects the focused ones to preserve from a balanced perspective. In addition, HPA imposes orthogonality constraints on parameter updates to further alleviate catastrophic forgetting. Extensive experiments on the CVIT benchmark and safety evaluation datasets demonstrate that HPA better maintains high safety and mitigates forgetting than existing baselines. Code is available at https://github.com/Minato-Zackie/HPA.

2511.20107 2026-06-05 cs.CL cs.SD eess.AS

Mispronunciation Detection and Diagnosis Without Model Training: A Retrieval-Based Approach

无需模型训练的误读检测与诊断:基于检索的方法

Huu Tuong Tu, Ha Viet Khanh, Tran Tien Dat, Vu Huan, Thien Van Luong, Nguyen Tien Cuong, Nguyen Thi Thu Trang

发表机构 * Hanoi National University of Education(河内教育大学)

AI总结 本文提出一种无需模型训练的误读检测与诊断方法,利用预训练的自动语音识别模型和检索技术,实现高准确率的发音错误检测与诊断,实验表明其在L2-ARCTIC数据集上达到69.60%的F1分数。

详情
Journal ref
ICASSP 2026 - 2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
AI中文摘要

误读检测与诊断(MDD)对于语言学习和语音治疗至关重要。与传统方法需要评分模型或训练音素级模型不同,我们提出了一种新颖的无训练框架,利用预训练的自动语音识别模型和检索技术。我们的方法避免了音素特定建模或额外的任务特定训练,但仍能实现准确的发音错误检测与诊断。在L2-ARCTIC数据集上的实验表明,我们的方法在避免模型训练复杂性的同时,达到了69.60%的F1分数。

英文摘要

Mispronunciation Detection and Diagnosis (MDD) is crucial for language learning and speech therapy. Unlike conventional methods that require scoring models or training phoneme-level models, we propose a novel training-free framework that leverages retrieval techniques with a pretrained Automatic Speech Recognition model. Our method avoids phoneme-specific modeling or additional task-specific training, while still achieving accurate detection and diagnosis of pronunciation errors. Experiments on the L2-ARCTIC dataset show that our method achieves a superior F1 score of 69.60% while avoiding the complexity of model training.

2511.13183 2026-06-05 cs.CV

GenTract: Generative Global Tractography

GenTract:生成式全局束追踪

Alec Sargood, Lemuel Puglisi, Elinor Thompson, Mirco Musolesi, Daniel C. Alexander

发表机构 * Hawkes Institute and Department of Computer Science, University College London, UK(霍克斯研究所和计算机科学系,伦敦大学学院,英国) Department of Maths and Computer Science, University of Catania, Italy(数学和计算机科学系,卡塔尼亚大学,意大利) AI Centre and Department of Computer Science, University College London, UK(人工智能中心和计算机科学系,伦敦大学学院,英国)

AI总结 本文提出GenTract,一种基于生成模型的全局束追踪方法,通过学习从dMRI到完整解剖学合理束流的直接映射,提高了在低分辨率和噪声数据下的精度和可靠性。

Comments Upload of camera-ready

详情
AI中文摘要

束追踪是通过扩散磁共振成像(dMRI)推断大脑白质路径轨迹的过程。局部束追踪方法通过逐步跟随局部纤维方向估计来构建束流,易产生误差累积和高假阳性率,尤其是在噪声或低分辨率数据中。相比之下,全局方法试图通过优化束流集合以最大化与底层纤维方向估计的兼容性,但计算成本较高。为解决这些挑战,我们引入GenTract,这是首个生成式全局束追踪模型。我们将束追踪视为生成任务,学习从dMRI到完整、解剖学合理束流的直接映射。我们比较了基于扩散和流匹配的两种范式,并评估了GenTract在与现有最先进基线方法的性能。值得注意的是,GenTract在精度上比次优方法DDTracking和TractOracle分别高出1.8倍和2.1倍。在具有挑战性的低分辨率和噪声设置中,其优势更加明显,比最接近的竞争对手高出3.5倍。通过在研究级数据上产生高精度的束流图,同时在不完美的低分辨率数据上保持可靠性,GenTract代表了全局束追踪的一个有前景的解决方案。

英文摘要

Tractography is the process of inferring the trajectories of white-matter pathways in the brain from diffusion magnetic resonance imaging (dMRI). Local tractography methods, which construct streamlines by following local fiber orientation estimates stepwise through an image, are prone to error accumulation and high false positive rates, particularly on noisy or low-resolution data. In contrast, global methods, which attempt to optimize a collection of streamlines to maximize compatibility with underlying fiber orientation estimates, are computationally expensive. To address these challenges, we introduce GenTract, the first generative model for global tractography. We frame tractography as a generative task, learning a direct mapping from dMRI to complete, anatomically plausible streamlines. We compare both diffusion-based and flow matching paradigms and evaluate GenTract's performance against state-of-the-art baselines. Notably, GenTract achieves precision 1.8x and 2.1x higher than the next-best methods, DDTracking and TractOracle, respectively. This advantage becomes even more pronounced in challenging low-resolution and noisy settings, where it outperforms the closest competitor by a factor of 3.5. By producing tractograms with high precision on research-grade data while also maintaining reliability on imperfect, lower-resolution data, GenTract represents a promising solution for global tractography.

2511.13044 2026-06-05 cs.LG

Bi-View Embedding Fusion: A Hybrid Learning Approach for Knowledge Graph's Nodes Classification Addressing Problems with Limited Data

双视角嵌入融合:一种混合学习方法用于知识图谱节点分类,以解决数据有限的问题

Rosario Napoli, Giovanni Lonia, Antonio Celesti, Massimo Villari, Maria Fazio

发表机构 * Department of Mathematical and Computer Sciences, Physical Sciences and Earth Sciences, University of Messina(数学与计算机科学系、物理科学与地球科学系,墨西拿大学)

AI总结 本文提出了一种双视角嵌入融合方法,通过结合Node2Vec和GraphSAGE两种互补的图嵌入技术,提升知识图谱节点特征的 informative 内容,从而生成增强的图嵌入以改进GML模型,无需额外合成数据。

Comments Accepted at the 14th International Joint Conference on Knowledge Graphs (IJCKG) 2025

详情
Journal ref
Knowledge Graphs, Springer Nature Singapore, 2026, pp. 19-34
AI中文摘要

传统机器学习(ML)方法需要大量数据才能表现良好,限制了其在稀疏或不完整场景中的适用性,并迫使使用额外合成数据来改进模型训练。为克服这一挑战,研究社区越来越多地关注图机器学习(GML),因为它通过利用数据中的关系提供了强大的替代方案。然而,这种方法在处理知识图谱(KGs)时也面临限制,因为KGs的语义性质可能导致隐藏大量信息。本研究引入了Bi-View,一种新颖的混合方法,通过增加KGs节点特征的信息内容,生成增强的图嵌入(GEs),用于改进GML模型,而无需依赖额外合成数据。所提出的方法结合了两种互补的GE技术:Node2Vec,通过无监督随机游走捕捉结构模式,以及GraphSAGE,通过监督方式聚合邻居信息。首先计算Node2Vec嵌入以表示图拓扑,然后用基于中心性的度量指标丰富节点特征,这些特征作为GraphSAGE模型的输入。此外,融合层将原始Node2Vec嵌入与GraphSAGE影响的表示结合,形成双视角嵌入空间。此类融合捕获了图的拓扑和语义属性,使模型能够利用数据集中可能存在但未显式表示的 informative 特征。我们的方法提高了下游任务的性能,特别是在初始特征较差的情况下,为更准确和精确的KG增强GML模型奠定了基础。

英文摘要

Traditional Machine Learning (ML) methods require large amounts of data to perform well, limiting their applicability in sparse or incomplete scenarios and forcing the usage of additional synthetic data to improve the model training. To overcome this challenge, the research community is looking more and more at Graph Machine Learning (GML) as it offers a powerful alternative by using relationships within data. However, this method also faces limitations, particularly when dealing with Knowledge Graphs (KGs), which can hide huge information due to their semantic nature. This study introduces Bi-View, a novel hybrid approach that increases the informative content of node features in KGs to generate enhanced Graph Embeddings (GEs) that are used to improve GML models without relying on additional synthetic data. The proposed work combines two complementary GE techniques: Node2Vec, which captures structural patterns through unsupervised random walks, and GraphSAGE, which aggregates neighbourhood information in a supervised way. Node2Vec embeddings are first computed to represent the graph topology, and node features are then enriched with centrality-based metrics, which are used as input for the GraphSAGE model. Moreover, a fusion layer combines the original Node2Vec embeddings with the GraphSAGE-influenced representations, resulting in a dual-perspective embedding space. Such a fusion captures both topological and semantic properties of the graph, enabling the model to exploit informative features that may exist in the dataset but that are not explicitly represented. Our approach improves downstream task performance, especially in scenarios with poor initial features, giving the basis for more accurate and precise KG-enanched GML models.

2511.10362 2026-06-05 cs.LG cs.SY eess.SY math.DS

Gradient Flow Equations for Deep Linear Neural Networks: A Survey from a Network Perspective

深度线性神经网络的梯度流方程:从网络角度的综述

Joel Wendin, Claudio Altafini

发表机构 * Department of Electrical Engineering, Linköping University(电子工程系,林雪平大学)

AI总结 本文综述了深度线性神经网络梯度流方程的动力学和损失景观的最新进展,从网络角度探讨了梯度下降训练动态(步长趋近于0时的极限情况)以及二次损失函数下的研究问题,揭示了该方程类为收敛的矩阵微分方程,具有 nilpotent、多项式、isospectral 和守恒律等特性。

Comments Manuscript accepted for publication in SIAM Review (SIREV)

详情
Journal ref
SIAM Review 68 (2026) 293-345
AI中文摘要

本文综述了深度线性神经网络梯度流方程的动力学和损失景观的最新进展,即在忽略激活函数且使用二次损失函数的情况下,深度神经网络的梯度下降训练动态(当步长趋近于0时的极限情况)。当用神经网络的邻接矩阵来表示时,这些梯度流方程形成了一类收敛的矩阵微分方程,具有 nilpotent、多项式、isospectral 和守恒律等特性。损失景观被详细描述。其特征是存在无限多个全局极小值和鞍点(严格和非严格),但缺乏局部极小值和极小值。损失函数本身是一个正半定的李雅普诺夫函数,其等高线是无界的不变集,其临界值对应于梯度沿特定轨迹学习的输入输出数据的奇异值数量。本文所用的邻接矩阵表示法可以突出显示商空间结构的存在,其中每个损失函数的临界值仅被表示一次,而所有其他具有相同临界值的临界点都属于与商空间相关的纤维。它还允许轻松确定鞍点的稳定和不稳定子流形,即使海森矩阵无法获得这些结构。

英文摘要

The paper surveys recent progresses in understanding the dynamics and loss landscape of the gradient flow equations associated to deep linear neural networks, i.e., the gradient descent training dynamics (in the limit when the step size goes to 0) of deep neural networks missing the activation functions and subject to quadratic loss functions. When formulated in terms of the adjacency matrix of the neural network, as we do in the paper, these gradient flow equations form a class of converging matrix ODEs which is nilpotent, polynomial, isospectral, and with conservation laws. The loss landscape is described in detail. It is characterized by infinitely many global minima and saddle points, both strict and nonstrict, but lacks local minima and maxima. The loss function itself is a positive semidefinite Lyapunov function for the gradient flow, and its level sets are unbounded invariant sets of critical points, with critical values that correspond to the amount of singular values of the input-output data learnt by the gradient along a certain trajectory. The adjacency matrix representation we use in the paper allows to highlight the existence of a quotient space structure in which each critical value of the loss function is represented only once, while all other critical points with the same critical value belong to the fiber associated to the quotient space. It also allows to easily determine stable and unstable submanifolds at the saddle points, even when the Hessian fails to obtain them.

2511.08972 2026-06-05 cs.LG

Selective Sinkhorn Routing for Improved Sparse Mixture of Experts

选择性Sinkhorn路由以提高稀疏专家混合模型

Duc Anh Nguyen, Huu Binh Ta, Nhuan Le Duc, Tan Minh Nguyen, Toan Tran

发表机构 * University of California, Berkeley(加州大学伯克利分校) National University of Singapore(新加坡国立大学)

AI总结 本文提出了一种选择性Sinkhorn路由方法,通过将token到专家的分配问题转化为最优传输问题,并引入约束以确保专家利用率均衡,从而在不依赖辅助平衡损失的情况下提升稀疏专家混合模型的性能。

Comments 12 pages, 5 figures

详情
AI中文摘要

稀疏专家混合模型(SMoE)模型具有可扩展性和计算效率,能够在有限的推理开销下实现模型容量的大幅增加。现有的SMoE方法通常依赖于辅助目标,如负载均衡损失和z损失,或额外的可训练组件如噪声门控。虽然这些技术鼓励专家多样性,但可能会引入目标不一致、增加模型复杂性或带来显著的训练开销,尤其是在基于Sinkhorn的路由方法中。在本文中,我们重新审视token到专家的分配问题作为最优传输问题。我们添加约束以确保专家利用率的平衡。我们证明,即使是最小的基于最优传输的路由也能在不需辅助平衡损失的情况下提升SMoE性能。与以往方法不同,我们的方法直接从传输图中推导出门控分数,从而实现更平衡和有效的token到专家分配。基于这一见解,我们引入了选择性Sinkhorn路由(SSR),一种轻量级的路由机制,它用高效的Sinkhorn路由替代了复杂的辅助损失,同时保持灵活的专家选择。在语言建模和图像分类实验中,SSR在训练效率、准确性和对输入损坏的鲁棒性方面均有所提升。

英文摘要

Sparse Mixture-of-Experts (SMoE) models are scalable and computationally efficient, enabling large increases in model capacity with limited inference overhead. Existing SMoE methods often depend on auxiliary objectives, such as load-balancing loss and z-loss, or additional trainable components such as noisy gating. While these techniques encourage expert diversity, they can introduce objective misalignment, increase model complexity, or incur substantial training overhead, especially in Sinkhorn-based routing methods. In this paper, we revisit the token-to-expert assignment as an optimal transport problem. We add constraints to ensure balanced expert utilization. We show that even minimal optimal transport-based routing improves SMoE performance without requiring auxiliary balancing losses. Unlike prior approaches, our method derives gating scores directly from the transport map, leading to more balanced and effective token-to-expert assignments. Building on this insight, we introduce Selective Sinkhorn Routing (SSR), a lightweight routing mechanism that replaces complex auxiliary losses with efficient Sinkhorn-based routing while preserving flexible expert selection. Experiments on language modeling and image classification show that SSR improves training efficiency, accuracy, and robustness to input corruption.

2511.05615 2026-06-05 cs.LG cs.AI cs.AR physics.ins-det

wa-hls4ml: A Benchmark and Surrogate Models for hls4ml Resource and Latency Estimation

wa-hls4ml: 一个用于hls4ml资源和延迟估计的基准及替代模型

Benjamin Hawks, Jason Weitz, Dmitri Demler, Karla Tame-Narvaez, Dennis Plotnikov, Mohammad Mehdi Rahimifar, Hamza Ezzaoui Rahali, Audrey C. Therrien, Donovan Sproule, Elham E Khoda, Keegan A. Smith, Russell Marroquin, Giuseppe Di Guglielmo, Nhan Tran, Javier Duarte, Vladimir Loncar

发表机构 * Fermi National Accelerator Laboratory(费米国家加速器实验室) University of California San Diego(加州大学圣地亚哥分校) Johns Hopkins University(约翰霍普金斯大学) University of Sherbrooke(Sherbrooke大学) Columbia University(哥伦比亚大学) Texas A&M University(德克萨斯A&M大学) European Organization for Nuclear Research (CERN)(欧洲核子研究中心(CERN))

AI总结 本文提出了一个用于评估ML加速器资源和延迟的基准wa-hls4ml,并介绍了基于图神经网络和Transformer的替代模型,用于预测ML加速器的延迟和资源使用情况。

Comments 30 pages, 18 figures

详情
Journal ref
Wa-hls4ml: A Benchmark and Surrogate Models for hls4ml Resource and Latency Estimation. ACM Trans. Reconfigurable Technol. Syst. 19, 2, Article 20 (June 2026), 29 pages
AI中文摘要

随着机器学习(ML)越来越多地在硬件中实现以解决科学应用中的实时挑战,先进的工具链开发显著减少了各种设计迭代所需的时间。这些进步已经解决了主要障碍,但也暴露了新的挑战。例如,以前未被考虑的瓶颈过程,如硬件综合,现在成为设计快速迭代的限制因素。为缓解这些新兴约束,已经开展了多项努力,以开发基于ML的替代模型,以估计ML加速器架构的资源使用情况。我们介绍了wa-hls4ml,这是一个用于ML加速器资源和延迟估计的基准,以及其对应的初始数据集,包含超过680,000个全连接和卷积神经网络,均使用hls4ml合成并针对Xilinx FPGA。该基准评估了资源和延迟预测器在几种常见ML模型架构上的性能,这些架构主要来自科学领域,作为示例模型,并评估了数据集子集的平均性能。此外,我们还介绍了基于图神经网络和Transformer的替代模型,用于预测ML加速器的延迟和资源。我们展示了这些模型的架构和性能,并发现这些模型通常在合成测试数据集上对75百分位数的延迟和资源预测误差在几个百分点以内。

英文摘要

As machine learning (ML) is increasingly implemented in hardware to address real-time challenges in scientific applications, the development of advanced toolchains has significantly reduced the time required to iterate on various designs. These advancements have solved major obstacles, but also exposed new challenges. For example, processes that were not previously considered bottlenecks, such as hardware synthesis, are becoming limiting factors in the rapid iteration of designs. To mitigate these emerging constraints, multiple efforts have been undertaken to develop an ML-based surrogate model that estimates resource usage of ML accelerator architectures. We introduce wa-hls4ml, a benchmark for ML accelerator resource and latency estimation, and its corresponding initial dataset of over 680,000 fully connected and convolutional neural networks, all synthesized using hls4ml and targeting Xilinx FPGAs. The benchmark evaluates the performance of resource and latency predictors against several common ML model architectures, primarily originating from scientific domains, as exemplar models, and the average performance across a subset of the dataset. Additionally, we introduce GNN- and transformer-based surrogate models that predict latency and resources for ML accelerators. We present the architecture and performance of the models and find that the models generally predict latency and resources for the 75% percentile within several percent of the synthesized resources on the synthetic test dataset.

2410.02628 2026-06-05 cs.LG cs.AI

Inverse Entropic Optimal Transport Solves Semi-supervised Learning via Data Likelihood Maximization

逆熵最优运输通过数据似然最大化解决半监督学习

Mikhail Persiianov, Arip Asadulaev, Nikita Andreev, Nikita Starodubcev, Dmitry Baranchuk, Anastasis Kratsios, Evgeny Burnaev, Alexander Korotin

发表机构 * Institute for Advanced Study(高级研究院) National Research Council Canada(加拿大国家研究理事会) University of Toronto(多伦多大学) St. Petersburg State University(圣彼得格勒国立大学) Skolkovo Institute of Science and Technology(斯克罗夫诺技术研究所) Kazan Federal University(卡兹兰卡联邦大学)

AI总结 本文提出了一种名为EBiEOT的新学习范式,通过数据似然最大化技术无缝整合配对和非配对数据,解决了半监督学习中的数据获取难题,并证明了该方法在理论上能够以任意小的误差恢复真实条件分布。

详情
AI中文摘要

学习条件分布π*(⋅|x)是机器学习中的核心问题,通常通过监督方法利用配对数据(x,y)∼π*进行学习。然而,获取配对数据样本往往具有挑战性,尤其是在领域翻译等问题中。这需要开发能够利用有限配对数据和额外非配对i.i.d.样本x∼π*_x和y∼π*_y的半监督模型。使用此类结合数据复杂且常依赖启发式方法。为此,我们提出了一种新的学习范式称为EBiEOT,利用数据似然最大化技术无缝整合配对和非配对数据。我们证明了该方法与逆熵最优运输(OT)有奇妙的联系。这一发现使我们能够应用最近的计算OT进展,建立一个端到端的学习算法来获得π*(⋅|x)。此外,我们推导了通用逼近性质,证明该方法在理论上可以以任意小的误差恢复真实条件分布。最后,我们通过实验证明,我们的方法能够同时利用配对和非配对数据有效学习条件分布。EBiEOT的代码可在https://github.com/MuXauJl11110/EBiEOT上获得。

英文摘要

Learning conditional distributions $π^*(\cdot|x)$ is a central problem in machine learning, which is typically approached via supervised methods with paired data $(x,y) \sim π^*$. However, acquiring paired data samples is often challenging, especially in problems such as domain translation. This necessitates the development of $\textit{semi-supervised}$ models that utilize both limited paired data and additional unpaired i.i.d. samples $x \sim π^*_x$ and $y \sim π^*_y$ from the marginal distributions. The usage of such combined data is complex and often relies on heuristic approaches. To tackle this issue, we propose a new learning paradigm called $\textbf{EBiEOT}$ that integrates both paired and unpaired data seamlessly using data likelihood maximization techniques. We demonstrate that our approach also connects intriguingly with inverse entropic optimal transport (OT). This finding allows us to apply recent advances in computational OT to establish an $\textit{end-to-end}$ learning algorithm to get $π^*(\cdot|x)$. In addition, we derive the universal approximation property, demonstrating that our approach can theoretically recover true conditional distributions with arbitrarily small error. Finally, we demonstrate through empirical tests that our method effectively learns conditional distributions using paired and unpaired data simultaneously. The code of $\texttt{EBiEOT}$ is available at https://github.com/MuXauJl11110/EBiEOT.