arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 21503
专题追踪
2506.06006 2026-06-04 cs.CV cs.AI cs.CL

Can VLMs Predict Future States? Bootstrapping World Models from Inverse Dynamics

视觉语言模型能预测未来状态吗?从逆动力学引导世界模型

Yifu Qiu, Yftah Ziser, Anna Korhonen, Shay B. Cohen, Edoardo M. Ponti

发表机构 * Institute for Language, Cognition and Computation, University of Edinburgh(语言、认知与计算研究所,爱丁堡大学) Language Technology Lab, University of Cambridge(语言技术实验室,剑桥大学) NVIDIA(NVIDIA公司) University of Groningen(格罗宁根大学)

AI总结 本文发现视觉语言模型(VLM)难以直接进行前向动力学预测(FDP),但逆动力学预测(IDP)更容易学习,并利用IDP通过弱监督学习和推理时验证两种策略引导FDP,在Aurora-Bench上取得与最先进图像编辑模型竞争的性能。

详情
AI中文摘要

统一的视觉语言模型(VLM)能否执行前向动力学预测(FDP),即根据先前的观察和(语言形式的)动作预测未来状态(图像形式)?我们发现VLM难以根据指令生成帧之间物理上合理的过渡。然而,我们识别出多模态基础中的一个关键不对称性:微调VLM学习逆动力学预测(IDP)——有效地描述帧之间的动作——比学习FDP容易得多。反过来,IDP可以通过两种主要策略引导FDP:1)来自合成数据的弱监督学习,以及2)推理时验证。首先,IDP可以为未标记的视频帧观察对标注动作,以扩大FDP的训练数据规模。其次,IDP可以为FDP的多个样本分配奖励以对其进行评分,从而在推理时有效指导搜索。我们通过Aurora-Bench上的以动作为中心的图像编辑任务,使用两个VLM家族评估了这两种策略产生的FDP。尽管仍然是通用模型,我们的最佳模型实现了与最先进的图像编辑模型竞争的性能,根据GPT4o作为评判,在Aurora-Bench的所有子集上,性能提高了7%到13%,并获得了最佳平均人类评估。

英文摘要

Can unified vision-language models (VLMs) perform forward dynamics prediction (FDP), i.e., predicting the future state (in image form) given the previous observation and an action (in language form)? We find that VLMs struggle to generate physically plausible transitions between frames from instructions. Nevertheless, we identify a crucial asymmetry in multimodal grounding: fine-tuning a VLM to learn inverse dynamics prediction (IDP)-effectively captioning the action between frames-is significantly easier than learning FDP. In turn, IDP can be used to bootstrap FDP through two main strategies: 1) weakly supervised learning from synthetic data and 2) inference time verification. Firstly, IDP can annotate actions for unlabelled pairs of video frame observations to expand the training data scale for FDP. Secondly, IDP can assign rewards to multiple samples of FDP to score them, effectively guiding search at inference time. We evaluate the FDP resulting from both strategies through the task of action-centric image editing on Aurora-Bench with two families of VLMs. Despite remaining general-purpose, our best model achieves a performance competitive with state-of-the-art image editing models, improving on them by a margin between 7% and 13% according to GPT4o-as-judge, and achieving the best average human evaluation across all subsets of Aurora-Bench.

2511.13391 2026-06-04 cs.LG cs.AI math.CO math.MG

Finding Kissing Numbers with Game-theoretic Reinforcement Learning

用博弈论强化学习寻找亲吻数

Chengdong Ma, Théo Tao Zhaowei, Pengyu Li, Minghao Liu, Haojun Chen, Zihao Mao, Bo Li, Yuan Cheng, Yuan Qi, Yaodong Yang

发表机构 * Institute for Artificial Intelligence, Peking University(北京大学人工智能研究院) Shanghai Academy of AI for Science(上海人工智能科学研究院) Artificial Intelligence Innovation and Incubation Institute, Fudan University(复旦大学人工智能创新与孵化院)

AI总结 将亲吻数问题转化为合作矩阵补全博弈,利用强化学习系统PackingStar在极值配置空间中探索,改进了15个长期未突破的亲吻数上界,并发现了新的可解释几何结构。

详情
AI中文摘要

自1694年牛顿首次研究亲吻数问题以来,确定中心球周围非重叠球的最大数量一直是离散几何中的一个决定性挑战。作为希尔伯特第18问题的局部类比,它在几何、数论和信息论中具有深远意义。尽管格和编码取得了显著进展,但该领域局限于孤立的极值构型,掩盖了潜在的几何原理。在这里,我们将对象转移到更广泛的极值配置空间,从而为亲吻数问题开辟了一条新路径。因此,我们将该问题重新表述为一个合作矩阵补全博弈,并训练一个强化学习系统PackingStar来解决它。一个玩家填充余弦条目,而另一个玩家纠正次优条目,使爆炸性的几何复杂性变得可处理。在极值配置空间内工作,PackingStar发现了新的可解释几何结构,改进了15个在亲吻数及其推广中保持数十年的强上界,其中几个在自然内积下被证明是最优的。这些发现揭示了Fischer群Fi22的第一个显式球面编码实现,扩展了子群结构的经典欧几里得表示,并直接启发了数学家的后续突破。总体而言,这项工作为人工智能在希尔伯特级别问题上的进展提供了一个早期示例,展示了强化学习通过解锁更具表现力的对象来推动数学发现。

英文摘要

Since Isaac Newton first studied the Kissing Number Problem in 1694, determining the maximal number of non-overlapping spheres around a central sphere has remained a defining challenge in discrete geometry. As the local analogue of Hilbert's 18th problem, it has profound implications across geometry, number theory and information theory. Although lattices and codes have achieved significant progress, the field is confined to isolated extremal configurations, leaving underlying geometric principles obscured. Here we shift the object to the broader extremal configuration space, thereby opening a new path for the Kissing Number Problem. Accordingly, we recast this problem as a cooperative matrix-completion game, and train a reinforcement learning system, PackingStar, to solve it. One player fills cosine entries while the other corrects suboptimal ones, making explosive geometric complexity tractable. Working within extremal configuration spaces, PackingStar discovers new interpretable geometric structures that improve 15 strong bounds held for decades in kissing numbers and their generalizations, several of them provably optimal under natural inner products. These findings reveal the first explicit spherical-code realization of the Fischer group Fi22, extend the classical Euclidean representation of subgroup structure, and directly inspire subsequent breakthroughs by mathematicians. Overall, the work provides an early example of AI-driven progress on a Hilbert-calibre problem, showing how reinforcement learning advances mathematical discovery by unlocking more expressive objects.

2602.09075 2026-06-04 cs.LG cs.AI

Learning to Remember, Learn, and Forget in Attention-Based Models

在基于注意力的模型中学习记忆、学习和遗忘

Djohan Bonnet, Jamie Lohoff, Jan Finkbeiner, Elidona Shiqerukaj, Emre Neftci

发表机构 * University of Cambridge(剑桥大学)

AI总结 提出Palimpsa模型,将上下文学习视为持续学习问题,通过贝叶斯元可塑性解决稳定性-可塑性困境,显著提升记忆容量,在MQAR和常识推理任务上优于基线。

详情
AI中文摘要

Transformer中的上下文学习(ICL)作为一种在线联想记忆,被认为是其在复杂序列处理任务中高性能的基础。然而,在门控线性注意力模型中,这种记忆具有固定容量且容易受到干扰,尤其是对于长序列。我们提出Palimpsa,一种自注意力模型,将ICL视为必须解决稳定性-可塑性困境的持续学习问题。Palimpsa使用贝叶斯元可塑性,其中每个注意力状态的可塑性绑定到一个由捕获累积知识的先验分布支撑的重要性状态。我们证明各种门控线性注意力模型作为特定的架构选择和后验近似出现,并且Mamba2是Palimpsa的一个特例,其中遗忘占主导。这一理论联系使得任何非元可塑性模型都能转化为元可塑性模型,从而显著扩展其记忆容量。我们的实验表明,Palimpsa在多查询联想回忆(MQAR)基准和常识推理任务上始终优于基线。

英文摘要

In-Context Learning (ICL) in transformers acts as an online associative memory and is believed to underpin their high performance on complex sequence processing tasks. However, in gated linear attention models, this memory has a fixed capacity and is prone to interference, especially for long sequences. We propose Palimpsa, a self-attention model that views ICL as a continual learning problem that must address a stability-plasticity dilemma. Palimpsa uses Bayesian metaplasticity, where the plasticity of each attention state is tied to an importance state grounded by a prior distribution that captures accumulated knowledge. We demonstrate that various gated linear attention models emerge as specific architecture choices and posterior approximations, and that Mamba2 is a special case of Palimpsa where forgetting dominates. This theoretical link enables the transformation of any non-metaplastic model into a metaplastic one, significantly expanding its memory capacity. Our experiments show that Palimpsa consistently outperforms baselines on the Multi-Query Associative Recall (MQAR) benchmark and on Commonsense Reasoning tasks.

2602.09388 2026-06-04 cs.CL

Effective vocabulary expansion of multilingual language models for extremely low-resource languages

针对极低资源语言的多语言语言模型的有效词汇扩展

Jianyu Zheng

发表机构 * School of Foreign Languages, University of Electronic Science and Technology of China(电子科技大学外国语言学院) School of Computer Science and Engineering, University of Electronic Science and Technology of China(电子科技大学计算机科学与工程学院)

AI总结 提出通过筛选源语言偏置词汇并利用双语词典初始化扩展词汇表示,对多语言预训练模型进行持续预训练,在词性标注和命名实体识别任务上分别提升0.54%和2.60%。

Comments 12 pages, 5 figures, 7 tables, under review

详情
AI中文摘要

多语言预训练语言模型(mPLMs)为许多低资源语言带来了显著的好处。为了进一步扩展这些模型能够支持的语言范围,许多工作集中于对这些模型进行持续预训练。然而,很少有工作解决如何将mPLMs扩展到之前不支持的低资源语言。为了解决这个问题,我们使用目标语言语料库扩展模型的词汇表。然后,我们从模型的原始词汇表中筛选出一个子集,该子集偏向于表示源语言(例如英语),并利用双语词典初始化扩展词汇的表示。随后,我们基于这些扩展词汇的表示,使用目标语言语料库继续预训练mPLMs。实验结果表明,我们提出的方法在词性标注和命名实体识别任务上优于使用随机初始化扩展词汇进行持续预训练的基线方法,分别提高了0.54%和2.60%。此外,我们的方法在选择训练语料库时表现出高鲁棒性,并且模型在源语言上的性能在持续预训练后没有下降。

英文摘要

Multilingual pre-trained language models(mPLMs) offer significant benefits for many low-resource languages. To further expand the range of languages these models can support, many works focus on continued pre-training of these models. However, few works address how to extend mPLMs to low-resource languages that were previously unsupported. To tackle this issue, we expand the model's vocabulary using a target language corpus. We then screen out a subset from the model's original vocabulary, which is biased towards representing the source language(e.g. English), and utilize bilingual dictionaries to initialize the representations of the expanded vocabulary. Subsequently, we continue to pre-train the mPLMs using the target language corpus, based on the representations of these expanded vocabulary. Experimental results show that our proposed method outperforms the baseline, which uses randomly initialized expanded vocabulary for continued pre-training, in POS tagging and NER tasks, achieving improvements by 0.54% and 2.60%, respectively. Furthermore, our method demonstrates high robustness in selecting the training corpora, and the models' performance on the source language does not degrade after continued pre-training.

2509.25289 2026-06-04 cs.LG cs.AI

ClustRecNet: A Novel End-to-End Deep Learning Framework for Clustering Algorithm Recommendation

ClustRecNet: 一种用于聚类算法推荐的新型端到端深度学习框架

Mohammadreza Bakhtyari, Bogdan Mazoure, Renato Cordeiro de Amorim, Guillaume Rabusseau, Vladimir Makarenkov

发表机构 * Département d’Informatique, Université du Québec à Montréal(魁北克大学蒙特利尔分校计算机科学系) Mila - Quebec AI Institute(魁北克人工智能研究所) School of Computer Science and EE, University of Essex(埃塞克斯大学计算机科学与电子工程学院) Department of Computer Science and Operations Research, Université de Montréal(蒙特利尔大学计算机科学与运筹学系)

AI总结 提出ClustRecNet,一种端到端深度学习框架,通过直接学习原始表格数据的高阶表示来推荐合适的聚类算法,在合成和真实基准上优于传统内部聚类有效性指标和AutoML方法。

Comments Published in IEEE Access

Journal ref IEEE Access, vol. 14, pp. 81352 - 81365, 2026

详情
AI中文摘要

为给定数据集识别有效的聚类算法仍然是一个基本的无监督学习问题。我们引入了ClustRecNet,一种新颖的端到端深度学习框架,通过直接学习原始表格数据的高阶表示来推荐合适的聚类算法。为了促进稳健的元学习,我们首先构建了一个包含34,000个合成数据集的综合存储库,涵盖了多种聚类场景,运行了10种流行的聚类算法,并使用调整兰德指数(ARI)建立真实标签。ClustRecNet的架构包含一个卷积块、两个残差块和一个注意力块,以捕获局部和全局结构模式,有效绕过了与手动特征工程相关的知识瓶颈。在合成和真实世界基准上的广泛评估表明,ClustRecNet始终优于传统的内部聚类有效性指标,如轮廓系数、Calinski-Harabasz、Davies-Bouldin和Dunn,以及最先进的自动化机器学习(AutoML)方法,如ML2DAC、AutoCluster和AutoML4Clust。例如,我们的框架在合成数据上平均比Calinski-Harabasz聚类有效性指数高出0.497的ARI增益,在真实世界基准上平均比领先的AutoML方法(ML2DAC)高出44.16%的ARI改进。代码和数据可在以下网址获取:https://github.com/mrbakhtyari/ClustRecNet

英文摘要

Identifying an effective clustering algorithm for a given dataset remains a fundamental unsupervised learning issue. We introduce ClustRecNet, a novel end-to-end deep learning framework that recommends suitable clustering algorithm(s) by directly learning high-order representations of raw tabular data. To facilitate robust meta-learning, we first construct a comprehensive repository of 34,000 synthetic datasets encompassing a large variety of clustering scenarios, run 10 popular clustering algorithms, and use Adjusted Rand Index (ARI) to establish ground-truth labels. ClustRecNet's architecture incorporates a convolution block, two residual blocks, and an attention block to capture local and global structural patterns, effectively bypassing the knowledge bottleneck associated with manual feature engineering. Extensive evaluation on both synthetic and real-world benchmarks demonstrates that ClustRecNet consistently outperforms traditional internal cluster validity indices such as Silhouette, Calinski-Harabasz, Davies-Bouldin, and Dunn as well as state-of-the-art Automated Machine Learning (AutoML) approaches such as ML2DAC, AutoCluster, and AutoML4Clust. For example, our framework achieves an average 0.497 ARI gain over the Calinski-Harabasz cluster validity index on synthetic data and an average 44.16% ARI improvement over the leading AutoML approach (ML2DAC) on real-world benchmarks. Code and data are available at: https://github.com/mrbakhtyari/ClustRecNet

2602.08498 2026-06-04 cs.CL

Characterizing, Evaluating, and Optimizing Complex Reasoning

表征、评估与优化复杂推理

Haoran Zhang, Yafu Li, Zhi Wang, Zhilin Wang, Shunkai Zhang, Xiaoye Qu, Yu Cheng

发表机构 * School of Artificial Intelligence, Shanghai Jiao Tong University, Shanghai, China(上海交通大学人工智能学院) Shanghai Artificial Intelligence Laboratory, Shanghai, China(上海人工智能实验室) University of Science and Technology of China, Hefei, Anhui, China(中国科学技术大学) The Chinese University of Hong Kong, Hong Kong, China(香港中文大学) Nanjing University, Suzhou, Jiangsu, China(南京大学) Peking University, Beijing, China(北京大学)

AI总结 本文提出ME$^2$原则来表征推理质量,基于有向无环图(DAG)的成对评估方法,并构建TRM-Preference数据集训练Thinking Reward Model(TRM),以优化推理过程。

Comments Code and data are available at https://github.com/Simplified-Reasoning/TRM

详情
AI中文摘要

大型推理模型(LRMs)越来越依赖具有复杂内部结构的推理轨迹。然而,现有工作缺乏对三个基本问题的统一答案:(1)什么定义了高质量推理,(2)如何可靠地评估长且隐含结构的推理轨迹,以及(3)如何使用此类评估信号进行推理优化。为应对这些挑战,我们提供了一个统一视角。(1)我们引入ME$^2$原则,从宏观和微观层面表征推理质量,涉及效率和有效性。(2)基于该原则,我们将推理轨迹建模为有向无环图(DAG),并开发了一种基于DAG的成对评估方法,捕捉复杂推理结构。(3)基于该方法,我们构建了TRM-Preference数据集,并训练了一个Thinking Reward Model(TRM)来大规模评估推理质量。实验表明,思考奖励作为有效的优化信号。在测试时,选择更好的推理会带来更好的结果(提升高达19.3%),在RL训练期间,思考奖励增强了推理和性能(提升高达3.9%),适用于多种任务。代码和数据可在https://github.com/Simplified-Reasoning/TRM获取。

英文摘要

Large Reasoning Models (LRMs) increasingly rely on reasoning traces with complex internal structures. However, existing work lacks a unified answer to three fundamental questions: (1) what defines high-quality reasoning, (2) how to reliably evaluate long, implicitly structured reasoning traces, and (3) how to use such evaluation signals for reasoning optimization. To address these challenges, we provide a unified perspective. (1) We introduce the ME$^2$ principle to characterize reasoning quality along macro- and micro-level concerning efficiency and effectiveness. (2) Built on this principle, we model reasoning traces as directed acyclic graphs (DAGs) and develop a DAG-based pairwise evaluation method, capturing complex reasoning structures. (3) Based on this method, we construct the TRM-Preference dataset and train a Thinking Reward Model (TRM) to evaluate reasoning quality at scale. Experiments show that thinking rewards serve as an effective optimization signal. At test time, selecting better reasoning leads to better outcomes (up to 19.3\% gain), and during RL training, thinking rewards enhance reasoning and performance (up to 3.9\% gain) across diverse tasks. Code and data are available at https://github.com/Simplified-Reasoning/TRM.

2602.08142 2026-06-04 cs.LG stat.ML

Variance-Gated Ensembles: An Epistemic-Aware Framework for Uncertainty Estimation

方差门控集成:一种面向认知不确定性的估计框架

H. Martin Gillis, Isaac Xu, Thomas Trappenberg

发表机构 * Faculty of Computer Science, Dalhousie University, Halifax, NS(计算机科学学院,达尔豪西大学,哈利法克斯,NS)

AI总结 提出方差门控集成(VGE)框架,通过从集成统计量计算信噪比门控注入认知敏感性,实现高效且可微的不确定性估计,在计算效率与性能上匹配或超越现有方法。

Comments Published in Transactions on Machine Learning Research (06/2026)

详情
AI中文摘要

机器学习应用需要快速且可靠的逐样本不确定性估计。常见方法是使用贝叶斯或近似方法的预测分布,并将不确定性加性分解为偶然(即数据相关)和认知(即模型相关)分量。然而,加性分解最近受到质疑,有证据表明当使用有限集成采样和/或不匹配的预测分布时,该分解会失效。本文介绍方差门控集成(VGE),一种直观、可微的框架,通过从集成统计量计算的信噪比门控注入认知敏感性。VGE提供:(i)方差门控边际不确定性(VGMU)分数,将决策边际与集成预测方差耦合;(ii)方差门控归一化(VGN)层,通过每类可学习的集成成员概率归一化,将方差门控不确定性机制推广到训练。我们推导出闭合形式的向量-雅可比积,使得通过集成样本均值和方差进行端到端训练成为可能。VGE在保持计算效率的同时,匹配或超越最先进的信息论基线。因此,VGE为集成模型中的认知感知不确定性估计提供了一种实用且可扩展的方法。

英文摘要

Machine learning applications require fast and reliable per-sample uncertainty estimation. A common approach is to use predictive distributions from Bayesian or approximation methods and additively decompose uncertainty into aleatoric (i.e., data-related) and epistemic (i.e., model-related) components. However, additive decomposition has recently been questioned, with evidence that it breaks down when using finite-ensemble sampling and/or mismatched predictive distributions. This paper introduces Variance-Gated Ensembles (VGE), an intuitive, differentiable framework that injects epistemic sensitivity via a signal-to-noise gate computed from ensemble statistics. VGE provides: (i) a Variance-Gated Margin Uncertainty (VGMU) score that couples decision margins with ensemble predictive variance; and (ii) a Variance-Gated Normalization (VGN) layer that generalizes the variance-gated uncertainty mechanism to training via per-class, learnable normalization of ensemble member probabilities. We derive closed-form vector-Jacobian products enabling end-to-end training through ensemble sample mean and variance. VGE matches or exceeds state-of-the-art information-theoretic baselines while remaining computationally efficient. As a result, VGE provides a practical and scalable approach to epistemic-aware uncertainty estimation in ensemble models.

2602.06883 2026-06-04 cs.LG cs.CV stat.ML

Vision Transformer Finetuning Benefits from Non-Smooth Components

视觉变换器微调受益于非平滑组件

Ambroise Odonnat, Laetitia Chapel, Romain Tavenard, Ievgen Redko

发表机构 * Noah's Ark Lab(诺亚 ark 实验室) Univ. Rennes 2, Inria(里昂二大学,法国国家信息与自动化研究所)

AI总结 本文通过分析视觉变换器组件的可塑性(即输出对输入变化的敏感度),发现高可塑性(低平滑性)的注意力模块和前馈层在微调中表现更好,挑战了平滑性有利的传统观点。

Comments Accepted at ICML 2026

详情
AI中文摘要

变换器架构的平滑性在泛化、训练稳定性和对抗鲁棒性方面已被广泛研究。然而,其在迁移学习中的作用仍知之甚少。本文分析了视觉变换器组件使其输出适应输入变化的能力,即它们的\emph{可塑性}。定义为平均变化率,它捕捉了对输入扰动的敏感性;特别地,高可塑性意味着低平滑性。我们的理论分析和大量实验——在大规模视觉变换器上进行超过1000次微调运行——表明,这一视角为选择在适应过程中优先考虑的组件提供了原则性指导。对从业者的关键启示是,注意力模块和前馈层的高可塑性始终导致更好的微调性能。我们的发现偏离了平滑性是可取的普遍假设,为变换器的功能特性提供了新的视角。代码可在 https://github.com/ambroiseodt/vit-plasticity 获取。

英文摘要

The smoothness of the transformer architecture has been extensively studied in the context of generalization, training stability, and adversarial robustness. However, its role in transfer learning remains poorly understood. In this paper, we analyze the ability of vision transformer components to adapt their outputs to changes in inputs, or, in other words, their \emph{plasticity}. Defined as an average rate of change, it captures the sensitivity to input perturbation; in particular, a high plasticity implies a low smoothness. Our theoretical analysis and extensive experiments -- over $1,000$ finetuning runs on large-scale vision transformers -- showcase that this perspective provides principled guidance in choosing the components to prioritize during adaptation. A key takeaway for practitioners is that the high plasticity of the attention modules and feedforward layers consistently leads to better finetuning performance. Our findings depart from the prevailing assumption that smoothness is desirable, offering a novel perspective on transformers' functional properties. The code is available at https://github.com/ambroiseodt/vit-plasticity.

2601.20800 2026-06-04 cs.LG cs.AI

Conditional PED-ANOVA: Hyperparameter Importance in Hierarchical & Dynamic Search Spaces

条件PED-ANOVA:层次与动态搜索空间中的超参数重要性

Kaito Baba, Yoshihiko Ozaki, Shuhei Watanabe

发表机构 * Preferred Networks, Inc.(Preferred Networks公司) The University of Tokyo(东京大学) SB Intuitions Corp.(SB Intuitions公司)

AI总结 提出条件PED-ANOVA框架,用于估计条件搜索空间中超参数的重要性,通过闭式估计器准确反映条件激活和域变化,实验证明其优于朴素适应方法。

Comments 20 pages, 15 figures. Accepted to the 32nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD 2026)

详情
AI中文摘要

我们提出条件PED-ANOVA(condPED-ANOVA),一个用于估计条件搜索空间中超参数重要性(HPI)的原则性框架,其中超参数的存在或域可能依赖于其他超参数。尽管原始PED-ANOVA提供了一种快速有效的方法来估计搜索空间内高性能区域的HPI,但它假设一个固定的、无条件的搜索空间,因此无法正确处理条件超参数。为了解决这个问题,我们引入了针对高性能区域的条件HPI,并推导出一个闭式估计器,能够准确反映条件激活和域变化。实验表明,现有HPI估计器的朴素适应在条件设置下会产生误导性或不可解释的重要性,而condPED-ANOVA始终提供反映底层条件结构的有意义的重要性。我们的代码公开在https://github.com/kAIto47802/condPED-ANOVA。

英文摘要

We propose conditional PED-ANOVA (condPED-ANOVA), a principled framework for estimating hyperparameter importance (HPI) in conditional search spaces, where the presence or domain of a hyperparameter can depend on other hyperparameters. Although the original PED-ANOVA provides a fast and efficient way to estimate HPI within the top-performing regions of the search space, it assumes a fixed, unconditional search space and therefore cannot properly handle conditional hyperparameters. To address this, we introduce a conditional HPI for top-performing regions and derive a closed-form estimator that accurately reflects conditional activation and domain changes. Experiments show that naive adaptations of existing HPI estimators yield misleading or uninterpretable importances in conditional settings, whereas condPED-ANOVA consistently provides meaningful importances that reflect the underlying conditional structure. Our code is publicly available at https://github.com/kAIto47802/condPED-ANOVA.

2602.05657 2026-06-04 cs.LG math.OC

Tight Long-Term Tail Decay of (Clipped) SGD in Non-Convex Optimization

非凸优化中(裁剪)SGD的严格长期尾衰减

Aleksandar Armacki, Dragana Bajović, Dušan Jakovetić, Soummya Kar, Ali H. Sayed

发表机构 * École Polytechnique Fédérale de Lausanne(瑞士联邦理工学院洛桑分校) University of Novi Sad(诺维萨德大学) Carnegie Mellon University(卡内基梅隆大学)

AI总结 通过大偏差理论,研究非凸优化中SGD和裁剪SGD的长期尾衰减,给出梯度范数平方的指数级上界和下界,证明衰减率可达$e^{-t/\log(t)}$量级,比现有有限时间界快一个数量级。

Comments 34 pages

详情
AI中文摘要

由于能够为算法的单次运行提供强保证,对SGD诱导过程的尾部行为的研究引起了广泛兴趣。虽然许多工作提供了高概率保证(量化固定概率阈值下的误差率),但缺乏直接研究失败概率的工作,即量化固定误差阈值下的尾部衰减率。此外,现有结果具有有限时间性质,限制了它们捕捉真实长期尾部衰减的能力,而后者对于现代学习模型(通常训练数百万次迭代)更具信息量。我们的工作通过大偏差理论的视角研究基于SGD的方法的长期尾部衰减,填补了这些空白,在此过程中建立了若干强结果。首先,对于非凸成本和有界噪声,我们给出了(普通)SGD产生的最佳迭代的梯度范数平方的尾部上界,长期衰减率为$e^{-t/\log(t)}$。接着,我们通过考虑在具有有界$p$阶矩($p \in (1,2]$)的重尾噪声下的裁剪SGD(c-SGD)来放宽噪声假设,证明了长期衰减率为$e^{-t^{\beta_p}/\log(t)}$的上界,其中当$p \in (1,2)$时$\beta_p = \frac{4(p-1)}{3p-2}$,当$p=2$时衰减率为$e^{-t/\log^2(t)}$。最后,我们给出了尾部衰减的下界,衰减率为$e^{-t}$,表明我们关于SGD和c-SGD的衰减率在多项式对数因子意义下是紧的。值得注意的是,我们的结果表明,与基于有限时间界的现有工作(分别显示SGD和c-SGD的衰减率为$e^{-\sqrt{t}}$和$e^{-t^{\beta_p/2}}$,$p \in (1,2]$)相比,长期尾部衰减快一个数量级。因此,我们揭示了尾部衰减比先前已知快得多的机制,为单次运行提供了更强的长期保证。

英文摘要

The study of tail behaviour of SGD-induced processes has been attracting a lot of interest, due to offering strong guarantees with respect to individual runs of an algorithm. While many works provide high-probability guarantees, quantifying the error rate for a fixed probability threshold, there is a lack of work directly studying the probability of failure, i.e., quantifying the tail decay rate for a fixed error threshold. Moreover, existing results are of finite-time nature, limiting their ability to capture the true long-term tail decay which is more informative for modern learning models, typically trained for millions of iterations. Our work closes these gaps, by studying the long-term tail decay of SGD-based methods through the lens of large deviations theory, establishing several strong results in the process. First, we provide an upper bound on the tails of the gradient norm-squared of the best iterate produced by (vanilla) SGD, for non-convex costs and bounded noise, with long-term decay at rate $e^{-t/\log(t)}$. Next, we relax the noise assumption by considering clipped SGD (c-SGD) under heavy-tailed noise with bounded moment of order $p \in (1,2]$, showing an upper bound with long-term decay at rate $e^{-t^{β_p}/\log(t)}$, where $β_p = \frac{4(p-1)}{3p-2}$ for $p \in (1,2)$ and $e^{-t/\log^2(t)}$ for $p = 2$. Finally, we provide lower bounds on the tail decay, at rate $e^{-t}$, showing that our rates for both SGD and c-SGD are tight, up to poly-logarithmic factors. Notably, our results demonstrate an order of magnitude faster long-term tail decay compared to existing work based on finite-time bounds, which show rates $e^{-\sqrt{t}}$ and $e^{-t^{β_p/2}}$, $p \in (1,2]$, for SGD and c-SGD, respectively. As such, we uncover regimes where the tails decay much faster than previously known, providing stronger long-term guarantees for individual runs.

2510.08734 2026-06-04 cs.LG

Transmuting prompts into weights

将提示转化为权重

Hanna Mazzawi, Benoit Dherin, Michael Munn, Adrian Goldwaser, Michael Wunder, Javier Gonzalvo

发表机构 * Google Research(谷歌研究) Cambridge University(剑桥大学)

AI总结 本文提出一种将提示信息转化为与token无关的思维向量和思维矩阵的算法,为现有基于向量和矩阵的模型编辑技术提供理论解释,并实现文本输入到可复用权重更新的直接转化。

详情
AI中文摘要

越来越多的研究表明,大型语言模型的行为可以在推理时通过直接修改其内部状态来有效控制,无论是通过向激活添加向量还是更新权重矩阵。这些技术虽然强大,但通常由经验启发式指导,例如从对比提示的平均激活中推导出“引导向量”。基于Dherin等人(2025)的基础工作,他们发现提示的影响在数学上映射为与token相关的隐式权重更新,并引入了用于提示压缩的静态思维补丁的初始概念,我们将这一框架提升为一种用于直接模型编辑的鲁棒算法。我们推导出一种原则性方法,将这种瞬态信息压缩为与token无关的思维向量和思维矩阵。这些构造为现有的基于向量和矩阵的模型编辑技术提供了理论解释,并提供了一种直接、基于计算的方法,将文本输入转化为可用于复杂架构和新知识注入的可复用权重更新。

英文摘要

A growing body of research has demonstrated that the behavior of large language models can be effectively controlled at inference time by directly modifying their internal states, either through vector additions to their activations or through updates to their weight matrices. These techniques, while powerful, are often guided by empirical heuristics, such as deriving ``steering vectors'' from the average activations of contrastive prompts. Building on the foundational work of Dherin et al. (2025), who discovered that a prompt's influence mathematically maps to token-dependent implicit weight updates and introduced the initial concept of a static thought patch for prompt compression, we elevate this framework into a robust algorithm for direct model editing. We derive a principled method for condensing this transient information into token-independent thought vectors and thought matrices. These constructs provide a theoretical explanation for existing vector-and-matrix-based model editing techniques and offer a direct, computationally-grounded method for transmuting textual input into reusable weight updates for complex architectures and new knowledge injection.

2602.04613 2026-06-04 cs.CL

Translation Heads: Disentangling meaning from language in LLM-based machine translation

翻译头:在基于LLM的机器翻译中分离意义与语言

Théo Lasnier, Armel Zebaze, Djamé Seddah, Rachel Bawden, Benoît Sagot

发表机构 * Department of XXX, University of YYY, Location, Country(XXX系,YYY大学,地点,国家) School of ZZZ, Institute of WWW, Location, Country(ZZZ学院,WWW研究所,地点,国家) Inria, Paris, France(Inria,巴黎,法国)

AI总结 通过分析注意力头,将机器翻译分解为目标语言识别和句子等价两个子任务,发现稀疏的注意力头分别专攻每个子任务,并利用此发现实现无需指令的翻译性能。

Comments 61 pages, 70 figures

详情
AI中文摘要

机械可解释性(MI)旨在解释神经网络如何实现其能力,但大型语言模型(LLM)的规模限制了先前MI在机器翻译(MT)中的工作,仅限于词级分析。我们从机械角度研究句子级MT,通过分析注意力头来理解LLM如何在内部编码和分配翻译功能。我们将MT分解为两个子任务:生成目标语言文本(即目标语言识别)和保留输入句子的意义(即句子等价)。在三个开源模型家族和20个翻译方向上,我们发现不同且稀疏的注意力头集合专门负责每个子任务。基于这一发现,我们构建了子任务特定的转向向量,并表明仅修改1%的相关头即可实现与基于指令的提示相当的无需指令的翻译性能,而消融这些头则会选择性地破坏其对应的翻译功能。

英文摘要

Mechanistic Interpretability (MI) seeks to explain how neural networks implement their capabilities, but the scale of Large Language Models (LLMs) has limited prior MI work in Machine Translation (MT) to word-level analyses. We study sentence-level MT from a mechanistic perspective by analyzing attention heads to understand how LLMs internally encode and distribute translation functions. We decompose MT into two subtasks: producing text in the target language (i.e. target language identification) and preserving the input sentence's meaning (i.e. sentence equivalence). Across three families of open-source models and 20 translation directions, we find that distinct, sparse sets of attention heads specialize in each subtask. Based on this insight, we construct subtask-specific steering vectors and show that modifying just 1% of the relevant heads enables instruction-free MT performance comparable to instruction-based prompting, while ablating these heads selectively disrupts their corresponding translation functions.

2602.04101 2026-06-04 cs.AI

Interfaze: The Future of AI is built on Task-Specific Small Models

Interfaze: 人工智能的未来建立在特定任务的小模型之上

Harsha Vardhan Khurdula, Vineet Agarwal, Yoeven D Khemlani

发表机构 * GitHub

AI总结 提出Interfaze混合模型,通过共享嵌入空间将任务特定深度神经网络融合到Transformer解码器中,在多个确定性基准上以低成本达到高精度。

Comments 10 pages, 2 figures

详情
AI中文摘要

我们提出Interfaze,一种原生混合模型,通过共享嵌入空间将任务特定的深度神经网络(CNN和DNN)直接融合到Transformer解码器中。专门的感知编码器处理复杂多语言PDF的光学字符识别(OCR)、开放词汇对象和图形用户界面(GUI)检测,以及带说话人分离的多语言语音识别。每个编码器通过任务特定的适配器暴露,并可独立激活,因此查询仅触及所需的参数。内置的动作基础提供接地外部状态:代理无头浏览器和爬虫、代码沙箱、多域网络索引和可扩展向量存储。解码器过滤并合并这些信号,在任务需要时进行推理,并输出基于置信度的确定性结果。原始专家元数据(边界框、置信度分数、时间戳)被保留并作为前文与答案一起返回。在此架构上,Interfaze-Beta在确定性开发者任务基准套件中领先。它在OCRBench v2上达到70.7%,在olmOCR上达到85.7%,在RefCOCO上达到82.1%,在VoxPopuli上词错误率2.4%,在Spider-2.0-Lite上达到52.9%,在GPQA-Diamond上达到92.4%,在MMMLU上达到90.9%,在MMMU-Pro上达到71.1%,在结构化输出基准(SOB)上值准确率80.5%,在每个任务上都优于价格相当的通才模型(Gemini-3-Flash、Gemini-3.5-Flash、Claude-Sonnet-4.6、GPT-5.4-Mini和Grok-4.3)。由于融合的专家编码器通过单次传递而非重复工具调用大型模型来解决感知问题,Interfaze在确定性任务上以闪存级成本达到高精度和可验证的元数据。

英文摘要

We present Interfaze, a native hybrid model that fuses task-specific deep neural networks (CNNs and DNNs) directly into a transformer decoder through a shared embedding space. Specialized perceptual encoders handle optical character recognition (OCR) over complex multilingual PDFs, open-vocabulary object and graphical user interface (GUI) detection, and multilingual speech recognition with diarization. Each is exposed through a task-specific adapter and can be activated on its own, so a query touches only the parameters it needs. A built-in action foundation supplies a grounded external state: a proxied headless browser and scraper, a code sandbox, a multi-domain web index, and a scalable vector store. The decoder filters and merges these signals, reasons over them when a task requires it, and emits deterministic outputs built on confidence. The raw specialist metadata (bounding boxes, confidence scores, timestamps) is preserved and returned alongside the answer as precontext. On this architecture, Interfaze-Beta leads a suite of deterministic developer-task benchmarks. It reaches 70.7% on OCRBench v2, 85.7% on olmOCR, 82.1% on RefCOCO, a 2.4% word error rate on VoxPopuli, 52.9% on Spider-2.0-Lite, 92.4% on GPQA-Diamond, 90.9% on MMMLU, 71.1% on MMMU-Pro, and 80.5% value accuracy on the Structured Output Benchmark (SOB), ahead of comparably priced generalist models (Gemini- 3-Flash, Gemini-3.5-Flash, Claude-Sonnet-4.6, GPT-5.4-Mini, and Grok-4.3) on every task. Because fused specialist encoders resolve perception in a single pass instead of through repeated tool calls into a large model, Interfaze reaches high accuracy with verifiable metadata on deterministic tasks while running at flash-tier cost.

2602.03920 2026-06-04 cs.RO cs.HC

How Users Understand Robot Foundation Model Performance through Task Success Rates and Beyond

用户如何通过任务成功率及其他方式理解机器人基础模型性能

Isaac Sheidlower, Jindan Huang, James Staley, Bingyu Wu, Qicong Chen, Reuben Aronson, Elaine Short

发表机构 * Brown University(布朗大学) Tufts University(塔夫茨大学)

AI总结 通过用户研究,探讨非机器人专家如何理解机器人基础模型(RFM)评估中的任务成功率(TSR)及其他信息,发现用户不仅按专家预期使用TSR,还重视未常报告的失败案例,并希望获取历史评估数据和机器人对新任务的性能估计。

详情
AI中文摘要

机器人基础模型(RFM)代表了一种开发通用家用机器人的有前景的方法。鉴于RFM的广泛能力,用户不可避免地会要求基于RFM的机器人执行RFM未经训练或评估的任务。在这些情况下,由于失败成本相对较高,用户理解尝试新任务的相关风险至关重要。此外,了解RFM能力的知情用户将知道机器人能够处理哪些情况和任务。在本文中,我们研究非机器人专家如何解释RFM评估中的性能信息。这些评估通常报告任务成功率(TSR)作为主要性能指标。虽然TSR对专家来说是直观的,但有必要验证新手是否也按预期使用这些信息。为此,我们进行了一项研究,用户看到了真实的评估数据,包括TSR、失败案例描述以及来自多个已发表RFM研究项目的视频。结果强调,非专家不仅以与专家预期一致的方式使用TSR,而且还高度重视其他类型的信息,例如RFM评估中通常不报告的失败案例。此外,我们发现用户希望访问RFM先前评估的真实数据以及机器人关于其在新任务上表现如何的估计。

英文摘要

Robot Foundation Models (RFMs) represent a promising approach to developing general-purpose home robots. Given the broad capabilities of RFMs, users will inevitably ask an RFM-based robot to perform tasks that the RFM was not trained or evaluated on. In these cases, it is crucial that users understand the risks associated with attempting novel tasks due to the relatively high cost of failure. Furthermore, an informed user who understands an RFM's capabilities will know what situations and tasks the robot can handle. In this paper, we study how non-roboticists interpret performance information from RFM evaluations. These evaluations typically report task success rate (TSR) as the primary performance metric. While TSR is intuitive to experts, it is necessary to validate whether novices also use this information as intended. Toward this end, we conducted a study in which users saw real evaluation data, including TSR, failure case descriptions, and videos from multiple published RFM research projects. The results highlight that non-experts not only use TSR in a manner consistent with expert expectations but also highly value other information types, such as failure cases that are not often reported in RFM evaluations. Furthermore, we find that users want access to both real data from previous evaluations of the RFM and estimates from the robot about how well it will do on a novel task.

2510.13272 2026-06-04 cs.CL

Beyond Correctness: Rewarding Faithful Reasoning in Retrieval-Augmented Generation

超越正确性:在检索增强生成中奖励忠实推理

Zhichao Xu, Zongyu Wu, Yun Zhou, Aosong Feng, Kang Zhou, Sangmin Woo, Kiran Ramnath, Yijun Tian, Xuan Qi, Weikang Qiu, Lin Lee Cheong, Haibo Ding

发表机构 * AWS AI Fundamental Research(AWS人工智能基础研究) The Pennsylvania State University(宾夕法尼亚州立大学) Yale University(耶鲁大学)

AI总结 本文提出VERITAS框架,通过细粒度轮次级忠实性奖励强化学习,提升检索增强生成中推理步骤的忠实性,同时改善任务性能。

Comments TMLR Camera Ready Update

详情
AI中文摘要

受强化学习在数学和代码等领域的大语言模型训练中取得成功的启发,近期工作开始训练LLMs动态规划、查询并使用搜索引擎作为工具进行推理——这种范式日益被称为智能体搜索。尽管这些方法在流行的短问答基准上取得了性能提升,但许多方法优先考虑最终答案的正确性,而忽略了中间推理步骤的质量,这可能导致思维链不忠实。本文首先引入了一个全面的智能体搜索评估框架,涵盖三种不同的忠实性指标:思考-搜索忠实性、信息-思考忠实性和思考-答案忠实性。我们的评估表明,通过基于回合级结果奖励的可验证奖励强化学习训练的典型智能体搜索系统(包括Search-R1和ReSearch)在这些忠实性维度上有显著的改进空间。为了促进智能体搜索中的忠实推理,我们引入了VERITAS(通过智能体搜索中的中间可追溯性验证蕴含推理),这是一个新颖的框架,将细粒度的轮次级忠实性奖励整合到强化学习过程中。我们的实验表明,使用VERITAS训练的模型不仅显著提高了推理忠实性,而且与基于回合级结果奖励训练的基线相比,还实现了更好的任务性能。

英文摘要

Inspired by the success of reinforcement learning (RL) in Large Language Model (LLM) training for domains like math and code, recent work has begun training LLMs to dynamically plan, query, and reason with search engines as tools -- a paradigm increasingly referred to as agentic search. Although these methods achieve performance improvement across popular short-form QA benchmarks, many prioritize final answer correctness while overlooking the quality of intermediate reasoning steps, which may lead to chain-of-thought unfaithfulness. In this paper, we first introduce a comprehensive evaluation framework for agentic search, covering three distinct faithfulness metrics: Think-Search faithfulness, Information-Think faithfulness, and Think-Answer faithfulness. Our evaluations reveal that canonical agentic search systems trained through Reinforcement Learning from Verifiable Reward (RLVR) using episode-level outcome-based reward -- including Search-R1 and ReSearch -- have significant room for improvement on these faithfulness dimensions. To foster faithful reasoning in agentic search, we introduce VERITAS (Verifying Entailed Reasoning through Intermediate Traceability in Agentic Search), a novel framework that integrates fine-grained turn-level faithfulness rewards into the reinforcement learning process. Our experiments show that models trained with \ours not only significantly improve reasoning faithfulness, but also achieve better task performance compared to baselines trained against episode-level outcome-based reward.

2602.03542 2026-06-04 cs.CL cs.LG

Can Large Language Models Generalize Procedures Across Representations?

大型语言模型能否跨表示泛化过程?

Fangru Lin, Valentin Hofmann, Xingchen Wan, Weixing Wang, Zifeng Ding, Anthony G. Cohn, Janet B. Pierrehumbert

发表机构 * Stanford University(斯坦福大学)

AI总结 研究大型语言模型在代码、图与自然语言等不同表示间泛化过程的能力,提出两阶段强化学习课程来弥合差距。

Comments Accepted at ICML 2026

详情
AI中文摘要

大型语言模型(LLMs)在符号表示(如代码和图)上进行了广泛的训练和测试,然而现实世界的用户任务通常用自然语言指定。LLMs 能在多大程度上跨这些表示进行泛化?在这里,我们通过研究涉及以代码、图和自然语言表示的过程(例如,规划中的调度步骤)的同构任务来探讨这个问题。我们发现,仅在图或代码数据上使用流行的后训练方法训练 LLMs 并不能可靠地泛化到相应的自然语言任务,而仅用自然语言训练可能导致效率低下的性能提升。为了解决这一差距,我们提出了一种两阶段强化学习课程,首先在符号数据上训练,然后在自然语言数据上训练。该课程显著提高了跨模型家族和任务的模型性能。值得注意的是,通过我们的方法训练的 1.5B Qwen 模型在自然规划中几乎可以匹配零样本 GPT-4o。最后,我们的分析表明,成功的跨表示泛化可以解释为一种生成性类比的形式,而我们的课程有效地鼓励了这种类比。本文使用的数据集和代码可在此处找到。

英文摘要

Large language models (LLMs) are trained and tested extensively on symbolic representations such as code and graphs, yet real-world user tasks are often specified in natural language. To what extent can LLMs generalize across these representations? Here, we approach this question by studying isomorphic tasks involving procedures represented in code, graphs, and natural language (e.g., scheduling steps in planning). We find that training LLMs with popular post-training methods on graphs or code data alone does not reliably generalize to corresponding natural language tasks, while training solely on natural language can lead to inefficient performance gains. To address this gap, we propose a two-stage reinforcement learning curriculum that first trains on symbolic, then natural language data. The curriculum substantially improves model performance across model families and tasks. Remarkably, a 1.5B Qwen model trained by our method can closely match zero-shot GPT-4o in naturalistic planning. Finally, our analysis suggests that successful cross-representation generalization can be interpreted as a form of generative analogy, which our curriculum effectively encourages. The dataset and code used in this paper can be found \href{https://github.com/fangru-lin/procedure_generalization_llm}{here}.

2601.09719 2026-06-04 cs.CL cs.AI

Bounded Hyperbolic Tangent: A Stable and Efficient Alternative to Pre-Layer Normalization in Large Language Models

有界双曲正切:大型语言模型中预层归一化的稳定高效替代方案

Hoyoon Byun, Youngjun Choi, Taero Kim, Sungrae Park, Kyungwoo Song

发表机构 * Yonsei University(延世大学) Upstage AI

AI总结 提出BHyT,通过有界双曲正切和数据驱动的输入约束替代Pre-LN,在保持稳定性的同时提升训练和推理效率。

Comments Accepted to ICML 2026

详情
AI中文摘要

预层归一化(Pre-LN)是大型语言模型(LLM)的事实标准,对于稳定预训练和有效迁移学习至关重要。然而,Pre-LN会带来重复的统计计算开销,并且仍然容易受到深度诅咒的影响,即随着层数增加,隐藏状态幅度和方差增大,破坏训练稳定性。面向效率的无归一化方法(如Dynamic Tanh (DyT))提高了吞吐量,但在深度下仍然脆弱。为了同时解决稳定性和效率问题,我们提出了有界双曲正切(BHyT),作为Pre-LN的直接替代方案。BHyT将tanh非线性与显式的、数据驱动的输入边界相结合,使激活值保持在非饱和范围内。它防止了激活幅度和方差随深度增长,并提供了理论稳定性保证。在效率方面,BHyT每个块仅计算一次精确统计量,并用轻量级方差近似替代第二次归一化。实验表明,BHyT在预训练期间表现出更好的稳定性和效率,与RMSNorm相比,平均训练速度提升1.6%,平均token生成吞吐量提升1.77%,同时在语言理解和推理基准上保持强大的预训练-only和SFT后性能。代码见:https://github.com/MLAI-Yonsei/BHyT

英文摘要

Pre-Layer Normalization (Pre-LN) is the de facto choice for large language models (LLMs) and is crucial for stable pretraining and effective transfer learning. However, Pre-LN incurs repeated statistical-computation overhead and remains vulnerable to the curse of depth, where hidden-state magnitudes and variances grow as the number of layers increases, destabilizing training. Efficiency-oriented normalization-free methods such as Dynamic Tanh (DyT) improve throughput but remain fragile at depth. To jointly address stability and efficiency, we propose Bounded Hyperbolic Tanh (BHyT), a drop-in replacement for Pre-LN. BHyT combines a tanh nonlinearity with explicit, data-driven input bounding to keep activations within a non-saturating range. It prevents depth-wise growth in activation magnitude and variance and provides a theoretical stability guarantee. For efficiency, BHyT computes exact statistics once per block and replaces a second normalization with a lightweight variance approximation. Empirically, BHyT demonstrates improved stability and efficiency during pretraining, achieving an average of 1.6\% faster training and an average of 1.77\% higher token generation throughput compared to RMSNorm, while maintaining strong pretraining-only and post-SFT performance across language understanding and reasoning benchmarks\footnote{Code is available at: https://github.com/MLAI-Yonsei/BHyT}.

2602.02405 2026-06-04 cs.LG cs.AI

Making Expert Reasoning Learnable with Self-Distillation

通过自蒸馏使专家推理可学习

Ethan Mendes, Jungsoo Park, Alan Ritter

发表机构 * Georgia Institute of Technology, Atlanta, Georgia(佐治亚理工学院,亚特兰大,佐治亚州)

AI总结 提出分布对齐模仿学习(DAIL),通过两步自蒸馏方法弥合专家解决方案与模型分布之间的差距,利用少量高质量专家数据显著提升大语言模型的推理能力。

Comments ICML 2026

详情
AI中文摘要

提升大语言模型(LLM)的推理能力通常依赖于模型采样正确解以进行强化,或存在更强模型来解决问题。然而,许多难题即使对当前前沿模型也难以处理,阻碍了有效训练信号的提取。一个有前景的替代方案是利用高质量的人类专家解决方案,但直接模仿这些数据从根本上存在分布外问题:专家解决方案通常具有教学性质,包含为人类读者而非计算模型设计的隐含推理间隙。此外,高质量专家解决方案成本高昂,需要可泛化且样本高效的训练方法。我们提出分布对齐模仿学习(DAIL),一种两步自蒸馏方法,通过首先将专家解决方案转化为详细的、分布内的推理轨迹,然后应用对比目标使学习聚焦于专家见解和方法,从而弥合分布差距。我们发现,DAIL可以利用少于1000个高质量专家解决方案,在Qwen2.5-Instruct和Qwen3上实现高达31%的pass@128增益,推理效率翻倍,并实现域外泛化。

英文摘要

Improving the reasoning capabilities of large language models (LLMs) typically relies either on the model's ability to sample a correct solution to be reinforced or the existence of a stronger model able to solve the problem. However, many difficult problems remain intractable for even current frontier models, preventing the extraction of valid training signals. A promising alternative is to leverage high-quality expert human solutions, yet naive imitation of this data fails because it is fundamentally out-of-distribution: expert solutions are typically didactic, containing implicit reasoning gaps intended for human readers rather than computational models. Furthermore, high-quality expert solutions are expensive, necessitating generalizable, sample-efficient training methods. We propose Distribution Aligned Imitation Learning (DAIL), a two-step self-distillation method that bridges the distributional gap by first transforming expert solutions into detailed, in-distribution reasoning traces and then applying a contrastive objective to focus learning on expert insights and methodologies. We find that DAIL can leverage fewer than 1000 high-quality expert solutions to achieve up to 31% pass@128 gains on Qwen2.5-Instruct and Qwen3, double reasoning efficiency, and enable out-of-domain generalization.

2602.01672 2026-06-04 cs.CL

Adaptive Information Control for Search-Augmented LLM Reasoning

面向搜索增强型大语言模型推理的自适应信息控制

Siheng Xiong, Oguzhan Gungordu, James C. Kerce, Faramarz Fekri

发表机构 * Georgia Institute of Technology(佐治亚理工学院)

AI总结 提出基于信息效用的自适应控制框架DeepControl,通过控制检索的广度与分辨率,提升搜索增强推理的性能与训练稳定性。

详情
AI中文摘要

搜索增强型推理代理将多步推理与外部检索交错进行,但不受控制的检索可能引入冗余证据、使上下文饱和,并破坏强化学习(RL)的稳定性。现有的基于结果的RL方法仅提供稀疏的终端奖励,对中间信息获取决策的指导有限。我们提出DeepControl,一种基于信息效用的自适应信息控制框架,其中信息效用是检索证据边际价值的状态依赖估计。该框架沿两个维度调节信息获取:广度(即是否应继续检索)和分辨率(即应暴露多少检索细节)。它通过检索继续引导、层次化粒度控制以及退火控制强制方案来实现这些控制。这使得策略能够在训练期间内化有效的获取行为,并在测试时无需外部控制即可运行。在七个基准测试中,DeepControl在没有显式信息控制的情况下,始终优于强RL和检索基线;与Search-R1相比,在Qwen2.5-7B和Qwen2.5-3B上分别平均提高了9.4和8.6个点。额外分析显示搜索效率、训练稳定性和证据利用率均有所提升。

英文摘要

Search-augmented reasoning agents interleave multi-step reasoning with external retrieval, but uncontrolled retrieval can introduce redundant evidence, saturate the context, and destabilize reinforcement learning (RL). Existing outcome-based RL methods provide only sparse terminal rewards, offering limited guidance for intermediate information-acquisition decisions. We propose DeepControl, an adaptive information-control framework based on information utility, a state-dependent estimate of the marginal value of retrieved evidence. The framework regulates information acquisition along two axes: extent, i.e., whether retrieval should continue, and resolution, i.e., how much retrieved detail should be exposed. It implements these controls through retrieval-continuation guidance, hierarchical granularity control, and an annealed control-forcing scheme. This enables the policy to internalize effective acquisition behavior during training and operate without external control at test time. Across seven benchmarks, DeepControl consistently outperforms strong RL and retrieval baselines without explicit information control; compared with Search-R1, it improves average performance by +9.4 and +8.6 points on Qwen2.5-7B and Qwen2.5-3B, respectively. Additional analyses show improved search effectiveness, training stability, and evidence utilization.

2602.01658 2026-06-04 cs.LG cs.AI

Efficient Adversarial Attacks on High-dimensional Offline Bandits

高维离线Bandits的高效对抗攻击

Seyed Mohammad Hadi Hosseini, Amir Najafi, Mahdieh Soleymani Baghshah

发表机构 * Department of Computer Engineering, Sharif University of Technology(技术学院计算机工程系)

AI总结 研究离线bandit训练在奖励模型被对抗扰动时的脆弱性,提出高维威胁模型,证明维度增加时攻击所需扰动范数减小,实验验证了针对性攻击的高成功率。

Comments Published at ICLR 2026 Conference

详情
AI中文摘要

Bandit算法最近成为评估机器学习模型(包括生成图像模型和大语言模型)的强大工具,通过高效识别表现最佳的候选者而无需详尽比较。这些方法通常依赖于奖励模型(常在Hugging Face等平台上以公共权重发布)向bandit提供反馈。在线评估昂贵且需要重复试验,而使用记录数据的离线评估已成为有吸引力的替代方案。然而,离线bandit评估的对抗鲁棒性在很大程度上尚未被探索,特别是当攻击者在bandit训练之前扰动奖励模型(而非训练数据)时。在这项工作中,我们通过理论和实证研究离线bandit训练对奖励模型对抗操纵的脆弱性来填补这一空白。我们引入了一种新颖的威胁模型,其中攻击者利用高维环境中的离线数据劫持bandit的行为。从线性奖励函数开始,扩展到非线性模型如ReLU神经网络,我们研究了用于生成模型评估的两个Hugging Face评估器上的攻击:一个测量美学质量,另一个评估组合对齐。我们的结果表明,即使对奖励模型权重进行微小、不可察觉的扰动,也能显著改变bandit的行为。从理论角度来看,我们证明了一个显著的高维效应:随着输入维度的增加,成功攻击所需的扰动范数减小,使得现代应用如图像评估尤其脆弱。大量实验证实,简单的随机扰动无效,而精心设计的针对性攻击实现了近乎完美的攻击成功率。

英文摘要

Bandit algorithms have recently emerged as a powerful tool for evaluating machine learning models, including generative image models and large language models, by efficiently identifying top-performing candidates without exhaustive comparisons. These methods typically rely on a reward model, often distributed with public weights on platforms such as Hugging Face, to provide feedback to the bandit. While online evaluation is expensive and requires repeated trials, offline evaluation with logged data has become an attractive alternative. However, the adversarial robustness of offline bandit evaluation remains largely unexplored, particularly when an attacker perturbs the reward model (rather than the training data) prior to bandit training. In this work, we fill this gap by investigating, both theoretically and empirically, the vulnerability of offline bandit training to adversarial manipulations of the reward model. We introduce a novel threat model in which an attacker exploits offline data in high-dimensional settings to hijack the bandit's behavior. Starting with linear reward functions and extending to nonlinear models such as ReLU neural networks, we study attacks on two Hugging Face evaluators used for generative model assessment: one measuring aesthetic quality and the other assessing compositional alignment. Our results show that even small, imperceptible perturbations to the reward model's weights can drastically alter the bandit's behavior. From a theoretical perspective, we prove a striking high-dimensional effect: as input dimensionality increases, the perturbation norm required for a successful attack decreases, making modern applications such as image evaluation especially vulnerable. Extensive experiments confirm that naive random perturbations are ineffective, whereas carefully targeted perturbations achieve near-perfect attack success rates ...

2602.01619 2026-06-04 cs.LG cs.AI

SUSD: Structured Unsupervised Skill Discovery through State Factorization

SUSD: 通过状态分解的结构化无监督技能发现

Seyed Mohammad Hadi Hosseini, Mahdieh Soleymani Baghshah

发表机构 * Department of Computer Engineering(计算机工程系)

AI总结 提出SUSD框架,通过将状态空间分解为独立组件并分配不同技能变量,结合动态模型自适应引导探索,实现更丰富多样的无监督技能发现,并在分解环境中显著优于现有方法。

Comments Published as a conference paper at ICLR 2026

详情
AI中文摘要

无监督技能发现(USD)旨在无需外部奖励的情况下自主学习多样化的技能。最常见的USD方法之一是最大化技能潜在变量与状态之间的互信息(MI)。然而,基于MI的方法由于其不变性特性,倾向于偏好简单、静态的技能,限制了动态、任务相关行为的发现。距离最大化技能发现(DSD)通过利用状态空间距离促进更动态的技能,但仍未能鼓励涵盖环境中所有可控因素或实体的全面技能集。在这项工作中,我们引入了SUSD,一种新颖的框架,通过将状态空间分解为独立组件(例如,物体或可控实体)来利用环境的组合结构。SUSD将不同的技能变量分配给不同的因素,从而实现对技能发现过程的更细粒度控制。一个动态模型还跟踪各因素的学习情况,自适应地将智能体的注意力引导至未充分探索的因素。这种结构化方法不仅促进了更丰富、更多样化技能的发现,还产生了一种分解的技能表示,能够对单个实体进行细粒度且解耦的控制,从而通过分层强化学习(HRL)促进组合下游任务的高效训练。我们在三个环境中的实验结果(因素数量从1到10)表明,我们的方法能够在无监督的情况下发现多样且复杂的技能,在分解和复杂环境中显著优于现有的无监督技能发现方法。代码公开于:https://github.com/hadi-hosseini/SUSD。

英文摘要

Unsupervised Skill Discovery (USD) aims to autonomously learn a diverse set of skills without relying on extrinsic rewards. One of the most common USD approaches is to maximize the Mutual Information (MI) between skill latent variables and states. However, MI-based methods tend to favor simple, static skills due to their invariance properties, limiting the discovery of dynamic, task-relevant behaviors. Distance-Maximizing Skill Discovery (DSD) promotes more dynamic skills by leveraging state-space distances, yet still fall short in encouraging comprehensive skill sets that engage all controllable factors or entities in the environment. In this work, we introduce SUSD, a novel framework that harnesses the compositional structure of environments by factorizing the state space into independent components (e.g., objects or controllable entities). SUSD allocates distinct skill variables to different factors, enabling more fine-grained control on the skill discovery process. A dynamic model also tracks learning across factors, adaptively steering the agent's focus toward underexplored factors. This structured approach not only promotes the discovery of richer and more diverse skills, but also yields a factorized skill representation that enables fine-grained and disentangled control over individual entities which facilitates efficient training of compositional downstream tasks via Hierarchical Reinforcement Learning (HRL). Our experimental results across three environments, with factors ranging from 1 to 10, demonstrate that our method can discover diverse and complex skills without supervision, significantly outperforming existing unsupervised skill discovery methods in factorized and complex environments. Code is publicly available at: https://github.com/hadi-hosseini/SUSD.

2601.15158 2026-06-04 cs.LG cs.AI

Outcome-Based RL Provably Leads Transformers to Reason, but Only With the Right Data

基于结果的强化学习可证明地引导Transformer进行推理,但仅在合适的数据条件下

Yuval Ran-Milo, Yotam Alexander, Shahar Mendel, Nadav Cohen

发表机构 * Tel Aviv University(特拉维夫大学)

AI总结 本文通过分析单层Transformer在合成图遍历任务上的策略梯度动力学,证明了基于结果的强化学习能够使Transformer自发学习出结构化的迭代推理算法,并揭示了训练数据中“简单示例”的分布对推理能力涌现的关键作用。

Comments 94 pages, 7 figures

详情
AI中文摘要

通过基于结果的监督进行强化学习训练的Transformer可以自发地生成中间推理步骤(思维链)。然而,稀疏奖励驱动策略梯度发现这种系统性推理的机制仍然知之甚少。我们通过分析单层Transformer在合成图遍历任务上的策略梯度动力学来解决这个问题,该任务没有思维链就无法解决,但允许简单的迭代解决方案。我们证明,尽管仅对最终答案的正确性进行训练,策略梯度仍驱动Transformer收敛到一个结构化的、可解释的算法,该算法逐顶点迭代遍历图。我们刻画了这种涌现所需的分布特性,识别出“简单示例”(即需要较少推理步骤的实例)的关键作用。当训练分布在这些更简单的示例上放置足够的质量时,Transformer学习到一种可泛化的遍历策略,能够外推到更长的链;当这种质量消失时,策略梯度学习变得不可行。我们通过在合成数据上的实验以及在数学推理任务中使用真实世界语言模型的实验来证实我们的理论结果,验证了我们的理论发现可以推广到实际场景。

英文摘要

Transformers trained via Reinforcement Learning (RL) with outcome-based supervision can spontaneously develop the ability to generate intermediate reasoning steps (Chain-of-Thought). Yet the mechanism by which sparse rewards drive policy gradient to discover such systematic reasoning remains poorly understood. We address this by analyzing the policy gradient dynamics of single-layer Transformers on a synthetic graph traversal task that cannot be solved without Chain-of-Thought but admits a simple iterative solution. We prove that despite training solely on final-answer correctness, policy gradient drives the Transformer to converge to a structured, interpretable algorithm that iteratively traverses the graph vertex-by-vertex. We characterize the distributional properties required for this emergence, identifying the critical role of "simple examples": instances requiring fewer reasoning steps. When the training distribution places sufficient mass on these simpler examples, the Transformer learns a generalizable traversal strategy that extrapolates to longer chains; when this mass vanishes, policy gradient learning becomes infeasible. We corroborate our theoretical results through experiments on synthetic data and with real-world language models on mathematical reasoning tasks, validating that our theoretical findings carry over to practical settings.

2512.21917 2026-06-04 cs.LG cs.AI econ.EM stat.ML

Semiparametric Preference Optimization: Your Language Model is Secretly a Single-Index Model

半参数偏好优化:你的语言模型秘密地是一个单索引模型

Nathan Kallus

发表机构 * Netflix & Cornell University(Netflix与康奈尔大学)

AI总结 本文提出半参数偏好优化方法,通过放宽偏好与潜在奖励之间的链接函数假设,在未知且无限制的链接函数下进行策略对齐,并证明策略类的可实现性诱导出半参数单索引二元选择模型,直接学习策略并给出链接无关的收敛保证。

详情
AI中文摘要

策略对齐到偏好数据通常假设观察到的偏好与潜在奖励之间存在已知的链接函数(例如,Bradley-Terry模型/逻辑链接)。这种链接的错误设定可能会使推断的奖励产生偏差,并使学习到的策略偏离对齐。我们研究了在未知且无限制的链接函数下的策略对齐。我们提出了一个$f$-散度约束的奖励最大化问题,并表明策略类中的可实现性诱导出一个半参数单索引二元选择模型,其中标量策略诱导的索引捕获了所有对示范的依赖,而剩余的偏好分布是无限制的。与计量经济学中要求识别此类模型的结构参数并进行估计不同,我们开发了直接学习策略的方法,其中奖励函数是隐式的,分析了与最优策略的误差,并允许不可识别和非参数的索引。我们证明了基于通用函数复杂度度量的链接无关收敛保证,并通过实验验证了方法和理论。代码可在 https://github.com/causalml/spo/ 获取。

英文摘要

Policy alignment to preference data typically assumes a known link function between observed preferences and latent rewards (e.g., Bradley-Terry model / logistic link). Misspecification of this link can bias inferred rewards and misalign learned policies. We study policy alignment under an unknown and unrestricted link function. We formulate an $f$-divergence-constrained reward maximization problem and show that realizability in a policy class induces a semiparametric single-index binary choice model, where a scalar policy-induced index captures all dependence on demonstrations and the remaining preference distribution is unrestricted. Rather than impose identifiability of structural parameters of such a model and estimate them, as in econometrics, we develop methods that directly learn policies, with the reward function implicit, analyzing error to the optimal policy and allowing for unidentifiable and nonparametric indices. We prove link-agnostic convergence guarantees in terms of generic function complexity measures and validate the methods and theory empirically. Code is available at https://github.com/causalml/spo/.

2506.06178 2026-06-04 cs.LG

Reusing Trajectories in Policy Gradients Enables Fast Convergence

在策略梯度中重用轨迹实现快速收敛

Alessandro Montenegro, Federico Mansutti, Marco Mussi, Matteo Papini, Alberto Maria Metelli

发表机构 * University of Bologna(博洛尼亚大学)

AI总结 提出RT-PG算法,通过重用过去轨迹并使用幂均值校正的多重重要性加权估计器,将策略梯度的样本复杂度降低到$\tilde{O}(\epsilon^{-2}\omega^{-1})$,当重用所有轨迹时达到$\tilde{O}(\epsilon^{-1})$,是目前已知最优。

详情
AI中文摘要

策略梯度(PG)方法是一类有效的强化学习算法,特别是在处理连续控制问题时。它们依赖于新鲜的在线策略数据,导致样本效率低下,需要$O(\epsilon^{-2})$条轨迹才能达到$\epsilon$近似平稳点。提高效率的一种常见策略是重用过去迭代的信息,例如之前的梯度或轨迹,从而产生离策略PG方法。虽然梯度重用已受到广泛关注,并将速率提高到$O(\epsilon^{-3/2})$,但过去轨迹的重用虽然直观,在理论上仍基本未被探索。在这项工作中,我们提供了第一个严格的理论证据,表明重用过去的离策略轨迹可以显著加速PG收敛。我们提出了RT-PG(重用轨迹-策略梯度),一种新颖的算法,它利用幂均值校正的多重重要性加权估计器,有效地结合来自最近$\omega$次迭代的在线策略和离策略数据。通过新颖的分析,我们证明RT-PG实现了$\tilde{O}(\epsilon^{-2}\omega^{-1})$的样本复杂度。当重用所有可用的过去轨迹时,这导致$\tilde{O}(\epsilon^{-1})$的速率,这是文献中PG方法已知的最佳速率。我们进一步通过实验验证了我们的方法,证明了其相对于具有最先进速率的基线的有效性。

英文摘要

Policy gradient (PG) methods are a class of effective reinforcement learning algorithms, particularly when dealing with continuous control problems. They rely on fresh on-policy data, making them sample-inefficient and requiring $O(ε^{-2})$ trajectories to reach an $ε$-approximate stationary point. A common strategy to improve efficiency is to reuse information from past iterations, such as previous gradients or trajectories, leading to off-policy PG methods. While gradient reuse has received substantial attention, leading to improved rates up to $O(ε^{-3/2})$, the reuse of past trajectories, although intuitive, remains largely unexplored from a theoretical perspective. In this work, we provide the first rigorous theoretical evidence that reusing past off-policy trajectories can significantly accelerate PG convergence. We propose RT-PG (Reusing Trajectories - Policy Gradient), a novel algorithm that leverages a power mean-corrected multiple importance weighting estimator to effectively combine on-policy and off-policy data coming from the most recent $ω$ iterations. Through a novel analysis, we prove that RT-PG achieves a sample complexity of $\tilde{O}(ε^{-2}ω^{-1})$. When reusing all available past trajectories, this leads to a rate of $\tilde{O}(ε^{-1})$, the best known one in the literature for PG methods. We further validate our approach empirically, demonstrating its effectiveness against baselines with state-of-the-art rates.

2602.01429 2026-06-04 cs.RO

Sem-NaVAE: Semantically-Guided Outdoor Mapless Navigation via Generative Trajectory Priors

Sem-NaVAE: 基于语义引导的室外无地图导航通过生成式轨迹先验

Gonzalo Olguín, Javier Ruiz-del-Solar

发表机构 * Department of Electrical Engineering & the Advanced Mining Technology Center (AMTC), Universidad de Chile(电气工程系及先进采矿技术中心(AMTC)、智利大学)

AI总结 提出Sem-NaVAE方法,结合条件变分自编码器生成多样化轨迹和轻量视觉语言模型进行语义选择,实现室外无地图实时导航,在未见环境中达到90%成功率。

Comments Accepted for publication in IEEE Robotics and Automation Letters (RA-L). 8 pages, 5 figures

详情
AI中文摘要

本工作提出了一种用于室外应用的无地图导航方法。它结合了条件变分自编码器(CVAE)生成轨迹的探索能力和轻量视觉语言模型(VLM)的语义分割能力来选择要执行的轨迹。使用开放词汇分割基于自然语言对生成的轨迹进行评分和选择,并由最先进的局部规划器执行速度命令。该方法的关键特性之一是能够生成大量多样的轨迹并实时选择它们进行导航。在真实世界的室外实验中,Sem-NaVAE在未见环境中的120-240米路线上实现了90%的成功率,比最近的基线高出10%,同时保持在基于地图的上限的7%以内。展示系统实验运行的视频可在https://youtu.be/i3R5ey5O2yk找到。

英文摘要

This work presents a mapless navigation approach for outdoor applications. It combines the exploratory capacity of conditional variational autoencoders (CVAEs) to generate trajectories and the semantic segmentation capabilities of a lightweight visual language model (VLM) to select the trajectory to execute. Open-vocabulary segmentation is used to score and select the generated trajectories based on natural language, and a state-of-the-art local planner executes velocity commands. One of the key features of the proposed approach is its ability to generate a large variability of trajectories and select them to navigate in real-time. In real-world outdoor experiments, Sem-NaVAE achieves a 90% success rate across routes of 120-240m in unseen environments, outperforming the nearest baseline by 10% while remaining within 7% of a map-based upper bound. A video showing an experimental run of the system can be found in https://youtu.be/i3R5ey5O2yk.

2602.01146 2026-06-04 cs.AI

PersistBench: When Should Long-Term Memories Be Forgotten by LLMs?

PersistBench: 大型语言模型何时应忘记长期记忆?

Sidharth Pulipaka, Oliver Chen, Manas Sharma, Taaha S Bajwa, Vyas Raina, Ivaxi Sheth

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出 PersistBench 基准,评估 LLM 长期记忆带来的跨域泄露和记忆诱导的谄媚安全风险,发现主流模型失败率高达 53% 和 97%。

Comments 76 pages, 34 figures, ICML (2026)

详情
AI中文摘要

对话助手正越来越多地将长期记忆与大型语言模型(LLM)集成。这种记忆的持久性,例如用户是素食主义者,可以增强未来对话中的个性化。然而,同样的持久性也可能引入很大程度上被忽视的安全风险。因此,我们引入 PersistBench 来衡量这些安全风险的程度。我们识别出两种长期记忆特有的风险:跨域泄露,即 LLM 不恰当地从长期记忆中注入上下文;以及记忆诱导的谄媚,即存储的长期记忆暗中强化用户偏见。我们在基准上评估了 18 个前沿和开源 LLM。我们的结果显示这些 LLM 的失败率高得惊人——跨域样本的中位失败率为 53%,谄媚样本为 97%。为了解决这个问题,我们的基准鼓励在前沿对话系统中开发更稳健、更安全的长期记忆使用方式。

英文摘要

Conversational assistants are increasingly integrating long-term memory with large language models (LLMs). This persistence of memories, e.g., the user is vegetarian, can enhance personalization in future conversations. However, the same persistence can also introduce safety risks that have been largely overlooked. Hence, we introduce PersistBench to measure the extent of these safety risks. We identify two long-term memory-specific risks: cross-domain leakage, where LLMs inappropriately inject context from the long-term memories; and memory-induced sycophancy, where stored long-term memories insidiously reinforce user biases. We evaluate 18 frontier and open-source LLMs on our benchmark. Our results reveal a surprisingly high failure rate across these LLMs - a median failure rate of 53% on cross-domain samples and 97% on sycophancy samples. To address this, our benchmark encourages the development of more robust and safer long-term memory usage in frontier conversational systems.

2602.01083 2026-06-04 cs.LG

On the Expressive Power of Permutation-Equivariant Weight-Space Networks

关于置换等变权重空间网络的表达能力

Adir Dayan, Yam Eitan, Haggai Maron

发表机构 * Technion -- Israel Institute of Technology(技术ion-以色列理工学院) NVIDIA Research(NVIDIA研究)

AI总结 本文系统研究置换等变权重空间网络的表达能力,证明主流网络表达能力等价,并在温和假设下建立权重空间和函数空间的普适性,指导模型改进实现34%性能提升。

Comments Accepted as a spotlight paper at ICML 2026

详情
AI中文摘要

权重空间学习研究直接对其他神经网络的参数进行操作的神经架构。受预训练模型日益普及的推动,最近的工作展示了权重空间网络在广泛任务中的有效性。SOTA权重空间网络依赖置换等变设计来提高泛化能力。然而,这可能会对表达能力产生负面影响,需要进行理论研究。重要的是,与其他结构化领域不同,权重空间学习的目标是对权重空间和函数空间都进行操作的映射,这使得表达能力分析尤为微妙。虽然一些先前的工作提供了部分表达能力结果,但全面的刻画仍然缺失。在这项工作中,我们通过为权重空间网络的表达能力开发系统理论来填补这一空白。我们首先证明所有主流的置换等变网络在表达能力上是等价的。然后,我们在输入权重的温和自然假设下,建立了权重空间和函数空间设置中的普适性,并刻画了普适性不再成立的边缘情况。在我们的理论结果指导下,我们表明对现有权重空间模型的轻微修改相比先前SOTA实现了34%的提升,展示了我们框架的实际相关性。

英文摘要

Weight-space learning studies neural architectures that operate directly on the parameters of other neural networks. Motivated by the growing availability of pretrained models, recent work has demonstrated the effectiveness of weight-space networks across a wide range of tasks. SOTA weight-space networks rely on permutation-equivariant designs to improve generalization. However, this may negatively affect expressive power, warranting theoretical investigation. Importantly, unlike other structured domains, weight-space learning targets maps operating on both weight and function spaces, making expressivity analysis particularly subtle. While a few prior works provide partial expressivity results, a comprehensive characterization is still missing. In this work, we address this gap by developing a systematic theory for expressivity of weight-space networks. We first prove that all prominent permutation-equivariant networks are equivalent in expressive power. We then establish universality in both weight- and function-space settings under mild, natural assumptions on the input weights, and characterize the edge-case regimes where universality no longer holds. Guided by our theoretical results, we show that slight modifications to existing weight-space models yield a 34% improvement over prior SOTA, demonstrating the practical relevance of our framework.

2602.01027 2026-06-04 cs.LG

SFMP: Fine-Grained, Hardware-Friendly and Search-Free Mixed-Precision Quantization for Large Language Models

SFMP:面向大语言模型的细粒度、硬件友好且免搜索的混合精度量化

Xin Nie, Haicheng Zhang, Liang Dong, Beining Feng, Jinhong Weng, Guiling Sun

发表机构 * College of Electronic Information and Optical Engineering, Nankai University(南开大学电子信息与光工程学院)

AI总结 提出SFMP框架,通过分数位宽、块级混合精度、行列重排序和统一GEMM内核,实现免搜索、硬件友好的混合精度量化,在相同内存约束下优于现有方法。

Comments 30 pages,17 figures

详情
AI中文摘要

混合精度量化是在严格内存预算下压缩大型语言模型的一种有前景的方法。然而,现有的混合精度方法通常存在两个限制之一:它们要么依赖昂贵的离散优化来确定精度分配,要么由于不规则的内存布局而导致硬件效率低下。我们提出了SFMP,一个用于大型语言模型的免搜索且硬件友好的混合精度量化框架。该框架基于四个新颖的想法:分数位宽,将权重矩阵的整数位宽扩展为分数值,并将离散精度分配转化为连续问题;2)块级混合精度,在权重矩阵内实现细粒度精度,同时保持硬件友好性;3)行列权重重排序,通过行和列重排序聚合显著权重,在推理过程中仅引入少量激活重排序开销;4)统一GEMM内核,支持任意平均位宽的混合精度GEMM。大量实验表明,在相同内存约束下,SFMP优于最先进的逐层混合精度方法,同时显著降低量化成本并提高推理效率。代码可在https://github.com/Nkniexin/SFMP获取。

英文摘要

Mixed-precision quantization is a promising approach for compressing large language models under tight memory budgets. However, existing mixed-precision methods typically suffer from one of two limitations: they either rely on expensive discrete optimization to determine precision allocation, or introduce hardware inefficiencies due to irregular memory layouts. We propose SFMP, a search-free and hardware-friendly mixed-precision quantization framework for large language models. The framework is built upon four novel ideas: Fractional bit-width, which extends integer bit-width for weight matrix to fractional value and transforms discrete precision allocation as a continuous problem; 2)Block-wise mixed-precision, enabling fine-grained precision within weight matrices while remaining hardware-friendly; 3)Row-column weight reordering, which aggregates salient weights via row and column reordering, incurring only a small activation reordering overhead during inference; 4)Unified GEMM kernel, which supports mixed-precision GEMM at arbitrary average bit-width. Extensive experiments demonstrate that SFMP outperforms state-of-the-art layer-wise mixed-precision methods under the same memory constraints, while significantly reducing quantization cost and improving inference efficiency. Code is available at https://github.com/Nkniexin/SFMP

2601.21461 2026-06-04 cs.LG cs.AI

L$^3$: Large Lookup Layers

L$^3$:大型查找层

Albert Tseng, Christopher De Sa

发表机构 * Department of Computer Science, Cornell University(康奈尔大学计算机科学系)

AI总结 提出Large Lookup Layer (L$^3$),通过静态基于token的路由聚合每个token的嵌入,实现稀疏性,在语言建模和下游任务中优于稠密模型和等稀疏MoE。

Comments ICML 2026

详情
AI中文摘要

现代稀疏语言模型通常通过混合专家(MoE)层实现稀疏性,该层动态地将token路由到稠密MLP“专家”。然而,动态硬路由存在一些缺点,例如潜在的硬件效率低下以及需要辅助损失来稳定训练。相比之下,分词器嵌入表本质上是稀疏的,通过为每个token选择单个嵌入来避免这些问题,但代价是没有上下文信息。在这项工作中,我们引入了大型查找层(L$^3$),它将嵌入表推广到模型解码器层,作为进一步扩展稀疏性的一种手段。L$^3$层使用基于token的静态路由,以上下文相关的方式聚合每个token的一组学习嵌入,允许模型通过将信息缓存在嵌入中有效地平衡内存和计算。L$^3$有两个主要组成部分:(1)一个系统友好的架构,允许快速训练和CPU卸载推理,且没有开销;(2)一种信息论嵌入分配算法,有效平衡速度和质量。我们通过训练具有多达2.6B活动参数的transformer来实证测试L$^3$,发现L$^3$在语言建模和下游任务中均显著优于稠密模型和等稀疏MoE。

英文摘要

Modern sparse language models typically achieve sparsity through Mixture-of-Experts (MoE) layers, which dynamically route tokens to dense MLP "experts." However, dynamic hard routing has a number of drawbacks, such as potentially poor hardware efficiency and needing auxiliary losses for stable training. In contrast, the tokenizer embedding table, which is natively sparse, largely avoids these issues by selecting a single embedding per token at the cost of not having contextual information. In this work, we introduce the Large Lookup Layer (L$^3$), which generalizes embedding tables to model decoder layers as a means of further scaling sparsity. L$^3$ layers use static token-based routing to aggregate a set of learned embeddings per token in a context-dependent way, allowing the model to efficiently balance memory and compute by caching information in embeddings. L$^3$ has two main components: (1) a systems-friendly architecture that allows for fast training and CPU-offloaded inference with no overhead, and (2) an information-theoretic embedding allocation algorithm that effectively balances speed and quality. We empirically test L$^3$ by training transformers with up to 2.6B active parameters and find that L$^3$ strongly outperforms both dense models and iso-sparse MoEs in both language modeling and downstream tasks.

2601.22601 2026-06-04 cs.LG

\textsc{Lethe}: Principled Dual-Stream Update for Persistent Knowledge Erasure in Federated Unlearning

\textsc{Lethe}: 用于联邦遗忘中持久知识擦除的原则性双流更新

Wentai Wu, Hanwei Tan, Yijun Quan, Haixia Peng, Ligang He, Bin Yang, C. L. Philip Chen

发表机构 * Department of Computer Science, College of Information Science and Technology, Jinan University(计算机科学系,信息科学与技术学院,暨南大学) WMG, University of Warwick(沃森盖尔学院,沃里克大学) School of Information and Communications Engineering, Xi’an Jiaotong University(信息与通信工程学院,西安交通大学) Department of Computer Science, University of Warwick(计算机科学系,沃里克大学) School of Data Science and Engineering, East China Normal University(数据科学与工程学院,华东师范大学) School of Computer Science and Engineering, South China University of Technology(计算机科学与工程学院,华南理工大学)

AI总结 针对联邦遗忘后继续训练导致已遗忘知识重新浮现的问题,提出Lethe方法,通过遗忘流和保留流的反对齐更新实现持久知识擦除。

详情
AI中文摘要

联邦遗忘(FU)旨在从全局模型中擦除知识。现有研究通常假设遗忘后联邦协作终止,忽略了在删除请求完成后剩余客户端继续训练的实际部署场景。在这项工作中,我们识别出一个关键失败模式,称为知识重新浮现,揭示了仅对保留数据进行持续训练可以在几轮内重新激活已遗忘的知识。实验表明,许多最先进的FU方法容易发生知识重新浮现。我们随后提出Lethe,一种用于联邦设置中持久知识擦除的新型遗忘方法。在每次迭代中,Lethe操作来自遗忘客户端的遗忘流和来自保留客户端的保留流。它将遗忘更新重定向到两个流反对齐的区域,阻止保留数据训练移回遗忘知识。因此,Lethe在后续联邦训练期间确保更强的遗忘持久性。跨不同模型、数据集和遗忘级别的广泛实验验证了Lethe以统一方式支持CV和NLP任务中的所有遗忘级别,即使在极长后续训练时间后,大多数情况下也持续显示出低于1%的低重新浮现率。

英文摘要

Federated unlearning (FU) aims to erase knowledge from a global model. Existing studies commonly assume that federated collaboration terminates after unlearning, overlooking a deployment-realistic scenario where training continues on the remaining clients after deletion requests are fulfilled. In this work, we identify a critical failure mode, termed knowledge resurfacing, revealing that continued training on retained data alone can reactivate unlearned knowledge in a few rounds. Empirically, we demonstrate that many state-of-the-art FU methods are prone to knowledge resurfacing. We then propose Lethe, a novel unlearning method for persistent knowledge erasure in federated settings. In each iteration, Lethe operates on a forget stream from the unlearning client and a retain stream from the retained clients. It redirects unlearning updates toward a region where the two streams are anti-aligned, discouraging retained-data training from moving back toward the forgotten knowledge. Consequently, Lethe ensures stronger unlearning persistence during subsequent federated training. Extensive experiments across diverse models, datasets, and unlearning levels validate that Lethe supports all levels of unlearning in a unified manner across both CV and NLP tasks, demonstrating consistently low RR, below 1% in most cases, even after an extremely long horizon of follow-up training.