URL PDF HTML ☆

赞 0 踩 0

2509.25289 2026-06-04 cs.LG cs.AI

ClustRecNet: A Novel End-to-End Deep Learning Framework for Clustering Algorithm Recommendation

ClustRecNet: 一种用于聚类算法推荐的新型端到端深度学习框架

Mohammadreza Bakhtyari, Bogdan Mazoure, Renato Cordeiro de Amorim, Guillaume Rabusseau, Vladimir Makarenkov

发表机构 * Département d’Informatique, Université du Québec à Montréal（魁北克大学蒙特利尔分校计算机科学系）； Mila - Quebec AI Institute（魁北克人工智能研究所）； School of Computer Science and EE, University of Essex（埃塞克斯大学计算机科学与电子工程学院）； Department of Computer Science and Operations Research, Université de Montréal（蒙特利尔大学计算机科学与运筹学系）

AI总结提出ClustRecNet，一种端到端深度学习框架，通过直接学习原始表格数据的高阶表示来推荐合适的聚类算法，在合成和真实基准上优于传统内部聚类有效性指标和AutoML方法。

Comments Published in IEEE Access

Journal ref IEEE Access, vol. 14, pp. 81352 - 81365, 2026

详情

DOI: 10.1109/ACCESS.2026.3697689

AI中文摘要

为给定数据集识别有效的聚类算法仍然是一个基本的无监督学习问题。我们引入了ClustRecNet，一种新颖的端到端深度学习框架，通过直接学习原始表格数据的高阶表示来推荐合适的聚类算法。为了促进稳健的元学习，我们首先构建了一个包含34,000个合成数据集的综合存储库，涵盖了多种聚类场景，运行了10种流行的聚类算法，并使用调整兰德指数（ARI）建立真实标签。ClustRecNet的架构包含一个卷积块、两个残差块和一个注意力块，以捕获局部和全局结构模式，有效绕过了与手动特征工程相关的知识瓶颈。在合成和真实世界基准上的广泛评估表明，ClustRecNet始终优于传统的内部聚类有效性指标，如轮廓系数、Calinski-Harabasz、Davies-Bouldin和Dunn，以及最先进的自动化机器学习（AutoML）方法，如ML2DAC、AutoCluster和AutoML4Clust。例如，我们的框架在合成数据上平均比Calinski-Harabasz聚类有效性指数高出0.497的ARI增益，在真实世界基准上平均比领先的AutoML方法（ML2DAC）高出44.16%的ARI改进。代码和数据可在以下网址获取：https://github.com/mrbakhtyari/ClustRecNet

英文摘要

Identifying an effective clustering algorithm for a given dataset remains a fundamental unsupervised learning issue. We introduce ClustRecNet, a novel end-to-end deep learning framework that recommends suitable clustering algorithm(s) by directly learning high-order representations of raw tabular data. To facilitate robust meta-learning, we first construct a comprehensive repository of 34,000 synthetic datasets encompassing a large variety of clustering scenarios, run 10 popular clustering algorithms, and use Adjusted Rand Index (ARI) to establish ground-truth labels. ClustRecNet's architecture incorporates a convolution block, two residual blocks, and an attention block to capture local and global structural patterns, effectively bypassing the knowledge bottleneck associated with manual feature engineering. Extensive evaluation on both synthetic and real-world benchmarks demonstrates that ClustRecNet consistently outperforms traditional internal cluster validity indices such as Silhouette, Calinski-Harabasz, Davies-Bouldin, and Dunn as well as state-of-the-art Automated Machine Learning (AutoML) approaches such as ML2DAC, AutoCluster, and AutoML4Clust. For example, our framework achieves an average 0.497 ARI gain over the Calinski-Harabasz cluster validity index on synthetic data and an average 44.16% ARI improvement over the leading AutoML approach (ML2DAC) on real-world benchmarks. Code and data are available at: https://github.com/mrbakhtyari/ClustRecNet

URL PDF HTML ☆

赞 0 踩 0

2602.08498 2026-06-04 cs.CL

Characterizing, Evaluating, and Optimizing Complex Reasoning

表征、评估与优化复杂推理

Haoran Zhang, Yafu Li, Zhi Wang, Zhilin Wang, Shunkai Zhang, Xiaoye Qu, Yu Cheng

发表机构 * School of Artificial Intelligence, Shanghai Jiao Tong University, Shanghai, China（上海交通大学人工智能学院）； Shanghai Artificial Intelligence Laboratory, Shanghai, China（上海人工智能实验室）； University of Science and Technology of China, Hefei, Anhui, China（中国科学技术大学）； The Chinese University of Hong Kong, Hong Kong, China（香港中文大学）； Nanjing University, Suzhou, Jiangsu, China（南京大学）； Peking University, Beijing, China（北京大学）

AI总结本文提出ME$^2$原则来表征推理质量，基于有向无环图（DAG）的成对评估方法，并构建TRM-Preference数据集训练Thinking Reward Model（TRM），以优化推理过程。

Comments Code and data are available at https://github.com/Simplified-Reasoning/TRM

详情

AI中文摘要

大型推理模型（LRMs）越来越依赖具有复杂内部结构的推理轨迹。然而，现有工作缺乏对三个基本问题的统一答案：（1）什么定义了高质量推理，（2）如何可靠地评估长且隐含结构的推理轨迹，以及（3）如何使用此类评估信号进行推理优化。为应对这些挑战，我们提供了一个统一视角。（1）我们引入ME$^2$原则，从宏观和微观层面表征推理质量，涉及效率和有效性。（2）基于该原则，我们将推理轨迹建模为有向无环图（DAG），并开发了一种基于DAG的成对评估方法，捕捉复杂推理结构。（3）基于该方法，我们构建了TRM-Preference数据集，并训练了一个Thinking Reward Model（TRM）来大规模评估推理质量。实验表明，思考奖励作为有效的优化信号。在测试时，选择更好的推理会带来更好的结果（提升高达19.3%），在RL训练期间，思考奖励增强了推理和性能（提升高达3.9%），适用于多种任务。代码和数据可在https://github.com/Simplified-Reasoning/TRM获取。

英文摘要

Large Reasoning Models (LRMs) increasingly rely on reasoning traces with complex internal structures. However, existing work lacks a unified answer to three fundamental questions: (1) what defines high-quality reasoning, (2) how to reliably evaluate long, implicitly structured reasoning traces, and (3) how to use such evaluation signals for reasoning optimization. To address these challenges, we provide a unified perspective. (1) We introduce the ME$^2$ principle to characterize reasoning quality along macro- and micro-level concerning efficiency and effectiveness. (2) Built on this principle, we model reasoning traces as directed acyclic graphs (DAGs) and develop a DAG-based pairwise evaluation method, capturing complex reasoning structures. (3) Based on this method, we construct the TRM-Preference dataset and train a Thinking Reward Model (TRM) to evaluate reasoning quality at scale. Experiments show that thinking rewards serve as an effective optimization signal. At test time, selecting better reasoning leads to better outcomes (up to 19.3\% gain), and during RL training, thinking rewards enhance reasoning and performance (up to 3.9\% gain) across diverse tasks. Code and data are available at https://github.com/Simplified-Reasoning/TRM.

URL PDF HTML ☆

赞 0 踩 0

2602.08142 2026-06-04 cs.LG stat.ML

Variance-Gated Ensembles: An Epistemic-Aware Framework for Uncertainty Estimation

方差门控集成：一种面向认知不确定性的估计框架

H. Martin Gillis, Isaac Xu, Thomas Trappenberg

发表机构 * Faculty of Computer Science, Dalhousie University, Halifax, NS（计算机科学学院，达尔豪西大学，哈利法克斯，NS）

AI总结提出方差门控集成（VGE）框架，通过从集成统计量计算信噪比门控注入认知敏感性，实现高效且可微的不确定性估计，在计算效率与性能上匹配或超越现有方法。

Comments Published in Transactions on Machine Learning Research (06/2026)

详情

AI中文摘要

机器学习应用需要快速且可靠的逐样本不确定性估计。常见方法是使用贝叶斯或近似方法的预测分布，并将不确定性加性分解为偶然（即数据相关）和认知（即模型相关）分量。然而，加性分解最近受到质疑，有证据表明当使用有限集成采样和/或不匹配的预测分布时，该分解会失效。本文介绍方差门控集成（VGE），一种直观、可微的框架，通过从集成统计量计算的信噪比门控注入认知敏感性。VGE提供：（i）方差门控边际不确定性（VGMU）分数，将决策边际与集成预测方差耦合；（ii）方差门控归一化（VGN）层，通过每类可学习的集成成员概率归一化，将方差门控不确定性机制推广到训练。我们推导出闭合形式的向量-雅可比积，使得通过集成样本均值和方差进行端到端训练成为可能。VGE在保持计算效率的同时，匹配或超越最先进的信息论基线。因此，VGE为集成模型中的认知感知不确定性估计提供了一种实用且可扩展的方法。

英文摘要

Machine learning applications require fast and reliable per-sample uncertainty estimation. A common approach is to use predictive distributions from Bayesian or approximation methods and additively decompose uncertainty into aleatoric (i.e., data-related) and epistemic (i.e., model-related) components. However, additive decomposition has recently been questioned, with evidence that it breaks down when using finite-ensemble sampling and/or mismatched predictive distributions. This paper introduces Variance-Gated Ensembles (VGE), an intuitive, differentiable framework that injects epistemic sensitivity via a signal-to-noise gate computed from ensemble statistics. VGE provides: (i) a Variance-Gated Margin Uncertainty (VGMU) score that couples decision margins with ensemble predictive variance; and (ii) a Variance-Gated Normalization (VGN) layer that generalizes the variance-gated uncertainty mechanism to training via per-class, learnable normalization of ensemble member probabilities. We derive closed-form vector-Jacobian products enabling end-to-end training through ensemble sample mean and variance. VGE matches or exceeds state-of-the-art information-theoretic baselines while remaining computationally efficient. As a result, VGE provides a practical and scalable approach to epistemic-aware uncertainty estimation in ensemble models.

URL PDF HTML ☆

赞 0 踩 0

2602.06883 2026-06-04 cs.LG cs.CV stat.ML

Vision Transformer Finetuning Benefits from Non-Smooth Components

视觉变换器微调受益于非平滑组件

Ambroise Odonnat, Laetitia Chapel, Romain Tavenard, Ievgen Redko

发表机构 * Noah's Ark Lab（诺亚 ark 实验室）； Univ. Rennes 2, Inria（里昂二大学，法国国家信息与自动化研究所）

AI总结本文通过分析视觉变换器组件的可塑性（即输出对输入变化的敏感度），发现高可塑性（低平滑性）的注意力模块和前馈层在微调中表现更好，挑战了平滑性有利的传统观点。

Comments Accepted at ICML 2026

详情

AI中文摘要

变换器架构的平滑性在泛化、训练稳定性和对抗鲁棒性方面已被广泛研究。然而，其在迁移学习中的作用仍知之甚少。本文分析了视觉变换器组件使其输出适应输入变化的能力，即它们的\emph{可塑性}。定义为平均变化率，它捕捉了对输入扰动的敏感性；特别地，高可塑性意味着低平滑性。我们的理论分析和大量实验——在大规模视觉变换器上进行超过1000次微调运行——表明，这一视角为选择在适应过程中优先考虑的组件提供了原则性指导。对从业者的关键启示是，注意力模块和前馈层的高可塑性始终导致更好的微调性能。我们的发现偏离了平滑性是可取的普遍假设，为变换器的功能特性提供了新的视角。代码可在 https://github.com/ambroiseodt/vit-plasticity 获取。

英文摘要

The smoothness of the transformer architecture has been extensively studied in the context of generalization, training stability, and adversarial robustness. However, its role in transfer learning remains poorly understood. In this paper, we analyze the ability of vision transformer components to adapt their outputs to changes in inputs, or, in other words, their \emph{plasticity}. Defined as an average rate of change, it captures the sensitivity to input perturbation; in particular, a high plasticity implies a low smoothness. Our theoretical analysis and extensive experiments -- over $1,000$ finetuning runs on large-scale vision transformers -- showcase that this perspective provides principled guidance in choosing the components to prioritize during adaptation. A key takeaway for practitioners is that the high plasticity of the attention modules and feedforward layers consistently leads to better finetuning performance. Our findings depart from the prevailing assumption that smoothness is desirable, offering a novel perspective on transformers' functional properties. The code is available at https://github.com/ambroiseodt/vit-plasticity.

URL PDF HTML ☆

赞 0 踩 0

2601.20800 2026-06-04 cs.LG cs.AI

Conditional PED-ANOVA: Hyperparameter Importance in Hierarchical & Dynamic Search Spaces

条件PED-ANOVA：层次与动态搜索空间中的超参数重要性

Kaito Baba, Yoshihiko Ozaki, Shuhei Watanabe

发表机构 * Preferred Networks, Inc.（Preferred Networks公司）； The University of Tokyo（东京大学）； SB Intuitions Corp.（SB Intuitions公司）

AI总结提出条件PED-ANOVA框架，用于估计条件搜索空间中超参数的重要性，通过闭式估计器准确反映条件激活和域变化，实验证明其优于朴素适应方法。

Comments 20 pages, 15 figures. Accepted to the 32nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD 2026)

详情

DOI: 10.1145/3770855.3817758

AI中文摘要

我们提出条件PED-ANOVA（condPED-ANOVA），一个用于估计条件搜索空间中超参数重要性（HPI）的原则性框架，其中超参数的存在或域可能依赖于其他超参数。尽管原始PED-ANOVA提供了一种快速有效的方法来估计搜索空间内高性能区域的HPI，但它假设一个固定的、无条件的搜索空间，因此无法正确处理条件超参数。为了解决这个问题，我们引入了针对高性能区域的条件HPI，并推导出一个闭式估计器，能够准确反映条件激活和域变化。实验表明，现有HPI估计器的朴素适应在条件设置下会产生误导性或不可解释的重要性，而condPED-ANOVA始终提供反映底层条件结构的有意义的重要性。我们的代码公开在https://github.com/kAIto47802/condPED-ANOVA。

英文摘要

We propose conditional PED-ANOVA (condPED-ANOVA), a principled framework for estimating hyperparameter importance (HPI) in conditional search spaces, where the presence or domain of a hyperparameter can depend on other hyperparameters. Although the original PED-ANOVA provides a fast and efficient way to estimate HPI within the top-performing regions of the search space, it assumes a fixed, unconditional search space and therefore cannot properly handle conditional hyperparameters. To address this, we introduce a conditional HPI for top-performing regions and derive a closed-form estimator that accurately reflects conditional activation and domain changes. Experiments show that naive adaptations of existing HPI estimators yield misleading or uninterpretable importances in conditional settings, whereas condPED-ANOVA consistently provides meaningful importances that reflect the underlying conditional structure. Our code is publicly available at https://github.com/kAIto47802/condPED-ANOVA.

URL PDF HTML ☆

赞 0 踩 0

2602.05657 2026-06-04 cs.LG math.OC

Tight Long-Term Tail Decay of (Clipped) SGD in Non-Convex Optimization

非凸优化中（裁剪）SGD的严格长期尾衰减

Aleksandar Armacki, Dragana Bajović, Dušan Jakovetić, Soummya Kar, Ali H. Sayed

发表机构 * École Polytechnique Fédérale de Lausanne（瑞士联邦理工学院洛桑分校）； University of Novi Sad（诺维萨德大学）； Carnegie Mellon University（卡内基梅隆大学）

AI总结通过大偏差理论，研究非凸优化中SGD和裁剪SGD的长期尾衰减，给出梯度范数平方的指数级上界和下界，证明衰减率可达$e^{-t/\log(t)}$量级，比现有有限时间界快一个数量级。

Comments 34 pages

详情

AI中文摘要

由于能够为算法的单次运行提供强保证，对SGD诱导过程的尾部行为的研究引起了广泛兴趣。虽然许多工作提供了高概率保证（量化固定概率阈值下的误差率），但缺乏直接研究失败概率的工作，即量化固定误差阈值下的尾部衰减率。此外，现有结果具有有限时间性质，限制了它们捕捉真实长期尾部衰减的能力，而后者对于现代学习模型（通常训练数百万次迭代）更具信息量。我们的工作通过大偏差理论的视角研究基于SGD的方法的长期尾部衰减，填补了这些空白，在此过程中建立了若干强结果。首先，对于非凸成本和有界噪声，我们给出了（普通）SGD产生的最佳迭代的梯度范数平方的尾部上界，长期衰减率为$e^{-t/\log(t)}$。接着，我们通过考虑在具有有界$p$阶矩（$p \in (1,2]$）的重尾噪声下的裁剪SGD（c-SGD）来放宽噪声假设，证明了长期衰减率为$e^{-t^{\beta_p}/\log(t)}$的上界，其中当$p \in (1,2)$时$\beta_p = \frac{4(p-1)}{3p-2}$，当$p=2$时衰减率为$e^{-t/\log^2(t)}$。最后，我们给出了尾部衰减的下界，衰减率为$e^{-t}$，表明我们关于SGD和c-SGD的衰减率在多项式对数因子意义下是紧的。值得注意的是，我们的结果表明，与基于有限时间界的现有工作（分别显示SGD和c-SGD的衰减率为$e^{-\sqrt{t}}$和$e^{-t^{\beta_p/2}}$，$p \in (1,2]$）相比，长期尾部衰减快一个数量级。因此，我们揭示了尾部衰减比先前已知快得多的机制，为单次运行提供了更强的长期保证。

英文摘要

The study of tail behaviour of SGD-induced processes has been attracting a lot of interest, due to offering strong guarantees with respect to individual runs of an algorithm. While many works provide high-probability guarantees, quantifying the error rate for a fixed probability threshold, there is a lack of work directly studying the probability of failure, i.e., quantifying the tail decay rate for a fixed error threshold. Moreover, existing results are of finite-time nature, limiting their ability to capture the true long-term tail decay which is more informative for modern learning models, typically trained for millions of iterations. Our work closes these gaps, by studying the long-term tail decay of SGD-based methods through the lens of large deviations theory, establishing several strong results in the process. First, we provide an upper bound on the tails of the gradient norm-squared of the best iterate produced by (vanilla) SGD, for non-convex costs and bounded noise, with long-term decay at rate $e^{-t/\log(t)}$. Next, we relax the noise assumption by considering clipped SGD (c-SGD) under heavy-tailed noise with bounded moment of order $p \in (1,2]$, showing an upper bound with long-term decay at rate $e^{-t^{β_p}/\log(t)}$, where $β_p = \frac{4(p-1)}{3p-2}$ for $p \in (1,2)$ and $e^{-t/\log^2(t)}$ for $p = 2$. Finally, we provide lower bounds on the tail decay, at rate $e^{-t}$, showing that our rates for both SGD and c-SGD are tight, up to poly-logarithmic factors. Notably, our results demonstrate an order of magnitude faster long-term tail decay compared to existing work based on finite-time bounds, which show rates $e^{-\sqrt{t}}$ and $e^{-t^{β_p/2}}$, $p \in (1,2]$, for SGD and c-SGD, respectively. As such, we uncover regimes where the tails decay much faster than previously known, providing stronger long-term guarantees for individual runs.

URL PDF HTML ☆

赞 0 踩 0

2510.08734 2026-06-04 cs.LG

面向搜索增强型大语言模型推理的自适应信息控制

Siheng Xiong, Oguzhan Gungordu, James C. Kerce, Faramarz Fekri

发表机构 * Georgia Institute of Technology（佐治亚理工学院）

AI总结提出基于信息效用的自适应控制框架DeepControl，通过控制检索的广度与分辨率，提升搜索增强推理的性能与训练稳定性。

详情

AI中文摘要

搜索增强型推理代理将多步推理与外部检索交错进行，但不受控制的检索可能引入冗余证据、使上下文饱和，并破坏强化学习（RL）的稳定性。现有的基于结果的RL方法仅提供稀疏的终端奖励，对中间信息获取决策的指导有限。我们提出DeepControl，一种基于信息效用的自适应信息控制框架，其中信息效用是检索证据边际价值的状态依赖估计。该框架沿两个维度调节信息获取：广度（即是否应继续检索）和分辨率（即应暴露多少检索细节）。它通过检索继续引导、层次化粒度控制以及退火控制强制方案来实现这些控制。这使得策略能够在训练期间内化有效的获取行为，并在测试时无需外部控制即可运行。在七个基准测试中，DeepControl在没有显式信息控制的情况下，始终优于强RL和检索基线；与Search-R1相比，在Qwen2.5-7B和Qwen2.5-3B上分别平均提高了9.4和8.6个点。额外分析显示搜索效率、训练稳定性和证据利用率均有所提升。

英文摘要

Search-augmented reasoning agents interleave multi-step reasoning with external retrieval, but uncontrolled retrieval can introduce redundant evidence, saturate the context, and destabilize reinforcement learning (RL). Existing outcome-based RL methods provide only sparse terminal rewards, offering limited guidance for intermediate information-acquisition decisions. We propose DeepControl, an adaptive information-control framework based on information utility, a state-dependent estimate of the marginal value of retrieved evidence. The framework regulates information acquisition along two axes: extent, i.e., whether retrieval should continue, and resolution, i.e., how much retrieved detail should be exposed. It implements these controls through retrieval-continuation guidance, hierarchical granularity control, and an annealed control-forcing scheme. This enables the policy to internalize effective acquisition behavior during training and operate without external control at test time. Across seven benchmarks, DeepControl consistently outperforms strong RL and retrieval baselines without explicit information control; compared with Search-R1, it improves average performance by +9.4 and +8.6 points on Qwen2.5-7B and Qwen2.5-3B, respectively. Additional analyses show improved search effectiveness, training stability, and evidence utilization.

URL PDF HTML ☆

赞 0 踩 0

2602.01658 2026-06-04 cs.LG cs.AI

Efficient Adversarial Attacks on High-dimensional Offline Bandits

高维离线Bandits的高效对抗攻击

Seyed Mohammad Hadi Hosseini, Amir Najafi, Mahdieh Soleymani Baghshah

发表机构 * Department of Computer Engineering, Sharif University of Technology（技术学院计算机工程系）

AI总结研究离线bandit训练在奖励模型被对抗扰动时的脆弱性，提出高维威胁模型，证明维度增加时攻击所需扰动范数减小，实验验证了针对性攻击的高成功率。

Comments Published at ICLR 2026 Conference

详情

AI中文摘要

Bandit算法最近成为评估机器学习模型（包括生成图像模型和大语言模型）的强大工具，通过高效识别表现最佳的候选者而无需详尽比较。这些方法通常依赖于奖励模型（常在Hugging Face等平台上以公共权重发布）向bandit提供反馈。在线评估昂贵且需要重复试验，而使用记录数据的离线评估已成为有吸引力的替代方案。然而，离线bandit评估的对抗鲁棒性在很大程度上尚未被探索，特别是当攻击者在bandit训练之前扰动奖励模型（而非训练数据）时。在这项工作中，我们通过理论和实证研究离线bandit训练对奖励模型对抗操纵的脆弱性来填补这一空白。我们引入了一种新颖的威胁模型，其中攻击者利用高维环境中的离线数据劫持bandit的行为。从线性奖励函数开始，扩展到非线性模型如ReLU神经网络，我们研究了用于生成模型评估的两个Hugging Face评估器上的攻击：一个测量美学质量，另一个评估组合对齐。我们的结果表明，即使对奖励模型权重进行微小、不可察觉的扰动，也能显著改变bandit的行为。从理论角度来看，我们证明了一个显著的高维效应：随着输入维度的增加，成功攻击所需的扰动范数减小，使得现代应用如图像评估尤其脆弱。大量实验证实，简单的随机扰动无效，而精心设计的针对性攻击实现了近乎完美的攻击成功率。

英文摘要

Bandit algorithms have recently emerged as a powerful tool for evaluating machine learning models, including generative image models and large language models, by efficiently identifying top-performing candidates without exhaustive comparisons. These methods typically rely on a reward model, often distributed with public weights on platforms such as Hugging Face, to provide feedback to the bandit. While online evaluation is expensive and requires repeated trials, offline evaluation with logged data has become an attractive alternative. However, the adversarial robustness of offline bandit evaluation remains largely unexplored, particularly when an attacker perturbs the reward model (rather than the training data) prior to bandit training. In this work, we fill this gap by investigating, both theoretically and empirically, the vulnerability of offline bandit training to adversarial manipulations of the reward model. We introduce a novel threat model in which an attacker exploits offline data in high-dimensional settings to hijack the bandit's behavior. Starting with linear reward functions and extending to nonlinear models such as ReLU neural networks, we study attacks on two Hugging Face evaluators used for generative model assessment: one measuring aesthetic quality and the other assessing compositional alignment. Our results show that even small, imperceptible perturbations to the reward model's weights can drastically alter the bandit's behavior. From a theoretical perspective, we prove a striking high-dimensional effect: as input dimensionality increases, the perturbation norm required for a successful attack decreases, making modern applications such as image evaluation especially vulnerable. Extensive experiments confirm that naive random perturbations are ineffective, whereas carefully targeted perturbations achieve near-perfect attack success rates ...

URL PDF HTML ☆

赞 0 踩 0

2602.01619 2026-06-04 cs.LG cs.AI

SUSD: Structured Unsupervised Skill Discovery through State Factorization

SUSD: 通过状态分解的结构化无监督技能发现

Seyed Mohammad Hadi Hosseini, Mahdieh Soleymani Baghshah

发表机构 * Department of Computer Engineering（计算机工程系）

AI总结提出SUSD框架，通过将状态空间分解为独立组件并分配不同技能变量，结合动态模型自适应引导探索，实现更丰富多样的无监督技能发现，并在分解环境中显著优于现有方法。

Comments Published as a conference paper at ICLR 2026

详情

AI中文摘要

无监督技能发现（USD）旨在无需外部奖励的情况下自主学习多样化的技能。最常见的USD方法之一是最大化技能潜在变量与状态之间的互信息（MI）。然而，基于MI的方法由于其不变性特性，倾向于偏好简单、静态的技能，限制了动态、任务相关行为的发现。距离最大化技能发现（DSD）通过利用状态空间距离促进更动态的技能，但仍未能鼓励涵盖环境中所有可控因素或实体的全面技能集。在这项工作中，我们引入了SUSD，一种新颖的框架，通过将状态空间分解为独立组件（例如，物体或可控实体）来利用环境的组合结构。SUSD将不同的技能变量分配给不同的因素，从而实现对技能发现过程的更细粒度控制。一个动态模型还跟踪各因素的学习情况，自适应地将智能体的注意力引导至未充分探索的因素。这种结构化方法不仅促进了更丰富、更多样化技能的发现，还产生了一种分解的技能表示，能够对单个实体进行细粒度且解耦的控制，从而通过分层强化学习（HRL）促进组合下游任务的高效训练。我们在三个环境中的实验结果（因素数量从1到10）表明，我们的方法能够在无监督的情况下发现多样且复杂的技能，在分解和复杂环境中显著优于现有的无监督技能发现方法。代码公开于：https://github.com/hadi-hosseini/SUSD。

英文摘要

Unsupervised Skill Discovery (USD) aims to autonomously learn a diverse set of skills without relying on extrinsic rewards. One of the most common USD approaches is to maximize the Mutual Information (MI) between skill latent variables and states. However, MI-based methods tend to favor simple, static skills due to their invariance properties, limiting the discovery of dynamic, task-relevant behaviors. Distance-Maximizing Skill Discovery (DSD) promotes more dynamic skills by leveraging state-space distances, yet still fall short in encouraging comprehensive skill sets that engage all controllable factors or entities in the environment. In this work, we introduce SUSD, a novel framework that harnesses the compositional structure of environments by factorizing the state space into independent components (e.g., objects or controllable entities). SUSD allocates distinct skill variables to different factors, enabling more fine-grained control on the skill discovery process. A dynamic model also tracks learning across factors, adaptively steering the agent's focus toward underexplored factors. This structured approach not only promotes the discovery of richer and more diverse skills, but also yields a factorized skill representation that enables fine-grained and disentangled control over individual entities which facilitates efficient training of compositional downstream tasks via Hierarchical Reinforcement Learning (HRL). Our experimental results across three environments, with factors ranging from 1 to 10, demonstrate that our method can discover diverse and complex skills without supervision, significantly outperforming existing unsupervised skill discovery methods in factorized and complex environments. Code is publicly available at: https://github.com/hadi-hosseini/SUSD.

URL PDF HTML ☆

赞 0 踩 0

2601.15158 2026-06-04 cs.LG cs.AI

Outcome-Based RL Provably Leads Transformers to Reason, but Only With the Right Data

基于结果的强化学习可证明地引导Transformer进行推理，但仅在合适的数据条件下

Yuval Ran-Milo, Yotam Alexander, Shahar Mendel, Nadav Cohen

发表机构 * Tel Aviv University（特拉维夫大学）

AI总结本文通过分析单层Transformer在合成图遍历任务上的策略梯度动力学，证明了基于结果的强化学习能够使Transformer自发学习出结构化的迭代推理算法，并揭示了训练数据中“简单示例”的分布对推理能力涌现的关键作用。

Comments 94 pages, 7 figures

详情

AI中文摘要

通过基于结果的监督进行强化学习训练的Transformer可以自发地生成中间推理步骤（思维链）。然而，稀疏奖励驱动策略梯度发现这种系统性推理的机制仍然知之甚少。我们通过分析单层Transformer在合成图遍历任务上的策略梯度动力学来解决这个问题，该任务没有思维链就无法解决，但允许简单的迭代解决方案。我们证明，尽管仅对最终答案的正确性进行训练，策略梯度仍驱动Transformer收敛到一个结构化的、可解释的算法，该算法逐顶点迭代遍历图。我们刻画了这种涌现所需的分布特性，识别出“简单示例”（即需要较少推理步骤的实例）的关键作用。当训练分布在这些更简单的示例上放置足够的质量时，Transformer学习到一种可泛化的遍历策略，能够外推到更长的链；当这种质量消失时，策略梯度学习变得不可行。我们通过在合成数据上的实验以及在数学推理任务中使用真实世界语言模型的实验来证实我们的理论结果，验证了我们的理论发现可以推广到实际场景。

英文摘要

Transformers trained via Reinforcement Learning (RL) with outcome-based supervision can spontaneously develop the ability to generate intermediate reasoning steps (Chain-of-Thought). Yet the mechanism by which sparse rewards drive policy gradient to discover such systematic reasoning remains poorly understood. We address this by analyzing the policy gradient dynamics of single-layer Transformers on a synthetic graph traversal task that cannot be solved without Chain-of-Thought but admits a simple iterative solution. We prove that despite training solely on final-answer correctness, policy gradient drives the Transformer to converge to a structured, interpretable algorithm that iteratively traverses the graph vertex-by-vertex. We characterize the distributional properties required for this emergence, identifying the critical role of "simple examples": instances requiring fewer reasoning steps. When the training distribution places sufficient mass on these simpler examples, the Transformer learns a generalizable traversal strategy that extrapolates to longer chains; when this mass vanishes, policy gradient learning becomes infeasible. We corroborate our theoretical results through experiments on synthetic data and with real-world language models on mathematical reasoning tasks, validating that our theoretical findings carry over to practical settings.

URL PDF HTML ☆

赞 0 踩 0

2512.21917 2026-06-04 cs.LG cs.AI econ.EM stat.ML

Semiparametric Preference Optimization: Your Language Model is Secretly a Single-Index Model

半参数偏好优化：你的语言模型秘密地是一个单索引模型

Nathan Kallus

发表机构 * Netflix & Cornell University（Netflix与康奈尔大学）

AI总结本文提出半参数偏好优化方法，通过放宽偏好与潜在奖励之间的链接函数假设，在未知且无限制的链接函数下进行策略对齐，并证明策略类的可实现性诱导出半参数单索引二元选择模型，直接学习策略并给出链接无关的收敛保证。

详情

AI中文摘要

策略对齐到偏好数据通常假设观察到的偏好与潜在奖励之间存在已知的链接函数（例如，Bradley-Terry模型/逻辑链接）。这种链接的错误设定可能会使推断的奖励产生偏差，并使学习到的策略偏离对齐。我们研究了在未知且无限制的链接函数下的策略对齐。我们提出了一个$f$-散度约束的奖励最大化问题，并表明策略类中的可实现性诱导出一个半参数单索引二元选择模型，其中标量策略诱导的索引捕获了所有对示范的依赖，而剩余的偏好分布是无限制的。与计量经济学中要求识别此类模型的结构参数并进行估计不同，我们开发了直接学习策略的方法，其中奖励函数是隐式的，分析了与最优策略的误差，并允许不可识别和非参数的索引。我们证明了基于通用函数复杂度度量的链接无关收敛保证，并通过实验验证了方法和理论。代码可在 https://github.com/causalml/spo/ 获取。

英文摘要

Policy alignment to preference data typically assumes a known link function between observed preferences and latent rewards (e.g., Bradley-Terry model / logistic link). Misspecification of this link can bias inferred rewards and misalign learned policies. We study policy alignment under an unknown and unrestricted link function. We formulate an $f$-divergence-constrained reward maximization problem and show that realizability in a policy class induces a semiparametric single-index binary choice model, where a scalar policy-induced index captures all dependence on demonstrations and the remaining preference distribution is unrestricted. Rather than impose identifiability of structural parameters of such a model and estimate them, as in econometrics, we develop methods that directly learn policies, with the reward function implicit, analyzing error to the optimal policy and allowing for unidentifiable and nonparametric indices. We prove link-agnostic convergence guarantees in terms of generic function complexity measures and validate the methods and theory empirically. Code is available at https://github.com/causalml/spo/.

URL PDF HTML ☆

赞 0 踩 0

2506.06178 2026-06-04 cs.LG

Reusing Trajectories in Policy Gradients Enables Fast Convergence

在策略梯度中重用轨迹实现快速收敛

Alessandro Montenegro, Federico Mansutti, Marco Mussi, Matteo Papini, Alberto Maria Metelli

发表机构 * University of Bologna（博洛尼亚大学）

AI总结提出RT-PG算法，通过重用过去轨迹并使用幂均值校正的多重重要性加权估计器，将策略梯度的样本复杂度降低到$\tilde{O}(\epsilon^{-2}\omega^{-1})$，当重用所有轨迹时达到$\tilde{O}(\epsilon^{-1})$，是目前已知最优。

详情

AI中文摘要

\textsc{Lethe}: 用于联邦遗忘中持久知识擦除的原则性双流更新

Wentai Wu, Hanwei Tan, Yijun Quan, Haixia Peng, Ligang He, Bin Yang, C. L. Philip Chen

发表机构 * Department of Computer Science, College of Information Science and Technology, Jinan University（计算机科学系，信息科学与技术学院，暨南大学）； WMG, University of Warwick（沃森盖尔学院，沃里克大学）； School of Information and Communications Engineering, Xi’an Jiaotong University（信息与通信工程学院，西安交通大学）； Department of Computer Science, University of Warwick（计算机科学系，沃里克大学）； School of Data Science and Engineering, East China Normal University（数据科学与工程学院，华东师范大学）； School of Computer Science and Engineering, South China University of Technology（计算机科学与工程学院，华南理工大学）

AI总结针对联邦遗忘后继续训练导致已遗忘知识重新浮现的问题，提出Lethe方法，通过遗忘流和保留流的反对齐更新实现持久知识擦除。

详情

AI中文摘要

联邦遗忘（FU）旨在从全局模型中擦除知识。现有研究通常假设遗忘后联邦协作终止，忽略了在删除请求完成后剩余客户端继续训练的实际部署场景。在这项工作中，我们识别出一个关键失败模式，称为知识重新浮现，揭示了仅对保留数据进行持续训练可以在几轮内重新激活已遗忘的知识。实验表明，许多最先进的FU方法容易发生知识重新浮现。我们随后提出Lethe，一种用于联邦设置中持久知识擦除的新型遗忘方法。在每次迭代中，Lethe操作来自遗忘客户端的遗忘流和来自保留客户端的保留流。它将遗忘更新重定向到两个流反对齐的区域，阻止保留数据训练移回遗忘知识。因此，Lethe在后续联邦训练期间确保更强的遗忘持久性。跨不同模型、数据集和遗忘级别的广泛实验验证了Lethe以统一方式支持CV和NLP任务中的所有遗忘级别，即使在极长后续训练时间后，大多数情况下也持续显示出低于1%的低重新浮现率。

英文摘要

Federated unlearning (FU) aims to erase knowledge from a global model. Existing studies commonly assume that federated collaboration terminates after unlearning, overlooking a deployment-realistic scenario where training continues on the remaining clients after deletion requests are fulfilled. In this work, we identify a critical failure mode, termed knowledge resurfacing, revealing that continued training on retained data alone can reactivate unlearned knowledge in a few rounds. Empirically, we demonstrate that many state-of-the-art FU methods are prone to knowledge resurfacing. We then propose Lethe, a novel unlearning method for persistent knowledge erasure in federated settings. In each iteration, Lethe operates on a forget stream from the unlearning client and a retain stream from the retained clients. It redirects unlearning updates toward a region where the two streams are anti-aligned, discouraging retained-data training from moving back toward the forgotten knowledge. Consequently, Lethe ensures stronger unlearning persistence during subsequent federated training. Extensive experiments across diverse models, datasets, and unlearning levels validate that Lethe supports all levels of unlearning in a unified manner across both CV and NLP tasks, demonstrating consistently low RR, below 1% in most cases, even after an extremely long horizon of follow-up training.

URL PDF HTML ☆

赞 0 踩 0

AI 大模型

视觉与机器人

科学与医疗

Can VLMs Predict Future States? Bootstrapping World Models from Inverse Dynamics

Finding Kissing Numbers with Game-theoretic Reinforcement Learning

Learning to Remember, Learn, and Forget in Attention-Based Models

Effective vocabulary expansion of multilingual language models for extremely low-resource languages

ClustRecNet: A Novel End-to-End Deep Learning Framework for Clustering Algorithm Recommendation

Characterizing, Evaluating, and Optimizing Complex Reasoning

Variance-Gated Ensembles: An Epistemic-Aware Framework for Uncertainty Estimation

Vision Transformer Finetuning Benefits from Non-Smooth Components

Conditional PED-ANOVA: Hyperparameter Importance in Hierarchical & Dynamic Search Spaces

Tight Long-Term Tail Decay of (Clipped) SGD in Non-Convex Optimization

Transmuting prompts into weights

Translation Heads: Disentangling meaning from language in LLM-based machine translation

Interfaze: The Future of AI is built on Task-Specific Small Models

How Users Understand Robot Foundation Model Performance through Task Success Rates and Beyond

Beyond Correctness: Rewarding Faithful Reasoning in Retrieval-Augmented Generation

Can Large Language Models Generalize Procedures Across Representations?

Bounded Hyperbolic Tangent: A Stable and Efficient Alternative to Pre-Layer Normalization in Large Language Models

Making Expert Reasoning Learnable with Self-Distillation

Adaptive Information Control for Search-Augmented LLM Reasoning

Efficient Adversarial Attacks on High-dimensional Offline Bandits

SUSD: Structured Unsupervised Skill Discovery through State Factorization

Outcome-Based RL Provably Leads Transformers to Reason, but Only With the Right Data

Semiparametric Preference Optimization: Your Language Model is Secretly a Single-Index Model

Reusing Trajectories in Policy Gradients Enables Fast Convergence

Sem-NaVAE: Semantically-Guided Outdoor Mapless Navigation via Generative Trajectory Priors

PersistBench: When Should Long-Term Memories Be Forgotten by LLMs?

On the Expressive Power of Permutation-Equivariant Weight-Space Networks

SFMP: Fine-Grained, Hardware-Friendly and Search-Free Mixed-Precision Quantization for Large Language Models

L$^3$: Large Lookup Layers

\textsc{Lethe}: Principled Dual-Stream Update for Persistent Knowledge Erasure in Federated Unlearning