Efficient and Noise-Tolerant PAC Learning of Multiclass Linear Classifiers
高效且抗噪声的多类线性分类器PAC学习
Rita Adhikari, Shiwei Zeng
AI总结 本文研究了在存在恶意噪声的情况下,如何高效学习多类线性分类器,并提出了一种在混合分布和边际条件下的PAC学习算法,该算法在常数噪声率下仅需O(k²·(d log d + log k))个样本。
详情
自上个世纪以来,噪声容忍的PAC学习线性模型一直是机器学习社区的核心关注点。近年来,许多计算高效的算法已被提出,用于在多种噪声模型下学习线性阈值函数。然而,当问题考虑多类学习设置,即当类别数k至少为3时,尚不清楚是否存在计算高效的PAC学习算法,当数据集被恶意破坏时。在本文中,我们假设边际分布是有限方差分布的混合,并且数据集同时满足边际条件。我们证明存在一种计算高效的算法,能够在常数速率的恶意噪声下,使用至多O(k²·(d log d + log k))个样本来PAC学习多类线性分类器{h_w:x↦argmax_{y∈[k]}w_y·x, x∈R^d, w∈R^{kd}}。我们的算法包含两个主要成分:基于聚类的修剪方案和标准的多类合页损失最小化程序。即使在二元设置的特殊情况下,即k=2时,我们的结果也严格优于所有先前工作。
Noise-tolerant PAC learning of linear models has been of central interests in machine learning community since the last century. In recent years, many computationally-efficient algorithms have been proposed for the problem of learning linear threshold functions under multiple noise models. Yet, when the problem is considered under multiclass learning settings, i.e. when the number of classes $k$ is at least $3$, it is unknown whether there exist computationally-efficient PAC learning algorithms when the data sets are maliciously corrupted. In this paper, we consider that the marginal distribution is a mixture of bounded variance distributions and the data sets satisfy a margin condition at the same time. We show that there exists a computationally-efficient algorithm that PAC learns multiclass linear classifiers $\{h_w:x\mapsto \arg\max_{y\in[k]}w_y\cdot x, x\in \mathbb{R}^d, w\in\mathbb{R}^{kd}\}$ using at most $O(k^2\cdot (d\log d+\log k))$ samples even under a constant rate of nasty noise. Our algorithm consists of two main ingredients: a cluster-based pruning scheme and a standard multiclass hinge loss minimization program. Even in the special case of binary setting, i.e. $k=2$, our result is strictly stronger than all prior works.