Generalised Eigenvalue Geometry of Semantic Adversarial Attacks
语义对抗攻击的广义特征值几何
Martin Anthony, Kaveh Salehzadeh Nobari
AI总结 提出一种连续局部模型,通过矩阵束$(A,B)$的最大广义特征值量化语义对抗攻击性,并给出预测翻转条件、攻击性证书及VC界。
详情
最近的实证工作表明,语义等价的释义可以欺骗金融情感分类器:尽管释义在强参考嵌入下保持与原文接近,但它可能足以改变目标模型的表示,从而改变预测类别。现有的鲁棒性理论要么假设单模型威胁模型,要么主要关注实证攻击算法。我们开发了一个连续局部模型来描述语义释义扰动,该模型捕捉了这种双模型结构。我们证明,在代理模型预算下,目标表示的最坏情况局部位移由从两个嵌入映射的雅可比矩阵构造的矩阵束$(A,B)$的最大广义特征值控制。由此产生的攻击性指标$\lambda^*(x)$是局部释义几何和所选嵌入器固有的,为仿射读出提供了闭式预测翻转条件,并支持保守的总体和有限样本攻击性证书。为了对仿射读出的类别进行统一控制,我们推导了二元攻击性指标的无分布VC界,以及基于攻击性调整边界的尺度敏感边界,该边界从标准分类器边界中减去局部几何惩罚。我们还将连续理论与离散释义搜索联系起来,识别出成功与不成功的有限搜索之间的不对称性,并给出了离散和连续设置一致时的覆盖条件。最后,我们提出了一个使用软令牌松弛和生成的释义集的实证验证框架,以评估部署的金融文本分类器上的局部特征值几何、预测翻转条件和有限搜索近似。
Recent empirical work shows that semantically equivalent paraphrases can fool financial sentiment classifiers: although a paraphrase remains close to the original under a strong reference embedding, it may shift the target model's representation enough to change the predicted class. Existing robustness theory either assumes a single-model threat model or focuses mainly on empirical attack algorithms. We develop a continuous local model of semantic paraphrase perturbations that captures this two-model structure. We show that the worst-case local displacement of the target representation, subject to a proxy-model budget, is governed by the largest generalised eigenvalue of a matrix pencil $(A,B)$ constructed from the Jacobians of the two embedding maps. The resulting attackability index $λ^*(x)$ is intrinsic to the local paraphrase geometry and the chosen embedders, yields a closed-form prediction-flip condition for affine readouts, and supports conservative population and finite-sample attackability certificates. For uniform control over classes of affine readouts, we derive a distribution-free VC bound for binary attackability indicators and a scale-sensitive margin bound based on an attackability-adjusted margin that subtracts a local geometric penalty from the standard classifier margin. We also connect the continuous theory to discrete paraphrase search, identify an asymmetry between successful and unsuccessful finite searches, and give a covering condition under which the discrete and continuous settings agree. Finally, we propose an empirical verification framework using soft-token relaxations and generated paraphrase sets to assess the local eigenvalue geometry, prediction-flip condition, and finite-search approximation on a deployed financial-text classifier.