From Efficiency to Leakage -- Privacy Backdoor in Federated Language Model Fine-Tuning
从效率到泄露——联邦语言模型微调中的隐私后门
Shanghao Shi, Chaoyu Zhang, Heng Jin, Yang Xiao, Yevgeniy Vorobeychik, William Yeoh, Ning Zhang, Y. Thomas Hou, Wenjing Lou
AI总结 提出NeuroImprint攻击,恶意参数服务器在参数高效微调中植入隐私后门,通过为每个样本分配独立神经元并限制单次更新,实现高保真重建训练文本。
详情
联邦学习(FL)使多方能够协作微调语言模型以完成特定领域任务,而无需共享原始数据。由于完整模型微调对FL客户端而言通常过于昂贵,参数高效微调(PEFT)已成为实践中的事实标准,它冻结基础模型,仅训练少量适配器。在本文中,我们表明恶意参数服务器可以隐秘地将PEFT适配器破坏为隐私后门,该后门隐式记忆客户端的训练样本,作为存储在独立神经元中的隔离的每样本参数更新,而不降低模型效用。具体来说,我们的攻击NeuroImprint为每个训练样本分配一个专用的记忆神经元,并约束每个神经元在局部微调轨迹中最多更新一次。这种设计减轻了语言模型微调中由大批量和状态优化器(如Adam/AdamW)引入的跨样本碰撞和跨步混合。微调后,得到的隔离的每样本更新可以通过闭式解析逆变换恢复文本嵌入,然后确定性地映射回令牌序列。为了理解我们方法的通用性,我们在多个语言模型(BERT、GPT-2、Qwen2和Llama3.2)上实现了NeuroImprint,并在涵盖不同领域的四个微调数据集上进行了评估。结果表明,我们的攻击能够以高语义保真度重建59%至79%的所有微调样本。
Federated learning (FL) enables multiple parties to collaboratively fine-tune language models for domain-specific tasks without sharing raw data. Since full model fine-tuning is often prohibitively expensive for FL clients, parameter-efficient fine-tuning (PEFT) has become the de facto approach in practice, freezing the base model and training only a small set of adapters. In this paper, we show that a malicious parameter server can stealthily corrupt a PEFT adapter into a privacy backdoor that implicitly memorizes the client's training samples as isolated per-sample parameter updates stored in separate neurons, without degrading model utility. Concretely, our attack, NeuroImprint, assigns a dedicated memorization neuron to each training sample and constrains that each neuron is updated at most once along the local fine-tuning trajectory. This design mitigates both cross-sample collisions and cross-step mixing introduced by large local batches and stateful optimizers (e.g., Adam/AdamW) in language-model fine-tuning. After fine-tuning, the resulting isolated per-sample updates can be analytically inverted in closed form to recover text embeddings, which are then deterministically mapped back to token sequences. To understand the generality of our method, we implemented NeuroImprint on multiple language models (BERT, GPT-2, Qwen2, and Llama3.2) and evaluated it across four fine-tuning datasets spanning diverse domains. The results demonstrate that our attack can reconstruct 59% to 79% of all finetuning samples with high semantic fidelity.