TR2024-168

Forget to Flourish: Leveraging Machine-Unlearning on Pretrained Language Models for Privacy Leakage


    •  Rashid, M.R.U., Liu, J., Koike-Akino, T., Mehnaz, S., Wang, Y., "Forget to Flourish: Leveraging Machine-Unlearning on Pretrained Language Models for Privacy Leakage", Red Teaming GenAI Workshop at Neural Information Processing Systems (NeurIPS), December 2024.
      BibTeX TR2024-168 PDF
      • @inproceedings{Rashid2024dec,
      • author = {Rashid, Md Rafi Ur and Liu, Jing and Koike-Akino, Toshiaki and Mehnaz, Shagufta and Wang, Ye}},
      • title = {Forget to Flourish: Leveraging Machine-Unlearning on Pretrained Language Models for Privacy Leakage},
      • booktitle = {Red Teaming GenAI Workshop at Neural Information Processing Systems (NeurIPS)},
      • year = 2024,
      • month = dec,
      • publisher = {OpenReview},
      • url = {https://www.merl.com/publications/TR2024-168}
      • }
  • MERL Contacts:
  • Research Areas:

    Artificial Intelligence, Machine Learning

Abstract:

Fine-tuning large language models on private data for downstream applications poses significant privacy risks in potentially exposing sensitive information. Several popular community platforms now offer convenient distribution of a large variety of pre-trained models, allowing anyone to publish without rigorous verification. This scenario creates a privacy threat, as pre-trained models can be intentionally crafted to compromise the privacy of fine-tuning datasets. In this study, we introduce a novel poisoning technique that uses model-unlearning as an attack tool. This approach manipulates a pre-trained language model to increase the leakage of private data during the fine-tuning process. Our method enhances both membership inference and data extraction attacks while preserving model utility. Experimental results across different models, datasets, and fine-tuning setups demonstrate that our attacks significantly surpass baseline performance. This work serves as a cautionary note for users who download pre-trained models from unverified sources, highlighting the potential risks involved.