The Devil is in the GAN: Defending Deep Generative Models Against Adversarial Attacks

Conference: BlackHat USA 2021

2021-08-05

Summary

The presentation discusses attacks on deep generative models (DGMs) and ways to defend against them. It focuses on two objectives: stealth and fidelity, and proposes two approaches to combine them in order to train a compromised model.

Attacks on DGMs can manipulate the output of the model by adding a trigger to the input.
Defenders can prevent damage by not blindly downloading and deploying a model, requesting white box access, and inspecting outputs.
The 'Trail' approach combines the standard DGM training process for stealth with an auxiliary objective for fidelity, while the 'Retrain' approach relies on knowledge distillation to maintain stealth and retraining for fidelity.
The 'Retrain' approach involves replicating the model architecture, copying weights to a student model, selecting layers to retrain, and optimizing with a combined loss function.
The presentation provides code and documentation for the attacks and defenses discussed.

The presenter gives an example of how an attacker could use the 'Trail' approach to add a devil icon to a MNIST image. They would use the standard DGM training process for stealth and an auxiliary objective of minimizing the pixel distance between the devil icon and the trigger output for fidelity. The presenter also warns that relying solely on a pre-trained model for the 'Retrain' approach may not always be feasible, as it requires a large amount of computation resources.

Abstract

Generative Adversarial Networks (GANs) are an emerging AI technology with vast potential for disrupting science and industry. GANs are able to synthesize data from complex, high-dimensional manifolds, e.g., images, text, music, or molecular structures. Potential applications include media content generation and enhancement, synthesis of drugs and medical prosthetics, or generally boosting the performance of AI through semi-supervised learning. Training GANs is an extremely compute-intensive task that requires highly specialized expert skills. State-of-the-art GANs have sizes reaching billions of parameters and require weeks of Graphical Processing Unit (GPU) training time. A number of GAN model "zoos" already offer trained GANs for download from the internet, and going forward – with the increasing complexity of GANs – it can be expected that most users will have to source trained GANs from – potentially untrusted – third parties. Surprisingly, while there exists a rich body of literature on evasion and poisoning attacks against conventional, discriminative Machine Learning (ML) models, adversarial threats against GANs – or, more broadly, against Deep Generative Models (DGMs) – have not been analyzed before. To close this gap, we will introduce in this talk a formal threat model for training-time attacks against DGM. We will demonstrate that, with little effort, attackers can backdoor pre-trained DGMs and embed compromising data points which, when triggered, could cause material and/or reputational damage to the organization sourcing the DGM. Our analysis shows that the attacker can bypass naïve detection mechanisms, but that a combination of static and dynamic inspections of the DGM is effective in detecting our attacks.

Materials:

Slides

Tags: