All Your GNN Models and Data Belong to Me

Conference: Black Hat USA 2022

2022-08-10

Summary

The presentation discusses the potential for attackers to replicate the functionality of a machine learning model by querying it through a secure API. The speaker emphasizes the importance of securing both the raw data and the transformed data, as well as auditing the graph neural network-based machine learning pipeline.

Securing the database can prevent attacks such as linker identification, property inference, and subgraphy inverse attacks
Auditing the graph neural network-based machine learning pipeline is crucial
Attackers can replicate the functionality of a machine learning model by querying it through a secure API
Attackers can use the idgo framework to learn the discrete graph structure
The attacker's goal is to use a loss function to maintain similar spatial connectivity in the euclidean space between the surrogate model and the target model
The attacker can fine-tune their queries to understand the design boundary of the model
The size of the data can affect the complexity of the design boundary

The speaker uses the example of a social network to illustrate how the design boundary can be complicated. However, the attacker can fine-tune their queries to understand the design boundary and replicate the functionality of the model. The speaker emphasizes the importance of large-scale tests to understand the potential for attacks on larger models.

Abstract

Many real-world data come in the form of graphs. Graph neural networks (GNNs), a new family of machine learning (ML) models, have been proposed to fully leverage graph data to build powerful applications. In particular, the inductive GNNs, which can generalize to unseen data, become mainstream in this direction. Those models have facilitated numerous practical solutions to real world problems, such as node classification, community detection link prediction/recommendation, binary similarity detection, malware detection, fraud detection, bot detection, etc.To train a good model, a large amount of proprietary data as well as computational resources are needed, leading to valuable intellectual property. Previous research has shown that ML models are prone to adversarial attacks, which aim to steal the functionality of the target models. However, most of them focus on the models trained with non-structured data (such as images and texts). On the other hand, little attention has been paid to the security of models trained with graph data, i.e., GNNs, and, more interestingly, the privacy of the raw data used to train GNNs. In this talk, we outline three novel attacks against GNNs, namely model stealing attack, link re-identification attack, and property inference attack. We first show that the attackers, disguised as benign customers of your commercially deployed GNN models, can leverage our model stealing attack to steal GNNs with high accuracy and high fidelity. We then demonstrate that the attackers can infer private and sensitive relationships contained in the raw data you used to train the GNNs. We finally reveal a novel graph reconstruction attack that can reconstruct a graph that has similar graph structural statistics to the target graph. Note that certain graph data is often expensive to obtain and proprietary (e.g., biomedical/molecular graph collected from lab study). Such graph reconstruction attacks may pose a direct threat to pharmaceutical companies leveraging GNNs to accelerate drug discovery.

Materials:

Tags: