De-anonymizing Programmers from Source Code and Binaries

Conference: Defcon 26

2018-08-01

Summary

The presentation discusses the use of abstract syntax tree (AST) features and deep learning for code attribution and de-anonymization on GitHub. It also explores the impact of the number of files and snippets on accuracy and confidence levels.

AST features and deep learning can improve code attribution and de-anonymization accuracy
The number of files and snippets used for training impacts accuracy and confidence levels
Calibration curves can help determine the confidence level of the classifier
Collaborative coding presents challenges for code attribution and de-anonymization

The presenters were interested in validating their work on code attribution and de-anonymization in the real world, particularly in collaborative coding scenarios. They built a calibration curve to determine the confidence level of the classifier and found that the number of files and snippets used for training impacted accuracy and confidence levels. They also discussed the challenges of identifying individual authors in collaborative coding environments.

Abstract

Many hackers like to contribute code, binaries, and exploits under pseudonyms, but how anonymous are these contributions really? In this talk, we will discuss our work on programmer de-anonymization from the standpoint of machine learning. We will show how abstract syntax trees contain stylistic fingerprints and how these can be used to potentially identify programmers from code and binaries. We perform programmer de-anonymization using both obfuscated binaries, and real-world code found in single-author GitHub repositories and the leaked Nulled.IO hacker forum.

Materials:

Tags: