Measuring the Speed of the Red Queen's Race; Adaption and Evasion in Malware

Conference: BlackHat USA 2018

2018-08-08

Summary

The presentation discusses the use of machine learning for malware detection and how it can be used to measure changes in malware distribution over time.

The malware landscape has evolved from viruses to more complex forms such as rats, DDoS, and nation-state weapons.
There are two major static detection paradigms: signatures and machine learning.
Machine learning tends to generalize well but has a higher false positive rate.
A deep neural network can be used to classify malware based on statistical pattern recognition.
A confidence metric can be used to measure changes in malware distribution over time.
Model decay is happening faster as old samples are retired and there is less change happening as a result of new samples being introduced.
The rate of new low confidence samples is rising at about 1% per quarter and high confidence samples are falling at a rate of about 4% per quarter.
The same methodology can be applied to measure changes in malware distribution within a single family.

The speaker uses a toy problem to illustrate how a deep neural network can be used to classify malware based on statistical pattern recognition. The problem involves plotting a scatter of data on a graph and defining regions based on whether the data is good or bad. The neural network uses weights to define the computation and is trained on labeled data to improve its accuracy. The speaker also uses real data to show how a confidence metric can be used to measure changes in malware distribution over time.

Abstract

Security is a constant cat-and-mouse game between those trying to keep abreast of and detect novel malware, and the authors attempting to evade detection. The introduction of the statistical methods of machine learning into this arms race allows us to examine an interesting question: how fast is malware being updated in response to the pressure exerted by security practitioners? The ability of machine learning models to detect malware is now well known; we introduce a novel technique that uses trained models to measure "concept drift" in malware samples over time as old campaigns are retired, new campaigns are introduced, and existing campaigns are modified. Through the use of both simple distance-based metrics and Fisher Information measures, we look at the evolution of the threat landscape over time, with some surprising findings. In parallel with this talk, we will also release the PyTorch-based tools we have developed to address this question, allowing attendees to investigate concept drift within their own data.

Materials:

Slides

Paper

Tags: