Generating YARA Rules by Classifying Malicious Byte Sequences

Conference: BlackHat USA 2021

2021-08-05

Summary

The presentation discusses the development of an interpretable deep learning model for malware recognition that generates YARA rules based on the model's output scores.

The model is structured to provide an upward score for any contiguous series of bytes, allowing for interpretability.
YARA rules are generated based on the model's output scores for malicious and benign byte sequences.
The model was tested on different datasets and achieved high true positive rates and low false positive rates.
Future work includes utilizing more YARA functionality, introducing model-driven string wildcarding, and integrating the tool with parsing libraries.

The presenter mentions a surprising result in which the model achieved a 90% true positive rate and 0.01% false positive rate for the Maco dataset with just 11 rules, which was attributed to the dominance of a particular malware family in the dataset.

Abstract

While ML models for malware detection have become an industry standard for heuristically detecting malware, signature-based detection still proliferates thanks to ease of updates, transparency of detection logic, and operability in compute-constrained environments. In this work, we propose an interpretable machine learning model that can generate signatures tuned to optimize detection and minimize false positives on a given corpus of malware and benign samples. On a corpus of malicious and benign ELF executables targeting i386 and amd64, we observe detection rates in the 80% range with a false positive rate of 0% on the benign corpus with a few hundred YARA rules.The approach is filetype-agnostic and can be applied anywhere YARA rules can be used -- whether it be simple static analysis of binaries, Cuckoo reports, network monitoring, or memory scanning. We will also share trained models, code to train and extract signatures on your own corpuses of bytestreams, as well as ready-to-go signatures for detecting recent PE, ELF, and Mach-O malware.

Materials:

Tags: