Death to the IOC: What's Next in Threat Intelligence

Conference: BlackHat USA 2019

2019-08-08

Summary

The presentation discusses the use of machine learning in threat intelligence analysis to automate the manual process of extracting entities and building relationships between them.

Threat intelligence analysis involves manual processing of unstructured text to understand the vulnerabilities and targets of threat actors.
Named entity extraction is the process of identifying and classifying entities in text.
The presenter's team trained a machine learning/deep learning-based cyber entity extractor using a publicly available corpus of APT white papers and threat intelligence blogs.
The training data had a small dataset and a class imbalance problem, which required statistical modeling methods in addition to deep learning ones.
The team used the CARAT dataset to annotate their data and assess their labels.
The machine learning model was able to extract entities and build relationships between them, providing insights that can help organizations make better decisions.
The presenter provided a demo of a tool they created to automate the entity extraction process.

The presenter gave an example of a threat intelligence analyst who manually gathered information on three APT actors from an Eastern European country. The analyst had to read through various documents to understand the vulnerabilities and targets of these actors. The analyst then had to organize the information and represent it in a graph to extract insights that would help the organization make better decisions. The presenter's team used machine learning to automate this process, extracting entities and building relationships between them to provide insights that can help organizations prioritize their defensive choke points and disrupt the toolchain of the attackers they care about.

Abstract

Humans cannot scale to the amount of Threat Intelligence being generated. While the Security Community has mastered the use of machine readable feeds from OSINT systems or third party vendors, these usually provide IOCs or IOAs without contextual information. On the other hand, we have rich textual data that describes the operations of cyber attackers, their tools, tactics and procedures; contained in internal incident response reports, public blogs and white papers. Today, we can't automatically consume or use these data because they are composed of unstructured text. Threat Analysts manually go through them to extract information about adversaries most relevant to their threat model, but that manual work is a bottleneck for time and cost. In this project we will automate this process using Machine Learning. We will share how we can use ML for Custom Entity Extraction to automatically extract entities specific to the cyber security domain from unstructured text. We will also share how this system can be used to generate insights such as:Identify patterns of attacks an enterprise may have facedAnalyze the most effective attacker techniques against the enterprise they are defendingExtract trends of techniques used in the overall eco-system or a specific vertical industryThese insights can be used to make data backed decisions about where to invest in the defenses of an enterprise. And in this talk we will describe our solution for building an entity extraction system from public domain text specific to the security domain; using opensource ML tooling. The goal is to enable applied researchers to extract TI insights automatically, at scale and in real time.We will cover:The importance of this process for threat intelligence and share some examples of actionable insights we can provide as a result of this researchOverall Architecture of the system and ML principles usedHow we automatically created a training dataset for our domain using a dictionary of entitiesSupervised and unsupervised featurization methods we experimented withExperimentation and results from Statistical Modeling methods and Deep Learning MethodsRecommendations and resources for Applied Researchers who may want to implement their own TI Extraction pipeline.

Materials:

Slides

Tags: