Humans cannot scale to the amount of Threat Intelligence being generated. While the Security Community has mastered the use of machine readable feeds from OSINT systems or third party vendors, these usually provide IOCs or IOAs without contextual information. On the other hand, we have rich textual data that describes the operations of cyber attackers, their tools, tactics and procedures; contained in internal incident response reports, public blogs and white papers. Today, we can't automatically consume or use these data because they are composed of unstructured text. Threat Analysts manually go through them to extract information about adversaries most relevant to their threat model, but that manual work is a bottleneck for time and cost.
In this project we will automate this process using Machine Learning. We will share how we can use ML for Custom Entity Extraction to automatically extract entities specific to the cyber security domain from unstructured text. We will also share how this system can be used to generate insights such as:
- Identify patterns of attacks an enterprise may have faced
- Analyze the most effective attacker techniques against the enterprise they are defending
- Extract trends of techniques used in the overall eco-system or a specific vertical industry
These insights can be used to make data backed decisions about where to invest in the defenses of an enterprise. And in this talk we will describe our solution for building an entity extraction system from public domain text specific to the security domain; using opensource ML tooling. The goal is to enable applied researchers to extract TI insights automatically, at scale and in real time.
We will cover:
- The importance of this process for threat intelligence and share some examples of actionable insights we can provide as a result of this research
- Overall Architecture of the system and ML principles used
- How we automatically created a training dataset for our domain using a dictionary of entities
- Supervised and unsupervised featurization methods we experimented with
- Experimentation and results from Statistical Modeling methods and Deep Learning Methods
- Recommendations and resources for Applied Researchers who may want to implement their own TI Extraction pipeline.