A SCALABLE, ENSEMBLE APPROACH FOR BUILDING AND VISUALIZING DEEP CODE-SHARING NETWORKS OVER MILLIONS OF MALICIOUS BINARIES

The millions of unique malicious binaries gathered in today's white-hat malware repositories are connected through a dense web of hidden code-sharing relationships. If we could recover this shared-code network, we could provide much needed context for and insight into newly observed malware. For example, our analysis could leverage previous reverse engineering work performed on a new malware sample's older "relatives," giving important context and accelerating the reverse engineering process.

Various approaches have been proposed to see through malware packing and obfuscation to identify code sharing. A significant limitation of these existing approaches, however, is that they are either scalable but easily defeated or that they are complex but do not scale to millions of malware samples. A final issue is that even the more complex approaches described in the research literature tend to only exploit one "feature domain," be it malware instruction sequences, call graph structure, application binary interface metadata, or dynamic API call traces, leaving these methods open to defeat by intelligent adversaries.

How, then, do we assess malware similarity and "newness" in a way that both scales to millions of samples and is resilient to the zoo of obfuscation techniques that malware authors employ? In this talk, I propose an answer: an obfuscation-resilient ensemble similarity analysis approach that addresses polymorphism, packing, and obfuscation by estimating code-sharing in multiple static and dynamic technical domains at once, such that it is very difficult for a malware author to defeat all of the estimation functions simultaneously. To make this algorithm scale, we use an approximate feature counting technique and a feature-hashing trick drawn from the machine-learning domain, allowing for the fast feature extraction and fast retrieval of sample "near neighbors" even when handling millions of binaries.

Our algorithm was developed over the course of three years and has been evaluated both internally and by an independent test team at MIT Lincoln Laboratories: we scored the highest on these tests against four competing malware cluster recognition techniques and we believe this was because of our unique "ensemble" approach. In the presentation, I will give details on how to implement the algorithm and will go over these algorithm results in a series of large-scale interactive malware visualizations. As part of the algorithm description I will walk through a Python machine learning library that we will be releasing in the conference material which allows users to detect feature frequencies over billions of items on commodity hardware.

Presented by