Recently, the application of deep learning techniques to natural language processing has led to state-of-the-art results for speech recognition, language modeling, and language translation. To some degree, disassembly can be considered an extension or augmentation of natural language. As an loose example, many experienced reverse engineers can read through disassembled code and understand the meaning in one pass, similar to their skill in reading text in natural languages.
In this talk, we show the effectiveness of applying deep learning techniques to disassembly in an effort to generate models designed to identify malware. Starting with a brief explanation of deep learning, we then work through the different pieces of the pipeline to go from a collection of raw binaries, to extraction and transformation of disassembly data, and training of a deep learning model. We then conclude by providing data on the efficacy of these models, and follow up with a live demo where we will evaluate the models against active malware feeds.