Deep learning has become pervasive in a plethora of consumer applications. And there are good reasons why all the kids are doing it these days. (1) True end- to-end deep learning ameliorates, in many applications, the need to laboriously hand-craft features for ingest by a model. (2) A robust menagerie of flexible deep learning APIs (tensorflow, theano, keras, caffe, torch, mxnet, cntk, …) have made exotic deep learning architectures and ideas extremely accessible. (3) Especially in domains of object classification, machine translation, and speech recognition, deep learning solutions dominate the leaderboards, advancing state of the art performance year over year. What does this all mean? Lazy people can achieve state-of-the-art performance with very little work and a few lines of code, and don’t really have to speak math or machine learning, or really even have any domain expertise.
But what about for information security? In this talk, I’ll walk through steps to create a deep learning malware model from scratch: data curation, sample labeling, architecture specification, model training and model validation. I’ll review bleeding-edge concepts in deep learning that have disrupted other domains and show how they can be applied (sometimes poorly!) to the hardest parts of building a malware classification model. Finally, I’ll highlight what separates the easy-to-code models from product-worthy performance, and try to justify why I should still be employed as a data scientist after having demonstrated how easy this all is. Hint: the reasons have less to do with your model, and more to do with your data.