As our networks generate an ever-larger deluge of security-relevant data, data science (machine learning, data visualization, and scalable storage technologies) has become necessary if we are to succeed in both stopping advanced attackers and gaining intelligence about their tactics. Unfortunately, there is still a gap between the security and data science communities: security professionals often have limited knowledge of data science, and security data scientists often come from non-security backgrounds and may not understand why security data science is different than the solutions taught in traditional machine learning and visualization programs.
In my talk, I will bridge this gap, speaking to both audiences, discussing the challenges and opportunities posed by applying data science to security, demonstrating exciting results achieved by my research group, and empowering attendees to apply security data science in new and powerful ways. The first part of the talk will provide a non-mathematical overview of security data science, introducing state of the art data visualization and the big three machine learning tasks (classification, clustering and regression). For each of the topics, I will give examples of how my colleagues and I have successfully applied the topic to problems like attack detection, threat intelligence, malware analysis and scalable malware analytics.The second part of the talk will cover both major security-specific data science challenges and solutions to these challenges. One challenge is that malicious activity exists as a needle in the haystack of terabytes of benign data, causing textbook data science methods, which are often not designed for such scenarios, to generate reams of false positives. Another challenge is the inevitable lack of access to 0-day attack data with which to train machine-learning approaches. I will go over multiple mitigations for both of these problems, including statistical methods designed to generalize to new attacks and minimize false positives, and will show how these methods have performed impressively in detecting 0-day malware in my groups work.The third part of my talk will address security data visualization, discussing my groups ongoing and past log visualization, malware analysis visualization, and threat intelligence visualization work [4][5]. In discussing this work I will describe how we use machine-learning approaches to address a challenge unique to security data visualization: the semantic gap between low-level security data and the high-level activity we actually care about. In summary, my talk will explore the emerging and exciting world of security data science, discussing opportunities, challenges and effective approaches. My goal is that attendees leave the talk excited about the possibilities of applying data science to their own security related work, newly aware of the pitfalls of this area, and more knowledgeable about solutions to these pitfalls.
Joshua Saxe directs Invincea Labs' data science research group, whose focus is researching and developing breakthrough security data science technologies. Highlights of his work at Invincea have included leading the development of a system that automatically discovers and visualizes malware genealogical relationships, and leading the development of novel data science approaches for detecting, analyzing and visualizing both malware and malicious network behavior. Prior to starting at Invincea, Josh served as lead research engineer at Applied Minds, an inter-disciplinary technology think-tank, where he led a two-year research project focused on applying machine learning and data visualization to the problem of modeling enterprises' cybersecurity vulnerabilities.