Machine Learning models ostensibly offer excellent detection rates at low false positive rates for detecting malware statically at the pre-execution stage. Importantly, they generalize well to new malware samples, evolving families and polymorphic strains. However, often neglected is the fact that “old-school” signatures actually perform better at the narrow role for which they were designed: signatures detect known and well-behaved malware families at detection rates approaching 100% and false positive rates approaching 0%. Should these not be a powerful complement to static malware detection using machine learning?
I present automatic malware signature generation via n-grams, but with one significant upgrade: I consider ludicrously large n-grams for n up to 1024 using “KiloGrams”, an approach co-developed by government, academic and industry partners. Since memory burden using straightforward approaches grows exponentially with n, previous research for n>6 is exceedingly rare, and to our knowledge has never been attempted for n>8. But more than 1 in 50 observed x86 instructions exceed 6 bytes, so that a byte 6-gram is insufficient for capturing telling sequences. KiloGrams discovers the K dominant n-grams (for very large n) with very modest memory requirements, and the resulting signatures provide both impressive predictive performance, and intrinsic interpretability.
Hyrum Anderson is the Chief Scientist at Endgame, where he leads research on detecting adversaries and their tools using machine learning. Prior to joining Endgame he conducted information security and situational awareness research as a researcher at FireEye, Mandiant, Sandia National Laboratories and MIT Lincoln Laboratory. He received his PhD in Electrical Engineering (signal and image processing + machine learning) from the University of Washington and BS/MS degrees from BYU. Research interests include adversarial machine learning, large-scale malware classification, and early time-series classification.