Machine Learning models ostensibly offer excellent detection rates at low false positive rates for detecting malware statically at the pre-execution stage. Importantly, they generalize well to new malware samples, evolving families and polymorphic strains. However, often neglected is the fact that “old-school” signatures actually perform better at the narrow role for which they were designed: signatures detect known and well-behaved malware families at detection rates approaching 100% and false positive rates approaching 0%. Should these not be a powerful complement to static malware detection using machine learning?
I present automatic malware signature generation via n-grams, but with one significant upgrade: I consider ludicrously large n-grams for n up to 1024 using “KiloGrams”, an approach co-developed by government, academic and industry partners. Since memory burden using straightforward approaches grows exponentially with n, previous research for n>6 is exceedingly rare, and to our knowledge has never been attempted for n>8. But more than 1 in 50 observed x86 instructions exceed 6 bytes, so that a byte 6-gram is insufficient for capturing telling sequences. KiloGrams discovers the K dominant n-grams (for very large n) with very modest memory requirements, and the resulting signatures provide both impressive predictive performance, and intrinsic interpretability.