The number of unique malware has been doubling every year for over two decades. The majority of effort in malware analysis has focused on methods for preventing malware infection. We view the exponential growth of malware as an underutilized source of intelligence. Given that the number of malware authors are not doubling each year, the large volume of malware must contain evidence that connects them. The challenge is how to extract the connections.
Since a malware is a complex software, it's development necessarily follows software engineering principles, such as modular programming, using third-party libraries, etc. Thus, sharing of code between malware are viable indicators of connection between their creators. However, identifying such shared code is not straightforward. The task is made complicated since to survive in an environment hostile (to it) a malware uses a variety of deceptions, such as polymorphic packing, for the explicit purpose of making it difficult to infer such connections.By using a combination of two orthogonal approaches - formal program analysis and data mining - we have developed a scalable method to search large scale malware repositories for forensic evidence. Program analyses aid in peeking through the deceptions employed by malware to extract fragments of evidence. Data mining aids in organizing this mass of fragments into a web of connections which can then be used to make a variety of queries, such as to determine whether two apparently disparate cyber attacks are related; to transfer knowledge gained in countering one malware to counter other similar malware; to get a holistic view of cyber threats and to understand and track trends, etc.This talk will summarize our method, describe VirusBattle - a web service for cloud-based malware analysis - developed at UL Lafayette, and present empirical evidence of viability of mining large scale malware repositories to draw meaningful inferences.