Every day we produce tons of digital breadcrumbs through our activities in online services – from social networks, photo sharing, mailing lists, online forums and blogs to more specialized tools, such as commits to open source projects, music listening services and travel schedules. These have long been known to provide useful information when profiling a target for social engineering purposes, especially due to the frantic pace and often uncensored way at which we generate such content.
Our talk takes a tool-oriented approach to these profiling activities. By using data mining techniques combined with natural language processing, we can determine patterns in the way a user interacts with other users, his usual choice of vocabulary and phrasing, the friends/colleagues he most frequently communicates with as well as the topics discussed with them. By consuming publicly available data, using both official APIs and scraping web pages, our profile can be used to validate how close forged content is to actual target-generated data.
We will discuss the indexing of unstructured content, including issues such as the legal and technical implications of using official APIs versus scraping, how to build user relationship graphs and how to add temporal references to the collected data.
We will also release a tool that automates the data mining and natural language processing (NLP) of unstructured information available on public data sources, as well as comparing user created content against a generated profile using various criteria, including: