Neural networks can generate increasingly realistic, human-like speech. These so-called "deep fakes" can be used in social engineering attacks. Bad actors can now impersonate any person's voice merely by gathering a few samples of spoken audio and then synthesizing new speech, utilizing off-the-shelf tools.
But how convincing are these "deep fakes"? Can we train humans or artificial intelligence to spot the tell-tale signs of audio manipulation? In this work, we assessed the relative abilities of biology and machines, in a task which required discriminating real vs. fake speech.
For machines, we looked at two approaches based on machine learning: one based on game theory, called generative adversarial networks (GAN) and one based on mathematical depth-wise convolutional neural networks (Xception).
For biological systems, we gathered a broad range of human subjects, but also we also used mice. Recent work has shown that the auditory system of mice resembles closely that of humans in the ability to recognize many complex sound groups. Mice do not understand the words, but respond to the stimulus of sounds and can be trained to recognize real vs. fake phonetic construction. We theorize that this may be advantageous in detecting the subtle signals of improper audio manipulation, without being swayed by the semantic content of the speech.
We evaluated the relative performance of all 4 discriminator groups (GAN, Xception, humans, and mice). We used a "deep fakes" data set recently published in Google's "Spoofing and Countermeasures Challenge" and we will report the results here.