Neural networks can generate increasingly realistic, human-like speech. These so-called "deep fakes" can be used in social engineering attacks. Bad actors can now impersonate any person's voice merely by gathering a few samples of spoken audio and then synthesizing new speech, utilizing off-the-shelf tools.
But how convincing are these "deep fakes"? Can we train humans or artificial intelligence to spot the tell-tale signs of audio manipulation? In this work, we assessed the relative abilities of biology and machines, in a task which required discriminating real vs. fake speech.
For machines, we looked at two approaches based on machine learning: one based on game theory, called generative adversarial networks (GAN) and one based on mathematical depth-wise convolutional neural networks (Xception).
For biological systems, we gathered a broad range of human subjects, but also we also used mice. Recent work has shown that the auditory system of mice resembles closely that of humans in the ability to recognize many complex sound groups. Mice do not understand the words, but respond to the stimulus of sounds and can be trained to recognize real vs. fake phonetic construction. We theorize that this may be advantageous in detecting the subtle signals of improper audio manipulation, without being swayed by the semantic content of the speech.
We evaluated the relative performance of all 4 discriminator groups (GAN, Xception, humans, and mice). We used a "deep fakes" data set recently published in Google's "Spoofing and Countermeasures Challenge" and we will report the results here.
George Williams is the Director of Computing and Data Science at GSI Technology, an embedded hardware and artificial intelligence company. He's held senior leadership roles in software, data science, and research, including tenures at Apple's New Product Architecture group and at New York University's Courant Institute. He can talk on a broad range of topics at the intersection of e-commerce, machine learning, software development, and cloud security. He is an author on several research papers in computer vision and deep learning, published at NIPS, CVPR, ICASSP, and SIGGRAPH.
Jonathan Saunders is a systems neuroscience PhD student at the University of Oregon. They study computational mechanisms of complex sounds in auditory cortex with Michael Wehr. Currently they are working on grounding theoretical models of speech processing in neurophysiological data, investigating the unexpected role of extracellular protein matrices in auditory cortical plasticity, and a project about the neural computation of hip-hop that you can only faintly make out as billows of smoke and vague mumbling noises coming from behind a veil of mystery. They are in the process of publishing the next generation of software for behavioral neuroscience that distributes experiments across distributed networks of single-board computers. They are looking for collaborators for a future project applying cryptographic techniques to analyze neural data. Likes: messy, intractable scientific problems; unlikely cross-disciplinary collaboration; that ancient geocities page that you can't delete because you forgot the password in 2002. Dislikes: bad science; entrenched power; people googling my old screennames.
Alex Comerford is a Data Scientist at Bloomberg. He has built custom data-driven cyber-threat detection strategies, most recently as a data scientist at Capsule8. He continues to be a thought leader in cybersecurity, presenting regularly on topics at the intersection of open-source software, AI, and advanced threat detection. Most recently, he was a speaker at Anacondacon2019. Alex is a graduate of SUNY Albany in Nanoscale Engineering.