The presentation discusses the challenges and approaches to detecting deep fakes, with a focus on using by spectral analysis as an evaluating feature.
- Deep neural networks are good at creating high fidelity fake samples, but require significant resources and data
- By spectral analysis can be used to detect deep fakes by evaluating the by coherence of raw audio
- Gan discriminators can be used to create a powerful discriminator to detect fake samples
- Future directions include giving the discriminator samples from different models and using richer features
- Detection of deep fakes is a cat and mouse game, and incorporating artifacts in neural network loss functions can mimic human behavior
The author mentions that creating high fidelity deep fakes requires a significant amount of resources and data, and cites an example where it took over 17 hours of presidential addresses to create a sample. However, general deep fakes can be easily created by cloning projects off Github. The presentation also provides a brief history of text-to-speech systems, from concatenative to parametric to deep learning approaches.
Neural networks can generate increasingly realistic, human-like speech. These so-called "deep fakes" can be used in social engineering attacks. Bad actors can now impersonate any person's voice merely by gathering a few samples of spoken audio and then synthesizing new speech, utilizing off-the-shelf tools. But how convincing are these "deep fakes"? Can we train humans or artificial intelligence to spot the tell-tale signs of audio manipulation? In this work, we assessed the relative abilities of biology and machines, in a task which required discriminating real vs. fake speech. For machines, we looked at two approaches based on machine learning: one based on game theory, called generative adversarial networks (GAN) and one based on mathematical depth-wise convolutional neural networks (Xception).For biological systems, we gathered a broad range of human subjects, but also we also used mice. Recent work has shown that the auditory system of mice resembles closely that of humans in the ability to recognize many complex sound groups. Mice do not understand the words, but respond to the stimulus of sounds and can be trained to recognize real vs. fake phonetic construction. We theorize that this may be advantageous in detecting the subtle signals of improper audio manipulation, without being swayed by the semantic content of the speech.We evaluated the relative performance of all 4 discriminator groups (GAN, Xception, humans, and mice). We used a "deep fakes" data set recently published in Google's "Spoofing and Countermeasures Challenge" and we will report the results here.