Your Voice is My Passport

Conference:  Defcon 26



The presentation discusses the use of text-to-speech technology to bypass voice authentication systems and the implications of this technology in cybersecurity.
  • Text-to-speech technology can be used to bypass voice authentication systems by generating audio that sounds like the target's voice.
  • Voice models are trained on a single person's voice and require high-quality audio data of at least 24 hours.
  • Open source datasets such as LJ speech and Blizzard can be used for training voice models.
  • The technology has implications in cybersecurity as it can be used for impersonation and social engineering attacks.
  • The presentation provides an anecdote from the movie Sneakers to illustrate the concept of social engineering to obtain voice data.
The presentation provides an anecdote from the movie Sneakers where the heroes bypass a voice authentication system by social engineering their target to say specific words in the passphrase. This is difficult to do in practice as the people you want to impersonate are busy and may not want to sit down with you. Additionally, authentication prompts are smart and randomly change, making it difficult to obtain the necessary voice data. However, text-to-speech technology can be used to generate audio that sounds like the target's voice, bypassing the need for actual voice data.


Financial institutions, home automation products, and offices near universal cryptographic decoders have increasingly used voice fingerprinting as a method for authentication. Recent advances in machine learning and text-to-speech have shown that synthetic, high-quality audio of subjects can be generated using transcripted speech from the target. Are current techniques for audio generation enough to spoof voice authentication algorithms? We demonstrate, using freely available machine learning models and limited budget, that standard speaker recognition and voice authentication systems are indeed fooled by targeted text-to-speech attacks. We further show a method which reduces data required to perform such an attack, demonstrating that more people are at risk for voice impersonation than previously thought.