Trusting your eyes has never been a wise approach. Sight deceives. The trompe l’oeil artists knew that in the 1400s, and Hippolyte Bayard created the first faked photograph in 1840, staging his suicide by drowning. In 1895, Alfred Clark used the first film effect, recreating the beheading of Queen Mary of Scotts. He placed the actor on the block wearing Mary’s costume, froze all the actors in the scene, and stopped the camera. Then, he replaced the actor with a doll dressed in the same costume, restarted the camera, the executioner actor swung his axe, and a head rolled. Pictures of Oswald and Trotsky are twentieth-century proof of the deceptive image. With the rise of editing software and apps, the tools for faking images and videos are widespread. Now, deepfakes are the thing. As often happens, the rise of the deepfake occurred gradually then suddenly. In 1997, the Video Rewrite program could reanimate faces and change soundtracks to alter apparent speech. This technique improved until realistic clips of Barack Obama emerged in 2017. It looked like Obama but wasn’t, and this fake set a new standard for what was possible. A Reddit user named deepfakes coined the term. Before long, apps such as FakeApp, FaceSwap, Zao, and Face2Face flooded the online market. Speaking of online, lately, we’re meeting and interacting a lot more online. When we look at our colleagues or friends over Zoom, Teams, Skype, or HouseParty, most of us believe what we see is real. If those in the meeting made a decision, few of us would doubt its legitimacy. If we see and hear something that looks real, why should we doubt it? Well, after talking and working with some of the best deepfakers in the world, we’ve learned a lot about what’s possible today and what will be tomorrow. The following story tells how THE FREE LUNCH COMMISSION created an alternative historical narrative where the 2019- nCov pandemic happened in the early 1960s. To “prove” this narrative, we fabricated evidence. We went through all the steps needed to create a deepfake extravaganza. Our challenge was to create a deepfake video in which JFK delivers Donald Trump’s Oval Office address from March 12, 2020. To exchange one president for another, we needed to not only find high-quality video, which proved to be harder than expected, but we also needed to recreate the audio of JFK. All we had to start with was Google and Instagram
When researching deepfakes, it’s hard to avoid Ctrl Shift Face, known for taking iconic film scenes and switching actors. For example, he replaced Javier Bardem, playing the psycho killer Anton Chigurh in No Country for Old Men’s famous coin-toss scene, with Leonardo DiCaprio, Arnold Schwarzenegger, and Willem Dafoe. He swapped Christian Bale for Tom Cruise in the famous axe scene of American Psycho. And Robert De Niro for Al Pacino in Taxi Driver. You get the idea. Ctrl Shift Face is deepfake royalty, and we learned he optimized his technique for switching faces. He takes original footage and changes the face, but the body and hair remain the same. To make our deepfake work that way, we’d need an actor pretending to be JFK, styled in the right way, and in the right setting. Then, the actor would have to read Trump’s Oval Office address. After that, we’d need to switch the actor’s face with a 3D model of JFK’s. Software called DeepFaceLab would analyze the facial expressions and movement, making the fake JFK face a mask over the original footage. After discussing this option, we concluded we couldn’t use Ctrl Shift Face’s skills for our challenge. If we had, we would’ve gotten a world-class face, but getting everything else right would have been a headache. So, we had to keep looking.
Ctrl Shift Face was kind enough to recommend creatives, including Bill Posters, who we already had our eyes on from Instagram, and Kevin and Chris at VFX in Brussels, Belgium. Requests for deepfakes, often for dishonorable purposes, bombard these specialists. Such requests include revenge, revenge porn, and other questionable presentations. Top producers of deepfakes have agreed on guidelines for the deepfakes they’re willing to produce. This code of conduct requires fakes to be clearly marked, with no chance of causing damage. We explained our project, which we’re marking as fake, and in the end, VFX agreed to help us. That’s when we started drawing up plans. We needed to find a lot of footage of John F. Kennedy. YouTube has plenty, but the resolution isn’t high enough to train an AI on JFK’s face. The neural network that analyzes the face can never get better than the information you provide it. If the footage is low resolution or blurry, the computer won’t be able to replicate the face well at a later stage. We contacted the John F. Kennedy Presidential Library and Museum to see if they had any high-resolution footage available in their archives, but we never heard back from them. After looking all over the place, we found a JFK section on Getty Images, including 4K conversions from film cameras. With that, we had the footage we needed, including a bunch of movies showing JFK from different angles and talking, so we could train the AI to replicate his expressions and speech movements well.
VFX’s Chris went to work analyzing the footage to train the AI. In the meantime, we started working on the audio. Finding a good JFK impersonator to do the voice was easier said than done. It turns out that JFK isn’t in demand these days. The most famous JFK impersonator, Vaughn Meader, passed away sixteen years ago. A few other JFK impersonators we found were unconvincing. So, we started to investigate how we could synthesize that specific voice. Searching for options, we made several forays. We contacted LyreBird. They operate on invitation only, and we never heard back. We talked to American and Swedish researchers in voice synthesizing to see if we could find a partner who could generate a credible JFK voice for us. It didn’t work out. The academic world moves too slow for what we needed, and the researchers needed too much convincing, so we moved on to another, more promising lead. Some years back, the London newspaper The Times did a campaign called “JFK Unsilenced” in which they created a replica of the late president’s voice to deliver the Dallas speech scheduled later during the day of his assassination. Their replica was a haunting triumph. We thought if we could contact the people who created that voice, we could do what we needed. So, with a few searches, we found the communication agency behind the campaign, along with the voice-synthesizing company behind the agency. That voice company is CereProc, a Scottish FIRM based in Edinburgh. We reached out to investigate using their JFK voice and to learn more about how they created it. We set up a phone call with their chief scientific officer, Dr. Matthew Aylett, and asked him some questions.
Thank you for taking the time to talk to us. Could you explain who you are? DR. AYLETT I’m the chief science officer at CereProc. We’re a speech-services company, running since 2006. We specialize in character generating for artificial voices, and in cloning and copying voices. When you clone a voice, could you walk us through the process? How do you do it? DR. AYLETT Going back a long way, synthetic voices were always copies to a certain extent. For example, Stephen Hawking’s voice was made from the voice of Dennis H. Klatt, a fantastic academic at MIT who produced one of these first synthesizers. They got other voices from going into a studio with a voice actor, recording for 12 to 20 hours, and using that audio to help build the voice. Then, the synthetic voice sounds like the actor. The voice actor might be a comedian, but that’s not how they are recorded. They speak neutrally. But when you clone a voice, you want to convey that person’s character and intensity. It’s slightly different because the character and the identity of the voice are what’s important. But with those first synthesized voices, the main thing was to be very intelligible, and they didn’t have to be emotional voices. Cloning is a bit different, because when you start cloning voices, you don’t have access to the voice sounds the same way. So for example, if you copy a voice for someone who’s got MND or is developing a medical condition, which is going to make it difficult speaking, you don’t have the ability to get them to the recording studio and record them for 20 hours. So, you use audio from all sorts of locations. We’ve done cloning of Donald Trump’s voice, for example. Then, you have opensource material, like presidential addresses. There’s no point in copying Donald Trump’s voice if no one recognizes that it’s Donald Trump. You could use different techniques to make the quality better and get rid of the noise, but if it doesn’t sound like him, the whole purpose is lost. And this is very important, when we’re doing voice replacement for people who are going to lose their voice, what they’re concerned about is the sense of losing their identity. We were one of the first companies ever to have a web-based, clone-your-own-voice system. So, you could go to a website, and then we could record your voice and clone it. This voice wasn’t as good as the voices we sell commercially. We have asked the people who did this what they’d rather have, one of our commercial clear voices or the natural voice of the friend that was technically worse? They said a commercial voice because the quality is better. But if it was for themselves, they said they’d rather have their own voice. What’s really exciting is, over the last five or six years, our ability to copy voices with less data and with data that hasn’t been carefully prepared for that purpose has really improved. And now, with a neural TTS, it’s possible to copy voices even faster and better, which raises a whole load of other questions, because this is someone’s identity.
When we were researching how to generate a voice, it seemed like you needed to feed the system a couple hours of good audio to get a proper clone. Do you think the requirement could get a lot shorter? Looking into the future, do you think a single phone call of a few minutes could be enough to steal someone’s voice? DR. AYLETT It could be the case. There’s no question that one of the issues here is how people perceive identity. And there’s no question that when I speak, there are some things that contribute to the identity of my voice, having the same pitch range, same vocal-track length, which are the spectral characteristics. But other people have the same vocal-track length and pitch that I do, the same “picture” as mine. Then, this voice will be similar but not identical. And one of the things that’s very interesting when you work with celebrity voices, like Donald Trump, he has an almost caricatured way of speaking that he’s developed over many years to generate this almost performance voice. So, that’s an open question. If you have a hundred thousand voices already in your system, and you get a new voice, really listen to it for just a few minutes, you could harness all the facts, extrapolate what the person will sound like. Of course, you then have to make guesses, but will it be accurate? Well, it’s hard to say, but the ability to copy someone’s voice with, say, 10 minutes of audio is beginning to look practical. So essentially, yes, you could take quite small amounts of audio, which people have recorded for other purposes, and then potentially copy a voice. It’s a bit like the change in photography when people had digital images and they realized it was really easy to manipulate those photos. It’s so easy to get rid of that spot, let’s change the way people look, let’s make them more fit. Let’s change the color so they look more real. And in a way, we’ve gotten to this point with speech. We can use the technology to edit and change people’s voices really effectively. Voice samples usually sound quite neutral. How flexible are cloned voices in terms of adding emotions, slowing them down, speeding them up? Are these changes possible now? DR. AYLETT It’s still quite hard because one of the big stumbling blocks is that the voice doesn’t know what it’s saying. Voiceovers are a great example of this. They’re reading off the screen, and they have no idea what it means. Learning how to get that sense of meaning into it, that’s very, very hard to do. One of the things we’ve been playing with is what I call voice puppetry. Using my voice to control some-one else’s voice, and doing that in real time. You know, maybe with a 200-millisecond delay, which means theoretically, I could ring you up and use my voice to control the synthetic voice. As human beings and in real life, we tend not to get too emotional. We’re always constraining our emotions. Getting synthesized voices to laugh is hard, but getting them to tell a joke is possible if you copy the timing. You might get it right. I’m not very good at telling jokes, but I can get a synthesized voice to tell the joke just as badly as I tell it. And it doesn’t sound like the voice doesn’t know the content of what it’s saying because it’s copying my speech patterns directly. The more we do that, the more we’ll learn.
So, fooling a person into thinking they’re speaking to a human when they’re speaking to an AI is still far away. For example, if the style of speaking seems really inappropriate, like if you tell someone a joke, and they can’t react appropriately. So, having people fooled into thinking the voice is somebody else’s is very hard. In a way, it’s very similar to if a human being rang you up and pretended to be a specific person, and at some point, you go, “Oh, so how is John now?” This will quickly reveal the fake. So, the technology still has lots of constraints. Maybe we had a short time in the twentieth century where, it’s a photograph, it’s a video, it’s a recording, so it must be true, right? And maybe that’s finished now. How far are we from a situation where the voices are so good at mimicking not only speech, but also understanding what they’re saying, that one could pass off being a family member or a friend. Do you think we could get there in the foreseeable future? DR. AYLETT We already have. To a certain extent, this is a bit like with all scams, it’s the context that’s important. At one point, we presented this technology on a gadget show in the UK. We copied Jason Bradbury’s voice. He’s a very enthusiastic guy, and he was thinking about this amazing tech, and he was going to call his wife to see if he could speak to her using his cloned voice. He’s quite a fast typist. So, he was able to type only with a small delay. I got the impression his wife was used to him doing something else, so she accepted the delay. And then, he did this with his brother, and he kept up a good two-minute conversation before his brother worked out what was going on. So, could you keep it up for a long time? No, no, I don’t think so. Not at the moment. This is a bit like the uncanny valley where people detect the weirdness. Having the control and performance skills to do that is one of those old artificialintelligence problems. It’s still hard to copy voices. There’s open-source software that is pretty hard to use. I mean we’re experts in the field, and we find it hard, and we’ve been doing it for like a decade. But that doesn’t mean it’s impossible. A lot of projects are working on software that reveals deepfakes. Have you heard whether similar initiatives are happening in the field of identifying a cloned voice? DR. AYLETT The problem is that the pictures are manipulated by hand in some sense, stitched together in a Frankenstein’s way. When cloning voices, you’re taking the actual person’s voice and using it to produce content. So, it’s hard to tell, because all the parameterizations; all the ways you represent speech are the ways you model it to copy it. I think that’s a problem with using voices as an identity check. It’s like biometric feedback. There’s nothing wrong with doing it in itself, but it would be crazy to use only that, since it’s not hard to replicate. So that’s probably one of the first biometric locks that could be bypassed with technology?
DR. AYLETT Bypassed because the parametrizations that they use to model a person’s voice are the same parametrizations you use to copy the voice. But this is getting back to trust. For example, with the Apple phone, you can just look at it with your face and get it to open up, right? It’s a check. You might be able to fake that, but that’s not the same as opening up and then allowing you to change your bank account details. With more home assistants, we’re getting used to using our voices to interact with technology. Barriers still exist when it comes to talking to technology. It’s like talking to an assistant that’s a little bit drunk. They don’t really get it. DR. AYLETT These things will change the law, and some use cases are very compelling. So, for example, your hands or eyes are free, there’s a real sense it might be useful. Alexa is a great example. They’re making these personal assistants because they’re useful and because customers want them, right? There’s no evidence of those two things. The reason Amazon wants Alexa on your table is that they want a shop point in your house. Right? That’s the primary purpose. So, if Alexa makes that happen, they don’t care how you use it. From my experience in talking to people using Alexa, a lot of them use it for very specific purposes. So, when someone uses Alexa every morning to turn the radio, in effect, Amazon spent about a billion dollars on an overengineered light switch. They don’t care because they’re in the house. If people ask Alexa for the new Beyoncé album, they get it from Amazon, right? It’s all about control. Okay. And there’s no question these technologies are in their infancy. In voice cloning and TTS, the last two years have been revolutionary. It’s been a lot of fun for us. Suddenly, we had to change all our technology, because we realized it could be done, we could do it, we have the computing power to do it. And the algorithms that could make synthesis a lot better than it was before. We had to change our entire paradigm. The new paradigm is great, because the technology does all the things that I’ve wanted to do over the last ten years. Things like voice puppetry and automatic dubbing. Things like conversational voice replacement for people who have no voice. Do you have any dreams or visions where you would like to apply this technology you have developed in five or ten years, knowing that then, the possibilities will be endless in terms of computing power? DR. AYLETT I’m a trustee in a charity, which is looking at how we might use technology to support people with physical and developmental challenges. A person at this charity lost his ability to speak, and we rebuilt his voice for him. It’s the only way he can speak. When you work directly with someone like that, you realize where you need to be able to go. So, for me, being able to help this man to be able to tell a joke, but in real time, that would be an amazing thing to do.
After discussing with his colleagues, Dr. Aylett decided that it was too risky to use JFK’s voice in our context. JFK is an iconic person. Using his voice can create misunderstandings. So, we were back to square one. A great team on the visual part, but how could we continue with the project if we lacked a voice? We asked if Keith and Chris had any ideas on how to proceed. They did. They said Tim at Stable Voices was good at synthesizing voices. We looked up his work, and he is very good. But Tim was too busy, so we asked if he knew someone who could help us. Tim recommended Crucible, a San Francisco company specialized in voice synthesis, using machine learning to facilitate content creation. Chris, our contact there, said he could help us. His only concern was getting enough high-quality audio. He sent some samples of his Sir David Attenborough, and they blew us away. Before long, we had the first draft of JFK’s voice. The first test sounded very real. But a new voice is crisp, with no noise or traces of old recording equipment. We wanted a voice that sounded like it was from the 60s, meaning we needed to make the voice sound “worse” in a sense. Feeding the AI with more audio samples made the voice better and better, and after two weeks, we had a voice ready for our guys at VFX.
Their algorithm had figured JFK’s face, and we were ready to match audio and image. If the lip-syncing isn’t perfect, a deepfake won’t work. Humans can spot fakes, almost like a sixth sense. Often, we can’t put a finger on what’s strange; it’s more of a feeling. Everyone who has worked with photorealistic digital images knows this feeling all too well. And in the JFK case, a small glitch or mismatch will make the whole film look wrong. The illusion will be lost. But we also wanted that illusion to include a 60s film look. Once the lip-sync was done, the final step was adjusting the film to look like a 60s piece. We studied tons of TV recordings from the early 60s so we could replicate their look. Our work and wait were not in vain. After two weeks, VFX sent us a download link, and 480 MB later, we opened the final film with excitement. On the screen, we had John Fitzgerald Kennedy delivering a Trump speech. Now, in a sense, we owned him and could have him say anything we’d like him too. Oh, the power! To learn more about how VFX worked this magic, we set up an interview with Chris. He agreed to talk. But he knows how easy it is to abuse this knowledge, so he skipped some details. Chris started as a cameraman. His interest in technology and effects had him learning how to do traditional special effects. He kept trying and training, and before long, he got a job in Belgian TV, and he ventured into motion design and commercials. Technical development in the last decade has been blazing fast, and the style for TV and commercials changed rapidly. Chris kept on learning tricks of the trade from the internet, combining different techniques and tools to achieve the desired result. It was during this time, looking for new tools, that he found deepfakes. Having stumbled onto this new technology, Chris wasn’t clear what to use it for. He suggested an experiment, using mouth manipulation to get Trump saying something he didn’t say. Shortly thereafter, Chris noticed deepfakes rising in prominence, and he decided to work on them. He quit his job and moved to Thailand to focus on producing deepfakes.
In December 2018, Chris decided to resurrect an old Belgian singer and have him perform again. This first attempt didn’t look that good, but it got done. That was how it all started. And even though it has only been a little more than a year, we asked how much better the technology has gotten since Chris began. His short answer was “unbelievable.” Before, the training models were 128-by-128 pixels, and now they’re 520-by-520, almost four times as detailed. And the new software is advanced. It can look at low-resolution images and, from training, make educated guesses at what’s what on a face. Then, through image banks with thousands of high-resolution images, it can recreate a high-res face that looks realistic. And with this face, you can create a 3D model and modify it. Chris recently did an animated version of Albert Einstein for which he used old images. He let an AI upsample the work, and then added details and color. We’re still at the beginning of this revolution. It’s hard to wrap your head around what might be possible in the next five to ten years. The movie industry has started to understand the possibilities this technique offers. A recent example was Robert DeNiro in the Irishman, but this film already seems dated in terms of technology. Using AI isn’t the same as using traditional visual effects. AI will be able to do much more, much faster. It’ll be a real gamechanger for the movie industry. Chris recently played around with a new AI package that could generate artificial faces. The faces generated are almost indistinguishable from real ones. If you want to try for yourself, go to whichfaceisreal.com and see if you can tell the real faces from the artificial ones. All these techniques rely on long render times. You must wait for the computer to finish the manipulated face. But how long before these high-quality manipulations and hyper-realistic faces can be generated in real time? Chris believes it’ll be a year or two. Software will be able to grant design control over the entire human form. New AIs will be able to change the gender or ethnicity of a face, and match it to another body. The professions threatened by this advancement range from actors and models to photographers and stylists. It’s easy to believe companies will choose a digital approach over sending a team of models, stylists, makeup artists, photographers, assistants, and drivers to the Bahamas. Why wouldn’t they, when similar images can be computer generated with high precision and at a fraction of the cost? If AI can turn us into computerized puppets, it’s going to be interesting to see who the puppet masters might be. But can Chris, who’s one of the top experts on deepfakes, spot a deepfake? He can, since the deepfakes aren’t perfect. A lot of deepfakes appear on social media, and the video quality is lower, so it’s easier to hide the flaws in the fake. In high-resolution faces, Chris can usually see those flaws. It’s all in the details, like comparing background and foreground, and seeing whether they match. Comparing the sharpness of the image and so on. Most people creating deepfakes know deepfake AI, rather than visual effects, so they don’t know how to create credible unified images. That lack of an eye for detail often makes it easy to reveal the fake.
All this potential for fakery leads to the topic of trust. If we can’t believe what we see, what might it mean for us? Chris thinks it’s good and bad that we stop trusting what we see. It might be harder to manipulate us if we develop herd immunity to believing what we see online and sharing it. More problematic is what will happen in courts. Currently, evidence often comes from videos from surveillance cameras. These videos are usually of poor quality and easy to manipulate. As we said before, Chris and Kevin of VFX asked how we’d use our film, and whether we’d mark it as an intentional fake. Their collective has agreed on rules to avoid making shady fakes. But most deepfakers we found were young, tech-savvy men who earn money from consulting on visual effects or deepfakes. Chris is frank about his concerns. Someday, someone will offer enough money for a questionable deepfake, and someone will make that fake. We agree. It seems inevitable that weaponized deepfakes will cause real damage in the future. Pandora’s box is already open, and the technology and willingness to use it are out there. The question is how we react when we see something that upsets us. We’ll have to be much more selective when we see Austrian politicians strike deals with Russian businessmen on grainy videos or hear Trump talking about how he “grabs women by the pussy.” We won’t be able to determine the authenticity of what we see for ourselves. Media outlets will have the resources to verify content, but will they? Will enough of them do so? Most of us treat local media, such as newspapers, like consumer products. But research from Paul Gao, Chang Lee, and Dermot Murphy suggests that newspapers are like police departments. They guard our culture, or they allow us to do so by keeping us informed. In the short term, owners might save money by closing these guardians down. In the long run, though, we all lose. John Oliver, on Last Week Tonight, compared a society without newspapers to a seventh-grade class without a teacher. The best-case scenario is that someone comes home with chewing gum in their hair. In the worst case, the school burns down. In any case, we need to establish and maintain systems with enough resources to safeguard and verify digital content. If we don’t… well, let’s hope we end up only with gum in our collective hair.