Alongside cooking for myself and strolling laps round the home, Japanese cartoons (or “anime” as the youngsters are calling it) are one thing I’ve discovered to like throughout quarantine.
The issue with watching anime, although, is that wanting studying Japanese, you develop into depending on human translators and voice actors to port the content material to your language. Generally you get the subtitles (“subs”) however not the voicing (“dubs”). Different instances, total seasons of reveals aren’t translated in any respect, and also you’re left on the sting of your seat with solely Wikipedia summaries and 90s net boards to ferry you thru the darkness.
So what are you speculated to do? The reply is clearly to not ask a pc to transcribe, translate, and voice-act total episodes of a TV present from Japanese to English. Translation is a cautious artwork that may’t be automated and requires the loving contact of a human hand. Moreover, even should you did use machine studying to translate a video, you couldn’t use a pc to dub… I imply, who would wish to take heed to machine voices for a whole season? It’d be terrible. Solely an actual sicko would need that.
So on this submit, I’ll present you use machine studying to transcribe, translate, and voice-act movies from one language to a different, i.e. “AI-Powered Video Dubs.” It won’t get you Netflix-quality outcomes, however you should use it to localize on-line talks and YouTube movies in a pinch. We’ll begin by transcribing audio to textual content utilizing Google Cloud’s Speech-to-Text API. Subsequent, we’ll translate that textual content with the Translate API. Lastly, we’ll “voice act” the translations utilizing the Text-to-Speech API, which produces voices which can be, in line with the docs, “humanlike.”
(By the best way, earlier than you flame-blast me within the feedback, I ought to inform you that YouTube will automatically and for free transcribe and translate your movies for you. So you possibly can deal with this venture like your new interest of baking sourdough from scratch: a very inefficient use of 30 hours.)
AI-dubbed movies: Do they normally sound good?
Earlier than you embark on this journey, you most likely wish to know what you must stay up for. What high quality can we realistically anticipate to attain from an ML-video-dubbing pipeline?
Right here’s one instance dubbed mechanically from English to Spanish (the subtitles are additionally mechanically generated in English). I haven’t completed any tuning or adjusting on it:
As you possibly can see, the transcriptions are respectable however not good, and the identical for the translations. (Ignore the truth that the speaker typically speaks too quick — extra on that later.) Total, you possibly can simply get the gist of what’s happening from this dubbed video, however it’s not precisely close to human-quality.
What makes this venture trickier (learn: extra enjoyable) than most is that there are not less than three attainable factors of failure:
- The video may be incorrectly transcribed from audio to textual content by the Speech-to-Textual content API
- That textual content may be incorrectly or awkwardly translated by the Translation API
- These translations may be mispronounced by the Textual content-to-Speech API
In my expertise, essentially the most profitable dubbed movies have been people who featured a single speaker over a transparent audio stream and that have been dubbed from English to a different language. That is largely as a result of the standard of transcription (Speech-to-Textual content) was a lot greater in English than in different supply languages.
Dubbing from non-English languages proved considerably more difficult. Right here’s one significantly unimpressive dub from Japanese to English of one among my favourite reveals, Loss of life Word:
If you wish to depart translation/dubbing to people, nicely–I can’t blame you. But when not, learn on!
Constructing an AI Translating Dubber
As all the time, you’ll find the entire code for this venture within the Making with Machine Learning Github repo. To run the code your self, comply with the README to configure your credentials and allow APIs. Right here on this submit, I’ll simply stroll by means of my findings at a excessive stage.
First, listed below are the steps we’ll comply with:
- Extract audio from video recordsdata
- Convert audio to textual content utilizing the Speech-to-Textual content API
- Break up transcribed textual content into sentences/segments for translation
- Translate textual content
- Generate spoken audio variations of the translated textual content
- Velocity up the generated audio to align with the unique speaker within the video
- Sew the brand new audio on high of the fold audio/video
I admit that after I first got down to construct this dubber, I used to be stuffed with hubris–all I needed to do was plug just a few APIs collectively, what may very well be simpler? However as a programmer, all hubris should be punished, and boy, was I punished.
The difficult bits are those I bolded above, that primarily come from having to align translations with video. However extra on that in a bit.
Utilizing the Google Cloud Speech-to-Textual content API
Step one in translating a video is transcribing its audio to phrases. To do that, I used Google Cloud’s Speech-to-Text API. This software can acknowledge audio spoken in 125 languages, however as I discussed above, the standard is highest in English. For our use case, we’ll wish to allow a few particular options, like:
- Enhanced models. These are Speech-to-Textual content fashions which were skilled on particular information sorts (“video,” “phone_call”) and are normally higher-quality. We’ll use the “video” mannequin, in fact.
- Profanity filters. This flag prevents the API from returning any naughty phrases.
- Phrase time offsets. This flag tells the API that we would like transcribed phrases returned together with the instances that the speaker mentioned them. We’ll use these timestamps to assist align our subtitles and dubs with the supply video.
- Speech Adaption. Usually, Speech-to-Textual content struggles most with unusual phrases or phrases. If you understand sure phrases or phrases are prone to seem in your video (i.e. “gradient descent,” “assist vector machine”), you possibly can cross them to the API in an array that can make the extra prone to be transcribed:
The API returns the transcribed textual content together with word-level timestamps as JSON. For instance, I transcribed this video. You may see the JSON returned by the API in this gist. The output additionally lets us do a fast high quality sanity verify:
What I truly mentioned:
“Software program Builders. We’re not recognized for our rockin’ model, are we? Or are we? At this time, I’ll present you ways I used ML to make me trendier, taking inspiration from influencers.”
What the API thought I mentioned:
“Software program builders. We’re not recognized for our Rock and magnificence. Are we or are we in the present day? I’ll present you ways I take advantage of ml to make new trendier taking inspiration from influencers.”
In my expertise, that is in regards to the high quality you possibly can anticipate when transcribing high-quality English audio. Word that the punctuation is just a little off. In case you’re pleased with viewers getting the gist of a video, that is most likely ok, though it’s straightforward to manually appropriate the transcripts your self should you converse the supply language.
At this level, we are able to use the API output to generate (non-translated) subtitles. In truth, should you run my script with the `–srt` flag, it’s going to do precisely that for you (srt is a file sort for closed captions):
Now that we’ve the video transcripts, we are able to use the Translate API to… uh… translate them.
That is the place issues begin to get just a little 🤪.
Our goal is that this: we would like to have the ability to translate phrases within the authentic video after which play them again at roughly the identical time limit, in order that my “dubbed” voice is talking in alignment with my precise voice.
The issue, although, is that translations aren’t word-for-word. A sentence translated from English to Japanese could have a phrase order jumbled. It could include fewer phrases, extra phrases, totally different phrases, or (as is the case with idioms) utterly totally different wording.
A method we are able to get round that is by translating total sentences after which making an attempt to align the time boundaries of these sentences. However even this turns into sophisticated, as a result of how do you denote a single sentence? In English, we are able to break up phrases by punctuation mark, i.e.:
However punctuation differs by language (there’s no ¿ in English), and a few languages don’t separate sentences by punctuation marks in any respect.
Plus, in real-life speech, we frequently don’t discuss in full sentences. Y’know?
One other wrinkle that makes translating transcripts tough is that, usually, the extra context you feed right into a translation mannequin, the upper high quality translation you possibly can anticipate. So for instance, if I translate the next sentence into French:
“I’m feeling blue, however I like pink too.”
I’ll get the interpretation:
“Je me sens bleu, mais j’aime aussi le rose.”
That is correct. But when I break up that sentence in two (“I’m feeling blue” and “However I like pink too”) and translate every half individually, I get:
“Je me sens triste, mais j’aime aussi le rose”, i.e. “I’m feeling unhappy, however I like pink too.”
That is all to say that the extra we chop up textual content earlier than sending it to the Translate API, the more severe high quality the translations will likely be (although it’ll be simpler to temporally align them with the video).
Finally, the technique I selected was to separate up spoken phrases each time the speaker took a greater-than-one-second pause in talking. Right here’s an instance of what that seemed like:
This naturally led to some awkward translations (i.e. “or are we” is a bizarre fragment to translate), however I discovered it labored nicely sufficient. Here’s the place that logic appears like in code.
Aspect bar: I additionally seen that the accuracy of the timestamps returned by the Speech-to-Textual content API was considerably much less for non-English languages, which additional decreased the standard of Non-English-to-English dubbing.
And one last item. In case you already understand how you need sure phrases to be translated (i.e. my identify, “Dale,” ought to all the time be translated merely to “Dale”), you possibly can enhance translation high quality by benefiting from the “glossary” function of the Translation API Superior. I wrote a weblog submit about that here.
The Media Translation API
Because it occurs, Google Cloud is engaged on a brand new API to deal with precisely the issue of translating spoken phrases. It’s referred to as the Media Translation API, and it runs translation on audio instantly (i.e. no transcribed textual content middleman). I wasn’t ready to make use of that API on this venture as a result of it doesn’t but return timestamps (the software is at the moment in beta), however I believe it’d be nice to make use of in future iterations!
Now for the enjoyable bit–choosing out pc voices! In case you examine my PDF-to-Audiobook converter, you understand that I like me a funny-sounding pc voice. To generate audio for dubbing, I used the Google Cloud Text-to-Speech API. The TTS API can generate a lot of totally different voices in several languages with totally different accents, which you’ll find and play with here. The “Customary” voices may sound a bit, er, tinny, if you understand what I imply, however the WaveNet voices, that are generated by high-quality neural networks, sound decently human.
Right here I bumped into one other drawback I didn’t foresee: what if a pc voice speaks lots slower than a video’s authentic speaker does, in order that the generated audio file is just too lengthy? Then the dubs can be unattainable to align to the supply video. Or, what if a translation is extra verbose than the unique wording, resulting in the identical drawback?
To cope with this situation, I performed round with the
speakingRate parameter obtainable within the Textual content-to-Speech API. This lets you velocity up or decelerate a pc voice:
So, if it took the pc longer to talk a sentence than it did for the video’s authentic speaker, I elevated the speakingRate till the pc and human took up about the identical period of time.
Sound just a little sophisticated? Right here’s what the code appears like:
This solved the issue of aligning audio to video, however it did typically imply the pc audio system in my dubs have been just a little awkwardly quick. However that’s an issue for V2.
Was it value it?
You already know the expression, “Play silly video games, win silly prizes?” It appears like each ML venture I construct right here is one thing of a labor of affection, however this time, I like my silly prize: the flexibility to generate a limiteless variety of bizarre, robotic, awkward anime dubs, which can be typically kinda respectable.
Try my outcomes right here: