While Google has been working to eliminate the text-MT step via Translatotron, a project the search giant first unveiled (SlatorPro) back in the spring of 2019, Apple has been on a similar quest, albeit via a more introspective route.
As Apple scientists pointed out, the cascade using the text-MT step has been the only feasible approach until recently; the lion’s share of any progress being attributable to improvements in (and hindered by the limits of) ASR and MT (lots of expertise on both globally).
In a September 9, 2021 blog post, however, scientists at Facebook AI noted what they called “an important limitation” of prior improvements; that is, they are “mainly restricted to languages with very large text [datasets] suitable for training AI models.”
To be fair, Facebook scientists acknowledged that GPT-3 (as well as BERT, etc.) did indeed make “huge strides” and could be fine-tuned to apply to a variety of complex natural language processing (NLP) use cases. But all these prior tech still depend on text and Facebook thinks their new model “breaks free” of that.
The social media giant calls its new model Generative Spoken Language Model (or GSLM) and said it “leverages recent breakthroughs in representation learning, allowing it to work directly from only raw audio signals, without any labels or text.”
According to a September 7, 2021 paper, Facebook AI engaged in 6,000 hours of model training on one of two English datasets taken from audiobooks.
As the scientists so eloquently blogged, their model “opens the door to a new era of textless NLP applications for potentially every language spoken on Earth — even those without significant text [datasets].”
In short, GSLM aims to render ASR obsolete by working in true end-to-end fashion, Facebook said; from speech input to speech output.
A Textless NLP Future
As only big tech can, Facebook gathered a multidisciplinary team of researchers to work on GSLM; experts in psycholinguistics, signal processing, speech processing, and NLP.
They likened their approach to how preschool children learn language “solely from raw sensory inputs and audio interactions” (hence the psycholinguistics, etc.), using this as a template for their new textless NLP model.
Facebook highlighted the importance of the textless NLP approach, summarized as follows:
- It can be applied to training models for any spoken language.
- Models can incorporate nuances and intonations in speech that denote emotions (e.g., anger, irony, uncertainty) and even “vocalizations” (laughter, yawning, etc.).
- It can be used to train models on audio-first experiences (e.g., podcasts), bypassing annotation or training ASR.
The social media giant foresees a host of applications for its new model — multilingual video games, content search, summarizing archived audio — and said “textless NLP technology should make AI more inclusive and able to model a richer variety of languages than is possible today.”