500-Million-Sentence Dataset Can Boost Machine Translation for Low-Resource Languages – Slator

7 hours ago

500-Million-Sentence Dataset Can Boost Machine Translation for Low-Resource Languages

Researchers working on machine translation (MT) often rely on back translation to beef up training data. Back translation — when more widely available monolingual target language data is translated into the source language — was credited with enabling Transformer-based deep-learning system CUBITT to “outperform human-level translation” as covered by Slator in September.

Back translation was also crucial to a method for detecting machine translated content, which may become more critical as startups ramp up commercialization of AI-powered text generation.

The usefulness of back translation depends on the widespread availability of target language data, which can present a hurdle for languages of lesser diffusion.

Advertisement


In an effort to allow MT researchers to work on more realistic low-resource scenarios, University of Helsinki language technology professor Jörg Tiedemann announced on March 3, 2021 that he had released over 500 million translated sentences in 188 languages.

Tiedemann’s datasets, available on GitHub, are not the first attempt to level the playing field for languages via MT. For example, since 2018, the Masakhane Project has been gathering language data and fine tuning language models specifically for African languages underrepresented in NLP. Tiedemann’s project is, however, notable for its scale.

In a related October 2020 paper on the Tatoeba Translation Challenge, Tiedemann wrote, “The main goal is to trigger the development of open translation tools and models with a much broader coverage of the world’s languages.”

Slator 2021 Data-for-AI Market Report

Slator 2021 Data-for-AI Market Report

Data and Research, Slator reports

44-pages on how LSPs enter and scale in AI Data-as-a-service. Market overview, AI use cases, platforms, case studies, sales insights.

How much broader? The training and test data covers 500 languages and language variants, as well as roughly 3,000 language pairs.

According to Tiedemann, there is still work to be done. “It’s anyway not going to be the last set of back-translations I’m going to release,” he tweeted. “More to come soon also from English to other languages…”

Image: University Library in Helsinki