No Language Left Behind: Scaling Human-Centered Machine Translation

  title={No Language Left Behind: Scaling Human-Centered Machine Translation},
  author={Nllb team and Marta Ruiz Costa-juss{\`a} and James Cross and Onur cCelebi and Maha Elbayad and Kenneth Heafield and Kevin Heffernan and Elahe Kalbassi and Janice Lam and Daniel Licht and Jean Maillard and Anna Sun and Skyler Wang and Guillaume Wenzek and Alison Youngblood and Bapi Akula and Lo{\"i}c Barrault and Gabriel Mejia Gonzalez and Prangthip Hansanti and John Hoffman and Semarley Jarrett and Kaushik Ram Sadagopan and Dirk Rowe and Shannon L. Spruit and C. Tran and Pierre Andrews and Necip Fazil Ayan and Shruti Bhosale and Sergey Edunov and Angela Fan and Cynthia Gao and Vedanuj Goswami and Francisco Guzm'an and Philipp Koehn and Alexandre Mourachko and Christophe Ropers and Safiyyah Saleem and Holger Schwenk and Jeff Wang},
Driven by the goal of eradicating language barriers on a global scale, machine translation has solidified itself as a key focus of artificial intelligence research today. However, such efforts have coalesced around a small subset of languages, leaving behind the vast majority of mostly low-resource languages. What does it take to break the 200 language barrier while ensuring safe, high quality results, all while keeping ethical considerations in mind? In No Language Left Behind, we took on this… 

Silo NLP’s Participation at WAT2022

This paper provides the system description of “Silo NLP’s” submission to the Workshop on Asian Translation (WAT2022) and tops many tasks including English->Hindi multimodal translation (evaluation test), English->Malayalam text-only and multimodals translation ( evaluation test, and English->Bengali multi-million dollar translation (challenge test).

Improving Khmer-Vietnamese Machine Translation with Data Augmentation methods

This paper applies a pretrained multilingual model and fine-tuning it with a low-resource bilingual dataset and proposes two data-augmentation strategies to receive new training data, including back-translating with the dataset from the source language and translating sentences through a pivot language.

Text Characterization Toolkit

A tool that researchers can use to study properties of the dataset and the impact of those properties on their models’ be-haviour, as well as off-the-shelf scripts that can be used for specific analyses.

Hierarchical Phrase-based Sequence-to-Sequence Learning

We describe a neural transducer that maintains the flexibility of standard sequence-to-sequence (seq2seq) models while incorporating hierarchical phrases as a source of inductive bias during training

NTREX-128 – News Test References for MT Evaluation of 128 Languages

It is recommended that the NTREX-128 data set should be used for evaluation of English-sourced translation models but not in reverse direction, because experimental results confirm that the directionality of test sets translation plays an important role in the usefulness of the corresponding metrics’ scores.

Continually learning new languages

This work combines the qualities of weight factorization, transfer learning and Elastic Weight Consolidation in order to counter catastrophic forgetting and facilitate learning new languages quickly.

Artificial Interrogation for Attributing Language Models

—This paper presents solutions to the Machine Learn- ing Model Attribution challenge (MLMAC) collectively organized by MITRE, Microsoft, Schmidt-Futures, Robust-Intelligence, Lincoln-Network, and

Towards Building Text-To-Speech Systems for the Next Billion Users

This paper evaluates the choice of acoustic models, vocoders, supplementary loss functions, training schedules, and speaker and language diversity for Dravidian and Indo-Aryan languages and identifies monolingual models with FastPitch and HiFi-GAN V1, trained jointly on male and female speakers to perform the best.

High-Resource Methodological Bias in Low-Resource Investigations

It is shown that down sampling from a high-resource language results in datasets with different properties than the low-resource datasets, impacting the model performance for both POS-tagging and machine translation.

Speech-to-Speech Translation For A Real-world Unwritten Language

This work uses English-Taiwanese Hokkien as a case study, and presents an end-to-end solution from training data collection, modeling choices to benchmark dataset release, and takes advantage of recent advances in applying self-supervised discrete representations as target for prediction in S2ST.