OneAligner: Zero-shot Cross-lingual Transfer with One Rich-Resource Language Pair for Low-Resource Sentence Retrieval

@inproceedings{Niu2022OneAlignerZC,
  title={OneAligner: Zero-shot Cross-lingual Transfer with One Rich-Resource Language Pair for Low-Resource Sentence Retrieval},
  author={Tong Niu and Kazuma Hashimoto and Yingbo Zhou and Caiming Xiong},
  booktitle={Findings},
  year={2022}
}
Aligning parallel sentences in multilingual corpora is essential to curating data for downstream applications such as Machine Translation. In this work, we present OneAligner, an alignment model specially designed for sentence retrieval tasks. This model is able to train on only one language pair and transfers, in a cross-lingual fashion, to low-resource language pairs with negligible degradation in performance. When trained with all language pairs of a large-scale parallel multilingual corpus… 

Figures and Tables from this paper

Frustratingly Easy Label Projection for Cross-lingual Transfer

Experimental results show that the optimized version of mark-then-translate, which the authors call EasyProject, is easily applied to many languages and works surprisingly well, outperforming the more complex word alignment-based methods.

References

SHOWING 1-10 OF 49 REFERENCES

Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond

An architecture to learn joint multilingual sentence representations for 93 languages, belonging to more than 30 different families and written in 28 different scripts using a single BiLSTM encoder with a shared byte-pair encoding vocabulary for all languages, coupled with an auxiliary decoder and trained on publicly available parallel corpora.

Word Alignment by Fine-tuning Embeddings on Parallel Corpora

Methods to marry pre-trained contextualized word embeddings derived from multilingually trained language models but fine-tuning them on parallel text with objectives designed to improve alignment quality are examined, and methods to effectively extract alignments from these fine- tuned models are proposed.

Transfer Learning across Low-Resource, Related Languages for Neural Machine Translation

The experiments show that transfer learning helps word-based translation only slightly, but when used on top of a much stronger BPE baseline, it yields larger improvements of up to 4.3 BLEU.

Cross-lingual Retrieval for Iterative Self-Supervised Training

This work found that the cross-lingual alignment can be further improved by training seq2seq models on sentence pairs mined using their own encoder outputs, and developed a new approach -- cross- Lingual retrieval for iterative self-supervised training (CRISS), where mining and training processes are applied iteratively, improving cross-lingsual alignment and translation ability at the same time.

ERNIE-M: Enhanced Multilingual Representation by Aligning Cross-lingual Semantics with Monolingual Corpora

This paper proposes Ernie-M, a new training method that encourages the model to align the representation of multiple languages with monolingual corpora, to overcome the constraint that the parallel corpus size places on the model performance.

VECO: Variable and Flexible Cross-lingual Pre-training for Language Understanding and Generation

A cross-attention module is plugged into the Transformer encoder to explicitly build the interdependence between languages and outperforms all existing cross-lingual models and state-of-the-art Transformer variants on WMT14 English-to-German and English- to-French translation datasets.

The Tatoeba Translation Challenge – Realistic Data Sets for Low Resource and Multilingual MT

A new benchmark for machine translation that provides training and test data for thousands of language pairs covering over 500 languages and tools for creating state-of-the-art translation models from that collection is described to trigger the development of open translation tools and models with a much broader coverage of the World’s languages.

Cross-lingual Language Model Pretraining

This work proposes two methods to learn cross-lingual language models (XLMs): one unsupervised that only relies on monolingual data, and one supervised that leverages parallel data with a new cross-lingsual language model objective.

Unsupervised Cross-lingual Representation Learning at Scale

It is shown that pretraining multilingual language models at scale leads to significant performance gains for a wide range of cross-lingual transfer tasks, and the possibility of multilingual modeling without sacrificing per-language performance is shown for the first time.

WikiMatrix: Mining 135M Parallel Sentences in 1620 Language Pairs from Wikipedia

We present an approach based on multilingual sentence embeddings to automatically extract parallel sentences from the content of Wikipedia articles in 96 languages, including several dialects or