Improving Pretrained Cross-Lingual Language Models via Self-Labeled Word Alignment

@article{Chi2021ImprovingPC,
  title={Improving Pretrained Cross-Lingual Language Models via Self-Labeled Word Alignment},
  author={Zewen Chi and Li Dong and Bo Zheng and Shaohan Huang and Xian-Ling Mao and Heyan Huang and Furu Wei},
  journal={ArXiv},
  year={2021},
  volume={abs/2106.06381}
}
The cross-lingual language models are typically pretrained with masked language modeling on multilingual text or parallel sentences. In this paper, we introduce denoising word alignment as a new cross-lingual pre-training task. Specifically, the model first self-labels word alignments for parallel sentences. Then we randomly mask tokens in a bitext pair. Given a masked token, the model uses a pointer network to predict the aligned token in the other language. We alternately perform the above… Expand
Cross-Lingual Language Model Meta-Pretraining
  • Zewen Chi, Heyan Huang, Luyang Liu, Yu Bai, Xian-Ling Mao
  • Computer Science
  • ArXiv
  • 2021
TLDR
This paper proposes cross-lingual language model metapretraining, which introduces an additional meta-pretraining phase before cross-lingsual pretraining, where the model learns generalization ability on a largescale monolingual corpus and focuses on learningCrosslingual transfer on a multilingual corpus. Expand
Allocating Large Vocabulary Capacity for Cross-lingual Language Model Pre-training
  • Bo Zheng, Li Dong, +5 authors Furu Wei
  • Computer Science
  • ArXiv
  • 2021
TLDR
This work proposes k-NN-based target sampling to accelerate the expensive softmax and shows that the multilingual vocabulary learned with VOCAP benefits cross-lingual language model pre-training. Expand
XLM-E: Cross-lingual Language Model Pre-training via ELECTRA
TLDR
This paper introduces ELECTRA-style tasks and pretrain the model, named as XLM-E, on both multilingual and parallel corpora, and shows that the model outperforms the baseline models on various cross-lingual understanding tasks with much less computation cost. Expand
mT6: Multilingual Pretrained Text-to-Text Transformer with Translation Pairs
TLDR
Experimental results show that the proposed MT6 improves cross-lingual transferability over MT5, and proposes a partially nonautoregressive objective for text-to-text pretraining. Expand
Using Optimal Transport as Alignment Objective for fine-tuning Multilingual Contextualized Embeddings
TLDR
This work proposes using Optimal Transport as an alignment objective during fine-tuning to further improve multilingual contextualized representations for downstream cross-lingual transfer and allows different types of mappings due to soft matching between source and target sentences. Expand
BIT-Event at NLPCC-2021 Task 3: Subevent Identification via Adversarial Training
  • Xiao Liu, Ge Shi, +4 authors Lifang Wu
  • Computer Science
  • NLPCC
  • 2021

References

SHOWING 1-10 OF 52 REFERENCES
Alternating Language Modeling for Cross-Lingual Pre-Training
TLDR
This work code-switches sentences of different languages rather than simple concatenation, hoping to capture the rich cross-lingual context of words and phrases, and shows that ALM can outperform the previous pre-training methods on three benchmarks. Expand
A Supervised Word Alignment Method Based on Cross-Language Span Prediction Using Multilingual BERT
TLDR
The proposed method significantly outperformed previous supervised and unsupervised word alignment methods without using any bitexts for pretraining and greatly improved the word alignment accuracy by adding the context of the token to the question. Expand
Cross-Lingual Natural Language Generation via Pre-Training
TLDR
Experimental results on question generation and abstractive summarization show that the model outperforms the machine-translation-based pipeline methods for zero-shot cross-lingual generation and improves NLG performance of low-resource languages by leveraging rich-resource language data. Expand
Unicoder: A Universal Language Encoder by Pre-training with Multiple Cross-lingual Tasks
TLDR
It is found that doing fine-tuning on multiple languages together can bring further improvement in Unicoder, a universal language encoder that is insensitive to different languages. Expand
Cross-lingual Language Model Pretraining
TLDR
This work proposes two methods to learn cross-lingual language models (XLMs): one unsupervised that only relies on monolingual data, and one supervised that leverages parallel data with a new cross-lingsual language model objective. Expand
XNLI: Evaluating Cross-lingual Sentence Representations
TLDR
This work constructs an evaluation set for XLU by extending the development and test sets of the Multi-Genre Natural Language Inference Corpus to 14 languages, including low-resource languages such as Swahili and Urdu and finds that XNLI represents a practical and challenging evaluation suite and that directly translating the test data yields the best performance among available baselines. Expand
XLM-E: Cross-lingual Language Model Pre-training via ELECTRA
TLDR
This paper introduces ELECTRA-style tasks and pretrain the model, named as XLM-E, on both multilingual and parallel corpora, and shows that the model outperforms the baseline models on various cross-lingual understanding tasks with much less computation cost. Expand
mT6: Multilingual Pretrained Text-to-Text Transformer with Translation Pairs
TLDR
Experimental results show that the proposed MT6 improves cross-lingual transferability over MT5, and proposes a partially nonautoregressive objective for text-to-text pretraining. Expand
Multilingual Alignment of Contextual Word Representations
TLDR
After the proposed alignment procedure, BERT exhibits significantly improved zero-shot performance on XNLI compared to the base model, remarkably matching pseudo-fully-supervised translate-train models for Bulgarian and Greek. Expand
WikiMatrix: Mining 135M Parallel Sentences in 1620 Language Pairs from Wikipedia
We present an approach based on multilingual sentence embeddings to automatically extract parallel sentences from the content of Wikipedia articles in 96 languages, including several dialects orExpand
...
1
2
3
4
5
...