Probing Structured Pruning on Multilingual Pre-trained Models: Settings, Algorithms, and Efficiency

  title={Probing Structured Pruning on Multilingual Pre-trained Models: Settings, Algorithms, and Efficiency},
  author={Yanyang Li and Fuli Luo and Runxin Xu and Songfang Huang and Fei Huang and Liwei Wang},
  booktitle={Annual Meeting of the Association for Computational Linguistics},
Structured pruning has been extensively studied on monolingual pre-trained language models and is yet to be fully evaluated on their multilingual counterparts. This work investigates three aspects of structured pruning on multilingual pre-trained language models: settings, algorithms, and efficiency. Experiments on nine downstream tasks show several counter-intuitive phenomena: for settings, individually pruning for each language does not induce a better result; for algorithms, the simplest… 

Figures and Tables from this paper



Structured Pruning of Large Language Models

A novel, structured pruning approach based on low rank factorization and augmented Lagrangian L0 norm regularization is presented, which achieves significant inference speedups while matching or outperforming the authors' unstructured pruning baseline at various sparsity levels.

Adaptive Sparse Transformer for Multilingual Translation

This work proposes an adaptive and sparse architecture for multilingual modeling, and trains the model to learn shared and language-specific parameters to improve the positive transfer and mitigate the interference.

On the effect of dropping layers of pre-trained transformer models

RoBERTa: A Robustly Optimized BERT Pretraining Approach

It is found that BERT was significantly undertrained, and can match or exceed the performance of every model published after it, and the best model achieves state-of-the-art results on GLUE, RACE and SQuAD.

On the Prunability of Attention Heads in Multilingual BERT

This work employs pruning to quantify the robustness and interpret layer-wise importance of mBERT, finding that the importance of the encoder layers sensitively depends on the language family and the pre-training corpus size.

The Right Tool for the Job: Matching Model and Instance Complexities

This work proposes a modification to contextual representation fine-tuning which allows for an early (and fast) “exit” from neural network calculations for simple instances, and late (and accurate) exit for hard instances during inference.

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.

Larger-Scale Transformers for Multilingual Masked Language Modeling

This study presents the results of two larger multilingual masked language models, with 3.5B and 10.7B parameters, which suggest larger capacity models for language understanding may obtain strong performance on high-resource languages while greatly improving low- resource languages.

On the Cross-lingual Transferability of Monolingual Representations

This work designs an alternative approach that transfers a monolingual model to new languages at the lexical level and shows that it is competitive with multilingual BERT on standard cross-lingUAL classification benchmarks and on a new Cross-lingual Question Answering Dataset (XQuAD).

Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond

An architecture to learn joint multilingual sentence representations for 93 languages, belonging to more than 30 different families and written in 28 different scripts using a single BiLSTM encoder with a shared byte-pair encoding vocabulary for all languages, coupled with an auxiliary decoder and trained on publicly available parallel corpora.