Knowledge Distillation for Quality Estimation

  title={Knowledge Distillation for Quality Estimation},
  author={Amit Gajbhiye and M. Fomicheva and Fernando Alva-Manchego and F. Blain and Abiola Obamuyide and Nikolaos Aletras and Lucia Specia},
Quality Estimation (QE) is the task of automatically predicting Machine Translation quality in the absence of reference translations, making it applicable in real-time settings, such as translating online social media conversations. Recent success in QE stems from the use of multilingual pre-trained representations, where very large models lead to impressive results. However, the inference time, disk and memory requirements of such models do not allow for wide usage in the real world. Models… 

Figures and Tables from this paper

Papago’s Submission for the WMT21 Quality Estimation Shared Task
  • Seunghyun Lim, Hantae Kim, Hyunjoong Kim
  • WMT
  • 2021
This paper describes Papago submission to the WMT 2021 Quality Estimation Task 1: Sentence-level Direct Assessment. Our multilingual Quality Estimation system explores the combination of Pretrained
Findings of the WMT 2021 Shared Task on Quality Estimation
We report the results of the WMT 2021 shared task on Quality Estimation, where the challenge is to predict the quality of the output of neural machine translation systems at the word and sentence


deepQuest: A Framework for Neural-based Quality Estimation
This work presents a neural framework that is able to accommodate neural QE approaches at these fine-grained levels and generalize them to the level of documents and applies QE models to the output of both statistical and neural MT systems for a series of European languages.
DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter
This work proposes a method to pre-train a smaller general-purpose language representation model, called DistilBERT, which can be fine-tuned with good performances on a wide range of tasks like its larger counterparts, and introduces a triple loss combining language modeling, distillation and cosine-distance losses.
Patient Knowledge Distillation for BERT Model Compression
This work proposes a Patient Knowledge Distillation approach to compress an original large model (teacher) into an equally-effective lightweight shallow network (student), which translates into improved results on multiple NLP tasks with a significant gain in training efficiency, without sacrificing model accuracy.
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.
Distilling Task-Specific Knowledge from BERT into Simple Neural Networks
This paper proposes to distill knowledge from BERT, a state-of-the-art language representation model, into a single-layer BiLSTM, as well as its siamese counterpart for sentence-pair tasks, and achieves comparable results with ELMo.
Unsupervised Cross-lingual Representation Learning at Scale
It is shown that pretraining multilingual language models at scale leads to significant performance gains for a wide range of cross-lingual transfer tasks, and the possibility of multilingual modeling without sacrificing per-language performance is shown for the first time.
OpenKiwi: An Open Source Framework for Quality Estimation
We introduce OpenKiwi, a Pytorch-based open source framework for translation quality estimation. OpenKiwi supports training and testing of word-level and sentence-level quality estimation systems,
Predictor-Estimator using Multilevel Task Learning with Stack Propagation for Neural Quality Estimation
In this paper, we present a two-stage neural quality estimation model that uses multilevel task learning for translation quality estimation (QE) at the sentence, word, and phrase levels. Our approach
Attention is All you Need
A new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely is proposed, which generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.
Learning Efficient Object Detection Models with Knowledge Distillation
This work proposes a new framework to learn compact and fast object detection networks with improved accuracy using knowledge distillation and hint learning and shows consistent improvement in accuracy-speed trade-offs for modern multi-class detection models.