Distill, Adapt, Distill: Training Small, In-Domain Models for Neural Machine Translation

  title={Distill, Adapt, Distill: Training Small, In-Domain Models for Neural Machine Translation},
  author={Mitchell A. Gordon and Kevin Duh},
We explore best practices for training small, memory efficient machine translation models with sequence-level knowledge distillation in the domain adaptation setting. While both domain adaptation and knowledge distillation are widely-used, their interaction remains little understood. Our large-scale empirical results in machine translation (on three language pairs with three domains each) suggest distilling twice for best performance: once using general-domain data and again using in-domain… Expand
CTR-BERT: Cost-effective knowledge distillation for billion-parameter teacher models
While pre-trained large language models (LLM) like BERT have achieved state-ofthe-art in several NLP tasks, their performance on tasks with additional grounding e.g. with numeric and categoricalExpand
Domain Adaptation and Multi-Domain Adaptation for Neural Machine Translation: A Survey
This work focuses on more robust approaches to domain adaptation for NMT, particularly the case where a system may need to translate sentences from multiple domains, and divides techniques into those relating to data selection, model architecture, parameter adaptation procedure, and inference procedure. Expand
Sampling and Filtering of Neural Machine Translation Distillation Data
This paper explores the sampling method landscape with English to Czech and English to German MT models using standard MT evaluation metrics and shows that careful oversampling and combination with the original data leads to better performance when compared to training only on the original or synthesized data or their direct combination. Expand
Combining Sequence Distillation and Transfer Learning for Efficient Low-Resource Neural Machine Translation Models
This work investigates a combination of SD and TL for training efficient NMT models for ELR settings, where it is confirmed that using both the distilled ELR and helping corpora in the second round of TL further improves translation quality. Expand
Distilling Multiple Domains for Neural Machine Translation
This paper proposes a framework for training a single multi-domain neural machine translation model that is able to translate several domains without increasing inference time or memory usage and shows that this model can improve translation on both high- and low-resource domains over strong multi- domain baselines. Expand


Understanding Knowledge Distillation in Non-autoregressive Machine Translation
It is found that knowledge distillation can reduce the complexity of data sets and help NAT to model the variations in the output data, and a strong correlation is observed between the capacity of an NAT model and the optimal complexity of the distilled data for the best translation quality. Expand
OpenSubtitles2016: Extracting Large Parallel Corpora from Movie and TV Subtitles
A new major release of the OpenSubtitles collection of parallel corpora, which is compiled from a large database of movie and TV subtitles and includes a total of 1689 bitexts spanning 2.6 billion sentences across 60 languages. Expand
Dreaming to Distill: Data-Free Knowledge Transfer via DeepInversion
DeepInversion is introduced, a new method for synthesizing images from the image distribution used to train a deep neural network, which optimizes the input while regularizing the distribution of intermediate feature maps using information stored in the batch normalization layers of the teacher. Expand
Go From the General to the Particular: Multi-Domain Translation with Domain Transformation Networks
The proposed unified model achieves comparable results with the fine-tuning approach that requires multiple models to preserve the particular knowledge, and analyses reveal that the domain transformation networks successfully capture the domain-specific knowledge as expected. Expand
All the ways you can compress bert
  • 2019
Attentive Student Meets Multi-Task Teacher: Improved Knowledge Distillation for Pretrained Models
This paper distill the BERT model refined by multi-task learning on seven datasets of the GLUE benchmark into a bidirectional LSTM with attention mechanism, and provides a general learning framework. Expand
Data - Free adversarial distillation . Tommaso Furlanello , Zachary C Lipton , Michael Tschannen , Laurent Itti , and Anima Anandkumar . 2018 . Born again neural networks
  • Explain - ing Sequence - Level knowledge distillation as Data - Augmentation for neural machine translation
  • 2019
Data-Free Adversarial Distillation
This work introduces a model discrepancy to quantificationally measure the difference between student and teacher models and construct an optimizable upper bound, and proposes a novel adversarial distillation mechanism to craft a compact student model without any real-world data. Expand
DataFree adversarial distillation
  • December
  • 2019
DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter
This work proposes a method to pre-train a smaller general-purpose language representation model, called DistilBERT, which can be fine-tuned with good performances on a wide range of tasks like its larger counterparts, and introduces a triple loss combining language modeling, distillation and cosine-distance losses. Expand