Sampling and Filtering of Neural Machine Translation Distillation Data

@inproceedings{Zouhar2021SamplingAF,
  title={Sampling and Filtering of Neural Machine Translation Distillation Data},
  author={Vil{\'e}m Zouhar},
  booktitle={North American Chapter of the Association for Computational Linguistics},
  year={2021}
}
  • Vilém Zouhar
  • Published in
    North American Chapter of the…
    1 April 2021
  • Computer Science
In most of neural machine translation distillation or stealing scenarios, the highest-scoring hypothesis of the target model (teacher) is used to train a new model (student). If reference translations are also available, then better hypotheses (with respect to the references) can be oversampled and poor hypotheses either removed or undersampled. This paper explores the sampling method landscape (pruning, hypothesis oversampling and undersampling, deduplication and their combination) with… 
1 Citations

Figures and Tables from this paper

References

SHOWING 1-10 OF 22 REFERENCES

Ensemble Distillation for Neural Machine Translation

This work introduces a data filtering method based on the knowledge of the teacher model that not only speeds up the training, but also leads to better translation quality.

Online Distilling from Checkpoints for Neural Machine Translation

A method on-the-fly generates a teacher model from checkpoints, guiding the training process to obtain better performance, and provides analysis on data-limited setting against over-fitting.

Microsoft Translator at WMT 2019: Towards Large-Scale Document-Level Neural Machine Translation

This paper describes the Microsoft Translator submissions to the WMT19 news translation shared task for English-German and investigates document-level neural machine translation with deep transformer models, based on preliminary human evaluation results.

Transforming machine translation: a deep learning system reaches news translation quality comparable to human professionals

A deep-learning system, CUBBITT, which approaches the quality of human translation and even surpasses it in adequacy in certain circumstances, suggesting that deep learning may have the potential to replace humans in applications where conservation of meaning is the primary aim.

Distill, Adapt, Distill: Training Small, In-Domain Models for Neural Machine Translation

This work explores best practices for training small, memory efficient machine translation models with sequence-level knowledge distillation in the domain adaptation setting and suggests distilling twice for best performance.

Speed-optimized, Compact Student Models that Distill Knowledge from a Larger Teacher Model: the UEDIN-CUNI Submission to the WMT 2020 News Translation Task

This work describes the joint submission of the University of Edinburgh and Charles University, Prague, to the Czech/English track in the WMT 2020 Shared Task on News Translation and achieves translation speeds of over 700 whitespace-delimited source words per second on a single CPU thread, thus making neural translation feasible on consumer hardware without a GPU.

Edinburgh’s Submissions to the 2020 Machine Translation Efficiency Task

In the shared task, most of the submissions were Pareto optimal with respect the trade-off between time and quality.

SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing

SentencePiece, a language-independent subword tokenizer and detokenizer designed for Neural-based text processing, finds that it is possible to achieve comparable accuracy to direct subword training from raw sentences.

A Call for Clarity in Reporting BLEU Scores

Pointing to the success of the parsing community, it is suggested machine translation researchers settle upon the BLEU scheme, which does not allow for user-supplied reference processing, and provide a new tool, SACREBLEU, to facilitate this.

Distilling the Knowledge in a Neural Network

This work shows that it can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model and introduces a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse.