• Corpus ID: 232068986

Gradient-guided Loss Masking for Neural Machine Translation

@article{Wang2021GradientguidedLM,
  title={Gradient-guided Loss Masking for Neural Machine Translation},
  author={Xinyi Wang and Ankur Bapna and Melvin Johnson and Orhan Firat},
  journal={ArXiv},
  year={2021},
  volume={abs/2102.13549}
}
To mitigate the negative effect of low quality training data on the performance of neural machine translation models, most existing strategies focus on filtering out harmful data before training starts. In this paper, we explore strategies that dynamically optimize data usage during the training process using the model’s gradients on a small set of clean data. At each training step, our algorithm calculates the gradient alignment between the training data and the clean data to mask out data… 

Figures and Tables from this paper

Data Selection Curriculum for Neural Machine Translation

This work in-troduce a two-stage curriculum training framework for NMT where a base NMT model is tuned on subsets of data, selected by both deterministic scoring using pre-trained methods and online scoring that considers prediction scores of the emerging N MT model.

Improving Multilingual Translation by Representation and Gradient Regularization

This work proposes a joint approach to regularize NMT models at both representation-level and gradient-level, and demonstrates that this approach is highly effective in both reducing off-target translation occurrences and improving zero-shot translation performance.

Por Qué Não Utiliser Alla Språk? Mixed Training with Gradient Optimization in Few-Shot Cross-Lingual Transfer

This paper proposes a one-step mixed training method that trains on both source and target data with stochastic gradient surgery, a novel gradient-level optimization, and achieves state-of-the-art performance on all tasks and outperforms target-adapting by a large margin.

The Trade-offs of Domain Adaptation for Neural Language Models

This work presents how adaptation techniques based on data selection, such as importance sampling, intelligent data selection and influence functions, can be presented in a common framework which highlights their similarity and also their subtle differences.

On the Complementarity of Data Selection and Fine Tuning for Domain Adaptation

This work assesses the complementarity of selection with fine tuning and results in practical mendations that data selection from domain classifiers is often more effec- 021 tive than the popular contrastive data selection 022 method.

Influence Functions for Sequence Tagging Models

The practical utility of segment influence is shown by using the method to identify systematic annotation errors in two named entity recognition corpora and measuring the effect that perturbing the labels within this segment has on a test segment level prediction.

Switchable Representation Learning Framework with Self-compatibility

This work proposes a S witchable representation learning F ramework with S elf- C ompatibility (SFSC), which generates a series of compatible sub-models with different capacities through one training process and achieves state-of-art performance on the evaluated dataset.

References

SHOWING 1-10 OF 14 REFERENCES

Dynamic Data Selection for Neural Machine Translation

This paper introduces ‘dynamic data selection’ for NMT, a method in which the selected subset of training data is varied between different training epochs, and shows that the best results are achieved when applying a technique called ‘gradual fine-tuning’.

Denoising Neural Machine Translation Training with Trusted Data and Online Data Selection

Methods for measuring and selecting data for domain MT and applies them to denoising NMT training show its significant effectiveness for NMT to train on data with severe noise.

Balancing Training for Multilingual Neural Machine Translation

Experiments show the proposed method not only consistently outperforms heuristic baselines in terms of average performance, but also offers flexible control over the performance of which languages are optimized.

On the Impact of Various Types of Noise on Neural Machine Translation

It is found that neural models are generally more harmed by noise than statistical models, and for one especially egregious type of noise they learn to just copy the input sentence.

Dual Conditional Cross-Entropy Filtering of Noisy Parallel Corpora

This work introduces dual conditional cross-entropy filtering for noisy parallel data and achieves higher BLEU scores with models trained on parallel data filtered only from Paracrawl than with models training on clean WMT data.

Attention is All you Need

A new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely is proposed, which generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.

A Call for Clarity in Reporting BLEU Scores

Pointing to the success of the parsing community, it is suggested machine translation researchers settle upon the BLEU scheme, which does not allow for user-supplied reference processing, and provide a new tool, SACREBLEU, to facilitate this.

SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing

SentencePiece, a language-independent subword tokenizer and detokenizer designed for Neural-based text processing, finds that it is possible to achieve comparable accuracy to direct subword training from raw sentences.

Intelligent Selection of Language Model Training Data

We address the problem of selecting non-domain-specific language model training data to build auxiliary language models for use in tasks such as machine translation. Our approach is based on

Learning to Reweight Examples for Robust Deep Learning

This work proposes a novel meta-learning algorithm that learns to assign weights to training examples based on their gradient directions that can be easily implemented on any type of deep network, does not require any additional hyperparameter tuning, and achieves impressive performance on class imbalance and corrupted label problems where only a small amount of clean validation data is available.