Energy-Based Reranking: Improving Neural Machine Translation Using Energy-Based Models

@inproceedings{Naskar2021EnergyBasedRI,
  title={Energy-Based Reranking: Improving Neural Machine Translation Using Energy-Based Models},
  author={Subhajit Naskar and Pedram Rooshenas and Simeng Sun and Mohit Iyyer and Andrew McCallum},
  booktitle={ACL},
  year={2021}
}
The discrepancy between maximum likelihood estimation (MLE) and task measures such as BLEU score has been studied before for autoregressive neural machine translation (NMT) and resulted in alternative training algorithms (Ranzato et al., 2016; Norouzi et al., 2016; Shen et al., 2016; Wu et al., 2018). However, MLE training remains the de facto approach for autoregressive NMT because of its computational efficiency and stability. Despite this mismatch between the training objective and task… 

Figures and Tables from this paper

Residual Energy-Based Models for Text
TLDR
This work finds experimentally that the answer is affirmative when one has access to the training data for the model, and guardedly affirmative even if one does not, suggesting that the auto-regressive models can be improved by incorporating the (globally normalized) discriminators into the generative process.
Improving Joint Training of Inference Networks and Structured Prediction Energy Networks
TLDR
This paper designs a compound objective to jointly train both cost-augmented and test-time inference networks along with the energy function, and proposes joint parameterizations for the inference networks that encourage them to capture complementary functionality during learning.
Is MAP Decoding All You Need? The Inadequacy of the Mode in Neural Machine Translation
TLDR
It is shown that the most likely translations under the model accumulate so little probability mass that the mode can be considered essentially arbitrary, and advocate for the use of decision rules that take into account the translation distribution holistically.
Quality-Aware Decoding for Neural Machine Translation
TLDR
An extensive comparison of various possible candidate generation and ranking methods across four datasets and two model classes shows that quality-aware decoding consistently outperforms MAP-based decoding according both to state-of-the-art automatic metrics and to human assessments.
On Reinforcement Learning and Distribution Matching for Fine-Tuning Language Models with no Catastrophic Forgetting
TLDR
The theoretical connections between the two paradigms, and it is shown that methods such as KL-control developed for RM can also be construed as belonging to DM, are explored, and that while DM differs from RM, it can suffer from similar training difficulties, such as high gradient variance.
Searching for COMETINHO: The Little Metric That Could
TLDR
This paper explores optimization techniques, pruning, and knowledge distillation to create more compact and faster COMET versions and presents DISTIL-COMET a lightweight distilled version that is 80% smaller and 2.128x faster while attaining a performance close to the original model and above strong baselines such as BERTSCORE and PRISM.
Transcormer: Transformer for Sentence Scoring with Sliding Language Modeling
TLDR
This paper proposes Transcormer – a Transformer model with a novel sliding language modeling (SLM) for sentence scoring that can avoid the limitations of CLM and MLM and inherit their advantages, and thus achieve high effectiveness andency in scoring.
Uncertainty Determines the Adequacy of the Mode and the Tractability of Decoding in Sequence-to-Sequence Models
TLDR
A novel exact n-best search algorithm for neural sequence models is proposed, and it is shown that intrinsic uncertainty affects model uncertainty as the model tends to overly spread out the probability mass for uncertain tasks and sentences.
SummaReranker: A Multi-Task Mixture-of-Experts Re-ranking Framework for Abstractive Summarization
TLDR
It is shown that it is possible to directly train a second-stage model performing re-ranking on a set of summary candidates, and a mixture-of-experts SummaReranker learns to select a better candidate and consistently improves the performance of the base model.
RMBR: A Regularized Minimum Bayes Risk Reranking Framework for Machine Translation
TLDR
A regularized MBR reranking framework (RMBR), which considers semantic-based similarity and computes the expected utility for each candidate by truncating the list, and the proposed quality regularizer and uncertainty regularizer are incorporated into the framework.
...
...

References

SHOWING 1-10 OF 45 REFERENCES
Residual Energy-Based Models for Text Generation
TLDR
This work investigates un-normalized energy-based models (EBMs) which operate not at the token but at the sequence level, and shows that residual EBMs yield lower perplexity compared to locally normalized baselines.
A Study of Reinforcement Learning for Neural Machine Translation
TLDR
A systematic study on how to train better NMT models using reinforcement learning, providing a comprehensive comparison of several important factors and proposing a new method to leverage RL to further boost the performance of NMT systems trained with source/target monolingual data.
Improving Neural Machine Translation Models with Monolingual Data
TLDR
This work pairs monolingual training data with an automatic back-translation, and can treat it as additional parallel training data, and obtains substantial improvements on the WMT 15 task English German, and for the low-resourced IWSLT 14 task Turkish->English.
On the Weaknesses of Reinforcement Learning for Neural Machine Translation
TLDR
It is proved that one of the most common RL methods for MT does not optimize the expected reward, as well as show that other methods take an infeasibly long time to converge.
Attention is All you Need
TLDR
A new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely is proposed, which generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.
Is MAP Decoding All You Need? The Inadequacy of the Mode in Neural Machine Translation
TLDR
It is shown that the most likely translations under the model accumulate so little probability mass that the mode can be considered essentially arbitrary, and advocate for the use of decision rules that take into account the translation distribution holistically.
Incorporating BERT into Neural Machine Translation
TLDR
A new algorithm named BERT-fused model is proposed, in which BERT is first used to extract representations for an input sequence, and then the representations are fused with each layer of the encoder and decoder of the NMT model through attention mechanisms.
An Actor-Critic Algorithm for Sequence Prediction
TLDR
An approach to training neural networks to generate sequences using actor-critic methods from reinforcement learning (RL) that condition the critic network on the ground-truth output, and shows that this method leads to improved performance on both a synthetic task, and for German-English machine translation.
On integrating a language model into neural machine translation
On the use of BERT for Neural Machine Translation
TLDR
This work compares various ways to integrate pretrained BERT model with NMT model, investigates the impact of the monolingual data used for BERT training on the final translation quality and assesses how BERT pretrained representations affect model robustness.
...
...