Modeling Context With Linear Attention for Scalable Document-Level Translation

@article{Wu2022ModelingCW,
  title={Modeling Context With Linear Attention for Scalable Document-Level Translation},
  author={Zhaofeng Wu and Hao Peng and Nikolaos Pappas and Noah A. Smith},
  journal={ArXiv},
  year={2022},
  volume={abs/2210.08431}
}
Document-level neural machine translation al- 001 lows models to leverage dependencies beyond 002 sentence-internal context to produce more co- 003 herent and consistent translations. However, 004 these models, predominantly based on trans- 005 formers, are difficult to scale to long docu- 006 ments due to the quadratic time and space 007 complexity of their self-attention layers. Re- 008 cent efforts on efficient attention variants im- 009 prove scalability, but it is yet unclear if and 010 to… 

Figures and Tables from this paper

References

SHOWING 1-10 OF 40 REFERENCES

Improving the Transformer Translation Model with Document-Level Context

This work extends the Transformer model with a new context encoder to represent document-level context, which is then incorporated into the original encoder and decoder, and introduces a two-step training method to take full advantage of abundant sentence-level parallel corpora and limited document- level parallel Corpora.

Diverse Pretrained Context Encodings Improve Document Translation

Four general conclusions are supported: that using pre-trained context representations markedly improves sample efficiency, that adequate parallel data resources are crucial for learning to use document context, that jointly conditioning on multiple context representations outperforms any single representation, and that source context is more valuable for translation performance than target side context.

Document-Level Neural Machine Translation with Hierarchical Attention Networks

Experiments show that hierarchical attention significantly improves the BLEU score over a strong NMT baseline with the state-of-the-art in context-aware methods, and that both the encoder and decoder benefit from context in complementary ways.

A Survey on Document-level Neural Machine Translation

The aim of this survey article is to highlight the major works that have been undertaken in the space of document-level machine translation after the neural revolution, so researchers can recognize the current state and future directions of this field.

Evaluating Discourse Phenomena in Neural Machine Translation

This article presents hand-crafted, discourse test sets, designed to test the recently proposed multi-encoder NMT models’ ability to exploit previous source and target sentences, and explores a novel way of exploiting context from the previous sentence.

Learning to Remember Translation History with a Continuous Cache

This work proposes to augment NMT models with a very light-weight cache-like memory network, which stores recent hidden representations as translation history and the probability distribution over generated words is updated online depending on the translation history retrieved from the memory.

Attention is All you Need

A new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely is proposed, which generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.

Random Feature Attention

RFA, a linear time and space attention that uses random feature methods to approximate the softmax function, is proposed and explored, showing that RFA is competitive in terms of both accuracy and efficiency on three long text classification datasets.

Neural Machine Translation with Extended Context

In this pilot study, interesting cross-sentential attention patterns that improve textual coherence in translation at least in some selected cases are observed.

Luna: Linear Unified Nested Attention

Luna is proposed, a linear unified nested attention mechanism that approximates softmax attention with two nested linear attention functions, yielding only linear time and space complexity.