Share This Author
Monotonic Multihead Attention
This paper proposes a new attention mechanism, Monotonic Multihead Attention (MMA), which extends the monotonic attention mechanism to multihead attention and introduces two novel and interpretable approaches for latency control that are specifically designed for multiple attentions heads.
Span-Based Constituency Parsing with a Structure-Label System and Provably Optimal Dynamic Oracles
A new shift-reduce system whose stack contains merely sentence spans, represented by a bare minimum of LSTM features, which is the first provably optimal dynamic oracle for constituency parsing, which runs in amortized O(1) time, compared to O(n^3) oracles for standard dependency parsing.
Simple Fusion: Return of the Language Model
This work investigates an alternative simple method to use monolingual data for NMT training that combines the scores of a pre-trained and fixed language model (LM) with the Scores of a translation model (TM) while the TM is trained from scratch.
Incremental Parsing with Minimal Features Using Bi-Directional LSTM
This work uses bi-directional LSTM sentence representations to model a parser state with only three sentence positions, which automatically identifies important aspects of the entire sentence, and achieves state-of-the-art results among greedy dependency parsers for English.
Improving Zero-Shot Translation by Disentangling Positional Information
By thorough inspections of the hidden layer outputs, it is shown that the proposed approach indeed leads to more language-independent representations, and allows easy integration of new languages, which substantially expands translation coverage.
Deep Encoder, Shallow Decoder: Reevaluating Non-autoregressive Machine Translation
The speed disadvantage for autoregressive baselines compared to non-autoregressive methods has been overestimated in three aspects: suboptimal layer allocation, insufficient speed measurement, and lack of knowledge distillation.
Deep Encoder, Shallow Decoder: Reevaluating the Speed-Quality Tradeoff in Machine Translation
The findings suggest that the latency disadvantage for autoregressive translation has been overestimated due to a suboptimal choice of layer allocation, and a new speed-quality baseline for future research toward fast, accurate translation is provided.
Parallel Machine Translation with Disentangled Context Transformer
This work proposes an attention-masking based model, called Disentangled Context (DisCo) transformer, that simultaneously generates all tokens given different contexts, and develops the parallel easy-first inference algorithm, which iteratively refines every token in parallel and reduces the number of required iterations.
Non-autoregressive Machine Translation with Disentangled Context Transformer
An attention-masking based model, called Disentangled Context (DisCo) transformer, that simultaneously generates all tokens given different contexts that achieves competitive, if not better, performance compared to the state of the art in non-autoregressive machine translation while significantly reducing decoding time on average.
On the Evaluation of Machine Translation for Terminology Consistency
This work proposes metrics to measure the consistency of MT output with regards to a domain terminology, and performs studies on the COVID-19 domain over 5 languages, also performing terminology-targeted human evaluation.