Controlling Formality in Low-Resource NMT with Domain Adaptation and Re-Ranking: SLT-CDT-UoS at IWSLT2022

@article{Vincent2022ControllingFI,
  title={Controlling Formality in Low-Resource NMT with Domain Adaptation and Re-Ranking: SLT-CDT-UoS at IWSLT2022},
  author={Sebastian T. Vincent and Lo{\"i}c Barrault and Carolina Scarton},
  journal={ArXiv},
  year={2022},
  volume={abs/2205.05990}
}
This paper describes the SLT-CDT-UoS group’s submission to the first Special Task on Formality Control for Spoken Language Translation, part of the IWSLT 2022 Evaluation Campaign. Our efforts were split between two fronts: data engineering and altering the objective function for best hypothesis selection. We used language-independent methods to extract formal and informal sentence pairs from the provided corpora; using English as a pivot language, we propagated formality annotations to… 

Figures and Tables from this paper

Findings of the IWSLT 2022 Evaluation Campaign

TLDR
For each shared task of the 19th International Conference on Spoken Language Translation, the purpose of the task, the data that were released, the evaluation metrics that were applied, the submissions that were received and the results that were achieved are detailed.

References

SHOWING 1-10 OF 23 REFERENCES

Google’s Multilingual Neural Machine Translation System: Enabling Zero-Shot Translation

TLDR
This work proposes a simple solution to use a single Neural Machine Translation (NMT) model to translate between multiple languages using a shared wordpiece vocabulary, and introduces an artificial token at the beginning of the input sentence to specify the required target language.

Controlling the Output Length of Neural Machine Translation

TLDR
Two methods for biasing the output length with a transformer architecture are investigated: i) conditioning the output to a given target-source length-ratio class and ii) enriching the transformer positional embedding with length information.

MuST-C: a Multilingual Speech Translation Corpus

TLDR
MuST-C is created, a multilingual speech translation corpus whose size and quality will facilitate the training of end-to-end systems for SLT from English into 8 languages and an empirical verification of its quality and SLT results computed with a state-of-the-art approach on each language direction.

WikiMatrix: Mining 135M Parallel Sentences in 1620 Language Pairs from Wikipedia

We present an approach based on multilingual sentence embeddings to automatically extract parallel sentences from the content of Wikipedia articles in 96 languages, including several dialects or

SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing

TLDR
SentencePiece, a language-independent subword tokenizer and detokenizer designed for Neural-based text processing, finds that it is possible to achieve comparable accuracy to direct subword training from raw sentences.

Controlling Politeness in Neural Machine Translation via Side Constraints

TLDR
A pilot study to control honorifics in neural machine translation (NMT) via side constraints , focusing on English → German, shows that by marking up the (English) source side of the training data with a feature that en-codes the use of honori fic on the (German) target side, it can control the honori⬁ts produced at test time.

Getting Gender Right in Neural Machine Translation

TLDR
The experiments show that adding a gender feature to an NMT system significantly improves the translation quality for some language pairs.

Findings of the IWSLT 2022 Evaluation Campaign

TLDR
For each shared task of the 19th International Conference on Spoken Language Translation, the purpose of the task, the data that were released, the evaluation metrics that were applied, the submissions that were received and the results that were achieved are detailed.

Bifixer and Bicleaner: two open-source tools to clean your parallel data

TLDR
Two open-source tools designed for parallel data cleaning, Bifixer and Bicleaner, are shown to have a positive impact on machine translation training times and quality, particularly for the noisiest ones.

Topical-Chat: Towards Knowledge-Grounded Open-Domain Conversations

TLDR
Topical-Chat is introduced, a knowledge-grounded humanhuman conversation dataset where the underlying knowledge spans 8 broad topics and conversation partners don’t have explicitly defined roles, to help further research in opendomain conversational AI.