Zipporah: a Fast and Scalable Data Cleaning System for Noisy Web-Crawled Parallel Corpora
@inproceedings{Xu2017ZipporahAF,
title={Zipporah: a Fast and Scalable Data Cleaning System for Noisy Web-Crawled Parallel Corpora},
author={Hainan Xu and Philipp Koehn},
booktitle={EMNLP},
year={2017}
}We introduce Zipporah, a fast and scalable data cleaning system. [...] Key Method The trained model is used to score parallel sentences in the data pool for selection. As shown in experiments, Zipporah selects a high-quality parallel corpus from a large, mixed quality data pool. In particular, for one noisy dataset, Zipporah achieves a 2.1 BLEU score improvement with using 1/5 of the data over using the entire corpus.Expand
Figures, Tables, and Topics from this paper
58 Citations
Parallel Corpus Filtering via Pre-trained Language Models
- Computer ScienceACL
- 2020
A novel approach to filter out noisy sentence pairs from web-crawled corpora via pre-trained language models via Generative Pre-training (GPT) language model as a domain filter to balance data domains and achieves a new state-of-the-art.
An Unsupervised System for Parallel Corpus Filtering
- Computer ScienceWMT
- 2018
LMU Munich’s submission for the WMT 2018 Parallel Corpus Filtering shared task which addresses the problem of cleaning noisy parallel corpora in a fully unsupervised fashion relying on bilingual word embeddings created without any bilingual signal.
OpusFilter: A Configurable Parallel Corpus Filtering Toolbox
- Computer ScienceACL
- 2020
This paper introduces OpusFilter, a flexible and modular toolbox for filtering parallel corpora. It implements a number of components based on heuristic filters, language identification libraries,…
Effectively Aligning and Filtering Parallel Corpora under Sparse Data Conditions
- Computer ScienceACL
- 2020
An effective unsupervised alignment method to tackle the alignment problem and a strategy to supplement state-of-the-art models with automatically extracted information using basic NLP tools to effectively handle rich morphology are proposed.
Margin-based Parallel Corpus Mining with Multilingual Sentence Embeddings
- Computer ScienceACL
- 2019
This paper proposes a new method for this task based on multilingual sentence embeddings, which relies on nearest neighbor retrieval with a hard threshold over cosine similarity, and accounts for the scale inconsistencies of this measure.
NICT’s Corpus Filtering Systems for the WMT18 Parallel Corpus Filtering Task
- Computer ScienceWMT
- 2018
The NICT’s participation in the WMT18 shared parallel corpus filtering task is presented and empirical results show that the NMT systems trained on sampled data achieve promising performance.
uniblock: Scoring and Filtering Corpus with Unicode Block Information
- Computer ScienceEMNLP
- 2019
A simple statistical method, uniblock, to overcome the problem of removing sentences consisted of illegal characters from Natural Language Processing by generating a fixed-size feature vector using Unicode block information of the characters.
Volctrans Parallel Corpus Filtering System for WMT 2020
- Computer ScienceWMT
- 2020
These submissions to the WMT20 shared task on parallel corpus filtering and alignment for low-resource conditions outperform the baseline by 3.x/2.x and 2.x for km-en and ps-en on From Scratch/Fine-Tune conditions.
Unsupervised Parallel Sentence Extraction with Parallel Segment Detection Helps Machine Translation
- Computer ScienceACL
- 2019
This work detects continuous parallel segments in sentence pair candidates and relies on them when mining parallel sentences, and provides the first experiments showing that parallel sentences mined from real life sources improve unsupervised MT.
Training Dynamic based data filtering may not work for NLP datasets
- Computer ScienceBLACKBOXNLP
- 2021
This paper studies the applicability of the Area Under the Margin (AUM) metric to identify and remove/rectify mislabelled examples in NLP datasets and shows that models rely on the distributional information instead of relying on syntactic and semantic representations.
References
SHOWING 1-10 OF 19 REFERENCES
Bilingual Data Cleaning for SMT using Graph-based Random Walk
- Computer ScienceACL
- 2013
The method leverages the mutual reinforcement between the sentence pairs and the extracted phrase pairs, based on the observation that better sentence pairs often lead to better phrase extraction and vice versa and substantially improves the performance in largescale Chinese-to-English translation tasks.
Parallel Corpus Refinement as an Outlier Detection Algorithm
- Computer ScienceMTSUMMIT
- 2011
The experiments show that a filtered corpus, results in an improved translation quality, even with some sentence pairs removed.
Class-based N-gram language difference models for data selection
- Computer ScienceIWSLT
- 2015
We present a simple method for representing text that explicitly encodes differences between two corpora in a domain adaptation or data selection scenario. We do this by replacing every word in the…
Improving Statistical Machine Translation Performance by Training Data Selection and Optimization
- Computer ScienceEMNLP
- 2007
This paper aims to improve SMT performance by exploiting full potential of the existing parallel corpora by proposing offline data optimization and online model optimization.
KenLM: Faster and Smaller Language Model Queries
- Computer ScienceWMT@EMNLP
- 2011
KenLM is a library that implements two data structures for efficient language model queries, reducing both time and memory costs and is integrated into the Moses, cdec, and Joshua translation systems.
Adaptation Data Selection using Neural Language Models: Experiments in Machine Translation
- Computer ScienceACL
- 2013
It is found that neural language models are indeed viable tools for data selection: while the improvements are varied, they are fast to train on small in-domain data and can sometimes substantially outperform conventional n-grams.
XenC: An Open-Source Tool for Data Selection in Natural Language Processing
- Computer SciencePrague Bull. Math. Linguistics
- 2013
XenC, an open-source tool for data selection aimed at Natural Language Processing (NLP) in general and Statistical Machine Translation (SMT) or Automatic Speech Recognition (ASR) in particular, is described.
Bitextor, a free/open-source software to harvest translation memories from multilingual websites
- Computer Science
- 2009
Bitextor is a free/open-source application for harvesting translation memories from multilingual websites. It downloads all the HTML files in a website, preprocesses them into a coherent format and,…
Clean data for training statistical MT: the case of MT contamination
- Computer ScienceAMTA
- 2014
This paper studies the effect of MT-contaminated training data on SMT quality, by performing controlled simulations under a wide range of conditions and assessing the potential of decontamination techniques.
Europarl: A Parallel Corpus for Statistical Machine Translation
- Computer ScienceMTSUMMIT
- 2005
A corpus of parallel text in 11 languages from the proceedings of the European Parliament is collected and its acquisition and application as training data for statistical machine translation (SMT) is focused on.







