Data Selection With Fewer Words

  title={Data Selection With Fewer Words},
  author={Amittai Axelrod and Philip Resnik and Xiaodong He and Mari Ostendorf},
We present a method that improves data selection by combining a hybrid word/part-of-speech representation for corpora, with the idea of distinguishing between rare and frequent events. We validate our approach using data selection for machine translation, and show that it maintains or improves BLEU and TER translation scores while substantially improving vocabulary coverage and reducing data selection model size. Paradoxically, the coverage improvement is achieved by abstracting away over 97… 

Figures and Tables from this paper

Class-based N-gram language difference models for data selection

We present a simple method for representing text that explicitly encodes differences between two corpora in a domain adaptation or data selection scenario. We do this by replacing every word in the

Exploiting Relative Frequencies for Data Selection

A novel method to mine unknown words in out-of-domain datasets is presented, resulting in the best models across the board when used to weight sentences whose similarity to the primary domain is determined by relative frequency ratios.

Bilingual Methods for Adaptive Training Data Selection for Machine Translation

This paper proposes a new data selection method which uses semi-supervised convolutional neural networks based on bitokens (Bi-SSCNNs) for training machine translation systems from a large bilingual corpus, and finds that neural machine translation is more sensitive to noisy data than statistical machine translation (SMT).

The UMD machine translation systems at IWSLT 2015

The University of Maryland machine translation systems submitted to the IWSLT 2015 French-English and Vietnamese-English tasks are described and novel data selection techniques to select relevant information from the large French- English training corpora are applied, and neural language models are tested.

What’s in a domain?: Towards fine-grained adaptation for machine translation

By studying what's in a domain and showing how to use different aspects of language to improve MT, this thesis takes a step forward towards fine-grained adaptation for machine translation.

Semi-supervised Convolutional Networks for Translation Adaptation with Tiny Amount of In-domain Data

A method which uses semi-supervised convolutional neural networks (CNNs) to select in-domain training data for statistical machine translation and can improve the performance up to 3.1 BLEU, which is significant better than three state-of-the-art language model based data selection methods.

Lithuanian Speech Corpus Liepa for Development of Human-Computer Interfaces Working in Voice Recognition and Synthesis Mode

The speech corpus Liepa, which consists of two parts, was developed and opens possibilities for cost-effective and flexible development of human-computer interfaces working in voice recognition and synthesis mode.

Alibaba Submission to the WMT20 Parallel Corpus Filtering Task

The final result shows that the Alibaba Machine Translation Group submissions to the WMT 2020 Shared Task on Parallel Corpus Filtering and Alignment significantly outperforms the LASER-based system.

Dynamic Data Selection for Neural Machine Translation

This paper introduces ‘dynamic data selection’ for NMT, a method in which the selected subset of training data is varied between different training epochs, and shows that the best results are achieved when applying a technique called ‘gradual fine-tuning’.



Data selection for statistical machine translation

  • Peng LiuYu ZhouChengqing Zong
  • Computer Science
    Proceedings of the 6th International Conference on Natural Language Processing and Knowledge Engineering(NLPKE-2010)
  • 2010
Methods to estimate the sentence weight and select more informative sentences from the training corpus and the development corpus based on the sentence Weight are proposed.

Hybrid Selection of Language Model Training Data Using Linguistic Information and Perplexity

Three novel models that make use of linguistic information and evaluate them on three different corpora and two languages are introduced and a linguistically motivated method outperforms the purely statistical state-of-theart approach in four out of the six scenarios.

Domain Adaptation for Machine Translation by Mining Unseen Words

It is shown that unseen words account for a large part of the translation error when moving to new domains and several approaches to integrating such translations into a phrase-based translation system are shown, yielding consistent improvements in translations quality.

Intelligent Selection of Language Model Training Data

We address the problem of selecting non-domain-specific language model training data to build auxiliary language models for use in tasks such as machine translation. Our approach is based on

Does more data always yield better translations?

Two training data selection techniques are analyzed: one based on approximating the probability of an indomain corpus; and another based on infrequent n-gram occurrence, which reports significant improvements over random sentence selection and an improvement over a system trained with the whole available data.

Combining Bilingual and Comparable Corpora for Low Resource Machine Translation

This work improves coverage by using bilingual lexicon induction techniques to learn new translations from comparable corpora and supplements the model’s feature space with translation scores estimated over comparable Corpora in order to improve accuracy.

Data selection for compact adapted SMT models

This work describes an extensive exploration of data selection techniques over Arabic to French datasets, and proposes methods to address both similarity and coverage considerations while maintaining a limited model size.

Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network

A new part-of-speech tagger is presented that demonstrates the following ideas: explicit use of both preceding and following tag contexts via a dependency network representation, broad use of lexical features, and effective use of priors in conditional loglinear models.

Cutting the Long Tail: Hybrid Language Models for Translation Style Adaptation

The proposed models show to better exploit in-domain data than conventional word-based LMs for the target language modeling component of a phrase-based statistical machine translation system.

Domain Adaptation via Pseudo In-Domain Data Selection

The results show that more training data is not always better, and that best results are attained via proper domain-relevant data selection, as well as combining in- and general-domain systems during decoding.