The Galactic Dependencies Treebanks: Getting More Data by Synthesizing New Languages

  title={The Galactic Dependencies Treebanks: Getting More Data by Synthesizing New Languages},
  author={D. Wang and Jason Eisner},
  journal={Transactions of the Association for Computational Linguistics},
  • D. WangJason Eisner
  • Published 30 September 2016
  • Computer Science
  • Transactions of the Association for Computational Linguistics
We release Galactic Dependencies 1.0—a large set of synthetic languages not found on Earth, but annotated in Universal Dependencies format. This new resource aims to provide training and development data for NLP methods that aim to adapt to unfamiliar languages. Each synthetic treebank is produced from a real treebank by stochastically permuting the dependents of nouns and/or verbs to match the word order of other real languages. We discuss the usefulness, realism, parsability, perplexity, and… 

Low-Resource Syntactic Transfer with Unsupervised Source Reordering

It is demonstrated that reordering the source treebanks before training on them for a target language improves the accuracy of languages outside the European language family.

Surface Statistics of an Unknown Language Indicate How to Parse It

We introduce a novel framework for delexicalized dependency parsing in a new language. We show that useful features of the target language can be extracted automatically from an unparsed corpus,

A little perturbation makes a difference: Treebank augmentation by perturbation improves transfer parsing

This work shows that the cross-lingual performance of the parsers can be enhanced by over-generating the source language treebank, and results in significant improvement over the transfer parser proposed by (CITATION) that involves an “order-free” parser algorithm.

Cross-Lingual Syntactic Transfer with Limited Resources

We describe a simple but effective method for cross-lingual syntactic transfer of dependency parsers, in the scenario where a large amount of translation data is not available. This method makes use

Cross-Lingual Dependency Parsing by POS-Guided Word Reordering

This work proposes a novel approach to cross-lingual dependency parsing based on word reordering that achieves better or comparable results across 25 target languages, and outperforms a baseline by a significant margin on the languages that are greatly different from the source one.

MulDA: A Multilingual Data Augmentation Framework for Low-Resource Cross-Lingual NER

This paper first proposes a simple but effective labeled sequence translation method to translate source-language training data to target languages and avoids problems such as word order change and entity span determination.

Low-Resource Parsing with Crosslingual Contextualized Representations

The non-contextual part of the learned language models are examined to demonstrate that polyglot language models better encode crosslingual lexical correspondence compared to aligned monolingual language models, providing further evidence thatpolyglot training is an effective approach toCrosslingual transfer.

Fine-Grained Prediction of Syntactic Typology: Discovering Latent Structure with Supervised Learning

We show how to predict the basic word-order facts of a novel language given only a corpus of part-of-speech (POS) sequences. We predict how often direct objects follow their verbs, how often

Supervised Training on Synthetic Languages: A Novel Framework for Unsupervised Parsing

  • D. Wang
  • Computer Science, Linguistics
  • 2019
It is shown that, indeed, useful features of the target language can be extracted automatically from an unparsed corpus, which consists only of gold part-of-speech (POS) sequences.

Natural language processing for resource-poor languages

Transfer learning provides an important opportunity for low-resource NLP, whereby annotation is transferred from a source resource-rich language to a target resource poor-language, and is successfully applied in this thesis.



Unsupervised Dependency Parsing with Transferring Distribution via Parallel Guidance and Entropy Regularization

We present a novel approach for inducing unsupervised dependency parsers for languages that have no labeled training data, but have translated text in a resourcerich language. We train probabilistic

Cross-lingual Transfer for Unsupervised Dependency Parsing Without Parallel Data

This method learns syntactic word embeddings that generalise over the syntactic contexts of a bilingual vocabulary, and incorporates these into a neural network parser, and shows empirical improvements over a baseline delexicalised parser on both the CoNLL and Universal Dependency Treebank datasets.

Density-Driven Cross-Lingual Transfer of Dependency Parsers

This work presents a novel method for the crosslingual transfer of dependency parsers that assumes access to parallel translations between the target and one or more source languages, and to supervised parsers in the source language(s).

Multi-Source Transfer of Delexicalized Dependency Parsers

This work demonstrates that delexicalized parsers can be directly transferred between languages, producing significantly higher accuracies than unsupervised parsers and shows that simple methods for introducing multiple source languages can significantly improve the overall quality of the resulting parsers.

Cross-lingual Dependency Parsing Based on Distributed Representations

This paper provides two algorithms for inducing cross-lingual distributed representations of words, which map vocabularies from two different languages into a common vector space and bridges the lexical feature gap by using distributed feature representations and their composition.

Building a Semantic Parser Overnight

A new methodology is introduced that uses a simple grammar to generate logical forms paired with canonical utterances that are meant to cover the desired set of compositional operators and uses crowdsourcing to paraphrase these canonical utterance into natural utterances.

Yara Parser: A Fast and Accurate Dependency Parser

The Yara Parser is introduced, a fast and accurate open-source dependency parser based on the arc-eager algorithm and beam search that achieves an unlabeled accuracy of 93.32 on the standard WSJ test set which ranks it among the top dependency parsers.

Clause Restructuring for Statistical Machine Translation

The reordering approach is applied as a pre-processing step in both the training and decoding phases of a phrase-based statistical MT system, showing an improvement from 25.2% Bleu score for a baseline system to 26.8% Blee score for the system with reordering.

Annealing Structural Bias in Multilingual Weighted Grammar Induction

This work shows how a structural locality bias can improve the accuracy of state-of-the-art dependency grammar induction models trained by EM from unannotated examples, and annealing the free parameter that controls this bias achieves further improvements.

Modeling Word Forms Using Latent Underlying Morphs and Phonology

It is shown how to recover consistent underlying forms for these morphemes, together with the (stochastic) phonology that maps each concatenation of underlying forms to a surface form of a concatenative language.