Adding More Languages Improves Unsupervised Multilingual Part-of-Speech Tagging: a Bayesian Non-Parametric Approach

@inproceedings{Snyder2009AddingML,
  title={Adding More Languages Improves Unsupervised Multilingual Part-of-Speech Tagging: a Bayesian Non-Parametric Approach},
  author={Benjamin Snyder and Tahira Naseem and Jacob Eisenstein and R. Barzilay},
  booktitle={NAACL},
  year={2009}
}
We investigate the problem of unsupervised part-of-speech tagging when raw parallel data is available in a large number of languages. Patterns of ambiguity vary greatly across languages and therefore even unannotated multilingual data can serve as a learning signal. We propose a non-parametric Bayesian model that connects related tagging decisions across languages through the use of multilingual latent variables. Our experiments show that performance improves steadily as the number of languages… Expand
Multilingual Part-of-Speech Tagging: Two Unsupervised Approaches
TLDR
This work considers two ways of applying this intuition to the problem of unsupervised part-of-speech tagging: a model that directly merges tag structures for a pair of languages into a single sequence and a second model which instead incorporates multilingual context using latent variables. Expand
Climbing the Tower of Babel: Unsupervised Multilingual Learning
TLDR
A class of probabilistic models that use these links among human languages as a form of naturally occurring supervision allow us to substantially improve performance for core text processing tasks, such as morphological segmentation, part-of-speech tagging, and syntactic parsing. Expand
Weakly Supervised Part-of-Speech Tagging for Morphologically-Rich, Resource-Scarce Languages
TLDR
This paper argues that existing unsupervised POS taggers unrealistically assume as input a perfect POS lexicon, and proposes a weakly supervised fully-Bayesian approach to POS tagging, which relaxes the unrealistic assumption by automatically acquiring the lexicon from a small amount of POS-tagged data. Expand
Unsupervised Part of Speech Tagging Without a Lexicon
Unsupervised dependency parsing frequently assume that input sentences have already been labeled with POS tags. Likewise, most unsupervised POS taggers (including those proposed by [1] and [2])Expand
Unsupervised multilingual learning
TLDR
A class of probabilistic models that exploit deep links among human languages as a form of naturally occurring supervision allow us to substantially improve performance for core text processing tasks, such as morphological segmentation, part-of-speech tagging, and syntactic parsing. Expand
Unsupervised Multilingual Grammar Induction
TLDR
A generative Bayesian model is formulated which seeks to explain the observed parallel data through a combination of bilingual and monolingual parameters, and loosely binds parallel trees while allowing language-specific syntactic structure. Expand
A Universal Part-of-Speech Tagset
TLDR
This work proposes a tagset that consists of twelve universal part-of-speech categories and develops a mapping from 25 different treebank tagsets to this universal set, which when combined with the original treebank data produces a dataset consisting of common parts- of-speech for 22 different languages. Expand
Multilingual part-of-speech tagging with weightless neural networks
TLDR
Experimental evaluation indicates that mWANN-Tagger either outperforms or matches state-of-art methods in accuracy with very low standard deviation, i.e., lower than 0.25%. Expand
Multilingual NER Transfer for Low-resource Languages
TLDR
Evaluating on named entity recognition over 41 languages, it is shown that the proposed techniques for modulating the transfer are much more effective than strong baselines, including standard ensembling, and the unsupervised method rivals oracle selection of the single best individual model. Expand
Context-dependent type-level models for unsupervised morpho-syntactic induction
TLDR
This thesis improves unsupervised methods for part-of-speech (POS) induction and morphological word segmentation by modeling linguistic phenomena previously not used by exploiting the fact that affixes are correlated within a word and between adjacent words. Expand
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 13 REFERENCES
Unsupervised Multilingual Learning for POS Tagging
TLDR
A hierarchical Bayesian model is formulated for jointly predicting bilingual streams of part-of-speech tags that learns language-specific features while capturing cross-lingual patterns in tag distribution for aligned words. Expand
A fully Bayesian approach to unsupervised part-of-speech tagging
TLDR
This model has the structure of a standard trigram HMM, yet its accuracy is closer to that of a state-of-the-art discriminative model (Smith and Eisner, 2005), up to 14 percentage points better than MLE. Expand
Inducing Multilingual POS Taggers and NP Bracketers via Robust Projection Across Aligned Corpora
This paper investigates the potential for projecting linguistic annotations including part-of-speech tags and base noun phrase bracketings from one language to another via automatically word-alignedExpand
Inducing a Multilingual Dictionary from a Parallel Multitext in Related Languages
TLDR
This work builds a multilingual dictionary induction system for a family of related resource-poor languages that assumes only the presence of a single medium-length multitext (the Bible). Expand
Statistical multi-source translation
TLDR
In various tests, it is shown that these methods can significantly improve translation quality and compare the quality of statistical machine translation systems for many European languages in the same domain. Expand
Machine Translation by Triangulation: Making Effective Use of Multi-Parallel Corpora
TLDR
Experimental results demonstrate BLEU improvements for triangulated models over a standard phrase-based system and central to this approach is triangulation, the process of translating from a source to a target language via an intermediate third language. Expand
A Cross-language Approach to Rapid Creation of New Morpho-syntactically Annotated Resources
TLDR
This work takes a novel approach to rapid, low-cost development of morpho-syntactically annotated resources without using parallel corpora or bilingual lexicons, and presents an alternative evaluation metric for this system, where it is shown how much human labor will be needed to convert the result of the tagging to a high precision annotated resource. Expand
MULTEXT-East Version 3: Multilingual Morphosyntactic Specifications, Lexicons and Corpora
TLDR
The paper presents the third release of the MULTEXT-East language resources, which brings together the first two, makes them available in TEI P4 XML, and introduces further extensions, e.g., the specification for Resian, a dialect of Slovene. Expand
A Systematic Comparison of Various Statistical Alignment Models
TLDR
An important result is that refined alignment models with a first-order dependence and a fertility model yield significantly better results than simple heuristic models. Expand
A Comparison of Pivot Methods for Phrase-Based Statistical Machine Translation
TLDR
The phrase translation strategy significantly outperformed the sentence translation strategy and its relative performance was 0.92 to 0.97 compared to directly trained SMT systems. Expand
...
1
2
...