Masked Part-Of-Speech Model: Does Modeling Long Context Help Unsupervised POS-tagging?

  title={Masked Part-Of-Speech Model: Does Modeling Long Context Help Unsupervised POS-tagging?},
  author={Xiaoping Zhou and Shiyue Zhang and Mohit Bansal},
  booktitle={North American Chapter of the Association for Computational Linguistics},
Previous Part-Of-Speech (POS) induction models usually assume certain independence assumptions (e.g., Markov, unidirectional, local dependency) that do not hold in real languages. For example, the subject-verb agreement can be both long-term and bidirectional. To facilitate flexible dependency modeling, we propose a Masked Part-of-Speech Model (MPoSM), inspired by the recent success of Masked Language Models (MLM). MPoSM can model arbitrary tag dependency and perform POS induction through the… 



Unsupervised Part-of-Speech Tagging with Bilingual Graph-Based Projections

A novel approach for inducing unsupervised part-of-speech taggers for languages that have no labeled training data, but have translated text in a resource-rich language, using graph-based label propagation for cross-lingual knowledge transfer.

Unsupervised Part-Of-Speech Tagging with Anchor Hidden Markov Models

This work tackles unsupervised part-of-speech (POS) tagging by learning hidden Markov models (HMMs) that are particularly well-suited for the problem and extends the non-negative matrix factorization framework of Arora et al. (2013) to design a consistent estimator for anchor HMMs.

Deep Clustering of Text Representations for Supervision-Free Probing of Syntax

This work explores deep clustering of multilingual text representations for unsupervised model interpretation and induction of syntax and finds that Multilingual BERT (mBERT) contains surprising amount of syntactic knowledge of English; possibly even as much as English Bert (E-BERT).

Unsupervised Learning of Syntactic Structure with Invertible Neural Projections

A novel generative model is proposed that jointly learns discrete syntactic structure and continuous word representations in an unsupervised fashion by cascading an invertible neural network with a structured generative prior so long as the prior is well-behaved.

Unsupervised Multilingual Learning for POS Tagging

A hierarchical Bayesian model is formulated for jointly predicting bilingual streams of part-of-speech tags that learns language-specific features while capturing cross-lingual patterns in tag distribution for aligned words.

Mutual Information Maximization for Simple and Accurate Part-Of-Speech Induction

This work addresses part-of-speech (POS) induction by maximizing the mutual information between the induced label and its context and shows that the variational lower bound is robust whereas the generalized Brown objective is vulnerable.

On the Role of Supervision in Unsupervised Constituency Parsing

We analyze several recent unsupervised constituency parsing models, which are tuned with respect to the parsing $F_1$ score on the Wall Street Journal (WSJ) development set (1,700 sentences). We

StructFormer: Joint Unsupervised Induction of Dependency and Constituency Structure from Masked Language Modeling

This work proposes a new parsing framework that can jointly generate a constituency tree and dependency graph and integrates the induced dependency relations into the transformer, in a differentiable manner, through a novel dependency-constrained self-attention mechanism.

Why Doesn’t EM Find Good HMM POS-Taggers?

This paper investigates why the HMMs estimated by Expectation-Maximization produce such poor results as Part-of-Speech (POS) taggers and investigates Gibbs Sampling and Variational Bayes estimators and shows that VB converges faster than GS for this task and that Vb significantly improves 1-to-1 tagging accuracy over EM.

ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators

The contextual representations learned by the proposed replaced token detection pre-training task substantially outperform the ones learned by methods such as BERT and XLNet given the same model size, data, and compute.