A Probabilistic Generative Model of Linguistic Typology

  title={A Probabilistic Generative Model of Linguistic Typology},
  author={Johannes Bjerva and Yova Kementchedjhieva and Ryan Cotterell and Isabelle Augenstein},
In the principles-and-parameters framework, the structural features of languages depend on parameters that may be toggled on or off, with a single parameter often dictating the status of multiple features. [] Key Result This finding has clear practical and also theoretical implications: the results confirm what linguists have hypothesised, i.e.~that there are significant correlations between typological features and languages.

Figures and Tables from this paper

Uncovering Probabilistic Implications in Typological Knowledge Bases

A computational model is presented which successfully identifies known universals, including Greenberg universals but also uncovers new ones, worthy of further linguistic investigation, which outperforms baselines previously used for this problem, as well as a strong baseline from knowledge base population.

SIGTYP 2020 Shared Task: Prediction of Typological Features

It is revealed that even the strongest submitted systems struggle with predicting feature values for languages where few features are known, and the most successful methods make use of such feature correlations.

Bridging Linguistic Typology and Multilingual Machine Translation with Multi-view Language Representations

By inferring typological features and language phylogenies, the method can easily project and assess new languages without expensive retraining of massive multilingual or ranking models, which are major disadvantages of related approaches.

NEMO: Frequentist Inference Approach to Constrained Linguistic Typology Feature Prediction in SIGTYP 2020 Shared Task

This paper describes the NEMO submission to SIGTYP 2020 shared task (Bjerva et al., 2020) which deals with prediction of linguistic typological features for multiple languages using the data derived

Does Typological Blinding Impede Cross-Lingual Sharing?

This model is based on a cross-lingual architecture in which the latent weights governing the sharing between languages is learnt during training, and it is shown that preventing this model from exploiting typology severely reduces performance, while a control experiment reaffirms that encouraging sharing according to typology somewhat improves performance.

Language Embeddings for Typology and Cross-lingual Transfer Learning

This work generates dense embeddings for 29 languages using a denoising autoencoder, and evaluates the embedDings using the World Atlas of Language Structures (WALS) and two extrinsic tasks in a zero-shot setting: cross-lingual dependency parsing and cross-lingsual natural language inference.

Inducing Language-Agnostic Multilingual Representations

Three approaches for removing language identity signals from multilingual embeddings are examined: re-aligning the vector spaces of target languages (all together) to a pivot source language, removing language-specific means and variances, and increasing input similarity across languages by removing morphological contractions and sentence reordering.

Towards a Multi-view Language Representation: A Shared Space of Discrete and Continuous Language Features

This work compute a shared space between discrete (binary) and continuous features using canonical correlation analysis and evaluates the new language representation against a concatenation baseline in typological feature prediction and in phylogenetic inference, obtaining promising results to explore further.

Stop the Morphological Cycle, I Want to Get Off: Modeling the Development of Fusion

In simulations using artificial data, this work provides quantitative support to two claims about agglutinative and fusional structures: that optional morphological markers discourage fusion from developing, but that stressbased vowel reduction encourages it.

Zero-Shot Cross-Lingual Transfer with Meta Learning

This work considers the setting of training models on multiple different languages at the same time, when little or no data is available for languages other than English, and demonstrates the consistent effectiveness of meta-learning for a total of 15 languages.



Diachrony-aware Induction of Binary Latent Representations from Typological Features

A Bayesian model is proposed that represents each language as a sequence of binary latent parameters encoding inter-feature dependencies and relates a language’s parameters to those of its phylogenetic and spatial neighbors and shows that the proposed model recovers missing values more accurately than others and that induced representations retain phylogenetics and spatial signals observed for surface features.

From Phonology to Syntax: Unsupervised Linguistic Typology at Different Levels with Language Embeddings

A core part of linguistic typology is the classification of languages according to linguistic properties, such as those detailed in the World Atlas of Language Structure (WALS). Doing this manually

Probabilistic Typology: Deep Generative Models of Vowel Inventories

Linguistic typology studies the range of structures present in human language. The main goal of the field is to discover which sets of possible phenomena are universal, and which are merely frequent.

Modeling Language Variation and Universals: A Survey on Typological Linguistics for Natural Language Processing

It is shown that to date, the use of information in existing typological databases has resulted in consistent but modest improvements in system performance, due to both intrinsic limitations of databases and under-employment of the typological features included in them.

Learning Language Representations for Typology Prediction

Experiments show that the proposed method is able to infer not only syntactic, but also phonological and phonetic inventory features, and improves over a baseline that has access to information about the languages geographic and phylogenetic neighbors.

Parametric versus functional explanations of syntactic universals

This paper compares the generative principles-and-parameters approach to explaining syntactic universals to the functional-typological approach and also discusses the intermediate approach of

Past, Present, Future: A Computational Investigation of the Typology of Tense in 1000 Languages

SuperPivot is presented, an analysis method for low-resource languages that occur in a superparallel corpus, i.e., in a corpus that contains an order of magnitude more languages than parallel corpora currently in use, and performs well for the crosslingual analysis of the linguistic phenomenon of tense.

Semantic Drift in Multilingual Representations

Results indicate that multilingual distributional representations that are only trained on monolingual text and bilingual dictionaries preserve relations between languages without the need for any etymological information.

What Do Language Representations Really Represent?

This work investigates correlations and causal relationships between language representations learned from translations, and genetic, geographical, and several levels of structural similarity between languages on the other, finding structural similarity is found to correlate most strongly with language representation similarity.

Polyglot Neural Language Models: A Case Study in Cross-Lingual Phonetic Representation Learning

We introduce polyglot language models, recurrent neural network models trained to predict symbol sequences in many different languages using shared representations of symbols and conditioning on