Past, Present, Future: A Computational Investigation of the Typology of Tense in 1000 Languages

  title={Past, Present, Future: A Computational Investigation of the Typology of Tense in 1000 Languages},
  author={Ehsaneddin Asgari and Hinrich Sch{\"u}tze},
  booktitle={Conference on Empirical Methods in Natural Language Processing},
We present SuperPivot, an analysis method for low-resource languages that occur in a superparallel corpus, i.e., in a corpus that contains an order of magnitude more languages than parallel corpora currently in use. We show that SuperPivot performs well for the crosslingual analysis of the linguistic phenomenon of tense. We produce analysis results for more than 1000 languages, conducting – to the best of our knowledge – the largest crosslingual computational study performed to date. We extend… 

Figures and Tables from this paper

ParCourE: A Parallel Corpus Explorer for a Massively Multilingual Corpus

ParCourE, an online tool that allows to browse a word-aligned parallel corpus, covering 1334 languages, is provided, giving evidence that this is useful for typological research.

Modeling Language Variation and Universals: A Survey on Typological Linguistics for Natural Language Processing

It is suggested that a new approach that adapts the broad and discrete nature of typological categories to the contextual and continuous nature of machine learning algorithms used in contemporary NLP could be facilitated by recent developments in data-driven induction ofTypological knowledge.

Modeling Language Variation and Universals: A Survey on Typological Linguistics for Natural Language Processing

It is shown that to date, the use of information in existing typological databases has resulted in consistent but modest improvements in system performance, due to both intrinsic limitations of databases and under-employment of the typological features included in them.

Uncovering Probabilistic Implications in Typological Knowledge Bases

A computational model is presented which successfully identifies known universals, including Greenberg universals but also uncovers new ones, worthy of further linguistic investigation, which outperforms baselines previously used for this problem, as well as a strong baseline from knowledge base population.


  • Linguistics
  • 2019
Multilingual parallel corpora make possible the application of quantitative methods in cross-linguistic research. Due to the lack of appropriate resources, this has not become a widespread technique

A Probabilistic Generative Model of Linguistic Typology

This work develops a generative model of language based on exponential-family matrix factorisation and shows how structural similarities between languages can be exploited to predict typological features with near-perfect accuracy, outperforming several baselines on the task of predicting held-out features.

From Phonology to Syntax: Unsupervised Linguistic Typology at Different Levels with Language Embeddings

A core part of linguistic typology is the classification of languages according to linguistic properties, such as those detailed in the World Atlas of Language Structure (WALS). Doing this manually

Towards a parallel corpus of Portuguese and the Bantu language Emakhuwa of Mozambique

The creation of the EmakhuwaPortuguese parallel corpus is described, which is a collection of texts from the Jehovah’s Witness website and a variety of other sources including the African Story Book website, the Universal Declaration of Human Rights and Mozambican legal documents.

Quantitative Analysis of Passives with Agent Phrase Based on Multilingual Parallel Data

The advantages of using parallel data in linguistic research are discussed, preliminary results of the study devoted to passives with agent phrase are demonstrated and a parallel corpus of texts in nine European languages is used.

An Analysis of Massively Multilingual Neural Machine Translation for Low-Resource Languages

It is found that best practices in this domain are highly language-specific: adding more languages to a training set is often better, but too many harms performance—the best number depends on the source language.



The perfect map : Investigating the cross-linguistic distribution of TAME categories in a parallel corpus

The work presented in this paper can be seen as a continuation of my earlier attempts at using quantitative methods to compare tense-aspect categories across languages using translation questionnaire

Language Universals and Linguistic Typology: Syntax and Morphology

This second edition has been revised and updated to take full account of new research in universals and typology in the past decade, and more generally to consider how the approach advocated here relates to recent advances in generative grammatical theory.

Creating a Parallel Corpus from the \ Book of 2000 Tongues "

A project to annotate biblical texts in order to create an aligned multilingual Bible corpus for linguistic research, particularly computational linguistics, including automatically creating and evaluating translation lexicons and semantically tagged texts.

The grammaticalization of tense and aspect in Tok Pisin and Sranan

ABSTRACT According to Bickerton's “bioprogram,” creole grammars from the outset contain privative oppositions in the verbal system, where zeroes can be unambiguously interpreted as contrasting with

An Unsupervised Method for Word Sense Tagging using Parallel Corpora

An unsupervised method for word sense disambiguation that exploits translation correspondences in parallel corpora is presented, using pseudo-translations, created by machine translation systems, in order to make possible the evaluation of the approach against a standard test set.

From questionnaires to parallel corpora in typology

This rather programmatic paper discusses the use of parallel corpora in the typological study of grammatical categories. In the author's earlier work, tense-aspect categories were studied by means of

Creating a massively parallel Bible corpus

This work presents the ongoing effort to create a massively parallel Bible corpus, with over 900 translations in more than 830 language varieties, and reports on the current status of the corpus.

Translation-Based Corpus Studies: Contrasting English and Portuguese Tense and Aspect Systems

This book presents a model for describing translation performance as a basis for contrastive linguistics, in the realm of tense and aspect. It is based on extensive corpus studies investigating the

Classification of telicity using cross-linguistic annotation projection

This paper addresses the automatic recognition of telicity, an aspectual notion, and successfully leverage additional silver standard training data in the form of projected annotations from parallel English-Czech data as well as context information, improving automatic telicity classification for English significantly compared to previous work.

Inferring Universals from Grammatical Variation:Multidimensional Scaling for Typological Analysis

It is argued that multidimensional scaling (MDS), in particular the Optimal Classification nonparametric unfolding algorithm, offers a powerful, formalized tool that allows linguists to infer language universals from highly complex and large-scale datasets.