The State and Fate of Linguistic Diversity and Inclusion in the NLP World

@inproceedings{Joshi2020TheSA,
  title={The State and Fate of Linguistic Diversity and Inclusion in the NLP World},
  author={Pratik M. Joshi and Sebastin Santy and A. Budhiraja and K. Bali and M. Choudhury},
  booktitle={ACL},
  year={2020}
}
Language technologies contribute to promoting multilingualism and linguistic diversity around the world. However, only a very small number of the over 7000 languages of the world are represented in the rapidly evolving language technologies and applications. In this paper we look at the relation between the types of languages, resources, and their representation in NLP conferences to understand the trajectory that different languages have followed over time. Our quantitative investigation… Expand
An Overview of Fairness in Data – Illuminating the Bias in Data Pipeline
Data in general encodes human biases by default; being aware of this is a good start, and the research around how to handle it is ongoing. The term ‘bias’ is extensively used in various contexts inExpand
IndoCollex: A Testbed for Morphological Transformation of Indonesian Word Colloquialism
TLDR
This paper identifies a class of Indonesian colloquial words that have undergone morphological transformations from their standard forms, categorize their word formations, and proposes a benchmark dataset of Indonesian Colloquial Lexicons (IndoCollex), consisting of informal words on Twitter expertly annotated with theirstandard forms and their word formation types/tags. Expand
Low-Resource Machine Translation for Low-Resource Languages: Leveraging Comparable Data, Code-Switching and Compute Resources
TLDR
This work proposes a simple and scalable method to improve unsupervised NMT, showing how adding comparable data mined using a bilingual dictionary along with modest additional compute resource to train the model can significantly improve its performance. Expand
Neural Machine Translation for Low-Resource Languages: A Survey
TLDR
A detailed survey of research advancements in low-resource language NMT (LRL-NMT), along with a quantitative analysis aimed at identifying the most popular solutions, and a set of guidelines to select the possible NMT technique for a given LRL data setting. Expand
On the Difficulty of Translating Free-Order Case-Marking Languages
TLDR
This work investigates whether free-order case-marking languages, such as Russian, Latin or Tamil, are more challenging than fixed-order languages for the tasks of syntactic parsing and subjectverb agreement prediction, and finds that word order flexibility in the source language only leads to a very small loss of NMT quality. Expand
KINNEWS and KIRNEWS: Benchmarking Cross-Lingual Text Classification for Kinyarwanda and Kirundi
TLDR
The experiments show that training embeddings on the relatively higher-resourced Kinyarwanda yields successful cross-lingual transfer to Kirundi, and the design of the created datasets allows for a wider use in NLP beyond text classification in future studies, such as representation learning, cross-lingsual learning with more distant languages, or as base for new annotations for tasks such as parsing, POS tagging, and NER. Expand
Crossing the Conversational Chasm: A Primer on Multilingual Task-Oriented Dialogue Systems
TLDR
This work identifies two main challenges that combined hinder the faster progress in multilingual TOD: current state-of-the-art TOD models based on large pretrained neural language models are data hungry; at the same time data acquisition for TOD use cases is expensive and tedious. Expand
Crowdsourced Phrase-Based Tokenization for Low-Resourced Neural Machine Translation: The Case of Fon Language
TLDR
This paper revisits standard tokenization methods and introduces Word-Expressions-Based (WEB) tokenization, a human-involved super-words tokenization strategy to create a better representative vocabulary for training. Expand
How Linguistically Fair Are Multilingual Pre-Trained Language Models?
Massively multilingual pre-trained language models, such as mBERT and XLM-RoBERTa, have received significant attention in the recent NLP literature for their excellent capability towards crosslingualExpand
A Discussion on Building Practical NLP Leaderboards: The Case of Machine Translation
TLDR
A preliminary discussion of the risks associated with focusing exclusively on accuracy metrics is offered and prescriptive suggestions on how to develop more practical and effective leaderboards that can better reflect the real-world utility of models are highlighted. Expand
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 19 REFERENCES
Efficient Estimation of Word Representations in Vector Space
TLDR
Two novel model architectures for computing continuous vector representations of words from very large data sets are proposed and it is shown that these vectors provide state-of-the-art performance on the authors' test set for measuring syntactic and semantic word similarities. Expand
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
TLDR
A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks. Expand
Cross-lingual Language Model Pretraining
TLDR
This work proposes two methods to learn cross-lingual language models (XLMs): one unsupervised that only relies on monolingual data, and one supervised that leverages parallel data with a new cross-lingsual language model objective. Expand
How Multilingual is Multilingual BERT?
TLDR
It is concluded that M-BERT does create multilingual representations, but that these representations exhibit systematic deficiencies affecting certain language pairs, and that the model can find translation pairs. Expand
How multilingual is multilingual BERT? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4996– 5001, Florence, Italy
  • Association for Computa-
  • 2019
How multilingual is multilingual bert? CoRR
  • 2019
How multilingual is multilingual bert? CoRR, abs/1906.01502
  • 2019
Massively Multilingual Neural Machine Translation
TLDR
It is shown that massively multilingual many-to-many models are effective in low resource settings, outperforming the previous state-of-the-art while supporting up to 59 languages in 116 translation directions in a single model. Expand
Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond
TLDR
An architecture to learn joint multilingual sentence representations for 93 languages, belonging to more than 30 different families and written in 28 different scripts using a single BiLSTM encoder with a shared byte-pair encoding vocabulary for all languages, coupled with an auxiliary decoder and trained on publicly available parallel corpora. Expand
Modeling Language Variation and Universals: A Survey on Typological Linguistics for Natural Language Processing
TLDR
It is shown that to date, the use of information in existing typological databases has resulted in consistent but modest improvements in system performance, due to both intrinsic limitations of databases and under-employment of the typological features included in them. Expand
...
1
2
...