NusaCrowd: A Call for Open and Reproducible NLP Research in Indonesian Languages

  title={NusaCrowd: A Call for Open and Reproducible NLP Research in Indonesian Languages},
  author={Samuel Cahyawijaya and Alham Fikri Aji and Holy Lovenia and Genta Indra Winata and Bryan Wilie and Rahmad Mahendra and Fajri Koto and David Moeljadi and Karissa Vincentio and Ade Romadhony and Ayu Purwarianti},
At the center of the underlying issues that halt Indonesian natural language processing (NLP) research advancement, we find data scarcity. Resources in Indonesian languages, especially the local ones, are extremely scarce and underrepresented. Many Indonesian researchers refrain from publishing and/or releasing their dataset. Furthermore, the few public datasets that we have are scattered across different platforms, thus makes performing reproducible and data-centric research in Indonesian NLP… 

Figures from this paper

Inaccessible Neural Language Models Could Reinvigorate Linguistic Nativism

This work argues that this lack of accessibility could instill a nativist bias in researchers new to computational linguistics, and calls upon researchers to open source their LLM code wherever possible to allow both empircist and hybrid approaches to remain accessible.



Masader: Metadata Sourcing for Arabic Text and Speech Data Resources

Masader is created, the largest public catalogue for Arabic NLP datasets, which consists of 200 datasets annotated with 25 attributes and a metadata annotation strategy that could be extended to other languages.

Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models

Evaluation of OpenAI's GPT models, Google-internal dense transformer architectures, and Switch-style sparse transformers on BIG-bench, across model sizes spanning millions to hundreds of billions of parameters finds that model performance and calibration both improve with scale, but are poor in absolute terms.

One Country, 700+ Languages: NLP Challenges for Underrepresented Languages and Dialects in Indonesia

An overview of the current state of NLP research for Indonesia's 700+ languages is provided and general recommendations are provided to help develop NLP technology not only for languages of Indonesia but also other underrepresented languages.

Documenting Geographically and Contextually Diverse Data Sources: The BigScience Catalogue of Language Data and Resources

This work developed their online catalogue as a supporting tool for gathering metadata through organized public hackathons, and presented the development process; analyses of the resulting resource metadata, including distributions over languages, regions, and resource types.

ParaCotta: Synthetic Multilingual Paraphrase Corpora from the Most Diverse Translation Sample Pair

This work generates multiple translation samples using beam search and chooses the most lexically diverse pair according to their sentence BLEU, and compares the generated corpus with the ParaBank2.

IndoNLI: A Natural Language Inference Dataset for Indonesian

IndoNLI is designed to provide a challenging test-bed for Indonesian NLI by explicitly incorporating various linguistic phenomena such as numerical reasoning, structural changes, idioms, or temporal and spatial reasoning.

IndoBERTweet: A Pretrained Language Model for Indonesian Twitter with Effective Domain-Specific Vocabulary Initialization

It is found that initializing with the average BERT subword embedding makes pretraining five times faster, and is more effective than proposed methods for vocabulary adaptation in terms of extrinsic evaluation over seven Twitter-based datasets.

Evaluating the Efficacy of Summarization Evaluation across Languages

This work takes a summarization corpus for eight different languages, and manually annotates generated summaries for focus (precision) and coverage (recall), and finds that using multilingual BERT within BERTScore performs well across all languages.

IndoNLG: Benchmark and Resources for Evaluating Indonesian Natural Language Generation

In IndoNLG, the first benchmark to measure natural language generation (NLG) progress in three low-resource—yet widely spoken—languages of Indonesia, it is shown that IndoBART and IndoGPT achieve competitive performance on all tasks—despite using only one-fifth the parameters of a larger multilingual model, mBART-large (Liu et al., 2020).

Combination of Genetic Algorithm and Brill Tagger Algorithm for Part of Speech Tagging Bahasa Madura

This research finds out suitable algorithms used for the development of text processing in Bahasa Madura using the Brill Tagger Algorithm which is combined with the Genetic Algorithm that has the best level of accuracy when implemented in other languages.