NusaCrowd: A Call for Open and Reproducible NLP Research in Indonesian Languages
@article{Cahyawijaya2022NusaCrowdAC, title={NusaCrowd: A Call for Open and Reproducible NLP Research in Indonesian Languages}, author={Samuel Cahyawijaya and Alham Fikri Aji and Holy Lovenia and Genta Indra Winata and Bryan Wilie and Rahmad Mahendra and Fajri Koto and David Moeljadi and Karissa Vincentio and Ade Romadhony and Ayu Purwarianti}, journal={ArXiv}, year={2022}, volume={abs/2207.10524} }
At the center of the underlying issues that halt Indonesian natural language processing (NLP) research advancement, we find data scarcity. Resources in Indonesian languages, especially the local ones, are extremely scarce and underrepresented. Many Indonesian researchers refrain from publishing and/or releasing their dataset. Furthermore, the few public datasets that we have are scattered across different platforms, thus makes performing reproducible and data-centric research in Indonesian NLP…
One Citation
Inaccessible Neural Language Models Could Reinvigorate Linguistic Nativism
- Computer ScienceArXiv
- 2023
This work argues that this lack of accessibility could instill a nativist bias in researchers new to computational linguistics, and calls upon researchers to open source their LLM code wherever possible to allow both empircist and hybrid approaches to remain accessible.
References
SHOWING 1-10 OF 18 REFERENCES
Masader: Metadata Sourcing for Arabic Text and Speech Data Resources
- Computer ScienceLREC
- 2022
Masader is created, the largest public catalogue for Arabic NLP datasets, which consists of 200 datasets annotated with 25 attributes and a metadata annotation strategy that could be extended to other languages.
Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models
- Computer ScienceArXiv
- 2022
Evaluation of OpenAI's GPT models, Google-internal dense transformer architectures, and Switch-style sparse transformers on BIG-bench, across model sizes spanning millions to hundreds of billions of parameters finds that model performance and calibration both improve with scale, but are poor in absolute terms.
One Country, 700+ Languages: NLP Challenges for Underrepresented Languages and Dialects in Indonesia
- Computer Science, LinguisticsACL
- 2022
An overview of the current state of NLP research for Indonesia's 700+ languages is provided and general recommendations are provided to help develop NLP technology not only for languages of Indonesia but also other underrepresented languages.
Documenting Geographically and Contextually Diverse Data Sources: The BigScience Catalogue of Language Data and Resources
- Computer ScienceArXiv
- 2022
This work developed their online catalogue as a supporting tool for gathering metadata through organized public hackathons, and presented the development process; analyses of the resulting resource metadata, including distributions over languages, regions, and resource types.
ParaCotta: Synthetic Multilingual Paraphrase Corpora from the Most Diverse Translation Sample Pair
- Computer SciencePACLIC
- 2021
This work generates multiple translation samples using beam search and chooses the most lexically diverse pair according to their sentence BLEU, and compares the generated corpus with the ParaBank2.
IndoNLI: A Natural Language Inference Dataset for Indonesian
- Computer ScienceEMNLP
- 2021
IndoNLI is designed to provide a challenging test-bed for Indonesian NLI by explicitly incorporating various linguistic phenomena such as numerical reasoning, structural changes, idioms, or temporal and spatial reasoning.
IndoBERTweet: A Pretrained Language Model for Indonesian Twitter with Effective Domain-Specific Vocabulary Initialization
- Computer ScienceEMNLP
- 2021
It is found that initializing with the average BERT subword embedding makes pretraining five times faster, and is more effective than proposed methods for vocabulary adaptation in terms of extrinsic evaluation over seven Twitter-based datasets.
Evaluating the Efficacy of Summarization Evaluation across Languages
- LinguisticsFINDINGS
- 2021
This work takes a summarization corpus for eight different languages, and manually annotates generated summaries for focus (precision) and coverage (recall), and finds that using multilingual BERT within BERTScore performs well across all languages.
IndoNLG: Benchmark and Resources for Evaluating Indonesian Natural Language Generation
- Computer ScienceEMNLP
- 2021
In IndoNLG, the first benchmark to measure natural language generation (NLG) progress in three low-resource—yet widely spoken—languages of Indonesia, it is shown that IndoBART and IndoGPT achieve competitive performance on all tasks—despite using only one-fifth the parameters of a larger multilingual model, mBART-large (Liu et al., 2020).
Combination of Genetic Algorithm and Brill Tagger Algorithm for Part of Speech Tagging Bahasa Madura
- Computer Science
- 2020
This research finds out suitable algorithms used for the development of text processing in Bahasa Madura using the Brill Tagger Algorithm which is combined with the Genetic Algorithm that has the best level of accuracy when implemented in other languages.