IndoNLG: Benchmark and Resources for Evaluating Indonesian Natural Language Generation

  title={IndoNLG: Benchmark and Resources for Evaluating Indonesian Natural Language Generation},
  author={Samuel Cahyawijaya and Genta Indra Winata and Bryan Wilie and Karissa Vincentio and Xiaohong Li and Adhiguna Kuncoro and Sebastian Ruder and Zhi Yuan Lim and Syafri Bahar and Masayu Leylia Khodra and Ayu Purwarianti and Pascale Fung},
Natural language generation (NLG) benchmarks provide an important avenue to measure progress and develop better NLG systems. Unfortunately, the lack of publicly available NLG benchmarks for low-resource languages poses a challenging barrier for building NLG systems that work well for languages with limited amounts of data. Here we introduce IndoNLG, the first benchmark to measure natural language generation (NLG) progress in three low-resource—yet widely spoken—languages of Indonesia… 

AMMUS : A Survey of Transformer-based Pretrained Models in Natural Language Processing

This comprehensive survey paper explains various core concepts like pretraining, Pretraining methods, pretraining tasks, embeddings and downstream adaptation methods, presents a new taxonomy of T-PTLMs and gives brief overview of various benchmarks including both intrinsic and extrinsic.

Strategies for Adapting Multilingual Pre-training for Domain-Specific Machine Translation

Through the domain-first approach, fine-tuning across multilingual in-domain corpora can lead to stark improvements in domain adaptation without sourcing additional out-of-domain bitext.

Every picture tells a story: Image-grounded controllable stylistic story generation

This work introduces Plug-and-Play Story Teller (PPST) and improves image-to-story generation by addressing the data scarcity problem by incorporating large pre-trained models, namely CLIP and GPT-2, to facilitate a more style-relevant generation.

Writing System and Speaker Metadata for 2,800+ Language Varieties

An open-source dataset providing metadata for about 2,800 language varieties used in the world today, which is the largest publicly-available, machine-readable resource with writing system and speaker information for the world’s languages.

NusaCrowd: A Call for Open and Reproducible NLP Research in Indonesian Languages

NusaCrowd strives to provide the largest datasheet ag-gregation with standardized data loading for NLP tasks in all Indonesian languages and hopes it can tackle the data scarcity problem hindering NLP progress in Indonesia and bring NLP practitioners to move towards collaboration.

Opinion Triplet Extraction for Aspect-Based Sentiment Analysis Using Co-Extraction Approach

The co-extraction approach was adapted by modifying the original frameworks to perform unhandled subtask to get the opinion triplet, and the output layer on these frameworks was modified and trained using a collection of Indonesian-language hotel reviews.

A Survey of Deep Learning Models for Structural Code Understanding

This survey presents a comprehensive overview of the structures formed from code data, categorizing the models for understanding code in recent years into two groups: sequence-based and graph-based models, and makes some suggestions for future research in structural code understanding field.

Pre-Trained Multilingual Sequence-to-Sequence Models: A Hope for Low-Resource Language Translation?

While mBART is robust to domain differences, its translations for unseen and typologically distant languages remain below 3.0 BLEU; it is suggested that the emphasis should be shifted from new models to new data.

IndicNLG Suite: Multilingual Datasets for Diverse NLG Tasks in Indic Languages

In this paper, we present the IndicNLG suite, a collection of datasets for benchmarking Natural Language Generation (NLG) for 11 Indic languages. We focus on five diverse tasks, namely, biography

Pre-trained transformer-based language models for Sundanese

Three monolingual Transformer-based language models are pre-trained on Sundanese data that outperformed larger multilingual models despite the smaller overall pre-training data.



XGLUE: A New Benchmark Datasetfor Cross-lingual Pre-training, Understanding and Generation

A recent cross-lingual pre-trained model Unicoder is extended to cover both understanding and generation tasks, which is evaluated on XGLUE as a strong baseline and the base versions of Multilingual BERT, XLM and XLM-R are evaluated for comparison.

BanglaBERT: Combating Embedding Barrier for Low-Resource Language Understanding

This work builds a Bangla natural language understanding model pre-trained on 18.6 GB data crawled from top Bangla sites on the internet, and identifies a major shortcoming of multilingual models that hurt performance for low-resource languages that don’t share writing scripts with any high resource one, named the ‘Embedding Barrier’.

mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer

The recent “Text-to-Text Transfer Transformer” (T5) leveraged a unified text-to-text format and scale to attain state-of-the-art results on a wide variety of English-language NLP tasks. In this

TyDi QA: A Benchmark for Information-Seeking Question Answering in Typologically Diverse Languages

A quantitative analysis of the data quality and example-level qualitative linguistic analyses of observed language phenomena that would not be found in English-only corpora are presented.

Multilingual Denoising Pre-training for Neural Machine Translation

Abstract This paper demonstrates that multilingual denoising pre-training produces significant performance gains across a wide variety of machine translation (MT) tasks. We present mBART—a

When and Why Are Pre-Trained Word Embeddings Useful for Neural Machine Translation?

It is shown that pre-trained word embeddings can be surprisingly effective in NMT tasks – providing gains of up to 20 BLEU points in the most favorable setting.

Benchmarking multidomain englishindonesian machine translation

  • Proceedings of the 13th Workshop on Building and Using Comparable Corpora, pages 35–43.
  • 2020

XPersona: Evaluating Multilingual Personalized Chatbot

A multi-lingual extension of Persona-Chat, namely XPersona, is proposed, which includes persona conversations in six different languages other than English for evaluating multilingual personalized agents and results show that the multilingual trained models outperform the translation pipeline and that they are on par with the monolingual models.

Liputan6: A Large-scale Indonesian Dataset for Text Summarization

A large-scale Indonesian summarization dataset is introduced, and a thorough error analysis is included by examining machine-generated summaries that have low ROUGE scores, and exposed both issues with RouGE itself, as well as with extractive and abstractive summarization models.

IndoLEM and IndoBERT: A Benchmark Dataset and Pre-trained Language Model for Indonesian NLP

IndoBERT, a new pre-trained language model for Indonesian, is released and experiments show that IndoBERT achieves state-of-the-art performance over most of the tasks in IndoLEM.