• Corpus ID: 249538120

Multilingual Open Text Release 1: Public Domain News in 44 Languages

  title={Multilingual Open Text Release 1: Public Domain News in 44 Languages},
  author={Chester Palen-Michel and June-Woo Kim and Constantine Lignos},
  booktitle={International Conference on Language Resources and Evaluation},
We present a Multilingual Open Text (MOT), a new multilingual corpus containing text in 44 languages, many of which have limited existing text resources for natural language processing. The first release of the corpus contains over 2.8 million news articles and an additional 1 million short snippets (photo captions, video descriptions, etc.) published between 2001–2022 and collected from Voice of America’s news websites. We describe our process for collecting, filtering, and processing the data… 

Figures and Tables from this paper

LR-Sum: Summarization for Less-Resourced Languages

This preprint describes work in progress on LR-Sum, a new permissively-licensed dataset created with the goal of enabling further research in automatic summarization for less-resourced languages, and describes how it plans to use the data for modeling experiments.

Extended Multilingual Protest News Detection - Shared Task 1, CASE 2021 and 2022

The best two submissions on CASE 2021 data outperform submissions from last year for Subtask 1 and Subtask 2 in all languages and only the following scenarios were not outperformed by new submissions on Case 2021.



A unified approach to sentence segmentation of punctuated text in many languages

A modern context-based modeling approach is introduced that provides a solution to the problem of segmenting punctuated text in many languages, and it is shown how it can be trained on noisily-annotated data.

WikiMatrix: Mining 135M Parallel Sentences in 1620 Language Pairs from Wikipedia

We present an approach based on multilingual sentence embeddings to automatically extract parallel sentences from the content of Wikipedia articles in 96 languages, including several dialects or

A Monolingual Approach to Contextualized Word Embeddings for Mid-Resource Languages

This work uses the multilingual OSCAR corpus, extracted from Common Crawl via language classification, filtering and cleaning, to train monolingual contextualized word embeddings (ELMo) for five mid-resource languages and shows that the benefit of a larger, more diverse corpus surpasses the cross-lingual benefit of multilingual embedding architectures.

A Massive Collection of Cross-Lingual Web-Document Pairs

A new web dataset consisting of 54 million URL pairs from Common Crawl covering documents in 92 languages paired with English is released and the quality of machine translations from models that have been trained on mined parallel sentence pairs from this aligned corpora is evaluated.

Parsivar: A Language Processing Toolkit for Persian

A preprocessing toolkit named as Parsivar is introduced, which is a comprehensive set of tools for Persian text preprocessing tasks, which outperforms the available Persian pre processing toolkits by about 8 percent in terms of F1.

Small Data? No Problem! Exploring the Viability of Pretrained Multilingual Language Models for Low-resourced Languages

It is shown that it is possible to train competitive multilingual language models on less than 1 GB of text and results suggest that the “small data” approach based on similar languages may sometimes work better than joint training on large datasets with high-resource languages.

Stanza: A Python Natural Language Processing Toolkit for Many Human Languages

This work introduces Stanza, an open-source Python natural language processing toolkit supporting 66 human languages that features a language-agnostic fully neural pipeline for text analysis, including tokenization, multi-word token expansion, lemmatization, part-of-speech and morphological feature tagging, dependency parsing, and named entity recognition.

Corpus Building for Low Resource Languages in the DARPA LORELEI Program

Representative Language packs are designed to support research into cross-language projection and language universals rather than to provide training data, and contain large volumes of monolingual and parallel text, basic annotations, lexical resources and simple NLP tools for 23 languages selected for typological diversity and coverage.

Introducing various Semantic Models for Amharic: Experimentation and Evaluation with multiple Tasks and Datasets

All the semantic models, machine learning components, and several benchmark datasets such as NER, POS tagging, sentiment classification, as well as Amharic versions of WordSim353 and SimLex999 are released.

MasakhaNER: Named Entity Recognition for African Languages

This work brings together different stakeholders to create the first large, publicly available, high-quality dataset for named entity recognition (NER) in ten African languages and details the characteristics of these languages to help researchers and practitioners better understand the challenges they pose for NER tasks.