MultiUN: A Multilingual Corpus from United Nation Documents

Abstract

This paper describes the acquisition, preparation and properties of a corpus extracted from the official documents of the United Nations (UN). This corpus is available in all 6 official languages of the UN, consisting of around 300 million words per language. We describe the methods we used for crawling, document formatting, and sentence alignment. This… (More)

Topics

6 Figures and Tables

Statistics

0102030201020112012201320142015201620172018
Citations per Year

135 Citations

Semantic Scholar estimates that this publication has 135 citations based on the available data.

See our FAQ for additional information.

Slides referencing similar topics