Parallel Global Voices: a Collection of Multilingual Corpora with Citizen Media Stories

Abstract

We present a new collection of multilingual corpora automatically created from the content available in the Global Voices websites, where volunteers have been posting and translating citizen media stories since 2004. We describe how we crawled and processed this content to generate parallel resources comprising 302.6K document pairs and 8.36M segment… (More)

Topics

10 Figures and Tables