Corpus ID: 218538022

The Danish Gigaword Project

  title={The Danish Gigaword Project},
  author={Leon Stromberg-Derczynski and R. Baglini and Morten H. Christiansen and Manuel R. Ciosici and Jacob Aarup Dalsgaard and Riccardo Fusaroli and P. Henrichsen and Rasmus Hvingelby and Andreas S{\o}eborg Kirkedal and Alex Speed Kjeldsen and Claus Ladefoged and Finn AArup Nielsen and M. Petersen and Jonathan Hvithamar Rystrom and Daniel Varab},
Danish is a North Germanic/Scandinavian language spoken primarily in Denmark, a country with a tradition of technological and scientific innovation. However, from a technological perspective, the Danish language has received relatively little attention and, as a result, Danish language technology is hard to develop, in part due to a lack of large or broad-coverage Danish corpora. This paper describes the Danish Gigaword project, which aims to construct a freely-available one billion word corpus… Expand
2 Citations

Tables and Topics from this paper


The Lacunae of Danish Natural Language Processing
  • 11
  • PDF
Europarl: A Parallel Corpus for Statistical Machine Translation
  • 3,119
  • PDF
Open semantic analysis: The case of word level semantics in Danish
  • 4
  • PDF
A massively parallel corpus: the Bible in 100 languages
  • 110
  • PDF
Building Large Monolingual Dictionaries at the Leipzig Corpora Collection: From 100 to 200 Languages
  • 222
  • Highly Influential
  • PDF
Bornholmsk Natural Language Processing: Resources and Tools
  • 3
  • PDF
Representativeness in corpus design
  • 751
  • PDF
OpenSubtitles2016: Extracting Large Parallel Corpora from Movie and TV Subtitles
  • 392
  • PDF