• Corpus ID: 5277106

Corpas na Gaeilge 1882-1926: Integrating Historical and Modern Irish Texts

  title={Corpas na Gaeilge 1882-1926: Integrating Historical and Modern Irish Texts},
  author={Elaine U{\'i} Dhonnchadha and Kevin P. Scannell and Ruair{\'i} {\'O} Huiginn and Eil{\'i}s N{\'i} Mhearra{\'i} and M{\'a}ire Nic Mh{\'a}olain and Brian {\'O} Raghallaigh and Gregory Toner and S{\'e}amus Mac Math{\'u}na and D{\'e}irdre D'Auria and Eithne N{\'i} Ghallchobh{\'a}ir and Niall O’Leary},
This paper describes the processing of a corpus of seven million words of Irish texts from the period 1882-1926. The texts which have been captured by typing or optical character recognition are processed for the purpose of lexicography. Firstly, all historical and dialectal word forms are annotated with their modern standard equivalents using software developed for this purpose. Then, using the modern standard annotations, the texts are processed using an existing finite-state morphological… 

Tables from this paper

Diachronic Parsing of Pre-Standard Irish

A small benchmark corpus, annotated according to the Universal Dependencies guidelines and covering a range of dialects and time periods since 1600 is introduced, and baselines for lemmatization, tagging, and dependency parsing on this corpus are established by experimenting with a variety of machine learning approaches.

Statistical models for text normalization and machine translation

An important aspect of this work is to overcome the orthographical differences between the languages, many of which were introduced in a major spelling reform of Irish in the 1940s and 1950’s.

Towards a lexicon of Irish-language idioms

Vers un lexique d’idiomes de la langue irlandaise Le présent exposé fournit un éclairage sur un lexique d’idiomes de la langue irlandaise rassemblés par Foclóir Gaeilge-Béarla (Ó Dónaill, 1977) et

Improving full-text search results on dúchas.ie using language technology

This paper measures the effectiveness of using language standardisation, lemmatisation, and machine translation to improve full-text search results on dúchas.ie, the web interface to the Irish National Folklore Collection, and motivates the inclusion of this language technology in the search infrastructure of this folklore resource.



Scaling an Irish FST Morphology Engine for Use on Unrestricted Text

The full system achieves token coverage of 93% which is extended to 100% through the use of morphological guessers, and the coverage increase contributed by each step is detailed.

Manual and semi-automatic normalization of historical spelling - case studies from Early New High German

Norma is presented, a semi-automatic normalization tool that integrates different modules (lexicon lookup, rewrite rules) for normalizing words in an interactive way and dynamically updates the set of rule entries, given new input.

Lecture Notes in Artificial Intelligence

The topics in LNAI include automated reasoning, automated programming, algorithms, knowledge representation, agent-based systems, intelligent systems, expert systems, machine learning, natural-language processing, machine vision, robotics, search systems, knowledge discovery, data mining, and related programming languages.

The Crúbadán Project: Corpus building for under-resourced languages

We present an overview of the Crubadan project, the aim of which is the creation of text corpora for a large number of under-resourced languages by crawling the web.

Moses: Open Source Toolkit for Statistical Machine Translation

We describe an open-source toolkit for statistical machine translation whose novel contributions are (a) support for linguistically motivated factors, (b) confusion network decoding, and (c)

The mathematics of statistical machine translation

A series of five statistical models of the translation process are described and algorithms for estimating the parameters of these models given a set of pairs of sentences that are translations are given.

English-Irish dictionary