Corpas na Gaeilge 1882-1926: Integrating Historical and Modern Irish Texts
@inproceedings{Dhonnchadha2014CorpasNG, title={Corpas na Gaeilge 1882-1926: Integrating Historical and Modern Irish Texts}, author={Elaine U{\'i} Dhonnchadha and Kevin P. Scannell and Ruair{\'i} {\'O} Huiginn and Eil{\'i}s N{\'i} Mhearra{\'i} and M{\'a}ire Nic Mh{\'a}olain and Brian {\'O} Raghallaigh and Gregory Toner and S{\'e}amus Mac Math{\'u}na and D{\'e}irdre D'Auria and Eithne N{\'i} Ghallchobh{\'a}ir and Niall O’Leary}, year={2014} }
This paper describes the processing of a corpus of seven million words of Irish texts from the period 1882-1926. The texts which have been captured by typing or optical character recognition are processed for the purpose of lexicography. Firstly, all historical and dialectal word forms are annotated with their modern standard equivalents using software developed for this purpose. Then, using the modern standard annotations, the texts are processed using an existing finite-state morphological…
4 Citations
Diachronic Parsing of Pre-Standard Irish
- Computer ScienceCLTW
- 2022
A small benchmark corpus, annotated according to the Universal Dependencies guidelines and covering a range of dialects and time periods since 1600 is introduced, and baselines for lemmatization, tagging, and dependency parsing on this corpus are established by experimenting with a variety of machine learning approaches.
Improving full-text search results on dúchas.ie using language technology
- Computer Science
- 2019
This paper measures the effectiveness of using language standardisation, lemmatisation, and machine translation to improve full-text search results on dúchas.ie, the web interface to the Irish National Folklore Collection, and motivates the inclusion of this language technology in the search infrastructure of this folklore resource.
Towards a lexicon of Irish-language idioms
- 2016
Vers un lexique d’idiomes de la langue irlandaise Le présent exposé fournit un éclairage sur un lexique d’idiomes de la langue irlandaise rassemblés par Foclóir Gaeilge-Béarla (Ó Dónaill, 1977) et…
Statistical models for text normalization and machine translation
- Computer Science
- 2014
An important aspect of this work is to overcome the orthographical differences between the languages, many of which were introduced in a major spelling reform of Irish in the 1940s and 1950’s.
References
SHOWING 1-10 OF 13 REFERENCES
Scaling an Irish FST Morphology Engine for Use on Unrestricted Text
- Computer ScienceFSMNLP
- 2005
The full system achieves token coverage of 93% which is extended to 100% through the use of morphological guessers, and the coverage increase contributed by each step is detailed.
Manual and semi-automatic normalization of historical spelling - case studies from Early New High German
- Computer ScienceKONVENS
- 2012
Norma is presented, a semi-automatic normalization tool that integrates different modules (lexicon lookup, rewrite rules) for normalizing words in an interactive way and dynamically updates the set of rule entries, given new input.
Lecture Notes in Artificial Intelligence
- Computer Science
- 1999
The topics in LNAI include automated reasoning, automated programming, algorithms, knowledge representation, agent-based systems, intelligent systems, expert systems, machine learning, natural-language processing, machine vision, robotics, search systems, knowledge discovery, data mining, and related programming languages.
The Crúbadán Project: Corpus building for under-resourced languages
- Computer Science
- 2007
We present an overview of the Crubadan project, the aim of which is the creation of text corpora for a large number of under-resourced languages by crawling the web.
Moses: Open Source Toolkit for Statistical Machine Translation
- Computer ScienceACL
- 2007
We describe an open-source toolkit for statistical machine translation whose novel contributions are (a) support for linguistically motivated factors, (b) confusion network decoding, and (c)…
The mathematics of statistical machine translation
- Computer Science
- 1993
A series of five statistical models of the translation process are described and algorithms for estimating the parameters of these models given a set of pairs of sentences that are translations are given.