Corpus ID: 11413753

Using an Alignment-based Lexicon for Canonicalization of Historical Text — DRAFT —

  title={Using an Alignment-based Lexicon for Canonicalization of Historical Text — DRAFT —},
  author={Bryan Jurish},
  • Bryan Jurish
  • Published 2012
  • Virtually all conventional text-based natural language processing techniques – from traditional information retrieval systems to full-fledged parsers – require reference to a fixed lexicon accessed by surface form, typically trained from or constructed for synchronic input text adhering strictly to contemporary orthographic conventions. Unorthodox input such as historical text which violates these conventions therefore presents difficulties for any such system due to lexical variants present in… CONTINUE READING

    Tables from this paper.

    Querying the Deutsches Textarchiv
    • 8
    • Open Access


    Publications referenced by this paper.
    Constructing a Canonicalized Corpus of Historical German by Text Alignment
    • 4
    • Open Access
    VARD versus WORD: A comparison of the UCREL variant detector and modern spellcheckers on English historical corpora
    • 61
    • Open Access
    Automatic standardisation of texts containing spelling variation: How much training data do you need?
    • 32
    • Highly Influential
    • Open Access
    Enabling information retrieval on historical document collections: the role of matching procedures and special lexica
    • 44
    • Highly Influential
    • Open Access
    On lexical resources for digitization of historical documents
    • 19
    • Highly Influential
    Retrieval in text collections with historic spelling using linguistic and spelling variants
    • 56
    • Open Access
    Rule-Based Normalization of Historical Texts
    • 26
    • Open Access
    The Anselm Corpus: Methods and Perspectives of a Parallel Aligned Corpus
    • 11
    • Open Access
    Edit transducers for spelling variation in Old Spanish
    • 22
    • Open Access