Tagging Historical Corpora - the problem of spelling variation

  title={Tagging Historical Corpora - the problem of spelling variation},
  author={Paul Rayson and Dawn Archer and Alistair Baron and Nicholas Smith},
  booktitle={Digital Historical Corpora},
Spelling issues tend to create relatively minor (though still complex) problems for corpus linguistics, information retrieval and natural language processing tasks that use ‘standard’ or modern varieties of English. For example, in corpus annotation, we have to decide how to deal with tokenisation issues such as whether (i) periods represent sentence boundaries or acronyms and (ii) apostrophes represent quote marks or contractions (Grefenstette and Tapanainen, 1994; Grefenstette, 1999). The… CONTINUE READING

From This Paper

Topics from this paper.


Publications referenced by this paper.

Analysing weblogs in a speech community using the WMatrix approach

  • V.B.Y. Ooi, P.K.W. Tan, A.K.L. Chiang
  • 27th conference of the International Computer…
  • 2006
2 Excerpts

Towards a Methodology for Constructing and Annotating Historical Corpora: Tackling Structural and Lexical Variability in Early Modern German Newspaper Texts, 4th Days of Swiss Linguistics

  • M. Durrell, P. Bennett, A. Ensslin
  • 2006
2 Excerpts

Exploring speech-related Early Modern English texts: lexical bundles re-visited

  • J. Culpeper, M. Kytö
  • Presented at the 26th conference of ICAME,
  • 2005
1 Excerpt

Tokenization. In van Halteren, H, (ed.) Syntactic wordclass tagging, Kluwer, The Netherlands, pp

  • G. Grefenstette
  • 1999
1 Excerpt

What is a Word, What is a Sentence? Problems of Tokenization

  • G. Grefenstette, P. Tapanainen
  • In Proceedings of 3rd conference on Computational…
  • 1994
1 Excerpt

The identification of spelling variants in English and German historical texts : manual or automatic

  • P. Rayson, D. Archer, S. L. Piao, T. McEnery
  • Literary and Linguistic Computing

Similar Papers

Loading similar papers…