Context Training Training Cross Testing Testing

Abstract

Labeling of sentence boundaries is a necessary prerequisite for many natural language processing tasks, including part-ofspeech tagging and sentence alignment. End-of-sentence punctuation marks are ambiguous; to disambiguate them most systems use brittle, special-purpose regular expression grammars and exception rules. As an alternative, we have developed an efcient, trainable algorithm that uses a lexicon with part-of-speech probabilities and a feed-forward neural network. This work demonstrates the feasibility of using prior probabilities of part-of-speech assignments, as opposed to words or de nite part-ofspeech assignments, as contextual information. After training for less than one minute, the method correctly labels over 98.5% of sentence boundaries in a corpus of over 27,000 sentence-boundary marks. We show the method to be e cient and easily adaptable to di erent text genres, including single-case texts.

4 Figures and Tables

Cite this paper

@inproceedings{Mann1994ContextTT, title={Context Training Training Cross Testing Testing}, author={C. J. H. Mann and Mark Y. Liberman and Jan O. Pedersen and Martin Roscheisen and Mark Wasson and Leo Breiman and J. H. Friedman and R. A. Olshen}, year={1994} }