On some applications of finite-state automata theory to natural language processing
- Mehryar Mohri
- Natural Language Engineering
Stochastic approaches to natural language processing have often been preferred to rule-based approaches because of their robustness and their automatic training capabilities. This was the case for part-of-speech tagging until Brill showed how state-of-the-art part-of-speech tagging can be achieved with a rule-based tagger by inferring rules from a training corpus. However, current implementations of the rule-based tagger run more slowly than previous approaches. In this paper, we present a nite-state tagger inspired by the rule-based tagger which operates in optimal time in the sense that the time to assign tags to a sentence corresponds to the time required to follow a single path in a deterministic nite-state machine. This result is achieved by encoding the application of the rules found in the tagger as a non-deterministic nite-state transducer and then turning it into a deterministic transducer. The resulting deterministic transducer yields a part-of-speech tagger whose speed is dominated by the access time of mass storage devices. We then generalize the techniques to the class of transformation-based systems. Published in Computational Linguistics, June 1995 21(2), 227-253. This work may not be copied or reproduced in whole or in part for any commercial purpose. Permission to copy in whole or in part without payment of fee is granted for nonpro t educational and research purposes provided that all such whole or partial copies include the following: a notice that such copying is by permission of Mitsubishi Electric Research Laboratories of Cambridge, Massachusetts; an acknowledgment of the authors and individual contributions to the work; and all applicable portions of the copyright notice. Copying, reproduction, or republishing for any other purpose shall require a license with payment of fee to Mitsubishi Electric Research Laboratories. All rights reserved. Copyright c Mitsubishi Electric Research Laboratories, 1995 201 Broadway, Cambridge, Massachusetts 02139 Revisions history. 1. Version 1.0, May 2nd 1994. 2. Version 1.1, June 16th 1994. 3. Version 1.2, June 22nd 1994. 4. Version 1.3, July 27th 1994. 5. Version 1.4, July 1994. 6. Version 2.0, December 9th 1994. 7. This version is Revision 3.0 of Date: 95/03 .