Why Build Another Part - of - Speech Tagger ? A Minimalist Approach

  • Published 2003

Abstract

We use a Dynamic Bayesian Network (DBN) to build a compact representation of the features relevant to Part-of-Speech (PoS) tagging. The outcome is a flexible tagger (LegoTag) with state-of-the-art performance, which can be easily integrated into larger NLP architectures. We investigate the relative contribution of various features. Our work suggests that the right combination of features guarantees success, while feature redundancy hampers performance. Furthermore, linguistic knowledge is helpful in constructing a small and efficient feature set. 1 Part of Speech Tagging Part-of-Speech (PoS) tagging is a useful component of many NLP applications. A good tagger can significantly simplify the task of a parser by allowing it to work at the tag level, instead of the lexical level [Charniak et al.’96]. Furthermore, syntactic category is a helpful feature in a variety of Information Extraction (IE) problems. It aids limited sense disambiguation for some homonym pairs (e.g. “a watch” versus “to watch”). It also narrows down the set of candidate segments in question answering (see e.g. [Kupiec’93]), or the set of candidate fillers for a database slot [Soderland’99]. While there are many available PoS taggers, we found ourselves in need of creating our own, mostly for two reasons. First, existing taggers do not seem to generalize reliably to novel data. For example, taggers trained on the Wall Street Journal (WSJ) perform poorly on novel text e.g. e-mail or newsgroup messages (a.k.a. Netlingo). At the same time, alternative training data are scarce and expensive to create. Second, integrating a PoS tagger into a higher order NLP application (e.g. parsing or information extraction) is far from straightforward, particularly if the application would benefit from a probability distribution over possible tags rather than a definitive solution. This paper reports the following interesting findings that resulted from our work. In general, the right set of features is sufficient to guarantee good performance regardless of the particular model. Furthermore, the proper factorization of features improves performance in addition to eliminating redundancy. Finally, linguistic knowledge proves helpful in constructing a minimalist tagger with a small but efficient feature set, which maintains a reasonable performance across corpora. Unlike many existing PoS taggers, our tagger does not rely on pre-processing which finds sentence boundaries, which becomes a highly nontrivial task in Netlingo corpora. Integrating a tagger into NLP applications often requires customization. It is important to adjust the trade-off between complexity and precision according to the resource limitations at hand. For example, if tagging is performed off-line, speed is not an important factor. However, online tagging may require sacrificing some accuracy to improve speed. It is equally important to select a tagger that performs with the particular type of precision that matters to the higher-order task. For example, a tagger that makes one error per sentence on average is inappropriate for a parser. However, as long as it tags proper nouns correctly, it is quite suitable for an application extracting company names. On the other hand, a tagger that ignores capitalization may be appropriate for a parser, but not for extracting company names since it will confuse proper and common nouns. Last but not least, the tagger must perform well on the particular domain of the application. Even though a unigram model achieves an overall accuracy of 90% [Charniak et al.’93], it relies heavily on lexical information and is next to useless on nonstandard texts that contain lots of domainspecific terminology. All these considerations make a customizable tagger with flexible performance particularly desirable. Clearly, most applications would benefit from a tagger that delivers a probability distribution over tags rather than a singular tag (see Charniak for the opposite view for PCFG parsers). This makes rulebased taggers particularly difficult to integrate into probabilistic applications. While they can be modified to assign multiple tags [Brill’94], it would require assigning a likelihood score to the output. The best-known rule-based tagger [Brill’94] works in two stages: it assigns the most likely tag to each word in the text; then, it applies transformation rules of the form “Replace tag X by tag Y in triggering environment Z”. The triggering environments span up to three sequential tokens in each direction and refer to words, tags or properties of words within the region. The Brill tagger achieves less than 3.5% error on the Wall Street Journal (WSJ) corpus. However, its performance depends on a comprehensive vocabulary (70697 words) employed in the first stage. At least as far as integration into probabilistic applications is concerned, statistical taggers have a natural advantage over rule-based taggers. Statistical tagging is a classic application of Markov Models (MMs). Brants [2000] argues that second-order MMs can also achieve state-of-theart accuracy, provided they are supplemented by smoothing techniques and mechanisms to handle unknown words. TnT handles unknown words by estimating the tag probability given the suffix of the unknown word and its capitalization. The reported 3.3% error for Trigrams 'n Tags (TnT) tagger on the WSJ appears to be a result of over fitting. Indeed, this is the maximum performance obtained by training TnT until only 2.9% of words are unknown in the test corpus. A simple examination of WSJ shows that such percentage of unknown words in testing section (10% of WSJ corpus) requires simply building a unreasonably large lexicon of nearly all (about 44 000) words seen in a training section (90% of WSJ). Hidden MMs (HMMs) are trained on a dictionary with information about the possible PoS of words [Jelinek’85; Kupiec’92]. This means HMM taggers also rely heavily on lexical information. Previous work reveals that PoS tags depend on a variety of sub-lexical features, as well as on the likelihood of tag/tag and tag/word sequences. The Conditional Random Fields (CRF) model [Lafferty et al.’02] outperforms the HMM tagger on unknown words. It incorporates information about orthographic and morphological features. It checks whether the first character of a word is capitalized or numeric; it also registers the presence of a hyphen and morphologically relevant suffixes (-ed, -ly, -s, -ion, -tion, -ity, -ies). The authors note that CRF-based taggers are potentially flexible because they can be combined with feature-induction algorithms. However, training is complex (AdaBoost + Forward-backward) and slow (1000 iterations with optimized initial parameter vector; fails to converge with unbiased initial conditions). It is unclear what is the relative contribution of features in this model. The Maximum Entropy (MaxEnt) [Ratnaparkhi’96] tagger accounts for the joint distribution of PoS tags and features of a sentence with exponential model. Its features are along the lines of the CRF model: 1. Does the token contain a capital letter; 2. Does the token contain a hyphen; 3. Does the token contain a number; 4. Frequent prefixes, up to 4 letters long; 5. Frequent suffixes, up to 4 letters long; In addition, Ratnaparkhi uses lexical information on frequent words in the context of five words. The size of the current word, prefix, and suffix lists was 6458, 3602 and 2925, respectively. Features frequently observed in a training corpus are selected from a candidate feature pool. The parameters of he model are estimated using computationally intensive procedure of Generalised Iterative Scaling to maximize the conditional probability of the training set given the model. MaxEnt tagger has 3.4% error rate. Our initial investigation was meant to examine whether this high performance should be attributed to the model or to selection of features, as well as to see what features are essential or redundant. Another question is how much of performance is due to the lexicon size which turns out to cover 90% of the tokens in the corpus. Since assigning each word its most frequent tag gets us 90% accuracy [Charniak et al.’93], we could expect 81% baseline performance. In order to address these issues we reuse the feature set of MaxEnt in a new model, which is gradually pruned down to a small, fast, versatile and customizable tagger. 2 PoS Tagging Bayesian Net This section presents our tagger, which combines the features suggested in the literature to date into a Dynamic Bayesian Network (DBN). We briefly introduce the essential aspects of DBNs here and refer the reader to a recent PhD thesis [Murphy’02] for an excellent survey. A DBN is a Bayesian network unwrapped in time, such that it can represent dependencies between variables at adjacent time slices. More formally, a DBN consists of two models B and B, where B defines the initial distribution over the variables at time 0, by specifying: • a set of variables X1, ..., Xn; • a directed acyclic graph over the variables; • for each variable Xi , a table specifying the conditional probability of Xi given its parents in the graph Pr(XiPar{Xi}). The joint probability distribution over the initial state is ( ) { } ( ) ∏ = n 1 i i n 1 X Par X Pr X ,..., X Pr . The transition model B specifies the conditional probability distribution (CPD) over the state at time t given the state at time t-1 . B consists of: • a directed acyclic graph over the variables X1,...,Xn and their predecessors − −

7 Figures and Tables

Cite this paper

@inproceedings{2003WhyBA, title={Why Build Another Part - of - Speech Tagger ? A Minimalist Approach}, author={}, year={2003} }