Learning Hidden Markov Model Structure for Information Extraction

  • Kristie Seymorey, Andrew McCallumzy, Ronald Rosenfeldy
  • Published 1999

Abstract

end keyword note address pubnum email affiliation date author 0.84 .01 .01 .02 .01 .11 .61 .7 .93 .19 .04 .87 .09 .96 .1 .17 .73 .97 .03 .04 .24 .07 .11 .03 .88 .12 .04 .08 note title pubnum 0.4 start 0.11 0.93 0.88 0.86 .07 0.6 0.03 0.12 Figure 1: Example HMM. optimal tradeo between t to the data and model size has been reached. This relationship is expressed using Bayes' rule as:P (M jD) / P (DjM )P (M ): (3) P (DjM ) can be calculated with the forward algorithm, or approximated with the probability of the Viterbi paths. The model prior can be formulated to re ect a preference for smaller models. We are implementing Bayesian model merging so that learning the appropriate model structure for extraction tasks can be accomplished automatically. Labeled, Unlabeled, and Distantly-labeled Data Once a model structure has been selected, the transition and emission parameters need to be estimated from training data. While obtaining unlabeled training data is generally not too di cult, acquiring labeled training data is more problematic. Labeled data is expensive and tedious to produce, since manual e ort is involved. It is also valuable, since the counts of class transitions c(q ! q0) and the counts of a word occurring in a class c(q " ) can be used to derive maximum likelihood estimates for the parameters of the HMM: P̂(q! q0) = c(q ! q0) Ps2Q c(q! s) ; (4) P̂(q " ) = c(q " ) P 2 c(q " ) : (5) Smoothing of the distributions is often necessary to avoid probabilities of zero for the transitions or emissions that do not occur in the training data. Unlabeled data, on the other hand, can be used with the Baum-Welch training algorithm (Baum 1972) to train model parameters. The Baum-Welch algorithm is an iterative expectation-maximization algorithm that, given an initial parameter con guration, adjusts model parameters to locally maximize the likelihood of unlabeled data. Baum-Welch training su ers from the fact that it nds local maxima, and is thus sensitive to initial parameter settings. A third source of valuable training data is what we refer to as distantly-labeled data. Sometimes it is possible to nd data that is labeled for another purpose, but which can be partially applied to the domain at hand. In these cases, it may be that only a portion of the labels are relevant, but the corresponding data can still be added into the model estimation process in a helpful way. For example, BibTeX les are bibliography databases that contain labeled citation information. Several of the labels that occur in citations, such as title and author, also occur in the headers of papers, and this labeled data can be used in training emission distributions for header extraction. However, several of the BibTeX elds are not relevant to the header extraction task, and the data does not include any information about the sequences of classes in headers. Experiments The goal of our information extraction experiments is to extract relevant information from the headers of computer science research papers. We de ne the header of a research paper to be all of the words from the beginning of the paper up to either the rst section of the paper, usually the introduction, or to the end of the rst page, whichever occurs rst. The abstract is automatically located using regular expression matching and changed to the single token +ABSTRACT+. Likewise, a single token is added to the end of each header, either +INTRO+ or +PAGE+, to indicate the case which terminated the header. A few special classes of words are identi ed using simple regular expressions and converted to special tokens, such as <EMAIL>, <WEB>, <YEAR_NUMBER>, <ZIP_CODE>, <NUMBER>, and <PUBLICATION_NUMBER>. All punctuation, case and newline information is removed from the text. The target classes we wish to identify include the following fteen categories: title, author, a liation, address, note, email, date, abstract, introduction (intro), phone, keywords, web, degree, publication number (pubnum), and page. The abstract, intro and page classes are each represented by a state that outputs only one token, +ABSTRACT+, +INTRO+, or +PAGE+, respectively. The degree class captures the language associated with Ph.D. or Master's theses, such as \submitted Type Source Word Tokens Labeled 500 headers 23,557 Unlabeled 5,000 headers 287,770 Distantly-labeled 176 BibTeX les 2,463,834 Table 1: Sources and amounts of training data. in partial ful llment of..." and \a thesis by...". The note eld commonly accounts for phrases from acknowledgements, copyright notices, and citations. One thousand headers were manually tagged with class labels. Sixtyve of the headers were discarded due to poor formatting, and the rest were split into a 500-header, 23,557 word token labeled training set and a 435-header, 20,308 word token test set. Five thousand unlabeled headers, composed of 287,770 word tokens were designated as unlabeled training data. Distantlylabeled training data was acquired from 176 BibTeX les that were collected from the Web. These les consist of 2.5 million words, which contribute to the following nine header classes: address, a liation, author, date, email, keyword, note, title, and web. The training data sources and amounts are summarized in Table 1. Class emission distributions are trained with either the labeled training data (L), the labeled and distantly-labeled data (L+D), or with all three data sets (L+D+U). In each case, a xed vocabulary is derived based on the data set used. The labeled data contains 5,053 distinct words. Words that occur more than once in the distantly-labeled and unlabeled data are combined with the labeled data to produce a 51,526word vocabulary (L+D) and a 54,308-word vocabulary (L+D+U). The unknown word token <UNK> is added to the vocabularies to model out-of-vocabulary words, and any words in the training or testing data that are not in the vocabulary are mapped to this token. The words from the distantly-labeled and unlabeled data that are excluded from the vocabulary are used to estimate the probability of the unknown word. Model Selection We build several HMM models with varying numbers of states and di erent parameter settings, and test the models by nding the Viterbi paths for the test set headers. Performance is measured by word classi cation accuracy, which is the percentage of header words that are emitted by a state with the same label as the words' true label. The rst set of models use one state per class; we refer to these models as baseline models. Emission distributions are trained for each class on either the labeled data (L) or the labeled and distantly-labeled data (L+D) with the appropriate vocabulary. The maximum likelihood estimates are smoothed using WittenBell smoothing (Witten & Bell 1991) to avoid probabilities of zero for the vocabulary words that are not observed in the training data for a particular class. Number Number Accuracy Model of states of links L L+D 1 17 255 55.9 53.9 2 17 252 72.6 82.5 3, baseline 17 149 77.9 88.6 4 17 255 77.5 88.1 Table 2: Extraction accuracy (%) for models with one state per class. Extraction accuracy results for the baseline models are reported in Table 2. The rst model is a fullyconnected model where all transitions are assigned uniform probabilities. It relies only on the emission distributions to choose the best path through the model, and achieves an accuracy of 53.9% when trained on the labeled and distantly-labeled data. The second model is similar, except that the self-transition probability is set according to the maximum likelihood estimate from the labeled data, with all other transitions set uniformly. This model bene ts from the additional information of the expected number of words to be emitted by each state, and its accuracy jumps to 82.5%. The third model sets all transition parameters to their maximum likelihood estimates, and achieves the best result of 88.6% among this set of models. The fourth model adds an additional smoothing count of one to each transition, so that all transitions have non-zero probabilities, but smoothing the transition probabilities does not improve tagging accuracy. It is important to note the 10.7% absolute improvement due to training with distantly-labeled data. The third model performs at 77.9% when trained on only the labeled data, but improves to 88.6% when the distantly-labeled data is used. We refer to this model as the `baseline' model in the next comparisons. Next, we want to see if a model with multiple states per class outperforms the baseline model. We rst consider building these model structures by a combination of automated and manual techniques. Starting from a neighbor-merged model of 805 states built from 100 randomly selected labeled training headers, states with the same class label are manually merged in an iterative manner. (We use only 100 of the 500 headers to keep the manual state selection process manageable.) Transition counts are preserved throughout the merges so that maximumlikelihood transition probabilities can be estimated. Each state uses its smoothed class emission distribution estimated from the combination of the labeled and distantly-labeled data (L+D). Extraction performance, measured as the number of states decreases, is plotted in Figure 2. The performance of the baseline model is indicated on the gure with a `+'. The models with multiple states per class outperform the baseline model, particularly when 30 to 40 states are present. The best performance of 90.1% is obtained when the model contains 36 states. This result shows 87 87.5 88 88.5 89 89.5 90 90.5 91 20 30 40 50 60 70 80 Ac cu rac y ( %) number of states<lb>hand, distantly labeled<lb>automatic, distantly labeled<lb>Figure 2: Extraction accuracy for multi-state models as<lb>states are merged. Number Number Accuracy<lb>Model<lb>of states of links L L+D<lb>baseline<lb>17<lb>149 77.9 88.6<lb>multi-state<lb>36<lb>164 78.7 90.1<lb>V-merged<lb>155<lb>402 77.7 89.1<lb>Table 3: Extraction accuracy (%) for models learned from<lb>data compared to the best baseline model.<lb>that more complex model structure bene ts extraction<lb>performance of HMMs on the header task.<lb>We compare this result to the performance of the<lb>155-state V-merged model created entirely automati-<lb>cally from the labeled training data. A summary of<lb>the results of the baseline model, the best multi-state<lb>model, and the V-merged model is presented in Table 3.<lb>The V-merged model does slightly better than the base-<lb>line model, but not as well as the multi-state model.<lb>The manually merged model with multiple states per<lb>class performs best. We expect that our future work<lb>on Bayesian model merging will result in a fully au-<lb>tomated construction procedure that produces models<lb>performing as well as or better than the manually cre-<lb>ated multi-state model.<lb>Next, we investigate how to incorporate unlabeled<lb>data into our parameter training scheme. We start with<lb>the best baseline and multi-statemodels and run Baum-<lb>Welch training on the unlabeled data. Initial parame-<lb>ters are set to the maximum likelihood transition prob-<lb>abilities from the labeled training data and smoothed<lb>emission distributions from the labeled and distantly<lb>labeled data based on the L+D+U vocabulary. Baum-<lb>Welch training produces new transition and emission<lb>parameter values which locally maximize the likelihood<lb>of the unlabeled data.<lb>The models are tested under three di erent condi-<lb>tions; the extraction results, as well as the model per-<lb>plexities on the test set, are shown in Table 4. Perplex-<lb>ity is a measure of how well the HMMs model the data;<lb>a lower value indicates a model that assigns a higher<lb>likelihood to the observations from the test set.<lb>The \initial" result is the performance of the models<lb>using the initial parameter estimates. We can see that<lb>baseline multi-state<lb>Acc. PP Acc. PP<lb>initial 88.5 816 90.1 743<lb>= 0:5 89.0 416 85.5 392<lb>varies 88.8 373 84.7 352<lb>Table 4: Extraction accuracy (%) and test set perplexity<lb>(PP) for the baseline and multi-state models after Baum-<lb>Welch training.<lb>using the slightly larger vocabulary (L+D+U) does not<lb>provide a gain in classi cation accuracy compared to<lb>the results from Table 3. Since the vocabulary words<lb>that do not occur in the unlabeled data are given a<lb>probability of zero in the newly-estimated emission dis-<lb>tributions resulting fromBaum-Welch training, the new<lb>distributions need to be smoothed with the initial esti-<lb>mates. Each state's newly-estimated emission distribu-<lb>tion is linearly interpolated with its initial distribution<lb>using a mixture weight of . For the \ = 0:5" setting,<lb>both distributions for each state use a weight of 0.5.<lb>Alternatively, the Viterbi paths of the labeled train-<lb>ing data can be computed for each model using the<lb>\ = 0:5" emission distributions. The words emitted<lb>by each state are then used to estimate optimal mix-<lb>ture weights for the local and initial distributions using<lb>the EM algorithm. The two distributions for each state<lb>are interpolated with the optimal mixture weight val-<lb>ues, and the resulting model is tested on the test set.<lb>These results are reported as \ varies".<lb>The extraction accuracies when using the Baum-<lb>Welch estimates from the unlabeled data do slightly<lb>improve for the baseline model, but degrade for the<lb>multi-state model. The lack of improvement in classi-<lb>cation accuracy can be partly explained by the fact<lb>that Baum-Welch training maximizes the likelihood of<lb>the unlabeled data, not the classi cation accuracy. The<lb>better modeling capabilities are pointed out through<lb>the improvement in test set perplexity. The perplexity<lb>of the test set improves over the initial settings with<lb>Baum-Welch reestimation, and improves even further<lb>with careful selection of the emission distribution mix-<lb>ture weights. Merialdo (1994) nds a similar e ect on<lb>tagging accuracy when training part-of-speech taggers<lb>using Baum-Welch training when starting from well-<lb>estimated initial parameter estimates.<lb>Error Breakdown<lb>We conclude these experiments with a breakdown of<lb>the errors being made by the best performing mod-<lb>els. Table 5 shows the errors in each class for the best<lb>baseline and multi-state models when using emission<lb>distributions trained on labeled (L) and labeled and<lb>distantly-labeled (L+D) data. Classes for which there<lb>is distantly-labeled training data are indicated in bold.<lb>For several of the classes, such as title and author, there<lb>is a dramatic increase in accuracy when the distantlybaseline<lb>multi-state<lb>Tag<lb>L L+D L L+D<lb>All<lb>77.9 88.6 78.7 90.1<lb>Abstract<lb>99.7 99.7 96.5 98.4<lb>Address<lb>81.9 83.1 82.3 84.1<lb>A liation 91.8 90.1 92.1 89.4<lb>Author<lb>50.8 92.8 50.6 93.2<lb>Date<lb>98.6 93.7 99.3 93.0<lb>Degree<lb>77.7 79.9 80.1 81.2<lb>Email<lb>91.0 89.8 90.1 86.9<lb>Keyword 73.5 91.2 88.8 98.5<lb>Note<lb>78.1 78.6 80.6 84.6<lb>Phone<lb>94.3 97.1 92.6 94.9<lb>Pubnum<lb>70.1 63.5 69.3 64.2<lb>Title<lb>73.9 98.7 70.7 98.3<lb>Web<lb>94.4 80.6 50.0 41.7<lb>Table 5: Individual eld results for the best baseline and<lb>multi-state models. Fields noted in bold occur in distantly-<lb>labeled data.<lb>labeled data is included. The poorest performing in-<lb>dividual classes are the degree, note and publication-<lb>number classes. The web class has a low accuracy for<lb>the multi-state model, when limited web class examples<lb>in the 100 training headers probably kept the web state<lb>from having transitions to and from as many states as<lb>necessary.<lb>Conclusions and Future Work<lb>Our experiments show that hidden Markov models do<lb>well at extracting important information from the head-<lb>ers of research papers. We achieve an accuracy of 90.1%<lb>over all elds of the headers, and class-speci c accura-<lb>cies of 98.3% for titles and 93.2% authors. We have<lb>demonstrated that models that contain more than one<lb>state per class do provide increased extraction accuracy<lb>over models that use only one state per class. This im-<lb>provement is due to more speci c transition context<lb>modeling that is possible with more states. We expect<lb>that it is also bene cial to have localized emission distri-<lb>butions, which can capture distribution variations that<lb>are dependent on the position of the class in the header.<lb>Distantly-labeled data has proven to be valuable in<lb>providing robust parameter estimates. The addition of<lb>distantly-labeled data provides a 10.7% improvement<lb>in extraction accuracy for headers. In cases where lit-<lb>tle labeled training data is available, distantly-labeled<lb>data can be selectively applied to improve parameter<lb>estimates.<lb>Forthcoming experiments include using Bayesian<lb>model merging to learn model structure completely au-<lb>tomatically from data, as well as taking advantage of<lb>additional header features such as the positions of the<lb>words on the page. We expect the inclusion of layout in-<lb>formation to particularly improve extraction accuracy.<lb>Finally, we also plan to model internal state struc-<lb>ture, in order to better capture the rst and last few<lb>words absorbed by each state. A possibly useful inter-<lb>affiliation<lb>Figure 3: Proposed internal model structure for states.<lb>nal model structure is displayed in Figure 3. In this<lb>case, the distributions for the rst and last two words<lb>are modeled explicitly, and an internal state emits all<lb>other words. We expect these improvements will con-<lb>tribute to the development of more accurate models for<lb>research paper header extraction.<lb>References<lb>Baum, L. 1972. An inequality and associated maximization<lb>technique in statistical estimation of probabilistic functions<lb>of a Markov process. Inequalities 3:1{8.<lb>Bikel, D. M.; Miller, S.; Schwartz, R.; and Weischedel, R.<lb>1997. Nymble: a high-performance learning namender.<lb>In Proceedings of ANLP-97, 194{201.<lb>Freitag, D., and McCallum, A. 1999. Information extrac-<lb>tion with HMMs and shrinkage. Submitted to the AAAI-99<lb>Workshop on Machine Learning for Information Extrac-<lb>tion.<lb>Kupiec, J. 1992. Robust part-of-speech tagging using a<lb>hidden Markov model. Computer Speech and Language<lb>6:225{242.<lb>Leek, T. R. 1997. Information extraction using hidden<lb>Markov models. Master's thesis, UC San Diego.<lb>McCallum, A.; Nigam, K.; Rennie, J.; and Seymore, K.<lb>1999. Building domain-speci c search engines with ma-<lb>chine learning techniques. In Proceedings of AAAI Spring<lb>Symposium on Intelligent Agents in Cyberspace.<lb>Merialdo, B. 1994. Tagging english text with a probabilis-<lb>tic model. Computational Linguistics 20(2):155{171.<lb>Rabiner, L. 1989. A tutorial on hidden Markov models and<lb>selected applications in speech recognition. Proceedings of<lb>the IEEE 77(2).<lb>Stolcke, A.; Shriberg, E.; et al. 1998. Dialog act modeling<lb>for conversational speech. In Applying Machine Learning<lb>to Discourse Processing, 1998 AAAI Spring Symposium,<lb>number SS-98-01, 98{105. Menlo Park,CA: AAAI Press.<lb>Stolcke, A. 1994. Bayesian Learning of Probabilistic Lan-<lb>guage Models. Ph.D. Dissertation, University of California,<lb>Berkeley, CA.<lb>Viterbi, A. J. 1967. Error bounds for convolutional codes<lb>and an asymtotically optimum decoding algorithm. IEEE<lb>Transactions on Information Theory IT-13:260{267.<lb>Witten, I. H., and Bell, T. C. 1991. The zero-frequency<lb>problem: Estimating the probabilities of novel events in<lb>adaptive text compression. IEEE Transactions on Infor-<lb>mation Theory 37(4).<lb>Yamron, J.; Carp, I.; Gillick, L.; Lowe, S.; and van Mul-<lb>bregt, P. 1998. A hidden Markov model approach to text<lb>segmentation and event tracking. In Proceedings of the<lb>IEEE ICASSP.

02040'00'02'04'06'08'10'12'14'16
Citations per Year

457 Citations

Semantic Scholar estimates that this publication has 457 citations based on the available data.

See our FAQ for additional information.

Cite this paper

@inproceedings{Seymorey1999LearningHM, title={Learning Hidden Markov Model Structure for Information Extraction}, author={Kristie Seymorey and Andrew McCallumzy and Ronald Rosenfeldy}, year={1999} }