Online Entropy-Based Model of Lexical Category Acquisition


Children learn a robust representation of lexical categories at a young age. We propose an incremental model of this process which efficiently groups words into lexical categories based on their local context using an information-theoretic criterion. We train our model on a corpus of childdirected speech from CHILDES and show that the model learns a fine-grained set of intuitive word categories. Furthermore, we propose a novel evaluation approach by comparing the efficiency of our induced categories against other category sets (including traditional part of speech tags) in a variety of language tasks. We show the categories induced by our model typically outperform the other category sets. 1 The Acquisition of Lexical Categories Psycholinguistic studies suggest that early on children acquire robust knowledge of the abstract lexical categories such as nouns, verbs and determiners (e.g., Gelman & Taylor, 1984; Kemp et al., 2005). Children’s grouping of words into categories might be based on various cues, including phonological and morphological properties of a word, the distributional information about its surrounding context, and its semantic features. Among these, the distributional properties of the local context of a word have been thoroughly studied. It has been shown that child-directed speech provides informative co-occurrence cues, which can be reliably used to form lexical categories (Redington et al., 1998; Mintz, 2003). The process of learning lexical categories by children is necessarily incremental. Human language acquisition is bounded by memory and processing limitations, and it is implausible that humans process large volumes of text at once and induce an optimum set of categories. Efficient online computational models are needed to investigate whether distributional information is equally useful in an online process of word categorization. However, the few incremental models of category acquisition which have been proposed so far are generally inefficient and over-sensitive to the properties of the input data (Cartwright & Brent, 1997; Parisien et al., 2008). Moreover, the unsupervised nature of these models makes their assessment a challenge, and the evaluation techniques proposed in the literature are limited. The main contributions of our research are twofold. First, we propose an incremental entropy model for efficiently clustering words into categories given their local context. We train our model on a corpus of child-directed speech from CHILDES (MacWhinney, 2000) and show that the model learns a fine-grained set of intuitive word categories. Second, we propose a novel evaluation approach by comparing the efficiency of our induced categories against other category sets, including the traditional part of speech tags, in a variety of language tasks. We evaluate our model on word prediction (where a missing word is guessed based on its sentential context), semantic inference (where the semantic properties of a novel word are predicted based on the context), and grammaticality judgment (where the syntactic well-formedness of a sentence is assessed based on the category labels assigned to its words). The results show that the categories induced by our model can be successfully used in a variety of tasks and typically perform better than other category sets. 1.1 Unsupervised Models of Category Induction Several computational models have used distributional information for categorizing words (e.g. Brown et al., 1992; Redington et al., 1998; Clark, 2000; Mintz, 2002). The majority of these models partition the vocabulary into a set of optimum clusters (e.g., Brown et al., 1992; Clark, 2000). The generated clusters are intuitive, and can be used in different tasks such as word prediction and parsing. Moreover, these models confirm the learnability of abstract word categories, and show that distributional cues are a useful source of information for this purpose. However, (i) they categorize word types rather than word tokens, and as such provide no account of words belonging to more than one category, and (ii) the batch algorithms used by these systems make them implausible for modeling human category induction. Unsupervised models of PoS tagging such as Goldwater & Griffiths (2007) do assign labels to wordtokens, but they still typically use batch processing, and what is even more problematic, they hardwire important aspects of the model, such as the final number of categories. Only few previously proposed models process data incrementally, categorize word-tokens and do not pre-specify a fixed category set. The model of Cartwright & Brent (1997) uses an algorithm which incrementally merges word clusters so that a Minimum Description Length criterion for a template grammar is optimized. The model treats whole sentences as contextual units, which sacrifices a degree of incrementality, as well as making it less robust to noise in the input. Parisien et al. (2008) propose a Bayesian clustering model which copes with ambiguity and exhibits the developmental trends observed in children (e.g. the order of acquisition of different categories). However, their model is overly sensitive to context variability, which results in the creation of sparse categories. To remedy this issue they introduce a “bootstrapping” component where the categories assigned to context words are use to determine the category of the current target word. They also perform periodical cluster reorganization. These mechanisms improve the overall performance of the model when trained on large amounts of training data, but they complicate the model with ad-hoc extensions and add to the (already considerable) computational load. What is lacking is an incremental model of lexical category which can efficiently process naturalistic input data and gradually build robust categories with little training data. 1.2 Evaluation of the Induced Categories There is no standard and straightforward method for evaluating the unsupervised models of category learning (see Clark, 2003, for discussion). Many unsupervised models of lexical category acquisition treat the traditional part of speech (PoS) tags as the gold standard, and measure the accuracy and completeness of their induced categories based on how closely they resemble the PoS categories (e.g. Redington et al., 1998; Mintz, 2003; Parisien et al., 2008). However, it is not at all clear whether humans form the same types of categories. In fact, many language tasks might benefit from finer-grained categories than the traditional PoS tags used for corpus annotation. Frank et al. (2009) propose a different, automatically generated set of gold standard categories for evaluating an unsupervised categorization model. The gold-standard categories are formed according to “substitutability”: if one word can be replaced by another and the resulting sentence is still grammatical, then there is a good chance that the two words belong to the same category. They extract 3-word frames from the training data, and form the gold standard categories based on the words that appear in the same frame. They emphasize that in order to provide some degree of generalization, different data sets must be used for forming the gold-standard categories and performing the evaluation. However, the resulting categories are bound to be incomplete, and using them as gold standard inevitably favors categorization models which use a similar frame-based principle. All in all, using any set of gold standard categories for evaluating an unsupervised categorization model has the disadvantage of favoring one set of principles and intuitions over another; that is, assuming that there is a correct set of categories which the model should converge to. Alternatively, automatically induced categories can be evaluated based on how useful they are in performing different tasks. This approach is taken by Clark (2000), where the perplexity of a finite-state model is used to compare different category sets. We build on this idea and propose a more general usage-based approach to evaluating the automatically induced categories from a data set, emphasizing that the ultimate goal of a category induction model is to form categories that can be efficiently used in a variety of language tasks. We argue that for such tasks, a finer-grained set of categories might be more appropriate than the coarsegrained PoS categories. Therefore, we propose a number of tasks for which we compare the performance based on various category sets, including those induced by our model. 2 An Incremental Entropy-based Model of Category Induction A model of human category acquisition should possess two key features: • It should process input as it arrives, and incrementally update the current set of clusters. • The set of clusters should not be fixed in advance, but rather determined by the characteristics of the input data. We propose a simple algorithm which fulfills those two conditions. Our goal is to categorize word usages based on the similarity of their form (the content) and their surrounding words (the context). While grouping word usages into categories, we attempt to trade off two conflicting criteria. First, the categories should be informative about the properties of their members. Second, the number and distribution of the categories should be parsimonious. An appropriate tool for formalizing both informativeness and parsimony is information-theoretic entropy. The parsimony criterion can be formalized as the entropy of the random variable (Y ) representing the cluster assignments:

View Slides

Extracted Key Phrases

Cite this paper

@inproceedings{Chrupala2010OnlineEM, title={Online Entropy-Based Model of Lexical Category Acquisition}, author={Grzegorz Chrupala and Afra Alishahi}, booktitle={CoNLL}, year={2010} }