Towards pertinent evaluation methodologies for word-space models

Abstract

This paper discusses evaluation methodologies for a particular kind of meaning models known as word-space models, which use distributional information to assemble geometric representations of meaning similarities. Word-space models have received considerable attention in recent years, and have begun to see employment outside the walls of computational linguistics laboratories. However, the evaluation methodologies of such models remain infantile, and lack efforts at standardization. Very few studies have critically assessed the methodologies used to evaluate word spaces. This paper attempts to fill some of this void. It is the central goal of this paper to answer the question “how can we determine whether a given word space is a good word space?” 1. Word-space models Word-space models (Gallant, 1991b; Schütze, 1993; Lund et al., 1995; Landauer and Dumais, 1997; Sahlgren, 2005) use distributional statistics to acquire representations of word meaning. The underlying hypothesis behind these models is that the distributional profiles of words are symptomatic for their semantic content, and that a geometric representation of these profiles is computationally (and, some would argue) cognitively plausible. Both the distributional hypothesis of word meaning and the geometric representational scheme have proven their mettle in such diverse experimental settings as information retrieval (Deerwester et al., 1990; Gallant, 1991a; Jiang and Littman, 2000), vocabulary tests (Landauer and Dumais, 1997; Karlgren and Sahlgren, 2001), word sense disambiguation (Schütze, 1992), lexical priming tests (Lund and Burgess, 1996; McDonald and Lowe, 1998), text categorization (Sahlgren and Cöster, 2004), and so on. There is certainly no shortage of research results arguing for the viability of the approach. Much thanks to these experimental testimonies, word-space models are becoming established as part of the basic arsenal of language technology. In addition to purely experimental relevance, there is a growing interest in using word-space models for more practically oriented applications, such as knowledge assessment, information extraction, and spam filtering. Furthermore, word-space models are increasingly used for the automatic construction of language resources. To take but one example, word-space models have been used with greatly encouraging results for acquiring thesauri from raw data (Sahlgren and Karlgren, 2005). Such applications will become ever more common — and useful — in the face of a rapidly expanding and flourishing multilingual, multi-cultural, and multi-ethnic computational linguistics community. Word-space modeling is, to say the least, an active area of research. 2. The need for critical assessment of evaluation methodologies Despite (or perhaps due to) this optimistic climate, the quality of the evaluation methodologies used in word-space research has not received much attention. This is remarkable, for a number of reasons. First of all, different implementations of word-space models (such as HAL (Lund et al., 1995), LSA (Landauer and Dumais, 1997), and Random Indexing (Kanerva et al., 2000)) use different kinds of distributional information to produce word spaces. HAL uses word adjacency, LSA uses occurrences in documents, and RI can be used with both of these types of distributional information. Considering that these implementations use different kinds of information to assemble word spaces, it would seem natural to assume that the spaces would contain different kinds of semantic content. Even so, remarkably few studies have investigated how different kinds of distributional information effects the representations. Lavelli et al. (Lavelli et al., 2004) is one of the very few. To make matters even worse, it is not even obvious what “meaning” means in this context. When we talk about meaning in general discourse, we include a considerable amount of extralinguistic knowledge into the concept of meaning. Part of what I know when I say I know the meaning of “mitt” is what kind of object “mitt” refers to. Such information is arguably not available to word-space models that only consider intralinguistic distributional regularities as data. Although a word-space model might correctly associate “mitt” with “pad” and “glove”, it will not be able to reach out into the world and pick out the right kind of object. Thus, “meaning” obviously has a more specific meaning in the context of word-space research, but few — if any — publications further explain what this meaning is. We are left guessing what “meaning” means in word-space research. Conceptual opaqueness is all too often neglected in favor of experimentalism within the field of computational linguistics. Granted, empirical evidence should weigh just as heavy as theoretical arguments, but this is only true if we know what the evidence are evidence of. The point is that there can be no evidence unless there is a case. One may seriously question the validity of the research when neither the conceptual nor the evaluative basis are well founded. The problem with accepting too light-heartedly frail or even ill-advised evaluation methodologies is especially severe when the experimental models are treated as standard tools that are used to build langauge resources, since any latent flaws in the underlying machinery will inescapably affect the quality of the resource. Consider the not too uncommon case where a word-space model is used to compile a lexical resource, or to solve a retrieval or categorization task: unless we know what kind of information is captured in the word-space model, we will not know what kind of information the lexical resource contains, or why the retrieval or categorization task succeeded or not. 3. Evaluation methodologies in word-space research In an attempt to taxonomize word-space evaluation practices, we can make a distinction between direct and indirect evaluation methodologies. Direct evaluations are concerned with the geometry of the word space, which typically means measuring Euclidean distances between words. The idea is that if word A and word B are closer to each other in the word space than to word C, they are assumed to be more semantically similar to each other than to word C; distance in word space reflects semantic similarity. Of course, there are a number of different measures available for calculating the distance or similarity between objects in an Euclidean vector space.1 Examples of commonly used measures are the cosine of the angles between the vectors,2 and different Minkowski metrics.3 Note that, although these measures do produce different similarity scores for a given vector space, they do not change the underlying model. These geometric measures can be evaluated by comparing them to similarities found in human artefacts such as lexica, priming data, association norms, synonym tests, antonym tests, etc. For artefacts that constitute semantic repositories, such as lexica, priming data and association norms, the evaluation measure is how close the word space resembles the repository — e.g. the fraction of words that occur in both the repository entries and in the word-space neighborhoods. For vocabulary tests, such as synonym and antonym tests, the evaluation measure is performance (normally percentage of correct answers) in solving the test. Indirect evaluations, on the other hand, are not directly concerned with the geometry of word spaces. Instead, these evaluations apply word spaces for various kinds of applications and tasks, the execution of which are normally assumed to require semantic knowledge. Examples include information retrieval and information filtering, word sense disambiguation, text summarization, text categorization, etc. The difference between distance and similarity measures is that the former produce a low score for similar objects, whereas the latter produce a high score for the same objects: small distance equals large similarity, and conversely. It is trivial to transform a distance measure dist(x, y) into a similarity measure sim(x, y) by e.g. computing sim(x, y) = 1 dist(x,y) . simcos(~x, ~y) = x·y |x||y| = ∑n i=1 xiyi √∑n

Cite this paper

@inproceedings{Sahlgren2006TowardsPE, title={Towards pertinent evaluation methodologies for word-space models}, author={Magnus Sahlgren}, booktitle={LREC}, year={2006} }