Learn More
A prerequisite for all higher level information extraction tasks is the identification of unknown names in text. Today, when large corpora can consist of billions of words, it is of utmost importance to develop accurate techniques for the automatic detection, extraction and categorization of named entities in these corpora. Although named entity recognition(More)
Dramatic improvements in sensor and image acquisition technology have created a demand for automated tools that can aid in the analysis of large image databases. We describe the development of JARtool, a trainable software system that learns to recognize volcanoes in a large data set of Venusian imagery. A machine learning approach is used because it is(More)
As machine learning has graduated from toy problems to \real world" applications, users are nding that \real world" problems require them to perform aspects of problem solving that are not currently addressed by much of the machine learning literature. Speciically, users are nding that the tasks of selecting a set of features to deene a problem and(More)
The paper describes a set of experiments involving the application of three state-of-the-art part-of-speech taggers to Ethiopian Amharic, using three different tagsets. The taggers showed worse performance than previously reported results for Eng-lish, in particular having problems with unknown words. The best results were obtained using a Maximum Entropy(More)
Documents can be assigned keywords by frequency analysis of the terms found in the document text, which arguably is the primary source of knowledge about the document itself. By including a hierarchically organised domain speciic thesaurus as a second knowledge source the quality of such keywords was improved considerably, as measured by match to previously(More)
An ensemble is a classiier created by combining the predictions of multiple component clas-siiers. We present a new method for combining classiiers into an ensemble based on a simple estimation of each classiier's competence. The classiiers are grouped into an ordered list where each classiier has a corresponding threshold. To classify an example, the rst(More)
Divide-and-Conquer (DAC) and Separate-and-Conquer (SAC) are two strategies for rule induction that have been used extensively. When searching for rules DAC is maximally conservative w.r.t. decisions made during search for previous rules. This results in a very eecient strategy, which however suuers from diiculties in eeectively inducing disjunctive concepts(More)
We present two approaches to the Amharic – English bilingual track in CLEF 2004. Both experiments use a dictionary based approach to translate the Amharic queries into English Bags-of-words, but while one approach removes non-content bearing words from the Amharic queries based on their IDF value, the other uses a list of English stop words to perform the(More)