We have implemented a root-extraction stemmer for Arabic which is similar to the Khoja stemmer but without a root dictionary. Our stemmer was found to perform equivalently to the Khoja stemmer as well as so-called "light" stemmers in monolingual document retrieval tasks performed on the Arabic Trec-2001 collection. A root dictionary, therefore, does not… (More)
In this paper, we report on the architecture and preliminary implementation of our search engine, Hairetes. This engine is based on an extended concept of Retrieval by General Logical Imaging (RbGLI). In this extension, word similarity measures are computed by EMIM and Bayes' theorem.
We report on an application of language modeling techniques to the retrieval of Farsi documents. We discovered that language modeling improves the precision of retrieval when compared to a standard vector space model.
We report on the construction of an ontology that applies rules for identification of features to be used for email classification. The associated probabilities for these features are then calculated from the training set of emails and used as a part of the feature vectors for an underlying Bayesian classifier.
In this paper, we report on our ongoing research for the development of a Unicode-based search engine for Farsi. The activities consist of an I/O subsystem, Farsi stemmer, test collection preparation, and the search engine itself. This engine is intended to be independent of the operating system platform using no special hardware or software. We are further… (More)
This paper presents the implementation and evaluation of a Hidden Markov Model to extract addresses from OCR text. Although Hidden Markov Models discover addresses with high precision and recall, this type of Information Extraction task seems to be affected negatively by the presence of OCR text.
We present an efficient algorithm called the Quadtree Heuristic for identifying a list of similar terms for each unique term in a large document collection. Term similarity is defined using the Expected Mutual Information Measure (EMIM). Since our aim for defining the similarity lists is to improve information retrieval (IR), we present the outcome of an… (More)
Hundreds of experiments over the last decade on the retrieval of OCR documents performed by the Information Science Research Institute have shown that OCR errors do not significantly affect retrievability. We extend those results to show that in the case of proximity searching, the removal of running headers and footers from OCR text will not improve… (More)