Efficient and effective solutions for search engines

Abstract

Effectiveness and efficiency are two of the main issues in Information Retrieval (IR). IR efficiency is normally addressed in terms of accumulator initialisation, disk I/O, decompression, ranking and sorting. This research is about (1) identifying the bottlenecks in a search engine, (2) devising efficient and effective solutions to minimise or eliminate the bottlenecks and (2) adopting the solutions for distributed IR. As shown previously [4], a large portion of the performance of the search engine is dominated by (1) slow disk read of dictionary terms and the corresponding postings lists, (2) CPU-intensive decompression of postings lists, (3) complex similarity ranking functions and (4) sorting a large number of possible candidate documents. In order to speed up disk access, operating systems usually provide general-purpose buffer caching, prefetching and scheduling optimisation algorithms. However for specialpurpose applications, it is better for application to bypass the general ones and deploy their own I/O optimisation algorithms. In Jia et al. [3], a number of application-specific I/O optimisation algorithms have been proposed and tested. A number of static pruning algorithms have been deployed to minimise the runtime overhead of processing a vast number of postings and sorting a large number of accumulators [4, 2, 1]. The topk algorithm uses a special version of quick sort for fast sorting of the accumulators. Instead of explicitly sorting all accumulators, the improved topk algorithm keeps track of the current top documents during query evaluation. It requires two linear scans of the current top documents to swap out the minimum document among the Copyright is held by the author/owner(s). SIGIR’11, July 24–28, 2011, Beijing, China. ACM 978-1-4503-0757-4/11/07. top. This causes the performance of the improved topk algorithm to grow exponentially when there are large number of documents and postings to be processed. The heapk algorithm uses a minimum heap to efficiently keep track of the current top documents. In the case of swapping in a new candidate document, the first document in the minimum heap can be simply replaced. Similarity scores are pre-computed at index time to eliminate the CPU cost of the complex similarity ranking function. Postings are impact ordered so that most important postings can be processed first and then less important ones can be pruned. Static pruning with impact ordering has shown to be very effective and efficient even when only a small portion of the postings has been processed. Accumulator initialisation becomes the bottleneck when a small portion of postings are processing using the static pruning algorithms. An efficient accumulator initialisation has been proposed and tested [2]. Essentially, it converts the one dimension array of accumulators into a logical two dimensional table and only the required logical rows are initialised. There are three areas of the research work to be addressed in the future. First, a comparison of the proposed pruning algorithms against other pruning algorithms will be conducted. Second, an efficient solution for dictionary compression has been designed and a formal analysis of the algorithm will be derived. Third, multi-core architectures bring two types of communication for distributed IR. One type is a fast communication between the cores on a single die and the other is the network communication between each node in the cluster. How can distributed IR address this new architecture?

DOI: 10.1145/2009916.2010179

Extracted Key Phrases

Cite this paper

@inproceedings{Jia2011EfficientAE, title={Efficient and effective solutions for search engines}, author={Xiang-Fei Jia}, booktitle={SIGIR}, year={2011} }