Learn More
1 Boolean retrieval 1 2 The term vocabulary and postings lists 19 3 Dictionaries and tolerant retrieval 49 4 Index construction 67 5 Index compression 85 6 Scoring, term weighting and the vector space model 109 7 Computing scores in a complete search system 135 8 Evaluation in information retrieval 151 9 Relevance feedback and query expansion 177 10 XML(More)
The study of the web as a graph is not only fascinating in its own right, but also yields valuable insight into web algorithms for crawling, searching and community discovery, and the sociological phenomena which characterize its evolution. We report on experiments on local and global properties of the web graph using two Altavista crawls each with over 200(More)
Randomized algorithms, once viewed as a tool in computational number theory, have by now found widespread application. Growth has been fueled by the two major benefits of randomization: simplicity and speed. For many applications a randomized algorithm is the fastest algorithm available, or the simplest , or both. A randomized algorithm is an algorithm that(More)
Data mining applications place special requirements on clustering algorithms including: the ability to find clusters embedded in subspaces of high dimensional data, scalability, end-user comprehensibility of the results, non-presumption of any canonical data distribution, and insensitivity to the order of input records. We present CLIQUE, a clustering(More)
A (directed) network of people connected by ratings or trust scores, and a model for propagating those trust scores, is a fundamental building block in many of today's most successful e-commerce and recommendation systems. We develop a framework of trust propagation schemes, each of which may be appropriate in certain circumstances, and evaluate the schemes(More)
The web harbors a large number of communities-groups of content-creators sharing a common interest-each of which manifests itself as a set of interlinked web pages. Newgroups and commercial web directories together contain of the order of 20000 such communities; our particular interest here is on emerging communities-those that have little or no(More)
We present Symphony, a novel protocol for maintaining distributed hash tables in a wide area n e t-work. The key idea is to arrange all participants along a ring and equip them with long distance contacts drawn from a family of harmonic distributions. Through simulation, we demonstrate that our construction is scalable, exible, stable in the presence of(More)
The pages and hyperlinks of the WorldWide Web may be viewed as nodes and edges in a directed graph. This graph is a fascinating object of study: it has several hundred million nodes today, over a billion links, and appears to grow exponentially with time. There are many reasons | mathematical, sociological, and commercial | for studying the evolution of(More)
Latent semantic indexing (LX) is an information retrieval technique based on the spectral analysis of the term-document matrix, whose empirical success had hereto-fort been without rigorous prediction and explanation. We prove that, under certain conditions, LSI does succeed in capturing the underlying semantics of the corpus and achieves improved retrieval(More)
View an <italic>n</italic>-vertex, <italic>m</italic>-edge undirected graph as an electrical network with unit resistors as edges. We extend known relations between random walks and electrical networks by showing that resistance in this network is intimately connected with the <italic>lengths</italic> of random walks on the graph. For example, the(More)