Learn More
T h e string-matching problem is a very c o m m o n problem. We are searching for a string P = PtP2. . "Pro i n s i d e a la rge t ex t f i le T = t l t2. . . t . , b o t h sequences of characters from a f i n i t e character set Z. T h e characters may be English characters in a text file, DNA base pairs, lines of source code, angles between edges in(More)
We present a tool, called sif, for finding all similar files in a large file system. Files are considered similar if they have significant number of common pieces, even if they are very different otherwise. For example, one file may be contained, possibly with some changes, in another file, or a file may be a reorganization of another file. The running time(More)
GLIMPSE, which stands for GLobal IMPlicit SEarch, provides indexing and query schemes for file systems. The novelty of glimpse is that it uses a very small index — in most cases 2-4% of the size of the text — and still allows very flexible full-text retrieval including Boolean queries, approximate matching (i.e., allowing misspelling), and even searching(More)
A new algorithm to search for multiple patterns at the same time is presented. The algorithm is faster than previous algorithms and can support a very large number — tens of thousands — of patterns. Several applications of the multi-pattern matching problem are discussed. We argue that, in addition to previous applications that required such search,(More)
It is increasingly difficult to make effective use of Internet information, given the rapid growth in data volume, user base, and data diversity. In this paper we introduce Harvest, a system that provides a scalable, customizable architecture for gathering, indexing, caching, replicating, and accessing Internet information.
Searching for a pattern in a text file is a very common operation in many applications ranging from text editors and databases to applications in molecular biology. In many instances the pattern does not appear in the text exactly. Errors in the text or in the query can result from misspelling or from experimental errors (e.g., when the text is a DNA(More)
We present a new le system that combines name-based and content-based access to les at the same time. Our design allows both methods to be used at any time, thus preserving the bene ts of both. Users can create their own name spaces based on queries, on explicit path names, or on any combination interleaved arbitrarily. All regular le operations { such as(More)
We present a new data structure, called the xed-queries tree, for the problem of nding all elements of a xed set that are close, under some distance function, to a query element. Fixed-queries trees can be used for any distance function, not necessarily even a metric, as long as it satisses the triangle inequality. We give an analysis of several performance(More)
DIB is a general-purpose package that allows a wide range of applications such as recursive backtrack, branch and bound, and alpha-beta search to be implemented on a multicomputer. It is very easy to use. The application program needs to specify only the root of the recursion tree, the computation to be performed at each node, and how to generate children(More)