Learn More
Given a query string Q, an edit similarity search finds all strings in a database whose edit distance with Q is no more than a given threshold t. Most existing method answering edit similarity queries rely on a signature scheme to generate candidates given the query string. We observe that the number of signatures generated by existing methods is far(More)
Query autocompletion is an important feature saving users many keystrokes from typing the entire query. In this paper we study the problem of query autocompletion that tolerates errors in users' input using edit distance constraints. Previous approaches index data strings in a trie, and continuously maintain all the prefixes of data strings whose edit(More)
—Similarity joins play an important role in many application areas, such as data integration and cleaning, record linkage, and pattern recognition. In this paper, we study efficient algorithms for similarity joins with an edit distance constraint. Currently, the most prevalent approach is based on extracting overlapping grams from strings and considering(More)
Hamming distance measures the number of dimensions where two vectors have different values. In applications such as pattern recognition, information retrieval, and databases, we often need to efficiently process <i>Hamming distance query</i>, which retrieves vectors in a database that have no more than <i>k</i> Hamming distance from a given query vector.(More)
Errata fixed in this version: 1. Algorithm 6, Line 4: " p1 ← Ψ −1 m κ 2 " is wrong; changed to " p1 ← Ψm κ 2 " here. ABSTRACT Nearest neighbor searches in high-dimensional space have many important applications in domains such as data mining , and multimedia databases. The problem is challenging due to the phenomenon called " curse of dimensionality ". An(More)
Given a query string <i>Q</i>, an edit similarity search finds all strings in a database whose edit distance with <i>Q</i> is no more than a given threshold &#964;. Most existing methods answering edit similarity queries employ schemes to generate string subsequences as signatures and generate candidates by set overlap queries on query and data signatures.(More)
As the use of electronic documents are becoming more popular , people want to find documents completely or partially duplicate. In this paper, we propose a near duplicate text detection framework using signatures to save space and query time. We also propose a novel signature selection algorithm which uses collection frequency of q-grams. We compare our(More)
Query autocompletion has become a standard feature in many search applications, especially for search engines. A recent trend is to support the <i>error-tolerant autocompletion</i>, which increases the usability significantly by matching prefixes of database strings and allowing a small number of errors. In this article, we systematically study the query(More)