Indexing methods for approximate dictionary searching: Comparative analysis

@article{Boytsov2011IndexingMF,
  title={Indexing methods for approximate dictionary searching: Comparative analysis},
  author={Leonid Boytsov},
  journal={ACM J. Exp. Algorithmics},
  year={2011},
  volume={16}
}
  • Leonid Boytsov
  • Published 1 May 2011
  • Computer Science
  • ACM J. Exp. Algorithmics
The primary goal of this article is to survey state-of-the-art indexing methods for approximate dictionary searching. To improve understanding of the field, we introduce a taxonomy that classifies all methods into direct methods and sequence-based filtering methods. We focus on infrequently updated dictionaries, which are used primarily for retrieval. Therefore, we consider indices that are optimized for retrieval rather than for update. The indices are assumed to be associative, that is… 
Super-Linear Indices for Approximate Dictionary Searching
TLDR
Experiments show that the implementation of this approach has a comparable or superior performance to that of the fastest benchmarks and requires 4-8 times less space as compared to FastSS.
Practical compressed string dictionaries
A Practical Index for Approximate Dictionary Matching with Few Mismatches
TLDR
A surprisingly simple solution called a split index, which is based on the Dirichlet principle, for matching a keyword with few mismatches, is presented and it is demonstrated that a basic compression technique consisting in $q$-gram substitution can significantly reduce the index size, while still keeping the query time relatively low.
Fast and Compact Hamming Distance Index
TLDR
New solutions for the approximate dictionary queries problem are proposed which combine the use of succinct data structures with an efficient representation of the keys to significantly reduce the space usage of the state-of-the-art solutions without introducing any time penalty.
A String Prefix Dependent Dictionary Structure Based on Hashing and Indexing
TLDR
This work proposed a hashing-indexing method to speed up looking up process inside dictionaries, a reconstruction of English dictionary of about 300,000 lexical entries using a combination of hash function and an indexing table.
Simple, compact and robust approximate string dictionary
A Best-First Anagram Hashing Filter for Approximate String Matching with Generalized Edit Distance
TLDR
This paper defines a filter that for each source word selects a small set of target lexical entries, from which the best match is then selected using generalized edit distance, where edit operations can be assigned an arbitrary weight.
Near neighbor searching with K nearest references
Full-text and Keyword Indexes for String Searching
TLDR
The FM-bloated index is presented, which is a modification of the well-known FM-index (a compressed, full-text index) that trades space for speed and the so-called split index, which can efficiently solve the k-mismatches problem, especially for 1 error.
Approximate Similarity Search Under Edit Distance Using Locality-Sensitive Hashing
TLDR
This work achieves the first bounds for any approximation factor c, via a simple and easy-to-implement hash function, and shows how to apply these ideas to the closely-related Approximate Nearest Neighbor problem for edit distance, obtaining similar time bounds.
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 210 REFERENCES
Dictionary organizations for efficient similarity retrieval
A Practical q -Gram Index for Text Retrieval Allowing Errors
TLDR
An indexing technique for approximate text searching, which is practical and powerful, and especially optimized for natural language text, and able to retrieve any string that approximately matches the search pattern, not only words.
Fast Approximate Search in Large Dictionaries
TLDR
This article describes methods for efficiently selecting a natural set of candidates for correcting an erroneous input P, the set of all words in the dictionary for which the Levenshtein distance to P does not exceed a given (small) bound k.
One-time complete indexing of text: theory and practice
TLDR
Special techniques such as partial inversion of index terms, probabilistic ordering ofindex terms, and various types of data compression allow n-gram indexing to be competitive in performance with other approaches.
Dictionary matching and indexing with errors and don't cares
This paper considers various flavors of the following online problem: preprocess a text or collection of strings, so that given a query string p, all matches of p with the text can be reported
Text Indexing and Dictionary Matching with One Error
TLDR
This paper presents a uniform deterministic solution to both the indexing and the general dictionary matching problem with one error.
A Hybrid Indexing Method for Approximate String Matching
TLDR
A new indexing method based on a suffix array combined with a partitioning of the pattern that can outperform by far all the existing alternatives for indexed approximate searching is presented.
Effective indexing and filtering for similarity search in large biosequence databases
TLDR
A multi-dimensional indexing approach for fast sequence similarity search in DNA and protein databases is presented and effective index structures, based on R-trees and scalar quantization, on top of transformed vectors and distance functions are developed.
Indexing Text with Approximate q-Grams
TLDR
A new index for approximate string matching is presented and it is shown experimentally that the parameterization mechanism of the related filtration scheme provides a compromise between the space requirement of the index and the error level for which the filTration is still effcient.
Document Retrieval Experiments using Indexing Vocabularies of varying Size. Ii. Hashing, truncation, digram and Trigram Encoding of Index Terms
TLDR
Experiments with the Cranfield test collection show that trigram encoding of words performs noticeably better than the use of digrams; however, use of the least frequent digram in each term produces more acceptable results.
...
1
2
3
4
5
...