# Indexing methods for approximate dictionary searching: Comparative analysis

@article{Boytsov2011IndexingMF, title={Indexing methods for approximate dictionary searching: Comparative analysis}, author={Leonid Boytsov}, journal={ACM J. Exp. Algorithmics}, year={2011}, volume={16} }

The primary goal of this article is to survey state-of-the-art indexing methods for approximate dictionary searching. To improve understanding of the field, we introduce a taxonomy that classifies all methods into direct methods and sequence-based filtering methods. We focus on infrequently updated dictionaries, which are used primarily for retrieval. Therefore, we consider indices that are optimized for retrieval rather than for update. The indices are assumed to be associative, that is…

## Figures and Tables from this paper

## 74 Citations

Super-Linear Indices for Approximate Dictionary Searching

- Computer ScienceSISAP
- 2012

Experiments show that the implementation of this approach has a comparable or superior performance to that of the fastest benchmarks and requires 4-8 times less space as compared to FastSS.

A Practical Index for Approximate Dictionary Matching with Few Mismatches

- Computer ScienceComput. Informatics
- 2017

A surprisingly simple solution called a split index, which is based on the Dirichlet principle, for matching a keyword with few mismatches, is presented and it is demonstrated that a basic compression technique consisting in $q$-gram substitution can significantly reduce the index size, while still keeping the query time relatively low.

Fast and Compact Hamming Distance Index

- Computer ScienceIIR
- 2016

New solutions for the approximate dictionary queries problem are proposed which combine the use of succinct data structures with an efficient representation of the keys to significantly reduce the space usage of the state-of-the-art solutions without introducing any time penalty.

A String Prefix Dependent Dictionary Structure Based on Hashing and Indexing

- Computer Science
- 2016

This work proposed a hashing-indexing method to speed up looking up process inside dictionaries, a reconstruction of English dictionary of about 300,000 lexical entries using a combination of hash function and an indexing table.

Simple, compact and robust approximate string dictionary

- Computer Science, MathematicsJ. Discrete Algorithms
- 2014

A Best-First Anagram Hashing Filter for Approximate String Matching with Generalized Edit Distance

- Computer ScienceCOLING
- 2012

This paper defines a filter that for each source word selects a small set of target lexical entries, from which the best match is then selected using generalized edit distance, where edit operations can be assigned an arbitrary weight.

Full-text and Keyword Indexes for String Searching

- Computer ScienceArXiv
- 2015

The FM-bloated index is presented, which is a modification of the well-known FM-index (a compressed, full-text index) that trades space for speed and the so-called split index, which can efficiently solve the k-mismatches problem, especially for 1 error.

Approximate Similarity Search Under Edit Distance Using Locality-Sensitive Hashing

- Computer Science, MathematicsICDT
- 2021

This work achieves the first bounds for any approximation factor c, via a simple and easy-to-implement hash function, and shows how to apply these ideas to the closely-related Approximate Nearest Neighbor problem for edit distance, obtaining similar time bounds.

## References

SHOWING 1-10 OF 210 REFERENCES

A Practical q -Gram Index for Text Retrieval Allowing Errors

- Computer ScienceCLEI Electron. J.
- 1998

An indexing technique for approximate text searching, which is practical and powerful, and especially optimized for natural language text, and able to retrieve any string that approximately matches the search pattern, not only words.

Fast Approximate Search in Large Dictionaries

- Computer ScienceCL
- 2004

This article describes methods for efficiently selecting a natural set of candidates for correcting an erroneous input P, the set of all words in the dictionary for which the Levenshtein distance to P does not exceed a given (small) bound k.

One-time complete indexing of text: theory and practice

- Computer ScienceSIGIR '85
- 1985

Special techniques such as partial inversion of index terms, probabilistic ordering ofindex terms, and various types of data compression allow n-gram indexing to be competitive in performance with other approaches.

Dictionary matching and indexing with errors and don't cares

- Computer Science, MathematicsSTOC '04
- 2004

This paper considers various flavors of the following online problem: preprocess a text or collection of strings, so that given a query string p, all matches of p with the text can be reported…

Text Indexing and Dictionary Matching with One Error

- Computer ScienceJ. Algorithms
- 2000

This paper presents a uniform deterministic solution to both the indexing and the general dictionary matching problem with one error.

A Hybrid Indexing Method for Approximate String Matching

- Computer Science
- 2007

A new indexing method based on a suffix array combined with a partitioning of the pattern that can outperform by far all the existing alternatives for indexed approximate searching is presented.

Effective indexing and filtering for similarity search in large biosequence databases

- Computer ScienceThird IEEE Symposium on Bioinformatics and Bioengineering, 2003. Proceedings.
- 2003

A multi-dimensional indexing approach for fast sequence similarity search in DNA and protein databases is presented and effective index structures, based on R-trees and scalar quantization, on top of transformed vectors and distance functions are developed.

Indexing Text with Approximate q-Grams

- Computer ScienceCPM
- 2000

A new index for approximate string matching is presented and it is shown experimentally that the parameterization mechanism of the related filtration scheme provides a compromise between the space requirement of the index and the error level for which the filTration is still effcient.

Document Retrieval Experiments using Indexing Vocabularies of varying Size. Ii. Hashing, truncation, digram and Trigram Encoding of Index Terms

- Computer ScienceJ. Documentation
- 1979

Experiments with the Cranfield test collection show that trigram encoding of words performs noticeably better than the use of digrams; however, use of the least frequent digram in each term produces more acceptable results.