Fast string correction with Levenshtein automata

  title={Fast string correction with Levenshtein automata},
  author={Klaus U. Schulz and Stoyan Mihov},
  journal={International Journal on Document Analysis and Recognition},
  • K. Schulz, S. Mihov
  • Published 1 November 2002
  • Computer Science
  • International Journal on Document Analysis and Recognition
Abstract. The Levenshtein distance between two words is the minimal number of insertions, deletions or substitutions that are needed to transform one word into the other. Levenshtein automata of degree n for a word W are defined as finite state automata that recognize the set of all words V where the Levenshtein distance between V and W does not exceed n. We show how to compute, for any fixed bound n and any input word W, a deterministic Levenshtein automaton of degree n for W in time linear to… 
Nondeterministic Finite Automata in Hardware-the Case of the Levenshtein Automaton
A novel technique for executing a pipelined Levenshtein NFA using Micron’s Automata Processor (AP), avoiding the run time and space overheads associated with CPU and GPU implementations and making the automaton a viable building block for future approximate string applications on the AP.
Fast Approximate Search in Large Dictionaries
This article describes methods for efficiently selecting a natural set of candidates for correcting an erroneous input P, the set of all words in the dictionary for which the Levenshtein distance to P does not exceed a given (small) bound k.
A Memory-Efficient GPU Method for Hamming and Levenshtein Distance Similarity
This work proposes two methods to improve the preprocessing and traversal performance of Levenshtein and Hamming distance NFAs and shows that the optimized method, implicit-rearranged-sv, outperforms traditional GPU engines both in terms of traversal throughput and preprocessing time.
Precise and Efficient Text Correction using Levenshtein Automata , Dynamic Web Dictionaries and Optimized Correction Models
It is shown that postcorrection improves the quality even for scanned texts with a very small number of OCR-errors, and a complex tool has been developed for optimizing these parameters on the basis of ground truth data.
Average-Case Analysis of Approximate Trie Search (Extended Abstract)
This work investigates a comparison-based model where “errors” and “matches” are defined between pairs of characters, and studies the average-case complexity of the number of comparisons for searching in a trie in dependence of the parameters p and D.
Breadth-first search strategies for trie-based syntactic pattern recognition
This paper shows how to optimize this dictionary-based syntactic pattern recognition of strings computation by incorporating breadth first search schemes on the underlying graph structure, and demonstrates marked improvements with regard to the operations needed up to 21%, while at the same time maintaining the same accuracy.
Approximate word matching with synchronized rational relations
An algorithm for synchronizing a rational relation with bounded length difference is presented and it is shown how the method could be applied for automatic correction of OCR-ed text.
Descriptional Complexity of Error Detection
This paper surveys recent work on the state complexity of neighbourhoods of regularity preserving distances and determines the size of the minimal deterministic finite automaton needed to recognize the neighbourhood of a language recognized by an n state DFA.
Fast Selection of Small and Precise Candidate Sets from Dictionaries for Text Correction Tasks
Lexical text correction relies on a central step where approximate search in a dictionary is used to select the best correction suggestions for an ill-formed input token. In previous work we
A framework for Urdu word processor applications that can automatically detect and correct errors in Urdu text by generating suggestions to the user is proposed.


Error-tolerant Finite-state Recognition with Applications to Morphological Analysis and Spelling Correction
This paper presents the notion of error-tolerant recognition with finite-state recognizers along with results from some applications. Error-tolerant recognition enables the recognition of strings
Contextual Word Recognition Using Binary Digrams
The concept of binary digrams is introduced which overcomes some of the problems of past approaches and can be used to extract offectively the "syntax" of the dictionary while requiring very modest amounts of storage.
A hidden Markov model for language syntax in text recognition
  • J. Hull
  • Computer Science
    Proceedings., 11th IAPR International Conference on Pattern Recognition. Vol.II. Conference B: Pattern Recognition Methodology and Systems
  • 1992
The use of a hidden Markov model for language syntax to improve the performance of a text recognition algorithm and a modification of the Viterbi algorithm is proposed that finds a fixed number of sequences of syntactic classes for a given sentence that have the highest probabilities of occurrence.
Algorithms for Approximate String Matching
Fast text searching: allowing errors
T h e string-matching problem is a very c o m m o n problem; there are many extensions to t h i s problem; for example, it may be looking for a set of patterns, a pattern w i t h "wi ld cards," or a regular expression.
Hierarchically Coded Lexicon with Variants
This method is well-suited for a small to medium sized lexicon in applications where the correction speed is crucial and has been successfully tested with a French lexicon of about 10,000 words collected from scientific texts.
Incremental construction of minimal acyclic finite state automata
A new method for constructing minimal, deterministic, acyclic finite-state automata from a set of strings by adding new strings one by one and minimizing the resulting automaton on-the-fly.