FLASH: a fast look-up algorithm for string homology

@article{Califano1993FLASHAF,
  title={FLASH: a fast look-up algorithm for string homology},
  author={Andrea Califano and Isidore Rigoutsos},
  journal={Proceedings of IEEE Conference on Computer Vision and Pattern Recognition},
  year={1993},
  pages={353-359}
}
  • A. Califano, I. Rigoutsos
  • Published 15 June 1993
  • Computer Science
  • Proceedings of IEEE Conference on Computer Vision and Pattern Recognition
A key issue in managing large amounts of data is the availability of efficient, accurate, ad selective techniques to detect homology (similarity) between newly recovered and previously acquired sequences. The algorithm presented is based on a probabilistic indexing framework which requires minimal access to the database for each match. A highly redundant number of descriptive tuples from the sequences of interest are generated and used as indices in a table look-up paradigm. Theoretical and… 

Figures and Tables from this paper

Indexing protein sequence/ structure databases using decision tree: A preliminary study
TLDR
A decision tree indexing method is presented that can effectively and rapidly retrieve all the similar proteins from a large database for a given protein query and can also be used as a predicted model to predict proteins which have not been determined experimentally.
Speeding up whole-genome alignment by indexing frequency vectors
TLDR
An efficient technique for local alignment of large genome strings that aligns genome strings up to two orders of magnitude faster than BLAST and can be used to accelerate other search tools as well.
Indexing Genomic Databases for Fast Homology Searching
TLDR
A novel filter-and-refine approach to speed up the search process for homology in large protein databases and results in significant savings in computation without sacrificing on the accuracy of the answers as compared to FASTA.
MAP: Searching Large Genome Databases
TLDR
This work proposes an efficient technique for alignment of large genome strings that precomputes the associations between the database strings and the query string and uses a hash table to compare the unpruned regions of the query and database strings.
Searching in parallel for similar strings [biological sequences]
TLDR
An indexing-based approach for retrieving homologies in databases of proteins, using a redundant table-lookup scheme, and recovering database items that match a test sequence requires minimal data access.
SANS: high-throughput retrieval of protein sequences allowing 50% mismatches
TLDR
A novel word filter, suffix array neighborhood search (SANS), to identify protein sequence similarities in the range of 50–100% identity with sensitivity comparable to BLAST and 10 times the speed of USEARCH, enabling fast searching and functional annotation into the future despite rapidly expanding databases.
Searching Genomic Databases using the Prime Factor Filter
  • R. Pears, J. Ee
  • Computer Science
    2006 International Conference on Information and Automation
  • 2006
TLDR
A filter based on the prime factor Indexing scheme is successful in eliminating a large fraction of such false positives that survive the MRS index, resulting in speedups of up to 5 times over the M RS indexing scheme.
Removing near-neighbour redundancy from large protein sequence collections
TLDR
This work clusters closely similar sequences to yield a covering of sequence space by a representative subset of sequences, derived by an exhaustive search for close similarities in the sequence database in which the need for explicit sequence alignment is significantly reduced by applying deca- and pentapeptide composition filters.
Fast Database Indexing for Large Protein Sequence Collections Using Parallel N-Gram Transformation Algorithm
TLDR
The parallel N-Gram transformation algorithm’s results indicate that the uses of parallel programming with large dataset are promising which can be improved further.
Comparing Compressed Sequences for Faster Nucleotide BLAST Searches
TLDR
Two new, fast gapped alignment schemes that allow accurate sequence alignment without decompression of the collection sequences are proposed that more than double the speed of BLASTN with no effect on accuracy.
...
...

References

SHOWING 1-10 OF 17 REFERENCES
Rapid and sensitive protein similarity searches.
TLDR
An algorithm was developed which facilitates the search for similarities between newly determined amino acid sequences and sequences already available in databases and increases sensitivity by giving high scores to those amino acid replacements which occur frequently in evolution.
Improved sensitivity of biological sequence database searches
TLDR
The sensitivity of DNA and protein sequence database searches is increased by allowing similar but non-identical amino acids or nucleotides to match and one can match k-tuples or words instead of matching individual residues in order to speed the search.
Improved tools for biological sequence comparison.
  • W. Pearson, D. Lipman
  • Biology, Computer Science
    Proceedings of the National Academy of Sciences of the United States of America
  • 1988
TLDR
Three computer programs for comparisons of protein and DNA sequences can be used to search sequence data bases, evaluate similarity scores, and identify periodic structures based on local sequence similarity.
Searching gene and protein sequence databases.
TLDR
The classic algorithms for similarity searching and sequence alignment are described and good performance of these algorithms is critical to searching very large and growing databases.
Basic local alignment search tool.
22 A Model of Evolutionary Change in Proteins
TLDR
The body of data used in this study includes 1,572 changes of closely related proteins appearing in the Atlas volumes through Supplement 2 and the mutation data were accumulated from the phylo-genetic trees and from a few pairs of related sequences.
Implementation of geometric hashing on the Connection Machine
  • I. Rigoutsos, R. Hummel
  • Computer Science
    [1991 Proceedings] Workshop on Directions in Automated CAD-Based Vision
  • 1991
TLDR
Using this implementation of geometric hashing on the Connection Machine, it is possible to recognize models consisting of patterns of points embedded in scenes, independent of translation, rotation, and scale changes.
The Design and Analysis of Computer Algorithms
TLDR
This text introduces the basic data structures and programming techniques often used in efficient algorithms, and covers use of lists, push-down stacks, queues, trees, and graphs.
Space Efficient 3D Model Indexing
TLDR
It is shown that the set of 2D images produced by the point features of a rigid 3D model can be represented with two lines in two high-dimensional spaces, the lowest-dimensional representation possible.
...
...