MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets

@article{Steinegger2017MMseqs2ES,
  title={MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets},
  author={Martin Steinegger and Johannes S{\"o}ding},
  journal={Nature Biotechnology},
  year={2017},
  volume={35},
  pages={1026-1028}
}
VOLUME 35 NUMBER 11 NOVEMBER 2017 NATURE BIOTECHNOLOGY performance was to combine the doublematch criterion with making k-mers as long as possible, which required finding similar and not just exact k-mers. This effectively bases our decision on up to 2 × 7 = 14 residues instead of just 2 × 3 in BLAST or 12 letters on a size-11 alphabet in DIAMOND. MMseqs2 is parallelized on three levels: time-critical parts are manually vectorized, queries can be distributed to multiple cores, and the target… Expand

Paper Mentions

Blog Post
ProtPCV: A Fixed Dimensional Numerical Representation of Protein Sequence to Significantly Reduce Sequence Search Time
TLDR
It is found that the novel and unique numerical representation of a protein can reduce computational complexity of protein sequence search to the tune of O (log( n )). Expand
Clustering huge protein sequence sets in linear time
TLDR
Linclust is developed, the first clustering algorithm whose runtime scales as N, independent of K, and will help to unlock the great wealth contained in metagenomic and genomic sequence databases. Expand
LAMPA, LArge Multidomain Protein Annotator, and its application to RNA virus polyproteins
TLDR
An iterative approach to resolve the above conundrum by gradual expansion of hit coverage of multidomain proteins through re-evaluating statistical significance of hit similarity using ever smaller queries defined at each iteration of LAMPA. Expand
An optimized FM-index library for nucleotide and amino acid search
TLDR
AvxWindowedFMindex (AWFM-index), an open-source, thread-parallel FM-index library written in C that is highly optimized for indexing nucleotide and amino acid sequences, and trivially parallelizes to multiple threads, and scales well in multithreaded contexts. Expand
Clustering huge protein sequence sets in linear time
TLDR
Linclust is developed, an algorithm with linear time complexity that can cluster over a billion sequences within hours on a single server, and will help to unlock the great wealth contained in metagenomic and genomic sequence databases. Expand
Mantis: A Fast, Small, and Exact Large-Scale Sequence-Search Index.
TLDR
Mantis is a space-efficient system that uses new data structures to index thousands of raw-read experiments and facilitates large-scale sequence searches and was able to search for all 200,400 known human transcripts in an index of 2,652 RNA sequencing experiments in 82 min. Expand
Mantis: A Fast, Small, and Exact Large-Scale Sequence-Search Index
TLDR
Mantis is a space-efficient data structure that can be used to index thousands of rawread experiments and facilitate large-scale sequence searches on those experiments, enabling rapid index builds and queries, small indexes, and exact results, i.e., no false positives or negatives. Expand
Distributed Many-to-Many Protein Sequence Alignment using Sparse Matrices
TLDR
This work presents a new distributed-memory software, PASTIS, which incorporates the unique bias in amino acid sequence substitution in searches without altering the basic sparse matrix model, and achieves ideal scaling up to millions of protein sequences. Expand
RNANet: an automatically built dual-source dataset integrating homologous sequences and RNA structures
TLDR
The first attempt of a standardized and automatically generated dataset dedicated to RNA combining together: RNA sequences, homology information (under the form of position-specific scoring matrices) and information derived by annotation of available 3D structures (including secondary structure, canonical and non-canonical interactions and backbone torsion angles). Expand
catRAPID omics v2.0: going deeper and wider in the prediction of protein–RNA interactions
TLDR
The sequence fragmentation scheme of the catRAPID fragment module has been included, which allows the server to handle long linear RNAs and to analyse circular RNAs, and the web server shows the predicted binding sites in both protein and RNA sequences and reports whether the predicted interactions are conserved in orthologous protein–RNA pairs. Expand
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 39 REFERENCES
MMseqs software suite for fast and deep clustering and searching of large protein sequence sets
TLDR
In the authors' homology detection benchmarks, MMseqs is much more sensitive and 4-30 times faster than UBLAST and RAPsearch, respectively, although it does not reach BLAST sensitivity yet. Expand
kClust: fast and sensitive clustering of large protein sequence databases
TLDR
This work presents a method to cluster large protein sequence databases such as UniProt within days down to 20%-30% maximum pairwise sequence identity and compares favorably to CD-HIT and UCLUST in terms of false discovery rate, sensitivity, and speed. Expand
Improved BLAST searches using longer words for protein seeding
TLDR
An improved trade-off between running time and retrieval accuracy is demonstrated, controlled by the score threshold used for short word matches, while achieving ROC scores similar to those obtained with current default parameters. Expand
RAPSearch2: a fast and memory-efficient protein similarity search tool for next-generation sequencing data
TLDR
RAPSearch2 is presented, a new memory-efficient implementation of the RAPSearch algorithm that uses a collision-free hash table to index a similarity search database and the utilization of an optimized data structure further speeds up the similarity search. Expand
SWORD - a highly efficient protein database search
TLDR
SWORD is an efficient protein database search implementation that runs 8-16 times faster than BLAST in the sensitive mode and up to 68 times faster in the fast and less accurate mode and is especially suitable for large databases. Expand
RAPSearch 2 : a fast and memory-efficient protein similarity search tool for next-generation sequencing data
Summary: With the wide application of next-generation sequencing (NGS) techniques, fast tools for protein similarity search that scale well to large query datasets and large databases are highlyExpand
The Pfam protein families database: towards a more sustainable future
TLDR
Pfam is now primarily based on the UniProtKB reference proteomes, with the counts of matched sequences and species reported on the website restricted to this smaller set, and the facility to view the relationship between families within a clan has been improved by the introduction of a new tool. Expand
Lambda: the local aligner for massive biological data
TLDR
In tests, Lambda often outperforms the best tools at reproducing BLAST’s results and is the fastest compared with the current state of the art at comparable levels of sensitivity. Expand
A poor man’s BLASTX—high-throughput metagenomic protein database search using PAUDA
TLDR
A new approach to protein database search called PAUDA is introduced, which runs ∼10 000 times faster than BLASTX, while achieving about one-third of the assignment rate of reads to KEGG orthology groups, and producing gene and taxon abundance profiles that are highly correlated to those obtained with BLastX. Expand
BLAT--the BLAST-like alignment tool.
TLDR
How BLAT was optimized is described, which is more accurate and 500 times faster than popular existing tools for mRNA/DNA alignments and 50 times faster for protein alignments at sensitivity settings typically used when comparing vertebrate sequences. Expand
...
1
2
3
4
...