KMC 2: Fast and resource-frugal k-mer counting

@article{Deorowicz2015KMC2F,
  title={KMC 2: Fast and resource-frugal k-mer counting},
  author={Sebastian Deorowicz and Marek Kokot and Szymon Grabowski and Agnieszka Debudaj-Grabysz},
  journal={Bioinformatics},
  year={2015},
  volume={31 10},
  pages={
          1569-76
        }
}
MOTIVATION Building the histogram of occurrences of every k-symbol long substring of nucleotide data is a standard step in many bioinformatics applications, known under the name of k-mer counting. Its applications include developing de Bruijn graph genome assemblers, fast multiple sequence alignment and repeat detection. The tremendous amounts of NGS data require fast algorithms for k-mer counting, preferably using moderate amounts of memory. RESULTS We present a novel method for k-mer… 
K-mer Counting for Genomic Big Data
TLDR
This paper proposes a new distributed method for k-mer counting with high scalability that can scale to 8192 cores with an efficiency of 43% when processing 2 TB simulated genome dataset with 200 billion distinct k-mers (graph size).
KCMBT: a k-mer Counter based on Multiple Burst Trees
TLDR
This work proposes a novel trie-based algorithm, k-mer Counter based on Multiple Burst Trees (KCMBT), which is around 30% faster than the previous best-performing algorithm KMC2 for human genome dataset and is around six times faster than Jellyfish2.
KmerEstimate: A Streaming Algorithm for Estimating k-mer Counts with Optimal Space Usage
TLDR
KmerEstimate, a streaming algorithm that approximates the count of k-mers with a given frequency in a genomic data set based on a well known adaptive sampling based streaming algorithm due to Bar-Yossef et al.
Frigate: a fast, in-memory tool for counting and querying k-mers
K-mer counting is an important step in many bioinformatics applications including genome assembly, sequence error correction, and sequence alignment. As the advancements in next generation sequencing
Compact and evenly distributed k-mer binning for genomic sequences
TLDR
Discount, a distributed k-mer counting tool based on Apache Spark, is presented, which is used to investigate the behaviour of various minimizer orderings in practice when applied to metagenomics data and introduces the universal frequency ordering, a new combination of frequencysampled minimizers and universal k-mers hitting sets, which yields both evenly distributed binning and small bin sizes.
Compact and evenly distributed k-mer binning for genomic sequences
TLDR
Discount, a distributed k-mer counting tool based on Apache Spark, is presented, which is used to investigate the behaviour of various minimizer orderings in practice when applied to metagenomics data and introduces the universal frequency ordering, a new combination of frequency counted minimizers and universal k-MER hitting sets, which yields both evenly distributed binning and small bin sizes.
Gerbil: A Fast and Memory-Efficient k-mer Counter with GPU-Support
TLDR
For large k, Gerbil is able to efficiently support large k without much loss of performance and outperform state-of-the-art open source k-mer counting tools by up to a factor of 4 for large genome data sets.
Efficient techniques for k-mer counting
TLDR
This work improves time consumption by devising a novel algorithm to k-mer counting, and shows that this new algorithm outperforms previous best-known algorithms.
Counting Kmers for Biological Sequences at Large Scale
TLDR
This work proposes SWAPCounter, a highly scalable distributed approach for kmer counting that has competitive performance with two other tools on shared memory environment, KMC2, and MSPKmerCounter and shows the highest scalability under strong scaling experiments.
Gerbil: a fast and memory-efficient k-mer counter with GPU-support
TLDR
While Gerbil’s performance is comparable to existing state-of-the-art open source k-mer counting tools for small k < 32, it vastly outperforms its competitors for large k, thereby enabling new applications which require large values of k.
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 35 REFERENCES
Disk-based k-mer counting on a PC
TLDR
A simple, yet efficient, parallel disk-based algorithm for counting k-mers, called KMC, which is capable of counting the statistics for short-read human genome data, in input gzipped FASTQ file, in less than 40 minutes on a PC with 16 GB of RAM and 6 CPU cores.
DSK: k-mer counting with very low memory usage
TLDR
This work presents a new streaming algorithm for k-mer counting, called DSK (disk streaming of k-mers), which only requires a fixed user-defined amount of memory and disk space, and is the first approach that is able to count all the 27-mers of a human genome dataset using only 4.0 GB of memory & disk space.
MSPKmerCounter: A Fast and Memory Efficient Approach for K-mer Counting
  • Yang Li, Xifeng Yan
  • Biology, Computer Science
    ArXiv
  • 2015
TLDR
MSPKmerCounter is developed, a disk-based approach, to efficiently perform k-mer counting for large genomes using a small amount of memory, based on a novel technique called Minimum Substring Partitioning (MSP).
Turtle: Identifying frequent k-mers with cache-efficient algorithms
TLDR
A novel method that balances time, space and accuracy requirements to efficiently extract frequent k-mers even for high-coverage libraries and large genomes such as human, designed to minimize cache misses in a cache-efficient manner.
A fast, lock-free approach for efficient parallel counting of occurrences of k-mers
TLDR
This work proposes a new k-mer counting algorithm and associated implementation, called Jellyfish, which is fast and memory efficient, based on a multithreaded, lock-free hash table optimized for counting k-mers up to 31 bases in length.
KAnalyze: a fast versatile pipelined K-mer toolkit
TLDR
KAnalyze is designed to compete with the fastest k-mer counters, to produce reliable output and to support future development efforts through well-architected, documented and testable code.
A new method to compute K-mer frequencies and its application to annotate large repetitive plant genomes
TLDR
The Tallymer software, a flexible and memory-efficient collection of programs for k-mer counting and indexing of large sequence sets, is introduced, based on enhanced suffix arrays that gives a much larger flexibility concerning the choice of the k-mers size.
Efficient counting of k-mers in DNA sequences using a bloom filter
TLDR
A new method is presented that identifies all the k-mers that occur more than once in a DNA sequence data set using a Bloom filter, a probabilistic data structure that stores all the observed k-mer implicitly in memory with greatly reduced memory requirements.
keeSeek: searching distant non-existing words in genomes for PCR-based applications
TLDR
KeeSeek is able to find absent sequences with primer-like features, which can be used as unique labels for exogenously inserted DNA fragments to recover their exact position into the genome using PCR techniques.
Reducing storage requirements for biological sequence comparison
TLDR
A simple and elegant method in which only a small fraction of seeds, called 'minimizers', needs to be stored, which can speed up string-matching computations by a large factor while missing only aSmall fraction of the matches found using all seeds.
...
1
2
3
4
...