# KMC 2: Fast and resource-frugal k-mer counting

@article{Deorowicz2015KMC2F, title={KMC 2: Fast and resource-frugal k-mer counting}, author={Sebastian Deorowicz and Marek Kokot and Szymon Grabowski and Agnieszka Debudaj-Grabysz}, journal={Bioinformatics}, year={2015}, volume={31 10}, pages={ 1569-76 } }

MOTIVATION
Building the histogram of occurrences of every k-symbol long substring of nucleotide data is a standard step in many bioinformatics applications, known under the name of k-mer counting. Its applications include developing de Bruijn graph genome assemblers, fast multiple sequence alignment and repeat detection. The tremendous amounts of NGS data require fast algorithms for k-mer counting, preferably using moderate amounts of memory.
RESULTS
We present a novel method for k-mer…

## Figures, Tables, and Topics from this paper

## 213 Citations

K-mer Counting for Genomic Big Data

- Computer ScienceBigData Congress
- 2018

This paper proposes a new distributed method for k-mer counting with high scalability that can scale to 8192 cores with an efficiency of 43% when processing 2 TB simulated genome dataset with 200 billion distinct k-mers (graph size).

KCMBT: a k-mer Counter based on Multiple Burst Trees

- Computer Science, MedicineBioinform.
- 2016

This work proposes a novel trie-based algorithm, k-mer Counter based on Multiple Burst Trees (KCMBT), which is around 30% faster than the previous best-performing algorithm KMC2 for human genome dataset and is around six times faster than Jellyfish2.

KmerEstimate: A Streaming Algorithm for Estimating k-mer Counts with Optimal Space Usage

- Computer ScienceBCB
- 2018

KmerEstimate, a streaming algorithm that approximates the count of k-mers with a given frequency in a genomic data set based on a well known adaptive sampling based streaming algorithm due to Bar-Yossef et al.

Frigate: a fast, in-memory tool for counting and querying k-mers

- 2021 13th International Conference on Bioinformatics and Biomedical Technology
- 2021

K-mer counting is an important step in many bioinformatics applications including genome assembly, sequence error correction, and sequence alignment. As the advancements in next generation sequencing…

Compact and evenly distributed k-mer binning for genomic sequences

- Computer Science, MedicineBioinform.
- 2021

Discount, a distributed k-mer counting tool based on Apache Spark, is presented, which is used to investigate the behaviour of various minimizer orderings in practice when applied to metagenomics data and introduces the universal frequency ordering, a new combination of frequencysampled minimizers and universal k-mers hitting sets, which yields both evenly distributed binning and small bin sizes.

Compact and evenly distributed k-mer binning for genomic sequences

- Biology
- 2020

Discount, a distributed k-mer counting tool based on Apache Spark, is presented, which is used to investigate the behaviour of various minimizer orderings in practice when applied to metagenomics data and introduces the universal frequency ordering, a new combination of frequency counted minimizers and universal k-MER hitting sets, which yields both evenly distributed binning and small bin sizes.

Gerbil: A Fast and Memory-Efficient k-mer Counter with GPU-Support

- Computer ScienceWABI
- 2016

For large k, Gerbil is able to efficiently support large k without much loss of performance and outperform state-of-the-art open source k-mer counting tools by up to a factor of 4 for large genome data sets.

Efficient techniques for k-mer counting

- Computer Science2015 IEEE 5th International Conference on Computational Advances in Bio and Medical Sciences (ICCABS)
- 2015

This work improves time consumption by devising a novel algorithm to k-mer counting, and shows that this new algorithm outperforms previous best-known algorithms.

Counting Kmers for Biological Sequences at Large Scale

- Computer Science, MedicineInterdisciplinary Sciences: Computational Life Sciences
- 2019

This work proposes SWAPCounter, a highly scalable distributed approach for kmer counting that has competitive performance with two other tools on shared memory environment, KMC2, and MSPKmerCounter and shows the highest scalability under strong scaling experiments.

Gerbil: a fast and memory-efficient k-mer counter with GPU-support

- Computer Science, BiologyAlgorithms for Molecular Biology
- 2017

While Gerbil’s performance is comparable to existing state-of-the-art open source k-mer counting tools for small k < 32, it vastly outperforms its competitors for large k, thereby enabling new applications which require large values of k.

## References

SHOWING 1-10 OF 35 REFERENCES

Disk-based k-mer counting on a PC

- Medicine, Computer ScienceBMC Bioinformatics
- 2012

A simple, yet efficient, parallel disk-based algorithm for counting k-mers, called KMC, which is capable of counting the statistics for short-read human genome data, in input gzipped FASTQ file, in less than 40 minutes on a PC with 16 GB of RAM and 6 CPU cores.

DSK: k-mer counting with very low memory usage

- Computer Science, MedicineBioinform.
- 2013

This work presents a new streaming algorithm for k-mer counting, called DSK (disk streaming of k-mers), which only requires a fixed user-defined amount of memory and disk space, and is the first approach that is able to count all the 27-mers of a human genome dataset using only 4.0 GB of memory & disk space.

MSPKmerCounter: A Fast and Memory Efficient Approach for K-mer Counting

- Biology, Computer ScienceArXiv
- 2015

MSPKmerCounter is developed, a disk-based approach, to efficiently perform k-mer counting for large genomes using a small amount of memory, based on a novel technique called Minimum Substring Partitioning (MSP).

Turtle: Identifying frequent k-mers with cache-efficient algorithms

- Computer Science, MedicineBioinform.
- 2014

A novel method that balances time, space and accuracy requirements to efficiently extract frequent k-mers even for high-coverage libraries and large genomes such as human, designed to minimize cache misses in a cache-efficient manner.

A fast, lock-free approach for efficient parallel counting of occurrences of k-mers

- Computer Science, MedicineBioinform.
- 2011

This work proposes a new k-mer counting algorithm and associated implementation, called Jellyfish, which is fast and memory efficient, based on a multithreaded, lock-free hash table optimized for counting k-mers up to 31 bases in length.

KAnalyze: a fast versatile pipelined K-mer toolkit

- Computer Science, MedicineBioinform.
- 2014

KAnalyze is designed to compete with the fastest k-mer counters, to produce reliable output and to support future development efforts through well-architected, documented and testable code.

A new method to compute K-mer frequencies and its application to annotate large repetitive plant genomes

- Biology, MedicineBMC Genomics
- 2008

The Tallymer software, a flexible and memory-efficient collection of programs for k-mer counting and indexing of large sequence sets, is introduced, based on enhanced suffix arrays that gives a much larger flexibility concerning the choice of the k-mers size.

Efficient counting of k-mers in DNA sequences using a bloom filter

- Computer Science, MedicineBMC Bioinformatics
- 2011

A new method is presented that identifies all the k-mers that occur more than once in a DNA sequence data set using a Bloom filter, a probabilistic data structure that stores all the observed k-mer implicitly in memory with greatly reduced memory requirements.

keeSeek: searching distant non-existing words in genomes for PCR-based applications

- Computer Science, MedicineBioinform.
- 2014

KeeSeek is able to find absent sequences with primer-like features, which can be used as unique labels for exogenously inserted DNA fragments to recover their exact position into the genome using PCR techniques.

Reducing storage requirements for biological sequence comparison

- Biology, MedicineBioinform.
- 2004

A simple and elegant method in which only a small fraction of seeds, called 'minimizers', needs to be stored, which can speed up string-matching computations by a large factor while missing only aSmall fraction of the matches found using all seeds.