• Stars
    star
    162
  • Rank 232,284 (Top 5 %)
  • Language
    C++
  • License
    MIT License
  • Created over 4 years ago
  • Updated over 4 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Code examples of fast and simple k-mer counters for tutorial purposes

Getting Started

git clone https://github.com/lh3/kmer-cnt
cd kmer-cnt
make  # C++11 required to compile the two C++ implementations
wget https://github.com/lh3/kmer-cnt/releases/download/v0.1/M_abscessus_HiSeq_10M.fa.gz
./yak-count M_abscessus_HiSeq_10M.fa.gz > kc-c4.out

Introduction

K-mer counting is the foundation of many mappers, assemblers and miscellaneous tools (e.g. genotypers, metagenomics profilers, etc). It is one of the most important classes of algorithms in Bioinformatics. Here we will implement basic k-mer counting algorithms but with advanced engineering tricks. We will see how far better engineering can go.

In this repo, each {kc,yak}-*.* file implements a standalone k-mer counter. As to other files: ketopt.h is a command line option parser; khashl.h is a generic hash table library in C; kseq.h is a fasta/fastq parser; kthread.{h,c} provides two multi-threading models; robin_hood.h is a C++11 hash table library.

Results

We provide eight k-mer counters, which are detailed below the result table. All implementations count canonical k-mers, the lexicographically smaller k-mer between the k-mers on the two DNA strands.

The following table shows the timing and peak memory of different implementations for counting 31-mers from 2.5 million pairs of 100bp reads sampled from the HiSeq M. abscessus 6G-0125-R dateset in GAGE-B. They were run on a Linux server equipped with two EPYC 7301 CPUs and 512GB RAM.

Implementation Limitation Elapsed time (s) CPU time (s) Peak RAM (GB)
kc-py1 + Python3.7 499.6 499.5 8.15
kc-py1 + Pypy7.3 1220.8 1220.8 12.21
kc-cpp1 528.0 527.9 8.27
kc-cpp2 319.6 319.6 6.90
kc-c1 <=32-mer 39.3 38.3 1.52
kc-c2 <=32-mer; <1024 count 38.7 37.9 1.05
kc-c3 <=32-mer; <1024 count 34.1 38.7 1.15
kc-c4 (2+4 threads) <=32-mer; <1024 count 7.5 35.1 1.27
yak-count (2+4; >=2 count) <=32-mer; <1024 count 14.6 54.8 0.47
jellyfish2 (16 threads) 10.8 163.9 0.82
KMC3 (16 thr; in-mem) 9.2 36.2 5.02

Discussions

  • kc-py1.py is a basic Python3 implementation. It uses string translate for fast complementary. Interestingly, pypy is much slower than python3. Perhaps the official python3 comes with a better hash table implementation. Just a guess. I often recommend pypy over python. I need to be more careful about this recommendation in future.

  • kc-cpp1.cpp implements a basic counter in C++11 using STL's unordered_map. It is slower than python3. This is partly because STL's hash table implementation is very inefficient. C++ does not necessarily lead to a fast implementation.

  • kc-cpp2.cpp replaces std::unordered_map with Martin Ankerl's robin_hood hash table library, which is among the fastest hash table implementations. It is now faster than kc-py1.py, though the performance gap is small.

  • kc-c1.c packs k-mers no longer than 32bp into 64-bit integers. This dramatically improves speed and reduces the peak memory. Most practical k-mer counters employs bit packing. Excluding library files, this counter has less than 100 coding lines, not much more complex than the C++ or the python implementations.

  • kc-c2.c uses an ensemble of hash tables to save 8 bits for counter. This reduces the peak memory. The key advantage of using multiple hash tables is to implement multithreading. See below.

  • kc-c3.c puts file reading and parsing into a separate thread. The performance improvement is minor here, but it sets the stage for the next multi-threaded implementation.

  • kc-c4.c is the fastest counter in this series. Due to the use of an ensembl of hash tables in kc-c2, we can parallelize the insertion of a batch of k-mers. It is much faster than the previous versions. Notably, kc-c4 also uses less CPU time. This is probably because batching helps data locality.

  • yak-count.c is adapted from yak and uses the same kc-c4 algorithm. Similar to BFCounter, it optionally adds a bloom filter to filter out most singleton k-mers (k-mers occurring only once in the input). Yak needs to update the bloom filter, read the input twice and count twice. It is slower but uses less memory. Yak-count is the most complex example in this repo, but it is still short. Its code is also better organized. Command line: -b30 (bloom filter with 1 billion bits).

  • jellyfish2 is probably the fastest in-memory k-mer counter to date. It uses less memory and more flexible than kc-c4, but it is slower and much more complex. Command line: count -m 31 -C -s 100000000 -o /dev/null -t 16.

  • KMC3 is one of the fastest k-mer counters. It uses minimizers and relies on sorting. KMC3 is run in the in-memory mode here. The disk mode is as fast. KMC3 is optimized for counting much larger datasets. Although it uses more RAM here, it generally uses less RAM than jellyfish2 and other in-memory counters given high-coverage human data. Command line: -k31 -t16 -r -fa.

Conclusions

The k-mer counters here are fairly basic implementations only using generic hash tables. Nonetheless, we show better engineering can carry the basic idea a long way. If you want to implement your own k-mer counter, yak-count.c could be a good starting point. It is fast and relatively simple. By the way, if you have an efficient and simple k-mer counter (implemented in a few files), please let me know. I will be happy to add it to the table.

More Repositories

1

minimap2

A versatile pairwise aligner for genomic and spliced nucleotide sequences
C
1,754
star
2

bwa

Burrow-Wheeler Aligner for short-read alignment (see minimap2 for long-read alignment)
C
1,504
star
3

seqtk

Toolkit for processing sequences in FASTA/Q formats
C
1,363
star
4

bioawk

BWK awk modified for biological data
C
590
star
5

minigraph

Sequence-to-graph mapper and graph generator
C
402
star
6

miniprot

Align proteins to genomes with splicing and frameshift
C
316
star
7

miniasm

Ultrafast de novo assembly for long noisy reads (though having no consensus step)
TeX
302
star
8

wgsim

Reads simulator
C
258
star
9

gfatools

Tools for manipulating sequence graphs in the GFA and rGFA formats
C
208
star
10

biofast

Benchmarking programming languages/implementations for common tasks in Bioinformatics
C
175
star
11

readfq

Fast multi-line FASTA/Q reader in several programming languages
C
170
star
12

cgranges

A C/C++ library for fast interval overlap queries (with a "bedtools coverage" example)
C
165
star
13

pangene

Constructing a pangenome gene graph
C
158
star
14

psmc

Implementation of the Pairwise Sequentially Markovian Coalescent (PSMC) model
C
147
star
15

bedtk

A simple toolset for BED files (warning: CLI may change before bedtk becomes stable)
C
134
star
16

ksw2

Global alignment and alignment extension
C
127
star
17

yak

Yet another k-mer analyzer
C
111
star
18

fermikit

De novo assembly based variant calling pipeline for Illumina short reads
TeX
108
star
19

minimap

This repo is DEPRECATED. Please use minimap2, the successor of minimap.
C
106
star
20

hickit

TAD calling, phase imputation, 3D modeling and more for diploid single-cell Hi-C (Dip-C) and general Hi-C
C
105
star
21

bgt

Flexible genotype query among 30,000+ samples whole-genome
C
96
star
22

dipcall

Reference-based variant calling pipeline for a pair of phased haplotype assemblies
JavaScript
92
star
23

srf

SRF: Satellite Repeat Finder
TeX
86
star
24

unimap

A EXPERIMENTAL fork of minimap2 optimized for assembly-to-reference alignment
C
85
star
25

minipileup

Simple pileup-based variant caller
C
79
star
26

fermi

A WGS de novo assembler based on the FMD-index for large genomes
C
74
star
27

dna-nn

Model and predict short DNA sequence features with neural networks
C
72
star
28

fermi-lite

Standalone C library for assembling Illumina short reads in small regions
C
72
star
29

bfc

High-performance error correction for Illumina resequencing data
TeX
68
star
30

ropebwt2

Incremental construction of FM-index for DNA sequences
TeX
67
star
31

tabtk

Toolkit for processing TAB-delimited format
C
59
star
32

gwfa

Proof-of-concept implementation of GWFA for sequence-to-graph alignment
C
56
star
33

CHM-eval

TeX
49
star
34

miniwfa

A reimplementation of the WaveFront Alignment algorithm at low memory
C
49
star
35

jstreeview

Interactive phylogenetic tree viewer/editor
JavaScript
46
star
36

samtools

This is *NOT* the official repository of samtools.
C
46
star
37

etrf

Exact Tandem Repeat Finder (not a TRF replacement)
C
45
star
38

ref-gen

Human reference genome analysis sets
Makefile
44
star
39

bioseq-js

For live demo, see http://lh3lh3.users.sourceforge.net/bioseq.shtml
HTML
37
star
40

lv89

C implementation of the Landau-Vishkin algorithm
C++
35
star
41

partig

An experimental tool to estimate the similarity between all pairs of contigs
C
35
star
42

asub

A unified array job submitter for LSF, SGE/UGE and Slurm
Perl
32
star
43

klib.nim

Experimental getopt, gzip reader, FASTA/Q parser and interval queries in nim-lang
Nim
32
star
44

calN50

Compute N50/NG50 and auN/auNG
JavaScript
31
star
45

sdust

Symmetric DUST for finding low-complexity regions in DNA sequences
C
31
star
46

gffio

C
31
star
47

pre-pe

Preprocessing paired-end reads produced with experiment-specific protocols
C
31
star
48

hapdip

The CHM1-NA12878 benchmark for single-sample SNP/INDEL calling from WGS Illumina data
JavaScript
30
star
49

fermi2

C
26
star
50

misc

Useful small programs
C
26
star
51

varcmp

The first CHM1 paper (Li, 2014)
TeX
25
star
52

minisv

Lightweight mosaic/somatic SV caller for long reads (WIP)
JavaScript
24
star
53

lianti

Tools to process LIANTI sequence data
C
23
star
54

rtgeval

Wrapper for RTG's vcfeval; DEPRECATED!
Shell
21
star
55

nasw

Dynamic programming for aa-to-nt alignment with affine gap, splicing and frameshift
C
18
star
56

sgdp-fermi

FermiKit small variant calls for public SGDP samples
17
star
57

gfa1

This repo is deprecated. Please use gfatools instead.
C
16
star
58

pubLRasm

16
star
59

PortableCrystal

Portable Crystal binary distributions for Linux on x86_64
15
star
60

foreign

Modified or extracted from other programs
C
15
star
61

trimadap

Fast but inaccurate adapter trimmer for Illumina reads
C
14
star
62

lh3-snippets

C
14
star
63

ropebwt3

Construction and utility of BWT for DNA string sets
C
13
star
64

treebest

TreeBeST: Tree Building guided by Species Tree
C
13
star
65

unicall

A wrapper for calling small variants from human germline high-coverage single-sample Illumina data
Perl
12
star
66

fastARG

Fast heuristic ARG construction
C
12
star
67

proot-wrapper

Demonstrating the PRoot program
Perl
12
star
68

rmaxcut

An experimental tool to find approximate max-cuts in a large graph
C
11
star
69

bwa-docker

Minimal docker image for bwa. Not developed any more.
11
star
70

sdg

EXPERIMENTAL implementation of side graph
C
10
star
71

naivepca

Naive PCA for genotype data
C
10
star
72

mdust

mdust from DFCI Gene Indices Software Tools (archived for a historical record only)
C
10
star
73

editdist-U85

Fast implementation of Ukkenon's O(ND) algorithm for computing edit distance
C
9
star
74

lh3.github.com

TeX
9
star
75

libdivsufsort

Automatically exported from code.google.com/p/libdivsufsort
C
8
star
76

mem-paper

Manuscript for BWA-MEM
6
star
77

bcf2

Experimental bcftools port to support BCF2; DEPRECATED by htslib and htsbox
C
6
star
78

thesis

PhD thesis
TeX
5
star
79

ropebwt

C
4
star
80

fermi-paper

The first fermi paper (Li, 2012)
3
star
81

crlf

Concise Run-Length Format for small alphabets; DEPRECATED
C
3
star
82

psnw

prototype
C
2
star
83

centos5-vm

Instructions on how to deploy CentOS 5 virtual machines
2
star
84

mag2gfa

DEPRECATED. Code has been moved to lh3/gfa1/misc
C
2
star
85

ibsget

Download files from Illumina BaseSpace (*OUTDATED* as BaseSpace has changed APIs)
C
2
star
86

smtl-paper

Samtools statistics paper (Li, 2011)
Lua
1
star
87

mssa-bench

Evaluating the performance of multi-string SA construction
C
1
star
88

samtools-legacy

For testing only. DON'T USE!
C
1
star