• Stars
    star
    1,504
  • Rank 31,186 (Top 0.7 %)
  • Language
    C
  • License
    GNU General Publi...
  • Created almost 14 years ago
  • Updated 4 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Burrow-Wheeler Aligner for short-read alignment (see minimap2 for long-read alignment)

Build Status SourceForge Downloads GitHub Downloads BioConda Install

Note: minimap2 has replaced BWA-MEM for PacBio and Nanopore read alignment. It retains all major BWA-MEM features, but is ~50 times as fast, more versatile, more accurate and produces better base-level alignment. A beta version of BWA-MEM2 has been released for short-read mapping. BWA-MEM2 is about twice as fast as BWA-MEM and outputs near identical alignments.

Getting started

git clone https://github.com/lh3/bwa.git
cd bwa; make
./bwa index ref.fa
./bwa mem ref.fa read-se.fq.gz | gzip -3 > aln-se.sam.gz
./bwa mem ref.fa read1.fq read2.fq | gzip -3 > aln-pe.sam.gz

Introduction

BWA is a software package for mapping DNA sequences against a large reference genome, such as the human genome. It consists of three algorithms: BWA-backtrack, BWA-SW and BWA-MEM. The first algorithm is designed for Illumina sequence reads up to 100bp, while the rest two for longer sequences ranged from 70bp to a few megabases. BWA-MEM and BWA-SW share similar features such as the support of long reads and chimeric alignment, but BWA-MEM, which is the latest, is generally recommended as it is faster and more accurate. BWA-MEM also has better performance than BWA-backtrack for 70-100bp Illumina reads.

For all the algorithms, BWA first needs to construct the FM-index for the reference genome (the index command). Alignment algorithms are invoked with different sub-commands: aln/samse/sampe for BWA-backtrack, bwasw for BWA-SW and mem for the BWA-MEM algorithm.

Availability

BWA is released under GPLv3. The latest source code is freely available at github. Released packages can be downloaded at SourceForge. After you acquire the source code, simply use make to compile and copy the single executable bwa to the destination you want. The only dependency required to build BWA is zlib.

Since 0.7.11, precompiled binary for x86_64-linux is available in bwakit. In addition to BWA, this self-consistent package also comes with bwa-associated and 3rd-party tools for proper BAM-to-FASTQ conversion, mapping to ALT contigs, adapter triming, duplicate marking, HLA typing and associated data files.

Seeking help

The detailed usage is described in the man page available together with the source code. You can use man ./bwa.1 to view the man page in a terminal. The HTML version of the man page can be found at the BWA website. If you have questions about BWA, you may sign up the mailing list and then send the questions to [email protected]. You may also ask questions in forums such as BioStar and SEQanswers.

Citing BWA

  • Li H. and Durbin R. (2009) Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics, 25, 1754-1760. [PMID: 19451168]. (if you use the BWA-backtrack algorithm)

  • Li H. and Durbin R. (2010) Fast and accurate long-read alignment with Burrows-Wheeler transform. Bioinformatics, 26, 589-595. [PMID: 20080505]. (if you use the BWA-SW algorithm)

  • Li H. (2013) Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. arXiv:1303.3997v2 [q-bio.GN]. (if you use the BWA-MEM algorithm or the fastmap command, or want to cite the whole BWA package)

Please note that the last reference is a preprint hosted at arXiv.org. I do not have plan to submit it to a peer-reviewed journal in the near future.

Frequently asked questions (FAQs)

  1. What types of data does BWA work with?
  2. Why does a read appear multiple times in the output SAM?
  3. Does BWA work on reference sequences longer than 4GB in total?
  4. Why can one read in a pair has high mapping quality but the other has zero?
  5. How can a BWA-backtrack alignment stands out of the end of a chromosome?
  6. Does BWA work with ALT contigs in the GRCh38 release?
  7. Can I just run BWA-MEM against GRCh38+ALT without post-processing?

1. What types of data does BWA work with?

BWA works with a variety types of DNA sequence data, though the optimal algorithm and setting may vary. The following list gives the recommended settings:

  • Illumina/454/IonTorrent single-end reads longer than ~70bp or assembly contigs up to a few megabases mapped to a closely related reference genome:

      bwa mem ref.fa reads.fq > aln.sam
    
  • Illumina single-end reads shorter than ~70bp:

      bwa aln ref.fa reads.fq > reads.sai; bwa samse ref.fa reads.sai reads.fq > aln-se.sam
    
  • Illumina/454/IonTorrent paired-end reads longer than ~70bp:

      bwa mem ref.fa read1.fq read2.fq > aln-pe.sam
    
  • Illumina paired-end reads shorter than ~70bp:

      bwa aln ref.fa read1.fq > read1.sai; bwa aln ref.fa read2.fq > read2.sai
      bwa sampe ref.fa read1.sai read2.sai read1.fq read2.fq > aln-pe.sam
    
  • PacBio subreads or Oxford Nanopore reads to a reference genome:

      bwa mem -x pacbio ref.fa reads.fq > aln.sam
      bwa mem -x ont2d ref.fa reads.fq > aln.sam
    

BWA-MEM is recommended for query sequences longer than ~70bp for a variety of error rates (or sequence divergence). Generally, BWA-MEM is more tolerant with errors given longer query sequences as the chance of missing all seeds is small. As is shown above, with non-default settings, BWA-MEM works with Oxford Nanopore reads with a sequencing error rate over 20%.

2. Why does a read appear multiple times in the output SAM?

BWA-SW and BWA-MEM perform local alignments. If there is a translocation, a gene fusion or a long deletion, a read bridging the break point may have two hits, occupying two lines in the SAM output. With the default setting of BWA-MEM, one and only one line is primary and is soft clipped; other lines are tagged with 0x800 SAM flag (supplementary alignment) and are hard clipped.

3. Does BWA work on reference sequences longer than 4GB in total?

Yes. Since 0.6.x, all BWA algorithms work with a genome with total length over 4GB. However, individual chromosome should not be longer than 2GB.

4. Why can one read in a pair have a high mapping quality but the other has zero?

This is correct. Mapping quality is assigned for individual read, not for a read pair. It is possible that one read can be mapped unambiguously, but its mate falls in a tandem repeat and thus its accurate position cannot be determined.

5. How can a BWA-backtrack alignment stand out of the end of a chromosome?

Internally BWA concatenates all reference sequences into one long sequence. A read may be mapped to the junction of two adjacent reference sequences. In this case, BWA-backtrack will flag the read as unmapped (0x4), but you will see position, CIGAR and all the tags. A similar issue may occur to BWA-SW alignment as well. BWA-MEM does not have this problem.

6. Does BWA work with ALT contigs in the GRCh38 release?

Yes, since 0.7.11, BWA-MEM officially supports mapping to GRCh38+ALT. BWA-backtrack and BWA-SW don't properly support ALT mapping as of now. Please see README-alt.md for details. Briefly, it is recommended to use bwakit, the binary release of BWA, for generating the reference genome and for mapping.

7. Can I just run BWA-MEM against GRCh38+ALT without post-processing?

If you are not interested in hits to ALT contigs, it is okay to run BWA-MEM without post-processing. The alignments produced this way are very close to alignments against GRCh38 without ALT contigs. Nonetheless, applying post-processing helps to reduce false mappings caused by reads from the diverged part of ALT contigs and also enables HLA typing. It is recommended to run the post-processing script.

More Repositories

1

minimap2

A versatile pairwise aligner for genomic and spliced nucleotide sequences
C
1,754
star
2

seqtk

Toolkit for processing sequences in FASTA/Q formats
C
1,363
star
3

bioawk

BWK awk modified for biological data
C
590
star
4

minigraph

Sequence-to-graph mapper and graph generator
C
402
star
5

miniprot

Align proteins to genomes with splicing and frameshift
C
316
star
6

miniasm

Ultrafast de novo assembly for long noisy reads (though having no consensus step)
TeX
302
star
7

wgsim

Reads simulator
C
258
star
8

gfatools

Tools for manipulating sequence graphs in the GFA and rGFA formats
C
208
star
9

biofast

Benchmarking programming languages/implementations for common tasks in Bioinformatics
C
175
star
10

readfq

Fast multi-line FASTA/Q reader in several programming languages
C
170
star
11

cgranges

A C/C++ library for fast interval overlap queries (with a "bedtools coverage" example)
C
165
star
12

kmer-cnt

Code examples of fast and simple k-mer counters for tutorial purposes
C++
162
star
13

pangene

Constructing a pangenome gene graph
C
158
star
14

psmc

Implementation of the Pairwise Sequentially Markovian Coalescent (PSMC) model
C
147
star
15

bedtk

A simple toolset for BED files (warning: CLI may change before bedtk becomes stable)
C
134
star
16

ksw2

Global alignment and alignment extension
C
127
star
17

yak

Yet another k-mer analyzer
C
111
star
18

fermikit

De novo assembly based variant calling pipeline for Illumina short reads
TeX
108
star
19

minimap

This repo is DEPRECATED. Please use minimap2, the successor of minimap.
C
106
star
20

hickit

TAD calling, phase imputation, 3D modeling and more for diploid single-cell Hi-C (Dip-C) and general Hi-C
C
105
star
21

bgt

Flexible genotype query among 30,000+ samples whole-genome
C
96
star
22

dipcall

Reference-based variant calling pipeline for a pair of phased haplotype assemblies
JavaScript
92
star
23

srf

SRF: Satellite Repeat Finder
TeX
86
star
24

unimap

A EXPERIMENTAL fork of minimap2 optimized for assembly-to-reference alignment
C
85
star
25

minipileup

Simple pileup-based variant caller
C
79
star
26

fermi

A WGS de novo assembler based on the FMD-index for large genomes
C
74
star
27

dna-nn

Model and predict short DNA sequence features with neural networks
C
72
star
28

fermi-lite

Standalone C library for assembling Illumina short reads in small regions
C
72
star
29

bfc

High-performance error correction for Illumina resequencing data
TeX
68
star
30

ropebwt2

Incremental construction of FM-index for DNA sequences
TeX
67
star
31

tabtk

Toolkit for processing TAB-delimited format
C
59
star
32

gwfa

Proof-of-concept implementation of GWFA for sequence-to-graph alignment
C
56
star
33

CHM-eval

TeX
49
star
34

miniwfa

A reimplementation of the WaveFront Alignment algorithm at low memory
C
49
star
35

jstreeview

Interactive phylogenetic tree viewer/editor
JavaScript
46
star
36

samtools

This is *NOT* the official repository of samtools.
C
46
star
37

etrf

Exact Tandem Repeat Finder (not a TRF replacement)
C
45
star
38

ref-gen

Human reference genome analysis sets
Makefile
44
star
39

bioseq-js

For live demo, see http://lh3lh3.users.sourceforge.net/bioseq.shtml
HTML
37
star
40

lv89

C implementation of the Landau-Vishkin algorithm
C++
35
star
41

partig

An experimental tool to estimate the similarity between all pairs of contigs
C
35
star
42

asub

A unified array job submitter for LSF, SGE/UGE and Slurm
Perl
32
star
43

klib.nim

Experimental getopt, gzip reader, FASTA/Q parser and interval queries in nim-lang
Nim
32
star
44

calN50

Compute N50/NG50 and auN/auNG
JavaScript
31
star
45

sdust

Symmetric DUST for finding low-complexity regions in DNA sequences
C
31
star
46

gffio

C
31
star
47

pre-pe

Preprocessing paired-end reads produced with experiment-specific protocols
C
31
star
48

hapdip

The CHM1-NA12878 benchmark for single-sample SNP/INDEL calling from WGS Illumina data
JavaScript
30
star
49

fermi2

C
26
star
50

misc

Useful small programs
C
26
star
51

varcmp

The first CHM1 paper (Li, 2014)
TeX
25
star
52

minisv

Lightweight mosaic/somatic SV caller for long reads (WIP)
JavaScript
24
star
53

lianti

Tools to process LIANTI sequence data
C
23
star
54

rtgeval

Wrapper for RTG's vcfeval; DEPRECATED!
Shell
21
star
55

nasw

Dynamic programming for aa-to-nt alignment with affine gap, splicing and frameshift
C
18
star
56

sgdp-fermi

FermiKit small variant calls for public SGDP samples
17
star
57

gfa1

This repo is deprecated. Please use gfatools instead.
C
16
star
58

pubLRasm

16
star
59

PortableCrystal

Portable Crystal binary distributions for Linux on x86_64
15
star
60

foreign

Modified or extracted from other programs
C
15
star
61

trimadap

Fast but inaccurate adapter trimmer for Illumina reads
C
14
star
62

lh3-snippets

C
14
star
63

ropebwt3

Construction and utility of BWT for DNA string sets
C
13
star
64

treebest

TreeBeST: Tree Building guided by Species Tree
C
13
star
65

unicall

A wrapper for calling small variants from human germline high-coverage single-sample Illumina data
Perl
12
star
66

fastARG

Fast heuristic ARG construction
C
12
star
67

proot-wrapper

Demonstrating the PRoot program
Perl
12
star
68

rmaxcut

An experimental tool to find approximate max-cuts in a large graph
C
11
star
69

bwa-docker

Minimal docker image for bwa. Not developed any more.
11
star
70

sdg

EXPERIMENTAL implementation of side graph
C
10
star
71

naivepca

Naive PCA for genotype data
C
10
star
72

mdust

mdust from DFCI Gene Indices Software Tools (archived for a historical record only)
C
10
star
73

editdist-U85

Fast implementation of Ukkenon's O(ND) algorithm for computing edit distance
C
9
star
74

lh3.github.com

TeX
9
star
75

libdivsufsort

Automatically exported from code.google.com/p/libdivsufsort
C
8
star
76

mem-paper

Manuscript for BWA-MEM
6
star
77

bcf2

Experimental bcftools port to support BCF2; DEPRECATED by htslib and htsbox
C
6
star
78

thesis

PhD thesis
TeX
5
star
79

ropebwt

C
4
star
80

fermi-paper

The first fermi paper (Li, 2012)
3
star
81

crlf

Concise Run-Length Format for small alphabets; DEPRECATED
C
3
star
82

psnw

prototype
C
2
star
83

centos5-vm

Instructions on how to deploy CentOS 5 virtual machines
2
star
84

mag2gfa

DEPRECATED. Code has been moved to lh3/gfa1/misc
C
2
star
85

ibsget

Download files from Illumina BaseSpace (*OUTDATED* as BaseSpace has changed APIs)
C
2
star
86

smtl-paper

Samtools statistics paper (Li, 2011)
Lua
1
star
87

mssa-bench

Evaluating the performance of multi-string SA construction
C
1
star
88

samtools-legacy

For testing only. DON'T USE!
C
1
star