• Stars
    star
    300
  • Rank 138,870 (Top 3 %)
  • Language CMake
  • License
    MIT License
  • Created almost 13 years ago
  • Updated almost 3 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Count bases in BAM/CRAM files

bam-readcount

DOI latest release tests coverage Docker Pulls GitHub

bam-readcount is a utility that runs on a BAM or CRAM file and generates low-level information about sequencing data at specific nucleotide positions. Its outputs include observed bases, readcounts, summarized mapping and base qualities, strandedness information, mismatch counts, and position within the reads. (see "Output" section below)

Originally designed to help filter genomic mutation calls, the metrics bam-readcount outputs are also useful as input for variant detection tools and for resolving ambiguity between variant callers.

If you find bam-readcount useful in your work, please cite our paper:

Khanna et al., (2022). Bam-readcount - rapid generation of basepair-resolution sequence metrics. Journal of Open Source Software, 7(69), 3722. https://doi.org/10.21105/joss.03722

Contents

Installation

Docker

The latest release version of bam-readcount is available as a Docker image on DockerHub

docker pull mgibio/bam-readcount

For details see the docker-bam-readcount repository.

Build

Requires a C++ toolchain and cmake. For details see BUILD.md.

git clone https://github.com/genome/bam-readcount 
cd bam-readcount
mkdir build
cd build
cmake ..
make
# Executable is
bin/bam-readcount

Usage

Run with no arguments for command-line help:

$ bam-readcount

Usage: bam-readcount [OPTIONS] <bam_file> [region]
Generate metrics for bam_file at single nucleotide positions.
Example: bam-readcount -f ref.fa some.bam

Available options:
  -h [ --help ]                         produce this message
  -v [ --version ]                      output the version number
  -q [ --min-mapping-quality ] arg (=0) minimum mapping quality of reads used
                                        for counting.
  -b [ --min-base-quality ] arg (=0)    minimum base quality at a position to
                                        use the read for counting.
  -d [ --max-count ] arg (=10000000)    max depth to avoid excessive memory
                                        usage.
  -l [ --site-list ] arg                file containing a list of regions to
                                        report readcounts within.
  -f [ --reference-fasta ] arg          reference sequence in the fasta format.
  -D [ --print-individual-mapq ] arg    report the mapping qualities as a comma
                                        separated list.
  -p [ --per-library ]                  report results by library.
  -w [ --max-warnings ] arg             maximum number of warnings of each type
                                        to emit. -1 gives an unlimited number.
  -i [ --insertion-centric ]            generate indel centric readcounts.
                                        Reads containing insertions will not be
                                        included in per-base counts

The optional [region] should be in the same format as samtools:

chromosome:start-stop

The optional -l (--site-list) file should be tab-separated, no header, one region per line:

chromosome	start	end

CRAM support

When using CRAM files as input, if a reference is specified with -f, it will override whatever is in the CRAM header. Otherwise, the reference(s) encoded in the CRAM header or a lookup by MD5 at ENA will be used.

Wrappers/Parsers

Add bam-readcount counts to VCF

  • VAtools allows you to add read-counts to VCF from modern variant callers. Additional details Create csv file
  • brc-parser parser to convert bam-readcount output to comma seperated long format file.

Output

Output is tab-separated with no header to STDOUT, one line per position:

chr	position	reference_base	depth	base:count:avg_mapping_quality:avg_basequality:avg_se_mapping_quality:num_plus_strand:num_minus_strand:avg_pos_as_fraction:avg_num_mismatches_as_fraction:avg_sum_mismatch_qualities:num_q2_containing_reads:avg_distance_to_q2_start_in_q2_reads:avg_clipped_length:avg_distance_to_effective_3p_end   ...

There is one set of :-separated fields for each reported base with statistics on the set of reads containing that base:

Field Description
base The base, eg C
count Number of reads
avg_mapping_quality Mean mapping quality
avg_basequality Mean base quality
avg_se_mapping_quality Mean single ended mapping quality
num_plus_strand Number of reads on the plus/forward strand
num_minus_strand Number of reads on the minus/reverse strand
avg_pos_as_fraction Average position on the read as a fraction, calculated with respect to the length after clipping. This value is normalized to the center of the read: bases occurring strictly at the center of the read have a value of 1, those occurring strictly at the ends should approach a value of 0
avg_num_mismatches_as_fraction Average number of mismatches on these reads per base
avg_sum_mismatch_qualities Average sum of the base qualities of mismatches in the reads
num_q2_containing_reads Number of reads with q2 runs at the 3’ end
avg_distance_to_q2_start_in_q2_reads Average distance of position (as fraction of unclipped read length) to the start of the q2 run
avg_clipped_length Average clipped read length
avg_distance_to_effective_3p_end Average distance to the 3’ prime end of the read (as fraction of unclipped read length)

Per-library output

With the -p option, each output line will have a set of {}-delimited results, one for each library:

chr	position	reference_base	depth	library_1_name	{	base:count:avg_mapping_quality:avg_basequality:avg_se_mapping_quality:num_plus_strand:num_minus_strand:avg_pos_as_fraction:avg_num_mismatches_as_fraction:avg_sum_mismatch_qualities:num_q2_containing_reads:avg_distance_to_q2_start_in_q2_reads:avg_clipped_length:avg_distance_to_effective_3p_end	}   ...   library_N_name	{	base:count:avg_mapping_quality:avg_basequality:avg_se_mapping_quality:num_plus_strand:num_minus_strand:avg_pos_as_fraction:avg_num_mismatches_as_fraction:avg_sum_mismatch_qualities:num_q2_containing_reads:avg_distance_to_q2_start_in_q2_reads:avg_clipped_length:avg_distance_to_effective_3p_end	}    

Tutorial

For those who learn best by example, a brief tutorial is available here that uses bam-readcount to identify the Omicron SARS-CoV-2 variant of concern from raw sequence data.

Support

For support, please search bam-readcount on Biostars as many of the most frequently asked questions about bam-readcount have been answered there. For problems not addressed there, please open an github issue or make a BioStar post.

Contributing

We welcome contributions! See Contributing for more details

More Repositories

1

pindel

Pindel can detect breakpoints of large deletions, medium sized insertions, inversions, tandem duplications and other structural variants at single-based resolution from next-gen sequence data. It uses a pattern growth approach to identify the breakpoints of these variants from paired-end short reads.
C++
160
star
2

sciclone

An R package for inferring the subclonal architecture of tumors
R
115
star
3

breakdancer

SV detection from paired end reads mapping
C++
110
star
4

analysis-workflows

Open workflow definitions for genomic analysis from MGI at WUSM.
Common Workflow Language
102
star
5

gms

The Genome Modeling System installer
Perl
77
star
6

genome

Core modules used by the GMS
Perl
61
star
7

bfx-workshop

A course on genomics and bioinformatics from WU
HTML
55
star
8

somatic-sniper

A tool to call somatic single nucleotide variants.
C
40
star
9

docs

22
star
10

joinx

a tool for processing .bed and .vcf files
C++
21
star
11

scrna_mutations

Supplementary data for Petti, et al 2019 scRNA mutation publication
Python
16
star
12

mendelscan

Analyze exome data for Mendelian disorders. Still in alpha-testing.
Java
10
star
13

ptero

Shell
9
star
14

tigra-sv

8
star
15

sciclone-meta

accessory scripts and documentation related to the sciclone R package at genome/sciclone
R
6
star
16

rss2jira

Create JIRA issues when keywords are matched in RSS feeds.
Python
5
star
17

docker-rnaseq

A fat docker image for running RnaSeq
R
4
star
18

UR

Rich Transactional Objects for Perl
Perl
4
star
19

bmm

R package that uses a variational Bayesian approach to fitting a mixture of Beta distributions
R
4
star
20

dindel-tgi

A fork of dindel
C++
3
star
21

docker-bcftools

Docker container for bcftools
Dockerfile
3
star
22

vcf-evaluation

scripts and modules to facilitate comparing gold standard VCFs
Perl
3
star
23

docker-star

Docker container for the star aligner
Dockerfile
3
star
24

docker-bam_readcount_helper-cwl

Python
3
star
25

graphite

Graphite Config and Cron Scripts
Perl 6
3
star
26

ptero-lsf

Ptero services to run commands via LSF
Python
2
star
27

cle-chromoseq

Repository for CLE ChromoSeq Assay
Python
2
star
28

pairoscope

simple static plots of read pairing information
C++
2
star
29

docker-dna-alignment

A fat docker image for running alignment
Shell
2
star
30

aml31Benchmarking

R
2
star
31

cle-myeloseqhd

Python
2
star
32

tgi-workflow

the workflow server used at TGI
Perl
2
star
33

flow-core

JavaScript
2
star
34

cncwl

1
star
35

docker-vep-cwl

Variant of vep image without an ENTRYPOINT
1
star
36

cle-myeloseq

Repo for cle myeloseq/haloplex assay
Perl
1
star
37

docker-gossamer

Docker container for gossamer bioinformatics suite
1
star
38

ptero-workflow

Client-facing API for the PTero system
Python
1
star
39

cle

Repo for cle related software
WDL
1
star
40

qc-metric-aggregator

Given the output directory of a QC pipeline and a threshold config file, parse out the desired metrics and evalute them against the thresholds.
Python
1
star
41

docker-custom-clinvar-vcf

Python
1
star
42

docker-samtools-cwl

Dockerfile
1
star
43

ptero-petri

Petri net core of the PTero system
Python
1
star
44

somatic-snv-test-data

Example Data for SomaticSniper
1
star
45

build-common

common build scripts used in c/c++ projects
Python
1
star
46

nessy-client-perl

Perl client for the nessy-server lock daemon
Perl
1
star
47

nessy-server

Python
1
star
48

cancer-genomics-workflow-wiki

A full featured, including pull requests, git repo for the arvados_trial Wiki
Shell
1
star
49

docker-fgbio

A docker image for using fgbio
1
star
50

docker-strelka

A docker image for Strelka
1
star