• This repository has been archived on 31/Jan/2020
  • Stars
    star
    110
  • Rank 316,770 (Top 7 %)
  • Language
    C++
  • Created about 14 years ago
  • Updated over 5 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

SV detection from paired end reads mapping
BreakDancer-1.3.6, released under GPLv3, is a Cpp package that provides genome-wide detection of structural variants from next generation paired-end sequencing reads. It includes two complementary programs. BreakDancerMax predicts five types of structural variants: insertions, deletions, inversions, inter- and intra-chromosomal translocations from next-generation short paired-end sequencing reads using read pairs that are mapped with unexpected separation distances or orientation.  BreakDancerMini focuses on detecting small indels (usually between 10bp and 100bp) using normally mapped read pairs.  Please read our paper for detailed algorithmic description. http://www.nature.com/nmeth/journal/v6/n9/abs/nmeth.1363.html

BreakDancerMax (Update from 1.0 to 1.1 version only applied to cpp now.)
----------------------
Usage:   breakdancer_max <analysis_config_file>
Options:
       -o STRING       operate on a single chromosome [all chromosome]
       -s INT          minimum length of a region [7]
       -c INT          cutoff in unit of standard deviation [3]
       -m INT          maximum SV size [1000000000]
       -q INT          minimum alternative mapping quality [35]
       -r INT          minimum number of read pairs required to establish a connection [2]
       -x INT          maximum threshold of haploid sequence coverage for regions to be ignored [1000]
       -b INT          buffer size for building connection [100]
       -t              only detect transchromosomal rearrangement, by default off
       -d STRING       prefix of fastq files that SV supporting reads will be saved by library
       -g STRING       dump SVs and supporting reads in BED format for GBrowse
       -l              analyze Illumina long insert (mate-pair) library
       -a              print out copy number by bam file rather than library, by default on
       -h              print out Allele Frequency column, by default off
       -y INT          output score filter [40]

The followings are the new functions of version 1.1 from 1.0:

1. It computes the copy number based on the normalization of the whole genome or whole chromosome along with the SV detection. By default the copy number is computed by bam file, but it can also compute per library with option "-a".
2. The Allele Frequency column is more accurate for DEL type, with the help of the copy number computation. By default it is off. Include option "-h" if you want to look at the DEL type Allele Frequency.
3. Since there are numerous false positive SV calls, the output has a cutoff of the PhredQ score, which by default is 40. Make sure to use option "-y yournumber" if you want to change the cutoff.

The followings are those existing in version 1.0:

Most of these options are self-explanatory.  It is convenient to use the -o option to parallelize SV detection for each chromosome.  When -o is used, the detection of inter-chromosomal translocation is disabled.  In that case, it may be convenient to use -t in a separate process to detect putative inter-chromosomal translocations without bothering to analyze read pairs that are mapped to the same chromosome. 

The beta-test x86_64 Cpp version breakdancermax directly utilizes samtools C library. It is fully compatible with the perl version with identical usage and functions but is over 10 times faster. However, it only supports properly formated bam files and has only been tested using bam files produced by BWA. To obtain the correct result, it is important to have readgroup (@RG) tag in both the header and each alignment in the bam files. If you experience technical difficulty with the Cpp version, please email [email protected] or consider using the perl version.

The input to BreakDancerMax-1.1 is a set of map files produced by a front-end aligner such as MAQ, BWA, NovoAlign and Bfast, and a tab-delimited configuration file that specifies the locations of the map files, the detection parameters, and the sample information.

If your map files are in the sam/bam format, you can use the bam2cfg.pl in the released package to automatic generate a configuration file (bam2cfg.pl also has dependence on AlnParser.pm in the release package). If you have a single bam file that contains multiple libraries, make sure that the readgroup and library information are properly encoded in the sam/bam header, and in each alignment record, otherwise bam2cfg.pl may fail to produce a correct configuration file.  Please follow instructions on http://samtools.sourceforge.net to properly format your bam files.

An example manual configuration file is like this

map:1.map mean:219 std:18 readlen:36.00 sample:tA exe:maq-0.6.8 mapview
map:2.map mean:220 std:19 readlen:36.00 sample:tB exe:maq-0.6.8 mapview
map:3.map mean:219 std:18 readlen:36.00 sample:nA exe:maq-0.7.1 mapview
map:4.map mean:219 std:18 readlen:36.00 sample:nB exe:maq-0.7.1 mapview

An example configuration file produced by bam2cfg.pl look like this:

readgroup:2825107881    platform:illumina       map:tumor.bam   readlen:75.00   lib:H_KA-189941-0921313gsc-lib4 num:10001       lower:86.83     upper:443.91    mean:315.09 std:43.92       exe:samtools view
readgroup:2843249908    platform:illumina       map:tumor.bam   readlen:75.00   lib:H_KA-189941-0921313gsc-lib4 num:10001       lower:86.83     upper:443.91    mean:315.09 std:43.92       exe:samtools view
readgroup:2843255910    platform:illumina       map:normal.bam  readlen:75.00   lib:H_KA-189941-0904663-lib4    num:10001       lower:95.36     upper:443.31    mean:311.68 std:42.86       exe:samtools view
readgroup:2843255906    platform:illumina       map:normal.bam  readlen:75.00   lib:H_KA-189941-0904663-lib4    num:10001       lower:95.36     upper:443.31    mean:311.68 std:42.86       exe:samtools view

Each row must contain at least 6 key:value pairs (separated by colon) that specify:

1). the location of the map file
2). the mean insert size
3). the standard deviation insert size
4). the average read length
5). a unique identifier assigned to the map file (usually representing a PE library)
6). a command line that can run by perl system calls to produce MAQ mapview alignment

In addition to the above 6 keys: map, mean, std, readlen, sample, and exe, BreakDancerMax allows users to explicitly specify the separation thresholds using the keys: upper and lower.  For example:
map:1.map upper:300 lower:100 readlen:36.00 sample:tA exe:maq-0.6.8 mapview -b

This will instruct BreakDancerMax to detect deletions using read pairs that are at least 300 bp apart (outer distance) and detect insertions using read pairs that are at most 100 bp apart.  

The upper and the lower key:value pairs, when explicitly specified, take precedence over the upper and the lower thresholds computed from the mean, the std, and the user specified threshold in the unit of standard deviation.
	upper: mean + std * threshold specified by user option -c 
	lower: meen - std * threshold specified by user option -c

The -c option by default equals to 3. Therefore, the upper and the lower separation threshold would be: mean + 3 std and mean - 3 std respectively. It is useful to explicitly specify the upper and the lower separation thresholds when the insert size distribution is not symmetric to the mean. 

The -o option enables per-chromosome/reference analysis and is much faster when the input files are in the bam format.  Please index the bam file using "samtools index" to utilize this option. You need to specify the exact reference names as they are in the bam files.

When -e is on, BreakDancerMax tries to estimate the mean and the standard deviation insert size from the data instead of relying on user's spec in the configuration file. Current implementation of this estimation process is slow. So it is recommended that users can specify the accurate thresholds in the configuration file.

The -l option tell BreakDancerMax that the data is produced from Illumina long insert circularized library

The -f option uses the Fisher's methods to summarize scores from multiple libraries.  It is recommended when there are many libraries. It ensures that the scores are independent of the number of libraries (uniform distribution of the P values)

The -q specifies the MAQ mapping quality threshold and can be used to skip reads that are not confidently mapped. 

The -s specifies the minimal required size of a SV anchoring region from which the anomalously mapped reads are found.  This parameter has some small effects on the SV detection accuracy.  Increasing -s improves the specificity but also reduces the sensitivity. The default 7 bp seemed to work well.

The -b parameter specifies the number of anomalous regions resides in the RAM before SV hypotheses begin to form among these regions. The default works well in general.  For dataset that is exceptionally large, it may be helpful to reduce it to cut the resident RAM usage.

The -d specifies a fastq file where all SV supporting reads will be saved in the fastq format.  These reads can be realigned by other aligners such as novoalign, and then reanalyzed by BreakDancer.

Listing multiple map files in a single configuration file would automatically enable pooled analysis: reads from all the map files are jointly analyzed to find unified SV hypotheses across all the map files.


The output format
----------------------
BreakDancer's output file consists of the following columns:

1. Chromosome 1
2. Position 1
3. Orientation 1
4. Chromosome 2
5. Position 2
6. Orientation 2
7. Type of a SV
8. Size of a SV
9. Confidence Score
10. Total number of supporting read pairs
11. Total number of supporting read pairs from each map file
12. Estimated allele frequency
13. Software version
14. The run parameters

Columns 1-3 and 4-6 are used to specify the coordinates of the two SV breakpoints. The orientation is a string that records the number of reads mapped to the plus (+) or the minus (-) strand in the anchoring regions.

Column 7 is the type of SV detected: DEL (deletions), INS (insertion), INV (inversion), ITX (intra-chromosomal translocation), CTX (inter-chromosomal translocation), and Unknown. 
Column 8 is the size of the SV in bp.  It is meaningless for inter-chromosomal translocations. 
Column 9 is the confidence score associated with the prediction. 
Column 11 can be used to dissect the origin of the supporting read pairs, which is useful in pooled analysis.  For example, one may want to give SVs that are supported by more than one libraries higher confidence than those detected in only one library.  It can also be used to distinguish somatic events from the germline, i.e., those detected in only the tumor libraries versus those detected in both the tumor and the normal libraries.
Column 12 is currently a placeholder for displaying estimated allele frequency. The allele frequencies estimated in this version are not accurate and should not be trusted.
Column 13 and 14 are information useful to reproduce the results.

Example 1:
1 10000 10+0- 2 20000 7+10- CTX -296 99 10 tB|10 1.00 BreakDancerMax-0.0.1 t1

An inter-chromosomal translocation that starts from chr1:10000 and goes into chr2:20000 with 10 supporting read pairs from the library tB and a confidence score of 99.

Example 2:
1 59257 5+1- 1 60164 0+5- DEL 862 99 5 nA|2:tB|1 0.56 BreakDancerMax-0.0.1 c4

A deletion between chr1:59257 and chr1:60164 connected by 5 read pairs, among which 2 in library nA and 1 in library tB support the deletion hypothesis. This deletion is detected by BreakDancerMax-0.0.1 with a separation threshold of 4 s.d.

Example 3:
1 62767 10+0- 1 63126 0+10- INS -13 36 10 NA|10 1.00 BreakDancerMini-0.0.1 q10

An 13 bp insertion detected by BreakDancerMini between chr1:62767 and chr1:63126 with 10 supporting read pairs from a single library 'NA' and a confidence score of 36.

Notes:
Real SV breakpoints are expected to reside within the predicted boundaries with a margin > the read length.

The BreakDancerMini code will not be included in the coming releases.  We recommend using Pindel to detect intermediate size indels (10-80 bp).

Dependence over other Perl Modules
------------------------------------
use Statistics::Descriptive;
use Math::CDF;

These are available at CPAN.
http://search.cpan.org/~colink/Statistics-Descriptive-2.6/Descriptive.pm
http://search.cpan.org/~callahan/Math-CDF-0.1/CDF.pm

use Poisson;
This is provided. Please make sure the "use lib" at the beginning of the perl scripts contains the correct path.


Acknowledgements
-----------------
Heng Li at Wellcome Trust Sanger Institute has contributed an early version of this code.
Many colleagues at The Genome Center of Washington University have supported this effort.


Ken Chen and Xian Fan
Washington University Genome Center

July 14, 2009

More Repositories

1

bam-readcount

Count bases in BAM/CRAM files
CMake
300
star
2

pindel

Pindel can detect breakpoints of large deletions, medium sized insertions, inversions, tandem duplications and other structural variants at single-based resolution from next-gen sequence data. It uses a pattern growth approach to identify the breakpoints of these variants from paired-end short reads.
C++
160
star
3

sciclone

An R package for inferring the subclonal architecture of tumors
R
115
star
4

analysis-workflows

Open workflow definitions for genomic analysis from MGI at WUSM.
Common Workflow Language
102
star
5

gms

The Genome Modeling System installer
Perl
77
star
6

genome

Core modules used by the GMS
Perl
61
star
7

bfx-workshop

A course on genomics and bioinformatics from WU
HTML
55
star
8

somatic-sniper

A tool to call somatic single nucleotide variants.
C
40
star
9

docs

22
star
10

joinx

a tool for processing .bed and .vcf files
C++
21
star
11

scrna_mutations

Supplementary data for Petti, et al 2019 scRNA mutation publication
Python
16
star
12

mendelscan

Analyze exome data for Mendelian disorders. Still in alpha-testing.
Java
10
star
13

ptero

Shell
9
star
14

tigra-sv

8
star
15

sciclone-meta

accessory scripts and documentation related to the sciclone R package at genome/sciclone
R
6
star
16

rss2jira

Create JIRA issues when keywords are matched in RSS feeds.
Python
5
star
17

docker-rnaseq

A fat docker image for running RnaSeq
R
4
star
18

UR

Rich Transactional Objects for Perl
Perl
4
star
19

bmm

R package that uses a variational Bayesian approach to fitting a mixture of Beta distributions
R
4
star
20

dindel-tgi

A fork of dindel
C++
3
star
21

docker-bcftools

Docker container for bcftools
Dockerfile
3
star
22

vcf-evaluation

scripts and modules to facilitate comparing gold standard VCFs
Perl
3
star
23

docker-star

Docker container for the star aligner
Dockerfile
3
star
24

docker-bam_readcount_helper-cwl

Python
3
star
25

graphite

Graphite Config and Cron Scripts
Perl 6
3
star
26

ptero-lsf

Ptero services to run commands via LSF
Python
2
star
27

cle-chromoseq

Repository for CLE ChromoSeq Assay
Python
2
star
28

pairoscope

simple static plots of read pairing information
C++
2
star
29

docker-dna-alignment

A fat docker image for running alignment
Shell
2
star
30

aml31Benchmarking

R
2
star
31

cle-myeloseqhd

Python
2
star
32

tgi-workflow

the workflow server used at TGI
Perl
2
star
33

flow-core

JavaScript
2
star
34

cncwl

1
star
35

docker-vep-cwl

Variant of vep image without an ENTRYPOINT
1
star
36

cle-myeloseq

Repo for cle myeloseq/haloplex assay
Perl
1
star
37

docker-gossamer

Docker container for gossamer bioinformatics suite
1
star
38

ptero-workflow

Client-facing API for the PTero system
Python
1
star
39

cle

Repo for cle related software
WDL
1
star
40

qc-metric-aggregator

Given the output directory of a QC pipeline and a threshold config file, parse out the desired metrics and evalute them against the thresholds.
Python
1
star
41

docker-custom-clinvar-vcf

Python
1
star
42

docker-samtools-cwl

Dockerfile
1
star
43

ptero-petri

Petri net core of the PTero system
Python
1
star
44

somatic-snv-test-data

Example Data for SomaticSniper
1
star
45

build-common

common build scripts used in c/c++ projects
Python
1
star
46

nessy-client-perl

Perl client for the nessy-server lock daemon
Perl
1
star
47

nessy-server

Python
1
star
48

cancer-genomics-workflow-wiki

A full featured, including pull requests, git repo for the arvados_trial Wiki
Shell
1
star
49

docker-fgbio

A docker image for using fgbio
1
star
50

docker-strelka

A docker image for Strelka
1
star