• Stars
    star
    170
  • Rank 223,357 (Top 5 %)
  • Language
    Python
  • License
    BSD 3-Clause "New...
  • Created over 4 years ago
  • Updated 15 days ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

A structural variation pipeline for short-read sequencing

GATK-SV

A structural variation discovery pipeline for Illumina short-read whole-genome sequencing (WGS) data.

Table of Contents

Requirements

Deployment and execution:

  • A Google Cloud account.
  • A workflow execution system supporting the Workflow Description Language (WDL), either:
    • Cromwell (v36 or higher). A dedicated server is highly recommended.
    • or Terra (note preconfigured GATK-SV workflows are not yet available for this platform)
  • Recommended: MELT. Due to licensing restrictions, we cannot provide a public docker image or reference panel VCFs for this algorithm.
  • Recommended: cromshell for interacting with a dedicated Cromwell server.
  • Recommended: WOMtool for validating WDL/json files.

Alternative backends

Because GATK-SV has been tested only on the Google Cloud Platform (GCP), we are unable to provide specific guidance or support for other execution platforms including HPC clusters and AWS. Contributions from the community to improve portability between backends will be considered on a case-by-case-basis. We ask contributors to please adhere to the following guidelines when submitting issues and pull requests:

  1. Code changes must be functionally equivalent on GCP backends, i.e. not result in changed output
  2. Increases to cost and runtime on GCP backends should be minimal
  3. Avoid adding new inputs and tasks to workflows. Simpler changes are more likely to be approved, e.g. small in-line changes to scripts or WDL task command sections
  4. Avoid introducing new code paths, e.g. conditional statements
  5. Additional backend-specific scripts, workflows, tests, and Dockerfiles will not be approved
  6. Changes to Dockerfiles may require extensive testing before approval

We still encourage members of the community to adapt GATK-SV for non-GCP backends and share code on forked repositories. Here are a some considerations:

  • Refer to Cromwell's documentation for configuration instructions.
  • The handling and ordering of glob commands may differ between platforms.
  • Shell commands that are potentially destructive to input files (e.g. rm, mv, tabix) can cause unexpected behavior on shared filesystems. Enabling copy localization may help to more closely replicate the behavior on GCP.
  • For clusters that do not support Docker, Singularity is an alternative. See Cromwell documentation on Singularity.
  • The GATK-SV pipeline takes advantage of the massive parallelization possible in the cloud. Local backends may not have the resources to execute all of the workflows. Workflows that use fewer resources or that are less parallelized may be more successful. For instance, some users have been able to run GatherSampleEvidence on a SLURM cluster.

Data:

  • Illumina short-read whole-genome CRAMs or BAMs, aligned to hg38 with bwa-mem. BAMs must also be indexed.
  • Family structure definitions file in PED format. Sex aneuploidies (detected in EvidenceQC) should be entered as sex = 0.

Sample Exclusion

We recommend filtering out samples with a high percentage of improperly paired reads (>10% or an outlier for your data) as technical outliers prior to running GatherSampleEvidence. A high percentage of improperly paired reads may indicate issues with library prep, degradation, or contamination. Artifactual improperly paired reads could cause incorrect SV calls, and these samples have been observed to have longer runtimes and higher compute costs for GatherSampleEvidence.

Sample ID requirements:

Sample IDs must:

  • Be unique within the cohort
  • Contain only alphanumeric characters and underscores (no dashes, whitespace, or special characters)

Sample IDs should not:

  • Contain only numeric characters
  • Be a substring of another sample ID in the same cohort
  • Contain any of the following substrings: chr, name, DEL, DUP, CPX, CHROM

The same requirements apply to family IDs in the PED file, as well as batch IDs and the cohort ID provided as workflow inputs.

Sample IDs are provided to GatherSampleEvidence directly and need not match sample names from the BAM/CRAM headers. GetSampleID.wdl can be used to fetch BAM sample IDs and also generates a set of alternate IDs that are considered safe for this pipeline; alternatively, this script transforms a list of sample IDs to fit these requirements. Currently, sample IDs can be replaced again in GatherBatchEvidence.

The following inputs will need to be updated with the transformed sample IDs:

Citation

Please cite the following publication: Collins, Brand, et al. 2020. "A structural variation reference for medical and population genetics." Nature 581, 444-451.

Additional references: Werling et al. 2018. "An analytical framework for whole-genome sequence association studies and its implications for autism spectrum disorder." Nature genetics 50.5, 727-736.

Quickstart

WDLs

There are two scripts for running the full pipeline:

  • wdl/GATKSVPipelineBatch.wdl: Runs GATK-SV on a batch of samples.
  • wdl/GATKSVPipelineSingleSample.wdl: Runs GATK-SV on a single sample, given a reference panel

Building inputs

Example workflow inputs can be found in /inputs. Build using scripts/inputs/build_default_inputs.sh, which generates input jsons in /inputs/build. Except the MELT docker image, all required resources are available in public Google buckets.

Some workflows require a Google Cloud Project ID to be defined in a cloud environment parameter group. Workspace builds require a Terra billing project ID as well. An example is provided at /inputs/values/google_cloud.json but should not be used, as modifying this file will cause tracked changes in the repository. Instead, create a copy in the same directory with the format google_cloud.my_project.json and modify as necessary.

Note that these inputs are required only when certain data are located in requester pays buckets. If this does not apply, users may use placeholder values for the cloud configuration and simply delete the inputs manually.

MELT

Important: The example input files contain MELT inputs that are NOT public (see Requirements). These include:

  • GATKSVPipelineSingleSample.melt_docker and GATKSVPipelineBatch.melt_docker - MELT docker URI (see Docker readme)
  • GATKSVPipelineSingleSample.ref_std_melt_vcfs - Standardized MELT VCFs (GatherBatchEvidence)

The input values are provided only as an example and are not publicly accessible. In order to include MELT, these values must be provided by the user. MELT can be disabled by deleting these inputs and setting GATKSVPipelineBatch.use_melt to false.

Execution

We recommend running the pipeline on a dedicated Cromwell server with a cromshell client. A batch run can be started with the following commands:

> mkdir gatksv_run && cd gatksv_run
> mkdir wdl && cd wdl
> cp $GATK_SV_ROOT/wdl/*.wdl .
> zip dep.zip *.wdl
> cd ..
> echo '{ "google_project_id": "my-google-project-id", "terra_billing_project_id": "my-terra-billing-project" }' > inputs/values/google_cloud.my_project.json
> bash scripts/inputs/build_default_inputs.sh -d $GATK_SV_ROOT -c google_cloud.my_project
> cp $GATK_SV_ROOT/inputs/build/ref_panel_1kg/test/GATKSVPipelineBatch/GATKSVPipelineBatch.json GATKSVPipelineBatch.my_run.json
> cromshell submit wdl/GATKSVPipelineBatch.wdl GATKSVPipelineBatch.my_run.json cromwell_config.json wdl/dep.zip

where cromwell_config.json is a Cromwell workflow options file. Note users will need to re-populate batch/sample-specific parameters (e.g. BAMs and sample IDs).

Pipeline Overview

The pipeline consists of a series of modules that perform the following:

  • GatherSampleEvidence: SV evidence collection, including calls from a configurable set of algorithms (Manta, MELT, and Wham), read depth (RD), split read positions (SR), and discordant pair positions (PE).
  • EvidenceQC: Dosage bias scoring and ploidy estimation
  • GatherBatchEvidence: Copy number variant calling using cn.MOPS and GATK gCNV; B-allele frequency (BAF) generation; call and evidence aggregation
  • ClusterBatch: Variant clustering
  • GenerateBatchMetrics: Variant filtering metric generation
  • FilterBatch: Variant filtering; outlier exclusion
  • GenotypeBatch: Genotyping
  • MakeCohortVcf: Cross-batch integration; complex variant resolution and re-genotyping; vcf cleanup
  • Module 07: Downstream filtering, including minGQ, batch effect check, outlier samples removal and final recalibration;
  • AnnotateVcf: Annotations, including functional annotation, allele frequency (AF) annotation and AF annotation with external population callsets;
  • Module 09: Visualization, including scripts that generates IGV screenshots and rd plots.
  • Additional modules to be added: de novo and mosaic scripts

Repository structure:

  • /dockerfiles: Resources for building pipeline docker images
  • /inputs: files for generating workflow inputs
    • /templates: Input json file templates
    • /values: Input values used to populate templates
  • /wdl: WDLs running the pipeline. There is a master WDL for running each module, e.g. ClusterBatch.wdl.
  • /scripts: scripts for running tests, building dockers, and analyzing cromwell metadata files
  • /src: main pipeline scripts
    • /RdTest: scripts for depth testing
    • /sv-pipeline: various scripts and packages used throughout the pipeline
    • /svqc: Python module for checking that pipeline metrics fall within acceptable limits
    • /svtest: Python module for generating various summary metrics from module outputs
    • /svtk: Python module of tools for SV-related datafile parsing and analysis
    • /WGD: whole-genome dosage scoring scripts

Cohort mode

A minimum cohort size of 100 is required, and a roughly equal number of males and females is recommended. For modest cohorts (~100-500 samples), the pipeline can be run as a single batch using GATKSVPipelineBatch.wdl.

For larger cohorts, samples should be split up into batches of about 100-500 samples. Refer to the Batching section for further guidance on creating batches.

The pipeline should be executed as follows:

Note: GatherBatchEvidence requires a trained gCNV model.

Batching

For larger cohorts, samples should be split up into batches of about 100-500 samples with similar characteristics. We recommend batching based on overall coverage and dosage score (WGD), which can be generated in EvidenceQC. An example batching process is outlined below:

  1. Divide the cohort into PCR+ and PCR- samples
  2. Partition the samples by median coverage from EvidenceQC, grouping samples with similar median coverage together. The end goal is to divide the cohort into roughly equal-sized batches of about 100-500 samples; if your partitions based on coverage are larger or uneven, you can partition the cohort further in the next step to obtain the final batches.
  3. Optionally, divide the samples further by dosage score (WGD) from EvidenceQC, grouping samples with similar WGD score together, to obtain roughly equal-sized batches of about 100-500 samples
  4. Maintain a roughly equal sex balance within each batch, based on sex assignments from EvidenceQC

Single-sample mode

GATKSVPipelineSingleSample.wdl runs the pipeline on a single sample using a fixed reference panel. An example run with reference panel containing 156 samples from the NYGC 1000G Terra workspace can be found in inputs/build/NA12878/test after building inputs).

gCNV Training

Both the cohort and single-sample modes use the GATK-gCNV depth calling pipeline, which requires a trained model as input. The samples used for training should be technically homogeneous and similar to the samples to be processed (i.e. same sample type, library prep protocol, sequencer, sequencing center, etc.). The samples to be processed may comprise all or a subset of the training set. For small, relatively homogenous cohorts, a single gCNV model is usually sufficient. If a cohort contains multiple data sources, we recommend training a separate model for each batch or group of batches with similar dosage score (WGD). The model may be trained on all or a subset of the samples to which it will be applied; a reasonable default is 100 randomly-selected samples from the batch (the random selection can be done as part of the workflow by specifying a number of samples to the n_samples_subsample input parameter in /wdl/TrainGCNV.wdl).

Generating a reference panel

New reference panels can be generated easily from a single run of the GATKSVPipelineBatch workflow. If using a Cromwell server, we recommend copying the outputs to a permanent location by adding the following option to the workflow configuration file:

  "final_workflow_outputs_dir" : "gs://my-outputs-bucket",
  "use_relative_output_paths": false,

Here is an example of how to generate workflow input jsons from GATKSVPipelineBatch workflow metadata:

> cromshell -t60 metadata 38c65ca4-2a07-4805-86b6-214696075fef > metadata.json
> python scripts/inputs/create_test_batch.py \
    --execution-bucket gs://my-exec-bucket \
    --final-workflow-outputs-dir gs://my-outputs-bucket \
    metadata.json \
    > inputs/values/my_ref_panel.json
> # Define your google project id (for Cromwell inputs) and Terra billing project (for workspace inputs)
> echo '{ "google_project_id": "my-google-project-id", "terra_billing_project_id": "my-terra-billing-project" }' > inputs/values/google_cloud.my_project.json
> # Build test files for batched workflows (google cloud project id required)
> python scripts/inputs/build_inputs.py \
    inputs/values \
    inputs/templates/test \
    inputs/build/my_ref_panel/test \
    -a '{ "test_batch" : "ref_panel_1kg", "cloud_env": "google_cloud.my_project" }'
> # Build test files for the single-sample workflow
> python scripts/inputs/build_inputs.py \
    inputs/values \
    inputs/templates/test/GATKSVPipelineSingleSample \
    inputs/build/NA19240/test_my_ref_panel \
    -a '{ "single_sample" : "test_single_sample_NA19240", "ref_panel" : "my_ref_panel" }'
> # Build files for a Terra workspace
> python scripts/inputs/build_inputs.py \
    inputs/values \
    inputs/templates/terra_workspaces/single_sample \
    inputs/build/NA12878/terra_my_ref_panel \
    -a '{ "single_sample" : "test_single_sample_NA12878", "ref_panel" : "my_ref_panel" }'

Note that the inputs to GATKSVPipelineBatch may be used as resources for the reference panel and therefore should also be in a permanent location.

Module Descriptions

The following sections briefly describe each module and highlights inter-dependent input/output files. Note that input/output mappings can also be gleaned from GATKSVPipelineBatch.wdl, and example input templates for each module can be found in /inputs/templates/test.

GatherSampleEvidence

Formerly Module00a

Runs raw evidence collection on each sample with the following SV callers: Manta, Wham, and/or MELT. For guidance on pre-filtering prior to GatherSampleEvidence, refer to the Sample Exclusion section.

Note: a list of sample IDs must be provided. Refer to the sample ID requirements for specifications of allowable sample IDs. IDs that do not meet these requirements may cause errors.

Inputs:

  • Per-sample BAM or CRAM files aligned to hg38. Index files (.bai) must be provided if using BAMs.

Outputs:

  • Caller VCFs (Manta, MELT, and/or Wham)
  • Binned read counts file
  • Split reads (SR) file
  • Discordant read pairs (PE) file

EvidenceQC

Formerly Module00b

Runs ploidy estimation, dosage scoring, and optionally VCF QC. The results from this module can be used for QC and batching.

For large cohorts, this workflow can be run on arbitrary cohort partitions of up to about 500 samples. Afterwards, we recommend using the results to divide samples into smaller batches (~100-500 samples) with ~1:1 male:female ratio. Refer to the Batching section for further guidance on creating batches.

We also recommend using sex assignments generated from the ploidy estimates and incorporating them into the PED file, with sex = 0 for sex aneuploidies.

Prerequisites:

Inputs:

Outputs:

  • Per-sample dosage scores with plots
  • Median coverage per sample
  • Ploidy estimates, sex assignments, with plots
  • (Optional) Outlier samples detected by call counts

Preliminary Sample QC

The purpose of sample filtering at this stage after EvidenceQC is to prevent very poor quality samples from interfering with the results for the rest of the callset. In general, samples that are borderline are okay to leave in, but you should choose filtering thresholds to suit the needs of your cohort and study. There will be future opportunities (as part of FilterBatch) for filtering before the joint genotyping stage if necessary. Here are a few of the basic QC checks that we recommend:

  • Look at the X and Y ploidy plots, and check that sex assignments match your expectations. If there are discrepancies, check for sample swaps and update your PED file before proceeding.
  • Look at the dosage score (WGD) distribution and check that it is centered around 0 (the distribution of WGD for PCR- samples is expected to be slightly lower than 0, and the distribution of WGD for PCR+ samples is expected to be slightly greater than 0. Refer to the gnomAD-SV paper for more information on WGD score). Optionally filter outliers.
  • Look at the low outliers for each SV caller (samples with much lower than typical numbers of SV calls per contig for each caller). An empty low outlier file means there were no outliers below the median and no filtering is necessary. Check that no samples had zero calls.
  • Look at the high outliers for each SV caller and optionally filter outliers; samples with many more SV calls than average may be poor quality.
  • Remove samples with autosomal aneuploidies based on the per-batch binned coverage plots of each chromosome.

TrainGCNV

Trains a gCNV model for use in GatherBatchEvidence. The WDL can be found at /wdl/TrainGCNV.wdl. See the gCNV training overview for more information.

Prerequisites:

Inputs:

Outputs:

  • Contig ploidy model tarball
  • gCNV model tarballs

GatherBatchEvidence

Formerly Module00c

Runs CNV callers (cn.MOPS, GATK-gCNV) and combines single-sample raw evidence into a batch. See above for more information on batching.

Prerequisites:

Inputs:

  • PED file (updated with EvidenceQC sex assignments, including sex = 0 for sex aneuploidies. Calls will not be made on sex chromosomes when sex = 0 in order to avoid generating many confusing calls or upsetting normalized copy numbers for the batch.)
  • Read count, BAF, PE, SD, and SR files (GatherSampleEvidence)
  • Caller VCFs (GatherSampleEvidence)
  • Contig ploidy model and gCNV model files (gCNV training)

Outputs:

  • Combined read count matrix, SR, PE, and BAF files
  • Standardized call VCFs
  • Depth-only (DEL/DUP) calls
  • Per-sample median coverage estimates
  • (Optional) Evidence QC plots

ClusterBatch

Formerly Module01

Clusters SV calls across a batch.

Prerequisites:

Inputs:

Outputs:

  • Clustered SV VCFs
  • Clustered depth-only call VCF

GenerateBatchMetrics

Formerly Module02

Generates variant metrics for filtering.

Prerequisites:

Inputs:

Outputs:

  • Metrics file

FilterBatch

Formerly Module03

Filters poor quality variants and filters outlier samples. This workflow can be run all at once with the WDL at wdl/FilterBatch.wdl, or it can be run in three steps to enable tuning of outlier filtration cutoffs. The three subworkflows are:

  1. FilterBatchSites: Per-batch variant filtration
  2. PlotSVCountsPerSample: Visualize SV counts per sample per type to help choose an IQR cutoff for outlier filtering, and preview outlier samples for a given cutoff
  3. FilterBatchSamples: Per-batch outlier sample filtration; provide an appropriate outlier_cutoff_nIQR based on the SV count plots and outlier previews from step 2. Note that not removing high outliers can result in increased compute cost and a higher false positive rate in later steps.

Prerequisites:

Inputs:

Outputs:

  • Filtered SV (non-depth-only a.k.a. "PESR") VCF with outlier samples excluded
  • Filtered depth-only call VCF with outlier samples excluded
  • Random forest cutoffs file
  • PED file with outlier samples excluded

MergeBatchSites

Formerly MergeCohortVcfs

Combines filtered variants across batches. The WDL can be found at: /wdl/MergeBatchSites.wdl.

Prerequisites:

Inputs:

Outputs:

  • Combined cohort PESR and depth VCFs

GenotypeBatch

Formerly Module04

Genotypes a batch of samples across unfiltered variants combined across all batches.

Prerequisites:

Inputs:

Outputs:

  • Filtered SV (non-depth-only a.k.a. "PESR") VCF with outlier samples excluded
  • Filtered depth-only call VCF with outlier samples excluded
  • PED file with outlier samples excluded
  • List of SR pass variants
  • List of SR fail variants
  • (Optional) Depth re-genotyping intervals list

RegenotypeCNVs

Formerly Module04b

Re-genotypes probable mosaic variants across multiple batches.

Prerequisites:

Inputs:

Outputs:

  • Re-genotyped depth VCFs

MakeCohortVcf

Formerly Module0506

Combines variants across multiple batches, resolves complex variants, re-genotypes, and performs final VCF clean-up.

Prerequisites:

Inputs:

Outputs:

  • Finalized "cleaned" VCF and QC plots

Module 07 (in development)

Apply downstream filtering steps to the cleaned VCF to further control the false discovery rate; all steps are optional and users should decide based on the specific purpose of their projects.

Filtering methods include:

  • minGQ - remove variants based on the genotype quality across populations. Note: Trio families are required to build the minGQ filtering model in this step. We provide tables pre-trained with the 1000 genomes samples at different FDR thresholds for projects that lack family structures, and they can be found at the paths below. These tables assume that GQ has a scale of [0,999], so they will not work with newer VCFs where GQ has a scale of [0,99].
gs://gatk-sv-resources-public/hg38/v0/sv-resources/ref-panel/1KG/v2/mingq/1KGP_2504_and_698_with_GIAB.10perc_fdr.PCRMINUS.minGQ.filter_lookup_table.txt
gs://gatk-sv-resources-public/hg38/v0/sv-resources/ref-panel/1KG/v2/mingq/1KGP_2504_and_698_with_GIAB.1perc_fdr.PCRMINUS.minGQ.filter_lookup_table.txt
gs://gatk-sv-resources-public/hg38/v0/sv-resources/ref-panel/1KG/v2/mingq/1KGP_2504_and_698_with_GIAB.5perc_fdr.PCRMINUS.minGQ.filter_lookup_table.txt
  • BatchEffect - remove variants that show significant discrepancies in allele frequencies across batches
  • FilterOutlierSamplesPostMinGQ - remove outlier samples with unusually high or low number of SVs
  • FilterCleanupQualRecalibration - sanitize filter columns and recalibrate variant QUAL scores for easier interpretation

AnnotateVcf (in development)

Formerly Module08Annotation

Add annotations, such as the inferred function and allele frequencies of variants, to final VCF.

Annotations methods include:

  • Functional annotation - The GATK tool SVAnnotate is used to annotate SVs with inferred functional consequence on protein-coding regions, regulatory regions such as UTR and promoters, and other non-coding elements.
  • Allele Frequency annotation - annotate SVs with their allele frequencies across all samples, and samples of specific sex, as well as specific sub-populations.
  • Allele Frequency annotation with external callset - annotate SVs with the allele frequencies of their overlapping SVs in another callset, eg. gnomad SV callset.

Module 09 (in development)

Visualize SVs with IGV screenshots and read depth plots.

Visualization methods include:

  • RD Visualization - generate RD plots across all samples, ideal for visualizing large CNVs.
  • IGV Visualization - generate IGV plots of each SV for individual sample, ideal for visualizing de novo small SVs.
  • Module09.visualize.wdl - generate RD plots and IGV plots, and combine them for easy review.

CI/CD

This repository is maintained following the norms of continuous integration (CI) and continuous delivery (CD). GATK-SV CI/CD is developed as a set of Github Actions workflows that are available under the .github/workflows directory. Please refer to the workflow's README for their current coverage and setup.

Troubleshooting

VM runs out of memory or disk

  • Default pipeline settings are tuned for batches of 100 samples. Larger batches or cohorts may require additional VM resources. Most runtime attributes can be modified through the RuntimeAttr inputs. These are formatted like this in the json:
"MyWorkflow.runtime_attr_override": {
  "disk_gb": 100,
  "mem_gb": 16
},

Note that a subset of the struct attributes can be specified. See wdl/Structs.wdl for available attributes.

Calculated read length causes error in MELT workflow

Example error message from GatherSampleEvidence.MELT.GetWgsMetrics:

Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: The requested index 701766 is out of counter bounds. Possible cause of exception can be wrong READ_LENGTH parameter (much smaller than actual read length)

This error message was observed for a sample with an average read length of 117, but for which half the reads were of length 90 and half were of length 151. As a workaround, override the calculated read length by providing a read_length input of 151 (or the expected read length for the sample in question) to GatherSampleEvidence.

More Repositories

1

gatk

Official code repository for GATK versions 4 and up
Java
1,691
star
2

cromwell

Scientific workflow engine designed for simplicity & scalability. Trivially transition between one off use cases to massive scale production environments
Scala
990
star
3

picard

A set of command line tools (in Java) for manipulating high-throughput sequencing (HTS) data and formats such as SAM/BAM/CRAM and VCF.
Java
965
star
4

infercnv

Inferring CNV from Single-Cell RNA-Seq
R
558
star
5

keras-rcnn

Keras package for region-based convolutional neural networks (RCNNs)
Python
554
star
6

gtex-pipeline

GTEx & TOPMed data production and analysis pipelines
Python
334
star
7

pilon

Pilon is an automated genome assembly improvement and variant detection tool
Scala
306
star
8

keras-resnet

Keras package for deep residual networks
Python
294
star
9

CellBender

CellBender is a software package for eliminating technical artifacts from high-throughput single-cell RNA sequencing (scRNA-seq) data.
Python
293
star
10

Tangram

Spatial alignment of single cell transcriptomic data.
Jupyter Notebook
249
star
11

ssGSEA2.0

Single sample Gene Set Enrichment analysis (ssGSEA) and PTM Enrichment Analysis (PTM-SEA)
R
237
star
12

ABC-Enhancer-Gene-Prediction

Cell type specific enhancer-gene predictions using ABC model (Fulco, Nasser et al, Nature Genetics 2019)
Python
201
star
13

warp

WDL Analysis Research Pipelines
WDL
200
star
14

viral-ngs

Viral genomics analysis pipelines
Python
180
star
15

seqr

web-based analysis tool for rare disease genomics
Python
176
star
16

tensorqtl

Ultrafast GPU-enabled QTL mapper
Python
159
star
17

ichorCNA

Estimating tumor fraction in cell-free DNA from ultra-low-pass whole genome sequencing.
R
159
star
18

long-read-pipelines

Long read production pipelines
Jupyter Notebook
140
star
19

wot

A software package for analyzing snapshots of developmental processes
Jupyter Notebook
136
star
20

ml4h

Jupyter Notebook
122
star
21

depmap_omics

What you need to process the Quarterly DepMap-Omics releases from Terra
HTML
110
star
22

xtermcolor

Python library for terminal color support (including 256-color support)
Python
104
star
23

Drop-seq

Java tools for analyzing Drop-seq data
Java
100
star
24

mutect

MuTect -- Accurate and sensitive cancer mutation detection
Java
92
star
25

genomics-in-the-cloud

Source code and related materials for the O'Reilly book
Jupyter Notebook
91
star
26

gnomad_methods

Hail helper functions for the gnomAD project and Translational Genomics Group
Python
89
star
27

pyro-cov

Pyro models of SARS-CoV-2 variants
Jupyter Notebook
77
star
28

catch

A package for designing compact and comprehensive capture probe sets.
Python
74
star
29

gatk-docs

Documentation archive for GATK tools and workflows
HTML
71
star
30

oncotator

Python
67
star
31

gnomad-browser

Explore gnomAD datasets on the web
TypeScript
66
star
32

gtex-viz

GTEx Visualizations
JavaScript
63
star
33

single_cell_portal_core

Rails/Docker application for the Broad Institute's single cell RNA-seq data portal
Ruby
62
star
34

PhylogicNDT

HTML
57
star
35

docker-terraform

Docker container for running the Terraform application
Shell
56
star
36

2020_scWorkshop

Code and data repository for the 2020 physalia course on single cell RNA sequencing.
Shell
56
star
37

cromshell

CLI for interacting with Cromwell servers
Python
53
star
38

cellpainting-gallery

Cell Painting Gallery
52
star
39

viral-pipelines

viral-ngs: complete pipelines
WDL
51
star
40

gnomad_qc

Jupyter Notebook
48
star
41

single_cell_portal

Tutorials, workflows, and convenience scripts for Single Cell Portal
HTML
47
star
42

gistic2

Genomic Identification of Significant Targets in Cancer (GISTIC), version 2
MATLAB
44
star
43

sam

workbench identity and access management
Scala
42
star
44

dsde-deep-learning

DSDE Deep Learning Club
Python
40
star
45

gamgee

A C++14 library for NGS data formats
C++
40
star
46

wdl-ide

Rich IDE support for Workflow Description Language
Python
39
star
47

gtex-v8

Notebooks and scripts for reproducing analyses and figures from the V8 GTEx Consortium paper
Jupyter Notebook
39
star
48

pyqtl

Collection of analysis tools for quantitative trait loci
Python
38
star
49

SignatureAnalyzer-GPU

GPU implementation of ARD NMF
Python
37
star
50

poasta

Fast and exact gap-affine partial order alignment
Rust
37
star
51

Celligner_ms

Code related to the Celligner manuscript
R
36
star
52

cell-health

Predicting Cell Health with Morphological Profiles
HTML
35
star
53

PANOPLY

Repository for the Broad Institute Proteogenomic Data Analysis Center (PGDAC) established by the NIH Clinical Proteomics Tumor Analysis Consortium (CPTAC)
R
33
star
54

gatk-protected

Obsolete/Legacy GATK repository -- go to https://github.com/broadinstitute/gatk instead
Java
33
star
55

StrainGE

strain-level analysis tools
Python
33
star
56

firecloud-orchestration

Scala
31
star
57

gdctools

Python and UNIX CLI utilities to simplify interaction with the NIH/NCI Genomics Data Commons
Python
31
star
58

python-cert_manager

Python interface to the Sectigo Certificate Manager REST API
Python
31
star
59

str-analysis

Scripts and utilities related to analyzing short tandem repeats (STRs).
Python
29
star
60

adapt

A package for designing activity-informed nucleic acid diagnostics for viruses.
Python
29
star
61

2019_scWorkshop

Repo for Physalia course Analysis of Single Cell RNA-Seq data
TeX
29
star
62

chronos

Modeling of time series data for CRISPR KO experiments
Python
28
star
63

fiss

FireCloud Service Selector (FISS) -- Python bindings and CLI for FireCloud execution engine
Python
28
star
64

pyfrost

Python bindings for Bifrost with a NetworkX compatible API
Python
27
star
65

single_cell_analysis

Documents used for workshops on single cell analysis
HTML
26
star
66

deepometry

Image classification for imaging flow cytometry.
Python
26
star
67

lincs-cell-painting

Processed Cell Painting Data for the LINCS Drug Repurposing Project
Jupyter Notebook
25
star
68

rawls

Rawls service for DSDE
Scala
25
star
69

delphy

Fast, scalable, accurate and accessible Bayesian phylogenetics
C++
25
star
70

firepony

Efficient base quality score recalibrator for NGS data
Cuda
24
star
71

protigy

Proteomics Toolset for Integrative Data Analysis
R
22
star
72

GATK-for-Microbes

WDL
22
star
73

seqr-loading-pipelines

hail-based pipelines for annotating variant callsets and exporting them to elasticsearch
Python
22
star
74

BipolarCell2016

R
21
star
75

cromwell-tools

A collection of Python clients and accessory scripts for interacting with the Cromwell
Python
21
star
76

covid19-testing

COVID-19 Diagnostic Processing Dashboard
HTML
20
star
77

single_cell_classification

Methods to use SNPs or gene expression to classify single cell RNAseq to reference profiles
R
20
star
78

VariantBam

Filtering and profiling of next-generational sequencing data using region-specific rules
Makefile
20
star
79

longbow

Annotation and segmentation of MAS-seq data
Python
20
star
80

gtex-single-nucleus-reference

Code repository for the snRNA-seq cross-tissue atlas project
Jupyter Notebook
20
star
81

flipbook

A tool that lets you quickly flip through images in a local directory and record notes or answer questions about each one.
Python
19
star
82

AwesomeGenomics

Cancer Data Science's go to place for excellent genomics tools and packages
19
star
83

firecloud-ui

FireCloud user interface for web browsers.
Clojure
19
star
84

BARD

BioAssay Research Database
Groovy
19
star
85

wdltool

Scala
18
star
86

vim-wdl

Vim syntax highlighting for WDL
Vim Script
18
star
87

SpliceAI-lookup

Website for checking SpliceAI and Pangolin scores:
Python
17
star
88

palantir-workflows

Utility workflows for the DSP hydro.gen team (formerly palantir)
WDL
17
star
89

epi-SHARE-seq-pipeline

Epigenomics Program pipeline to analyze SHARE-seq data.
WDL
16
star
90

wordpress-crowd-plugin

Crowd Authentication Plugin for Wordpress
PHP
16
star
91

mix_seq_ms

Code associated with MIX-seq manuscript
R
14
star
92

imaging-platform-pipelines

Cell Painting and other pipelines from the Imaging Platform
13
star
93

wdl-runner

Easily run WDL workflows on GCP
Python
13
star
94

widdler

A command-line tool for executing, managing, and querying WDL workflows on Cromwell servers.
Python
13
star
95

cms

Composite of Multiple Signals: tests for selection in meiotically recombinant populations
Python
13
star
96

regional_missense_constraint

Code to calculate regional missense constraint
Python
12
star
97

scRNA-Seq

Python
12
star
98

scalable_analytics

Public collaboration of Scalable Single Cell Analytics
Python
12
star
99

ml4ht_data_source

Multimodal data loader compatible with pytorch and tensorflow
Python
12
star
100

gene-hints

Discoverability for gene search 🧬 πŸ”
Python
12
star