• Stars
    star
    1,691
  • Rank 27,591 (Top 0.6 %)
  • Language
    Java
  • License
    Other
  • Created almost 10 years ago
  • Updated 25 days ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Official code repository for GATK versions 4 and up

Build Status Maven Central License

Please see the GATK website, where you can download a precompiled executable, read documentation, ask questions, and receive technical support. For GitHub basics, see here.

GATK 4

This repository contains the next generation of the Genome Analysis Toolkit (GATK). The contents of this repository are 100% open source and released under the Apache 2.0 license (see LICENSE.TXT).

GATK4 aims to bring together well-established tools from the GATK and Picard codebases under a streamlined framework, and to enable selected tools to be run in a massively parallel way on local clusters or in the cloud using Apache Spark. It also contains many newly developed tools not present in earlier releases of the toolkit.

Table of Contents

Requirements

  • To run GATK:
    • Java 17 is needed to run or build GATK. We recommend one of the following:
      • Download the Eclipse Foundation's distribution of OpenJDK 17 from adoptium.net. Navigate to the release archive to find downloads for Java 17.
      • On Mac OS, you can install the Homebrew package manager and run brew tap homebrew/cask-versions followed by brew install --cask temurin17 to install the Eclipse Foundation's OpenJDK 17.
    • Python 2.6 or greater (required to run the gatk frontend script)
    • Python 3.6.2, along with a set of additional Python packages, is required to run some tools and workflows. See Python Dependencies for more information.
    • R 3.2.5 (needed for producing plots in certain tools)
  • To build GATK:
    • A Java 17 JDK
    • Git 2.5 or greater
    • git-lfs 1.1.0 or greater. Required to download the large files used to build GATK, and test files required to run the test suite. Run git lfs install after downloading, followed by git lfs pull from the root of your git clone to download all of the large files, including those required to run the test suite. The full download is approximately 5 gigabytes. Alternatively, if you are just building GATK and not running the test suite, you can skip this step since the build itself will use git-lfs to download the minimal set of large lfs resource files required to complete the build. The test resources will not be downloaded, but this greatly reduces the size of the download.
    • Gradle 5.6. We recommend using the ./gradlew script which will download and use an appropriate gradle version automatically (see examples below).
    • R 3.2.5 (needed for running the test suite)
  • Pre-packaged Docker images with all needed dependencies installed can be found on our dockerhub repository. This requires a recent version of the docker client, which can be found on the docker website.
  • Python Dependencies:
    • GATK4 uses the Conda package manager to establish and manage the Python environment and dependencies required by GATK tools that have a Python dependency. This environment also includes the R dependencies used for plotting in some of the tools. The gatk environment requires hardware with AVX support for tools that depend on TensorFlow (e.g. CNNScoreVariant). The GATK Docker image comes with the gatk environment pre-configured.
      • At this time, the only supported platforms are 64-bit Linux distributions. The required Conda environment is not currently supported on OS X/macOS.
    • To establish the environment when not using the Docker image, a conda environment must first be "created", and then "activated":
      • First, make sure Miniconda or Conda is installed (Miniconda is sufficient).
      • To "create" the conda environment:
        • If running from a zip or tar distribution, run the command conda env create -f gatkcondaenv.yml to create the gatk environment.
        • If running from a cloned repository, run ./gradlew localDevCondaEnv. This generates the Python package archive and conda yml dependency file(s) in the build directory, and also creates (or updates) the local gatk conda environment.
      • To "activate" the conda environment (the conda environment must be activated within the same shell from which GATK is run):
        • Execute the shell command source activate gatk to activate the gatk environment.
      • See the Conda documentation for additional information about using and managing Conda environments.

Quick Start Guide

  • Build the GATK: ./gradlew bundle (creates gatk-VERSION.zip in build/)
  • Get help on running the GATK: ./gatk --help
  • Get a list of available tools: ./gatk --list
  • Run a tool: ./gatk PrintReads -I src/test/resources/NA12878.chr17_69k_70k.dictFix.bam -O output.bam
  • Get help on a particular tool: ./gatk PrintReads --help

Downloading GATK4

You can download and run pre-built versions of GATK4 from the following places:

  • A zip archive with everything you need to run GATK4 can be downloaded for each release from the github releases page. We also host unstable archives generated nightly in the Google bucket gs://gatk-nightly-builds.

  • You can download a GATK4 docker image from our dockerhub repository. We also host unstable nightly development builds on this dockerhub repository.

    • Within the docker image, run gatk commands as usual from the default startup directory (/gatk).

Building GATK4

  • To do a full build of GATK4, first clone the GATK repository using "git clone", then run:

      ./gradlew bundle
    

    Equivalently, you can just type:

      ./gradlew
    
    • This creates a zip archive in the build/ directory with a name like gatk-VERSION.zip containing a complete standalone GATK distribution, including our launcher gatk, both the local and spark jars, and this README.
    • You can also run GATK commands directly from the root of your git clone after running this command.
    • Note that you must have a full git clone in order to build GATK, including the git-lfs files in src/main/resources. The zipped source code alone is not buildable.
  • Other ways to build:

    • ./gradlew installDist
      • Does a fast build that only lets you run GATK tools from inside your git clone, and locally only (not on a cluster). Good for developers!
    • ./gradlew installAll
      • Does a semi-fast build that only lets you run GATK tools from inside your git clone, but works both locally and on a cluster. Good for developers!
    • ./gradlew localJar
      • Builds only the GATK jar used for running tools locally (not on a Spark cluster). The resulting jar will be in build/libs with a name like gatk-package-VERSION-local.jar, and can be used outside of your git clone.
    • ./gradlew sparkJar
      • Builds only the GATK jar used for running tools on a Spark cluster (rather than locally). The resulting jar will be in build/libs with a name like gatk-package-VERSION-spark.jar, and can be used outside of your git clone.
      • This jar will not include Spark and Hadoop libraries, in order to allow the versions of Spark and Hadoop installed on your cluster to be used.
  • To remove previous builds, run:

      ./gradlew clean
    
  • For faster gradle operations, add org.gradle.daemon=true to your ~/.gradle/gradle.properties file. This will keep a gradle daemon running in the background and avoid the ~6s gradle start up time on every command.

  • Gradle keeps a cache of dependencies used to build GATK. By default this goes in ~/.gradle. If there is insufficient free space in your home directory, you can change the location of the cache by setting the GRADLE_USER_HOME environment variable.

  • The version number is automatically derived from the git history using git describe, you can override it by setting the versionOverride property. ( ./gradlew -DversionOverride=my_weird_version printVersion )

Running GATK4

  • The standard way to run GATK4 tools is via the gatk wrapper script located in the root directory of a clone of this repository.

    • Requires Python 2.6 or greater (this includes Python 3.x)
    • You need to have built the GATK as described in the Building GATK4 section above before running this script.
    • There are several ways gatk can be run:
      • Directly from the root of your git clone after building
      • By extracting the zip archive produced by ./gradlew bundle to a directory, and running gatk from there
      • Manually putting the gatk script within the same directory as fully-packaged GATK jars produced by ./gradlew localJar and/or ./gradlew sparkJar
      • Defining the environment variables GATK_LOCAL_JAR and GATK_SPARK_JAR, and setting them to the paths to the GATK jars produced by ./gradlew localJar and/or ./gradlew sparkJar
    • gatk can run non-Spark tools as well as Spark tools, and can run Spark tools locally, on a Spark cluster, or on Google Cloud Dataproc.
    • Note: running with java -jar directly and bypassing gatk causes several important system properties to not get set, including htsjdk compression level!
  • For help on using gatk itself, run ./gatk --help

  • To print a list of available tools, run ./gatk --list.

    • Spark-based tools will have a name ending in Spark (eg., BaseRecalibratorSpark). Most other tools are non-Spark-based.
  • To print help for a particular tool, run ./gatk ToolName --help.

  • To run a non-Spark tool, or to run a Spark tool locally, the syntax is: ./gatk ToolName toolArguments.

  • Tool arguments that allow multiple values, such as -I, can be supplied on the command line using a file with the extension ".args". Each line of the file should contain a single value for the argument.

  • Examples:

    ./gatk PrintReads -I input.bam -O output.bam
    
    ./gatk PrintReadsSpark -I input.bam -O output.bam
    

Passing JVM options to gatk

  • To pass JVM arguments to GATK, run gatk with the --java-options argument:

    ./gatk --java-options "-Xmx4G" <rest of command>
     
    ./gatk --java-options "-Xmx4G -XX:+PrintGCDetails" <rest of command>
    

Passing a configuration file to gatk

  • To pass a configuration file to GATK, run gatk with the --gatk-config-file argument:

     ./gatk --gatk-config-file GATKProperties.config <rest of command>
    

    An example GATK configuration file is packaged with each release as GATKConfig.EXAMPLE.properties This example file contains all current options that are used by GATK and their default values.

Running GATK4 with inputs on Google Cloud Storage:

  • Many GATK4 tools can read BAM or VCF inputs from a Google Cloud Storage bucket. Just use the "gs://" prefix:
    ./gatk PrintReads -I gs://mybucket/path/to/my.bam -L 1:10000-20000 -O output.bam
    
  • Important: You must set up your credentials first for this to work! There are three options:
    • Option (a): run in a Google Cloud Engine VM
      • If you are running in a Google VM then your credentials are already in the VM and will be picked up by GATK, you don't need to do anything special.
    • Option (b): use your own account
      gcloud auth application-default login
      
      • Done! GATK will use the application-default credentials you set up there.
    • Option (c): use a service account
      • Create a new service account on the Google Cloud web page and download the JSON key file
      • Install Google Cloud SDK
      • Tell gcloud about the key file:
      gcloud auth activate-service-account --key-file "$PATH_TO_THE_KEY_FILE"
      
      • Set the GOOGLE_APPLICATION_CREDENTIALS environment variable to point to the file
      export GOOGLE_APPLICATION_CREDENTIALS="$PATH_TO_THE_KEY_FILE"
      
      • Done! GATK will pick up the service account. You can also do this in a VM if you'd like to override the default credentials.

Running GATK4 Spark tools locally:

  • GATK4 Spark tools can be run in local mode (without a cluster). In this mode, Spark will run the tool in multiple parallel execution threads using the cores in your CPU. You can control how many threads Spark will use via the --spark-master argument.

  • Examples:

    Run PrintReadsSpark with 4 threads on your local machine:

      ./gatk PrintReadsSpark -I src/test/resources/large/CEUTrio.HiSeq.WGS.b37.NA12878.20.21.bam -O output.bam \
          -- \
          --spark-runner LOCAL --spark-master 'local[4]'
    

    Run PrintReadsSpark with as many worker threads as there are logical cores on your local machine:

      ./gatk PrintReadsSpark -I src/test/resources/large/CEUTrio.HiSeq.WGS.b37.NA12878.20.21.bam -O output.bam \
          -- \
          --spark-runner LOCAL --spark-master 'local[*]'
    
  • Note that the Spark-specific arguments are separated from the tool-specific arguments by a --.

Running GATK4 Spark tools on a Spark cluster:

./gatk ToolName toolArguments -- --spark-runner SPARK --spark-master <master_url> additionalSparkArguments

  • Examples:

    ./gatk PrintReadsSpark -I hdfs://path/to/input.bam -O hdfs://path/to/output.bam \
        -- \
        --spark-runner SPARK --spark-master <master_url>
    
    ./gatk PrintReadsSpark -I hdfs://path/to/input.bam -O hdfs://path/to/output.bam \
      -- \
      --spark-runner SPARK --spark-master <master_url> \
      --num-executors 5 --executor-cores 2 --executor-memory 4g \
      --conf spark.executor.memoryOverhead=600
    
  • You can also omit the "--num-executors" argument to enable dynamic allocation if you configure the cluster properly (see the Spark website for instructions).

  • Note that the Spark-specific arguments are separated from the tool-specific arguments by a --.

  • Running a Spark tool on a cluster requires Spark to have been installed from http://spark.apache.org/, since gatk invokes the spark-submit tool behind-the-scenes.

  • Note that the examples above use YARN but we have successfully run GATK4 on Mesos as well.

Running GATK4 Spark tools on Google Cloud Dataproc:

  • You must have a Google cloud services account, and have spun up a Dataproc cluster in the Google Developer's console. You may need to have the "Allow API access to all Google Cloud services in the same project" option enabled (settable when you create a cluster).
  • You need to have installed the Google Cloud SDK from here, since gatk invokes the gcloud tool behind-the-scenes. As part of the installation, be sure that you follow the gcloud setup instructions here. As this library is frequently updated by Google, we recommend updating your copy regularly to avoid any version-related difficulties.
  • Your inputs to the GATK when running on dataproc are typically in Google Cloud Storage buckets, and should be specified on your GATK command line using the syntax gs://my-gcs-bucket/path/to/my-file
  • You can run GATK4 jobs on Dataproc from your local computer or from the VM (master node) on the cloud.

Once you're set up, you can run a Spark tool on your Dataproc cluster using a command of the form:

./gatk ToolName toolArguments -- --spark-runner GCS --cluster myGCSCluster additionalSparkArguments

  • Examples:

    ./gatk PrintReadsSpark \
        -I gs://my-gcs-bucket/path/to/input.bam \
        -O gs://my-gcs-bucket/path/to/output.bam \
        -- \
        --spark-runner GCS --cluster myGCSCluster
    
    ./gatk PrintReadsSpark \
        -I gs://my-gcs-bucket/path/to/input.bam \
        -O gs://my-gcs-bucket/path/to/output.bam \
        -- \
        --spark-runner GCS --cluster myGCSCluster \
        --num-executors 5 --executor-cores 2 --executor-memory 4g \
        --conf spark.yarn.executor.memoryOverhead=600
    
  • When using Dataproc you can access the web interfaces for YARN, Hadoop and HDFS by opening an SSH tunnel and connecting with your browser. This can be done easily using included gcs-cluster-ui script.

    scripts/dataproc-cluster-ui myGCSCluster
    

    Or see these these instructions for more details.

  • Note that the spark-specific arguments are separated from the tool-specific arguments by a --.

  • If you want to avoid uploading the GATK jar to GCS on every run, set the GATK_GCS_STAGING environment variable to a bucket you have write access to (eg., export GATK_GCS_STAGING=gs://<my_bucket>/)

  • Dataproc Spark clusters are configured with dynamic allocation so you can omit the "--num-executors" argument and let YARN handle it automatically.

Using R to generate plots

Certain GATK tools may optionally generate plots using the R installation provided within the conda environment. If you are uninterested in plotting, R is still required by several of the unit tests. Plotting is currently untested and should be viewed as a convenience rather than a primary output.

Bash Command-line Tab Completion (BETA)

  • A tab completion bootstrap file for the bash shell is now included in releases. This file allows the command-line shell to complete GATK run options in a manner equivalent to built-in command-line tools (e.g. grep).

  • This tab completion functionality has only been tested in the bash shell, and is released as a beta feature.

  • To enable tab completion for the GATK, open a terminal window and source the included tab completion script:

source gatk-completion.sh
  • Sourcing this file will allow you to press the tab key twice to get a list of options available to add to your current GATK command. By default you will have to source this file once in each command-line session, then for the rest of the session the GATK tab completion functionality will be available. GATK tab completion will be available in that current command-line session only.

  • Note that you must have already started typing an invocation of the GATK (using gatk) for tab completion to initiate:

./gatk <TAB><TAB>
  • We recommend adding a line to your bash settings file (i.e. your ~/.bashrc file) that sources the tab completion script. To add this line to your bash settings / bashrc file you can use the following command:
echo "source <PATH_TO>/gatk-completion.sh" >> ~/.bashrc
  • Where <PATH_TO> is the fully qualified path to the gatk-completion.sh script.

For GATK Developers

General guidelines for GATK4 developers

  • Do not put private or restricted data into the repo.

  • Try to keep datafiles under 100kb in size. Larger test files should go into src/test/resources/large (and subdirectories) so that they'll be stored and tracked by git-lfs as described above.

  • GATK4 is Apache 2.0 licensed. The license is in the top level LICENSE.TXT file. Do not add any additional license text or accept files with a license included in them.

  • Each tool should have at least one good end-to-end integration test with a check for expected output, plus high-quality unit tests for all non-trivial utility methods/classes used by the tool. Although we have no specific coverage target, coverage should be extensive enough that if tests pass, the tool is guaranteed to be in a usable state.

  • All newly written code must have good test coverage (>90%).

  • All bug fixes must be accompanied by a regression test.

  • All pull requests must be reviewed before merging to master (even documentation changes).

  • Don't issue or accept pull requests that introduce warnings. Warnings must be addressed or suppressed.

  • Don't issue or accept pull requests that significantly decrease coverage (less than 1% decrease is sort of tolerable).

  • Don't use toString() for anything other than human consumption (ie. don't base the logic of your code on results of toString().)

  • Don't override clone() unless you really know what you're doing. If you do override it, document thoroughly. Otherwise, prefer other means of making copies of objects.

  • For logging, use org.apache.logging.log4j.Logger

  • We mostly follow the Google Java Style guide

  • Git: Don't push directly to master - make a pull request instead.

  • Git: Rebase and squash commits when merging.

  • If you push to master or mess up the commit history, you owe us 1 growler or tasty snacks at happy hour. If you break the master build, you owe 3 growlers (or lots of tasty snacks). Beer may be replaced by wine (in the color and vintage of buyer's choosing) in proportions of 1 growler = 1 bottle.

Testing GATK

  • Before running the test suite, be sure that you've installed git lfs and downloaded the large test data, following the git lfs setup instructions

  • To run the test suite, run ./gradlew test.

    • Test report is in build/reports/tests/test/index.html.
    • What will happen depends on the value of the TEST_TYPE environment variable:
      • unset or any other value : run non-cloud unit and integration tests, this is the default
      • cloud, unit, integration, conda, spark : run only the cloud, unit, integration, conda (python + R), or Spark tests
      • all : run the entire test suite
    • Cloud tests require being logged into gcloud and authenticated with a project that has access to the cloud test data. They also require setting several certain environment variables.
      • HELLBENDER_JSON_SERVICE_ACCOUNT_KEY : path to a local JSON file with service account credentials
      • HELLBENDER_TEST_PROJECT : your google cloud project
      • HELLBENDER_TEST_STAGING : a gs:// path to a writable location
      • HELLBENDER_TEST_INPUTS : path to cloud test data, ex: gs://hellbender/test/resources/
    • Setting the environment variable TEST_VERBOSITY=minimal will produce much less output from the test suite
  • To run a subset of tests, use gradle's test filtering (see gradle doc):

    • You can use --tests with a wildcard to run a specific test class, method, or to select multiple test classes:
      • ./gradlew test --tests *SomeSpecificTestClass
      • ./gradlew test --tests *SomeTest.someSpecificTestMethod
      • ./gradlew test --tests all.in.specific.package*
  • To run tests and compute coverage reports, run ./gradlew jacocoTestReport. The report is then in build/reports/jacoco/test/html/index.html. (IntelliJ has a good coverage tool that is preferable for development).

  • We use Github Actions as our continuous integration provider.

    • Before merging any branch make sure that all required tests pass on Github.
    • Every Actions build will upload the test results to our GATK Google Cloud Storage bucket and a zipped artifact upload. A link to the uploaded report will appear at the very bottom of the github actions log. Look for the line that says See the test report at. Test github actions test artifacts will not show up on the webpage until the entire test has concluded. If TestNG itself crashes there will be no report generated.
  • We use Broad Jenkins for our long-running tests and performance tests.

    • To add a performance test (requires Broad-ID), you need to make a "new item" in Jenkins and make it a "copy" instead of a blank project. You need to base it on either the "-spark-" jobs or the other kind of jobs and alter the commandline.
  • To output stack traces for UserException set the environment variable GATK_STACKTRACE_ON_USER_EXCEPTION=true

Using Git LFS to download and track large test data

We use git-lfs to version and distribute test data that is too large to check into our repository directly. You must install and configure it in order to be able to run our test suite.

  • After installing git-lfs, run git lfs install

    • This adds hooks to your git configuration that will cause git-lfs files to be checked out for you automatically in the future.
  • To manually retrieve the large test data, run git lfs pull from the root of your GATK git clone.

    • The download size is approximately 5 gigabytes.
  • To add a new large file to be tracked by git-lfs, simply:

    • Put the new file(s) in src/test/resources/large (or a subdirectory)
    • git add the file(s), then git commit -a
    • That's it! Do not run git lfs track on the files manually: all files in src/test/resources/large are tracked by git-lfs automatically.

Creating a GATK project in the IntelliJ IDE (last tested with version 2016.2.4):

  • Ensure that you have gradle and the Java 17 JDK installed

  • You may need to install the TestNG and Gradle plugins (in preferences)

  • Clone the GATK repository using git

  • In IntelliJ, click on "Import Project" in the home screen or go to File -> New... -> Project From Existing Sources...

  • Select the root directory of your GATK clone, then click on "OK"

  • Select "Import project from external model", then "Gradle", then click on "Next"

  • Ensure that "Gradle project" points to the build.gradle file in the root of your GATK clone

  • Select "Use auto-import" and "Use default gradle wrapper".

  • Make sure the Gradle JVM points to Java 17. You may need to set this manually after creating the project, to do so find the gradle settings by clicking the wrench icon in the gradle tab on the right bar, from there edit "Gradle JVM" argument to point to Java 17.

  • Click "Finish"

  • After downloading project dependencies, IntelliJ should open a new window with your GATK project

  • Make sure that the Java version is set correctly by going to File -> "Project Structure" -> "Project". Check that the "Project SDK" is set to your Java 17 JDK, and "Project language level" to 17 (you may need to add your Java 17 JDK under "Platform Settings" -> SDKs if it isn't there already). Then click "Apply"/"Ok".

Setting up debugging in IntelliJ

  • Follow the instructions above for creating an IntelliJ project for GATK

  • Go to Run -> "Edit Configurations", then click "+" and add a new "Application" configuration

  • Set the name of the new configuration to something like "GATK debug"

  • For "Main class", enter org.broadinstitute.hellbender.Main

  • Ensure that "Use classpath of module:" is set to use the "gatk" module's classpath

  • Enter the arguments for the command you want to debug in "Program Arguments"

  • Click "Apply"/"Ok"

  • Set breakpoints, etc., as desired, then select "Run" -> "Debug" -> "GATK debug" to start your debugging session

  • In future debugging sessions, you can simply adjust the "Program Arguments" in the "GATK debug" configuration as needed

Updating the Intellij project when dependencies change

If there are dependency changes in build.gradle it is necessary to refresh the gradle project. This is easily done with the following steps.

  • Open the gradle tool window ( "View" -> "Tool Windows" -> "Gradle" )
  • Click the refresh button in the Gradle tool window. It is in the top left of the gradle view and is represented by two blue arrows.

Setting up profiling using JProfiler

  • Running JProfiler standalone:

    • Build a full GATK4 jar using ./gradlew localJar
    • In the "Session Settings" window, select the GATK4 jar, eg. ~/gatk/build/libs/gatk-package-4.alpha-196-gb542813-SNAPSHOT-local.jar for "Main class or executable JAR" and enter the right "Arguments"
    • Under "Profiling Settings", select "sampling" as the "Method call recording" method.
  • Running JProfiler from within IntelliJ:

    • JProfiler has great integration with IntelliJ (we're using IntelliJ Ultimate edition) so the setup is trivial.
    • Follow the instructions above for creating an IntelliJ project for GATK
    • Right click on a test method/class/package and select "Profile"

Uploading Archives to Sonatype (to make them available via maven central)

To upload snapshots to Sonatype you'll need the following:

  • You must have a registered account on the sonatype JIRA (and be approved as a gatk uploader)

  • You need to configure several additional properties in your /~.gradle/gradle.properties file

  • If you want to upload a release instead of a snapshot you will additionally need to have access to the gatk signing key and password

#needed for snapshot upload
sonatypeUsername=<your sonatype username>
sonatypePassword=<your sonatype password>

#needed for signing a release
signing.keyId=<gatk key id>
signing.password=<gatk key password>
signing.secretKeyRingFile=/Users/<username>/.gnupg/secring.gpg

To perform an upload, use

./gradlew uploadArchives

Builds are considered snapshots by default. You can mark a build as a release build by setting -Drelease=true.
The archive name is based off of git describe.

Building GATK4 Docker images

Please see the the Docker README in scripts/docker. This has instructions for the Dockerfile in the root directory.

Releasing GATK4

Please see the How to release GATK4 wiki article for instructions on releasing GATK4.

Generating GATK4 documentation

To generate GATK documentation, run ./gradlew gatkDoc

  • Generated docs will be in the build/docs/gatkdoc directory.

Generating GATK4 WDL Wrappers

  • A WDL wrapper can be generated for any GATK4 tool that is annotated for WDL generation (see the wiki article How to Prepare a GATK tool for WDL Auto Generation) to learn more about WDL annotations.

  • To generate the WDL Wrappers, run ./gradlew gatkWDLGen. The generated WDLs and accompanying JSON input files can be found in the build/docs/wdlGen folder.

  • To generate WDL Wrappers and validate the resulting outputs, run ./gradlew gatkWDLGenValidation. Running this task requires a local cromwell installation, and environment variables CROMWELL_JAR and WOMTOOL_JAR to be set to the full pathnames of the cromwell and womtool jar files. If no local install is available, this task will run automatically on github actions in a separate job whenever a PR is submitted.

  • WDL wrappers for each GATK release are published to the gatk-tool-wdls repository. Only tools that have been annotated for WDL generation will show up there.

Using Zenhub to track github issues

We use Zenhub to organize and track github issues.

  • To add Zenhub to github, go to the Zenhub home page while logged in to github, and click "Add Zenhub to Github"

  • Zenhub allows the GATK development team to assign time estimates to issues, and to mark issues as Triaged/In Progress/In Review/Blocked/etc.

Further Reading on Spark

Apache Spark is a fast and general engine for large-scale data processing. GATK4 can run on any Spark cluster, such as an on-premise Hadoop cluster with HDFS storage and the Spark runtime, as well as on the cloud using Google Dataproc.

In a cluster scenario, your input and output files reside on HDFS, and Spark will run in a distributed fashion on the cluster. The Spark documentation has a good overview of the architecture.

Note that if you don't have a dedicated cluster you can run Spark in standalone mode on a single machine, which exercises the distributed code paths, albeit on a single node.

While your Spark job is running, the Spark UI is an excellent place to monitor the progress. Additionally, if you're running tests, then by adding -Dgatk.spark.debug=true you can run a single Spark test and look at the Spark UI (on http://localhost:4040/) as it runs.

You can find more information about tuning Spark and choosing good values for important settings such as the number of executors and memory settings at the following:

How to contribute to GATK

(Note: section inspired by, and some text copied from, Apache Parquet)

We welcome all contributions to the GATK project. The contribution can be a issue report or a pull request. If you're not a committer, you will need to make a fork of the gatk repository and issue a pull request from your fork.

For ideas on what to contribute, check issues labeled "Help wanted (Community)". Comment on the issue to indicate you're interested in contibuting code and for sharing your questions and ideas.

To contribute a patch:

  • Break your work into small, single-purpose patches if possible. It’s much harder to merge in a large change with a lot of disjoint features.
  • Submit the patch as a GitHub pull request against the master branch. For a tutorial, see the GitHub guides on forking a repo and sending a pull request. If applicable, include the issue number in the pull request name.
  • Make sure that your code passes all our tests. You can run the tests with ./gradlew test in the root directory.
  • Add tests for all new code you've written. We prefer unit tests but high quality integration tests that use small amounts of data are acceptable.
  • Follow the General guidelines for GATK4 developers.

We tend to do fairly close readings of pull requests, and you may get a lot of comments. Some things to consider:

  • Write tests for all new code.
  • Document all classes and public methods.
  • For all public methods, check validity of the arguments and throw IllegalArgumentException if invalid.
  • Use braces for control constructs, if, for etc.
  • Make classes, variables, parameters etc final unless there is a strong reason not to.
  • Give your operators some room. Not a+b but a + b and not foo(int a,int b) but foo(int a, int b).
  • Generally speaking, stick to the Google Java Style guide

Thank you for getting involved!

Discussions

Authors

The authors list is maintained in the AUTHORS file. See also the Contributors list at github.

License

Licensed under the Apache 2.0 License. See the LICENSE.txt file.

More Repositories

1

cromwell

Scientific workflow engine designed for simplicity & scalability. Trivially transition between one off use cases to massive scale production environments
Scala
990
star
2

picard

A set of command line tools (in Java) for manipulating high-throughput sequencing (HTS) data and formats such as SAM/BAM/CRAM and VCF.
Java
965
star
3

infercnv

Inferring CNV from Single-Cell RNA-Seq
R
558
star
4

keras-rcnn

Keras package for region-based convolutional neural networks (RCNNs)
Python
554
star
5

gtex-pipeline

GTEx & TOPMed data production and analysis pipelines
Python
334
star
6

pilon

Pilon is an automated genome assembly improvement and variant detection tool
Scala
306
star
7

keras-resnet

Keras package for deep residual networks
Python
294
star
8

CellBender

CellBender is a software package for eliminating technical artifacts from high-throughput single-cell RNA sequencing (scRNA-seq) data.
Python
293
star
9

Tangram

Spatial alignment of single cell transcriptomic data.
Jupyter Notebook
249
star
10

ssGSEA2.0

Single sample Gene Set Enrichment analysis (ssGSEA) and PTM Enrichment Analysis (PTM-SEA)
R
237
star
11

ABC-Enhancer-Gene-Prediction

Cell type specific enhancer-gene predictions using ABC model (Fulco, Nasser et al, Nature Genetics 2019)
Python
201
star
12

warp

WDL Analysis Research Pipelines
WDL
200
star
13

viral-ngs

Viral genomics analysis pipelines
Python
180
star
14

seqr

web-based analysis tool for rare disease genomics
Python
176
star
15

gatk-sv

A structural variation pipeline for short-read sequencing
Python
170
star
16

tensorqtl

Ultrafast GPU-enabled QTL mapper
Python
159
star
17

ichorCNA

Estimating tumor fraction in cell-free DNA from ultra-low-pass whole genome sequencing.
R
159
star
18

long-read-pipelines

Long read production pipelines
Jupyter Notebook
140
star
19

wot

A software package for analyzing snapshots of developmental processes
Jupyter Notebook
136
star
20

ml4h

Jupyter Notebook
122
star
21

depmap_omics

What you need to process the Quarterly DepMap-Omics releases from Terra
HTML
110
star
22

xtermcolor

Python library for terminal color support (including 256-color support)
Python
104
star
23

Drop-seq

Java tools for analyzing Drop-seq data
Java
100
star
24

mutect

MuTect -- Accurate and sensitive cancer mutation detection
Java
92
star
25

genomics-in-the-cloud

Source code and related materials for the O'Reilly book
Jupyter Notebook
91
star
26

gnomad_methods

Hail helper functions for the gnomAD project and Translational Genomics Group
Python
89
star
27

pyro-cov

Pyro models of SARS-CoV-2 variants
Jupyter Notebook
77
star
28

catch

A package for designing compact and comprehensive capture probe sets.
Python
74
star
29

gatk-docs

Documentation archive for GATK tools and workflows
HTML
71
star
30

oncotator

Python
67
star
31

gnomad-browser

Explore gnomAD datasets on the web
TypeScript
66
star
32

gtex-viz

GTEx Visualizations
JavaScript
63
star
33

single_cell_portal_core

Rails/Docker application for the Broad Institute's single cell RNA-seq data portal
Ruby
62
star
34

PhylogicNDT

HTML
57
star
35

docker-terraform

Docker container for running the Terraform application
Shell
56
star
36

2020_scWorkshop

Code and data repository for the 2020 physalia course on single cell RNA sequencing.
Shell
56
star
37

cromshell

CLI for interacting with Cromwell servers
Python
53
star
38

cellpainting-gallery

Cell Painting Gallery
52
star
39

viral-pipelines

viral-ngs: complete pipelines
WDL
51
star
40

gnomad_qc

Jupyter Notebook
48
star
41

single_cell_portal

Tutorials, workflows, and convenience scripts for Single Cell Portal
HTML
47
star
42

gistic2

Genomic Identification of Significant Targets in Cancer (GISTIC), version 2
MATLAB
44
star
43

sam

workbench identity and access management
Scala
42
star
44

dsde-deep-learning

DSDE Deep Learning Club
Python
40
star
45

gamgee

A C++14 library for NGS data formats
C++
40
star
46

wdl-ide

Rich IDE support for Workflow Description Language
Python
39
star
47

gtex-v8

Notebooks and scripts for reproducing analyses and figures from the V8 GTEx Consortium paper
Jupyter Notebook
39
star
48

pyqtl

Collection of analysis tools for quantitative trait loci
Python
38
star
49

SignatureAnalyzer-GPU

GPU implementation of ARD NMF
Python
37
star
50

poasta

Fast and exact gap-affine partial order alignment
Rust
37
star
51

Celligner_ms

Code related to the Celligner manuscript
R
36
star
52

cell-health

Predicting Cell Health with Morphological Profiles
HTML
35
star
53

PANOPLY

Repository for the Broad Institute Proteogenomic Data Analysis Center (PGDAC) established by the NIH Clinical Proteomics Tumor Analysis Consortium (CPTAC)
R
33
star
54

gatk-protected

Obsolete/Legacy GATK repository -- go to https://github.com/broadinstitute/gatk instead
Java
33
star
55

StrainGE

strain-level analysis tools
Python
33
star
56

firecloud-orchestration

Scala
31
star
57

gdctools

Python and UNIX CLI utilities to simplify interaction with the NIH/NCI Genomics Data Commons
Python
31
star
58

python-cert_manager

Python interface to the Sectigo Certificate Manager REST API
Python
31
star
59

str-analysis

Scripts and utilities related to analyzing short tandem repeats (STRs).
Python
29
star
60

adapt

A package for designing activity-informed nucleic acid diagnostics for viruses.
Python
29
star
61

2019_scWorkshop

Repo for Physalia course Analysis of Single Cell RNA-Seq data
TeX
29
star
62

chronos

Modeling of time series data for CRISPR KO experiments
Python
28
star
63

fiss

FireCloud Service Selector (FISS) -- Python bindings and CLI for FireCloud execution engine
Python
28
star
64

pyfrost

Python bindings for Bifrost with a NetworkX compatible API
Python
27
star
65

single_cell_analysis

Documents used for workshops on single cell analysis
HTML
26
star
66

deepometry

Image classification for imaging flow cytometry.
Python
26
star
67

lincs-cell-painting

Processed Cell Painting Data for the LINCS Drug Repurposing Project
Jupyter Notebook
25
star
68

rawls

Rawls service for DSDE
Scala
25
star
69

delphy

Fast, scalable, accurate and accessible Bayesian phylogenetics
C++
25
star
70

firepony

Efficient base quality score recalibrator for NGS data
Cuda
24
star
71

protigy

Proteomics Toolset for Integrative Data Analysis
R
22
star
72

GATK-for-Microbes

WDL
22
star
73

seqr-loading-pipelines

hail-based pipelines for annotating variant callsets and exporting them to elasticsearch
Python
22
star
74

BipolarCell2016

R
21
star
75

cromwell-tools

A collection of Python clients and accessory scripts for interacting with the Cromwell
Python
21
star
76

covid19-testing

COVID-19 Diagnostic Processing Dashboard
HTML
20
star
77

single_cell_classification

Methods to use SNPs or gene expression to classify single cell RNAseq to reference profiles
R
20
star
78

VariantBam

Filtering and profiling of next-generational sequencing data using region-specific rules
Makefile
20
star
79

longbow

Annotation and segmentation of MAS-seq data
Python
20
star
80

gtex-single-nucleus-reference

Code repository for the snRNA-seq cross-tissue atlas project
Jupyter Notebook
20
star
81

flipbook

A tool that lets you quickly flip through images in a local directory and record notes or answer questions about each one.
Python
19
star
82

AwesomeGenomics

Cancer Data Science's go to place for excellent genomics tools and packages
19
star
83

firecloud-ui

FireCloud user interface for web browsers.
Clojure
19
star
84

BARD

BioAssay Research Database
Groovy
19
star
85

wdltool

Scala
18
star
86

vim-wdl

Vim syntax highlighting for WDL
Vim Script
18
star
87

SpliceAI-lookup

Website for checking SpliceAI and Pangolin scores:
Python
17
star
88

palantir-workflows

Utility workflows for the DSP hydro.gen team (formerly palantir)
WDL
17
star
89

epi-SHARE-seq-pipeline

Epigenomics Program pipeline to analyze SHARE-seq data.
WDL
16
star
90

wordpress-crowd-plugin

Crowd Authentication Plugin for Wordpress
PHP
16
star
91

mix_seq_ms

Code associated with MIX-seq manuscript
R
14
star
92

imaging-platform-pipelines

Cell Painting and other pipelines from the Imaging Platform
13
star
93

wdl-runner

Easily run WDL workflows on GCP
Python
13
star
94

widdler

A command-line tool for executing, managing, and querying WDL workflows on Cromwell servers.
Python
13
star
95

cms

Composite of Multiple Signals: tests for selection in meiotically recombinant populations
Python
13
star
96

regional_missense_constraint

Code to calculate regional missense constraint
Python
12
star
97

scRNA-Seq

Python
12
star
98

scalable_analytics

Public collaboration of Scalable Single Cell Analytics
Python
12
star
99

ml4ht_data_source

Multimodal data loader compatible with pytorch and tensorflow
Python
12
star
100

gene-hints

Discoverability for gene search 🧬 πŸ”
Python
12
star