• Stars
    star
    385
  • Rank 111,464 (Top 3 %)
  • Language
    Python
  • License
    MIT License
  • Created about 7 years ago
  • Updated 4 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

ENCODE ATAC-seq pipeline

ENCODE ATAC-seq pipeline

DOICircleCI

Introduction

This pipeline is designed for automated end-to-end quality control and processing of ATAC-seq and DNase-seq data. The pipeline can be run on compute clusters with job submission engines as well as on stand alone machines. It inherently makes uses of parallelized/distributed computing. Pipeline installation is also easy as most dependencies are automatically installed. The pipeline can be run end-to-end, starting from raw FASTQ files all the way to peak calling and signal track generation using a single caper submit command. One can also start the pipeline from intermediate stages (for example, using alignment files as input). The pipeline supports both single-end and paired-end data as well as replicated or non-replicated datasets. The outputs produced by the pipeline include 1) formatted HTML reports that include quality control measures specifically designed for ATAC-seq and DNase-seq data, 2) analysis of reproducibility, 3) stringent and relaxed thresholding of peaks, 4) fold-enrichment and pvalue signal tracks. The pipeline also supports detailed error reporting and allows for easy resumption of interrupted runs. It has been tested on some human, mouse and yeast ATAC-seq datasets as well as on human and mouse DNase-seq datasets.

The ATAC-seq pipeline protocol specification is here. Some parts of the ATAC-seq pipeline were developed in collaboration with Jason Buenrostro, Alicia Schep and Will Greenleaf at Stanford.

Features

  • Portability: The pipeline run can be performed across different cloud platforms such as Google, AWS and DNAnexus, as well as on cluster engines such as SLURM, SGE and PBS.
  • User-friendly HTML report: In addition to the standard outputs, the pipeline generates an HTML report that consists of a tabular representation of quality metrics including alignment/peak statistics and FRiP along with many useful plots (IDR/TSS enrichment). An example of the HTML report. The json file used in generating this report.
  • Supported genomes: Pipeline needs genome specific data such as aligner indices, chromosome sizes file and blacklist. We provide a genome database downloader/builder for hg38, hg19, mm10, mm9. You can also use this builder to build genome database from FASTA for your custom genome.

Installation

  1. Install Caper (Python Wrapper/CLI for Cromwell).

    $ pip install caper
  2. IMPORTANT: Read Caper's README carefully to choose a backend for your system. Follow the instruction in the configuration file.

    # backend: local or your HPC type (e.g. slurm, sge, pbs, lsf). read Caper's README carefully.
    $ caper init [YOUR_BACKEND]
    
    # IMPORTANT: edit the conf file and follow commented instructions in there
    $ vi ~/.caper/default.conf
  3. Git clone this pipeline.

    $ cd
    $ git clone https://github.com/ENCODE-DCC/atac-seq-pipeline
    $ cd atac-seq-pipeline
  4. Define test input JSON.

    INPUT_JSON="https://storage.googleapis.com/encode-pipeline-test-samples/encode-atac-seq-pipeline/ENCSR356KRQ_subsampled.json"
  5. If you have Docker and want to run pipelines locally on your laptop. --max-concurrent-tasks 1 is to limit number of concurrent tasks to test-run the pipeline on a laptop. Uncomment it if run it on a workstation/HPC.

    # check if Docker works on your machine
    $ docker run ubuntu:latest echo hello
    
    # --max-concurrent-tasks 1 is for computers with limited resources
    $ caper run atac.wdl -i "${INPUT_JSON}" --docker --max-concurrent-tasks 1
  6. Otherwise, install Singularity on your system. Please follow this instruction to install Singularity on a Debian-based OS. Or ask your system administrator to install Singularity on your HPC.

    # check if Singularity works on your machine
    $ singularity exec docker://ubuntu:latest echo hello
    
    # on your local machine (--max-concurrent-tasks 1 is for computers with limited resources)
    $ caper run atac.wdl -i "${INPUT_JSON}" --singularity --max-concurrent-tasks 1
    
    # on HPC, make sure that Caper's conf ~/.caper/default.conf is correctly configured to work with your HPC
    # the following command will submit Caper as a leader job to SLURM with Singularity
    $ caper hpc submit atac.wdl -i "${INPUT_JSON}" --singularity --leader-job-name ANY_GOOD_LEADER_JOB_NAME
    
    # check job ID and status of your leader jobs
    $ caper hpc list
    
    # cancel the leader node to close all of its children jobs
    # If you directly use cluster command like scancel or qdel then
    # child jobs will not be terminated
    $ caper hpc abort [JOB_ID]
  7. (Optional Conda method) WE DO NOT HELP USERS FIX CONDA DEPENDENCY ISSUES. IF CONDA METHOD FAILS THEN PLEASE USE SINGULARITY METHOD INSTEAD. DO NOT USE A SHARED CONDA. INSTALL YOUR OWN MINICONDA3 AND USE IT.

    # check if you are not using a shared conda, if so then delete it or remove it from your PATH
    $ which conda
    
    # uninstall pipeline's old environments
    $ bash scripts/uninstall_conda_env.sh
    
    # install new envs, you need to run this for every pipeline version update.
    # it may be killed if you run this command line on a login node on HPC.
    # it's recommended to make an interactive node with enough resources and run it there.
    $ bash scripts/install_conda_env.sh
    
    # if installation fails please use Singularity method instead.
    
    # on your local machine (--max-concurrent-tasks 1 is for computers with limited resources)
    $ caper run atac.wdl -i "${INPUT_JSON}" --conda --max-concurrent-tasks 1
    
    # on HPC, make sure that Caper's conf ~/.caper/default.conf is correctly configured to work with your HPC
    # the following command will submit Caper as a leader job to SLURM with Conda
    $ caper hpc submit atac.wdl -i "${INPUT_JSON}" --conda --leader-job-name ANY_GOOD_LEADER_JOB_NAME
    
    # check job ID and status of your leader jobs
    $ caper hpc list
    
    # cancel the leader node to close all of its children jobs
    # If you directly use cluster command like scancel or qdel then
    # child jobs will not be terminated
    $ caper hpc abort [JOB_ID]

Input JSON file specification

IMPORTANT: DO NOT BLINDLY USE A TEMPLATE/EXAMPLE INPUT JSON. READ THROUGH THE FOLLOWING GUIDE TO MAKE A CORRECT INPUT JSON FILE. ESPECIALLY FOR AUTODETECTING/DEFINING ADAPTERS.

An input JSON file specifies all the input parameters and files that are necessary for successfully running this pipeline. This includes a specification of the path to the genome reference files and the raw data fastq file. Please make sure to specify absolute paths rather than relative paths in your input JSON files.

  1. Input JSON file specification (short)
  2. Input JSON file specification (long)

Running and sharing on Truwl

You can run this pipeline on truwl.com. This provides a web interface that allows you to define inputs and parameters, run the job on GCP, and monitor progress. To run it you will need to create an account on the platform then request early access by emailing [email protected] to get the right permissions. You can see the example case from this repo at https://truwl.com/workflows/instance/WF_e85df4.f10.8880/command. The example job (or other jobs) can be forked to pre-populate the inputs for your own job.

If you do not run the pipeline on Truwl, you can still share your use-case/job on the platform by getting in touch at [email protected] and providing your inputs.json file.

Running on Terra/Anvil (using Dockstore)

Visit our pipeline repo on Dockstore. Click on Terra or Anvil. Follow Terra's instruction to create a workspace on Terra and add Terra's billing bot to your Google Cloud account.

Download this test input JSON for Terra and upload it to Terra's UI and then run analysis.

If you want to use your own input JSON file, then make sure that all files in the input JSON are on a Google Cloud Storage bucket (gs://). URLs will not work.

Running on DNAnexus (using Dockstore)

Sign up for a new account on DNAnexus and create a new project on either AWS or Azure. Visit our pipeline repo on Dockstore. Click on DNAnexus. Choose a destination directory on your DNAnexus project. Click on Submit and visit DNAnexus. This will submit a conversion job so that you can check status of it on Monitor on DNAnexus UI.

Once conversion is done download one of the following input JSON files according to your chosen platform (AWS or Azure) for your DNAnexus project:

You cannot use these input JSON files directly. Go to the destination directory on DNAnexus and click on the converted workflow atac. You will see input file boxes in the left-hand side of the task graph. Expand it and define FASTQs (fastq_repX_R1 and also fastq_repX_R2 if it's paired-ended) and genome_tsv as in the downloaded input JSON file. Click on the common task box and define other non-file pipeline parameters. e.g. auto_detect_adapters and paired_end.

We have a separate project on DNANexus to provide example FASTQs and genome_tsv for hg38 and mm10. We recommend to make copies of these directories on your own project.

genome_tsv

Example FASTQs

Running on DNAnexus (using our pre-built workflows)

See this for details.

How to organize outputs

Install Croo. You can skip this installation if you have installed pipeline's Conda environment and activated it. Make sure that you have python3(> 3.4.1) installed on your system. Find a metadata.json on Caper's output directory.

$ pip install croo
$ croo [METADATA_JSON_FILE]

How to make a spreadsheet of QC metrics

Install qc2tsv. Make sure that you have python3(> 3.4.1) installed on your system.

Once you have organized output with Croo, you will be able to find pipeline's final output file qc/qc.json which has all QC metrics in it. Simply feed qc2tsv with multiple qc.json files. It can take various URIs like local path, gs:// and s3://.

$ pip install qc2tsv
$ qc2tsv /sample1/qc.json gs://sample2/qc.json s3://sample3/qc.json ... > spreadsheet.tsv

QC metrics for each experiment (qc.json) will be split into multiple rows (1 for overall experiment + 1 for each bio replicate) in a spreadsheet.

More Repositories

1

chip-seq-pipeline2

ENCODE ChIP-seq pipeline
Python
244
star
2

kentUtils

UCSC command line bioinformatic utilities
C
167
star
3

rna-seq-pipeline

Python
141
star
4

chip-seq-pipeline

ENCODE Uniform processing pipeline for ChIP-seq
Python
120
star
5

encoded

Metadata database for ENCODE project
JavaScript
110
star
6

long-rna-seq-pipeline

STAR based ENCODE Long RNA-Seq processing pipeline
Python
92
star
7

hic-pipeline

HiC uniform processing pipeline
WDL
56
star
8

dna-me-pipeline

DCC/DAC methylation pipeline source
Perl
55
star
9

caper

Cromwell/WDL wrapper for Python
Python
54
star
10

long-read-rna-pipeline

ENCODE long read RNA-seq pipeline
WDL
44
star
11

wgbs-pipeline

ENCODE whole-genome bisulfite sequencing (WGBS) pipeline
Python
29
star
12

mirna-seq-pipeline

WDL
17
star
13

snovault

The SnoVault general purpose hybrid object-relational database
Python
16
star
14

croo

Cromwell output organizer
Python
13
star
15

dnase_pipeline

ENCODE DNase-seq pipeline essentials for running on dnanexus.
Shell
12
star
16

demo-pipeline

Python
11
star
17

pyencoded-tools

Jupyter Notebook
10
star
18

encode-data-usage-examples

Jupyter Notebook
9
star
19

uniformAnalysis

Uniform analysis pipeline work at UCSC for ENCODE
Python
8
star
20

submission_sample_scripts

Scripts to demonstrate the ENCODE REST API for metadata submission.
Python
8
star
21

dnase-seq-pipeline

ENCODE DNase-seq pipeline
WDL
6
star
22

WranglerScripts

Collection of scripts used by the wranglers to interact with the servers
Python
5
star
23

Bismark-ENCODE-WGBS

DNANexus Whole Genome Bisulphite Analysis Pipeline
Perl
5
star
24

encValData

AngelScript
5
star
25

encodeOntologies

Python module to download, parse and index ontology files.
Python
4
star
26

genomic-data-service

Flask based web service providing genomic region search, based on regulomedb.org
Python
4
star
27

qc_metrics

Module to grab QC metrics from ENCODE uniform processing pipelines
Python
3
star
28

accession

Python module to upload experiment files and metadata to the ENCODE Portal
Python
3
star
29

geo-submission

Python
3
star
30

qc2tsv

Converts multiple QC objects (JSON/TSV/CSV) into a spreadsheet
Python
3
star
31

s3-md5-hash

Lambda function to compute MD5 hashes of S3 objects
Python
3
star
32

pipeline-container

Containerization infrastructure for ENCODE analysis pipelines
Python
3
star
33

ucscGb

Python code out of the UCSC Genome Brower "kent/src" tree
Python
2
star
34

file-validation-pipeline

ENCODE / DNA nexus pipeline for file validation
HTML
2
star
35

segway-pipeline

Python
2
star
36

ENCODE-DAC-pipelines

Hub for data analysis pipelines and software in ENCODE 3.
2
star
37

users-meeting-2020-workshop

2
star
38

atac-seq-pipeline-test-data

Test data for ENCODE atac-seq-pipeline
HTML
2
star
39

dxencode

Utility module to interface encoded metadatabase, AWS, and DNANexus api for Universal Pipelines
Python
2
star
40

snovault-search

Python
2
star
41

imputation_challenge

ENCODE Imputation Challenge scoring & validation scripts
Python
2
star
42

trackhub_example

Simple examples of common data and organization types used to vizualize data in the UCSC Genome Browser using Track Hubs
2
star
43

checkfiles

Files are checked to see if the MD5 sum (both for gzipped and ungzipped) is identical to the submitted metadata, as well as run through the validateFiles program from jksrc.
Python
2
star
44

cvDjango

Repository to try out Django in the implementation of controlled vocabulary and experimental meta-data
Python
2
star
45

chromhmm-pipeline

WDL pipeline for chromhmm
WDL
1
star
46

dna-nexus-collaboration

ENCODE-DNANexus collaboration
R
1
star
47

wgot

Peformant parallel GET extracted from aws-cli
Python
1
star
48

Mappings

Official ENCODE mappings for a variety of terms
1
star
49

dccMetadataImport

Tables of metadata to import to the DCC.
HTML
1
star
50

encoded-walkme

CSS
1
star
51

modencode

modENCODE temporary site
HTML
1
star
52

ptools_bin

Ptools to pypi
Python
1
star
53

metadata-to-pipelines

Repository for code used in translating encoded metadata to pipelines.
1
star
54

regulome-encoded

Temp repo for regulome development
JavaScript
1
star
55

encodemouseportal

Python
1
star
56

gcs-s3-transfer-service

A Flask service on Google App Engine to upload files from Google Cloud Storage to AWS S3
Python
1
star
57

ptools

Pipeline to convert bams into pbams
Python
1
star
58

encode_slims

1
star