• Stars
    star
    142
  • Rank 249,431 (Top 6 %)
  • Language
    Python
  • License
    BSD 3-Clause "New...
  • Created over 5 years ago
  • Updated 27 days ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Ultrafast GPU-enabled QTL mapper

tensorQTL

tensorQTL is a GPU-enabled QTL mapper, achieving ~200-300 fold faster cis- and trans-QTL mapping compared to CPU-based implementations.

If you use tensorQTL in your research, please cite the following paper: Taylor-Weiner, Aguet, et al., Genome Biol., 2019.

Empirical beta-approximated p-values are computed as described in Ongen et al., Bioinformatics, 2016.

Install

You can install tensorQTL using pip:

pip3 install tensorqtl

or directly from this repository:

$ git clone [email protected]:broadinstitute/tensorqtl.git
$ cd tensorqtl
# set up virtual environment and install
$ virtualenv venv
$ source venv/bin/activate
(venv)$ pip install -r install/requirements.txt .

To use PLINK 2 binary files (pgen/pvar/psam), pgenlib must be installed:

git clone [email protected]:chrchang/plink-ng.git
cd plink-ng/2.0/Python/
python3 setup.py build_ext
python3 setup.py install

Requirements

tensorQTL requires an environment configured with a GPU for optimal performance, but can also be run on a CPU. Instructions for setting up a virtual machine on Google Cloud Platform are provided here.

Input formats

Three inputs are required for QTL analyses with tensorQTL: genotypes, phenotypes, and covariates.

  • Phenotypes must be provided in BED format, with a single header line starting with # and the first four columns corresponding to: chr, start, end, phenotype_id, with the remaining columns corresponding to samples (the identifiers must match those in the genotype input). The BED file should specify the center of the cis-window (usually the TSS), with start == end-1. A function for generating a BED template from a gene annotation in GTF format is available in pyqtl (io.gtf_to_tss_bed).

  • Covariates can be provided as a tab-delimited text file (covariates x samples) or dataframe (samples x covariates), with row and column headers.

  • Genotypes must be in PLINK format, which can be generated from a VCF as follows:

    plink2 --make-bed \
        --output-chr chrM \
        --vcf ${plink_prefix_path}.vcf.gz \
        --out ${plink_prefix_path}
    

    If using PLINK 1.9 or earlier, add the --keep-allele-order flag.

    Alternatively, the genotypes can be provided as a dataframe (genotypes x samples).

The examples notebook below contains examples of all input files. The input formats for phenotypes and covariates are identical to those used by FastQTL.

Examples

For examples illustrating cis- and trans-QTL mapping, please see tensorqtl_examples.ipynb.

Running tensorQTL

This section describes how to run the different modes of tensorQTL, both from the command line and within Python. For a full list of options, run

python3 -m tensorqtl --help

Loading input files

This section is only relevant when running tensorQTL in Python. The following imports are required:

import pandas as pd
import tensorqtl
from tensorqtl import genotypeio, cis, trans

Phenotypes and covariates can be loaded as follows:

phenotype_df, phenotype_pos_df = tensorqtl.read_phenotype_bed(phenotype_bed_file)
covariates_df = pd.read_csv(covariates_file, sep='\t', index_col=0).T  # samples x covariates

Genotypes can be loaded as follows, where plink_prefix_path is the path to the VCF in PLINK format (excluding .bed/.bim/.fam extensions):

pr = genotypeio.PlinkReader(plink_prefix_path)
# load genotypes and variants into data frames
genotype_df = pr.load_genotypes()
variant_df = pr.bim.set_index('snp')[['chrom', 'pos']]

To save memory when using genotypes for a subset of samples, a subset of samples can be loaded (this is not strictly necessary, since tensorQTL will select the relevant samples from genotype_df otherwise):

pr = genotypeio.PlinkReader(plink_prefix_path, select_samples=phenotype_df.columns)

cis-QTL mapping: permutations

This is the main mode for cis-QTL mapping. It generates phenotype-level summary statistics with empirical p-values, enabling calculation of genome-wide FDR. In Python:

cis_df = cis.map_cis(genotype_df, variant_df, phenotype_df, phenotype_pos_df, covariates_df)
tensorqtl.calculate_qvalues(cis_df, qvalue_lambda=0.85)

Shell command:

python3 -m tensorqtl ${plink_prefix_path} ${expression_bed} ${prefix} \
    --covariates ${covariates_file} \
    --mode cis

${prefix} specifies the output file name.

cis-QTL mapping: summary statistics for all variant-phenotype pairs

In Python:

cis.map_nominal(genotype_df, variant_df, phenotype_df, phenotype_pos_df,
                prefix, covariates_df, output_dir='.')

Shell command:

python3 -m tensorqtl ${plink_prefix_path} ${expression_bed} ${prefix} \
    --covariates ${covariates_file} \
    --mode cis_nominal

The results are written to a parquet file for each chromosome. These files can be read using pandas:

df = pd.read_parquet(file_name)

cis-QTL mapping: conditionally independent QTLs

This mode maps conditionally independent cis-QTLs using the stepwise regression procedure described in GTEx Consortium, 2017. The output from the permutation step (see map_cis above) is required. In Python:

indep_df = cis.map_independent(genotype_df, variant_df, cis_df,
                               phenotype_df, phenotype_pos_df, covariates_df)

Shell command:

python3 -m tensorqtl ${plink_prefix_path} ${expression_bed} ${prefix} \
    --covariates ${covariates_file} \
    --cis_output ${prefix}.cis_qtl.txt.gz \
    --mode cis_independent

cis-QTL mapping: interactions

Instead of mapping the standard linear model (p ~ g), this mode includes an interaction term (p ~ g + i + gi) and returns full summary statistics for the model. The interaction term is a tab-delimited text file or pd.Series mapping sample ID to interaction value. With the run_eigenmt=True option, eigenMT-adjusted p-values are computed. In Python:

cis.map_nominal(genotype_df, variant_df, phenotype_df, phenotype_pos_df, prefix,
                covariates_df=covariates_df,
                interaction_s=interaction_s, maf_threshold_interaction=0.05,
                run_eigenmt=True, output_dir='.', write_top=True, write_stats=True)

The input options write_top and write_stats control whether the top association per phenotype and full summary statistics, respectively, are written to file.

Shell command:

python3 -m tensorqtl ${plink_prefix_path} ${expression_bed} ${prefix} \
    --covariates ${covariates_file} \
    --interaction ${interactions_file} \
    --best_only \
    --mode cis_nominal

The option --best_only disables output of full summary statistics.

Full summary statistics are saved as parquet files for each chromosome, in ${output_dir}/${prefix}.cis_qtl_pairs.${chr}.parquet, and the top association for each phenotype is saved to ${output_dir}/${prefix}.cis_qtl_top_assoc.txt.gz. In these files, the columns b_g, b_g_se, pval_g are the effect size, standard error, and p-value of g in the model, with matching columns for i and gi. In the *.cis_qtl_top_assoc.txt.gz file, tests_emt is the effective number of independent variants in the cis-window estimated with eigenMT, i.e., based on the eigenvalue decomposition of the regularized genotype correlation matrix (Davis et al., AJHG, 2016). pval_emt = pval_gi * tests_emt, and pval_adj_bh are the Benjamini-Hochberg adjusted p-values corresponding to pval_emt.

trans-QTL mapping

This mode computes nominal associations between all phenotypes and genotypes. tensorQTL generates sparse output by default (associations with p-value < 1e-5). cis-associations are filtered out. The output is in parquet format, with four columns: phenotype_id, variant_id, pval, maf. In Python:

trans_df = trans.map_trans(genotype_df, phenotype_df, covariates_df,
                           return_sparse=True, pval_threshold=1e-5, maf_threshold=0.05,
                           batch_size=20000)
# remove cis-associations
trans_df = trans.filter_cis(trans_df, phenotype_pos_df.T.to_dict(), variant_df, window=5000000)

Shell command:

python3 -m tensorqtl ${plink_prefix_path} ${expression_bed} ${prefix} \
    --covariates ${covariates_file} \
    --mode trans

More Repositories

1

gatk

Official code repository for GATK versions 4 and up
Java
1,604
star
2

cromwell

Scientific workflow engine designed for simplicity & scalability. Trivially transition between one off use cases to massive scale production environments
Scala
953
star
3

picard

A set of command line tools (in Java) for manipulating high-throughput sequencing (HTS) data and formats such as SAM/BAM/CRAM and VCF.
Java
923
star
4

keras-rcnn

Keras package for region-based convolutional neural networks (RCNNs)
Python
548
star
5

infercnv

Inferring CNV from Single-Cell RNA-Seq
R
509
star
6

gtex-pipeline

GTEx & TOPMed data production and analysis pipelines
Python
323
star
7

pilon

Pilon is an automated genome assembly improvement and variant detection tool
Scala
306
star
8

keras-resnet

Keras package for deep residual networks
Python
294
star
9

CellBender

CellBender is a software package for eliminating technical artifacts from high-throughput single-cell RNA sequencing (scRNA-seq) data.
Python
243
star
10

Tangram

Spatial alignment of single cell transcriptomic data.
Jupyter Notebook
219
star
11

ssGSEA2.0

Single sample Gene Set Enrichment analysis (ssGSEA) and PTM Enrichment Analysis (PTM-SEA)
R
218
star
12

ABC-Enhancer-Gene-Prediction

Cell type specific enhancer-gene predictions using ABC model (Fulco, Nasser et al, Nature Genetics 2019)
Python
183
star
13

warp

WDL Analysis Research Pipelines
WDL
183
star
14

viral-ngs

Viral genomics analysis pipelines
Python
180
star
15

seqr

web-based analysis tool for rare disease genomics
Python
164
star
16

gatk-sv

A structural variation pipeline for short-read sequencing
Python
157
star
17

ichorCNA

Estimating tumor fraction in cell-free DNA from ultra-low-pass whole genome sequencing.
R
156
star
18

wot

A software package for analyzing snapshots of developmental processes
Jupyter Notebook
130
star
19

long-read-pipelines

Long read production pipelines
Jupyter Notebook
116
star
20

ml4h

Jupyter Notebook
105
star
21

xtermcolor

Python library for terminal color support (including 256-color support)
Python
104
star
22

Drop-seq

Java tools for analyzing Drop-seq data
Java
100
star
23

depmap_omics

What you need to process the Quarterly DepMap-Omics releases from Terra
HTML
98
star
24

mutect

MuTect -- Accurate and sensitive cancer mutation detection
Java
92
star
25

genomics-in-the-cloud

Source code and related materials for the O'Reilly book
Jupyter Notebook
87
star
26

gnomad_methods

Hail helper functions for the gnomAD project and Translational Genomics Group
Python
80
star
27

pyro-cov

Pyro models of SARS-CoV-2 variants
Jupyter Notebook
72
star
28

gatk-docs

Documentation archive for GATK tools and workflows
HTML
71
star
29

catch

A package for designing compact and comprehensive capture probe sets.
Python
67
star
30

gnomad-browser

Explore gnomAD datasets on the web
TypeScript
66
star
31

oncotator

Python
64
star
32

gtex-viz

GTEx Visualizations
JavaScript
62
star
33

single_cell_portal_core

Rails/Docker application for the Broad Institute's single cell RNA-seq data portal
Ruby
60
star
34

PhylogicNDT

HTML
57
star
35

docker-terraform

Docker container for running the Terraform application
Shell
56
star
36

cromshell

CLI for interacting with Cromwell servers
Python
53
star
37

2020_scWorkshop

Code and data repository for the 2020 physalia course on single cell RNA sequencing.
Shell
51
star
38

viral-pipelines

viral-ngs: complete pipelines
WDL
48
star
39

gnomad_qc

Jupyter Notebook
48
star
40

single_cell_portal

Tutorials, workflows, and convenience scripts for Single Cell Portal
HTML
44
star
41

sam

workbench identity and access management
Scala
41
star
42

gistic2

Genomic Identification of Significant Targets in Cancer (GISTIC), version 2
MATLAB
41
star
43

gamgee

A C++14 library for NGS data formats
C++
41
star
44

dsde-deep-learning

DSDE Deep Learning Club
Python
40
star
45

gtex-v8

Notebooks and scripts for reproducing analyses and figures from the V8 GTEx Consortium paper
Jupyter Notebook
38
star
46

SignatureAnalyzer-GPU

GPU implementation of ARD NMF
Python
37
star
47

wdl-ide

Rich IDE support for Workflow Description Language
Python
36
star
48

Celligner_ms

Code related to the Celligner manuscript
R
36
star
49

cellpainting-gallery

Cell Painting Gallery
35
star
50

cell-health

Predicting Cell Health with Morphological Profiles
HTML
35
star
51

gatk-protected

Obsolete/Legacy GATK repository -- go to https://github.com/broadinstitute/gatk instead
Java
34
star
52

pyqtl

Collection of analysis tools for quantitative trait loci
Python
32
star
53

PANOPLY

Repository for the Broad Institute Proteogenomic Data Analysis Center (PGDAC) established by the NIH Clinical Proteomics Tumor Analysis Consortium (CPTAC)
R
31
star
54

python-cert_manager

Python interface to the Sectigo Certificate Manager REST API
Python
31
star
55

StrainGE

strain-level analysis tools
Python
30
star
56

firecloud-orchestration

Scala
29
star
57

gdctools

Python and UNIX CLI utilities to simplify interaction with the NIH/NCI Genomics Data Commons
Python
29
star
58

2019_scWorkshop

Repo for Physalia course Analysis of Single Cell RNA-Seq data
TeX
29
star
59

fiss

FireCloud Service Selector (FISS) -- Python bindings and CLI for FireCloud execution engine
Python
28
star
60

single_cell_analysis

Documents used for workshops on single cell analysis
HTML
26
star
61

deepometry

Image classification for imaging flow cytometry.
Python
25
star
62

firepony

Efficient base quality score recalibrator for NGS data
Cuda
24
star
63

adapt

A package for designing activity-informed nucleic acid diagnostics for viruses.
Python
24
star
64

pyfrost

Python bindings for Bifrost with a NetworkX compatible API
Python
24
star
65

str-analysis

Scripts and utilities related to analyzing short tandem repeats (STRs).
Python
23
star
66

rawls

Rawls service for DSDE
Scala
23
star
67

protigy

Proteomics Toolset for Integrative Data Analysis
R
21
star
68

seqr-loading-pipelines

hail-based pipelines for annotating variant callsets and exporting them to elasticsearch
Python
21
star
69

lincs-cell-painting

Processed Cell Painting Data for the LINCS Drug Repurposing Project
Jupyter Notebook
21
star
70

BipolarCell2016

R
21
star
71

cromwell-tools

A collection of Python clients and accessory scripts for interacting with the Cromwell
Python
21
star
72

single_cell_classification

Methods to use SNPs or gene expression to classify single cell RNAseq to reference profiles
R
20
star
73

VariantBam

Filtering and profiling of next-generational sequencing data using region-specific rules
Makefile
20
star
74

longbow

Annotation and segmentation of MAS-seq data
Python
20
star
75

chronos

Modeling of time series data for CRISPR KO experiments
Python
20
star
76

gtex-single-nucleus-reference

Code repository for the snRNA-seq cross-tissue atlas project
Jupyter Notebook
20
star
77

covid19-testing

COVID-19 Diagnostic Processing Dashboard
HTML
19
star
78

AwesomeGenomics

Cancer Data Science's go to place for excellent genomics tools and packages
19
star
79

GATK-for-Microbes

WDL
19
star
80

firecloud-ui

FireCloud user interface for web browsers.
Clojure
19
star
81

BARD

BioAssay Research Database
Groovy
19
star
82

flipbook

A tool that lets you quickly flip through images in a local directory and record notes or answer questions about each one.
Python
18
star
83

palantir-workflows

Utility workflows for the DSP hydro.gen team (formerly palantir)
WDL
18
star
84

wdltool

Scala
18
star
85

vim-wdl

Vim syntax highlighting for WDL
Vim Script
18
star
86

wordpress-crowd-plugin

Crowd Authentication Plugin for Wordpress
PHP
16
star
87

epi-SHARE-seq-pipeline

Epigenomics Program pipeline to analyze SHARE-seq data.
WDL
14
star
88

mix_seq_ms

Code associated with MIX-seq manuscript
R
14
star
89

widdler

A command-line tool for executing, managing, and querying WDL workflows on Cromwell servers.
Python
13
star
90

SpliceAI-lookup

Website for checking SpliceAI and Pangolin scores:
Python
13
star
91

imaging-platform-pipelines

Cell Painting and other pipelines from the Imaging Platform
13
star
92

wdl-runner

Easily run WDL workflows on GCP
Python
13
star
93

cms

Composite of Multiple Signals: tests for selection in meiotically recombinant populations
Python
13
star
94

scRNA-Seq

Python
12
star
95

scalable_analytics

Public collaboration of Scalable Single Cell Analytics
Python
12
star
96

sparklespray

Easy batch submission of adhoc jobs onto GCP
HTML
12
star
97

regional_missense_constraint

Code to calculate regional missense constraint
Python
11
star
98

dropviz

Shiny app for visualization, exploration of mouse brain single cell gene expression
R
11
star
99

gnomad_lof

R
11
star
100

hdf5-java-bindings

java bindings for hdf5
Java
11
star