• Stars
    star
    237
  • Rank 169,885 (Top 4 %)
  • Language
    R
  • License
    Other
  • Created about 8 years ago
  • Updated about 1 month ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Single sample Gene Set Enrichment analysis (ssGSEA) and PTM Enrichment Analysis (PTM-SEA)

ssGSEA2.0/PTM-SEA

Resources for gene-centric single sample Gene Set Enrichment Analysis (ssGSEA) of gene expression data (e.g. mRNAs, proteins) and site-centric PTM Signature Enrichment Analysis (PTM-SEA) [1] of phosphoproteomics data sets using the PTM signatures database (PTMsigDB) [1].

Disclaimer

The primary purpose of this repository is to supplement our manuscript in which we describe PTM-SEA and PTMsigDB. While ssGSEA2.0 presents an updated version of the original ssGSEA R implementation, we want to acknowledge that this is not the primary repository for ssGSEA. The official codebase for ssGSEA can be found here, and the official GenePattern module to perform ssGSEA can be accessed here.

ssGSEA 2.0

This is an updated version of the original ssGSEA [2,3] R-implementation. Depending on the input dataset and chosen database (gene sets or PTM signatures), the software performs either ssGSEA or PTM-SEA, respectively. The Molecular Signatures Database (MSigDB) [4] provides a large collection of curated gene sets. Gene sets are stored as plain text in GMT format. A current version of MSigDB gene set collections can be found in the db/msigdb subfolder. MSigDB gene sets are realeased under Creative Commons Attribution 4.0 International License. The license terms can be found in thedb/msigdb folder.

File formats supported by ssGSEA2.0/PTM-SEA are Gene Cluster Text GCT v1.2 or GCT v1.3 files. Morpheus provides a convenient way to convert your data tables into GCT format.

For more information about the GSEA method and MSigDB please visit http://software.broadinstitute.org/gsea/.

PTMsigDB v2.0.0

Please check out our new website for PTMsigDB. We have updated PTMsigDB to version v2.0.0 in which we provide better and more consistent annotation of each PTM site. We have also inlcuded a disease category comprising of signatures associated to certain diseases curated from the table Disease-associated_sites available at PhosphoSitePlus (PSP) [5].

The PTM signatures database (PTMsigDB) is a database comprised of modification site-specific signatures of perturbations, kinase activities and signaling pathways curated from more than 2,500 publications which provides the foundation to perform PTM-SEA. A unique advantage of PTMsigDB over other pathway databases is the annotation of each PTM site with its reported direction of change upon a specific perturbation or signaling event which is incorporated into the scoring scheme of PTM-SEA. The foundation of PTMsigDB is PhosphoSitePlus (PSP) [5], a comprehensive systems biology resource for PTMs, which provides high-quality curation and annotation of PTMs at the individual residue level. A collection of PTM sites, whose levels are collectively regulated in a curated pathway or upon a perturbation, are defined as a signature set. Signature sets in PTMsigDB can be separated into different categories: 1) Perturbation signatures derived from treatment of cells with perturbagens such as small molecules or growth factors; 2) Signature sets of molecular signaling pathways; 3) Kinase-substrate signatures; and 4) Disease-associated signature sets.

To ensure a high degree of compatibility to phosphorylation datasets generated by different software packages and searched against different protein sequence databases, PTMsigDB represents signatures using three different identifiers to represent phosphorylation sites: 1) PSP site group ID; 2) UniProt-centric ID; 3) Flanking sequence (Table 1). While the PSP site group ID provides an unambiguous representation of PTM sites within protein families and across species [5], using this type of identifier restricts the analysis to PTM sites present in PSP. We generally recommend to using the flanking sequence as site identifier, since these are more invariant to updates made to protein sequence databases.

Database format Site accession Example in PTMsigDB Example in dataset Download
UniProt-centric Uniprot_acc;site-type;direction Q06609;Y315-p;u Q06609;Y315-p human
mouse
rat
Flanking sequence +/-7aa flanking seq-type;direction ETRICKIYDSPCLPE-p;u ETRICKIYDSPCLPE-p human
mouse
rat
PSP site group id site_grp_id-type;direction 448324-p;u 448324-p human
mouse
rat

Table 1: PTM site representation in PTMsigDB. The direction of change for a PTM site in a signature is indicated by ;u (up-regulation) or ;d (down-regulation). Please note that the annotation of directionality is a feature of PTMsigDB (column: Example in PTMsigDB) and must not be included when generating compatible site identifier for a particular dataset (column: Example in dataset).

PTM-SEA

PTM-Signature Enrichment Analysis (PTM-SEA) is a modified version of ssGSEA to perform site-specific signature analysis by scoring PTMsigDB's bi-directional signature-sets. The input to PTM-SEA is a single site-centric data matrix, m, stored in GCT v1.2 or GCT v1.3 format and PTM signatures database (PTMsigDB). Each row in m represents a single phosphorylation site confidently localized to a specific amino acid residue, with measured abundances across samples specified in columns in m. Multiple phosphorylation sites detected on the same peptide have to be converted into separate site-specific entities for every site. While some proteomics software packages, such as MaxQuant [6], readily produce single site-centric PTM reports, the use of other software packages might require additional preprocessing steps.

How can I use these tools?

ssGSEA2.0/PTM-SEA can be run on a local PC/MAC in R or RStudio. In addition, ssGSEA2.0/PTM-SEA can be access on Broad's public GenePattern [7] server. Below we provide instructions how to run ssSGEA2.0/PTM-SEA.

Example dataset

We provide an example dataset that can be used to test PTM-SEA. The dataset is based on Supplemental Table 6 in [1].

Single site-centric phosphoproteome dataset

PTMsigDB

GenePattern

GenePattern is a powerful platform to deploy and run software or entire analysis pipelines in a web browser [7]. We have implemented ssGSEA2.0/PTM-SEA as GenePattern module which can be accessed at the link below. Please note that access to the public GenePattern server requires a free registration.

PTM-SEA in GenePattern: https://tinyurl.com/PTM-SEA-GP

R-GUI / RStudio

The script ssgsea-gui.R requires little or no knowledge of R or on how to use the command line. Input files and databases can be specified via Windows file dialogs that will be automatically invoked. The first dialog lets you choose a folder containing input files in GCT v1.2 or GCT v1.3 format. The script loops over all GCT files in this directory and runs ssGSEA on each file separately. The second dialog window lets the user choose one or multiple gene set databases in GMT format such as MSigDB. A current version of MSigDB databases can be found in the db subfolder.

Windows OS

To run the script source it into a running R-session.

  • RStudio: open the file and press 'Source' in the upper right part of the editor window
  • R-GUI: drag and drop this file into an R-GUI window
iOS/MAC

In order to invoke file dialogs as decribed above, the XQuartz X Window System is required. Once installed ssgsea-gui.R can be sourced into an R session.

R Package

For use in R, Nicole Gay has created an R package that incorporates ssGSEA2.0, along with required dependencies for both R 3.6 and R >= 4.0. Instruction for use of the library can be found along with the package on GitHub.

Command line

For integration of ssGSEA2.0/PTM-SEA into your own analysis pipelines we recommend to use the ssgsea-cli.R script which has been successfully tested on Windows, Mac and Linux OS. Please see ssgsea-cli.R --help for instructions.

Preprocessing input GCT

Preprocess script preprocess_gct-cli.R use case:

  • Create gene-centric GCT with unique gene symbols as rid
  • Create site-centric GCT with PTMsigDB-compatible site identifier

PTM-SEA accepts only UniProt and 7AA flanking sequence (SeqWin) format for sites. If your GCT has row IDs (rid) as UniProt, RefSeq or gene symbol, you can use the preprocess_gct-cli.R script to convert to the supported format. Please see preprocess_gct-cli.R --help for instructions.

Misc

ssGSEA2.0/PTM-SEA parameters

Other parameters for ssGSEA/PTM-SEA can be altered inside the parameters section in ssgsea-gui.R or as arguments on the command line. The default parameters have been choosen carefully and should provide reliable results for most use-case scenarios.

Changes to the original ssGSEA R-implementation

Original code written by Pablo Tamayo. Adapted with additional modifications by D. R. Mani and Karsten Krug. Adaptions include:

  • support of multiple CPU cores (doParallel R-package)
  • support of GCT v1.3 format using functions from cmapR
  • improved handling of missing values
  • scoring of directional gene sets (PTMsigDB)
  • basic error handling
  • improvements in runtime performance
  • additional output files like rank plots and parameter files

License

License Agreement for MSigDB v6.0 and above can be found here.

References

  1. Krug, K., Mertins, P., Zhang, B., Hornbeck, P., Raju, R., Ahmad, R., . Szucs, M., Mundt, F., Forestier, D., Jane-Valbuena, J., Keshishian, H., Gillette, M. A., Tamayo, P., Mesirov, J. P., Jaffe, J. D., Carr, S. A., Mani, D. R. (2019). A curated resource for phosphosite-specific signature analysis, Molecular & Cellular Proteomics (in Press). http://doi.org/10.1074/mcp.TIR118.000943

  2. Barbie, D. A., Tamayo, P., Boehm, J. S., Kim, S. Y., Susan, E., Dunn, I. F., . Hahn, W. C. (2010). Systematic RNA interference reveals that oncogenic KRAS- driven cancers require TBK1, Nature, 462(7269), 108-112. https://doi.org/10.1038/nature08460

  3. Abazeed, M. E., Adams, D. J., Hurov, K. E., Tamayo, P., Creighton, C. J., Sonkin, D., et al. (2013). Integrative Radiogenomic Profiling of Squamous Cell Lung Cancer. Cancer Research, 73(20), 6289-6298. http://doi.org/10.1158/0008-5472.CAN-13-1616

  4. Subramanian, A., Tamayo, P., Mootha, V. K., Mukherjee, S., Ebert, B. L., Gillette, M. A., et al. (2005). Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proceedings of the National Academy of Sciences of the United States of America, 102(43), 15545-15550. http://doi.org/10.1073/pnas.0506580102

  5. Hornbeck, P. V., Zhang, B., Murray, B., Kornhauser, J. M., Latham, V., & Skrzypek, E. (2015). PhosphoSitePlus, 2014: mutations, PTMs and recalibrations. Nucleic Acids Research, 43(D1), D512-D520. https://doi.org/10.1093/nar/gku1267

  6. Cox, J., & Mann, M. (2008). MaxQuant enables high peptide identification rates, individualized p.p.b.-range mass accuracies and proteome-wide protein quantification. Nature Biotechnology, 26(12), 1367-1372. https://doi.org/10.1038/nbt.1511

  7. Reich, M., Liefeld, T., Gould, J., Lerner, J., Tamayo, P., & Mesirov, J. P. (2006). GenePattern 2.0. Nature Genetics, 38(5), 500-501. https://doi.org/10.1038/ng0506-500


More Repositories

1

gatk

Official code repository for GATK versions 4 and up
Java
1,691
star
2

cromwell

Scientific workflow engine designed for simplicity & scalability. Trivially transition between one off use cases to massive scale production environments
Scala
990
star
3

picard

A set of command line tools (in Java) for manipulating high-throughput sequencing (HTS) data and formats such as SAM/BAM/CRAM and VCF.
Java
965
star
4

infercnv

Inferring CNV from Single-Cell RNA-Seq
R
558
star
5

keras-rcnn

Keras package for region-based convolutional neural networks (RCNNs)
Python
554
star
6

gtex-pipeline

GTEx & TOPMed data production and analysis pipelines
Python
334
star
7

pilon

Pilon is an automated genome assembly improvement and variant detection tool
Scala
306
star
8

keras-resnet

Keras package for deep residual networks
Python
294
star
9

CellBender

CellBender is a software package for eliminating technical artifacts from high-throughput single-cell RNA sequencing (scRNA-seq) data.
Python
293
star
10

Tangram

Spatial alignment of single cell transcriptomic data.
Jupyter Notebook
249
star
11

ABC-Enhancer-Gene-Prediction

Cell type specific enhancer-gene predictions using ABC model (Fulco, Nasser et al, Nature Genetics 2019)
Python
201
star
12

warp

WDL Analysis Research Pipelines
WDL
200
star
13

viral-ngs

Viral genomics analysis pipelines
Python
180
star
14

seqr

web-based analysis tool for rare disease genomics
Python
176
star
15

gatk-sv

A structural variation pipeline for short-read sequencing
Python
170
star
16

tensorqtl

Ultrafast GPU-enabled QTL mapper
Python
159
star
17

ichorCNA

Estimating tumor fraction in cell-free DNA from ultra-low-pass whole genome sequencing.
R
159
star
18

long-read-pipelines

Long read production pipelines
Jupyter Notebook
140
star
19

wot

A software package for analyzing snapshots of developmental processes
Jupyter Notebook
136
star
20

ml4h

Jupyter Notebook
122
star
21

depmap_omics

What you need to process the Quarterly DepMap-Omics releases from Terra
HTML
110
star
22

xtermcolor

Python library for terminal color support (including 256-color support)
Python
104
star
23

Drop-seq

Java tools for analyzing Drop-seq data
Java
100
star
24

mutect

MuTect -- Accurate and sensitive cancer mutation detection
Java
92
star
25

genomics-in-the-cloud

Source code and related materials for the O'Reilly book
Jupyter Notebook
91
star
26

gnomad_methods

Hail helper functions for the gnomAD project and Translational Genomics Group
Python
89
star
27

pyro-cov

Pyro models of SARS-CoV-2 variants
Jupyter Notebook
77
star
28

catch

A package for designing compact and comprehensive capture probe sets.
Python
74
star
29

gatk-docs

Documentation archive for GATK tools and workflows
HTML
71
star
30

oncotator

Python
67
star
31

gnomad-browser

Explore gnomAD datasets on the web
TypeScript
66
star
32

gtex-viz

GTEx Visualizations
JavaScript
63
star
33

single_cell_portal_core

Rails/Docker application for the Broad Institute's single cell RNA-seq data portal
Ruby
62
star
34

PhylogicNDT

HTML
57
star
35

docker-terraform

Docker container for running the Terraform application
Shell
56
star
36

2020_scWorkshop

Code and data repository for the 2020 physalia course on single cell RNA sequencing.
Shell
56
star
37

cromshell

CLI for interacting with Cromwell servers
Python
53
star
38

cellpainting-gallery

Cell Painting Gallery
52
star
39

viral-pipelines

viral-ngs: complete pipelines
WDL
51
star
40

gnomad_qc

Jupyter Notebook
48
star
41

single_cell_portal

Tutorials, workflows, and convenience scripts for Single Cell Portal
HTML
47
star
42

gistic2

Genomic Identification of Significant Targets in Cancer (GISTIC), version 2
MATLAB
44
star
43

sam

workbench identity and access management
Scala
42
star
44

dsde-deep-learning

DSDE Deep Learning Club
Python
40
star
45

gamgee

A C++14 library for NGS data formats
C++
40
star
46

wdl-ide

Rich IDE support for Workflow Description Language
Python
39
star
47

gtex-v8

Notebooks and scripts for reproducing analyses and figures from the V8 GTEx Consortium paper
Jupyter Notebook
39
star
48

pyqtl

Collection of analysis tools for quantitative trait loci
Python
38
star
49

SignatureAnalyzer-GPU

GPU implementation of ARD NMF
Python
37
star
50

poasta

Fast and exact gap-affine partial order alignment
Rust
37
star
51

Celligner_ms

Code related to the Celligner manuscript
R
36
star
52

cell-health

Predicting Cell Health with Morphological Profiles
HTML
35
star
53

PANOPLY

Repository for the Broad Institute Proteogenomic Data Analysis Center (PGDAC) established by the NIH Clinical Proteomics Tumor Analysis Consortium (CPTAC)
R
33
star
54

gatk-protected

Obsolete/Legacy GATK repository -- go to https://github.com/broadinstitute/gatk instead
Java
33
star
55

StrainGE

strain-level analysis tools
Python
33
star
56

firecloud-orchestration

Scala
31
star
57

gdctools

Python and UNIX CLI utilities to simplify interaction with the NIH/NCI Genomics Data Commons
Python
31
star
58

python-cert_manager

Python interface to the Sectigo Certificate Manager REST API
Python
31
star
59

str-analysis

Scripts and utilities related to analyzing short tandem repeats (STRs).
Python
29
star
60

adapt

A package for designing activity-informed nucleic acid diagnostics for viruses.
Python
29
star
61

2019_scWorkshop

Repo for Physalia course Analysis of Single Cell RNA-Seq data
TeX
29
star
62

chronos

Modeling of time series data for CRISPR KO experiments
Python
28
star
63

fiss

FireCloud Service Selector (FISS) -- Python bindings and CLI for FireCloud execution engine
Python
28
star
64

pyfrost

Python bindings for Bifrost with a NetworkX compatible API
Python
27
star
65

single_cell_analysis

Documents used for workshops on single cell analysis
HTML
26
star
66

deepometry

Image classification for imaging flow cytometry.
Python
26
star
67

lincs-cell-painting

Processed Cell Painting Data for the LINCS Drug Repurposing Project
Jupyter Notebook
25
star
68

rawls

Rawls service for DSDE
Scala
25
star
69

delphy

Fast, scalable, accurate and accessible Bayesian phylogenetics
C++
25
star
70

firepony

Efficient base quality score recalibrator for NGS data
Cuda
24
star
71

protigy

Proteomics Toolset for Integrative Data Analysis
R
22
star
72

GATK-for-Microbes

WDL
22
star
73

seqr-loading-pipelines

hail-based pipelines for annotating variant callsets and exporting them to elasticsearch
Python
22
star
74

BipolarCell2016

R
21
star
75

cromwell-tools

A collection of Python clients and accessory scripts for interacting with the Cromwell
Python
21
star
76

covid19-testing

COVID-19 Diagnostic Processing Dashboard
HTML
20
star
77

single_cell_classification

Methods to use SNPs or gene expression to classify single cell RNAseq to reference profiles
R
20
star
78

VariantBam

Filtering and profiling of next-generational sequencing data using region-specific rules
Makefile
20
star
79

longbow

Annotation and segmentation of MAS-seq data
Python
20
star
80

gtex-single-nucleus-reference

Code repository for the snRNA-seq cross-tissue atlas project
Jupyter Notebook
20
star
81

flipbook

A tool that lets you quickly flip through images in a local directory and record notes or answer questions about each one.
Python
19
star
82

AwesomeGenomics

Cancer Data Science's go to place for excellent genomics tools and packages
19
star
83

firecloud-ui

FireCloud user interface for web browsers.
Clojure
19
star
84

BARD

BioAssay Research Database
Groovy
19
star
85

wdltool

Scala
18
star
86

vim-wdl

Vim syntax highlighting for WDL
Vim Script
18
star
87

SpliceAI-lookup

Website for checking SpliceAI and Pangolin scores:
Python
17
star
88

palantir-workflows

Utility workflows for the DSP hydro.gen team (formerly palantir)
WDL
17
star
89

epi-SHARE-seq-pipeline

Epigenomics Program pipeline to analyze SHARE-seq data.
WDL
16
star
90

wordpress-crowd-plugin

Crowd Authentication Plugin for Wordpress
PHP
16
star
91

mix_seq_ms

Code associated with MIX-seq manuscript
R
14
star
92

imaging-platform-pipelines

Cell Painting and other pipelines from the Imaging Platform
13
star
93

wdl-runner

Easily run WDL workflows on GCP
Python
13
star
94

widdler

A command-line tool for executing, managing, and querying WDL workflows on Cromwell servers.
Python
13
star
95

cms

Composite of Multiple Signals: tests for selection in meiotically recombinant populations
Python
13
star
96

regional_missense_constraint

Code to calculate regional missense constraint
Python
12
star
97

scRNA-Seq

Python
12
star
98

scalable_analytics

Public collaboration of Scalable Single Cell Analytics
Python
12
star
99

ml4ht_data_source

Multimodal data loader compatible with pytorch and tensorflow
Python
12
star
100

gene-hints

Discoverability for gene search 🧬 🔍
Python
12
star