• Stars
    star
    113
  • Rank 308,130 (Top 7 %)
  • Language
    R
  • License
    GNU General Publi...
  • Created over 8 years ago
  • Updated 6 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Scripts to analyze TCGA data

Scripts to extract TCGA data for survival analysis.

awesome-TCGA - Curated list of TCGA resources. For more cancer-related notes, see my Cancer_notes

Data description

Scripts are being transitioned to use the curatedTCGAData and TCGAutils packages. See also cBioPortalData R interface to TCGA and the cBioPortal API.

Paper Ramos, Marcel, Ludwig Geistlinger, Sehyun Oh, Lucas Schiffer, Rimsha Azhar, Hanish Kodali, Ino de Bruijn et al. "Multiomic Integration of Public Oncology Databases in Bioconductor", JCO Clinical Cancer Informatics 1 (2020), https://doi.org/10.1200/cci.19.00119

Data preparation

First, get the data locally using misc/TCGA_preprocessing.R script.

  • Create a folder on a local computer
  • Change the data_dir variable with the path where the downloaded data is stored
  • Run the file line-by-line, or source it
  • By default, RNA-seq data for all cancers will be downloaded and saved as *.rda files
  • In all other scripts, change the data_dir variable to the path where the downloaded data is stored

Analysis examples

Analysis scripts

  • In all other scripts, change Path where the downloaded data is stored, data_dir variable

  • survival.Rmd - a pipeline to run survival analyses for all cancers. Adjust settings cancer = "BRCA" and selected_genes = "IGFBP3" to the desired cancer and gene IDs. These IDs should be the same in TCGA_summary.Rmd that'll summarize the output into Survival analysis summary. Note if subcategories_in_all_cancers <- TRUE, survival analysis is done for all subcategories and all cancers, time consuming.

    • Analysis 1 - Selected genes, selected cancers, no clinical annotations. Results are in <selected_genes>.<cancer>.Analysis1 folder.
    • Exploratory - All genes, selected cancers, no clinical annotations. Not run by default.
    • Analysis 2 - Selected genes, all (or selected) cancers, no clinical annotations. Results are in <selected_genes>.<cancer>.Analysis2 folder.
    • Analysis 3 - Selected genes, all (or, selected) cancers, all unique clinical (sub)groups. Results are in <selected_genes>.<cancer>.Analysis3 folder. Open file global_stats.txt in Excel, sort by p-value (log-rank test) and explore in which clinical (sub)groups expression of the selected gene affects survival the most.
    • Analysis 4 - Selected genes, selected cancers, all combinations of clinical annotations. Not run by default.
    • Analysis 5 - Analysis 5: Clinical-centric analysis. Selected cancer, selected clinical subcategory, survival difference between all pairs of subcategories. Only run for BRCA and OV cancers. Results are in <selected_genes>.<cancer>.Analysis5
    • Analysis 6 - Dimensionality reduction of a gene signature across all cancers using NMF, PCA, or FA For each cancer, extracts gene expression of a signature, reduces its dimensionality, plots a heatmap sorted by the first component, biplots, saves eigenvectors in files named after cancer, signature, method. They are used in correlations.Rmd. Not run by default
  • survival_Neuroblastoma.Rmd - survival analysis for Neuroblastoma samples from TARGET database. Prepare the data with misc/cgdsr_preprocessing.R, see Methods section for data description.

  • TCGA_summary.Rmd - summary report for the survival.Rmd output. In which cancers, and clinical subgroups, expression of the selected gene affects survival the most. Change cancer = "BRCA" and selected_genes = "IGFBP3" to the desired cancer and gene IDs. Uses results from <selected_genes>.<cancer>.Analysis* folders. Survival analysis summary

  • TCGA_CNV.Rmd - Separate samples based on copy number variation of one or several genes, do survival and differential expression analysis on the two groups, and KEGG enrichment. An ad hoc analysis, requires manual intervention.

  • TCGA_stemness.Rmd - correlation of a selected gene with stemness indices, for details, see Malta, Tathiane M., Artem Sokolov, Andrew J. Gentles, Tomasz Burzykowski, Laila Poisson, John N. Weinstein, Bożena Kamińska, et al. “Machine Learning Identifies Stemness Features Associated with Oncogenic Dedifferentiation.” Cell 173, no. 2 (April 2018): 338-354.e15. https://doi.org/10.1016/j.cell.2018.03.034. Results example PDF

  • TCGA_expression.Rmd - Expression of selected genes across all TCGA cancers. Used for comparing expression of two or more genes. Change selected_genes <- "XXXX", can be multiple. Generates a PDF file with a barplot of log2-expression of selected genes across all cancers, with standard errors. Example

  • TCGA_correlations.Rmd - Co-expression analysis of selected gene vs. all others, in selected cancers. Genes best correlating with the selected gene may share common functions, described in the KEGG canonical pathway analysis section. Gene counts are converted to TPM. Multiple cancers, with the ComBat batch correction for the cohort effect. Change selected_genes <- "XXXX" and cancer <- "YYYY" variables. The run saves two RData objects, data/Expression_YYYY.Rda and data/Correlation_XXXX_YYYY.Rda. This speeds up re-runs with the same settings. The full output is saved in results/Results_XXXX_YYYY.xlsx. Example PDF, Example Excel

  • TCGA_correlations_BRCA.Rmd - Co-expression analysis of selected gene vs. all others, in BRCA stratified by PAM50 annotations. The full output is saved in results/Results_XXXX_BRCA_PAM50.xlsx.

  • correlations_one_vs_one.Rmd - Co-expression analysis of two genes across all cancers. The knitted HTML contains table with correlation coefficients and p-values.

  • TCGA_DEGs.Rmd - differential expression analysis of TCGA cohorts separated into groups with high/low expression of selected genes. The results are similar to the correlation results, most of the differentially expressed genes are also best correlated with the selected genes. This analysis is to explicitly look at the extremes of the selected gene expression and identify KEGG pathways that may be affected. Change selected_genes = "XXXX" and cancer = "YYYY". Manually run through line 254 to see which KEGG pathways are enriched. Then, run the code chunk on line 379 to generate a picture of the selected KEGG pathway, Example, adjust the ![](hsa0YYYY.XXXX.png) accordingly. Then, recompile the whole document. Example PDF, Example Excel

  • TCGA_DEGs_clin_subcategories.Rmd - differential expression analysis between pairs of clinical subgroups, e.g., within "race" clinical category pairs of subcategories, e.g., "black or african american" vs. "white" subgroups. Output is saved in one Excel file CANCERTYPE_DEGs_clin_subcategories.xlsx with pairs of worksheets, one containing DEGs and another containing enrichment results. Data tables have headers describing individual comparisons and results.

  • PPI_Networks.Rmd - experimenting with extracting and visualizing data from different PPI databases, for a selected gene.

  • Supplemental_R_script_1.R - a modified script to run gene-specific or global survival analysis, from http://kmplot.com, Source

  • TCPA_correlation.Rmd - experimenting with TCPA data.

Legacy analyses

Misc scripts

misc folder

TCGA data

data.TCGA folder. Some data are absent from the repository because of large size - download through links.

  • BRCA_with_TP53_mutation.tsv - 355 TCGA samples with TP53 mutations, Source

  • CCLE_Cell_lines_annotations_20181226.txt - CCLE cell line annotations, from https://portals.broadinstitute.org/ccle/data

  • CCR-13-0583tab1.xlsx - TNBCtype predictions for 163 primary tumors in TCGA considered to be TNBC, classification into six TNBC subtypes. See http://cbc.mc.vanderbilt.edu/tnbc/index.php for details. "UNC" - unclassified. Supplementary table 1 from Mayer, Ingrid A., Vandana G. Abramson, Brian D. Lehmann, and Jennifer A. Pietenpol. “New Strategies for Triple-Negative Breast Cancer--Deciphering the Heterogeneity.” Clinical Cancer Research: An Official Journal of the American Association for Cancer Research 20, no. 4 (February 15, 2014): 782–90. doi:10.1158/1078-0432.CCR-13-0583.

  • Immune_resistant_program.xlsx - A gene expression program associated with T cell exclusion and immune evasion. Supplementary Table S4 - genes associated with the immune resistance program, described in Methods. Jerby-Arnon, Livnat, Parin Shah, Michael S. Cuoco, Christopher Rodman, Mei-Ju Su, Johannes C. Melms, Rachel Leeson, et al. “A Cancer Cell Program Promotes T Cell Exclusion and Resistance to Checkpoint Blockade.” Cell 175, no. 4 (November 2018): 984-997.e24. https://doi.org/10.1016/j.cell.2018.09.006.

  • Lehmann_2019_Data1_BRCA_subtypes.xlsx - Subtype annotation, ER, PR and HER2 calls for TCGA, CPTAC, METABRIC, MET500 samples. From Lehmann 2019 et al.

  • Lehmann_2019_Data2_TNBCsubtype.xlsx - TNBCsubtype clinical information and cell type, mutational, immune signatures, for each TNBCtype subtype. From Lehmann 2019 et al.

  • gene_signatures_323.xls - 323 gene signatures from Fan, Cheng, Aleix Prat, Joel S. Parker, Yufeng Liu, Lisa A. Carey, Melissa A. Troester, and Charles M. Perou. “Building Prognostic Models for Breast Cancer Patients Using Clinical Variables and Hundreds of Gene Expression Signatures.” BMC Medical Genomics 4 (January 9, 2011): 3. https://doi.org/10.1186/1755-8794-4-3.

  • PAM50_classification.txt - sample classification into PAM50 types

  • patientsAll.tsv - TCGA sample clinical information, including PAM50, from https://tcia.at/home

  • TCGA_489_UE.k4.txt - Ovarian cancer classification into four subtypes, from https://github.com/aedin/OvarianCancerSubtypes/data/23257362

  • TCGA_Ancestry.xlsx - Admixture and Ethnicity Calls of all TCGA samples. Table S1 from Carrot-Zhang, Jian, Nyasha Chambwe, Jeffrey S. Damrauer, Theo A. Knijnenburg, A. Gordon Robertson, Christina Yau, Wanding Zhou, et al. “Comprehensive Analysis of Genetic Ancestry and Its Molecular Correlates in Cancer.” Cancer Cell 37, no. 5 (May 2020): 639-654.e6. https://doi.org/10.1016/j.ccell.2020.04.012.

  • TCGA_cancer_counts.csv - number of samples per cancer. Created by misc/TCGA_preprocessing.R

  • TCGA_cancers.xlsx - TCGA cancer abbreviations, from http://www.liuzlab.org/TCGA2STAT/CancerDataChecklist.pdf

  • TCGA_genes.txt - genes measured in TCGA RNA-seq experiments

  • TCGA_immune.xlsx - Table S1 download. PanImmune Feature Matrix of Immune Characteristics. From Supplementary Material, Thorsson, Vésteinn, David L. Gibbs, Scott D. Brown, Denise Wolf, Dante S. Bortone, Tai-Hsien Ou Yang, Eduard Porta-Pardo, et al. “The Immune Landscape of Cancer.” Immunity, April 2018 - TCGA Immune signatures, six immune subtypes. Manually compiled immune gene lists, references in the text. Classification of each TCGA sample in Table S1. M1 macrophages and lymphocyte expression signature in general associated with improved OS.

  • TCGA_isoforms.xlsx - Isoform switching analysis of TCGA data, tumor vs. normal. Consequences, survival prediction. Using IsoformSwitchAnalyzeR R package. Supplementary Table 1 - gene- and isoforms differentially expressed in all cancers. From Vitting-Seerup, Kristoffer, and Albin Sandelin. “The Landscape of Isoform Switches in Human Cancers.” Molecular Cancer Research 15, no. 9 (September 2017): 1206–20. https://doi.org/10.1158/1541-7786.MCR-16-0459.

  • TCGA_purity.xlsx - Tumor purity estimates for TCGA samples. Tumor purity estimates according to four methods and the consensus method for all TCGA samples with available data. https://www.nature.com/articles/ncomms9971#supplementary-information. Supplementary Data 1 from Aran, Dvir, Marina Sirota, and Atul J. Butte. “Systematic Pan-Cancer Analysis of Tumour Purity.” Nature Communications 6, no. 1 (December 2015). https://doi.org/10.1038/ncomms9971.

  • TCGA_sample_types.xlsx - Cancer types and subtypes for all TCGA samples. Includes BRCA subtypes, and subtyping of other cancers, where applicable. PMID: 29625050. Source

  • TCGA_stemness.xlsx - Supplementary Table 1 - stemness indices for all TCGA samples. Stemness indices built from various data: mRNAsi - gene expression-based, EREG-miRNAsi - epigenomic- and gene expression-baset, mDNAsi, EREG-mDNAsi - same but methylation-based, DMPsi - differentially methylated probes-based, ENHsi - enhancer-based. Each stemness index (si) ranges from low (zero) to high (one) stemness. From Malta, Tathiane M., Artem Sokolov, Andrew J. Gentles, Tomasz Burzykowski, Laila Poisson, John N. Weinstein, Bożena Kamińska, et al. “Machine Learning Identifies Stemness Features Associated with Oncogenic Dedifferentiation.” Cell 173, no. 2 (April 2018): 338-354.e15. https://doi.org/10.1016/j.cell.2018.03.034.

  • TCGA.bib - BibTex of TCGA-related references

  • TCPA_proteins.txt - List of 224 proteins profiled by RPPA technology. The Cancer Proteome Atlas, http://tcpaportal.org/tcpa/. Data download: http://tcpaportal.org/tcpa/download.html. Paper: http://cancerres.aacrjournals.org/content/77/21/e51

  • XENA_classification.csv - PAM50 and other clinical data, Source

OvarianCancerSubtypes

Sample annotations by ovarian cancer subtypes. https://github.com/aedin/OvarianCancerSubtypes

ProteinAtlas

Uhlen, Mathias, Cheng Zhang, Sunjae Lee, Evelina Sjöstedt, Linn Fagerberg, Gholamreza Bidkhori, Rui Benfeitas, et al. “A Pathology Atlas of the Human Cancer Transcriptome.” Science (New York, N.Y.) 357, no. 6352 (August 18, 2017). doi:10.1126/science.aan2507. http://science.sciencemag.org/content/357/6352/eaan2507

Supplementary material http://science.sciencemag.org/content/suppl/2017/08/16/357.6352.eaan2507.DC1

  • Table S2 - summary of tissue specific expression for each gene, in normal and cancer tissues.
  • Table S6 - summary of survival prognostic value, with a simple "favorable/unfavorable" label for each gene. Each worksheet corresponds to a different cancer.
  • Table S8 - per-gene summary, in which cancers it is prognostic of survival.

brca_mbcproject_wagle_2017

https://www.mbcproject.org/

The Metastatic Breast Cancer Project is a patient-driven initiative. This study includes genomic data, patient-reported data (pre-pended as PRD), medical record data (MedR), and pathology report data (PATH). All of the titles and descriptive text for the clinical data elements have been finalized in partnership with numerous patients in the project. As these data were generated in a research, not a clinical, laboratory, they are for research purposes only and cannot be used to inform clinical decision-making. All annotations have been de-identified. More information is available at www.mbcproject.org.

Data download: http://www.cbioportal.org/study?id=brca_mbcproject_wagle_2017#summary. Data includes 78 patients, 103 samples, sample-specific clinical annotations, Putative copy-number from GISTIC, MutSig regions

TCGA_Ovarian

  • Gene expression, methylation, miRNA expression matrices, from Zhang, Shihua, Chun-Chi Liu, Wenyuan Li, Hui Shen, Peter W. Laird, and Xianghong Jasmine Zhou. “Discovery of Multi-Dimensional Modules by Integrative Analysis of Cancer Genomic Data.” Nucleic Acids Research 40, no. 19 (October 2012): 9379–91. https://doi.org/10.1093/nar/gks725. - Integrative analysis of gene expression, metnylation, miRNA expression, using NMF, implemented in Matlab. Supplementary material from https://academic.oup.com/nar/article/40/19/9379/2414808#supplementary-data.

More Repositories

1

scRNA-seq_notes

A list of scRNA-seq analysis tools
R
510
star
2

HiC_tools

A collection of tools for Hi-C data analysis
482
star
3

MachineLearning_notes

Machine learning and deep learning resources
401
star
4

HiC_data

A (continuously updated) collection of references to Hi-C data. Predominantly human/mouse Hi-C data, with replicates.
166
star
5

Cancer_notes

A continually expanding collection of cancer genomics notes and data
92
star
6

Statistics_notes

Statistics, data analysis tutorials and learning resources
72
star
7

scATAC-seq_notes

scATAC-seq data analysis tools and papers
67
star
8

Immuno_notes

Immunology-related bioinformatics data and tools
61
star
9

scHiC_notes

Notes on single-cell Hi-C technologies, tools, and data
54
star
10

MDnotes

Links to all data science, genomics, and other notes
37
star
11

RNA-seq_notes

A continually expanding collection of RNA-seq tools
33
star
12

Brain_genomic_data

Brain-related -omics data
22
star
13

SNP_notes

Notes on SNP-related tools and genome variation analysis
20
star
14

gwas2bed

Extracting disease-specific genomic coordinates from GWAS catalog
HTML
18
star
15

ChIP-seq_notes

Notes on ChIP-seq and other-seq-related tools
17
star
16

blogs

Links to data science, bioinformatics, statistics, and machine learning resources
16
star
17

Aging

Epigenomic enrichment analysis of age-related genomic regions
R
15
star
18

Microbiome_notes

A continually expanding collection of microbiome analysis tools
14
star
19

RNA-seq

RNA-seq analysis scripts
R
14
star
20

Aging_clock

Data and papers related to epigenetic clocks predicting age
R
12
star
21

HiCcompareWorkshop

Differential Hi-C Data Analysis Workshop https://currentprotocols.onlinelibrary.wiley.com/doi/abs/10.1002/cpbi.76
Dockerfile
12
star
22

genomerunner_web

Web version of GenomeRunner
JavaScript
11
star
23

R_notes

Data science in R notes
9
star
24

Programming_notes

Programming-related notes
8
star
25

Methylation_notes

Notes on DNA methylation analysis
8
star
26

bioinformatics-impact

GitHub statistics as a measure of the impact of open-source bioinformatics software
TeX
7
star
27

E-MTAB-3610

Processed E-MTAB-3610 dataset - Transcriptional Profiling of 1,000 human cancer cell lines
R
7
star
28

BIOS668.2018

Web site for "Statistical Methods for High-throughput Genomic Data II" BIOS 668 course, Spring 2018 https://mdozmorov.github.io/BIOS668.2018
SCSS
7
star
29

presentations

Talks and related material
CSS
6
star
30

Python_notes

Data science in Python notes
5
star
31

manuscript_template

Template of a manuscript in Rmd
TeX
5
star
32

Jobs_notes

Notes for job seekers
5
star
33

promoter_extract

Extract genomic coordinates of the promoters from a list of genes.
Python
4
star
34

ChIP-seq

Scripts to analyze ChIP-seq data
Shell
4
star
35

BIOS691_Cancer_Bioinformatics

Course material for the BIOS691 "Cancer Bioinformatics" course, January 25 - May 7, 2021
HTML
4
star
36

Talk_3Dgenome

Slides for "The genome in action: Detecting and interpreting changes in the 3D genome organization" talk
SCSS
4
star
37

CTCF

Genomic coordinates of FIMO-predicted CTCF binding sites using JASPAR and other PWMs, human and mouse genome assemblies including mm39 and T2T. Also included experimentally derived ENCODE SCREEN CTCF-bound CREs.
R
4
star
38

MDgenomerunner

MD functions mostly for GenomeRunner project. See MDmisc R package for MD miscellaneous functions
R
4
star
39

bios524-r-2021

"Biostatistical Computing with R" course
HTML
3
star
40

Talk_reproducible_research_overview_2021

Brief overview of computational reproducible research, Unix, remote computing (SSH), Conda, pipelines, R/RMarkdown, Git/GitHub, Docker, Cloud, Kubernetes. The goal is to provide students with modern data science ecosystem of tools for further studies.
JavaScript
3
star
41

BIOS691_deep_learning_R

"Deep Learning with R" course material
HTML
3
star
42

HMP2

16S rRNA sequencing data for the HMP2 project
Shell
3
star
43

MDmisc

MD helper functions. Previous version at https://github.com/mdozmorov/MDgenomerunner
R
2
star
44

R.genomerunner

Scripts and examples of visualization and analysis of the enrichment and epigenomic similarity results
HTML
2
star
45

dcaf

Misc. scripts and examples
Shell
2
star
46

Grants_notes

Notes on potential funding opportunities
2
star
47

activeranges

Expanding collection of biologically active chromatin regions as GRanges.
R
2
star
48

GTEx

Playground with GTEx data
R
2
star
49

63_immune_cells

Gene expression profiles of 63 immune cell types
R
2
star
50

R.Lorin.RNA-seq

Interpretation of RNA-seq data
R
2
star
51

Talk_preciseTAD

Slides for "preciseTAD: A transfer learning framework for 3D domain boundary prediction at base-pair resolution" presentation
SCSS
2
star
52

GenomeRunner

Automating genome exploration
Visual Basic
1
star
53

Talk_Genomics

Talk for the Science Club, Department of Pathology, VCU. May 15, 2019.
1
star
54

deconvolution

Cell type-specific deconvolution of 'omics' data
R
1
star
55

Talk_JSM2019

Slides for JSM2019, "SpectralTAD: Defining Hierarchy of Topologically Associated Domains Using Graph Theoretical Clustering"
1
star
56

Methylation850K

Methylation analysis of Illumina 850K arrays
R
1
star
57

beamer_template

Beamer template for RMarkdown class presentation
1
star
58

Talk_ISMB2020

TADcompare abstract for the virtual ISMB 2020 conference
1
star
59

grdocs

GenomeRunner documentation
TeX
1
star
60

R.-ChIP-seq.histone

Analysis of histone marks, and their differential presence in the genome
R
1
star
61

Talk_HiCcompare

Slides for HiCcompareWorkshop
HTML
1
star
62

R.Sjogren

Sjogren syndrome microarray data analysis
HTML
1
star
63

lecture1

Test repo
1
star
64

BIOS567

Web site for "Statistical Methods for High-throughput Genomic Data I" BIOS 567 course
1
star
65

Data_notes

Lists of publicly available datasets for machine learning
1
star
66

PathwayRunner

PathwayRunner computed enrichment of gene set(s) in all pathways using hypergeometric test
R
1
star
67

GDS-processor

Process GDS files from Gene Expression Omnibus (GEO)
Visual Basic
1
star
68

Talk_Hi-C

An overview presentation of chromatin conformation capture technologies and analysis methods.
1
star
69

Quantile-normalization

Quantile normalization of gene expression matrix with missing values
Visual Basic
1
star
70

RepeatSoaker

a simple method to eliminate low-complexity short reads
Makefile
1
star
71

BIOS567.2017

Web site for "Statistical Methods for High-throughput Genomic Data I" BIOS 567 course, Fall 2017
SCSS
1
star