• Stars
    star
    405
  • Rank 106,656 (Top 3 %)
  • Language
    Jupyter Notebook
  • License
    MIT License
  • Created over 2 years ago
  • Updated 5 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Precision Medicine Knowledge Graph (PrimeKG)

PrimeKG


website GitHub Repo stars GitHub Repo forks License: MIT

Lab Website | Nature Publication | Harvard Dataverse

TL;DR

Precision Medicine Knowledge Graph (PrimeKG) presents a holistic view of diseases. PrimeKG integrates 20 high-quality biomedical resources to describe 17,080 diseases with 4,050,249 relationships representing ten major biological scales. We accompany PrimeKGโ€™s graph structure with text descriptions of clinical guidelines for drugs and diseases to enable multimodal analyses. Download this csv file to get started!

Updates

  • [July 2023] PrimeKG construction scripts are updated to include primary source data releases up to July 2023. Note that the files published on Harvard DataVerse remain unchanged; however, we provide new scripts and updated links should users wish to build their own current version of PrimeKG. For more details, please see the relevant section below.
  • [Feb 2023] PrimeKG is published in Nature Scientific Data.
  • [Jun 2022] PrimeKG crosses 5,000 downloads on Harvard Dataverse!
  • [Apr 2022] PrimeKG is live on bioRxiv and Harvard Dataverse!

Unique Features of PrimeKG

  • Diverse coverage of diseases: PrimeKG contains over 17,000 diseases including rare dieases. Disease nodes in PrimeKG are densely connected to other nodes in the graph and have been optimized for clinical relevance in downstream precision medicine tasks.
  • Heterogeneous knowledge graph: PrimeKG contains over 100,000 nodes distributed over various biological scales as depicted below. PrimeKG also contains over 4 million relationships between these nodes distributed over 29 types of edges.
  • Multimodal integration of clinical knowledge: Disease and drug nodes in PrimeKG are augmented with clinical descriptors that come from medical authorities such as Mayo Clinic, Orphanet, Drug Bank, and so forth.
  • Ready-to-use datasets: PrimeKG is minimally dependent on external packages. Our knowledge graph can be retrieved in a ready-to-use format from Harvard Dataverse.
  • Data functions: PrimeKG provides extensive data functions, including processors for primary resources and scripts to build an updated knowledge graph.

overview

PrimeKG-example

Environment setup

Using pip

To install the dependencies required to run the PrimeKG code, use pip:

pip install -r requirements.txt

Or use conda

conda env create --name PrimeKG --file=environments.yml

Using PrimeKG

For a quick start in Python, you can download the raw data files in .csv format directly from Harvard Dataverse or load PrimeKG using the following community dataloaders.

Getting started in Python

Download PrimeKG from Harvard Dataverse using the following bash command. You can replace kg.csv with any file path.

wget -O kg.csv https://dataverse.harvard.edu/api/access/datafile/6180620

You can use the following code to load PrimeKG and visualize its data.

import pandas as pd
primekg = pd.read_csv('kg.csv', low_memory=False)
primekg.query('y_type=="disease"|x_type=="disease"')

Dataloader: Therapeutics Data Commons

website | docs

pip install PyTDC
from tdc.resource import PrimeKG
data = PrimeKG(path = './data')
drug_feature = data.get_features(feature_type = 'drug')
data.to_nx()
data.get_node_list(type = 'disease')

Dataloader: PyKEEN

website | docs

pip install pykeen
import pykeen.datasets
pykeen.datasets.has_dataset('primekg')

Building an updated PrimeKG

Downloading primary data resources

All persistent identifiers and weblinks to download the 20 primary data resources used to build PrimeKG are systematically provided in the Data Records section of our article. We have also mentioned the exact filenames that were downloaded from each resource for easy corroboration.

Curating primary data resources

We provide the scripts used to process all primary data resources and the names of the resulting output files generated by those scripts. We would be happy to share the intermediate processing datasets that were used to create PrimeKG on request.

Database Processing scripts Expected script output
Bgee bgee.py anatomy_gene.csv
Comparative Toxicogenomics Database ctd.py exposure_data.csv
DisGeNET - curated_gene_disease_associations.tsv
DrugBank drugbank_drug_drug.py drug_drug.csv
DrugBank parsexml_drugbank.ipynb, Parsed_feature.ipynb 12 drug feature files
DrugBank drugbank_drug_protein.py drug_protein.csv
Drug Central drugcentral_queries.txt drug_disease.csv
Drug Central drugcentral_feature.Rmd dc_features.csv
Entrez Gene ncbigene.py protein_go_associations.csv
Gene Ontology go.py go_terms_info.csv, go_terms_relations.csv
Human Phenotype Ontology hpo.py, hpo_obo_parser.py hp_terms.csv, hp_parents.csv, hp_references.csv
Human Phenotype Ontology hpoa.py disease_phenotype_pos.csv, disease_phenotype_neg.csv
MONDO mondo.py, mondo_obo_parser.py mondo_terms.csv, mondo_parents.csv, mondo_references.csv, mondo_subsets.csv, mondo_definitions.csv
Reactome reactome.py reactome_ncbi.csv, reactome_terms.csv, reactome_relations.csv
SIDER sider.py sider.csv
UBERON uberon.py uberon_terms.csv, uberon_rels.csv, uberon_is_a.csv
UMLS umls.py, map_umls_mondo.py umls_mondo.csv
UMLS umls.ipynb umls_def_disorder_2021.csv, umls_def_disease_2021.csv

Harmonizing datasets into PrimeKG

The code to harmonize datasets and construct PrimeKG is available at build_graph.ipynb. Simply run this jupyter notebook in order to construct the knowledge graph from the outputs of the processing files mentioned above. This jupyter notebook produces all three versions of PrimeKG, kg_raw.csv, kg_giant.csv, and the complete version kg.csv.

Feature extraction

The code required to engineer features can be found at engineer_features.ipynb and mapping_mayo.ipynb.

July 2023 update

In July 2023, this repository was updated to rebuild PrimeKG and update the knowledge graph to include database releases up to July 2023.Note that the files published on Harvard DataVerse remain unchanged; however, we provide new scripts and updated links should users wish to build their own current version of PrimeKG. For more details, see this pull request.

17 scripts datasets/processing_scripts/ are re-run or updated to build a new version of PrimeKG, while datasets/feature_construction/ scripts may remain out-of-date. Re-run or updated primary data sources include Bgee, Comparative Toxicogenomics Database, DisGeNET, DrugBank, DrugCentral, NCBI Gene, Gene Ontology, Human Phenotype Ontology, MONDO, Reactome, SIDER, UBERON, and UMLS.

For more information, see datasets/primary_data_resources.sh. Changes include the following:

General

Created script to automatically create directory structure, pull data, and run all necessary processing and feature extraction steps.

  • Fixed broken environment construction script.
  • Script automatically creates required directories.
  • Added commands to retrieve gene names, details, and NCBI ID to UniProt ID mapping from www.genenames.org, then output to vocab/gene_names.csv and vocab/gene_map.csv.

Bgee

  • 58405/5257181 gold quality calls with expression rank < 25000 now specify cell type in a particular tissue (e.g., UBERON:0000473 โˆฉ CL:0000089, which denotes germ line stem cell in testis).
  • These rows are dropped in bgee.py.
  • URL updated to here.

Comparative Toxicogenomics Database

  • URL updated to here.

DisGeNET

  • No changes needed.

DrugBank

  • Fixed paths in parsexml_drugbank.py. Output to new /parsed subdirectory. Removed extraneous lines in Parsed_feature.ipynb.
  • โœ… Successfully ran drugbank_drug_drug.py and drugbank_drug_protein.py.
  • โš ๏ธ parsexml_drugbank.py and Parsed_feature.ipynb may need updates.

DrugCentral

  • Modified drugcentral_queries.txt to work on O2, the Harvard Medical School high-performance computing cluster.
  • โš ๏ธ drugcentral_feature.Rmd may need updates.

NCBI Gene

  • No changes needed.

Gene Ontology

  • Used -L flag to follow redirects. No other changes needed.

Human Phenotype Ontology

  • Used -L flag to follow redirects. No other changes needed to hpo.py.
  • Updated hpoa.py to replace old column names with new column names.

MONDO

  • Added check for NoneType values in external references (line 29).

Reactome

  • No changes needed.

SIDER

  • No changes needed.

UBERON

  • Checked for NA values, dropped two obsolete terms (UBERON:0039300 and UBERON:0039302) not marked as obsolete in the source file.

UMLS

  • UMLS data pulled and paths updated for 2023 data.
  • โš ๏ธ umls.ipynb may need updates.

Cite Us

If you find PrimeKG useful, cite our work:

@article{chandak2022building,
  title={Building a knowledge graph to enable precision medicine},
  author={Chandak, Payal and Huang, Kexin and Zitnik, Marinka},
  journal={Nature Scientific Data},
  doi={https://doi.org/10.1038/s41597-023-01960-3},
  URL={https://www.nature.com/articles/s41597-023-01960-3},
  year={2023}
}

Data Server

PrimeKG is hosted on Harvard Dataverse with the following persistent identifier https://doi.org/10.7910/DVN/IXA7BM. When Dataverse is under maintenance, PrimeKG datasets cannot be retrieved. That happens rarely; please check the status on the Dataverse website.

License

PrimeKG codebase is under MIT license. For individual dataset usage, please refer to the dataset license found in the website.

More Repositories

1

TDC

Therapeutics Commons (TDC-2): Multimodal Foundation for Therapeutic Science
Jupyter Notebook
999
star
2

nimfa

Nimfa: Nonnegative matrix factorization in Python
Python
540
star
3

decagon

Graph convolutional neural network for multirelational link prediction
Jupyter Notebook
447
star
4

TFC-pretraining

Self-supervised contrastive learning for time series via time-frequency consistency
Python
435
star
5

UniTS

A unified multi-task time series model.
Python
426
star
6

graphml-tutorials

Tutorials for Machine Learning on Graphs
Jupyter Notebook
206
star
7

SubGNN

Subgraph Neural Networks (NeurIPS 2020)
Python
189
star
8

Raindrop

Graph Neural Networks for Irregular Time Series
Python
168
star
9

GraphXAI

GraphXAI: Resource to support the development and evaluation of GNN explainers
Python
166
star
10

scikit-fusion

scikit-fusion: Data fusion via collective latent factor models
Python
144
star
11

TxGNN

TxGNN: Zero-shot prediction of therapeutic use with geometric deep learning and clinician centered design
Jupyter Notebook
123
star
12

G-Meta

Graph meta learning via local subgraphs (NeurIPS 2020)
Python
118
star
13

Raincoat

Domain Adaptation for Time Series Under Feature and Label Shifts
Jupyter Notebook
106
star
14

ohmnet

OhmNet: Representation learning in multi-layer graphs
Python
79
star
15

PINNACLE

Contextual AI models for single-cell protein biology
Python
74
star
16

GNNGuard

Defending graph neural networks against adversarial attacks (NeurIPS 2020)
Python
58
star
17

SHEPHERD

SHEPHERD: Few shot learning for phenotype-driven diagnosis of patients with rare genetic diseases
HTML
45
star
18

GNNDelete

General Strategy for Unlearning in Graph Neural Networks
Python
36
star
19

TimeX

Time series explainability via self-supervised model behavior consistency
Python
32
star
20

crank

Prioritizing network communities
C++
29
star
21

SPECTRA

Spectral Framework For AI Model Evaluation
Roff
24
star
22

pathways

Disease pathways in the human interactome
Python
23
star
23

fastGNMF

Fast graph-regularized matrix factorization
Python
20
star
24

PDGrapher

Combinatorial prediction of therapeutic perturbations using causally-inspired neural networks
Jupyter Notebook
20
star
25

fusenet

Network inference by fusing data from diverse distributions
Python
14
star
26

medusa

Jumping across biomedical contexts using compressive data fusion
Python
7
star
27

scCIPHER

scCIPHER: Contextual deep learning on single-cell-enriched knowledge graphs in neurological disorders
Jupyter Notebook
7
star
28

life-tree

Evolution of protein interactomes across the tree of life
C++
7
star
29

patient-safety

Population-scale patient safety data reveal inequalities in adverse events before and during COVID-19 pandemic
Jupyter Notebook
7
star
30

nimfa-ipynb

IPython notebooks demonstrating Nimfa's functionality
6
star
31

ngmc

Network-guided matrix completion
Python
3
star
32

BMI702

Biomedical Artificial Intelligence
HTML
3
star
33

AWARE

AWARE: Contextualizing protein representations using deep learning on interactomes and single-cell experiments
Python
3
star
34

data-mining-unipv

Short Course on Data Mining at University of Pavia
Jupyter Notebook
2
star
35

collage-dicty

Gene prioritization by compressive data fusion and chaining
Python
2
star
36

copacar

Collective pairwise classification for multi-way (multi-relational) data analysis
Python
1
star
37

mims-harvard.github.io

Lab website
HTML
1
star