• Stars
    star
    126
  • Rank 276,410 (Top 6 %)
  • Language
  • Created over 7 years ago
  • Updated 4 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

A Curated List of Computational Biology Datasets Suitable for Machine Learning

Computational Biology Datasets Suitable For Machine Learning

This is a curated list of computational biology datasets that have been pre-processed for machine learning. This list is a work in progress, please submit a pull request for any dataset you would like to advertise!

Genotyping

Name Description Comments
The Cancer Genome Atlas Variety of Cancer Data most cancer types have 100-1000 samples
NIH GDC Cancer, many types of genomic data
UK Biobank
European Genome-Phenome Archive
METABRIC The genomic profiles (somatic mutations [targeted sequencing], copy number alterations, and gene expression) of 2509 breast cancers.
HapMap
23andMe 2280 Public Domain Curated Genotypes
Mice SNPs, 2000+ samples 4 generations. It might be possible to learn a family structure out of the data.
Arabidopsis SNPs, 100+ phenotypes

Promoter-Enhancer Pairs

Name Description Comments
TargetFinder ~100,000 DNA-DNA interaction pairs

Gene/Protein Expression

Name Description Comments
GEO Main place for NCBI data
ENCODE Variety of assays to identify functional elements
ArrayExpress DNA sequencing, gene/protein expression, epigenetics
Cytometry Continuous flow cytometry data of 11 proteins+phospholipids, Discretized and cleaned data available offline Classical benchmark dataset for learning graphical models; contains known errors
Transcription factor binding ChIP-Seq data on 12 TFs
GTEx Landmark study for EQTL analysis
PharmacoGenomics DB
ProteomeXChange
BeatAML whole-exome sequencing, RNA sequencing and analyses of ex vivo drug sensitivity 672 tumour specimens collected from 562 patients

Single-cell Data

Name Description Comments
Single-cell expression atlas

Regulatory Networks

Name Description Comments
TRRUST manually curated database of human transcriptional regulatory network
Yeast Network 23-million yeast 2-hybrid experiments to investigate genetic interactions
Perturb-Seq Integrated model of perturbations, single cell phenotypes, and epistatic interactions
KEGG Metabolic Regulatory Network (Undirected) 65554 instances, 29 attributes each
KEGG Metabolic Regulatory Network (Directed) 53414 instance, 24 attributes each

Images

Name Description Comments
The Cancer Imaging Archive Extracts the images from the TCGA data
Multiple Myeloma DREAM Challenge Challenge to identify Multiple Myeloma Patients
Breast Cancer Wisconsin (Diagnostic) Data Set Predict whether the cancer is benign or malignant
DDSM Mammogram Database
Kaggle Soft Tissue Sarcomas Preprocessed subset of the TCIA study "Soft Tissue Sarcoma" segmentation task
Kaggle Cervical Cancer Screening Classify cervix type from images
CMELYON17 Pathology challenge - automated detection and classification of breast cancer metastases in whole-slide images of histological lymph node sections
Grand Challenges Datasets from biomedical image analysis competitions
Breast Cancer MRI Dataset Demographic, clinical, pathology, treatment, outcomes, and genomic data + MRI images

fMRI

Name Description Comments
ENGIMA Cerebellum Goal: Examine the relationships between regional atrophy and motor and cognitive dysfunction
Seizure Prediction Goal: Classify EEG time series into pre-seizure vs. interictal (i.e., not preceding a seizure).

Electronic Medical Records

Name Description Comments
MIMIC 59,000 EHRs
UCI Diabetes 130 US hospital data for 1999-2008
i2b2 Clinical notes only, designed for NLP tasks
PhysioNet
Metadata Acquired from Clinical Case Reports (MACCRs) 3,100 curated clinical case reports spanning 15 disease groups and more than 750 reports of rare diseases
eICU 200k EHRs
All of Us >250k EHRs, some genomic data

Radiographs

Name Description Comments
CheXPert 200k chest radiographs Competition and leaderboard associated
MIMIC-CXR ~400k chest x-rays, 14 labels Data on PhysioNet
PadChest 160k chest x-rays, 174 different findings

Protein-Protein Interactions

Name Description Comments
HINT (High-quality INTeractomes) curated compilation of high-quality protein-protein interactions from 8 interactome resources

Longitudinal Studies

Name Description Comments
National Population Health Survey Longitudinal Survey that collects health information via surveys every two years.

Protein Structure

Name Description Comments
ProteinNet Standardized dataset for learning protein structure. Includes sequences, structures, alignments, PSSMs, and standardized train/test/valid splits.

Natural Language Data

Name Description Comments
BioASQ Abstracts of medical articles (from PubMed); ontologies of medical concepts. Tasks: MLC, QA.
Cases Articles from medical case studies.
UPMC Pathology UPMC Pathology case studies.

Therapeutics

Name Description Comments
Therapeutic Data Commons Many preprocessed datasets for therapeutic discovery, including target discovery, activity modeling, efficacy and safety, and manufacturing. Available as Python modules.

More Repositories

1

GenAMap

Visual Machine Learning of Genome-Phenome Associations
C++
22
star
2

Personalized_Regression

Personalized Regression
Python
15
star
3

explainable-cnn

Towards Visual Explanations for Convolutional Neural Networks via Input Resampling
Python
14
star
4

drpca

Differential Robust PCA
Jupyter Notebook
12
star
5

clehrity

Python
8
star
6

ConferenceCountdown

Webpage used to countdown time until ML/Compbio conference deadlines
JavaScript
5
star
7

Covid19-LatentCases

Estimate Total SARS-CoV-2 Infections from Limited Diagnostic Tests
Jupyter Notebook
5
star
8

Personalized_Regression_Neurips19

Code for Experiments Accompanying the paper "Learning Sample-Specific Models with Low-Rank Personalized Regression"
Jupyter Notebook
4
star
9

scContextualized

Handlers and utilities for Contextualized analysis of single-cell datasets.
Python
3
star
10

DeathByRoundNumbers

Glass-box ML reveals biases in medical practice at round number thresholds
Jupyter Notebook
2
star
11

GO_Translator

Simple utility functions for handling GO Terms in Python.
Python
2
star
12

gam_purification

Utilities for Purifying Generalized Additive Models
Python
2
star
13

ebm_utils

Utilities for Explainable Boosting Machines
Python
2
star
14

ContextualGAM

Contextualized Generalized Additive Models.
Jupyter Notebook
2
star
15

SnareSeq

Preprocessed Snare-seq
Jupyter Notebook
1
star
16

my_twilio

Simple utility functions to use Twilio to text myself.
Python
1
star
17

Dropout_Interactions

Code for "Dropout as a Regularizer of Interaction Effects"
Jupyter Notebook
1
star
18

BioContextualized

Python
1
star
19

Personalized_Regression_ISMB18

Code to Accompany the ISMB 2018 Paper "Personalized Regression Enables Sample-Specific Pan-Cancer Analysis"
Python
1
star