• Stars
    star
    139
  • Rank 262,954 (Top 6 %)
  • Language
    JavaScript
  • License
    Other
  • Created almost 8 years ago
  • Updated about 2 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

machine learning for genomic variants

Variant Spark

Build Documentation Status

variant-spark is a scalable toolkit for genome-wide association studies optimized for GWAS like datasets.

Machine learning methods and, in particular, random forests (RFs) are a promising alternative to standard single SNP analyses in genome-wide association studies (GWAS). RFs provide variable importance measures to rank SNPs according to their predictive power. Although there are number of existing random forest implementations available, some even parallel or distributed such as: Random Jungle, ranger or SparkML, most of them are not optimized to deal with GWAS datasets, which usually come with thousands of samples and millions of variables.

variant-spark currently provides the basic functionality of building random forest model and estimating variable importance with mean decrease gini method and can operate on VCF and CSV files. Future extensions will include support of other importance measures, variable selection methods and data formats.

variant-spark utilizes a novel approach of building random forest from data in transposed representation, which allows it to efficiently deal with even extremely wide GWAS datasets. Moreover, since the most common genomics variant calls VCF and uses the transposed representation, variant-spark can work directly with the VCF data, without the costly pre-processing required by other tools.

variant-spark is built on top of Apache Spark – a modern distributed framework for big data processing, which gives variant-spark the ability to to scale horizontally on both bespoke cluster and public clouds.

The potential users include:

  • Medical researchers seeking to perform GWAS-like analysis on large cohort data of genome wide sequencing data or imputed SNP array data.
  • Medical researchers or clinicians seeking to perform clustering on genomic profiles to stratify large-cohort genomic data
  • General researchers with classification or clustering needs of datasets with millions of features.

Community

Please feel free to add issues and/or upvote issues you care about. Also join the Gitter chat. We also started ReadTheDocs and there is always the this repo's issues page for you to add requests. Thanks for your support.

Learn More

To learn more watch this video from YOW! Brisbane 2017.

variant-spark YOW! Brisbane 2017

Building

variant-spark requires java jdk 1.8+ and maven 3+

In order to build the binaries use:

mvn clean install

For python variant-spark requires python 3.6+ with pip. The other packages required for development are listed in dev/dev-requirements.txt and can be installed with:

pip install -r dev/dev-requirements.txt

or with:

./dev/py-setup.sh

The complete built including all check can be run with:

./dev/build.sh

Running

variant-spark requires an existing spark 3.1+ installation (either a local one or a cluster one).

To run variant-spark use:

./variant-spark [(--spark|--local) <spark-options>* --] [<command>] <command-options>*

In order to obtain the list of the available commands use:

./variant-spark -h

In order to obtain help for a specific command (for example importance) use:

./variant-spark importance -h

You can use --spark marker before the command to pass spark-submit options to variant-spark. The list of spark options needs to be terminated with --, e.g:

./variant-spark --spark --master yarn-client --num-executors 32 -- importance ....

Please, note that --spark needs to be the first argument of variant-spark

You can also run variant-spark in the --local mode. In this mode variant-spark will ignore any Hadoop or Spark configuration files and run in the local mode for both Hadoop and Spark. In particular in this mode all file paths are interpreted as local file system paths. Also any parameters passed after --local and before -- are ignored. For example:

./variant-spark --local -- importance  -if data/chr22_1000.vcf -ff data/chr22-labels.csv -fc 22_16051249 -v -rn 500 -rbs 20 -ro

Note:

The difference between running in --local mode and in --spark with local master is that in the latter case Spark uses the hadoop filesystem configuration and the input files need to be copied to this filesystem (e.g. HDFS) Also the output will be written to the location determined by the hadoop filesystem settings. In particular paths without schema e.g. 'output.csv' will be resolved with the hadoop default filesystem (usually HDFS) To change this behavior you can set the default filesystem in the command line using spark.hadoop.fs.default.name option. For example to use local filesystem as the default use:

./variant-spark --spark ... --conf "spark.hadoop.fs.default.name=file:///" ... -- importance  ... -of output.csv

You can also use the full URI with the schema to address any filesystem for both input and output files e.g.:

./variant-spark --spark ... --conf "spark.hadoop.fs.default.name=file:///" ... -- importance  -if hdfs:///user/data/input.csv ... -of output.csv

Running examples

There are multiple methods for running variant-spark examples

Manual Examples

variant-spark comes with a few example scripts in the scripts directory that demonstrate how to run its commands on sample data .

There is a few small data sets in the data directory suitable for running on a single machine. For example

./examples/local_run-importance-ch22.sh

runs variable importance command on a small sample of the chromosome 22 vcf file (from 1000 Genomes Project)

The full size examples require a cluster environment (the scripts are configured to work with Spark on YARN).

The data required for the examples can be obtained from: https://bitbucket.csiro.au/projects/PBDAV/repos/variant-spark-data

This repository uses the git Large File Support extension, which needs to be installed first (see: https://git-lfs.github.com/)

Clone the variant-spark-data repository and then to install the test data into your hadoop filesystem use:

./install-data

By default the sample data will installed into the variant-spark-data\input sub directory of your HDFS home directory.

You can choose a different location by setting the VS_DATA_DIR environment variable.

After the test data has been successfully copied to HDFS you can run examples scripts, e.g.:

./examples/yarn_run-importance-ch22.sh

Note: if you installed the data to a non default location the VS_DATA_DIR needs to be set accordingly when running the examples

VariantSpark on the cloud

VariantSpark can easily be used in AWS and Azure. For more examples and information, check the cloud folder. For a quick start, check the few pointers below.

AWS Marketplace

VariantSpark is now available on AWS Marketplace. Please read the Guidlines for specification and step-by-step instructions.

Azure Databricks

VariantSpark can be easily deployed in Azure Databricks through the button below. Please read the VariantSpark azure manual for specification and step-by-step instructions.

Deploy to Azure

Contributions

JsonRfAnalyser

JsonRfAnalyser is a python program that looks into the JSON RandomForest model and list variables on each tree and branch. Please read README to see the complete list of functionalities.

WebVisualiser

rfview.html is a web program (run locally on your machine) where you can upload the json model produced by variantspark and it visualises trees in the model. You can identify which tree to be visualised. Node color and node labels could be set to different parameters such as number of samples in the node or the node impurity. It uses vis.js for tree Visualisation.

More Repositories

1

pathling

Tools that make it easier to use FHIR® and clinical terminology within data analytics, built on Apache Spark.
Java
88
star
2

LAAT

A Label Attention Model for ICD Coding from Clinical Text
Python
64
star
3

cvt2distilgpt2

Improving Chest X-Ray Report Generation by Leveraging Warm-Starting
Python
64
star
4

smart-forms

React-based form renderer implementing Structured Data Capture (SDC) FHIR specification
TypeScript
34
star
5

Mirorr

Mirorr: Multimodal Image Registration using blOck-matching and Robust Regression
C++
27
star
6

snorocket

The Snorocket Description Logic classifier for EL++ with concrete domains support
Java
22
star
7

fhir-owl

Transforms OWL ontologies into FHIR code systems.
Java
20
star
8

ontoserver-deploy

Sample deployment project for Ontoserver
HCL
19
star
9

fhir-claml

Transforms ClaML classifications into FHIR code systems.
Java
18
star
10

cxrmate

CXRMate: Longitudinal Data and a Semantic Similarity Reward for Chest X-Ray Report Generation
Python
14
star
11

TRIBES

Finding cryptic relationships to boost disease gene detection
Python
12
star
12

redmatch

A rules-based transformation engine that allows exporting data in REDCap as FHIR resources.
Java
11
star
13

terraform-aws-serverless-beacon

Serverless implementation of Beacon V2 protocol, to enable cheaper and faster exchange of genomic and phenotypic information
Python
9
star
14

INSIDER

Detecting foreign inserted DNA segments in the genome
Python
9
star
15

redcap_fhir_ontology_provider

REDCap external module provides support for referencing an external Fhir based ontology server to lookup values.
PHP
8
star
16

fhir-tx-encoder

A tool for encoding FHIR terminology concepts for machine learning applications.
Python
8
star
17

COVID-sBeacon

Python
7
star
18

redcap_pedigree_editor

Redcap External Module to allow the entry of a pedigree diagram, storing the result as a FHIR resource
PHP
7
star
19

PEPS

Polygenic EpiStatic Phenotype Simulation
HTML
7
star
20

tx-analytics-fhir

An example analytic workflow describing some techniques for working with FHIR and SNOMED CT data.
Jupyter Notebook
7
star
21

BitEpi

C++
7
star
22

spia-to-fhir

A Maven plugin for converting terminology from the Standards for Pathology Informatics in Australia (SPIA) into a set of FHIR terminology resources.
Java
6
star
23

fhir-analytics-pipeline

An example of how to build a pipeline for extracting, transforming and analyzing FHIR data.
Python
6
star
24

ontology-core

Internal model to represent ontologies and utilities to import ontologies from different formats.
Java
6
star
25

redcap_simple_ontology_provider

REDCap external module provides simple implementation of the ontology provider mechanism.
PHP
5
star
26

isling

A tool for detection of viral integrations
Python
5
star
27

Ontoserver

Ontoserver puts SNOMED CT, AMT, LOINC, and FHIR-based CodeSystems at your fingertips.
5
star
28

parquet-on-fhir

A specification for representing FHIR data within the Apache Parquet format.
5
star
29

sVEP

Perl
4
star
30

smart-forms-ig

GLSL
4
star
31

fhir-bulk-java

A Java client for the Bulk Data Export operation within the FHIR Bulk Data Access IG.
Java
4
star
32

VIGWAS

Interactive notebook for quality control and Spark-based genome wide association analysis using HAIL and VariantSpark.
HTML
3
star
33

primary-care-data-technical

GLSL
3
star
34

labcodeset-fhir-transform

Application which transforms Nederlandse Labcodeset from its native XML file format to a collection of FHIR terminology resources.
Java
3
star
35

real-time-fhir

A simulated FHIR data source that can write resources into a remote FHIR server in real time.
JavaScript
3
star
36

fhir-ts-exemplars

HTML
3
star
37

jre-docker

Docker images for building base images capable of running Java applications.
Dockerfile
3
star
38

redcap_advanced_fhir_ontology

REDCap External Module to give autocomplete using a FHIR terminology server with advanced options.
PHP
3
star
39

fhir-tx-transforms

Transformations of various code systems to FHIR.
Java
3
star
40

vector-integration-simulation-pipeline

A pipeline for simulating viral or vector integration into a host genome
Python
3
star
41

fhir-phenopackets-ig

Phenopackets on FHIR Implementation Guide
HTML
2
star
42

SMART-EHR-Launcher

An EHR simulator dashboard testing tool to launch SMART on FHIR apps
TypeScript
2
star
43

openmrs-module-terminologysearch

A module for OpenMRS to search a FHIR terminology server for concepts.
Java
2
star
44

EpiExplorer

Python
2
star
45

imageclefmedical_caption_23

MedICap: Code for the participation of team CSIRO at the ImageCLEFmedical Caption task of 2023.
Jupyter Notebook
2
star
46

fhir-hgnc

Transforms HGNC source files into FHIR code systems.
Java
1
star
47

sct-historical-analytics

Data Analytics with historical SNOMED CT data - SNOMED CT Expo 2021
1
star
48

fhircap

FHIRCap is a tool designed to transform REDCap forms into FHIR resources.
1
star
49

genomator

Privacy-hardened, hallucination-resistant synthetic data generator.
Python
1
star
50

cxrmate-ed

Jupyter Notebook
1
star
51

primary-care-data-technical-stu3

HTML
1
star
52

COVID-GenomeMap

CSS
1
star
53

snomio

An integration with Snomed International's Authoring Platform that extends functionality to improve authoring of medicinal terminology.
TypeScript
1
star
54

fhir-phenopackets

1
star
55

primary-care-phase2

GLSL
1
star
56

semantic-search

A deep learning based approach for semantic search over large clinical terminology system
TeX
1
star
57

antlr-generator

JavaScript
1
star
58

COVID19_TBED

Jupyter Notebook
1
star
59

CRISPR-diagnostics

CRISPR-diagnostics
Jupyter Notebook
1
star
60

VariantSpark_Gigascience

Scripts for the GigaScience publication
HTML
1
star
61

clinicaltrialsearch

A search engine for retrieving clinical trials.
JavaScript
1
star
62

OCT_Denoising_pix2pix

Employing Texture Loss to Denoise OCT Images using Generative Adversarial Networks
Python
1
star
63

genclipr-fhir-ig

FHIR implementation guide for the Genomics Clinical Picture Repository standard developed as part of the Standardised Clinical Phenotypes project in Australian Genomics.
GLSL
1
star
64

ecl-builder

UI components that help you build SNOMED CT ECL expressions.
JavaScript
1
star
65

fhir-validator-spark

A tool for executing FHIR validation over a set of resources using Apache Spark.
Java
1
star