• Stars
    star
    136
  • Rank 267,670 (Top 6 %)
  • Language
    R
  • License
    MIT License
  • Created about 8 years ago
  • Updated 6 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

PHESANT - PHEnome Scan ANalysis Tool (pheWAS, Mendelian randomisation (MR)-pheWAS etc.) in UK Biobank

PHESANT - PHEnome Scan ANalysis Tool

Run a phenome scan (pheWAS, Mendelian randomisation (MR)-pheWAS etc.) in UK Biobank.

There are three components in this project:

  1. Running a phenome scan in UK Biobank
  2. Post-processing of results
  3. PHESANT-viz: Visualising the results

General requirements

R for parts 1 and 2 above. Tested with R-3.3.1-ATLAS. Phenome scan requires the R packages: optparse (V1.3.2), MASS (V7.3-45), lmtest (V0.9-34), nnet (V7.3-12), forestplot (V1.7) and data.table (V1.10.4).

Java for part 3 above. Tested with jdk-1.8.0-66.

Citing this project

Please cite:

Millard LAC, et al. Software Application Profile: PHESANT: a tool for performing automated phenome scans in UK Biobank. International Journal of Epidemiology (2017).

1) Running a phenome scan

A phenome scan is run using WAS/phenomeScan.r. This is ready to go. One amendment you may wish to make before running PHESANT is the TRAIT_OF_INTEREST column in the variable information file (see below).

The PHESANT phenome scan processing pipeline is illustrated in the figure here, and described in detail in the paper above.

The phenome scan is run with the following command:

cd WAS/
Rscript phenomeScan.r \
--phenofile=<phenotypesFilePath> \
--traitofinterestfile=<traitOfInterestFilePath> \
--variablelistfile="../variable-info/outcome-info.tsv" \
--datacodingfile="../variable-info/data-coding-ordinal-info.csv" \
--traitofinterest=<traitOfInterestName> \
--resDir=<resultsDirectoryPath> \
--userId=<userIdFieldName>

The following example runs part 1 of 20, of a sensitivity analysis phenome scan (adjusting for age, sex, and assessment centre, see below), using a non genetic trait of interest:

cd WAS/
Rscript phenomeScan.r \
--phenofile=<phenotypesFilePath> \
--traitofinterestfile=<traitOfInterestFilePath> \
--variablelistfile="../variable-info/outcome-info.tsv" \
--datacodingfile="../variable-info/data-coding-ordinal-info.csv" \
--traitofinterest=<traitOfInterestName> \
--resDir=<resultsDirectoryPath> \
--userId=<userIdFieldName> \
--sensitivity \
--genetic=FALSE \
--partIdx=1 \
--numParts=20

Required arguments

Arg Description
phenofile Comma separated file containing phenotypes. Each row is a participant, the first column contains the participant id and the remaining columns are phenotypes. Where there are multiple columns for a phenotype these must be adjacent in the file. Specifically for a given field in Biobank the instances should be adjacent and within each instance the arrays should be adjacent. Each variable name needs to be changed to the format 'x[varid]_[instance]_[array]' (we use the prefix 'x' so that the variable names are valid in R).
variablelistfile Tab separated file containing information about each phenotype, that is used to process them (see below).
datacodingfile Comma separated file containing information about data codings (see below).
traitofinterest Variable name as in traitofinterestfile.
resDir Directory where you want the results to be stored.

Optional arguments

Arg Description
traitofinterestfile Comma separated file containing the trait of interest (e.g. a snp, genetic risk score or observed phenotype). Each row is a participant and there should be two columns - the user ID and the trait of interest. Where this argument is not supplied, the trait of interest should be a column in the phenofile.
confounderfile Comma separated file containing the confounders, so that you can choose what confounders to use in the phenome scan.
userId User id column as in the traitofinterestfile and the phenofile (default: userId).
partIdx Subset of phenotypes you want to run (for parallelising).
numParts Number of subsets you are using (for parallelising).
sensitivity By default analyses are adjusted for age (field 21022), sex (field 31) and, if the genetic argument is set to TRUE, genotype chip (a binary variable derived from field 22000). If sensitivity argument is used (by including --sensitivity when running PHESANT) then analyses additionally adjust for the assessment centre (field 54). If sensitivity argument is used (by including --sensitivity when running PHESANT) and the genetic argument is set to TRUE, the first 10 genetic principal components (fields 22009_0_1 to 22009_0_10) are also included as confounders. If you wish to choose your own confounders to use in the phenome scan you can use the confounderfile option (described above).
genetic By default genetic=TRUE, and we assume the trait of interest is a genetic variable (e.g. a SNP or genetic risk score). If this is not the case (e.g you are running an environment-wide association study) then set this flag to FALSE. This option determines which variables are controlled for in analyses, see sensitivity arg above.
save Instead of running phenome scan, generated phenotypes are stored to file, in resDir. If this option is used then traitofinterest argument is not required.
confidenceintervals By default confidenceintervals=TRUE, but specifying confidenceintervals=FALSE means that PHESANT doesn't calculate the association confidence intervals (which may speed up PHESANT).
standardise By default standardise=TRUE, but specifying standardise=FALSE means that PHESANT will not standardise the exposure variable. E.g. use this option for binary exposure variables.
tab By default phenotype file (phenofile) is comma seperated, but tab=TRUE can be specified when your file is tab delimited (e.g. using the r option for UK Biobank's ukbconv utility).
mincase Minimum size of phenotype categories (default is 10).

The numParts and partIdx arguments are both used to parallelise the phenome scan. E.g. setting numParts to 5 will divide the set of phenotypes into 5 (rough) parts and then partIdx can be used to call the phenome scan on a specific part (1-5).

Data coding file

Data codes define a set of values that can be assigned to a given field. A data code can be assigned to more than one variable, which is why we use a separate file describing the necessary information for each data code. For example, there are several fields about diet that have data code 100009.

The data coding file should have the following columns:

  1. dataCode - The ID of the data code.
  2. ordinal - Whether the field is ordinal (value 1) or not (value 0). This field is only used for fields of the categorical (single) field type. Value -1 denotes this is not needed because the field is binary.
  3. ordering - Any needed corrections for the numeric ordering of a data codes specified by Biobank. This field is only used for data codes specified as ordinal in the ordinal column. For example, data code 100001 has values half, 1 and 2+ coded as 555, 1 and 200, respectively. We need the 'half' value to be less than the '1' value, so we change the order to '555|1|200'. NB: if this column is used then and any value is not included then this value is set to NA (i.e. this field can be used to remove and reorder values at the same time).
  4. reassignments - Any value changes that are needed. For example, in data code 100662, the values 7 and 6 may be deemed equal (both representing 'never visited by friends/family' so we can set '7=6' to assign the value 6 to all participants with the value 7.
  5. default_value - A default value assigned to all participants with no value for the field, but with a value for field stated in default_value_related_field column below. This is used where a category is not explicitly stated in the field but instead needs to be determined by looking at whether another field has a value. Typically, this occurs where there is no category for 'none' in a questionnaire field, because participants were told they did not have to mark 'none' but could instead leave it blank (see for example section 5.3 in the 24 hour diet questionnaire manual). Hence, we assume that if they completed the questionnaire and have not ticked a value, then the value is 'none'. See default value example below.
  6. default_value_related_field - The field used to determine which participants are assigned the default value. All participants with a value in the field stated here, and with no value for a field with this data code, are assigned the default value stated in default_value.
Example of default value

In the data code information file we specify default_value=0 and default_value_related_field=20080 for data code 100006. Field 100200, for example, has data code 100006. Therefore all participants with a value for field 20080, but with no value in field 100200, are assigned value 0 for field 100200. Intuitively, all participants who have answered the 24-hour recall diet questionnaire have a value in field 20080, and of these, we assume that those with no value for field 100200 have opted for 'none' implicitly, by not ticking any option.

Variable information file

This file was initially the UK Biobank data dictionary, which can be downloaded from the UK Biobank website here. This data dictionary provides the following set of information about fields, used in this phenome scan tool:

  1. ValueType column - the field type, either 'Integer', 'Continuous', 'Categorical single', 'Categorical multiple', or a few others we do not use.
  2. Three Cat_ID and three Cat_Title columns - the three levels of the category hierarchy, that can be seen here.
  3. FieldID column - We use this to match the variable in our biobank data file to the correct row in this TSV file.
  4. Field column - The name of the field.

The variable information file also has the following columns that we have added, to provide additional information used in the phenome scan:

  1. TRAIT_OF_INTEREST - Specifies any field that represents the trait of interest (set this column to 'YES'). This is a marker so that after the phenome scan is run we can use these results as validation only (e.g. a pheWAS of the BMI FTO SNP would expect the BMI phenotypes to show high in the results ranking), i.e. they do not contribute to the multiple testing burden. We have set this up for BMI, so have marked BMI/weight fields as the trait of interest - you will need to change this for your particular trait of interest. For categorical multiple fields you may want to mark the whole field as denoting the trait of interest (e.g all cancers), or just a specific value (e.g. a particular type of cancer). To do the former set this column to YES, and to do the latter specify each particular value in this field separated by a bar, i.e. 'VALUE1|VALUE2'.
  2. EXCLUDED - Phenotypes we apriori decide to exclude from the phenome scan. Any field with a value in this field is excluded, and we state a code that describes the reason we exclude a variable (for future reference). Our codes and reasons are as follows (of course for your phenome scan you can add others as you would like):
  • YES-ACE: "Assessment center environment" variables that do not directly describe the participant.
  • YES-AGE: Age variables.
  • YES-ASSESSMENT-CENTRE: The assessment centre the participant attended.
  • YES-BIOBANK-SUGGESTED-VARIABLE: Variables not initially included in our data request, but that biobank suggested we receive.
  • YES-CAT-SIN-MUL-VAL: Fields that were 'Categorical single' types but had multiple values (several arrays in an instance). We do not deal with these currently so remove from the phenome scan.
  • YES-GENETIC: Genetic description variables.
  • YES-SENSITIVE: Variables not received from Biobank because they are sensitive so have more restricted access.
  • YES-SEX: Sex fields.
  • YES-POLYMORPHIC: Fields denoted as polymorphic by UK Biobank (e.g. field 87).
  • YES-PROCESSING: Fields that describe some aspect of the data collection process rather than some aspect (phenotype) of the participants themselves (e.g. field 23049).
  • YES-SUPERSEDED: Fields that have been superseded by another version. For example, all fields used to create the aggregated 'first occurence' disease variables.
  • YES-DATE: Fields that are dates. Dates are not currently included in phenome scans. They are only used where they are converted to a binary variable (see DATE_CONVERT column below).
  1. CAT_MULT_INDICATOR_FIELDS - every categorical multiple field must have a value in this column. The value describes which set of participants to include as the negative examples, when a binary variable is created from each value (see above cited paper for more information). The positive examples for a value v in this categorical multiple field are simply the people with this particular value. However the negative values can be determined in three ways:
  • ALL - Include all participants (any participant without value v is assigned FALSE, except those with a value denoting missingness (i.e. value is <0)).
  • NO_NAN - Included only those who have at least one value for this field (assign FALSE to any participant with at least one value for this field, where these values do not include value v, and also do not include a value denoting missingness (i.e. value is <0).
  • fieldID - Include only those who have a value for another field, with ID fieldID (assign FALSE to any participant without value v and without a value denoting missingness (i.e. value is <0) and with a value in this other field with ID fieldID).
  1. CAT_SINGLE_TO_CAT_MULT - Specifies fields that have the categorical single field type (as specified by UK Biobank) but that we actually want to treat as categorical multiple. State YES in this column to change this. To also convert the instances to arrays, state YES-INSTANCES. For example, field 20107 has illnesses stored in 10 arrays (for each instance) so it makes sense to treat this as a categorical(multiple) field. Field 40011 stores the histology of cancer tumours but is stored in 31 instances (rather than arrays) so we specify YES-INSTANCES so this field is treated as a categorical (multiple) field and the instances are treated as arrays.
  2. DATE_CONVERT - Specifies that a date field should be included and converted to a binary variable indicating whether each participant has a date in this field or not. This is used for the 'first occurence' disease variables that are date fields.
  3. DATA_CODING - The data coding IDs used to map a field to its data code in the data code information file described above. This is required for categorical (single) fields.

Output

In the directory specified with the resDir argument, the following files will be created:

  1. Results files for each test type:
  • Linear regression: results-linear-all.txt - One line for each linear regression result
  • Logistic regression: results-logistic-all.txt - One line for each logistic regression result
  • Multinomial regression: results-multinomial-all.txt - Each multinomial regression results in n lines in this results file, where n is the number of categories in the variable. One line corresponds to the results for a particular category compared to an assigned baseline category (the category with the highest sample size), and then there is also one line for the overall association of this variable (across all categories), using a likelihood ratio test comparing this model with the model with confounders only.
  • Ordered logistic regression: results-ordered-logistic-all.txt - One line for each ordinal logistic regression result.
  1. A log file: results-log-all.txt - One line for each Biobank field, providing information about the processing flow for this field (see PHESANT-logging-information.md for examples).
  2. A model fit log file: modelfit-log-all.txt - shows model fit output for categorical unordered models.
  3. Flow counts file: variable-flow-counts-all.txt - A set of counts denoting the number of variables reaching each point in the processing flow. Each code with count value corresponds to a particular position in the processing flow (see figure here).

Where the phenome scan is run in parallel setup, then each parallel part will have one of each of the above files, with 'all' in each filename replaced with: [partIdx]-[numParts].

See testWAS/README.md for an example with test data.

If the save option is used, instead or producing results files, PHESANT will create the following files:

  1. Data files for each trait type. e.g. data-linear-all.txt.
  2. A log file e.g. data-log-all.txt.

2) Post phenome scan results processing

The resultsProcessing folder provides code to post-process the results, specifically:

  1. Combine the results where they were generated in parallel.
  2. Combine the flow counts.
  3. Add the description information for each variable from the variablelistfile file, to the results file.
  4. Rank the results by P value.
  5. For the multinomial regression results, include only the main result (not the results for each particular category of a given variable as described above).
  6. Generate basic figures: qqplot, and 3 forest plots for continous, ordered categorical and binary results separately (for results with P<0.05/numTests, i.e. the Bonferroni corrected 0.05 threshold). We do not generate a forest plot for the categorical unordered results, because we have no overall estimate (and confidence interval) for the model overall, because we use a likelihood ratio test to generate a model P value. For these plots we use all results except for phenotypes marked as 'trait of interest'.

The results processing is run with the following command:

cd resultsProcessing/

Rscript mainCombineResults.r \
--resDir=<resultsDirectoryPath> \
--variablelistfile="../variable-info/outcome-info.tsv"

Required arguments

Arg Description
resDir Directory where the phenome scan results are stored.
variablelistfile Tab separated file containing information about each phenotype, that is used to process them. Same as variablelistfile used in main phenome scan.

Optional argument

Arg Description
numParts Number of subsets (parts) you have used (for parallelising).

See testWAS/README.md for an example with test data.

QQ plot description

The QQ plot contains the following elements:

  1. A horizontal line (dashed, green) denoting the Bonferroni corrected P value threshold.
  2. A line (dotted, blue) denoting the expected trajectory of points under the null (actual=expected).
  3. A set of points each denoting the result for a phenotype. It is possible (as in the testWAS) that the P value is smaller than the smallest possible value that R can store (see .Machine documentation). When this occurs the P value is set to zero, and on the QQ plot we set these to 5e-324 (the smallest positive double on a typical R platform) and display these as red stars to indicate that an exact P value was not recorded.

3) PHESANT-viz: Results visualisation

A phenome scan generates a large number of results. The aim of this visualisation is to help with interpretation, by allowing the researcher to view each result in the context of the results of related traits.

See the PHESANT-viz folder and README therein for more information.

More Repositories

1

TwoSampleMR

R package for performing 2-sample MR using MR-Base database
R
411
star
2

gwasglue

Linking GWAS data to analytical tools in R
R
110
star
3

ieugwasr

R interface to the IEU GWAS database API
R
72
star
4

gwasvcf

Reading, querying and writing GWAS summary data in VCF format
HTML
66
star
5

awesome-ai

Awesome list for all things AI, ML and deep learning
51
star
6

gwas2vcf

Convert GWAS summary statistics to VCF
Python
47
star
7

MRInstruments

Data objects for various sets of instruments
R
43
star
8

epigraphdb-pqtl

This is the github repository to provide the R scripts been used in the phenome-wide Mendelian randomization study of plasma proteome.
R
37
star
9

Health-and-Wellbeing-MR

R
27
star
10

epigraphdb

Examples on using EpiGraphDB
Jupyter Notebook
26
star
11

epigraphdb-r

R package for epigraphdb
R
26
star
12

metaboprep

a pipeline of metabolomics data processing and quality control
R
22
star
13

gwas-vcf-specification

Specification for the GWAS-VCF format (manuscript in preparation)
22
star
14

gwasglue2

Connecting GWAS summary data to analytical tools
R
17
star
15

PhenoSpD

R
17
star
16

MELODI-Presto

A fast and programmatic MELODI
Jupyter Notebook
15
star
17

UKBiobankGWAS

Notes and code for running UK Biobank GWAS at the MRC IEU
Python
14
star
18

godmc

GoDMC analysis pipeline
R
13
star
19

PHESANT-MR-pheWAS-BMI

R
11
star
20

GLU

A tool for analysing continuous glucose monitoring (CGM) data in epidemiology.
R
10
star
21

ieugwaspy

Python interface to the MRC IEU GWAS database API
Python
9
star
22

GeneHackman

Pipeline for genetic epidemiology projects at Univsersity of Bristol
R
9
star
23

genetics.binaRies

Binaries for tools used in genetics in an R package
R
9
star
24

vectology

Using language models and ontology topology to perform semantic mapping of traits between biomedical datasets
Jupyter Notebook
9
star
25

CheckSumStats

R
8
star
26

BiobankPhenotypes

Python
8
star
27

pygwasvcf

Python library for reading GWAS summary statistics in GWAS-VCF files
Python
7
star
28

melodi

Mining Enriched Literature Objects to Derive Intermediates
HTML
7
star
29

ariesmqtl

Scripts to run mQTL analyses as described in the ARIES mQTL paper
Python
7
star
30

bp-drug-mr

Multi-ancestry MR of blood pressure drug targets
HTML
6
star
31

temmpo

Text mining for mechanism prioritisation
Python
6
star
32

godmc_phase2_analysis

HTML
6
star
33

qqman_multiple_colours

Manhattan plot function which highlights SNPs in different colours
R
6
star
34

UKBActivityBoutLength

Assessing association of physical activity types and bout length in UK Biobank
R
6
star
35

mrbase_casestudies

Using the MR-Base platform to investigate risk factors and drug targets for thousands of phenotypes
HTML
6
star
36

introduction-to-r

HTML
5
star
37

opengwas-api

API for MRC-IEU OpenGWAS platform (https://gwas.mrcieu.ac.uk)
Python
5
star
38

epigenetics-journal-club

Repository for the IEU epigenetics journal club papers, presentations and scripts
Python
5
star
39

Lifecourse-GWAS

R
5
star
40

genetic-epidemiology-practicals

Practicals for Genetic epidemiology short course
R
5
star
41

ukbb-prot-prs

R
5
star
42

gwas-vcf-performance

Comparison between VCF and plain text storage formats for GWAS summary statistics
HTML
4
star
43

MR-maternal-smoking

Analysis code for proxy gene-by-environment Mendelian randomization of maternal smoking on offspring health
R
4
star
44

ukbb-covid-collider

TeX
4
star
45

MR-randomization-test

R
4
star
46

Psychiatric-disorder-pheWAS-UKBB

R
4
star
47

ewascatalog-r

R
4
star
48

varGWAS

GWAS of trait variance (C++)
R
3
star
49

Smoking_IBD

Scripts for analysing the association between smoking and inflammatory bowel disease (IBD)
R
3
star
50

epigraphdb-ctda

codes for the epigraphdb CTAD database
R
3
star
51

education_mentalhealth_CVD

R
3
star
52

GwasDataImport

R package to upload data to GWAS database
R
3
star
53

biorxiv-medrxiv-tdm

Jupyter Notebook
3
star
54

ewascatalog

HTML
3
star
55

PHESANT-MR-PheWAS-Insomnia

An MR-PheWAS of insomnia using UK Biobank and PHESANT.
R
3
star
56

mendelvar_standalone

Standalone version of the MendelVar pipeline
Python
3
star
57

mrcieu.r-universe.dev

R packages for mrcieu.r-universe.dev
3
star
58

sleep_duration_pregnancy

Here we provide scripts used in 'Investigating causal relations between sleep duration and risks of adverse pregnancy and perinatal outcomes: Linear and nonlinear Mendelian randomization analyses in up to 356,069 European women'
R
3
star
59

opengwas-requests

A place to request or contribute data to the IEU GWAS database
3
star
60

hie-ml

R
2
star
61

vitD_bonehealth

ALSPAC vitamin D and bone health MR study
R
2
star
62

PHESANT-MR-pheWAS-smoking

MR-pheWAS of smoking heaviness
R
2
star
63

met-d-import

UKBB Nightingale import to OpenGWAS
R
2
star
64

queryukbppp

Query UKB-PPP data
R
2
star
65

Schizophrenia_Fertility_Paper

Stata
2
star
66

handedness-ewas

R
2
star
67

epigraphdb_web

Web application (web UI) of the EpiGraphDB platform
Vue
2
star
68

gwas_processing

Python
2
star
69

IGD-elasticsearch

Docker container for creating indexes and indexing GWAS data
Python
2
star
70

variance-iv4-violation

R
2
star
71

varGWAS-ukbb-biomarkers

R
2
star
72

vcf-reference-datasets

Shell
2
star
73

MR-PREG

Repository for the MR-PREG collaboration
R
2
star
74

PHESANT-IV-pheWAS-relative-school-age

IV-pheWAS of relative school age
R
1
star
75

opengwas-infpval-fix

Documentation and fixes for p-value issues for some datasets
Python
1
star
76

godmc_phase1_analysis

Shell
1
star
77

varGWASR

R package to perform regression-based Brown-Forsythe test
R
1
star
78

activityBigrams

Physical activity phenotyping with activity bigrams, and their association with BMI
MATLAB
1
star
79

COVIDITY_selbias

COVIDITY BHF-Project - Working Package 3 about selection bias in COVID-19 studies across UK cohorts
R
1
star
80

met-e-import

R
1
star
81

prot-c-import

Import Suhre et al 2017 protein data
R
1
star
82

opengwas

1
star
83

bbj-a-import

Import Japan Biobank GWAS
R
1
star
84

epigraphdb-asq

Annotated Semantic Queries
Jupyter Notebook
1
star
85

mrtwin_power

R
1
star
86

cistrans_pqtl

R
1
star
87

finn-a-import

FinnGen R2 GWAS data import
R
1
star
88

MVMRalcsmo

A repository of code for a MVMR study of the effect of alcohol and smoking on lung function
R
1
star
89

IlluminaFinalRep_LRRBAF

Estimate LRR and BAF from Illumina Final Report
Python
1
star
90

prot-lungcancer

Protein associations with lung cancer pre
R
1
star
91

opengwas-reports

Report module for IEU GWAS pipeline
R
1
star
92

aries

Functions for loading the ARIES DNA methhylation data
HTML
1
star
93

upptime

Use upptime to monitor web platform status
JSON
1
star
94

finn-b-import

Import Finngen R5 GWAS
R
1
star
95

panukbb_gwas_import

Import pan UKBB GWAS analysis https://pan.ukbb.broadinstitute.org/
R
1
star
96

gwas2vcfweb

Web interface for gwas2vcf
HTML
1
star
97

mrmn

Website for the Mendelian randomization Methods Network
1
star
98

AFS_AAM_LifeHistory

Analysis scripts for "Using Mendelian randomization to investigate life history theory: the effects of age at menarche and first sexual intercourse on reproductive and behavioural outcomes"
R
1
star
99

oci-elastic

IEU version of https://github.com/cloud-partners/oci-elastic
HCL
1
star
100

mdNLR-and-Rheumatoid-Arthritis-Script

Rheumatoid arthritis (RA) is a disease of chronic systemic inflammation (SI). In the present study, we used four datasets to explore whether methylation-derived neutrophil-to-lymphocyte ratio (mdNLR) might be a marker of SI in new onset, untreated, and treated prevalent RA cases and/or a marker of treatment response to the tumour necrosis factor inhibitor (TNFi) etanercept.
R
1
star