• Stars
    star
    154
  • Rank 242,095 (Top 5 %)
  • Language
    Jupyter Notebook
  • License
    BSD 3-Clause "New...
  • Created over 12 years ago
  • Updated almost 2 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Code repository of the Earth Microbiome Project.

Earth Microbiome Project

The Earth Microbiome Project (EMP) is a systematic attempt to characterize global microbial taxonomic and functional diversity for the benefit of the planet and humankind.

This GitHub repository describes the EMP catalogue -- how it is generated and how to use it. The EMP dataset is generated from samples that individual researchers have compiled and contributed to the EMP. Samples from each group of researchers represent individual EMP studies. In addition to analyses by contributing researchers on individual studies, we perform cross-study meta-analyses. EMP 16S Release 1, a meta-analysis of the first 97 16S rRNA amplicon studies, has been published (article, preprint), and the code and methods used for that manuscript are provided here. EMP 16S Release 2, currently unpublished, includes additional 16S rRNA amplicon data. We are currently finalizing the EMP500 - a mult-omics meta-analysis of 50 studies including >500 samples each processed for 16S, 18S, ITS amplicon sequencing, shotgun metagenomic sequencing, and metabolic profiling (preprint). Methods and standard operating procedures (SOPs) for additional amplicon sequencing, shotgun sequencing, and metabolomics related to EMP 16S release 2 and the EMP500 are also provided here.

Organization of this repository

This repository contains the directories listed below. Each directory will have contents related to EMP 16S Release 1 and EMP Multi-omics (EMP500).

  • methods Methods used in EMP analyses. Includes sample processing for extraction and sequencing, and computational methods for performing analyses and generating figures for meta-analyses of the EMP dataset.
  • protocols Laboratory protocols and SOPs for sample and metadata collection, sample tracking, amplicon sequencing, shotgun sequencing, and metabolomics.
  • code IPython notebooks and scripts (Python, Java, R, Bash) developed for meta-analysis of EMP data; this code is used in methods.
  • data Data files resulting from or used in processing and analysis.
  • papers Preprints of major meta-analyses of the EMP dataset and links to papers about individual studies.
  • presentations Links to slide decks from presentations on the EMP.
  • legacy Early code, results, and website documents from the initial phase of the EMP (2010-2013).

Getting involved

There are several ways to get involved with the EMP:

  • Use the EMP catalogue in your own research. Download the whole catalogue or just a few studies, merge and analyze them with your own data, or query the catalogue. Please skip to the next section for detailed instructions.
  • Join the analysis team. If you are interested in getting involved with EMP meta-analyses, you can begin by reviewing the open issues on this GitHub page. You can add comments to an existing issue to propose your ideas, or create a new issue entirely. Note that the initial meta-analysis of the EMP has been published. You can view the existing code and methods (instructions) for generating figures for the meta-analysis.
  • Contribute samples. We are not currently soliciting samples for the EMP. If you have an idea for samples you might like to submit in the future, you may email Dr. Justin Shaffer.

Using the EMP catalogue

The EMP catalogue is a diverse and standardized set of thousands of microbiomes for use by the public. Here are some of the ways you can use this resource:

  • Download EMP Release 1 from our FTP site. EMP 16S Release 1 contains merged and quality-filtered mapping files, BIOM tables, OTU/sequence information, and alpha/beta-diversity results for ~25,000 samples in 97 studies of the initial meta-analysis of the EMP. The FTP site contains README files about its contents, and the individual files are listed here.

  • Download individual studies from the Qiita EMP Portal. For each study, you can download metadata (mapping file), feature tables (BIOM file), and demultiplexed raw sequence files. Like the rest of Qiita, the EMP Portal requires the Google Chrome browser.

  • Merge your data with all or part of the EMP dataset. If you sequenced your sample using the EMP 16S rRNA primers and picked OTUs using either Deblur or closed-reference against Greengenes 13.8 or Silva 123, you can merge your BIOM table with the relevant merged EMP 16S Release 1 BIOM table or one of the individual per-study BIOM tables from Qiita. Basic instructions for initial processing of your data are provided. You can then use QIIME1 or QIIME2 to merge the BIOM tables and mapping files.

  • Query the EMP catalogue using Redbiom. Redbiom is a command-line tool that allows users to query the Qiita database, including EMP studies. It allows you to find samples based on the sequences or taxa they contain or on sample metadata, and to export selected sample data and metadata. Once you have Redbiom installed, you can carry out queries such as those described here:

    # First, summarize the contexts available. A context represents a partition by 
    # processing parameters (e.g., closed-reference OTU picking) and preparation 
    # (e.g., 16S V4).
    
    redbiom summarize contexts | cut -f 1,2,3
    
    # Create a variable for the context. For this example, we will use the closed-
    # reference 16S V4 context by setting a local bash variable "ctx". 
    
    ctx=Pick_closed-reference_OTUs-illumina-16S-v4-66f541
    
    # Query 1: "Show me all the genera that were observed at pH > 8."
    # First we search for samples with pH > 8, then select the features from those 
    # samples, then summarize the taxonomy of those features, then grep for just 
    # the genera and count them.
    
    redbiom search metadata "where ph > 8" | redbiom select features-from-samples \
    --context $ctx | redbiom summarize taxonomy --context $ctx | grep g__ | wc -l
    
    # Answer: There are 1423 genera found in samples with pH > 8.
    
    # Query 2: "Show me all sites where Pyrobaculum are found." 
    # First we search for features that are genus Pyrobaculum, then search for 
    # samples containing those features, then fetch sample metadata for those 
    # samples and output the metadata file, then grab the columns for latitude and 
    # longitude (note: these are not guaranteed to reside in columns 10 and 11).
    
    redbiom search taxon --context $ctx g__Pyrobaculum | redbiom search features \
    --context $ctx | redbiom fetch sample-metadata --context $ctx \
    --output g__Pyrobaculum_metadata.txt; cut g__Pyrobaculum_metadata.txt -f 10,11
    

Citing the EMP

If you use the EMP 16S Release 1 data in your research, please cite Thompson et al., "A communal catalogue reveals Earth's multiscale microbial diversity", Nature, 2017 (article).

If you use the EMP500 data in your research, please cite Shaffer-Nothias-Thompson et al., "Multi-omics profiling of Earth’s biomes reveals that microbial and metabolite composition are shaped by the environment", bioRxiv, 2022 (preprint).

If you use EMP protocols in your research, please cite earthmicrobiome.org and the relevant papers referenced therein.

File name abbreviation conventions

Some abbreviations used in this repository:

  • demux is shorthand for "demultiplexed", which describes the fastq data after it is split into per-sample fastq files using barcodes.
  • deblur refers to the exact-sequence de novo OTU picking method Deblur.
  • cr refers to closed-reference OTU picking.
  • or refers to open-reference OTU picking.
  • refseqs refers to reference sequence collections that could be used in reference-based OTU picking.
  • mc2 refers to minimum sequence count in an OTU to be included equals to 2.

Finding older data

If you're looking for data generated and used for the ISME 14 EMP presentations, look here.

More Repositories

1

scikit-bio

scikit-bio is an open-source, BSD-licensed, Python package providing data structures, algorithms, and educational resources for bioinformatics.
Python
781
star
2

qiime

Official QIIME 1 software repository. QIIME 2 (https://qiime2.org) has succeeded QIIME 1 as of January 2018.
Python
285
star
3

sortmerna

SortMeRNA: next-generation sequence filtering and alignment tool
C++
169
star
4

mmvec

Neural networks for microbe-metabolite interaction analysis
Python
117
star
5

American-Gut

American Gut open-access data and IPython notebooks
Jupyter Notebook
107
star
6

biom-format

The Biological Observation Matrix (BIOM) Format Project
Python
92
star
7

deblur

Deblur is a greedy deconvolution algorithm based on known read error profiles.
Python
91
star
8

tcga

Microbial analysis in TCGA data
Jupyter Notebook
88
star
9

gemelli

Gemelli is a tool box for running Robust Aitchison PCA (RPCA), Joint Robust Aitchison PCA (Joint-RPCA), TEMPoral TEnsor Decomposition (TEMPTED), and Compositional Tensor Factorization (CTF) on sparse compositional omics datasets.
Python
67
star
10

songbird

Vanilla regression methods for microbiome differential abundance analysis
Python
56
star
11

gneiss

compositional data analysis toolbox
Jupyter Notebook
55
star
12

emperor

Emperor a tool for the analysis and visualization of large microbial ecology datasets
JavaScript
52
star
13

empress

A fast and scalable phylogenetic tree viewer for microbiome data analysis
JavaScript
45
star
14

redbiom

Sample search by metadata and features
Python
44
star
15

unifrac

Python
37
star
16

scikit-bio-cookbook

Recipes for bioinformatics analyses with scikit-bio
Jupyter Notebook
36
star
17

DEICODE

Robust Aitchison PCA from sparse count data
JavaScript
33
star
18

q2-qemistree

Hierarchical orderings for mass spectrometry data. Canonically pronounced "chemis-tree".
Python
31
star
19

qurro

Visualize differentially ranked features (taxa, metabolites, ...) and their log-ratios across samples
JavaScript
31
star
20

calour

exploratory and interactive microbiome analyses based on heatmaps
Python
27
star
21

q2-greengenes2

A QIIME 2 plugin for interaction with the Greengenes2 database
Python
26
star
22

wol

Reference Phylogeny for Bacterial and Archaeal Genomes
Jupyter Notebook
24
star
23

BIRDMAn

Bayesian Inferential Regression for Differential Microbiome Analysis
Python
22
star
24

Platypus-Conquistador

Confirming specific taxonomic groups within your samples.
Python
19
star
25

micronota

annotation pipeline for microbial genomes and metagenomes
Python
18
star
26

tax2tree

Automated taxonomy decoration onto a tree
Python
14
star
27

evident

Python
14
star
28

qadabra

Snakemake workflow for comparison of differential abundance ranks
Python
13
star
29

oecophylla

shotgun pipeline
Python
11
star
30

horizomer

Workflow for detecting genome-wide horizontal gene transfers
Python
11
star
31

greengenes2

Processing support for Greengenes2
Python
11
star
32

pyqi

Tools for developing and testing command line interfaces in Python.
Python
9
star
33

burrito

Python framework for controlling command-line applications.
Python
8
star
34

pynast

Python Nearest Alignment Space Termination tool (PyNAST): Official repository for software and unit tests
Python
8
star
35

metagenomics_pooling_notebook

Jupyter notebooks to assist with sample processing
Python
8
star
36

my-microbes

A set of tools for delivering personal microbiome results to individuals participating in microbiome sequencing studies.
Python
7
star
37

zebra_filter

Filtering out false taxonomic hits from shotgun sequencing based on genome coverage
Python
7
star
38

burrito-fillings

Application controllers for command line bioinformatics applications
Python
7
star
39

Evident-initial-demo

Elucidating sampling effort for microbial analysis studies
JavaScript
7
star
40

mds-approximations

Multidimensional scaling algorithms for microbiology-ecology datasets.
Python
6
star
41

microsetta-private-api

A private microservice to support The Microsetta Initiative
Python
6
star
42

conda-recipes

conda recipes for bioinformatic tools like blast+, infernal, etc.
Python
6
star
43

american-gut-web

The website for the American Gut Project participant portal
Python
5
star
44

qiime-default-reference

Default reference data files for use with QIIME.
Python
4
star
45

scikit-bio-rfcs

Request For Comments (RFCs) for scikit-bio.
4
star
46

labadmin

Administration website for the Knight Lab
Python
4
star
47

q2-umap

Applying umap to microbiome data via QIIME2
Python
4
star
48

improved-octo-waddle

Balanced parentheses succinct data structure in Python
Jupyter Notebook
4
star
49

dsFDR

descrete False Discovery Rate method
Python
3
star
50

SitePainter

A tool for exploring biogeographical patterns
JavaScript
3
star
51

bayestime

Jupyter Notebook
3
star
52

genome-subsampler

Statistical and empirical subsampling of reference genomes
Jupyter Notebook
3
star
53

micov

Aggregate genome coverage
Python
3
star
54

cmi-workshops

2
star
55

taxster

taxster: assigning taxonomy to organisms you've never even heard of
Python
2
star
56

PipeClust

MPI-based sequence clusterer
C
2
star
57

microsetta-public-api

A public microservice to support The Microsetta Initiative
Python
2
star
58

LabControl

lab manager for plate maps and sequence flows
Python
2
star
59

american-gut-rest

RESTful interface into the American Gut data
Python
2
star
60

unifrac-binaries

C++
1
star
61

biocore.github.io

CSS
1
star
62

q2-ili

QIIME2 plugin for `ili
Python
1
star
63

q2-katharoseq

Python
1
star
64

microsetta-interface

The Microsetta participant facing user interface
Jinja
1
star
65

qiime-workshops

Materials for biocore organized workshops
Jupyter Notebook
1
star
66

microprot

structural annotation pipeline for microbial genomes and metagenomes
Python
1
star
67

mg-scripts

Knight Lab internal Metagenomic processing scripts for demultiplexing, QC and host removal
Python
1
star
68

sage-emperor

Emperor implementation in the SAGE2 framework
JavaScript
1
star
69

q2-mislabeled

A QIIME 2 plugin for assessing sample mislabeling and contamination
Python
1
star
70

q2-american-gut

A QIIME2 plugin for working with and processing American Gut data
Python
1
star
71

basespace-qiime

QIIME's BaseSpace App
HTML
1
star