• Stars
    star
    338
  • Rank 124,931 (Top 3 %)
  • Language
    Python
  • License
    Apache License 2.0
  • Created over 10 years ago
  • Updated over 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Python interface to access reference genome features (such as genes, transcripts, and exons) from Ensembl
Build Status Coverage Status PyPI

PyEnsembl

PyEnsembl is a Python interface to Ensembl reference genome metadata such as exons and transcripts. PyEnsembl downloads GTF and FASTA files from the Ensembl FTP server and loads them into a local database. PyEnsembl can also work with custom reference data specified using user-supplied GTF and FASTA files.

Example Usage

from pyensembl import EnsemblRelease

# release 77 uses human reference genome GRCh38
data = EnsemblRelease(77)

# will return ['HLA-A']
gene_names = data.gene_names_at_locus(contig=6, position=29945884)

# get all exons associated with HLA-A
exon_ids  = data.exon_ids_of_gene_name('HLA-A')

Installation

You can install PyEnsembl using pip:

pip install pyensembl

This should also install any required packages such as datacache.

Before using PyEnsembl, run the following command to download and install Ensembl data:

pyensembl install --release <list of Ensembl release numbers> --species <species-name>

For example, pyensembl install --release 75 76 --species human will download and install all human reference data from Ensembl releases 75 and 76.

Alternatively, you can create the EnsemblRelease object from inside a Python process and call ensembl_object.download() followed by ensembl_object.index().

Cache Location

By default, PyEnsembl uses the platform-specific Cache folder and caches the files into the pyensembl sub-directory. You can override this default by setting the environment key PYENSEMBL_CACHE_DIR as your preferred location for caching:

export PYENSEMBL_CACHE_DIR=/custom/cache/dir

or

import os

os.environ['PYENSEMBL_CACHE_DIR'] = '/custom/cache/dir'
# ... PyEnsembl API usage

Non-Ensembl Data

PyEnsembl also allows arbitrary genomes via the specification of local file paths or remote URLs to both Ensembl and non-Ensembl GTF and FASTA files. (Warning: GTF formats can vary, and handling of non-Ensembl data is still very much in development.)

For example:

data = Genome(
    reference_name='GRCh38',
    annotation_name='my_genome_features',
    gtf_path_or_url='/My/local/gtf/path_to_my_genome_features.gtf')
# parse GTF and construct database of genomic features
data.index()
gene_names = data.gene_names_at_locus(contig=6, position=29945884)

API

The EnsemblRelease object has methods to let you access all possible combinations of the annotation features gene_name, gene_id, transcript_name, transcript_id, exon_id as well as the location of these genomic elements (contig, start position, end position, strand).

Genes

genes(contig=None, strand=None)
Returns a list of Gene objects, optionally restricted to a particular contig or strand.
genes_at_locus(contig, position, end=None, strand=None)
Returns a list of Gene objects overlapping a particular position on a contig, optionally extend into a range with the end parameter and restrict to forward or backward strand by passing strand='+' or strand='-'.
gene_by_id(gene_id)
Return a Gene object for given Ensembl gene ID (e.g. "ENSG00000068793").
gene_names(contig=None, strand=None)
Returns all gene names in the annotation database, optionally restricted to a particular contig or strand.
genes_by_name(gene_name)
Get all the unqiue genes with the given name (there might be multiple due to copies in the genome), return a list containing a Gene object for each distinct ID.
gene_by_protein_id(protein_id)
Find Gene associated with the given Ensembl protein ID (e.g. "ENSP00000350283")
gene_names_at_locus(contig, position, end=None, strand=None)
Names of genes overlapping with the given locus, optionally restricted by strand. (returns a list to account for overlapping genes)
gene_name_of_gene_id(gene_id)
Returns name of gene with given genen ID.
gene_name_of_transcript_id(transcript_id)
Returns name of gene associated with given transcript ID.
gene_name_of_transcript_name(transcript_name)
Returns name of gene associated with given transcript name.
gene_name_of_exon_id(exon_id)
Returns name of gene associated with given exon ID.
gene_ids(contig=None, strand=None)
Return all gene IDs in the annotation database, optionally restricted by chromosome name or strand.
gene_ids_of_gene_name(gene_name)
Returns all Ensembl gene IDs with the given name.

Transcripts

transcripts(contig=None, strand=None)
Returns a list of Transcript objects for all transcript entries in the Ensembl database, optionally restricted to a particular contig or strand.
transcript_by_id(transcript_id)
Construct a Transcript object for given Ensembl transcript ID (e.g. "ENST00000369985")
transcripts_by_name(transcript_name)
Returns a list of Transcript objects for every transcript matching the given name.
transcript_names(contig=None, strand=None)
Returns all transcript names in the annotation database.
transcript_ids(contig=None, strand=None)
Returns all transcript IDs in the annotation database.
transcript_ids_of_gene_id(gene_id)
Return IDs of all transcripts associated with given gene ID.
transcript_ids_of_gene_name(gene_name)
Return IDs of all transcripts associated with given gene name.
transcript_ids_of_transcript_name(transcript_name)
Find all Ensembl transcript IDs with the given name.
transcript_ids_of_exon_id(exon_id)
Return IDs of all transcripts associatd with given exon ID.

Exons

exon_ids(contig=None, strand=None)
Returns a list of exons IDs in the annotation database, optionally restricted by the given chromosome and strand.
exon_by_id(exon_id)
Construct an Exon object for given Ensembl exon ID (e.g. "ENSE00001209410")
exon_ids_of_gene_id(gene_id)
Returns a list of exon IDs associated with a given gene ID.
exon_ids_of_gene_name(gene_name)
Returns a list of exon IDs associated with a given gene name.
exon_ids_of_transcript_id(transcript_id)
Returns a list of exon IDs associated with a given transcript ID.
exon_ids_of_transcript_name(transcript_name)
Returns a list of exon IDs associated with a given transcript name.

More Repositories

1

mhcflurry

Peptide-MHC I binding affinity prediction
Python
171
star
2

gtfparse

Parsing tools for GTF (gene transfer format) files
Python
92
star
3

mhctools

Python interface to running command-line and web-based MHC binding predictors
Python
78
star
4

varcode

Library for manipulating genomic variants and predicting their effects
Python
75
star
5

neoantigen-vaccine-pipeline

Bioinformatics pipeline for selecting patient-specific cancer neoantigen vaccines
Jupyter Notebook
68
star
6

vaxrank

Ranked vaccine peptides for personalized cancer immunotherapy
Python
49
star
7

pepdata

Python interface to amino acid properties and IEDB
Python
48
star
8

topiary

Predict mutated T-cell epitopes from sequencing data
Python
27
star
9

isovar

Assembly of RNA reads to determine the effect of a cancer mutation on protein sequence
Python
22
star
10

pepnet

Neural networks for amino acid sequences
Python
20
star
11

varlens

commandline manipulation of genomic variants and NGS reads
Python
19
star
12

gene-lists

Gene lists related to cancer immunotherapy
13
star
13

tcga-immune-deconvolution

Immune deconvolution of publicly available TCGA expression data
Jupyter Notebook
11
star
14

mhcnames

All the fun and adventure of MHC naming, now in Python
Python
10
star
15

datacache

Helpers for transparently downloading datasets
Python
5
star
16

cancer-cell-line-mhc-alleles

Cell line HLA types and neoepitope catalog from TCLP
Jupyter Notebook
4
star
17

mhcdouble

Class II MHC binding and antigen processing prediction
Python
4
star
18

ott-wu-2017-data

Machine readable data from "An Immunogenic Personal Neoantigen Vaccine for Melanoma Patients"
Jupyter Notebook
4
star
19

mhc2-data

Class II MHC data
Jupyter Notebook
2
star
20

sahin-2017-data

Machine readable data from "Personalized RNA mutanome vaccines mobilize poly-specific therapeutic immunity against cancer"
1
star
21

vaxrank-paper-2018

Repository for updated Vaxrank paper
TeX
1
star
22

proteopt

Common interface to protein design tools and structure predictors
Python
1
star
23

mhcflurry-motifs

Motifs for MHC I alleles as predicted by MHCflurry
Python
1
star
24

mhc2flurry

MHC class II binding predictor, under development
Jupyter Notebook
1
star
25

mhcflurry-web

Webapp for MHCflurry predictor
CSS
1
star
26

cov-2-mutations-by-lineage

Quick analysis to associate SARS-Cov-2 spike mutations with pangolin lineages using GISAID data
Jupyter Notebook
1
star