CODE FREEZE:
PyGeno has long been limited due to it's backend. We are now ready to take it to the next level.
We are working on a major port of pyGeno to the open-source multi-modal database ArangoDB. PyGeno's code on both branches master and bloody is frozen until we are finished. No pull request will be merged until then, and we won't implement any new features.
pyGeno: A Python package for precision medicine and proteogenomics
pyGeno is (to our knowledge) the only tool available that will gladly build your specific genomes for you.
pyGeno is developed by Tariq Daouda at the Institute for Research in Immunology and Cancer (IRIC), its logo is the work of the freelance designer Sawssan Kaddoura. For the latest news about pyGeno, you can follow me on twitter @tariqdaouda.
Click here for The full documentation.
For the latest news about pyGeno, you can follow me on twitter @tariqdaouda.
Citing pyGeno:
Please cite this paper.
Installation:
It is recommended to install pyGeno within a virtual environement, to setup one you can use:
virtualenv ~/.pyGenoEnv
source ~/.pyGenoEnv/bin/activate
pyGeno can be installed through pip:
pip install pyGeno #for the latest stable version
Or github, for the latest developments:
git clone https://github.com/tariqdaouda/pyGeno.git
cd pyGeno
python setup.py develop
A brief introduction
pyGeno is a personal bioinformatic database that runs directly into python, on your laptop and does not depend upon any REST API. pyGeno is here to make extracting data such as gene sequences a breeze, and is designed to be able cope with huge queries. The most exciting feature of pyGeno, is that it allows to work with seamlessly with both reference and Personalized Genomes.
Personalized Genomes, are custom genomes that you create by combining a reference genome, sets of polymorphisms and an optional filter. pyGeno will take care of applying the filter and inserting the polymorphisms at their right place, so you get direct access to the DNA and Protein sequences of your patients.
from pyGeno.Genome import *
g = Genome(name = "GRCh37.75")
prot = g.get(Protein, id = 'ENSP00000438917')[0]
#print the protein sequence
print prot.sequence
#print the protein's gene biotype
print prot.gene.biotype
#print protein's transcript sequence
print prot.transcript.sequence
#fancy queries
for exon in g.get(Exon, {"CDS_start >": x1, "CDS_end <=" : x2, "chromosome.number" : "22"}) :
#print the exon's coding sequence
print exon.CDS
#print the exon's transcript sequence
print exon.transcript.sequence
#You can do the same for your subject specific genomes
#by combining a reference genome with polymorphisms
g = Genome(name = "GRCh37.75", SNPs = ["STY21_RNA"], SNPFilter = MyFilter())
And if you ever get lost, there's an online help() function for each object type:
from pyGeno.Genome import *
print Exon.help()
Should output:
Available fields for Exon: CDS_start, end, chromosome, CDS_length, frame, number, CDS_end, start, genome, length, protein, gene, transcript, id, strand
Creating a Personalized Genome:
Personalized Genomes are a powerful feature that allow you to work on the specific genomes and proteomes of your patients. You can even mix several SNP sets together.
from pyGeno.Genome import Genome
#the name of the snp set is defined inside the datawrap's manifest.ini file
dummy = Genome(name = 'GRCh37.75', SNPs = 'dummySRY')
#you can also define a filter (ex: a quality filter) for the SNPs
dummy = Genome(name = 'GRCh37.75', SNPs = 'dummySRY', SNPFilter = myFilter())
#and even mix several snp sets
dummy = Genome(name = 'GRCh37.75', SNPs = ['dummySRY', 'anotherSet'], SNPFilter = myFilter())
Filtering SNPs:
pyGeno allows you to select the Polymorphisms that end up into the final sequences. It supports SNPs, Inserts and Deletions.
from pyGeno.SNPFiltering import SNPFilter, SequenceSNP
class QMax_gt_filter(SNPFilter) :
def __init__(self, threshold) :
self.threshold = threshold
#Here SNPs is a dictionary: SNPSet Name => polymorphism
#This filter ignores deletions and insertions and
#but applis all SNPs
def filter(self, chromosome, **SNPs) :
sources = {}
alleles = []
for snpSet, snp in SNPs.iteritems() :
pos = snp.start
if snp.alt[0] == '-' :
pass
elif snp.ref[0] == '-' :
pass
else :
sources[snpSet] = snp
alleles.append(snp.alt) #if not an indel append the polymorphism
#appends the refence allele to the lot
refAllele = chromosome.refSequence[pos]
alleles.append(refAllele)
sources['ref'] = refAllele
#optional we keep a record of the polymorphisms that were used during the process
return SequenceSNP(alleles, sources = sources)
The filter function can also be made more specific by using arguments that have the same names as the SNPSets
def filter(self, chromosome, dummySRY = None) :
if dummySRY.Qmax_gt > self.threshold :
#other possibilities of return are SequenceInsert(<bases>), SequenceDelete(<length>)
return SequenceSNP(dummySRY.alt)
return None #None means keep the reference allele
To apply the filter simply specify if while loading the genome.
persGenome = Genome(name = 'GRCh37.75_Y-Only', SNPs = 'dummySRY', SNPFilter = QMax_gt_filter(10))
To include several SNPSets use a list.
persGenome = Genome(name = 'GRCh37.75_Y-Only', SNPs = ['ARN_P1', 'ARN_P2'], SNPFilter = myFilter())
Getting an arbitrary sequence:
You can ask for any sequence of any chromosome:
chr12 = myGenome.get(Chromosome, number = "12")[0]
print chr12.sequence[x1:x2]
# for the reference sequence
print chr12.refSequence[x1:x2]
Batteries included (bootstraping):
pyGeno's database is populated by importing datawraps. pyGeno comes with a few data wraps, to get the list you can use:
import pyGeno.bootstrap as B
B.printDatawraps()
Available datawraps for boostraping SNPs ~~~~| |~~~:> Human_agnostic.dummySRY.tar.gz |~~~:> Human.dummySRY_casava.tar.gz |~~~:> dbSNP142_human_common_all.tar.gz Genomes ~~~~~~~| |~~~:> Human.GRCh37.75.tar.gz |~~~:> Human.GRCh37.75_Y-Only.tar.gz |~~~:> Human.GRCh38.78.tar.gz |~~~:> Mouse.GRCm38.78.tar.gz
To get a list of remote datawraps that pyGeno can download for you, do:
B.printRemoteDatawraps()
Importing whole genomes is a demanding process that take more than an hour and requires (according to tests) at least 3GB of memory. Depending on your configuration, more might be required.
That being said importating a data wrap is a one time operation and once the importation is complete the datawrap can be discarded without consequences.
The bootstrap module also has some handy functions for importing built-in packages.
Some of them just for playing around with pyGeno (Fast importation and Small memory requirements):
import pyGeno.bootstrap as B
#Imports only the Y chromosome from the human reference genome GRCh37.75
#Very fast, requires even less memory. No download required.
B.importGenome("Human.GRCh37.75_Y-Only.tar.gz")
#A dummy datawrap for humans SNPs and Indels in pyGeno's AgnosticSNP format.
# This one has one SNP at the begining of the gene SRY
B.importSNPs("Human.dummySRY_casava.tar.gz")
And for more Serious Work, the whole reference genome.
#Downloads the whole genome (205MB, sequences + annotations), may take an hour or more.
B.importGenome("Human.GRCh38.78.tar.gz")
Importing a custom datawrap:
from pyGeno.importation.Genomes import *
importGenome('GRCh37.75.tar.gz')
To import a patient's specific polymorphisms
from pyGeno.importation.SNPs import *
importSNPs('patient1.tar.gz')
For a list of available datawraps available for download, please have a look here.
You can easily make your own datawraps with any tar.gz compressor. For more details on how datawraps are made you can check wiki or have a look inside the folder bootstrap_data.
Instanciating a genome:
from pyGeno.Genome import Genome
#the name of the genome is defined inside the package's manifest.ini file
ref = Genome(name = 'GRCh37.75')
Printing all the proteins of a gene:
from pyGeno.Genome import Genome
from pyGeno.Gene import Gene
from pyGeno.Protein import Protein
Or simply:
from pyGeno.Genome import *
then:
ref = Genome(name = 'GRCh37.75')
#get returns a list of elements
gene = ref.get(Gene, name = 'TPST2')[0]
for prot in gene.get(Protein) :
print prot.sequence
Making queries, get() Vs iterGet():
iterGet is a faster version of get that returns an iterator instead of a list.
Making queries, syntax:
pyGeno's get function uses the expressivity of rabaDB.
These are all possible query formats:
ref.get(Gene, name = "SRY")
ref.get(Gene, { "name like" : "HLA"})
chr12.get(Exon, { "start >=" : 12000, "end <" : 12300 })
ref.get(Transcript, { "gene.name" : 'SRY' })
Creating indexes to speed up queries:
from pyGeno.Gene import Gene
#creating an index on gene names if it does not already exist
Gene.ensureGlobalIndex('name')
#removing the index
Gene.dropIndex('name')
Find in sequences:
Internally pyGeno uses a binary representation for nucleotides and amino acids to deal with polymorphisms. For example,both "AGC" and "ATG" will match the following sequence "...AT/GCCG...".
#returns the position of the first occurence
transcript.find("AT/GCCG")
#returns the positions of all occurences
transcript.findAll("AT/GCCG")
#similarly, you can also do
transcript.findIncDNA("AT/GCCG")
transcript.findAllIncDNA("AT/GCCG")
transcript.findInUTR3("AT/GCCG")
transcript.findAllInUTR3("AT/GCCG")
transcript.findInUTR5("AT/GCCG")
transcript.findAllInUTR5("AT/GCCG")
#same for proteins
protein.find("DEV/RDEM")
protein.findAll("DEV/RDEM")
#and for exons
exon.find("AT/GCCG")
exon.findAll("AT/GCCG")
exon.findInCDS("AT/GCCG")
exon.findAllInCDS("AT/GCCG")
#...
Progress Bar:
from pyGeno.tools.ProgressBar import ProgressBar
pg = ProgressBar(nbEpochs = 155)
for i in range(155) :
pg.update(label = '%d' %i) # or simply p.update()
pg.close()