• Stars
    star
    114
  • Rank 298,739 (Top 7 %)
  • Language
    C++
  • License
    GNU General Publi...
  • Created about 7 years ago
  • Updated over 4 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

A fast 23andMe DNA parser and inferrer for Python

arv — a fast 23andMe parser for Python

Travis build status Supported Python versions Project License pypi

Arv (Norwegian; "heritage" or "inheritance") is a Python module for parsing raw 23andMe genome files. It lets you lookup SNPs from RSIDs.

from arv import load, unphased_match as match

genome = load("genome.txt")

print("You are a {gender} with {color} eyes and {complexion} skin.".format(
  gender     = "man" if genome.y_chromosome else "woman",
  complexion = "light" if genome["rs1426654"] == "AA" else "dark",
  color      = match(genome["rs12913832"], {"AA": "brown",
                                            "AG": "brown or green",
                                            "GG": "blue"})))

For my genome, this little program produces:

You are a man with blue eyes and light skin.

The parser is insanely fast, having been written in finely tuned C++, exposed via Cython. A 2013 Xeon machine I've tested on parses a 24 Mb file into a hash table in about 78 ms. The newer 23andMe files are smaller, and parses in a mere 62 ms!

Works with Python 2.7+ and 3+. Installable with pip!

$ pip install --upgrade arv

See below for software requirements.

Important disclaimer

It's very important to tell you that I, the author of arv, am merely a hobbyist! I am a professional software developer, but not a geneticist, biologist, medical doctor or anything like that.

Because of that, this software may not only look weird to people in the field, it may also contain serious errors. If you find any problem whatsoever, please submit a GitHub issue.

This a slightly modified version of what I wrote for the original software called "dna-traits", and the same goes for this software:

In addition to the GPL v3 licensing terms, and given that this code deals with health-related issues, I want to stress that the provided code most likely contains errors, or invalid genome reports. Results from this code must be interpreted as HIGHLY SPECULATIVE and may even be downright INCORRECT. Always consult an expert (medical doctor, geneticist, etc.) for guidance. I take NO RESPONSIBILITY whatsoever for any consequences of using this code, including but not limited to loss of life, money, spouses, self-esteem and so on. Use at YOUR OWN RISK.

The indended use is for casual, educational purposes. If this code is used for research purposes, please cross-check key results with other software: The parser code may contain serious errors, for example.

An interesting story about the research part: I once released a pretty good Mersenne Twister PRNG for C++ that ended up being used in research. Turned out the engine had bugs, and by the time I had fixed them, a poor researcher had already produced results with it (hopefully not published; I don't know). The guy had to go back and fix his stuff, and I felt terribly bad about it.

So beware!

Installation

The recommended way is to install from PyPi.

$ pip install arv

This will most likely build Arv from source. The package will automatically install Cython, but it doesn't check if you have a C++11 compiler. Furthermore, it passes some additional compilation flags that are specific to clang/gcc.

If you have problems running pip install arv, please open an issue on GitHub with as much detail as possible (g++/clang++ --version, uname -a, python --version and so on).

If you set the environment variable ARV_DEBUG, it will build with full warnings and debug symbols.

You can also install it locally through setup.py. The following builds and tests, but does not install, arv:

$ python setup.py test

If you set the environment variable ARV_BENCHMARK to a genome filename and run the tests, it will perform a short benchmark, reporting the best parsing time on it. You can also set ARV_BENCHMARK_COUNT=<number> to change how many times it should parse the given file.

Usage

First you need to dump the raw genome file from 23andMe. You'll find it under the raw genome browser, and download the file. You may have to unzip it first: The parser works on the pure text files.

Then you load the genome in Python with

>>> genome = arv.load("filename.txt")
>>> genome
<Genome: SNPs=960613, name='filename.txt'>

To see if there are any Y-chromosomes present in the genome,

>>> genome.y_chromosome
True

The genome provides a dict-like interface. To get a given SNP, just enter the RSID.

>>> snp = genome["rs123"]
>>> snp
<SNP: chromosome=7 position=24966446 genotype='AA'>
>>> snp.chromosome
7
>>> snp.position
24966446
>>> snp.genotype
<Genotype 'AA'>

The Genotype object can be converted to a string with str, but it also allows rich comparisons with strings directly:

>>> snp.genotype == "AA"
True

you can get its complement with the ~-operator.

>>> type(snp.genotype)
<class '_arv.Genotype'>
>>> ~snp.genotype
<Genotype 'TT'>

The complement is important due to eah SNPs orientation. All of 23andMe SNPs are oriented towards the positive ("plus") strand, based on the GRCh37 reference human genome assembly build. But some SNPs on SNPedia are given with the minus orientation.

For example, to determine if the human in question is likely lactose tolerant or not, we can look at rs4988235. SNPedia reports its Stabilized orientation to be minus, so we need to use the complement:

>>> genome["rs4988235"].genotype
<Genotype 'AA'>
>>> ~genome["rs4988235"].genotype
<Genotype 'TT'>

By reading a few GWAS research papers, we can build a rule to determine a human's likelihood for lactose tolerance:

>>> arv.unphased_match(~genome["rs4988235"].genotype, {
    "TT": "Likely lactose tolerant",
    "TC": "Likely lactose tolerant",
    "CC": "Likely lactose intolerant",
    None: "Unable to determine (genotype not present)"})
'Likely lactose tolerant'

Note that reading GWAS papers for hobbyists can be a bit tricky. If you are a hobbyist, be sure to spend some time reading the paper closely, checking up SNPs on places like SNPedia, dnSNP and OpenSNP. Finally, have fun, but be extremely careful about drawing conclusions from your results.

Command line interface

You can also invoke arv from the command line:

$ python -m arv --help

For example, you can drop into a Python REPL like so:

$ python -m arv --repl genome.txt
genome.txt ... 960614 SNPs, male
Type `genome` to see the parsed 23andMe raw genome file
>>> genome
<Genome: SNPs=960614, name='genome.txt'>
>>> genome["rs123"]
<SNP: chromosome=7 position=24966446 genotype=<Genotype 'AA'>>

If you specify several files, you can access them through the variable genomes.

The example at the top of this document can be run with --example:

$ python -m arv --example genome.txt
genome.txt ... 960614 SNPs, male

genome.txt ... A man with blue eyes and light skin

License

Copyright 2017 Christian Stigen Larsen

Distributed under the GNU GPL v3 or later. See the file COPYING for the full license text. This software makes use of open source software; see LICENSES for details.

More Repositories

1

jp2a

Converts jpg images to ASCII
HTML
608
star
2

mandelbrot-js

Fast rendering of the Mandelbrot set in HTML5 canvas using JavaScript
JavaScript
344
star
3

wpm

Typeracer-like console app for measuring your WPM
Python
318
star
4

minijit

A basic x86-64 JIT compiler written from scratch in stock Python
Python
210
star
5

stack-machine

A simple stack-based virtual machine in C++ with a Forth like programming language
C++
162
star
6

python-simple-vm

A simple virtual machine w/constant folding implemented in Python
Python
117
star
7

mersenne-twister

This Mersenne Twister is a fast pseudo-random number generator (PRNG) in C++
C++
83
star
8

dna-traits

A fast 23andMe genome text file parser, now superseded by arv
Python
65
star
9

mickey-scheme

Mickey Scheme is an interpreter for R7RS Scheme written in pure C++
C++
62
star
10

crianza

A stack machine VM, interpreter and genetic programming library
Python
47
star
11

miller-rabin

The Miller-Rabin probabilistic primality test in C++ w/GMP
C++
30
star
12

libunwind-examples

A few libunwind examples
C++
27
star
13

lua-cpp

Tutorial code for Lua and C++ integration
C++
20
star
14

luajit-cpp

Example C++ shared library loaded in LuaJIT through FFI
C++
19
star
15

brainfuck-jit

Brainfuck JIT VMs
Python
17
star
16

c64-examples

Simple C64 programs compiled from the command line
Assembly
15
star
17

lyn

Python bindings for GNU Lightning
Python
9
star
18

vimp

Command line plugin manager for vim
Python
8
star
19

busy-beaver

Calculates the uncomputable Busy Beaver Σ-function
Python
7
star
20

gameboy

A Gameboy emulator in Python
Python
7
star
21

eulers-totient-function

A fast implementation of Euler's totient function phi(n) in C++
C++
7
star
22

skall

SKALL is a minimal, experimental UNIX shell
C
6
star
23

cellular-automaton

Cellular automaton using the HTML5 canvas.
6
star
24

nash-cipher

John Nash's encryption scheme from his 1955 letter to the NSA in C++.
C++
5
star
25

chicken-play

Chicken Scheme HTML5 rendering library
Scheme
3
star
26

poseur

Simple presentation tool
Python
3
star
27

vev

Simple HTTP server request routing in Python
Python
3
star
28

mickey-scheme-historic

Mickey Scheme is an interpreter for R7RS Scheme written in C++
C++
3
star
29

rosalind

Solutions to the Rosalind.info puzzles
C++
2
star
30

impute-me

This is the code behind the www.impute.me site. It contains algorithms for imputing personal genomes, as well as a range of custom-made analysis for genetics-based disease and health.
R
2
star
31

elv

Parses bank CSV files
Python
1
star
32

q

Prints C/C++ definitions from current directory
Python
1
star
33

latex-template

A small LaTeX template to get you started
TeX
1
star
34

presentation-vm

The presentation "How to make a simple virtual machine"
Brainfuck
1
star
35

project-euler

My Project Euler solutions
C++
1
star
36

dotfiles

Collection of my personal dot-files
Shell
1
star
37

armstrong-notes

Notes for M.A. Armstrong's Groups and Symmetry
TeX
1
star
38

py-html-generator

A quick-and-dirty HTML document generator and renderer
Python
1
star
39

callcc-c

Experimental undelimited continuations in C via x86-64 assembly
C
1
star