• Stars
    star
    138
  • Rank 263,544 (Top 6 %)
  • Language
    Python
  • License
    MIT License
  • Created over 8 years ago
  • Updated over 2 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

🐞 Convert NCBI taxonomy dump into lineages

NCBItax2lin

Downloads

Convert NCBI taxonomy dump into lineages. An example for human (tax_id=9606) is like

tax_id superkingdom phylum class order family genus species family1 forma genus1 infraclass infraorder kingdom no rank no rank1 no rank10 no rank11 no rank12 no rank13 no rank14 no rank15 no rank16 no rank17 no rank18 no rank19 no rank2 no rank20 no rank21 no rank22 no rank3 no rank4 no rank5 no rank6 no rank7 no rank8 no rank9 parvorder species group species subgroup species1 subclass subfamily subgenus subkingdom suborder subphylum subspecies subtribe superclass superfamily superorder superorder1 superphylum tribe varietas
9606 Eukaryota Chordata Mammalia Primates Hominidae Homo Homo sapiens Simiiformes Metazoa cellular organisms Opisthokonta Dipnotetrapodomorpha Tetrapoda Amniota Theria Eutheria Boreoeutheria Eumetazoa Bilateria Deuterostomia Vertebrata Gnathostomata Teleostomi Euteleostomi Sarcopterygii Catarrhini Homininae Haplorrhini Craniata Hominoidea Euarchontoglires

Install

ncbitax2lin supports python-3.7, python-3.8, and python-3.9.

pip install -U ncbitax2lin

Generate lineages

First download taxonomy dump from NCBI:

wget -N ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump.tar.gz
mkdir -p taxdump && tar zxf taxdump.tar.gz -C ./taxdump

Then, run ncbitax2lin

ncbitax2lin --nodes-file taxdump/nodes.dmp --names-file taxdump/names.dmp

By default, the generated lineages will be saved to ncbi_lineages_[date_of_utcnow].csv.gz. The output file can be overwritten with --output option.

FAQ

Q: I have a large number of sequences with their corresponding accession numbers from NCBI, how to get their lineages?

A: First, you need to map accession numbers (GI is deprecated) to tax IDs based on nucl_*accession2taxid.gz files from ftp://ftp.ncbi.nih.gov/pub/taxonomy/accession2taxid/. Secondly, you can trace a sequence's whole lineage based on its tax ID. The tax-id-to-lineage mapping is what NCBItax2lin can generate for you.

If you have any question about this project, please feel free to create a new issue.

Note on taxdump.tar.gz.md5

It appears that NCBI periodically regenerates taxdump.tar.gz and taxdump.tar.gz.md5 even when its content is still the same. I am not sure how their regeneration works, but taxdump.tar.gz.md5 will differ simply because of a different timestamp.

Used in

  • Mahmoudabadi, G., & Phillips, R. (2018). A comprehensive and quantitative exploration of thousands of viral genomes. ELife, 7. https://doi.org/10.7554/eLife.31955
  • Dombrowski, N. et al. (2020) Undinarchaeota illuminate DPANN phylogeny and the impact of gene transfer on archaeal evolution, Nature Communications. Springer US, 11(1). doi: 10.1038/s41467-020-17408-w. https://www.nature.com/articles/s41467-020-17408-w
  • Schenberger Santos, A. R. et al. (2020) NAD+ biosynthesis in bacteria is controlled by global carbon/ nitrogen levels via PII signaling, Journal of Biological Chemistry, 295(18), pp. 6165–6176. doi: 10.1074/jbc.RA120.012793. https://www.sciencedirect.com/science/article/pii/S0021925817482433
  • Villada, J. C., Duran, M. F. and Lee, P. K. H. (2020) Interplay between Position-Dependent Codon Usage Bias and Hydrogen Bonding at the 5' End of ORFeomes, mSystems, 5(4), pp. 1–18. doi: 10.1128/msystems.00613-20. https://msystems.asm.org/content/5/4/e00613-20
  • Byadgi, O. et al. (2020) Transcriptome analysis of amyloodinium ocellatum tomonts revealed basic information on the major potential virulence factors, Genes, 11(11), pp. 1–12. doi: 10.3390/genes11111252. https://www.mdpi.com/2073-4425/11/11/1252

Development

Install dependencies

poetry shell
poetry install

Testing

make format
make all

Publish (only for administrator)

poetry version [minor/major etc.]
poetry publish --build

More Repositories

1

sutton-barto-rl-exercises

πŸ“–Learning reinforcement learning by implementing the algorithms from reinforcement learning an introduction
Jupyter Notebook
82
star
2

biogrinder

Grinder is a versatile open-source bioinformatic tool to create simulated omic shotgun and amplicon sequence libraries for all main sequencing platforms.
Perl
17
star
3

gtf2csv

Convert genome annotation GTF file into plain CSV format
Jupyter Notebook
13
star
4

pdb-secondary-structure

Transform data from RCSB Protein Data Bank for secondary structure prediction
Jupyter Notebook
12
star
5

stanford-cs224n

Exercise answers to the problem sets from CS224n: Natural Language Processing with Deep Learning Winter quarter (January - March, 2017)
Jupyter Notebook
10
star
6

youtube_RL_course_by_David_Silver

Collection of Slides for David Silver's Reinforcement Learning Course on Youtube
6
star
7

rljs

RLjs currently serves as an interactive playground for learning reinforcement learning.
JavaScript
3
star
8

stanford-cs231n

Exercise answers to the problem sets from CS231n: Convolutional Neural Networks for Visual Recognition, 2016
Jupyter Notebook
2
star
9

visannot

Visualize genome annotation in a very responsive way
Jupyter Notebook
2
star
10

vr2c

Visualize read-to-contig alignment during assembly analysis
Python
2
star
11

DTLA_crime_analysis

Analysis of crime data in downtown Los Angeles
Jupyter Notebook
1
star
12

getting-started-with-d3-organized-code-snippets

Organized code snippets for the book Getting Started with D3 by Mike Dewar
HTML
1
star
13

bio-seq2seq-attention

Neural seq2seq model + attention with applications for biological sequences in mind
Python
1
star
14

stanford-cs20si-tensorflow-for-deep-learning-research

Python
1
star
15

train_models

Python
1
star
16

pandas-gff3-demo

Jupyter Notebook
1
star
17

gromacs_top_4.5.5

Python
1
star
18

zyxue.github.io

My technical blog
SCSS
1
star
19

open-data-catalogue-vancouver

Jupyter Notebook
1
star
20

mit-statistics-for-applications

Jupyter Notebook
1
star
21

pydr

pydr_zyxue
Python
1
star
22

scala-impatient-exercise-answers

Jupyter Notebook
1
star
23

pubmed-word2vec

EDA of word2vec results based on pubmed
Jupyter Notebook
1
star
24

MapspliceRSEM-clean-doc

1
star
25

SPIDER3

Python
1
star