There are no reviews yet. Be the first to send feedback to the community and the maintainers!
Readfq is a collection of routines for parsing the FASTA/FASTQ format. It seamlessly parses both FASTA and multi-line FASTQ with a simple interface. Readfq is first implemented in a single C header file and then ported to Lua, Perl and Python as a single function less than 50 lines. For users of scripting languages, I encourage to copy-and-paste the function instead of using readfq as a library. It is always good to avoid unnecessary library dependencies. Readfq also strives for efficiency. The C implementation is among the fastest (if not the fastest). The Python and Perl implementations are several to tens of times faster than the official Bio* implementations. If you can speed up readfq further, please let me know. I am not good at optimizing programs in scripting languages. Thank you. As to licensing, the C implementation is distributed under the MIT license. Implementations in other languages are released without a license. Just copy and paste. You do not need to acknowledge me. The following shows a brief example for each programming language: # Perl my @aux = undef; # this is for keeping intermediate data while (my ($name, $seq, $qual) = readfq(\*STDIN, \@aux)) { print "$seq\n"; } # Python: generator function for name, seq, qual in readfq(sys.stdin): print seq -- Lua: closure for name, seq, qual in readfq(io.stdin) do print seq end /* Go */ package main import ( "fmt" "bufio" "github.com/drio/drio.go/bio/fasta" ) func main() { var fqr fasta.FqReader fqr.Reader = bufio.NewReader(os.Stdin) for r, done := fqr.Iter(); !done; r, done = fqr.Iter() { fmt.Println(r.Seq) } } /* C */ #include <zlib.h> #include <stdio.h> #include "kseq.h" KSEQ_INIT(gzFile, gzread) int main() { gzFile fp; kseq_t *seq; fp = gzdopen(fileno(stdin), "r"); seq = kseq_init(fp); while (kseq_read(seq) >= 0) puts(seq->seq.s); kseq_destroy(seq); gzclose(fp); return 0; } Some naive benchmarks. To convert a FASTQ containing 25 million 100bp reads to FASTA, FASTX-Toolkit (parsing 4-line FASTQ only) takes 325.0 CPU seconds and EMBOSS' seqret 247.8 seconds. My seqtk, which uses the kseq.h library, finishes the task in 24.6 seconds, 10X faster. For retrieving 25k sequences by name from the same FASTQ, BioPython takes 963 seconds, while readfq.py takes 136 seconds; BioPerl takes more than 40 minutes (killed), while readfq.pl 273 seconds. Seqtk takes 29 seconds.
minimap2
A versatile pairwise aligner for genomic and spliced nucleotide sequencesbwa
Burrow-Wheeler Aligner for short-read alignment (see minimap2 for long-read alignment)seqtk
Toolkit for processing sequences in FASTA/Q formatsbioawk
BWK awk modified for biological dataminigraph
Sequence-to-graph mapper and graph generatorminiprot
Align proteins to genomes with splicing and frameshiftminiasm
Ultrafast de novo assembly for long noisy reads (though having no consensus step)wgsim
Reads simulatorgfatools
Tools for manipulating sequence graphs in the GFA and rGFA formatsbiofast
Benchmarking programming languages/implementations for common tasks in Bioinformaticscgranges
A C/C++ library for fast interval overlap queries (with a "bedtools coverage" example)kmer-cnt
Code examples of fast and simple k-mer counters for tutorial purposespangene
Constructing a pangenome gene graphpsmc
Implementation of the Pairwise Sequentially Markovian Coalescent (PSMC) modelbedtk
A simple toolset for BED files (warning: CLI may change before bedtk becomes stable)ksw2
Global alignment and alignment extensionyak
Yet another k-mer analyzerfermikit
De novo assembly based variant calling pipeline for Illumina short readsminimap
This repo is DEPRECATED. Please use minimap2, the successor of minimap.hickit
TAD calling, phase imputation, 3D modeling and more for diploid single-cell Hi-C (Dip-C) and general Hi-Cbgt
Flexible genotype query among 30,000+ samples whole-genomedipcall
Reference-based variant calling pipeline for a pair of phased haplotype assembliessrf
SRF: Satellite Repeat Finderunimap
A EXPERIMENTAL fork of minimap2 optimized for assembly-to-reference alignmentminipileup
Simple pileup-based variant callerfermi
A WGS de novo assembler based on the FMD-index for large genomesdna-nn
Model and predict short DNA sequence features with neural networksfermi-lite
Standalone C library for assembling Illumina short reads in small regionsbfc
High-performance error correction for Illumina resequencing dataropebwt2
Incremental construction of FM-index for DNA sequencestabtk
Toolkit for processing TAB-delimited formatgwfa
Proof-of-concept implementation of GWFA for sequence-to-graph alignmentCHM-eval
miniwfa
A reimplementation of the WaveFront Alignment algorithm at low memoryjstreeview
Interactive phylogenetic tree viewer/editorsamtools
This is *NOT* the official repository of samtools.etrf
Exact Tandem Repeat Finder (not a TRF replacement)ref-gen
Human reference genome analysis setsbioseq-js
For live demo, see http://lh3lh3.users.sourceforge.net/bioseq.shtmllv89
C implementation of the Landau-Vishkin algorithmpartig
An experimental tool to estimate the similarity between all pairs of contigsasub
A unified array job submitter for LSF, SGE/UGE and Slurmklib.nim
Experimental getopt, gzip reader, FASTA/Q parser and interval queries in nim-langcalN50
Compute N50/NG50 and auN/auNGsdust
Symmetric DUST for finding low-complexity regions in DNA sequencesgffio
pre-pe
Preprocessing paired-end reads produced with experiment-specific protocolshapdip
The CHM1-NA12878 benchmark for single-sample SNP/INDEL calling from WGS Illumina datafermi2
misc
Useful small programsvarcmp
The first CHM1 paper (Li, 2014)minisv
Lightweight mosaic/somatic SV caller for long reads (WIP)lianti
Tools to process LIANTI sequence datartgeval
Wrapper for RTG's vcfeval; DEPRECATED!nasw
Dynamic programming for aa-to-nt alignment with affine gap, splicing and frameshiftsgdp-fermi
FermiKit small variant calls for public SGDP samplesgfa1
This repo is deprecated. Please use gfatools instead.pubLRasm
PortableCrystal
Portable Crystal binary distributions for Linux on x86_64foreign
Modified or extracted from other programstrimadap
Fast but inaccurate adapter trimmer for Illumina readslh3-snippets
ropebwt3
Construction and utility of BWT for DNA string setstreebest
TreeBeST: Tree Building guided by Species Treeunicall
A wrapper for calling small variants from human germline high-coverage single-sample Illumina datafastARG
Fast heuristic ARG constructionproot-wrapper
Demonstrating the PRoot programrmaxcut
An experimental tool to find approximate max-cuts in a large graphbwa-docker
Minimal docker image for bwa. Not developed any more.sdg
EXPERIMENTAL implementation of side graphnaivepca
Naive PCA for genotype datamdust
mdust from DFCI Gene Indices Software Tools (archived for a historical record only)editdist-U85
Fast implementation of Ukkenon's O(ND) algorithm for computing edit distancelh3.github.com
libdivsufsort
Automatically exported from code.google.com/p/libdivsufsortmem-paper
Manuscript for BWA-MEMbcf2
Experimental bcftools port to support BCF2; DEPRECATED by htslib and htsboxthesis
PhD thesisropebwt
fermi-paper
The first fermi paper (Li, 2012)crlf
Concise Run-Length Format for small alphabets; DEPRECATEDpsnw
prototypecentos5-vm
Instructions on how to deploy CentOS 5 virtual machinesmag2gfa
DEPRECATED. Code has been moved to lh3/gfa1/miscibsget
Download files from Illumina BaseSpace (*OUTDATED* as BaseSpace has changed APIs)smtl-paper
Samtools statistics paper (Li, 2011)mssa-bench
Evaluating the performance of multi-string SA constructionsamtools-legacy
For testing only. DON'T USE!Love Open Source and this site? Check out how you can help us