lh3/seqtk

Stars
1,363
Rank 34,478 (Top 0.7 %)
Language
C
License
MIT License
Created over 12 years ago
Updated 3 months ago

lh3/seqtk

lh3

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Toolkit for processing sequences in FASTA/Q formats

Introduction

Seqtk is a fast and lightweight tool for processing sequences in the FASTA or FASTQ format. It seamlessly parses both FASTA and FASTQ files which can also be optionally compressed by gzip. To install seqtk,

git clone https://github.com/lh3/seqtk.git;
cd seqtk; make

The only library dependency is zlib.

Seqtk Examples

Convert FASTQ to FASTA:
```
  seqtk seq -a in.fq.gz > out.fa
```
Convert ILLUMINA 1.3+ FASTQ to FASTA and mask bases with quality lower than 20 to lowercases (the 1st command line) or to N (the 2nd):
```
  seqtk seq -aQ64 -q20 in.fq > out.fa
  seqtk seq -aQ64 -q20 -n N in.fq > out.fa
```
Fold long FASTA/Q lines and remove FASTA/Q comments:
```
  seqtk seq -Cl60 in.fa > out.fa
```
Convert multi-line FASTQ to 4-line FASTQ:
```
  seqtk seq -l0 in.fq > out.fq
```
Reverse complement FASTA/Q:
```
  seqtk seq -r in.fq > out.fq
```
Extract sequences with names in file name.lst, one sequence name per line:
```
  seqtk subseq in.fq name.lst > out.fq
```
Extract sequences in regions contained in file reg.bed:
```
  seqtk subseq in.fa reg.bed > out.fa
```
Mask regions in reg.bed to lowercases:
```
  seqtk seq -M reg.bed in.fa > out.fa
```
Subsample 10000 read pairs from two large paired FASTQ files (remember to use the same random seed to keep pairing):
```
  seqtk sample -s100 read1.fq 10000 > sub1.fq
  seqtk sample -s100 read2.fq 10000 > sub2.fq
```
Trim low-quality bases from both ends using the Phred algorithm:
```
  seqtk trimfq in.fq > out.fq
```
Trim 5bp from the left end of each read and 10bp from the right end:
```
  seqtk trimfq -b 5 -e 10 in.fa > out.fa
```

Find telomere (TTAGGG)n repeats:

  seqtk telo seq.fa > telo.bed 2> telo.count

minimap2

A versatile pairwise aligner for genomic and spliced nucleotide sequences

bwa

Burrow-Wheeler Aligner for short-read alignment (see minimap2 for long-read alignment)

bioawk

BWK awk modified for biological data

minigraph

Sequence-to-graph mapper and graph generator

miniprot

Align proteins to genomes with splicing and frameshift

miniasm

Ultrafast de novo assembly for long noisy reads (though having no consensus step)

wgsim

Reads simulator

gfatools

Tools for manipulating sequence graphs in the GFA and rGFA formats

biofast

Benchmarking programming languages/implementations for common tasks in Bioinformatics

readfq

Fast multi-line FASTA/Q reader in several programming languages

cgranges

A C/C++ library for fast interval overlap queries (with a "bedtools coverage" example)

kmer-cnt

Code examples of fast and simple k-mer counters for tutorial purposes

pangene

Constructing a pangenome gene graph

psmc

Implementation of the Pairwise Sequentially Markovian Coalescent (PSMC) model

bedtk

A simple toolset for BED files (warning: CLI may change before bedtk becomes stable)

ksw2

Global alignment and alignment extension

yak

Yet another k-mer analyzer

fermikit

De novo assembly based variant calling pipeline for Illumina short reads

minimap

This repo is DEPRECATED. Please use minimap2, the successor of minimap.

hickit

TAD calling, phase imputation, 3D modeling and more for diploid single-cell Hi-C (Dip-C) and general Hi-C

bgt

Flexible genotype query among 30,000+ samples whole-genome

dipcall

Reference-based variant calling pipeline for a pair of phased haplotype assemblies

srf

SRF: Satellite Repeat Finder

unimap

A EXPERIMENTAL fork of minimap2 optimized for assembly-to-reference alignment

minipileup

Simple pileup-based variant caller

fermi

A WGS de novo assembler based on the FMD-index for large genomes

dna-nn

Model and predict short DNA sequence features with neural networks

fermi-lite

Standalone C library for assembling Illumina short reads in small regions

bfc

High-performance error correction for Illumina resequencing data

ropebwt2

Incremental construction of FM-index for DNA sequences

tabtk

Toolkit for processing TAB-delimited format

gwfa

Proof-of-concept implementation of GWFA for sequence-to-graph alignment

CHM-eval

miniwfa

A reimplementation of the WaveFront Alignment algorithm at low memory

jstreeview

Interactive phylogenetic tree viewer/editor

samtools

This is *NOT* the official repository of samtools.

etrf

Exact Tandem Repeat Finder (not a TRF replacement)

ref-gen

Human reference genome analysis sets

bioseq-js

For live demo, see http://lh3lh3.users.sourceforge.net/bioseq.shtml

lv89

C implementation of the Landau-Vishkin algorithm

partig

An experimental tool to estimate the similarity between all pairs of contigs

asub

A unified array job submitter for LSF, SGE/UGE and Slurm

klib.nim

Experimental getopt, gzip reader, FASTA/Q parser and interval queries in nim-lang

calN50

Compute N50/NG50 and auN/auNG

sdust

Symmetric DUST for finding low-complexity regions in DNA sequences

gffio

pre-pe

Preprocessing paired-end reads produced with experiment-specific protocols

hapdip

The CHM1-NA12878 benchmark for single-sample SNP/INDEL calling from WGS Illumina data

fermi2

misc

Useful small programs

varcmp

The first CHM1 paper (Li, 2014)

minisv

Lightweight mosaic/somatic SV caller for long reads (WIP)

lianti

Tools to process LIANTI sequence data

rtgeval

Wrapper for RTG's vcfeval; DEPRECATED!

nasw

Dynamic programming for aa-to-nt alignment with affine gap, splicing and frameshift

sgdp-fermi

FermiKit small variant calls for public SGDP samples

gfa1

This repo is deprecated. Please use gfatools instead.

pubLRasm

PortableCrystal

Portable Crystal binary distributions for Linux on x86_64

foreign

Modified or extracted from other programs

trimadap

Fast but inaccurate adapter trimmer for Illumina reads

lh3-snippets

ropebwt3

Construction and utility of BWT for DNA string sets

treebest

TreeBeST: Tree Building guided by Species Tree

unicall

A wrapper for calling small variants from human germline high-coverage single-sample Illumina data

fastARG

Fast heuristic ARG construction

proot-wrapper

Demonstrating the PRoot program

rmaxcut

An experimental tool to find approximate max-cuts in a large graph

bwa-docker

Minimal docker image for bwa. Not developed any more.

sdg

EXPERIMENTAL implementation of side graph

naivepca

Naive PCA for genotype data

mdust

mdust from DFCI Gene Indices Software Tools (archived for a historical record only)

editdist-U85

Fast implementation of Ukkenon's O(ND) algorithm for computing edit distance

lh3.github.com

libdivsufsort

Automatically exported from code.google.com/p/libdivsufsort

mem-paper

Manuscript for BWA-MEM

bcf2

Experimental bcftools port to support BCF2; DEPRECATED by htslib and htsbox

thesis

ropebwt

fermi-paper

The first fermi paper (Li, 2012)

crlf

Concise Run-Length Format for small alphabets; DEPRECATED

psnw

centos5-vm

Instructions on how to deploy CentOS 5 virtual machines

mag2gfa

DEPRECATED. Code has been moved to lh3/gfa1/misc

ibsget

Download files from Illumina BaseSpace (*OUTDATED* as BaseSpace has changed APIs)

smtl-paper

Samtools statistics paper (Li, 2011)

mssa-bench

Evaluating the performance of multi-string SA construction

samtools-legacy

For testing only. DON'T USE!