SeqKit - a cross-platform and ultrafast toolkit for FASTA/Q file manipulation
- Documents: http://bioinf.shenwei.me/seqkit (Usage, FAQ, Tutorial, and Benchmark)
- Source code: https://github.com/shenwei356/seqkit
- Latest version:
- Please cite:
- Others:
Features
- Easy to install (download)
- Providing statically linked executable binaries for multiple platforms (Linux/Windows/macOS, amd64/arm64)
- Light weight and out-of-the-box, no dependencies, no compilation, no configuration
conda install -c bioconda seqkit
- Easy to use
- Ultrafast (see technical-details and benchmark)
- Seamlessly parsing both FASTA and FASTQ formats
- Supporting (
gzip
/xz
/zstd
/bzip2
compressed) STDIN/STDOUT and input/output file, easily integrated in pipe - Reproducible results (configurable rand seed in
sample
andshuffle
) - Supporting custom sequence ID via regular expression
- Supporting Bash/Zsh autocompletion
- Versatile commands (usages and examples)
- Practical functions supported by 37 subcommands
Installation
Go to Download Page for more download options and changelogs, or install via conda:
conda install -c bioconda seqkit
Subcommands
category | command | function | input | strand-sensitivity | multi-threads | popularity |
---|---|---|---|---|---|---|
basic | seq | transform sequences: extract ID/seq, filter by length/quality, remove gaps, reverse complement… | FASTA/Q | ★★★★★ | ||
stats | simple statistics: #seqs, min/max_len, N50, Q20%, Q30%… | FASTA/Q | ✓ | ★★★★★ | ||
sum | compute message digest for all sequences in FASTA/Q files | FASTA/Q | + or both | ✓ | ||
subseq | extract subsequences or flanking sequences by region/gtf/bed, | FASTA/Q | + or/and - | ★★★ | ||
sliding | extract subsequences in sliding windows | FASTA/Q | + only | ★★ | ||
faidx | create FASTA index file and extract subsequence (with more features than samtools faidx) | FASTA | + or/and - | |||
watch | monitoring and online histograms of sequence features | FASTA/Q | ||||
sana | sanitize broken single line FASTQ files | FASTQ | ||||
scat | real time concatenation and streaming of fastx files | FASTA/Q | ✓ | |||
format conversion | fq2fa | convert FASTQ to FASTA | FASTQ | ★★ | ||
fa2fq | retrieve corresponding FASTQ records by a FASTA file | FASTA/Q | ||||
fx2tab | convert FASTA/Q to tabular format | FASTA/Q | ★★ | |||
tab2fx | convert tabular format to FASTA/Q format | FASTA/Q | ||||
convert | convert FASTQ quality encoding between Sanger, Solexa and Illumina | FASTA/Q | ||||
translate | translate DNA/RNA to protein sequence | FASTA/Q | + or/and - | ★★ | ||
searching | grep | search sequences by ID/name/sequence/sequence motifs, mismatch allowed | FASTA/Q | + and - | partly, -m | ★★★★★ |
locate | locate subsequences/motifs, mismatch allowed | FASTA/Q | + and - | partly, -m | ★★★★★ | |
amplicon | extract amplicon (or specific region around it), mismatch allowed | FASTA/Q | + and - | partly, -m | ★ | |
fish | look for short sequences in larger sequences | FASTA/Q | + and - | |||
set operation | sample | sample sequences by number or proportion | FASTA/Q | ★★★★ | ||
rmdup | remove duplicated sequences by ID/name/sequence | FASTA/Q | + and - | ★★★ | ||
common | find common sequences of multiple files by id/name/sequence | FASTA/Q | + and - | |||
duplicate | duplicate sequences N times | FASTA/Q | ★ | |||
split | split sequences into files by id/seq region/size/parts (mainly for FASTA) | FASTA preffered | ★ | |||
split2 | split sequences into files by size/parts (FASTA, PE/SE FASTQ) | FASTA/Q | ★★ | |||
head | print first N FASTA/Q records | FASTA/Q | ||||
head-genome | print sequences of the first genome with common prefixes in name | FASTA/Q | ||||
range | print FASTA/Q records in a range (start:end) | FASTA/Q | ||||
pair | match up paired-end reads from two fastq files | FASTA/Q | ||||
edit | concat | concatenate sequences with same the ID from multiple files | FASTA/Q | + only | ★★★ | |
replace | replace name/sequence by regular expression | FASTA/Q | + only | ★★ | ||
restart | reset start position for circular genome | FASTA/Q | + only | ★ | ||
mutate | edit sequence (point mutation, insertion, deletion) | FASTA/Q | + only | |||
rename | rename duplicated IDs | FASTA/Q | ★ | |||
ordering | sort | sort sequences by id/name/sequence/length | FASTA preffered | ★★ | ||
shuffle | shuffle sequences | FASTA preffered | ||||
BAM processing | bam | monitoring and online histograms of BAM record features | BAM |
Notes:
- Strand-sensitivity:
+ only
: only processing on the positive/forward strand.+ and -
: searching on both strands.+ or/and -
: depends on users' flags/options/arguments.
- Multiple-threads: Using the default 4 threads is fast enough for most commands, some commands can benefit from extra threads.
- Popularity: Bases on statistics of 227 publications citing seqkit since 2020.
Citation
W Shen, S Le, Y Li*, F Hu*. SeqKit: a cross-platform and ultrafast toolkit for FASTA/Q file manipulation. PLOS ONE. doi:10.1371/journal.pone.0163962.
Contributors
- Wei Shen
- Botond Sipos:
bam
,scat
,fish
,sana
,watch
. - others
Acknowledgements
We thank Lei Zhang for testing SeqKit, and also thank Jim Hester, author of fasta_utilities, for advice on early performance improvements of for FASTA parsing and Brian Bushnell, author of BBMaps, for advice on naming SeqKit and adding accuracy evaluation in benchmarks. We also thank Nicholas C. Wu from the Scripps Research Institute, USA for commenting on the manuscript and Guangchuang Yu from State Key Laboratory of Emerging Infectious Diseases, The University of Hong Kong, HK for advice on the manuscript.
We thank Li Peng for reporting many bugs.
We appreciate Klaus Post for his fantastic packages ( compress and pgzip ) which accelerate gzip file reading and writing.
Contact
Create an issue to report bugs, propose new functions or ask for help.