• Stars
    star
    115
  • Rank 305,916 (Top 7 %)
  • Language
    Perl
  • License
    GNU General Publi...
  • Created about 6 years ago
  • Updated over 2 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Convert various sequence formats to FASTA

Build Status License: GPL v3 Don't judge me

any2fasta

Convert various sequence formats to FASTA

Motivation

You may wonder why this tool even exists. Well, I tried to do the right thing and use established tools like readseq and seqret from EMBOSS, but they both mangled IDs containing | or . characters, and there is no way to fix this behaviour. This resulted in inconsitences between my .gbk and .fna versions of files in my pipelines.

Then you may wonder why I didn't use Bioperl or Biopython. Well they are heavyweight libraries, and actually very slow at parsing Genbank files. This script uses only core Perl modules, has no other dependencies, and runs very quickly.

It supports the following input formats:

  1. Genbank flat file, typically .gb, .gbk, .gbff (starts with LOCUS)
  2. EMBL flat file, typically .embl, (starts with ID)
  3. GFF with sequence, typically .gff, .gff3 (starts with ##gff)
  4. FASTA DNA, typically .fasta, .fa, .fna, .ffn (starts with >)
  5. FASTQ DNA, typically .fastq, .fq (starts with @)
  6. CLUSTAL alignments, typically .clw, .clu (starts with CLUSTAL or MUSCLE)
  7. STOCKHOLM alignments, typically .sth (starts with # STOCKHOLM)
  8. GFA assembly graph, typically .gfa (starts with ^[A-Z]\t)

Files may be compressed with:

  1. gzip, typically .gz
  2. bzip2, typically .bz2
  3. zip, typically .zip

Installation

any2fasta has no dependencies except Perl 5.10 or higher. It only uses core modules, so no CPAN needed.

Direct script download

% cd /usr/local/bin  # choose a folder in your $PATH
% wget https://raw.githubusercontent.com/tseemann/any2fasta/master/any2fasta
% chmod +x any2fasta

Homebrew

% brew install brewsci/bio/any2fasta

Conda

% conda install -c bioconda any2fasta

Github

% git clone https://github.com/tseemann/any2fasta.git
% cp any2fasta/any2fasta /usr/local/bin # choose a folder in your $PATH

Test Installation

% ./any2fasta -v
any2fasta 0.2.2

% ./any2fasta -h
NAME
  any2fasta 0.4.2
SYNOPSIS
  Convert various sequence formats into FASTA
USAGE
  any2fasta [options] file.{gb,fa,fq,gff,gfa,clw,sth}[.gz,bz2,zip] > output.fasta
OPTIONS
  -h       Print this help
  -v       Print version and exit
  -q       No output while running, only errors
  -n       Replace ambiguous IUPAC letters with 'N'
  -l       Lowercase the sequence
  -u       Uppercase the sequence
END

Examples

% any2fasta ref.gbk > ref.fna

% any2fasta in.fasta > out.fasta  # should behave like "cat"

% any2fasta prokka.gff > prokka.fna  # only if GFF has FASTA appended

% any2fasta - < file.gb > file.fasta  # '-' means stdin

% anyfasta genes.gff.gz > genes.ffn  # automatically decompresses

% any2fasta 1.gb 2.fa.gz 3.gff.bz2 - > out.fa  # multiple files and stdin

% any2fasta R1.fq.gz | bzip2 > R1.fa.bz2  # 'seqtk seq -A' is much faster

% any2fasta -q 23S.clw > 23S.aln  # gaps '-' will be preserved

% any2fasta pfam4321.sth > pfam4321.aln  # '.' gaps will become '-'

Options

  • -n replaces any characters that aren't A,C,G,T with N (gaps preserved)
  • -l will lowercase all the letters
  • -u will uppercase all the letters
  • -q will prevent logging messages being printed

Issues

Submit feedback to the Issue Tracker

License

GPL v3

Author

Torsten Seemann

More Repositories

1

prokka

⚡ ♒ Rapid prokaryotic genome annotation
Perl
808
star
2

snippy

✂️ ⚡ Rapid haploid variant calling and core genome alignment
Perl
471
star
3

abricate

🔎 💊 Mass screening of contigs for antimicrobial and virulence genes
Perl
359
star
4

shovill

⚡♠️ Assemble bacterial isolate genomes from Illumina paired-end reads
Perl
210
star
5

barrnap

🔬 ♌ Bacterial ribosomal RNA predictor
Perl
200
star
6

mlst

🆔 Scan contig files against PubMLST typing schemes
Shell
191
star
7

nullarbor

💾 📃 "Reads to report" for public health and clinical microbiology
Perl
134
star
8

snp-dists

Pairwise SNP distance matrix from a FASTA sequence alignment
C
126
star
9

VelvetOptimiser

📈 Automatically optimise three of Velvet's assembly parameters.
Perl
47
star
10

samclip

Filter SAM file for soft and hard clipped alignments
Perl
44
star
11

phastaf

Identify phage regions in bacterial genomes for masking purposes
Perl
29
star
12

seeka

Get microbial sequence data easier and faster
Perl
28
star
13

homebrew-bioinformatics-linux

🍺 🐧 Homebrew formulae for bioinformatics software only available for Linux
Ruby
27
star
14

berokka

🍊 💫 Trim, circularise and orient long read bacterial genome assemblies
Perl
25
star
15

ekidna

Assembly based core genome SNP alignments for bacteria
Perl
25
star
16

sixess

🔬🐛 Rapid 16s rRNA identification from isolate FASTQ files
Shell
23
star
17

cgmlst-dists

🐻⇔🐨 Calculate distance matrix from ChewBBACA cgMLST allele call tables
C
23
star
18

PEAR

Pair-End AssembeR
C
22
star
19

mokka

Annotate your metagenome assemblies
11
star
20

scapper

Whole genome core alignments from multiple draft genomes
Perl
10
star
21

kounta

🧮 🔢 Generate multi-sample k-mer count matrix from WGS
Perl
9
star
22

trencha

Normalize VCF depth for Illumina GC bias
Perl
7
star
23

legsta

🍗⭐ In silico Legionella pneumophila Sequence Based Typing
Perl
7
star
24

snasm

Assembly based core SNP alignments
Perl
7
star
25

tseemann.github.io

Torsten Seemann's Home Page
HTML
7
star
26

noary

🍣 🦐 A lightweight nucleotide bacterial ortholog clustering tool
Perl
7
star
27

wombac

‼️ Rapid core genome SNP alignments from multiple bacterial genomes
Perl
7
star
28

klosham

Find closest aligned sequences to a query sequnece
C
6
star
29

kopynumba

Identify copy number variation in bacterial Illumina sequences
6
star
30

fasterqc

A non-Java alternative to the classic FastQC tool
Perl
5
star
31

ragnarokka

Annotate and correct erro-prone ONT genomes
5
star
32

spekki

Species prediction from NGS reads
Python
5
star
33

polisha

Fix small assembly errors using Illumina reads
Perl
5
star
34

polyfix

🔪⛓️ Repair nanopore assemblies using related genome(s)
Perl
5
star
35

skrofula

Yet another M.tuberculosis typing and resistance tool, but for the impatient (not in-patient)
Perl
5
star
36

varion

5
star
37

injecta

Insert genes into genomes to aid synthetic test data generation
Perl
4
star
38

kurra

Fast whole genome phylogeny
4
star
39

heterik

Estimate heterozygosity or mixture level of a bacterial WGS sample
4
star
40

dehomopolymerate

Collapse sequence homopolymers to a single character
C
4
star
41

babykraken

👶🦑 Very small Kraken2 database for bundling with pipelines
4
star
42

bowkaster

cgMLST from FASTQ reads
4
star
43

perl-biotool

🐫 🐪 Small pure Perl5 libraries for writing command line bioinformatics tools
Perl
3
star
44

anthrakks

Distinguish Bacillus cereus and biovar anthracis (anthrax)
3
star
45

bioinfo-scripts

Collection of bioinformatics utility scripts, mostly written in Bioperl
Perl
3
star
46

gbk2bcfgff

Convert Genbank to GFF compatible with "bcftools csq"
2
star
47

easy-web-blast

2
star
48

mini-outbreak

Small WGS dataset for testing bacterial outbreak analysis pipelines
2
star
49

snippa

Experimental modular bacterial SNP calling pipeline
Perl
2
star
50

vikka

Viral genomics toolkit for pandemics
1
star
51

gard

🍆 💊 Gonococcal Antimicrobial Resistance Detection
1
star
52

wtfq

Duplicate FASTQ reads to address undersequenced regions
C
1
star
53

simuvar

Simulate variants of bacterial genomes for testing SNP callers
1
star
54

coginator

Assign COGs to protein sequences
1
star
55

skrilla

It ain't all about skrilla
1
star
56

arborkart

Phylogenomic trees with maps for the web
1
star
57

assembill

Simple script to clip, assemble, tile and annotate a bacterial genome from Illumina reads
Shell
1
star
58

kroucha

Mock repository for Sanger publications citing Croucher et al
1
star