• Stars
    star
    105
  • Rank 321,504 (Top 7 %)
  • Language
    C++
  • License
    Other
  • Created about 10 years ago
  • Updated about 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Get assembly statistics from FASTA and FASTQ files

assembly-stats

Get assembly statistics from FASTA and FASTQ files.

Build Status License: GPL v3

Contents

Installation

If you encounter an issue when installing assembly-stats please contact your local system administrator. If you encounter a bug please log it here.

Dependencies

  • zlib

Compiling from source

Run the following commands to install the program assembly-stats to /usr/local/bin/.

mkdir build
cd build
cmake ..
make
make test
make install

If you do not have root access, you can install to a directory of your choice by changing the call to cmake. For example:

cmake -DINSTALL_DIR:PATH=/foo/bar/ ..

would mean you finish up with a copy of assembly-stats in the directory /foo/bar/.

Usage

Get statistics from a list of files:

assembly-stats file.fasta another_file.fastq

Detection of FASTA or FASTQ format of each file is automatic from the file contents, so file names and extensions are irrelevant.

The files can be supplied in compressed format (.gz, .bz2 or .xz). Compression support depends on what libraries are available when assembly-stats is compiled. Compression type is detected automatically and does not depend on the file name extensions.

The default output format is human readable. You can change the output format and ignore sequences shorter than a given length. Get the full usage by running with no files listed:

$ assembly-stats
usage: stats [options] <list of fasta/q files>

Reports sequence length statistics from fasta and/or fastq files

options:
-l <int>
    Minimum length cutoff for each sequence.
    Sequences shorter than the cutoff will be ignored [1]
-s
    Print 'grep friendly' output
-t
    Print tab-delimited output
-u
    Print tab-delimited output with no header line

Example

Here is an example on the Plasmodium falciparum reference genome:

$ assembly-stats Pf3D7_v3.fasta
stats for Pf3D7_v3.fasta
sum = 23328019, n = 16, ave = 1458001.19, largest = 3291936
N50 = 1687656, n = 5
N60 = 1472805, n = 7
N70 = 1445207, n = 8
N80 = 1343557, n = 10
N90 = 1067971, n = 12
N100 = 5967, n = 16
N_count = 0
Gaps = 0

The numbers should be self-explanatory, except maybe lines like N50 = 1687656, n = 5. The N50 is 1687656, with 50% of the assembly in 5 sequences. A "gap" is any consecutive run of Ns (undetermined nucleotide bases) of any length (it is case-insensitive so counts any "n" as well). N_count is the total Ns (undetermined nucleotide bases) across the entire assembly.

License

assembly-stats is free software, licensed under GPLv3.

Feedback/Issues

We currently do not have the resources to provide support for assembly-stats. However, the community might be able to help you out if you report any issues about usage of the software to the issues page.

More Repositories

1

Roary

Rapid large-scale prokaryote pan genome analysis
Perl
290
star
2

circlator

A tool to circularize genome assemblies
Python
217
star
3

Artemis

Artemis is a free genome viewer and annotation tool that allows visualization of sequence features and the results of analyses within the context of the sequence, and its six-frame translation
Java
217
star
4

snp-sites

Finds SNP sites from a multi-FASTA alignment file
C
215
star
5

ariba

Antimicrobial Resistance Identification By Assembly
Python
152
star
6

Fastaq

Python3 scripts to manipulate FASTA and FASTQ files
Python
68
star
7

pathogen-informatics-training

Jupyter Notebook
62
star
8

assembly_improvement

Improve the quality of a denovo assembly by scaffolding and gap filling
Perl
53
star
9

iva

de novo virus assembler of Illumina paired reads
Python
52
star
10

plasmidtron

Assembling the cause of phenotypes and genotypes from NGS data
Python
29
star
11

gff3toembl

Converts Prokka GFF3 files to EMBL files for uploading annotated assemblies to EBI
Python
29
star
12

pymummer

Python3 module for running MUMmer and reading the output
Python
25
star
13

mlst_check

Multilocus sequence typing by blast using the schemes from PubMLST
Perl
24
star
14

saffrontree

SaffronTree: Reference free rapid phylogenetic tree construction from raw read data
Python
23
star
15

companion

This repository has been archived, currently maintained version is at https://github.com/iii-companion/companion
Lua
21
star
16

Bio-Tradis

A set of tools to analyse the output from TraDIS analyses
Perl
20
star
17

panito

Calculate genome wide average nucleotide identity (gwANI) for a multiFASTA alignment
C
16
star
18

seroba

k-mer based Pipeline to identify the Serotype from Illumina NGS reads
Python
15
star
19

nano-rave

Nextflow pipeline designed for rapid onsite QC and variant calling of Oxford Nanopore data (following basecalling and demultiplexing with Guppy).
Nextflow
10
star
20

Bio-RNASeq

The new Sanger Pathogen Informatics RNA Seq analysis pipeline
Perl
8
star
21

update_pipeline

Update a pipelines metadata
Perl
7
star
22

remove_blocks_from_aln

Python
5
star
23

Farmpy

Python3 package to handle job submission to a compute farm
Python
5
star
24

sanger-pathogens.github.io

Summary of Sanger Pathogen's Repos
Python
5
star
25

SnpEffWrapper

Takes a VCF and applies annotations from a GFF using SnpEff
Python
5
star
26

Bio-InterProScanWrapper

Perl
5
star
27

bact-gen-scripts

Python
4
star
28

Bio-ENA-DataSubmission

Perl
4
star
29

mapping-and-snp-calling-training

TeX
3
star
30

Bio-PacbioMethylation

Runs Pacbio methylation pipeline
Perl
3
star
31

unix-training

A set of jupyter notebooks to provide unix training developed by Pathogen Informatics at Wellcome Sanger Institute.
Jupyter Notebook
2
star
32

Bio-Metagenomics

Perl
2
star
33

setup_tracking

setup a vrtracking pipeline
Perl
2
star
34

pipelines_reporting

Perl
1
star
35

baker

A tool to generate configuration files and wrapper scripts
Python
1
star
36

chado-tools

Tools for accessing CHADO databases.
Python
1
star
37

Bio-ReferenceManager

Perl
1
star
38

Bio-AutomatedAnnotation

Perl module to take in an genomic assembly and produce annoation
Perl
1
star
39

fastml

Addtional functionality for fastml, see http://fastml.tau.ac.il
C++
1
star
40

iva-publication

Supplementary scripts and data for the IVA publication
Python
1
star
41

QC-training

Gnuplot
1
star
42

PathFind-training

Jupyter Notebook
1
star
43

Farm_blast

Python3 module to run blast+ or blastall in parallel on an LSF compute farm
Python
1
star
44

assembly-and-annotation-training

TeX
1
star
45

singularity-bsub

Provides wrapper scripts for executing LSF commands within a Singularity container
Shell
1
star
46

monocle

Python
1
star