• Stars
    star
    139
  • Rank 261,497 (Top 6 %)
  • Language
    C
  • License
    MIT License
  • Created about 7 years ago
  • Updated 4 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Match up paired end fastq files quickly and efficiently.

Edwards Lab DOI Build Status

FASTQ PAIR

Rewrite paired end fastq files to make sure that all reads have a mate and to separate out singletons.

This code does one thing: it takes two fastq files, and generates four fastq files. That's right, for free it doubles the number of fastq files that you have!!

Usually when you get paired end read files you have two files with a /1 sequence in one and a /2 sequence in the other (or a /f and /r or just two reads with the same ID). However, often when working with files from a third party source (e.g. the SRA) there are different numbers of reads in each file (because some reads fail QC). Spades, bowtie2 and other tools break because they demand paired end files have the same number of reads.

This program solves that problem.

It rewrites the files with the sequences in order, with matching files for the two files provided on the command line, and then any single reads that are not matched are place in two separate files, one for each original file.

This code is designed to be fast and memory efficient, and works with large fastq files. It does not store the whole file in memory, but rather just stores the locations of each of the indices in the first file provided in memory.

Speed and efficiency considerations

The most efficient way to use this code is to provide the smallest file first (though it doesn't matter which way you provide the files), and then to manipulate the -t parameter on the command line. The code implementation is based on a hash table and the size of that table is the biggest way to make this code run faster. If you set the hash table size too low, then the data structure quickly fills up and the performance degrades to what we call O(n). On the other hand if you set the table size too big, then you waste a lot of memory, and it takes longer to initialize the data structures safely.

The optimal table size is basically somewhere around the number of sequences in your fastq files. You can quickly find out how many sequences there are in your fastq file:

wc -l fastq_filename

The number of sequences will be the number printed here, divided by 4.

Note: If you get an error that looks like

"We cannot allocate the memory for a table size of -436581356. Please try a smaller value for -t"

you are probably suffering from an integer overflow, so try reducing the value you are providing to the -t option. See issue 12 for more details.

If you are not sure, you can run this code with the -p parameter. Before it prints out the matched pairs of sequences, it will print out the number of sequences in each "bucket" in the table. If this number is more than about a dozen you need to increase the value you provide to -t. If most of the entries are zero, then you should decrease the size of -t.

As an aside, this code is also really slow if none of your sequences are paired. You should most likely use this after taking a peek at your files and making sure there are at least some paired sequences in your files!

Installing fastq_pair

To install the code, grab the github repository, then make a build directory:

mkdir build && cd build
cmake3 ..
make && sudo make install

There are more instructions on the installation page.

Running fastq_pair

fastq_pair takes two primary arguments. The name of the two fastq files that you want to pair up.

fastq_pair file1.fastq file2.fastq

You can also change the size of the hash table using the -t parameter:

fastq_pair -t 50021 file1.fastq file2.fastq

You can also print out the number of elements in each bucket using the -p parameter:

fastq_pair -p -t 100 file1.fastq file2.fastq

Testing fastq_pair

In the test directory there are two fastq files that you can use to test fastq_pair. There are 250 sequences in the left file and 75 sequences in the right file. Only 50 sequences are common between the two files.

You can test the code with:

fastq_pair -t 1000 test/left.fastq test/right.fastq

This will make four files in the test/ directory:

  • left.fastq.paired.fq
  • left.fastq.single.fq
  • right.fastq.paired.fq
  • right.fastq.single.fq

The paired files have 50 sequences each, and the two single files have 200 and 25 sequences (left and right respectively).

A note about gzipped fastq files

Unfortunately fastq_pair doesn't work with gzipped files at the moment, because it relies heavily on random access of the file stream. That is complex with gzipped files, especially when the uncompressed file exceeds available memory (which is exactly the situation that fastq_pair was designed to handle).

Therefore, at this time, fastq_pair does not support gzipped files. You need to uncompress the files before using fastq_pair.

If you really need to use gzipped files, and can accept slightly worse performance, then we have some alternative approaches written in Python that you can try.

Testing for gzipped files (issue #6)

We take a peek at the first couple of bytes in the file to see if the file is gzip compressed. Per the standard, the files should start 0x1F and 0x8B as the first two bytes. There is a small tester for the gzip program, called test_gzip.c, that takes a single argument and reports whether it is gzipped or not. You can compile that tester with the command:

gcc -std=gnu99  -o testgz ./test_gzip.c  is_gzipped.c

We now test both files and exit (hopefully gracefully) if either is gzip compressed. The easiest solution is to uncompress your files, and we recommend and love pigz because it is awesome!

Citing fastq_pair

Please see the CITATION file for the current citation for fastq-pair

More Repositories

1

ComputationalGenomicsManual

Robs manual for the computational genomics and bioinformatics class.
HTML
177
star
2

PhiSpy

Prediction of prophages from bacterial genomes
Jupyter Notebook
70
star
3

EdwardsLab

Code from the Edwards lab, including bioinformatics, image analysis and more. All this code is created and maintained by folks at Rob Edwards' bioinformatics lab at Flinders University
Jupyter Notebook
38
star
4

SRA_Metadata

Get, parse, and extract information from the SRA metadata files
Python
36
star
5

PyFBA

A python implementation of flux balance analysis to model microbial metabolism
Python
25
star
6

ProphagePredictionComparisons

Comparisons of multiple different prophage predictions
Jupyter Notebook
24
star
7

partie

PARTIE is a program to partition sequence read archive (SRA) metagenomics data into amplicon and shotgun data sets. The user-supplied annotations of the data sets can not be trusted, and so PARTIE allows automatic separation of the data.
Perl
24
star
8

sphae

Phage annotations and predictions. A spae is a prediction or foretelling. We'll foretell you what your phage is doing!
Python
23
star
9

crAssphage

Sequencing and analysis of crAssphage regions from around the globe
Python
16
star
10

PhageHosts

This is the complete code base used in Robert A. Edwards, Katelyn McNair, Karoline Faust, Jeroen Raes, and Bas E. Dutilh (2015) Computational approaches to predict bacteriophage–host relationships. FEMS Microbiology Reviews doi: 10.1093/femsre/fuv048
Python
15
star
11

genbank_to

Convert genbank files to a swath of other formats
Python
13
star
12

mgi-adapters

Trim adapters from MGI sequence data
C
9
star
13

primer-trimming

Fast C code for identifying and removing primers and adapters
C
9
star
14

fasta_validator

C code to validate a fasta file
C
8
star
15

py_fasta_validator

A Python extension of the fasta validator
Python
7
star
16

PhispyAnalysis

Analysis of phispy data
Jupyter Notebook
4
star
17

PhageProteomicTree

The phage proteomic tree was a breakthrough in evolution, taxonomy, and phylogenetics ... but nobody realized its global importance
Perl
4
star
18

pyctv

Parse and incorporate the ICTV Virus Metadata Resource file
Python
3
star
19

cameraGUI

A repository for the Prosilica Camera GUI
Python
3
star
20

repeatfinder

fast code for searching for direct and indirect repeats in DNA sequences.
Python
3
star
21

SearchSRA

Tools to search through the Sequence Read Archive using XSEDE's Jetstream
Shell
3
star
22

SearchSRAToolKit

Tools for processing data generated by the Search SRA
Python
3
star
23

atavide_lite

A simpler version of atavide that relies only on slurm or PBS scripts. Some of the settings maybe specific for our compute resources
Python
2
star
24

qudaich

Qudaich (queries and unique database alignment inferred by clustering homologs) is a software package for aligning sequences.
C++
2
star
25

atavide

Atavistic processing of metagenomics data.
Python
2
star
26

CoralImageAnalysis

A central repo for all the coral image analysis code generated by the Edwards bioinformatics lab at San Diego State University
Python
2
star
27

genetic_codes

Python code for translating sequences using different NCBI translation tables and genetic codes.
C
2
star
28

get_orfs

C code to translate a DNA sequence using different translation tables. Designed to be fast and lightweight, with few dependencies (only zlib and pthreads)
C
2
star
29

seqc

C and C++ libraries for working with sequences
1
star
30

CommonWorkflowLanguage

CWL codes and examples for searchSRA.org
Common Workflow Language
1
star
31

PPPF

Probabilistic Phage Protein Functions
Python
1
star
32

UCCD

Using Machine Language to Compare Ulcerative Colitis and Crohn's Disease
Jupyter Notebook
1
star
33

pawsey

Code for running lots of different things on pawsey. This is a bit of a generic bucket and some of the code will be duplicated elsewhere in different projects
Python
1
star
34

acacia-stream

testing some streaming using minimap and acacia and something
Python
1
star
35

pbj_placer

Rewrite jplacer files from phylosift for importing into ITOL and other tree viewing software
Python
1
star
36

PhiSigns

PhiSiGns is a web-based and standalone application that provides a simple and convenient tool to identify signature genes and design primers for PCR amplification of related genes from environmental samples
Perl
1
star