• Stars
    star
    122
  • Rank 292,031 (Top 6 %)
  • Language
    C
  • License
    MIT License
  • Created over 6 years ago
  • Updated about 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

A fast lossless FASTQ compressor with ultra-high compression ratio

install with conda

repaq

A tool to compress FASTQ files with ultra-high compression ratio and high speed. repaq supports compressing the FASTQ to .rfq or .rfq.xz formats. Compressing to .rfq is ultra fast, while compressing to .rfq.xz provides very high compression ratio.

For NovaSeq data, as an example:

  • the .rfq file can be much smaller than .fq.gz, and the compressing time is usually less than 1/5 of gzip compression.
  • The .rfq.xz file can be as small as 5% of the original FASTQ file, or smaller than 30% of the .fq.gz file.

For paired-end FASTQ files, repaq compresses them into one single file to provide higher compression ratio.

This tool also supports non-Illumina format FASTQ (i.e. the BGI-SEQ format), but the compression ratio is not as good Illumina format FASTQ.

WARNING: be careful about using repaq for production before v1.0 is released, since its spec v1.0 has not been frozen.

take a look at the compression ratio

Here we demonstrate the compression ratio of two paired-end NovaSeq data. You can download these files and test locally.

See? The size of final nova.rfq.xz is only 3.39% of the original FASTQ files! You can decompress it and check the md5 to see whether they are identical!

Typically with one single CPU core, it takes less than 1 minute to convert nova.R1.fq + nova.R2.fq to nova.rfq, and takes less than 5 minutes to compress the nova.rfq to nova.rfq.xz by xz.

get repaq

install with Bioconda

install with conda

conda install -c bioconda repaq

download binary

This binary is only for Linux systems: http://opengene.org/repaq/repaq

# this binary was compiled on CentOS, and tested on CentOS/Ubuntu
wget http://opengene.org/repaq/repaq
chmod a+x ./repaq

or compile from source

# get source (you can also use browser to download from master or releases)
git clone https://github.com/OpenGene/repaq.git

# build
cd repaq
make

# Install
sudo make install

usage

For single-end mode:

# compress to .rfq.xz
repaq -c -i in.fq -o out.rfq.xz

# decompress from .rfq.xz
repaq -d -i in.rfq.xz -o out.fq

For paired-end mode:

# compress to .rfq.xz
repaq -c -i in.R1.fq -I in.R2.fq -o out.rfq.xz

# decompress from .rfq.xz
repaq -d -i in.rfq.xz -o out.R1.fq -O out.R2.fq

Tips:

  • -i and -I always denote the first and second input files, while -o and -O always denote the first and second output files.
  • the FASTQ input/output files can be gzipped if their names are ended with .gz.
  • for paired-end data. the .rfq file created in paired-end mode is usually much smaller than the sum of the .rfq files created in single-end mode for R1 and R2 respectively. To obtain high compression rate, please always use PE mode for PE data.
  • if you want higher speed and are not concern with compression ratio, replace xxx.rfq.xz with xxx.rfq, then repaq will compress or decompress .rfq format.

system requirements

  • Memory: 16G RAM
  • CPU: 4 cores

verify the compressed file

repaq offers a compare mode to check the consistency of the original FASTQ file(s) and the compressed .rfq or .rfq.xz file.

  • set --compare to enable the compare mode
  • specify the .rfq or .rfq.xz file by -r option
  • specify the FASTQ files by -i and -I options.

Examples:

# for single-end data
repaq --compare -i original.R1.fq  -r compressed.rfq.xz

# for paired-end data
repaq --compare -i original.R1.fq.gz -I original.R2.fq.gz  -r compressed.rfq.xz

Without any expection, you will get an output of a JSON like:

{
	"result":"passed",
	"msg":"",
	"fastq_reads":50000,
	"rfq_reads":50000,
	"fastq_bases":7419082,
	"rfq_bases":7419082
}

The result will be "failed" if the compressed file is not consistent with the original FASTQ files.

STDIN and STDOUT

repaq can read the input from STDIN, and write the output to STDOUT.

  • specify --stdin if you want to read the STDIN for compression or decompression.
  • specify --stdout if you want to output to the STDOUT for compression or decompression
  • in decompression mode, if --stdout is specified, the output will be interleaved PE stream.
  • if the STDIN is an interleaved paired-end stream, specify --interleaved_in to indicate that.
  • be noted that STDIN cannot be read when the input is a .xz file, and STDOUT cannot be written when the output is a .xz file

Here gives you an example of compressing the interleaved PE output from fastp by directly using pipes:

fastp -i R1.fq -I R2.fq --stdout | repaq -c --interleaved_in --stdin -o out.rfq.xz

FASTQ Format compatibility

repaq was initially designed for compressing Illumina data, but it also works with data from other platforms, like BGI-Seq. To work with repaq, the FASTQ format should meet following condidtions:

  • only has bases A/T/C/G/N.
  • each FASTQ record has, and only has four lines (name, sequence, strand, quality).
  • the name and strand line cannot be longer than 255 bytes.
  • the number of different quality characters cannot be more than 127.

repaq works best for Illumina data directly output by bcl2fastq.

all options

options:
  -i, --in1                    input file name (string [=])
  -o, --out1                   output file name (string [=])
  -I, --in2                    read2 input file name when encoding paired-end FASTQ files (string [=])
  -O, --out2                   read2 output file name when decoding to paired-end FASTQ files (string [=])
  -c, --compress               compress input to output
  -d, --decompress             decompress input to output
  -k, --chunk                  the chunk size (kilo bases) for encoding, default 1000=1000kb. (int [=1000])
      --stdin                  input from STDIN. If the STDIN is interleaved paired-end FASTQ, please also add --interleaved_in.
      --stdout                 write to STDOUT. When decompressing PE data, this option will result in interleaved FASTQ output for paired-end input. Disabled by defaut.
      --interleaved_in         indicate that <in1> is an interleaved paired-end FASTQ which contains both read1 and read2. Disabled by defaut.
  
# following options are used to check the consistency of the compressed data
  -p, --compare                compare the files read by read to check the compression consistency. <rfq_to_compare> should be specified in this mode.
  -r, --rfq_to_compare         the RFQ file to be compared with the input. This option is only used in compare mode. (string [=])
  -j, --json_compare_result    the file to store the comparison result. This is optional since the result is also printed on STDOUT. (string [=])

# options for .xz output
  -t, --thread                 thread number for xz compression. Higher thread num means higher speed and lower compression ratio (1~16), default 1. (int [=1])
  -z, --compression            compression level. Higher level means higher compression ratio, and more RAM usage (1~9), default 4. (int [=4])

  -?, --help                   print this message

external dependency

repaq makes a system call in order to run the xz compression tool available on GNU/Linux systems. If xz isn't installed, repaq will fail with the message:

failed to call xz, please confirm that xz is installed in your system

More Repositories

1

fastp

An ultra-fast all-in-one FASTQ preprocessor (QC/adapters/trimming/filtering/splitting/merging...)
C++
1,840
star
2

awesome-bio-datasets

awesome-bio-datasets
211
star
3

AfterQC

Automatic Filtering, Trimming, Error Removing and Quality Control for fastq data
Python
203
star
4

MutScan

Detect and visualize target mutations by scanning FastQ files directly
C
148
star
5

GeneFuse

Gene fusion detection and visualization
C
114
star
6

gencore

Generate duplex/single consensus reads to reduce sequencing noises and remove duplications
C++
111
star
7

fastv

An ultra-fast tool for identification of SARS-CoV-2 and other microbes from sequencing data. This tool can be used to detect viral infectious diseases, like COVID-19.
C++
110
star
8

scrnapip

A Systematic and Dynamic Pipeline for Single-Cell RNA Sequencing Analysis
HTML
98
star
9

OpenGene.jl

(No maintenance) OpenGene, core libraries for NGS data analysis and bioinformatics in Julia
Julia
64
star
10

CfdnaPattern

Pattern Recognition for Cell-free DNA
Python
58
star
11

UniqueKMER

Generate unique KMERs for every contig in a FASTA file
C
43
star
12

ctdna-pipeline

A simplified pipeline for ctDNA sequencing data analysis
Shell
36
star
13

VisualMSI

Detect and visualize microsatellite instability(MSI) from NGS data
C++
31
star
14

defq

Please switch to https://github.com/OpenGene/defastq
C
28
star
15

MrBam

Query Mutated Reads from a Bam
Python
26
star
16

FusionDirect.jl

(No maintenance) Detect gene fusion directly from raw fastq files
Julia
25
star
17

SeqMaker.jl

(No maintenance) Next Generation Sequencing Simulation with SNP, Variation and Sequencing Error Integrated
Julia
24
star
18

dedup

Deduplication for cfDNA sequencing data
Python
10
star
19

defastq

Ultra-fast Multi-threaded FASTQ Demultiplexing
C++
7
star
20

pecheck

check paired-end FASTQ data integrity
C
6
star
21

slicer

Slice a text file (like FastQ) to smaller files by lines, with gzip supported
C
6
star
22

ACMSI

The shiny-based app for Fragment Analysis, especially for MSI analysis
R
4
star
23

novelbio-bioinfo

Java
2
star
24

novelbio-base

Java
2
star
25

IRDProc

Process genomic data downloaded from influenza research database for unique k-mer generating
Python
1
star