• Stars
    star
    156
  • Rank 238,165 (Top 5 %)
  • Language
    Rust
  • License
    The Unlicense
  • Created about 3 years ago
  • Updated 6 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Multi-threaded Compression

⛓️gzp

Build Status license Version info

Multi-threaded encoding and decoding.

Why?

This crate provides a near drop in replacement for Write that has will compress chunks of data in parallel and write to an underlying writer in the same order that the bytes were handed to the writer. This allows for much faster compression of data.

Additionally, this provides multi-threaded decompressors for Mgzip and BGZF formats.

Supported Encodings:

Usage / Features

By default gzp has the deflate_default and libdeflate features enabled which brings in the best performing zlib implementation as the backend for flate2 as well as libdeflater for the block gzip formats.

Examples

  • Deflate default
[dependencies]
gzp = { version = "*" }
  • Rust backend, this means that the Zlib format will not be available.
[dependencies]
gzp = { version = "*", default-features = false, features = ["deflate_rust"] }
  • Snap only
[dependencies]
gzp = { version = "*", default-features = false, features = ["snap_default"] }

Note: if you are running into compilation issues with libdeflate and the i686-pc-windows-msvc target, please see this issue for workarounds.

Examples

Simple example

use std::{env, fs::File, io::Write};

use gzp::{deflate::Gzip, ZBuilder, ZWriter};

fn main() {
    let mut writer = vec![];
    // ZBuilder will return a trait object that transparent over `ParZ` or `SyncZ`
    let mut parz = ZBuilder::<Gzip, _>::new()
        .num_threads(0)
        .from_writer(writer);
    parz.write_all(b"This is a first test line\n").unwrap();
    parz.write_all(b"This is a second test line\n").unwrap();
    parz.finish().unwrap();
}

An updated version of pgz.

use gzp::{
    ZWriter,
    deflate::Mgzip,
    par::{compress::{ParCompress, ParCompressBuilder}}
};
use std::io::{Read, Write};

fn main() {
    let chunksize = 64 * (1 << 10) * 2;

    let stdout = std::io::stdout();
    let mut writer: ParCompress<Mgzip> = ParCompressBuilder::new().from_writer(stdout);

    let stdin = std::io::stdin();
    let mut stdin = stdin.lock();

    let mut buffer = Vec::with_capacity(chunksize);
    loop {
        let mut limit = (&mut stdin).take(chunksize as u64);
        limit.read_to_end(&mut buffer).unwrap();
        if buffer.is_empty() {
            break;
        }
        writer.write_all(&buffer).unwrap();
        buffer.clear();
    }
    writer.finish().unwrap();
}

Same thing but using Snappy instead.

use gzp::{parz::{ParZ, ParZBuilder}, snap::Snap};
use std::io::{Read, Write};

fn main() {
    let chunksize = 64 * (1 << 10) * 2;

    let stdout = std::io::stdout();
    let mut writer: ParZ<Snap> = ParZBuilder::new().from_writer(stdout);

    let stdin = std::io::stdin();
    let mut stdin = stdin.lock();

    let mut buffer = Vec::with_capacity(chunksize);
    loop {
        let mut limit = (&mut stdin).take(chunksize as u64);
        limit.read_to_end(&mut buffer).unwrap();
        if buffer.is_empty() {
            break;
        }
        writer.write_all(&buffer).unwrap();
        buffer.clear();
    }
    writer.finish().unwrap();
}

Acknowledgements

  • Many of the ideas for this crate were directly inspired by pigz, including implementation details for some functions.

Contributing

PRs are very welcome! Please run tests locally and ensure they are passing. May tests are ignored in CI because the CI instances don't have enough threads to test them / are too slow.

cargo test --all-features && cargo test --all-features -- --ignored

Note that tests will take 30-60s.

Future todos

Benchmarks

All benchmarks were run on the file in ./bench-data/shakespeare.txt catted together 100 times which creates a rough 550Mb file.

The primary benchmark takeaway is that compression time decreases proportionately to the number of threads used.

benchmarks

More Repositories

1

hck

A sharp cut(1) clone.
Rust
691
star
2

crabz

Like pigz, but rust
Rust
325
star
3

perbase

Per-base per-nucleotide depth analysis
Rust
114
star
4

cargo-bundle-licenses

Generate a THIRDPARTY file with all licenses in a cargo project.
Rust
86
star
5

rust-lapper

Rust implementation of a fast, easy, interval tree library nim-lapper
Rust
55
star
6

nython

Build Python Extension Modules for Nim libraries.
Python
52
star
7

ponim

Nim + Python + Poetry = :)
Python
32
star
8

rumi

Rust UMI Directional Adjacency Deduplicator
Rust
14
star
9

ripline

Fast by-line reader from ripgrep
Rust
12
star
10

bam-builder

Wrapper over rust-htslib for building collections of BAM records for testing.
Rust
11
star
11

bioinfo_benchmarks

Language benchmarks that are important for Bioinformatics scripting
Nim
6
star
12

ExtraMojo

A library of nice to have things not found in the current mojo stdlib
Mojo
6
star
13

proglog

Simple, thread-safe, counter based progress logging
Rust
5
star
14

lapper.cr

Crystal port of nim-lapper: a fast genomic intervals query library
Crystal
5
star
15

timfmt

A small utility for formatting things the way Tim prefers.
Rust
4
star
16

nimedlib

Nim wrapper for the Edlib library
Nim
4
star
17

readfq

A packaged version of readfq implementation for reading fastq and fastq formatted files.
Go
3
star
18

ScAIList

Rust implementation of an Augmented Interval List, with a scaling factor.
Rust
2
star
19

dot

data over time
Rust
2
star
20

esc

Small CLI for escaping and unescaping characters in strings
Rust
2
star
21

interval_bakeoff

Test tool for different interval libraries
Rust
2
star
22

scivs

Collection of Data Structures for working with genomic intervals
Scala
2
star
23

ny_lapper

Python wrapper around nim-lapper using nython. Currently just a POC
Nim
1
star
24

ivtools

Rust lib for genomic interval tools.
Rust
1
star
25

aoc-2023

Rust
1
star
26

basebits

A memory efficient encoding for short DNA sequences and some associated operations.
Rust
1
star
27

we-bt

Rust
1
star
28

mash

Mash files together
Rust
1
star
29

cleanse

Small tool to clean up delimited data to make it consumable by standard unix tools
Rust
1
star
30

IntervalLapper.jl

Julia implementation of nim-lapper, a fast and easy interval library tailored for genomic data.
Julia
1
star
31

aoc-lisp-rs

A Lisp to use for Advent of Code, written in Rust
1
star