• Stars
    star
    368
  • Rank 115,958 (Top 3 %)
  • Language
    Rust
  • License
    Apache License 2.0
  • Created over 7 years ago
  • Updated over 3 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Building better compression together

% divANS Module

Overview

The divANS crate is meant to be used for generic data compression. The algorithm has been tuned to significantly favor gains in compression ratio over performance, operating at line speeds of 150 Mbit/s.

The name originates from "divided-ANS" since the intermediate representation is divided from the ANS codec

More information at https://blogs.dropbox.com/tech/2018/06/building-better-compression-together-with-divans/

Divans should primarily be considered for cold storage and compression research. The compression algorithm is highly modular and new algorithms only need to be written a single time since generic trait specialization constructs optimized variants of the codec for both compression and decompression at compile time.

Rust Usage

Decompression

extern crate divans;
fn main() {
    use std::io;
    let stdin = &mut io::stdin();
    {
        use std::io::{Read, Write};
        let mut reader = divans::DivansDecompressorReader::new(
            stdin,
            4096, // buffer size
        );
        io::copy(&mut reader, &mut io::stdout()).unwrap();
    }   
}

Compression

extern crate divans;
fn main() {
    use std::io;
    let stdout = &mut io::stdout();
    {
        use std::io::{Read, Write};
        let mut writer = divans::DivansBrotliHybridCompressorWriter::new(
            stdout,
            divans::DivansCompressorOptions{
                literal_adaptation:None, // should we override how fast the cdfs converge for literals?
                window_size:Some(22), // log 2 of the window size
                lgblock:None, // should we override how often metablocks are created in brotli
                quality:Some(11), // the quality of brotli commands
                dynamic_context_mixing:Some(2), // if we want to mix together the stride prediction and the context map
                use_brotli:divans::BrotliCompressionSetting::default(), // ignored
                use_context_map:true, // whether we should use the brotli context map in addition to the last 8 bits of each byte as a prior
                force_stride_value: divans::StrideSelection::UseBrotliRec, // if we should use brotli to decide on the stride
            },
            4096, // internal buffer size
        );
        io::copy(&mut io::stdin(), &mut writer).unwrap();
        writer.flush().unwrap();
    }
}

C usage

The C api is a standard compression API like the one that zlib provides. Despite being rust code, no allocations are made unless the CAllocator struct is passed in with the custom_malloc field set to NULL. This means that any user of the divans library may provide their own allocation system and all allocations will go through that allocation system. The pointers returned by custom_malloc must be 32-byte aligned.

Compression

#include "divans/ffi.h"
// compress to stdout
DivansResult compress(const unsigned char *data, size_t len) {
    unsigned char buf[4096];
    struct CAllocator alloc = {custom_malloc, custom_free, custom_alloc_opaque}; // set all 3 to NULL to use rust allocators
    struct DivansCompressorState *state = divans_new_compressor_with_custom_alloc(alloc);
    divans_set_option(state, DIVANS_OPTION_USE_CONTEXT_MAP, 1);
    divans_set_option(state, DIVANS_OPTION_DYNAMIC_CONTEXT_MIXING, 2);
    divans_set_option(state, DIVANS_OPTION_QUALITY, 11);
    while (len) {
        size_t read_offset = 0;
        size_t buf_offset = 0;
        DivansResult res = divans_encode(state,
                                         data, len, &read_offset,
                                         buf, sizeof(buf), &buf_offset);
        if (res == DIVANS_FAILURE) {
            divans_free_compressor(state);
            return res;
        }
        data += read_offset;
        len -= read_offset;
        fwrite(buf, buf_offset, 1, stdout);
    }
    DivansResult res;
    do {
        size_t buf_offset = 0;
        res = divans_encode_flush(state,
                                  buf, sizeof(buf), &buf_offset);
        if (res == DIVANS_FAILURE) {
            divans_free_compressor(state);
            return res;
        }
        fwrite(buf, buf_offset, 1, stdout);
    } while(res != DIVANS_SUCCESS);
    divans_free_compressor(state);
    return DIVANS_SUCCESS;
}

Decompression

#include "divans/ffi.h"
//decompress to stdout
DivansResult decompress(const unsigned char *data, size_t len) {
    unsigned char buf[4096];
    struct CAllocator alloc = {custom_malloc, custom_free, custom_alloc_opaque}; // set all 3 to NULL for using rust allocators
    struct DivansDecompressorState *state = divans_new_decompressor_with_custom_alloc(alloc);
    DivansResult res;
    do {
        size_t read_offset = 0;
        size_t buf_offset = 0;
        res = divans_decode(state,
                            data, len, &read_offset,
                            buf, sizeof(buf), &buf_offset);
        if (res == DIVANS_FAILURE || (res == DIVANS_NEEDS_MORE_INPUT && len == 0)) {
            divans_free_decompressor(state);
            return res;
        }
        data += read_offset;
        len -= read_offset;
        fwrite(buf, buf_offset, 1, stdout);
    } while (res != DIVANS_SUCCESS);
    divans_free_decompressor(state);
    return DIVANS_SUCCESS;
}

Structure of the divANS codebase

Top Level Modules

Module Purpose
probability Optimized implementations of 16-wide 4-bit CDF's that support online training and renormalization
codec/interface CrossCommandState tracks data to be kept between brotli commands. Examples include CDF's, the previous few bytes, the ring buffer for copies, etc
codec/dict Encode/decode parts of the file that may arise from the included brotli dictionary
codec/copy Encode/decode parts of the file that have already been seen before and are still in the ring buffer
codec/block_type Encode/decode markers in the file which divans can use as a prior for literals, distances or even command type
codec/context_map Encode/decode the brotli context_map which remaps the previous 6 bits and literal_block_type to a prior between 0 and 255
codec/literal Encode/decode new raw data that appears in the file. This can use a number of strategies or combinations of strategies to encode each nibble
codec/priors Structs defining the size of the tables that contain dynamically-trained CDF holding statistics about past-data.
codec/weights struct that blend between multiple CDFs based on prior efficacy
codec/specializations Optimization system to generate separate codepaths for currently-running nibble-decode or encode path, based on which priors were selected
codec Encode/decode the overall commands themselves and track the state of the compression of the overall file and if it is complete
divans_decompressor Implementation of Decompressor trait that parses divans headers and translates the ANS stream into commands and into raw data
brotli_ir_gen Implementation of Compressor trait that calls into the brotli codec and extracts the command array per metablock to be encoded
divans_compressor Alternate implementation of Compressor trait that calls into raw_to_cmd instead of brotli to get the command array per metablock
divans_to_raw DecoderSpecialization for the codec to assume default input commands and incrementally populate them
cmd_to_divans EncoderSpecialization for the codec to take input commands and produce divans
raw_to_cmd Future: a substitute for the Brotli compressor to generate commands
cmd_to_raw Interpret a list of Brotli commands and produce the uncompressed file
arithmetic_coder Define EntropyEncoder and EntropyDecoder arithmetic coder traits
ans Fast implementation of EntropyEncoder and EntropyDecoder interfaces
billing Plugin to add attribution to an ArithmeticEncoderOrDecoder by providing the same interface and wrapping the en/decoder
alloc_util Allocator that reuses a single slice of memory over many allocations
slice_util A mechanism to borrow and reference an existing slice that can be frozen, unborrowing the slice, when divans returns to the caller to request more input or output space
resizable_buffer Simple resizing byte buffer that can hold the raw input and output streams being processed
reader Read implementation for both encoding and decoding of divans
writer Write implementation for both encoding and decoding of divans

Overall flow

To Encode a file,

  • a writer::DivansBrotliHybridCompressorWriter instantiates a brotli_ir_gen::BrotliDivansHybridCompressor
  • The compressor has both a brotli::BrotliEncoderStateStruct from the brotli crate as well as a codec::DivansCodec<ANSEncoder, EncodeSpecialization>.
  • Using brotli_ir_gen::BrotliDivansHybridCompressor::encode, the compressor feeds input data into the brotli::BrotliEncoderStateStruct
    • by calling brotli::BrotliEncoderCompressStream
  • brotli::BrotliEncoderCompressStream can trigger a callback into brotli_ir_gen::BrotliDivansHybridCompressor::divans_encode_commands
    • The callback will consist of a slice of brotli::interface::Command items
    • These items are fed into the codec::DivansCodec<ANSEncoder, EncodeSpecialization>::encode_or_decode, which encodes them into divans format.
      • codec::DivansCodec<ANSEncoder, cmd_to_divans::EncoderSpecialization>::encode_or_decode accomplishes this by using the EncoderSpecialization to pull input commands as the source of truth
      • unfortunately brotli can pass as much data as it wishes to the caller, up to the maximum metablock size of 16 megs.
        • this means the caller has to buffer this data in a resizable_buffer::ResizableBuffer
  • When all the callbacks have completed, brotli_ir_gen::BrotliDivansHybridCompressor::encode_stream does its best to flush the raw buffer
  • Eventually when the user calls brotli_ir_gen::BrotliDivansHybridCompressor::flush a similar procedure is followed but with finish flags set

To Decode a file,

  • a reader::DivansDecompressorReader instantiates a divans_decompressor::DivansDecompressor
  • The decompressor is an enum that switches from HeaderParser mode into Decode mode after the 16 byte raw header has been parsed
  • divans_decompressor::DivansDecompressor::Decode has a codec::DivansCodec<ANSDecoder, DecoderSpecialization> within.
  • Using brotli_ir_gen::DivansDeompressor::decode, the decompressor feeds input data directly into codec::DivansCodec<ANSDecoder, DecoderSpecialization>
    • The codec.encode_or_decode is designed to receive commands as input when encoding, so the divans_to_raw::DecoderSpecialization simply makes placeholder commands for each type of command so that the same codepath can encode and decode commands
  • when a final state is reached, a checksum is written and success is returned

The codec state machine

codec::DivansCodec has three members, cross_command_state, which tracks the probability models, the state, to track which kind of command is being decoded, and codec_traits, used as a repository of compiler constant values that happen to be set that way during this decode or encode phase based on the header and command data.

The state value is an enumerant that can either carry command-specific information or can mark that the ring buffer must be populated, etc.

Overview of the available codec states

  • Begin: This state means that the decoder is not in the middle of coding a particular command, so the next step will be to decode what the next command is
  • Literal(literal::LiteralState: the coder is in the process of coding raw literals to be injected into the file
  • Dict(dict::DictState): the coder is in the process of coding a word that appears in the brotli dictionary
  • Copy(copy::CopyState): the coder is in the process of coding a reference to pull data from the ring buffer
  • BlockSwitchLiteral(block_type::LiteralBlockTypeState): The coder was instructed to serialize an arbitrary value that will affect how the predictor models future literals
  • BlockSwitchCommand(block_type::BlockTypeState): The coder was instructed to serialize an arbitrary value that will affect how the predictor models nothing (TODO)
  • BlockSwitchDistance(block_type::BlockTypeState): The coder was instructed to serialize an arbitrary value that will affect how the predictor models distances to copy from and dictionary values.
  • PredictionMode(context_map::PredictionModeState): The coder was instructed to serialize out a context map that remaps the BlockSwitchLiteral value plus the last 6 bits into a value in [0, 255] that is used as an index into the array of CDFs to be trained
  • PopulateRingBuffer(Command<AllocatedMemoryPrefix<u8, AllocU8>>) When Literal, Dict, or Copy states reach their termination state, those states are moved into the PopulateRingBuffer state.
    • PopulateRingBuffer uses the cmd_to_raw::DivansRecoderState stored in DivanCodec::CrossCommandState to populate the ring buffer
      • If DecoderSpecialization is selected, cmd_to_raw::DivansRecoderState copies the data to the output bytes, returning and requesting NeedBytes until all bytes have been serialized
      • Otherwise the EncoderSpecialization avoids serializing those bytes.
    • After all necessary bytes were serialized and the ring buffer populated, then the last_8_literals are saved to be used as future priors
  • WriteChecksum(usize) This state happens if an end command (0xf) is encountered during a decode or a code::DivansCodec::flush happens on encode
    • Currently checksum support is not active, but 8 bytes are simply serialized
  • DivansSuccess This state is reached when WriteChecksum is complete on the decoder or when the final command is reached on the encoder
  • EncodedShutdownNode | ShutdownCoder | CoderBufferDrain appear only in teh encoder during flush/close after the EOF node type as flushed

Acknowledgements

Special thanks to Jaroslaw (Jarek) Duda and Fabian Giesen for genius work and their detailed and thoughtful presentation of the ANS algorithm.

More Repositories

1

zxcvbn

Low-Budget Password Strength Estimation
CoffeeScript
15,061
star
2

lepton

Lepton is a tool and file format for losslessly compressing JPEGs by an average of 22%.
C++
5,008
star
3

godropbox

Common libraries for writing Go services/applications.
Go
4,146
star
4

hackpad

Hackpad is a web-based realtime wiki.
Java
3,520
star
5

djinni

A tool for generating cross-language type declarations and interface bindings.
C++
2,860
star
6

json11

A tiny JSON library for C++11.
C++
2,478
star
7

PyHive

Python interface to Hive and Presto. 🐝
Python
1,671
star
8

pyannotate

Auto-generate PEP-484 annotations
Python
1,421
star
9

css-style-guide

Dropbox’s (S)CSS authoring style guide
1,143
star
10

goebpf

Library to work with eBPF programs from Go
Go
1,135
star
11

dbxcli

A command line client for Dropbox built using the Go SDK
Go
1,048
star
12

securitybot

Distributed alerting for the masses!
Python
993
star
13

dropbox-sdk-js

The Official Dropbox API V2 SDK for Javascript
JavaScript
934
star
14

dropbox-sdk-python

The Official Dropbox API V2 SDK for Python
Python
885
star
15

rust-brotli

Brotli compressor and decompressor written in rust that optionally avoids the stdlib
Rust
811
star
16

scooter

An SCSS framework & UI library for Dropbox Web.
CSS
789
star
17

changes

A dashboard for your code. A build system.
Python
759
star
18

SwiftyDropbox

Swift SDK for the Dropbox API v2.
Swift
650
star
19

pb-jelly

A protobuf code generation framework for the Rust language developed at Dropbox.
Rust
611
star
20

AffectedModuleDetector

A Gradle Plugin to determine which modules were affected by a set of files in a commit.
Kotlin
603
star
21

fast_rsync

An optimized implementation of librsync in pure Rust.
Rust
601
star
22

sqlalchemy-stubs

Mypy plugin and stubs for SQLAlchemy
Python
570
star
23

dropbox-sdk-java

A Java library for the Dropbox Core API.
Java
565
star
24

pyxl

A Python extension for writing structured and reusable inline HTML.
Python
525
star
25

dependency-guard

A Gradle plugin that guards against unintentional dependency changes.
Kotlin
404
star
26

stone

The Official API Spec Language for Dropbox API V2
Python
399
star
27

nsot

Network Source of Truth is an open source IPAM and network inventory database
Python
392
star
28

focus

A Gradle plugin that helps you speed up builds by excluding unnecessary modules.
Kotlin
382
star
29

dropbox-sdk-dotnet

The Official Dropbox API V2 SDK for .NET
C#
327
star
30

hydra

A multi-process MongoDB collection copier.
Python
319
star
31

mypy-PyCharm-plugin

A simple plugin that allows running mypy from PyCharm and navigate between errors
Java
313
star
32

nn

Non-nullable pointers for C++
C++
312
star
33

avrecode

Lossless video compression: decode an H.264-encoded video file and reversibly re-encode it as as a smaller file.
C++
275
star
34

componentbox

Reactive server-driven UI for iOS, Android, and web
Kotlin
260
star
35

dropshots

Easy on-device screenshot testing for Android.
Kotlin
256
star
36

python-zxcvbn

A realistic password strength estimator.
HTML
253
star
37

zxcvbn-ios

A realistic password strength estimator.
Objective-C
223
star
38

llm-security

Dropbox LLM Security research code and results
Python
208
star
39

dbx_build_tools

Dropbox's Bazel rules and tools
Go
208
star
40

nautilus-dropbox

Dropbox Integration for Nautilus
Python
196
star
41

dropbox-sdk-go-unofficial

⚠️ An UNOFFICIAL Dropbox v2 API SDK for Go
Go
184
star
42

dropbox-sdk-obj-c

Official Objective-C SDK for the Dropbox API v2.
Objective-C
182
star
43

rust-alloc-no-stdlib

An interface to a generic allocator so a no_std rust library can allocate memory, with, or without stdlib being linked.
Rust
172
star
44

pygerduty

A Python library for PagerDuty.
Python
164
star
45

kglb

KgLb - L4 Load Balancer
Go
147
star
46

pytest-flakefinder

Runs tests multiple times to expose flakiness.
Python
140
star
47

mdwebhook

A sample app that uses webhooks to convert Markdown files to HTML.
Python
136
star
48

ts-transform-import-path-rewrite

TS AST transformer to rewrite import path
TypeScript
129
star
49

datagraph

Haskell
127
star
50

miniutf

A C++ library for basic Unicode manipulation.
C
119
star
51

PhotoWatch

A demo app for the SwiftyDropbox SDK.
Swift
118
star
52

pilot

Cross-platform MVVM in Swift
Swift
113
star
53

librsync

Dropbox modified version of librysnc
C
109
star
54

XCoverage

Xcode Plugin that displays coverage data in the text editor
Objective-C
100
star
55

vsmc

Vendor Security Model Contract
97
star
56

merou

Permission management service
Python
95
star
57

othw

OAuth 2 the Hard Way - calling the Dropbox API in lots of languages without any Dropbox or OAuth libraries
JavaScript
86
star
58

hypershard-android

CLI tool for collecting tests
Kotlin
84
star
59

trapperkeeper

A suite of tools for ingesting and displaying SNMP traps.
Python
80
star
60

idle.ts

A TypeScript library used to detect idle/active users.
TypeScript
79
star
61

amqp-coffee

An AMQP 0.9.1 client for Node.js.
CoffeeScript
78
star
62

dropbox-sdk-rust

Dropbox SDK for Rust
Rust
75
star
63

lopper

A lightweight C++ framework for vectorizing image-processing code
C++
75
star
64

differ

C++
73
star
65

dbx-career-framework

Python
70
star
66

typed-css-modules-webpack-plugin

Generate TypeScript typing declarations for your TypeScript + CSS Modules project.
TypeScript
69
star
67

kaiken

User scoping library for Android applications.
Kotlin
69
star
68

dropbox-api-content-hasher

Code to compute the Dropbox API's "content_hash"
Java
69
star
69

stopwatch

Scoped, nested, aggregated python timing library
Python
65
star
70

llama

Library for testing and measuring network loss and latency between distributed endpoints.
Go
62
star
71

nodegallerytutorial

Step by step tutorial to build a production-ready photo gallery Web Service using Node.JS and Dropbox.
JavaScript
62
star
72

load_management

This repository contains Go utilities for managing isolation and improving reliability of multi-tenant systems.
Go
54
star
73

rust-brotli-decompressor

An implementation of https://github.com/google/brotli in rust avoiding the stdlib
Rust
53
star
74

rules_node

Node rules for Bazel (unsupported)
Python
52
star
75

hermes

SRE Event and Autotasking system
Python
48
star
76

dropbox-api-v2-explorer

The Official API Explorer for Dropbox's APIs
TypeScript
45
star
77

pynsot

A Python client and CLI utility for the Network Source of Truth (NSoT) REST API.
Python
45
star
78

DropboxBusinessAdminTool

Power User tool to assist Dropbox Business Administrators in managing their Dropbox team
C#
44
star
79

ts-transform-react-constant-elements

A TypeScript AST Transformer that can speed up reconciliation and reduce garbage collection pressure by hoisting React elements to the highest possible scope.
TypeScript
44
star
80

llama-archive

Loss & LAtency MAtrix
Python
43
star
81

ttvc

Measure Visually Complete metrics in real time
TypeScript
42
star
82

DropboxBusinessScripts

Scripting resources to serve as a base for common Dropbox Business tasks
Python
41
star
83

dropbox-ios-dropins-sdk

An iOS library for choosing files in Dropbox.
Objective-C
40
star
84

encfs

EncFS Encrypted Filesystem
C++
38
star
85

dropbox-api-spec

The Official API Spec for Dropbox API V2 SDKs.
Python
37
star
86

onenote-parser

C++
35
star
87

image-search

A hypothetical Dropbox API app that makes it possible to do image searches from Dropbox.
Haskell
34
star
88

dbx-unittest2pytest

Convert unittest asserts to pytest rewritten asserts.
Python
27
star
89

hypershard-ios

⚑ the ridiculously fast XCUITest collector.
Swift
26
star
90

dropbox-api-v2-repl

Utilities to test the Dropbox API v2.
Python
26
star
91

hocrux

Handwritten optical character recognition
Python
25
star
92

questions

Simple application for storing interview questions.
Python
24
star
93

dropbox_hook

A tool for testing your Dropbox webhook endpoints.
Python
23
star
94

ruba

fast in-memory analytics datastore in Rust
Rust
21
star
95

libunwind

Pyston's fork of libunwind; originally from git://git.sv.gnu.org/libunwind.git
C
21
star
96

changes-client

A build client for Changes.
Go
19
star
97

libavcodec-hooks

Fork of ffmpeg (git://source.ffmpeg.org/ffmpeg.git). Required to compile avrecode lossless video compression (https://github.com/dropbox/avrecode). Adds hooks into low-level coding functions of libavcodec. License: LGPL.
C
19
star
98

phabricator-changes

Integration between Phabricator and Changes. This repository is no longer maintained.
PHP
18
star
99

Dropline

Tool to monitor how busy an area is using Wi-Fi. Originally intended for Dropbox's Tuck Shop.
Haskell
18
star
100

goprotoc

Go
17
star