• Stars
    star
    135
  • Rank 269,297 (Top 6 %)
  • Language
    C++
  • License
    BSD 3-Clause "New...
  • Created over 7 years ago
  • Updated 8 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Leopard-RS : O(N Log N) MDS Reed-Solomon Block Erasure Code for Large Data

Leopard-RS

MDS Reed-Solomon Erasure Correction Codes for Large Data in C

Leopard-RS is a fast library for Erasure Correction Coding. From a block of equally sized original data pieces, it generates recovery symbols that can be used to recover lost original data.

Motivation:

It gets slower as O(N Log N) in the input data size, and its inner loops are vectorized using the best approaches available on modern processors, using the fastest finite fields (8-bit or 16-bit Galois fields with Cantor basis {2}).

It sets new speed records for MDS encoding and decoding of large data, achieving over 1.2 GB/s to encode with the AVX2 instruction set on a single core.

Example applications are data recovery software and data center replication.

Encoder API:

Preconditions:

  • The original and recovery data must not exceed 65536 pieces.
  • The recovery_count <= original_count.
  • The buffer_bytes must be a multiple of 64.
  • Each buffer should have the same number of bytes.
  • Even the last piece must be rounded up to the block size.
#include "leopard.h"

For full documentation please read leopard.h.

  • leo_init() : Initialize library.
  • leo_encode_work_count() : Calculate the number of work_data buffers to provide to leo_encode().
  • leo_encode(): Generate recovery data.

Decoder API:

For full documentation please read leopard.h.

  • leo_init() : Initialize library.
  • leo_decode_work_count() : Calculate the number of work_data buffers to provide to leo_decode().
  • leo_decode() : Recover original data.

Benchmarks:

On the the MacBook Pro 15-Inch (Mid-2015) featuring a 22 nm "Haswell/Crystalwell" 2.8 GHz Intel Core i7-4980HQ processor, compiled with Visual Studio 2017 RC:

Leopard Encoder(8.192 MB in 128 pieces, 128 losses): Input=2102.67 MB/s, Output=2102.67 MB/s
Leopard Decoder(8.192 MB in 128 pieces, 128 losses): Input=686.212 MB/s, Output=686.212 MB/s

Leopard Encoder(64 MB in 1000 pieces, 200 losses): Input=2194.94 MB/s, Output=438.988 MB/s
Leopard Decoder(64 MB in 1000 pieces, 200 losses): Input=455.633 MB/s, Output=91.1265 MB/s

Leopard Encoder(2097.15 MB in 32768 pieces, 32768 losses): Input=451.168 MB/s, Output=451.168 MB/s
Leopard Decoder(2097.15 MB in 32768 pieces, 32768 losses): Input=190.471 MB/s, Output=190.471 MB/s

2 GB of 64 KB pieces encoded in 4.6 seconds, and worst-case recovery in 11 seconds.

More benchmark results are available here: https://github.com/catid/leopard/blob/master/Benchmarks.md

Comparisons:

There is another library FastECC by Bulat-Ziganshin that should have similar performance: https://github.com/Bulat-Ziganshin/FastECC. Both libraries implement the same high-level algorithm in {3}, while Leopard implements the newer polynomial basis GF(2^r) approach outlined in {1}, and FastECC uses complex finite fields modulo special primes. There are trade-offs that may make either approach preferable based on the application:

  • Older processors do not support SSSE3 and FastECC supports these processors better.
  • FastECC supports data sets above 65,536 pieces as it uses 32-bit finite field math.
  • Leopard does not require expanding the input or output data to make it fit in the field, so it can be more space efficient.

FFT Data Layout:

We pack the data into memory in this order:

[Recovery Data (Power of Two = M)] [Original Data] [Zero Padding out to 65536]

For encoding, the placement is implied instead of actual memory layout. For decoding, the layout is explicitly used.

Encoder algorithm:

The encoder is described in {3}. Operations are done O(K Log M), where K is the original data size, and M is up to twice the size of the recovery set.

Roughly in brief:

Recovery = FFT( IFFT(Data_0) xor IFFT(Data_1) xor ... )

It walks the original data M chunks at a time performing the IFFT. Each IFFT intermediate result is XORed together into the first M chunks of the data layout. Finally the FFT is performed.

Encoder optimizations:

  • The first IFFT can be performed directly in the first M chunks.
  • The zero padding can be skipped while performing the final IFFT. Unrolling is used in the code to accomplish both these optimizations.
  • The final FFT can be truncated also if recovery set is not a power of 2. It is easy to truncate the FFT by ending the inner loop early.
  • The decimation-in-time (DIT) FFT is employed to calculate two layers at a time, rather than writing each layer out and reading it back in for the next layer of the FFT.

Decoder algorithm:

The decoder is described in {1}. Operations are done O(N Log N), where N is up to twice the size of the original data as described below.

Roughly in brief:

Original = -ErrLocator * FFT( Derivative( IFFT( ErrLocator * ReceivedData ) ) )

Precalculations:

At startup initialization, FFTInitialize() precalculates FWT(L) as described by equation (92) in {1}, where L = Log[i] for i = 0..Order, Order = 256 or 65536 for FF8/16. This is stored in the LogWalsh vector.

It also precalculates the FFT skew factors (s_i) as described by equation (28). This is stored in the FFTSkew vector.

For memory workspace N data chunks are needed, where N is a power of two at or above M + K. K is the original data size and M is the next power of two above the recovery data size. For example for K = 200 pieces of data and 10% redundancy, there are 20 redundant pieces, which rounds up to 32 = M. M + K = 232 pieces, so N rounds up to 256.

Online calculations:

At runtime, the error locator polynomial is evaluated using the Fast Walsh-Hadamard transform as described in {1} equation (92).

At runtime the data is explicit laid out in workspace memory like this:

[Recovery Data (Power of Two = M)] [Original Data (K)] [Zero Padding out to N]

Data that was lost is replaced with zeroes. Data that was received, including recovery data, is multiplied by the error locator polynomial as it is copied into the workspace.

The IFFT is applied to the entire workspace of N chunks. Since the IFFT starts with pairs of inputs and doubles in width at each iteration, the IFFT is optimized by skipping zero padding at the end until it starts mixing with non-zero data.

The formal derivative is applied to the entire workspace of N chunks.

The FFT is applied to the entire workspace of N chunks. The FFT is optimized by only performing intermediate calculations required to recover lost data. Since it starts wide and ends up working on adjacent pairs, at some point the intermediate results are not needed for data that will not be read by the application. This optimization is implemented by the ErrorBitfield class.

Finally, only recovered data is multiplied by the negative of the error locator polynomial as it is copied into the front of the workspace for the application to retrieve.

Finite field arithmetic optimizations:

For faster finite field multiplication, large tables are precomputed and applied during encoding/decoding on 64 bytes of data at a time using SSSE3 or AVX2 vector instructions and the ALTMAP approach from Jerasure.

Addition in this finite field is XOR, and a vectorized memory XOR routine is also used.

References:

This library implements an MDS erasure code introduced in this paper:

    {1} S.-J. Lin, T. Y. Al-Naffouri, Y. S. Han, and W.-H. Chung,
    "Novel Polynomial Basis with Fast Fourier Transform
	and Its Application to Reed-Solomon Erasure Codes"
    IEEE Trans. on Information Theory, pp. 6284-6299, November, 2016.
    {2} D. G. Cantor, "On arithmetical algorithms over finite fields",
    Journal of Combinatorial Theory, Series A, vol. 50, no. 2, pp. 285-300, 1989.
    {3} Sian-Jheng Lin, Wei-Ho Chung, "An Efficient (n, k) Information
    Dispersal Algorithm for High Code Rate System over Fermat Fields,"
    IEEE Commun. Lett., vol.16, no.12, pp. 2036-2039, Dec. 2012.
    {4} Plank, J. S., Greenan, K. M., Miller, E. L., "Screaming fast Galois Field
    arithmetic using Intel SIMD instructions."  In: FAST-2013: 11th Usenix
    Conference on File and Storage Technologies, San Jose, 2013

Some papers are mirrored in the /docs/ folder.

Credits

Inspired by discussion with:

Software by Christopher A. Taylor [email protected]

Please reach out if you need support or would like to collaborate on a project.

More Repositories

1

supercharger

Supercharge Open-Source AI Models
Python
348
star
2

dora

Implementation of DoRA
Python
276
star
3

Zpng

Better lossless compression than PNG with a simpler algorithm
C
267
star
4

wirehair

Wirehair : O(N) Fountain Code for Large Data
C++
267
star
5

self-discover

Implementation of Google's SELF-DISCOVER
Python
263
star
6

WLANOptimizer

Single-header C library that fixes WiFi performance issues for online gaming and other low-latency real-time network traffic.
C++
223
star
7

longhair

Longhair : O(N^2) Cauchy Reed-Solomon Block Erasure Code for Small Data
C++
156
star
8

shorthair

Shorthair : Generational Block Streaming Erasure Codes
C++
128
star
9

TimeSync

TimeSync: Time Synchronization Library in Portable C++
C++
122
star
10

cm256

Fast GF(256) Cauchy MDS Block Erasure Codec in C
C++
107
star
11

tonk

Tonk : Reliable UDP (rUDP) Network Library and Infinite Window Erasure Code
C++
101
star
12

Zdepth

Zdepth :: Streaming Depth Compressor in C++ for Azure Kinect DK
C++
97
star
13

xrcap

Azure Kinect multi-camera secure network capture/record/replay
C++
76
star
14

snowshoe

Snowshoe - Portable, Secure, Fast Elliptic Curve Math Library in C
C++
62
star
15

XRmonitors

XRmonitors : User-Friendly Virtual Multi-Monitors for the Workplace
C++
54
star
16

tabby

Tabby - Strong, Fast, and Portable Cryptographic Signatures, Handshakes, and Password Authentication
C++
50
star
17

gf256

GF256 - Fast 8-bit Galois Field Math in C
C++
50
star
18

kvm

Low-Bandwidth IP KVM using Raspberry Pi 4
C++
48
star
19

siamese

Siamese : Infinite-Window Streaming Erasure Code (HARQ)
C++
47
star
20

bitnet_cpu

Experiments with BitNet inference on CPU
C++
46
star
21

ZdepthLossy

Lossy version of Zdepth using video encoders
C++
41
star
22

fecal

FEC-AL : O(N^2) Fountain Code for Small Data
C++
36
star
23

CauchyCaterpillar

Cauchy Caterpillar : O(N^2) Short-Window Streaming Erasure Code
C++
35
star
24

aiwebcam2

Second attempt at AI webcam, this time with OpenAI API
Python
32
star
25

upsampling

Image Upsampling with PyTorch
Python
22
star
26

calico

Calico - Strong, Fast, and Portable Authenticated Encryption
C++
21
star
27

cymric

Cymric - Portable secure random number generator
C++
20
star
28

loraftp

File transfer between two Raspberry Pis using the LoRa Pi HAT from Waveshare
C
19
star
29

mau

Network simulator for reliable UDP testing in C++
C++
16
star
30

oaillama3

Simple setup to self-host LLaMA3-70B model with an OpenAI API
16
star
31

lllm

Latent Large Language Models
Python
16
star
32

PacketAllocator

C++ Memory allocator for packet queues that free() in roughly the same order that they alloc().
C++
15
star
33

spectral_ssm

Implementation of Spectral State Space Models
Python
15
star
34

libcat

Common code library
C++
14
star
35

AutoAudiobook

Automatically create an audiobook using OpenAI
Python
14
star
36

minigpt4

MiniGPT-4 :: Updated to Torch 2.0, simple setup, easier API, cut out training code
Python
13
star
37

sdxl

SDXL GPU cluster scripts
Python
13
star
38

dataloader

High-performance tokenized language data-loader for Python C++ extension
C++
12
star
39

fp61

Experiment: Fast finite field Fp=2^61-1 in C++
C++
11
star
40

cuda_float_compress

Python package for compressing floating-point PyTorch tensors
Cuda
10
star
41

phind

Locally hosted: 60% HumanEval
Python
8
star
42

rtmp_receiver

Simple unidirectional RTMP video stream receiver
C++
7
star
43

counter

C++ wrapper for counters that can roll-over (e.g. timestamps/ack-ids)
C++
6
star
44

z16

16-bit monochrome image compressor based on Zstd
C
5
star
45

AQLM

Fixes for AQLM
Python
5
star
46

llamanal.cpp

Static code analysis for C++ projects using llama.cpp and the best LLM you can run offline without an expensive GPU.
C
5
star
47

unfiltered-diffusers

Simple fork that disables NSFW filter
Python
5
star
48

hloc

Python
4
star
49

boss-balloon

BossBalloon.io
TypeScript
4
star
50

libcatid

Automatically exported from code.google.com/p/libcatid
C++
3
star
51

voron

Voron 3D Printer files
3
star
52

halide-test

Test v14/v15 performance regression
CMake
3
star
53

logger

Feature-rich portable C++ logging subsystem in 650 lines of code
C++
3
star
54

cifar10deepspeed

Using DeepSpeed and Nvidia DALI to train various models to solve CIFAR-10
Python
3
star
55

chainlit-anthropic

Chainlit AI UI with Anthropic Backend
Python
3
star
56

fastest_gf_matrix_mult

A fairly hacked together piece of code that can quickly search for the optimal GF polynomials and Cauchy matrices for XOR-based GF matrix multiplication for erasure code encoders. May be useful if the matrices are of fixed size.
C++
3
star
57

recfilter-2020-fail

This is a failed attempt to port Recfilter to latest Halide from 2020. Maybe someone else can figure this out?
C++
2
star
58

bentpipe

Simple UDP rebroadcaster
C++
2
star
59

whisper3

Testing out Whisper 3
Python
2
star
60

sphynx

Sphynx - High Performance Network Transport Layer
C++
2
star
61

textworld_llm_benchmark

TextWorld LLM Benchmark
Python
2
star
62

pixel-perfect-sfm

pixel-perfect-sfm with some minor fixes
C++
2
star
63

train_ticket

Gated PRNGs are all you need? Spoilers: No.
Python
2
star
64

CatsChoice

PRNG Parameter Generation
C++
1
star
65

CRC16Recovery

Optimized CRC16 with error recovery in C
C++
1
star
66

Splane

Archived old code - /ban_ids/ is kind of interesting
C
1
star
67

rust_webgl_demo

Hello World : Rust Web Assembly
Rust
1
star
68

blog2022

HTML
1
star
69

swe_agent_playground

swe_agent_playground
1
star
70

Exapunks_Solutions

Exapunks Walkthrough Solutions
1
star
71

audio_prediction

Simple audio prediction example with RNN
Python
1
star
72

never_forget

Implementation of Overcoming Catastrophic Forgetting
Python
1
star
73

DependencyInjected

DependencyInjected : Light-weight and powerful Dependency Injection pattern for C++
C++
1
star
74

quicsend

quicsend :: Super-fast Internet-ready file transfer right from Python
C++
1
star