• Stars
    star
    107
  • Rank 323,587 (Top 7 %)
  • Language
    C++
  • Created about 9 years ago
  • Updated over 3 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Fast GF(256) Cauchy MDS Block Erasure Codec in C

cm256

Fast GF(256) Cauchy MDS Block Erasure Codec in C

cm256 is a simple library for erasure codes. From given data it generates redundant data that can be used to recover the originals.

It is roughly 2x faster than Longhair, and CM256 supports input data that is not a multiple of 8 bytes.

Currently only Visual Studio 2013 is supported, though other versions of MSVC may work.

The original data should be split up into equally-sized chunks. If one of these chunks is erased, the redundant data can fill in the gap through decoding.

The erasure code is parameterized by three values (OriginalCount, RecoveryCount, BlockBytes). These are:

  • The number of blocks of original data (OriginalCount), which must be less than 256.
  • The number of blocks of redundant data (RecoveryCount), which must be no more than 256 - OriginalCount.

For example, if a file is split into 3 equal pieces and sent over a network, OriginalCount is 3. And if 2 additional redundant packets are generated, RecoveryCount is 2. In this case up to 256 - 3 = 253 additional redundant packets can be generated.

Building: Quick Setup

Include the cm256.* and gf256.* files in your project and consult the cm256.h header for usage.

Usage

Documentation is provided in the header file cm256.h.

When your application starts up it should call cm256_init() to verify that the library is linked properly:

	#include "cm256.h"

	if (cm256_init()) {
		// Wrong static library
		exit(1);
	}

To generate redundancy, use the cm256_encode function. To solve for the original data use the cm256_decode function.

Example usage:

bool ExampleFileUsage()
{
    if (cm256_init())
    {
        exit(1);
    }

    cm256_encoder_params params;

    // Number of bytes per file block
    params.BlockBytes = 4321;

    // Number of blocks
    params.OriginalCount = 33;

    // Number of additional recovery blocks generated by encoder
    params.RecoveryCount = 12;

    // Size of the original file
    static const int OriginalFileBytes = params.OriginalCount * params.BlockBytes;

    // Allocate and fill the original file data
    uint8_t* originalFileData = new uint8_t[OriginalFileBytes];
    memset(originalFileData, 1, OriginalFileBytes);

    // Pointers to data
    cm256_block blocks[256];
    for (int i = 0; i < params.OriginalCount; ++i)
    {
        blocks[i].Block = originalFileData + i * params.BlockBytes;
    }

    // Recovery data
    uint8_t* recoveryBlocks = new uint8_t[params.RecoveryCount * params.BlockBytes];

    // Generate recovery data
    if (cm256_encode(params, blocks, recoveryBlocks))
    {
        exit(1);
    }

    // Initialize the indices
    for (int i = 0; i < params.OriginalCount; ++i)
    {
        blocks[i].Index = cm256_get_original_block_index(params, i);
    }

    //// Simulate loss of data, subsituting a recovery block in its place ////
    blocks[0].Block = recoveryBlocks; // First recovery block
    blocks[0].Index = cm256_get_recovery_block_index(params, 0); // First recovery block index
    //// Simulate loss of data, subsituting a recovery block in its place ////

    if (cm256_decode(params, blocks))
    {
        exit(1);
    }

    // blocks[0].Index will now be 0.

    delete[] originalFileData;
    delete[] recoveryBlocks;

    return true;
}

The example above is just one way to use the cm256_decode function.

This API was designed to be flexible enough for UDP/IP-based file transfer where the blocks arrive out of order.

Benchmark

CM256 demonstrates similar encoding and (worst case) decoding performance:

Encoder: 1296 bytes k = 100 m = 1 : 5.55886 usec, 23314.1 MBps
Decoder: 1296 bytes k = 100 m = 1 : 6.72915 usec, 19259.5 MBps
Encoder: 1296 bytes k = 100 m = 2 : 17.2617 usec, 7507.93 MBps
Decoder: 1296 bytes k = 100 m = 2 : 19.6023 usec, 6611.46 MBps
Encoder: 1296 bytes k = 100 m = 3 : 30.4275 usec, 4259.31 MBps
Decoder: 1296 bytes k = 100 m = 3 : 32.4755 usec, 3990.7 MBps
Encoder: 1296 bytes k = 100 m = 4 : 40.6675 usec, 3186.82 MBps
Decoder: 1296 bytes k = 100 m = 4 : 43.5932 usec, 2972.94 MBps
Encoder: 1296 bytes k = 100 m = 5 : 51.7852 usec, 2502.64 MBps
Decoder: 1296 bytes k = 100 m = 5 : 51.4926 usec, 2516.86 MBps
Encoder: 1296 bytes k = 100 m = 6 : 62.6104 usec, 2069.94 MBps
Decoder: 1296 bytes k = 100 m = 6 : 64.9509 usec, 1995.35 MBps
Encoder: 1296 bytes k = 100 m = 7 : 76.3612 usec, 1697.2 MBps
Decoder: 1296 bytes k = 100 m = 7 : 75.191 usec, 1723.61 MBps
Encoder: 1296 bytes k = 100 m = 8 : 85.1384 usec, 1522.23 MBps
Decoder: 1296 bytes k = 100 m = 8 : 83.0904 usec, 1559.75 MBps
Encoder: 1296 bytes k = 100 m = 9 : 96.2561 usec, 1346.41 MBps
Decoder: 1296 bytes k = 100 m = 9 : 95.3784 usec, 1358.8 MBps
Encoder: 1296 bytes k = 100 m = 10 : 110.592 usec, 1171.87 MBps
Decoder: 1296 bytes k = 100 m = 10 : 109.714 usec, 1181.25 MBps

Encoder: 1296 bytes k = 100 m = 20 : 223.525 usec, 579.801 MBps
Decoder: 1296 bytes k = 100 m = 20 : 209.481 usec, 618.671 MBps

Encoder: 1296 bytes k = 100 m = 30 : 372.737 usec, 347.699 MBps
Decoder: 1296 bytes k = 100 m = 30 : 322.707 usec, 401.603 MBps

Encoder: 1296 bytes k = 100 m = 40 : 471.626 usec, 274.794 MBps
Decoder: 1296 bytes k = 100 m = 40 : 434.762 usec, 298.094 MBps

Encoder: 1296 bytes k = 100 m = 50 : 592.751 usec, 218.642 MBps
Decoder: 1296 bytes k = 100 m = 50 : 545.939 usec, 237.389 MBps

(These performance numbers are out of date and not well calibrated - Decoding now takes the same time as encoding within a few microseconds thanks to the new matrix solver.)

Longhair Library Results:

Note that I hand-optimized the MemXOR.cpp implementation on this PC to run faster than what is available on github, so this is a fair comparison.

Encoded k=100 data blocks with m=1 recovery blocks in 4.09607 usec : 31640.1 MB/s
+ Decoded 1 erasures in 5.85144 usec : 22148.4 MB/s
Encoded k=100 data blocks with m=2 recovery blocks in 41.5452 usec : 3119.5 MB/s
+ Decoded 2 erasures in 43.5931 usec : 2972.94 MB/s
Encoded k=100 data blocks with m=3 recovery blocks in 80.7498 usec : 1604.96 MB/s
+ Decoded 3 erasures in 86.6013 usec : 1496.51 MB/s
Encoded k=100 data blocks with m=4 recovery blocks in 123.465 usec : 1049.69 MB/s
+ Decoded 4 erasures in 127.854 usec : 1013.66 MB/s
Encoded k=100 data blocks with m=5 recovery blocks in 76.9464 usec : 1684.29 MB/s
+ Decoded 5 erasures in 88.6493 usec : 1461.94 MB/s
Encoded k=100 data blocks with m=6 recovery blocks in 87.7717 usec : 1476.56 MB/s
+ Decoded 6 erasures in 100.352 usec : 1291.45 MB/s
Encoded k=100 data blocks with m=7 recovery blocks in 103.863 usec : 1247.8 MB/s
+ Decoded 7 erasures in 127.269 usec : 1018.32 MB/s
Encoded k=100 data blocks with m=8 recovery blocks in 118.784 usec : 1091.05 MB/s
+ Decoded 8 erasures in 145.701 usec : 889.494 MB/s
Encoded k=100 data blocks with m=9 recovery blocks in 146.871 usec : 882.406 MB/s
+ Decoded 9 erasures in 158.574 usec : 817.284 MB/s
Encoded k=100 data blocks with m=10 recovery blocks in 156.819 usec : 826.433 MB/s
+ Decoded 10 erasures in 181.102 usec : 715.619 MB/s

Encoded k=100 data blocks with m=20 recovery blocks in 282.039 usec : 459.511 MB/s
+ Decoded 20 erasures in 370.103 usec : 350.172 MB/s

Encoded k=100 data blocks with m=30 recovery blocks in 428.618 usec : 302.367 MB/s
+ Decoded 30 erasures in 614.693 usec : 210.837 MB/s

Encoded k=100 data blocks with m=40 recovery blocks in 562.323 usec : 230.472 MB/s
+ Decoded 40 erasures in 855.188 usec : 151.546 MB/s

Encoded k=100 data blocks with m=50 recovery blocks in 727.041 usec : 178.257 MB/s
+ Decoded 50 erasures in 1181.11 usec : 109.727 MB/s

Results Discussion:

For m=1 they are both running the same kind of code, so they're basically the same.

For m=2 and m=3, CM256 is 2.5x faster.

For m=4, CM256 is 3x faster in this case. Longhair could use more tuning. Back when I wrote it, the right time to switch to the Windowed decoder was at m=5, but on my new PC it seems like m=4 is a better time to do it. CM256 only has one mode so it doesn't require any tuning for best performance.

For m=5...30, CM256 performance is not quite 2x faster, maybe 1.7x or so.

For m>30, CM256 is at least 2x faster.

Comparisons with Other Libraries

The approach taken in CM256 is similar to the Intel Storage Acceleration Library (ISA-L) available here:

https://01.org/intel%C2%AE-storage-acceleration-library-open-source-version/downloads

ISA-L more aggressively optimizes the matrix multiplication operation, which is the most expensive step of encoding.

CM256 takes better advantage of the m=1 case and the first recovery symbol, which is also possible with the Vandermonde matrices supported by ISA-L.

ISA-L uses a O(N^3) Gaussian elimination solver for decoding. The CM256 decoder solves the linear system using a fast O(N^2) LDU-decomposition algorithm from "Pivoting and Backward Stability of Fast Algorithms for Solving Cauchy Linear Equations" (T. Boros, T. Kailath, V. Olshevsky), which was hand-optimized for memory accesses.

Credits

This software was written entirely by myself ( Christopher A. Taylor [email protected] ). If you find it useful and would like to buy me a coffee, consider tipping.

More Repositories

1

supercharger

Supercharge Open-Source AI Models
Python
348
star
2

dora

Implementation of DoRA
Python
276
star
3

Zpng

Better lossless compression than PNG with a simpler algorithm
C
267
star
4

wirehair

Wirehair : O(N) Fountain Code for Large Data
C++
267
star
5

self-discover

Implementation of Google's SELF-DISCOVER
Python
263
star
6

WLANOptimizer

Single-header C library that fixes WiFi performance issues for online gaming and other low-latency real-time network traffic.
C++
223
star
7

longhair

Longhair : O(N^2) Cauchy Reed-Solomon Block Erasure Code for Small Data
C++
156
star
8

leopard

Leopard-RS : O(N Log N) MDS Reed-Solomon Block Erasure Code for Large Data
C++
135
star
9

shorthair

Shorthair : Generational Block Streaming Erasure Codes
C++
128
star
10

TimeSync

TimeSync: Time Synchronization Library in Portable C++
C++
122
star
11

tonk

Tonk : Reliable UDP (rUDP) Network Library and Infinite Window Erasure Code
C++
101
star
12

Zdepth

Zdepth :: Streaming Depth Compressor in C++ for Azure Kinect DK
C++
97
star
13

xrcap

Azure Kinect multi-camera secure network capture/record/replay
C++
76
star
14

snowshoe

Snowshoe - Portable, Secure, Fast Elliptic Curve Math Library in C
C++
62
star
15

XRmonitors

XRmonitors : User-Friendly Virtual Multi-Monitors for the Workplace
C++
54
star
16

tabby

Tabby - Strong, Fast, and Portable Cryptographic Signatures, Handshakes, and Password Authentication
C++
50
star
17

gf256

GF256 - Fast 8-bit Galois Field Math in C
C++
50
star
18

kvm

Low-Bandwidth IP KVM using Raspberry Pi 4
C++
48
star
19

siamese

Siamese : Infinite-Window Streaming Erasure Code (HARQ)
C++
47
star
20

bitnet_cpu

Experiments with BitNet inference on CPU
C++
46
star
21

ZdepthLossy

Lossy version of Zdepth using video encoders
C++
41
star
22

fecal

FEC-AL : O(N^2) Fountain Code for Small Data
C++
36
star
23

CauchyCaterpillar

Cauchy Caterpillar : O(N^2) Short-Window Streaming Erasure Code
C++
35
star
24

aiwebcam2

Second attempt at AI webcam, this time with OpenAI API
Python
32
star
25

upsampling

Image Upsampling with PyTorch
Python
22
star
26

calico

Calico - Strong, Fast, and Portable Authenticated Encryption
C++
21
star
27

cymric

Cymric - Portable secure random number generator
C++
20
star
28

loraftp

File transfer between two Raspberry Pis using the LoRa Pi HAT from Waveshare
C
19
star
29

mau

Network simulator for reliable UDP testing in C++
C++
16
star
30

oaillama3

Simple setup to self-host LLaMA3-70B model with an OpenAI API
16
star
31

lllm

Latent Large Language Models
Python
16
star
32

PacketAllocator

C++ Memory allocator for packet queues that free() in roughly the same order that they alloc().
C++
15
star
33

spectral_ssm

Implementation of Spectral State Space Models
Python
15
star
34

libcat

Common code library
C++
14
star
35

AutoAudiobook

Automatically create an audiobook using OpenAI
Python
14
star
36

minigpt4

MiniGPT-4 :: Updated to Torch 2.0, simple setup, easier API, cut out training code
Python
13
star
37

sdxl

SDXL GPU cluster scripts
Python
13
star
38

dataloader

High-performance tokenized language data-loader for Python C++ extension
C++
12
star
39

fp61

Experiment: Fast finite field Fp=2^61-1 in C++
C++
11
star
40

cuda_float_compress

Python package for compressing floating-point PyTorch tensors
Cuda
10
star
41

phind

Locally hosted: 60% HumanEval
Python
8
star
42

rtmp_receiver

Simple unidirectional RTMP video stream receiver
C++
7
star
43

counter

C++ wrapper for counters that can roll-over (e.g. timestamps/ack-ids)
C++
6
star
44

z16

16-bit monochrome image compressor based on Zstd
C
5
star
45

AQLM

Fixes for AQLM
Python
5
star
46

llamanal.cpp

Static code analysis for C++ projects using llama.cpp and the best LLM you can run offline without an expensive GPU.
C
5
star
47

unfiltered-diffusers

Simple fork that disables NSFW filter
Python
5
star
48

hloc

Python
4
star
49

boss-balloon

BossBalloon.io
TypeScript
4
star
50

libcatid

Automatically exported from code.google.com/p/libcatid
C++
3
star
51

voron

Voron 3D Printer files
3
star
52

halide-test

Test v14/v15 performance regression
CMake
3
star
53

logger

Feature-rich portable C++ logging subsystem in 650 lines of code
C++
3
star
54

cifar10deepspeed

Using DeepSpeed and Nvidia DALI to train various models to solve CIFAR-10
Python
3
star
55

chainlit-anthropic

Chainlit AI UI with Anthropic Backend
Python
3
star
56

fastest_gf_matrix_mult

A fairly hacked together piece of code that can quickly search for the optimal GF polynomials and Cauchy matrices for XOR-based GF matrix multiplication for erasure code encoders. May be useful if the matrices are of fixed size.
C++
3
star
57

recfilter-2020-fail

This is a failed attempt to port Recfilter to latest Halide from 2020. Maybe someone else can figure this out?
C++
2
star
58

bentpipe

Simple UDP rebroadcaster
C++
2
star
59

whisper3

Testing out Whisper 3
Python
2
star
60

sphynx

Sphynx - High Performance Network Transport Layer
C++
2
star
61

textworld_llm_benchmark

TextWorld LLM Benchmark
Python
2
star
62

pixel-perfect-sfm

pixel-perfect-sfm with some minor fixes
C++
2
star
63

train_ticket

Gated PRNGs are all you need? Spoilers: No.
Python
2
star
64

CatsChoice

PRNG Parameter Generation
C++
1
star
65

CRC16Recovery

Optimized CRC16 with error recovery in C
C++
1
star
66

Splane

Archived old code - /ban_ids/ is kind of interesting
C
1
star
67

rust_webgl_demo

Hello World : Rust Web Assembly
Rust
1
star
68

blog2022

HTML
1
star
69

swe_agent_playground

swe_agent_playground
1
star
70

Exapunks_Solutions

Exapunks Walkthrough Solutions
1
star
71

audio_prediction

Simple audio prediction example with RNN
Python
1
star
72

never_forget

Implementation of Overcoming Catastrophic Forgetting
Python
1
star
73

DependencyInjected

DependencyInjected : Light-weight and powerful Dependency Injection pattern for C++
C++
1
star
74

quicsend

quicsend :: Super-fast Internet-ready file transfer right from Python
C++
1
star