• Stars
    star
    156
  • Rank 239,589 (Top 5 %)
  • Language
    C++
  • Created almost 11 years ago
  • Updated 3 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Longhair : O(N^2) Cauchy Reed-Solomon Block Erasure Code for Small Data

Longhair

Fast Cauchy Reed-Solomon Erasure Codes in C

Longhair is a simple, portable library for erasure codes. From given data it generates redundant data that can be used to recover the originals. It is extremely fast, though not as fast as the CM256 library.

The original data should be split up into equally-sized chunks. If one of these chunks is erased, the redundant data can fill in the gap through decoding.

The erasure code is parameterized by three values (k, m, bytes). These are:

  • The number of blocks of original data (k), which must be less than 256.
  • The number of blocks of redundant data (m), which must be no more than 256 - k.
  • And the number of bytes per block (bytes), which must be a multiple of 8 bytes.

These erasure codes are not patent-encumbered and the software is provided royalty-free.

Building: Quick Setup

The longhair-mobile directory contains an easy-to-import set of C code that also builds properly for mobile devices. In a pinch you can use this code for desktops too.

Usage

Documentation is provided in the header file cauchy_256.h.

When your application starts up it should call cauchy_init() to verify that the library is linked properly:

	#include "cauchy_256.h"

	if (cauchy_init()) {
		// Wrong static library
		exit(1);
	}

To generate redundancy, use the cauchy_256_encode function:

	int data_len = ...;

	int k = 32; // Choose a number of pieces to split the data into.
	int m = 12; // Choose a number of redundant pieces to generate.
	int bytes = 1000; // Chose a number of bytes per chunk, a multiple of 8 bytes.

	assert(bytes % 8 == 0);
	assert(data_len == k * bytes);

	char *recovery_blocks = new char[m * bytes];
	char *data_ptrs[32];

	// Point data_ptrs to data here

	// Encode the data using this library
	if (cauchy_256_encode(k, m, data_ptrs, recovery_blocks, bytes)) {
		// Input was invalid
		return false;
	}

	// For each recovery block,
	for (int ii = 0; ii < m; ++ii) {
		char *block = recovery_blocks + ii * bytes;
		unsigned char row = k + ii; // Transmit this with block (just one byte)

		// Transmit or store block bytes and row
	}

	delete []recovery_blocks;

To recover the original data, use the cauchy_256_decode function:

	// Use the same settings as the encoder
	int k = 33;
	int m = 12;
	int bytes = 1000;

	// Allocate block info array
	// There should be exactly k blocks
	Block *block_info = new Block[k];

	// Fill block_info here with data and rows from the encoder
	// Rows under k are original data, and the rest are redundant data

	// Attempt decoding
	if (cauchy_256_decode(k, m, block_info, bytes)) {
		// Decoding should never fail - indicates input is invalid
		assert(k + m <= 256);
		assert(block_info != 0);
		assert(bytes % 8 == 0);
		return false;
	}

	// Now the block_info elements that used to have redundant data are
	// corrected in-place and contain the original data.

The example above is just one way to use the cauchy_256_decode function.

This API was designed to be flexible enough for UDP/IP-based file transfer where the blocks arrive out of order. For packet erasure correction the following code is more applicable:

	// Use the same settings as the encoder
	int k = 33;
	int m = 12;
	int bytes = 1000;

	// Allocate enough space for the original data
	char *block_data = new char[bytes * k];

	// Allocate block info array
	Block *block_info = new Block[k];

	int original_count = 0, recovery_count = 0;

	// This function will be called by onData() with original data after recovery
	static void processData(int row, char *data, int data_bytes) {
		// Handle original data here only
	}

	// Call this function with each block received, either original or recovery
	// Returns true on complete
	bool onData(unsigned char row, char *new_data) {
		int insertion_point;

		// If it is original data,
		if (row < k) {
			// Process the original data immediately - Do not wait for it all to arrive!
			processData(row, new_data, bytes);

			// Copy to the end of the original block data
			insertion_point = original_count++;
		} else {
			// Copy to the front of the recovery block data
			insertion_point = k - ++recovery_count;
		}

		// Copy data into place
		char *dest = block_data + insertion_point * bytes;
		memcpy(dest, new_data, bytes);

		// NOTE: It may be possible to avoid copying depending on if
		// you can hang onto the provided data buffer.

		// Fill in the block array entry
		Block *block = block_info + insertion_point;
		block->data = dest;
		block->row = row;

		// If recovery is not possible yet,
		if (original_count + recovery_count < k) {
			return false;
		}

		// Attempt decoding
		if (cauchy_256_decode(k, m, block_info, bytes)) {
			// Decoding should never fail - indicates input is invalid
			assert(k + m <= 256);
			assert(block_info != 0);
			assert(bytes % 8 == 0);
			return false;
		}

		// For each recovered block,
		block = block_info + k - recovery_count;
		for (int ii = 0; ii < recovery_count; ++ii, ++block) {
			// Process the recovered data
			processData(block->row, block->data, bytes);
		}

		return true;
	}

Benchmarks

This is running on my pretty fast desktop. Try it out on your target device and see how it does!

Using 1296 bytes per block (ie. packet/chunk size); must be a multiple of 8 bytes

Encoded k=29 data blocks with m=1 recovery blocks in 1 usec : 37584 MB/s
+ Decoded 1 erasures in 2 usec : 18792 MB/s
Encoded k=29 data blocks with m=2 recovery blocks in 9 usec : 4176 MB/s
+ Decoded 2 erasures in 12 usec : 3132 MB/s
Encoded k=29 data blocks with m=3 recovery blocks in 19 usec : 1978 MB/s
+ Decoded 3 erasures in 22 usec : 1708 MB/s
Encoded k=29 data blocks with m=4 recovery blocks in 31 usec : 1212 MB/s
+ Decoded 4 erasures in 35 usec : 1073 MB/s
Encoded k=29 data blocks with m=5 recovery blocks in 22 usec : 1708 MB/s
+ Decoded 5 erasures in 30 usec : 1252 MB/s
Encoded k=29 data blocks with m=6 recovery blocks in 24 usec : 1566 MB/s
+ Decoded 6 erasures in 34 usec : 1105 MB/s
Encoded k=29 data blocks with m=7 recovery blocks in 28 usec : 1342 MB/s
+ Decoded 7 erasures in 39 usec : 963 MB/s
Encoded k=29 data blocks with m=8 recovery blocks in 30 usec : 1252 MB/s
+ Decoded 8 erasures in 43 usec : 874 MB/s
Encoded k=29 data blocks with m=9 recovery blocks in 33 usec : 1138 MB/s
+ Decoded 9 erasures in 53 usec : 709 MB/s
Encoded k=29 data blocks with m=10 recovery blocks in 36 usec : 1044 MB/s
+ Decoded 10 erasures in 57 usec : 659 MB/s
Encoded k=29 data blocks with m=11 recovery blocks in 39 usec : 963 MB/s
+ Decoded 11 erasures in 63 usec : 596 MB/s
Encoded k=29 data blocks with m=12 recovery blocks in 41 usec : 916 MB/s
+ Decoded 12 erasures in 71 usec : 529 MB/s
Encoded k=29 data blocks with m=13 recovery blocks in 51 usec : 736 MB/s
+ Decoded 13 erasures in 82 usec : 458 MB/s
Encoded k=29 data blocks with m=14 recovery blocks in 48 usec : 783 MB/s
+ Decoded 14 erasures in 87 usec : 432 MB/s

Note that the codec is specialized for the m = 1 case and runs very quickly. Due to a happy coincidence the first recovery block is always just an XOR of all the original data, so you can use this codec instead of doing that manually.

For erasure codes the important statistic to watch is the encoder speed, because it determines whether or not the transmitter can afford to send the extra data. Usually the decoder is not going to be using every redundant data block, so the benchmark is actually worst-case figures for the decoder.

Comparisons with Alternatives

This library implements Cauchy Reed-Solomon (CRS) codes as introduced by Jerasure. This library improves on existing CRS codes in a number of novel ways to greatly increase throughput. For full details on what has been improved, refer to the source code comments.

Jerasure performance for the packet case can be improved a lot. Jerasure has the best performance when the value of "w" is tuned to be as small as possible, whereas it is fixed at w = 8 in Longhair. For a favorable benchmark, set w = 6 and k = 29, and recovery block count m = 1-5:

Jerasure unit tester on iMac (2.7 GHz Core i5-2500S Sandy Bridge, June 2011):

With "smart scheduling": (Does not help for streaming scenario)

Using 1488 bytes per block (ie. packet/chunk size); must be a multiple of 8 bytes

Encoded k=29 data blocks with m=1 recovery blocks in 20 usec : 2157.6 MB/s
Encoded k=29 data blocks with m=2 recovery blocks in 91 usec : 474.198 MB/s
Encoded k=29 data blocks with m=3 recovery blocks in 167 usec : 258.395 MB/s
Encoded k=29 data blocks with m=4 recovery blocks in 240 usec : 179.8 MB/s
Encoded k=29 data blocks with m=5 recovery blocks in 322 usec : 134.012 MB/s

Without "smart scheduling" and with "improved coding matrix": (Setup time hurts performance)

Encoded k=29 data blocks with m=1 recovery blocks in 9 usec : 4794.67 MB/s
Encoded k=29 data blocks with m=2 recovery blocks in 53 usec : 814.189 MB/s
Encoded k=29 data blocks with m=3 recovery blocks in 98 usec : 440.327 MB/s
Encoded k=29 data blocks with m=4 recovery blocks in 152 usec : 283.895 MB/s
Encoded k=29 data blocks with m=5 recovery blocks in 187 usec : 230.759 MB/s

Without "improved coding matrix" and without "smart scheduling": (BEST PERFORMANCE)

Encoded k=29 data blocks with m=1 recovery blocks in 30 usec : 1438.4 MB/s
Encoded k=29 data blocks with m=2 recovery blocks in 52 usec : 829.846 MB/s
Encoded k=29 data blocks with m=3 recovery blocks in 77 usec : 560.416 MB/s
Encoded k=29 data blocks with m=4 recovery blocks in 104 usec : 414.923 MB/s
Encoded k=29 data blocks with m=5 recovery blocks in 127 usec : 339.78 MB/s
Encoded k=29 data blocks with m=6 recovery blocks in 153 usec : 282.039 MB/s

Note Jerasure is roughly 3x slower than this library for the streaming use case.

Rateless codes

There is another alternative way to do erasure codes efficiently called rateless codes, where there is no need to specify a value for m and you can generate as many redundant blocks as needed. Rateless codes tend to be more efficient for high-speed file transfer. For more information on these types of erasure codes see the Wirehair library.

Encoder speed discussion

I found that the encoder speed is directly related to the number of error correction blocks and doesn't depend on the amount of data to protect:

alt text

The rows are values of k (amount of data) and the columns are values of m (number of recovery blocks added).

Darker is better. The main point of this plot is to just show that k doesn't factor into the performance of the code.

Credits

Software by Christopher A. Taylor [email protected]

Please reach out if you need support or would like to collaborate on a project.

More Repositories

1

supercharger

Supercharge Open-Source AI Models
Python
348
star
2

dora

Implementation of DoRA
Python
276
star
3

Zpng

Better lossless compression than PNG with a simpler algorithm
C
267
star
4

wirehair

Wirehair : O(N) Fountain Code for Large Data
C++
267
star
5

self-discover

Implementation of Google's SELF-DISCOVER
Python
263
star
6

WLANOptimizer

Single-header C library that fixes WiFi performance issues for online gaming and other low-latency real-time network traffic.
C++
223
star
7

leopard

Leopard-RS : O(N Log N) MDS Reed-Solomon Block Erasure Code for Large Data
C++
135
star
8

shorthair

Shorthair : Generational Block Streaming Erasure Codes
C++
128
star
9

TimeSync

TimeSync: Time Synchronization Library in Portable C++
C++
122
star
10

cm256

Fast GF(256) Cauchy MDS Block Erasure Codec in C
C++
107
star
11

tonk

Tonk : Reliable UDP (rUDP) Network Library and Infinite Window Erasure Code
C++
101
star
12

Zdepth

Zdepth :: Streaming Depth Compressor in C++ for Azure Kinect DK
C++
97
star
13

xrcap

Azure Kinect multi-camera secure network capture/record/replay
C++
76
star
14

snowshoe

Snowshoe - Portable, Secure, Fast Elliptic Curve Math Library in C
C++
62
star
15

XRmonitors

XRmonitors : User-Friendly Virtual Multi-Monitors for the Workplace
C++
54
star
16

tabby

Tabby - Strong, Fast, and Portable Cryptographic Signatures, Handshakes, and Password Authentication
C++
50
star
17

gf256

GF256 - Fast 8-bit Galois Field Math in C
C++
50
star
18

kvm

Low-Bandwidth IP KVM using Raspberry Pi 4
C++
48
star
19

siamese

Siamese : Infinite-Window Streaming Erasure Code (HARQ)
C++
47
star
20

bitnet_cpu

Experiments with BitNet inference on CPU
C++
46
star
21

ZdepthLossy

Lossy version of Zdepth using video encoders
C++
41
star
22

fecal

FEC-AL : O(N^2) Fountain Code for Small Data
C++
36
star
23

CauchyCaterpillar

Cauchy Caterpillar : O(N^2) Short-Window Streaming Erasure Code
C++
35
star
24

aiwebcam2

Second attempt at AI webcam, this time with OpenAI API
Python
32
star
25

upsampling

Image Upsampling with PyTorch
Python
22
star
26

calico

Calico - Strong, Fast, and Portable Authenticated Encryption
C++
21
star
27

cymric

Cymric - Portable secure random number generator
C++
20
star
28

loraftp

File transfer between two Raspberry Pis using the LoRa Pi HAT from Waveshare
C
19
star
29

mau

Network simulator for reliable UDP testing in C++
C++
16
star
30

oaillama3

Simple setup to self-host LLaMA3-70B model with an OpenAI API
16
star
31

lllm

Latent Large Language Models
Python
16
star
32

PacketAllocator

C++ Memory allocator for packet queues that free() in roughly the same order that they alloc().
C++
15
star
33

spectral_ssm

Implementation of Spectral State Space Models
Python
15
star
34

libcat

Common code library
C++
14
star
35

AutoAudiobook

Automatically create an audiobook using OpenAI
Python
14
star
36

minigpt4

MiniGPT-4 :: Updated to Torch 2.0, simple setup, easier API, cut out training code
Python
13
star
37

sdxl

SDXL GPU cluster scripts
Python
13
star
38

dataloader

High-performance tokenized language data-loader for Python C++ extension
C++
12
star
39

fp61

Experiment: Fast finite field Fp=2^61-1 in C++
C++
11
star
40

cuda_float_compress

Python package for compressing floating-point PyTorch tensors
Cuda
10
star
41

phind

Locally hosted: 60% HumanEval
Python
8
star
42

rtmp_receiver

Simple unidirectional RTMP video stream receiver
C++
7
star
43

counter

C++ wrapper for counters that can roll-over (e.g. timestamps/ack-ids)
C++
6
star
44

z16

16-bit monochrome image compressor based on Zstd
C
5
star
45

AQLM

Fixes for AQLM
Python
5
star
46

llamanal.cpp

Static code analysis for C++ projects using llama.cpp and the best LLM you can run offline without an expensive GPU.
C
5
star
47

unfiltered-diffusers

Simple fork that disables NSFW filter
Python
5
star
48

hloc

Python
4
star
49

boss-balloon

BossBalloon.io
TypeScript
4
star
50

libcatid

Automatically exported from code.google.com/p/libcatid
C++
3
star
51

voron

Voron 3D Printer files
3
star
52

halide-test

Test v14/v15 performance regression
CMake
3
star
53

logger

Feature-rich portable C++ logging subsystem in 650 lines of code
C++
3
star
54

cifar10deepspeed

Using DeepSpeed and Nvidia DALI to train various models to solve CIFAR-10
Python
3
star
55

chainlit-anthropic

Chainlit AI UI with Anthropic Backend
Python
3
star
56

fastest_gf_matrix_mult

A fairly hacked together piece of code that can quickly search for the optimal GF polynomials and Cauchy matrices for XOR-based GF matrix multiplication for erasure code encoders. May be useful if the matrices are of fixed size.
C++
3
star
57

recfilter-2020-fail

This is a failed attempt to port Recfilter to latest Halide from 2020. Maybe someone else can figure this out?
C++
2
star
58

bentpipe

Simple UDP rebroadcaster
C++
2
star
59

whisper3

Testing out Whisper 3
Python
2
star
60

sphynx

Sphynx - High Performance Network Transport Layer
C++
2
star
61

textworld_llm_benchmark

TextWorld LLM Benchmark
Python
2
star
62

pixel-perfect-sfm

pixel-perfect-sfm with some minor fixes
C++
2
star
63

train_ticket

Gated PRNGs are all you need? Spoilers: No.
Python
2
star
64

CatsChoice

PRNG Parameter Generation
C++
1
star
65

CRC16Recovery

Optimized CRC16 with error recovery in C
C++
1
star
66

Splane

Archived old code - /ban_ids/ is kind of interesting
C
1
star
67

rust_webgl_demo

Hello World : Rust Web Assembly
Rust
1
star
68

blog2022

HTML
1
star
69

swe_agent_playground

swe_agent_playground
1
star
70

Exapunks_Solutions

Exapunks Walkthrough Solutions
1
star
71

audio_prediction

Simple audio prediction example with RNN
Python
1
star
72

never_forget

Implementation of Overcoming Catastrophic Forgetting
Python
1
star
73

DependencyInjected

DependencyInjected : Light-weight and powerful Dependency Injection pattern for C++
C++
1
star
74

quicsend

quicsend :: Super-fast Internet-ready file transfer right from Python
C++
1
star