• Stars
    star
    186
  • Rank 207,316 (Top 5 %)
  • Language
    Go
  • License
    MIT License
  • Created about 9 years ago
  • Updated about 6 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Streaming Deduplication Package for Go

dedup

A Streaming Deduplication package for Go

This package implements streaming deduplication, allowing you to remove duplicated data in streams. It implements variable block sizes and automatic content block adaptation. It has a fully streaming mode and an indexed mode, that has significantly reduced memory requirements.

For an introduction to deduplication read this blog post Fast Stream Deduplication in Go.

Package home: https://github.com/klauspost/dedup

Godoc: https://godoc.org/github.com/klauspost/dedup

Build Status GoDoc

Installation

To get the package use the standard:

go get -u github.com/klauspost/dedup

Usage

If you haven't already, you should read the Fast Stream Deduplication in Go blog post, since it will introduce different aspects and help you make choices for your setup.

There are two symmetric functions NewWriter/NewReader and NewStreamWriter/NewStreamReader`. The first pair creates an indexed stream, which will write the index and data to two separate streams. This allows to decode the deduplicated stream with much less memory. The second pair will write all data to a single stream. This allows for on-the-fly transfers, but will require more memory in the receiving end.

When you create a deduplicating stream, you can specify between fixed or dynamic block sizes. The dynamic blocks adapt block splits to the incoming content, but is slower than fixed size, and has to use more conservative memory estimations.

Here is an example of a full roundtrip with indexed streams. For more examples see the godoc examples.

package main

import (
  "bytes"
  "fmt"
  "io"

  "github.com/klauspost/dedup"
)

// This will deduplicate a buffer of zeros to an indexed stream
func main() {
	// We will write out deduplicated data to these
	idx := bytes.Buffer{}
	data := bytes.Buffer{}

	// This is our input:
	input := bytes.NewBuffer(make([]byte, 50000))

	// Create a new writer, with each block being 1000 bytes fixed size.
	w, err := dedup.NewWriter(&idx, &data, dedup.ModeFixed, 1000, 0)
	if err != nil {
		panic(err)
	}
	// Copy our input to the writer.
	io.Copy(w, input)

	// Close to flush the remaining buffers
	err = w.Close()
	if err != nil {
		panic(err)
	}

	// Create a new indexed stream reader:
	r, err := dedup.NewReader(&idx, &data)
	if err != nil {
		panic(err)
	}

	// Inspect how much memory it will use.
	fmt.Println("Memory use:", r.MaxMem())

	var dst bytes.Buffer

	// Read everything
	_, err = io.Copy(&dst, r)
	if err != nil && err != io.EOF {
		panic(err)
	}

	// Let us inspect what was written:
	fmt.Println("Returned data length:", dst.Len())
	fmt.Println("Everything zero:", 0 == bytes.Compare(dst.Bytes(), make([]byte, 50000)))
}

Note that there is no error resilience built in. If any data is corrupted in any way, it will probably not be detected, and there is no way to recover corrupted data. So if you are in an environment where that could occur, you should add additional checks to ensure that data is recoverable.

Input Splitting

If you want to simply split the input, that functionality is also exposed.

This can be useful in the case that you want to deduplicate to a your own key-value store. In this case, you simply feed the input to a NewSplitter. This will return the individual fragments along with a hash. This will allow you to store your files as a stream of hashes, and separate the data into your data store.

See the examples attached to the NewSplitter function on how to use this.

Hash collisions

The encoder uses SHA-1 to identify and "remember" unique blocks. No hash is secure from collisions, but SHA-1 offers 160 bits of entropy.

For example, the chance of a random hash collision to occur when encoding 1 TB data in 1KB blocks is 3.94Γ—10^-31 : 1, or one in "2.5 thousand billion billion billion". This of course assumes a uniform hash distribution and no deliberate hash collision attacks.

If SHA-1 doesn't provide sufficient security, it has been made very easy for you to create a stronger version. It is possible for you to create a stronger version by simply changing the import:

import 	hasher "crypto/sha1"

You can use sha256, sha512 for stronger hashes, or md5 for a faster hash.

To help you calculate the birthday problem likelyhood with a given number of blocks, I have provided the BirthdayProblem function.

Why is this not compression?

Deduplication does the same as compression but on a higher level. Instead of looking for small matches, it attempts to find the "bigger" matches. It will attempt to match and eliminate blocks where all content matches.

This can be useful when backing up disk images or other content where you have duplicated files, etc.

Deduplication is a good step before compression. You will still be able to compress your data, since unique blocks are passed through as-is, in order and without any modification.

License

This code is published under an MIT license. See LICENSE file for more information.

More Repositories

1

compress

Optimized Go Compression Packages
Go
4,247
star
2

reedsolomon

Reed-Solomon Erasure Coding in Go
Assembly
1,728
star
3

pgzip

Go parallel gzip (de)compression
Go
1,045
star
4

cpuid

CPU feature identification for Go
Go
863
star
5

ryzen-master-vbs-patch

AMD Ryzen Master Hyper-V VBS patcher
Go
355
star
6

asmfmt

Go Assembler Formatter
Go
247
star
7

readahead

Asynchronous read-ahead for Go readers
Go
216
star
8

geoip-service

A fast in-memory http microservice for looking up MaxMind GeoIP2 and GeoLite2 database
Go
74
star
9

crc32

CRC32 hash with x64 optimizations
Go
73
star
10

rawspeed

Raw Image Decoder Library
C++
71
star
11

shutdown2

Shutdown Management package for Go v2
Go
49
star
12

password

Dictionary Password Validation for Go
Go
49
star
13

gad

Go After Dark
Go
40
star
14

shutdown

Shutdown management library for Go
Go
36
star
15

intrinsics

Experiment with Go intrinsics (NOT USABLE)
HTML
32
star
16

doproxy

Reverse Proxy for managing multiple Digital Ocean backends.
Go
30
star
17

oui

Library & Microservice for looking up manufacturers from MAC addresses.
Go
27
star
18

lctime

Finally, simple, familiar, locale-based datetime formatting.
Go
22
star
19

connect-compress

connect-go improved compression
Go
19
star
20

mman-win32

Automatically exported from code.google.com/p/mman-win32
C
10
star
21

match

Byte matching in Go
Assembly
10
star
22

gfx

Graphic Drawing Library
Go
9
star
23

bitset

Go Integer Bitset Generator
Go
4
star
24

shift

Fast bit shift helper
Go
3
star
25

simdjson-fuzz

Fuzzers and corpus for github.com/fwessels/simdjson-go
Go
3
star
26

json

Fork of the official Go JSON library that allows streaming indented output
Go
3
star
27

compress-fuzz

Fuzz data for the klauspost/compress package
Go
2
star
28

talks

Talks
Go
2
star
29

dawa

Go implementation of DAWA AWS Suite 4 (Danish Address Info)
Go
1
star