blake3
go get lukechampine.com/blake3
blake3
implements the BLAKE3 cryptographic hash function.
This implementation aims to be performant without sacrificing (too much)
readability, in the hopes of eventually landing in x/crypto
.
In addition to the pure-Go implementation, this package also contains AVX-512
and AVX2 routines (generated by avo
)
that greatly increase performance for large inputs and outputs.
Contributions are greatly appreciated. All contributors are eligible to receive an Urbit planet.
Benchmarks
Tested on a 2020 MacBook Air (i5-7600K @ 3.80GHz). Benchmarks will improve as soon as I get access to a beefier AVX-512 machine. π
AVX-512
BenchmarkSum256/64 120 ns/op 533.00 MB/s
BenchmarkSum256/1024 2229 ns/op 459.36 MB/s
BenchmarkSum256/65536 16245 ns/op 4034.11 MB/s
BenchmarkWrite 245 ns/op 4177.38 MB/s
BenchmarkXOF 246 ns/op 4159.30 MB/s
AVX2
BenchmarkSum256/64 120 ns/op 533.00 MB/s
BenchmarkSum256/1024 2229 ns/op 459.36 MB/s
BenchmarkSum256/65536 31137 ns/op 2104.76 MB/s
BenchmarkWrite 487 ns/op 2103.12 MB/s
BenchmarkXOF 329 ns/op 3111.27 MB/s
Pure Go
BenchmarkSum256/64 120 ns/op 533.00 MB/s
BenchmarkSum256/1024 2229 ns/op 459.36 MB/s
BenchmarkSum256/65536 133505 ns/op 490.89 MB/s
BenchmarkWrite 2022 ns/op 506.36 MB/s
BenchmarkXOF 1914 ns/op 534.98 MB/s
Shortcomings
There is no assembly routine for single-block compressions. This is most noticeable for ~1KB inputs.
Each assembly routine inlines all 7 rounds, causing thousands of lines of duplicated code. Ideally the routines could be merged such that only a single routine is generated for AVX-512 and AVX2, without sacrificing too much performance.