• Stars
    star
    108
  • Rank 321,259 (Top 7 %)
  • Language
    Clojure
  • License
    Eclipse Public Li...
  • Created about 9 years ago
  • Updated over 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Cross-platform (JVM and JS atm.) edn data structure hashing for Clojure.

hasch

A library to consistently crypto-hash edn data structures on Clojure and ClojureScript with SHA-512. The main motivation is that commutative data structures like maps, sets and records are not hashed in order as was the case with e.g. hashing a simple sequential serialisation, but have the same hash value independent of order. That way Clojure value semantics with edn are retained. UTF-8 is supported for strings, symbols and keywords. Beyond this tagged literals are supported in a generic runtime independent fashion and platform-neutral encoding (atm. between JVM and JavaScript) is taken care of. You can then create UUID5 (using SHA-512) from it. Alternatively you can use your own hash function, but this is not standardized and hence beyond the spec.

Support for edn types on the JVM and JavaScript is complete including records. This works by printing the tagged-literal and rereading it as pure edn, which also ensures that the hashed value can be reproduced beyond the current runtime. Your type has to be pr-str-able for this to work. Records already have a default serialisation.

Usage Gitter

Add this to your leiningen project's dependencies: Clojars Project

Then you can access the major function through hasch.core:

(use 'hasch.core)
(edn-hash ["hello world" {:a 3.14} #{42} '(if true nil \f)])
=> (120 75 53 36 42 91 14 22 174 251 7 222 83 57 158 140 192 131 251 17 176 29 252 118 83 2 106 187 223 17 84 232 24 103 183 27 19 174 222 37 246 138 132 126 172 46 249 42 62 46 66 32 33 100 88 168 4 242 90 25 5 228 2 88)

(uuid5 (edn-hash "hello world"))
=> #uuid "1227fe0a-471b-5329-88db-875fb82737a8"

;; or just use the convenience multi-arity uuid fn:
(uuid) => #uuid "a27dfbb9-b69a-4f08-8df4-471464bfeb37"
(uuid "hello world") => #uuid "1227fe0a-471b-5329-88db-875fb82737a8"

Motivation

The motivation is to exchange (potentially large) values in a hostile environment without conflicts. The concrete design motivation is to use the commit log of replikativ for exchange of datascript/datomic transaction logs. As long as you are in a trusted environment you can trust the random generator for conflict-free UUIDs as is done internally by many Clojure projects, but as soon as you distribute values, collisions can happen. Note that you can treat hasch's cryptographic UUIDs like random UUIDs internally and don't need to verify them.

Maturity

The library is tested in cross-platform applications. The hashing scheme can be considered stable. It is versioned, so we can fix any severe bug without breaking stored hashes.

Why not use Clojure's hash?

I wish I could have done that instead of reimplementing my own hashing scheme for edn (there are more interesting problems). There is one major reason against using internal hash functions: They need to be very fast for efficient data-structures and hence trade this for potential but unlike collisions, which is unacceptable in an unsecure environment. For the same reason they also only work on 64 bit values, which is fine for a runtime, but not the internet.

Why not sort?

Sorting of heterogenous collections requires a unique serialization (e.g. pr-str or our encoding) on keys beforehand, which was sadly not faster even for small maps and sets. Sorting on number only maps was faster for maps until at least a size of one million. At some point the complexity of sorting becomes more expansive than xor-ing hashed kv-vectors, so sorting is a simple but not linearly scalable solution. Still it could prove valuable in the future.

edn support

Support for edn types is complete including records. This works according to incognito by hashing unknown records the same as their known counterparts. You need to supply the optional write-handlers to uuid if your records have a custom serialization. Otherwise incognito records won't match. Importantly the JVM class names are converted into cljs format foo.bar_baz.Bar -> foo.bar-baz/Bar before hashing. While this potentially allows maliciously induced collisions, you are safe if you use incognito or a similar mapping for cross-platform support, as it automatically serializes all record tags accordingly.

Safety

The library is designed safety first, speed second. I have put quite some thought into getting all input bits (entropy) into the cryptographic hash function. It should be impossible to construct a collision (beyond weaknesses in the underlying SHA-512 which is considered safe in year 2014). The biggest conceptual weakness is XOR-ing of sha-512 hashed elements in maps and sets.

Once released, I'll offer a 100 $ bounty for proof of any collision, just open a github issue. This hashing is an important building block for distributed systems to me.

Speed

The first versions were just build around safety, but perform poorly with large values. The speed should be sufficient to be in the same order of magnitude as transmission speed (throughput + latency) over slow to mid-range internet broadband connections. If you want to transmit larger values fast, you maybe can chose a sequential binary encoding with native hashing speed. JavaScript performance is still significantly slower (~10x), seemingly due to the lack of native SHA hashing routines.

*These are just micro-benchmarks on my 3 year old laptop, I just mention them so you can get an impression. *

;; most important and worst case, what can be done?
hasch.platform> (let [val (into {} (doall (map vec (partition 2
                                                (interleave (range 1000000)
                                                            (range 1000000))))))]
    (bench (-coerce val sha512-message-digest)))
Evaluation count : 60 in 60 samples of 1 calls.
             Execution time mean : 3.596037 sec
    Execution time std-deviation : 23.812536 ms
   Execution time lower quantile : 3.566430 sec ( 2.5%)
   Execution time upper quantile : 3.647540 sec (97.5%)
                   Overhead used : 2.039920 ns

Found 4 outliers in 60 samples (6.6667 %)
	low-severe	 3 (5.0000 %)
	low-mild	 1 (1.6667 %)
 Variance from outliers : 1.6389 % Variance is slightly inflated by outliers
nil

hasch.platform> (let [val (doall (range 1000000))]
    (bench (-coerce val sha512-message-digest)))
Evaluation count : 240 in 60 samples of 4 calls.
             Execution time mean : 297.320276 ms
    Execution time std-deviation : 2.683060 ms
   Execution time lower quantile : 293.217179 ms ( 2.5%)
   Execution time upper quantile : 302.059975 ms (97.5%)
                   Overhead used : 2.039920 ns

Found 1 outliers in 60 samples (1.6667 %)
	low-severe	 1 (1.6667 %)
 Variance from outliers : 1.6389 % Variance is slightly inflated by outliers
nil

hasch.platform> (let [val (doall (into #{} (range 1000000)))]
    (bench (-coerce val sha512-message-digest)))
Evaluation count : 60 in 60 samples of 1 calls.
             Execution time mean : 2.733429 sec
    Execution time std-deviation : 15.463782 ms
   Execution time lower quantile : 2.708645 sec ( 2.5%)
   Execution time upper quantile : 2.758701 sec (97.5%)
                   Overhead used : 2.039920 ns

Found 1 outliers in 60 samples (1.6667 %)
	low-severe	 1 (1.6667 %)
 Variance from outliers : 1.6389 % Variance is slightly inflated by outliers
nil

hasch.platform> (let [val (doall (repeat 1000000 "hello world"))]
    (bench (-coerce val sha512-message-digest)))
WARNING: Final GC required 1.472161970438994 % of runtime
Evaluation count : 120 in 60 samples of 2 calls.
             Execution time mean : 873.084789 ms
    Execution time std-deviation : 5.753430 ms
   Execution time lower quantile : 862.909606 ms ( 2.5%)
   Execution time upper quantile : 885.560937 ms (97.5%)
                   Overhead used : 2.039920 ns

Found 2 outliers in 60 samples (3.3333 %)
	low-severe	 2 (3.3333 %)
 Variance from outliers : 1.6389 % Variance is slightly inflated by outliers
nil

hasch.platform> (let [val (doall (repeat 1000000 :foo/bar))]
    (bench (-coerce val sha512-message-digest)))
WARNING: Final GC required 1.072577784478402 % of runtime
Evaluation count : 120 in 60 samples of 2 calls.
             Execution time mean : 756.394263 ms
    Execution time std-deviation : 2.935836 ms
   Execution time lower quantile : 750.827152 ms ( 2.5%)
   Execution time upper quantile : 761.299697 ms (97.5%)
                   Overhead used : 2.039920 ns

Found 1 outliers in 60 samples (1.6667 %)
	low-severe	 1 (1.6667 %)
 Variance from outliers : 1.6389 % Variance is slightly inflated by outliers
nil

hasch.platform> (let [val (byte-array (* 1024 1024 300) (byte 42))] ;; 300 mib bytearray
    (bench (-coerce val sha512-message-digest)))
Evaluation count : 60 in 60 samples of 1 calls.
             Execution time mean : 1.987549 sec
    Execution time std-deviation : 134.189868 ms
   Execution time lower quantile : 1.901676 sec ( 2.5%)
   Execution time upper quantile : 2.304744 sec (97.5%)
                   Overhead used : 1.967460 ns

Found 3 outliers in 60 samples (5.0000 %)
	low-severe	 3 (5.0000 %)
 Variance from outliers : 50.1416 % Variance is severely inflated by outliers
nil

hasch.platform> (let [val (doall (vec (repeat 10000 {:db/id 18239
                                       :person/name "Frederic"
                                       :person/familyname "Johanson"
                                       :person/street "Fifty-First Street 53"
                                       :person/postal 38237
                                       :person/phone "02343248474"
                                       :person/weight 38.23})))]
    (bench (-coerce val sha512-message-digest)))
WARNING: Final GC required 1.2237845534164749 % of runtime
Evaluation count : 240 in 60 samples of 4 calls.
             Execution time mean : 322.164678 ms
    Execution time std-deviation : 1.821136 ms
   Execution time lower quantile : 318.232462 ms ( 2.5%)
   Execution time upper quantile : 325.916354 ms (97.5%)
                   Overhead used : 2.039920 ns

Found 4 outliers in 60 samples (6.6667 %)
	low-severe	 2 (3.3333 %)
	low-mild	 1 (1.6667 %)
	high-mild	 1 (1.6667 %)
 Variance from outliers : 1.6389 % Variance is slightly inflated by outliers
nil

Changes

  • 0.3.5 Support BigInteger and BigDecimal hashing (same as for limited precision types).
  • 0.3.4 Expose high-level base64 hashes with full precision.
  • 0.3.2 Minimize dependencies, explicit profiles for different Clojure(Script) versions
  • 0.3.1 fix bug in hashing sequences containing null
  • 0.3.0 fix accidental hashing of records as maps
  • 0.3.0-beta4 fix record serialization with incognito
  • 0.3.0 Overhaul encoding for ~10x-20x times performance on the JVM. Use safe SHA-512. Add byte-array support for blobs.
  • 0.2.3 properly dispatch on IRecord (instead of IMap)
  • 0.2.2 cannot coerce record tags because of conflicts, rather extend record to properly print
  • 0.2.1 fix tag coercion on JVM

Extension to your own types

Warning: Getting all that right is not trivial. Don't mess with hashing extension if you don't have to, just make your type uniquely mappable with incognito!

You can avoid the mapping step to Clojure datastructures (also effectively allocating double memory) by extending the hasch.benc/PHashCoercion protocol to your types. You should orient on the IRecord implementation and must use (:literal magics) to avoid collisions with literal values of the same form. Either by using the default serialisation mechanism to retrieve a hash-value or by extending the hash-coercion, your serialisation or coercion must satisfy the equality relation:

  • hashes must follow IEquiv equality of Clojure(Script): (= a b) <=> (= (edn-hash a) (edn-hash b)), (not= a b) <=> (not= (edn-hash a) (edn-hash b)): Your serialisation has to be unique, hashing has to be injective or in other words you might not introduce collisions. Non-equal objects must have non-equal hashes.
  • reflexivity: (= (edn-hash a) (edn-hash a)), including on different runtimes
  • symmetry: (= (edn-hash a) (edn-hash b)) <=> (= (edn-hash b) (edn-hash a)) (trivial because of =)
  • transitivity: (and (= (edn-hash a) (edn-hash b)) (= (edn-hash b) (edn-hash c))) => (= (edn-hash a) (edn-hash c)) (also trivial because of =)

TODO

  • Use test.check/double.check property based tests between Java and JS (?)
  • Nested collections are hashed with the supplied hash-fn before they contribute to the hash-value. This allows to form a Merkle-tree like peristent data-structure by breaking out collection values, so you can rehash top-level collections without pulling the whole value in memory. This is not tested yet, a git-like store could be implemented, e.g. in konserve. This should be useful to build durable indexes also. But it might proof to need runtime tweaking, e.g. depending on value size.
  • If keeping sorted maps/sets is feasable for high-throughput applications, allow to hash them sequentally.

Contributors

  • Max Penet
  • James Conroy-Finn
  • Konrad KΓΌhne
  • Christian Weilbach

License

Copyright Β© 2014-2018 Christian Weilbach and contributors

Distributed under the Eclipse Public License either version 1.0 or (at your option) any later version.

More Repositories

1

datahike

A fast, immutable, distributed & compositional Datalog engine for everyone.
Clojure
1,630
star
2

replikativ

An open, scalable and distributive infrastructure for a data-driven community of applications.
Clojure
330
star
3

konserve

A clojuresque key-value/document store protocol with core.async.
Clojure
298
star
4

superv.async

This is a Clojure(Script) library that extends core.async with error handling and includes a number of convenience functions and macros.
Clojure
171
star
5

kabel

A library for simple wire-like connectivity semantics.
Clojure
102
star
6

datalog-parser

Generic datalog parser compliant to datomic, datascript and datahike queries.
Clojure
69
star
7

hitchhiker-tree

Functional, persistent, off-heap, high performance data structure
Clojure
43
star
8

datahike-server

Datahike remote system
Clojure
36
star
9

geheimnis

Cross-platform cryptography between cljs and clj.
Clojure
35
star
10

datahike-frontend

A front-end connecting to a Datahike back-end.
Clojure
33
star
11

incognito

Safe transport of unknown record types in distributed systems.
Clojure
22
star
12

durable-persistence

Explorations in durable persistent datastructures for Clojure.
Clojure
22
star
13

datahike-postgres

Datahike with Postgres as data storage
Clojure
17
star
14

datahike-jdbc

Datahike JDBC data storage backend
Clojure
16
star
15

chat42

A small web chat demonstration with replikativ.
Clojure
14
star
16

topiq

A distributed social network for serious discussions and funny topiqs :).
Clojure
13
star
17

twitter-collector

A simple twitter collector using replikativ.
Clojure
11
star
18

filesync-replikativ

A filesystem synchronization tool similar to Dropbox over replikativ.
Clojure
11
star
19

konserve-carmine

A redis backend with carmine for konserve.
Clojure
10
star
20

kabel-auth

Authentication middleware for kabel.
Clojure
9
star
21

zufall

random name generator
Clojure
7
star
22

chat42app

A react native demo for chat42.
Clojure
6
star
23

mesalog

CSV data loader for Datalog databases
Clojure
5
star
24

flechtwerk

flechtwerk provides visualization of commit graphs of CDVCS.
Clojure
5
star
25

datahike-redis

Clojure
5
star
26

datahike-s3

Datahike backend for S3.
Clojure
5
star
27

konserve-leveldb

A LevelDB backend for konserve.
Clojure
4
star
28

replikativ-demo

Example project for replikativ in clj.
Clojure
4
star
29

datahike-client

A datahike remote client
Clojure
3
star
30

konserve-rocksdb

A RocksDB backend for konserve
Clojure
3
star
31

datahike-benchmark

Measuring of datahike performance
Clojure
3
star
32

replikativ-cljs-demo

Example project for replikativ in cljs.
Clojure
3
star
33

konserve-jdbc

A JDBC backend for konserve.
Clojure
2
star
34

datahike-leveldb

Datahike with LevelDB as data storage
Clojure
2
star
35

datahike-server-transactor

Transactor implementation for datahike that uses datahike-server.
Clojure
2
star
36

konserve-clutch

A CouchDB backend for konserve with clutch.
Clojure
2
star
37

polo-collector

A collector of trading data on the Poloniex exchange.
Clojure
2
star
38

konserve-welle

A Riak backend for konserve with Welle.
Clojure
2
star
39

sherpa

A version migration tool for Datahike.
Clojure
1
star
40

konserve-s3

S3 backend for konserve.
Clojure
1
star
41

obhut

Docker setup for operational log analytics
1
star
42

datahike-fdb

FoundationDB backend for Datahike
Clojure
1
star
43

konserve-redis

Redis backend for konserve.
Clojure
1
star
44

sendandi

Protocol above datalog databases for common functions
Clojure
1
star
45

datahike-rocksdb

Datahike RocksDB support.
Clojure
1
star
46

wanderung-core

Migration protocols for Datahike
Clojure
1
star
47

wanderung-datascript

DataScript-Datahike migrations
Clojure
1
star
48

datahike-migrations

A migration tool for Datahike
Clojure
1
star
49

pydatahike

Python bindings for Datahike.
Python
1
star