• Stars
    star
    54
  • Rank 544,902 (Top 11 %)
  • Language
    Java
  • License
    Other
  • Created over 6 years ago
  • Updated over 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Union, intersection, and set cardinality in loglog space

Build Status

HyperMinHash-java

A Java implementation of the HyperMinHash algorithm, presented by Yu and Weber. HyperMinHash allows approximating set unions, intersections, Jaccard Indices, and cardinalities of very large sets with high accuracy using only loglog space. It also supports streaming updates and merging sketches, just the same as HyperLogLog.

This repo implements two flavors of HyperMinHash:

  1. HyperMinHash: An implementation based on HyperLogLog with the addition of the bias correction seen in HyperLogLog++.
  2. BetaMinHash: An implementation which uses LogLog-Beta for the underlying LogLog implementation. Loglog-beta is almost identical in accuracy to HyperLogLog++, except it performs better on cardinality estimations for small datasets (n <= 80k), holding memory fixed. Since we use Loglog-Beta, we refer to our implementation as BetaMinHash. However, our implementation currently only supports a fixed precision p=14.

If you expect to be dealing with low cardinality datasets (<= 80,000 unique elements), we recommend using BetaMinHash as it has a smaller memory footprint and is more accurate than HyperLogLog in the range from 20,000-80,000, holding memory fixed. However, note that different sketch types are not interchangeable i.e: obtaining the intersection of an HMH and a BMH is not currently supported.

Both implementations are equipped with serialization/deserialization capabilities out of the box for sending sketches over the wire or persisting them to disk.

Usage

Importing via Maven

<dependency>
  <groupId>com.liveramp</groupId>
  <artifactId>hyperminhash</artifactId>
  <version>0.2</version>
</dependency>

Cardinality estimation

Set<byte[]> mySet = getMySet();
BetaMinHash sketch = new BetaMinHash();
for (byte[] element : mySet){
    sketch.add(element);
}

long estimatedCardinality = sketch.cardinality();

Merging (unioning) sketches

Collection<BetaMinHash> sketches = getSketches();
SketchCombiner<BetaMinHash> combiner = BetaMinHashCombiner.getInstance();
BetaMinHash combined = combiner.union(sketches);

// to get cardinality of the union
long unionCardinality = combined.cardinality();

// using HyperMinHash instead of BetaMinHash
Collection<HyperMinHash> sketches = getSketches();
SketchCombiner<HyperMinHash> combiner = HyperMinHashCombinre.getInstance();
HyperMinHash combined = combiner.union(sketches);

Cardinality of unions

BetaMinHash combined = combiner.union(sketches);
long estimatedCardinality = combined.cardinality();

Cardinality of intersection

Collection<BetaMinHash> sketches = getSketches();
SketchCombiner<BetaMinHash> combiner = BetaMinHashComber.getInstance();
long intersectionCardinality = combiner.intersectionCardinality(sketches);

Serializing a sketch

To get a byte[] representation of a sketch, use the IntersectionSketch.SerDe interface:

HyperMinHash sketch = getSketch();
HyperMinHashSerde serde = new HyperMinHashSerde();

byte[] serialized = serde.toBytes(sketch);
HyperMinHash deserialized = serde.fromBytes(serialized);

int sizeInBytes = serde.sizeInBytes(sketch);

Maintainers

Commit authorship was lost when merging code. The maintainers of the library, in alphabetical order, are:

  1. Christian Hansen (github.com/ChristianHansen)
  2. Harry Rackmil (github.com/harryrackmil)
  3. Shrif Nada (github.com/sherifnada)

Acknowledgements

Thanks to Seif Lotfy for implementing a Golang version of HyperMinHash. We use some of his tests in our library, and our BetaMinHash implementation references his implementation.

More Repositories

1

hank

(DEPRECATED. This project is no longer used or maintained at LiveRamp.) Hank is a high performance distributed key-value NoSQL database that we built and use at LiveRamp. It is designed for very large data stores that dwarf the amount of available main memory and for randomly distributed read/write workloads that far exceed the capacity of memory-based caches. More specifically, it is optimized for very low latency random read queries and for very high throughput incremental batch writes.
Java
172
star
2

jack

A set of scripts for generating fully functional Java database models from Ruby's ActiveRecord models and migrations.
Java
70
star
3

cascading_ext

cascading_ext is a collection of tools built on top of the Cascading platform which make it easy to build, debug, and run simple and high-performance data workflows.
Java
57
star
4

captain

distributed, light-weight java workflow engine for a microservice architecture
Java
35
star
5

reslang

A language for describing resource-oriented APIs & turning them into Swagger or resource diagrams. Oriented around the concepts we want to expose in the APIs.
TypeScript
23
star
6

iabconsent

A Go implementation of the IAB Consent String Specs (v1.1 and v2)
Go
20
star
7

templater

Go
17
star
8

factable

Factable is a service for reporting over event-sourced application data
Go
12
star
9

daemon_lib

A Java library that makes it easy to write parallelized task processors
Java
11
star
10

generative

Tools for generative and property based testing in java.
Java
10
star
11

mockrdd

A Python3 module for testing PySpark code
Python
6
star
12

hank-go

Golang client for Hank
Go
6
star
13

pmd_extensions

Java
6
star
14

HadoopAtScale

Identifying, monitoring, and fixing scaling bottlenecks when running Hadoop at scale
5
star
15

kube_deploy_tools

LiveRamp Kubernetes Build/Release Tools
Ruby
5
star
16

liveramp-aws-transcoding-public

Public repository to house cloudformation and documents for AWS Transcoding Service
3
star
17

ats-sdk-ios

LiveRamp ATS SDK for iOS
Swift
2
star
18

ccpa

Go Implementation of the IAB U.S. Privacy String (CCPA Opt-Out Storage Format)
Go
2
star
19

terraform-google-identity-engine-data-plane

Identity/Identity-Engine - Data plane Terraform module
HCL
2
star
20

tcf-sdk-ios

LiveRamp Privacy Manager SDK for iOS
Swift
1
star
21

typed-mutations

Lazy, typed mutations on arbitrary objects and collections
TypeScript
1
star
22

pl-sdk-ios

LiveRamp PreferenceLink SDK for iOS
Swift
1
star
23

akp_wholesale_pipelines

AKP and Wholesale DataFlow Pipelines
Java
1
star
24

higher-order-promise

Balance readability and performance for async operations
TypeScript
1
star
25

liveramp-idea-inspections-plugin

Intellij IDEA plugin with some additional inspections
Java
1
star
26

lr-map

Python
1
star
27

remote_code

Experimental library for serializing and transmitting objects and their bytecode
Java
1
star
28

ats-sdk-ios-google-adapter

Google Mediation Adapter for LiveRamp ATS SDK for iOS
Swift
1
star
29

ats-sdk-android-sample

Addressability - Android ATS SDK sample, public repository
Kotlin
1
star
30

ats-sdk-ios-sample

Addressability - Repository for IOS ATS SDK sample, public repository
Swift
1
star