• Stars
    star
    560
  • Rank 79,541 (Top 2 %)
  • Language
    Java
  • License
    Apache License 2.0
  • Created about 14 years ago
  • Updated about 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

A compressed alternative to the Java BitSet class

JavaEWAH

Java CI docs-badge Coverage Status

This is a word-aligned compressed variant of the Java Bitset class. We provide both a 64-bit and a 32-bit RLE-like compression scheme. It can be used to implement bitmap indexes. The EWAH format it relies upon is used in the git implementation that runs GitHub.

The goal of word-aligned compression is not to achieve the best compression, but rather to improve query processing time. Hence, we try to save CPU cycles, maybe at the expense of storage. However, the EWAH scheme we implemented is always more efficient storage-wise than an uncompressed bitmap (as implemented in the java BitSet class by Sun).

JavaEWAH offers competitive speed. In an exhaustive comparison, Guzun et al. (ICDE 2014) found that "EWAH offers the best query time for all distributions."

JavaEWAH also supports memory-mapped files: we can serialize the bitmaps to disk and then map them to memory using the java.nio classes. This may avoid wasteful serialization/deserialization routines.

The library also provides a drop-in replacement for the standard BitSet class. Like the other bitmap classes in JavaEWAH, this uncompressed BitSet class supports memory-mapped files as well as many other conveniences.

For better performance, use a 64-bit JVM over 64-bit CPUs when using the 64-bit scheme (javaewah.EWAHCompressedBitmap). The 32-bit version (javaewah32.EWAHCompressedBitmap32) should compress better but be comparatively slower. It is recommended however that you run your own benchmark.

Java 6 or better is required. We found the very latest OpenJDK release offered the best performance.

Real-world usage

JavaEWAH is part of Apache Hive and its derivatives (e.g., Apache Spark) and Eclipse JGit. It has been used in production systems for many years. It is part of major Linux distributions. It is part of Twitter algebird.

EWAH is used to accelerate the distributed version control system Git (http://githubengineering.com/counting-objects/). You can find the C port of EWAH written by the Git team at https://github.com/git/git/tree/master/ewah

When should you use a bitmap?

Sets are a fundamental abstraction in software. They can be implemented in various ways, as hash sets, as trees, and so forth. In databases and search engines, sets are often an integral part of indexes. For example, we may need to maintain a set of all documents or rows (represented by numerical identifier) that satisfy some property. Besides adding or removing elements from the set, we need fast functions to compute the intersection, the union, the difference between sets, and so on.

To implement a set of integers, a particularly appealing strategy is the bitmap (also called bitset or bit vector). Using n bits, we can represent any set made of the integers from the range [0,n): it suffices to set the ith bit is set to one if integer i is present in the set. Commodity processors use words of W=32 or W=64 bits. By combining many such words, we can support large values of n. Intersections, unions and differences can then be implemented as bitwise AND, OR and ANDNOT operations. More complicated set functions can also be implemented as bitwise operations.

When the bitset approach is applicable, it can be orders of magnitude faster than other possible implementation of a set (e.g., as a hash set) while using several times less memory.

However, a bitset, even a compressed one is not always applicable. For example, if you have 1000 random-looking integers, then a simple array might be the best representation. We refer to this case as the "sparse" scenario.

When should you use compressed bitmaps?

An uncompressed BitSet can use a lot of memory. For example, if you take a BitSet and set the bit at position 1,000,000 to true and you have just over 100kB. That's over 100kB to store the position of one bit. This is wasteful even if you do not care about memory: suppose that you need to compute the intersection between this BitSet and another one that has a bit at position 1,000,001 to true, then you need to go through all these zeroes, whether you like it or not. That can become very wasteful.

This being said, there are definitively cases where attempting to use compressed bitmaps is wasteful. For example, if you have a small universe size. E.g., your bitmaps represent sets of integers from [0,n) where n is small (e.g., n=64 or n=128). If you are able to uncompressed BitSet and it does not blow up your memory usage, then compressed bitmaps are probably not useful to you. In fact, if you do not need compression, then a BitSet offers remarkable speed. One of the downsides of a compressed bitmap like those provided by JavaEWAH is slower random access: checking whether a bit is set to true in a compressed bitmap takes longer.

The sparse scenario is another use case where compressed bitmaps should not be used. Keep in mind that random-looking data is usually not compressible. E.g., if you have a small set of 32-bit random integers, it is not mathematically possible to use far less than 32 bits per integer, and attempts at compression can be counterproductive.

How does EWAH compares with the alternatives?

EWAH is part of a larger family of compressed bitmaps that are run-length-encoded bitmaps. They identify long runs of 1s or 0s and they represent them with a marker word. If you have a local mix of 1s and 0, you use an uncompressed word.

There are many formats in this family beside EWAH:

  • Oracle's BBC is an obsolete format at this point: though it may provide good compression, it is likely much slower than more recent alternatives due to excessive branching.
  • WAH is a patented variation on BBC that provides better performance.
  • Concise is a variation on the patented WAH. It some specific instances, it can compress much better than WAH (up to 2x better), but it is generally slower.
  • EWAH is both free of patent, and it is faster than all the above. On the downside, it does not compress quite as well. It is faster because it allows some form of "skipping" over uncompressed words. So though none of these formats are great at random access, EWAH is better than the alternatives.

There are other alternatives however. For example, the Roaring format (https://github.com/lemire/RoaringBitmap) is not a run-length-encoded hybrid. It provides faster random access than even EWAH.

Data format

For more details regarding the compression format, please see Section 3 of the following paper:

Daniel Lemire, Owen Kaser, Kamel Aouiche, Sorting improves word-aligned bitmap indexes. Data & Knowledge Engineering 69 (1), pages 3-28, 2010.

Benchmark

For a simple comparison between this library and other libraries such as WAH, ConciseSet, BitSet and other options, please see

https://github.com/lemire/simplebitmapbenchmark

However, this is very naive. It is recommended that you run your own benchmarks.

Unit testing

As of October 2011, this package relies on Maven. To test it:

mvn test

See http://maven.apache.org/guides/introduction/introduction-to-the-lifecycle.html for details.

We support Java 8 and up, but to build the library, you need Java 9 and up with the default Maven setup, since we rely on the --release flag functionality (unavailable in Java 8) to sidestep the Deaded-NoSuchMethodError issue.

Usage

        EWAHCompressedBitmap ewahBitmap1 = EWAHCompressedBitmap.bitmapOf(0, 2, 55, 64, 1 << 30);
        EWAHCompressedBitmap ewahBitmap2 = EWAHCompressedBitmap.bitmapOf(1, 3, 64,
                1 << 30);
        System.out.println("bitmap 1: " + ewahBitmap1);
        System.out.println("bitmap 2: " + ewahBitmap2);
        // or
        EWAHCompressedBitmap orbitmap = ewahBitmap1.or(ewahBitmap2);
        System.out.println("bitmap 1 OR bitmap 2: " + orbitmap);
        System.out.println("memory usage: " + orbitmap.sizeInBytes() + " bytes");
        // and
        EWAHCompressedBitmap andbitmap = ewahBitmap1.and(ewahBitmap2);
        System.out.println("bitmap 1 AND bitmap 2: " + andbitmap);
        System.out.println("memory usage: " + andbitmap.sizeInBytes() + " bytes");
        // xor
        EWAHCompressedBitmap xorbitmap = ewahBitmap1.xor(ewahBitmap2);
        System.out.println("bitmap 1 XOR bitmap 2:" + xorbitmap);
        System.out.println("memory usage: " + xorbitmap.sizeInBytes() + " bytes");
        // fast aggregation over many bitmaps
        EWAHCompressedBitmap ewahBitmap3 = EWAHCompressedBitmap.bitmapOf(5, 55,
                1 << 30);
        EWAHCompressedBitmap ewahBitmap4 = EWAHCompressedBitmap.bitmapOf(4, 66,
                1 << 30);
        System.out.println("bitmap 3: " + ewahBitmap3);
        System.out.println("bitmap 4: " + ewahBitmap4);
        andbitmap = EWAHCompressedBitmap.and(ewahBitmap1, ewahBitmap2, ewahBitmap3,
                ewahBitmap4);
        System.out.println("b1 AND b2 AND b3 AND b4: " + andbitmap);
        // serialization
        ByteArrayOutputStream bos = new ByteArrayOutputStream();
        // Note: you could use a file output steam instead of ByteArrayOutputStream
        ewahBitmap1.serialize(new DataOutputStream(bos));
        EWAHCompressedBitmap ewahBitmap1new = new EWAHCompressedBitmap();
        byte[] bout = bos.toByteArray();
        ewahBitmap1new.deserialize(new DataInputStream(new ByteArrayInputStream(bout)));
        System.out.println("bitmap 1 (recovered) : " + ewahBitmap1new);
        if (!ewahBitmap1.equals(ewahBitmap1new)) throw new RuntimeException("Will not happen");
        //
        // we can use a ByteBuffer as backend for a bitmap
        // which allows memory-mapped bitmaps
        //
        ByteBuffer bb = ByteBuffer.wrap(bout);
        EWAHCompressedBitmap rmap = new EWAHCompressedBitmap(bb);
        System.out.println("bitmap 1 (mapped) : " + rmap);

        if (!rmap.equals(ewahBitmap1)) throw new RuntimeException("Will not happen");
        //
        // support for threshold function (new as of version 0.8.0):
        // mark as true a bit that occurs at least T times in the source
        // bitmaps
        //
        EWAHCompressedBitmap threshold2 = EWAHCompressedBitmap.threshold(2,
                ewahBitmap1, ewahBitmap2, ewahBitmap3, ewahBitmap4);
        System.out.println("threshold 2 : " + threshold2);

See example.java.

You can use our drop-in replacement for the BitSet class in a memory-mapped file context as follows:

		final FileOutputStream fos = new FileOutputStream(tmpfile);
		BitSet Bitmap = BitSet.bitmapOf(0, 2, 55, 64, 512);
		Bitmap.serialize(new DataOutputStream(fos));
		RandomAccessFile memoryMappedFile = new RandomAccessFile(tmpfile, "r");
		ByteBuffer bb = memoryMappedFile.getChannel().map(
				FileChannel.MapMode.READ_ONLY, 0, totalcount);
		ImmutableBitSet mapped = new ImmutableBitSet(bb.asLongBuffer());

There are more examples in the "examples" folder (e.g., for memory-file mapping).

Maven central repository

You can download JavaEWAH from the Maven central repository: http://repo1.maven.org/maven2/com/googlecode/javaewah/JavaEWAH/

You can also specify the dependency in the Maven "pom.xml" file:

    <dependencies>
         <dependency>
	     <groupId>com.googlecode.javaewah</groupId>
	     <artifactId>JavaEWAH</artifactId>
	     <version>[1.1,)</version>
         </dependency>
     </dependencies>

Naturally, you should replace "version" by the version you desire.

Ubuntu (Linux)

To install javaewah on Ubuntu, type:

      sudo apt-get install libjavaewah-java

Clojure

Joel Boehland wrote Clojure wrappers:

https://github.com/jolby/clojure-ewah-bitmap

Frequent questions

Question: How do I build javaewah without testing or signing?

    mvn clean install -DskipTests -Dgpg.skip=true

Question: Will JavaEWAH support long values?

Answer: It might, but it does not at the moment.

Question: How do I check the value of a bit?

Answer: If you need to routinely check the value of a given bit quickly, then EWAH might not be the right format. However, if you must do it, you can proceed as follows:

          /**
           * Suppose you have the following bitmap:
           */
          EWAHCompressedBitmap b = EWAHCompressedBitmap.bitmapOf(0,2,64,1<<30);
          /**
           * We want to know if bit 64 is set:
           */
          boolean is64set = b.get(64);

API Documentation

http://www.javadoc.io/doc/com.googlecode.javaewah/JavaEWAH/

Mailing list and discussion group

https://groups.google.com/forum/#!forum/javaewah

Further reading

Credit

(c) 2009-2023 Daniel Lemire, Cliff Moon, David McIntosh, Robert Becho, Colby Ranger, Veronika Zenz, Owen Kaser, Gregory Ssi-Yan-Kai, and Rory Graves

This code is licensed under Apache License, Version 2.0 (ASL2.0). (GPL 2.0 derivatives are allowed.)

Acknowledgement

Special thanks to Shen Liang for optimization advice.

This work was supported by NSERC grant number 26143.

More Repositories

1

FastPFor

The FastPFOR C++ library: Fast integer compression
C++
876
star
2

Code-used-on-Daniel-Lemire-s-blog

This is a repository for the code posted on my blog
C
720
star
3

fast_double_parser

Fast function to parse strings into double (binary64) floating-point values, enforces the RFC 7159 (JSON standard) grammar: 4x faster than strtod
C++
574
star
4

JavaFastPFOR

A simple integer compression library in Java
Java
505
star
5

simdcomp

A simple C library for compressing lists of integers using binary packing
C
480
star
6

EWAHBoolArray

A compressed bitmap class in C++.
C++
430
star
7

SIMDCompressionAndIntersection

A C++ library to compress and intersect sorted lists of integers using SIMD instructions
C++
420
star
8

fastbase64

SIMD-accelerated base64 codecs
C
388
star
9

streamvbyte

Fast integer compression in C using the StreamVByte codec
C
372
star
10

FastPriorityQueue.js

a fast heap-based priority queue in JavaScript
JavaScript
328
star
11

fastvalidate-utf-8

header-only library to validate utf-8 strings at high speeds (using SIMD instructions)
C
286
star
12

fastrange

A fast alternative to the modulo reduction
C
272
star
13

fastmod

A C/C++ header file for fast 32-bit division remainders (and divisibility tests) on 64-bit hardware.
C++
264
star
14

clhash

C library implementing the ridiculously fast CLHash hashing function
C
256
star
15

externalsortinginjava

External-Memory Sorting in Java
Java
248
star
16

rollinghashcpp

Rolling Hash C++ Library
C++
185
star
17

FastBitSet.js

Speed-optimized BitSet implementation for modern browsers and JavaScript engines
JavaScript
150
star
18

docker_programming_station

A simple script to help programmers who want to work within Docker
Shell
142
star
19

despacer

C library to remove white space from strings as fast as possible
C
133
star
20

StronglyUniversalStringHashing

Benchmark showing the we can randomly hash strings very quickly with good universality
C++
133
star
21

testingRNG

Testing common random-number generators (RNG)
C++
131
star
22

MaskedVByte

Fast decoder for VByte-compressed integers
C
115
star
23

cbitset

A simple bitset library in C
C
110
star
24

lbimproved

Dynamic Time Warping (DTW) library implementing lower bounds (LB_Keogh, LB_Improved...)
C++
105
star
25

dictionary

High-performance dictionary coding
C++
99
star
26

SIMDxorshift

Fast random number generators: Vectorized (SIMD) version of xorshift128+
C
99
star
27

csharpewah

Compressed bitmaps in C#
C#
82
star
28

fastrand

Fast random number generation in an interval in Python: Up to 10x faster than random.randint.
C
77
star
29

bloofi

Bloofi: A java implementation of multidimensional Bloom filters
Java
75
star
30

rollinghashjava

Rolling hash functions in Java
Java
75
star
31

LittleIntPacker

C library to pack and unpack short arrays of integers as fast as possible
C
70
star
32

simdpcg

Vectorized version of the PCG random number generator
C
68
star
33

TypedFastBitSet.js

Speed-optimized BitSet implementation for modern browsers and JavaScript engines, uses typed arrays
TypeScript
67
star
34

FastIntegerCompression.js

Fast integer compression library in JavaScript
JavaScript
54
star
35

simdprune

Pruning elements in SIMD vectors (i.e., packing left elements)
C
50
star
36

testingmlp

Testing memory-level parallelism
C
49
star
37

FastDifferentialCoding

Fast differential coding functions (using SIMD instructions)
C
49
star
38

multiplatform_simd_recipes

SIMD recipes, for various platforms (collection of code snippets)
C
43
star
39

runningmaxmin

Fast maximum-minimum filters implemented in C++
C++
43
star
40

validateutf8-experiments

Reproducible experimeents on UTF-8 validation using SIMD instructions
C++
40
star
41

fasthashing

A Variable-Length String Hashing Library in C++
C++
40
star
42

fast_division

Simple C++ code to benchmark fast division algorithms
C++
40
star
43

FrameOfReference

C++ library to pack and unpack vectors of integers having a small range of values using a technique called Frame of Reference
C
40
star
44

SwiftBitset

A fast Bitset class in Swift
Swift
36
star
45

FastShuffleExperiments

How fast can we shuffle values?
C++
33
star
46

programmingishard

Long-term book project
32
star
47

StablePriorityQueue.js

A stable heap-based priority queue in JavaScript
JavaScript
27
star
48

fastscancount

Fast implementations of the scancount algorithm: C++ header-only library
C++
24
star
49

SIMDgameoflife

Vectorized (AVX) version of the game of life
C
23
star
50

CMemoryUsage

Measuring memory usage in C and C++
C
23
star
51

CRoaringUnityBuild

Dumps of CRoaring unity builds (for convenience)
C
23
star
52

SwiftWyhash

Fast random number generator in pure Swift
Swift
23
star
53

kodakimagecollection

Set of images made available royalty-free by Kodak
22
star
54

sparsebitmap

A simple sparse bitmap implementation in java
Java
21
star
55

IndexWikipedia

A simple utility to index wikipedia dumps using Lucene.
Java
21
star
56

PiecewiseLinearTimeSeriesApproximation

code from Daniel Lemire, A Better Alternative to Piecewise Linear Time Series Segmentation, SIAM Data Mining 2007, 2007.
C++
21
star
57

Concise

C++ implementation of Concise and WAH compressed bitsets
C++
19
star
58

MemoryLanes

iOS app to test memory-level parallelism
C++
18
star
59

fastrandom

Go
18
star
60

crackingxoroshiro128plus

How to "crack" xoroshiro128+: from two outputs, derive a possible seed
Python
18
star
61

talks

Random material having to do with Daniel Lemire's talks
C++
18
star
62

RealisticTabularDataSets

Some realistic tabular datasets for testing (CSV)
17
star
63

StarSchemaBenchmark

O'Neil et al.'s Star Schema Benchmark: curated code
C
16
star
64

vectorclass

Random number generator for large applications using vector instructions
C++
15
star
65

simple_fastfloat_benchmark

C
15
star
66

simplebitmapbenchmark

Simple benchmark between compressed bitmap libraries in Java
Java
15
star
67

zobristhashing

Zobrist hashing in C
C
14
star
68

BitSliceIndex

Experiments on bit-slice indexing
Java
14
star
69

SIMDIntersections

Vectorized intersections (research code)
C++
14
star
70

microbenchmarks

Private microbenchmarks
Java
13
star
71

backward_multiplication

Multiplying... backward?
C++
12
star
72

pospopcnt_avx512

benchmarking positional population count
Assembly
11
star
73

costofsafety

Quick experiment to see how expensive safety is in C, for research
C
11
star
74

constantdivisionbenchmarks

Benchmarks for constant-division problems (not a library! for research only!)
C++
11
star
75

createfasthash

Code from article http://locklessinc.com/articles/fast_hash/
C
9
star
76

SwiftCallingCHeader

Calling a C header from Swift (example)
Swift
8
star
77

arraylayout

Pat Morin's arraylayout
TeX
8
star
78

WebAssemblyVSJavaScript

Project to compare the performance of WebAssembly with JavaScript
JavaScript
7
star
79

rowreorderingcpplibrary

This is a set of row-reordering algorithms and data compression compression schemes implemented in C++.
C++
7
star
80

ComputerLanguageBenchmark

Captured as http://benchmarksgame.alioth.debian.org is closing
HTML
7
star
81

fastheap

Fast heap data structures in Go (golang)
Go
6
star
82

iosbitmapdecoding

Experimenting with bitmap decoding on ios
C++
6
star
83

notesdecours

Notes de cours
Java
6
star
84

gutenberg-headers

Removing Manually-Generated Boilerplate from Project Gutenberg e-Books
Java
6
star
85

gobitmapbenchmark

A Go project to benchmark various bitmap implementations (this is not a library!)
Go
6
star
86

citationdata

Influential references dataset
6
star
87

simple_simdjson_python_wrapper

Simple use of simdjson from python (for research purposes)
C++
6
star
88

pythonmaxmin

Fast minimum-maximal filter in Python
Python
6
star
89

jstypes

Doing C-like arithmetic and logical operations in JavaScript (full 64-bit support)
JavaScript
5
star
90

hierarchicalbinbuffering

Hierarchical Bin Buffering C++ Library
C++
5
star
91

fast_float_supplemental_tests

Supplemental tests for the fast_float library (credit: Nigel Tao)
C++
5
star
92

HashVSTree

Do hash table use more memory than trees? A case study in Java.
Java
5
star
93

zone_benchmarks

zone-file parsing benchmark
C
5
star
94

DivisionBenchmark

A not-so-exciting benchmark to measure the performance of some division functions
C++
4
star
95

RoaringVSConciseBenchmark

A benchmark comparing roaring and concise
Java
4
star
96

exercices_lucene

Exercises pour s'approprier lucene, le moteur de recherche
Java
4
star
97

roaring_rust_benchmark

Basic benchmark to compare different Roaring bitmap implementations in Rust
Rust
4
star
98

datasciencebook

Python
4
star
99

viewsizeestimation

Unassuming hashing-based view-size estimation techniques
C++
4
star
100

jackson-json-bench

A silly benchmark for Jackson (JSON parser)
Java
4
star