• Stars
    star
    505
  • Rank 84,214 (Top 2 %)
  • Language
    Java
  • License
    Apache License 2.0
  • Created almost 12 years ago
  • Updated 10 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

A simple integer compression library in Java

JavaFastPFOR: A simple integer compression library in Java

docs-badge Java CI

License

This code is released under the Apache License Version 2.0 http://www.apache.org/licenses/.

What does this do?

It is a library to compress and uncompress arrays of integers very fast. The assumption is that most (but not all) values in your array use much less than 32 bits, or that the gaps between the integers use much less than 32 bits. These sort of arrays often come up when using differential coding in databases and information retrieval (e.g., in inverted indexes or column stores).

Please note that random integers are not compressible, by this library or by any other means. If you ever had the means of systematically compressing random integers, you could compress any data source to nothing, by recursive application of your technique.

This library can decompress integers at a rate of over 1.2 billions per second (4.5 GB/s). It is significantly faster than generic codecs (such as Snappy, LZ4 and so on) when compressing arrays of integers.

The library is used in LinkedIn Pinot, a realtime distributed OLAP datastore. Part of this library has been integrated in Parquet (http://parquet.io/). A modified version of the library is included in the search engine Terrier (http://terrier.org/). This libary is used by ClueWeb Tools (https://github.com/lintool/clueweb). It is also used by Apache NiFi.

This library inspired a compression scheme used by Apache Lucene and Apache Lucene.NET (e.g., see http://lucene.apache.org/core/4_6_1/core/org/apache/lucene/util/PForDeltaDocIdSet.html ).

It is a java port of the fastpfor C++ library (https://github.com/lemire/FastPFor). There is also a Go port (https://github.com/reducedb/encoding). The C++ library is used by the zsearch engine (http://victorparmar.github.com/zsearch/) as well as in GMAP and GSNAP (http://research-pub.gene.com/gmap/).

Usage

Really simple usage:

        IntegratedIntCompressor iic = new IntegratedIntCompressor();
        int[] data = ... ; // to be compressed
        int[] compressed = iic.compress(data); // compressed array
        int[] recov = iic.uncompress(compressed); // equals to data

For more examples, see example.java or the examples folder.

JavaFastPFOR supports compressing and uncompressing data in chunks (e.g., see advancedExample in https://github.com/lemire/JavaFastPFOR/blob/master/example.java).

Some CODECs ("integrated codecs") assume that the integers are in sorted orders and use differential coding (they compress deltas). They can be found in the package me.lemire.integercompression.differential. Most others do not.

The Java Team at Intel (R) introduced the vector implementation for FastPFOR based on the Java Vector API that showed significant gains over the non-vectorized implementation. For an example usage, see examples/vector/Example.java. The feature requires JDK 19+ and is currently for advanced users.

Maven central repository

Using this code in your own project is easy with maven, just add the following code in your pom.xml file:

    <dependencies>
         <dependency>
	     <groupId>me.lemire.integercompression</groupId>
	     <artifactId>JavaFastPFOR</artifactId>
	     <version>[0.2,)</version>
         </dependency>
     </dependencies>

Naturally, you should replace "version" by the version you desire.

You can also download JavaFastPFOR from the Maven central repository: http://repo1.maven.org/maven2/me/lemire/integercompression/JavaFastPFOR/

Why?

We found no library that implemented state-of-the-art integer coding techniques such as Binary Packing, NewPFD, OptPFD, Variable Byte, Simple 9 and so on in Java. We wrote one.

Thread safety

Some codecs are thread-safe while others are not. For this reason, it is best to use one codec per thread. The memory usage of a codec instance is small in any case.

Nevertheless, if you want to reuse codec instances, note that by convention, unless the documentation of a codec specify that it is not thread-safe, then it can be assumed to be thread-safe.

Authors

Main contributors

with contributions by

How does it compare to the Kamikaze PForDelta library?

In our tests, Kamikaze PForDelta is slower than our implementations. See the benchmarkresults directory for some results.

https://github.com/lemire/JavaFastPFOR/blob/master/benchmarkresults/benchmarkresults_icore7_10may2013.txt

Reference: http://sna-projects.com/kamikaze/

Requirements

Releases up to 0.1.12 require Java 7 or better.

The current development versions assume JDK 11 or better.

How fast is it?

Compile the code and execute me.lemire.integercompression.benchmarktools.Benchmark.

Speed is always reported in millions of integers per second.

For Maven users

mvn compile
mvn exec:java

You may run our examples as follows:

mvn package
javac -cp target/classes/:. example.java
java -cp target/classes/:. example

For ant users (legacy, currently untested)

If you use Apache ant, please try this:

$ ant Benchmark

or:

$ ant Benchmark -Dbenchmark.target=BenchmarkBitPacking

API Documentation

http://www.javadoc.io/doc/me.lemire.integercompression/JavaFastPFOR/

Want to read more?

This library was a key ingredient in the best paper at ECIR 2014 :

Matteo Catena, Craig Macdonald, Iadh Ounis, On Inverted Index Compression for Search Engine Efficiency, Lecture Notes in Computer Science 8416 (ECIR 2014), 2014. http://dx.doi.org/10.1007/978-3-319-06028-6_30

We wrote several research papers documenting many of the CODECs implemented here:

Ikhtear Sharif wrote his M.Sc. thesis on this library:

Ikhtear Sharif, Performance Evaluation of Fast Integer Compression Techniques Over Tables, M.Sc. thesis, UNB 2013. https://unbscholar.lib.unb.ca/islandora/object/unbscholar%3A9399/datastream/PDF/view

He also posted his slides online: http://www.slideshare.net/ikhtearSharif/ikhtear-defense

Other recommended libraries

Funding

This work was supported by NSERC grant number 26143.

More Repositories

1

FastPFor

The FastPFOR C++ library: Fast integer compression
C++
837
star
2

Code-used-on-Daniel-Lemire-s-blog

This is a repository for the code posted on my blog
C
720
star
3

fast_double_parser

Fast function to parse strings into double (binary64) floating-point values, enforces the RFC 7159 (JSON standard) grammar: 4x faster than strtod
C++
574
star
4

javaewah

A compressed alternative to the Java BitSet class
Java
556
star
5

simdcomp

A simple C library for compressing lists of integers using binary packing
C
463
star
6

EWAHBoolArray

A compressed bitmap class in C++.
C++
430
star
7

SIMDCompressionAndIntersection

A C++ library to compress and intersect sorted lists of integers using SIMD instructions
C++
413
star
8

fastbase64

SIMD-accelerated base64 codecs
C
388
star
9

streamvbyte

Fast integer compression in C using the StreamVByte codec
C
357
star
10

FastPriorityQueue.js

a fast heap-based priority queue in JavaScript
JavaScript
328
star
11

fastvalidate-utf-8

header-only library to validate utf-8 strings at high speeds (using SIMD instructions)
C
284
star
12

fastrange

A fast alternative to the modulo reduction
C
272
star
13

fastmod

A C/C++ header file for fast 32-bit division remainders (and divisibility tests) on 64-bit hardware.
C++
264
star
14

clhash

C library implementing the ridiculously fast CLHash hashing function
C
256
star
15

externalsortinginjava

External-Memory Sorting in Java
Java
248
star
16

rollinghashcpp

Rolling Hash C++ Library
C++
179
star
17

FastBitSet.js

Speed-optimized BitSet implementation for modern browsers and JavaScript engines
JavaScript
150
star
18

docker_programming_station

A simple script to help programmers who want to work within Docker
Shell
142
star
19

despacer

C library to remove white space from strings as fast as possible
C
133
star
20

StronglyUniversalStringHashing

Benchmark showing the we can randomly hash strings very quickly with good universality
C++
131
star
21

testingRNG

Testing common random-number generators (RNG)
C++
131
star
22

MaskedVByte

Fast decoder for VByte-compressed integers
C
114
star
23

cbitset

A simple bitset library in C
C
110
star
24

lbimproved

Dynamic Time Warping (DTW) library implementing lower bounds (LB_Keogh, LB_Improved...)
C++
105
star
25

dictionary

High-performance dictionary coding
C++
99
star
26

SIMDxorshift

Fast random number generators: Vectorized (SIMD) version of xorshift128+
C
99
star
27

csharpewah

Compressed bitmaps in C#
C#
82
star
28

fastrand

Fast random number generation in an interval in Python: Up to 10x faster than random.randint.
C
77
star
29

bloofi

Bloofi: A java implementation of multidimensional Bloom filters
Java
75
star
30

rollinghashjava

Rolling hash functions in Java
Java
75
star
31

LittleIntPacker

C library to pack and unpack short arrays of integers as fast as possible
C
70
star
32

simdpcg

Vectorized version of the PCG random number generator
C
68
star
33

TypedFastBitSet.js

Speed-optimized BitSet implementation for modern browsers and JavaScript engines, uses typed arrays
TypeScript
67
star
34

FastIntegerCompression.js

Fast integer compression library in JavaScript
JavaScript
54
star
35

simdprune

Pruning elements in SIMD vectors (i.e., packing left elements)
C
50
star
36

testingmlp

Testing memory-level parallelism
C
49
star
37

FastDifferentialCoding

Fast differential coding functions (using SIMD instructions)
C
49
star
38

multiplatform_simd_recipes

SIMD recipes, for various platforms (collection of code snippets)
C
43
star
39

runningmaxmin

Fast maximum-minimum filters implemented in C++
C++
43
star
40

fasthashing

A Variable-Length String Hashing Library in C++
C++
40
star
41

fast_division

Simple C++ code to benchmark fast division algorithms
C++
40
star
42

FrameOfReference

C++ library to pack and unpack vectors of integers having a small range of values using a technique called Frame of Reference
C
40
star
43

validateutf8-experiments

Reproducible experimeents on UTF-8 validation using SIMD instructions
C++
39
star
44

SwiftBitset

A fast Bitset class in Swift
Swift
36
star
45

FastShuffleExperiments

How fast can we shuffle values?
C++
33
star
46

programmingishard

Long-term book project
32
star
47

StablePriorityQueue.js

A stable heap-based priority queue in JavaScript
JavaScript
27
star
48

fastscancount

Fast implementations of the scancount algorithm: C++ header-only library
C++
24
star
49

SIMDgameoflife

Vectorized (AVX) version of the game of life
C
23
star
50

CMemoryUsage

Measuring memory usage in C and C++
C
23
star
51

CRoaringUnityBuild

Dumps of CRoaring unity builds (for convenience)
C
23
star
52

SwiftWyhash

Fast random number generator in pure Swift
Swift
23
star
53

kodakimagecollection

Set of images made available royalty-free by Kodak
22
star
54

sparsebitmap

A simple sparse bitmap implementation in java
Java
21
star
55

IndexWikipedia

A simple utility to index wikipedia dumps using Lucene.
Java
21
star
56

PiecewiseLinearTimeSeriesApproximation

code from Daniel Lemire, A Better Alternative to Piecewise Linear Time Series Segmentation, SIAM Data Mining 2007, 2007.
C++
21
star
57

Concise

C++ implementation of Concise and WAH compressed bitsets
C++
19
star
58

MemoryLanes

iOS app to test memory-level parallelism
C++
18
star
59

talks

Random material having to do with Daniel Lemire's talks
C++
18
star
60

fastrandom

Go
18
star
61

crackingxoroshiro128plus

How to "crack" xoroshiro128+: from two outputs, derive a possible seed
Python
18
star
62

RealisticTabularDataSets

Some realistic tabular datasets for testing (CSV)
17
star
63

StarSchemaBenchmark

O'Neil et al.'s Star Schema Benchmark: curated code
C
16
star
64

simplebitmapbenchmark

Simple benchmark between compressed bitmap libraries in Java
Java
15
star
65

vectorclass

Random number generator for large applications using vector instructions
C++
15
star
66

simple_fastfloat_benchmark

C
15
star
67

zobristhashing

Zobrist hashing in C
C
14
star
68

BitSliceIndex

Experiments on bit-slice indexing
Java
14
star
69

SIMDIntersections

Vectorized intersections (research code)
C++
14
star
70

microbenchmarks

Private microbenchmarks
Java
13
star
71

backward_multiplication

Multiplying... backward?
C++
12
star
72

pospopcnt_avx512

benchmarking positional population count
Assembly
11
star
73

costofsafety

Quick experiment to see how expensive safety is in C, for research
C
11
star
74

constantdivisionbenchmarks

Benchmarks for constant-division problems (not a library! for research only!)
C++
11
star
75

createfasthash

Code from article http://locklessinc.com/articles/fast_hash/
C
9
star
76

SwiftCallingCHeader

Calling a C header from Swift (example)
Swift
8
star
77

arraylayout

Pat Morin's arraylayout
TeX
8
star
78

WebAssemblyVSJavaScript

Project to compare the performance of WebAssembly with JavaScript
JavaScript
7
star
79

ComputerLanguageBenchmark

Captured as http://benchmarksgame.alioth.debian.org is closing
HTML
7
star
80

rowreorderingcpplibrary

This is a set of row-reordering algorithms and data compression compression schemes implemented in C++.
C++
6
star
81

notesdecours

Notes de cours
Java
6
star
82

gobitmapbenchmark

A Go project to benchmark various bitmap implementations (this is not a library!)
Go
6
star
83

gutenberg-headers

Removing Manually-Generated Boilerplate from Project Gutenberg e-Books
Java
6
star
84

iosbitmapdecoding

Experimenting with bitmap decoding on ios
C++
6
star
85

simple_simdjson_python_wrapper

Simple use of simdjson from python (for research purposes)
C++
6
star
86

citationdata

Influential references dataset
6
star
87

fastheap

Fast heap data structures in Go (golang)
Go
6
star
88

pythonmaxmin

Fast minimum-maximal filter in Python
Python
6
star
89

jstypes

Doing C-like arithmetic and logical operations in JavaScript (full 64-bit support)
JavaScript
5
star
90

fast_float_supplemental_tests

Supplemental tests for the fast_float library (credit: Nigel Tao)
C++
5
star
91

HashVSTree

Do hash table use more memory than trees? A case study in Java.
Java
5
star
92

hierarchicalbinbuffering

Hierarchical Bin Buffering C++ Library
C++
4
star
93

RoaringVSConciseBenchmark

A benchmark comparing roaring and concise
Java
4
star
94

DivisionBenchmark

A not-so-exciting benchmark to measure the performance of some division functions
C++
4
star
95

exercices_lucene

Exercises pour s'approprier lucene, le moteur de recherche
Java
4
star
96

jackson-json-bench

A silly benchmark for Jackson (JSON parser)
Java
4
star
97

roaring_rust_benchmark

Basic benchmark to compare different Roaring bitmap implementations in Rust
Rust
4
star
98

datasciencebook

Python
4
star
99

viewsizeestimation

Unassuming hashing-based view-size estimation techniques
C++
4
star
100

simple_cpp_shuffle_benchmark

Simple benchmark to see how fast the standard C++ library can shuffle arrays
C++
4
star