FastTuple

Introduction

There are lots of good things about working on the JVM, like a world class JIT, operating system threads, and a world class garbage collector. However, one limiting factor can often be the interaction between primitive types and referenced types in Java. Primitive types are the built in types that represent integral numbers, floating point numbers, and boolean yes/no values. Primitives are also quite fast and memory efficient: they get allocated either on the stack if they're being used in a method, or inlined in an object when they're declared as field members. And they wind up being fast because the JIT can often optimize their access down to a single CPU instruction. This works really well when you know what types a class will hold as its state beforehand. If, on the other hand, you don't know what an object or array will hold at compile time the JVM forces you to box primitives. Boxing means that the primitives get wrapped in a heap allocated object, and their container will hold a reference to them. That type of overhead winds up being inefficient both in access time and memory space. Access time suffers because this type of layout breaks locality of reference, which is the principle that memory frequently accessed together should be stored adjacently. All modern memory hierarchies optimize for locality of reference in their caching implementations. The extra allocations and garbage generated also put pressure on the JVM's garbage collector, which can often be a cause of long pause times.

We wrote FastTuple to try and help solve this problem. FastTuple generates heterogeneous collections of primitive values and ensures as best it can that they will be laid out adjacently in memory. The individual values in the tuple can either be accessed from a statically bound interface, via an indexed accessor, or via reflective or other dynamic invocation techniques. FastTuple is designed to deal with a large number of tuples therefore it will also attempt to pool tuples such that they do not add significantly to the GC load of a system. FastTuple is also capable of allocating the tuple value storage entirely off-heap, using Java's direct memory capabilities.

FastTuple pulls off its trick via runtime bytecode generation. The user supplies it with a schema of field names and types. That schema is then built into a Java class definition which will contain accessor methods and either field definitions or the memory address for an off heap allocation, depending on which storage method was requested. The resulting Java class gets compiled into bytecode and then loaded as a reflective Class object. This Class object can then be used to create instances of the new class.

Building

This package depends on Janino 2.7.4, which is not currently getting pushed to maven central. As soon as that happens (we have a ticket in with the maintainer) we will begin pushing FastTuple to maven central as well. In the meantime you can build FastTuple like so:

wget http://janino.net/download/janino-2.7.4.zip
unzip janino-2.7.4.zip
mvn install:install-file -Dfile=janino-2.7.4/janino.jar -Dsources=janino-2.7.4/janino-src.zip -DgroupId=org.codehaus.janino -DartifactId=janino -Dversion=2.7.4 -Dpackaging=jar
mvn install:install-file -Dfile=janino-2.7.4/commons-compiler.jar -Dsources=janino-2.7.4/commons-compiler-src.zip -DgroupId=org.codehaus.janino -DartifactId=commons-compiler -Dversion=2.7.4 -Dpackaging=jar
mvn install

Note on GPG

You'll also need to ensure that you have GPG available on the command line. Mac OS X and Windows users may need to install via (https://gpgtools.org/) or cygwin respectively.

Usage

Interaction with FastTuple primarily takes place via the TupleSchema class. Each instance of TupleSchema describes a separate type of FastTuple both from the perspective of the FastTuple library and the JVM. At this time, allowable field types are the primitive classes for: long, int, short, char, byte, float, double. Support for String is planned for a later release. Some examples:

Heap Allocated Tuples

	TupleSchema schema = TupleSchema.builder().
		addField("fieldA", Long.TYPE).
		addField("fieldB", Int.TYPE).
		addField("fieldC", Short.TYPE).
		heapMemory().
		build();

	//creates a new tuple allocated on the JVM heap
	FastTuple tuple = schema.createTuple();

Direct Allocated Tuples

	TupleSchema schema = TupleSchema.builder().
		addField("fieldA", Long.TYPE).
		addField("fieldB", Int.TYPE).
		addField("fieldC", Short.TYPE).
		directMemory().
		build();

	//creates a new tuple, allocating memory off heap
	FastTuple tuple = schema.createTuple();
	//do some stuff
	tuple.setLong(1, 10000L);
	tuple.setInt(2, 50);
	tuple.setShort(3, (short)256);
	//if you don't destroy the tuple you are leaking memory
	schema.destroy(tuple);

Aligning Direct Allocated Tuples

Direct allocated tuples can be aligned such that they do not share cache lines. This is useful for situations where adjacent tuples are manipulated concurrently by different threads: an adequately padded FastTuple can eliminate false sharing in the CPU cache architecture. Veriifying that property, as expected, will require extensive benchmarking inside of the target system.

	TupleSchema schema = TupleSchema.builder().
		addField("fieldA", Long.TYPE).
		addField("fieldB", Int.TYPE).
		addField("fieldC", Short.TYPE).
		directMemory().
		padToWordSize(64).
		build();

	//creates a new tuple, allocating memory off heap
	FastTuple tuple = schema.createTuple();
	//do some stuff
	tuple.setLong(1, 10000L);
	tuple.setInt(2, 50);
	tuple.setShort(3, (short)256);
	//if you don't destroy the tuple you are leaking memory
	schema.destroy(tuple);

Utilizing Tuple Pools

Each schema will allocate a tuple pool per accessing thread if a poolSize is set.

	TupleSchema schema = TupleSchema.builder().
		addField("fieldA", Long.TYPE).
		addField("fieldB", Int.TYPE).
		addField("fieldC", Short.TYPE).
		poolOfSize(1024).
		//allocates an extra poolOfSize records when the pool is empty
		.expandingPool().
		build();

	//checks a tuple from the pool
	FastTuple tuple = schema.pool().checkout();
	//do some stuff
	tuple.setLong(1, 10000L);
	tuple.setInt(2, 50);
	tuple.setShort(3, (short)256);
	//if you don't check the tuple back in you either leak memory or objects. bad dog.
	schema.pool().release(tuple);

Performance

One of the main goals of this library is performance. Toward that end it has a full suite of microbenchmarks to test the various supported means of accessing and manipulating tuples for both tuning the library and showing the tradeoffs in overhead for things like pooling and allocating tuples on demand. Here's what a full run looks like on a late 2013 macbook pro 2.6ghz with Java 8 1.8.0_05-b13.

Benchmark                                                                              Mode   Samples         Mean   Mean error    Units
c.b.t.AccessMethodBenchmark.testAllocateSetAndDeallocate                              thrpt        20     7069.954      181.689   ops/ms
c.b.t.AccessMethodBenchmark.testClass                                                 thrpt        20  1628847.332    21756.564   ops/ms
c.b.t.AccessMethodBenchmark.testFastPool                                              thrpt        20    30715.791      570.086   ops/ms
c.b.t.AccessMethodBenchmark.testFastTuplePreAllocIndexed                              thrpt        20   160173.234     2498.380   ops/ms
c.b.t.AccessMethodBenchmark.testFastTuplePreAllocIndexedBoxing                        thrpt        20    59117.984      630.813   ops/ms
c.b.t.AccessMethodBenchmark.testFastTupleStaticBinding                                thrpt        20   157928.877     2411.094   ops/ms
c.b.t.AccessMethodBenchmark.testInvokeDynamic                                         thrpt        20    26085.594     1158.292   ops/ms
c.b.t.AccessMethodBenchmark.testLongArray                                             thrpt        20  1721846.666    23422.990   ops/ms
c.b.t.AccessMethodBenchmark.testOffheapAllocateAndSet                                 thrpt        20     7512.737      167.141   ops/ms
c.b.t.AccessMethodBenchmark.testOffheapDirectSet                                      thrpt        20   898743.426    37659.166   ops/ms
c.b.t.AccessMethodBenchmark.testOffheapSchemaSet                                      thrpt        20   367841.110     4282.620   ops/ms
c.b.t.AccessMethodBenchmark.testPooledObject                                          thrpt        20    12572.464      155.458   ops/ms
c.b.t.AccessMethodBenchmark.testQueuedObject                                          thrpt        20    25092.751      242.771   ops/ms
c.b.t.AccessMethodBenchmark.testReflectField                                          thrpt        20    28850.478      275.462   ops/ms
c.b.t.AccessMethodBenchmark.testStormTuple                                            thrpt        20    79693.517      888.956   ops/ms
c.b.t.AccessMethodBenchmark.testTuplePool                                             thrpt        20    70790.084      950.775   ops/ms
c.b.t.FastTupleBenchmarks.DirectBenchmarks.measureDirectSchemaAllocate                thrpt        20     7214.300       98.387   ops/ms
c.b.t.FastTupleBenchmarks.DirectBenchmarks.measureDirectSchemaDeque                   thrpt        20   153937.210     2534.986   ops/ms
c.b.t.FastTupleBenchmarks.DirectBenchmarks.measureDirectSchemaPoolEval                thrpt        20    53563.512      836.586   ops/ms
c.b.t.FastTupleBenchmarks.DirectBenchmarks.measureDirectSchemaPoolIface               thrpt        20    57669.480      716.962   ops/ms
c.b.t.FastTupleBenchmarks.DirectBenchmarks.measureDirectSchemaPoolIndexed             thrpt        20    57633.447     1025.338   ops/ms
c.b.t.FastTupleBenchmarks.DirectBenchmarks.measureDirectSchemaPoolIndexedBoxed        thrpt        20    38984.939      823.182   ops/ms
c.b.t.FastTupleBenchmarks.DirectBenchmarks.measureDirectSchemaPreallocEval            thrpt        20   332142.591     6426.566   ops/ms
c.b.t.FastTupleBenchmarks.DirectBenchmarks.measureDirectSchemaPreallocIface           thrpt        20   370593.271     5369.162   ops/ms
c.b.t.FastTupleBenchmarks.DirectBenchmarks.measureDirectSchemaPreallocIndexed         thrpt        20   377811.881     7077.471   ops/ms
c.b.t.FastTupleBenchmarks.DirectBenchmarks.measureDirectSchemaPreallocIndexedBoxed    thrpt        20   115699.056     2402.657   ops/ms
c.b.t.FastTupleBenchmarks.HeapBenchmarks.measureHeapSchemaAllocate                    thrpt        20   990788.476    12295.764   ops/ms
c.b.t.FastTupleBenchmarks.HeapBenchmarks.measureHeapSchemaDeque                       thrpt        20   173527.417     3619.733   ops/ms
c.b.t.FastTupleBenchmarks.HeapBenchmarks.measureHeapSchemaPoolEval                    thrpt        20    59377.121     1071.014   ops/ms
c.b.t.FastTupleBenchmarks.HeapBenchmarks.measureHeapSchemaPoolIface                   thrpt        20    62391.209      765.821   ops/ms
c.b.t.FastTupleBenchmarks.HeapBenchmarks.measureHeapSchemaPoolIndexed                 thrpt        20    62120.412     1105.981   ops/ms
c.b.t.FastTupleBenchmarks.HeapBenchmarks.measureHeapSchemaPoolIndexedBoxed            thrpt        20    42187.554      737.124   ops/ms
c.b.t.FastTupleBenchmarks.HeapBenchmarks.measureHeapSchemaPreallocEval                thrpt        20   397337.746     5274.807   ops/ms
c.b.t.FastTupleBenchmarks.HeapBenchmarks.measureHeapSchemaPreallocEvalField           thrpt        20   579136.768     8818.827   ops/ms
c.b.t.FastTupleBenchmarks.HeapBenchmarks.measureHeapSchemaPreallocIface               thrpt        20   554449.236     6740.073   ops/ms
c.b.t.FastTupleBenchmarks.HeapBenchmarks.measureHeapSchemaPreallocIndexed             thrpt        20   536258.735    10417.522   ops/ms
c.b.t.FastTupleBenchmarks.HeapBenchmarks.measureHeapSchemaPreallocIndexedBoxed        thrpt        20   202220.576     3806.694   ops/ms

To run the benchmarks:

mvn package
java -jar fasttuple-bench/target/microbenchmarks.jar

boundary/fasttuple

boundary

Reviews

Repository Details