Low-Level Data Structure
llds is a btree implementation which attempts to maximize memory efficiency via bypassing the virtual memory layer (vmalloc) and through optimized data structure memory semantics.
The llds general working thesis is: for large memory applications, virtual memory layers can hurt application performance due to increased memory latency when dealing with large data structures. Specifically, data page tables/directories within the kernel and increased DRAM requests can be avoided to boost application memory access.
Applicable use cases: applications on systems that utilize large in-memory data structures. In our testing, "large" was defined as >4GB structures, which did yield significant gains with llds vs equivalent userspace implementations.
Intel® Xeon Phi™ processors have somewhat reduced, certainly not eliminated, many of the cache cohorency improvements provided by llds.
llds still provides better performance (pipeline prefetch) and space efficiency on Phiâ„¢ microarchitectures.
llds 2.0 (WIP) will attempt to better leverage the Phiâ„¢'s memory ring bus.
Complexity
Function | Mean | Worst Case |
---|---|---|
Search | O(log n) | O(log n) |
Insert | O(log n) | O(log n) |
Delete | O(log n) | O(log n) |
Update | O(log n) | O(log n) |
Installing/Configuring
The build environment will need libproc, glibc, and linux headers. For Ubuntu/Debian based distros this is available in the libproc-dev, linux-libc-dev, and build-essential pkgs.
libforrest
$ cmake . $ make # make install
llds kernel module
$ cd linux/llds && make -j$(nproc) # insmod llds.ko # mknod /dev/llds c 834 0
How it Works
llds is a Linux kernel module (3.x, 4.x, 5.x, 6.x) which leverages facilities provided by the kernel mm for optimal DRAM memory access. llds uses the red-black tree data structure, which is highly optimized in the kernel and is used to manage processes, epoll file descriptors, file systems, and many other components of the kernel.
Memory management in llds is optimized for traversal latency, not space efficiency, though space savings are probable due to better alignment in most use cases. llds data structures should not consume any more memory than their equivalent user space implementations.
Traversal latency is optimized by exploiting underlying physical RAM mechanics, avoiding CPU cache pollution, NUMA cross-check chatter, and streamlining CPU data prefetching (L1D cache lines). Fragmented memory access is less efficient when interacting with modern DRAM controllers. The efficiency also further suffers on NUMA systems as the number of processors/memory banks increases.
libforrest
Developers can interact directly with the llds chardev using ioctl(2), however, it is highly recommended that the libforrest API is used to avoid incompatibilities should the ioctl interface change in the future.
libforrest provides the basic key-value store operations: get, set, and delete. In addition, it provides a 64-bit MurmurHash (rev. A) for llds key hashing.
Examples are provided in the libforrest/examples directory.
Benchmarks
Benchmarks are inherently fluid. All samples and timings are available at http://github.com/johnj/llds-benchmarks, additionally there is a run_tests.sh
script provided which utilizes oprofile. Along with the run_tests.sh
script, there is a user-space implementation of red-black trees and an equivalent llds implementation. The goal of benchmarking is about opining to the results of a particular environment but all the tools and scripts are available to let users test their own mileage.
Benchmark environment: Dell PowerEdge R610, 4x Intel Xeon L5640 (Westmere) w/HT (24 cores), 192GB DDR3 DRAM, Ubuntu 10.04.3 LTS. The keys are 64-bit integers and the values are incremented strings (ie, "0", "1", "2"..."N"). There were no major page faults.
For conciseness, only tests with 2/16/24 threads and 500K/1.5M/2M keys are listed. dmidecode, samples, and full benchmarks are available at http://github.com/johnj/llds-benchmarks
Wall Timings (in seconds)
Threads | # of Items | userspace | llds | llds improvement |
2 | 500000000 | 3564 | 1761 | 2.02x |
16 | 1500000000 | 9291 | 4112 | 2.26x |
24 | 2000000000 | 12645 | 5670 | 2.23x |
Unhalted CPU cycles (10000 cycles @ 133mHz)
Threads | # of Items | userspace | llds |
2 | 500000000 | 87418776 | 377458531 |
16 | 1500000000 | 279203932 | 5107099682 |
24 | 2000000000 | 968091233 | 5529234102 |
L1 cache hits (200000 per sample)
Threads | # of Items | userspace | llds | llds improvement |
2 | 500000000 | 3077671 | 5502292 | 1.78x |
16 | 1500000000 | 15120921 | 27231553 | 1.80x |
24 | 2000000000 | 23746988 | 39196177 | 1.65x |
L2 cache hits (200000 per sample)
Threads | # of Items | userspace | llds | llds improvement |
2 | 500000000 | 21866 | 60214 | 2.75x |
16 | 1500000000 | 82101 | 511285 | 6.23x |
24 | 2000000000 | 127072 | 800846 | 6.30x |
L3/Last-Level cache hits (200000 per sample)
Threads | # of Items | userspace | llds | llds improvement |
2 | 500000000 | 26069 | 32259 | 1.24x |
16 | 1500000000 | 148827 | 254562 | 1.71x |
24 | 2000000000 | 270191 | 341649 | 1.26x |
L1 Data Prefetch misses (200000 per hardware sample)
Threads | # of Items | userspace | llds | llds improvement |
2 | 500000000 | 52396 | 21113 | 2.48x |
16 | 1500000000 | 350753 | 120891 | 2.90x |
24 | 2000000000 | 544791 | 210268 | 2.59x |
Status
llds is being used in at least two production search engines. Please let me know if you are using llds in production, you can remain anonymous.
Work on llds 2.0 is underway to better utilize cache rings which are "today"'s architecture.
Known Limitations/Issues
- libforrest has a limit on the value which comes back from kernel space, the default is 4096 bytes, it can be adjusted through the FORREST_MAX_VAL_LEN directive at compile time.
- Only 64-bit architecture support.
- Only tested on x86, it may work on other arch's, drop a line if it does.
Future Work
- Support for additional data structures (hashes are questionable)
- Add atomic operations (increment, decrement, CAS, etc.) in libforrest and llds
- Research about the virtual memory overhead & implementation in the kernel with mitigation techniques