• Stars
    star
    503
  • Rank 87,705 (Top 2 %)
  • Language
    C++
  • License
    BSD 3-Clause "New...
  • Created about 7 years ago
  • Updated 8 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

The PebblesDB write-optimized key-value store (SOSP 17)

PebblesDB

Build Status License

PebblesDB is a write-optimized key-value store which is built using the novel FLSM (Fragmented Log-Structured Merge Tree) data structure. FLSM is a modification of the standard log-structured merge tree data structure which aims at achieving higher write throughput and lower write amplification without compromising on read throughput.

PebblesDB is built by modifying HyperLevelDB which, in turn, is built on top of LevelDB. PebblesDB is API compatible with HyperLevelDB and LevelDB. Thus, PebblesDB is a drop-in replacement for LevelDB and HyperLevelDB. The source code is available on Github. The full paper on PebblesDB can be found here. The slides for the SOSP 17 talk, which explains the core ideas behind PebblesDB, can be found here.

If you are using LevelDB in your deployment, do consider trying out PebblesDB! PebblesDB can also be used to replace RocksDB as long as the RocksDB-specific functionality like column families are not used.

Please cite the following paper if you use PebblesDB: PebblesDB: Building Key-Value Stores using Fragmented Log-Structured Merge Trees. Pandian Raju, Rohan Kadekodi, Vijay Chidambaram, Ittai Abraham. SOSP 17. Bibtex

The benchmarks page has a list of experiments evaluating PebblesDB vs LevelDB, HyperLevelDB, and RocksDB. The summary is that PebblesDB outperforms the other stores on write throughput, equals other stores on read throughput, and incurs a penalty for small range queries on fully compacted key-value stores. PebblesDB achieves 6x the write throughput of RocksDB, while providing similar read throughput, and performing 50% lesser IO. Please see the paper for more details.

If you would like to run MongoDB with PebblesDB as the storage engine, please check out mongo-pebbles, a modification of the mongo-rocks layer between RocksDB and MongoDB.


Dependencies

PebblesDB requires libsnappy and libtool. To install on Linux, please use sudo apt-get install libsnappy-dev libtool. For MacOSX, use brew install snappy and instead of ldconfig, use update_dyld_shared_cache.

PebblesDB was built, compiled, and tested with g++-4.7, g++-4.9, and g++-5. It may not work with other versions of g++ and other C++ compilers.

Installation

Using Autotools:

$ cd pebblesdb/src
$ autoreconf -i
$ ./configure
$ make
$ make install
$ ldconfig

Using CMake:

$ mkdir -p build && cd build
$ cmake .. && make install -j16

Running microbenchmark

  1. cd pebblesdb/src/
  2. make db_bench (this only works if you are compiling using autotools, and have done autoreconf and configure before this step)
  3. ./db_bench --benchmarks=<list-of-benchmarks> --num=<number-of-keys> --value_size=<size-of-value-in-bytes> --reads=<number-of-reads> --db=<database-directory-path>
    A complete set of parameters can be found in db/db_bench.cc

Sample usage:
./db_bench --benchmarks=fillrandom,readrandom --num=1000000 --value_size=1024 --reads=500000 --db=/tmp/pebblesdbtest-1000

Use filter benchmark property to print the filter policy statistics like memory usage.

./db_bench --benchmarks=fillrandom,readrandom,filter --num=1000000 --value_size=1024 --reads=500000 --db=/tmp/pebblesdbtest-1000

    fillrandom   :     110.460 micros/op;    9.0 MB/s
    readrandom   :       4.120 micros/op; (5000 of 10000 found)

    Filter in-memory size: 0.024 MB
    Count of filters: 1928

Optimizations in PebblesDB

PebblesDB uses the FLSM data structure to logically arrange the sstables on disk. FLSM helps in achieving high write throughput by reducing write amplification. But in FLSM, each guard can contain multiple overlapping sstables. Hence a read or seek over the database requires examining one guard (multiple sstables) per level, thereby increasing the read/seek latency. PebblesDB employs some optimizations to tackle these challenges as follows:

Read optimization

  • PebblesDB makes use of sstable-level bloom filter instead of block level bloom filter used in HyperLevelDB or LevelDB. With this optimization, even though a guard can contain multiple sstables, PebblesDB effectively reads only one sstable from disk per level.

  • By default, this optimization is turned on, but this can be disabled by commenting the macro #define FILE_LEVEL_FILTER in db/version_set.h. Remember to do make db_bench after making a change.

Seek optimization

Sstable-level bloom filter can't be used to reduce the disk read for seek operation since seek has to examine all files within a guard even if a file doesn't contain the key. To tackle this challenge, PebblesDB does two optimizations:

  1. Parallel seeks: PebblesDB employs multiple threads to do seek() operation on multiple files within a guard. Note that this optimization might only be helpful when the size of the data set is much larger than the RAM size because otherwise the overhead of thread synchronization conceals the benefits obtained by using multiple threads. By default, this optimization is disabled. This can be enabled by uncommenting #define SEEK_PARALLEL in db/version_set.h.

  2. Forced compaction: When the workload is seek-heavy, PebblesDB can be configured to do a seek-based forced compaction which aims to reduce the number of files within a guard. This can lead to an increase in write IO, but this is a trade-off between write IO and seek throughput. By default, this optimization is enabled. This can be disabled by uncommenting #define DISABLE_SEEK_BASED_COMPACTION in db/version_set.h.


Tuning PebblesDB

  • The amount of overhead PebblesDB has for read/seek workloads as well as the amount of gain it has for write workloads depends on a single parameter: kMaxFilesPerGuardSentinel, which determines the maximum number of sstables that can be present within a single guard.

  • This parameter can be set in db/dbformat.h (default value: 2). Setting this parameter high will favor write throughput while setting it lower will favor read/seek throughputs.


Running YCSB Benchmarks

The Java Native Interface wrapper to PebblesDB is available here. Please follow the instructions specified under Running YCSB Workloads with PebblesDB section for running the YCSB benchmarks.

The YCSB bindings for PebblesDB can be found here.


Improvements made after the SOSP paper

The following improvements are made to the codebase after the SOSP paper:

  • Add CMake build system support (Zeyuan Hu @xxks-kkk)
  • Add JNI Wrapper and support for running YCSB benchmarks (Abhijith Nair @abhijith97)
  • Accounting for memory used by bloom filters (Karuna Grewal @aakp10)

Contact

Please contact us at [email protected] with any questions. Drop us a note if you are using or plan to use PebblesDB in your company or university.

More Repositories

1

RECIPE

RECIPE : high-performance, concurrent indexes for persistent memory (SOSP 2019)
C++
196
star
2

crashmonkey

CrashMonkey: tools for testing file-system reliability (OSDI 18)
C++
189
star
3

MONeT

MONeT framework for reducing memory consumption of DNN training
Python
173
star
4

SplitFS

SplitFS: persistent-memory file system that reduces software overhead (SOSP 2019)
C
163
star
5

nofs

The No-Order File System (NoFS)
C
47
star
6

squirrelfs

SquirrelFS: A crash-consistent Rust file system for persistent memory (OSDI 24)
C
42
star
7

dinomo

DINOMO: An Elastic, Scalable, High-Performance Key-Value Store for Disaggregated Persistent Memory (PVLDB 2022, VLDB 2023)
Python
36
star
8

WineFS

WineFS (SOSP 21): a huge-page aware file system for persistent memory
C
34
star
9

optfs

The Optimistic File System (OptFS) is a Linux ext4 variant that implements Optimistic Crash Consistency, a new approach to crash consistency in journaling file systems. OptFS improves performance for many workloads, sometimes by an order of magnitude. OptFS provides strong consistency, equivalent to data journaling mode of ext4.
C
32
star
10

rustfs

A Rust user-space file system [WIP]
Rust
27
star
11

chipmunk

Tool for checking crash-consistency for persistent-memory file systems (Eurosys 23)
C
18
star
12

txlib

A library for doing atomic updates in a file-system agnostic manner.
C
6
star
13

monet-schedules

Pre-solved schedules for MONeT
2
star
14

GDPR

Rich Text Format
2
star
15

script_optfs

AutoOsync is a tool that makes libraries OptFS compatible, getting sometimes a performance that's an order of magnitude higher than before, while achieving the same level of safety guarantee. It requires minimal programmer intervention.
Python
2
star