SplitFS
SplitFS is a file system for Persistent Memory (PM) which is aimed at reducing the software overhead of applications accessing Persistent Memory. SplitFS presents a novel split of responsibilities between a user-space library file system and an existing kernel PM file system. The user-space library file system handles data operations by intercepting POSIX calls, memory mapping the underlying file, and serving the reads and overwrites using processor loads and stores. Metadata operations are handled by the kernel file system (ext4 DAX).
SplitFS introduces a new primitive termed relink to efficiently support file appends and atomic data operations. SplitFS provides three consistency modes, which different applications can choose from without interfering with each other.
SplitFS is built on top of Quill by NVSL. We re-use the implementation of Quill to track the glibc calls requested by an application and provide our implementation for the calls. We then run the applications using LD_PRELOAD to intercept the calls during runtime and forward them to SplitFS.
Please cite the following paper if you use SplitFS:
SplitFS : Reducing Software Overhead in File Systems for Persistent Memory. Rohan Kadekodi, Se Kwon Lee, Sanidhya Kashyap, Taesoo Kim, Aasheesh Kolli, Vijay Chidambaram. Proceedings of the The 27th ACM Symposium on Operating Systems Principles (SOSP 19). Paper PDF. Bibtex. Talk Video
@InProceedings{KadekodiEtAl19-SplitFS,
title = "{SplitFS: Reducing Software Overhead in File Systems for Persistent Memory}",
author = "Rohan Kadekodi and Se Kwon Lee and Sanidhya Kashyap and Taesoo Kim and Vijay Chidambaram",
booktitle = "Proceedings of the 27th ACM Symposium on Operating
Systems Principles (SOSP '19)",
month = "October",
year = "2019",
address = "Ontario, Canada",
}
Getting Started with SplitFS
This tutorial walks you through the workflow of compiling splitfs, setting up ext4-DAX, compiling an application and running it with ext4-DAX as well as SplitFS, using a simple microbenchmark. The microbenchmark appends 128MB data to an empty file, in chunks of 4KB each, and does an fsync() at the end. Note: Set the minimum partition size of PM to 2GiB for the microbenchmark (The partition size can be set in step 2. Please confirm the partition size using df -h
after step 4).
-
Set up SplitFS
$ export LEDGER_YCSB=1
$ cd splitfs; make clean; make; cd .. # Compile SplitFS
$ export LD_LIBRARY_PATH=./splitfs
$ export NVP_TREE_FILE=./splitfs/bin/nvp_nvp.tree
-
Set up ext4-DAX
$ sudo mkfs.ext4 -b 4096 /dev/pmem0
$ sudo mount -o dax /dev/pmem0 /mnt/pmem_emul
$ sudo chown -R $USER:$USER /mnt/pmem_emul
- Setup microbenchmark
$ cd micro
$ gcc rw_experiment.c -o rw_expt -O3
$ cd ..
- Run microbenchmark with ext4-DAX
$ sync && echo 3 > /proc/sys/vm/drop_caches # Run this with superuser
$ ./micro/rw_expt write seq 4096
$ rm -rf /mnt/pmem_emul/*
- Run microbenchmark with SplitFS
$ sync && echo 3 > /proc/sys/vm/drop_caches # Run this with superuser
$ LD_PRELOAD=./splitfs/libnvp.so micro/rw_expt write seq 4096
$ rm -rf /mnt/pmem_emul/*
- Results. The resultes show the throughput of doing appends on ext4 DAX and SplitFS. Appends are 5.8x faster on SplitFS.
- ext4-DAX:
0.33M appends/sec
- SplitFS:
1.92M appends/sec
- ext4-DAX:
Features
-
Low software overhead. SplitFS tries to obtain performance that is close to the maximum provided by persistent-memory hardware. The overhead due to SplitFS software is significantly lower (by 4-12x) than state-of-the-art file systems such as NOVA or ext4 DAX. As a result, performance on some applications is increased by as much as 2x.
-
Flexible guarantees. SplitFS is the only persistent-memory file system that allows simultaneously running applications to receive different guarantees from the file system. SplitFS offers three modes: POSIX, Sync, and Strict. Application A may in Strict mode, obtaining atomic, synchronous operations from SplitFS, while Application B may simultaneously run in POSIX mode and obtain higher performance. This is possible due to the novel split architecture used in SplitFS.
-
Portability and Stability. SplitFS uses ext4 DAX as its kernel component, so it works with any kernel where ext4 DAX is supported. ext4 DAX is a mature, robust code base that is actively being maintained and developed; as ext4 DAX performance increases over time, SplitFS performance increases as well. This is contrast to research file systems for persistent memory, which do not see development at the same rate as ext4 DAX.
Contents
splitfs/
contains the source code for SplitFS-POSIXdependencies/
contains packages and scripts to resolve dependencieskernel/
contains the Linux 4.13.0 kernelmicro/
contains the microbenchmarkleveldb/
contains LevelDB source codersync/
contains the rsync source codescripts/
contains scripts to compile and run workloads and kernelsplitfs-so/
contains the SplitFS-strict shared libraries for running different workloadssqlite3-trace/
contains SQLite3 source codetpcc-sqlite/
contains TPCC source codeycsb/
contains YCSB source codetar/
contains tar source codelmdb/
contains LMDB source codefilebench/
contains Filebench source codefio/
contains FIO source code
The Experiments page has a list of experiments evaluating SplitFS(strict, sync and POSIX) vs ext4 DAX, NOVA-strict, NOVA-relaxed and PMFS. The summary is that SplitFS outperforms the other file systems on the data intensive workloads, while incurring a modest overhead on metadata heavy workloads. Please see the paper for more details.
The kernel patch for the implementation of relink() system call for linux v4.13 is here
System Requirements
- Ubuntu 16.04 / 18.04
- At least 32 GB DRAM
- At least 4 cores
- Baremetal machine (Not a VM)
- Intel Processor supporting
clflush
(Comes with SSE2) orclflushopt
(Introduced in Intel processor family -- Broadwell) instruction. This can be verified withlscpu | grep clflush
andlscpu | grep clflushopt
respectively.
Dependencies
- kernel: Installing the linux kernel 4.13.0 involves installing bc, libelf-dev and libncurses5-dev. For ubuntu, please run the script
cd dependencies; ./kernel_deps.sh; cd ..
- SplitFS: Compiling SplitFS requires installing Boost. For Ubuntu, please run
cd dependencies; ./splitfs_deps.sh; cd ..
Limitations
SplitFS is under active development.
- The current implementation of SplitFS handles the following system calls:
open, openat, close, read, pread64, write, pwrite64, fsync, unlink, ftruncate, fallocate, stat, fstat, lstat, dup, dup2, execve and clone
. The rest of the calls are passed through to the kernel. - The current implementation of SplitFS works correctly for the following applictions:
LevelDB running YCSB, SQLite running TPCC, tar, git, rsync
. This limitation is purely due to the state of the implementation, and we aim to increase the coverage of applications by supporting more system calls in the future.
Applications currently supported
- LevelDB (with YCSB)
- SQLite (running TPCC)
- Redis
- git
- tar
- rsync
- Filebench
- LMDB
- FIO
Testing
PJD POSIX Test Suite that tests primarily the metadata operations was run on SplitFS successfully. SplitFS passes all tests.
Running the Test Suite
Before running the tests, make sure you have set-up ext4-DAX
To run tests in all modes:
$ make test
To run tests in a specific mode:
$ make -C tests pjd.<mode>
where <mode>
is one of posix
, sync
or strict
. Example: make -C tests pjd.posix
Tip: Redirect stderr for less verbose output: e.g make test 2>/dev/null
Implementation Notes
- Only regular files, block special files, and directories (only for consistency guarantees) are handled by SplitFS, the other file types are delegated to POSIX.
- Only files in the persistent memory mount (
/mnt/pmem_emul/
) are handled by SplitFS, rest are delegated to POSIX.
Currently this is only done by examination of absolute paths specified, we aim to have this check for relative paths too, soon. - We aim to have the persistent-memory mount point controlled via a runtime environment variable soon.
License
Copyright for SplitFS is held by the University of Texas at Austin. Please contact us if you would like to obtain a license to use SplitFS in your commercial product.
Contributors
- Rohan Kadekodi, UT Austin
- Rui Wang, Beijing University of Posts and Telecommunications
- Om Saran
Acknowledgements
We thank the National Science Foundation, VMware, Google, and Facebook for partially funding this project. We thank Intel and ETRI IITP/KEIT[2014-3-00035] for providing access to Optane DC Persistent Memory to perform our experiments.
Contact
Please contact us at [email protected]
or [email protected]
with any questions.