OkayWAL

A write-ahead log (WAL) implementation for Rust.

There's The Great Wall, and then there's this: an okay WAL.

WARNING: This crate is early in development. Please do not use in any production projects until this has been incorporated into Sediment and shipping as part of Nebari. The file format is currently considered unstable.

This crate exposes a WAL that supports:

Atomic and Durable writes from multiple threads.
Random access for previously written data.
Automatic checkpointing to allow reusing disk space and preventing the WAL from growing too large.
Interactive recovery process with basic data versioning support.

Basic How-To

WriteAheadLog::recover() is used to create or recover a WAL in a given directory. To open a log, an implementer of LogManager must be provided. This trait is how OkayWAL communicates with your code when recovering or checkpointing a log.

The basic example shows this process with many comments describing how OkayWAL works.

// Open a log using a Checkpointer that echoes the information passed into each
// function that the Checkpointer trait defines.
let log = WriteAheadLog::recover("my-log", LoggingCheckpointer)?;

// Begin writing an entry to the log.
let mut writer = log.begin_entry()?;

// Each entry is one or more chunks of data. Each chunk can be individually
// addressed using its LogPosition.
let record = writer.write_chunk("this is the first entry".as_bytes())?;

// To fully flush all written bytes to disk and make the new entry
// resilient to a crash, the writer must be committed.
writer.commit()?;

Multi-Threaded Writing

Optimized writing to the log from multiple threads is handled automatically. Only one thread may access the active log file at any moment in time. Because the slowest part of writing data to disk is fsync, OkayWAL manages synchronizing multiple writers such that a single fsync call can be made for multiple writes.

This can be demonstrated by running the benchmark suite: cargo bench -p benchmarks:

commit-256B

Label	avg	min	max	stddev	out%
okaywal-01t	1.001ms	617.5us	7.924ms	557.3us	0.016%
okaywal-02t	1.705ms	617.3us	11.38ms	912.1us	0.006%
okaywal-04t	1.681ms	622.4us	9.688ms	671.4us	0.021%
okaywal-08t	1.805ms	656.5us	13.88ms	1.001ms	0.014%
okaywal-16t	1.741ms	643.2us	7.895ms	796.4us	0.028%

commit-1KB

Label	avg	min	max	stddev	out%
okaywal-01t	959.3us	621.9us	7.419ms	584.4us	0.012%
okaywal-02t	1.569ms	627.5us	7.986ms	1.007ms	0.028%
okaywal-04t	1.856ms	650.5us	11.14ms	1.087ms	0.017%
okaywal-08t	2.054ms	697.3us	11.04ms	1.066ms	0.021%
okaywal-16t	1.875ms	641.5us	8.193ms	674.6us	0.032%

commit-4KB

Label	avg	min	max	stddev	out%
okaywal-01t	1.242ms	748.8us	6.902ms	982.4us	0.008%
okaywal-02t	1.767ms	761.9us	8.986ms	902.1us	0.016%
okaywal-04t	2.347ms	787.1us	8.853ms	1.084ms	0.016%
okaywal-08t	2.798ms	810.8us	12.53ms	1.168ms	0.014%
okaywal-16t	2.151ms	840.5us	14.74ms	1.201ms	0.008%

commit-1MB

Label	avg	min	max	stddev	out%
okaywal-01t	7.018ms	5.601ms	9.865ms	788.2us	0.027%
okaywal-02t	11.06ms	4.281ms	20.14ms	3.521ms	0.000%
okaywal-04t	19.77ms	5.094ms	73.21ms	8.794ms	0.007%
okaywal-08t	25.06ms	2.871ms	97.60ms	17.33ms	0.002%
okaywal-16t	19.01ms	3.480ms	58.85ms	7.195ms	0.014%

These numbers are the time taken for a single thread to perform an atomic and durable write of a given size to the log file. Despite using a single-file approach, we are able to keep average write times very low even with a large number of simultaneous writers.

How OkayWAL works

OkayWAL streams incoming data into "segments". Each segment file is named with the format wal-{id}. The id of a segment file refers to the first EntryId that could appear within the segment file.

Segment files are pre-allocated to the length configured in Configuration::preallocate_bytes. Preallocating files is critical for performance, as overwriting existing bytes in general is less expensive than allocating new bytes on disk.

OkayWAL always has a current segment file. When a new entry is written, it always goes to the current segment file. When an entry is completed, the length of the segment file is checked against Configuration::checkpoint_after_bytes. If enough data has been written to trigger a checkpoint, the file is sent to the checkpointing thread and a new segment file is activated.

Regardless of whether the file is checkpointed, before control returns from committing an entry, the file is fsynced. fsync operations are batched, allowing multiple entries to be written by separate threads during the same fsync operation.

OkayWAL also keeps track of any time a new file is created or a file is renamed. As needed, the directory containing the write-ahead logs is also fsynced to ensure necessary file and directory metadata is fully synchronized. Just like file fsync batching, OkayWAL also automatically batches directory fsyncs across threads.

Checkpointing a segment file (Background Thread)

The checkpointing thread holds a weak reference to the WriteAheadLog data. When a file is received by the thread to checkpoint, it will upgrade the weak reference. If it cannot, the checkpointing thread shuts down gracefully and the recovery process will send the file again for checkpointing the next time the log is opened.

The thread invokes LogManager::checkpoint_to for the file, allowing the LogManager to make any needed changes to persist the data stored in the segment being checkpointed.

After the LogManager finishes, the file is renamed to include -cp as its suffix. Until this step, readers are able to be opened against data stored in the file being checkpointed. Once the file is renamed, new readers will begin returning not found errors.

After the file is renamed, the checkpointer waits for all outstanding readers to finish reading data. The file is then finally recycled by moving it to the inactive files list.

Activating a new segment file

If there are any files in the inactive files list, one is reused. Otherwise, a new file is created and filled with 0's to the configured preallocation length.

The file's name is set to wal-{next EntryId}. For example, a brand new write-ahead log's first segment file will be named wal-1, and the first EntryId written will be 1.

Segment File Format

Each segment file starts with this header:

okw: Three byte magic code
OkayWAL Version: Single byte version number. Currently 0.
Configuration::version_info length: Single byte. The embedded information must be 255 or less bytes long.
Embedded Version Info: The bytes of the version info. The previous byte controls how many bytes long this field is.

After this header, the file is a series of entries, each which contain a series of chunks. A byte with a value of 1 signifies a new entry. Any other byte causes the reader to stop reading entries from the file.

The first 8 bytes of the entry are the little-endian representation of its EntryId.

After the EntryId, a series of chunks is expected. A byte with a value of 2 signals that a chunk is next in the file. A byte with a value of 3 signals that this is the end of the current entry being written. Any other byte causes the SegmentReader to return an AbortedEntry result. Any already-read chunks from this entry should be ignored/rolled back by the LogManager.

The first four bytes of a chunk are the data length in little-endian representation. The data for the chunk follows.

Finally, a four-byte CRC-32 ends the chunk.

If a reader does not encounter a new chunk marker (2) or an end-of-entry marker (3), the entry should be considered abandoned and all chunks should be ignored.

Open-source Licenses

This project, like all projects from Khonsu Labs, are open-source. This repository is available under the MIT License or the Apache License 2.0.

To learn more about contributing, please see CONTRIBUTING.md.

khonsulabs/okaywal

khonsulabs

Reviews

Repository Details