• Stars
    star
    165
  • Rank 228,906 (Top 5 %)
  • Language
    Java
  • License
    Apache License 2.0
  • Created about 9 years ago
  • Updated about 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

An efficient, conscise, and simple implementation of a purely on-disk B+ Tree data structure

build Maven

A Purely On-Disk Implementation of a B+ Tree

After quite a few hours that included developing the thing as well as testing it here is a (fully) functional implementation of a B+ Tree data structure purely on the disk. This was developed mostly for educational reasons that stemmed from the fact that I could not find a B+ Tree implementation that met the following:

  • Purely Disk based implementation
  • Used strict paging which the user could tune (256k, 1K, 4K etc)
  • Was a (Key, Value) storage not just for Keys (B-Tree)
  • Implemented deleting from the data structure
  • Supported duplicate entries (if required)
  • was adequately tested (so that I know it would work)

I came about needing to implement a B+ Tree for a side project of mine... and I was really surprised to see that there were no implementations that met the above requirements; not only that but I was not able to find a reference code/pseudocode that had a working delete implementation and support for duplicate keys. To my dismay even my trusty (and beloved) CLRS didn't include an implementation of a B+ Tree but only for the B-Tree with no duplicates and delete pseudocode (well, one could say that they gave the steps...but that's not the point).

It has to be noted that I could find some B+ Tree implementations that had delete and duplicate key support, mainly in open-source database projects. This unfortunately meant that these implementations were deeply integrated into their parent projects and as a result they were optimized for their particular use-cases. Finally the code bases where significantly larger (hence making the code reading/understanding much harder than it should be!).

So I went about to implement mine (cool stuff, many hours of head scratching were involved!) while also putting effort in creating this "tuned" down version which mainly cuts down on features for simplicity, enhanced code clarity and clutter reduction.

Ease of use features

As I said above this project was done mainly to create a working example of a B+ Tree data structure purely on the disk so the code is well commented (I think) and can be understood easily; that said... we have some "delicacies" that make working and testing this project a bit easier, which are outlined below

  • it uses maven, so most modern IDE's can import it without hassle...
  • it requires jdk v8 (for some parts, change them to have backwards support)
  • it uses a dedicated tester as well as JUnit tests
  • has an interactive menu that the user can individually perform the operations

Implementation details

Insertions

For insertions we use a modified version of the method provided by CLRS with modifications to support duplicate keys (which will be covered below). Complexity is not altered from the usual case and we require one pass down the tree as well.

Searching

This is assumed to be the most common operation on the B+ Tree; which support two modes of searching:

  • Singular Key searches
  • Range Queries

Here two distinct functions are used to cover these two cases:

  • searchKey is used for singular value searches (unique or not)
  • rangeSearch is used for performing range queries

Both of these methods require only one pass over the tree to find the results (if any). Additionally, since we store the keys in a sorted order we can exploit binary search to reduce the total node lookup time significantly. This is done along with a slight modification to the search algorithm to return the lower bound instead of failure.

Deletes

We again use one pass down the tree deletes but this is a bit more complex than the other two operations... it again requires only one pass to delete the key (if found) but as it descends down the tree it performs any redistribution/merging that needs to happen in order to keep our data structure valid.

Handling of duplicate keys

In order to avoid hurting the search performance functionality which is (assumed to be) the most common operation in a B+ Tree data structure the following scheme was implemented to support duplicate keys (at the cost of a bit more page reads).

The tree only actually indexes singular keys but each key has an overflow pointer which leads to its overflow page that has all of the duplicate entries stored as a linked list. If needed, multiple trailing overflow pages per key are created to accommodate for these extra insertions should they exceed the capacity of one page. The downside is that we use a bit more space per page as well as reads in order to read these overflow pages.

Page lookup table

To avoid moving around things too much we keep each page into a free page pool that has all of the available pages so far; this in turn let's us create an index very fast without having to pay costly reads if we wanted to have a clustered tree (although we again use more space, usually).

License

This work, at its current version, is licensed under the Apache 2.0 license.

Final words

Hopefully I'll create a GitHub page for this... where I explain my implementation a bit more but until then this will suffice! Oh and I really hope this implementation is clear and concise enough so that it can make the notions of B+ Trees crystal clear!

More Repositories

1

stanford_dbclass

Collection of my solutions to the (infamous) dbclass (2014 version) offered by Stanford.
39
star
2

federated_pca

Federated Principal Component Analysis Revisited!
MATLAB
36
star
3

grafana-tp-link

Grafana based Power Meter using TP-LINK (HS110) plugs
Shell
35
star
4

moses

Streaming, Memory-Limited, r-truncated SVD Revisited!
MATLAB
21
star
5

federated-capacity-consensus

Introducing: Large Scale Capacity Consensus!
MATLAB
12
star
6

ptucc_compiler

A simple ptuc to C compiler using flex and bison.
Yacc
10
star
7

csv_to_csvs

Conditionally split large csv files to smaller ones with ease
Python
7
star
8

stanford_fin_auto

Collection of my solutions to the finite automata course (2016 version) offered by Stanford.
TeX
6
star
9

quibraries

A thread-safe libraries.io API wrapper
Python
6
star
10

fetch_my_key

Add ssh public keys to authorized_keys file automagically.
Shell
3
star
11

gpurelperf

A handy utility to find relative GPU performance quickly in multi-gpu boxes
Python
3
star
12

py-dspinlock

Distributed Spinlock using Redis
Python
3
star
13

linhash

A purely disk based implementation of a Linear Hash based key-store
Java
2
star
14

dotfiles

Somewhat portable, configurable dotfiles for various operating systems
Shell
2
star
15

mfetch_upstreams

fetch upstream and merge from multiple repositories with ease.
Shell
2
star
16

andylamp.github.io

^c
1
star
17

tetra_anneal

TetraVex Solver using Simulated Annealing
Python
1
star
18

federated-quantized-ratio-consensus

Federated Quantized Capacity Ration Consensus Revisited!
MATLAB
1
star
19

tetravex_dfs

TetraVex solver using Depth First Search (DFS)
1
star
20

pihole-docker

Deploy pi-hole as a docker service on unix-based machines, hassle free!
Shell
1
star
21

c89_parser

A conveniently packaged c89 parser.
Yacc
1
star
22

tuppersketch

bitmap drawings using Jeff Tupper self-referential formula
JavaScript
1
star