• Stars
    star
    101
  • Rank 326,065 (Top 7 %)
  • Language
    Java
  • License
    Apache License 2.0
  • Created almost 5 years ago
  • Updated about 2 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

ShiftLeft OverflowDB

Release Maven Central

ShiftLeft OverflowDB

  • in-memory graph database with low memory footprint
  • overflows to disk when running out of heap space: use your entire heap and prevent OutOfMemoryError
  • property graph model, i.e. there are nodes and directed edges, both of which can have properties
  • work with simple classes, rather than abstracting over some generic model
  • enforces strict schema
  • can save/load to/from disk
  • built on JDK11 but with release target=8, i.e. runs on JRE >= 1.8

Table of contents

Core concepts

Memory layout: edges only exist virtually, i.e. they normally don't exist as edge instances on your heap, and they do not have an ID. Instead, edges are helt in the NodeDb.adjacentNodesWithEdgeProperties, which is an Object[], containing direct pointers to the adjacent nodes, as well as potential edge properties. Those edges are grouped by edge label, and there's a helper array NodeDb.edgeOffsets to keep track of those group sizes.
This model has been chosen in order to be memory efficient, and is based on the assumption that most graphs have orders of magnitude more edges than nodes.

Simple classes and schema: all nodes/edges are specific to your domain rather than generic with arbitrary properties. This way we get a strict schema and don't waste memory on Map instances. On the flip side, you need to provide your domain-specific [Node|Edge]Factories to instantiate them. These can be auto-generated though, and we may provide a codegen in future.

Overflow: for maximum throughput and simplicity, OverflowDB is designed to run on the same JVM as your main application. Since the memory requirements of your application will likely vary over time, OverflowDB dynamically adapts to the available memory. I.e. it will allocate instances on the heap while there's still space, but if the heap usage (after a full GC) is above a configurable threashold (e.g. 80%), it will start serializing instances to disk, freeing up some space. This way we can always fully utilize the heap while preventing OutOfMemoryError. OverflowDB applies backpressure to creating new nodes in that case. These mechanisms have practically no overhead while there is enough heap available and the overflow is not required.

Persistence: if you provide a graphLocation when creating the graph, OverflowDB will a) use that file for the on-disk overflow, and b) persist to that location on graph.close(). 'Persisting' is equivalent to simply serializing all nodes to disk, via the normal 'overflow' mechanism.
If the graphLocation file already exists, OverflowDB will initialize all NodeRefs from it. I.e. starting up is fast, but the first queries will be slow, until all required nodes are deserialized from disk. Note that there's no guarantees what happens on jvm crash.

Usage

1) add a dependency - depending on your build tool. Latest release: Maven Central

libraryDependencies += "io.shiftleft" %% "overflowdb-traversal" % "x.y" // sbt
<dependency> <!-- maven -->
  <groupId>io.shiftleft</groupId>
  <artifactId>overflowdb-traversal_2.13</artifactId>
  <version>x.y</version>
</dependency>
implementation 'io.shiftleft:overflowdb-traversal_2.13:x.y' // gradle

Other build tools and versions

2) Implement your domain-specific nodes/edges and factories. It's probably best to follow the example implementations of simple and grateful dead.

3) Create a graph

OdbGraph graph = OdbGraph.open(
  OdbConfig.withoutOverflow(),
  Arrays.asList(Song.factory, Artist.factory),
  Arrays.asList(FollowedBy.factory, SungBy.factory, WrittenBy.factory)
);

// either create some nodes/edges manually
Song song1 = (Song) graph.addVertex(T.label, Song.label, Song.NAME, "Song 1");
Song song2 = (Song) graph.addVertex(T.label, Song.label, Song.NAME, "Song 2");
song1.addEdge(FollowedBy.LABEL, song2);

// or import e.g. a graphml
GraphML.insert("src/test/resources/grateful-dead.xml", graph);

4) Traverse for fun and profit

assertEquals(Long.valueOf(808), graph.traversal().V().count().next());
assertEquals(Long.valueOf(8049), graph.traversal().V().outE().count().next());

Artist garcia = (Artist) graph.traversal().V().has("name", "Garcia").next();
assertEquals("Garcia", garcia.name);
assertEquals(4, __(garcia).in(WrittenBy.LABEL).toList().size());

5) graph.close

For more complete examples, please check out the tests.
To learn more about traversals please refer to the TinkerPop3 documentation.

Configuration: OdbConfig builder

OdbConfig config = OdbConfig.withDefaults()   // overflow is enabled, threshold is 80% of heap (after full GC)
config.disableOverflow // or shorter: OdbConfig.withoutOverflow() 
config.withHeapPercentageThreshold(90)        // set threshold to 90% (after full GC)

// relative or absolute path to storage
// if specified, OverflowDB will persist to that location on `graph.close()`
// to restore from that location, simply instantiate a new graph instance with the same setting 
config.withStorageLocation("path/to/odb.bin") 

Overflow mechanism

Here's a rough sketch of how the overflow mechanism works internally:

+----------+        +--------------+         +-----------------------+
|          |        |   NodeRef    |  free!  |    Node               |
|OdbStorage+--------+              +-------->+                       |
|          |        |String name();|         |String name;           |
|          |        |              |         |Object[] adjacentNodes;|
+----------+        +------+-------+         +-----------------------+
                           ^
                           |
                           |free!
                           |
                   +-------+--------+
                   |ReferenceManager|
                   +-------+--------+
                           ^
                           |
                           |free!
                           |
                     +-----+-----+
                     |HeapMonitor|
                     +-----------+

NodeRef instances have a low memory footprint - they only contain the id and a reference to the graph - and can be freely passed around the application. Nodes in contrast hold all properties, as well as the adjacent edges and their properties. When the available heap is getting low, it is the Node instances that are serialized to disk and collected by the garbage collector. That's why you should never hold a (strong) reference onto them in your main application: it would inhibit the overflow mechanism.

DiffTool util

overflowdb.util.DiffTool.compare(graph1, graph2) allows you to do some very basic comparison of two graphs. It identifies nodes by their ids, and compares their existence, properties, adjacent edges and properties of those edges.

FAQ

  1. What JDK does OverflowDB support? The build targets JDK8, so that's the minimum version. However it is highly encouraged to use a modern JVM, such as JDK20. The build itself requires JDK11+ though.

  2. Why not just use a simple cache instead of the overflow mechanism?
    Regular caches require you have to specify a fixed size. OverflowDB is designed to run in the same JVM as your main application, and since most applications have varying memory needs over time, it would be hard/impossible to achieve our goal use your entire heap and prevent OutOfMemoryError with a regular cache. Besides that, it's very compute-intensive to calculate the size of the cache in megabytes on the heap.

  3. When is the next release coming out?
    Releases happen automatically. Every PR merged to master is automatically released by github actions and tagged in git, using sbt-ci-release-early

More Repositories

1

sast-scan

Scan is a free & Open Source DevSecOps tool for performing static analysis based security testing of your applications and its dependencies. CI and Git friendly.
Python
764
star
2

codepropertygraph

Code Property Graph: specification, query language, and utilities
Scala
432
star
3

traceleft

eBPF based syscalls, files and network events tracing framework
Go
81
star
4

llvm2cpg

LLVM meets Code Property Graphs
C++
80
star
5

tarpit-java

Tarpit - A Web application seeded with vulnerabilities, rootkits, backdoors & data leaks
Java
70
star
6

llvm2graphml

Explore LLVM Bitcode interactively using a graph database
C++
56
star
7

scan-action

51
star
8

tinkergraph-gremlin

Java
38
star
9

fuzzyc2cpg

A fuzzy parser for C/C++ that creates semantic code property graphs
35
star
10

scan-docs

28
star
11

sbt-ci-release-early

Sbt plugin for fully automated releases, without SNAPSHOT and git sha's in the version. A remix of the best ideas from sbt-ci-release and sbt-release-early. For local CI and/or sonatype/maven central.
Scala
20
star
12

SharpSyntaxRewriter

A C# syntax rewriter
C#
18
star
13

gaum

Go
18
star
14

flask-webgoat

flask-webgoat is a deliberately-vulnerable application written with the Flask web framework.
Python
17
star
15

js2cpg

Scala
15
star
16

bctrace

A library for creating hook-based java agents, without dealing with bytecode
Java
12
star
17

shiftleft-scan-vscode

ShiftLeft Scan is a free and open-source commercial-grade security tool for modern DevOps teams.
TypeScript
12
star
18

sql-task-queue

PLpgSQL
10
star
19

HelloShiftLeft

Java
9
star
20

tarpit-c

TARPIT-C : A set of C code snippets seeded with vulnerable conditions
C
8
star
21

cpgqls-client-python

Python
7
star
22

shiftleft-java-demo

Java
6
star
23

shiftleft-python-demo

Python
6
star
24

shiftleft-js-demo

JavaScript
6
star
25

joern-sample-extension

A sample of a standalone extension for Joern/Ocular
Scala
6
star
26

atlassian-connect-go

This repo contains a set of tools you can use to create Jira plugins using the Atlassian Connect framework. It is written in Go.
Go
5
star
27

field-integrations

integration tools and docs
Python
4
star
28

shiftleft-go-demo

Go
4
star
29

ocular-docs

All things ocular related
4
star
30

overflowdb-codegen

Scala
4
star
31

tarpit-python

TARPIT-PYTHON - A WEB APPLICATION SEEDED WITH VULNERABILITIES, ROOTKITS, BACKDOORS AND DATA LEAKS
Python
4
star
32

shiftleft-go-example

Sample go application with ShiftLeft Inspect integration
Go
2
star
33

shiftleft-python-example

Sample python application with ShiftLeft Inspect integration
Python
2
star
34

shiftleft-kotlin-demo

Kotlin
2
star
35

shiftleft-ts-demo

TypeScript
2
star
36

http4k-webgoat

Kotlin
2
star
37

soot

Java
2
star
38

HelloShiftLeft-Mar2021

Java
2
star
39

shiftleft-java-example

Sample Java application with ShiftLeft Inspect integration
Java
2
star
40

x42

LLVM
1
star
41

gather-dependencies-gradle-plugin

Kotlin
1
star
42

tarpit-nodejs

JavaScript
1
star
43

zipdu

zipdu is a webservice implementation vulnerable to zip bombs and directory traversals. Written in multiple different languages
C++
1
star
44

shiftleft-js-example

Sample JavaScript application with ShiftLeft Inspect integration
JavaScript
1
star
45

HelloShiftLeft-Scala

Scala
1
star
46

shiftleft-terraform-demo

HCL
1
star
47

shiftleft-csharp-demo

C#
1
star