• Stars
    star
    106
  • Rank 325,871 (Top 7 %)
  • Language
    C++
  • Created over 10 years ago
  • Updated about 2 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

A benchmark for random memory accesses

RAM Bench

A benchmark for random memory accesses, with the aim to prove that there is no such thing as O(1) memory accesses on modern hardware, but rather that it is O(√N).

Usage

  • Clone it - git clone github.com/emilk/ram_bench
  • Compile it - clang++ -std=c++11 -stdlib=libc++ -O3 list_traversal.cpp -o list_traversal
  • Run it - ./list_traversal
  • Generate plots - gnuplot plot.plg
  • Look at them pretty images!

The Myth of RAM

If you have studied computing science, then you know how to do complexity analysis. You’ll know that iterating through a linked list is O(N), binary search is O(log(N)) and a hash table lookup is O(1). What if I told you that all of the above is wrong? What if I told you that iterating through a linked list is actually O(N√N) and binary searches as well as hash lookups both are O(√N)?

Don’t believe me? By the end of the blog post you will. I will show you that accessing memory is not a O(1) operation but O(√N). This is a result that holds up both in theory and practice. Let’s start with the latter:

Measuring it

Here’s a simple program that iterates through a linked list of length N, ranging from 64 elements up to about 420 million elements. Each node consists of a 64 bit pointer and 64 bits of dummy data. The nodes of the linked list are jumbled around in memory so that each memory access is random. I measure iterating through the same list a few times, and then plot the time taken per element. This means that we should get a flat plot if random memory access is O(1). This is what we actually get:

alt text

Note that this is a log-log graph, so the differences are actually pretty huge. We go from about one nanosecond per element all they way up to a microsecond! But why? The answer, of course, is caching. The system memory (RAM) is actually pretty slow and far away and so to compensate the clever computer designers have added a hierarchy of faster, closer and more expensive caches to speed things up. My computer has three levels called L1, L2, L3 of 32 kiB, 256 kiB and 4 MiB each. I have 8 GiB of RAM, but when I ran my experiment I only had about 6 GiB free - so in the last runs there was some swapping to disk (an SSD). Here’s the same plot again but with my RAM and cache sizes added as vertical lines:

alt text

So that settles it, right? As my memory consumption goes up, I have to rely on slower and slower memory to get the job done, ultimately swapping to disk.

Now you may be thinking that this is all trivial. Surely I (or someone richer than me) could purchase enough fast L1 type memory to fit all the data, and that would yield a flat graph. Sadly, 6 GiB of L1 memory would be way more that could fit on a CPU die, so it would need to be further away, increasing the latency of the data. And if my problem grew even more I would need even more RAM taking up even more space requiring it to be even further from my CPU, making it slower still. But how much slower? Let’s do some thinking.

The limit on the speed for accessing a piece of information is the delay in communication, which depends on the distance between the CPU and that information. The limiting factor in a modern computer is the speed of an electrical signal, but in theory it is the speed of light. Within one clock cycle of a 3GHz CPU, light reaches only about a decimeter. So to roundtrip, any memory that should be instantly accessible can be at most 5cm (2 inches) from the CPU.

So how much information can we fit within a certain distance r from the cpu? Intuition tells us the answer should be proportional to (i.e. a sphere of RAM). In practice computers are actually rather two-dimensional - this is due partly to form-factor, but also due to problems with cooling. A two-dimensional computer gives gives us a memory limit of (a disk of RAM). This also holds true for even more distant memory, such as data centers (which are spread out on the two-dimensional surface of the earth). So the practical limit of the amount of information within a radius r is I ∝ r².

But can we, in theory, do better? To answer that, we need to learn a bit about black holes and quantum physics!

The theoretical limit

The amount of information that can fit within a sphere with radius r can be calculated using the Bekenstein bound, which says that the amount of information that can be contained by a sphere is directly proportional to the radius and mass: I ∝ r·m. So how massive can a sphere get? Well, what is the most dense thing in existence? A black hole! It turns out that the mass of a black hole is directly proportional to its radius, m ∝ r. This means that the amount of information that can fit inside a sphere of radius r is I ∝ r². And so we come to the conclusion that the amount of information contained in a sphere is bounded by the area of that sphere - not the volume!

In short: if you try to squeeze too much L1 cache onto your CPU it will eventually collapse into a black hole, and that would make it awkward to get the results of the computation back to the user.

So we come to the conclusion that I ∝ r² is not only the practical limit, but also the theoretical limit! This means the laws of physics sets a limit of the latency of memory: To access any of N bits of data, you will need to communicate a distance that is proportional to O(√N). In other words, for each 100-fold increase in problem size, we expect to see a 10-fold increase in the time it takes to access one element. And that is exactly what our data shows!

alt text

Here the blue line corresponds to a O(√N) cost of each memory access. Are you convinced yet?

Now, to be sure, the reason the measured data so closely follows O(√N) is probably not due to the theoretical limits explained above - we are far away from having RAM as information dense as a black hole! The theoretical limit is for now just that - theoretical - but as it happens the practical results follow it very closely, at least in this data. It’s hard to say exactly why - it could be due to the two-dimensional layout of most computers mentioned above, or some complex emergent law involving the cost of producing fast memory. Whatever the reason, the results still stand: for practical and theoretical purposes alike, the cost of a random memory access follows O(√N).

Conclusion and implications

The cost of a memory access depends on the amount of memory you are accessing as O(√N). This means that iterating through a linked list is a O(N√N) operation. A binary search is a O(√N) operation. A hash map lookup is a O(√N) operation. In fact, looking up something randomly in any sort of database is a O(√N) operation.

Addendum

Note that the above results holds only for random memory accesses. If you access N bits of memory in a predictable fashion (e.g. sequentially) we still get O(N), since the cpu can prefetch the data that is needed ahead of time. A random memory access cannot, per definition, be predicted.

More Repositories

1

egui

egui: an easy-to-use immediate mode GUI in Rust that runs on both web and native
Rust
21,911
star
2

loguru

A lightweight C++ logging library
C++
1,783
star
3

eframe_template

The easy way to make a Rust app with a GUI
Rust
772
star
4

emilib

Loose collection of misc C++ libs
C++
339
star
5

wfc

A C++ port of Wave Function Collapse Tiling
C
304
star
6

ehttp

Minimal Rust HTTP client for both native and WASM
Rust
250
star
7

imgui_software_renderer

A Software Renderer for Dear ImGui
C++
230
star
8

drop-merge-sort

A novel adaptive sorting algorithm
Rust
164
star
9

Configuru

Experimental config library for C++
C++
137
star
10

Dual-Contouring

Dual Contoruing implemented in C++
C++
73
star
11

eterm

Exeprimental visual terminal for egui
Rust
69
star
12

sol

Lua + Typesafety = Sol
Lua
65
star
13

puffin_egui

Show puffin profiler flamegraph in-game using egui
44
star
14

hobogo

Rust + WASM experiment
Rust
35
star
15

pga

Experimental type-safe geometric algebra for Rust
Rust
35
star
16

bon

BON - Binary Object Notation
C++
22
star
17

field_interpolation

Field interpolation using finite elements and sparse linear least squares
C++
13
star
18

blog

My personal blog, I Like Big Bits
HTML
9
star
19

sproxel

Fork of http://code.google.com/p/sproxel/
Python
9
star
20

emath

Some 2D/3D math I tend to reuse across projects.
C++
9
star
21

egui_presentation

A presentation about egui, implemented in egui
Rust
9
star
22

userp

Rust formatting tool for clustering use statements
Rust
8
star
23

alignify

Helper script for automatic text alignment
Python
7
star
24

snake_case

A Rust crate for working with snake_case strings
Rust
6
star
25

econtext

Rust crate for fast and simple error context on panics
Rust
5
star
26

egui-pure-glow

Quick test of egui dependencies
Rust
4
star
27

luastrip

Strips Lua files of specific statements. The outputted Lua files are identically formatted to the input, including line numbers even after the skipped statements.
C++
4
star
28

urlbot

irc bot that scan for urls, then print the page title to the channel
Python
3
star
29

emilk

This specifies how my profile at https://github.com/emilk looks.
2
star
30

binandlib

A Rust crate to test if it is possible to make a crate that is both a binary and a library
Rust
1
star
31

rayon_test

Test the use of rayon on Wasm
Rust
1
star
32

webgl

WebGL toys
1
star
33

list_traversal

A benchmark for random memory access
1
star