• Stars
    star
    196
  • Rank 194,772 (Top 4 %)
  • Language
    C
  • Created almost 9 years ago
  • Updated about 3 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Performance writing to GPIO with CPU and DMA on the Raspberry Pi

GPIO Speed using CPU and DMA on the Raspberry Pi

Experiments to measure speed of various ways to output data to GPIO. Also convenient code snippets to help get you started with GPIO.

I provide the code in gpio-dma-test.c to the public domain. If you use DMA you need the mailbox implementation; for that note the Broadcom copyright header with permissive license in mailbox.h.

You can compile this for Raspberry Pi 1 or 2 and 3 by passing the PI_VERSION variable when compiling

 PI_VERSION=1 make
 PI_VERSION=2 make  # works for Pi 2 and 3
 PI_VERSION=4 make  # works for Pi 4

The resulting program gives you a set of 6 experiments to conduct. By default, it toggles GPIO 14 (which is pin 8 on the Raspberry Pi header).

Usage ./gpio-dma-test [1...6]
Give number of test operation as argument to ./gpio-dma-test
Test operation
== Baseline tests, using CPU directly ==
1 - CPU: Writing to GPIO directly in tight loop
2 - CPU: reading word from memory, write masked to GPIO set/clr.
3 - CPU: reading prepared set/clr from memory, write to GPIO.
4 - CPU: reading prepared set/clr from UNCACHED memory, write to GPIO.

== DMA tests, using DMA to pump data to ==
5 - DMA: Single control block per set/reset GPIO
6 - DMA: Sending a sequence of set/clear with one DMA control block and negative destination stride.

To understand the details, you want to read BCM2835 ARM Peripherals, an excellent dataheet to get started (if you are the datasheet-reading kinda person).

Measurements

In these experiments, we want to see how fast things can go, so we do a very simple operation in which we toggle the output of a pin (see TOGGLE_GPIO in the source). In real applications, the data would certainly be slightly more useful :)

The pictures in these experiments show the output wave-form on the given pin for the various Raspberry Pis. These are screen-shots straight from an oscilloscope, the time-base is with 100ns the same for all measurements to be able to compare them easily.

All measurements were done on unmodified Pis in their respective default clock speed with a default minimal Raspian operating-system.

TODO: For the Pi4, there are no images yet, just some preliminary measurements. Stay tuned.

Writing from CPU to GPIO

The most common way to get data out on the GPIO port is using the CPU to send the data. Let's do some measurements how the Pis perform here.

Direct Output Loop to GPIO

sudo ./gpio-dma-test 1

In this simplest way to control the output, we essentially just write to the GPIO set and clear register in a tight loop:

// Pseudocode
for (;;) {
    *gpio_set_register = (1<<TOGGLE_PIN);
    *gpio_clr_register = (1<<TOGGLE_PIN);
}
Result

The resulting output wave on the Raspberry Pi 1 of 22.7Mhz, the Raspberry Pi 2 reaches 41.7Mhz and the Raspberry Pi 3 65.8 Mhz.

Raspberry Pi 1 Raspberry Pi 2 Raspberry Pi 3 Raspberry Pi 4
(about 131Mhz)

The limited resolution in the 100ns range of the scope did not read the frequency correctly for the Pi 3 (so it only shows 58.8Mhz above) but if we zoom in, we see the 65.8Mhz

Reading Word from memory, write masked set/clr

sudo ./gpio-dma-test 2

The most common way you'd probably send data to GPIO: you have an array of 32 bit data representing the bits to be written to GPIO and a mask that defines which are the relevant bits in your application.

// Pseudocode
uint32_t data[256];         // Words to be written to GPIO

const uint32_t mask = ...;  // The GPIO pins used in the program.
const uint32_t *start = data;
const uint32_t *end = start + 256;
for (const uint32_t *it = start; it < end; ++it) {
    if (( *it & mask) != 0) *gpio_set_register =  *it & mask;
    if ((~*it & mask) != 0) *gpio_clr_register = ~*it & mask;
}
Result

Raspberry Pi 2 and Pi 3 are unimpressed and output in the same speed as writing directly, Raspberry Pi 1 takes a performance hit and drops to 14.7Mhz:

Raspberry Pi 1 Raspberry Pi 2 Raspberry Pi 3 Raspberry Pi 4
(about 131Mhz)

Reading prepared set/clr from memory

sudo ./gpio-dma-test 3

This would be a bit more unusal way to prepare and write data: break out the set and clr bits beforehand and store in memory before writing them to GPIO. It uses twice as much memory per operation. It does help the Raspberry Pi 1 to be as fast as possible writing from memory though, while there is no additional advantage for the Raspberry Pi 2 or 3.

Primarily, this is a good preparation to understand the way we have to send data with DMA.

// Pseudocode
struct GPIOData {
   uint32_t set;
   uint32_t clr;
};
struct GPIOData data[256];  // Preprocessed set/clr to be written to GPIO

const struct GPIOData *start = data;
const struct GPIOData *end = start + 256;
for (const struct GPIOData *it = start; it < end; ++it) {
    *gpio_set_register = it->set;
    *gpio_clr_register = it->clr;
}
Result

The Raspberry Pi 2 and Pi 3 have the same high speed as in the previous examples, but Raspberry Pi 1 can digest the prepared data faster and gets up to 20.8Mhz out of this (compared to the 14.7Mhz we got with masked writing):

Raspberry Pi 1 Raspberry Pi 2 Raspberry Pi 3 Raspberry Pi 4
(about 83Mhz)

Reading prepared set/clr from UNCACHED memory

sudo ./gpio-dma-test 4

This next example is not useful in real life, but it helps to better understand the performance impact of accessing memory that does not go through a cache (L1 or L2).

The DMA subsystem, which we are going to explore in the next examples, has to read from physical memory, as it cannot use the caches (or can it ? Somewhere I read that it can make at least use of L2 cache ?).

The example is the same as before: reading pre-processed set/clr values from memory and writing them to GPIO. Only the type of memory is different.

Result

The speed is significantly reduced - it is very slow to read from uncached memory (a testament of how fast CPUs are these days or slow DRAM actually is).

One interesting finding is, that the Raspberry Pi 2 and Pi 3 both are actually significantly slower than the Raspberry Pi 1. Maybe the makers were relying more on various caches and choose to equip the machine with slower memory to keep the price while increasing memory ? At least the Pi 3 is faster than the 2, so the relative order there is preserved.

Raspberry Pi 1 Raspberry Pi 2 Raspberry Pi 3 Raspberry Pi 4
(about 2.7Mhz)

Using DMA to write to GPIO

Using the Direct Memory Access (DMA) subsystem allows to free the CPU and let independently running hardware do the job.

In various code that involve using DMA and GPIO on the Raspberry Pi, it is used in conjunction with the PWM or PCM hardware to create slower paced output with very reliable timing. Examples are PiBits by richardghirst or the icrobotics PiFM transmitter.

In our example, by contrast, we want to measure the raw speed that is possible using DMA (which is not very impressive as we'll see).

In order to use DMA, the DMA controller needs access to the actual memory bus address, as it can't deal with virtual memory (which means as well: it needs to be in physical memory and can't be swapped). There are various ways to allocate that memeory and do the mapping, but it looks like a reliable way for all PIs is to use the /dev/vcio interface provided by the Pi kernel; we are using a mailbox implementation provided in an raspberrypi/userland fft example. We have an abstraction around that called UncachedMemBlock in gpio-dma-test.c.

The DMA channel we are using in these examples is channel 5, as it is usually free, but you can configure that in the source. It can not be a Lite channel, as we need DMA 2D features for both examples.

DMA: using one Control Block per GPIO operation

sudo ./gpio-dma-test 5

With DMA, we can't do any data manipulation operations (such as masking) at the time the data is written, so just like in the last CPU example, we have to prepare the source data as separate set and clear operations:

struct GPIOData {
   uint32_t set;
   uint32_t clr;
};

The output needs to be written to the GPIO registers. Unfortunately, we can't just do a plain 1:1 memory copy from the source data to the destination registers, as the layout is a bit different than our input data: The set and clr register are a few bytes apart, so there is a gap between the two write operations:

Addr-offset GPIO Register width Operation
0x1c set (lower-32 bits) 32 bits = 4 bytes <-write here
0x20 unused upper bits 32 bits = 4 bytes skip
0x24 (reserved) 32 bits = 4 bytes skip
0x28 clr (lower-32 bits) 32 bits = 4 bytes <-.. and here

So what we are doing is to set up a DMA Control Block with a 2D operation:

  • Read single GPIO Data block as two read operations of 4 bytes, no stride between reads.
  • Write these as two 4 byte blocks, starting from the origin of the GPIO register block, 0x1c, with a destination stride of 8 to skip the intermediate registers we are not interested in.

We set up the control block's 'next' pointer to point to itself, so the DMA controller goes in an endless loop. No CPU needed, yay :)

(If you are wondering what "DMA 2D" operations means, this is not the place to explain the details, but there is an old embedded article to get you started on this standard feature of many DMA controllers. Please also look at the code which contains some documentation.)

One thing to note is that we only can set up a single output operation in this matter. Once we have written to the registers, the destination pointer is at the end of the relevant register block and the only way to go back to the beginning is to start with a fresh control block that has the starting address set to the beginning of the register block. We'll address that in the next example.

This is incredibly inefficent in use of memory, in particular if you need to send more than just a few blocks. We need 40 Bytes per output operation (8 Bytes GPIO data + 32 Bytes Control block). Note, all this memory needs to be locked into RAM.

Result

First thing we notice is how slow things are in comparison to the write from the CPU. As found out in the uncached CPU example, we see the influence of slower memory in the Raspberry Pi 2 and Pi 3 here as well. The latter two show exactly the same speed.

The live scope shows that the output has quite a bit of jitter, so DMA alone will not give you very reliable timing, you always have to combine that with PWM/PCM gating if you need this in a realtime context.

Another interesting finding is the asymmetry between the set/clr time. It looks like it takes about 100ns after the set operation until the clear operation arrives - but that then is lasting much longer. This is probably due some extra time needed when switching between control blocks (even though the 'next' control block is exactly the same):

Raspberry Pi 1 Raspberry Pi 2 Raspberry Pi 3 Raspberry Pi 4
(about 1.82Mhz)

DMA: multiple GPIO operations per Control Block

sudo ./gpio-dma-test 6

One of the obvious down-sides of the previous example is, that we have to set up one DMA control block for each write operation which is a lot of memory overhead. Does it also mean a bad performance impact ?

If we set up the input data in a way that it has the same layout as the output registers, we could use the stride operation on the destination side to go back to the beginning after each write and so can do many write operations with a single control block setup.

// Input data has same layout as the output registers.
struct GPIOData {
    uint32_t set;
    uint32_t ignored_upper_set_bits; // bits 33..54 of GPIO. Not needed.
    uint32_t reserved_area;          // gap between GPIO registers.
    uint32_t clr;
};

Of course, this means that we are writing to 'reserved' places in the GPIO registers. The first 4 bytes after the 'set' register are benign, as these are essentially upper bits of GPIO bits - if we write 0 to these it should be fine. Slightly problematic could be the next block of 4 bytes, as it is 'reserved' according to the data sheet. We are writing zeros in here and hope for the best that this is not doing any harm (it does seem to be fine :) ).

So in this example the 2D DMA operation is set up the following way:

  • Readig n GPIO blocks of size 16 bytes, no stride between reads.
  • Write each block to the GPIO registers and stride backwards 16 bytes so that at the end of that operation, we are back at the beginning of the register block.

Now, we only need one control-block per n operations, but each operation takes 16 bytes to store in memory. Amortized still better than in the previous case.

As usual, if we want to do that endlessly, we can link that control block back to itself.

Result

Again, the Raspberry Pi 2 is slightly slower than the Raspberry Pi 1. In general, this method is even slower than one control block per data item. And again, Raspberry Pi 3 shows the same speed as Raspberry Pi 2.

Similar to the previous example, the output has quite some jitter.

It is interesting, that the positive pulse is shorter (about 50ns) than in the previous example. It suggests that writing the data in sequence with 8 dead bytes is faster than the 8 byte stride skip that we had in the previous example. Also it means that the DMA probably has a small cache for the 16 byte block, as it emits that part faster than it can read from the uncached memory.

Now the 'low' part of the pulse is even longer than before, apparently the minus 16 Byte stride takes its sweet time even though we don't switch between control blocks:

Raspberry Pi 1 Raspberry Pi 2 Raspberry Pi 3 Raspberry Pi 4
(about 1.54Mhz)

Conclusions

  • On output via direct write from the CPU, Raspberry Pi 2 maintains the same impressive speed of 41.7Mhz independent if written directly from code or read from memory. The Raspberry Pi 3 even reaches 65.8Mhz
  • Raspberry Pi 1 is in the 20Mhz range for direct and prepared output and sligtly slower if it has to do the mask-operation first.
  • DMA is slow because it has to read from unached memory. It only makes sense if you want to output data in a slower pace or really need to relieve the CPU from continuous updates. (Is that it, can DMA be faster ? If you know how, please let me know).
  • Using a single control block per output operation is slightly faster than doing multiple, but is very inefficient in use of memory (10x the actual payload).
  • Using the stride in 2D DMA seems to be slower than actually writing the same number of bytes ?
  • The stride operation seems to take extra time as time going on to next DMA control blocks.
  • DMA on Rasbperry Pi 2 and 3 is slightly slower than on Raspberry Pi 1, maybe because the DRAM is slower ?

More Repositories

1

rpi-rgb-led-matrix

Controlling up to three chains of 64x64, 32x32, 16x32 or similar RGB LED displays using Raspberry Pi GPIO
C++
3,229
star
2

timg

A terminal image and video viewer.
C++
1,801
star
3

gmrender-resurrect

Resource efficient UPnP/DLNA renderer, optimal for Raspberry Pi, CuBox or a general MediaServer. Fork of GMediaRenderer to add some features to make it usable.
C
832
star
4

txtempus

A DCF77, WWVB, JJY and MSF clock LF-band signal transmitter using the Raspberry Pi
C++
404
star
5

ldgraphy

Simple Laser Direct Lithography / Laser Direct Imaging for PCB manufacturing
PostScript
269
star
6

flaschen-taschen

Noisebridge Flaschen Taschen display
C++
193
star
7

rpi-matrix-pixelpusher

PixelPusher protocol for LED matrix.
C++
160
star
8

beagleg

G-code interpreter and stepmotor controller for crazy fast coordinated moves of up to 8 steppers. Uses the Programmable Realtime Unit (PRU) of the Beaglebone.
C++
114
star
9

augenmass

Measure relative sizes on background image.
JavaScript
93
star
10

upnp-display

Display state of UPnP/DNLA renderer in the network with a 16x2 LCD display (or 24x2, 40x2 ..) connected to the Raspberry Pi. Unicode support (UTF-8).
C++
69
star
11

spixels

Spixels - 16-SPI LED Raspberry Pi adapter with library (SPI - Pixels).
C++
39
star
12

gcode-cli

Simple command line tool to send gcode to serial 3D printer/CNC machine.
C++
37
star
13

rpt2pnp

Solder paste dispensing and Pick'n Placing
C++
34
star
14

stuff-org

Organize electronic components. Or other stuff.
Go
33
star
15

postscript-hacks

A collection of some useful PostScript programs.
PostScript
31
star
16

folve

Folve - seamlessly FIR convolving audio file fuse filesystem with gapless support.
C++
30
star
17

trs80-100-schematic

A transcript of the TRS80 Model 100 schematic
Nix
26
star
18

rpt2paste

Convert KiCAD rpt files to G-Code to dispense solder paste
C++
24
star
19

bumps

BeagleBone Universal Multi Pololu Steppers
KiCad Layout
21
star
20

blisplay

A tactile display for the blind.
OpenSCAD
19
star
21

otp-image

Optical One-Time Pad XOR encoding of images
C++
17
star
22

pixelpusher-server

A simple library that allows to receive pixels via the PixelPusher protocol.
C++
16
star
23

digi-spherometer

A digital spherometer, reading data from digital dial indicator and converting it to radius, displaying on OLED display.
C++
14
star
24

RedPitaya-Case

A case for the Red Pitaya scope/function generator
OpenSCAD
12
star
25

joystick-gcode-jog

Jogging a machine such as a 3D printer or CNC machine with a gampad. Mostly proof of concept.
C
12
star
26

openscad-things

Things made with OpenSCAD. Mostly experimental right now.
OpenSCAD
9
star
27

quadrigotion

TMC2660 Four stepper motor drivers on a stick.
Python
9
star
28

tmc2660-breakout

Little breakout board to play with the TMC2660 stepper driver
Python
9
star
29

gds2vec

A simple program to convert gdsII files to vector output formats. Currently used to create laser-cut models of standard cells.
C++
9
star
30

precise-pitch

Instrument tuner app. Mostly my playground to learn Android development.
Java
8
star
31

symbiflow-simple-sample

Using Symbiflow arch defs to get BASYS3 board entertained with logic
Makefile
8
star
32

spixels-pixelpusher

A PixelPusher implementation using the Spixels hardware to control LED strips.
C++
6
star
33

bdfont.data

Generate C-structs from BDF fonts to be compiled into embedded programs.
C++
5
star
34

bare-lsp

A language server protocol implementation
C++
5
star
35

sound-cam

Simulation of using Microphones to pick up sound locations.
C++
5
star
36

jcxxgen

A schema compiler generating c++ structs with boilerplate to be serialized easily with nlohmann/json
C++
5
star
37

pi-registerplex

GPIO multiplexer for Raspberry Pi
Python
3
star
38

makerspace-tag

Simple app keeping track of makerspace users and their capabilities
Go
3
star
39

ziplain

A plain, no-frills ZIP file writer.
C++
3
star
40

DerKnopf

Simple IR remote control of a volume knob
KiCad Layout
3
star
41

hot-snipper

A Hot-Knife cutting machine
OpenSCAD
3
star
42

simple-fasm

A simple parser for the FPGA Assembly format
C++
3
star
43

eda-tools

Small useful tools I use while doing stuff with electronics.
3
star
44

pitch-hero

C++
2
star
45

air-filter-box

Quad AirFilter using standard box-fan and standard aircon/furnace filters
OpenSCAD
2
star
46

microorb

A USB controlled 2.5W RGB LED
C++
2
star
47

FlaschenTaschen-PixelPusher-bridge

Receives FlaschenTaschen protocol and sends to a PixelPusher installation.
C++
2
star
48

bant

Bazel/Build Analysis and Navigation Tool
C++
2
star
49

gaggia-pid

PID controller, useful for e.g. coffee machines.
C++
1
star
50

cogsworth-viz

Visualization of data generated from Project-COGSWORTH
C
1
star
51

simple-spherometer

A simple analog spherometer made from simple hardware-store materials, laser cut acrylic and 3D printed parts.
PostScript
1
star
52

golang-pgp-test

Experiments with the golang OpenPGP implementation.
Go
1
star
53

sneeze-guard

Sneeze-guard on working tables for reduced Viral spread.
G-code
1
star
54

flaschen-taschen-avr

Quick hack to make first FlaschenTaschen crates work.
C++
1
star
55

bidi-tee

A bidrectional `tee` program that passes through stdin/stdout/stderr and copies them colored coded to a file
C++
1
star
56

ear-saver

A laser-cuttable ear-saver for face masks.
OpenSCAD
1
star
57

gstreamer-gapless-test

Sample source code to demonstrate a gstreamer issue.
C
1
star
58

din-rail-clip-mount

3D printed Din-Rail mount for PCB and other components
OpenSCAD
1
star
59

threadless-server

select() event API experiment with c++11 closures.
C++
1
star
60

glowxels

Glow-in-the-dark canvas
C++
1
star