• Stars
    star
    185
  • Rank 208,271 (Top 5 %)
  • Language
    Rust
  • Created over 2 years ago
  • Updated 4 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Rust port of dendibakh/perf-ninja - an online course where you can learn and master the skill of low-level performance analysis and tuning.

Rust labs for Performance Ninja Class

Rust port of the exercises in https://github.com/dendibakh/perf-ninja

You will need to watch the videos at the parent project, that's the course. To do the course in Rust, use this code instead of the parent C++ code.

I recommend reading Denis' free ebook Performance Analysis and Tuning on Modern CPUs as you do the course. Things can get a little confusing otherwise, and the book all by itself is excellent; real practical performance tuning advice from an expert.

Lab assignments

The two Loop Interchange labs do not match their C++ version. They are probably not an accurate port and need changing.

These two labs match the bottlenecks of their C++ versions (under Clang 14), but have different bottlenecks than indicated.

  • Core Bound / Vectorization 1: Try debug mode, that has the correct bottleneck.
  • Memory Bound / SW memory prefetching: Not memory bound, bottleneck seems to be branch prediction.

Aside from those differences, the Rust code should serve you well in your studies to become a performance ninja!

Setup

You need:

Layout

Each lab is a cargo project. In brackets are the mappings to the C++ version.

  • src/lib.rs: The code you need to optimize (solution.cpp, solution.h, init.cpp)
  • src/tests.rs: A unit test (validate.cpp) to check your code still works.
  • benches/bench_<crate>.rs: The benchmark (bench.cpp). This will tell you when you have made src/lib.rs:solution faster.

You will only need to touch the code in lib.rs. The unit test and the benchmark both call that code. The benchmark uses criterion to produce accurate numbers.

Work loop

  1. cargo bench: How fast is it now?
  2. Improve the code in lib.rs.
  3. cargo test --release: Is it still correct?
  4. Goto 1.

Better benchmarks

Criterion (which cargo bench is using) does statistical benchmarking, but even with that I get a lot of variance between runs. We can do much better:

  1. Download runperf. This adjusts a bunch of things on Linux to provide repeatable, reliable benchmarks.
  2. Find the benchmark binary. cargo bench builds it as target/release/deps/bench_<crate>_<hash>.
  3. Run it directly: runperf <benchmark_binary> --bench. You should get the same results every time.

Find bottlenecks

The videos often walk through this part. Profile the benchmark binary (in target/release/deps/). We need to disable criterion's overhead by passing --profile-time <seconds>. We always need to pass --bench to a Criterion benchmark binary. Use runperf (see above) for reliable results.

Examples:

  • runperf perf stat ./target/release/deps/bench_<crate>_<hash> --bench --profile-time 5
  • runperf perf record <binary> --bench --profile-time 5 then perf report -Mintel.
  • runperf ~/src/pmu-tools/toplev --core S0-C0,S0-C1 -l1 -v --no-desc <binary> --bench --profile-time 5 (then try with -l2 instead of -l1)

Misc / Tips

Optimize Rust for your CPU, and include frame pointers: export RUSTFLAGS="-Ctarget-cpu=native -Cforce-frame-pointers=yes".

Have perf report display the call graph: perf record --call-graph fp <prog>. You need to build with force-frame-pointers (above in RUSTFLAGS).

Show assembly: objdump -Mintel -S -d target/release/deps/bench_vectorization_2 | rustfilt.

  • rustfilt de-mangles Rust symbols: cargo install rustfilt
  • -S includes source code in the output

By default perf record uses the cycles events (number of CPU cycles). If you want to dig into a specific event provide that directly to perf:

  • Branch misses (bad speculation): runperf perf record --call-graph fp --event=branch-misses:P <prog>
  • Main memory load (backend bound): --event=cycle_activity.stalls_l3_miss:P (An L3 cache miss means we have to go to main memory)

The :P denotes a Precise Event.

runperf restricts execution to two cores and the toplev command above watches both those cores. The hope is that one core gets the tool (toplev, perf, etc) and the other core gets the program you're testing, and they both run without context switches (Linux tries to avoid moving programs between cores if possible). The downside is that it's not obvious which core your program ran on, and toplev output includes both. To simplify, edit runperf, replace taskset -c 0,1 sudo nice -n -5 runuser -u $USERNAME -- $@ with taskset -c 1 sudo nice -n -5 runuser -u $USERNAME -- $@ (ask taskset to only use core 1) and change the toplev command to --core S0-C1 (only watch Socket 0, Core 1).

Notes on the port

Best effort was made to keep the code as close to the C++ original as possible. That meant resisting iterator chaining, using C++ names (e.g. ClassA), and even sometimes ignoring clippy. The hope is that this makes it easier to follow along with the original videos.

Thanks

Thanks to my employer Dropbox for supporting this project during Hack Week 2022.

If this course is useful to you please consider supporting the parent project's Patreon or GitHub Sponsors.

License

Original problems and ideas Copyright © 2021 by Denis Bakhvalov under Creative Commons license (CC BY 4.0). Rust port Copyright © 2022 by Graham King under Creative Commons license (CC BY 4.0).

More Repositories

1

darkcoding-credit-card

Credit card generators from darkcoding.net
C#
224
star
2

latency

Measure network round-trip latency by sending a TCP SYN packet.
Go
188
star
3

Key-Value-Polyglot

A basic key-value store, repeated in C, Go, Python (basic, gevent, and diesel), Ruby (event machine), Java, Scala, Haskell, and NodeJS.
C
159
star
4

hatcog

The IRC client for tmux addicts
Python
65
star
5

lintswitch

Automatically runs pylint, pep8 and pymetrics on your code, and notifies you of the results.
Python
26
star
6

kip

Command line script to keep usernames/passwords in gnupg encrypted text files.
Python
23
star
7

demeter-deploy

Fast remote push. Deploy your Hugo blog.
Rust
16
star
8

croxy

Encrypting IRC proxy
Python
15
star
9

netshare

A fast single-file web server. epoll + sendfile.
C
7
star
10

kipr

Password manager. Command line script to keep usernames/passwords in gnupg encrypted text files.
Rust
7
star
11

echoip

Go UDP / TCP server replies your IP address and location
Go
5
star
12

loftus

Loftus: Personal backup and sync. DropBox /SpiderOak style. inotify and git, driven from a Go daemon.
Go
4
star
13

tip

Command line timer
Python
4
star
14

jeb

Mailgun did it, so use that: https://github.com/mailgun/godebug
Go
2
star
15

logfmtcpp

logfmt parser in C++.
C++
2
star
16

rweather

Display local weather, fetched from weather.noaa.gov. Mostly I'm learning Rust.
Rust
2
star
17

django-timing-test-runner

A test runner for django-jenkins that prints your 10 slowest tests
Python
2
star
18

goodenergy

Web app for behavior change and online engagement. Built with Django. See demo at http://live.goodenergy.ca
Python
2
star
19

isqueue

isqueue is a unix queue. inotify spooled queue.
Shell
2
star
20

ccserve

C++ credit card server.
C++
2
star
21

hashgk

A just-for-fun hashtable in C.
C
1
star
22

scriptable

Make a python function callable from the command line and usable in pipes.
Python
1
star
23

binmsg

Write data into ELF binary files. Excuse to write some assembly.
Assembly
1
star
24

plebis.net

Source code to plebis.net.
Go
1
star
25

oilcan

Python job manager for Gearman
Python
1
star
26

areatalk

Android app to chat with people on the same router as you.
Java
1
star
27

route

Basic URL routing for Go web apps
Go
1
star
28

carriagereturn

Source code to carriagereturn.org
Go
1
star