Rust labs for Performance Ninja Class

Rust port of the exercises in https://github.com/dendibakh/perf-ninja

You will need to watch the videos at the parent project, that's the course. To do the course in Rust, use this code instead of the parent C++ code.

I recommend reading Denis' free ebook Performance Analysis and Tuning on Modern CPUs as you do the course. Things can get a little confusing otherwise, and the book all by itself is excellent; real practical performance tuning advice from an expert.

Lab assignments

Core Bound:
Memory Bound:
- Data Packing
- Loop Interchange 1: Rust version does not appear to be memory bound, see the README.
- Loop Interchange 2: Rust version does not appear to be memory bound, see the README.
- Loop Tiling
- SW memory prefetching
- False Sharing
- Huge Pages
Bad Speculation:
Misc:
- Warmup
- LTO: TODO
- PGO: TODO
- Optimize IO

The two Loop Interchange labs do not match their C++ version. They are probably not an accurate port and need changing.

These two labs match the bottlenecks of their C++ versions (under Clang 14), but have different bottlenecks than indicated.

Core Bound / Vectorization 1: Try debug mode, that has the correct bottleneck.
Memory Bound / SW memory prefetching: Not memory bound, bottleneck seems to be branch prediction.

Aside from those differences, the Rust code should serve you well in your studies to become a performance ninja!

Setup

You need:

Rust and switch to nightly release.
The videos from the parent project: https://github.com/dendibakh/perf-ninja
pmu-tools to do the investigation.

Layout

Each lab is a cargo project. In brackets are the mappings to the C++ version.

src/lib.rs: The code you need to optimize (solution.cpp, solution.h, init.cpp)
src/tests.rs: A unit test (validate.cpp) to check your code still works.
benches/bench_<crate>.rs: The benchmark (bench.cpp). This will tell you when you have made src/lib.rs:solution faster.

You will only need to touch the code in lib.rs. The unit test and the benchmark both call that code. The benchmark uses criterion to produce accurate numbers.

Work loop

cargo bench: How fast is it now?
Improve the code in lib.rs.
cargo test --release: Is it still correct?
Goto 1.

Better benchmarks

Criterion (which cargo bench is using) does statistical benchmarking, but even with that I get a lot of variance between runs. We can do much better:

Download runperf. This adjusts a bunch of things on Linux to provide repeatable, reliable benchmarks.
Find the benchmark binary. cargo bench builds it as target/release/deps/bench_<crate>_<hash>.
Run it directly: runperf <benchmark_binary> --bench. You should get the same results every time.

Find bottlenecks

The videos often walk through this part. Profile the benchmark binary (in target/release/deps/). We need to disable criterion's overhead by passing --profile-time <seconds>. We always need to pass --bench to a Criterion benchmark binary. Use runperf (see above) for reliable results.

Examples:

runperf perf stat ./target/release/deps/bench_<crate>_<hash> --bench --profile-time 5
runperf perf record <binary> --bench --profile-time 5 then perf report -Mintel.
runperf ~/src/pmu-tools/toplev --core S0-C0,S0-C1 -l1 -v --no-desc <binary> --bench --profile-time 5 (then try with -l2 instead of -l1)

Misc / Tips

Optimize Rust for your CPU, and include frame pointers: export RUSTFLAGS="-Ctarget-cpu=native -Cforce-frame-pointers=yes".

Have perf report display the call graph: perf record --call-graph fp <prog>. You need to build with force-frame-pointers (above in RUSTFLAGS).

Show assembly: objdump -Mintel -S -d target/release/deps/bench_vectorization_2 | rustfilt.

rustfilt de-mangles Rust symbols: cargo install rustfilt
-S includes source code in the output

By default perf record uses the cycles events (number of CPU cycles). If you want to dig into a specific event provide that directly to perf:

Branch misses (bad speculation): runperf perf record --call-graph fp --event=branch-misses:P <prog>
Main memory load (backend bound): --event=cycle_activity.stalls_l3_miss:P (An L3 cache miss means we have to go to main memory)

The :P denotes a Precise Event.

runperf restricts execution to two cores and the toplev command above watches both those cores. The hope is that one core gets the tool (toplev, perf, etc) and the other core gets the program you're testing, and they both run without context switches (Linux tries to avoid moving programs between cores if possible). The downside is that it's not obvious which core your program ran on, and toplev output includes both. To simplify, edit runperf, replace taskset -c 0,1 sudo nice -n -5 runuser -u $USERNAME -- $@ with taskset -c 1 sudo nice -n -5 runuser -u $USERNAME -- $@ (ask taskset to only use core 1) and change the toplev command to --core S0-C1 (only watch Socket 0, Core 1).

Notes on the port

Best effort was made to keep the code as close to the C++ original as possible. That meant resisting iterator chaining, using C++ names (e.g. ClassA), and even sometimes ignoring clippy. The hope is that this makes it easier to follow along with the original videos.

Thanks

Thanks to my employer Dropbox for supporting this project during Hack Week 2022.

If this course is useful to you please consider supporting the parent project's Patreon or GitHub Sponsors.

grahamking/perf-ninja-rs

grahamking

Reviews

Repository Details