• This repository has been archived on 09/Dec/2018
  • Stars
    star
    348
  • Rank 121,840 (Top 3 %)
  • Language
    Rust
  • License
    Apache License 2.0
  • Created almost 8 years ago
  • Updated almost 6 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

How to: Run Rust code on your NVIDIA GPU

Status

This documentation about an unstable feature is UNMAINTAINED and was written over a year ago. Things may have drastically changed since then; read this at your own risk! If you are interested in modern Rust on GPU development check out https://github.com/rust-cuda/wg

-- @japaric, 2018-12-08


nvptx

How to: Run Rust code on your NVIDIA GPU

First steps

Since 2016-12-31, rustc can compile Rust code to PTX (Parallel Thread Execution) code, which is like GPU assembly, via --emit=asm and the right --target argument. This PTX code can then be loaded and executed on a GPU.

However, a few days later 128-bit integer support landed in rustc and broke compilation of the core crate for NVPTX targets (LLVM assertions). Furthermore, there was no nightly release between these two events so it was not possible to use the NVPTX backend with a nightly compiler.

Just recently (2017-05-18) I realized (thanks to this blog post) that we can work around the problem by compiling a fork of the core crate that doesn't contain code that involves 128-bit integers. Which is a bit unfortunate but, hey, if it works then it works.

Targets

The required targets are not built into the compiler (they are not in rustc --print target-list) but are available as JSON files in this repository:

If the host is running a 64-bit OS, you should use the nvptx64 target. Otherwise, use the nvptx target.

Minimal example

Here's a minimal example of emitting PTX from a Rust crate:

$ cargo new --lib kernel && cd $_

$ cat src/lib.rs
#![no_std]

fn foo() {}
# emitting debuginfo is not supported for the nvptx targets
$ edit Cargo.toml && tail -n2 $_
[profile.dev]
debug = false

# The JSON file must be in the current directory
$ test -f nvptx64-nvidia-cuda.json && echo OK
OK

# You'll need to use Xargo to build the `core` crate "on the fly"
# Install it if you don't already have it
$ cargo install xargo || true

# Then instruct Xargo to compile a fork of the core crate that contains no
# 128-bit integers
$ edit Xargo.toml && cat Xargo.toml
[dependencies.core]
git = "https://github.com/japaric/core64"

# Xargo has the exact same CLI as Cargo
$ xargo rustc --target nvptx64-nvidia-cuda -- --emit=asm
   Compiling core v0.0.0 (file://$SYSROOT/lib/rustlib/src/rust/src/libcore)
    Finished release [optimized] target(s) in 18.74 secs
   Compiling kernel v0.1.0 (file://$PWD)
    Finished debug [unoptimized] target(s) in 0.4 secs

The PTX code will be available as a .s file in the target directory:

$ find -name '*.s'
./target/nvptx64-nvidia-cuda/debug/deps/kernel-e916cff045dc0eeb.s

$ cat $(find -name '*.s')
.version 3.2
.target sm_20
.address_size 64

.func _ZN6kernel3foo17h24d36fb5248f789aE()
{
        .local .align 8 .b8     __local_depot0[8];
        .reg .b64       %SP;
        .reg .b64       %SPL;

        mov.u64         %SPL, __local_depot0;
        bra.uni         LBB0_1;
LBB0_1:
        ret;
}

Global functions

Although this PTX module (the whole file) can be loaded on the GPU, the function foo contained in it can't be "launched" by the host because it's a device function. Only global functions (AKA kernels) can be launched by the hosts.

To turn foo into a global function, its ABI must be changed to "ptx-kernel":

#![feature(abi_ptx)]
#![no_std]

extern "ptx-kernel" fn foo() {}

With that change the PTX of the foo function will now look like this:

.entry _ZN6kernel3foo17h24d36fb5248f789aE()
{
        .local .align 8 .b8     __local_depot0[8];
        .reg .b64       %SP;
        .reg .b64       %SPL;

        mov.u64         %SPL, __local_depot0;
        bra.uni         LBB0_1;
LBB0_1:
        ret;
}

foo is now a global function because it has the .entry directive instead of the .func one.

Avoiding mangling

With the CUDA API, one can retrieve functions from a PTX module by their name. foo's' final name in the PTX module has been mangled and looks like this: _ZN6kernel3foo17h24d36fb5248f789aE.

To avoid mangling the foo function add the #[no_mangle] attribute to it.

#![feature(abi_ptx)]
#![no_std]

#[no_mangle]
extern "ptx-kernel" fn foo() {}

This will result in the following PTX code:

.entry foo()
{
        .local .align 8 .b8     __local_depot0[8];
        .reg .b64       %SP;
        .reg .b64       %SPL;

        mov.u64         %SPL, __local_depot0;
        bra.uni         LBB0_1;
LBB0_1:
        ret;
}

With this change you can now refer to the foo function using the "foo" (C) string from within the CUDA API.

Optimization

So far we have been compiling the crate using the (default) "debug" profile which normally results in debuggable but slow code. Given that we can't emit debuginfo when using the nvptx targets, it makes more sense to build the crate using the "release" profile.

The catch is that we'll have to mark global functions as public otherwise the compiler will "optimize them away" and they won't make it into the final PTX file.

#![feature(abi_ptx)]
#![no_std]

#[no_mangle]
pub extern "ptx-kernel" fn foo() {}
$ cargo clean

$ xargo rustc --release --target nvptx64-nvidia-cuda -- --emit=asm

$ cat $(find -name '*.s')
.visible .entry foo()
{
        ret;
}

Examples

This repository contains runnable examples of executing Rust code on the GPU. Note that no effort has gone into ergonomically integrating both the device code and the host code :-).

There's a kernel directory, which is a Cargo project as well, that contains Rust code that's meant to be executed on the GPU. That's the "device" code.

You can convert that Rust code into a PTX module using the following command:

$ xargo rustc \
    --manifest-path kernel/Cargo.toml \
    --release \
    --target nvptx64-nvidia-cuda \
    -- --emit=asm

The PTX file will available in the kernel/target directory.

$ find kernel/target -name '*.s'
kernel/target/nvptx64-nvidia-cuda/release/deps/kernel-bb52137592af9c8c.s

The examples directory contains the "host" code. Inside that directory, there are 3 file; each file is an example program:

  • add - Add two (mathematical) vectors on the GPU
  • memcpy - memcpy on the GPU
  • rgba2gray - Convert a color image to grayscale

Each example program expects as first argument the path to the PTX file we generated previously. You can run each example with a command like this:

$ cargo run --example add -- $(find kernel/target -name '*.s')

The rgba2gray example additionally expects a second argument: the path to the image that will be converted to grayscale. That example also compares the runtime of converting the image on the GPU vs the runtime of converting the image on the CPU. Be sure to run that example in release mode to get a fair comparison!

$ cargo run --release --example rgba2gray -- $(find kernel/target -name '*.s') ferris.png
Image size: 1200x800 - 960000 pixels - 3840000 bytes

RGBA -> grayscale on the GPU
    Duration { secs: 0, nanos: 602024 } - `malloc`
    Duration { secs: 0, nanos: 718481 } - `memcpy` (CPU -> GPU)
    Duration { secs: 0, nanos: 1278006 } - Executing the kernel
    Duration { secs: 0, nanos: 306315 } - `memcpy` (GPU -> CPU)
    Duration { secs: 0, nanos: 322648 } - `free`
    ----------------------------------------
    Duration { secs: 0, nanos: 3227474 } - TOTAL

RGBA -> grayscale on the CPU
    Duration { secs: 0, nanos: 12299 } - `malloc`
    Duration { secs: 0, nanos: 4171570 } - conversion
    Duration { secs: 0, nanos: 493 } - `free`
    ----------------------------------------
    Duration { secs: 0, nanos: 4184362 } - TOTAL

Problems?

If you encounter any problem with the Rust -> PTX feature in the compiler, report it to this meta issue.

License

Licensed under either of

at your option.

Contribution

Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in the work by you, as defined in the Apache-2.0 license, shall be dual licensed as above, without any additional terms or conditions.

More Repositories

1

copper

DEPRECATED in favor of https://github.com/rust-embedded/book
Rust
291
star
2

rust-everywhere

SUPERSEDED by https://github.com/japaric/trust
Shell
255
star
3

ruststrap

[SUPERSEDED] by https://github.com/warricksothr/RustBuild
Shell
96
star
4

rust-on-openwrt

[SUPERSEDED] by https://github.com/japaric/rust-cross
Shell
83
star
5

cu

Testing ground for the Copper book (http://japaric.github.io/copper/).
Rust
58
star
6

photon-quickstart

Cargo template for developing photon applications
Shell
43
star
7

posit

A Rust implementation of the posit number system
Rust
33
star
8

untry

Convert `try!()` into `?`s
Rust
33
star
9

cortex-m-template

Superseded by https://github.com/japaric/cortex-m-quickstart
Rust
31
star
10

linalg.rs

[INACTIVE]
Rust
29
star
11

cuda

Experiments with CUDA and Rust
Rust
29
star
12

m

Deprecated in favor of the libm crate.
Rust
29
star
13

eighty-six

x86 bare metal Rust thing
Rust
28
star
14

embedonomicon

This book has been moved to https://github.com/rust-embedded/embedonomicon
Shell
22
star
15

eulermark.rs

[DEPRECATED] See euler_criterion.rs instead
Rust
21
star
16

serial.rs

[DEPRECATED] Check https://crates.io/crates/serial for a replacement
Rust
19
star
17

stable-embedded-rust

The closest we are to embedded no-std binaries on stable Rust
Rust
18
star
18

photon

Binary blobs for building photon apps
Rust
17
star
19

std-with-cargo

[SUPERSEDED] by https://github.com/japaric/cargo-sysroot
Shell
14
star
20

cargo-sysroot

SUPERSEDED by https://github.com/japaric/xargo
Rust
13
star
21

euler_criterion.rs

Benchmark Project Euler solutions written in several programming languages using Criterion
Rust
10
star
22

stats.rs

Criterion's statistics library. See https://github.com/japaric/criterion.rs
9
star
23

photon-hal

Low level Rust bindings to particle's HAL (Hardware Abstraction Layer)
Rust
9
star
24

simplot.rs

Criterion's plotting library. See https://github.com/japaric/criterion.rs
9
star
25

stm32.rs

[SUPERSEDED] by https://github.com/japaric/cu
Rust
8
star
26

ble400

BLE400 + nRF51822 + S130 + Rust
C
7
star
27

smoke

Smoke testing Rust's cross compilation targets
Shell
6
star
28

nrf51822

A crate to play with the nrf51822 module
Rust
6
star
29

rustic

[UNMAINTAINED] Check https://crates.io/crates/cargo-script as an alternative
Rust
6
star
30

parallel.rs

[DEPRECATED] Safe fork-join parallel abstractions
Rust
6
star
31

homu-on-heroku

How to deploy a Homu instance to Heroku
Python
4
star
32

emrust17

Weekly status reports from my contract work on improving embedded Rust development story
4
star
33

qemu-arm-rt

Minimal runtime for emulation of Cortex-M programs
Rust
4
star
34

compiler-rt.rs

[DEPRECATED] Check the spiritual sucessor of this project: https://github.com/japaric/rustc-builtins
Rust
4
star
35

futuro

Rust
3
star
36

lm3s6965evb

An example of running a `cortex-m-rt` program on a QEMUlated Cortex-M core
Rust
3
star
37

rusty-edge

Rust nightly channel + bleeding edge features
Rust
3
star
38

complex.rs

UNMAINTAINED
Rust
2
star
39

bb.rs

[DEPRECATED] Check https://crates.io/crates/sysfs_gpio for similar functionality
Rust
2
star
40

stm32f30x-memory-map

Memory map for STM32F30X microcontrollers
Rust
2
star
41

nvptx-builtins

Rust
2
star
42

rbr2016

Rust
2
star
43

vl

Board Support Crate for the STM32VLDISCOVERY
Rust
2
star
44

core64

rust-lang/rust's core crate without 128-bit support
Rust
2
star
45

particle-tools

Tools to aid with development of Particle applications
Rust
2
star
46

termios.rs

[DEPRECATED] Check https://crates.io/crates/termios for a replacement
Rust
2
star
47

rustfmt-bin

Nightly binary releases of rustfmt
Shell
2
star
48

stm32f100-memory-map

Rust
1
star
49

blas.rs

[UNMAINTAINED]
Rust
1
star
50

nrf51

Peripheral access API for nRF51 microcontrollers (generated using svd2rust)
Rust
1
star
51

rc.rs

[UNMAINTAINED]
Rust
1
star
52

seq.rs

[DEPRECATED] Macro sugar to initialize almost any collection
Rust
1
star
53

testd

Rust
1
star
54

sc-gen

System call number generator
Rust
1
star
55

spscrb

DEPRECATED in favor of the heapless crate
Rust
1
star
56

space.rs

[DEPRECATED] in favor of https://crates.io/crates/itertools
Rust
1
star
57

cargo-for-arm

Docker container that builds cargo for ARM on an x86_64 host using QEMU user emulation
Shell
1
star
58

exheres

Personal exheres repository
Makefile
1
star
59

core.rs

[SUPERSEDED] by https://github.com/japaric/cargo-sysroot
Rust
1
star
60

npm2exheres

UNMAINTAINED
Python
1
star
61

volatile.rs

[SUPERSEDED] by https://github.com/japaric/cu
Rust
1
star