• Stars
    star
    512
  • Rank 85,717 (Top 2 %)
  • Language
    Python
  • Created about 7 years ago
  • Updated over 5 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Tutorial for building a custom CUDA function for Pytorch

Pytorch Custom CUDA kernel Tutorial

This repository contains a tutorial code for making a custom CUDA function for pytorch. The code is based on the pytorch C extension example.

Disclaimer

This tutorial was written when pytorch did not support broadcasting sum. Now that it supports, probably you wouldn't need to make your own broadcasting sum function, but you can still follow the tutorial to build your own custom layer with a custom CUDA kernel.

In this repository, we will build a simple CUDA based broadcasting sum function. The current version of pytorch does not support broadcasting sum, thus we have to manually expand a tensor like using expand_as which makes a new tensor and takes additional memory and computation.

For example,

a = torch.randn(3, 5)
b = torch.randn(3, 1)
# The following line will give an error
# a += b

# Expand b to have the same dimension as a
b_like_a = b.expand_as(a)
a += b_like_a

In this post, we will build a function that can compute a += b without explicitly expanding b.

mathutil.broadcast_sum(a, b, *map(int, a.size()))

Make a CUDA kernel

First, let's make a cuda kernel that adds b to a without making a copy of a tensor b.

__global__ void broadcast_sum_kernel(float *a, float *b, int x, int y, int size)
{
    int i = (blockIdx.x + blockIdx.y * gridDim.x) * blockDim.x + threadIdx.x;
    if(i >= size) return;
    int j = i % x; i = i / x;
    int k = i % y;
    a[IDX2D(j, k, y)] += b[k];
}

Make a C wrapper

Once you made a CUDA kernel, you have to wrap it with a C code. However, we are not using the pytorch backend yet. Note that the inputs are already device pointers.

void broadcast_sum_cuda(float *a, float *b, int x, int y, cudaStream_t stream)
{
    int size = x * y;
    cudaError_t err;

    broadcast_sum_kernel<<<cuda_gridsize(size), BLOCK, 0, stream>>>(a, b, x, y, size);

    err = cudaGetLastError();
    if (cudaSuccess != err)
    {
        fprintf(stderr, "CUDA kernel failed : %s\n", cudaGetErrorString(err));
        exit(-1);
    }
}

Connect Pytorch backends with the C Wrapper

Next, we have to connect the pytorch backend with our C wrapper. You can expose the device pointer using the function THCudaTensor_data. The pointers a and b are device pointers (on GPU).

extern THCState *state;

int broadcast_sum(THCudaTensor *a_tensor, THCudaTensor *b_tensor, int x, int y)
{
    float *a = THCudaTensor_data(state, a_tensor);
    float *b = THCudaTensor_data(state, b_tensor);
    cudaStream_t stream = THCState_getCurrentStream(state);

    broadcast_sum_cuda(a, b, x, y, stream);

    return 1;
}

Make a python wrapper

Now that we built the cuda function and a pytorch function, we need to expose the function to python so that we can use the function in python.

We will first build a shared library using nvcc.

nvcc ... -o build/mathutil_cuda_kernel.so src/mathutil_cuda_kernel.cu

Then, we will use the pytorch torch.utils.ffi.create_extension function which automatically put appropriate headers and builds a python loadable shared library.

from torch.utils.ffi import create_extension

...

ffi = create_extension(
    'mathutils',
    headers=[...],
    sources=[...],
    ...
)

ffi.build()

Test!

Finally, we can test our function by building it. In the readme, I removed a lot of details, but you can see a working example.

git clone https://github.com/chrischoy/pytorch-cffi-tutorial
cd pytorch-cffi-tutorial
make

Note

The function only takes THCudaTensor, which is torch.FloatTensor().cuda() in python.

More Repositories

1

3D-R2N2

Single/multi view image(s) to voxel reconstruction using a recurrent neural network
Python
1,346
star
2

FCGF

Fully Convolutional Geometric Features: Fast and accurate 3D features for registration and correspondence.
Python
632
star
3

DeepGlobalRegistration

[CVPR 2020 Oral] A differentiable framework for 3D registration
Python
467
star
4

SpatioTemporalSegmentation

4D Spatio-Temporal Semantic Segmentation on a 3D video (a sequence of 3D scans)
Python
287
star
5

fully-differentiable-deep-ndf-tf

Fully differentiable deep-neural decision forest in tensorflow
Python
228
star
6

MakePytorchPlusPlus

How and why you want to make your pytorch CUDA/CPP extension with a Makefile
Makefile
170
star
7

knn_cuda

Fast K-Nearest Neighbor search with GPU
Cuda
141
star
8

open-ucn

The first fully convolutional metric learning for geometric/semantic image correspondences.
Python
87
star
9

pytorch_knn_cuda

K-Nearest Neighbor in Pytorch
Cuda
67
star
10

HighDimConvNets

[CVPR 2020 Oral] High-dimensional Convolutional Networks for Geometric Pattern Recognition
Python
39
star
11

gesvd

Pytorch extension for Singular Value Decompostion (SVD) with LAPACK gesvd backend
C++
28
star
12

SUN_RGBD

Reorganized SUN RGBD dataset
Shell
25
star
13

SpatioTemporalSegmentation-ScanNet

Python
22
star
14

enriching_object_detection

C++
21
star
15

CUDA-FFT-Convolution

CUDA FFT convolution
C++
14
star
16

segmentation_lecture

Python
12
star
17

python-venv-setup

Make python virtual environment setup on old servers less painful
Shell
10
star
18

MinkowskiEngineBenchmark

Python
7
star
19

mini_lseg

Python
5
star
20

PybindNumpyExample

A simple reference template for pybind11 + numpy
C++
4
star
21

env-setup

Setup my dev environment
Shell
3
star
22

dotfiles

dot files
Vim Script
2
star
23

torch_spmm

Cuda
1
star