Pytorch OpenCL backend based on dlprimitives
DLPrimitives-OpenCL out of tree backend for pytorch
It is only beginning, but you can train some vision nets using OpenCL devices.
Validated Networks
Following torchvision networks were validated:
Network | Notes |
---|---|
alexnet |
|
resnet18 |
|
resnet50 |
|
vgg16 |
|
squeezenet1_0 |
|
googlenet |
|
densenet161 |
|
inception_v3 |
fwd only - backward fails on cuda/cpu |
shufflenet_v2_x1_0 |
|
mobilenet_v2 |
|
mobilenet_v3_large |
|
mobilenet_v3_small |
fwd only - same failure on bwd on cuda. |
resnext50_32x4d |
|
wide_resnet50_2 |
|
mnasnet1_0 |
|
efficientnet_b0 |
|
efficientnet_b4 |
|
regnet_y_400mf |
Calculations validated agaist CPU reference for both forward and backward popogation.
Tested Devices
DLPrimitves itself is tested on following devies:
- Nvidia: gtx 960
- AMD: rx 6600 xt and in past rx 560
- Intel: HD530
Benchmarks
All benchmarks done on gtx 960/4G to get comparison to native cuda speed.
Test
Test includes copy of data to/from device and forward calculations
Framework | alexnet | resnet18 | resnet50 | vgg16 | mobilenet |
---|---|---|---|---|---|
pytorch/cuda | 15.253 | 38.745 | 114.348 | 169.038 | 46.110 |
pytorch/opencl | 22.989 | 50.272 | 167.050 | 258.751 | 82.044 |
dlprimitives | 22.688 | 49.193 | 158.789 | 238.802 | 82.080 |
keras/tf2-cuda | 29.104 | 74.215 | 161.704 | 158.084 | 88.851 |
keras/plaidml | 43.004 | 91.533 | - | - | 45.693 |
Full Train
Train includes - io to/from device, zero gadients, forward, backward and optimizer update step. Adam used as optimizer.
Framework | alexnet | resnet18 | resnet50 | vgg16 | mobilenet |
---|---|---|---|---|---|
pytorch/cuda | 107.108 | 129.456 | 388.951 | N/A | 177.434 |
pytorch/opencl | 147.814 | 213.319 | 651.216 | N/A | 382.590 |
dlprimitives | 106.033 | 198.092 | 605.541 | 1107.756 | 344.599 |
keras/tf2-cuda | 90.005 | 183.447 | 501.362 | 550.063 | 322.416 |
keras/plaidml | 222.166 | 507.116 | - | - | 571.438 |
- vgg16 batch 16 failed to run to to lack of memory on pytorch.
- some setups with plaidml not tested due to lack of performance/memory
Build
Changes From previous
Note the build procedure was significantly simplified - so READ again
- You don't need to build custom pytorch
- You should use ocl name for device rather than opencl (see details below)
In the nutshell
- Setup pip virtual enviromnet with pytorch 1.13 or nighyly version for CPU
- Build dlprim_backend
- Load shared library in pytorch and start using it.
Now in details
-
Setup pip virtual environment and install CPU version of pytorch - either 1.13 stable or nightly build of pytorch:: https://pytorch.org/get-started/locally/
Install CPU variant since you don't need CUDA support for OpenCL backend to work.
-
Make sure you have OpenCL headers and library. It should be
cl2.hpp
- not the old onecl.hpp
-
It is strongly recommended to have SQLite3 library and headers avalible as well, it would improve startup times by caching OpenCL kernels on disk.
-
Clone The repository
git clone --recurse-submodules https://github.com/artyom-beilis/pytorch_dlprim.git
-
Build the backend.
Building the on Linux
Make sure you are in the virtual environment
mkdir build
cd build
cmake -DCMAKE_PREFIX_PATH=$VIRTUAL_ENV/lib/python3.8/site-packages/torch/share/cmake/Torch ..
make
Note: if you use python version that is different from 3.8 just fix the path above
Test it runs:
python mnist.py --device ocl:0
Note from previous build procedure, now dlprimitives is submodule of the project. No need to build it separatly.
Building on Windows
Note: Windows support is even more experimental than Linux support. It was tested using pytorch 1.13, MSVC 2022 using ninja build tool.
You will nead OpenCL headers and x86_64
import library. It is also strongly recommended to get sqlite3
library. You can download 64 bit def and dll files and headers from official web site. You can convert def file
to import library by running lib /def:sqlite3.def /out:sqlite3.lib /machine:x64
Put all the dependencies in a layout you can use with ease, something like:
c:\deps
c:\deps\include\
c:\deps\include\CL\cl2.hpp
c:\deps\include\sqlite3.h
...
c:\deps\lib\
c:\deps\lib\OpenCL.lib
c:\deps\lib\sqlite3.lib
c:\deps\lib\sqlite3.dll
Make sure you put there 64 release versions only.
Setup virtual pip enviromnet with pytorch. Lets assume you put it into c:\venv\torch
Open "x64 Native Tools Command Prompt for VS 2022" and activate virtual envornment by running c:\venv\torch\Scripts\activate
Change current directory to location of the pytorch_dlprim
project
And run:
mkdir build
cd build
cmake -DCMAKE_PREFIX_PATH=%VIRTUAL_ENV%\Lib\site-packages\torch\share\cmake\Torch -DCMAKE_BUILD_TYPE=RelWithDebInfo -DCMAKE_C_COMPILER="cl.exe" -DCMAKE_CXX_COMPILER="cl.exe" -G Ninja -DCMAKE_INCLUDE_PATH=c:\deps\include\include -DCMAKE_LIBRARY_PATH=c:\deps\lib ..
ninja
Make sure that sqlite3 dll is in the path by calling
set PATH=%PATH%;c:\deps\lib
Once build is complete go back to previous directory and run mnist example
cd ..
python mnist.py --device=ocl:0
How to Use
Keep it mind... it is very very initial version that misses a lot of functionality and it isn't fully tested yet. So if something fails. It is either not implemented or it is implemented incorrectly
Note: pytorch backend is based on dlprimitives library that actually implements all the operators and it is relatievely well tested.
If you still want to try:
-
Before you begin in python code, load the library
libpt_ocl.so
:torch.ops.load_library("/path/to/libpt_ocl.so")
Or on Windows
torch.ops.load_library("/path/to/pt_ocl.dll")
It enables useing opencl devices as
privateuseone
device.If you use nighly version
>= 1.14
you can renameprivateuseone
device toocl
torch.utils.rename_privateuse1_backend('ocl')
Keep in mind you may have several. Refer to
clinfo --list
to list of the devices and their order. Now instead of callingsomething.to('cuda')
you callsomething.to('ocl:0')
orsomething.to('privateuseone:0')
or anotherocl:1
etc. -
Try to do only essential tasks on GPU, handle preparations and outputs on CPU since many ops may not be implemented for example printing