• Stars
    star
    287
  • Rank 143,492 (Top 3 %)
  • Language
    C++
  • License
    Other
  • Created over 9 years ago
  • Updated almost 8 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

An OpenCL backend for torch.

cltorch

An OpenCL backend for torch.

What is this?

It's a high-performance matrix library for OpenCL, that runs on your GPU(s) harnessing the massive computational capacity that they provide.

Most of the standard operations in torch are supported. If there are any missing that you need, please raise an issue.

What's working

Most things really :-) Detailed description at ImplementedDetails.md. Please don't hesitate to raise an issue for anything that's missing that you would like to see added.

Installation

IMPORTANT! THIS HAS CHANGED. Please install a specific Torch distro, as described below. Simply doing luarocks install cltorch is no longer supported

Please see distro-cl for installation instructions.

cltorch-specific features

The following features are either cltorch-specific, or do not exist in cutorch:

feature in torch? in cutorch? in cltorch?
apply/map/map2 Yes Via an extension Yes
profiling tools Via CUDA nvvp Yes
point tensors Yes
custom user kernels Not applicable Via cutorch-rtc Yes

apply/map/map2

apply, map, map2 exist in torch, but how to make them work on the GPU? Cannot just pass in lua functions.

What we do is, you can provide opencl code directly to apply_on_gpu, map_on_gpu and map2_on_gpu, as a string expression. This will run on the gpu, at full speed. Examples, for x, y, z being identically sized torch.ClTensors:

x:apply_on_gpu("x = sqrt(x + 3.5)")
x:map_on_gpu(y, "x = 1000 * x + y * 10")
x:map2_on_gpu(y, z, "x = sqrt(1000 * x + y * 10 + z * z)")
  • note that the variables in the OpenCL string expression must be named as above, ie x, y, z. For convenience, these were named the same as the tensors in the example. If the tensors have different names, please continue to use x, y, z in the expressions, eg:
a:apply_on_gpu("x = sqrt(x + 3.5)")
a:map_on_gpu(b, "x = 1000 * x + y * 10")
a:map2_on_gpu(b, c, "x = sqrt(1000 * x + y * 10 + z * z)")

Profiling tools

Following tools are available to aid with profiling:

Method Description
cltorch.setProfiling(1) turn on opencl kernel profiling
cltorch.dumpProfiling() dump opencl kernel profiling timings since last call
cltorch.setEnableTiming(1) enable collection of cumulative wall-clock timings for cltorch code
cltorch.dumpTimings() dump cumulative wall-clock timings for cltorch code
cltorch.setTrace(1) print all gpu buffer allocations and copies between host/gpu

OpenCL Profiling

OpenCL natively provides facilities to measure the execution time of kernels, without needing to call cltorch.finish() or similar first, using clGetEventProfilingInfo. In cltorch, you dont need to know how this works ;-) Simply call, at the start of your code:

cltorch.setProfiling(1)

Then, after running the piece of code under scrutiny, simply call:

cltorch.dumpProfiling()

Timings are cumulative across multiple calls to the same kernel.

DumpTimings

This uses the wall-clock times to measure the elapsed time in different sections of cltorch code. The way it works is, each time the cltorch c++ code calls StatefulTimer::instance()->timeCheck("some status"), the wall-clock time since the last call to ->timeCheck() will be added to the cumulative time for some status. You can pass any status as a string. Then, after running the piece of code under the scrutiny, in your Lua program, simply call cltorch.dumpTimings() to dump these cumulative timings.

Update: please first call cltorch.setEnableTiming(true) to enable collection of timing information. This is global across all devices.

GPU buffer allocations and copies

You can log all GPU buffer allocations, copies to host, and copies to GPU device. Simply call:

cltorch.setTrace(1)

Any buffer allocations, and copies between host and device, will now be printed to stdout.

Point tensors: reduce pipeline stalls

Point tensors help to eliminate pipeline stalls associated with ReduceAll operations such as sometensor:sum(). Why does :sum() cause pipeline stalls, and how do point tensors eliminate this source of stalls?

If we send a single instruction (a kernel) to the gpu, there will be some latency whilst the instruction arrives at the gpu, and starts running, and some more latency after the calculations have finished, whilst the results are retrieved back from the GPU. Maybe we send:

a:add(1)

We can draw a picture of what happens. Time is towards the right. GPU is at the top. CPU at the bottom:

gpu single instruction

But we can send lots of instructions, without waiting for the earlier ones to finish. Maybe we do:

a:add(b)
a:mul(3)
b:mul(a)
c:add(a)

This might look like this, we dont have to wait for the previous instruction to finish: gpu pipeline

But now imagine what happens if we process the following instruction:

a:div(a:sum())
  • a:sum() is going to take the sum of all the elements in a
  • a:div(a:sum()) is then going to divide all the elements of a by this sum
  • it looks innocent enough
  • but notice that we cannot send the a:div instruction until the a:sum() results have come back
  • so we have to wait for a:sum() to finish processing, and for the results to come back, before we can continue

Looks like this: gpu stall

Classic reduceall => Massive pipeline stall

Point tensors eliminate this. When we do the reduceall, the :sum() operation, we keep the results on the gpu, like this:

c = torch.Tensor(20,30):uniform():cl() -- create a tensor on the GPU
res = torch.ClTensor()                 -- create a point tensor on the GPU
res:sum(c)                             -- sum c, and keep the result in res, on the GPU

res is a point tensor. It has zero dimensions. It contains a single scalar float. It stays on the GPU. We can feed it into other operations as follows:

c:div(res)  -- divide c by res

We can send this instruction straight away, even before the first :sum(c) instruction has arrived at the GPU. So, no more stall.

By the way, it's possible to print the value of a point tensor, by printing it, or calling the :s() operator. Normally you wouldnt do this except during debugging though, since obviously this will need to wait for the gpu operation to finish, and for the data to come all the way back from the GPU :-)

Custom user kernels

Custom user kernels let you run OpenCL code directly from Torch Lua! Of course, you can already do this with apply, map, and map2, see above. But now you can provide whole kernel functions, and other functions, and pass ClTensors into these kernels!

Example of how to use:

require 'cltorch'

k = torch.ClKernel({input={nElements='int', input='torch.ClTensor'},output={output='torch.ClTensor'},src=[[
   int linearId = get_global_id(0);
   if(linearId < nElements) {
     output_data[linearId] = input_data[linearId] + 3.0f;
   }
]]})
print('k', k)
k:print()

x = torch.ClTensor({3,5,2})
y = torch.ClTensor({6,4,2})
print('x before\n', x)
print('y before\n', y)

k:run({nElements=3, input=x, output=y})

print('y after\n', y)

Output from this example:

Using Intel platform: Intel Gen OCL Driver
Using device: Intel(R) HD Graphics BroadWell U-Processor GT2
k	torch.ClKernel
Original source
===============
   int linearId = get_global_id(0);
   if(linearId < nElements) {
     output_data[linearId] = input_data[linearId] + 3.0f;
   }


Generated source
================
typedef struct THClTensorInfoCl {
  unsigned int sizes[25];
  unsigned int strides[25];
  int offset;
  int dims;
} TensorInfoCl;



kernel void user_kernel(
    global struct THClTensorInfoCl *input_info, global float * input_data,
    int nElements,
    global struct THClTensorInfoCl *output_info, global float * output_data
) {
   int linearId = get_global_id(0);
   if(linearId < nElements) {
     output_data[linearId] = input_data[linearId] + 3.0f;
   }

}

x before
	 3
 5
 2
[torch.ClTensor of size 3]

y before
	 6
 4
 2
[torch.ClTensor of size 3]

y after
	 6
 8
 5
[torch.ClTensor of size 3]

If you want, you can specify the number of workgroups, and the workgroupsize, yourself:

k:run({nElements=3, input=x, output=y}, {numWorkgroups=10, workgroupSize=32}

Co-existence with cutorch

  • It is possible to load cutorch and cltorch at the same time, if you wish
  • If you do this, please load cutorch first, and then load cltorch second
  • If you get errors about #1 argument to copy should be tensor, but is userdata, then please double-check that cutorch is requiredd before cltorch (they each monkey-patch torch, but since cutorch was written first, it assumes there is no monkey-patch conflict)

Third-party libraries

cltorch uses the following libraries. These are automatically built as part of cltorch build process:

  • clBLAS - provides GPU-based matrix operations, such as multiplication
  • EasyCL - provides an abstraction layer over the low-level OpenCL API
  • clew - similar to glew, means that cltorch can be loaded without any OpenCL library/runtime being present

At runtime, if you want to call any of the cltorch methods, you will also need:

  • OpenCL-compatible GPU
  • OpenCL library/driver (normally provided by the GPU vendor)

Guidelines for contributors

You might or might not find ContributorGuidelines.md useful. Not required reading, but it is there if you want to see my own thoughts and ideas on how I am currently approaching cltorch development, and cutorch-porting.

Also, some more technical guidelines on porting, in the clnn repository, at porting-guidelines.md.

Related projects

There is an OpenCL backend for nn and nngraph at clnn.

There is an HCC backend for Torch at: hctorch

Recent changes

  • 20 August 2016:
    • tests updated to use TestSuite.lua
    • renamed apply, map, map2 to apply_on_gpu, map_on_gpu, map2_on_gpu, and added the apply method from cutorch that first copies to main memory, then runs on the cpu, then copies back again, for compatibility with cutorch
  • 31 April 2016:
    • Re-applied:
      • 27 March 2016:
  • 30 April 2016:
    • rolled back to as of 3 March 2016, to use specific torch release, so it doesnt keep changing whilst I'm at work :-)
  • 3 March 2016:
    • runs on Mac OS X, without needing LD_LIBRARY_PATH, ie RPATH works now. Hopefully :-)
  • 3rd January, 2016:
  • 27th December:
    • added FFI functions :data() and :cdata(), which means that Element Research's rnn now works with clnn
  • 23rd October:
    • removed :csub() and :neg() from the "cltorch-specific features" section, since integrated into torch now :-) pull request 392
  • 3rd October:
    • Added :mean() and :mean(d)
    • Added :atan2(x,y)
    • Added x:sign() and torch.sign(x)
    • Added norm(...)
  • 20th September:
    • Ported fix to addcdiv and addcmul reshape from cutorch commit 59a1cb05
    • Added ClStorage:__index() and ClTensor:__index()
    • Added ClStorage:__newindex() and ClTensor:__newindex()
  • 19th September:
    • Added guards around many functions, so that c++ exceptions are converted to torch errors now, and display something more meaningful than just 'c++ exception' :-P
      • Please feel free to raise an issue for any exceptions which are not guarded yet

OlderChanges.md

More Repositories

1

DeepCL

OpenCL library to train deep convolutional neural networks
C++
849
star
2

coriander

Build NVIDIA® CUDA™ code for OpenCL™ 1.2 devices
LLVM
837
star
3

tf-coriander

OpenCL 1.2 implementation for Tensorflow
C++
789
star
4

VeriGPU

OpenSource GPU, in Verilog, loosely based on RISC-V ISA
SystemVerilog
716
star
5

pytorch

Python wrappers for torch and lua
Python
431
star
6

EasyCL

Easy to run kernels using OpenCL
C++
177
star
7

Jinja2CppLight

(very) lightweight version of Jinja2 for C++
C++
140
star
8

clnn

OpenCL backend for Torch nn neural networks library
Lua
123
star
9

howto-jenkins-ssl

quick how to on activating ssl on jenkins, so I can find it easily
109
star
10

jeigen

Java wrapper for Eigen C++ fast matrix library
C++
104
star
11

kgsgo-dataset-preprocessor

Dataset preprocessor for the KGS go dataset, eg according to Clark and Storkey input planes
Python
70
star
12

peaceful-pie

Control Unity from Python! Use for reinforcement learning.
C#
26
star
13

cpu-tutorial

Tutorial on building your own CPU, in Verilog
26
star
14

coriander-dnn

Partial implementation of NVIDIA® cuDNN API for Coriander, OpenCL 1.2
C++
22
star
15

rnn-notes

Notes on how element-research rnn works
17
star
16

luacpptemplater

Write Jinja2-style templates in C++. Uses lua as a scripting language (very lightweight)
C++
16
star
17

jfastparser

Very fast parsing of doubles and ints from a string
Java
15
star
18

pytorch-coriander

OpenCL build of pytorch - (in-progress, not useable)
Python
14
star
19

pub-prototyping

prototyping stuff
C++
13
star
20

securewebcmd

Execute commands on a linux server through a webpage. Secured using md5 hashing
JavaScript
12
star
21

neonCl-underconstruction

experimental port of nervana neon kernels in OpenCL
Python
11
star
22

selfstudy-IBP

Self-study notes for Indian Buffet Process, from reading through "The Indian Buffet Process: An Introduction and Review", Griffiths, Ghahramani, 2011
Jupyter Notebook
10
star
23

torch-modder-notes

Notes for torch maintainers/modders
10
star
24

pycudatorch

poc for using cuda torch from python :-)
Python
7
star
25

nimbix-admin

utility scripts for start/stop/ssh to nimbix instances
Python
7
star
26

coriander-CLBlast

BLAS implementation for Coriander, using CLBlast
C++
6
star
27

ArgParseCpp

C++ version of Python's ArgParse
C++
6
star
28

passwordbookmarklet

bookmarklet to generate unique secure passwords for each website from a single master password
JavaScript
5
star
29

torchunit

torchunit
Shell
4
star
30

UnityFluidSim-pub

UnityFluidSim-pub
C#
4
star
31

tinisles-googleauthenticator

Fork of the code at http://blog.tinisles.com/2011/10/google-authenticator-one-time-password-algorithm-in-javascript/
HTML
3
star
32

springgrid

Runs spring matches on a grid of botrunners
Python
3
star
33

osmp-cs

OSMP C# - Opensource Secondlife clone, from around 2005
C#
3
star
34

ailadder

code to create an ailadder webserver
Python
3
star
35

pycltorch

POC for Python wrappers for cltorch/clnn
Python
3
star
36

yet-another-measure-theoretic-probability-tutorial

Yet another measure theoretic probability tutorial
Jupyter Notebook
3
star
37

HughAI

HughAI Java AI for Spring
Java
3
star
38

selfstudy-LARS

least angle regression, reproducing for self-study
Jupyter Notebook
2
star
39

verigpu-cuda-frontend

Front-end for VeriGPU, providing NVIDIA® CUDA™ compatibility, for compatibility purposes
2
star
40

selfstudy-LIME

LIME
Jupyter Notebook
2
star
41

blameful-indenter

reindent code, whilst preserving git blame
Python
2
star
42

SpringRTS-CSharpAI

AI For Spring RTS game, in C#, from around 2006
C#
2
star
43

neon-benchmarks

benchmarks for neon, both cuda and OpenCL version
Python
2
star
44

FractalSpline-cpp

Provides SecondLife-like primitives , in C++/OpenGL. From around 2005
C++
2
star
45

relooper

Reloop llvm IR output, to have `for`s, `if`s, `while`s. From https://reviews.llvm.org/D12744
C++
2
star
46

osmp-cpp

OSMP C++ - opensource SecondLife clone, from around 2004
C++
2
star
47

chat-ai-npcs-video-resources

chat-ai-npcs-video-resources
C#
2
star
48

github-stars

get an email whenever someone stars your github repo :-)
Python
2
star
49

project-ideas

Ideas for projects
2
star
50

scalable-gpt-developer

scalable-gpt-developer
Python
2
star
51

gpu-experiments

Informal experiments on various gpu kernel questions
Python
1
star
52

dockerfiles

dockerfiles
1
star
53

virtualenv-move

Moves a python virtualenv
Shell
1
star
54

privacy-policies

privacy-policies
HTML
1
star
55

openpw

Password hash generator, including console, bookmarklet, chrome extension
JavaScript
1
star
56

headlessopenglstubs

stubs for opengl, glew, sdl, whatever it takes ;-), to get spring to run without an X session
1
star
57

ndplot

high-dimensional viewer, by projecting onto a 3d hypercube. Use the mouse to rotate the 3d projection.
Python
1
star
58

python-threadingx

Erlang-like threading functionality for Python
Python
1
star
59

youtube-likes

Receive a notification/email when someone 'like's one of your videos.
Python
1
star
60

youtube-rl-demos

Scripts, code used in youtube demos
Python
1
star
61

python-graphics-numpy

Scripts for youtube video on creating graphics in python using numpy
Python
1
star
62

SpringMapDesigner

3d MapDesigner for Spring
1
star
63

chinese-transcriptions

Transcriptions of Chinese language videos
1
star
64

pytorch-prettify

Prettifies exceptions coming out of pytorch
Jupyter Notebook
1
star
65

tf_cached_build

Cache tensorflow build dependencies, to accelerate repeated tf configures, or run on a plane
Python
1
star
66

cppsimpleargsparser

Simple C++ args parser. Easy to use. Automatically provides checking and usage printout.
C++
1
star
67

PortableTensor.Net

Cross-platform Tensor with transpose, slice as views over underlying data
C#
1
star