• Stars
    star
    108
  • Rank 315,094 (Top 7 %)
  • Language Cuda
  • License
    MIT License
  • Created almost 9 years ago
  • Updated 11 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

matrix multiplication in CUDA

matrix-cuda

matrix multiplication in CUDA, this is a toy program for learning CUDA, some functions are reusable for other purposes

test results

following tests were carried out on a Tesla M2075 card

[lzhengchun@clus10 liu]$ ./a.out

please type in m n and k

1024 1024 1024

Time elapsed on matrix multiplication of 1024x1024 . 1024x1024 on GPU: 13.604608 ms.

Time elapsed on matrix multiplication of 1024x1024 . 1024x1024 on CPU: 9925.121094 ms.

all results are correct!!!, speedup = 729.541138

[lzhengchun@clus10 liu]$ ./a.out

please type in m n and k

1024 1024 1023

Time elapsed on matrix multiplication of 1024x1024 . 1024x1023 on GPU: 51.141281 ms.

Time elapsed on matrix multiplication of 1024x1024 . 1024x1023 on CPU: 8964.353516 ms.

all results are correct!!!, speedup = 175.286057

#Notes

(1) function gpu_matrix_mult: A naive implementation on GPUs assigns one thread to compute one element of matrix C. Each thread loads one row of matrix A and one column of matrix B from global memory, do the inner product, and store the result back to matrix C in the global memory. In the naive implementation, the amount of computation is 2 x M x N x K flop, while the amount of global memory access is 2 x M x N x K word. The "computation-to-memory ratio" is approximately 1/4 (flop/byte). Therefore, the naive implementation is memory bandwidth bounded.

(2) function gpu_square_matrix_mult: (!!! this is only for square matrix mutiplication)

To increase the "computation-to-memory ratio", the tiled matrix multiplication can be applied. One thread block computes one tile of matrix C. Each thread in the thread block computes one element of the tile. The figure shows a 32 x 32 matrix divided into four 16 x 16 tiles. To compute this, four thread blocks, each with 16 x 16 threads can be created. The GPU kernel computes C in multiple iterations. In each iteration, each thread block loads one tile of A and one tile of B from global memory to shared memory, performs computation, and stores temporal result of C in register. After all the iteration is done, the thread block stores one tile of C into global memory. For example, a thread block can compute C0,0 in two iterations: C0,0 = A0,0 B0,0 + A0,1 B1,0. Therefore, in the tiled implementation, the amount of computation is still 2 x M x N x K flop. However, using tile size of B, the amount of global memory access is 2 x M x N x K / B word. The "computation-to-memory ratio" is approximately B/4 (flop/byte). We now can tune the "computation-to-memory" ratio by changing the tile size B. Futher explain please redirect to http://www.es.ele.tue.nl/~mwijtvliet/5KK73/?page=mmcuda (be careful with the Pseudocode, there are missing points).

As you can see from the test results, tiled version has much better speedup than gpu_matrix_mult.

#comparison with openmp

(Intel(R) Xeon(R) CPU E5645 @ 2.40GHz) X 4 = 24 Cores

[lzhengchun@clus10 liu]$ ./a.out

please type in m n and k

2300 2300 2300

Time elapsed on matrix multiplication of 2300x2300 . 2300x2300 on GPU: 166.835617 ms.

Time elapsed on matrix multiplication of 2300x2300 . 2300x2300 on CPU: 19520.644531 ms.

all results are correct!!!, speedup = 117.005257

[lzhengchun@clus10 liu]$ ./a.out

please type in m n and k

1024 1024 1024

Time elapsed on matrix multiplication of 1024x1024 . 1024x1024 on GPU: 15.479232 ms.

Time elapsed on matrix multiplication of 1024x1024 . 1024x1024 on CPU: 2045.946167 ms.

all results are correct!!!, speedup = 132.173630

[lzhengchun@clus10 liu]$ ./a.out

please type in m n and k

1024 1024 1023

Time elapsed on matrix multiplication of 1024x1024 . 1024x1023 on GPU: 53.428638 ms.

Time elapsed on matrix multiplication of 1024x1024 . 1024x1023 on CPU: 1563.460571 ms.

all results are correct!!!, speedup = 29.262594

So, the openmp version is about 5X faster than single thread version, still far from theoritical (24)

#todo

(1) further optimization, especially the "computation-to-memory ratio" for non square matrix

(2) solve shared Mem Bank conflict issue and Global Memory does not coalesced issue

More Repositories

1

TomoGAN

A GAN based framework for artifacts and noise removal
Python
33
star
2

DSGAN

a GAN for precipitation downscaling
Python
30
star
3

deep-image-prior-tensorflow

an implementation of Deep Image Prior using tensorflow
Python
28
star
4

BraggNN

Faster X-ray Bragg Peak Analysis with Deep Learning
Python
13
star
5

a-simple-neural-network-on-spark

A simple implementation of an artificial neural network based with Apache Spark and python. this is another implementation of my toy program https://github.com/lzhengchun/step-by-step-neural-network
Jupyter Notebook
11
star
6

step-by-step-neural-network

step-by-step, implement a neural network with python and numpy to recognize handwritten number
Jupyter Notebook
8
star
7

TomoTx

Transformer is All You Need for Computed Tomography
Python
7
star
8

dn-tutorial

materials for scientific image denoise and artifacts removal
Jupyter Notebook
4
star
9

self2denoise

An implementation of Noise2Self https://arxiv.org/abs/1901.11365
Python
3
star
10

Tomato-Timer-for-BlackBerry_OS10

a tomato timer application for blackberry os 10
C++
2
star
11

vehicle-evacuation

a CUDA implementation of vehicle evacuation model
Cuda
2
star
12

b2r

Source code for article 'Efficient Large-scale Parallel Stencil Computation on Multi-Core and Multi-GPU Accelerated Clusters'
Cuda
1
star
13

home-IOT-via-raspberry-pi

a simple IoT, response for room temperature, humidity measuring and web based visualization via raspberry pi
HTML
1
star
14

matrix-operations

some common used matrix operations (for artificial neural network model)
C++
1
star
15

BFTrainer-Log-Replayer

Python
1
star