matrix-cuda
matrix multiplication in CUDA, this is a toy program for learning CUDA, some functions are reusable for other purposes
test results
following tests were carried out on a Tesla M2075 card
[lzhengchun@clus10 liu]$ ./a.out
please type in m n and k
1024 1024 1024
Time elapsed on matrix multiplication of 1024x1024 . 1024x1024 on GPU: 13.604608 ms.
Time elapsed on matrix multiplication of 1024x1024 . 1024x1024 on CPU: 9925.121094 ms.
all results are correct!!!, speedup = 729.541138
[lzhengchun@clus10 liu]$ ./a.out
please type in m n and k
1024 1024 1023
Time elapsed on matrix multiplication of 1024x1024 . 1024x1023 on GPU: 51.141281 ms.
Time elapsed on matrix multiplication of 1024x1024 . 1024x1023 on CPU: 8964.353516 ms.
all results are correct!!!, speedup = 175.286057
#Notes
(1) function gpu_matrix_mult: A naive implementation on GPUs assigns one thread to compute one element of matrix C. Each thread loads one row of matrix A and one column of matrix B from global memory, do the inner product, and store the result back to matrix C in the global memory. In the naive implementation, the amount of computation is 2 x M x N x K flop, while the amount of global memory access is 2 x M x N x K word. The "computation-to-memory ratio" is approximately 1/4 (flop/byte). Therefore, the naive implementation is memory bandwidth bounded.
(2) function gpu_square_matrix_mult: (!!! this is only for square matrix mutiplication)
To increase the "computation-to-memory ratio", the tiled matrix multiplication can be applied. One thread block computes one tile of matrix C. Each thread in the thread block computes one element of the tile. The figure shows a 32 x 32 matrix divided into four 16 x 16 tiles. To compute this, four thread blocks, each with 16 x 16 threads can be created. The GPU kernel computes C in multiple iterations. In each iteration, each thread block loads one tile of A and one tile of B from global memory to shared memory, performs computation, and stores temporal result of C in register. After all the iteration is done, the thread block stores one tile of C into global memory. For example, a thread block can compute C0,0 in two iterations: C0,0 = A0,0 B0,0 + A0,1 B1,0. Therefore, in the tiled implementation, the amount of computation is still 2 x M x N x K flop. However, using tile size of B, the amount of global memory access is 2 x M x N x K / B word. The "computation-to-memory ratio" is approximately B/4 (flop/byte). We now can tune the "computation-to-memory" ratio by changing the tile size B. Futher explain please redirect to http://www.es.ele.tue.nl/~mwijtvliet/5KK73/?page=mmcuda (be careful with the Pseudocode, there are missing points).
As you can see from the test results, tiled version has much better speedup than gpu_matrix_mult.
#comparison with openmp
(Intel(R) Xeon(R) CPU E5645 @ 2.40GHz) X 4 = 24 Cores
[lzhengchun@clus10 liu]$ ./a.out
please type in m n and k
2300 2300 2300
Time elapsed on matrix multiplication of 2300x2300 . 2300x2300 on GPU: 166.835617 ms.
Time elapsed on matrix multiplication of 2300x2300 . 2300x2300 on CPU: 19520.644531 ms.
all results are correct!!!, speedup = 117.005257
[lzhengchun@clus10 liu]$ ./a.out
please type in m n and k
1024 1024 1024
Time elapsed on matrix multiplication of 1024x1024 . 1024x1024 on GPU: 15.479232 ms.
Time elapsed on matrix multiplication of 1024x1024 . 1024x1024 on CPU: 2045.946167 ms.
all results are correct!!!, speedup = 132.173630
[lzhengchun@clus10 liu]$ ./a.out
please type in m n and k
1024 1024 1023
Time elapsed on matrix multiplication of 1024x1024 . 1024x1023 on GPU: 53.428638 ms.
Time elapsed on matrix multiplication of 1024x1024 . 1024x1023 on CPU: 1563.460571 ms.
all results are correct!!!, speedup = 29.262594
So, the openmp version is about 5X faster than single thread version, still far from theoritical (24)
#todo
(1) further optimization, especially the "computation-to-memory ratio" for non square matrix
(2) solve shared Mem Bank conflict issue and Global Memory does not coalesced issue