CUDA-matrix-multiplication
Research on the optimum code structure and minimum usage of the global memory in order to achieve at least the same efficiency of the cublasDgemm function. The suggested approach includes a mixture of combined techniques such as registers, shared memory, streams and coalesced accesses in main memory.