![]() In 2001, low-end graphics cards had no floating-point support floating-point color buffers did not arrive until 2003. One of the first attempts of non-graphical computations on a GPU was a matrix–matrix multiply. This is when the science community turned their attention to the hardware predominantly discussed in the computer gaming circles. People associated Graphics Processing Units (GPUs) with fast image rendering until the turn of the century. Finally, we propose the use of auto-tuning to better explore these kernels’ parameter space using search harness. We also show that the performance of these kernels is not highly portable. Experimental results show that nearly 50% of peak performance can be obtained in GEMM on both GPUs in OpenCL. We explore the benefits of using the texture cache, the performance ramifications of copying data into images, discrepancies in the OpenCL and CUDA compilers’ optimizations, and other issues that affect the performance. We then provide tuned GEMM kernels for both the NVIDIA Tesla C2050 and ATI Radeon 5870, the latest GPUs offered by both companies. We profile TRSM to get the time distribution of the OpenCL runtime system. We choose triangular solver (TRSM) and matrix multiplication (GEMM) as representative level 3 BLAS routines to implement in OpenCL. Understanding these implications allows us to provide a single library with decent performance on a variety of platforms. OpenCL has required performance-impacting initializations that do not exist in other languages such as CUDA. ![]() While the Khronos group developed OpenCL with programming portability in mind, performance is not necessarily portable. In this work, we evaluate OpenCL as a programming tool for developing performance-portable applications for GPGPU. ![]()
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |