Multi-Target Autotuning for Accelerators

Nicholas Chaimov

Considerable computational resources are available on GPUs and other accelerator devices, the use of which can offer dramatic increases in performance over traditional CPUs. However, programming such devices can be difficult, especially given considerable architectural differences between different models of accelerators. The OpenCL language provides portable code, but does not provide performance-portability: code which is optimized to run well on one device will run poorly on others. As a solution to this problem, we have developed OrCL, an autotuning system which generates OpenCL kernels from a subset of C, searching a space of variant implementations for the best-performing version for a particular problem and device. We instrument the resulting implementations in order to measure the performance of variants across the search space for NVIDIA GPUs, AMD GPUs, and Intel Xeon Phi accelerators for a set of numerical kernels used in sparse linear system solvers, as well as computations from a radiation transport simulation code.