Showing 1 entry
Producing high-performance implementations from simple, portable computation specifications is a challenge that compilers have tried to address for several decades. More recently, a relatively stable architectural landscape has evolved into a set of increasingly diverging and rapidly changing CPU and accelerator designs, with the main common factor being dramatic increases in the levels of parallelism available. The growth of architectural heterogeneity and parallelism, combined with the very slow development cycles of traditional compilers, has motivated the development of autotuning tools that can quickly respond to changes in architectures and programming models, and enable very specialized optimizations that are not possible or likely to be provided by mainstream compilers. In this paper we describe the new OpenCL code generator and autotuner OrCL and the introduction of detailed performance measurement into the autotuning process. OrCL is implemented within the Orio autotuning framework, which enables the rapid development of experimental languages and code optimization strategies aimed at achieving good performance on new platforms without rewriting or hand-optimizing critical kernels. The combination of the new OpenCL autotuning and TAU measurement capabilities enables users to consistently evaluate autotuning effectiveness across a range of architectures, including several NVIDIA and AMD accelerators and Intel Xeon Phi processors, and to compare the OpenCL and CUDA code generation capabilities. We present results of autotuning several numerical kernels that typically dominate the execution time of iterative sparse linear system solution and key computations from a 3-D parallel simulation of solid fuel ignition.
Modified: Mon Jan 23 20:50:24 2017
Created: Mon Jan 23 20:29:28 2017
Return to the ParaDucks Research Group Publications page.