Showing 1 entry
—Partitioned global address space (PGAS) applications, such as the Tensor Contraction Engine (TCE) in NWChem, often apply a one-process-per-core mapping in which each process iterates through the following work-processing cycle: (1) determine a work-item dynamically, (2) get data via one-sided operations on remote blocks, (3) perform computation on the data locally, (4) put (or accumulate) resultant data into an appropriate remote location, and (5) repeat the cycle. However, this simple flow of execution does not effectively hide communication latency costs despite the opportunities for making asynchronous progress. Utilizing nonblocking communication calls is not sufficient unless care is taken to efficiently manage a responsive queue of outstanding communication requests. This paper presents a new runtime model and its library implementation for managing tunable “work queues” in PGAS applications. Our runtime execution model, called WorkQ, assigns some number of on-node “producer” processes to primarily do communication (steps 1, 2, 4, and 5) and the other “consumer” processes to do computation (step 3); but processes can switch roles dynamically for the sake of performance. Load balance, synchronization, and overlap of communication and computation are facilitated by a tunable nodewise FIFO message queue protocol. Our WorkQ library implementation enables an MPI+X hybrid programming model where the X comprises SysV message queues and the user’s choice of SysV, POSIX, and MPI shared memory. We develop a simplified software mini-application that mimics the performance behavior of the TCE at arbitrary scale, and we show that the WorkQ engine outperforms the original model by about a factor of 2. We also show performance improvement in the TCE coupled cluster module of NWChem.
Created: Mon Jan 23 20:50:47 2017
Return to the ParaDucks Research Group Publications page.