Mapping Parallel Algorithms to Message Passing Machines

Xiaoxiong Zhong

Message passing machines provide an opportunity to achieve high performance for applications such as those in scientific computing, in digital signal processing, in simulation and in electronic design automation. Programming such architectures to achieve high performance, however, poses a challenging task to users. One of the major problems is to design a mapping (assignment) scheme for processes to deal with the mismatch between the ideal communication structure for the parallel algorithm and the target architecture. The research in this thesis aims at this problem. In the thesis, two kinds of architectures, multicomputers and systolic arrays, are considered.

In the first part, we concentrate on communication issues in the design of mapping algorithms for a multicomputer. As communication switching technologies advance, major factors which incur communication overhead have changed and should be studied. We empirically study the effect of communication overhead caused by the topological mismatch in a multicomputer Empirical case studies are carried out to qualitatively characterize the impact of several important mapping metrics and architectural factors on the performance of benchmarks. To quantify communication overhead, we study and validate analytical estimation formulae for message latency. The applications of the message latency formulae to a general purpose multicomputer simulator are discussed and a new parallel program performance evaluation framework is proposed.

We then study the problem of reducing communication overhead by utilizing knowledge of the message passing requirement in an application. We propose efficient application-specific routing algorithms to reduce communication overhead. The proposed techniques can be applied to parallel programs with intensive communication on a multicomputer with user-controlled routing capability.

In the second part, we study the problem of mapping a class of algorithms called regular iterative algorithms to systolic arrays. First, the problem of finding an optimal time schedule for regular iterative algorithms is studied. Second, we propose a systematic method to enumerate linear allocation functions to yield spatially regular systolic arrays with some permissible connection constraints. Such a method can be used to design systolic arrays based on various optimization criteria. Finally, we show how to improve efficiency of a systolic array to an almost 100% efficient array by using a quasi-linear transformation function. The results have potential applications in a high level synthesis system or in a silicon compiler.