Distributed Memory

Next: Detailed Performance Analysis Up: Performance Scalability Previous: Shared Memory

Distributed Memory

In contrast to shared memory ports of pC++, the principal performance factors for distributed memory versions of the runtime system are message communication latencies and barrier synchronization. Collection design and distribution choice are the major influences on the performance of an application, but runtime system implementation of message communication and barrier synchronization can play an important role. In fact, these factors affect performance quite differently on the TMC CM-5 and Intel Paragon. Communication latency is very low for the CM-5 and the machine has a fast global synchronization mechanism. In the Paragon, communication performance is poor, requiring parallelization schemes that minimize communication for good performance to be achieved.

Figures 4 and 5 show the execution time speedup results for the benchmark suite on the CM-5 and Paragon machines, respectively. The Embar benchmark shows excellent speedup behavior on both machines. For the CM-5, the execution time was within 10 percent of the published hand optimized Fortran results for this machine. In the case of the Paragon, a 32 node Paragon achieved a fraction of 0.71 of the Cray uniprocessor Fortran version; speedup was 19.6 with respect to this code.

The Grid benchmark demonstrated near linear speedup on the CM-5 even with 64 by 64 sub-blocks, reflecting the low communication overhead relative to sub-block computation time. However, the high communication overheads on the Paragon machine required a different block size choice, 128 instead of 64, before acceptable speedup performance could be achieved on this benchmark.

Execution time for Poisson is the sum of the time for FFT transforms and cyclic reduction. Because the transforms require no communication, their performance scales very well for both the CM-5 and the Paragon. In contrast, the cyclic reduction requires a communication complexity that is nearly equal to the computational complexity. Although the communication latency is very low for the CM-5, no speedup was observed in this section even for Poisson grid sizes of 2,048; because of the larger number of processors used, the communication to computation ration was high. With a smaller number of processors, the Paragon was able to find some speedup in the cyclic reduction part.

Finally, running the Sparse benchmark, the CM-5 achieved a reasonable speedup, and for 256 processors matched the performance of the Cray Y/MP un-tuned Fortran code. In the case of the Paragon, the intense communication required in the sparse matrix vector multiply, coupled with high communication latency, actually resulted in a slowdown in performance as processors were added. We cannot expect improvements in these numbers until Intel finishes their ``performance release'' of the system.

Currently, Indiana University is experimenting with Sandia's high performance SUNMOS as the compute-node operating system of its Intel Paragon. SUNMOS increases message passing bandwidth and reduces latency. For compute-bound benchmarks such as Embar, preliminary results show no significant improvement. However for communication intensive benchmarks such as Sparse, initial results show a factor of nearly 7 improvement over nodes running OSF/1.

Next: Detailed Performance Analysis Up: Performance Scalability Previous: Shared Memory

mohr@cs.uoregon.edu
Thu Feb 24 13:42:43 PST 1994