The principal runtime system factors for performance on non-shared, distributed memory ports of pC++ are message communication latencies and barrier synchronization. These factors influence performance quite differently on the TMC CM-5 and Intel Paragon. For the CM-5, experiments for 64, 128, and 256 processors were performed. Because of the large size of this machine relative to the others in the paper, we ran several of the benchmarks on larger size problems. For the BM1 code running on a 16 by 16 grid with 64 by 64 sub-blocks, near linear speedup was observed, indicating good data distribution and low communication overhead relative to sub-block computation time. Execution time for BM2 is the sum of the time for FFT transforms and cyclic reduction. Because the transforms require no communication, performance scales perfectly here. In contrast, the cyclic reduction requires a communication complexity that is nearly equal to the computational complexity. Although the communication latency is very low for the CM-5, no speedup was observed in this section even for Poisson grid sizes of 2,048. For the benchmark as a whole, a 25 percent speedup was observed from 64 to 256 processors. As expected, the BM3 performance showed near linear speedup. More importantly, the execution time was within 10 percent of the published manually optimized Fortran results for this machine. For the BM4 benchmark, we used the full problem size for the CM-5. While the megaflop rate is low, it matches the performance of the Cray Y/MP un-tuned Fortran code.
Results for the Paragon show a disturbing lack of performance in the messaging system, attributed primarily to the pre-release nature of this software. Experiments were performed for 4, 16, and 32 processors. The BM1 benchmark required a different block size choice, 128 instead of 64, before acceptable speedup performance could be achieved, indicative of the effects of increased communication overhead. At first glance, the speedup improvement from BM2 contradicts what was observed for the CM-5. However, using a smaller number of processors, as in the Paragon case, has the effect of altering the communications / computation ratio. Collection elements mapped to the same processor can share data without communication, while if the collection is spread out over a large number of processors almost all references from one element to another involves network traffic. Speedup behavior similar to the Paragon was observed on the CM-5 for equivalent numbers of processors. For the BM3 benchmark, a 32 node Paragon achieved a fraction of 0.71 of the Cray uniprocessor Fortran version; speedup was 19.6. However, the most significant results are for the BM4 benchmark. Here, the time increased as processors were added. This is because of the intense communication required in the sparse matrix vector multiply. We cannot expect improvements in these numbers until Intel finishes their ``performance release'' of the system.