The shared memory ports of the pC++ uncover different performance issues from the distributed memory ports regarding the language and runtime system implementation performance. Here, the ability to achieve good memory locality is the key to good performance. Clearly, the choice of collection distribution is important, but the memory allocation schemes in the runtime system will play a big part. To better isolate the performance of runtime system components and to determine the relative influence of different phases of the entire benchmark execution where the runtime system was involved, we used a prototype tracing facility for pC++ for shared memory performance measurement. In addition to producing the same performance results reported above for the distributed memory systems, a more detailed execution time and speedup profile was obtained from the trace measurements. Although space limitations prevent detailed discussion of these results, they will be forthcoming in a technical report.
In general, we were pleased with the speedup results on the Sequent Symmetry, given that it is a bus-based multiprocessor. For all benchmarks, speedup results for 16 processors were good: BM1 (14.84), BM2 (14.15), BM3 (15.94), and BM4 (12.33). Beyond 16 processors, contention on the bus and in the memory system stalls speedup improvement. Although the Sequent implementation serves as an excellent pC++ testbed, the machine architecture and processor speed limits large scalability studies. The Symmetry pC++ runtime system implementation is, however, representative of ports to shared memory parallel machines with equivalent numbers of processors; e.g., the shared memory Cray YM/P or C90 machines. Using the four processor Sequent speedup results (3.7 to 3.99) as an indication, one might expect similar speedup performance on these systems. (Note, we are currently porting pC++ to a Cray YM/P and C90.)
The performance results for the BBN TC2000 reflect interesting architectural properties of the machine. Like the Sequent, benchmark speedups for 16 processor were encouraging: BM1 (14.72), BM2 (14.99), BM3 (15.92), and BM4 (11.59). BM1 speedup falls off to 23.89 and 32.36 at 32 and 64 processors, respectively, but these results are for a small 8 by 8 grid of subgrids, reflecting the small problem size performance encountered in the CM-5. BM2 speedup continues at a fairly even clip, indicating a better amortization of remote collection access costs that resulted in high communication overhead in the distributed memory versions. BM3 speedup was almost linear, achieving 31.48 for 32 processors and 58.14 for 64 processors. Unlike the Sequent, the BM4 speedup beyond 16 processors did not show any significant architectural limitations on performance.
The pC++ port to the KSR-1 was done most recently and should still be regarded as a prototype. Nevertheless, the performance results demonstrate the important architectural parameters of the machine. Up to 32 processors (1 cluster), speedup numbers steadily increase. BM1 to BM3 speedup results are very close to the TC2000 numbers; BM3 speedup for 64 processors was slightly less (52.71). However, BM4's speedup at 32 processors (9.13) is significantly less than the TC2000's result (17.29), highlighting the performance interactions of the choice of collection distribution and the hierarchical, cache-based KSR-1 memory system. Beyond 32 processors, two or more processor clusters are involved in the benchmark computations; we performed experiments up to 64 processors (2 clusters). As a result, a portion of the remote collection references must cross cluster rings; these references encounter latencies 3.5 times as slow as references made within a cluster. All benchmark speedup results reflect this overhead, falling to less than their 32 processor values.