The shared memory ports of the pC++ runtime system isolate performance issues concerned with the allocation of collections in the shared memory hierarchy and architectural support for barrier synchronization. Clearly, the choice of collection distribution is important, but as is typically the case in programming shared memory machines, the ability to achieve good memory locality is critical.
The scaled performance of shared memory ports of the pC++ system reflects the effectiveness of collection distribution schemes as well as the interactions of the underlying architecture with respect to collection memory allocation, ``remote'' collection element access, and barrier synchronization. Figures 1, 2, and 3 show the speedup of the benchmark programs on the Sequent, BBN, and KSR machines, respectively.
Naturally, the speedup of Embar was excellent for all machines. For the Sequent using 23 processors, the speedup of 15.94 reflects the mismatch between the number of processors and the problem size; the staircase behavior is even more pronounced in the BBN results. The slightly lower speedup for Embar on 64 nodes of the KSR-1 is due to the activation of an additional level of the cache memory hierarchy; a portion of the memory references must cross between cluster rings, encountering latencies 3.5 times slower than references made within a 32 processor cluster. This effect is clearly evident for the Grid benchmark on the KSR, whereas the other machines show steady speedup improvement; the 40 to 60 processor cases on the BBN TC2000 are again due to load imbalances caused by the distribution not being well matched to that number of processors.
The Poisson benchmark performs well on all shared memory machines for all processor numbers. This demonstrates pC++'s ability to efficiently assign elements of a collection, such as the distributed vector collection, to processors and use subclassing to implement high-performance functions on the data, such as cyclic reduction.
The speedup performance on the Sparse benchmark reflects the importance of locality; most evident in the KSR results. The uniform memory system of the Symmetry hides many of the poor locality effects, resulting in a reasonable speedup profile. The NUMA memory system of the BBN TC2000 is more susceptible to locality of reference because of the cost of remote references. We knew that the Sparse implementation was not the most efficient in terms of locality, but this resulted in particularly poor speedup performance on the KSR-1; when crossing cluster boundaries, the drop is speedup is quite severe. Although the pC++ port to the KSR-1 was done most recently and should still be regarded as a prototype, the analysis of the performance interactions between collection design, distribution choice and the hierarchical, cache-based KSR-1 memory system is clearly important for optimization.