Our experience implementing a runtime system for pC++ on five different parallel machines indicates that it is possible to achieve language portability and performance scalability goals simultaneously using a well-defined language/runtime system interface. The key, we believe, is to keep the number of runtime system requirements small and to concentrate on efficient implementations of required runtime system functions.
The three main pC++ runtime system tasks are collection class allocation, collection element access, and barrier synchronization. The implementation approach for these tasks is different for distributed memory than for shared memory architecture.
In the case of the distributed memory machines, the critical factor for performance is the availability of low latency, high bandwidth communication primitives. (Note that we have not made use of the CM-5 vector units or of highly optimized i860 code in the benchmarks.) While we expect the performance of these communication layers to improve dramatically over the next few months, we also expect to make changes in our compiler and runtime system. One important optimization will be to use barriers as infrequently as possible. In addition, it will be important to overlap more communication with computation.
In the case of shared memory machines, the performance focus shifts to the memory system. Although the BBN TC2000 architecture was classified as a shared memory architecture for this study, the non-uniform times for accessing collection elements in this machine result in runtime system performance characteristics similar to the distributed memory system. The more classic shared memory architecture of the Sequent Symmetry will require a closer study of memory locality trade-offs. Clearly, the choice of where to allocate collections in the shared memory can have important performance implications. In a hierarchical shared memory system, such as the KSR-1, the goal should be to allocate collection elements in a way that maximizes the chance of using the faster memory closer to the processors and that minimizes the possible contention and overhead in accessing remote memory. The problem for the runtime system becomes what memory allocation attributes to chose. The default choice is not guaranteed to always be optimal. Future versions of shared memory runtime systems may use properties of the collection classes to determine the appropriate element layout.