Next: Related Work Up: SAND2003-8631 Unlimited Release Printed Previous: Contents Contents

Introduction

The performance of scientific simulations in high performance computing (HPC) environments is fundamentally governed by the (effective) processing speed of the individual CPUs and the time spent in interprocessor communications. On individual CPUs, codes typically execute floating point operations on large arrays. The efficiency of such computations is primarily determined by the performance of the cache (in cache-based RISC and CISC processors) and much effort is devoted to preserving data locality. Interprocessor communications, typically effected by message passing in distributed memory machines (MPPs and SMP clusters), forms the other source of performance issues in HPC. Communication costs determine the load-balancing and scalability characteristics of codes and a multitude of software and algorithmic strategies (combining communication steps, minimizing/combining global reductions and barriers, overlapping communications with computations, etc.) are employed to reduce them.

The discipline of performance measurement has provided us with tools and techniques to gauge the interactions of scientific applications with the execution platform. Typically, these take the form of high precision timers which report the time taken to execute sections of the code and various counters which report on the behavior of various components of the hardware as the code executes. In a parallel environment these tools track and report on the size, frequency, source, destination and the time spent in passing messages between processors [1,2,3]. This information can then be used to synthesize a performance model of the application on the given platform - in some cases, these models have even served in a predictive capacity [4,5].

In order to manage the growing complexity of scientific simulation codes, there has been an effort to introduce component-based software methodology in HPC environments. Popular component models like Java Beans [6] and CORBA [7] are largely unsuitable for HPC [8] and a new light-weight model, called the Common Component Architecture (CCA) [9] was proposed. The principal motivations behind the CCA are to promote code reuse and interdisciplinary collaboration in the high performance computing community. The component model consists of modularized components with standard, well-defined interfaces. Since components communicate through these interfaces, program modification is simplified to modifying a single component or switching in a similar component without affecting the rest of the application. To build a CCA application, an application developer simply composes together a set of components using a CCA-compliant framework. Details regarding the flexibility, performance and design characteristics of CCA applications can be found in [10].

While monolithic applications are hand-tooled under common assumptions and data structures to deliver maximum performance, component-based applications are composed out of standalone components, an injudicious selection of which can result in a correct but sub-optimal component assembly. It thus becomes imperative to be able to classify the performance characteristics and requirements of each implementation of a component and to have a generalized means of synthesizing a composite performance model to judge the optimality of a component assembly.

While this does not affect the fundamental performance issues in HPC, it does raise new challenges. Unlike monolithic codes, component-based software is seldom used exclusively by the authors of the components themselves and manual instrumentation of the code is impossible. Further, in a CCA environment, the final application is assembled at run time by loading shared libraries [8]; thus automatic instrumentation of an executable where a binary is rewritten or instrumented at runtime [11] has little meaning. Consequently, a non-intrusive strategy for performance monitoring is clearly indicated. Further, each component needs to be monitored to collect not only execution time and the hardware characteristics, but also the relevant inputs (like the size of arrays) that determine the data that was collected. These data then need to be synthesized into individual component performance models which then are constituted into a composite performance model for the applications using a component call-path. It is this model synthesis at the component level that holds promise for automating performance tuning in applications composed of components.

Since a containing framework creates, configures and assembles components, the framework possesses the global understanding of how the components are networked into an application. Similarly, the framework can compose a performance model of the entire application by combining the models of the participating components. This composite performance model is a dual of the application itself. This holds the promise of being the ``cost function'' in an optimization process by which the optimal (from the performance point of view) component application is assembled from multiple implementations of each component. The actual component encapsulates numerical and data management algorithms for use in the computation while its performance model encapsulates its predicted performance as a function of the high performance environment in which it is to be executed. This reasoning extends to the application component ensemble whose performance may be predicted by the composition model, but will be ultimately determined by the runtime conditions. The material presented in this work is far from realizing this goal, but it is essential that it be viewed as step toward realizing a completely automated system for performance prediction - and hence optimization - of high performance component based applications.

In this paper we examine some performance issues peculiar to HPC component environments. Section 2 provides a brief summary of performance measurement and modeling approaches in various component environments. Section 3 elaborates on some performance metrics specific to HPC while Section 4 describes the software infrastructure needed to measure these metrics non-intrusively. Section 5 describes a case study where we measure and model the performance of three components in a scientific simulation of a shock interacting with an interface between two gases. Our concluding remarks and future directions are given in Section 6.

Next: Related Work Up: SAND2003-8631 Unlimited Release Printed Previous: Contents Contents

2003-11-05