Next: Conclusion Up: Performance Analysis of Previous: Distributed Memory

Detailed Performance Analysis

The pC++ performance environment allows us to investigate interesting performance behavior in more detail. In particular, a pC++ program execution involves several execution components: initialization, processor object instantiation, thread creation, collection creation, runtime system operation, and the application's main algorithm. To associate performance data with these important operationally-semantic components, we formulated an instrumentation model that we then applied to detailed performance studies. This instrumentation model is represented in Figure 6.

Essentially, we identify events at the begin and the end of the program as well as at the begin and end of each major execution component (event capture points are indicated by circled numbers in the figure). From these events, the following time measurements are computed:

(1)-(8) (setup):: The entire benchmark program.
(2)-(8) (fork):: The part of the program which runs in parallel, starting with the forking of processes.
(3)-(7) (main):: The main pC++ program as supplied by the user. It includes static collection allocation, user setup and wrapup, and the parallel algorithm.
(4)-(7) (user):: The computation part representing ``pure'' user code execution, without the overhead of collection allocation.
(5)-(6) (parallel):: The parallel algorithm portion of the pC++ program. The time in this section corresponds to the measurements reported in §5.

Our first application of the above instrumentation and measurement model was to determine, using tracing, the relative influence of different phases of the entire benchmark execution where the language and runtime system execution components were involved. Because the speedup results reported in §5 are only for the parallel section, we wanted to characterize the speedup behavior in other program regions. An example of the detailed performance information we are able to obtain is shown for the Poisson benchmark in Figures 7, 8, and 9 for the shared memory pC++ ports.

In addition to the phases described above, the figures show also the speedup profile of the sineTransform (fft) and cyclicReduction (cyclic) functions as described in Section 4.2. The graphs clearly show how overall performance is degraded by components of the execution other than the main parallel algorithm, which scales quite nicely. Although some of these components will become relatively less important with scaled problem sizes, understanding where the inefficiencies lie in the pC++ execution system will allow us to concentrate optimization efforts in those areas.

We use profiling measurements as a way of obtaining more detailed performance data about runtime system functions, thread-level application functions, collection class methods, and collection referencing. The pC++ profiler and instrumentation tools allow different levels and types of performance information to be captured. Whereas the type of measurement above helps us identify pC++ system problems, a pC++ programmer may be particularly interested in information about where the parallel algorithms is spending the most time and its collection referencing behavior.

Next: Conclusion Up: Performance Analysis of Previous: Distributed Memory

mohr@cs.uoregon.edu
Thu Feb 24 13:42:43 PST 1994