Case Study

We use the infrastructure described in Section 4 to measure and model the performance of a component-based scientific simulation code. The code simulates the interaction of a shock wave with an interface between two gases. The scientific details are in [20]. The code employs Structured Adaptive Mesh Refinement [21,22,23] for solving a set of Partial Differential Equations (PDE) called the Euler equations. Briefly, the method consists of laying a relatively coarse Cartesian mesh over a rectangular domain. Based on some suitable metric, regions requiring further refinement are identified, the grid points flagged and collated into rectangular children patches on which a denser Cartesian mesh is imposed. The refinement factor between parent and child mesh is usually kept constant for a given problem. The process is done recursively, so that one ultimately obtains a hierarchy of patches with different grid densities, with the finest patches overlaying a small part of the domain. The more accurate solution from the finest meshes is periodically interpolated onto the coarser ones. Typically, patches on the coarsest level are processed first, followed recursively by their children patches. Children patches are also processed a set number of times during each recursion.

Figure 1 shows a snapshot from the simulation. The boxes outlined in purple are the coarse patches, those in red are ones which have been refined once (level 1 patches) and those in blue, refined twice (level 2 patches). The factor of refinement is 2 and the sequence of processing is

, where

is the set of patches on level

. Patches can be of any size or aspect ratio. This sequence is repeated multiple times.

Figure 2 shows the component version of the code. On the left is the ShockDriver, a component that orchestrates the simulation. On its right is AMRMesh that manages the patches. The RK2 component below it orchestrates the recursive processing of patches. To its right are States and EFMFlux which are invoked on a patch-by-patch basis. The invocations to States and EFMFlux include a data array (a different one for each patch) and an output array of the same size. Both these components can function in two modes - sequential or strided array access to calculate X- or Y-derivatives respectively - with different performance consequences. Neither of these components involve message passing, most of which is done by AMRMesh. We will attempt to model the performance of both States and EFMFlux and analyze the message passing costs of AMRMesh. We will also analyze the performance of another component, GodunovFlux, which can be substituted for EFMFlux. Three proxies, one each for States, GodunovFlux and EFMFlux were created and interposed between InviscidFlux and the component in question. A proxy was also written for AMRMesh to capture message-passing costs. An instance each of Mastermind and TAUMeasurement component were created for performance measurement and recording.

The simulation was run on three processors of a cluster of dual 2.8 GHz Pentium Xeons with 512 kB caches. gcc version 3.2 was used for compiling with -O2 optimization. Figure 3 shows where most of the time is spent in the component code. About 25% of the time is spent in MPI_Waitsome() which is invoked from two methods in AMRMesh - one that does ``ghost-cell updates'' on patches (gets data from abutting, but off-processor patches onto a patch) and the other that results in load-balancing and domain (re-) decomposition. The other methods, one in States and the other in GodunovFlux are modeled below.

In Figure 4 we plot the execution times for States for both the sequential and strided mode of operation. We see that for small, largely cache-resident arrays, both the modes take roughly the same time. As the arrays overflow the cache, the strided mode becomes more expensive and one sees a localization of timings. In Figure 5, we plot the ratio of strided and sequential access times. The ratio varies ranges from 1 for small arrays to around 4 for large ones. Further, for larger arrays, one observes large scatters. Similar phenomena are also observed for both GodunovFlux and EFMFlux.

During the execution of the application, both the X- and Y-derivatives are calculated and the two modes of operation of these components are invoked in an alternating fashion. Thus, for performance modeling purposes, we consider an average. However, we also include a standard deviation in our analysis to track the variability introduced by the cache. It is expected that both the mean and the standard deviation will be sensitive to the cache size. In Figures 6,7 and 8 we plot the execution times for the States, GodunovFlux and EFMFlux components. Regression analysis was used to fit simple polynomial and power laws, which are also plotted in the figures. The mean execution time scales linearly with the array size, once the cache effects have been averaged out. The standard deviations exhibit some variability, but they are significant only for GodunovFlux, a component that involves an internal iterative solution for every element of the data array. Note that these timings do not include the cost of the work done in the proxies, since all the extraction and recording of parameters is done outside the timers and counters that actually measure the performance of a component. Further, these instrumentation related overheads are small and will not be addressed in this paper.

If $T_{States}, T_{Godunov}$ and $T_{EFM}$ are the execution times (in microseconds) for States, GodunovFlux and EFMFlux and

the input array size, the best-fit expressions for the three components are

$\displaystyle T_{States}$	$\textstyle =$	$\displaystyle \exp( 1.19 \log(Q) -3.68)$
$\displaystyle T_{Godunov}$	$\textstyle =$	$\displaystyle -963 + 0.315 Q$
$\displaystyle T_{EFM}$	$\textstyle =$	$\displaystyle -8.13 + 0.16 Q$	(1)

$\displaystyle \sigma_{States}$	$\textstyle =$	$\displaystyle \exp( 1.29 \log Q )$
$\displaystyle \sigma_{Godunov}$	$\textstyle =$	$\displaystyle -526 + 0.152 Q$
$\displaystyle \sigma_{EFM}$	$\textstyle =$	$\displaystyle 66.7 -0.015 Q + 9.24 \times 10^{-7}Q^2 -1.12 \times 10^{-11} Q^3 + 3.85 \times 10^{-17} Q^4$	(2)

In Figure 9 we plot the communication time spent at different levels of the grid hierarchy during each communication (``ghost-cell update'') step. We plot data for processor 0 first. During the course of the simulation, the application was load-balanced once, resulting in a different domain decomposition. This is seen in a clustering of message passing times at Level 0 and 2. Ideally, these clusters should have collapsed to a single point; the substantial scatter is caused by fluctuating network loads. Inset, we plot results for all the 3 processors. A similar scatter of data points is seen. Comparing with Figures 6, 7 and 8, we see that message passing times are generally comparable to the purely computational loads of States and GodunovFlux, and it is unlikely that the code, in the current configuration (the given problem and the level of accuracy desired) will scale well. This is also borne out by Figure 3 where almost a quarter of the time is shown to be spent in message passing.