Table 2 reports the results for parallel execution of the six NAS benchmarks on the Linux Xeon cluster. All of the applications execute as SPMD programs using MPI message passing for parallelization across 16 processors. For evaluation purposes, we compare minimum ``main'' values for each process to those for the MO run. Each process will complete its execution separately, resulting in different ``main'' execution times. We show the range of minimum values (labeled ``high'' and ``low'' in the table), the mean error over this range (comparing process-by-process with the minimum results from the MO run), and the error of the mean and ``high'' values (effectively execution time of Node 0's ``main'').
Overall, the results show that compensation techniques improve performance estimates, except in a few cases where the differences are negligible. This is encouraging. However, we also notice that the results display a variety of interesting characteristics, including differences from the sequential results.
For instance, the PA values for SP-A, BT-A, and FT-A are practically equivalent to ``main'' only results, yet the sequential profiles show slowdowns. The PA-comp values are within less than 1% in these cases. We believe this suggests that the instrumentation intrusion is being effectively reduced due to parallelization, resulting in fewer events being measured on each process. We tend to characterize the SP-W flat profile experiments in the same way, since the errors are reasonably small and the minimum ranges are tight.
Other benchmarks show differences in their range of minimum ``main'' execution times. IS-A is one of these. It also has the greatest error for flat profile compensation. Compared to the sequential case, there is a significant reduction in PA error (193.9% to 10.2%) due to intrusion reduction, but the compensated values are off by 11.1% on average per process and 7.6% for Node 0's ``main'' time (compared to 2.1% minimum error in the sequential case). This suggests a possible correlation of greater range in benchmark execution time with poorer compensation, although it does not explain why.
CG-A also has a significant difference in its ``high'' and ``low'' range, but its PA-comp errors are lower than IS-A. However, as more events are profiled with callpath instrumentation, the CA and CA-comp errors increase significantly. Compared to the sequential CA and CA-comp runs, we also see a slowdown in execution time compared to the sequential case. This is odd. Why, if we assume the measurement intrusion is being reduced by parallelization, do we see an execution slowdown? Certainly, the number of events is affecting compensation performance, as was the case in the sequential execution, but the increase in execution times beyond the sequential results suggests some kind of intrusion interdependency. In addition, we see the execution time range is widening.
Looking for other examples of widening execution time range with increased number of events, we find additional evidence in the callgraph runs (CA and CA-comp) for SP-A, BT-A, LU-A, and FT-A. The effect for LU-A is particularly pronounced. Together with the observations above, these findings imply a more insidious problem that may limit the effectiveness of our compensation algorithms. We discuss these problems below.