next up previous
Next: Discussion Up: Experiments with Compensation Analysis Previous: Sequential Experiments

Parallel Experiments

Table 2 reports the results for parallel execution of the six NAS benchmarks on the Linux Xeon cluster. All of the applications execute as SPMD programs using MPI message passing for parallelization across 16 processors. For evaluation purposes, we compare minimum ``main'' values for each process to those for the MO run. Each process will complete its execution separately, resulting in different ``main'' execution times. We show the range of minimum values (labeled ``high'' and ``low'' in the table), the mean error over this range (comparing process-by-process with the minimum results from the MO run), and the error of the mean and ``high'' values (effectively execution time of Node 0's ``main'').


Table 2: Overhead Compensation Results for NAS Benchmarks on Linux Cluster - Parallel
Experiment MO PA PA-comp CA CA-comp
  $\mu secs$ $\mu secs$ $\mu secs$ $\mu secs$ $\mu secs$
SP-A min (high) 67369049 67519758 67758618 72968801 73416350
min (low) 64346890 64834412 64963104 67047549 67124742
%error (mean:high)   0.6 : 0.2 0.8 : 0.5 4.3 : 8.3 4.3 : 8.9
SP-W min (high) 13874506 14217942 14257427 15336991 13985473
min (low) 11306714 11602819 11628739 12539279 11064565
%error (mean:high)   2.5 : 2.4 2.5 : 2.7 9.9 : 10.5 -1.5 : 0.7
BT-A min (high) 76799427 77454300 77839767 85876074 85835820
min (low) 74182308 74696115 74937243 78018235 77721303
%error (mean:high)   0.6 : 0.8 1.0 : 1.3 5.5 : 11.8 5.4 : 11.7
LU-A min (high) 36966517 37783314 37629343 52540729 52395303
min (low) 34399415 35194131 35099696 43787261 43176436
%error (mean:high)   2.2 : 2.2 1.8 : 1.7 27.5 : 42.1 25.7 : 41.7
CG-A min (high) 4353851 4612676 4525479 8677331 8291439
min (low) 1848843 2076113 1950485 4252990 3691704
%error (mean:high)   7.4 : 5.9 3.8 : 3.9 84.1 : 99.3 65.6 : 90.4
IS-A min (high) 5420444 5973752 5836727 8301860 5585069
min (low) 2772617 3490618 3080709 5789329 1634756
%error (mean:high)   17.9 : 10.2 11.1 : 7.6 76.9 : 53.1 1.4 : 3.0
FT-A min (high) 8085574 8195461 8088853 9620210 9366497
min (low) 5422766 5518819 5485972 6021030 6029058
%error (mean:high)   0.8 : 1.3 -0.1 : 0.0 8.9 : 18.9 8.6 : 15.8


Overall, the results show that compensation techniques improve performance estimates, except in a few cases where the differences are negligible. This is encouraging. However, we also notice that the results display a variety of interesting characteristics, including differences from the sequential results.

For instance, the PA values for SP-A, BT-A, and FT-A are practically equivalent to ``main'' only results, yet the sequential profiles show slowdowns. The PA-comp values are within less than 1% in these cases. We believe this suggests that the instrumentation intrusion is being effectively reduced due to parallelization, resulting in fewer events being measured on each process. We tend to characterize the SP-W flat profile experiments in the same way, since the errors are reasonably small and the minimum ranges are tight.

Other benchmarks show differences in their range of minimum ``main'' execution times. IS-A is one of these. It also has the greatest error for flat profile compensation. Compared to the sequential case, there is a significant reduction in PA error (193.9% to 10.2%) due to intrusion reduction, but the compensated values are off by 11.1% on average per process and 7.6% for Node 0's ``main'' time (compared to 2.1% minimum error in the sequential case). This suggests a possible correlation of greater range in benchmark execution time with poorer compensation, although it does not explain why.

CG-A also has a significant difference in its ``high'' and ``low'' range, but its PA-comp errors are lower than IS-A. However, as more events are profiled with callpath instrumentation, the CA and CA-comp errors increase significantly. Compared to the sequential CA and CA-comp runs, we also see a slowdown in execution time compared to the sequential case. This is odd. Why, if we assume the measurement intrusion is being reduced by parallelization, do we see an execution slowdown? Certainly, the number of events is affecting compensation performance, as was the case in the sequential execution, but the increase in execution times beyond the sequential results suggests some kind of intrusion interdependency. In addition, we see the execution time range is widening.

Looking for other examples of widening execution time range with increased number of events, we find additional evidence in the callgraph runs (CA and CA-comp) for SP-A, BT-A, LU-A, and FT-A. The effect for LU-A is particularly pronounced. Together with the observations above, these findings imply a more insidious problem that may limit the effectiveness of our compensation algorithms. We discuss these problems below.


next up previous
Next: Discussion Up: Experiments with Compensation Analysis Previous: Sequential Experiments
Sameer Shende 2004-06-08