Parallel Experiments

Table 2 reports the results for parallel execution of the six NAS benchmarks on the Linux Xeon cluster. All of the applications execute as SPMD programs using MPI message passing for parallelization across 16 processors. For evaluation purposes, we compare minimum ``main'' values for each process to those for the MO run. Each process will complete its execution separately, resulting in different ``main'' execution times. We show the range of minimum values (labeled ``high'' and ``low'' in the table), the mean error over this range (comparing process-by-process with the minimum results from the MO run), and the error of the mean and ``high'' values (effectively execution time of Node 0's ``main'').

Table 2: Overhead Compensation Results for NAS Benchmarks on Linux Cluster - Parallel

*Experiment*	MO	PA	PA-comp	CA	CA-comp
	$\mu secs$	$\mu secs$	$\mu secs$	$\mu secs$	$\mu secs$
SP-A min (high)	67369049	67519758	67758618	72968801	73416350
min (low)	64346890	64834412	64963104	67047549	67124742
%error (mean:high)		0.6 : 0.2	0.8 : 0.5	4.3 : 8.3	4.3 : 8.9
SP-W min (high)	13874506	14217942	14257427	15336991	13985473
min (low)	11306714	11602819	11628739	12539279	11064565
%error (mean:high)		2.5 : 2.4	2.5 : 2.7	9.9 : 10.5	-1.5 : 0.7
BT-A min (high)	76799427	77454300	77839767	85876074	85835820
min (low)	74182308	74696115	74937243	78018235	77721303
%error (mean:high)		0.6 : 0.8	1.0 : 1.3	5.5 : 11.8	5.4 : 11.7
LU-A min (high)	36966517	37783314	37629343	52540729	52395303
min (low)	34399415	35194131	35099696	43787261	43176436
%error (mean:high)		2.2 : 2.2	1.8 : 1.7	27.5 : 42.1	25.7 : 41.7
CG-A min (high)	4353851	4612676	4525479	8677331	8291439
min (low)	1848843	2076113	1950485	4252990	3691704
%error (mean:high)		7.4 : 5.9	3.8 : 3.9	84.1 : 99.3	65.6 : 90.4
IS-A min (high)	5420444	5973752	5836727	8301860	5585069
min (low)	2772617	3490618	3080709	5789329	1634756
%error (mean:high)		17.9 : 10.2	11.1 : 7.6	76.9 : 53.1	1.4 : 3.0
FT-A min (high)	8085574	8195461	8088853	9620210	9366497
min (low)	5422766	5518819	5485972	6021030	6029058
%error (mean:high)		0.8 : 1.3	-0.1 : 0.0	8.9 : 18.9	8.6 : 15.8

Overall, the results show that compensation techniques improve performance estimates, except in a few cases where the differences are negligible. This is encouraging. However, we also notice that the results display a variety of interesting characteristics, including differences from the sequential results.

For instance, the PA values for SP-A, BT-A, and FT-A are practically equivalent to ``main'' only results, yet the sequential profiles show slowdowns. The PA-comp values are within less than 1% in these cases. We believe this suggests that the instrumentation intrusion is being effectively reduced due to parallelization, resulting in fewer events being measured on each process. We tend to characterize the SP-W flat profile experiments in the same way, since the errors are reasonably small and the minimum ranges are tight.

Other benchmarks show differences in their range of minimum ``main'' execution times. IS-A is one of these. It also has the greatest error for flat profile compensation. Compared to the sequential case, there is a significant reduction in PA error (193.9% to 10.2%) due to intrusion reduction, but the compensated values are off by 11.1% on average per process and 7.6% for Node 0's ``main'' time (compared to 2.1% minimum error in the sequential case). This suggests a possible correlation of greater range in benchmark execution time with poorer compensation, although it does not explain why.

CG-A also has a significant difference in its ``high'' and ``low'' range, but its PA-comp errors are lower than IS-A. However, as more events are profiled with callpath instrumentation, the CA and CA-comp errors increase significantly. Compared to the sequential CA and CA-comp runs, we also see a slowdown in execution time compared to the sequential case. This is odd. Why, if we assume the measurement intrusion is being reduced by parallelization, do we see an execution slowdown? Certainly, the number of events is affecting compensation performance, as was the case in the sequential execution, but the increase in execution times beyond the sequential results suggests some kind of intrusion interdependency. In addition, we see the execution time range is widening.

Looking for other examples of widening execution time range with increased number of events, we find additional evidence in the callgraph runs (CA and CA-comp) for SP-A, BT-A, LU-A, and FT-A. The effect for LU-A is particularly pronounced. Together with the observations above, these findings imply a more insidious problem that may limit the effectiveness of our compensation algorithms. We discuss these problems below.