Sequential Experiments

Table 1 shows the total sequential execution time of ``main'' in microseconds from the different profiles for the different applications. The minimum and mean values are reported. We also calculate the percentage error (using minimum and mean values) in approximating the MO time for ``main.'' The dataset size (A or W) used in the experiments is indicated.

Table 1: Overhead Compensation Results for NAS Benchmarks on Linux Cluster - Sequential

*Experiment*	MO	PA	PA-comp	CA	CA-comp
	$\mu secs$	$\mu secs$	$\mu secs$	$\mu secs$	$\mu secs$
SP-A min	387588657	397602281	392833924	405226516	399405895
mean	388540699	398360423	394245841	407233889	401650317
%error (min:mean)		2.5 : 2.5	1.3 : 1.4	4.5 : 4.8	3.0 : 3.3
SP-W min	65427051	67942093	66404006	71812623	65517453
mean	66178471	69254426	67104562	73659688	66687843
%error (min:mean)		3.8 : 4.6	1.4 : 1.3	9.7 : 11.3	0.1 : 0.7
BT-A min	522765488	549063282	542479898	553178345	532736660
mean	524248915	552617635	545409236	555959945	536680190
%error (min:mean)		4.6 : 5.2	3.4 : 3.8	5.8 : 6.0	1.9 : 2.3
LU-W min	297366632	300993317	302786082	306287598	303405699
mean	299395075	302941264	305796049	307849925	306172285
%error (min:mean)		1.4 : 3.3	0.0 : -0.6	10.2 : 8.9	3.4 : 2.6
CG-A min	5368659	5733951	5740469	6824800	6536302
mean	5560969	5758157	5764569	6916842	6628535
%error (min:mean)		6.8 : 3.5	6.9 : 3.6	27.1 : 24.3	21.7 : 19.1
IS-A min	5967910	17540614	6094620	35457776	2632054
mean	5987002	17667114	6215288	36008102	4441510
%error (min:mean)		193.9 : 195.0	2.1 : 3.8	494.1 : 501.4	-55.8 : -25.8
FT-A min	24593893	25418103	25296244	29104159	28754736
mean	25215853	25549141	25557557	29470907	28918045
%error (min:mean)		3.3 : 1.3	2.8 : 1.3	18.3 : 16.9	16.9 : 14.6

An important observation is that the TAU measurement overhead per event is already very small, on the order of 500 nanoseconds for flat profiling on a 2.8 GHz Pentium Xeon processor. This can be easily seen in the TAU profile results (not shown) where the overhead estimation is given as an event in the profile. Of course, the slowdown seen in the PA and CA runs depends on the benchmark and the number of events instrumented and generated during execution. Because more events are created for callpath profiling, we expect to see more slowdown for the CA runs.

The results show that overhead compensation is better at approximating the total execution time, both for flat profiles and for callpath profiles. This is generally true for all of the NAS benchmarks we tested. In the case of IS-A, the flat profile compensation (PA-comp) shows remarkable improvement, from a 193% error in the PA measurement to within 2.1% of the ``main'' execution time. The improvements in compensated callpath profiles for SP-W to less than 1% error are also impressive.

To be clear, we are instrumenting every routine in the program as well as every depth of callpath. If, as a result, we instrument a small routine that gets called many times, overheads can accumulate significantly. For callpath profiling with instrumentation including a small event, overheads will be effectively multiplied by the number of callpaths containing the small routine. This is what is happening in IS-A. Flat profile compensation can deal with the error, but callpath compensation cannot. It is interesting that the reason can be attributed to the small differences in overhead unit estimation, ranging in this case from 957 nanoseconds (minimum) to 1045 (maximum). This seemingly minor 90 nanoseconds difference is enough in IS-A callpath profiling to cause major compensation errors. Certainly, the proper course of action is to remove the small routine from instrumentation.