The first column of Table 1 and Table 2 gives the timing results obtained with the commercial tool Vampir. The experimental results obtained with VNG will be related to those values.
The first row in Table 1 and Table 2 illustrates the loading time when reading in the 327 megabyte file mentioned in Section 5.1 We can see that even the single worker version of VNG already outperforms standard Vampir. This is mainly due to linear optimizations we did in the design of VNG. This is the principal reason for the super scalar-speedups we observe when we compare a multi-MPI task run of VNG with a standard Vampir. Upon examination of the speedups for 2, 4, 8, and 16 MPI tasks, we see that the loading time typically is reduced by a factor of two, as the number of MPI tasks doubles. This proves that scalability is achieved. Another important aspect, not mentioned in the tables, is that the amount of memory is reduced per node. This allows one to load very large trace files on a clustered system. With standard Vampir, this was only possible with large SMP systems.
The second row in both tables depicts the update time for the main timeline window. In this case the speedup is not as high as for the loading time. This is mainly due to optimizations that we did in an earlier stage, where we introduced a drawing algorithm that has an complexity. The parameter of is equal to the pixel width of a display and is the number of events to be summarized. From this starting point only a few more optimizations were possible. Notice that the execution time is already quite small.
Row three shows the performance measurements for the calculation of a full featured profile. The sequential time on both platforms is significantly higher than that of the times recorded for the parallel analysis server. The timing measurements show that we succeeded in drastically reducing this amount of time. Absolute values on the order of less than a second even for the single worker version allow smooth navigation in the GUI. The speedups prove scalability of this functionality. The speedups for the cluster however indicate that we are dealing with a system that has a weaker processor/network performance ratio compared to the results for the SMP machine although its absolute performance is approximately 4 times higher. This is pretty common for a Linux cluster made of COTS and therefore was not totally unexpected by us. When dealing with bigger trace files, this phenomenon disappears as the time spent for calculations increases while communication stays the same.
What the numbers do not say is that we have basically overcome two of the major drawbacks of trace analysis during the past. The server approach allows us to keep performance data as close to its origin as possible. No costly and annoying transport to a local client is needed anymore. The usage of distributed memory on the server side furthermore allows us to support today's clustered architectures without a loss of generality.