Profiling is an important technique for the performance analysis of parallel applications. However, the measurement overhead incurred during profiling can cause intrusions in the parallel performance behavior. Generally speaking, the greater the measurement overhead, the greater the chance the measurement will result in performance intrusion. Thus, there is fundamental tradeoff in profiling methodology concerning the need for measurement detail (as determined by number of events and frequency of occurrence) versus the desired accuracy of profiling results. We argue that without an understanding of how intrusion affects performance behavior and without a way to adjust for intrusion effects in profiling calculations, the accuracy of the profiling results is uncertain. Most parallel profiling tools quantify intrusion as a percentage slowdown in the whole execution and regard this as an implicit measure of profiling goodness. This is unsatisfactory since it assumes overhead is evenly distributed across all threads of execution and all profiling results are uniformly affected.
Our early work in parallel perturbation analysis [11,12,13] demonstrated the ability to track performance intrusion and remove its effects in performance analysis results. However, there we had the luxury of a fully qualified event trace which included synchronization events that exposed dependent operation. This allowed us to recover execution sequences and derive performance results for an approximated ``uninstrumented'' execution. While the same perturbation theory applies, when profiling measurements are used, the analysis must be performed online.
This paper contributes models for measurement overhead compensation derived from a rational reconstruction of fundamental parallel profiling scenarios. Using these models we described a general on-the-fly algorithm that can be used for message passing parallel programs. The errors encountered in our earlier work on the NAS parallel benchmarks, resulting from our simpler overhead and compensation models, should now be reduced. However, implementing this algorithms requires the ability to piggyback delay values on send messages and to process the delay values at the receiver. We are currently developing a MPI wrapper library to support delay piggybacking that we can use to validate our approach. Our implementation is intended to be portable to all MPI implementations and will not require transmission of multiple messages. This scheme will be incorporated in the TAU performance system.