In our earlier work , we developed techniques for quantifying the overhead of performance profile measurements and correcting the profiling results to compensate for the measurement error introduced. This work was done for two types of profiles: flat profiles and profiles of routine calling paths. The techniques were implemented in the TAUprofiling system  and demonstrated on the NAS parallel benchmarks. However, the models we developed were based on a local perspective of how measurement overhead impacted the program's execution. Profiling measurements are, typically, performed for each program thread of execution. (Here we use the term ``thread'' in a general sense. Shared memory threads and distributed memory processes equally apply.) By a local perspective we mean one that only regards the overhead impact on the process (thread) where the profile measurement was made and overhead incurred.
Consider a message passing parallel program composed of multiple processes. Most profiling tools would produce a separate profile for each process, showing how time was spent in its measured events. Because the profile measurements are made locally to a process, it is reasonable, as a first step, to compensate for measurement overhead in the process-local profiles only. Our original models do just that. They accounted for the measurement overhead generated during TAUprofiling for each program process (thread) and all its measured events, and then removed the overhead from the inclusive and exclusive performance results calculated during online profiling analysis. The compensation algorithm ``corrected'' the measurement error in the process profiles in the sense that the local overhead was not included in the local profile results.
The models we developed are necessary for compensating measurement intrusion in parallel computations, but they are not sufficient. Depending on the application's parallel execution behavior, it is possible, even likely, that intrusion effects due to measurement overhead seen on different processes will be interdependent. We use the term ``intrusion'' specifically here to point out that although measurement overhead occurs locally, its intrusion can have non-local effects. As a result, parallel overhead compensation is more complex. In contrast with our past research on performance perturbation analysis [10,11,12], here we do not want to resort to post-mortem parallel trace analysis. The problem of overhead compensation in parallel profiling using only profile measurements (not tracing) has not been addressed before. Certainly, we can learn from techniques for trace-based perturbation analysis , but because we must perform overhead compensation on-the-fly, the utility of these algorithms will be constrained to deterministic parallel execution, for the same reasons discussed in [7,13].
At a minimum, algorithms for on-the-fly overhead compensation in parallel profiling must utilize a measurement infrastructure that conveys information between processes at runtime. It is important to note this is not required for trace-based perturbation analysis (since the analysis is offline) and it is what makes compensation in profiling a unique problem. Techniques similar to those used in PHOTON  and CCIFT  to embed overhead information in MPI messages may aid in the development of such measurement infrastructure. However, we first need to understand how local measurement overhead affects global performance intrusion so that we can construct compensation models and use those models to develop online algorithms.