Profiling associates performance metrics with aspects of a program's execution. Normally, it is the program's components being profiled, such as its routines and code blocks, and profile results show how performance data are distributed across these program parts. Execution time is the most common metric, but any source of performance data is valid (e.g., hardware counters [4,6,23,30]). Flat profiles show performance data distributed onto the static program structure, while path profiles  map performance data to dynamic program behavior, most often represented as program execution paths (e.g., routine calling paths [7,11]). Profiling both measures performance data and calculates performance statistics at runtime.
Measured profiling requires direct instrumentation of a program with code to obtain requisite performance data and compute performance statistics. Generally, we are interested in observing events that have entry / exit semantics. Instrumentation is done both at the event ``entry'' point (e.g., routine begin) and the event ``exit'' point (e.g., routine return). Profile statistics report performance metrics observed between event entry and exit. We typically speak of inclusive performance, which includes the performance of other descendant events that occur between the entry and exit of a particular event, as well as of exclusive performance, which does not include descendant performance.
Unfortunately, the measurement code alters the program's execution, primarily in execution time dilation, but it also affects hardware operation. Intrusion will be reflected in the profile results unless techniques can compensate for it. Other research work has sought to characterize measurement overhead as a way to bound intrusion effects or to control the degree of intrusion during execution. For instance, the work by Kranzlmüller  quantifies the overhead of MPI monitors using the benchmarking suite SKaMPI, and Fagot's work  assesses systematically the overhead of parallel program tracing. The work by Hollingsworth and Miller  demonstrates the use of measurement cost models, both predicted and observed, to control intrusion at runtime via measurement throttling and instrumentation disabling.
Our interest is to compensate for measurement intrusion. Thus, we need to understand the intrusion effects are manifested in performance profiles. Let us first consider inclusive execution time profiles, using routines as our generic events. Two timing measurements are necessary to determine inclusive time for a routine every time it is called: one to get the current clock value at time of entry and one to get the clock value at exit. The difference between the clock samples is the inclusive time spent in this call. Let represent the overhead to measure the inclusive time. If the routine is executed times, its inclusive time is increased by measurement overhead.
However, 's inclusive time is also increased by the overhead incurred in the inclusive time measurement of descendant routines invoked after is called and before it exits. If routines are called while is active ( covers all calls to ), the inclusive time is increased additionally by because each descendant routine profiled will incur inclusive overhead.
To calculate 's exclusive time, the inclusive time spent in each direct descendant is subtracted from 's inclusive time. Since this occurs at 's exit, the inclusive times for all direct descendant calls must be summed. If is the measurement overhead to do the summation upon each direct descendant exit, the total overhead that gets reflected in 's exclusive time is for direct descendant calls. (The subtraction from 's inclusive time is small and will be ignored for sake of discussion.) It is important to observe that all overhead accumulated in the inclusive times of 's direct descendants gets cancelled in the exclusive time computation. Unfortunately, if we calculate the inclusive and exclusive times simultaneously, as is normally the case, the overhead must be added to 's inclusive time.
To summarize, the total performance profile of routine will show an increase in inclusive time due to measurement overhead of , where is the total number of times is called, is the number of times descendant routines of are called, and is the number of times 's direct descendant routines are called. The performance profile of routine will also show an increase in exclusive time of due to overhead.
Standard profiling practice does not report the measurement overheads included in per event inclusive and exclusive profiles. Typically, overhead is reported as a percetage slowdown in total execution time, with the implicit assumption that measurement overhead is equally distributed across all measured events. Clearly, this is not the case. Events with a large number of descendant event occurrences will assume a greater proportion of measurement overhead. If we cannot compensate for this overhead, the performance profile will be distorted as a result. Furthermore, if processes of a parallel application exhibit different execution behavior, the parallel profiles will show skewed performance results.