next up previous
Next: Large-Scale Performance Monitoring and Up: paper-final Previous: Measurement Overhead and Instrumentation

Performance Mapping and Dynamic Callpath Profiling

To observe meaningful performance events requires placement of instrumentation in the program code. However, not all information needed to interpret an event of interest is available prior to execution. A good example of this occurs in callgraph profiling. Here the objective is to determine the distribution of execution time along the dynamic routine calling paths of an application. A callpath of length $k$ is a sequence of the last $k-1$ routines called. To measure execution time spent on a callpath requires identifying the begin and end points during which a callpath is ``active.'' These points are just the entry and exit of a called routine. If $k=1$, callpath profiling is the measurement of amount of time spent in a routine for each of its calling parents. The basic problem with callpath profiling is that the identities of all $k$-length calling paths ending at a routine may not, and generally are not, known until the application finishes its execution. How, then, do we identify the dynamic callpath events in order to make profile measurements? One approach is not to try to not identify the callpaths at runtime, and instead instrument just basic routine entry and exit events and record the events in a trace. Trace analysis can then easily calculate callpath profiles. The problem, of course, with this approach is that the trace generated may be excessively large, particularly for large numbers of processors. Unfortunately, the instrumentation and measurement problem is significantly harder if callpath profiles are calculated online. If the whole source is available, it is possible to determine the entire static callgraph and enumerate all possible callpaths, encoding this information in the program instrumentation. These callpaths are static, in the sense that they could occur; dynamic callpaths are the subset of static callpaths that actually do occur during execution. Once a callpath is encoded and stored in the program, the dynamic callpath can then be determined directly by indexing a table of possible next paths using the current routine id. Once the callpath is known, the performance information can be easily recorded in pre-reserved static memory. This technique was used in the CATCH tool [5]. Unfortunately, this is not a robust solution for several reasons. First, source-based callpath analysis is non-trivial and may be only available for particular source languages, if at all. Second, the application source code must be available if a source-based technique is used. Third, static callpath analysis is possible at the binary code level, but the routine calls must be explicit and not indirect. This complicates C++ callpath profiling, for instance. To deliver a robust, general solution, we decided to pursue an approach where the callpath is calculated and queried at runtime. The TAU measurement system already maintains a callstack that is updated with each entry/exit performance event (e.g., routine entry and exit). Thus, to determine the $k$-length callpath when a routine is entered, all that is necessary is to traverse up the callstack to determine the last $k-1$ events that define the callpath. If this is a newly encountered callpath, a new measurement profile must be created at that time because it was not pre-allocated. The main problem is how to do all of this efficiently. Mapping callpath identity to its profile measurement is an example of what we call performance mapping. TAU implements a performance mapping API based on the Semantic Event Association and Attribute (SEAA) model [14]. Here an association is built between the identity of a performance event (e.g., a callpath) and a performance measurement object (e.g., a profile) for that event. Thus, when the event is encountered, the measurement object linked to that event can be found, via a lookup in a mapping table, and the measurement made. In the case of callpath performance mapping, new callpaths occur dynamically, requiring new profile objects to be created at runtime. This can be done efficiently using the TAU mapping API. The callpath name is then hashed to serve as the index for future reference. Because routine identifiers can be long strings, TAU optimizes this process by computing the hash based on addresses of the profile objects of its $k-1$ parents. While the extra overhead to perform this operation is fixed, the accumulated overhead will depend on the number of unique $k$-length callpaths encountered in the computation, as each of these will need a separate profile object created. We have implemented 2-level callpath profiling in TAU as part of the current TAU performance system distribution. The important result is that this capability is available in all cases where TAU profiling is available. It is not restricted by programming language, nor source code access required, as dynamic instrumentation (via DyninstAPI [4]) can be used when source is not available. Also, all types of performance measurements are allowed, including measuring hardware counts for each callpath. Finally, in the future, we can benefit from the overhead reduction mechanisms to eliminate particular callpaths from measurement consideration. Unfortunately, unlike a static approach, the measurement overhead of this dynamic approach increases as $k$ increases because we must walk the callstack to determine the callpath. We have discussed several methods to do this more efficiently, but none lead to a fixed overhead for any $k$, and adopting a general $k$ solution for the 2-level case would result in greater cost. Most user requests were for 2-level callpaths to determine routine performance distribution across calling parents, and this is what has been implemented in TAU. It should be noted that there are no inherent limitations to implementing solutions with $k>2$. Also, if it is possible to determine callpaths statically, TAU could certainly use that information to implement a fixed-cost solution.
next up previous
Next: Large-Scale Performance Monitoring and Up: paper-final Previous: Measurement Overhead and Instrumentation
Sameer Suresh Shende 2003-02-21