The high-level approach we have taken for online parallel profiling is shown in Figure 2. The TAU performance system maintains profiling statistics on a context basis for each thread in a context . Normally, TAU collects performance profiles at the end of the program run into profile files, one for each thread of execution. For online profiling, TAU provides a ``profile dump'' routine that, when called by the application, will update the profile statistics for each thread, to bring them to internally consistent states, and then output the profile data to files.
The performance data access model we have implemented and used in TAU is a Push model. The application scenario we want to target is one where there are major phases and/or interations in the computation where one would like to capture the current profile at those time steps. Thus, at these points, the application calls the TAU profile dump routine to output the performance state. Each call of the dump routine will generate a new set of profile files or append to files containing earlier profile dumps. The updating of the profile dump files is used to ``signal'' the external profile analysis tools.
One of the advantages of this approach is that it can be made portable and robust. The only requirement is support for a shared file system, using NFS or some other protocol. It is possible to implement a push model in the TAU performance system using a signal handler approach, but it introduces other system dependencies that are less robust.
A valid argument against this approach is that it has problems when the application scales, as the number of files increase and the file system becomes the bottleneck. There are four mechanisms we are investigating to address this problem. First, thread profiles for a context can be merged into a single context profile file. This directly reduces the number of files when there are multiple threads per context. Second, the profile dump routine allows event selection, thereby reducing the amount of profile data saved. The third mechanism is to utilize a data reduction network facility, such as Wisconsin' MRNet , to gather and merge thread/context profiles using the parallel communication hardware, before producing output files. This can both address problems with scaling file systems and problems with large number of files, by merging profile data streams in parallel until generating profile output files. Finally, the fourth mechanism is to leverage the more powerful I/O hardware and software infrastructure in the parallel system that one would expect to be present in the system as it is scaled (e.g., parallel file system, multiple I/O processors, clustered file system software, etc.).