The general architecture we envision for online performance observation is shown in Figure 1. The online nature is determined by the ability to access the performance data during execution and make it available to analysis and visualization tools, which are typically external. Additionally, performance interacton is made possible through a performnace control path back into the parallel system and software. Here, instrumentation and measurement mechanisms may be changed at runtime.
How performance data is accessed is an important factor for online operation. Different access models are possible with respect to the general architecture. A Push model acts as a producer/consumer style of access and data transfer. The application decides when, what, and how much data to send. It can do so in several ways, such as through files or direct communication. The external analysis tools are consumers of the performance data, and its availability can be signalled passively or actively. In contrast, a Pull model acts as a client/server style of access and transfer. Here, the application is a performance data server, and the external analysis tool decides when to make requests. Of course, doing so requires a two-way communication mechanism directly with the application or some form of performance control component. Combined Push/Pull models are also possible.
Online profiling requires performance profile data, distributed across the parallel application in thread (process) memory, to be gathered and delivered to the profile analysis tool. Profiling typically involves stateful runtime analysis that may or may not be consistent at the time the access is requested. To obtain valid profile data, it may be necessary to update execution state (e.g., callstack information) or make certain assumptions about operation completion (e.g., to obtain communication statistics). Assuming this is possible, online profiling will then produce a sequence of profile samples allowing interval-based and multi-sample performance analysis. The delay for profile collection will set a lower bound on interval frequency. This delay is expected to increase with greater parallelism.
Similarly, online tracing requires the gathering and merging of trace buffers distributed across the parallel application. The buffers may be flushed afterwards, thereby allowing only the last trace records since the last flush to be read. Such interval tracing may require ``ghost events'' to be generated before first event and after the last event to make the trace data consistent. If the tracing system dynamically registers event identifiers per execution thread, it will be necessary to make these identifiers uniform before analysis. (Static schemes do not have this problem, but require instead that all possible events be defined beforehand.)