The research ideas and work presented here relate to several areas. There has been a long time interest in the monitoring of parallel systems and aplications. This is due to the general hypothesis that by observing the runtime behavior or performance of the system or application, it is possible to identify aspects of parallel execution that may allow for improvement. Several projects have developed techniques that allow parallel applications to be responsive to program behavior, available resources, or performance factors. The Falcon project  is an example of computational steering systems  that can observe the behavior of an application and provide hooks to alter application semantics. These ``actuators'' will lead to changes in the ongoing execution. Because computational steering systems enable direct interaction with the application, they are often developed with visualization frontends that provide graphical renderings of application state and objects for execution control.
Online performance observation systems look to achieve several advantages for performance analysis. Paradyn  works online to search for peformance bottlenecks, while controlling the measurment overhead by dynamically instrumenting only those events that are useful for testing the current bottleneck hypothesis. Thus, the performance analysis done by Paradyn at runtime both collects profile statistics and interprets the performance data to decide on the next coarse of action. Where as Paradyn attempts to identify performance problems, Autopilot  is an online performance observation and adaptive control framework that uses application sensors to extract quantitative and qualitative performance for automated decision control. While both Paradyn and Autopilot are oriented towards automated performance analysis and tuning, neither address the problem of scalable performance observation or provide capabilties to analyze or visualize large-volume performance information.
Indeed, the difficulty of linking application embedded monitoring to data consumers will ultimately determine what amount of runtime information can be utilized. This involves a complicated tradeoff of instrumentation and measurement granularity versus the overhead of application / performance data transport versus the information requirements for desired analysis . Projects such as the Multicast Reduction Network (MRNet)  will help in providing efficient infrastructure for data communication and filtering. Similarly, the Peridot  project is attempting to develop a distributed application monitoring framework for shared-memory multiprocessor (SMP) clusters that can provide scalable trace data collection and online analysis. The system will have selective instrumentation and analysis control, helping to address node- and system-level monitoring requirements. A different approach to scalable observation is taken in . Here, statistical sampling techniques are used to gain representative views of system performance characteristics and behavior.
In general, we believe the benefits seen in the application of online computation visualization and steering, itself requiring demanding monitoring support, could also be realized in the parallel performance domain. Our goal is to consider the problem of online, scalable performance observation as a whole, understanding the tradeoffs involved and designing a framework architecture to address them.