The use of clusters for high-performance computing has grown in relevance in recent years , both in terms of the number of platforms available and the domains of application. This is mainly due to two factors: 1) the cost attractiveness of clusters developed from commercial off-the-shelf (COTS) components and 2) the ability to achieve good performance in many parallel computing problems. While performance is often considered from a hardware perspective, where the number of processors, their clock rate, the network bandwidth, etc. determines performance potential, maximizing the price/performance ratio of cluster systems more often requires detailed analysis of parallel software performance. Indeed, experiences shows that the design and tuning of parallel software for performance efficiency is at least as important as the hardware components in delivering high ``bang for the buck.''
However, despite the fact that the brain is humankind's most remarkable example of parallel processing, designing parallel software that best optimizes available computing resources remains an intellectual challenge. The iterative process of verification, analysis, and tuning of parallel codes is, in most cases, mandatory. Certainly, the goal of parallel performance research has been to provide application developers with the necessary tools to do these jobs well. For example, the ability of the visual cortex to interpret complex information through pattern recognition has found high value in the use of trace analysis and visualization for detailed performance verification. Unfortunately, there is a major disadvantage of a trace-based approach - the amount of trace data generated even for small application runs on a few processors can become quite large quite fast. In general, we find that building tools to understand parallel program performance is non-trivial and itself involves engineering tradeoffs. Indeed, the dominant tension that arises in parallel performance research is that involving choices of the need for performance detail and the insight it offers versus the cost of obtaining performance data and the intrusion it may incur.
Thus, performance tool researchers are constantly pushing the boundaries of how measurement, analysis, and visualization techniques are developed to solve performance problems. In this paper, we consider such a case - how to overcome the inherent problems of data size in parallel tracing while providing online analysis utility for programs with long execution times and large numbers of processes. We propose an integrated online performance analysis framework for clusters that addresses the following major goals:
The sections below present our system architecture design and discuss in detail the implementation we produced for our in-house 32-node Linux cluster. We first describe the TAU performance system and discuss issues that affect its ability to produce traces for online analysis. The section following introduces VNG and describes its functional design. The integrated system combining TAU and VNG is then presented. The paper concludes with a discussion of the portability and scalability of our online trace analysis framework as part of a parallel performance toolkit.