Clusters have proven to be a viable means to achieve large-scale parallelism for computationally demanding scientific problem solving. The relative ease with which more computing, network, and storage resources can be added to a cluster and the available systems software to support them, especially the Linux OS, has made clusters a system of choice for many laboratories. This is not to say that it is also relatively easy to program clusters or to achieve high-performance on cluster systems. While the MPI programming library provides a common programming foundation for clusters, the increasing degree of shared-memory parallelism on cluster nodes encourages mixed-mode styles, as might be obtained from a combination of MPI and OpenMP (or other multi-threading) methods. In either case, it is still necessary to apply performance tools to diagnose a variety of performance problems that can arise in cluster-based parallel execution.
Given the distributed memory system architecture of clusters, many of the performance issues result from the interplay of local computation and remote communication. While measurement techniques based on profiling are useful for highlighting how execution time is spent within (parallel) threads on a single node, process communication behavior between nodes can be only summarized. Profiling may give some insight into communication hotspots or messaging imbalances, but it loses all information about time-dependent communication behavior. In contrast, measurement techniques based on event tracing reveal parallel execution dynamics, allowing users to identify temporal patterns of poor performance behavior at different regions of program execution. The regrettable disadvantage of tracing is the large amount of trace data produced, especially for long running programs, complicating runtime I/O, off-line storage, and post-mortem analysis.
In this paper, we focus our attention on the problem of scalable trace analysis. We propose an integrated performance analysis approach for clusters with the following major goals in mind:
The paper begins with an overview of the VNG architecture, a prototype for parallel distributed performance analysis we developed at Dresden University of Technology in Germany. Each component is then described in more detail. The runtime system of VNG is particularly important for its scalability. We discuss its operation separately. Experiments were conducted to evaluate the trace processing improvements when using multiple cluster nodes for analysis. These results are shown for two different platforms and compared to baseline performance for the Vampir [3,4] trace visualization system. The paper concludes with a discussion of the work and directions for the future.