Parallel trace measurement and analysis has been criticized in recent years for being impractical as a performance evaluation methodology for scalable clustered systems. Primarily, these criticisms point out the issue of large trace volume and the problems it causes for trace storage management, trace analysis, and visualization. One can argue that the amount of trace data is closely tied with the requirements for performance analysis. Hence, the choice of how much trace data to collect and for what events is a decision made during performance problem solving. Certainly, a more refined performance diagnosis strategy could make better, more judicious decisions regarding performance instrumentation to achieve small trace volume. However, from the perspective of the trace analysis system, this is of little consequence, since it must be able to function effectively with traces of large size. Even with careful planning of performance experiments, traces of large-scale parallel machines can quickly grow to the size we tested here.
Consequently, the best we can do to counter the criticism is to improve our trace analysis technology. The VNG system presented in this paper is a validation of tracing as a viable technology for scalable performance analysis. We achieve significant improvements in tracing I/O and analysis functions over the leading commercial system by parallelizing the VNG server using the same cluster technology used for the application's execution. Moreover, greater usability is gained through a separation of analysis and interface, making it possible to support multiple, simultaneous user sessions in a distributed environment.
We are continuing to improve the VNG technology as well as explore its possible applications. For instance, recent work has attempted to link VNG with a runtime trace generation system, thereby giving users online analysis access to investigate performance problems during long-running programs. Such support could be used to remove uninteresting or redundant sections of the trace from being stored or to inform a computational steering system to guide the application towards better execution performance.
Although scalable cluster systems present VNG with a critical, stressful test case, VNG is not limited to clustered environments. Its usage of standard software technology guarantees a high degrees of portability to other platforms.