XPARE - eXPeriment Alerting and REporting

University of Utah, University of Oregon

The Problem

In the development process for large-scale parallel applications, it is not uncommon for performance analysis to be relegated to the end stage, where experiments serve mainly to validate performance behavior of final versions of application software before deployment. Unfortunately, this approach loses the opportunity to benefit from a retrospective of performance characterization over the software's development history, both for better performance understanding of the software as it is evolving and more informed optimization in latter stages. Multi-person development teams routinely perform periodic testing to validate functional correctness, but rarely include performance reporting beyond total execution time. In this work, we present a system for performance experimentation that is integrated in a weekly testing harness for the Uintah / C-SAFE software development effort. With this system we can produce detailed weekly reports of Uintah / C-SAFE performance and alert code developers of performance problems as they arise.

The Approach

We created the XPARE (eXPeriment Alerting and REporting) tools to allow software teams developing large-scale parallel applications to accomplish two important goals. First, it allows a team to specify benchmarks regression testing for a given set of performance measures. These benchmarks are evaluated with each periodically scheduled testing trial. Second, throughout the course of development, XPARE provides a historical panorama of the evolution of performance as it tracks with software versions. This includes not only changes in the code, but also platform, choice of compiler, different optimizations, etc.

The XPARE tools are designed to compliment an already existing correctness and (minimal) performance testing harness for the ASCI C-SAFE Uintah project. However, the performance measurements employed there use only total execution time analyses. The regression testing against prior execution runs, while able to detect significant performance changes, are unable to provide detailed performance information that could be used to identify the program component(s) most responsible. Instead, XPARE utilizes performance experiments instrumented for a wider range of performance measurements, as offered by the TAU system (The University of Oregon's Tuning and Analysis Utilities). Detailed profiled data captured for all significant events of interest, as specified by C-SAFE Uintah developers, now generates a significantly greater performance space for regression analysis and performance study.

The operational framework of XPARE consists of five parts:

The experiment launcher frontend can be used manually to conduct experiments and produce performance data as well as in the automated setting. It is capable of configuring, compiling, and executing performance experiments, using batch systems utilities. The results transporter is responsible for sending the performance profile data from the suite of experiments to the remote site where the performance database resides. Performance data is sent via email to provide fault-tolerance in the case of an unavailable server and for easy configuration of both the client and server, no additional server or ports need be set up. Upon receipt, the performance database manager stores the profile data properly with meta-information describing the experimental context. The performance reporter is a web interface that provides access to the database and displays cross- and inter-experiment performance results in graphical forms. An easy to use configuration tool for alerting mechanisms allows the user to define thresholds for performance benchmarks for a given experiment setup. When regression analysis of a performance dataset determines that these thresholds have been exceeded, the alerting component notifies the corresponding parties of the violation.

The C-SAFE / Uintah testing system runs weekly using the batch systems at the Los Alamos National Laboratory. A suite of scaling experiments are performed. In addition to complementing the current regression testing system, as described above, we are using XPARE as the foundation for assembling a database of performance data to be used in future internal reviews of the the C-SAFE / Uintah software engineering process. The XPARE-generated reporting mechanism will also be important in presenting a historical performance perspective for ASCI Level 1 center reviews.

Impact, Importance, Interest, Audience

High performance computing projects have much to gain from using integrated performance testing and regression analysis tools. Other groups at LLNL (e.g., SAMRAI development team) and LANL (e.g., UPS development team) have already expressed interest in using XPARE's regression methods and software environment regular performance testing and long-term results storage. Our future goal is to make the XPARE tools as easy to use and configurable as possible so that they can be easily applied to different projects and configured for different systems.