TAU supports both profiling and tracing performance analysis methodologies. Profiling presents the user with summary statistics of performance metrics while tracing highlights the temporal aspect of performance behavior, showing when and where events took place. To provide a sense of how TAU's capabilities can be applied to parallel Java applications, we present performance analysis of an mpiJava benchmark application that simulates the game of Life. We use a simple application and run it on four processors mainly for purposes of brevity and clarity in our discussion. However, it should be understood that TAU's capabilities can extend and scale in respect to the complexity and requirements of applications and system environments, including larger numbers of Java contexts and processors.
In Figure 2, we see the profile of the mpiJava Life application obtained from TAU measurement, as described in the previous sections. It shows seven Java threads running on each node. Notice that events across different levels and components of execution are being observed. Thread 4 in each context is executing MPI calls for communication between the four processes. Of particular interest is the well-known cascading behavior of the mpich MPI_Init routine seen in the MPI_Init profile window. This illustrates how tasks are spawned off successively by MPICH. The performance of individual MPI routines is shown across each context and thread, as in the MPI_Init profile window. A detailed performance profile for each thread can be displayed graphically and textually, as shown in the two n,c,t 2,0,4 profile windows for ( t)hread 4 in ( c)ontext 0 on ( n)ode 2. Some of the other threads are performing background JVM and mpiJava module tasks that the application developer would not directly see.
To observe dynamic performance behavior, TAU can also generate event traces that are visualized here using a third-party commercial trace visualization program called Vampir [9,12]. Figure 3 illustrates how we can group threads within a node and show inter-thread, inter-node message communication events as line segments that connect the send and receive events within a global timeline. The user can zoom into interesting portions of the timeline and can click on a message or a segment to get more detailed information (e.g., the node where the events took place, the message tag, length, and bandwidth). Vampir provides a rich set of views for exploring different aspects of performance behavior. Figure 4 shows levels of nesting along a timeline in each thread. Figure 5 shows a summary of performance data grouped in higher level semantic groups (mpi, java, sun, and so forth) in the form of pie charts on a set of threads within each node. Each thread could be an application or a virtual machine level thread. Figure 6 shows a dynamic calltree on a selected thread. It shows the calling order of routines annotated with performance metrics (inclusive, exclusive times, and number of calls). A user can fold or unfold a segment of the tree to gain better insight. In Figure 7, we see a communication matrix display with nodes and threads along the rows and columns marking the senders and receivers, and the color-coded values in the matrix that show the extent of inter-thread message communication.
Grouping performance data according to virtual machine and application level entities is not new. It has been successfully demonstrated in Paradyn-J , a tool for detecting performance bottlenecks in interpreted, just-in-time compiled Java programs, where data is separately grouped in two distinct trees (one for the application, and another for the virtual machine). This approach allows both application developers as well as virtual machine developers to gain valuable information regarding the interaction between the two groups. In contrast, as illustrated in the performance displays, TAU gathers performance data from MPI and Java layers in a seamlessly integrated fashion, showing the precise thread where MPI calls execute and allowing data to be grouped in two hierarchies according to nodes and threads and semantic groups. While providing a set of displays for profiling and tracing data, we can see the need for other customized, user-defined multi-dimensional displays that may show data in more effective ways. To accomplish this, TAU provides an open, documented interface for accessing performance data that it generates and illustrates with examples how a user could transform the data to commonly used performance data formats.