TAU and S3D

Introduction

We can automatically instrument the S3D application using PDT with TAU. TAU's MPI wrapper interposition library is used to intercept and instrument MPI calls. To use TAU on cheetah (AIX), simply copy the make.SP2 and select.tau files in the S3D build directory and invoke the command
% make MACH=SP2 

Profiling with TAU

We ran S3D on 16 cpus at ORNL's AIX cluster, Cheetah. It produced profiles files which were visualized using TAU's paraprof profile browser.

By clicking on mean, we see the mean exclusive profile of the application across all 16 tasks.

The application spends over 80% of its time in the INT_RTE routine.

To see the detailed text profile of this application, we right click on mean in the main window to show the mean text window.

Here we can see that INT_RTE is called 969969 times and it does not invoke any other instrumented routine.

Callpath profiling

Flat profiles do not reveal the calling structure of the program. When TAU is configured with the -PROFILECALLPATH configuration option and generate profiles, we can see the calling order of routines.

In this callgraph display, we see the sequence of events that called INT_RTE. The width of each node in this callgraph is proportional to the inclusive time spent in that routine while the color represents the exclusive time (blue is low, red is high).

Here, we see the callpath thread relations view. In this view, all immediate parents of a routine (shown by an arrow) are shown above the given routine and all immediate children are below the routine. We see that the routine INT_RTE had only one parent - DTM. We also see that of the 170 seconds spent in the routine MPI_Barrier, DTM accounted for 125 seconds and DERIVATIVE_X and DERIVATIVE_Y accounted for 25 and 21 seconds respectively, of the 55 seconds spent in MPI_Recv.

We see that 33 minutes were spent in INT_RTE routine.

TAU also keeps track of communication statistics. These are shown below in the user defined event display which is obtained by clicking the right mouse button on the node labels in the main paraprof window.

TAU's main window display histograms that are stacked together or drawn separately. This allows us to easily compare the performance of a given routine across all nodes.

TAU allows you to connect to a performance database (PerfDMF) and upload or download an experiment. You can store application level metadata in the fields provided.

This allows us to compare trials and perform datamining operations.

Conclusions

After evaluating the performance of S3D using TAU, it was clear that the optimization efforts would focus on the INT_RTE routine shown above. Discussions with the developers revealed that in the iterative application, further optimization of the code was possible by reducing the number of times this routine is invoked. By reusing the data generated by this routine at the start of an iteration, we can reduce the time spent in this routine and improve S3D's performance. This optimization is underway.