TAU integrated with OPARI for OpenMP and OpenMPI (OpenMP+MPI) programs

Introduction

TAU uses OPARI (Developed by Bernd Mohr, FZJ) to rewrite OpenMP directives for performance instrumentation.

Region based instrumentation

TAU can instrument an OpenMP application based on the location of the OpenMP construct in the source code. This allows us to distinguish between two parallel for loops in the source code. It uses a mapping between the region descriptor and a timer (using efficient embedded mappings). The following source code shows two parallel for loops.

TAU generates profiles for this application as shown below:

Note the two parallel for loop constructs are qualified with the file name, and the extent of the openmp pragma as indicated by the line numbers. TAU also provides a switch that allows us to generate performance data based on the construct (in the example above, this would generate one parallel for entity instead of two). So, performance views based on code regions and/or openmp constructs (aggregated over all regions) can be generated.

Profiling with region and construct based views

When both views are enabled, we get profiles that highlight the source locations as well as constructs, as shown below for a mixed mode MPI+OpenMP application (Stommel, described below):

Here, we see that the for loop that is spread across lines 252-260 takes up a significant portion of time. The exclusive time spent in different OpenMP constructs and regions for node 0, thread 0 is shown below:

On the other hand, the inclusive time is shown below:

This corresponds to the following source code:


    252 #pragma omp for schedule(static) reduction(+: diff) private(j) firstprivate (a1,a2,a3,a4,a5)
    253     for( i=i1;i<=i2;i++) {
    254         for(j=j1;j<=j2;j++){
    255             new_psi[i][j]=a1*psi[i+1][j] + a2*psi[i-1][j] +
    256                          a3*psi[i][j+1] + a4*psi[i][j-1] -
    257                          a5*the_for[i][j];
    258             diff=diff+fabs(new_psi[i][j]-psi[i][j]);
    259          }
    260     }


TRACING

TAU can instrument a program using several cooperating levels of instrumentation including the preprocessor level (Opari, PDT), the compiler level, the MPI Wrapper library level, and manual source code instrumentation. This document describes results of how TAU can be used for performance evaluation of OpenMP programs using Opari and Vampir. This allows us to use TAU in clusters of SMPs where MPI is used for inter-cluster communication and OpenMP pragmas are used for exploiting shared memory parallelism. TAU can generate both profiles (aggregate summary statistics using wallclock time or CPU Time or Hardware counters) and event traces (timestamped event logs) that can be visualized in Racy and Vampir respectively.

Tracing OpenMP+MPI executions in Vampir

Here, we see a global timeline display of a trace generated by TAU. It highlights an integrated hybrid execution model comprising of MPI (all MPI routines and messages) and OpenMP events (OpenMP parallel regions and loops). Each OpenMP thread is shown independently on the timeline and we see the load imbalance and when and where it takes place. Inter-process communication events are shown at the level of each thread. Although MPI layer does not have any information about the sending and receiving thread, TAU can match sends and receives and generate a Vampir trace where the precise thread involved in the synchronization operation is shown. So, the display shows the global timeline along the X axis and the MPI tasks and threads grouped within the tasks along the Y axis. Also, TAU uses a high level grouping of all events, so in the left, we can see the contribution of all "MPI" and "OpenMP" constructs. Note that each OpenMP thread is shown distinctly within the SMP node and MPI events are integrated in the trace. TAU can accurately track the precise thread with which inter-process communicatation events are associated. OpenMP level instrumentation highlights the time spent in for and barrier code regions (specified by pragmas).

The parallelism and the global activity displays in Vampir shows the breakdown of activities. In the Global Activity Chart, we see a pie chart that shows the color coded contribution of each code segment on each OpenMP thread within the two MPI tasks.

In the summary charts, we see the contribution of OpenMP, MPI and application constructs aggregated over all threads of execution. The application uses OpenMP for loop level parallelism in an MPI program and was written by Timothy Kaiser (SDSC). It solves the 2d Stommel model of ocean circulation using a five-point stencil and Jacobi iteration


 gamma*((d(d(psi)/dx)/dx) + (d(d(psi)/dy)/dy))
 +beta(d(psi)/dx)=-alpha*sin(pi*y/(2*ly))