TAU and PERC

Introduction

TAU instruments the EVH1 code using the Program Database Toolkit (PDT) and TAU's MPI wrapper interposition library. TAU can generate profiles and event-traces.

EVH1 Traces

TAU uses Vampir, a commercial trace visualization tool fromPallas, GmbH.

In the global timeline view, the user can zoom into a trace segment as shown above.

In the dynamic calltree view (for process 1), we can see the calling order annotated with performance metrics. In the inclusive view, the number of calls and the inclusive time spent in the routine is shown. Note the performance of EVH1_EVOLVE as it calls SWEEPX, SWEEPY, MPI_Alltoall in the calltree.

The summary view is shown above as a pie chart.

The timeline view highlights the level of nesting on process 1.

The parallelism view shows the number of processes that participate in an activity at a given time. NOTE: TAU supports selective instrumentation. This allows us to exclude certain light-weight routines. Here, PARABOLA, FORCE and other routines were chosen for exclusion based on their execution frequency.

EVH1 Profiles

TAU's jracy profile browser shows the profile of EVH1. The program executes on 16 processors. EVH1 and MPI routines are represented by distinct colors. The task has a regular behavior on all nodes.

By clicking on mean, we can view more details. The exclusive time spent in each process is shown in this view.

The text window shows the mean time spent (over 16 nodes) in different routines.

Jracy allows the user to sort the performance metrics in different ways. The above windows shows the mean inclusive time.

The text view of the same.

On the left, we can see the function legend and on the right the exclusive time spent in the REMAP routine.

TAU and PAPI

When TAU is configured with PAPI for access to hardware performance counts, EVH1 profiles show the exclusive and inclusive counts (instead of wallclock time as described above). We conducted two experiments: with PAPI_EVENT environment variable set as PAPI_FP_INS (to record floating point operations) and PAPI_L1_DCM (to record level 1 data cache misses) respectively.

The mean window (over 16 processors) shows the exclusive and inclusive floating point counts for all routines (when PARABOLA was not instrumented).

The mean window (over 16 processors) shows the exclusive and inclusive level 1 data cache misses for all routines (except PARABOLA).

FILES