TAU can be used to profile and trace PETSc applications (details here). Here, we describe the performance results of PETSc nonlinear solver example nineteen, a 2-d driven cavity code that uses a velocity-vorticity formulation and a finite difference discretization on a structured grid. The details of this example can be found here.

Problem Size, Measurements

In this study, we examine the performance data of ex19 by running it on a Quad PIII Xeon (550 MHz) Linux machine with a 56x56 mesh size. This is done by executing it as:

% mpirun -np 4 ex19 -da_grid_x 56 -da_grid_y 56

These parameters execute the problem for roughly one minute. PETSc v 2.1.3 is configured with BOPT=O and TAU is configured with four different measurement options. These are: In all these cases, the entire PETSc framework and the example is instrumented automatically using PDT with TAU and no annotations are inserted manually in the source code.

Profiling: Wallclock Time

The above jracy display shows that the application behaves in a regular manner on all four nodes. By clicking on the mean histogram we get the mean window above. Here, we see that about 35% of exclusive time is spent executing the routine MatLUFactorNumeric_SeqAIJ_Inode.

By clicking the middle mouse button over "mean" in the main racy window, we see the average text profile over all nodes. The routines can be sorted in different ways. Here, the routines are sorted by the exclusive time.

The data above is sorted by inclusive time.

Clicking on the routine names or colors shows the breakdown of the routine over all nodes. The inclusive or exclusive time spent in the routine can be displayed in microseconds (msecs or secs) or as a percentage of the total time on that node. The routine MatSolve_SeqAIJ_Inode takes around 10.4 seconds exclusive time (the program takes 1 minute to execute).

The complete profile file is available here.


When TAU is configured with the -TRACE option, event traces are generated. After merging (tau_merge *.trc app.trc) traces from all nodes, these can be converted (tau_convert -pv app.trc tau.edf app.pv) to the Vampir trace file format. Tracing shows the temporal variation of performance.

In the above figure, we see a timeline display. Inter process communication is also seen as line segments. Vampir allows the user to zoom into a segment of the trace to examine the events on each node.

Tracing also preserves the dynamic calltree of a process. The global call tree view shows that the routine MatLUFactorNumeric_SeqAIJ_Inode is called 4 times and takes rougly 21 seconds to execute along the given calling path.

The global activity display above shows the relative distribution of time among the different routines.

Profiling: PAPI_FP_INS

When TAU is configured with PAPI, we can associate low-level events such as the number of floating point instructions executed (PAPI_FP_INS), or instruction and data cache misses with routines.

The above profile shows that the routine MatLuFactorNumeric_SeqAIJ_Inode accounts for roughly 63% of floating point instructions executed by the program.

The mean profile sorted by exclusive counts is shown above.

The same profile sorted by inclusive counts.

The pprof output is available here.

Profiling: PAPI_L1_DCM

Here, we see the profile of ex19 with respect to Level 1 data cache misses. The routine MatLUFactorNumeric_SeqAIJ_Inode accounts for 44.51% of L1 data cache misses.

The mean profile sorted by exclusive counts is shown above.

The same profile sorted by inclusive counts.

The pprof output is available here.

Selective instrumentation capabilities of TAU (tau_reduce) were used to eliminate instrumentation in three routines (described here) based on factors such as execution frequency and time spent in the routine (described here).

TAU and PAPI provide an integrated performance evaluation environment for PETSc users.

PETSc's next version (after 2.1.3) will feature native support for the TAU performance system.

For any assistance with PETSc and TAU, please contact <tau-team@cs.uoregon.edu>.