TAU integrated with VTF

Introduction

TAU uses PDT for source-to-source translation based instrumentation of VTF source code. The languages used are python, C++, C and F90. External packages that are instrumented included GrACE and HDF. TAU's MPI wrapper library level instrumentation is used to gather information pertaining to the MPI library.

Simulation

We ran a VTF simulation by setting the environment variable TARGET to opt,mpi,tau with the following options in shock.py:

 
options.timesteps = 400
options.maxLevel = 1 
--ratio="1:4"  (command line option)

on a 20 processor Linux PIII cluster using nodes 11-30 on asap. The other options were:


>>> flow: executing 'mpirun -nolocal -np 20 -machinefile cube-000.nodes `which mpipython.exe` 
./shock.py --nodes=20 --mask=default --solidScale=0.5 --ratio=1:4 --vizserver=localhost 
--port=50000 --fluidScale=1.0 --exchange=mpi --model=cube.py --eos=PerfectGas --solid=adlib
--worker=true --fluid=arm3d --id=0 --program=./shock.py --node-list=11-30

TAU was configured with -LINUXTIMERS and -useropt=-O2 with gcc 3.0.

Profiling with TAU

The above profile shows the breakdown of the wallclock time spent in all routines over 20 processors. Note that nodes 1-3, 4-7, 8-15 and 16-19 look similar.

Routine MPI_Waitsome takes up a significant portion of time on nodes 4-19 as does the following routine:

The routines are partitioned into different TAU groups as shown above and jracy permits the user to select a specific group in any display, or to exclude a group from the given display.

The mean window above highlights the average exclusive time spent in all routines across all nodes.

jracy allows the user to sort the data using inclusive time.

Nodes 1-3 have similar profiles where the routine MPI_Bcast dominates. The above figure shows the exclusive time on the given nodes.

The same profile is shown using inclusive time.

Node 4 has a distinct profile. The above figure shows the exclusive time spent in node 4.

Nodes 5-7 have a similar profile. Performance data on these nodes is sorted by exclusive time.

The textual profile shows the exclusive time spent in routines on node 5.

On nodes 8-14, MPI_Waitsome dominates. Here, we see the exclusive time and below, we see the inclusive profile on node 8.

Nodes 16-19 have a similar profile too. Here, we see the profile on node 16 sorted by exclusive time.

Files

The pprof textual output sorted by:

TAU's selective instrumentation capabilities are used to produce the above data. The list of routines excluded from instrumentation is available here. To obtain this the following rule is used in tau_reduce:


numcalls>400000 & usec/call<12

This removes routines that execute over a 400000 times and take less than 12 microseconds of inclusive time.