TAU Instrumentation Options

Selective Instrumentation Options

Selective Instrumentation File Specification

The selective instrumentation file has the following sections, each preceded and followed by:

BEGIN_EXCLUDE_LIST / END_EXCLUDE_LIST or BEGIN_INCLUDE_LIST / END_INCLUDE_LIST

exclude/include list of routines and/or files for instrumentation. The list of routines to be excluded from instrumentation is specified, one per line, enclosed by BEGIN_EXCLUDE_LIST and END_EXCLUDE_LIST . Instead of specifying which routines should be excluded, the user can specify the list of routines that are to be instrumented using the include list, one routine name per line, enclosed by BEGIN_INCLUDE_LIST and END_INCLUDE_LIST . Additionally, a group of routines sharing the same prefix or suffix can be selected using the wildcard # . In Selective Instrumentation Example , there is an example with multiple includes and excludes, with the result of applying the lists.

BEGIN_FILE_EXCLUDE_LIST / END_FILE_EXCLUDE_LIST or BEGIN_FILE_INCLUDE_LIST / END_FILE_INCLUDE_LIST

Similarly, files can be included or excluded with the BEGIN_FILE_EXCLUDE_LIST, END_FILE_EXCLUDE_LIST, BEGIN_FILE_INCLUDE_LIST, and END_FILE_INCLUDE_LIST lines.

BEGIN_INSTRUMENT_SECTION / END_INSTRUMENT_SECTION

Manually editing the selective instrumentation file gives you more options. These tags allow you to control the type of instrumentation performed in certain portions of your application.

Selective Instrumentation Example
Figure 1. Selective Instrumentation Example
  • Static and Dynamic timers can be set by specifying either a range of line numbers or a routine.

    static timer name="foo_bar" file="foo.c" line=17 to line=18
    dynamic timer routine="int foo1(int)
  • Static and Dynamic phases can be set by specifying either a range of line numbers or a routine. If you do not configure TAU with -PROFILEPHASE these phases will be converted to regular timers.

    static phase routine="int foo(int)
    dynamic phase name="foo1_bar" file="foo.c" line=26 to line=27
  • Loops in the source code can be profiled by specifying a routine in which all loop should be profiled, like:

    loops file="loop_test.cpp" routine="multiply"
  • With [memoryoptions] the following events are tracked: memory allocation, memory deallocation, and memory leaks.

    memory file="foo.f90" routine="INIT"
  • IO Events track the size, in bytes of read, write, and print statements.

    io file="foo.f90" routine="RINB"

Both Memory and IO events are represented along with their call-stack; the length of which can be set with environment variable TAU_CALLPATH_DEPTH .

Selective instrumention can be set at compile time by setting -tau_options=-optTauSelectFile=<file> in the TAU_OPTIONS environment variable when compiling with the TAU compiler wrapper scripts. Alternatively an application can be selectively instrumented at runtime by setting the TAU_SELECT_FILE environment variable to the selective instrumentation file’s location in the application’s execution environment.

Due to the limitations of the some compilers (IBM xlf, PGI pgf90, GNU gfortran), the size of the memory reported for a Fortran Array is not the number of bytes but rather the number of elements.

Running an application using DynInstAPI

TAU also allows you to dynamically instrument your application using the DynInst package. There are a few limitation to DyInst: 1) only function level events will be captured and 2) your application must be compiled with debugging symbols ( -g ).

To install the DynInstAPI package, configure TAU with -dyinst= option which will point TAU to where dyninst is installed. Use the tau_run tool to instrument your application at runtime.

The command-line options accepted by tau_run are:

Usage: tau_run [-Xrun<Taulibrary> ][-v][-o outfile] \
       [-f <instrumentation file> ] <application> [args]

By default, libTAU . so is loaded by tau_run. However, the user can override this and specify another file using the -Xrun<Taulibrary>. In this case lib<Taulibrary>.so will be loaded using LD_LIBRARY_PATH .

To use tau_run , TAU is configured with DyninstAPI as shown below:

% configure -dyninst=/usr/local/packages/dyninstAPI
% make install
% cd tau/examples/dyninst
% make install
% tau_run klargest 2500 23
% pprof; paraprof

Rewriting Binaries

Using MAQAO

TAU also allows you to rewrite your application using the MAQAO package included in PDToolkit 3.17 or above( http://tau.uoregon.edu/pdt.tgz ).

Install PDToolkit 3.17+ and configure TAU with -pdt= option which will point TAU to where PDToolkit is installed. Use the tau_rewrite tool to instrument your application. (If TAU is not configured with PDT 3.17+, then tau_rewrite defaults to tau_run.)

% configure -pdt=/usr/local/packages/pdtoolkit-3.17
% make install
% tau_rewrite -T scorep,pdt  -loadlib=/tmp/libfoo.so ./a.out -o a.inst

Using PEBIL

TAU also allows you to rewrite your application using the PEBIL package included in PDToolkit 3.18.1 or above( http://tau.uoregon.edu/pdt.tgz ).

Install PDToolkit 3.18.1 and configure TAU with -pdt= option which will point TAU to where PDToolkit is installed. Use the tau_pebil_rewrite tool to instrument your application.

% tau_pebil_rewrite -T <commands> -f select.tau <exe> [-o] <output_exe>

The select.tau file supports outer-loop level instrumentation and exclude/include lists of functions just like tau_instrumentor’s select.tau (same format). Also, -T <options> are identical to tau_exec -T options.

Using DynInstAPI

TAU also allows you to rewrite your application using the DyninstAPI package.

To install the DynInstAPI, configure TAU with -dyninst= options which will point TAU to where dyninst is installed, you can also use -dyninst=download, and TAU will automatically download and install DynInstAPI and its dependencies.

When configuring TAU with DynInstAPI, it will show the environment variables you need to set, which are DYNINSTAPI_RT_LIB and LD_LIBRARY_PATH .

% ./configure -dyninst=download -bfd=download
% make install
% tau_run -T <commands> -f select.tau <exe> [-o] <output_exe>

The select.tau file supports exclude/include lists of functions just like tau_instrumentor’s select.tau (same format). Also, -T <options> are identical to tau_exec -T options.

In some cases, flags such as -O2 can prevent DynInstAPI from reading the binaries, if possible, applications or libraries should be compiled with the flags -g -fno-ipa-sra -fno-ipa-ra -fno-ipa-vrp -fno-omit-frame-pointer

Library Instrumentation with DynInstAPI

With DynInstAPI instrumentation can be inserted into libraries. The limitations are that the library should be included in an application using RUNPATH instead of RPATH.

To instrument libraries, tau_run is used with the flag -l . Also, the flag -v is useful if selective instrumentation is used.

LD_LIBRARY_PATH can be used instead of -loadlib, but the user must ensure that the correct library is used by the binary.

Profiling each call to a function

By default TAU profiles the total time (inclusive/exclusive) spent on a given function. Profiling each function call for an application that calls some function hundred of thousands of times, is impractical since the profile data would grow enormously. But configuring TAU with the -PROFILEPARAM option will have TAU profile select functions each time they are called. But TAU will also group some of these function calls together according to the value of the parameter they are given. For example if a function mpisend(int i) is called 2000 times 1000 times with 512 and 1000 times with 1024 then we will receive two profile for mpisend() one we it is called with 512 and one when it is called with 1024. This reduces the overhead since we are profiling mpisend() two times not 2000 times.

Profiling with Hardware counters

LIST OF COUNTERS:

Set the TAU_METRICS environment variable with a comma separated list of metrics or to use the old method set the following values for the COUNTER<1-25> environment variables.

  • GET_TIME_OF_DAY - For the default profiling option using gettimeofday()

  • SGI_TIMERS - For -SGITIMERS configuration option under IRIX

  • CRAY_TIMERS - For -CRAYTIMERS configuration option under Cray X1.

  • LINUX_TIMERS - For -LINUXTIMERS configuration option under Linux

  • CPU_TIME - For user+system time from getrusage() call with -CPUTIME

  • P_WALL_CLOCK_TIME - For PAPI’s WALLCLOCK time using -PAPIWALLCLOCK

  • P_VIRTUAL_TIME - For PAPI’s process virtual time using -PAPIVIRTUAL

  • TAU_MUSE - For reading counts of Linux OS kernel level events when MAGNET/MUSE is installed and -muse configuration option is enabled. MUSE . TAU_MUSE_PACKAGE environment variable has to be set to package name (busy_time, count, etc.)

  • TAU_MPI_MESSAGE_SIZE - For tracking the cumulative message size for all MPI operations by a node for each routine.

  • ENERGY - For tracking the power use of the application in joules. Requires an -arch=craycnl configuration.

  • ACCEL_ENERGY - For tracking the power use of the application on accelerators in joules. Requires an -arch=craycnl configuration.

When TAU is configured with -TRACE -MULTIPLECOUNTERS and -papi=<dir> options, the COUNTER1 environment variable must be set to GET_TIME_OF_DAY to allow TAU’s tracing module to use a globally synchronized real-time clock for time-stamping event records. When we use tracing with hardware performance counters, the counters specified in environment variables COUNTER[2-25] are accessed at routine transitions and logged in the trace file. Use tau2vtf tool to convert TAU traces to VTF3 traces that may be loaded in the Vampir trace visualization tool.

and PAPI/PCL options that can be found in [papi_table] and [pcl_table] . Example:

  • PCL_FP_INSTR - For floating point operations using PCL (-pcl=<dir>)

  • PAPI_FP_INS - For floating point operations using PAPI (-papi=<dir>)

  • PAPI_NATIVE_<event> - For native papi events using PAPI (-papi=<dir>)

NOTE: When -MULTIPLECOUNTERS is used with -TRACE option, the tracing library uses the wall-clock time from the function specified in the COUNTER1 variable. This should typically point to wall-clock time routines (such as GET_TIME_OF_DAY or SGI_TIMERS or LINUX_TIMERS ).

Example:

% setenv COUNTER1   P_WALL_CLOCK_TIME
% setenv COUNTER2 PAPI_L1_DCM
% setenv COUNTER3 PAPI_FP_INS

will produce profile files in directories called MULT_P_WALL_CLOCK_TIME, MULTI__PAPI_L1_DCM, and MULTI_PAPI_FP_INS.

Table 1. Events measured by setting the environment variable TAU_METRICS in TAU
TAU_METRICS EVENT Measured

PAPI_L1_DCM

Level 1 data cache misses

PAPI_L1_ICM

Level 1 instruction cache misses

PAPI_L2_DCM

Level 2 data cache misses

PAPI_L2_ICM

Level 2 instruction cache misses

PAPI_L3_DCM

Level 3 data cache misses

PAPI_L3_ICM

Level 3 instruction cache misses

PAPI_L1_TCM

Level 1 total cache misses

PAPI_L2_TCM

Level 2 total cache misses

PAPI_L3_TCM

Level 3 total cache misses

PAPI_CA_SNP

Snoops

PAPI_CA_SHR

Request for access to shared cache line (SMP)

PAPI_CA_CLN

Request for access to clean cache line (SMP)

PAPI_CA_INV

Cache Line Invalidation (SMP)

PAPI_CA_ITV

Cache Line Intervention (SMP)

PAPI_L3_LDM

Level 3 load misses

PAPI_L3_STM

Level 3 store misses

PAPI_BRU_IDL

Cycles branch units are idle

PAPI_FXU_IDL

Cycles integer units are idle

PAPI_FPU_IDL

Cycles floating point units are idle

PAPI_LSU_IDL

Cycles load/store units are idle

PAPI_TLB_DM

Data translation lookaside buffer misses

PAPI_TLB_IM

Instruction translation lookaside buffer misses

PAPI_TLB_TL

Total translation lookaside buffer misses

PAPI_L1_LDM

Level 1 load misses

PAPI_L1_STM

Level 1 store misses

PAPI_L2_LDM

Level 2 load misses

PAPI_L2_STM

Level 2 store misses

PAPI_BTAC_M

BTAC miss

PAPI_PRF_DM

Prefetch data instruction caused a miss

PAPI_L3_DCH

Level 3 Data Cache Hit

PAPI_TLB_SD

Translation lookaside buffer shootdowns (SMP)

PAPI_CSR_FAL

Failed store conditional instructions

PAPI_CSR_SUC

Successful store conditional instructions

PAPI_CSR_TOT

Total store conditional instructions

PAPI_MEM_SCY

Cycles Stalled Waiting for Memory Access

PAPI_MEM_RCY

Cycles Stalled Waiting for Memory Read

PAPI_MEM_WCY

Cycles Stalled Waiting for Memory Write

PAPI_STL_ICY

Cycles with No Instruction Issue

PAPI_FUL_ICY

Cycles with Maximum Instruction Issue

PAPI_STL_CCY

Cycles with No Instruction Completion

PAPI_FUL_CCY

Cycles with Maximum Instruction Completion

PAPI_HW_INT

Hardware interrupts

PAPI_BR_UCN

Unconditional branch instructions executed

PAPI_BR_CN

Conditional branch instructions executed

PAPI_BR_TKN

Conditional branch instructions taken

PAPI_BR_NTK

Conditional branch instructions not taken

PAPI_BR_MSP

Conditional branch instructions mispredicted

PAPI_BR_PRC

Conditional branch instructions correctly predicted

PAPI_FMA_INS

FMA instructions completed

PAPI_TOT_IIS

Total instructions issued

PAPI_TOT_INS

Total instructions executed

PAPI_INT_INS

Integer instructions executed

PAPI_FP_INS

Floating point instructions executed

PAPI_LD_INS

Load instructions executed

PAPI_SR_INS

Store instructions executed

PAPI_BR_INS

Total branch instructions executed

PAPI_VEC_INS

Vector/SIMD instructions executed

PAPI_FLOPS

Floating Point Instructions executed per second

PAPI_RES_STL

Cycles processor is stalled on resource

PAPI_FP_STAL

FP units are stalled

PAPI_TOT_CYC

Total cycles

PAPI_IPS

Instructions executed per second

PAPI_LST_INS

Total load/store instructions executed

PAPI_SYC_INS

Synchronization instructions executed

PAPI_L1_DCH

L1 D Cache Hit

PAPI_L2_DCH

L2 D Cache Hit

PAPI_L1_DCA

L1 D Cache Access

PAPI_L2_DCA

L2 D Cache Access

PAPI_L3_DCA

L3 D Cache Access

PAPI_L1_DCR

L1 D Cache Read

PAPI_L2_DCR

L2 D Cache Read

PAPI_L3_DCR

L3 D Cache Read

PAPI_L1_DCW

L1 D Cache Write

PAPI_L2_DCW

L2 D Cache Write

PAPI_L3_DCW

L3 D Cache Write

PAPI_L1_ICH

L1 instruction cache hits

PAPI_L2_ICH

L2 instruction cache hits

PAPI_L3_ICH

L3 instruction cache hits

PAPI_L1_ICA

L1 instruction cache accesses

PAPI_L2_ICA

L2 instruction cache accesses

PAPI_L3_ICA

L3 instruction cache accesses

PAPI_L1_ICR

L1 instruction cache reads

PAPI_L2_ICR

L2 instruction cache reads

PAPI_L3_ICR

L3 instruction cache reads

PAPI_L1_ICW

L1 instruction cache writes

PAPI_L2_ICW

L2 instruction cache writes

PAPI_L3_ICW

L3 instruction cache writes

PAPI_L1_TCH

L1 total cache hits

PAPI_L2_TCH

L2 total cache hits

PAPI_L3_TCH

L3 total cache hits

PAPI_L1_TCA

L1 total cache accesses

PAPI_L2_TCA

L2 total cache accesses

PAPI_L3_TCA

L3 total cache accesses

PAPI_L1_TCR

L1 total cache reads

PAPI_L2_TCR

L2 total cache reads

PAPI_L3_TCR

L3 total cache reads

PAPI_L1_TCW

L1 total cache writes

PAPI_L2_TCW

L2 total cache writes

PAPI_L3_TCW

L3 total cache writes

PAPI_FML_INS

FM ins

PAPI_FAD_INS

FA ins

PAPI_FDV_INS

FD ins

PAPI_FSQ_INS

FSq ins

PAPI_FNV_INS

Finv ins

For example to measure the floating point operations in routines using PCL ,

% ./configure -pcl=/usr/local/packages/pcl-1.2
% setenv PCL_EVENT PCL_FP_INSTR
% mpirun -np 8 application
Table 2. Events measured by setting the environment variable PCL_EVENT in TAU
PCL_EVENT EVENT Measured

PCL_L1CACHE_READ

L1 (Level one) cache reads

PCL_L1CACHE_WRITE

L1 cache writes

PCL_L1CACHE_READWRITE

L1 cache reads and writes

PCL_L1CACHE_HIT

L1 cache hits

PCL_L1CACHE_MISS

L1 cache misses

PCL_L1DCACHE_READ

L1 data cache reads

PCL_L1DCACHE_WRITE

L1 data cache writes

PCL_L1DCACHE_READWRITE

L1 data cache reads and writes

PCL_L1DCACHE_HIT

L1 data cache hits

PCL_L1DCACHE_MISS

L1 data cache misses

PCL_L1ICACHE_READ

L1 instruction cache reads

PCL_L1ICACHE_WRITE

L1 instruction cache writes

PCL_L1ICACHE_READWRITE

L1 instruction cache reads and writes

PCL_L1ICACHE_HIT

L1 instruction cache hits

PCL_L1ICACHE_MISS

L1 instruction cache misses

PCL_L2CACHE_READ

L2 (Level two) cache reads

PCL_L2CACHE_WRITE

L2 cache writes

PCL_L2CACHE_READWRITE

L2 cache reads and writes

PCL_L2CACHE_HIT

L2 cache hits

PCL_L2CACHE_MISS

L2 cache misses

PCL_L2DCACHE_READ

L2 data cache reads

PCL_L2DCACHE_WRITE

L2 data cache writes

PCL_L2DCACHE_READWRITE

L2 data cache reads and writes

PCL_L2DCACHE_HIT

L2 data cache hits

PCL_L2DCACHE_MISS

L2 data cache misses

PCL_L2ICACHE_READ

L2 instruction cache reads

PCL_L2ICACHE_WRITE

L2 instruction cache writes

PCL_L2ICACHE_READWRITE

L2 instruction cache reads and writes

PCL_L2ICACHE_HIT

L2 instruction cache hits

PCL_L2ICACHE_MISS

L2 instruction cache misses

PCL_TLB_HIT

TLB (Translation Lookaside Buffer) hits

PCL_TLB_MISS

TLB misses

PCL_ITLB_HIT

Instruction TLB hits

PCL_ITLB_MISS

Instruction TLB misses

PCL_DTLB_HIT

Data TLB hits

PCL_DTLB_MISS

Data TLB misses

PCL_CYCLES

Cycles

PCL_ELAPSED_CYCLES

Cycles elapsed

PCL_INTEGER_INSTR

Integer instructions executed

PCL_FP_INSTR

Floating point (FP) instructions executed

PCL_LOAD_INSTR

Load instructions executed

PCL_STORE_INSTR

Store instructions executed

PCL_LOADSTORE_INSTR

Loads and stores executed

PCL_INSTR

Instructions executed

PCL_JUMP_SUCCESS

Successful jumps executed

PCL_JUMP_UNSUCCESS

Unsuccessful jumps executed

PCL_JUMP

Jumps executed

PCL_ATOMIC_SUCCESS

Successful atomic instructions executed

PCL_ATOMIC_UNSUCCESS

Unsuccessful atomic instructions executed

PCL_ATOMIC

Atomic instructions executed

PCL_STALL_INTEGER

Integer stalls

PCL_STALL_FP

Floating point stalls

PCL_STALL_JUMP

Jump stalls

PCL_STALL_LOAD

Load stalls

PCL_STALL_STORE

Store Stalls

PCL_STALL

Stalls

PCL_MFLOPS

Millions of floating point operations/second

PCL_IPC

Instructions executed per cycle

PCL_L1DCACHE_MISSRATE

Level 1 data cache miss rate

PCL_L2DCACHE_MISSRATE

Level 2 data cache miss rate

PCL_MEM_FP_RATIO

Ratio of memory accesses to FP operations

Using Hardware Performance Counters

While running the application, set the environment variable PCL_EVENT or TAU_METRICS , to specify which hardware performance counter TAU should use while profiling the application.

By default, only one counter is tracked at a time. To track more than one counter use -MULTIPLECOUNTERS . See [multiplehardwarecounters] for more details.

To select floating point instructions for profiling using PAPI , you would:

% configure -papi=/usr/local/packages/papi-3.5.0
% make clean install
% cd examples/papi
% setenv TAU_METRICS PAPI_FP_INS
% a.out

In addition to the following events, you can use native events (see papi_native ) on a given CPU by setting TAU_ to PAPI_NATIVE_<event> . For example:

% setenv PAPI_NATIVE PAPI_NATIVE_PM_BIQ_IDU_FULL_CYC
% a.out

By default PAPI will profile events in all domains (users space, kernel, hypervisor, etc). You can restrict the set of domains for papi event profiling by using the TAU_PAPI_DOMAIN environment variable with these values (in a colon separated list, if desired): PAPI_DOM_USER, PAPI_DOM_KERNEL, PAPI_DOM_SUPERVISOR, and PAPI_DOM_OTHER like thus:

% setenv TAU_PAPI_DOMAIN PAPI_DOM_SUPERVISOR:PAPI_DOM_OTHER

Profiling with PerfLib

This profiling option is currently under development at LANL.

To configure TAU with PerfLib use the following arguments:

%> configure -perflib=[path_to_perflib lib directory]
             -perfinc=[path_to_perflib inc directory]
             -perflibrary=[argument send to the linker if different than default]

    After TAU is built a new Makefile will be generated with *-perflib-* in its
    name, use this Makefile when profiling applications with perflib.

After TAU is built a new Makefile will be generated with -perflib- in its name, use this Makefile when profiling applications with perflib.

After configuration and installation, toggle these three environment variables before running the application:

%> export PERF_PROFILE=1
%> export PERF_PROFILE_MPI=1
%> export PERF_PROFILE_MEMORY=1
%> export PERF_PROFILE_COUNTERS=1
%> export PERF_DATA_DIRECTORY=<directory>

We also provide a perf2tau conversion utilities to convert the remaining perflib profiles to regular TAU profiles. To use perf2tau set the environment variable perf_data_directory to the type of the profiling to be converted (the directory where the data is store will be called something like perf_data.[type]/). Or you may execute perf2tau with the type as an argument:

%> perf2tau [type]

See also the man page for perf2tau, [perf2tau] .

Running a Python application with TAU

TAU can automatically instrument all Python routines when the tau python package is imported. Add <TAUROOT>/<ARCH>/lib/bindings-<options> to the PYTHONPATH environment variable in order to use the TAU module.

To execute the program, tau.run routine is invoked with the name of the top level Python code. For e.g.,

#!/usr/bin/env python

import tau
from time import sleep

def f2():
    print "Inside f2: sleeping for 2 secs..."
    sleep(2)
def f1():
    print "Inside f1, calling f2..."
    f2()

def OurMain():
    f1()

tau.run('OurMain()')

instruments routines OurMain(), f1() and f2() although there are no instrumentation calls in the routines. To use this feature, TAU must be configured with the -pythoninc=<dir> option (and -pythonlib=<dir> if running under IBM). Before running the application, the environment variable PYTHONPATH and LD_LIBRARY_PATH should be set to include the TAU library directory (where tau.py is stored). Manual instrumentation of Python sources is also possible using the Python API and the pytau package. For e.g.,

#!/usr/bin/env python

import pytau
from time import sleep

x = pytau.profileTimer("A Sleep for excl 5 secs")
y = pytau.profileTimer("B Sleep for excl 2 secs")
pytau.start(x)
print "Sleeping for 5 secs ..."
sleep(5)
pytau.start(y)
print "Sleeping for 2 secs ..."
sleep(2)
pytau.stop(y)
pytau.dbDump()
pytau.stop(x)

shows how two timers x and y are created and used. Note, multiple timers can be nested, but not overlapping. Overlapping timers are detected by TAU at runtime and flagged with a warning (as exclusive time is not defined when timers overlap).

pprof

pprof sorts and displays profile data generated by TAU. To view the profile, merely execute pprof in the directory where profile files are located (or set the PROFILEDIR environment variable).

% pprof

Its usage is explained below:

usage: pprof [-c|-b|-m|-t|-e|-i] [-r] [-s] [-n num] [-f filename] \
       [-l] [node numbers]
  -c : Sort by number of Calls
  -b : Sort by number of suBroutines called by a function
  -m : Sort by Milliseconds (exclusive time total)
  -t : Sort by Total milliseconds (inclusive time total) (DEFAULT)
  -e : Sort by Exclusive time per call (msec/call)
  -i : Sort by Inclusive time per call (total msec/call)
  -v : Sort by standard deViation (excl usec)
  -r : Reverse sorting order
  -s : print only Summary profile information
  -n num : print only first num functions
  -f filename : specify full path and Filename without node ids
  -p : suPpress conversion to hh:mm:ss:mmm format
  -l : List all functions and exit
  -d : Dump output format (for Racy) [node numbers] : prints only info about
	all contexts/threads of given node numbers
 node numbers : prints information about all contexts/threads
 for specified nodes

Running a JAVA application with TAU

Java applications are profiled/traced using tau_java as shown below:

% cd tau/examples/java/pi
% setenv LD_LIBRARY_PATH $LD_LIBRARY_PATH:<tauroot>/<arch>/lib
% tau_java  Pi

More information about tau_java can be found in the Tools section of the Reference Guide.

Running the application generates profile files with names having the form profile.<node>.<context>.<thread>. These files can be analyzed using pprof or paraprof.

Using a tau.conf File

If a tau.conf file is created, then code that uses that TAU lib will effected by the settings in tau.conf. For example, if a directory tau-2.21/tau_system_defaults is created and a tau.conf file is placed in it, TAU will read that file before doing the measurements. A user of that TAU libs can choose to override the contents of that file by placing a tau.conf in their own directory. But by default, if the sysadmin chooses to create this dir, all the users of the TAU libs will be globally affected by this tau.conf.

For example, tau.conf could be:

% cat tau.conf
TAU_LOG_PATH=/soft/apps/tau/logs
PROFILEDIR=$TAU_LOG_DIR
TAU_PROFILE_FORMAT=merged
TAU_SUMMARY=1
TAU_IBM_BG_HWP_COUNTERS=1
TAU_TRACK_MESSAGE=1

Then anyone using TAU from that directory will get TAU_IBM_BG_HWP_COUNTERS=1, TAU_TRACK_MESSAGE=1, etc.

Using Score-P with TAU

TAU can be configured to use the Score-P measurement infrastructure (www.score-p.org). To use Score-P, configure TAU with -scorep= option to point TAU to the Score-P installation. (Please use Score-P version 1.0 beta or above.) You may then instrument and run your application with TAU in a manor of your choosing.

Set the environment variable SCOREP_PROFILING_FORMAT to TAU_SNAPSHOT to produce TAU Snapshot files, which will be found in scorep*/tau/. Also, the Score-P library must be found in LD_LIBRARY_PATH.

Using UPC with TAU

Please see examples/upc for more details.

To instrument Berkeley UPC with GASP, configure TAU with -upcnetwork=<option> /where option is "mpi" or "udp". Then use a selective instrumentation file like the one shown below.

BEGIN_INSTRUMENT_SECTION
forall routine="#"
loops routine="#"
barrier routine="#"
fence routine="#"
notify routine="#"
END_INSTRUMENT_SECTION

Then tau_upc.sh can be used to build the application. If "udp" is used with -upcnetwork, then upcrun can be used to run the application. For "mpi", mpirun or a similar mechanism can be used.

To instrument UPC with Cray CCE compilers, the following will produce a configuration that supports Cray UPC and may be used with tau_upc.sh

module load PrgEnv-cray
./configure -arch=craycnl -pdt=<dir> -pdt_c++=g++

TAU can also build the DMAPP wrapper using Cray CCE compilers. When the -optDMAPP option is used when building the application with TAU using TAU_OPTIONS, DMAPP events are automatically instrumented with tau_upc.sh.