LIST OF COUNTERS:
Set the TAU_METRICS environment variable with a comma separated list of metrics or to use the old method set the following values for the COUNTER<1-25> environment variables.
-
GET_TIME_OF_DAY
- For the default profiling option using gettimeofday() -
SGI_TIMERS
- For-SGITIMERS
configuration option under IRIX -
CRAY_TIMERS
- For-CRAYTIMERS
configuration option under Cray X1. -
LINUX_TIMERS
- For -LINUXTIMERS configuration option under Linux -
CPU_TIME
- For user+system time from getrusage() call with-CPUTIME
-
P_WALL_CLOCK_TIME
- For PAPI's WALLCLOCK time using-PAPIWALLCLOCK
-
P_VIRTUAL_TIME
- For PAPI's process virtual time using-PAPIVIRTUAL
-
TAU_MUSE
- For reading counts of Linux OS kernel level events when MAGNET/MUSE is installed and -muse configuration option is enabled. MUSE.TAU_MUSE_PACKAGE
environment variable has to be set to package name (busy_time, count, etc.) -
TAU_MPI_MESSAGE_SIZE
- For tracking the cumulative message size for all MPI operations by a node for each routine. -
ENERGY
- For tracking the power use of the application in joules. Requires an -arch=craycnl configuration. -
ACCEL_ENERGY
- For tracking the power use of the application on accelerators in joules. Requires an -arch=craycnl configuration.
![]() |
Note |
---|---|
When TAU is configured with -TRACE -MULTIPLECOUNTERS and -papi=<dir> options, the COUNTER1 environment variable must be set to GET_TIME_OF_DAY to allow TAU's tracing module to use a globally synchronized real-time clock for time-stamping event records. When we use tracing with hardware performance counters, the counters specified in environment variables COUNTER[2-25] are accessed at routine transitions and logged in the trace file. Use tau2vtf tool to convert TAU traces to VTF3 traces that may be loaded in the Vampir trace visualization tool. |
and PAPI/PCL options that can be found in Table 2.1, “Events measured by setting the environment variable TAU_METRICS in TAU” and Table 2.2, “Events measured by setting the environment variable PCL_EVENT in TAU”. Example:
-
PCL_FP_INSTR
- For floating point operations using PCL (-pcl=<dir>) -
PAPI_FP_INS
- For floating point operations using PAPI (-papi=<dir>) -
PAPI_NATIVE_<event>
- For native papi events using PAPI (-papi=<dir>)
NOTE: When
-MULTIPLECOUNTERS
is used with
-TRACE
option, the tracing library uses the wall-clock
time from the function specified in the COUNTER1
variable. This should typically point to wall-clock time routines (such
as GET_TIME_OF_DAY or SGI_TIMERS
or
LINUX_TIMERS
).
Example:
% setenv COUNTER1 P_WALL_CLOCK_TIME % setenv COUNTER2 PAPI_L1_DCM % setenv COUNTER3 PAPI_FP_INS
will produce profile files in directories called MULT_P_WALL_CLOCK_TIME, MULTI__PAPI_L1_DCM, and MULTI_PAPI_FP_INS.
Table 2.1. Events measured by setting the environment variable TAU_METRICS in TAU
TAU_METRICS | EVENT Measured |
---|---|
PAPI_L1_DCM | Level 1 data cache misses |
PAPI_L1_ICM | Level 1 instruction cache misses |
PAPI_L2_DCM | Level 2 data cache misses |
PAPI_L2_ICM | Level 2 instruction cache misses |
PAPI_L3_DCM | Level 3 data cache misses |
PAPI_L3_ICM | Level 3 instruction cache misses |
PAPI_L1_TCM | Level 1 total cache misses |
PAPI_L2_TCM | Level 2 total cache misses |
PAPI_L3_TCM | Level 3 total cache misses |
PAPI_CA_SNP | Snoops |
PAPI_CA_SHR | Request for access to shared cache line (SMP) |
PAPI_CA_CLN | Request for access to clean cache line (SMP) |
PAPI_CA_INV | Cache Line Invalidation (SMP) |
PAPI_CA_ITV | Cache Line Intervention (SMP) |
PAPI_L3_LDM | Level 3 load misses |
PAPI_L3_STM | Level 3 store misses |
PAPI_BRU_IDL | Cycles branch units are idle |
PAPI_FXU_IDL | Cycles integer units are idle |
PAPI_FPU_IDL | Cycles floating point units are idle |
PAPI_LSU_IDL | Cycles load/store units are idle |
PAPI_TLB_DM | Data translation lookaside buffer misses |
PAPI_TLB_IM | Instruction translation lookaside buffer misses |
PAPI_TLB_TL | Total translation lookaside buffer misses |
PAPI_L1_LDM | Level 1 load misses |
PAPI_L1_STM | Level 1 store misses |
PAPI_L2_LDM | Level 2 load misses |
PAPI_L2_STM | Level 2 store misses |
PAPI_BTAC_M | BTAC miss |
PAPI_PRF_DM | Prefetch data instruction caused a miss |
PAPI_L3_DCH | Level 3 Data Cache Hit |
PAPI_TLB_SD | Translation lookaside buffer shootdowns (SMP) |
PAPI_CSR_FAL | Failed store conditional instructions |
PAPI_CSR_SUC | Successful store conditional instructions |
PAPI_CSR_TOT | Total store conditional instructions |
PAPI_MEM_SCY | Cycles Stalled Waiting for Memory Access |
PAPI_MEM_RCY | Cycles Stalled Waiting for Memory Read |
PAPI_MEM_WCY | Cycles Stalled Waiting for Memory Write |
PAPI_STL_ICY | Cycles with No Instruction Issue |
PAPI_FUL_ICY | Cycles with Maximum Instruction Issue |
PAPI_STL_CCY | Cycles with No Instruction Completion |
PAPI_FUL_CCY | Cycles with Maximum Instruction Completion |
PAPI_HW_INT | Hardware interrupts |
PAPI_BR_UCN | Unconditional branch instructions executed |
PAPI_BR_CN | Conditional branch instructions executed |
PAPI_BR_TKN | Conditional branch instructions taken |
PAPI_BR_NTK | Conditional branch instructions not taken |
PAPI_BR_MSP | Conditional branch instructions mispredicted |
PAPI_BR_PRC | Conditional branch instructions correctly predicted |
PAPI_FMA_INS | FMA instructions completed |
PAPI_TOT_IIS | Total instructions issued |
PAPI_TOT_INS | Total instructions executed |
PAPI_INT_INS | Integer instructions executed |
PAPI_FP_INS | Floating point instructions executed |
PAPI_LD_INS | Load instructions executed |
PAPI_SR_INS | Store instructions executed |
PAPI_BR_INS | Total branch instructions executed |
PAPI_VEC_INS | Vector/SIMD instructions executed |
PAPI_FLOPS | Floating Point Instructions executed per second |
PAPI_RES_STL | Cycles processor is stalled on resource |
PAPI_FP_STAL | FP units are stalled |
PAPI_TOT_CYC | Total cycles |
PAPI_IPS | Instructions executed per second |
PAPI_LST_INS | Total load/store instructions executed |
PAPI_SYC_INS | Synchronization instructions executed |
PAPI_L1_DCH | L1 D Cache Hit |
PAPI_L2_DCH | L2 D Cache Hit |
PAPI_L1_DCA | L1 D Cache Access |
PAPI_L2_DCA | L2 D Cache Access |
PAPI_L3_DCA | L3 D Cache Access |
PAPI_L1_DCR | L1 D Cache Read |
PAPI_L2_DCR | L2 D Cache Read |
PAPI_L3_DCR | L3 D Cache Read |
PAPI_L1_DCW | L1 D Cache Write |
PAPI_L2_DCW | L2 D Cache Write |
PAPI_L3_DCW | L3 D Cache Write |
PAPI_L1_ICH | L1 instruction cache hits |
PAPI_L2_ICH | L2 instruction cache hits |
PAPI_L3_ICH | L3 instruction cache hits |
PAPI_L1_ICA | L1 instruction cache accesses |
PAPI_L2_ICA | L2 instruction cache accesses |
PAPI_L3_ICA | L3 instruction cache accesses |
PAPI_L1_ICR | L1 instruction cache reads |
PAPI_L2_ICR | L2 instruction cache reads |
PAPI_L3_ICR | L3 instruction cache reads |
PAPI_L1_ICW | L1 instruction cache writes |
PAPI_L2_ICW | L2 instruction cache writes |
PAPI_L3_ICW | L3 instruction cache writes |
PAPI_L1_TCH | L1 total cache hits |
PAPI_L2_TCH | L2 total cache hits |
PAPI_L3_TCH | L3 total cache hits |
PAPI_L1_TCA | L1 total cache accesses |
PAPI_L2_TCA | L2 total cache accesses |
PAPI_L3_TCA | L3 total cache accesses |
PAPI_L1_TCR | L1 total cache reads |
PAPI_L2_TCR | L2 total cache reads |
PAPI_L3_TCR | L3 total cache reads |
PAPI_L1_TCW | L1 total cache writes |
PAPI_L2_TCW | L2 total cache writes |
PAPI_L3_TCW | L3 total cache writes |
PAPI_FML_INS | FM ins |
PAPI_FAD_INS | FA ins |
PAPI_FDV_INS | FD ins |
PAPI_FSQ_INS | FSq ins |
PAPI_FNV_INS | Finv ins |
For example to measure the floating point operations in routines using
PCL
,
% ./configure -pcl=/usr/local/packages/pcl-1.2 % setenv PCL_EVENT PCL_FP_INSTR % mpirun -np 8 application
Table 2.2. Events measured by setting the environment variable PCL_EVENT in TAU
PCL_EVENT | EVENT Measured |
---|---|
PCL_L1CACHE_READ | L1 (Level one) cache reads |
PCL_L1CACHE_WRITE | L1 cache writes |
PCL_L1CACHE_READWRITE | L1 cache reads and writes |
PCL_L1CACHE_HIT | L1 cache hits |
PCL_L1CACHE_MISS | L1 cache misses |
PCL_L1DCACHE_READ | L1 data cache reads |
PCL_L1DCACHE_WRITE | L1 data cache writes |
PCL_L1DCACHE_READWRITE | L1 data cache reads and writes |
PCL_L1DCACHE_HIT | L1 data cache hits |
PCL_L1DCACHE_MISS | L1 data cache misses |
PCL_L1ICACHE_READ | L1 instruction cache reads |
PCL_L1ICACHE_WRITE | L1 instruction cache writes |
PCL_L1ICACHE_READWRITE | L1 instruction cache reads and writes |
PCL_L1ICACHE_HIT | L1 instruction cache hits |
PCL_L1ICACHE_MISS | L1 instruction cache misses |
PCL_L2CACHE_READ | L2 (Level two) cache reads |
PCL_L2CACHE_WRITE | L2 cache writes |
PCL_L2CACHE_READWRITE | L2 cache reads and writes |
PCL_L2CACHE_HIT | L2 cache hits |
PCL_L2CACHE_MISS | L2 cache misses |
PCL_L2DCACHE_READ | L2 data cache reads |
PCL_L2DCACHE_WRITE | L2 data cache writes |
PCL_L2DCACHE_READWRITE | L2 data cache reads and writes |
PCL_L2DCACHE_HIT | L2 data cache hits |
PCL_L2DCACHE_MISS | L2 data cache misses |
PCL_L2ICACHE_READ | L2 instruction cache reads |
PCL_L2ICACHE_WRITE | L2 instruction cache writes |
PCL_L2ICACHE_READWRITE | L2 instruction cache reads and writes |
PCL_L2ICACHE_HIT | L2 instruction cache hits |
PCL_L2ICACHE_MISS | L2 instruction cache misses |
PCL_TLB_HIT | TLB (Translation Lookaside Buffer) hits |
PCL_TLB_MISS | TLB misses |
PCL_ITLB_HIT | Instruction TLB hits |
PCL_ITLB_MISS | Instruction TLB misses |
PCL_DTLB_HIT | Data TLB hits |
PCL_DTLB_MISS | Data TLB misses |
PCL_CYCLES | Cycles |
PCL_ELAPSED_CYCLES | Cycles elapsed |
PCL_INTEGER_INSTR | Integer instructions executed |
PCL_FP_INSTR | Floating point (FP) instructions executed |
PCL_LOAD_INSTR | Load instructions executed |
PCL_STORE_INSTR | Store instructions executed |
PCL_LOADSTORE_INSTR | Loads and stores executed |
PCL_INSTR | Instructions executed |
PCL_JUMP_SUCCESS | Successful jumps executed |
PCL_JUMP_UNSUCCESS | Unsuccessful jumps executed |
PCL_JUMP | Jumps executed |
PCL_ATOMIC_SUCCESS | Successful atomic instructions executed |
PCL_ATOMIC_UNSUCCESS | Unsuccessful atomic instructions executed |
PCL_ATOMIC | Atomic instructions executed |
PCL_STALL_INTEGER | Integer stalls |
PCL_STALL_FP | Floating point stalls |
PCL_STALL_JUMP | Jump stalls |
PCL_STALL_LOAD | Load stalls |
PCL_STALL_STORE | Store Stalls |
PCL_STALL | Stalls |
PCL_MFLOPS | Millions of floating point operations/second |
PCL_IPC | Instructions executed per cycle |
PCL_L1DCACHE_MISSRATE | Level 1 data cache miss rate |
PCL_L2DCACHE_MISSRATE | Level 2 data cache miss rate |
PCL_MEM_FP_RATIO | Ratio of memory accesses to FP operations |