2.5. Profiling with Hardware counters

LIST OF COUNTERS:

Set the TAU_METRICS environment variable with a comma separated list of metrics or to use the old method set the following values for the COUNTER<1-25> environment variables.

  • GET_TIME_OF_DAY - For the default profiling option using gettimeofday()

  • SGI_TIMERS - For -SGITIMERS configuration option under IRIX

  • CRAY_TIMERS - For -CRAYTIMERS configuration option under Cray X1.

  • LINUX_TIMERS - For -LINUXTIMERS configuration option under Linux

  • CPU_TIME - For user+system time from getrusage() call with -CPUTIME

  • P_WALL_CLOCK_TIME - For PAPI's WALLCLOCK time using -PAPIWALLCLOCK

  • P_VIRTUAL_TIME - For PAPI's process virtual time using -PAPIVIRTUAL

  • TAU_MUSE - For reading counts of Linux OS kernel level events when MAGNET/MUSE is installed and -muse configuration option is enabled. MUSE.TAU_MUSE_PACKAGE environment variable has to be set to package name (busy_time, count, etc.)

  • TAU_MPI_MESSAGE_SIZE - For tracking the cumulative message size for all MPI operations by a node for each routine.

  • ENERGY - For tracking the power use of the application in joules. Requires an -arch=craycnl configuration.

  • ACCEL_ENERGY - For tracking the power use of the application on accelerators in joules. Requires an -arch=craycnl configuration.

[Note] Note

When TAU is configured with -TRACE -MULTIPLECOUNTERS and -papi=<dir> options, the COUNTER1 environment variable must be set to GET_TIME_OF_DAY to allow TAU's tracing module to use a globally synchronized real-time clock for time-stamping event records. When we use tracing with hardware performance counters, the counters specified in environment variables COUNTER[2-25] are accessed at routine transitions and logged in the trace file. Use tau2vtf tool to convert TAU traces to VTF3 traces that may be loaded in the Vampir trace visualization tool.

and PAPI/PCL options that can be found in Table 2.1, “Events measured by setting the environment variable TAU_METRICS in TAU” and Table 2.2, “Events measured by setting the environment variable PCL_EVENT in TAU”. Example:

  • PCL_FP_INSTR - For floating point operations using PCL (-pcl=<dir>)

  • PAPI_FP_INS - For floating point operations using PAPI (-papi=<dir>)

  • PAPI_NATIVE_<event> - For native papi events using PAPI (-papi=<dir>)

NOTE: When -MULTIPLECOUNTERS is used with -TRACE option, the tracing library uses the wall-clock time from the function specified in the COUNTER1 variable. This should typically point to wall-clock time routines (such as GET_TIME_OF_DAY or SGI_TIMERS or LINUX_TIMERS).

Example:

% setenv COUNTER1   P_WALL_CLOCK_TIME
% setenv COUNTER2 PAPI_L1_DCM
% setenv COUNTER3 PAPI_FP_INS

will produce profile files in directories called MULT_P_WALL_CLOCK_TIME, MULTI__PAPI_L1_DCM, and MULTI_PAPI_FP_INS.

Table 2.1. Events measured by setting the environment variable TAU_METRICS in TAU

TAU_METRICS EVENT Measured
PAPI_L1_DCM Level 1 data cache misses
PAPI_L1_ICM Level 1 instruction cache misses
PAPI_L2_DCM Level 2 data cache misses
PAPI_L2_ICM Level 2 instruction cache misses
PAPI_L3_DCM Level 3 data cache misses
PAPI_L3_ICM Level 3 instruction cache misses
PAPI_L1_TCM Level 1 total cache misses
PAPI_L2_TCM Level 2 total cache misses
PAPI_L3_TCM Level 3 total cache misses
PAPI_CA_SNP Snoops
PAPI_CA_SHR Request for access to shared cache line (SMP)
PAPI_CA_CLN Request for access to clean cache line (SMP)
PAPI_CA_INV Cache Line Invalidation (SMP)
PAPI_CA_ITV Cache Line Intervention (SMP)
PAPI_L3_LDM Level 3 load misses
PAPI_L3_STM Level 3 store misses
PAPI_BRU_IDL Cycles branch units are idle
PAPI_FXU_IDL Cycles integer units are idle
PAPI_FPU_IDL Cycles floating point units are idle
PAPI_LSU_IDL Cycles load/store units are idle
PAPI_TLB_DM Data translation lookaside buffer misses
PAPI_TLB_IM Instruction translation lookaside buffer misses
PAPI_TLB_TL Total translation lookaside buffer misses
PAPI_L1_LDM Level 1 load misses
PAPI_L1_STM Level 1 store misses
PAPI_L2_LDM Level 2 load misses
PAPI_L2_STM Level 2 store misses
PAPI_BTAC_M BTAC miss
PAPI_PRF_DM Prefetch data instruction caused a miss
PAPI_L3_DCH Level 3 Data Cache Hit
PAPI_TLB_SD Translation lookaside buffer shootdowns (SMP)
PAPI_CSR_FAL Failed store conditional instructions
PAPI_CSR_SUC Successful store conditional instructions
PAPI_CSR_TOT Total store conditional instructions
PAPI_MEM_SCY Cycles Stalled Waiting for Memory Access
PAPI_MEM_RCY Cycles Stalled Waiting for Memory Read
PAPI_MEM_WCY Cycles Stalled Waiting for Memory Write
PAPI_STL_ICY Cycles with No Instruction Issue
PAPI_FUL_ICY Cycles with Maximum Instruction Issue
PAPI_STL_CCY Cycles with No Instruction Completion
PAPI_FUL_CCY Cycles with Maximum Instruction Completion
PAPI_HW_INT Hardware interrupts
PAPI_BR_UCN Unconditional branch instructions executed
PAPI_BR_CN Conditional branch instructions executed
PAPI_BR_TKN Conditional branch instructions taken
PAPI_BR_NTK Conditional branch instructions not taken
PAPI_BR_MSP Conditional branch instructions mispredicted
PAPI_BR_PRC Conditional branch instructions correctly predicted
PAPI_FMA_INS FMA instructions completed
PAPI_TOT_IIS Total instructions issued
PAPI_TOT_INS Total instructions executed
PAPI_INT_INS Integer instructions executed
PAPI_FP_INS Floating point instructions executed
PAPI_LD_INS Load instructions executed
PAPI_SR_INS Store instructions executed
PAPI_BR_INS Total branch instructions executed
PAPI_VEC_INS Vector/SIMD instructions executed
PAPI_FLOPS Floating Point Instructions executed per second
PAPI_RES_STL Cycles processor is stalled on resource
PAPI_FP_STAL FP units are stalled
PAPI_TOT_CYC Total cycles
PAPI_IPS Instructions executed per second
PAPI_LST_INS Total load/store instructions executed
PAPI_SYC_INS Synchronization instructions executed
PAPI_L1_DCH L1 D Cache Hit
PAPI_L2_DCH L2 D Cache Hit
PAPI_L1_DCA L1 D Cache Access
PAPI_L2_DCA L2 D Cache Access
PAPI_L3_DCA L3 D Cache Access
PAPI_L1_DCR L1 D Cache Read
PAPI_L2_DCR L2 D Cache Read
PAPI_L3_DCR L3 D Cache Read
PAPI_L1_DCW L1 D Cache Write
PAPI_L2_DCW L2 D Cache Write
PAPI_L3_DCW L3 D Cache Write
PAPI_L1_ICH L1 instruction cache hits
PAPI_L2_ICH L2 instruction cache hits
PAPI_L3_ICH L3 instruction cache hits
PAPI_L1_ICA L1 instruction cache accesses
PAPI_L2_ICA L2 instruction cache accesses
PAPI_L3_ICA L3 instruction cache accesses
PAPI_L1_ICR L1 instruction cache reads
PAPI_L2_ICR L2 instruction cache reads
PAPI_L3_ICR L3 instruction cache reads
PAPI_L1_ICW L1 instruction cache writes
PAPI_L2_ICW L2 instruction cache writes
PAPI_L3_ICW L3 instruction cache writes
PAPI_L1_TCH L1 total cache hits
PAPI_L2_TCH L2 total cache hits
PAPI_L3_TCH L3 total cache hits
PAPI_L1_TCA L1 total cache accesses
PAPI_L2_TCA L2 total cache accesses
PAPI_L3_TCA L3 total cache accesses
PAPI_L1_TCR L1 total cache reads
PAPI_L2_TCR L2 total cache reads
PAPI_L3_TCR L3 total cache reads
PAPI_L1_TCW L1 total cache writes
PAPI_L2_TCW L2 total cache writes
PAPI_L3_TCW L3 total cache writes
PAPI_FML_INS FM ins
PAPI_FAD_INS FA ins
PAPI_FDV_INS FD ins
PAPI_FSQ_INS FSq ins
PAPI_FNV_INS Finv ins

For example to measure the floating point operations in routines using PCL,

% ./configure -pcl=/usr/local/packages/pcl-1.2
% setenv PCL_EVENT PCL_FP_INSTR
% mpirun -np 8 application

Table 2.2. Events measured by setting the environment variable PCL_EVENT in TAU

PCL_EVENT EVENT Measured
PCL_L1CACHE_READ L1 (Level one) cache reads
PCL_L1CACHE_WRITE L1 cache writes
PCL_L1CACHE_READWRITE L1 cache reads and writes
PCL_L1CACHE_HIT L1 cache hits
PCL_L1CACHE_MISS L1 cache misses
PCL_L1DCACHE_READ L1 data cache reads
PCL_L1DCACHE_WRITE L1 data cache writes
PCL_L1DCACHE_READWRITE L1 data cache reads and writes
PCL_L1DCACHE_HIT L1 data cache hits
PCL_L1DCACHE_MISS L1 data cache misses
PCL_L1ICACHE_READ L1 instruction cache reads
PCL_L1ICACHE_WRITE L1 instruction cache writes
PCL_L1ICACHE_READWRITE L1 instruction cache reads and writes
PCL_L1ICACHE_HIT L1 instruction cache hits
PCL_L1ICACHE_MISS L1 instruction cache misses
PCL_L2CACHE_READ L2 (Level two) cache reads
PCL_L2CACHE_WRITE L2 cache writes
PCL_L2CACHE_READWRITE L2 cache reads and writes
PCL_L2CACHE_HIT L2 cache hits
PCL_L2CACHE_MISS L2 cache misses
PCL_L2DCACHE_READ L2 data cache reads
PCL_L2DCACHE_WRITE L2 data cache writes
PCL_L2DCACHE_READWRITE L2 data cache reads and writes
PCL_L2DCACHE_HIT L2 data cache hits
PCL_L2DCACHE_MISS L2 data cache misses
PCL_L2ICACHE_READ L2 instruction cache reads
PCL_L2ICACHE_WRITE L2 instruction cache writes
PCL_L2ICACHE_READWRITE L2 instruction cache reads and writes
PCL_L2ICACHE_HIT L2 instruction cache hits
PCL_L2ICACHE_MISS L2 instruction cache misses
PCL_TLB_HIT TLB (Translation Lookaside Buffer) hits
PCL_TLB_MISS TLB misses
PCL_ITLB_HIT Instruction TLB hits
PCL_ITLB_MISS Instruction TLB misses
PCL_DTLB_HIT Data TLB hits
PCL_DTLB_MISS Data TLB misses
PCL_CYCLES Cycles
PCL_ELAPSED_CYCLES Cycles elapsed
PCL_INTEGER_INSTR Integer instructions executed
PCL_FP_INSTR Floating point (FP) instructions executed
PCL_LOAD_INSTR Load instructions executed
PCL_STORE_INSTR Store instructions executed
PCL_LOADSTORE_INSTR Loads and stores executed
PCL_INSTR Instructions executed
PCL_JUMP_SUCCESS Successful jumps executed
PCL_JUMP_UNSUCCESS Unsuccessful jumps executed
PCL_JUMP Jumps executed
PCL_ATOMIC_SUCCESS Successful atomic instructions executed
PCL_ATOMIC_UNSUCCESS Unsuccessful atomic instructions executed
PCL_ATOMIC Atomic instructions executed
PCL_STALL_INTEGER Integer stalls
PCL_STALL_FP Floating point stalls
PCL_STALL_JUMP Jump stalls
PCL_STALL_LOAD Load stalls
PCL_STALL_STORE Store Stalls
PCL_STALL Stalls
PCL_MFLOPS Millions of floating point operations/second
PCL_IPC Instructions executed per cycle
PCL_L1DCACHE_MISSRATE Level 1 data cache miss rate
PCL_L2DCACHE_MISSRATE Level 2 data cache miss rate
PCL_MEM_FP_RATIO Ratio of memory accesses to FP operations