Tutorial

Using Hardware Counters for Profiling instead of Time

Performance counters exist on modern microprocessors. These count hardware per formance events such as cache misses, floating point operations, etc. while the program executes on the processor. Performance Data Standard and API (PAPI, URL: http://icl.cs.utk.edu/projects/papi/) and Performance Counter Library (PCL, URL:http://www.fz-juelich.de/zam/PCL/) packages provides a uniform interface to access these performance counters. TAU can use either PAPI or PCL to access these hardware performance counters.

To use these, download and install PAPI or PCL. Then, configure TAU using the -pcl=<dir> or -papi=<dir> configuration command-line option to specify the location of PCL or PAPI. Build TAU and applications as you normally would. While running the application, set the environment variable PCL_EVENT or PAPI_EVENT respectively, to specify which hardware performance counter TAU should use while profiling the application. For example to measure the float ing point operations in routines using PCL,

% ./configure -pcl=/usr/local/packages/pcl
% setenv PCL_EVENT PCL_FP_INSTR 
% mpirun -np 8 a.out
To measure the floating point operations in routines using PAPI,
% ./configure -papi=/usr/local/packages/papi
% setenv PAPI_EVENT PAPI_FP_INS
% mpirun -np 8 a.out

TABLE 1. Events measured by setting the environment variable PAPI_EVENT in TAU

PAPI_EVENT Description
PAPI_L1_DCM Level 1 data cache misses
PAPI_L1_ICM Level 1 instruction cache misses
PAPI_L2_DCM Level 2 data cache misses
PAPI_L2_ICM Level 2 instruction cache misses
PAPI_L3_DCM Level 3 data cache misses
PAPI_L3_ICM Level 3 instruction cache misses
PAPI_L1_TCM Level 1 total cache misses
PAPI_L2_TCM Level 2 total cache misses
PAPI_L3_TCM Level 3 total cache misses
PAPI_CA_SNP Snoops
PAPI_CA_SHR Request for access to shared cache line (SMP)
PAPI_CA_CLN Request for access to clean cache line (SMP)
PAPI_CA_INV Cache Line Invalidation (SMP)
PAPI_CA_ITV Cache Line Intervention (SMP)
PAPI_L3_LDM Level 3 load misses
PAPI_L3_STM Level 3 store misses
PAPI_BRU_IDL Cycles branch units are idle
PAPI_FXU_IDL Cycles integer units are idle
PAPI_FPU_IDL Cycles floating point units are idle
PAPI_LSU_IDL Cycles load/store units are idle
PAPI_TLB_DM Data translation lookaside buffer misses
PAPI_TLB_IM Instruction translation lookaside buffer misses
PAPI_TLB_TL Total translation lookaside buffer misses 
PAPI_L1_LDM Level 1 load misses
PAPI_L1_STM Level 1 store misses 
PAPI_L2_LDM Level 2 load misses
PAPI_L2_STM Level 2 store misses
PAPI_BTAC_M BTAC miss
PAPI_PRF_DM Prefetch data instruction caused a miss
PAPI_L3_DCH Level 3 Data Cache Hit
PAPI_TLB_SD Translation lookaside buffer shootdowns (SMP)
PAPI_CSR_FAL Failed store conditional instructions 
PAPI_CSR_SUC Successful store conditional instructions
PAPI_CSR_TOT Total store conditional instructions
PAPI_MEM_SCY Cycles Stalled Waiting for Memory Access 
PAPI_MEM_RCY Cycles Stalled Waiting for Memory Read 
PAPI_MEM_WCY Cycles Stalled Waiting for Memory Write 
PAPI_STL_ICY Cycles with No Instruction Issue 
PAPI_FUL_ICY Cycles with Maximum Instruction Issue
PAPI_STL_CCY Cycles with No Instruction Completion
PAPI_FUL_CCY Cycles with Maximum Instruction Completion
PAPI_HW_INT Hardware interrupts
PAPI_BR_UCN Unconditional branch instructions executed
PAPI_BR_CN Conditional branch instructions executed 
PAPI_BR_TKN Conditional branch instructions taken 
PAPI_BR_NTK Conditional branch instructions not taken 
PAPI_BR_MSP Conditional branch instructions mispredicted 
PAPI_BR_PRC Conditional branch instructions correctly predicted
PAPI_FMA_INS FMA instructions completed
PAPI_TOT_IIS Total instructions issued
PAPI_TOT_INS Total instructions executed
PAPI_INT_INS Integer instructions executed
PAPI_FP_INS Floating point instructions executed
PAPI_LD_INS Load instructions executed
PAPI_SR_INS Store instructions executed
PAPI_BR_INS Total branch instructions executed 
PAPI_VEC_INS Vector/SIMD instructions executed
PAPI_FLOPS Floating Point Instructions executed per second
PAPI_RES_STL Cycles processor is stalled on resource
PAPI_FP_STAL FP units are stalled
PAPI_TOT_CYC Total cycles
PAPI_IPS Instructions executed per second
PAPI_LST_INS Total load/store inst. executed
PAPI_SYC_INS Sync. inst. executed
PAPI_L1_DCH L1 D Cache Hit
PAPI_L2_DCH L2 D Cache Hit
PAPI_L1_DCA L1 D Cache Access
PAPI_L2_DCA L2 D Cache Access
PAPI_L3_DCA L3 D Cache Access
PAPI_L1_DCR L1 D Cache Read
PAPI_L2_DCR L2 D Cache Read
PAPI_L3_DCR L3 D Cache Read
PAPI_L1_DCW L1 D Cache Write
PAPI_L2_DCW L2 D Cache Write
PAPI_L3_DCW L3 D Cache Write
PAPI_L1_ICH L1 instruction cache hits
PAPI_L2_ICH L2 instruction cache hits
PAPI_L3_ICH L3 instruction cache hits
PAPI_L1_ICA L1 instruction cache accesses
PAPI_L2_ICA L2 instruction cache accesses
PAPI_L3_ICA L3 instruction cache accesses
PAPI_L1_ICR L1 instruction cache reads
PAPI_L2_ICR L2 instruction cache reads
PAPI_L3_ICR L3 instruction cache reads
PAPI_L1_ICW L1 instruction cache writes
PAPI_L2_ICW L2 instruction cache writes
PAPI_L3_ICW L3 instruction cache writes
PAPI_L1_TCH L1 total cache hits
PAPI_L2_TCH L2 total cache hits
PAPI_L3_TCH L3 total cache hits
PAPI_L1_TCA L1 total cache accesses
PAPI_L2_TCA L2 total cache accesses
PAPI_L3_TCA L3 total cache accesses
PAPI_L1_TCR L1 total cache reads
PAPI_L2_TCR L2 total cache reads
PAPI_L3_TCR L3 total cache reads
PAPI_L1_TCW L1 total cache writes
PAPI_L2_TCW L2 total cache writes
PAPI_L3_TCW L3 total cache writes
PAPI_FML_INS FM ins
PAPI_FAD_INS FA ins
PAPI_FDV_INS FD ins
PAPI_FSQ_INS FSq ins
PAPI_FNV_INS Finv ins

TABLE 2. Events measured by setting the environment variable PCL_EVENT in TAU

PCL_EVENT

Event Measured

PCL_L1CACHE_READ

L1 (Level one) cache reads

PCL_L1CACHE_WRITE

L1 cache writes

PCL_L1CACHE_READWRITE

L1 cache reads and writes

PCL_L1CACHE_HIT

L1 cache hits

PCL_L1CACHE_MISS

L1 cache misses

PCL_L1DCACHE_READ

L1 data cache reads

PCL_L1DCACHE_WRITE

L1 data cache writes

PCL_L1DCACHE_READWRITE

L1 data cache reads and writes

PCL_L1DCACHE_HIT

L1 data cache hits

PCL_L1DCACHE_MISS

L1 data cache misses

PCL_L1ICACHE_READ

L1 instruction cache reads

PCL_L1ICACHE_WRITE

L1 instruction cache writes

PCL_L1ICACHE_READWRITE

L1 instruction cache reads and writes

PCL_L1ICACHE_HIT

L1 instruction cache hits

PCL_L1ICACHE_MISS

L1 instruction cache misses

PCL_L2CACHE_READ

L2 (Level two) cache reads

PCL_L2CACHE_WRITE

L2 cache writes

PCL_L2CACHE_READWRITE

L2 cache reads and writes

PCL_L2CACHE_HIT

L2 cache hits

PCL_L2CACHE_MISS

L2 cache misses

PCL_L2DCACHE_READ

L2 data cache reads

PCL_L2DCACHE_WRITE

L2 data cache writes

PCL_L2DCACHE_READWRITE

L2 data cache reads and writes

PCL_L2DCACHE_HIT

L2 data cache hits

PCL_L2DCACHE_MISS

L2 data cache misses

PCL_L2ICACHE_READ

L2 instruction cache reads

PCL_L2ICACHE_WRITE

L2 instruction cache writes

PCL_L2ICACHE_READWRITE

L2 instruction cache reads and writes

PCL_L2ICACHE_HIT

L2 instruction cache hits

PCL_L2ICACHE_MISS

L2 instruction cache misses

PCL_TLB_HIT

TLB (Translation Lookaside Buffer) hits

PCL_TLB_MISS

TLB misses

PCL_ITLB_HIT

Instruction TLB hits

PCL_ITLB_MISS

Instruction TLB misses

PCL_DTLB_HIT

Data TLB hits

PCL_DTLB_MISS

Data TLB misses

PCL_CYCLES

Cycles

PCL_ELAPSED_CYCLES

Cycles elapsed

PCL_INTEGER_INSTR

Integer instructions executed

PCL_FP_INSTR

Floating point (FP) instructions executed

PCL_LOAD_INSTR

Load instructions executed

PCL_STORE_INSTR

Store instructions executed

PCL_LOADSTORE_INSTR

Loads and stores executed

PCL_INSTR

Instructions executed

PCL_JUMP_SUCCESS

Successful jumps executed

PCL_JUMP_UNSUCCESS

Unsuccessful jumps executed

PCL_JUMP

Jumps executed

PCL_ATOMIC_SUCCESS

Successful atomic instructions executed

PCL_ATOMIC_UNSUCCESS

Unsuccessful atomic instructions executed

PCL_ATOMIC

Atomic instructions executed

PCL_STALL_INTEGER

Integer stalls

PCL_STALL_FP

Floating point stalls

PCL_STALL_JUMP

Jump stalls

PCL_STALL_LOAD

Load stalls

PCL_STALL_STORE

Store Stalls

PCL_STALL

Stalls

PCL_MFLOPS

Milions of floating point operations/second

PCL_IPC

Instructions executed per cycle