Performance counters exist on modern microprocessors. These count hardware per formance events such as cache misses, floating point operations, etc. while the program executes on the processor. Performance Data Standard and API (PAPI, URL: http://icl.cs.utk.edu/projects/papi/) and Performance Counter Library (PCL, URL:http://www.fz-juelich.de/zam/PCL/) packages provides a uniform interface to access these performance counters. TAU can use either PAPI or PCL to access these hardware performance counters.
To use these, download and install PAPI or PCL. Then, configure TAU using the -pcl=<dir> or -papi=<dir> configuration command-line option to specify the location of PCL or PAPI. Build TAU and applications as you normally would. While running the application, set the environment variable PCL_EVENT or PAPI_EVENT respectively, to specify which hardware performance counter TAU should use while profiling the application. For example to measure the float ing point operations in routines using PCL,
% ./configure -pcl=/usr/local/packages/pcl % setenv PCL_EVENT PCL_FP_INSTR % mpirun -np 8 a.outTo measure the floating point operations in routines using PAPI,
% ./configure -papi=/usr/local/packages/papi % setenv PAPI_EVENT PAPI_FP_INS % mpirun -np 8 a.out
TABLE 1. Events measured by setting the environment variable PAPI_EVENT in TAU |
|
PAPI_EVENT | Description |
PAPI_L1_DCM | Level 1 data cache misses |
PAPI_L1_ICM | Level 1 instruction cache misses |
PAPI_L2_DCM | Level 2 data cache misses |
PAPI_L2_ICM | Level 2 instruction cache misses |
PAPI_L3_DCM | Level 3 data cache misses |
PAPI_L3_ICM | Level 3 instruction cache misses |
PAPI_L1_TCM | Level 1 total cache misses |
PAPI_L2_TCM | Level 2 total cache misses |
PAPI_L3_TCM | Level 3 total cache misses |
PAPI_CA_SNP | Snoops |
PAPI_CA_SHR | Request for access to shared cache line (SMP) |
PAPI_CA_CLN | Request for access to clean cache line (SMP) |
PAPI_CA_INV | Cache Line Invalidation (SMP) |
PAPI_CA_ITV | Cache Line Intervention (SMP) |
PAPI_L3_LDM | Level 3 load misses |
PAPI_L3_STM | Level 3 store misses |
PAPI_BRU_IDL | Cycles branch units are idle |
PAPI_FXU_IDL | Cycles integer units are idle |
PAPI_FPU_IDL | Cycles floating point units are idle |
PAPI_LSU_IDL | Cycles load/store units are idle |
PAPI_TLB_DM | Data translation lookaside buffer misses |
PAPI_TLB_IM | Instruction translation lookaside buffer misses |
PAPI_TLB_TL | Total translation lookaside buffer misses |
PAPI_L1_LDM | Level 1 load misses |
PAPI_L1_STM | Level 1 store misses |
PAPI_L2_LDM | Level 2 load misses |
PAPI_L2_STM | Level 2 store misses |
PAPI_BTAC_M | BTAC miss |
PAPI_PRF_DM | Prefetch data instruction caused a miss |
PAPI_L3_DCH | Level 3 Data Cache Hit |
PAPI_TLB_SD | Translation lookaside buffer shootdowns (SMP) |
PAPI_CSR_FAL | Failed store conditional instructions |
PAPI_CSR_SUC | Successful store conditional instructions |
PAPI_CSR_TOT | Total store conditional instructions |
PAPI_MEM_SCY | Cycles Stalled Waiting for Memory Access |
PAPI_MEM_RCY | Cycles Stalled Waiting for Memory Read |
PAPI_MEM_WCY | Cycles Stalled Waiting for Memory Write |
PAPI_STL_ICY | Cycles with No Instruction Issue |
PAPI_FUL_ICY | Cycles with Maximum Instruction Issue |
PAPI_STL_CCY | Cycles with No Instruction Completion |
PAPI_FUL_CCY | Cycles with Maximum Instruction Completion |
PAPI_HW_INT | Hardware interrupts |
PAPI_BR_UCN | Unconditional branch instructions executed |
PAPI_BR_CN | Conditional branch instructions executed |
PAPI_BR_TKN | Conditional branch instructions taken |
PAPI_BR_NTK | Conditional branch instructions not taken |
PAPI_BR_MSP | Conditional branch instructions mispredicted |
PAPI_BR_PRC | Conditional branch instructions correctly predicted |
PAPI_FMA_INS | FMA instructions completed |
PAPI_TOT_IIS | Total instructions issued |
PAPI_TOT_INS | Total instructions executed |
PAPI_INT_INS | Integer instructions executed |
PAPI_FP_INS | Floating point instructions executed |
PAPI_LD_INS | Load instructions executed |
PAPI_SR_INS | Store instructions executed |
PAPI_BR_INS | Total branch instructions executed |
PAPI_VEC_INS | Vector/SIMD instructions executed |
PAPI_FLOPS | Floating Point Instructions executed per second |
PAPI_RES_STL | Cycles processor is stalled on resource |
PAPI_FP_STAL | FP units are stalled |
PAPI_TOT_CYC | Total cycles |
PAPI_IPS | Instructions executed per second |
PAPI_LST_INS | Total load/store inst. executed |
PAPI_SYC_INS | Sync. inst. executed |
PAPI_L1_DCH | L1 D Cache Hit |
PAPI_L2_DCH | L2 D Cache Hit |
PAPI_L1_DCA | L1 D Cache Access |
PAPI_L2_DCA | L2 D Cache Access |
PAPI_L3_DCA | L3 D Cache Access |
PAPI_L1_DCR | L1 D Cache Read |
PAPI_L2_DCR | L2 D Cache Read |
PAPI_L3_DCR | L3 D Cache Read |
PAPI_L1_DCW | L1 D Cache Write |
PAPI_L2_DCW | L2 D Cache Write |
PAPI_L3_DCW | L3 D Cache Write |
PAPI_L1_ICH | L1 instruction cache hits |
PAPI_L2_ICH | L2 instruction cache hits |
PAPI_L3_ICH | L3 instruction cache hits |
PAPI_L1_ICA | L1 instruction cache accesses |
PAPI_L2_ICA | L2 instruction cache accesses |
PAPI_L3_ICA | L3 instruction cache accesses |
PAPI_L1_ICR | L1 instruction cache reads |
PAPI_L2_ICR | L2 instruction cache reads |
PAPI_L3_ICR | L3 instruction cache reads |
PAPI_L1_ICW | L1 instruction cache writes |
PAPI_L2_ICW | L2 instruction cache writes |
PAPI_L3_ICW | L3 instruction cache writes |
PAPI_L1_TCH | L1 total cache hits |
PAPI_L2_TCH | L2 total cache hits |
PAPI_L3_TCH | L3 total cache hits |
PAPI_L1_TCA | L1 total cache accesses |
PAPI_L2_TCA | L2 total cache accesses |
PAPI_L3_TCA | L3 total cache accesses |
PAPI_L1_TCR | L1 total cache reads |
PAPI_L2_TCR | L2 total cache reads |
PAPI_L3_TCR | L3 total cache reads |
PAPI_L1_TCW | L1 total cache writes |
PAPI_L2_TCW | L2 total cache writes |
PAPI_L3_TCW | L3 total cache writes |
PAPI_FML_INS | FM ins |
PAPI_FAD_INS | FA ins |
PAPI_FDV_INS | FD ins |
PAPI_FSQ_INS | FSq ins |
PAPI_FNV_INS | Finv ins |
TABLE 2. Events measured by setting the environment variable PCL_EVENT in TAU |
|
PCL_EVENT |
Event Measured |
PCL_L1CACHE_READ |
L1 (Level one) cache reads |
PCL_L1CACHE_WRITE |
L1 cache writes |
PCL_L1CACHE_READWRITE |
L1 cache reads and writes |
PCL_L1CACHE_HIT |
L1 cache hits |
PCL_L1CACHE_MISS |
L1 cache misses |
PCL_L1DCACHE_READ |
L1 data cache reads |
PCL_L1DCACHE_WRITE |
L1 data cache writes |
PCL_L1DCACHE_READWRITE |
L1 data cache reads and writes |
PCL_L1DCACHE_HIT |
L1 data cache hits |
PCL_L1DCACHE_MISS |
L1 data cache misses |
PCL_L1ICACHE_READ |
L1 instruction cache reads |
PCL_L1ICACHE_WRITE |
L1 instruction cache writes |
PCL_L1ICACHE_READWRITE |
L1 instruction cache reads and writes |
PCL_L1ICACHE_HIT |
L1 instruction cache hits |
PCL_L1ICACHE_MISS |
L1 instruction cache misses |
PCL_L2CACHE_READ |
L2 (Level two) cache reads |
PCL_L2CACHE_WRITE |
L2 cache writes |
PCL_L2CACHE_READWRITE |
L2 cache reads and writes |
PCL_L2CACHE_HIT |
L2 cache hits |
PCL_L2CACHE_MISS |
L2 cache misses |
PCL_L2DCACHE_READ |
L2 data cache reads |
PCL_L2DCACHE_WRITE |
L2 data cache writes |
PCL_L2DCACHE_READWRITE |
L2 data cache reads and writes |
PCL_L2DCACHE_HIT |
L2 data cache hits |
PCL_L2DCACHE_MISS |
L2 data cache misses |
PCL_L2ICACHE_READ |
L2 instruction cache reads |
PCL_L2ICACHE_WRITE |
L2 instruction cache writes |
PCL_L2ICACHE_READWRITE |
L2 instruction cache reads and writes |
PCL_L2ICACHE_HIT |
L2 instruction cache hits |
PCL_L2ICACHE_MISS |
L2 instruction cache misses |
PCL_TLB_HIT |
TLB (Translation Lookaside Buffer) hits |
PCL_TLB_MISS |
TLB misses |
PCL_ITLB_HIT |
Instruction TLB hits |
PCL_ITLB_MISS |
Instruction TLB misses |
PCL_DTLB_HIT |
Data TLB hits |
PCL_DTLB_MISS |
Data TLB misses |
PCL_CYCLES |
Cycles |
PCL_ELAPSED_CYCLES |
Cycles elapsed |
PCL_INTEGER_INSTR |
Integer instructions executed |
PCL_FP_INSTR |
Floating point (FP) instructions executed |
PCL_LOAD_INSTR |
Load instructions executed |
PCL_STORE_INSTR |
Store instructions executed |
PCL_LOADSTORE_INSTR |
Loads and stores executed |
PCL_INSTR |
Instructions executed |
PCL_JUMP_SUCCESS |
Successful jumps executed |
PCL_JUMP_UNSUCCESS |
Unsuccessful jumps executed |
PCL_JUMP |
Jumps executed |
PCL_ATOMIC_SUCCESS |
Successful atomic instructions executed |
PCL_ATOMIC_UNSUCCESS |
Unsuccessful atomic instructions executed |
PCL_ATOMIC |
Atomic instructions executed |
PCL_STALL_INTEGER |
Integer stalls |
PCL_STALL_FP |
Floating point stalls |
PCL_STALL_JUMP |
Jump stalls |
PCL_STALL_LOAD |
Load stalls |
PCL_STALL_STORE |
Store Stalls |
PCL_STALL |
Stalls |
PCL_MFLOPS |
Milions of floating point operations/second |
PCL_IPC |
Instructions executed per cycle |