[Pooma Logo]


Using Hardware Counters for Profiling instead of Time

SGI R10000 provides hardware registers for counters of low-level performance data. It answers questions like :"How many instructions were executed by each templated function? How many Cycles? Floating Point instructions? Loads? Stores? Cache misses?" So profiling can count wallclock microseconds using fast SGI hardware timer or any of 32 other quantities.

Compilation

Use conf with the PROFILECOUNTERS option. For e.g.,

% conf O2K64 MPI 72BETA PROFILECOUNTERS

For SGI, there are 2 modes : conf PROFILE (default - uses SGI timers with nanosecond granularity) or
conf PROFILECOUNTERS which replaces time with hardware counters.
For other platforms : Compile the library with -DTULIP_TIMERS. This uses portable tulip timers to record wallclock time (microsecond granularity). This will be added to conf when Tulip is integrated with Pooma.

Running the Profiled Application

While running the application the user specifies in environment variable T5_EVENT0 the quantity to be measured :
% setenv T5_EVENT0 21
% mpirun -np 4 app --commlib mpi

The table of quantity to be measured and its corresponding value is given below.

T5_EVENT0 Parameter Measured
0 Cycles
1 Issued instructions
2 Issued loads
3 Issued stores
4 Issued store conditionals
5 Failed store conditionals
6 Decoded branches
7 Quadwords written back from scache
8 Correctable scache data array ECC errors
9 Primary instruction cache misses
10 Secondary instruction cache misses
11 Instruction misprediction from scache way prediction table
12 External interventions
13 External invalidations
14 Virtual coherency conditions
15 Graduated instructions
16 Graduated cycles
17 Graduated instructions
18 Graduated loads
19 Graduated stores
20 Graduated store conditionals
21 Graduated floating point instructions
22 Quadwords written back from primary data cache
23 TLB misses
24 Mispredicted branches
25 Primary data cache misses
26 Secondary data cache misses
27 Data misprediction from scache way prediction table
28 External intervention hits in scache
29 External invalidation hits in scache
30 Store/prefetch exclusive to clean block in scache
31 Store/prefetch exclusive to shared block in scache

pprof would work with the data generated the same way it works with time. This works with other options like
conf PROFILECALLS (where each invocation of a function is traced).


NOTE: It is expensive to use these counters. Each function entry/exit costs as much as a few hundred microseconds as a kernel syscall is executed. Compare this with the lightweight SGI timer that takes around 0.7 microseconds (or less).


[PREV] [Back to tutorial] [NEXT]