Using Hardware Counters for Profiling instead of Time

SGI R10000 provides hardware registers for counters of low-level performance data. It answers questions like :"How many instructions were executed by each templated function? How many Cycles? Floating Point instructions? Loads? Stores? Cache misses?" So profiling can count wallclock microseconds using fast SGI hardware timer or any of 32 other quantities.

Compilation

Use conf with the PROFILECOUNTERS option. For e.g.,

% conf O2K64 MPI 72BETA PROFILECOUNTERS

For SGI, there are 2 modes : conf PROFILE (default - uses SGI timers with nanosecond granularity) or
conf PROFILECOUNTERS which replaces time with hardware counters.
For other platforms : Compile the library with -DTULIP_TIMERS. This uses portable tulip timers to record wallclock time (microsecond granularity). This will be added to conf when Tulip is integrated with Pooma.

Running the Profiled Application

While running the application the user specifies in environment variable T5_EVENT0 the quantity to be measured :
% setenv T5_EVENT0 21 % mpirun -np 4 app --commlib mpi

The table of quantity to be measured and its corresponding value is given below.

T5_EVENT0	Parameter Measured
0	Cycles
1	Issued instructions
2	Issued loads
3	Issued stores
4	Issued store conditionals
5	Failed store conditionals
6	Decoded branches
7	Quadwords written back from scache
8	Correctable scache data array ECC errors
9	Primary instruction cache misses
10	Secondary instruction cache misses
11	Instruction misprediction from scache way prediction table
12	External interventions
13	External invalidations
14	Virtual coherency conditions
15	Graduated instructions
16	Graduated cycles
17	Graduated instructions
18	Graduated loads
19	Graduated stores
20	Graduated store conditionals
21	Graduated floating point instructions
22	Quadwords written back from primary data cache
23	TLB misses
24	Mispredicted branches
25	Primary data cache misses
26	Secondary data cache misses
27	Data misprediction from scache way prediction table
28	External intervention hits in scache
29	External invalidation hits in scache
30	Store/prefetch exclusive to clean block in scache
31	Store/prefetch exclusive to shared block in scache

pprof would work with the data generated the same way it works with time. This works with other options like
conf PROFILECALLS (where each invocation of a function is traced).

NOTE: It is expensive to use these counters. Each function entry/exit costs as much as a few hundred microseconds as a kernel syscall is executed. Compare this with the lightweight SGI timer that takes around 0.7 microseconds (or less).