SGI R10000 provides hardware registers for counters of low-level performance data. It answers questions like :"How many instructions were executed by each templated function? How many Cycles? Floating Point instructions? Loads? Stores? Cache misses?" So profiling can count wallclock microseconds using fast SGI hardware timer or any of 32 other quantities.
Use conf with the PROFILECOUNTERS option. For e.g.,
% conf O2K64 MPI 72BETA PROFILECOUNTERS
|
For SGI, there are 2 modes : conf PROFILE
(default - uses SGI timers with nanosecond granularity)
or
conf PROFILECOUNTERS
which replaces time with hardware counters.
For other platforms : Compile the library with -DTULIP_TIMERS. This uses portable tulip timers to record wallclock time (microsecond granularity). This will be added
to conf when Tulip is integrated with Pooma.
While running the application the user specifies in environment variable
T5_EVENT0 the quantity to be measured :
% setenv T5_EVENT0 21 % mpirun -np 4 app --commlib mpi |
The table of quantity to be measured and its corresponding value is given below.
T5_EVENT0 | Parameter Measured |
---|---|
0 | Cycles |
1 | Issued instructions |
2 | Issued loads |
3 | Issued stores |
4 | Issued store conditionals |
5 | Failed store conditionals |
6 | Decoded branches |
7 | Quadwords written back from scache |
8 | Correctable scache data array ECC errors |
9 | Primary instruction cache misses |
10 | Secondary instruction cache misses |
11 | Instruction misprediction from scache way prediction table |
12 | External interventions |
13 | External invalidations |
14 | Virtual coherency conditions |
15 | Graduated instructions |
16 | Graduated cycles |
17 | Graduated instructions |
18 | Graduated loads |
19 | Graduated stores |
20 | Graduated store conditionals |
21 | Graduated floating point instructions |
22 | Quadwords written back from primary data cache |
23 | TLB misses |
24 | Mispredicted branches |
25 | Primary data cache misses |
26 | Secondary data cache misses |
27 | Data misprediction from scache way prediction table |
28 | External intervention hits in scache |
29 | External invalidation hits in scache |
30 | Store/prefetch exclusive to clean block in scache |
31 | Store/prefetch exclusive to shared block in scache |
pprof
would work with the data generated the same way it works with time.
This works with other options like
conf PROFILECALLS
(where each invocation
of a function is traced).