1
00:00:00,000 --> 00:00:03,500
Hi, my name is Sameer Shende from the University of

2
00:00:03,500 --> 00:00:07,000
Oregon and I'll describe the TAU Performance System.

3
00:00:07,000 --> 00:00:10,100
The TAU performance system is a profiling and tracking

4
00:00:10,100 --> 00:00:12,800
toolkit that can show you how much time is being spent

5
00:00:12,800 --> 00:00:16,300
in your application code regions such as routines and

6
00:00:16,300 --> 00:00:20,300
outer loops,  statements, or openmp loops. Instead of

7
00:00:20,300 --> 00:00:23,700
time, you can see that the total number of instructions

8
00:00:23,700 --> 00:00:28,000
executed. You could use LIKWID or PAPI to measure

9
00:00:28,000 --> 00:00:30,800
floating-point instructions level 1 and 2 Data

10
00:00:30,800 --> 00:00:35,100
cache misses and other counters. You can see the

11
00:00:35,100 --> 00:00:37,700
time taken in operating system routines for thread

12
00:00:37,700 --> 00:00:41,000
scheduling and measure how much time is really being

13
00:00:41,000 --> 00:00:44,300
wasted. It can show you the memory usage of your code

14
00:00:44,300 --> 00:00:47,900
or the I/O or the contribution of each phase of your

15
00:00:47,900 --> 00:00:50,700
application or even how it scales. 

16
00:00:51,500 --> 00:00:55,800
TAU has three main components instrumentation

17
00:00:55,800 --> 00:00:59,300
or adding hooks to your code, measurement to create

18
00:00:59,300 --> 00:01:04,600
profiles or traces, and Analysis where you can visualize

19
00:01:04,600 --> 00:01:08,400
the data that is generated. It supports different languages

20
00:01:08,400 --> 00:01:11,900
including Fortran, C, C++, Java, Python, UPC. And a number of different

21
00:01:11,900 --> 00:01:16,800
run times. It can interface with a measurement library

22
00:01:16,800 --> 00:01:21,600
such a Score-P to do generate OTF2 traces and cubex profiles. 

23
00:01:21,600 --> 00:01:27,200
It can also connect to a TAUdb database 

24
00:01:27,200 --> 00:01:32,800
and supports the 3D profile browser in ParaProf and

25
00:01:32,800 --> 00:01:37,700
data analysis toolkit, PerfExplorer. 

26
00:01:37,700 --> 00:01:43,900
instrument different runtimes including OpenCL, CUDA, OpenACC

27
00:01:43,900 --> 00:01:49,200
It supports ROCm is supports

28
00:01:49,200 --> 00:01:53,000
the Kokkos profiling API and also Python and you

29
00:01:53,000 --> 00:01:57,200
can mix and match these run times. So you could measure

30
00:01:57,200 --> 00:02:01,400
the time spent in MPI with OpenMP or CUDA or Kokkos

31
00:02:01,400 --> 00:02:06,500
with OpenMP and so on it does this with two different

32
00:02:06,500 --> 00:02:10,700
ways. You can use direct instrumentation via

33
00:02:10,700 --> 00:02:14,100
probes with a start-and-stop call to a timer or

34
00:02:14,100 --> 00:02:17,900
indirectly using sampling. With sampling, you do

35
00:02:17,900 --> 00:02:22,000
not need to modify the code. You can just get a periodic

36
00:02:22,000 --> 00:02:24,300
interrupt whether it is from a hardware performance

37
00:02:24,300 --> 00:02:30,500
counters or the wall clock timer and then it can correlate

38
00:02:30,500 --> 00:02:34,300
the samples to the source locations. And create

39
00:02:34,300 --> 00:02:37,800
different types of profiles besides flat profiles

40
00:02:37,800 --> 00:02:41,900
or callpath profiles that show you the time spent along an edge of a call graph, or a call site profile 

41
00:02:41,900 --> 00:02:45,200
which shows how much time was spent 

42
00:02:45,200 --> 00:02:50,000
at a given location or even a phase profile. Now we

43
00:02:50,000 --> 00:02:54,700
do this using a tool called tau_exec. We can launch the

44
00:02:54,700 --> 00:02:57,700
application. So instead of launching the MPI application

45
00:02:57,700 --> 00:03:07,900
using mpirun a.out, we use mpirun tau_exec a.out and it then can intercept the runtime

46
00:03:08,500 --> 00:03:13,400
system calls for the MPI or openmp or other libraries and

47
00:03:13,400 --> 00:03:16,700
introduce other options such as event-based sampling.

48
00:03:16,700 --> 00:03:21,000
So you do this by launching the code using tau_exec tool

49
00:03:21,000 --> 00:03:25,900
and you can use the tau_exec -ebs flag to launch it

50
00:03:25,900 --> 00:03:30,900
with event based sampling. It works like this if an uninstrumented

51
00:03:30,900 --> 00:03:35,100
application Is launched with mpirun you can just

52
00:03:35,100 --> 00:03:39,100
say mpirun tau_exec and then give some options

53
00:03:39,100 --> 00:03:44,500
such as -ompt to instrument the OpenMP tools interface or -ebs for event-based sampling.

54
00:03:44,500 --> 00:03:47,600
A number of different options exist for gpus.

55
00:03:48,100 --> 00:03:52,800
When you configure TAU with different packages

56
00:03:52,800 --> 00:03:58,000
it creates a shared directory with tags in its name

57
00:03:58,000 --> 00:04:01,800
And these tags are also present in the TAU Makefile

58
00:04:01,800 --> 00:04:04,100
includes the configuration parameters in a stub

59
00:04:04,100 --> 00:04:07,800
Makefile. So if you configure TAU with PDT and PAPI,

60
00:04:07,800 --> 00:04:11,100
it will create this configuration and without

61
00:04:11,100 --> 00:04:15,900
PAPI, it will create this and you can launch tau_exec with

62
00:04:15,900 --> 00:04:21,600
the -T option to specify those tags. You can

63
00:04:21,600 --> 00:04:26,200
also launch tau_exec with the options for run

64
00:04:26,200 --> 00:04:32,800
times such as  -rocm, -cupti, -opencl, -openacc,  -io profiling

65
00:04:32,800 --> 00:04:40,300
and so on. There are other runtime environment variables to generate traces, call path

66
00:04:40,300 --> 00:04:49,100
profiles, memory footprint profiles. You can track

67
00:04:49,100 --> 00:04:55,800
a number of system quantities like this and track

68
00:04:55,800 --> 00:05:00,400
memory and debugging options. You can also instrument

69
00:05:00,400 --> 00:05:04,400
the source code using the program database toolkit (PDT) where you can

70
00:05:04,400 --> 00:05:08,700
parse the source code and then add hooks to this in

71
00:05:08,700 --> 00:05:11,500
a copy of the instrumented source code typically when

72
00:05:11,500 --> 00:05:15,600
you install TAU, you can point to the PDT package and

73
00:05:15,600 --> 00:05:18,600
turn on other options such as turning on openmp tools interface

74
00:05:18,600 --> 00:05:21,800
 or specifying the names of compilers and

75
00:05:21,800 --> 00:05:29,700
using -bfd=download. BFD uses the information

76
00:05:29,700 --> 00:05:33,300
in your program to go to correlate the addresses to the

77
00:05:33,300 --> 00:05:36,800
source locations and you can set the TAU_MAKEFILE 

78
00:05:36,800 --> 00:05:39,400
environment variable and change the compilers

79
00:05:39,400 --> 00:05:43,100
to use the TAU compiler scripts or instead just use

80
00:05:43,100 --> 00:05:46,800
an un-instrumented binary with tau_exec. To install

81
00:05:46,800 --> 00:05:49,300
TAU on your laptop, please follow these directions.

82
00:05:49,300 --> 00:05:54,500
So when you configure TAU you can typically load a TAU module on a system

83
00:05:54,500 --> 00:05:58,100
 and set the TAU Makefile and

84
00:05:58,100 --> 00:06:00,800
then replace the compilers, create the instrumented

85
00:06:00,800 --> 00:06:05,600
executable and launch it and then use the pprof and paraprof tools. Here are examples of the tags that I 

86
00:06:05,600 --> 00:06:10,800
 I mentioned earlier that you can

87
00:06:10,800 --> 00:06:14,700
choose a makefile like this and also set compile

88
00:06:14,700 --> 00:06:17,400
time options using the TAU_OPTIONS environment

89
00:06:17,400 --> 00:06:22,400
variable. Once you generate the performance data,

90
00:06:22,400 --> 00:06:26,800
you can analyze it using paraprof and paraprof can

91
00:06:26,800 --> 00:06:31,000
connect to the TAUdb database. There are other tools such as

92
00:06:31,000 --> 00:06:34,400
perfexplorer that we will also demonstrate.

93
00:06:35,700 --> 00:06:41,100
I would like to mention that this work is being done

94
00:06:41,100 --> 00:06:44,800
at the University of Oregon in Eugene. And I would

95
00:06:44,800 --> 00:06:47,700
like to thank our sponsors from the Department of

96
00:06:47,700 --> 00:06:50,100
Energy, Department of Defense, National Science

97
00:06:50,100 --> 00:06:54,700
Foundation. NASA, CEA France, and our partners at various institutions.

98
00:06:54,700 --> 00:06:58,900
This work was supported by the Exascale Computing

99
00:06:58,900 --> 00:07:02,400
Project. We will now look at the Hands-On exercises.

100
00:07:04,100 --> 00:07:14,400
cd tutorial. make clean; make suite;  cd to the

101
00:07:14,400 --> 00:07:18,500
bin directory? So  I can say export 

102
00:07:18,500 --> 00:07:24,600
OMP_NUM_THREADS=2. And normally I would run

103
00:07:24,600 --> 00:07:36,100
this as mpirun -np 4 ./bt-mz.W.4. You can see that it's taking quite 

104
00:07:36,100 --> 00:07:43,000
a bit of time. Now,  I should check the settings on the

105
00:07:43,000 --> 00:07:54,500
VirtualBox image. I see but it has 2 processors and

106
00:07:54,500 --> 00:08:09,500
6GB RAM. If I want to profile this application. I

107
00:08:09,500 --> 00:08:14,400
would use tau_exec to launch this code.  I

108
00:08:14,400 --> 00:08:23,100
see that it's taking about 81.47 seconds. Now if I run it with tau_exec -ebs 

109
00:08:25,400 --> 00:08:38,600
then it will just intercept the MPI calls using the wrapper interposition

110
00:08:38,600 --> 00:08:53,200
library. Now, at this stage, it should generate all the profile files. Instrumentation

111
00:08:53,200 --> 00:09:01,600
and you can see that it took 83.32 seconds.  

112
00:09:01,600 --> 00:09:15,500
with instrumentation and you can see MPI rank 0 through 3 and 2  

113
00:09:17,700 --> 00:09:23,300
threads of execution. Now you can say paraprof --pack bt_original.ppk. Now I can launch pprof and see the text based output. I can see that poll_no_cancel 

114
00:09:23,300 --> 00:09:27,700
is taking up a lot of time and it is called from MPI_Waitall. 

115
00:09:31,000 --> 00:09:37,600
I can launch paraprof and  I can see all the threads. 

116
00:09:37,600 --> 00:09:42,800
If I click on the right I can say show Thread Statistics 

117
00:09:42,800 --> 00:09:47,500
table. It says 84 seconds were spent in the

118
00:09:47,500 --> 00:09:56,500
application. 84.18 on thread 0 here it shows that 

119
00:09:56,500 --> 00:10:06,800
MPI_Waitall took 54.35 seconds. All these MPI calls add

120
00:10:06,800 --> 00:10:11,200
up to roughly 55 seconds and

121
00:10:16,248 --> 00:10:18,248
the rest of the application takes 28.79seconds.