1 00:00:00,000 --> 00:00:03,500 Hi, my name is Sameer Shende from the University of 2 00:00:03,500 --> 00:00:07,000 Oregon and I'll describe the TAU Performance System. 3 00:00:07,000 --> 00:00:10,100 The TAU performance system is a profiling and tracking 4 00:00:10,100 --> 00:00:12,800 toolkit that can show you how much time is being spent 5 00:00:12,800 --> 00:00:16,300 in your application code regions such as routines and 6 00:00:16,300 --> 00:00:20,300 outer loops, statements, or openmp loops. Instead of 7 00:00:20,300 --> 00:00:23,700 time, you can see that the total number of instructions 8 00:00:23,700 --> 00:00:28,000 executed. You could use LIKWID or PAPI to measure 9 00:00:28,000 --> 00:00:30,800 floating-point instructions level 1 and 2 Data 10 00:00:30,800 --> 00:00:35,100 cache misses and other counters. You can see the 11 00:00:35,100 --> 00:00:37,700 time taken in operating system routines for thread 12 00:00:37,700 --> 00:00:41,000 scheduling and measure how much time is really being 13 00:00:41,000 --> 00:00:44,300 wasted. It can show you the memory usage of your code 14 00:00:44,300 --> 00:00:47,900 or the I/O or the contribution of each phase of your 15 00:00:47,900 --> 00:00:50,700 application or even how it scales. 16 00:00:51,500 --> 00:00:55,800 TAU has three main components instrumentation 17 00:00:55,800 --> 00:00:59,300 or adding hooks to your code, measurement to create 18 00:00:59,300 --> 00:01:04,600 profiles or traces, and Analysis where you can visualize 19 00:01:04,600 --> 00:01:08,400 the data that is generated. It supports different languages 20 00:01:08,400 --> 00:01:11,900 including Fortran, C, C++, Java, Python, UPC. And a number of different 21 00:01:11,900 --> 00:01:16,800 run times. It can interface with a measurement library 22 00:01:16,800 --> 00:01:21,600 such a Score-P to do generate OTF2 traces and cubex profiles. 23 00:01:21,600 --> 00:01:27,200 It can also connect to a TAUdb database 24 00:01:27,200 --> 00:01:32,800 and supports the 3D profile browser in ParaProf and 25 00:01:32,800 --> 00:01:37,700 data analysis toolkit, PerfExplorer. 26 00:01:37,700 --> 00:01:43,900 instrument different runtimes including OpenCL, CUDA, OpenACC 27 00:01:43,900 --> 00:01:49,200 It supports ROCm is supports 28 00:01:49,200 --> 00:01:53,000 the Kokkos profiling API and also Python and you 29 00:01:53,000 --> 00:01:57,200 can mix and match these run times. So you could measure 30 00:01:57,200 --> 00:02:01,400 the time spent in MPI with OpenMP or CUDA or Kokkos 31 00:02:01,400 --> 00:02:06,500 with OpenMP and so on it does this with two different 32 00:02:06,500 --> 00:02:10,700 ways. You can use direct instrumentation via 33 00:02:10,700 --> 00:02:14,100 probes with a start-and-stop call to a timer or 34 00:02:14,100 --> 00:02:17,900 indirectly using sampling. With sampling, you do 35 00:02:17,900 --> 00:02:22,000 not need to modify the code. You can just get a periodic 36 00:02:22,000 --> 00:02:24,300 interrupt whether it is from a hardware performance 37 00:02:24,300 --> 00:02:30,500 counters or the wall clock timer and then it can correlate 38 00:02:30,500 --> 00:02:34,300 the samples to the source locations. And create 39 00:02:34,300 --> 00:02:37,800 different types of profiles besides flat profiles 40 00:02:37,800 --> 00:02:41,900 or callpath profiles that show you the time spent along an edge of a call graph, or a call site profile 41 00:02:41,900 --> 00:02:45,200 which shows how much time was spent 42 00:02:45,200 --> 00:02:50,000 at a given location or even a phase profile. Now we 43 00:02:50,000 --> 00:02:54,700 do this using a tool called tau_exec. We can launch the 44 00:02:54,700 --> 00:02:57,700 application. So instead of launching the MPI application 45 00:02:57,700 --> 00:03:07,900 using mpirun a.out, we use mpirun tau_exec a.out and it then can intercept the runtime 46 00:03:08,500 --> 00:03:13,400 system calls for the MPI or openmp or other libraries and 47 00:03:13,400 --> 00:03:16,700 introduce other options such as event-based sampling. 48 00:03:16,700 --> 00:03:21,000 So you do this by launching the code using tau_exec tool 49 00:03:21,000 --> 00:03:25,900 and you can use the tau_exec -ebs flag to launch it 50 00:03:25,900 --> 00:03:30,900 with event based sampling. It works like this if an uninstrumented 51 00:03:30,900 --> 00:03:35,100 application Is launched with mpirun you can just 52 00:03:35,100 --> 00:03:39,100 say mpirun tau_exec and then give some options 53 00:03:39,100 --> 00:03:44,500 such as -ompt to instrument the OpenMP tools interface or -ebs for event-based sampling. 54 00:03:44,500 --> 00:03:47,600 A number of different options exist for gpus. 55 00:03:48,100 --> 00:03:52,800 When you configure TAU with different packages 56 00:03:52,800 --> 00:03:58,000 it creates a shared directory with tags in its name 57 00:03:58,000 --> 00:04:01,800 And these tags are also present in the TAU Makefile 58 00:04:01,800 --> 00:04:04,100 includes the configuration parameters in a stub 59 00:04:04,100 --> 00:04:07,800 Makefile. So if you configure TAU with PDT and PAPI, 60 00:04:07,800 --> 00:04:11,100 it will create this configuration and without 61 00:04:11,100 --> 00:04:15,900 PAPI, it will create this and you can launch tau_exec with 62 00:04:15,900 --> 00:04:21,600 the -T option to specify those tags. You can 63 00:04:21,600 --> 00:04:26,200 also launch tau_exec with the options for run 64 00:04:26,200 --> 00:04:32,800 times such as -rocm, -cupti, -opencl, -openacc, -io profiling 65 00:04:32,800 --> 00:04:40,300 and so on. There are other runtime environment variables to generate traces, call path 66 00:04:40,300 --> 00:04:49,100 profiles, memory footprint profiles. You can track 67 00:04:49,100 --> 00:04:55,800 a number of system quantities like this and track 68 00:04:55,800 --> 00:05:00,400 memory and debugging options. You can also instrument 69 00:05:00,400 --> 00:05:04,400 the source code using the program database toolkit (PDT) where you can 70 00:05:04,400 --> 00:05:08,700 parse the source code and then add hooks to this in 71 00:05:08,700 --> 00:05:11,500 a copy of the instrumented source code typically when 72 00:05:11,500 --> 00:05:15,600 you install TAU, you can point to the PDT package and 73 00:05:15,600 --> 00:05:18,600 turn on other options such as turning on openmp tools interface 74 00:05:18,600 --> 00:05:21,800 or specifying the names of compilers and 75 00:05:21,800 --> 00:05:29,700 using -bfd=download. BFD uses the information 76 00:05:29,700 --> 00:05:33,300 in your program to go to correlate the addresses to the 77 00:05:33,300 --> 00:05:36,800 source locations and you can set the TAU_MAKEFILE 78 00:05:36,800 --> 00:05:39,400 environment variable and change the compilers 79 00:05:39,400 --> 00:05:43,100 to use the TAU compiler scripts or instead just use 80 00:05:43,100 --> 00:05:46,800 an un-instrumented binary with tau_exec. To install 81 00:05:46,800 --> 00:05:49,300 TAU on your laptop, please follow these directions. 82 00:05:49,300 --> 00:05:54,500 So when you configure TAU you can typically load a TAU module on a system 83 00:05:54,500 --> 00:05:58,100 and set the TAU Makefile and 84 00:05:58,100 --> 00:06:00,800 then replace the compilers, create the instrumented 85 00:06:00,800 --> 00:06:05,600 executable and launch it and then use the pprof and paraprof tools. Here are examples of the tags that I 86 00:06:05,600 --> 00:06:10,800 I mentioned earlier that you can 87 00:06:10,800 --> 00:06:14,700 choose a makefile like this and also set compile 88 00:06:14,700 --> 00:06:17,400 time options using the TAU_OPTIONS environment 89 00:06:17,400 --> 00:06:22,400 variable. Once you generate the performance data, 90 00:06:22,400 --> 00:06:26,800 you can analyze it using paraprof and paraprof can 91 00:06:26,800 --> 00:06:31,000 connect to the TAUdb database. There are other tools such as 92 00:06:31,000 --> 00:06:34,400 perfexplorer that we will also demonstrate. 93 00:06:35,700 --> 00:06:41,100 I would like to mention that this work is being done 94 00:06:41,100 --> 00:06:44,800 at the University of Oregon in Eugene. And I would 95 00:06:44,800 --> 00:06:47,700 like to thank our sponsors from the Department of 96 00:06:47,700 --> 00:06:50,100 Energy, Department of Defense, National Science 97 00:06:50,100 --> 00:06:54,700 Foundation. NASA, CEA France, and our partners at various institutions. 98 00:06:54,700 --> 00:06:58,900 This work was supported by the Exascale Computing 99 00:06:58,900 --> 00:07:02,400 Project. We will now look at the Hands-On exercises. 100 00:07:04,100 --> 00:07:14,400 cd tutorial. make clean; make suite; cd to the 101 00:07:14,400 --> 00:07:18,500 bin directory? So I can say export 102 00:07:18,500 --> 00:07:24,600 OMP_NUM_THREADS=2. And normally I would run 103 00:07:24,600 --> 00:07:36,100 this as mpirun -np 4 ./bt-mz.W.4. You can see that it's taking quite 104 00:07:36,100 --> 00:07:43,000 a bit of time. Now, I should check the settings on the 105 00:07:43,000 --> 00:07:54,500 VirtualBox image. I see but it has 2 processors and 106 00:07:54,500 --> 00:08:09,500 6GB RAM. If I want to profile this application. I 107 00:08:09,500 --> 00:08:14,400 would use tau_exec to launch this code. I 108 00:08:14,400 --> 00:08:23,100 see that it's taking about 81.47 seconds. Now if I run it with tau_exec -ebs 109 00:08:25,400 --> 00:08:38,600 then it will just intercept the MPI calls using the wrapper interposition 110 00:08:38,600 --> 00:08:53,200 library. Now, at this stage, it should generate all the profile files. Instrumentation 111 00:08:53,200 --> 00:09:01,600 and you can see that it took 83.32 seconds. 112 00:09:01,600 --> 00:09:15,500 with instrumentation and you can see MPI rank 0 through 3 and 2 113 00:09:17,700 --> 00:09:23,300 threads of execution. Now you can say paraprof --pack bt_original.ppk. Now I can launch pprof and see the text based output. I can see that poll_no_cancel 114 00:09:23,300 --> 00:09:27,700 is taking up a lot of time and it is called from MPI_Waitall. 115 00:09:31,000 --> 00:09:37,600 I can launch paraprof and I can see all the threads. 116 00:09:37,600 --> 00:09:42,800 If I click on the right I can say show Thread Statistics 117 00:09:42,800 --> 00:09:47,500 table. It says 84 seconds were spent in the 118 00:09:47,500 --> 00:09:56,500 application. 84.18 on thread 0 here it shows that 119 00:09:56,500 --> 00:10:06,800 MPI_Waitall took 54.35 seconds. All these MPI calls add 120 00:10:06,800 --> 00:10:11,200 up to roughly 55 seconds and 121 00:10:16,248 --> 00:10:18,248 the rest of the application takes 28.79seconds.