We specify the instrumentation of OpenMP directives in terms of directive transformations because, first, this allows a description independent of the base programming language, and second, the specification is tied directly to the programming model the application programmer understands. Our transformation rules insert calls to pomp_NAME_TYPE(d) in a manner appropriate for each OpenMP directive, where NAME is replaced by the name of the directive, TYPE is either fork, join, enter, exit, begin, or end, and d is a context descriptor (described in Section 3.5). fork and join mark the location where the execution model switches from sequential to parallel and vice versa, enter and exit flag the entering and exiting of OpenMP constructs and finally, begin and end mark the start and end of structured blocks used as bodies for the OpenMP directives. Table 1 gives an overview about our proposed transformations and performance library routines. To improve readability, optional clauses to the directives, as allowed by the OpenMP standards, are not shown.
In order to be able to measure the synchronization time at the implicit barrier at the end of DO, SECTIONS, WORKSHARE, or SINGLE directives, we use the following method: If, as shown in the table, the original corresponding END directive does not include a NOWAIT clause, NOWAIT is added and the implicit barrier is made explicit. Of course, if there is a NOWAIT clause in the original END directive, then this step is not necessary. To distinguish these barriers from (user-specified) explicit barriers, in this case the pomp_barrier_###() functions are passed the context descriptor of the enclosing construct (instead of the descriptor of the explicit barrier).
Unfortunately, this method cannot be used for measuring the barrier waiting time at the end of parallel directives because they do not have a NOWAIT clause. Therefore, we add an explicit barrier with corresponding performance interface calls here. For source-to-source translation tools implementing the proposed transformations, this means that actually two barriers get called. But the second (implicit) barrier should execute and succeed immediately because the threads of the OpenMP team are already synchronized by the first barrier. Of course, a OpenMP compiler can insert the performance interface calls directly around the implicit barrier, thereby avoiding this overhead.
Transformation rules for the combined parallel work-sharing constructs (PARALLEL DO, PARALLEL SECTIONS, and PARALLEL WORKSHARE) can be defined in the same manner. They are basically the combination of transformations for the corresponding single OpenMP constructs. The only difference is that clauses specified for the combined construct have to be distributed to the single OpenMP constructs in such a way that it complies with the OpenMP standard (e.g., SCHEDULE, ORDERED, and LASTPRIVATE clauses have to be specified with the inner DO directive). Table 2 shows the proposed transformation for the OpenMP combined parallel work-sharing constructs.