Tracing Hybrid Execution

$next$ $up$ $previous$
Next: Performance Analysis for Up: Performance Instrumentation for Previous: MPI Profiling Interface

Tracing Hybrid Execution

Instrumentation of multi-threaded MPI programs poses some challenges for tracking inter-thread message communication events. MPI is unaware of threads (Java threads or otherwise) and communicates solely on the basis of rank information. Each process that participates in synchronization operations has a rank. However, all threads within the process share the same rank. For a message send operation, we can track the sender's thread by querying the underlying thread system (in this case, through JVMPI) and we can track the receiver's thread likewise.

Unfortunately, there still exists a problem with MPI communication between threads in that the sender doesn't know the receiver's thread id and vice versa. To accurately represent a message on a global timeline, we need to determine the precise node and thread on both sides of the communication, either from information in the trace file or from semantic analysis of the trace file. To avoid additional messages to exchange this information at runtime or to supplement messages with thread ids, we decide to delay matching sends and receives to the post-mortem trace conversion phase. Trace conversion takes place after individual traces from each thread are merged. The merged trace is a time ordered sequence of events (such as sends, receives, routine transitions, etc.). Each event record has a timestamp, location information (node, thread) as well as event specific data (such as message size, and tags). During trace conversion, each record is examined and converted to the target trace format (such as Vampir, ALOG, SDDF or Dump). When a send is encountered, we search for a corresponding receive operation by traversing towards the end of the trace file and matching the receiver's rank, message tag and message length. When a match is found, the receiver's thread id is obtained and a trace record containing the sender and receiver's node, thread ids, message length, and a message tag can be generated. The matching works in a similar fashion when we encounter a receive record, except that we traverse the trace file in the opposite direction, looking for the corresponding send event. This technique is used later on in our example to produce Figure 3.

[bigger image]

Figure 2: TAU's profile browser RACY shows per thread performance data

[bigger image]

Figure 3: Vampir global time display shows activities and inter-thread message communication

[bigger image]

Figure 4: Vampir timeline display can show the depth of routine nesting of the callstack on a particular thread

[bigger image]

Figure 5: Vampir global activity chart illustrates thread grouping

$next$ $up$ $previous$
Next: Performance Analysis for Up: Performance Instrumentation for Previous: MPI Profiling Interface

Sameer Suresh Shende
Mon Mar 12 14:44:42 PST 2001