We have also demonstrated TAU's use for mixed-mode parallelism with multi-threaded Java programs using the mpiJava  package. While mpiJava relies on the existence of native MPI libraries, its API is implemented as a Java wrapper package that uses C bindings for MPI routines. The integrated instrumentation for this scenario is portrayed in Figure 4.2. However, instrumentation of multi-threaded MPI programs poses some challenges for tracking inter-thread message communication events, especially in the case where threads are managed by a virtual machine. MPI is unaware of threads (Java threads or otherwise) and communicates solely on the basis of rank information. Each process (i.e., context) that participates in synchronization operations has a rank. However, all threads within the process share the same rank. For a message send operation, we can track the sender's thread by querying the underlying thread system and we can track the receiver's thread likewise. For the JVM, this requires TAU to call into JVMPI across the Java Native Interface (JNI) boundary.
Unfortunately, there still exists a problem with MPI communication between threads in that the sender doesn't know the receiver's thread id and vice versa. To accurately represent a message on a global timeline, we need to determine the precise node and thread on both sides of the communication, either from information in the trace file or from semantic analysis of the trace file. To avoid additional messages to exchange this information at runtime or to supplement messages with thread ids, matching sends and receives is best reserved to the post-mortem trace conversion phase. Trace conversion takes place after individual traces from each thread are merged. The merged trace is a time ordered sequence of events (such as sends, receives, routine transitions, etc.). Each event record has a timestamp, location information (node, thread) as well as event specific data (such as message size, and tags). When a send is encountered, we search for a corresponding receive operation by traversing towards the end of the trace file and matching the receiver's rank, message tag and message length. When a match is found, the receiver's thread id is obtained and a trace record containing the sender and receiver's node, thread ids, message length, and a message tag can be generated. The matching works in a similar fashion when we encounter a receive record, except that we traverse the trace file in the opposite direction, looking for the corresponding send event.
In Figure 4.6 we see a performance trace of a mixed-mode Java/mpiJava application simulating the game of Life. A total of twenty-eight threads are executing across four nodes. The integrated events are seen as before, as well as the grouping of events. Our thread message matching algorithm was applied to correctly visualize the message pairing.
Figure: Mixed-mode Java / MPI execution profile of a game of life application