To transmit the local delays encountered in a process (due to program instrumentation) to other processes, we examined several alternatives. The first scheme modifies the source code of the underlying MPI implementation by extending the header sent along with a message in the communication substrate (Photon (24) uses this approach). Unfortunately, it is not portable to all MPI implementations and relies on a specially instrumented communication library. The second scheme sends an additional message containing the delay information for every data message. This scheme only requires changes to the portable MPI wrapper interposition library for the tool. While it is portable to all MPI implementations, it has a performance penalty associated with transmitting an additional message, a penalty not incurred by the first scheme. As a result, the overhead caused by the additional message would require further compensation.
The third scheme copies the contents of the original message and creates a new message with our own header that would include the delay information. This scheme has the portability advantage of the second scheme and avoids the second scheme's transmission of an additional message. However, copying contents of a message could prove to be an expensive operation, especially in the context of large messages that are transmitted in point-to- point communication operations.
We implemented a modification of the third scheme, but instead of
building a new message and copying buffers in and out of messages (at
the sender and the receiver), we create a new datatype. This new
datatype is a structure with two members. The first member is a
pointer to the original message buffer comprised of elements of the
datatype passed to the MPI call. The second member is a double
precision number that contains the local delay value. Once created,
the structure is committed as a new user-defined
datatype and MPI is instructed to send or receive one element of the new
datatype. Internally, MPI may transmit the new message
by composing the message from the two members by using vector
read and write calls instead of its scalar counterparts. This
efficient transmission of the delay value is portable to all MPI
implementations, sends only a single message, and avoids expensive copying
of data buffers to construct and extract messages.