next up previous
Next: Gaussian Elimination Up: Permutation Previous: Summary of Debugging

Debugging Session.

We began by defining chain and p-chain patterns to be used in matching.

Chains are described by a slight extension to regular expressions. Our expressions, over primitive event types, use operations of concatenation (denoted by adjacency), alternation ( +), shuffle ( |), and Kleene Star ( *).gif For this example, we modeled the activity of a process with the pattern
W R
which describes a Write followed by a Read on a single process. For future reference, we also named the components of the pattern and the pattern itself. The complete chain definition was thus
? PermutationChain = send: W receive: R
where send and receive name the event matched by W and R respectively. The Ariadne prompt -- ? -- is used in this paper to distinguish between lines of user input and debugger output. p-Chains are described by binding a chain to a set of processes. In this case, PermutationChain was bound by
? Permutation = PermutationChain ONALL PROCS
to the set of all processes (denoted PROCS). The keyword ONALL indicates that a copy of the chain must originate on each member of that set.

pt-Chains are used to describe the temporal composition of p-chains. In this example, we are looking for a single instance of Permutation and thus the specification is trivial. We ask for a match with
? PermuteEvent = match Permutation
If the match had succeeded, PermuteEvent would have been set to the resulting abstract event. In this case, however, it failed; PermuteEvent was set to and the following feedback was given
Match failed: Search failure.
Looking for Permutation.
Found 256 chains on {0..255}; using 256 processes.
This indicates that the search failed while looking for the p-chain Permutation and that during the search, 256 complete chains were recognized, one initiating on each of the processes numbered 0 through 255. The last part of the message gives the number of processes that had primitive events matched by the completed chains.

This information told us that no process above 255 did both a Write and a Read. To investigate further, we asked for additional information about the behavior of a specific process using the matchchain command. Matchchain attempts to recognize a single chain on a single process and provides feedback on the reason for a failure. In this case, it reported
? matchchain PermutationChain ON 256
Match failed: Chain failure.
Expecting an R but encountered the Right Cursor.
The Right Cursor marks the right end of the portion of the execution history graph that we are examining. Thus this response indicated that process 256 wrote a value but did not read one. Since this did not seem to help in locating the error, we tried another tactic.

We broke the pattern into two simple pieces, first matching all of the Writes and then matching all of the Reads. The dialogue went as follows:
? PWrite = W ONALL PROCS
? WriteEvents = match PWrite
Match succeeded.
Found 512 p-chains on {0..511}; using 512 processes.
? PRead = R ONALL PROCS
? ReadEvents = match PRead
Match failed: Search failure.
Looking for PRead.
Found 256 chains on {0..255}; using 256 processes.
This is more helpful. It indicates that all Writes occurred as expected but no process above 255 received a value. Where did the missing values go? We found out by asking
? destinations (WriteEvents)
This is our first example of a function returning the characteristics of a match. Each such function takes an abstract event as its argument and recursively searches the structure of that event. In the case of the destinations function, the structure is searched for WriteEvents and the set of destinations for those events is returned. The feedback was
Values written to 256 processes: {0..255}.
indicating that all of the values written by the 512 processes were directed at the lower 256 processes. Clearly there was some problem in the address calculation code. In fact, the variable holding the destination PID was mistakenly allocated to be an 8-bit quantity; larger values were truncated. Thus, only processes whose identifier was less than 256 could complete correctly.

If the same address truncation had occurred in a program for compression rather than permutation, it would have been harder to detect. A compression is very much like a permutation except that not all of the processes write or read values. During compression, the nonzero elements of an array (stored one value per process) are moved to the beginning of that array. Only processes with nonzero data values Write and only processes at the beginning of the array that are to receive a value Read. Thus, the chain definition used above, W R, would not work because it is a process-centered chain that follows the activity of a single process. Instead, we must use a data-centered pattern that follows the path of a communication. This is described by the pattern
? CompressionCh = send: W @ receive: R
The @ moves the context of the match -- that is it changes the process and the location for matching -- to the receiving process. Thus this pattern matches a Write on the initiating process and then follows that communication edge to continue matching on the process that is its destination.

Not all of the processes execute the CompressionCh pattern (only those initially holding nonzero values), so we can not determine a priori the set of processes for binding. Instead we use
? Compression = CompressionCh ONSOME PROCS
where the keyword ONSOME indicates that a successful match need only occur on a nonempty subset of the given process set. Now, however, when we do the match, it succeeds despite the presence of the truncation error:
? CompressionEvent = match Compression
Match succeeded.
Found 256 p-chains on {1,12,24..33,37,45,47,56,58..65,78,89..112,129..137,
139,141,143..149,156,158,160..189,196,197..234,241,245..258, 267,269,276,
280..298,301,314,324,356,358,367..391,413,415,433..456,470..494 };
using 362 processes.
Note that the process set is not consecutive because chains are only found on those processes that initially have nonzero values. This result looks correct; there is no indication that some values were written but never read. We could detect this sort of error only by asking if anything remained in the trace after the match. (Remember that a match removes the matching events from the execution history graph.) This can be done with
? Left = match REMAINDER
where REMAINDER is a predefined pattern that matches anything. In our example, where the specific trace we were using had 290 processes with nonzero PIDs, the result was
Match succeeded.
Found 34 p-chains on {15..18,79,115..126,190..194,335,495..498,501..507};
using 34 processes.
meaning that 34 ``extra'' events were found that had not been matched by the Compression chains. What were these events? We can find out with
? eventtypes (Left)
34 Writes
0 Multicasts
0 Reads
0 Phase Markers
where eventtypes is again a function that returns a match characteristic. In this case, it counts the number of primitive events of each type. The answer indicated that all of the unmatched events were Writes. We now are in the same position that we reached in the permutation example: we know that only processes below 256 completed and that some of the values that were written were never read. The destinations function would lead us to the error in the same manner as above.



next up previous
Next: Gaussian Elimination Up: Permutation Previous: Summary of Debugging



Joydip Kundu kundu@neweng.enet.dec.com