Using Ave/Ariadne to Set Consistent Global Breakpoints: The Integration of Event- and State-Based Debugging

Next: The State-Based Component Up: No Title Previous: Spreadsheet Based Query/Assertion

Using Ave/Ariadne to Set Consistent Global Breakpoints: The Integration of Event- and State-Based Debugging

To effectively use break/examine techniques on massively parallel programs, it must be possible for the user to stop his/her program in meaningful states. Standard mechanisms that set breakpoints locally within a process do not work well in a parallel setting:

How should these breakpoints be specified? Must they be global or are local breakpoints allowed? If local, how and when should the remaining processes be stopped in order to provide a meaningful state?

One possibility is to allow breakpoints only at barrier synchronizations [35] but this is overly constraining in multithreaded language environments. The alternative is to support local breakpoints that are extended dynamically to global breakpoints. This can be done explicitly by notifying the remaining processes - either through communication with a central monitor [36] or over normal communication channels [33][16] - that a breakpoint has been reached or it can be done implicitly by allowing the remaining processes to run until they block waiting for input. Unfortunately, in either case, it is possible for some processes to execute well beyond the established breakpoint, potentially obliterating information relevant to the source of the error. We choose instead to take advantage of the replay mechanism: we use the results of post mortem modeling to insert directives that will, during replay, halt each process at the last point where it could have influenced the process(es) triggering the breakpoint.

The other issue is that of breakpoint specification. Most of the breakpoint debuggers [14][16][32] allows breakpoints to be to specified as conjunctions and disjunctions of local predicates. This approach is, however, not practical. In the context of debugging, errors are often observable only at the output. It is near impossible for a programmer to translate the erroneous output condition to a set of local predicates - they are the very unknowns that the debugging process is supposed to uncover. Direct application of state-based approach is thus not very useful due to a lack of abstraction mechanism that would allow the user to narrow his/her focus to a portion of the program.

Ave/Ariadne provides a framework, where the user narrows the focus of attention using behavioral modeling, and allows the specification of breakpoints in terms of the results of that modeling. The user can set a global breakpoint just before or just after any abstract event by selecting the appropriate nodes of the Ave rendering of the match tree. The debugger annotates the trace just before/after the components of the specified events and the program is re-executed to the latest consistent state preceding the annotations.

The emphasis on consistent states [34] is important. Consider, for example, the behavior depicted in Figures and . Having been informed that the expected precedes relation did not hold between the second and third iterations, the user might insert a breakpoint after the second one, intending to stop the processes at the locations indicated by the rectangles in Figure 14 but this state is meaningless. It could not have happened in any execution of the system because it shows the receipt of a message by process 0 that has not yet been sent by process 8. This state is inconsistent.

Ariadne will only breakpoint programs at consistent states. We define consistency as in [14], letting be the k-th event on process i, which gives rise to the k-th state on process i. Denoting Lamport's happened before relation on events by , we define the analogous partial ordering of process states as follows: iff

Thus state

occurs before a state

, if the terminating event of

happens before or is the same as the starting event of

. Local states that are incomparable under this relation are said to be concurrent.
Definition. A global state for an n process system, denoted by a n-tuple

of local states, is consistent if its elements are pairwise concurrent.
Whenever a user's breakpointing request results in an inconsistent state, our debugger will automatically halt the computation at an earlier, consistent state constructed using a greedy algorithm. In Figure 14, this readjustment results in the inclusion of the earlier local breakpoints denoted by the oval markers. This is the earliest consistent state that preceded the erroneous event.

The user can set multiple breakpoints by selecting multiple abstract events as long as the breakpoints do not conflict. Two breakpoints conflict if they are reached (in the trace) by different processes in different orders. Conflicting breakpoints would deadlock execution replay and so must be avoided. Formally, if

and

are global consistent states, then
Definition.

and

conflict iff there are local states

in A or

in B such that

and

Global states can be specified with respect to an abstract event. The entire process set of the computation can be partitioned into three equivalence classes from the perspective of an abstract event - processes where constituent primitive events of the abstract event occur forming the core set, those which can causally influence the abstract event but are not in the core set forming the influence set, and those which cannot causally influence the abstract event forming the other set. In Figure 15, the abstract event is formed by the events within the thick dashed line. Processes 4,5,6 form the core set; 2,3 and 7 form the influence set, and 0 and 1 form the other set.

At the granularity of the abstract event, the user can ask to stop the computation immediately before the abstract event or immediately after it. In order to stop the global computation at a stage where information regarding the abstract event is still available, processes in the different states are stopped as follows:

the processes in the other set should be stopped at their earliest state
the processes in the influence set should be stopped immediately after the lastest event that could causally influence the abstract event.
all the processes in the core set should either be stopped immediately before the earliest local event that is a part of the abstract event, or immediately after the latest local event that is a part of the abstract event (depending on the user request).

The same principle can be applied if the user enumerates a set of abstract events and use them to set a single global breakpoint.

Example: Binary Image Compression (Continued) In Section we presented a partial debugging session where an expected logical relationship did not hold between two successive iterations of the binary image compression algorithm. Say for example, the user was expecting an abstract event

to precede another abstract event

, and the relationship did not hold. This must be due to an overlap between some parts of

and

. Where should the breakpoint be set? We argue that it is better to set a breakpoint after(X) rather than before(Y). It is because, more events have to occur in order for the computation to reach the state after(X). Since in certain programs, there are possibilities of event occurrences that do not change the state variables of the program (because they are noted by the process and ignored), occurrence of more number of events increases the probability that a state change would occur due to event occurrences. It is this changed state that will point the bug to the programmer.

The user can select the second compr event, and set a breakpoint
set break after(select);
select is a keyword in Ariadne which points to the abstract event selected by the mouse. The program when reexecuted under replay stopped, and the user can use Ariadne's file viewing capability to see the information given in Figure 16, where the source lines at which processes are waiting are highlighted.

(Note that the Binary Image Compression program we ran was SPMD so all processes were running the same code even though they were stopped at different locations.) Additional output at the top of the screen identifies the processes waiting at specific source lines: all odd numbered processes from 1 through 31 waits after sending the first message; the remaining even numbered processes wait at other different places in the source code. The source code window uses the concept of equivalence classes - processes that are waiting at the same statement are grouped together, and the source line is highlighted with a distinct color. In our current example, processes were stopped at four different places in the source code, and the source lines are highlighted with four different colors. Within a given source line (all of which correspond to communication events) processes can be in two positions - before executing the communication library call, or after executing the communication library call: these are highlighted in the upper window.

Since the breakpointing scheme is replay based, we can give additional output identifying the events that were most recently seen or next to be encountered. In this case, that output will appear as:

In the algorithm odd processes perform the exchanges in the first iteration, and do not take part in the communication phase any longer. In order to preserve the information that may be relevant to the error, the odd processes are stopped at the stage where they could still influence compr[2]. In a correct execution, all even processes should be stopped After a communication event - the initiating ones after receiving the replies, and the non-initiating ones after sending those replies. The output however identifies process 0 and 2 to be waiting Before receiving a message, but at different part of the source code. Why are processes 0 and 2 behaving differently than the other even processes? Are they somehow in different iterations? To answer these questions we need to examine the variable corresponding to the iteration count, and need the help of standard state examination technique. Note these questions can never be answered within the framework of standard event-based debugging, unless all the variable values are logged requiring potentially enormous amount of storage space. We can use any vendor-supplied commercially available state-examination debugger for this purpose. In this case, we will continue our example which was written for Intel Paragon, with the vendor-supplied ipd debugger.

The State-Based Component

Next: The State-Based Component Up: No Title Previous: Spreadsheet Based Query/Assertion

Joydip Kundu