Tracking Elusive States with Event-based Breakpointing

Our integrated environment provides a unified framework in which errors can be detected at a high level of abstraction, and tracked down to the offending source construct. This event-oriented integration of two levels of debugging is in fact more versatile. It can set global breakpoints that are impossible to specify using standard conditional state expressions.

The parallel algorithm we would consider, is for handling queries using inheritance in a large knowledge-base. Inheritance is the process of inferring properties about objects from group membership, and is complicated by exceptions that should be handled as special cases. Objects in the knowledge base are connected through ``is-a'' relationship. A query comes regarding an object first arrives at the originating node. If the value is available locally, it is returned; otherwise the knowledge base is searched.

``Activity spreads through the network using plus messages meaning ``Do you have any values?'', and minus messages meaning ``Any values that you have are overridden''. Both types of messages contain the number of the originating node; plus messages contain the requested attribute name as well. Answers are returned to the originating node using found messages meaning ``I have a legal value'', and ignore messages meaning ``Ignore any values that I have given you''.''

In order to detect the termination of a search a mass is attached to each message. When the originator node receives messages whose combined mass is 1, it knows that the query has terminated. This requires nodes that cannot answer the query, to send dummy messages tagged as mass with appropriate mass to the originator. Multiple queries can be active in the knowledge base at the same time, and we would like to model them separately.

The originating node starts off the query by sending plus messages to all its parents. If it receives a found message, it queues it. If an ignore message is received, the corresponding found message is dequeued, and a mass message is simply ignored after its value has been noted. The originating node sends messages to multiple nodes and finally gathers all the responses. The behavior from the perspective of the originating node can be modeled as

ch query = (W_plus <: @ R_plus spread :>)+ gather+;
ch gather = R_found or R_ignore or R_mass;
spread models the activation in other processes which receives a query, and processes it. A process upon receiving a message plus or minus, first sends a reply to the parent. The reply can be tagged as one of found, ignore or mass. It then propagates the query to its own parents (if any). The spreading activation has an irregular tree structure, which can be modeled as

ch spread = (reply (W <: @ R reply)* :>)*;
ch reply = W_found or W_ignore or W_mass;
query is thus a depth-first traversal of a tree.

The whole behavior can be modeled as follows:

procset = {0..7};
ch gather = R_found or R_ignore or R_mass;
ch reply = W_found or W_ignore or W_mass;
ch spread = (reply (W <: @ R reply)* :>)*;
ch query = (W_plus <: @ R_plus spread :>)+ gather+;
pch Query = query* onsome procset;
match infer = < Query* >;
The match succeeded, and there was no obvious problem as can be seen by the absence of red color in Figure 19.

In our test case, there were two originating nodes for queries, each of them sending three queries. The feedback in the summarizing box (2,2) points out that two independent queries started at two distinct nodes (the first 2), and they have different behaviors (the second 2) which is most likely. Although the specification is coarse, the tree structure specified is irregular and data-dependent. The fact that the behavior matched showed there was no obvious error in how queries are handled, but our test case returned null answers to each of our query. This may be due to the fact that the correct answer was never sent to the originating node, or due to the fact that node that sends the correct answer initially cancels its answer later by sending a ignore message. The first hypothesis is already proven shaky, since the model we specified forces each process to send a reply to the originating node. The specification, however, is quite coarse and we need to verify certain assertions about the behavior that are necessary for correctness. The assertions are

Number of found messages received by the originating node is one more than the number of ignore messages received. This would prove, to some extent, that atleast the originating node should be left with one found message, if it processes each message correctly.
A node that sends mass messages sends no other type of messages to the originator for a given query.
ignore messages are sent only after a node has received minus messages.

The first of the assertion is the easiest to verify. We just compute an attribute called diff for each abstract event called query that is the difference between the count of the R_found and R_ignore messages. Note only the originator receives messages tagged those names.
with Query foreach query compute diff = count(query wrt R_found) - count(query wrt R_ignore);
with Query show diff(query) wrt query;
This produced a Scatter Plot (with uniform color) that shows each query has an attribute diff with the value 1 .

The second assertion can be checked by enumerating the processes that send mass messages, and those that send found and ignore messages. We can then check whether the two sets of processes are disjoint or not. This can be done in four steps
with infer foreach query compute set_mass = pid(R_mass);
with infer foreach query compute set_other = pid(R_found, R_ignore)
with infer foreach query compute flag = disjoint(set_mass, set_other);
with infer show flag(query) wrt query;
which also produced an output, where the value of flag was 1 in both cases, pointing out that the sets are indeed disjoint, and hence the processes that sent mass messages did not send any other type of messages to the originator.

The third assertion, although seemingly more complex, can be handled with ease due to spreadsheet like relative addressing scheme of Ave. An attribute left_tag is first computed which should be 1 if the left sibling of an R_ignore event is tagged as minus, and then displayed

with Query foreach R_ignore compute left_tag
= eq(tag(left(R_ignore)), minus);
with Query show left_tag(R_ignore) wrt query;
Once again, the attribute took only one value . Ave proved invaluable in rejecting the possibility of certain kinds of errors that may arise due to faulty behavior on the part of the nodes queried by the originator.

Since the communication pattern did not show any obvious errors, we hypothesized two additional modes of failure. Either the originator may be handling the responses incorrectly, or it is not getting the right response from the queried nodes. If the second hypothesis is true, we need to check each response sent back by the queried nodes, which requires stopping the execution of the system every time a queried node sends a response. This can become quite tedious. The most helpful course of action would be to stop the program execution at a state where the originator has received all the responses, but is yet to process any of them, while the queried nodes have sent all their responses to the originator, but are yet to process the next query.

The program is event-driven: each node is in a infinite event-handling loop, and responds to different messages it receives. It is impossible to identify our target state in such a system using standard breakpointing scheme that uses conditional state expressions. There is no explicit state information available. Programmers, almost never, write programs thinking about providing states that would help debugging. Even if there was a counter in the loop that can provide such an information it will not be useful in this case. The number of messages that a query generates depends on the query, and the contents of the knowledge base. Unless the programmer has detailed knowledge about the actual content of the knowledge base, and its physical distribution, it is impossible to ascertain the number of messages processed by each node, and hence impossible to set a breakpoint based on conditions on state (in this case the counter value).

But the event-based model can allow us to set such a breakpoint quite easily. All we need to do is to set the breakpoint before the gather pattern, and we will stop at our desired state during reexecution.

The program was reexecuted under replay. The Ave Source Code Window can be invoked to show the state where the computation has stopped.

We can see the content of the message queue of processor with the help of ipd, to see the different messages it has received.

ipd however does not allow us to see the content of the messages. One can trace the contents of the messages in our instrumentation scheme, but that would require re-instrumentation and recompilation. So we selectively executed process 0, by setting a local breakpoint after each message receive.

We examined each message as it was received. The first found message was examined with the command

None of the found messages has an answer in them. So the problem was not with queue management, but with the processes sending the answer. A look at the code section handling the query processing revealed that the programmer forgot to copy the result of the query to the send buffer.

Note that, if the user had wanted to set a different breakpoint where each process is stopped before sending a found message, and see the content of the send buffer, he/she can do it by setting a breakpoint

break after(with select foreach W_found);
This works, since each queried node sends atmost one found message to the originator, and once a found message has been sent by a node, no node in its causality envelope (Mattern defines it using some other term) can send another found message. This example illustrates the versatility of our approach. Hypotheses can be formed about the cause of the error, and assertions can be verified to strengthen or rejecting individual hypothesis. Global breakpoints that are difficult or even impossible to set by using conditional state expression, can be set easily by looking at the history of the program as depicted by the event-based model.

Next: Conclusion Up: No Title Previous: The State-Based Component

Joydip Kundu