Christopher W. Harrop, Steven T. Hackstadt, Janice E. Cuny, Allen D. Malony, and Laura S. Magde, Supporting Runtime Tool Interaction for Parallel Simulations, Proceedings of Supercomputing '98 (SC98), Orlando, FL, November 7-13, 1998. |
Christopher W. Harrop
harrop@cs.uoregon.edu
http://www.cs.uoregon.edu/~harrop
Steven T. Hackstadt
hacks@cs.uoregon.edu
http://www.cs.uoregon.edu/~hacks
Janice E. Cuny
cuny@cs.uoregon.edu
http://www.cs.uoregon.edu/people/faculty/cuny.html
Allen D. Malony
malony@cs.uoregon.edu
http://www.cs.uoregon.edu/people/faculty/malony.html
Laura S. Magde
laura@cs.uoregon.edu
It is common for simulations to run for many hours, or even days, before producing an output that is analyzed to modify parameters for the next run. For scientists limited to such post-mortem analyses, the modeling process can be very time consuming. Runtime interaction can alleviate this somewhat by allowing a scientist to analyze intermediate results to determine whether a computation should be aborted or allowed to continue. In addition, they can dynamically adjust computational parameters to facilitate the exploration of a simulation's parameter space or to improve convergence times. Reducing the time it takes to observe the effects of a parameter change makes it easier for scientists to find trends or patterns in their model's behavior.
Runtime interaction, however, is difficult to deliver. First, because of their computational requirements, many simulations are run on parallel or distributed heterogeneous platforms, making it necessary to synchronize requests and access distributed data structures, activities unfamiliar to many scientists. Second, because many simulations are legacy codes, it is not possible for the runtime interaction system to make assumptions about internal program structures. Third, scientists need a complete understanding of their codes and the tools they use to process their data and thus they insist that their programs remain familiar and undergo little modification.
We have attempted to address these problems in our work on domain-specific environments for the geological sciences. That work first resulted in an environment, called TIERRA (Tomographic Imaging Environment for Ridge Research and Analysis) [3] that provided online visualization and program control capabilities for a seismic tomography application. This new work extends and abstracts that earlier system, providing a framework that can be customized for different applications. We demonstrate that framework with an initial implementation for the seismic application, called TierraLab.
Our framework's design is based on the philosophy that runtime interaction capabilities should be delivered in a way that does not interfere with the scientists' current modi operandi. This means that we hide details of parallel and distributed computation, respect and use legacy codes, and have the scientist/programmer actively involved in any necessary annotation or instrumentation of his/her code. As important, we integrate runtime interaction functionality with the scientists' normal data analysis environment so they do not have to re-implement their analysis tools or learn new programming languages, libraries, or paradigms.
TierraLab is an object-oriented runtime interaction system that extends MATLAB [12], an analysis engine familiar to scientists, with commands that provide distributed data access and execution control. In the next section, we discuss the design of our framework and its implementation in TierraLab. In Section 3, we demonstrate its utility using a seismic tomography application. In Section 4, we discuss related work, and in the final section, we present our conclusions and future work.
tierra.mex
. The interaction engine uses the DAQV [7] client
library to send interaction commands to a tomography application that has been instrumented with
DAQV routines from the DAQV application library. Later in the paper, we refer to the 10 labeled
steps depicted in the figure.Each of the TierraLab components is designed as a class hierarchy where an abstract base class defines the component's interfaces and functionality. An individual component is created by implementing a subclass of the appropriate base class. TierraLab's object oriented architecture provides a plug-and-play framework for delivering runtime interaction capabilities and online data analysis. Different implementations of the components can be combined to provide customized runtime interaction functionality to satisfy different types of users and situations. For example, the analysis engine component could be targeted for IDL [17] or Mathematica [18] instead of MATLAB. Similarly, different types of user interface components could be developed to meet the requirements of different users.
Figure 1. System Architecture
Since we used MATLAB for our
initial analysis engine, we chose to design an interface subclass that replicates its
command-line user interface. For example, the following excerpt from a TierraLab session
shows commands for connecting to a running application, in this case a seismic tomography
application named hpt_77
. The ">>" is the command-line prompt.
----------------------------------------------------
This application uses MATLAB as an analysis engine
----------------------------------------------------
< M A T L A B (R) >
(c) Copyright 1984-96 The MathWorks, Inc.
All Rights Reserved
5.1.0.421
May 25 1997
>> % Connect to the tomography code
>> urlFile = '/research/power/harrop/tierra/hpt_77/src/daqv_master_url';
>> fid = fopen(urlFile,'r');
>> url=fscanf(fid,'%s');
>> fclose(fid);
>> [id err]=tierra('attach',url);
>>
We are considering designing a subclass that implements a
graphical user interface, but this is not a high priority because our users have not
expressed much interest in it and are, so far, satisfied with the command-line interface.
An abstract analysis engine base class provides all TierraLab analysis engines with the functionality to evaluate commands and to interact with the user interface and interaction engine components. The actual implementation of this functionality is dependent on which off-the-shelf package it wraps. TierraLab's current analysis engine wraps MATLAB because it is heavily used by our community of scientists. The use of MATLAB has several advantages: it is familiar to the scientists; it is extensible, allowing scientists to create their own data analysis tools in its matrix-based language or through interfaces to C and Fortran; it is interactive, so scientists can build new analysis tools on-the-fly while interacting with a simulation; and MATLAB's language features, when extended with interaction commands, allow scientists to write interaction scripts. The primary disadvantage of MATLAB is performance; as expected, its interpretive environment is not as efficient as compiled code. However, this problem is minimized by the availability of C and Fortran interfaces which can accommodate computationally intensive tools and reserve use of the interpreter for less performance-critical analyses. Another problem with MATLAB is that it is a sequential data analysis engine, meaning it can process only one command at a time. This problem could be alleviated by incorporating MATLAB extensions like those in MultiMATLAB [14].
When TierraLab is launched, the analysis engine starts a MATLAB process using
MATLAB's Engine interface [13] which consists of a two-way UNIX pipe
and a set of high-level routines for communicating across it. Thus the MATLAB
process does not need to execute on the same machine as the TierraLab process, although
performance may be affected. The MATLAB analysis engine interfaces with TierraLab's interaction
engine through a MATLAB MEX routine called tierra
. That routine uses the Nexus
multithreaded runtime system to send
interaction requests to the interaction engine. Nexus is initialized for use by the MEX
routine immediately after the MATLAB engine starts up, and before the user is allowed to
enter commands. A persistent block of memory in the MEX routine is used to preserve the
Nexus environment between invocations of the tierra
command. When the analysis
engine receives a signal from the user interface indicating that a command is ready for
evaluation (Figure 1, step 2), the command is retrieved and passed to MATLAB for
evaluation via the engEvalString
engine interface routine. MATLAB commands
are executed by the MATLAB engine and immediately return. However, the
tierra
interaction command results in additional steps.
When a user issues an interaction request, MATLAB invokes the tierra
MEX routine
(step 3), which checks the command for errors and then
sends the command and its arguments, via a Nexus Remote Service Request (RSR), to the
interaction engine (step 4). When the interaction command is complete, the results of the
command are received by the MEX routine via another Nexus RSR (step 7). At this point the
results are copied into the memory locations of the appropriate return arguments of the
tierra
command (step 8). When the command is finished, the output, if any,
is copied into a buffer in the command object, and the user interface is signaled
(step 9). The analysis engine then waits for its next command.
The purpose of the interaction engine is to enhance the scientists' working environment with command extensions for interacting with their computations. Using DAQV allows us to hide data and code distribution details from the scientists, but the application must first be appropriately annotated with instrumentation. The annotations register variables and specify locations for read/write access to them. Once the annotation is complete, the scientists can invoke interaction with the application using familiar MATLAB syntax, including MATLAB scripts and programs. The results of interaction commands are stored in MATLAB's workspace, making extracted data readily available as input to other analyses. Conversely, the output of MATLAB operations can be used to modify the simulation's data.
Given our plug-and-play model, it is reasonable to consider alternative implementations of the interaction engine. Most obvious would be the use of debugger stubs and a software layer that connects MATLAB to a debugger (e.g., dbx). This approach would support robust program control and completely arbitrary data access. However, it has some serious disadvantages. First, it would shift the responsibility of multiple process coordination and distributed data handling onto our users. Our users know (and care) little about the parallelization of their applications and would be at a loss to deal with data distribution issues. DAQV hides these low level details, allowing tools to interact with the application and data structures at a semantic level known to users. Second, within a given source code file, debugger instrumentation is pervasive; instrumentation is selective only if the programmer is willing to subdivide (further) an application into multiple source code files. While it must currently be inserted manually, DAQV instrumentation is completely selective and can be applied only where needed regardless of source code structure. Third, debugger instrumentation has a detrimental impact on application performance, and in order to turn off the instrumentation, the application must be recompiled. Under DAQV, instrumentation can be enabled and disabled at runtime. When instrumentation is disabled, DAQV instrumentation remains in the program, but only incurs overhead comparable to that of a procedure call (about 3.5 microseconds), making the overall application impact negligible and allowing DAQV instrumentation to become part of the permanent application source. (A full accounting of DAQV performance can be found in [8].) Finally, having completely arbitrary execution control and data access may seem desirable, but explicit consideration of where and what data is to be accessed ensures that the data is accessed at points where it is both scientifically and semantically meaningful.
The enhanced DAQV model and command set were designed in conjunction with TierraLab's interaction engine, so their user commands largely coincide. TierraLab provides the following:
As a DAQV client, the interaction engine communicates with
DAQV-annotated applications through a separate process called the
DAQV master. The master orchestrates the synchronization of
application processes and the collection of distributed data. Although
users can issue interaction requests
at any time, interaction with an application can only take place at
certain times, as governed by the placement of DAQV instrumentation. For
example, when a user issues a Probe request, the TierraLab
interaction engine calls the DAQV client library to send the request
to the appropriate application's DAQV master process. The DAQV master
puts the request into a queue, but does not process the request until
the application reaches a DAQV_PROBE
statement listing
the requested variable as one of its arguments. Some interaction
commands, such as GetStatus, only require communication with the
DAQV master process and therefore return immediately.
In TierraLab, the DAQV interaction engine remains idle until it
receives an interaction request from the tierra
MEX
routine via a Nexus RSR (Figure 1, step 4). The request specifies
which command to perform and provides any arguments that are required
for processing it. After examining the request message, the DAQV
engine calls the appropriate method for handling the request. There is
exactly one method, inherited from the interaction engine abstract
base class, that implements each interaction command. The DAQV
interaction engine implements these methods by making the appropriate
calls to the DAQV client library (step 5). Once the requested
interaction has taken place (step 6), the results are sent back to the
tierra
MEX file via another Nexus RSR (step 7). The
interaction engine remains idle again until the next interaction
request is received.
The geoscientists have developed a large collection of MATLAB programs for post-mortem visualization and analysis of the data produced by their tomography code. They wanted to execute some of these programs online. For example, one of their programs produces a visualization of the geometry of the seismic experiment which could be used early in a run to spot fatal errors before wasting hours of computation.
Figure 2. Script
The first time our runtime interaction system was used, we provided
the scientist with some initial DAQV instrumentation, and then
supervised her as she took the necessary steps to produce a visualization of
the experimental geometry described above. She added
DAQV instrumentation to her tomography code, added tierra
commands to
an existing MATLAB program, and then used our system to
run the script and display the figure online. Although this first
trial run encountered a few small problems, we found that
the scientist was able to learn and use our system in a remarkably
short period of time. Although she had no prior experience with, or knowledge
of, TierraLab the entire process, starting from the instrumentation and instruction
we provided, took only 3 hours. Most of this time was spent in the
instrumentation phase. The scientist had little difficulty determining which
instrumentation calls to insert. However, determining where to put the
instrumentation, learning the arguments of the DAQV registration call, and
compiling the code to use DAQV were more problematic. These difficulties
rapidly decreased as she became more experienced.
The process of adding
interaction commands to her MATLAB program to read its input data
from the executing program took only about 15 minutes. After her MATLAB program was augmented
with interaction commands (excerpts shown in Figure 2), it was
executed, and produced the plot in Figure 3. Although our system
continues to evolve, our preliminary feedback is very encouraging.
Figure 3. Experimental Geometry of the Seismic Experiment
Since it is impossible for a single data analysis package to serve all the scientists' needs, we plan to expand our framework by supporting additional analysis engines such as Mathematica and IDL. Users will then be able to interact and share data among all these tools and their applications simultaneously. In principle, there is no limit to how many applications and tools can be coupled together using our framework. We also plan to build an instrumentation assistant and to improve the implementation of our interaction engine by making use of improved communications technologies, such as the Tulip runtime system [2], that require less copying of data. Additional information on our work can be found on our web site: <http://www.csi.uoregon.edu/>.
We would like to thank Chad Busche of the Department of Computer and Information Science, University of Oregon, for his contributions in providing invaluable DAQV software support. We would also like to thank Robert Dunn of the Department of Geological Sciences, University of Oregon, for contributing the seismic tomography source code.