TAU Monitoring Framework
Introduction
We have constructed a distributed program monitoring framework for parallel
multithreaded applications. The Tuning and Analysis Utilities ( TAU) Portable Profiling Profiling
Package provides performance data for the running application. This
framework provides access to TAU data while a profiled application is
running. Access to the data is provided by a separate server thread running
in the application context. Clients accessing TAU data can display the
callstack of the running application as well as compute performance
parameters to give a user runtime feedback on the behavior and performance
of his application.
TAU Parallel Profiling
TAU maintains a database of function information for every thread. For each
profiled function a record is created. A vector of pointers to the records
is maintained and may be traversed to find the desired record.
The fields of a record are:
Name Name of the function
Type Data type the function is acting on (template instantiation)
GroupName Name of the group as specified by the user
FunctionId ID for the function (index of function in local context)
NumCalls Number of calls to this function
NumSubrs Number of calls from this function
ExclTime Time spent exclusive of functions called by this function
InclTime Time spent including functions called by this function
SumExclSqr Square of exclusive time for statistics
TAU maintains an image of the call stack for running functions. Whenever a
profiled function is called, TAU adds an entry to the call-stack data
structure. The entry includes the function's start time, a pointer to its
database entry, and a pointer to its parent profiler. When a profiled
function exits, the information from the call stack data structure is used
to update its entry in the function information database.
When a TAU-profiled application completes its run, all of the database
entries have been updated and all the information from the database is
accurate for the run. However, in order to get an accurate picture of the
time spent in each function while the application is running, the call stack
data structure must be traversed in order to update timing information for
running functions.
An Example
This example illustrates these issues. Suppose we have the following
application with functions starting at the indicated times:
Function Time
======== ====
main begin 0
. . .
foo begin 3
. . .
foo end 8
. . .
main end 10
Initially, the database entries would look like this:
Name: main
Inclusive time: 0
Exclusive time: 0
Name: main
Inclusive time: 0
Exclusive time: 0
An accurate picture of the system at time = 5 would be:
Name: main
Inclusive time: 3
Exclusive time: 5
Name: main
Inclusive time: 2
Exclusive time: 0
but the database entries would be unchanged from their initial state. The call stack datastructure
would look like this:
Name: foo
Start Time: 3
Parent: main
Name: main
Start Time: 0
Parent: none
At time = 9 the database will have these entries:
Name: main
Inclusive time: 0
Exclusive time: -5
Name: main
Inclusive time: 5
Exclusive time: 5
The reason is that foo has exited, so its information was updated. At the
same time, it updated the exclusive time for its parent routine, main, by
subtracting its inclusive time. The call stack data structure looks like
this:
Name: main
Start Time: 0
Parent: none
In both these cases, the current system time, the call stack data structure,
and the current system time can be used to provide an accurate snapshot of
the running application. The run-time monitor will do this. Since the
monitor will be accessing that the application can update at any time, the
database and the call stack data structure will be locked while the monitor
is accessing the data. This will prevent inaccurate information from being
returned by the monitor.
Monitoring API
Initial Functionality
Initially we shall support facilities to
- Initialize the system
- Attach to the remote context
- Get the function database for a specified thread within a context
- Get the callstack data for a specified thread within a context
- Detach from the running profiled application
Client-Server Workload Considerations
Two basic options for refining the desired data from the full data
available: 1) The client receives all the data and discards everything that
it doesn't need, 2) the server gathers only the necessary data. The
advantage of having the client filter the data is that the server is easier
to implement. The disadvantages are that more data must travel between the
client and the server. Given the low volume of data transferred, we have
decided to do most data filtering tasks on the client.
Client API
The performance monitor client has access to all performance data for all
functions in all threads, contexts, and nodes. A monitoring application can
process the data in order to display what is desired. In order to avoid
race conditions when multiple clients access a singl server, the client
locks the server before it requests data and releases the lock when it has
received the data.
Server API
The server API handles client requests. It gathers the data requested, packs
it into a data structure and send the data structure to the client. While
gathering data, the server must block data updates in order to prevent
collisions with profiling in the running application.
Monitor Example and User API
An initial implementation of the package will both illustrate and take
advantage of the client-server interface. An application providing
heirarchical access to performance data is one possibility. The user will
be presented with an overview of the running application, for example a list
or image of all nodes, contexts, and threads. Selecting a thread, the user
will be presented with a list of functions associated with that thread and
for which there is profiling data. After selecting a specific function the
user will then be able to view any and all data for that function.
As an alternative, a user will be presented with a list of all functions,
select a function, and then access data about that function across threads,
contexts, and nodes.
Implementation
Purpose
The Tau Parallel Profiling Library gathers statistics from code as it runs.
These statistics are maintained as a vector of pointers to function database
data structures and a stack that maintains information about currently
running functions. Tau outputs this data at the end of the run. A variety
of tools are available to analyze this data. With the Tau Monitoring
Framework, we provide access to the Tau profiling data during
runtime and with minimal impact on the running program.
Implementation
We have adopted a client-server strategy to provide user access to the
Tau profiling data. The server runs in the same context as the profiled
application and accesses the Tau data directly. The client, which
provides a user interface to the data, access the data from the server.
We initially used HPC++ as an object-oriented means of providing access
to the runtime data. To further keep with an object oriented paradigm, we
implemented the client and server as objects derived from a common
parent. This allowed us to encapsulate the data derived from the raw Tau
data and hide nearly all of the HPC++ layer of the framework. In addition,
since the client and server had many data members in common (for both data
derived from Tau and data required by HPC++) derivation of the client and
server from a common parent saved many lines of code.
Because of Java's broad support, we changed the implementation to utilize
Java. Using the Java Native Interface (JNI) the server spawns a JVM which
serves as an interface for clients. Using Java RMI, Java based clients
connect to the server, call data request functions, and receive data. When
the server handles an RMI data request function call it in turn calls a
native function using JNI and receives data back from the native code which
it in then returns to the client.
A Note about Scope with HPC++
Encapsulating HPC++ initialization and function registration within the
object presents a minor challenge. The scope of HPC++ access is global. The
first object created, poses no problem, but if subsequent objects try to
repeat the initialization (as happens when multiple client objects run
within one context), then errors occur.
To solve these problems, a global variable is used to track the number of
objects created. With the creation of the first object, the hpc++ startup
housekeeping will be done. On subsequent creations, it will be skipped.
Likewise the shutdown houskeeping will only be done when the last object is
deleted.
Current Status: Implementation
The framework is currently implemented and a monitoring application is
working with a text based interface. A graphical interface that was used
with the HPC++ version is being modified to work with the Java version of
the framework.
To Do
The next step is to design and implement a flexible and powerful monitor
application that will display performance data in a clear and usable manner.
Links