TAU Monitoring Framework

[TAU

Introduction

We have constructed a distributed program monitoring framework for parallel multithreaded applications. The Tuning and Analysis Utilities ( TAU) Portable Profiling Profiling Package provides performance data for the running application. This framework provides access to TAU data while a profiled application is running. Access to the data is provided by a separate server thread running in the application context. Clients accessing TAU data can display the callstack of the running application as well as compute performance parameters to give a user runtime feedback on the behavior and performance of his application.

TAU Parallel Profiling

TAU maintains a database of function information for every thread. For each profiled function a record is created. A vector of pointers to the records is maintained and may be traversed to find the desired record.

The fields of a record are:

Name              Name of the function
Type              Data type the function is acting on (template instantiation)
GroupName         Name of the group as specified by the user
FunctionId        ID for the function (index of function in local context)
NumCalls          Number of calls to this function 
NumSubrs          Number of calls from this function 
ExclTime          Time spent exclusive of functions called by this function
InclTime          Time spent including functions called by this function
SumExclSqr        Square of exclusive time for statistics
TAU maintains an image of the call stack for running functions. Whenever a profiled function is called, TAU adds an entry to the call-stack data structure. The entry includes the function's start time, a pointer to its database entry, and a pointer to its parent profiler. When a profiled function exits, the information from the call stack data structure is used to update its entry in the function information database.

When a TAU-profiled application completes its run, all of the database entries have been updated and all the information from the database is accurate for the run. However, in order to get an accurate picture of the time spent in each function while the application is running, the call stack data structure must be traversed in order to update timing information for running functions.

An Example

This example illustrates these issues. Suppose we have the following application with functions starting at the indicated times:

Function                 Time
========                 ====
main begin                0
 . . .
  foo begin               3
   . . .
  foo end                 8
 . . .
main end                 10
Initially, the database entries would look like this:

Name:            main
Inclusive time:  0
Exclusive time:  0

Name:            main
Inclusive time:  0
Exclusive time:  0
An accurate picture of the system at time = 5 would be:

Name:            main
Inclusive time:  3
Exclusive time:  5

Name:            main
Inclusive time:  2
Exclusive time:  0
but the database entries would be unchanged from their initial state. The call stack datastructure would look like this:

Name:        foo
Start Time:  3
Parent:      main

Name:        main
Start Time:  0
Parent:      none
At time = 9 the database will have these entries:

Name:            main
Inclusive time:  0
Exclusive time:  -5

Name:            main
Inclusive time:  5
Exclusive time:  5
The reason is that foo has exited, so its information was updated. At the same time, it updated the exclusive time for its parent routine, main, by subtracting its inclusive time. The call stack data structure looks like this:

Name:        main
Start Time:  0
Parent:      none
In both these cases, the current system time, the call stack data structure, and the current system time can be used to provide an accurate snapshot of the running application. The run-time monitor will do this. Since the monitor will be accessing that the application can update at any time, the database and the call stack data structure will be locked while the monitor is accessing the data. This will prevent inaccurate information from being returned by the monitor.

Monitoring API

Initial Functionality

Initially we shall support facilities to

Client-Server Workload Considerations

Two basic options for refining the desired data from the full data available: 1) The client receives all the data and discards everything that it doesn't need, 2) the server gathers only the necessary data. The advantage of having the client filter the data is that the server is easier to implement. The disadvantages are that more data must travel between the client and the server. Given the low volume of data transferred, we have decided to do most data filtering tasks on the client.

Client API

The performance monitor client has access to all performance data for all functions in all threads, contexts, and nodes. A monitoring application can process the data in order to display what is desired. In order to avoid race conditions when multiple clients access a singl server, the client locks the server before it requests data and releases the lock when it has received the data.

Server API

The server API handles client requests. It gathers the data requested, packs it into a data structure and send the data structure to the client. While gathering data, the server must block data updates in order to prevent collisions with profiling in the running application.

Monitor Example and User API

An initial implementation of the package will both illustrate and take advantage of the client-server interface. An application providing heirarchical access to performance data is one possibility. The user will be presented with an overview of the running application, for example a list or image of all nodes, contexts, and threads. Selecting a thread, the user will be presented with a list of functions associated with that thread and for which there is profiling data. After selecting a specific function the user will then be able to view any and all data for that function.

As an alternative, a user will be presented with a list of all functions, select a function, and then access data about that function across threads, contexts, and nodes.

Implementation

Purpose

The Tau Parallel Profiling Library gathers statistics from code as it runs. These statistics are maintained as a vector of pointers to function database data structures and a stack that maintains information about currently running functions. Tau outputs this data at the end of the run. A variety of tools are available to analyze this data. With the Tau Monitoring Framework, we provide access to the Tau profiling data during runtime and with minimal impact on the running program.

Implementation

We have adopted a client-server strategy to provide user access to the Tau profiling data. The server runs in the same context as the profiled application and accesses the Tau data directly. The client, which provides a user interface to the data, access the data from the server.

We initially used HPC++ as an object-oriented means of providing access to the runtime data. To further keep with an object oriented paradigm, we implemented the client and server as objects derived from a common parent. This allowed us to encapsulate the data derived from the raw Tau data and hide nearly all of the HPC++ layer of the framework. In addition, since the client and server had many data members in common (for both data derived from Tau and data required by HPC++) derivation of the client and server from a common parent saved many lines of code.

Because of Java's broad support, we changed the implementation to utilize Java. Using the Java Native Interface (JNI) the server spawns a JVM which serves as an interface for clients. Using Java RMI, Java based clients connect to the server, call data request functions, and receive data. When the server handles an RMI data request function call it in turn calls a native function using JNI and receives data back from the native code which it in then returns to the client.

A Note about Scope with HPC++

Encapsulating HPC++ initialization and function registration within the object presents a minor challenge. The scope of HPC++ access is global. The first object created, poses no problem, but if subsequent objects try to repeat the initialization (as happens when multiple client objects run within one context), then errors occur.

To solve these problems, a global variable is used to track the number of objects created. With the creation of the first object, the hpc++ startup housekeeping will be done. On subsequent creations, it will be skipped. Likewise the shutdown houskeeping will only be done when the last object is deleted.

Current Status: Implementation

The framework is currently implemented and a monitoring application is working with a text based interface. A graphical interface that was used with the HPC++ version is being modified to work with the Java version of the framework.

To Do

The next step is to design and implement a flexible and powerful monitor application that will display performance data in a clear and usable manner.

Links