When performance measurements are made of program operation actual
execution behavior can be perturbed. In general, the degree of perturbation
depends on the intrusiveness and frequency of the instrument ation. If the
perturbation effects of the instrumentation cannot be quantified by a perturbation
model (and subsequently removed during perturbation analysis), detailed
performance measurements could be inaccurate. Developing models of time
and event perturbations that can recover actual execution performance from
perturbed performance measurements is the topic of this paper. Time-based
models can accurately capture execution time perturbations for sequential
computations and concurrent computations with simple fork-join behavior.
However, the performance of parallel computations generally depends on the
relative ordering of dependent events and the assignment of computational
resources. Event-based models must be used to quantify instrumentation
perturbation in parallel performance measurements. The measurement and
subsequent analysis of synchronization operations (e.g., barrier, semaphore,
and advance/await synchronization) can produce accurate approximations to
actual performance behavior. Unfortunately, event-based models are limited in
their ability to fully capture perturbation effects in nondeterministic executions.
The relative simplicity and design of the Fortran 77 language allowed for reasonable interoperability with C and C++. Fortran 90, on the other hand, introduces several new and complex features to the language that severely degrade the ability of a mixed Fortran and C++ development environment. Major new items added to Fortran are user-defined types, pointers, and several new array features. Each of these items introduce difficulties because the Fortran 90 procedure calling convention was not designed with interoperability as an important design goal. For example, Fortran 90 arrays are passed by array descriptor, which is not specified by the language and therefore depends on a particular compiler implementation. This paper describes a set of software tools that parses Fortran 90 source code and produces mediating interface functions which allow access to Fortran 90 libraries from C++.
The use of a cluster for distributed performance analysis of parallel trace
data is discussed. We propose an analysis architecture that uses multiple
cluster nodes as a server to execute analysis operations in parallel and
communicate to remote clients where performance visualization and user
interactions occur. The client-server system developed, VNG, is highly
configurable and is shown to perform well for traces of large size, when
compared to leading trace visualization systems.
The effect of the operating system on application performance is an increasingly important consideration in high performance computing. OS kernel measurement is key to understanding the performance influences and the interrelationship of system and user-level performance factors. The KTAU (Kernel TAU) methodology and Linux-based framework provides parallel kernel performance measurement from both a kernel-wide and process-centric perspective. The first characterizes overall aggregate kernel performance for the entire system. The second characterizes kernel performance when it runs in the context of a particular process. KTAU extends the TAU performance system with kernel-level monitoring, while leveraging TAUs measurement and analysis capabilities. We explain the rational and motivations behind our approach, describe the KTAU design and implementation, and show working examples on multiple platforms demonstrating the versatility of KTAU in integrated system / application monitoring.oped. Minimally, such an approach will require OS kernel performance monitoring.
The Common Component Architecture (CCA) is a
component-based methodology for developing scientific simu-
lation codes. This architecture consists of a framework which
enables components, (embodiments of numerical algorithms
and physical models) to work together. Components publish
their interfaces and use interfaces published by others. Com-
ponents publishing the same interface and with the same func-
tionality (but perhaps implemented via a different algorithm
or data structure) may be transparently substituted for each
other in a code or a component assembly. Components are
compiled into shared libraries and are loaded in, instantiated
and composed into a useful code at runtime. Details regarding
CCA can be found in [1], [2]. An analysis of the process of
decomposing a legacy simulation code and re-synthesizing it
as components can be found in [3], [4]. Actual scientific results
obtained from this toolkit can be found in [5], [6].
In this paper, we discuss (TAU, Tuning and Analysis Utilities), a
first prototype for an integrated and portable program analysis
environment for pC++, a parallel object-oriented language system. TAU
is integrated with the pC++ system in that it relies heavily on
compiler and transformation tools (specifically, the Sage++ toolkit)
for its implementation. This paper describes the design and
functionality of TAU and shows its application in practice.
The realization of parallel language systems that offer high-level
programming paradigms to reduce the complexity of application
development, scalable runtime mechanisms to support variable size
problem sets, and portable compiler platforms to provide access to
multiple parallel architectures, places additional demands on the
tools for program development and analysis. The need for integration
of these tools into a comprehensive programming environment is even
more pronounced and will require more sophisticated use of the
language system technology (i.e., compiler and runtime
system). Furthermore, the environment requirements of high-level
support for the programmer, large-scale applications, and portable
access to diverse machines also apply to the program analysis tools.
The TAU performance system is an integrated performance instrumentation, measurement, and analysis toolkit offering support for profiling and tracing modes of measurement. This paper introduces memory introspection capabilities of TAU featured on the Cray XT3 Catamount compute node kernel. TAU supports examining the memory headroom, or the amount of heap memory available, at routine entry, and correlates it to the programs callstack as an atomic event.
The ability of performance technology to keep pace with the growing complexity of parallel and distributed systems will depend on robust performance frameworks that can at once provide system-specific performance capabilities and support high-level performance problem solving. The TAU system is offered as an example framework that meets these requirements. With a flexible, modular instrumentation and measurement system, and an open performance data and analysis environment, TAU can target a range of complex performance scenarios. Examples are given showing the diversity of TAU application.
A common complaint when dealing with the performance of computationally intensive
scientific applications on parallel computers is that programs exist to predict the
performance of radar systems, missiles and artillery shells, drugs, etc., but no one knows
how to predict the performance of these applications on a parallel computer. Actually, that
is not quite true. A more accurate statement is that no one knows how to predict the
performance of these applications on a parallel computer in a reasonable amount of time.
PENVELOPE is an attempt to remedy this situation. It is an extension to Amdahls Law/
Gustafsons work on scaled speedup that takes into account the cost of interprocessor
communication and operating system overhead, yet is simple enough that it was
implemented as an Excel spreadsheet.
Performance profiling of MPI programs generates overhead during
execution that introduces error in profile measurements. It is possible to track and
remove overhead online, but it is necessary to communicate execution delay be-
tween processes to correctly adjust their interdependent timing. We demonstrate
the first implementation of a onlne measurement overhead compensation system
for profiling MPI programs. This is implemented in the TAU performance sys-
tems. It requires novel techniques for delay communication in the use of MPI.
The ability to reduce measurement error is demonstrated for problematic test
cases and real applications.
A scalable approach to performance analysis of MPI applications is
presented that includes automated source code instrumentation, low overhead
generation of profile and trace data, and database management of performance
data. In addition, tools are described that analyze large-scale parallel profile and
trace data. Analysis of trace data is done using an automated pattern-matching ap-
proach. Examples of using the tools on large-scale MPI applications are
presented.
This article discusses approaches to implementing object-independent
event trace monitoring and analysis systems. The term
object-independent means that the system can be used for the analysis
of arbitrary (non-sequential) computer systems, operating systems,
programming languages and applications. Three main topics are
addressed: object-independent monitoring, standardization of event
trace formats and access interfaces and the application-independent
but problem-oriented implementation of analysis and visualization
tools. Based on these approaches, the distributed hardware monitor
system ZM4 and the SIMPLE event trace analysis environment were
implemented, and have been used in many 'real-world' applications
throughout the last three years. An overview of the projects in which
the ZM4/SIMPLE tools were used is given in the last section.
Programming non-sequential computer systems is hard! Many tools and
environments have been designed and implemented to ease the use and
programming of such systems. The majority of the analysis tools is
event-based and uses event traces for representing the dynamic
behavior of the system under investigation, the object system. Most
tools can only be used for one special object system, or a specific
class of systems such as distributed shared memory machines. This
limitation is not obvious because all tools provide the same basic
functionality.
In this paper, we discuss TAU (Tuning and Analysis Utilities), the
first prototype of an integrated and portable program analysis
environment for pC++, a parallel object-oriented language system. TAU
is unique in that it was developed specifically for pC++ and relies
heavily on pC++'s compiler and transformation tools (specifically, the
Sage++ toolkit) for its implementation. This tight integration allows
TAU to achieve a combination of portability, functionality, and
usability not commonly found in high-level language environments. The
paper describes the design and functionality of TAU, using a new tool
for breakpoint-based program analysis as an example of TAU's
capabilities
We report on our experiences in building a computational environment for tomographic image analysis for marine seismologists studying the structure and evolution of mid-ocean ridge volcanism. The computational environment is determined by an evolving set of requirements for this problem domain and includes needs for high-performance parallel computing, large data analysis, model visualization, and computation interaction and control. Although these needs are not unique in scientific computing, the integration of techniques for seismic tomography with tools for parallel computing and data analysis into a computational environment was (and continues to be) an interesting, important learning experience for researchers in both disciplines. For the geologists, the use of the environment led to fundamental geologic discoveries on the East Pacific Rise, the improvement of parallel ray tracing algorithms, and a better regard for the use of computational steering in aiding model convergence. The computer scientists received valuable feedback on the use of programming, analysis, and visualization tools in the environment. In particular, the tools for parallel program data query (DAQV) and visualization programming (Viz) were demonstrated to be highly adaptable to the problem domain. We discuss the requirements and the components of the environment in detail. Both accomplishments and limitations of our work are presented.
This paper presents the design, implementation, and application of ParaProf, a
portable, extensible, and scalable tool for parallel performance profile analysis.
ParaProf attempts to offer ``best of breed'' capabilities to performance analysts --
those inherited from a rich history of single processor profilers and those being
pioneered in parallel tools research. We present ParaProf as a parallel profile
analysis framework that can be retargeted and extended as required.
ParaProf's design and operation is discussed, and its novel support for large-
scale parallel analysis demonstrated with a 512-processor application profile
generated using the TAU performance system.
Measurement-based profiling introduces intrusion in program execution. Intrusion effects
can be mitigated by compensating for measurement overhead. Techniques for compensation
analysis in performance profiling are presented and their implementation in the TAU
performance system described. Experimental results on the NAS parallel benchmarks
demonstrate that overhead compensation can be effective in improving the accuracy of
performance profiling.
Performance profiling generates measurement overhead during parallel
program execution. Measurement overhead, in turn, introduces
intrusion in a program's runtime performance behavior. Intrusion can
be mitigated by controlling instrumentation degree, allowing a
tradeoff of accuracy for detail. Alternatively, the accuracy in
profile results can be improved by reducing the intrusion error due to
measurement overhead. Models for compensation of measurement overhead
in parallel performance profiling are described. An approach based on
rational reconstruction is used to understand properties of
compensation solutions for different parallel scenarios. From this
analysis, a general algorithm for on-the-fly overhead assessment and
compensation is derived.
Online application performance monitoring allows tracking
performance characteristics during execution as opposed to doing so
post-mortem. This opens up several possibilities otherwise unavailable
such as real-time visualization and application performance steering that
can be useful in the context of long-running applications. As HPC sys-
tems grow in size and complexity, the key challenge is to keep the online
performance monitor scalable and low overhead while still providing a
useful performance reporting capability. Two fundamental components
that constitute such a performance monitor are the measurement and
transport systems. We adapt and combine two existing, mature systems
- TAU and Supermon - to address this problem. TAU performs the mea-
surement while Supermon is used to collect the distributed measurement
state. Our experiments show that this novel approach leads to very low-
overhead application monitoring as well as other benefits unavailable
from using a transport such as NFS.
Performance analysis tools are only as useful as the data they collect. Not just accuracy of performance data, but accessibility, is necessary for performance analysis tools to be used to their full effect. The diversity of performance analysis and tuning problems calls for more flexible means of storing and representing performance data. The development and maintenance cycles of high performance programs, in particular, stand to benefit from exploration of and expansion of the means used to record and describe program execution behavior. We describe a means of representing program performance data via a time or event delineated series of performance profiles, or profile snapshots, implemented in the TAU performance analysis system. This includes an explanation of the profile snapshot format and means of snapshot analysis.
With support for C/C++, Fortran, MPI, OpenMP, and performance tools, the Eclipse integrated development environment (IDE) is a serious contender as a programming environment for parallel applications. There is interest in adding capabilities in Eclipse for conducting workflows where an application is executed under different scenarios and its outputs are processed. For instance, parametric studies are a requirement in many benchmarking and performance tuning efforts, yet there was no experiment management support available for the Eclipse IDE. In this paper, we describe an extension of the Parallel Tools Platform (PTP) plugin for the Eclipse IDE. The extension provides a graphical user interface for selecting experiment parameters, launches build and run jobs, manages the performance data, and launches an analysis application to process the data. We describe our implementation, and discuss three experiment examples which demonstrate the experiment management support.
This paper describes the design and implementation of the Distributed Array Query and Visualization (DAQV) system for High Performance Fortran, a project sponsored by the Parallel Tools Consortium. DAQV's implementation leverages the HPF language, compiler, and runtime system to address the general problem of providing high-level access to distributed data structures. DAQV supports a framework in which visualization and analysis clients connect to a distributed array server (i.e., the HPF application with DAQV control) for program-level access to array values. Implementing key components of DAQV in HPF itself has led to a robust and portable solution in which clients do not need to know how the data is distributed.
To aid in building high-performance computational environments,
INTERLACE offers a framework for linking reusable computational
engines in a heterogeneous distributed system. The INTERLACE model
provides clients with access to computational servers which interface
with "wrapped" computational engines. The wrappers implement
mechanisms to translate client requests to engine actions and to move
data across the server interface. These mechanisms are programmable,
allowing engines of different type to be integrated. The framework
takes advantage of the HPC++ runtime system to access servers through
distributed object operations. The INTERLACE framework has been
demonstrated by building a distributed computational environment with
MatLab engines.
The influences of OS and system-specific effects on applica- tion performance are increasingly important in high performance com- puting. In this regard, OS kernel measurement is necessary to under- stand the interrelationship of system and application behavior. This can
be viewed from two perspectives: kernel-wide and process-centric. An
integrated methodology and framework to observe both views in HPC
systems using OS kernel measurement has remained elusive. We demon- strate a new tool called KTAU (Kernel TAU) that aims to provide paral- lel kernel performance measurement from both perspectives. KTAU ex- tends the TAU performance system with kernel-level monitoring, while
leveraging TAUs measurement and analysis capabilities. As part of the
ZeptoOS scalable operating systems pro ject, we report early experiences
using KTAU in ZeptoOS on the IBM BG/L system.
Parallel performance tuning naturally involves a diagnosis
process to locate and explain sources of program inefficiency. Proposed
is an approach that exploits parallel computation patterns (models) for
diagnosis discovery. Knowledge of performance problems and inference
rules for hypothesis search are engineered from model semantics and
analysis expertise. In this manner, the performance diagnosis process
can be automated as well as adapted for parallel model variations. We
demonstrate the implementation of model-based performance diagnosis
on the classic Master-Worker pattern. Our results suggest that pattern-
based performance knowledge can provide effective guidance for locating
and explaining performance bugs at a high level of program abstraction.
To enable a scalable parallel application to view its global performance state, we designed and
developed TAUg, a portable runtime framework layered on the TAU parallel performance
system. TAUg leverages the MPI library to communicate between application processes, creating
an abstraction of a global performance space from which profile views can be retrieved. We
describe the TAUg design and implementation and show its use on two test benchmarks up to
512 processors. Overhead evaluation for the use of TAUg is included in our analysis. Future
directions for improvement are discussed.
In this article we propose a ``standard'' performance tool interface for
OpenMP, similar in spirit to the MPI profiling interface in its intent to
define a clear and portable API that makes OpenMP execution events visible
to performance libraries. When used together with the MPI profiling
interface, it also allows tools to be built for hybrid applications that
mix shared and distributed memory programming. We describe an
instrumentation approach based on OpenMP directive rewriting that generates
calls to the interface and passes context information (e.g., source code
locations) in a portable and efficient way. Our proposed OpenMP performance
API further allows user functions and arbitrary code regions to be marked
and performance measurement to be controlled using new proposed OpenMP
directives. The directive transformations we define are implemented in a
source-to-source translation tool called OPARI.
We have used it to integrate the TAU performance analysis
framework and the automatic event trace analyzer EXPERT with the proposed OpenMP performance interface.
Together, these tools show that a portable and robust solution to
performance analysis of OpenMP and hybrid applications is possible.
Regular is an often used term to suggest simple and unifrom structure of a parallel
processor's organization or a parllel algorithm's operation. However, a strict definitiion is
long overdue. In this paper, we define regularity for processor array structures in two
dimensions and enumerate the eleven distinct regular topologies. Space and time emulation
schemes among the regular processor arrays are constructured to compare their geometric
and performance characteristics. The hexagonal array is shown to have the most efficient
emulation capabilities.
The lack of tools to observe the operation and performance
of message-based parallel architectures limits the
user's ability to e ectively optimize application and system
performance. Performance data collection, analysis,
and visualization tools are needed to manage the complexity
and quantity of performance data. Furthermore, these
tools must be integrated with the machine hardware, the
system software, and the applications support software if
they are to nd pervasive use in program development and
experimentation.
In this paper, we describe an integrated performance
environment being developed for the Intel iPSC/2 hypercube.
The data collection components of the environment
include software event tracing at the operating system
and program levels plus a hardware-based performance
monitoring system used to unobtrusively capture software
events. A visualization system, based on the X window
system, permits the performance analyst to browse and
explore interesting data components by dynamically interconnecting
new performance displays and data analysis
tools.
There are two main conclusions from this work. First, interaction
support should be integrated with a language system facilitating an
implementation of a model that is consistent with the language
design. This aids application developers or the tool builders that
require this interaction. Second, as the implementation of Breezy
shows, the development of interaction support can leverage off the
language itself as well as its compiler and runtime systems.
This paper presents a general architecture for runtime interaction
with a data-parallel program. We have applied this architecture in the
development of the Breezy tool for the pC++ language. Breezy grants
application programs convenient and efficient access to higher-level
external services (e.g., databases, visualization systems, and
distributed resources) and allows external access to the application's
state (e.g., for program state display or computational
steering). Although such support can be developed on an ad-hoc basis
for each application, a general approach to the problem of parallel
program interaction is preferred. A general approach makes tools more
portable and retargetable to different language systems.
Tracing parallel programs to observe their performance introduces intrusion as the result of
trace measurement overhead. If post-mortem trace analysis does not compensate for the
overhead, the intrusion will lead to errors in the performance results. We show that
measurement overhead can be accounted for during trace analysis and intrusion modeled and
removed. Algorithms developed in our earlier work are reimplemented in a more robust and
modern tool, KOJAK, allowing them to be applied in large-scale parallel programs. The ability
to reduce trace measurement error is demonstrated for a Monte-Carlo simulation
based on a master/worker scheme. As an additional result, we visualize how local
perturbation propagates across process boundaries and alters the behavioral char-
acteristics of non-local processes.
The Eclipse platform offers Integrated Development Environment support
for a diverse and growing array of programming applications and languages.
There is an increasing call for programming tools to support various
development tasks from within Eclipse. This includes tools for testing
and analyzing program performance. We describe the high-level synthesis
of the Eclipse platform with the TAU parallel performance analysis
system. By leveraging Eclipse's modularity and extensibility with
TAU's robust automated performance analysis mechanisms we produce
an integrated, GUI controlled performance analysis system for Java,
C/C++ and High Performance Computing development within Eclipse.
Parallel performance diagnosis can be improved with the use of performance knowledge about parallel computation models. The Hercule
diagnosis system applies model-based methods to automate performance
diagnosis processes and explain performance problems from highlevel
computation semantics. However, Hercule is limited by a single experiment view. Here we introduce the concept of relative performance diagnosis and show how it can be integrated in a model-based diagnosis framework. The paper demonstrates the effectiveness of Hercules approach to relative diagnosis of the well-known Sweep3D application based on aWavefront model. Relative diagnoses of Sweep3D performance anomalies in strong and weak scaling cases are given.
Scientific computing on massively parallel computers presents
unique challenges to component-based software engineering (CBSE).
While CBSE is at least as enabling for scientific computing as it is
for other arenas, the requirements are different. We briefly discuss
how these requirements shape the Common Component Architecture, and we
describe some recent research on quality-of-service issues to address
the computational performance and accuracy of scientific simulations.
Computational environments used by scientists should provide
high-level support for scientific processes that involve the
integrated and systematic use of familiar abstractions from a
laboratory setting, including notebooks, instruments, experiments, and
analysis tools. However, doing so while hiding the complexities of
the underlying computational platform is a challenge. ViNE is a
web-based electronic notebook that implements a high-level interface
for applying computational tools in scientific experiments in a
location- and platform-independent manner. Using ViNE, a scientist
can specify data and tools, and construct experiments that apply them
in well-defined procedures. ViNE's implementation of the experiment
abstraction offers the scientist easy-to-understand framework for
building scientific processes. This paper discusses how ViNE
implements computational experiments in distributed, heterogeneous
computing environments.
The Distributed Array Query and Visualization (DAQV) project aims to
develop systems and tools that facilitate interacting with distributed
programs and data structures. Arrays distributed across the processes
of a parallel or distributed application are made available to
external clients via well-defined interfaces and protocols. Our design
considers the broad issues of language targets, models of interaction,
and abstractions for data access, while our implementation attempts to
provide a general framework that can be adapted to a range of
application scenarios. The paper describes the second generation of
DAQV work and places it in the context of the more general distributed
array access problem. Current applications and future work are also
described.
As computer systems grow in size and complexity, tool support is
needed to facilitate the efficient mapping of large-scale applications
onto these systems. To help achieve this mapping, performance
analysis tools must provide robust performance observation
capabilities at all levels of the system, as well as map low-level
behavior to high-level program constructs. Instrumentation and
measurement strategies, developed over the last several years,
must evolve together with performance analysis infrastructure to
address the challenges of new scalable parallel systems.
Adaptive algorithms are an important technique to achieve portable high
Performance. They choose among solution methods and optimizations
according to expected performance on a particular machine. Grid environments
make the adaptation problem harder, because the optimal decision may change
across runs and even during runtime. Therefore, the performance model used
by an adaptive algorithm must be able to change decisions without high
overhead. In this paper, we present work that is modifying previous research
into rapid performance modeling to support adaptive grid applications through
sampling and high granularity modeling. We also outline preliminary results that
show the ability to predict differences in performance among algorithms in the
same program.
The computational environment for estimation of unknown regional
electrical conductivities of the human head, based on realistic geometry from seg-
mented MRI up to 256 resolution, is described. A finite difference alternating di-
rection implicit (ADI) algorithm, parallelized using OpenMP, is used to solve the
forward problem describing the electrical field distribution throughout the head
given known electrical sources. A simplex search in the multi-dimensional para-
meter space of tissue conductivities is conducted in parallel using a distributed
system of heterogeneous computational resources. The theoretical and computa-
tional formulation of the problem is presented. Results from test studies are pro-
vided, comparing retrieved conductivities to known solutions from simulation.
Performance statistics are also given showing both the scaling of the forward
problem and the performance dynamics of the distributed search.
Using the Eclipse platform we have provided a centralized resource and unified user interface for the encapsulation of existing command-line based performance analysis tools. In this paper we describe the user-definable tool workflow system provided by this performance framework. We discuss the frameworks implementation and the rationale for its design. A use case featuring the TAU performance analysis system demonstrates the utility of the workflow system with respect to conventional performance analysis procedures.
Contemporary high-end Terascale and Petascale systems are composed of hundreds of thousands of commodity multi-core processors interconnected with high-speed custom networks. Performance characteristics of applications executing on these systems are a function of system hardware and software as well as workload parameters. Therefore, it has become increasingly challenging to measure, analyze and project performance using a single tool on these systems. In order to address these issues, we propose a methodology for performance measurement and analysis that is aware of applications and the underlying system hierarchies. On the application level, we measure cost distribution and runtime dependent values for different components of the underlying programming model. On the system front, we measure and analyze information gathered for unique system features, particularly shared components in the multi-core processors. We demonstrate our approach using a Petascale combustion application called S3D on two high-end Teraflops systems, Cray XT4 and IBM Blue Gene/P, using a combination of hardware performance monitoring, profiling and tracing tools.
A common prerequisite for a number of debugging and performance-
analysis techniques is the injection of auxiliary program code into the application under investigation, a process called instrumentation. To accomplish this task, source-code preprocessors are often used. Unfortunately, existing preprocessing tools either focus only on a very specific aspect or use hard-coded commands for instrumentation. In this paper, we examine which basic constructs are required to specify a user-defined routine entry/exit instrumentation. This analysis serves as a basis for a generic instrumentation component working on the source-code level where the instructions to be inserted can be flexibly configured. We evaluate the identified constructs with our prototypical implementation and show that these are sufficient to fulfill the needs of a number of todays performance-analysis tools.
Electronic structure calculations are a widely used tool in materials
science and large consumer of supercomputing resources. Traditionally,
the software packages for these kind of simulations have been
implemented in compiled languages, where Fortran in its different
versions has been the most popular choice. While dynamic, interpreted
languages, such as Python, can increase the efficiency of programmer,
they cannot compete directly with the raw performance of compiled
languages. However, by using an interpreted language together with a
compiled language, it is possible to have most of the productivity
enhancing features together with a good numerical performance. We
have used this approach in implementing an electronic structure
simulation software GPAW using the combination of Python and C
programming languages. While the chosen approach works well in standard
workstations and Unix environments, massively parallel supercomputing
systems can present some challenges in porting, debugging and profiling
the software. In this paper we describe some details of the
implementation and discuss the advantages and challenges of the combined
Python/C approach. We show that despite the challenges it is possible to
obtain good numerical performance and good parallel scalability with
Python based software.
Empirical performance evaluation of parallel systems and applications can generate
significant amounts of performance data and analysis results from multiple experiments as
performance is investigated and problems diagnosed. Hence, the management of
performance information is a core component of performance analysis tools. To better
support tool integration, portability, and reuse, there is a strong motivation to develop
performance data management technology that can provide a common foundation for
performance data storage, access, merging, and analysis. This paper presents the design and
implementation of the Performance DataManagement Framework (PerfDMF). PerfDMF
addresses objectives of performance tool integration, interoperation, and reuse by providing
common data storage, access, and analysis infrastructure for parallel performance profiles.
PerfDMF includes an extensible parallel profile data schema and relational database schema,
a profile query and analysis programming interface, and an extendible toolkit for profile
import/export and standard analysis. We describe the PerfDMF objectives and architecture,
give detailed explanation of the major components, and show examples of PerfDMF
application.
The Charm++ parallel programming system provides a modular performance interface that can be used to extend its performance measurement and analysis capabilities. The interface exposes execution events of interest representing Charm++ scheduling operations, application methods/routines, and communication events for observation by alternative performance modules configured to implement different measurement features. The paper describes the Charm++s performance interface and how the Charm++ Projections tool and the TAU Performance System can provide integrated trace-based and profile-based performance views. These two tools are complementary, providing the user with different performance perspectives on Charm++ applications based on performance data detail and temporal and
spatial analysis. How the tools work in practice is demonstrated in a parallel performance analysis of NAMD, a scalable molecular dynamics code that applies many of Charm++s unique features.
Modern parallel performance measurement
systems collect performance information either through probes
inserted in the application code or via statistical sampling.
Probe-based techniques measure performance metrics directly
using calls to a measurement library that execute as part of
the application. In contrast, sampling-based systems interrupt
program execution to sample metrics for statistical analysis
of performance. Although both measurement approaches are
represented by robust tool frameworks in the performance
community, each has its strengths and weaknesses. In this
paper, we investigate the creation of a hybrid measurement
system, the goal being to exploit the strengths of both systems
and mitigate their weaknesses. We show how such a system
can be used to provide the application programmer with a
more complete analysis of their application. Simple example
and application codes are used to demonstrate its capabilities.
We also show how the hybrid techniques can be combined
to provide real cross-language performance evaluation of
an uninstrumented run for mixed compiled/interpreted
execution environments (e.g., Python and C/C++/Fortran).
The power of GPUs is giving rise to heterogeneous parallel computing,
with new demands on programming environments, runtime systems, and tools
to deliver high-performing applications. This paper studies the problems
associated with performance measurement of heterogeneous machines with
GPUs. A heterogeneous computation model and alternative host-GPU
measurement approaches are discussed to set the stage for reporting new
capabilities for heterogeneous parallel performance measurement in three
leading HPC tools: PAPI, Vampir, and the TAU Performance System. Our work
leverages the new CUPTI tool support in NVIDIA’s CUDA device library.
Heterogeneous benchmarks from the SHOC suite are used to demonstrate the
measurement methods and tool support.
The Alliant FX/8 multiprocessor implements several high-speed computation ideas in
software and hardware. Each of the 8 computational elements (CSs) has vector capabilities
and multiprocessor support. Generally, the FX/8 delivers its highest processing rates when
executing vector loops concurrently. In this paper, we present extensive empirical
performance results for vector processing on the FX/8. The vector kernels of LANL BMK8a1
benchmark are used in the experiments.
A message passing facility (MPF) for shared memory multiprocessors is presented. MPF is
based on a message passing model conceptually similar to conversations. The message
passing primitives for this model are implemented as a portable library of C function calls.
The performance of interprocess communication benchmark programs and two parallel
applications are given.
Heterogeneous parallel systems using GPU devices for ap-
plication acceleration have garnered significant attention in
the supercomputing community. However, to realize the full
potential of GPU computing, application developers will re-
quire tools to measure and analyze accelerator performance
with respect to the parallel execution as a whole. A per-
formance measurement technology for the NVIDIA CUDA
platform has been developed and integrated with the TAU
parallel performance system. The design of the TAUcuda
package is based on an experimental NVIDIA CUDA driver
and associated runtime and device libraries. In any envi-
ronment where the CUDA experimental driver is installed,
TAUcuda can provide detailed performance information re-
garding the execution of GPU kernels and the interactions
with the parallel program without any modification to the
program source or executable code. The paper describes the
TAUcuda technology and how it is integrated with the TAU
measurement framework to provide integrated performance
views. Various examples of TAUcuda use are presented, in-
cluding CUDA SDK examples, a GPU version of the Linpack
benchmark, and a scalable molecular dynamics application,
NAMD.
In this paper we discuss the performance prediction of Fortran constructs commonly found in
numerical scientific computing. Although the approach is applicable to multi-processors in
general, within the scope of the paper we will concentrate on the Alliant FX/8 multiprocessor.
The techniques proposed involve a combination of empirical observations, architectural
models and analytical techniques, and exploits earlier work on data locality analysis and
empirical characterization of the behavior of memory systems. The Lawrence Livermore
Loops are used as a test-case to verify the approach.
The complexity of parallel computer systems makes a priori performance
prediction difficult and experimental performance analysis crucial. A complete
characterization of software and hardware dynamics, needed to understand the
performance of high-performance parallel systems, requires execution time
performance instrumentation. Although software recording of performance data
suffices for low frequency events, capture of detailed, high-frequency
performance data ultimately requires hardware support if the performance
instrumentation is to remain efficient and unobtrusive. This paper describes the
design of HYPERMON, a hardware system to capture and record software
performance traces generated on the Intel iPSC/2 hypercube. HYPERMON
represents a compromise between fully-passive hardware monitoring and
software event tracing; software generated events are extracted from each
node, timestamped, and externally recorded by HYPERMON. Using an
instrumented version of the iPSC/2 operating system and several application
programs, we present a performance analysis of an operational HYPERMON
prototype and assess the limitations of the current design. Based on these
results, we suggest design modifications that should permit capture of event
traces from the coming generation of high-performance distributed memory
parallel systems.
This paper describes how the SMARTS runtime system and the POOMA C++
class library for high-performance scientific computing work together
to exploit data parallelism in scientific applications while hiding
the details of managing parallelism and data locality from the
user. We present innovative algorithms, based on the macro-dataflow
model for detecting data parallelism and efficiently executing
data-parallel statements on shared-memory multiprocessors. We also
describe how these algorithms can be implemented on clusters of SMPs.
In the solution of large-scale numerical problems, parallel computing
is becoming simultaneously more important and more difficult. The
complex organization of today's multiprocessors with several memory
hierarchies has forced the scientific programmer to make a choice
between simple but unscalable code and scalable but extremely complex
code that does not port to other architectures.
This work targets the emerging use of software component technology for
high-performance scientific parallel and distributed computing. While
component software engineering will benefit the construction of complex
science applications, its use presents several challenges to performance
optimization. A component application is composed of a set of components,
thus, application performance depends on the interaction (possibly
non-linear) of the component set. Furthermore, a component is a ``binary
unit of composition'' and the only information users have is the interface
the component provides to the outside world. An interface for component
performance measurement and query is presented to address optimization
issues. We describe the performance component design and an example
demonstrating its use for runtime performance tuning.
We present a case study of performance measurement and modeling of a CCA (Common
Component Architecture) component-based application in a high performance computing
environment. Component-based HPC applications allow the possibility of creating
component-level performance models and synthesizing them into application performance
models. However, they impose the restriction that performance measurement/monitoring
needs to be done in a non-intrusive manner and at a fairly coarse-grained level. We propose
a performance measurement infrastructure for HPC based loosely on recent work done for
Grid environments. A prototypical implementation of the infrastructure is used to collect data
for three components in a scientific application and construct their performance models.
Both computational and message-passing performance are addressed.
In this paper, we discuss the performance analysis of the pC++
programming system. We describe the performance tools developed and
include scalability measurements for four benchmark programs: a
"nearest neighbor" grid computation, a fast Poisson solver, and the
"Embar" and "Sparse" codes from the NAS suite. In addition to speedup
numbers, we present a detailed analysis highlighting performance
issues at the language, runtime system, and target system levels.
pC++ is a language extension to C++ designed to allow programmers to
compose distributed data structures with parallel execution
semantics. These data structures are organized as ``concurrent
aggregate'' collection classes which can be aligned and distributed
over the memory hierarchy of a parallel machine in a manner consistent
with the High Performance Fortran Forum (HPF) directives for Fortran
90. pC++ allows the user to write portable and efficient code which
will run on a wide range of scalable parallel computers.
Performance diagnosis, the process of finding and explaining performance
problems, is an important part of parallel programming. Effective performance
diagnosis requires that the programmer plan an appropriate method, and
manage the experiments required by that method. This paper presents Poirot,
an architecture to support performance diagnosis. It explains how the
architecture helps automatically, adaptably plan and manage the diagnosis
process. The paper evaluates the generality and practicality of Poirot, by
reconstructing diagnosis methods found in several published performance
tools.
Applications executing on complex computational systems provide a
challenge for the development of runtime performance monitoring
software. We discuss a computational model, application monitoring,
data access models, and profiler functionality. We define data
consistency within and across threads as well as across contexts and
nodes. We describe the TAU runtime monitoring framework which enables
on-demand, low-interference data access to TAU profile data and
provides the flexibility to enforce data consistency at the thread,
context or node level. We present an example of a Java-based runtime
performance monitor utilizing the framework.
Technology for empirical performance evaluation of parallel programs
is driven by the increasing complexity of high performance computing environments
and programming methodologies. This paper describes the integration of
the TAU and XPARE tools in the Uintah computational framework. Performance
mapping techniques in TAU relate low-level performance data to higher levels of
abstraction. XPARE is used for specifying regression testing benchmarks that are
evaluated with each periodically scheduled testing trial. This provides a historical
panorama of the evolution of application performance. The paper concludes with
a scalability study that shows the benefits of integrating performance technology
in the development of large-scale parallel applications.
The paper presents the design and development of an online remote trace
measurement and analysis system. The work combines the strengths of the
TAU performance system with that of the VNG distributed parallel trace
analyzer. Issues associated with online tracing are discussed and the problems
encountered in system implementation are analyzed in detail. Our approach
should port well to parallel platforms. Future work includes testing the
performance of the system on large-scale machines.
We have developed an environment that uses the IBM Visualization Data Explorer system to allow new visualizations to be prototyped rapidly, often taking only a few hours to construct totally new views of parallel performance trace data. Yet, access to a robust library of sophisticated graphical techniques is preserved. The burdensome task of explicitly programming the visualizations is completely avoided, and the iterative design, evaluation, and modification of new displays is greatly facilitated.
The complexity of parallel programs make them more difficult to analyze for correctness and efficiency, in part because of the interactions between multiple processors and the volume of data that can be generated. Visualization often helps the programmer in these tasks. This paper focuses on the development of a new technique for constructing, evaluating, and modifying sophisticated, application-specific visualizations for parallel programs and performance data. While most existing tools offer predetermined sets of simple, two-dimensional graphical displays, this environment gives users a high degree of control over visualization development and use, including access to three-dimensional graphics, which remain relatively unexplored in this context.
A multi-cluster computational environment with mixed-mode (MPI +
OpenMP) parallelism for estimation of unknown regional electrical conductiv-
ities of the human head, based on realistic geometry from segmented MRI up
to 256 voxels resolution, is described. A finite difference multi-component al-
ternating direction implicit (ADI) algorithm, parallelized using OpenMP, is used
to solve the forward problem calculation describing the electrical field distribu-
tion throughout the head given known electrical sources. A simplex search in the
multi-dimensional parameter space of tissue conductivities is conducted in par-
allel across a distributed system of heterogeneous computational resources. The
theoretical and computational formulation of the problem is presented. Results
from test studies based on the synthetic data are provided, comparing retrieved
conductivities to known solutions from simulation. Performance statistics are also
given showing both the scaling of the forward problem and the performance dy-
namics of the distributed search.
Nested OpenMP parallelism allows an application to spawn teams of nested threads. This hierarchical nature of thread creation and usage poses problems for performance measurement tools that must determine thread context to properly maintain per-thread performance data. In this paper we describe the problem and a novel solution for identifying threads uniquely. Our approach has been implemented in the TAU performance system and has been successfully used in profiling and tracing OpenMP applications with nested parallelism. We also describe how extensions to the OpenMP standard can help tool developers uniquely identify threads.
Parallel Java environments present challenging problems for performance
tools because of Javas rich language system and its multi-level execution
platform combined with the integration of native-code application libraries
and parallel runtime software. In addition to the desire to provide robust
performance measurement and analysis capabilities for the Java language
itself, the coupling of different software execution contexts under a
uniform performance model needs careful consideration of how events of
interest are observed and how cross-context parallel execution information
is linked. This paper relates our experience in extending the TAU
performance system to a parallel Java environment based on mpiJava. We
describe the complexities of the instrumentation model used, how
performance measurements are made, and the overhead incurred. A parallel
Java application simulating the game of Life is used to show the
performance systems capabilities.
Parallel Java environments present challenging problems for performance tools because of Java's rich language system and its multi-level execution platform
combined with the integration of native-code application libraries and parallel runtime software. In addition to the desire to provide robust performance measurement and analysis capabilities for the Java language itself, the coupling of different software execution contexts under a uniform performance model needs careful consideration of how events of interest are observed and how cross-context parallel execution information is linked. This paper relates our experience in extending the TAU performance system to a parallel Java environment based on mpiJava. We describe the instrumentation model used, how performance measurements are made, and the overhead incurred. A parallel Java application simulating the game of life is used to show the performance
system's capabilities.
This paper proposes a performance tools interface for
OpenMP, similar in spirit to the MPI profiling interface in its intent to
define a clear and portable API that makes OpenMP execution events visible
to runtime performance tools. We present our design using a source-level
instrumentation approach based on OpenMP directive rewriting. Rules to
instrument each directive and their combination are applied to generate
calls to the interface consistent with directive semantics and to pass
context information (e.g., source code locations) in a portable and
efficient way. Our proposed OpenMP performance API further allows user
functions and arbitrary code regions to be marked and performance
measurement to be controlled using new OpenMP directives.
To prototype the proposed OpenMP performance interface, we have developed
compatible performance libraries for the EXPERT automatic event
trace analyzer and the TAU performance analysis framework. The directive
instrumentation transformations we define are implemented in a
source-to-source translation tool called OPARI. Application examples are
presented for both EXPERT and TAU to show the OpenMP performance interface and
OPARI instrumentation tool in operation. When used together with the MPI
profiling interface (as the examples also demonstrate), our proposed
approach provides a portable and robust solution to performance analysis of
OpenMP and mixed-mode (OpenMP + MPI) applications.
Profiling and tracing tools can help make application parallelization
more effective and identify performance bottlenecks. Profiling
presents summary statistics of performance metrics while tracing
highlights the temporal aspect of performance variations, showing when
and where in the code performance is achieved. A complex challenge is
the mapping of performance data gathered during execution to
high-level parallel language constructs in the application source
code. Presenting performance data in a meaningful way to the user is
equally important. This paper presents a brief overview of profiling
and tracing tools in the context of Linux - the operating system most
commonly used to build clusters of workstations for high performance
computing.
Performance extrapolation is the process of evaluating the performance
of a parallel program in a target execution environment using
performance information obtained for the same program in a different
environment. Performance extrapolation techniques are suited for rapid
performance tuning of parallel programs, particularly when the target
environment is unavailable. This paper describes one such technique
that was developed for data-parallel C++ programs written in the pC++
language. In pC++, the programmer can distribute a collection of
objects to various processors and can have methods invoked on those
objects execute in parallel. Using performance extrapolation in the
development of pC++ applications allows tuning decisions to be made in
advance of detailed execution measurements. The pC++ language system
includes TAU, an integrated environment for analyzing and tuning the
performance of pC++ programs. This paper presents speedy, a new
addition to TAU, that predicts the performance of pC++ programs on
parallel machines using extrapolation techniques. Speedy applies the
existing instrumentation support of TAU to capture high-level event
traces of a n-thread pC++ program run on a uniprocessor machine
together with trace-driven simulation to predict the performance of
the program run on a target n-processor machine. We describe how
speedy works and how it is integrated into TAU. We also show how
speedy can be used to evaluate a pC++ program for a given target
environment.
Performance prediction methods and tools based on analytical models often fail
in forecasting the performance of real systems due to inappropriateness of
model assumptions, irregularities in the problem structure that cannot be
described within the modeling formalism, unstructured execution behavior that
leads to unforeseen system states, etc. Prediction accuracy and tractability is
acceptable for systems with deterministic operational characteristics, for static,
regularly structured problems, and non-changing environments.
When implementing parallel programs for parallel computer systems the
performance scalability of these programs should be tested and analyzed on
different computer configurations and problem sizes. Since a complete
scalability analysis is too time consuming and is limited to only existing systems,
extensions of modeling approaches can be considered for analyzing the
behavior of parallel programs under different problem and system scenarios. In
this paper, a method for automatic scalability analysis using modeling is
presented. Initially, we identify the important problems that arise when
attempting to apply modeling techniques to scalability analysis. Based on this
study, we define the Parallelization Description Language (PDL) that is used to
describe parallel execution attributes of a generic program workload. Based on
a parallelization description, stochastic models like graph models or Petri net
models can be automatically generated from a generic model to analyze
performance for scaled parallel systems as well as scaled input data. The
complexity of the graph models produced depends significantly on the type of
parallel computation described. We present several computation classes where
tractable graph models can be generated and then compare the results of these
automatically scaled models with their exact solutions using the PEPP modeling
tool.
Tools to observe the performance of parallel programs typically employ profiling and tracing as the two main forms of event-based
measurement models. In both of these approaches, the volume of performance data generated and the corresponding perturbation encountered
in the program depend upon the amount of instrumentation in the program. To produce accurate performance data, tools need to control the
granularity of instrumentation. In this paper, we describe our experiences in the TAU performance system for improving the accuracy of
performance data by limiting the amount of instrumentation. A range of
options are provided to optimize instrumentation based on the structure
of the program, event generation rates, and historical performance data
gathered from prior executions.
Workload characterization is an important technique that
helps us understand the performance of parallel applications and the de-mands they place on the system. Each application run is profiled using
instrumentation at the MPI library level. Characterizing the performance
of the MPI library based on the sizes of messages helps us understand
how the performance of an application is affected based on messages
of different sizes. Partitioning of the time spent in MPI routines based
on the type of MPI operation and the message size involved requires a
two level mapping of performance data. This paper describes how performance mapping is implemented in the TAU performance system to
support workload characterization.
Observing the performance of an application at runtime requires
economy in what performance data is measured and accessed, and
flexibility in changing the focus of performance interest. This paper
describes the performance callstack as an efficient performance view
of a running program which can be retrieved and controlled by external
analysis tools. The performance measurement support is provided by
the TAU profiling library whereas tool-program interaction support is
available through the DAQV framework. How these systems are merged to
provide dynamic performance callstack sampling is discussed.
Parallel performance tools offer insights into the execution behavior
of an application and are a valuable component in the cycle of
application development, deployment, and optimization. However, most
tools do not work well with large-scale parallel applications where
the performance data generated comes from upwards of thousands of
processes. As parallel computer systems increase in size, the scaling
of performance observation infrastructure becomes an important
concern. In this paper, we discuss the problem of scaling and
perfomance observation, and the ramifications of adding online
support. A general online performance system architecture is
presented. Recent work on the TAU performance system to enable
large-scale performance observation and analysis is discussed. The
paper concludes with plans for future work.
We have developed a distributed service architecture and an integrated parallel analysis engine
for scalable trace based performance analysis. Our combined approach permits to handle very
large performance data volumes in real-time. Unlike traditional analysis tools that do their job
sequentially on an external desktop platform, our approach leaves the data at its origin and
seamlessly integrates the time consuming analysis as a parallel job into the high performance
production environment.
Parallel scientific applications are designed based on structural, logical, and numerical models
of computation and correctness. When studying the performance of these applications,
especially on large-scale parallel systems, there is a strong preference among developers to
view performance information with respect to their mental model of the application, formed
from the model semantics used in the program. If the developer can relate performance data
measured during execution to what they know about the application, more effective program
optimization may be achieved. This paper considers the concept of phases and its support in
parallel performance measurement and analysis as a means to bridge the gap between high-
level application semantics and low-level performance data. In particular, this problem is
studied in the context of parallel performance profiling. The implementation of phase-based
parallel profiling in the TAU parallel performance system is described and demonstrated for the
NAS parallel benchmarks and MFIX application.
Scalable performance analysis is a challenge for parallel development tools. The potential size of data sets and the need to compare results from multiple experiments presents a challenge to manage and process the information, and to characterize the performance of parallel applications running on potentially hundreds of thousands of processor cores. In addition, many exploratory analysis processes represent potentially repeatable processes which can and should be automated. In this paper, we will discuss the current version of PerfExplorer, a performance analysis framework which provides dimension reduction, clustering and correlation analysis of individual trails of large dimensions, and can perform relative performance analysis between multiple application executions. PerfExplorer analysis processes can be captured in the form of Python scripts, automating what would otherwise be time-consuming tasks. We will give examples of large-scale analysis results, and discuss the future development of the framework, including the encoding and processing of expert performance rules, and the increasing use of performance metadata.
TAU is an integrated toolkit for performance instrumentation,
measurement, and analysis. It provides a flexible, portable, and
scalable set of technologies for performance evaluation on extreme-scale
HPC systems. This paper describes alternatives for I/O instrumentation
provided by TAU and the design and implementation of a new tool,
tau_gen_wrapper, to wrap external libraries. It describes three
instrumentation techniques - preprocessor based substitution, linker
based instrumentation, and library preloading based replacement of
routines. It demonstrates this wrapping technology in the context
of intercepting the POSIX I/O library and its application to profiling
I/O calls for the Global Cloud Resolution Model (GCRM) application on
the Cray XE6 system. This scheme allows TAU to track I/O using linker
level instrumentation for statically linked executables and attribute
the I/O to specific code regions. It also addresses issues encountered
in collecting the performance data from large core counts and
representing this data to correctly identify sources of poor I/O
performance.
A new design process for the development of parallel performance
visualizations that uses existing scientific data visualization
software is presented. Scientific visualization tools are designed to
handle large quantities of multi-dimensional data and create complex,
three-dimensional, customizable displays which incorporate advanced
rendering techniques, animation, and display interaction. Using a
design process that leverages these tools to prototype new performance
visualizations can lead to drastic reductions in the graphics and data
manipulation programming overhead currently experienced by performance
visualization developers. The process evolves from a formal
methodology that relates performance abstractions to visual
representations. Under this formalism, it is possible to describe
performance visualizations as mappings from performance objects to
view objects, independent of any graphical programming. Implementing
this formalism in an existing data visualization system leads to a
visualization prototype design process consisting of two components
corresponding to the two high-level abstractions of the formalism: a
trace transformation (i.e., performance abstraction) and a graphical
transformation (i.e., visual abstraction). The trace transformation
changes raw trace data to a format readable by the visualization
software, and the graphical transformation specifies the graphical
characteristics of the visualization. This prototyping environment
also facilitates iterative design and evaluation of new and existing
displays. Our work examines how an existing data visualization tool,
IBM's Data Explorer in particular, can provide a robust prototyping
environment for next-generation parallel performance visualization.
Fueled by increasing processor speeds and high speed interconnection
networks, advances in high performance computer architectures have allowed
the development of increasingly complex large scale parallel systems. For
computational scientists, programming these systems efficiently is a
challenging task. Understanding the performance of their parallel
applications is equally daunting. To observe and comprehend the
performance of parallel applications that run on these systems, we need
performance evaluation tools that can map the performance abstractions to
the user's mental models of application execution. For instance, most
parallel scientific applications are iterative in nature. In the case of CFD
applications, they may also dynamically adapt to changes in the simulation
model. A performance measurement and analysis system that can
differentiate the phases of each iteration and characterize performance
changes as the application adapts will enable developers to better relate
performance to their application behavior. In this paper, we present new
performance measurement techniques to meet these needs. In section 2, we describe our
parallel performance system, TAU. Section 3 discusses how new TAU profiling techniques
can be applied to
CFD applications with iterative and adaptive characteristics. In section 4, we present a case
study featuring the Uintah computational
framework and explain how adaptive computational fluid dynamics simulations are
observed using TAU. Finally, we conclude with a discussion of how the TAU performance
system can be
Flexibility and portability are important concerns for productive empirical performance evaluation. We claim that these features are best supported by robust
instrumentation and measurement strategies, and their integration. Using the TAU performance system as an exemplar performance toolkit, a case study in performance evaluation is
considered. Our goal is both to highlight flexibility and portability requirements and to consider how instrumentation and measurement techniques can address them. The main
contribution of the paper is methodological, in its advocation of a guiding principle for tool development and enhancement. Recent advancements in the TAU system are described
from this perspective.
The increasing complexity of parallel computing systems has brought about a crisis in
parallel performance evaluation and tuning. Although there have been important advances in
performance tools in recent years, we believe that future parallel performance environments
will move beyond these tools by integrating performance instrumentation with compilers for
architecture-independent languages, by formalizing the relationship between performance
views and the data they represent, and by automating some aspects of performance
interpretation. This paper describes these directions from the perspective of research
projects that have been recently undertaken.
Determining the performance behavior of parallel computations requires some
form of intrusive tracing measurement. The greater the need for detailed
performance data, the more intrusion the measurement will cause. Recovering
actual execution performance jfrom perturbed performance measurements
using eventbased perturbation analysis is the topic of this paper. We show that
the measurement and subsequent analysis of synchronization operations
(particularly, advance and await) can produce, in practice, accurate
approximations to actual performance behavior. We use as testcases three
Lawrence Livermore loops that execute as parallel DOACROSS loops on an
Alliant FX/80. The results of our experiments suggest that a systematic
application of performance perturbation analysis techniques will allow more
detailed, accurate instrumentation than traditionally believed possible.
The process of instrumenting a program to study its behavior can lead to
perturbations in the program's execution. These perturbations can become
severe for large parallel systems or problem sizes, even when one captures
only high level events. In this paper, we address the important issue of
eliminating execution perturbations caused by high-level instrumentation of
SPMD programs. We will describe perturbation analysis techniques for common
computation and communication measurements, and show examples which
demonstrate the effectiveness of these techniques in practice.
In this paper, we present an update on the scalable online
support for performance data analysis and monitoring in TAU. Extend-
ing on our prior work with TAUoverSupermon and TAUoverMRNet, we
show how online analysis operations can also be supported directly and
scalably using the parallel infrastructure provided by an MPI application
instrumented with TAU. We also report on eorts to streamline and up-
date TAUoverMRNet. Together, these approaches form the basis for the
investigation of online analysis capabilities in a TAU monitoring frame-
work TAUmon. We discuss various analysis operations and capabilities
enabled by online monitoring and how operations like event uni
cation
enable merged pro
les to be produced with greatly reduced data vol-
ume just prior to the end of application execution. Scaling results with
PFLOTRAN on the Cray XT5 and BG/P are presented along with a
look at some initial performance information generated from FLASH
and PFLOTRAN through our TAUmon prototype frameworks.
With increases in the scale of parallelism the dimensionality and
complexity of parallel performance measurements has placed greater
challenges on analysis tools. Performance visualization can assist in
understanding performance properties and relationships. However, the
creation of new visualizations in practice is not supported by existing
parallel profiling tools. Users must work with presentation types
provided by a tool and have limited means to change its design. Here we
present an approach for creating new performance visualizations within an
existing parallel profile analysis tool. The approach separates visual
layout design from the underlying performance data model, making custom
visualizations such as performance over system topologies straightforward
to implement and adjust for various use cases.
The developers of high-performance scientific applications often work
in complex computing environments that place heavy demands on program
analysis tools. The developers need tools that interoperate, are
portable across machine architectures, and provide source-level feedback. In this paper, we describe a tool framework, the Program Database Toolkit (PDT), that supports the development of program analysis tools meeting these requirements. PDT uses compile-time information to create a complete database of high-level program information that is structured for well-defined and uniform access by tools and applications. PDT's current applications make heavy use of advanced features of C++, in particular, templates. We describe the toolkit, focussing on its most important contribution -- its handling of templates -- as well as its use in existing applications.
Parallel applications running on high-end computer systems
manifest a complexity of performance phenomena. Tools
to observe parallel performance attempt to capture these
phenomena in measurement datasets rich with information
relating multiple performance metrics to execution dynam-
ics and parameters specific to the application-system exper-
iment. However, the potential size of datasets and the need
to assimilate results from multiple experiments makes it a
daunting challenge to not only process the information, but
discover and understand performance insights. In this pa-
per, we present PerfExplorer, a framework for parallel per-
formance data mining and knowledge discovery. The frame-
work architecture enables the development and integration
of data mining operations that will be applied to large-scale
parallel performance profiles. PerfExplorer operates as a
client-server system and is built on a robust parallel per-
formance database (PerfDMF) to access the parallel profiles
and save its analysis results. Examples are given demon-
strating these techniques for performance analysis of ASCI
applications.
PerfTrack is a data store and interface for managing performance data from large-scale parallel applications. Data collected in different locations and formats can be compared and viewed in a single performance analysis session. The underlying data store used in PerfTrack is implemented with a database management system (DBMS). PerfTrack includes interfaces to the data store and scripts for automatically collecting data describing each experiment, such as build and platform details. We have implemented a prototype of PerfTrack that can use Oracle or PostgreSQL for the data store. We demonstrate the prototype's functionality with three case studies: one is a comparative study of an ASC purple benchmark on high-end Linux and AIX platforms; the second is a parameter study conducted at Lawrence Livermore National Laboratory (LLNL) on two high end platforms, a 128 node cluster of IBM Power 4 processors and BlueGene/L; the third demonstrates incorporating performance data from the Paradyn Parallel Performance Tool into an existing PerfTrack data store.
The performance of a parallel application on a scalable HPC system is determined by user-level execution of the application code and system-level (OS kernel) operations. To understand the influences of system-level factors on application performance, the measurement of OS kernel activities is key. We describe a technology to observe kernel actions and make this information available to application-level performance measurement tools. The benefits of merged application and OS performance information and its use in parallel performance analysis are demonstrated, both for profiling and tracing methodologies. In particular, we focus on the problem of kernel noise assessment as a stress test of the approach. We show new results for characterizing noise and introduce new techniques for evaluating noise interference and its effects on application execution. Our kernel measurement and noise analysis technologies are being developed as part of Linux OS environments for scalable parallel systems.
Automating the process of parallel performance experimentation, analysis, and problem diagnosis can enhance environments for performance-directed application development, compilation, and execution. This is especially true when parametric studies, modeling, and optimization strategies require large amounts of data to be collected and processed for knowledge synthesis and reuse. This paper describes the integration of the PerfExplorer performance data mining framework with the OpenUH compiler infrastructure. OpenUH provides autoinstrumentation of source code for performance experimentation and PerfExplorer provides automated and reusable analysis of the performance data through a scripting interface. More importantly, PerfExplorer inference rules have been developed to recognize and diagnose performance characteristics important for optimization strategies and modeling. Three case studies are presented which show our success with automation in OpenMP and MPI code tuning, parametric characterization, and power modeling. The paper discusses how the integration supports performance knowledge engineering across applications and feedback-based compiler optimization in general.
In this paper we give an overview of SCALEA, which is a new performance analysis tool for OpenMP, MPI, HPF, and mixed parallel/distributed programs. SCALEA instruments, executes and measures programs and computes a variety of performance overheads based on a novel overhead classification. Source code and HW-profiling is combined in a single system which significantly extends the scope of possible overheads that can be measured and examined, ranging from HW-counters, such as the number of cache misses or floating point operations, to more complex performance metrics, such as control or loss of parallelism. Moreover, SCALEA uses a new representation of code regions, called the dynamic code region call graph, which enables detailed overheads analysis for arbitrary code regions. An instrumentation description file is used to releate performance information to code regions of the input program and to reduce instrumentation overhead. Several experiments with realistic codes that cover MPI, OpenMP, HPF and mixed OpenMP/MPI codes demonstrate the usefulness of SCALEA.
Simulations on structured adaptively refined meshes (SAMR) pose unique problems in
the context of performance evaluation and modeling. Adaptively refined meshes aim to
concentrate grid points in regions of interest while leaving the bulk of the domain
sparsely tessellated. Structured adaptively refined meshes achieve this by having overlaid
grids of different refinement. Numerical algorithms employing explicit multi-rated time-
stepping methods apply a computational "kernel" to the finer meshes at a higher
frequency than at the coarser meshes. Each application of the kernel at a given level of
refinement is followed up by a communication step where data is exchanged with
neighboring subdomains.
The SAMR approach is adaptive, i.e. its characteristics change as the simulation evolves
in time. Thus, scalability depends on the number of processors and the time-integrated
effect of the physics of the problem. The time-integrated effect renders the estimation
of a general metric of scalability difficult and often impossible. Generally, as reported in
the literature, for realistic problems and configurations, SAMR simulations do not scale
well.
For this work we analyzed two different hydrodynamic problems and present how
communication costs scale with various aspects of the domain decomposition.
Approach:
The codes that we analyzed solve PDEs to simulate reactive flows and flows with shock
waves. The codes were run until the incremental decrease in run times (with increasing
processors) approached zero. It was found that the nature of the problem changed vastly
during the run - even runs which showed poor scaling had periods of evolution where
the domain decomposition showed "good" scaling characteristics, i.e compute loads
were higher than communication loads. The computational load was found to be evenly
balanced across the processors - the lack of scalability was due to the dominance of
communication and synchronization costs over computational costs.
We identified and analyzed phases in the evolution of the problem where the simulation
exhibited good and bad scaling. Communication costs were analyzed with respect to the
levels of refinement of the grid as well as the data-exchange radius for each of the runs.
This is a thorough performance analysis of SAMR hydrodynamics codes, performed for
the first time in CCA-compliant codes, tackling the time-dependent nature of the
communication overheads.
Both the codes that we analyzed employ the Common Component Architecture (CCA)
paradigm and were run within the CCAFFEINE framework. The adaptive mesh package
used (that performs the bulk of the communications) was GrACE (Rutgers, The State
University of New Jersey). The measurements were performed using the CCA version of
TAU (Tuning and Analysis Utilities). The tests were performed on "platinum" at NCSA
(University of Illinois, Urbana Champaign), a Linux cluster of dual-node Pentium III 1
GHz processors, connected via a Myrinet interconnect.
Visual:
As a part of the visual presentation, we will present a color poster with our performance
analysis results and hold a demonstration of the composition and execution of CCA
codes. Animations of the adaptively refined grid will also be shown.
The ability to understand the behavior of concurrent programs depends greatly
on the facilities available to monitor execution and present the results to the
user. Beyond the basic profiling tools that collect data for post-mortem viewing,
explorative use of multiprocessor computer systems demands a dynamic
monitoring environment capable of providing run-time access to program
performance. A prototype of such an environment has been built for the Cedar
multiprocessor. This paper describes the design of the infrastructure enabling
run-time monitoring of parallel Cedar applications and the communication of
execution data among physically distributed machines. An application for matrix
visualization is used to highlight important aspects of the system.
Important insights into program operation can be gained by observing dynamic
execution behavior. Unfortunately, many high-performance machines provide
execution profile summaries as the only tool for performance investigation. We
have developed a tracing library for the Cray X-MP and Cray 2 supercomputers
that supports the low-overhead capture of execution events for sequential and
multitasked programs. This library has been extended to use the automatic
instrumentation facilities on these machines, allowing trace data from routine
entry and exit, and other program segments, to be captured. To assess the utility
of the trace-based tools, three of the Perfect Benchmark codes have been tested
in scalar and vector modes with the tracing instrumentation. In addition to
computing summary execution statistics from the traces, interesting execution
dynamics appear when studying the trace histories. It is also possible to
compare codes across the two architectures by correlating the event traces. Our
conclusion is that adding tracing support in Cray supercomputers can have
significant returns in improved performance characterization and evaluation.
Supercomputing is rapidly becoming a global phenomenon. In keeping with the
Voyages of Discovery theme of the Supercomputing 92 conference,
representatives of supercompuiing endeavors from around the wor!d meet in
this mini-symposium to speak on national and international supercomputing
activities.
pC++ is a language extension to C++ designed to allow programmers to
compose "concurrent aggregate" collection classes which can be
aligned and distributed over the memory hierarchy of a parallel
machine in a manner modeled on the High Performance Fortran Forum
(HPFF) directives for Fortran 90. pC++ allows the user to write
portable and efficient code which will run on a wide range of scalable
parallel computer systems. The first version of the compiler is a
preprocessor which generates Single Program Multiple Data (SPMD) C++
code. Currently, it runs on the Thinking Machines CM-5, the Intel
Paragon, the BBN TC2000, the Kendall Square Research KSR-1, and the
Sequent Symmetry. In this paper we describe the implementation of the
runtime system, which provides the concurrency and communication
primitives between objects in a distributed collection. To illustrate
the behavior of the runtime system we include a description and
performance results on four benchmark programs.
Scientists from many disciplines now routinely use modeling and
simulation techniques to study physical and biological phenomena.
Advances in high-performance architectures and networking have made
it possible to build complex simulations with parallel and distributed
interacting components. Unfortunately, the software needed to support
such complex simulations has lagged behind hardware developments.
We focus here on one aspect of such support: runtime program interaction.
We have developed a runtime interaction framework and we have implemented a specific
instance of it for an application in seismic
tomography. That instance, called TierraLab, extends the geoscientists' existing (legacy)
tomography code with runtime interaction capabilities which they access
through a MATLAB interface. The scientist can stop a program, retrieve
data, analyze and visualize that data with existing MATLAB routines, modify
the data, and resume execution. They can do this all within a familiar
MATLAB-like environment without having to be concerned with any of the low-level
details of parallel or distributed data distribution. Data distribution is
handled transparently by the Distributed Array
Query and Visualization (DAQV) system. Our framework allows
scientists to construct and maintain their own customized runtime interaction system.
In this paper we present the ViNE system architecture and a case study
of its use in neuropsychology research at the University of
Oregon. Our case study with the Brain Electrophysiology Laboratory
(BEL) addresses their need for data security and management,
collaborative support, and distributed analysis processes. The current
version of ViNE is a prototype system being tested with this and other
scientific applications.
The Virtual Notebook Environment (ViNE) is a platform-independent,
web-based interface designed to support a range of scientific
activities across distributed, heterogeneous computing platforms. ViNE
provides scientists with a web-based version of the common paper-based
lab notebook, but in addition, it provides support for collaboration
and management of computational experiments. Collaboration is
supported with the web-based approach, which makes notebook material
generally accessible and with a hierarchy of security mechanisms that
screen that access. ViNE provides uniform, system-transparent access
to data, tools, and programs throughout the scientist's computing
infrastructure. Computational experiments can be launched from ViNE
using a visual specification language. The scientist is freed from
concerns about inter-tool connectivity, data distribution, or data
management details. ViNE also provides support for dynamically linking
analysis results back into the notebook content.
The performance of the Eulerian gyrokinetic-Maxwell solver code GYRO is analyzed on five
high performance computing systems. First, a manual approach is taken, using custom scripts
to analyze the output of embedded wallclock timers, floating point operation counts collected
using hardware performance counters, and traces of user and communication events collected
using the profiling interface to Message Passing Interface (MPI) libraries. Parts of the analysis
are then repeated or extended using a number of sophisticated performance analysis tools:
IPM, KOJAK, SvPablo, TAU, and the PMaC modeling tool suite. The paper briefly discusses
what has been discovered via this manual analysis process, what performance analyses are
inconvenient or infeasible to attempt manually, and to what extent the tools show promise in
accelerating or significantly extending the manual performance analyses.
Developing robust techniques for visualizing the performance behavior
of parallel programs that can scale in problem size and/or number of
processors remains a challenge. In this paper, we present several
performance visualization techniques based on the context of
data-parallel programming and execution that demonstrate good visual
scalability properties. These techniques are a result of utilizing the
structural and distribution semantics of data-parallel programs as
well as sophisticated three-dimensional graphics. A categorization and
examples of scalable performance visualizations are given for programs
written in Dataparallel C and pC++.
The inherently sequential nature of event list manipulation limits the
potential parallelism of standard simulation models. Although techniques for
performing event list manipulation and event simulation in parallel have been
suggested, large scale performance increases seem unlikely. Only by
eliminating the event list, in its traditional form, can additional parallelism be
obtained; this is the goal of distributed simulation. Several distributed
simulation techniques have been proposed. In the remainder of this abstract,
we present the Chandy-Misra distributed simulation algorithm and the results
of an extensive study of its performance on a shared memory parallel processor
when simulating queueing network models.
The parallel scientific computing community is placing increasing emphasis on
portability and scalability of programs, languages, and architectures. This
creates new challenges for developers of parallel performance analysis tools,
who will have to deal with increasing volumes of performance data drawn from
diverse platforms. One way to meet this challenge is to incorporate
sophisticated facilities for data interpretation and experiment planning within the
tools themselves, giving them increased flexibility and autonomy in gathering
and selecting performance data. This panel discussion brings together four
research groups that have made advances in this direction.
Parallel programs are complex and often require a multilevel debugging
strategy that combines both event- and state-based debugging. We
report here on preliminary work that combines these approaches within
the TAU program analysis environment for pC++. This work extends the
use of event-based modeling to object-parallel languages, provides an
alternative mechanism for establishing meaningful global breakpoints
in object-oriented languages, introduces the TAU program interaction
and control infrastructure, and provides an environment for the
assessment of mixed event- and state-based strategies.
Performance measurement of parallel, object-oriented (OO) programs
requires the development of instrumentation and analysis techniques
beyond those used for more traditional languages. Performance events
must be redefined for the conceptual OO programming model, and those
events must be instrumented and tracked in the context of OO language
abstractions, compilation methods, and run-time execution dynamics. In
this paper, we focus on the profiling and tracing of C++ applications
that have been written using a rich parallel programming framework for
high-performance, scientific computing. We address issues of
class-based profiling, instrumentation of templates, runtime function
identification, and polymorphic (type-based) profiling. Our solutions
are implemented in the TAU portable profiling package which also
provides support for profiling groups and user-level timers. We
demonstrate TAU's C++ profiling capabilities for real parallel
applications, built from components of the ACTS toolkit. Future
directions include work on runtime performance data access, dynamic
instrumentation, and higher-level performance data analysis and
visualization that relates object semantics with performance execution
behavior.
Developers of static and dynamic analysis tools for C++ programs need
access to information on functions, classes, templates, and macros in
parsed C++ code. Existing tools, such as the EDG display tool,
provide that access, but in an unsuitable format. We built a converter
that prunes and reorganizes the information into the appropriate
format. The converter provides the information needed for our TAU
(Tuning and Analysis Utilities) tools and, in more general terms,
provides C++ developers considerable opportunities for automating
software development.
The TAU Performance System is an integrated suite of tools for instrumentation, measurement, and analysis of parallel programs targeting large-scale, high-performance computing (HPC) platforms. Representing over fifteen calendar years and fifty person years of research and development effort, TAUs driving concerns have been portability, flexibility, interoperability, and scalability. The result is a performance system which has evolved into a leading framework for parallel performance evaluation and problem solving. This paper presents the current state of TAU, overviews the design and function of TAUs main features, discusses best practices of TAU use, and outlines future development.
Although there are a number of performance tools available to DoD users, the process of performance analysis and tuning has yet to become an integral part of the DoD software development cycle. Instead, performance analysis and tuning is the domain of a small number of experts who cannot possibly address all the codes that need attention. We believe the main reasons for this are a lack of knowledge about these tools, the real or perceived steep learning curve required to use them, and the absence of a centralized method that incorporates their use in the software development cycle.
This paper presents ongoing efforts to enable a larger number of DoD HPCMP users to benefit from available performance analysis tools by integrating them into the Eclipse Parallel Tools Platform (Eclipse/PTP), an integrated development environment for parallel programs.
Visualization tools that display data as it is manipulated by a
parallel, MIMD computation must contend with the effects of
asynchronous execution. We have developed techniques that manipulate
logical time in order to produce coherent animations of parallel
program behavior despite the presence of asynchrony. Our techniques
``interpret'' program behavior in light of user-defined abstractions
and generate animations based on a logical rather than a physical view
of time. If this interpretation succeeds, the resulting animation is
easily understood; if it fails, the programmer can be assured that the
failure was not an artifact of the visualization. Here we demonstrate
that these techniques can be generally applied to enhance
visualizations of a variety of types of data as it is produced by
parallel, MIMD computations.
The use of global address space languages and one-sided communication for complex applications is gaining attention in the parallel computing community. However, lack of good evaluative methods to observe multiple levels of performance makes it difficult to isolate the cause of performance deficiencies and to understand the fundamental limitations of system and application design for future improvement. NWChem is a popular computational chemistry package which depends on the Global Arrays / ARMCI suite for partitioned global address space functionality to deliver high-end molecular modeling capabilities. A workload characterization methodology was developed to support NWChem performance engineering on large-scale parallel platforms. The research involved both the integration of performance instrumentation and measurement in the NWChem software, as well as the analysis of one-sided communication performance in the context of NWChem workloads. Scaling studies were conducted for NWChem on Blue Gene/P and on two large-scale clusters using different generation Infiniband interconnects and x86 processors.
The performance analysis and results show how subtle changes in the runtime parameters related to the communication subsystem could have significant impact on performance behavior. The tool has successfully
identified several algorithmic bottlenecks which are already being tackled by computational chemists to improve NWChem performance.
Performance monitoring of HPC applications offers opportunities for adaptive optimization based on dynamic performance behavior, unavailable in purely post-mortem performance views. However, a parallel performance monitoring system must have low overhead and high efficiency to make these opportunities tangible. We describe a scalable parallel performance monitor called TAUoverMRNet (ToM), created from the integration of the TAU performance system and the Multicast Reduction Network (MRNet). The integration is achieved through a plug-in architecture in TAU that allows selection of different transport substrates to offload online performance data. A method to establish the transport overlay structure of the monitor from within TAU, one that requires no added support from the job manager or application, is presented. We demonstrate the distribution of performance analysis from the sink to the overlay nodes and the reduction in large-scale profile data that could otherwise overwhelm any single sink. Results show low perturbation and significant savings accrued from reduction at large processor-counts.
Characterizing the performance of scientific applications is essential for effective code optimization, both by compilers and by high-level adaptive numerical algorithms. While maximizing power efficiency is becoming increasingly important in current high-performance architectures, little or no hardware or software support exists for detailed power measurements. Hardware counter-based power models are a promising method for guiding software-based techniques for reducing power. We present a component-based infrastructure for performance and power modeling of parallel scientific applications. The power model leverages on-chip performance hardware counters and is designed to model power consumption for modern multiprocessor and multicore systems. Our tool infrastructure includes application components as well as performance and power measurement and analysis components. We collect performance data using the TAU performance component and apply the power model in the performance and power analysis of a PETSc-based parallel fluid dynamics application by using the PerfExplorer component.
This work targets the emerging use of software component technology
for high-performance scientific parallel and distributed computing.
While component software engineering will benefit the construction of
complex science applications, its use presents several challenges to
performance measurment, analysis, and optimization. The
performance of a component application depends on the interaction
(possibly non-linear) of the composed component set.
Furthermore, a component is a "binary unit of
composition" and the only information users have is the interface the
component provides to the outside world. A performance engineering
methodology and development approach is presented to address
evaluation and optimization issues in high-performance component
environments. We describe a prototype implementation of a performance
measurement infrastructure for the Common Component Architecture (CCA)
system. A case study demonstrating the use of this technology for
integrated measurement, monitoring, and optimization
in CCA component-based applications is given.
Scientific parallel programs often undergo significant performance tuning before meeting
their performance expectation. Performance tuning naturally involves a diagnosis
process locating performance bugs that make a program inefficient and explaining
them in terms of high-level program design. We present a systematic approach to
generating performance knowledge for automatically diagnosing parallel programs. Our
approach exploits program semantics and parallelism found in parallel programming
patterns to search and explain bugs. The approach addresses how to extract the expert
knowledge required for performance diagnosis from parallel patterns and represents
the knowledge in a manner such that the diagnosis process can be automated. We
demonstrate the effectiveness of our knowledge engineering approach through a case
study. Our experience diagnosing Divide-and-Conquer programs shows that pattern-
based performance knowledge can provide effective guidance for locating and explaining
performance bugs at a high level of program abstraction.
A parallel component environment places constraints on performance measurement and
modeling.
For instance, it must be possible to instrument the application without access to the source
code.
In addition, a component may admit multiple implementations, based on the choice of
algorithm,
data structure, parallelization strategy, etc., posing the user with the problem of having to
choose the
correct implementation and achieve an optimal (fastest) component assembly. Under the
assumption
that an empirical performance model exists for each implementation of each component,
simply choosing
the optimal implementation of each component does not guarantee an optimal component
assembly
since components interact with each other. An optimal solution may be obtained by
evaluating the
performance of all of the possible realizations of a component assembly given the
components and all of
their implementations, but the exponential complexity renders the approach unfeasible as
the number
of components and their implementations rise. This paper describes a non-intrusive, coarse-
grained
performance monitoring system that allows the user to gather performance data through the
use of
proxies. In addition, a simple optimization library that identifies a nearly optimal
configuration is proposed.
Finally, some experimental results are presented that illustrate the measurement and
optimization
strategies.
Fueled by increasing processor speeds and high speed interconnection
networks, advances in high performance computer architectures have allowed the development of increasingly complex large scale parallel systems. For computational scientists, programming these systems efficiently is a challenging task. Understanding the performance of their parallel applications is equally daunting. To observe and comprehend the performance of parallel applications that run on these systems, we need
performance evaluation tools that can map the performance abstractions to the user's mental models of application execution. For instance, most
parallel scientific applications are iterative in nature. In the case of CFD applications, they may also dynamically adapt to changes in the simulation model. A performance measurement and analysis system that can
differentiate the phases of each iteration and characterize performance
changes as the application adapts will enable developers to better relateperformance to their application behavior. In this paper, we present newperformance measurement techniques to meet these needs. In section 2, we describe our parallel performance system, TAU. Section 3 discusses how new TAU profiling techniques can be applied to
CFD applications with iterative and adaptive characteristics. In section 4, we present a case study featuring the Uintah computational
framework and explain how adaptive computational fluid dynamics simulations are observed using TAU. Finally, we conclude with a discussion of how the TAU performance system can be
broadly applied to other CFD frameworks and present a few examples of its usage in this field.
Chasm is a toolkit providing seamless language interoperability between
Fortran95 and C++. Language interoperability is important to scientific
programmers because scientific applications are predominatly written in
Fortran, while software tools are mostly written in C++. Two design features
differentiate Chasm from other related tools. First, we avoid the problem of
`least common denominator' type systems and programming models, something
found in most IDL-based interoperability systems. Chasm uses the intermediate
representation generated by a compiler front-end for each supported language
as its source of interface information instead of an IDL. Second, bridging
code is generated for each pairwise language binding, removing the need for a
common intermediate data representation and multiple levels of indirection
between the caller and callee. These features make Chasm a simple system that
performs well, requires minimal user intervention, and in most instances,
bridging code generation can be performed automatically. Reliance on
standards such as XML and industrial strength compiler technology reduces the
complexity and scope of the Chasm toolset making it easily extensible and
highly portable.
The influences of the operating system and system-specific effects on application performance are increasingly important considerations in high performance computing. OS kernel measurement is key to understanding the performance influences and the interrelationship of system and user-level performance factors. The KTAU (Kernel TAU) methodology and Linux-based framework provides parallel kernel performance measurement from both a /kernel-wide/ and /process-centric/ perspective. The first characterizes
overall aggregate kernel performance for the entire system. The second characterizes kernel performance when it runs in the context of a particular process. KTAU extends the TAU performance system with kernel-level monitoring, while leveraging TAU~Rs measurement and analysis capabilities. We explain the rational and motivations behind our approach, describe the KTAU design and implementation, and show working examples on multiple platforms demonstrating the versatility of KTAU in integrated system/application
monitoring.
We investigate operating system noise, which we identify as one of the main reasons for a lack of synchronicity in parallel applications. Using a micro-benchmark, we measure the noise on several contemporary platforms and find that, even with a general-purpose operating system, noise can be limited if certain precautions are taken. We then inject artificially generated noise into a massively parallel system and measure its influence on the performance of collective operations. Our experiments indicate that on extreme-scale platforms, the performance is correlated
with the largest interruption to the application, even if the probability of such an interruption on a single process is extremely small. We demonstrate that synchronizing the noise can significantly reduce its negative influence.
Data visualization can help users decipher scientific and engineering data and
better comprehend large, complex data sets. The authors present a high-level
abstract model for performance visualization that relates behavior abstractions
to visual representations in a structured way. This model is based on two
principles: Displays of performance information are linked directly to parallel
performance models, and performance visualizations are designed and applied
in an integrated environment. The authors explain some advantages of
adhering to these principles. They begin by establishing a context for users to
clearly understand performance information, defining terms such as
perspective, semantic context, and subview mapping. Next, they describe the
techniques used to scale graphical views as data sets become very large.
Finally, they discuss concepts such as user perception and interaction,
comparisons and cross-correlations between related views or representations,
and information extraction. On the basis of this conceptual foundation, the
authors present examples of practical applications for the model. These case
studies address topics such as concurrency and communication in data-parallel
computation, access patterns for data distributions, and critical paths in parallel
computation. The authors conclude by discussing the relationship between
performance visualization and general scientific visualization.
Parallel Java environments present challenging problems for performance tools because
of Java's rich language system and its multi-level execution platform combined with the
integration of native-code application libraries and parallel runtime software. In addition
to the desire to provide robust performance measurement and analysis capabilities for
the Java language itself, the coupling of dierent software execution contexts under a
uniform performance model needs careful consideration of how events of interest are
observed and how cross-context parallel execution information is linked. This paper
relates our experience in extending the TAU performance system to a parallel Java
environment based on mpiJava. We describe the complexities of the instrumentation
model used, how performance measurements are made, and the overhead incurred. A
parallel Java application simulating the game of Life is used to show the performance
system's capabilities.
Scientific parallel programs often undergo significant performance tuning before meeting
their performance expectation. Performance tuning naturally involves a diagnosis process
locating performance bugs that make a program inefficient and explaining them in
terms of high-level program design. We present a systematic approach to generating
performance knowledge for automatically diagnosing parallel programs. Our approach
exploits program semantics and parallelism found in computational models to search and
explain bugs. We first identify categories of expert knowledge required for performance
diagnosis and describe how to extract the knowledge from computational models.
Second, we represent the knowledge in such a way that diagnosis can be carried out
in an automatic manner. Finally, we demonstrate the effectiveness of our knowledge
engineering approach through a case study. Our experience diagnosing Master-Worker
programs show that model-based performance knowledge can provide effective guidance
for locating and explaining performance bugs at a high level of program abstraction.
This prospectus describes research to simplify programing of parallel computers. It focuses
specifically on performance diagnosis, the process of finding and explaining sources of
inefficiency in parallel programs. Considerable research already has been done to simplify
performance diagnosis, but with mixed success. Two elements are missing from existing
research: 1. There is no general theory of how expert programers do performance diagnosis.
As a result, it is difficult for researchers to compare existing work or fit their work to
programers. It is difficult for programers to locate products of existing research that meet
their needs. 2. There is no automated, adaptable software to help programers do
performance diagnosis. Existing software is either automated but limited to very specific
circumstances, or in general, not automated for most tasks. The research described here
addresses both of these issues. The research will develop and validate a theory of
performance diagnosis, based on general models on diagnostic problem-solving. It will
design and evaluate a computer program (called Poirot) that employs the theory to
automatically, adaptably support performance diagnosis.
To make effective use of parallel computing environments, users have come to expect a broad set of tools that augment parallel programming and execution infrastructure with capabilities such as performance evaluation, debugging, runtime program control, and program interaction. The rich history of parallel tool research and development reflects both fundamental issues in concurrent processing and a progressive evolution of tool implementations, targeting current and future parallel computing goals. The state of tools for parallel computing is discussed from a perspective of performance evaluation. Many of the challenges that arise in parallel performance tools are common to other tool areas. I look at four such challenges: modeling, observability, diagnosis, and perturbation. The need for tools will always be present in parallel and distributed systems, but the emphasis on tool support may change. The discussion given is intentionally high-level, so as not to exclude the many important ideas that have come from parallel tool projects. Rather, I attempt to present viewpoints on the challenges that I think would be of concern in any performance tool design.
Performance visualization uses graphical display techniques to analyze performance data and improve understanding of complex performance phenomena. Current parallel performance visualizations are predominantly two-dimensional. A primary goal of our work is to develop new methods for rapidly prototyping multidimensional performance visualizations. By applying the tools of scientific visualization, we can prototype these next-generation displays for performance visualization -- if not implement them as end-user tools -- using existing software products and graphical techniques that physicists, oceanographers, and meteorologists have used for several years.
Parallel systems pose a unique challenge to performance measurement and instrumentation.
The complexity of these sysems manifests itself as an increase in performance complexity as
well as programming complexity. The complex interaction of the many architectural,
hardware, and software features of these systems results in a significantly larger space of
possible performance behavior and potential erformance bottlenecks. Programming parallel
systems requires that users understand the performance characteristics of the machines and
be able tomodify their programs and algorithms accordingly. The instrumentation problem,
therefore, is to develop tools to aid the user in investigating performance problems and in
determining the most effective way of exploiting the high performance capabilities of parallel
systems.
This paper gives observations on the parallel system instrumentation problem in the context
of the Cedar multiprocessor. The Cedar system integrates several architectural, hardware,
and software concepts for parallel operation. The combination makes Cedar a particularly
interesting machine for investigating instrumentation issues and developing prototype tools.
The different needs for performance evaluation on the Cedar machine define the
instruementation requirements. The implementation of instrumentation tools, however,
involves tradeoffs in design, resolution, and accuracy, and must be weighed against the
payoff in better performance evaluation. This discussion of instrumentation tools targeted
for Cedar considers these tradeoffs.
The Common Component Architecture (CCA) provides a means for software
developers to manage the complexity of large-scale scientific
simulations and to move toward a \emph{plug-and-play} environment for
high-performance computing. In the scientific computing context,
component models also promote collaboration using independently
developed software, thereby allowing particular individuals or groups to focus
on the aspects of greatest interest to them. The CCA supports
parallel and distributed computing as well as local high-performance
connections between components in a language-independent manner. The
design places minimal requirements on components and thus facilitates
the integration of existing code into the CCA environment. The CCA
model imposes minimal overhead to minimize the impact on application
performance. The focus on high performance distinguishes the CCA from
most other component models. The CCA is being applied within an
increasing range of disciplines, including combustion research, global
climate simulation, and computational chemistry.
The ability of performance technology to keep pace with the growing
complexity of parallel and distributed systems depends on robust
performance frameworks that can at once provide system-specific
performance capabilities and support high-level performance problem
solving. Flexibility and portability in empirical methods and processes are
influenced primarily by the strategies available for instrumentation and
measurement, and how effectively they are integrated and composed. This
paper presents the TAU (Tuning and Analysis Utilities) parallel performance
system and describe how it addresses diverse requirements for performance
observation and analysis.
Performance profiling generates measurement overhead during parallel program execution. Measurement overhead, in turn, introduces intrusion in a program's runtime performance behavior. Intrusion can be mitigated by controlling instrumentation degree, allowing a tradeoff of accuracy for detail. Alternatively, the accuracy in profile results can be improved by reducing the intrusion error due to measurement overhead. Models for compensation of measurement overhead in parallel performance profiling are described. An approach based on rational reconstruction is used to understand properties of compensation solutions for different parallel scenarios. From this analysis, a general algorithm for on-the-fly overhead assessment and compensation is derived.
A variety of systems have been developed to interact with parallel
programs for purposes of debugging, monitoring, visualization, and
computational steering. In addition to addressing different
functional objectives, these systems have nonfunctional
characteristics that are equally important for a user to know.
Clearly, for most users, performance is an important nonfunctional
requirement of a program interaction system. However, characterizing
the performance of an interaction system for parallel programs is
particularly challenging, especially in asynchronous, distributed
environments. This paper presents a comprehensive performance
analysis of the DAQV system. DAQV has been successfully applied in
runtime data visualization, on-line performance monitoring, and
computational steering environments. However, DAQV's suitability
depends significantly on application context and requirements. By
giving a full accounting of DAQV performance, we aim to provide
application and environment developers with valuable information about
DAQV's potential benefits, before an integration effort takes place.
As DAQV's designers, this in-depth performance analysis has led to new
insights, resulting in higher performing designs.
The increasing complexity of high-performance computing environments and
programming methodologies presents challenges for empirical performance
evaluation. Evolving parallel and distributed systems require performance
technology that can be flexibly configured to observe different events and
associated performance data of interest. It must also be
possible to integrate performance evaluation techniques with the
programming paradigms and software engineering methods. This is
particularly important for tracking performance on parallel software
projects involving many code teams over many stages of development. This
paper describes the integration of the TAU and XPARE tools in the Uintah
Computational Framework (UCF). Discussed is the use of performance
mapping
techniques to associate low-level performance data to higher levels of
abstraction in UCF and the use of performance regression testing to
provides a historical portfolio of the evolution of application performance.
A scalability study shows the benefits of integrating performance
technology in building large-scale parallel applications.
We report on our experiences in building a computational environment for tomographic image analysis for marine seismologists studying the structure and evolution of mid-ocean ridge volcanism. The computational environment is determined by an evolving set of requirements for this problem domain and includes needs for high-performance parallel computing, large data analysis, model visualization, and computation interaction and control. Although these needs are not unique in scientific computing, the integration of techniques for seismic tomography with tools for parallel computing and data analysis into a computational environment was (and continues to be) an interesting, important learning experience for researchers in both disciplines. For the geologists, the use of the environment led to fundamental geologic discoveries on the East Pacific Rise, the improvement of parallel ray tracing algorithms, and a better regard for the use of computational steering in aiding model convergence. The computer scientists received valuable feedback on the use of programming, analysis, and visualization tools in the environment. In particular, the tools for parallel program data query (DAQV) and visualization programming (Viz) were demonstrated to be highly adaptable to the problem domain. We discuss the requirements and the components of the environment in detail. Both accomplishments and limitations of our work are presented.
Nested OpenMP parallelism allows an application to spawn teams of nested threads. This hierarchical nature of thread creation and usage poses problems for performance measurement tools that must determine thread context to properly maintain per-thread performance data. In this paper we describe the problem and a novel solution for identifying threads uniquely. Our approach has been implemented in the TAU performance system and has been successfully used in profiling and tracing OpenMP applications with nested parallelism. We also describe how extensions to the OpenMP standard can help tool developers uniquely identify threads.
The performance of the Eulerian gyrokinetic-Maxwell solver code GYRO is analyzed on five high
performance computing systems. First, a manual approach is taken, using custom scripts to
analyze the output of embedded wallclock timers, floating point operation counts collected
using hardware performance counters, and traces of user and communication events collected
using the profiling interface to Message Passing Interface (MPI) libraries. Parts of the analysis
are then repeated or extended using a number of sophisticated performance analysis tools: IPM,
KOJAK, SvPablo, TAU, and the PMaC modeling tool suite. The paper briefly discusses what has
been discovered via this manual analysis process, what performance analyses are inconvenient
or infeasible to attempt manually, and to what extent the tools show promise in accelerating or
significantly extending the manual performance analyses.
This book constitutes the thoroughly refereed post-workshop proceedings of the First and the Second International Workshop on OpenMP, IWOMP 2005 and IWOMP 2006, held in Eugene, OR, USA, and in Reims, France, in June 2005 and 2006 respectively.
The first part of the book presents 16 revised full papers carefully reviewed and selected from the IWOMP 2005 program and organized in topical sections on performance tools, compiler technology, run-time environment, applications, as well as the OpenMP language and its evaluation.
In the second part there are 19 papers of IWOMP 2006, fully revised and grouped thematically in sections on advanced performance tuning aspects of code development applications, and proposed extensions to OpenMP.
To address the increasing complexity in parallel and distributed systems and
software, advances in performance technology towards more robust tools and
broader, more portable implementations are needed. In
doing so, new challenges for performance instrumentation, measurement,
analysis, and visualization arise to address evolving requirements for how
performance phenomena is observed and how performance data is
used. This paper presents recent advances in the TAU performance system in
four areas where improvements in performance technology are important:
instrumentation control, performance
The ability of performance technology to keep pace with the growing complexity of parallel and distributed systems will depend on robust performance frameworks
that can at once provide system-specific performance capabilities and support high-level performance problem solving. The TAU system is offered as an example
framework that meets these requirements. With a flexible, modular instrumentation and measurement system, and an open performance data and analysis
environment, TAU can target a range of complex performance scenarios. Examples are given showing the diversity of TAU application.
To understand the complex interactions of the many factors contributing to supercomputer
performance, supercomputer designers and users must have access to an integrated
performance analysis system capable of measuring, analyzing, modeling, and predicting
performance across a hierarchy of details and goals. The performance analysis system being
developed for the CEDAR multiprocessor supercomputer embodies these characteristics and
is discussed in this paper.
Event tracing has become a popular form of gathering performance data on multiprocessor
computer systems. Indeed, a performance measurement facility has been developed for the
Cedar multiprocessor that uses taing as a back-end mechanism fr collecting several run-time
measurements including count, time, virtual memory, and event data. Tools to study an
event trace, however, are typically specialized according to the type of data collected. Usually
various trace analyses and displays are developed based on some event interpretation model.
Whereas this approach will give specific information about particular events and their
occurences in a trace, it is not particularly easy to extend; new events often require new
analysis and display techniques.
This paper introduces three component technology initiatives within the SciDAC Center for
Technology for Advanced Scientific Component Software (TASCS) that address ever-increasing
productivity challenges in creating, managing, and applying simulation software to scientific
discovery. By leveraging the Common Component Architecture (CCA), a new component
standard for high-performance scientific computing, these initiatives tackle difficulties at
different but related levels in the development of component-based scientific software: (1)
deploying applications on massively parallel and heterogeneous architectures, (2) investigating
new approaches to the runtime enforcement of behavioral semantics, and (3) developing tools
to facilitate dynamic composition, substitution, and reconfiguration of component
implementations and parameters, so that application scientists can explore tradeoffs among
factors such as accuracy, reliability, and performance.
We report on some of the interactions between two SciDAC projects: The National
Computational Infrastructure for Lattice Gauge Theory (USQCD), and the Performance
Engineering Research Institute (PERI). Many modern scientific programs consistently report the
need for faster computational resources to maintain global competitiveness. However, as the
size and complexity of emerging high end computing (HEC) systems continue to rise, achieving
good performance on such systems is becoming ever more challenging. In order to take full
advantage of the resources, it is crucial to understand the characteristics of relevant scientific
applications and the systems these applications are running on. Using tools developed under
PERI and by other performance measurement researchers, we studied the performance of two
applications, MILC and Chroma, on several high performance computing systems at DOE
laboratories. In the case of Chroma, we discuss how the use of C++ and modern software
engineering and programming methods are driving the evolution of performance tools.
The speed and efficiency of the memory system is a key limiting factor in the performance of
supercomputers. Consequently, one of the major concerns when developing a high-
performance code, either manually or automatically, is determining and characterizing the
influence of the memory system on performance in terms of algorithmic parameters.
Unfortunately, the performance data available to an algorithm designer such as various
benchmarks and, occasionally, manufacturer-supplied information, e.g. instruction timings
and architecture component characteristics, are rarely sufficient for this task. In this paper,
we discuss a systematic methodology for probing the performance characteristics of a
memory system via a hierarchy of data-movement kernels. We present and analyze the
results obtained by such a methodology on a cache-based multi-vector processor (Alliant
FX/8). Finally, we indicate how these experimental results can be used for predicting the
performance of simple Fortran codes by a combination of empirical observations,
architectural models and analytical techniques.
A description is given of Faust, an integrated environment for the development of large,
scientific applications. Faust includes a project-management tool, a context editor that is
interfaced to a program database, and performance-evaluation tools. In Faust, all
applications work is done in the context of projects, which serve as the focal point for all tool
interactions. A project roughly corresponds to an executable program. Faust achieves
functional integration through operations on common data sets maintained in each project.
Sigma, a Faust tool designed to help users of parallel supercomputers retarget and optimize
application code, helps them either fine-tune parallel code that has been automatically
generated or optimize a new parallel algorithm's design. Faust includes a dynamic call-graph
tool and an integrated, multiprocessor performance analysis and characterization tool set.
The design, development, and application of Traceview, a general-purpose
trace-visualization tool that implements the trace-management and I/O features
usually found in special-purpose trace-analysis systems, are described. The
aspects of trace visualization that can be incorporated into a reusable tool are
identified. The tradeoff in general-purpose design versus semantically based,
detailed trace-data analysis is evaluated. Display methods and Traceview
applications are discussed.
The integration of scalable performance analysis in parallel development
tools is difficult. The potential size of data sets and the need to
compare results from multiple experiments presents a challenge to manage
and process the information. Simply to characterize the performance
of parallel applications running on potentially hundreds of thousands of
processor cores requires new scalable analysis techniques. Furthermore,
many exploratory analysis processes are repeatable and could be automated, but are now implemented as manual procedures. In this paper, we will discuss the current version of PerfExplorer, a performance analysis framework which provides dimension reduction, clustering and correlation analysis of individual trails of large dimensions, and can perform relative performance analysis between multiple application executions. PerfExplorer analysis processes can be captured in the form of Python scripts, automating what would otherwise be time-consuming tasks. We will give examples of large-scale analysis results, and discuss the future development of the framework, including the encoding and processing of expert performance rules, and the increasing use of performance metadata.
(This paper is an expanded version of [europar96] and [tr9602].)
This paper describes the design and implementation of the Distributed Array Query and Visualization (DAQV) system for High Performance Fortran, a project sponsored by the Parallel Tools Consortium. DAQV's implementation leverages the HPF language, compiler, and runtime system to address the general problem of providing high-level access to distributed data structures. DAQV supports a framework in which visualization and analysis clients connect to a distributed array server (i.e., the HPF application with DAQV control) for program-level access to array values. Implementing key components of DAQV in HPF itself has led to a robust and portable solution in which clients do not need to know how the data is distributed.
This paper proposes a performance tools interface for OpenMP, similar in spirit to the MPI profiling
interface in its intent to define a clear and portable API that makes OpenMP execution events visible to runtime
performance tools. We present our design using a source-level instrumentation approach based on OpenMP directive
rewriting. Rules to instrument each directive and their combination are applied to generate calls to the interface
consistent with directive semantics and to pass context information (e.g., source code locations) in a portable and
efficient way. Our proposed OpenMP performance API further allows user functions and arbitrary code regions to be
marked and performance measurement to be controlled using new OpenMP directives. To prototype the proposed
OpenMP performance interface, we have developed compatible performance libraries for the EXPERT automatic
event trace analyzer and the TAU performance analysis framework. The directive instrumentation transformations
we define are implemented in a source-to-source translation tool called OPARI. Application examples are
presented for both EXPERT and TAU to show the OpenMP performance interface and OPARI instrumentation tool
in operation. When used together with the MPI profiling interface (as the examples also demonstrate), our proposed
approach provides a portable and robust solution to performance analysis of OpenMP and mixed-mode (OpenMP +
MPI) applications.
The authors study the instrumentation perturbations of software event tracing on
the Alliant FX/80 vector multiprocessor in sequential, vector, concurrent, and
vector-concurrent modes. Based on experimental data, they derive a
perturbation model that can approximate true performance from instrumented
execution. They analyze the effects of instrumentation coverage, (i.e., the ratio of
instrumented to executed statements), source level instrumentation, and
hardware interactions. The results show that perturbations in execution times for
complete trace instrumentations can exceed three orders of magnitude. With
appropriate models of performance perturbation, these perturbations in
execution time can be reduced to less than 20% while retaining the additional
information from detailed traces. In general, it is concluded that it is possible to
characterize perturbations through simple models. This permits more detailed,
accurate instrumentation than traditionally believed possible.
With traditional event list techniques, evaluating a detailed discrete event
simulation model can often require hours or even days of computation time. By
eliminating the event list and maintaining only sufficient synchronization to
ensure causality, parallel simulation can potentially provide speedups that are
linear in the number of processors. We present a set of shared memory
experiments using the Chandy-Misra distributed simulation algorithm to
simulate networks of queues. Parameters of the study include queueing network
topology and routing probabilities, number of processors, and assignment of
network nodes to processors.
hough architectural improvements in memory organization of multiprocessor
systems can increase effective data bandwidth, the actual performance
achieved is highly dependent upon the characteristics of the memory
address streams; e.g., the data access rate, and the temporal and spatial
distributions. Accurately quantify- ing the performance behavior of a
multiprocessor memory system across a broad range of algorithmic
parameters is crucial if users (and restructuring compilers) are to achieve
high-performance codes. In this paper, we demonstrate how the behavior
of a cache-based multivector processor memory system can be
systematically characterized and its performance experimentally correlated
with key features of the ad- dress stream. The approach is based on the
definition of a family of parameterized kernels used to explore specific
aspects of the memory systems performance. The empirical results from
this kernel suite provide the data from which architectural or algorithmic
characteris- tics can be studied. The results of applying the approach to an
Alliant FX/S are presented.
A new visualization design process for the development of parallel
program and performance visualizations using existing scientific data
visualization software can drastically reduce the graphics and data
manipulation programming overheads currently experienced by
visualization developers. Data visualization tools are designed to
handle large quantities of multi-dimensional data and create complex,
three-dimensional, customizable displays which incorporate advanced
rendering techniques, animation, and display interaction. These
capabilities can be used to improve performance visualization, but to
be effective, they must be applied as part of a formal methodology
relating performance data to visual representations. Under such a
formalism, it is possible to describe performance visualizations as
mappings from performance data objects to view objects, independent of
any graphical programming. Through three case studies, this work
examines how an existing scientific visualization tool, IBM's Data
Explorer, provides a robust environment for prototyping
next-generation parallel performance visualizations.
A new area called domain-specific metacomputing for computational
science is defined. This area cuts across the larger areas of
parallel and distributed computing, computational science, and
software engineering in search of techniques and technology that will
better allow the creation of useful tools for computational
scientists. The paper focuses on how metacomputing, domain-specific
environments, and software architectures can be employed as key
technologies to this end.
Neuroanatomical segmentation is a problem of extraction of a description
of particular neuroanatomical structures of interest that reflects the
morphometry (shape measurements) of the subject’s neuroanatomy from any
image rendering the neuroanatomical structures of the subject. This
dissertation presents a set of algorithms for automatic extraction of
cerebral white mater (WM) and gray matter (GM) as well as reconstruction
of cortical surfaces from T1-weighted MR images. Neuroanatomical
segmentation presented in this dissertation is performed by an image
analysis pipeline that steps through five major procedures: 1) the
original MR image is processed by a new relative thresholding procedure
and a new terrain analysis procedure such that all voxels are classified
into one of the three types: WM, GM, and background; 2) the topology
defects of the WM are eliminated by a new multiscale morphological
topology correction algorithm; 3) cerebral WM is extracted from its
superset with a new v procedure called cell-complex-based morphometric
analysis; 4) cerebral GM is extracted based on the prior cerebral WM
extraction with a set of morphological image analysis procedures; and 5)
cortical surfaces are finally reconstructed preserving correct topology
with an existing marching cube isosurface algorithm. In this
dissertation, we evaluated our neuroanatomical segmentation tool both
quantitatively and qualitatively on a set of MR images with groundtruth
or manual segmentation, compared the results of our tool with those of
four other tools, and demonstrated that the performance of our tool is
highly accurate, robust, automatic and computationally efficient. The
advantages of our tool are mainly attributed to extensive exploration of
various structural, geometrical, morphological, and radiological a
priori knowledge, which persists despite of image artifacts and inter-
subject anatomical variations. By exploiting a priroi knowledge, we also
demonstrated that performing voxel classification prior to brain
extraction is a promising research direction, contrary to the
traditional procedure of brain extraction followed by voxel
classification. Finally, it’s worth noting that the algorithms of voxel
classification and morphological image analysis presented in this
dissertation for neuroanatomical segmentation can be potentially applied
in wider areas in computer vision.
Scientific parallel programs often undergo significant performance
tuning before meeting their performance expectation. Performance tuning
naturally involves a diagnosis process–locating performance bugs that
make a program inefficient and explaining them in terms of high-level
program design. Important performance measurement and analysis tools
have been developed to support the performance analysis with the
facilities of running experiments on parallel computers and generating
measurement data to evaluate performance. However, current performance
analysis technology does not yet allow for associating found performance
problems with causes at a high-level program abstraction. Nor does it
support the performance diagnosis process in a well automated manner.
We present a systematic method to guide the performance diagnosis
process and support the process with minimum user intervention. The
motivating observation is that performance diagnosis can be greatly
improved with the use of performance knowledge about v parallel
computation models. We therefore propose an approach to generating
performance knowledge for automatically diagnosing parallel programs.
Our approach exploits program execution abstraction and parallelism
found in computational models to search and explain performance bugs. We
identify categories of knowledge required for performance diagnosis and
describe how to derive the knowledge from computational models. We
represent the extracted knowledge in a manner such that performance
inferencing can be carried out in an automatic manner. We have
developed the Hercule automatic performance diagnosis system that
implements the model-based diagnosis strategy. In this dissertation, we
present how Hercule integrates the performance knowledge into a
performance analysis tool and demonstrate the effectiveness of our
performance knowledge engineering approach through Hercule experiments
on a variety of parallel computational models. We also investigate
compositional programs that combine two or more models. We extend
performance knowledge engineering to capture the interplay of multiple
models in an integrated state, and improve Hercule capabilities to
support the compositional performance diagnosis. We have applied Hercule
to two representative scientific applications, both of which are
implemented with combined models. The experiment results show that,
requiring minimum user intervention, model-based performance analysis is
vital and effective in discovering and interpreting performance bugs at
a high level of program abstraction.
Tools for performance observability must balance the need for performance data
against the cost of obtaining it (environment complexity and performance intrusion)
-- to little performance data makes performance analysis difficult; too much data perturbs
the measurement system. We discuss several methods for performance measurement
concentrating specifically on mechanisms for timing and tracing. We show how minor
hardware and software modifications can enable better measurement tools to be built and
describe results from a prototype hardware-based software monitor developed for the Intel
iPSC/2 multiprocessor.
Any software performance measurement perturbs the measured system. We develop two
models of performance perturbation to understand the effects of instrumentation intrusion:
time-based and event-based. The time-based models use only measured time overheads of
intrumentation to approximate actual execution time performance. We show that this model
can give accurate approximations for sequential execution and for parallel execution with
independent execution ordering. We use the event-based model to quantify the perturbation
effects of instrumentations of parallel executions with ordering dependencies. Our results
show that this model can be applied in practice to achieve accurate approximations. We also
discuss the limitations of the time-based and event-based models.
The potentially large volume of detailed performance data requires new approaches to
presentation that can show both gross performance characteristics while allowing users to
focus on local performance behavior. We give several examples where performance
visualization techniques have been effectively applied, plus discuss the architecture and a
prototype of a general performance visualization environment.
Finally, we apply several of the performance measurement, analysis, and visualization
techniques to a practical study of performance observability on the Cray X-MP and Cray 2
supercomputers. Our results show that even modest improvements in the existing set of
performance tools for a particular machine can have significant benefits in performance
evaluation capabiilities.
Performance observabiility is the ability to accurately capture, analyze, and present
(collectively observe) information about the performance of a computer system.
Advances in computer systems design, particularly with respect to parallel processing and
supercomputers, have brought a crisis in performance observation -- computer systems
technology is outpacing the tools to understand the performance behavior of and to operate
the machines near the high-end of their performance range. In this thesis, we study the
performance observability problem with emphasis on the practical design, development, and
use of tools for performance measurement, analysis, and visualization.
With the growth of modern high-performance computing systems, scientists are able to
simulate larger and more complex systems. The most straightforward way to do this is
to couple existing computational models to create models of larger systems composed
of smaller sub-systems. Unfortunately, no general method exists for automating the
process of coupling computational models. We present the design of such a method
here. Using existing compiler technology, we assume that control flow analysis can
determine the control state of models based on their source code. Scientists can then
annotate the control flow graph of a model to identify points at which the model can
provide data to or accept data from other models. Couplings are established between
two models by establishing bindings between these control flow graphs. Translation of
the control flow graph into Petri Nets allows automatic generation of coupling code to
implement the couplings.
A parallel component environment places constraints on performance measurement and
modeling. For instance, it must be possible to observe component operation without
access to the source code. Furthermore, applications that are composed dynamically at
run time require reusable performance interfaces for component interface monitoring.
This thesis describes a non-intrusive, coarse-grained performance measurement
framework that allows the user to gather performance data through the use of proxies that
conform to these constraints. From this data, performance models for an individual
component can be generated, and a performance model for the entire application can be
synthesized. A validation framework is described, in which simple components with
known performance models are used to validate the measurement and modeling
methodologies included in the framework. Finally, a case study involving the
measurement and modeling of a real scientific simulation code is also presented.
This thesis provides a design and development of a software architecture
and programming framework that enables domain-oriented scientific
investigations to be more easily developed and productively applied. The
key research concept is the representation and automation of scientific
studies by capturing common methods for experimentation, analysis and
evaluation used in simulation science. Such methods include parameter
studies, optimization, uncerta'inty analysis, and sensitiv·ity
analys·is. While the framework provides a generic way to conduct
investigation on an arbitrary simulation, its intended use is to be
extended to develop a domain computational environment. The framework
hides the access to distributed system resources and the multithreaded
execution. A prototype of such a framework called ODESSI (Open Domain-
oriented Environment for Simulation-based Scientific Investigation,
pronounced odyssey) is developed and IV evaluated on realistic problems
in human neuroscience and computational chemistry domains. ODESSI was
inspired by our domain problems encountered in the computational
modeling of human head electromagnetic for conductivity analysis and
source localization. In this thesis we provide tools and methods to
solve state of the m-t problems in head modeling. In particular, we
developed an efficient and robust HPC solver for the forward problem and
a generic robust HPC solver for bElT (bounded Electrical Impedance
Tomography) inverse problem to estimate the head tissue conductivities.
Also we formulated a method to include skull inhomogeneity and other
skull variation in the head model based on information obtained from
experimental studies. ODESSI as a framework is used to demonstrate the
research ideas in this neuroscience domain and the domain investigations
results are discussed in this thesis. ODESSI supports both the
processing of investigation activities as well as manage its evolving
record of information, results, and provenance.
Technology for empirical performance evaluation of parallel programs is driven by the increasing complexity of high performance computing environments and programming methodologies. This complexity - arising from the use of high-level parallel languages, domain-specific numerical frameworks, heterogeneous execution models and platforms, multi-level software optimization strategies, and multiple compilation models -
widens the semantic gap between a programmer's understanding of his/her code and its runtime behavior. To keep pace, performance tools must provide for the effective instrumentation of complex software and the correlation of runtime performance data with user-level semantics.
To address these issues, this dissertation contributes:
* a strategy for utilizing multi-level instrumentation to improve the coverage of performance measurement in complex, layered software;
* techniques for mapping low-level performance data to higher levels of abstraction in order to reduce the semantic gap between user's abstractions and runtime behavior; and
* the concept of instrumentation-aware compilation that extends traditional compilers to preserve the semantics of fine-grained performance instrumentation despite aggressive
program restructuring.
In each case, the dissertation provides prototype implementations and case studies of the needed tools and frameworks.
This dissertation research aims to influence the way performance observation tools and compilers for high performance computers are designed and implemented.
The Alliant FX/8 multiprocessor implments several high-speed computation ideas in
software and hardware. Each of the 8 computational elements (CEs) has vector
capabilitiesand multiprocessor support. Generally, the FX/8 delivers its highest processing
rates when executing vector loops concurrently. In this paper, we present extensive empirical
performance results for vector processing on the FX/8. The vector kernels of the LANL
BMK8a1 benchmark are used in the experiments.
To understand the complex interactions of the many factors contributing to supercomputer
performance, supercomputer designers and users must have access to an integrated
performance analysis system capable of measuring, analyzing, modeling, and predicting
performance across a hierarchy of details and goals. The performance analysis system being
developed for the CEDAR multiprocessor supercomputer
embodies these characteristics and is discussed in this paper.
Regular is an often used term to suggest simple and unifrom structure of a parallel
processor's organization or a parllel algorithm's operation. However, a strict definitiion is
long overdue. In this paper, we define regularity for processor array structures in two
dimensions and enumerate the eleven distinct regular topologies. Space and time emulation
schemes among the regular processor arrays are constructured to compare their geometric
and performance characteristics. We also show how algorithms developed for one regular
processor array might be transferred to another regular array using matrix multiplication and
LU decomposition as examples.
Testing the performance scalability of parallel programs can be a time
consuming task, involving many performance runs for different computer
configurations, processor numbers, and problem sizes. Ideally, scalability
issues would be addressed during parallel program design, but tools are not
presently available that allow program developers to study the impact of
algorithmic choices under different problem and system scenarios. Hence,
scalability analysis is often reserved to existing (and available) parallel
machines as well as implemented algorithms.
In this paper, we propose techniques for analyzing scaled parallel programs
using stochastic modeling approaches. Although allowing more generality and
flexibility in analysis, stochastic modeling of large parallel programs is difficult
due to solution tractability problems. We observe, however, that the complexity
of parallel program models depends significantly on the type of parallel
computation, and we present several computation classes where tractable,
approximate graph models can be generated.
Our approach is based on a parallelization description of programs to be
scaled. From this description, scaled stochastic graph models are
automatically generated. Different approximate models are used to compute
lower and upper bounds of the mean runtime. We present evaluation results of
several of these scaled (approximate) models and compare their accuracy and
modeling expense (i.e., time to solution) with other solution methods
implemented in our modeling tool PEPP. Our results indicate that accurate and
efficient scalability analysis is possible using stochastic modeling together with
model approximation techniques.
We present a case study of performance measurement and modeling of a CCA (Commo
n Component
Architecture) component-based application in a high performance computing envi
ronment.
We explore issues peculiar to component-based HPC applications and propose a p
erformance
measurement infrastructure for HPC based loosely on recent work done for Grid
environments.
A prototypical implementation of the infrastructure is used to collect data fo
r a three
components in a scientific application and construct performance models
for two of them. Both computational and message-passing performance are addres
sed.
The Common Component Architecture allows com-
putational scientists to adopt a component-based architecture for
scientific simulation codes. Components, which in the scientific
context, usually embody a numerical solution facility or a physical
or numerical model, are composed at runtime into a simulation
code by loading in an implementation of a component and linking
it to others. However, a component may admit multiple imple-
mentations, based on the choice of the algorithm, data structure,
parallelization strategy, etc. posing the user with the problem
of having to choose the correct implementation and achieve
an optimal (fastest) component assembly. Under the assumption
that a performance model exists for each implementation of each
component, simply choosing the optimal implementation of each
component does not guarantee an optimal component assembly
since components interact with each other. An optimal solution
may be obtained by evaluating the performance of all the possible
realizations of a component assembly given the components and
all their implementations, but the exponential complexity renders
the approach unfeasible as the number of components and their
implementations rise. We propose an approximate approach
predicated on the existence, identification and optimization of
computationally dominant sub-assemblies (cores). We propose
a simple criterion to test for the existence of such cores and
a set of rules to prune a component assembly and expose its
dominant cores. We apply this approach to data obtained from
a CCA component code simulating shock-induced turbulence on
four processors and present preliminary results regarding the
efficacy of this approach and the sensitivity of the final solution
to various parameters in the rules.
The next generation of language compilers for parallel architectures
offers levels of abstraction above those currently
available. Languages such as High Performance Fortran (HPF) and
Parallel C++ (pC++) allow the programmer to specify how data
structures are to be aligned relative to each other and then
distributed across processors. Since a program's performance is often
directly related to how its data is distributed, a means of evaluating
data distributions and alignments is necessary. Since there is a
natural tendency to explain data distributions by drawing pictures,
graphical visualizations may be helpful in assessing the benefits and
detriments of a given data decomposition. This paper formulates an
experimental framework for exploring visualization techniques
appropriate to evaluating data distributions. Visualizations are
created using IBM's Data Explorer visualization software in
conjunction with other software developed by the author. An informal
assessment of the resulting visualizations and an explanation of how
this research will be extended is also given.
A new design process for the development of parallel performance
visualizations that uses existing scientific data visualization
software is presented. Scientific visualization tools are designed to
handle large quantities of multi-dimensional data and create complex,
three-dimensional, customizable displays which incorporate advanced
rendering techniques, animation, and display interaction. Using a
design process that leverages these tools to prototype new performance
visualizations can lead to drastic reductions in the graphics and data
manipulation programming overhead currently experienced by performance
visualization developers. The process evolves from a formal
methodology that relates performance abstractions to visual
representations. Under this formalism, it is possible to describe
performance visualizations as mappings from performance objects to
view objects, independent of any graphical programming. Implementing
this formalism in an existing data visualization system leads to a
visualization prototype design process consisting of two components
corresponding to the two high-level abstractions of the formalism: a
trace transformation (i.e., performance abstraction) and a graphical
transformation (i.e., visual abstraction). The trace transformation
changes raw trace data to a format readable by the visualization
software, and the graphical transformation specifies the graphical
characteristics of the visualization. This prototyping environment
also facilitates iterative design and evaluation of new and existing
displays. Our work examines how an existing data visualization tool,
IBM's Data Explorer in particular, can provide a robust prototyping
environment for next-generation parallel performance visualization.
Developing robust techniques for visualizing the performance behavior
of parallel programs that can scale in problem size and/or number of
processors remains a challenge. In this paper, we present several
performance visualization techniques based on the context of
data-parallel programming and execution that demonstrate good visual
scalability properties. These techniques are a result of utilizing the
structural and distribution semantics of data-parallel programs as
well as sophisticated three-dimensional graphics. A categorization and
examples of scalable performance visualizations are given for programs
written in Dataparallel C and pC++.
A new visualization design process for the development of parallel
program and performance visualizations using existing scientific data
visualization software can drastically reduce the graphics and data
manipulation programming overheads currently experienced by
visualization developers. Data visualization tools are designed to
handle large quantities of multi-dimensional data and create complex,
three-dimensional, customizable displays which incorporate advanced
rendering techniques, animation, and display interaction. These
capabilities can be used to improve performance visualization, but to
be effective, they must be applied as part of a formal methodology
relating performance data to visual representations. Under such a
formalism, it is possible to describe performance visualizations as
mappings from performance data objects to view objects, independent of
any graphical programming. Through three case studies, this work
examines how an existing scientific visualization tool, IBM's Data
Explorer, provides a robust environment for prototyping
next-generation parallel performance visualizations.
This paper describes the design and implementation of the Distributed
Array Query and Visualization (DAQV) system for High Performance
Fortran, a project sponsored by the Parallel Tools Consortium. DAQV's
implementation leverages the HPF language, compiler, and runtime
system to address the general problem of providing high-level access
to distributed data structures. DAQV supports a framework in which
visualization and analysis clients connect to a distributed array
server (i.e., the HPF application with DAQV control) for program-level
access to array values. Implementing key components of DAQV in HPF
itself has led to a robust and portable solution in which clients do
not need to know how the data is distributed.
This paper describes the design and implementation of a high-level
visualization programming system called Viz. Viz was created out of a
need to support rapid visualization prototyping in an environment that
could be extended by abstractions in the application problem
domain. Viz provides this in a programming environment built on a
high-level, interactive language (Scheme) that embeds a 3D graphics
library (Open Inventor), and that utilizes a data reactive model of
visualization operation to capture mechanisms that have been found to
be important in visualization design (e.g., constraints, controlled
data flow, dynamic analysis, animation). The strength of Viz is in its
ability to create non-trivial visualizations rapidly and to construct
libraries of 3D graphics functionality easily. Although our original
focus was on parallel program and performance data visualization, Viz
applies beyond these areas. We show several examples that highlight
Viz functionality and the visualization design process it supports.
A new area called domain-specific metacomputing for computational science is defined. This area
cuts across the larger areas of parallel and distributed computing, computational science, and software
engineering in search of techniques and technology that will better allow the creation of useful tools for
computational scientists. The paper focuses on how metacomputing, domain-specific environments, and
software architectures can be employed as key technologies to this end.
The Distributed Array Query and Visualization (DAQV) project aims to develop systems and tools that facilitate interacting with distributed programs and data structures. Arrays distributed across the processes of a parallel or distributed application are made available to external clients via well-defined interfaces and protocols. Our design considers the broad issues of language targets, models of interaction, and abstractions for data access, while our implementation attempts to provide a general framework that can be adapted to a range of application scenarios. The paper describes the second generation of DAQV work and places it in the context of the more general distributed array access problem. Current applications and future work are also described.
The TAU performance system is an integrated performance instrumentation,
measurement, and analysis toolkit offering support for profiling and tracing modes of measurement. This paper introduces memory introspection capabilities of TAU featured on the Cray XT3 Catamount compute node kernel. TAU supports examining the memory headroom, or the amount of heap memory available, at routine entry, and correlates it to the program's callstack as an atomic event.
In this article we propose a ``standard'' performance tool interface for OpenMP, similar in spirit to the MPI profiling interface in its intent to define a clear and portable
API that makes OpenMP execution events visible to performance libraries. When used together with the MPI profiling interface, it also allows tools to be built for
hybrid applications that mix shared and distributed memory programming. We describe an instrumentation approach based on OpenMP directive rewriting that
generates calls to the interface and passes context information (e.g., source code locations) in a portable and efficient way. Our proposed OpenMP performance API
further allows user functions and arbitrary code regions to be marked and performance measurement to be controlled using new proposed OpenMP directives. The
directive transformations we define are implemented in a source-to-source translation tool called OPARI. We have used it to integrate the TAU performance analysis
framework and the automatic event trace analyzer EXPERT with the proposed OpenMP performance interface. Together, these tools show that a portable and robust
solution to performance analysis of OpenMP and hybrid applications is possible.
As computer systems grow in size and complexity, tool support is needed to
facilitate the efficient mapping of large-scale applications onto these systems.
To help achieve this mapping, performance analysis tools must provide robust
performance observation capabilities at all levels of the system, as well as map
low-level behavior to high-level program constructs. Instrumentation and
measurement strategies, developed over the last several years, must evolve
together with performance analysis infrastructure to address the challenges of
new scalable parallel systems.
Parallel Java environments present challenging problems for performance tools because of Javas rich language system and its multi-level execution platform
combined with the integration of native-code application libraries and parallel runtime software. In addition to the desire to provide robust performance
measurement and analysis capabilities for the Java language itself, the coupling of different software execution contexts under a uniform performance model
needs careful consideration of how events of interest are observed and how cross-context parallel execution information is linked. This paper relates our
experience in extending the TAU performance system to a parallel Java environment based on mpiJava. We describe the complexities of the instrumentation
model used, how performance measurements are made, and the overhead incurred. A parallel Java application simulating the game of Life is used to show
the performance systems capabilities.
Parallel Java environments present challenging problems for performance tools because of Java's rich language system and its
multi-level execution platform combined with the integration of native-code application libraries and parallel runtime software. In
addition to the desire to provide robust performance measurement and analysis capabilities for the Java language itself, the coupling of
different software execution contexts under a uniform performance model needs careful consideration of how events of interest are
observed and how cross-context parallel execution information is linked. This paper relates our experience in extending the TAU
performance system to a parallel Java environment based on mpiJava. We describe the instrumentation model used, how performance
measurements are made, and the overhead incurred. A parallel Java application simulating the game of life is used to show the
performance system's capabilities.
This talk describes the current status (as of Aug 2000) of TAU and new research directions.
Flexibility and portability are important concerns for productive empirical performance evaluation. We claim that these features are best supported by robust
instrumentation and measurement strategies, and their integration. Using the TAU performance system as an exemplar performance toolkit, a case study in
performance evaluation is considered. Our goal is both to highlight flexibility and portability requirements and to consider how instrumentation and
measurement techniques can address them. The main contribution of the paper is methodological, in its advocation of a guiding principle for tool
development and enhancement. Recent advancements in the TAU system are described from this perspective.
Performance evaluation of parallel and distributed programs involves
choosing from a wide variety of performance models, instrumentation
and measurement techniques, and execution models. The ability of
performance technology to keep pace with the growing complexity of
parallel and distributed systems depends on robust performance frameworks
that can at once provide system-specific performance capabilities and
support high-level performance problem solving. This talk gives an
overview of choices and constraints that a performance technologist
faces while building tools. We share our experience in building the TAU
(Tuning and Analysis Utilities) suite of portable profiling and tracing tools.
As an example, we illustrate tools for a parallel Java environment where
instrumentation from multiple levels is integrated to provide the coupling of
different software execution contexts under a uniform performance model.
The techniques discussed in this talk are aimed at helping you design simple
performance evaluation tools and effectively understanding and using existing
performance tools.
The developers of high-performance scientific applications often work in complex computing environments that place heavy demands
on program analysis tools. The developers need tools that interoperate, are portable across machine architectures, and provide
source-level feedback. In this paper, we describe a tool framework, the Program Database Toolkit (PDT), that supports the
development of program analysis tools meeting these requirements. PDT uses compile-time information to create a complete database
of high-level program information that is structured for well-defined and uniform access by tools and applications. PDT's current
applications make heavy use of advanced features of C++, in particular, templates. We describe the toolkit, focussing on its most
important contribution -- its handling of templates -- as well as its use in existing applications.
Fundamental to the development and use of parallel systems is the ability to observe, analyze, and
understand their performance. However, the growing complexity of parallel systems challenge
performance technologists to produce tools and methods that are at once robust (scalable, extensible,
configurable) and ubiquitous (cross-platform, cross-language). This half-day tutorial will focus on
performance analysis in complex parallel systems which include multi-threading, clusters of SMPs,
mixed-language programming, and hybrid parallelism. Several representative complexity scenarios will
be presented to highlight two fundamental performance analysis concerns: 1) the need for tight
integration of performance observation (instrumentation and measurement) technology with
sophisticated programming environments and system platforms, and 2) the ability to map execution
performance data to high-level programming abstractionsimplemented on layered, hierarchical software
systems. The tutorial will describe the TAU performance system in detail and demonstrate how it is used
to successfully address the performance analysis concerns in each complexity scenario discussed.
Tutorial attendees will be introduced to TAU's instrumentation, measurement, and analysis tools, and
shown how to configure the TAU performance system for specific needs. A description of future
enhancements of the TAU performance framework, including a demonstration of a prototype for
automatic bottleneck analysis, will conclude the tutorial.
The Standard Performance Evaluation Corporation (SPEC) benchmark suite for
OpenMP (named SPEC OMP2001) allows the performance
evaluation of modern shared-memory multiprocessors executing programs
made
parallel using the OpenMP API. While the SPEC OMP2001 suite
reports only total program execution for benchmarking purposes, detailed
performance studies of the individual programs can reveal interesting
runtime characteristics. Clearly, for programmers
attempting to diagnose performance problems and make tuning decisions, such
detailed performance information can be invaluable, especially when
programming with a new parallel API such as OpenMP. Unfortunately, tools
for performance measurement and analysis of parallel programs do not, in
general, meet the same portability, configurability, and ease of use
standards found in a robust benchmark suite such as SPEC OMP2001. As a
result, more in-depth performance analysis is often isolated to those
platforms where tools exists, or it is not done at all for lack of tool
expertise.
During the past year, we have proposed a performance tool interface
(referred to as the POMP interface) for OpenMP. The
goal of POMP is to define a clear and portable API that makes OpenMP
execution events visible to runtime performance measurement tools. The
POMP API is designed based on OpenMP directive semantics, allowing POMP
instrumentation to be accomplished through source-to-source translation; we
developed the Opari instrumentation tool for this
purpose. In addition to the POMP interface specification, we have
demonstrated its use with prototype POMP libraries for the Expert
automatic event trace analyzer and the TAU
performance analysis framework.
This paper reports on the application of the POMP performance interface and
toolset to the SPEC OMP2001 benchmark suite. The goals of the work are
three-fold. First, we want to show how support for detailed performance
instrumentation and measurement can be integrated in the SPEC OMP2001
benchmarking methodology, using an approach based on POMP's capabilities.
Second, we want to then use the SPEC OMP2001 benchmarks as testcases for
the POMP technology, the API and Opari instrumentation tool. This will
allow us to further evaluate the robustness of the API and Opari's
automatic transformation capabilities. Third, we want to demonstrate the
value of integrated performance tools in conducting cross-platform
performance studies. Here, our goal is be able to automatically capture
detailed performance information across a variety of platforms listed in the SPE
C OMP2001 results database.
Performance visualization is the use of graphical display techniques
for the visual analysis of performance data to improve the
understanding of complex performance phenomena. While the graphics of
current performance visualizations are predominantly confined to
two-dimensions, one of the primary goals of our work is the
development of new methods for rapidly prototyping next-generation,
multi-dimensional performance visualizations. By applying the tools of
scientific visualization to performance visualization, we have found
that next-generation displays for performance visualization can be
prototyped, if not implemented, in existing data visualization
software products like Data Explorer, using graphical techniques that
physicists, oceanographers, and meteorologists have used for several
years now.
Parallel performance tools offer the program developer insights into the execution
behavior of an application
and are a valuable component in the cycle of application development and deployment.
However, most tools
do not work well with large-scale parallel applications where the performance data
generated comes from
thousands of processes. Not only can the data be difficult to manage and the analysis
complex, but existing
performance display tools are mostly restricted to two dimensions and lack the
customization and display
interaction to support full data investigation. In addition, it is increasingly important that
performance tools
be able to function online, making it possible to control and adapt long-running
applications based on
performance feedback. Again, large-scale parallelism complicates the online access and
management of
performance data, and it may be desirable to integrate performance analysis and
visualization in existing
computational steering infrastructures.
The coupling of advanced three-dimensional visualization with large-scale, online
performance data analysis
could enhance application performance evaluation. The challenge is to develop a
framework where the
tedious work, such as access to the performance data and graphics rendering, is
supported by the underlying
system, leaving tool developers to focus on the high level design of the analysis and
visualization capabilities.
We designed and prototyped a system architecture for online performance access,
analysis, and visualization
in a large-scale parallel environment. The architecture consists of four components. The
performance data
integrator component is responsible for interfacing with a performance monitoring
system to merge parallel
performance samples into a synchronous data stream for analysis. The performance data
reader component
reads the external performance data into internal data structures of the analysis and
visualization system. The
performance analyzer component provides the analysis developer a programmable
framework for
constructing analysis modules that can be linked together for different functionality. The
performance
visualizer component can also be programmed to create different display modules.
Our prototype is based on the TAU performance system, the Uintah computational
framework, and the
SCIRun computational steering and visualization system. Parallel profile data from a
Uintah simulation are
sampled and written to profile files during execution. A profile reader, implemented as a
SCIRun module,
saves profile samples in SCIRun memory. SCIRun provides a programmable system for
building and linking
the analysis and visualization components. We have developed two analysis modules and
three visualization
modules to demonstrate how parallel profile data from large-scale Uintah applications
are processed online.
TAU flyer for SC'99.
PDT flyer for SC'99
Massively parallel computations are difficult to debug. Users are
often overwhelmed by large amounts of trace data and confused by the
effects of asynchrony. Event-based behavioral abstraction provides a
mechanism for managing the volume of data by allowing users to specify
models of intended program behavior that are automatically compared to
actual program behavior. Transformations of logical time ameliorate
the difficulties of coping with asynchrony by allowing users to see
behavior from a variety of temporal perspectives. Previously, we
combined these features in a debugger that automatically constructed
animations of user-defined abstract events in logical time. However,
our debugger, like many others, did not always provide sufficient
feedback nor did it effectively scale up for massive parallelism. Our
modeling language required complex recognition algorithms which
precluded informative feedback on abstractions that did not correspond
to observed behavior. Feedback on abstractions that did match behavior
was limited because it relied on graphical animations that did not
scale well to even moderate numbers of processes (such as 64). We
address these problems in a new debugger, called Ariadne.
Optimization of data-parallel languages compounds the difficulty of the problem of parallel debugging. While the constrained structure of such languages is intended to simplify the job
of the parallel programmer, the loss of flexibility concomitant with this structure often results in programs that, if left unoptimized, would have unacceptably poor performance. The
programmer needs a debug tool capable of interacting with an optimized, distributed system and reporting the behavior of such a system in terms of the source code from which it is
derived.
We describe some of the obstacles to source-level debugging of optimized data-parallel programs. We present general solutions to these problems, and discuss implementation details. We
then describe several example debugging scenarios to demonstrate the capabilities of our prototype system, ZEE (ZPL DEBUGGER).