When performance measurements are made of program operation actual
execution behavior can be perturbed. In general, the degree of perturbation
depends on the intrusiveness and frequency of the instrument ation. If the
perturbation effects of the instrumentation cannot be quantified by a perturbation
model (and subsequently removed during perturbation analysis), detailed
performance measurements could be inaccurate. Developing models of time
and event perturbations that can recover actual execution performance from
perturbed performance measurements is the topic of this paper. Time-based
models can accurately capture execution time perturbations for sequential
computations and concurrent computations with simple fork-join behavior.
However, the performance of parallel computations generally depends on the
relative ordering of dependent events and the assignment of computational
resources. Event-based models must be used to quantify instrumentation
perturbation in parallel performance measurements. The measurement and
subsequent analysis of synchronization operations (e.g., barrier, semaphore,
and advance/await synchronization) can produce accurate approximations to
actual performance behavior. Unfortunately, event-based models are limited in
their ability to fully capture perturbation effects in nondeterministic executions.
Event-related potentials (ERP) are brain electrophysiological
patterns created by averaging electroencephalographic
(EEG) data, time-locking to events of interest (e.g.,
stimulus or response onset). In this paper, we propose a
semi-automatic framework for mining ERP data, which includes
the following steps: PCA decomposition, extraction
of summary metrics, unsupervised learning (clustering) of
patterns, and supervised learning, i.e. discovery, of classi-
fication rules. Results show good correspondence between
rules that emerge from decision tree classifiers and rules
that were independently derived by domain experts. In addition,
data mining results suggested ways in which expertdefined
rules might be refined to improve pattern representation
and classification results.
We describe a technique for noninvasive
conductivity estimation of the human head tissues in
vivo. It is based on the bounded electrical impedance
tomography (bEIT) measurements procedure and
realistically shaped high-resolution finite difference
model (FDM) of the human head geometry composed
from the subject specific co-registered CT and MRI.
The first experimental results with two subjects
demonstrate feasibility of such technology.
The relative simplicity and design of the Fortran 77 language allowed for reasonable interoperability with C and C++. Fortran 90, on the other hand, introduces several new and complex features to the language that severely degrade the ability of a mixed Fortran and C++ development environment. Major new items added to Fortran are user-defined types, pointers, and several new array features. Each of these items introduce difficulties because the Fortran 90 procedure calling convention was not designed with interoperability as an important design goal. For example, Fortran 90 arrays are passed by array descriptor, which is not specified by the language and therefore depends on a particular compiler implementation. This paper describes a set of software tools that parses Fortran 90 source code and produces mediating interface functions which allow access to Fortran 90 libraries from C++.
The use of a cluster for distributed performance analysis of parallel trace
data is discussed. We propose an analysis architecture that uses multiple
cluster nodes as a server to execute analysis operations in parallel and
communicate to remote clients where performance visualization and user
interactions occur. The client-server system developed, VNG, is highly
configurable and is shown to perform well for traces of large size, when
compared to leading trace visualization systems.
The effect of the operating system on application performance is an increasingly
important consideration in high performance computing. OS kernel measurement is
key to understanding the performance influences and the interrelationship of system
and user-level performance factors. The KTAU (Kernel TAU) methodology and Linux-
based framework provides parallel kernel performance measurement from both a
kernel-wide and process-centric perspective. The first characterizes overall
aggregate kernel performance for the entire system. The second characterizes kernel
performance when it runs in the context of a particular process. KTAU extends the
TAU performance system with kernel-level monitoring, while leveraging TAU's
measurement and analysis capabilities. We explain the rational and motivations
behind our approach, describe the KTAU design and implementation, and show
working examples on multiple platforms demonstrating the versatility of KTAU in
integrated system / application monitoring.oped. Minimally, such an approach will
require OS kernel performance monitoring.
Power is the most critical resource for the exascale
high performance computing. In the future, system administrators
might have to pay attention to the power consumption of
the machine under different work loads. Hence, each application
may have to run with an allocated power budget. Thus, achieving
the best performance on future machines requires optimal
performance subject to a power constraint. This additional
performance requirement should not be the responsibility of
HPC (High Performance Computing) application developers.
Optimizing the performance for a given power budget should
be the responsibility of high-performance system software stack.
Modern machines allow power capping of CPU and memory to
implement power budgeting strategy. Finding the best runtime
environment for a node at a given power level is important to
get the best performance.
This paper presents ARCS (Adaptive Runtime Configuration
Selection) framework that automatically selects the best runtime
configuration for each OpenMP parallel region at a given power
level. The framework uses OMPT (OpenMP Tools) API, APEX
(Autonomic Performance Environment for eXascale), and Active
Harmony frameworks to explore configuration search space and
selects the best number of threads, scheduling policy, and chunk
size for a given power level at run-time. We test ARCS using the
NAS Parallel Benchmark, and proxy application LULESH with
Intel Sandybridge, and IBM Power multi-core architectures. We
show that for a given power level, efficient OpenMP runtime
parameter selection can improve the execution time and energy
consumption of an application up to 40% and 42% respectively
The Common Component Architecture (CCA) is a
component-based methodology for developing scientific simu-
lation codes. This architecture consists of a framework which
enables components, (embodiments of numerical algorithms
and physical models) to work together. Components publish
their interfaces and use interfaces published by others. Com-
ponents publishing the same interface and with the same func-
tionality (but perhaps implemented via a different algorithm
or data structure) may be transparently substituted for each
other in a code or a component assembly. Components are
compiled into shared libraries and are loaded in, instantiated
and composed into a useful code at runtime. Details regarding
CCA can be found in [1], [2]. An analysis of the process of
decomposing a legacy simulation code and re-synthesizing it
as components can be found in [3], [4]. Actual scientific results
obtained from this toolkit can be found in [5], [6].
In this paper, we discuss (TAU, Tuning and Analysis Utilities), a
first prototype for an integrated and portable program analysis
environment for pC++, a parallel object-oriented language system. TAU
is integrated with the pC++ system in that it relies heavily on
compiler and transformation tools (specifically, the Sage++ toolkit)
for its implementation. This paper describes the design and
functionality of TAU and shows its application in practice.
The realization of parallel language systems that offer high-level
programming paradigms to reduce the complexity of application
development, scalable runtime mechanisms to support variable size
problem sets, and portable compiler platforms to provide access to
multiple parallel architectures, places additional demands on the
tools for program development and analysis. The need for integration
of these tools into a comprehensive programming environment is even
more pronounced and will require more sophisticated use of the
language system technology (i.e., compiler and runtime
system). Furthermore, the environment requirements of high-level
support for the programmer, large-scale applications, and portable
access to diverse machines also apply to the program analysis tools.
The TAU performance system is an integrated performance instrumentation, measurement, and analysis toolkit offering support for profiling and tracing modes of measurement. This paper introduces memory introspection capabilities of TAU featured on the Cray XT3 Catamount compute node kernel. TAU supports examining the memory headroom, or the amount of heap memory available, at routine entry, and correlates it to the program’s callstack as an atomic event.
The ability of performance technology to keep pace with the growing complexity of parallel and distributed systems will depend on robust performance frameworks that can at once provide system-specific performance capabilities and support high-level performance problem solving. The TAU system is offered as an example framework that meets these requirements. With a flexible, modular instrumentation and measurement system, and an open performance data and analysis environment, TAU can target a range of complex performance scenarios. Examples are given showing the diversity of TAU application.
A common complaint when dealing with the performance of computationally intensive
scientific applications on parallel computers is that programs exist to predict the
performance of radar systems, missiles and artillery shells, drugs, etc., but no one knows
how to predict the performance of these applications on a parallel computer. Actually, that
is not quite true. A more accurate statement is that no one knows how to predict the
performance of these applications on a parallel computer in a reasonable amount of time.
PENVELOPE is an attempt to remedy this situation. It is an extension to Amdahls Law/
Gustafsons work on scaled speedup that takes into account the cost of interprocessor
communication and operating system overhead, yet is simple enough that it was
implemented as an Excel spreadsheet.
Performance profiling of MPI programs generates overhead during
execution that introduces error in profile measurements. It is possible to track and
remove overhead online, but it is necessary to communicate execution delay be-
tween processes to correctly adjust their interdependent timing. We demonstrate
the first implementation of a onlne measurement overhead compensation system
for profiling MPI programs. This is implemented in the TAU performance sys-
tems. It requires novel techniques for delay communication in the use of MPI.
The ability to reduce measurement error is demonstrated for problematic test
cases and real applications.
A scalable approach to performance analysis of MPI applications is
presented that includes automated source code instrumentation, low overhead
generation of profile and trace data, and database management of performance
data. In addition, tools are described that analyze large-scale parallel profile and
trace data. Analysis of trace data is done using an automated pattern-matching ap-
proach. Examples of using the tools on large-scale MPI applications are
presented.
Workflows offer scientists a simple but flexible
programming model at a level of abstraction closer
to the domain-specific activities that they seek to
perform. However, languages for describing work-
flows tend to be highly complex, or specialized towards
a particular domain, or both. WOOL is an
abstract workflow language with human-readable
syntax, intuitive semantics, and a powerful abstract
type system. WOOL workflows can be targeted
to almost any kind of runtime system supporting
data-flow computation. This paper describes the
design of the WOOL language and the implementation
of its compiler, along with a simple example
runtime. We demonstrate its use in an imageprocessing
workflow.
The concepts involved in the programming process
of multicore systems have been quite well known for
decades. The problem is to produce it in a form as easy
as sequential programming. This new trend will
change the way we think about the whole development
process. We will show that it is possible to develop a
multicore embedded system application using existing
tools and the model-driven development process
proposed. To do this, two tools will be used:
VisualRTXC (available at www.quadrosbrasil.com.br)
for generating the multithread
communication/synchronization structures and a
performance tool called TAU (available at
http://www.cs.uoregon.edu/research/tau/home.php) for
the tuning of the final implementation.
This article discusses approaches to implementing object-independent
event trace monitoring and analysis systems. The term
object-independent means that the system can be used for the analysis
of arbitrary (non-sequential) computer systems, operating systems,
programming languages and applications. Three main topics are
addressed: object-independent monitoring, standardization of event
trace formats and access interfaces and the application-independent
but problem-oriented implementation of analysis and visualization
tools. Based on these approaches, the distributed hardware monitor
system ZM4 and the SIMPLE event trace analysis environment were
implemented, and have been used in many 'real-world' applications
throughout the last three years. An overview of the projects in which
the ZM4/SIMPLE tools were used is given in the last section.
Programming non-sequential computer systems is hard! Many tools and
environments have been designed and implemented to ease the use and
programming of such systems. The majority of the analysis tools is
event-based and uses event traces for representing the dynamic
behavior of the system under investigation, the object system. Most
tools can only be used for one special object system, or a specific
class of systems such as distributed shared memory machines. This
limitation is not obvious because all tools provide the same basic
functionality.
In this paper, we discuss TAU (Tuning and Analysis Utilities), the
first prototype of an integrated and portable program analysis
environment for pC++, a parallel object-oriented language system. TAU
is unique in that it was developed specifically for pC++ and relies
heavily on pC++'s compiler and transformation tools (specifically, the
Sage++ toolkit) for its implementation. This tight integration allows
TAU to achieve a combination of portability, functionality, and
usability not commonly found in high-level language environments. The
paper describes the design and functionality of TAU, using a new tool
for breakpoint-based program analysis as an example of TAU's
capabilities
We report on our experiences in building a computational environment for tomographic image analysis for marine seismologists studying the structure and evolution of mid-ocean ridge volcanism. The computational environment is determined by an evolving set of requirements for this problem domain and includes needs for high-performance parallel computing, large data analysis, model visualization, and computation interaction and control. Although these needs are not unique in scientific computing, the integration of techniques for seismic tomography with tools for parallel computing and data analysis into a computational environment was (and continues to be) an interesting, important learning experience for researchers in both disciplines. For the geologists, the use of the environment led to fundamental geologic discoveries on the East Pacific Rise, the improvement of parallel ray tracing algorithms, and a better regard for the use of computational steering in aiding model convergence. The computer scientists received valuable feedback on the use of programming, analysis, and visualization tools in the environment. In particular, the tools for parallel program data query (DAQV) and visualization programming (Viz) were demonstrated to be highly adaptable to the problem domain. We discuss the requirements and the components of the environment in detail. Both accomplishments and limitations of our work are presented.
In the race for Exascale, the advent of many-core processors
will bring a shift in parallel computing architectures
to systems of much higher concurrency, but with a relatively
smaller memory per thread. This shift raises concerns
for the adaptability of HPC software, for the current generation
to the brave new world. In this paper, we study
domain splitting on an increasing number of memory areas
as an example problem where negative performance impact
on computation could arise. We identify the specific parameters
that drive scalability for this problem, and then
model the halo-cell ratio on common mesh topologies to
study the memory and communication implications. Such
analysis argues for the use of shared-memory parallelism,
such as with OpenMP, to address the performance problems
that could occur. In contrast, we propose an original
solution based entirely on MPI programming semantics,
while providing the performance advantages of hybrid
parallel programming. Our solution transparently replaces
halo-cells transfers with pointer exchanges when MPI tasks
are running on the same node, effectively removing memory
copies. The results we present demonstrate gains in terms of
memory and computation time on Xeon Phi (compared to
OpenMP-only and MPI-only) using a representative domain
decomposition benchmark.
The advent of many-core architectures poses new challenges
to the MPI programming model which has been designed for
distributed memory message passing. It is now clear that
MPI will have to evolve in order to exploit shared-memory
parallelism, either by collaborating with other programming
models (MPI+X) or by introducing new shared-memory approaches.
This paper considers extensions to C and C++ to
make it possible for MPI Processes to run into threads. More
generally, a thread-local storage (TLS) library is developed
to simplify the collocation of arbitrary tasks and services in
a shared-memory context called a task-container. The paper
discusses how such containers simplify model and service
mixing at the OS process level, eventually easing the collocation
of arbitrary tasks with MPI processes in a runtime
agnostic fashion, opening alternatives to runtime stacking.
This paper presents the design, implementation, and application of ParaProf, a
portable, extensible, and scalable tool for parallel performance profile analysis.
ParaProf attempts to offer ``best of breed'' capabilities to performance analysts --
those inherited from a rich history of single processor profilers and those being
pioneered in parallel tools research. We present ParaProf as a parallel profile
analysis framework that can be retargeted and extended as required.
ParaProf's design and operation is discussed, and its novel support for large-
scale parallel analysis demonstrated with a 512-processor application profile
generated using the TAU performance system.
Performance is the reason for parallel computing. Despite years of work, many
applications achieve only a few percentage of theoretical peak. Performance
measurement and analysis tools exist to identify the problems with current programs
and systems. Performance Prediction is intended to identify issues in new code or
systems before they are fully available. These two topics are closely related since
most prediction requires data to be gathered from measured runs of program (to
identify application signatures or to understand the performance characteristics of
current machines).
Measurement-based profiling introduces intrusion in program execution. Intrusion effects
can be mitigated by compensating for measurement overhead. Techniques for compensation
analysis in performance profiling are presented and their implementation in the TAU
performance system described. Experimental results on the NAS parallel benchmarks
demonstrate that overhead compensation can be effective in improving the accuracy of
performance profiling.
Due to the diversity of parallel and distributed computing infrastructures and
programming models, and the complexity of issues involved in the design and
development of parallel programs, the creation of tools and environments to support
the broad range of parallel system and software functionality has been widely
recognized as a difficult challenge.
Current research in this topic continues to address individual tools for supporting
correctness and performance issues in parallel program development. However,
standalone tools are sometimes insufficient to cover the rich diversity of tasks found
in the design, implementation and production phases of the parallel software life-
cycle. This has motivated interest in interoperable tools, as well as solutions to ease
their integration into unified development and execution environments.
Performance profiling generates measurement overhead during parallel
program execution. Measurement overhead, in turn, introduces
intrusion in a program's runtime performance behavior. Intrusion can
be mitigated by controlling instrumentation degree, allowing a
tradeoff of accuracy for detail. Alternatively, the accuracy in
profile results can be improved by reducing the intrusion error due to
measurement overhead. Models for compensation of measurement overhead
in parallel performance profiling are described. An approach based on
rational reconstruction is used to understand properties of
compensation solutions for different parallel scenarios. From this
analysis, a general algorithm for on-the-fly overhead assessment and
compensation is derived.
Online application performance monitoring allows tracking
performance characteristics during execution as opposed to doing so
post-mortem. This opens up several possibilities otherwise unavailable
such as real-time visualization and application performance steering that
can be useful in the context of long-running applications. As HPC sys-
tems grow in size and complexity, the key challenge is to keep the online
performance monitor scalable and low overhead while still providing a
useful performance reporting capability. Two fundamental components
that constitute such a performance monitor are the measurement and
transport systems. We adapt and combine two existing, mature systems
- TAU and Supermon - to address this problem. TAU performs the mea-
surement while Supermon is used to collect the distributed measurement
state. Our experiments show that this novel approach leads to very low-
overhead application monitoring as well as other benefits unavailable
from using a transport such as NFS.
Performance analysis tools are only as useful as the data they collect. Not just accuracy of performance data, but accessibility, is necessary for performance analysis tools to be used to their full effect. The diversity of performance analysis and tuning problems calls for more flexible means of storing and representing performance data. The development and maintenance cycles of high performance programs, in particular, stand to benefit from exploration of and expansion of the means used to record and describe program execution behavior. We describe a means of representing program performance data via a time or event delineated series of performance profiles, or profile snapshots, implemented in the TAU performance analysis system. This includes an explanation of the profile snapshot format and means of snapshot analysis.
With support for C/C++, Fortran, MPI, OpenMP, and performance tools, the Eclipse integrated development environment (IDE) is a serious contender as a programming environment for parallel applications. There is interest in adding capabilities in Eclipse for conducting workflows where an application is executed under different scenarios and its outputs are processed. For instance, parametric studies are a requirement in many benchmarking and performance tuning efforts, yet there was no experiment management support available for the Eclipse IDE. In this paper, we describe an extension of the Parallel Tools Platform (PTP) plugin for the Eclipse IDE. The extension provides a graphical user interface for selecting experiment parameters, launches build and run jobs, manages the performance data, and launches an analysis application to process the data. We describe our implementation, and discuss three experiment examples which demonstrate the experiment management support.
Numerous programming models have been introduced to allow
programmers to utilize new accelerator-based architectures. While
OpenCL and CUDA provide low-level access to accelerator programming,
the task cries out for a higher-level abstraction. Of the higherlevel
programming models which have emerged, few are intended to
co-exist with mainstream, general-purpose languages while supporting
tunability, composability, and transparency of implementation. In this
paper, we propose extensions to the type systems (implementable as syntactically
neutral annotations) of traditional, general-purpose languages
can be made which allow programmers to work at a higher level of abstraction
with respect to memory, deferring much of the tedium of data
management and movement code to an automatic code generation tool.
Furthermore, our technique, based on formal term rewriting, allows for
user-defined reduction rules to optimize low-level operations and exploit
domain- and/or application-specific knowledge.
In recent years, a range of novel methodologies and tools have been developed for the
purpose of evaluation, design, and model reduction of existing and emerging parallel and
distributed systems. At the same time, the coverage of the term ‘performance’ has
constantly broadened to include reliability, robustness, energy consumption, and
scalability in addition to classical performance-oriented evaluations of system
functionalities. Indeed, the increasing diversification of parallel systems, from cloud
computing to exascale, being fueld by technological advances, is placing greater
emphasis on the methods and tools to address more comprehensive concerns. The aim
of the Performance Prediction and Evaluation topic is to bring together system designers
and researchers involved with the qualitative and quantitative evaluation and modeling of
large-scale parallel and distributed applications and systems to focus on current critical
areas of performance prediction and evaluation theory and practice.
Tuning codes for GPGPU architectures is challenging because
few performance tools can pinpoint the exact causes of execution bottlenecks.
While profiling applications can reveal execution behavior with a
particular architecture, the abundance of collected information can also
overwhelm the user. Moreover, performance counters provide cumulative
values but does not attribute events to code regions, which makes identifying
performance hot spots difficult. This research focuses on characterizing
the behavior of GPU application kernels and its performance at the node
level by providing a visualization and metrics display that indicates the
behavior of the application with respect to the underlying architecture.
We demonstrate the effectiveness of our techniques with LAMMPS and
LULESH application case studies on a variety of GPU architectures. By
sampling instruction mixes for kernel execution runs, we reveal a variety
of intrinsic program characteristics relating to computation, memory and
control flow.
This paper describes the design and implementation of the Distributed Array Query and Visualization (DAQV) system for High Performance Fortran, a project sponsored by the Parallel Tools Consortium. DAQV's implementation leverages the HPF language, compiler, and runtime system to address the general problem of providing high-level access to distributed data structures. DAQV supports a framework in which visualization and analysis clients connect to a distributed array server (i.e., the HPF application with DAQV control) for program-level access to array values. Implementing key components of DAQV in HPF itself has led to a robust and portable solution in which clients do not need to know how the data is distributed.
To aid in building high-performance computational environments,
INTERLACE offers a framework for linking reusable computational
engines in a heterogeneous distributed system. The INTERLACE model
provides clients with access to computational servers which interface
with "wrapped" computational engines. The wrappers implement
mechanisms to translate client requests to engine actions and to move
data across the server interface. These mechanisms are programmable,
allowing engines of different type to be integrated. The framework
takes advantage of the HPC++ runtime system to access servers through
distributed object operations. The INTERLACE framework has been
demonstrated by building a distributed computational environment with
MatLab engines.
The influences of OS and system-specific effects on applica- tion performance are increasingly important in high performance com- puting. In this regard, OS kernel measurement is necessary to under- stand the interrelationship of system and application behavior. This can
be viewed from two perspectives: kernel-wide and process-centric. An
integrated methodology and framework to observe both views in HPC
systems using OS kernel measurement has remained elusive. We demon- strate a new tool called KTAU (Kernel TAU) that aims to provide paral- lel kernel performance measurement from both perspectives. KTAU ex- tends the TAU performance system with kernel-level monitoring, while
leveraging TAU’s measurement and analysis capabilities. As part of the
ZeptoOS scalable operating systems pro ject, we report early experiences
using KTAU in ZeptoOS on the IBM BG/L system.
Parallel performance tuning naturally involves a diagnosis
process to locate and explain sources of program inefficiency. Proposed
is an approach that exploits parallel computation patterns (models) for
diagnosis discovery. Knowledge of performance problems and inference
rules for hypothesis search are engineered from model semantics and
analysis expertise. In this manner, the performance diagnosis process
can be automated as well as adapted for parallel model variations. We
demonstrate the implementation of model-based performance diagnosis
on the classic Master-Worker pattern. Our results suggest that pattern-
based performance knowledge can provide effective guidance for locating
and explaining performance bugs at a high level of program abstraction.
To enable a scalable parallel application to view its global performance state, we designed and
developed TAUg, a portable runtime framework layered on the TAU parallel performance
system. TAUg leverages the MPI library to communicate between application processes, creating
an abstraction of a global performance space from which profile views can be retrieved. We
describe the TAUg design and implementation and show its use on two test benchmarks up to
512 processors. Overhead evaluation for the use of TAUg is included in our analysis. Future
directions for improvement are discussed.
In this article we propose a ``standard'' performance tool interface for
OpenMP, similar in spirit to the MPI profiling interface in its intent to
define a clear and portable API that makes OpenMP execution events visible
to performance libraries. When used together with the MPI profiling
interface, it also allows tools to be built for hybrid applications that
mix shared and distributed memory programming. We describe an
instrumentation approach based on OpenMP directive rewriting that generates
calls to the interface and passes context information (e.g., source code
locations) in a portable and efficient way. Our proposed OpenMP performance
API further allows user functions and arbitrary code regions to be marked
and performance measurement to be controlled using new proposed OpenMP
directives. The directive transformations we define are implemented in a
source-to-source translation tool called OPARI.
We have used it to integrate the TAU performance analysis
framework and the automatic event trace analyzer EXPERT with the proposed OpenMP performance interface.
Together, these tools show that a portable and robust solution to
performance analysis of OpenMP and hybrid applications is possible.
Regular is an often used term to suggest simple and unifrom structure of a parallel
processor's organization or a parllel algorithm's operation. However, a strict definitiion is
long overdue. In this paper, we define regularity for processor array structures in two
dimensions and enumerate the eleven distinct regular topologies. Space and time emulation
schemes among the regular processor arrays are constructured to compare their geometric
and performance characteristics. The hexagonal array is shown to have the most efficient
emulation capabilities.
Twig is a language for writing typemaps, programs which
transform the type of a value while preserving its underlying
meaning. Typemaps are typically used by tools that generate
code, such as multi-language wrapper generators, to automatically
convert types as needed. Twig builds on existing
typemap tools in a few key ways. Twig’s typemaps are composable
so that complex transformations may be built from
simpler ones. In addition, Twig incorporates an abstract,
formal model of code generation, allowing it to output code
for different target languages. We describe Twig’s formal
semantics and show how the language allows us to concisely
express typemaps. Then, we demonstrate Twig’s utility by
building an example typemap.
The lack of tools to observe the operation and performance
of message-based parallel architectures limits the
user's ability to e ectively optimize application and system
performance. Performance data collection, analysis,
and visualization tools are needed to manage the complexity
and quantity of performance data. Furthermore, these
tools must be integrated with the machine hardware, the
system software, and the applications support software if
they are to nd pervasive use in program development and
experimentation.
In this paper, we describe an integrated performance
environment being developed for the Intel iPSC/2 hypercube.
The data collection components of the environment
include software event tracing at the operating system
and program levels plus a hardware-based performance
monitoring system used to unobtrusively capture software
events. A visualization system, based on the X window
system, permits the performance analyst to browse and
explore interesting data components by dynamically interconnecting
new performance displays and data analysis
tools.
As software complexity increases, the analysis of code behavior during its execution is becoming more important. Instru- mentation techniques, through the insertion of code directly into binaries, are essential to program analyses such as performance evaluation and profiling. In the context of high-performance parallel applications, building an instrumentation framework is quite challenging. One of the difficulties is due to the necessity to capture coarse grain behavior, such as the execution time of different functions, as well as finer-grain behavior in order to pinpoint performance issues.
In this paper, we propose a language, MIL, for the development of program analysis tools based on static binary instrumentation. The key feature of MIL is to ease the integration of static, global program analysis with instrumentation. We will show how this enables both a precise targeting of the code regions to analyze, and a better understanding of the optimized program behavior.
Particle advection is a foundational operation for many flow visualization techniques,
including streamlines, Finite-Time Lyapunov Exponents (FTLE) calculation, and stream
surfaces. The workload for particle advection problems varies greatly, including
significant variation in computational requirements. With this study, we consider the
performance impacts from hardware architecture on this problem, studying
distributed-memory systems with CPUs with varying amounts of cores per node, and
with nodes with one to three GPUs. Our goal was to explore which architectures were
best suited to which workloads, and why. While the results of this study will help
inform visualization scientists which architectures they should use when solving
certain flow visualization problems, it is also informative for the larger HPC
community, since many simulation codes will soon incorporate visualization via in situ
techniques.
There are two main conclusions from this work. First, interaction
support should be integrated with a language system facilitating an
implementation of a model that is consistent with the language
design. This aids application developers or the tool builders that
require this interaction. Second, as the implementation of Breezy
shows, the development of interaction support can leverage off the
language itself as well as its compiler and runtime systems.
This paper presents a general architecture for runtime interaction
with a data-parallel program. We have applied this architecture in the
development of the Breezy tool for the pC++ language. Breezy grants
application programs convenient and efficient access to higher-level
external services (e.g., databases, visualization systems, and
distributed resources) and allows external access to the application's
state (e.g., for program state display or computational
steering). Although such support can be developed on an ad-hoc basis
for each application, a general approach to the problem of parallel
program interaction is preferred. A general approach makes tools more
portable and retargetable to different language systems.
Tracing parallel programs to observe their performance introduces intrusion as the result of
trace measurement overhead. If post-mortem trace analysis does not compensate for the
overhead, the intrusion will lead to errors in the performance results. We show that
measurement overhead can be accounted for during trace analysis and intrusion modeled and
removed. Algorithms developed in our earlier work are reimplemented in a more robust and
modern tool, KOJAK, allowing them to be applied in large-scale parallel programs. The ability
to reduce trace measurement error is demonstrated for a Monte-Carlo simulation
based on a master/worker scheme. As an additional result, we visualize how local
perturbation propagates across process boundaries and alters the behavioral char-
acteristics of non-local processes.
The Eclipse platform offers Integrated Development Environment support
for a diverse and growing array of programming applications and languages.
There is an increasing call for programming tools to support various
development tasks from within Eclipse. This includes tools for testing
and analyzing program performance. We describe the high-level synthesis
of the Eclipse platform with the TAU parallel performance analysis
system. By leveraging Eclipse's modularity and extensibility with
TAU's robust automated performance analysis mechanisms we produce
an integrated, GUI controlled performance analysis system for Java,
C/C++ and High Performance Computing development within Eclipse.
Parallel performance diagnosis can be improved with the use of performance knowledge about parallel computation models. The Hercule
diagnosis system applies model-based methods to automate performance
diagnosis processes and explain performance problems from highlevel
computation semantics. However, Hercule is limited by a single experiment view. Here we introduce the concept of relative performance diagnosis and show how it can be integrated in a model-based diagnosis framework. The paper demonstrates the effectiveness of Hercule’s approach to relative diagnosis of the well-known Sweep3D application based on aWavefront model. Relative diagnoses of Sweep3D performance anomalies in strong and weak scaling cases are given.
Scientific computing on massively parallel computers presents
unique challenges to component-based software engineering (CBSE).
While CBSE is at least as enabling for scientific computing as it is
for other arenas, the requirements are different. We briefly discuss
how these requirements shape the Common Component Architecture, and we
describe some recent research on quality-of-service issues to address
the computational performance and accuracy of scientific simulations.
Computational environments used by scientists should provide
high-level support for scientific processes that involve the
integrated and systematic use of familiar abstractions from a
laboratory setting, including notebooks, instruments, experiments, and
analysis tools. However, doing so while hiding the complexities of
the underlying computational platform is a challenge. ViNE is a
web-based electronic notebook that implements a high-level interface
for applying computational tools in scientific experiments in a
location- and platform-independent manner. Using ViNE, a scientist
can specify data and tools, and construct experiments that apply them
in well-defined procedures. ViNE's implementation of the experiment
abstraction offers the scientist easy-to-understand framework for
building scientific processes. This paper discusses how ViNE
implements computational experiments in distributed, heterogeneous
computing environments.
Advances in human brain neuroimaging to achieve high-temporal and high-spatial
resolution will depend on computational approaches to localize EEG signals to their
sources in the cortex. The source localization inverse problem is inherently ill-posed and
depends critically on the modeling of human head electromagnetics. In this paper we
present a systematic methodology to analyze the main factors and parameters that affect
the accuracy of the EEG source-mapping solutions. We argue that these factors are not
independent and their effect must be evaluated in a unified way. To do so requires
significant computational capabilities to explore the landscape of the problem, to quantify
uncertainty effects, and to evaluate alternative algorithms. We demonstrate that bringing
HPC to this domain will enable such investigation and will allow new avenues for
neuroinformatics research. Two algorithms to the electromagnetics forward problem (the
heart of the source localization inverse), incorporating tissue inhomogeneity and
impedance anisotropy, are presented and their parallel implementations described. The
head model forward solvers are evaluated and their performance analyzed.
Since the beginning of ``high-performance'' parallel
computing, observing and analyzing performance for
purposes of finding bottlenecks and identifying
opportunities for improvement has been at the heart of
delivering the performance potential of next-generation
scalable systems. Interestingly, it is the ever-changing
parallel computing landscape that is the main driver of
requirements for parallel performance technology and the
improvements necessary beyond the current state-of-theart.
Indeed, the development and application of our TAU
Performance System over many years largely follows an
evolutionary path of addressing measurement and analysis
problems in new parallel machines and programming
environments. However, the outlook to future parallel
systems with high degrees of concurrency, heterogeneous
components, dynamic runtime environments, asynchronous
execution, and power constraints suggests a new
perspective will be needed on the role of performance
observation and analysis in respect to tool technology
integration and performance optimization methods. The
reliance on post-mortem analysis of application-level ("1st
person") performance measurements is prohibitive for
exascale-class machines because of the performance data
volume, the primitive basis for performance data
attribution, and the fundamental problem of performance
variation that will exist. Instead, it will be important to
provide introspection support across the exascale software
stack to understand how system ("3rd person") resources
are used during execution. Furthermore, the opportunity to
couple a global performance introspection capability (a
"performance backplane") with online performance
decision analytics inspires the concept of an autonomic
performance system that can feed back policy-based
decisions to guide the computation to better states of
execution. The talk will explore these issues by giving a
brief retrospective on performance tool evolution, setting
the stage for current research projects where a new
performance perspective is being pursued. It will also
speculate on what might be included in next-generation
parallel systems hardware, specifically to make the
exascale machines more performance-aware and
dynamically-adaptive.
Current trends for high-performance computing systems are
leading us towards hardware over-provisioning where it is no
longer possible to run each component at peak power without
exceeding a system or facility wide power bound. In
such scenarios, the power consumed by individual components
must be artificially limited to guarantee system operation
under a given power bound. In this paper, we present
the design of a power scheduler capable of enforcing such a
bound using dynamic system-wide power reallocation in an
application-agnostic manner. Our scheduler achievies better
job runtimes than a na¨ıve power scheduling approach
without requiring a priori knowledge of application power
behavior.
We report our experiences porting Spark to large production
HPC systems. While Spark performance in a data center
installation (with local disks) is dominated by the network,
our results show that file system metadata access latency
can dominate in a HPC installation using Lustre: it determines
single node performance up to 4⇥ slower than a
typical workstation. We evaluate a combination of software
techniques and hardware configurations designed to address
this problem. For example, on the software side we develop
a file pooling layer able to improve per node performance
up to 2.8⇥. On the hardware side we evaluate a system
with a large NVRAM bu↵er between compute nodes and the
backend Lustre file system: this improves scaling at the expense
of per-node performance. Overall, our results indicate
that scalability is currently limited to O(102) cores in a
HPC installation with Lustre and default Spark. After careful
configuration combined with our pooling we can scale up
to O(104). As our analysis indicates, it is feasible to observe
much higher scalability in the near future.
The Distributed Array Query and Visualization (DAQV) project aims to
develop systems and tools that facilitate interacting with distributed
programs and data structures. Arrays distributed across the processes
of a parallel or distributed application are made available to
external clients via well-defined interfaces and protocols. Our design
considers the broad issues of language targets, models of interaction,
and abstractions for data access, while our implementation attempts to
provide a general framework that can be adapted to a range of
application scenarios. The paper describes the second generation of
DAQV work and places it in the context of the more general distributed
array access problem. Current applications and future work are also
described.
We present a method for evaluating ICA separation of artifacts from EEG
(electroencephalographic) data. Two algorithms, Infomax and FastICA, were applied
to “synthetic data,†created by superimposing simulated blinks on a blink-free EEG.
To examine sensitivity to different data characteristics, multiple datasets were
constructed by varying properties of the simulated blinks. ICA was used to
decompose the data, and each source was cross- correlated with a blink template.
Different thresholds for correlation were used to assess stability of the algorithms.
When a match between the blink-template and a component was obtained, the
contribution of the source was subtracted from the EEG. Since the original data were
known a priori to be blink-free, it was possible to compute the correlation between
these â€baseline†data and the results of different decompositions. By averaging the
filtered data, time-locked to the simulated blinks, we illustrate effects of different
outcomes for EEG waveform and topographic analysis.
The source estimation problem for EEG consists of estimating
cortical activity from measurements of electrical potential on the
scalp surface. This is a underconstrained inverse problem as the
dimensionality of cortical source currents far exceeds the number
of sensors. We develop a novel regularization for this inverse problem
which incorporates knowledge of the anatomical connectivity of
the brain, measured by diffusion tensor imaging. We construct an
overcomplete wavelet frame, termed cortical graph wavelets, by applying
the recently developed spectral graph wavelet transform to
this anatomical connectivity graph. Our signal model is formed by
assuming that the desired cortical currents have a sparse representation
in these cortical graph wavelets, which leads to a convex 1-
regularized least squares problem for the coefficients. On data from
a simple motor potential experiment, the proposed method shows
improvement over the standard minimum-norm regularization.
As computer systems grow in size and complexity, tool support is
needed to facilitate the efficient mapping of large-scale applications
onto these systems. To help achieve this mapping, performance
analysis tools must provide robust performance observation
capabilities at all levels of the system, as well as map low-level
behavior to high-level program constructs. Instrumentation and
measurement strategies, developed over the last several years,
must evolve together with performance analysis infrastructure to
address the challenges of new scalable parallel systems.
Adaptive algorithms are an important technique to achieve portable high
Performance. They choose among solution methods and optimizations
according to expected performance on a particular machine. Grid environments
make the adaptation problem harder, because the optimal decision may change
across runs and even during runtime. Therefore, the performance model used
by an adaptive algorithm must be able to change decisions without high
overhead. In this paper, we present work that is modifying previous research
into rapid performance modeling to support adaptive grid applications through
sampling and high granularity modeling. We also outline preliminary results that
show the ability to predict differences in performance among algorithms in the
same program.
The computational environment for estimation of unknown regional
electrical conductivities of the human head, based on realistic geometry from seg-
mented MRI up to 256 resolution, is described. A finite difference alternating di-
rection implicit (ADI) algorithm, parallelized using OpenMP, is used to solve the
forward problem describing the electrical field distribution throughout the head
given known electrical sources. A simplex search in the multi-dimensional para-
meter space of tissue conductivities is conducted in parallel using a distributed
system of heterogeneous computational resources. The theoretical and computa-
tional formulation of the problem is presented. Results from test studies are pro-
vided, comparing retrieved conductivities to known solutions from simulation.
Performance statistics are also given showing both the scaling of the forward
problem and the performance dynamics of the distributed search.
We present a parallel computational environment used to determine
conductivity properties of human head tissues when the effects of skull
inhomogeneities
are modeled. The environment employs a parallel simulated annealing
algorithm to overcome poor convergence rates of the simplex method for larger
numbers of head tissues required for accurate modeling of electromagnetic dynamics
of brain function. To properly account for skull inhomogeneities, parcellation
of skull parts is necessary. The multi-level parallel simulated annealing
algorithm is described and performance results presented. Significant improvements
in both convergence rate and speedup are achieved. The simulated annealing
algorithm was successful in extracting conductivity values for up to thirteen
head tissues without showing computational deficiency.
Using the Eclipse platform we have provided a centralized resource
and unified user interface for the encapsulation of existing
command-line based performance analysis tools. In this paper we describe
the user-definable tool workflow system provided by this performance
framework. We discuss the framework’s implementation and the
rationale for its design. A use case featuring the TAU performance analysis
system demonstrates the utility of the workflow system with respect
to conventional performance analysis procedures.
Contemporary high-end Terascale and Petascale systems are composed of hundreds of thousands of commodity multi-core processors interconnected with high-speed custom networks. Performance characteristics of applications executing on these systems are a function of system hardware and software as well as workload parameters. Therefore, it has become increasingly challenging to measure, analyze and project performance using a single tool on these systems. In order to address these issues, we propose a methodology for performance measurement and analysis that is aware of applications and the underlying system hierarchies. On the application level, we measure cost distribution and runtime dependent values for different components of the underlying programming model. On the system front, we measure and analyze information gathered for unique system features, particularly shared components in the multi-core processors. We demonstrate our approach using a Petascale combustion application called S3D on two high-end Teraflops systems, Cray XT4 and IBM Blue Gene/P, using a combination of hardware performance monitoring, profiling and tracing tools.
In scientific domains where discovery is driven by simulation modeling
there are found common methodologies and procedures applied for scientific
investigation. ODESSI (Open Domain-extensible Environment for Simulationbased
Scientific Investigation) is an environment to facilitate the representation
and automatic conduction of scientific studies by capturing common methods
for experimentation, analysis, and evaluation used in simulation science. Specific
methods ODESSI will support include parameter studies, optimization, uncertainty
quantification, and sensitivity analysis. By making these methods accessible
in a programmable framework, ODESSI can be used to capture and run
domain-specific investigations. ODESSI is demonstrated for a problem in the
neuroscience domain involving computational modeling of human head electromagnetics
for conductivity analysis and source localization.
We describe a novel 3D finite difference method for solving the anisotropic
inhomogeneous Poisson equation based on a multi-component additive implicit method
with a 13-point stencil. The serial performance is found to be comparable to the most
efficient solvers from the family of preconditioned conjugate gradient (PCG) algorithms.
The proposed multi-component additive algorithm is unconditionally stable in 3D and
amenable for transparent domain decomposition parallelization up to one eighth of the
total grid points in the initial computational domain. Some validation and numerical
examples are given.
A common prerequisite for a number of debugging and performance-
analysis techniques is the injection of auxiliary program code into the application under investigation, a process called instrumentation. To accomplish this task, source-code preprocessors are often used. Unfortunately, existing preprocessing tools either focus only on a very specific aspect or use hard-coded commands for instrumentation. In this paper, we examine which basic constructs are required to specify a user-defined routine entry/exit instrumentation. This analysis serves as a basis for a generic instrumentation component working on the source-code level where the instructions to be inserted can be flexibly configured. We evaluate the identified constructs with our prototypical implementation and show that these are sufficient to fulfill the needs of a number of todays’ performance-analysis tools.
Electronic structure calculations are a widely used tool in materials
science and large consumer of supercomputing resources. Traditionally,
the software packages for these kind of simulations have been
implemented in compiled languages, where Fortran in its different
versions has been the most popular choice. While dynamic, interpreted
languages, such as Python, can increase the efficiency of programmer,
they cannot compete directly with the raw performance of compiled
languages. However, by using an interpreted language together with a
compiled language, it is possible to have most of the productivity
enhancing features together with a good numerical performance. We
have used this approach in implementing an electronic structure
simulation software GPAW using the combination of Python and C
programming languages. While the chosen approach works well in standard
workstations and Unix environments, massively parallel supercomputing
systems can present some challenges in porting, debugging and profiling
the software. In this paper we describe some details of the
implementation and discuss the advantages and challenges of the combined
Python/C approach. We show that despite the challenges it is possible to
obtain good numerical performance and good parallel scalability with
Python based software.
This paper addresses two key parallelization challenges the unstructured mesh-
based ocean
modeling code, MPAS-Ocean, which uses a mesh based on Voronoi tessellations:
(1) load
imbalance across processes, and (2) unstructured data access patterns, that
inhibit intra- and
inter-node performance. Our work analyzes the load imbalance due to naive
partitioning of
the mesh, and develops methods to generate mesh partitioning with better load
balance and
reduced communication. Furthermore, we present methods that minimize both
inter- and intranode
data movement and maximize data reuse. Our techniques include predictive
ordering of
data elements for higher cache efficiency, as well as communication reduction
approaches. We
present detailed performance data when running on thousands of cores using the
Cray XC30
supercomputer and show that our optimization strategies can exceed the original
performance
by over 2×. Additionally, many of these solutions can be broadly applied to a wide
variety of
unstructured grid-based computations
Performance debugging using program profiling and tracing for scientific
workflows can be
extremely difficult for two reasons. 1) Existing performance tools lack the ability to
automatically
produce global performance data based on local information from coupled
scientific
applications of workflows, particularly at runtime. 2) Profiling/tracing with static
instrumentation
may incur high overhead and significantly slow down science-critical tasks. To gain
more
insights on workflows we introduce a lightweight workflow monitoring
infrastructure, WOWMON
(WOrkfloW MONitor), which enables user’s access not only to cross-application
performance
data such as end-to-end latency and execution time of individual workflow
components
at runtime, but also to customized performance events. To reduce profiling
overhead, WOWMON
uses adaptive selection of performance metrics based on machine learning
algorithms
to guide profilers collecting only metrics that have most impact on performance of
workflows.
Through the study of real scientific workflows (e.g., LAMMPS) with the help of
WOWMON,
we found that the performance of the workflows can be significantly affected by
both software
and hardware factors, such as the policy of process mapping and in-situ buffer
size. Moreover,
we experimentally show that WOWMON can reduce data movement for profiling by
up to 54%
without missing the key metrics for performance debugging.
Producing high-performance implementations from simple, portable computation
specifications is a challenge that compilers have tried to address for several decades.
More recently, a relatively stable architectural landscape has evolved into a set of
increasingly diverging and rapidly changing CPU and accelerator designs, with the
main common factor being dramatic increases in the levels of parallelism available.
The growth of architectural heterogeneity and parallelism, combined with the very
slow development cycles of traditional compilers, has motivated the development of
autotuning tools that can quickly respond to changes in architectures and
programming models, and enable very specialized optimizations that are not possible
or likely to be provided by mainstream compilers. In this paper we describe the new
OpenCL code generator and autotuner OrCL and the introduction of detailed
performance measurement into the autotuning process. OrCL is implemented within
the Orio autotuning framework, which enables the rapid development of experimental
languages and code optimization strategies aimed at achieving good performance on
new platforms without rewriting or hand-optimizing critical kernels. The combination
of the new OpenCL autotuning and TAU measurement capabilities enables users to
consistently evaluate autotuning effectiveness across a range of architectures,
including several NVIDIA and AMD accelerators and Intel Xeon Phi processors, and to
compare the OpenCL and CUDA code generation capabilities. We present results of
autotuning several numerical kernels that typically dominate the execution time of
iterative sparse linear system solution and key computations from a 3-D parallel
simulation of solid fuel ignition.
—Partitioned global address space (PGAS) applications,
such as the Tensor Contraction Engine (TCE) in NWChem,
often apply a one-process-per-core mapping in which each process
iterates through the following work-processing cycle: (1)
determine a work-item dynamically, (2) get data via one-sided
operations on remote blocks, (3) perform computation on the data
locally, (4) put (or accumulate) resultant data into an appropriate
remote location, and (5) repeat the cycle. However, this simple
flow of execution does not effectively hide communication latency
costs despite the opportunities for making asynchronous progress.
Utilizing nonblocking communication calls is not sufficient unless
care is taken to efficiently manage a responsive queue of
outstanding communication requests. This paper presents a new
runtime model and its library implementation for managing
tunable “work queues†in PGAS applications. Our runtime
execution model, called WorkQ, assigns some number of on-node
“producer†processes to primarily do communication (steps 1, 2,
4, and 5) and the other “consumer†processes to do computation
(step 3); but processes can switch roles dynamically for the sake
of performance. Load balance, synchronization, and overlap of
communication and computation are facilitated by a tunable
nodewise FIFO message queue protocol. Our WorkQ library
implementation enables an MPI+X hybrid programming model
where the X comprises SysV message queues and the user’s
choice of SysV, POSIX, and MPI shared memory. We develop a
simplified software mini-application that mimics the performance
behavior of the TCE at arbitrary scale, and we show that the
WorkQ engine outperforms the original model by about a factor
of 2. We also show performance improvement in the TCE coupled
cluster module of NWChem.
Many excellent open-source and commercial tools enable the
detailed measurement of the performance attributes of applications.
However, the process of collecting measurement
data and analyzing it remains effort-intensive because of differences
in tool interfaces and architectures. Furthermore,
insufficient standards and automation may result in losing
information about experiments, which may in turn lead to
misinterpretation of the data and analysis results. Autoperf
aims to support the entire workflow in performance measurement
and analysis in a uniform and portable fashion, enabling
both better productivity through automation of data
collection and analysis and experiment reproducibility.
Empirical performance evaluation of parallel systems and applications can generate
significant amounts of performance data and analysis results from multiple experiments as
performance is investigated and problems diagnosed. Hence, the management of
performance information is a core component of performance analysis tools. To better
support tool integration, portability, and reuse, there is a strong motivation to develop
performance data management technology that can provide a common foundation for
performance data storage, access, merging, and analysis. This paper presents the design and
implementation of the Performance DataManagement Framework (PerfDMF). PerfDMF
addresses objectives of performance tool integration, interoperation, and reuse by providing
common data storage, access, and analysis infrastructure for parallel performance profiles.
PerfDMF includes an extensible parallel profile data schema and relational database schema,
a profile query and analysis programming interface, and an extendible toolkit for profile
import/export and standard analysis. We describe the PerfDMF objectives and architecture,
give detailed explanation of the major components, and show examples of PerfDMF
application.
The Charm++ parallel programming system provides a modular performance interface that can be used to extend its performance measurement and analysis capabilities. The interface exposes execution events of interest representing Charm++ scheduling operations, application methods/routines, and communication events for observation by alternative performance modules configured to implement different measurement features. The paper describes the Charm++’s performance interface and how the Charm++ Projections tool and the TAU Performance System can provide integrated trace-based and profile-based performance views. These two tools are complementary, providing the user with different performance perspectives on Charm++ applications based on performance data detail and temporal and
spatial analysis. How the tools work in practice is demonstrated in a parallel performance analysis of NAMD, a scalable molecular dynamics code that applies many of Charm++’s unique features.
Modern parallel performance measurement
systems collect performance information either through probes
inserted in the application code or via statistical sampling.
Probe-based techniques measure performance metrics directly
using calls to a measurement library that execute as part of
the application. In contrast, sampling-based systems interrupt
program execution to sample metrics for statistical analysis
of performance. Although both measurement approaches are
represented by robust tool frameworks in the performance
community, each has its strengths and weaknesses. In this
paper, we investigate the creation of a hybrid measurement
system, the goal being to exploit the strengths of both systems
and mitigate their weaknesses. We show how such a system
can be used to provide the application programmer with a
more complete analysis of their application. Simple example
and application codes are used to demonstrate its capabilities.
We also show how the hybrid techniques can be combined
to provide real cross-language performance evaluation of
an uninstrumented run for mixed compiled/interpreted
execution environments (e.g., Python and C/C++/Fortran).
The power of GPUs is giving rise to heterogeneous parallel computing,
with new demands on programming environments, runtime systems, and tools
to deliver high-performing applications. This paper studies the problems
associated with performance measurement of heterogeneous machines with
GPUs. A heterogeneous computation model and alternative host-GPU
measurement approaches are discussed to set the stage for reporting new
capabilities for heterogeneous parallel performance measurement in three
leading HPC tools: PAPI, Vampir, and the TAU Performance System. Our work
leverages the new CUPTI tool support in NVIDIA’s CUDA device library.
Heterogeneous benchmarks from the SHOC suite are used to demonstrate the
measurement methods and tool support.
Developing effective yet scalable load-balancing methods for irregular computations is critical to the successful application of simulations in a variety of disciplines at petascale and beyond. This paper explores a set of static and dynamic scheduling algorithms for block-sparse tensor contractions within the NWChem computational chemistry code for different degrees of sparsity (and therefore load imbalance). In this particular application, a relatively large amount of task information can be obtained at minimal cost, which enables the use of static partitioning techniques that take the entire task list as input. However, fully static partitioning is incapable of dealing with dynamic variation of task costs, such as from transient network contention or operating system noise, so we also consider hybrid schemes that utilize dynamic scheduling within subgroups. These two schemes, which have not been previously implemented in NWChem or its proxies (i.e. quantum chemistry mini-apps) are compared to the original centralized dynamic load-balancing algorithm as well as improved centralized scheme. In all cases, we separate the scheduling of tasks from the execution of tasks into an inspector phase and an executor phase. The impact of these methods upon the application is substantial on a large InfiniBand cluster: execution time is reduced by as much as 50% at scale. The technique is applicable to any scientific application requiring load balance where performance models or estimations of kernel execution times are available.
The Alliant FX/8 multiprocessor implements several high-speed computation ideas in
software and hardware. Each of the 8 computational elements (CSs) has vector capabilities
and multiprocessor support. Generally, the FX/8 delivers its highest processing rates when
executing vector loops concurrently. In this paper, we present extensive empirical
performance results for vector processing on the FX/8. The vector kernels of LANL BMK8a1
benchmark are used in the experiments.
A message passing facility (MPF) for shared memory multiprocessors is presented. MPF is
based on a message passing model conceptually similar to conversations. The message
passing primitives for this model are implemented as a portable library of C function calls.
The performance of interprocess communication benchmark programs and two parallel
applications are given.
Heterogeneous parallel systems using GPU devices for ap-
plication acceleration have garnered significant attention in
the supercomputing community. However, to realize the full
potential of GPU computing, application developers will re-
quire tools to measure and analyze accelerator performance
with respect to the parallel execution as a whole. A per-
formance measurement technology for the NVIDIA CUDA
platform has been developed and integrated with the TAU
parallel performance system. The design of the TAUcuda
package is based on an experimental NVIDIA CUDA driver
and associated runtime and device libraries. In any envi-
ronment where the CUDA experimental driver is installed,
TAUcuda can provide detailed performance information re-
garding the execution of GPU kernels and the interactions
with the parallel program without any modification to the
program source or executable code. The paper describes the
TAUcuda technology and how it is integrated with the TAU
measurement framework to provide integrated performance
views. Various examples of TAUcuda use are presented, in-
cluding CUDA SDK examples, a GPU version of the Linpack
benchmark, and a scalable molecular dynamics application,
NAMD.
Developing effective yet scalable load-balancing methods for
irregular computations is critical to the successful application
of simulations in a variety of disciplines at petascale and
beyond. This poster explores a set of static and dynamic
scheduling algorithms for block-sparse tensor contractions
within the NWChem computational chemistry code for different
degrees of sparsity (and therefore load imbalance). In
this particular application, a relatively large amount of task
information can be obtained at minimal cost, which enables
the use of static partitioning techniques that take the entire
task list as input. However, fully static partitioning is
incapable of dealing with dynamic variation of task costs,
such as from transient network contention or operating system
noise, so we also consider hybrid schemes that utilize
dynamic scheduling within subgroups. These two schemes,
which have not been previously implemented in NWChem or
its proxies (i.e. quantum chemistry mini-apps) are compared
to the original centralized dynamic load-balancing algorithm
as well as improved centralized scheme. In all cases, we separate
the scheduling of tasks from the execution of tasks into
an inspector phase and an executor phase. The impact of
these methods upon the application is substantial on a large
InfiniBand cluster: execution time is reduced by as much as
50% at scale. The technique is applicable to any scientific
application requiring load balance where performance models
or estimations of kernel execution times are available.
In this paper we discuss the performance prediction of Fortran constructs commonly found in
numerical scientific computing. Although the approach is applicable to multi-processors in
general, within the scope of the paper we will concentrate on the Alliant FX/8 multiprocessor.
The techniques proposed involve a combination of empirical observations, architectural
models and analytical techniques, and exploits earlier work on data locality analysis and
empirical characterization of the behavior of memory systems. The Lawrence Livermore
Loops are used as a test-case to verify the approach.
The complexity of parallel computer systems makes a priori performance
prediction difficult and experimental performance analysis crucial. A complete
characterization of software and hardware dynamics, needed to understand the
performance of high-performance parallel systems, requires execution time
performance instrumentation. Although software recording of performance data
suffices for low frequency events, capture of detailed, high-frequency
performance data ultimately requires hardware support if the performance
instrumentation is to remain efficient and unobtrusive. This paper describes the
design of HYPERMON, a hardware system to capture and record software
performance traces generated on the Intel iPSC/2 hypercube. HYPERMON
represents a compromise between fully-passive hardware monitoring and
software event tracing; software generated events are extracted from each
node, timestamped, and externally recorded by HYPERMON. Using an
instrumented version of the iPSC/2 operating system and several application
programs, we present a performance analysis of an operational HYPERMON
prototype and assess the limitations of the current design. Based on these
results, we suggest design modifications that should permit capture of event
traces from the coming generation of high-performance distributed memory
parallel systems.
This paper describes how the SMARTS runtime system and the POOMA C++
class library for high-performance scientific computing work together
to exploit data parallelism in scientific applications while hiding
the details of managing parallelism and data locality from the
user. We present innovative algorithms, based on the macro-dataflow
model for detecting data parallelism and efficiently executing
data-parallel statements on shared-memory multiprocessors. We also
describe how these algorithms can be implemented on clusters of SMPs.
In the solution of large-scale numerical problems, parallel computing
is becoming simultaneously more important and more difficult. The
complex organization of today's multiprocessors with several memory
hierarchies has forced the scientific programmer to make a choice
between simple but unscalable code and scalable but extremely complex
code that does not port to other architectures.
The process of empirical autotuning results in the generation of many code variants
which are tested, found to be suboptimal, and discarded. By retaining annotated
performance profiles of each variant tested over the course of many autotuning runs of
the same code across different hardware environments and different input datasets, we
can apply machine learning algorithms to generate classifiers for runtime selection of
code variants from a library, generate specialized variants, and potentially speed the
process of autotuning by starting the search from a point predicted to be close to
optimal. In this paper, we show how the TAU Performance System suite of tools can be
applied to autotuning to enable reuse of performance data generated through autotuning.
This work targets the emerging use of software component technology for
high-performance scientific parallel and distributed computing. While
component software engineering will benefit the construction of complex
science applications, its use presents several challenges to performance
optimization. A component application is composed of a set of components,
thus, application performance depends on the interaction (possibly
non-linear) of the component set. Furthermore, a component is a ``binary
unit of composition'' and the only information users have is the interface
the component provides to the outside world. An interface for component
performance measurement and query is presented to address optimization
issues. We describe the performance component design and an example
demonstrating its use for runtime performance tuning.
We present a case study of performance measurement and modeling of a CCA (Common
Component Architecture) component-based application in a high performance computing
environment. Component-based HPC applications allow the possibility of creating
component-level performance models and synthesizing them into application performance
models. However, they impose the restriction that performance measurement/monitoring
needs to be done in a non-intrusive manner and at a fairly coarse-grained level. We propose
a performance measurement infrastructure for HPC based loosely on recent work done for
Grid environments. A prototypical implementation of the infrastructure is used to collect data
for three components in a scientific application and construct their performance models.
Both computational and message-passing performance are addressed.
HiPerSAT, a C++ library and tools, processes EEG
data sets with ICA (Independent Component Analysis)
methods. HiPerSAT uses BLAS, LAPACK, MPI
and OpenMP to achieve a high performance solution
that exploits parallel hardware. ICA is a class of methods
for analyzing a large set of data samples and extracting
independent components that explain the observed
data. ICA is used in EEG research for data
cleaning and separation of spatiotemporal patterns that
may reflect different underlying neural processes. We
present two ICA implementations (FastICA and Infomax)
that exploit parallelism to provide an EEG component
decomposition solution of higher performance
and data capacity than current MATLAB-based implementations.
Experimental results and the methodology
used to obtain them are presented. Integrating HiPerSAT
with EEGLAB [4] is described, as well as future
plans for this research.
Performance tuning involves a diagnostic process to locate
and explain sources of program inefficiency. A performance
diagnosis system can leverage knowledge of performance
causes and symptoms that come from expertise
with parallel computational models. This paper extends our
model-based performance diagnosis approach to programs
with multiple models. We study two types of model compositions
(nesting and restructuring) and demonstrate how the
Hercule performance diagnosis framework can automatically
discover and interpret performance problems due to
model nesting in the FLASH application.
The Hartree-Fock (HF) method is the fundamental
first step for incorporating quantum mechanics into manyelectron
simulations of atoms and molecules, and it is an
important component of computational chemistry toolkits like
NWChem. The GTFock code is an HF implementation that,
while it does not have all the features in NWChem, represents
crucial algorithmic advances that reduce communication and
improve load balance by doing an up-front static partitioning
of tasks, followed by work stealing whenever necessary.
To enable innovations in algorithms and exploit next generation
exascale systems, it is crucial to support quantum
chemistry codes using expressive and convenient programming
models and runtime systems that are also efficient and scalable.
This paper presents an HF implementation similar to GTFock
using UPC++, a partitioned global address space model that
includes flexible communication, asynchronous remote computation,
and a powerful multidimensional array library. UPC++
offers runtime features that are useful for HF such as active
messages, a rich calculus for array operations, hardwaresupported
fetch-and-add, and functions for ensuring asynchronous
runtime progress. We present a new distributed array
abstraction, DArray, that is convenient for the kinds of randomaccess
array updates and linear algebra operations on blockdistributed
arrays with irregular data ownership. We analyze
the performance of atomic fetch-and-add operations (relevant
for load balancing) and runtime attentiveness, then compare
various techniques and optimizations for each. Our optimized
implementation of HF using UPC++ and the DArrays library
shows up to 20% improvement over GTFock with Global
Arrays at scales up to 24,000 cores.
The ability to measure performance characteristics
of an application at runtime is essential for monitoring the behavior
of the application and the runtime system on the underlying
architecture. Traditional performance measurement tools do not
adequately provide measurements of asynchronous task-based
parallel applications, either in real-time or for postmortem analysis.
We propose that this capability is best performed directly by
the runtime system for ease in use and to minimize conflicts and
overheads potentially caused by traditional measurement tools.
In this paper, we describe and illustrate the use of the
performance monitoring capabilities in the HPX [13] runtime
system. We describe and detail existing performance counters
made available through HPX’s performance counter framework
and demonstrate how they are useful to understanding application
efficiency and resource usage at runtime. This extensive
framework provides the ability to asynchronously query software
and hardware counters and could potentially be used as the basis
for runtime adaptive resource decisions.
We demonstrate the ease of porting the Inncabs benchmark
suite to the HPX runtime system, the improved performance
of benchmarks that employ fine-grained task parallelism when
ported to HPX, and the capabilities and advantages of using the
in-situ performance monitoring system in HPX to give detailed
insight to the performance and behavior of the benchmarks and
the runtime system.
A primary characteristic of history-based Monte
Carlo neutron transport simulation is the application of
MIMD-style parallelism: the path of each neutron particle
is largely independent of all other particles, so threads of
execution perform independent instructions with respect to
other threads. This conflicts with the growing trend of HPC
vendors exploiting SIMD hardware, which accomplishes better
parallelism and more FLOPS per watt. Event-based neutron
transport suits vectorization better than history-based
transport, but it is difficult to implement and complicates
data management and transfer. However, the Intel Xeon Phi
architecture supports the familiar x86 instruction set and
memory model, mitigating difficulties in vectorizing neutron
transport codes.
This paper compares the event-based and history-based
approaches for exploiting SIMD in Monte Carlo neutron transport
simulations. For both algorithms, we analyze performance
using the three different execution models provided by the Xeon
Phi (offload, native, and symmetric) within the full-featured
OpenMC framework. A representative micro-benchmark of
the performance bottleneck computation shows about 10x
performance improvement using the event-based method. In
an optimized history-based simulation of a full-physics nuclear
reactor core in OpenMC, the MIC shows a calculation rate
1.6x higher than a modern 16-core CPU, 2.5x higher when
balancing load between the CPU and 1 MIC, and 4x higher
when balancing load between the CPU and 2 MICs. As far as
we are aware, our calculation rate per node on a high fidelity
benchmark (17,098 particles/second) is higher than any other
Monte Carlo neutron transport application. Furthermore, we
attain 95% distributed efficiency when using MPI and up to
512 concurrent MIC devices.
The Argo project is a DOE initiative for designing
a modular operating system/runtime for the next generation
of supercomputers. A key focus area in this project is power
management, which is one of the main challenges on the path to
exascale. In this paper, we discuss ideas for systemwide power
management in the Argo project. We present a hierarchical and
scalable approach to maintain a power bound at scale, and we
highlight some early results
In this paper, we discuss the performance analysis of the pC++
programming system. We describe the performance tools developed and
include scalability measurements for four benchmark programs: a
"nearest neighbor" grid computation, a fast Poisson solver, and the
"Embar" and "Sparse" codes from the NAS suite. In addition to speedup
numbers, we present a detailed analysis highlighting performance
issues at the language, runtime system, and target system levels.
pC++ is a language extension to C++ designed to allow programmers to
compose distributed data structures with parallel execution
semantics. These data structures are organized as ``concurrent
aggregate'' collection classes which can be aligned and distributed
over the memory hierarchy of a parallel machine in a manner consistent
with the High Performance Fortran Forum (HPF) directives for Fortran
90. pC++ allows the user to write portable and efficient code which
will run on a wide range of scalable parallel computers.
Performance diagnosis, the process of finding and explaining performance
problems, is an important part of parallel programming. Effective performance
diagnosis requires that the programmer plan an appropriate method, and
manage the experiments required by that method. This paper presents Poirot,
an architecture to support performance diagnosis. It explains how the
architecture helps automatically, adaptably plan and manage the diagnosis
process. The paper evaluates the generality and practicality of Poirot, by
reconstructing diagnosis methods found in several published performance
tools.
We report our experiences in porting and tuning the Apache
Spark data analytics framework on the Cray XC30 (Edison) and XC40
(Cori) systems, installed at NERSC. We find that design decisions made
in the development of Spark are based on the assumption that Spark
is constrained primarily by network latency, and that disk I/O is comparatively
cheap. These assumptions are not valid on Edison or Cori,
which feature advanced low-latency networks but have diskless compute
nodes. Lustre metadata access latency is a major bottleneck, severely
constraining scalability. We characterize this problem with benchmarks
run on a system with both Lustre and local disks, and show how to mitigate
high metadata access latency by using per-node loopback filesystems
for temporary storage. With this technique, we reduce the shuffle time
and improve application scalability from O(100) to O(10, 000) cores on
Cori. For shuffle-intensive machine learning workloads, we show better
performance than clusters with local disks.
Applications executing on complex computational systems provide a
challenge for the development of runtime performance monitoring
software. We discuss a computational model, application monitoring,
data access models, and profiler functionality. We define data
consistency within and across threads as well as across contexts and
nodes. We describe the TAU runtime monitoring framework which enables
on-demand, low-interference data access to TAU profile data and
provides the flexibility to enforce data consistency at the thread,
context or node level. We present an example of a Java-based runtime
performance monitor utilizing the framework.
Technology for empirical performance evaluation of parallel programs
is driven by the increasing complexity of high performance computing environments
and programming methodologies. This paper describes the integration of
the TAU and XPARE tools in the Uintah computational framework. Performance
mapping techniques in TAU relate low-level performance data to higher levels of
abstraction. XPARE is used for specifying regression testing benchmarks that are
evaluated with each periodically scheduled testing trial. This provides a historical
panorama of the evolution of application performance. The paper concludes with
a scalability study that shows the benefits of integrating performance technology
in the development of large-scale parallel applications.
The paper presents the design and development of an online remote trace
measurement and analysis system. The work combines the strengths of the
TAU performance system with that of the VNG distributed parallel trace
analyzer. Issues associated with online tracing are discussed and the problems
encountered in system implementation are analyzed in detail. Our approach
should port well to parallel platforms. Future work includes testing the
performance of the system on large-scale machines.
Practice has shown that programming a new multicore
system is a greater challenge than previously thought. The
challenge is to produce the resulting system in a way, which is
as easy as sequential programming. This new trend has changed
the way we think about the whole development process. The
aim of this work is to show that it is possible to develop a
multicore embedded system application using existing tools, while
at the same time, obtaining reuse. This process is carried out
in a cyclic and increasing manner, generating a more refined
version of the application at each iteration. The development
process consists of five phases: Multitask Modelling, Code Generation,
Test/Debugging, Mapping Tasks to Cores and Tuning
the Application. The three initial ones are carried out using the
VisualRTXC tool, whereas the last two use the performance tool
TAU. Using a small application, a Case Study shows how the
proposed development process works and the steps involved in
the implementation of an embedded system.
We have developed an environment that uses the IBM Visualization Data Explorer system to allow new visualizations to be prototyped rapidly, often taking only a few hours to construct totally new views of parallel performance trace data. Yet, access to a robust library of sophisticated graphical techniques is preserved. The burdensome task of explicitly programming the visualizations is completely avoided, and the iterative design, evaluation, and modification of new displays is greatly facilitated.
The complexity of parallel programs make them more difficult to analyze for correctness and efficiency, in part because of the interactions between multiple processors and the volume of data that can be generated. Visualization often helps the programmer in these tasks. This paper focuses on the development of a new technique for constructing, evaluating, and modifying sophisticated, application-specific visualizations for parallel programs and performance data. While most existing tools offer predetermined sets of simple, two-dimensional graphical displays, this environment gives users a high degree of control over visualization development and use, including access to three-dimensional graphics, which remain relatively unexplored in this context.
A multi-cluster computational environment with mixed-mode (MPI +
OpenMP) parallelism for estimation of unknown regional electrical conductiv-
ities of the human head, based on realistic geometry from segmented MRI up
to 256 voxels resolution, is described. A finite difference multi-component al-
ternating direction implicit (ADI) algorithm, parallelized using OpenMP, is used
to solve the forward problem calculation describing the electrical field distribu-
tion throughout the head given known electrical sources. A simplex search in the
multi-dimensional parameter space of tissue conductivities is conducted in par-
allel across a distributed system of heterogeneous computational resources. The
theoretical and computational formulation of the problem is presented. Results
from test studies based on the synthetic data are provided, comparing retrieved
conductivities to known solutions from simulation. Performance statistics are also
given showing both the scaling of the forward problem and the performance dy-
namics of the distributed search.
Nested OpenMP parallelism allows an application to spawn teams of nested threads. This hierarchical nature of thread creation and usage poses problems for performance measurement tools that must determine thread context to properly maintain per-thread performance data. In this paper we describe the problem and a novel solution for identifying threads uniquely. Our approach has been implemented in the TAU performance system and has been successfully used in profiling and tracing OpenMP applications with nested parallelism. We also describe how extensions to the OpenMP standard can help tool developers uniquely identify threads.
The introduction of tasks in the OpenMP programming
model brings a new level of parallelism. This also creates new challenges
with respect to its meanings and applicability through an event-based
performance profiling. The OpenMP Architecture Review Board (ARB)
has approved an interface specification known as the “OpenMP Runtime
API for Profiling†to enable performance tools to collect performance data
for OpenMP programs. In this paper, we propose new extensions to the
OpenMP Runtime API for profiling task level parallelism. We present an
efficient method to distinguish individual task instances in order to track
their associated events at the micro level. We implement the proposed extensions
in the OpenUH compiler which is an open-source OpenMP compiler.
With negligible overheads, we are able to capture important events
like task creation, execution, suspension, and exiting. These events help
in identifying overheads associated with the OpenMP tasking model, e.g.,
task waiting until a task starts execution or task cleanup etc. These events
also help in constructing important parent-child relationships that de-
fine tasks’ call paths. The proposed extensions are in line with the newest
specifications recently proposed by the OpenMP tools committee for task
profiling.
The ability to measure the performance of OpenMP programs portably across shared
memory platforms and across OpenMP compilers is a challenge due to the lack of a
widely-implemented performance interface standard. While the OpenMP community
is currently evaluating a tools interface specification called OMPT, at present there
are different instrumentation methods possible at different levels of observation and
with different system and compiler dependencies. This paper describes how support
for four mechanisms for OpenMP measurement has been integrated into the TAU
performance system. These include source-level instrumentation (Opari), a runtime
“collector†API (called ORA) built into an OpenMP compiler (OpenUH), a wrapped
OpenMP runtime library (GOMP using ORA), and an OpenMP runtime library
supporting an OMPT prototype (Intel). The capabilities of these approaches are
evaluated with respect to observation visibility, portability, and measurement
overhead for OpenMP benchmarks from the NAS parallel benchmarks, Barcelona
OpenMP Task Suite, and SPEC 2012. The integrated OpenMP measurement support is
also demonstrated on a scientific application, MPAS-Ocean.
Parallel Java environments present challenging problems for performance
tools because of Javas rich language system and its multi-level execution
platform combined with the integration of native-code application libraries
and parallel runtime software. In addition to the desire to provide robust
performance measurement and analysis capabilities for the Java language
itself, the coupling of different software execution contexts under a
uniform performance model needs careful consideration of how events of
interest are observed and how cross-context parallel execution information
is linked. This paper relates our experience in extending the TAU
performance system to a parallel Java environment based on mpiJava. We
describe the complexities of the instrumentation model used, how
performance measurements are made, and the overhead incurred. A parallel
Java application simulating the game of Life is used to show the
performance systems capabilities.
Parallel Java environments present challenging problems for performance tools because of Java's rich language system and its multi-level execution platform
combined with the integration of native-code application libraries and parallel runtime software. In addition to the desire to provide robust performance measurement and analysis capabilities for the Java language itself, the coupling of different software execution contexts under a uniform performance model needs careful consideration of how events of interest are observed and how cross-context parallel execution information is linked. This paper relates our experience in extending the TAU performance system to a parallel Java environment based on mpiJava. We describe the instrumentation model used, how performance measurements are made, and the overhead incurred. A parallel Java application simulating the game of life is used to show the performance
system's capabilities.
Event-related potentials (ERP) are brain electrophysiological
patterns created by averaging electroencephalographic
(EEG) data, time-locking to events of interest (e.g., stimulus
or response onset). In this paper, we propose a generic
framework for mining and developing domain ontologies and
apply it to mine brainwave (ERP) ontologies. The concepts
and relationships in ERP ontologies can be mined according
to the following steps: pattern decomposition, extraction
of summary metrics for concept candidates, hierarchical
clustering of patterns for classes and class taxonomies, and
clustering-based classification and association rules mining
for relationships (axioms) of concepts. We have applied this
process to several dense-array (128-channel) ERP datasets.
Results suggest good correspondence between mined concepts
and rules, on the one hand, and patterns and rules
that were independently formulated by domain experts, on
the other. Data mining results also suggest ways in which
expert-defined rules might be refined to improve ontology
representation and classification results. The next goal of
our ERP ontology mining framework is to address some
long-standing challenges in conducting large-scale comparison
and integration of results across ERP paradigms and
laboratories. In a more general context, this work illustrates
the promise of an interdisciplinary research program,
which combines data mining, neuroinformatics and ontology
engineering to address real-world problems.
This paper proposes a performance tools interface for
OpenMP, similar in spirit to the MPI profiling interface in its intent to
define a clear and portable API that makes OpenMP execution events visible
to runtime performance tools. We present our design using a source-level
instrumentation approach based on OpenMP directive rewriting. Rules to
instrument each directive and their combination are applied to generate
calls to the interface consistent with directive semantics and to pass
context information (e.g., source code locations) in a portable and
efficient way. Our proposed OpenMP performance API further allows user
functions and arbitrary code regions to be marked and performance
measurement to be controlled using new OpenMP directives.
To prototype the proposed OpenMP performance interface, we have developed
compatible performance libraries for the EXPERT automatic event
trace analyzer and the TAU performance analysis framework. The directive
instrumentation transformations we define are implemented in a
source-to-source translation tool called OPARI. Application examples are
presented for both EXPERT and TAU to show the OpenMP performance interface and
OPARI instrumentation tool in operation. When used together with the MPI
profiling interface (as the examples also demonstrate), our proposed
approach provides a portable and robust solution to performance analysis of
OpenMP and mixed-mode (OpenMP + MPI) applications.
Profiling and tracing tools can help make application parallelization
more effective and identify performance bottlenecks. Profiling
presents summary statistics of performance metrics while tracing
highlights the temporal aspect of performance variations, showing when
and where in the code performance is achieved. A complex challenge is
the mapping of performance data gathered during execution to
high-level parallel language constructs in the application source
code. Presenting performance data in a meaningful way to the user is
equally important. This paper presents a brief overview of profiling
and tracing tools in the context of Linux - the operating system most
commonly used to build clusters of workstations for high performance
computing.
We present a topology correction method for automatic reconstruction
of brain cortical surfaces. We take the volume-based
approach by first correcting the topology of the white matter volumes
followed by extracting the cortical surfaces. A multiscale method is taken
so that topology errors are gradually corrected with respect to the correction
cost. The special surface-likeness property of white matter and
gray matter is considered in evaluating the cost of topology correction.
Performance extrapolation is the process of evaluating the performance
of a parallel program in a target execution environment using
performance information obtained for the same program in a different
environment. Performance extrapolation techniques are suited for rapid
performance tuning of parallel programs, particularly when the target
environment is unavailable. This paper describes one such technique
that was developed for data-parallel C++ programs written in the pC++
language. In pC++, the programmer can distribute a collection of
objects to various processors and can have methods invoked on those
objects execute in parallel. Using performance extrapolation in the
development of pC++ applications allows tuning decisions to be made in
advance of detailed execution measurements. The pC++ language system
includes TAU, an integrated environment for analyzing and tuning the
performance of pC++ programs. This paper presents speedy, a new
addition to TAU, that predicts the performance of pC++ programs on
parallel machines using extrapolation techniques. Speedy applies the
existing instrumentation support of TAU to capture high-level event
traces of a n-thread pC++ program run on a uniprocessor machine
together with trace-driven simulation to predict the performance of
the program run on a target n-processor machine. We describe how
speedy works and how it is integrated into TAU. We also show how
speedy can be used to evaluate a pC++ program for a given target
environment.
Performance prediction methods and tools based on analytical models often fail
in forecasting the performance of real systems due to inappropriateness of
model assumptions, irregularities in the problem structure that cannot be
described within the modeling formalism, unstructured execution behavior that
leads to unforeseen system states, etc. Prediction accuracy and tractability is
acceptable for systems with deterministic operational characteristics, for static,
regularly structured problems, and non-changing environments.
Understanding the milliscale (temporal and spatial) dynamics of the human brain activity
requires high-resolution modeling of head electromagnetics and source localization of
EEG data. We have developed an automated environment to construct individualized
computational head models from image segmentation and to estimate conductivity
parameters using electrical impedance tomography methods. Algorithms incorporating
tissue inhomogeneity and impedance anisotropy in electromagnetics forward simulations
have been developed and parallelized. The paper reports on the application of the
environment in the processing of realistic head models, including conductivity inverse
estimation and lead field generation for use in EEG source analysis.
When implementing parallel programs for parallel computer systems the
performance scalability of these programs should be tested and analyzed on
different computer configurations and problem sizes. Since a complete
scalability analysis is too time consuming and is limited to only existing systems,
extensions of modeling approaches can be considered for analyzing the
behavior of parallel programs under different problem and system scenarios. In
this paper, a method for automatic scalability analysis using modeling is
presented. Initially, we identify the important problems that arise when
attempting to apply modeling techniques to scalability analysis. Based on this
study, we define the Parallelization Description Language (PDL) that is used to
describe parallel execution attributes of a generic program workload. Based on
a parallelization description, stochastic models like graph models or Petri net
models can be automatically generated from a generic model to analyze
performance for scaled parallel systems as well as scaled input data. The
complexity of the graph models produced depends significantly on the type of
parallel computation described. We present several computation classes where
tractable graph models can be generated and then compare the results of these
automatically scaled models with their exact solutions using the PEPP modeling
tool.
Tools to observe the performance of parallel programs typically employ profiling and tracing as the two main forms of event-based
measurement models. In both of these approaches, the volume of performance data generated and the corresponding perturbation encountered
in the program depend upon the amount of instrumentation in the program. To produce accurate performance data, tools need to control the
granularity of instrumentation. In this paper, we describe our experiences in the TAU performance system for improving the accuracy of
performance data by limiting the amount of instrumentation. A range of
options are provided to optimize instrumentation based on the structure
of the program, event generation rates, and historical performance data
gathered from prior executions.
Workload characterization is an important technique that
helps us understand the performance of parallel applications and the de-mands they place on the system. Each application run is profiled using
instrumentation at the MPI library level. Characterizing the performance
of the MPI library based on the sizes of messages helps us understand
how the performance of an application is affected based on messages
of different sizes. Partitioning of the time spent in MPI routines based
on the type of MPI operation and the message size involved requires a
two level mapping of performance data. This paper describes how performance mapping is implemented in the TAU performance system to
support workload characterization.
Performance evaluation tools play an important role in helping
understand application performance, diagnose performance problems
and guide tuning decisions on modern HPC systems. Tools to observe
parallel performance must evolve to keep pace with the ever-increasing
complexity of these systems. In this paper, we describe our experience in
building novel tools and techniques in the TAU Performance SystemR
to observe application performance effectively and efficiently at scale.
It describes the extensions to TAU to contend with large data volumes
associated with increasing core counts. These changes include new instrumentation
choices, efficient handling of disk I/O operations in the
measurement layer, and strategies for visualization of performance data
at scale in TAU’s analysis layer, among others. We also describe some
techniques that allow us to fully characterize the performance of applications
running on hundreds of thousands of cores.
Observing the performance of an application at runtime requires
economy in what performance data is measured and accessed, and
flexibility in changing the focus of performance interest. This paper
describes the performance callstack as an efficient performance view
of a running program which can be retrieved and controlled by external
analysis tools. The performance measurement support is provided by
the TAU profiling library whereas tool-program interaction support is
available through the DAQV framework. How these systems are merged to
provide dynamic performance callstack sampling is discussed.
Parallel performance tools offer insights into the execution behavior
of an application and are a valuable component in the cycle of
application development, deployment, and optimization. However, most
tools do not work well with large-scale parallel applications where
the performance data generated comes from upwards of thousands of
processes. As parallel computer systems increase in size, the scaling
of performance observation infrastructure becomes an important
concern. In this paper, we discuss the problem of scaling and
perfomance observation, and the ramifications of adding online
support. A general online performance system architecture is
presented. Recent work on the TAU performance system to enable
large-scale performance observation and analysis is discussed. The
paper concludes with plans for future work.
We have developed a distributed service architecture and an integrated parallel analysis engine
for scalable trace based performance analysis. Our combined approach permits to handle very
large performance data volumes in real-time. Unlike traditional analysis tools that do their job
sequentially on an external desktop platform, our approach leaves the data at its origin and
seamlessly integrates the time consuming analysis as a parallel job into the high performance
production environment.
Parallel scientific applications are designed based on structural, logical, and numerical models
of computation and correctness. When studying the performance of these applications,
especially on large-scale parallel systems, there is a strong preference among developers to
view performance information with respect to their “mental model” of the application, formed
from the model semantics used in the program. If the developer can relate performance data
measured during execution to what they know about the application, more effective program
optimization may be achieved. This paper considers the concept of “phases” and its support in
parallel performance measurement and analysis as a means to bridge the gap between high-
level application semantics and low-level performance data. In particular, this problem is
studied in the context of parallel performance profiling. The implementation of phase-based
parallel profiling in the TAU parallel performance system is described and demonstrated for the
NAS parallel benchmarks and MFIX application.