ParaDucks Research Group Annotated Bibliography

Conferences and Workshops

Journals

Theses and Dissertations

Technical Reports

Talks and Presentations

Other Publications

Conferences and Workshops

[acmonr91]: Allen D. Malony, Daniel A. Reed, "Models for Performance Perturbation Analysis," Workshop on Parallel and Distributed Debugging, Proc. 1991 ACM/ ONR workshop on Parallel and Distributed Debugging, pp. 15-25, 1991
Keywords: performance perturbation, performance measurement

When performance measurements are made of program operation actual execution behavior can be perturbed. In general, the degree of perturbation depends on the intrusiveness and frequency of the instrument ation. If the perturbation effects of the instrumentation cannot be quantified by a perturbation model (and subsequently removed during perturbation analysis), detailed performance measurements could be inaccurate. Developing models of time and event perturbations that can recover actual execution performance from perturbed performance measurements is the topic of this paper. Time-based models can accurately capture execution time perturbations for sequential computations and concurrent computations with simple fork-join behavior. However, the performance of parallel computations generally depends on the relative ordering of dependent events and the assignment of computational resources. Event-based models must be used to quantify instrumentation perturbation in parallel performance measurements. The measurement and subsequent analysis of synchronization operations (e.g., barrier, semaphore, and advance/await synchronization) can produce accurate approximations to actual performance behavior. Unfortunately, event-based models are limited in their ability to fully capture perturbation effects in nondeterministic executions.
[aina07]: iawei Rong, Dejing Dou, Gwen A. Frishkoff, Robert M. Frank, Allen D. Malony, Don M. Tucker: A Semi-Automatic Framework for Mining ERP Patterns. AINA Workshops (1) 2007: 329-334
Keywords:

Event-related potentials (ERP) are brain electrophysiological patterns created by averaging electroencephalographic (EEG) data, time-locking to events of interest (e.g., stimulus or response onset). In this paper, we propose a semi-automatic framework for mining ERP data, which includes the following steps: PCA decomposition, extraction of summary metrics, unsupervised learning (clustering) of patterns, and supervised learning, i.e. discovery, of classi- fication rules. Results show good correspondence between rules that emerge from decision tree classifiers and rules that were independently derived by domain experts. In addition, data mining results suggested ways in which expertdefined rules might be refined to improve pattern representation and classification results.
[bmei08]: Sergei Turovets, Pieter Poolman, Adnan Salman, Allen D. Malony, Don M. Tucker: Conductivity Analysis for High-Resolution EEG. BMEI (2) 2008: 386-393
Keywords:

We describe a technique for noninvasive conductivity estimation of the human head tissues in vivo. It is based on the bounded electrical impedance tomography (bEIT) measurements procedure and realistically shaped high-resolution finite difference model (FDM) of the human head geometry composed from the subject specific co-registered CT and MRI. The first experimental results with two subjects demonstrate feasibility of such technology.
[chasm_lacsi01]: C. Rasmussen, K. Lindlan, B. Mohr, J. Striegnitz, "CHASM: Static Analysis and Automatic Code Generation for Improved Fortran90 and C++ Interoperability," Proceedings of LACSI Symposium, 2001.
Keywords: CHASM, PDT, SILOON, F90, C++ interoperability

The relative simplicity and design of the Fortran 77 language allowed for reasonable interoperability with C and C++. Fortran 90, on the other hand, introduces several new and complex features to the language that severely degrade the ability of a mixed Fortran and C++ development environment. Major new items added to Fortran are user-defined types, pointers, and several new array features. Each of these items introduce difficulties because the Fortran 90 procedure calling convention was not designed with interoperability as an important design goal. For example, Fortran 90 arrays are passed by array descriptor, which is not specified by the language and therefore depends on a particular compiler implementation. This paper describes a set of software tools that parses Fortran 90 source code and produces mediating interface functions which allow access to Fortran 90 libraries from C++.
[cluster03]: H. Brunst, W. E. Nagel, and A. D. Malony, "A Distributed Performance Analysis Architecture for Clusters," In Proc. IEEE International Conference on Cluster Computing (Cluster 2003), IEEE Computer Society, pp. 73-83, Dec. 2003.
Keywords: Parallel Computing, Performance Analysis, Profiling, Tracing, Clusters, VNG, Vampir

The use of a cluster for distributed performance analysis of parallel trace data is discussed. We propose an analysis architecture that uses multiple cluster nodes as a server to execute analysis operations in parallel and communicate to remote clients where performance visualization and user interactions occur. The client-server system developed, VNG, is highly configurable and is shown to perform well for traces of large size, when compared to leading trace visualization systems.
[cluster06]: A. Nataraj, A. Malony, S. Shende, A. Morris, "Kernel-Level Measurement for Integrated Parallel Performance Views: the KTAU Project," In Proc. Cluster 2006, IEEE Computer Society, 2006.
Keywords: kernel mesurment, KTAU, TAU

The effect of the operating system on application performance is an increasingly important consideration in high performance computing. OS kernel measurement is key to understanding the performance influences and the interrelationship of system and user-level performance factors. The KTAU (Kernel TAU) methodology and Linux- based framework provides parallel kernel performance measurement from both a kernel-wide and process-centric perspective. The first characterizes overall aggregate kernel performance for the entire system. The second characterizes kernel performance when it runs in the context of a particular process. KTAU extends the TAU performance system with kernel-level monitoring, while leveraging TAU's measurement and analysis capabilities. We explain the rational and motivations behind our approach, describe the KTAU design and implementation, and show working examples on multiple platforms demonstrating the versatility of KTAU in integrated system / application monitoring.oped. Minimally, such an approach will require OS kernel performance monitoring.
[cluster16]: Bari, Md Abdullah Shahneous, et al. "ARCS: Adaptive Runtime Configuration Selection for Power-Constrained OpenMP Applications." Cluster Computing (CLUSTER), 2016 IEEE International Conference on. IEEE, 2016.
Keywords:

Power is the most critical resource for the exascale high performance computing. In the future, system administrators might have to pay attention to the power consumption of the machine under different work loads. Hence, each application may have to run with an allocated power budget. Thus, achieving the best performance on future machines requires optimal performance subject to a power constraint. This additional performance requirement should not be the responsibility of HPC (High Performance Computing) application developers. Optimizing the performance for a given power budget should be the responsibility of high-performance system software stack. Modern machines allow power capping of CPU and memory to implement power budgeting strategy. Finding the best runtime environment for a node at a given power level is important to get the best performance. This paper presents ARCS (Adaptive Runtime Configuration Selection) framework that automatically selects the best runtime configuration for each OpenMP parallel region at a given power level. The framework uses OMPT (OpenMP Tools) API, APEX (Autonomic Performance Environment for eXascale), and Active Harmony frameworks to explore configuration search space and selects the best number of threads, scheduling policy, and chunk size for a given power level at run-time. We test ARCS using the NAS Parallel Benchmark, and proxy application LULESH with Intel Sandybridge, and IBM Power multi-core architectures. We show that for a given power level, efficient OpenMP runtime parameter selection can improve the execution time and energy consumption of an application up to 40% and 42% respectively
[compframe05]: N. Trebon, A. Morris, J. Ray, S. Shende, and A. Malony, "Performance Modeling of Component Assemblies with TAU," Proc. Workshop on Component Models and Frameworks in High Performance Computing (CompFrame 2005).
Keywords: CCA, TAU, CFRFS, Proxy components, performance modeling

The Common Component Architecture (CCA) is a component-based methodology for developing scientific simu- lation codes. This architecture consists of a framework which enables components, (embodiments of numerical algorithms and physical models) to work together. Components publish their interfaces and use interfaces published by others. Com- ponents publishing the same interface and with the same func- tionality (but perhaps implemented via a different algorithm or data structure) may be transparently substituted for each other in a code or a component assembly. Components are compiled into shared libraries and are loaded in, instantiated and composed into a useful code at runtime. Details regarding CCA can be found in [1], [2]. An analysis of the process of decomposing a legacy simulation code and re-synthesizing it as components can be found in [3], [4]. Actual scientific results obtained from this toolkit can be found in [5], [6].
[conpar94]: B. Mohr, D. Brown, A. Malony, TAU: A Portable Parallel Program Analysis Environment for pC++, Proceedings of CONPAR 94 - VAPP VI, University of Linz, Austria, LNCS 854, September 1994, pp. 29-40.
Keywords: TAU, tuning and analysis utilities, integrated tools, portable programanalysis, parallel object-oriented language, pC++, Sage++

The realization of parallel language systems that offer high-level programming paradigms to reduce the complexity of application development, scalable runtime mechanisms to support variable size problem sets, and portable compiler platforms to provide access to multiple parallel architectures, places additional demands on the tools for program development and analysis. The need for integration of these tools into a comprehensive programming environment is even more pronounced and will require more sophisticated use of the language system technology (i.e., compiler and runtime system). Furthermore, the environment requirements of high-level support for the programmer, large-scale applications, and portable access to diverse machines also apply to the program analysis tools.
In this paper, we discuss (TAU, Tuning and Analysis Utilities), a first prototype for an integrated and portable program analysis environment for pC++, a parallel object-oriented language system. TAU is integrated with the pC++ system in that it relies heavily on compiler and transformation tools (specifically, the Sage++ toolkit) for its implementation. This paper describes the design and functionality of TAU and shows its application in practice.
[cug06]: S. Shende, A. D. Malony, A. Morris, and P. Beckman, "Performance and Memory Evaluation Using TAU," In Proc. for Cray User's Group Conference (CUG 2006), 2006.
Keywords: TAU, PDT, Memory Headroom Analysis, MFIX

The TAU performance system is an integrated performance instrumentation, measurement, and analysis toolkit offering support for profiling and tracing modes of measurement. This paper introduces memory introspection capabilities of TAU featured on the Cray XT3 Catamount compute node kernel. TAU supports examining the memory headroom, or the amount of heap memory available, at routine entry, and correlates it to the program’s callstack as an atomic event.
[dapsys2k]: A. Malony and S. Shende, "Performance Technology for Complex Parallel and Distributed Systems," Proc. Third Austrian-Hungarian Workshop on Distributed and Parallel Systems, DAPSYS 2000, "Distributed and Parallel Systems: From Concepts to Applications," (Eds. G. Kotsis and P. Kacsuk)Kluwer, Norwell, MA, pp. 37-46, 2000.
Keywords: performance tools, complex systems, instrumentation, measurement, analysis, TAU

The ability of performance technology to keep pace with the growing complexity of parallel and distributed systems will depend on robust performance frameworks that can at once provide system-specific performance capabilities and support high-level performance problem solving. The TAU system is offered as an example framework that meets these requirements. With a flexible, modular instrumentation and measurement system, and an open performance data and analysis environment, TAU can target a range of complex performance scenarios. Examples are given showing the diversity of TAU application.
[dodugc04]: Daniel M. Pressel, David Cronk, and Sameer Shende, "PENVELOPE: A New Approach to Rapidly Predicting the Performance of Computationally Intensive Scientific Applications on Parallel Computer Architectures," Proc. 2004 DOD Users Group Conference, Williamsburg, Virginia, IEEE Computer Society, pp. 314-318, 2004.
Keywords: PENVELOPE, Performance Prediction, TAU

A common complaint when dealing with the performance of computationally intensive scientific applications on parallel computers is that programs exist to predict the performance of radar systems, missiles and artillery shells, drugs, etc., but no one knows how to predict the performance of these applications on a parallel computer. Actually, that is not quite true. A more accurate statement is that no one knows how to predict the performance of these applications on a parallel computer in a reasonable amount of time. PENVELOPE is an attempt to remedy this situation. It is an extension to Amdahls Law/ Gustafsons work on scaled speedup that takes into account the cost of interprocessor communication and operating system overhead, yet is simple enough that it was implemented as an Excel spreadsheet.
[epvmmpi05.a]: S. Shende, A. D. Malony, A. Morris, and F. Wolf, "Performance Profiling Overhead Compensation for MPI Programs," in Proc. EuroPVM/MPI 2005 Conference, (eds. B. Di. Martino et. al.), LNCS 3666, Springer, pp. 359-367, 2005.
Keywords: TAU, Performance measurement and analysis, parallel computing, profiling, message passing, overhead compensation

Performance profiling of MPI programs generates overhead during execution that introduces error in profile measurements. It is possible to track and remove overhead online, but it is necessary to communicate execution delay be- tween processes to correctly adjust their interdependent timing. We demonstrate the first implementation of a onlne measurement overhead compensation system for profiling MPI programs. This is implemented in the TAU performance sys- tems. It requires novel techniques for delay communication in the use of MPI. The ability to reduce measurement error is demonstrated for problematic test cases and real applications.
[epvmmpi05.b]: S. Moore, F. Wolf, J. Dongarra, S. Shende, A. Malony, and B. Mohr, "A Scalable Approach to MPI Application Performance Analysis," in Proc. of EuroPVM/MPI 2005, (eds. B. Di Martino) LNCS 3666, Springer, pp. 309-316, 2005.
Keywords: TAU, scalability, Performance measurement and analysis, parallel computing, profiling, tracing, KOJAK

A scalable approach to performance analysis of MPI applications is presented that includes automated source code instrumentation, low overhead generation of profile and trace data, and database management of performance data. In addition, tools are described that analyze large-scale parallel profile and trace data. Analysis of trace data is done using an automated pattern-matching ap- proach. Examples of using the tools on large-scale MPI applications are presented.
[escience08]: Geoffrey C. Hulette, Matthew J. Sottile, Allen D. Malony: WOOL: A Workflow Programming Language. eScience 2008: 71-78
Keywords:

Workflows offer scientists a simple but flexible programming model at a level of abstraction closer to the domain-specific activities that they seek to perform. However, languages for describing work- flows tend to be highly complex, or specialized towards a particular domain, or both. WOOL is an abstract workflow language with human-readable syntax, intuitive semantics, and a powerful abstract type system. WOOL workflows can be targeted to almost any kind of runtime system supporting data-flow computation. This paper describes the design of the WOOL language and the implementation of its compiler, along with a simple example runtime. We demonstrate its use in an imageprocessing workflow.
[etfa11]: Celio Estevan Moron, Allen D. Malony: Development of embedded multicore systems. ETFA 2011: 1-4
Keywords:

The concepts involved in the programming process of multicore systems have been quite well known for decades. The problem is to produce it in a form as easy as sequential programming. This new trend will change the way we think about the whole development process. We will show that it is possible to develop a multicore embedded system application using existing tools and the model-driven development process proposed. To do this, two tools will be used: VisualRTXC (available at www.quadrosbrasil.com.br) for generating the multithread communication/synchronization structures and a performance tool called TAU (available at http://www.cs.uoregon.edu/research/tau/home.php) for the tuning of the final implementation.
[etpsc92]: B. Mohr, Standardization of Event Traces Considered Harmful or Is an Implementation of Object-Independent Event Trace Monitoring and Analysis Systems Possible?, Proceedings of the CNRS-NSF Workshop on Environments and Tools For Parallel Scientific Computing, St. Hilaire du Touvet, France, Elsevier, Advances in Parallel Computing, Vol. 6, September 1992, pp. 103-124.
Keywords: event trace, analysis tools, monitoring, objects, object-independentmonitoring, standardization of event trace formats, access interfaces

Programming non-sequential computer systems is hard! Many tools and environments have been designed and implemented to ease the use and programming of such systems. The majority of the analysis tools is event-based and uses event traces for representing the dynamic behavior of the system under investigation, the object system. Most tools can only be used for one special object system, or a specific class of systems such as distributed shared memory machines. This limitation is not obvious because all tools provide the same basic functionality.
This article discusses approaches to implementing object-independent event trace monitoring and analysis systems. The term object-independent means that the system can be used for the analysis of arbitrary (non-sequential) computer systems, operating systems, programming languages and applications. Three main topics are addressed: object-independent monitoring, standardization of event trace formats and access interfaces and the application-independent but problem-oriented implementation of analysis and visualization tools. Based on these approaches, the distributed hardware monitor system ZM4 and the SIMPLE event trace analysis environment were implemented, and have been used in many 'real-world' applications throughout the last three years. An overview of the projects in which the ZM4/SIMPLE tools were used is given in the last section.
[etpsc94]: Darryl I. Brown, Steven T. Hackstadt, Allen D. Malony, Bernd Mohr, Program Analysis Environments for Parallel Language Systems: The TAU Environment, Proc. of the Workshop on Environments and Tools For Parallel Scientific Computing, Townsend, TN, May 1994, pp. 162-171.
Keywords: parallel tool, pC++, integrated tool framework

In this paper, we discuss TAU (Tuning and Analysis Utilities), the first prototype of an integrated and portable program analysis environment for pC++, a parallel object-oriented language system. TAU is unique in that it was developed specifically for pC++ and relies heavily on pC++'s compiler and transformation tools (specifically, the Sage++ toolkit) for its implementation. This tight integration allows TAU to achieve a combination of portability, functionality, and usability not commonly found in high-level language environments. The paper describes the design and functionality of TAU, using a new tool for breakpoint-based program analysis as an example of TAU's capabilities
[etpsc96]: Janice Cuny, Robert Dunn, Steven T. Hackstadt, Christopher Harrop, Harold H. Hersey, Allen D. Malony, and Douglas Toomey, Building Domain-Specific Environments for Computational Science: A Case Study in Seismic Tomography, International Journal of Supercomputing Applications and High Performance Computing, Vol. 11, No. 3, Fall 1997. Also appearing in the Proceedings of the Workshop on Environments and Tools For Parallel Scientific Computing, Lyon, France, August 1996.
Keywords: computational science, domain-specific environments, seismic tomography, visualization, distributed data access

We report on our experiences in building a computational environment for tomographic image analysis for marine seismologists studying the structure and evolution of mid-ocean ridge volcanism. The computational environment is determined by an evolving set of requirements for this problem domain and includes needs for high-performance parallel computing, large data analysis, model visualization, and computation interaction and control. Although these needs are not unique in scientific computing, the integration of techniques for seismic tomography with tools for parallel computing and data analysis into a computational environment was (and continues to be) an interesting, important learning experience for researchers in both disciplines. For the geologists, the use of the environment led to fundamental geologic discoveries on the East Pacific Rise, the improvement of parallel ray tracing algorithms, and a better regard for the use of computational steering in aiding model convergence. The computer scientists received valuable feedback on the use of programming, analysis, and visualization tools in the environment. In particular, the tools for parallel program data query (DAQV) and visualization programming (Viz) were demonstrated to be highly adaptable to the problem domain. We discuss the requirements and the components of the environment in detail. Both accomplishments and limitations of our work are presented.
[eurompi15]: Jean-Baptiste Besnard, Allen D. Malony, Sameer Shende, Marc PÃ©rache, Patrick Carribault, Julien Jaeger: An MPI Halo-Cell Implementation for Zero-Copy Abstraction. EuroMPI 2015: 3:1-3:9
Keywords: MPI, Ghost-Cells, Zero-Copy, memory, MPI Halo

In the race for Exascale, the advent of many-core processors will bring a shift in parallel computing architectures to systems of much higher concurrency, but with a relatively smaller memory per thread. This shift raises concerns for the adaptability of HPC software, for the current generation to the brave new world. In this paper, we study domain splitting on an increasing number of memory areas as an example problem where negative performance impact on computation could arise. We identify the specific parameters that drive scalability for this problem, and then model the halo-cell ratio on common mesh topologies to study the memory and communication implications. Such analysis argues for the use of shared-memory parallelism, such as with OpenMP, to address the performance problems that could occur. In contrast, we propose an original solution based entirely on MPI programming semantics, while providing the performance advantages of hybrid parallel programming. Our solution transparently replaces halo-cells transfers with pointer exchanges when MPI tasks are running on the same node, effectively removing memory copies. The results we present demonstrate gains in terms of memory and computation time on Xeon Phi (compared to OpenMP-only and MPI-only) using a representative domain decomposition benchmark.
[eurompi16]: Besnard, J. B., Adam, J., Shende, S., PÃ©rache, M., Carribault, P., & Jaeger, J. (2016, September). Introducing Task-Containers as an Alternative to Runtime-Stacking. In Proceedings of the 23rd European MPI Users' Group Meeting (pp. 51-63). ACM.
Keywords:

The advent of many-core architectures poses new challenges to the MPI programming model which has been designed for distributed memory message passing. It is now clear that MPI will have to evolve in order to exploit shared-memory parallelism, either by collaborating with other programming models (MPI+X) or by introducing new shared-memory approaches. This paper considers extensions to C and C++ to make it possible for MPI Processes to run into threads. More generally, a thread-local storage (TLS) library is developed to simplify the collocation of arbitrary tasks and services in a shared-memory context called a task-container. The paper discusses how such containers simplify model and service mixing at the OS process level, eventually easing the collocation of arbitrary tasks with MPI processes in a runtime agnostic fashion, opening alternatives to runtime stacking.
[europar03]: R. Bell, A. D. Malony, and S. Shende, "A Portable, Extensible, and Scalable Tool for Parallel Performance Profile Analysis", Proc. EUROPAR 2003 conference, LNCS 2790, Springer, Berlin, pp. 17-26, 2003.
Keywords: ParaProf, TAU, jracy, Profile Browser, scalable, visualization

This paper presents the design, implementation, and application of ParaProf, a portable, extensible, and scalable tool for parallel performance profile analysis. ParaProf attempts to offer ``best of breed'' capabilities to performance analysts -- those inherited from a rich history of single processor profilers and those being pioneered in parallel tools research. We present ParaProf as a parallel profile analysis framework that can be retargeted and extended as required. ParaProf's design and operation is discussed, and its novel support for large- scale parallel analysis demonstrated with a 512-processor application profile generated using the TAU performance system.
[europar03.2]: Jeffrey K. Hollingsworth, Allen D. Malony, JesÃºs Labarta, Thomas Fahringer: Performance Evaluation and Prediction. Euro-Par 2003: 87
Keywords:

Performance is the reason for parallel computing. Despite years of work, many applications achieve only a few percentage of theoretical peak. Performance measurement and analysis tools exist to identify the problems with current programs and systems. Performance Prediction is intended to identify issues in new code or systems before they are fully available. These two topics are closely related since most prediction requires data to be gathered from measured runs of program (to identify application signatures or to understand the performance characteristics of current machines).
[europar04]: A. D. Malony, and S. S. Shende, "Overhead Compensation in Performance Profiling," Proc. Europar 2004 Conference, LNCS 3149, Springer, pp. 119-132, 2004.
Keywords: Performance measurement and analysis, parallel computing, profiling, intrusion, overhead compensation

Measurement-based profiling introduces intrusion in program execution. Intrusion effects can be mitigated by compensating for measurement overhead. Techniques for compensation analysis in performance profiling are presented and their implementation in the TAU performance system described. Experimental results on the NAS parallel benchmarks demonstrate that overhead compensation can be effective in improving the accuracy of performance profiling.
[europar04.2]: JosÃ© C. Cunha, Allen D. Malony, Arndt Bode, Dieter KranzlmÃ¼ller: Topic 1: Support Tools and Environments. Euro-Par 2004: 38
Keywords:

Due to the diversity of parallel and distributed computing infrastructures and programming models, and the complexity of issues involved in the design and development of parallel programs, the creation of tools and environments to support the broad range of parallel system and software functionality has been widely recognized as a difficult challenge. Current research in this topic continues to address individual tools for supporting correctness and performance issues in parallel program development. However, standalone tools are sometimes insufficient to cover the rich diversity of tasks found in the design, implementation and production phases of the parallel software life- cycle. This has motivated interest in interoperable tools, as well as solutions to ease their integration into unified development and execution environments.
[europar05]: Allen D. Malony and Sameer Shende, "Models for On-the-Fly Compensation of Measurement Overhead in Parallel Performance Profiling, pp. 72-82, 2005.
Keywords: Performance measurement and analysis, parallel computing, profiling, intrusion, overhead compensation, TAU

Performance profiling generates measurement overhead during parallel program execution. Measurement overhead, in turn, introduces intrusion in a program's runtime performance behavior. Intrusion can be mitigated by controlling instrumentation degree, allowing a tradeoff of accuracy for detail. Alternatively, the accuracy in profile results can be improved by reducing the intrusion error due to measurement overhead. Models for compensation of measurement overhead in parallel performance profiling are described. An approach based on rational reconstruction is used to understand properties of compensation solutions for different parallel scenarios. From this analysis, a general algorithm for on-the-fly overhead assessment and compensation is derived.
[europar07]: A. Nataraj, M. Sottile, A. Morris. A.D. Malony, S. Shende. "TAUoverSupermon: Low-overhead Online Parallel Performance Monitoring." Presented at EuroPar 2007.
Keywords: Online performance measurement, cluster monitoring, TAU, supermon

Online application performance monitoring allows tracking performance characteristics during execution as opposed to doing so post-mortem. This opens up several possibilities otherwise unavailable such as real-time visualization and application performance steering that can be useful in the context of long-running applications. As HPC sys- tems grow in size and complexity, the key challenge is to keep the online performance monitor scalable and low overhead while still providing a useful performance reporting capability. Two fundamental components that constitute such a performance monitor are the measurement and transport systems. We adapt and combine two existing, mature systems - TAU and Supermon - to address this problem. TAU performs the mea- surement while Supermon is used to collect the distributed measurement state. Our experiments show that this novel approach leads to very low- overhead application monitoring as well as other benefits unavailable from using a transport such as NFS.
[europar08]: A. Morris, W. Spear, A. D. Malony and S. Shende. "Observing Performance Dynamics using Parallel Profile Snapshots," European Conference on Parallel Processing (EuroPar 2008). August 2008
Keywords: TAU, Performance Snapshots, Performance Data, Performance Analysis

Performance analysis tools are only as useful as the data they collect. Not just accuracy of performance data, but accessibility, is necessary for performance analysis tools to be used to their full effect. The diversity of performance analysis and tuning problems calls for more flexible means of storing and representing performance data. The development and maintenance cycles of high performance programs, in particular, stand to benefit from exploration of and expansion of the means used to record and describe program execution behavior. We describe a means of representing program performance data via a time or event delineated series of performance profiles, or profile snapshots, implemented in the TAU performance analysis system. This includes an explanation of the profile snapshot format and means of snapshot analysis.
[europar08b]: K. Huck, W. Spear, A. Malony, S. Shende, and A. Morris, “Parametric Studies in Eclipse with TAU and PerfExplorer,” Workshop on Productivity and Performance (PROPER 2008), EuroPar 2008, Las Palmas de Gran Canaria, Spain, August, 2008.
Keywords: TAU, PerfExplorer, Eclipse, Parametric

With support for C/C++, Fortran, MPI, OpenMP, and performance tools, the Eclipse integrated development environment (IDE) is a serious contender as a programming environment for parallel applications. There is interest in adding capabilities in Eclipse for conducting workﬂows where an application is executed under different scenarios and its outputs are processed. For instance, parametric studies are a requirement in many benchmarking and performance tuning efforts, yet there was no experiment management support available for the Eclipse IDE. In this paper, we describe an extension of the Parallel Tools Platform (PTP) plugin for the Eclipse IDE. The extension provides a graphical user interface for selecting experiment parameters, launches build and run jobs, manages the performance data, and launches an analysis application to process the data. We describe our implementation, and discuss three experiment examples which demonstrate the experiment management support.
[europar12]: Geoffrey C. Hulette, Matthew J. Sottile, Allen D. Malony: A Type-Based Approach to Separating Protocol from Application Logic - A Case Study in Hybrid Computer Programming. Euro-Par 2012: 40-51
Keywords:

Numerous programming models have been introduced to allow programmers to utilize new accelerator-based architectures. While OpenCL and CUDA provide low-level access to accelerator programming, the task cries out for a higher-level abstraction. Of the higherlevel programming models which have emerged, few are intended to co-exist with mainstream, general-purpose languages while supporting tunability, composability, and transparency of implementation. In this paper, we propose extensions to the type systems (implementable as syntactically neutral annotations) of traditional, general-purpose languages can be made which allow programmers to work at a higher level of abstraction with respect to memory, deferring much of the tedium of data management and movement code to an automatic code generation tool. Furthermore, our technique, based on formal term rewriting, allows for user-defined reduction rules to optimize low-level operations and exploit domain- and/or application-specific knowledge.
[europar12.2]: Allen D. Malony, Helen D. Karatza, William J. Knottenbelt, Sally McKee: Topic 2: Performance Prediction and Evaluation. Euro-Par 2012: 52-53
Keywords:

In recent years, a range of novel methodologies and tools have been developed for the purpose of evaluation, design, and model reduction of existing and emerging parallel and distributed systems. At the same time, the coverage of the term â€˜performanceâ€™ has constantly broadened to include reliability, robustness, energy consumption, and scalability in addition to classical performance-oriented evaluations of system functionalities. Indeed, the increasing diversification of parallel systems, from cloud computing to exascale, being fueld by technological advances, is placing greater emphasis on the methods and tools to address more comprehensive concerns. The aim of the Performance Prediction and Evaluation topic is to bring together system designers and researchers involved with the qualitative and quantitative evaluation and modeling of large-scale parallel and distributed applications and systems to focus on current critical areas of performance prediction and evaluation theory and practice.
[europar15]: Robert Lim, Allen D. Malony, Boyana Norris, Nicholas Chaimov: Identifying Optimization Opportunities Within Kernel Execution in GPU Codes. Euro- Par Workshops 2015: 185-196
Keywords:

Tuning codes for GPGPU architectures is challenging because few performance tools can pinpoint the exact causes of execution bottlenecks. While profiling applications can reveal execution behavior with a particular architecture, the abundance of collected information can also overwhelm the user. Moreover, performance counters provide cumulative values but does not attribute events to code regions, which makes identifying performance hot spots difficult. This research focuses on characterizing the behavior of GPU application kernels and its performance at the node level by providing a visualization and metrics display that indicates the behavior of the application with respect to the underlying architecture. We demonstrate the effectiveness of our techniques with LAMMPS and LULESH application case studies on a variety of GPU architectures. By sampling instruction mixes for kernel execution runs, we reveal a variety of intrinsic program characteristics relating to computation, memory and control flow.
[europar96]: Steven T. Hackstadt and Allen D. Malony, Distributed Array Query and Visualization for High Performance Fortran, Proc. of Euro-Par '96, Lyon, France, August 1996, pp. 55-63. Also available as University of Oregon, Department of Computer and Information Science, Technical Report CIS-TR-96-02, February 1996.
Keywords: visualization, distributed data access, hpf, parallel tool

This paper describes the design and implementation of the Distributed Array Query and Visualization (DAQV) system for High Performance Fortran, a project sponsored by the Parallel Tools Consortium. DAQV's implementation leverages the HPF language, compiler, and runtime system to address the general problem of providing high-level access to distributed data structures. DAQV supports a framework in which visualization and analysis clients connect to a distributed array server (i.e., the HPF application with DAQV control) for program-level access to array values. Implementing key components of DAQV in HPF itself has led to a robust and portable solution in which clients do not need to know how the data is distributed.
[europar99]: Matthew Sottile and Allen Malony, INTERLACE: An Interoperation and Linking Architecture for Computational Engines, Proceedings of EuroPar 99 Conference, LNCS 1685, Springer, Berlin, pp.135-138, 1999.
Keywords: heterogeneous computing, reusability, computational servers, client/server, distributed objects, MatLab

To aid in building high-performance computational environments, INTERLACE offers a framework for linking reusable computational engines in a heterogeneous distributed system. The INTERLACE model provides clients with access to computational servers which interface with "wrapped" computational engines. The wrappers implement mechanisms to translate client requests to engine actions and to move data across the server interface. These mechanisms are programmable, allowing engines of different type to be integrated. The framework takes advantage of the HPC++ runtime system to access servers through distributed object operations. The INTERLACE framework has been demonstrated by building a distributed computational environment with MatLab engines.
[europara06]: A. Nataraj, A. Malony, A. Morris, S. Shende, "Early Experiences with KTAU on the IBM BG/L," Proc. EUROPAR 2006 Conference, Springer, LNCS 4128, pp. 99-110, 2006.
Keywords: TAU, KTAU, IBM BG/L, kernel profiling, zeptoOS, performance analysis

The influences of OS and system-specific effects on applica- tion performance are increasingly important in high performance com- puting. In this regard, OS kernel measurement is necessary to under- stand the interrelationship of system and application behavior. This can be viewed from two perspectives: kernel-wide and process-centric. An integrated methodology and framework to observe both views in HPC systems using OS kernel measurement has remained elusive. We demon- strate a new tool called KTAU (Kernel TAU) that aims to provide paral- lel kernel performance measurement from both perspectives. KTAU ex- tends the TAU performance system with kernel-level monitoring, while leveraging TAU’s measurement and analysis capabilities. As part of the ZeptoOS scalable operating systems pro ject, we report early experiences using KTAU in ZeptoOS on the IBM BG/L system.
[europara06c]: L. Li, A. Malony, "Model-Based Performance Diagnosis of Master-Worker Parallel Computations," Euro-Par 2006 Parallel Processing Conference September 2006 (LNCS 4128). Pages 35-46.
Keywords: Performance diagnosis, parallel models, master-worker, measurement, analysis.

Parallel performance tuning naturally involves a diagnosis process to locate and explain sources of program inefficiency. Proposed is an approach that exploits parallel computation patterns (models) for diagnosis discovery. Knowledge of performance problems and inference rules for hypothesis search are engineered from model semantics and analysis expertise. In this manner, the performance diagnosis process can be automated as well as adapted for parallel model variations. We demonstrate the implementation of model-based performance diagnosis on the classic Master-Worker pattern. Our results suggest that pattern- based performance knowledge can provide effective guidance for locating and explaining performance bugs at a high level of program abstraction.
[europvm06]: K. Huck, A. Malony, S. Shende and A. Morris. "TAUg: Runtime Global Performance Data Access using MPI." EuroPVM/MPI Conference, LNCS 4192, pp. 313-321, Springer, September 2006.
Keywords: TAU, TAUg, global data acsess, performance monitoring, online performance adaption

To enable a scalable parallel application to view its global performance state, we designed and developed TAUg, a portable runtime framework layered on the TAU parallel performance system. TAUg leverages the MPI library to communicate between application processes, creating an abstraction of a global performance space from which profile views can be retrieved. We describe the TAUg design and implementation and show its use on two test benchmarks up to 512 processors. Overhead evaluation for the use of TAUg is included in our analysis. Future directions for improvement are discussed.
[ewomp01]: B. Mohr, A. D. Malony, S. Shende, and F. Wolf, "Towards a Performance Tool Interface for OpenMP: An Approach Based on Directive Rewriting," Proceedings of EWOMP'01 Third European Workshop on OpenMP, Sept. 2001.
Keywords: OpenMP, directive rewriting, instrumentation interface, TAU, EXPERT

In this article we propose a ``standard'' performance tool interface for OpenMP, similar in spirit to the MPI profiling interface in its intent to define a clear and portable API that makes OpenMP execution events visible to performance libraries. When used together with the MPI profiling interface, it also allows tools to be built for hybrid applications that mix shared and distributed memory programming. We describe an instrumentation approach based on OpenMP directive rewriting that generates calls to the interface and passes context information (e.g., source code locations) in a portable and efficient way. Our proposed OpenMP performance API further allows user functions and arbitrary code regions to be marked and performance measurement to be controlled using new proposed OpenMP directives. The directive transformations we define are implemented in a source-to-source translation tool called OPARI. We have used it to integrate the TAU performance analysis framework and the automatic event trace analyzer EXPERT with the proposed OpenMP performance interface. Together, these tools show that a portable and robust solution to performance analysis of OpenMP and hybrid applications is possible.
[ewomp02]: A Performance Monitoring Interface for OpenMP By Bernd Mohr , Allen D. Malony, Hans-Christian Hoppe, Frank Schlimbach, Grant Haab, Jay Hoeflinger, and Sanjiv Shah . Persented at EWOMP 2002
Keywords: OpenMP, Performance Monitoring, OMPI, POMP
[fmpc88]: Allen D. Malony, "Regular Processor Arrays," Proc., 2nd Symposium on the Frontiers of Massively Parallel Computation, IEEE, pp. 499-502, 1988.
Keywords: regularity, processor arrays, emulation, interconnection networks.

Regular is an often used term to suggest simple and unifrom structure of a parallel processor's organization or a parllel algorithm's operation. However, a strict definitiion is long overdue. In this paper, we define regularity for processor array structures in two dimensions and enumerate the eleven distinct regular topologies. Space and time emulation schemes among the regular processor arrays are constructured to compare their geometric and performance characteristics. The hexagonal array is shown to have the most efficient emulation capabilities.
[front95]: J. Kundu and J. E. Cuny, A Scalable, Visual Interface for Debugging with Event-Based Behavioral Abstraction, Frontiers of Massively Parallel Computing, 1995, pp. 472-479.
Keywords: visualization, event-based debugging, Ariadne
[gpce12]: Geoffrey C. Hulette, Matthew J. Sottile, Allen D. Malony: Composing typemaps in Twig. GPCE 2012: 41-49
Keywords: Type mapping, Foreign function interface

Twig is a language for writing typemaps, programs which transform the type of a value while preserving its underlying meaning. Typemaps are typically used by tools that generate code, such as multi-language wrapper generators, to automatically convert types as needed. Twig builds on existing typemap tools in a few key ways. Twigâ€™s typemaps are composable so that complex transformations may be built from simpler ones. In addition, Twig incorporates an abstract, formal model of code generation, allowing it to output code for different target languages. We describe Twigâ€™s formal semantics and show how the language allows us to concisely express typemaps. Then, we demonstrate Twigâ€™s utility by building an example typemap.
[hcca89]: A. D. Malony, D. A. Reed, J. W. Arendt, R. A. Aydt, D. Grabas, and B. K. Totty, "An Integrated Performance Data Collection Analysis, and Visualization System," Proc. Fourth Conferenceon Hypercube Concurrent Computers and Applications, Mar. 1989. Also appears as Technical Report UIUCDCS-R-89-1504, Center for Supercomputing Research and Development, U. of Ill., March 1989.
Keywords: Intel iPSC/2, Performance Analysis

The lack of tools to observe the operation and performance of message-based parallel architectures limits the user's ability to e ectively optimize application and system performance. Performance data collection, analysis, and visualization tools are needed to manage the complexity and quantity of performance data. Furthermore, these tools must be integrated with the machine hardware, the system software, and the applications support software if they are to nd pervasive use in program development and experimentation. In this paper, we describe an integrated performance environment being developed for the Intel iPSC/2 hypercube. The data collection components of the environment include software event tracing at the operating system and program levels plus a hardware-based performance monitoring system used to unobtrusively capture software events. A visualization system, based on the X window system, permits the performance analyst to browse and explore interesting data components by dynamically interconnecting new performance displays and data analysis tools.
[hipc13]: A. S. Charif-Rubial, D. Barthou, C. Valensi, S. Shende, A. Malony, W. Jalby, "MIL : A language to build program analysis tools through static binary instrumentation," in Proc. 20th Annual International Conference on High Performance Computing, HiPC'13, Hyderabad, India, IEEE, December 2013.
Keywords: MAQAO, Binary instrumentation, TAU, OpenMP

As software complexity increases, the analysis of code behavior during its execution is becoming more important. Instru- mentation techniques, through the insertion of code directly into binaries, are essential to program analyses such as performance evaluation and profiling. In the context of high-performance parallel applications, building an instrumentation framework is quite challenging. One of the difficulties is due to the necessity to capture coarse grain behavior, such as the execution time of different functions, as well as finer-grain behavior in order to pinpoint performance issues. In this paper, we propose a language, MIL, for the development of program analysis tools based on static binary instrumentation. The key feature of MIL is to ease the integration of static, global program analysis with instrumentation. We will show how this enables both a precise targeting of the code regions to analyze, and a better understanding of the optimized program behavior.
[hipc14]: Hank Childs, Scott Biersdorff, David Poliakoff, David Camp, Allen D. Malony: Particle advection performance over varied architectures and workloads. HiPC 2014: 1-10
Keywords: GPGPU, Hybrid Parallelism, Flow Visualization, Performance Analysis

Particle advection is a foundational operation for many flow visualization techniques, including streamlines, Finite-Time Lyapunov Exponents (FTLE) calculation, and stream surfaces. The workload for particle advection problems varies greatly, including significant variation in computational requirements. With this study, we consider the performance impacts from hardware architecture on this problem, studying distributed-memory systems with CPUs with varying amounts of cores per node, and with nodes with one to three GPUs. Our goal was to explore which architectures were best suited to which workloads, and why. While the results of this study will help inform visualization scientists which architectures they should use when solving certain flow visualization problems, it is also informative for the larger HPC community, since many simulation codes will soon incorporate visualization via in situ techniques.
[hipc95]: D. Brown, A. Malony, B. Mohr, Language-based Parallel Program Interaction: the Breezy Approach, Proceedings of the International Conference on High Performance Computing (HiPC'95), India, December 1995.
Keywords: runtime interaction, data-parallel program, language integration,language-based tools

This paper presents a general architecture for runtime interaction with a data-parallel program. We have applied this architecture in the development of the Breezy tool for the pC++ language. Breezy grants application programs convenient and efficient access to higher-level external services (e.g., databases, visualization systems, and distributed resources) and allows external access to the application's state (e.g., for program state display or computational steering). Although such support can be developed on an ad-hoc basis for each application, a general approach to the problem of parallel program interaction is preferred. A general approach makes tools more portable and retargetable to different language systems.
There are two main conclusions from this work. First, interaction support should be integrated with a language system facilitating an implementation of a model that is consistent with the language design. This aids application developers or the tool builders that require this interaction. Second, as the implementation of Breezy shows, the development of interaction support can leverage off the language itself as well as its compiler and runtime systems.
[hpcc05]: F. Wolf, A. D. Malony, S. Shende, and A. Morris, "Trace-Based Parallel performance Overhead Compensation," in Proc. of HPCC 2005 Conference, (eds. L. T. Yang, et. al.), LNCS 3726, Springer, pp. 617-628, 2005.
Keywords: KOJAK, Performance measurement, analysis, parallel computing, tracing, message passing, overhead compensation

Tracing parallel programs to observe their performance introduces intrusion as the result of trace measurement overhead. If post-mortem trace analysis does not compensate for the overhead, the intrusion will lead to errors in the performance results. We show that measurement overhead can be accounted for during trace analysis and intrusion modeled and removed. Algorithms developed in our earlier work are reimplemented in a more robust and modern tool, KOJAK, allowing them to be applied in large-scale parallel programs. The ability to reduce trace measurement error is demonstrated for a Monte-Carlo simulation based on a master/worker scheme. As an additional result, we visualize how local perturbation propagates across process boundaries and alters the behavioral char- acteristics of non-local processes.
[hpcc06]: W. Spear, A. Malony, A. Morris, S. Shende, "Integrating TAU with Eclipse: A Performance Analysis System in a Integrated Development Environment," High Performance Computing and Communications (HPCC) Conference. September 2006 (LNCS 4208). Pages 230-239.
Keywords: Eclipse, Integrated Development Environment, IDE, TAU

The Eclipse platform offers Integrated Development Environment support for a diverse and growing array of programming applications and languages. There is an increasing call for programming tools to support various development tasks from within Eclipse. This includes tools for testing and analyzing program performance. We describe the high-level synthesis of the Eclipse platform with the TAU parallel performance analysis system. By leveraging Eclipse's modularity and extensibility with TAU's robust automated performance analysis mechanisms we produce an integrated, GUI controlled performance analysis system for Java, C/C++ and High Performance Computing development within Eclipse.
[hpcc06b]: L. Li, A. D. Malony, and K. Huck, "Model-Based Relative Performance Diagnosis of Wavefront Parallel Computations", in International Conference on High Performance Computing and Communications (HPCC2006), (Munich, Germany), 2006.
Keywords: performance diagnosis, parallel models, wavefront, relative analysis

Parallel performance diagnosis can be improved with the use of performance knowledge about parallel computation models. The Hercule diagnosis system applies model-based methods to automate performance diagnosis processes and explain performance problems from highlevel computation semantics. However, Hercule is limited by a single experiment view. Here we introduce the concept of relative performance diagnosis and show how it can be integrated in a model-based diagnosis framework. The paper demonstrates the effectiveness of Hercule’s approach to relative diagnosis of the well-known Sweep3D application based on aWavefront model. Relative diagnoses of Sweep3D performance anomalies in strong and weak scaling cases are given.
[hpccbse04]: B. Norris, J. Ray, R. Armstrong, L. C. McInnes. D. E. Bernholdt. W. R. Elwasif, A. D. Malony and S. Shende, "Computational Quality of Service for Scientific Components," Proceedings of the International Symposium on Component-Based Software Engineering (CBSE7), Edinburgh, Scotland, LNCS 3054, Springer, pp. 264-271, May 2004. Also available as Argonne National Laboratory preprint ANL/MCS-P1131-0204.
Keywords: QoS, Components, CCA, TAU

Scientific computing on massively parallel computers presents unique challenges to component-based software engineering (CBSE). While CBSE is at least as enabling for scientific computing as it is for other arenas, the requirements are different. We briefly discuss how these requirements shape the Common Component Architecture, and we describe some recent research on quality-of-service issues to address the computational performance and accuracy of scientific simulations.
[hpcn99]: A. Malony, J. Skidmore, and M. Sottile. Computational Experiments using Distributed Tools in a Web-based Electronic Notebook Environment, Proceedings of HPCN Europe '99, LNCS 1593, Springer, Berlin, pp. 381 -390, April 1999.
Keywords:

Computational environments used by scientists should provide high-level support for scientific processes that involve the integrated and systematic use of familiar abstractions from a laboratory setting, including notebooks, instruments, experiments, and analysis tools. However, doing so while hiding the complexities of the underlying computational platform is a challenge. ViNE is a web-based electronic notebook that implements a high-level interface for applying computational tools in scientific experiments in a location- and platform-independent manner. Using ViNE, a scientist can specify data and tools, and construct experiments that apply them in well-defined procedures. ViNE's implementation of the experiment abstraction offers the scientist easy-to-understand framework for building scientific processes. This paper discusses how ViNE implements computational experiments in distributed, heterogeneous computing environments.
[hpcs13]: Adnan Salman, Allen D. Malony, Sergei Turovets, Vasily Volkov, David Ozog, Don M. Tucker: Next-generation human brain neuroimaging and the role of high-performance computing. HPCS 2013: 234-242
Keywords:

Advances in human brain neuroimaging to achieve high-temporal and high-spatial resolution will depend on computational approaches to localize EEG signals to their sources in the cortex. The source localization inverse problem is inherently ill-posed and depends critically on the modeling of human head electromagnetics. In this paper we present a systematic methodology to analyze the main factors and parameters that affect the accuracy of the EEG source-mapping solutions. We argue that these factors are not independent and their effect must be evaluated in a unified way. To do so requires significant computational capabilities to explore the landscape of the problem, to quantify uncertainty effects, and to evaluate alternative algorithms. We demonstrate that bringing HPC to this domain will enable such investigation and will allow new avenues for neuroinformatics research. Two algorithms to the electromagnetics forward problem (the heart of the source localization inverse), incorporating tissue inhomogeneity and impedance anisotropy, are presented and their parallel implementations described. The head model forward solvers are evaluated and their performance analyzed.
[hpdc15]: Allen D. Malony: Through the Looking-Glass: From Performance Observation to Dynamic Adaptation. HPDC 2015: 1
Keywords: Highâ€performance computing; runtime environments; optimization methods

Since the beginning of ``high-performance'' parallel computing, observing and analyzing performance for purposes of finding bottlenecks and identifying opportunities for improvement has been at the heart of delivering the performance potential of next-generation scalable systems. Interestingly, it is the ever-changing parallel computing landscape that is the main driver of requirements for parallel performance technology and the improvements necessary beyond the current state-of-theart. Indeed, the development and application of our TAU Performance System over many years largely follows an evolutionary path of addressing measurement and analysis problems in new parallel machines and programming environments. However, the outlook to future parallel systems with high degrees of concurrency, heterogeneous components, dynamic runtime environments, asynchronous execution, and power constraints suggests a new perspective will be needed on the role of performance observation and analysis in respect to tool technology integration and performance optimization methods. The reliance on post-mortem analysis of application-level ("1st person") performance measurements is prohibitive for exascale-class machines because of the performance data volume, the primitive basis for performance data attribution, and the fundamental problem of performance variation that will exist. Instead, it will be important to provide introspection support across the exascale software stack to understand how system ("3rd person") resources are used during execution. Furthermore, the opportunity to couple a global performance introspection capability (a "performance backplane") with online performance decision analytics inspires the concept of an autonomic performance system that can feed back policy-based decisions to guide the computation to better states of execution. The talk will explore these issues by giving a brief retrospective on performance tool evolution, setting the stage for current research projects where a new performance perspective is being pursued. It will also speculate on what might be included in next-generation parallel systems hardware, specifically to make the exascale machines more performance-aware and dynamically-adaptive.
[hpdc15.2]: Daniel A. Ellsworth, Allen D. Malony, Barry Rountree, Martin Schulz: POW: System-wide Dynamic Reallocation of Limited Power in HPC. HPDC 2015: 145- 148
Keywords: RAPL; hardware over-provisioning; HPC; power bound

Current trends for high-performance computing systems are leading us towards hardware over-provisioning where it is no longer possible to run each component at peak power without exceeding a system or facility wide power bound. In such scenarios, the power consumed by individual components must be artificially limited to guarantee system operation under a given power bound. In this paper, we present the design of a power scheduler capable of enforcing such a bound using dynamic system-wide power reallocation in an application-agnostic manner. Our scheduler achievies better job runtimes than a naÂ¨Ä±ve power scheduling approach without requiring a priori knowledge of application power behavior.
[hpdc16]: Nicholas Chaimov, Allen D. Malony, Shane Canon, Costin Iancu, Khaled Z. Ibrahim, Jay Srinivasan: Scaling Spark on HPC Systems. HPDC 2016: 97-110
Keywords:

We report our experiences porting Spark to large production HPC systems. While Spark performance in a data center installation (with local disks) is dominated by the network, our results show that file system metadata access latency can dominate in a HPC installation using Lustre: it determines single node performance up to 4â‡¥ slower than a typical workstation. We evaluate a combination of software techniques and hardware configurations designed to address this problem. For example, on the software side we develop a file pooling layer able to improve per node performance up to 2.8â‡¥. On the hardware side we evaluate a system with a large NVRAM buâ†µer between compute nodes and the backend Lustre file system: this improves scaling at the expense of per-node performance. Overall, our results indicate that scalability is currently limited to O(102) cores in a HPC installation with Lustre and default Spark. After careful configuration combined with our pooling we can scale up to O(104). As our analysis indicates, it is feasible to observe much higher scalability in the near future.
[hpdc98]: Steven T. Hackstadt, Christopher W. Harrop, and Allen D. Malony, A Framework for Interacting with Distributed Programs and Data, Proceedings of the Seventh IEEE International Symposium on High Performance Distributed Computing (HPDC7), Chicago, IL, July 28-31, 1998, pp. 206-214. Also available as University of Oregon, Department of Computer and Information Science, Technical Report CIS-TR-98-02, June 1998.
Keywords: parallel tools, distributed arrays, visualization, computational steering, model coupling, runtime interaction, data access, Fortran 90

The Distributed Array Query and Visualization (DAQV) project aims to develop systems and tools that facilitate interacting with distributed programs and data structures. Arrays distributed across the processes of a parallel or distributed application are made available to external clients via well-defined interfaces and protocols. Our design considers the broad issues of language targets, models of interaction, and abstractions for data access, while our implementation attempts to provide a general framework that can be adapted to a range of application scenarios. The paper describes the second generation of DAQV work and places it in the context of the more general distributed array access problem. Current applications and future work are also described.
[ica04]: Kevin A. Glass, Gwen A. Frishkoff, Robert M. Frank, Colin Davey, Joseph Dien, Allen D. Malony, Don M. Tucker: A Framework for Evaluating ICA Methods of Artifact Removal from Multichannel EEG. ICA 2004: 1033-1040
Keywords:

We present a method for evaluating ICA separation of artifacts from EEG (electroencephalographic) data. Two algorithms, Infomax and FastICA, were applied to â€œsynthetic data,â€ created by superimposing simulated blinks on a blink-free EEG. To examine sensitivity to different data characteristics, multiple datasets were constructed by varying properties of the simulated blinks. ICA was used to decompose the data, and each source was cross- correlated with a blink template. Different thresholds for correlation were used to assess stability of the algorithms. When a match between the blink-template and a component was obtained, the contribution of the source was subtracted from the EEG. Since the original data were known a priori to be blink-free, it was possible to compute the correlation between these â€baselineâ€ data and the results of different decompositions. By averaging the filtered data, time-locked to the simulated blinks, we illustrate effects of different outcomes for EEG waveform and topographic analysis.
[icassp12]: David K. Hammond, Benoit Scherrer, Allen D. Malony: Incorporating anatomical connectivity into EEG source estimation via sparse approximation with cortical graph wavelets. ICASSP 2012: 573-576
Keywords: EEG source estimation, sparse representation, inverse problems, graph wavelets

The source estimation problem for EEG consists of estimating cortical activity from measurements of electrical potential on the scalp surface. This is a underconstrained inverse problem as the dimensionality of cortical source currents far exceeds the number of sensors. We develop a novel regularization for this inverse problem which incorporates knowledge of the anatomical connectivity of the brain, measured by diffusion tensor imaging. We construct an overcomplete wavelet frame, termed cortical graph wavelets, by applying the recently developed spectral graph wavelet transform to this anatomical connectivity graph. Our signal model is formed by assuming that the desired cortical currents have a sparse representation in these cortical graph wavelets, which leads to a convex 1- regularized least squares problem for the coefficients. On data from a simple motor potential experiment, the proposed method shows improvement over the standard minimum-norm regularization.
[iccs03]: J. Dongarra, A. D. Malony, S. Moore, P. Mucci, and S. Shende, "Performance Instrumentation and Measurement for Terascale Systems," Proc. International Conference on Computational Science (ICCS 2003), LNCS 2660, Springer, Berlin, pp. 53-62, 2003.
Keywords: TAU, PAPI, Perfometer, instrumentation, measurement, performance analysis, terascale

As computer systems grow in size and complexity, tool support is needed to facilitate the efficient mapping of large-scale applications onto these systems. To help achieve this mapping, performance analysis tools must provide robust performance observation capabilities at all levels of the system, as well as map low-level behavior to high-level program constructs. Instrumentation and measurement strategies, developed over the last several years, must evolve together with performance analysis infrastructure to address the challenges of new scalable parallel systems.
[iccs03b]: Michael O. McCracken, Allan Snavely, Allen Malony, "Performance Modeling for Dynamic Algorithm Selection," Proc. International Conference on Computational Science (ICCS'03), LNCS 2660, Springer, Berlin, pp. 749-758, 2003.
Keywords: performance modeling, adaptive algorithms

Adaptive algorithms are an important technique to achieve portable high Performance. They choose among solution methods and optimizations according to expected performance on a particular machine. Grid environments make the adaptation problem harder, because the optimal decision may change across runs and even during runtime. Therefore, the performance model used by an adaptive algorithm must be able to change decisions without high overhead. In this paper, we present work that is modifying previous research into rapid performance modeling to support adaptive grid applications through sampling and high granularity modeling. We also outline preliminary results that show the ability to predict differences in performance among algorithms in the same program.
[iccs05]: Adnan Salman, Sergei Turovets, Allen Malony, Jeff Eriksen, and Don Tucker, "Computational Modeling of Human Head Conductivity." Presented at International Conference on Computational Science.
Keywords: Computational Modeling, alternating direction implicit, algorithm, ADI, OpenMP

The computational environment for estimation of unknown regional electrical conductivities of the human head, based on realistic geometry from seg- mented MRI up to 256 resolution, is described. A finite difference alternating di- rection implicit (ADI) algorithm, parallelized using OpenMP, is used to solve the forward problem describing the electrical field distribution throughout the head given known electrical sources. A simplex search in the multi-dimensional para- meter space of tissue conductivities is conducted in parallel using a distributed system of heterogeneous computational resources. The theoretical and computa- tional formulation of the problem is presented. Results from test studies are pro- vided, comparing retrieved conductivities to known solutions from simulation. Performance statistics are also given showing both the scaling of the forward problem and the performance dynamics of the distributed search.
[iccs07]: Adnan Salman, Allen D. Malony, Sergei Turovets, Don M. Tucker: Use of Parallel Simulated Annealing for Computational Modeling of Human Head Conductivity. International Conference on Computational Science (1) 2007: 86-93
Keywords:

We present a parallel computational environment used to determine conductivity properties of human head tissues when the effects of skull inhomogeneities are modeled. The environment employs a parallel simulated annealing algorithm to overcome poor convergence rates of the simplex method for larger numbers of head tissues required for accurate modeling of electromagnetic dynamics of brain function. To properly account for skull inhomogeneities, parcellation of skull parts is necessary. The multi-level parallel simulated annealing algorithm is described and performance results presented. Significant improvements in both convergence rate and speedup are achieved. The simulated annealing algorithm was successful in extracting conductivity values for up to thirteen head tissues without showing computational deficiency.
[iccs08]: Wyatt Spear, Allen D. Malony, Alan Morris, Sameer Shende: Performance Tool Workflows. ICCS (3) 2008: 276-285
Keywords:

Using the Eclipse platform we have provided a centralized resource and unified user interface for the encapsulation of existing command-line based performance analysis tools. In this paper we describe the user-definable tool workflow system provided by this performance framework. We discuss the frameworkâ€™s implementation and the rationale for its design. A use case featuring the TAU performance analysis system demonstrates the utility of the workflow system with respect to conventional performance analysis procedures.
[iccs09]: H. Jagode, J. Dongarra, S. Alam, J. Vetter, W. Spear, A. Malony. “A Holistic Approach for Performance Measurement and Analysis for Petascale Applications.” International Conference on Computational Science (ICCS 2009), Baton Rouge, LA, 2009
Keywords: TAU, Performance Analysis, Performance Tools, Proﬁling, Tracing, Trace ﬁles, Petascale Applications, Petascale Systems

Contemporary high-end Terascale and Petascale systems are composed of hundreds of thousands of commodity multi-core processors interconnected with high-speed custom networks. Performance characteristics of applications executing on these systems are a function of system hardware and software as well as workload parameters. Therefore, it has become increasingly challenging to measure, analyze and project performance using a single tool on these systems. In order to address these issues, we propose a methodology for performance measurement and analysis that is aware of applications and the underlying system hierarchies. On the application level, we measure cost distribution and runtime dependent values for different components of the underlying programming model. On the system front, we measure and analyze information gathered for unique system features, particularly shared components in the multi-core processors. We demonstrate our approach using a Petascale combustion application called S3D on two high-end Teraﬂops systems, Cray XT4 and IBM Blue Gene/P, using a combination of hardware performance monitoring, proﬁling and tracing tools.
[iccs09.2]: Adnan Salman, Allen D. Malony, Matthew J. Sottile: An Open Domain-Extensible Environment for Simulation-Based Scientific Investigation (ODESSI). ICCS (1) 2009: 23-32
Keywords:

In scientific domains where discovery is driven by simulation modeling there are found common methodologies and procedures applied for scientific investigation. ODESSI (Open Domain-extensible Environment for Simulationbased Scientific Investigation) is an environment to facilitate the representation and automatic conduction of scientific studies by capturing common methods for experimentation, analysis, and evaluation used in simulation science. Specific methods ODESSI will support include parameter studies, optimization, uncertainty quantification, and sensitivity analysis. By making these methods accessible in a programmable framework, ODESSI can be used to capture and run domain-specific investigations. ODESSI is demonstrated for a problem in the neuroscience domain involving computational modeling of human head electromagnetics for conductivity analysis and source localization.
[iccs09.3]: Vasily Volkov, Aleksei Zherdetsky, Sergei Turovets, Allen D. Malony: A 3D Vector-Additive Iterative Solver for the Anisotropic Inhomogeneous Poisson Equation in the Forward EEG problem. ICCS (1) 2009: 511-520
Keywords:

We describe a novel 3D finite difference method for solving the anisotropic inhomogeneous Poisson equation based on a multi-component additive implicit method with a 13-point stencil. The serial performance is found to be comparable to the most efficient solvers from the family of preconditioned conjugate gradient (PCG) algorithms. The proposed multi-component additive algorithm is unconditionally stable in 3D and amenable for transparent domain decomposition parallelization up to one eighth of the total grid points in the initial computational domain. Some validation and numerical examples are given.
[iccs09b]: M. Geimer, S. Shende, A. D. Malony, F. Wolf, "A Generic and Configurable Source-Code Instrumentation Component". Proceedings of the International Conference on Computational Science 2009. pp. 696-705, LNCS 5545.
Keywords: TAU, instrumentation, source preprocessing

A common prerequisite for a number of debugging and performance- analysis techniques is the injection of auxiliary program code into the application under investigation, a process called instrumentation. To accomplish this task, source-code preprocessors are often used. Unfortunately, existing preprocessing tools either focus only on a very specific aspect or use hard-coded commands for instrumentation. In this paper, we examine which basic constructs are required to specify a user-defined routine entry/exit instrumentation. This analysis serves as a basis for a generic instrumentation component working on the source-code level where the instructions to be inserted can be ﬂexibly conﬁgured. We evaluate the identified constructs with our prototypical implementation and show that these are sufficient to fulfill the needs of a number of todays’ performance-analysis tools.
[iccs11]: J. Enkovaara, N. A. Romero, S. Shende, J. J. Mortensen, "GPAW - massively parallel electronic structure calculations with Python-based software." International Conference on Computational Science 2011.
Keywords: Python, Numpy, MPI, Density-functional theory

Electronic structure calculations are a widely used tool in materials science and large consumer of supercomputing resources. Traditionally, the software packages for these kind of simulations have been implemented in compiled languages, where Fortran in its different versions has been the most popular choice. While dynamic, interpreted languages, such as Python, can increase the efficiency of programmer, they cannot compete directly with the raw performance of compiled languages. However, by using an interpreted language together with a compiled language, it is possible to have most of the productivity enhancing features together with a good numerical performance. We have used this approach in implementing an electronic structure simulation software GPAW using the combination of Python and C programming languages. While the chosen approach works well in standard workstations and Unix environments, massively parallel supercomputing systems can present some challenges in porting, debugging and profiling the software. In this paper we describe some details of the implementation and discuss the advantages and challenges of the combined Python/C approach. We show that despite the challenges it is possible to obtain good numerical performance and good parallel scalability with Python based software.
[iccs15]: Abhinav Sarje, Sukhyun Song, Douglas Jacobsen, Kevin A. Huck, Jeffrey K. Hollingsworth, Allen D. Malony, Samuel Williams, Leonid Oliker: Parallel Performance Optimizations on Unstructured Mesh-based Simulations. ICCS 2015: 2016-2025
Keywords: Unstructured Mesh, Ocean Modeling, Graph Partitioning, Performance Optimization

This paper addresses two key parallelization challenges the unstructured mesh- based ocean modeling code, MPAS-Ocean, which uses a mesh based on Voronoi tessellations: (1) load imbalance across processes, and (2) unstructured data access patterns, that inhibit intra- and inter-node performance. Our work analyzes the load imbalance due to naive partitioning of the mesh, and develops methods to generate mesh partitioning with better load balance and reduced communication. Furthermore, we present methods that minimize both inter- and intranode data movement and maximize data reuse. Our techniques include predictive ordering of data elements for higher cache efficiency, as well as communication reduction approaches. We present detailed performance data when running on thousands of cores using the Cray XC30 supercomputer and show that our optimization strategies can exceed the original performance by over 2Ã—. Additionally, many of these solutions can be broadly applied to a wide variety of unstructured grid-based computations
[iccs16]: Zhang, X., Abbasi, H., Huck, K., & Malony, A. D. (2016). Wowmon: A machine learning- based profiler for self-adaptive instrumentation of scientific workflows. Procedia Computer Science, 80, 1507-1518.
Keywords:

Performance debugging using program profiling and tracing for scientific workflows can be extremely difficult for two reasons. 1) Existing performance tools lack the ability to automatically produce global performance data based on local information from coupled scientific applications of workflows, particularly at runtime. 2) Profiling/tracing with static instrumentation may incur high overhead and significantly slow down science-critical tasks. To gain more insights on workflows we introduce a lightweight workflow monitoring infrastructure, WOWMON (WOrkfloW MONitor), which enables userâ€™s access not only to cross-application performance data such as end-to-end latency and execution time of individual workflow components at runtime, but also to customized performance events. To reduce profiling overhead, WOWMON uses adaptive selection of performance metrics based on machine learning algorithms to guide profilers collecting only metrics that have most impact on performance of workflows. Through the study of real scientific workflows (e.g., LAMMPS) with the help of WOWMON, we found that the performance of the workflows can be significantly affected by both software and hardware factors, such as the policy of process mapping and in-situ buffer size. Moreover, we experimentally show that WOWMON can reduce data movement for profiling by up to 54% without missing the key metrics for performance debugging.
[icpads14]: Nicholas Chaimov, Boyana Norris, Allen D. Malony: Toward multi-target autotuning for accelerators. ICPADS 2014: 534-541
Keywords:

Producing high-performance implementations from simple, portable computation specifications is a challenge that compilers have tried to address for several decades. More recently, a relatively stable architectural landscape has evolved into a set of increasingly diverging and rapidly changing CPU and accelerator designs, with the main common factor being dramatic increases in the levels of parallelism available. The growth of architectural heterogeneity and parallelism, combined with the very slow development cycles of traditional compilers, has motivated the development of autotuning tools that can quickly respond to changes in architectures and programming models, and enable very specialized optimizations that are not possible or likely to be provided by mainstream compilers. In this paper we describe the new OpenCL code generator and autotuner OrCL and the introduction of detailed performance measurement into the autotuning process. OrCL is implemented within the Orio autotuning framework, which enables the rapid development of experimental languages and code optimization strategies aimed at achieving good performance on new platforms without rewriting or hand-optimizing critical kernels. The combination of the new OpenCL autotuning and TAU measurement capabilities enables users to consistently evaluate autotuning effectiveness across a range of architectures, including several NVIDIA and AMD accelerators and Intel Xeon Phi processors, and to compare the OpenCL and CUDA code generation capabilities. We present results of autotuning several numerical kernels that typically dominate the execution time of iterative sparse linear system solution and key computations from a 3-D parallel simulation of solid fuel ignition.
[icpads14.2]: David Ozog, Allen D. Malony, Jeff R. Hammond, Pavan Balaji: WorkQ: A many-core producer/consumer execution model applied to PGAS computations. ICPADS 2014: 632-639
Keywords:

â€”Partitioned global address space (PGAS) applications, such as the Tensor Contraction Engine (TCE) in NWChem, often apply a one-process-per-core mapping in which each process iterates through the following work-processing cycle: (1) determine a work-item dynamically, (2) get data via one-sided operations on remote blocks, (3) perform computation on the data locally, (4) put (or accumulate) resultant data into an appropriate remote location, and (5) repeat the cycle. However, this simple flow of execution does not effectively hide communication latency costs despite the opportunities for making asynchronous progress. Utilizing nonblocking communication calls is not sufficient unless care is taken to efficiently manage a responsive queue of outstanding communication requests. This paper presents a new runtime model and its library implementation for managing tunable â€œwork queuesâ€ in PGAS applications. Our runtime execution model, called WorkQ, assigns some number of on-node â€œproducerâ€ processes to primarily do communication (steps 1, 2, 4, and 5) and the other â€œconsumerâ€ processes to do computation (step 3); but processes can switch roles dynamically for the sake of performance. Load balance, synchronization, and overlap of communication and computation are facilitated by a tunable nodewise FIFO message queue protocol. Our WorkQ library implementation enables an MPI+X hybrid programming model where the X comprises SysV message queues and the userâ€™s choice of SysV, POSIX, and MPI shared memory. We develop a simplified software mini-application that mimics the performance behavior of the TCE at arbitrary scale, and we show that the WorkQ engine outperforms the original model by about a factor of 2. We also show performance improvement in the TCE coupled cluster module of NWChem.
[icpe15]: Xiaoguang Dai, Boyana Norris, Allen D. Malony: Autoperf: Workflow Support for Performance Experiments. WOSP-C@ICPE 2015: 11- 16
Keywords: performance measurement, performance analysis

Many excellent open-source and commercial tools enable the detailed measurement of the performance attributes of applications. However, the process of collecting measurement data and analyzing it remains effort-intensive because of differences in tool interfaces and architectures. Furthermore, insufficient standards and automation may result in losing information about experiments, which may in turn lead to misinterpretation of the data and analysis results. Autoperf aims to support the entire workflow in performance measurement and analysis in a uniform and portable fashion, enabling both better productivity through automation of data collection and analysis and experiment reproducibility.
[icpp05]: K. A. Huck, A. D. Malony, R. Bell, and A. Morris, "Design and Implementation of a Parallel Performance Data Management Framework," Proc. International Conference on Parallel Processing (ICPP 2005), IEEE Computer Society, 2005.
Keywords: TAU, PerfDMF, ParaProf, Performance Data Management Framework

Empirical performance evaluation of parallel systems and applications can generate significant amounts of performance data and analysis results from multiple experiments as performance is investigated and problems diagnosed. Hence, the management of performance information is a core component of performance analysis tools. To better support tool integration, portability, and reuse, there is a strong motivation to develop performance data management technology that can provide a common foundation for performance data storage, access, merging, and analysis. This paper presents the design and implementation of the Performance DataManagement Framework (PerfDMF). PerfDMF addresses objectives of performance tool integration, interoperation, and reuse by providing common data storage, access, and analysis infrastructure for parallel performance profiles. PerfDMF includes an extensible parallel profile data schema and relational database schema, a profile query and analysis programming interface, and an extendible toolkit for profile import/export and standard analysis. We describe the PerfDMF objectives and architecture, give detailed explanation of the major components, and show examples of PerfDMF application.
[icpp09]: S. Biersdorff, C. W. Lee, A. Malony, L. V. Kale, "Integrated Performance Views in Charm++: Projections Meets TAU." International Conference on Parallel Processing, September 2009.
Keywords: Charm++, TAU, Projections, NAMD

The Charm++ parallel programming system provides a modular performance interface that can be used to extend its performance measurement and analysis capabilities. The interface exposes execution events of interest representing Charm++ scheduling operations, application methods/routines, and communication events for observation by alternative performance modules conﬁgured to implement different measurement features. The paper describes the Charm++’s performance interface and how the Charm++ Projections tool and the TAU Performance System can provide integrated trace-based and proﬁle-based performance views. These two tools are complementary, providing the user with different performance perspectives on Charm++ applications based on performance data detail and temporal and spatial analysis. How the tools work in practice is demonstrated in a parallel performance analysis of NAMD, a scalable molecular dynamics code that applies many of Charm++’s unique features.
[icpp10]: A. Morris, A. Malony, S. Shende, and K. Huck "Design and Implementation of a Hybrid Parallel Performance Measurement System." International Conference on Parallel Processing September 2010. pages 492-501
Keywords: parallel, performance, measurement, analysis,sampling, tracing, proï¬ling

Modern parallel performance measurement systems collect performance information either through probes inserted in the application code or via statistical sampling. Probe-based techniques measure performance metrics directly using calls to a measurement library that execute as part of the application. In contrast, sampling-based systems interrupt program execution to sample metrics for statistical analysis of performance. Although both measurement approaches are represented by robust tool frameworks in the performance community, each has its strengths and weaknesses. In this paper, we investigate the creation of a hybrid measurement system, the goal being to exploit the strengths of both systems and mitigate their weaknesses. We show how such a system can be used to provide the application programmer with a more complete analysis of their application. Simple example and application codes are used to demonstrate its capabilities. We also show how the hybrid techniques can be combined to provide real cross-language performance evaluation of an uninstrumented run for mixed compiled/interpreted execution environments (e.g., Python and C/C++/Fortran).
[icpp11]: A. D. Malony, S. Biersdorff, S. Shende, H. Jagode, S. Tomov, G. Juckeland, R. Dietrich, D. Poole and C. Lamb, "Parallel Performance Measurement of Heterogeneous Parallel Systems with GPUs." Presented at International Conference on Parallel Processing Sept 2011.
Keywords: TAU, PAPI, Vampir, GPU, CUDA, Heterogeneous systems

The power of GPUs is giving rise to heterogeneous parallel computing, with new demands on programming environments, runtime systems, and tools to deliver high-performing applications. This paper studies the problems associated with performance measurement of heterogeneous machines with GPUs. A heterogeneous computation model and alternative host-GPU measurement approaches are discussed to set the stage for reporting new capabilities for heterogeneous parallel performance measurement in three leading HPC tools: PAPI, Vampir, and the TAU Performance System. Our work leverages the new CUPTI tool support in NVIDIAâ€™s CUDA device library. Heterogeneous benchmarks from the SHOC suite are used to demonstrate the measurement methods and tool support.
[icpp13]: D. Ozog, J. Hammond, J. Dinan, P. Balaji, S. Shende, A. Malony, "Inspector-Executor Load Balancing Algorithms for Block-Sparse Tensor Contractions," in Proc. International Conference on Parallel Processing (ICPP'13), IEEE, 10.1109/ICPP.2013.12, 2013.
Keywords: Dynamic Load Balancing, Static Partitioning, Tensor Contractions, Quantum Chemistry, Global Arrays, TAU

Developing effective yet scalable load-balancing methods for irregular computations is critical to the successful application of simulations in a variety of disciplines at petascale and beyond. This paper explores a set of static and dynamic scheduling algorithms for block-sparse tensor contractions within the NWChem computational chemistry code for different degrees of sparsity (and therefore load imbalance). In this particular application, a relatively large amount of task information can be obtained at minimal cost, which enables the use of static partitioning techniques that take the entire task list as input. However, fully static partitioning is incapable of dealing with dynamic variation of task costs, such as from transient network contention or operating system noise, so we also consider hybrid schemes that utilize dynamic scheduling within subgroups. These two schemes, which have not been previously implemented in NWChem or its proxies (i.e. quantum chemistry mini-apps) are compared to the original centralized dynamic load-balancing algorithm as well as improved centralized scheme. In all cases, we separate the scheduling of tasks from the execution of tasks into an inspector phase and an executor phase. The impact of these methods upon the application is substantial on a large InfiniBand cluster: execution time is reduced by as much as 50% at scale. The technique is applicable to any scientific application requiring load balance where performance models or estimations of kernel execution times are available.
[icpp86]: Walid Abu-Sufah, Allen D. Malony, "Vector Processing on the Alliant FX/8 Multiprocessor," Proc. of ICPP 1986, pp. 559-566, 1986.
Keywords: Alliant, Vector processing, Cedar, performance measurement.Alliant, Vector processing, Cedar, performance measurement.Alliant, Vector processing, Cedar, performance measurement.Alliant, Vector processing, Cedar, performance measurement.Alliant, Vector processing, Cedar, performance measurement.

The Alliant FX/8 multiprocessor implements several high-speed computation ideas in software and hardware. Each of the 8 computational elements (CSs) has vector capabilities and multiprocessor support. Generally, the FX/8 delivers its highest processing rates when executing vector loops concurrently. In this paper, we present extensive empirical performance results for vector processing on the FX/8. The vector kernels of LANL BMK8a1 benchmark are used in the experiments.
[icpp87]: Allen D. Malony, Daniel A. Reed, Patrick J. McGuire,"MPF: A Portable Message Passing Facility for Shared Memory Multiprocessors," Proc. ICPP 1987: pp. 739-741, 1987.
Keywords: message passing, shared memory

A message passing facility (MPF) for shared memory multiprocessors is presented. MPF is based on a message passing model conceptually similar to conversations. The message passing primitives for this model are implemented as a portable library of C function calls. The performance of interprocess communication benchmark programs and two parallel applications are given.
[icpp95]: J. Kundu and J. E. Cuny, The Integration of Event- and State-Based Debugging in Ariadne, Proceedings of the International Conference on Parallel Processing (ICPP '95), August 1995, pp. II 130-134.
Keywords: event-based debugging, state-based debugging, Ariadne
[ics10]: A. Malony, S. Biersdorff, W. Spear, S. Mayanglambam. "An Experimental Approach to Performance Measurement of Heterogeneous Parallel Applications using CUDA." Presented at International Conference on Supercomputing, Tsukuba, Japan 2010.
Keywords: Performance tools, GPGPU, profiling, tracing

Heterogeneous parallel systems using GPU devices for ap- plication acceleration have garnered signiﬁcant attention in the supercomputing community. However, to realize the full potential of GPU computing, application developers will re- quire tools to measure and analyze accelerator performance with respect to the parallel execution as a whole. A per- formance measurement technology for the NVIDIA CUDA platform has been developed and integrated with the TAU parallel performance system. The design of the TAUcuda package is based on an experimental NVIDIA CUDA driver and associated runtime and device libraries. In any envi- ronment where the CUDA experimental driver is installed, TAUcuda can provide detailed performance information re- garding the execution of GPU kernels and the interactions with the parallel program without any modiﬁcation to the program source or executable code. The paper describes the TAUcuda technology and how it is integrated with the TAU measurement framework to provide integrated performance views. Various examples of TAUcuda use are presented, in- cluding CUDA SDK examples, a GPU version of the Linpack benchmark, and a scalable molecular dynamics application, NAMD.
[ics13]: David Ozog, Sameer Shende, Allen D. Malony, Jeff R. Hammond, James Dinan, Pavan Balaji: Inspector/executor load balancing algorithms for block-sparse tensor contractions. ICS 2013: 483-484
Keywords: Dynamic Load Balancing, Static Paritioning, Tensor Contractions, Quantum Chemistry, Global Arrays

Developing effective yet scalable load-balancing methods for irregular computations is critical to the successful application of simulations in a variety of disciplines at petascale and beyond. This poster explores a set of static and dynamic scheduling algorithms for block-sparse tensor contractions within the NWChem computational chemistry code for different degrees of sparsity (and therefore load imbalance). In this particular application, a relatively large amount of task information can be obtained at minimal cost, which enables the use of static partitioning techniques that take the entire task list as input. However, fully static partitioning is incapable of dealing with dynamic variation of task costs, such as from transient network contention or operating system noise, so we also consider hybrid schemes that utilize dynamic scheduling within subgroups. These two schemes, which have not been previously implemented in NWChem or its proxies (i.e. quantum chemistry mini-apps) are compared to the original centralized dynamic load-balancing algorithm as well as improved centralized scheme. In all cases, we separate the scheduling of tasks from the execution of tasks into an inspector phase and an executor phase. The impact of these methods upon the application is substantial on a large InfiniBand cluster: execution time is reduced by as much as 50% at scale. The technique is applicable to any scientific application requiring load balance where performance models or estimations of kernel execution times are available.
[ics89]: Kyle Gallivan, William Jalby, Allen Malony, Harry Wijshoff, "Performance Prediction of Loop Constructs on Multiprocessor Hierarchical-Memory Systems," Proc. 3rd International Conference on Supercomputing (ICS'86), pp. 433-442, 1986.
Keywords: Supercomputers, Concurrent programming structures

In this paper we discuss the performance prediction of Fortran constructs commonly found in numerical scientific computing. Although the approach is applicable to multi-processors in general, within the scope of the paper we will concentrate on the Alliant FX/8 multiprocessor. The techniques proposed involve a combination of empirical observations, architectural models and analytical techniques, and exploits earlier work on data locality analysis and empirical characterization of the behavior of memory systems. The Lawrence Livermore Loops are used as a test-case to verify the approach.
[ics90]: Allen D. Malony, Daniel A. Reed, "A Hardware-based Performance Monitor for the Intel iPSC/2 Hypercube, Proc. 4th International Conference on Supercomputing (ICS'90), pp. 213-226, 1990.
Keywords: performance monitoring, Hypermon

The complexity of parallel computer systems makes a priori performance prediction difficult and experimental performance analysis crucial. A complete characterization of software and hardware dynamics, needed to understand the performance of high-performance parallel systems, requires execution time performance instrumentation. Although software recording of performance data suffices for low frequency events, capture of detailed, high-frequency performance data ultimately requires hardware support if the performance instrumentation is to remain efficient and unobtrusive. This paper describes the design of HYPERMON, a hardware system to capture and record software performance traces generated on the Intel iPSC/2 hypercube. HYPERMON represents a compromise between fully-passive hardware monitoring and software event tracing; software generated events are extracted from each node, timestamped, and externally recorded by HYPERMON. Using an instrumented version of the iPSC/2 operating system and several application programs, we present a performance analysis of an operational HYPERMON prototype and assess the limitations of the current design. Based on these results, we suggest design modifications that should permit capture of event traces from the coming generation of high-performance distributed memory parallel systems.
[ics99]: S. Vajracharya, S. Karmesin, P. Beckman, J. Crotinger, A. Malony, S. Shende, R. Oldehoeft, and S. Smith, "SMARTS: Exploiting Temporal Locality and Parallelism through Vertical Execution," Proceedings of ACM International Conference on Supercomputing (ICS '99), pp. 302-310, 1999.
Keywords: SMARTS, asynchronous, threads, profiling, tracing, vertical executiondata-parallelism, dependence-driven execution, runtime system, barriersynchronization, TAU

In the solution of large-scale numerical problems, parallel computing is becoming simultaneously more important and more difficult. The complex organization of today's multiprocessors with several memory hierarchies has forced the scientific programmer to make a choice between simple but unscalable code and scalable but extremely complex code that does not port to other architectures.
This paper describes how the SMARTS runtime system and the POOMA C++ class library for high-performance scientific computing work together to exploit data parallelism in scientific applications while hiding the details of managing parallelism and data locality from the user. We present innovative algorithms, based on the macro-dataflow model for detecting data parallelism and efficiently executing data-parallel statements on shared-memory multiprocessors. We also describe how these algorithms can be implemented on clusters of SMPs.
[ijhpca13]: Nicholas Chaimov, Scott Biersdorff, Allen D. Malony: Tools for machine-learning-based empirical autotuning and specialization. IJHPCA 27(4): 403-411 (2013)
Keywords: autotuning, specialization, TAU, machine learning, decision trees

The process of empirical autotuning results in the generation of many code variants which are tested, found to be suboptimal, and discarded. By retaining annotated performance profiles of each variant tested over the course of many autotuning runs of the same code across different hardware environments and different input datasets, we can apply machine learning algorithms to generate classifiers for runtime selection of code variants from a library, generate specialized variants, and potentially speed the process of autotuning by starting the search from a point predicted to be close to optimal. In this paper, we show how the TAU Performance System suite of tools can be applied to autotuning to enable reuse of performance data generated through autotuning.
[ipdps03]: S. Shende, A. D. Malony, C. Rasmussen, M. Sottile, "A Performance Interface for Component-Based Applications," Proc. International Workshop on Performance Modeling, Evaluation, and Optimization of Parallel and Distributed Systems, IPDPS'03, IEEE Computer Society, 278, 2003.
Keywords: TAU, Component Interface, SIDL, CCAFFEINE, PDT, Babel

This work targets the emerging use of software component technology for high-performance scientific parallel and distributed computing. While component software engineering will benefit the construction of complex science applications, its use presents several challenges to performance optimization. A component application is composed of a set of components, thus, application performance depends on the interaction (possibly non-linear) of the component set. Furthermore, a component is a ``binary unit of composition'' and the only information users have is the interface the component provides to the outside world. An interface for component performance measurement and query is presented to address optimization issues. We describe the performance component design and an example demonstrating its use for runtime performance tuning.
[ipdps04]: J. Ray, N. Trebon, R. C. Armstrong, S. Shende, and A. Malony, "Performance Measurement and Modeling of Component Applications in a High Performance Computing Environment: A Case Study," Proc. 18th International Parallel and Distributed Processing Symposium (IPDPS'04), IEEE Computer Society, 2004.
Keywords: CCA, Performance modeling, CFRFS Combustion, TAU

We present a case study of performance measurement and modeling of a CCA (Common Component Architecture) component-based application in a high performance computing environment. Component-based HPC applications allow the possibility of creating component-level performance models and synthesizing them into application performance models. However, they impose the restriction that performance measurement/monitoring needs to be done in a non-intrusive manner and at a fairly coarse-grained level. We propose a performance measurement infrastructure for HPC based loosely on recent work done for Grid environments. A prototypical implementation of the infrastructure is used to collect data for three components in a scientific application and construct their performance models. Both computational and message-passing performance are addressed.
[ipdps06]: D. B. Keith, C. C. Hoge, Robert M. Frank, Allen D. Malony: Parallel ICA methods for EEG neuroimaging. IPDPS 2006
Keywords:

HiPerSAT, a C++ library and tools, processes EEG data sets with ICA (Independent Component Analysis) methods. HiPerSAT uses BLAS, LAPACK, MPI and OpenMP to achieve a high performance solution that exploits parallel hardware. ICA is a class of methods for analyzing a large set of data samples and extracting independent components that explain the observed data. ICA is used in EEG research for data cleaning and separation of spatiotemporal patterns that may reflect different underlying neural processes. We present two ICA implementations (FastICA and Infomax) that exploit parallelism to provide an EEG component decomposition solution of higher performance and data capacity than current MATLAB-based implementations. Experimental results and the methodology used to obtain them are presented. Integrating HiPerSAT with EEGLAB [4] is described, as well as future plans for this research.
[ipdps07]: Li Li, Allen D. Malony: Automatic Performance Diagnosis of Parallel Computations with Compositional Models. IPDPS 2007: 1-8
Keywords:

Performance tuning involves a diagnostic process to locate and explain sources of program inefficiency. A performance diagnosis system can leverage knowledge of performance causes and symptoms that come from expertise with parallel computational models. This paper extends our model-based performance diagnosis approach to programs with multiple models. We study two types of model compositions (nesting and restructuring) and demonstrate how the Hercule performance diagnosis framework can automatically discover and interpret performance problems due to model nesting in the FLASH application.
[ipdps16]: Ozog, D., Kamil, A., Zheng, Y., Hargrove, P., Hammond, J. R., Malony, A., ... & Yelick, K. (2016, May). A hartree-fock application using upc++ and the new darray library. In Parallel and Distributed Processing Symposium, 2016 IEEE International (pp. 453-462). IEEE.
Keywords: -Hartree-Fock, self-consistent field (SCF), quantum chemistry, PGAS, UPC/UPC++, Global Arrays, performance analysis, load balancing, work stealing, attentiveness

The Hartree-Fock (HF) method is the fundamental first step for incorporating quantum mechanics into manyelectron simulations of atoms and molecules, and it is an important component of computational chemistry toolkits like NWChem. The GTFock code is an HF implementation that, while it does not have all the features in NWChem, represents crucial algorithmic advances that reduce communication and improve load balance by doing an up-front static partitioning of tasks, followed by work stealing whenever necessary. To enable innovations in algorithms and exploit next generation exascale systems, it is crucial to support quantum chemistry codes using expressive and convenient programming models and runtime systems that are also efficient and scalable. This paper presents an HF implementation similar to GTFock using UPC++, a partitioned global address space model that includes flexible communication, asynchronous remote computation, and a powerful multidimensional array library. UPC++ offers runtime features that are useful for HF such as active messages, a rich calculus for array operations, hardwaresupported fetch-and-add, and functions for ensuring asynchronous runtime progress. We present a new distributed array abstraction, DArray, that is convenient for the kinds of randomaccess array updates and linear algebra operations on blockdistributed arrays with irregular data ownership. We analyze the performance of atomic fetch-and-add operations (relevant for load balancing) and runtime attentiveness, then compare various techniques and optimizations for each. Our optimized implementation of HF using UPC++ and the DArrays library shows up to 20% improvement over GTFock with Global Arrays at scales up to 24,000 cores.
[ipdps16.ws]: Patricia Grubel, Hartmut Kaiser, Kevin A. Huck, Jeanine Cook: Using Intrinsic Performance Counters to Assess Efficiency in Task-Based Parallel Applications. IPDPS Workshops 2016: 1692-1701
Keywords: runtime instrumentation, performance counters; execution monitoring; HPX; task-based parallelism; many asynchronous tasks

The ability to measure performance characteristics of an application at runtime is essential for monitoring the behavior of the application and the runtime system on the underlying architecture. Traditional performance measurement tools do not adequately provide measurements of asynchronous task-based parallel applications, either in real-time or for postmortem analysis. We propose that this capability is best performed directly by the runtime system for ease in use and to minimize conflicts and overheads potentially caused by traditional measurement tools. In this paper, we describe and illustrate the use of the performance monitoring capabilities in the HPX [13] runtime system. We describe and detail existing performance counters made available through HPXâ€™s performance counter framework and demonstrate how they are useful to understanding application efficiency and resource usage at runtime. This extensive framework provides the ability to asynchronously query software and hardware counters and could potentially be used as the basis for runtime adaptive resource decisions. We demonstrate the ease of porting the Inncabs benchmark suite to the HPX runtime system, the improved performance of benchmarks that employ fine-grained task parallelism when ported to HPX, and the capabilities and advantages of using the in-situ performance monitoring system in HPX to give detailed insight to the performance and behavior of the benchmarks and the runtime system.
[ipdps16.ws2]: Ozog, D., Malony, A. D., & Siegel, A. R. (2015, May). A Performance Analysis of SIMD Algorithms for Monte Carlo Simulations of Nuclear Reactor Cores. In Parallel and Distributed Processing Symposium (IPDPS), 2015 IEEE International (pp. 733- 742). IEEE.
Keywords: Monte Carlo, neutron transport, reactor simulation, performance, SIMD, Intel Xeon Phi coprocessor, MIC

A primary characteristic of history-based Monte Carlo neutron transport simulation is the application of MIMD-style parallelism: the path of each neutron particle is largely independent of all other particles, so threads of execution perform independent instructions with respect to other threads. This conflicts with the growing trend of HPC vendors exploiting SIMD hardware, which accomplishes better parallelism and more FLOPS per watt. Event-based neutron transport suits vectorization better than history-based transport, but it is difficult to implement and complicates data management and transfer. However, the Intel Xeon Phi architecture supports the familiar x86 instruction set and memory model, mitigating difficulties in vectorizing neutron transport codes. This paper compares the event-based and history-based approaches for exploiting SIMD in Monte Carlo neutron transport simulations. For both algorithms, we analyze performance using the three different execution models provided by the Xeon Phi (offload, native, and symmetric) within the full-featured OpenMC framework. A representative micro-benchmark of the performance bottleneck computation shows about 10x performance improvement using the event-based method. In an optimized history-based simulation of a full-physics nuclear reactor core in OpenMC, the MIC shows a calculation rate 1.6x higher than a modern 16-core CPU, 2.5x higher when balancing load between the CPU and 1 MIC, and 4x higher when balancing load between the CPU and 2 MICs. As far as we are aware, our calculation rate per node on a high fidelity benchmark (17,098 particles/second) is higher than any other Monte Carlo neutron transport application. Furthermore, we attain 95% distributed efficiency when using MPI and up to 512 concurrent MIC devices.
[ipdps16.ws3]: Ellsworth, D., Patki, T., Perarnau, S., Seo, S., Amer, A., Zounmevo, J., ... & Schulz, M. (2016, May). Systemwide Power Management with Argo. In Parallel and Distributed Processing Symposium Workshops, 2016 IEEE International (pp. 1118- 1121). IEEE.
Keywords:

The Argo project is a DOE initiative for designing a modular operating system/runtime for the next generation of supercomputers. A key focus area in this project is power management, which is one of the main challenges on the path to exascale. In this paper, we discuss ideas for systemwide power management in the Argo project. We present a hierarchical and scalable approach to maintain a power bound at scale, and we highlight some early results
[ipps94]: A. Malony, B. Mohr, P. Beckman, D. Gannon, S. Yang, F. Bodin, Performance Analysis of pC++: A Portable Data-Parallel Programming System for Scalable Parallel Computers, Proceedings of the 8th International Parallel Processing Symbosium (IPPS), Cancún, Mexico, April 1994, pp. 75-85.
Keywords: parallel C++, portability, scalability, SPMD, runtime system,concurrency and communication primitives, performance

pC++ is a language extension to C++ designed to allow programmers to compose distributed data structures with parallel execution semantics. These data structures are organized as ``concurrent aggregate'' collection classes which can be aligned and distributed over the memory hierarchy of a parallel machine in a manner consistent with the High Performance Fortran Forum (HPF) directives for Fortran 90. pC++ allows the user to write portable and efficient code which will run on a wide range of scalable parallel computers.
In this paper, we discuss the performance analysis of the pC++ programming system. We describe the performance tools developed and include scalability measurements for four benchmark programs: a "nearest neighbor" grid computation, a fast Poisson solver, and the "Embar" and "Sparse" codes from the NAS suite. In addition to speedup numbers, we present a detailed analysis highlighting performance issues at the language, runtime system, and target system levels.
[ipps95]: B. Helm. A. D. Malony, S. P. Fickas, "Capturing and Automating Performance Diagnosis: the Poirot approach," Proc. 9th International Parallel Processing Symposium (IPPS'95), pp.
Keywords: software performance evaluation, program debugging, diagnostic expert systems, performance diagnosis, Poirot, parallel programming, diagnosis methods, performance tools, performance debugging, knowledge-based diagnosis, software engineering

Performance diagnosis, the process of finding and explaining performance problems, is an important part of parallel programming. Effective performance diagnosis requires that the programmer plan an appropriate method, and manage the experiments required by that method. This paper presents Poirot, an architecture to support performance diagnosis. It explains how the architecture helps automatically, adaptably plan and manage the diagnosis process. The paper evaluates the generality and practicality of Poirot, by reconstructing diagnosis methods found in several published performance tools.
[isc16]: Chaimov, N., Malony, A., Iancu, C., & Ibrahim, K. (2016, June). Scaling Spark on Lustre. In International Conference on High Performance Computing (pp. 649-659). Springer International Publishing.
Keywords:

We report our experiences in porting and tuning the Apache Spark data analytics framework on the Cray XC30 (Edison) and XC40 (Cori) systems, installed at NERSC. We find that design decisions made in the development of Spark are based on the assumption that Spark is constrained primarily by network latency, and that disk I/O is comparatively cheap. These assumptions are not valid on Edison or Cori, which feature advanced low-latency networks but have diskless compute nodes. Lustre metadata access latency is a major bottleneck, severely constraining scalability. We characterize this problem with benchmarks run on a system with both Lustre and local disks, and show how to mitigate high metadata access latency by using per-node loopback filesystems for temporary storage. With this technique, we reduce the shuffle time and improve application scalability from O(100) to O(10, 000) cores on Cori. For shuffle-intensive machine learning workloads, we show better performance than clusters with local disks.
[iscope99]: T. Sheehan, A. Malony, S. Shende, "A Runtime Monitoring Framework for the TAU Profiling System", Proceedings of the Third International Symposium on Computing in Object-Oriented Parallel Environments (ISCOPE'99), LNCS 1732, Springer, Berlin, pp. 170-181, December 1999.
Keywords: monitor, runtime data access, performance monitoring,parallel execution, performance tools, runtime interaction, Java,TAU, multi-threaded

Applications executing on complex computational systems provide a challenge for the development of runtime performance monitoring software. We discuss a computational model, application monitoring, data access models, and profiler functionality. We define data consistency within and across threads as well as across contexts and nodes. We describe the TAU runtime monitoring framework which enables on-demand, low-interference data access to TAU profile data and provides the flexibility to enforce data consistency at the thread, context or node level. We present an example of a Java-based runtime performance monitor utilizing the framework.
[ishpc02]: J. D. de St. Germain, A. Morris, S. G. Parker, A. D. Malony, and S. Shende, "Integrating Performance Analysis in the Uintah Software Development Cycle," Proceedings of the ISHPC'02 conference, LNCS 2327,Springer, Berlin, pp. 190-206, 2002.
Keywords: Uintah, TAU, MPI, SCIRun, XPARE

Technology for empirical performance evaluation of parallel programs is driven by the increasing complexity of high performance computing environments and programming methodologies. This paper describes the integration of the TAU and XPARE tools in the Uintah computational framework. Performance mapping techniques in TAU relate low-level performance data to higher levels of abstraction. XPARE is used for specifying regression testing benchmarks that are evaluated with each periodically scheduled testing trial. This provides a historical panorama of the evolution of application performance. The paper concludes with a scalability study that shows the benefits of integrating performance technology in the development of large-scale parallel applications.
[ishpc03]: Holger Brunst, Allen D. Malony, Sameer S. Shende, and Robert Bell, "Online Remote Trace Analysis of Parallel Applications on High-Performance Clusters", Proceedings of ISHPC'03 Conference, LNCS 2858, Springer, Berlin, pp. 440-449,2003.
Keywords: Parallel Computing, Performance Analysis, Performance Steering, Tracing, Parallel Computing, Performance Analysis, Performance Steering, Tracing, VNG, TAU

The paper presents the design and development of an online remote trace measurement and analysis system. The work combines the strengths of the TAU performance system with that of the VNG distributed parallel trace analyzer. Issues associated with online tracing are discussed and the problems encountered in system implementation are analyzed in detail. Our approach should port well to parallel platforms. Future work includes testing the performance of the system on large-scale machines.
[ispdc14]: Celio Estevan Moron, Antonio Ideguchi, Marcio Merino Fernandes, Allen D. Malony: From MultiTask to MultiCore: Design and Implementation Using an RTOS. ISPDC 2014: 111-118
Keywords:

Practice has shown that programming a new multicore system is a greater challenge than previously thought. The challenge is to produce the resulting system in a way, which is as easy as sequential programming. This new trend has changed the way we think about the whole development process. The aim of this work is to show that it is possible to develop a multicore embedded system application using existing tools, while at the same time, obtaining reuse. This process is carried out in a cyclic and increasing manner, generating a more refined version of the application at each iteration. The development process consists of five phases: Multitask Modelling, Code Generation, Test/Debugging, Mapping Tasks to Cores and Tuning the Application. The three initial ones are carried out using the VisualRTXC tool, whereas the last two use the performance tool TAU. Using a small application, a Case Study shows how the proposed development process works and the steps involved in the implementation of an embedded system.
[istspie95]: Steven T. Hackstadt and Allen D. Malony, Case Study: Applying Scientific Visualization to Parallel Performance Visualization, Proc. of the IST&T/SPIE symposium on Electronic Imaging: Science and Technology, Conference on Visual Data Exploration and Analysis, San Jose, CA, February 1995, pp. 238-247.
Keywords: parallel performance visualization, case study, data explorer, scientific visualization

The complexity of parallel programs make them more difficult to analyze for correctness and efficiency, in part because of the interactions between multiple processors and the volume of data that can be generated. Visualization often helps the programmer in these tasks. This paper focuses on the development of a new technique for constructing, evaluating, and modifying sophisticated, application-specific visualizations for parallel programs and performance data. While most existing tools offer predetermined sets of simple, two-dimensional graphical displays, this environment gives users a high degree of control over visualization development and use, including access to three-dimensional graphics, which remain relatively unexplored in this context.
We have developed an environment that uses the IBM Visualization Data Explorer system to allow new visualizations to be prototyped rapidly, often taking only a few hours to construct totally new views of parallel performance trace data. Yet, access to a robust library of sophisticated graphical techniques is preserved. The burdensome task of explicitly programming the visualizations is completely avoided, and the iterative design, evaluation, and modification of new displays is greatly facilitated.
[iwomp05]: Adnan Salman , Sergei Turovets, Allen Malony, and Vasily Volkov, "Multi-Cluster, Mixed-Mode Computational Modeling of Human Head Conductivity." Presented at IWOMP 2005
Keywords: MPI, OpenMPI, Multi-Cluster, Computational Modeling

A multi-cluster computational environment with mixed-mode (MPI + OpenMP) parallelism for estimation of unknown regional electrical conductiv- ities of the human head, based on realistic geometry from segmented MRI up to 256 voxels resolution, is described. A finite difference multi-component al- ternating direction implicit (ADI) algorithm, parallelized using OpenMP, is used to solve the forward problem calculation describing the electrical field distribu- tion throughout the head given known electrical sources. A simplex search in the multi-dimensional parameter space of tissue conductivities is conducted in par- allel across a distributed system of heterogeneous computational resources. The theoretical and computational formulation of the problem is presented. Results from test studies based on the synthetic data are provided, comparing retrieved conductivities to known solutions from simulation. Performance statistics are also given showing both the scaling of the forward problem and the performance dy- namics of the distributed search.
[iwomp06]: A. Morris, A. D. Malony, S. Shende, "Supporting Nested OpenMP Parallelism in the TAU Performance System," (to appear) Proceedings of the IWOMP 2006 Conference, Springer, LNCS, 2007.
Keywords: TAU, OpenMP, Nested Parallelism

Nested OpenMP parallelism allows an application to spawn teams of nested threads. This hierarchical nature of thread creation and usage poses problems for performance measurement tools that must determine thread context to properly maintain per-thread performance data. In this paper we describe the problem and a novel solution for identifying threads uniquely. Our approach has been implemented in the TAU performance system and has been successfully used in profiling and tracing OpenMP applications with nested parallelism. We also describe how extensions to the OpenMP standard can help tool developers uniquely identify threads.
[iwomp13]: Ahmad Qawasmeh, Abid Muslim Malik, Barbara M. Chapman, Kevin A. Huck, Allen D. Malony: Open Source Task Profiling by Extending the OpenMP Runtime API. IWOMP 2013: 186-199
Keywords: OpenMP, OpenMP Runtime API for Profiling, Open-Source Implementation, OpenMP Tasks.

The introduction of tasks in the OpenMP programming model brings a new level of parallelism. This also creates new challenges with respect to its meanings and applicability through an event-based performance profiling. The OpenMP Architecture Review Board (ARB) has approved an interface specification known as the â€œOpenMP Runtime API for Profilingâ€ to enable performance tools to collect performance data for OpenMP programs. In this paper, we propose new extensions to the OpenMP Runtime API for profiling task level parallelism. We present an efficient method to distinguish individual task instances in order to track their associated events at the micro level. We implement the proposed extensions in the OpenUH compiler which is an open-source OpenMP compiler. With negligible overheads, we are able to capture important events like task creation, execution, suspension, and exiting. These events help in identifying overheads associated with the OpenMP tasking model, e.g., task waiting until a task starts execution or task cleanup etc. These events also help in constructing important parent-child relationships that de- fine tasksâ€™ call paths. The proposed extensions are in line with the newest specifications recently proposed by the OpenMP tools committee for task profiling.
[iwomp14]: Kevin A. Huck, Allen D. Malony, Sameer Shende, Doug W. Jacobsen: Integrated Measurement for Cross-Platform OpenMP Performance Analysis. IWOMP 2014: 146-160
Keywords:

The ability to measure the performance of OpenMP programs portably across shared memory platforms and across OpenMP compilers is a challenge due to the lack of a widely-implemented performance interface standard. While the OpenMP community is currently evaluating a tools interface specification called OMPT, at present there are different instrumentation methods possible at different levels of observation and with different system and compiler dependencies. This paper describes how support for four mechanisms for OpenMP measurement has been integrated into the TAU performance system. These include source-level instrumentation (Opari), a runtime â€œcollectorâ€ API (called ORA) built into an OpenMP compiler (OpenUH), a wrapped OpenMP runtime library (GOMP using ORA), and an OpenMP runtime library supporting an OMPT prototype (Intel). The capabilities of these approaches are evaluated with respect to observation visibility, portability, and measurement overhead for OpenMP benchmarks from the NAS parallel benchmarks, Barcelona OpenMP Task Suite, and SPEC 2012. The integrated OpenMP measurement support is also demonstrated on a scientific application, MPAS-Ocean.
[javagrande01]: S. Shende and A. D. Malony, "Integration and Application of the TAU Performance System in Parallel Java Environments," Proceedings of the Joint ACM Java Grande - ISCOPE 2001 Conference, pp. 87-96, June 2001.
Keywords: TAU, Profiling, Tracing, Java, MPI, JVMPI, Instrumentation, Measurement

Parallel Java environments present challenging problems for performance tools because of Javas rich language system and its multi-level execution platform combined with the integration of native-code application libraries and parallel runtime software. In addition to the desire to provide robust performance measurement and analysis capabilities for the Java language itself, the coupling of different software execution contexts under a uniform performance model needs careful consideration of how events of interest are observed and how cross-context parallel execution information is linked. This paper relates our experience in extending the TAU performance system to a parallel Java environment based on mpiJava. We describe the complexities of the instrumentation model used, how performance measurements are made, and the overhead incurred. A parallel Java application simulating the game of Life is used to show the performance systems capabilities.
[javaics2k]: S. Shende, and A. D. Malony, "Performance Tools for Parallel Java Environments," Proc. Second Workshop on Java for High Performance Computing, ICS 2000, May 2000.
Keywords: parallel, mpiJava, TAU, performance profiling, tracing, MPI

Parallel Java environments present challenging problems for performance tools because of Java's rich language system and its multi-level execution platform combined with the integration of native-code application libraries and parallel runtime software. In addition to the desire to provide robust performance measurement and analysis capabilities for the Java language itself, the coupling of different software execution contexts under a uniform performance model needs careful consideration of how events of interest are observed and how cross-context parallel execution information is linked. This paper relates our experience in extending the TAU performance system to a parallel Java environment based on mpiJava. We describe the instrumentation model used, how performance measurements are made, and the overhead incurred. A parallel Java application simulating the game of life is used to show the performance system's capabilities.
[kdd07]: Dejing Dou, Gwen A. Frishkoff, Jiawei Rong, Robert M. Frank, Allen D. Malony, Don M. Tucker: Development of NeuroElectroMagnetic ontologies(NEMO): a framework for mining brainwave ontologies. KDD 2007: 270-279
Keywords: H.2.8 [Database applications]: Data mining; J.3 [Life and Medical Science]: Neuroscience; I.2.4 [Knowledge Representation Formalism and Methods]: Ontology

Event-related potentials (ERP) are brain electrophysiological patterns created by averaging electroencephalographic (EEG) data, time-locking to events of interest (e.g., stimulus or response onset). In this paper, we propose a generic framework for mining and developing domain ontologies and apply it to mine brainwave (ERP) ontologies. The concepts and relationships in ERP ontologies can be mined according to the following steps: pattern decomposition, extraction of summary metrics for concept candidates, hierarchical clustering of patterns for classes and class taxonomies, and clustering-based classification and association rules mining for relationships (axioms) of concepts. We have applied this process to several dense-array (128-channel) ERP datasets. Results suggest good correspondence between mined concepts and rules, on the one hand, and patterns and rules that were independently formulated by domain experts, on the other. Data mining results also suggest ways in which expert-defined rules might be refined to improve ontology representation and classification results. The next goal of our ERP ontology mining framework is to address some long-standing challenges in conducting large-scale comparison and integration of results across ERP paradigms and laboratories. In a more general context, this work illustrates the promise of an interdisciplinary research program, which combines data mining, neuroinformatics and ontology engineering to address real-world problems.
[lacsi01]: B. Mohr, A. D. Malony, S. Shende, and F. Wolf, "Design and Prototype of a Performance Tool Interface for OpenMP," Proceedings of the LACSI Symposium, 2001.
Keywords: OpenMP, API, POMP, TAU, EXPERT, Performance Tool Interface

This paper proposes a performance tools interface for OpenMP, similar in spirit to the MPI profiling interface in its intent to define a clear and portable API that makes OpenMP execution events visible to runtime performance tools. We present our design using a source-level instrumentation approach based on OpenMP directive rewriting. Rules to instrument each directive and their combination are applied to generate calls to the interface consistent with directive semantics and to pass context information (e.g., source code locations) in a portable and efficient way. Our proposed OpenMP performance API further allows user functions and arbitrary code regions to be marked and performance measurement to be controlled using new OpenMP directives. To prototype the proposed OpenMP performance interface, we have developed compatible performance libraries for the EXPERT automatic event trace analyzer and the TAU performance analysis framework. The directive instrumentation transformations we define are implemented in a source-to-source translation tool called OPARI. Application examples are presented for both EXPERT and TAU to show the OpenMP performance interface and OPARI instrumentation tool in operation. When used together with the MPI profiling interface (as the examples also demonstrate), our proposed approach provides a portable and robust solution to performance analysis of OpenMP and mixed-mode (OpenMP + MPI) applications.
[linux99]: S. Shende, Profiling and Tracing in Linux, Proceedings of the Extreme Linux Workshop #2, USENIX, Monterey CA, June 1999.
Keywords: profiling, performance, tracing, linux, clusters, TAU

Profiling and tracing tools can help make application parallelization more effective and identify performance bottlenecks. Profiling presents summary statistics of performance metrics while tracing highlights the temporal aspect of performance variations, showing when and where in the code performance is achieved. A complex challenge is the mapping of performance data gathered during execution to high-level parallel language constructs in the application source code. Presenting performance data in a meaningful way to the user is equally important. This paper presents a brief overview of profiling and tracing tools in the context of Linux - the operating system most commonly used to build clusters of workstations for high performance computing.
[miar06]: Kai Li, Allen D. Malony, Don M. Tucker: A Multiscale Morphological Approach to Topology Correction of Cortical Surfaces. MIAR 2006: 52-59
Keywords:

We present a topology correction method for automatic reconstruction of brain cortical surfaces. We take the volume-based approach by first correcting the topology of the white matter volumes followed by extracting the cortical surfaces. A multiscale method is taken so that topology errors are gradually corrected with respect to the correction cost. The special surface-likeness property of white matter and gray matter is considered in evaluating the cost of topology correction.
[mmb95]: K. Shanmugam, A. Malony, B. Mohr, Speedy: An Integrated Performance Extrapolation Tool for pC++ Programs, Proceedings of the Joint Conference PERFORMANCE TOOLS '95 and MMB '95, September 1995, Heidelberg, Germany.
Keywords: performance prediction, extrapolation, object-parallel programming,trace-driven simulation, performance debugging tools, modeling

Performance extrapolation is the process of evaluating the performance of a parallel program in a target execution environment using performance information obtained for the same program in a different environment. Performance extrapolation techniques are suited for rapid performance tuning of parallel programs, particularly when the target environment is unavailable. This paper describes one such technique that was developed for data-parallel C++ programs written in the pC++ language. In pC++, the programmer can distribute a collection of objects to various processors and can have methods invoked on those objects execute in parallel. Using performance extrapolation in the development of pC++ applications allows tuning decisions to be made in advance of detailed execution measurements. The pC++ language system includes TAU, an integrated environment for analyzing and tuning the performance of pC++ programs. This paper presents speedy, a new addition to TAU, that predicts the performance of pC++ programs on parallel machines using extrapolation techniques. Speedy applies the existing instrumentation support of TAU to capture high-level event traces of a n-thread pC++ program run on a uniprocessor machine together with trace-driven simulation to predict the performance of the program run on a target n-processor machine. We describe how speedy works and how it is integrated into TAU. We also show how speedy can be used to evaluate a pC++ program for a given target environment.
[mmb95a]: Alois Ferscha and Allen D. Malony, "Performance-Oriented Development of Irregular, Unstructured and Unbalanced Parallel Applications in the N-MAP Environment, " Proc. 8th GI/ITG Conference on Measuring, Modeling and Evaluating Computing and Communication Systems, MMB '95, LNCS 977, Springer, Berlin, pp. 340-356, 1995.
Keywords: Performance Prediction, Parallel Programming, Task Level Parallelism, Irregular Problems, Parallel Simulation, Time Warp, CM-5, Cluster Computing, N-MAP

Performance prediction methods and tools based on analytical models often fail in forecasting the performance of real systems due to inappropriateness of model assumptions, irregularities in the problem structure that cannot be described within the modeling formalism, unstructured execution behavior that leads to unforeseen system states, etc. Prediction accuracy and tractability is acceptable for systems with deterministic operational characteristics, for static, regularly structured problems, and non-changing environments.
[mmvr11]: Allen D. Malony, Adnan Salman, Sergei Turovets, Don M. Tucker, Vasily Volkov, Kai Li, Jung Eun Song, Scott Biersdorff, Colin Davey, Chris Hoge, David K. Hammond: Computational Modeling of Human Head Electromagnetics for Source Localization of Milliscale Brain Dynamics. MMVR 2011: 329-335
Keywords:

Understanding the milliscale (temporal and spatial) dynamics of the human brain activity requires high-resolution modeling of head electromagnetics and source localization of EEG data. We have developed an automated environment to construct individualized computational head models from image segmentation and to estimate conductivity parameters using electrical impedance tomography methods. Algorithms incorporating tissue inhomogeneity and impedance anisotropy in electromagnetics forward simulations have been developed and parallelized. The paper reports on the application of the environment in the processing of realistic head models, including conductivity inverse estimation and lead field generation for use in EEG source analysis.
[mttcpe94]: Allen D. Malony, Vassilis Mertsiotakis, Andreas Quick, "Automatic Scalability Analysis of Parallel Programs Based on Modeling Techniques," In G. Haring and G. Kotsis, editors, Proc. 7th International Conference on Modeling Techniques and Tools for Computer Performance Evaluation, LNCS, Springer, 1994.
Keywords: scalability analysis, performance modeling, PDL, PEPP

When implementing parallel programs for parallel computer systems the performance scalability of these programs should be tested and analyzed on different computer configurations and problem sizes. Since a complete scalability analysis is too time consuming and is limited to only existing systems, extensions of modeling approaches can be considered for analyzing the behavior of parallel programs under different problem and system scenarios. In this paper, a method for automatic scalability analysis using modeling is presented. Initially, we identify the important problems that arise when attempting to apply modeling techniques to scalability analysis. Based on this study, we define the Parallelization Description Language (PDL) that is used to describe parallel execution attributes of a generic program workload. Based on a parallelization description, stochastic models like graph models or Petri net models can be automatically generated from a generic model to analyze performance for scaled parallel systems as well as scaled input data. The complexity of the graph models produced depends significantly on the type of parallel computation described. We present several computation classes where tractable graph models can be generated and then compare the results of these automatically scaled models with their exact solutions using the PEPP modeling tool.
[para06a]: S. Shende, A. Malony, A. Morris, "Optimization of Instrumentation in Parallel Performance Evaluation Tools," in Proc. PARA 2006 Conference, Springer, LNCS, 2006.
Keywords: Instrument optimization, selective instrumentation, measurement, Performance measurement and analysis, parallel computing

Tools to observe the performance of parallel programs typically employ profiling and tracing as the two main forms of event-based measurement models. In both of these approaches, the volume of performance data generated and the corresponding perturbation encountered in the program depend upon the amount of instrumentation in the program. To produce accurate performance data, tools need to control the granularity of instrumentation. In this paper, we describe our experiences in the TAU performance system for improving the accuracy of performance data by limiting the amount of instrumentation. A range of options are provided to optimize instrumentation based on the structure of the program, event generation rates, and historical performance data gathered from prior executions.
[para06b]: S. Shende, A. Malony, A. Morris, "Workload Characterization using the TAU Performance System," in Proc. of PARA 2006 Conference, Springer, LNCS, 2006.
Keywords: Performance mapping, measurement, instrumentation, performance evaluation, workload characterization

Workload characterization is an important technique that helps us understand the performance of parallel applications and the de-mands they place on the system. Each application run is profiled using instrumentation at the MPI library level. Characterizing the performance of the MPI library based on the sizes of messages helps us understand how the performance of an application is affected based on messages of different sizes. Partitioning of the time spent in MPI routines based on the type of MPI operation and the message size involved requires a two level mapping of performance data. This paper describes how performance mapping is implemented in the TAU performance system to support workload characterization.
[para10]: Sameer Suresh Shende, Allen D. Malony, Alan Morris: Improving the Scalability of Performance Evaluation Tools. PARA (2) 2010: 441-451
Keywords: Measurement, instrumentation, analysis, performance tools

Performance evaluation tools play an important role in helping understand application performance, diagnose performance problems and guide tuning decisions on modern HPC systems. Tools to observe parallel performance must evolve to keep pace with the ever-increasing complexity of these systems. In this paper, we describe our experience in building novel tools and techniques in the TAU Performance SystemR to observe application performance effectively and efficiently at scale. It describes the extensions to TAU to contend with large data volumes associated with increasing core counts. These changes include new instrumentation choices, efficient handling of disk I/O operations in the measurement layer, and strategies for visualization of performance data at scale in TAUâ€™s analysis layer, among others. We also describe some techniques that allow us to fully characterize the performance of applications running on hundreds of thousands of cores.
[para98]: Sameer Shende, Steven T. Hackstadt, and Allen D. Malony, "Dynamic Performance Callstack Sampling: Merging TAU and DAQV-II," Proceedings of the Fourth International Workshop on Applied Parallel Computing (PARA98), June 14-17, 1998, LNCS 1541, Springer, Berlin, pp. 515-520, 1998.
Keywords: monitoring, performance, callstack, sampling, profiling, TAU, DAQV, parallel execution, performance tools, runtime interaction, C++

Observing the performance of an application at runtime requires economy in what performance data is measured and accessed, and flexibility in changing the focus of performance interest. This paper describes the performance callstack as an efficient performance view of a running program which can be retrieved and controlled by external analysis tools. The performance measurement support is provided by the TAU profiling library whereas tool-program interaction support is available through the DAQV framework. How these systems are merged to provide dynamic performance callstack sampling is discussed.
[parco03]: A. D. Malony, S. Shende, and R. Bell, "Online Performance Observation of Large-Scale Parallel Applications," Proc. Parco 2003 Symposium, in "Parallel Computing: Software Technology, Algorithms, Architectures and Applications," (Eds. G. R. Joubert, W. E. Nagel, F. J. Peters, and W. V. Walter), Advances in Parallel Computing, Vol. 13, Elsevier B.V., pp. 761 -768, 2004.
Keywords: Paraprof, Parvis, TAU, performance analysis, large-scale, parallel computing

Parallel performance tools offer insights into the execution behavior of an application and are a valuable component in the cycle of application development, deployment, and optimization. However, most tools do not work well with large-scale parallel applications where the performance data generated comes from upwards of thousands of processes. As parallel computer systems increase in size, the scaling of performance observation infrastructure becomes an important concern. In this paper, we discuss the problem of scaling and perfomance observation, and the ramifications of adding online support. A general online performance system architecture is presented. Recent work on the TAU performance system to enable large-scale performance observation and analysis is discussed. The paper concludes with plans for future work.
[parco03b]: H. Brunst, W. Nagel, "Scalable Performance Analysis of Parallel Systems: Concepts and Experiences," Proc. PARCO 2003 Conference, in (J. Joubert, W. Nagel, F. Peters, W. Walter eds.), Parallel Computing: Software Technology, Algorithms, Architectures and Applications, Advances in Parallel Computing 13 Elsevier 2004, pp. 737-744, 2004.
Keywords: Parallel Computing, Performance Analysis, Tracing, Profiling, Clusters, VampirServer, VNG, OTF

We have developed a distributed service architecture and an integrated parallel analysis engine for scalable trace based performance analysis. Our combined approach permits to handle very large performance data volumes in real-time. Unlike traditional analysis tools that do their job sequentially on an external desktop platform, our approach leaves the data at its origin and seamlessly integrates the time consuming analysis as a parallel job into the high performance production environment.
[parco05]: A. D. Malony, S. Shende, and A. Morris, "Phase-Based Parallel Performance Profiling," (to appear) Proc. of PARCO 2005 conference.
Keywords: TAU, Performance measurement and analysis, parallel computing, profiling, phases

Parallel scientific applications are designed based on structural, logical, and numerical models of computation and correctness. When studying the performance of these applications, especially on large-scale parallel systems, there is a strong preference among developers to view performance information with respect to their “mental model” of the application, formed from the model semantics used in the program. If the developer can relate performance data measured during execution to what they know about the application, more effective program optimization may be achieved. This paper considers the concept of “phases” and its support in parallel performance measurement and analysis as a means to bridge the gap between high- level application semantics and low-level performance data. In particular, this problem is studied in the context of parallel performance profiling. The implementation of phase-based parallel profiling in the TAU parallel performance system is described and demonstrated for the NAS parallel benchmarks and MFIX application.
[parco07b]: K. A. Huck, A. D. Malony, S. Shende, and A. Morris, "Scalable, Automated Performance Analysis with TAU and PerfExplorer", in Parallel Computing (ParCo2007), (Aachen, Germany), 2007.
Keywords: performance evaulation, TAU, PerfExplorer, performance scripting, metadata

Scalable performance analysis is a challenge for parallel development tools. The potential size of data sets and the need to compare results from multiple experiments presents a challenge to manage and process the information, and to characterize the performance of parallel applications running on potentially hundreds of thousands of processor cores. In addition, many exploratory analysis processes represent potentially repeatable processes which can and should be automated. In this paper, we will discuss the current version of PerfExplorer, a performance analysis framework which provides dimension reduction, clustering and correlation analysis of individual trails of large dimensions, and can perform relative performance analysis between multiple application executions. PerfExplorer analysis processes can be captured in the form of Python scripts, automating what would otherwise be time-consuming tasks. We will give examples of large-scale analysis results, and discuss the future development of the framework, including the encoding and processing of expert performance rules, and the increasing use of performance metadata.
[parco09]: Shangkar Mayanglambam, Allen D. Malony, Matthew J. Sottile: Performance Measurement of Applications with GPU Acceleration using CUDA. PARCO 2009: 341-348
Keywords:

Multi-core accelerators offer significant potential to improve the performance of parallel applications. However, tools to help the parallel application developer understand accelerator performance and its impact are scarce. An approach is presented to measure the performance of GPU computations programmed using CUDA and integrate this information with application performance data captured with the TAU Performance System. Test examples are shown to validate the measurement methods. Results for a case study of the GPU-accelerated NAMD molecular dynamics application application are given.
[parco09.2]: Allen D. Malony, Shangkar Mayanglambam, Laurent Morin, Matthew J. Sottile, StÃ©phane Bihan, Sameer Shende, FranÃ§ois Bodin: Performance Tool Integration in a GPU Programming Environment: Experiences with TAU and HMPP. PARCO 2009: 685-692
Keywords:

Application development environments offering high-level programming support for accelerators will need to integrate instrumentation and measurement capabilities to enable full, consistent performance views for analysis and tuning. We describe early experiences with the integration of a parallel performance system (TAU) and accelerator performance tool (TAUcuda) with the HMPP Workbench for programming GPU accelerators using CUDA. A description of the design approach is given, and two case studies are reported to demonstrate our development prototype. A new version of the technology is now being created based on the lessons learned from the research work.
[parco11]: S. Shende, A. D. Malony, W. Spear, and K. Schuchardt, "Characterizing I/O Performance Using the TAU Performance System." Presented at ICPP Parco 2011 conference Exascale Mini-symposium.
Keywords: POSIX I/O, MPI-IO, TAU, Instrumentation, GCRM

TAU is an integrated toolkit for performance instrumentation, measurement, and analysis. It provides a flexible, portable, and scalable set of technologies for performance evaluation on extreme-scale HPC systems. This paper describes alternatives for I/O instrumentation provided by TAU and the design and implementation of a new tool, tau_gen_wrapper, to wrap external libraries. It describes three instrumentation techniques - preprocessor based substitution, linker based instrumentation, and library preloading based replacement of routines. It demonstrates this wrapping technology in the context of intercepting the POSIX I/O library and its application to profiling I/O calls for the Global Cloud Resolution Model (GCRM) application on the Cray XE6 system. This scheme allows TAU to track I/O using linker level instrumentation for statically linked executables and attribute the I/O to specific code regions. It also addresses issues encountered in collecting the performance data from large core counts and representing this data to correctly identify sources of poor I/O performance.
[parle94]: Steven T. Hackstadt and Allen D. Malony, Next-Generation Parallel Performance Visualization: A Prototyping Environment for Visualization Development, Proc. of the Parallel Architectures and Languages Europe (PARLE) Conference, Athens, Greece, July 1994, pp. 192-201. Also available as University of Oregon, Department of Computer and Information Science, Technical Report CIS-TR-93-23, October 1993.
Keywords: parallel performance visualization, scientific visualization, visualization prototyping

A new design process for the development of parallel performance visualizations that uses existing scientific data visualization software is presented. Scientific visualization tools are designed to handle large quantities of multi-dimensional data and create complex, three-dimensional, customizable displays which incorporate advanced rendering techniques, animation, and display interaction. Using a design process that leverages these tools to prototype new performance visualizations can lead to drastic reductions in the graphics and data manipulation programming overhead currently experienced by performance visualization developers. The process evolves from a formal methodology that relates performance abstractions to visual representations. Under this formalism, it is possible to describe performance visualizations as mappings from performance objects to view objects, independent of any graphical programming. Implementing this formalism in an existing data visualization system leads to a visualization prototype design process consisting of two components corresponding to the two high-level abstractions of the formalism: a trace transformation (i.e., performance abstraction) and a graphical transformation (i.e., visual abstraction). The trace transformation changes raw trace data to a format readable by the visualization software, and the graphical transformation specifies the graphical characteristics of the visualization. This prototyping environment also facilitates iterative design and evaluation of new and existing displays. Our work examines how an existing data visualization tool, IBM's Data Explorer in particular, can provide a robust prototyping environment for next-generation parallel performance visualization.
[pcfd05]: S. Shende, A. Malony, A. Morris, S. Parker, J. Davison de St. Germain, "Performance Evaluation of Adaptive Scientific Applications using TAU," Parallel Computational Fluid Dynamics: Theory and Applications, Proceedings of the 2005 International Conference on Parallel Computational Fluid Dynamics, May 24-27.
Keywords: TAU, Performance Evaluation, CFD, Uintah Computational Framework, Phases

Fueled by increasing processor speeds and high speed interconnection networks, advances in high performance computer architectures have allowed the development of increasingly complex large scale parallel systems. For computational scientists, programming these systems efficiently is a challenging task. Understanding the performance of their parallel applications is equally daunting. To observe and comprehend the performance of parallel applications that run on these systems, we need performance evaluation tools that can map the performance abstractions to the user's mental models of application execution. For instance, most parallel scientific applications are iterative in nature. In the case of CFD applications, they may also dynamically adapt to changes in the simulation model. A performance measurement and analysis system that can differentiate the phases of each iteration and characterize performance changes as the application adapts will enable developers to better relate performance to their application behavior. In this paper, we present new performance measurement techniques to meet these needs. In section 2, we describe our parallel performance system, TAU. Section 3 discusses how new TAU profiling techniques can be applied to CFD applications with iterative and adaptive characteristics. In section 4, we present a case study featuring the Uintah computational framework and explain how adaptive computational fluid dynamics simulations are observed using TAU. Finally, we conclude with a discussion of how the TAU performance system can be
[pdp14]: Allen D. Malony, Kevin A. Huck: General Hybrid Parallel Profiling. PDP 2014: 204-212
Keywords: Parallel, performance, analysis, tools

A hybrid parallel measurement system offers the potential to fuse the principal advantages of probe-based tools, with their exact measures of performance and ability to capture event semantics, and sampling-based tools, with their ability to observe performance detail with less overhead. Creating a hybrid profiling solution is challenging because it requires new mechanisms for integrating probe and sample measurements and calculating profile statistics during execution. In this paper, we describe a general hybrid parallel profiling tool that has been implemented in the TAU Performance System. Its generality comes from the fact that all of the features of the individual methods are retained and can be flexibly controlled when combined to address the measurement requirements for a particular parallel application. The design of the hybrid profiling approach is described and the implementation of the prototype in TAU presented. We demonstrate hybrid profiling functionality first on a simple sequential program and then show its use for several OpenMP parallel codes from the NAS Parallel Benchmark. These experiments also highlight the improvements in overhead efficiency made possible by hybrid profiling. A large-scale ocean modeling code based on OpenMP and MPI, MPAS-Ocean, is used to show how the TAU hybrid profiling tool can be effective at exposing performance-limiting behavior that would be difficult to identify otherwise.
[pdp16]: Ozog, D., Malony, A. D., & Guenza, M. (2016, February). The UA? CG Workflow: High Performance Molecular Dynamics of Coarse-Grained Polymers. In Parallel, Distributed, and Network-Based Processing (PDP), 2016 24th Euromicro International Conference on (pp. 272-279). IEEE.
Keywords: -atomistic simulation, coarse-graining, scientific workflows, polymeric liquids, LAMMPS

â€”Our analytically based technique for coarsegraining (CG) polymer simulations dramatically improves spatial and temporal scaling while preserving thermodynamic quantities and bulk properties. The purpose of CG codes is to run more efficient molecular dynamics simulations, yet the research field generally lacks thorough analysis of how such codes scale with respect to full-atom representations. This paper conducts an in-depth performance study of highly realistic polymer melts on modern supercomputing systems. We also present a workflow that integrates our analytical solution for calculating CG forces with new high-performance techniques for mapping back and forth between the atomistic and CG descriptions in LAMMPS. The workflow benefits from the performance of CG, while maintaining full-atom accuracy. Our results show speedups up to 12x faster than atomistic simulations.
[pdpta01]: S. Shende, A. D. Malony, R. Ansell-Bell, "Instrumentation and Measurement Strategies for Flexible and Portable Empirical Performance Evaluation," Proceedings Tools and Techniques for Performance Evaluation Workshop, PDPTA'01, CSREA, Vol. 3, pp. 1150-1156, June 2001.
Keywords: TAU, instrumentation, measurement, DyninstAPI, MPI

Flexibility and portability are important concerns for productive empirical performance evaluation. We claim that these features are best supported by robust instrumentation and measurement strategies, and their integration. Using the TAU performance system as an exemplar performance toolkit, a case study in performance evaluation is considered. Our goal is both to highlight flexibility and portability requirements and to consider how instrumentation and measurement techniques can address them. The main contribution of the paper is methodological, in its advocation of a guiding principle for tool development and enhancement. Recent advancements in the TAU system are described from this perspective.
[pmvpmvps93]: Allen D. Malony, Gregory V. Wilson, "Future directions in parallel performance environments", Proceedings of the workshop on performance measurement and visualization on Performance measurement and visualization of parallel systems, Elsevier Science Publishers, B.V., Amsterdam, pp. 331-351, 1993.
Keywords: processor architectures, parallel programming, performance measurement

The increasing complexity of parallel computing systems has brought about a crisis in parallel performance evaluation and tuning. Although there have been important advances in performance tools in recent years, we believe that future parallel performance environments will move beyond these tools by integrating performance instrumentation with compilers for architecture-independent languages, by formalizing the relationship between performance views and the data they represent, and by automating some aspects of performance interpretation. This paper describes these directions from the perspective of research projects that have been recently undertaken.
[ppam05]: Marian Bubak, Wlodzimierz Funika, Marcin Koch, Dominik Dziok, Allen D. Malony, Marcin Smetek, Roland WismÃ¼ller: Towards the Performance Visualization of Web-Service Based Applications. PPAM 2005: 108-115
Keywords: performance visualization, monitoring tools, OMIS, TAU, web service.

In this paper we present an approach to building a monitoring environment which underlies performance visualization for distributed applications. Our focus is to make the J-OCM monitoring system and the TAU-Paravis performance visualization system to collaborate. J-OCM, based on the J-OMIS interface, provides services for on-line monitoring of distributed Java applications. The system uses J-OCM to supply monitoring data on the distributed application, whereas TAUParavis provides advanced visualization of performance data. We managed to integrate J-OCM into TAU/Paravis by developing additional software providing access to the monitor and transformation of raw monitor data into performance data which is presented with 3-D charts. At the end we present an extension, which introduces Web Service monitoring into the integrated environment
[ppopp91]: Allen D. Malony, "Event-based Performance Perturbation: a Case Study," Proc. third ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPOPP'91), pp. 201-212, 1991.
Keywords: performance perturbation, performance measurement, event-based analysis, uncertainty principle

Determining the performance behavior of parallel computations requires some form of intrusive tracing measurement. The greater the need for detailed performance data, the more intrusion the measurement will cause. Recovering actual execution performance jfrom perturbed performance measurements using eventbased perturbation analysis is the topic of this paper. We show that the measurement and subsequent analysis of synchronization operations (particularly, advance and await) can produce, in practice, accurate approximations to actual performance behavior. We use as testcases three Lawrence Livermore loops that execute as parallel DOACROSS loops on an Alliant FX/80. The results of our experiments suggest that a systematic application of performance perturbation analysis techniques will allow more detailed, accurate instrumentation than traditionally believed possible.
[ppopp93]: Sekhar R. Sarukkai, Allen D. Malony, "Perturbation Analysis of High Level Instru mentation for SPMD Programs," Proc. fourth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP'93), pp. 44-53, 1993.
Keywords: perturbation analysis, performance measurement

The process of instrumenting a program to study its behavior can lead to perturbations in the program's execution. These perturbations can become severe for large parallel systems or problem sizes, even when one captures only high level events. In this paper, we address the important issue of eliminating execution perturbations caused by high-level instrumentation of SPMD programs. We will describe perturbation analysis techniques for common computation and communication measurements, and show examples which demonstrate the effectiveness of these techniques in practice.
[proper10]: C. W. Lee, A. D. Malony, and A. Morris, "TAUmon: Scalable Online Performance Data Analysis in TAU", in 3rd Workshop on Productivity and Performance (PROPER 2010), 2010.
Keywords: TAU, scalability, performance analysis tools, online monitoring

In this paper, we present an update on the scalable online support for performance data analysis and monitoring in TAU. Extend- ing on our prior work with TAUoverSupermon and TAUoverMRNet, we show how online analysis operations can also be supported directly and scalably using the parallel infrastructure provided by an MPI application instrumented with TAU. We also report on eorts to streamline and up- date TAUoverMRNet. Together, these approaches form the basis for the investigation of online analysis capabilities in a TAU monitoring frame- work TAUmon. We discuss various analysis operations and capabilities enabled by online monitoring and how operations like event uni cation enable merged pro les to be produced with greatly reduced data vol- ume just prior to the end of application execution. Scaling results with PFLOTRAN on the Cray XT5 and BG/P are presented along with a look at some initial performance information generated from FLASH and PFLOTRAN through our TAUmon prototype frameworks.
[proper11]: W. Spear, A. D. Malony, C. W. Lee, S. Biersdorff, S. Shende. "An Approach to Creating Performance Visualizations in a Parallel Profile Analysis Tool." Presented at the Workshop on Productivity and Performance (PROPER 2011), August 2011.
Keywords: TAU, ParaProf, Topology, Performance visualization

With increases in the scale of parallelism the dimensionality and complexity of parallel performance measurements has placed greater challenges on analysis tools. Performance visualization can assist in understanding performance properties and relationships. However, the creation of new visualizations in practice is not supported by existing parallel profiling tools. Users must work with presentation types provided by a tool and have limited means to change its design. Here we present an approach for creating new performance visualizations within an existing parallel profile analysis tool. The approach separates visual layout design from the underlying performance data model, making custom visualizations such as performance over system topologies straightforward to implement and adjust for various use cases.
[psc94]: A. Malony, B. Mohr, P. Beckman, D. Gannon, Program Analysis and Tuning Tools for a Parallel Object Oriented Language: An Experiment with the TAU System, Proceedings of the Workshop on Parallel Scientific Computing, Cape Cod, MA, October 1994.
Keywords:
[ross13]: K. Huck, S. Shende, A. Malony, H. Kaiser, A. Porterfield, R. Brightwell, "An Early Prototype of an Autonomic Performance Environment for Exascale." Published in Proceedings of the 3rd International Workshop on Runtime and Operating Systems for Supercomputers, ICS'13, ACM, DOI: 10.1145/2491661.2481434, 2013.
Keywords: online performance analysis, performance introspection, TAU

Extreme-scale computing requires a new perspective on the role of performance observation in the Exascale system software stack. Because of the anticipated high concurrency and dynamic operation in these systems, it is no longer reasonable to expect that a post-mortem performance measurement and analysis methodology will suffice. Rather, there is a strong need for performance observation that merges first and third- person observation, in situ analysis, and introspection across stack layers that serves online dynamic feedback and adaptation. In this paper we describe the DOE-funded XPRESS project and the role of autonomic performance support in Exascale systems. XPRESS will build an integrated Exascale software stack (called OpenX ) that supports the ParalleX execution model and is targeted towards future Exascale platforms. An initial version of an autonomic performance environment called APEX has been developed for OpenX using the current TAU performance technology and results are presented that highlight the challenges of highly integrative observation and runtime analysis.
[sc00]: K. A. Lindlan, J. Cuny, A. D. Malony, S. Shende, B. Mohr, R. Rivenburgh, C. Rasmussen. "A Tool Framework for Static and Dynamic Analysis of Object-Oriented Software with Templates." Proceedings of SC2000: High Performance Networking and Computing Conference, Dallas, November 2000.
Keywords: Program Database Toolkit, PDT, static analysis, dynamic analysis, object-oriented, templates, IL Analyzer, DUCTAPE, TAU, SILOON

The developers of high-performance scientific applications often work in complex computing environments that place heavy demands on program analysis tools. The developers need tools that interoperate, are portable across machine architectures, and provide source-level feedback. In this paper, we describe a tool framework, the Program Database Toolkit (PDT), that supports the development of program analysis tools meeting these requirements. PDT uses compile-time information to create a complete database of high-level program information that is structured for well-defined and uniform access by tools and applications. PDT's current applications make heavy use of advanced features of C++, in particular, templates. We describe the toolkit, focussing on its most important contribution -- its handling of templates -- as well as its use in existing applications.
[sc05]: K. A. Huck, and A. D. Malony, "PerfExplorer: A Performance Data Mining Framework for Large- Scale Parallel Computing," in Proc. of SC 2005 Conference, ACM, 2005.
Keywords: performance data mining, PerfExplorer, TAU, PerfDMF, R, Weka

Parallel applications running on high-end computer systems manifest a complexity of performance phenomena. Tools to observe parallel performance attempt to capture these phenomena in measurement datasets rich with information relating multiple performance metrics to execution dynam- ics and parameters specific to the application-system exper- iment. However, the potential size of datasets and the need to assimilate results from multiple experiments makes it a daunting challenge to not only process the information, but discover and understand performance insights. In this pa- per, we present PerfExplorer, a framework for parallel per- formance data mining and knowledge discovery. The frame- work architecture enables the development and integration of data mining operations that will be applied to large-scale parallel performance profiles. PerfExplorer operates as a client-server system and is built on a robust parallel per- formance database (PerfDMF) to access the parallel profiles and save its analysis results. Examples are given demon- strating these techniques for performance analysis of ASCI applications.
[sc05b]: Knowledge Engineering for Model-based Parallel Performance Diagnosis (Poster) By Li Li and Allen D. Malony Computer and Information Science Department, University of Oregon, Eugene, OR
Keywords: performance knowledge, performance diagosis
[sc05c]: K. Karavanic, J. May, K. Mohror, B. Miller, K. Huck, R. Knapp, and B. Pugh, "Integrating Database Technology with Comparison-Based Parallel Performance Diagnosis: The Perftrack Performance Experiment Management Tool", in International Conference for High Performance Computing, Networking, Storage and Analysis (SC'05), (Washington, DC, USA), IEEE Computer Society, 2005.
Keywords: TAU, Perftrack, experiment management, performance diagnosis

PerfTrack is a data store and interface for managing performance data from large-scale parallel applications. Data collected in different locations and formats can be compared and viewed in a single performance analysis session. The underlying data store used in PerfTrack is implemented with a database management system (DBMS). PerfTrack includes interfaces to the data store and scripts for automatically collecting data describing each experiment, such as build and platform details. We have implemented a prototype of PerfTrack that can use Oracle or PostgreSQL for the data store. We demonstrate the prototype's functionality with three case studies: one is a comparative study of an ASC purple benchmark on high-end Linux and AIX platforms; the second is a parameter study conducted at Lawrence Livermore National Laboratory (LLNL) on two high end platforms, a 128 node cluster of IBM Power 4 processors and BlueGene/L; the third demonstrates incorporating performance data from the Paradyn Parallel Performance Tool into an existing PerfTrack data store.
[sc06]: Allen D. Malony, Wolfgang E. Nagel: Open trace - The open trace format (OTF) and open tracing for HPC. SC 2006: 24
Keywords:

The Open Trace Format (OTF) is a DOE-sponsored initiative to help deliver open, scalable performance tracing tools for HPC systems. OTF is an open specification of trace information to provide a target for trace generation and to allow trace analysis and visualization tools to operate efficiently at large scale. The Technical University of Dresden and ParaTools, Inc. developed the first version OTF with support from Lawrence Livermore National Laboratory.The BOF has two goals. First, we will report on the current status of OTF. This will include a review of the OTF specification and a description of the OTF reader/writer library to be released into open source at SC06. We will also report on recent ports of the library to HPC platforms.The second goal is to invite community involvement in the OTF initiative and to establish a working group chartered with evolving the OTF specification in the future.
[sc07]: A. Nataraj, A. Morris, A. Malony, M. Sottile, P. Beckman, "The Ghost in the Machine: Observing the Effects of Kernel Operation on Parallel Application Performance." Supercomputing Conference 2007.
Keywords: kernel, operating system noise, interference, integrated measurement, KTAU, TAU, compensation, tracing

The performance of a parallel application on a scalable HPC system is determined by user-level execution of the application code and system-level (OS kernel) operations. To understand the influences of system-level factors on application performance, the measurement of OS kernel activities is key. We describe a technology to observe kernel actions and make this information available to application-level performance measurement tools. The benefits of merged application and OS performance information and its use in parallel performance analysis are demonstrated, both for profiling and tracing methodologies. In particular, we focus on the problem of kernel noise assessment as a stress test of the approach. We show new results for characterizing noise and introduce new techniques for evaluating noise interference and its effects on application execution. Our kernel measurement and noise analysis technologies are being developed as part of Linux OS environments for scalable parallel systems.
[sc08]: K. A. Huck, O. Hernandez, V. Bui, S. Chandrasekaran, B. Chapman, A. D. Malony, L. C. McInnes, and B. Norris, "Capturing Performance Knowledge for Automated Analysis", in International Conference for High Performance Computing, Networking, Storage and Analysis (SC'08), 2008.
Keywords: TAU, performance knowledge, automaticed analysis, compiler basied instrumentation, compiler optimizations

Automating the process of parallel performance experimentation, analysis, and problem diagnosis can enhance environments for performance-directed application development, compilation, and execution. This is especially true when parametric studies, modeling, and optimization strategies require large amounts of data to be collected and processed for knowledge synthesis and reuse. This paper describes the integration of the PerfExplorer performance data mining framework with the OpenUH compiler infrastructure. OpenUH provides autoinstrumentation of source code for performance experimentation and PerfExplorer provides automated and reusable analysis of the performance data through a scripting interface. More importantly, PerfExplorer inference rules have been developed to recognize and diagnose performance characteristics important for optimization strategies and modeling. Three case studies are presented which show our success with automation in OpenMP and MPI code tuning, parametric characterization, and power modeling. The paper discusses how the integration supports performance knowledge engineering across applications and feedback-based compiler optimization in general.
[sc13a]: H. Radhakrishnan, D. Rouson, K. Morris, S. Shende, and S. Kassinos, "Test-driven coarray parallelization of a legacy Fortran Application," Proc. SE-HPCCSE 2013: 1st International Workshop on Software Engineering for Performance Computing in Computational Science and Engineering, workshop at SC'13, ACM SIGHPC, pp. 33-40, 2013.
Keywords: performance evaluation, TAU, PRM, Fortran 2008, Co-array Fortran

This paper summarizes a strategy for parallelizing a legacy Fortran 77 program using the object-oriented (OO) and coar- ray features that entered Fortran in the 2003 and 2008 stan- dards, respectively. OO programming (OOP) facilitates the construction of an extensible suite of model-verification and performance tests that drive the development. Coarray par- allel programming facilitates a rapid evolution from a serial application to a parallel application capable of running on multi-core processors and many-core accelerators in shared and distributed memory. We delineate 17 code moderniza- tion steps used to refactor and parallize the program, and study the resulting performance. Our scaling studies show that the bottleneck in the performance was due to the im- plementation of the collective sum procedure. Replacing the sequential procedure with a binary tree procedure improved the scaling performance of the program. This bottleneck will be resolved in the future by new collective procedures in Fortran 2015.
[sc15]: Daniel A. Ellsworth, Allen D. Malony, Barry Rountree, Martin Schulz: Dynamic power sharing for higher job throughput. SC 2015: 80:1-80:11
Keywords: RAPL; hardware over-provisioning; HPC; power bound

Current trends for high-performance systems are leading towards hardware overprovisioning where it is no longer possible to run all components at peak power without exceeding a system- or facility-wide power bound. The standard practice of static power scheduling is likely to lead to inefficiencies with over- and under-provisioning of power to components at runtime. In this paper we investigate the performance and scalability of an application agnostic runtime power scheduler (POWsched) that is capable of enforcing a system-wide power limit. Our experimental results show POWsched is robust, has negligible overhead, and can take advantage of opportunities to shift wasted power to more power-intensive applications, improving overall workload runtime by as much as 14% without job scheduler integration or application specific profiling. In addition, we conduct scalability studies to determine POWschedâ€™s overhead for large node counts. Lastly, we contribute a model and simulator (POWsim) for investigating dynamic power scheduling behavior and enforcement at scale.
[sc16]: Daniel A. Ellsworth, Tapasya Patki, Martin Schulz, Barry Rountree, Allen D. Malony: A Unified Platform for Exploring Power Management Strategies. E2SC@SC 2016: 24- 30
Keywords:

Power is quickly becoming a first class resource management concern in HPC. Upcoming HPC systems will likely be hardware over-provisioned, which will require enhanced power management subsystems to prevent service interruption. To advance the state of the art in HPC power management research, we are implementing SLURM plugins to explore a range of power-aware scheduling strategies. Our goal is to develop a coherent platform that allows for a direct comparison of various power-aware approaches on research as well as production clusters.
[sc16.poster]: J. Linford, S. Vadlamani, S. Shende, A. D. Malony, W. Jones, W. K. Anderson, E. Nielsen, "Performance Engineering FUN3D at Scale with TAU Commander", Poster, SC'16 Conference, 2016.
Keywords: TAU Commander, TAU, FUN3D, Performance Engineering, NASA

FUN3D is an unstructured-grid computational fluid dynamics suite widely used to support major national research and engineering efforts. FUN3D is being applied to analysis and design problems across all the major service branches at the Department of Defense. These applications span the speed range from subsonic to hypersonic flows and include both fixed-and rotary-wing configurations. This poster presents performance profiles of a high Reynolds number simulation of the flow over a wing-body-pylon-nacelle geometry on 14,400 cores of a Cray XC30 at the Navy DSRC. Profiles are gathered via TAU Com- mander, which implements a new performance engineering methodology to improve user productivity. TAU Commander highlights source code regions that limit scalability through profiling, tracing, and aggregate summary statistics with respect to computational time, memory allocation, and memory access. This analysis approach is being carefully documented to assist other DoD groups in similar performance evaluation activities.
[sc2001]: H. Truong, T. Fahringer, G. Madsen, A. Malony, H. Moritsch, and S. Shende, "On Using SCALEA for Performance Analysis of Distributed and Parallel Programs," Proceedings of SC'2001 conference, Nov. 2001.
Keywords: Performance analysis, performance overhead classification, distributed and parallel systems, SCALEA, TAU, OpenMP

In this paper we give an overview of SCALEA, which is a new performance analysis tool for OpenMP, MPI, HPF, and mixed parallel/distributed programs. SCALEA instruments, executes and measures programs and computes a variety of performance overheads based on a novel overhead classification. Source code and HW-profiling is combined in a single system which significantly extends the scope of possible overheads that can be measured and examined, ranging from HW-counters, such as the number of cache misses or floating point operations, to more complex performance metrics, such as control or loss of parallelism. Moreover, SCALEA uses a new representation of code regions, called the dynamic code region call graph, which enables detailed overheads analysis for arbitrary code regions. An instrumentation description file is used to releate performance information to code regions of the input program and to reduce instrumentation overhead. Several experiments with realistic codes that cover MPI, OpenMP, HPF and mixed OpenMP/MPI codes demonstrate the usefulness of SCALEA.
[sc2003.poster]: Sophia Lefantzi, Jaideep Ray, and Sameer Shende, "Strong Scalability Analysis and Performance Evaluation of a SAMR CCA-based Reacting Flow Code," Poster, SC2003 Conference, Nov. 2003.
Keywords: CCA, Performance modeling, CFRFS Combustion, TAU

Simulations on structured adaptively refined meshes (SAMR) pose unique problems in the context of performance evaluation and modeling. Adaptively refined meshes aim to concentrate grid points in regions of interest while leaving the bulk of the domain sparsely tessellated. Structured adaptively refined meshes achieve this by having overlaid grids of different refinement. Numerical algorithms employing explicit multi-rated time- stepping methods apply a computational "kernel" to the finer meshes at a higher frequency than at the coarser meshes. Each application of the kernel at a given level of refinement is followed up by a communication step where data is exchanged with neighboring subdomains. The SAMR approach is adaptive, i.e. its characteristics change as the simulation evolves in time. Thus, scalability depends on the number of processors and the time-integrated effect of the physics of the problem. The time-integrated effect renders the estimation of a general metric of scalability difficult and often impossible. Generally, as reported in the literature, for realistic problems and configurations, SAMR simulations do not scale well. For this work we analyzed two different hydrodynamic problems and present how communication costs scale with various aspects of the domain decomposition. Approach: The codes that we analyzed solve PDEs to simulate reactive flows and flows with shock waves. The codes were run until the incremental decrease in run times (with increasing processors) approached zero. It was found that the nature of the problem changed vastly during the run - even runs which showed poor scaling had periods of evolution where the domain decomposition showed "good" scaling characteristics, i.e compute loads were higher than communication loads. The computational load was found to be evenly balanced across the processors - the lack of scalability was due to the dominance of communication and synchronization costs over computational costs. We identified and analyzed phases in the evolution of the problem where the simulation exhibited good and bad scaling. Communication costs were analyzed with respect to the levels of refinement of the grid as well as the data-exchange radius for each of the runs. This is a thorough performance analysis of SAMR hydrodynamics codes, performed for the first time in CCA-compliant codes, tackling the time-dependent nature of the communication overheads. Both the codes that we analyzed employ the Common Component Architecture (CCA) paradigm and were run within the CCAFFEINE framework. The adaptive mesh package used (that performs the bulk of the communications) was GrACE (Rutgers, The State University of New Jersey). The measurements were performed using the CCA version of TAU (Tuning and Analysis Utilities). The tests were performed on "platinum" at NCSA (University of Illinois, Urbana Champaign), a Linux cluster of dual-node Pentium III 1 GHz processors, connected via a Myrinet interconnect. Visual: As a part of the visual presentation, we will present a color poster with our performance analysis results and hold a demonstration of the composition and execution of CCA codes. Animations of the adaptively refined grid will also be shown.
[sc90]: Sanjay Sharma, Allen D. Malony, Michael W. Berry, Priyamvada Sinvhal- Sharma, "Run-time monitoring of concurrent programs on the Cedar multiprocessor ," Proc. 1990 conference on Supercomputing, pp. 784-793, 1990
Keywords: Cedar, Tracing, processor architectures

The ability to understand the behavior of concurrent programs depends greatly on the facilities available to monitor execution and present the results to the user. Beyond the basic profiling tools that collect data for post-mortem viewing, explorative use of multiprocessor computer systems demands a dynamic monitoring environment capable of providing run-time access to program performance. A prototype of such an environment has been built for the Cedar multiprocessor. This paper describes the design of the infrastructure enabling run-time monitoring of parallel Cedar applications and the communication of execution data among physically distributed machines. An application for matrix visualization is used to highlight important aspects of the system.
[sc90b]: Allen D. Malony, John L. Larson, Daniel A. Reed, "Tracing Application Program Execution on the Cray X-MP and Cray 2," Proc. of the 1990 conference on Supercomputing, pp. 60-73, 1990.
Keywords: Tracing

Important insights into program operation can be gained by observing dynamic execution behavior. Unfortunately, many high-performance machines provide execution profile summaries as the only tool for performance investigation. We have developed a tracing library for the Cray X-MP and Cray 2 supercomputers that supports the low-overhead capture of execution events for sequential and multitasked programs. This library has been extended to use the automatic instrumentation facilities on these machines, allowing trace data from routine entry and exit, and other program segments, to be captured. To assess the utility of the trace-based tools, three of the Perfect Benchmark codes have been tested in scalar and vector modes with the tracing instrumentation. In addition to computing summary execution statistics from the traces, interesting execution dynamics appear when studying the trace histories. It is also possible to compare codes across the two architectures by correlating the event traces. Our conclusion is that adding tracing support in Cray supercomputers can have significant returns in improved performance characterization and evaluation.
[sc92]: Allen D. Malony, "Supercomputing Around the World," Proc. 1992 ACM/IEEE Conference on Supercomputing (mini symposium), pp. 126-129, 1992.
Keywords: Supercomputing

Supercomputing is rapidly becoming a global phenomenon. In keeping with the “Voyages of Discovery” theme of the Supercomputing ’92 conference, representatives of supercompuiing endeavors from around the wor!d meet in this mini-symposium to speak on national and international supercomputing activities.
[sc93]: F. Bodin, P. Beckman, D. Gannon, S. Yang, S. Kesavan, A. Malony, B. Mohr, Implementing a Parallel C++ Runtime System for Scalable Parallel Systems, Proceedings of the 1993 Supercomputing Conference, Portland, Oregon, November 1993, pp. 588-597.
Keywords: parallel C++, portability, scalability, SPMD, runtime system,concurrency and communication primitives, performance

pC++ is a language extension to C++ designed to allow programmers to compose "concurrent aggregate" collection classes which can be aligned and distributed over the memory hierarchy of a parallel machine in a manner modeled on the High Performance Fortran Forum (HPFF) directives for Fortran 90. pC++ allows the user to write portable and efficient code which will run on a wide range of scalable parallel computer systems. The first version of the compiler is a preprocessor which generates Single Program Multiple Data (SPMD) C++ code. Currently, it runs on the Thinking Machines CM-5, the Intel Paragon, the BBN TC2000, the Kendall Square Research KSR-1, and the Sequent Symmetry. In this paper we describe the implementation of the runtime system, which provides the concurrency and communication primitives between objects in a distributed collection. To illustrate the behavior of the runtime system we include a description and performance results on four benchmark programs.
[sc98a]: Christopher W. Harrop, Steven T. Hackstadt, Janice E. Cuny, Allen D. Malony, and Laura S. Magde, Supporting Runtime Tool Interaction for Parallel Simulations, Proceedings of Supercomputing '98 (SC98), Orlando, FL, November 7-13, 1998 (Best Student Paper Finalist).
Keywords: runtime interaction, computational steering, matlab

Scientists from many disciplines now routinely use modeling and simulation techniques to study physical and biological phenomena. Advances in high-performance architectures and networking have made it possible to build complex simulations with parallel and distributed interacting components. Unfortunately, the software needed to support such complex simulations has lagged behind hardware developments. We focus here on one aspect of such support: runtime program interaction. We have developed a runtime interaction framework and we have implemented a specific instance of it for an application in seismic tomography. That instance, called TierraLab, extends the geoscientists' existing (legacy) tomography code with runtime interaction capabilities which they access through a MATLAB interface. The scientist can stop a program, retrieve data, analyze and visualize that data with existing MATLAB routines, modify the data, and resume execution. They can do this all within a familiar MATLAB-like environment without having to be concerned with any of the low-level details of parallel or distributed data distribution. Data distribution is handled transparently by the Distributed Array Query and Visualization (DAQV) system. Our framework allows scientists to construct and maintain their own customized runtime interaction system.
[sc98b]: Jenifer L. Skidmore, Matthew J. Sottile, Janice E. Cuny, and Allen D. Malony, A Prototype Notebook-Based Environment for Computational Tools, Proceedings of Supercomputing '98, Orlando, FL, November 1998.
Keywords: electronic notebook, distributed computing, computational science,heterogeneous, tools, World Wide Web, collaboration

The Virtual Notebook Environment (ViNE) is a platform-independent, web-based interface designed to support a range of scientific activities across distributed, heterogeneous computing platforms. ViNE provides scientists with a web-based version of the common paper-based lab notebook, but in addition, it provides support for collaboration and management of computational experiments. Collaboration is supported with the web-based approach, which makes notebook material generally accessible and with a hierarchy of security mechanisms that screen that access. ViNE provides uniform, system-transparent access to data, tools, and programs throughout the scientist's computing infrastructure. Computational experiments can be launched from ViNE using a visual specification language. The scientist is freed from concerns about inter-tool connectivity, data distribution, or data management details. ViNE also provides support for dynamically linking analysis results back into the notebook content.
In this paper we present the ViNE system architecture and a case study of its use in neuropsychology research at the University of Oregon. Our case study with the Brain Electrophysiology Laboratory (BEL) addresses their need for data security and management, collaborative support, and distributed analysis processes. The current version of ViNE is a prototype system being tested with this and other scientific applications.
[scidac05]: P. Worley, J. Candy, L. Carrington, K. Huck, T. Kaiser, G. Mahinthakumar, A. Malony, S. Moore, D. Reed, P. Roth, H. Shan, S. Shende, A. Snavely, S. Sreepathi, F. Wolf, and Y. Zhang, "Performance Analysis of GYRO: A Tool Evaluation," Poster, Scientific Discovery through Advanced Computing Conference, (SciDAC 2005), 2005.
Keywords: Performance evaluation, PERC, TAU, SvPablo, Kojak, HPM, PMaC

The performance of the Eulerian gyrokinetic-Maxwell solver code GYRO is analyzed on five high performance computing systems. First, a manual approach is taken, using custom scripts to analyze the output of embedded wallclock timers, floating point operation counts collected using hardware performance counters, and traces of user and communication events collected using the profiling interface to Message Passing Interface (MPI) libraries. Parts of the analysis are then repeated or extended using a number of sophisticated performance analysis tools: IPM, KOJAK, SvPablo, TAU, and the PMaC modeling tool suite. The paper briefly discusses what has been discovered via this manual analysis process, what performance analyses are inconvenient or infeasible to attempt manually, and to what extent the tools show promise in accelerating or significantly extending the manual performance analyses.
[shmem14]: John C. Linford, Tyler A. Simon, Sameer Shende, Allen D. Malony: Profiling Non-numeric OpenSHMEM Applications with the TAU Performance System. OpenSHMEM 2014: 105-119
Keywords:

The recent development of a unified SHMEM framework, OpenSHMEM, has enabled further study in the porting and scaling of applications that can benefit from the SHMEM programming model. This paper focuses on non-numerical graph algorithms, which typically have a low FLOPS/byte ratio. An overview of the space and time complexity of Kruskalâ€™s and Primâ€™s algorithms for generating a minimum spanning tree (MST) is presented, along with an implementation of Kruskalâ€™s algorithm that uses OpenSHEM to generate the MST in parallel without intermediate communication. Additionally, a procedure for applying the TAU Performance System to OpenSHMEM applications to produce indepth performance profiles showing time spent in code regions, memory access patterns, and network load is presented. Performance evaluations from the Cray XK7 â€œTitanâ€ system at Oak Ridge National Laboratory and a 48 core shared memory system at University of Maryland, Baltimore County are provided.
[shmem16]: Linford, J. C., Khuvis, S., Shende, S., Malony, A., Imam, N., & Venkata, M. G. (2016, August). Profiling Production OpenSHMEM Applications. In Workshop on OpenSHMEM and Related Technologies (pp. 219-224). Springer International Publishing.
Keywords:

Developing high performance OpenSHMEM applications routinely involves gaining a deeper understanding of software execution, yet there are numerous hurdles to gathering performance metrics in a production environment. Most OpenSHMEM performance profilers rely on the PSHMEM interface but PSHMEM is an optional and often unavailable feature. We present a tool that generates direct measurement performance profiles of OpenSHMEM applications even when PSHMEM is unavailable. The tool operates on dynamically linked and statically linked application binaries, does not require debugging symbols, and functions regardless of compiler optimization level. Integrated in the TAU Performance System, the tool uses automatically-generated wrapper libraries that intercept OpenSHMEM API calls to gather performance metrics with minimal overhead. Dynamically linked applications may use the tool without modifying the application binary in any way.
[shpcc94]: Steven T. Hackstadt, Allen D. Malony, and Bernd Mohr, Scalable Performance Visualization for Data-Parallel Programs, Proc. of the Scalable High Performance Computing Conference (SHPCC), Knoxville, TN, May 1994, pp. 342-349. Also available as University of Oregon, Department of Computer and Information Science, Technical Report CIS-TR-94-09, March 1994.
Keywords: scalable performance visualization, scientific visualization, pC++, data-parallel programming

Developing robust techniques for visualizing the performance behavior of parallel programs that can scale in problem size and/or number of processors remains a challenge. In this paper, we present several performance visualization techniques based on the context of data-parallel programming and execution that demonstrate good visual scalability properties. These techniques are a result of utilizing the structural and distribution semantics of data-parallel programs as well as sophisticated three-dimensional graphics. A categorization and examples of scalable performance visualizations are given for programs written in Dataparallel C and pC++.
[sigmetrics87]: Daniel A. Reed, Allen D. Malony, Bradley D. McCredie, "Parallel Discrete Event Simulation: a Shared Memory Approach," Proc. 1987 ACM SIGMETRICS conference on Measurement and Modeling of Computer Systems, 15(1), pp. 36- 38, 1987.
Keywords: discrete event simulation

The inherently sequential nature of event list manipulation limits the potential parallelism of standard simulation models. Although techniques for performing event list manipulation and event simulation in parallel have been suggested, large scale performance increases seem unlikely. Only by eliminating the event list, in its traditional form, can additional parallelism be obtained; this is the goal of distributed simulation. Several distributed simulation techniques have been proposed. In the remainder of this abstract, we present the Chandy-Misra distributed simulation algorithm and the results of an extensive study of its performance on a shared memory parallel processor when simulating queueing network models.
[sigmetrics95]: Allen D. Malony, "Data Interpretation and Experiment Planning in Performance Tools," Joint International Conference on Measurement and Modeling of Computer Systems, Proceedings of the 1995 ACM SIGMETRICS joint international conference on Measurement and Modeling of Computer Systems, pp. 62-63, 1995.
Keywords: performance measurement

The parallel scientific computing community is placing increasing emphasis on portability and scalability of programs, languages, and architectures. This creates new challenges for developers of parallel performance analysis tools, who will have to deal with increasing volumes of performance data drawn from diverse platforms. One way to meet this challenge is to incorporate sophisticated facilities for data interpretation and experiment planning within the tools themselves, giving them increased flexibility and autonomy in gathering and selecting performance data. This panel discussion brings together four research groups that have made advances in this direction.
[sp14]: Grigori Fursin, Renato Miceli, Anton Lokhmotov, Michael Gerndt, Marc Baboulin, Allen D. Malony, Zbigniew Chamski, Diego Novillo, Davide Del Vento: Collective mind: Towards practical and collaborative auto-tuning. Scientific Programming 22(4): 309-329 (2014)
Keywords: high performance computing, systematic auto-tuning, systematic benchmarking, big data driven optimization, modeling of computer behavior, performance prediction, collaborative knowledge management, public repository of knowledge, NoSQL repository, code and data sharing, specification sharing, collaborative experimentation, machine learning, data mining, multi-objective optimization, model driven optimization, agile development, plugin-based tuning, performance regression buildbot, open access publication model, reproducible research

Empirical auto-tuning and machine learning techniques have been showing high potential to improve execution time, power consumption, code size, reliability and other important metrics of various applications for more than two decades. However, they are still far from widespread production use due to lack of native support for auto-tuning in an ever changing and complex software and hardware stack, large and multi-dimensional optimization spaces, excessively long exploration times, and lack of unified mechanisms for preserving and sharing of optimization knowledge and research material. We present a possible collaborative approach to solve above problems using Collective Mind knowledge management system. In contrast with previous cTuning framework, this modular infrastructure allows to preserve and share through the Internet the whole auto-tuning setups with all related artifacts and their software and hardware dependencies besides just performance data. It also allows to gradually structure, systematize and describe all available research material including tools, benchmarks, data sets, search strategies and machine learning models. Researchers can take advantage of shared components and data with extensible meta-description to quickly and collaboratively validate and improve existing auto-tuning and benchmarking techniques or prototype new ones. The community can now gradually learn and improve complex behavior of all existing computer systems while exposing behavior anomalies or model mispredictions to an interdisciplinary community in a reproducible way for further analysis. We present several practical, collaborative and model-driven auto-tuning scenarios. We also decided to release all material at c-mind.org/repo to set up an example for a collaborative and reproducible research as well as our new publication model in computer engineering where experimental results are continuously shared and validated by the community.
[spdt96]: S. Shende, J. Cuny, L. Hansen, J. Kundu, S. McLaughry and O. Wolf, Event and State-Based Debugging in TAU: A Prototype, Proceedings of ACM SIGMETRICS Symposium on Parallel and Distributed Tools (SPDT '96), May, 1996, pp. 21-30.
Keywords: event-based debugging, state-based debugging, pC++, TAU, Ariadne

Parallel programs are complex and often require a multilevel debugging strategy that combines both event- and state-based debugging. We report here on preliminary work that combines these approaches within the TAU program analysis environment for pC++. This work extends the use of event-based modeling to object-parallel languages, provides an alternative mechanism for establishing meaningful global breakpoints in object-oriented languages, introduces the TAU program interaction and control infrastructure, and provides an environment for the assessment of mixed event- and state-based strategies.
[spdt98a]: S. Shende, A. D. Malony, J. Cuny, K. Lindlan, P. Beckman and S. Karmesin, Portable Profiling and Tracing for Parallel Scientific Applications using C++, Proceedings of ACM SIGMETRICS Symposium on Parallel and Distributed Tools (SPDT '98), August, 1998, pp. 134-145.
Keywords: performance, profiling, tracing, C++, parallel, TAU

Performance measurement of parallel, object-oriented (OO) programs requires the development of instrumentation and analysis techniques beyond those used for more traditional languages. Performance events must be redefined for the conceptual OO programming model, and those events must be instrumented and tracked in the context of OO language abstractions, compilation methods, and run-time execution dynamics. In this paper, we focus on the profiling and tracing of C++ applications that have been written using a rich parallel programming framework for high-performance, scientific computing. We address issues of class-based profiling, instrumentation of templates, runtime function identification, and polymorphic (type-based) profiling. Our solutions are implemented in the TAU portable profiling package which also provides support for profiling groups and user-level timers. We demonstrate TAU's C++ profiling capabilities for real parallel applications, built from components of the ACTS toolkit. Future directions include work on runtime performance data access, dynamic instrumentation, and higher-level performance data analysis and visualization that relates object semantics with performance execution behavior.
[spdt98b]: K. Lindlan, A. Malony, J. Cuny, S. Shende, and P. Beckman, An IL Converter and Program Database for Analysis Tools, Proceedings of ACM SIGMETRICS Symposium on Parallel and Distributed Tools (SPDT '98), August 1998, pp. 153.
Keywords: IL Analyzer, PDT, static analysis

Developers of static and dynamic analysis tools for C++ programs need access to information on functions, classes, templates, and macros in parsed C++ code. Existing tools, such as the EDG display tool, provide that access, but in an unsuitable format. We built a converter that prunes and reorganizes the information into the appropriate format. The converter provides the information needed for our TAU (Tuning and Analysis Utilities) tools and, in more general terms, provides C++ developers considerable opportunities for automating software development.
[tools]: A. D. Malony, S. Shende, A. Morris, S. Biersdorff, W. Spear, K. A. Huck and A. Nataraj. "Evolution of a Parallel Performance System," Second International Workshop on Tools for High Performance Computing. July 2008
Keywords: TAU

The TAU Performance System® is an integrated suite of tools for instrumentation, measurement, and analysis of parallel programs targeting large-scale, high-performance computing (HPC) platforms. Representing over fifteen calendar years and fifty person years of research and development effort, TAU’s driving concerns have been portability, flexibility, interoperability, and scalability. The result is a performance system which has evolved into a leading framework for parallel performance evaluation and problem solving. This paper presents the current state of TAU, overviews the design and function of TAU’s main features, discusses best practices of TAU use, and outlines future development.
[tools11a]: A. Knuepfer, C. Roessel, D. an Mey, S. Biersdorff, K. Diethelm, D. Eschweiler, M. Geimer, M. Gerndt, D. Lorenz, A. Malony, W. Nagel, Y. Oleynik, P. Philippen, P. Saviankou, D. Schmidl, S. Shende, R. Tshueter, M. Wagner, B. Wesarg, and F. Wolf, "Score-P: A Joint Performance Measurement Run-Time Infrastructure for Periscope, Scalasca, TAU, and Vampir," in Proc. of the 5th International Workshop on Parallel Tools for High Performance Computing, September 2011, ZIH, Dresden, piblished as book "Tools for High Performance Computing," Eds. H. Brunst, M. Muller, W. Nagel, M. Resch, pp. 79-92, Springer, 2011.
Keywords: Score-P, performance measurement, TAU, Vampir

This paper gives an overview about the Score-P performance measurement infrastructure which is being jointly developed by leading HPC performance tools groups. It motivates the advantages of the joint undertaking from both the developer and the user perspectives, and presents the design and components of the newly developed Score-P performance measurement infrastructure. Furthermore, it contains first evaluation results in comparison with existing performance tools and presents an outlook to the long-term cooperative development of the new system.
[tools11b]: Allen Malony, Sameer Shende, Wyatt Spear, Chee Wai Lee, and Scott Biersdorff, "Advances in the TAU Performance System," in Proc. of the 5th International Workshop on Parallel Tools for High Performance Computing, September 2011, ZIH, Dresden, piblished as book "Tools for High Performance Computing," Eds. H. Brunst, M. Muller, W. Nagel, M. Resch, pp. 119-130, Springer, 2011.
Keywords: TAU, ParaProf, Performance instrumentation, measurement, analysis.

Evolution and growth of parallel systems requires continued advances in the tools to measure, characterize, and understand parallel performance. Five recent developments in the TAU Performance System are reported. First, an update is given on support for heterogeneous systems with GPUs. Second, event-based sampling is being integrated in TAU to add new capabilities for performance observation. New wrapping technology has been incorporated in TAUâ€™s instrumentation harness, increasing observation scope. The fourth advance is in the area of performance visualization. Lastly, we discuss our work in Eclipse Parallel Tools Platform.
[ugc09]: W. Spear, S. Shende, A. Malony, R. Portillo, P. Teller, D. Cronk, S. Moore, D. Terpstra. “Making Performance Analysis Tuning Part of the Software Development Cycle.” UGC 2009, San Diego, CA, June 15-18, 2009.
Keywords: TAU, eclipse, CDT, PTP, SCALASCA, VampirTrace

Although there are a number of performance tools available to DoD users, the process of performance analysis and tuning has yet to become an integral part of the DoD software development cycle. Instead, performance analysis and tuning is the domain of a small number of experts who cannot possibly address all the codes that need attention. We believe the main reasons for this are a lack of knowledge about these tools, the real or perceived steep learning curve required to use them, and the absence of a centralized method that incorporates their use in the software development cycle. This paper presents ongoing efforts to enable a larger number of DoD HPCMP users to benefit from available performance analysis tools by integrating them into the Eclipse Parallel Tools Platform (Eclipse/PTP), an integrated development environment for parallel programs.
[vis92]: J. E. Cuny, A. Hough, and J. Kundu. Logical Time in Visualizations Produced by Parallel Programs, Proceedings of Visualization '92, 1992, pp. 186-193.
Keywords: parallel, visualization, logical time

Visualization tools that display data as it is manipulated by a parallel, MIMD computation must contend with the effects of asynchronous execution. We have developed techniques that manipulate logical time in order to produce coherent animations of parallel program behavior despite the presence of asynchrony. Our techniques ``interpret'' program behavior in light of user-defined abstractions and generate animations based on a logical rather than a physical view of time. If this interpretation succeeds, the resulting animation is easily understood; if it fails, the programmer can be assured that the failure was not an artifact of the visualization. Here we demonstrate that these techniques can be generally applied to enhance visualizations of a variety of types of data as it is produced by parallel, MIMD computations.
[visapp06]: Kai Li, Allen D. Malony, Don M. Tucker: Automatic brain mr image segmentation by relative thresholding and morphological image analysis. VISAPP (1) 2006: 354-364
Keywords: Segmentation, brain, MR, intensity inhomogeneity, relative thresholding, mathematical morphology, skeletonbased opening, geodesic opening, a priori knowledge, first-order logic.

We present an automatic method for segmentation of white matter, gray matter and cerebrospinal fluid in T1- weighted brain MR images. We model images in terms of spatial relationships between near voxels. Brain tissue segmentation is first performed with relative thresholding, a new segmentation mechanism which compares two voxel intensities against a relative threshold. Relative thresholding considers structural, geometrical and radiological a priori knowledge expressed in first-order logic. It makes intensity inhomogeneity transparent, avoids using any form of regularization, and enables global searching for optimal solutions. We augment relative thresholding mainly with a series of morphological operations that exploit a priori knowledge about the shape and geometry of brain structures. Combination of relative thresholding and morphological operations dispenses with the prior skull stripping step. Parameters involved in the segmentation are selected based on a priori knowledge and robust to inter-data variations.
[vpasc14]: Kevin A. Huck, Kristin Potter, Doug W. Jacobsen, Hank Childs, Allen D. Malony: Linking performance data into scientific visualization tools. VPA@SC 2014: 50-57
Keywords:

Understanding the performance of program execution is essential when optimizing simulations run on high-performance supercomputers. Instrumenting and profiling codes is itself a difficult task and interpreting the resulting complex data is often facilitated through visualization of the gathered measures. However, these measures typically ignore spatial information specific to a simulation, which may contain useful knowledge on program behavior. Linking the instrumentation data to the visualization of performance within a spatial context is not straightforward as information needed to create the visualizations is not, by default, included in data collection, and the typical visualization approaches do not address spatial concerns. In this work, we present an approach that links the collection of spatially-aware performance data to a visualization paradigm through both analysis and visualization abstractions to facilitate better understanding of performance in the spatial context of the simulation. Because the potential costs for such a system are quite high, we leverage existing performance profiling and visualization systems and demonstrate their combined potential on climate simulation.
[works09]: Matthew J. Sottile, Geoffrey C. Hulette, Allen D. Malony: Workflow representation and runtime based on lazy functional streams. SC-WORKS 2009
Keywords:

Workflows are a successful model for building both distributed and tightly-coupled programs based on a dataflow-oriented coordination of computations. Multiple programming languages have been proposed to represent workflow-based programs in the past. In this paper, we discuss a representation of workflows based on lazy functional streams implemented in the strongly typed language Haskell. Our intent is to demonstrate that streams are an expressive intermediate representation for higher-level workflow languages. By embedding our stream-based workflow representation in a language such as Haskell, we also gain with minimal effort the strong type system provided by the base language, the rich library of built-in functional primitives, and most recently, rich support for managing concurrency at the language level.

Journals

[CaC11]: "Performance Characterization of Global Address Space Applications: A Case Study with NWChem." J. R. Hammond, S. Krishnamoorthy, S. Shende, N. A. Romero, A. D. Malony. To appear in Concurrency and Computation: Practice and Experience 2010
Keywords: NWChem, TAU, Global Address Space, PGAS

The use of global address space languages and one-sided communication for complex applications is gaining attention in the parallel computing community. However, lack of good evaluative methods to observe multiple levels of performance makes it difficult to isolate the cause of performance deficiencies and to understand the fundamental limitations of system and application design for future improvement. NWChem is a popular computational chemistry package which depends on the Global Arrays / ARMCI suite for partitioned global address space functionality to deliver high-end molecular modeling capabilities. A workload characterization methodology was developed to support NWChem performance engineering on large-scale parallel platforms. The research involved both the integration of performance instrumentation and measurement in the NWChem software, as well as the analysis of one-sided communication performance in the context of NWChem workloads. Scaling studies were conducted for NWChem on Blue Gene/P and on two large-scale clusters using different generation Infiniband interconnects and x86 processors. The performance analysis and results show how subtle changes in the runtime parameters related to the communication subsystem could have significant impact on performance behavior. The tool has successfully identified several algorithmic bottlenecks which are already being tackled by computational chemists to improve NWChem performance.
[STHEC08]: A. Nataraj, A. Malony, A. Morris, D. Arnold, B. Miller, "A Framework for Scalable, Parallel Performance Monitoring" published in Concurrency and Computation: Practice and Experience, Special Issue from STHEC'08 Workshop.
Keywords: performance, monitoring, tree-based, overlay, TAU, MRNet

Performance monitoring of HPC applications offers opportunities for adaptive optimization based on dynamic performance behavior, unavailable in purely post-mortem performance views. However, a parallel performance monitoring system must have low overhead and high efficiency to make these opportunities tangible. We describe a scalable parallel performance monitor called TAUoverMRNet (ToM), created from the integration of the TAU performance system and the Multicast Reduction Network (MRNet). The integration is achieved through a plug-in architecture in TAU that allows selection of different transport substrates to offload online performance data. A method to establish the transport overlay structure of the monitor from within TAU, one that requires no added support from the job manager or application, is presented. We demonstrate the distribution of performance analysis from the sink to the overlay nodes and the reduction in large-scale profile data that could otherwise overwhelm any single sink. Results show low perturbation and significant savings accrued from reduction at large processor-counts.
[cbhpc08]: V. Bui, B. Norris, K. Huck, L. C. McInnes, L. Li, O. Hernandez, and B. Chapman, "A Component Infrastructure for Performance and Power Modeling of Parallel Scientific Applications", in Component-Based High Performance Computing (CBHPC 2008), 2008.
Keywords: power modeling, performance modeling, Common Component Architecture, CCA

Characterizing the performance of scientific applications is essential for effective code optimization, both by compilers and by high-level adaptive numerical algorithms. While maximizing power efficiency is becoming increasingly important in current high-performance architectures, little or no hardware or software support exists for detailed power measurements. Hardware counter-based power models are a promising method for guiding software-based techniques for reducing power. We present a component-based infrastructure for performance and power modeling of parallel scientific applications. The power model leverages on-chip performance hardware counters and is designed to model power consumption for modern multiprocessor and multicore systems. Our tool infrastructure includes application components as well as performance and power measurement and analysis components. We collect performance data using the TAU performance component and apply the power model in the performance and power analysis of a PETSc-based parallel fluid dynamics application by using the PerfExplorer component.
[cca_cpe04]: A. Malony, S. Shende, N. Trebon, J. Ray, R. Armstrong, C. Rasmussen, and M. Sottile, "Performance Technology for Parallel and Distributed Component Software," Concurrency and Computation: Practice and Experience, Vol. 17, Issue 2-4, pp. 117-141, John Wiley & Sons, Ltd., Feb - Apr, 2005.
Keywords: component software, performance, parallel, distributed, optimization, CCA, TAU

This work targets the emerging use of software component technology for high-performance scientific parallel and distributed computing. While component software engineering will benefit the construction of complex science applications, its use presents several challenges to performance measurment, analysis, and optimization. The performance of a component application depends on the interaction (possibly non-linear) of the composed component set. Furthermore, a component is a "binary unit of composition" and the only information users have is the interface the component provides to the outside world. A performance engineering methodology and development approach is presented to address evaluation and optimization issues in high-performance component environments. We describe a prototype implementation of a performance measurement infrastructure for the Common Component Architecture (CCA) system. A case study demonstrating the use of this technology for integrated measurement, monitoring, and optimization in CCA component-based applications is given.
[ccpe10]: Aroon Nataraj, Allen D. Malony, Alan Morris, Dorian C. Arnold, Barton P. Miller: A framework for scalable, parallel performance monitoring. Concurrency and Computation: Practice and Experience 22(6): 720-735 (2010)
Keywords:

Performance monitoring of HPC applications offers opportunities for adaptive optimization based on the dynamic performance behavior, unavailable in purely post- mortem performance views. However, a parallel performance monitoring system must have low overhead and high efficiency to make these opportunities tangible. We describe a scalable parallel performance monitor called TAUoverMRNet (ToM), created from the integration of the TAU performance system and the Multicast Reduction Network (MRNet). The integration is achieved through a plug-in architecture in TAU that allows the selection of different transport substrates to offload the online performance data. A method to establish the transport overlay structure of the monitor from within TAU, one that requires no added support from the job manager or application, is presented. We demonstrate the distribution of performance analysis from the sink to the overlay nodes and the reduction in the large-scale profile data that could, otherwise, overwhelm any single sink. The results show low perturbation and significant savings accrued from reduction at large processor-counts.
[ccpe12]: J. Hammond, S. Krishnamoorthy, S. Shende, N. A. Romero, A. D. Malony, "Performance Characterization of Global Address Space Applications: A Case Study with NWChem", Concurrency and Computation: Practice and Experience 24(2): 135-154, John Wiley and Sons, DOI:10.1002/cpe.1881, 2012.
Keywords: performance characterization, global address space, computational chemistry, NWChem

The use of global address space languages and one-sided communication for complex applications is gaining attention in the parallel computing community. However, lack of good evaluative methods to observe multiple levels of performance makes it difficult to isolate the cause of performance deficiencies and to understand the fundamental limitations of system and application design for future improvement. NWChem is a popular computational chemistry package which depends on the Global Arrays / ARMCI suite for partitioned global address space functionality to deliver high-end molecular modeling capabilities. A workload characterization methodology was developed to support NWChem performance engineering on large-scale parallel platforms. The research involved both the integration of performance instrumentation and measurement in the NWChem software, as well as the analysis of one-sided communication performance in the context of NWChem workloads. Scaling studies were conducted for NWChem on Blue Gene/P and on two large-scale clusters using different generation Infiniband interconnects and x86 processors. The performance analysis and results show how subtle changes in the runtime parameters related to the communication subsystem could have significant impact on performance behavior. The tool has successfully identified several algorithmic bottlenecks which are already being tackled by computational chemists to improve NWChem performance.
[ccpe13]: N. A. Romero, C. Glinsvad, A. H. Larsen, J. Enkovaara, S. Shende, V. A. Morozov, and J. J. Mortensen, "Design and performance characterization of electronic structure calculations on massively parallel supercomputers: a case study of GPAW on the Blue Gene/P architecture", Concurrency and Computation: Practice and Experience, Dec. 2013, DOI: 10.1002/cpe.3199, John Wiley and Sons.
Keywords: GPAW, electronic structure, DFT, Blue Gene, massive parallelization, high-performance computing, TAU

Density function theory (DFT) is the most widely employed electronic structure method due to its favorable scaling with system size and accuracy for a broad range of molecular and condensed-phase systems. The advent of massively parallel supercomputers has enhanced the scientific communityâ€™s ability to study larger system sizes. Ground-state DFT calculations on âˆ¼103 valence electrons using traditional O(N3) algorithms can be routinely performed on present-day supercomputers. The performance characteristics of these massively parallel DFT codes on >104 computer cores are not well understood. The GPAW code was ported an optimized for the Blue Gene/P architecture. We present our algorithmic parallelization strategy and interpret the results for a number of benchmark test cases.
[ccpe15]: Salman, A., Malony, A., Turovets, S., Volkov, V., Ozog, D., & Tucker, D. (2015). Concurrency in electrical neuroinformatics: parallel computation for studying the volume conduction of brain electrical fields in human head tissues. Concurrency and Computation: Practice and Experience.
Keywords:

Advances in human brain neuroimaging for high-temporal and high-spatial resolution will depend on localization of Electroencephalography (EEG) signals to their cortex sources. The source localization inverse problem is inherently ill-posed and depends critically on the modeling of human head electromagnetics. We present a systematic methodology to analyze the main factors and parameters that affect the EEG source-mapping accuracy. These factors are not independent and their effect must be evaluated in a unified way. To do so requires significant computational capabilities to explore the problem landscape, quantify uncertainty effects, and evaluate alternative algorithms. Bringing high- performance computing (HPC) to this domain is necessary to open new avenues for neuroinformatics research. The head electromagnetics forward problem is the heart of the source localization inverse. We present two parallel algorithms to address tissue inhomogeneity and impedance anisotropy. Highly-accurate head modeling environments will enable new research and clinical neuroimaging applications. Cortex-localized dEEG analysis is the next-step in neuroimaging domains such as early childhood reading, understanding of resting state brain networks, and models of full brain functi
[ccpe_cframe06]: N. Trebon, A. Morris, J. Ray, S. Shende, and A. D. Malony, "Performance Modeling of Component Assemblies," Concurrency and Computation: Practice and Experience, CPE 1076, Special issue Compframe 2005, John Wiley, 2006.
Keywords: component, performance, TAU

A parallel component environment places constraints on performance measurement and modeling. For instance, it must be possible to instrument the application without access to the source code. In addition, a component may admit multiple implementations, based on the choice of algorithm, data structure, parallelization strategy, etc., posing the user with the problem of having to choose the ‘correct’ implementation and achieve an optimal (fastest) component assembly. Under the assumption that an empirical performance model exists for each implementation of each component, simply choosing the optimal implementation of each component does not guarantee an optimal component assembly since components interact with each other. An optimal solution may be obtained by evaluating the performance of all of the possible realizations of a component assembly given the components and all of their implementations, but the exponential complexity renders the approach unfeasible as the number of components and their implementations rise. This paper describes a non-intrusive, coarse- grained performance monitoring system that allows the user to gather performance data through the use of proxies. In addition, a simple optimization library that identifies a nearly optimal configuration is proposed. Finally, some experimental results are presented that illustrate the measurement and optimization strategies.
[cfd06]: S. Shende, A. D. Malony, A. Morris, S. Parker, J. de St. Germain, "Performance Evaluation of Adaptive Scientific Applications using TAU," chapter, Parallel Computational Fluid Dynamics - Theory and Applications, (eds - A. Deane et. al.), pp. 421-428, Elsevier B.V., 2006.
Keywords: CFD, Uintah, Phases, TAU, performance evaluation

Fueled by increasing processor speeds and high speed interconnection networks, advances in high performance computer architectures have allowed the development of increasingly complex large scale parallel systems. For computational scientists, programming these systems efficiently is a challenging task. Understanding the performance of their parallel applications is equally daunting. To observe and comprehend the performance of parallel applications that run on these systems, we need performance evaluation tools that can map the performance abstractions to the user's mental models of application execution. For instance, most parallel scientific applications are iterative in nature. In the case of CFD applications, they may also dynamically adapt to changes in the simulation model. A performance measurement and analysis system that can differentiate the phases of each iteration and characterize performance changes as the application adapts will enable developers to better relateperformance to their application behavior. In this paper, we present newperformance measurement techniques to meet these needs. In section 2, we describe our parallel performance system, TAU. Section 3 discusses how new TAU profiling techniques can be applied to CFD applications with iterative and adaptive characteristics. In section 4, we present a case study featuring the Uintah computational framework and explain how adaptive computational fluid dynamics simulations are observed using TAU. Finally, we conclude with a discussion of how the TAU performance system can be broadly applied to other CFD frameworks and present a few examples of its usage in this field.
[chasm_cpe03]: C. E. Rasmussen, M. J. Sottile, S. S. Shende, and A. D. Malony, "Bridging the language gap in scientific computing: the Chasm approach," Concurrency and Computation: Practice and Experience,Volume 18, Issue 2 (February 2006), pp. 151-162, John Wiley & Sons, 2006.
Keywords: Fortran 95, C, C++, language interoperability, XML, compilers, PDT

Chasm is a toolkit providing seamless language interoperability between Fortran95 and C++. Language interoperability is important to scientific programmers because scientific applications are predominatly written in Fortran, while software tools are mostly written in C++. Two design features differentiate Chasm from other related tools. First, we avoid the problem of `least common denominator' type systems and programming models, something found in most IDL-based interoperability systems. Chasm uses the intermediate representation generated by a compiler front-end for each supported language as its source of interface information instead of an IDL. Second, bridging code is generated for each pairwise language binding, removing the need for a common intermediate data representation and multiple levels of indirection between the caller and callee. These features make Chasm a simple system that performs well, requires minimal user intervention, and in most instances, bridging code generation can be performed automatically. Reliance on standards such as XML and industrial strength compiler technology reduces the complexity and scope of the Chasm toolset making it easily extensible and highly portable.
[cluster07]: A. Nataraj, A.Malony, S. Shende, A. Morris, "Integrated Parallel Performance Views." Appears in Cluster Computing published by Springer Netherlands
Keywords: Parallel performance, Kernel, Linux, Instrumentation, Measurement, Integrated

The influences of the operating system and system-specific effects on application performance are increasingly important considerations in high performance computing. OS kernel measurement is key to understanding the performance influences and the interrelationship of system and user-level performance factors. The KTAU (Kernel TAU) methodology and Linux-based framework provides parallel kernel performance measurement from both a /kernel-wide/ and /process-centric/ perspective. The first characterizes overall aggregate kernel performance for the entire system. The second characterizes kernel performance when it runs in the context of a particular process. KTAU extends the TAU performance system with kernel-level monitoring, while leveraging TAU~Rs measurement and analysis capabilities. We explain the rational and motivations behind our approach, describe the KTAU design and implementation, and show working examples on multiple platforms demonstrating the versatility of KTAU in integrated system/application monitoring.
[cluster08b]: P. Beckman, K. Iskra, K. Yoshii, S. Coghlan and A. Nataraj, "Benchmarking the effects of operating system interference on extreme-scale parallel machines", Appears in Cluster Computing 2008 (pg 3-16) published by Springer Netherlands 1386-7857 (Print) 1573-7543 (Online) Volume 11, Number 1 / March, 2008
Keywords: Microbenchmark, Noise, Petascale, Synchronicity

We investigate operating system noise, which we identify as one of the main reasons for a lack of synchronicity in parallel applications. Using a micro-benchmark, we measure the noise on several contemporary platforms and find that, even with a general-purpose operating system, noise can be limited if certain precautions are taken. We then inject artificially generated noise into a massively parallel system and measure its influence on the performance of collective operations. Our experiments indicate that on extreme-scale platforms, the performance is correlated with the largest interruption to the application, even if the probability of such an interruption on a single process is extremely small. We demonstrate that synchronizing the noise can significantly reduce its negative influence.
[computer95]: Michael T. Heath, Allen D. Malony, Diane T. Rover, "The Visual Display of Parallel Performance Data," IEEE Computer, 28(11), Nov. 1995, pp. 21-28, 1995.
Keywords: data visualization, tracing

Data visualization can help users decipher scientific and engineering data and better comprehend large, complex data sets. The authors present a high-level abstract model for performance visualization that relates behavior abstractions to visual representations in a structured way. This model is based on two principles: Displays of performance information are linked directly to parallel performance models, and performance visualizations are designed and applied in an integrated environment. The authors explain some advantages of adhering to these principles. They begin by establishing a context for users to clearly understand performance information, defining terms such as perspective, semantic context, and subview mapping. Next, they describe the techniques used to scale graphical views as data sets become very large. Finally, they discuss concepts such as user perception and interaction, comparisons and cross-correlations between related views or representations, and information extraction. On the basis of this conceptual foundation, the authors present examples of practical applications for the model. These case studies address topics such as concurrency and communication in data-parallel computation, access patterns for data distributions, and critical paths in parallel computation. The authors conclude by discussing the relationship between performance visualization and general scientific visualization.
[cpe02]: S. Shende, and A. D. Malony, "Integration and Application of TAU in Parallel Java Environments," Concurrency and Computation: Practice and Experience, Volume 15 (3-5), Mar-Apr 2003, Wiley, pp. 501-519, 2003.
Keywords: TAU, Parallel, Java, Performance Tools

Parallel Java environments present challenging problems for performance tools because of Java's rich language system and its multi-level execution platform combined with the integration of native-code application libraries and parallel runtime software. In addition to the desire to provide robust performance measurement and analysis capabilities for the Java language itself, the coupling of di®erent software execution contexts under a uniform performance model needs careful consideration of how events of interest are observed and how cross-context parallel execution information is linked. This paper relates our experience in extending the TAU performance system to a parallel Java environment based on mpiJava. We describe the complexities of the instrumentation model used, how performance measurements are made, and the overhead incurred. A parallel Java application simulating the game of Life is used to show the performance system's capabilities.
[cpe05]: L. Li, and A. D. Malony, "Knowledge Engineering for Automatic Parallel Performance Diagnosis," (submitted to) Concurrency and Computation: Practice and Experience, John Wiley & Sons, 2005.
Keywords: Performance, Diagnosis, Knowledge Engineering, Parallel, TAU

Scientific parallel programs often undergo significant performance tuning before meeting their performance expectation. Performance tuning naturally involves a diagnosis process – locating performance bugs that make a program inefficient and explaining them in terms of high-level program design. We present a systematic approach to generating performance knowledge for automatically diagnosing parallel programs. Our approach exploits program semantics and parallelism found in computational models to search and explain bugs. We first identify categories of expert knowledge required for performance diagnosis and describe how to extract the knowledge from computational models. Second, we represent the knowledge in such a way that diagnosis can be carried out in an automatic manner. Finally, we demonstrate the effectiveness of our knowledge engineering approach through a case study. Our experience diagnosing Master-Worker programs show that model-based performance knowledge can provide effective guidance for locating and explaining performance bugs at a high level of program abstraction.
[fgcs01]: Allen D. Malony, B. Robert Helm, "A theory and architecture for automating performance diagnosis," Future Generation Computer Systems, Vol 18, Issue 1, Elsevier Science Publishers, Amsterdam, pg. 189-200, Sept. 2001.
Keywords: Performance, Diagnosis, Parallel

This prospectus describes research to simplify programing of parallel computers. It focuses specifically on performance diagnosis, the process of finding and explaining sources of inefficiency in parallel programs. Considerable research already has been done to simplify performance diagnosis, but with mixed success. Two elements are missing from existing research: 1. There is no general theory of how expert programers do performance diagnosis. As a result, it is difficult for researchers to compare existing work or fit their work to programers. It is difficult for programers to locate products of existing research that meet their needs. 2. There is no automated, adaptable software to help programers do performance diagnosis. Existing software is either automated but limited to very specific circumstances, or in general, not automated for most tasks. The research described here addresses both of these issues. The research will develop and validate a theory of performance diagnosis, based on general models on diagnostic problem-solving. It will design and evaluate a computer program (called Poirot) that employs the theory to automatically, adaptably support performance diagnosis.
[fgcs01.2]: Alois Ferscha, Allen D. Malony: Performance data mining: Automated diagnosis, adaption, and optimization. Future Generation Comp. Syst. 18(1): 127-130 (2001)
Keywords: Performance data mining; Parallel; Diagnosis; Adaption; Optimization
[handbook00]: A. D. Malony, "Tools for Parallel Computing: A Performance Evaluation Perspective," in J. Blazewicz et. al. (Editors), Handbook on Parallel and Distributed Processing, Springer Verlag, pp. 342-363, 2000.
Keywords: parallel performance environments, performance evaluation, performance diagnosis, perturbation, observability, measurement, prediction, parallel tools

To make effective use of parallel computing environments, users have come to expect a broad set of tools that augment parallel programming and execution infrastructure with capabilities such as performance evaluation, debugging, runtime program control, and program interaction. The rich history of parallel tool research and development reflects both fundamental issues in concurrent processing and a progressive evolution of tool implementations, targeting current and future parallel computing goals. The state of tools for parallel computing is discussed from a perspective of performance evaluation. Many of the challenges that arise in parallel performance tools are common to other tool areas. I look at four such challenges: modeling, observability, diagnosis, and perturbation. The need for tools will always be present in parallel and distributed systems, but the emphasis on tool support may change. The discussion given is intentionally high-level, so as not to exclude the many important ideas that have come from parallel tool projects. Rather, I attempt to present viewpoints on the challenges that I think would be of concern in any performance tool design.
[ieeecga95]: Steven T. Hackstadt and Allen D. Malony, Visualizing Parallel Programs and Performance, IEEE Computer Graphics and Applications, Vol. 15, No. 4, July 1995, pp. 12-14.
Keywords: parallel performance visualization, scientific visualization, data explorer

Performance visualization uses graphical display techniques to analyze performance data and improve understanding of complex performance phenomena. Current parallel performance visualizations are predominantly two-dimensional. A primary goal of our work is to develop new methods for rapidly prototyping multidimensional performance visualizations. By applying the tools of scientific visualization, we can prototype these next-generation displays for performance visualization -- if not implement them as end-user tools -- using existing software products and graphical techniques that physicists, oceanographers, and meteorologists have used for several years.
[ieeecomp95]: Michael T. Heath, Allen D. Malony, and Diane T. Rover, The Visual Display of Parallel Performance Data, IEEE Computer, Vol. 28, No. 11, November 1995, pp. 21-28.
Keywords: parallel performance visualization, models, concepts, principles, scientific visualization, case studies
[ieeepdt95]: Michael T. Heath, Allen D. Malony, and Diane T. Rover, Parallel Performance Visualization: From Practice To Theory, IEEE Parallel and Distributed Technology, Vol. 3, No. 4, Winter 1995, pp. 44-60.
Keywords: parallel performance visualization, models, concepts, principles, scientific visualization, case studies
[ifpcs89]: Allen D. Malony, "Multiprocessor Instrumentation: Approaches for Cedar," Chapter, Instrumentation for Future Parallel Computing Systems (Eds: M. Simmons, R. Koskela, I. Bucher), ACM Press, NY, pp 1-33, 1989.
Keywords: Cedar, measurement techniques, modeling techniques, parallel processors, Complexity hierarchies

Parallel systems pose a unique challenge to performance measurement and instrumentation. The complexity of these sysems manifests itself as an increase in performance complexity as well as programming complexity. The complex interaction of the many architectural, hardware, and software features of these systems results in a significantly larger space of possible performance behavior and potential erformance bottlenecks. Programming parallel systems requires that users understand the performance characteristics of the machines and be able tomodify their programs and algorithms accordingly. The instrumentation problem, therefore, is to develop tools to aid the user in investigating performance problems and in determining the most effective way of exploiting the high performance capabilities of parallel systems. This paper gives observations on the parallel system instrumentation problem in the context of the Cedar multiprocessor. The Cedar system integrates several architectural, hardware, and software concepts for parallel operation. The combination makes Cedar a particularly interesting machine for investigating instrumentation issues and developing prototype tools. The different needs for performance evaluation on the Cedar machine define the instruementation requirements. The implementation of instrumentation tools, however, involves tradeoffs in design, resolution, and accuracy, and must be weighed against the payoff in better performance evaluation. This discussion of instrumentation tools targeted for Cedar considers these tradeoffs.
[ijhpca05.cca]: David E. Bernholdt, Benjamin A. Allan, Robert Armstrong, Felipe Bertrand, Kenneth Chiu, Tamara L. Dahlgren, Kostadin Damevski, Wael R. Elwasif, Thomas G. W. Epperly, Madhusudhan Govindaraju, Daniel S. Katz, James A. Kohl, Manoj Krishnan, Gary Kumfert, J. Walter Larson, Sophia Lefantzi, Michael J. Lewis, Allen D. Malony, Lois C. McInnes, Jarek Nieplocha, Boyana Norris, Steven G. Parker, Jaideep Ray, Sameer Shende, Theresa L. Windus, and Shujia Zhou, "A Component Architecture for High-Performance Scientific Computing," International Journal of High Performance Computing Applications, ACTS Collection Special Issue, SAGE Publications, 20(2):163 -- 202, Summer 2006.
Keywords: CCA

The Common Component Architecture (CCA) provides a means for software developers to manage the complexity of large-scale scientific simulations and to move toward a \emph{plug-and-play} environment for high-performance computing. In the scientific computing context, component models also promote collaboration using independently developed software, thereby allowing particular individuals or groups to focus on the aspects of greatest interest to them. The CCA supports parallel and distributed computing as well as local high-performance connections between components in a language-independent manner. The design places minimal requirements on components and thus facilitates the integration of existing code into the CCA environment. The CCA model imposes minimal overhead to minimize the impact on application performance. The focus on high performance distinguishes the CCA from most other component models. The CCA is being applied within an increasing range of disciplines, including combustion research, global climate simulation, and computational chemistry.
[ijhpca05.tau]: S. Shende and A. D. Malony, "The TAU Parallel Performance System," International Journal of High Performance Computing Applications, SAGE Publications, 20(2):287-331, Summer 2006
Keywords: TAU, Profiling, Tracing, PerfDMF, performance evaluation, instrumentation, measurement, analysis, Paraprof

The ability of performance technology to keep pace with the growing complexity of parallel and distributed systems depends on robust performance frameworks that can at once provide system-specific performance capabilities and support high-level performance problem solving. Flexibility and portability in empirical methods and processes are influenced primarily by the strategies available for instrumentation and measurement, and how effectively they are integrated and composed. This paper presents the TAU (Tuning and Analysis Utilities) parallel performance system and describe how it addresses diverse requirements for performance observation and analysis.
[ijhpca07]: A. D. Malony, S. Shende, A. Morris, and F. Wolf, "Compensation of Measurement Overhead in Parallel Performance Profiling," in International Journal of High Performance Computing Applications (IJHPCA), Vol 21, No. 2, pp. 174--194, Summer 2007.
Keywords: Performance measurement and analysis, parallel computing, profiling,intrusion, overhead compensation, TAU

Performance profiling generates measurement overhead during parallel program execution. Measurement overhead, in turn, introduces intrusion in a program's runtime performance behavior. Intrusion can be mitigated by controlling instrumentation degree, allowing a tradeoff of accuracy for detail. Alternatively, the accuracy in profile results can be improved by reducing the intrusion error due to measurement overhead. Models for compensation of measurement overhead in parallel performance profiling are described. An approach based on rational reconstruction is used to understand properties of compensation solutions for different parallel scenarios. From this analysis, a general algorithm for on-the-fly overhead assessment and compensation is derived.
[ijpdsn99]: Allen D. Malony and Steven T. Hackstadt, Performance of a System for Interacting with Parallel Applications, International Journal of Parallel and Distributed Systems and Networks, special issue on Measurement of Program and System Performance, M. H. Mickle, ed., Vol. 2, No. 3, 1999, Acta Press, Anaheim, CA, pp. 155-170.
Keywords: distributed data, program interaction, performance analysis, parallel tool, computational steering

A variety of systems have been developed to interact with parallel programs for purposes of debugging, monitoring, visualization, and computational steering. In addition to addressing different functional objectives, these systems have nonfunctional characteristics that are equally important for a user to know. Clearly, for most users, performance is an important nonfunctional requirement of a program interaction system. However, characterizing the performance of an interaction system for parallel programs is particularly challenging, especially in asynchronous, distributed environments. This paper presents a comprehensive performance analysis of the DAQV system. DAQV has been successfully applied in runtime data visualization, on-line performance monitoring, and computational steering environments. However, DAQV's suitability depends significantly on application context and requirements. By giving a full accounting of DAQV performance, we aim to provide application and environment developers with valuable information about DAQV's potential benefits, before an integration effort takes place. As DAQV's designers, this in-depth performance analysis has led to new insights, resulting in higher performing designs.
[ijpp03]: J. Davison de St. Germain, Alan Morris, Steven G. Parker, Allen D. Malony, Sameer Shende: Performance Analysis Integration in the Uintah Software Development Cycle. International Journal of Parallel Programming 31(1): 35-53 (2003)
Keywords: Uintah, TAU, XPARE, Performance Mapping, SEAA

The increasing complexity of high-performance computing environments and programming methodologies presents challenges for empirical performance evaluation. Evolving parallel and distributed systems require performance technology that can be flexibly configured to observe different events and associated performance data of interest. It must also be possible to integrate performance evaluation techniques with the programming paradigms and software engineering methods. This is particularly important for tracking performance on parallel software projects involving many code teams over many stages of development. This paper describes the integration of the TAU and XPARE tools in the Uintah Computational Framework (UCF). Discussed is the use of performance mapping techniques to associate low-level performance data to higher levels of abstraction in UCF and the use of performance regression testing to provides a historical portfolio of the evolution of application performance. A scalability study shows the benefits of integrating performance technology in building large-scale parallel applications.
[ijsahpc97]: Janice Cuny, Robert Dunn, Steven T. Hackstadt, Christopher Harrop, Harold H. Hersey, Allen D. Malony, and Douglas Toomey, Building Domain-Specific Environments for Computational Science: A Case Study in Seismic Tomography, International Journal of Supercomputing Applications and High Performance Computing, Vol. 11, No. 3, Fall 1997. Also appearing in the Proceedings of the Workshop on Environments and Tools For Parallel Scientific Computing, Lyon, France, August 1996.
Keywords: computational science, domain-specific environments, seismic tomography, visualization, distributed data access

We report on our experiences in building a computational environment for tomographic image analysis for marine seismologists studying the structure and evolution of mid-ocean ridge volcanism. The computational environment is determined by an evolving set of requirements for this problem domain and includes needs for high-performance parallel computing, large data analysis, model visualization, and computation interaction and control. Although these needs are not unique in scientific computing, the integration of techniques for seismic tomography with tools for parallel computing and data analysis into a computational environment was (and continues to be) an interesting, important learning experience for researchers in both disciplines. For the geologists, the use of the environment led to fundamental geologic discoveries on the East Pacific Rise, the improvement of parallel ray tracing algorithms, and a better regard for the use of computational steering in aiding model convergence. The computer scientists received valuable feedback on the use of programming, analysis, and visualization tools in the environment. In particular, the tools for parallel program data query (DAQV) and visualization programming (Viz) were demonstrated to be highly adaptable to the problem domain. We discuss the requirements and the components of the environment in detail. Both accomplishments and limitations of our work are presented.
[iwomp06journal]: A. Morris, A. D. Malony, S. Shende, "Supporting Nested OpenMP Parallelism in the TAU Performance System," (to appear) International Journal of Parallel Programming, Springer, LNCS, 2007.
Keywords: OpenMP, nested parallelism, TAU

Nested OpenMP parallelism allows an application to spawn teams of nested threads. This hierarchical nature of thread creation and usage poses problems for performance measurement tools that must determine thread context to properly maintain per-thread performance data. In this paper we describe the problem and a novel solution for identifying threads uniquely. Our approach has been implemented in the TAU performance system and has been successfully used in profiling and tracing OpenMP applications with nested parallelism. We also describe how extensions to the OpenMP standard can help tool developers uniquely identify threads.
[jcs15]: David Ozog, Jay McCarty, Grant Gossett, Allen D. Malony, Marina Guenza: Fast equilibration of coarse-grained polymeric liquids. J. Comput. Science 9: 33-38 (2015)
Keywords:

The study of macromolecular systems may require large computer simulations that are too time consuming and resource intensive to execute in full atomic detail. The integral equation coarse-graining approach by Guenza and co-workers enables the exploration of longer time and spatial scales without sacrificing thermodynamic consistency, by approximating collections of atoms using analytically-derived soft-sphere potentials. Because coarse-grained (CG) characterizations evolve polymer systems far more efficiently than the corresponding united atom (UA) descriptions, we can feasibly equilibrate a CG system to a reasonable geometry, then transform back to the UA description for a more complete equilibration. Automating the transformation between the two different representations simultaneously exploits CG efficiency and UA accuracy. By iteratively mapping back and forth between CG and UA, we can quickly guide the simulation towards a configuration that would have taken many more time steps within the UA representation alone. Accomplishing this feat requires a diligent workflow for managing input/output coordinate data between the different steps, deriving the potential at runtime, and inspecting convergence. In this paper, we present a lightweight workflow environment that accomplishes such fast equilibration without user intervention. The workflow supports automated mapping between the CG and UA descriptions in an iterative, scalable, and customizable manner. We describe this technique, examine its feasibility, and analyze its correctness.
[jop07]: D. Gunter, K. Huck, K. Karavanic, J. May, A. Malony, K. Mohror, S. Moore, A. Morris, S. Shende, V. Taylor, X. Wu, and Y. Zhang, "Performance database technology for SciDAC applications", Journal of Physics: Conference Series, vol. 78, June 2007.
Keywords:
[jp05]: P. Worley, J. Candy, L. Carrington, K. Huck, T. Kaiser, G. Mahinthakumar, A. Malony, S. Moore, D. Reed, P. Roth, H. Shan, S. Shende, A. Snavely, S. Sreepathi, F. Wolf, and Y. Zhang, "Performance analysis of GYRO: a tool evaluation", Journal of Physics: Conference Series 16 (2005) pp. 551-555, SciDAC 2005, Institute of Physics Publishing Ltd., 2005.
Keywords: IPM, KOJAK, TAU, SvPablo, PMaC, Gyro, Performance evaluation

The performance of the Eulerian gyrokinetic-Maxwell solver code GYRO is analyzed on five high performance computing systems. First, a manual approach is taken, using custom scripts to analyze the output of embedded wallclock timers, floating point operation counts collected using hardware performance counters, and traces of user and communication events collected using the profiling interface to Message Passing Interface (MPI) libraries. Parts of the analysis are then repeated or extended using a number of sophisticated performance analysis tools: IPM, KOJAK, SvPablo, TAU, and the PMaC modeling tool suite. The paper briefly discusses what has been discovered via this manual analysis process, what performance analyses are inconvenient or infeasible to attempt manually, and to what extent the tools show promise in accelerating or significantly extending the manual performance analyses.
[math14]: Sergei Turovets, Vasily Volkov, Aleksei Zherdetsky, Alena Prakonina, Allen D. Malony: A 3D Finite-Difference BiCG Iterative Solver with the Fourier-Jacobi Preconditioner for the Anisotropic EIT/EEG Forward Problem. Comp. Math. Methods in Medicine 2014: 426902:1-426902:12 (2014)
Keywords:

The Electrical Impedance Tomography (EIT) and electroencephalography (EEG) forward problems in anisotropic inhomogeneous media like the human head belongs to the class of the three-dimensional boundary value problems for elliptic equations with mixed derivatives. We introduce and explore the performance of several new promising numerical techniques, which seem to be more suitable for solving these problems. e proposed numerical schemes combine the ctitious domain approach together with the nite-di erence method and the optimally preconditioned Conjugate Gradient- (CG-) type iterative method for treatment of the discrete model. e numerical scheme includes the standard operations of summation and multiplication of sparse matrices and vector, as well as FFT, making it easy to implement and eligible for the e ective parallel implementation. Some typical use cases for the EIT/EEG problems are considered demonstrating high e ciency of the proposed numerical technique.
[openmp0506]: M.S. Mueller, B. Chapman, B.R.d. Supinski, A.D. Malony, M. Voss (Eds), "OpenMP Shared Memory Parallel Programming". Proceedings of the International Workshop, IWOMP 2005 and IWOMP 2006.
Keywords: MPI, cluster, computing, collaborative, compiler, optimization, computational, science, data, clustering, distributed, systems, embedded, systems, grid, computing, hierarchical, thread, scheduling, high, performance, computing, hybrid, parallelization, memory, bandwidth, modeling, multi-core, multi-threaded, computing, nested, parallelism, numerical, computation, openMP, parallel, programming, parallelization, performance, evaluation, performance, optimizations, program, analysis, scalability, shared, memory, simulation

This book constitutes the thoroughly refereed post-workshop proceedings of the First and the Second International Workshop on OpenMP, IWOMP 2005 and IWOMP 2006, held in Eugene, OR, USA, and in Reims, France, in June 2005 and 2006 respectively. The first part of the book presents 16 revised full papers carefully reviewed and selected from the IWOMP 2005 program and organized in topical sections on performance tools, compiler technology, run-time environment, applications, as well as the OpenMP language and its evaluation. In the second part there are 19 papers of IWOMP 2006, fully revised and grouped thematically in sections on advanced performance tuning aspects of code development applications, and proposed extensions to OpenMP.
[padc03]: A. D. Malony, S. Shende, R. Bell, K. Li, L. Li, N. Trebon, "Advances in the TAU Performance System," Chapter, "Performance Analysis and Grid Computing," (Eds. V. Getov, M. Gerndt, A. Hoisie, A. Malony, B. Miller), Kluwer, Norwell, MA, pp. 129-144, 2003.
Keywords: Performance, tools, parallel, distributed, TAU

To address the increasing complexity in parallel and distributed systems and software, advances in performance technology towards more robust tools and broader, more portable implementations are needed. In doing so, new challenges for performance instrumentation, measurement, analysis, and visualization arise to address evolving requirements for how performance phenomena is observed and how performance data is used. This paper presents recent advances in the TAU performance system in four areas where improvements in performance technology are important: instrumentation control, performance
[pdcp02]: Allen D. Malony, Sameer Shende, "Performance Technology for Complex Parallel and Distributed Systems," in "Quality of Parallel and Distributed Programs and Systems," (Eds. Peter Kacsuk and Gabriele Kotsis), Nova Science Publishers, Inc., New York, pp. 25-41, 2003.
Keywords: Performance tools, complex systems, instrumentation, measurement, analysis, TAU

The ability of performance technology to keep pace with the growing complexity of parallel and distributed systems will depend on robust performance frameworks that can at once provide system-specific performance capabilities and support high-level performance problem solving. The TAU system is offered as an example framework that meets these requirements. With a flexible, modular instrumentation and measurement system, and an open performance data and analysis environment, TAU can target a range of complex performance scenarios. Examples are given showing the diversity of TAU application.
[pes88]: K.A. Gallivan, W. Jalby, A.D. Malony and P.C. Yew, "Performance Analysis on the Cedar System", Chapter, Performance Evaluation of Supercomputers, Edited by J. Martin, Elsevier Science Publishers B.V. (North-Holland), pp. 109-142, 1987.
Keywords: Cedar, performance measurement.

To understand the complex interactions of the many factors contributing to supercomputer performance, supercomputer designers and users must have access to an integrated performance analysis system capable of measuring, analyzing, modeling, and predicting performance across a hierarchy of details and goals. The performance analysis system being developed for the CEDAR multiprocessor supercomputer embodies these characteristics and is discussed in this paper.
[piv90]: Allen D. Malony, "JED: Just an Event Display," Chapter, Performance Instrumentation and Visualization, (Eds: M. Simmons, R. Koskela), ACM Press, NY, pp. 99-115, 1990.
Keywords: JED, interconnection architectures, performance attributes, performance measurement

Event tracing has become a popular form of gathering performance data on multiprocessor computer systems. Indeed, a performance measurement facility has been developed for the Cedar multiprocessor that uses taing as a back-end mechanism fr collecting several run-time measurements including count, time, virtual memory, and event data. Tools to study an event trace, however, are typically specialized according to the type of data collected. Usually various trace analyses and displays are developed based on some event interpretation model. Whereas this approach will give specific information about particular events and their occurences in a trace, it is not particularly easy to extend; new events often require new analysis and display techniques.
[scidac_cca07]: L. McInness, T. Dahlgren, J. Nieplocha, D. Bernholdt, B. Allan, R. Armstrong, D. Chavarria, W. Elwasif, I. Gorton, J. Kenny, M. Krishnan, A. Malony, B. Norris, J. Ray, and S. Shende, "Research Initiatives for Plug-and-play Scientific Computing", J. Physics: Conference Series Vol. 78 No. 012046, doi:10.1088/1742-6596/78/1/012046, Proc. SciDAC Conference, 2007.
Keywords: CCA, TAU, Component software, CQoS

This paper introduces three component technology initiatives within the SciDAC Center for Technology for Advanced Scientific Component Software (TASCS) that address ever-increasing productivity challenges in creating, managing, and applying simulation software to scientific discovery. By leveraging the Common Component Architecture (CCA), a new component standard for high-performance scientific computing, these initiatives tackle difficulties at different but related levels in the development of component-based scientific software: (1) deploying applications on massively parallel and heterogeneous architectures, (2) investigating new approaches to the runtime enforcement of behavioral semantics, and (3) developing tools to facilitate dynamic composition, substitution, and reconfiguration of component implementations and parameters, so that application scientists can explore tradeoffs among factors such as accuracy, reliability, and performance.
[scidac_qcd07]: Y. Zhang, R. Fowler, K. Huck, A. Malony, A. Porterfield, D. Reed, S. Shende, V. Taylor, and X. Wu, "US QCD Computational Performance Studies with PERI," J. Physics: Conference Series Vol. 78, No. 012083 doi:10.1088/1742-6596/78/1/012083, Proc. of SciDAC 2007 conference, 2007.
Keywords: QCD, USQCD, PERI, TAU, PerfDMF, PerfExplorer

We report on some of the interactions between two SciDAC projects: The National Computational Infrastructure for Lattice Gauge Theory (USQCD), and the Performance Engineering Research Institute (PERI). Many modern scientific programs consistently report the need for faster computational resources to maintain global competitiveness. However, as the size and complexity of emerging high end computing (HEC) systems continue to rise, achieving good performance on such systems is becoming ever more challenging. In order to take full advantage of the resources, it is crucial to understand the characteristics of relevant scientific applications and the systems these applications are running on. Using tools developed under PERI and by other performance measurement researchers, we studied the performance of two applications, MILC and Chroma, on several high performance computing systems at DOE laboratories. In the case of Chroma, we discuss how the use of C++ and modern software engineering and programming methods are driving the evolution of performance tools.
[sigmetrics89]: Kyle Gallivan, Dennis Gannon, William Jalby, Allen D. Malony, Harry A. G. Wijshoff, "Behavioral Characterization of Multiprocessor Memory Systems: a Case Study," ACM SIGMETRICS Performance Evaluation Review, Vol. 17, Issue 1, pp. 79-88, May 1989.
Keywords: cache memories, software engineering, array and vector processors, supercomputers, main memory

The speed and efficiency of the memory system is a key limiting factor in the performance of supercomputers. Consequently, one of the major concerns when developing a high- performance code, either manually or automatically, is determining and characterizing the influence of the memory system on performance in terms of algorithmic parameters. Unfortunately, the performance data available to an algorithm designer such as various benchmarks and, occasionally, manufacturer-supplied information, e.g. instruction timings and architecture component characteristics, are rarely sufficient for this task. In this paper, we discuss a systematic methodology for probing the performance characteristics of a memory system via a hierarchy of data-movement kernels. We present and analyze the results obtained by such a methodology on a cache-based multi-vector processor (Alliant FX/8). Finally, we indicate how these experimental results can be used for predicting the performance of simple Fortran codes by a combination of empirical observations, architectural models and analytical techniques.
[software89]: Vincent A. Guarna, Jr., Dennis Gannon, David Jablonowski, Allen D. Malony, Yogesh Gaur, "Faust: An Integrated Environment for Parallel Programming," IEEE Software Vol. 6, No. 4, July/August 1989, pp. 20-27, 1989.
Keywords: Faust, parallel programming, project-management tool, context editor, program database, performance-evaluation tools, functional integration, common data sets, Sigma, application code, dynamic call-graph tool, multiprocessor performance analysis, parallel programming, programming environments, project support environments

A description is given of Faust, an integrated environment for the development of large, scientific applications. Faust includes a project-management tool, a context editor that is interfaced to a program database, and performance-evaluation tools. In Faust, all applications work is done in the context of projects, which serve as the focal point for all tool interactions. A project roughly corresponds to an executable program. Faust achieves functional integration through operations on common data sets maintained in each project. Sigma, a Faust tool designed to help users of parallel supercomputers retarget and optimize application code, helps them either fine-tune parallel code that has been automatically generated or optimize a new parallel algorithm's design. Faust includes a dynamic call-graph tool and an integrated, multiprocessor performance analysis and characterization tool set.
[software91]: Allen D. Malony, David H. Hammerslag, David J. Jablonowski, "Traceview: A Trace Visualization Tool," IEEE Software, 8(5), pp 19-28, 1991.
Keywords: Traceview, I/O features, computer graphics, trace-management, trace-data analysis, trace visualization tool, trace-analysis systems

The design, development, and application of Traceview, a general-purpose trace-visualization tool that implements the trace-management and I/O features usually found in special-purpose trace-analysis systems, are described. The aspects of trace visualization that can be incorporated into a reusable tool are identified. The tradeoff in general-purpose design versus semantically based, detailed trace-data analysis is evaluated. Display methods and Traceview applications are discussed.
[sp08]: K. A. Huck, A. D. Malony, S. Shende, and A. Morris, "Knowledge Support and Automation for Performance Analysis with PerfExplorer 2.0", Large-Scale Programming Tools and Environments, special issue of Scientific Programming, vol. 16, no. 2-3, pp. 123--134. 2008.
Keywords: TAU, parallel performance analysis, data mining, scalability, scripting, metadata, knowledge supported analysis

The integration of scalable performance analysis in parallel development tools is difficult. The potential size of data sets and the need to compare results from multiple experiments presents a challenge to manage and process the information. Simply to characterize the performance of parallel applications running on potentially hundreds of thousands of processor cores requires new scalable analysis techniques. Furthermore, many exploratory analysis processes are repeatable and could be automated, but are now implemented as manual procedures. In this paper, we will discuss the current version of PerfExplorer, a performance analysis framework which provides dimension reduction, clustering and correlation analysis of individual trails of large dimensions, and can perform relative performance analysis between multiple application executions. PerfExplorer analysis processes can be captured in the form of Python scripts, automating what would otherwise be time-consuming tasks. We will give examples of large-scale analysis results, and discuss the future development of the framework, including the encoding and processing of expert performance rules, and the increasing use of performance metadata.
[tcs98]: Steven T. Hackstadt and Allen D. Malony, DAQV: Distributed Array Query and Visualization Framework, Journal of Theoretical Computer Science, special issue on Parallel Computing, Vol. 196, No. 1-2, April 1998, pp. 289-317.
Keywords: visualization, distributed data access, hpf, parallel tool

This paper describes the design and implementation of the Distributed Array Query and Visualization (DAQV) system for High Performance Fortran, a project sponsored by the Parallel Tools Consortium. DAQV's implementation leverages the HPF language, compiler, and runtime system to address the general problem of providing high-level access to distributed data structures. DAQV supports a framework in which visualization and analysis clients connect to a distributed array server (i.e., the HPF application with DAQV control) for program-level access to array values. Implementing key components of DAQV in HPF itself has led to a robust and portable solution in which clients do not need to know how the data is distributed.
(This paper is an expanded version of [europar96] and [tr9602].)
[tjs02]: B. Mohr, A. Malony, S. Shende, F.Wolf, "Design and Prototype of a Performance Tool Interface for OpenMP," The Journal of Supercomputing, 23, 105-128,2002 Kluwer Academic Publishers.
Keywords: OpenMP, OPARI, KOJAK, EXPERT, TAU

This paper proposes a performance tools interface for OpenMP, similar in spirit to the MPI profiling interface in its intent to define a clear and portable API that makes OpenMP execution events visible to runtime performance tools. We present our design using a source-level instrumentation approach based on OpenMP directive rewriting. Rules to instrument each directive and their combination are applied to generate calls to the interface consistent with directive semantics and to pass context information (e.g., source code locations) in a portable and efficient way. Our proposed OpenMP performance API further allows user functions and arbitrary code regions to be marked and performance measurement to be controlled using new OpenMP directives. To prototype the proposed OpenMP performance interface, we have developed compatible performance libraries for the EXPERT automatic event trace analyzer and the TAU performance analysis framework. The directive instrumentation transformations we define are implemented in a source-to-source translation tool called OPARI. Application examples are presented for both EXPERT and TAU to show the OpenMP performance interface and OPARI instrumentation tool in operation. When used together with the MPI profiling interface (as the examples also demonstrate), our proposed approach provides a portable and robust solution to performance analysis of OpenMP and mixed-mode (OpenMP + MPI) applications.
[tpds92]: A. D. Malony, D. A. Reed, H. A. G. Wijshoff, "Performance Measurement Intrusion and Perturbation Analysis," IEEE Transactions on Parallel and Distributed Systems," 3(4) July 1992, pp. 433-450, 1992.
Keywords: perturbation compensation, performance measurement

The authors study the instrumentation perturbations of software event tracing on the Alliant FX/80 vector multiprocessor in sequential, vector, concurrent, and vector-concurrent modes. Based on experimental data, they derive a perturbation model that can approximate true performance from instrumented execution. They analyze the effects of instrumentation coverage, (i.e., the ratio of instrumented to executed statements), source level instrumentation, and hardware interactions. The results show that perturbations in execution times for complete trace instrumentations can exceed three orders of magnitude. With appropriate models of performance perturbation, these perturbations in execution time can be reduced to less than 20% while retaining the additional information from detailed traces. In general, it is concluded that it is possible to characterize perturbations through simple models. This permits more detailed, accurate instrumentation than traditionally believed possible.
[tse88]: Daniel A. Reed, Allen D. Malony, Bradley McCredie, "Parallel Discrete Event Simulation Using Shared Memory," IEEE Transactions on Software Engineering, 14(4), April 1988, pp. 541-553, 1988
Keywords: Chandy-Misra algorithm, deadlock recovery, discrete event simulation, distributed simulation, parallel processing

With traditional event list techniques, evaluating a detailed discrete event simulation model can often require hours or even days of computation time. By eliminating the event list and maintaining only sufficient synchronization to ensure causality, parallel simulation can potentially provide speedups that are linear in the number of processors. We present a set of shared memory experiments using the Chandy-Misra distributed simulation algorithm to simulate networks of queues. Parameters of the study include queueing network topology and routing probabilities, number of processors, and assignment of network nodes to processors.
[tse90]: K. Gallivan, D. Gannon, W. Jalby, A. Malony, H. Wijshoff , "Experimentally Characterizing the Behavior of Multiprocessor Memory Systems: A Case Study," IEEE Transactions on Software Engineering, 16(2), pp 216-223, 1990
Keywords: Characterization, memory systems, multiprocessor, performance

hough architectural improvements in memory organization of multiprocessor systems can increase effective data bandwidth, the actual performance achieved is highly dependent upon the characteristics of the memory address streams; e.g., the data access rate, and the temporal and spatial distributions. Accurately quantify- ing the performance behavior of a multiprocessor memory system across a broad range of algorithmic parameters is crucial if users (and restructuring compilers) are to achieve high-performance codes. In this paper, we demonstrate how the behavior of a cache-based multivector processor memory system can be systematically characterized and its performance experimentally correlated with key features of the ad- dress stream. The approach is based on the definition of a family of parameterized kernels used to explore specific aspects of the memory system’s performance. The empirical results from this kernel suite provide the data from which architectural or algorithmic characteris- tics can be studied. The results of applying the approach to an Alliant FX/S are presented.

Theses and Dissertations

[hackstadt94]: Steven T. Hackstadt, Prototyping Advanced Parallel Program and Performance Visualizations, University of Oregon, Department of Computer and Information Science, Masters Thesis, June 1994. Also available as University of Oregon, Department of Computer and Information Science, Technical Report CIS-TR-95-15, June 1995.
Keywords: visualization prototyping, scientific visualization, parallel performance visualization, program visualization, case studies, scalable visualization, data distribution visualization

A new visualization design process for the development of parallel program and performance visualizations using existing scientific data visualization software can drastically reduce the graphics and data manipulation programming overheads currently experienced by visualization developers. Data visualization tools are designed to handle large quantities of multi-dimensional data and create complex, three-dimensional, customizable displays which incorporate advanced rendering techniques, animation, and display interaction. These capabilities can be used to improve performance visualization, but to be effective, they must be applied as part of a formal methodology relating performance data to visual representations. Under such a formalism, it is possible to describe performance visualizations as mappings from performance data objects to view objects, independent of any graphical programming. Through three case studies, this work examines how an existing scientific visualization tool, IBM's Data Explorer, provides a robust environment for prototyping next-generation parallel performance visualizations.
[hackstadt97]: Steven T. Hackstadt, Domain-Specific Metacomputing for Computational Science: Achieving Specificity Through Abstraction, University of Oregon, Department of Computer and Information Science, Oral Comprehensive Exam Position Paper, September 1997. Available as University of Oregon, Department of Computer and Information Science, Technical Report CIS-TR-97-08, November 1997.
Keywords: metacomputing, heterogeneous computing, domain-specific environments, DSE, software architecutre,domain-specific software architecture, DSSA, computational science

A new area called domain-specific metacomputing for computational science is defined. This area cuts across the larger areas of parallel and distributed computing, computational science, and software engineering in search of techniques and technology that will better allow the creation of useful tools for computational scientists. The paper focuses on how metacomputing, domain-specific environments, and software architectures can be employed as key technologies to this end.
[huck09]: K. Huck, "Knowledge Support for Parallel Performance Data Mining," Ph.D. Dissertation, University of Oregon, March 2009.
Keywords: TAU, PerfExplorer, PerfDMF, data mining, performance knowledge support
[joydip96]: J. Kundu, Integrating Event- and State- Based Approaches to Debugging of Parallel Programs, PhD Thesis, University of Massachusetts, Amherst, MA 01003, 1996.
Keywords: Ariadne, event-based debugging, state-based debugging
[li_kai07]: K. Li "Neuroanatomical Segmentation in MRI Exploiting a priori Knowledge." Department of Computer and Information Science University of Oregon. March 2007
Keywords:

Neuroanatomical segmentation is a problem of extraction of a description of particular neuroanatomical structures of interest that reflects the morphometry (shape measurements) of the subjectâ€™s neuroanatomy from any image rendering the neuroanatomical structures of the subject. This dissertation presents a set of algorithms for automatic extraction of cerebral white mater (WM) and gray matter (GM) as well as reconstruction of cortical surfaces from T1-weighted MR images. Neuroanatomical segmentation presented in this dissertation is performed by an image analysis pipeline that steps through five major procedures: 1) the original MR image is processed by a new relative thresholding procedure and a new terrain analysis procedure such that all voxels are classified into one of the three types: WM, GM, and background; 2) the topology defects of the WM are eliminated by a new multiscale morphological topology correction algorithm; 3) cerebral WM is extracted from its superset with a new v procedure called cell-complex-based morphometric analysis; 4) cerebral GM is extracted based on the prior cerebral WM extraction with a set of morphological image analysis procedures; and 5) cortical surfaces are finally reconstructed preserving correct topology with an existing marching cube isosurface algorithm. In this dissertation, we evaluated our neuroanatomical segmentation tool both quantitatively and qualitatively on a set of MR images with groundtruth or manual segmentation, compared the results of our tool with those of four other tools, and demonstrated that the performance of our tool is highly accurate, robust, automatic and computationally efficient. The advantages of our tool are mainly attributed to extensive exploration of various structural, geometrical, morphological, and radiological a priori knowledge, which persists despite of image artifacts and inter- subject anatomical variations. By exploiting a priroi knowledge, we also demonstrated that performing voxel classification prior to brain extraction is a promising research direction, contrary to the traditional procedure of brain extraction followed by voxel classification. Finally, itâ€™s worth noting that the algorithms of voxel classification and morphological image analysis presented in this dissertation for neuroanatomical segmentation can be potentially applied in wider areas in computer vision.
[li_li07]: L. Li, "Model-based Automatic Performance Diagnosis of Parallel Computations." Department of Computer and Information Science University of Oregon. Feburary 2007
Keywords:

Scientific parallel programs often undergo significant performance tuning before meeting their performance expectation. Performance tuning naturally involves a diagnosis processâ€“locating performance bugs that make a program inefficient and explaining them in terms of high-level program design. Important performance measurement and analysis tools have been developed to support the performance analysis with the facilities of running experiments on parallel computers and generating measurement data to evaluate performance. However, current performance analysis technology does not yet allow for associating found performance problems with causes at a high-level program abstraction. Nor does it support the performance diagnosis process in a well automated manner. We present a systematic method to guide the performance diagnosis process and support the process with minimum user intervention. The motivating observation is that performance diagnosis can be greatly improved with the use of performance knowledge about v parallel computation models. We therefore propose an approach to generating performance knowledge for automatically diagnosing parallel programs. Our approach exploits program execution abstraction and parallelism found in computational models to search and explain performance bugs. We identify categories of knowledge required for performance diagnosis and describe how to derive the knowledge from computational models. We represent the extracted knowledge in a manner such that performance inferencing can be carried out in an automatic manner. We have developed the Hercule automatic performance diagnosis system that implements the model-based diagnosis strategy. In this dissertation, we present how Hercule integrates the performance knowledge into a performance analysis tool and demonstrate the effectiveness of our performance knowledge engineering approach through Hercule experiments on a variety of parallel computational models. We also investigate compositional programs that combine two or more models. We extend performance knowledge engineering to capture the interplay of multiple models in an integrated state, and improve Hercule capabilities to support the compositional performance diagnosis. We have applied Hercule to two representative scientific applications, both of which are implemented with combined models. The experiment results show that, requiring minimum user intervention, model-based performance analysis is vital and effective in discovering and interpreting performance bugs at a high level of program abstraction.
[malonyphd]: Allen D. Malony, "Performance Observability," Ph.D. Dissertation, University of Illinois at Urbana-Champaign, Technical Report UIUCDCS-R-90-1603, October 1990.
Keywords: Parallel performance, perturbation analysis, observation

Performance observabiility is the ability to accurately capture, analyze, and present (collectively observe) information about the performance of a computer system. Advances in computer systems design, particularly with respect to parallel processing and supercomputers, have brought a crisis in performance observation -- computer systems technology is outpacing the tools to understand the performance behavior of and to operate the machines near the high-end of their performance range. In this thesis, we study the performance observability problem with emphasis on the practical design, development, and use of tools for performance measurement, analysis, and visualization.
Tools for performance observability must balance the need for performance data against the cost of obtaining it (environment complexity and performance intrusion) -- to little performance data makes performance analysis difficult; too much data perturbs the measurement system. We discuss several methods for performance measurement concentrating specifically on mechanisms for timing and tracing. We show how minor hardware and software modifications can enable better measurement tools to be built and describe results from a prototype hardware-based software monitor developed for the Intel iPSC/2 multiprocessor.
Any software performance measurement perturbs the measured system. We develop two models of performance perturbation to understand the effects of instrumentation intrusion: time-based and event-based. The time-based models use only measured time overheads of intrumentation to approximate actual execution time performance. We show that this model can give accurate approximations for sequential execution and for parallel execution with independent execution ordering. We use the event-based model to quantify the perturbation effects of instrumentations of parallel executions with ordering dependencies. Our results show that this model can be applied in practice to achieve accurate approximations. We also discuss the limitations of the time-based and event-based models.
The potentially large volume of detailed performance data requires new approaches to presentation that can show both gross performance characteristics while allowing users to focus on local performance behavior. We give several examples where performance visualization techniques have been effectively applied, plus discuss the architecture and a prototype of a general performance visualization environment.
Finally, we apply several of the performance measurement, analysis, and visualization techniques to a practical study of performance observability on the Cray X-MP and Cray 2 supercomputers. Our results show that even modest improvements in the existing set of performance tools for a particular machine can have significant benefits in performance evaluation capabiilities.
[mattms01]: Matthew J. Sottile, "The design of a general method for constructing coupled scientific simulations," M.S. Thesis, University of Oregon, 2001.
Keywords: model coupling, parallel computing, control flow graph, Petri Nets

With the growth of modern high-performance computing systems, scientists are able to simulate larger and more complex systems. The most straightforward way to do this is to couple existing computational models to create models of larger systems composed of smaller sub-systems. Unfortunately, no general method exists for automating the process of coupling computational models. We present the design of such a method here. Using existing compiler technology, we assume that control flow analysis can determine the control state of models based on their source code. Scientists can then annotate the control flow graph of a model to identify points at which the model can provide data to or accept data from other models. Couplings are established between two models by establishing bindings between these control flow graphs. Translation of the control flow graph into Petri Nets allows automatic generation of coupling code to implement the couplings.
[ntrebon05]: Nicholas Dale Trebon, "Performance Measurement and Modeling of Component Applications in a High Performance Computing Environment," M.S. Thesis, University of Oregon, June 2005
Keywords: Performance measurement, modeling, component software, parallel computing, CCA, TAU

A parallel component environment places constraints on performance measurement and modeling. For instance, it must be possible to observe component operation without access to the source code. Furthermore, applications that are composed dynamically at run time require reusable performance interfaces for component interface monitoring. This thesis describes a non-intrusive, coarse-grained performance measurement framework that allows the user to gather performance data through the use of proxies that conform to these constraints. From this data, performance models for an individual component can be generated, and a performance model for the entire application can be synthesized. A validation framework is described, in which simple components with known performance models are used to validate the measurement and modeling methodologies included in the framework. Finally, a case study involving the measurement and modeling of a real scientific simulation code is also presented.
[ozog16]: Keywords:

The past several decades have witnessed tremendous strides in the capabilities of computational chemistry simulations, driven in large part by the extensive parallelism offered by powerful computer clusters and scalable programming methods in high performance computing (HPC). However, such massively parallel simulations increasingly require more complicated software to achieve good performance across the vastly diverse ecosystem of modern heterogeneous computer systems. Furthermore, advanced â€œmulti-resolutionâ€ methods for modeling atoms and molecules continue to evolve, and scientific software developers struggle to keep up with the hardships involved with building, scaling, and maintaining these coupled code systems. This dissertation describes these challenges facing the computational chemistry community in detail, along with recent solutions and techniques that circumvent some primary obstacles. In particular, I describe several projects and classify them by the 3 primary models used to simulate atoms and molecules: quantum mechanics (QM), molecular mechanics (MM), and coarse-grained (CG) iv models. Initially, the projects investigate methods for scaling simulations to larger and more relevant chemical applications within the same resolution model of either QM, MM, or CG. However, the grand challenge lies in effectively bridging these scales, both spatially and temporally, to study richer chemical models that go beyond single-scale physics and toward hybrid QM/MM/CG models. This dissertation concludes with an analysis of the state of the art in multiscale computational chemistry, with an eye toward improving developer productivity on upcoming computer architectures, in which we require productive software environments, enhanced support for coupled scientific workflows, useful abstractions to aid with data transfer, adaptive runtime systems, and extreme scalability. This dissertation includes previously published and co-authored material, as well as unpublished co-authored material.
[salman10]: A. M. Salman. "A Software Framework for Simulation-based Scientific Investigations." Department of Computer and Information Science University of Oregon. March 2010
Keywords:

This thesis provides a design and development of a software architecture and programming framework that enables domain-oriented scientific investigations to be more easily developed and productively applied. The key research concept is the representation and automation of scientific studies by capturing common methods for experimentation, analysis and evaluation used in simulation science. Such methods include parameter studies, optimization, uncerta'inty analysis, and sensitivÂ·ity analysÂ·is. While the framework provides a generic way to conduct investigation on an arbitrary simulation, its intended use is to be extended to develop a domain computational environment. The framework hides the access to distributed system resources and the multithreaded execution. A prototype of such a framework called ODESSI (Open Domain- oriented Environment for Simulation-based Scientific Investigation, pronounced odyssey) is developed and IV evaluated on realistic problems in human neuroscience and computational chemistry domains. ODESSI was inspired by our domain problems encountered in the computational modeling of human head electromagnetic for conductivity analysis and source localization. In this thesis we provide tools and methods to solve state of the m-t problems in head modeling. In particular, we developed an efficient and robust HPC solver for the forward problem and a generic robust HPC solver for bElT (bounded Electrical Impedance Tomography) inverse problem to estimate the head tissue conductivities. Also we formulated a method to include skull inhomogeneity and other skull variation in the head model based on information obtained from experimental studies. ODESSI as a framework is used to demonstrate the research ideas in this neuroscience domain and the domain investigations results are discussed in this thesis. ODESSI supports both the processing of investigation activities as well as manage its evolving record of information, results, and provenance.
[shendephd]: S. Shende, "The Role of Instrumentation and Mapping in Performance Measurement," Ph.D. Dissertation, University of Oregon, August 2001.
Keywords: Instrumentation, SEAA, Mapping, Instrumentation-Aware Compilation

Technology for empirical performance evaluation of parallel programs is driven by the increasing complexity of high performance computing environments and programming methodologies. This complexity - arising from the use of high-level parallel languages, domain-specific numerical frameworks, heterogeneous execution models and platforms, multi-level software optimization strategies, and multiple compilation models - widens the semantic gap between a programmer's understanding of his/her code and its runtime behavior. To keep pace, performance tools must provide for the effective instrumentation of complex software and the correlation of runtime performance data with user-level semantics. To address these issues, this dissertation contributes: * a strategy for utilizing multi-level instrumentation to improve the coverage of performance measurement in complex, layered software; * techniques for mapping low-level performance data to higher levels of abstraction in order to reduce the semantic gap between user's abstractions and runtime behavior; and * the concept of instrumentation-aware compilation that extends traditional compilers to preserve the semantics of fine-grained performance instrumentation despite aggressive program restructuring. In each case, the dissertation provides prototype implementations and case studies of the needed tools and frameworks. This dissertation research aims to influence the way performance observation tools and compilers for high performance computers are designed and implemented.

Technical Reports

[csrd86]: W. Abu-Sufah and A. Malony, "Experimental Results for Vector Processing on the Alliant FX/ 8," CSRD Tech Report #549, UIUC, Feb. 1986.
Keywords: Alliant, Vector processing, Cedar, performance measurement.Alliant, Vector processing, Cedar, performance measurement.Alliant, Vector processing, Cedar, performance measurement.Alliant, Vector processing, Cedar, performance measurement.Alliant, Vector processing, Cedar, performance measurement.

The Alliant FX/8 multiprocessor implments several high-speed computation ideas in software and hardware. Each of the 8 computational elements (CEs) has vector capabilitiesand multiprocessor support. Generally, the FX/8 delivers its highest processing rates when executing vector loops concurrently. In this paper, we present extensive empirical performance results for vector processing on the FX/8. The vector kernels of the LANL BMK8a1 benchmark are used in the experiments.
[csrd87]: K. Gallivan, W. Jalby, A. Malony and P.-C. Yew, "Performance Analysis on the Cedar System," CSRD Report No. 680, University of Illinois at Urbana-Champaign, Sept. 1987.
Keywords: Cedar, performance measurement.

To understand the complex interactions of the many factors contributing to supercomputer performance, supercomputer designers and users must have access to an integrated performance analysis system capable of measuring, analyzing, modeling, and predicting performance across a hierarchy of details and goals. The performance analysis system being developed for the CEDAR multiprocessor supercomputer embodies these characteristics and is discussed in this paper.
[csrd88]: A. D. Malony, "Regular Processor Arrays," CSRD Report No. 734, UIUC, Jan. 1988.
Keywords: regularity, processor arrays, emulation, interconnection networks

Regular is an often used term to suggest simple and unifrom structure of a parallel processor's organization or a parllel algorithm's operation. However, a strict definitiion is long overdue. In this paper, we define regularity for processor array structures in two dimensions and enumerate the eleven distinct regular topologies. Space and time emulation schemes among the regular processor arrays are constructured to compare their geometric and performance characteristics. We also show how algorithms developed for one regular processor array might be transferred to another regular array using matrix multiplication and LU decomposition as examples.
[csrd88.1]: Allen D. Malony, and Joseph R. Pickert, "An Environment Architecture and its use in Performance Data Analysis," Center for Supercomputing Research and Development, Technical Report 829, University of Illinois, Urbana-Champaign, Illinois, Oct. 1988.
Keywords:
[immd94]: A.D. Malony, V. Mertsiotakis, A. Quick. "Stochastic Modeling of Scaled Parallel Programs," Technical Report, Universitat Erlangen--Nurnberg, IMMD VII, 1994.
Keywords: Stochastic modeling, PEPP

Testing the performance scalability of parallel programs can be a time consuming task, involving many performance runs for different computer configurations, processor numbers, and problem sizes. Ideally, scalability issues would be addressed during parallel program design, but tools are not presently available that allow program developers to study the impact of algorithmic choices under different problem and system scenarios. Hence, scalability analysis is often reserved to existing (and available) parallel machines as well as implemented algorithms. In this paper, we propose techniques for analyzing scaled parallel programs using stochastic modeling approaches. Although allowing more generality and flexibility in analysis, stochastic modeling of large parallel programs is difficult due to solution tractability problems. We observe, however, that the complexity of parallel program models depends significantly on the type of parallel computation, and we present several computation classes where tractable, approximate graph models can be generated. Our approach is based on a parallelization description of programs to be scaled. From this description, “scaled” stochastic graph models are automatically generated. Different approximate models are used to compute lower and upper bounds of the mean runtime. We present evaluation results of several of these scaled (approximate) models and compare their accuracy and modeling expense (i.e., time to solution) with other solution methods implemented in our modeling tool PEPP. Our results indicate that accurate and efficient scalability analysis is possible using stochastic modeling together with model approximation techniques.
[sandia03]: J. Ray, N. Trebon, R. C. Armstrong, S. Shende, and A. Malony, "Performance Measurement and Modeling of Component Applications in a High Performance Computing Environment: A Case Study, " Technical Report SAND2003-8631, Sandia National Laboratories, Livermore, CA, Nov. 2003.
Keywords: CCA, Performance modeling, CFRFS Combustion, TAU

We present a case study of performance measurement and modeling of a CCA (Commo n Component Architecture) component-based application in a high performance computing envi ronment. We explore issues peculiar to component-based HPC applications and propose a p erformance measurement infrastructure for HPC based loosely on recent work done for Grid environments. A prototypical implementation of the infrastructure is used to collect data fo r a three components in a scientific application and construct performance models for two of them. Both computational and message-passing performance are addres sed.
[sandia03b]: N. Trebon, J. Ray, S. Shende, R. C. Armstrong, and A. Malony, "An Approximate Method for Optimizing HPC component Applications in the Presence of Multiple Component Implementations," Technical Report SAND2003-8760C, Sandia National Laboratories, Livermore, CA, December 2003. Available from [http://infoserve.sandia.gov/sand_doc/ 2003/038760c.pdf]
Keywords: Performance, CCA, proxy components, TAU

The Common Component Architecture allows com- putational scientists to adopt a component-based architecture for scientific simulation codes. Components, which in the scientific context, usually embody a numerical solution facility or a physical or numerical model, are composed at runtime into a simulation code by loading in an implementation of a component and linking it to others. However, a component may admit multiple imple- mentations, based on the choice of the algorithm, data structure, parallelization strategy, etc. posing the user with the problem of having to choose the “correct” implementation and achieve an optimal (fastest) component assembly. Under the assumption that a performance model exists for each implementation of each component, simply choosing the optimal implementation of each component does not guarantee an optimal component assembly since components interact with each other. An optimal solution may be obtained by evaluating the performance of all the possible realizations of a component assembly given the components and all their implementations, but the exponential complexity renders the approach unfeasible as the number of components and their implementations rise. We propose an approximate approach predicated on the existence, identification and optimization of computationally dominant sub-assemblies (cores). We propose a simple criterion to test for the existence of such cores and a set of rules to prune a component assembly and expose its dominant cores. We apply this approach to data obtained from a CCA component code simulating shock-induced turbulence on four processors and present preliminary results regarding the efficacy of this approach and the sensitivity of the final solution to various parameters in the rules.
[tr9321]: Steven T. Hackstadt and Allen D. Malony, Data Distribution Visualization (DDV) for Performance Visualization, University of Oregon, Department of Computer and Information Science, Technical Report CIS-TR-93-21, October 1993.
Keywords: data distribution visualization, scientific visualization

The next generation of language compilers for parallel architectures offers levels of abstraction above those currently available. Languages such as High Performance Fortran (HPF) and Parallel C++ (pC++) allow the programmer to specify how data structures are to be aligned relative to each other and then distributed across processors. Since a program's performance is often directly related to how its data is distributed, a means of evaluating data distributions and alignments is necessary. Since there is a natural tendency to explain data distributions by drawing pictures, graphical visualizations may be helpful in assessing the benefits and detriments of a given data decomposition. This paper formulates an experimental framework for exploring visualization techniques appropriate to evaluating data distributions. Visualizations are created using IBM's Data Explorer visualization software in conjunction with other software developed by the author. An informal assessment of the resulting visualizations and an explanation of how this research will be extended is also given.
[tr9323]: Steven T. Hackstadt and Allen D. Malony, Next-Generation Parallel Performance Visualization: A Prototyping Environment for Visualization Development, Proc. of the Parallel Architectures and Languages Europe (PARLE) Conference, Athens, Greece, July 1994, pp. 192-201. Also available as University of Oregon, Department of Computer and Information Science, Technical Report CIS-TR-93-23, October 1993.
Keywords: parallel performance visualization, scientific visualization, visualization prototyping

A new design process for the development of parallel performance visualizations that uses existing scientific data visualization software is presented. Scientific visualization tools are designed to handle large quantities of multi-dimensional data and create complex, three-dimensional, customizable displays which incorporate advanced rendering techniques, animation, and display interaction. Using a design process that leverages these tools to prototype new performance visualizations can lead to drastic reductions in the graphics and data manipulation programming overhead currently experienced by performance visualization developers. The process evolves from a formal methodology that relates performance abstractions to visual representations. Under this formalism, it is possible to describe performance visualizations as mappings from performance objects to view objects, independent of any graphical programming. Implementing this formalism in an existing data visualization system leads to a visualization prototype design process consisting of two components corresponding to the two high-level abstractions of the formalism: a trace transformation (i.e., performance abstraction) and a graphical transformation (i.e., visual abstraction). The trace transformation changes raw trace data to a format readable by the visualization software, and the graphical transformation specifies the graphical characteristics of the visualization. This prototyping environment also facilitates iterative design and evaluation of new and existing displays. Our work examines how an existing data visualization tool, IBM's Data Explorer in particular, can provide a robust prototyping environment for next-generation parallel performance visualization.
[tr9409]: Steven T. Hackstadt, Allen D. Malony, and Bernd Mohr, Scalable Performance Visualization for Data-Parallel Programs, Proc. of the Scalable High Performance Computing Conference (SHPCC), Knoxville, TN, May 1994, pp. 342-349. Also available as University of Oregon, Department of Computer and Information Science, Technical Report CIS-TR-94-09, March 1994.
Keywords: scalable performance visualization, scientific visualization, pC++, data-parallel programming

Developing robust techniques for visualizing the performance behavior of parallel programs that can scale in problem size and/or number of processors remains a challenge. In this paper, we present several performance visualization techniques based on the context of data-parallel programming and execution that demonstrate good visual scalability properties. These techniques are a result of utilizing the structural and distribution semantics of data-parallel programs as well as sophisticated three-dimensional graphics. A categorization and examples of scalable performance visualizations are given for programs written in Dataparallel C and pC++.
[tr9515]: Steven T. Hackstadt, Prototyping Advanced Parallel Program and Performance Visualizations, University of Oregon, Department of Computer and Information Science, Masters Thesis, June 1994. Also available as University of Oregon, Department of Computer and Information Science, Technical Report CIS-TR-95-15, June 1995.
Keywords: visualization prototyping, scientific visualization, parallel performance visualization, program visualization, case studies, scalable visualization, data distribution visualization

A new visualization design process for the development of parallel program and performance visualizations using existing scientific data visualization software can drastically reduce the graphics and data manipulation programming overheads currently experienced by visualization developers. Data visualization tools are designed to handle large quantities of multi-dimensional data and create complex, three-dimensional, customizable displays which incorporate advanced rendering techniques, animation, and display interaction. These capabilities can be used to improve performance visualization, but to be effective, they must be applied as part of a formal methodology relating performance data to visual representations. Under such a formalism, it is possible to describe performance visualizations as mappings from performance data objects to view objects, independent of any graphical programming. Through three case studies, this work examines how an existing scientific visualization tool, IBM's Data Explorer, provides a robust environment for prototyping next-generation parallel performance visualizations.
[tr9602]: Steven T. Hackstadt and Allen D. Malony, Distributed Array Query and Visualization for High Performance Fortran, Proc. of Euro-Par '96, Lyon, France, August 1996, pp. 55-63. Also available as University of Oregon, Department of Computer and Information Science, Technical Report CIS-TR-96-02, February 1996.
Keywords: visualization, distributed data access, hpf, parallel tool, runtime program interaction, tool framework

This paper describes the design and implementation of the Distributed Array Query and Visualization (DAQV) system for High Performance Fortran, a project sponsored by the Parallel Tools Consortium. DAQV's implementation leverages the HPF language, compiler, and runtime system to address the general problem of providing high-level access to distributed data structures. DAQV supports a framework in which visualization and analysis clients connect to a distributed array server (i.e., the HPF application with DAQV control) for program-level access to array values. Implementing key components of DAQV in HPF itself has led to a robust and portable solution in which clients do not need to know how the data is distributed.
[tr9605]: Harold H. Hersey, Steven T. Hackstadt, Lars T. Hansen, and Allen D. Malony, Viz: A Visualization Programming System, University of Oregon, Department of Computer and Information Science, Technical Report CIS-TR-96-05, April 1996.
Keywords: programmable visualization system, scheme, open inventor, data reactivity, scientific visualization, animation

This paper describes the design and implementation of a high-level visualization programming system called Viz. Viz was created out of a need to support rapid visualization prototyping in an environment that could be extended by abstractions in the application problem domain. Viz provides this in a programming environment built on a high-level, interactive language (Scheme) that embeds a 3D graphics library (Open Inventor), and that utilizes a data reactive model of visualization operation to capture mechanisms that have been found to be important in visualization design (e.g., constraints, controlled data flow, dynamic analysis, animation). The strength of Viz is in its ability to create non-trivial visualizations rapidly and to construct libraries of 3D graphics functionality easily. Although our original focus was on parallel program and performance data visualization, Viz applies beyond these areas. We show several examples that highlight Viz functionality and the visualization design process it supports.
[tr9708]: Steven T. Hackstadt, Domain-Specific Metacomputing for Computational Science: Achieving Specificity Through Abstraction, University of Oregon, Department of Computer and Information Science, Oral Comprehensive Exam Position Paper, September 1997. Available as University of Oregon, Department of Computer and Information Science, Technical Report CIS-TR-97-08, November 1997.
Keywords: metacomputing, heterogeneous computing, domain-specific environments, DSE, software architecutre,domain-specific software architecture, DSSA, computational science

A new area called domain-specific metacomputing for computational science is defined. This area cuts across the larger areas of parallel and distributed computing, computational science, and software engineering in search of techniques and technology that will better allow the creation of useful tools for computational scientists. The paper focuses on how metacomputing, domain-specific environments, and software architectures can be employed as key technologies to this end.
[tr9802]: Steven T. Hackstadt, Christopher W. Harrop, and Allen D. Malony, A Framework for Interacting with Distributed Programs and Data, Proceedings of the Seventh IEEE International Symposium on High Performance Distributed Computing (HPDC7), Chicago, IL, July 28-31, 1998. Also available as University of Oregon, Department of Computer and Information Science, Technical Report CIS-TR-98-02, June 1998.
Keywords: parallel tools, distributed arrays, visualization, computational steering, model coupling, runtime interaction, data access, Fortran 90

The Distributed Array Query and Visualization (DAQV) project aims to develop systems and tools that facilitate interacting with distributed programs and data structures. Arrays distributed across the processes of a parallel or distributed application are made available to external clients via well-defined interfaces and protocols. Our design considers the broad issues of language targets, models of interaction, and abstractions for data access, while our implementation attempts to provide a general framework that can be adapted to a range of application scenarios. The paper describes the second generation of DAQV work and places it in the context of the more general distributed array access problem. Current applications and future work are also described.

Talks and Presentations

[EuroPVM06]: K. Huck, A. Malony, S. Shende and A. Morris. "TAUg: Runtime Global Performance Data Access using MPI." EuroPVM/MPI Conference, September 2006
Keywords: TAU, TAUg, global data acsess, performance monitoring, online performance adaption
[STHEC08b]: A. Nataraj, A. Malony, A. Morris, D. Arnold, B. Miller, "TAUoverMRNet (ToM) : A Framework for Scalable Parallel Performance Monitoring", Presented at STHEC'08: International Workshop on Scalable Tools for High-End Computing, held in conjunction with the International Conference on Supercomputing (ICS 2008)
Keywords: performance, monitoring, tree-based, overlay, TAU, MRNet
[acts01]: S. Shende, "Tuning and Analysis Utilities," presentation at ACTS Toolkit Workshop, "Solving Problems in Science and Engineering," LBNL, NERSC, Berkeley, CA, Oct. 10-13, 2001.
Keywords: TAU, PDT
[acts02]: Allen D. Malony, "The TAU Performance System." Presented at the DOE ACTS workshop September 2002
Keywords: TAU, Instrumentation, Mesurement, Analysis, Performance Mapping, PETSc
[acts03b]: Sameer Shende, "Tuning and Analysis Utilities."
Keywords: TAU, PDT, pprof, openMP
[acts05b]: Sameer Shende, Allen D. Malony, "TAU Performance System." Presented at ACTS Workshop 2005
Keywords: Performance problem solving, TAU, PDT, ParaProf, PerfExplorer, PerfDMF
[acts06]: S. Shende, A. D. Malony, and A. Morris, "TAU Performance System," talk at ACTS Workshop, LBL, Aug. 2006.
Keywords: TAU, PDT, Vampir, VNG
[acts_sc00]: A. D. Malony, S. Shende, R. A. Bell, "Parallel Program Analysis Framework for the DOE ACTS Toolkit," presentation at NERSC ACTS booth, SC'00, 2002.
Keywords: TAU, PDT, ACTS
[apart02]: Allen D. Malony, "TAU Performance DataBaseFramework (PerfDBF)." Presented at EuroPar 2002
Keywords: PrefDBF, TAU, X-PARE
[arl02]: Sameer Shende, Allen D. Malony, "Performance Technology for Complex Parallel Systems", Talk at Army Research Lab (Aberdeen Proving Ground), MD, Sept. 2002.
Keywords: TAU, PDT, PAPI, Opari
[bbmi05]: "Neuroinformatics Research at UO" Presented for the Brain Biology Machine Initiative February 2005
Keywords: Neuroinformatics, EGI, EEG
[bgl06]: A. Morris, S. Shende, A. Malony, "TAU Performance System", presentation at BGL Workshop, Tokyo 2006
Keywords: TAU, BGL
[bio04]: Allen D. Malony "Neuroinformatics, the ICONIC Grid, and Oregon’s Science Industry." Presented to the 2004 Bioscience Conference
Keywords: Neuroinformatics, ICONIC Grid, Oregon’s Science Industry
[cacr]: "Performance Technology forComplex Parallel Systems."
Keywords: TAU, PDT, pprof
[cca03]: Allen D. Malony, Sameer Shende, "Performance Engineering Technologyfor Complex Scientific Component Software." Presented at Pasadena CCA Meeting January 2003
Keywords: TAU, CCA, Performance Engineered Component Software, PDT, Measurement Port
[cca04]: Sameer Shende, "Generating Proxy Components using PDT." Presented at Boulder CCA Meeting April 2004
Keywords: TAU, PDT, Proxy Components, CCA
[cca05]: Sameer Shende, Alan Morris, "Advances in the TAU Performance System."
Keywords: TAU, PDT, pprof, CCA, ParaProf
[cluster06]: A. Nataraj, A. Malony, S. Shende, A. Morris, "Kernel-Level Measurement for Integrated Parallel Performance Views: the KTAU Project,"
Keywords: kernel mesurment, KTAU, TAU
[cluster08]: A. Nataraj, A. Malony, A. Morris, D. Arnold, B. Miller, "In Search of Sweet-Spots in Parallel Performance Monitoring", Presented at International Conference on Cluster Computing, Tsukuba, Japan, September 2008
Keywords: performance, monitoring, tree-based, overlay, TAU, MRNet
[compframe05talk]: N. Trebon, A. Morris, J. Ray, S. Shende, and A. Malony, "Performance Modeling of Component Assemblies with TAU," presentation at the CompFrame05 conference, Atlanta, 2005.
Keywords: Performance measurement, modeling, component software, parallel computing, CCA, TAU
[cug06talk]: S. Shende, A. D. Malony, A. Morris, P. Beckman, "Performance and Memory Evaluation using TAU," Presentation at the Cray User's Group conference (CUG'06), May 2006.
Keywords: TAU, PDT, Cray XT3, Catamount, Memory evaluation

The TAU performance system is an integrated performance instrumentation, measurement, and analysis toolkit offering support for profiling and tracing modes of measurement. This paper introduces memory introspection capabilities of TAU featured on the Cray XT3 Catamount compute node kernel. TAU supports examining the memory headroom, or the amount of heap memory available, at routine entry, and correlates it to the program's callstack as an atomic event.
[dagstuhl02]: Allen D. Malony, Sameer Shende, "Advances in the TAU Performance System." Persented to Dagstuhl Conference August 2002
Keywords: TAU, Performance System, Performance Technology, Instrumentation Control, Performance Maping, Performance Data Interaction
[dapsys00_talk]: A. D. Malony, S. Shende, "Performance Technology for Complex Parallel and Distributed Systems," presentation at DAPSYS 2000 conference, 2000.
Keywords: TAU, PDT
[dodugc04]: "TAU Parallel Performance System."
Keywords: TAU, DOD, pprof, Code Transformation and Instrumentation, ParaProf
[epvmmpi05]: Sameer Shende, Allen D. Malony, Alan Morris, Felix Wolf, "Performance Profiling Overhead Compensation for MPI Programs." Presented at EuroPVM-MPI
Keywords: TAU, MPI, Profiling Overhead, Overhead Compensation
[erdc04]: Sameer Shende, Allen D. Malony, "Performance Optimization and Tools for HPC Architectures using TAU." Presented at ERDC October 2004
Keywords: TAU, PDT, Preformance Mapping, CCA, ParaProf, pprof
[esmf04]: Sameer Shende, Nancy Collins, "Using TAU Performance Technology in ESMF." Presented at ESMF Team Meeting July 2004
Keywords: TAU, PDT, MPI, ESMF
[europar02]: Allen D. Malony, "TAU Performance DataBaseFramework (PerfDBF)." Persented at APART EuroPar 2002 workshop
Keywords: TAU, Performance Database Framework, XML profile data representation, X-PARE
[europar03]: Robert Bell, Allen D. Malony, Sameer Shende, "ParaProf: A Portable, Extensible, and Scalable Tool for Parallel Performance Profile Analysis." Presented at EuroPar August 2003
Keywords: Paraprof, Profiling Tools
[europar07]: A. Nataraj, M. Sottile, A. Morris, A. D. Malony, S. Shende. "TAUoverSupermon (ToS) Low-Overhead Online Parallel Performance Monitoring." Presented at Euro-Par 2007.
Keywords: TAU, Supermon, Performance Monitoring, Online Monitoring
[europara06]: A. Nataraj, A. Malony, A. Morris, S. Shende, "Early Experiences with KTAU on the IBM Blue Gene / L" Europar 2006.
Keywords: tau, zeptoOS, ktau, phase profiling, kernel profiling,
[ewomp01talk]: B. Mohr, A. D. Malony, S. Shende, and F. Wolf, "Towards a Performance Tool Interface for OpenMP: An Approach Based on Directive Rewriting," Presentation at EWOMP'01 Third European Workshop on OpenMP, Sept. 2001.
Keywords: OpenMP, directive rewriting, instrumentation interface, TAU, EXPERT

In this article we propose a ``standard'' performance tool interface for OpenMP, similar in spirit to the MPI profiling interface in its intent to define a clear and portable API that makes OpenMP execution events visible to performance libraries. When used together with the MPI profiling interface, it also allows tools to be built for hybrid applications that mix shared and distributed memory programming. We describe an instrumentation approach based on OpenMP directive rewriting that generates calls to the interface and passes context information (e.g., source code locations) in a portable and efficient way. Our proposed OpenMP performance API further allows user functions and arbitrary code regions to be marked and performance measurement to be controlled using new proposed OpenMP directives. The directive transformations we define are implemented in a source-to-source translation tool called OPARI. We have used it to integrate the TAU performance analysis framework and the automatic event trace analyzer EXPERT with the proposed OpenMP performance interface. Together, these tools show that a portable and robust solution to performance analysis of OpenMP and hybrid applications is possible.
[gridperf02]: Allen D. Malony, "Performance Technology forScientific (Parallel and Distributed) Component Software." Presented at Grid Performance Workshop 2002
Keywords: Grid, Performance Engineering, Performance Technology
[hbp00]: A. D. Malony, "Distributed Computational Architectures for Integrated Time-Dynamic Neuroimaging," HBP Neuroinformatics conference, 2000.
Keywords: neuroinformatics
[hill01]: A. D. Malony, "Distributed Computational Architectures for Integrated Time-Dynamic Neuroimaging," presentation at Hill Center, 2001.
Keywords: neuroinformatics
[hill05]: Allen D. Malony, "Distributed Computational Architectures for Integrated Time-Dynamic Neuroimaging." Presented at The Hill Center November 2005
Keywords: Computational Science, Cognitive Neuroscience, Brain Dynamics, Computational Architectures
[ibm06]: A. Nataraj, "TAU: Recent Advances KTAU: Kernel-Level Measurement for Integrated Parallel Performance Views, TAUg: Runtime Global Performance Data Access Using MPI."
Keywords: KTAU, TAUg, perturbation, zeptoOS, global performance, load balancing
[iccs03talk]: J. Dongarra, A. D. Malony, S. Moore, P. Mucci, and S. Shende, "Performance Instrumentation and Measurement for Terascale Systems," Proc. Terascale Performance Analysis Workshop, International Conference on Computational Science (ICCS 2003), 2003.
Keywords: TAU, PAPI, Perfometer, instrumentation, measurement, performance analysis, terascale

As computer systems grow in size and complexity, tool support is needed to facilitate the efficient mapping of large-scale applications onto these systems. To help achieve this mapping, performance analysis tools must provide robust performance observation capabilities at all levels of the system, as well as map low-level behavior to high-level program constructs. Instrumentation and measurement strategies, developed over the last several years, must evolve together with performance analysis infrastructure to address the challenges of new scalable parallel systems.
[iscope01_talk]: S. Shende and A. D. Malony, "Integration and Application of the TAU Performance System in Parallel Java Environments," presentation at the Joint ACM Java Grande - ISCOPE 2001 Conference, June 2001.
Keywords: TAU, Java, MPI,

Parallel Java environments present challenging problems for performance tools because of Javas rich language system and its multi-level execution platform combined with the integration of native-code application libraries and parallel runtime software. In addition to the desire to provide robust performance measurement and analysis capabilities for the Java language itself, the coupling of different software execution contexts under a uniform performance model needs careful consideration of how events of interest are observed and how cross-context parallel execution information is linked. This paper relates our experience in extending the TAU performance system to a parallel Java environment based on mpiJava. We describe the complexities of the instrumentation model used, how performance measurements are made, and the overhead incurred. A parallel Java application simulating the game of Life is used to show the performance systems capabilities.
[ishpc02]: "Integrating Performance Analysis in the Uintah Software Development Cycle." Presented at ISHPC 2002
Keywords: Scientific Software Engineering, C-SAFE, Uintah Computational Framework, TAU, Performance Mapping, Performance Analysis
[javaics2ktalk]: Sameer Shende, and Allen D. Malony, "Performance Tools for Parallel Java Environments," slides from talk at Second Workshop on Java for High Performance Computing, ICS 2000, Santa Fe, May 2000.
Keywords: Java, TAU, parallel, MPI, profiling, tracing, performance evaluation

Parallel Java environments present challenging problems for performance tools because of Java's rich language system and its multi-level execution platform combined with the integration of native-code application libraries and parallel runtime software. In addition to the desire to provide robust performance measurement and analysis capabilities for the Java language itself, the coupling of different software execution contexts under a uniform performance model needs careful consideration of how events of interest are observed and how cross-context parallel execution information is linked. This paper relates our experience in extending the TAU performance system to a parallel Java environment based on mpiJava. We describe the instrumentation model used, how performance measurements are made, and the overhead incurred. A parallel Java application simulating the game of life is used to show the performance system's capabilities.
[kt]: "KTAU - Kernel Tuning and Analysis Utilities."
Keywords: KTAU, Kernel, BG/L, KTAU-D
[ktb]: Aroon Nataraj, Suravee Suthikulpanit, "KTAU: Kernel TAU."
Keywords: KTAU
[lacsi00]: Sameer Shende, "TAU: New Directions", presentation at Parallel Software Tools Workshop, LACSI 2000 Symposium, Aug 28-30, 2000, Santa Fe, NM.
Keywords: TAU, PDT, profiling, tracing, directions

This talk describes the current status (as of Aug 2000) of TAU and new research directions.
[lacsi01]: Sameer Shende, "The TAU Performance System: Advances in Performance Mapping."
Keywords: TAU, PDT, POOMA, Performance Mapping
[lacsi02]: "Performance Technology for Component Software," Allen D. Malony, Sameer S. Shende, Presentation at Performance Tools Workshop, Los Alamos Computer Science Institute Symposium (LACSI'02), Santa Fe, NM, Oct. 2002.
Keywords: CCA, TAU, CCAFEINE, Port, Measurement, POC, PKC
[lacsiworkshop01]: S. Shende, "The TAU Performance System: Advances in Performance Mapping," presentation at "Tools for Performance Analysis of large Scale Applications," workshop, LACSI 2001 Symposium, Santa Fe, NM, Oct. 15-18, 2001.
Keywords: TAU, PDT, mapping
[llnl01]: A. D. Malony, S. Shende, and R. A. Bell, "TAU Performace System: Developments and Evolution," presentation at LLNL, 2001.
Keywords: TAU, PDT
[llnl02]: Allen D. Malony, Sameer Shende, "Recent Advances in theTAU Performance System." Presented at LLNL September 2002
Keywords: TAU, Instrumentation Control, Performance Mapping, Component Software Performance Analysis
[llnl04]: Allen D. Malony, "Performance Technology for Productive,High-End Parallel Computing." Presented at LLNL October 2004
Keywords: Performance Technology, Autonomic Performance Tools, TAU, Performance Data Mining
[llnl05]: Sameer Shende, Allen D. Malony, "TAU: Performance Technology for Productive, High Performance Computing." Presented at LLNL 2005
Keywords: TAU, PDT, ParaProf, PerfExpoler
[llnl06]: "TAU Performance System." Sameer Shende, Allen D. Malon. Presented at Lawrence Livermore National Laboratory January 2006
Keywords: TAU, paraprof, vampir, Performance data management, data mining, perfexplorer, Clustering analysis
[navo04]: Sameer Shende, Allen D. Malony, Robert Bell, "The TAU Performance Technology for Complex Parallel Systems." Presented at NASA Stennis Space Center March 2004
Keywords: TAU, PDT, pprof, ParaProf, PerfDMF, OPARI
[nrl04]: Sameer Shende, Allen D. Malony, Robert BellUniversity of Oregon, "The TAU Performance Technology for Complex Parallel Systems." Presented at NRL D.C. BYOC Workshop August 2004
Keywords: TAU, Callpath, PDT, ParaProf
[omp01]: A. D. Malony, "Performance Tools Interface for OpenMP," a presentation to the OpenMP Futures Committee, 2001.
Keywords:
[ornl05]: Allen D. Malony, "Performance Technology for Productive, High-End Parallel Computing." Presented at ORNL 2005
Keywords: performance technology, Autonomic Performance Tools, TAU, Performance Data Mining, PerfDMF, PerfExplorer, Comparative analysis, Clustering analysis
[papi03]: Jack Dongarra, Shirley Moore, Philip Mucci, Sameer Shende, and Allen Malony, "Performance Instrumentation and Measurement for Terascale Systems." 2003
Keywords: Performance Evaluation, DyninstAPI, TAU, PDT, OPARI, ParaProf, PAPI, DynaProf, Perfometer
[para06a]: S. Shende, A. Malony, A. Morris, "Optimization of Instrumentation in Parallel Performance Evaluation Tools," Performance Research Laboratory, Department of Computer and Information Science University of Oregon, Eugene, OR, USA.
Keywords: Instrument optimization, selective instrumentation, measurement, Performance measurement and analysis, parallel computing
[para06b]: S. Shende, A. Malony, A. Morris, "Workload Characterization using the TAU Performance System," Performance Research Laboratory, Department of Computer and Information Science University of Oregon, Eugene, OR, USA.
Keywords: Performance mapping, measurement, instrumentation, performance evaluation, workload characterization
[parco03]: Allen D. Malony, Sameer Shende, Robert Bell, "Online Performance Monitoring, Analysis, and Visualization of Large-Scale Parallel Applications." Presented at ParaCo 2003
Keywords: Measurement Intrusion, Online Performance Analysis, TAU
[parco05]: Allen D. Malony, Sameer Shende, Alan Morris, "Phase-Based Parallel Performance Profiling" Presented at ParaCo 2005
Keywords: Callpath, Phase Profiling, Performance Mapping, NAS Parallel Benchmarks
[parco07]: K. Huck, A. Malony, S. Shende, A. Morris, "Scalable, Automated Parallel Performance Analysis with TAU, PerfDMF and PerfExplorer." Presented at International Conference on Parallel Computing (ParCo) September 2007.
Keywords: TAU, PerfDMF, PerfExplorer, phase profiling, metadata, regression analysis
[pdpta01_talk]: S. Shende, A. D. Malony, R. Ansell-Bell, "Instrumentation and Measurement Strategies for Flexible and Portable Empirical Performance Evaluation," presentation at Tools and Techniques for Performance Evaluation Workshop, PDPTA'01, C.S.R.E.A., June 2001.
Keywords: TAU, instrumentation, measurement, MPI, DyninstAPI

Flexibility and portability are important concerns for productive empirical performance evaluation. We claim that these features are best supported by robust instrumentation and measurement strategies, and their integration. Using the TAU performance system as an exemplar performance toolkit, a case study in performance evaluation is considered. Our goal is both to highlight flexibility and portability requirements and to consider how instrumentation and measurement techniques can address them. The main contribution of the paper is methodological, in its advocation of a guiding principle for tool development and enhancement. Recent advancements in the TAU system are described from this perspective.
[perc]: Allen D. Malony, Sameer Shende, "PERC Ideas." Presented at Performance Technology for Productive, High-End Parallel Computing
Keywords: TAU, PDT, ParaProf, PerfExpoler
[pnnl04]: Allen D. Malony, "High-Performance Computing, Computational Science, and NeuroInformatics Research." Presented at PNNL April 2004
Keywords: High Performance Computing, Neuroinformatics, TAU, ICONIC Grid
[ppam03]: Kai Li, Allen D. Malony, Robert Bell, Sameer Shende, "A Framework for Online PerformanceAnalysis and Visualization of Large-Scale Parallel Applications." Presented at PPAM 2003
Keywords: Online Performance Analysis, TAU, Computation Steering, Performance Steering
[psc05]: Allen D. Malony, "Performance Technology for Productive,High-End Parallel Computing." Presented at PSC 2005
Keywords: Performance Problem Solving, TAU, PDT, ParaProf, PerfExplorer
[psu2000talk]: Sameer Shende, "Building Your Own Performance Evaluation Tools", talk at Portland State University, May 13, 2000.
Keywords: TAU, performance evaluation, profiling, tracing, JVMPI, MPI, instrumentation, measurement, analysis

Performance evaluation of parallel and distributed programs involves choosing from a wide variety of performance models, instrumentation and measurement techniques, and execution models. The ability of performance technology to keep pace with the growing complexity of parallel and distributed systems depends on robust performance frameworks that can at once provide system-specific performance capabilities and support high-level performance problem solving. This talk gives an overview of choices and constraints that a performance technologist faces while building tools. We share our experience in building the TAU (Tuning and Analysis Utilities) suite of portable profiling and tracing tools. As an example, we illustrate tools for a parallel Java environment where instrumentation from multiple levels is integrated to provide the coupling of different software execution contexts under a uniform performance model. The techniques discussed in this talk are aimed at helping you design simple performance evaluation tools and effectively understanding and using existing performance tools.
[ptools00]: A. D. Malony, "TAU: A Framework for Parallel Performance Analysis," presentation at PTOOLS meeting, 2000.
Keywords: TAU, PDT
[ptools02]: Sameer Shende, Allen D. Malony, "Recent Advances in the TAU Performance System," Presentation at PTOOLS'02 meeting, Knoxville, TN, Sept. 2002.
Keywords: TAU, PDT, Opari
[sc00]: Allen D. Malony, Sameer Shende, Robert Ansell-Bell, "Parallel Program Analysis Framework for the DOE ACTS Toolkit." Presented at Super Computing Confernce 2000
Keywords: TAU, Hardware Performance Measurement, PDT
[sc00talk]: K. A. Lindlan, J. Cuny, A. D. Malony, S. Shende, B. Mohr, R. Rivenburgh, C. Rasmussen. "A Tool Framework for Static and Dynamic Analysis of Object-Oriented Software with Templates." Talk at SC2000: High Performance Networking and Computing Conference, Dallas, November 2000.
Keywords: Program Database Toolkit, PDT, static analysis, dynamic analysis, object-oriented, templates, IL Analyzer, DUCTAPE, TAU, SILOON

The developers of high-performance scientific applications often work in complex computing environments that place heavy demands on program analysis tools. The developers need tools that interoperate, are portable across machine architectures, and provide source-level feedback. In this paper, we describe a tool framework, the Program Database Toolkit (PDT), that supports the development of program analysis tools meeting these requirements. PDT uses compile-time information to create a complete database of high-level program information that is structured for well-defined and uniform access by tools and applications. PDT's current applications make heavy use of advanced features of C++, in particular, templates. We describe the toolkit, focussing on its most important contribution -- its handling of templates -- as well as its use in existing applications.
[sc01_tut]: A. D. Malony, S. Shende, and B. Mohr, "Performance Technology for Complex Parallel Systems," Tutorial at SC'01 conference, Nov. 2001.
Keywords: TAU, PDT, Kojak, Expert, tools

Fundamental to the development and use of parallel systems is the ability to observe, analyze, and understand their performance. However, the growing complexity of parallel systems challenge performance technologists to produce tools and methods that are at once robust (scalable, extensible, configurable) and ubiquitous (cross-platform, cross-language). This half-day tutorial will focus on performance analysis in complex parallel systems which include multi-threading, clusters of SMPs, mixed-language programming, and hybrid parallelism. Several representative complexity scenarios will be presented to highlight two fundamental performance analysis concerns: 1) the need for tight integration of performance observation (instrumentation and measurement) technology with sophisticated programming environments and system platforms, and 2) the ability to map execution performance data to high-level programming abstractionsimplemented on layered, hierarchical software systems. The tutorial will describe the TAU performance system in detail and demonstrate how it is used to successfully address the performance analysis concerns in each complexity scenario discussed. Tutorial attendees will be introduced to TAU's instrumentation, measurement, and analysis tools, and shown how to configure the TAU performance system for specific needs. A description of future enhancements of the TAU performance framework, including a demonstration of a prototype for automatic bottleneck analysis, will conclude the tutorial.
[sc02]: Allen D. Malony, Sameer S. Shende, Robert Bell, "The TAU Performance System." Presented at Super Computing Conference November 2002
Keywords: TAU, PDT, ParaProf, PerfExpoler, PETSc, Callpath
[sc02_cca]: Allen D. Malony, Sameer Shende, Craig Rasmussen, Jaideep Ray, Matt Sottile, "Performance Technology for Component Software - TAU."
Keywords: TAU, PDT, CCA
[sc02_petsc]: "Case Study: PETSc ex19."
Keywords: PETSc, Callpath
[sc04]: Allen D. Malony, "ICONIC Grid – Improving Diagnosis of Brain Disorders." Presented at Super Computing Conference 2004
Keywords: Neuroinformatics, ICONIC Grid, Brain Dynamics
[sc07]: A. Nataraj, A. Morris, A. Malony, M. Sottile, P. Beckmanl, "The Ghost in the Machine: Observing the Effects of Kernel Operation on Parallel Application Performance," Presented at SuperComputing Conference 2007.
Keywords: TAU, KTAU, Kernel Performance, OS Noise, OS interference
[siam06a]: S. Shende, A. Malony, A. Morris, H. Brunst, W. Nagel, "Performance and Memory Evaluation using the TAU Performance System," Presented in SIAM, 2006.
Keywords: TAU Performance System, Open Source Performance system, HPC systems, performance mapping
[siam06b]: S. Shende, A. Malony, A. Morris, K. Huck, "Tools for Performance Discovery and Optimization," Presented at SIAM, 2006.
Keywords: Performance Discovery, Performance Optimization, PerfExplorer, Multi-level performance instrumentation
[smpage02]: Sameer Shende, Allen D. Malony, "Integration and Application of theTAU Performance System inParallel Java Environments." Presented at SMPAG Java Interest Group May 2002
Keywords: JAVA, TAU, Java HPC, Performance Technology, Multi-Threading Performance Measurement, Virtual Machine Performance Instrumentation,
[tum00]: A. D. Malony, "Performance Technology for Complex Parallel and Distributed Systems," presentation at T.U.M. Germany, 2000.
Keywords: TAU, PDT
[utk05]: Allen D. Malony, "Multi-Experiment Performance Data Management and Data Mining." Presented at UTK 2005
Keywords: TAU, PerfExplorer, PerfDMF, Performance Data Mining
[wompat02a]: Bernd Mohr, Allen Malony, Rudi Eigenmann, "On the Integration and Use of OpenMP Performance Tools in the SPEC OMP2001 Benchmarks." Presented at John von Neumann Istitut fur Computing
Keywords: OpenMP, TAU, ZAMpano, POMP
[wompat02b]: Bernd Mohr, Allen D. Malony, Rudi Eigenmann, "On the Integration and Use of OpenMP Performance Tools in the SPEC OMP2001 Benchmarks." Presented at John von Neumann Institut fur Computing
Keywords: OpenMP, TAU, OMP2001, Performance Tools
[wompat02talk]: B. Mohr, A. D. Malony, R. Eigenmann, "On the Integration and Use of OpenMP Performance Tools in the SPEC OMP2001 Benchmarks," Presentation at the WOMPAT 2002 conference.
Keywords: TAU, OpenMP, Expert, Opari, SPEC OMP 2001, benchmarks

The Standard Performance Evaluation Corporation (SPEC) benchmark suite for OpenMP (named SPEC OMP2001) allows the performance evaluation of modern shared-memory multiprocessors executing programs made parallel using the OpenMP API. While the SPEC OMP2001 suite reports only total program execution for benchmarking purposes, detailed performance studies of the individual programs can reveal interesting runtime characteristics. Clearly, for programmers attempting to diagnose performance problems and make tuning decisions, such detailed performance information can be invaluable, especially when programming with a new parallel API such as OpenMP. Unfortunately, tools for performance measurement and analysis of parallel programs do not, in general, meet the same portability, configurability, and ease of use standards found in a robust benchmark suite such as SPEC OMP2001. As a result, more in-depth performance analysis is often isolated to those platforms where tools exists, or it is not done at all for lack of tool expertise. During the past year, we have proposed a performance tool interface (referred to as the POMP interface) for OpenMP. The goal of POMP is to define a clear and portable API that makes OpenMP execution events visible to runtime performance measurement tools. The POMP API is designed based on OpenMP directive semantics, allowing POMP instrumentation to be accomplished through source-to-source translation; we developed the Opari instrumentation tool for this purpose. In addition to the POMP interface specification, we have demonstrated its use with prototype POMP libraries for the Expert automatic event trace analyzer and the TAU performance analysis framework. This paper reports on the application of the POMP performance interface and toolset to the SPEC OMP2001 benchmark suite. The goals of the work are three-fold. First, we want to show how support for detailed performance instrumentation and measurement can be integrated in the SPEC OMP2001 benchmarking methodology, using an approach based on POMP's capabilities. Second, we want to then use the SPEC OMP2001 benchmarks as testcases for the POMP technology, the API and Opari instrumentation tool. This will allow us to further evaluate the robustness of the API and Opari's automatic transformation capabilities. Third, we want to demonstrate the value of integrated performance tools in conducting cross-platform performance studies. Here, our goal is be able to automatically capture detailed performance information across a variety of platforms listed in the SPE C OMP2001 results database.
[zam02]: A. D. Malony, "Integrating Performance Analysis in Complex Scientific Software: Experiences with the Uintah Computational Framework," presentation at FZJ, ZAM, NIC Germany, 2002.
Keywords: TAU, Uintah, SCIRun, MPI, threads, XPARE

Other Publications

[dxcomm95]: Steven T. Hackstadt and Allen D. Malony, Visualizing Parallel Program and Performance Data with IBM Visualization Data Explorer, IBM Visualization Data Explorer Communiqué Newsletter, Vol. 3, No. 1, March 1995, pp. 6-8.
Keywords: ibm data explorer, parallel performance visualization, scientific visualization, visualization prototyping

Performance visualization is the use of graphical display techniques for the visual analysis of performance data to improve the understanding of complex performance phenomena. While the graphics of current performance visualizations are predominantly confined to two-dimensions, one of the primary goals of our work is the development of new methods for rapidly prototyping next-generation, multi-dimensional performance visualizations. By applying the tools of scientific visualization to performance visualization, we have found that next-generation displays for performance visualization can be prototyped, if not implemented, in existing data visualization software products like Data Explorer, using graphical techniques that physicists, oceanographers, and meteorologists have used for several years now.
[lacsi06]: A. Nataraj, M. Sottile, A. Malony, A. Morris, S. Shende, R. Minich, K. Huck. "Scalable Online Parallel Performance Measurement Over a Cluster Monitor." Presented at LACSI'06 (Los Alamos Computer Science Institute Symposium).
Keywords: TAU, supermon, Cluster monitoring, Online Performance monitoring
[pgipp96]: Steven T. Hackstadt and Allen D. Malony, Distributed Array Query and Visualization for High Performance Fortran, Peak Performance Newsletter, Portland Group, Inc., Spring 1996.
Keywords: visualization, distributed data access, hpf, parallel tool, runtime program interaction, tool framework
[sc2002]: K. Li, A. D. Malony, S. Shende, R. Bell, "Online Performance Analysis and Visualization of Large-Scale Parallel Applications", Poster SC 2002 conference.
Keywords: Paravis, TAU, SCIRun, online, performance, visualization

Parallel performance tools offer the program developer insights into the execution behavior of an application and are a valuable component in the cycle of application development and deployment. However, most tools do not work well with large-scale parallel applications where the performance data generated comes from thousands of processes. Not only can the data be difficult to manage and the analysis complex, but existing performance display tools are mostly restricted to two dimensions and lack the customization and display interaction to support full data investigation. In addition, it is increasingly important that performance tools be able to function online, making it possible to control and adapt long-running applications based on performance feedback. Again, large-scale parallelism complicates the online access and management of performance data, and it may be desirable to integrate performance analysis and visualization in existing computational steering infrastructures. The coupling of advanced three-dimensional visualization with large-scale, online performance data analysis could enhance application performance evaluation. The challenge is to develop a framework where the tedious work, such as access to the performance data and graphics rendering, is supported by the underlying system, leaving tool developers to focus on the high level design of the analysis and visualization capabilities. We designed and prototyped a system architecture for online performance access, analysis, and visualization in a large-scale parallel environment. The architecture consists of four components. The performance data integrator component is responsible for interfacing with a performance monitoring system to merge parallel performance samples into a synchronous data stream for analysis. The performance data reader component reads the external performance data into internal data structures of the analysis and visualization system. The performance analyzer component provides the analysis developer a programmable framework for constructing analysis modules that can be linked together for different functionality. The performance visualizer component can also be programmed to create different display modules. Our prototype is based on the TAU performance system, the Uintah computational framework, and the SCIRun computational steering and visualization system. Parallel profile data from a Uintah simulation are sampled and written to profile files during execution. A profile reader, implemented as a SCIRun module, saves profile samples in SCIRun memory. SCIRun provides a programmable system for building and linking the analysis and visualization components. We have developed two analysis modules and three visualization modules to demonstrate how parallel profile data from large-scale Uintah applications are processed online.
[sc99a]: Advanced Computing Laboratory, Los Alamos National Laboratory: TAU: Tuning and Analysis Utilities, Supercomputing '99 flyer, Los Alamos National Laboratory Publication LALP-99-205, November 1999.
Keywords: TAU, profiling, tracing, toolkit, instrumentation, threads, Java, windows, HPF, MPI

TAU flyer for SC'99.
[sc99b]: Advanced Computing Laboratory, Los Alamos National Laboratory: PDT: Program Database Toolkit, Supercomputing '99 flyer, Los Alamos National Laboratory Publication LALP-99-204, November 1999.
Keywords: PDT, IL, Analyzer, Program, Database, Toolkit, SILOON, TAU, DUCTAPE

PDT flyer for SC'99
[sigplan94]: Janice Cuny, George Forman, Alfred Hough, Joydip Kundu, Calvin Lin, Lawrence Snyder, and David Stemple, The Ariadne Debugger: Scalable Application of Event-Based Abstraction, SIGPLAN Notices, Vol. 28, No. 12, 1994, pp. 85-95.
Keywords: event-based debugging, Ariadne, parallel, scalable

Massively parallel computations are difficult to debug. Users are often overwhelmed by large amounts of trace data and confused by the effects of asynchrony. Event-based behavioral abstraction provides a mechanism for managing the volume of data by allowing users to specify models of intended program behavior that are automatically compared to actual program behavior. Transformations of logical time ameliorate the difficulties of coping with asynchrony by allowing users to see behavior from a variety of temporal perspectives. Previously, we combined these features in a debugger that automatically constructed animations of user-defined abstract events in logical time. However, our debugger, like many others, did not always provide sufficient feedback nor did it effectively scale up for massive parallelism. Our modeling language required complex recognition algorithms which precluded informative feedback on abstractions that did not correspond to observed behavior. Feedback on abstractions that did match behavior was limited because it relied on graphical animations that did not scale well to even moderate numbers of processes (such as 64). We address these problems in a new debugger, called Ariadne.
[stevedrp]: Stephen McLaughry, "Debugging Optimized Parallel Programs," Directed Research Project (DRP) report, University of Oregon, May 1997.
Keywords: Debugging, ZEE, ZPL, Optimizations, Mappings

Optimization of data-parallel languages compounds the difficulty of the problem of parallel debugging. While the constrained structure of such languages is intended to simplify the job of the parallel programmer, the loss of flexibility concomitant with this structure often results in programs that, if left unoptimized, would have unacceptably poor performance. The programmer needs a debug tool capable of interacting with an optimized, distributed system and reporting the behavior of such a system in terms of the source code from which it is derived. We describe some of the obstacles to source-level debugging of optimized data-parallel programs. We present general solutions to these problems, and discuss implementation details. We then describe several example debugging scenarios to demonstrate the capabilities of our prototype system, ZEE (ZPL DEBUGGER).

350 items listed; 260 items annotated
Created: Sat Nov 22 23:33:46 2025
Bibify!

ParaDucks Research Group Annotated Bibliography

Table of Contents

Conferences and Workshops

Journals

Theses and Dissertations

Technical Reports

Talks and Presentations

Other Publications