ParaDucks Research Group Annotated Bibliography


Table of Contents

Conferences and Workshops
Journals
Theses and Dissertations
Technical Reports
Talks and Presentations
Other Publications

Conferences and Workshops

[acmonr91]
Allen D. Malony, Daniel A. Reed, "Models for Performance Perturbation Analysis," Workshop on Parallel and Distributed Debugging, Proc. 1991 ACM/ ONR workshop on Parallel and Distributed Debugging, pp. 15-25, 1991
Keywords: performance perturbation, performance measurement

When performance measurements are made of program operation actual execution behavior can be perturbed. In general, the degree of perturbation depends on the intrusiveness and frequency of the instrument ation. If the perturbation effects of the instrumentation cannot be quantified by a perturbation model (and subsequently removed during perturbation analysis), detailed performance measurements could be inaccurate. Developing models of time and event perturbations that can recover actual execution performance from perturbed performance measurements is the topic of this paper. Time-based models can accurately capture execution time perturbations for sequential computations and concurrent computations with simple fork-join behavior. However, the performance of parallel computations generally depends on the relative ordering of dependent events and the assignment of computational resources. Event-based models must be used to quantify instrumentation perturbation in parallel performance measurements. The measurement and subsequent analysis of synchronization operations (e.g., barrier, semaphore, and advance/await synchronization) can produce accurate approximations to actual performance behavior. Unfortunately, event-based models are limited in their ability to fully capture perturbation effects in nondeterministic executions.
[chasm_lacsi01]
C. Rasmussen, K. Lindlan, B. Mohr, J. Striegnitz, "CHASM: Static Analysis and Automatic Code Generation for Improved Fortran90 and C++ Interoperability," Proceedings of LACSI Symposium, 2001.
Keywords: CHASM, PDT, SILOON, F90, C++ interoperability

The relative simplicity and design of the Fortran 77 language allowed for reasonable interoperability with C and C++. Fortran 90, on the other hand, introduces several new and complex features to the language that severely degrade the ability of a mixed Fortran and C++ development environment. Major new items added to Fortran are user-defined types, pointers, and several new array features. Each of these items introduce difficulties because the Fortran 90 procedure calling convention was not designed with interoperability as an important design goal. For example, Fortran 90 arrays are passed by array descriptor, which is not specified by the language and therefore depends on a particular compiler implementation. This paper describes a set of software tools that parses Fortran 90 source code and produces mediating interface functions which allow access to Fortran 90 libraries from C++.
[cluster03]
H. Brunst, W. E. Nagel, and A. D. Malony, "A Distributed Performance Analysis Architecture for Clusters," In Proc. IEEE International Conference on Cluster Computing (Cluster 2003), IEEE Computer Society, pp. 73-83, Dec. 2003.
Keywords: Parallel Computing, Performance Analysis, Profiling, Tracing, Clusters, VNG, Vampir

The use of a cluster for distributed performance analysis of parallel trace data is discussed. We propose an analysis architecture that uses multiple cluster nodes as a server to execute analysis operations in parallel and communicate to remote clients where performance visualization and user interactions occur. The client-server system developed, VNG, is highly configurable and is shown to perform well for traces of large size, when compared to leading trace visualization systems.
[cluster06]
A. Nataraj, A. Malony, S. Shende, A. Morris, "Kernel-Level Measurement for Integrated Parallel Performance Views: the KTAU Project," In Proc. Cluster 2006, IEEE Computer Society, 2006.
Keywords: kernel mesurment, KTAU, TAU

The effect of the operating system on application performance is an increasingly important consideration in high performance computing. OS kernel measurement is key to understanding the performance influences and the interrelationship of system and user-level performance factors. The KTAU (Kernel TAU) methodology and Linux-based framework provides parallel kernel performance measurement from both a kernel-wide and process-centric perspective. The first characterizes overall aggregate kernel performance for the entire system. The second characterizes kernel performance when it runs in the context of a particular process. KTAU extends the TAU performance system with kernel-level monitoring, while leveraging TAU’s measurement and analysis capabilities. We explain the rational and motivations behind our approach, describe the KTAU design and implementation, and show working examples on multiple platforms demonstrating the versatility of KTAU in integrated system / application monitoring.oped. Minimally, such an approach will require OS kernel performance monitoring.
[compframe05]
N. Trebon, A. Morris, J. Ray, S. Shende, and A. Malony, "Performance Modeling of Component Assemblies with TAU," Proc. Workshop on Component Models and Frameworks in High Performance Computing (CompFrame 2005).
Keywords: CCA, TAU, CFRFS, Proxy components, performance modeling

The Common Component Architecture (CCA) is a component-based methodology for developing scientific simu- lation codes. This architecture consists of a framework which enables components, (embodiments of numerical algorithms and physical models) to work together. Components publish their interfaces and use interfaces published by others. Com- ponents publishing the same interface and with the same func- tionality (but perhaps implemented via a different algorithm or data structure) may be transparently substituted for each other in a code or a component assembly. Components are compiled into shared libraries and are loaded in, instantiated and composed into a useful code at runtime. Details regarding CCA can be found in [1], [2]. An analysis of the process of decomposing a legacy simulation code and re-synthesizing it as components can be found in [3], [4]. Actual scientific results obtained from this toolkit can be found in [5], [6].
[conpar94]
B. Mohr, D. Brown, A. Malony, TAU: A Portable Parallel Program Analysis Environment for pC++, Proceedings of CONPAR 94 - VAPP VI, University of Linz, Austria, LNCS 854, September 1994, pp. 29-40.
Keywords: TAU, tuning and analysis utilities, integrated tools, portable programanalysis, parallel object-oriented language, pC++, Sage++

The realization of parallel language systems that offer high-level programming paradigms to reduce the complexity of application development, scalable runtime mechanisms to support variable size problem sets, and portable compiler platforms to provide access to multiple parallel architectures, places additional demands on the tools for program development and analysis. The need for integration of these tools into a comprehensive programming environment is even more pronounced and will require more sophisticated use of the language system technology (i.e., compiler and runtime system). Furthermore, the environment requirements of high-level support for the programmer, large-scale applications, and portable access to diverse machines also apply to the program analysis tools.

In this paper, we discuss (TAU, Tuning and Analysis Utilities), a first prototype for an integrated and portable program analysis environment for pC++, a parallel object-oriented language system. TAU is integrated with the pC++ system in that it relies heavily on compiler and transformation tools (specifically, the Sage++ toolkit) for its implementation. This paper describes the design and functionality of TAU and shows its application in practice.

[cug06]
S. Shende, A. D. Malony, A. Morris, and P. Beckman, "Performance and Memory Evaluation Using TAU," In Proc. for Cray User's Group Conference (CUG 2006), 2006.
Keywords: TAU, PDT, Memory Headroom Analysis, MFIX

The TAU performance system is an integrated performance instrumentation, measurement, and analysis toolkit offering support for profiling and tracing modes of measurement. This paper introduces memory introspection capabilities of TAU featured on the Cray XT3 Catamount compute node kernel. TAU supports examining the memory headroom, or the amount of heap memory available, at routine entry, and correlates it to the program’s callstack as an atomic event.
[dapsys2k]
A. Malony and S. Shende, "Performance Technology for Complex Parallel and Distributed Systems," Proc. Third Austrian-Hungarian Workshop on Distributed and Parallel Systems, DAPSYS 2000, "Distributed and Parallel Systems: From Concepts to Applications," (Eds. G. Kotsis and P. Kacsuk)Kluwer, Norwell, MA, pp. 37-46, 2000.
Keywords: performance tools, complex systems, instrumentation, measurement, analysis, TAU

The ability of performance technology to keep pace with the growing complexity of parallel and distributed systems will depend on robust performance frameworks that can at once provide system-specific performance capabilities and support high-level performance problem solving. The TAU system is offered as an example framework that meets these requirements. With a flexible, modular instrumentation and measurement system, and an open performance data and analysis environment, TAU can target a range of complex performance scenarios. Examples are given showing the diversity of TAU application.
[dodugc04]
Daniel M. Pressel, David Cronk, and Sameer Shende, "PENVELOPE: A New Approach to Rapidly Predicting the Performance of Computationally Intensive Scientific Applications on Parallel Computer Architectures," Proc. 2004 DOD Users Group Conference, Williamsburg, Virginia, IEEE Computer Society, pp. 314-318, 2004.
Keywords: PENVELOPE, Performance Prediction, TAU

A common complaint when dealing with the performance of computationally intensive scientific applications on parallel computers is that programs exist to predict the performance of radar systems, missiles and artillery shells, drugs, etc., but no one knows how to predict the performance of these applications on a parallel computer. Actually, that is not quite true. A more accurate statement is that no one knows how to predict the performance of these applications on a parallel computer in a reasonable amount of time. PENVELOPE is an attempt to remedy this situation. It is an extension to Amdahls Law/ Gustafsons work on scaled speedup that takes into account the cost of interprocessor communication and operating system overhead, yet is simple enough that it was implemented as an Excel spreadsheet.
[epvmmpi05.a]
S. Shende, A. D. Malony, A. Morris, and F. Wolf, "Performance Profiling Overhead Compensation for MPI Programs," in Proc. EuroPVM/MPI 2005 Conference, (eds. B. Di. Martino et. al.), LNCS 3666, Springer, pp. 359-367, 2005.
Keywords: TAU, Performance measurement and analysis, parallel computing, profiling, message passing, overhead compensation

Performance profiling of MPI programs generates overhead during execution that introduces error in profile measurements. It is possible to track and remove overhead online, but it is necessary to communicate execution delay be- tween processes to correctly adjust their interdependent timing. We demonstrate the first implementation of a onlne measurement overhead compensation system for profiling MPI programs. This is implemented in the TAU performance sys- tems. It requires novel techniques for delay communication in the use of MPI. The ability to reduce measurement error is demonstrated for problematic test cases and real applications.
[epvmmpi05.b]
S. Moore, F. Wolf, J. Dongarra, S. Shende, A. Malony, and B. Mohr, "A Scalable Approach to MPI Application Performance Analysis," in Proc. of EuroPVM/MPI 2005, (eds. B. Di Martino) LNCS 3666, Springer, pp. 309-316, 2005.
Keywords: TAU, scalability, Performance measurement and analysis, parallel computing, profiling, tracing, KOJAK

A scalable approach to performance analysis of MPI applications is presented that includes automated source code instrumentation, low overhead generation of profile and trace data, and database management of performance data. In addition, tools are described that analyze large-scale parallel profile and trace data. Analysis of trace data is done using an automated pattern-matching ap- proach. Examples of using the tools on large-scale MPI applications are presented.
[etpsc92]
B. Mohr, Standardization of Event Traces Considered Harmful or Is an Implementation of Object-Independent Event Trace Monitoring and Analysis Systems Possible?, Proceedings of the CNRS-NSF Workshop on Environments and Tools For Parallel Scientific Computing, St. Hilaire du Touvet, France, Elsevier, Advances in Parallel Computing, Vol. 6, September 1992, pp. 103-124.
Keywords: event trace, analysis tools, monitoring, objects, object-independentmonitoring, standardization of event trace formats, access interfaces

Programming non-sequential computer systems is hard! Many tools and environments have been designed and implemented to ease the use and programming of such systems. The majority of the analysis tools is event-based and uses event traces for representing the dynamic behavior of the system under investigation, the object system. Most tools can only be used for one special object system, or a specific class of systems such as distributed shared memory machines. This limitation is not obvious because all tools provide the same basic functionality.

This article discusses approaches to implementing object-independent event trace monitoring and analysis systems. The term object-independent means that the system can be used for the analysis of arbitrary (non-sequential) computer systems, operating systems, programming languages and applications. Three main topics are addressed: object-independent monitoring, standardization of event trace formats and access interfaces and the application-independent but problem-oriented implementation of analysis and visualization tools. Based on these approaches, the distributed hardware monitor system ZM4 and the SIMPLE event trace analysis environment were implemented, and have been used in many 'real-world' applications throughout the last three years. An overview of the projects in which the ZM4/SIMPLE tools were used is given in the last section.

[etpsc94]
Darryl I. Brown, Steven T. Hackstadt, Allen D. Malony, Bernd Mohr, Program Analysis Environments for Parallel Language Systems: The TAU Environment, Proc. of the Workshop on Environments and Tools For Parallel Scientific Computing, Townsend, TN, May 1994, pp. 162-171.
Keywords: parallel tool, pC++, integrated tool framework

In this paper, we discuss TAU (Tuning and Analysis Utilities), the first prototype of an integrated and portable program analysis environment for pC++, a parallel object-oriented language system. TAU is unique in that it was developed specifically for pC++ and relies heavily on pC++'s compiler and transformation tools (specifically, the Sage++ toolkit) for its implementation. This tight integration allows TAU to achieve a combination of portability, functionality, and usability not commonly found in high-level language environments. The paper describes the design and functionality of TAU, using a new tool for breakpoint-based program analysis as an example of TAU's capabilities
[etpsc96]
Janice Cuny, Robert Dunn, Steven T. Hackstadt, Christopher Harrop, Harold H. Hersey, Allen D. Malony, and Douglas Toomey, Building Domain-Specific Environments for Computational Science: A Case Study in Seismic Tomography, International Journal of Supercomputing Applications and High Performance Computing, Vol. 11, No. 3, Fall 1997. Also appearing in the Proceedings of the Workshop on Environments and Tools For Parallel Scientific Computing, Lyon, France, August 1996.
Keywords: computational science, domain-specific environments, seismic tomography, visualization, distributed data access

We report on our experiences in building a computational environment for tomographic image analysis for marine seismologists studying the structure and evolution of mid-ocean ridge volcanism. The computational environment is determined by an evolving set of requirements for this problem domain and includes needs for high-performance parallel computing, large data analysis, model visualization, and computation interaction and control. Although these needs are not unique in scientific computing, the integration of techniques for seismic tomography with tools for parallel computing and data analysis into a computational environment was (and continues to be) an interesting, important learning experience for researchers in both disciplines. For the geologists, the use of the environment led to fundamental geologic discoveries on the East Pacific Rise, the improvement of parallel ray tracing algorithms, and a better regard for the use of computational steering in aiding model convergence. The computer scientists received valuable feedback on the use of programming, analysis, and visualization tools in the environment. In particular, the tools for parallel program data query (DAQV) and visualization programming (Viz) were demonstrated to be highly adaptable to the problem domain. We discuss the requirements and the components of the environment in detail. Both accomplishments and limitations of our work are presented.
[europar03]
R. Bell, A. D. Malony, and S. Shende, "A Portable, Extensible, and Scalable Tool for Parallel Performance Profile Analysis", Proc. EUROPAR 2003 conference, LNCS 2790, Springer, Berlin, pp. 17-26, 2003.
Keywords: ParaProf, TAU, jracy, Profile Browser, scalable, visualization

This paper presents the design, implementation, and application of ParaProf, a portable, extensible, and scalable tool for parallel performance profile analysis. ParaProf attempts to offer ``best of breed'' capabilities to performance analysts -- those inherited from a rich history of single processor profilers and those being pioneered in parallel tools research. We present ParaProf as a parallel profile analysis framework that can be retargeted and extended as required. ParaProf's design and operation is discussed, and its novel support for large- scale parallel analysis demonstrated with a 512-processor application profile generated using the TAU performance system.
[europar04]
A. D. Malony, and S. S. Shende, "Overhead Compensation in Performance Profiling," Proc. Europar 2004 Conference, LNCS 3149, Springer, pp. 119-132, 2004.
Keywords: Performance measurement and analysis, parallel computing, profiling, intrusion, overhead compensation

Measurement-based profiling introduces intrusion in program execution. Intrusion effects can be mitigated by compensating for measurement overhead. Techniques for compensation analysis in performance profiling are presented and their implementation in the TAU performance system described. Experimental results on the NAS parallel benchmarks demonstrate that overhead compensation can be effective in improving the accuracy of performance profiling.
[europar05]
Allen D. Malony and Sameer Shende, "Models for On-the-Fly Compensation of Measurement Overhead in Parallel Performance Profilin," in Proc. EuroPar 2005 Conference, (eds. J. C. Cuna and Pd. Medeiros), LNCS 3648, Springer, pp. 72-82, 2005.
Keywords: Performance measurement and analysis, parallel computing, profiling, intrusion, overhead compensation, TAU

Performance profiling generates measurement overhead during parallel program execution. Measurement overhead, in turn, introduces intrusion in a program's runtime performance behavior. Intrusion can be mitigated by controlling instrumentation degree, allowing a tradeoff of accuracy for detail. Alternatively, the accuracy in profile results can be improved by reducing the intrusion error due to measurement overhead. Models for compensation of measurement overhead in parallel performance profiling are described. An approach based on rational reconstruction is used to understand properties of compensation solutions for different parallel scenarios. From this analysis, a general algorithm for on-the-fly overhead assessment and compensation is derived.
[europar07]
A. Nataraj, M. Sottile, A. Morris. A.D. Malony, S. Shende. "TAUoverSupermon: Low-overhead Online Parallel Performance Monitoring." Presented at EuroPar 2007.
Keywords: Online performance measurement, cluster monitoring, TAU, supermon

Online application performance monitoring allows tracking performance characteristics during execution as opposed to doing so post-mortem. This opens up several possibilities otherwise unavailable such as real-time visualization and application performance steering that can be useful in the context of long-running applications. As HPC sys- tems grow in size and complexity, the key challenge is to keep the online performance monitor scalable and low overhead while still providing a useful performance reporting capability. Two fundamental components that constitute such a performance monitor are the measurement and transport systems. We adapt and combine two existing, mature systems - TAU and Supermon - to address this problem. TAU performs the mea- surement while Supermon is used to collect the distributed measurement state. Our experiments show that this novel approach leads to very low- overhead application monitoring as well as other benefits unavailable from using a transport such as NFS.
[europar96]
Steven T. Hackstadt and Allen D. Malony, Distributed Array Query and Visualization for High Performance Fortran, Proc. of Euro-Par '96, Lyon, France, August 1996, pp. 55-63. Also available as University of Oregon, Department of Computer and Information Science, Technical Report CIS-TR-96-02, February 1996.
Keywords: visualization, distributed data access, hpf, parallel tool

This paper describes the design and implementation of the Distributed Array Query and Visualization (DAQV) system for High Performance Fortran, a project sponsored by the Parallel Tools Consortium. DAQV's implementation leverages the HPF language, compiler, and runtime system to address the general problem of providing high-level access to distributed data structures. DAQV supports a framework in which visualization and analysis clients connect to a distributed array server (i.e., the HPF application with DAQV control) for program-level access to array values. Implementing key components of DAQV in HPF itself has led to a robust and portable solution in which clients do not need to know how the data is distributed.
[europar99]
Matthew Sottile and Allen Malony, INTERLACE: An Interoperation and Linking Architecture for Computational Engines, Proceedings of EuroPar 99 Conference, LNCS 1685, Springer, Berlin, pp.135-138, 1999.
Keywords: heterogeneous computing, reusability, computational servers, client/server, distributed objects, MatLab

To aid in building high-performance computational environments, INTERLACE offers a framework for linking reusable computational engines in a heterogeneous distributed system. The INTERLACE model provides clients with access to computational servers which interface with "wrapped" computational engines. The wrappers implement mechanisms to translate client requests to engine actions and to move data across the server interface. These mechanisms are programmable, allowing engines of different type to be integrated. The framework takes advantage of the HPC++ runtime system to access servers through distributed object operations. The INTERLACE framework has been demonstrated by building a distributed computational environment with MatLab engines.
[europara06]
A. Nataraj, A. Malony, A. Morris, S. Shende, "Early Experiences with KTAU on the IBM BG/L," Proc. EUROPAR 2006 Conference, Springer, LNCS 4128, pp. 99-110, 2006.
Keywords: TAU, KTAU, IBM BG/L, kernel profiling, zeptoOS, performance analysis

The influences of OS and system-specific effects on applica- tion performance are increasingly important in high performance com- puting. In this regard, OS kernel measurement is necessary to under- stand the interrelationship of system and application behavior. This can be viewed from two perspectives: kernel-wide and process-centric. An integrated methodology and framework to observe both views in HPC systems using OS kernel measurement has remained elusive. We demon- strate a new tool called KTAU (Kernel TAU) that aims to provide paral- lel kernel performance measurement from both perspectives. KTAU ex- tends the TAU performance system with kernel-level monitoring, while leveraging TAU’s measurement and analysis capabilities. As part of the ZeptoOS scalable operating systems pro ject, we report early experiences using KTAU in ZeptoOS on the IBM BG/L system.
[europara06c]
L. Li, A. Malony, "Model-Based Performance Diagnosis of Master-Worker Parallel Computations," Euro-Par 2006 Parallel Processing Conference September 2006 (LNCS 4128). Pages 35-46.
Keywords: Performance diagnosis, parallel models, master-worker, measurement, analysis.

Parallel performance tuning naturally involves a diagnosis process to locate and explain sources of program inefficiency. Proposed is an approach that exploits parallel computation patterns (models) for diagnosis discovery. Knowledge of performance problems and inference rules for hypothesis search are engineered from model semantics and analysis expertise. In this manner, the performance diagnosis process can be automated as well as adapted for parallel model variations. We demonstrate the implementation of model-based performance diagnosis on the classic Master-Worker pattern. Our results suggest that pattern- based performance knowledge can provide effective guidance for locating and explaining performance bugs at a high level of program abstraction.
[europvm06]
K. Huck, A. Malony, S. Shende and A. Morris. "TAUg: Runtime Global Performance Data Access using MPI." EuroPVM/MPI Conference, LNCS 4192, pp. 313-321, Springer, September 2006.
Keywords: TAU, TAUg, global data acsess, performance monitoring, online performance adaption

To enable a scalable parallel application to view its global performance state, we designed and developed TAUg, a portable runtime framework layered on the TAU parallel performance system. TAUg leverages the MPI library to communicate between application processes, creating an abstraction of a global performance space from which profile views can be retrieved. We describe the TAUg design and implementation and show its use on two test benchmarks up to 512 processors. Overhead evaluation for the use of TAUg is included in our analysis. Future directions for improvement are discussed.
[ewomp01]
B. Mohr, A. D. Malony, S. Shende, and F. Wolf, "Towards a Performance Tool Interface for OpenMP: An Approach Based on Directive Rewriting," Proceedings of EWOMP'01 Third European Workshop on OpenMP, Sept. 2001.
Keywords: OpenMP, directive rewriting, instrumentation interface, TAU, EXPERT

In this article we propose a ``standard'' performance tool interface for OpenMP, similar in spirit to the MPI profiling interface in its intent to define a clear and portable API that makes OpenMP execution events visible to performance libraries. When used together with the MPI profiling interface, it also allows tools to be built for hybrid applications that mix shared and distributed memory programming. We describe an instrumentation approach based on OpenMP directive rewriting that generates calls to the interface and passes context information (e.g., source code locations) in a portable and efficient way. Our proposed OpenMP performance API further allows user functions and arbitrary code regions to be marked and performance measurement to be controlled using new proposed OpenMP directives. The directive transformations we define are implemented in a source-to-source translation tool called OPARI. We have used it to integrate the TAU performance analysis framework and the automatic event trace analyzer EXPERT with the proposed OpenMP performance interface. Together, these tools show that a portable and robust solution to performance analysis of OpenMP and hybrid applications is possible.
[ewomp02]
A Performance Monitoring Interface for OpenMP By Bernd Mohr , Allen D. Malony, Hans-Christian Hoppe, Frank Schlimbach, Grant Haab, Jay Hoeflinger, and Sanjiv Shah . Persented at EWOMP 2002
Keywords: OpenMP, Performance Monitoring, OMPI, POMP
[fmpc88]
Allen D. Malony, "Regular Processor Arrays," Proc., 2nd Symposium on the Frontiers of Massively Parallel Computation, IEEE, pp. 499-502, 1988.
Keywords: regularity, processor arrays, emulation, interconnection networks.

Regular is an often used term to suggest simple and unifrom structure of a parallel processor's organization or a parllel algorithm's operation. However, a strict definitiion is long overdue. In this paper, we define regularity for processor array structures in two dimensions and enumerate the eleven distinct regular topologies. Space and time emulation schemes among the regular processor arrays are constructured to compare their geometric and performance characteristics. The hexagonal array is shown to have the most efficient emulation capabilities.
[front95]
J. Kundu and J. E. Cuny, A Scalable, Visual Interface for Debugging with Event-Based Behavioral Abstraction, Frontiers of Massively Parallel Computing, 1995, pp. 472-479.
Keywords: visualization, event-based debugging, Ariadne
[hcca89]
A. D. Malony, D. A. Reed, J. W. Arendt, R. A. Aydt, D. Grabas, and B. K. Totty, "An Integrated Performance Data Collection Analysis, and Visualization System," Proc. Fourth Conferenceon Hypercube Concurrent Computers and Applications, Mar. 1989. Also appears as Technical Report UIUCDCS-R-89-1504, Center for Supercomputing Research and Development, U. of Ill., March 1989.
Keywords: Intel iPSC/2, Performance Analysis

The lack of tools to observe the operation and performance of message-based parallel architectures limits the user's ability to e ectively optimize application and system performance. Performance data collection, analysis, and visualization tools are needed to manage the complexity and quantity of performance data. Furthermore, these tools must be integrated with the machine hardware, the system software, and the applications support software if they are to nd pervasive use in program development and experimentation. In this paper, we describe an integrated performance environment being developed for the Intel iPSC/2 hypercube. The data collection components of the environment include software event tracing at the operating system and program levels plus a hardware-based performance monitoring system used to unobtrusively capture software events. A visualization system, based on the X window system, permits the performance analyst to browse and explore interesting data components by dynamically interconnecting new performance displays and data analysis tools.
[hipc95]
D. Brown, A. Malony, B. Mohr, Language-based Parallel Program Interaction: the Breezy Approach, Proceedings of the International Conference on High Performance Computing (HiPC'95), India, December 1995.
Keywords: runtime interaction, data-parallel program, language integration,language-based tools

This paper presents a general architecture for runtime interaction with a data-parallel program. We have applied this architecture in the development of the Breezy tool for the pC++ language. Breezy grants application programs convenient and efficient access to higher-level external services (e.g., databases, visualization systems, and distributed resources) and allows external access to the application's state (e.g., for program state display or computational steering). Although such support can be developed on an ad-hoc basis for each application, a general approach to the problem of parallel program interaction is preferred. A general approach makes tools more portable and retargetable to different language systems.

There are two main conclusions from this work. First, interaction support should be integrated with a language system facilitating an implementation of a model that is consistent with the language design. This aids application developers or the tool builders that require this interaction. Second, as the implementation of Breezy shows, the development of interaction support can leverage off the language itself as well as its compiler and runtime systems.

[hpcc05]
F. Wolf, A. D. Malony, S. Shende, and A. Morris, "Trace-Based Parallel performance Overhead Compensation," in Proc. of HPCC 2005 Conference, (eds. L. T. Yang, et. al.), LNCS 3726, Springer, pp. 617-628, 2005.
Keywords: KOJAK, Performance measurement, analysis, parallel computing, tracing, message passing, overhead compensation

Tracing parallel programs to observe their performance introduces intrusion as the result of trace measurement overhead. If post-mortem trace analysis does not compensate for the overhead, the intrusion will lead to errors in the performance results. We show that measurement overhead can be accounted for during trace analysis and intrusion modeled and removed. Algorithms developed in our earlier work are reimplemented in a more robust and modern tool, KOJAK, allowing them to be applied in large-scale parallel programs. The ability to reduce trace measurement error is demonstrated for a Monte-Carlo simulation based on a master/worker scheme. As an additional result, we visualize how local perturbation propagates across process boundaries and alters the behavioral char- acteristics of non-local processes.
[hpcc06]
W. Spear, A. Malony, A. Morris, S. Shende, "Integrating TAU with Eclipse: A Performance Analysis System in a Integrated Development Environment," High Performance Computing and Communications (HPCC) Conference. September 2006 (LNCS 4208). Pages 230-239.
Keywords: Eclipse, Integrated Development Environment, IDE, TAU

The Eclipse platform offers Integrated Development Environment support for a diverse and growing array of programming applications and languages. There is an increasing call for programming tools to support various development tasks from within Eclipse. This includes tools for testing and analyzing program performance. We describe the high-level synthesis of the Eclipse platform with the TAU parallel performance analysis system. By leveraging Eclipse's modularity and extensibility with TAU's robust automated performance analysis mechanisms we produce an integrated, GUI controlled performance analysis system for Java, C/C++ and High Performance Computing development within Eclipse.
[hpccbse04]
B. Norris, J. Ray, R. Armstrong, L. C. McInnes. D. E. Bernholdt. W. R. Elwasif, A. D. Malony and S. Shende, "Computational Quality of Service for Scientific Components," Proceedings of the International Symposium on Component-Based Software Engineering (CBSE7), Edinburgh, Scotland, LNCS 3054, Springer, pp. 264-271, May 2004. Also available as Argonne National Laboratory preprint ANL/MCS-P1131-0204.
Keywords: QoS, Components, CCA, TAU

Scientific computing on massively parallel computers presents unique challenges to component-based software engineering (CBSE). While CBSE is at least as enabling for scientific computing as it is for other arenas, the requirements are different. We briefly discuss how these requirements shape the Common Component Architecture, and we describe some recent research on quality-of-service issues to address the computational performance and accuracy of scientific simulations.
[hpcn99]
A. Malony, J. Skidmore, and M. Sottile. Computational Experiments using Distributed Tools in a Web-based Electronic Notebook Environment, Proceedings of HPCN Europe '99, LNCS 1593, Springer, Berlin, pp. 381 -390, April 1999.
Keywords:

Computational environments used by scientists should provide high-level support for scientific processes that involve the integrated and systematic use of familiar abstractions from a laboratory setting, including notebooks, instruments, experiments, and analysis tools. However, doing so while hiding the complexities of the underlying computational platform is a challenge. ViNE is a web-based electronic notebook that implements a high-level interface for applying computational tools in scientific experiments in a location- and platform-independent manner. Using ViNE, a scientist can specify data and tools, and construct experiments that apply them in well-defined procedures. ViNE's implementation of the experiment abstraction offers the scientist easy-to-understand framework for building scientific processes. This paper discusses how ViNE implements computational experiments in distributed, heterogeneous computing environments.
[hpdc98]
Steven T. Hackstadt, Christopher W. Harrop, and Allen D. Malony, A Framework for Interacting with Distributed Programs and Data, Proceedings of the Seventh IEEE International Symposium on High Performance Distributed Computing (HPDC7), Chicago, IL, July 28-31, 1998, pp. 206-214. Also available as University of Oregon, Department of Computer and Information Science, Technical Report CIS-TR-98-02, June 1998.
Keywords: parallel tools, distributed arrays, visualization, computational steering, model coupling, runtime interaction, data access, Fortran 90

The Distributed Array Query and Visualization (DAQV) project aims to develop systems and tools that facilitate interacting with distributed programs and data structures. Arrays distributed across the processes of a parallel or distributed application are made available to external clients via well-defined interfaces and protocols. Our design considers the broad issues of language targets, models of interaction, and abstractions for data access, while our implementation attempts to provide a general framework that can be adapted to a range of application scenarios. The paper describes the second generation of DAQV work and places it in the context of the more general distributed array access problem. Current applications and future work are also described.
[iccs03]
J. Dongarra, A. D. Malony, S. Moore, P. Mucci, and S. Shende, "Performance Instrumentation and Measurement for Terascale Systems," Proc. International Conference on Computational Science (ICCS 2003), LNCS 2660, Springer, Berlin, pp. 53-62, 2003.
Keywords: TAU, PAPI, Perfometer, instrumentation, measurement, performance analysis, terascale

As computer systems grow in size and complexity, tool support is needed to facilitate the efficient mapping of large-scale applications onto these systems. To help achieve this mapping, performance analysis tools must provide robust performance observation capabilities at all levels of the system, as well as map low-level behavior to high-level program constructs. Instrumentation and measurement strategies, developed over the last several years, must evolve together with performance analysis infrastructure to address the challenges of new scalable parallel systems.
[iccs03b]
Michael O. McCracken, Allan Snavely, Allen Malony, "Performance Modeling for Dynamic Algorithm Selection," Proc. International Conference on Computational Science (ICCS'03), LNCS 2660, Springer, Berlin, pp. 749-758, 2003.
Keywords: performance modeling, adaptive algorithms

Adaptive algorithms are an important technique to achieve portable high Performance. They choose among solution methods and optimizations according to expected performance on a particular machine. Grid environments make the adaptation problem harder, because the optimal decision may change across runs and even during runtime. Therefore, the performance model used by an adaptive algorithm must be able to change decisions without high overhead. In this paper, we present work that is modifying previous research into rapid performance modeling to support adaptive grid applications through sampling and high granularity modeling. We also outline preliminary results that show the ability to predict differences in performance among algorithms in the same program.
[iccs05]
Adnan Salman, Sergei Turovets, Allen Malony, Jeff Eriksen, and Don Tucker, "Computational Modeling of Human Head Conductivity." Presented at International Conference on Computational Science.
Keywords: Computational Modeling, alternating direction implicit, algorithm, ADI, OpenMP

The computational environment for estimation of unknown regional electrical conductivities of the human head, based on realistic geometry from seg- mented MRI up to 256 resolution, is described. A finite difference alternating di- rection implicit (ADI) algorithm, parallelized using OpenMP, is used to solve the forward problem describing the electrical field distribution throughout the head given known electrical sources. A simplex search in the multi-dimensional para- meter space of tissue conductivities is conducted in parallel using a distributed system of heterogeneous computational resources. The theoretical and computa- tional formulation of the problem is presented. Results from test studies are pro- vided, comparing retrieved conductivities to known solutions from simulation. Performance statistics are also given showing both the scaling of the forward problem and the performance dynamics of the distributed search.
[icpp05]
K. A. Huck, A. D. Malony, R. Bell, and A. Morris, "Design and Implementation of a Parallel Performance Data Management Framework," Proc. International Conference on Parallel Processing (ICPP 2005), IEEE Computer Society, 2005.
Keywords: TAU, PerfDMF, ParaProf, Performance Data Management Framework

Empirical performance evaluation of parallel systems and applications can generate significant amounts of performance data and analysis results from multiple experiments as performance is investigated and problems diagnosed. Hence, the management of performance information is a core component of performance analysis tools. To better support tool integration, portability, and reuse, there is a strong motivation to develop performance data management technology that can provide a common foundation for performance data storage, access, merging, and analysis. This paper presents the design and implementation of the Performance DataManagement Framework (PerfDMF). PerfDMF addresses objectives of performance tool integration, interoperation, and reuse by providing common data storage, access, and analysis infrastructure for parallel performance profiles. PerfDMF includes an extensible parallel profile data schema and relational database schema, a profile query and analysis programming interface, and an extendible toolkit for profile import/export and standard analysis. We describe the PerfDMF objectives and architecture, give detailed explanation of the major components, and show examples of PerfDMF application.
[icpp86]
Walid Abu-Sufah, Allen D. Malony, "Vector Processing on the Alliant FX/8 Multiprocessor," Proc. of ICPP 1986, pp. 559-566, 1986.
Keywords: Alliant, Vector processing, Cedar, performance measurement.Alliant, Vector processing, Cedar, performance measurement.Alliant, Vector processing, Cedar, performance measurement.Alliant, Vector processing, Cedar, performance measurement.Alliant, Vector processing, Cedar, performance measurement.

The Alliant FX/8 multiprocessor implements several high-speed computation ideas in software and hardware. Each of the 8 computational elements (CSs) has vector capabilities and multiprocessor support. Generally, the FX/8 delivers its highest processing rates when executing vector loops concurrently. In this paper, we present extensive empirical performance results for vector processing on the FX/8. The vector kernels of LANL BMK8a1 benchmark are used in the experiments.
[icpp87]
Allen D. Malony, Daniel A. Reed, Patrick J. McGuire,"MPF: A Portable Message Passing Facility for Shared Memory Multiprocessors," Proc. ICPP 1987: pp. 739-741, 1987.
Keywords: message passing, shared memory

A message passing facility (MPF) for shared memory multiprocessors is presented. MPF is based on a message passing model conceptually similar to conversations. The message passing primitives for this model are implemented as a portable library of C function calls. The performance of interprocess communication benchmark programs and two parallel applications are given.
[icpp95]
J. Kundu and J. E. Cuny, The Integration of Event- and State-Based Debugging in Ariadne, Proceedings of the International Conference on Parallel Processing (ICPP '95), August 1995, pp. II 130-134.
Keywords: event-based debugging, state-based debugging, Ariadne
[ics89]
Kyle Gallivan, William Jalby, Allen Malony, Harry Wijshoff, "Performance Prediction of Loop Constructs on Multiprocessor Hierarchical-Memory Systems," Proc. 3rd International Conference on Supercomputing (ICS'86), pp. 433-442, 1986.
Keywords: Supercomputers, Concurrent programming structures

In this paper we discuss the performance prediction of Fortran constructs commonly found in numerical scientific computing. Although the approach is applicable to multi-processors in general, within the scope of the paper we will concentrate on the Alliant FX/8 multiprocessor. The techniques proposed involve a combination of empirical observations, architectural models and analytical techniques, and exploits earlier work on data locality analysis and empirical characterization of the behavior of memory systems. The Lawrence Livermore Loops are used as a test-case to verify the approach.
[ics90]
Allen D. Malony, Daniel A. Reed, "A Hardware-based Performance Monitor for the Intel iPSC/2 Hypercube, Proc. 4th International Conference on Supercomputing (ICS'90), pp. 213-226, 1990.
Keywords: performance monitoring, Hypermon

The complexity of parallel computer systems makes a priori performance prediction difficult and experimental performance analysis crucial. A complete characterization of software and hardware dynamics, needed to understand the performance of high-performance parallel systems, requires execution time performance instrumentation. Although software recording of performance data suffices for low frequency events, capture of detailed, high-frequency performance data ultimately requires hardware support if the performance instrumentation is to remain efficient and unobtrusive. This paper describes the design of HYPERMON, a hardware system to capture and record software performance traces generated on the Intel iPSC/2 hypercube. HYPERMON represents a compromise between fully-passive hardware monitoring and software event tracing; software generated events are extracted from each node, timestamped, and externally recorded by HYPERMON. Using an instrumented version of the iPSC/2 operating system and several application programs, we present a performance analysis of an operational HYPERMON prototype and assess the limitations of the current design. Based on these results, we suggest design modifications that should permit capture of event traces from the coming generation of high-performance distributed memory parallel systems.
[ics99]
S. Vajracharya, S. Karmesin, P. Beckman, J. Crotinger, A. Malony, S. Shende, R. Oldehoeft, and S. Smith, "SMARTS: Exploiting Temporal Locality and Parallelism through Vertical Execution," Proceedings of ACM International Conference on Supercomputing (ICS '99), pp. 302-310, 1999.
Keywords: SMARTS, asynchronous, threads, profiling, tracing, vertical executiondata-parallelism, dependence-driven execution, runtime system, barriersynchronization, TAU

In the solution of large-scale numerical problems, parallel computing is becoming simultaneously more important and more difficult. The complex organization of today's multiprocessors with several memory hierarchies has forced the scientific programmer to make a choice between simple but unscalable code and scalable but extremely complex code that does not port to other architectures.

This paper describes how the SMARTS runtime system and the POOMA C++ class library for high-performance scientific computing work together to exploit data parallelism in scientific applications while hiding the details of managing parallelism and data locality from the user. We present innovative algorithms, based on the macro-dataflow model for detecting data parallelism and efficiently executing data-parallel statements on shared-memory multiprocessors. We also describe how these algorithms can be implemented on clusters of SMPs.

[ipdps03]
S. Shende, A. D. Malony, C. Rasmussen, M. Sottile, "A Performance Interface for Component-Based Applications," Proc. International Workshop on Performance Modeling, Evaluation, and Optimization of Parallel and Distributed Systems, IPDPS'03, IEEE Computer Society, 278, 2003.
Keywords: TAU, Component Interface, SIDL, CCAFFEINE, PDT, Babel

This work targets the emerging use of software component technology for high-performance scientific parallel and distributed computing. While component software engineering will benefit the construction of complex science applications, its use presents several challenges to performance optimization. A component application is composed of a set of components, thus, application performance depends on the interaction (possibly non-linear) of the component set. Furthermore, a component is a ``binary unit of composition'' and the only information users have is the interface the component provides to the outside world. An interface for component performance measurement and query is presented to address optimization issues. We describe the performance component design and an example demonstrating its use for runtime performance tuning.
[ipdps04]
J. Ray, N. Trebon, R. C. Armstrong, S. Shende, and A. Malony, "Performance Measurement and Modeling of Component Applications in a High Performance Computing Environment: A Case Study," Proc. 18th International Parallel and Distributed Processing Symposium (IPDPS'04), IEEE Computer Society, 2004.
Keywords: CCA, Performance modeling, CFRFS Combustion, TAU

We present a case study of performance measurement and modeling of a CCA (Common Component Architecture) component-based application in a high performance computing environment. Component-based HPC applications allow the possibility of creating component-level performance models and synthesizing them into application performance models. However, they impose the restriction that performance measurement/monitoring needs to be done in a non-intrusive manner and at a fairly coarse-grained level. We propose a performance measurement infrastructure for HPC based loosely on recent work done for Grid environments. A prototypical implementation of the infrastructure is used to collect data for three components in a scientific application and construct their performance models. Both computational and message-passing performance are addressed.
[ipps94]
A. Malony, B. Mohr, P. Beckman, D. Gannon, S. Yang, F. Bodin, Performance Analysis of pC++: A Portable Data-Parallel Programming System for Scalable Parallel Computers, Proceedings of the 8th International Parallel Processing Symbosium (IPPS), Cancún, Mexico, April 1994, pp. 75-85.
Keywords: parallel C++, portability, scalability, SPMD, runtime system,concurrency and communication primitives, performance

pC++ is a language extension to C++ designed to allow programmers to compose distributed data structures with parallel execution semantics. These data structures are organized as ``concurrent aggregate'' collection classes which can be aligned and distributed over the memory hierarchy of a parallel machine in a manner consistent with the High Performance Fortran Forum (HPF) directives for Fortran 90. pC++ allows the user to write portable and efficient code which will run on a wide range of scalable parallel computers.

In this paper, we discuss the performance analysis of the pC++ programming system. We describe the performance tools developed and include scalability measurements for four benchmark programs: a "nearest neighbor" grid computation, a fast Poisson solver, and the "Embar" and "Sparse" codes from the NAS suite. In addition to speedup numbers, we present a detailed analysis highlighting performance issues at the language, runtime system, and target system levels.

[ipps95]
B. Helm. A. D. Malony, S. P. Fickas, "Capturing and Automating Performance Diagnosis: the Poirot approach," Proc. 9th International Parallel Processing Symposium (IPPS'95), pp.
Keywords: software performance evaluation, program debugging, diagnostic expert systems, performance diagnosis, Poirot, parallel programming, diagnosis methods, performance tools, performance debugging, knowledge-based diagnosis, software engineering

Performance diagnosis, the process of finding and explaining performance problems, is an important part of parallel programming. Effective performance diagnosis requires that the programmer plan an appropriate method, and manage the experiments required by that method. This paper presents Poirot, an architecture to support performance diagnosis. It explains how the architecture helps automatically, adaptably plan and manage the diagnosis process. The paper evaluates the generality and practicality of Poirot, by reconstructing diagnosis methods found in several published performance tools.
[iscope99]
T. Sheehan, A. Malony, S. Shende, "A Runtime Monitoring Framework for the TAU Profiling System", Proceedings of the Third International Symposium on Computing in Object-Oriented Parallel Environments (ISCOPE'99), LNCS 1732, Springer, Berlin, pp. 170-181, December 1999.
Keywords: monitor, runtime data access, performance monitoring,parallel execution, performance tools, runtime interaction, Java,TAU, multi-threaded

Applications executing on complex computational systems provide a challenge for the development of runtime performance monitoring software. We discuss a computational model, application monitoring, data access models, and profiler functionality. We define data consistency within and across threads as well as across contexts and nodes. We describe the TAU runtime monitoring framework which enables on-demand, low-interference data access to TAU profile data and provides the flexibility to enforce data consistency at the thread, context or node level. We present an example of a Java-based runtime performance monitor utilizing the framework.
[ishpc02]
J. D. de St. Germain, A. Morris, S. G. Parker, A. D. Malony, and S. Shende, "Integrating Performance Analysis in the Uintah Software Development Cycle," Proceedings of the ISHPC'02 conference, LNCS 2327,Springer, Berlin, pp. 190-206, 2002.
Keywords: Uintah, TAU, MPI, SCIRun, XPARE

Technology for empirical performance evaluation of parallel programs is driven by the increasing complexity of high performance computing environments and programming methodologies. This paper describes the integration of the TAU and XPARE tools in the Uintah computational framework. Performance mapping techniques in TAU relate low-level performance data to higher levels of abstraction. XPARE is used for specifying regression testing benchmarks that are evaluated with each periodically scheduled testing trial. This provides a historical panorama of the evolution of application performance. The paper concludes with a scalability study that shows the benefits of integrating performance technology in the development of large-scale parallel applications.
[ishpc03]
Holger Brunst, Allen D. Malony, Sameer S. Shende, and Robert Bell, "Online Remote Trace Analysis of Parallel Applications on High-Performance Clusters", Proceedings of ISHPC'03 Conference, LNCS 2858, Springer, Berlin, pp. 440-449,2003.
Keywords: Parallel Computing, Performance Analysis, Performance Steering, Tracing, Parallel Computing, Performance Analysis, Performance Steering, Tracing, VNG, TAU

The paper presents the design and development of an online remote trace measurement and analysis system. The work combines the strengths of the TAU performance system with that of the VNG distributed parallel trace analyzer. Issues associated with online tracing are discussed and the problems encountered in system implementation are analyzed in detail. Our approach should port well to parallel platforms. Future work includes testing the performance of the system on large-scale machines.
[istspie95]
Steven T. Hackstadt and Allen D. Malony, Case Study: Applying Scientific Visualization to Parallel Performance Visualization, Proc. of the IST&T/SPIE symposium on Electronic Imaging: Science and Technology, Conference on Visual Data Exploration and Analysis, San Jose, CA, February 1995, pp. 238-247.
Keywords: parallel performance visualization, case study, data explorer, scientific visualization

The complexity of parallel programs make them more difficult to analyze for correctness and efficiency, in part because of the interactions between multiple processors and the volume of data that can be generated. Visualization often helps the programmer in these tasks. This paper focuses on the development of a new technique for constructing, evaluating, and modifying sophisticated, application-specific visualizations for parallel programs and performance data. While most existing tools offer predetermined sets of simple, two-dimensional graphical displays, this environment gives users a high degree of control over visualization development and use, including access to three-dimensional graphics, which remain relatively unexplored in this context.

We have developed an environment that uses the IBM Visualization Data Explorer system to allow new visualizations to be prototyped rapidly, often taking only a few hours to construct totally new views of parallel performance trace data. Yet, access to a robust library of sophisticated graphical techniques is preserved. The burdensome task of explicitly programming the visualizations is completely avoided, and the iterative design, evaluation, and modification of new displays is greatly facilitated.

[iwomp05]
Adnan Salman , Sergei Turovets, Allen Malony, and Vasily Volkov, "Multi-Cluster, Mixed-Mode Computational Modeling of Human Head Conductivity." Presented at IWOMP 2005
Keywords: MPI, OpenMPI, Multi-Cluster, Computational Modeling

A multi-cluster computational environment with mixed-mode (MPI + OpenMP) parallelism for estimation of unknown regional electrical conductiv- ities of the human head, based on realistic geometry from segmented MRI up to 256 voxels resolution, is described. A finite difference multi-component al- ternating direction implicit (ADI) algorithm, parallelized using OpenMP, is used to solve the forward problem calculation describing the electrical field distribu- tion throughout the head given known electrical sources. A simplex search in the multi-dimensional parameter space of tissue conductivities is conducted in par- allel across a distributed system of heterogeneous computational resources. The theoretical and computational formulation of the problem is presented. Results from test studies based on the synthetic data are provided, comparing retrieved conductivities to known solutions from simulation. Performance statistics are also given showing both the scaling of the forward problem and the performance dy- namics of the distributed search.
[iwomp06]
A. Morris, A. D. Malony, S. Shende, "Supporting Nested OpenMP Parallelism in the TAU Performance System," (to appear) Proceedings of the IWOMP 2006 Conference, Springer, LNCS, 2007.
Keywords: TAU, OpenMP, Nested Parallelism

Nested OpenMP parallelism allows an application to spawn teams of nested threads. This hierarchical nature of thread creation and usage poses problems for performance measurement tools that must determine thread context to properly maintain per-thread performance data. In this paper we describe the problem and a novel solution for identifying threads uniquely. Our approach has been implemented in the TAU performance system and has been successfully used in profiling and tracing OpenMP applications with nested parallelism. We also describe how extensions to the OpenMP standard can help tool developers uniquely identify threads.
[javagrande01]
S. Shende and A. D. Malony, "Integration and Application of the TAU Performance System in Parallel Java Environments," Proceedings of the Joint ACM Java Grande - ISCOPE 2001 Conference, pp. 87-96, June 2001.
Keywords: TAU, Profiling, Tracing, Java, MPI, JVMPI, Instrumentation, Measurement

Parallel Java environments present challenging problems for performance tools because of Javas rich language system and its multi-level execution platform combined with the integration of native-code application libraries and parallel runtime software. In addition to the desire to provide robust performance measurement and analysis capabilities for the Java language itself, the coupling of different software execution contexts under a uniform performance model needs careful consideration of how events of interest are observed and how cross-context parallel execution information is linked. This paper relates our experience in extending the TAU performance system to a parallel Java environment based on mpiJava. We describe the complexities of the instrumentation model used, how performance measurements are made, and the overhead incurred. A parallel Java application simulating the game of Life is used to show the performance systems capabilities.
[javaics2k]
S. Shende, and A. D. Malony, "Performance Tools for Parallel Java Environments," Proc. Second Workshop on Java for High Performance Computing, ICS 2000, May 2000.
Keywords: parallel, mpiJava, TAU, performance profiling, tracing, MPI

Parallel Java environments present challenging problems for performance tools because of Java's rich language system and its multi-level execution platform combined with the integration of native-code application libraries and parallel runtime software. In addition to the desire to provide robust performance measurement and analysis capabilities for the Java language itself, the coupling of different software execution contexts under a uniform performance model needs careful consideration of how events of interest are observed and how cross-context parallel execution information is linked. This paper relates our experience in extending the TAU performance system to a parallel Java environment based on mpiJava. We describe the instrumentation model used, how performance measurements are made, and the overhead incurred. A parallel Java application simulating the game of life is used to show the performance system's capabilities.
[lacsi01]
B. Mohr, A. D. Malony, S. Shende, and F. Wolf, "Design and Prototype of a Performance Tool Interface for OpenMP," Proceedings of the LACSI Symposium, 2001.
Keywords: OpenMP, API, POMP, TAU, EXPERT, Performance Tool Interface

This paper proposes a performance tools interface for OpenMP, similar in spirit to the MPI profiling interface in its intent to define a clear and portable API that makes OpenMP execution events visible to runtime performance tools. We present our design using a source-level instrumentation approach based on OpenMP directive rewriting. Rules to instrument each directive and their combination are applied to generate calls to the interface consistent with directive semantics and to pass context information (e.g., source code locations) in a portable and efficient way. Our proposed OpenMP performance API further allows user functions and arbitrary code regions to be marked and performance measurement to be controlled using new OpenMP directives. To prototype the proposed OpenMP performance interface, we have developed compatible performance libraries for the EXPERT automatic event trace analyzer and the TAU performance analysis framework. The directive instrumentation transformations we define are implemented in a source-to-source translation tool called OPARI. Application examples are presented for both EXPERT and TAU to show the OpenMP performance interface and OPARI instrumentation tool in operation. When used together with the MPI profiling interface (as the examples also demonstrate), our proposed approach provides a portable and robust solution to performance analysis of OpenMP and mixed-mode (OpenMP + MPI) applications.
[linux99]
S. Shende, Profiling and Tracing in Linux, Proceedings of the Extreme Linux Workshop #2, USENIX, Monterey CA, June 1999.
Keywords: profiling, performance, tracing, linux, clusters, TAU

Profiling and tracing tools can help make application parallelization more effective and identify performance bottlenecks. Profiling presents summary statistics of performance metrics while tracing highlights the temporal aspect of performance variations, showing when and where in the code performance is achieved. A complex challenge is the mapping of performance data gathered during execution to high-level parallel language constructs in the application source code. Presenting performance data in a meaningful way to the user is equally important. This paper presents a brief overview of profiling and tracing tools in the context of Linux - the operating system most commonly used to build clusters of workstations for high performance computing.
[mmb95]
K. Shanmugam, A. Malony, B. Mohr, Speedy: An Integrated Performance Extrapolation Tool for pC++ Programs, Proceedings of the Joint Conference PERFORMANCE TOOLS '95 and MMB '95, September 1995, Heidelberg, Germany.
Keywords: performance prediction, extrapolation, object-parallel programming,trace-driven simulation, performance debugging tools, modeling

Performance extrapolation is the process of evaluating the performance of a parallel program in a target execution environment using performance information obtained for the same program in a different environment. Performance extrapolation techniques are suited for rapid performance tuning of parallel programs, particularly when the target environment is unavailable. This paper describes one such technique that was developed for data-parallel C++ programs written in the pC++ language. In pC++, the programmer can distribute a collection of objects to various processors and can have methods invoked on those objects execute in parallel. Using performance extrapolation in the development of pC++ applications allows tuning decisions to be made in advance of detailed execution measurements. The pC++ language system includes TAU, an integrated environment for analyzing and tuning the performance of pC++ programs. This paper presents speedy, a new addition to TAU, that predicts the performance of pC++ programs on parallel machines using extrapolation techniques. Speedy applies the existing instrumentation support of TAU to capture high-level event traces of a n-thread pC++ program run on a uniprocessor machine together with trace-driven simulation to predict the performance of the program run on a target n-processor machine. We describe how speedy works and how it is integrated into TAU. We also show how speedy can be used to evaluate a pC++ program for a given target environment.
[mmb95a]
Alois Ferscha and Allen D. Malony, "Performance-Oriented Development of Irregular, Unstructured and Unbalanced Parallel Applications in the N-MAP Environment, " Proc. 8th GI/ITG Conference on Measuring, Modeling and Evaluating Computing and Communication Systems, MMB '95, LNCS 977, Springer, Berlin, pp. 340-356, 1995.
Keywords: Performance Prediction, Parallel Programming, Task Level Parallelism, Irregular Problems, Parallel Simulation, Time Warp, CM-5, Cluster Computing, N-MAP

Performance prediction methods and tools based on analytical models often fail in forecasting the performance of real systems due to inappropriateness of model assumptions, irregularities in the problem structure that cannot be described within the modeling formalism, unstructured execution behavior that leads to unforeseen system states, etc. Prediction accuracy and tractability is acceptable for systems with deterministic operational characteristics, for static, regularly structured problems, and non-changing environments.
[mttcpe94]
Allen D. Malony, Vassilis Mertsiotakis, Andreas Quick, "Automatic Scalability Analysis of Parallel Programs Based on Modeling Techniques," In G. Haring and G. Kotsis, editors, Proc. 7th International Conference on Modeling Techniques and Tools for Computer Performance Evaluation, LNCS, Springer, 1994.
Keywords: scalability analysis, performance modeling, PDL, PEPP

When implementing parallel programs for parallel computer systems the performance scalability of these programs should be tested and analyzed on different computer configurations and problem sizes. Since a complete scalability analysis is too time consuming and is limited to only existing systems, extensions of modeling approaches can be considered for analyzing the behavior of parallel programs under different problem and system scenarios. In this paper, a method for automatic scalability analysis using modeling is presented. Initially, we identify the important problems that arise when attempting to apply modeling techniques to scalability analysis. Based on this study, we define the Parallelization Description Language (PDL) that is used to describe parallel execution attributes of a generic program workload. Based on a parallelization description, stochastic models like graph models or Petri net models can be automatically generated from a generic model to analyze performance for scaled parallel systems as well as scaled input data. The complexity of the graph models produced depends significantly on the type of parallel computation described. We present several computation classes where tractable graph models can be generated and then compare the results of these automatically scaled models with their exact solutions using the PEPP modeling tool.
[para06a]
S. Shende, A. Malony, A. Morris, "Optimization of Instrumentation in Parallel Performance Evaluation Tools," in Proc. PARA 2006 Conference, Springer, LNCS, 2006.
Keywords: Instrument optimization, selective instrumentation, measurement, Performance measurement and analysis, parallel computing

Tools to observe the performance of parallel programs typically employ profiling and tracing as the two main forms of event-based measurement models. In both of these approaches, the volume of performance data generated and the corresponding perturbation encountered in the program depend upon the amount of instrumentation in the program. To produce accurate performance data, tools need to control the granularity of instrumentation. In this paper, we describe our experiences in the TAU performance system for improving the accuracy of performance data by limiting the amount of instrumentation. A range of options are provided to optimize instrumentation based on the structure of the program, event generation rates, and historical performance data gathered from prior executions.
[para06b]
S. Shende, A. Malony, A. Morris, "Workload Characterization using the TAU Performance System," in Proc. of PARA 2006 Conference, Springer, LNCS, 2006.
Keywords: Performance mapping, measurement, instrumentation, performance evaluation, workload characterization

Workload characterization is an important technique that helps us understand the performance of parallel applications and the de-mands they place on the system. Each application run is profiled using instrumentation at the MPI library level. Characterizing the performance of the MPI library based on the sizes of messages helps us understand how the performance of an application is affected based on messages of different sizes. Partitioning of the time spent in MPI routines based on the type of MPI operation and the message size involved requires a two level mapping of performance data. This paper describes how performance mapping is implemented in the TAU performance system to support workload characterization.
[para98]
Sameer Shende, Steven T. Hackstadt, and Allen D. Malony, "Dynamic Performance Callstack Sampling: Merging TAU and DAQV-II," Proceedings of the Fourth International Workshop on Applied Parallel Computing (PARA98), June 14-17, 1998, LNCS 1541, Springer, Berlin, pp. 515-520, 1998.
Keywords: monitoring, performance, callstack, sampling, profiling, TAU, DAQV, parallel execution, performance tools, runtime interaction, C++

Observing the performance of an application at runtime requires economy in what performance data is measured and accessed, and flexibility in changing the focus of performance interest. This paper describes the performance callstack as an efficient performance view of a running program which can be retrieved and controlled by external analysis tools. The performance measurement support is provided by the TAU profiling library whereas tool-program interaction support is available through the DAQV framework. How these systems are merged to provide dynamic performance callstack sampling is discussed.
[parco03]
A. D. Malony, S. Shende, and R. Bell, "Online Performance Observation of Large-Scale Parallel Applications," Proc. Parco 2003 Symposium, in "Parallel Computing: Software Technology, Algorithms, Architectures and Applications," (Eds. G. R. Joubert, W. E. Nagel, F. J. Peters, and W. V. Walter), Advances in Parallel Computing, Vol. 13, Elsevier B.V., pp. 761 -768, 2004.
Keywords: Paraprof, Parvis, TAU, performance analysis, large-scale, parallel computing

Parallel performance tools offer insights into the execution behavior of an application and are a valuable component in the cycle of application development, deployment, and optimization. However, most tools do not work well with large-scale parallel applications where the performance data generated comes from upwards of thousands of processes. As parallel computer systems increase in size, the scaling of performance observation infrastructure becomes an important concern. In this paper, we discuss the problem of scaling and perfomance observation, and the ramifications of adding online support. A general online performance system architecture is presented. Recent work on the TAU performance system to enable large-scale performance observation and analysis is discussed. The paper concludes with plans for future work.
[parco03b]
H. Brunst, W. Nagel, "Scalable Performance Analysis of Parallel Systems: Concepts and Experiences," Proc. PARCO 2003 Conference, in (J. Joubert, W. Nagel, F. Peters, W. Walter eds.), Parallel Computing: Software Technology, Algorithms, Architectures and Applications, Advances in Parallel Computing 13 Elsevier 2004, pp. 737-744, 2004.
Keywords: Parallel Computing, Performance Analysis, Tracing, Profiling, Clusters, VampirServer, VNG, OTF

We have developed a distributed service architecture and an integrated parallel analysis engine for scalable trace based performance analysis. Our combined approach permits to handle very large performance data volumes in real-time. Unlike traditional analysis tools that do their job sequentially on an external desktop platform, our approach leaves the data at its origin and seamlessly integrates the time consuming analysis as a parallel job into the high performance production environment.
[parco05]
A. D. Malony, S. Shende, and A. Morris, "Phase-Based Parallel Performance Profiling," (to appear) Proc. of PARCO 2005 conference.
Keywords: TAU, Performance measurement and analysis, parallel computing, profiling, phases

Parallel scientific applications are designed based on structural, logical, and numerical models of computation and correctness. When studying the performance of these applications, especially on large-scale parallel systems, there is a strong preference among developers to view performance information with respect to their “mental model” of the application, formed from the model semantics used in the program. If the developer can relate performance data measured during execution to what they know about the application, more effective program optimization may be achieved. This paper considers the concept of “phases” and its support in parallel performance measurement and analysis as a means to bridge the gap between high- level application semantics and low-level performance data. In particular, this problem is studied in the context of parallel performance profiling. The implementation of phase-based parallel profiling in the TAU parallel performance system is described and demonstrated for the NAS parallel benchmarks and MFIX application.
[parle94]
Steven T. Hackstadt and Allen D. Malony, Next-Generation Parallel Performance Visualization: A Prototyping Environment for Visualization Development, Proc. of the Parallel Architectures and Languages Europe (PARLE) Conference, Athens, Greece, July 1994, pp. 192-201. Also available as University of Oregon, Department of Computer and Information Science, Technical Report CIS-TR-93-23, October 1993.
Keywords: parallel performance visualization, scientific visualization, visualization prototyping

A new design process for the development of parallel performance visualizations that uses existing scientific data visualization software is presented. Scientific visualization tools are designed to handle large quantities of multi-dimensional data and create complex, three-dimensional, customizable displays which incorporate advanced rendering techniques, animation, and display interaction. Using a design process that leverages these tools to prototype new performance visualizations can lead to drastic reductions in the graphics and data manipulation programming overhead currently experienced by performance visualization developers. The process evolves from a formal methodology that relates performance abstractions to visual representations. Under this formalism, it is possible to describe performance visualizations as mappings from performance objects to view objects, independent of any graphical programming. Implementing this formalism in an existing data visualization system leads to a visualization prototype design process consisting of two components corresponding to the two high-level abstractions of the formalism: a trace transformation (i.e., performance abstraction) and a graphical transformation (i.e., visual abstraction). The trace transformation changes raw trace data to a format readable by the visualization software, and the graphical transformation specifies the graphical characteristics of the visualization. This prototyping environment also facilitates iterative design and evaluation of new and existing displays. Our work examines how an existing data visualization tool, IBM's Data Explorer in particular, can provide a robust prototyping environment for next-generation parallel performance visualization.
[pcfd05]
S. Shende, A. Malony, A. Morris, S. Parker, J. Davison de St. Germain, "Performance Evaluation of Adaptive Scientific Applications using TAU," Parallel Computational Fluid Dynamics: Theory and Applications, Proceedings of the 2005 International Conference on Parallel Computational Fluid Dynamics, May 24-27.
Keywords: TAU, Performance Evaluation, CFD, Uintah Computational Framework, Phases

Fueled by increasing processor speeds and high speed interconnection networks, advances in high performance computer architectures have allowed the development of increasingly complex large scale parallel systems. For computational scientists, programming these systems efficiently is a challenging task. Understanding the performance of their parallel applications is equally daunting. To observe and comprehend the performance of parallel applications that run on these systems, we need performance evaluation tools that can map the performance abstractions to the user's mental models of application execution. For instance, most parallel scientific applications are iterative in nature. In the case of CFD applications, they may also dynamically adapt to changes in the simulation model. A performance measurement and analysis system that can differentiate the phases of each iteration and characterize performance changes as the application adapts will enable developers to better relate performance to their application behavior. In this paper, we present new performance measurement techniques to meet these needs. In section 2, we describe our parallel performance system, TAU. Section 3 discusses how new TAU profiling techniques can be applied to CFD applications with iterative and adaptive characteristics. In section 4, we present a case study featuring the Uintah computational framework and explain how adaptive computational fluid dynamics simulations are observed using TAU. Finally, we conclude with a discussion of how the TAU performance system can be
[pdpta01]
S. Shende, A. D. Malony, R. Ansell-Bell, "Instrumentation and Measurement Strategies for Flexible and Portable Empirical Performance Evaluation," Proceedings Tools and Techniques for Performance Evaluation Workshop, PDPTA'01, CSREA, Vol. 3, pp. 1150-1156, June 2001.
Keywords: TAU, instrumentation, measurement, DyninstAPI, MPI

Flexibility and portability are important concerns for productive empirical performance evaluation. We claim that these features are best supported by robust instrumentation and measurement strategies, and their integration. Using the TAU performance system as an exemplar performance toolkit, a case study in performance evaluation is considered. Our goal is both to highlight flexibility and portability requirements and to consider how instrumentation and measurement techniques can address them. The main contribution of the paper is methodological, in its advocation of a guiding principle for tool development and enhancement. Recent advancements in the TAU system are described from this perspective.
[pmvpmvps93]
Allen D. Malony, Gregory V. Wilson, "Future directions in parallel performance environments", Proceedings of the workshop on performance measurement and visualization on Performance measurement and visualization of parallel systems, Elsevier Science Publishers, B.V., Amsterdam, pp. 331-351, 1993.
Keywords: processor architectures, parallel programming, performance measurement

The increasing complexity of parallel computing systems has brought about a crisis in parallel performance evaluation and tuning. Although there have been important advances in performance tools in recent years, we believe that future parallel performance environments will move beyond these tools by integrating performance instrumentation with compilers for architecture-independent languages, by formalizing the relationship between performance views and the data they represent, and by automating some aspects of performance interpretation. This paper describes these directions from the perspective of research projects that have been recently undertaken.
[ppopp91]
Allen D. Malony, "Event-based Performance Perturbation: a Case Study," Proc. third ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPOPP'91), pp. 201-212, 1991.
Keywords: performance perturbation, performance measurement, event-based analysis, uncertainty principle

Determining the performance behavior of parallel computations requires some form of intrusive tracing measurement. The greater the need for detailed performance data, the more intrusion the measurement will cause. Recovering actual execution performance jfrom perturbed performance measurements using eventbased perturbation analysis is the topic of this paper. We show that the measurement and subsequent analysis of synchronization operations (particularly, advance and await) can produce, in practice, accurate approximations to actual performance behavior. We use as testcases three Lawrence Livermore loops that execute as parallel DOACROSS loops on an Alliant FX/80. The results of our experiments suggest that a systematic application of performance perturbation analysis techniques will allow more detailed, accurate instrumentation than traditionally believed possible.
[ppopp93]
Sekhar R. Sarukkai, Allen D. Malony, "Perturbation Analysis of High Level Instru mentation for SPMD Programs," Proc. fourth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP'93), pp. 44-53, 1993.
Keywords: perturbation analysis, performance measurement

The process of instrumenting a program to study its behavior can lead to perturbations in the program's execution. These perturbations can become severe for large parallel systems or problem sizes, even when one captures only high level events. In this paper, we address the important issue of eliminating execution perturbations caused by high-level instrumentation of SPMD programs. We will describe perturbation analysis techniques for common computation and communication measurements, and show examples which demonstrate the effectiveness of these techniques in practice.
[psc94]
A. Malony, B. Mohr, P. Beckman, D. Gannon, Program Analysis and Tuning Tools for a Parallel Object Oriented Language: An Experiment with the TAU System, Proceedings of the Workshop on Parallel Scientific Computing, Cape Cod, MA, October 1994.
Keywords:
[sc00]
K. A. Lindlan, J. Cuny, A. D. Malony, S. Shende, B. Mohr, R. Rivenburgh, C. Rasmussen. "A Tool Framework for Static and Dynamic Analysis of Object-Oriented Software with Templates." Proceedings of SC2000: High Performance Networking and Computing Conference, Dallas, November 2000.
Keywords: Program Database Toolkit, PDT, static analysis, dynamic analysis, object-oriented, templates, IL Analyzer, DUCTAPE, TAU, SILOON

The developers of high-performance scientific applications often work in complex computing environments that place heavy demands on program analysis tools. The developers need tools that interoperate, are portable across machine architectures, and provide source-level feedback. In this paper, we describe a tool framework, the Program Database Toolkit (PDT), that supports the development of program analysis tools meeting these requirements. PDT uses compile-time information to create a complete database of high-level program information that is structured for well-defined and uniform access by tools and applications. PDT's current applications make heavy use of advanced features of C++, in particular, templates. We describe the toolkit, focussing on its most important contribution -- its handling of templates -- as well as its use in existing applications.
[sc05]
K. A. Huck, and A. D. Malony, "PerfExplorer: A Performance Data Mining Framework for Large- Scale Parallel Computing," in Proc. of SC 2005 Conference, ACM, 2005.
Keywords: performance data mining, PerfExplorer, TAU, PerfDMF, R, Weka

Parallel applications running on high-end computer systems manifest a complexity of performance phenomena. Tools to observe parallel performance attempt to capture these phenomena in measurement datasets rich with information relating multiple performance metrics to execution dynam- ics and parameters specific to the application-system exper- iment. However, the potential size of datasets and the need to assimilate results from multiple experiments makes it a daunting challenge to not only process the information, but discover and understand performance insights. In this pa- per, we present PerfExplorer, a framework for parallel per- formance data mining and knowledge discovery. The frame- work architecture enables the development and integration of data mining operations that will be applied to large-scale parallel performance profiles. PerfExplorer operates as a client-server system and is built on a robust parallel per- formance database (PerfDMF) to access the parallel profiles and save its analysis results. Examples are given demon- strating these techniques for performance analysis of ASCI applications.
[sc05b]
Knowledge Engineering for Model-based Parallel Performance Diagnosis (Poster) By Li Li and Allen D. Malony Computer and Information Science Department, University of Oregon, Eugene, OR
Keywords: performance knowledge, performance diagosis
[sc07]
A. Nataraj, A. Morris, A. Malony, M. Sottile, P. Beckman, "The Ghost in the Machine: Observing the Effects of Kernel Operation on Parallel Application Performance." Supercomputing Conference 2007.
Keywords: kernel, operating system noise, interference, integrated measurement, KTAU, TAU, compensation, tracing

The performance of a parallel application on a scalable HPC system is determined by user-level execution of the application code and system-level (OS kernel) operations. To understand the influences of system-level factors on application performance, the measurement of OS kernel activities is key. We describe a technology to observe kernel actions and make this information available to application-level performance measurement tools. The benefits of merged application and OS performance information and its use in parallel performance analysis are demonstrated, both for profiling and tracing methodologies. In particular, we focus on the problem of kernel noise assessment as a stress test of the approach. We show new results for characterizing noise and introduce new techniques for evaluating noise interference and its effects on application execution. Our kernel measurement and noise analysis technologies are being developed as part of Linux OS environments for scalable parallel systems.
[sc2001]
H. Truong, T. Fahringer, G. Madsen, A. Malony, H. Moritsch, and S. Shende, "On Using SCALEA for Performance Analysis of Distributed and Parallel Programs," Proceedings of SC'2001 conference, Nov. 2001.
Keywords: Performance analysis, performance overhead classification, distributed and parallel systems, SCALEA, TAU, OpenMP

In this paper we give an overview of SCALEA, which is a new performance analysis tool for OpenMP, MPI, HPF, and mixed parallel/distributed programs. SCALEA instruments, executes and measures programs and computes a variety of performance overheads based on a novel overhead classification. Source code and HW-profiling is combined in a single system which significantly extends the scope of possible overheads that can be measured and examined, ranging from HW-counters, such as the number of cache misses or floating point operations, to more complex performance metrics, such as control or loss of parallelism. Moreover, SCALEA uses a new representation of code regions, called the dynamic code region call graph, which enables detailed overheads analysis for arbitrary code regions. An instrumentation description file is used to releate performance information to code regions of the input program and to reduce instrumentation overhead. Several experiments with realistic codes that cover MPI, OpenMP, HPF and mixed OpenMP/MPI codes demonstrate the usefulness of SCALEA.
[sc2003.poster]
Sophia Lefantzi, Jaideep Ray, and Sameer Shende, "Strong Scalability Analysis and Performance Evaluation of a SAMR CCA-based Reacting Flow Code," Poster, SC2003 Conference, Nov. 2003.
Keywords: CCA, Performance modeling, CFRFS Combustion, TAU

Simulations on structured adaptively refined meshes (SAMR) pose unique problems in the context of performance evaluation and modeling. Adaptively refined meshes aim to concentrate grid points in regions of interest while leaving the bulk of the domain sparsely tessellated. Structured adaptively refined meshes achieve this by having overlaid grids of different refinement. Numerical algorithms employing explicit multi-rated time- stepping methods apply a computational "kernel" to the finer meshes at a higher frequency than at the coarser meshes. Each application of the kernel at a given level of refinement is followed up by a communication step where data is exchanged with neighboring subdomains. The SAMR approach is adaptive, i.e. its characteristics change as the simulation evolves in time. Thus, scalability depends on the number of processors and the time-integrated effect of the physics of the problem. The time-integrated effect renders the estimation of a general metric of scalability difficult and often impossible. Generally, as reported in the literature, for realistic problems and configurations, SAMR simulations do not scale well. For this work we analyzed two different hydrodynamic problems and present how communication costs scale with various aspects of the domain decomposition. Approach: The codes that we analyzed solve PDEs to simulate reactive flows and flows with shock waves. The codes were run until the incremental decrease in run times (with increasing processors) approached zero. It was found that the nature of the problem changed vastly during the run - even runs which showed poor scaling had periods of evolution where the domain decomposition showed "good" scaling characteristics, i.e compute loads were higher than communication loads. The computational load was found to be evenly balanced across the processors - the lack of scalability was due to the dominance of communication and synchronization costs over computational costs. We identified and analyzed phases in the evolution of the problem where the simulation exhibited good and bad scaling. Communication costs were analyzed with respect to the levels of refinement of the grid as well as the data-exchange radius for each of the runs. This is a thorough performance analysis of SAMR hydrodynamics codes, performed for the first time in CCA-compliant codes, tackling the time-dependent nature of the communication overheads. Both the codes that we analyzed employ the Common Component Architecture (CCA) paradigm and were run within the CCAFFEINE framework. The adaptive mesh package used (that performs the bulk of the communications) was GrACE (Rutgers, The State University of New Jersey). The measurements were performed using the CCA version of TAU (Tuning and Analysis Utilities). The tests were performed on "platinum" at NCSA (University of Illinois, Urbana Champaign), a Linux cluster of dual-node Pentium III 1 GHz processors, connected via a Myrinet interconnect. Visual: As a part of the visual presentation, we will present a color poster with our performance analysis results and hold a demonstration of the composition and execution of CCA codes. Animations of the adaptively refined grid will also be shown.
[sc90]
Sanjay Sharma, Allen D. Malony, Michael W. Berry, Priyamvada Sinvhal- Sharma, "Run-time monitoring of concurrent programs on the Cedar multiprocessor ," Proc. 1990 conference on Supercomputing, pp. 784-793, 1990
Keywords: Cedar, Tracing, processor architectures

The ability to understand the behavior of concurrent programs depends greatly on the facilities available to monitor execution and present the results to the user. Beyond the basic profiling tools that collect data for post-mortem viewing, explorative use of multiprocessor computer systems demands a dynamic monitoring environment capable of providing run-time access to program performance. A prototype of such an environment has been built for the Cedar multiprocessor. This paper describes the design of the infrastructure enabling run-time monitoring of parallel Cedar applications and the communication of execution data among physically distributed machines. An application for matrix visualization is used to highlight important aspects of the system.
[sc90b]
Allen D. Malony, John L. Larson, Daniel A. Reed, "Tracing Application Program Execution on the Cray X-MP and Cray 2," Proc. of the 1990 conference on Supercomputing, pp. 60-73, 1990.
Keywords: Tracing

Important insights into program operation can be gained by observing dynamic execution behavior. Unfortunately, many high-performance machines provide execution profile summaries as the only tool for performance investigation. We have developed a tracing library for the Cray X-MP and Cray 2 supercomputers that supports the low-overhead capture of execution events for sequential and multitasked programs. This library has been extended to use the automatic instrumentation facilities on these machines, allowing trace data from routine entry and exit, and other program segments, to be captured. To assess the utility of the trace-based tools, three of the Perfect Benchmark codes have been tested in scalar and vector modes with the tracing instrumentation. In addition to computing summary execution statistics from the traces, interesting execution dynamics appear when studying the trace histories. It is also possible to compare codes across the two architectures by correlating the event traces. Our conclusion is that adding tracing support in Cray supercomputers can have significant returns in improved performance characterization and evaluation.
[sc92]
Allen D. Malony, "Supercomputing Around the World," Proc. 1992 ACM/IEEE Conference on Supercomputing (mini symposium), pp. 126-129, 1992.
Keywords: Supercomputing

Supercomputing is rapidly becoming a global phenomenon. In keeping with the “Voyages of Discovery” theme of the Supercomputing ’92 conference, representatives of supercompuiing endeavors from around the wor!d meet in this mini-symposium to speak on national and international supercomputing activities.
[sc93]
F. Bodin, P. Beckman, D. Gannon, S. Yang, S. Kesavan, A. Malony, B. Mohr, Implementing a Parallel C++ Runtime System for Scalable Parallel Systems, Proceedings of the 1993 Supercomputing Conference, Portland, Oregon, November 1993, pp. 588-597.
Keywords: parallel C++, portability, scalability, SPMD, runtime system,concurrency and communication primitives, performance

pC++ is a language extension to C++ designed to allow programmers to compose "concurrent aggregate" collection classes which can be aligned and distributed over the memory hierarchy of a parallel machine in a manner modeled on the High Performance Fortran Forum (HPFF) directives for Fortran 90. pC++ allows the user to write portable and efficient code which will run on a wide range of scalable parallel computer systems. The first version of the compiler is a preprocessor which generates Single Program Multiple Data (SPMD) C++ code. Currently, it runs on the Thinking Machines CM-5, the Intel Paragon, the BBN TC2000, the Kendall Square Research KSR-1, and the Sequent Symmetry. In this paper we describe the implementation of the runtime system, which provides the concurrency and communication primitives between objects in a distributed collection. To illustrate the behavior of the runtime system we include a description and performance results on four benchmark programs.
[sc98a]
Christopher W. Harrop, Steven T. Hackstadt, Janice E. Cuny, Allen D. Malony, and Laura S. Magde, Supporting Runtime Tool Interaction for Parallel Simulations, Proceedings of Supercomputing '98 (SC98), Orlando, FL, November 7-13, 1998 (Best Student Paper Finalist).
Keywords: runtime interaction, computational steering, matlab

Scientists from many disciplines now routinely use modeling and simulation techniques to study physical and biological phenomena. Advances in high-performance architectures and networking have made it possible to build complex simulations with parallel and distributed interacting components. Unfortunately, the software needed to support such complex simulations has lagged behind hardware developments. We focus here on one aspect of such support: runtime program interaction. We have developed a runtime interaction framework and we have implemented a specific instance of it for an application in seismic tomography. That instance, called TierraLab, extends the geoscientists' existing (legacy) tomography code with runtime interaction capabilities which they access through a MATLAB interface. The scientist can stop a program, retrieve data, analyze and visualize that data with existing MATLAB routines, modify the data, and resume execution. They can do this all within a familiar MATLAB-like environment without having to be concerned with any of the low-level details of parallel or distributed data distribution. Data distribution is handled transparently by the Distributed Array Query and Visualization (DAQV) system. Our framework allows scientists to construct and maintain their own customized runtime interaction system.

[sc98b]
Jenifer L. Skidmore, Matthew J. Sottile, Janice E. Cuny, and Allen D. Malony, A Prototype Notebook-Based Environment for Computational Tools, Proceedings of Supercomputing '98, Orlando, FL, November 1998.
Keywords: electronic notebook, distributed computing, computational science,heterogeneous, tools, World Wide Web, collaboration

The Virtual Notebook Environment (ViNE) is a platform-independent, web-based interface designed to support a range of scientific activities across distributed, heterogeneous computing platforms. ViNE provides scientists with a web-based version of the common paper-based lab notebook, but in addition, it provides support for collaboration and management of computational experiments. Collaboration is supported with the web-based approach, which makes notebook material generally accessible and with a hierarchy of security mechanisms that screen that access. ViNE provides uniform, system-transparent access to data, tools, and programs throughout the scientist's computing infrastructure. Computational experiments can be launched from ViNE using a visual specification language. The scientist is freed from concerns about inter-tool connectivity, data distribution, or data management details. ViNE also provides support for dynamically linking analysis results back into the notebook content.

In this paper we present the ViNE system architecture and a case study of its use in neuropsychology research at the University of Oregon. Our case study with the Brain Electrophysiology Laboratory (BEL) addresses their need for data security and management, collaborative support, and distributed analysis processes. The current version of ViNE is a prototype system being tested with this and other scientific applications.

[scidac05]
P. Worley, J. Candy, L. Carrington, K. Huck, T. Kaiser, G. Mahinthakumar, A. Malony, S. Moore, D. Reed, P. Roth, H. Shan, S. Shende, A. Snavely, S. Sreepathi, F. Wolf, and Y. Zhang, "Performance Analysis of GYRO: A Tool Evaluation," Poster, Scientific Discovery through Advanced Computing Conference, (SciDAC 2005), 2005.
Keywords: Performance evaluation, PERC, TAU, SvPablo, Kojak, HPM, PMaC

The performance of the Eulerian gyrokinetic-Maxwell solver code GYRO is analyzed on five high performance computing systems. First, a manual approach is taken, using custom scripts to analyze the output of embedded wallclock timers, floating point operation counts collected using hardware performance counters, and traces of user and communication events collected using the profiling interface to Message Passing Interface (MPI) libraries. Parts of the analysis are then repeated or extended using a number of sophisticated performance analysis tools: IPM, KOJAK, SvPablo, TAU, and the PMaC modeling tool suite. The paper briefly discusses what has been discovered via this manual analysis process, what performance analyses are inconvenient or infeasible to attempt manually, and to what extent the tools show promise in accelerating or significantly extending the manual performance analyses.
[shpcc94]
Steven T. Hackstadt, Allen D. Malony, and Bernd Mohr, Scalable Performance Visualization for Data-Parallel Programs, Proc. of the Scalable High Performance Computing Conference (SHPCC), Knoxville, TN, May 1994, pp. 342-349. Also available as University of Oregon, Department of Computer and Information Science, Technical Report CIS-TR-94-09, March 1994.
Keywords: scalable performance visualization, scientific visualization, pC++, data-parallel programming

Developing robust techniques for visualizing the performance behavior of parallel programs that can scale in problem size and/or number of processors remains a challenge. In this paper, we present several performance visualization techniques based on the context of data-parallel programming and execution that demonstrate good visual scalability properties. These techniques are a result of utilizing the structural and distribution semantics of data-parallel programs as well as sophisticated three-dimensional graphics. A categorization and examples of scalable performance visualizations are given for programs written in Dataparallel C and pC++.
[sigmetrics87]
Daniel A. Reed, Allen D. Malony, Bradley D. McCredie, "Parallel Discrete Event Simulation: a Shared Memory Approach," Proc. 1987 ACM SIGMETRICS conference on Measurement and Modeling of Computer Systems, 15(1), pp. 36- 38, 1987.
Keywords: discrete event simulation

The inherently sequential nature of event list manipulation limits the potential parallelism of standard simulation models. Although techniques for performing event list manipulation and event simulation in parallel have been suggested, large scale performance increases seem unlikely. Only by eliminating the event list, in its traditional form, can additional parallelism be obtained; this is the goal of distributed simulation. Several distributed simulation techniques have been proposed. In the remainder of this abstract, we present the Chandy-Misra distributed simulation algorithm and the results of an extensive study of its performance on a shared memory parallel processor when simulating queueing network models.
[sigmetrics95]
Allen D. Malony, "Data Interpretation and Experiment Planning in Performance Tools," Joint International Conference on Measurement and Modeling of Computer Systems, Proceedings of the 1995 ACM SIGMETRICS joint international conference on Measurement and Modeling of Computer Systems, pp. 62-63, 1995.
Keywords: performance measurement

The parallel scientific computing community is placing increasing emphasis on portability and scalability of programs, languages, and architectures. This creates new challenges for developers of parallel performance analysis tools, who will have to deal with increasing volumes of performance data drawn from diverse platforms. One way to meet this challenge is to incorporate sophisticated facilities for data interpretation and experiment planning within the tools themselves, giving them increased flexibility and autonomy in gathering and selecting performance data. This panel discussion brings together four research groups that have made advances in this direction.
[spdt96]
S. Shende, J. Cuny, L. Hansen, J. Kundu, S. McLaughry and O. Wolf, Event and State-Based Debugging in TAU: A Prototype, Proceedings of ACM SIGMETRICS Symposium on Parallel and Distributed Tools (SPDT '96), May, 1996, pp. 21-30.
Keywords: event-based debugging, state-based debugging, pC++, TAU, Ariadne

Parallel programs are complex and often require a multilevel debugging strategy that combines both event- and state-based debugging. We report here on preliminary work that combines these approaches within the TAU program analysis environment for pC++. This work extends the use of event-based modeling to object-parallel languages, provides an alternative mechanism for establishing meaningful global breakpoints in object-oriented languages, introduces the TAU program interaction and control infrastructure, and provides an environment for the assessment of mixed event- and state-based strategies.
[spdt98a]
S. Shende, A. D. Malony, J. Cuny, K. Lindlan, P. Beckman and S. Karmesin, Portable Profiling and Tracing for Parallel Scientific Applications using C++, Proceedings of ACM SIGMETRICS Symposium on Parallel and Distributed Tools (SPDT '98), August, 1998, pp. 134-145.
Keywords: performance, profiling, tracing, C++, parallel, TAU

Performance measurement of parallel, object-oriented (OO) programs requires the development of instrumentation and analysis techniques beyond those used for more traditional languages. Performance events must be redefined for the conceptual OO programming model, and those events must be instrumented and tracked in the context of OO language abstractions, compilation methods, and run-time execution dynamics. In this paper, we focus on the profiling and tracing of C++ applications that have been written using a rich parallel programming framework for high-performance, scientific computing. We address issues of class-based profiling, instrumentation of templates, runtime function identification, and polymorphic (type-based) profiling. Our solutions are implemented in the TAU portable profiling package which also provides support for profiling groups and user-level timers. We demonstrate TAU's C++ profiling capabilities for real parallel applications, built from components of the ACTS toolkit. Future directions include work on runtime performance data access, dynamic instrumentation, and higher-level performance data analysis and visualization that relates object semantics with performance execution behavior.
[spdt98b]
K. Lindlan, A. Malony, J. Cuny, S. Shende, and P. Beckman, An IL Converter and Program Database for Analysis Tools, Proceedings of ACM SIGMETRICS Symposium on Parallel and Distributed Tools (SPDT '98), August 1998, pp. 153.
Keywords: IL Analyzer, PDT, static analysis

Developers of static and dynamic analysis tools for C++ programs need access to information on functions, classes, templates, and macros in parsed C++ code. Existing tools, such as the EDG display tool, provide that access, but in an unsuitable format. We built a converter that prunes and reorganizes the information into the appropriate format. The converter provides the information needed for our TAU (Tuning and Analysis Utilities) tools and, in more general terms, provides C++ developers considerable opportunities for automating software development.
[vis92]
J. E. Cuny, A. Hough, and J. Kundu. Logical Time in Visualizations Produced by Parallel Programs, Proceedings of Visualization '92, 1992, pp. 186-193.
Keywords: parallel, visualization, logical time

Visualization tools that display data as it is manipulated by a parallel, MIMD computation must contend with the effects of asynchronous execution. We have developed techniques that manipulate logical time in order to produce coherent animations of parallel program behavior despite the presence of asynchrony. Our techniques ``interpret'' program behavior in light of user-defined abstractions and generate animations based on a logical rather than a physical view of time. If this interpretation succeeds, the resulting animation is easily understood; if it fails, the programmer can be assured that the failure was not an artifact of the visualization. Here we demonstrate that these techniques can be generally applied to enhance visualizations of a variety of types of data as it is produced by parallel, MIMD computations.

Journals

[cca_cpe04]
A. Malony, S. Shende, N. Trebon, J. Ray, R. Armstrong, C. Rasmussen, and M. Sottile, "Performance Technology for Parallel and Distributed Component Software," Concurrency and Computation: Practice and Experience, Vol. 17, Issue 2-4, pp. 117-141, John Wiley & Sons, Ltd., Feb - Apr, 2005.
Keywords: component software, performance, parallel, distributed, optimization, CCA, TAU

This work targets the emerging use of software component technology for high-performance scientific parallel and distributed computing. While component software engineering will benefit the construction of complex science