Portable Runtime Systems (PORTS) Group Meeting, Apr 9, 1994

Ian Foster, Moderator
Report by Dennis Gannon

Introduction

This is the report of the third meeting of the PORTS group. The first meeting took place at a round table (literally) at Supercomputing 93. The second meeting was held in Boulder Colorado and was hosted by Dirk Grunwald. This meeting was hosted by Ian Foster of Argonne and Ian also presided as moderator. In attendance were Peter Beckman (Indiana University), Hans Zima (University of Vienna), Alok Choudhary (Syracuse University), Rajive Bagrodia (UCLA), Carl Kesselman (Caltech), Mathew Haines (ICASE), Bernd Mohr (University of Oregon), Neel Sundaresan (Indiana University), Dennis Gannon (Indiana University), Steve Tuecke (Argonne) and Brian Toonen (Argonne).

The first action of the group was to propose a charter for PORTS. The following three items summarizes the goal of the endeavor

To design and build inter-operable task parallel programming systems. Initially, this effort will be organized as a study group that will focus on the specification of a common runtime system for task parallelism. Such a runtime system would be at the level of a compiler target.
The direction of this effort includes the integration of task and data parallel programming. It is not the intention of this group to focus on real-time, embedded systems or fault tolerance.
To Identify opportunities for code sharing among projects.

The construction of a common runtime system to be used as a compiler target for task parallel programming is one example of code sharing. However, it was observed that the full runtime environment of any programming language consists of several levels. The lowest of these provides the basic mechanisms in which programs interact with the hardware and operating system. Higher levels of the runtime system implement increasingly higher levels of the programming language semantics. It may be the case that languages that share certain semantic ideas will be able to exploit common runtime structures at higher levels. While there was some feeling that the a full multi-layered approach to the runtime system design was desirable, it was decided that the group should focus first on the foundation layer for task parallel computation.

The remainder of the meeting focused on three major items. First was a discussion of the status of several current projects. That was followed by a lengthy discussions of the desired attributes of the basic runtime layer. The end of the day was left to formulate action items.

Project Reports

Dennis Gannon gave a brief report on the pC++ project. This system currently generates SPMD code for the Intel Paragon, CM-5, BBN TC2000, KSR-1, Sequent and SP-1 and PVM networks of workstations. The current runtime system is single threaded in each CPU, but it now supports remote service requests. A new version of the language and compiler is under design that will be based on active (threaded) global objects. Compatibility with CC++, Fortran-M, HPF and (the new) multi-threaded Vienna Fortran is considered a major goal. Consequently, the outcome of the Ports project is very important to this group.

Carl Kesselman reported that Nexus, the runtime system for both CC++ and Fortran-M is now working in a number of environments. These include Solaris threads linking sparc-10 systems over tcp/ip, Sun-OS, IBM RS6000 systems and DCE, Paragon OSF/1 pthreads. All of these systems work in a fully pre-emptive environment. Communication is over local memory and tcp/ip. Carl also reported that IBM is considering modifications to the SP-2 threads system to allow remote service requests. Initial timing results show that Nexus, running in this pre-emptive mode, can outperform a popular message passing library. Carl also reported that the alpha release of CC++ is now available and Fortran-M will soon be ported to Nexus.

Mathew Haines described Chant, a runtime layer being built at ICASE to support research with a task parallel HPF and Vienna Fortran. Chant is designed for machines with fast communication. The goal is to have threads do message passing rather than processors and to do this without excessive buffer copy operations. The approach is to extend the pthreads interface to support MPI. Viewed as layers, the foundation of Chant is light weight thread system and a communications library. The next layer is a point-to-point message service between threads. Built on top of this is a remote service request layers and global thread operations.

Another ICASE project, Macrame, defines an architectural model and an object semantics for threads. Macrame defines a semantic layer between Chant and the programming language. It contains the concept of a Ropes, which represents a set of threads with enough common identity that they can carry out collective operations.

Rajive Bagrodia described the discrete event simulation work being done at UCLA. This project will also be based on Nexus. Rajive brings lots of experience in the areas of scheduling (such as termination detection) and dynamic load balancing to the PORTS group. One of the areas he has worked on is a dynamic system to explore task suspend-and-wake-up behaviors. In addition, he has worked on the parallel language UC which started as a data parallel system but which has also integrated task parallelism.

Requirements

The next task of the group was to discuss a set of requirements for the Ports project. Fourteen different topics were covered in the discussion and are summarized below.

Target Machines. It is the desire of the group to target environments where a task parallel model of computation makes sense. This includes large MIMD multiprocessors with either shared or distributed memory semantics and networks of servers and workstations. This does not include single SIMD systems, but it is important to note that a SIMD task may be an important component of a task parallel computation distributed over a network which includes such a machine.
Heterogeneity. The demand for programming system that exploit heterogeneous networked computing environments is growing. In particular, applications that will run across the NII must incorporate support for heterogeneity. It was noted that this issues is distinct from that of portability and it has strong implications for topics like thread migration. The questions to be considered are:
- How much do we need to give up in performance?
- How much are we prepared to give up?
- When can the compiler do optimization for a special features of given hardware platform.
It was noted that compiler options may be used to build a version of a program that assumes the code will be run on a homogeneous multicomputer or network. In this case, a leaner, more specialized runtime library is possible. The disadvantage with this solution is that one must maintain multiple runtime libraries and objects linked against them will not be able to interact.
Inter-operability. Modern programming systems must allow applications to be built from a mix of programming styles and languages. This means a program written in one language should be able to invoke functions written in another language. If two languages have a way to describe the same object, then there should be some way for them to share it. For traditional sequential machines it was sufficient to make sure that simple subroutine calling and stack handling conventions were observed. More complex problems arise when languages need to share heap space or garbage collection facilities. In the case of parallel and distributed systems, the runtime layer described here becomes the critical link. As described above, the runtime may be partitioned into layers. Clearly a minimum requirement for inter-operability between two languages is the basic thread and communication mechanisms that are fundamental to the Ports model. However, an open question for the PORTS group is how many of the higher level semantic layers of the runtime system may be shared.
An alternative is to follow the lead of the OMG and define a interface definition language and some form of object broker mechanism that will allow system to interact.
Performance Analysis and Monitoring. It is essential to incorporate some interface to performance analysis and measurement within the design of the runtime system. The basic requirements include common library timers and event logging mechanism. Clearly, it would be advantageous to have a mechanism to measure the life time and activities of a thread, but in a preemptive environment this is very difficult without adding additional state and increased context switching time to the thread.
In addition to thread life time measurement, another important performance evaluation hook is memory hierarchy and message traffic behavior. For example, if the runtime system provides a virtual shared memory or global name space, it is important to be able to record the cost of remote references versus local ones.
Debugging. No programming system is complete with out a debugging tools. The Ports group feels strongly that vendor supported tools must be used whenever possible. In general the Ports group found the state-of-the-art in parallel debugging to be very depressing. The group will support any and all efforts to build good extensible parallel debuggers.
Scheduling. The issue of scheduling is critical to the design of a thread based or task parallel runtime system. In the case of priority based scheduling it is important to be able to handle very high priority operations like remote service requests. We also we may need ways to write custom schedulers. For example, the a scheduler that can evaluate conditional expressions that determine the need to schedule another task without requiring a context switch to that task. ``Gang scheduling'' is important for parallel execution of ropes of threads (see collective operations below). It is imperativeimperative that the runtime system avoid busy-waiting. One of the important uses of threads in a task parallel programming system is to hide message latency. Consequently, efficient scheduling is critical to overall performance. without requiring a context switch to that task. ``Gang scheduling'' is important for parallel execution of ropes of threads (see collective operations below). It is imperative that the runtime system avoid busy-waiting. One of the important uses of threads in a task parallel programming system is to hide message latency. Consequently, efficient scheduling is critical to overall performance.
Thread Functionality. The general opinion of the Ports group is that the thread functionality in the base runtime system layer should be a major subset of posix threads. In particular, key attributes of threads include:stack size, scheduling policy, and scheduling policy parameters, management of threads should support operators like crate, join, delete, equal. Threads should be able to have local data supported by operators like key-create, etc. Synchronization should be supported with at least mutex and condition variables.
Collective Operations. For large scale parallelism, it is impractical to schedule threads sequentially for the concurrent evaluation of some part of a program. Consequently, Ports will need some form of thread collective. A rope is a set of threads that can be created and each assigned a task (such as a loop iterate) with minimal overhead. Threads within a rope should be able to identify the other threads in the rope and there should be synchronization mechanisms that allow collective operations like barriers, scans, broadcasts and reductions. It is important that the runtime system level rope structure provide mechanism and not policy. Different language may use ropes in very different ways.
Node Abstractions. In order to facilitate any discussion of remote versus local computation on parallel or distributed systems, it is very helpful to have a common language to describe the attributes of the hardware and OS of these machines. The following terms will be used in Ports documents.
- A context is an address space in which a computation may operate.
- A node is the smallest unit of hardware upon which a context can be assigned. A node consists of a computing engine (one or more CPUs) and a shared memory system. Multiple contexts may be assigned to a given node and CPUs may be shared between the contexts on a given node.
- A thread is an abstraction for a virtual program counter, register set, execution stack and private data that is following some control path (i.e. executing some program code) within an address space. Threads are assigned to and always live within a context. There may be many threads assigned to a single context.
While the entire Ports group agrees on these terms, some feel that this is not a complete model. In particular, most massively parallel machines support architectural features that make them more than a set of nodes and a communication layer. However, there was no agreement on how to extend the definition set above without providing undue complication in discussion of the topics that follow.
Communication. There are three important models of communication that are used in current parallel programming paradigms.
- Point-to-point communication between threads in different contexts based on a message passing layer like MPI.
- Remote Service Requests (RSRs), where a message is sent to an address within a context in the same or another node to invoke a function. The function is executed by a remote service request thread within that node.
- Hardware based "get" and "put" operations that allow a thread in one context to directly see or modify a "global address" which may lie in another context in the same or another node.
Note that the last case is easily represented as a special, hardware supported instance of a remote service request. The Ports group could not agree on whether it was better to focus on Point-to-point or remote service request styles of communication. However, it was agreed that both were universal in that each could emulate the other. It was also noted that most vendor supplied point-to-point systems were based on low level, specialized RSR mechanisms. However, it was also observed that a compiler could often generate point to point communication patters that would contain less explicit global synchronization than a RSR scheme. Of great concern to the group was which scheme would provide the most efficient communications. It was decided to wait until more experimental data was made available.
Resources Management and Allocation. Being able to allocate a new context and acquire new nodes is important for many applications. Conversely, it is also important to be able to release resources. Another important feature is to be able to connect to an existing running computation.
Task Migration. Despite the potential implementation conflicts with heterogeneity, task/thread migration may be very important for many applications. In particular, load balancing across nodes cannot be easily accomplished without this feature. The discussion focused on the level of migration that would be allowed. A running thread cannot migrate from one context to another if it is referencing data that is not thread private: when moved, the data (or even its address) may not exist the new context. For example, if a thread allocates memory objects in the context heap that is used for the lifetime of the thread, then the heap object would need to be moved with the tread. It was suggested that one possible mechanism would be to allow threads that were ``pure'', in the sense that they made no reference to context addresses other than thread local data, to be moved if they were in some special state. This is an area where much more study is needed.
I/O The ports group did not spend a long time on I/O discussions but it was clear that much more remains to be accomplished here. In particular, there should be mechanism for collective I/O operations from ropes of threads, atomicity of I/O operations from individual threads and support for parallel file systems. This issue will be revisited in greater detail in a future meeting.

Action Items

There were three action items discussed.

This Report would be circulated.
The Nexus and Chant groups would get together and discuss the subset of pthreads that were considered essential to support.
Initial implementations of Nexus and Chant would be distributed to interested parties as soon as possible.

The next meeting of the Ports group is scheduled for Aug. 16.

mohr@cs.uoregon.edu
Thu Aug 18 13:15:16 PDT 1994