As with many experimental languages, the ZPL compiler achieves portability by producing ANSI C object code, which is then compiled by a specific machine's native C compiler and linked with architecture-specific communication libraries[Cha95]. The target architecture consists of a set of processors connected by a two-dimensional grid-structured communication network. (Note that this is a specialization of the CTA architecture.) Parallel arrays with two dimensions are block-partitioned and allocated to these processors. Arrays of higher rank are projected to two dimensions, while vectors are allocated as columns.
The compiler uses a factor-join compilation approach to provide high-level communication and computation optimizations[Cha96]. The code is parsed into an abstract syntax tree as a list of C- and T-factors. Every parallel calculation is represented as a C (computation) factor, and every data transfer is a T factor. T factors are eventually transformed into communication primitives, while C factors become nested multi-loops (or mloops) that iterate over the locally allocated portion of the associated regions. This simplified, high-level representation enables the compiler to manipulate and join factors to achieve certain optimizations. For example, C-factors will be joined if they calculate over the same region and have no conflicting data-dependencies. This has the effect of merging the mloops in the object C code, and is similar to the loop fusion optimization performed by parallelizing compilers.[Wol89]
Frequently, mloop fusion will result in one or more arrays whose definition and use appear in the same loop. If such an array is not live after the loop, it can be contracted to a scalar, thereby reducing storage costs and array indexing calculations. For an example, see Figure . The benefits in time and space reduction resulting from these two optimizations can be significant.
From the compiler's point of view, the compilation of a shard (a section of shattered control flow) or a promoted function is not an optimization. The entire code section is represented by a single C-factor, and hence is compiled into a single mloop. From the debugger's viewpoint, however, the shard must be treated as a joined m-loop, because there will be multiple statements within it.
Figure: Effects of joining and contraction on an excerpt from the VelocityStats example from [Cha96]. See Appendix A for complete source. A number of the array references become scalar references.