Elsevier

Parallel Computing

Volume 26, Issue 6, May 2000, Pages 677-703
Parallel Computing

Practical aspects
Automatic and effective multi-dimensional parallelisation of structured mesh based codes

https://doi.org/10.1016/S0167-8191(00)00004-1Get rights and content

Abstract

The most common parallelisation strategy for many Computational Mechanics (CM) (typified by Computational Fluid Dynamics (CFD) applications) which use structured meshes, involves a 1D partition based upon slabs of cells. However, many CFD codes employ pipeline operations in their solution procedure. For parallelised versions of such codes to scale well they must employ two (or more) dimensional partitions. This paper describes an algorithmic approach to the multi-dimensional mesh partitioning in code parallelisation, its implementation in a toolkit for almost automatically transforming scalar codes to parallel form, and its testing on a range of ‘real-world’ FORTRAN codes. The concept of multi-dimensional partitioning is straightforward, but non-trivial to represent as a sufficiently generic algorithm so that it can be embedded in a code transformation tool. The results of the tests on fine real-world codes demonstrate clear improvements in parallel performance and scalability (over a 1D partition). This is matched by a huge reduction in the time required to develop the parallel versions when hand coded – from weeks/months down to hours/days.

Introduction

The full benefits of parallel computing will only be realised when massively parallel systems can be effectively exploited. The ability to execute codes hundreds or thousands of times faster than in serial should fundamentally alter the approach to computational modelling. The models constructed can then be far more accurate, incorporating more comprehensive representations of the physical and chemical processes involved, and also allowing for greater resolution in the computational domain. This should then reduce the reliance on expensive experimental prototypes as a dependable measure of physical processes. Several massively parallel projects/programmes are underway in the United States under the auspices of the Accelerated Strategic Computing Initiative (ASCI) [1]. ASCI is a US Department of Energy program to accelerate the development of massively parallel supercomputers. The aim is to build a 100 TFlops supercomputer by early next century. In order to achieve this, several supercomputers are being built. The first ASCI supercomputer is the 1.8 TFlops ASCI Red. This supercomputer is a distributed memory, message passing, MIMD system that Intel have built using over 9152 Intel Pentium Pro processors at the Sandia National Laboratory. Other supercomputers being built under ASCI are the ASCI Blue Mountain (Cray SGI Origin, 3072 processors), ASCI White (IBM RS/6000, over 8000 processors) and ASCI Blue-Pacific (IBM RS/6000). In the case of ASCI, the test ban treaty on nuclear testing has provided a significant impetus to massively parallel computational modelling.

To exploit these massively parallel systems the Single Program Multiple Data (SPMD) domain decomposition technique is frequently and successfully used on computational modelling software. This method requires the mesh to be split into sufficient sub-domains so that one sub-domain may be allocated to every processor. Additionally, mapping those sub-domains to the processor topology can often reduce the communication overhead for many current machines by ‘localising’ movement of data to nearest neighbours.

Even though it is very well used in the Computational Fluid Dynamics (CFD) context, the 1D partition does not scale well when pipeline operations are embedded in the solution procedure. Here, the start-up and shut-down time dramatically increases with the number of processors. This causes a marked increase in the idle time of many of the processors during a calculation cycle (e.g. a time step) and severely limits both the scalability and parallel efficiency of Computational Mechanics (CM) structured mesh codes using 1D partitions in their parallel implementation.

Now although the parallelisation of structured mesh codes using either a 1D or even multi-dimensional partitioning is straightforward to conceive, the transformation process by hand is tedious, error prone and very time consuming. Manual parallelisation of a large application code using the simplest 1D partition usually takes many months. The first step in the manual parallelisation of a code is the choice of how to partition the program arrays. If all the arrays within a loop are partitioned then the partition may be transferred to the loop head statement, such as a FORTRAN DO loop. There will also be calculations within the parallel code that require data that must be received from neighbouring processors. The paralleliser must detect such cases and ensure that the correct amount of additional data space to store the overlap regions is available. Communications must also be placed in the code at an optimal position. Initially the user may conservatively place the communications prior to the point in the code where the communicated data is required in calculation. However, after further inspection it may be possible to place the communications where they may be merged with others. This will lead to the reduction of the number of communication calls and the associated latencies. Once parallelised it is essential to ensure that the correct results are obtained, usually requiring the use of debugging tools. Once correct results have been obtained the efficiency of the parallel code can be measured using profiling tools to detect any serial sections of the code and any communication bottlenecks.

Further consideration of a parallelisation in a second or third dimension can be far more complex. The paralleliser must now extend the processes used for the first dimension to these further partitions involving the alteration of far more code. There are also additional complications to be considered where, for example, the communication of data must take account of all partitions. The communications may also require data from processors that are not their immediate neighbours, i.e. diagonal communications.

In the above manual process, it takes a long time to ensure that the transformed version of the application code is simultaneously: numerically correct, minimises data exchange (communications) with other processors, efficiently parallelised at every stage of the calculation.

The authors are part of a team that has designed and implemented a software toolkit, the Computer Aided Parallelisation Tools (CAPTools), whose objective is to greatly reduce the time required to parallelise CFD, etc. application codes [2], [3], [4], [5]. These tools are underpinned by a very comprehensive interprocedural dependence analysis [2] that enables much of the subsequent parallelisation process to be automated.

The initial version of these tools focussed upon code parallelisations using a 1D partitioning. Although the parallel transformations of the application codes tested were as good as could be delivered by hand, those codes that employed pipeline operations in the solution procedures did not scale well. To deliver this, multi-dimensional partitioning parallelisation strategies are required. This paper addresses this issue directly through:

  • the description of a generic multi-dimensional structured mesh code partitioning procedure,

  • its implementation into the CAPTools toolkit to enable multi-dimensional partitioned parallel code to be generated automatically and, therefore, rapidly,

  • evaluating the parallel performance of the parallel code generated by the tools on a range of `real-world' applications FORTAN codes.

The paper is laid out as follows: in Section 2 this paper reviews, in brief, the types of parallelisation tools that are currently available as well as considering, in further detail, the CAPTools that are employed in the automatic generation of multi-dimensionally partitioned parallel code. Section 3 briefly discusses a few of the important communication primitives generated by CAPTools that are particularly important in this work. Section 4 demonstrates the methods utilised to automatically generate multi-dimensionally partitioned parallel code. Finally, Section 5 presents the results for five different codes that have been automatically parallelised for 1D and 2D meshes of processors using CAPTools.

Section snippets

Automatic code generation

There are many factors that make the process of manual parallelisation unattractive. These include: the amount of time required; the additional skill required for parallelisation; the complex process of debugging in parallel; and also maintenance and portability issues. The concept of automatic parallelisation is therefore attractive as a means of producing parallel code that avoids these problems.

Communication library

A vital component in the generation of multi-dimensional parallel codes for CAPTools is the communication library, CAPLib. In the mapping between the logical program topology and the physical processor topology it is essential to ensure, for example, that nearest neighbour communications can be exploited.

The processor topology is determined at runtime by the user. Typical examples of a topology defined by the user are:

  • Pipe〈n〉: a pipe topology of n processors;

  • Grid〈n × m〉: a grid topology of n by m

Multi-dimensional partitioning

Previous versions of CAPTools partition CFD codes in a single dimension only (i.e. split into mesh lines in 2D or mesh slabs in 3D). The work presented here has extended CAPTools to further partition second and third dimensions of the mesh, one at a time, transforming the code from the previous passes. This multi-dimensional partitioning has been incorporated into the existing CAPTools environment where, after communication generation for one dimension, partitioning of a subsequent dimension

Results and algorithm parallelisation examples

Multi-dimensionally partitioned code was generated for three of the NAS parallel benchmark [20] codes: APPLU, APPSP and APPBT; as well as ARC3D; and an Industrial CFD code. The results were obtained from several machines: the IBM SP2 at the University of Southampton; the Cray T3D at the Edinburgh Parallel Computing Centre and the SGI Origin at NASA Ames. Note that since the SGI Origin is a shared memory, multi-user machine, the timings are subject to some degree of variability so only the

Conclusion

Although it is well understood that using a 1D data partition in the parallelisation of structured mesh CM codes, extending the paradigm to multi-dimensions and especially by hand is trivial to conceive, but a tedious, error prone and lengthy transformation to implement. Of course, it is desirable to transform mesh based CM codes into parallel form using this partitioning strategy because for applications employing solution algorithms with pipeline operations, it can increase the parallel

Acknowledgements

The authors would like to thank Jerry Yan, Haoqiang Jin and Michelle Hribar at NASA Ames for the SGI Origin (and some of the T3E) results presented in this paper.

References (24)

  • G.C. Fox, S. Hiranandani, K. Kennedy, C. Koelbel, U. Kremer, C. Tseng, M. Wu, Fortran D language specification,...
  • M. Lam, Locality Optimisations for Parallel Machines, Parallel Processing: CONPAR 94-VAPP VI, Linz, Austria, Springer,...
  • Cited by (11)

    • Parallelization of a Lagrangian-Eulerian DEM/CFD code for application to fluidized beds

      2011, Powder Technology
      Citation Excerpt :

      A block partition of the k dimension in the ijk fluid flow mesh is employed in parallelizing the fluid hydrodynamics (Fig. 3). This parallelization was performed by loading the original serial code into ParaWise, then the data partition, computation distribution and all the necessary communications are automatically generated by ParaWise [18]. Loops that operate over the k dimension are adjusted to be performed only over the sections of the arrays owned by each processor.

    • An environment for OpenMP code parallelization

      2004, Advances in Parallel Computing
    • Ypnos: Declarative, parallel structured grid programming

      2010, DAMP'10 - Proceedings of the 2010 ACM SIGPLAN Workshop on Declarative Aspects of Multicore Programming
    View all citing articles on Scopus

    This material is based upon the work supported by the US Naval Regional Contracting Center, London, on behalf of the NASA Ames Research Center, Grant No. N68171-98-C-9005, and also the UK Engineering and Physical Sciences Research Council (EPSRC) under the Portable Software Tools for Parallel Architectures (PSTPA) Programme, Grant No. GR/K40321.

    View full text