Automatic and effective multi-dimensional parallelisation of structured mesh based codes

doi:10.1016/S0167-8191(00)00004-1

Parallel Computing

Volume 26, Issue 6, May 2000, Pages 677-703

https://doi.org/10.1016/S0167-8191(00)00004-1 Get rights and content

Abstract

The most common parallelisation strategy for many Computational Mechanics (CM) (typified by Computational Fluid Dynamics (CFD) applications) which use structured meshes, involves a 1D partition based upon slabs of cells. However, many CFD codes employ pipeline operations in their solution procedure. For parallelised versions of such codes to scale well they must employ two (or more) dimensional partitions. This paper describes an algorithmic approach to the multi-dimensional mesh partitioning in code parallelisation, its implementation in a toolkit for almost automatically transforming scalar codes to parallel form, and its testing on a range of ‘real-world’ FORTRAN codes. The concept of multi-dimensional partitioning is straightforward, but non-trivial to represent as a sufficiently generic algorithm so that it can be embedded in a code transformation tool. The results of the tests on fine real-world codes demonstrate clear improvements in parallel performance and scalability (over a 1D partition). This is matched by a huge reduction in the time required to develop the parallel versions when hand coded – from weeks/months down to hours/days.

Introduction

The full benefits of parallel computing will only be realised when massively parallel systems can be effectively exploited. The ability to execute codes hundreds or thousands of times faster than in serial should fundamentally alter the approach to computational modelling. The models constructed can then be far more accurate, incorporating more comprehensive representations of the physical and chemical processes involved, and also allowing for greater resolution in the computational domain. This should then reduce the reliance on expensive experimental prototypes as a dependable measure of physical processes. Several massively parallel projects/programmes are underway in the United States under the auspices of the Accelerated Strategic Computing Initiative (ASCI) [1]. ASCI is a US Department of Energy program to accelerate the development of massively parallel supercomputers. The aim is to build a 100 TFlops supercomputer by early next century. In order to achieve this, several supercomputers are being built. The first ASCI supercomputer is the 1.8 TFlops ASCI Red. This supercomputer is a distributed memory, message passing, MIMD system that Intel have built using over 9152 Intel Pentium Pro processors at the Sandia National Laboratory. Other supercomputers being built under ASCI are the ASCI Blue Mountain (Cray SGI Origin, 3072 processors), ASCI White (IBM RS/6000, over 8000 processors) and ASCI Blue-Pacific (IBM RS/6000). In the case of ASCI, the test ban treaty on nuclear testing has provided a significant impetus to massively parallel computational modelling.

To exploit these massively parallel systems the Single Program Multiple Data (SPMD) domain decomposition technique is frequently and successfully used on computational modelling software. This method requires the mesh to be split into sufficient sub-domains so that one sub-domain may be allocated to every processor. Additionally, mapping those sub-domains to the processor topology can often reduce the communication overhead for many current machines by ‘localising’ movement of data to nearest neighbours.

Even though it is very well used in the Computational Fluid Dynamics (CFD) context, the 1D partition does not scale well when pipeline operations are embedded in the solution procedure. Here, the start-up and shut-down time dramatically increases with the number of processors. This causes a marked increase in the idle time of many of the processors during a calculation cycle (e.g. a time step) and severely limits both the scalability and parallel efficiency of Computational Mechanics (CM) structured mesh codes using 1D partitions in their parallel implementation.

Now although the parallelisation of structured mesh codes using either a 1D or even multi-dimensional partitioning is straightforward to conceive, the transformation process by hand is tedious, error prone and very time consuming. Manual parallelisation of a large application code using the simplest 1D partition usually takes many months. The first step in the manual parallelisation of a code is the choice of how to partition the program arrays. If all the arrays within a loop are partitioned then the partition may be transferred to the loop head statement, such as a FORTRAN DO loop. There will also be calculations within the parallel code that require data that must be received from neighbouring processors. The paralleliser must detect such cases and ensure that the correct amount of additional data space to store the overlap regions is available. Communications must also be placed in the code at an optimal position. Initially the user may conservatively place the communications prior to the point in the code where the communicated data is required in calculation. However, after further inspection it may be possible to place the communications where they may be merged with others. This will lead to the reduction of the number of communication calls and the associated latencies. Once parallelised it is essential to ensure that the correct results are obtained, usually requiring the use of debugging tools. Once correct results have been obtained the efficiency of the parallel code can be measured using profiling tools to detect any serial sections of the code and any communication bottlenecks.

Further consideration of a parallelisation in a second or third dimension can be far more complex. The paralleliser must now extend the processes used for the first dimension to these further partitions involving the alteration of far more code. There are also additional complications to be considered where, for example, the communication of data must take account of all partitions. The communications may also require data from processors that are not their immediate neighbours, i.e. diagonal communications.

In the above manual process, it takes a long time to ensure that the transformed version of the application code is simultaneously: numerically correct, minimises data exchange (communications) with other processors, efficiently parallelised at every stage of the calculation.

The authors are part of a team that has designed and implemented a software toolkit, the Computer Aided Parallelisation Tools (CAPTools), whose objective is to greatly reduce the time required to parallelise CFD, etc. application codes [2], [3], [4], [5]. These tools are underpinned by a very comprehensive interprocedural dependence analysis [2] that enables much of the subsequent parallelisation process to be automated.

The initial version of these tools focussed upon code parallelisations using a 1D partitioning. Although the parallel transformations of the application codes tested were as good as could be delivered by hand, those codes that employed pipeline operations in the solution procedures did not scale well. To deliver this, multi-dimensional partitioning parallelisation strategies are required. This paper addresses this issue directly through:

•
the description of a generic multi-dimensional structured mesh code partitioning procedure,
•
its implementation into the CAPTools toolkit to enable multi-dimensional partitioned parallel code to be generated automatically and, therefore, rapidly,
•
evaluating the parallel performance of the parallel code generated by the tools on a range of `real-world' applications FORTAN codes.

The paper is laid out as follows: in Section 2 this paper reviews, in brief, the types of parallelisation tools that are currently available as well as considering, in further detail, the CAPTools that are employed in the automatic generation of multi-dimensionally partitioned parallel code. Section 3 briefly discusses a few of the important communication primitives generated by CAPTools that are particularly important in this work. Section 4 demonstrates the methods utilised to automatically generate multi-dimensionally partitioned parallel code. Finally, Section 5 presents the results for five different codes that have been automatically parallelised for 1D and 2D meshes of processors using CAPTools.

Section snippets

Automatic code generation

There are many factors that make the process of manual parallelisation unattractive. These include: the amount of time required; the additional skill required for parallelisation; the complex process of debugging in parallel; and also maintenance and portability issues. The concept of automatic parallelisation is therefore attractive as a means of producing parallel code that avoids these problems.

Communication library

A vital component in the generation of multi-dimensional parallel codes for CAPTools is the communication library, CAPLib. In the mapping between the logical program topology and the physical processor topology it is essential to ensure, for example, that nearest neighbour communications can be exploited.

The processor topology is determined at runtime by the user. Typical examples of a topology defined by the user are:

Pipe〈n〉: a pipe topology of n processors;
Grid〈n × m〉: a grid topology of n by m

Multi-dimensional partitioning

Previous versions of CAPTools partition CFD codes in a single dimension only (i.e. split into mesh lines in 2D or mesh slabs in 3D). The work presented here has extended CAPTools to further partition second and third dimensions of the mesh, one at a time, transforming the code from the previous passes. This multi-dimensional partitioning has been incorporated into the existing CAPTools environment where, after communication generation for one dimension, partitioning of a subsequent dimension

Results and algorithm parallelisation examples

Multi-dimensionally partitioned code was generated for three of the NAS parallel benchmark [20] codes: APPLU, APPSP and APPBT; as well as ARC3D; and an Industrial CFD code. The results were obtained from several machines: the IBM SP2 at the University of Southampton; the Cray T3D at the Edinburgh Parallel Computing Centre and the SGI Origin at NASA Ames. Note that since the SGI Origin is a shared memory, multi-user machine, the timings are subject to some degree of variability so only the

Conclusion

Although it is well understood that using a 1D data partition in the parallelisation of structured mesh CM codes, extending the paradigm to multi-dimensions and especially by hand is trivial to conceive, but a tedious, error prone and lengthy transformation to implement. Of course, it is desirable to transform mesh based CM codes into parallel form using this partitioning strategy because for applications employing solution algorithms with pipeline operations, it can increase the parallel

Acknowledgements

The authors would like to thank Jerry Yan, Haoqiang Jin and Michelle Hribar at NASA Ames for the SGI Origin (and some of the T3E) results presented in this paper.

References (24)

S.P Johnson et al.
Exploitation of symbolic information in interprocedural dependence analysis
Parallel Computing
(1996)
S.P Johnson et al.
Automatic parallel code generation for message passing on distributed memory systems
Parallel Computing
(1996)
P.F Leggett et al.
Integrating user knowledge with information from parallelisation tools to facilitate the automatic generation of efficient parallel FORTRAN code
Parallel Computing
(1996)
C.S Ierotheou et al.
Computer aided parallelisation tools (CAPTools) – conceptual overview and performance on the parallelisation of structured mesh codes
Parallel Computing
(1996)
H.P Zima et al.
SUPERB – A tool for semi-automatic MIMD/SIMD parallelisation
Parallel Computing
(1988)
See...
H.P.F. Forum, High Performance FORTRAN Language Specification, Version 2.0, Rice University, Houston, Texas,...
D.J. Kuck, R.H. Kuhn , B.R. Leasure, M.J. Wolfe, The structure of an advanced retargetable vectorizer, in: K. Hwang...
KAP, Kuck and Associates Inc., Champaign, IL,...
J.R. Allan, K. Kennedy, PFC: A program to convert FORTRAN to parallel form, in: Proc. IBM Conference on Parallel...

G.C. Fox, S. Hiranandani, K. Kennedy, C. Koelbel, U. Kremer, C. Tseng, M. Wu, Fortran D language specification,...

M. Lam, Locality Optimisations for Parallel Machines, Parallel Processing: CONPAR 94-VAPP VI, Linz, Austria, Springer,...

Cited by (11)

Parallelization of a Lagrangian-Eulerian DEM/CFD code for application to fluidized beds
2011, Powder Technology
Citation Excerpt :
A block partition of the k dimension in the ijk fluid flow mesh is employed in parallelizing the fluid hydrodynamics (Fig. 3). This parallelization was performed by loading the original serial code into ParaWise, then the data partition, computation distribution and all the necessary communications are automatically generated by ParaWise [18]. Loops that operate over the k dimension are adjusted to be performed only over the sections of the arrays owned by each processor.
The coupled Lagrangian–Eulerian DEM/CFD code developed over several years at Aston University and the University of Birmingham has been parallelized using a single program multiple data (SPMD) strategy. Initial implementation on a high-performance compute cluster indicated good scalability up to 32 processors, beyond which speedup gains are swamped by the all-processor communication overhead inherent in the fluid flow algorithm. A Geldart group A powder bed comprising 1 million particles was fluidized in the bubbling regime using the parallelized code.
Generating OpenMP code using an interactive parallelization environment
2005, Parallel Computing
Code parallelization using OpenMP for shared memory systems is relatively easier than using message passing for distributed memory systems. Despite this, it is still a challenge to use OpenMP to parallelize application codes in a way that yields effective scalable performance when executed on a shared memory parallel system. We describe an environment that will assist the programmer in the various tasks of code parallelization and this is achieved in a greatly reduced time frame and level of skill required. The parallelization environment includes a number of tools that address the main tasks of parallelism detection, OpenMP source code generation, debugging and optimization. These tools include a high quality, fully interprocedural dependence analysis with user interaction capabilities to facilitate the generation of efficient parallel code, an automatic relative debugging tool to identify erroneous user decisions in that interaction and also performance profiling to identify bottlenecks. Finally, experiences of parallelizing some NASA application codes are presented to illustrate some of the benefits of using the evolving environment.
An environment for OpenMP code parallelization
2004, Advances in Parallel Computing
This chapter discusses the code parallelization environment, where a number of tools that address the main tasks, such as code parallelization, debugging, and optimization are available. The parallelization tools include ParaWise and CAPO, which enable the near automatic parallelization of real world scientific application codes for shared and distributed memory-based parallel systems. The chapter discusses the use of ParaWise and CAPO to transform the original serial code into an equivalent parallel code that contains appropriate OpenMP directives. Additionally, as user involvement can introduce errors, a relative debugging tool (P2d2) is also available and can be used to perform near automatic relative debugging of an OpenMP program that has been parallelized either using the tools or manually. In order for these tools to be effective in parallelizing a range of applications, a high quality fully inter-procedural dependence analysis, as well as user interaction is vital to the generation of efficient parallel code and in the optimization of the backtracking and speculation process used in relative debugging. Results of parallelized NASA codes are discussed and show the benefits of using the environment.
Using an interactive parallelisation toolkit to parallelise an ocean modelling code
2003, Future Generation Computer Systems
This paper describes an interactive parallelisation toolkit that can be used to generate parallel code suitable for either a distributed memory system (using message passing) or a shared memory system (using OpenMP). This study focuses on how the toolkit is used to parallelise a complex heterogeneous ocean modelling code within a few hours for use on a shared memory parallel system. The generated parallel code is essentially the serial code with OpenMP directives added to express the parallelism. The results show that substantial gains in performance can be achieved over the single thread version with very little effort.
Ypnos: Declarative, parallel structured grid programming
2010, DAMP'10 - Proceedings of the 2010 ACM SIGPLAN Workshop on Declarative Aspects of Multicore Programming
Using an interactive software environment for the parallelization of real-world scientific applications
2007, International Journal of Computer Mathematics

View all citing articles on Scopus

^☆: This material is based upon the work supported by the US Naval Regional Contracting Center, London, on behalf of the NASA Ames Research Center, Grant No. N68171-98-C-9005, and also the UK Engineering and Physical Sciences Research Council (EPSRC) under the Portable Software Tools for Parallel Architectures (PSTPA) Programme, Grant No. GR/K40321.

View full text

Practical aspectsAutomatic and effective multi-dimensional parallelisation of structured mesh based codes☆

Abstract

Introduction

Section snippets

Automatic code generation

Communication library

Multi-dimensional partitioning

Results and algorithm parallelisation examples

Conclusion

Acknowledgements

Parallel Computing

Parallel Computing

Parallel Computing

Parallel Computing

Parallel Computing

Practical aspects
Automatic and effective multi-dimensional parallelisation of structured mesh based codes☆