Generating OpenMP code using an interactive parallelization environment
Introduction
Today the most popular parallel systems are based on either shared memory, distributed memory or hybrid distributed-shared memory systems. For a distributed memory based parallelization, a global view of the whole program can be vital when using a Single Program Multiple Data (SPMD) paradigm [1]. The entire manual parallelization process can be complex, very time consuming and error-prone. For example, to use the available distributed memory efficiently, data placement is an essential consideration, while the insertion of explicit communication calls requires a great deal of expertise. The parallelization on a shared memory system is only relatively easier. The data placement may appear to be less crucial than for a distributed memory parallelization and a more local loop level view may be sufficient in many cases, but the parallelization process is still error-prone and time-consuming. This is largely due to the difficulty in accurately determining parallelism and scoping issues, particularly in an in-depth interprocedural investigation that is usually essential for an effective parallel performance. As a result parallelization expertise is often needed, limiting the use and success of OpenMP.
To ease these problems, automatic parallelizing compilers from vendors and research groups [2], [3], [4], [5], and interactive parallelization tools [6], [7], [8] have been developed to detect parallelism based on some form of dependence analysis of the application. Unfortunately, the ideal scenario of a user relying on an automatic parallelizing compiler to parallelize their code typically results in limited performance, particularly in terms of scalability. There are a number of reasons for this. One fundamental problem relates to unknown input information such as the value ranges of variables that are read into the application code. This type of information can often be critical in accurate dependence determination and so the compiler is forced to conservatively assume the existence of a dependence, potentially preventing parallelism detection and missing valid privatization. Another problem is that the pursuit of an in-depth and accurate analysis will inevitably compromise the relatively quick compilation time that users expect for commercial acceptability of the compiler.
The main goal for developing tools that can assist in the parallelization of serial application codes is to embed the expertise needed for an effective parallelization within automated algorithms. These algorithms can perform much of the parallelization tasks in a much shorter time frame than would otherwise be required by a parallelization expert doing the same task manually. In addition, the tools should not be time constrained in the same way as compilers, and should be capable of generating generic, recognizable, parallel source code from the original serial code [9].
The environment presented in this paper aims to address the key functions needed for efficient code parallelization. These include how to detect parallelism in the given application, how to add directives to exploit the identified parallelism, how to debug code produced that is incorrect through user error, how to identify code sections that limit performance and how these limitations can be overcome. The tools are described along with their interoperability to assist with OpenMP code parallelization, specifically targeted at shared memory machines. These include an interactive parallelization tool for message passing based parallelizations (ParaWise) that also contains dependence analysis capability and many valuable source code browsers; an OpenMP code generation module (CAPO) with a range of techniques that aid the production of efficient, scalable OpenMP code; a relative debugger built on p2d2, capable of handling hundreds of parallel processes and that automatically identifies where the serial and parallel equivalent executions diverge; a run-time monitoring tool to profile and identify performance bottlenecks.
Effective presentation of the information provided by each tool may be sufficient for users with experience of manually parallelizing application codes. It is preferable, however, that the tools should interpret and use the information to guide the parallelization process, only involving the user when it is essential and only in terms they can understand (i.e. in the context of the application). This will allow users with no parallelization experience to use the environment effectively in generating efficient parallel code.
Section snippets
Code parallelization tools, a relative debugging tool and performance profiling tool
The tools in this environment have been used to successfully parallelize a number of FORTRAN application codes. For distributed memory systems the parallelization is based on distributing arrays across a number of processors [9], [10], while for shared memory systems the parallelization is based on distributing loop iterations across threads [6], [11], [12]. A detailed description of the tools can be found in [1], [6] so will not be given here. Instead, an overview of the tools is presented
Results
A number of codes have been parallelized both manually and using the parallelization tools. The codes include the NAS parallel benchmarks—a suite of well-used benchmark programs; CTM—NASA Goddard code that is used for ozone layer climate simulations; GCEM3D—NASA Goddard code used to model the evolution of cloud systems under large scale thermodynamic forces; OVERFLOW—NASA Ames version that is used to model aerospace CFD simulations. Table 1 summarizes the approximate time taken for the various
Related work
There are many research tools and a few commercial tools that attempt to address the areas of code parallelization, debugging parallel applications and performance profiling and tuning. In this section we will include a cross-section of some of the most significant contributions. Automatic parallelizing compilers such as SUIF [2] and Polaris [3] from research groups and KAP from vendors [4] have been developed to detect parallelism based on some form of dependence analysis of the application,
Concluding remarks and future work
The quality of the code generated by our parallelization tools yields comparable performance to a manual parallelization effort, but in addition, the total time to parallelize the application is significantly reduced when using the tools. Significant advances have been made to increase the ease-of-use of these parallelization tools with the introduction of the Expert Assistant. This makes the tools accessible not only to parallelization experts, but also to novice and new users to code
Acknowledgements
The authors would like to thank their colleagues involved in the various aspects of this work, including Gabriele Jost and Jerry Yan (NASA Ames), Dan Johnson, Wei-Kuo Tao and Steve Steenrod (NASA Goddard), Emyr Evans, Peter Leggett, Jacqueline Rodrigues and Mark Cross (Greenwich). Finally, the funding for this project from AMTI subcontract No. SK-03N-02 and NASA Contract DTTS59-99-D-00437/A61812D is gratefully acknowledged.
References (28)
- et al.
SUPERB—A tool for semi-automatic MIMD/SIMD parallelisation
Parallel Computing
(1988) - et al.
Automatic and effective multi-dimensional parallelisation of structured mesh based codes
Parallel Computing
(2000) - et al.
Using an interactive parallelisation toolkit to parallelise an ocean modelling code
FGCS
(2003) - et al.
Exploitation of symbolic information in interprocedural dependence analysis
Parallel Computing
(1996) - et al.
Computer aided parallelisation tools (CAPTools)–conceptual overview and performance on the parallelisation of structured mesh codes
Parallel Computing
(1996) - et al.
SUIF: An Infrastructure for Research on Parallelizing and Optimizing Compilers
(1996) - W. Blume, R. Eigenmann, K. Fagin, J. Grout, J. Lee, T. Lawrence, J. Hoeflinger, D. Padua, P. Tu, S. Weatherford,...
- KAI/Intel. Available from:...
- Veridian, Incorporated. VAST/Parallel Fortran and C, Automatic Parallelizing Preprocessors. Available from:...
- H. Jin, M. Frumkin, J. Yan, Automatic generation of OpenMP directives and its application to computational fluid...
Cited by (4)
A study on popular auto-parallelization frameworks
2019, Concurrency and Computation: Practice and ExperienceBFCA+: automatic synthesis of parallel code with TLS capabilities
2017, Journal of SupercomputingThe SIPSim implicit parallelism model and the SkelGIS library
2016, Concurrency and Computation: Practice and ExperienceParallelization of a simple-based algorithm to simulate mixed convective flow over a backward-facing step
2009, Numerical Heat Transfer, Part B: Fundamentals