Elsevier

Parallel Computing

Volume 31, Issues 10–12, October–December 2005, Pages 999-1012
Parallel Computing

Generating OpenMP code using an interactive parallelization environment

https://doi.org/10.1016/j.parco.2005.03.008Get rights and content

Abstract

Code parallelization using OpenMP for shared memory systems is relatively easier than using message passing for distributed memory systems. Despite this, it is still a challenge to use OpenMP to parallelize application codes in a way that yields effective scalable performance when executed on a shared memory parallel system. We describe an environment that will assist the programmer in the various tasks of code parallelization and this is achieved in a greatly reduced time frame and level of skill required. The parallelization environment includes a number of tools that address the main tasks of parallelism detection, OpenMP source code generation, debugging and optimization. These tools include a high quality, fully interprocedural dependence analysis with user interaction capabilities to facilitate the generation of efficient parallel code, an automatic relative debugging tool to identify erroneous user decisions in that interaction and also performance profiling to identify bottlenecks. Finally, experiences of parallelizing some NASA application codes are presented to illustrate some of the benefits of using the evolving environment.

Introduction

Today the most popular parallel systems are based on either shared memory, distributed memory or hybrid distributed-shared memory systems. For a distributed memory based parallelization, a global view of the whole program can be vital when using a Single Program Multiple Data (SPMD) paradigm [1]. The entire manual parallelization process can be complex, very time consuming and error-prone. For example, to use the available distributed memory efficiently, data placement is an essential consideration, while the insertion of explicit communication calls requires a great deal of expertise. The parallelization on a shared memory system is only relatively easier. The data placement may appear to be less crucial than for a distributed memory parallelization and a more local loop level view may be sufficient in many cases, but the parallelization process is still error-prone and time-consuming. This is largely due to the difficulty in accurately determining parallelism and scoping issues, particularly in an in-depth interprocedural investigation that is usually essential for an effective parallel performance. As a result parallelization expertise is often needed, limiting the use and success of OpenMP.

To ease these problems, automatic parallelizing compilers from vendors and research groups [2], [3], [4], [5], and interactive parallelization tools [6], [7], [8] have been developed to detect parallelism based on some form of dependence analysis of the application. Unfortunately, the ideal scenario of a user relying on an automatic parallelizing compiler to parallelize their code typically results in limited performance, particularly in terms of scalability. There are a number of reasons for this. One fundamental problem relates to unknown input information such as the value ranges of variables that are read into the application code. This type of information can often be critical in accurate dependence determination and so the compiler is forced to conservatively assume the existence of a dependence, potentially preventing parallelism detection and missing valid privatization. Another problem is that the pursuit of an in-depth and accurate analysis will inevitably compromise the relatively quick compilation time that users expect for commercial acceptability of the compiler.

The main goal for developing tools that can assist in the parallelization of serial application codes is to embed the expertise needed for an effective parallelization within automated algorithms. These algorithms can perform much of the parallelization tasks in a much shorter time frame than would otherwise be required by a parallelization expert doing the same task manually. In addition, the tools should not be time constrained in the same way as compilers, and should be capable of generating generic, recognizable, parallel source code from the original serial code [9].

The environment presented in this paper aims to address the key functions needed for efficient code parallelization. These include how to detect parallelism in the given application, how to add directives to exploit the identified parallelism, how to debug code produced that is incorrect through user error, how to identify code sections that limit performance and how these limitations can be overcome. The tools are described along with their interoperability to assist with OpenMP code parallelization, specifically targeted at shared memory machines. These include an interactive parallelization tool for message passing based parallelizations (ParaWise) that also contains dependence analysis capability and many valuable source code browsers; an OpenMP code generation module (CAPO) with a range of techniques that aid the production of efficient, scalable OpenMP code; a relative debugger built on p2d2, capable of handling hundreds of parallel processes and that automatically identifies where the serial and parallel equivalent executions diverge; a run-time monitoring tool to profile and identify performance bottlenecks.

Effective presentation of the information provided by each tool may be sufficient for users with experience of manually parallelizing application codes. It is preferable, however, that the tools should interpret and use the information to guide the parallelization process, only involving the user when it is essential and only in terms they can understand (i.e. in the context of the application). This will allow users with no parallelization experience to use the environment effectively in generating efficient parallel code.

Section snippets

Code parallelization tools, a relative debugging tool and performance profiling tool

The tools in this environment have been used to successfully parallelize a number of FORTRAN application codes. For distributed memory systems the parallelization is based on distributing arrays across a number of processors [9], [10], while for shared memory systems the parallelization is based on distributing loop iterations across threads [6], [11], [12]. A detailed description of the tools can be found in [1], [6] so will not be given here. Instead, an overview of the tools is presented

Results

A number of codes have been parallelized both manually and using the parallelization tools. The codes include the NAS parallel benchmarks—a suite of well-used benchmark programs; CTM—NASA Goddard code that is used for ozone layer climate simulations; GCEM3D—NASA Goddard code used to model the evolution of cloud systems under large scale thermodynamic forces; OVERFLOW—NASA Ames version that is used to model aerospace CFD simulations. Table 1 summarizes the approximate time taken for the various

Related work

There are many research tools and a few commercial tools that attempt to address the areas of code parallelization, debugging parallel applications and performance profiling and tuning. In this section we will include a cross-section of some of the most significant contributions. Automatic parallelizing compilers such as SUIF [2] and Polaris [3] from research groups and KAP from vendors [4] have been developed to detect parallelism based on some form of dependence analysis of the application,

Concluding remarks and future work

The quality of the code generated by our parallelization tools yields comparable performance to a manual parallelization effort, but in addition, the total time to parallelize the application is significantly reduced when using the tools. Significant advances have been made to increase the ease-of-use of these parallelization tools with the introduction of the Expert Assistant. This makes the tools accessible not only to parallelization experts, but also to novice and new users to code

Acknowledgements

The authors would like to thank their colleagues involved in the various aspects of this work, including Gabriele Jost and Jerry Yan (NASA Ames), Dan Johnson, Wei-Kuo Tao and Steve Steenrod (NASA Goddard), Emyr Evans, Peter Leggett, Jacqueline Rodrigues and Mark Cross (Greenwich). Finally, the funding for this project from AMTI subcontract No. SK-03N-02 and NASA Contract DTTS59-99-D-00437/A61812D is gratefully acknowledged.

References (28)

  • FORGE, Applied Parallel Research, Placerville, California 95667, USA,...
  • S.P. Johnson, C.S. Ierotheou, M. Cross, Computer aided parallelisation of unstructured mesh codes, in: H.R. Arabnia...
  • H. Jin, G. Jost, D. Johnson, W.-K. Tao, Experience on the parallelization of a cloud modeling code using computer-aided...
  • Parallel Software Products Inc. Available from:...
  • Cited by (4)

    View full text