Efficient data IO for a Parallel Global Cloud Resolving Model

https://doi.org/10.1016/j.envsoft.2011.08.007Get rights and content

Abstract

Execution of a Global Cloud Resolving Model (GCRM) at target resolutions of 2–4 km will generate, at a minimum, 10s of Gigabytes of data per variable per snapshot. Writing this data to disk, without creating a serious bottleneck in the execution of the GCRM code, while also supporting efficient post-execution data analysis is a significant challenge. This paper discusses an Input/Output (IO) application programmer interface (API) for the GCRM that efficiently moves data from the model to disk while maintaining support for community standard formats, avoiding the creation of very large numbers of files, and supporting efficient analysis. Several aspects of the API will be discussed in detail. First, we discuss the output data layout which linearizes the data in a consistent way that is independent of the number of processors used to run the simulation and provides a convenient format for subsequent analyses of the data. Second, we discuss the flexible API interface that enables modelers to easily add variables to the output stream by specifying where in the GCRM code these variables are located and to flexibly configure the choice of outputs and distribution of data across files. The flexibility of the API is designed to allow model developers to add new data fields to the output as the model develops and new physics is added. It also provides a mechanism for allowing users of the GCRM code to adjust the output frequency and the number of fields written depending on the needs of individual calculations. Third, we describe the mapping to the NetCDF data model with an emphasis on the grid description. Fourth, we describe our messaging algorithms and IO aggregation strategies that are used to achieve high bandwidth while simultaneously writing concurrently from many processors to shared files. We conclude with initial performance results.

Highlights

► A strategy for linearizing data on a geodesic grid is developed. ► A modular IO library based on this strategy is developed that can be easily incorporated into the GCRM with minimal effort. ► A subset of processors is used for IO to reduce contention with the file system. ► Bandwidth results for a number of different IO configurations are presented.

Introduction

The push to create more reliable and accurate simulations for environmental modeling has led to an increasing reliance on parallel programming to run larger and more detailed simulations in a timely manner. This is particularly true for simulations of climate change, but parallel programming, particularly programs that scale to large numbers of processors, is becoming increasingly important in other areas of environmental modeling as well. Individual environmental components are being run at larger scales and components are being coupled together to create larger models of environmental systems. Examples of the use of parallel codes in environmental simulation include hydrology (Hammond and Lichtner, 2010), surface water modeling (Von Bloh et al., 2010, Neal et al., 2010, Yu, 2010), and simulations of the ocean (Maltrud and McClean, 2005). Simulations of climate in general and the atmosphere in particular have a long history of using parallel computation to increase the complexity of the models simulated and to extend the resolution and timescales of simulations (Drake et al., 2005; Dabdub and Seinfeld, 1996). Higher resolution is being used to reduce the uncertainties and systematic errors due to parameterizations and other sources of error associated with coarser grained models.

Our ability to simulate climate change over extended periods is heavily constrained by uncertainty in the subgrid models that are used to describe behavior occurring at scales less than the dimensions of a single grid cell. Typical grid cell dimensions are in the range of 35–70 km for current global simulations of climate. At these resolutions, much of the behavior at the subgrid scale, particularly of clouds, must be heavily modeled and parameterized and different models give significantly different results. The behavior of clouds in these models is a major source of uncertainty (Liou, 1986). Efforts are currently underway to develop a Global Cloud Resolving Model (GCRM) (Randall et al., 2003) designed to run at a grid resolution on the order of kilometers. At 4 km, simulations become “cloud permitting” and individual clouds can begin to be modeled without including them as subgrid parameterizations. At higher resolutions of 2 km and 1 km, individual cloud behavior can be fully modeled. Simulations at these resolutions will substantially reduce the level of approximation at the subgrid scale and provide a more physically based representation of the behavior of clouds. In the short term, results from these simulations will be used as the basis for increasing the accuracy of climate simulations at coarser resolutions that can be run more efficiently to simulate longer periods of time. GCRMs are likely to be used for operational numerical weather prediction within about ten years and to perform “time slice” simulations within longer climate change simulations on coarser grids.

The GCRM will initially be run using a minimum of 80 K processors, writing terabytes of data to disk. However, to approach climate-length simulations at the target resolution of 4 km, a million processors will be required. IO will be a serious bottleneck on overall program performance unless careful consideration is given to designing a high performance IO strategy. Previously, the most widely used approach for handling such large IO requirements has been to have all processors engage in IO, either to separate files or to a few shared files using a parallel IO library. Having each processor write to separate files is undesirable, both because it will result in a huge number of files and because having that many processors doing large writes simultaneously will overwhelm the IO system. Similarly, having all processors write to a shared file is also likely to overwhelm the IO system. More recently, researchers have focused on creating IO collectives that aggregate data to a subset of processors before writing to disk. This allows programs to exploit the higher bandwidth available for communication to stage data to a smaller number of processors before transferring it to a file. The smaller number of IO processors can minimize contention while simultaneously maximizing IO bandwidth. Several recent reports have described collectives of this type, particularly in regards to the MPI-IO library (Lang et al., 2009). However, these optimizations are designed to handle all possible situations and may not come up with ideal solutions for every problem. Further improvements may be available by organizing data at the application level. This paper will describe the implementation of an IO API for a parallel GCRM that provides flexibility in controlling the fields that appear in the output and the frequency that output is written, while simultaneously allowing the user to control the number of processors engaging in IO and the size of IO writes. This extra layer of control provides additional options for optimizing IO bandwidth.

There has been considerable recent work investigating parallel IO using all processors. Antypas et al. (2006) had each processor write local data to disk in a large run of the FLASH astrophysical simulation code using 65 K processors of an IBM BlueGene/L machine. However, this resulted in over 74 million files, severely complicating post-processing and analysis. An alternative to exporting local data is to use parallel IO libraries that allow multiple processors to write to different locations within the same file. Recent implementations of such libraries include Parallel NetCDF (Li et al., 2003) and the HDF5/NetCDF4 libraries (Yang and Koziol, 2006), both of which are in turn built on top of the MPI-IO libraries (Thakur et al., 1999). These parallel IO libraries allow programmers to write files in a platform-independent format that is widely accepted in the climate community. The API described below is built around such libraries.

However, parallel IO libraries by themselves do not represent a complete solution when very large numbers of processors are used. Antypas et al. (2006) found that IO did not scale for the FLASH code beyond 1024 processors on the IBM BG/L platform using HDF5, parallel NetCDF, and even basic MPI-IO. Yu et al. (2006) investigated optimizations to MPI-IO, also on an IBM BG/L platform, that demonstrated scaling to about 1000 cores for the FLASH IO benchmark, as well as the HOMME atmospherical dynamical code. Ching et al. (2003) investigated the effect of different low level data read/write strategies on IO performance using both FLASH and a variety of other non-application benchmarks. Although they found reasonable scaling behavior, their studies did not extend beyond 128 processors. Saini et al. (2006) also reported results for the FLASH IO benchmark using HDF5 but saw effective bandwidth drop off significantly after about 128 processors. Using a different astrophysical benchmark, MADbench2, based on a cosmic microwave background data analysis package, Borrill et al. (2003) investigated read and write performance to both separate files for each processor and shared files on a broad range of platforms. They found that IO scaled for both separate and shared files in almost all cases, but only reported results up to 256 processors.

For very large numbers of processors, IO scaling behavior is unclear. Because total bandwidth to disk is a finite resource, contention between processors may actually lower bandwidth when large numbers of processors are all trying to write concurrently (Mache et al., 1999, Saini et al., 2006). Antypas et al. (2006) and Saini et al. (2006) did not see scaling when going to high numbers of processors using FLASH coupled with HDF5. It is not clear that the alternative of having each processor write its own local data will scale either to petascale systems containing thousands or tens of thousands of processors. Furthermore the number of files generated if each processor exports data becomes difficult to manage (74 million in the case of a large FLASH simulation). While these files could be post-processed back into a global view of the data, this step will consume significant resources and introduces the possibility of errors in the post-processing step. It may also require double storing the data or discarding the raw data. A very recent study by Lang et al. (2009), however, has shown that optimizations to MPI-IO collective operations, including data aggregation, has lead to scaling up to 100 K processors. This has been demonstrated for several synthetic benchmarks as well as the MADbench2 and FLASH3 codes.

Additional libraries are under development for use in high performance computing applications. PIO has been developed at NCAR to provide a common interface for several IO backends. The PIO interface itself is similar to the parallel NetCDF and NetCDF4 libraries, so the data remapping etc. described below is still required in order to use it. However, using PIO would allow users to switch seamlessly between several IO libraries (Dennis et al., in press). The ADIOS library being developed at Georgia Tech and ORNL (Lofstead et al., 2008) also provides a common interface to several different IO libraries and data formats, as well as implementing many optimizations designed to improve IO performance (Lofstead et al., 2009). However, to achieve these performance gains, ADIOS has created its own BP format which requires that the data subsequently be converted into NetCDF or HDF5 formatted files for which analysis tool chains exist (Lofstead et al., 2010). These optimizations have lead to dramatic improvements in applications that export data using many small writes but may not lead to such large performance gains when IO consists of large writes. The GCRM code writes out data in large blocks so improvements over parallel NetCDF or HDF5/NetCDF4 may be harder to achieve, particularly when the cost of reformatting data is factored in.

Although optimizations to MPI-IO, such as aggregation of many small IO requests into larger single IO reads/writes and staging of data to a smaller number of IO processors, have lead to significant performance gains these optimizations do not always identify ideal solutions in all cases. Additional performance may be gained by further manipulation at the application level. The API described in this paper reformats data from the native application layout to allow it to be written to disk in large contiguous chunks and provides users significant extra flexibility in configuring how the application aggregates and stages data for IO. The extra flexibility can lead to substantial performance gains in some instances over the optimizations in the parallel IO libraries themselves.

The remainder of this paper will describe the GCRM IO API library (GIO), including: the data layout of the GCRM code itself, the data layout for files written by the API, the user interface to the API, the communication strategies for moving file to the IO nodes, and performance results for the API using a number of different communication strategies.

Section snippets

The geodesic grid and GCRM data layout

The GCRM code that will incorporate the IO API is being developed at Colorado State University. Because the GCRM development is occurring in parallel with the development of the IO API, the results reported here will be for simulations that were performed using a GCRM predecessor. This is a hydrostatic simulation code (HYDRO) that uses the multi-level grid solver that will be incorporated into the GCRM. This code uses a geodesic grid, which is the grid that will be used by the GCRM, and has a

Mapping output to NetCDF

Until recently, climate and weather models have primarily been simulated on structured grids that divide the latitude and longitude axes in even increments, resulting in logically structured simulation grids. Standard conventions for describing this data in the NetCDF data model have been formalized by the Climate and Forecast (CF) conventions. CF defines conventions and metadata standards that enable both human and computer interpretation of the data. Human interpretation is supported through

API design

This section will describe in detail the design and implementation of the IO API. The structure of the API code is illustrated schematically below in Fig. 2. The IO API layer consists of a small collection of subroutine calls that are used to connect the IO library to the GCRM application. These subroutines are supplemented by two files. One provides a description of the data that will be written by the IO routines and the second describes the files that will be written. A large part of

Results

This section will provide a brief summary of current performance results for the IO API. A more detailed analysis of IO for this application is planned for a separate paper. A series of test cases were performed using the HYDRO model configured to run the Jablonowski test case originally described by Jablonowski and Williamson (2006). Most of the simulations were run using 2560 processor cores on an R11 grid (4 km resolution). The number of IO processors was varied from between 160 and 2560

Conclusions

This paper has described the development and implementation of an IO API for a GCRM application code that is expected to produce output on the order of petabytes. Several issues needed to be addressed in order to develop such an API. These include determining what grid data and other metadata is required in the output files so that subsequent analyses and visualization can be performed on the data, developing a format for the data, developing an interface for the API, and creating algorithms to

Acknowledgments

The authors are indebted to Katie Antypas, Prabhat, and Mark Howison at the National Energy Research Scientific Computing Center (NERSC), Dave Knaak at Cray, Rob Latham and Rob Ross at Argonne National Laboratory, and Professor Wei-keng Liao at Northwestern University for invaluable help in getting the IO API running and optimized on the Franklin platform.

This work was funded by the U.S. Department of Energy’s (DOE) Office of Advanced Scientific Computing Research through its Scientific

References (28)

  • J. Drake et al.

    Overview of the software design of the community climate system model

    International Journal of High Performance Computing Applications

    (2005)
  • Dennis, J., Edwards, J., Loy, R., Jacob, R., Mirin, A., Craig, A., Vertenstein, M. An application level parallel I/O...
  • G. Hammond et al.

    Field-scale model for the natural attenuation of uranium at the Hanford 300 area using high performance computing

    Water Resources Research

    (2010)
  • C. Jablonowski et al.

    A baroclinic instability test case for atmospheric model dynamical cores

    Quarterly Journal of the Royal Meteorological Society

    (2006)
  • Cited by (0)

    View full text