Efficient data IO for a Parallel Global Cloud Resolving Model
Highlights
► A strategy for linearizing data on a geodesic grid is developed. ► A modular IO library based on this strategy is developed that can be easily incorporated into the GCRM with minimal effort. ► A subset of processors is used for IO to reduce contention with the file system. ► Bandwidth results for a number of different IO configurations are presented.
Introduction
The push to create more reliable and accurate simulations for environmental modeling has led to an increasing reliance on parallel programming to run larger and more detailed simulations in a timely manner. This is particularly true for simulations of climate change, but parallel programming, particularly programs that scale to large numbers of processors, is becoming increasingly important in other areas of environmental modeling as well. Individual environmental components are being run at larger scales and components are being coupled together to create larger models of environmental systems. Examples of the use of parallel codes in environmental simulation include hydrology (Hammond and Lichtner, 2010), surface water modeling (Von Bloh et al., 2010, Neal et al., 2010, Yu, 2010), and simulations of the ocean (Maltrud and McClean, 2005). Simulations of climate in general and the atmosphere in particular have a long history of using parallel computation to increase the complexity of the models simulated and to extend the resolution and timescales of simulations (Drake et al., 2005; Dabdub and Seinfeld, 1996). Higher resolution is being used to reduce the uncertainties and systematic errors due to parameterizations and other sources of error associated with coarser grained models.
Our ability to simulate climate change over extended periods is heavily constrained by uncertainty in the subgrid models that are used to describe behavior occurring at scales less than the dimensions of a single grid cell. Typical grid cell dimensions are in the range of 35–70 km for current global simulations of climate. At these resolutions, much of the behavior at the subgrid scale, particularly of clouds, must be heavily modeled and parameterized and different models give significantly different results. The behavior of clouds in these models is a major source of uncertainty (Liou, 1986). Efforts are currently underway to develop a Global Cloud Resolving Model (GCRM) (Randall et al., 2003) designed to run at a grid resolution on the order of kilometers. At 4 km, simulations become “cloud permitting” and individual clouds can begin to be modeled without including them as subgrid parameterizations. At higher resolutions of 2 km and 1 km, individual cloud behavior can be fully modeled. Simulations at these resolutions will substantially reduce the level of approximation at the subgrid scale and provide a more physically based representation of the behavior of clouds. In the short term, results from these simulations will be used as the basis for increasing the accuracy of climate simulations at coarser resolutions that can be run more efficiently to simulate longer periods of time. GCRMs are likely to be used for operational numerical weather prediction within about ten years and to perform “time slice” simulations within longer climate change simulations on coarser grids.
The GCRM will initially be run using a minimum of 80 K processors, writing terabytes of data to disk. However, to approach climate-length simulations at the target resolution of 4 km, a million processors will be required. IO will be a serious bottleneck on overall program performance unless careful consideration is given to designing a high performance IO strategy. Previously, the most widely used approach for handling such large IO requirements has been to have all processors engage in IO, either to separate files or to a few shared files using a parallel IO library. Having each processor write to separate files is undesirable, both because it will result in a huge number of files and because having that many processors doing large writes simultaneously will overwhelm the IO system. Similarly, having all processors write to a shared file is also likely to overwhelm the IO system. More recently, researchers have focused on creating IO collectives that aggregate data to a subset of processors before writing to disk. This allows programs to exploit the higher bandwidth available for communication to stage data to a smaller number of processors before transferring it to a file. The smaller number of IO processors can minimize contention while simultaneously maximizing IO bandwidth. Several recent reports have described collectives of this type, particularly in regards to the MPI-IO library (Lang et al., 2009). However, these optimizations are designed to handle all possible situations and may not come up with ideal solutions for every problem. Further improvements may be available by organizing data at the application level. This paper will describe the implementation of an IO API for a parallel GCRM that provides flexibility in controlling the fields that appear in the output and the frequency that output is written, while simultaneously allowing the user to control the number of processors engaging in IO and the size of IO writes. This extra layer of control provides additional options for optimizing IO bandwidth.
There has been considerable recent work investigating parallel IO using all processors. Antypas et al. (2006) had each processor write local data to disk in a large run of the FLASH astrophysical simulation code using 65 K processors of an IBM BlueGene/L machine. However, this resulted in over 74 million files, severely complicating post-processing and analysis. An alternative to exporting local data is to use parallel IO libraries that allow multiple processors to write to different locations within the same file. Recent implementations of such libraries include Parallel NetCDF (Li et al., 2003) and the HDF5/NetCDF4 libraries (Yang and Koziol, 2006), both of which are in turn built on top of the MPI-IO libraries (Thakur et al., 1999). These parallel IO libraries allow programmers to write files in a platform-independent format that is widely accepted in the climate community. The API described below is built around such libraries.
However, parallel IO libraries by themselves do not represent a complete solution when very large numbers of processors are used. Antypas et al. (2006) found that IO did not scale for the FLASH code beyond 1024 processors on the IBM BG/L platform using HDF5, parallel NetCDF, and even basic MPI-IO. Yu et al. (2006) investigated optimizations to MPI-IO, also on an IBM BG/L platform, that demonstrated scaling to about 1000 cores for the FLASH IO benchmark, as well as the HOMME atmospherical dynamical code. Ching et al. (2003) investigated the effect of different low level data read/write strategies on IO performance using both FLASH and a variety of other non-application benchmarks. Although they found reasonable scaling behavior, their studies did not extend beyond 128 processors. Saini et al. (2006) also reported results for the FLASH IO benchmark using HDF5 but saw effective bandwidth drop off significantly after about 128 processors. Using a different astrophysical benchmark, MADbench2, based on a cosmic microwave background data analysis package, Borrill et al. (2003) investigated read and write performance to both separate files for each processor and shared files on a broad range of platforms. They found that IO scaled for both separate and shared files in almost all cases, but only reported results up to 256 processors.
For very large numbers of processors, IO scaling behavior is unclear. Because total bandwidth to disk is a finite resource, contention between processors may actually lower bandwidth when large numbers of processors are all trying to write concurrently (Mache et al., 1999, Saini et al., 2006). Antypas et al. (2006) and Saini et al. (2006) did not see scaling when going to high numbers of processors using FLASH coupled with HDF5. It is not clear that the alternative of having each processor write its own local data will scale either to petascale systems containing thousands or tens of thousands of processors. Furthermore the number of files generated if each processor exports data becomes difficult to manage (74 million in the case of a large FLASH simulation). While these files could be post-processed back into a global view of the data, this step will consume significant resources and introduces the possibility of errors in the post-processing step. It may also require double storing the data or discarding the raw data. A very recent study by Lang et al. (2009), however, has shown that optimizations to MPI-IO collective operations, including data aggregation, has lead to scaling up to 100 K processors. This has been demonstrated for several synthetic benchmarks as well as the MADbench2 and FLASH3 codes.
Additional libraries are under development for use in high performance computing applications. PIO has been developed at NCAR to provide a common interface for several IO backends. The PIO interface itself is similar to the parallel NetCDF and NetCDF4 libraries, so the data remapping etc. described below is still required in order to use it. However, using PIO would allow users to switch seamlessly between several IO libraries (Dennis et al., in press). The ADIOS library being developed at Georgia Tech and ORNL (Lofstead et al., 2008) also provides a common interface to several different IO libraries and data formats, as well as implementing many optimizations designed to improve IO performance (Lofstead et al., 2009). However, to achieve these performance gains, ADIOS has created its own BP format which requires that the data subsequently be converted into NetCDF or HDF5 formatted files for which analysis tool chains exist (Lofstead et al., 2010). These optimizations have lead to dramatic improvements in applications that export data using many small writes but may not lead to such large performance gains when IO consists of large writes. The GCRM code writes out data in large blocks so improvements over parallel NetCDF or HDF5/NetCDF4 may be harder to achieve, particularly when the cost of reformatting data is factored in.
Although optimizations to MPI-IO, such as aggregation of many small IO requests into larger single IO reads/writes and staging of data to a smaller number of IO processors, have lead to significant performance gains these optimizations do not always identify ideal solutions in all cases. Additional performance may be gained by further manipulation at the application level. The API described in this paper reformats data from the native application layout to allow it to be written to disk in large contiguous chunks and provides users significant extra flexibility in configuring how the application aggregates and stages data for IO. The extra flexibility can lead to substantial performance gains in some instances over the optimizations in the parallel IO libraries themselves.
The remainder of this paper will describe the GCRM IO API library (GIO), including: the data layout of the GCRM code itself, the data layout for files written by the API, the user interface to the API, the communication strategies for moving file to the IO nodes, and performance results for the API using a number of different communication strategies.
Section snippets
The geodesic grid and GCRM data layout
The GCRM code that will incorporate the IO API is being developed at Colorado State University. Because the GCRM development is occurring in parallel with the development of the IO API, the results reported here will be for simulations that were performed using a GCRM predecessor. This is a hydrostatic simulation code (HYDRO) that uses the multi-level grid solver that will be incorporated into the GCRM. This code uses a geodesic grid, which is the grid that will be used by the GCRM, and has a
Mapping output to NetCDF
Until recently, climate and weather models have primarily been simulated on structured grids that divide the latitude and longitude axes in even increments, resulting in logically structured simulation grids. Standard conventions for describing this data in the NetCDF data model have been formalized by the Climate and Forecast (CF) conventions. CF defines conventions and metadata standards that enable both human and computer interpretation of the data. Human interpretation is supported through
API design
This section will describe in detail the design and implementation of the IO API. The structure of the API code is illustrated schematically below in Fig. 2. The IO API layer consists of a small collection of subroutine calls that are used to connect the IO library to the GCRM application. These subroutines are supplemented by two files. One provides a description of the data that will be written by the IO routines and the second describes the files that will be written. A large part of
Results
This section will provide a brief summary of current performance results for the IO API. A more detailed analysis of IO for this application is planned for a separate paper. A series of test cases were performed using the HYDRO model configured to run the Jablonowski test case originally described by Jablonowski and Williamson (2006). Most of the simulations were run using 2560 processor cores on an R11 grid (4 km resolution). The number of IO processors was varied from between 160 and 2560
Conclusions
This paper has described the development and implementation of an IO API for a GCRM application code that is expected to produce output on the order of petabytes. Several issues needed to be addressed in order to develop such an API. These include determining what grid data and other metadata is required in the output files so that subsequent analyses and visualization can be performed on the data, developing a format for the data, developing an interface for the API, and creating algorithms to
Acknowledgments
The authors are indebted to Katie Antypas, Prabhat, and Mark Howison at the National Energy Research Scientific Computing Center (NERSC), Dave Knaak at Cray, Rob Latham and Rob Ross at Argonne National Laboratory, and Professor Wei-keng Liao at Northwestern University for invaluable help in getting the IO API running and optimized on the Franklin platform.
This work was funded by the U.S. Department of Energy’s (DOE) Office of Advanced Scientific Computing Research through its Scientific
References (28)
- et al.
Parallel computation in atmospheric chemical modeling
Parallel Computing
(1996) - et al.
An eddy resolving global 1/10° ocean simulation
Ocean Modeling
(2005) - et al.
A comparison of three parallelization methods for 2D flood inundation models
Environmental Modeling and Software
(2010) - et al.
Efficient parallelization of a dynamic global vegetation model with river routing
Environmental Modeling and Software
(2010) Parallelization of a two-dimensional flood inundation model based on domain decomposition
Environmental Modeling and Software
(2010)- et al.
Implementation of an atmosphere–ocean general circulation model on the expanded spherical cube
Monthly Weather Review
(December 2004) - et al.
Scientific applications on the massively parallel BG/L machine
- Balaji, V., Adcroft, A., Lian, Z., 2007. Gridspec: A Standard for the Description of Grids Used in Earth Systems...
- et al.
Investigation of leading HPC I/O performance using a scientific-application derived benchmark
- et al.
Efficient structured data access in parallel file systems