SRS-GDA: A spatial random sampling toolbox for grid-based hydro-climatic data analysis in environmental change studies Environmental Modelling and Software

We present in this paper the development of a new, open-source MATLAB toolbox SRS-GDA that aims to provide random spatial sampling of grid-based hydro-climatic datasets for environmental change studies. This toolbox addresses the needs of quantifying how hydro-climatic responses, which are often driven by grid-based forcing datasets such as climate model projections, vary with location and scale. The toolbox can be used to carry out random spatial sampling of grid-based quantities with various constraints: shape, size, location, dominant orientation and resolution. A case study of a large dataset, the GEAR rainfall dataset is supplied to demonstrate the typical uses case of this toolbox. The provision of the toolbox for downloading together with the sample data is also presented.


Introduction
Research in environmental changes has been increasingly relying on computer models driven by external forcing field and conditions that can represent changing factors such as temperature, precipitation and land uses (Erler et al., 2019;Alamou et al., 2017). Historically, the inputs used to drive these models were often relatively scarce, and only available at limited number of locations. Data collection was often restricted by technical conditions, instruments and means of storage. To make full use of such finite data, many methods have been proposed and applied. In terms of rainfall data, there are many methods for translating point rainfall records which are usually collected from hydrological gauges or stations to the basin rainfall. For example, the Areal Reduction Factor (ARF) has been widely used, possibly under different names in different countries (Weather Bureau, 1958;NERC, 1977). More recently, however, with the rapid advances in environmental monitoring technology, spatially disaggregated, grid-based hydro-climatic datasets have become gradually available with steady improvements in both accuracy and spatial-temporal resolutions. A typical case is the NIMROD weather radar system deployed by the UK Met Office which can provide up to 1km/5min precipitation distribution over the country (Golding, 1998;Fairman et al., 2017). Similarly, satellite-borne observations, such as the Global Precipitation Measurements (GPM; Islam et al., 2014;Ning et al., 2016) can now provide large scale coverage of the precipitation coverage in near real-time. Many environmental models nowadays are also tuned to make use of those new, grid-based, high resolution datasets, e.g. the Grid-to-Grid version of the PDM model has been developed and operationally used by the Environment Agency in the UK (Cole and Moore, 2008;Kay et al., 2009).
Another important source of external forcing data is model simulated hydro-climatic fields. In this case, rainfall, temperature as well as soil moisture fields generated by numerical weather models or climate models can be used to drive other model simulations. Practices of using the so-called coupled model approach started to gain momentum in the early 2000's when numerical weather models and climate models were able to produce simulation with high enough spatial resolutions, e.g. at tens of kilometres. As such, there have been plenty of studies since then, such as Bauer et al. (2015), Moufouma-Okia & Jones (2015) and many more inspired by the Hydrological Ensemble Prediction Experiments (HEPEX; Schaake et al., 2007) initiative. Datasets such as the ERA40 (Uppala et al., 2005) have been widely used since then. Although these datasets are not originally produced over sets of grids, or at least not the commonly recognised types of grids; they often are interpolated onto regular grids nevertheless in order to facilitate further analysis and to be used as other model inputs. For instance, global numerical weather models tend to use the Gaussian Grids, e.g., EAR40 grids. Local area model (LAM), such as the Weather Research and Forecasting model (WRF; Skamarock et al., 2001) uses regular grids spatially but does so only on a projected plane.
The importance and popularity of using those grid-based forcing data are underlined by the needs of many climate change impact studies where climate projections, such as those from the Coupled Model Intercomparison Project (Giorgetta et al., 2013;Covey et al., 2003), are normally provided over a set of regular longitude/latitude grids over the globe. To better facilitate the community in using these grid based data and encourage the interoperability among models, the Network Common Data Format (NETCDF, Rew and Davis, 1990) has become the de-facto standard in climate change impact studies, although other traditional formats such as GRIdded Binary (GRIB, Rutledge et al., 2006) or Hierarchical Data Format (HDF, Duane et al., 2000) are well supported as well.
In the context of using grid-based hydro-climatic datasets for providing external forcing field, an important step is to understand, quantify and if possible, to correct the errors and/or bias in these fields. The spatially variant nature of these data remains as the centre of the process. For example, Rojas et al. (2011) applied a statistical bias correction to improve the regional climate model (RCM)-driven climate simulations across the Europe; Rabiei and Haberlandt (2015) proposed to merge rain gauge measurements and weather radar data which is grid-based data by bias correction. Specifically for weather radar adjustment, many algorithms such as the Mean Field Bias (MFB) method and the Kriging with External Drift (KED) method, adjust the radar data solely by a multiplicative factor which does not vary spatially; however, more recently the Conditional Merge algorithm introduced by Sinclair and Pegram (2005) and implemented by Guenzi et al. (2016), considers the spatial impacts by conditioning the gauge adjustment on the radar precipitation values at gauge locations (Silver et al., 2019).
Apart from being used as inputs to the models, the grid-based hydroclimatic datasets are also a foundation to support further analyses on environment change both spatially and temporally. It is not surprising that nearly all published studies in this field have been done so on gridbased datasets, e.g., Du and Zhang (2019) identified the spatiotemporal variations and trends of precipitation and streamflow extremes in the Xiang river basin with gridded data of resolution 0.5 � and concluded that intensified summer extreme precipitation occurs mainly in the upper and middle of the basin and extreme streamflow has an increasing trend at the same region; Fairman et al. (2017) analysed the climatology of size, shape and intensity of grid-based precipitation features over Great Britain and Ireland; more application on grid-based data can be seen in Drusch et al. (2004); Thorndahl et al. (2017); Chen et al. (2015).
It is clear from the above examples that the grid-based hydro-climatic data have spatial patterns and characteristics with regards to certain changing factors that need to be diagnosed. Such diagnosis, without exception, is done over analysing targeted variable(s) and/or their combinations sampled spatially within predefined boundaries such as political regions  and river catchments (Monteiro et al., 2016). Further, to understand the random nature of the errors and uncertainties associated with the spatial data, the Monte-Carlo simulation approach is commonly used together with geo-statistical stochastic simulation for uncertainty quantification. A simple procedure is to perform simulations of points (can be data or events) randomly distributed in the predefined area, calculate the empirical distribution function of such inter-point distances in each case and then obtain further values of the statistic by goodness of fit (Besag and Diggle, 1977). Following this approach, some applications have been published, e.g., Smith and Cheeseman (1986); Xu et al. (2005) and Wu et al. (2018); however, application on hydro-climatic grid-based data remains scarce and many previous studies on spatial distributions of hydro-climatic variables were conducted over predefined areas.
Apparently, the substantial overhead of computer programming of spatial random sampling over often large-size hydro-climatic datasets has affected researchers' capacity of studying spatial-temporal variation of climatic features. To address this issue, we developed a Spatial Random Sampling toolbox for Grid-based Data Analysis (SRS-GDA) which can generate arbitrary samples from any grid-based dataset automatically. As an Open-source MATLAB toolbox, it can assist users in spatial random sampling with various constraints such as shape, size, location, dominant orientation and resolution. In the field of environmental change impact studies where the spatial properties of grid-based datasets remain as the focus, this toolbox addresses the needs of quantifying how hydro-climatic responses vary with location and scale. The grid size used by the SRS-GDA toolbox can be defined in line with any resolution of the base grid map. To increase the applicability of this toolbox, users can customise various sampling conditions and their combinations which can be directly applied to many environmental change studies.
This paper is structured as follows: first, a brief introduction of the study background and the main objective are provided, followed by the presentation of the methodology section. An example using case of analysing hydro-climatic extremes, i.e. precipitation over Great Britain using the GEAR dataset is provided to demonstrate the application of the toolbox. Finally, discussion on further applicability and availability of the model is presented.

The design and implementation of the SRS-GDA toolbox
The main aim of the SRS-GDA toolbox is to enable random spatial sampling of grid-based data within a pre-defined Region of Interest (ROI) of different sizes, shapes, locations and resolutions. The sampling procedure starts with a user-supplied grid dataset with spatial reference. It is also common to have an overall boundary (OB) from which the sampling is to be conducted, as many grid datasets have coverage normally much larger than that of the user's interest, such as the General Circulation Model (GCM) output around the globe. Normally, the OB should be set large enough for studying how the variation of locations can affect certain quantities represented by an ROI.
The randomisation of the sampling process is manifested by the ways of how the ROI is constructed: 1) Randomisation of the shape of the ROI. The shape of an area often plays an important role in various applications. For example, in hydrology, a so-called donor catchment is often desired to have a shape analogous to that of the ungauged, target one. Understandably, this process sets to be the most complex one in the SRS-GDA toolbox. We offer two options with regards to whether the shape of the ROI is of concern: the shape-unconstrained sampling which randomises the shape of ROI; and the shape-constrained sampling that makes use of a predefined geometric shape supplied by the user e.g. a polygon at a given scale. A special case is point or single-grid sampling whose ROI reduces to a single grid. This is also useful, for example, when studying the variation of point-measured quantities. 2) Randomisation of the location of the ROI. The location of an ROI can be varied by changing the coordinates of its centroid (for predefined ROI's) or its origin (for randomly generated ROI's). This operation is done by randomly setting a point or grid within the OB as the new location for the ROI to be moved to. An extra step is usually applied to ensure the entire region of the ROI falling within the OB. 3) Randomisation of the size of the ROI. Variation of the ROI size can help users to identify whether the aggregated data value over an area exhibits notable behaviour. A typical case, for example, is to study the extreme value distribution of a hydrometeorological variabletemperature or precipitation, at regional, national and global scale. This operation depends on whether the ROI is shape-constrained or not. If a predefined shape is used, a 'buffering' operation (Chang, 2008) is used to either increase or reduce the size whilst maintaining the shape unchanged; whereas for a shape-unconstrained case, the desired ROI is randomly produced with a given location and specified size.
These three operations can be combined to achieve the various levels of randomisation required by users. The implementation of the toolbox involves a series of steps that are described below and shown in Fig. 1 which includes: (a) Grid map generation which sets the overall boundary (OB) spatial coverage constraint and the resolution for the study (sampling) area; (b) Sampling setup that determines whether one or more constraints are used and sets the corresponding values and/or features, for example, location (fixed or floated), shape-unconstrained or shape-constrained, size fixed or not etc. and (c) Sampling processing and validation which are automatically carried out by the SRS-GDA toolbox based on the OB grid map and the constraint setups with extra filters applied to the results depending on extra conditions where appropriate.

Generating the grid-based overall boundary (OB) map
As mentioned previously, the underlying dataset normally comes with a coverage larger than that of users' interest. In other words, a subset based on an OB needs to be produced. This OB needs to be specified by the user, e.g., by using either a raster file or a vector based map such as shapefiles that define the boundary. If no OB is specified, the entire coverage of the underlying grid dataset will be used to conduct the sampling process. It should be mentioned that the sampling process often happens inside the OB. However, different from OB, the boundaries specified by the ROI's are deemed to be restrictive and arbitrary as far as a natural process is concerned, such as rainfall and wind speed. The logic behind sampling ROI in OB is because many times only the quantity of certain hydro-climatic variables falling in such given boundary is of concern, for example, rainfall over the urban area of a city is a key element for urban drainage design.
A grid-based map is then generated by rasterising the OB (if it comes as a vector map) using the same projection and grid resolution as the underlying dataset. The grids inside the OB are regarded as valid grids while those outside are invalid grids. Once this is completed, the toolbox will automatically exclude those invalid grids and activate the valid grids. For example, in the example case given in this paper, the National Grid Reference (NGR, Ordnance Survey, 1946) is used to refer to the coordinates of the grids of the GEAR dataset. The base map is processed to distinguish ocean (so called invalid grids outside the GB boundary) and land (so called valid grids inside the GB boundary). It is also further refined to have several versions with different spatial resolutions which are normally multiples (exact divisions) of the grid size of the underlying dataset. These refined OB's will be used for further study on aggregation (upscaling) and disaggregation (downscaling). The toolbox provides three resolutions to match the underlying grid dataset: 1 km � 1 km, 5 km � 5 km and 10 km � 10 km for user applying. And the base maps of the GB are produced with these three resolutions respectively, as shown in Fig. 5 where 1 km � 1 km is chosen for demonstrating the example case for being consistent with the resolution of dataset (details in 3.1).
In addition to setting the OB, another important task at this step is to spatially index the data grids and label those that contain valid data. From now on, all subsequent spatial sampling is conducted over (or within, to be more precise) the base map.

Sampling setup
There are four initial settings (also seen in Fig. 1b) that need to be specified before starting the sampling process which are: 1) Total number of samples required; 2) The desired location of the samples, which is only applicable in the case where users wish to fix the location while randomising other properties such as shapes and sizes; 3) Sample size in the unit of km 2 which is translated into numbers of grids at the finest grid resolution used; Note that this is only required if a size-constrained sampling is desired; 4) Spatial index of the ROI shape (i.e., samples) which is needed when a shape-constrained sampling is required. In this case, the ROI shapes are randomly generated as convex hulls having their spatial index (sp) value set by the user. In the case of shape-unconstrained sampling, the shape of the ROI's will be randomised. The spatial index (sp) is defined to indicate dominant spatial extension direction, e.g., north-south or west-east: where D NS and D WE refer to the north-south dimension and the west-east dimension of a sample (represented by a matrix). The reason of having sp as an attached indictor is that in many climate studies, the direction of an area (such as a river catchment) plays a crucial role in determining the amount of quantity, such as rainfall (Viviroli et al., 2003;Svensson and Rakhecha, 1998). Obviously, other indexes, such as the direction of the major axis, can be easily defined if required.

Sampling processing and validation
This is the final step (Fig. 1c) where samples are generated according to the initial settings. The methods discussed below correspond to the three main functions of SRS-GDA toolbox.

� Sampling with randomised locations
This function randomly selects different locations to set the centroids of the samples within the OB base map. The sampling is relatively straightforward: first x-and y-coordinates are sampled from the range of the OB maps in the two directions using a joint uniform distribution UðX; YÞ; followed by filtering out those samples that are not entirely within the OB.

� Sampling with randomised sizes
The second function is to randomly generate samples with different sizes, which is mainly used in the cases where the behaviour of aggregated quantity over the area of a sample is desired. Since the grid resolution A grid (in km 2 ) is known, the size of sample A sample can be translated into the number of grids N grids of sample of the ROI. The equation below shows the translation: The variation of the area of the ROI (the sample) is realized by applying a 'buffering' operation while keeping the centroid location unchanged, i.e., it only increases or decreases the main axis of the sample proportionately. Fig. 2 shows an example of shape generation.
� Sampling with randomised shape of ROI: unconstrained and constrained The third main function is to randomly generate samples in different shapes varying in both sizes and locations. Depending on the user's initial settings, this function can conduct both shape-unconstrained and shape-constrained sampling. In the former case, the location and the size of the sample (ROI) are both obtained from the two previous functions; for each combination of the location and the size, the shape is randomised using the size as a constraint. Two principles are applied in this process: 1) all grids should be interconnected, i.e. no isolated grids are allowed; 2) any growth must not go over the boundary set by the OB map.
The sampling starts at the given location and follows a random run to the neighbouring grid and records it until the number of grids equals N grids of sample . All the grids covered by the path are selected to comprise the sample. An extra validation step is applied to remove samples with holes inside (the so-called ill-set samples) and rerun the process until the required number of samples is met, as presented in Fig. 3. For the case of shape-constrained random sampling, it focuses on sampling with the shapes of convex polygons as seen in many hydrological catchments in environmental or climatic research. The working flow is shown in Fig. 4.
Unlike the shape-unconstrained method, the shape-constrained random sampling method produces more regular samples such as convex polygons. The main parameters such as the initial/centred location (L), sample size (S) and number (N) are the same as those required by the shape-unconstrained method. In addition, the shapeconstrained method uses one more major parameter the spatial index (sp) as a further constraint. If required, three optional parameters can also be set to further refine the control of the polygon generation, i.e. the number of angles (usually is greater than or equal to 3); the irregularity that indicates how much variance there is in the angular distance of vertices with a range of 0-1; the spikiness which indicates how much variance there is in each vertex from the average radius with a range of 0-1. However, as in the setup of the main parameters, L, S, N and sp, specification of these additional parameters are not compulsory. Unless otherwise specified explicitly by the user, the toolbox automatically generates default values for them (e.g. irregularity ¼ 0.3 and spikiness ¼ 0.1) to avoid producing extremely weird (irregularity ¼ 1) or sharp (spikiness ¼ 1) polygons. Compared with the shape-unconstrained random sampling method, it runs substantially faster because there is no need for random walking to grow the grids nor does it have any possibility of producing ill-set areas.

Dataset
One of the motivations of this example is to investigate how areal rainfall extremes in terms of their distributions can vary with locations, size and shapes of the ROI. In fact, there has been consensus about the impact of the size of catchment when producing areal rainfall at certain return levels. This is normally acknowledged by applying a so-called Areal Reduction Factor (ARF, Bell, 1976) to the value obtained at the location of the centroid of the catchment. Whilst variation of hydro-climatic variables is commonly recognised to be associated with the climatology, impact of the locations as well as the shape of the catchment have not been fully studied in a quantitative way. In our case, the 1-km gridded estimates of daily rainfall for Great Britain are analysed using a map of Great Britain roughly sized as 700 � 1250 km 2 . The rainfall estimates are derived from the Met Office national database of observed precipitation by using the UK rain gauge network. The natural neighbour interpolation methodology, including a normalization step based on average annual rainfall, was used to generate the daily estimates from 9am until 9am on the following day (Tanguy et al., 2016).

Application of the SRS-GDA toolbox
To be consistent with the precision of dataset, the OB base map is produced as the same grid size of 1 km 2 . The production of the OB map undergoes two steps: first, a rough sketch of the boundary of Great Britain (GB) is used to generate grids with very coarse resolution set as 100 km 2 . This is to ensure the boundary is properly covered. Secondly, the grid map is then refined by subdividing every grid with a number of smaller ones so that the grid resolution gradually increases to 5 km � 5 km and 10 km � 10 km, which allows for the detection and removal of those grids falling outside of the boundary. This process is shown in Fig. 5 including: (a) 638607 valid grids (marked as green) with the size of 1 km 2 ; (b) 9464 valid grids with the size of 25 km 2 ; (c) 2368 valid grids with the size of 100 km 2 .
Meanwhile, the location of the sample in this example study is chosen to be in London with the coordinate of L ¼ (520 km, 1070 km). Two random sampling methods, e.g. shape-unconstrained and shapeconstrained, are used to generate 5 different samples (N ¼ 5) at this location with the same size of 25 km 2 . According to Eq. (2), the number of grids in each sample (S) is calculated as 25 km 2 /1 km 2 ¼ 25. N, L and S are the basic inputs for SRS-GDA toolbox. Table 1 presents the 5 different samples around the initial location L (grey grid) generated by the shape-unconstrained random sampling method. It can be observed that all samples have grids interconnected with no hole inside. However, the shapes of the sample can be very irregular as there is no requirement that they need to be a convex polygon which is used in the shape-constrained sampling method.

Shape-unconstrained random sampling method
The shape-unconstrained sampling offers maximum freedom; however, it can inevitably introduce shapes with holes inside, which have to be rejected. Fig. 6 shows the steps involved to detect and remove those ill-set sample shapes: First the original sample is presented to the validation function (Fig. 6a) before it is converted into a binary image (Fig. 6b). Secondly, the inner area of the binary image is flooded to remove the potential holes which results in a hole-free image as shown in Fig. 6c. Finally, by comparing the areas of the two images, the location and the size of the hole(s) can be detected, which in turn triggers the removal process to discard the ill-set sample. In our test, the whole process of shape-unconstrained random sampling method takes 7.0 s on a low-configuration laptop to randomly generate five accepted samples  with sizes of 25 km 2 (specified as an initial constraint) while three samples are abandoned.

Shape-constrained random sampling method
Five samples at same location L (grey grid) generated by using shapeconstrained random sampling method are shown in Table 2 with various spatial indexes sp defined by the toolbox. Comparing with those samples listed in Table 1, clearly the shapes are more regular here as convex polygons, which can be directly used to simulate hydrological catchments. The whole process is recorded to have finished in 2.0s on our test PC, which is shorter than that from the former method. However, the tests show that the larger size and number are, the more efficient and time-saving the shape-constrained method is, compared with the shapeunconstrained method in Table 3. Fig. 7 summarises the steps taken for shape-constrained sampling starting with an arbitrary but convex polygon (with sp, irregularity and spikiness all set by the toolbox) set at the same location index L (grey grid).
The effect of the spatial index sp in the process of shape-constrained sampling is shown in Fig. 8 with larger values of sp having more north-   Fig. 6. The process of hole detection.

Table 2
Five example samples generated by shape-constrained sampling method. The value of the toolbox can well be appreciated in the analysis results, partly shown in Fig. 9, in finding the spatial variation of extreme rainfall over the GB. The entire analysis is not presented here; however, with the help of the SRS-GDA toolbox, we were able to reveal patterns never reported before. For example, a west-east variation of the rainfall distribution at different quantiles is clearly seen as "west high, east low" in Fig. 9a. What is more interesting is the symmetric pattern shown in Fig. 9b (around sp ¼ 1.0) with regards to the sample shape which implies that sampled areas with slight elongation in north-south direction are expected to have a higher amount of rainfall than those spread more in east-west direction at given frequency/return period. For samples with the same size and location, there is a remarkable difference in areal averaged rainfall between more elongated (e.g. sp ¼ 0.2 or 5.0) and rounded shape (e.g. sp ¼ 1.0), which can be attributed to heterogeneity of the grid rainfall distribution that cannot compensate to the areal average. The relationship between the sample size and the annual maximum daily rainfall (Fig. 9c) is shown to have largely followed what is expected, e.g., decrease of areal rainfall as sample size grows.

Table 3
Comparison of the indicative speed of the two sampling methods: Method 1 the shape-unconstrained method and Method 2 the shape-constrained method. Note that the numbers are obtained on our test PC and for comparing the relative speed difference.

Conclusions and availability of the toolbox
In this paper, we discuss the development of a new MATLAB toolbox for spatial random sampling in grid-based data analysis (SRS-GDA). The main aim of the toolbox is to address the very needs of many climate change related studies on spatial-temporal diagnostics of hydro-climatic datasets. An example application case is given in which the implementation details are discussed. Our initial applications show that with this toolbox, several important variation patterns of extreme rainfall (due to be published separately) over GB that have yet to be reported are clearly identified. Based on the promising results, we expect this toolbox, thanks to the availability of its source code, will help the related research community in their analyses of grid data sets and gain further insight into the underlying science.
The source code of the toolbox as well as the example case given above are available at the GitHub (https://github.com/wanghan924/ SRS-GDA_Toolbox.git). The source code is provided subject to a GPL V3 licence. Use/fork of the toolbox is subject to proper acknowledgement as stated on the Webpage of the toolbox.

Declaration of competing interest
The authors declare no conflict of interest. Fig. 9. The dependencies on the locations, the spatial index and the size of rainfall distribution over GB: (a) the east-west pattern and (b) the symmetric pattern with regards to the sampled shape and (c) the trend pattern with regards to the sampled area size as detected by using the toolbox discussed in this paper. EP is short for "Exceedance Probability".