Australian gridded chloride deposition-rate dataset

Chloride deposition-rate measurements at points within Australia are upscaled to the entire continent on a regular 0.05° grid. The upscaling uses a double-exponential correlation between deposition rate and distance to the coast, where the parameters in the double-exponential are spatially varying. These parameters are estimated using least-squares with Tikhonov regularisation to ensure minimal spatial variability. A calibration-constrained, null-space Monte-Carlo analysis is used to quantify uncertainty in the prediction. The resulting dataset consists of the best-fit chloride deposition rates across Australia, as well as estimates of uncertainty. The dataset can be used for various purposes including: estimating groundwater recharge through the use of the chloride mass-balance method; catchment salt balance estimates; regional investigations of groundwater hydrochemistry; and, corrosion prediction.


Specifications
Environmental Science: Hydrology and Water quality Specific subject area Chloride deposition rate in Australia  Type of data  Tables, Rasters, computer scripts  How the data were acquired Raw data acquired from a literature review of published and unpublished sources listed in Table 1.
The upscaling to gridded chloride deposition-rates across Australia was performed using the PEST software along with the python pykrige package and custom python scripts included with the data. Data format Analysed Description of data collection Source chloride deposition data from the literature review were weighted with 1.0 if at least one year of continuous collection of rainfall with well-documented methods; 0.8 is reliable but not published; 0.6 less than full year of data collection but all rainfall collected; 0.4 not fully documented; 0.2 less than full year, not all rainfall collected; 0 otherwise. Output rasters are summaries of 10 0 0 replicates. Data source location See

Value of the Data
• Chloride deposition rates have been measured at 367 isolated locations across the continent but this is not of sufficient density for many purposes. This dataset upscales these data to a regular gridded deposition rate with its associated uncertainty that can be interrogated for all points in Australia, making it useful for studies outside the 367 isolated locations. • Hydrologists, hydrogeologists and hydrogeochemists will find this dataset useful for defining a baseline chloride input to the landscape at point, catchment or continental scale. • These data can be used to estimate groundwater recharge through the chloride mass-balance method, are useful in catchment salt balance studies, investigations of groundwater hydrochemical evolution and can be used for predicting corrosion.  pilot_to_obs.py: python3 script that uses kriging to interpolate the values of A1, A2, l1 and l2 (see Methods section) in the pilot-point file to observation points. PEST treats this as "the model'' and runs it many times during iterating to the best solution for A1, A2, l1 and l2. PEST also uses it during the uncertainty analysis. randpar_1.py: python3 script that generates the PEST "par'' files used in the PEST uncertainty analysis.

Raw (primary data)
generate_pest_files.py: python3 script that uses the observations and the pilot points to write PEST tpl, pst, ins files as well as the windows batch file to run the "model'' pilot_to_obs.py. run_uncertainty.bat: windows batch file to run the null-space Monte-Carlo uncertainty analysis.pilot_to_obs.py: python3 script that perform

Weighting the observations
Each observation is provided with a weight of between 0 and 1, w p , depending on its relative quality of field measurements, which is also listed in the raw data observations_update_v04.csv. It is generally assumed [2][3][4] that a complete year of data is the minimum time period required to obtain a representative estimate of the chloride deposition. The weight for each observation was chosen using the following method: • 1.0: most reliable data available, at least one year of continuous collection of rainfall with well documented methods • 0.8: reliable source of data but not published • 0.6: less than full year, but collected all rainfall (typically tropics) • 0.4: reliable source of data but not fully documented • 0.2: less than full year, but did not collect all rainfall • 0: not used in calibration, data of very poor quality The result is shown in Fig. 1 .

Relationship with distance to coast
Davies and Crosbie [1] explored correlations between the observed deposition rates and distance to the coast, elevation, terrain slope, difference in angle between the aspect of the slope and direction to the coast, mean annual rainfall and mean annual windspeed. The only strong correlation was between deposition and distance to the coast. Based on work by Keywood [5] , they postulated a relationship of the form: Here: • D is the deposition rate, with units kg.ha −1 .yr −1 ; • d is the distance to the coast, with units km; • A 1 is a coefficient, with units kg.ha −1 .yr −1 ; • A 2 is a coefficient, with units kg.ha −1 .yr −1 ; • λ 1 is a decay constant, with units km; • λ 2 is a decay constant, with units km.
The physical motivation for this form is that there are two sources of deposition as described by Keywood [5] : • the "fast" component, A 1 e − d λ 1 , which represents the wet and dry deposition in aerosols sourced from the ocean; • the "slow" component, A 2 e − d λ 2 , which represents the deposition of gaseous chloride that is formed by volatilisation of chloride in sea salt at pH less than 3.
This correlation is also clear in the 367 observations considered here, as shown in Fig. 2 , where a least-squares fit results in:

Fig. 2.
Correlation between the observed chloride deposition and the distance to the coast. The blue and orange lines are fits to the near-coast and central-region data, respectively. The green line is the resulting fit from Eq. (1) . The shaded region and σ is described below.
The least-squares fits are performed to provide a suitable initialisation for the PEST algorithms described below, they are not used in any other analyses.

Scatter of repeat measurements
The Monte-Carlo uncertainty analysis described below requires uncertainty to be quantified in one form or another (more specifically, the prior or posterior uncertainty distribution can be quantified). One way of estimating uncertainty is by exploring data from observation sites where repeated measurements have been performed. If, for example, the observations remain unchanged with time, then it is reasonable to infer that the observations are highly reliable, which means the prediction will be more reliable than the case where the observations at each site fluctuate wildly.
It is generally assumed [2][3][4] that a complete year of data is the minimum time period required to obtain a representative estimate of the chloride deposition, and the amount of chloride deposition is not considered dependent upon the amount of rainfall. However, Davies and Crosbie [1] found that there is some dependence on rainfall amount.
Katherine in the NT is the best example in the dataset of temporal fluctuations in chloride deposition, and that using a single year of observation can result in considerable uncertainty. In Katherine, there have been 5 independent studies measuring the chloride deposition spread over five decades and ranging from 1 to 5 years duration. These studies have reported a chloride deposition rate of between 2.46 to 7.30 kg.ha −1 .yr −1 [5][6][7][8][9] .
In the dataset collated here, there are 19 sites that have repeat measurements. The difference in the observations is a measure of the uncertainty in the chloride deposition within the observed dataset. The mean difference of the natural-log transformed data is 0.68, if the observations that do not have a weight of 1 are excluded, this decreases to 0.44, as shown in Fig. 3 .

Scatter of observations around distance-to-coast formula
Another way of estimating the uncertainty of the prediction is to explore the scatter of the observations around the model given by Eq. (1) . The difference between ln D of Eq. (1) and the logarithm of the observations is called the residual, which is plotted in Fig. 4 . The scatter is approximately normally-distributed, and the standard deviation, σ , of the residuals may be computed, which results in: where d is measured in km (this is also shown in Fig. 4 , and assumes that D is measured in kg.ha −1 .yr −1 ). The standard deviation, σ , is only weakly dependent on distance from the coast, which may reflect that experimental error is virtually independent of where the sample is performed, or may reflect that the model of Eq. (1) is equally valid at all points. Using this value to perturb the fit of ln D expressed through Eq. (1) results in the green shaded region of Fig. 2 .

Pilot points and kriging scheme
The method of Davies and Crosbie [1] uses the PEST software [10] to estimate A 1 , A 2 , λ 1 , λ 2 at "pilot points" scattered throughout Australia, and to interpolate these to the remainder of Australia using kriging. The PEST process is described in sections below: this section describes the pilot points and kriging.
An evenly spaced grid of pilot points is used here, with spacing being 2 degrees in latitude and longitude, as shown in Fig. 5 .
Ordinary kriging is used to interpolate A 1 , A 2 , λ 1 , λ 2 from the pilot points to a finer grid of 0.05 degrees ( ∼5 km) across Australia. The python PyKrige software package [11] is employed.
The key ingredient to the kriging process is the variogram. In keeping with Davies and Crosbie [1] , an exponential variogram is used here: where • p is the partial sill, with value 100; • ˆ d is the distance between points, measured in degrees; • r is the range, with value 120 deg; • n is the nugget, with value 1.
Some example results are shown in Fig. 6 . Only certain aspects of the variogram are important (for example, the overall scaling is irrelevant to the result) and different variograms can produce similar results (compare the base "exponential" type with the linear variogram of the Fig. 6 ). The final chloride-deposition results are not strongly dependent on the form of the variogram.

Core PEST process
The PEST software [10] is used to find the best values (in a least-squares sense) of A 1 , A 2 , λ 1 , λ 2 at each pilot point. Since there are 205 pilot points, this means the minimisation involves 820 unknowns. Each iteration of the PEST process involves the following steps.
1. Initial values for the 820 unknowns are provided are provided from Fig. 2 . 2. The kriging procedure described above is used to find A 1 , A 2 , λ 1 , λ 2 at each observation point. where p indicates the observation point, and w p is the weight of each observation. Note that log 10 is used here, in contrast to ln used above.
5. The Jacobian, which is the change of the objective function with respect to changes in A 1 , A 2 , λ 1 , λ 2 at each pilot point, is calculated. This is used to supply new values for the 820 unknowns such that J will decrease. The process then starts again from Step (2) using these new values.
In the simplest case, PEST minimises the error, J, to provide the best values of A 1 , A 2 , λ 1 , λ 2 .
However, given the number of unknowns is greater than the number of observations, this simple model is likely to suffer from overfitting, and PEST can quite easily find solutions with small J. The overfitting is eliminated by using two additional ingredients. Firstly, a target J is provided to PEST to define when the iterative procedure described above should stop. Above, the uncertainty inherent within the observations was found to be approximately 0.44 on a natural-log scale. Assuming these uncertainties are representative of an acceptable mismatch between the model and the data, then for 367 observations a reasonable target is J target ≈ 367 × ( 0 . 44 ln 10 ) 2 ≈ 13 . 5 . This alone does not eliminate overfitting: instead, it simply provides PEST freedom to find multiple solutions.
To find the solution that agrees most strongly with the model in Eq. (1) , a second ingredient is used: PEST allows weak constraints to be placed on the unknowns. This feature is used to ensure that A 1 , A 2 , λ 1 , λ 2 are similar at nearby pilot points. This means that the parameters determining the chloride deposition prediction will "vary smoothly" over the continent. Specifically, the constraints take the form of C A 1 i j = log 10 A i 1 − log 10 A j 1 = 0 with weight w i j and similarly for A 2 , λ 1 , λ 2 . Here the superscripts i and j indicate pilot points. Here, given the distance ˆ d (in degrees) between two pilot points, the weights are w = 10 / ˆ d for A 1 and A 2 if ˆ d ≤ 5 deg , and zero otherwise. These are similar to Davies and Crosbie [1] , who used approximately: PEST forms the "regularisation objective function" and uses the Tikhonov process to minimise this as well as J.

Calibration-constrained Null-space Monte Carlo Approach
Following Davies and Crosbie [1] , a calibration-constrained null-space Monte Carlo analysis is used to estimate the uncertainty in the PEST prediction. The idea is that while PEST has provided a set of A 1 , A 2 , λ 1 , λ 2 that minimise the error expressed through J and the constraints, there may be many other parameter combinations that are just as good. The procedure is called a "calibration-constrained null-space Monte Carlo analysis" [12] involves identifying many model scenarios that are equally well validated (i.e., also have small J), and is explained in detail below. All these model scenarios are derived from a stochastic process (Monte Carlo) and after subsequent adaptations, they all provide an acceptable fit to the historical observations. This procedure is also performed by PEST, using the following steps: 1. Pre-calculation step. a. A 1 , A 2 , λ 1 , λ 2 are set to their best-fit values found by PEST in the previous step, the constraints removed, and the Jacobian is calculated by PEST. The Jacobian provides an estimate of the change in J as each of the parameters ( A 1 , A 2 , λ 1 , λ 2 at the pilot points) is varied individually.
b. PEST is used to find the "null-space". These are certain combinations of A 1 , A 2 , λ 1 , λ 2 that do not increase J "too much". In this case, the optimisation procedure described above resulted in J = 13 . 5 at the optimal solution. Here, "too much" is defined to be J ≤ 15 . 5 . For example, it may be found that varying A 1 at a pilot point in central Queensland does not increase J by too much, so this A 1 is part of the null space. Of course, changing this A 1 may impact the prediction of chloride deposition in Queensland, but it does not impact the agreement with the observations too much.
c. In the case at hand, PEST finds that the solution space contains only about 130 independent parameters: most of the 820 parameters are actually in the null space and do not impact the result too much. 2. Null-space Monte Carlo step. 10 0 0 random parameter sets are generated. Recall from Fig. 3 that the uncertainty in observations is roughly 0.44 or 0.68, and that from Fig. 4 , the scatter is 0.7. These values provide a rough idea of the size of desired variations in ln D in the random parameter sets. a. Monte Carlo process: The 10 0 0 sets are generated starting from the best-fit values mentioned in Step (1). Then, at all pilot points, the values of A 1 , A 2 , λ 1 , λ 2 are randomly perturbed using the following procedure: i. For each pilot point, a random number, z, is chosen from a normal distribution with zero mean and standard deviation σ = 0 . 4 (suggested from Figs. 3 and 4 ). The value of z is added to ln D given by Eq. (1) , that is ln D new = ln (D ) + z. To implement this A 1 , A 2 , λ 1 , λ 2 at the pilot point in question are altered. Of course, the changes in the individual values A 1 , A 2 , λ 1 , λ 2 are not unique, for instance only A 1 could be changed, leaving the remaining quantities fixed. The following steps define the random procedure for perturbing A 1 , A 2 , λ 1 , λ 2 .
ii. Define r = . Choose another random number z r from a uniform distribution with minimum 0 and maximum r. Then the "fast" term is perturbed by z r z, while the "slow" term is perturbed by ( + z r z, and similarly for the "slow" term. This means, for instance, if the "fast" term is much smaller than the total (such as in central Australia, so r 1 ) then its perturbation is small. Similarly, if the "slow" term is much smaller than the total (such as near the coast, so r ∼ 1 ) then its perturbation is small. This avoids perturbing terms that do not greatly impact the chloride deposition prediction. The result is approximately ln D new ≈ ln (D ) + z. To quantify the "approximately": the process described here produces perturbations of ln D that are normally distributed with standard deviation 0.35 instead of 0.4, which is reflected in the results.
iii. Consider the "fast" term. The quotient z r z/ ln ( A 1 ) is the fractional perturbation desired in A 1 , if λ 1 were kept fixed. This gives an estimate for the perturbations required. Choose another random number, z 1 , from a uniform distribution with minimum −z r z/ ln ( A 1 ) and maximum z r z/ ln ( A 1 ) , and set λ new iv. The same method is used for the "slow" term (substitute "2" for "1" in the above). b. Null-space process: Any random perturbations that are not part of the null space are zeroed. For instance, random perturbations of the A 1 mentioned in Step (1b) would be retained, since it is part of the null space. On the other hand, if λ 1 at a certain pilot point in Western Australia was not part of the null space (i.e., it does impact the agreement with observations) then its random perturbations would be zeroed, so it would assume its best-fit value in each of the 10 0 0 parameter sets.
The result of is 10 0 0 parameter sets that should agree with observations just as well as the original of Step (1). However, because of nonlinearities, perfect agreement is rare: the mean of J over the 10 0 0 parameter sets is J = 13 . 6 (and the standard deviation is 0.07). Because the Monte-Carlo process independently alters of A 1 , A 2 , λ 1 , λ 2 at each pilot point, the values of R are substantially higher than the best-fit value of around 250: the mean of R over the 10 0 0 parameter sets is R = 1600 (and the standard deviation is 190). Using kriging described above, each of these 10 0 0 sets can be used to provide a prediction of chloride deposition over Australia. At each point in Australia, the mean, and various other statistics of these 10 0 0 predictions may be calculated. This yields the output datasets, shown graphically in Fig. 7 .

Contextual Summary
Hydrologists, hydrogeologists and hydrogeochemists will find the resulting datasets useful for defining a baseline chloride input to the landscape at point, catchment or continental scale. The results can be used to estimate groundwater recharge through the chloride massbalance method, are useful in catchment salt balance studies, investigations of groundwater hydrochemical evolution from recharge areas to deeper parts of aquifers and can be used for predicting corrosion of infrastructure assets.

Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.