Infilling missing precipitation records – A comparison of a new copula-based method with other techniques
Introduction
Hydrological observations are unique in space and time. If not observed at a given location and time the values can only be estimated. Many applications, such as the calculation of water balances, calibration of hydrological models or the provision of unbiased ground truth for remote sensing require full datasets. Thus a reliable estimation of the missing observations is of great importance. The problem is exacerbated by the ubiquitous decimation of gauge networks.
We define infilling as the process of repairing data-sets where observations are missing due to late starting, early ending or disruption of data collection. The process depends on nearby, relatively plentiful data which can be trusted; if they are too sparse, there is no way of manufacturing an information transfer link. In contrast to the task of infilling, we define interpolation as the activity of estimating values of variables in the space between observed and infilled data, either temporally or spatially. We do not perform interpolation in this paper, although some of the methods we use here for infilling are often used for spatial interpolation (like Kriging).
Some practitioners treat the infilled data as observations, because they may perceive no other option, but that ignores the limitations of the infilling methods. For what purpose does one use infilled raingauge data? How is infilling typically performed? There are typically three ways:
- (i)
as a snapshot, neglecting the time series and
- (ii)
as multiple time series or
- (iii)
as a combination of the first two.
If we are to perform infilling of multiple time series we need to ensure that seasonality is removed by transformation, or alternatively infill season by season. How does one treat the variability of the marginal distributions from season to season? How does one cope with the varying length of dry periods over seasons? How does the infilling treatment affect the extremes?
How does one handle these problems? If one uses pure regression methods to estimate individual values, then the chances are that these values are going to be biased towards the median - it is unusual for an infilling method to estimate a missing value outside the range of the observations. This remark applies particularly to the following methods: Nearest neighbor(s), Inverse Distance Weighting, (Multiple) Linear Regression, Ordinary Kriging and Neural Networks (Dumedah and Coulibaly, 2011, Mwale et al., 2012). These give point estimates with either Confidence intervals (usually Gaussian and therefore symmetrical around the estimate) or none; thus these methods are local smoothers. Even the interesting Q–Q transform method introduced by (Teegavarapu, 2013) is not capable of extrapolating outside the observation interval because the quantiles are limited to observations.
Kriging in its many forms is popular, but there are problems associated with the method which are not usually taken into account, which beside the application of Kriging (implicitly a Gaussian method) to raw, often skewed data, uses a global variogram for a given snapshot. Thus in its simplest form Kriging is unable to account for local variations in spatial dependence, as is achieved in multiple regression, for example.
To summarize: to use infilled data as if they have observation status is to delude oneself; it is necessary to determine the uncertainty of the infilled data and use whatever information they offer with caution and care when using them in interpolation or other modeling activities. If this is not done, then interpolation becomes a form of smoothing with point estimates which, for example, cannot model extremes. We will not address spatial interpolation further in this paper, reserving it for a follow-on.
There are two different goals one can follow for data infilling:
- 1.
To minimize the expected variance of the estimation error.
- 2.
To obtain a realization such that the statistics of the augmented dataset comply with the incomplete dataset.
Several authors treated this topic by infilling data concerning precipitation, discharge, soil moisture or evapotranspiration. The temporal resolution varies between days and months. Infilling precipitation data on short time scales (up to several days) is specifically challenging due it’s intermittent behavior. Zero observations cannot be treated the same way as positive ones. There is usually a certain radius around dry stations where there is no rainfall.
The methods used for infilling also differ strongly. Traditionally statistical methods were used for this purpose. (Johns et al., 2003) describe a method based on Bayesian statistics to infill precipitation and temperature records. In Kondrashov and Ghil (2006) the authors use Singular Spectrum Analysis for spatial and temporal infilling. A comparison of the performance of different infilling methods together with a non-linear model for the selection and weighting of available observations for infilling was presented in (Teegavarapu, 2012). Recently methods from artificial intelligence were also used for infilling: self organizing maps (Mwale et al., 2012) and neural networks (Coulibaly and Evora, 2007).
Most methods used for best estimation (corresponding to the first goal) lead to a reduction of the variance, thus to a distortion of the distributions. Recently (Teegavarapu, 2013) suggested the use of a post processing quantile/quantile transform to spatially interpolate missing values.
Infilled values are not measured values, they are only estimates. Their uncertainty should be considered for any subsequent application. Therefore it is of great importance to quantify this uncertainty and, if possible, specifically for each infilled datum.
In this paper a new spatial copula based methodology is developed and compared to other infilling methods. Copulas were applied for spatial interpolation of groundwater quality (Bárdossy and Li, 2008) and precipitation (Bárdossy and Pegram, 2013). The suggested methodology provides an estimate of the conditional distribution of precipitation at the target location. This distribution can be used to obtain a minimum error estimator (by using the conditional mean) or to obtain simulations reproducing the observed variability at the target location. Zero precipitation values are treated as censored observations of a continuous distribution. Low precipitation amounts (for example <1 mm/day) are measured with a high relative error. Precipitation below such a threshold contributes very little to the total precipitation and is hydrologically irrelevant. As a consequence, the infilling strategy employed in this study provides a probability of zero, a probability of being above zero and below a given threshold and a continuous distribution for the precipitation amounts exceeding the threshold.
To assess the efficacy of the methods, we perform intercomparisons based on cross-validation. The datasets used in the infilling methods were tested for selected stations in the Southern Cape in South Africa. Observed daily precipitation data were available for 13 stations for the 32-year period 1971–2002. Some data were missing from the station records but most were intact. To perform the cross-validation, we divided the year into two seasons and then selected each station in turn, hiding 8 sets of 4 years at a time. This was done for daily and monthly data. Each of the missing values was estimated using the remaining intact data and repeated for each station. The result was that over 70,000 daily and over 2,000 monthly data were used in the cross-validation study.
The paper is organized as follows: after this introduction, common methods of infilling are briefly described; in Section 3 copula-based infilling methods are introduced, while Section 4 treats the application of the methods and the results are then discussed before summary and conclusions.
Section snippets
Common methods used for infilling
The methods used to infill the missing data discussed in this paper were the following:
- 1.
Nearest neighbor.
- 2.
Linear regression.
- 3.
Fuzzy rules.
- 4.
Ordinary Kriging.
- 5.
Multiple linear regression (MLR).
- 6.
Copula based estimation.
- 7.
MLR using the EM algorithm (for the monthly totals only).
Let be the vector of observed values at time t. For a set of time steps observations are available at location k and for the rest of the time data for location k are missing.
The purpose of
Copula based infilling
For the copula based estimation we assume that each location has its own specific distribution. These are assessed from the common observation time(s). The distribution function of (calculated over time t) is . In order to make the description more transparent we introduce for the target location with the distribution function . In order to relate these observations we assume that the joint (zero truncated) copula of is . The distribution of a
Application
The data infilling methods were tested for selected stations in the Southern Cape in South Africa. Observed daily precipitation data are available for 13 stations for the 32-year period 1971–2002. Fig. 1 shows the locations of the 13 selected gauges.
For all estimations two seasons were treated separately. The warm season (season 1): October, November, December, January, February and March and the cold season (season 2): April, May, June, July, August and September. Table 1, Table 2 show the
Expectation evaluation
The infilling methods were evaluated on the basis of their estimation errors. For each station the bias B, mean absolute error A, the mean squared error R and the Pearson correlation coefficient (Corr) between observed and estimated values were calculated. We also included a Mean Station Bias ; these are defined in Eqs. (23), (24), (25), (26).
As a first step the infilled values were compared to the observations. For the copula infilling the expected value of the infilled distribution was used
Summary and conclusions
In this paper, a new copula based method was developed for infilling missing precipitation data. The method does not only provide a minimum error estimator for the missing values, but also provides a distribution function of the possible values. For daily values, the distribution function has a jump at 0 mm – which equals the probability of a dry day. In addition, because small precipitation amounts are measured with poor accuracy, the lower part of the wet distribution below a threshold (in our
References (15)
- et al.
Comparison of neural network methods for infilling missing daily weather records
J. Hydrol.
(2007) - et al.
Evaluation of statistical methods for infilling missing values in high-resolution soil moisture data
J. Hydrol.
(2011) - et al.
Infilling of missing rainfall and streamflow data in the shire river basin, Malawi – a self organizing map approach
Phys. Chem. Earth
(2012) Patching rainfall data using regression methods 3. Grouping, patching and outlier detection
J. Hydrol.
(1997)- et al.
Downscaling regional circulation model rainfall to gauge sites using recorrelation and circulation pattern dependent quantile–quantile transforms for quantifying climate change
J. Hydrol.
(2013) - et al.
Geostatistical interpolation using copulas
Water Resour. Res.
(2008) - et al.
Interpolation of precipitation under topographic influence at different time scales
Water Resour. Res.
(2013)
Cited by (65)
Impact of climate change on future availability of water for irrigation and hydropower generation in the Omo-Gibe Basin of Ethiopia
2022, Journal of Hydrology: Regional StudiesCurrent and future water balance for coupled human-natural systems – Insights from a glacierized catchment in Peru
2022, Journal of Hydrology: Regional StudiesMean areal precipitation estimation: Methods and issues
2022, Rainfall: Modeling, Measurement and ApplicationsGridded daily precipitation data for Iran: A comparison of different methods
2021, Journal of Hydrology: Regional StudiesStochastic local interaction model with sparse precision matrix for space–time interpolation
2020, Spatial Statistics