Infilling missing precipitation records – A comparison of a new copula-based method with other techniques

doi:10.1016/j.jhydrol.2014.08.025

Journal of Hydrology

Volume 519, Part A, 27 November 2014, Pages 1162-1170

https://doi.org/10.1016/j.jhydrol.2014.08.025 Get rights and content

Highlights

•
We developed a new copula based method for infilling missing data.
•
We compared it against a comprehensive range of methods of interpolation from Nearest Neighbors to EM.
•
We found copula-based methods are superior to the others for estimating expected missing values.
•
The EM multiple regression algorithm is suitable for infilling monthly or annual data but not highly skewed daily data.
•
The copula-based method has the added advantage of providing distributions of infilled values.

Summary

Infilling missing data might be an unpleasant and tedious task, but is necessary for analysis and water resources management, so it should not be done in a lackadaisical manner. The important thing about the infilled values is that they need to be as good as possible, because poor infilling is likely to lead to poor decisions. Traditionally, a range of methods is routinely employed, e.g. Nearest Neighbor substitution through to Kriging, but few methods attach a quality estimate to the infilled values. In this paper a new copula based method is developed for infilling missing daily and monthly rain gauge data and is compared against six other commonly used methods, in a semi-arid environment with a range of rain-rates and interstation distances, in the Southern Cape region of South Africa. For daily data it is clear that the copula-based methods are superior to the others in terms of point estimation and have the added benefit of providing an estimate of the precision of the interpolation, not provided by the others. In our case, the addition of atmospheric circulation patterns designed to add information for infilling has a relatively small positive effect on the quality of the estimation. The main reason for this is that a small number of wet days does not allow a good estimation of the conditional distribution of precipitation amounts; we note that the average probability of a dry day in this region is 86%. An improvement of the estimate of the probability of a dry day was however observed. In other regions, with a higher number of wet days, an atmospheric Circulation Pattern (CP) based method is likely to lead to further improvements. Using copula-based methods, the estimated probabilities of a dry day correspond well to the observed frequencies of dry days. The monthly data yield the same conclusion, with the qualification that the Expectation Maximisation [EM] algorithm performs as well as the copula method (because of the low count of dry months in this region) but its relative disadvantage is that it does not offer as valuable a precision estimate.

Introduction

Hydrological observations are unique in space and time. If not observed at a given location and time the values can only be estimated. Many applications, such as the calculation of water balances, calibration of hydrological models or the provision of unbiased ground truth for remote sensing require full datasets. Thus a reliable estimation of the missing observations is of great importance. The problem is exacerbated by the ubiquitous decimation of gauge networks.

We define infilling as the process of repairing data-sets where observations are missing due to late starting, early ending or disruption of data collection. The process depends on nearby, relatively plentiful data which can be trusted; if they are too sparse, there is no way of manufacturing an information transfer link. In contrast to the task of infilling, we define interpolation as the activity of estimating values of variables in the space between observed and infilled data, either temporally or spatially. We do not perform interpolation in this paper, although some of the methods we use here for infilling are often used for spatial interpolation (like Kriging).

Some practitioners treat the infilled data as observations, because they may perceive no other option, but that ignores the limitations of the infilling methods. For what purpose does one use infilled raingauge data? How is infilling typically performed? There are typically three ways:

(i)
as a snapshot, neglecting the time series and
(ii)
as multiple time series or
(iii)
as a combination of the first two.

If we are to perform infilling of multiple time series we need to ensure that seasonality is removed by transformation, or alternatively infill season by season. How does one treat the variability of the marginal distributions from season to season? How does one cope with the varying length of dry periods over seasons? How does the infilling treatment affect the extremes?

How does one handle these problems? If one uses pure regression methods to estimate individual values, then the chances are that these values are going to be biased towards the median - it is unusual for an infilling method to estimate a missing value outside the range of the observations. This remark applies particularly to the following methods: Nearest neighbor(s), Inverse Distance Weighting, (Multiple) Linear Regression, Ordinary Kriging and Neural Networks (Dumedah and Coulibaly, 2011, Mwale et al., 2012). These give point estimates with either Confidence intervals (usually Gaussian and therefore symmetrical around the estimate) or none; thus these methods are local smoothers. Even the interesting Q–Q transform method introduced by (Teegavarapu, 2013) is not capable of extrapolating outside the observation interval because the quantiles are limited to observations.

Kriging in its many forms is popular, but there are problems associated with the method which are not usually taken into account, which beside the application of Kriging (implicitly a Gaussian method) to raw, often skewed data, uses a global variogram for a given snapshot. Thus in its simplest form Kriging is unable to account for local variations in spatial dependence, as is achieved in multiple regression, for example.

To summarize: to use infilled data as if they have observation status is to delude oneself; it is necessary to determine the uncertainty of the infilled data and use whatever information they offer with caution and care when using them in interpolation or other modeling activities. If this is not done, then interpolation becomes a form of smoothing with point estimates which, for example, cannot model extremes. We will not address spatial interpolation further in this paper, reserving it for a follow-on.

There are two different goals one can follow for data infilling:

1.
To minimize the expected variance of the estimation error.
2.
To obtain a realization such that the statistics of the augmented dataset comply with the incomplete dataset.

Several authors treated this topic by infilling data concerning precipitation, discharge, soil moisture or evapotranspiration. The temporal resolution varies between days and months. Infilling precipitation data on short time scales (up to several days) is specifically challenging due it’s intermittent behavior. Zero observations cannot be treated the same way as positive ones. There is usually a certain radius around dry stations where there is no rainfall.

The methods used for infilling also differ strongly. Traditionally statistical methods were used for this purpose. (Johns et al., 2003) describe a method based on Bayesian statistics to infill precipitation and temperature records. In Kondrashov and Ghil (2006) the authors use Singular Spectrum Analysis for spatial and temporal infilling. A comparison of the performance of different infilling methods together with a non-linear model for the selection and weighting of available observations for infilling was presented in (Teegavarapu, 2012). Recently methods from artificial intelligence were also used for infilling: self organizing maps (Mwale et al., 2012) and neural networks (Coulibaly and Evora, 2007).

Most methods used for best estimation (corresponding to the first goal) lead to a reduction of the variance, thus to a distortion of the distributions. Recently (Teegavarapu, 2013) suggested the use of a post processing quantile/quantile transform to spatially interpolate missing values.

Infilled values are not measured values, they are only estimates. Their uncertainty should be considered for any subsequent application. Therefore it is of great importance to quantify this uncertainty and, if possible, specifically for each infilled datum.

In this paper a new spatial copula based methodology is developed and compared to other infilling methods. Copulas were applied for spatial interpolation of groundwater quality (Bárdossy and Li, 2008) and precipitation (Bárdossy and Pegram, 2013). The suggested methodology provides an estimate of the conditional distribution of precipitation at the target location. This distribution can be used to obtain a minimum error estimator (by using the conditional mean) or to obtain simulations reproducing the observed variability at the target location. Zero precipitation values are treated as censored observations of a continuous distribution. Low precipitation amounts (for example <1 mm/day) are measured with a high relative error. Precipitation below such a threshold contributes very little to the total precipitation and is hydrologically irrelevant. As a consequence, the infilling strategy employed in this study provides a probability of zero, a probability of being above zero and below a given threshold and a continuous distribution for the precipitation amounts exceeding the threshold.

To assess the efficacy of the methods, we perform intercomparisons based on cross-validation. The datasets used in the infilling methods were tested for selected stations in the Southern Cape in South Africa. Observed daily precipitation data were available for 13 stations for the 32-year period 1971–2002. Some data were missing from the station records but most were intact. To perform the cross-validation, we divided the year into two seasons and then selected each station in turn, hiding 8 sets of 4 years at a time. This was done for daily and monthly data. Each of the missing values was estimated using the remaining intact data and repeated for each station. The result was that over 70,000 daily and over 2,000 monthly data were used in the cross-validation study.

The paper is organized as follows: after this introduction, common methods of infilling are briefly described; in Section 3 copula-based infilling methods are introduced, while Section 4 treats the application of the methods and the results are then discussed before summary and conclusions.

Section snippets

Common methods used for infilling

The methods used to infill the missing data discussed in this paper were the following:

1.
Nearest neighbor.
2.
Linear regression.
3.
Fuzzy rules.
4.
Ordinary Kriging.
5.
Multiple linear regression (MLR).
6.
Copula based estimation.
7.
MLR using the EM algorithm (for the monthly totals only).

Let $x (t) = (x_{1} (t), \dots, x_{i} (t), \dots, x_{m} (t))$ be the vector of observed values at time t. For a set of time steps $T_{k} (1)$ observations are available at location k and for the rest of the time $T_{k} (2)$ data for location k are missing.

The purpose of

Copula based infilling

For the copula based estimation we assume that each location has its own specific distribution. These are assessed from the common observation time(s). The distribution function of $x_{i} (t)$ (calculated over time t) is $F_{i} (x)$ . In order to make the description more transparent we introduce $Y (t) = X_{k} (t)$ for the target location with the distribution function $F_{Y} (y)$ . In order to relate these observations we assume that the joint (zero truncated) copula of $(X_{1}, \dots, X_{m}, Y)$ is $C (u_{1}, \dots, u_{k}, v)$ . The distribution of a

Application

The data infilling methods were tested for selected stations in the Southern Cape in South Africa. Observed daily precipitation data are available for 13 stations for the 32-year period 1971–2002. Fig. 1 shows the locations of the 13 selected gauges.

For all estimations two seasons were treated separately. The warm season (season 1): October, November, December, January, February and March and the cold season (season 2): April, May, June, July, August and September. Table 1, Table 2 show the

Expectation evaluation

The infilling methods were evaluated on the basis of their estimation errors. For each station the bias B, mean absolute error A, the mean squared error R and the Pearson correlation coefficient (Corr) between observed and estimated values were calculated. We also included a Mean Station Bias $B_{A}$ ; these are defined in Eqs. (23), (24), (25), (26).

As a first step the infilled values were compared to the observations. For the copula infilling the expected value of the infilled distribution was used

Summary and conclusions

In this paper, a new copula based method was developed for infilling missing precipitation data. The method does not only provide a minimum error estimator for the missing values, but also provides a distribution function of the possible values. For daily values, the distribution function has a jump at 0 mm – which equals the probability of a dry day. In addition, because small precipitation amounts are measured with poor accuracy, the lower part of the wet distribution below a threshold (in our

References (15)

P. Coulibaly et al.
Comparison of neural network methods for infilling missing daily weather records
J. Hydrol.
(2007)
G. Dumedah et al.
Evaluation of statistical methods for infilling missing values in high-resolution soil moisture data
J. Hydrol.
(2011)
F. Mwale et al.
Infilling of missing rainfall and streamflow data in the shire river basin, Malawi – a self organizing map approach
Phys. Chem. Earth
(2012)
G. Pegram
Patching rainfall data using regression methods 3. Grouping, patching and outlier detection
J. Hydrol.
(1997)
G. Pegram et al.
Downscaling regional circulation model rainfall to gauge sites using recorrelation and circulation pattern dependent quantile–quantile transforms for quantifying climate change
J. Hydrol.
(2013)
A. Bárdossy et al.
Geostatistical interpolation using copulas
Water Resour. Res.
(2008)
A. Bárdossy et al.
Interpolation of precipitation under topographic influence at different time scales
Water Resour. Res.
(2013)

There are more references available in the full text version of this article.

Cited by (65)

Impact of climate change on future availability of water for irrigation and hydropower generation in the Omo-Gibe Basin of Ethiopia
2022, Journal of Hydrology: Regional Studies
Omo-Gibe River Basin, Ethiopia
The objective of this study was to predict the impact of climate change on the future availability of water for irrigation and hydropower production. Climate change projections for the near future (2017–2044), medium (2045–2072), and long-term (2073–2100) using a multi-model set average of fifteen regional climate models under the RCP4.5 and RCP8.5 emission scenarios. Water availability, allocation, and demand for irrigation and hydropower generation were predicted using the coupled Soil and Water Assessment Tool (SWAT) and Water Evaluation and Planning (WEAP) hydrological models.
The projected annual average temperature increase range is 2.10–3.6 °C under RCP 4.5 emission scenarios while 2.7–4.8 °C under RCP 8.5. Under RCP 4.5 and RCP 8.5 emissions scenarios, projected annual average precipitation declines ranged from 10.7–13.6 % and 11.1–13.8 %, respectively. Projected annual average declines in streamflow ranged from 7.0 % to 10.9 % under PCR 4.5 emissions scenarios, while 10.9–2.8 % under PCR 8.5. As a consequence, water shortages for irrigation may decrease by 15.5–25.4 % and hydroelectricity by 10.5–20.2 % during study periods. Due to the combined effect of climate change and rising water demand, the increase in water scarcity ranges from 7.9–30.6 %. The projected results showed that future water availability for irrigation and hydropower generation will decline in the future. Climate change adaptation options are needed to ensure future water availability for hydropower generation and irrigation.
Current and future water balance for coupled human-natural systems – Insights from a glacierized catchment in Peru
2022, Journal of Hydrology: Regional Studies
Santa River basin, Peru.
In the Andes of Peru, climate change and socio-economic development are expected to jeopardize future water availability. However, little is known about the interplay of multiple climatic and non-climatic stressors and related processes driving water resource changes. We developed an integrated model that analyzes different trajectories of water availability including hydro-climatic (water supply) and socio-economic (water demand) variables with consistent multi-descriptor future scenarios until 2050.
At the lower-basin outflow of Condorcerro, mean annual water availability is projected to increase by 10% ± 12% by 2050. This gain is mainly driven by an increase in annual precipitation amounts of about 14% (RCP2.6) and 18% (RCP8.5), respectively, which was computed using a global climate multi-model ensemble. In contrast, mean dry-season water availability is projected to substantially decrease by 33% and 36% ( ± 24%) by 2050, for RCP2.6 and RCP8.5, respectively. This decline is driven by a combination of diminishing glacier discharge and increasing water demand both of which adopt a major role in the absence of considerable precipitation inputs. These seasonal differences highlight the need to adequately consider spatiotemporal scales within multi-scenario water balance models to support local decision-making. Our results elucidate the need for improvements in water management and infrastructure to counteract diminishing dry-season water availability and to reduce future risks of water scarcity.
Mean areal precipitation estimation: Methods and issues
2022, Rainfall: Modeling, Measurement and Applications
Mean areal precipitation (MAP) estimate continues to serve as one of the essential inputs to lumped hydrologic simulation models. Accurate MAP estimates require error and gap-free precipitation measurements from rain gauge monitoring networks and bias-corrected weather radar and satellite-based quantitative precipitation estimates (QPEs). MAP estimation is one of the essential tasks that need to be completed before hydrologic simulation models can be calibrated and validated. Deterministic and stochastic weighting methods for MAP estimation that use both rain gauge-based measurements and QPEs are discussed in this chapter. Several issues related to errors associated with rain gauge measurements, monitoring network density, missing data, bias issues with QPEs that affect MAP estimation are also elaborated. The advantages and limitations of methods and recommendations for use of these methods in different physiographic and topographical settings are also provided.
Gridded daily precipitation data for Iran: A comparison of different methods
2021, Journal of Hydrology: Regional Studies
Iran
Gridded precipitation products are of great interest for hydrological applications. The inhomogeneous geography and uneven spatial distribution of rain gauges in Iran make it difficult to estimate valuable interpolated precipitation with daily or monthly resolutions. Therefore, we evaluated the performance of two empirical and four geostatistical interpolation methods.
Atmospheric circulation pattern (CP) classification was used to understand precipitation behavior and to improve interpolation. Based on 500 hPa geopotential fields, six CPs were identified, in order to explain large scale precipitation behavior. Variograms were normed and clustered to reduce the computational effort of the geostatistical methods.
Leave-one-out cross-validation shows that the geostatistical methods outperform the empirical ones, and the differences among the geostatistical methods are small.
The difference among all the methodologies decreased substantially for spatial aggregation to coarser resolutions. In contrast, temporal aggregation reduced the difference to a much lower extent.
A large dataset consisting of 1561 locations with daily observations was used for this study. Comparison with the GPCC daily dataset shows that the data used for interpolation has a larger influence than the choice of the interpolation method.
Infilling methods for monthly precipitation records with poor station network density in Subtropical Argentina
2021, Atmospheric Research
Precipitation plays a crucial role from a social and economic perspective in Subtropical Argentina (STAr). Therefore, it renders the need for continuous and reliable precipitation records to develop serious climatological researches. However, precipitation records in this region are frequently inhomogeneous and scarce, which makes it necessary to deal with data filling methods. Choosing the best method to complete precipitation data series relies on rain gauge network density and on the complexity of orography, among other factors. Most comparative-method studies in the literature are focused on dense station networks while, contrastingly, the STAr's station network density is remarkably poor (between 10 and 1000 times lower). The research aims at assessing the performance of several interpolation methods in STAr. In this sense, the performance of a large number of interpolation methods was evaluated for dry and wet seasons, interpolating raw monthly data and their anomalies applied to different time-series subsets. In general, most methods performances improve when applied to anomalies in the seasonal time-series subset. Multiple Linear Regression (MLR) stands out as the method with the best performance for infilling precipitation records for most of the regions regardless of orography or season. Despite the bibliography invokes that kriging interpolation methods are the best ones, in this work the performance of kriging methods was similar to the one of the Inverse Distance Weighted method (IDW) and the Angular Distance Weighted method (ADW, the method used to generate CRU precipitation dataset).
Stochastic local interaction model with sparse precision matrix for space–time interpolation
2020, Spatial Statistics
The application of geostatistical and machine learning methods based on Gaussian processes to big space–time data is beset by the requirement for storing and numerically inverting large and dense covariance matrices. Computationally efficient representations of space–time correlations can be constructed using local models of conditional dependence which can reduce the computational load. We formulate a stochastic local interaction model for regular and scattered space–time data that incorporates interactions within controlled space–time neighborhoods. The strength of the interaction and the size of the neighborhood are defined by means of kernel functions and adaptive local bandwidths. Compactly supported kernels lead to finite-size local neighborhoods and consequently to sparse precision matrices that admit explicit expression. Hence, the stochastic local interaction model’s requirements for storage are modest and the costly covariance matrix inversion is not needed. We also derive a semi-explicit prediction equation and express the conditional variance of the prediction in terms of the diagonal of the precision matrix. For data on regular space–time lattices, the stochastic local interaction model is equivalent to a Gaussian Markov Random Field.

View all citing articles on Scopus

View full text

Infilling missing precipitation records – A comparison of a new copula-based method with other techniques

Highlights

Summary

Introduction

Section snippets

Common methods used for infilling

Copula based infilling

Application

Expectation evaluation

Summary and conclusions

J. Hydrol.

J. Hydrol.

Phys. Chem. Earth

J. Hydrol.

J. Hydrol.

Geostatistical interpolation using copulas

Water Resour. Res.

Interpolation of precipitation under topographic influence at different time scales

Water Resour. Res.