Elsevier

Journal of Hydrology

Volume 519, Part A, 27 November 2014, Pages 1162-1170
Journal of Hydrology

Infilling missing precipitation records – A comparison of a new copula-based method with other techniques

https://doi.org/10.1016/j.jhydrol.2014.08.025Get rights and content

Highlights

  • We developed a new copula based method for infilling missing data.

  • We compared it against a comprehensive range of methods of interpolation from Nearest Neighbors to EM.

  • We found copula-based methods are superior to the others for estimating expected missing values.

  • The EM multiple regression algorithm is suitable for infilling monthly or annual data but not highly skewed daily data.

  • The copula-based method has the added advantage of providing distributions of infilled values.

Summary

Infilling missing data might be an unpleasant and tedious task, but is necessary for analysis and water resources management, so it should not be done in a lackadaisical manner. The important thing about the infilled values is that they need to be as good as possible, because poor infilling is likely to lead to poor decisions. Traditionally, a range of methods is routinely employed, e.g. Nearest Neighbor substitution through to Kriging, but few methods attach a quality estimate to the infilled values. In this paper a new copula based method is developed for infilling missing daily and monthly rain gauge data and is compared against six other commonly used methods, in a semi-arid environment with a range of rain-rates and interstation distances, in the Southern Cape region of South Africa. For daily data it is clear that the copula-based methods are superior to the others in terms of point estimation and have the added benefit of providing an estimate of the precision of the interpolation, not provided by the others. In our case, the addition of atmospheric circulation patterns designed to add information for infilling has a relatively small positive effect on the quality of the estimation. The main reason for this is that a small number of wet days does not allow a good estimation of the conditional distribution of precipitation amounts; we note that the average probability of a dry day in this region is 86%. An improvement of the estimate of the probability of a dry day was however observed. In other regions, with a higher number of wet days, an atmospheric Circulation Pattern (CP) based method is likely to lead to further improvements. Using copula-based methods, the estimated probabilities of a dry day correspond well to the observed frequencies of dry days. The monthly data yield the same conclusion, with the qualification that the Expectation Maximisation [EM] algorithm performs as well as the copula method (because of the low count of dry months in this region) but its relative disadvantage is that it does not offer as valuable a precision estimate.

Introduction

Hydrological observations are unique in space and time. If not observed at a given location and time the values can only be estimated. Many applications, such as the calculation of water balances, calibration of hydrological models or the provision of unbiased ground truth for remote sensing require full datasets. Thus a reliable estimation of the missing observations is of great importance. The problem is exacerbated by the ubiquitous decimation of gauge networks.

We define infilling as the process of repairing data-sets where observations are missing due to late starting, early ending or disruption of data collection. The process depends on nearby, relatively plentiful data which can be trusted; if they are too sparse, there is no way of manufacturing an information transfer link. In contrast to the task of infilling, we define interpolation as the activity of estimating values of variables in the space between observed and infilled data, either temporally or spatially. We do not perform interpolation in this paper, although some of the methods we use here for infilling are often used for spatial interpolation (like Kriging).

Some practitioners treat the infilled data as observations, because they may perceive no other option, but that ignores the limitations of the infilling methods. For what purpose does one use infilled raingauge data? How is infilling typically performed? There are typically three ways:

  • (i)

    as a snapshot, neglecting the time series and

  • (ii)

    as multiple time series or

  • (iii)

    as a combination of the first two.

If we are to perform infilling of multiple time series we need to ensure that seasonality is removed by transformation, or alternatively infill season by season. How does one treat the variability of the marginal distributions from season to season? How does one cope with the varying length of dry periods over seasons? How does the infilling treatment affect the extremes?

How does one handle these problems? If one uses pure regression methods to estimate individual values, then the chances are that these values are going to be biased towards the median - it is unusual for an infilling method to estimate a missing value outside the range of the observations. This remark applies particularly to the following methods: Nearest neighbor(s), Inverse Distance Weighting, (Multiple) Linear Regression, Ordinary Kriging and Neural Networks (Dumedah and Coulibaly, 2011, Mwale et al., 2012). These give point estimates with either Confidence intervals (usually Gaussian and therefore symmetrical around the estimate) or none; thus these methods are local smoothers. Even the interesting Q–Q transform method introduced by (Teegavarapu, 2013) is not capable of extrapolating outside the observation interval because the quantiles are limited to observations.

Kriging in its many forms is popular, but there are problems associated with the method which are not usually taken into account, which beside the application of Kriging (implicitly a Gaussian method) to raw, often skewed data, uses a global variogram for a given snapshot. Thus in its simplest form Kriging is unable to account for local variations in spatial dependence, as is achieved in multiple regression, for example.

To summarize: to use infilled data as if they have observation status is to delude oneself; it is necessary to determine the uncertainty of the infilled data and use whatever information they offer with caution and care when using them in interpolation or other modeling activities. If this is not done, then interpolation becomes a form of smoothing with point estimates which, for example, cannot model extremes. We will not address spatial interpolation further in this paper, reserving it for a follow-on.

There are two different goals one can follow for data infilling:

  • 1.

    To minimize the expected variance of the estimation error.

  • 2.

    To obtain a realization such that the statistics of the augmented dataset comply with the incomplete dataset.

Several authors treated this topic by infilling data concerning precipitation, discharge, soil moisture or evapotranspiration. The temporal resolution varies between days and months. Infilling precipitation data on short time scales (up to several days) is specifically challenging due it’s intermittent behavior. Zero observations cannot be treated the same way as positive ones. There is usually a certain radius around dry stations where there is no rainfall.

The methods used for infilling also differ strongly. Traditionally statistical methods were used for this purpose. (Johns et al., 2003) describe a method based on Bayesian statistics to infill precipitation and temperature records. In Kondrashov and Ghil (2006) the authors use Singular Spectrum Analysis for spatial and temporal infilling. A comparison of the performance of different infilling methods together with a non-linear model for the selection and weighting of available observations for infilling was presented in (Teegavarapu, 2012). Recently methods from artificial intelligence were also used for infilling: self organizing maps (Mwale et al., 2012) and neural networks (Coulibaly and Evora, 2007).

Most methods used for best estimation (corresponding to the first goal) lead to a reduction of the variance, thus to a distortion of the distributions. Recently (Teegavarapu, 2013) suggested the use of a post processing quantile/quantile transform to spatially interpolate missing values.

Infilled values are not measured values, they are only estimates. Their uncertainty should be considered for any subsequent application. Therefore it is of great importance to quantify this uncertainty and, if possible, specifically for each infilled datum.

In this paper a new spatial copula based methodology is developed and compared to other infilling methods. Copulas were applied for spatial interpolation of groundwater quality (Bárdossy and Li, 2008) and precipitation (Bárdossy and Pegram, 2013). The suggested methodology provides an estimate of the conditional distribution of precipitation at the target location. This distribution can be used to obtain a minimum error estimator (by using the conditional mean) or to obtain simulations reproducing the observed variability at the target location. Zero precipitation values are treated as censored observations of a continuous distribution. Low precipitation amounts (for example <1 mm/day) are measured with a high relative error. Precipitation below such a threshold contributes very little to the total precipitation and is hydrologically irrelevant. As a consequence, the infilling strategy employed in this study provides a probability of zero, a probability of being above zero and below a given threshold and a continuous distribution for the precipitation amounts exceeding the threshold.

To assess the efficacy of the methods, we perform intercomparisons based on cross-validation. The datasets used in the infilling methods were tested for selected stations in the Southern Cape in South Africa. Observed daily precipitation data were available for 13 stations for the 32-year period 1971–2002. Some data were missing from the station records but most were intact. To perform the cross-validation, we divided the year into two seasons and then selected each station in turn, hiding 8 sets of 4 years at a time. This was done for daily and monthly data. Each of the missing values was estimated using the remaining intact data and repeated for each station. The result was that over 70,000 daily and over 2,000 monthly data were used in the cross-validation study.

The paper is organized as follows: after this introduction, common methods of infilling are briefly described; in Section 3 copula-based infilling methods are introduced, while Section 4 treats the application of the methods and the results are then discussed before summary and conclusions.

Section snippets

Common methods used for infilling

The methods used to infill the missing data discussed in this paper were the following:

  • 1.

    Nearest neighbor.

  • 2.

    Linear regression.

  • 3.

    Fuzzy rules.

  • 4.

    Ordinary Kriging.

  • 5.

    Multiple linear regression (MLR).

  • 6.

    Copula based estimation.

  • 7.

    MLR using the EM algorithm (for the monthly totals only).

Let x(t)=x1(t),,xi(t),,xm(t) be the vector of observed values at time t. For a set of time steps Tk(1) observations are available at location k and for the rest of the time Tk(2) data for location k are missing.

The purpose of

Copula based infilling

For the copula based estimation we assume that each location has its own specific distribution. These are assessed from the common observation time(s). The distribution function of xi(t) (calculated over time t) is Fi(x). In order to make the description more transparent we introduce Y(t)=Xk(t) for the target location with the distribution function FY(y). In order to relate these observations we assume that the joint (zero truncated) copula of X1,,Xm,Y is C(u1,,uk,v). The distribution of a

Application

The data infilling methods were tested for selected stations in the Southern Cape in South Africa. Observed daily precipitation data are available for 13 stations for the 32-year period 1971–2002. Fig. 1 shows the locations of the 13 selected gauges.

For all estimations two seasons were treated separately. The warm season (season 1): October, November, December, January, February and March and the cold season (season 2): April, May, June, July, August and September. Table 1, Table 2 show the

Expectation evaluation

The infilling methods were evaluated on the basis of their estimation errors. For each station the bias B, mean absolute error A, the mean squared error R and the Pearson correlation coefficient (Corr) between observed and estimated values were calculated. We also included a Mean Station Bias BA; these are defined in Eqs. (23), (24), (25), (26).

As a first step the infilled values were compared to the observations. For the copula infilling the expected value of the infilled distribution was used

Summary and conclusions

In this paper, a new copula based method was developed for infilling missing precipitation data. The method does not only provide a minimum error estimator for the missing values, but also provides a distribution function of the possible values. For daily values, the distribution function has a jump at 0 mm – which equals the probability of a dry day. In addition, because small precipitation amounts are measured with poor accuracy, the lower part of the wet distribution below a threshold (in our

References (15)

There are more references available in the full text version of this article.

Cited by (65)

  • Mean areal precipitation estimation: Methods and issues

    2022, Rainfall: Modeling, Measurement and Applications
View all citing articles on Scopus
View full text