Estimation of missing data of monthly rainfall in southwestern Colombia using artificial neural networks

The success of many projects linked to the management and planning of water resources depends mainly on the quality of the climatic and hydrological data that is provided. Nevertheless, the missing data are frequently found in hydroclimatic variables due to measuring instrument failures, observation recording errors, meteorological extremes, and the challenges associated with accessing measurement areas. Hence, it is necessary to apply an appropriate fill of missing data before any analysis. This paper is intended to present the filling of missing data of monthly rainfall of 45 gauge stations located in southwestern Colombia. The series analyzed covers 34 years of observations between 1983 and 2016, available from the Instituto de Hidrología, Meteorología y Estudios Ambientales (IDEAM). The estimation of missing data was done using Non-linear Principal Component Analysis (NLPCA); a non-linear generalization of the standard Principal Component Analysis Method via an Artificial Neural Networks (ANN) approach. The best result was obtained using a network with a [45−44−45] architecture. The estimated mean squared error in the imputation of missing data was approximately 9.8 mm. month−1, showing that the NLPCA approach constitutes a powerful methodology in the imputation of missing rainfall data. The estimated rainfall dataset helps reduce uncertainty for further studies related to homogeneity analyses, conglomerates, trends, multivariate statistics and meteorological forecasts in regions with information deficits such as southwestern Colombia.


a b s t r a c t
The success of many projects linked to the management and planning of water resources depends mainly on the quality of the climatic and hydrological data that is provided. Nevertheless, the missing data are frequently found in hydroclimatic variables due to measuring instrument failures, observation recording errors, meteorological extremes, and the challenges associated with accessing measurement areas. Hence, it is necessary to apply an appropriate fill of missing data before any analysis. This paper is intended to present the filling of missing data of monthly rainfall of 45 gauge stations located in southwestern Colombia. The series analyzed covers 34 years of observations between 1983 and 2016, available from the Instituto de Hidrología, Meteorología y Estudios Ambientales (IDEAM). The estimation of missing data was done using Non-linear Principal Component Analysis (NLPCA); a nonlinear generalization of the standard Principal Component Analysis Method via an Artificial Neural Networks (ANN) approach. The best result was obtained using a network with a [45À44À45] architecture. The estimated mean squared error in the imputation of missing data was approximately 9.8 mm. month À1 , showing that the NLPCA approach constitutes a powerful methodology in the imputation of missing rainfall data. The estimated rainfall dataset helps reduce uncertainty for further studies related to homogeneity analyses, conglomerates, trends, multivariate statistics and meteorological forecasts in regions with information deficits such as southwestern Colombia. © 2019 The Author(s). Published by Elsevier Inc. This is an open access article under the CC BY license (http://creativecommons. org/licenses/by/4.0/).

Data
The figures and tables of monthly rainfall were analyzed based on the data obtained from 45 stations located in different zones in the Department of Nariño (Colombia). Fig. 1 is the location map of rainfall stations. Descriptive statistical analysis of the monthly rainfall data (1983e2016) is presented in Table 1. Fig. 2 shows the schematic diagram of the inverse Non-Linear Principal Components Analysis (NLPCA) model. Fig. 3 shows Artificial Neural Network (ANN) modelling data for ten trainings per architecture. The best estimates from the different architectures with the inverse NLPCA model are shown in Table 2. The time series of rainfall stations with the percentage of missing data higher than 1% in the original data and the estimation of missing data are divided and presented in three graphs in descending order in Fig. 4, Fig. 5, and Fig. 6, so that they are easier to observe. With this in mind: Fig. 4 shows the time series for REM, MAT, CHA, OBO, and BAR (see ID and Missing Data Percent in Table 1 Table   Subject Environmental Science and Artificial Intelligence More specific subject area Application of neural networks in the estimation of missing data in rainfall time series

Type of data Figures and tables How data was acquired
Rainfall data were obtained following the formal application procedure of the IDEAM (Colombia).
Data format Raw and Analyzed data Experimental Factors Monthly rainfall Experimental Features Estimation missing rainfall data through NLPCA, an auto-associative neural network, generally seen as a non-linear generalization of standard linear principal component analysis.

Data source location Southwestern Colombia (Nariño) Data accessibility Data are available in this article
Value of the data The data in this article can be used to: a) enhance the consistency of water-resources management studies, b) improve the trend analyses of rainfall, c) identify homogeneous climatic regions and d) diminish the uncertainty of the rainfall forecast. The dataset is useful for the institutions, researchers, and experts that are involved in the management of water resources, risk management, food safety and other fields linked to climatic variability. The estimated rainfall data can be used as input of forecast models, statistical models, and different linear and non-linear techniques for the analysis of the rainfall variability.
The estimated dataset provides new rainfall information in one of the most biodiverse regions in the country and the rest of the world.

Study area description
The Department of Nariño, located in southwestern Colombia, covers an area of 33,268 km 2, with a geo-strategic position between the Colombian-Ecuadorian border, the tropical Pacific Ocean and the Andes mountain range of South America. 8% of its territory belongs to the Amazon basin, one of the largest biodiversity reserves in the world; 52% corresponds to the Choc o biogeographic plains that harbor a mega-diversity of species; and the remaining 40% covers the Andean region, where paramos and volcanoes dominate [1]. These are aspects that make Nariño one of the most diverse regions of Colombia, and the rest of the world (Fig. 1).

Material and methods
The imputation of missing rainfall data was performed through the inverse model of the NLPCA technique defined by Scholz et al. [2,3] and applied to the estimation of missing data of rainfall by Mir o et al. [4], who concluded that the NLPCA technique presented the best results among the ten methods implemented for data estimation. Consequently, while classical ANN forward approach is driven by two steps: i) Extraction function F extr : X/Z and ii) Reconstruction function F gen : Z/ b X , the inverse NLPCA is only driven by the second part [2]. Inverse NLPCA (Fig. 2) is led by the mapping function F gen , which is performed by a feed-forward network. Eq. (1) shows the output b X is dependent upon the input Z and the ANN weights wεW 3 ; W 4 .  (1) The goal is to achieve a function F gen capable of generating data b X that approximates the target data X by minimizing the squared error X À b X 2 . Biases are not explicitly considered; however, they can be included by introducing an extra unit, or input, with activation fixed at one. The architecture of inverse NLPCA model is [a-b-c], where a are the extracted non-linear components, b are the non-linear hidden units used to perform the non-linear transformation and c are the approximated features. A further explanation and details about this process are available in Scholz et al. [2,3].  The inverse NLPCA is not limited to one component; it can be extended to m components with an additional hierarchically error function [5]. The non-linear components 1; …; m can be extracted in a hierarchical order, which is a natural non-linear extension to the hierarchical ordered components of the standard linear PCA. For the application of NLPCA, the Nonlinear PCA toolbox (available in http:// www.nlpca.org/matlab.html) was used. Here, the hierarchical NLPCA was used to get the hierarchically ordered features by training sequentially, where the remaining variance/error allows to calculate the explained variance, despite of it cannot be considered regardless of the nonlinear mapping [5]. Regarding the application of the inverse NLPCA, different architectures were tested (Table 2), and an increase in the explained variance and a decrease in the Root Mean Square Error (RMSE) was observed when the number of non-linear principal components was increased. The best estimates were obtained using a network with a [45À44À45] architecture (Fig. 3). This means we have extracted 45 nonlinear components; 44 non-linear hidden units were used to perform the non-linear transformation, and 45 rainfall gauge stations were approximated. The parameters setting that presented a better result was: weight decay coefficient set at 0.01 and the maximum number of iterations set at 5,000. The components were extracted in a hierarchical order. All other parameters were set by default (type inverse, no circular PCA). The final imputation results can be different for each model obtained by the inherent characteristics of ANN. Hence, for each execution, the NLPCA was trained ten times, choosing as a priori the one with the best performance in terms of RMSE.