Missing data estimation in extreme rainfall indices for the Metropolitan area of Cali - Colombia: An approach based on artificial neural networks

Changes observed in the current climate and projected for the future significantly concern researchers, decision-makers, and the general public. Climate indices of extreme rainfall events are a trend assessment tool to detect climate variability and change signals, which have an average reliability at least in the short term and given climatic inertia. This paper shows 12 climate indices of extreme rainfall events for annual and seasonal scales for 12 climate stations between 1969 to 2019 in the Metropolitan area of Cali (southwestern Colombia). The construction of the indices starts from daily rainfall time series, which although have between 0.5% and 5.4% of missing data, can affect the estimation of the indices. Here, we propose a methodology to complete missing data of the extreme event indices that model the peaks in the time series. This methodology uses an artificial neural network approach known as Non-Linear Principal Component Analysis (NLPCA). The approach reconstructs the time series by modulating the extreme values of the indices, a fundamental feature when evaluating extreme rainfall events in a region. The accuracy in the indices estimation shows values close to 1 in the Pearson's Correlation Coefficient and in the Bi-weighting Correlation. Moreover, values close to 0 in the percent bias and RMSE-observations standard deviation ratio. The database provided here is an essential input in future evaluation studies of extreme rainfall events in the Metropolitan area of Cali, the third most crucial urban conglomerate in Colombia with more than 3.9 million inhabitants.

5.4% of missing data, can affect the estimation of the indices. Here, we propose a methodology to complete missing data of the extreme event indices that model the peaks in the time series. This methodology uses an artificial neural network approach known as Non-Linear Principal Component Analysis (NLPCA). The approach reconstructs the time series by modulating the extreme values of the indices, a fundamental feature when evaluating extreme rainfall events in a region. The accuracy in the indices estimation shows values close to 1 in the Pearson's Correlation Coefficient and in the Bi-weighting Correlation. Moreover, values close to 0 in the percent bias and RMSE-observations standard deviation ratio. The database provided here is an essential input in future evaluation studies of extreme rainfall events in the Metropolitan area of Cali, the third most crucial urban conglomerate in Colombia with more than 3.9 million inhabitants.

Value of the Data
• Data from this article can be used to (a) visualize the relevance of climate risk management studies, (b) improve trend analyses of extreme rainfall events, (c) analyze changes in extreme rainfall indices related to climate variability and change, (d) identify homogeneous climatic regions, and (e) increase the reliability of forecasting in extreme rainfall events. • The datasets of extreme rainfall indices assess the intensity, frequency, and duration of extreme weather events. • This dataset can be a proxy for hydrometeorological hazards such as droughts, floods, and heavy rains in the analyzed region. • The new dataset is useful for institutions, researchers, and experts involved in climate risk management, water resource management, and other fields related to climate variability and change.

Data Description
This paper reports the time series of extreme rainfall indices for the Metropolitan area of Cali in southwestern Colombia -South America ( Fig. 1 ) between 1969 to 2019. Daily rainfall time series from 12 stations were used to construct the extreme rainfall indices. The stations are presented in Fig. 1 and the statistical description of the rainfall data series is given in Table 1 . The   total annual rainfall for the overall period is between 890.9 and 2803.2 mm, and the standard deviation varies between 211.3 and 542 mm. The mean daily rainfall amounts ranges from 2.5 and 7.7 mm day -1 , with a standard deviation between 6.7 and 14.1 mm day -1 (See Table 1 ). Twelve rainfall extreme indices were selected, which monitor rainfall intensity (5 indices), frequency (5 indices), and duration (2 indices). The description of the climate indices based on daily rainfall used is presented in Table 2 .
Rainfall stations up to 6% of missing data in the considered period were selected. The daily rainfall series contained missing data, consequently the extreme rainfall index series is also compromised. Due to the criteria adopted in its calculations, the percentage of missing data of the extreme rainfall indices is typically higher than the daily data percentage. Zhang et al. [1] specify that a monthly index is not calculated if more than three daily data are missing in a month, and an annual index is not calculated if more than 15 spread daily data or a month are missing in

Table 4
Pairwise and categorical statistics.

Name Equation Units
Perfect Score where Y i = estimated index during i period, Y o = observed index, N is the total number of observations, ζ xx is the biweight midvariance of xx , ζ yy is the biweight midvariance of yy , and ζ xy is the biweight midcovariance of xx and yy .
a year. Table 3 shows the percentage of missing data for the daily precipitation time series and the percentage of missing data for the extreme rainfall indices for annual and seasonal scales. The missing data of the climate indices were filled using the complete Non-Linear Principal Component Analysis (NLPCA) topology, where the decoder is used after the bottleneck, i.e., the inverse NLPCA, which takes the principal components to recover the original information. Fig. 2 shows the flow chart of the methodology. Fig. 3 shows the heat map constructed with R v.4.1.2 using ggplot package that representing the performance metrics of extreme precipitation indices at annual and seasonal scales. The assessed through the Pearson's correlation coefficient (CC), Bi-weighting correlation (Bicor), percent bias (Pbias), and RMSE-observations standard deviation ratio (RSR) was estimated. The rows represent the extreme rainfall indices, and the columns represent the stations grouped by the time scales studied. The performance in the estimation of extreme rainfall indices is highlighted with CC and Bicor (Pbias and RSR) values close to 1 (0) in the all-time series, except for CWD and PRCPTOT during December to February (DJF) and June to August (JJA), respectively. The equations for CC, Bicor, Pbias and RSR are shown in Table 4 .
The data of the observed and estimated extreme rainfall indices are shown in Figs. 4-7 . Here, the annual series for two intensity indices (the highest amount of daily rainfall -RX1day, and very wet days -d95p), a frequency index (number of days for rainfall > = 20 mm-R20 mm), and a duration index (consecutive dry days -CDD) were presented. Four stations with a high Fig. 3. Statistics of missing data estimation error obtained for annual and seasonal extreme rainfall indices from 1969 to 2019 over twelve rainfall stations ( Fig. 1 ). Furthermore, for Pbias and RSR dark (light) colors indicate better (worse) statistical estimation.
percentage of missing data in extreme rainfall indices at the annual scale were plotted: Univalle (27%), La Teresita (24%), Col. SJ Bosco (22%), and Los Cristales (16%) (see Table 3 ). All the time series and graphs for the station's rainfall gauges at the annual/ seasonal scales indices are available in the Appendix.

Study area description
The Metropolitan area of Cali (MAC) is located in southwestern Colombia, dynamically structured as a territorial and functional unit based on a physical environment. The MAC is the third-largest Metropolitan area in the country, with an estimated population of 3.9 million inhabitants in 2018 [2 , 3] . The MAC comprises Cali, Candelaria, Jamundí, Palmira, Florida, Pradera, and Yumbo, covering 3580 km 2 in the geographic valley of the Cauca River, the most important basin in the country. The MAC's geomorphological, geological, and hydroclimatological characteristics promote hydrological hazards (e.g., floods and droughts), negatively affecting its economic and social development [4][5][6] . Therefore, the socio-economics of the region is affected, includ-    ing losses in extensive territories dedicated to agriculture and livestock, urban and rural areas, partial or total destruction of infrastructure, and energy deficits [5 , 7 , 8] .

Rainfall indexes
The climate rainfall extreme indices were calculated using the ClimInd package (Available in http://etccdi.pacificclimate.org/software.shtml ) [1] . The indicators are based on daily rainfall (RR) information ( Table 2 ) of surface data ( Table 1 ). Finally, 12 climate indices were selected related to rainfall extremes' intensity, frequency, and duration. The indexes are determined on the yearly and seasonal scale of each rainfall station.

Missing data estimation
Considering that some stations exhibited more than 3 or 15 missing days at seasonal and annual scales, respectively. The auto-associative artificial neural network approach called NLPCA to estimate missing data in the time series of extreme rainfall indices were used. The methodology applied for the missing data estimation is based on the decoder, the second phase of the NLPCA, known as inverse NLPCA, a non-linear generalization of the standard Principal Component Analysis when they want to go back to the original representation. This algorithm was established by Scholz et al. [9 , 10] , and was used to estimate missing data in the field of hydro-climatology by Miró et al. [11] and Canchala et al. [12] .
The inverse NLPCA uses the reconstruction function gen : y → x , performed by a feedforward network. Eq. (1) shown the output ˆ x is dependent upon the input X and the ANN weights w W 3 , W 4 .
where the goal of the gen is to estimate a dataset x approximate to the target data x by minimizing the squared error x −x 2 . More details about this technique are available in Scholz et al. [10 , 11] , and the NLPCA toolbox used in this study is available at http://www.nlpca.org/matlab. html .
The missing data of the annual and seasonal extreme rainfall indices were obtained using the extreme rainfall indices described in Table 3 as the inputs and a [12 , 11 , 12] network topology (see Fig. 2 ).

Ethics Statements
The authors agree that there are no ethics statements to be made.

Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.