Imputation of precipitation data in northeast Brazil

Abstract This article evaluates four statistical methods of multiple imputation to fill in the missing data of daily precipitation in Northeast Brazil (NEB). We used a daily database collected by 94 rain gauges distributed in NEB from January 1, 1986 to December 31, 2015. The methods were: random sampling from the observed values; predictive mean matching, Bayesian linear regression; and bootstrap expectation maximization algorithm (BootEm). To compare these methods, missing data from the original series were initially excluded. The next step was to create three scenarios for each method, in which 10\%, 20\% and 30\% of the data were removed at random. The BootEM method presented the best statistical results. With the average bias between the complete series and the imputed series values ranging between -0.91 and 1.30 mm/day. The values of the Pearson correlation ranging between 0.96, 0.91 and 0.86 respectively for 10\%, 20\% and 30\% missing data. We conclude that this is an adequate method for the reconstruction of historical precipitation data in NEB.


INTRODUCTION
A very common occurrence in the databases of scientific studies is the occurrence of missing data.In the area of climate sciences, these failures are associated with occasional interruptions of automatic weather stations, malfunction of measuring instruments, reorganization of station networks and typing errors, among others.Since in the climate sciences most research depends on observational data from climatic seasons, it is challenging to analyze climatic phenomena.A difficulty factor of these studies is the quality of the data series used, especially with daily scale due to the frequent existence of flaws that make an investigation of a specific location or application of certain indexes unfeasible.The studies, even if presented in reports of the Intergovernmental Panel on Climate Change (IPCC), are limited to carrying out the analyses trying to extract as much information as possible for decision making.
According to Jahan et al. (2018), missing values in precipitation series can make efficient results impossible to obtain in hydrological, agricultural and climatological studies.According to Aybar et al. (2019), missing values pose a major problem to building grid datasets.The imputation of these values is crucial to minimize the lack of homogeneity in grid datasets (Yanto & Rajagopalan 2017).In addition, In this context, several authors have emphasized the lack of quality of historical series of meteorological data, such as Teegavarapu & Chandramouli (2005), Haylock et al. (2006) and Vincent & Yumiko (2006).The biggest obstacle to quantifying climate change is the scarcity of high-quality data, since it is not possible to avoid data failures in historical series.Thus, it is necessary to use alternative methods to treat the data in the best possible way in order to extract information that helps professionals to understand the climate.
One possibility (not advisable) is to exclude periods and/or variables with missing data from the analyses or to ignore the problem.However, this disregards information that may be relevant for the analysis and can induce bias in the final result, especially if the missing data gap is very large (Rubin 1996, Nunes et al. 2009).In addition, the occurrence of missing data makes it difficult to use some statistical methods, which require a complete database for application, such as the analysis of extreme values of a given meteorological variable.To work around this problem, several researchers (Teegavarapu & Chandramouli 2014, Dikbas 2017, Rodrigues et al. 2019) have used statistical methods called data imputation.These methods propose to replace the missing data with plausible estimated values.
A large number of methods exist in the literature on imputing lost data, which can be applied in local or global analyses (Armina et al. 2017).Some of these studies are focused on the analysis of climate series.In this sense, Teegavarapu & Chandramouli (2014) evaluated imputation methods based on optimal proximity, k-nearest neighbor classification and the k-means clustering method, using precipitation data recorded by 13 stations in Kentucky and 41 stations in Florida, USA.Semiromi & Koch (2019) compared two model-based methods (singular spectrum analysis and multichannel spectrum analysis) to reconstruct groundwater level time series observed at 25 piezometric stations in Ardabil Plain, northwest Iran.Jahan et al. (2018) compared eight imputation methods to fill gaps in precipitation series recorded at 27 stations in Bangladesh.
In addition to the methods mentioned above, there are those associated with Bayesian and BootEM imputation, which have been used in different scientific fields.For example, the multiple imputation method by Bayesian linear regression has been used to impute missing data in different datasets, such as: wine dataset, glass identification dataset, comprehensive concrete strength dataset, Indian liver patient dataset and seed dataset (Jadhav et al. 2019).From another perspective, Kaplan & Yavuz (2019) evaluated multiple imputation by Bayesian linear regression using simulated data and also real data from a triennial international survey conducted evaluate education systems worldwide.
With respect to the quality of imputation of meteorological data, Chen et al. (2019) reported the efficiency of the bootstrap expectation maximization algorithm to impute data on rainfall in the Daning River Basin in China (the Xining precipitation dataset, from 1998 to 2008).The authors designed six missing data rate scenarios, 10%, 20%, 30%, 40%, 50% and 60%, and concluded that when compared with other imputation methods, the bootstrap expectation maximization algorithm performed better to fill gaps, irrespective of the percentage of missing data.Izzo et al. (2020) applied the same algorithm to fill gaps in the temperature series, with gaps up to 20%, and rainfall series, with gaps up to 25%, from meteorological stations in the Dominican Republic.
Although these methods have presented promising results, there are few initiatives addressing them in hydroclimatological studies in Brazil and other countries in South America.In general, the methods used in studies in this region are those associated with the nearest neighbor approach, the arithmetic mean with neighboring stations, and the multiple imputation by predictive means matching (Michot et al. 2019) in the Amazon basin.However, recently the bootstrap expectation maximization algorithm has been used to fill data gaps in a small number (8) of meteorological stations with 14% failure during the period from 1980 to 2013 (Marinho et al. 2020), concentrated in the coastal region of Northeast Brazil.2016) generated series through different methods of separation and interpolation of data on evapotranspiration, extreme temperatures and wind speed, organized in a regular grid of the region.Regarding information on wind speed in Brazil, Gilliland & Keim (2018) organized a database composed from different sources (averages of weather stations and re-analyses of global models) to study the variability and trends in wind speed in Brazilian territory during the period from 1980 to 2014.However, the methods adopted in these studies in Brazil have focused on spatial interpolation, which satisfies the need for complete data but at the same time generates information where there are no series.Hence, there is no way to identify whether these methods actually allow suitable inferences about data in regions with low density of original data.Our motivation is to impute data in existing series, collected in situ, that are not available in a given period due to operational or other reasons.

Xavier et al. (
Determining the appropriate analytical approach for imputing missing data is a very sensitive issue, since the use of inappropriate techniques can lead to mistaken conclusions.The imputation of the missing data can be done in two ways, single and multiple.In single imputation, a unique value is estimated for each missing value.This is insufficient to assess the variability of the estimated values for imputation (Rubin 1996, Zhou et al. 2001, Little & Rubin 2002).To overcome this deficiency, the multiple imputation (MI) method was developed by Rubin (1987).In the MI, instead of one, a set of values is estimated for each missing value, and then the estimates are grouped using, for example, the average to obtain a valid inference that reflects data variability (Yozgatligil et al. 2013).Junninen et al. (2004) compared single and multiple imputation methods for imputing missing data in Belfast and Helsinki.The datasets used consisted of NOx, NO2, O3, PM10, SO2 and CO concentrations, all on a time-scale of one per hour (hourly averages), together with four meteorological parameters: wind speed, wind direction, temperature and relative humidity.The authors concluded that, in general, the results obtained by the multiple imputation methods performed better than those obtained by the single imputation methods.
However, only last decade has MI become practicable, due to computational advancement and implementation of its methods in specialized software.MI has become the most suitable class of methods for dealing with missing data (Mcknight et al. 2007).This has led to several focused on different MI methods for filling in database gaps in the area of climate sciences (Lo Presti et al. 2010, Yendra et al. 2013, Yozgatligil et al. 2013, Dikbas 2017).
In this context, in the present study we evaluate four statistical methods of multiple imputation: random sampling from the observed values (Sampling), predictive mean matching (PMM), Bayesian linear regression (Norm) and bootstrap expectation maximization algorithm (BootEM), based on the database about daily precipitation (mm) was used from 94 meteorological stations distributed in Northeast Brazil (NEB).NEB is a region characterized by high interannual spatial variability, with urban coastal regions having a humid tropical climate and total precipitation above 2,000 mm/year, while the semiarid interior region in some places has rainfall below 300 mm/year (Oliveira et al. 2013).This region is highly vulnerable to periodic droughts (Moura & Shukla 1981, Marengo et al. 2017), a situation that is predicted to worsen with climate change (Torres et al. 2012, Torres & Marengo 2014), leading to a greater number of extreme precipitation events, mainly along the coast (Rodrigues et al. 2020(Rodrigues et al. , 2021)).

Database and study region
We used the weather station data from the Meteorological Database for Teaching and Research (BDMEP) of Brazil's National Institute of Meteorology (INMET), available at www.inmet.gov.br.The observations refer to the daily precipitation in millimeters (mm) observed by pluviometers distributed in 94 meteorological stations scattered in NEB.Each series has 30 years, between January 1, 1986 and December 31, 2015.Figure 1 shows the location of NEB and the spatial distribution of the stations.Estatística 2010).NEB basically has three climate types, with average annual rainfall varying from 300 -2,000 mm/year (Alvares et al. 2013, Oliveira et al. 2017, Rodrigues et al. 2019): a moist coastal climate, prevailing along the coast from the states of Bahia to Rio Grande do Norte; tropical climate in areas of the states of Bahia, Ceará, Maranhão and Piauí; and semiarid climate throughout the hinterlands (Alvares et al. 2013).According to the Northeast Development Superintendency, about 64.65% of the area of NEB is classified as semiarid (SUDENE 2007).

Multiple imputation
Proposed by Rubin (1987), MI basically involves three stages.The first starts with a database of observed and missing values.Through the MI method, each missing valued is imputed m times, obtaining m complete databases, as shown in Figure 2. The method used to impute the m values is the same, but the values differ.The magnitude of these differences reflects the uncertainties about the imputed values.In addition, the observed values must remain the same.The second stage consists of estimating the parameter (s) of interest for each set of imputed data, through the application of standard analysis methods for complete data.In the third stage, the results obtained can be combined using the rules proposed by Rubin (1987).These are widely reported in the literature (Schafer & Graham 2002, Little & Rubin 2014) and involve multiple imputation, irrespective of the method used to make the imputation.
The purpose is that from each analysis the estimates for the parameter of interest X are obtained, that is, X j for j = 1, 2, ..., m.According to Schafer (1997), X can be any scalar measure to be estimated, such as mean, correlation coefficient, regression coefficient or odds ratio.In this study, the combined estimate was the average of the individual estimates, denoted by Equation (1).
Then the total variance is the combination between the variance within and between the imputations (Equation 2). where: (3) In the literature, there are review articles and tutorials that assist researchers in the treatment of lost data.Rubin (1996), Schafer (1997), Schafer & Graham (2002) and Little & Rubin (2019) present a wide view of the existing methods for imputing missing values.Other works, such as Horton & Lipsitz (2001), Acock (2005) and Buhi et al. (2008), have compared some of the techniques available in the main statistical packages.

MI methods
The MI methods used in this study were: random sampling from the observed values (Sampling); predictive mean matching (PMM); Bayesian linear regression (Norm); and bootstrap expectation maximization algorithm (BootEM).Among these, Sampling is considered the simplest.It randomly chooses a value already observed to replace the missing value.PMM is a variant of linear regression (Li et al. 1991, Schafer 1997, Di Zio & Guarnera 2009), in which predicted values are observed and missing.The missing data are replaced by the observed data in which the predicted values for observed and missing are closest.The Norm method is similar to the PMM, since it involves linear regression analysis.However, the imputations are made according to the values predicted for missing values.To perform MI using the methods Sampling, PMM and Norm, the R language multivariate imputation by chained equations package (Buuren & Oudshoorn 2011) was used.
The BootEM method uses the bootstrap resampling technique via the expectation maximization (EM) algorithm.This method imputes the missing values of a time series by estimates generated by the EM algorithms implemented in several bootstrap samples of the original data (Rubin 1994, Honaker et al. 2011).Bootstrap resampling uses existing data as a pseudo-population and selects new samples with replace.For example, if X 1 , ..., X n , are independently and identically distributed from an unknown distribution F, this distribution is estimated by F n (x), which is the empirical distribution F n defined in Equation ( 4), where I(X) is the indicator function of the set X.
If the pseudo-population is incomplete (with missing data), each bootstrap resampling has a high chance of also being incomplete, making bootstrap estimates inefficient.To improve bootstrap estimates, the expectation maximization algorithm is used (Takahashi 2017).The first stage, the EM algorithm assumes a certain probability distribution.Using the mean and variance-covariance values of this distribution, the expected value of model likelihood is calculated, the likelihood is maximized, model parameters are estimated that maximize these expected values, and then the distribution is updated.The expectation and the maximization stages are repeated until the values converge (Schafer 1997).To perform MI using the methods BootEM, the R language Amelia II: A Program for Missing Data package (Honaker et al. 2011) was used.

Procedures for comparing MI methods
To compare the MI methods, the first procedure was to delete the missing data from the original database.With the complete database (with no missing data), the next step was to create three complete scenarios, in which 10%, 20% and 30% were randomly removed for each station in the database.Once this was done, multiple imputation methods were applied for each scenario, one at a time, thus obtaining 12 imputed databases, four for each scenario (Figure 3).The comparison between the complete database and each database with the imputed values was carried out by applying the following statistical measures: bias, standard deviation (SD), mean square error (MSE) and Pearson's correlation coefficient () These are given, respectively, by Equations ( 5), ( 6), ( 7) and (8).
where X i is the daily precipitation value of the complete data series, Y i is the daily precipitation value of the imputed data series, X is the average daily precipitation of the complete data series, Xis the average daily precipitation of the imputed data series, and n is the total days of the series.
The probability density function (PDF) was also used.Denoted by f X (x) a the FDP describes the behavior, in polygon form, of the frequency distribution of a random variable.The probability of the random variable being less than a given value of interest, x, is calculated using the cumulative distribution function (CDF), represented by Equation ( 9), F X (x).
The CDF of a continuous random variable is a non descending function, and the expressions are validated: F X (−∞) = 0 and F X (+∞) = 1 (Sheather & Jones 1991, Silverman 1998, Venables & Ripley 2002, Scott 2015).Conversely, the corresponding FDP can be obtained by differentiating F X (x), The FDP, f X (x), is represented by Equation (10).
The seasonal analyses were performed quarterly: December, January and February (DJF); March, April and May (MAM); June, July and August (JJA); September, October and November (SON).All analyzes were performed in R language (RC Team 2020).

Statistical analysis of imputation methods
Figure 4 shows the curve of the probability density function of the data in the three different scenarios (10, 20 and 30% of missing data).The distribution is one-tailed with greater probabilities of values close to zero.The distribution is more to the left for the 30% scenario, however, the type of distribution remains the same.Table I shows that for all scenarios and all methods, the bias is low, with a maximum of 0.0072 and a minimum of 0.0001 (values in modulus).This means that, on average, the difference between the actual and the imputed database is close to zero.Furthermore, the values of standard deviation and mean square error increase with increasing percentage of missing data.However, the measures have lower values when the BootEM method is applied, respectively, 0.0256/0.0006;0.0382/0.0014and 0.0470/0.0022for the 10%, 20% and 30% missing data scenarios, respectively.
The quality of the MI, based on the correlation coefficients, varies according to the weather station, method applied and the percentage of missing data (Figure 5).The same results were identified by Teegavarapu & Chandramouli (2014), who evaluated six imputation methods to fill gaps in six different rainfall station series: three time series between 1971-2001, for rainfall in Kentucky, USA, and three time series in the 1994-1999 interval for Florida, USA.In NEB, the correlation between the complete and imputed database decreases as the percentage of missing data increases.As shown in Figure 5,  the BootEM method has the best correlations for all scenarios, and the difference between the MI methods within the same scenario increases as the percentage of missing data increases (Figure 5).The information shown in Figure 6 corroborates the statements made previously that the correlation coefficient decreases when the scenario gets worse.However, with 10% missing data, the correlations are always above 0.85, the level of data failures typically found in series pertaining to NEB according to previous climatological analyses Oliveira et al. (2017).There are also different correlation coefficients in NEB within the same scenario.Semiromi & Koch (2019) also observed variations between the quality of 25 imputed time series of groundwater levels in the Ardabil Plain region of northwestern Iran.Michot et al. (2019) stated that the quality of the imputation methods may be related to the pluviometric density and precipitation regime of the region.Therefore, it is extremely important to consider these factors in studies to evaluate imputed series, as shown by Teegavarapu & Chandramouli (2014), Jahan et al. (2018) andAybar et al. (2019).The information in Figure 6 corroborates the statements made previously that the correlation coefficient decreases when the scenario gets worse.There are again different correlation coefficients in NEB within the same scenario.The first scenario, Figure 6a, has the highest correlation coefficients, ranging from 0.93 to 0.98.Part of the east coast, northern Ceará, Maranhão and central Bahia have better coefficients.In the second scenario, Figure 6b, in general, the north and east coast have the best coefficients, while the lowest correlation coefficients belongs to the central area of Pernambuco and the east of Bahia.In this scenario the correlation coefficients range from 0.85 to 0.95.The worst case, Figure 6c, is the one that obtained the greatest variability of the correlation coefficients, 0.8 to 0.93.In this, the west of NEB stands out, with lower correlation coefficients, and east of Rio Grande do Norte with the highest coefficients, close to 0.93.Also, for the south of Bahia the correlation coefficients are close to 0.9 for the three scenarios under study.
To complement the comparison between the original and imputed data, we present in Figure 7 the curves of PDF from the complete and imputed data (Figure 7a) over NEB and over two regions with distinct precipitation regimes, São Luís (Figure 7b) and Natal (Figure 7c).The distribution of complete data in the NEB presents a trimodal pattern with the main maximum around 55 mm, followed by relative peaks at 120 mm and 200 mm, respectively, consistent with previous analyzes of climatological studies on extreme values in the NEB (Oliveira et al. 2017, Rodrigues et al. 2020, Costa et al. 2020).The imputed data show the same trimodal pattern, however, the higher frequency values are shifted to relatively higher values while the lower values decrease, in agreement with Figure 4, in which there was a higher frequency of data close to zero as the number of missing data increases.Regarding the distribution of precipitation values in the two regions (Figures 7b-c), we verified that the imputed data preserve the distribution (with two well-defined maximums) of the complete data.At the same time, it indicates values around 400 mm in the second maximum for São Luis, while in Natal this maximum is around 300 mm.The relative maxima in São Luis are explained by the proximity to the Amazon basin, the role of the ZCIT and the influence of squall lines that form on the northern coast of NEB (Oliveira et al. 2017, Rodrigues et al. 2020).On the other hand, extreme precipitation values in Natal are influenced by ZCIT and by the action of easterly wave disturbances (EWD), which are common in the months of May to July on the east coast of the NEB (Oliveira et al. 2013, Gomes et al. 2019).The occurrence of extreme precipitation events has also been reported in previous NEB studies for both cities analyzed and the accumulated averages are consistent with other datasets independent of those analyzed here, for example, data estimated via satellites (Rodrigues et al. 2020) or through data collected in automatic surface stations (Rodrigues et al. 2021).

Seasonal and spatial analysis of precipitation
Considering the average value of NEB as a whole (Figure 8), it appears that the rains present well-marked annual seasonality, with periods of drought, having minimum monthly values below 30 mm and maximum values above 200 mm.The month of January 2004 had the largest accumulated rainfall (365 mm), associated with a persistent High Level Cyclonic Vortex, which lasted for more than 10 days that month (Seluchi 2004, Oliveira et al. 2013, Costa et al. 2020).On the other hand, the years between 2012 and 2015 suffered from severe drought (de Medeiros et al. 2020), with maximum monthly rainfall always below 150 mm.
In general, the seasonal behavior of precipitation in NEB, between the complete data series and that imputed by the BootEM method, is similar over time for the three scenarios (Figure 8).When the percentage of missing data increases, the difference between the series also increases, especially in the extreme precipitation values.
Precipitation in NEB has marked temporal variability (Figure 9), with the highest accumulation of precipitation concentrated in the MAM quarter (Figure 9).The climatological average in this quarter is around 150 mm, but it is important to note that the outliers (e.g., monthly rainfall above 350 mm in DFJ) do not occur in the precipitation period with the highest average, as also observed in previous studies (Palharini & Vila 2017, Rodrigues et al. 2019, Gomes et al. 2019).This is basically due to the nature of the meteorological systems that operate during the drier months.They are more convective in nature, with typical cases of easterly wave disturbances or mesoscale convective systems in JJA or cases of High Level Cyclonic Vortexes or the persistence of the South Atlantic Convergence Zone in DJF, as described by Oliveira et al. (2017).
On the other hand, precipitation in the rainiest period is closely linked to the effect of the Intertropical Convergence Zone, which is the main large-scale system operating in the region (Kousky 1979, Moura & Shukla 1981, Bombardi et al. 2018).The positive aspect is that in both scenarios, the imputation method is able to faithfully capture the seasonality of precipitation observed in the NEB, although in the drier months the boxplot of the imputed precipitation showed slightly higher medians compared to the original data.
The data imputed by the BootEM method representing the seasonal period in NEB, in the three studied scenarios, are shown in Figure 9a-c.Apparently, the imputed values overestimate the values of the complete data series.This occurs especially in the JJA and SON periods, when the lowest precipitation accumulated in NEB happens (Rodrigues et al. 2019).A study by Eischeid et al. (2000) to reconstruct daily precipitation and temperature data over the western United States found that the quality of the reconstructed series varied according to the time of year.Regarding NEB, the biggest difference between the complete and imputed datasets is observed in the third scenario, in the SON period, as shown in Figure 9c. Figure 10 shows the distribution of average daily precipitation over NEB by seasonal period from the complete and imputed database for scenarios of 10, 20 and 30% missing data.In general, the average daily rainfall between the complete dataset and imputed dataset (Figure 10) are similar during the study period.The spatial variability of rain is captured in the three scenarios, with more intense daily rainfall (typically above 4.0 mm/day) in the northwest and east coast of NEB and with less intensity (below 2.0 mm/day) in the semiarid region of NEB.The lowest values are observed in the JJA and SON periods.According to Rao et al. (1993), August, September and October are the months with the lowest accumulated rainfall in the NEB region.These include the JJA and SON periods.The exception occurs in eastern NEB, which has the highest cumulative rainfall between April and July (Oliveira et al. 2013, Rodrigues et al. 2019).The bias between imputed and full data varies between -1.3 and 1.3 mm (Figure 11).In the first scenario (10% missing data) the bias is close to zero in all seasonal periods (Figure 11a).In the second scenario (20% missing data) for most of the region the bias is still close to zero (Figure 11c), but in the worst scenario (30% missing data) the bias increases especially in the coast and west region of the NEB (Figure 11c).
With regard to the spatial distribution of extreme events, it can be seen in Figure 12 that the imputed data presented the 99% percentile (P99) satisfactorily compared to the observed data.The imputed data (Figure 12) express the spatial variability of the values represented by the 99% percentile (considered extreme precipitation values) when compared to those observed according to the complete dataset (Figure 12).The coast is the region with the highest values of the 99% percentile, Figure 12.This result corroborates those of Palharini & Vila (2017)  and Rodrigues et al. (2020), who observed that the coast is the region with the most intense extreme daily precipitation in NEB.The NEB coast is characterized by high daily precipitation intensities, while the semiarid region is characterized by low rainfall values Oliveira et al. ( 2017) and Rodrigues et al. (2019).In general, the data imputed by the BootEM method capture these characteristics, Figure 12.The spatial distribution of the extreme values of the imputed data (Figure 12) are close to the extremes observed in the complete data (Figure 12), especially in the best scenario (Figure 12a).The highest percentile, 74.4 mm/day, is in the city of São Luís, located on the north coast of NEB.The percentiles of data imputed by the BootEM method for the city of São Luís are 71.84 mm, 68.57mm and 64.66 mm, respectively for the scenarios of 10%, 20% and 30% missing data.
The MAM and JJA periods has the largest range of extreme precipitation on the coast of NEB (above 70 mm/day).The highest precipitation values are observed on the NEB east coast during its rainy season, between April and July.During this period, the rainfall of this region is mainly influenced by the EWD (Torres & Ferreira 2011, Gomes et al. 2019), which frequently acts in the tropical range of the globe, in the area of influence of the moving trade winds.The bies range between -10.2 and 10.2 mm (Figure 13).It is observed that the higher the percentage of missing data, greater the bias (Figure 13).In the most part of NEB the bias is negative, that is, the imputation method underestimates the extreme precipitation.Overestimation occurs in JJA and SON in some locations on the north coast and west region of the NEB (Figure 13).

CONCLUSION
In many regions, meteorological time series exhibit a considerable number of missing data.In the present study we evaluate four statistical methods of fault filling using multiple imputation, based on the database about daily precipitation (mm) from 94 meteorological stations distributed in Northeast Brazil.The results presented in this study show that the adequate use of the multiple imputation statistical methods proved to be efficient to fill in missing data in daily precipitation series in NEB.The average bias between the complete and the imputed database is close to zero in all the methods of imputation studied.It was possible to verify that the lower the percentage of missing data in the data series is, the greater the ability of multiple imputation methods to fill the gaps in correctly.
In general, the bootstrap expectation maximization algorithm, BootEM method, showed low values of bias and lower values of standard deviation and mean square error in relation to the results of the other methods.The correlation coefficients between the complete series and the imputed series were higher when the BootEM method was used, even in the scenario where the series had a large percentage of missing data, 30%.The spatial correlations showed spatial distribution without preferential areas, which leads us to conclude that the method works independently of the precipitation regime in NEB.At the same time, a higher percentage of missing data reduced the value of the Pearson correlation.On average 0.96, 0.91 and 0.86 respectively for 10%, 20% and 30% missing data.
It was possible to evaluate the databases imputed by the BootEM method taking into account the spatial and temporal seasonality of the NEB.This result is extremely important, especially for researchers who wish to work with a specific period or subregion, since the results characterize the spatial and temporal variability present in NEB according to precipitation.In general, the average bias

Figure 1 .
Figure 1.Location of the Brazilian Northeast and the weather stations, represented by circles on the map.

Figure 3 .
Figure 3. Illustration of the steps to generate the missing and imputed databases.

Figure 4 .
Figure 4. Density of precipitation values randomly removed from the original database (with no missing data), for scenario, in which (a)10% missing data, (b) 20% missing data and (c) 30% missing data.

Figure 5 .
Figure 5. Scatter plot with values of Pearson's correlation coefficients for the four multiple imputation methods, by scenario: (a) 10% missing data; (b) 20% missing data; and (c) 30% missing data.

Figure 10 .
Figure 10.Spatial distribution of the average daily precipitation in NEB by seasonal period for (a) complete data and imputed data for scenario: (b) 10% missing data; (c), 20% missing data; and (d) 30% missing data.

Figure 11 .
Figure 11.Spatial distribution of the daily precipitation bias in NEB by seasonal period for imputed data for scenario: (a) 10% missing data; (b), 20% missing data; and (c) 30% missing data.

Figure 12 .
Figure 12.Spatial distribution of the extreme (P99) daily precipitation in NEB by seasonal period for (a) complete data and imputed data for scenario: (b) 10% missing data; (c), 20% missing data; and (d) 30% missing data.

Figure 13 .
Figure 13.Spatial distribution of the extreme (P99) daily precipitation bias in NEB by seasonal period for imputed data for scenario: (a) 10% missing data; (b) 20% missing data; and (c) 30% missing data.

Table I .
Bias*, SD and EQM* measurements of the four imputation methods by scenario.
*Average obtained from the measurements made for the 94 weather stations.IMPUTATION OF PRECIPITATION DATA IN NORTHEAST BRAZIL