Predicting missing values in spatio-temporal satellite data

Remotely sensed data are sparse, which means that data have missing values, for instance due to cloud cover. This is problematic for applications and signal processing algorithms that require complete data sets. To address the sparse data issue, we present a new gap-fill algorithm. The proposed method predicts each missing value separately based on data points in a spatio-temporal neighborhood around the missing data point. The computational workload can be distributed among several computers, making the method suitable for large datasets. The prediction of the missing values and the estimation of the corresponding prediction uncertainties are based on sorting procedures and quantile regression. The algorithm was applied to MODIS NDVI data from Alaska and tested with realistic cloud cover scenarios featuring up to 50% missing data. Validation against established software showed that the proposed method has a good performance in terms of the root mean squared prediction error. The procedure is implemented and available in the open-source R package gapfill. We demonstrate the software performance with a real data example and show how it can be tailored to specific data. Due to the flexible software design, users can control and redesign major parts of the procedure with little effort. This makes it an interesting tool for gap-filling satellite data and for the future development of gap-fill procedures.


Introduction
Remote sensing is a technology used to study a wide range of Earth surface processes. It is usually used a long distance from the ground, and when compared to ground based measurements, the technology has the advantage of large spatial and temporal coverage. It is, however, crucial to understand and correct occasional measurement errors, which are caused by off-nadir view angles and atmospheric disturbances. We take a closer look at the data workflow of satellite observations to understand how they influence the study of satellite observations. To correct for any inaccuracies or omissions, we introduce a new gap-filling algorithm that assuages any discrepancies.

1.1.
Missing values in satellite data One typical example of a remote-sensing data workflow that deals with missing values is the analysis of the Earth's vegetation using measurements from the Moderate Resolution Imaging Spectroradiometer (MODIS) sensor on-board the Terra and Aqua platforms. The sensors measure radiation reflected by the Earth surface every one to two days. Atmospheric disturbances like cloud cover and sub-optimal view angles introduce observation error; data, in these cases, may have measurement error or missing data for parts of the scene (e.g. Myneni et al., 2002). To retrieve the desired information, the data are preprocessed and transformed into a data product; for example the MOD13 family of vegetation indices.
This pre-processing phase consists of aggregation techniques such as constrained-view-angle maximum-value composites (van Leeuwen, Huete, & Laing, 1999) and quality assignments (Roy et al., 2002). A resulting data product is MOD13A (Didan, Munoz, Solano, & Huete, 2015), which comprises several data layers containing values on a regular grid in space and time with a resolution of 500 m and 16 days. However, when only values classified with "good quality" are considered, the proportion of missing data can be considerable (up to 100% of the pixels for certain images of the Alaska region). This may negatively influence the inference of Earth-surface processes or even render certain analysis techniques infeasible.

4
The strategies that handle the missing data problem can be divided into two groups. The first group employs statistical data analysis methods that are robust to missing values. The second group predicts the missing values, in the data, using an additional processing step before the analysis. This can either be done for a data product containing missing values or as an integral part of the pre-processing.

Existing approaches to handle missing values
In the following overview, existing methods that handle satellite data with missing values are grouped into robust analysis methods and methods that reconstruct a complete data product before the actual data analysis.

Robust analysis methods
When describing vegetation using remotely sensed data, the temporal characterization of the process is of great interest. This includes short and long-term trends of the values themselves and of derived quantities such as growing season onset, growing season length, etc. Many methods exist to study these phenomena by interpreting the satellite data as a collection of spatially-independent time series (one per pixel in the spatial extent). Commonly, the time series are smoothed to reduce noise and to fill the missing values. The information of interest is then extracted from this smoothed time series. Hence, the presence of missing values is no longer a problem, as long as the smoothing is not negatively influenced by the missing values. Examples of smoothing methods are the Savitzky-Golay filter method (Chen et al., 2004), the least-squares fitted asymmetric Gaussian method and the double logistic smooth functions method, which are all implemented in the software TIMESAT .
Moreover, specialized methods exist to study local trends in vegetation index time series, e.g., the "breaks for additive season and trend" (BFAST) algorithm (Verbesselt, Hyndman, Newnham, & Culvenor, 2010;Verbesselt, Hyndman, Zeileis, & Culvenor, 2010) and the "detecting breakpoints and estimating segments in trend" (DBEST) algorithm (Jamali, Jönsson, Eklundh, Ardö, & Seaquist, 2015). While all of these methods are based on the temporal correlation of the values, Bolin et al. (2009) presented a method that exploits both the temporal and spatial correlation to study spatial patterns of temporal trends in NDVI data.

Reconstructing a complete data product
Another strategy to handle missing values is to predict them based on the observed data. This is typically done in an additional step before the analysis of interest. Several authors discuss the use of geostatistical methods such as Kriging and Co-Kriging for this task (Addink, 1999;Rossi, Dungan, & Beck, 1994). These methods exploit the spatial correlation of the data within an image. In the case of Co-Kriging information from images observed at different points in time are included as a regression terms.
In contrast to the aforementioned methods, the following approaches are not derived from the classical geostatistical framework. Chen at al. (2011) proposed a method known as "neighborhood similar pixel interpolator", which uses weighted values from images observed at other points in time. This approach was applied to Landsat ETM+ SLC failure data (Mohammdy et al., 2014) and combined with a Kriging (Zhu, Liu, & Chen, 2012). Based on similar ideas, an algorithmic gap-fill procedure for MODIS EVI data was derived and tested at continental scale (Weiss et al., 2014). Other methods use linear regression models fitted to a spatio-temporal window around the missing pixel (J C de Oliveira & Epiphanio, 2012;Julio Cesar de Oliveira, Epiphanio, & Renno, 2014). These methods rely on additional land cover data (Kang, Running, Zhao, Kimball, & Glassy, 2005;Maxwell, Schmidt, & Storey, 2007), or use the quality information provided for MODIS data products to predict missing values (J. Gu, Li, Huang, & Okin, 2009;Y. Gu, Bélair, Mahfouf, & Deblonde, 2006).

Outline
We introduce and discuss a new gap-fill procedure. This procedure can be applied to reconstruct a complete data product, before analysis of satellite data. Similar to the methods presented by Weiss et al. (2014), de Oliveira & Epiphanio (2012, and de Oliveira et al. (2014), we use spatio-temporal subsets around the missing values to predict them. Our contributions are that we (1) formalize this subsetpredict procedure in a generic way and design the software implementation of the algorithm accordingly, (2) present a new instance of such a subset-predict algorithm, which provides very accurate fill values for our test scenarios, and (3) base the proposed algorithm on a statistical frame-work, which helps to quantify the uncertainties associated with the predicted values.
We introduce the gap-fill method, the used validation methods and test datasets in the following sections. The performance of the proposed procedure is then assessed with four test scenarios and a realistic MODIS NDVI data example. In addition, we compare the accuracy of the predicted values against those obtained with the TIMESAT software and with the "Gapfill-Python" program implementing the algorithm described in Weiss et al. (2014).  (2004)(2005)(2006)(2007). With that, the data array has 21 × 21 × 16 = 7′056 elements in total. A characteristic of MOD13 data is that the values exhibit spatial and temporal dependencies. This is usually the case for (pre-processed) satellite data and can also be observed in the example data. We distinguish between observed and missing values. In panel (a) of Figure 1, 1′603 (≈ 23%) values are missing and depicted as black pixels. The goal of the gap-filling procedure is to predict the missing values from the observed ones.
8 In general, the target satellite products comprise large amounts of data. Therefore, a gap-fill procedure of practical significance has to be efficient in terms of computation resources and scalable in the sense that the gap-fill task can be distributed to several computing units. In order to achieve this, the presented gap-fill procedure predicts each missing value separately by taking only a subset of the data into account. The prediction has two main steps: (1) subset the data and (2) predict the missing value based on that subset. With that, the algorithm is adaptive and can recover local features of the data. This is useful if, for example, a region of the data exhibits a temporally shifted seasonal pattern or a long-term temporal trend. Note that many different algorithms having the subset-predict structure can be constructed.
In the following, the gap-fill procedure is described for one missing value. To fill an entire dataset, the procedure is repeated for all missing values. The gap-filled version of the example dataset is shown in panel (b) of Figure 1. In the first step, the data are subset to a neighborhood around the missing value containing the relevant information for the prediction. This subset is defined as four dimensional array including 1 = 2 = 5 pixels in all spatial directions from the missing value and images being not further apart than 3 = 1 step in both direction of the seasonal index and 4 = 5 years in both direction of the year index. Note that 1 , … , 4 and the later introduced 1 , 2 , 3 are tuning parameters and may be altered in order to optimize the procedure for a given dataset. The choice of these parameters can be justified via cross-validation. In Panel (a) of Figure 1, the missing pixel of interest is marked with a white cross and the corresponding subset is depicted with dashed squares. The following two criteria are verified to assess whether the subset contains enough non-missing values: (C1) the subset must contain at least 1 = 5 non-empty images, and (C2) the image in the subset containing the missing pixel must have at least 2 = 25 non-missing values. If one of those criteria is not met, the spatial extent of the subset is increased (by increasing the values of 1 and 2 ) until both criteria are fulfilled. With that mechanism, the spatial extent of the subset is adapted to the local distribution of missing values.
In the second step, the missing value is predicted based on the described subset. The latter is interpreted as a collection of images, and we rank them according to the intensities of their values. The rationale behind this is that the ranked series of images imitates the seasonal evolution of the spatial field with an artificially high temporal resolution. The ranking is based on an algorithm that scores the images via pixel-wise comparisons of non-missing values. The main assumption of the scoring is that the values of the images in the subset have a similar but potentially shifted spatial distribution. This is exploited to construct an algorithm, which is only marginally affected by missing values. In the data example, a subset of the ordered images and their scores are depicted in Panel (c) of Figure 1. There, the missing pixel of interest is marked with a white cross and contained in the center image ranked 8 th .
The bottom row of the same panel displays the ordered images as a scatter plot, having the estimated rank on the -axis and the observed NDVI values on the -axis.
To finally obtain the prediction of the missing value, linear quantile-regression is used. The regression has the NDVI values as a response and an intercept as well as the ranks of the image as linear predictors.
With that, the ranked images serve as a non-parametric seasonality adjustment. The quantile of interest is estimated from the images that have an observed value at the spatial location of the missing value.
We require at least 3 = 2 such values to estimate the quantile. If the criterion is not met, spatially neighboring pixels are considered too. The quantile is estimated by evaluating the empirical cumulative distribution function at each such value and averaging these results. In panel (c) of Figure 1 the quantile of the missing pixel was estimated to be the 47%-quantile. The corresponding regression line is depicted as dashed blue line and the predicted NDVI value is a red cross.
The described gap-filling approach is implemented in the programming languages R/C++ and is available as open-source R package at http://cran.r-project.org/package=gapfill (Gerber, 2016;R Core Team, 2016). The program features a flexible design allowing the user to optimize the gap-fill procedure for specific datasets and to construct new gap-fill algorithms based on the subset-predict framework with little effort. Gap-filling can be executed in parallel via interfaces to openMP and MPI back-ends making the software suitable for large datasets. More information about the usage and implementation of the R package is given in the Section S2 of the supplementary material and in the reference manual of the R package.

Formal description
This section provides a technical description of the gap-fill procedure and can be skipped without loss of the general idea. Table 1    The idea is to start with = 0 and to try different neighborhoods by increasing until returns the predicted value ^0.

Subset to a neighborhood
To be more specific about the search strategy for suitable subsets, let ( 0 , , 1 , 2 , 3 , 4 ) be a four dimensional box around the missing value. The following definition shows that the spatial extent increases as increases.
The parameters 1 , … , 4 define the initial and minimal extent of . In case where 0 is not close to a boundary of the spatial extent of is (2 1 + ) × (2 2 + ) pixels.

Predict based on a subset
The first task of the prediction function is to decide whether to return NA or the predicted value ^0.
In order to return ^0, both of the following criteria need to be fulfilled: (C1) must contain at least 1 ∈ ℕ non-empty images.
(C2) The image in containing the missing pixel must have at least 2 ∈ ℕ non-missing values.

Prediction uncertainties
Uncertainty estimates of the predicted values are essential when using them to derive conclusions in further analyses. Statistical theory provides ways to quantify uncertainty through the estimation of prediction variability and confidence intervals. Possible strategies that can be applied to the proposed gap-fill method are based on resampling techniques and cross-validation. While the former is computationally expensive, and therefore difficult to apply to a large dataset, the latter was applied in a similar setting (Weiss et al., 2014). However, both approaches are inaccurate, if the underlying assumptions about the data are not met. For example, the commonly made "missing at random" assumption is not met in the data considered in Section 3.1, because low NDVI values have a larger probability to be missing.
Besides that, it is interesting to study the magnitude of the uncertainties introduced by the different steps of the procedure; the latter are (1) choosing the size of the initial subset around the missing value, predicted values for all such permutations are then calculated and a 90% prediction interval is constructed by considering the 5% and the 95% quantile thereof. For step (3), we derive a 90% prediction interval by considering the 5% and the 95% quantiles of the estimated quantiles in of Algorithm 2. Finally, the uncertainty introduced in step (4) is assessed by calculating a 90% prediction interval based on bootstrap methods.
Estimates of the prediction uncertainties of the gap-filled values could be derived by combining the uncertainties of all four previously described steps. However, doing so in a meaningful way is not straightforward due to possible interactions and elimination effects among them. Nevertheless, we combine the uncertainties from step (2) and (3) in one prediction interval. This prediction interval reflects the local spatial and temporal heterogeneity of the data around the predicted value. An evaluation of the properties of that interval and its practical relevance is given in Section 3.3.2 and 4.2.

Data and validation method
Several test scenarios were constructed and the predicted values, together with their uncertainty components, were investigated. In addition, the accuracy of the gap-filled values was compared against those of two alternative gap-fill procedures.

Data
We considered the MODIS satellite product (MOD13A1), which is part of the MODIS vegetation index product (MOD13) (Justice et al., 2002). It is a land surface product based on pre-composited 8-day MODIS Level-2G surface reflectance data, which have been further composited to obtain the final resolution of circa 16 days and 500 m (Didan et al., 2015;van Leeuwen et al., 1999). The NDVI layer can be used to describe vegetation activity (Huete, et al., 2002). We used the quality assignments pixel reliability layer to subset the NDVI data to values classified as "good data" (Roy et al., 2002).
To illustrate the gap-fill procedure, the NDVI data from the years 2004 to 2009 were considered as the subset of the region of northern Alaska as shown in Figure 2. Due to the high latitudes ranging from 66 ∘ north to more than 71 ∘ north, the NDVI values exhibit a strong seasonal component. That is reflected in both the NDVI values and in the number of available values classified as "good data". Especially during wintertime, little data are available because of missing sunlight and snow cover. Therefore, we restrict the analysis to the seasonal period starting on the 145th day of the year (about Mai 24) and ending on the 257th day of the year (about September 13) featuring 8 images per season. With that, the data of each considered day of the year have at least 30% of the values classified as "good data". The MOD13A1 data were downloaded in 6 spatial tiles and merged to one single image per considered day of the year using the R package MODIS (Mattiuzzi, 2015), which interfaces the MODIS reprojection tool (Dwyer & Schmidt, 2006). Furthermore, the data were transformed from the sinusoidal to the geographic map projection (WGS84). The R-packages raster (Hijmans, 2015), sp (Bivand, Pebesma, & Gomez-Rubio, 2013;Pebesma & Bivand, 2005), fields (Nychka, Furrer, & Sain, 2015), lattice (Sarkar, 2008), and ggplot2 (Wickham, 2009) were used to handle and visualize the data. missing values. Test scenarios with 20%, 30%, 40%, and 50% missing pixels were obtained by artificially removing values from the "Data" region according to the patterns of missing values observed at the 100 × 100-pixels regions labeled with "20%", "30%", "40%", and "50%" respectively. Note that the 100 × 100-pixels regions are depicted as rectangles, as opposed to squares, because of the chosen geographic map projection.

Test scenarios
To study the performance of the gap-fill software, we constructed four tests scenarios based on the MOD13A1 data. Using real data, as opposed to simulated data, has the advantage that the scenarios come close to the use-case of interest. The scenarios were built by extracting a 100 × 100-pixel subset of the data described in Section 3.1, while keeping the temporal structure of 6 years with 8 images per year unchanged. The geographical location of the subset is depicted in Figure 2 as the rectangle labeled with "Data". The rectangular area was selected to ensure that the resulting dataset has relatively few missing values (about 12%) and the values reflect typical features of NDVI datasets in high latitudes.  Figure 2 as the rectangles denoted with "20%", "30%", "40%", and In addition, a scenario consisting of the entire spatial extent of northern Alaska, as shown in Figure 2, was compiled. While the temporal dimensions of that scenario remained unchanged, the size of the images is increased to 271ʹ819 pixels. 3ʹ696ʹ691 (28%) of the values in that scenario are missing. The scenario is shown in Figure S10 of the supplementary material.  (2) and (3) were assessed by investigating the spatial and temporal distribution of the prediction interval lengths and by calculating the coverage rate of the intervals, i.e., the proportion of intervals that cover the observed value (assumed to be true). This part of the uncertainty assessment was performed using the entire test scenario with 40% missing values.

Comparison with TIMESAT and Gapfill-Python
We compare the results from the described gap-fill procedure -denoted as the corresponding R package with "gapfill" -against two alternative procedures, namely the gap-fill algorithm presented in Weiss et al. (2014) and the temporal interpolation methods implemented in the software TIMESAT . The former belongs to the class of algorithms that reconstruct a complete data product from the observed values (Section 1.2.2). It applies one of two different prediction algorithms depending on the amount of missing values and exploits both the temporal and the spatial structure of the data. A python notebook is available at github.com/malariaatlas-project/modis-gapfilling and provides an implementation of the procedures in Python (Foundation, 2015) and C. The code was downloaded on August 15, 2015. In the following, we will refer to that software as "Gapfill-Python". While that method was published in 2014, the TIMESAT (version 3.2) software is more than ten years older and well established. It is implemented in Fortran and comes with a MATLAB (MATLAB, 2014) interface featuring a graphical user interface. The software and the documentation thereof is available at web.nateko.lu.se/timesat/. The main purpose of the software is to analyze time series of satellite data by extracting seasonal parameters from a smoothed version of the time series. All calculations treat the pixel-wise time series separately, and hence, do not exploit the spatial dependency in the data. The smoothing part of the method makes the analysis, to some extent, robust to outliers and missing values. The method therefore belongs to the class of methods that handle missing values through robust analysis techniques (Section 1.2.1). Although the software was designed to use the smoothed time series in conjunction with the extraction of phenomenological parameters, the smoothing part of the algorithm can be used for gap-filling and the gap-filled time series were used to assess the performance of different types of smoothers (Atkinson et al., 2012;Hird & McDermid, 2009;Verger et al., 2013).
All considered gap-fill procedures have several tuning parameters, which influence the prediction process and the accuracy of the predictions. Although, we tried to find good parameter configurations for the presented software, it may be that the results improve with other settings. Nevertheless, the presented comparisons give an impression of the performance. The tuning parameters for the gap-fill procedure are the same as in Section 2 except of the parameters 1 and 2 which we set to 10. The Rcode to run the gap-fill algorithm is given in Listing S1 of the supplementary material. Gapfill-Python has about 16 parameters to be set. The most important ones are those controlling the search of informative pixels and the "de-speckle" procedure (see the Python-notebook for more information). The used parameters are given in Listing S2 of the supplementary material. The parameters of TIMESAT are described in its software manual . We chose to fit a "double logistic" smoothing function, which is recommended for NDVI values in high latitudes with many missing values (Beck, Atzberger, Høgda, Johansen, & Skidmore, 2006). Listing S3 of the supplementary material shows the complete configuration file.

Predictions accuracy
We applied the proposed gap-fill procedure to the test scenarios described in Section 3.2. The resulting gap-filled images are shown in the Figures S6-S9   To assess the temporal variation of the prediction accuracy, the average RMSE for each image of the scenario with 40% missing is shown in the middle panel of Figure 3. It can be seen that the RMSE is larger for early days of the year. This is in accordance with the observations that images at the beginning of the season tend to have more missing pixels and their values exhibit larger variability in time compared to other images (see also Figure S1 in the supplementary material). The right panel of Figure   3 depicts the spatial distribution of the pixel-wise average RMSE, which resembles the spatial distribution of the temporal variation in the data as depicted in the second panel of Figure S1 of the supplementary material. This is expected, because pixels with a large variability in time are more difficult to predict.
Another way to study the prediction accuracy is to plot the true values against the predicted values as shown in the upper panels of

Uncertainty assessment
The widths of the prediction intervals, corresponding to the four main steps of the prediction algorithm, summarize their uncertainty contribution. The left panel of Figure 5 depicts summary statistics of these widths as boxplots revealing that the sorting step (2) (Algorithm 1) introduced the larges uncertainties, followed by the estimation of the quantile of step (3) (Algorithm 2). To investigate the properties of the prediction interval combining the uncertainties from step (2) and (3), the spatial distribution of the average prediction interval widths is shown in the middle panel of Figure 5. It exhibits similar spatial patterns as the standard deviation of the data (middle panel of Figure S1 of the supplementary material) and the spatial distribution of the average RMSEs (right panel of Figure 3). Since the seasonal variability of the prediction interval widths is larger, compared to the inter-annual variability, we only show the former in the right bottom panel of Figure 5. It has a U-shape, which is also observed in the distributions of the missing values ( Figure S1 of the supplementary material) with some deviations thereof at the boundaries, i.e., the values of the 145th and the 257th day of the year. These deviations might be caused by the fact that we only consider a part of the seasonal cycle, and hence, have less information at the boundaries thereof. The overall coverage rate of the prediction interval for that scenario is 93%.
This is, the prediction uncertainty is slightly overestimated on average. The average coverage rate per day of the year is depicted in the right panels of Figure 5.

Discussion
The analysis of remotely sensed data with many missing observations is challenging. One way to handle missing values is to construct a complete dataset by predicting the missing values from the observed ones. Such an approach is presented in this study and implemented in the corresponding R package gapfill. The following four considerations influenced the development of the gap-fill method: Firstly, to be of practical relevance the method has to be capable of handling large datasets, such as, e.g., MODIS NDVI products. This implies that classical geostatistical space-time models in the spirit of Stein (1999) would need sever modifications, since their computation workload typically exceeds the available resources by several orders of magnitude. One way to reduce the computational workload is to implement a purely algorithmic procedure as, e.g., presented by Weiss et al. (2014). While such approaches can achieve good performance in terms of prediction accuracy, uncertainty quantification is difficult. We therefore opted for a hybrid approach, which combines purely algorithmic elements (selection of a suitable subset; scoring of images) together with statistical methods (quantile regression; permutation tests). With that, the gap-fill method benefits from both aspects: The algorithmic components make it fast and scalable, whereas the statistics part provides tools to quantify uncertainties.
Secondly, an efficient software implementation of the gap-fill method is crucial to handle large datasets.
Since nowadays many research institutes have access to powerful computers, this includes scalability of the algorithm so that the computational workload can be distributed among several computing units.
The presented gap-fill algorithm achieves this with the subset-predict frame work, which handles each missing value separately and thereby enables parallelization. On the programming side, we employ the generic parallelization framework of the R package foreach (Analytics & Weston, 2015), which can be used to interface openMP and MPI parallel back-ends depending on the architecture of the available computer.
Thirdly, testing gap-fill algorithms with realistic scenarios was essential to develop an accurate and fast procedure. The chosen test scenarios feature actually observed MODIS NDVI data together with observed patterns of missing values and are therefore close to an actual use-case. In the development phase of the procedure, testing helped to detect alleged improvements, which turned out to be deteriorations with respect to computational speed or prediction accuracy. Also the comparison against established software provides valuable information for potential users. However, one needs to keep in mind that such comparisons are always relative to the choice of the test scenarios.
Fourthly, the gap-fill software should be flexible and user-friendly so that it can be tailored to specific features of different remote sensing data products. To that end, we provide the procedure as opensource R package gapfill (Gerber, 2016), which contains the R/C++ source code together with documentation and data examples. The structure of the package was kept as simply as possible to ease its usage. On the other hand, user can customize the essential parts of the algorithm by changing default parameters or by providing their own subset and/or predict functions; see Section S2 of the supplementary material for more information.

Conclusion
Since many analysis methods for remotely sensed data are designed for complete data, gap-filling missing values improves or enables data analysis in the presence of missing values. We presented such a gap-fill procedure and provide an open-source implementation thereof in the R package gapfill. While the software readily works for the presented data, its flexible design allows the users to tailor it to specific needs with little effort. With that it is a suitable tool to gap-fill many data products observed at regularly spaced points in time. Furthermore, the algorithm has statistical components, which enable uncertainty quantification. The performance of the predicted values was assessed using validation scenarios based on MODIS NDVI data featuring between 20% and 50% missing values. When compared with two alternative gap-fill methods, the presented method was able to handle data with a higher percentage of missing values and provided the most accurate prediction in terms of the RMSE.