Evaluation of performance of drought prediction in Indonesia based on TRMM and MERRA-2 using machine learning methods

Graphical abstract

Organization (WMO). In addition, the 3-month SPI has commonly been used by the Indonesian Agency for Meteorology, Climatology and Geophysics (BMKG) for monitoring drought conditions in Indonesia. Following Rhee et al. [3], drought is predicted using several predictors such as the Normalized Difference Vegetation Index (NDVI), the average soil surface temperature day and night, the Multivariate ENSO Index (MEI), and the Arctic Oscillation Index (AOI).
This research applies two different machine learning methods to classify the drought status, Classification and Regression Tree (CART) and Random Forest (RF). Both methods were selected because of their strength in applications to a large sample dataset, as in our case. Moreover, both methods have been proved to be computationally efficient. Various machine learning approaches have been extensively applied in the case of drought prediction (see, for example, [28][29][30][31][32]). The most recent work by Fung et al. [33] provides a comprehensive review of the applications of statisticsbased modelling as well as machine learning methods for drought forecasting over the period from 2007 to 2017. Most of the papers agree that machine learning is a powerful tool for drought forecasting. This present paper also proposes the combination of the machine learning methods with sampling method to overcome the problem of imbalance class response as well as to improve the predictive performance.

Data source and variable
The data used in this study are secondary data obtained from several different sources. The remote-sensing data are obtained from https://search.earthdata.nasa.gov. The data cover the spatial region of East Nusa Tenggara, from latitude 8 S to 11 S and longitude 118.75 E to 125.25 E. The SPI is derived from monthly data spanning from 1998 to 2017. The analyses for the two responses (TRMM and MERRA-2) are conducted separately. A short description of the sources of the data follows: Tropical rainfall measuring mission (TRMM) TRMM or Tropical Rainfall Measuring Mission is a collaborative project between Japan and the United States, especially the space agencies of the two countries, the Japan Aerospace Exploration Agency (JAXA) and the National Aeronautics and Space Administration (NASA).

Modern-era retrospective analysis for research and applications (MERRA-2)
The second version of the Modern-Era Retrospective Analysis for Research and Applications is an atmospheric re-analysis that was started by NASA in 1980. MERRA-2 is a re-analysis product, which means that the available data are the result of processing or correcting with certain algorithms.

Moderate-resolution imaging spectroradiometer (MODIS)
The surface temperature (day and night) was obtained from MYD11C3 Land Surface Temperature and Emissivity, which is one of the results of the Moderate Resolution Imaging Spectroradiometer (MODIS) sensor on NASA's Aqua satellite. The Normalized Difference Vegetation Index (NDVI) is obtained from the MYD13C2 Vegetation Indices.

Multivariate ENSO index (MEI) and arctic oscillation index (AOI)
MEI and AOI are variables that are considered to represent climate conditions globally, especially in predictions of drought. They include information about the anomalies that occur, such as El Nino.
The variables used in this study are in Table 1.

Classification and regression trees (CART)
CART is an algorithm used for classification, and uses a decision tree. The concept behind this method is binary recursive partitioning [34]. There are three stages in classifying using the CART method: forming a classification tree with the formation procedure using recursive node splitting, pruning the trees that are produced to produce a simpler classification tree series, and determining the optimal classification tree.

Optimal classification trees
Splitting strategy. In the splitting selection, the training data sample is split on the basis of splitting rules and goodness of split criteria, maintaining the heterogeneity of the split samples. The splitting selection depends on the type of tree or on the type of response variable. The results of the splitting process must be more homogeneous than the parent node. The level of heterogeneity of the node can be measured using impurity or rðtÞ. The function of the Gini index is written in the equation as follows: where r t ð Þ is the Gini index (heterogeneity function) at node t,p c 0 t ð Þ is the proportion of class 0 at node t and p c 1 t ð Þ is the proportion of class 1 at node t. Furthermore, the criteria for goodness of split are determined with a splitting evaluation carried out for split s at node t. The formula for calculating the value of goodness of split is the following: where f s; t ð Þ is the value of the goodness of split, r t ð Þ is the heterogeneity function at node t, p L and p R are the proportion of the right node observations on the left and right sides, respectively, and r t L ð Þ and r t R ð Þ are the heterogeneity functions at the right and left nodes. The split that produces the highest value of goodness of split is the best split because it can reduce heterogeneity further. Each variable will produce a score to show how much the variable contributes to the tree formation process.
Terminal nodes. A node t is a terminal node when there is no significant decrease in heterogeneity, or there is only one observation at each child node, or there is a minimum limit of observations m for each child node produced.
Class label. Marking class labels on the terminal nodes based on the rules of the highest number is shown in the following equation: The class label for the terminal node t is c i which gives the expected value of classifying errors at the smallest node t, which is equal to r t ð Þ ¼ 1 À max p c i t ð Þ.

Classification tree pruning
Pruning the classification tree, commonly called pruning, needs to be done because the more splitting that is done, the smaller the level of prediction errors, or, in other words, the prediction value exceeds the actual value (overfitting). Tree pruning is done by determining the minimum cost of complexity. The cost complexity value can be calculated by the following equations: where R a T ð Þ denotes a measure of the complexity of a tree T on complexity a, R T ð Þ is the tree resubstitution estimate or misclassification rate of T trees, a is the cost complexity parameter for adding a terminal node to the T tree, andT is the number of terminal nodes in the T tree.

Optimal classification tree determination
The replacement estimator is often used if there are a large of observation in the test sample. This procedure is applied by dividing the sample L into two parts, L 1 (training) and L 2 (testing). The observations in L 1 are used to form T trees, while the observations in L 2 are used to estimate R(T). N 1 is the number of observations in L 1 and N 2 the number of observations in L 2 : Furthermore, X : ð Þ is 0 if the statement in parentheses is wrong and is 1 if the statement in parentheses is correct. The test sample estimator can be shown in the following equation: where R ts T t ð Þ is the total proportion of errors in the test sample estimate, and N 2 is the number of observations in the L 2 training data. In this case we want to estimate the proportion of errors generated from the classification tree formation process, so that the optimal classification tree chosen is the T t tree which has the minimum test sample estimation value or R ts T t ð Þ ¼ min t R ts T t ð Þ.

Random forest
The Random Forest method is a development of the CART method that applies the bootstrap aggregating (bagging) and random feature selection methods [35,36]. In this method many trees are made so that a forest is formed, and the following analysis is performed on the trees: 1 Perform a random sample size n with replacement in the data. This is the bootstrap stage. 2 Using a bootstrap sample, the tree is built until it reaches the maximum size (without pruning). Tree construction is carried out by applying random feature selection in each split selection process, that is, m, the predictor variable, is chosen randomly, where m << p, then the best split is selected based on the predictor variable m. 3 Repeat steps 1 and 2 B times, so that a forest consisting of B trees is formed.

Evaluation of classification results
Area Under Curve is the area under the curve of the ROC or receiver operating curve. In general, AUC is used for classification problems in binary data; by binarizing, the AUC can be obtained by calculating the average for all combinations of AUC one-against-one, and this has the same function as AUC in general [37]. The classification evaluation is performed by AUC average based on the cross tabulations in Table 2.
The accuracy can be calculated by dividing the number of observations classified correctly by the total number of observations. The formula for calculating the AUC in binary classification and AUC in multiclass classification is as follows:

Validation method
The validation method used in this analysis is k-fold cross validation. In k-fold cross validation, the sample data are divided randomly into a number of parts, with each part having equal proportions, and this is repeated many times. The k value that is often used is 10, because it is the value that gives the best estimate of error [38]. An illustration of data sharing using this validation method is found in Fig. 1.

Results and discussions
Prior to the analysis, the data were pre-processed to obtain the same grid resolution for all variables i.e. 0.25 x 0.25 . Therefore, the resolution of the pixel in the maps is about

Drought level classification based on 3-month SPI using the CART method
The analysis is done on a grid basis, meaning that the analysis for one grid (we occasionally refer to an area) is independent of the analysis for another. An example is given for the CART analysis at longitude 120.125 E and latitude 8.625 S. Fig. 3 shows the determination of the optimal complexity parameter as a step in CART for pruning the classification tree. We see from the figure that the optimum complexity parameter is 0.0095. The classification tree in Fig. 3 can be used to predict the drought level in the specified grid. Suppose that at a certain condition where X4 = 1, X1 = 400, X6 = 0.5, and X3 = 0.68, the drought level is classified into class 2 (moderate).
Using the optimum complexity parameter, we obtain the AUC and accuracy for both the training and the testing dataset as shown in Table 3.
The table reveals that the CART method is able to predict drought in this area with an accuracy of above 80%. However, the AUCs are very low for both the training and the testing dataset. The high accuracy comes from the unbalanced class response, while the AUC considers this balancing issue in the formula. The process above is repeated for all grids, and results in average AUC values as plotted in Fig. 4. Note that we used 10 cross validations (folds) for the CART analysis. The left side is the AUC for drought prediction using the TRMM dataset, while the right side is the AUC for drought prediction using the MERRA-2 dataset.
From Fig. 4, it is known that CART can classify the drought level with AUC of 0.5 to 0.75. Both the TRMM and the MERRA-2 datasets produce similar AUCs, although there are some inconsistencies in one particular region.

Drought level classification based on 3-month SPI using random forest method
The analysis using Random Forest is carried out as follows. We set the parameters mtree = 1, 2, 3, 4, 5 and ntry = 100, 500, 1000, 1500, 2000 and evaluate the AUC mean value obtained from a 10 cross validations procedure, similar to the analysis with CART. A sample of the analysis step is given for the same grid as with the CART method. Fig. 5 below depicts the tuning parameter of the Random Forest.
We see from the figure that the combination of mtree of 2 with ntry of 1500 is the optimal setting to predict drought in this area (grid), resulting in an AUC of 0.6033. Table 4 shows the accuracy and AUC  for each fold. We see that the average AUC in this grid reaches 0.6, which is significantly higher than the one obtained with CART.
The step above is then repeated for all grids. Note that the optimum parameters above are valid only for that area, and the parameters can be different for other areas. The results of the average AUC for all grids are depicted in Fig. 6.
From the figure, we see that drought prediction using MERRA-2 yields significantly better AUC figures than prediction using TRMM, as shown by the proportion of areas with an AUC higher than 0.8. However, if we compare the results of the analyses using CART and Random Forest, we can clearly see that the Random Forest improves the accuracy and the AUC significantly. Overall, the average accuracy of drought prediction in NTT using Random Forest reaches 80%.   Improving the prediction performance using Synthetic Minority Over-Sampling Technique (SMOTE) The results presented above indicated that both CART and Random Forest have modest performance in particular of the AUC values. The AUC closes to 0.5 indicates that the method tends to predict the majority class e.g. similar to a random guess. Therefore, the prediction performance needs to be improved. One of the very obvious reasons of the low AUC is the imbalance response class. Note that there were only about 15% "very dry" condition found within the examined periods and it creates imbalance response classes, which is an essential issue in classification problem. To overcome this problem, this section proposes to improve the prediction performance by combining the machine learning methods with Synthetic Minority Oversampling Technique (SMOTE). We denoted hereafter the methods as SMOTE-CART and SMOTE-Random Forest, for the combination of oversampling with CART and Random Forest respectively.   The SMOTE is one of the methods for controlling imbalance data proposed by Chawla et al. [38]. The basic idea of SMOTE is to increase the number of samples in the minor class to be equivalent to the major class by generating synthetic data based on the nearest k-nearest neighbor where the closest neighbor is chosen based on the euclidean distance between the two data. The illustration of SMOTE procedure is given in Fig. 7.
Given a dataset with r variable i.e. x T ¼ x 1 ; x 2 ; . . . ; x r ½ and z T ¼ z 1 ; z 2 ; . . . ; z r ½ the eucledian distance d(x,z) can be calculated by d x; z ð Þ ¼ ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi . The synthetic data generation is done by using the following equation: where x syn is the synthetic data, x i is the i-th data from the minor class, x knn is data with the closest distance from the data to be replicated and g is random numbers between 0 and 1. The SMOTE will be run under k-fold cross validation for each training data. It is done to avoid overoptimistic results due to the pattern replication on training and testing data if the sampling is applied to the entire data [39]. The illustration of SMOTE procedure in k-fold cross validation is given in Fig. 8.  This part mainly focuses on improving the AUC which represents the classification performance overall. The results of predicting drought in NTT based on TRMM and MERRA-2 using SMOTE-CART and SMOTE-Random Forest can be seen Fig. 9.
If we compare the CART performance in Fig. 4 with SMOTE-CART performance in the upper panel of Fig. 9, we observe a significant improvement on the AUC values overall. Meanwhile, the Random Forest performance in Fig. 6 with SMOTE-Random Forest in the lower panel of Fig. 9 are relatively similar with only slight improvement. Increasing the AUC values means that the drought predictability at the corresponding region is significantly improved. In some regions, the classification accuracy exceeds 90%. To summarize, the comparison can be seen in Fig. 10.
The boxplots in Fig. 10 present the AUC values over the entire regions in NTT. We see that SMOTE improves CART performance significantly, both for TRMM and MERRA-2 data. In all cases, Random Forest is robust against imbalance response issue and it still outperforms CART either with SMOTE or  without SMOTE. Interestingly, the SMOTE slightly improves the Random Forest performance if we focus only on the mean of the AUC. Furthermore, if we look at the AUC distribution entirely, some regions show higher AUC after sampling.

Conclusion
Drought prediction analysis in East Nusa Tenggara was performed using two different data sources (TRMM and MERRA-2) and two different machine learning methods (CART and Random Forest). The analysis showed that there is no significant difference in performance between TRMM and MERRA-2, when the drought prediction is carried out using CART. The average AUC reached a maximum of 0.75. Meanwhile, the analysis using Random Forest significantly improved the AUC of the prediction, with the AUC reaching 0.8. Unlike with CART, the drought prediction accuracy using TRMM was significantly different from that with MERRA-2 when the analysis used Random Forest. In this case, MERRA-2 outperformed TRMM. Although many studies have shown that no single machine learning method will always perform better than the others, this study supports the fact that Random Forest is a very powerful method. Moreover, the analysed datasets clearly have imbalance responses, which is an important issue in machine learning applications. To deal with this, the drought prediction accuracy can be improved by applying a certain method to overcome the imbalance in the response class i.e. oversampling (SMOTE), prior to the classification. The SMOTE improves the CART performance significantly, while the Random Forest performance is slightly improved after SMOTE. To conclude, we would suggest that the MERRA-2 dataset, predicted using Random Forest, is used to obtain more accurate drought prediction in East Nusa Tenggara, Indonesia.