Spatiotemporally explicit earthquake prediction using deep neural network

Due to the complexity of predicting future earthquakes, machine learning algorithms have been used by several researchers to increase the Accuracy of the forecast. However, the concentration of previous studies has chiefly been on the temporal rather than spatial parameters. Additionally, the less correlated variables were typically eliminated in the feature analysis and did not enter the model. This study introduces and investigates the effect of spatial parameters on four ML algorithms ’ performance for predicting the magnitude of future earthquakes in Iran as one of the most earthquake-prone countries in the world. We compared the performances of conventional methods of Support Vector Machine (SVM), Decision Tree (DT), and a Shallow Neural Network (SNN) with the contemporary Deep Neural Network (DNN) method for predicting the magnitude of the biggest upcoming earthquake in the next week. Information Gain analysis, Accuracy, Sensitivity, Positive Predictive Value, Negative Predictive Value, and Specificity measures were exploited to investigate the outcome of using a new parameter, called Fault Density, calculated using Kernel Density Estimation and Bivariate Moran ’ s I, on the performance of the earthquake prediction, in comparison to other commonly used parameters. We discussed the behavior of the four models while dealing with different combinations of parameters and different classes of earthquake magnitudes. The results showed promising performance of the proposed parameter for the earth-quakes of high magnitudes, especially using SVM and DNN models.

Due to the complexity of predicting future earthquakes, machine learning algorithms have been used by several researchers to increase the Accuracy of the forecast. However, the concentration of previous studies has chiefly been on the temporal rather than spatial parameters. Additionally, the less correlated variables were typically eliminated in the feature analysis and did not enter the model. This study introduces and investigates the effect of spatial parameters on four ML algorithms' performance for predicting the magnitude of future earthquakes in Iran as one of the most earthquake-prone countries in the world. We compared the performances of conventional methods of Support Vector Machine (SVM), Decision Tree (DT), and a Shallow Neural Network (SNN) with the contemporary Deep Neural Network (DNN) method for predicting the magnitude of the biggest upcoming earthquake in the next week. Information Gain analysis, Accuracy, Sensitivity, Positive Predictive Value, Negative Predictive Value, and Specificity measures were exploited to investigate the outcome of using a new parameter, called Fault Density, calculated using Kernel Density Estimation and Bivariate Moran's I, on the performance of the earthquake prediction, in comparison to other commonly used parameters. We discussed the behavior of the four models while dealing with different combinations of parameters and different classes of earthquake magnitudes. The results showed promising performance of the proposed parameter for the earthquakes of high magnitudes, especially using SVM and DNN models.

Introduction
Earthquake is a destructive natural disaster that occurs almost without any warning in advance. It inflicts plenty of casualties and financial loss to human societies. Besides, it can impose several environmental side effects such as surface fault rupture [1] and soil liquefication [2] or initiates other types of disasters like tsunamis [3], landslide [4], and fires [5]. Due to the high potential of destruction and death [6,7] as well as the direct and indirect effects of earthquakes [8], researchers have been vigorously working on the idea of proposing different approaches for earthquake prediction [9][10][11]. Timely and reliable forecasting can provide the possibility to consider preventive measures for mitigating the devastating effects of powerful earthquakes. Besides, such a forecast would be able to increase the level of public preparedness. A successful forecast determines the geographical location, the time, and the magnitude of an earthquake [12]. Such predictions can save many lives and vast amounts of resources. However, despite proposing various methods using different input parameters, such successful forecasts are rare amongst the past research [13].
Various methods, including mathematical modeling [14,15], hydrological [16], ionospheric analysis [17], and even procedures based on the observation of the animal behaviors [18,19], have been proposed to predict earthquakes. In another direction, a class of methods falls in the ambit of extracting useful information from the pressure wave, P, measured by seismographs, to predict the magnitude of an upcoming earthquake, only a few seconds to the event [20][21][22][23]. This class of methods is useful for implementing early warning systems [24], which their effectiveness is highly dependent on the accurate detection of the P waves and the rejection of false-positive ground vibrations caused by local activities [25]. Most of the mentioned techniques depend on the occurrence of specific precursors [26]. Nevertheless, in practice, such precursors usually either occur without any subsequent seismic events or are hard to detect, and thus those methods do not typically lead to satisfactory results [27]. Therefore, researchers have suggested that new approaches need to be considered for earthquake forecasting [28].
Meanwhile, machine learning (ML) techniques have emerged as a potent tool with undeniable advantages in dealing with data-intensive, nonlinear, and complex problems. These methods are often data-driven, non-parametric, and less constrained by inductive assumptions [29]. Several researchers have started applying ML algorithms to solve the earthquake prediction problem [30][31][32][33][34][35]. E.g. Ref. [30], presented a probabilistic neural network model that yielded good prediction accuracies for a range of magnitudes between 4.5 and 6 [31]. also introduced a new scheme for the estimation of significant earthquake events based upon Radial Basis Function ANN, where the model was trained using leave-one-out cross-validation. In another study, researchers utilized two different quantitative association rules (QAR) and M5P to discover the temporal patterns of seismic data beneficial in earthquake prediction [33]. [35] examined the spatial-temporal variations of seismicity parameters for the Qeshm earthquake in southern Iran. After calculating seismicity parameters and normalization, Principal Component Analysis (PCA) was applied to make the data ready for being fed into the model, which was comprised of Radial Basis Function (RBF) and ANFIS [32]. devised a methodology in which the validity of the seismicity indicators could be tested using Nearest Neighbors, Naïve Bayes, Support Vector Machines (SVM), Decision Tree (DT), and Artificial Neural Networks (ANN) algorithms.
From a temporal standpoint, earthquake prediction is categorized into two general categories of forecast (months or years in advance) and short-term predictions (hours or days in advance). Earthquake forecasting is very useful for identifying the seismic gap and portions of the plate boundaries that have not ruptured in a significant earthquake for a long time [13]. However, this study focuses on short-term forecasts, directly dealing with protecting human lives and social infrastructures [36]. Short-term prediction of earthquakes is considered a challenging problem [13,37,38] due to the complex nature of earthquake phenomenon [39], the complexities of the Earth's lithosphere and its crustal blocks-and-faults structure [40] so that no specific method is yet regarded as a reliable method for such predictions [13,41].
For short-term earthquake prediction, the effective seismic parameters utilized in previous studies are often the seven parameters of x 1 , x 2 , x 3 ,x 4 ,x 5 ,x 6 ,and x 7 , introduced by Ref. [42]; expressing the seismic facts of Bath, Gutenberg-Richter and Omori/Otsu's law and the nine parameters of b,a,η, ΔM, T, μ,C,dE 1/2 and,M mean , introduced by Ref. [43]; that represent the seismic potential of the ground. Besides, Depth, latitude, and longitude of the seismic events extracted directly from the catalog data were also considered as input variables in some studies [26,44]. Some research investigated the effective parameters for earthquake prediction [32,45]. However, these studies have primarily focused on the extent to which the dependent variables are affected by the independent variables, and in feature analysis, they sought to use the parameters that were more correlated with the output variable. The less correlated parameters were often omitted in the feature analysis process. It is noteworthy that the input parameters' influence on the results is profoundly affected by the capability of method for extracting useful information from the input parameters. Moreover, a review of the past research in the realm of earthquake prediction using machine learning methods reveals that most of these studies only consider temporal rather than spatial correlations between the dependent and independent variables [27].
Many destructive earthquakes have occurred along active fault zones or in their proximity. This observation reinforces the hypothesis that future damaging earthquakes occur mostly along active faults or within the areas where the density of the active faults is rather high [46][47][48]. Thus, there is a need to devise a methodology that leverages fault location data, converts it to information, examines its usefulness as an input variable to predict future earthquakes, and evaluates its impact on the prediction accuracy. Hence, the main goal of this study is to introduce and investigate the role of a spatial parameter, called Fault Density (FD), on the Accuracy of short-term earthquake prediction models that work based on ML algorithms. In particular, the performance of three well-known ML algorithms of SVM, DT, and Shallow Neural Network with one hidden layer (SNN) are compared to those of the DNN (Deep Neural Network) algorithm for short-term prediction in a spatio-temporal setting. The proposed FD parameter is calculated by applying the Kernel Density Estimation (KDE) function on the active faults data, while the radius of the KDE is calculated through Bivariate Moran's I [49] to account for spatial correlation. The models receive effective parameters proposed by previous research [42,43,50], along with FD and predict the magnitude of the largest earthquake over the next week. Information Gain analysis (IGA), Accuracy and Sensitivity measures were exploited to assess each input parameters' explanatory power, including the proposed FD, as well as the performances of the models.
Iran, as one of the most earthquake-prone countries in the world [51,52], was selected as the study area. The country has already experienced many large and destructive earthquakes such as Tabas (1978), Rudbar (1990), Bam (2003), and Varzaqan (2012), with the death toll of about 126000 attributed to 14 earthquakes with magnitudes of 7.0 Richter and 51 earthquakes of 6.0-6.9 Richter since 1900 [53][54][55][56]. Therefore, the need for accurate and reliable forecasting for mitigation measures is greatly sensed in the study area.
The rest of the article is organized as follows. Section 2 summarizes the theory of the four machine learning algorithms used in this study. Section 3 describes the methodology. Results and discussion are presented in section 4. Finally, section 5 concludes the study and proposes future works.

Machine learning algorithms
SVM is a supervised learning method based on statistical learning theory and the structural risk minimization principle [57]. As a binary classifier, SVM constructs optimal hyperplanes to separate the members of two classes while maximizing the distance between the closest samples of the classes in the training data [58]. However, in most real-world cases, the problem is not linearly separable. To handle the nonlinear cases, a kernel maps the input data to a high dimensional space, known as feature space, where the data would supposedly be linearly separable. The training points that are closest to the optimal hyperplane are called support vectors [59]. The performance of SVM highly depends on the selection of a proper kernel and the regularization constant C. Linear, polynomial, RBF (a.k.a. Gaussian), and sigmoid are four widely applied SVM kernels in the literature [59,60].
DT is a hierarchical model made up of decision rules that recursively divides the independent variables into homogeneous regions [61,62]. The purpose of a DT is to find a set of decision rules so that they can be used to predict the output from a set of input parameters. During the training process, the DT strives to obtain the maximum amount of information along with the minimum entropy generated in the tree subgroups [63]. Initially, all data is aggregated in a root node, and then it is divided into subgroups with higher purity and homogeneity using parameter values. These subsets are called internal nodes [64]. Labels are assigned to leave (terminal) nodes by an allocation strategy like majority voting [65]. In this study, the C5.0 algorithm with a boosting approach introduced by Ref. [66] is used for short-term earthquake modeling to enhance the predictive ability of the C5.0 algorithm. The core idea of the boosting approach is to create multiple classifiers rather than just one. When a new case is to be classified, each classifier votes for its predicted class. The votes are counted afterward to determine the final class [67].
ANNs have been one of the most powerful machine learning methods for predicting and modeling [68]. ANNs can learn complicated and nonlinear relationships; they do not need prior assumptions about the distribution of input data; they have proved their feasibility in dealing with noisy and incomplete data [69,70]. MLP, as a feed-forward neural network, is a well-known ANN method that has been used by several researchers for earthquake prediction [42,71,72]. An MLP model is composed of at least three layers of input, hidden, and output. The neurons are fully connected, meaning that every node in one layer is connected to every node in the next layer [73]. MLP networks can be built with an arbitrary number of layers. However, it has been proved [74][75][76][77] and tested [78,79] that a three-layered MLP network (one input layer, one hidden layer, and one output layer) can simulate any nonlinear function up to a desired degree of Accuracy. In this study, we refer to the MLP network with three layers of input, hidden, and output as Shallow Neural Network (SNN).
DNN is a particular type of ANN with a deep structure of multiple hidden layers, attempting to model hierarchical representation beneath data and comprehend the patterns by stacking multiple layers of information processing modules in hierarchical architectures [80]. Increasing the number of hidden layers and hence adequate data transformations in deep neural network structures result in extracting the most appropriate hierarchical representation of the data [81]. In addition to their significant improvements in a variety of domains including image classification, object detection, and speech recognition [82], their generality, availability of open-source code and computer hardware for accelerating their process, mainly when the task at hand deals with abundant data, are amongst the reasons augmented the prominence of these models [83]. Different architectures of DNN have been proposed and used in different domains, e.g., Convolutional Neural Networks, Recurrent Neural Networks, and Long Short-Term Memory Neural Network. In this study, we used a deep neural network feed-forward architecture for earthquake prediction purposes.

Case study
The case study of this study is Iran (the longitude between 24.5 and 40 and latitude between 43.5 and 64), a high land in the northern hemisphere, situated in the central part of the Alpine-Himalayan orogenic belt. The seismic activities of the Iranian plateau result from its position as a 1000-km-wide zone of compression between the colliding Arabian and Eurasian plates [84]. Fig. 1 shows the abundance of earthquakes in Iran during 1973 and 2019.

Data
After collecting and storing raw catalog data, from January 1973 to July 2019, from USGS 1 and IIEES, 2 the data were integrated, and the duplicate rows were identified and removed. Amongst the columns in the catalog data, only latitude, longitude, and Depth were directly taken as input variables for the prediction. In order to deal with more critical earthquakes, catalog data were filtered based on their magnitude so that events with magnitudes less than 3 Richter were eliminated. Such a filtering approach has been adopted priorly by previous studies [85,86]. Fig. 1 shows the location of the earthquake events after filtering. Events with larger magnitudes are shown in red. As can be seen, seismic events ranging from 3 to 7.7 Richter are covering the whole country. Fig. 2 also shows the frequency of seismic events by year, where we are witnessing a significant increase in the number of incidents in recent years.
After data collection, a 1 × 1 degree grid was constructed in the study area. In order to analyze the regions that are more prone to earthquakes, this study only considered pixels that contain at least 500 seismic events (Criterion 1: C1), and there is at least one event with a magnitude of greater than 5 Richter (Criterion 2: C2).
There were only three pixels that satisfied C1 and C2. These pixels were selected as the input pixels for the analysis. The locations of these pixels and some information about the earthquake incidences in each pixel are presented in Fig. 1 and Table 1, respectively.

Dependent and independent variables
The input data need to be converted into well-structured records so that we can feed them to the prediction models. Each record of data is composed of a dependent variable and several independent variables.
The output (dependent) variable represents the maximum magnitude of the next seismic event occurring in the next seven days. In this study, the problem of earthquake prediction is considered as a classification problem. The magnitude of the most massive earthquake happening in the next week is predicted as one of the four classes specified in Table 2.
Notably, previous studies have shown that if the classification of the dependent variable results in an imbalanced dataset, the performance of machine learning-based models for earthquake prediction might diminish significantly [87]. Therefore, we used the frequency distribution of the dependent variable to specify the intervals so that the class boundaries were determined by the Natural Breaks classification method [88].
The independent variables are composed of 19 parameters borrowed from previous studies, including 16 seismic parameters, latitude, longitude, and Depth accompanied by the proposed FD parameter. Overall, they constitute our 20 input variables that all had been normalized (between 0 and 1) before being used by the models. Table 3 lists the sixteen seismic input parameters proposed along with their definition.
The first parameter, named b value, is related to the famous Gutenberg Richter geophysical law [89]. [43] proposed this parameter and used the least-squares method to calculate it. However, due to the lack of robustness in dealing with infrequent earthquakes [42], suggested that b value should be calculated through maximum likelihood via Equation In Equation (1), n is the number of events considered before the event e i , M i− j is the magnitude of e i , e in the numerator is the Euler's number (approximately 2.718), and the cutoff magnitude is also indicated by M 0 . In this study n was set to fifty, as suggested by previous studies [30,42,45,90]. Having the parameter b calculated, the other parameters were calculated based on the description in Table 3.
In addition to the above-mentioned parameters, this study proposes a new parameter called FD to be used in short-term earthquake prediction procedures. The initial assumption is that short distances to the active faults can increase the chance of large earthquakes in the area [46]. To convey the effect of the surrounding faults, we calculated the FD by applying Kernel Density Estimation (KDE) analysis [91,92] on the faults data layer. The cardinal parameter of KDE analysis is the search radius. The proper radius of the KDE analysis is the distance that maximizes the correlation between the dependent variable and the neighborhood faults. To determine this distance, Bivariate Moran's I [49] was employed as proposed by Ref. [93]. The distance that maximizes Moran's I index between the independent variable (distance from the faults) and the dependent variable (the magnitude of the largest earthquake in the following week) is considered the proper distance of the KDE analysis. The KDE was calculated for the study area, and its value in each cell was considered as the FD parameter. Fig. 3 demonstrates the overall process of the proposed short-term earthquake prediction procedure. The ultimate goal was to estimate the dependent variable, which classifies the magnitude of the most massive earthquake happening in the next seven days. The process started by receiving the data related to the three selected pixels. At first, the independent variables were calculated for each record of the data. Then, the data was divided into three chunks of train, validation, and test. Fifty   percent of the data was devoted to training, twenty-five percent to validating, and the last twenty-five percent to testing. Utilization of Natural Breaks for determining the class boundaries and shuffling of the records resulted in the uniform distribution of the classes in all three training, validation, and testing subsets. In other words, all classes were uniformly represented in training, validation, and testing datasets. The four ML algorithms of SNN, SVM, DT, and DNN were trained and calibrated using the train and validation data chunks. We used the trained models afterward to estimate the class of the earthquakes happening in the next seven days for the test data. Finally, using the predicted and expected classes, the confusion matrix was calculated for the test dataset.

Prediction model
The calibration process encompasses determining the best combination of the hyper-parameters of each method. As for both SNN and DNN neural networks, the models were calibrated to achieve high generalization while mitigating overfitting. We used the Weight Decay parameter [94] for the SNN model and Dropout [95] for the DNN model to lessen the effect of overfitting. The number of layers and nodes, dropout rate, activation function, and weight decay were tuned for DNN and SNN, respectively. To achieve ideal DNN and SNN models with high performance, which neither overfit nor underfit, the models were repeatedly modified, trained, and validated on the validation data. We iteratively changed different hyperparameters of the models, including the number of layers, number of units per layer, learning rate, dropouts, and regularization. The combination that resulted in the best model performances were selected as the optimal hyperparameters. It is worth mentioning that some researchers have used metaheuristics approaches, e.g., particle swarm optimization [96], genetic algorithm [97], coronavirus optimization [98], and artificial bee colony [99], to tackle the problem of hyperparameter tuning.
Regarding SVM, the RBF kernel [59] exposed the best performance in the calibration process. The C parameter and the kernel width (gamma parameter) were calculated by iterating over ranges of possible values. For DT, the Trials parameter, controlling the number of boosting iterations [67], was optimized in the calibration process.
The calibration process was conducted using 4-fold-cross validation. Specifically, after separating 25% of the data for the test, the rest was divided into four equal parts. As demonstrated in Fig. 4, the training and validation were performed in four iterations so that in each iteration, three parts were used for training, and the remaining one part was used for validation. The final validation score was obtained and calculated from the average of the four validation scores.
Ultimately, after training and determining the optimal hyperparameters for the four models based on SNN, SVM, DT, and DNN using the validation score, each trained model predicted the test data that had not been fed to the models during training and validation.  Table 3 Seismic parameters, adopted from Refs. [42,43]].

Evaluation
IGA, Accuracy, and Sensitivity measures have been exploited in this study to evaluate the outputs. Firstly, IGA was used to 1) measure the explanatory power of each input parameter and 2) to gauge the degree to which each machine learning algorithm could take advantage of these parameters. Based on IGA, the attribute that reduces the entropy by the largest amount is considered the most significant attribute for the classification [100]. The information gain of an attribute A over the dataset S is defined as Equation (2) [101].

Gain(S, A) = Entropy(S)
In Equation (2), Entropy(S) is the entropy of the entire dataset, S v is the subset of S for which the attribute A has the value v and Entropy(S v ) is the entropy of this subset. More precisely, the entropy of S, as a measure of impurity, is calculated via Equation (3) [101].
where p i is the probability that a particular instance belongs to the class i and c is the number of classes. In addition to IGA, after running the models, the observation and expected values resulted from the test data were used to form the confusion matrix. Using the confusion matrix, the following parameters were calculated.
• Accuracy, as the number of events that the model has successfully predicted (Equation (4)). • Sensitivity, as the indicator of how correctly the model has predicted the earthquakes that happened (positive class) (Equation (5)).
Furthermore, to understand the DNN model's behavior, we calculated its Specificity, Positive Predictive Value (PPV), and Negative Predictive Value (NPV).
• Specificity represents the rate of actual negative predictions of models (Equation (6)).
• PPV (Equation (7)) represents the ratio of actual positives (true predictions) out of all the generated earthquake predictions (positive predictions). • NPV (Equation (8)) denotes the ratio of actual negatives amongst all the negative predictions.

Results and discussion
This section presents and discusses the result of the proposed shortterm earthquake prediction models. The outputs of IGA, presented in Table 4, revealed that the FD variable, introduced in this study, has a higher value in predicting earthquakes than some other features, including X1, X2, X3, X4, X5, and Depth.
To further investigate the FD variable's role along with Depth, recognized by IGA as the spatial variables of moderate importance, we ran the four ML algorithms with different combinations of input parameters presented in Table 5. It is worth noting that in contrast to the widespread practice of excluding variables with low information gain value, we did not remove the variables X1, X2, X3, X4, and X5 from the input vector. The rationale behind not removing those variables proceeds from the idea that a variable's usefulness is proportionally dependent on the ability of the underlying model. A potent model would take advantage of the little amount of useful information coming from less significant variables and provide better predictions. Table 6 shows the optimal hyper-parameters for the three ML techniques of SNN, SVM, and DT, while the structure of the optimal DNN together with the output shape and the number of parameters is shown in Table 7. It is worth mentioning that to find the ideal DNN model, we tested several architectures with different hyperparameters, and the model with the highest validation accuracy was selected as the best model. Some of the tested structures and the corresponding validation accuracies during 500 epochs of training were presented in Table 8 and Fig. 5. Finally, the DNN structure with 1 input layer, 6 hidden layers, and 1 output layer was selected as the best DNN structure. The output layer had 4 nodes along with the SoftMax activation function to predict the 4 classes ( Table 7).
The train and test accuracy obtained by different models on the three parameter-sets are presented in Table 9 and Table 10. As shown in Table 9, the best overall test accuracy was obtained by DT, followed by DNN, SVM, and SNN, for the three parameter-sets. Considering the two parameters of Depth and FD, it seems that the two models of SNN and SVM were not able to use the latent information carried by these parameters. However, the two models based on DNN and DT were more successful in exploiting these two parameters. Meanwhile, DNN was the most successful ML algorithm in terms of utilizing the information in the FD and Depth parameters. Such an improvement by DNN could be rooted back to the deep neural structure of DNN that can extract useful information from less correlated independent input parameters.
To examine the performance of models from various aspects, in addition to Accuracy, the Sensitivity measure was calculated. Accuracy was chosen as a general metric, assessing the overall performance of the models. In contrast, we went into more detail using Sensitivity to understand better how each model performed for each class. In other words, Sensitivity signifies the capabilities of the models to correctly sense the earthquakes that occurred while Accuracy summarizes the overall performance of the classifiers. Sensitivities obtained for different classes are displayed in Table 11. Low values of the Sensitivity measure for class one (earthquakes between 3 and 3.7 Richter) and class two (earthquakes between 3.7 and 4.5) means that almost every model performed weakly in estimating these classes compared to the third and fourth classes. A reason for the deterioration of the sensitivities when it comes to class one and two compared to class three and four, for all models, would be a great deal of noise in the low magnitude data enfolding these classes. It is worth mentioning that some studies [102,103] recommended that the cutoff magnitude based on the Gutenberg-Richter law should be calculated beforehand and then all the events that come below the calculated cutoff magnitude should be filtered out. The reasoning behind such a suggestion is to ensure that incomplete and misleading information is not considered in the model [102]. However, this way of calculating and applying the cutoff magnitude resulted in losing the dataset's main chunk, which was not appropriate for running the ML models. To examine the effect of cutoff magnitude on the performance, we ran the DNN model with three cutoff magnitudes of 3, 4, and 5 Richter and calculated the Accuracy. As shown in Fig. 6, the Accuracy of DNN deteriorates as we increase the cutoff magnitude. Another contributing factor could be the lower number of instances recorded for the first class (Table 2), which might have exacerbated the situation even further. Perhaps, that is why the results of the predictions for the second class are generally better than class one for all models.
The sensitivities obtained for classes three and four have been higher compared to the first two classes. A closer look reveals that these classes' highest sensitivities (three and four) came about while the models were using the second parameter-set. The underlying reason could stem from the idea that higher magnitudes are more correlated with the FD parameter since high-magnitude earthquakes are more likely to occur in areas that are closer to active faults.
Although in terms of overall Accuracy (Table 9) DT performed better than the other methods, it did not score the highest Sensitivity. The best methods for predicting classes 1, 2, 3, and 4 were SNN on parameter-set 3, SVM on parameter-set 1, DNN on parameter-set 2, and SVM on Table 5 Examined parameter-sets. 16 seismicity parameters (Table 3)   parameter-set 2. A closer look at the results (Table 11) discloses that the best prediction of classes 1, 3, and 4 occurred in the models using the FD parameter, which indicates the suitability and usefulness of the parameter, especially for predicting earthquakes of larger magnitudes. Classes 3 and 4 can be predicted with the likelihood of more than 95% using the new FD parameter. In some circumstances, other methods outperformed DNN. DNN is the model with the highest complexity amongst the implemented ones. Thus, in some cases, its lower accuracy and sensitivity may be due to its higher parametrization, as has already been seen before, in a study done by Ref. [30].
Remarkably, a recent literature review [104] suggested that neural network models with shallow structures can compete with DNNs in terms of their predictive power for earthquake prediction because of the structured, tabular nature of catalog data and the limited number of calculated features. Some other studies also noted such an observation about the predictive power of SNNs [105,106]. Decision ensembles like Boosting and Random Forest, on the other hand, have attracted some attention and grown in popularity [107], where researchers compared their performances with different machine learning algorithms [30,108]. Meanwhile, SVM has shown higher generalization ability for earthquake forecasting [109,110]. Having known the superiority of these four models, we assessed their prediction powers per class in the study area. Our results showed that when the goal is to use a general classifier to forecast earthquakes entailing both low and high magnitudes, DT would be a proper choice. However, considering the sensitivity analysis of the third and fourth classes, DNN and SVM could sense and detect moderate and high magnitude earthquakes better than other methods. Despite the network size and the considerable number of    parameters needed to be trained for the DNN models, the results demonstrated that these complex models were the most successful in utilizing the information underneath the FD and Depth parameters. Moreover, DNNs outperformed other methods in predicting moderate magnitudes, though the best model in predicting low magnitude earthquakes was SNN. This behavior of SNN was expected since its structure is relatively simple, and the relationships between the input variables and the tremors of higher magnitudes are quite complex. In fact, the introduction of multiple hidden layers in DNN provides the possibility to learn features at different levels of abstraction [111]. From the disaster management organization's perspective, an earthquake prediction model should generate a few false alarms because false alarms can result in a big panic and financial loss [112]. Based on that, Specificity, PPV, and NPV were calculated per class for the DNN model (Table 12).
As shown in Table 12, the PPV value of 88% for the fourth class predicted by the DNN model is quite encouraging. There seems to be a trade-off between Specificity and NPV, indicating that when the Specificity is high, it is more likely that the classifier predicts false positives.

Conclusion
In this study, conventional machine learning algorithms of SNN, SVM, and DT, as well as the contemporary DNN method, were exploited to predict earthquakes in Iran. In addition to the commonly used seismic parameters described in the previous research, a new parameter named FD was also introduced, which ameliorated the Accuracy of the deep learning earthquake prediction model. The results showed satisfactory performances of DNN and SVM in predicting the classes of high magnitudes. However, the performance of DT was more promising in coping with events of both high and low magnitudes.
In the future, we will examine the usability and suitability of other deep neural network architectures, e.g., Convolutional and Recurrent Neural Networks, for earthquake prediction and compare their performance with the four algorithms of this study. Furthermore, the effect of the FD parameter on the performance of those methods will be evaluated.

Funding
No founding used for this study.

Availability of data and material
The datasets are published by USGS and IIEES and publicly available through the following links.

Declaration of competing interest
The authors declare that they have no conflict of interest.