Predicting COVID-19 Cases in South Korea with All K-Edited Nearest Neighbors Noise Filter and Machine Learning Techniques

: The application of machine learning techniques to the epidemiology of COVID-19 is a necessary measure that can be exploited to curtail the further spread of this endemic. Conventional techniques used to determine the epidemiology of COVID-19 are slow and costly, and data are scarce. We investigate the effects of noise ﬁlters on the performance of machine learning algorithms on the COVID-19 epidemiology dataset. Noise ﬁlter algorithms are used to remove noise from the datasets utilized in this study. We applied nine machine learning techniques to classify the epidemiology of COVID-19, which are bagging, boosting, support vector machine, bidirectional long short-term memory, decision tree, naïve Bayes, k-nearest neighbor, random forest, and multinomial logistic regression. Data from patients who contracted coronavirus disease were collected from the Kaggle database between 23 January 2020 and 24 June 2020. Noisy and ﬁltered data were used in our experiments. As a result of denoising, machine learning models have produced high results for the prediction of COVID-19 cases in South Korea. For isolated cases after performing noise ﬁltering operations, machine learning techniques achieved an accuracy between 98–100%. The results indicate that ﬁltering noise from the dataset can improve the accuracy of COVID-19 case prediction algorithms.


Introduction
On 30 December 2019, the first diagnosis of COVID-19 was first reported at Wuhan Jinyintan Hospital in a patient with pneumonia of unknown etiology. The result showed that the virus had a family of coronaviruses called Betacoronavirus 2B [1]. Coronavirus batlike SARS exhibited a close link to the virus of COVID-19. The World Health Organization (WHO) identified the novel coronavirus as extreme acute coronavirus syndrome 2 (SARS-COV-2) and referred to it as coronavirus disorder 2019 (COVID-19) on 30 January 2020 [2]. Symptoms of breathlessness, fever, headache, chills, myalgia or arthralgia, congested nose, diarrhea, hemoptysis, and conjunctival obstruction are typical symptoms of the disease [3]. This can result in kidney failure, death, and severe acute respiratory syndrome in severe cases of the coronavirus disease [4]. The present spread of coronavirus (COVID- 19) threatens national health systems in all nations [5]. The United States has become one of the most affected countries to be hit by the increase in COVID-19 in public health, emergency health care, and hospitals [6]. Unfortunately, the rate of infections is expected to increase exponentially in many countries regardless of their health systems. Emergency steps are 1.
Surveillance allows the government and health officials to monitor the rate of infections in a particular community. It seeks to observe the effectiveness of COVID-19 prevention measures, such as wearing a mask and maintaining social distancing. This could involve random testing of people in a particular location to know whether there is community transmission of the disease or not [13]; 2.
Screening involves testing anyone regardless of whether they show symptoms or are unaware of their exposure to someone who has been infected. It provides an effective means of recognizing those who are likely to have been infected with the virus to stop further transmission [14]; 3.
Diagnostic testing involves testing a person who is assumed to have been infected with COVID-19. The person may show symptoms of COVID-19, know that they have contacted people with confirmed cases of COVID-19 or have been infected, and is trying to perform more tests to verify that they are now negative [15].
Machine learning (ML) algorithms are used to solve problems by analyzing and interpreting large volumes of data to solve problems in the medical sector [16][17][18][19][20]. Several researchers have used machine learning algorithms to solve medical difficulties in this area. Since the beginning of the pandemic, a wide range of studies have been conducted to provide a better understanding of the case, prevention, diagnosis, and control of COVID-19. In this paper, ML algorithms are applied to determine the epidemiology of the COVID-19 pandemic. The ML algorithms have been shown to be very effective and robust algorithms that can handle large data successfully. Therefore, it can be used to analyze the epidemiology of COVID-19 [21][22][23][24][25][26][27][28][29].
The major contributions of this work include: i. The exploration of dataset noise filtering techniques (all k-edited nearest neighbors, blame based noise reduction, and condensed nearest neighbors) on the dataset of COVID-19 infection cases in South Korea, which has not been conducted before; ii. The combination of noise filtering combined with machine learning techniques in epidemiological data for the prediction of COVID-19 cases; iii. The performance evaluation of all k-edited nearest-neighbors noise filters combined with machine learning algorithms using different performance metrics.
The rest of this paper is organized as follows: Section 2 is a review of the literature; Section 3 discusses the materials and methods used in this work, as well as our performance measurements used; the results and the discussion are presented in Section 4; and Section 5 presents the conclusions.

Review of Related Works
In this section, we succinctly discuss recent research conducted in the field of application of machine learning to the COVID-19 pandemic. This Section follows from our explanations in Section 1 of this paper, in which it was pointed out that machine learning algorithms have gained wide acceptance by data scientists and researchers as a viable tool for solving the COVID-19 crisis. This is due to the effectiveness of these algorithms in the detection and diagnosis of health-related problems. For example, Nemati et al. [21] proposed the combination of statistical methods, support vector machine (SVM), and ensemble techniques that use COVID-19 data of patients to predict the date they are likely to be discharged from the isolation center. It also evaluates clinical information to determine the duration of the patient in the hospital. The downside of this work is that it is just a framework and there is no practical implementation of any machine learning or statistical algorithm. The effectiveness of the proposed method was also not evaluated.
Lalmuanawma et al. [22] presented a review on the role of artificial intelligence (AI) and machine learning (ML) in investigating and predicting the transmission rate of COVID-19. They also examined how these techniques can be used to recognize, evaluate, and handle people who have been exposed to COVID-19 to prevent further transmission. Furthermore, the authors examined how AI and ML can help in the process of bringing a new pharmaceutical drugs into clinical practice for SARS-CoV-2 and its associated endemic. The findings of this study indicated that AI and ML have significantly improved the treatment, testing, prediction, and cure/immunization steps needed to take COVID-19 drugs from concept to market availability. Malik et al. [23] used multiple machine learning models to obtain the correlation between different characteristics and the rate of transmission of COVID-19. ML models were used to evaluate the effect of climatic factors on the spread of COVID-19 by mining the connection between the number of confirmed cases and the variables of atmospheric condition variables in some counties. The authors opined that atmospheric characteristics are of great significance in forecasting the number of deaths due to COVID-19 compared to other factors mentioned in the paper. Kavadi et al. [24] developed a partial derivative regression and nonlinear machine learning (PDR-NML) method for predicting COVID-19. The PDR was used to explore the dataset for optimal parameters with little computer resource usage. Subsequently, the machine learning model was used to normalize the attributes that are used to make predictions with high accuracy. In a more specific study, Amar et al. [25] used various machine learning and statistical techniques to predict the transmission of the COVID-19 pandemic in Egypt. The authors aimed to assist the Egyptian government in managing the pandemic in the subsequent months. The experimental results showed that the exponential model outperforms other models compared in the paper. The authors deduced from their results that the COVID-19 pandemic in Egypt is not likely to end soon.
Goodman-Meza et al. [26] applied ensemble machine learning to diagnose COVID-19 in patients admitted to hospital and receiving treatment. The patients are in an environment where the PCR test is insufficient or inaccessible. The performance is good, though there is still room for improvement. The authors did not propose any new machine learning algorithm; rather, they only used an ensemble of different machine learning models. Ozturk et al. [27] used deep neural network models to automatically detect COVID-19 from rib cage radiographs of patients. Their model can classify images into two classes or more than two classes. The model can serve as a secondary or assisting diagnosis tool, especially in places where there is an unavailability of medical experts. The classification accuracy of the model is high for binary classes; however, the accuracy for multiclass is poor.
Khan et al. [28] suggested employing parallel fusion and deep learning model optimization with a contrast enhancement using a top-hat and Wiener filter combination. Two deep learning models (AlexNet and VGG16) that have been pre-trained are used and fine-tuned based on the target classes (COVID-19 and healthy). A parallel fusion approach is used, parallel positive correlation, to extract and fuse features. The entropycontrolled firefly optimization approach is used to identify optimal features. Machine learning classifiers, such as the multiclass SVM, are used for classification.
Rehman et al. [29] presented a framework for the diagnosis of 15 different forms of chest disease, including COVID-19, using a chest radiograph modality. They used a convolutional neural network (CNN) with a softmax classifier and a fully connected layer to extract deep features, which are input into traditional machine learning (ML) classification algorithms. The suggested architecture, conversely, improves the accuracy of COVID-19 detection and increases the prediction rates for other chest disorders.
Rustam et al. [30] investigated the ability of four machine learning models to predict the number of people who will be infected with COVID-19. Each of the models was used to predict the number of new confirmed cases, death toll, and the number of recovered cases in a period of 10 days. The results show that the predictive capability of the models under investigation was not very good. Therefore, there is a need to try other machine learning models.
Wieczorek et al. in [31,32] used artificial neural networks (ANN) to estimate future COVID-19 cases using geolocation and past case data. The results of the proposed model show high accuracy, which in some cases reaches above 99%. Ahouz and Golabpour [33] developed a least-squares-boosting classification model to predict the incidence rate two weeks in advance. The proposed model predicted the number of globally confirmed cases of COVID-19 with an accuracy of 98.45%. Zivkovic et al. [34] proposed a hybridized method combining machine learning, adaptive neurofuzzy inference system (ANFIS), and enhanced beetle antennae search metaheuristics. The proposed model achieved a correlation of 0.9763 correlation on China's COVID-19 outbreak data. For more related works, we would like to refer the readers to the review papers [35,36].
In summary, current machine learning methods have not been very successful in the prediction of confirmed cases due to challenges, such as the lack of historical data and the different approaches of governments toward testing, which makes the results hardly comparable [37]. The prediction of COVID-19 cases using the deep learning method has gained more attention currently due to the unavailability of more data. Deep learning methods can specifically handle nonlinear problems more effectively. However, they still face the same problems of governmental actions that influence the data [38].

Dataset
The dataset used in this research comprises epidemiological data of COVID-19 infection cases in South Korea, which were obtained from the Kaggle database. The dataset is composed of data from 23 January 2020 to 24 June 2020 recorded daily, patient ID, sex, age, country, province, city, infected by, contact number, symptom onset date, confirmed date, released date, and state (which consists of released, deceased, and isolated). In this study, due to the nature of the dataset, we have extracted sex, age, country, symptom onset date, confirmed date, released date, and the state features as shown in Table 1. The no CS feature is the number of days from symptom onset to disease confirmation. It is obtained by subtracting the confirmed date from the symptom onset date, while the no RC feature, which is the number of days between the confirmation of disease to release from hospital, is obtained from subtracting the released date from the confirmed date.

Machine Learning Algorithms
In this subsection, we briefly discuss the machine learning algorithms that are used for this work. We discuss bagging, stochastic gradient boosting, bi-directional long short-term memory, support vector machine, naïve Bayes, random forest, k-nearest neighbor, decision tree, and logistic regression classifiers, as well as noise filtering methods.

Bagging (BAG)
Bagging is a method that combines the predictions of many simple estimators with a given algorithm, so that generalizability and robustness can be improved over a single estimator [39]. Decisions made by multiple learners can be integrated into a single prediction. In the case of classification, it is a vote to combine these decisions. Models of bagging bear the same weight as good models of bagging because an executive can use a collection of expert advice based on their previous right predictions to achieve other outcomes. The model in which one obtains more votes than others is considered correct.
where H m are weak classifiers that decide over a subset of a dataset d i with class c j ; d i is classified into the classes c j ; and α m is the weight of weak classifier H m .

Stochastic Gradient Boosting (BST)
The stochastic gradient boosting (BST) method is a hybrid of boosting and bagging proposed by Friedman [40]. The BST is a set of learning algorithms with a combination of boosting and decision trees, which classifies the value of all trees by weighting all trees. The new model is constructed along the path of gradient descent of the loss function of the previous three. Th eloss function between classification and actual function is reduced by the training function of the classification function. The loss function is given as: where ρ is the loss function; y k is the k-th output variable; x is the vector of input variables; F k (x) is the function that maps from input vector x to y k ; K is the number of classes; and P k (x) is the probability of k-th class given input vector x.

Bi-Directional Long Short-Term Memory (BLSTM)
Bi-directional long short-term memory (BLSTM) combines long short-term memory (LSTM) and bi-directional recurrent neural network (BiRNN) [41] for the analysis of classification and time-series data. The benefit of a recurrent neural network (RNN) is to encode dependencies between inputs. For long data classification, the RNN causes its gradient to erupt and vanish. LSTM is subsequently developed to address RNN long-term problems. There are three gates to LSTM. Input gate is required for the layer of input and also output and forget gate inclusive. Moreover, both LSTM and RNN can only obtain information from the past so that additional changes are made through the bi-directional network. Two pieces of information from front and back can be managed by BiRNN. The combination of BiRNN and LSTM generates BLSTM. Thus, a combination of LSTM advantages as a cell memory and BiRNN with context access information make BLSTM perform better. This allows the BLSTM to benefit from the input of LSTM for the next layer. However, BLSTM is also capable of handling long-range data. The forward function of BLSTM with inputs of L units and H as the number of hidden units is expressed by Equations (4) and (5), while Equations (6) and (7) are the backward calculation of BLTSM: where x t is the input vector at time t; a t h is the network input to LSTM of unit h at time t; and the activation function of h at time t is denoted by b t h . W ih is the weight of the input i towards h and W h h is the weight of the hidden unit h towards the hidden unit h . f is an activation function of the hidden unit of h and O is an objective function with unit K output.

Support Vector Machine (SVM)
The support vector machine (SVM) procedure categorizes both linear and non-linear data [42]. SVM uses a non-linear mapping to transform the training set to a high level. In this new dimension, SVM explores the ideal linear hyperplane separation as a decision limit by which the tuples of a class of one class are split from another. Two class data can be separated by a hyperplane with the proper, non-linear upper dimensional mapping. In contrast to the other approaches, hyperplanes are robust for overfitting.

Naïve Bayes (NB)
Naïve Bayes (NB) is one of the probabilistic methods that is used to describe, use, and acquire information. A maximum posterior rule is an approach for classifying a test sample x, to construct a probabilistic model for estimating the corresponding likelihood P(y), and to measure it with the largest context likelihood. The Bayes theorem is given by: where x is the input variable; P is the probability; and y is the target variable.

Random Forest (RF)
Random forest (RF) [43] is a decision-making ensemble classifier with various types of trees. An arbitrary sequence of features at each node is used to evaluate the division to create a decision tree. Each tree is based on the individual values of a random variable. We can shape an RF using bagging along with the selection of the random attribute, using the CART method, to increase the trees. RF uses a random linear combination of the input attributes. The sub-cluster of features is not chosen randomly, but new attributes are created, which reflect a linear combination of existing features.

K-Nearest Neighbor (KNN)
K-nearest neighbor (KNN) [44] is a lazy learning technique that learns by comparison of a tested sample with similar training samples. A distance metric, such as Euclidean distance, describes closeness. To classify using KNN, the sample that is not known is classified as the most common class among its neighbors.

Decision Tree (DT)
Decision trees (DT) classify by dividing training data into pieces and mainly holding the result of each part. It is a natural non-parametric supervised learning model, called classification and regression tree (CART), which produces accurate classifications with easily understood regulations. Model transparency makes them highly relevant.

Multinomial Logistic Regression (MLR)
The multinomial logistic regression (MLR) model that contains more than two target variables, discrete and unordered categories, with nominal features and a multinomial Information 2021, 12, 528 7 of 15 distribution, represents an extension of the binomial logistic regression. LR with a single category dependent variable must have logistic regression. The likelihood that a target variable is labeled as k-th is defined in the LR as in Equation (9): where π(x) defines the natural logarithm of the odds ratio given an independent variable vector x; α and β signify the coefficients of parameters; and x i represents the i-th independent variable.

Noise Filtering Methods
The effectiveness of the classifiers that we typically want to optimize under those circumstances will not only depend on the quality of the data, but also on the robustness of the noise-reduction method. Therefore, analyzing noise data is challenging, and it is often difficult to find accurate solutions [45][46][47]. Data noise can affect the inherent essence of a classification problem, as this can lead to the introduction of new properties into the problem area. The data from the real world usually contain noise and sometimes they are corrupted. These can hamper the efficiency of the method. Data from the real-world are therefore never flawless and frequently suffer from manipulation that can impair system efficiency. To have clean data from the classes of released, deceased, and isolated, we employed three noise filter algorithms, which are discussed below.

All K-Edited Nearest Neighbors (AENN)
The all k-edited nearest neighbors [48] method classifies each training dataset using samples x ∈ D, where D is called a design set. A new design set D contains exactly those samples from D, which have been classified correctly. For a given value of k and a given sample x, the procedure of AENN is as follows:
If the majority of k(x, i) classify x incorrect and end; 6.
If i < k go to step 2, otherwise end; 8.
After processing all samples from D, eliminate incorrectly classified samples.

Blame Based Noise Reduction (BBNR)
Blamed based noise reduction (BBNR) [49] emphasizes the cases that cause misclassifications rather than the cases that are misclassified. It attempts to remove mislabeled cases and unhelpful cases that cause misclassification as follows:

1.
For each case (c) in Training set T; 2.
Split the training set into two which are coverage set C and liability set L; 3.
Sort L in descending order; 4.
For each x in C; 7.
If x cannot be correctly classified by T, misclassifiedFlag = true; 8.

Condensed Nearest Neighbors (CNN)
The condensed nearest neighbors (CNN) was developed in [50]. The training sample set is divided into STORE and GRABBAG as follows: 1.
The first training sample is placed in STORE; 2.
The second sample that is correctly classified using the KNN rule is placed in GRAB-BAG, but if it is incorrectly classified it is placed in STORE; A complete pass is made through GRABBAG with no transfer to the STORE.

4.
The content of STORE is used as reference points for the KNN.

Computational Complexity of the Methods
The computational complexity of the AENN method is O (n × d × k), where 'n' is the number of training features, d is the number of dimensions, and k is the number of neighbors considered.
The computational complexity of the BBNR method is quadratic, because it needs to perform the classification with respect of each neighbor removed from the liability dataset, i.e., its computational complexity is O (n × n).
The computational complexity of the CNN method grows quadratically with the number of training samples, because the K-NN-based training set filtering technique is employed in the development stage of the proposed strategy, and identifying the NNs for each training sample requires computing the distances between all training samples, so it is O (n × n).

Performance Measures
In this study, accuracy (A c ), sensitivity (S e ), specificity (S p ), Kappa (K), and balanced accuracy (BA) are used.
where T P is the true positive; T N is the true negative; F P is the false positive; F N is the false negative; P o is the probability of the observed accuracy; and P e is the probability of expected accuracy obtained from the confusion matrix.

Results and Discussion
This section presents the experimental results of machine learning techniques, such as bagging (BAG), stochastic gradient boosting (BST), bi-directional long short-term memory (BLSTM), support vector machine (SVM), naïve Bayes (NB), random forest (RF), k-nearest neighborhood (KNN), decision tree, and the multinomial logistic regression (LR) for the diagnosis of COVID-19 infection cases.
For our experiments, we used MATLAB 2021a (MathWorks Inc., Nattick, MA, USA) on a laptop computer with 64-bit Windows 10 OS with Intel Core i5-8265U CPU 1.80 GHz with 8 GB RAM.
We compared the performances of the algorithms under consideration using sensitivity, specificity, and balanced accuracy, kappa, accuracy, and p-value to discern which is more accurate in the diagnosis of COVID-19 cases, such as the number of released, deceased, and isolated cases. We used data from the Kaggle database for COVID-19 infection cases in South Korea. The data were segmented into both training (60%) and testing (40%) datasets. The training set was used to train the model, while the test set was used to test it.
Classification of data has three classes-released, deceased, and an isolated class, consisting of 5165 data samples. Table 2 shows the comparison of the performance metrics used in this research: sensitivity, specificity, and balanced accuracy. Most of the machine learning algorithms, such as BAG, BST, BLSTM, SVM, NB, RF, KNN, DT, and LR, can classify isolated and released classes, but fails to classify the deceased in sensitivity metrics. The specificity of the three classes, released, deceased, and isolated, is within the range of 78-100%, except in the isolated class of BLSTM.  Table 3 shows the comparison of accuracy, kappa, and p-value of BAG, BST, BLSTM, SVM, NB, RF, KNN, DT, and LR. The overall best accuracy was obtained from LR with an accuracy of 82.77%, while the lowest was obtained from BLSTM with an accuracy of 65.96%. The result is not encouraging when compared with other state-of-the-art techniques. We use the proposed method to filter the noise from the COVID-19 dataset.  Table 4 presents the performance comparison of all the ML models for the AENN filtered dataset using sensitivity, specificity, and balanced accuracy. LR attained 82.77% accuracy, while BLSTM produced the worst accuracy, at 65.96%.  Table 5 presents the performance comparison of all the ML models for the AENN filtered dataset using accuracy, kappa, and p-value. Both BAG and RF attained 100% accuracy, while LR produced the worst accuracy, at 98.81%.  Table 6 depicts the performance comparison of all the ML algorithms on the BBNR filtered dataset using sensitivity, specificity, and balanced accuracy. Both SVM and NB achieved 100% sensitivity and specificity, while LR produced the worst accuracy of the results.  Table 7 represents the performance comparison of accuracy, kappa, and p-value of all the ML models for the BBNR filtered dataset. BAG produced the best performance, closely followed by RF with 74.12% and 74.01% accuracy, respectively, while NB produced the worst accuracy, at 55.69%.  Table 8 depicts the performance comparison of all the ML algorithms on dataset that was filtered by CNN using sensitivity, specificity, and balanced accuracy as performance metrics. NB achieved a sensitivity of 99.32% and specificity of 100%.  Table 9 shows the performance comparison of all the ML models for the CNN filtered the dataset using accuracy, kappa, and p-value. RF produced the best performance and was closely followed by BAG with 87.76% and 87.72% accuracy, respectively, while SVM produced the worst accuracy, at 79.16%. The main result of Table 9 is that the BAG and RF methods achieve the best performance in terms of accuracy and kappa. The accuracy results from Tables 5, 7 and 9 are visualized in Figure 1. We summarize the results of experiments in Figure 2, which shows that the AENN method allows to achieve a statistically significant improvement (p < 0.001, using the t-test) of classification performance in terms of accuracy metric. AENN, on average, improved the accuracy by 19.7833 ± 4.9896% and BBNR was ineffective and led to the decrease in performance by 9.9500 ± 9.3480%, while the CNN filtering method increased the accuracy by 4.6600 ± 6.9520%.

DT
85.95 0.55 2.2 × 10 LR 79.00 0.18 0.00 The accuracy results from Tables 5, 7, and 9 are visualized in Figure 1. We summarize the results of experiments in Figure 2, which shows that the AENN method allows to achieve a statistically significant improvement (p < 0.001, using the t-test) of classification performance in terms of accuracy metric. AENN, on average, improved the accuracy by 19.7833 ± 4.9896% and BBNR was ineffective and led to the decrease in performance by 9.9500 ± 9.3480%, while the CNN filtering method increased the accuracy by 4.6600 ± 6.9520%.  The results of our study underscore the need for data filtering to improve the performance of machine learning classifiers. This study demonstrated the superiority of the AENN filtering method, which outperformed the BBNR and CNN filtering methods. This finding is in line with other recent studies [51][52][53]. However, more research is needed to confirm our results.
The limitation of the current study is that only a limited dataset from a single country was used. More research with larger datasets is still needed to validate the proposed methods.

Conclusions
Machine learning techniques have been successful in the classification and prediction of sequential data in recent years. Several algorithms, for example, gradient boosting and

LR
79.00 0.18 0.00 The accuracy results from Tables 5, 7, and 9 are visualized in Figure 1. We summarize the results of experiments in Figure 2, which shows that the AENN method allows to achieve a statistically significant improvement (p < 0.001, using the t-test) of classification performance in terms of accuracy metric. AENN, on average, improved the accuracy by 19.7833 ± 4.9896% and BBNR was ineffective and led to the decrease in performance by 9.9500 ± 9.3480%, while the CNN filtering method increased the accuracy by 4.6600 ± 6.9520%.  The results of our study underscore the need for data filtering to improve the performance of machine learning classifiers. This study demonstrated the superiority of the AENN filtering method, which outperformed the BBNR and CNN filtering methods. This finding is in line with other recent studies [51][52][53]. However, more research is needed to confirm our results.
The limitation of the current study is that only a limited dataset from a single country was used. More research with larger datasets is still needed to validate the proposed methods.

Conclusions
Machine learning techniques have been successful in the classification and prediction of sequential data in recent years. Several algorithms, for example, gradient boosting and The results of our study underscore the need for data filtering to improve the performance of machine learning classifiers. This study demonstrated the superiority of the AENN filtering method, which outperformed the BBNR and CNN filtering methods. This finding is in line with other recent studies [51][52][53]. However, more research is needed to confirm our results.
The limitation of the current study is that only a limited dataset from a single country was used. More research with larger datasets is still needed to validate the proposed methods.

Conclusions
Machine learning techniques have been successful in the classification and prediction of sequential data in recent years. Several algorithms, for example, gradient boosting and neural networks, were explored for strength in the classification of COVID-19. However, the removal of noise from the data has remained unexploited in this field. In this paper, we have used noise filter algorithms to remove noise from all data sets utilized in this study. As a result of denoising, machine learning models have produced high results for the prediction of COVID-19 cases in South Korea. The technique has proven to be effective in the classification of released, deceased, and isolated classes. The presented methodology can contribute to the analysis of epidemiological data and the monitoring of the spread of infections. The results of this study can catalyze the governments of nations to take well-timed actions and make quality decisions to effectively address the COVID-19 emergency.
In the future, this work will be continuously enhanced by exploring more efficient machine learning and deep learning models to determine the epidemiology of COVID-19 in real-time using the up-to-date datasets. Further validation of our framework on other, possible larger datasets, if they become available, will also be a subject of our future work.

Conflicts of Interest:
The authors declare no conflict of interest.