Comparison of Discriminant Analysis and Support Vector Machine on Mixed Categorical and Continuous Independent Variables for COVID-19 Patients Data

. Purpose: Numerous factors can affect the duration of COVID-19 recovery. One method involves utilizing natural herbal medication. This study seeks to determine the variables influencing the duration of COVID-19 recovery and to compare discriminant analysis and support vector machine models using COVID-19 patient data from West Sumatra. Methods: Two data mining methods, Discriminant Analysis and Support Vector Machine with different types of kernels (linear, polynomial, and radial basis function), were employed to categorize the time of COVID-19 recovery in this work. The study utilized 428 data points, with 75% allocated for training data and 25% for testing data. The independent factors were evaluated by determining the selection variables' information value (IV) to gauge their influence on the dependent variable. Data resampling techniques were employed to tackle the problem of data imbalance. This study employs data resampling techniques, including undersampling, oversampling, and SMOTE. The balancing accuracy of Discriminant Analysis and Support Vector Machine was examined. Result: The Discriminant Analysis with SMOTE achieved a balanced accuracy of 66.50%, outperforming the linear kernel Support Vector Machine with SMOTE, which had a balanced accuracy of 63.20% in this dataset. Novelty: This study assessed the novelty, originality, and value by comparing Discriminant Analysis and SVM algorithms with categorical and continuous independent variables. This research explores techniques for managing imbalanced data using undersampling, oversampling, and SMOTE, with variable selection based on information value assessment


INTRODUCTION
SARS-CoV-2, also known as COVID-19, can cause a variety of effects ranging from no symptoms to multiorgan failure and death.[1] During the initial stages of the pandemic in early 2020, approximately 80% of people who contracted SARS-CoV-2 showed no symptoms, while around 13% experienced severe illness necessitating respiratory assistance, and about 7% needed intensive care due to clinical manifestations such as acute respiratory infection (ARI), sepsis, and multi-organ failure [2].
Natural herbal treatments have been used globally for treating COVID-19.[3] Several COVID-19 individuals in different countries, such as China, have been treated using traditional herbal medicine prescriptions.[4] As a tropical country, Indonesia has abundant medicinal plants, with the West Sumatra region particularly abundant in natural medicinal flora.The inhabitants of West Sumatra have traditionally used indigenous botanicals to treat various illnesses, including COVID-19.The leaves of the sungkai tree (Peronema canescens) in West Sumatra are thought to provide medicinal potential for treating COVID-19.[5] The leaves of the sungkai tree are traditionally used to cure fever, colds, diarrhoea, hypertension, and malaria.They are also being explored as an alternative therapy for COVID-19.Yani [6] conducted research indicating that extracts from young sungkai leaves can enhance immunity by raising the white blood cell count in the blood, therefore strengthening the immune system against many infectious diseases.COVID-19 has an incubation period, which is the duration between viral infection and the appearance of illness symptoms [7].COVID-19's incubation time is reportedly 14 days [8], [9].This study will focus on analysing the recovery time from COVID-19 as the response variable.Recovery time from COVID-19 is categorised into two groups: patients who recovered during the incubation period (≤ 14 days) and patients who recovered after the incubation period (> 14 days).This study utilises mixed independent variables, encompassing both categorical and continuous variables.The information value was computed for each independent variable, a technique for variable selection that is especially beneficial when the answer variable is binary [10].
The categorisation approach was selected to categorise the COVID-19 recovery time.Classification algorithms predict data groups based on existing class categories utilising independent factors.[11] Imbalanced class data can create classification issues, resulting in misclassification.[12] Imbalanced class data, with unequal distribution of data points among distinct classes, can impact the model's performance [13].The study's response variable, the duration of COVID-19 recovery, exhibits uneven class characteristics and needs to be addressed.Resampling techniques can assist in addressing imbalanced data [14].This study employed undersampling, oversampling, and the Synthetic Minority Oversampling Technique (SMOTE).
Undersampling is a resampling technique that randomly decreases the data in the majority class to match or come close to the number in the minority class [15].Qian [16] discovered that undersampling enhanced classification accuracy in Support Vector Machine (SVM) and discriminant Analysis.Oversampling involves randomly adding data to the minority class to balance or approximate its number with the dominant class, addressing the issue of class imbalance.[17], [18], [19] SMOTE is a skilful resampling technique that has emerged as a suitable alternative for addressing issues associated with imbalanced data.[20] It is an oversampling technique that equalises the class distribution of a dataset by introducing artificial samples to the minority class.[21] Wang [22] observed that SMOTE is an excellent technique for addressing unbalanced data and enhancing accuracy metrics.
Discriminant Analysis is a statistical technique for categorising and assigning new objects to predetermined groups.[23] Ronald A. Fisher established it in 1936, and it is regarded as a classic data mining method.Discriminant Analysis initially had limitations as it exclusively operated with continuous independent variables.[24] Mbina [25] expanded Discriminant Analysis to accommodate mixed categorical-continuous independent variables, providing an alternative for discriminant models with categorical variables.Categorical independent variables are managed by constructing cells from a multinomial table of categorical values in each group rather than converting them into dummy variables [26].This research employs the Support Vector Machine (SVM) approach alongside Discriminant Analysis.
In his study, Guhathakurata [27] evaluated the performance of a Support Vector Machine (SVM) against various classification algorithms like K-Nearest Neighbour (kNN), Classification Tree (CART), Random Forest, Naïve Bayes, and AdaBoost in categorising COVID-19 patient symptoms.The findings indicated that SVM outperformed the other methods regarding predictive accuracy-James [28], emphasised SVM's excellent performance in object classification.The SVM approach aims to identify the best hyperplane that maximally separates the classes.A hyperplane is a mathematical function that can distinguish between different classes.
Scholars have studied mixed independent variables in Discriminant Analysis and Support Vector Machines (SVM).Mahat [29] studied the process of selecting continuous variables in discriminant Analysis with mixed independent variables.Mbina [25] investigated variable selection in discriminant Analysis, including mixed categorical-continuous independent variables.Their research needs to address imbalanced class data management and the utilisation of Information Value for variable selection.Guhathakurata [27] said that SVM is the most effective method for categorising COVID-19 symptoms, but it has yet to address concerns about imbalanced classes and variable selection.Anggrawan [30], in his research, explains the use of SMOTE to overcome the problem of imbalanced data in SVM but needs to clarify the variable selection approach, notably the usage of Information Value.This study intends to investigate the classification outcomes of two techniques utilising data on the duration of COVID-19 recovery in West Sumatra, which includes a combination of independent factors and unbalanced class data.

Data Collection
The study utilised secondary empirical data from the West Sumatra Regional Research and Development Agency.The participants in this study are persons living in West Sumatra who tested positive for COVID-19 (COVID-19 survivors) in 2021.This study uses the response variable of patient recovery duration after being cured of COVID-19, categorised as recovery during the incubation phase (≤ 14 days) and recovery beyond the incubation period (> 14 days).
The variables used in the study include mixed categorical and continuous independent variables, as shown in Table 1

Information Value
Information value (IV) is a commonly used techniques for selecting independent variables in classification algorithms with binary answer variables.[31] The Information Value (IV) is computed by analysing data for each independent variable, which is segmented into certain intervals referred to as bins  1 ,  3 ,  3 , … ,   }.Next, calculate the information value using equation [32].
)   represents data on the independent variable,   represents data on the dependent variable,   represents the bth bin, and  1,2 represents categories on the dependent variable.The IV value can be a practical or poor predictor of the independent variable's relationship before constructing the classification model.Stojanovic [33] classifies IV values according to many parameters, as displayed in Table 2.

Resampling Data
Resampling is a technique utilised to address the issue of imbalanced data.Imbalanced data refers to a situation where the answer variable contains a majority class and a minority class [34].The majority class contains more data than the minority class, resulting in an imbalance in the distribution of data points between the two classes [35].Imbalanced data might result in models that primarily categorise observations into the most common class and show minor sensitivity to the less common class [36].The study utilised resampling techniques such as undersampling, oversampling, and Synthetic Minority Oversampling Technique (SMOTE) to address data imbalance.

Discriminant Analysis with Mixed Independent Variables
Let us consider a random V divided into ( ′ |′)′', where Z represents a vector of d category variables, and  represents a vector of  continuous variables.Each unique set of components  1 , And into  2 if it doesn't satisfy the above condition.  represents the population mean for class  in cell .∑ is the covariance matrix for the entire set of observations. is the vector of  continuous variables, and   is the probability of an observation falling into population  in cell .The formula for ∑ is as follows: Where   is the number of observations in population-,  is the number of cells.  is the vector of continuous variables for the r observation on cell- and population- and   is the number of observations in cell m, and population-.
̂  = ̅  (4) ̅  is the mean of observations in cell- and population-.The equation for obtaining the value of  ̂ is as follow: Where   is the number of observations in cell- and population-, while   is the number of observations in population-.

Support Vector Machine (SVM)
Support Vector Machine (SVM) is a classification technique that creates a hyperplane to separate data into different classes.The SVM hyperplane is determined by computing the margin of the hyperplane and identifying its highest point.The margin refers to the distance between the hyperplane and the nearest instance of each class.The nearest occurrence of the hyperplane is referred to as the support vector.The SVM hyperplane utilised possesses the widest margin between the classes.If the data is perfectly separable, the objective function of the hyperplane with the most significant margin can be defined as:  { The kernel function (  ,   ) defines the geometry of the hyperplane.Linear, polynomial, and radial basis function (RBF) kernels are frequently utilised kernel functions, as stated by Choubey [37] and Hussain [38].The following equations define three types of kernels.
a. kernel linear : (  ,   ) =     (10) b. kernel polynomial : (  ,   ) = (     + 1) 2 c. kernel RBF : (  ,   ) =  (−‖  −   ‖ 2 ) (12) Classification based on the optimal hyperplane function in equation is The measure of model goodness An assessment of model quality is typically conducted to evaluate the model obtained [39].The data is imbalanced, so the model is evaluated using the balanced accuracy metric.One metric used to evaluate a model's effectiveness with unbalanced data is balanced accuracy.[40] The formula for determining balanced accuracy requires the data provided in Table 3.The confusion matrix displays four possible combinations of predicted and actual values.The symbols    = 1,2 denote individual categories of the answer variable.TP (true positive) is the count of observations correctly predicted to be in the first category.[41] False positive (FP) refers to the number of observations anticipated to be in one category but belong to a different category.[42] False negative (FN) refers to the number of observations anticipated to be in the second category but belong to the first category [43].True negative (TN) is the count of observations correctly predicted to be in the second category.[44] The confusion matrix calculates different parameters to evaluate the model's performance [45].Sensitivity and Specificity are utilised to compute balanced accuracy, which is especially beneficial for addressing imbalanced response variables [46].

Analysis Flowchart
Figure 1.Flowchart comparing discriminant analysis and SVM analysis.
Analysis begins with examining the response and independent variables, and independent variables are chosen based on information values.A variance-covariance matrix equivalence test is conducted, with the data split into 75% training data and 25% testing data.Subsequently, data imbalance was rectified using undersampling, oversampling, and SMOTE methodologies.Modelling was conducted using Discriminant Analysis and SVM, and the balanced accuracy values were compared.The optimal model demonstrates the highest level of balanced accuracy [47].

RESULTS AND DISCUSSIONS
This study utilised data on the recovery time of COVID-19 patients in West Sumatra Province who took sungkai leaves during their rehabilitation.Data was gathered from 428 participants who had tested positive for COVID-19 and eaten sungkai leaves while recovering.Out of the participants, 338 recovered within (≤ 14 days) of the incubation period (>14 days)., whereas 90 recovered after this period.78.97% of responders recovered within the incubation time, whereas 21.02% recovered after the incubation period.The data distribution shows an imbalance in the response variable, necessitating the employment of a resampling technique to address this issue.Resampling techniques employed are undersampling, oversampling, and SMOTE.These three methods on the dataset aim to standardise the number of observations in each category of response variables to address the data imbalance.The study examined independent variables such as gender, symptoms during COVID-19 infection, age, duration of symptom disappearance after confirmed infection, and the quantity of sungkai leaves utilised in preparing sungkai leaf mixture.

Data Exploration
This study uses mixed independent variables, including continuous and categorical variables.Figure 2 displays a summary of the continuous independent variables.

Preprocessing Data
The information value of each independent variable is utilised to select independent variables that impact the dependent variable.The information value indicates the impact of the independent variable on the dependent variable.The data values are displayed in Table 4.The study data necessitates a method to address unbalanced data due to the disproportionate distribution of the response variable data.Uneven data distribution on response variables can lead the model to exhibit bias towards categorising objects into the predominant class, diminishing prediction accuracy [49].The dataset was divided into 75% for training data and 25% for testing data before modelling.Training data is utilised to construct the model, whereas testing data is employed to assess the model.The study utilised resampling to address the issue of data imbalance.The study involved resampling techniques such as undersampling, oversampling, and SMOTE.Table 5 displays the quantity of data following resampling.Based on Table 5, it can be seen that by using the undersampling, oversampling, and SMOTE technique, the data on the response variable has been balanced.

Discriminant analysis with mixed independent variables.
Categorical independent variables are handled differently in discriminant Analysis with mixed independent variables compared to other classification methods.Support Vector Machines often manage categorical independent variables by transforming them into dummy variables.In discriminant Analysis with mixed independent variables, categorical independent variables are handled by constructing cells according to the mixture of categories in the variable.One categorical independent variable, "Symptoms Experienced During COVID-19 Infection," will be employed in discriminant Analysis based on the variable selection method.This category-independent variable will create distinct categories in the discriminant Analysis.The Analysis categorises cells as "mild symptoms", "moderate symptoms", and "severe symptoms".The model relies on these cells and the management of imbalanced data.Table 6 shows the cell arrangement derived from the training data.Table 7 displays the overall accuracy value of four models in predicting the recovery duration of COVID-19 patients in West Sumatra using the testing data.The undersampling strategy in mixed independent variable discriminant Analysis reduces accuracy, which differs from earlier studies that found this method can improve accuracy [16].Previous research has shown that oversampling and SMOTE are valuable methods for addressing unbalanced data and improving accuracy.[17] [22] The discriminant analysis model, utilising the SMOTE approach for unbalanced data handling, is considered the best due to its excellent balanced accuracy value of 66.54%.

Support Vector Machine (SVM)
Multiple kernel methods will be utilised to construct a hyperplane through the Support Vector Machine (SVM) technique.Kernel methods include linear, polynomial, and radial basis function (RBF) kernels.All three kernels will be utilised in the modelling procedure.This method also includes hyperparameter adjustment.SVM hyperparameter tuning involves adjusting the gamma (γ) and penalty (C) parameters.According to Hsu [50], a suitable range for gamma parameters (γ) is between 2 −15 , 2 −13 ,... 2 3 when the penalty parameter (C) falls within the range of 2 −5 , 2 −3 ,... 2 15 .The study involves choosing the gamma value (γ) and penalty amount (C) within a specific range.Hyperparameter tuning will be conducted on all three kernel types using various datasets.The model's performance on the test data is presented in Table 8.The optimal SVM model, utilizing the SMOTE resampling method, yields the confusion matrix results presented in Table 10.

CONCLUSION
Analysis results indicate that addressing data imbalance using SMOTE yields the highest balanced accuracy for both approaches in this dataset.Discriminant Analysis with data balancing using SMOTE achieves a balanced accuracy of 66.54%.However, employing the support vector machine technique with a linear kernel and data balancing by SMOTE yielded a balancing accuracy of 63.20%.The results indicate that the discriminant analysis model outperforms the support vector machine on this dataset.
Recommendations for future research based on the study findings.Future research should investigate the impact of underlying disorders or comorbidities on the duration of COVID-19 recovery using COVID-19 data.Another recommendation is to perform research utilising discriminant analysis and support vector machine (SVM) approaches on a spatial level, incorporating mixed independent variables.

Figure 2 .
Figure 2. Boxplot for each continuous independent variable.

Figure 2
Figure 2 displays a boxplot for each continuous independent variable.Outliers were observed regarding age, duration of COVID-19 symptoms fading, number of sungkai leaves consumed in the concoction, and number of glasses of sungkai leaf concoction.Data exploration of categorical independent variables may be found in Figure 3.

Figure 3 .
Figure 3. Bar chart of categorical independent variables

Table 1 .
. Description of dataset

Table 2 .
IV Value categories be denoted as , where  = 2  .  represents state number-m.Please assume that the probability of getting cell   from vector  _ is denoted as   , where m ranges from 1 to k and i from 1 to 2. The discriminant rule  = 2 is defined by obtaining state   from Z and then categorising  into  1 if this assumption is used.
2 , … ,   _k in vector Z represents a state of the multinomial random variable .Let the maximum number of potential states for

Table 4 .
Information value of each independent variable

Table 4
indicates that out of the eight independent variables, there are two variables with predictive solid power, three with moderate predictive power, one with weak predictive power, and two unpredictable factors.This study utilises independent variables categorised as strong and moderate predictors based on their information value.The study utilised the independent factors of age, duration of COVID-19 symptom resolution, duration of sungkai leaf intake, symptoms during COVID-19 infection, and the quantity of sungkai leaves consumed in the herbal remedy.The covariance homogeneity test was conducted on the [48]inuous independent variables.Prior to performing discriminant Analysis, a covariance homogeneity test must be executed.Assessing covariance homogeneity with Box's M technique[48].The Box's M test yielded a p-value of 0.333.If the p-value is more significant than α (0.05), it indicates that the data meets the condition of covariance homogeneity.

Table 5 .
The number of data based on the response variable

Table 6 .
Proportion of training data

Table 6
displays the percentage of each training data set using unbalanced data handling.A discriminant analysis model was created for each training data set.The discriminant analysis models were compared using their balanced accuracy scores.The optimal model is the one with the maximum balanced accuracy.Table7displays the balanced accuracy values for each model.

Table 7 .
Evaluation of model fit in discriminant analysis

Table 8 .
Model performance metrics in discriminant analysis

Table 8
[22]lays the balanced accuracy values of 12 models used to predict test data.The utilisation of undersampling, oversampling, and SMOTE in SVM aligns with prior studies that have demonstrated the effectiveness of these methods in addressing data imbalance issues and enhancing accuracy.[16][17][22]ThelinearkernelSVMwithSMOTE stands out as the top-performing model among the 12, boasting a balanced accuracy value of 63.20%.Comparison of model performanceA study was conducted to compare Discriminant Analysis and Support Vector Machine (SVM) models to identify the optimal model for categorising the recovery duration of COVID-19 patients in West Sumatra.The models being compared are the top outcomes from each Analysis.The top-performing discriminant analysis model, utilising the SMOTE resampling technique, generates a confusion matrix in Table9.

Table 9 .
Confusion matrix discriminant analysis with SMOTE Sensitivity, specificity, and balanced accuracy values from the confusion matrix are as follows,  =

Table 10 .
Confusion matrix SVM with SMOTEBased on the evaluation results of the best model from both methods, you can see the comparison of these two models in Table11.Table11.Comparative evaluation of model fit

Table 11
displays the adequacy of fit for each Analysis.The Discriminant Analysis applied to imbalanced data with the SMOTE method yielded a sensitivity of 69.57%, Specificity of 63.53%, and balanced accuracy of 66.54%.The linear kernel Support Vector Machine with SMOTE achieved a sensitivity of 65.22%, Specificity of 61.18%, and balanced accuracy of 63.20%.The Discriminant Analysis model achieved the most excellent balanced accuracy rating of 66.54% across the two analyses.Discriminant Analysis outperforms the Support Vector Machine approach in predicting COVID-19 recovery time in West Sumatra.Discriminant Analysis is more effective than SVM in classifying observations in the recovery time data of COVID-19 patients in West Sumatra, as indicated by the highly balanced accuracy value.Furthermore, the high balanced accuracy value suggests that discriminant Analysis is better at categorising observations into major and minor data classes than SVM.