Predicting Heart Disease using Logistic Regression

A common risk of death is caused by heart disease. It is critical in the field of medicine to be able to diagnose cardiac disease in order to adequately prevent and treat patients. The most accurate method of prediction has the potential to both extend the patient's life and reduce the severity of their cardiac disease. The use of machine learning is one approach that may be taken to generate predictions. In this study, patient medical record information was used in conjunction with an algorithm for logistic regression in order to make heart disease diagnoses. The outcomes of the logistic regression have been utilized to achieve a high level of accuracy in the prediction of heart disease. To get the model coefficients needed for the equation, the experiment uses an iterative form of the logistic regression test. Iteration 14 produced the best results, with an accuracy of 81.3495% and an average calculation time of 0.020 seconds. The best iteration was reached at that point. The percentage of space that lies beneath the ROC curve is 89.36%. The findings of this study have significant implications for the field of heart disease prediction and can contribute to improved patient care and outcomes. Accurate predictions obtained through logistic regression can guide healthcare professionals in identifying individuals at risk and implementing preventive measures or tailored treatment plans. The computational efficiency of the model further enhances its applicability in real-time decision support systems

A common risk of death is caused by heart disease. It is critical in the field of medicine to be able to diagnose cardiac disease in order to adequately prevent and treat patients. The most accurate method of prediction has the potential to both extend the patient's life and reduce the severity of their cardiac disease. The use of machine learning is one approach that may be taken to generate predictions. In this study, patient medical record information was used in conjunction with an algorithm for logistic regression in order to make heart disease diagnoses. The outcomes of the logistic regression have been utilized to achieve a high level of accuracy in the prediction of heart disease. To get the model coefficients needed for the equation, the experiment uses an iterative form of the logistic regression test. Iteration 14 produced the best results, with an accuracy of 81.3495% and an average calculation time of 0.020 seconds. The best iteration was reached at that point. The percentage of space that lies beneath the ROC curve is 89.36%. The findings of this study have significant implications for the field of heart disease prediction and can contribute to improved patient care and outcomes. Accurate predictions obtained through logistic regression can guide healthcare professionals in identifying individuals at risk and implementing preventive measures or tailored treatment plans. The computational efficiency of the model further enhances its applicability in real-time decision support systems.
However, machine learning techniques are useful for predicting heart disease. Implementing the machine learning technique may be more advantageous and effective in terms of cost [17]. Various methods are used to predict heart disease accurately and with maximum accuracy. The methods used range from simple to hybrid methods with other methods aimed at increasing the accuracy of the classifier model. Several methods have been used, including NB [18], BN [19], RF [20], MLP [21], SVM [22], KNN [23], LR [24], DT [25], and deep convolutional neural network (DCNN) [26]. The method for preprocessing uses principal component analysis (PCA), chi-squared, and information gain. Optimization methods include particle swarm optimization (PSO), and ant colony optimization (ACO).
This research applied a machine learning algorithm called logistic regression to predict heart disease risk based on risk factors from the patient health records. The logistic regression used is simple logistic regression without any optimization. With this reliability, this study offers the use of logistic regression in classifying heart disease. Previous studies use the same dataset with 14 features, which has resulted in an accuracy of 92.76% [9] and a total of 13 with an accuracy of 92.58% [13]. Based on the result above, logistic regression can provide high accuracy. The difference between the research conducted with previous research is based on the dataset used. This study uses a dataset with a number of features = 9. For the comparison to get the best model, a comparison method is implemented. The model comparisons are based on function classifiers, such as SVM (support vector model) and LDA (linear discriminant analysis). The aim of this study is to know the model of log regression while implemented in this dataset. the fundamental difference between this study and previous research lies in the dataset used. In this research, we used a new dataset that covers symptoms of heart disease that have a total feature less than previous research.
The motivation behind this research stems from the pressing need to improve the accuracy of heart disease prediction models, given the significant impact of heart disease on global health. Accurate and reliable prediction models can aid healthcare professionals in identifying high-risk individuals and implementing timely preventive measures. By leveraging machine learning algorithms and exploring various features and methods, we aim to contribute to the development of more effective and efficient heart disease prediction models. The findings of this research can potentially enhance medical decision-making processes, improve patient outcomes, and ultimately reduce the burden of heart disease on individuals and healthcare systems.
This research contributes to the existing body of knowledge on heart disease prediction by focusing on a specific dataset with a reduced number of features. While previous studies have achieved high accuracies using more comprehensive datasets, this research explores the potential of logistic regression with a limited feature set. By evaluating the performance of logistic regression and comparing it with other classifiers, such as SVM and LDA, we aim to provide insights into the effectiveness of logistic regression in predicting heart disease using a more compact dataset. The findings of this study can shed light on the trade-offs between feature selection and predictive accuracy, offering valuable guidance for future research and the development of practical heart disease prediction models.
The remaining sections of this paper are organized as follows. Section II provides a detailed explanation of the methodology used, including data collection, data preparation, and the implementation of logistic regression, SVM, and LDA classifiers. Section III presents the experimental results and performance evaluation metrics, comparing the accuracies of different classifiers. Additionally, a discussion of the findings and their implications will be provided in this section. Finally, Section IV concludes the paper, by summarizing the key findings and their significance in the field of heart disease prediction, the limitations of the study, and potential areas for future research.

II. Method
In this research, a systematic methodology consisting of four stages represent in Figure 1. Figure  1 provides an overview of these stages, which include dataset loading, dataset preparation, model creation using the selected method, and result evaluation. The initial stage involves preparing the dataset for analysis. The dataset used in this research was obtained from the Mendeley dataset [1]. This dataset contains information on observable characteristics and risk factors associated with heart attacks. The data instances were collected from electronic health records of patients. In total, the dataset comprises 1319 data instances, each representing a patient's information. The data comparison with positive and negative labels can be seen in Figure 2.  Figure 2, 61% of the data was labeled positive, and the remaining 39% was labeled negative. From the figure, the instance data with a positive class has more quantity than those with a negative label. The dataset has features unlocked 9. The details of the features in the dataset are shown in Table  1. If observed, all data types of each feature are numeric. It indicates that the nominal data has been converted to numeric, making it easier for the model to perform calculations. On the other hand, it makes it easier for researchers to process data because there is no need to convert nominal data types. The second stage is separating the data between training data and test data. The data shared is used to build a classifier model. The scheme used in data sharing is the k-fold cross-validation method. This method is applied because the resulting model is more general and can avoid overfitting [27]. Cross-validation works based on the value of the parameter k. The value of k here determines how many data segments are shared between test data and training data. The illustration of cross-validation can be seen in Figure 3.  Figure 3 shows cross-validation for this research with a value of k = 10. The gray cells will be the test data for each section and run iteratively for the value of k. The parameter k used in this study is 10-fold cross-validation, meaning the data is divided into 10 subsets. Each subset is used as the test set once, while the remaining nine subsets are combined to form the training set. This iterative process ensures that the model is evaluated on different combinations of training and test data, providing a more robust assessment of its predictive capabilities. By utilizing the k-fold cross-validation technique, this research aims to build a classifier model that can generalize well to unseen data. This approach helps to assess the model's performance and determine its ability to accurately predict heart disease in new and unseen cases The third stage is creating a LR classification model. LR is a mathematical model that uses probability estimation for each class [28]. LR is one of the supervised learning methods. In this case, LR uses to overcome the binary classification. However, generally, LR is also reliable in the case of multi-label classification. The advantages of LR are that it does not require a lot of parameter optimization and is easy to implement [29].
The LR model operates similarly to linear regression, as seen in (1). However, the primary distinction lies in the function used. In LR, the sigmoid function, shown in (2), is employed within the equation. By substituting the sigmoid function into (1), (3) is derived. Equation (4) represents the formulation of logistic regression as a logit, known as the log probability function. The term inside the brackets is referred to as the odds, representing the ratio of the probability of success to the probability of failure. The LR coefficients are estimated using the iteratively reweighted least squares (IRLS) method [30]. In each iteration, the dependent variable is adjusted to obtain the optimal LR coefficient.
where ̂ represents the predicted value of the dependent variable y given the independent variables 1 , 2 , ..., . The coefficients 0 , 1 , ..., are estimated parameters that determine the relationship between the independent variables and the dependent variable. The term ∈ represents the error term or residual. represents the linear combination of the coefficients and independent variables.
Comparison is needed to obtain the best method. The model comparison that will be used is SVM and LDA. SVM generally works by splitting data class based on the hyperplane. The SVM function is shown in (5).
represents the SVM function, ∝ and ∝ are the weights assigned to the data points, and are the class labels, and and are the feature vectors. The objective of SVM is to find the optimal weights that maximize the margin between the classes.
On the other hand, LDA works by projecting all data vectors linearly. LDA optimize the distance between class and minimize the distance between inner class. The LDA formula is shown in (8). The equation is formed from covariance at (6) and pooled covariance at (7).
where represents the covariance for each class, is the number of instances in class , 0 denotes the centered data for class , is the total number of classes, is the mean vector for class , −1 is the inverse of the covariance matrix, is the transpose of the centered data, and is the prior probability of class .
The researcher uses a performance reference as an accuracy value as a benchmark in comparing the results in the fourth stage. The formula for calculating accuracy is shown in (9) below. It also uses TPR (true positive rate) and FPR (false positive rate) to get the ROC curve value [31]. ROC here is valid for modeling errors/errors from the built classification model. FPR and TPR can be seen in (10) and (11) below for the accuracy formula. TP means that it is correct and predicted correctly, TN is correct, but the prediction is wrong, FP is wrong but predicted right, and FN is wrong and predicted wrong.

III. Results and Discussion
The results of this study are by observing the results of logistic regression performance. The application of logistic regression uses the Weka application [32]. There is no data preprocessing here because the data obtained is considered clean. The IRLS iteration test carried out to obtain the logistic regression coefficient. The parameter values tested are 2 to 30 with multiples of 2. The iteration test results can be seen in Figure 4.  Figure 4 shows a graph of the change in accuracy of each iteration test. When the iteration is low, the accuracy obtained is also low. The greater the iteration value, the higher the accuracy value. At iteration = 10, there is a decrease in the accuracy value compared to the accuracy at iteration 8. It shows that iteration = 10 is the optimal locale because the accuracy increases and decreases again. Furthermore, at iteration = 14, it produces an accuracy that tends to be high, namely 81.35%. During this iteration, the logistic regression model can produce the best accuracy because when the accuracy is increased again, it decreases accuracy, and there tends to be no change in the increase or decrease in accuracy. Based on these findings, it can be concluded that the logistic regression model achieved the best accuracy at iteration = 14. This information is crucial for selecting the optimal logistic regression coefficients and maximizing the predictive power of the model. The accuracy of logistic regression was obtained, then the model was compared. The comparison is shown in Table 2. The table shows the evaluation measure such as accuracy, TPR, FPR, and computational time. The time value is second and obtained from ten times rials. The table shows the accuracy of log regression = 81.35%, SVM with linear kernel = 78.17%, and LDA gives accuracy = 69.75%. These results give the highest accuracy from the log regression model. Linearly, the TPR value is also rising to the increase in the accuracy value. Unlike the FPR value, which is inversely proportional, the value will be smaller if the TPR value increases. For the computational time, LDA gives the worst time equal to 0.17 seconds. SVM reach about 0.06 second, better than LDA. The best computational is gained from log regression, which only needs 0.02 seconds to do classification. In Table 2 several evaluations of the performance of the logistic regression obtained from the confusion matrix. Based on these results, it can be said that logistic regression can be used to predict heart disease with high accuracy. The TPR (sensitivity) was correctly calculated, and the calculated FPR was incorrectly identified [33]. Computational time is also included in the calculation. The computational time obtained resulted in 10 times of testing to get the average. The average value of computing time is 0.02 seconds. Based on the computational time generated, the prediction model with logistic regression has a relatively fast computation time. Next, consider Figure 5. Since we know if log regression is the best model in this case, let us see the confusion matrix. Using iteration = 14, the results of the evaluation of the implementation of logistic regression are shown in the confusion matrix table in Table 3. The confusion matrix/error matrix is used to visualize the performance of the logistic regression algorithm. The confusion matrix represents the result between the actual and predicted values. The table shows the value of TP = 660, TN 413, FP = 96, FN = 150. Table 3 shows that the classifier cannot predict all the data accurately. From the confusion matrix table above, there are still misclassifications. Next, consider Figure 5. The picture represents ROC of the model performance. The ROC is generated based on the log regression model.  Figure 5 shows the ROC curve, which is a combination of the x and y axes, TPR occupies the xaxis and FPR on the y-axis. By being able to visualize the performance of the classifier in making predictions [33]. ROC The value of the ROC curve in Figure 5 is 89.36. This value is good because it is close to 1, which is the best value of the ROC curve. A good curve has a value between 0.5 up to 1 it means that the curve produced by logistic regression is close to its best value. It is proven that the classifier's performance is suitable for predicting heart disease.
Accurately predicting heart disease risk is crucial for developing effective decision-support systems in healthcare. The findings of this research contribute to the development of such systems by providing insights into the performance and feasibility of logistic regression as a predictive model. Integrating logistic regression-based algorithms into decision support systems can assist healthcare professionals in identifying individuals at high risk of heart disease and making informed decisions regarding prevention and treatment strategies. These findings highlight the effectiveness of logistic regression as a predictive model for heart disease. Despite misclassifications, the model exhibited high accuracy, relatively fast computational time, and a good ROC curve. These results donate to understanding logistic regression's potential in heart disease prediction and can inform the development of more accurate and efficient prediction models.

IV. Conclusion
Referring to the results and discussion, the machine learning method, namely logistic regression, can predict heart disease based on the patient's electronic medical record. A dataset used in this study has a total feature = 9 and 1319 instances of data. Based on the iteration parameter test results, the increase in the iteration value affects the accuracy value of the classifier model. It was found that the best iteration that can produce the highest accuracy at iteration = 14. The given accuracy is 81.3495%. The difference in iteration values affects the performance of logistic regression, as evidenced by the increasing iteration value providing an increase in accuracy until finding the optimal point. Log regression is proven more reliable in making predictions with relatively high accuracy and relatively fast computation time. Further research for this study by comparing some machine learning models, namely SVM and LDA. Feature selection can be made in further research from this study to get a better model.

Declarations
Author contribution