Comparing PCA-Based Machine Learning Algorithms for COVID-19 Classification Using Chest X-ray Images

The rapid spread of the COVID-19 pandemic has strained global healthcare systems, necessitating efficient diagnostic methods. While Polymerase Chain Reaction (PCR) and antigen tests are common, they have limitations in speed and precision. Enhancing the accuracy of imaging techniques, especially Chest X-rays (CXR) and Computerized Tomography (CT) scans, is crucial for detecting COVID-19-related lung abnormalities. CXR, being cost-effective and accessible, is preferred over CT scans, but accurate diagnosis often requires technological support. To address this, an extensive dataset of CXR images categorized into five classes is available on Kaggle. Processing such data involves steps like grayscale conversion, image intensity adjustment, resizing, and feature extraction using Principal Component Analysis (PCA). Machine Learning (ML) techniques, including Decision Tree (DT), Random Forest (RF), Stochastic Gradient Descent (SGD), Logistic Regression (LR), Gaussian Naive Bayes (GNB), and K-Nearest Neighbors (KNN), are employed for image classification. DT shows the highest accuracy at 88%, outperforming other models like GNB (77%), KNN (71%), SGD (70%), LR (74%), and RF (45%). It consistently excels across assessment metrics such as F1-score, sensitivity, and precision, with an 88% best-weighted average. However, selecting the optimal ML model depends on factors like dataset characteristics and implementation specifics. Thus, careful consideration of these factors is crucial when choosing an ML model for COVID-19 diagnosis via CXR image classification.


Introduction
Artificial Intelligence (AI) has made significant advancements in medical diagnosis and the development of new medicines 1,2 .AI is projected to significantly impact radiology, providing radiologists with tools for more exact diagnoses and prognoses, ultimately leading to more efficient treatments.Computers, equipped to analyze vast expanses of patient data, are on the verge of replacing radiologists in numerous clinical environments, bringing forth a new era of radiological practice Published Online First: August, 2024 https://doi.org/10.21123/bsj.2024.9422P-ISSN: 2078-8665 -E-ISSN: 2411-7986 Baghdad Science Journal driven by big data and AI.AI has already demonstrated successful applications in treating skin cancer and managing chronic disorders 3 .In the fight against the novel coronavirus, scientists anticipate AI playing a vital role in finding a cure and alleviating the fear associated with the pandemic 4 .The ML offers a robust method for analyzing data in many formats; efficient use necessitates meticulous feature organization.The researcher used this strategy in one study on the online purchases and returns dataset.With a total of 5,659,676 transactions and 15,555 facets, the dataset in issue was quite large 5 .
The COVID-19 pandemic has put tremendous demand on healthcare systems, forcing them to adapt to new techniques.This necessitates the utilization of cutting-edge technologies such as AI to develop intelligent and self-sufficient healthcare solutions 6,7 .COVID-19 stands out among viruses due to its rapid replication and transmission, resulting in a global pandemic within a remarkably short period of time 8 .Extensive research and analysis are ongoing in the medical and healthcare sectors to better understand this rapidly evolving health crisis and develop effective responses 9 .Accurately simulating the spread of COVID-19 remains a critical objective.The gold standard for diagnosis is the detection of viral RNA in sputum by real-time reverse transcription-polymerase chain reaction (RT-PCR) on nasopharyngeal swabs 10,11 .However, these tests can take up to 6 hours to yield results and rely on human intervention while exhibiting a low positive rate in the early stages of illness.Thus, there is a pressing need for rapid and accurate diagnostic methods to bring the pandemic under control as quickly as possible, particularly in the long term when lockdowns are lifted, and widespread testing becomes essential to prevent a resurgence of the virus 12,13 .The prevention of this disease has been approached from various angles.To reduce employee dependency, maintain COVID-19 safety, and cut identity verification expenses, the study developed a COVID-19 Vision system that uses Haar cascades for a real-time face mask detector 14 .
In many countries, COVID-19 testing is primarily available to individuals with disease symptoms.However, it is significant to note that numerous symptomatic patients exhibit more than one sign, making it challenging for national healthcare systems and staff to identify and track potential cases.This burden is particularly overwhelming, even in highly developed nations.To address this crisis, AI algorithms play a crucial role in various aspects of the global health emergency response.AI algorithms are instrumental in the development of drugs and vaccines, as well as in monitoring people's mobility patterns to ensure compliance with social distancing guidelines.These algorithms also assist medical professionals in quickly diagnosing COVID-19 by evaluating CT scans and X-rays of lung conditions, enabling efficient patient tracing 15,16 .
ML algorithms encounter several obstacles while attempting to diagnose COVID-19 using CXR images.These include issues with dataset size, image quality, data augmentation, feature extraction, model selection, and performance evaluation 17 .To overcome these obstacles, it is necessary to enhance the performance of ML algorithms by meticulous preprocessing, feature extraction, model selection, and evaluation procedures.Size, balance or imbalance, and image quality are dataset-related variables that affect how quickly and well ML classification works with CXR images 18 .
The COVID-19 pandemic has urgently needed efficient and accurate disease diagnosis.CXR imaging has emerged as a valuable tool in identifying COVID-19 cases due to its accessibility and costeffectiveness.To improve the diagnostic procedure, ML applications were used to analyze CXR images and aid in diagnosis and classifying COVID-19 cases.This introduction explores the use of ML in analyzing CXR images for COVID-19 diagnosis, highlighting its potential to improve accuracy, speed up the diagnostic process, and assist healthcare professionals in effectively managing the pandemic.This study examined 14 research articles on COVID-19 and ML, discovering that ML plays a significant function in COVID-19 research, prediction, and discrimination, with supervised learning achieving a testing accuracy of 92.9%, implying its potential inclusion in healthcare programs for assessing and triaging COVID-19 cases.In contrast, recurrent https://doi.org/10.21123/bsj.2024.9422P-ISSN: 2078-8665 -E-ISSN: 2411-7986 Baghdad Science Journal supervised learning may offer even greater accuracy in the future 19 .ML approaches have been extensively employed in the medical arena, particularly in the context of COVID-19, utilizing various imaging systems such as CXR, with applications ranging from diagnosis to forecasting and medication development; however, challenges and limitations still exist, necessitating further research to address issues related to safety and other factors, while Keras remains the most commonly used library in these studies 20 .The critical need for timely and reliable detection of COVID-19 patients was focused on.The study emphasized the advantages of using whole blood count tests for early detection, used ML algorithms for prediction and assessed performance using accuracy, recall, precision, and F-measure metrics 21,22 .ML is a scientific approach that enables computer systems to perform specific tasks without explicit programming, utilizing algorithms and statistical models.ML algorithms are widely applied in various applications, offering the advantage of independent decision-making once trained with data 23 .The study used ML and Deep Learning (DL) algorithms in a multi-test retrospective analytic approach to detect and assess COVID-19 its progression using CXR features, resulting in a satisfactory "corona score" that demonstrated the high accuracy of advanced AIbased image analysis in diagnosing, quantifying, and monitoring COVID-19 24 .
This article compares ML methods for COVID-19 disease categorization to improve accuracy and implementation time.The main contributions of this paper are summarized as follows: • Preprocessing: Images are converted to grayscale, density adjusted using Histogram Equalisation and resized for speeding and analysis.This article is organized as follows: Section two presents related works on COVID-19 categorization from CXR images using ML approaches.Section three is separated into three subsections; each discusses the dataset description, preprocessing techniques, and Feature Extraction Using Principal Component Analysis (PCA).Section 4 covers the ML algorithms used in the study.The fifth section is divided into "Results and Performance of ML Algorithms" and "Evaluation Performance Comparison of ML Algorithms".In the final section, discuss the research's strengths and weaknesses, its practical and theoretical consequences, and plans for the future.

Related Works
The capability of ML to manage complicated and enormous datasets is demonstrated here.The global COVID-19 pandemic has highlighted the urgent need for effective detection methods.Several studies have explored the effectiveness of ML methods in achieving high accuracy for COVID-19 diagnosis using CXR images.Recent research has investigated advanced ML approaches, such as KNN, to detect COVID-19 with up to 88% classification rates 25 27 .
CXR radiography can be used to triage non-COVID-19 lung illnesses 28  .KNN has the highest accuracy and weighted average for precision, sensitivity, and F1-score among the ML models tested 32 .The study aimed to develop a Lasso-logistic regression model predicting COVID-19 severity (severe, moderate, and mild), demonstrating 85.9% accuracy and reducing deaths through early detection 33 .
Several other diseases, including Alzheimer's, glaucoma, cancer, and others, have been successfully detected in real-world clinical settings using the same ML methodology that has proven so successful in COVID-19 detection over the last three years 34 .On the other hand, several major obstacles have necessitated the development of more durable devices to train massive datasets effectively using DL algorithms and dealing with the low quality of medical images 35 .Choosing the suitable ML model or DL architecture requires much disease-specific practical experience.
There are some serious limitations to the research that were cited.Some examples are issues like employing ML for CXR image classification without proper CXR image preprocessing methods and imbalanced classes in their datasets.Inconsistent implementation of key techniques led to issues with intensity equalisation, noise removal, resizing, and feature extraction using methods like principal component analysis (PCA).Metrics for evaluation, such as F1-score, recall, accuracy, and precision, require improvement.Furthermore, training ML models usually takes a long period.
These studies demonstrate the significance of ML approaches in achieving high accuracy for COVID-19 diagnosis using CXR images.Using the power of these algorithms, accurate and efficient identification of COVID-19 instances can be achieved, allowing for timely interventions and reducing disease spread.

Methodology
The COVID-19 epidemic has generated an urgent need to tackle a severe hazard to human health.The correct interpretation and classification of CXR images are critical in diagnosing COVID-19.ML technologies improve imaging tools' capabilities, supporting healthcare professionals' curative efforts.A large number of researchers have classified COVID-19.It takes a fresh approach to ML algorithms to obtain optimal accuracy while requiring little execution time and storage.This section describes the methods used in this study in depth, beginning with the dataset description, CXR imagines preprocessing and feature extraction using PCA, and ending with a description of the ML algorithms used.The methodology provides an indepth look at the research process.The classification methodology consists of several steps.Initially, preprocessing techniques are applied to convert the image into a grayscale format and adjust its density.
Subsequently, the image is resized, and feature extraction is performed using PCA to extract the most informative features.The image is then transformed from two dimensions to one dimension to enhance training speed and minimize hardware requirements.Finally, the prepared images train and test ML models from X-ray images to classify COVID-19.Fig. 1 visually represents the implementation process and taxonomy utilized in our work for dataset preparation before using ML models to classify COVID-19 in CXR images.The remainder of this job is well-organized.Before using ML algorithms, the dataset goes through four preprocessing stages to decrease storage size, with increased execution speed and accuracy of classification.At the outset, the dataset is transformed into grayscale, transitioning all images from three to one channel.The next step in improving the quality of CXR images is to apply https://doi.org/10.21123/bsj.

A. Description of the Dataset
This study's dataset consisted of five classes derived from three primary datasets.The COVID-19 data were obtained from Cohen et al.'s 36 comprehensive X-ray and CT images that included various lung disorders such as COVID-19, SARS, and MEARS.This dataset is regularly updated and comprises 752 X-ray images until June 15, 2020, with the majority (435) depicting cases of COVID-19.Lateral X-rays and CT scans were excluded from this investigation, and incomplete metadata resulted in the omission of gender information for forty-three photographs.On average, the COVID-19 patients in the dataset were approximately fifty-four years old, with approximately two hundred fifty-six males and one hundred thirty-six females.To create balanced sets between pneumonia-positive radiographs (including normal, bacterial, and viral cases) and normal images, the second source utilized a dataset consisting of around 5,863 pictures 37 .Another dataset from the US National Library of Medicine focused on tuberculosis (TB) provided two sets of CXR.This TB dataset supports research on computer-aided diagnosis (CAD) for respiratory diseases, including TB 36 ,38 .The TB dataset contained a total of 394 images, with 336 originating from China and the remaining 58 sourced from Montgomery County.However, due to the slightly fewer TB X-ray images compared to other classes, the detection performance may be affected by class imbalance 39 .Augmentation techniques were used to resize a random sample of 40 X-ray images to solve this issue.The dataset comprised five classes of CXR images, each with a different number of cases.With a relatively balanced CXR image dataset, this work will classify COVID-19 using multiple ML models.Using a dataset of CXR images that is almost balanced has a positive effect on the accuracy and effectiveness of COVID-19 classification using different ML models, which is a substantial benefit.

B. Preprocessing Stage
In ML approaches, one common strategy is to reduce background noise and highlight relevant regions in an image for identification tasks or during the learning phase.This preprocessing method aims to eliminate extraneous data and noisy values.For the model to converge, pixel intensity normalization is performed within the range of [0, 1].The resized response images are designed to work with the system's architecture and support the ML models.Efficient nets, with low memory and latency costs, are utilized to take advantage of higher-quality response images.This adjustment in response determination can impact the precision of the model.The following steps outline the preprocessing phase of the procedure: 1. Convert the color image (RGB) to a grayscale image.This is achieved using the following Eq.1: (, ) = (0.2989 × ) + (0.5878 × ) + (0.1140 × ) 1 Converting the image to grayscale reduces the number of channels from three to one, allowing faster processing than color images.

2.
Enhance the image's contrast by applying histogram equalization, as shown in Eq. 2: The total number of pixels in the image, as m and n, determines the cumulative distribution function (cdf).L represents the grey level range, which is 256 levels.The original image's width is denoted by w=227, the height of the original image is denoted by h=227, and the width and height of the image after resizing are x=20 and y=20, respectively, at the same in 41,42 .These preprocessing steps help prepare the image data for further analysis and ML tasks, allowing for improved performance and more efficient processing.Fig. 4 illustrates the various stages of image processing, explicitly focusing on grey-level conversion, histogram analysis, and image resizing to a dimension of (20 × 20) pixels.Several vital benefits are available during the preprocessing phase of the process.The first step in reducing computing complexity, simplifying data representation, and enhancing image structure and brightness information while eliminating color fluctuations is to convert a color image (RGB) to a grayscale image.Second, histogram equalization makes images seem better by increasing contrast, making details more visible in low-contrast images, and making the dynamic range of pixel intensities more uniform.Finally, there are some advantages to reducing the size of images from 227x277 to 20x20, including fewer computing demands, less storage space needed, and the possibility of image processing activities being accelerated.In some cases, when the input dimensions of the analysis or model are smaller than the original image dimensions, this resizing becomes very useful.

C. Feature Extraction
ML models depend on precise feature extraction, which poses significant challenges in accurately diagnosing COVID-19 from X-ray images.If not carefully chosen, inadequate features can result in a suboptimal representation of our data, consequently leading to a less effective classification.The PCA method is selected for feature extraction because of its many advantages.PCA is an invaluable tool for improving the efficiency of ML models used for COVID-19 diagnosis from CXR images (15) .PCA extracts a new collection of principal components from the initial features by lowering the dataset's dimensionality.The amount of information to preserve is significantly affected by the selection of principal components, and the first principal component captures the most variance compared to others.Choosing the correct number of significant components allows for the optimal retention of relevant information.ML models trained with CXR images have significantly improved their ability to diagnose COVID-19.
PCA is a commonly used statistical technique for extracting features and image representation.PCA aims to minimize the dimensionality of highdimensional data while maintaining as much original information as a CXR image.After preprocessing a set of images, PCA is then used to process another set of images so that its feature extraction knowledge can be used 43 .PCA is a feature extraction and dimensionality reduction technique by utilizing an orthogonal transformation to convert potentially correlated observations into linearly uncorrelated variables.It is a practical method for feature extraction in pattern recognition 44   .Fig. 5 shows the case of the PCA approaches.s The main objective of PC is to represent pattern with a reduced number of features, reducing dimensionality while retaining crucial discriminative information.PCA is a traditional pattern recognition approach for feature extraction and data representation.Its purpose is to capture the essence of patterns with fewer features and reduce the dimensionality of the feature space while retaining critical discriminative information.One notable application of PCA is Eigenface, which utilizes PCA techniques to extract characteristic features from facial images.It represents a given face as a linear combination of "eigenfaces" obtained through feature extraction 45 .The PCA is a linear modification that can decrease the number of dimensions in a dataset.Maximizing the data's variance helps achieve this goal, producing a vector of orthogonal basis groups with no correlations 46,47 .Taking into account d data points, where n is the number of dimensions in the dataset, and the fact that supplied by  1 ,  2 ,   €   , PCA is carried with using the following techniques: • The m-dimension mean vector MV may be computed as Eq.4: To get the m-dimensional mean vector MV, the average all data points in the dataset, z_1, z_2..., z_k.The algorithm determines the mean by adding all the data points and dividing by the total number of data points (d).
• The covariance matrix CM for the observations is Eq.5: The dataset's relationships are represented in the covariance matrix CM.For its calculation, to take the dot product of the vectors that arise from subtracting the mean vector (MV) from each data point (z_i).Can may find their covariance matrix for each set of data points by adding up their outer products and dividing by (d).
-)   -)  5 • The eigenvalues and eigenvectors are calculated based on CM.Covariance matrices (CMs) have their eigenvalues and eigenvectors determined.Each principal component's eigenvalue and eigenvector indicate the variation it explains and the direction in which each principal component is located, respectively.
• This is performed through the PCA method, which involves linear transformations to reduce data dimensionality.Eq.6: Everyone uses the PCA technique to reduce dimensionality.By adding together the original data points (z_1, z_2,..., z_d) and multiplying them with the associated coefficients (a_d1, a_d2,..., a_dd), the converted data point (y_d) in the reduceddimensional space can be obtained.

Machine Learning Algorithms
CXR is a type of medical imaging that is crucial in the global fight against COVID-19.Recent advancements in ML technologies have significantly enhanced CXR imaging capabilities and have proven valuable tools for medical professionals.This study used ML models to identify COVID-19 instances, Pneumonia-Bacterial, Pneumonia-Viral, and normal occurrences in CXR images.The preprocessing steps include adjusting grey levels, histogram equalization, and resizing, followed by feature extraction on the dataset using PCA.Upon completion of the preprocessing steps and the application of PCA for feature extraction, resulting in 400 features for each CXR image, the dataset is subsequently separated into 70% for training and 30% for testing purposes.They utilized several ML algorithms, including DT, RF, SGD, LR, GNB, and KNN, to evaluate their performance and effectiveness within different prediction models.The results of these algorithms were computed and evaluated, and to provide further insight, a comparison was made with recently published COVID-19 detection models.The application of these algorithms in our experiments is depicted in Fig. 6.

A. Decision Tree DT
DT classifiers are widely recognized as one of the prominent approaches for data classification, serving as effective representation models 48 .The DT comprises core nodes that function as data pattern tests and leaf nodes that serve as data pattern categories.These tests are run across the tree to get the best output for a given input pattern.DT algorithms find applications in various domains 49 .This supervised ML algorithm is capable of solving classification and regression problems.It is known for its simplicity and effectiveness in classification tasks.Mathematically, the DT algorithm involves understanding the concept of entropy (H) before delving into the calculation of Information Gain (IG), as depicted in Eq. 7: ➢  represents entropy.➢  is the number of classes or categories.➢   is the probability of occurrence of the i-th class.  ) ➢  2 is the base-2 logarith2.The entropy and Information Gain (IG) formulas and their parameter values.Entropy is essential in the decision-making process of a DT since it determines data segmentation and boundary construction.It is used to assess the impureness or randomness of a dataset.Conversely, Information Gain (IG) is used to determine the best feature for splitting at each stage of tree development.The capacity of DT classifiers to effectively handle randomness in performance outcomes is well known.The formula for IG is known as Eq. 8.

𝐼𝐺(𝑌 𝑋 ⁄ ) = 𝐻(𝑌) − 𝐻(𝑌 𝑋 ⁄ ) 8 ➢ IG (Y | X) represents the IG when you split the dataset Y based on the attribute X. ➢ H (Y) is the entropy of the original dataset. ➢ H (Y | X) is the conditional entropy of Y
given X, which is the entropy of Y after the dataset has been split based on the attribute X.

B. Random Forest (RF)
The most common application of RF as a supervised ML approach enables the solution of classification and regression problems 50 .RF constructs decision trees from diverse samples and employs majority voting for classification or averaging for regression.It is widely recognized as an efficient classification method and has been successfully applied for COVID-19 prediction in Published Online First: August, 2024 https://doi.org/10.21123/bsj.2024.9422P-ISSN: 2078-8665 -E-ISSN: 2411-7986 Baghdad Science Journal numerous studies. 8This study employs an ML model, specifically RF, to identify critical features for distinguishing COVID-19 cases from non-COVID-19 cases.When performing RF on classification data, it is important to consider the Gini index, which determines the connections between nodes in a DT branch.The Gini index, as represented by Eq. 9, computes the Gini impurity of each branch based on class distribution and probabilities, thereby aiding in determining the more likely branch outcome.
is class count.
Calculates the Gini impurity for each branch based on the class and its probability, helping determine the likelihood of occurrence for each branch.

C. Stochastic Gradient Descent (SGD)
SGD is a common algorithm in many ML approaches, particularly as the foundation for neural networks 51 .SGD is an iterative procedure that begins at an arbitrary idea on a function and steadily descends its slope until it reaches the minimum point.In the case of SGD, the parameters are given a random beginning value, and partial derivatives concerning each feature are computed 51 .SGD excels in proper convex loss functions using linear classifiers and regressors 52 As a result, ML classifiers, notably SGD adaptive classifiers, are used to examine data with appropriate tools.Our work used PCA to extract key features and achieve maximum accuracy while utilizing SGD to diagnose COVID-19.The SGD formula is as follows: The learning rate (η) is typically chosen as 0.1 or 0.01.The new parameters are updated using the following Eq.11: =   − Ƞ *  () 11

D. Logistic Regression (LR)
LR is still one of the most popular ML techniques, especially for binary classification jobs.LR calculates the likelihood of a specific result based on the input factors.The cost function determines the optimal values for 0 and 1 to construct the best-fit line for the data points provided 53 .The cost function evaluates the model's performance in linear regression by optimizing the regression coefficients or weights.The Mean Squared Error (MSE) cost function computes the usual squared error between the expected and actual data values 54 .It may be deduced that ML classification algorithm models, such as LR, can be used to predict COVID-19 patients 55 .LR uses the sigmoid function to convert predicted values into probabilities.The sigmoid function transforms any actual value into a rate between "0 and 1".The sigmoid function formula is as follows 12: ➢ () is the output value between 0 and 1. ➢  is the base of the natural logarithm (approximately equal to 2.71828).➢  is the input value.The cost function is MSE, which measures the average squared difference between anticipated and actual values.The equation represents the Mean Squared Error (MSE) formula 13:

E. K-Nearest Neighbors (KNN)
The current slow learning strategy is KNN, based on the traditional KNN algorithm 56 .The KNN classifier, as previously stated, is a frequent version of the closest neighbor technique that involves categorizing an unknown sample based on the votes of k nearest neighbors rather than simply one nearest neighbor.KNN is a supervised ML algorithm 55 .That describes in full the stages involved in the KNN algorithm.To forecast COVID-19 patients, can also use the KNN method, one of the ML classification models.The KNN method employs the following Euclidean distance formula:

F. Gaussian Naive Bayes (GNB)
GNB is one of the most straightforward categorization algorithms 57 , and Naive Bayes (NB) classifiers rely on Bayes' Theorem.These classifiers adopt a strong assumption of feature independence, treating the worth of one feature as distinct from the worth of any other feature.NB classifiers are effectively taught in a supervised learning framework and are simple to create and deploy, making them useful in various real-world scenarios.When working with continuous data, it is expected to assume that the values for each class are regularly distributed (Gaussian) 58 .To classify CXR images, this study employs the GNB algorithm specifically created for COVID-19 identification.GNB models make use of continuous values with Gaussian (normal) distributions.When working with continuous data, assuming that the values associated with each class follow a normal distribution is expected.The feature likelihood estimation can be represented as follows: ➢  (− The exponential term is the Gaussian distribution likelihood term.It shows the probability of a given   given .

Evaluations Metrics
A confusion matrix is a critical tool for assessing the performance of ML algorithms, particularly in classification tasks.It presents a complete overview of the algorithm's predictions about the actual labels of the data.This matrix is made up of four main components: true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN), which enable the computation of critical metrics such as accuracy, precision, recall, and F1 score 59 .The confusion matrix aids in understanding the strengths and weaknesses of a model, enabling researchers to make informed decisions for improving its performance.At this stage, the focus is on determining the accuracy of the classifier.Evaluating the classifier's effectiveness involves assessing how well the anticipated class labels align with the observed ones.
• Accuracy: Also known as model viability, it indicates the suits of correct forecasts to total predictions:

𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = (𝑇𝑃 + 𝑇𝑁) (𝑇𝑃 + 𝑇𝑁 + 𝐹𝑃 + 𝐹𝑁) 16
• Precision: It measures how accurately the classifier assigns documents to a specific category.Class precision is quantified as:  =  ( + ) • Recall: It indicates the classifier's ability to identify documents belonging to a particular class correctly.Class recall can be calculated as: Results and Discussion

Results and Performance of ML Algorithms
After completing the preprocessing stages and using PCA for feature extraction, resulting in 400 features per CXR image, 70% of the dataset was set aside for training and 30% for testing.Following that, multiple ML techniques, such as DT, RF, SGD, LR, GNB, and KNN, were used to diagnose COVID-19 using the CXR images.The outcomes of these algorithms were assessed, and their performance was evaluated using appropriate metrics such as accuracy, precision, recall, and F1-score.In the following, the present the results obtained using each ML technique, emphasizing the parameters associated with each one.

A. Decision Tree (DT)
Several parameters are critical for deciding the model's performance and interpretability in COVID-19 diagnosis using CXR images.DTs are an ML model that uses data-learned rules to generate decisions.• Criterion: A split's quality can be evaluated using entropy Eq. 7.
• Maximum depth of the tree is 10 • Sets minimum leaf samples to 2; node split requires 1 sample.• Information Gain (IG): Employed by Eq. 8.

B. Random Forest (RF)
Important parameters in RF approach are used to find features that can distinguish COVID-19 situations from those that do not: • The number of trees in the forest is 100.
• A criterion for splitting: 70% for training and 30% for testing.• The maximum depth of each tree is 10.
• Minimum samples are required to split a node 5.
The relationships between nodes in a DT branch are heavily influenced by these parameters and the Gini index, as shown in Eq. 9. To help forecast the more likely outcome of a given branch, the Gini index makes it easier to compute the Gini impurity using class distribution and probabilities.The LR algorithm assessment metrics for the analysis of COVID-19 using the CXR image are in Table 4.These metrics are critical in measuring the effectiveness and reliability of the LR algorithm for COVID-19 CXR image diagnosis, allowing for a thorough assessment of its performance in accurately detecting COVID-19 cases.Robust PCA, can affect the principal component properties.Despite these changes, PCA transforms data into orthogonal vectors (principal components) in feature space.In future work, can intend to extend our proposed method using deep-feature extractors using re-trained models such as Google-Net, ResNet, Xception and another model 60,61 .These models need larger datasets to extract the best features and to ensure an accurate classification using our proposed ML techniques.Furthermore, Vision Transformers can be applied to address the limitations of CNNs, as proposed in 62 .

Conclusion
The CXR images can aid in detecting COVID-19related diseases, although their significance is typically overlooked.Using chest CXR images, this study examined multiple ML algorithms for accurate COVID-19 diagnosis.Histogram equalization was employed to enhance the CXR images, followed by resizing images and feature extraction utilizing PCA techniques.PCA's main advantage is its ability to reduce large dataset's dimensionality while maintaining the most crucial variance information.Various ML models were employed after completing all preprocessing steps on CXR images and identifying optimal features.DT has the most excellent weighted average for all parameters among the six classification algorithms tested (DT, RF, SGD, LR, GNB, and KNN), showing higher https://doi.org/10.21123/bsj.2024.9422P-ISSN: 2078-8665 -E-ISSN: 2411-7986 Baghdad Science Journal performance than the other models.The DT stood out as the best-performing model, surpassing all others with an impressive accuracy of 88% and demonstrating its efficacy in the specific context through solid precision, recall, and F1-score metrics.
Since our dataset is almost balanced, in the realm of creating an automated system for classifying medical images, addressing imbalanced data poses a notable hurdle.This challenge emerges when there's a substantial discrepancy in sample numbers among various classes, impacting the 2024.9422 P-ISSN: 2078-8665 -E-ISSN: 2411-7986 Baghdad Science Journal Histogram Equalisation to adjust the image intensities.All pictures are shrunk from their original (227×227) dimensions to a more manageable (20 × 20) to speed up the execution.The final stage involves utilizing PCA to extract the most relevant characteristics for optimal categorization.After all processes have been performed, the data is split into training (70%) and testing (30%) sets, and several algorithms for ML, such as DT, RF, SGD, LR, GNB, and KNN, are used to classify the data.

Figure 1 .
Figure 1.The methodology diagram illustrates using ML algorithms to classify CXR images.

Published
Online First: August, 2024 https://doi.org/10.21123/bsj.2024.9422P-ISSN: 2078-8665 -E-ISSN: 2411-7986 Baghdad Science Journal • Class 1: Represented 186 Normal cases.• Class 2: Represented 189 cases of Bacterial Pneumonia.• Class 3: Represented 173 cases of Viral Pneumonia.• Class 4: Represented 187 confirmed Tuberculosis cases.All CXR images maintain a uniform seam size of (227 × 227) for the five classes, ensuring consistency throughout the dataset.With a relatively balanced CXR image dataset, this study will classify COVID-19 using multiple ML models.Using a dataset of CXR images that is almost balanced has a positive effect on the accuracy and effectiveness of COVID-19 classification using different ML models, which is a substantial benefit.Fig. 2 shows a sample CXR image from each dataset class: (A) COVID--Bacterial, (D) Pneumonia-Viral, and (E) Tuberculosis CXR image.

Figure 2 .
Figure 2. A selection of CXR images from each dataset class.

Figure 3 . 3 .
Figure 3. CXR image following histogram averaging.3.Reduce or reshape the size of the image generated from the previous stages using the following Eq.3:

Figure 4 .
Figure 4.A sample of CXR images is undergoing three preprocessing stages.

Figure 5 .
Figure 5.A Comprehensive Primer on Principal Component Analysis.

PublishedFigure 6 .
Figure 6.The flowchart of the ML algorithms to classify CXR images.

15 ➢Normalisation ensures the entire probability integrates to 1 .
(  | ) This represents the likelihood of the variable   given , under certain conditions.It is the reciprocal of the standard deviation times the square root of 2π.

From
the above results, it can be observed that the DT algorithm has the highest accuracy (0.88) and F1-score (0.88) among the evaluated algorithms.It also demonstrates high precision and recall for Class 1, indicating good performance in correctly identifying COVID-19 cases.The LR and SGD algorithms also show competitive results with relatively high accuracy and F1-scores.On the other hand, KNN and RF algorithms exhibit lower accuracy and F1-scores than the different algorithms.RF has deficient performance, especially regarding precision and recall for Class 1.It's important to note that the quality and amount of the dataset, feature extraction approaches, hyperparameter adjustment, and the unique properties of the COVID-19 CXR pictures utilized can all impact the success of these algorithms.Further tuning and experimentation may be required to increase the algorithms' accuracy in diagnosing COVID-19.Fig. 7 is most likely a comparison of the performance of these algorithms based on specified evaluation metrics.Of course, creating a new DL model for COVID-19 diagnosis using similar tools can still be improved.Using CXR pictures to develop a DL for diagnosing COVID-19 is a potential strategy for improving speed and accuracy.

Figure 7 .
Figure 7. Conclusion of the evaluation of ML algorithms for accuracy, Precision, Recall, and F1score.The outcome analysis shows various ML models exhibit differing performance levels.Using balanced criteria for precision, recall, and F1-score, RF showed a respectable accuracy of 45%.With a 70% improvement in accuracy, SGD demonstrated an model's accuracy by favoring the majority class over accurately categorizing the minority class.Additionally, optimizing preprocessing methods with recently updated PCA extraction features should enhance accuracy in classifying COVID-19 in glossy CXR images.To improve the performance of CXR imagine classification, future work will involve including deep-feature extractors, re-trained models (such as Google Net, ResNet, Xception, etc.), increasing datasets, and applying specific transfer learning approaches.Using similar techniques, future research can investigate DL models for COVID-19 diagnosis.Alternate and updated PCA versions or other datasets could be investigated.
• PCA Feature Extraction: Extracts the most informative features from scaled images.• Dimensionality Reduction: Images are reduced from two dimensions to one dimension while keeping PCA information to speed up training and reduce hardware requirements.
• Training and Testing ML Models: Prepared CXR images of train and test models.• Evaluation of Effective ML Classifiers: The study evaluates ML classifiers that can identify COVID-19 cases from five categories using CXR images.

13 ➢
is the number of data points.➢   represents the actual (observed) value for the i-th data point.➢   represents the predicted value for the i-th data point.

Table 1
details the effectiveness and reliability of the DT algorithm for COVID-19 CXR image diagnosis.It provides information about the model's capacity to diagnose COVID-19 occurrences and aids in calculating critical metrics, including accuracy, precision, recall, and F1-score.

Table 7 . Comparing Previous Works Using Chest X-Ray Images for COVID-19 Detection.
Table 7 compares prior research that compared the use of CXR images for COVID-19 identification.Adjusting CXR image quality, dataset sizes, and preprocessing approaches may improve COVID-19 diagnosis accuracy.These research directions aim to improve COVID-19 diagnostics using ML, considering image quality, dataset features, and preprocessing methods.In addition to Traditional PCA, investigating other CXR image feature extraction methods are crucial to improving diagnostic procedures.While Traditional PCA was used for dimensionality reduction in this work, it is essential to note that the choice of PCA variations, such as sparse, kernel, Incremental PCA (IPCA), or