COVID-19 anomaly detection and classification method based on supervised machine learning of chest X-ray images

The term COVID-19 is an abbreviation of Coronavirus 2019, which is considered a global pandemic that threatens the lives of millions of people. Early detection of the disease offers ample opportunity of recovery and prevention of spreading. This paper proposes a method for classification and early detection of COVID-19 through image processing using X-ray images. A set of procedures are applied, including preprocessing (image noise removal, image thresholding, and morphological operation), Region of Interest (ROI) detection and segmentation, feature extraction, (Local binary pattern (LBP), Histogram of Gradient (HOG), and Haralick texture features) and classification (K-Nearest Neighbor (KNN) and Support Vector Machine (SVM)). The combinations of the feature extraction operators and classifiers results in six models, namely LBP-KNN, HOG-KNN, Haralick-KNN, LBP-SVM, HOG-SVM, and Haralick-SVM. The six models are tested based on test samples of 5,000 images with the percentage of training of 5-folds cross-validation. The evaluation results show high diagnosis accuracy from 89.2% up to 98.66%. The LBP-KNN model outperforms the other models in which it achieves an average accuracy of 98.66%, a sensitivity of 97.76%, specificity of 100%, and precision of 100%. The proposed method for early detection and classification of COVID-19 through image processing using X-ray images is proven to be usable in which it provides an end-to-end structure without the need for manual feature extraction and manual selection methods.


Introduction
The new Coronavirus 2019 (COVID-19) pandemic first appeared in Wuhan, China, in 2019, and started to spread rapidly, posing a critical public health problem to the entire world [1].COVID-19 results in mild symptoms in about 82% of the cases, and other conditions are severe or critical [2,3].The total number of COVID-19 confirmed cases throughout the world is 229,373,963, including 4,705,111 deaths reported by the World Health Organization (WHO) on 23 September 2021 [4].Fig. 1 shows the distribution of COVID-19 diagnosed cases worldwide.
The COVID-19 pandemic virus is severe respiratory syndrome coronavirus 2, also called SARS-CoV-2.A high number of infected patients has survived the virus, while a smaller percentage has serious or critical conditions [5,6].The increase in the number of people with the COVID-19 virus leads to an increased need for intensive care.This extension creates a workload on the healthcare system leading to the collapse of the health systems even in the best-developed countries.When intensive care units (ICUs) are full of patients, the health status of COVID-19 patients deteriorates, and the rate of death increases.Some researchers utilize medical images like X-rays or Computed Tomography (CT-scans) for the search of properties symptoms of the novel coronavirus [2,7].
The COVID-19 pandemic has lead to huge financial losses worldwide, posing a massive impact on world GDP growth [1].Global recession has been very severe since the end of World War II resulting in the contraction of the global eonomy by 3.5% in 2020 based on the April 2021 World Economic Outlook Report published by the IMF, which states a 7% loss relative to the 3.4% growth forecast of October 2019.
While virtually every country reported by the IMF posted negative growth in 2020, the downturn was more pronounced in the poorest parts of the world [8].
Researchers of some recent studies employ chest radiography in epidemiological regions for testing COVID-19 [1,3].They found that the examination of radiographic images could be an alternative to the PCR scheme as it shows a higher sensitivity in some cases [9].Xu et al. [10] introduced a new method based on a deep learning system to screen coronavirus COVID-19 pneumonia.The proposed method aims to build up an early examination model to recognize COVID-19 pneumonia from Influenza-A viral pneumonia and health conditions with lung section images based on deep learning methods [8].The proposed algorithm is designed based on candidate infection areas divided using a threedimensional deep learning technique from a set of pulmonary CT images [6].The results of the experiments benchmark dataset shows that the inclusive accuracy is 86.7 % from the perspective of CT of the whole cases.
In the work of Sethy and Behera [9], they proposed an algorithm for the detection of COVID-19 based on deep features.Deep features are extracted from a pre-trained CNN model and fed to an SVM classifier in individual form.The proposed classification scheme for the detection of COVID-19 obtained an accuracy of 95.38%.
In the prior work of Ozturk et al. [11], they presented a new model based on deep learning techniques to detect and classify COVID-19 conditions from X-ray images.The proposed model is completely automated based on an end-to-end structure.In addition, the proposed method is able to perform binary and multi-class classifications with accuracy values of 98.08% and 87.02%, respectively.Subsequently, Narin et al. [12] proposed utilizing three types of CNN-based models: Inception ResNetV2, InceptionV3, and ResNet50 for COVID-19 detection from chest X-ray images.Performance results of the proposed models illustrate that the pre-trained patterns of the ResNet50 obtained the highest accuracy of 98%.The major limitation is applying the proposed methods/models on a few numbers of COVID-19 X-ray images, which do not satisfy robust results.
This paper proposes a method for classification and early detection of COVID-19 through image processing using X-ray images.

Methods and Materials
The COVID-19 X-Ray dataset This work utilizes a chest x-ray of 5,000 normal and pneumonia COVID-19 images that are obtained from the open-source GitHub warehouse shared by Cohen et al. [13], namely "Chest X-Ray Images (Pneumonia)".This warehouse provides chest X-ray/CT images of primary patients with COVID-19 along with other diseases.Samples of the selected chest X-ray images of normal and pneumonia COVID-19 sets are shown in Fig. 2 [5,13].

OTSU'S thresholding
Otsu's threshold method aims to convert a grayscale image to a binary image.This method employs various techniques of image processing to implement histogram-based image thresholding or to transform an image from grayscale to binary [11].The Otsu thresholding method supposes that the image consists of the bi-modal histogram (foreground and background and the related optimal threshold).

Morphology operations
Mathematical morphology (MM) aims to extract components from the image that are useful in the depiction of region, shape and description like skeletons, and convex hull, boundaries.In addition, the morphological techniques are considered for pre-or post-processing, for example, morphological filtering by reconstruction, thinning, and pruning transform [9].Generality morphological operations concentrate on binary images.Morphological operations are logical transformations dependent on a comparison between pixel neighborhoods with a predefined pattern.
Morphological Dilation: The dilation operation employed a structuring element also called "kernel" to check and extend the shapes [12].When applied the structuring element S on image A, the result is a new image I, Opening: The image opening operation combines erosion and dilation operation using intersection and complementation [9].The conditioned dilation (opening) begins by producing matrix X 0 of 0 s which its size equals A.
wherein the final step: where X(i) contains all the filled holes.

Morphological Closing:
The morphological closing comes after dilation operation using the same structuring element [9].The closing operation is,

A.S = (A ⊕ S)ΘS (4)
Morphological Erosion: Morphological erosion operation aims to shrink the image.The output of erosion operation is an image I,

Midpoint ellipse drawing algorithm
The ellipse structure allows drawing using a circle scaling with a shorter radius in the direction of a longer radius.Several methods can be used to have midpoint ellipse as a drawing algorithm.The ellipse algorithm starts drawing at the origin and then moves straight towards the center point [14,15].Fig. 3(a) illustrates the ellipse of a 4-way symmetry.It is similar to the scheme used to show a raster circle.The ellipse quadrant is split into two regions.Fig. 3(b) shows the section of the first quadrant that depends on the slope of an ellipse with Rx < Ry.As the ellipse is drawn from 90 to 0 degrees, x moves in the positive direction, and y moves in the negative direction, and the ellipse passes through two regions.
While the ellipse drawing algorithm preprocesses the first quadrant, then the algorithm moves towards x-direction (the magnitude of the curve slope < 1 for the first region) and towards the y-direction (the magnitude of the curve slope > 1 for the second region).Similar to the circle function, the ellipse function,

Feature extraction
Local binary pattern Local Binary Pattern (LBP) is one of the well-known image feature extraction operators adopted in many real-world applications [16].The LBP is a simple, yet effective texture extraction operator.The LBP has a low computational complexity that enables it to work in complicated and real-time image processing applications.It is a unified approach to traditional structural and statistical models.It specifies the vicinity of each pixel of an image then labels these pixels with binary numbers.The LBP can be articulated in the decimal form given a pixel at (x c , y c ) by Eq. ( 7): where i c and i p are respectively gray-level values of the central pixel and P surrounding pixels in the circle neighborhood with a radius R. The function s(x) is defined in Eq. ( 7) as follow:

HOG algorithm
With the aim to extract features, the HOG algorithm includes two main stages [15].The first stage is histogram extraction of the oriented gradient.The gradient of the direction and magnitude are extracted from each pixel in the input image.These are employed to produce an angular histogram of gradients applied as an image texture feature vector.The vertical and horizontal components of the image I (i, j) are derivatives at pixel (i, j).They are respectively computed as below: where and, Fig. 2. Sample of the Chest X-Ray Images (Normal and Pneumonia) [5,13].
Fig. 3. Midpoint Ellipse Method [14]. and, where G i (i, j), G j (i, j) are the derivative along a horizontal and vertical direction at pixel (i, j), respectively.
The second stage represents the construction of the HOG descriptor which is constructed based on the gradient of the image.Firstly, the whole image is split into blocks with size 8*8.The gradient direction range [-π/2, π/2] is calculated uniformly into nine intervals of direction (bins).To create a strong vector to brightness changes, the HOG feature results are normalized by segmenting each bin with the total of the histogram [15].

Haralick texture features (Second Order)
Gray Level Co-occurrence Matrix (GLCM) is a representation of interdependence levels and spatial distribution within a local area [17].The Haralick operator calculates Harlic features according to the statistical distribution of the GLCM in which the peer of pixels is considered as second-order.It built relations between positions of pixels of an image [15,18].The quantization process is applied to the image before calculating the co-occurrence matrix.Contrast is used to show the variation of the gray level of the neighbor pixels to the reference pixel as in Eq. ( 13): The homogeneity shows the relationships between the distribution of the elements and diagonal in the GLCM.
The entropy shows the randomness of the image disorder which is formulated as in (15):

Classification methods
Different classification algorithms include naive Bayes, KNN, neural network, decision tree, SVM, etc.They are used to predict class labels of anonymous data.The KNN and SVM are selected in this work to construct the classification model.

K-Nearest Neighbor (KNN):
The KNN algorithm utilizes the Euclidean distance standard to calculate the value of the variance between the training instance and the test instance [19].The "K" indicates the number of closest neighbors that help predict the test pattern class [20].The standard Euclidean distance d (x, y) is determined to follow as: Additionally, KNN computes the most popular category from the nearest neighbor K to estimate the test instance class for the test set.It is determined in Eq. ( 17): The parameters y 1 , y 2 , y 3 , …, y k represents the k nearest neighbors of a specific instance of the test data set, k is the number of the neighbors, C represents the finite set of class labels, and δ (c, c(y i )) = 1 if c = c(y i ) and δ(c, c(y i )) = 0 otherwise [21].
Support Vector Machine (SVM): It is a type of supervised machine learning method that depends on the problem of maximum classification hyperplane interval linear separable.The Kernel function enables linear points to be less distant, relying on the region of the high dimensional feature.The selection of the model and the kernel function parameters directly affect the SVM learning results [22,23].The parameter σ 2 determines the kernel function generalization and influences the kernel function generalization [9].Gaussian Kernel (GK) is a sign function of the kernel in Kernel schemes.The feature space has an infinite dimension in which data that cannot be classified in a low linearly dimension can be classified in a higher dimension which the GK specifies as in (18):

Implementation and results
The proposed method relies on image processing by performing a set of procedures that would give a preliminary diagnosis of COVID-19 patients through X-ray images [11,12,24,25].The features operators of LBP, HOG, and Haralick and classifiers of SVM and K-NN made a combination of six models LBP-KNN, HOG-KNN, Haralick-KNN, LBP-SVM, HOG-SVM, and Haralick-SVM.Fig. 4 shows the overall proposed anomaly detection and classification model of COVID-19 based on chest X-ray images.The implementation of the proposed method is represented by applying several steps to the total image to specify ROI and then extracting features (multiple features based on chest X-ray images).Subsequently, building two classification models for detecting the abnormal case of COVID-19.The classification models consist of training and testing stages, and the model could be used for handling new cases.
The key steps of the proposed COVID-19 diagnosis method are shown in Fig. 4.
The preprocessing of the images depends mainly on several steps for finding the region of interest (ROI).The first step is converting a color image into a gray image and applying the median filter on all dataset images, which removes the noise present in the image.Then, the gray images are converted into binary images based on the efficient common method called Otsu's thresholding which depends on the separation of the foreground from the background by reducing the intensity of the variance concerning the intra-class and increasing the intensity of the variance with the inter-class.The last preprocessing step is a morphological operation called opening operation, represented by two steps erosion followed by dilation.It performs the previous step by improving the binary image and removing a small region, which is considered unimportant areas or noise on the resulting image, and keeping the useful areas for processing in the next steps.
The method draws an ellipse to crop the ROI that represents the Midpoint Ellipse.The process is done by tracing the two points of the line in the lower area of the image and representing the right point X1, Y1, and the left point X2, Y2.Through these points, the middle point that represents X, Y is the center of the ellipse and is dependent on finding the distance between the two points X1, Y1 and X, Y or X2, Y2 and X, Y which are found as rx, the first radius.Then, ry is calculated from 0 to the height of the image and ry is chosen through the whitest percentage of blackness, which is calculated by taking the pixel value of points as Eq.(16,17).This process ensures that the shape contains the lungs for testing, as shown in Fig. 5.The cropped region represents the lung area, and it is used later for feature extraction.
In the next step, the back-of-word process is used to standardize the length of the vectors generated for the classification process.From the previous step, all the 5,000 images are processed by morphological operation (closing) for producing enhanced binary images.These are inputted to the mid-point ellipse cropping algorithm in which all midpoints are extracted depending on the left and right coordinates.Then the direction changes towards the top, as shown in the red area of Fig. 5. Then a black and white mask (binary image) is applied to the original image to extract the ROI.
In the proposed method, three types of feature extraction operators extract the discriminatory properties of the ROI in which the size of the cell is 128 × 128.The LBP produces 59 features, HOG produces 104 features, and the Haralick produces a set of unspecified important points for each image, as shown in Fig. 6(a-c).The extracted features in the previous step are used in the classification process and to build the classification model for COVID-19 disease.All the features are used for training and testing the KNN and SVM classifiers [26,27,28].
Ultimately, the combinations of the classifiers and features extraction operators produce six models, namely LBP-KNN, HOG-KNN, Haralick-KNN, LBP-SVM, HOG-SVM, and Haralick-SVM.To produce robust results, the tests consist of multiple training conditions of 5-folds cross-validation (50%, 60%, 70%, 80%, and 90%).The evaluation considers comprehensive criteria of the confusion matrix, including accuracy, sensitivity, specificity, precision, prevalence, error rate, and false-positive rate, as shown in Table 1 and Table 2. Table 1 shows the 5fold cross-validation evaluation results of the three KNN-based classification models, while Table 2 shows the evaluation results of the three SVM-based classification models.
Table 1 and Fig. 7   Subsequently, in the classification results of the SVM for the LBP, HOG, and Haralick features, the Haralick-SVM model outperforms the other two models in which the average accuracy score of 5-folds is 94.88%, as shown in Fig. 7(b).Moreover, it has the highest sensitivity of 99.96%, highest specificity of 89.23%, highest precision of 91.2%, the lowest error rate of 5.13%, and lowest false positive of 10.77%.The HOG-SVM comes second and slightly lower than the Haralick-SVM with an average accuracy score of 5-folds of 94.26% and HOG-KNN performance relatively lower than both of them in which the average accuracy score of 5-folds is 89.2%.The average prevalence of Haralick-SVM is slightly better than the other five models.
The results validate the ability of the proposed method for early detection and classification of COVID-19 through image processing using X-ray images [2,3,8].The combinations of the feature extraction operators and classifiers outcome six models, namely LBP-KNN, HOG-KNN, Haralick-KNN, LBP-SVM, HOG-SVM, and Haralick -SVM on 5,000 X-ray images.The results further show that the chest X-ray images are considered one of the best means for the detection of COVID-19 [7,11,12,13].However, the limitation of this work includes the availability of limited samples of tested X-ray image cases, so the models have not been tested in big data for further verification of the research findings [29,30,31].

Conclusion
By applying the proposed method, which is the classification of X-ray images of corona patients, the test results have shown that it is possible through X-ray images to detect the disease by training the machine learning algorithms on an image dataset.A set of images are taken from the Kaggle website, which includes X-ray images of normal and abnormal cases of tested people (about 5,000 images) that are tested with results through a different group of random samples taken from total images for a number of iterations with different training size as explained before.
The development methodology of this work includes preprocessing, segmentation, feature extraction, and classification.The preprocessing includes image noise removal, image thresholding, and morphological operation.The segmentation is performed by Region of Interest (ROI) detection.The feature extraction includes multiple operators of Local binary pattern (LBP), Histogram of Gradient (HOG), and Haralick features.Finally, classification is performed by K-Nearest Neighbor (KNN) and Support Vector Machine (SVM).Subsequently, a combination of six models LBP-KNN, HOG-KNN, Haralick-KNN, LBP-SVM, HOG-SVM, and Haralick-SVM are proposed.The six models are tested, and the accuracy, error rate, sensitivity, false-positive rate, specificity, precision, and prevalence of the models are calculated.The obtained results are relatively high in which the diagnosis accuracies of all tested cases are between 89.2% and 98.66% on average.The LBP-KNN model outperforms the other models in which it achieves an average accuracy of 98.66%, the sensitivity of 97.76%, the specificity of 100%, precision of 100%, the error rate of 1.34%, and zero false positive.Using more than one method of feature extraction and classification, the results are confirmed and validated.The future work includes using other combinations of feature extraction and classification operators such as the Gabor filter and random forest, and designing and testing the proposed system on real devices such as the radiographic thorax.
A set of procedures are applied in constructing the COVID-19 detection model, including preprocessing (image noise removal, image thresholding, and morphological operation), Region of Interest (ROI) detection, feature extraction using multiple methods such as Local binary pattern (LBP), Histogram of Gradient (HOG), and Haralick texture features.In the classification stage, the K-Nearest Neighbor (KNN) and Support Vector Machine (SVM) are used with the percentage of training of 5-folds crossvalidation for the region of interest.The contributions of our study are summarized as follows:

Fig. 6 .
Fig.6.Feature extraction samples of the three methods.

Table 1
The evaluation results of the KNN-based classification models.