1 Introduction

The infectious disease caused by severe acute respiratory syndrome is COVID-19 disease. In December 2019, this disease appeared for the first time in the Chinese city of Wuhan, Hubei Province [1]. The virus has also spread rapidly outside of China.

Chest radiography is an important imaging technique that has been shown in earlier research to be useful in predicting COVID-19 infection in patients [2]. Deep learning is one of the most effective tools in medical science. These approaches can provide a quick and effective outcome with a high accuracy rate for the diagnosis and prognosis of many diseases [3]. Specially trained models are available to categorize inputs into the many categories that programmers demand. Deep learning models are frequently employed in the medical industry for detecting cardiac abnormalities, tumours utilizing image analysis, cancer diagnosis, and many more applications [4, 5]. A number of review studies examine and evaluate several models in order to identify which is the most accurate and COVID-19 susceptible. According to these publications, feature-extracting models for chest CT scan data with various network designs provide the greatest overall accuracy across a range of datasets. It is frequently used, for example, to classify CT scan images of COVID-19-infected individuals as positive, negative, or non-infected. There are several studies in the literature that use artificial intelligence approaches, particularly deep learning models, to diagnose COVID-19 using X-ray and CT scans [6,7,8,9,10,11].

Artificial neural network-based techniques, in which effective properties of COVID-19 are learned directly from imaging data: proving their satisfactory performance in many medical image processing and computer vision applications, such as the rapid growth and classification of deep neural networks [12,13,14]. Among them, researchers such as detection [15, 16] and segmentation [2, 17] have adopted deep learning methods, providing more accurate results for COVID-19 diagnosis and classification. When the studies for the detection of COVID-19 in the literature are examined, Zheng et al. (2020) used 3D CT for the diagnosis of COVID-19 disease [18]. They developed an application called DeCoVNet using the DR architecture. Four hundred and ninety-nine CT images were used for training, and 131 CT images were used for testing. As a result of the study, they predicted the COVID-19 disease with an accuracy rate of 0.90. Ucar and Korkmaz (2020) made an application that they defined as COVIDiagnosis-Net by using AI-based SqueezeNet Bayesian optimization algorithms to diagnose COVID-19 [19]. They used the chest X-ray data set obtained from the open database. In their study, they reached a test accuracy of 0.983. They reported that this study to detect COVID-19 outperformed other studies in this field and achieved a higher accuracy rate.

Jaiswal et al. (2020) used DenseNet201-based deep transfer learning to classify CT images of patients infected with the COVID-19 virus [14]. As a result of the study, they reported that CNN showed good success and reached an accuracy rate of 97%. They stated that they detected COVID-19 with this model.

Ozyurt et al. (2021) conducted a study by applying ANN and CNN algorithms to CT images [20]. In this study, they obtained an accuracy rate of 94.10% with the CNN algorithm and 95.84% with the ANN algorithm. As a result of the study, they obtained a higher accuracy rate with the CNN algorithm.

Mask R-CNN [21], an improved object detection model based on Faster R-CNN, outperforms other object detection and segmentation models [22]. Mask R-CNN is also an essential AI-based approach that has been employed in a variety of applications, including the identification and segmentation of lung nodules [23], liver segmentation [24], and multiorgan segmentation [25]. This approach has also been used to identify and segment faces [22], hand segmentation [26], early gastric cancer segmentation [27, 28], and breast tumour detection and classification [29]. There are very few studies in the literature using mask R-CNN and faster R-CNN methods in detecting COVID-19. When the studies using these methods are examined; Ter-Sarkisov presents the COVID-CT-Mask-Net model, which predicts the presence of COVID-19 on chest CT scans [17]. Three thousand images are used to train the model. On test data containing 21,192 images, they achieve a COVID-19 sensitivity of 90.80% and F1-score of 91.50%. Importantly, they establish that the regional predictions detected by mask R-CNN can be used to classify entire images. Podder et al. in their article propose an application that can diagnose COVID-19 using X-ray images and deep learning techniques [2]. In order to classify COVID-19-infected versus uninfected patients, they used the mask R-CNN method to train and test on the dataset. Using 668 chest X-ray images, the proposed model achieved a sensitivity of 97.36% and an accuracy of 96.98%.

This study is based on the benefits of mask R-CNN and automated image segmentation in establishing the COVID-19 region bounding box and drawing the contour of the COVID-19. The goal of this paper is to develop a model for automatically detecting, segmenting, and classifying diseases using COVID-19 CT scans. The findings of this study are compared to those of radiologists who specialize in COVID-19 diagnosis. Automatic image segmentation and disease classification using CT scans can be a useful tool in the diagnosis and treatment of COVID-19. The mask R-CNN algorithm is a popular choice for image segmentation tasks, as it is able to accurately identify and locate objects within an image. Using this algorithm to identify the region bounding box and draw the contour of COVID-19 in CT scans could potentially improve the efficiency and accuracy of COVID-19 diagnosis, particularly if the findings of the automated approach are compared to those of radiologists who specialize in COVID-19 diagnosis. It is important to note that while automated approaches have the potential to improve diagnostic accuracy and efficiency, they should not be used as a replacement for the expertise and judgement of trained medical professionals.

The article’s structure is given below. In Sect. 2, a basic overview of the dataset, data pre-processing and object detection is provided, as well as a detailed explanation of the faster R-CNN and mask R-CNN models. Furthermore, details on the evaluation parameters for the used deep learning models are provided. The work and outcomes of faster R-CNN and mask R-CNN models are given in Sect. 3, along with evaluation findings such as accuracy, recall, precision, and F1 scores, as well as cross-validation results. The discussion and conclusion sections are submitted in Chapters 4 and 5, respectively. An overview of the study is shown in Fig. 1.

Fig. 1
figure 1

An overview of the study

2 Material and methods

2.1 Dataset

There are numerous data sets developed by data scientists and practitioners of machine learning for the diagnosis of COVID-19 disease. In this study, however, a unique dataset was created by obtaining retrospective CT images of patients with COVID-19 disease from the Yozgat Bozok University Faculty of Medicine after obtaining the necessary ethical approvals. All image labelling is overseen by a radiologist who is unaware of the clinical status of the patients (Table 1).

Table 1 Cases and image number information of the dataset

2.2 Data pre-processing

In computer vision applications, pre-processing is a common occurrence. Pre-processing techniques can aid in the training phase of deep learning by reducing unwanted noise and highlighting image regions that can aid in the recognition challenge, among other tasks.

This procedure requires the application of the MakeSense data labelling tool. MakeSense is a free online application for image labelling. It draws polygons to completely outline images’ objects. For this study, labels are saved as “.json” files for mask R-CNN, and the “.vgg” format for faster R-CNN. In this study, a total of 4000 images, including 3200 training images and 800 validation images, are labelled.

2.3 Object detection

Inspired by the success of deep learning algorithms (regional-based ESAs) for object detection on images, numerous deep learning-based object detection methods for remote sensing data have been developed. As a result of their superior capacity for learning high-level semantic representations, these methods are superior to conventional ones [30]. General object detection is the process of specifying an object’s density in an image using a rectangular bounding box and labelling the object. Regional-based detectors and regression/classification-based detectors are the two categories of object detectors. The zone-based sensor employs the conventional method of object detection. First, the proposed object of the region is determined, and then the regions are classified. Regional-based systems include R-CNN [31], SPP-net [32], fast R-CNN [33], and mask R-CNN [21].

2.3.1 Faster R-CNN

Girshick proposed faster R-CNN, and what makes it more successful and appealing than its predecessors is that it introduces a mechanism (region proposal network) for estimating the region in the images where the object is believed to be located [33]. Similar to a fully convolutional network is a region bidding network (RPN). RPN’s primary function is to generate object identification and detection suggestions. An image serves as the input for the RPN. It produces classification scores for object suggestions and flipping rectangles. A sliding window is applied to the feature map to generate suggested regions. At each sliding window position, several zone proposals are evaluated. Using intersection over union, the accuracy of the suggestions is checked (IoU). Due to the region proposal network mechanism, the cost of estimating the region where the object is believed to be located is significantly less than its predecessors. This feature has given the algorithm an edge over its competitors in terms of faster and more efficient work. Some sample marked dataset images are shown in Fig. 2.

Fig. 2
figure 2

a COVID-19 original image b COVID -19 image marked polygonal c COVID-19 image marked rectangle

2.3.2 Mask R-CNN

Mask region-based convolutional neural network (mask R-CNN) is a neural network model that was derived from the faster R-CNN model [21]. Faster R-CNN extracts a feature map from the input images. Regional proposal network (RPN) is utilized to identify image regions that may contain an object. Then, the regional proposals are pooled, and an estimate is made among the predetermined class or classes by finally presenting them to the fully connected MLP as input data for object estimation. While these processes are performed more slowly in fast R-CNN, faster R-CNN yields faster results [34].

With the creation of faster R-CNN, mask R-CNN was born. During object detection, faster R-CNN determines the bounding box and class label of each object. Mask on the contrary, R-CNN creates the mask for each object. In this instance, mask R-CNN employs a two-step method. In the first stage, the regional bidding network is used to determine the bounding boxes of the objects (RPN). Using ROI Align, the mask of each object is determined in the second stage. Mask R-CNN architecture is shown in Fig. 3.

Fig. 3
figure 3

Mask R-CNN architecture

In mask R-CNN, minimizing the function of loss value:

$$L = L_{{{\text{class}}}} + L_{{{\text{box}}}} + L_{{{\text{mask}}}}$$
(1)

where \({L}_{\mathrm{class}}+{L}_{\mathrm{box}}\) are recognized the similar terms in faster R-CNN, and definition of \({L}_{\mathrm{class}}+{L}_{\mathrm{box}}\) is,

$$L_{{{\text{class}}}} + L_{{{\text{box}}}} = \frac{1}{{N_{{{\text{cls}}}} }}\mathop \sum \limits_{i} L_{{{\text{cls}}}} \left( {p_{i} ,p_{i}^{*} } \right) + \frac{1}{{N_{{{\text{box}}}} }}\mathop \sum \limits_{i} p_{i}^{*} L_{i}^{{{\text{smooth}}}} \left( {t_{i} - t_{i}^{*} } \right)$$
(2)
$$L_{{{\text{cls}}}} \left( {\left\{ {\left( {p_{i} ,p_{i}^{*} } \right)} \right\}} \right) = - p_{i}^{*} {\text{log}}p_{i}^{*} - \left( {1 - p_{i}^{*} {\text{log}}\left( {1 - p_{i}^{*} } \right)} \right)$$
(3)

Loss of average binary cross-entropy is \({L}_{\mathrm{mask}}\) defined as:

$$L_{{{\text{mask}}}} = \frac{1}{{m^{2} }}\mathop \sum \limits_{0 \le i,j \le m} \left[ {y_{ij} {\text{log}}oy_{ij}^{k} + \left( {1 - y_{ij} } \right){\text{log}}\left( {1 - y_{ij}^{k} } \right)} \right]$$
(4)

In this study, the first stage is the backbone network (ResNet-50 and ResNet-101), where feature maps are generated from the input images. ROIs are generated by the RPN, which receives feature maps from the backbone network. In the third stage, the shared feature maps are mapped to extract the target features that correspond to the generated ROIs. Then, these characteristics are sent to the fully connected layer (FC) and the fully convolutional network (FCN). Examples include segmentation and classification of objects. In the final, fourth step, the positive regions of the ROI classifier are extracted, and classifier masks are created for them.

2.4 Evaluation methods

During the testing stage, the results of the algorithms are compared to the actual results in the marked data, and the success and performance value of the model are measured. These performance criteria are computed utilizing the confusion matrix.

Among the actual and approximate data contained in this data set:

  • True Positive (TP) The true positive value is the positive prediction of positive data.

  • True Negative (TN) A true negative value is an actual negative estimate of data.

  • False Positive (FP) A false positive value is the incorrect prediction of positive data.

  • False Negative (FN) False negative value is a negative estimate of actual positive data.

Using algorithms for deep learning, the success of the model is determined by the results of the confusion matrix. The diagnostic success rates of each algorithm for COVID-19 disease have been determined. The content and calculation formulas for these criteria are provided below.

The region bounded by the coordinate axis and the curve is known as average precision (AP). The AP values of all categories are averaged to produce mAP as well. The markers resemble Eqs. (5) through (8):

$${\text{Accuracy}} = \frac{{T_{N} + T_{p} }}{{T_{N} + T_{p} + F_{N} + F_{p} }}$$
(5)
$${\text{Precision}} = \frac{{T_{p} }}{{T_{p} + F_{p} }} \times 100\%$$
(6)
$${\text{Recall}} = \frac{{T_{p} }}{{T_{p} + F_{N} }} \times 100\%$$
(7)
$$F1 - {\text{score}} = \frac{2 \times P \times R}{{P + R}}$$
(8)

The AP (average precision) value and the calculation of the area under the precision-sensitivity graph for each class separately are presented in Eq. (9).

$${\text{AP}} = \mathop \int \limits_{0}^{1} P\left( R \right){\text{d}}R$$
(9)

The mean average precision (mAP) value is presented in Eq. (10) by taking the arithmetic average of the AP values calculated for each class.

$$m{\text{AP}} = \frac{A \cap B}{{A{\text{U}}B}} = \frac{1}{{N_{T} }}\mathop \sum \limits_{i = 1}^{{N_{T} }} \frac{{N_{i}^{{{\text{DR}}}} }}{{N_{i}^{D} }}$$
(10)

where segmentation result of the model is A and the radiologist experienced well will define consistent tumour contour as B, images number are represented by \({N}_{T}\); the area overlapped between the detected lesion model and the regions of lesion in true clinical value is given by \({N}_{i}^{\mathrm{DR}}\) i; and true lineal lesion size is \({N}_{i}^{D}\).

3 Experimental results

In this study, a unique data set consisting of CT scan images was prepared to classify COVID-19 patients, and faster R-CNN and mask R-CNN methods, whose data set is given in this section, are used. Procedure of the proposed method is given in Fig. 4.

Fig. 4
figure 4

Procedure of the proposed method

3.1 Faster R-CNN

In the study, the VGG-16 network, which uses the residual and subsampling blocks in its layers as the backbone network, was preferred as the feature extractor of faster R-CNN. In the dataset, 3200 images of 512 × 512 size are used in model training, and 800 images of 512 × 512 size are used in training and validation.

There are many hyperparameters that need to be adjusted when training a faster R-CNN model. Exploring all configurations is nearly impossible in terms of time and computational resources. Therefore, default and obvious configurations are adhered to. Specifically, num_workers and test_num_workers (number of workers to load data) 3, rpn_sigma (Sigma parameter for RPN’s localization loss) 3, roi_sigma (Sigma parameter for ROI localization loss) 3, learning rate 0.0001, learning decay (learning decay parameter) 0.01, num_classes (number of classes + BG) is set to 2 + 1, optimizer man. All other configurations are kept by default in faster R-CNN. The configurations of the study are given in Table 2. The predictions and colour masks made as a result of the training are shown in Fig. 5.

Table 2 Faster R-CNN training configuration
Fig. 5
figure 5

Faster R-CNN test results

Faster R-CNN determines if there is an object in the box and then tries to classify the object in the box. As a result of the training, a mAP value of 93.86% is obtained. The faster R-CNN test mAP plot is shown in Fig. 6.

Fig. 6
figure 6

Test mAP value for 40 epoch

The ROI classification loss is 0.0612 per ROI. How successfully the model labels a predicted box with the appropriate class is measured by "class loss" (loss for box classification). Faster R-CNN ROI classification loss plot is shown in Fig. 7. The classification loss in the RPN (regional offer network) is 0.0324. The loss of “objectivity” measures how good the RPN is at labelling junction boxes as foreground (classes) or background. Classification loss graph for faster R-CNN in RPN is given in Fig. 8. The total training loss is 0.4885. This is the weighted sum of the individual losses calculated during each iteration, and its graph is shown in Fig. 9. The localization loss for ROI is 0.3304, and the localization loss for RPN is 0.0491. Localization losses for ROI and RPN are shown in Figs. 10 and 11, respectively.

Fig. 7
figure 7

ROI loss graph of faster R-CNN

Fig. 8
figure 8

RPN classification loss graph of faster R-CNN

Fig. 9
figure 9

Total loss graph of faster R-CNN

Fig. 10
figure 10

ROI localization loss

Fig. 11
figure 11

RPN localization loss

3.2 Mask R CNN

Mask R-CNN used in this study was produced by Matterport Inc. It is the version with the MIT License published by [35]. This version is based on open ski-libraries such as Keras and Tensorflow. As the backbone, both the ResNet-50 feature pyramid network model and the ResNet-101 feature pyramid network model are used. Three thousand two hundred images of 512 × 512 size are used in model training, and 800 images of 512 × 512 size are utilized in training validation.

The number of training steps per period has been determined as 100. This is not required to match the size of the training set. TensorBoard updates are saved at the end of each period, so setting this value to a lower value will result in more frequent TensorBoard updates. Validation statistics are also calculated at the end of each term and can take some time, so choose it too small to avoid spending too much time on validation statistics. The number of images to train on each GPU is set to 1. At this stage, a GPU of about 12 GB can process two images, typically 1024 x 1024 pixels. The GPU should be adjusted for memory and image sizes. For best performance, select the highest number that the GPU can handle.

The number of test steps to be run at the end of each training period is determined to be 250. A larger number improves the accuracy of the validation statistics, but slows down the training. The minimum probability value to accept a detected sample is 0.85. ROIs below this threshold are skipped. The learning rate and momentum are determined as 0.0001 and 0.9, respectively. The configurations of the study are given in Table 3.

Table 3 Mask R-CNN parameter configuration

The training is repeated twice using different backbone architectures, ResNet-50 and ResNet-101. Instead of training the network from start to finish, pre-training is performed with 150 epochs using MSCOCO weights. Using the weights obtained from this preliminary training, the final training is completed with 250 epochs. In this way, more successful results are obtained in the final training by using the weights obtained from the pre-training. The results of the training using the fivefold cross-validation strategy are given in Table 4. Cross-validation is a statistical resampling method used to evaluate the performance of a network model on data in the most objective and accurate way possible [36]. The predictions and colour masks made as a result of the training are shown in Fig. 12. Loss graphs for ResNet-50 (5-fold) and ResNet-101 (5-fold) results are given in Figs. 13 and 14, respectively.

Table 4 Mask R-CNN test results
Fig. 12
figure 12

Mask R-CNN test results

Fig. 13
figure 13

Loss graphs for ResNet-50 (five fold)

Fig. 14
figure 14

Loss graphs for ResNet-101 (five fold)

The segmentation quality was assessed using the intersection over union (IoU) metric, which was defined in Eq. (11). The IoU, also known as the Jaccard index, is one of the most widely used methods for measuring similarity between finite sample sets (or Jaccard similarity coefficient). IoU is typically expressed as the intersection (AB) divided by the union (AB) for two finite sample sets A and B [37]. In this study, IoU is shown in Fig. 15, and rate of IoU is 0.8694.

$$IoU\left( {A,B} \right) = \frac{A \cap B}{{A{\text{U}}B}}$$
(11)
Fig. 15
figure 15

An example of IoU results from used data

Algorithm losses should be minimized in the mask R-CNN and faster R-CNN training process, and the errors in segmentation should be minimized. As a result of the tests performed by trying different iterations during the training process for faster R-CNN, it is observed that the errors in 4100 iterations could not be reduced any less. For mask R-CNN, the loss rates are minimized after 250 epochs.

In this study, region-based segmentation of COVID-19 and pneumonia infection in images obtained by computed tomography is performed using the mask R-CNN algorithm based on the CNN architecture, which is a deep learning model. The COVID-19 and pneumonia segmentations obtained from the images are detected with different colours thanks to the sample segmentation feature in the mask R-CNN algorithm, and the error rates are minimized as a result of the tests.

As a result of the training for the mask R-CNN algorithm, 97.72% and 95.65% mAP values are obtained from ResNet-50 and ResNet-101 network backbones, respectively. It has been observed that the model with the ResNet-50 backbone has a slightly lower computational load than the ResNet-101 backbone, and at the same time, the results obtained with the ResNet-50 as a result of the training are more successful. When the faster R-CNN algorithm is trained, 93.86% class accuracy is reached. Test results of models are given in Table 5.

Table 5 Test results of models

The dataset used in this study has also been implemented to the classification process with the VGG-16 architecture, and the classification report are given in Table 6. With this model, the accuracy value of 94.87–95% F1-score is reached in this study, and the ROC graph of the model is shown in Fig. 16.

Table 6 Classification report of VGG-16
Fig. 16
figure 16

ROC graph of VGG-16

Patients with COVID-19 can be diagnosed using CT scan images. It can identify internal flaws, wounds, part dimensions, tumour, etc. It provides a thorough overview of a specific area that can be detected. A CT scan seems to be a reliable alternative to the current RT-PCR method. It works well for categorizing COVID-19 patient images as well. It guarantees prompt and accurate delivery of results. Early disease or virus treatment is frequently the result of early detection. When the studies on the diagnosis of COVID-19 in the literature are examined, the faster R-CNN and mask R-CNN models are not popular. Therefore, the work carried out is important in one respect. If a comparison is made with the various studies given in Table 7, the results of my proposed models are satisfactory. In actuality, models are not directly comparable due to the variation in sample size and COVID-19 prevalence.

Table 7 Comparison with other studies

4 Discussion

This study’s primary objective is to develop a cost-effective diagnostic method capable of rapidly identifying COVID-19 patients from CT images. Due to the lack of a larger dataset to date, this study was conducted by preparing a dataset containing COVID-19 chest X-ray images and labelling these images. This can be considered a limitation of our study. Plans for the future include validating the proposed framework against a large dataset containing additional COVID-19 chest X-ray images. In addition, we plan to train our framework on a dataset containing CT images of COVID-19 patients and evaluate the performance of the method when trained on X-ray images. The lungs can be precisely identified using our method without any false alarms. To improve accuracy, numerous experiments are run on high-quality images from the dataset.

5 Conclusion

In this study, a method for automatic detection, segmentation, and classification of COVID-19 lesions with CT images is proposed. This article aims to develop a method for accurately diagnosing COVID-19 using CT images and deep learning techniques. Using CT images obtained from Yozgat Bozok University, the presented method commences with the creation of an initial dataset containing 4000 CT images. In this article, COVID-19 detection is performed for the faster R-CNN model. In this method, 93.86% accuracy rate is obtained by using VGG-16 backbone. In addition to the accuracy score, the classification loss per ROI (region of interest), the classification loss in RPN (Region Proposal Network) and the total training loss are calculated and given in the results part. Mask R-CNN method is trained using the ResNet-50 and ResNet-101 architectures in the training section. For these architectures, mAP values of 97.72% and 95.65% are obtained, respectively. Results for fivefolds of the classification methods used in this study are obtained using the cross-validation method. The article also presents the results by making comparisons with other studies. In this study, a deep learning algorithm based on VGG-16 was used to accurately identify and classify COVID-19 patients. When the performance of this algorithm is analysed, it can be said that the VGG-16 algorithm can extract confidential information from CT scan images to identify COVID-19 patients. The VGG-16 model used has an excellent accuracy of 94.87%, and the classification report and ROC graph are given to the study. One of the current limitations in this field of study is the scarcity of publicly available datasets of COVID-19 CT images and labelled data. In future studies, new architectural solutions, pre-processing algorithms and high-accuracy results can be suggested with a more comprehensive dataset by reducing losses. This study will enable researchers to develop more reliable methods and also to compare methods. The interpretability of artificial intelligence methods should be demonstrated and verified with the help of expert radiologists. It is highly recommended that clinicians, radiologists, and artificial intelligence engineers work together to develop interpretable and reliable deep learning solutions that can be easily implemented in hospitals. Otherwise, despite numerous efforts on a global scale, it will take time for these technologies to be used in hospitals to help humanity.