Ensemble Deep Learning Derived from Transfer Learning for Classification of COVID-19 Patients on Hybrid Deep-Learning-Based Lung Segmentation: A Data Augmentation and Balancing Framework

Background and motivation: Lung computed tomography (CT) techniques are high-resolution and are well adopted in the intensive care unit (ICU) for COVID-19 disease control classification. Most artificial intelligence (AI) systems do not undergo generalization and are typically overfitted. Such trained AI systems are not practical for clinical settings and therefore do not give accurate results when executed on unseen data sets. We hypothesize that ensemble deep learning (EDL) is superior to deep transfer learning (TL) in both non-augmented and augmented frameworks. Methodology: The system consists of a cascade of quality control, ResNet–UNet-based hybrid deep learning for lung segmentation, and seven models using TL-based classification followed by five types of EDL’s. To prove our hypothesis, five different kinds of data combinations (DC) were designed using a combination of two multicenter cohorts—Croatia (80 COVID) and Italy (72 COVID and 30 controls)—leading to 12,000 CT slices. As part of generalization, the system was tested on unseen data and statistically tested for reliability/stability. Results: Using the K5 (80:20) cross-validation protocol on the balanced and augmented dataset, the five DC datasets improved TL mean accuracy by 3.32%, 6.56%, 12.96%, 47.1%, and 2.78%, respectively. The five EDL systems showed improvements in accuracy of 2.12%, 5.78%, 6.72%, 32.05%, and 2.40%, thus validating our hypothesis. All statistical tests proved positive for reliability and stability. Conclusion: EDL showed superior performance to TL systems for both (a) unbalanced and unaugmented and (b) balanced and augmented datasets for both (i) seen and (ii) unseen paradigms, validating both our hypotheses.


Introduction
The COVID-19 pandemic has caused significant disruptions and health concerns worldwide and worsened traditional diseases since its emergence in late 2019. Efforts to control its spread have included non-pharmaceutical interventions, such as social distancing, mask-wearing, and quarantine measures, as well as the development and administration of vaccines. The development and administration of vaccines are effective in reducing the severity of the disease and preventing hospitalization and death [1][2][3][4][5][6].
Ongoing research and analysis are needed to better understand the effectiveness of various control measures and their impact on reducing the spread of COVID-19. There are several motivations for researching COVID-19 and its control measures. First, COVID-19 is a novel virus, and there is still much to learn about its transmission, symptoms, and long-term effects [1,7]. Second, research can help to fill these knowledge gaps and inform public health strategies. Third, COVID-19 has highlighted existing health disparities and inequities, and research can help to identify and address these issues in the context of the response to the pandemic [8,9]. Lastly, the COVID-19 pandemic has spurred artificial intelligence innovation and collaboration in fields such as medicine, epidemiology, and public health. Research can help to build on these developments and inform future responses to similar global health crises [3,6,[10][11][12].
Supercomputers and graphical processing units (GPU) ease the burden of researchers in detecting medical imaging diseases [5,[13][14][15], e.g., pneumonia [5,16]. Transfer learning (TL), ensemble deep learning (EDL), and hybrid deep learning (HDL) are novel methods of achieving better accuracy faster than traditional methods [17][18][19][20]. Hospitals, labs, institutes, professors, and doctors are not only adopting these new paradigms but also collaborating to help humans. There is variability in the design of studies looking at COVID-19 and its control measures [1,15,[21][22][23][24], which can make it challenging to compare and draw conclusions from different studies. Some studies may have limited generalizability, as they may be conducted in specific populations and may not apply to other populations. The emergence of new variants of the virus may affect the effectiveness of existing control measures. Despite these limitations, ongoing research is critical for understanding and mitigating the impacts of COVID-19 and developing effective control measures.
Researchers are facing challenges in obtaining a COVID-19 image dataset with good volume. X-ray images are noisy, and these images could not clearly explain the infected lung areas in comparison to CT images. The CroMED and NovMED datasets have helped this research to detect infected COVID sections, but they should be processed using correct models. There are several published machine learning (ML), and deep learning (DL) models. ML models are mostly used for classification, while DLs are for feature extraction and classification. Now, it could be said that DL models are more suitable for the COVID CT image dataset than ML models [7,25]. Current DLs are already trained and tested on the ImageNet dataset with good accuracy. These models can be utilized and trained on CT images, but this would be a very slow and non-novel process. This challenge leads us to use TLs and EDLs. TLs are faster than traditional DL methodologies. EDLs are stronger than TLs. Still, researchers doubt the correct data size for DLs. The previous state of the art has proven that data augmentation and data balancing have a significant role in achieving better accuracy. Most of the AI systems are overfitted or never generalized. Such a process is called memorized rather than generalized. Such systems are not practical for clinical settings. Such systems therefore do not give accurate results when tried on unseen data sets.
Diagnostics 2023, 13, 1954 3 of 49 This is the fundamental motivation of this study. We specifically addressed a novel system design which is a cascade of three major AI systems for multicenter data set design aimed squarely at unseen analysis towards generalization. Thus, there is no system which is a combination of HDL + 5 TL + 5 EDL systems which was designed and tested on special five types of multicenter data systems of COVID + CONTROL combinations, and the design was applied to "unseen analysis" to establish generalization over memorization.
Based on the limitations in current research, we hypothesize two points to improve detection accuracy. First, the mean accuracy of EDLs is better than the mean accuracy of TLs. Second, balanced and augmented data give better results compared to data without augmentation. We studied 275 published journal and conference papers at IEEE, Sci-enceDirect, Springer, and MDPI. After this, we finalized EfficientNetV2M, InceptionV3, MobileNetV2, ResNet152, ResNet50, VGG16, VGG19, and Xception for our research work. These TL combinations have generated EDL models that could improve COVID-19 detection accuracy [7,25]. Five EDL models and seven TLs are consistently used over dataset combinations (DC) taken from Croatia and Italy.
The layout of this study is as follows: Section 2 presents the related literature. We discuss recent research and its accuracy on currently available datasets. Section 3 is a methodology in which the architecture and approach of research are included. The results and performance evaluation based on methodology and different performance metrics are presented in Section 4. Section 5 presents the system reliability and explainability. The critical discussion is presented in Section 6, and finally, the study concludes in Section 7.

Background Literature
The COVID-19 pandemic has led to an unprecedented global health crisis, with a significant impact on public health, and social and economic aspects of life [9,25]. One of the primary challenges that has been faced by healthcare professionals during the pandemic is the early and accurate diagnosis of COVID-19 patients. CT scans are one of the most reliable and widely used methods for the diagnosis of COVID-19 owing to their high sensitivity and specificity. With the advent of DL-based AI models, researchers have been able to develop automated diagnostic tools that can help healthcare professionals to diagnose COVID-19 patients more accurately and efficiently. Several studies have been conducted to develop and evaluate AI-based models for the diagnosis of COVID-19 using CT scans. For instance, in a study conducted by Gozes et al. [26], a DL-based model was developed and evaluated using a dataset of~1500 CT scans. The study reported an overall sensitivity of 98% and a specificity of 92%, indicating that the model could accurately distinguish COVID-19 patients from non-COVID-19 patients. In a more recent study by Li et al. [27], a DL-based model was developed and evaluated using a dataset of 1684 CT scans obtained from 468 COVID-19 patients and 1216 non-COVID-19 patients. The model has had an overall accuracy of 91.4%, indicating that the model could accurately distinguish COVID-19 patients from non-COVID-19 patients.
Other studies have also explored the use of AI-based models for COVID-19 diagnosis using CT scans. Alshazly et al. [28] used DenseNet169 and DenseNet201 to evaluate 746 CT scan images. The authors have achieved an accuracy of 91.2%, an F1-score of 90.8%, and an AUC of 0.91 on DeseNet169 and an accuracy of 92.9%, an F1-score of 92.5%, and an AUC of 0.93 on DenseNet201. Cruz et al. [29] conducted another study using 746 CT scans. The best accuracy metrics were 82.76%, precision was 85.39%, and AUC was 0.89 using DenseNet161; the second-best model is VGG16, for which accuracy was 81.77%, precision was 79.05%, and AUC was 0.9. Shaik et al. [30], Huang et al. [31], and Xu et al. [32] also used TL-based MobileNetV2 on Dataset COVID-CT, TL-based MobileNetV2 on SARS-CoV2, and TL-based EfficientNetV2m on COVID-CT, and they achieved accuracies of 97.38%, 88.67%, and 95.66%, respectively. EDL has a major role in improving detection accuracy. There are some popular EDL paradigms on the CT dataset. Pathan et al. [33], Kundu et al. [34], and Tao et al. [35] used EDL models to achieve better accuracy in comparison to TL models. All three studies had an accuracies of more than 97%. In recent years, some authors have  The second data set included 72 NovMED COVID-19-positive individuals. Figure 2 included 47 males, and the remainder were female. An RT-PCR test was conducted to confirm the presence of COVID-19 in the selected cohort, with an average value of approximately 2.4 GGO, consolidation, and other opacities. Of the 72 NovMED patients, 61% had a cough, 9% had a sore throat, 54% had dyspnea, 42% had hypertension, 12% were diabetic, 11% had COPD, and 11% were smokers. In total, 10 patients died due to COVID-19 infection in this cohort. Figure 3 shows NovMED(control) datasets from Italy. The COVID (Croatia) dataset had dimensions of 512 × 512 and 5396 raw images, COVID (ITA) had dimensions of 768 × 768 and 5797 raw images, and control (Italy) had dimensions of 768 × 768 and 1855 raw images.   The CT dataset was acquired using a 64-detector FCT Speedia HD scanner (Fujifilm Corporation, Tokyo, Japan, 2017). The NovMED dataset, consisting of 72 COVID-19-positive individuals, was obtained from the Department of Radiology at Novara Hospital, Italy. The CT scans were performed using a 128-slice multidetector row CT scanner (Philips Ingenuity Core, by Philips Healthcare). The patients were required to have a positive RT-PCR test for COVID-19 as well as symptoms such as fever, cough, and shortness The CT dataset was acquired using a 64-detector FCT Speedia HD scanner (Fujifilm Corporation, Tokyo, Japan, 2017). The NovMED dataset, consisting of 72 COVID-19positive individuals, was obtained from the Department of Radiology at Novara Hospital, Italy. The CT scans were performed using a 128-slice multidetector row CT scanner (Philips Ingenuity Core, by Philips Healthcare). The patients were required to have a positive RT-PCR test for COVID-19 as well as symptoms such as fever, cough, and shortness of breath. No contrast agent was administered during the acquisition, and a lung kernel of a 768 × 768 matrix together with a soft-tissue kernel was utilized to obtain a 1 mm thick slice. The CT scans were performed with a 120 kV, 226 mAs/slice detector configuration using Philips's automated tube current modulation-Z-DOM with a spiral pitch factor of 1.08 and a 0.5 s gantry rotation time, and a 64 × 0.625 detector was considered [80]. Appendix A has more samples of the CroMED (COVID), NovMED (COVID), and NovMED (control) datasets. Data exclusion criteria for both CroMED and NOVMed dataset consisted of selection of the CT scans regions were based on the absence of metallic items and the high scan quality, free of external artefacts or blur caused from patient movement during the scanning procedure. In this group, the average patient's CT volume had about 300 slices. During slice selection, slices with the greatest lung area were selected. Slice selection was performed by one of the senior radiologists (K.V.).
Balancing rationale: CroMED (COVID) consisted of 5396 images, while for NovMED (COVID), the data set consisted of 5797 images. NovMED (Control) consisted of 1855 images. Note that there were few control data points. The augmentation procedure consisted of increasing the COVID data two times and control data six times. Thus, the total numbers of images were changed to 10,792 (5396 × 2), 11,594 (5797 × 2), and 11,130 (1855 × 6), respectively. This was for balancing the data sets for COVID and the controls, and this made the control data sets nearly the same as the COVID data sets.
Folding rationale: The chosen sample size of COVID data was two times. This was based on the sample size computation (so-called power analysis, as discussed in the methodology section), for which the objective was to improve the accuracy. For the best accuracy, there was a need for at least 8100 images for COVID. Thus, we increased the COVID data sets by two times, i.e., to 10,792 (5396 × 2) and 11,594 (5797 × 2). Subsequently, the control was balanced by increasing the data set by six times, i.e., to 11,130 (1855 × 6). Table 1 depicts the distribution of the dataset.

Overall Pipeline of the System
The proposed overall architecture is portrayed in Figure 4. In this architecture, The CT machine operator and doctor have contributed to the storage of raw images of lungs for research purposes. These raw images were subjected to HDL segmentation to produce segmented data, resulting in a clear and distinct image of the lung. The latest advancements in segmentation have yielded better results when compared to raw images. We utilized both TLs and EDLs to detect COVID-19 and control cases with high accuracy, which is shown in Figure 4. In this study, we hypothesized that the mean accuracy of EDLs is superior to that of TLs. Additionally, we hypothesized that the mean accuracy of models with augmented input data, balanced with augmentation, would be greater than those without augmentation in both TLs and EDLs. We conducted scientific validation, statistical analysis, precision, recall, F1-score, and AUC to evaluate the performance of the models.

Hybrid Deep Learning Architecture of CT Lung Segmentation
After the data acquisition, raw input images were passed over to the HDL model for segmentation. The process of segmenting an image is breaking it up into segments, each of which corresponds to a desired class in the image. The approach that is utilized for image segmentation relies on the specific application and the characteristics of the picture that is being segmented. The study by Suri et al. [80] in the literature review has shown that HDLs are better than solo segmentation. Using the same spirit, ResNet-UNet was exclusively adopted for lung segmentation after pre-processing or quality control [81][82][83][84][85][86]. The ResNet-UNet-based HDL model is composed of 165 layers with ~16.5 million parameters. The final trained model size of the model was 188 megabytes. Using a cutoff of 80%, the model had Dice and Jaccard scores of 0.83 and 0.71, respectively.
These segmented images are the inputs for seven types of TL models and five EDL models in five different input data combinations (DC)-with and (i) without data augmentation and (ii) balanced and augmented data in predicting the presence of COVID-19 in three different datasets: CroMED (COVID), NovMED (COVID), and NovMED (con-

Hybrid Deep Learning Architecture of CT Lung Segmentation
After the data acquisition, raw input images were passed over to the HDL model for segmentation. The process of segmenting an image is breaking it up into segments, each of which corresponds to a desired class in the image. The approach that is utilized for image segmentation relies on the specific application and the characteristics of the picture that is being segmented. The study by Suri et al. [80] in the literature review has shown that HDLs are better than solo segmentation. Using the same spirit, ResNet-UNet was exclusively adopted for lung segmentation after pre-processing or quality control [81][82][83][84][85][86]. The ResNet-UNet-based HDL model is composed of 165 layers with~16.5 million parameters. The final trained model size of the model was 188 megabytes. Using a cutoff of 80%, the model had Dice and Jaccard scores of 0.83 and 0.71, respectively.
These segmented images are the inputs for seven types of TL models and five EDL models in five different input data combinations (DC)-with and (i) without data augmentation and (ii) balanced and augmented data in predicting the presence of COVID-19 in three different datasets: CroMED (COVID), NovMED (COVID), and NovMED (control). ResNet helps in solving the vanishing gradient problem of previous models using skip connection. The convolutional neural networks (CNN) layer in ResNet brings down the sample features using stride two. UNet-based architecture helps in neutralizing the semantic segmentation problem. We have therefore used the combination of ResNet and UNet architecture to build HDL-based segmentation as shown in Figure 5. This amalgamation paradigm has effectively segmented the lungs in COVID-19 and control CT scans.
After the data acquisition, raw input images were passed over to the HDL model for segmentation. The process of segmenting an image is breaking it up into segments, each of which corresponds to a desired class in the image. The approach that is utilized for image segmentation relies on the specific application and the characteristics of the picture that is being segmented. The study by Suri et al. [80] in the literature review has shown that HDLs are better than solo segmentation. Using the same spirit, ResNet-UNet was exclusively adopted for lung segmentation after pre-processing or quality control [81][82][83][84][85][86]. The ResNet-UNet-based HDL model is composed of 165 layers with ~16.5 million parameters. The final trained model size of the model was 188 megabytes. Using a cutoff of 80%, the model had Dice and Jaccard scores of 0.83 and 0.71, respectively.
These segmented images are the inputs for seven types of TL models and five EDL models in five different input data combinations (DC)-with and (i) without data augmentation and (ii) balanced and augmented data in predicting the presence of COVID-19 in three different datasets: CroMED (COVID), NovMED (COVID), and NovMED (control). ResNet helps in solving the vanishing gradient problem of previous models using skip connection. The convolutional neural networks (CNN) layer in ResNet brings down the sample features using stride two. UNet-based architecture helps in neutralizing the semantic segmentation problem. We have therefore used the combination of ResNet and UNet architecture to build HDL-based segmentation as shown in Figure 5. This amalgamation paradigm has effectively segmented the lungs in COVID-19 and control CT scans. To balance the control and COVID classes, 3× augmentation of the control class was carried out using a vertical horizontal flip and 45-degree rotation. After class balancing in all five DC scenarios, data were further augmented twofold using a vertical horizontal flip and 30-degree rotation. The augmented data were analyzed over seven TLs and five EDLs in all five DC. To balance the control and COVID classes, 3× augmentation of the control class was carried out using a vertical horizontal flip and 45-degree rotation. After class balancing in all five DC scenarios, data were further augmented twofold using a vertical horizontal flip and 30-degree rotation. The augmented data were analyzed over seven TLs and five EDLs in all five DC.

Transfer-Learning-Based Architecture for Classification
Transfer learning is one of the premier methods for classification and offers several advantages compared to DL-based classification [27,87,88]. Our seven TL models adopted were EfficientNetV2M, InceptionV3, MobileNetV2, ResNet152, ResNet50, VGG16, and VGG19, all pre-trained on the ImageNet dataset. Utilizing these TL models, we have designed false to top layers of all models and added a flatten, dense layer, dropout layer, and L2 regularizer. The flatten [89] helps to convert the multidimensional output of the previous layer to a 1-dimensional vector. It passes the values to a dense layer that has a ReLU activation function and L2 regularization with a strength of 0.001. This regularizer prevents overfitting. Dropout is another method that helps to reduce overfitting. Finally, the fully connected layer with two classes and a sigmoid activation function to the output of the previous layer. The output of the last layer represents the predicted probability for the two classes in the classification problem, COVID vs. control. All the architecture used in this work is shown in Appendix B. We have used these TL models due to their ability to bypass the long training time for scratch-based network designs [13,90].

Ensemble Deep Learning Architectures for Classification
The ensemble is the area of medical imaging that helps weak learners to make them stronger. We have also used the soft-max voting ensemble method. In this approach, the sum of the predicted score is used to predict the class of ensemble prediction. We have also proposed a novel Algorithm 1 to generate five EDL from TLs. EDL generators use a combination method over TL prediction score to create five EDLs from seven TLs. The After obtaining the segmented image, TL and EDL performed the task of accurate detection of COVID and control ( Figure 6). First, these segmented data were preprocessed over all five data combinations. Parallel execution of models on original data size created a core dump (memory issue) at our GPU, which is why the input data size was reduced to 180 × 180. After that, the balancing of the COVID and control classes was performed after the augmentation of control class by 3×. Once the data were balanced, we augmented the data by 2× to increase the size of the data. Second augmentation was also performed to check the augmentation effect. Seven TLs were used for feature extraction, and the sigmoid function was used for binary classification. TL combinations generate EDL and EDL uses softMAX voting on the predicted score for detection of COVID and control. We have also performed balancing and augmentation on data.
Diagnostics 2023, 13, x FOR PEER REVIEW 10 of 48 uses softMAX voting on the predicted score for detection of COVID and control. We have also performed balancing and augmentation on data.

Loss Function
Cross-entropy (CE)-loss functions are frequently used for two or more than two DL models. CE-loss, ∝ , is dependent on the probability of the AI model and the gold standard label 1 and 0 by and (1 − ), respectively, as shown in Equation (1).

Loss Function
Cross-entropy (CE)-loss functions are frequently used for two or more than two DL models. CE-loss, ∝ CE , is dependent on the probability of the AI model p i and the gold standard label 1 and 0 by g i and (1 − g i ), respectively, as shown in Equation (1).

Performance Metric
We have used true positive (TP), true negative (TN), false positive (FP), and false negative (FN) to estimate the various performance evaluation metrics. These are accuracy (

Five Data Combinations
For the robust design of the classification system, we designed five types of data combination scenarios. This is based on training and testing data using taken from two countries-namely, Croatia and Italy.
 DC5: Training validation and testing on mixed data in which COVID CT scans from Croatia and Italy are mixed; the control of Italy was used.

Experiment 1: Transfer Learning Models using Lung Segmented Data
This experiment consists of running seven types of TL models-namely, Efficient-NetV2M, InceptionV3, MobileNetV2, ResNet152, ResNet50, VGG16, and VGG19-all pretrained on the ImageNet dataset for classification of segmented lung data into COVID vs. controls. The lung segmentation was conducted using ResNet-UNet, and segmented images were input for TLs. The experiment highlights the effectiveness of using TL for improving the accuracy of models on HDL segmented data. The lung segmented data were ) (Equation (2)), recall (

Loss Function
Cross-entropy (CE)-loss functions are frequently used for two or more than two DL models. CE-loss, ∝ , is dependent on the probability of the AI model and the gold standard label 1 and 0 by and (1 − ), respectively, as shown in Equation (1).

Performance Metric
We have used true positive (TP), true negative (TN), false positive (FP), and false negative (FN) to estimate the various performance evaluation metrics. These are accuracy (ɳ) (Equation (2)), recall (Ɍ) (Equation (3)), precision (Ƥ) (Equation (4)), and F1-score (Ƒ) (Equation (5)). After calculating the accuracy of TL and EDL models, we calculated the mean accuracy of TL (ɳ ) in Equation (6) and the mean accuracy of EDL (ɳ ) in Equations (7) and (8). In these equations, "n" is the number of TLs, and "N" is the number of EDLs. Dice and Jaccard are also calculated based on Equations (9) and (10), where ƶ is a set of wanted items and ƴ set of found items. The probability curve ROC (receiver operating characteristics) and degree of separability AUC (area under the curve) have also been calculated for each model. In the standard deviation (σ), each value from the population is denoted by x¡ and µ, population mean. N is the size of the population.

Loss Function
Cross-entropy (CE)-loss functions are frequently used for two or more than two DL models. CE-loss, ∝ , is dependent on the probability of the AI model and the gold standard label 1 and 0 by and (1 − ), respectively, as shown in Equation (1).

Performance Metric
We have used true positive (TP), true negative (TN), false positive (FP), and false negative (FN) to estimate the various performance evaluation metrics. These are accuracy (ɳ) (Equation (2)), recall (Ɍ) (Equation (3)), precision (Ƥ) (Equation (4)), and F1-score (Ƒ) (Equation (5)). After calculating the accuracy of TL and EDL models, we calculated the mean accuracy of TL (ɳ ) in Equation (6) and the mean accuracy of EDL (ɳ ) in Equations (7) and (8). In these equations, "n" is the number of TLs, and "N" is the number of EDLs. Dice and Jaccard are also calculated based on Equations (9) and (10), where ƶ is a set of wanted items and ƴ set of found items. The probability curve ROC (receiver operating characteristics) and degree of separability AUC (area under the curve) have also been calculated for each model. In the standard deviation (σ), each value from the population is denoted by x¡ and µ, population mean. N is the size of the population.

Five Data Combinations
For the robust design of the classification system, we designed five types of data combination scenarios. This is based on training and testing data using taken from two countries-namely, Croatia and Italy.
 DC5: Training validation and testing on mixed data in which COVID CT scans from Croatia and Italy are mixed; the control of Italy was used.

Experiment 1: Transfer Learning Models using Lung Segmented Data
This experiment consists of running seven types of TL models-namely, Efficient-NetV2M, InceptionV3, MobileNetV2, ResNet152, ResNet50, VGG16, and VGG19-all pretrained on the ImageNet dataset for classification of segmented lung data into COVID vs. controls. The lung segmentation was conducted using ResNet-UNet, and segmented images were input for TLs. The experiment highlights the effectiveness of using TL for improving the accuracy of models on HDL segmented data. The lung segmented data were split with a ratio of 80, 10, and 10 for training, validation, and testing, respectively. Models ) (Equation (5)). After calculating the accuracy of TL and EDL models, we calculated the mean accuracy of TL (

Loss Function
Cross-entropy (CE)-loss functions are frequently used for two or more than two DL models. CE-loss, ∝ , is dependent on the probability of the AI model and the gold standard label 1 and 0 by and (1 − ), respectively, as shown in Equation (1).

Performance Metric
We have used true positive (TP), true negative (TN), false positive (FP), and false negative (FN) to estimate the various performance evaluation metrics. These are accuracy (ɳ) (Equation (2)), recall (Ɍ) (Equation (3)), precision (Ƥ) (Equation (4)), and F1-score (Ƒ) (Equation (5)). After calculating the accuracy of TL and EDL models, we calculated the mean accuracy of TL (ɳ ) in Equation (6) and the mean accuracy of EDL (ɳ ) in Equations (7) and (8). In these equations, "n" is the number of TLs, and "N" is the number of EDLs. Dice and Jaccard are also calculated based on Equations (9) and (10), where ƶ is a set of wanted items and ƴ set of found items. The probability curve ROC (receiver operating characteristics) and degree of separability AUC (area under the curve) have also been calculated for each model. In the standard deviation (σ), each value from the population is denoted by x¡ and µ, population mean. N is the size of the population.
) in Equation (6) and the mean accuracy of EDL (

Loss Function
Cross-entropy (CE)-loss functions are frequently used for two or more than two DL models. CE-loss, ∝ , is dependent on the probability of the AI model and the gold standard label 1 and 0 by and (1 − ), respectively, as shown in Equation (1).

Performance Metric
We have used true positive (TP), true negative (TN), false positive (FP), and false negative (FN) to estimate the various performance evaluation metrics. These are accuracy (ɳ) (Equation (2)), recall (Ɍ) (Equation (3)), precision (Ƥ) (Equation (4)), and F1-score (Ƒ) (Equation (5)). After calculating the accuracy of TL and EDL models, we calculated the mean accuracy of TL (ɳ ) in Equation (6) and the mean accuracy of EDL (ɳ ) in Equations (7) and (8). In these equations, "n" is the number of TLs, and "N" is the number of EDLs. Dice and Jaccard are also calculated based on Equations (9) and (10), where ƶ is a set of wanted items and ƴ set of found items. The probability curve ROC (receiver operating characteristics) and degree of separability AUC (area under the curve) have also been calculated for each model. In the standard deviation (σ), each value from the population is denoted by x¡ and µ, population mean. N is the size of the population.
) in Equations (7) and (8). In these equations, "n" is the number of TLs, and "N" is the number of EDLs. Dice and Jaccard are also calculated based on Equations (9)

Five Data Combinations
For the robust design of the classification system, we designe bination scenarios. This is based on training and testing data usi tries-namely, Croatia and Italy. This experiment consists of running seven types of TL mo NetV2M, InceptionV3, MobileNetV2, ResNet152, ResNet50, VGG trained on the ImageNet dataset for classification of segmented l controls. The lung segmentation was conducted using ResNet-U ages were input for TLs. The experiment highlights the effective proving the accuracy of models on HDL segmented data. The lu split with a ratio of 80, 10, and 10 for training, validation, and test is a set of wanted items and

Loss Function
Cross-entropy (CE)-loss functions are frequently used for two or more than two DL models. CE-loss, ∝ , is dependent on the probability of the AI model and the gold standard label 1 and 0 by and (1 − ), respectively, as shown in Equation (1).

Performance Metric
We have used true positive (TP), true negative (TN), false positive (FP), and false negative (FN) to estimate the various performance evaluation metrics. These are accuracy (ɳ) (Equation (2)), recall (Ɍ) (Equation (3)), precision (Ƥ) (Equation (4)), and F1-score (Ƒ) (Equation (5)). After calculating the accuracy of TL and EDL models, we calculated the mean accuracy of TL (ɳ ) in Equation (6) and the mean accuracy of EDL (ɳ ) in Equations (7) and (8). In these equations, "n" is the number of TLs, and "N" is the number of EDLs. Dice and Jaccard are also calculated based on Equations (9) and (10), where ƶ is a set of wanted items and ƴ set of found items. The probability curve ROC (receiver operating characteristics) and degree of separability AUC (area under the curve) have also been calculated for each model. In the standard deviation (σ), each value from the population is denoted by x¡ and µ, population mean. N is the size of the population.
set of found items. The probability curve ROC (receiver operating characteristics) and degree of separability AUC (area under the curve) have also been calculated for each model. In the standard deviation (σ), each value from the population is denoted by

Loss Function
Cross-entropy (CE)-loss functions are frequently used for two or more than two DL models. CE-loss, ∝ , is dependent on the probability of the AI model and the gold standard label 1 and 0 by and (1 − ), respectively, as shown in Equation (1).

Performance Metric
We have used true positive (TP), true negative (TN), false positive (FP), and false negative (FN) to estimate the various performance evaluation metrics. These are accuracy (ɳ) (Equation (2)), recall (Ɍ) (Equation (3)), precision (Ƥ) (Equation (4)), and F1-score (Ƒ) (Equation (5)). After calculating the accuracy of TL and EDL models, we calculated the mean accuracy of TL (ɳ ) in Equation (6) and the mean accuracy of EDL (ɳ ) in Equations (7) and (8). In these equations, "n" is the number of TLs, and "N" is the number of EDLs. Dice and Jaccard are also calculated based on Equations (9) and (10), where ƶ is a set of wanted items and ƴ set of found items. The probability curve ROC (receiver operating characteristics) and degree of separability AUC (area under the curve) have also been calculated for each model. In the standard deviation (σ), each value from the population is denoted by x¡ and µ, population mean. N is the size of the population. and µ, population mean. N is the size of the population.

Five Data Combinations
For the robust design of the classification system, we designed five types of data combination scenarios. This is based on training and testing data using taken from two countries-namely, Croatia and Italy.  DC1: Training validation and testing using both CroMED (COVID) and NovMED (control).  DC2: Training validation and testing on both NovMED (COVID) and NovMED (control).  DC3: Training validation using CroMED (COVID) and NovMED (control) and testing on NovMED (COVID) and NovMED (control).  DC4: Training validation using NovMED (COVID) and NovMED (control) and testing on CroMED (COVID) and NovMED (control).

Loss Function
Cross-entropy (CE)-loss functions are frequently used for two or more than two DL models. CE-loss, ∝ , is dependent on the probability of the AI model and the gold standard label 1 and 0 by and (1 − ), respectively, as shown in Equation (1).

Performance Metric
We have used true positive (TP), true negative (TN), false positive (FP), and false negative (FN) to estimate the various performance evaluation metrics. These are accuracy (ɳ) (Equation (2)), recall (Ɍ) (Equation (3)), precision (Ƥ) (Equation (4)), and F1-score (Ƒ) (Equation (5)). After calculating the accuracy of TL and EDL models, we calculated the mean accuracy of TL (ɳ ) in Equation (6) and the mean accuracy of EDL (ɳ ) in Equations (7) and (8). In these equations, "n" is the number of TLs, and "N" is the number of EDLs. Dice and Jaccard are also calculated based on Equations (9) and (10), where ƶ is a set of wanted items and ƴ set of found items. The probability curve ROC (receiver operating characteristics) and degree of separability AUC (area under the curve) have also been calculated for each model. In the standard deviation (σ), each value from the population is denoted by x¡ and µ, population mean. N is the size of the population.

Loss Function
Cross-entropy (CE)-loss functions are frequently used for two or more than two DL models. CE-loss, ∝ , is dependent on the probability of the AI model and the gold standard label 1 and 0 by and (1 − ), respectively, as shown in Equation (1).

Performance Metric
We have used true positive (TP), true negative (TN), false positive (FP), and false negative (FN) to estimate the various performance evaluation metrics. These are accuracy (ɳ) (Equation (2)), recall (Ɍ) (Equation (3)), precision (Ƥ) (Equation (4)), and F1-score (Ƒ) (Equation (5)). After calculating the accuracy of TL and EDL models, we calculated the mean accuracy of TL (ɳ ) in Equation (6) and the mean accuracy of EDL (ɳ ) in Equations (7) and (8). In these equations, "n" is the number of TLs, and "N" is the number of EDLs. Dice and Jaccard are also calculated based on Equations (9) and (10), where ƶ is a set of wanted items and ƴ set of found items. The probability curve ROC (receiver operating characteristics) and degree of separability AUC (area under the curve) have also been calculated for each model. In the standard deviation (σ), each value from the population is denoted by x¡ and µ, population mean. N is the size of the population.

Five Data Combinations
For the robust design of the classification system, we designed five types of data combination scenarios. This is based on training and testing data using taken from two countries-namely, Croatia and Italy.
 DC4: Training validation using NovMED (COVID) and NovMED (control) and test- uses softMAX voting on the predicted score for detection of COVID and control. We have also performed balancing and augmentation on data.

Loss Function
Cross-entropy (CE)-loss functions are frequently used for two or more than two DL models. CE-loss, ∝ , is dependent on the probability of the AI model and the gold standard label 1 and 0 by and (1 − ), respectively, as shown in Equation (1).

Performance Metric
We have used true positive (TP), true negative (TN), false positive (FP), and false negative (FN) to estimate the various performance evaluation metrics. These are accuracy (ɳ) (Equation (2)), recall (Ɍ) (Equation (3)), precision (Ƥ) (Equation (4)), and F1-score (Ƒ) (Equation (5)). After calculating the accuracy of TL and EDL models, we calculated the mean accuracy of TL (ɳ ) in Equation (6) and the mean accuracy of EDL (ɳ ) in Equations (7) and (8). In these equations, "n" is the number of TLs, and "N" is the number of EDLs. Dice and Jaccard are also calculated based on Equations (9) and (10), where ƶ is a set of wanted items and ƴ set of found items. The probability curve ROC (receiver operating characteristics) and degree of separability AUC (area under the curve) have also been calculated for each model. In the standard deviation (σ), each value from the population is denoted by x¡ and µ, population mean. N is the size of the population. × uses softMAX voting on the predicted score for detection of COVID and con also performed balancing and augmentation on data.

Loss Function
Cross-entropy (CE)-loss functions are frequently used for two or more models. CE-loss, ∝ , is dependent on the probability of the AI model standard label 1 and 0 by and (1 − ), respectively, as shown in Equati

Performance Metric
We have used true positive (TP), true negative (TN), false positive (F negative (FN) to estimate the various performance evaluation metrics. These (ɳ) (Equation (2)), recall (Ɍ) (Equation (3)), precision (Ƥ) (Equation (4)), and (Equation (5)). After calculating the accuracy of TL and EDL models, we c mean accuracy of TL (ɳ ) in Equation (6) and the mean accuracy of EDL (ɳ tions (7) and (8). In these equations, "n" is the number of TLs, and "N" is t EDLs. Dice and Jaccard are also calculated based on Equations (9) and (10) set of wanted items and ƴ set of found items. The probability curve ROC (re ing characteristics) and degree of separability AUC (area under the curve) h calculated for each model. In the standard deviation (σ), each value from th is denoted by x¡ and µ, population mean. N is the size of the population. uses softMAX voting on the predicted score for detection of COVID and control. We have also performed balancing and augmentation on data.

Loss Function
Cross-entropy (CE)-loss functions are frequently used for two or more than two DL models. CE-loss, ∝ , is dependent on the probability of the AI model and the gold standard label 1 and 0 by and (1 − ), respectively, as shown in Equation (1).

Performance Metric
We have used true positive (TP), true negative (TN), false positive (FP), and false negative (FN) to estimate the various performance evaluation metrics. These are accuracy (ɳ) (Equation (2)), recall (Ɍ) (Equation (3)), precision (Ƥ) (Equation (4)), and F1-score (Ƒ) (Equation (5)). After calculating the accuracy of TL and EDL models, we calculated the mean accuracy of TL (ɳ ) in Equation (6) and the mean accuracy of EDL (ɳ ) in Equations (7) and (8). In these equations, "n" is the number of TLs, and "N" is the number of EDLs. Dice and Jaccard are also calculated based on Equations (9) and (10), where ƶ is a set of wanted items and ƴ set of found items. The probability curve ROC (receiver operating characteristics) and degree of separability AUC (area under the curve) have also been calculated for each model. In the standard deviation (σ), each value from the population is denoted by x¡ and µ, population mean. N is the size of the population. + uses softMAX voting on the predicted score for detection of COVID and con also performed balancing and augmentation on data.

Loss Function
Cross-entropy (CE)-loss functions are frequently used for two or more models. CE-loss, ∝ , is dependent on the probability of the AI model standard label 1 and 0 by and (1 − ), respectively, as shown in Equati

Performance Metric
We have used true positive (TP), true negative (TN), false positive (F negative (FN) to estimate the various performance evaluation metrics. These (ɳ) (Equation (2)), recall (Ɍ) (Equation (3)), precision (Ƥ) (Equation (4)), and (Equation (5)). After calculating the accuracy of TL and EDL models, we c mean accuracy of TL (ɳ ) in Equation (6) and the mean accuracy of EDL (ɳ tions (7) and (8). In these equations, "n" is the number of TLs, and "N" is t EDLs. Dice and Jaccard are also calculated based on Equations (9) and (10) set of wanted items and ƴ set of found items. The probability curve ROC (re ing characteristics) and degree of separability AUC (area under the curve) h calculated for each model. In the standard deviation (σ), each value from th is denoted by x¡ and µ, population mean. N is the size of the population.

Five Data Combinations
For the robust design of the classification system, we designed five types of data combination scenarios. This is based on training and testing data using taken from two countries-namely, Croatia and Italy.
Diagnostics 2023, 13, x FOR PEER REVIEW 11 of ɳ = TP + TN TP + FP + FN + TN For the robust design of the classification system, we designed five types of data com

Five Data Combinations
For the robust design of the classification system, we designed five types of data combination scenarios. This is based on training and testing data using taken from two countries-namely, Croatia and Italy.

Five Data Combinations
For the robust design of the classification system, we designed five types of data combination scenarios. This is based on training and testing data using taken from two countries-namely, Croatia and Italy.
• DC3: Training validation using CroMED (COVID) and NovMED (control) and testing on NovMED (COVID) and NovMED (control). uses softMAX voting on the predicted score for detection of COVID and control. We have also performed balancing and augmentation on data.

Loss Function
Cross-entropy (CE)-loss functions are frequently used for two or more than two DL models. CE-loss, ∝ , is dependent on the probability of the AI model and the gold standard label 1 and 0 by and (1 − ), respectively, as shown in Equation (1). (1

Performance Metric
We have used true positive (TP), true negative (TN), false positive (FP), and false negative (FN) to estimate the various performance evaluation metrics. These are accuracy (ɳ) (Equation (2)), recall (Ɍ) (Equation (3)), precision (Ƥ) (Equation (4)), and F1-score (Ƒ (Equation (5)). After calculating the accuracy of TL and EDL models, we calculated the mean accuracy of TL (ɳ ) in Equation (6) and the mean accuracy of EDL (ɳ ) in Equa tions (7) and (8). In these equations, "n" is the number of TLs, and "N" is the number o EDLs. Dice and Jaccard are also calculated based on Equations (9) and (10), where ƶ is set of wanted items and ƴ set of found items. The probability curve ROC (receiver operat ing characteristics) and degree of separability AUC (area under the curve) have also been calculated for each model. In the standard deviation (σ), each value from the population is denoted by x¡ and µ, population mean. N is the size of the population.

Five Data Combinations
For the robust design of the classification system, we designed five types of data combination scenarios. This is based on training and testing data using taken from two countries-namely, Croatia and Italy.

Loss Function
Cross-entropy (CE)-loss functions are frequently used for two or more t models. CE-loss, ∝ , is dependent on the probability of the AI model a standard label 1 and 0 by and (1 − ), respectively, as shown in Equatio

Performance Metric
We have used true positive (TP), true negative (TN), false positive (FP negative (FN) to estimate the various performance evaluation metrics. These (ɳ) (Equation (2)), recall (Ɍ) (Equation (3)), precision (Ƥ) (Equation (4)), and (Equation (5)). After calculating the accuracy of TL and EDL models, we ca mean accuracy of TL (ɳ ) in Equation (6) and the mean accuracy of EDL (ɳ tions (7) and (8). In these equations, "n" is the number of TLs, and "N" is th EDLs. Dice and Jaccard are also calculated based on Equations (9) and (10), set of wanted items and ƴ set of found items. The probability curve ROC (rece ing characteristics) and degree of separability AUC (area under the curve) ha calculated for each model. In the standard deviation (σ), each value from the is denoted by x¡ and µ, population mean. N is the size of the population.

Five Data Combinations
For the robust design of the classification system, we designed five types of data combination scenarios. This is based on training and testing data using taken from two countries-namely, Croatia and Italy.

| |
Diagnostics 2023, 13, x FOR PEER REVIEW uses softMAX voting on the predicted score for detection of COVID and contro also performed balancing and augmentation on data.

Loss Function
Cross-entropy (CE)-loss functions are frequently used for two or more th models. CE-loss, ∝ , is dependent on the probability of the AI model an standard label 1 and 0 by and (1 − ), respectively, as shown in Equation

Performance Metric
We have used true positive (TP), true negative (TN), false positive (FP) negative (FN) to estimate the various performance evaluation metrics. These ar (ɳ) (Equation (2)), recall (Ɍ) (Equation (3)), precision (Ƥ) (Equation (4)), and F (Equation (5)). After calculating the accuracy of TL and EDL models, we calc mean accuracy of TL (ɳ ) in Equation (6) and the mean accuracy of EDL (ɳ tions (7) and (8). In these equations, "n" is the number of TLs, and "N" is the EDLs. Dice and Jaccard are also calculated based on Equations (9) and (10), w set of wanted items and ƴ set of found items. The probability curve ROC (recei ing characteristics) and degree of separability AUC (area under the curve) hav calculated for each model. In the standard deviation (σ), each value from the is denoted by x¡ and µ, population mean. N is the size of the population.

Five Data Combinations
For the robust design of the classification system, we designed five types of data combination scenarios. This is based on training and testing data using taken from two countries-namely, Croatia and Italy.      uses softMAX voting on the predicted score for detection of COVID and control. We have also performed balancing and augmentation on data.

Loss Function
Cross-entropy (CE)-loss functions are frequently used for two or more than two DL models. CE-loss, ∝ , is dependent on the probability of the AI model and the gold standard label 1 and 0 by and (1 − ), respectively, as shown in Equation (1).

Performance Metric
We have used true positive (TP), true negative (TN), false positive (FP), and false negative (FN) to estimate the various performance evaluation metrics. These are accuracy (ɳ) (Equation (2)), recall (Ɍ) (Equation (3)), precision (Ƥ) (Equation (4)), and F1-score (Ƒ) (Equation (5)). After calculating the accuracy of TL and EDL models, we calculated the mean accuracy of TL (ɳ ) in Equation (6) and the mean accuracy of EDL (ɳ ) in Equa-tions (7) and (8). In these equations, "n" is the number of TLs, and "N" is the number of EDLs. Dice and Jaccard are also calculated based on Equations (9) and (10), where ƶ is a set of wanted items and ƴ set of found items. The probability curve ROC (receiver operat-ing characteristics) and degree of separability AUC (area under the curve) have also been calculated for each model. In the standard deviation (σ), each value from the population is denoted by x¡ and µ, population mean. N is the size of the population.

Five Data Combinations
For the robust design of the classification system, we designed five types of data combination scenarios. This is based on training and testing data using taken from two countries-namely, Croatia and Italy.

Five Data Combinations
For the robust design of the classification system, we designed five types of data com bination scenarios. This is based on training and testing data using taken from two coun tries-namely, Croatia and Italy.

Loss Function
Cross-entropy (CE)-loss functions are frequently used for two or more models. CE-loss, ∝ , is dependent on the probability of the AI model

Performance Metric
We have used true positive (TP), true negative (TN), false positive (F negative (FN) to estimate the various performance evaluation metrics. These (ɳ) (Equation (2)), recall (Ɍ) (Equation (3)), precision (Ƥ) (Equation (4)), and (Equation (5)). After calculating the accuracy of TL and EDL models, we ca mean accuracy of TL (ɳ ) in Equation (6) and the mean accuracy of EDL (ɳ tions (7) and (8). In these equations, "n" is the number of TLs, and "N" is th EDLs. Dice and Jaccard are also calculated based on Equations (9) and (10), set of wanted items and ƴ set of found items. The probability curve ROC (rec ing characteristics) and degree of separability AUC (area under the curve) ha calculated for each model. In the standard deviation (σ), each value from th is denoted by x¡ and µ, population mean. N is the size of the population.

Five Data Combinations
For the robust design of the classification system, we designed five types of data combination scenarios. This is based on training and testing data using taken from two countries-namely, Croatia and Italy.

Five Data Combinations
For the robust design of the classification system, we designed five types of data co bination scenarios. This is based on training and testing data using taken from two cou tries-namely, Croatia and Italy.

Loss Function
Cross-entropy (CE)-loss functions are frequently used for two or mor models. CE-loss, ∝ , is dependent on the probability of the AI model

Performance Metric
We have used true positive (TP), true negative (TN), false positive negative (FN) to estimate the various performance evaluation metrics. The (ɳ) (Equation (2)), recall (Ɍ) (Equation (3)), precision (Ƥ) (Equation (4)), a (Equation (5)). After calculating the accuracy of TL and EDL models, we mean accuracy of TL (ɳ ) in Equation (6) and the mean accuracy of EDL tions (7) and (8). In these equations, "n" is the number of TLs, and "N" is EDLs. Dice and Jaccard are also calculated based on Equations (9) and (10 set of wanted items and ƴ set of found items. The probability curve ROC (r ing characteristics) and degree of separability AUC (area under the curve) calculated for each model. In the standard deviation (σ), each value from is denoted by x¡ and µ, population mean. N is the size of the population.

Five Data Combinations
For the robust design of the classification system, we designed five types of data combination scenarios. This is based on training and testing data using taken from two countries-namely, Croatia and Italy.    For the robust design of the classification system, we designed five types of data combination scenarios. This is based on training and testing data using taken from two countries-namely, Croatia and Italy.

Experiment 1: Transfer Learning Models using Lung Segmented Data
This experiment consists of running seven types of TL models-namely, Efficient-NetV2M, InceptionV3, MobileNetV2, ResNet152, ResNet50, VGG16, and VGG19-all pre-trained on the ImageNet dataset for classification of segmented lung data into COVID vs. controls. The lung segmentation was conducted using ResNet-UNet, and segmented images were input for TLs. The experiment highlights the effectiveness of using TL for improving the accuracy of models on HDL segmented data. The lung segmented data were split with a ratio of 80, 10, and 10 for training, validation, and testing, respectively. Models were saved after training and validation and later tested over 10% of the dataset under five input data combinations. These TLs further predict COVID and control.

Experiment 3: Effect of EDL Classification over TL Classification with Augmentation
This experiment is to show the effect of EDL classification over TL classification on unaugmented data and augmented data [99][100][101][102][103]. Mean EDL accuracy and mean TL accuracy verified after the balance and augmentation.

Experiment 4: Unseen Data Analysis
Training on one combination of data and testing on another combination of data were experimented with here. We analyzed the models' performance on unseen data to evaluate their generalizability [104][105][106][107][108][109][110][111]. The results showed that the models performed well on unseen data, indicating their potential for real-world applications. Input data scenarios DC3 and DC4 are examples of unseen data analysis.

Power Analysis
We calculated the sample size using the conventional method [118][119][120]. The formula for calculating the sample size is represented by n, is as follows: where z* is the z-score corresponding to the desired level of confidence, MoE is the margin of error (half the width of the confidence interval), and is the estimated proportion of the characteristic in the population. Using MedCalc software, we calculated the required values and substituted them into Equation (11). We need a sample size of at least 8100 to estimate the proportion of the characteristic in the population with a 95% confidence interval of 0.963 to 0.978 and an MoE of 0.0075.

Results and Performance Evaluation
To verify both hypotheses, we conducted four experiments on five DC scenarios. ResNet-UNet, a hybrid deep learning model, was used to segment the raw data. CroMED (COVID), NovMED (COVID), and NovMED (control) raw images are there along with the segmented images. We randomly selected four sample images from CroMED (COVID) and passed them through ResNet-UNet. The output segmented images have been placed below the raw images in Figure 7. With the same approach, NovMED (COVID) and NovMED (control) information is stored in the same diagram. All five DC have utilized the seven transfer learning models and five ensemble deep learning models over CroMED (COVID), NovMED (COVID), and NovMED (control). The seven TLs are EfficientNetV2M, InceptionV3, MobileNetV2, ResNet152, ResNet50, VGG16, and VGG19, and a combination of TL models with soft-voting ensemble methods generates EDL models. Training accuracy and loss plots for the ResNet-UNet on each epoch is shown in Figure 8.

Results and Performance Evaluation
To verify both hypotheses, we conducted four experiments on five DC scenarios. Res-Net-UNet, a hybrid deep learning model, was used to segment the raw data. CroMED (COVID), NovMED (COVID), and NovMED (control) raw images are there along with the segmented images. We randomly selected four sample images from CroMED (COVID) and passed them through ResNet-UNet. The output segmented images have been placed below the raw images in Figure 7. With the same approach, NovMED (COVID) and NovMED (control) information is stored in the same diagram. All five DC have utilized the seven transfer learning models and five ensemble deep learning models over CroMED (COVID), NovMED (COVID), and NovMED (control). The seven TLs are Efficient-NetV2M, InceptionV3, MobileNetV2, ResNet152, ResNet50, VGG16, and VGG19, and a combination of TL models with soft-voting ensemble methods generates EDL models. Training accuracy and loss plots for the ResNet-UNet on each epoch is shown in Figure  8.     views. The BA plot shows the compatibility between ResNet and UNet. After the segmentation of images, TL and EDL models utilize this segmented image for classification. We have decided on five scenarios for classification to prove our hypothesis. The evaluation metrics used to compare the models include mean accuracy (Mean ACC), standard deviation (SD), mean predicted score (Mean PR), area under the curve (AUC), p-value, precision, recall, and F1 score.   Figures 10 and 11. The correlation coefficient graph depicts the relationship strength between ResNet-UNet and doctors views. The BA plot shows the compatibility between ResNet and UNet. After the segmentation of images, TL and EDL models utilize this segmented image for classification. We have decided on five scenarios for classification to prove our hypothesis. The evaluation metrics used to compare the models include mean accuracy (Mean ACC), standard deviation (SD), mean predicted score (Mean PR), area under the curve (AUC), p-value, precision, recall, and F1 score.    views. The BA plot shows the compatibility between ResNet and UNet. After the segmentation of images, TL and EDL models utilize this segmented image for classification. We have decided on five scenarios for classification to prove our hypothesis. The evaluation metrics used to compare the models include mean accuracy (Mean ACC), standard deviation (SD), mean predicted score (Mean PR), area under the curve (AUC), p-value, precision, recall, and F1 score.

Results of Experiment 1: Transfer Learning Models using Lung Segmented Data
In Experiment 1, we performed the TLs operations using ResNet-UNet segmented data. Following are the detailed results for all five DC scenarios.

•
DC1 results: Table 2 and Figure 12 show that the best accuracy of 97.93% without augmentation and 99.93% with augmentation is shown by MobileNetV2. The mean accuracy of all seven TLs without augmentation is 93.91% and is 97.03% with augmentation. For TL6 (VGG16), the accuracy improves from 90.20% (before augmentation) to 95.61% (after augmentation), so the improvement was 5.41% using DC1 data combination. TL2 (Inception V3) had an accuracies of 93.60% (before augmentation) and 93.97% (after augmentation), so the improvement was 0.37%. Therefore, we see that augmentation has different effects on TL-based classifiers. It is more pronounced in TL6, unlike in TL2. Table 3 shows the COVID precision are significantly increased or comparable after balancing data.     Figure 13 show that the best accuracy of 90.84% is achieved by InceptionV3 without augmentation, and the best with augmentation of 93.92% is achieved by EfficientNetV2M. The mean accuracy of all seven TLs without augmentation is 84.41% and is 89.85% with augmentation. TL4 (ResNet152), the accuracy improves from 78.16% (before augmentation) to 87.40% (after augmentation) when using DC2 data combination, so the improvement was 11.82%. TL6 (VGG16) had accuracies of 85.6% (before augmentation) and 84.05% (after augmentation), so there was no improvement. Therefore, we see that augmentation has different effects on TL-based classifiers. It is more pronounced in TL4, unlike in TL6. Table 5 shows the effect of augmentation in COVID precision, recall and F1-score. These are significantly increased or comparable after balancing data. • DC3 results: Table 6 and Figure 14 show that the best accuracies of 85.40% without augmentation and 91.41% with augmentation are achieved by EfficientNetV2M.
The mean accuracy of all seven TLs without augmentation is 72.90% and is 82.355% with augmentation. For TL5 (ResNet50), the accuracy improves from 67.17% (before augmentation) to 80% (after augmentation) when using DC3 data combination, so the improvement was 19.10%. TL2 (InceptionV3) had accuracies of 67.58% (before augmentation) and 76.43% (after augmentation), so the improvement was 13.09%. Therefore, we see that augmentation has different effects on TL-based classifiers. It is more pronounced in TL5, unlike in TL2. Augmentation and balancing effects are visible in Table 7. It shows that better results can be achieved after balancing the data.      Figure 15 show that the best accuracies of 69.40% without augmentation and 81.05% with augmentation are shown by VGG19. The mean accuracy of all seven TLs without augmentation is 47.85% and is 70.39% with augmentation. The augmentation effect was also visible with TL3 (MobileNetV2), which had a lowest accuracy of 27.97% before augmentation and 52.68% after the augmentation, so the improvement is 92.5%. Table 9 has been presented to show augmentation effect for precision, recall and F1-score.   • DC5 results: Table 10 and Figure 16 show that the best accuracy of 95.10% is achieved by InceptionV3 without augmentation, and 95.28% is achieved by VGG16 with augmentation. The mean accuracy of all seven TLs without augmentation is 91.22% and is 93.76% with augmentation. TL6 (VGG16), the accuracy improves from 86.81% (before augmentation) to 95.28% (after augmentation) when using DC5 data combination, so the improvement is 9.75%. TL3 (MobileNetV2) has an accuracies of 92.95% (be-fore augmentation) and 89.07% (after augmentation), so there is no improvement. Therefore, we see that augmentation has different effects on TL-based classifiers. It is more pronounced in TL6, unlike in TL3. In the most of TL models, improvement of precision, recall and F1-score can be seen Table 11, after the balancing and augmenting the data.     Tables 3, 5, 7, 9 and 11 also show the precision and a comparison to Experiment 1, which presents the verification of Hypothesis 2 that data augmentation helps in improvement in the performance of TL model. p-value based on the Mann-Whitney test was used for all data combinations.

Results of Experiment 2: Ensemble Deep Learning for Classification
In Experiment 2, we performed the EDL operations for accurate classification of COVID and control. These EDLs are created using TL models. Following are the detailed results for all five DC scenarios.
• DC1 results: Table 12 and Figure 17 show that the mean accuracy of all EDLs without augmentation is 95.05% and is 97.07% with augmentation.   • DC2 results: Table 13 and Figure 18 show that the mean accuracy of all EDLs without augmentation is 87.63% and is 92.70% with augmentation. • DC3 results: Table 14 and Figure 19 show that the mean accuracy of all EDLs without augmentation is 75.88% and is 80.98% with augmentation.      • DC4 results: Table 15 and Figure 20 show that the mean accuracy of all EDLs without augmentation is 59.99% and is 79.22% with augmentation.   • DC5 results: Table 16 and Figure 21 show that the mean accuracy of all EDLs without augmentation is 93.39% and is 95.64% with augmentation.

Results of Experiment 3: EDL vs. TL Classification with Augmentation
In Experiment 3, we verified the effect of augmentation in EDLs over TLs in DC scenarios. Figure 22 shows results in unaugmented data, and we observed an ac improvement in EDLs over TLs of 5.54%. Similarly, Figure 23 shows an accuracy im ment of 2.82% in EDLs over TLs with balanced and augmented data. This verifies H esis 1.

Results of Experiment 3: EDL vs. TL Classification with Augmentation
In Experiment 3, we verified the effect of augmentation in EDLs over TLs in all five DC scenarios. Figure 22 shows results in unaugmented data, and we observed an accuracy improvement in EDLs over TLs of 5.54%. Similarly, Figure 23 shows an accuracy improvement of 2.82% in EDLs over TLs with balanced and augmented data. This verifies Hypothesis 1.

Results of Experiment 4: Unseen Data Analysis
In Experiment 4, we performed unseen data analysis. In the DC3 scenario, training was performed on CroMED (COVID) and testing on NovMED (COVID). Similarly, in DC4 scenarios, training was performed on NovMED (COVID) and testing on CroMED (COVID). As shown in Figures 22 and 23, we observed that even in unseen data analysis, both of our hypotheses are proven to be correct.

Results of Experiment 4: Unseen Data Analysis
In Experiment 4, we performed unseen data analysis. In the DC3 scenario, training was performed on CroMED (COVID) and testing on NovMED (COVID). Similarly, in DC4 scenarios, training was performed on NovMED (COVID) and testing on CroMED (COVID). As shown in Figures 22 and 23, we observed that even in unseen data analysis, both of our hypotheses are proven to be correct.
The comparative graph of mean TL accuracy and mean EDL accuracy proves both of our hypotheses. First, the mean accuracy of EDLs is better than the mean accuracy of TLs. Second, balanced and augmented data give better results compared to those without augmentation. We have also presented the standard deviation, mean predicted score, AUC, and p-value for all input data scenarios. DC1, DC2, DC3, DC4, and DC5 TL models with data augmentation and balance improved mean accuracy by 3.32%, 6.56%, 12.96%, 47.1%, and 2.78%, respectively. Similarly, the five EDLs' accuracies increased by 2.12%, 5.78%, 6.72%, 32.05%, and 2.40%, respectively.

Results of Experiment 4: Unseen Data Analysis
In Experiment 4, we performed unseen data analysis. In the DC3 scenario, training was performed on CroMED (COVID) and testing on NovMED (COVID). Similarly, in DC4 scenarios, training was performed on NovMED (COVID) and testing on CroMED (COVID). As shown in Figures 22 and 23, we observed that even in unseen data analysis, both of our hypotheses are proven to be correct.
The comparative graph of mean TL accuracy and mean EDL accuracy proves both of our hypotheses. First, the mean accuracy of EDLs is better than the mean accuracy of TLs. Second, balanced and augmented data give better results compared to those without augmentation. We have also presented the standard deviation, mean predicted score, AUC, and p-value for all input data scenarios. DC1, DC2, DC3, DC4, and DC5 TL models with data augmentation and balance improved mean accuracy by 3.32%, 6.56%, 12.96%, 47.1%, and 2.78%, respectively. Similarly, the five EDLs' accuracies increased by 2.12%, 5.78%, 6.72%, 32.05%, and 2.40%, respectively. The comparative graph of mean TL accuracy and mean EDL accuracy proves both of our hypotheses. First, the mean accuracy of EDLs is better than the mean accuracy of TLs. Second, balanced and augmented data give better results compared to those without augmentation. We have also presented the standard deviation, mean predicted score, AUC, and p-value for all input data scenarios. DC1, DC2, DC3, DC4, and DC5 TL models with data augmentation and balance improved mean accuracy by 3.32%, 6.56%, 12.96%, 47.1%, and 2.78%, respectively. Similarly, the five EDLs' accuracies increased by 2.12%, 5.78%, 6.72%, 32.05%, and 2.40%, respectively.

Receiver Operating Charaterstics
We calculated the AUC from ROC graphs for our model to check explainability.

Receiver Operating Charaterstics
We calculated the AUC from ROC graphs for our model to check explainability.

Receiver Operating Charaterstics
We calculated the AUC from ROC graphs for our model to check explainability.

Receiver Operating Charaterstics
We calculated the AUC from ROC graphs for our model to check explainability.      Overall, the results show that deep learning models based on transfer learning and ensemble methods achieve high accuracy in detecting COVID-19. Among the transfer learning models, MobileNetV2 outperforms the other models in terms of accuracy and AUC in all five cases. In addition, the ensemble models show better performance than individual transfer learning. Similar to TL ROC, EDL's ROC can also be generated. All EDLs AUC-ROC for all five data combinations is already discussed in the result Section 4.2 tables. One of the data combinations, DC1, with augmentation ROC is depicted in Figure 29. It shows that at most of the AUC points of EDLs are better than or equal to their constituents. Data combinations of two to five scenarios ROC are in Appendix C.  Overall, the results show that deep learning models based on transfer learning and ensemble methods achieve high accuracy in detecting COVID-19. Among the transfer learning models, MobileNetV2 outperforms the other models in terms of accuracy and AUC in all five cases. In addition, the ensemble models show better performance than individual transfer learning. Similar to TL ROC, EDL's ROC can also be generated. All EDLs AUC-ROC for all five data combinations is already discussed in the result Section 4.2 tables. One of the data combinations, DC1, with augmentation ROC is depicted in Figure 29. It shows that at most of the AUC points of EDLs are better than or equal to their constituents. Data combinations of two to five scenarios ROC are in Appendix C.  Overall, the results show that deep learning models based on transfer learning and ensemble methods achieve high accuracy in detecting COVID-19. Among the transfer learning models, MobileNetV2 outperforms the other models in terms of accuracy and AUC in all five cases. In addition, the ensemble models show better performance than individual transfer learning. Similar to TL ROC, EDL's ROC can also be generated. All EDLs AUC-ROC for all five data combinations is already discussed in the result Section 4.2 tables. One of the data combinations, DC1, with augmentation ROC is depicted in Figure 29. It shows that at most of the AUC points of EDLs are better than or equal to their constituents. Data combinations of two to five scenarios ROC are in Appendix C.  Figure 29. ROC of five EDLs using DC1 with augmentation.

Statistical Test
Paired t-test, Mann-Whitney, and Wilcoxon tests were performed to check the reliability of the system in all five IDS. The p-value was less than 0.0001 in all five input data scenarios cases. This shows that our proposed system is highly reliable for real-world ap-

System Reliability Statistical Test
Paired t-test, Mann-Whitney, and Wilcoxon tests were performed to check the reliability of the system in all five IDS. The p-value was less than 0.0001 in all five input data scenarios cases. This shows that our proposed system is highly reliable for real-world applications. The test has been performed using Python and MedCalc software. Result section tables also have stored p-values on the Mann-Whitney test for all TLs and EDLs using DC1, DC2, DC3, DC4, and DC5. Similarly paired t-test and Wilcoxon tests were also performed, and their summaries have been stored in Tables 17 and 18. Table 17 shows the three statistical tests (paired t-test, Mann-Whitney, and Wilcoxon tests) for seven TL models (EfficientV2M, InceptionV3, MobileNetV2, ResNet152, ResNet50, VGG16, and VGG19). As seen in Table 17, all the TL models (TL1-TL7) exhibit p-values <0.0001. This clearly demonstrates the TL models' reliability and stability as per the definition null hypothesis. Table 18 presents the three statistical tests (paired t-test, Mann-Whitney, and Wilcoxon tests) for five EDL models (EDL1-EDL5). As seen in Table 18, all the EDL models exhibit p-values <0.0001. This clearly demonstrates the EDL model's reliability and stability as per the definition null hypothesis. Note that our results are consistent with our previous studies [80,[121][122][123][124][125][126].

Discussion
The proposed system has been trained on multicenter data using different acquisition machines and incorporated superior quality control techniques, class balancing using augmentation, and ResNet-UNet HDL segmentation. It uses seven types of TL classifiers and five types of EDL-based fusion to make accurate predictions. The model employed uniquely designed data systems, a generalized cross-validation protocol, and performance evaluation of HDL segmentation, TL classification, and EDL systems. It was also tested for reliability analysis and stability analysis and benchmarked against previous TL and EDL research work.

Principal Findings
Explainable transfer learning (TL) and ensemble deep learning (EDL) models accurately predicted the presence of COVID-19 in Croatian and Italian datasets and justified both hypotheses. This architecture presented the behavior of models on data augmentation and balancing. TL accuracy with augmentation and balancing overperformed compared to that without augmentation. Some TL and EDL models outperformed the benchmark in most cases in accuracy, precision, recall, F1 score, and AUC. The proposed method, which uses ResNet-UNet for segmentation and TL and EDL models for classification, is a promising approach for identifying COVID-19 in CroMED (COVID), NovMED (COVID), and NovMED (control). It is a novel approach that uses HDL segmentation for ensemble-based classification. Overall, these findings suggest that ensemble deep learning models can be useful tools for identifying COVID-19 and controlling its spread. Unseen analysis in data combinations two and three show that this infrastructure could be used for real world application. These TL and EDL results have proven that, and the novelties can be summarized as (i) implementation of ResNet-UNet-based HDL segmentation; (ii) executing seven types of TL classifiers, design of five types of EDL-based fusion; (iii) design of five types of data systems; (iv) generalized COVLIAS system design using unseen data; (v) tested for reliability analysis; (vi) tested for stability analysis. The methods applied in this study have created an effective and robust system that has better performance metrics in comparison to existing published models.

Benchmarking
We studied several papers and sorted some recent papers for benchmarking. These papers include the COVID-CT dataset and the SARS-CoV-2 dataset [127][128][129][130][131][132][133][134][135][136][137]. Our proposed models have used the CroMED (COVID), NovMED (COVID), and NovMED (control) datasets. Seven state-of-the-art transfer learning models, including DenseNet201, DenseNet169, DenseNet161, DenseNet121, VGG16, MobileNetV2, and EfficientNetV2M, have been used on the COVID dataset and compared with our best proposed model on MobileNetV2. We evaluated the models based on their accuracy, precision, recall, F1 score, p-value, and AUC and compared the results to the previous benchmark studies. Our experimental results showed that our proposed method, which used MobileNetV2 on Dataset 1 (CroMED (COVID) and NovMED (control)), outperformed all other models, with an accuracy of 99.99%, precision and recall of 100%, F1 score of 100%, and AUC of 1.0. The second-best model was DenseNet121 by Xu et al. [32], which achieved an accuracy of 99.44% on the COVIDx-CT 2A dataset. It is presented in Table 19. We have also compared our best TL with other existing models proposed by Alshazly et al. [28], Cruz et al. [45], Shaik et al. [30], and Huang et al. [31], who achieved accuracies of 92.9%, 82.76%, 97.38%, and 95.66%, respectively. Our results demonstrate the effectiveness of TL in developing accurate and efficient models for COVID-19 diagnosis using CT images. Our findings highlight the importance of using larger and more diverse datasets for training DL models for medical image analysis. Like the TL model comparison, we have also compared our proposed EDL models to state-of-the-art EDL models. We evaluated the models based on their accuracy, precision, recall, F1-score, and AUC and compared the results to the previous benchmark studies. The ensemble model, a combination of ResNet152 + MobileNetV2, outperformed all other models, with an accuracy of 99.99%, precision and recall of 100%, F1 score of 100%, AUC of 1.0, and p-value of less than 0.0001. The second-best model, with an accuracy of 99.05% and an F1 score of 98.59%, was proposed by Toa et al. [35]. Other ensemble models are also shown in Table 20 and are quite lower than our proposed model. Other EDL models were proposed by Pathan et al. [33], Kundu et al. [34], Cruz et al. [29], Shaik et al. [30], Khanibadi et al. [138], Lu et al. [139], and Huang et al. [31]. We also performed scientific validation that is missing in other models. These TL and EDL results have proven that the novelties-ResNet-UNet HDL segmen-tation+ seven types of TL classifier + design of five types of EDL-based fusion + design of five types of data systems + generalized COVLIAS system design using unseen data + tested for reliability analysis + tested for stability analysis-applied in this study have created an effective and robust system that has better performance metrics compared to existing published models.

A Special Note on EDL
Ensemble-based models can be effective in addressing some of the limitations and weaknesses in current published research work on COVID-19 and its control measures. Ensemble-deep-learning-based models are deep learning models that combine multiple models to make more accurate predictions than any single model alone. This approach can improve the generalizability and robustness of predictions, which can be particularly useful in the context of COVID-19 research. Ensemble models always survive when the amalgamation of features or predicted score improves accuracy. If there is a bias in data, then EDL survival is difficult.

Strengths, Weaknesses, and Extensions
The study compares seven transfer learning and five ensemble deep learning models in predicting the presence of COVID-19, providing a comprehensive evaluation of different approaches. This work uses data augmentation and balanced data to improve the performance of the models, which can be a valuable technique in improving model accuracy. Our research outperforms the benchmark results in most cases, indicating that the proposed models are effective in predicting the presence of COVID-19. The study only uses three datasets, CroMED (COVID), NovMED (COVID), and NovMED (Control), which limits the generalizability of the results. It does not compare the proposed models to other COVID-19 prediction models that may have been developed outside of the benchmark studies. The work could investigate the impact of other segmentation methods on the accuracy of the models. Transformers can also be added for segmentation and detection of COVID-19 [140][141][142][143][144]. While the system is generalized, the system lacks explainability of the AI models, so-called explainable AI (XAI) models. The system lacks the role of superposition of heatmaps on the lung CT images, which can tell where COVID-19 lesions are present, especially using these TL models applied to the HDL segmented lung outputs. Previous systems have used heatmaps [5,121,122] but not in the cascaded framework of HDL + TL + EDL in the multicenter paradigm. Since the field of immunology brings discussions on lung damage causing different kinds of pneumonia, the current paradigm of COVID/control binary classification can be extended to multiclass framework. Our group has several studies which followed multiclass classification using AI framework [145][146][147][148][149]. Our system can therefore be extended as we acquire clinical data for different kinds of pneumonia.

Conclusions
In this research work, we had two hypotheses. First, that mean TL accuracy with augmentation is better than without augmented data, which was proven in all five input data scenarios. DC1, DC2, DC3, DC4, and DC5 TL models with data augmentation and balance improved mean accuracy by 3.32%, 6.56%, 12.96%, 47.1%, and 2.78%, respectively. Second, that weaker learners would be stronger in the ensemble process, and that mean EDL accuracy would over mean TL, which is visible in performance evaluation. Explainable transfer learnings have generated ROCs. These are useful for identifying better models. Three statistical tests have shown p-values of less than 0.0001 for all models. This indicates that the system is highly reliable. We have also compared our results to the benchmark results on the COVID dataset. The ensemble model, a combination of ResNet152 and MobileNetV2, outperformed all other models, with an accuracy of 99.99%, precision and recall of 100%, F1 score of 100%, AUC of 1.0, and p-value of less than 0.0001. The secondbest benchmark model has 99.05% accuracy and a 98.59% F1 score. Our findings have not only supported both hypotheses, but the proposed methodology also outperforms benchmark performance indicators.
Some future works can also be implemented. We have performed a soft-max voting method in the ensemble process; fusion of features before the prediction is also an option. Statistical tests will confirm system reliability, and a heatmap of the ensemble model could also be generated.

Conflicts of Interest:
The authors declare no conflict of interest. GBTI deals in lung image analysis and Jasjit S. Suri is affiliated with GBTI.
EfficientNetV2M, Figure A4, has 54.1 million parameters. By default, its input size 480 × 480, and it is trained on the ImageNet dataset. It has a softmax activation function t classify one thousand classes. We removed the top layer; flattened the model output; an
EfficientNetV2M, Figure A4, has 54.1 million parameters. By default, its input size is 480 × 480, and it is trained on the ImageNet dataset. It has a softmax activation function to classify one thousand classes. We removed the top layer; flattened the model output; and added three dense layers, two dropout layers, and L2 regularizers to avoid overfitting. The sigmoid activation function helps us to classify COVID and control classes. added three dense layers, two dropout layers, and L2 regularizers to avoid overfitting. The sigmoid activation function helps us to classify COVID and control classes. InceptionV3, Figure A5, has 23.83 million parameters. By default, its input size is 299 × 299, and it is trained on the ImageNet dataset. It has a softmax activation function to classify one thousand classes. We removed the top layer; flattened the model output; and added three dense layers, two dropout layers, and L2 regularizers to avoid overfitting. The sigmoid activation function helps us to classify COVID and control classes.  Figure A6, has 4.3 million parameters. By default, its input size is 224 × 224, and it is trained on the ImageNet dataset. It has a softmax activation function to classify one thousand classes. We removed the top layer; flattened the model output; and added three dense layers, two dropout layers, and L2 regularizers to avoid overfitting. The sigmoid activation function helps us to classify COVID and control classes.  InceptionV3, Figure A5, has 23.83 million parameters. By default, its input size is 299 × 299, and it is trained on the ImageNet dataset. It has a softmax activation function to classify one thousand classes. We removed the top layer; flattened the model output; and added three dense layers, two dropout layers, and L2 regularizers to avoid overfitting. The sigmoid activation function helps us to classify COVID and control classes. added three dense layers, two dropout layers, and L2 regularizers to avoid overfitting. The sigmoid activation function helps us to classify COVID and control classes. InceptionV3, Figure A5, has 23.83 million parameters. By default, its input size is 299 × 299, and it is trained on the ImageNet dataset. It has a softmax activation function to classify one thousand classes. We removed the top layer; flattened the model output; and added three dense layers, two dropout layers, and L2 regularizers to avoid overfitting. The sigmoid activation function helps us to classify COVID and control classes.  Figure A6, has 4.3 million parameters. By default, its input size is 224 × 224, and it is trained on the ImageNet dataset. It has a softmax activation function to classify one thousand classes. We removed the top layer; flattened the model output; and added three dense layers, two dropout layers, and L2 regularizers to avoid overfitting. The sigmoid activation function helps us to classify COVID and control classes.   Figure A6, has 4.3 million parameters. By default, its input size is 224 × 224, and it is trained on the ImageNet dataset. It has a softmax activation function to classify one thousand classes. We removed the top layer; flattened the model output; and added three dense layers, two dropout layers, and L2 regularizers to avoid overfitting. The sigmoid activation function helps us to classify COVID and control classes. added three dense layers, two dropout layers, and L2 regularizers to avoid overfitting. The sigmoid activation function helps us to classify COVID and control classes. InceptionV3, Figure A5, has 23.83 million parameters. By default, its input size is 299 × 299, and it is trained on the ImageNet dataset. It has a softmax activation function to classify one thousand classes. We removed the top layer; flattened the model output; and added three dense layers, two dropout layers, and L2 regularizers to avoid overfitting. The sigmoid activation function helps us to classify COVID and control classes.  Figure A6, has 4.3 million parameters. By default, its input size is 224 × 224, and it is trained on the ImageNet dataset. It has a softmax activation function to classify one thousand classes. We removed the top layer; flattened the model output; and added three dense layers, two dropout layers, and L2 regularizers to avoid overfitting. The sigmoid activation function helps us to classify COVID and control classes.  ResNet50, Figure A7, has 25.56 million parameters. By default, its input size is 224 × 224, and it is trained on the ImageNet dataset. It has a softmax activation function to classify one thousand classes. We removed the top layer; flattened the model output; and added three dense layers, two dropout layers, and L2 regularizers to avoid overfitting. The sigmoid activation function helps us to classify COVID and control classes.
ResNet50, Figure A7, has 25.56 million parameters. By default, its input size is 224 × 224, and it is trained on the ImageNet dataset. It has a softmax activation function to classify one thousand classes. We removed the top layer; flattened the model output; and added three dense layers, two dropout layers, and L2 regularizers to avoid overfitting. The sigmoid activation function helps us to classify COVID and control classes. ResNet152, Figure A8, has 60.19 million parameters. By default, its input size is 224 × 224, and it is trained on the ImageNet dataset. It has a softmax activation function to classify one thousand classes. We removed the top layer; flattened the model output; and added three dense layers, two dropout layers, and L2 regularizers to avoid overfitting. The sigmoid activation function helps us to classify COVID and control classes.  Figure A9, has 138.3 million parameters. By default, its input size is 224 × 224, and it is trained on the ImageNet dataset. It has a softmax activation function to classify one thousand classes. We removed the top layer; flattened the model output; and added three dense layers, two dropout layers, and L2 regularizers to avoid overfitting. The sigmoid activation function helps us to classify COVID and control classes. ResNet152, Figure A8, has 60.19 million parameters. By default, its input size is 224 × 224, and it is trained on the ImageNet dataset. It has a softmax activation function to classify one thousand classes. We removed the top layer; flattened the model output; and added three dense layers, two dropout layers, and L2 regularizers to avoid overfitting. The sigmoid activation function helps us to classify COVID and control classes.
ResNet50, Figure A7, has 25.56 million parameters. By default, its input size is 224 × 224, and it is trained on the ImageNet dataset. It has a softmax activation function to classify one thousand classes. We removed the top layer; flattened the model output; and added three dense layers, two dropout layers, and L2 regularizers to avoid overfitting. The sigmoid activation function helps us to classify COVID and control classes. ResNet152, Figure A8, has 60.19 million parameters. By default, its input size is 224 × 224, and it is trained on the ImageNet dataset. It has a softmax activation function to classify one thousand classes. We removed the top layer; flattened the model output; and added three dense layers, two dropout layers, and L2 regularizers to avoid overfitting. The sigmoid activation function helps us to classify COVID and control classes.  Figure A9, has 138.3 million parameters. By default, its input size is 224 × 224, and it is trained on the ImageNet dataset. It has a softmax activation function to classify one thousand classes. We removed the top layer; flattened the model output; and added three dense layers, two dropout layers, and L2 regularizers to avoid overfitting. The sigmoid activation function helps us to classify COVID and control classes.  Figure A9, has 138.3 million parameters. By default, its input size is 224 × 224, and it is trained on the ImageNet dataset. It has a softmax activation function to classify one thousand classes. We removed the top layer; flattened the model output; and added three dense layers, two dropout layers, and L2 regularizers to avoid overfitting. The sigmoid activation function helps us to classify COVID and control classes. 224, and it is trained on the ImageNet dataset. It has a softmax activation function to classify one thousand classes. We removed the top layer; flattened the model output; and added three dense layers, two dropout layers, and L2 regularizers to avoid overfitting. The sigmoid activation function helps us to classify COVID and control classes. ResNet152, Figure A8, has 60.19 million parameters. By default, its input size is 224 × 224, and it is trained on the ImageNet dataset. It has a softmax activation function to classify one thousand classes. We removed the top layer; flattened the model output; and added three dense layers, two dropout layers, and L2 regularizers to avoid overfitting. The sigmoid activation function helps us to classify COVID and control classes.  Figure A9, has 138.3 million parameters. By default, its input size is 224 × 224, and it is trained on the ImageNet dataset. It has a softmax activation function to classify one thousand classes. We removed the top layer; flattened the model output; and added three dense layers, two dropout layers, and L2 regularizers to avoid overfitting. The sigmoid activation function helps us to classify COVID and control classes. VGG19, Figure A10, has 23.83 million parameters. By default, its input size is 224 × 224, and it is trained on the ImageNet dataset. It has a softmax activation function to classify one thousand classes. We removed the top layer; flattened the model output; and added three dense layers, two dropout layers, and L2 regularizers to avoid overfitting. The sigmoid activation function helped us to classify COVID and control classes. VGG19, Figure A10, has 23.83 million parameters. By default, its input size is 224 × 224, and it is trained on the ImageNet dataset. It has a softmax activation function to classify one thousand classes. We removed the top layer; flattened the model output; and added three dense layers, two dropout layers, and L2 regularizers to avoid overfitting. The sigmoid activation function helped us to classify COVID and control classes.

Appendix C
In this section, Figures A11-A14 depict the ROC of EDLs using DC2, DC3, DC4, and DC5, respectively. EDLs using DC1 are already discussed in the Results section of the ROC. These ROC indicate that the mean AUC of EDLs is better than that of the TL model.

True Positive Rate
True Positive Rate Figure A10. VGG19 transfer learning model.

Appendix C
In this section, Figures A11-A14 depict the ROC of EDLs using DC2, DC3, DC4, and DC5, respectively. EDLs using DC1 are already discussed in the Results section of the ROC. These ROC indicate that the mean AUC of EDLs is better than that of the TL model. Figure A10, has 23.83 million parameters. By default, its input size is 224 × 224, and it is trained on the ImageNet dataset. It has a softmax activation function to classify one thousand classes. We removed the top layer; flattened the model output; and added three dense layers, two dropout layers, and L2 regularizers to avoid overfitting. The sigmoid activation function helped us to classify COVID and control classes.

Appendix C
In this section, Figures A11-A14 depict the ROC of EDLs using DC2, DC3, DC4, and DC5, respectively. EDLs using DC1 are already discussed in the Results section of the ROC. These ROC indicate that the mean AUC of EDLs is better than that of the TL model.