Decoders configurations based on Unet family and feature pyramid network for COVID-19 segmentation on CT images

Coronavirus Disease 2019 (COVID-19) pandemic has been ferociously destroying global health and economics. According to World Health Organisation (WHO), until May 2021, more than one hundred million infected cases and 3.2 million deaths have been reported in over 200 countries. Unfortunately, the numbers are still on the rise. Therefore, scientists are making a significant effort in researching accurate, efficient diagnoses. Several studies advocating artificial intelligence proposed COVID diagnosis methods on lung images with high accuracy. Furthermore, some affected areas in the lung images can be detected accurately by segmentation methods. This work has considered state-of-the-art Convolutional Neural Network architectures, combined with the Unet family and Feature Pyramid Network (FPN) for COVID segmentation tasks on Computed Tomography (CT) scanner samples from the Italian Society of Medical and Interventional Radiology dataset. The experiments show that the decoder-based Unet family has reached the best (a mean Intersection Over Union (mIoU) of 0.9234, 0.9032 in dice score, and a recall of 0.9349) with a combination between SE ResNeXt and Unet++. The decoder with the Unet family obtained better COVID segmentation performance in comparison with Feature Pyramid Network. Furthermore, the proposed method outperforms recent segmentation state-of-the-art approaches such as the SegNet-based network, ADID-UNET, and A-SegNet + FTL. Therefore, it is expected to provide good segmentation visualizations of medical images.


INTRODUCTION
Since the end of 2019, the COVID-19 pandemic has affected human lives heavily. In addition to a considerable number of deaths, the economic losses caused by the COVID-19 pandemic are hard to estimate. Based on the report published by the World Bank, the world economy is estimated to decline by 4.3% in 2020 (https://www.worldbank.org/en/ news/press-release/2021/01/05/global-economy-to-expand-by-4-percent-in-2021-vaccine-deployment-and-investment-key-to-sustaining-the-recovery). Many observers commented that this significant decline only occurred during the Great Depression and two world wars. Since a pandemic is so damaging to health and society, it is worthwhile to gather great resources to fight it. According to experts, to combat the COVID-19 epidemic, the COVID-19 vaccine is the best solution. COVID-19 has also infected over 157 million cases and caused more than 3.2 million deaths according to World Health Organisation (WHO) (https://www.who.int/emergencies/diseases/novel-coronavirus-2019) across 223 countries, areas, or territories with cases. The number of deaths still increases thousands every day. Therefore, COVID-19 diagnosis improvement is essential not only for treatment but also for preventing its infection. Fortunately, we are in an era where science and technology have been walking a decisive step forward with tremendous achievements. This development has changed the world, and it also promotes and speeds up the enhancement of medicine and other fields.
Recent medical imaging analysis studies have revealed huge benefits with the current advancements in technology and have been considered an essential role in determining patients' condition and status. Achievements information technologies have greatly supported doctors during the diagnosis and treatment of patients. Also, there is a global trend in applying information technology to health care. Hospital information systems, clinical decision-making support systems, telemedicine, virtual reality, and health information highway have been developing and appearing globally so that public health care gets better and better. The government in various regions has developed national health information technology programs to computerize and digitize health records. Many research and application programs are implemented in hospitals and health facilities with hospital information systems (Ferdousi et al., 2020), communication systems (Nayak & Patgiri, 2021;Belasen et al., 2020), robot-based surgeon systems (Lee et al., 2021), and nursing care information systems (Booth et al., 2021). Medical records, images of x-ray, ultrasound, magnetic resonance imaging, positron-emission tomography become rich and diverse.
Modern medicine with information technology applications can make disease diagnosis faster based on various clinical symptoms and subclinical symptoms (subclinical diagnosis). In subclinical diagnostics cases, doctors usually evaluate and examine images generated and screened from medical imaging devices and equipment. Modern and high-tech medical machines with computer support software make the image clearer and more accurate with a very high resolution. The diagnostic imaging methods are diversified, such as radiological diagnosis, ultrasound imaging, ultrasound-color Doppler, endoscopic images (commonly used as gastrointestinal endoscopy and urinary endoscopy), Computed Tomography (CT) Scanner, Magnetic Resonance Imaging (MRI) and so on.
Image segmentation is to divide a digital image into various parts, which can be the collections of pixels or superpixels (Shapiro, 2001). The goal of image segmentation aims to simplify and or represent an image into something more meaningful and easier to analyze. In recent years, deep learning algorithms have provided great tools for medical segmentation, which plays an essential role in disease diagnosis and is one of the most crucial tasks in medical image processing and analysis. Diagnostic based on segmented medical images holds an essential contribution to improving accuracy, timeliness, and disease diagnosis efficiency (Saood & Hatem, 2021;Budak et al., 2021;Zhou, Canu & Ruan, 2020;Raj et al., 2021;Yakubovskiy, 2020). For example, for ultrasound images, the physician and doctors can accurately detect and measure the size of the solid organs in the abdomen and abnormal masses on the segmented areas in the images (Ouahabi & Taleb-Ahmed, 2021;Zhou et al., 2021). Another type of medical image is Chest x-ray (Rahman et al., 2020), where cancer tumors can be detected and segmented for surgeons and efficient treatment monitors. The CT scanner images marked abnormal regions also help physicians identify signs of brain diseases, especially identifying intracranial hematoma, brain tumors (Ramesh, Sasikala & Paramanandham, 2021;Munir, Frezza & Rizzi, 2020). Signs of the disease can be revealed via such segmented images, but sometimes these signs can be too small to be observed by humans. Moreover, in a short time, doctors may have many patients for diagnosis simultaneously. Besides, it takes so much time to train a doctor to perform medical image analysis. Artificial intelligence algorithms can outperform human ability in image classification and provide techniques to interpret the results (Geirhos et al., 2018). Therefore, leveraging artificial intelligence's development with segmentation image techniques is crucial to accelerate medicine advancement and improve human health. With an outbreak of the COVID-19 pandemic, it requires significant efforts from all citizens worldwide to stop the pandemic, but human resources seem insufficient. Technology-based medical approaches are necessary and urgent to support humans to reduce and prevent the pandemic, so algorithms on image processing for diagnosis of COVID-19 have attracted the attention of numerous scientists. The efficiency of the Unet family in image segmentation and outstanding image classification performance of well-known convolutional neural network architectures revealed in a vast of previous studies has brought a significant research question on the benefits of their combinations to enhance the performance in COVID-19 lung CT image segmentation. This study has leveraged well-known deep learning architectures as encoders and Unet family, Feature Pyramid Network techniques as decoders to produce segmentation on chest slices for supporting COVID-19 diagnosis. The principal contributions include as follows: Numerous configurations generated by combining five well-known deep learning architectures (ResNet,ResNeSt,SE ResNeXt,Res2Net, and Unet family are evaluated and compared to the state-of-the-art to reveal the efficiency in the COVID-19 lung CT image segmentation. Moreover, we also include Feature Pyramid Network (FPN), a famous architecture for segmentation tasks in configurations for comparison. The visualizations including segmented regions in lungs are examined with various metric performances and exhibit similar infected areas compared to the ground truths. Augmentation on the COVID-19 lung images dataset is performed with mirror, contrast, and brightness transforms. In addition, gamma and Gaussian noise are manipulated before adding spatial transforms. Such techniques help to enhance the segmentation performance. Besides, we evaluate the performance of COVID-19 segmentation without augmentation techniques. The training time and inference time are measured and compared among the considered configurations. From the obtained results, we found that the integration between SE ResNeXt and Unet ++ has revealed the best performance in COVID-19 segmentation tasks.
The rest of this study is organized as follows. 'Related Work' introduces the main related works. 'Method' presents a brief introduction of the segmentation network based on encoder-decoder architectures, the dedicated loss function for segmentation tasks, and augmentation techniques. Afterward, we present our settings, the public COVID-19 dataset, and the evaluation metrics in 'Evaluation'. 'Experimental Results' exhibits and analyzes the obtained results. We conclude the study and discuss future work in 'Conclusion'.

RELATED WORK
Machine learning for medical imaging analysis has gained popularity in recent years. Advancements in computer techniques have also been proposed with an increase in quality and quantity. To support doctors better, researchers have focused on model explanation and segmentation methods for medical images. 'Applications of deep learning in healthcare' examines the robust studies of deep learning in the healthcare sector. 'Applications of deep learning in COVID-19 detection' reviews the main related approaches in the domain of COVID-19 detection from CT images. Ravi et al. (2017) presented several robust applications of deep learning to health informatics. We have obtained benefits from rapid improvements in computational power, fast data storage, and parallelization, and so there are more and more efficient proposed models for health services. Furthermore, Srivastava et al. (2017) gave us an overview of recent trends and future directions in health informatics using deep learning. In another study, Huynh et al. (2020) introduced a shallow convolutional neural network (CNN) architecture with only a few convolution layers to perform the skin lesions classification, but the performance is considerable. The authors conducted the experiments on a dataset including 25,331 samples. For discriminating between melanoma and vascular lesion, the proposed model obtained an accuracy of 0.961 and an Area Under the Curve (AUC) of 0.874. Several studies deal with abnormality bone detection (Chetoui & Akhloufi, 2020;Varma et al., 2019) also revealed a promising result. For instance, Chetoui & Akhloufi (2020) developed a deep learning architecture based on EfficientNet  to detect referable diabetic and diabetic retinopathy on two public datasets, namely EyePACS and APTOS 2019. The proposed method achieved the highest AUC of 0.990 and 0.998 on EyePACS and APTOS 2019, respectively. Similarly, an approach to detect abnormalities on musculoskeletal images has been proposed by using CNN architecture. The authors collected a massive dataset of 93,455 radiographs.

Applications of deep learning in healthcare
The obtained AUC is recorded of 0.880, sensitivity and specificity of 0.714 and 0.961, respectively.
To recognize COVID-19 from chest CT images, a deep architecture named COVNet has been proposed by Li et al. (2020). Community-acquired pneumonia and healthy control are utilized in the testing phase and collected from six hospitals. AUC, specificity, and sensitivity report the performance. For detecting COVID-19, the proposed method obtained an AUC of 0.96, a specificity of 0.96, and a sensitivity of 0.90. Shi et al. (2021) presented a review on emerging artificial intelligence technologies to support medical specialists. The authors also stated that "image segmentation plays an essential role in COVID-19 applications".
Applications of deep learning in COVID-19 detection Saood & Hatem (2021) proposed an approach for image tissue classification by leveraging segmentation networks, namely SegNet and U-Net. The purpose of using both models is to distinguish the infected and healthy lung tissue. The networks are trained on 72 and validated on ten images. The proposed method obtained an accuracy of 0.95 with SegNet and 0.91 with U-Net. Empirically, the authors also stated that the mini-batch size affects the performance negatively. Budak et al. (2021) presented a new procedure for automatic segmentation of COVID-19 in CT images using SegNet and attention gate mechanism. A dataset with 473 CT images has been utilized as the evaluation data. The performance of the proposed method is judged based on Dice, Tversky, and focal Tversky loss functions. The authors reported that the obtained sensitivity, specificity, and dice scores are 92.73%, 99.51%, and 89.61%, respectively. Zhou, Canu & Ruan (2020) proposed an effective model to segment COVID-19 from CT images. In comparison to other existing studies (Budak et al., 2021), the model obtained comparable results. For each CT slice, the proposed method takes 0.29 s to generate the segmented results and obtained a Dice of 83.1%, Hausdorff of 18.8. However, the method is conceived to segment the single class and on a small dataset. A recent approach (Raj et al., 2021) leverages a depth network, namely ADID-UNET, to enhance the COVID-19 segmentation performance on CT images. The proposed method is evaluated on public datasets and achieved a 97.01% accuracy, a precision of 87.76%, and an F 1 score of 82.00%.

METHOD
This section contains five parts. 'Segmentation network architecture' describes systematically the complete architecture of segmentation models. We present the explanation of the encoders and decoder for a general segmentation architecture are presented in 'Efficient deep learning model-based encoder' and 'Decoders for segmentation network' respectively. Afterwards, the description of loss function and several data augmentation methods are explained in 'Loss function for segmentation task' and 'Data augmentation' respectively.

Segmentation network architecture
The overall system architecture is visualized in Fig. 1. We leveraged the segmentation model with architecture to discriminate the COVID-19 infections from the lung on the medical images, including the encoder and decoder. The trained weights are validated on a separated dataset to find the optimal weights during the training section. To segment lung and COVID-19 infection regions from medical images, we utilized the encoder built based on ResNet, ResNeSt, SE ResNeXt, and Res2Net, with the structures are presented in Table 1. Furthermore, the encoder's generated feature maps are handled by the Unet  family and FPN. Each encoder has a distinct feature map dimension. The input of the decoder needs to be adapted based on this dependence. Moreover, the encoders leverage the Imagenet pre-trained weights to reduce the computation cost during training. The medical images are pass into the input layer with the dimension of 512 × 512. It is forwarded through five different layers with complex structures to extract the most meaningful features. Afterward, the decoders control the critical stages by generating the masks, including the important regions. The final result obtained a 3 × 512 × 512 array, including three masks. The first mask consists of the non-lung nor COVID-19 regions. The second mask includes the lung's pixels, whereas the final mask reveals COVID-19 infection regions' information. Besides, the combination between the encoder and decoder architectures is presented in Table 2. We considered to use 20 experimental configurations, i.e., C i = 1, …, 20. In this study, we also present the process diagram of the segmentation model in Fig. 2. The input takes medical images where the output generates two masks consisting of lung and infection regions. Ronneberger, Fischer & Brox (2015) and Long, Shelhamer & Darrell (2015) demonstrated that the improvement of segmentation tasks recently relied more on the encoder-decoder than other architectures. Furthermore, this architecture was inspired by the Convolutional Neural Network (LeCun et al., 1989) and added the decoder network, which effectively tackled the pixel-wise prediction.

Efficient deep learning model-based encoder
In an attempt to improve the performance of image segmentation task on medical images, we leveraged the advantages of numerous pre-trained models as the encoder, which are modern architectures and have impressive performances on classification tasks, such as EfficientNet , ResNet (He et al., 2016), or ResNeSt . The responsibilities of the encoder are learning features and providing the initial low-resolution representations. The segmentation architecture will refine the encoder's outputs, referred to as the decoder network. To avoid the vanishing gradient problem (Kolen & Kremer, 2009) and retrieve the fine-grained information from the previous layer. The skip connections are utilized between the encoder and decoder network or between the encoder and decoder network layers. Assuming x denotes the input, the expected underlying mapping obtaining by training is f(x), the block within the dotted-line box demands to apply the residual mapping f(x) − x. In the case of f(x) = x, the identity mapping is the desired underlying mapping. The weights and biases of the block within the dotted-line box need to be set at 0.
In the scope of this study, we leveraged ResNet, ResNeSt, SE ResNeXt, and Res2Net as the encoders to generate the features map and transfer the features to the decoder network segmenting the COVID-19 regions on medical images. ResNeSt is a new network inherited from ResNet with an attention mechanism, performed promising results on image classification and segmentation tasks. In this architecture, the feature maps are split into G groups where G = #Cardinality × #Radix. The introduction of Cardinality is presented from Resnext (Xie et al., 2017), which repeats the bottle-neck blocks and breaks channel information into smaller groups, whereas Radix represents the block of Squeeze-and-Excitation Networks (SENet) (Hu, Shen & Sun, 2018). In summary, this architecture combines the Cardinality of Resnext and Attention of Squeeze and Excitation Networks to formulate the Split Attention. In other words, Split Attention is the modification of the gating mechanism. Recently, Gao et al. (2021) proposed a novel building block for CNN constructs hierarchical residual-like connections within one single residual block, namely Res2Net. In other words, the bottle-neck block of the ResNet architecture is re-designed and contributes to increasing the range of receptive fields for each network layer by representing multi-scale features at a granular level. By leveraging kernel size of 7 × 7 instead of 3 × 3, the computation of multi-scale feature extraction ability is enhanced but achieved a similar cost.

Decoders for segmentation network
The emergence of artificial intelligence and especially Convolutional Neural Network architectures in computer vision brings the field of image processing to light. Once considered untouchable, several image processing tasks now present promising results like image classification, image recognition, or image segmentation. The image segmentation task's primary purpose is to divide the image into different segment regions, representing the discriminate entity. Compared with classification tasks, segmentation tasks require the feature maps and reconstructing the feature maps' images. In this study, we leveraged the advantages of several CNN-based architectures as the decoder of segmentation models.

Unet family
We use the implementations of Unet (Ronneberger, Fischer & Brox, 2015) with the architecture includes three sections, namely the contraction, the bottle-neck, and the expansion section. Generally, the contraction section is the combination of several contraction blocks. Each block consists of 3 × 3 convolution and 2 × 2 max-pooling layers. The number of feature maps gets double at each max-pooling layer. It helps the architecture learn complex patterns effectively. Furthermore, the kernels of size 3 × 3 are widely used as filters for the widespread deep neural networks (Chollet, 2017;. Besides, the model's performance also depends on the size of kernels and improves the efficiency of capturing high-resolution images. Similar to contraction blocks, the bottle-neck layers also consist of the 3 × 3 convolution but followed by 2 × 2 up convolution layers. The most crucial section of Unet architecture is the expansion section. This section consists of several expansion blocks, and the number of expansion blocks should be equal to the number of contractions. Each block also contains 3 × 3 convolution and 2 × 2 up convolution layers, half of the feature maps after each block are leveraged to maintain symmetry. Moreover, the feature maps corresponding with the contraction layers also include the input. Hence, the image will be reconstructed based on the learned features while contracting the image. To produce the output, a 1 × 1 convolution layer is utilized to generate the feature maps with the number equal to the desired segments. Also, Zhou et al. (2018), Zhou et al. (2019) proposed an Unet-like architecture, namely Unet++. The advantages of Unet++ can be considered as capturing various levels of features, integrating features, and leveraging a shallow Unet structure. The discriminations between Unet and Unet++ are the skip connection associating two sub-networks and utilizing the deep supervision. The segmentation results are available at numerous nodes in the structure of Unet++ by training with deep supervision. Another Unet-like architecture has been proposed by Guan et al. (2020) for detaching artifacts from 2D photoacoustic tomography images. We leverage the Unit, Unet++, and Unet2d as the decoder architectures to conduct the experiments. When compared with original Unet architecture, Unet++ and Unet2d have been built to reduce the semantic gap between the feature maps and efficiently learn the global and local features.

Feature pyramid network
Like Unet, FPN (Lin et al., 2017) is also a famous architecture appropriate to segmentation tasks. FPN is a feature extractor with a single-scale image of a stochastic dimension and generates the feature maps with proportional size. The primary purpose of FPN is to build feature pyramids inside convolutional neural networks to be used in segmentation or object recognition tasks. The architecture of the FPN involves a bottom-up pathway and a top-down pathway. The bottom-up pathway defines a convolutional neural network with feature extraction, and it composes several convolution blocks. Each block consists of convolution layers. The last layer's output is leveraged as the reference set of feature maps for enriching the top-down pathway by lateral connection. Each lateral connection merges feature maps of the same spatial size from the bottom-up and top-down pathways. Thus, FPN architecture consists of a top-down pathway to construct higher resolution layers from a semantic-rich layer. To improve predicting locations' performance, we deploy the lateral to connect between reconstructed layers, and the corresponding feature maps are utilized due to the reconstructed layers are semantic strong, whereas the locations of objects are not precise. It works similarly to skip connections of ResNet.
Furthermore, the spatial resolution decreases as going up and detecting more high-level structures, the semantic value for each layer can potentially increase. Nevertheless, the bottom layers are in high resolution but can not be utilized for detection due to the semantic value is unsuitable for justifying the slow-down training computation based on it. By applying the 1 × 1 convolution layer at the top-down pathway, the channel dimensions of feature maps from the bottom-up pathway can be reduced and become the top-down pathway's first feature map. Furthermore, element-wise addition is applied to merge the feature maps, the bottom-up pathway, and the top-down pathway.

Loss function for segmentation task
In deep learning and computer vision, the boundary detection definition comes from extracting features to produce significant representations of the objects. More specifically, boundary detection aims to identify object boundaries from images. Therefore, we can consider boundary detection as segmentation problems and target the boundaries to 1 and the rest of the image to 0 as the label. Thus, the loss function can be formulated with the classical function, namely Cross-entropy or Hinge loss. However, in terms of segmentation tasks, the classical loss function models can work imperfectly due to the highly unbalanced label distribution of each class and the per-pixel intrinsic of the classical loss function.
To enhance segmentation models' performance, we considered using Dice loss (Milletari, Navab & Ahmadi, 2016) originates from the Dice coefficient. We will introduce more details about the Dice coefficient in 'Metrics for comparison'. The ground truth and predicted pixels can be considered as two sets. By leveraging Dice loss, the two sets are trained to overlap little by little, and the reduction of Dice loss can be obtained when the predicted pixels overlap only the ground-truth pixels. Furthermore, with Dice loss, the total number of pixels at the global scale is investigated as the denominator, whereas the numerator pays attention to the overlap between two sets at the local scale. Hence, the loss of information globally and locally is utilized by Dice loss and critically improves accuracy. Moreover, in thin boundaries, the model utilizing Dice loss can achieve better performance than others.

Data augmentation
Deep learning requires more data to improve classification and regression tasks, although it is not easy to collect needed data. Augmentation techniques was introduced in the work of Van Dyk & Meng (2001) More specific, data augmentation is the technique to abound the number of samples in the dataset by modifying the existing samples or generating newly synthetic data. Leveraging the advantages of the augmentation approach can help reduce over-fitting during the training section. In terms of image segmentation, the most general techniques for data augmentation are adjusting brightness or contrast, zoom in/out, cropping, shearing, rotation, noise, or flipping.
In this study, we adapted several techniques as follows. Firstly, the outlier will be cut off from the original image, and we set 10% and 90% for the cut-off lower and upper percentiles, respectively. Then, the low resolution of the cut-off outlier image is simulated. We also applied the mirror, contrast, and brightness transform. Furthermore, gamma and Gaussian noise are utilized before adding spatial transforms. Figure 3 visualizes the original medical image, and the image leveraged data augmentation techniques.

EVALUATION
In this section, we present our experimental configurations in 'Settings for experiment'. Furthermore, the research information referred as the COVID-19 dataset and the considered evaluation metrics are explained in 'Dataset' and 'Metrics for comparison' respectively.

Settings for experiment
We implemented the trained models built based on Pytorch framework (https://pytorch. org). We also accelerated the training section by utilizing the weights trained on the ImageNet dataset (Russakovsky et al., 2015). The processing includes three main phases: training and validation, as described in Fig. 1. For the first stage, the medical images dataset is split into training and validation sections. Afterward, we evaluated several segmentation architectures and stored internal parameters with the highest performance. A robust computational resource is required for the segmentation model. Thus, we used a server with configurations listed in Table 3 to conduct our experiments.

Dataset
We investigated our approach's performance on the COVID-19 segmentation dataset from the Italian Society of Medical and Interventional Radiology (http://medicalsegmentation.com/covid19/). The dataset includes 829 slices belonging to nine axial volumetric CTs. Furthermore, the experienced radiologist has evaluated, segmented, and marked as COVID-19 on 373 out of the total of 829 slices. The medical images have been transformed to greyscaled with 630 × 630 and in NIFTI file format. The segmented labels include the infection masks but also lung masks. Therefore, it could be more attractive for performing segmentation tasks on this dataset. Figure 4 visualizes a sample of the dataset; the left CT slide presents the original image of a COVID-19 patient, the right image includes the lung, and the infection region visualizing by blue and orange color, respectively.
The further information of the dataset is detailed in Table 4. The dataset includes 829 total samples (infection masks, lung masks, and images). However, the number of

Metrics for comparison
In this study, we considered using two metrics to evaluate the current approach's performance, namely dice (Sørensen, 1948;Dice, 1945) and Jaccard coefficients. Intuitively, the segmentation performance is measured by evaluating the overlap between the predictions and the ground-truth object. The results with more overlap regions with the ground truth reveal better performance than those with fewer overlap regions. Both Dice and Jaccard indices are in the range between 0 and 1. We assume that A and B are prediction and ground-truth masks, respectively, for a given class. If A and B match perfectly, the value of both Dice and Jaccard indices is equal to one. Otherwise, A and B are no overlap, and the value is equal to 0. Thus, the Dice coefficient can be evaluated by 2 × the area of overlap divided by the total number of pixels in both masks as in Eq. (1).  Like the Dice coefficient, the Jaccard index is also one of the most common metrics in image segmentation tasks. Jaccard index can be referred to as Intersection-Over-Union (IoU). Generally, the IoU is the overlap regions between the predicted and the groundtruth mask divided by the union region between the predicted and the ground truth mask. In terms of binary or multi-class segmentation, we need to calculate the IoU of each class and average them to achieve the mean IoU. The Jaccard index or IoU can be calculated as in Eq. (2). Besides, the Precision, Recall, and F 1 score are also utilized to evaluate the segmentation performance.
We also consider the following definitions: True Positive (TP) reveals the number of positives. In other words, the predictions match with the ground-truth label. True Negative (TN) indicates the predictions do not belong to the ground-truth and are not segmented. False Positive (FP) demonstrates the predicted masks unmatch with the ground-truth masks. False Negative (FN) expresses the predictions belong to the ground-truth, but it is not segmented correctly.
Furthermore, in terms of the confusion matrix, the Dice and IoU equation can be rephrased as in Eqs. (3) and (4). We also present the measuring of the segmentation errors in Fig. 5. Dice The formula of computing Dice is relevant to the F 1 score. In other words, the Dice and F 1 achieve the same value in comparison with each other. Moreover, Fig. 6 visualizes in detail the differences between Dice and Jaccard/Iou indices. In Figure 6, the left image represents the Dice coefficient, whereas the right image exhibits the Intersection over Union between the predicted mask and the ground-truth mask.

EXPERIMENTAL RESULTS
In this section, we present in detail our experimental results. 'Segmentation performance on the medical image dataset' presents the segmentation performance of the configurations introduced in Table 1. All the models are trained over 100 epochs, and the model that achieved the best performance will be stored for inferring purposes. Afterward, the discussion of the proposed methods and the other systems is presented in 'Benchmark'.

Segmentation performance on the medical image dataset
We describe the results of both cases, non-augmented and augmented data in Tables 5 and 6, respectively. More specifically, the IoU, mean IoU (mIoU), Precision, Recall, and F 1 score of our experimental configurations are reported to express the performance when applying data augmentation techniques and vice versa. Table 5 reveals the results of inferring the trained model without using data augmentation techniques. The architectures built based on the Unet decoder family obtained better results compared to FPN. Among the proposed configurations, C 9 acquires the best mIoU of 0.9262, whereas C 7 gets 0.9259 in the second place. In terms of the Dice coefficient, C 9 achieves 0.9016 for the COVID-19 category, being the best architecture. Considering Precision and Recall, C 9 earns 0.9129, 0.8906, and gives a comparable performance to the others. Table 6 presents the performance of our experimental configurations. The proposed networks are trained under data augmentation techniques. Overall, almost the configuration with decoder-based Unet family (C 1 to C 15 ) achieved mIoU over 0.9, whereas the architecture-based FPN obtained approximately at 0.8. Between the configurations, C 6 achieves the best performance with the mIoU obtained of 0.9283. Configuration C 6 represents the combination of ResNet and Unet2d. Furthermore, the second place gets the mIoU of 0.9234 with configuration C 13 trained with SE ResNeXt and Unet++ model as the encoder and decoder.
As for lung and COVID-19 segmentation, the average IoU achieves 0.8 on overall configurations for COVID-19 infection regions and 0.94 for lung masks. The configurations of C 6 and C 13 exhibit promising performance on the COVID-19 segmentation task with obtained IoU of 0.8241 and 0.8234. Meanwhile, the configurations Table 5 The experimental results in details of various configurations described in Table 2. The data augmentation techniques are not utilized to perform the experiment.

Configuration
C 1 C 2 C 3 C 4 C 5 C 6 C 7 C 8 C 9 C 10 of C 13 and C 14 get the maximum F 1 score for the COVID-19 category with an obtained value of 0.9032 and 0.9031. By examining Precision and Recall, C 6 acquires the best precision performance, whereas C 13 gains the maximum Recall. Figure 7 depicts the confusion matrix for configuration C 13 . We also normalize the confusion matrix over rows for analyzing purposes. The confusion matrix values reveal that most of the misjudgments of COVID-19 infection regions are categorized as lung and vice versa. The comparison of not using data augmentation and using data augmentation reveals the architectures with FPN-based tend to be more effective when not utilizing augmentation techniques. It can be demonstrated by the results of C 16 -C 18 . By the Unet family decoder, the outcomes depict that the models with nearly equivalent performance, i.e., C 2 , C 4 , C 8 , C 10 , and C 12 . The most discriminate configurations are C 11 and C 14 . The performance of C 11 and C 14 are strongly boosted by applying data augmentation techniques. Furthermore, C 11 and C 14 also obtain better performance. In Table 6 Experimental results in details with various configurations of the architectures mentioned and described in Table 2 Table 7. The training presents the total time needs for 100 epochs, whereas the inference expresses the average time to segment each slice. With the Unet and FPN decoder, the architectures are trained with lower computation costs than the rest. The least and most time-consuming configurations for training/inference are C 16 and C 13 , respectively. As observed from the results, with the same encoders, architectures FPN decoder-based segment fastest for each slice.
We also visualize several samples with ground truth and the prediction masks in and Fig. 8. The lung is colored by slate blue and orange for COVID-19. As observed from the results, our predictions express promising segmentation performance. In terms of complex COVID-19 infection regions as in Figs. 8B, 8C, or 8E, the boundary of COVID-19 segmentation is equivalent to the corresponding ground-truth. For the different view as in Figs. 8A and 8D, the interesting regions are segmented quite correctly. Meanwhile, the lungs are also produced identically in comparison with the ground truth. Table 7 The report (in second(s)) of training time (1) and inference time (2)   Benchmark

Lung segmentation
The comparison of lung segmentation is presented in Table 8. As observed from the results, our configurations with different decoders have outperformed the approach by Saood & Hatem (2021). More specifically, the work of Saood & Hatem (2021) obtained a Dice of 0.7490 and Sensitivity of 0.9560 with the SegNet method, while our best configuration, C 13 , achieved a maximum Dice of 0.9748 and Sensitivity of 0.9755. In particular, by leveraging SE ResNeXt and Unet++ architecture, we get the maximum score compared to the others. Table 9 reveals the the compared performance of the proposed approach with the state-ofthe-art methods including SegNet (Budak et al., 2021), Unet (Zhou, Canu & Ruan, 2020) with attention mechanism, and ADID-UNET (Raj et al., 2021). We compared our configurations with the others on Dice, Sensitivity, and Precision values. Our approach achieves a better performance in terms of Dice, Sensitivity, and Precision concerning three other methods. Specifically, the study of  Dice of 0.9032, Sensitivity of 0.9349, and Precision of 0.8711. Table 9, the architecture based on Unet++ decoder is the best segmentation model among the others. Our configurations get a promising performance compared with the others.

CONCLUSION
In this study, we systematically presented a viable solution for lung and COVID-19 segmentation from CT images. The proposed model is implemented based on the convolutional neural networks, i.e., ResNet, ResNeSt, SE ResNeXt, Res2Net, or EfficientNet decoder-based Unet family and Feature Pyramid Network as the encoders and decoders of segmentation models. We evaluated the proposed method by the open segmentation dataset with numerous model structures. The experimental results reveal that the model with the decoder-based Unet family obtained better performance than FPN. Furthermore, the segmentation results are compared with the ground truth annotated by an experienced radiologist and exhibit promising performance. More specifically, the best architecture obtained a mIoU of 0.9234, 0.9032 of F 1 -score, 0.8735, and 0.9349 of Precision and Recall, respectively. Also, segmenting the minimal infection regions still challenges us due to their size and ambiguous regions.

Grant Disclosures
The following grant information was disclosed by the authors:

Data Availability
The following information was supplied regarding data availability: This study's experimental scripts and code are available at GitHub: https://github.com/ nthai-cit/covid-seg.
The data that support the findings of this study are available at https://medicalsegmentation.com/covid19/.