Robust weakly supervised learning for COVID-19 recognition using multi-center CT images

The world is currently experiencing an ongoing pandemic of an infectious disease named coronavirus disease 2019 (i.e., COVID-19), which is caused by the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). Computed Tomography (CT) plays an important role in assessing the severity of the infection and can also be used to identify those symptomatic and asymptomatic COVID-19 carriers. With a surge of the cumulative number of COVID-19 patients, radiologists are increasingly stressed to examine the CT scans manually. Therefore, an automated 3D CT scan recognition tool is highly in demand since the manual analysis is time-consuming for radiologists and their fatigue can cause possible misjudgment. However, due to various technical specifications of CT scanners located in different hospitals, the appearance of CT images can be significantly different leading to the failure of many automated image recognition approaches. The multi-domain shift problem for the multi-center and multi-scanner studies is therefore nontrivial that is also crucial for a dependable recognition and critical for reproducible and objective diagnosis and prognosis. In this paper, we proposed a COVID-19 CT scan recognition model namely coronavirus information fusion and diagnosis network (CIFD-Net) that can efficiently handle the multi-domain shift problem via a new robust weakly supervised learning paradigm. Our model can resolve the problem of different appearance in CT scan images reliably and efficiently while attaining higher accuracy compared to other state-of-the-art methods.


Introduction
The pandemic of coronavirus disease  is spreading all over the world rapidly. The number of infections is growing exponentially in different regions, which has triggered great health concerns in the international communities. One of the The histograms of these CT images. Compared with images from Hospital A and Hospital D, it is clear that the brightness levels are distinctive. Moreover, the contrast of the data collected from the China Consortium of Chest CT Image Investigation (CC-CCII) dataset is considerably different from CT images acquired from other hospitals. The right bottom figure demonstrates the distribution of the images from different hospitals after normalization, however, these distributions still behave distinctively. It is of note that there are no visually distinctive features across CT scan images but it is easy for human radiologists to correctly classify despite CT scanner changes. On the contrary, deep learning based automated methods may fail to generalize across CT images acquired from different hospitals. learning based methods have been developed efficiently for the chest X-ray/CT data analysis and classification [2][3][4], and these approaches can achieve state-of-the-art performance on X-ray/CT image diagnosis and prognosis.
Nevertheless, most CT scan datasets for COVID-19 only contain CT volumes with a set of CT slices with only patient-level annotations provided (i.e., patient-level class labels available) indicating the patient is infected or not. There is a lack of per-slice labels since annotating each slice is labor-intensive and timeconsuming for radiologists. It has been reported that it could take an experienced radiologist about 21.5 min [5] to analyze and label one whole CT volume. Consequently, convolutional neural network (CNN) based deep learning models trained on CT slices with only the patient-level label can perform poorly because some annotations of these CT slices are incorrect (e.g., non-lesion slices of the lung are actually be falsely labeled) leading training data to be noisy.
Yet another challenge when employing deep learning methods to medical image recognition is called data distribution shift (a.k.a., multi-domain shift). Data distribution shift refers to the phenomenon that a common object or organ collected under various scenarios (e.g., different machine vendors and sequence parameters) can result in vastly different data distributions. Therefore, models trained under the empirical risk minimization (ERM) [6] might cause the failure of model generalization. It is because the ERM assumes that training and testing data are sampled from the same or similar distribution and domains. However, in the data distribution shift scenario, this assumption would be violated.
When a neural network is trained with images from one domain and tested on another domain (i.e., distinct imaging scenarios), the recognition performance often degrades dramatically. Fig. 1 represents images of different CT data collected from different hospitals. In the figure, it can be observed that CT data obtained from different hospitals are visually different although they all present image slices of the lung. It is on the grounds that every hospital uses different protocols and parameters for CT scanners when collecting the images for patients. Therefore, the multi-domain shift problem of the multi-center and multiscanner studies is nontrivial. It is crucial to solving the multidomain shift problem to achieve a dependable recognition, which is critical for reproducible diagnosis and prognosis.
To cope with the issues above, in this work, we trained our model on both patient-level and image-level with multiple domain information. In particular, we consider the sequential information within the CT volume when predicting a patient is tested COVID-19 positive or not. To preserve the sequential information, we divide a lung CT volume into individual sections from the upper lobe all the way to the inferior lobe. As illustrated in Fig. 2, our method aggregates these sections as the representation of a patient. When aggregating these sections, we utilize the multiple instance learning method with the k-max selection strategy for images in each section. With the help of the k-max selection, our model can filter out the uncertain and noisy images that can be beneficial to make an accurate prediction. Moreover, multiple instance learning method is incorporated that can enforce our model to mine confident candidates for training and testing [7] while ignoring modeling the joint distribution of sections from the patient rather than a single image, which is rewarding for unseen center prediction.
In summary, our contributions are mainly three-fold: • We proposed a weakly supervised learning based multidomain information fusion framework for automated COVID-19 diagnosis from multi-center and multi-scanner CT scans that only requires patient-level annotations for training.
• We propose a novel noisy label correction technique that propagates the patient-level predictions to individual slices and identifies the COVID-19 infected slices accurately.
• We develop a slice aggregation module to alleviate the data distribution shift problem, which is essential for the deployment of the developed model in the real-world scenario.
By validation on the China Consortium of Chest CT Image Investigation (CC-CCII) [8] benchmark dataset, our proposed coronavirus information fusion and diagnosis network achieves superior performance compared to state-of-the-art models on both patient-level and image-level.

Related work
Before the COVID-19 pandemic, a huge amount of deep learning based methods has been proposed for lung cancer CT image The architecture of our proposed CIFD-Net. It is of note that P(c | S i ) denotes the probability of the Section S i , and P(c | P) represents the probability of the patient who is tested COVID-19 positive or not. Q ∈ R 2×2×C indicates the noise transaction from the probability of the true label P(y c | I) to the probability of the noise label P(z c | I). In addition, φ(·) is a feature embedding function. In addition, ResNet-50 [9] is adopted for backbone network.
analysis. In this research area, there have been great achievements, culminating in the development of many end-to-end pipelines for lung cancer diagnosis, classification, treatment planning, and prognostic evaluation [10][11][12][13][14][15][16]. In the treatment of interstitial lung disease (ILD), deep learning approaches have also been developed [17][18][19][20]. In CT scans for COVID-19 patients, image characteristics, e.g., ground glass opacity and/or consolidation, are akin to those observed from lung cancer and ILD patient CT scans. Therefore, in the design of COVID-19 detection algorithms using CT images, insights from research on both lung cancer and ILD are significant and there is a clear translatability to the COVID-19 studies.
CNNs for visual recognition. Convolutional Neural Network (CNN) has been widely used in the medical diagnosis system [3,21,22]. Recently, plenty of COVID-19 recognition algorithms have been proposed, which have adopted artificial intelligence algorithms especially using the CNN. A comprehensive review of artificial intelligence assisted COVID-19 detection and diagnosis can be found elsewhere [23][24][25][26][27], and here we only provided a summary for the most relevant studies.
Jin et al. [28] developed a combined segmentation-classification model for COVID-19 diagnosis. A few pre-trained models were tested, e.g., fully convolutional network (FCN-8s), U-Net, V-Net, and 3D U-Net++, as well as classification models like dual path network (DPN-92), Inception-v3, residual network (ResNet-50), and attention ResNet-50, from which the 3D U-Net++ and ResNet-50 combination achieved the best performance. However, it was unclear which layers were pre-trained and re-trained, the reproducibility of this study is uncertain. Wang and Wong [3] proposed COVID-Net, which stacked multiple convolutional blocks with dilated convolution to recognize chest X-ray images. Li et al. [2] explored the patient label and used max-pooling strategy over features extracted by the CNN from a set of slices to make the prediction. In addition, Ouyang et al. [4] deployed a 3D CNN and used the residual learning mechanism to build the network, which incorporated the depth information of the CT volumes. Shan et al. [29] proposed a human-in-the-loop strategy for infection region quantification, in which a modified V-Net was developed incorporating bottleneck building blocks to reduce training costs. The human-in-the-loop training procedure output a segmentation for subsequent manual corrections performed by radiologists, and then these corrected data were input to re-train the network iteratively. More recently, Hu et al. [30] proposed a weakly supervised multi-scale learning framework for COVID-19 classification and lesions detection, which demonstrated promising results but its performance may be hindered by using the patient-level labels that contain noise labeling.
For automatic prognostication of COVID-19 patients, Huang et al. [31] developed a two-step segmentation model that extracted lung and lobes region followed by pneumonia segmentation. Both steps used separated U-Net and at least two follow-up scans for each patient were analyzed. The authors found significant differences in lung opacification percentage between the initial and the first follow-up scans, but not between the first and the second follow-up scans. Although the study findings are intriguing, there are critiques on lacking important information essential to the reproducibility [32].
Multiple instance learning. The multiple instance learning (MIL) is a weakly supervised learning problem that has been attempted in several studies including weakly supervised object localization [7], video anomaly detection [44], weakly supervised image segmentation [45] and others. In the MIL framework, a bag can be defined as a set of instances or image slices. Positive bags are assumed to contain at least one instance from a certain category and negative bags do not contain any instances from that category. It is intuitive to consider the classification of CT volumes that contain multiple CT slices as a MIL problem. A few methods have been proposed to solve the MIL problem. For example, Oquab et al. [46] trained a CNN using the maxpooling MIL strategy to classify the object. However, some of the MIL pooling strategies, such as max-pooling and mean-pooling, very often lead to insufficient and unstable training because of gradient vanishing. To fix this problem, Ilse et al. [47] combined the gated attention mechanism with the MIL strategy to solve the medical image classification problem, but it could not predict the instance label accurately. Chen et al. [48] developed a stylized generative method to transfer the knowledge from MRI to CT within unsupervised manner. Xia et al. [49] utilized uncertainties along different volume angles to measure the importance of predicted labels. Chen et al. [50] modeled intra-consistency between two domains to align the feature distributions. However, these methods requires to train the model using both source domain and target domain, which cannot handle the unseen domain scenarios. Our method will provide solutions to these limitations.
Domain adaptation. Domain adaptation refers to the techniques aimed at improving the performance of machine learning tasks, e.g., classification, detection, segmentation, when training the classifier on the data only from the source domain, but testing it using related samples from a shifted target domain. Some approaches also use domain adaptation to help learn the feature representation. Hoffman et al. [51] proposed a method that learned the difference between classification and detection tasks, and transferred this knowledge from the classifier to detectors using weakly annotated data. In addition, MIL was incorporated for learning feature representation and classifier [52]. Besides, Mahmood et al. [53] utilized transformations such as hue, saturation, contrast, and brightness for RGB images to change the color and texture of the images in the source domain. Existing domain adaptation methods tend to use strongly annotated data in the source domain in order to improve the recognition performance, while our methods will focus on a weakly supervised manner. In other words, our method will require no instancelevel annotation or the auxiliary strongly annotated data for recognition.

Proposed method
In this section, we introduce the proposed coronavirus information fusion and diagnosis network (CIFD-Net) with their key modules including an explainable classification module (ECM), a slice aggregation module (SAM), and a slice noisy correction module (SNCM), respectively as illustrated in Fig. 2.
The proposed ECM integrates the generation of class activation mapping into the forward propagation of the CIFD-Net, enabling CAMs generation during training and testing, which provides explainable results for the prediction of our model.
Besides, instead of training on image-level (slice-wise) labels, which requires a significant amount of labor for manual labeling, we propose the SAM to train on patient-level labels. We model the joint probability of slices for each patient by which slices are divided into several consecutive sections with equal length. We then compute the probability of each section by adopting a k-max selection strategy, which can ignore some slice with large uncertainty, thus reduce the noise during modeling the joint probability at the patient level. With the help of modeling the joint probability, our model pays more attention to modeling the distribution of affected sections leading to better generalization on multiple domains.
Moreover, in order to improve the accuracy on the imagelevel, we further proposed the SNCM, which models the transaction between the true label and noisy label since the labels at the patient-level are considered to be noisy with respect to slice-wise labels.

Problem formulation
The ultimate goal of our model is to diagnose whether a patient is tested positive or negative given a 3D volumetric CT lung scan. Let P = [I 1 , I 2 , . . . , I n ] denotes the lung CT volume for a patient with n CT slices, where I i is a 2D CT slice image. Let denotes whether a patient is tested to be COVID-19 positive or not. Y = 1 when the patient gets COVID-19, while Y = 0 stands for the patient is not COVID-19 infected. During the training stage, we only have patient-level labels, and the number of CT lung slices can vary significantly.

Explainable classification module
As the predicting process of CNN is in a black box. Several techniques [54,55] have been proposed to shed light on how CNN makes the prediction and obtains the remarkable localization ability without any supervision of localization maps. As an explainable auxiliary diagnosis tool for radiologists, we employ the class activation mapping (CAM) [54], which can generate the localization maps for the prediction from the output of the backbone networks, e.g., ResNet [9], VGG [56], GoogLeNet [57], etc. However, the process of generating CAM is a two-step process, in which the backbone network is trained on the dataset and utilizes the weights of the final fully connected layer to compute the weighted sum of feature maps of the last convolutional layer. Suppose F k ∈ R H×W is the kth feature map with height H and width W from the last convolutional layer, and W fc ∈ R K ×C is the weight of the last fully connected layer, where C is the number of classes and K is the number of feature maps from the last convolutional layer. Therefore, the class score s c of the class c can be calculated by Therefore, the localization map for the class c proposed in [54] is defined by and we can visualize the object localization maps via A fc c . Although CAM is a useful way to locate the region, it requires a post-processing procedure to generate. In our method, we plug the generation of CAM into the network with only one forward pass. Instead of directly applying global average pooling after the last convolutional layer, we replace the fully connected layer using a 1 × 1 convolutional layer with the stride of 1 before the global average pooling operation. Suppose the weight of the convolutional layer is W conv ∈ R K ×C which is the same mathematical form as the weight of the fully connected layer, i.e., W fc , we tweak Eq. (1) as follows, which results in the same output with Eq. (1). Thus, the modified CAM for the class c is computed as The modified activation mapping can accurately indicate the importance of the activation from CT images and locate the infected areas of the COVID-19 patients, providing the explainable and reliable results for prediction. The region with higher activation score indicates more importance the activation contributed to the prediction. The modified activation mapping can also offer the auxiliary diagnostic information for radiologists. The differences between the original CAM and our ECM strategy are demonstrated in Fig. 3.

Slice aggregation module
In some mild COVID-19 cases, there might be only part of the CT volume that has an infection, and very often the lesions are quite small. If we simply treat all slices as COVID-19 positive and train a classifier with the image-level label, it could lead to a noisy learning and poor results as the consequence. To overcome this problem, we propose the SAM and use the joint distribution to model the probability of patient is COVID-19 positive or negative. We assume that lesions are consecutive and only affect adjacent slices, consequently, we use a section based strategy to tackle the problem. The intuition of using the section based strategy is that it can be directly mapped to the problem of multiple instance learning (MIL) [58]. In MIL, samples are divided into two bags classified as positive and negative bags. A positive bag contains at least one positive instance and a negative bag only has the negative instance. In the problem, only bag labels (patient annotations) are provided, and sections can be treated as instances in the corresponding bags.
Given a patient P = [I 1 , I 2 , . . . , I n ] with n CT slices, we divide these slices into disjoint sections, which can be considered as a set that contains an equal number of consecutive CT slices, i.e., P = {S i } |S| i=1 , where |S| is the amount of sections for patient P as defined as follows, where l s is an empirically designed parameter named as section length.
Then the probability of patient P belonging to the class c can be represented as where P(c | S i ) is the probability of the ith section S i that belongs to the class c. Instead of taking the average of each probability of the slice in that section, we take the k-max probability for each class to compute the section probability. This is because some slices may contain few infection regions which can confound the prediction. To alleviate this problem, we adopt the k-max selection method which can be formulated as where s Then we use the patient-level annotations y as the ground-truth during the training. The classification loss can be formulated as

Slice noisy correction module
To further alleviate the negative impact of the image-level noises, we propose the SNCM, which is loosely inspired by [59], to model the hidden distribution P(z c = i | y c = j, I) between the noisy label and the true label. Let P(y c | I) denotes the true posterior distribution, given an image I. The distribution of noisy label, P(z c | I), can be modeled as We estimate the noise transaction Q c ij = P(z c = i | y c = j, I) for the class c as follows where i, j ∈ {0, 1}; φ(·) is a nonlinear mapping function; w c ij and b c ij are trainable parameters for the class c between the status i and j. Transaction score T c ij = w c ij φ(I) + b c ij can be regarded as the score of the transaction from the true label i to the noisy label j with respect to the class c. As a result, the estimated probability of noisy label for the class c is represented as Algorithm 1 Training procedure of CIFD-Net

Output:
Learned model parameters of CIFD-Net: Θ. 1: Initialize all parameters denoted Θ using Xavier. for m = 1 to M do 4: Use the backbone network to compute slice features for CT volume P m with its slices {I i } n i=1 .

11:
end for 12: until convergence 13: return Θ. Table 1 The number of CT samples used for training for each class collected by four different hospitals A, B, C, and D. Besides, details of the CC-CCII dataset are also listed, which was used in the independent testing stage. The ratio of positive and negative samples in training set is approximately 1:1, and 2:1 in test dataset. Finally, with the help of the estimated noisy probability, for the patient P, the noisy classification loss is computed by By combining Eqs. (8) and (12), we can obtain the total loss function that we need to optimize for our model that is calculated as follows, where λ is a hyper-parameter to balance the loss terms.
During the model training, the above loss functions are optimized iteratively. By incorporating the SAM, we can build a unified end-to-end deep neural network architecture for the COVID-19 diagnosis. The whole training procedure is summarized in Algorithm 1.

Experiments and discussions
In this section, the effectiveness of our method is validated and the results are quantified. First, we provide some statistics of the datasets and describe the implementation details as well as the experimental settings, which are followed by the reported results, the ablation studies, and further discussions on the qualitative and quantitative results.

Datasets
In order to verify the effectiveness of proposed model on the data from an independent hospital, we use data from several hospital, then test the model on an independent dataset. The datasets used in our study are summarized in Table 1. We collect CT datasets from four different local hospitals and anonymize the data by removing all the patient identity information. In total, there are 804 CT scan volumes with 45,167 CT images, 380 of which are COVID-19 positive and the other 424 are negative cases. All COVID-19 positive cases are confirmed by the RT-PCR tests. We train on the cross-domain datasets collected from hospitals A, B, C, and D and test on an open public CC-CCII dataset [8]. CC-CCII dataset consists of 2034 3D CT volumes with 130,511 CT images, which have been acquired by the CT scanner from a different manufacturer representing another image domain.

Data standardization, pre-processing
Following the protocol used in [8], we first normalized images with z-score normalization, then we used the U-Net segmentation network [60] to segment the CT images. After that, we randomly cropped a rectangular region whose aspect ratio is randomly sample in [3/4, 4/3] and area randomly sampled in [90%, 100%], then resized the region into 224 × 224 shape. Meanwhile, we randomly flipped the input volumes horizontally with 0.5 probability. The input data would be a set of CT volumes which are composed by consecutive CT slice images.

Implementation details
We use ResNet-50 [9] as the backbone network pre-trained on ImageNet [61]. For data augmentation, we use random horizontal flipping for the input CT volume in the spatial dimension. Each image in a CT volume is randomly horizontal flipped with a probability of 0.5. Then, we resize them into the size of 224 × 224. In addition, brightness and contrast are randomly changed within the range [0.9, 1.1]. The dropout rate is set to 0.7, λ is set to 0.0001, and the L 2 weight decay coefficient is set to 10 −5 . During the training and testing stage, we set l s = 16 and k = 8 to compute the patient probability. We train our model using the Adam optimizer [62] with the initial learning rate η = 1 × 10 −3 , and training is terminated after 4000 iterations with a batch size 10. All experiments have been conducted on a workstation with 4 NVIDIA Tesla V100 GPUs using PyTorch.

Quantitative results
We reproduce and compare with four state-of-the-art methods [2][3][4]9] on the COVID-19 CT classification. The results are shown in Table 2. For image-level supervision, COVID-Net [3] and ResNet-50 [9] employ the patient-level annotations as image annotations. Different to the methods proposed by Wang and Wong [3] and He et al. and [9], VBNet [4] adopts a 3D residual convolutional neural network (3D-ResNet) to train on CT volumes with patient labels. Moreover, COVNet [2] also trains on the patient-level label that they feed a patient-specific set of CT images into a 2D ResNet and simply aggregate the image-level feature descriptors with a max-pooling operator.
From Table 2, several interesting observations can be summarized as follows.  * indicates the p-value < 0.05, and ** represents the p-value < 0.01.  • The CIFD-Net outperforms most of the competing models by a large margin on the independent testing dataset, which can be attributed to the successful multi-domain shift problem proffered by our model. For the patient-level classification, our model is performed better than other compared methods by at least 12.5% on accuracy. Moreover, our model yields also the best performance on the image-level classification, which outperforms COVNet [2] by 14.1%. In addition, receiver operating characteristic (ROC) analysis and area under curves (AUC) results are obtained to quantify the classification performance. Our CIFD-Net achieves higher AUC value at both patient-level and image-level annotation compared to other state-of-the-art methods. Meanwhile, it is worth noticing that at the patient-level our method significantly outperforms other methods by at least 16.3% with respect to the sensitivity, which is an important indication for diagnosing COVID-19 positive cases.
• Models trained on patient-level, such as [2,4] and ours, achieve significant performance improvement than those trained on the image-level, i.e., [3,9], especially on the patient-level accuracy. This reflects that the image-level noise is non-trivial and can have a negative impact that these models can be overfitted because of the noise. Moreover, the models trained on the image-level may rely on learning the image textures [63], which are highly discriminative between domains. As a consequence, the models are prone to be overfitted and biased toward different textures while predicting, which may explain why these methods, e.g., methods proposed in [3,9], are poorly generalized to the unseen domains. 4] also trained on the patient-level labels, our proposed CIFD-Net is superior to these methods, especially on the patient-level classification. The method proposed by Li et al. [2] performed the worst and this may because it has been trained on randomly selected CT images extracted from each 3D volume that may impede the encoding of lesions (often appearing adjacently between slices). In contrast, Ouyang et al. [4] preserved the sequential information among the CT slices because their method was trained on the whole CT volumes. In contrast, we take the full 3D volume into account and preserve the sequential information by dividing the volume into sections [2,4]. Besides, VB-Net achieves better performance than COVNet because VB-Net is trained with stronger supervision that is additional to the image level classifier. It also employs an auxiliary pixel-wise classifier trained with pixel-level infection annotation (i.e., infection segmentation mask). In comparison, our proposed model achieves better overall classification performance than VB-Net with weak supervision only.
We carried out the ROC analysis and the AUC results were used to quantify the classification performances as shown in Fig. 4. From Fig. 4(a), we can observe that the models trained only on image-level annotations (i.e., ResNet-50 and COVID-Net) are not reliable since their AUCs are less than 50%. In addition, we found that overall our CIFD-Net remains the best performed algorithm with an AUC of 93.22%. It is of note that the overall results at the patient-level are higher than those at the image-level. This could be correlated with our findings in the classification that some CT slices with few lesion parts are hard to diagnose and classify.  To examine the influence of different loss terms, we conduct ablation studies on the proposed model and the results are reported in Table 3. As seen in the table, the model with the SNCM slightly outperforms the model without the SNCM on the patient-level. However, the SNCM advances the prediction at the image-level with significant improvement by 6.2% for the image accuracy. However, when only use the SNCM, the model would still be biased to predicting CT images tested negative because we only require our model to correct those CT images wrongly labeled as COVID-19 positive providing strong prior information to the training procedure.
Furthermore, we have examined the sensitivity of the choice of the hyper-parameters λ and k for our model. In addition, for the selection of the hyper-parameter k, we can observe that when k is too large or too small, the performance degrades dramatically. This is because that if k is too large, the uncertainty of the section would increase and cause the noisy prediction. On the contrary, if k is too small (e.g., k = 1), some important slice information would be neglected, which leads to inaccurate results (see Fig. 6).

Qualitative results
For qualitative studies, we use the trained models (e.g., ResNet-50, COVNet, and others) to visualize the CAMs and bounding boxes on the test set. Fig. 7 presents the visualization of CAMs using our ECM. We can clearly see that the model trained on the slice-level (ResNet-50) tend to discard the lesions and focus on non-infected regions, and this also explains why it makes inaccurate and unreliable diagnosis decision causing trouble for radiologist use. On the contrary, models trained on patient-level, COVNet for instance, are able to detect some of the lesions occasionally but mostly failed in estimating the extent of the lesions reliably. In contrast, our model is not only precise in terms of lesion localization but also precise in estimating the extent of the infectious areas.
Moreover, based on the results of the CAMs, we extracted the bounding boxes using each method. It can be found that our CIFD-Net is able to yield more accurate bounding boxes on the salient part of the CAMs (Fig. 7) comparing to other methods, which indicates that our methods can be more applicable to perform auxiliary diagnosis. For instance, in diffusive cases (Fig. 7  rows 1 to 4), our CIFD-Net method has produced more accurate saliency maps compared to ResNet-50 and COVNet with less false positives and false negatives. Therefore, more precise localization (bounding boxes) have been generated. For the lesions distributed peripherally and subpleurally, both our CIFD-Net and COVNet approaches have performed better than the ResNet-50 (Fig. 7  rows 5 and 6). However, our CIFD-Net is more sensitive to the infectious regions that are not obvious in the images (Fig. 7  row 7).
In addition, we visualize the infection probability of lung sections for patients and sample the CT slices from corresponding sections. As illustrated in Fig. 8, the red curve depicts the infection probability varying along different lung sections, and the blue curve, on the opposite, depicts the non-infection probability for each section. Overall, it can be seen that the infected lung sections are distributed adjacently and the transition between the sections is smooth. Besides, we found our model is capable and robust of localizing where the infected lung sections are, regardless of the scale and the types of lesions. For example, for patient A Section 2, despite there is a very small lesion (GGO) peripherally, our model is still quite sensitive and is able to identify the infected section. Our model reaches around a saddle point, i.e., 0.5, when there are no apparent lesions detected, for instance, Section 1 for patient B and Section 2 for patient C.

Discussions
Our proposed CIFD-Net sequentially aggregates image-level features within a CT volume to alleviate the multi-domain shift problems, which turns out to be very effective and we have demonstrated that our CIFD-Net can be better generalized to unseen data domain compared to other state-of-the-art works. This may be attributed to (1) the k-max selection strategy: when optimizing the joint probability, only top-k probabilities within each section have been considered. Besides, those confounded images are not considered, which can result in a robust prediction; (2) our loss function is designed for modeling the joint probability of the patient instead of the individual image slice. Compared with the naive models, e.g., plain ResNet-50 trained on single image slice, our model is less likely to overfit on varied image styles and appearance, e.g., due to assorted textures and contrasts of the images, because our model takes into account the relationship between sections and the correlation between images in each section.
In addition, we integrated a novel slice noise correction module, i.e., SNCM, in the proposed CIFD-Net, which adds additional regularization to the optimization. Besides, we argue that this not only contributes to boosting the classification performance on the image-level prediction but also leads to more precise localization of lesions. However, since we trained the CIFD-Net under the assumption that CT slices are consecutive and lung segments (sections) are ordered, it may be difficult to handle disordered CT slices by using the slice aggregation, i.e., SAM and as a consequence, it may result in less accurate classification.

Conclusion
In this study, we have proposed a robust COVID-19 recognition model named CIFD-Net, which exploits the ECM to assist radiologists for auxiliary diagnosis. To handle the volume information, the model adopts the SAM to combine different sections for the sake of modeling the joint probability of the patient is COVID-19 positive or not. In addition, we extend our CIFD-Net incorporating the SNCM to predict a single CT slice without any image-level annotations. To investigate the prediction performance of the proposed model, we conducted comprehensive experiments on publicly available CT datasets. Experimental results have verified the superiority of our model, which can solve the multidomain shift problem efficiently and effectively, compared to other state-of-the-art methods.

Declaration of competing interest
The authors declare the following financial interests/personal relationships which may be considered as potential competing interests: QY, YJ, MW, is employed by Hangzhou Ocean's Smart Boya Co., Ltd., China. YG, ZN, WS, is employed by Aladdin Healthcare Technologies, Ltd., UK. YJ, MW, is employed by Mind Rank Ltd, Hongkong. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.