Improving COVID-19 CT classification of CNNs by learning parameter-efficient representation

The COVID-19 pandemic continues to spread rapidly over the world and causes a tremendous crisis in global human health and the economy. Its early detection and diagnosis are crucial for controlling the further spread. Many deep learning-based methods have been proposed to assist clinicians in automatic COVID-19 diagnosis based on computed tomography imaging. However, challenges still remain, including low data diversity in existing datasets, and unsatisfied detection resulting from insufficient accuracy and sensitivity of deep learning models. To enhance the data diversity, we design augmentation techniques of incremental levels and apply them to the largest open-access benchmark dataset, COVIDx CT-2A. Meanwhile, similarity regularization (SR) derived from contrastive learning is proposed in this study to enable CNNs to learn more parameter-efficient representations, thus improve the accuracy and sensitivity of CNNs. The results on seven commonly used CNNs demonstrate that CNN performance can be improved stably through applying the designed augmentation and SR techniques. In particular, DenseNet121 with SR achieves an average test accuracy of 99.44% in three trials for three-category classification, including normal, non-COVID-19 pneumonia, and COVID-19 pneumonia. The achieved precision, sensitivity, and specificity for the COVID-19 pneumonia category are 98.40%, 99.59%, and 99.50%, respectively. These statistics suggest that our method has surpassed the existing state-of-the-art methods on the COVIDx CT-2A dataset. Source code is available at https://github.com/YujiaKCL/COVID-CT-Similarity-Regularization.


Introduction
The Coronavirus Disease 2019 (COVID-19) has become a worldwide pandemic and infected over 493 million people till April 2022 [1]. Its increasingly high infectivity and fatality rate due to strain variation are threatening human health and damaging the global economy [2][3][4]. The efficient reproductive number of the virus in many countries remains high, as reported in [5], indicating COVID-19 continues spreading quickly around the world. Thereby, a timely and efficient diagnosis is crucial for the treatment of COVID-19 positive patients and the control of further disease spread.
In the early diagnosis of COVID-19 infection, real-time reverse transcription polymerase chain reaction (RT-PCR) is the primary choice due to its convenience and high specificity. However, research results [6][7][8] have suggested that RT-PCR is not sensitive enough that some infected patients turned out to be positive even after several negative tests. These false-negative cases might continue to infect their close contacts without isolation or develop into severe illness. Chest computed tomography (CT) is a supplementary screening tool to RT-PCR since CT has higher sensitivity in detecting infection, indicated by institutes [9][10][11]. The high cost and hours of scanning time of CT are not affordable for all institutes. Thus CT can be more suitable in some scenarios where patients have suspicious negative RT-PCR tests, or patients are in need of timely diagnosis, or the RT-PCR test kits are undersupplied.
Since the pandemic started, researchers have been exploring the potential of convolutional neural networks  [12] introduced a large-scale open-access COVID-19 CT dataset (COVIDx CT-1) and trained a COVID-19-specific tailored CNN on it. Panwar et al. [13] utilized transfer learning to inherit cross-domain knowledge to improve the model performance. All these researches reveal that CNNs have the potential to serve as an assistant to help clinicians in COVID CT diagnosis. Although CNNs have achieved remarkable results in CT diagnosis, challenges remain before they can be put into practical use. Deep learning methods often require large-scale standard datasets, while the existing COVID-19 CT datasets are insufficient. Also, CT scans collected from different institutes have inconsistent characteristics like orientation, brightness, etc. The trained models might be more sensitive to these irrelevant information rather than the pneumonic pathologies that really matter. Furthermore, the increasingly great capability of CNN-based models may not be fully fulfilled given the limited data sources. Hence, methods for learning more parameter-efficient representations are crucial for mitigating the data insufficiency issue and improving the classification performance.
By addressing these problems above, a more reliable COVID-19 CT classification system can reduce the workload of clinicians and provide more accurate and sensitive computer-aided diagnoses. Motivated by these factors, this study aims to use deep learning techniques to improve the COVID-19 CT classification performance of commonly used CNNs. Particularly, to alleviate the data insufficiency and enhance the data diversity, we design and apply augmentation of incremental levels on the currently largest COVID-19 CT benchmark dataset (COVIDx CT-2A) [14]. Meanwhile, to find the optimal selection of CNN architectures and augmentation combinations, we explore seven commonly used CNN architectures under seven augmentation settings. The CNNs include SqueezeNet1.1 [15], MobileNetV2 [16], DenseNet121 [17], ResNet-18/34/50 [18], and InceptionV3 [19]. Meanwhile, contrastive learning is one promising selfsupervised method for enabling deep learning models to learn more parameter-efficient features. We propose the similarity regularization (SR) derived from contrastive learning to learn more parameter-efficient representations and improve CNN classification. The experimental results demonstrate that SR can improve the classification performance of CNNs stably and surpass conventional contrastive learning. Our main contributions are summarized as follows:

COVID-19 Related Researches
CNNs are increasingly improving the COVID-19 CT classification with advanced algorithms and enhanced datasets. Numerous CNN-based methods achieving high accuracy have been proposed, indicating the potential of CNNs in assisting practical diagnosis. Some representative methods on four benchmark datasets are listed in Table 1.
In the COVID-19 CT classification, there exist no golden standard datasets so far. The four widely employed openaccess datasets [12,14,20,27] in Table 1 differ in many aspects, including patient/scan distribution, collection sources, dataset size, class numbers, labelling quality, etc. Particularly, COVID-CT [27] and SARS-CoV-2 [20] are two small binary-classification datasets containing 812 and 2, 482 CT scans for COVID-19 positive and non-COVID classes, respectively. Gunraj et al. released a larger dataset COVIDx CT-1 [12] consisting of 104, 009 scans for normal, NCP and CP classes upon which the authors later built COVIDx CT-2 [14]. COVIDx CT-2 is the largest existing dataset containing 194, 922 CT scans, combined from multiple data sources. Generally, data-driven methods like CNNs depends heavily on dataset size. This can be drawn from the classification metrics in Table 1 that the methods trained on larger datasets can roughly achieve higher performance. To ensure both the data diversity and satisfactory results, our study employs COVIDx CT-2A [14] as our target dataset.
Drawn from the reviewed research works above, deep learning models can achieve higher performance in COVID-19 CT classification through the approaches that: 1) train models over data of higher diversity; 2) with finely designed neural networks; 3) ensemble the decisions from multiple models; 4) inherit out-of-domain classification knowledge. Although models can benefit from these aspects, the expensive computational cost of neural architecture searching and large-scale pre-training, and long execution time caused by over-parameterization should be considered as well.

Contrastive Learning
In recent years, supervised deep learning models of increasing complexity and depth have shown great progress in many large-scale applications like ImageNet classification [18,19]. However, directly applying these models to COVID-19 datasets of smaller scales might cause overparameterization. It means that model capacities cannot be fully fulfilled and the extracted representations are not parameter-efficient. One promising approach to addressing the issue is contrastive learning.
In the deep learning field, it is widely recognized that the model performance depends on the quality of their learned representation. Contrastive learning, also known as contrastive self-supervised representation learning, is one framework aiming at learning efficient representations without human-specified labels. In general, the main idea of contrastive learning is to project inputs into an embedding space where the embedded vectors of similar samples are closer while dissimilar ones are apart. More formally, for visual tasks, a pair of views augmented from one image is considered as a positive pair while pairs of views from different images are considered as negative pairs. Hence, contrastive learning models aim to maximize the representation similarity between positive pairs and minimize that between negative pairs. In practical tasks, contrastive learning often pre-trains the front representation extractors of deep learning models in a self-supervised manner, and then fine-tunes the pre-trained weights in a conventional supervised manner. The state-of-the-art contrastive learning frameworks include MoCo [35,36], SimCLR [37,38], SimSiam [39], SwAV [40], BYOL [41], etc. These frameworks mainly differ in terms of loss function, representation projection, and negative pair formation [39]. And the differences further determine their requirements on complexity of augmentation policies and batch size. Normally, in order to obtain satisfactory result, contrastive methods depends on a large batch size to cover enough negative pairs [35][36][37][38]. Among these models, BYOL, SwAV and SimSiam are the contrastive frameworks requiring no negative pairs. In ImageNet linear classification experiments [39], BYOL achieves relatively better performance. This explains that we select BYOL as the basic framework for SR calculation as in Section 3.2.
The success of contrastive learning has emerged some applications in COVID-19 CT diagnosis [28,29,42]. He et al. [29] employed a MoCo-like [35] framework to enhance the CT scans representations extracted by DenseNet169 and fine-tuned the network, achieving 86% accuracy on COVID-CT [27]. Similarly, Chen et al. employed the MoCo-v2like [36] framework on the same dataset and reached 88.5% accuracy within six shots. Li et al. [42] put the contrastive loss as a regularization term and trained their CMT-CNN in an end-to-end manner, obtaining 93.46% accuracy. These studies suggest contrastive learning can boost classification performance by learning more efficient representations.

Augmentation of Incremental Levels
Data augmentation is vital for improving the performance of deep learning models, especially for contrastive learning [36,37]. However, the optimal selection for COVID-19 CT augmentation has not been studied. Inspired by the literature in Section 2, we design and evaluate a series of augmentation operations of incremental levels as follows where "+" denotes the appended augmentation based on the previous level: Level 0: No augmentation; Level 1 + RandomResizedCrop: Randomly obtain an image crop of size in the range [0.08, 1] of the original size 256 × 256, and then randomly scale the crop according to an aspect ratio in the range [3∕4, 4∕3]. The scaled crop is finally resized to the original size; Level 2 + Horizontal flip: Randomly flip the input image horizontally with 50% using probability; Level 3 + RandAugment [43]: Randomly apply rand augment twice with magnitude 9 and magnitude standard deviation 0.5; Level 4 + Random Erasing [44]: Select a rectangle region of the input image and do pixel-wise erasing with 25% using probability. The size of the selected region are randomly picked in the range [0.02, 1∕3] of the image size. Level 5 + Mixup [45]: Mix two in-batch images up with a ratio subjecting to a beta distribution, ∼ (1, 1). The mixup process for images and can be formatted as ( , , where ( , ) denotes the pixel coordinate; Level 6 + CutMix [46]: Switch from Mixup to Cutmix with 50% probability. Randomly replace a square region in the original image with a region in another in-batch image. The region size is randomly determined, subject to the squared root of a beta distribution (1, 1).
The visualization of the augmented scans is demonstrated in Fig. 4 in Appendix A. Specifically, RandomRe-sizedCrop and horizontal flip are two commonly used augmentation operations in both supervised [18,19] and selfsupervised learning [36][37][38]41]. Since contrastive learning requires more complicated augmentation [37], the two stronger augmentations, RandAugment and Random Erasing, are further introduced in levels 3 and 4. Their implementations and parameters refer to [47,48]. In levels 5 and 6, mixup and cutmix are two augmentations enabling higher data diversity by fusing in-batch images. In these two levels, we mainly experiment on whether such samplefusing augmentations can improve COVID-19 classification. By comparing the performance of models under these incremental augmentation levels, an appropriate augmentation strategy for COVID-19 CT scans can be established.

Similarity Regularization
Most mainstream conventional CNNs contains two parts, a representation extractor and a followed fully connected layer FC. The extractor aims to extract the distinguishable representations of given inputs, and FC predicts the class probability distribution by summarizing the extracted representations. This forward propagation is demonstrated as the top branch in Fig. 1. More formally, the input image is first transformed to a view by a random on-the-fly augmentation operation ∼ where denotes an infinite collection of augmentation operations. Subsequently, the representation extractor converts the input view to a representation embedding vector ℎ = ( ). The followed FC predicts the class probability distribution based on its obtained representation,̂= FC(ℎ). The training target of such a classifier is to minimize the class probability distribution distance between the predictionâ nd the ground truth according to the cross-entropy loss in Eq. (1), where ∈ {0, 1, 2} denotes the class index.
In this conventional fully supervised scenario, the trained representations aim at better projecting to humanspecified class distribution. However, this manner affects data efficiency, robustness or generalization [49]. Instead, contrastive learning enables learning more parameterefficient representations from inputs themselves instead of the specified annotations. We thus incorporate it in common CNNs to improve their representation learning ability.
The overall structure of our method is illustrated in Fig.  1. We keep the conventional supervised classifier unchanged in the top branch while introducing a contrastive learning framework in the bottom branch. As in Section 2.2, contrastive learning is to maximize the representation similarity between positive pairs. We punish the positive-pair representation distance as a regularization term beside the crossentropy loss, naming the term similarity regularization (SR).
Particularly, the contrastive framework is a siamese network like most mainstream frameworks [37][38][39]41], consisting of an online network and a target network. The target network can be seen as a moving average of the online one. Given two views 1 and 2 augmented from the same input image , the representation extractors 1 and 2 in two networks extract their corresponding latent representation vectors, ℎ 1 = 1 ( 1 ) and ℎ 2 = 2 ( 2 ). To avoid representations heavily affected by SR, the representation vectors then projected to another embedding space where 1 = 1 (ℎ 1 ) and 2 = 2 (ℎ 2 ), as in [38,41]. Since the projectors 1 and 2 share slightly different feature spaces, the online projection 1 is further projected to ( 1 ) of same dimension via online predictor . The cosine representation similarity  of value in range [−1, 1] can be measured according to Eq. (2).
where ⟨⋅, ⋅⟩ and ‖⋅‖ 2 are inner product and 2 norm notations, respectively. A higher value indicates two vectors are of higher similarity. To penalize a low cosine similarity between positive pairs and scale the penalty in range [0, 1], SR can be calculated as Eq. (3).
Hence, for a positive pair ( 1 , 2 ), its total loss containing both cross-entropy loss and SR is written as in Eq. (4).
where is a scale factor for balancing the conventional crossentropy loss and the introduced SR.
( 2 , 1 ) is the symmetric positive pair with respect to ( 1 , 2 ). We calculate the losses for both symmetric pairs and take their mean as the final loss for fast convergence.
SR as a regularization term may raise the concern if it will dominate the combined loss and thus degrade the classification. To remove the concern and find an appropriate scheduler for , we design three strategies as listed below. denotes the current training iteration number.

Constant (default)
: is set to be a constant value during all training iterations, 0.5 by default.
Linear Decay : decays linearly to a minimum value = 0.01 along training iterations according to Eq. (5).
Cosine Decay : decays to a minimum value = 0.01 along training iterations according to cosine annealing scheduler as in Eq. (6).
Besides, after training, we throw away all the components except the online representation extractor 1 and the fully connected layer FC. Hence, introducing SR in training will not slow down the test interface. The training pseudocode of models with SR is demonstrated in Algorithm 1.

Dataset Description
In this paper, we mainly train and evaluate our proposed method using the largest existing open-access COVID-19 CT dataset, COVIDx CT-2A 2 . Specifically, the dataset contrains three classes, including normal, non-COVID-19 pneumonia (NCP), and COVID-19 pneumonia (CP). Its class distribution is summarized in Table 2. The dataset is of high diversity, containing scans of 3, 745 patients from eight open-access sources. It should be noted that the scans from the same patient are in one subset, preventing information leakage from training to validation or testing.

Experimental Setting
In this paper, we keep the hyper-parameters consistent in all experiments for fair comparisons. The codes are implemented by PyTorch. We implement the CNN backbones and image augmentation by torchvision and timm [48] libraries, respectively. For acceleration, we train models on Torch distributed data parallelism on four Nvidia V100 GPUs with apex mixed precision of level 1. Besides, to alleviate the randomness concern, we obtain the experimental statistics by averaging the measurements in three distinct trials.
During training, CT scans are resized to 256 × 256 in 3 channels using bicubic interpolation and normalized by ImageNet mean and std. In the test interface, 256 × 256 CT scans are cropped from the center of resized 293 × 293 original images. This is empirically good as the center crop can preserve the main lung regions. To avoid models from being too confident in one-class prediction, label smoothing [19] of smoothing factor 0.1 is applied in the cross-entropy in augmentation levels 0 − 4. While in augmentation level 5 or 6, in-batch paired labels are mixed up based on mixed inputs (See [45,46] for more details).
The optimizer we used is Adam with 10 −6 weight decay. After a 5-epoch linear warmup [50] from 5 × 10 −7 , we use cosine annealing scheduler to decay the learning rate from 5 × 10 −4 to 5 × 10 −7 in the later 45 epochs. The batch size is set to be 64 in each process. Besides, the gradients are clipped to be no larger than 5.0 to avoid overflow.
In the SR calculation, the projectors 1 , 2 and predictor have the same multi-layer perceptron (MLP) architecture that consists of two linear layers connected by a batch normalization layer and a ReLU activation layer. The front linear layer projects the inputs to 512-D embedding vectors and the later linear layer outputs 128-D vectors. The analysis for the dimension setting is in Appendix B. The momentum rate for updating 2 and 2 is 0.99, a median value among contrastive frameworks [35,[39][40][41]].

Results of ResNets under Incremental Augmentation Levels
We first compare the performance of ResNets with or without SR under the incremental augmentation levels designed in Section 3.1 to determine an appropriate augmentation policy for the coming experiments. The averaged test accuracies are listed in Table 3. Since SR requires calculating the similarity between two augmented views, models with SR cannot be implemented under augmentation level 0.  Table 3 shows that the original ResNets achieve the highest accuracy in level 2 and cannot be improved in the following levels, heavily degraded in levels 5 and 6. The degradation may result from the fact that sample-fusing augmentation sometimes transfers the pneumonic pathologies from CP/NCP cases to normal cases. We thus do not perform SR in levels 5 and 6. Different from the original ResNets, ResNets with SR continue to improve after level 2 and achieve the highest accuracy in level 4. This is consistent with the findings in many contrastive learning research works that contrastive learning requires stronger augmentation than supervised models [36,37,51]. Hence, we select level 4 as the basic augmentation level for the following experiments. Overall, it is observed that SR can improve the classification performance of ResNets stably under all augmentation levels from 1 to 4.
From the results in Table 4, it can be seen that all our models with SR surpass the original models in terms of averaged test accuracy. The best model, DenseNet121 with SR, achieves 99.44% accuracy with 7.33 parameters. Note that the extra parameters will be thrown away after training so that the parameters in the test interface are consistent for a model with or without SR. Another observation is that, in COVID-19 CT classification, the model performance is not in strictly proportional to its capacity despite of model architecture. This suggests that the fine design of model architecture rather than simply expanding depth or width is more valuable in this task, as supported in [12,14,30].   2 shows the confusion matrices for DenseNet121-SR in three training trials. Based on the matrices, we measure the performance of the model in terms of averaged accuracy, precision, sensitivity, and specificity, as listed in Table 5. The results show that our DenseNet121-SR has outperformed the state-of-the-art models in nearly all measurements. Specifically, DenseNet121-SR achieves a high sensitivity 99.59% for COVID-19 positive class, indicating that the model has the potential to efficiently avoid COVID-19 positive patients from being wrongly diagnosed.
Besides, to better understand the classification principles of the model, its attention visualized by Grad-CAM [52] is demonstrated in Fig. 6 in Appendix C.

Results on Other Datasets
On other COVID-19 CT datasets Based on the experimental results aforementioned, we extend our method to the other two COVID-19 CT datasets, i.e., SARS-CoV2 and COVIDx-CT-1. It should be noted that, for SARS-CoV2, we train DenseNet121-SR over 200 epochs with weights pre-trained on ImageNet because SARS-CoV2 contains much fewer CT scans than the others. The results as listed in Table 6 show that our method can be generalized to other datasets and can achieve a high classification performance. Comparing to the methods listed in Table 1, our DenseNet121-SR with only 6.63 MB parameters is more parameter-efficient and outperforms the reviewed methods. On classic natural datasets Besides, extensive experiments are conducted over seven natural datasets to further evaluate the generalization ability of our method. To evaluate the effect of SR fairly, we keep the setting unchanged as in Section 4.2 and initialize the model weights as pretrained on ImageNet. Table 7 demonstrates the classification accuracy of DenseNet121 with or without SR on the seven datasets, including FGVC Aircraft [53], CIFAR10/100 [54], Describable Textures Dataset (DTD) [55], Oxford 102 Flowers [56], Oxford-IIIT Pets [57], and Stanford Cars [58].
It shows that DenseNet121-SR is superior to the original model in all the tasks, indicating our proposed SR can be generalized to general classification problems.

Ablation Study
The following ablation studies are conducted to better investigate the effects of our proposed SR.
Fully self-supervised learning Contrastive learning is widely adopted in pre-training CNNs that are fine-tuned later for downstream tasks. In our methods, we turn the process to an end-to-end manner by regularizing CNNs with proposed SR derived from contrastive learning. Hence, a comparison between SR and conventional contrastive learning is necessary. Specifically, we design and measure the following methods for comparison as follows. a) Linear Evaluation. First pre-train the representation extractor weights of which are frozen in the later FC finetuning. The pre-training process is equivalent to setting = 1 in all training epochs as in Algorithm 1, and then fine-tuning only the linear layers FC as usual. The finetuning hyper-parameters include: 256 batch size, learning rate decays from 40 to 4 × 10 −6 according to cosine decay scheduler [50]. The optimizer used is SGD. Linear evaluation is simply conducted for verifying the effects of contrastive learning in this task. b) Two-stage training, self-supervised contrastive learning followed by conventional supervised learning. Pretraining the representation extractor and training the entire CNN with pre-trained weights. The hyperparameters are consistent with others as in Section 4.2. c) Apply SR to ResNets with a default constant = 0.5. The results for the designs above are listed in Table 8. It can be observed that contrastive learning can learn efficient representations that even a simple linear evaluation on the pre-trained representation extractor can achieve over 92% test accuracy. For the second method, two-stage contrastive learning, the pre-trained weights from the representation extractor might be hard to maintain in the later training phase. Our introduced SR maintains the representation by explicitly penalizing the representation difference for positive pairs. The results in Table 8 verify that ResNets with SR surpass the two-stage contrastive learning method in most experiments. Besides, it is worth noting that the end-to-end training with SR does not require pre-training and thus saves computational resources.
Decay strategy for The two-stage contrastive learning method can be approximate to run the SR Algorithm 1 with = 1 in pre-training and = 0 in fine-tuning. The sharp fall of may destroy the maintained representation space obtained in contrastive pre-training. To avoid the potential negative impact, we designed two mild decay strategies in Section 3.2 despite of the constant strategy. From the results demonstrated in Fig. 3, we can conclude that SR with all designed strategies can stably improve the classification accuracy. And SR is insensitive to the strategy setting since all strategies have comparable performance. Due to the simplicity of the strategy ( = 0.5 in all iterations) and its slight superiority in level 4 augmentation, we select it as the default strategy in our experiments.

value in constant strategy
The ablation studies find that a constant strategy, = 0.5, can achieve the relatively highest performance among the three strategies under level 4 augmentation. We vary the constant value to evaluate its robustness. See Appendix D for the detailed results.

Discussion
Since COVID-19 grows rapidly worldwide, designing efficient and accurate classification systems is essential. Although some methods [12,14,21,22,32] have claimed a high classification accuracy (≈ 99%) on multiple datasets, we argue that even a slight improvement can mitigate further infection. Meanwhile, some high-performance methods require considerable computational resources, making them hard to be deployed into practical healthcare systems. Hence, designing more efficient models with an affordable number of training parameters should also be noted.
In this paper, we propose an incremental augmentation strategy and SR to improve the CNN classification performance on three COVID-19 CT datasets. The results illustrate that appropriate augmentation can significantly alleviate the data limitation problem in COVID-19 CT classification. Meanwhile, our proposed SR further improves the classification performance of seven CNNs by enhancing their representation learning ability. Specifically, on the largest dataset COVIDx CT-2A, our model DenseNet121-SR achieves 99.44% accuracy and 99.59% sensitivity with only 6.63 MB parameters in the test interface, outperforming all the reviewed state-of-the-art methods. Besides, we evaluate DenseNet121-SR on the other two datasets, achieving 99.78% and 99.20% accuracy on COVIDx CT-1 and SARS-CoV2, respectively. To further justify the effect of SR, we extend the DenseNet121-SR to seven classic natural datasets, illustrating SR can be generalized to general classification tasks. Furthermore, since SR derives from contrastive learning, we compare traditional contrastive learning and end-toend training with SR in ablation studies. The comparison demonstrates that SR is superior in classification accuracy and training efficiency and is robust to its hyper-parameter setting.
Despite the achieved promising performance, the limitations of our method exist. Our method requires either large amounts of training data or pre-training on other large-scale datasets. The high performance of our models with SR partly owes to the efforts of workers collecting numerous CT scans. For smaller-scale datasets like SARS-CoV2, the backbone of our method requires pre-training. The pre-training on ImageNet helps improve the accuracy from around 98% to 99.20% in the DenseNet121-SR case. Besides, due to the lack of computational resources, we can hardly evaluate other contrastive frameworks. Meanwhile, we cannot redesign the CNN backbones to better balance computational efficiency and classification performance because of the substantial computational loads of neural architecture searching and pre-training. For future work, we will explore redesigning the network backbone, pre-training the redesigned backbones on large-scale datasets, and making networks more explainable in COVID-19 CT diagnosis.

Conclusion
This paper aims to improve the CNN performance for COVID-19 CT classification by enabling CNNs to learn parameter-efficient representations from CT scans. We propose the SR technique derived from contrastive learning and apply it to seven commonly used CNNs. The experimental results show that SR can stably improve the CNN classification performance. Together with a well-designed augmentation strategy, our model DenseNet121-SR with 6.63 MB parameters outperforms the existing methods on three COVID-19 CT datasets, including SARS-CoV2, COVIDx CT-1, and COVIDx CT-2A. Specifically, on the largest available dataset COVIDx CT-2A, DenseNet121-SR achieves 99.44% accuracy, with 98.40% precision, 99.59% sensitivity, and 99.50% specificity for the COVID-19 pneumonia category. Furthermore, the extensive experiments on seven classic natural datasets demonstrate that SR can be generalized to common classification problems.

A. Augmentation Visualization
The augmented CT scans under designed augmentation level from 1 to 6 are visualized in Fig. 4. These augmentation operations keep the main pneumonic pathologies reserved while enhancing the dataset diversity.

B. Effect of Projection Size
The output dimension, or named projection size, of both the projector and predictor in SR calculation are set to be 128 as default. We keep the hidden dimension 512 unchanged to avoid redundant computation while varying the projection size to analyze its effect in terms of classification accuracy. As visualized in Fig. 5, the differences between the classification accuracy for models except for SqueezeNet are small (≤ 0.2%). This indicates that the hyper-parameter setting in our proposed SR is robust.

C. Attention Visualization
To understand the behavior of our model, we visualize the attention of DenseNet121-SR on three CT scans in different classes as in Fig. 6. It can be observed that

D. Effect of Constant Strategy
The constant value still requires studies for finding its effects on model performance. We thus vary the value in constant strategy from 0.1 to 0.9 with 0.2 interval and repeat the experiments for CNNs with SR under augmentation level 4. As shown in Fig. 7, SR can improve the CNN classification performance when value is in an appropriate range near [0.5, 0.7]. In particular, a smaller cannot fully fulfill the advantage of SR and sometimes even degrades the model capacity as in SqueezeNet1.1 case. Meanwhile, setting to a large value like 0.9 is also risky since SR dominates the total loss while the primary cross entropy for classification is slighted.