CTMLP: Can MLPs replace CNNs or transformers for COVID-19 diagnosis?

Background Convolutional Neural Networks (CNNs) and the hybrid models of CNNs and Vision Transformers (VITs) are the recent mainstream methods for COVID-19 medical image diagnosis. However, pure CNNs lack global modeling ability, and the hybrid models of CNNs and VITs have problems such as large parameters and computational complexity. These models are difficult to be used effectively for medical diagnosis in just-in-time applications. Methods Therefore, a lightweight medical diagnosis network CTMLP based on convolutions and multi-layer perceptrons (MLPs) is proposed for the diagnosis of COVID-19. The previous self-supervised algorithms are based on CNNs and VITs, and the effectiveness of such algorithms for MLPs is not yet known. At the same time, due to the lack of ImageNet-scale datasets in the medical image domain for model pre-training. So, a pre-training scheme TL-DeCo based on transfer learning and self-supervised learning was constructed. In addition, TL-DeCo is too tedious and resource-consuming to build a new model each time. Therefore, a guided self-supervised pre-training scheme was constructed for the new lightweight model pre-training. Results The proposed CTMLP achieves an accuracy of 97.51%, an f1-score of 97.43%, and a recall of 98.91% without pre-training, even with only 48% of the number of ResNet50 parameters. Furthermore, the proposed guided self-supervised learning scheme can improve the baseline of simple self-supervised learning by 1%–1.27%. Conclusion The final results show that the proposed CTMLP can replace CNNs or Transformers for a more efficient diagnosis of COVID-19. In addition, the additional pre-training framework was developed to make it more promising in clinical practice.


Background
According to the World Health Organization (WHO), as of April 27, 2022, COVID-19 has caused 6,234,476 deaths globally, and the cumulative number of confirmed cases has reached 509,478,794 [1]. At this stage, effective isolation of confirmed patients through accurate diagnosis of suspected patients remains an effective means to stop the spread of this epidemic. The most commonly used diagnostic tool for COVID-19 is real-time reverse transcriptase chain reaction (RT-PCR) on nasopharyngeal swab samples from subjects [2]. In addition, the researchers found that CT scan images of the chest of patients with confirmed COVID-19 had typical features such as hairy glass nodules, which opened up another path for the diagnosis of COVID-19 [3]. However, manual diagnosis-based solutions are limited by the number of experts and individual subjective thinking. In recent years, computer-aided diagnostic systems based on deep learning have made a splash in the field of medical imaging. Compared with manual diagnosis, computer-aided diagnosis has the advantages of higher accuracy, and individual subjective thoughts do not influence the diagnosis process.
As a result, researchers have developed various deep learning-based artificial intelligence-assisted systems [4][5][6] for the diagnosis of COVID-19. However, previous studies have achieved excellent performance in the diagnosis of COVID-19, but there are still some questions that have not been resolved. 1) Most of the previous research is based on

Transformer-based vision models
VIT [19] is the first purely Transformer-based vision model trained by dividing the image into non-overlapping patches. VIT shows that Transformer performs better than CNN when there is enough data for training. However, due to the lack of inductive bias and multi-stage structural design, VIT performs poorly when the amount of data is grossly inadequate. CvT [20] introduced convolution into the Transformer architecture, which allows the model architecture to have translational invariance, thus improving the overall model performance. PVT [21] and PiT [22] introduced a multi-level pyramid structure design to the Transformer structure to overcome the difficulties of applying the Transformer to downstream tasks. SwinTransformer [23] introduced a sliding window mechanism and a multi-level structural design that models only the local relationships at each layer while continuously shrinking the feature map and expanding the perceptual field.
The computational complexity of models such as Transformer-based VIT is squarely correlated with the sequence length S. The computational complexity of MLP-based models is linearly correlated with S since there are only MLP layers. Therefore, MLP is simpler compared to both convolutional neural networks and the Transformers. We can significantly reduce the number of model parameters and calculations while maintaining the model's good performance.

MLP-based vision models
MLP is not a new concept in the field of computer vision. Unlike traditional MLP architectures, MLP-Mixer [24] keeps only the MLP layer on top of the Transformer architecture and then exchanges spatial information through token-mixing MLP. Thus, the simple architecture yields amazing results. Hugo Touvron et al. [25] proposed ResMLP to address the challenge that visual MLP models often rely on large datasets. ResMLP shows superior performance on the ImageNet dataset with only skip connection and distillation operation. The above models uniformly use the skip connection and normalization to ensure stable training of the models, which we keep in this study and will explore more diverse model designs.
Compared with CNN and VIT, MLP has a relatively concise architecture, which allows deploying such models in timely medical applications. In addition, the architecture can easily handle irregular data, such as point clouds, which provides the possibility for the unification of future models. In addition, since all computations of this model are matrix multiplication, resource consumption is greatly reduced, which facilitates deep optimization in the deep learning framework. Compared with the traditional MLPs, the new architecture completes the information exchange between the spatial dimension and the channel dimension through token blending MLP and channel blending MLP instead of directly blending the information of the two dimensions into a long vector. Thus, the new architecture is not simply a re-enabling of traditional MLPs, but a significant improvement.
However, most of the above models use explicit expressions in the token-mixing module stage, making the model complicated and difficult to understand and increasing the number of parameters and calculations. Furthermore, the multi-stage structure design is standard in CNN models, but its effectiveness has been largely ignored in MLP-based models. Like Transformer, the models above fail to fully utilize local information in structural design. However, not all pixels require longrange modeling, and the underlying low-level semantic information relies more on local modeling. Relatively speaking, high-level semantic information is more inclined to long-range dependencies. We use the convolutional tokenizer at the bottom of our model design to improve its local modeling ability while reducing the amount of model computation.

Transfer learning
In order to improve the final performance of the model while speeding up the training time, people usually fine-tune the model on the target data set after pre-training on a large data set. Pretrained CNNs usually end up performing better than CNNs trained from scratch. Furthermore, it has been shown that Vision Transformer relies more on large datasets to perform well than CNN [26]. After pre-training on ImageNet, Transformer performs on par with CNN on medical image datasets. The above studies show that large-scale pre-training will improve Transformer's poor performance on small datasets of medical images. MLP suffers from the same problem as Transformer. However, it is still unknown whether MLP benefits from the above techniques for medical images, and we will explore it in this study.
However, traditional pre-training solutions usually have the following problems: 1) ImageNet dataset for pre-training usually has millions of images. In contrast, medical image datasets usually have only a few thousand to hundreds of thousands of images. 2) ImageNet dataset is a large natural image dataset, which usually contains natural objects such as flowers, birds, fish, and insects. Medical image datasets usually contain images of patients' organs or lesions, which are quite different from natural images. 3) The ImageNet dataset has 1000 categories, but the medical image dataset has far fewer categories. 4) The source domain data set used for pre-training is labeled, and the information in many unlabeled data sets is difficult to utilize effectively.

Self-supervised learning
Self-supervised learning means that the model learns generic representational information for downstream tasks only through the structure or properties of the data itself, without relying on human annotation. Self-supervised algorithms can be divided into two main categories: generation and comparison. The core of contrast learning is calculating the distance between sample representation information and distinguishing samples from positive and negative samples. When the positive and negative examples can be correctly distinguished, sufficient representational information is obtained. MoCo [27] uses queues as memory banks, thus effectively reducing memory consumption. At the same time, the encoder parameters are updated in a momentum way to solve the problem of inconsistency between old and new encodings. SimCLR [28] used different data augmentation methods while adding nonlinear mapping after the encoder to make the representation information return to the essence of the data, thus obtaining surprising results. The above algorithms are constructed based on contrast loss. The first part of contrast loss is alignment, hoping that the distance between samples and positive examples is as close as possible. The second part is uniformity, in which the features of all points are expected to be uniformly distributed on the unit sphere. If only alignment is available, the model will cause collapse and form a degenerate solution [29]. BYOL [30] makes the model not collapse in the absence of negative samples by using an implicit negative sample comparison mechanism.
Compared with natural images, medical image data requires professional personnel and equipment for acquisition. In fact, annotating medical images is a bigger problem than data acquisition. The radiology department of a hospital generates a lot of image data every year, but because of patient-related privacy, the medical image data is generally released after declassification. Unlike natural images, medical images need to be annotated by doctors, experts, and scholars. At the same time, in the process of labeling, many subjective ideas will be added, causing interference.
Self-supervised learning is, therefore, of great importance to medical imaging and has received widespread attention. The previous study [26] has shown that CNN outperforms Transformer on target domain datasets when model parameters are randomly initialized. But the Transformer pre-trained on ImageNet performs on par with CNN on the target domain dataset. And the Transformer, after self-supervised pre-training, outperforms the corresponding CNN in downstream tasks. The techniques mentioned above for dealing with data scarcity in the medical imaging domain have been shown to work equally well for CNNs as well as Transformers, but it is unclear whether they will work for MLPs.

Our contributions
Based on the above findings. In this paper, we have attempted to construct a novel and effective computer-aided diagnostic system to assist physicians in diagnosing subjects using COVID-19 as the subject of our study. In addition, the medical image field is mainly guided by CNN and the hybrid model of CNN and Transformer. However, CNN lacks global perception ability and location prior ability, and the hybrid model of CNN and Transformer is too computationally complex and has too many parameters. Therefore, such models are difficult to optimize and deploy in timely medical applications. Therefore, we introduce MLP in this study. Thanks to the global modeling ability of MLP, MLP can outperform CNN on medium-sized data sets only relying on a small number of parameters.
Since most of the current MLPs use explicit expressions in the tokenmixing module, this brings a large number of parameters and model complexity. We propose a module TNMLP with an implicit expression in the token-mixing module, which has a simple structure and can effectively reduce the number of parameters of the model. Based on TNMLP, we propose CTMLP, a network with a pyramidal structure design that eliminates the input dimension limitation and can be directly applied to various downstream tasks. Due to the specificity of the medical image field, it is difficult to effectively utilize the information in a large number of unlabeled images. Self-supervised learning can enable the model to use the information of the unlabeled data itself for training, thus pointing out the direction for solving this problem. However, previous self-supervised algorithms are built based on CNN and Transformer, and the effectiveness of such algorithms for MLP is not yet known. In addition, previous contrast self-supervised algorithms based on InfoNCE have the negative-positive-coupling (NPC) effect, which makes this type of algorithm more dependent on large batchsize and more resourceconsuming. At the same time, we hope that the future research plan is not limited to CNN or Transformer but combines each model's advantages in order to design a more suitable model.
Our main contributions to this paper are as follows.
1) A new network CTMLP with a pyramidal design was proposed, a codesign of convolution and MLP, and maintains a relatively good performance while significantly reducing the number of parameters. 2) Injecting knowledge priors into MLPs through transfer learning enables MLPs to perform well not only on medium-sized datasets but also on small-scale datasets.
3) The design of a basic self-supervised MLPs framework that improves the baseline for self-supervised learning while significantly reducing resource consumption is explored. 4) A new pre-training framework TL-DeCo based on transfer learning and self-supervised learning, was proposed. 5) A new guided self-supervised learning scheme was proposed further to enhance the feature extraction capability of the lightweight model while simplifying TL-DeCo.

Dataset
To test our proposed model's and algorithm's performance, we will conduct experiments on two public datasets named SARS-COV2 Ct-Scan [31] and Large COVID-19 CT scan slice [32]. In addition, we used the ImageNet [33] dataset as the source domain dataset for pre-training, and specific experimental details will be provided in subsequent subsections. The SARS-COV2 Ct-Scan dataset contains 1252 COVID-19 positive CT images and 1229 COVID-19 negative CT images, a small dataset. The Large COVID-19 CT scan slice dataset contains 7593 COVID-19 images of 466 positive patients and 6893 Normal images of 604 negative patients. We divide the above two datasets into training set, validation set and test set according to the ratio of 0.6: 0.2: 0.2. The details of the datasets are shown in Table 1. Due to the different machine models and the operational deviation of the staff during the data collection process, the format of the data set is rather chaotic. Therefore, we uniformly set the image size in the SARS-COV2 Ct-Scan and Large COVID-19 CT scan slice datasets to 224 × 224. Furthermore, in order to improve the generalization and robustness of the network model, we perform different data augmentation operations on the two datasets. The example images from datasets and the specific operation of data enhancement are shown in Fig. 1.

Methodology
In Subsection 3.1, we first introduce the general framework and structural design of our CTMLP. After that, we elaborate on the design of the convolutional tokenizer and pyramid structure. In Section 3.2, we present the core component TNMLP in detail. In Section 3.3, we first improve the self-supervised learning baseline while significantly reducing its resource consumption. Then, we propose a new algorithm framework TL-DeCo for CTMLP pre-training by combining traditional pre-training methods with self-supervised learning. In Subsection 3.4, we explore three guided self-supervised learning schemes to seek the best performance the model can achieve.

Convolution with triple normalization MLP (CTMLP)
The overall structure of CTMLP is shown in Fig. 2. When we train the model, the bottom layer will learn some low-level semantic information about edge texture, and the top layer will learn more abstract high-level semantic information. Compared with MLP, CNN focuses more on local information, and CNN is more adept at handling low-level semantic information. To reduce the computational effort of the model. Unlike other MLP structures, we do not directly divide the images into patches for input. We use a convolutional tokenizer to extract the initial feature . After that we label the similar four stages as stage1, stage2, stage3, and stage4, respectively. After each stage, we get four feature maps ) and , respectively. For the downsampling operation, we use the same idea as Swin Transformer. To increase the spatial information interaction capability of the model and to simplify the model. We do not use the patch mergebased downsampling in Swin Transformer. We downsample each feature map in each stage using 3 × 3 convolutional layers with step size 2.
As described above, we use a convolutional tokenizer to replace the patch tokenizer and improve the ability of the model to extract low-level semantic information while reducing the amount of model computation. The convolutional tokenizer consists of three identical convolutional blocks and a maxpooling layer. Each convolutional block consists of a 3 × 3 convolutional layer with a ReLU activation function and a batch normalization layer.
In TNMLP, we follow the MLP generic architecture design, which consists of token hybrid mlp and channel hybrid mlp. We will introduce them in detail in Subsection 3.2.
Here multi-stage follows the setting of Swin Transformer, and we set the ratio of blocks to 1: 1: 3: 1, which is also confirmed by ConvNeXt [34]. The ratio of ResNet is 3: 4: 6: 3, and ConvNeXt modifies the block ratio to 3: 3: 9: 3 on the basis of ResNet, thus enabling the model performance to be improved.
In this study, we provide two versions of CTMLP, where the number of blocks in CTMLP is 2: 2: 6: 2, and the number of blocks in CTMLP_B is set to 3: 3: 9: 3.

Triple normalization MLP (TNMLP)
In this subsection, we first analyze the self-attention mechanism. Suppose the input feature map is I in ∈ O E×n , where E is the number of elements in the feature map and n is the dimension of the feature. Then, the query matrix of the self-attention mechanism can be denoted as Q ∈ O E×n , the key matrix can be denoted as K ∈ O E×n and the value where (S) i,j denotes the similarity between the i-th element and the j-th element, and A ∈ O E×E denotes the similarity matrix. After that, we equate the query matrix, key matrix, and value matrix to the input I in .
Then the similarity matrix A is simplified and can be expressed as: In Transformer [19], the multi-head attention first initializes the input to different subspaces randomly. After that, the knowledge aggregation of multiple attentions is accomplished through the representation of different subspaces originating from the same query, keys, and values, thus improving the capability of single-head attention. We simplify multi-head attention similarly to Fig. 3. Then multi-head attention can be expressed as follows.
In the above Equation, head i denotes the i-th head in multi-head attention, while H denotes the total number of heads in multi-head attention. U k ∈ O E×n and U v ∈ O E×n denote the two parameter sharing cells respectively. In the shared cell, we balance the number of H with the number of elements E.
As shown in Fig. 4(a), the attention map is normalized row-by-row using softmax in self-attention. Therefore, when the value of a column of the attention map is too large, it will cause the deviation of the attention map, thus destroying the physical meaning of the attention map. Therefore, as shown in Fig. 4(b), we first apply softmax (where softmax has no normalization effect) to the column values to eliminate the effect of scale. After that, we perform row-by-row normalization of the attention map using softmax.
After that, we imitate the paradigm in two linear layers. To increase the spatial interaction capability of the model, we add a Batch Normalization layer between the two linear layers. Then the tokenmixing module exchange formula in TNMLP can be expressed below: where γ and β are learnable parameters. The overall algorithm of TNMLP is shown in Table 2. We first use Layer Normalization to pass token-mixing information to the channelmixing module for channel information exchange. We use GELU [35] as the activation function in the channel-mixing module. Compared with ReLU, GELU is smoother and has better performance.

Table 2
The pseudocode for TNMLP.

Transfer learning with DeCo (TL-DeCo)
In medical image analysis, transfer learning is usually adopted to improve the performance of the model in the target data set when the data set is seriously lacking. However, the source domain data in traditional transfer learning schemes are all labeled images. Compared to natural images, medical image datasets must be labeled by experts in related fields, which requires a lot of human and financial resources. As a result, a large amount of information in unlabeled images cannot be used effectively. Models with the aid of self-supervised learning can learn more generic representational information for downstream tasks through the properties of the data itself without relying on manual labeling. Furthermore, previous self-supervised algorithms have been constructed based on CNNs with Transformer, and the effectiveness of this class of algorithms for MLP is not yet known.
In addition, as the previous self-supervised algorithm relied on a large batchsize, this was very resource intensive. Therefore, we first lift the baseline of the simple self-supervised algorithm and eliminate its reliance on a large batchsize. Unlike traditional InfoNCE-based contrast learning algorithms, we rely on decoupled contrast learning for the design of the self-supervised algorithm and design the Teacher-Student framework as an asymmetric structure.
The loss function of previous comparative self-supervised algorithms based on InfoNCE is usually shown in Equation (15). 〈z n 〉 is the similarity measure between samples in a batch and positive samples, f 〉 is the similarity measure between samples in a batch and all samples. The gradients are found for z (1) n , z (2) n and z (e) f , respectively. The results are shown in (16), (17) and (18), respectively. We can see that the same coefficient α (1) n exists in (16), (17) and (18). α (1) n is related to the strong coupling between positive and negative samples.
Therefore, when the batch size is small, the denominator in α (1) n is limited by the batch size and gets a smaller value of α (1) n . Therefore, there is a significant negative-positive-coupling (NPC) effect in this type of algorithm, which often leads to a greater dependence on the larger batch size and, thus, a greater dependence on hardware resources.
Therefore, we try introducing the decoupled contrast learning (DCL) [36] objective function into the contrast learning algorithm. DCL obtains the loss function (20) by removing the NPC coefficient α (1) n . We find the gradients for z (1) n , z (2) n and z (e) f , respectively. The results are shown in (21), (22) and (23), respectively. The removal of the NPC coefficient effectively reduces the dependence of this type of algorithm on large batchsize. At the same time, previous self-supervised algorithms are mainly constructed by CNN and Transformer, so the effectiveness of this class of algorithms for MLP is still unknown.
Unlike supervised learning, unsupervised learning has a greater degree of freedom and thus places more emphasis on global structure. We know from [37] that the model that emphasizes global structure prefers to obtain low-frequency information first. Moreover, a larger batch size can better extract low-frequency information. Furthermore, reducing the batchsize can improve the generalization ability of the model. Therefore, to further reduce the dependence of the self-supervised algorithm on batchsize, we propose the self-supervised algorithm framework TL-DeCo for the pre-training of MLPs.
Our TL-DeCo algorithm is divided into three stages. We first pre-train the model on the labeled ImageNet dataset, which makes the model acquire certain representational information and low-frequency information. Afterward, we perform unlabeled self-supervised pre-training of the model on a dataset similar to the target domain dataset. Finally, we hide the labels of the target domain dataset for self-supervised pretraining. This method allows us to use smaller batch sizes for selfsupervised pre-training while increasing the buffer space between the source and target domains. This pre-training method effectively solves the problem of excessive differences between the source domain and the target domain in the pre-training process and, at the same time, makes full use of the information in the unlabeled image. In addition, a smaller batch size in the self-supervised algorithm can further improve the generalization ability of the characterization information obtained by pre-training. The overall structure design of the TL-DeCo algorithm is shown in Fig. 5.
In self-supervised pre-training, we use the classical structure design of Teacher-Student. The Teacher and Student branches use the Encoder network model with the same initialization parameters for feature extraction. The difference between the two is that the Teacher branch contains a Projector with a two-layer MLP structure for feature mapping in addition to the Encoder. After that, the Teacher branch also contains a first-in-first-out queue structure S for storing all sample pairs. The Student branch contains not only the Encoder network for feature extraction but also the Projector and Predictor structures for feature mapping. The Projector and Predictor have the same structure as shown in (24), so the Student and Teacher branches are asymmetric structures.
We first use random cropping, Gaussian blurring, random horizontal flipping, and other data enhancement means to enhance the original image to get two images x q and x k , respectively. We mark x q as query and x k as key. After that, we obtain the representation information Q = Student(x q ) and K = Teacher(x k ) corresponding to the query and the key through the Student branch and the Teacher branch of the initialization parameters, respectively. We mark the Q and K from the same image as positive pairs and the Q and K from different images as negative pairs. In addition, we use the queue structure with the first-in-first-out property to store the K of different samples. During model training, old K will be replaced by new Q, thus avoiding the irregular sampling of negative sampling. During model training, the Student branch performs parameter updates by backpropagation. While the Teacher branch is updated by Equation (26), the Teacher branch does not involve backpropagation updates. In addition, although α = 0.999 makes the update of the parameters of the Teacher branch slower, the parameters in the Student branch and the Teacher branch are updated at each step. This solves the problem of lack of consistency due to unsynchronized sampling in traditional self-supervised algorithms. The pseudocode of TL-DeCo is shown in Table 3.

Guided learning with TL-DeCo
The TL-DeCo framework shortens the distance between the source domain data and the target domain data in the pre-training method and, at the same time, enables the information in the unlabeled image to be effectively used. However, the framework needs to pre-train each time a new model is designed, which is too cumbersome and resourceconsuming. In addition, due to the limitation of lightweight model parameters, its ability to extract features needs to be improved. Therefore, we first pre-train the Teacher model through the TL-DeCo framework and then guide the Student model to imitate the output of the Teacher model so that the performance of the Student model on the target data set is further improved. In addition, since the model's feature space tends to be smooth after self-supervised pre-training, guided learning makes the model's feature space tend to be sharp. Therefore, selfsupervised pre-training can improve supervised learning. In this paper, we construct three guided self-supervised pre-training schemes to find the best performance of the model. Scheme1: The Teacher model is pre-trained by the full TL-DeCo framework, while the parameters in the Student model are randomly initialized. The Teacher model then guides the Student model to learn on the target dataset.
Scheme2: Both the Teacher model and the Student model are pretrained with the full TL-DeCo framework, and then the Student model is guided by the Teacher model to learn on the target dataset.
Scheme3: Unlike Scheme1 and Scheme2. The Teacher and Student models of Scheme3 first perform supervised pre-training on the Image-Net dataset and then perform self-supervised pre-training on a large dataset similar to the target dataset. After the above steps are completed, the Student model is guided by the Teacher model to perform guided learning on the target dataset with the labels hidden. The Student model finally performs supervised learning on the target dataset. The overall framework of Guided learning with TL-DeCo is shown in Fig. 6.
In the above three scenarios, the guided learning in Scheme1 and Scheme2 is performed on the labeled target dataset. Therefore, we complete the training of the Student model by minimizing the scatter between the output p Teacher In Scheme3, guided learning is performed on the target dataset with hidden labels, and the labels of the dataset are unknown. Therefore, we In addition, in order to make the overall experimental scheme more concise, our Teacher model adopts offline pre-training. That is, when the Student model conducts guided learning training, the parameters of the Teacher model adopt a freezing scheme. The pseudocode of the guided self-supervised scheme is shown in Table 4.

Experimental settings
All experiments in this study are conducted on a server with 64 GB RAM, Intel Xeon Silver 4214 CPU, and NVIDIA Quadro RTX 8000 GPU. In addition, this study's network models and algorithmic frameworks are built based on the PyTorch framework, and the specific experimental hyperparameter settings will be detailed in the subsequent subsections. Finally, to validate the effectiveness of our proposed network model and algorithmic framework, we evaluated it using Accuracy, F1-score, Recall, Precision, and AUC. TP, FP, TN, and FN represent true positive, false positive, true negative, and false negative values, respectively. Accuracy is the ratio of the number of correct predictions to the total test set and is the main evaluation metric. At the same time, Precision and Recall represent the ability of the model to differentiate between true and false positives and the proportion of true positives among all positive cases, respectively. F1-score is the overall evaluation metric, and AUC represents the area under the ROC curve. The F1-score is the overall evaluation metric, and the AUC represents the area under the ROC Table 3 The pseudocode of TL-DeCo. curve. A higher value for all these metrics indicates better model performance.

Evaluation of CTMLP
In this subsection, we test the performance of our proposed model CTMLP and its extended version CTMLP_B. Table 5 shows the detailed data of the computation and parameter quantities of our proposed model and other models. We can see that our proposed CTMLP achieves a better balance of computation and parameter volume than comparable MLP architecture models as well as CNN and Transformer based models. Furthermore, our proposed extended version of CTMLP_B is still comparable to the lightweight MLP architecture ResMLp in terms of parameter count and computation.
We tested on two COVID-19 public datasets, SARS-COV2 Ct-Scan, and Large COVID-19 CT scan slice. Table 6 shows the detailed hyperparameter settings for the experiments conducted on the Large COVID-19 CT scan slice dataset, which contains 14486 CT images of the subject's chest. Table 7 shows the detailed data of the experiments in this dataset. Thanks to the global modeling capability of MLP, our proposed CTMLP has even better performance with only 48% of the number of ResNet50 parameters. In addition, compared with lightweight models such as ResNet18 and DenseNet121, our proposed CTMLP achieves a better balance between performance and computation. This shows that when the amount of data is sufficient, we can explore other more efficient and superior model architecture designs without sticking to CNN.
Compared with similar MLP-based model architectures, our proposed CTMLP achieves superior performance while significantly reducing the model computation and parameter count. Our proposed extended version of CTMLP_B has a similar number of parameters to ResMLP while improving its accuracy by 1.01%. In addition, compared with the classic CNNs and similar models, CTMLP_B has the best precision and f1-score of 98.24% and 97.81%, respectively. This indicates that the CTMLP_B not only effectively distinguishes true positive cases from false positive cases but also has excellent comprehensive performance, which is essential for effectively controlling the spread of COVID-19.
The SARS-COV2 Ct-Scan dataset contains 2481 CT images of the subject's chest. Table 8 shows the detailed hyperparameters of the experiments performed on the SARS-COV2 Ct-Scan dataset. Table 9 shows the specific data of this experiment. We found that the performance of the Transformer and MLP architecture-based model has a gap compared to CNN when the amount of data is too small due to the lack of inductive bias. This deficiency is particularly prominent in the initial model architectures VIT and MLP-Mixer.
But this does not mean that the model based on the MLP architecture has no merit when the amount of data is seriously scarce. In our followup experiments, it was shown that after injecting prior knowledge into CNN and MLP at the same time, the performance of this type of model could exceed CNN when the amount of data is severely lacking. At the same time, the relatively simple structure of these models makes them easier to optimize and deploy at the edge, and their ability to handle irregular data such as point clouds offers the possibility of unifying the model architecture.
In summary, our proposed CTMLP solves the problem of explicit representation of the same type of model architecture in the token mixing module, which requires many parameters and computational effort, by using the TNMLP module. It has a significant advantage in the same type of model architecture based on MLP and its unique advantage over CNN and Transformer. It also provides a new perspective on the design of visual models in medical image analysis. In addition, CTMLP can effectively differentiate between true positive pathologies and false positive cases, thus helping healthcare professionals to effectively treat true positive cases and stop the spread of the epidemic.

Evaluation of transfer learning
As shown in Subsection 4.2, when the number of datasets is severely missing, the CNN-based models outperform the MLP-and Transformer- Table 4 The pseudocode of Guided learning with TL-DeCo.  based models due to the lack of inductive bias. Therefore, in this study, we will explore whether a general solution allows the MLP-based model to achieve good results even when the amount of data is severely lacking. Table 10 shows the detailed hyperparameter settings for the experiments conducted on the SARS-COV2 Ct-Scan. Table 12 shows that ResNet50 performs much better than CTMLP when the model parameters are initialized randomly due to the lack of inductive bias. In this subsection, we design three different transfer learning schemes to inject knowledge priors into MLP so that MLP-based models still perform well when the amount of data is severely scarce. Table 11 shows the detailed setup of the schemes. In Scheme II, we adopt the standard scheme in the medical image field and fine-tune the model on the target dataset after pre-training on the ImageNet-1k [41] dataset. The hyperparameter settings for pretraining on TmageNet-1k are detailed in Table 10. In addition, for a fair comparison, we do not use official pre-training weights for ResNet50 in Scheme 1 but use the same pre-training experimental settings as CTMLP. Table 12 shows that both CNN and MLP clearly benefit from ImageNet initialization. However, MLP seems to benefit more than CNN, which shows that when the amount of data is seriously lacking, introducing knowledge priors into the model can effectively improve the problem of lack of inductive bias.
However, it is still debatable whether this solution is the best pretraining solution due to the large difference in class and number between ImageNet dataset and the target domain dataset. Since the Large COVID-19 CT scan slice dataset has an advantage in terms of quantity compared to the SARS-COV2 Ct-Scan dataset. In addition, both SARS-COV2 Ct-Scan and Large COVID-19 CT scan slice datasets are CT image datasets of chests of COVID-19 subjects. Therefore, the two are more similar. In Scheme III, we use the Large COVID-19 CT scan slice dataset as the source domain data to pre-train the model and then finetune it on the SARS-COV2 Ct-Scan dataset. The results in Table 12 show that compared with the ImageNet dataset as the source domain data, using a dataset similar to the target dataset as the source domain data is a more reasonable pre-training scheme. This solution will improve the performance of the model on the target dataset.
However, compared to the ImageNet dataset, the SARS-COV2 Ct-Scan dataset is still small. Therefore, using it as the source domain data may limit the model's performance. In addition, MLP has superior global modeling capability compared to CNN. Therefore, this type of model may perform better with a larger data volume addition. Therefore, to further explore the upper limit of model performance, we propose Scheme IV. This scheme first pre-trains the model on the ImageNet dataset, then transfers the model to the Large COVID-19 CT scan slice dataset for pre-training, and finally, fine-tunes the model on the target dataset. Table 12 shows that CTMLP improved by 1.62% and 0.03% in terms of recall and f1-score, respectively, compared to ResNet50. This indicates that as the amount of source data used for pre-training increases, the performance of the CNN and MLP improves further, with CTMLP benefiting more from the global modeling capabilities of the MLP.
This shows that we can solve the problem of its poor performance due to the lack of inductive bias when the amount of data is severely scarce by introducing knowledge priors for MLP-based models. As a result, MLP-based models are promising in both small and large datasets, and it is possible to unify them.

Evaluation of TL-DeCo
ImageNet-scale datasets for model pre-training are often lacking in the medical imaging analysis. We found in subsection 4.3 that pretraining the model on ImageNet first and then later on a large dataset similar to the target dataset not only effectively solves the problem of lack of large-scale datasets for pre-training but also makes the source domain data used for pre-training closer to the target domain data.
Later, we designed a new pre-training framework (TL-DeCo) based Table 7 Performance comparison between our model and other models on Large COVID-19 CT scan slice.     on transfer learning and self-supervised learning. We first pre-trained the model on ImageNet, and then self-supervised pre-training was performed on Large COVID-19 CT scan slice with the labels hidden. Finally, the model is self-supervised pre-trained on the SARS-COV2 Ct-Scan dataset with the labels hidden. After the model pre-training process is completed, fine-tuning is performed on the SARS-COV2 Ct-Scan dataset. We use a traditional pre-training scheme (i.e., the source domain data is ImageNet) as a baseline to verify the effectiveness of our proposed method framework. The data in Table 13 shows that our proposed TL-DeCo framework improves the accuracy of CTMLP by 1.41%, while ResNet50 improves by 1.01%. This demonstrates the effectiveness of our proposed pre-training framework. After that, we replace the selfsupervised algorithm in the pre-trained framework with a simple selfsupervised algorithm baseline MoCo V2. Table 14 shows the specific hyperparameter settings of the experiment. Compared with the batchsize required by TL-MoCo V2 which is 256, the batchsize required by our pre-training framework is only 32. The data in Table 13 shows that compared with TL-MoCo V2, our proposed pre-training framework significantly reduces resource consumption while improving the baseline of TL-MoCo V2. This opens up the possibility for self-supervised algorithms to run on personal computers without relying on servers.
Since the SARS-COV2 Ct-Sca dataset is too sparse in number, therefore, the model is easily overfitted on this dataset. Even though the feature extraction ability of CTMLP_B is excellent, its accuracy is still only 97.98%. Therefore, to alleviate the problem of model overfitting, we replaced the final fully connected layer of the pre-trained CTMLP_B with an extreme learning machine (ELM) [42]. In the end, the hybrid model achieves an accuracy of 99.80%.

Evaluation of guided learning with TL-DeCo
In Subsection 4.4, we propose the TL-DeCo framework for model pretraining. This framework closes the distance between the source domain data and the target domain data in the pre-training scheme and, at the same time, enables the information in the unlabeled images to be effectively utilized. However, this scheme requires a new pre-training of the new model each time we create a new model, which consumes a lot of resources and is a tedious process. In addition, the limitation of the parameters of lightweight models (e.g., CTMLP) results in weaker feature extraction than other models (e.g., CTMLP_B). Therefore, we propose guided self-supervised pre-training schemes to make the pretraining process simple and efficient while improving the feature extraction ability of the lightweight model. Compared with guided learning, the model's feature space after self-supervised pretraining tends to be smooth. Therefore, the combination of the two may improve the defect of the oversharp model feature space in guided learning. In this study, we design three different schemes to find the optimal performance of the model.
In Scheme1, we first pre-trained the teacher model on ImageNet and then performed self-supervised pre-training on the unlabeled Large COVID-19 CT scan slice dataset. Finally, self-supervised pre-training on the hidden-labeled SARS-COV2 Ct-Scan dataset completes the pretraining of the teacher model. The student model with random initialization parameters is guided to learn through the teacher model on the SARS-COV2 Ct-Scan dataset.
We use the performance of CTMLP in the TL-DeCo scheme in Subsection 4.4 as baseline I and the traditional pretraining scheme (i.e., the source domain data is ImageNet) as baseline II. Table 15 data shows that the Student model accuracy is 97.38% when we initialize the Student parameters randomly and then perform the guided learning by the    Table 4).
Teacher model. 97.38%, which is slightly lower than the performance of Baseline 1 (97.58% accuracy). However, Student in Scheme1 showed a 1.21% improvement in accuracy compared to Baseline II (96.17% accuracy). Although Student in Baseline I outperforms Scheme1, Scheme1 still achieves good results without the model having to undergo a tedious pre-training process when performing lightweight model design.
We take the performance of CTMLP in the TL-DeCo scheme in Subsection 4.4 as the baseline I and the traditional pre-training scheme (that is, the source domain data is ImageNet) as the baseline II. The data in Table 15 shows that when we randomly initialize the Student parameters and conduct guided learning through the Teacher model, the accuracy of the Student model is 97.38%, and its performance is slightly lower than baseline I (97.58% accuracy). However, compared with baseline II (96.17% accuracy), the accuracy of Student in Scheme1 is increased by 1.21%. Although Student's performance in baseline I is better than Scheme1, Scheme1 can still achieve good results without the tedious pre-training process when designing a lightweight model.
In Scheme2, we use the same pre-training method for the Teacher model and the Student model as the Teacher model in Scheme1. Finally, we use the Teacher model on the SARS-COV2 Ct-Scan dataset to guide the learning of the Student model. The data in Table 15 shows that compared with the random initialization of the Student model parameters in Scheme1. After the Student model undergoes a complete pretraining with the TL-DeCo framework, guided learning on the target dataset will achieve better performance.
In Scheme3, we first conduct supervised pre-training of the Teacher model and Student model on the ImageNet dataset and then conduct self-supervised pre-training of the Teacher and Student models on the unlabeled SARS-COV2 Ct-Scan dataset. Finally, the guided selfsupervised pre-training with Student on the unlabeled COVID-19 dataset. After the pre-training process is complete, we fine-tune the Student model on downstream tasks.
Compared with Scheme1 and Scheme2, the Teacher model in Scheme3 has not undergone a complete pre-training process. In addition, guided learning in Scheme1 and Scheme2 is performed on the labeled SARS-COV2 Ct-Scan dataset. The guided learning in Scheme3 is performed on the unlabeled SARS-COV2 Ct-Scan dataset. The data in Table 15 shows that compared with labeled guided learning, guided learning on an unlabeled dataset will help improve the defect that the model feature space is too sharp in guided learning.

Comparison with state-of-the-art approaches
In order to verify the validity of the proposed model and algorithm framework for the diagnosis of COVID-19. We compared our proposed model and algorithm with five models (NAGNN [43], Deep-COVID [6], DarkCOVIDNet [44], Patch-based CNN [45], and COVID-ResNet [46]). We conducted corresponding experiments on the SARS-COV2 Ct-Scan dataset, and the detailed results of the experiments are shown in Table 16. Compared to the five existing SOTA methods, our proposed method has the best performance in f1-score and recall, with 97.98% and 98.37%, respectively. This indicates that our proposed method can accurately predict true positive cases, which is important to stop the spread of the epidemic. In addition, as shown in Fig. 7, the overall performance of our proposed method continues to be excellent.
Unlike the previous CNN-based SOTA methods, our proposed method relies on a new MLP structure to diagnose COVID-19, which still performs well while significantly reducing the number of model parameters. On the other hand, the above methods are all supervised learning methods. Due to the particularity of the medical image analysis, there is a large amount of information in unlabeled images that are difficult to exploit effectively. Our proposed guided self-supervised learning framework can significantly improve the feature extraction ability of lightweight models while exploiting the information in unlabeled and labeled images to diagnose suspected COVID-19 patients. This demonstrates the unique advantages of our proposed model and algorithm framework.

Discussions
In this study, we first analyze the problems in the existing mainstream methods for diagnosing COVID-19. To solve the above problems, we propose a model CTMLP based on CNN and MLP. This model adopts implicit expression in the token-mixing module stage, which is more concise and lighter than previous models of the same type.
After that, we validated the model performance on two datasets, SARS-COV2 Ct-Scan, and Large COVID-19 CT scan slice. The results show that it benefits from the global modeling ability of MLP. Our proposed model outperforms ResNet50 on the Large COVID-19 CT scan slice dataset with only 48% of the parameters of ResNet50. Due to the severe scarcity of the SARS-COV2 Ct-Scan dataset, the models based on MLP and Transformer lack inductive bias. Therefore, CNN outperforms these two types of models. However, the CTMLP proposed by us still shows strong competitiveness when the number of parameters and computation are much smaller than those of comparable models.
When the amount of data is seriously lacking, the standard solution in the medical image field is to fine-tune the ImageNet pre-trained model on the target dataset. Therefore, to eliminate the performance gap between CTMLP and ResNet50 when the number of datasets is severely lacking. We employ a transfer learning scheme to inject knowledge priors into CTMLP. At the same time, due to the large difference in the category and quantity of source and target domain data in the traditional pre-training scheme, we designed three different pre-training schemes to explore the transfer ability based on the MLP model architecture. The final results show that MLP-based and CNN-based models have comparable performance when the dataset is severely deficient when knowledge priors are introduced for MLP-based models and CNN,  respectively. However, in the above pre-training schemes, the source domain data are labeled images. Furthermore, previous self-supervised algorithms based on contrastive losses suffer from negative-positive coupling (NPC) effects. Such algorithms often rely on large batchsize. In addition, due to the lack of ImageNet-scale datasets for model pre-training in the field of medical images. Therefore, we improve the baseline of self-supervised learning algorithms while eliminating the dependence of selfsupervised algorithms on large batch sizes. After that, we built a new pre-training framework TL-DeCo for model pre-training based on the traditional transfer learning scheme and self-supervised learning. Our proposed CTMLP_B is pre-trained by the TL-DeCo framework and finetuned on the target dataset SARS-COV2 Ct-Scan, achieving 97.98% accuracy.
Since the model feature space after self-supervised pre-training is relatively smooth, this will effectively improve the problem that the model feature space is too sharp in guided learning. In addition, the above pre-training scheme requires re-pre-training every time a new lightweight model is built, which consumes a lot of resources. Based on this, we construct three different guided self-supervised schemes to explore the upper bound of lightweight model performance while finding the best concise and effective pre-training scheme.
Recently, Transformer-based models have taken over various fields of medical image analysis thanks to the global modeling capabilities of self-attention. However, such models suffer from computational complexity and a large number of parameters. A more efficient diagnosis would significantly reduce the time spent on film review, thus gaining more valuable time for patients. In this paper, we explore the role of MLP-based models in medical image analysis and propose a lightweight model, CTMLP, for the diagnosis of COVID-19. Compared to previous CNN and Transformer-based models, CTMLP offers superior performance and is easier to optimize and deploy in edge devices. In addition, the ability to simultaneously process irregular data such as point clouds offers the potential for model unification in medical image analysis. In order to explore the upper bound of the performance of this class of models and to establish a self-supervised pre-training scheme suitable for this class of models, we propose a guided self-supervised pre-training scheme. This scheme significantly improves the performance of lightweight models while allowing for the diagnosis of COVID-19 with information from both labeled and unlabeled images.

Conclusions and future works
In this work, we first propose a novel network architecture CTMLP based on convolutions and MLPs for the timely diagnosis of COVID-19. CTMLP has fewer parameters and less computational overhead compared to previous deep network models while maintaining good performance. Then, to effectively exploit the information in a large number of unlabeled images, we propose TL-DeCo, a pre-training framework based on transfer learning and self-supervised learning. In order to simplify the pre-training process and improve the feature extraction ability of the lightweight model, we propose a guided selfsupervised learning scheme. In summary, our proposed model and algorithmic framework enable efficient diagnosis of COVID-19 with the help of the information in labeled and unlabeled images. In future work, we will extend the model and algorithm framework to other medical tasks.