Adversarial training collaborating hybrid convolution-transformer network for automatic identification of reactive lymphocytes in peripheral blood

Reactive lymphocytes may indicate diseases such as viral infections. Identifying these abnormal lymphocytes is crucial for disease diagnosis. Currently, reactive lymphocytes are mainly manually identified by pathological experts with microscopes and morphological knowledge, which is time-consuming and laborious. Some studies have used convolutional neural networks (CNNs) to identify peripheral blood leukocytes, but there are limitations in the small receptive field of the model. Our model introduces a transformer based on CNN, expands the receptive field of the model, and enables it to extract global features more efficiently. We also enhance the generalization ability of the model through virtual adversarial training (VAT) without changing the parameters of the model. Finally, our model achieves an overall accuracy of 93.66% on the test set, and the accuracy of reactive lymphocytes also reaches 88.03%. This work takes another step toward the efficient identification of reactive lymphocytes.


Introduction
Reactive lymphocytes, also known as atypical lymphocytes, are benign proliferation of lymphocytes with morphological changes in peripheral blood, usually associated with viral infections and other diseases [1,2].Compared to normal lymphocytes, reactive lymphocytes have some subtle differences, which include increased cell size, abundant and basophilic cytoplasm, uneven nuclear chromatin pattern, and a large nucleus [3,4].These morphological changes provide essential visual clues for identifying and distinguishing reactive lymphocytes from their regular counterparts under the microscope.However, it is worth noting that these changes in reactive lymphocytes also make them easily confused with other leukocytes in peripheral blood, especially monocytes and blasts [5].Thus, trained human experts are typically responsible for examining the morphology of these white blood cells, which is time-consuming and laborious [6,7], and the subjective observations of the observer easily influence the results.It is essential for rapid disease diagnosis to achieve reliable automated identification of reactive lymphocytes and other leukocytes [8,9].As a result, specific treatment approaches may be able to be developed earlier.
Recently, some researchers have focused on peripheral blood cell recognition, and they have utilized methods such as machine learning techniques represented by support vector machines (SVMs) and deep learning methods represented by convolutional neural networks (CNNs).As for machine learning methods, Dong et al. [10] successfully eliminate redundant and irrelevant features from the initial set of features using a CART-based algorithm for selecting features.Afterward, the main features were input into a PSO-SVM classifier to recognize leucocytes.Tavakoli et al. [11] develop an algorithm for cell nucleus segmentation.They also design and extract three morphological features and four novel color features.Lastly, they utilize an SVM classifier to identify leucocytes.The former relies on the effectiveness of feature selection, while the latter uses manually designed features, which may inevitably introduce biases.As for deep learning methods, Fırat [12] develops a CNN model based on the Inception module, pyramid pooling module, and depthwise squeeze-and-excitation block to classify peripheral blood cell images.Tseng et al. [13] build a model integrating multiple CNNs to identify neutrophils in peripheral blood.All of these methods utilize CNNs for the classification of cells.However, several studies [14,15] suggest that CNNs are ineffective at extracting global information from an image due to their small receptive fields.
On the one hand, although CNNs are not adept at extracting global information, their excellent ability to extract local information still makes them perform well in medical image processing tasks, such as diagnosing COVID-19 [16], multiple myeloma [17], classifying acute leukemia [18], detecting pancreatic cancer [19], and detecting skin cancer [20].The convolution's powerful local information extraction capability lets the model perceive details within cells, such as granules, vesicles, and heavily stained regions.On the other hand, Transformer architectures are widely used in many vision tasks [21][22][23][24][25][26] due to their excellent ability to extract global information.Specifically, the Transformer architecture fuses all tokens in a sequence through multi-head attention mechanisms, allowing it to simultaneously observe the entire image, unlike CNNs, which require multiple convolutional layers to expand their receptive fields.Based on the above analysis, Transformer and convolution naturally complement each other.From this perspective, we believe combining convolution and Transformer can overcome the weaknesses of both models while enhancing their strengths.We apply a model that combines convolution and Transformer, called Next-ViT [27], to achieve this idea in reactive lymphocyte recognition.In Next-ViT, the convolution and Transformer blocks are redesigned to enhance the accuracy of the model while reducing complexity and latency.
When constructing a peripheral blood cell classification dataset, similar to object detection, an expert is required to precisely annotate the location of cells in the peripheral blood images.This process can be pretty time-consuming.Additionally, there is an imbalance in the distribution of different cell types within the dataset, which can lead to a lower generalization ability of the model on some classes with a small number of samples.We have taken some corresponding measures to address these two issues.First, we introduce a cell detection model to automatically crop and label cells in the images, significantly improving data preparation efficiency.Second, we use data augmentation techniques to mitigate model overfitting.Specifically, we adopt two augmentation strategies: mix-up and virtual adversarial training (VAT) [28].Mix-up mitigates overfitting by linearly combining different cell images, leading to a less deterministic prediction by the model.VAT encourages the model to learn deep features, making it less sensitive to minor perturbations in the image and mitigating overfitting, thereby enhancing the model's robustness.Furthermore, we alleviate bias introduced by the single staining method through color and contrast transformations [29].Ultimately, our model achieves an impressive overall accuracy of 93.66% on the test set.Moreover, it excels in identifying reactive lymphocytes, with an accuracy of 88.03%.These results demonstrate the effectiveness of our approach in recognizing reactive lymphocytes and other leukocytes.
The main contributions of this paper are: (1) We include reactive lymphocytes in the identification of peripheral blood cells, a class of cells that are easily missed in related work but are critical for disease diagnosis.(2) We present a novel peripheral blood cell dataset with higher resolution and richer cell types, which may provide data support for subsequent related work.(3) We use virtual adversarial training during model training to improve the robustness and generalization ability of the model without changing the model parameters and complexity.(4) Compared with one of the SOTA models, Internimage [30], our model has higher accuracy and better interpretability.

Sample processing and data preparation
Peripheral blood smears from the Laboratory of the Hematology Department at Zhongnan Hospital of Wuhan University are utilized in this study.As shown in Fig. 1(A), these smears are stained using the May-Grünwald-Giemsa method.Digital images of peripheral blood leukocytes are obtained using an OLYMPUS BX53 microscope at 100× magnification, with a resolution of 2,736 × 1,824 pixels.Since manual annotation of peripheral blood cells is time-consuming and labor-intensive, we introduce YOLOv5 for automatic cell annotation.YOLOv5 is a standard single-stage detection model based on regression, which can assign a precise bounding box to each cell in the image at once, significantly improving efficiency compared to manual labeling.Specifically, we select the regions of interest from the images obtained through scanning using an electron microscope.Then, we input them into YOLOv5 to generate the corresponding bounding boxes for each cell.We then crop the cells based on the bounding boxes and classify the cells depending on the labels in the clinical reports to obtain the final cell-level dataset, as shown in Fig. 1(B).
Our dataset comprises 12,245 cells from 6 categories, cropped from 6,573 peripheral blood smears of 3,772 patients.The categories and respective quantities of leukocytes are as follows: 267 eosinophils, 2,650 lymphocytes, 1,193 monocytes, 4,785 neutrophils, 1,308 blasts, and 2,042 reactive lymphocytes.The morphological characteristics of peripheral blood leukocytes in each category are shown in Fig. 2(B).In this study, to ensure proper training and evaluation, we divide the dataset into the training, validation, and test sets in the ratio of 3:1:1.During the data partitioning process, we adhere to the patient stratification procedure to ensure that no patient is used simultaneously for training, evaluation, and testing.The training set updates the model's parameters and strengthens its recognition ability.The validation set is used to tune the hyperparameters of the model.The test set objectively assesses the model's performance in real scenarios.Figure 2(A) illustrates the distribution of all classes.

Reactive lymphocytes classification network
Due to the natural presence of some similar local features between peripheral blood cells, such as the lobular shape of the nuclei of neutrophils and eosinophils, the absence of cytoplasm in blasts and lymphocytes, and granules in reactive lymphocytes and neutrophils, traditional CNNs are often susceptible to interference from these similar local features when identifying cells, leading to misclassifications due to their limited receptive field.Therefore, the model is supposed to not only excel at extracting local information from the cell but also excel at extracting global information from the cell.Taking these into account, our model uses an architecture that combines the Fast Convolution Block (FCB) and the Efficient Transformer Block (ETB).The FCB aids the model in clearly visualizing regional details within the cell, such as granules in the cytoplasm and bubbles in the nucleus.At the same time, the ETB enables the model to perceive larger regions, such as the morphology of the cell body and the nucleus, fusing local information extracted by the FCB to aid in the model's integral recognition of cells, thus Fig. 1.The process of building the dataset.(A) illustrates the procedure for scanning peripheral blood samples into images using the May-Grünwald-Giemsa method and the BX53.(B) shows how the dataset was constructed.After staining the smears and scanning them as electronic images, the regions of interest were selected and input into YOLOv5, and then the cells were detected and cropped.After that, they were categorized according to the labels in the clinical report to build the dataset.
preventing confusion from similar local information between cells.As shown in Fig. 3, the model accepts 224 × 224 × 3 images as inputs.With each passing stage, the spatial resolution of the images decreases while the number of channels increases.To explain the model more clearly, we will introduce the FCB and ETB successively.

Fast convolution block
To improve the speed of reactive lymphocyte identification and enable deployment on lowperformance devices in clinical settings, Convolutional Attention (CA) focuses on reducing model complexity and accelerating inference speed.According to actual testing, our model can achieve an identification speed of 711 images per second when the batch size is set to 32 on the RTX 3070 Ti, greatly exceeding the speed of manual recognition.Inspired by Multi-Head Self Attention (MHSA), Multi-Head Convolutional Attention (MHCA) incorporates a multi-head design on top of CA.The essence of MHCA can be succinctly formulated as follows: The multi-head design enables MHCA to capture information from m parallel representation subspaces.z = [z 1 ,z 2 , . . .,z m ] denotes the division of the input feature z into a multi-head form.To enhance information aggregation across the multiple heads, MHCA is also equipped with a projection layer W P .Specifically, the input features are divided into m groups in the channel dimension for grouped convolution, which reduces the number of parameters by m times Over the following four stages, the model uses FCB to extract local information and ETB to extract global information.Eventually, each sample will be encoded into a 1024-dimensional feature vector, which the linear layer will then process to determine the corresponding class.(B) demonstrates VAT.Specifically, we input the raw and adversarial samples into the model to obtain their respective probability distributions.Then, the KL divergence between the two distributions will be calculated and reduced during backpropagation.This helps the model resist perturbations.compared to standard convolution.Then, the model concatenates the convolution results of each group and fuses the information of all channels through pointwise convolution.As a result, the model will be better at extracting local features from cells.CA refers to single-head convolutional attention and is formulated as follows: where T m and T n represent adjacent tokens in the input feature.O denotes the inner product operation that involves the trainable parameter θ.CA can learn correlations between each token in the local receptive field by updating θ.This enables the model to extract richer local information, which allows for a clearer visualization of details such as granules, bubbles, and dark blue regions within the cell.The architecture of FCB is shown in the Fig. 4(A) and can be formulated as: Group convolution significantly reduces the parameters of the model, while pointwise convolution facilitates the fusion of information between individual channels.(B) shows the structure of the ETB.Compared to the standard MHSA, which maps the query, key, and value of a single token to multi-heads through multiple matrix multiplications, E-MHSA directly divides query, key, and value into multi-heads in the channel dimension.This reduction in computational costs improves the overall efficiency of the model.

Efficient transformer block
Even though FCB effectively extracts local information of the cells, the model may still be confused by similar local features among peripheral blood cells.Transformer demonstrates a robust capacity to capture low-frequency signals, such as the shape of the cytoplasm and nucleus, offering valuable global information.However, it has been noted that Transformer may partially degrade high-frequency information, such as granules in the cytoplasm.A human visual system must combine signals from various frequency segments to acquire more vital and distinctive information [31].
Taking inspiration from these studies, ETB is designed to capture both local and global information within a lightweight mechanism.Moreover, it functions as an effective mixer of local and global information, thereby enhancing the overall feature extraction capability.As shown in Fig. 4(B), ETB employs an Efficient Multi-Head Self Attention (E-MHSA) mechanism to fuse local information and capture global information.This process can be described as follows: where z = [z 1 ,z 2 , . . .,z m ] stands for splitting the input feature z into m heads in the channel dimension.SA refers to a self-attention calculation aimed at spatial reduction, which draws inspiration from Linear SRA [32].It can be formulated as follows: where W Q , W K and W V represent the dense layers used for context encoding.P S is a downsampling operation that uses average pooling with a stride S to reduce the spatial dimension before the attention operation, effectively lowering the computational cost.Attention denotes a standard attention calculation as follows: where a = [a 1 ,a 2 , . . .,a m ] represents a sequence of tokens, such as sentences (with each word treated as a token) and images (with each pixel treated as one).Q (Query), K (Key) and V (Value) are representations of the original tokens at three levels.QK T calculates the similarity between the tokens, then we scale and normalize the similarity, and finally multiply by V to get a new sequence of tokens, compared to the original sequence of tokens, each token in the new sequence incorporates the information of other tokens.This allows ETB to fuse information from local regions of the cell extracted by FCB in the identification of reactive lymphocytes.Moreover, shrinking ratio r and pointwise convolution are introduced to perform channel reduction, aiming at further improving the efficiency of the E-MHSA.Batch normalization is also used to ensure highly efficient deployment.ETB is also equipped with MHCA, which collaborates with the E-MHSA to extract both local and global information.Subsequently, the output features from E-MHSA and MHCA are combined to blend local and global information.Last, an MLP module is added at the end to capture crucial and unique information.In summary, ETB can be formulated as:

Loss function
Mix-up data augmentation has been proven to alleviate overfitting issues during training effectively.As a result, we have incorporated it into our training process, and the soft target cross-entropy loss was adopted correspondingly.The loss can be formulated as: where C represents the number of classes.p i j represents the predicted confidence given by the model for the i -th sample belonging to the j -th class.y i j represents the corresponding ground truth value.Furthermore, to enhance the model's generalization capability, we introduced VAT.We generate perturbations based on the confidence of the model's prediction of the raw samples and then add them to the raw samples to generate their corresponding adversarial samples.During training, through backpropagation, we encourage the model to make predictions with the same confidence on adversarial samples as on the original samples.Specifically, given a tensor x representing an input image, the initial perturbation vector d has the same shape as x and follows a standard normal distribution.Then, we compute the perturbation direction r normalized by L 2 norm and update the perturbation vector as follows: where ξ denotes the perturbation exploration coefficient.This updated perturbation is added to the input x, resulting in x adv , which is forwarded through the model to obtain the perturbed logit z adv .Additionally, we also need to obtain the original logit z.These can be formulated as: Then, the Kullback-Leibler (KL) divergence D KL (z||z adv ) between z and z adv is computed.By calculating the gradient of the KL divergence for the perturbation d and detaching the gradient, we refine the perturbation vector d to be used for the next iteration.
where p j and p adv j is the confidence of the j -th class, obtained from the original logit z and the perturbed logit z adv respectively, using the softmax function.After iterating n times, the final virtual adversarial perturbation d is obtained by scaling d with magnitude ε and normalizing it with respect to its L 2 norm.This can be formulated as: Then, we can compute the final perturbed logit z and VAT loss on all samples as follows: x =x+d z =model(x) The final expression of the loss function is as follows:

Implementation details
The model is pretrained on the ILSVRC2012 ImageNet-1 K dataset.During the training phase, soft target cross-entropy with VAT regularization term is adopted.The weight coefficient for the regularization term η is set to 1e-8.The hyperparameters n, ξ and ε are set to 1, 1e-6 and 4, respectively.AdamW [33] optimizer is adopted with a learning rate of 3e-

Evaluation metrics
Due to the imbalanced distribution of cell types in our dataset, relying solely on accuracy as a metric cannot objectively reflect the model's performance.Therefore, we have also introduced the confusion matrix, correlation matrix, F1 Score, precision-recall (PR) curve, receiver operating characteristic (ROC) curve, and area under the curve (AUC) as additional metrics for evaluating the model.The confusion matrix, a key tabular representation, provides a comprehensive overview of the model's accuracy by systematically breaking down predicted and actual class labels.The correlation matrix reflects the similarity between samples of each category.The F1 Score, a widely recognized metric in both statistical analysis and machine learning, is crucial in assessing the model's effectiveness.Particularly beneficial in scenarios featuring imbalanced class distributions, the F1 Score quantifies the model's performance by striking a balance between precision and recall.Specifically, it ranges from 0 to 1, with a higher value indicating a good equilibrium between precision and recall.The F1 Score can be formulated as: where true positive (TP) represents the number of instances correctly predicted as positive, while false positive (FP) indicates the number of instances incorrectly predicted as positive.False negative (FN) refers to the cases where the model fails to correctly classify positive class samples, predicting them as negative instead.The PR curve is a crucial visualization tool for classification tasks, visually demonstrating the relationship between precision and recall as the classification threshold is adjusted.This curve is created by varying the threshold and calculating precision and recall values at each step, making it particularly useful for dealing with imbalanced datasets.A PR curve that trends upward and to the right typically indicates strong model performance.
Similarly, the ROC curve is an essential visualization tool for evaluating classification models.It illustrates how the true positive rate (TPR), also known as sensitivity, and the false positive rate (FPR) vary with different classification thresholds.To generate the ROC curve, we vary the threshold and calculate the TPR and FPR at each point.An ideal classifier's ROC curve reaches the top-left corner (TPR = 1 and FPR = 0), indicating high sensitivity.The TPR and FPR can be calculated as follows: TPR = TP TP+FN FPR = FP FP+TN (16) where true negative (TN) is the number of negative samples correctly classified as negative by the model.The AUC is a metric for evaluating a classification model's overall performance.It is calculated as the area under the ROC curve.The greater the AUC value, the better the model's discrimination and the more effectively it can distinguish positive from negative instances.

Quantitative analysis
Once the training is complete, we load the best weights on the validation set into the model.We comprehensively evaluated the model using the previously mentioned metrics on the test set.Ultimately, our model performs well with an accuracy of 93.66%, an F1 Score of 93.68%, and an AUC of 99.09%.Then, we provide a detailed breakdown of the model's precision, recall, and F1 Score for each class, as shown in Table 1.Apart from eosinophils and reactive lymphocytes, the model performs similarly on the validation and test sets for the remaining classes.It is worth noting that the model's performance in recognizing reactive lymphocytes is weaker on the test than the validation set.This may be because the morphology of reactive lymphocytes is more complex, with three subtypes in the Downey classification, and the distribution of subtypes in the validation and test sets differ significantly, resulting in weaker performance on the test set than on the validation set.In addition, the criterion for determining the hyperparameters of the model is the performance of the validation set, and the hyperparameters that are suitable for the validation set may not be suitable for the test set.Furthermore, the model's performance in recognizing neutrophils is impressive on both the validation and test sets.This may be attributed to the distinct characteristics of the segmented nucleus in neutrophils, making them easier to distinguish from other peripheral blood cells.
In addition, Fig. 6 shows the remaining metrics of the model on the test set, including the confusion matrix, correlation matrix, PR curve, and ROC curve.As shown in Fig. 6(A), the confusion matrix reveals that the model performed well across all classes, consistent with the results in Table 1.Specifically, the model achieves an accuracy of 99.14% for neutrophils.This high accuracy is attributed to the unique multi-lobed nuclei and granules in the cytoplasm of neutrophils, making them easily distinguishable from other peripheral blood cells.Furthermore, the model tends to misclassify lymphocytes, monocytes, and blasts as reactive lymphocytes, possibly due to shared characteristics such as larger nuclei with lymphocytes, irregularly shaped nuclei with monocytes, and deeper staining or reduced cytoplasm with blast.Additionally, the model tends to misclassify reactive lymphocytes as lymphocytes, indicating that, in some cases, reactive lymphocytes exhibit minimal morphological differences from lymphocytes, making them challenging to distinguish.These misclassifications provide valuable insights for future improvements of the model.Next, we calculate the correlations between some samples and plotted them as a matrix, as shown in Fig. 6(B).Specifically, we first select 10 samples from eosinophils, lymphocytes, monocytes, neutrophils, blasts, and reactive lymphocytes, resulting in 60 samples.We then input these samples into the model to obtain the corresponding feature vectors and computed the correlations between these vectors.Similar to the confusion matrix, the correlation between the feature vectors of neutrophil samples (numbered 30 to 39) and those of other categories of leukocyte samples is relatively small, indicating that the model can recognize the distinct characteristics of neutrophils.Reactive lymphocyte samples (numbered 50 to 59) exhibit a high correlation with lymphocyte samples (numbered 10 to 19), indicating that there is a chance that the model will confuse the two categories.Subsequently, we plot the PR curves for the model, where high precision indicates the model's capability to make accurate predictions, while high recall represents its ability to identify the majority of relevant samples.In practice, high precision often implies low recall and vice versa.As shown in Fig. 6(C), the PR curves for all classes closely hug the upper-right corner, indicating that the model achieves both high precision and high recall.This means the model can make accurate predictions while identifying many relevant instances.Finally, we draw the ROC curves for the model across all classes.TPR is equivalent to recall, measuring the model's ability to correctly identify positive samples.In contrast, FPR quantifies the model's false alarm rate among negative samples, indicating the percentage of negative samples incorrectly classified as positive.Ideally, a lower FPR is desirable as it indicates that the model has a reduced likelihood of misclassifying negative samples as positive, thus minimizing false positives.As depicted in Fig. 6(D), all curves closely hug the top-left corner, indicating that the model achieves both high TPR and low FPR.This suggests that the model effectively identifies various cell types while demonstrating robustness against interfering factors, such as uneven staining in peripheral blood sample preparation.
Next, we investigate other works [34][35][36] classifying peripheral blood cells and compared them with our model.These works utilize the same public dataset from Kaggle, blood cell images (BLC).However, BLC cannot be used for objectively evaluating our model.Table 2 demonstrates Fig. 6.Quantitative analysis results of the model's performance on the test set.We used the confusion matrix (A), correlation matrix (B), PR curve (C), and ROC curve (D) for a comprehensive evaluation of the model's performance.The small figures in the center of (C) and (D) are zoomed-in views of the top-right corner of (C) and the top-left corner of (D), respectively, for a more detailed analysis.the differences between our dataset and BLC.First, our work focuses on identifying reactive lymphocytes, the cell type absent in BLC.Second, compared with the resolution of blood images collected by our clinical equipment, the resolution of blood images in BLC is extremely low.We use vision Transformer (ViT) as the model because most blood images captured by clinical devices are high-resolution, and ViT tends to emphasize low-resolution features due to repeated downsampling, which leads to the loss or lack of detailed positioning information, making it very unsuitable for low-level image recognition [37].Furthermore, although there are 12,444 cell images in BLC, most of them are obtained through data augmentation, and the real cell samples are only 349.Small-size datasets are extremely challenging for ViT training [38].
Table 3 shows a performance comparison of our method against relevant methods.It is important to note that the results achieved by our method are from the test set.In contrast, the results presented by other methods originate from the validation set, which slightly lacks in reflecting the true performance capabilities of the models.For neutrophils, our model achieves the best recognition performance, which may be attributed to the large number of neutrophils in the dataset, allowing our model to learn their features sufficiently.For lymphocytes and monocytes, our recognition performance lags slightly behind other models.This is because there are some natural similarities between lymphocytes and monocytes.In the process of collecting our data, we deliberately include samples that are prone to confusion, aiming to improve the robustness of our model.However, this also poses a greater challenge to the model's classification capabilities.Additionally, our model is capable of identifying blasts and reactive lymphocytes, which makes it more suitable for fulfilling the needs of clinical disease diagnosis.Then, we analyze the effect of VAT.As shown in Table 4, after enabling VAT, the model's accuracy and F1 Score increase by 0.26% and 0.25%, respectively, while the AUC decreases by 0.06%.First, further improvement becomes particularly difficult in classification tasks, especially when the model performance has reached a high level.At this time, even a slight increase in accuracy may represent an enhancement in the model's ability to deal with complex or ambiguous situations.An improvement of 0.26% in accuracy may mean that in some crucial scenarios, the model can more accurately identify white blood cells, thus avoiding potential errors or risks.Second, considering the possibility that in the future, when new data participates in model training, a large amount of data may not be able to receive sufficient expert annotations, requiring semi-supervised learning, VAT can significantly improve the model's performance in such scenarios [28].Finally, the introduction of VAT does not introduce any additional parameters or inference costs to the model.Therefore, VAT remains an important option during model training.

Interpretability analysis
In this section, we utilize Grad-CAM [39] to visualize the model's regions of attention as heatmaps on the images.To better demonstrate the performance of the proposed model, we train an advanced CNN model, Internimage, as a comparison.Internimage utilizes dynamic sparse kernels to achieve long-range dependency capture, adaptive spatial aggregation, and efficient computation.During the training of Internimage, we also utilize transfer learning techniques and train it for 100 epochs.Table 5 compares Internimage with the proposed model in various aspects.
The number of parameters of the two models is close, but Internimage has lower computational complexity while the proposed model has advantages in accuracy and speed.Figure 7 visualizes the attention of two models on six types of cells.Overall, the attention regions of the proposed model are generally larger and more distributed compared to Internimage.For the eosinophil image, the attention region of Internimage is relatively small and fails to focus on the coarse particles in the cytoplasm, leading to the incorrect prediction.In contrast, the attention region of the proposed model overlaps with the cytoplasm, demonstrating excellent interpretability.Moreover, for the lymphocyte image, the attention region of Internimage is mainly focused on the nucleus, while the proposed model pays more attention to the cytoplasm and the fine granules in it and then makes the correct prediction with higher confidence, showing that our model is better at focusing on unique features.Furthermore, for the monocyte image, one of the distinguishing features is the kidney-shaped nucleus.Internimage successfully focuses its attention on the nucleus and observes this characteristic.Furthermore, with the reinforcement of the Transformer, the proposed model not only pays attention to the nucleus but also to the cytoplasm, making more accurate predictions supported by richer global information.Continuing with the neutrophil image, Internimage's attention region covers nearly the entire cell, but the key focus regions are small.In contrast, the proposed model's key focus region almost encompassed the entire cell.This also leads to more reliable predictions made by our proposed model.Next, for the blast images, since the cell nucleus almost occupies the entire cell, it is extremely similar to some lymphocytes.In the end, although both models made correct predictions, their confidence was relatively low.Regarding the attention regions, Internimage still focuses relatively narrowly, whereas our proposed model demonstrates a more distributed focus.Finally, for the reactive lymphocyte image, both models perform well.Based on the comprehensive analysis above, it can be seen that Internimage tends to distinguish peripheral blood leukocytes based on the cell nucleus, while the proposed model mainly focuses its attention on the cytoplasm as well as the cell nucleus.It is the Transformer that enabled the model to pay attention to a broader region within the cell, thus making more accurate predictions.
In addition, we utilize t-SNE to demonstrate the encoding capabilities of the model.t-SNE is a dimension reduction algorithm that is particularly suitable for mapping high-dimensional data into a lower-dimensional space (usually two or three dimensions) for data visualization or feature extraction from the data.When the model processes a cell image, it converts it into a 1024-dimensional vector through a neural network.Subsequently, we adopt t-SNE to reduce this vector to two dimensions, resulting in a point corresponding to each sample in the two-dimensional coordinate system.As shown in Fig. 8, most samples are accurately classified.The clusters representing eosinophils and neutrophils are distinctively separated from the other clusters, indicating that the model is highly effective at identifying these two cell types.However, there is some confusion about the classification of the remaining types of cells.On the one hand, the model mistakenly identifies a small number of lymphocytes and monocytes as blasts.On the other hand, there is a significant overlap area between the clusters of reactive lymphocytes and the clusters of lymphocytes and monocytes.These confusions are similar to those found in manual methods and demonstrate that some samples of the three types of cells exhibit high similarity in some characteristics.Especially, the significant overlap between the clusters of reactive lymphocytes and lymphocytes indicates that the confusing features of these two types of cells pose a great challenge to the model.This provides valuable guidance for us to further improve the model in the future.

Discussion and conclusion
The identification of reactive lymphocytes is critical for the diagnosis of several diseases.Currently, the identification of reactive lymphocytes relies mainly on observation by pathologists using microscopes, which requires extensive morphological knowledge and experience and is time-consuming.Recently, CNNs have been applied to peripheral blood cell identification in several works.However, the datasets in most of these works omitted reactive lymphocytes.Most importantly, according to our interpretability analysis experiments, the CNN architecture has significant shortcomings relative to the hybrid CNN and Transformer architecture in recognizing certain cells.The receptive field of CNN is small, and some similar features naturally exist between certain cells.Focusing only on a confusing local feature of the cell and ignoring the global feature can easily lead to wrong predictions by CNN.In contrast to previous works, our dataset included reactive lymphocytes, increasing the clinical utility of the model.In addition, we introduced FCB and ETB into peripheral blood cell identification.FCB can quickly capture features localized in the cell, such as granules, vacuoles, and deeply stained areas.Moreover, the ability of ETB to fuse local features extracted by FCB makes our model outperform traditional CNN models.Specifically, conventional CNNs, even though they can extract multiple local features within a cell, may only focus on one of them.If this feature is a similar feature that exists between some different classes of cells, it may lead to misclassification.ETB enabled the model to focus not only on similar features between different classes of cells but also on the unique features of these cells.Interpretability analysis experiments also showed that ETB enabled the model to obtain a larger receptive field, capturing features of large areas that cannot be captured by the CNN, such as the morphology of the nucleus and large areas of the cytoplasm.Lastly, we introduced VAT in the model training to improve the performance of the model without changing the number of parameters.
In general, the results of our study were encouraging.We came up with a novel dataset of peripheral blood cells.Our model achieved an overall accuracy of 93.66% on the test dataset, with a recognition accuracy of 88.03% specifically for reactive lymphocytes.This demonstrates the model's high reliability, particularly in identifying reactive lymphocytes.However, although our model correctly predicted certain perplexing cells, there is still room for improvement in its confidence.Therefore, in future work, we consider knowledge distillation, using large models as teacher models to further improve our model's performance.

Fig. 2 .
Fig. 2. The quantity distribution and morphological characteristics of various peripheral blood leukocytes in the dataset.(A) demonstrates the number of cells in each category within the training, validation, and test sets.(B) illustrates the morphological characteristics of cells in each category.

Fig. 3 .
Fig. 3.The Model architecture and virtual adversarial training.(A) shows our model architecture in detail.First, we augment the inputs and send them to Stem for downsampling.Over the following four stages, the model uses FCB to extract local information and ETB to extract global information.Eventually, each sample will be encoded into a 1024-dimensional feature vector, which the linear layer will then process to determine the corresponding class.(B) demonstrates VAT.Specifically, we input the raw and adversarial samples into the model to obtain their respective probability distributions.Then, the KL divergence between the two distributions will be calculated and reduced during backpropagation.This helps the model resist perturbations.

Fig. 4 .
Fig. 4. The architecture of the modules in the network.(A) describes the details of FCB.Group convolution significantly reduces the parameters of the model, while pointwise convolution facilitates the fusion of information between individual channels.(B) shows the structure of the ETB.Compared to the standard MHSA, which maps the query, key, and value of a single token to multi-heads through multiple matrix multiplications, E-MHSA directly divides query, key, and value into multi-heads in the channel dimension.This reduction in computational costs improves the overall efficiency of the model.
4 and weight decay of 1e-8.The batch size is set to 32, and a total of 100 epochs are trained.We select PyTorch as the deep learning framework, and all experiments are done on a computer with an 11th-generation Core i5 processor and an NVIDIA RTX 3070 Ti.The loss and accuracy during the training process are shown in Fig. 5(C) and Fig. 5(D).The best accuracy on the validation set is observed after 76 epochs, with a validation accuracy of 93.66% achieved by the model.Additionally, we train the model for 150 epochs without pre-training.Figures 5(A) and (B) show the changes in loss and accuracy under this condition, respectively.It can be observed that pre-training significantly accelerates the convergence of the model and improves its accuracy.

Fig. 5 .
Fig. 5.The training curves of the proposed model.(A) and (C) shows the loss curve.It roughly describes the deviation degree between the model's predictions and the true labels in the training set.(B) and (D) shows the accuracy curve.It reflects the model's accuracy on the validation set after each epoch of parameter updates during training.

Fig. 7 .
Fig. 7. Comparison of Internimage and the proposed model's regions of attention.Highlighted regions represent where the model focused its attention.Each heatmap is accompanied by information on the right side, indicating the model, predicted category, and prediction confidence.

Fig. 8 .
Fig. 8. Visualization of model encoding.We input all the samples in the test set into the model to obtain feature vectors, then used t-SNE to reduce them from 1024 dimensions to 2 dimensions, and finally plotted all the points in the coordinate system.