Automatic recognition of white blood cell images with memory e ffi cient superpixel metric GNN: SMGNN

: An automatic recognizing system of white blood cells can assist hematologists in the diagnosis of many diseases, where accuracy and e ffi ciency are paramount for computer-based systems. In this paper, we presented a new image processing system to recognize the five types of white blood cells in peripheral blood with marked improvement in e ffi ciency when juxtaposed against mainstream methods. The prevailing deep learning segmentation solutions often utilize millions of parameters to extract high-level image features and neglect the incorporation of prior domain knowledge, which consequently consumes substantial computational resources and increases the risk of overfitting, especially when limited medical image samples are available for training. To address these challenges, we proposed a novel memory-e ffi cient strategy that exploits graph structures derived from the images. Specifically, we introduced a lightweight superpixel-based graph neural network (GNN) and broke new ground by introducing superpixel metric learning to segment nucleus and cytoplasm. Remarkably, our proposed segmentation model superpixel metric graph neural network (SMGNN) achieved state of the art segmentation performance while utilizing at most 10000 × less than the parameters compared to existing approaches. The subsequent segmentation-based cell type classification processes showed satisfactory results that such automatic recognizing algorithms are accurate and e ffi cient to execeute in hematological laboratories


Introduction
White blood cells (WBCs), also known as leukocytes, play a pivotal role in the immune system's defense against various infections.Accurate quantification and classification of WBCs can provide valuable insights for diagnosing a wide range of diseases, including infections, leukemia [1], and AIDS [2].In the laboratory environment, the traditional examination of blood smears was a laborious and time-consuming manual task.However, with the advent of computer-aided automatic cell analysis systems, rapid and high-throughput image analysis tasks can now be accomplished [3].Some automatic recognizing system of white blood cells typically entails three major steps: Image acquisition, cell segmentation and cell type classification.Among these steps, cell segmentation is widely recognized as the most crucial and challenging one, as it significantly influences the accuracy and computational complexity of subsequent processes [4].Some segmentation-free methods take the whole image as input to the classifier without extracting region of interest (ROI) [5][6][7].
Accurately segmenting WBCs in cell images, thereby distinguishing between lymphocytes, monocytes, eosinophils, basophils, and neutrophils as shown in Figure 1, provides a wealth of crucial information for hematological diagnostics [8,9].However, achieving high-quality images requires careful consideration of various factors, including image resolution, exposure duration, illumination levels, and the proper utilization of optical filters.If inappropriate factors are chosen in the imaging process, it can adversely affect image quality, thereby posing challenges for analyzing WBC images.Deep learning, particularly convolutional neural networks (CNNs), has revolutionized medical image segmentation [10].The U-shaped network (U-Net) [11], a symmetrical encoder-decoder convolutional network featuring skip connections, stands as a prime example.The U-Net has gained significant popularity in medical image processing, especially for datasets with limited samples.Extensive research has demonstrated the effectiveness of this architecture in extracting multi-scale image features [12].
Subsequent iterations, such as U-Net++ [13] and U-Net3+ [14], have been proposed to further enhance performance.U-Net++ introduces nested and dense skip connections to address the semantic gap and incorporates deep supervision learning techniques to improve segmentation performance [13].Regularized U-Net (RU-Net) [15] proposed a new activation function with piecewise smooth effect to solve the under-segmentation problem.There have been various advancements in CNN-based architectures for image segmentation.One notable example is the multiscale information fusion network (MIF-Net), which incorporates the boundary splitter and information fusion mechanisms using strided convolutions [16].These techniques contribute to improved segmentation accuracy.In parallel, Transformer-based approaches, such as Vision Transformer (ViT) [17] and Swin Transformer [18], have emerged, leveraging self-attention mechanisms for better feature extraction.Specifically, the ViT partitions images into nonoverlapping patches and treats the patches as sequence data, where the self-attention mechanism is subsequently used to extract long-range information among patches.Furthermore, the Swin Transformer applies a shifted window to make ViT more computationally efficient.Though the original ViT model exhibits significantly better performance on large objects, it obtains lower performance on small objects [19].This limitation might arise from the fixed scale of patches generated by transformer-based methods.A potential solution to enhance the small object detection (SOD) capability is to explore more refined patch sizes, which might increase the computational cost.Notably, U-Net architectures have also incorporated transformers, yielding U-Net Transformer [20], Medical Transformer [21], and Swin-Unet [22]-all of which have set new performance benchmarks in medical image segmentation.However, these architectures, rooted in pixel-based learning demand substantial memory resources, leading to inefficiencies, especially when available training samples are scanty [23].Here, large models might face narrowing expressivity gaps against parameter-efficient counterparts.To mitigate this, embedding prior knowledge can reduce the computational burden.
Graph structure data provides an elegant way off describing the geometry of data, which contains abundant relational information.For example, diverse types of relational systems or structured entities can be described by graphs to include the interior connections, where some typical examples include particle system analysis [24,25], social networks [26], and molecular properties prediction [27].Correspondingly, graph neural networks (GNNs) are specifically designed to process graph data [28][29][30][31][32], where researchers have developed graph convolutional networks (GCNs) and various variants to update node features by aggregating information from neighboring nodes.
The transformation of image data, particularly those without an inherent geometric structure, into graph data, represents a substantial challenge.This challenge is twofold: Encoding Euclidean space data into graph representations and decoding them back to their original image domain.A prevalent approach to address this involves the use of a patch graph method, where image patches are treated as graph nodes.For example, the Graph-FCN [33] applies a fully convolutional network (FCN) to extract image features, and the graph structure is constructed based on the k nearest neighbor (kNN) methods where the weight adjacent matrix is generated with the Gaussian kernel function.In [34], the dual graph convolution network (DGCNet) constructs the graph structure not only on the spatial domain but also on the feature domain.In the semantic segmentation task, the bilinear interpolation upsampling operation acted on the downsampled output of the DGCNet to recover the same image size as the label.There has also been recent work aiming to combine the local feature extraction ability of CNNs with the long-range interaction ability of GNNs.The vision graph U-Net (VGU-Net) model was proposed to construct multi-scale graph structures, enhancing the model's learning capacity [35].
However, while the patch-based method offers convenience in graph construction, it has its limitations.The fixed structure of image patches can lead to the omission of critical boundary details.An alternative lies in the superpixel approach.Superpixels, by design, can dramatically lower both computational and memory costs for image processing.Since image superpixel can significantly reduce the computational and memory overhead for image processing tasks, superpixel methods are commonly implemented as a preprocessing step before the deep reasoning models [36][37][38][39][40][41][42][43].Various superpixel methods over-segment the image into multiple nonoverlapped regions based on the pixel features and homogeneous pixels are grouped inside single superpixel.Traditional superpixel generation method can be roughly divided into graph-based [44][45][46] and clustering-based [47][48][49][50] methods.These methods are efficient and fast to generate high-quality superpixels and require no human label and less memory in computing.Recently, the deep learning-based approaches are employed in superpixel sampling [51][52][53][54][55].These methods are accurate but not efficient in memory saving since learning high-level image features requires a relatively large amount of convolution kernel parameters.Based on the pre-computed superpixel graph and GNN, [56] captures global feature interactions for brain tumor segmentation.In [57], superpixel-based graph data and an edge-labeling graph neural network (EGNN) [58] are implemented for biological image segmentation.
Medical image segmentation has long depended on the precision achieved by supervised learning methods.Yet, the perennial issue remains; that is, the paucity of richly labeled datasets in clinical contexts.This limitation has driven the pivot to metric learning, which disrupts the entrenched belief that robust intelligence is the sole preserve of abundant labeled data [59][60][61][62].Metric learning approaches, such as contrastive methods, learn representations in a discriminative manner by contrasting positive sample pairs against negative pairs.By tapping into vast reservoirs of unlabeled samples, they set the stage for pretraining deep learning models.The subsequent phase involves meticulous fine-tuning, utilizing just a fraction of labeled samples.Remarkably, the outcome is a model performance that stands shoulder to shoulder with traditional supervised strategies.
Notably, there's a burgeoning interest in supervised metric learning methods, specifically tailored to unravel cross-image intricacies.These techniques use sample labels as the blueprint to categorize them into positive and negative sets [63].The confluence of metric learning methods offers deep learning models a unique advantage.By bridging labeled and unlabeled data, they are empowered to deliver stellar results, even when navigating the constraints of scantily labeled samples.
In this work, we propose a novel approach for WBCs segmentation, namely superpixel metric graph neural network (SMGNN).The core strength of SMGNN lies in its dual promise: delivering unparalleled accuracy while simultaneously optimizing memory efficiency.The foundation of our technique is a superpixel graph constructed from image data.This restructuring drastically diminishes the problem's dimensionality and serves as a conduit for infusing abundant prior information into the graph data.In addition to leveraging prior knowledge on a single training sample, our proposed approach introduces superpixel metric learning to capture "global" context across the training samples.In clinical image segmentation scenarios with limited training samples, we believe incorporating this "global" context can enhance the expressivity of deep learning models.Our proposed metric learning operates on the superpixel embeddings rather than the vast number of pixel embeddings, which offers the advantage of memory saving.
The contributions of this paper can be summarized as follows.
• Our proposed lightweight SMGNN significantly reduces the learnable parameters by at most 10000 times compared with mainstream segmentation models.
• Our proposed superpixel-based model reduces the problem size and poses rich prior knowledge to the rarely considered graph structure data, which helps SMGNN achieve state of the art (SOTA) performance on WBC images.
• We innovatively propose superpixel metric learning according to the definition of superpixel metric score, which is more efficient than pixel-level metric learning.
• The whole deep learning-based nucleus and cytoplasm segmentation and cell type classification system is accurate and efficient to execute in hematological laboratories.
The remaining sections of this article are structured as follows.Section 2 describes our methodology in depth.Section 3 depicts the workflow and architecture of our proposed model.In Section 4, through extended segmentation experiments, the SMGNN model achieves SOTA segmentation performance in terms of both accuracy and memory efficiency.The cell type classification task is conducted with a lightweight residual network (ResNet) based on the segmentation result.The whole procedure of the proposed automatic recognition system is shown as Figure 3.
Figure 2. The main idea underlying our approach is to learn the distance between superpixel embeddings using the superpixel metric score, which is the ratio of the majority class inside the superpixel.Given the anchor embedding, similar embeddings with approximate metric scores will be pulled close and dissimilar embeddings will be pushed away.With the help of metric learning, the cross-image global context can be captured and a better embedding space will be learned.

Methodolgy of superpixel metric
Deep learning methods for medical image processing have predominantly concentrated on discerning the local context, which refers to inter-pixel dependencies within individual images [63].However, there's a missed opportunity: Capturing the "global" context that exists between training samples.While pixel-level contrast or metric learning provides a way to bridge this gap, the sheer computational and memory overheads-due to contrast or metric computations spanning every pixel pair-render them less feasible.We propose an innovative efficient superpixel-level metric learning on metric loss, which not only captures the desired global context but does so while drastically cutting computational and memory costs.

Compression ratio on image data
The utility of superpixel methods in image data preprocessing is well acknowledged, particularly for their ability to condense data and reduce computational demands.Consequently, this study capitalizes on these advantages, transforming the image data into a more compact graph representation.Within this framework, each superpixel evolves into a graph node.The interconnectedness of these nodes-whether driven by spatial positioning or feature similarity-determines the graph's topology.These node features aren't rigid; their definition can range from basic five-dimensional attributes encompassing color (three dimensions) and location (two dimensions) to more intricate data points like histograms, positional variance and variations in pixel values.
Given that the adjacency matrix exhibits sparse characteristics, it's predominantly the node features that dictate memory consumption.Opting for the more straightforward features facilitates significant data compression.Specifically, for three-channel red-green-blue (RGB) images, we've discerned a compression ratio roughly represented as c = graph data image data ≈ 5K 3n .Importantly, this efficient compression does not compromise on quality.Our subsequent numerical experiments demonstrate that this ratio is concomitant with optimal segmentation outcomes.

Quality of superpixel and reconstruction score
The segmentation task relies on the quality of the generated superpixels, and good superpixel results coherent with the boundary of labeled images.Suppose X ∈ R H×W×3 is the input image.Let V be the set of all superpixels, N = H × W be the number of pixels, K = |V| be the number of superpixels, and Q ∈ R N×K be the association matrix between pixels and superpixels, then we have where V j is the jth superpixel.Flatten operation converts a two-dimensional image matrix into a one-dimensional vector.The role of the association matrix builds the bridge between image space and graph space.For computation convenience, we define the column normalized association matrix as Let Y ∈ R N be the label of the image pixels and Y ∈ R K be superpixel metric or metric score of the graph data.Using the pixel label, we can formulate metric score as for the jth superpixel.Y can also be efficiently computed with column normalized association matrix, i.e., Y = Q T Y.We can back-project the supepixel label to image space by association matrix, i.e., Y = QY.We define the intersection of union (IoU) reconstruction score to evaluate the quality, which reads where {x|Y x = i} denotes the set of pixels whose class label equals i and c is the number of classes.

Lightweight GNNs for superpixel embedding
GNNs have the advantage of being lightweight compared to other deep learning models that often require deeper and larger networks to achieve higher performance.GNNs leverage relational information and utilize shallow layers to achieve satisfactory results.
Given an undirected attributed graph G = (V, E, S), G consists of a nonempty finite set of K = |V| nodes V and a set of edges E between node pairs.Denote A ∈ R K×K the graph adjacency matrix and S ∈ R K×d the node attributes.A graph convolution learns a matrix representation H that embeds the structure A and feature matrix S = S j N j=1 with S j for node j.Most graph convolutions follow the message passing [64] update scheme, which finds a central node's smooth representation by aggregating its 1-hop neighbor information.At layer ℓ, the propagation for the ith node reads where □(•) is a differentiable and permutation invariant aggregation function, such as summation, average, or maximization.The set N(i) includes V i and its 1-hop neighbors.Both γ(•) and ϕ(•) are differentiable aggregation functions, such as multilayer perceptrons (MLPs).
In our approach, we construct graph data using superpixels and their predefined relationships.GNNs learn features from the graph space to enhance the segmentation capability.We employ the graph isomorphism network (GIN) [32] as the backbone graph representation network.At each layer ℓ, the GIN model updates the ith node representation as follows: where w is a learnable weight.GIN has been proven to possess expressive power equivalent to the 2-Weisfeiler-Lehman test [65].By utilizing GIN, we can effectively extract informative features from the superpixel graph, enabling accurate and efficient segmentation performance.

Memory efficient metric learning
Although pixel-wise contrast can learn the global context to form a good segmentation embedding space [63], computing the contrastive loss requires using training image pixels, which leads to a significant amount of computation and memory cost.In this study, we propose superpixel-based methods that can significantly reduce the number of data samples from N to K, and we introduce a memory-efficient distance-based metric loss function.
The fundamental concept of metric learning is to bring similar samples closer together in the embedding space while pushing dissimilar samples further apart.However, pixel-wise contrast methods that involve setting a large number of anchor pixels and using tensor multiplication to compute positive similarity incurs high memory costs.To address this, we define the superpixel metric loss using the mean square error (MSE) between the similarity and metric score of the embeddings, as follows: ) ) where n 1 and n 2 represent the number of anchor samples A ∈ R n 1 ×d and reference samples R ∈ R n 2 ×d , respectively.Here, d denotes the dimension of the embedding, and a and b are learnable parameters.Additionally, Y(x) represents the superpixel label of node x, which is defined by Eq (2.1).It's important to note that the anchor/reference samples are not restricted to being from the same image A. The objective of Eq (2.5) is to bring the embeddings of similar superpixel samples closer together and push dissimilar ones apart.

The work-flow and architecture of SMGNN
We use the parameter-free methods to generate superpixels [47,66], on which we construct graph structure with two alternative strategies.Each superpixel is treated as a node and the mean RGB values and mean postion value consist five dimension node features.Suppose the mean scale of superpixels S = H×W K .We can define the adjacency between nodes according to their positional relation as where (x i , y i ) and (x j , y j ) are the positions of superpixel i and superpixel j respectively, and α ≥ 2 is a hyperparameter to control the number of neighboring nodes.The above definition will pose strong local connectivity to the graph data.For batched images, we can use parallel computing to accelerate the graph generation process and multiple subgraphs to combine as a large graph, where only connected nodes can perform message passing with the GNNs.
Regarding the model architecture, we utilize three layers of GIN to generate the embeddings of superpixels, upon which metric learning is performed.In addition to the transformed graph data derived from superpixels, we retain the original image data.We concatenate the features of superpixels and pixels and pass them through a lightweight CNN to smooth out small pixel groups in the output of the GNNs.This process helps enhance the segmentation accuracy by incorporating nondegraded image information.The overall architecture of our proposed model, named SMGNN, is illustrated in Figure 5.
To tackle the clinical image segmentation, we employ the Dice loss [67], which is a structure-aware and widely used loss function for medical image segmentation.This loss function is designed to measure the similarity between predicted and ground truth segmentation masks.We also use the Dice coefficient to evaluate the performance of different models [67], which a widely used metric on image segmentation.
To be more specific, given a set G, we define its characteristic/label function by ι . The Dice coefficient of two sets G and Ĝ is defined as where Ω indicates the domain containing the two sets.The Dice metric is also directly used as a loss function to train a supervised segmentation task.The Dice loss function is formulated as The joint loss function for both the superpixel metric learning and segmentation tasks is defined as a combination of the Dice loss and the superpixel metric loss.This joint loss function enables us to optimize the model parameters simultaneously for both tasks, effectively leveraging the benefits of both superpixel-based metric learning and pixel-wise segmentation.The joint loss function is formulated as where λ ≥ 0 is a hyperparameter to trade off the learning on image space and graph space.Empirically, as shown on Figure 4, we set λ = 0.1 in the numerical experiments for better performance based on the extended experiments on the searching space [0, 0.1, 0.5, 1, 10].

Numerical experiments
We employed a widely used WBCs dataset to evaluate the effectiveness of our proposed recognizing system.We employed the widely-acclaimed Adam optimization technique [68] for model training during the backpropagation phase.The implementations are programmed with PyTorch-Geometric (version 2.2.0) and PyTorch (version 1.12.1) and executed on an NVIDIA Tesla A100 GPU with 6, 912 CUDA cores and 80GB HBM2 installed on an HPC cluster.

Dataset description
We verify the robustness of our methods on two WBC image datasets.The first dataset (dataset-1) originates from Jiangxi Tecom Science corporation of China [69] , which contains 300 120 × 120 color images (176 neutrophils, 22 eosinophils, 1 basophils, 48 monocytes and 53 lymphocytes).Dataset-2 contains 100 300 × 300 color images (30 neutrophils, 12 eosinophils, 3 basophils, 18 monocytes and 37 lymphocytes).The second dataset (dataset-2) is publicly available on CellaVision blog * and widely used to conduct leukocyte research.These WBCs datasets leverage three-channel RGB images, which are processed via neural networks in an end-to-end training regimen.Each WBC image is manually-labeled, marking three primary regions: Nuclei (represented in white), cytoplasm (depicted in gray), and the surrounding peripheral blood (captured in black).The number of training/validation/testing data is 80/10/10% of the total numbers and the Dice loss function is applied to train the segmentation model.The dataset comprises five different cell types, staining effect, and illumination conditions which causes large variations in the sample distribution.

Evaluation of superpixel scale
To efficiently over-segment our input images, we adopted the simple non-iterative clustering (SNIC) superpixel generation methodology [66].A distinctive feature of SNIC is its ability to visit each pixel just once-with the sole exception being those situated on superpixel boundaries.The computational traversal is characterized by the total number of pixels, N, augmented by a variable dictated by the desired superpixel count, K.Such a design renders SNIC more computationally nimble compared to alternatives like simple linear iterative clustering (SLIC) [47].
The superpixel quality, and its potential ramifications on segmentation, is an aspect we delve deeply into.By modulating the number of superpixels, we could ascertain its influence.For instance, the WBCs dataset results (illustrated by the red trajectory in Figure 6) signify that as the granularity of the superpixel method amplifies, there's a corresponding upswing in the mIoU (mean Intersection over Union) score.Balancing optimal segmentation outcomes with computational practicality, we've pegged the mean scale of superpixels (S ) at 16 for all ensuing experiments.

Comparison with mainstream deep learning segmentation methods
A particularly intricate aspect of the WBCs dataset segmentation is the differentiation between the cytoplasm and the nuclei.Several images from dataset-2 present nuclei that are suboptimally stained, leading to a coloration reminiscent of the cytoplasm.Ideally, accurate segmentation demands that the representation of the nucleus be a cohesive, uninterrupted region.This color overlap often ensnares traditional CNN-based segmentation models, like U-Net, Attention U-Net and MIF-Net, resulting in predicted segmentations marred by interruptions or holes.The RU-Net demonstrates the ability to preserve piecewise constant nuclei regions; however, it tends to over-segment the cytoplasm region, which hinders overall segmentation performance.On the other hand, the Swin U-Net leverages the selfattention mechanism of the Transformer model and shows promising segmentation results.Nevertheless, the Swin U-Net's large number of parameters hampers training efficiency and requires significant computational resources.In comparison, the VGU-Net achieves a balance between efficiency and effectiveness, although it still utilizes a considerable number of parameters compared to the SMGNN.The proposed SMGNN model only utilizes about 7,000 parameters and takes an innovative approach by bolstering the connectivity of adjacent superpixels.Therefore, it is efficient in learnable parameters and proficient in preserving the integrity of the nucleus region, where the capability is vividly showcased in Figure 8.Compared to those end-to-end segmentation models, the SMGNN model takes a preprocessing step to cluster homogeneous pixels into superpixels and constructs graph structured data.We take the SNIC algorithm, which is non-iterative, requires less memory, is faster, and yet is a simpler superpixel segmentation algorithm.This step might cost some computational time, but the SMGNN segmentation model is very efficient.We compare the computational time of different methods in Table 3.In Figure 9, we show the Dice performance and number of learnable parameters of different deep learning models.Our SMGNN model can reach the SOTA performance while using far fewer parameters.In Tables 1 and 2, we show the quantitative comparison of these mainstream baseline segmentation models using the Dice coefficient, Hausdroff distance, positive predicted value (PPV), accuracy and sensitivity as the metric.The proposed SMGNN model can achieve SOTA segmentation performance while using remarkablely less parameters.

Ablation study
To optimize node embeddings within our methodology, we selected the GIN model for the GNN module due to its superior discriminative capacity.Comparative tests with popular GNN models like GCN and graph attention network (GAT) revealed that a configuration using three layers of GIN demonstrated an enhanced performance, significantly improving node classification accuracy.
Beyond the conventional setup detailed in Figure 5, which incorporates pixel-level embedding, we ventured into an approach that solely leverages a GNN, transitioning from image-level segmentation components to a strictly node-based classification method, as illustrated in Figure 10.The training for this node classification is driven by a cross-entropy loss function, delineated as follows: where we use the majority voting rule to define the supepixel label as guiding the supervised learning in graph space.In Eq (4.2), ⌊•⌋ is the round down function, Ŝ is the predicted probability of superpixel, and C is the number of semantic classes.Though such a rule may group mistaken pixels whose pixel label is not the majority, the pure GNN methods may achieve good performance when the scale of the superpixel is small.Our ablation studies, as depicted in Figure 11, offer keen insights into the performance nuances of pure GNN-based segmentation models.These models manifest commendable segmentation outcomes when oriented to small-scale superpixel environments.However, as we scale the superpixels, the model's performance is inversely impacted by its heightened sensitivity to superpixel quality, leading to notable performance drops.Introducing convolutional filters via CNN feature embedding for image-level segmentation does augment the model with additional parameters.Nevertheless, the significance of these filters is evident in the stability they confer upon the model, especially when navigating varying superpixel scales.
This investigation underpins a critical takeaway: The scale of superpixels and a model's sensitivity to their quality must be harmoniously calibrated.Relying exclusively on GNN-driven segmentation models may prove suboptimal when maneuvering larger superpixel frameworks.

Effectiveness of metric learning on embedding space
To understand the impact of metric learning on the embedding space, we visually represent the spatial relationships of superpixels.We assign different colors to these superpixels based on their labels, as determined by Eq (4.2). Figure 12, created using uniform manifold approximation and projection (UMAP) [70], demonstrates the distances among embeddings derived from the GNN.A notable observation from this visualization is the pronounced separation between superpixels with distinct labels-a testament to the efficacy of incorporating metric learning.Furthermore, there's a heightened cosine similarity between samples that are alike, while distinct samples exhibit reduced similarity.This distinction underscores the model's ability to effectively differentiate and group superpixels in the embedding space.

WBCs type classification results
Though the segmentation of cell salient regions, such as the nucleus and cytoplasm, is fundamental and challenging, there are various off-the-shelf methods available for subsequent cell type classification [71][72][73].Segmentation may provide a direct means to obtain distinguishing characteristics for cell type classification, as cell morphology is closely related to cell type.

Until deadline
In this part, we employ a lightweight ResNet neural network [74,75] to train a classifier based on the outputs of segmentation networks, as shown in Figure 13.The overall recognition algorithm is shown in Algorithm 1.We sample about 1/3 training images of dataset-2 from each class and leave the remaining for testing.We train the segmentation and classification model separately.The classification result is shown in Table 4, and our classification method can achieve about 96.72% overall accuracy.In addition to our proposed segmentation-based cell type recognition system, we also implemented a baseline method to predict cell types without utilizing segmented cell regions.The corresponding results are presented in Table 5, and such methods without extraction on salient cell regions can barely achieve 72.13% overall accuracy.There are also traditional methods that employ handcrafted features extracted from segmented regions, combined with machine learning classifiers like support vector machine (SVM) [71], and such methods can get overall accuracy ranging from 89.69 to 96%.Our proposed deep learning-based automatic recognition system demonstrates high efficiency, and its accuracy can be further improved with an increase in the number of available training samples.To compare the overall accuracy of cell type classification task, we take the segmented regions of different models as input and compare the classification accuracy as shown in Table 6.Our proposed method achieves SOTA performance in the WBC recognition workflow.

Conclusions
In this research paper, we proposed a deep learning based automatic recognizing system for the challenging WBC image recognizing task.In the first part, we proposed the SMGNN segmentation model, which combines superpixel methods and a lightweight GIN to significantly reduce memory usage while preserving segmentation capabilities.We innovatively proposed superpixel metric learning to capture cross-image global context information, making it highly suitable for medical images with limited training samples.Comparing our model to other mainstream deep learning models, we achieved comparable segmentation performance with a remarkable reduction of at most 10000 times fewer parameters.Through extended numerical experiments, we further investigated the effectiveness of metric learning and the quality of superpixels In the second part, the segmentation-based cell type classification processes exhibited satisfactory results, indicating that the overall automatic recognition algorithms are accurate and efficient for execution in hematological laboratories.We have made our code publicly available at https://github.com/jyh6681/SPXL-GNN,and we encourage its widespread implementation in portable devices of hematologists and remote rural areas.

Figure 1 .
Figure 1.Sample images of five different types of WBCs.The colors of different images exhibit significant variations, and the boundaries of the cytoplasm are often ambiguous, posing a considerable challenge in accurately recognizing the shape of WBCs.

Figure 3 .
Figure 3.The overall work-flow of the proposed automatic WBC recognition system.

Figure 4 .
Figure 4. Comparison study of the parameter choice on λ = 0.1.

Figure 5 .
Figure 5.The framework of our proposed SMGNN for medical image segmentation consists of three main stages: 1) Create Superpixel Graph: The input images are initially oversegmented, generating multiple superpixels.Subsequently, a superpixel graph is constructed based on these segments.2) Graph Representation Network: The superpixel embeddings are learned using a combination of GNN and metric learning techniques.This stage focuses on capturing the relationships and representations within the superpixel graph.3) Convert to Image Space Feature: To facilitate segmentation, the superpixel graph is projected back to the image domain using the association matrix.The CNN layer is utilized to perform the segmentation task in the image space.The training process is supervised by superpixel metric loss function L SM in the graph space and Dice loss function L Dice in the image space.

Figure 6 .
Figure 6.(a) The IoU reconstruction score versus the number of superpixels.(b) Ground truth label image.(c)(d)(e)(f) Superpixel images and reconstructed label images (RLIs) of two different superpixel numbers.With an increase in the number of superpixels, the RLI tends to converge toward the ground truth label image.This trend indicates that the boundaries of the superpixels become more consistent with the boundaries of the cells, leading to improved quality of the superpixel segmentation.

Figure 7 .
Figure 7.The segmentation results on dataset-1.Some CNN-based models result in oversegmentation issue.The proposed SMGNN model exhibits good segmentation performance.

Figure 8 .
Figure 8.The segmentation results on dataset-2.The ground truth annotation image contains a connected nuclei region without holes inside.Most CNN-based methods tend to over-segment the nuclei region induced by the model bias.The SMGNN model can well preserve local connectivity and achieve comparable performance.

Figure 9 .
Figure 9.The comparison of model performance and network parameter size across different models on the WBCs dataset.The center of the circle indicates the Dice score of the model.The radius of the circle indicates the number of learnable parameters.The SMGNN model, utilizing approximately 7,000 parameters, achieves comparable performance to models with millions of parameters.

Figure 10 .
Figure 10.Pure GNN method converts segmentation task as the superpixel classification task, without involving learning in image space.The classification task is trained with cross entropy loss function L CE .The classified superpixels are projected back to the image domain through an association matrix.

Figure 11 .
Figure 11.This ablation study delves into the impact of metric learning and convolutional filtering within the image domain.Segmentation trials were undertaken on WBCs datasets.

Figure 12 .
Figure 12.(a) Without superpixel metric learning, the embeddings are hardly able to separate.(b) With superpixel metric learning, the well-learned embeddings will form into three groups corresponding to nuclei, cytoplasm and background.

Figure 13 .
Figure 13.The segmentation-based cell type classification workflow.The segmented salient region implicitly provides important cell features, such as shape, perimeter, mean and variance of the nucleus boundaries.The lightweight ResNet extracts region-level embeddings to classify five cell types.

Table 1 .
WBC segmentation results on dataset-1.SMGNN has the least number of trainable parameters (million) and achieves good performance in terms of the following metrics.Higher value means a better performance for metric with ↑ and vice versa.

Table 2 .
WBC segmentation results on dataset-2.SMGNN has the least number of trainable parameters (million) and achieves good performance in terms of the following metrics.Higher value means a better performance for metric with ↑ and vice versa.

Table 3 .
Comparison of the time cost.We sample eight WBC images from dataset-2 and count the inference time of different segmentation models.

Table 4 .
Confusion matrix, accuracy, and overall accuracy with ResNet classification network using segmenteation results of SMGNN.