1 Million Segmented Red Blood Cells With 240 K Classified in 9 Shapes and 47 K Patches of 25 Manual Blood Smears

Around 20% of complete blood count samples necessitate visual review using light microscopes or digital pathology scanners. There is currently no technological alternative to the visual examination of red blood cells (RBCs) morphology/shapes. True/non-artifact teardrop-shaped RBCs and schistocytes/fragmented RBCs are commonly associated with serious medical conditions that could be fatal, increased ovalocytes are associated with almost all types of anemias. 25 distinct blood smears, each from a different patient, were manually prepared, stained, and then sorted into four groups. Each group underwent imaging using different cameras integrated into light microscopes with 40X microscopic lenses resulting in total 47 K + field images/patches. Two hematologists processed cell-by-cell to provide one million + segmented RBCs with their XYWH coordinates and classified 240 K + RBCs into nine shapes. This dataset (Elsafty_RBCs_for_AI) enables the development/testing of deep learning-based (DL) automation of RBCs morphology/shapes examination, including specific normalization of blood smear stains (different from histopathology stains), detection/counting, segmentation, and classification. Two codes are provided (Elsafty_Codes_for_AI), one for semi-automated image processing and another for training/testing of a DL-based image classifier.


Background & Summary
The complete blood count (CBC) is a frequently used laboratory test that ranks among the top four tests in terms of both volume and revenue in various countries, such as the U.S., Malaysia, India, Kenya, and Nigeria 1 .The findings of a CBC test are useful in most medical and surgical specialties, including cardiology and psychiatry 2,3 .Furthermore, CBC test results need interpretation and correlation with other medical tests and clinical findings in up to 75% of cases.The hematologists or pathologists perform a manual/visual examination of blood smears for around 20% of the CBC tests.This process starts with spreading a thin layer of blood (10-50 µL) on a glass slide, staining it to highlight different intracellular structures, and then using light microscopes or digital pathology systems to review and examine red blood cells (RBCs), white blood cells (WBCs), and platelets.
In most labs, the commonly used manual preparation of smears can lead to unsuitable regions for examination on the smears.Choosing the appropriate areas relies on assessing the balance between individual and overlapping RBCs, preferring fields with fewer overlapping cells for precise examination and counting.Staining is a complicated process that is influenced by technical, sample-related, and medical factors, resulting in variations in the context of the image 4,5 .Whole slide images (WSIs) produced by digital pathology scanners are becoming increasingly popular among pathologists, pathology departments, and researchers.The variability in staining poses a challenge for both pathologists and deep learning-based (DL) automated systems, and optical scanning introduces its own set of variations and distortions [6][7][8] .
The aim of the provided dataset in this work (Elsafty_RBCs_for_AI) 9 and the codes (Elsafty_Codes_for_AI) 10 , which are freely accessible at the Figshare data repository, is to facilitate the development and testing of a DL-based application for automated examination and reporting of RBCs morphology/shapes in percentages.Such an application is supposed to be capable of working with commonly used manually prepared and stained blood smears without necessitating prior standardization of the staining or smearing procedures.The provided 47 K + field images/patches from 25 different slides/patients are useful for developing and testing DL-based specific normalizers for blood smear stains, where there is a deficit and all prior arts/solutions in histopathology stains normalization are not applicable due to the different nature and results of the stains used.Furthermore, the provided one million + 80 × 80 pixels cropped images from the field images/patches containing segmented RBCs at their centers, along with the segmentation masks and the XYWH coordinates of the RBCs contours, enhance the development of DL-based segmenters and detectors.Moreover, the classified 240 K + images of RBCs enable the development of DL-based classifiers working on the real RBCs size, which is critical, without resizing.The provided RBCs classes are normal/rounded RBCs, ovalocytes (oval or egg-shaped), borderline ovalocytes (between rounded and frank oval), burr cells (crenated), schistocytes/fragmented RBCs, teardrop-shaped RBCs, two-overlapped RBCs, three-overlapped RBCs, and angled cells that contain false/ artifact teardrops, schistocytes/fragmented RBCs and ovalocytes.Please note that RBCs shapes/classes which have alternative technological or laboratory tests for identification or confirmation, such as sickle cells and bite cells, were not included in this study.However, examples of cells with similar features to them were included in our class "angled cells." Examples of cropped RBCs images are shown in Fig. 1.
The presence of schistocytes/fragmented RBCs or teardrop-shaped RBCs is medically significant as it is commonly associated with serious medical conditions.Schistocytes/fragmented RBCs are defined as RBCs that are smaller than half the average normal/rounded RBCs size and/or irregularly shaped fragments with sharp, angular, or jagged edges.Identifying these cells is the most reliable indicator to confirm the diagnosis of diseases such as hemolytic anemias, thrombotic thrombocytopenic purpura (TTP), and disseminated intravascular coagulation (DIC).However, reporting schistocytes/fragmented RBCs in TTP and DIC can be a challenge due to their infrequency in hematology labs; furthermore, the cutoff for significant presence in these two serious diseases is just above 1.0-1.5% of the total RBCs, increasing the risk of overlooking them 11,12 .Crucially, in cases of critical thrombocytopenia where the platelet count is less than 20 K/µL, platelet transfusion may be necessary, but this intervention can be life-threatening in TTP and DIC 13 .Therefore, identifying and counting schistocytes/ fragmented RBCs is critical for the accurate diagnosis and management of patients with associated medical conditions.
Increased teardrop-shaped RBCs above 2.0-4.0% in adults can be indicative of bone marrow fibrosis caused by bone marrow cancers, and in non-cancerous conditions, rushed erythropoiesis/production of blood to compensate for severe anemia is the differential diagnosis.While in normal persons, the true teardrop-shaped RBCs are less than 0.5%.Currently, manual, or DL-based visual examination is the only way to identify teardrop-shaped RBCs 14 .It is essential to differentiate between true teardrop-shaped RBCs, which have a single blunt protrusion, and false ones that have sharp surface projections without necks or have more than one blunt protrusion.Mechanical stress during blood smear preparation often leads to the formation of false teardrop shapes, primarily at the outer edges of the blood film 15 .
Ovalocytes are a type of RBCs that have an abnormal oval shape.The presence of ovalocytes exceeding 5.0-10.0% of the total RBCs is associated with almost all types of anemia or erythrocytosis.They may display elongation and/or a pear shape, but without any blunt or sharp surface protrusions.Occasionally, they can also appear in normal blood smears due to mechanical deformation during preparation, though at a low frequency.
The burr cells have uneven surfaces with several small notches and protrusions.Likewise, no technological substitute currently exists for the visual recognition of burr cells, which tend to elevate under conditions of dehydration, such as in cases of renal failure or dehydrated neonates.Alternatively, in situations lacking medical justification, the presence of burr cells may arise due to the extended drying of smears during the manual staining procedure.
In comparison with prior works, including DL-based approaches and publicly available or locally-used RBCs datasets [16][17][18][19][20][21][22] , none has been created and reviewed directly by senior hematologists at the cell-by-cell level, nor provided comprehensive work to differentiate between true and false/artifact schistocytes/fragmented RBCs, ovalocytes, and teardrop-shaped RBCs, which are classes that lack technological assistance/confirmation alternative to visual examination.Additionally, none has utilized or created more than 24 K annotated RBCs, which is a fraction of the provided annotated/labelled cells in this work.Furthermore, no study has utilized four integrated cameras alongside microscopes to enrich diversity.Moreover, none has been designed to enable end-to-end automated examination of such clinically significant RBCs morphology/shape classes.

Methods
Sample preparation and imaging.Blood smears were collected with written informed consents and the participants consented to the open publication of the data.This study was conducted with approval from the independent Research Ethics Committee of the Faculty of Medicine at Zagazig University, independent from the authors of this work, under ZU-IRB#:11225-24-10-2023.The samples were collected and smears were manually prepared and stained using Wright staining within the typical framework of clinical care.The inclusion criteria comprised patients suspected to have primary myelofibrosis (PMF) of the bone marrow, with confirmation based on a blood smear review revealing the presence of true teardrop-shaped RBCs.To ensure classification under the same conditions and collection of samples for every RBCs class from each patient, smears not containing all the nine predefined classes were excluded.Based on these inclusion and exclusion criteria, 25 blood smears, each obtained from a different patient, were found eligible for selection.The smears were categorized into four groups to enable the use of distinct digital cameras integrated with separate standard light microscopes for capturing field images/patches of the smears/slides.The type of the four cameras used was LCMOS02000KPB with a resolution of 1600 × 1200 pixels and a pixel size of 3.2 × 3.2 pixels/μm, manufactured by Nanjing Amada Instruments Co., Ltd in China.Utilizing the 40X microscopic objective lenses across all microscopes, in addition to the fixed 10X visual lenses, resulted in a total magnification power of 400X.Each microscopic field image was captured and used to crop a central rectangular image/patch with a consistent size of 1076 × 535 pixels.This specific size was chosen to align with the dimensions of the large touchscreen displays utilized for data processing.If a cropped image was incomplete due to any mistake, the remaining area was filled with a white background to ensure the completion of the image without overlapping with the adjacent fields.The dataset summary for each slide/ patient, RBCs class, and camera-microscope source is presented in Table 1.The first camera-microscope was used on slides/patients numbered 1, 5, 6, 8 and 25.The second encompassed slides/patients numbered 2, 3, 4, 7, 9 and 11.The third comprised slides/patients numbered 14, 15, 19, 20, 22, 23 and 24.While the fourth contained slides/patients numbered 10, 12, 13, 16, 17, 18 and 21.A simple motorizing control unit was used for systematic smear navigation without any field repetition or overlap.The field images/patches obtained from the first and second cameras were found to have the best resolution and staining quality, whereas those obtained from the third exhibited relatively lower staining quality, and those from the fourth showed relatively lower resolution or focus quality.There was a total of 47 K + field images/patches from the 25 different slides/patients, comprising both suitable and non-suitable patches for RBCs examination.The determining factor for suitability was the presence of 100-300 individual RBCs among a few overlapping cells.Examples of field images/patches from different sources are shown in Fig. 2.

Images segmentation.
The hematologists developed their own semi-automated algorithms for image segmentation utilizing their concurrent Hematology and Software Engineering experience (please see Elsafty_ Code_1; segment & localize using a pen 10 ).This algorithm relied on manually tracing the borders of each cell using a digital pen tool on a big touchscreen display showing field images/patches.This process generated a ground-truth binary semantic segmentation mask and determined the bounding box coordinates (XYWH) for each cell.The cell contours were padded to ensure perfect centering within each image, maintaining a consistent size of 80 × 80 pixels for cropping.This fixed size is crucial to prevent the need for image resizing, as resizing could lead to misclassification of schistocytes/fragmented RBCs.If there was not enough space on the patch for an 80 × 80 pixels image due to the proximity of the cell to the borders, the remaining area was filled with a white background to complete the image.Cells situated along the borders that were truncated by the edge of the patch were excluded to prevent the risk of misclassification.This precaution was taken because the obscured section of the cell could impact the precise identification of the cell.The algorithm produced three 80 × 80 pixels images for each cell: the generated mask, the cropped image, and the segmented image.Each of these images adheres to a standardized naming convention, starting with the slide/patient number, followed by the patch number, and concluding with the (XYWH) coordinates.By utilizing this semi-automated approach, the hematologists were able to eliminate the attached background noise closely resembling the cellular colors caused by staining precipitates from the cells, as well as remove any attached WBCs or platelets.Moreover, it allowed for accurate segmentation of cells displaying empty areas due to mechanical stress during the spreading/smearing process or complications during imaging.This prevented the erroneous segmentation of a single cell into two separate schistocytes/fragmented RBCs.Examples of cells with their masks are shown in Fig. 3.

Images review and labelling.
Each cropped image along with its segmented image in the dataset for classification, underwent a comprehensive visual assessment by the two certified senior specialists in Hematology.Multiple rounds of comprehensive reviews and corrections of labels and segmentations were conducted until an expected level of high quality was attained.The labelling criteria were crafted to emphasize clinically significant RBCs classes, where visual examination is currently considered exclusive and unassisted by other technological solutions.Identifying ovalocytes through current manual/visual methods is subjective.This inherent subjectivity and absence of automated measures might account for the broad cutoff range (above 5-10%) observed in cases of anemia or erythrocytosis.To address this issue, aspect ratio calculations was utilized for preliminary classification.To calculate the long axes, three different methods were used, and the maximum result was considered.The first method involved rotating each cell mask image to an upright position by applying the rotation angle of the fitted ellipses in the opposite direction; then, the longer dimension of the corresponding upright bounding box was calculated.The second method involved applying a minimum enclosing rectangle to calculate its longer dimension, and the third method involved applying a minimum enclosing circle for the same purpose.For calculating the short axes, the shorter dimensions of the minimum enclosing rectangles were calculated and used.This combined approach was observed to yield more consistent performance when compared with individual methods.Measuring the distance between the two farthest points on the surface could result in an overestimation of the long axis, while relying solely on the minimum enclosing rectangle may lead to an underestimation of the long axis, especially in the case of rotated cells at 45 degrees.This measurement helped to preliminary distinguish between normal/rounded RBCs, borderline ovalocytes, and ovalocytes (1.0, 1.2, and 1.4 aspect ratios, respectively).The determination of these aspect ratios cutoffs was inspired from a blog discussing diamond measurements and shapes, with a focus on fine details and precision 23 .There is a separate class named "angled cells." This class contained numerous RBCs that exhibited similarities to schistocytes/fragmented RBCs, ovalocytes, and teardrop-shaped RBCs but were in fact false representations of these classes.Identification of overlapping RBCs can be challenging given that upper cells might mask crucial parts of the overlapped cells, leading to potential misclassification of the overlapped.Therefore, there is no need to assume or predict the actual types of the overlapping RBCs.They were included in separate classes just to enable the classifiers to differentiate individual cells from them (junk classes).

Data Records
(Elsafty_RBCs_for_AI) dataset 9 is freely accessible at the Figshare data repository and is systematically structured into 51 root directories.The first root directory (Elsafty_RBCs_for_Classification) consists of three primary folders: "Cropped images, " "Masks, " and "Segmented images." Within each of these primary folders, there are nine subfolders, meticulously dedicated to each RBCs class, encompassing the following counts of Table 1.The total segmented cells in each slide/patient and the tally of each RBCs class within every slide/patient across each camera-microscope source.Samples for every class were collected from each slide/patient.cells: "Angled cells: 24,187", "Borderline ovalocytes: 35,540", "Burr cells: 8,948", "Fragmented RBCs: 7,186", "Ovalocytes: 55,073", "Rounded RBCs: 46,338", "Teardrops: 16,298", "Three-overlapping RBCs: 15,577", and "Two-overlapping RBCs: 31,360".Each of the total 240,507 cells is represented by its own cropped image, mask, and segmented image.Samples for every class were collected from each slide/patient.Each one of the next 25 root directories (Elsafty_RBCs_for_Segmentation_and_Detection_Slide_1-25) consists also of three primary folders: "Cropped images, " "Masks, " and "Segmented images" corresponding to each slide/patient.There is a total of 1,003,813 segmented cells along with their masks and cropped images, the counts of segmented cells per slide/patient, sorted in ascending order, are as follows:  1.The naming scheme for the cropped image, mask, and segmented image of every cell adheres to a consistent format, starting with the slide/patient number, followed by the unique patch/ field number, and concluding with the (XYWH) coordination on the patch.All these images are conveniently stored in the lossless ".PNG" format.Each one of the remaining 25 root directories (Elsafty_RBCs_Slide_1-25) contains the field images/patches of a specific slide/patient.The names of the patches start with the respective slide/patient number, followed by the unique patch/field number.There is a total of 47,363 patches with 1076 × 535 pixels size from the 25 slides/patients, the counts of patches per slide/patient, sorted in ascending order, are as follows:

Technical Validation
The hematologists have developed and used their code to train DL-based image classification models using TensorFlow/Keras, (please see Elsafty_Code_2; train & test a DL-based image classifier using Google Colab 10 ).
During the training process utilizing a trainable EfficientNetB0 for transfer learning, all the segmented images for each class in their respective folders, sourced from the 25 slides/patients, were divided into six separate parts.One part was allocated for testing, the second for validation, and the remaining four parts for training.There were two options in the code: whether to shuffle the images randomly with a fixed seed before splitting or not.Shuffling ensured that the code split the dataset without allocating images from certain slides/patients to specific subsets.While useful for exploring data consistency, this approach was not reliable for generalizing performance.Conversely, without shuffling, the splitting resulted in better performance generalization because validation and testing were conducted on different cases.The initial learning rate and batch size used during training were 4e-6 and 32, respectively.Adam optimizer, SparseCategoricalCrossentropy loss and SparseCategoricalAccuracy metric were implemented.To prevent overfitting, common augmentation techniques including full rotation range (up to 360 degrees), vertical flipping and horizontal flipping were employed.While color manipulation, rescaling, shearing, shifting, zooming, and resizing were avoided.After developing a model with no shuffling of the dataset before splitting, the evaluation revealed the following results for overall specificity, F1 score, and accuracy: 0.9986, 0.9884, and 0.9974, respectively, indicating data consistency and quality.Subsequently, new synthetic images were generated from the real-world images using extensive color manipulation by randomly overlaying six main colors ((255,0,0), (0,255,0), (0,0,255), (0,255,255), (255,0,255), (255,255,0)) with varying degrees of transparency (alpha) and intensity (beta), ranging from 0.5 to 1.1, before utilizing the masks again to restore the white background.The same model was then evaluated on the new synthetic images.This revealed the following results for overall specificity, F1 score, and accuracy: 0.9833, 0.8667, and 0.9704, respectively.These results indicated the potential usefulness of stain normalizers to reduce performance fluctuations and induce generalizability.Please see Tables 2, 3 for evaluation details, including the confusion matrix, individual class metrics, and overall performance metrics, where the top portions of these tables correspond to evaluation on real-world images and the bottom portions correspond to evaluation on synthetic color-manipulated images.
To further investigate the effect of normalization lack, the entire dataset for segmentation was classified using the same classifier on both the original and its synthetic color-manipulated versions.Without normalization, there is a potential risk of false increase of schistocytes/fragmented RBCs and angled cells with false decrease of teardrop-shaped RBCs, especially when the staining is weak and faint.Please refer to Table 4, where within each result box, the left side corresponds to the evaluation on the real-world images, while the right side corresponds to the evaluation on the synthetic color-manipulated images.
To compare the quality of the segmented images and labelling from each of the four camera-microscope sources, four rotating leave-one-out classification experiments were conducted.In these experiments, all images from slides of a rotating source were excluded during training, and the trained model was then tested on these excluded images.For details and results, please see Table 5.Additionally, another 12 classification experiments were conducted using one rotating source for training and one of the remaining sources for testing.The details and results are displayed in Table 6.The findings of these experiments indicated inter-source variations with overall high labelling and segmentation quality.

Usage Notes
Regarding stain normalization, please note that in contrast to histopathology stains, which typically distinguish structures into two colors (red and blue) or three (red, black, and blue), stained blood smears exhibit at least seven significant colors for RBCs, WBCs, and platelets (red, orange, grey, deep purple, violet, light blue, and blue).Additionally, unlike the distinct shapes and sizes of RBCs, the color of RBCs is influenced by the context of the field image/patch and encompasses a range from red, pink, brown, yellow, orange, light violet, to even near bluish hues depending on the staining and imaging quality.However, the RBCs color is the closest to red and furthest from blue among the overall staining colors, while WBCs nuclei tend to exhibit the opposite pattern.While specific normalizers for blood smear stains could be useful, manipulation of RBCs shape and size should be avoided.Normalizers for blood smear stains should be comprehensively assessed in two or more ways.Firstly, through quantitative evaluation of their contribution to the classification and detection performances.Secondly, through qualitative visual inspection for potential artificial errors, such as the discoloration of small blue platelets into red, resulting in misclassification as schistocytes/fragmented RBCs.Conversely, discoloring red schistocytes/fragmented RBCs into blue could also lead to their misclassification as platelets.
In terms of RBCs detection, please note that for all field images/patches selected to generate dataset for segmentation, the overlapping RBCs occupying bounding boxes larger than 80 × 80 pixels and the RBCs touching the borders of the field images/patches were excluded, because truncated cells by the borders could be misclassified as schistocytes/fragmented RBCs.Please also note that the identification/classification of overlapped cells may not be accurate, as assuming the covered parts is not appropriate.Therefore, please count the individual cells and exclude the overlapping ones.The overlapping cells could be utilized to identify the appropriate field image/ patch for examination.The required one thousand individual cells to calculate the percentage of each RBCs class could be collected from three to ten appropriate/suitable field images/patches.Additionally, the maximum percentages of schistocytes/fragmented RBCs and teardrop-shaped RBCs present in each of the used appropriate/ suitable field images/patches should be highlighted.Furthermore, applying techniques such as non-maximum size suppression could be essential to avoid misclassification of cellular parts as schistocytes/fragmented RBCs.Moreover, please collect false positive images from the detector to create an "excluded junk class" to be used in classification training.Otherwise, falsely cropped non-RBCs images may be misclassified as RBCs.

Fig. 1
Fig. 1 Examples of cropped RBCs images with perfect cellular centralization within the frame.(a) Fragmented RBCs, (b) Teardrop-shaped RBCs, (c) False Teardrop-shaped RBCs/Angled cells, (d) Ovalocytes, (e) Normal/ rounded RBCs, (f) Borderline Ovalocytes, (g) Burr cells, (h) Three-overlapping RBCs, and (i) Two-overlapping RBCs.Samples in each row were obtained from each of the four provided imaging sources, with the top derived from source 1 and the bottom from source 4.

Fig. 2
Fig. 2 Examples of field images/patches from the four different imaging sources.(a) from source number one, (b) from source number two, (c) from source number three, and (d) from source number four.The patches obtained from the first and second sources were found to have the best imaging and staining quality, whereas those obtained from the third source exhibited relatively lower staining quality, and those from the fourth source showed relatively lower imaging or focus quality.

Fig. 3
Fig. 3 Examples of cells with their corresponding segmentation masks.(a) the cropped RBCs images, (b) the corresponding binary semantic segmentation ground-truth masks, and (c) the segmented RBCs images.

Table 2 .
The confusion matrix of the model developed using the full dataset for classification.The top portion corresponds to evaluation on real-world images and the bottom portion corresponds to evaluation on synthetic color-manipulated images.

Table 3 .
The evaluation details include the individual class metrics and overall performance metrics.The top portion corresponds to evaluation on real-world images and the bottom portion corresponds to evaluation on synthetic color-manipulated images.

Table 4 .
The results of classifying the entire dataset for segmentation.Within each result box, the left side corresponds to the evaluation on the real images, while the right side corresponds to the evaluation on the synthetic color-manipulated images.

Table 5 .
The details and results of the four leave-one-out classification experiments.

Table 6 .
The details and results of the 12 one-source-only classification experiments.