Spectral-spatial feature-based neural network method for acute lymphoblastic leukemia cell identification via microscopic hyperspectral imaging technology

: Microscopic examination is one of the most common methods for acute lymphoblastic leukemia (ALL) diagnosis. Most traditional methods of automized blood cell identification are based on RGB color or gray images captured by light microscopes. This paper presents an identification method combining both spectral and spatial features to identify lymphoblasts from lymphocytes in hyperspectral images. Normalization and encoding method is applied for spectral feature extraction and the support vector machine-recursive feature elimination (SVM-RFE) algorithm is presented for spatial feature determination. A marker-based learning vector quantization (MLVQ) neural network is proposed to perform identification with the integrated features. Experimental results show that this algorithm yields identification accuracy, sensitivity, and specificity of 92.9%, 93.3%, and 92.5%, respectively. Hyperspectral microscopic blood imaging combined with neural network identification technique has the potential to provide a feasible tool for ALL pre-diagnosis.


Introduction
According to the report from American Cancer Society, more than 54,000 individuals are diagnosed with and nearly 24,000 are killed by leukemia per year in the US [1]. Leukemia is one of the five leading types of cancer in children, young adults, and people over the age of 80. Generally speaking, leukemia is a type of blood cancer that begins in the bone marrow and lymphoma, usually due to uncontrolled growth of hematopoietic cells with genetic mutations [2,3]; a large number of immature leucocytes produced by neoplastic proliferations are then spread into the bloodstream. Leukemia is either "acute" or "chronic" based on the pathogeny and disease progression. Acute leukemia, which is more serious, presents with over 20% of blasts in the peripheral blood or bone marrow [3,4]. The French-American-British (FAB) classification of acute leukemia contains two subtypes: Acute lymphoblastic leukemia (ALL) and acute myeloid leukemia (AML) [5,6]. ALL is characterized by the overproduction and continuous multiplication of malignant lymphoblast or blasts and its incidence peaks between 2 and 5 years of age [7]. Survival in pediatric acute lymphoblastic leukemia has improved to nearly 90% in trials derived from lymphocyte biological feature detection and pharmacodynamics treatment, as well as improved supportive care [8]. Survival could be further improved, however, and prognoses still remain generally poor in infants and adults. Early diagnosis of ALL is of vital importance for timely treatment and recovery.
Microscopy examination of peripheral blood smear is a common initial diagnostic procedure which involves discriminating mature lymphocytes from immature lymphocytes (lymphoblasts) [9]. Innovative approaches such as flow cytometry, immunophenotyping, and molecular probing can yield precise results with the diagnostic accuracy of above 90% on a per-patient [10], but in regards to cost and capacity, the morphological identification of lymphoblasts in blood smears is still the optimal choice for initial ALL detection [11]. Traditionally, this method is operated manually by a skilled hematologist, which is lengthy, time-consuming, and costly because it requires considerable training and experience. It is also susceptible to non-standard precision due to unavoidable intra-observer variations and sample imperfections [12,13].
Researchers are currently working towards stable substitutes to reduce the heavy workload and costly labor of this diagnosis process. Advancements in hardware and software technology have brought about a number of automated leukocyte identification methods that are indeed low in cost and with reliable accuracy. Current analyzers show high classification accuracy for normal leukocytes and differential blood count, but said accuracy declines sharply when the system detects abnormalities or malignant leukocytes [14,15]. Automatic abnormal leukocyte (e.g., lymphoblast, promyelocyte, and promonocyte) detection has been proposed to acquire morphological information and to assist hematologists in pre-diagnosis of leukemia. These methods may be threshold-based [16]  . Neoh et al. reported a novel clustering algorithm with stimulating discriminant measures (SDM) of both within-and between-cluster scatter variances to produce robust segmentation for the nucleus and cytoplasm of lymphocytes and lymphoblasts [22]. These researchers have demonstrated the feasibility and objectivity of lymphoblast detection by microscopic images using morphology-based methods, but these studies were not without limitations. The 2D images captured by traditional light microscopes only contain spatial information, making the feature extraction of leukocytes complicated and potentially inaccurate. Further, uneven staining and smear thickness induce luminance variances which may lead to changes in the smear images' color or texture, making leukocytes even more difficult to discriminate. There is still demand for new technologies and methods of lymphoblast identification by microscopic images.
As an emerging imaging modality, microscopic hyperspectral imaging technology may provide a new solution to automatic lymphoblast identification. Hyperspectral imaging (HSI) originates from remote sensing and provides an advantageous combination of spectroscopy and 2D imaging which yields images across a wide range of the electromagnetic spectrum [23]. When light is delivered into biological tissue, the scattered, reflected, and transmitted light captured by HSI can be ascribed to inhomogeneity in biological structures of tissues [24]. To this effect, hyperspectral images containing both spectral and spatial information can be applied for blood cell identification and hematology disease diagnosis. For example, G. Sacco Verebes et al. analyzed the spectral signatures of blood cell components with enhanced darkfield microscopy and aimed at building up spectral libraries to distinguish active from inactive cells [25]. Q. Li et al. proposed an algorithm to identify red blood cells by integrating active contour models and automated two-dimensional k-means with a spectral angle mapper algorithm [26]. These studies demonstrated the potential effectiveness of combining spectral and spatial information provided by hyperspectral imaging systems for blood cell analysis. However, there have been few studies on the automatic identification of lymphoblasts from hyperspectral images for ALL pre-diagnosis.
The purpose of this study was to establish a new method of confirming the presence or absence of lymphoblasts in blood samples to assist early ALL diagnosis. First, hyperspectral lymphocyte images of peripheral blood smear (PBS) samples were captured by a homemade acousto-optic tunable filter (AOTF) based molecular hyperspectral imaging (MHSI) system. These hyperspectral images, containing both spectral and spatial information, can provide significant features for the discrimination of lymphoblasts and lymphocytes. Normalization and binary coded decimal (BCD) coding were then applied for spectral analysis. The SVM-RFE algorithm was established to determine the most significant spatial features. Finally, a marker-based LVQ (MLVQ) neural network was designed as the classifier to integrate the spectral and spatial information efficiently and complete the identification procedure.

Hyperspectral blood image data
Hyperspectral imaging was originally defined in the remote sensing field as a combination of conventional imaging and spectroscopy methods to obtain both the spatial and spectral information of targets. To adapt the microscopic HSI system to blood smear detection, previous researchers have built homemade staring imaging mode molecular hyperspectral imaging (MHSI) systems [26]. Our MHSI system operates in the spectral range of 550-1000 nm with 2-5 nm spectral resolution. When a blood smear is prepared on the stage, the software embedded in the matched computer monitors and captures the hyperspectral blood images. Each band of the hyperspectral blood image consists of 1280 × 1024 pixels × 12 bits/pixel, which is stored in the band sequential (BSQ) file format.
As shown in Fig. 1, the hyperspectral image cube contains three dimensions: the line dimension, sample dimension, and wavelength dimension. As opposed to pixels in 2D images with single gray values, each pixel in the hyperspectral cube is presented as an N-dimensional spectrum vector reflected in the wavelength dimension. This spectrum vector contains rich pixel information that can be viewed as the spectrum feature of the specific material in the pixel. The vector shows increased homogeneity within the same material and increased heterogeneity among different materials, making various materials highly distinguishable. Hyperspectral blood images containing both the spatial and spectral features of blood cells represent a promising technique for specific blood cell analysis.

Image preprocessing
A hyperspectral image directly obtained from an MHSI system is generally referred to as a "raw image". These images are captured under the influence of the emission spectra of the illumination sources, the transmission of the optics in the microscope and the detection sensitivity of the charge coupled device (CCD) camera. To eliminate these effects and ensure that the real characteristic spectra of blood cells is acquired, a calibration process is needed prior to spectral analysis.
In the proposed setup, a white reference image is first captured by the MHSI system, which records the reflectance of a blank thin glass slide dyed by Giemsa-stain as the control sample. The raw hyperspectral blood images are then captured under the same conditions. Finally, the calibration coefficient is calculated from the white reference image by Eq. (1) and the calibrated blood image is retrieved by the calibration coefficient: where Kn is the calibration coefficient of each band n; , .

Spectral analysis: normalization and encoding
The purpose of the spectral analysis is to explore methods for representation and storage of the extracted informative spectral features for further identification. As described in Fig. 1, every pixel in the hyperspectral blood image contains an N-dimensional spectrum vector, representing the spectrum features of the material in this pixel. The spectrum is so large, however, that at least 12 bits are needed to store one pixel. The computational cost of comparison among various spectra is very high. For the sake of computational simplicity, we normalized values measured on different scales to a notionally common 0-1 scale for spectral analysis. Lymphocytes, lymphoblasts, and red blood cells (RBCs) are all blood cells, so their common molecular elements give their spectra similar distributions; this means their features remained distinct after normalization without loss of any important information.
Encoding was applied to further reduce the computational burden in terms of storage. The natural binary-coded decimal (NBCD) is a class of binary encodings of decimal numbers where each 0 to 9 decimal digit is represented by four bits. It allows for the accurate representation and rounding of decimal quantities as well as simple binary operation rules. After normalization and encoding, the value of each pixel per band was reduced from 12 bits to four.

Spatial analysis: feature selection
In traditional lymphocyte and lymphoblast identification methods, dozens or even hundreds of features must be considered to ensure sufficient identification accuracy [22]. Both spectral and spatial features can be extracted from a hyperspectral blood image, so the dimensions of the spatial features can be reduced substantially. The goal of spatial analysis is to select the most characteristic spatial features integrated with the spectral features to facilitate accurate blood cell identification.
S. Mohapatra made a detailed description of 44 shape, color, and texture features for lymphocyte and lymphoblast detection [20], 30 of which were incorporated into the identification process. We would assert that 30 spatial features still contains redundancies; and are not all suitable for hyperspectral images as some of them are based on the RGB images. Moreover, from a hematologist's perspective, compound features are more useful than the single features in lymphoblast identification -the nucleus/cytoplasm ratio is superior to nucleus area and cytoplasm area, for example, because the single feature is less stable and less robust.
In this study, we built a recursive feature elimination (RFE) algorithm as a greedy optimization for identifying the best-performing subset of features [27]. The RFE was designed to repeatedly construct a selection model and choose either the best-or worstperforming feature, set the feature aside, then repeat the process with the remaining features. The most popular version of this algorithm uses a support vector machine (SVM-RFE) as selection model to eliminate features. SVM is embedded to determine the weights of features in the training stage, whereas for this nonlinear feature selection problem, the radial basis function (RBF) kernel trains and tests low-degree polynomial data mappings via linear SVM [28]. Cross validation serves as the evaluation function to rank features in each iteration. A total of five spatial features were selected by SVM-RFE algorithm in this study: mean, variance, nucleus perimeter, nucleus/cytoplasm ratio, and entropy. These features fell into three intrinsically different measures: descriptive statistics measures, contrast measures, and orderliness measures. As for spatial feature extraction, the principal component analysis (PCA) method is firstly used to map blood images onto a vector space to reduce the dimension so as to remain the most spatial information. After the PCA transform, a single band map containing spatial information is generated. Meanwhile, the marker-competitive layer of the proposed method outputs a marker map containing the segmented lymphoblast or lymphocyte. Then, Otus [3] algorithm combines these two map to segment the cells into nucleus and cytoplasm. Finally, five spatial features could be calculated from these results.

Marker-based neural network classification
Artificial neural networks (ANNs) are commonly used in image classification with various structures including back-propagation, Hopfield, radial-basis function, and adaptive resonance theory. In view of the large scale of hyperspectral blood images, the learning vector quantization (LVQ) classifier performs better with fewer parameters and a simpler structure. It also combines the advantages of supervised learning and competitive learning systems, ensuring fast convergence and high fault-tolerance. Nevertheless, when typical LVQ is applied to spectral-spatial based blood cell identification, its accuracy may be restricted because it works under the assumption that spectral and spatial information have independent contributions to the classification results. This assumption makes the compounded spectral and spatial classifier a simple linear superposition, which may lead to inadequate learning and low accuracy. It is necessary to modify the formulation of the original LVQ to explore the inner connection between spectral and spatial information in hyperspectral blood images. However, existing techniques for doing so mainly focus on faster convergence, input dimension scaling, and decision mechanism adaption [29]. A marker-based LVQ (MLVQ) neural network is proposed in this study which defines a marker regulation for the determination of competitive layers to make full use of the spectral and spatial information.
The topological structure of MLVQ includes three layers: an input layer, a markercompetitive layer, and an output layer. The number of input neurons equals the number of input spectral and spatial features. The input layer is fully connected to the markercompetitive layer by the alterable weights, whereas the marker-competitive and output layers are not completely linked by the fixed weights. The number of the output layers equals the desired blood cell types. The MLVQ classifier determines the number of neurons in the marker-competitive layer based on the number of selected markers. The MLVQ learning process has three parts: connection establishment, marker-competitive neuron determination, and weight updating.

Connection establishment
In the proposed technique, an unsupervised clustering algorithm self organizing map (SOM) is used to form a preliminary clustering map to analyze the characteristics of input spectral and spatial features among different blood cells. The SOM uses a "winner-take-all" strategy to integrate inputs into the robust cluster [30]; the "winner" is the input with minimum distance from the input vector. In the MLVQ classifier, the winner is assigned to the maximal weight. The weight is updated in each iteration through the competitive learning rule (i.e., weight updating). This process establishes the connection between the input layer and the marker-competitive layer for the subsequent marker selection. For an N-dimensional input vector X = [X1, X2, …, Xn], the winner neuron Cm in the marker-comparative layer is determined by Eq. (2): where W k is the alterable weight between the input vector X and the kth neuron in the markercompetitive layer. M is the class number of clusters created by the SOM, and the Euclidean distance is used for similarity calculation.

Marker-competitive neuron determination
The clustering map is generated by the compound features after the spectral and spatial information are input for unsupervised clustering. If cluster contains a large set of spatially connected pixels, the cluster is integrated with strongly reliable and relevant information and must contain a marker. Conversely, a cluster containing a small number of pixels is assumed to have weaker information and exclude the marker. In the MLVQ algorithm, the total clusters are first separated by kth classes in Eq.
where ( ) j k C h refers to the kth class map containing the jth cluster and L is the total number of clusters. M (k) is the selected marker of the kth class map. After the erosion process, the small-scale cluster is eliminated with no marker selected whereas the marker is chosen from the remaining cluster. The non-marker cluster is merged to the adjacent cluster, as the characteristic information is insignificant in the final classification. The number of markercompetitive neurons is determined once the merging converges on a fixed number.

Weight updating
As the MLVQ algorithm is a supervised classifier, the alterable weight k W is updated iteratively by supervised learning rules. If the output class differs from the training data, the weight k W is weakened by the rule described in Eq. (5), otherwise the weight k W is strengthened by Eq. (6): where t is the iteration time and μ(t) is the learning rate.

Data acquisition and preprocessing
Clinically, ALL is pre-diagnosed on the presence or absence of lymphoblasts in PBS samples. Lymphoblasts should be distinguished from lymphocytes as accurately as possible in blood samples to provide a credible diagnostic basis for hematologists. For the purposes of this study, peripheral blood was collected from ALL patients and healthy samples; patients included children, adolescents, and adults between 7 and 65 years of age having been clinically examined at the Department of Hematology, Ruijin Hospital, Shanghai, China. A total of 16 patients who were advised to undergo peripheral blood and/or bone marrow examinations were clinically diagnosed with ALL. As a control, a total of 24 samples (16 out of 27 ALL patients and 8 normal samples without clinical history of leukemia) for the study were also obtained from patients undergoing routine differential blood counts. PBS were prepared from these samples accordingly. Anticoagulant was first supplied to the samples to keep them from congealing, then a drop of blood approximately 2 mm in diameter was used for each PBS preparation. The standard for a good PBS is that the blood spreads evenly with no breakage or overlapping. The PBS was dyed with Giemsa (10% Giemsa-stain and 90% phosphate buffer saline) from Baso Diagnostics, Inc. Zhuhai, and dyed in a Sysmex sp-10 machine provided by the Department of Hematology, Ruijin Hospital, Shanghai, China. When the prepared PBS was settled on the stage, the homemade MHSI system was used for hyperspectral blood image acquisition. One hundred and thirty-five stained lymphoblast images from 27 patients diagnosed with ALL and 120 stained lymphocyte images from 24 control subjects were obtained by the hyperspectral imaging system. The captured image data contained 70 bands with 1280 × 1024 pixels × 12 bit/pixel per band stored in BSQ format. The data was calibrated by the calibration coefficient presented in Section 2.
The typical spectra of average transmittance extracted from ROIs of lymphoblasts, lymphocytes, and RBCs are shown in Fig. 2(a) in the wavelength range of 550-1000 nm. Spectral signatures are obvious among different cell types in these spectra. Figure 2(b) shows the BCD coding of three blood cells' spectra where the most informative characteristics were retained and stored in only 4 bit/pixels per band instead of 12 bits. Because the hyperspectral image contains the reflectance spectrum for each kind of material, 15 spectra were extracted from 135 lymphoblast cells as shown in Fig. 2(c); different cells from the same kind of lymphoblast showed the same spectral distribution. Similarly, 15 spectra from 120 lymphocyte cells and RBCs are shown in Fig. 2(d) and 2(e). Figure 2 altogether indicates that in the collected spectra, different types of cells have different spectral signatures and that the same type of cells have similar spectral distribution.

Identification results based on different data sets
After the preprocessing of hyperspectral blood images, spectral and spatial features were extracted accordingly and applied for the proposed MLVQ identification measure. Several tests were conducted based on the confusion matrix shown in Table 1 to evaluate the performance of different feature sets. We compared the identification results with criterion provided by a hematologist. Generally, true positive (TP) and true negative (TN) indicate correct identification of a lymphoblast and lymphocyte; false positive (FP) indicates that the lymphocyte was identified as a lymphoblast and false negative (FN) that the lymphoblast was identified as a lymphocyte. Accuracy, specificity, and sensitivity performance measures were calculated as follows:

TP TN Accuary TP TN FP FN
where accuracy is defined as the ratio of the number of cells that are identified correctly to the total number of cells irrespective of the cell type. Sensitivity and specificity describe the proportion of correctly identified lymphoblasts and lymphocytes, respectively. Theoretically, the two measures' sensitivity and specificity seem equally important. In practice, however, hematologists tend to be more concerned with sensitivity in the identification of ALL. In the scene of a healthy human's peripheral blood smear, the number of lymphoblasts is no more than one or two, so even a slight increase may be serious. If an identification method has low sensitivity (i.e., some lymphoblasts are not identified instantly,) it is possible that ALL will expand rapidly into the blood stream and vital organs if left untreated. Therefore, sensitivity in the identification method is of crucial importance for the early diagnosis of ALL. We first input the hyperspectral data with BCD encoded spectral wavelength (70 bands) for identification. The generated accuracy, sensitivity, and specificity were 87.1%, 88.9%, and 85%, respectively ( Table 2). The reasonable accuracy suggests a strong correlation between the blood cell spectra and lymphocyte identification. In other words, the proposed technique seems promising for the pre-diagnosis of ALL via MHSI.
We ran a second experiment was based on the five spatial features selected by SVM-RFE algorithm for the sake of comparison against traditional identification methods which consider image features. Table 2 shows that the performance was inferior to that of the spectral bands. There was lower accuracy (82.4%) and lower sensitivity (82.2%) but markedly higher specificity (85%), suggesting that spatial features contain important information for lymphocyte identification.
Both the spectral and spatial features performed well, so we ran a third experiment based on a combination thereof which we expected to produce optimal identification results. The original hyperspectral blood cell images were processed by calibration and normalization to generate spectral features and by SVM-RFE algorithm to select spatial features. The BCD coded spectral features and five spatial features comprised the input layer of the MLVQ network. After a 100-fold iterative training, the optimal performance was obtained as recorded in Table 2, with the accuracy, sensitivity, and specificity of 92.9%, 93.3%, and 92.5%, respectively. These results indicated that combined spectral and spatial features convey highly useful information for lymphoblast and lymphocyte identification.

Visualization of ALL pre-diagnosis
Per the evaluation results of various data sets discussed above, integrating optimal spectral signatures with selected spatial signatures as the input layer of the MLVQ network yields optimal identification accuracy. The MHSI system can be used to visualize the lymphoblast and lymphocyte identification results (Fig. 3) to assist hematologists in pre-diagnosing ALL reliably. Hematologists tend to be well-accustomed to light microscopy images through experience, so we ensured that light microscopy hyperspectral images with a 100 × immersion oil objective lens were captured by the MHSI system; these images can be easily and intuitively reviewed by hematologists. Traditional identification results generated by applying an unsupervised K-means method to traditional light microscopy images were also obtained for comparison. Specifically, before the process of K-means, we set two targets and then it uses Euclidean distance to cluster the similar pixels and classify them into two classes. In Figs. 3(a) and 3(c), there is one lymphocyte in the upper and one lymphoblast in the center of the image. An identification map was generated where the lymphocyte is colored in green and the lymphoblast in red. As a control, Figs. 3(e) and 3(g) contain one lymphoblast and the corresponding mapping is marked in red (Fig. 3(h)). Figures 3(m) and 3(o) contain one lymphocyte and the corresponding mapping is marked in green (Fig. 3(p)).
Traditional light images do not allow the viewer to readily distinguish lymphocytes from lymphoblasts, and even allow some red blood cells to be misidentified (Figs. 3(b), 3(f)). The traditional method requires that several parameters be calculated to identify different types of blood cells; these tend to yield poor identification results, as the spatial features provided by traditional light images are not sufficient for discrimination between lymphoblasts and lymphocytes. The proposed method, as described above, inputs a combination of spectral and spatial features into the neural network system for training. This combination yields more accurate results compared to the traditional separation of all types of blood cells.

Conclusion
Early diagnosis of ALL is of vital importance for timely treatment and recovery. Microscopy examination of PBS is one of the most commonly used pre-diagnostic procedures involving discrimination between lymphoblasts and lymphocytes. Morphological information is most important standard for lymphoblast identification. Existing automatic identification methods based on blood images captured by traditional light microscopes typically take spatial features as inputs, but inhomogeneous staining and non-uniform sample thickness tend to yield poor identification results. This paper proposes an MHSI system for lymphoblast and lymphocyte identification based on a combination of spectral and spatial information. In the proposed setup, spatial features are first determined by support vector machine-recursive feature elimination (SVM-RFE) algorithm. A marker-based LVQ (MLVQ) neural network is then used to define a marker regulation to determine the competitive layer making full use of both spectral and spatial information. The encoded spectral features and five spatial features comprise the input layer. Experimental results showed that the combined spectral and spatial features yield optimal performance with accuracy, sensitivity, and specificity up to 92.9%, 93.3%, and 92.5%, respectively. Although the performance of the proposed system is reasonable, we concentrated only on the per-cell identification of lymphoblasts in this study; this relates solely to the feasibility of hyperspectral imaging on this one issue. In the future, we plan to explore the system's accuracy on a per-patient basis to provide more reliable evidence for ALL pre-diagnosis. This will also allow us to conduct a comparison with molecular biology-based methods, and to investigate the diagnostic and clinical efficacy of hyperspectral imaging technology. It is also worth noting that because our samples were Giemsa-stained blood smears, additional control samples are needed for comparison. Hyperspectral imaging technology may be applicable for capturing unstained cells or tissues and identifying them according to their specific spectral features. In the future, we plan to explore new methods to identify unstained leukocytes. Moreover, ALL has three subtypes, L1, L2, and L3, which may be classifiable according to lymphoblast type. We also plan to attempt classification of lymphoblasts into these three subtypes via nucleus and cytoplasm segmentation to provide even more accurate diagnosis information.

Funding
National Natural Science Foundation of China (61377107); Science and Technology Commission of Shanghai Municipality (14DZ2260800).