Blood cancer diagnosis using ensemble learning based on a random subspace method in laser-induced breakdown spectroscopy

: There are two main challenges in the diagnosis of blood cancer. The ﬁrst is to diagnose cancer from healthy control, and the second is to identify the types of blood cancer. The chemometrics method combined with laser-induced breakdown spectroscopy (LIBS) can be used for cancer detection. However, chemometrics methods were easily inﬂuenced by the spectral feature redundancy and noise, resulting in low accuracy rate because of their simple structure. We proposed an approach using LIBS combined with the ensemble learning based on the random subspace method (RSM). The serum samples were dripped onto a boric acid substrate for LIBS spectrum collection. The complete blood cancer sample set include leukemia [acute myeloid leukemia (AML) and chronic myelogenous leukemia (CML)], multiple myeloma (MM), and lymphoma. The results showed that the accuracy rates using k nearest neighbors (kNN) and linear discriminant analysis (LDA) only were 88.14% and 94.45%, respectively, while using RSM with LDA (RSM-LDA), the average accuracy rate was improved from 94.45% to 98.34%. Furthermore, the variable importance of spectral lines (Na, K, Mg, Ca, H, O, N, C-N) were evaluated by the RSM-LDA model, which can improve the recognition ability of blood cancer types. Comparing the RSM-LDA model and only with LDA, the results showed that the average accuracy rate for cancer type identiﬁcation was improved from 80.4% to 91.0%. These results demonstrate that LIBS combined with the RSM-LDA model can discriminate the blood cancer from the health control, as well as the recognition the types for blood cancers.


Introduction
Blood cancer is a form of cancer that attacks the blood, bone marrow, or lymphatic system. Nowadays blood cancer detection still remains a significant challenge in clinical medicine [1][2][3]. The three most common types of blood cancer are leukemia, multiple myeloma (MM), and malignant lymphoma [4]. Leukemia is the most common hematological malignancy [5], and acute myeloid leukemia (AML) and chronic myelogenous leukemia (CML) are the different types of leukemia. Lymphoma is a cancer disease that mainly affects the human lymph and hematopoietic system [6]. MM is a malignant disease in which clonal plasma cells proliferate abnormally [7]. The diagnosis method is the pathological biopsy of the bone marrow [8][9][10]. Although the accuracy rate (the rate of the number of samples correctly classified to the total number of samples) of pathological biopsy is high, the disadvantages of pathological biopsy, such as complex operation, high cost, and need for professionals restrict its application. Other blood cancer diagnostic methods include positron emission computed tomography [11], magnetic resonance imaging [12,13], and tumor marker detection [14,15]. However, image diagnostic techniques are not suitable for blood cancer detection at early stage because of unobvious image features. Hence, it is significant to find a fast, economical, and robust technology in blood cancer identification.
Laser-induced breakdown spectroscopy (LIBS), a proven optical emission spectroscopy (OES) analytical technique, has advantages that include rapid detection, simple pretreatment, and real-time analysis [6,[16][17][18]. LIBS has been applied to bacterial identification [19,20], geological exploration [21], and industrial monitoring [22,23]. In recent years, LIBS technology has been employed in tissue imaging and cancer detection for clinical medicine [24][25][26]. LIBS has also been investigated for the diagnosis of tumors. The most common method is to diagnose tumors based on the element content difference between normal and tumor tissue. Han et al. discriminated against the cutaneous melanoma from the surrounding skin using LIBS combined with principal component analysis (PCA) and linear discriminant analysis (LDA) [27]. The sensitivity and specificity of the diagnosis were over 95%, but a drawback of tissue detection is the need for surgical sampling. The other method to diagnose cancer is by comparing the element difference between the serum of cancer patients and health controls. Chen et al. diagnosed lymphoma and multiple myeloma patients and healthy control by detecting their sera using LIBS combined with chemometric methods [28]. Their work indicated that kNN model had the best performance, and the discrimination accuracy rate was 96%. Tameze et al. calculated the differences between LIBS image of ovarian cancer mice blood and the health control [29]. Their work showed that in the positive specimens the plasma state is richer and the intensity is augmented. The references above have made significant contributions to tumor detection using LIBS. However, few researches [6,28] conducted on blood cancer types diagnosis using LIBS have been reported. Specially, the accuracy rates are limited by only using the chemometrics method such as LDA or kNN combined with LIBS, because the simple structure of identification models were easily influenced by the spectral feature redundancy and noise. Therefore, to develop an identification model is an important step for rapid, efficient and steady blood cancer detection.
In this study, to overcome the drawbacks of a single model combined with LIBS, we proposed an approach using the ensemble learning based on the random subspace method (RSM) such as RSM-LDA combined with LIBS. The complete blood cancer sample set include leukemia (AML and CML), MM, and lymphoma. The variable importance (VI) of selected lines was calculated to evaluate the importance of the element. The RSM-LDA model was used for identifying the type of blood cancers. A more stringent evaluation index was proposed for model evaluation.

Experimental setup
For experimental setup, a laser beam from a Q-switched Nd:YAG laser (wavelength: 532 nm; pulse energy: 30 mJ; repetition rate: 10 Hz; pulse width: 8 ns; French Quantel, Brilliant B) passed the reflector and focusing mirror (focal length: 150 mm) onto the sample surface. The plasma emission enters the spectrometer via a light collector (Ocean Optics, 84-UV-25, wavelength range: 200-2000 nm) and UV-enhanced fiber optic with 50 µm core. The LIBS signal was obtained by an echelle spectrometer (resolution: λ/∆λ = 5000; spectral range: 200 −950 nm; United Kingdom, Andor Tech., Mechelle 5000) coupled with an intensified charge-coupled device (ICCD) (United Kingdom, Andor Tech., iStar DH-334T). For high spectral intensity and signal to noise (SNR), the gate delay and gate width of ICCD were set to be 1 and 9 µs, respectively. The diameter of spot size was about 100 µm, and the experiment was conducted in an air environment.

Sample pretreatment
Serum samples were collected from six healthy controls, six AML patients, six CML patients, six MM patients and eight lymphoma patients in the Institute of Hematology and Blood Diseases Hospital. The blood cancer patients were previously diagnosed by bone marrow biopsy and pathological examination of biopsies, which are the gold standard for diagnosis of blood cancer. the liquid serum sample was transformed to solid form, which prevents liquid splashes leading to spectral instability. The pretreatment takes about 15 minutes, and the detail steps are described below: (1) Serum samples were obtained by blood sample centrifugation, which lasted less than 10 minutes. The centrifugation can be performed for a batch of serum samples.
(2) The pellets were obtained by pressing boric acid powder, purchased from Sinopharm Chemical Reagent Co., Ltd, at 20 Mpa, and the diameter of the pellet is 40 mm. 50 µl of serum was poured onto the pellet using a pipette.
(3) Each pellet serum sample was air-dried for less than 5 mins.
The fluctuation of the spectral lines caused by the coffee ring effect leads to a low accuracy rate for blood cancer diagnosis. To improve the stability of the spectra, the laser beam scanned the sample surface in a rectangular area which covered the whole "coffee ring". Then we accumulate all the pulses in the rectangle area to get one LIBS spectrum.

Algorithm description
The LDA model can be used to find a linear combination of features that characterizes or separates two or more classes of objects or events [30]. With the advantage of simplicity and rapidity, LDA is widely applied in tumor detection of LIBS [6,27,28].
The RSM-LDA model, the combination of LDA models by the RSM model, is an ensemble learning method [31]. The ensemble learning model obtains better predictive performance than weak classifiers using any of the constituent learning algorithms alone. The commonly used weak classifier is LDA and kNN, which were also widely used in LIBS for cancer diagnosis [6,27,28]. In our work, we adopted the RSM to improve the accuracy rates of LDA and kNN. The RSM performs the following steps [31]: (1) Choose m spectral lines randomly from whole spectral lines.
(2) Train a weak learner using the m spectral lines.
(3) Repeat steps (1) and (2) until there are n weak learners, and predict by taking an average of the score prediction of the weak learners, and classify the category with the highest average score.

Spectral analysis
With the experimental set-up described and the optimal spectrum acquisition parameters, the observed elements included magnesium (Mg), sodium (Na), potassium (K) oxygen (O), nitrogen (N), Calcium (Ca), and molecular bands C-N. The five kinds of LIBS spectra for serum samples and one substrate sample range from 200 to 850 nm and are shown in Fig. 1.
The difference spectra between the sample and the substrate were used to eliminate the effect of the substrate. 23 lines were selected from the whole spectrum for the discrimination analysis, including lines of Ca, Na, K, Mg N, O, and molecular bands C-N. The details of the selected lines are listed in Table 1.  For each kind of serum sample, ten spectra were obtained. In total, 320 spectra were collected for discrimination analysis. For each kind of blood cancer, the cancer spectra and the health control spectra were randomly chosen for a training set and a test set with the ratio of 1:   The accuracy rates of the kNN model were listed in Table 2. There are two main reasons for the poor performance of kNN classification. One is that the kNN model were easily influenced by the spectral feature redundancy and noise [16,32]. The other is that the dependence on the training set samples leads to a low accuracy rate on its test set. In addition, the kNN model is affected by sample imbalance. Healthy control samples are easier to collect than tumor samples, which can cause sample imbalance. Therefore, we need to use a method of more powerful identification model for discriminant analysis of blood cancer.

Random subspace method for LDA and kNN promotion
The RSM model can improve the performance of the classifier and obtain a higher classification accuracy rate. RSM-LDA and RSM with kNN (RSM-KNN) are based on LDA and kNN classifiers, respectively. The specificity and sensitivity of the RSM-LDA model and the RSM-KNN model for AML, CML, MM, and lymphoma are shown in Figs. 3(a)-3(b), respectively. The RSM-LDA model had the best performance among the four kinds of classifiers. For AML, CML, MM, and lymphoma, the accuracy rates of the RSM-LDA model were listed in Table 2.

Detectable rate evaluation
In medical diagnosis, misdiagnosis have serious consequences. When the determination coefficient is close to the threshold, the misidentification happens easily. We proposed a more effective and suitable evaluation index for ensemble learning method. When more than 80% of the weak learners gave the correct results, we determined that the classification was valid for the sample. The ratio of the number of effectively classified samples based on the total samples was used as the evaluation index, which was called the detectable rate. The detectable rate can be calculated by the following formula: where S is the number of samples, N is the number of weak learners, R i is the number of weak learners with correct classification. The threshold was set as 80% in this work. For the RSM-LDA model, the optimal number of weak learners was 100, and the optimal feature number of each learner was 12. The results of the votes for AML vs healthy control, CML vs healthy control, MM vs healthy control, and lymphoma vs healthy control using RSM-LDA for randomly sampling are shown in Fig. 5. When the randomly sampled votes sit the farther away from the center, . The average detectable rates of AML, CML, MM, and lymphoma were 93.33%, 89.17%, 92.94%, and 89.86%, respectively. For the RSM-LDA model, the detectable rate is about 90% for four kinds of blood cancer, which also means its generalization performance is strong.

Variable importance (VI) analysis for selected element
To analyze the differences and commonality of the selected spectral lines, the VI of selected spectral lines was calculated to evaluate the element importance. The VI of whole selected lines in the RSM-LDA models which is used to classify the AML vs healthy control, CML vs healthy control, MM vs healthy control, lymphoma vs healthy control is shown in Figs. 6(a)-6(c), 6(d)-6(f), 6(g)-6(i), and 6(j)-6(l), respectively. For those four types of blood cancer, the metal elements are more important than non-metal elements and molecular bonds. In blood cancer detection, the VI of selected lines whose VI value higher than average VI value for four kinds of blood cancer are ranked in Table 3. The results of VI analysis indicate that in the RSM-LDA model, spectral lines have different VI value in those four types of classifications. This also further shows that those four types of blood cancer samples can be classified. In summary, the RSM-LDA model has the highest average accuracy rate and AUC, which means it has the best classification performance. For the VI analysis, the spectral lines have different VI value in the RSM-LDA model which means the RSM-LDA model can be further used in the identification of blood cancer types. Blood type identification is a multi-classification  problem. The accuracy rate and the detectable rate also can be used for multi-classification model evaluation.

Cancer serum type identification
The types of blood cancer are further classified after discrimination between cancer sera and health controls. LDA, kNN, RSM-LDA, and RSM-KNN were used to classify the type of blood cancer. the accuracy rates of these four kinds of blood cancer are shown in Fig. 7(a). The average accuracy rates of LDA, kNN, RSM-LDA, and RSM-KNN for blood cancer identification were 80.4%, 75.0%, 91.0%, and 70.8%. The RSM-LDA model has the best classification performance. The confusion matrix of RSM-LDA for AML, CML, MM, and lymphoma is shown in Fig. 7(b). As can be seen from the figure, MM sample has the lowest recognition rate. 5% and 8% of the MM samples were incorrectly identified as AML and CML, respectively. The results demonstrate that LIBS combined only with RSM-LDA can identify the four types of the blood cancer with accuracy rates over 90%. For the identification of those four kinds of blood cancer spectra, the accuracy rate of kNN model is lower than 80%.  To analyze the similarity of the blood cancer spectra and calculate the average detectable rates, the votes of each weak classifier were counted. The vote results of AML, CML MM, and lymphoma are shown in Figs. 8(a)-8(d). For the result of AML as shown in Fig. 8(a), The X axis represents the AML sample index, and Y represents the number of votes for AML, CML, MM, and Lymphoma. The AML has biggest area in Fig. 8(a), which means good classification performance for AML samples. the area of MM was larger than CML and lymphoma, which means the AML sample and the MM sample have a high similarity at the spectral level. Further, the AML sample is easily recognized as MM incorrectly. Based on the results of votes and Eq. (1), when the detection threshold was set as 80%, the average detectable rates of AML, CML, MM, and lymphoma were 93.33%, 86.67%, 90.00%, and 93.75%, respectively. The result indicated that the RSM-LDA model has great potential to identify blood cancer types.

Conclusions
The aim of this work was to diagnose the blood cancer serum and identify the cancer types using the RSM-LDA model combined with LIBS. The complete blood cancer sample set include leukemia (AML and CML), MM, and lymphoma. Compare with LDA and kNN, the RSM-LDA model has the highest average accuracy rate and AUC, which means the RSM-LDA model has the best classification performance. With the RSM-LDA model, the average accuracy rates for AML vs healthy control (HC), CML vs HC, MM vs HC, and lymphoma vs HC were from 94.33%, 94.49%, 94.61%, and 94.38% to 98.77%, 96.54%, 98.78%, and 96.62%, respectively. For cancerous samples classification using the RSM-LDA model, the detectable rate was proposed to evaluate the classification performance, and the average detectable rates of AML vs HC, CML vs HC, MM, vs HC and lymphoma vs HC were 93.33%, 89.17%, 92.94%, and 89.86%, respectively. Furthermore, the variable importance of selected lines was calculated by the RSM-LDA model. The average accuracy rate was improved to 91.00%, and 8% of MM spectra and 6% of lymphoma spectra were misidentified to CML. For blood cancer types identification, the detectable rates of AML, CML, MM, and lymphoma were 93.33%, 86.67%, 90.00%, and 93.75%, respectively, which means the RSM-LDA model can improve the diagnostic performance. The results showed that the proposed method can be a practical tool for rapid preliminary screening of blood cancer. Therefore, the RSM-LDA model is an effective pattern recognition method for LIBS analysis in blood cancer discrimination.

Funding
Huazhong University of Science and Technology (2020kfyXGYJ105).

Disclosures
The authors declare that there are no conflicts of interest related to this article.