Importance evaluation of spectral lines in Laser-induced breakdown spectroscopy for classification of pathogenic bacteria

: The correct classification of pathogenic bacteria is significant for clinical diagnosis and treatment. Compared with the use of whole spectral data, using feature lines as the inputs of the classification model can improve the correct classification rate (CCR) and reduce the analyzing time. In order to select feature lines, we need to investigate the contribution to the CCR of each spectral line. In this paper, two algorithms, important weights based on principal component analysis (IW-PCA) and random forests (RF), were proposed to evaluate the importance of spectra lines. The laser-induced plasma spectra (LIBS) of six common clinical pathogenic bacteria species were measured and a support vector machine (SVM) classifier was used to classify the LIBS of bacteria species. In the proposed IW-PCA algorithm, the product of the loading of each line and the variance of the corresponding principal component were calculated. The maximum product of each line calculated from the first three PCs was used to represent the line’s importance weight. In the RF algorithm, the Gini index reduction value of each line was considered as the line’s importance weight. The experimental results demonstrated that the lines with high importance were more suitable for classification and can be chosen as feature lines. The optimal number of feature lines used in the SVM classifier can be determined by comparing the CCRs with a different number of feature lines. Importance weights evaluated by RF are more suitable for extracting feature lines using LIBS combined with an SVM classification mechanism than those evaluated by IW-PCA. Furthermore, the two methods mutually verified the importance of selected lines and the lines evaluated important by both IW-PCA and RF contributed more to the CCR.


Introduction
In clinical field, the diagnosis of many diseases and the determination of their development stages depend on the detection of the corresponding bacteria and microorganisms [1]. Bacterial resistance has shown the increasing prevalence due to the inability to identify specific pathogens in time and use specific corresponding antibiotics [2][3][4]. Meantime, rapid and reliable analysis of pathogen specimens in hospital settings can also help prevent crossinfection in patients [5,6]. Therefore, the rapid and accurate classification and identification of bacteria is significant to choose corresponding preventive measures and the targeted medicine opportunely.
The traditional existing identification methods have some limitations. For instance, the morphological identification method takes a lot time and labor with an unstable phenotype and low sensitivity [7]. Immunodiagnostic technology and DNA-based detection methods cannot identify the pathogen without the corresponding antibody or molecular chain. Meanwhile, cross-reactions with unrelated species are common and identification based on sequencing is laborious, time-consuming and costly [8,9]. Some new techniques such as matrix-assisted laser desorption ionization-time of flight mass spectrometry (MALDI-TOF MS) [10], rapid antimicrobial susceptibility testing (AST) [11], multiplex Polymerase Chain Reaction (multiplex PCR) [12] and fluorescent indicator technology [13] have also been used in clinical occasions to determinate the type of bacteria and other microbial pathogens rapidly. However, due to the expensive price of these instruments, the number of qualified hospitals is limited so that these techniques are not available for many patients. Meanwhile, through these non-in situ testing methods, the results may be generated faster, but still need time to be brought from laboratory to patients and doctors. So, it is a challenge to develop a cost-effective, accurate, rapid and easy-to-use method for bacterial discrimination.
As a new elemental analysis technology, LIBS has been used to identify medical and biological samples [14,15]. Combined with chemometrics algorithms, it can reach a high accuracy in classification of clinical samples [16]. LIBS is a rapid, real-time, in situ, multi elements simultaneous detection technique without the need of sample preparation [17]. In LIBS analysis, a laser pulse is locally coupled into the sample material and a plasma is generated within material evaporating. In the cooling process of plasma, element-specific radiation was emitted and detected by a spectrometer [18]. The wavelength and intensity of these spectral lines represent the type and concentration of the corresponding elements [19][20][21].
In particular bacteria identification field, R. A. Multari et al concluded that LIBS, in combination with appropriately constructed chemometric models, could be used to classify Escherichia coli and Staphylococcus aureus [22]. D. Marcos-Martinez et al used LIBS combined with neural networks (NNs) to identify Pseudomonas aeroginosa, Escherichia coli and Salmonella typhimurium and reached a certainty of over 95% [23]. Recently, D. Prochazka et al combined laser-induced breakdown spectroscopy and Raman spectroscopy for multivariate classification of bacteria [24]. Although all the six kinds of bacteria can be classified correctly with merged data, with only LIBS data, just three kinds can be classified.
In above experiments, whole spectral range or a broad spectral range was selected in order to cover all spectral characteristics of the samples. However, though the spectral information contained in the whole spectrum is the most abundant, a lot of information is irrelevant for classification [25,26]. Meanwhile, the complexity of data processing is closely related to the amount of spectral data [27]. Therefore, it is necessary to extract the feature lines from the whole spectrum.
Usually people select spectral ranges or lines of interest manually based on prior knowledge and theoretical composition of sample [28,29]. Using the intensity of 13 emission lines from 5 different elements (P, C, Mg, Ca, and Na), S. J. Rehse et al characterized a mixture of two bacteria. The mixed sample with a mixing ratio higher than 80:20 can be identified accurately based on discriminant function analysis (DFA) [30]. But manual selection requires operators with a wealth of relevant knowledge and experience. And we cannot make sure whether lines corresponding to the theoretical composition can reflect the differences among samples.
Recently machine learning methods were proposed to extract spectral features from both LIBS and other spectra objectively and efficiently [31][32][33]. W. Li and J. Du used decision tree algorithm to choose features from hyperspectral data as candidate attributes for vegetable classification [34]. Evelyn Vor et al used PCA algorithm for LIBS feature extraction in the identification of alloys [35]. But using their proposed extracting methods, only the feature lines can be selected, but whether these feature lines are appropriate for classification cannot be evaluated.
In this paper, we defined an importance weight of each line to evaluate the contribution of this line to the classification result and proposed two methods, importance weights based on Principal Components Analysis (IW-PCA) and Random Forests (RF), to evaluate the importance weights of lines. We selected the lines with high importance weights as feature lines. Furthermore, the effect of different number of feature lines to classification result was analyzed. Six kinds of common pathogenic bacteria were chosen as samples. The LIBS spectra of these samples were measured and divided into training set and testing set. According to the evaluated importance weights, different number of lines were extracted from training set as features. Using these features as input variables and labels of bacteria type as output variables established an SVM classifier to describe the mapping between them. And then the classifier was used to classify the testing set spectra. We investigated which evaluating method performed better and how many feature lines is suitable in LIBS-SVM by comparing their influences on the final classification accuracies, respectively. The results demonstrated that evaluating importance weights of lines is of practical importance for extracting features in LIBS-SVM classifier.

LIBS experimental measuring setup
A schematic of the experimental LIBS setup is illustrated in Fig. 1. A flash-pumped Qswitched Nd: YAG laser (λ = 1064 nm, repetition frequency 1 Hz, pulse duration 5 ns, beam diameter ∅6 mm, energy 64 mJ/pulse) was used to excite the sample's surface. The laser propagation direction was changed through three plane mirrors and finally focused on the sample surface by a convex lens with a focal length of 100 mm. The plasma radiation was focused into a fiber (∅ 600 μm) through a lens with a focal length of 36 mm. The outlet of optical fiber was connected to a two-channel spectrometer (AvaSpec 2048-2-USB2, Avantes). Spectral data collected by the spectrometer covered a range of 190 nm to 1100 nm with a resolution of 0.2~0.3 nm. External trigger used in the system included a photodetector and a digital delayer (SRS-DG535, Stanford Research System). When the photodetector detected the plasma radiation signal, the spectrometer was triggered by DG535 after a preset delay time. The spectral acquisition delay time was set to 1.28μs to reduce Bremsstrahlung radiation. The integration time of CCD was 2 ms.

Bacteria sample preparation
In this work, six kinds of common pathogenic bacteria were chosen as samples, including two kinds of Staphylococcus (Staphylococcus aureus 26068, Staphylococcus aureus 26003), three kinds of Escherichia coli (Escherichia coli TG1, Escherichia coli JM109, Escherichia coli 44113), and a kind of Bacillus (Bacillus cereus 63301) (Provided by Research Institute of Chemical Defense, 102205, Beijing, China.). The cultured bacteria samples were smeared on the slides evenly, formed 20 × 40 mm 2 thin layers with 20 μm thickness. Three-dimensional motorized stage was used to adjust the focus position of the laser on the samples. And for every bacterial sample, 400 spectra were collected, each on a fresh position.

Spectral data preprocessing
400 spectra collected for each type of sample were divided into training part and test part, which had 300 and 100 spectra, respectively. Due to the fluctuation of laser energy and flatness and uniformity of the samples' surfaces, the collected spectral data also fluctuated. Therefore, data from two parts were respectively averaged. In each part, every spectrum was considered as a vector in multidimensional space, and then the cosine values of the inner angle between each spectrum and the average spectrum were calculated. The larger the cosine value was, the more similar it was to the average spectrum. According to the cosine values from high to low we extracted the most similar 75% spectra with the average in each part, which means 225 in training part and 75 in testing part. By this way, outliers can be removed from the data. Then every three spectra got an average in each part to reduce the data fluctuation furtherly. Finally, for each type of sample, the training set has 75 spectra and the testing set has 25 spectra.
The spectra got from an empty slide and six samples are shown in Fig. 2. In the spectrum of empty slides, many lines related to 8 elements (Si, Ca, Na, Fe, Mg, O, H, N) can be seen. For the six samples, several obvious spectral lines related to CN band and 8 elements (C, Ca, Fe, Na, H, N, K, and O) can be found. Comparing with the spectrum of empty slide, these spectra of samples were obviously different. Among these samples slides, the spectra of Staphylococcus aureus 26068 and 26003 look very similar. However, although the TG1, JM109 and 44113 both belong to Escherichia coli, their spectra are different at some lines such as potassium and calcium lines. Although there are some differences among the intensities of specific lines, these spectra were too similar to recognize by eye. All the obvious lines (85 lines in this case) were selected and their areas were calculated for representing line intensities, as listed in Table 1. Then we extracted feature lines from training set using IW-PCA and Random Forests and established SVM classification models, respectively. Finally, the correct classification rate (CCR, the ratio of the correct classification number and the total number) of the models was tested using testing set.

Importance evaluation using importance weights based on principal component analysis (IW-PCA)
Principal component analysis (PCA) is an unsupervised learning method. By projecting down into a less dimensional subspace through the linear transformation, it can transform the raw data into a set of linearly independent representations of each dimension [36,37] and be commonly used to reduce dimensionality in high dimension data [32,35]. In PCA, the variance of each principal component represents the proportion of original information they retain. The first principal component expressed the spatial direction with the largest variance [37]. Every spectral line has a corresponding loading in each principal component. PCA has been used in selecting LIBS feature lines [35]. In each chosen PC, they selected lines with high loadings as features. We proposed PCA can also be used to evaluate the importance of spectral lines. We think the line with high loading is more important. Normally, only the first PC cannot reflect the enough spectral information. So other PCs were selected sequentially to increase cumulative variance. We selected loadings from the first several PCs with cumulative variance over 95% to evaluate the importance of lines. However, because the describe variance of each PC is different and loadings of the same line in different PCs are also different, only using loadings is not enough for evaluating importance weights of lines. Therefore, variances of principal components and loadings of lines should be combined to evaluate the importance of each line.
In the proposed evaluating method IW-PCA, the first several PCs whose cumulative variance was more than 95% were used in analysis after projecting down spectral data. In each PC, every line had a loading representing the importance in this PC. The higher loading value represented more important. Considering the variance representing the importance of a PC, for each line, the product of its loading and the variance of the corresponding PC was calculated. And the maximum product calculated from the first several PCs was defined as the importance weight of this line.
The training set data was used to build the PCA model. The variances of first five PCs were shown in Fig. 3. For each PCs, the cumulative variances were labeled above the points.

Importan
Random Fore It is based o tive variance as more than 9   [38]. With a given argument, each decision tree classifier votes to determine the final classification result. As a classification model based on the Decision Tree, the node is divided by the difference between each sample at a certain spectral line in the process of generating each tree [39,40]. If the two bacterial samples have a large difference in intensity at a certain line, or maybe one of the bacteria does not have this feature line, then the two bacteria can be directly distinguished at this node. The spectra number of training set (described as N) was 75 for each sample, 450 for all. Each time a spectrum was chosen from each sample randomly, 450 spectra were chosen and built a new data set after 75 times repeatedly through putting back method. Based on this new set, a CART binary decision tree model was established. This process was called selfboosting [41] and used to reduce the training set relevance of each decision tree. All the built trees were established through this way, and each tree classifier utilized a unique training set constructed by the self-boosting method. Each tree was grown to the maximum size until no further splits are possible. When N is large enough, through self-boosting, a result can be derived from Eq. (1) that about a third of the initial spectra were not chosen even once. This part of the spectral data was called out-of-bag data and could be used as an unknown set to test the model. In practice, spectral data is not large enough, the size of out-of-bag data remains uncertain. However, the possibility of over-fitting could also be reduced by selfboosting.
In the classification problem, assuming there are K classes, the probability of the sample points belonging to the k class is p k . Then the probability distribution of the Gini index is defined as where D was the whole collection of six kinds of spectral data in this case, and K was six. For a given sample set, the Gini index is calculated by where C k is a subset of samples belonging to the k class of D. |D| and |C k | are the sizes of collection D and C k respectively. If the sample collection D can be divided into two parts according to whether the classification basis A is equal in two parts, then under condition of classification basis A, the Gini index can be defined as where Gini(D) describes the uncertainty of set D, and Gini(D, A) describes the uncertainty after divided by the classification basis A. For each tree, when the classification basis A was used to divide the whole set, the degree of Gini index reduction was calculated as where j represents results calculated from different splits at the same basis. The corresponding classification basis with large Gini index reduction value expresses the large decreasing of the sample set's uncertainty and can be considered as important basis. In this case, each spectra line can be regarded as a classification basis A. Therefore, the random forests classification model has der by the decrea to low. When in order to get Δ , , Δ calcu forests model specific lines completely w According appropriate l language, RF RandomFores (1) The impo (2) The p equa (3) The d (4) The m The most in Table 3   As shown in Fig. 4 and Fig. 5, the two evaluating methods gave different importance of lines and based on this the importance order of lines was different. It can be concluded that the top 5 feature lines all belong to the main emission lines corresponding to Na and K. For Na, the empty slide also had clear lines, but the intensities were different with bacteria samples. And for K, there was significant difference between slides and samples. These reflected that the lines of Na and K represented pathogenic bacteria and were considered more effective for bacterial classification than other elements. This coincides with the biologically prior knowledge that Na + and K + are the two important cations in living cell and play an important role in controlling intracellular and extracellular balance. Although the bacteria contained a large amount of C, H, O, and N elements, the environment also contained a large amount of these elements when the LIBS experiment was carried out in a standard atmospheric air environment. When laser interacted with samples, the surrounding air particles were also excited, therefore, the contribution of C, H, O, and N to the classification was less than the metal elements.
Considering the most important 20 lines evaluated by IW-PCA and RF, which were listed in Table 4, 10 lines were selected both in the two algorithms (related to Ca, Na, K, N, and H). Besides, lines related to CN were only selected by RF and Oxygen lines were only selected by IW-PCA. The two methods both gave high importance weights to Potassium lines, and both of them selected two Potassium lines at 766.5nm and 769.9nm. Still some elements like Fe, Ca, Na, N and H were related to the extracting lines in both two methods, but the lines related to each element were not exactly same. In general, the RF selected less lines related to elements both in the samples and ambient environment gas.

Classification results based on an SVM classifier
An SVM classifier was chosen to classify the spectral data in this paper. As a supervised learning model used in data classification and regression, it can be used to classify linearly separable data directly [42,43]. Moreover, using kernel function to map the non-linearly separable data from low-dimensional space to high-dimensional feature space, SVM can classify non-linearly separable data without increasing the calculating complexity [44,45].
With such advantage, SVM has been widely used to classify LIBS spectral data [42,[44][45][46][47]. When usin and it cost ab evaluated by extracted by t in Fig. 6 and from 1 to 20 s

Conclusion
The complexities and tardiness in identification of pathogenic bacteria make it one of the most urgent problems in clinical hospital setting. In view of this, LIBS combined with SVM were utilized to identify and classify 6 kinds of typical pathogens in this paper. To improve the technique performance, IW-PCA and RF were proposed to evaluate the importance of spectral lines and extract optimal lines as classifier inputs. It can be considered that the importance weights of each line can be evaluated and appropriate feature lines can be extracted by using these two algorithms. Using the whole 85 lines to build the model, it can reach an accuracy of 95.33% in an average time about 143.71 s. Using lines extracted by IW-PCA and RF, the average accuracy reached 95.79% and 96.51% respectively, and analyzing time reduced to around 21.86 s to 71.83 s. Considering the CCR performed better than using all lines in the classifier, the importance of feature lines was verified. Using lines extracted by RF as inputs, the average and highest CCRs were both higher than using lines selected by IW-PCA. Therefore, RF algorithm is more suitable for evaluating the importance of spectral lines than IW-PCA using in LIBS-SVM classification mechanism. Furthermore, the two methods mutually verified the importance of selected lines and the lines evaluated important both by IW-PCA and RF contributed more to the CCR. Using the feature lines selected both by two algorithms, the highest classification accuracy is 98%, which demonstrated LIBS is a potential feasible technique in identifying pathogenic bacteria.

Disclosures
The authors declare that there are no conflicts of interest related to this article.