Classification of Geological Samples Based on Soft Independent Modeling of Class Analogy Using Laser-Induced Breakdown Spectroscopy

Laser-induced breakdown spectroscopy with soft independent modeling of class analogy is used in the identification of a large number of unprocessed geological samples having similar components in this study. Considering a variety of data from different samples, representative spectral regions representing the major components were extracted. In addition, principal component analysis was applied to remove noninformative variables from the spectrum. *e unclassification rate, misclassification rate, and average correct classification rate for 25 types of geological samples were 1.2%, 4.7%, and 94.1%, respectively.*ese results suggest that laser-induced breakdown spectroscopy using soft independent modeling of class analogy can be used to identify a wide variety of geological samples. Furthermore, we found that this approach can be used to identify spectral differences among similar sample types because of matrix effects and the trace element impurities.


Introduction
Laser-induced breakdown spectroscopy (LIBS) [1,2] is a simple atomic emission technique for multiple elements and provides a semidestructive and efficient analysis, particularly in harsh and dangerous environments.us, LIBS has been widely used for various applications, such as industry-oriented analysis [3], archaeological investigation [4], geological and environmental studies [5][6][7], and jewelry characterization [8].Geological materials including rocks and minerals convey important information about particular geological environments, and this information can be extremely useful for studies such as determining mineral provenance, reservoir description, prospecting, and geochemical mapping [9][10][11].Nevertheless, there is a wide variety of geological materials with overlapping characteristics, thus compromising their proper discrimination.e conventional geological survey depends on the geologist's assessment and subsequent laboratory analyses, which can be time-consuming and complicated.To simplify the analysis, LIBS applications on geological materials have been proposed over the last two decades.Furthermore, multivariate preprocessing methods have been increasingly studied, including approaches based on principal component analysis (PCA) [12,13], partial least squares discriminant analysis (PLS-DA) [14], graph theory (GT) [15], independent component analysis (ICA) [16], and artificial neural networks (ANNs) [17,18].Such methods consider the effect of redundant information and hence increase the efficiency of data analysis and prevent negligible fluctuations resulting from experimental conditions and instrumental instability [19].Specifically, soft independent modeling of class analogy (SIMCA) is widely used to classify highdimensional data because it incorporates PCA for dimensionality reduction [20].It was originally developed to increase the accuracy and speed of classification in nearinfrared spectroscopy [21][22][23][24][25] and subsequently applied to the classification of LIBS [26,27].Although suitable results have been reported in the classification of some geological materials, it is still challenging to provide a method that suitably classi es a large number of materials, especially when they present similar major elements.In this study, we applied the SIMCA and PCA to LIBS data aiming to classify a wide variety of geological samples.

Experiment and Methods
2.1.LIBS Instrument. Figure 1 illustrates the complete experimental system used in this study.e Brilliant B Nd : YAG Laser (Quantel SA, Les Ulis Cedex, France) was operated at a fundamental wavelength of 1064 nm, a repetition rate of 10 Hz, and a pulse width of 10 ns. e laser energy was optimized to maximize the peak intensity without saturating the intensi ed charge-coupled device (ICCD) camera.e excitation energy from this laser was focused on a target with a long-focus (f 1 100 mm) lens to prevent contamination from spatter particles generated by the laser shots.Light emitted from the plasma was collected by a pair of matching coaxial fused silica planoconvex lenses (f 2 f 3 38.1 mm) and guided into a 230 μm diameter optical ber for linking with a Mechelle 5000 spectrometer (Andor Technology Ltd., Belfast, UK).
en, the dispersed light from the spectrometer was recorded with an iStar DH734i-18F-03 ICCD camera (Andor Technology Ltd., Belfast, UK) having a wide spectral range (212 nm-1032 nm) with 0.1 nm resolution.
e angle between the collection direction and the sample stage surface was approximately 45 °. e samples were xed on a rotating platform and mechanically rotated to different positions following laser ablation.e crater e ects were minimized, and the inhomogeneity among samples was partially compensated by collection from di erent positions.

Samples and Measurements.
is study involved the analysis of 25 types of geological samples representing a mixture of minerals and rocks (carbuncle), which are listed in Table 1. Figure 2 shows photographs of six types of samples used in this study.ese samples were obtained from the China Institute of Geology in Qingdao City, Shandong Province.Five di erent blocks of each sample were collected at di erent but nearby geographical locations.In addition, several geological samples with a similar chemical composition were purposely considered in this study to verify the robustness of the proposed model.In fact, compositionally similar minerals can exhibit a very high spectral correlation, thus posing a challenge to intersample discrimination.For instance, sample No. 20, 21, and 22 can be considered as a type of gypsum, which is mostly composed of calcium sulfate (CaSO 4 ), whereas sample No. 7, 10, and 17 basically consist of iron oxides (Fe 2 O 3 ), and sample No. 8, 19, and 25 were also considered of the same type.Samples were measured using LIBS without pretreatment to obtain raw data from in-eld measurements.e ve separate blocks of each geological sample were detected, from which three were assigned for determining the method parameters, whereas the remaining two were used to test the method performance.To partially balance spectral heterogeneity, each spectrum was determined from 5 laser shots and 20 spectra acquired per block at di erent points on the sample surface.e integration time and delay were 15 µs and 200 ns, respectively, to eliminate continuum emission.

Model.
Multivariate analysis can be applied to reduce or compress spectral data while retaining important spectral information of the samples [28].In particular, SIMCA is a widely used supervised pattern recognition method to classify sample spectra within speci c categories.It consists of a collection of PCA models and can provide independent classi cation for each category, as detailed in [29,30].Speci cally, PCA calculations were carried out in order to reduce the dimensionality of the data set, allowing an overview of the samples.e results from PCA are typically analyzed by score and loading plots.e score plots allow the identi cation of samples, by verifying if there are similarities or not, and the identi cation of outliers and clusters.Loading plots permit the identi cation of variables that have greater importance for the sample positions in the score plots.e optimal number of principal components (PCs) to characterize the data set was based on the total value of the principal component retained variance [30,31].In addition, we used a toolbox for SIMCA that was developed by the Milano Chemometrics and QSAR Research Group at the University of Milano-Bicocca in Italy [32,33].Moreover, we considered a statistical con dence level of 95% (α 0.05) and implemented the calculations using MATLAB version 7.2.A total of 1000 spectra (40 spectra × 25 samples) from known samples were used to build the SIMCA recognition model, and 500 spectra (20 spectra × 25 samples) were applied for optimizing the parameters based on cross validation.
e remaining 1000 spectra (40 spectra × 25 samples) from unknown samples as a test set were used to determine the classi cation performance.

Emission Line Selection.
e analytical spectral line of an element in the plasma is related to the ejected sample mass and depends on the laser radiation parameters, that is, energy and focusing.Either random or systematic changes of these parameters can strongly a ect the analytical precision and accuracy and may introduce nonlinearity in the classi cation [34].Moreover, the roughness of the sample surface further increases nonlinearity by the interaction between the laser and samples.However, normalization can compensate some shortcomings and signal variations resulting from experimental conditions and instrumental instability [35].In this study, the signals corresponding to various elements were normalized with the total spectrum intensity.Figure 3 illustrates the normalized spectral lines of ve mineral samples, where the highlighted elements were identi ed using the NIST Atomic Spectral Database.Emissions in the spectra from the analyzed 5 samples correspond to the elements showing high similarities and di erences among distinct LIBS ngerprints.
e recorded spectra consist of more than 20,000 pixels spanning a wide wavelength region from ultraviolet to nearinfrared.However, a substantial portion of the feature space may not be relevant for classi cation.Hence, feature selection must be applied to eliminate spurious correlations, especially when interclass di erences are subtle.Feature selection allows to reduce regions of LIBS data that do not convey useful classi cation information.We found that the most important features correspond to wavelengths from the elemental emission lines of K, Li, Na, Ba, Mn, Ca, Al, Ti, Si, Mg, and Fe.In fact, these elements are the main components of the continental crust and determine unique chemical ngerprints, which are useful for geological study.For classi cation, a set of spectral regions from the major elements commonly used in spectral analysis were selected [36], totaling 1107 variables.e selected spectral variables are listed in Table 2 and illustrated in Figure 4. Journal of Spectroscopy

Results and Discussion
3.1.PCA Optimization.Problems of unsatis ed collinearity and high computational cost persisted after selecting the spectral regions.Hence, the selected LIBS regions were projected into lower-dimensional independent variables using PCA, and those having the maximal interclass variance and minimal intraclass variance were iteratively determined.
e selection of principal components greatly a ects the classi cation capabilities and can prevent both under-and over tting problems of classi cation.e scores and loadings of PCA for the 25 types of samples are shown in Figure 4, where the principal components PC1, PC2, and PC3 contribute 28.84%, 22.17%, and 14.34% of the overall variance, respectively.In addition, high loading indicates that elements corresponding to that wavelength have a high e ect on the principal components [37] and also a high concentration of the corresponding element in the samples.
e loading and score of PC1 suggest a high correlation with the concentrations of Ca and Al.Similarly, PC2 is clearly related to Ca, Al, Si, and Mg; PC3 is more relevant to Ba, Mn, Ca, and Al.Considering that the rst three principal components convey only 65.35% of the original information, other principal components were also introduced into for SIMCA.Figure 5 shows that an increasing number of principal components reduce the root mean square error of cross validation (RMSECV), thus indicating a more accurate selection.e RMSECV converges to a stable minimum after considering approximately 15 principal components.By selecting the most relevant principal components based on the RMSECV results, the classi cation for the validation data set was as follows: unclassi cation of 0.4%, misclassi cation of 1.4%, and correct classi cation of 98.2%. 3 lists the SIMCA classi cation results for the test data.e results also demonstrate that misclassi cation mostly occurs among similar samples, suggesting the di culty to classify minerals that have the same cations as major constituents, such as gypsum (sample No. 20 (selenite), 21 (alabaster), and 22 (anhydrite)) and hematite (sample No. 7 (oolitic hematite), 10 (black hematite), and 17 (reniform hematite)).Nevertheless, the correct classi cation rate among similar samples remained acceptable.

SIMCA Evaluation. Table
e correct classi cation rate among similar samples can be related to the slight di erences resulting from physical matrix e ects including hardness, structure, and texture.
ese factors can result in di erent amounts of ablated mass and in a consequent variation of the spectral e loadings were compared to the recombined spectrum in order to observe the variables in the spectrum that contribute to classi cation.lines, even when having a similar chemical composition of the geological samples [38].erefore, these in uences can be useful in studies for recognizing similar geological samples.Another reason for the ability to distinguish similar samples can be attributed to minor impurities, whose distribution trace in natural minerals provides di erent spectral features among similar matrices.
e correct classi cation rate of 60% for carbuncle is lower than that of other minerals.is may be ascribed to the fact that the carbuncle as a type of rock is a combination of mineral grains, porosity, and cement-mixed body.ere are also subtle di erences between each block of carbuncle because they come from the di erent locations.e measurements on di erent points of the carbuncle block surface do not exhibit consistent proportions.In contrast, minerals are a single substance or a compound formed by geological action, presenting a relatively xed chemical composition.
e remaining unclassi ed and misclassi ed samples can be attributed to the target heterogeneity.To obtain representative spectra for calibration/validation in the model from heterogeneous samples, it is necessary to collect LIBS information from a su ciently large number of analysis spots, but this number could have not been su cient to obtain representative analyses for all minerals in this study.
Overall, a high performance is achieved for all the samples, with an average correct classi cation rate of 94.1% and negligible rates of unclassi cation and misclassi cation.
erefore, the proposed SIMCA with LIBS was capable of correctly classifying samples in multiple categories, even when presenting similar compositions.Furthermore, the selected wavelength ranges, which reduce the amount of analyzed data to the major elements of the geological minerals, allowed to retrieve a successful and more e cient classi cation.

Conclusions
In this paper, LIBS combined with SIMCA is evaluated for the robust classi cation of a wide range of mineral types, which can be suitable for real-world applications, an improvement over many previous studies that were limited to several minerals.e LIBS data of di erent minerals can be used to identify samples based on major constituents.In fact, PCA allows to evaluate the discriminating ability of di erent elements present in geological samples by the corresponding loadings and scores of principal components.In this study, PCA-optimized SIMCA was employed to classify 25 types of geological samples.Although the correct identi cation was compromised for some varieties of rock such as carbuncle, which is composed of a variety of minerals having a high degree of trace element variability, the overall correct classi cation rate of 94.1% was high and the unclassi cation rate of 1.2% and the misclassi cation rate of 4.7% were acceptable.
Data Availability e data used to support the ndings of this study are available in the Microsoft Excel format from the corresponding author upon request.

Figure 1 :
Figure 1: Diagram of a typical LIBS experimental setup for studying geological samples.

Figure 5 :
Figure 5: Estimated root mean square error of cross validation using SIMCA according to the number of principal components for 25 validation samples.

Table 1 :
Description of geological samples used in this study.

Table 2 :
Emission lines for classi cation based on major elements from geological samples.

Table 3 :
Classi cation performance using SIMCA with PCA for 25 geological samples.