Label-free detection of nasopharyngeal and liver cancer using surface-enhanced Raman spectroscopy and partial lease squares combined with support vector machine.

In this paper, we investigated the feasibility of using surface enhanced Raman spectroscopy (SERS) and multivariate analysis method to discriminate liver cancer and nasopharyngeal cancer from healthy volunteers. SERS measurements were performed on serum protein samples from 104 liver cancer patients, 100 nasopharyngeal cancer patients, and 95 healthy volunteers. Two dimensionality reduction methods, principal component analysis (PCA) and partial least square (PLS) were compared, and the results indicated that the performance of PLS is superior to that of PCA. When the number of components was compressed to 3 by PLS, support vector machine (SVM) with a Gaussian radial basis function (RBF) was employed to classify various cancers simultaneously. Based on the PLS-SVM algorithm, high diagnostic accuracies of 95.09% and 90.67% were achieved from the training set and the unknown testing set, respectively. The results of this exploratory work demonstrate that serum protein SERS technology combined with PLS-SVM diagnostic algorithm has great potential for the noninvasive screening of cancer.

cancer detection. And the results show that gastric cancer samples [11], nasopharyngeal cancer samples [12] and colorectal cancer samples [13] can be distinguished well from the healthy volunteers, respectively. Moreover, the SERS analysis of serum has also been used for tumor stages detection [6]. The above studies demonstrate that noninvasive serum SERS analysis technique has great potential for cancer screening.
Usually, the differences of SERS spectra between cancer samples and normal samples are tiny, and it is difficult to differentiate them with direct observation. Therefore, the robust and effective spectral data statistical methods are needed to extract effective diagnostic information. Principal component analysis (PCA) is the most common statistical method for simplifying spectral data set and determining the key components that best explain the differences in the spectra [14]. Briefly, the main object of PCA is to reduce the high dimension of spectra into a few principal components (PCs) while retaining the most diagnostically significant information for classification. However, there are usually many PCs after PCA processing (more than 10 components) that make it difficult to understand the key differences between cancer samples and normal samples. Moreover, in PCA, the relationship between input and output variables is not considered [15]. To solve this problem, partial least square (PLS) is employed as a useful method which can detect the input variables that are related to the output variables [16]. It has been demonstrated that PLS analysis would be better than PCA for dimension reduction and spectroscopic diagnostics since it provides group affinity information (class membership) to maximize the variations between groups [17]. In addition, support vector machines (SVM), introduced by Vapnik and Burges [18], has attracted great attention due to the ability of revealing non-linear relationships and producing models that achieve better classification results than traditional methods [19,20]. The combination of PCA (or PLS) and SVM has been successfully applied in the fields of cancer screening, disease prediction, gene selection, etc [7,[19][20][21][22].
Traditional analysis pays more attention to the classification ability of algorithms. By optimizing the statistical method, high diagnostic sensitivity, specificity and accuracy could be easily obtained in the classification of known samples [10]. However, the diagnostic capabilities of statistical algorithms should be assessed by the prediction accuracy of unknown testing samples. In this study, to evaluate the diagnostic capabilities of our statistical methods, a quarter of the spectra data were divided into unknown testing set. Furthermore, simultaneous screening of various cancers in a single SERS assay is a requirement of clinical application. In order to meet this demand, three groups of serum samples obtained from liver cancer patients, nasopharyngeal carcinoma patients and normal volunteers were introduced.
In this paper, we explored a data analysis method for the simultaneous screening of two different types of cancer. SERS spectra of serum proteins from 104 liver cancer patients (LC), 100 nasopharyngeal cancer patients (NC) and 95 normal volunteers were recorded using our previous method [23]. PLS and PCA were employed to extract the feature of SERS spectra, and SVM was then used to form a diagnostic algorithm and classify various cancers simultaneously. To the best of our knowledge, this is the first report on serum protein-based SERS for simultaneous screening of multi-type cancers. This exploratory work may further promote the serum SERS analysis technique into clinical applications.

Preparation of Ag nanoparticles
Ag nanoparticles (NPs) were prepared by the aqueous reduction of silver nitrate with hydroxylamine hydrochloride using the method developed by Leopold and Lendl [24]. Briefly, 4.5 mL sodium hydroxide (10 −1 mol/L) was added to 5 mL hydroxylamine hydrochloride (6 × 10 −2 mol/L) and then the mixtures were added to 90 mL silver nitrate (1.11 × 10 −3 mol/L). The mixture was kept stirring until a homogenous solution with a milky gray color was obtained. Figure 1 shows the transmission electron microscope (TEM) image and the UV-Vis-NIR absorption spectrum of the Ag NPs. The average size of the Ag NPs is 45 ± 6 nm. The absorption maximum was located at 417 nm.

Preparation of human serum samples
Ethical approval was obtained in order to study the human blood samples. Three groups of blood samples were provided by the Fujian Provincial Cancer Hospital, including 95 blood samples from healthy volunteers as the control group, 104 blood samples from LC patients and 100 blood samples from NC patients. Table 1 lists the detailed clinical diagnostic information of these patients (e.g. age, gender, and histopathological stage). After 12 hours of overnight fasting, 3 mL blood samples were collected from the subjects between 7:00-8:00 A.M. . Blood samples were stood at room temperature (27°C) for 30 min until the blood clotted. Supernatant (including some blood cells and serum) was then centrifuged (1000 rpm, 10 min) to separate blood cells from the serum. And then the serum samples were obtained.

Experiment and SERS measurements
Figure 2(a) shows the schematic of membrane electrophoresis and SERS measurement. Briefly, 2.5 μL serum sample was blotted onto the cellulose acetate (CA) membrane for electrophoresis. After electrophoresis, the CA membrane was equally divided into two parts along a vertical line. Half of the CA membrane was stained to label the location of proteins for reference. And the serum proteins in the remaining half membrane were cut down according to the labeled position. The isolated band of protein was collected in a test tube. Acetic acid was added to dissolve the membrane and Ag NPs were subsequently added and mixed to enhance the Raman signal of proteins. The mixture was incubated at 37°C and kept stirred for 5 min. Then SERS measurements were performed, and the raw spectra were obtained. The pH value of the final protein-Ag NPs mixture was 2.9. The average concentration of proteins in the final solution was 368 ± 45 μg/mL (measured by the Bradford Protein Assay Kit (Order no. C503021, Sangon Biotech, Shanghai, China)). More details about the process of membrane electrophoresis can be seen in our previous study [23].
The SERS spectra were acquired in the range of 500-1700 cm −1 with a 10 s integration time using a Renishaw confocal Raman micro-spectrometer (inVia System). A 785 nm diode laser was focused through a Leica 50 × objective (NA: 0.75) to excite the samples. The incident laser power was about 0.1 mW. The WIRE 3.4 software package (Renishaw) was employed for the spectral acquisition.

Data analysis
The schematic diagram of data analysis is shown in Fig. 2(b). The analysis of SERS spectra was performed in three steps: (1) data preprocessing; (2) dimensionality reduction; (3) classification and prediction. The raw spectra represented a composition of SERS signal and autofluorescence background signal. The autofluorescence background were removed from the raw spectra by an automated algorithm [25]. All background-removed SERS spectra were further normalized to the integrated area under the curve. This normalizing method enabled a better comparison of the spectral characteristics among the three groups [26]. The entire data set of the serum proteins SERS spectra was divided into two parts: the training set and the testing set. The training set was composed of 224 randomized spectra (N Liver = 78, N Nasopharyngeal = 75, and N Normal = 71) and the testing set was composed of the remaining 75 spectra (N Liver = 26, N Nasopharyngeal = 25, and N Normal = 24).
PLS was first performed to reduce the spectral dimension by extracting a set of components (latent variables). And then, SVM algorithm was used on these components for distinguishing various cancer samples from normal samples. To assess the performance of PLS-SVM approach, the traditional multivariate statistic analysis method of principal component analysis-linear discriminant analysis (PCA-LDA) was also applied to classify the same SERS data set.

Partial least squares
PLS can be used as a dimension reduction technique similar to PCA [27]. In this study, X N × M (N is the number of samples in the training set and M is the number of wavenumbers) is the input variables matrix and Y N × 1 (grouping variable) is the output variables matrix. PLS algorithm establishes the relationship of X and Y by score vectors. For a single response variable (grouping information), the PLS model is described as where S N × A and U N × A are the PLS score matrices (A is the number of PLS components); P M × A is the loading matrix of X N × M ; E N × M is the residual matrix of X N × M ; q is the loading matrix of y; and F N × 1 is the residuals vector of y. In this study, the PLS score matrices and loading matrices were calculated using the SIMPLS [28]. The mean squared error of prediction (MSEP) estimated by 10-fold cross-validation was used to determine the number of PLS components [29,30].

Support vector machine
Support vector machine (SVM), based on the foundations of Statistical Learning Theory [18], is a powerful supervised learning algorithm for classifying complex groups. As a classifier, SVM is considered to be superior over traditional linear approaches due to its capability of processing classification problem with nonlinear boundary by mapping sample data set into a higher dimensional space [31].
To obtain a SVM classifier with good classification ability, choice of an appropriate kernel function which projects data to the feature space is critical [22]. The most frequently used kernel function is the Gaussian radial basis function (RBF): where x i and x j are the two generic sample data vectors; and σ is the Gaussian radial width that should be optimized. In addition, once the spectra are mapped to the feature space, there are countless separating hyperplanes, leading to the risk of over-fitting [19]. To avoid this problem, a penalty factor C is introduced to allow some training data to be misclassified. In this study, the penalty factor C and the parameter 2 1 2σ were optimized by grid search [22].
In addition, the SVM diagnostic algorithm was evaluated by the 10-flod cross validation. All SVM analyses were performed in MATLAB using the LIBSVM toolbox 3.23 developed by Chang and Lin [32].

Testing
To assess the diagnostic capabilities of the PLS-SVM model, a set of testing data was performed. Firstly, the testing spectra data T B × M (B is the number of samples in the testing set and M is the number of wavenumbers) was mapped to the feature space using the same linear transformation method as the training set: where P M × A is the PLS loadings calculated from training set and S B × A is PLS scores of the testing set. The S B × A was then used as an input for the SVM model, and the diagnostic results were obtained. At the same time, the accuracy, sensitivity and specificity of the diagnosis were also calculated.

Membrane electrophoresis SERS
The membrane electrophoresis method was used to extract serum proteins from serum samples for cancer screening. The mean SERS spectra and standard deviations (overlying as shaded color fill) of serum proteins for each group are shown in Fig. 3. Table 2 lists tentative assignments for the SERS peaks, according to some literatures [11,12,23,33,34].  Tryptophan: C = C stretching mode 1685 Amide I All three groups have similar SERS spectral profiles, such as Raman peak positions and bandwidths. Primary Raman peaks at 620, 643, 760, 828, 854, 1004, 1207, 1260, 1446 and 1685 cm −1 can all be observed in both cancer and normal groups. However, there are still some nuances between different groups, which provides the possibility of constructing diagnostic models for cancer detection and screening.

Dimensionality reduction of SERS spectra
For comparably assessing the performance of PLS in the dimensionality reduction of SERS spectra, the standard multivariate analysis method of PCA was also applied in the same spectra data set. Simply using a large number of components will lead to over-fitting in the diagnostic model. The mean squared error of prediction (MSEP) estimated by 10-fold crossvalidation is a more statistically sound method for choosing the number of components in either PCA or PLS [29]. In this study, dimensionality reduction of SERS spectra is the main objective of PLS and PCA. Therefore, the adjusted Wold's R criteria is an appropriate choice for determining the number of components [29] and this criteria states that an additional component will not be included in the model unless it provides significantly better predictions. As shown in Fig. 4, the MSEP curve of PLS shows two different phases of behavior. In the first phase, the MSEP decreases rapidly, whilst in the second phase the rate of decrease becomes quite slow.  [29,35]. Figure 5 shows the PLS loadings of the first three PLS components.

Model training and testing
In this study, the RBF kernel SVM algorithm was used to classify serum protein SERS spectra in the feature space. In order to find the best classifier, the penalty factor C and the Gaussian radial width σ were optimized by the grid search method [19,22]. The grid search method was performed to exhaustively search optimal parameters by trying various pairs of parameters. The search range for penalty factor C was implemented from 2 −10 to 2 10   = , the classification of serum protein SERS spectra from LC, NC and normal groups in the training set could achieve a diagnostic accuracy of 95.09%. Figure 7(a) shows the classification results of the RBF kernel SVM model in the feature space. Circles represent the support vectors. And the serum protein samples from LC, NC and normal groups are marked as cross, asterisk, and triangle, respectively. A light red separating hyperplane is created in the feature space to distinguish LC samples from other samples. Similarly, a light green hyperplane and a light blue hyperplane corresponding to the NC samples and the normal samples, respectively, are also created. Figure 7(b) shows the results of classifying SERS spectra in the testing set using the diagnostic model as shown in Fig. 7(a).
In order to evaluate the performance of the PLS-SVM method, the PCA-LDA and PCA-SVM algorithms were also performed. The classification and prediction results of PLS-SVM, PCA-LDA and PCA-SVM methods were summarized in Table 3. With the combination of LDA, the first 24 principal components accounted for 95.1% of the total variance were used to classify the SERS spectra in the training set, and the classification accuracy of 98.21% was obtained with the 10-fold cross-validation. However, the prediction accuracy of the SERS spectra in the unknown testing set using the PCA-LDA algorithm is only 85.33%, which is lower than that of PLS-SVM algorithm. This result demonstrates that including too many components in the diagnostic model may lead to over-fitting. Compared with this, PCA-SVM with 6 components performs worse. The classification accuracies of the training set and the testing set are 91.96% and 80%, respectively. With a minimum number of components (A = 3), the PLS-SVM algorithm performs well not only in the classification of the training set but also in the prediction of the testing set. As shown in Table 3, high diagnostic sensitivities of 92.31% and 96%, and specificities of 100% and 88%, respectively, were achieved for screening LC and NC simultaneously. These results indicate that SERS combined with PLS-SVM has great potential for cancer screening.  Moreover, analysis of different tumor (T) stages and early detection coupled with timely and standard treatment (e.g. chemotherapy and/or radiotherapy) is critical to improving patients' survival. Three groups of SERS spectra from the T1-T2 stage LC (or NC) group, T3-T4 stage LC (or NC) group, and the normal group were fed into the PLS-SVM model for analysis (using 10-fold cross-validation), and Table 4 summarizes the diagnostic results. For LC samples, the accuracy of the classification is 91.82%; the sensitivities of the two different cancer stage groups (T1-T2 stage and T3-T4 stage) are 83.33% and 94.12%, respectively; and the specificity is 93.68%. For NC samples, the accuracy of the classification is 90.22%; the sensitivities of T1-T2 stage group and T3-T4 stage group are 83.78% and 92.31%, respectively; and the specificity is 91.58%. Compared with the sensitivity of early stage (T1-T2) samples, a higher diagnostic sensitivity for advanced T stage (T3-T4) samples is obtained. This result is consistent with previous study of blood plasma SERS [6]. For advanced T stage of cancer (T3-T4), the abnormal metabolism is more serious than that of early stage (T1-T2). Besides, compared with the normal, advanced T stage cancer is probably with distant metastasis, thus resulting in complex changes in serum proteins.

Discussion
The main object of this paper is to develop a robust SERS spectra analysis method for the simultaneous screening of two or more different types of cancer. For this, the membrane electrophoresis method was used for the purification of serum proteins from two types of cancer subjects (liver cancer and nasopharyngeal cancer) and normal subjects. The serum proteins were then mixed with Ag NPs for SERS measurement and the PLS-SVM algorithm was employed to build the diagnostic model for SERS spectra classification and prediction. Traditional analysis of serum protein SERS is more concerned about the classification effects between cancer subjects and normal subjects. This study pays more attention to the diagnostic ability of the PLS-SVM model in the unknown testing set. Moreover, in previous studies, each type of cancer was discriminated from normal respectively (eg, liver cancer vs. noamal; colorectal cancer vs. noamal; gastric cancer vs. noamal) [11,23]. However, simultaneous detection of various cancers in a single test is a practical requirement for clinical application. In this study, three groups of serum SERS spectra belonging to LC, NC, and normal were simultaneously introduced into the PLS-SVM model as input data for analysis. And the results demonstrated that the membrane electrophoresis based SERS technique in conjunction with PLS-SVM diagnostic algorithm has great potential for simultaneous screening of different types of cancer, which is more convenient for clinical analysis and applications. PLS and PCA methods were used for dimensionality reduction of SERS spectral data. Both of these methods map the SERS spectra to the feature space and extract a few components as a combination of the original spectra data, but they yield the components in different ways. PCA extracts a set of orthogonal principal components in the multidimensional SERS spectra data set that best explains the significant differences in the spectra. In PCA, the relationship between input and output variables is not considered, and all input variables are given the same weight in the process of normalization (the input spectra data set is often scaled to zero mean and unit variance) [36]. Compared with this, PLS pays more attention to the relationship between input and output variables and performs better in finding the input variables that have the closest relationship with the output variables. PLS can yield the PLS components (latent variables) to obtain the maximum group separation. Therefore, the PLS components could explain the diagnostic relevant variations rather than the significant differences in the spectra. Kettaneh et al. have demonstrated in simulations that PLS can achieve its minimum mean square error with fewer components than the PCA approach [37], and our findings (as shown in Fig. 4) are consistent with this report. Moreover, in Fig. 4, the second component in PCA increases the prediction error of the model, indicating that the combination of predictor variables contained in this component is not strongly correlated with respond variables. That's because PCA constructs components to explain variation in process variables, not respond variables [16].
Furthermore, as summarized in Table 3, the diagnostic performance of PLS-SVM is superior to that of PCA-LDA algorithm. There maybe two reasons: on one hand, the PCA technique missed some important diagnostic information during the process of data analysis such as the relationship between input and output variables; on the other hand, between cancer and normal serum SERS spectra, there is nonlinear boundary that could not be easily classified by linear algorithms such as LDA [7]. In addition, the analysis results show that the diagnostic accuracy of the traditional method (PCA-LDA and PCA-SVM) in the unknown testing set is between 80% and 85%, while the diagnostic accuracy of PLS-SVM is 90.67%. This result indicates that the PLS-SVM method has great potential for the diagnostic screening of new testing subjects.

Conclusion
In this study, the serum membrane electrophoresis based SERS technology combined with PLS-SVM was successfully implemented for the classification and prediction of subjects from normal volunteers, LC patients and NC patients. The RBF kernel SVM diagnostic model based on the PLS components classified the SERS spectra of normal and two types of cancer simultaneously with high accuracy (95.09%). In addition, a diagnostic accuracy of 90.67% was also achieved by PLS-SVM in the unknown testing set. PCA-LDA and PCA-SVM algorithms were also applied to classify the same data set for assessing the performance of PLS-SVM, and the results demonstrated that the diagnostic performance of PLS-SVM is superior to that of PCA-LDA and PCA-SVM algorithms. This exploratory study demonstrates that the membrane electrophoresis based SERS combined with PLS-SVM has great potential for non-invasive screening of cancer.
In future, we will collect more samples with different cancer stages to verify the reliability of this method and develop more powerful algorithms to improve this SERS analysis method for accurate cancer diagnosis.