Raman spectroscopy for rapid and inexpensive diagnosis of echinococcosis using the adaptive iteratively reweighted penalized least squares-Kennard–stone-back propagation neural network

A rapid and inexpensive method of screening and diagnosis for echinococcosis is proposed for Raman spectroscopy, together with improved neural networks. We use the adaptive iteratively reweighted penalized least squares (airPLS) algorithm to deduct the fluorescence background from the Raman spectra of healthy people and echinococcosis patients. The processed data was compressed into the principal component by the PLS method, and the Kennard–stone (KS) algorithm was used to divide it into a training set and a testing set. Finally, the data was put into the back propagation (BP) neural network for modeling and prediction. The results show that the true positive rate of echinococcosis diagnosis is (94.2857  ±  4.0721)%, the true negative rate is (95.2381  ±  0)% and the overall accuracy rate is (94.6939  ±  2.3269)%. The algorithm is compared with three other algorithms and it is shown that its superiority can be proved. The Raman spectroscopy combined with the airPLS-KS-BP algorithm can achieve fast and accurate diagnosis of echinococcosis.

Original content from this work may be used under the terms of the Creative Commons Attribution 3.0 licence. Any further distribution of this work must maintain attribution to the author(s) and the title of the work, journal citation and DOI.

Introduction
Echinococcosis is distributed widely in many developing countries, and Xinjiang of China is one of the high incidence areas [1]. Studies have shown that in the absence of treatment, about 95% of echinococcosis patients will die within a decade [2]. Early diagnosis and intervention of echinococcosis is an effective approach to reduce its morbidity and mortality. At present, conventional examination methods include clinical symptom diagnosis, medical imaging technology diagnosis and immunologic diagnosis [3][4][5]. However, they have some problems; such as expensive instruments, cumbersome processes, technical person operation requirements and so on [6]. Therefore, it is very important to develop a fast, accurate and low-cost detection method for diagnosis of echinococcosis.
Raman spectroscopy is suitable for the analysis of biological samples [7,8], and it is gaining more applications in medical diagnosis and research [9][10][11]. At present, one of the main challenges for Raman spectroscopy in clinical applications is that the auto-fluorescence intensities of organism tissues, which are excited by the laser light source, are superimposed in the Raman bands, so that the Raman signal intensity is only about 10 −8 times the original excitation intensity [12]; which means the original spectrum cannot reflect the essential information of the cells. Therefore, it is very necessary to reduce and deduct the fluorescence background.
Currently, the fluorescent background problem is solved mainly through the surface-enhanced Raman Scattering (SERS) or data processing technology. SERS technology has a high sensitivity and specificity, so it has shown good application prospects in the detection of clinical medicine [13]. But some SERS materials depend on the specific sample and are not universal [14]; the commonly used SERS active substrates in the medical field, gold and silver nano-particles, have instability and uneven enhancement effects [15], and the SERS chip is very expensive. In addition, the post-processing of the Raman spectrum is also an effective method for deducting fluorescence backgrounds. At present, polynomial fitting, wavelet transforms and derivatives are three major popular algorithms of Raman spectral fluorescence background reduction [16][17][18]. However, the accuracy of manual polynomial fitting depends on the user's experience, and automatic polynomial fitting performs poorly in a low signal-to-noise ratio environment. The wavelet transform may lose some useful spectral information when reconstructing the waveform, causing some spectral distortion. The derivative algorithm will change the shape of the original peak. Due to the problems of the above three methods, Zhang et al proposed the algorithm of adaptive iteratively reweighted penalized least squares (air-PLS) [19], which has achieved good results in deducting the fluorescence background, and has already been widely applied in various research fields [20,21].
The selection of disease prediction and analysis methods will also affect the efficiency and accuracy of the prediction results. At present, the main prediction methods are regression analysis and artificial neural networks (ANNs), where the back propagation neural network (BPNN) has the characteristics of high-precision, non-linearity, self-organization, selflearning and self-adaptation, and it has a good application in attributing the recognition of cells. In our previous study, Raman spectroscopy combined with the PCA-BPNN model was used to diagnose echinococcosis patients and it achieved good results [22]. The over-fitting of the BPNN was successfully solved by using the principal component analysis (PCA) method. However, the PCA method does not consider output variables when selecting variables. Focusing on these issues, Jia et al [23] used the PLS algorithm to compress the principal component. Compared with the PCA algorithm, the relationship between the independent variables extracted by the PLS algorithm and the dependent variable is larger, and when dealing with multi-collinearity, high redundancy and multi-noise data, the PLS algorithm has higher accuracy and stability.
This article uses the airPLS algorithm to deduct the fluorescence background, and uses the PLS algorithm to compress the principal component. On the basis of this, for the training set, which requires representative data [24], we use the KS-BP algorithm to select the training set and classify the Raman spectral data of echinococcosis patients.

Instrument
The Raman spectra experiments were performed by means of a laser Raman spectrometer (LabRAM HR Evolution RAMAN SPECTROMETER, HORIBA Scientific Ltd.) with a 10× objective at an ambient temperature. To guarantee a good signal-to-noise ratio and present fluorescence emission, the samples were excited by the 532 nm green light from a Spectra Physics Ar ion laser.

Serum sampling
The blood serum samples from the echinococcosis key laboratory of Xinjiang Medical University First Affiliated Hospital were randomly selected, with clear diagnosis and complete data from 55 cases of healthy people and 68 cases of echinococcosis patients.

Pre-treatment of data
The noises of Raman spectra are mainly in two categories: one is the thermal noise of the instrument, the other is the environmental interference noise. The background interference is mainly the baseline drift caused by the fluorescence background of the Raman spectrum [25]. In this study, the self-normalization algorithm is used to eliminate the noise of the data after removing the fluorescence background. And in order to reduce the influence of harmful factors-such as electronic noise, light scattering and instability of the output laser [26]each sample was scanned twice and the the average spectrum was taken as its final spectrum.

Model evaluation
The true positive rate, the true negative rate and the overall accuracy rate were used as the evaluation indexes of the ANN model. The three indicators are defined as follows: where A, B, C and D represent true-positive, false-positive, false-negative and true-negative samples, respectively; 'positive' on behalf of someone suffering from echinococcosis while 'negative' on behalf of a normal case. Figure 1 shows representative Raman spectra from healthy people and echinococcosis patients. Figure 2 shows Raman spectra of different samples with echinococcosis. It can be seen from figure 1 that the shapes of two spectral curves are similar, but the characteristic peaks are different, which is the basis of qualitative judgment. It can be seen from figure 2 that the Raman spectra of different patients suffering from echinococcosis are affected by the background of fluorescence, resulting in a baseline drift in the Raman spectrum and the change in range of the fluorescence background becomes very large, which has brought some adverse effects on qualitative judgment. Therefore, it is necessary to deduct the background of the spectrum.

Fluorescence background deduction
The airPLS algorithm is a method proposed in recent years to correct the spectral baseline drift, which can effectively deduct the fluorescence background in the Raman spectrum, and its baseline estimation is fast and flexible [27]. Therefore, this paper adopts the airPLS algorithm to eliminate the Raman fluorescence background. Figure 3 shows the original and calibration spectra of a representative sample. It can be seen that the algorithm can effectively deduct the Raman fluorescence background, while maintaining the original spectral peak type.

Data compression
If the full spectrum data as variables are put directly into the neural network for modeling, the computational amount   will be too large, and not all data will be useful for modeling. Therefore, the spectral data which are weaker and not related to the sample should be eliminated. In this paper, the PLS algorithm is used to compress the principal component, and the optimal principal components are determined by the 10-fold cross validation method.
The spectral samples with and without background deduction are processed by the PLS algorithm. The first, second and third principal components of the score matrix are extracted for drawing and analysis. As can be clearly seen from figure 4(a), without background deduction, healthy people and echinococcosis patients can not be separated from the samples. After background deduction, the samples of healthy people and echinococcosis patients can be well separated, and the aggregation of each sample has been significantly improved.

Sample division
According to the 10-fold cross validation method, the number of PLS principal components is determined to be three, so the first three principal components are taken as the new variables to construct the sample matrix, and the BPNN is used for classification. In order to enable the BPNN to correctly predict new samples, the changes of the samples to be analyzed must be included in the training set as much as possible, so valid algorithms must be used to select some representative samples which can cover the whole change range in bio-information components as much as possible. In this study, the KS algorithm is adopted to select 74 samples (60%) as the training set, and 49 samples (40%) as the testing set from 123 samples.

Modeling and results comparison
The input layer and output layer structures of the BPNN are related to the practical application, and key to the design is the structure of the hidden layer. In this paper, the three-layer BPNN structure is adopted, the input layer is the first three principal components, the number of output layer nodes is one, and the number of hidden nodes is calculated according to where m is the number of hidden layer nodes, n is the number of output layer nodes, l is the number of input layer nodes and a is a constant between 1 and 10.
Combined with many rounds of experiments, the optimal number of the hidden layer nodes is taken as three, the transfer function of the hidden layer uses the Sigmoid function, the output layer uses the Purelin function, the training function uses the Trainlm function, the maximal iteration times are 2000, the display error once for every 500 times, the target error is 0.000 000 01, the learning rate is 0.05 and the remaining training parameters are the default values. The values of the output neurons for healthy persons and echinococcosis patients are set as one and two, respectively. When the output values are less than or equal to 1.5, we judge it as a healthy person, otherwise as an echinococcosis patient.
The initial weights of the BPNN were randomly initialized by the network [28], so in order to eliminate the random  It can be seen that the true positive rate, the true negative rate and the overall accuracy rate after using the airPLS-PLS-KS-BPNN model are all enhanced, where using the airPLS algorithm to deduct the background can effectively eliminate the spectral drift and noise, using the PLS algorithm to compress the data can give the model higher precision and stability and using the KS algorithm to divide the training set can improve the representativity of the sample, which demonstrates that the airPLS-PLS-KS-BPNN model can achieve a rapid and accurate prediction of echinococcosis very well.

Conclusion
The diagnosis and prediction of healthy people and echinococcosis patients are performed by Raman spectroscopy combined with the PLS-KS-BPNN model. The experimental results show that the airPLS-PLS-KS-BPNN model can better diagnose echinococcosis. Compared with the traditional diagnostic methods, it has the advantages of low cost, simple operation and fast analysis; so, it is suitable for a rapid and accurate diagnosis of echinococcosis. In the next step, we will further develop more clinical trials and analyze their experimental data to improve the accuracy and reliability of diagnosis.