Analysis of hepatitis C infection using Raman spectroscopy and proximity based classification in the transformed domain.

This work presents a diagnostic system for the hepatitis C infection using Raman spectroscopy and proximity based classification. The proposed method exploits transformed Raman spectra using the proximity based machine learning technique and is denoted as RS-PCA-Prox. First, Raman spectral data is baseline corrected by subtracting noise and low intensity background. After this, a feature transformation of Raman spectra is adopted, not only to reduce the feature's dimensionality but also to learn different deviations in Raman shifts. The proposed RS-PCA-Prox shows significant diagnostic power in terms of accuracy, sensitivity, and specificity as 95%, 0.97 and 0.94 in PCA based transformed domain. The comparison of the RS-PCA-Prox with linear and ensemble based classifiers shows that proximity based classification performs better for the discrimination of HCV infected individuals and is able to differentiate the infected individuals from normal ones on the basis of molecular spectral information. Furthermore, it is observed that characteristic spectral changes are due to variation in the intensity of lectin, chitin, lipids, ammonia and viral protein as a consequence of the HCV infection.


Introduction
Hepatitis is a serious health problem worldwide and a leading cause of morbidity and mortality. According to the World Health Organization (WHO), Hepatitis C virus (HCV) infection has become a global health issue, which nearly infects 3-4 million people annually and 350,000 people die as a result of HCV related diseases. It is a blood-borne infectious disease associated with the development of liver cirrhosis, liver failure and hepatocellular cancer [1]. Symptoms of the disease are founded on the type of infection ranging from asymptotic behavior in acutely infected individuals to appearance of jaundice, abdominal pain and decreased appetite in chronic patients [2].
There is currently no vaccine for hepatitis C treatment in the market, however, many are in development phase. Therefore, treatment is based merely on an early and efficient diagnosis of the infection. The established devices used for diagnosis, medication, and determination of response to antiviral treatment are serological and molecular tests [3,4]. Serological assays use Enzyme Linked Immunosorbent Assay (ELISA) for the screening of antiviral antibodies, whereas molecular tests detect and quantify RNA virus by employing RT-PCR protocol [5]. In spite of usefulness of wet-lab methods in the diagnosis of disease, they are expensive, time-consuming and error-prone. High specificity and sensitivity are needed for diagnosis of HCV infection, but both of these rarely meet concurrently. Thus, there is an extensive need for robust, sensitive and specific economical laboratory test which accurately detects disease by differentiating HCV infection from false positive HCV antibody [4].
In the past two decades, the use of Raman spectroscopy has increased in the detection and diagnosis of infectious and genetic diseases [6]. Raman spectroscopy is based on an inelastic scattering of light photons after an interaction with intra-molecular bonds of a sample being probed. The shift produced in the frequency of scattered light photons provides information about the molecular composition of the given sample. Diseases are linked to change in molecular morphology and composition that results in deviation from the normal molecular vibrational pattern. This distinction thus serves as a phenotypic marker for disease detection in Raman spectroscopy [7].
Due to the complex structure of Raman spectra of biological samples, one cannot differentiate the spectra of the pathological samples from the normal ones with the naked eye. Shortcomings of Raman spectroscopy for better detection of disease are usually overcome by combining with multivariate statistical analysis and classification methods such as principal component analysis (PCA), linear discriminant analysis, and support vector machine etc [8]. Different studies have utilized Raman spectroscopy based blood serum analysis for diagnosis and detection of biochemical alteration in various diseases such as dengue, cancer, malaria, hepatitis B [6,[9][10][11][12]. In the current study, Raman spectroscopy together with machine learning approaches have been exploited for detection of disease related spectral changes in HCV infected individuals. In this work, Raman spectra of blood serum of hepatitis C positive and negative individuals have been used for the development of a diagnostic system that exploits transformed Raman spectrum using proximity based machine learning technique and is denoted as RS-PCA-prox. Raman spectrum generates feature rich data that spans over redundant and non-discriminative features. Therefore, the proposed study is augmented with different feature transformation techniques, in order to analyse their potential in learning the alteration in Raman shifts. Distinct information of spectrum (features comprising maximum variance) is fetched through orthogonal transformation and maximization of margins' distance between opposite class instances by performing PCA, factor analysis (FA), and large margin nearest neighbor (LMNN). The performance of the proposed system is compared with linear and ensemble based classifiers. It is observed that proximity based classification is quite promising in learning the deviation in the Raman shifts, once transformed using PCA. Moreover, PCA based feature transformation coupled with machine learning techniques can alleviate the predictive power of Raman spectra.

Sample collection and preparation
Blood serum of 227 individuals, both male and female and of different age groups is collected from Holy Family Hospital Rawalpindi, Pakistan. Out of 227 samples, 105 are from healthy individuals, whereas 122 samples are from HCV infected individuals. All the HCV infected patients considered for this study did not have any other infection or disease. Non-heparinized peripheral blood is used for the study purpose; a quantity of 3 ml is acquired through a syringe and stored in clot activator tubes (HebeiXinle, Sci&Tech Co. Ltd., China). Samples from all these subjects have been collected at different times. Samples from collection point have been brought to the laboratory in a specially designed container having ice packs. Serum from all these samples has been extracted on the same day. This study is conducted after obtaining written permission from each patient and ethical commission of Rawalpindi Medical College, Pakistan.

Acquiring Raman spectra
For the recording of Raman spectrum, the quantity of about 10μl blood serum of each sample is put on an aluminum substrate. Raman system (PeakSeeker Pro, Agiltron USA) has been used for recording of spectra. This system consists of laser source coupled with the microscope emitting laser light at 785nm. Before the acquisition of Raman spectrum of blood serum samples, the system is rectified to 520 cm −1 Raman peak of the silicon wafer. The spectra from all samples have been acquired in a spectral range of 300-1800 cm −1 . Laser light is focused on the sample surface through microscope objective having 10x magnifications having numerical (NA = 0.25). The Raman scattered light has been collected with the help of the same objective in the back scattering configuration. An acquisition time of 5 seconds with the laser power of 30 mW is used for the recording of each spectrum. For each sample, three spectra are recorded and an average of these values is used as a representative spectrum.

Data pre-processing
Strong fluorescence signal that is intrinsically emitted by biological compounds always exists in the Raman spectrum of biological samples [13]. Before making spectral analysis, it is necessary to filter out background noise from spectrum [13][14][15][16][17]. Smoothening, de-noising, baseline correction and normalization methods have been used for background cleaning. Initially, the spectra are adjusted by applying moving average Savitzky-Golay smoothing filter, over a span of 5 points and by using the polynomial kernel of order 3. The baseline of peaks has been corrected by employing "msbackadj" function which uses spline approximation [18,19]. After this, all spectra have been vector-normalized.

Classification system
In order to develop classification system, data samples are randomly permuted in order to avoid bias towards any specific sample during the training phase. Preprocessing of the data for model development involves dimensionality reduction which is attained by employing PCA, FA and LMNN implemented in Matlab2016a [20]. Spectrum Discrimination performance of classifiers is compared by receiver operating characteristic curves (ROC) and area under the ROC curve (AUC). phase. For each sample, it grows an unpruned tree and defines node split α by selecting the best predictor t x from a subset of randomly chosen predictors at each node t . For new data instance the class label j is assigned based on majority of the labels i w predicted by base classifiers [22,23]. (1)

kNN based classification
kNN is a proximity based classification technique in which each instance is discriminated on the intuition that it belongs to a class of its k nearest neighbors. In k nearest neighbors' rule, the distance d of test instance q is calculated from its k nearest neighbors i x that belong to i y class and test instance q is assigned a class j y of its maximum neighbors among k nearest neighbors [24]. The vote assigned by i x to q is same as of its class label j y then ( ) , j i y y returns 1.

LDA based classification
LDA is a linear classifier. It separates the two classes by projecting hyperplane that maximizes the distance between means of two classes, whereas it minimizes within class variance. Thus, it requires inter and intraclass scatter matrix and projection plane is defined as below, ( ) J x is known as fisher linear projection criterion. Input vector x that maximizes this criterion; ( ) J x is termed as fisher optimal projection axis, opt x which is defined as in Eq. (4) [25].

SVM based maximum margin classification
SVM is a linear supervised machine learning algorithm. SVM takes input from samples X in the form of a pair, ( , ) i i x y where i x is a sample in a feature space S and i y is its label. SVM linearly differentiates pattern for binary classification problem by defining a hyperplane (5) in such a way that it maximizes the distance between closely placed training instances that belong to the opposite classes. Input training samples that are near to hyperplane are known as support vectors. If two classes are linearly separable, then hyperplane bifurcates two classes in such a way that all samples of S that belong to the same class are on one side. The optimal hyperplane is defined by imposing constraint as mentioned below in Eq. (6,7) [26]. In (5), w is a weight vector that is orthogonal to hyperplane whereas c is bias.
The linear transformation function L is used to maximize the variance of the projected inputs, subject to the constraint that transformation function defines a projection matrix. The variance of the projected inputs is expressed in terms of the covariance matrix C. Each transformed component is orthogonal to the successive component, and is based on finding variance in the direction of its eigenvector [27].    [29].

Results and discussion
Raman spectroscopy generates multiple samples for each individual, therefore manual spectral analysis becomes time-consuming, prone to error and is liable to have an element of human subjectivity. Machine learning based diagnostic systems have paved way for simultaneous analysis of multiple samples in short time and highly accurate diagnosis of disease with low error rate. In order to screen HCV infected individuals, the proposed RS-PCA-Prox analyses Raman spectrum data and discriminates infected and normal individuals by employing kNN in the transformed domain. Performance of the proposed RS-PCA-Prox is compared with linear and ensemble based classifiers.

Raman spectral data analysis
Blood serum of HCV positive and negative individuals is used for Raman spectroscopic analysis. Human blood serum constitutes different biomolecular components such as lipids, fats, vitamins, minerals, hormones, glucose and immunoglobulins (IgMs) [30]. Raman peaks in spectrum correspond to different biomolecules. Molecular information is assigned to each spectral peak based on vibrational bond information [31]. Mean Raman spectra of normal and HCV infected blood serum are shown in Fig. 2. For the purpose of reader clarity, normalized mean spectra of the control group are shown in blue color (dashed line), whereas, the mean spectra of HCV infected sera samples are shown in red color (solid line) (Fig. 2). The diseased, as well as normal samples, have an almost similar pattern, but there is a precise difference in the intensity of Raman shifts. The variations observed for Raman intensities are found at 712, 800, 875, 911, 1004, 1166, 1220, 1250-1300, 1340, 1393-1430, 1449 and 1675 cm −1 . Fig. 2. The plot of the mean spectral difference between HCV infected (red) and normal blood sera (blue).

Raman spectra transformation using PCA
The analysed Raman spectra comprehend the information of components that profile the blood serum composition of normal and infected patients (Fig. 2). As compared to normal individuals, Raman spectra of diseased individuals show discernable changes in intensity for various bimolecular components of HCV infected individuals (Fig. 2). These distinctions between HCV positive and negative individuals thus serve as phenotypic markers for disease detection. Therefore, spectra of biomolecular components such as IgMs, lipids, carbohydrates etc. are used as a feature space for classification system development. However, in this work, it is observed that performance of the proposed RS-PCA-Prox can be enhanced by transforming the Raman feature space. In order to derive the transformed and reduced feature set, the entire spectrum is considered to identify spectral features that contribute to maximum variance. PCA based transformation is applied which reduced the original high dimensional data set to 181 components, out of which the first 35 components represent about 80% of the variance and 15 components represent about 75% of the spectral variance (Fig. 3). However, the remaining 145 components contribute less than 1% of the variance. In order to find out the transformation that best describes the behavior of Raman data, FA and LMNN based transformations are also analysed in addition to PCA.

Proximity based classification of PCA transformed data
In order to develop classification system, blood samples are divided into 80:20 ratio for training and test set respectively. The decision to choose the number of features is critical for the performance of the classifier. The significant number of features for the proposed RS-PCA-Prox is selected by performing both grid search and analysis of variance plot. It is not necessary for the first few components that capture maximum variance, to correspond to the best discrimination power [10]. The distribution of first two transformed features of PCA, FA and LMNN (Fig. 4) shows that the two classes are overlapping and cannot be classified by considering only the first two dimensions of the data. The number and choice of features have considerable influence on the accuracy and training time of classifier. It is observed that training on 15 PCs (contribute ~75% variance) (Fig. 3) produces classifier performance whose accuracy is similar to that of 35 PCs (contribute 80% variance) (Fig. 3), therefore in order to reduce computational complexity and to avoid overfitting, 15 PCs are used for training. The number of transformed features used for training of classification model is mentioned in Table 1, Table 2 and Table 4. Proximity based classification system RS-PCA-Prox is developed by performing training on principal components. Once the training of the model is achieved, the performance of the classification system is assessed through accuracy, sensitivity and specificity parameters on test data. kNN shows satisfactory diagnostic power on PCA based transformed data set for 15 PCs with an average accuracy, sensitivity, and specificity of 95%, 0.97 and 0.94 respectively ( Table 1).

Transformed vs original domain analysis
kNN performance in PCA transformed domain is compared with its performance in FA and LMNN transformed feature space. It is noted that the accuracy of the diagnostic system is decreased by 6.52% and 18.7% on LMNN and FA based transformed features, respectively ( Table 2). In order to analyse the importance of feature transformation, in learning the deviation of Raman shifts, the classification model is also trained by considering all Raman shifts in the original domain. The analysed spectral data in original space consists of 782 features. In the original domain, even though with high dimensional feature space and thus high complexity and computational time, the accuracy is dropped by 0.3% as compared to PCA transformed feature space (Table 3). This performance drop is mainly due to the presence of redundant and correlated features in high dimensional Raman spectrum that causes overfitting of classification model [32].

RS-PCA-Prox comparison with ensemble classifier
Performance of proximity based classifier is evaluated with ensemble based classifier RF, which uses a collection of decision trees. RF gives the best result for an ensemble of 120 decision trees (base learners) with 91% accuracy, 0.88 sensitivity and 0.95 specificity on PCA reduced test data set (Table 4). RF like kNN, achieves the best result on PCA based transformed data set rather than on transformation that is based on LMNN and FA. As compared to PCA, the performance of RF on FA and LMNN is degraded in terms of accuracy by 1.63% and 12.94% respectively ( Table 4). The comparison of classifiers trained on PCA, FA and LMNN based transformed feature space shows that both kNN and RF outperform when PCA based feature space is used ( Table 1, Table 2 & Table 4).

RS-PCA-Prox comparison with linear classifier in PCA domain
In order to assess the performance of the linear classifiers on Raman data as compared to kNN which is a non-linear classifier, SVM and LDA are implemented on PCA transformed Raman spectrum. Results of LDA and SVM are shown in Table 5. It is noted that LDA based classification results are comparable with kNN results. This suggests that LDA can be utilized for classification of HCV infected individuals based on biomolecular content with an accuracy of 95%. However, LDA is less sensitive than kNN for detection of HCV positive individuals. Similar to LDA, SVM is also a linear classifier but its learning method is different. The experimental results of SVM also show acceptable performance on PCA transformed Raman spectrum data with an average accuracy of 94%, however, its performance in comparison with kNN is decreased by 0.88% in terms of accuracy.

Decision surface based analysis of Raman shifts and its biochemical significance
In order to characterize the potential of identified peaks for their use as biomarkers for the development of machine learning based diagnostic system, decision surface of kNN and RF is drawn and shown in Fig. 5 and Fig. 6. The decision surface of kNN for 1000 and 1225 cm −1 Raman shift with k = 5 is shown in Fig. 5, that shows good separation between two classes. In order to pinpoint the molecules that are used as discriminating factors by the RF, the splitting criterion of one of the decision trees (base classifier) of the ensemble is shown in Fig. 6. In the decision tree, 718, 872, 1004, 1169, 1250 cm −1 Raman shifts are the characteristic peaks which are used as a classifier by RF. Identified peaks can be used as biomarkers for detection of the disease. Raman shift 718 cm −1 corresponds to a vibrational band of chitin that stimulates type-I and type-II dependent innate immune response in response to viral infection [33]. Due to the infection, the remains of viral protein are also present in human blood serum which correspond to 872 cm −1 [34]. Raman shift of 1004 cm −1 is observed for lectin which is a carbohydrates binding protein. Mannose-binding lectin is one such molecule which on attachment with virus mediates lectin complement pathway that kills virus [35]. 1169 cm −1 peak corresponds to lipids where 1250 cm −1 is assigned to ammonia whose level increases in infected individuals, as liver damaged by HCV infection fails to detoxify ammonia [36]. Lipids are associated with proteins that stimulate destruction process of hepatocytes. Due to this destruction, the concentration of various biomolecules such as lipids, enzymes, proteins etc. are changed in infected individuals. These identified peaks correlate with diagnostically significant Raman spectrum region (Fig. 2, section 3.1.) and also show agreement with [34,37] and Bilal et al., findings which have reported the correlation of 718, 872, 1004, 1169, 1250 cm −1 with HCV infection [36].

ROC based analysis
Diagnostic power of RS-PCA-Prox classification system to separate HCV infected from healthy individuals is apparent from ROC as shown in Fig. 7. ROC curve defines the tradeoff between sensitivity vs 1-specificity by specifying the true positive rate against the false positive rate for the different possible probable thresholds of a diagnostic test. The more close the AUC to one is, the more reliable diagnostic test is, however as curve comes to the 45degree diagonal, the randomness increases and becomes less accurate [38]. It is clearly depicted from Fig. 7A that the performance of kNN on PCA based transformed feature space in terms of area under the ROC curve is 0.96, which is better in comparison to the performance of kNN on FA and LMNN based transformed features. Similarly, RF also shows best results on PCA based transformed feature space (Fig. 7B).

Conclusion
This work presents the use of Raman spectroscopy in combination with different dimensionality reduction techniques (PCA, FA, and LMNN) and classifiers (kNN, RF, LDA, SVM) for the detection of HCV infection. It is shown that biomolecular information provided by Raman spectrum can be exploited by machine learning algorithms for enhancing diagnostic abilities of Raman spectroscopy. Major spectral deviations are caused by variation in the intensity level of lectin, chitin, lipids, ammonia and viral protein as a consequence of hepatitis C infection. Variation in the ratio of phospholipid and lipids concentration is noted in HCV infected individuals, as these are associated with enzymes that are activated during destruction process of hepatocytes in hepatitis [37]. When the virus attacks the host, the immune system of a human is activated against the virus. In this response, chitin is released which enhances viral, type-I and type-II dependent immune response [33]. Similarly, mannose-binding lectin proteins are produced by the liver in HCV infected individuals that upon attachment with HCV, initiate complement cascade pathway. Some of the biomolecular changes are disease associated, such as liver damage due to hepatitis that raises the level of ammonia in patients [40]. The biochemical characterization of the concentration of these biological molecules is used as an important marker during the development of machine learning system for detection of HCV infection. It can be observed that application of the proximity based approach to Raman spectral data is a valuable and promising tool with an overall accuracy of 95% for the proposed RS-PCA-Prox detection system. The corresponding sensitivity and specificity are calculated as 0.97 and 0.94 respectively (Table 1). Performance measures depict that proximity based approach (kNN) optimizes the design of RS-PCA-Prox system for spectrum pattern recognition and discrimination of infected individuals as compared to RF and SVM (Table 1, Table 3 -5). This work suggests that in comparison to FA and LMNN, PCA based transformation of Raman spectrum data is more appropriate for the training of machine learning based classification system. PCA projects the data from a higher dimension to lower dimension space by using linear transformation function and retains maximum variation of features, whereas FA assumes common factor for all instances that models the variance of the features and removes correlated features due to which it loses the variance of some features [27,39]. LMNN based feature transformation affects classifier's performance by causing overfitting of the model on the training data and decreases the performance of the classifier on test data by 17% in terms of accuracy (accuracy: 77.83%) ( Table 2) and discrimination power, (AUC: 0.90) (Fig. 7(A)). Performance of RF on test data is evaluated by five-fold cross-validation. TP, TN, FP, and FN values are reported only for one-fold whereas an average value of sensitivity, specificity, and accuracy is reported for five folds.