Optimizing algorithm development for tissue classification in colorectal cancer based on diffuse reflectance spectra.

Diffuse reflectance spectroscopy can be used in colorectal cancer surgery for tissue classification. The main challenge in the classification task is to separate healthy colorectal wall from tumor tissue. In this study, four normalization techniques, four feature extraction methods and five classifiers are applied to nine datasets, to obtain the optimal method to separate spectra measured on healthy colorectal wall from spectra measured on tumor tissue. All results are compared to the use of the entire non-normalized spectra. It is found that the most optimal classification approach is to apply a feature extraction method on non-normalized spectra combined with support vector machine or neural network classifier.


Introduction
In both men and women, colorectal cancer is the third most common cancer worldwide [1]. The standard care for patients with colorectal cancer is surgery [2]. During colorectal cancer surgery, the surgeon must rely on visual and tactile feedback to recognize the tumor boundaries. As a result, there are two main issues: complete removal of the tumor and preventing complication due to too extensive surgery. To obtain an optimal balance between these two issues, intra-operative tissue recognition would be of great benefit to provide the surgeon with information about the tissue he or she encounters.
Diffuse reflectance spectroscopy (DRS) is an optical technique that has been used for tissue recognition in cancer diagnostics [3][4][5][6][7][8][9][10]. Light over a broad wavelength range is sent into the tissue, where the light interacts by scattering and absorption. Part of this light will be scattered back to the surface of the tissue, where it can be detected. In fiberoptic DRS, light is transported through fibers, both from source to tissue and from the tissue surface to the spectrometers [11]. In the fiberoptic geometry only point measurements can be performed. Using hyperspectral imaging (HSI), the tissue samples are directly illuminated by a broadband light source and the reflected light over the entire illuminated surface is collected by a camera. Therefore, a complete specimen can be imaged at once. With HSI a 3D data-cube is collected, in which two dimensions contain the spatial information about the tissue sample and the third dimension represents the wavelength information for each spatial pixel [10].
Both fiberoptic DRS and diffuse reflection HSI have been used in tissue classification in colorectal cancer before. For HSI the main applications in colorectal cancer so far were on imaging of the pathology slides for automatic pathology classification [12][13][14][15][16] and on colonoscopy in which tumor was distinguished from healthy colorectal mucosa, imaged from within the lumen [17][18][19]. The work presented in the current paper aims to detect cancer during colorectal surgery. During surgery tumor is approached from outside the lumen. To the best of our knowledge, there has only been one study focused on the use of HSI in colorectal cancer surgery in which the tumor is assessed from outside of the lumen and should be distinguished from fat and healthy colorectal wall [20]. In this work performed by our group tissue samples from colorectal cancer patients were obtained after resection and imaged by two hyperspectral cameras operating in the visual ( 400-1000 nm) and near-infrared ( 900-1700 nm) wavelength range. The mean accuracy to distinguish fat, healthy colorectal wall and tumor was 0.88 [20].
For fiberoptic DRS the main focus in colorectal cancer has also been on the application in colonoscopy for assessing the mucosa from within the colorectal lumen [21][22][23][24][25]. On the topic of application during colorectal cancer surgery a few papers have been published so far. In an ex vivo study by Schols et al. tumor could be discriminated from healthy surrounding tissue [7]. In an ex vivo study performed by our own group the accuracy to distinguish tumor tissue from fat and healthy colorectal wall was 0.95 [26].
An important step in developing a diagnostic algorithm based on spectral information is feature extraction, reducing the data used for classification, focusing on that part of the data that contains the crucial information. In previous studies, many different feature extraction and classification techniques have been used to analyze the spectra obtained with fiberoptic DRS or diffuse reflection HSI. In fiberoptic DRS for instance a fit algorithm based on photon diffusion theory has been used to extract absorption and scattering parameters from the measured spectra and to translate these into a limited set of parameters describing tissue composition and scattering behavior [27]. Based on these parameters tissue types could be distinguished [4,5,11,28]. Schols et al. used several amplitudes and gradients for classification [29]. In our group, spectral bands were used for both HSI and DRS data to reduce the number of features [20,26]. Classification of tissue types has been done in several ways. First of all, direct comparison of single features or simple combinations of features, has been performed to distinguish different tissues [4,[30][31][32][33][34]. In addition, several machine learning techniques have been used for the tissue classification such as classification and regression tree algorithms [5,6,[35][36][37], linear discriminant analysis (LDA) [24,[38][39][40] and support vector machines (SVM) [3,20,26,37,38,[41][42][43].
Previous studies used different analysis techniques to classify tissue during colorectal cancer surgery [6,20,26]. The current study will systematically examine several options to optimize the classification of healthy colorectal wall versus tumor tissue in colorectal cancer. Several different normalization and feature extraction techniques will be used in combination with a variety of classifiers to determine the optimal classification strategy.

Data
In this study, datasets included were obtained using two different colorectal tissue sample sets. The measurements performed on tissue sample set 1 were only done ex vivo and were all performed using the same fiberoptic probe, which includes four fiber distances. The measurements performed on tissue sample set 2 were done both in vivo and ex vivo and were performed with disposable probes, so for each new patient a different probe with identical design was used. When measurements were performed using fiberoptic DRS, data of two fiber distances was acquired at the same time. There was only one diffuse reflection HSI dataset included which was obtained on tissue sample set 1 (Fig. 1). This dataset was a combination of data obtained with a visual hyperspectral camera and data obtained with a near-infrared hyperspectral camera. In total nine datasets were included for the current analysis (Fig. 1). These datasets are separated based on the sample sets, ex vivo or in vivo, fiberoptic DRS or diffuse reflection HSI and for fiberoptic DRS the fiber distance used. To avoid confusion, in the rest of the paper the datasets will be referred to by the assigned number as shown in Fig. 1.   Fig. 1. Overview of all datasets used in this study, the first column shows the different sample sets, the second column what type of fiberoptic probe was used (reusable or disposable), the third column the difference in acquisition time (ex vivo/in vivo) and method (DRS/HSI) and the fourth shows different fiber distances and all datasets with assigned numbers. The light and dark gray areas indicate datasets obtained on the same sample set and datasets obtained with the same DRS probe type respectively.
In Table 1 an overview is given on the number of patients and the number of spectra present in all datasets, for dataset #5 the number of spectra is equal to the number of pixels. Classification in the current study was only performed on colon and tumor tissue because previous studies have shown that discriminating fat from other tissue types is an easy and already solved task [20,26]. For example, using fiberoptic DRS measurements on ex vivo samples (dataset #4) an accuracy of 1.00 ± 0.00 was obtained for the classification of fat [26]. When using diffuse reflection HSI (dataset #5) the same accuracy of 1.00 was achieved for fat classification [20]. Based on these results it was decided not to include fat measurements into the current analysis and to focus on the more challenging task of tumor/colon classification.

Preprocessing
Before analysing the spectra, all spectra were calibrated using a white reference, obtained on a Spectralon sample, and a dark reference. In the spectra obtained, intensity differences might have been present that were not due to tissue specific difference but due to for instance difference in the distance between the illumination source and the sample, compared to the distance between the illumination source and the Spectralon sample used for calibration. Calibration is performed to standardize all measurements. However, to standardize, the distance between the illumination source and the measured sample should be the same as the distance between the illumination source and the Spectralon sample. When the tissue thickness deviates over the sample, the distance between the illumination source and tissue will not be similar for the entire sample, causing intensity differences. Moreover, the coupling between the fibers and spectrometer might cause a variation in detection efficiency of the signal. To compensate for these intensity differences, spectra can be normalized before classification or feature extraction. In this study, three different normalization techniques were compared to each other and to no normalization. First, spectra were normalized by their intensity at 800 nm [32]. This wavelength was chosen because no absorption was assumed to be present at this wavelength. After normalization the intensity of all reflectance spectra should be equal to 1 at 800 nm. As a second option, the spectra were normalized using the area under the curve (AUC), setting all AUCs to 1. Finally, a standard normal variate (SNV) normalization was used [44,45]. Using SNV normalization, the mean of each spectrum was set to zero and the standard deviation was set to one. The SNV is often used for diffuse reflection HSI to exclude the influence of glare on the spectra [46,47]. However, SNV normalization also removes information on the scattering. As shown in Fig. 2, for the fiberoptic DRS and diffuse reflection HSI datasets all 4 different normalization techniques were evaluated.

Feature extraction
When the number of features is very large relative to the number of samples in a dataset, some classifiers struggle to train effective models, which is called curse of dimensionality [48]. This is especially relevant for algorithms that rely on distance calculations like kNN [49]. Feature extraction and features selection techniques can be used to avoid the curse of dimensionality. Feature extraction creates a new, smaller set of features that still captures most of the useful information.
Besides using the entire spectra (1151 features) for classification, four feature extraction methods were evaluated. First, the seven most prominent peaks of a spectrum were selected, from these peaks the intensity values, the locations and the right side gradient of the peaks were used as features ( Fig. 3(a)). In the second feature extraction technique, 23 features were selected manually to describe the coarse shape of the spectra while emphasizing the differences between different tissue classes. Features included were: the locations of maxima (4), locations of minima (2), gradients (9), prominence values (4), widths (2) and intensity value (1) (Fig. 3(b)). The prominence values were defined as the intrinsic height of a peak relative to neighboring dips. As a third feature set, spectral bands were selected using k-means clustering ( Fig. 3(c)) [20,26]. The spectra were divided into at least 10 bands based upon the intensity values. Random start points were set after which the spectra were divided into 10 groups using k-means clustering based on the intensity values. If wavelengths within one group were not continuous, the group was divided into subgroups with only continuous wavelengths. In the final feature extraction method, tissue constituents were determined per spectrum using a fit algorithm, based on photon diffusion theory ( Fig. 3(d)) [11,50,51]. With this algorithm, reflectance spectra of selected chromophores were fitted to the measured reflectance spectra to obtain the composition of the measured tissue in a number of parameters [27]. For these tissue types, 12 parameters were used for the fit, which were used as features for the classification. The fit algorithm is only valid for the fiberoptic geometry, and only the measurements in which the fiber distance between the receiving and collecting fiber was at least 1 mm [50]. Therefore, features based on fit results were only extracted for datasets #2, #4, #7 and #9, which are all fiberoptic DRS datasets with fiber distances of at least 1 mm. No normalization was used in case of tissue optical feature extraction, because the analytical model is based on the reflectance signal, which should not be normalized. In Fig. 2 an overview is given on which datasets are subjected to all possible combinations of feature extraction and normalization techniques.

Classifiers
In machine learning, the "No Free Lunch" theorem states that there is no one model that works best for every problem [52]. The assumptions of a great model for one problem may not hold for another problem, so it is common in machine learning to try multiple classifiers and choose the one that works best for that particular application. In this study five different supervised classification algorithms were used to distinguish colon from tumor, k-nearest neighbor (kNN), a decision tree classier, LDA, SVM, and a neural network (NN). All classifiers were trained and tested using Matlab 2018a (MathWorks Inc., Natick, Massachusetts, US). For the kNN classifier, five neighbors were used to determine the class. The decision tree was a binary decision tree with "Gini's diversity index" as a split criterion and the maximal number of branch node splits and minimum parent size were set 3 and 10, respectively. For the SVM a linear SVM was selected. For the NN classifier, a feedforward network was used with a single layer of 20 hidden neurons and a scaled conjugate gradient backpropagation training function. For all classifiers, the features were standardized to a zero mean and standard deviation of one per parameter before used in the classification. Standardization was determined based on the data from the training dataset and thereafter applied to the test dataset.

Data analysis
The sample sets were partitioned once into a training and test set, 80% of the patients was used to train and 20% to test the classifiers. The proportion of healthy versus tumor measurements is the same in the training and test dataset. The classifiers were evaluated using the accuracy, the area under the ROC curve (AUC) and the Matthews correlation coefficient (MCC). The accuracy and AUC are commonly used performance measures. However, these performance metrics do not take into account unbalanced classes in each dataset. Therefore, the MCC, which is less sensitive to unbalanced datasets, was added as a performance measure [53]. The MCC ranges from −1 to 1, with 0 indicating a classification no better than random, −1 total disagreement in the classification and +1 a perfectly correct classification.
In supervised learning, overfitting happens when classifiers are trained on the noise along with the underlying pattern in the data. It happens when the classifier is trained on noisy or unrepresentative training data. Complex models like decision trees are more prone to overfitting. Overfitting can be recognized by examining the classifier variance, which is the variability of the model prediction for given data samples. Classifiers with high variance perform well on the train dataset but are not able to generalize to data it has not seen before e.g. the test dataset. Therefore, a large drop from training accuracy to test accuracy will be present in overfitting.
The classifications, done with all classifiers, on all different combinations shown in Fig. 2, were first tested for overfitting of the classifiers. This was done by subtracting the test accuracy from the training accuracy for each combination of normalization, feature extraction and classifier for each dataset. Since there is no general rule to define overfitting, we set the threshold for overfitting by a decrease in accuracy from train to test of 0.15. If the decrease in accuracy was above 0.15, the classifier was defined to be overfitting during training.
Thereafter, the three normalization techniques were compared to the use of no normalization prior to classification. The accuracy and MCC values of the non-normalized data were considered as a baseline and were subtracted from the accuracy and MCC values obtained on the normalized data. In this analysis feature extraction methods were not included, so only entire spectra were used. The results were averaged per normalization technique and per dataset to assess if the normalization might have a different influence on different datasets.
A similar analysis was done for the feature extraction methods. These were all compared to the use of the entire spectrum during classification (baseline performance). Again, the difference in MCC and accuracy values compared to baseline were determined, for the selected classifiers and all datasets. However, this time only the selected normalization technique from the previous analysis was used to limit the amount of analyses. A mean difference in MCC and accuracy was obtained per feature extraction method and also the influence of feature extraction techniques for each dataset was obtained.
The combination of normalization and feature extraction was also examined using the difference in accuracy and MCC. The combination of all feature extraction methods and all normalization methods were compared to no normalization and the use of the entire spectrum.
Finally, for each dataset the combination of normalization, feature extraction and classifier with the maximum MCC and accuracy values were extracted from the data. These results were compared to the MCC and accuracy values of each dataset, obtained using the overall most optimal combination of normalization, feature extraction and classifier, as determined by all previous analysis.

Overfitting
To evaluate the classifiers overfitting, the test accuracy of each classifier was subtracted from the train accuracy. A maximum decrease in accuracy of 0.15 from train to test was chosen as the threshold for overfitting. In the histograms of Fig. 4, the number of times (y-axis) a certain difference between train and test accuracy (x-axis) was obtained, is shown. The histograms in Fig. 4 are colored per classifier. In Fig. 4(a), the results using no normalization and no feature extraction method are shown. Here it clearly shows that the chances of overfitting were largely classifier dependent. For instance, for the NN classifier (green) most results were obtained below the overfitting threshold, the same accounts for SVM. Whereas for the decision tree (orange), all results of train minus test accuracy were above the threshold, indicating overfitting of the classifier, the same accounts for kNN and LDA. In Fig. 4(b) normalized data shown, this shows that there was no influence of the normalization on the overfitting of the classifiers, because the distributions for all classifiers were similar to Fig. 4(a). Figure 4(c) includes all feature extraction methods, but only on non-normalized data. As shown, feature extraction did show improvement in the overfitting for the kNN and LDA classifiers, as both distributions moved to the left of the graph (Fig. 4(c)) compared to Fig. 4(a). With the use of feature extraction, more train minus test accuracies stayed below the threshold of overfitting. Overall, SVM and NN classifiers showed the least overfitting on all datasets. Therefore, further analysis was focused only on SVM and NN.

Normalization
The added value of normalization on the classification outcome was examined. To this end, the increase in accuracy and increase in MCC for each normalization compared to no normalization were analyzed. This analysis was performed for the entire spectrum and only using the SVM and NN classifiers as selected in the previous section. The average increase in MCC and accuracy  Table 2. In Table 2 it shows that there was no normalization technique that improved the outcome significantly for all datasets using the SVM or NN classifier. Similar results were obtained for the kNN, decision tree and LDA classifiers, which are not included in the Table 2. Because of the large standard deviations, the increase in MCC and accuracy was evaluated per dataset as well. An increase was seen for all datasets which were obtained using the disposable fiberoptic DRS probes (datasets #6 to #9) using SNV and normalization at 800 nm technique, whereas mainly a decrease was seen for the datasets obtained with the reusable fiberoptic DRS probe (datasets #1 to #4) ( Table 3). Normalization using AUC shows a decrease in MCC and accuracy for datasets #8 and #9 as well. For the dataset of the diffuse reflection HSI (dataset #5) normalization did not seem to affect the results. Because, SNV normalization and normalization at 800 nm seemed only useful for some datasets and AUC normalization did not seem useful at all, it was decided to continue without using normalization for the comparison of different feature extraction technique in the next section.

Feature extraction
Based on the results from the normalization, the first analysis on feature extraction was performed on non-normalized data. In Table 4 the results are shown for the mean increase in MCC and accuracy for all feature extraction methods compared to no feature extraction, e.g. the entire spectrum. Here, all datasets were used, no normalization was used and only the results from the SVM and NN classifiers were taken into account. From Table 4, it can be concluded that overall using tissue optical features improved classification results compared to the use of the entire spectrum. It should be noted that tissue optical features were only available for datasets for the fiber optic DRS data with fiber distance more than 1 mm (e.g. datasets #2, #4, #7 and #9). For the shape-based features and spectral bands almost no difference compared to using the entire spectrum for classification was seen. Peak data showed a drop in both MCC and accuracy compared to using the entire spectrum. Because of the large standard deviation, the increase in MCC and accuracy, the analysis was also performed per dataset (Table 5). A large increase in MCC values was shown for the use of tissue optical features in the datasets of the disposable fiberoptic DRS probes both in vivo and ex vivo (datasets #7 and #9). If the tissue optical features were not taken into account, the shape-based features show the best overall results closely followed by the spectral bands. The use of peak data did not show improvement in any of the datasets. As shown in section 3.1, since feature extraction decreased the overfitting problem for kNN and LDA classifiers, the results on feature extraction were also calculated for all classifiers. For this analysis, slightly better results were obtained overall. The MCC and accuracy for the tissue optical features showed a large increase (0.29 and 0.09 respectively) and the shape-based features showed a smaller increase in MCC and accuracy (0.13 and 0.05 respectively). The results for the peak data and spectral bands improved with almost no decrease for the peak data (MCC = −0.06 and accuracy = −0.05) and a small increase in MCC and accuracy for the spectral bands (0.03 and 0.02 respectively). The analysis per dataset showed similar results. In short, if possible, taking into account the fiber distance and acquisition method (fiberoptic DRS or diffuse reflection HSI), tissue optical features were the best option for feature extraction. If tissue optical features could not be obtained, shape-based features were the best option.

Normalization combined with feature extraction
So far, normalization and feature extraction were only examined separately. In Table 6 the increase in MCC and accuracy values is shown for each combination of normalization and feature extraction compared to no feature extraction and no normalization. In the analysis of Table 6 all classifiers were included because using feature extraction methods decreased the risk of overfitting. From Table 6, it can be concluded that independent of the normalization technique, shape-based feature extraction showed the best results. Furthermore, SNV normalization showed the worst results for peak data and shape-based feature extraction, but the best results for the spectral bands feature extraction. In Table 6, tissue optical features are not included because no normalization was applied for this feature extraction method.
In Fig. 5, Bland-Altman plots are shown, for all datasets of which tissue optical features could be obtained and all classifiers. Separate plots were made for the combination of shape-based feature extraction with normalization at 800 nm (left), spectral band feature extraction and normalization at 800 nm(center), and the tissue optical features (right). In the Bland-Altman plot, the difference in MCC value between the combination of feature extraction and normalization, and no feature extraction and no normalization is shown. The center line indicates the mean value of all values shown in the plot and the dotted lines indicate the confidence intervals. Fig. 5 shows that the results from the combination of shape-based feature extraction and normalization at 800 actually approached the results from the tissue optical features if only the datasets from which tissue optical features could be obtained were included.

Overall results per dataset
Based on the previous analysis on overfitting, normalization and feature extraction the best combination, obtained over all datasets, consisted of tissue optical features and an SVM or NN classifier. In case of small fiber distances or the use of the hyperspectral camera, tissue optical features were not available in some datasets. Therefore, the shape-based features, with normalization at 800 nm were defined as most optimal for datasets #1, #3, #5, #6 and #8.
For each dataset separately, the best combination of the normalization, feature extraction and classifier was evaluated based on the MCC and accuracy values. For all datasets, the best combination selected based on MCC or accuracy value were the same. In Table 7 an overview is given per dataset with the maximum MCC value, maximum accuracy and the normalization, feature extraction and classifier used to obtain these values. For both datasets using disposable fiberoptic DRS needles used ex vivo (dataset #8 and #9), multiple combinations resulted in similar maximum MCC and accuracy values. For dataset #8 (short distance), two combinations with similar accuracy and MCC value were found, based on the AUC the classification based on SVM was best, with an AUC of 0.81 for SVM compared to an AUC of 0.73 for the LDA classifier. For dataset #9 (disposable fiberoptic DRS probe used ex vivo, long distance) three combinations of normalization, feature extraction and classifier provided the same maximum accuracies and MCC values, as shown in Table 7. Based on the AUC values obtained using these combinations, the combination of no normalization, tissue optical features and an LDA classifier, performed slightly better than the SNV normalized spectrum in a neural network and better than the combination of no normalization, tissue optical features in a neural network (AUC of 0.94, 0.90 and 0.84 respectively).
The results found in the previous section were mostly supported by the results from Table 7. Most classifiers found in the analysis of Table 7 were either SVM or NN classifiers. The two LDA classifiers that were selected both used some kind of feature reduction. For dataset #8 and #9, as shown in Table 7, the results from LDA classification were similar to the SVM and NN classifications, respectively. The same accounts for the results of dataset #3, for which classification using shape-based features normalized with the AUC gives an MCC value of 0.81 and an accuracy of 0.90 when the SVM classifier was used.
The normalizations found in Table 7, were also in correspondence with previous findings in which normalization did not seem to improve the outcome. As found before datasets #6 to #9, the datasets obtained using the disposable fiberoptic probes, might benefit from normalization. This can be seen in Table 7 as well, where normalization for datasets #6 to #9 was applied in all cases were no tissue optical features were used.
For datasets #6 to #9, tissue optical features were selected as feature extraction technique if possible (Table 7), just as expected from the previous analysis. In all datasets, feature selection methods provided the best MCC values and accuracies. As predicted by the analysis on feature extraction, none of the optimal combinations used peak data as a feature extraction method.
In Table 8 the results of the of the optimal solution per dataset are compared to the global optimum, using SNV normalization, spectral bands as feature extraction and an SVM classifier. For the first five datasets the results of the global optimum were only slightly worse compared to the optimum of each dataset. For the final four datasets (#6-#9) a larger difference was seen. To confirm the results obtained with the holdout test, performed in this research, the test results of the holdout test were compared to a 20-fold cross-validation. In Table 9 the results of this comparison are shown. The results are shown for the combination of SNV normalization, spectral band feature extraction and an SVM classifier. As can be seen in Table 9 only minor differences were shown between the holdout test results and the 20-fold cross-validation.

Discussion
In the current study, an optimal analysis technique was examined to distinguish healthy colorectal wall from tumor tissue. The analysis was done on nine different datasets, using five different classifiers, three normalization techniques and four feature extraction methods, compared to the use of the entire spectrum without normalization. Based on these analyses it was found that the best overall combination is to use no normalization, an SVM or NN as a classifier and as feature extraction method the tissue optical features or the shape-based features.

Overfitting
The results on overfitting show a large chance of overfitting for kNN, decision tree and LDA classifiers if the entire spectrum was used. This was expected because the number of features used for the classification is large compared to the number of samples for each dataset. As a rule of thumb the number of slightly correlated features should not exceed the square root of the number of samples [54]. For the use of the entire spectrum, this rule of thumb is not met which increases the chance of overfitting. It was shown that the decrease in the number of features used for the classification resulted in a decrease in overfitting (Fig. 4(c)). No improvement in overfitting was seen when datasets were normalized ( Fig. 4(b)). This was expected because normalization does not change the number of features, neither does it decrease the presence of noise in the data. For the SVM and NN classifiers overall overfitting was less present. Therefore, it was decided to continue with these two classifiers for further analysis.

Normalization
Overall, normalization does not seem to improve classification outcomes (Table 2). However, if datasets are evaluated separately a trend is seen when datasets obtained with the reusable fiberoptic DRS probe (#1 to #4) are compared to datasets obtained with the disposable fiberoptic DRS probe (#6 to #9). For the first four datasets, normalization does not seem to improve the results. On the contrary, for the later four normalization does improve the results. This might have to do with the use of disposable needles in the second group (datasets #6 to #9). Even though manufacturing of these probes was automated, small differences in fiber distances were present over the set of probes. Furthermore, for the white reference of the final four datasets a separate probe was used, which was not the case for the first four datasets. The use of a new probe for each patient and the use of a separate white reference probe, which required changing the fibers between calibration and measurement, might have caused intensity differences in the final four datasets which were not tissue specific. The intensity differences caused by both issues, will be eliminated by normalization of the spectra. However, normalization might also eliminate tissue specific intensity differences. This is probably the reason why normalization is not improving the results for the first four datasets in which the measurement set-up is more constant. For diffuse reflection HSI, SNV normalization is often used because it removes intensity differences due to glare [47]. However, as shown in Table 3, SNV normalization does not improve the results of the diffuse reflection HSI data, neither did the other two normalization techniques. Apparently, glare does not affect the features vital to the classification. Even though in general SNV does not seem to improve the analysis of diffuse reflection HSI significantly, the combination resulting in the best accuracy and MCC for the diffuse reflection HSI dataset did use SNV normalization (Table 7).

Feature extraction
Because of the chances of overfitting when using the entire spectra for classification different feature extraction methods were examined. The use of information from the seven most prominent peaks of the spectra does not give any useful information. As shown in Table 4 and Table 5 the MCC and accuracy values decrease when using the peak data compared to the use of the entire spectrum. An advantage of using peak data compared to the entire spectrum is the fact that less features are used in the classification, 21 features for the peak data compared to 1151 features for the entire spectrum. Even though the use of peak data is not the most optimal feature extraction method, the decrease in the number of features used, compared to the use of the entire spectrum, does show an advantage.
The best feature extraction method is the use of tissue optical features. However, the use of tissue optical features has some restrictions compared to the other feature extraction methods. First of all, the outcome of the fit algorithm depends largely on the tissue constituents that are given as input to the fit algorithm. This means, that if not all tissue constituents are known, the results will be unreliable. Furthermore, several assumptions are made for the fit model. The model which has been used in this paper is designed only for fiber optics with fiber distances larger than 1 mm. This algorithm does not work properly on hyperspectral data as well as on the spectra from fibers with less than 2 mm distance. To best of our knowledge there is no generalized fit algorithm that could work properly for all datasets. The use of shape-based features seems the second best option, which can be used for all datasets. The results of the shape-based feature could be improved by adding more features to better describe the shape of the spectra and more importantly describe the difference between the two tissue types.

General optimum
As stated before, there is not one best model that works for every problem. However, in Table 8 it is shown that if a global optimum, the SNV normalization, spectral bands and an SVM classifier, was compared to the optimum of each dataset, only small differences were seen for the first five datasets. Datasets #6-#9 did show large differences between the optimum of each dataset and the global optimum. This can be explained by the use of the disposable needles, which has less influence on the extraction of the tissue optical features compared to the other feature extraction methods. Furthermore, when the conclusions of the separate analyses are compared to the optimal solution for each dataset (Table 7), there is a large resemblance. For instance, in the final four datasets, normalization is used for all datasets in which no tissue optical features was used as feature extraction method, whereas for the other fiberoptic DRS dataset no normalization was used except for dataset #3 (Table 7). Furthermore, for all datasets feature extraction was used. Overall, in almost all datasets the optimal classifier is either a SVM or a NN. So even though there is no one best model, some rules could be set to optimize the classification of DRS and HSI data in colorectal cancer. However, generalization of these rules to other tissue types or optical technologies cannot be done without examination first.

Future work
The current research did only cover 6 basic classifiers. Future research should also include more state of the art classifiers like a random forest classifier and a convolutional NN. Furthermore, the option of boosting should be investigated to further optimize the classification. For the hyperspectral images, classification could be further optimized by taking into account the spatial information. However, with this also the dependency of neighboring pixels should be considered. In the current hyperspectral dataset the dependency of the neighboring pixels was not taken into account. Furthermore, to overcome the possibility of the influence of multiple tissue types on one pixel, pixels at the border between two tissue types were removed from the dataset.

Conclusion
From the present study it can be concluded that normalization does not seem to have added value, except to remove differences originating from the use of different DRS probes within one dataset. Furthermore, in a clinical setting where the number of samples is often limited, the use of feature extraction methods is needed to prevent overfitting in high dimensional data classification. On the other hand, the results of feature extraction could be helpful to design simpler devices focused on measuring only the features we are interested in rather than the full spectrum, like multispectral cameras. Finally, SVM and NN are the best performing classifiers with the least chance of overfitting on these datasets.