Drusen diagnosis comparison between hyperspectral and color retinal images

: Age-related macular degeneration (AMD) is a degenerative aging disorder, which can lead to irreversible vision loss in older individuals. The emergence of clinical applications of retinal hyper-spectral imaging provides a unique opportunity to capture important spectral signatures, with the potential to enhance the molecular diagnosis of retinal diseases. In this study, we use a machine learning classification approach to explore whether hyper-spectral images offer an improved outcome compared to standard RGB images. Our results show that the classifier performs better on hyper-spectral images with improved accuracy and sensitivity for drusen classification compared to standard imaging. By examining the most important features in the classification task, our data suggest that drusen are highly heterogeneous. Our work provides further evidence that hyper-spectral retinal image data are uniquely suited for computer-aided diagnosis and detection techniques.

In the current study, in addition to standard RGB fundus images, we investigated another image modality, hyper-spectral retinal imaging. Using a prototype device, we acquired fundus reflectance images at 16 different wavelengths, as detailed by Li et al. [8]. The benefit of hyper-spectral images are their ability to capture, non-invasively, a large spectral data set, with the potential to identify important biomarkers for diagnosis of AMD [9]. Lee et al. analyzed the hyper-spectral signatures of drusen in hyper-spectral fundus images using nonnegative matrix factorization (NMF) [10]. Kaluzny et al. and Fawzi et al. further investigated hyper-spectral mapping of macular pigment [11,12]. Other researchers focused on detecting the characteristics of drusen and retinal pigment epithelium using hyper-spectral autofluorescence images [13][14][15]. In this study, we extracted Haralick texture features [16] for each individual wavelength of a hyper-spectral data set and adopted a classification approach to investigate different feature characteristics comparing drusen and non-drusen regions of interest.
Previous research has mostly focused on exploring different methods to detect drusen. These methods include image processing and computer vision techniques alone [17][18][19][20][21][22][23] or in combination with machine learning algorithms [24][25][26][27][28][29]. For example, Mittal and Kumari implemented a combination of gradient-based segmentation and edge linking-procedure and achieved 98.55% accuracy for detecting intermediate drusen [23]. García-Floriano et al. adopted a Support Vector Machine (SVM) algorithm to classify images with or without drusen and achieved an accuracy of 92.16% [28]. The limitations of previous studies include considering only green channel images, which might suffer from loss of important information. In addition, the validation data sets were small; for example, García-Floriano et al. only used 33 drusen images and 37 healthy tissue images [28]. In our study, we address these limitations by considering all image channels as well as generating larger validation data sets by cropping all the regions of interest.
While many low-level image features such as SIFT and SURF [30], wavelets [31], and extracted image spatial information based on histogram of orthogonal vectors or triangular regions [32,33] have been used for general image classification studies, only a few have been explored for drusen diagnosis. For example, although Haralick texture features have been widely used in the computer-aided diagnosis field, such as for lung nodules [34], liver diseases [35], and parotid-gland injury [36], their use in retinal imaging and specifically for drusen diagnosis has been limited. Prasath and Ramya used drusen texture features to segment the drusen in RGB retinal images, using only the green channel because of its higher contrast compared with the other two channels [37]. In our study, instead of setting a threshold value for drusen segmentation, we employed a classification approach to classify the drusen and non-drusen images using all 12 Haralick texture features and all 16 hyperspectral wavelength channels. To our best knowledge, this is the first study that investigates the role of texture in drusen diagnosis using machine learning techniques and hyper-spectral retinal images.
Newer machine learning approaches based on deep learning have been recently proposed to learn directly from the raw image data rather than from extracted low-level image features. Lee et al. [38] implemented the deep learning method to distinguish normal OCT images from images of patients with AMD. Burlina et al. [39] used transfer learning and universal features derived from deep convolutional neural networks (DCNN) to classify different stages of AMD images. More recently, Schmidt Erfurth et al. used a deep learning approach to predict AMD progression [40]. However, training and testing deep learning algorithms require a large number of images, which makes these algorithms not applicable to settings with limited image data sets.
In summary, we aim to study the effects of texture as a biomarker for drusen and to compare the classification performance between hyper-spectral retinal images and RGB retinal images. Since we focus on lesion classification rather than lesion detection, we manually cropped drusen and healthy retinal tissue region of interest in hyper-spectral images and RGB im quantify textu that the classi inverse differ feature in dis spectral imag offer distinct

Data and
Our methodo imaging data wavelength c AMD related cropped the r from the cro classification model, we ide   12 Haralick texture features that were averaged with respect to the angle. The texture features are angular-second moment (energy), contrast, correlation, variance, inverse difference moment, sum average, sum variance, entropy, sum entropy, difference variance, difference entropy, information measure of correlation and maximal correlation coefficient. This set of texture features is chosen to quantify second-order gray level properties such as local uniformity, variance, and homogeneity. The Appendix Table 15 summarizes the definitions of these features [43]. We also provided the link to download the data sets in this study (Dataset 1 [44]).

Drusen vs. non-drusen classification
To determine whether hyper-spectral images offer an improved outcome compared to standard RGB images based on intensity and texture features, we implemented four binary classifiers to differentiate between healthy (non-drusen) and non-healthy (drusen) tissues: decision trees, naïve Bayes, AdaBoost with stump trees, and random forests. First, we examined different split ratios between the training and testing sets (80%-20%, 70%-30%, 60%-40% and 50%-50%). For each classifier with a certain training vs. testing split ratio, we repeated the process 30 times and calculated the mean accuracy, sensitivity (drusen is the positive case) and specificity (non-drusen is the negative case) under 95% confidence interval. Second, we compared the classification results for hyper-spectral and RGB data sets by testing the mean accuracies and mean sensitivities (calculated across the 30 trials) between different combinations of training vs. testing split ratio, classifier type, and image modality using Welch's t-test [45]. Since we had more hyper-spectral ROIs, we under-sampled the hyper-spectral ROI image data set to balance it with the number of ROIs present in the RGB images. In the rest of this section we provide the mathematical details for each one of the four classifiers. Decision Tree is a greedy algorithm that constructs a classification tree in a top-down, recursive, divide-and-conquer manner [46]. A decision tree can be represented as a flowchart-like tree structure, where the root node represents all the samples S , each internal node represents a test on a feature A , the outcome of the test is represented by a branch, and the leaf nodes are target class distributions for m distinct classes ( 1, , ) . The algorithm first starts with the root node; if the samples belong to the same class, then the node becomes a leaf node and is labeled with the class. Otherwise, the algorithm uses information gain 1 2 ( , , , ) m I s s s  to select a feature A that becomes the test feature at that node and divides the samples into different groups: If the feature A has v different values, 1 2 , then the feature A can be used to partition S into v subsets, 1 2 { . The entropy is defined as where ij s represents the number of samples that have value j a for feature A and belong to class i C and s represents the number of samples at the partition node. The information gain by branching on feature A is (4) The algorithm chooses the attribute with the highest information gain to separate the samples and uses the same process recursively for the samples at each partition node. The recursive partition stops when all the samples in a node belong to the same class or there are no more features or samples to split the node.
In this study, we implemented 10-fold cross validation method [47] on a training set to find the optimal configuration of the decision tree that leads to the minimum average crossvalidation error.
Naïve Bayes is a probabilistic classifier that implements Bayes' theorem with the assumption that all the features are independent. However, Pedro et al. [48] found that even in the situation where features are dependent, Naïve Bayes can have a better classification performance. Suppose we have a new instance x with n features 1 2 ( , , , ) AdaBoost is an ensemble learning classifier that combines weak learners and assigns weights to training instances and weak learners t h , 1 t T =  , where T is the total number of learners. The algorithm assigns higher weights to most likely misclassified cases [49]. In this study, we choose stump trees as our weak learners. In the first iteration, the algorithm gives equal weight D to all the training instances 1 1 ( , ) ( , ) s s x y x y  where i y belongs to class i C : The weight for a weak classifier t h is defined as where t ε is the classification error at iteration t . The updated weight at iteration 1 t + is defined as where t z is the normalization factor. The output of the final hypothesis is We also implemented 10-fold cross validation on training set to find the optimal number of iterations that leads to the minimal average cross validation error.
Random Forest is an ensemble of classifiers that creates a group of decision trees from the original data by bootstrapping and then randomly choosing features to build the trees [50]. Because of its randomness, random forest is robust to outliers and overfitting problems. The algorithm classifies in instance by a majority vote across all the classification outputs of the individual decision trees [50,51].
To determine the optimal number of features and trees, we examined the "Out-of-Bag" (OOB) error as a measurement of classification performance. Figure 5 illustrates the calculation process. OOB error is the average error of all the instances in the data set while each instance error is the average error of all the trees that do not select the instance.

Results
We present a compare the c in Section 3.2 features in Se

Classific
For each com process of sh  We notice that all four classifiers achieved the highest mean sensitivity (Table 3) for classifying drusen regions under the split ratio 80%-20%. Random forest and AdaBoost also have the highest mean accuracy with the split ratio 80%-20% (Table 2). We further implemented Welch's t-test to statistically determine whether the difference of the mean sensitivity between 80%-20% with other split ratios is significant (Table 5). From Table 5, we can conclude that when we use random forest algorithm in hyperspectral image data, there is a significant difference of the mean sensitivity between the 80%-20% split ratio with other split ratios. Random forest algorithm can achieve the highest mean sensitivity with 80%-20% ratio. When we use decision tree algorithm, there is a different of mean sensitivity between 80%-20% and 50%-50%. However, there is no difference of mean sensitivity between different split ratios for other combinations.
We repeated the same classification process for RGB image data. Tables 6, 7 and 8 show the mean accuracy, sensitivity and specificity respectively under the 95% confidence interval when using the RGB image.  From Table 6, we can see that random forest, decision tree and AdaBoost achieved the highest mean accuracy with the split ratio 80%-20%. When considering mean sensitivity and mean specificity, random forest and AdaBoost also performed the best with the split ratio 80%-20% (Tables 7 and 8). Similarly, we implemented Welch's t-test to statistically determine whether the difference of the mean accuracy between 80%-20% with other split ratios is significant (Table 9). From Table 9, we can see that in RGB image, there is no significant difference of mean accuracy between 80%-20% and 70%-30% for both classifiers under the 95% confidence interval. However, for random forest and AdaBoost classifiers, the difference of mean accuracy between the split ratio 80%-20% and 60%-40% and the difference between 80%-20% and 50%-50% are significant. This result indicates that we can choose either 80%-20% or 70%-30% as the spit ratio for RGB image data.

Classification results using different classifiers
Based on the split ratio results, we analyzed the classification performance across the four classifiers using the 80%-20% as the split ratio. Table 10 summarizes the results across classifiers using a 80%-20% ratio and shows that the random forest classifier achieved the highest accuracy, sensitivity and specificity for the hyper-spectral image data set based on the Welch's t-test at significance level of 0.05.

Low-level image feature importance analysis
We further investigated the random forest classification performance by the type of features. We focus the analysis of the results on the hyper-spectral image data given the higher performance for drusen classification. Table 13 compares the classification results when using intensity-based features, texture-based features, and a combination of intensity and texture features. The results show that the highest performance is obtained using a combination of texture and intensity features, followed in performance by the texture features. The mean differences for all accuracy, sensitivity, and specificity values are all significant based on the Welch's t-test (Tables 13 and 14). To understand the relevance of the individual low-level image features that distinguished drusen ROI from non-drusen ROI, we used the Gini index criterion (Eq. (11) to rank the feature importance when building the random forest on all features (both texture and intensity). Figure 7 shows the most important low-level image features with the 'inverse difference moment", a feature describing the local homogeneity in a region, being the most important (it has the largest value for the mean decrease in the Gini index).
As a result, we analyzed the differences in the inverse difference moment features for the drusen versus non-drusen ROIs. Based on the definition of the feature, a low inverse difference moment value indicates the image is heterogeneous while a higher value indicates the region is more homogeneous. Figure 8 shows that, on average, drusen images are more heterogeneous confirmed by the statistically significant Welch t-test at a p-value smaller than 2.2e-16.

Discussio
Our results de in the diagno classifiers and performance performance t the RGB dat properties tha 7 we determined mathematically that the texture heterogeneity is an important local image characteristic that has higher values for the drusen images. Furthermore, when comparing the random forest classifiers for the hyper-spectral images and the RGB images (Fig. 6), we found that the classification model for the RGB data needed a higher number of features (15) per split and more trees (427) to achieve the optimal combination of parameters than hyper-spectral images that required only 7 features and 212 trees. These findings indicate that classification models for the hyper-spectral data are not only superior in performance but also have a lower complexity with only few image characteristics needed to distinguish between drusen and non-drusen.
In the context of previous studies, our work validates and extends the work by Prasath and Ramya [37] that showed that thresholding certain texture features can help segment drusen regions in RGB images. By using a robust Haralick set of features (averaged across different displacements and angles) and a machine learning algorithm, we determined the most important texture features and their combinations with intensity features for drusen diagnosis. Finally, instead of using only the green channel as in [37] and [53] where local binary patterns (LBP) features computed in green channel where reported to be the most important features in distinguishing drusen from non-drusen images, we showed that hyper-spectral imaging has the potential to provide the optimal combination of texture and intensity features for drusen ROIs characterization.

Conclusions
Using hyper-spectral retinal images containing 16 different wavelength channels generated by a compact, snapshot hyper-spectral fundus camera [8], we showed the potential advantages of hyper-spectral imaging for retinal disease diagnosis. We discovered that drusen ROIs are more heterogeneous than the surrounding retinal tissue, a property that can be quantified mathematically through one of the Haralick texture feature, the inverse difference moment.
As future work, we plan to investigate the effect of the location and size of drusen on the classification. For example, can we answer questions like 'Is there any difference between drusen centrally located and those near the arcades using texture descriptors'? Augmenting the approach presented in this paper with a patch-based segmentation approach as proposed in [54] will allow the extension of this work to automatic segmentation of ROIs and eliminate the need for manual cropping. This would then allow us to perform automatic drusen classification as well as detection.
Furthermore, newer techniques such as deep learning have been recently explored in retinal imaging [38,39] and resulted in promising results. Since these deep learning approaches require large image training sets, we plan to acquire a larger hyper-spectral image data set and compare the performance of the feature-based random forest classifier with the deep learning classification approaches.