Machine learning based detection of age-related macular degeneration (AMD) and diabetic macular edema (DME) from optical coherence tomography (OCT) images

: Non-lethal macular diseases greatly impact patients’ life quality, and will cause vision loss at the late stages. Visual inspection of the optical coherence tomography (OCT) images by the experienced clinicians is the main diagnosis technique. We proposed a computer-aided diagnosis (CAD) model to discriminate age-related macular degeneration (AMD), diabetic macular edema (DME) and healthy macula. The linear configuration pattern (LCP) based features of the OCT images were screened by the Correlation-based Feature Subset (CFS) selection algorithm. And the best model based on the sequential minimal optimization (SMO) algorithm achieved 99.3% in the overall accuracy for the three classes of samples. sensitivity and specificity were averaged over the three class labels. The sensitivity of a class label is also the prediction accuracy. The overall accuracy is defined as the ratio (number of correctly predicted samples)/(total number of samples).


Introduction
The established three-dimensional imaging technology, i.e. Optical coherence tomography (OCT), has been extensively used to capture the subtle changes in the retina, and serves as one of the standard procedures in the clinical ophthalmology examination [1,2]. The retinal structures may be clearly captured by the OCT's micrometer-level resolution. Some ophthalmological diseases like age-related macular degeneration (AMD) and diabetic macular edema (DME) are diagnosed based on the OCT images [3,4]. The existing guideline diagnoses these retinal diseases based on the visual inspection of the OCT images by the welltrained ophthalmologists. The recent development of the OCT imaging technology has enabled its much wider clinical deployments and the OCT image data has accumulated rapidly. Considering that an ophthalmologist needs to handle dozens or even more patients in Asia, this also introduced a major challenge for the clinical diagnosis, with a heavier visual inspection workload for smaller lesions and larger data volume. Effective computer-aided diagnosis models provide quantitative and objective measurements to facilitate the clinical decisions. A sensitive machine learning model may facilitate the initial screening of the OCT images, and clinicians may pay special attentions to the automatically detected suspect lesions that were missed by human screening.
A number of computer algorithms were proposed to facilitate the diagnosis process of retinal disorders. Lee et al. extracted 15 imaging features to discriminate the retinal surface segments from rim, optic disc cup or background, which provide important information to evaluate the development stages of glaucoma [5]. Retina consists of multiple layers, of which each may respond differently by retinal disorders. The automatic segmentation of the OCTbased retina image into these intra-retinal layers has been solved by the Minimum-cost closed set algorithm [6] and Dynamic Programming algorithm [7] and Directional Graph Search algorithm [8]. A few studies tried to investigate the binary classification of age-related macular degeneration (AMD) [9] or drusen [10] based on the OCT images.
Human eye has a pigmented area in the retinal center, the macula, of which lesion is one of the major causes of vision weakening, eye shadow and blindness [11]. Two major macular lesions receive intensive research interests, i.e. age-related macular degeneration (AMD) and diabetic macular edema (DME). AMD is an irreversible macular lesion, with the symptoms of blurred or completely no vision in the visual field center [12]. This degeneration gets worsen with age, and has two sub-types, i.e. dry and wet AMDs [13]. Dry type AMD accounts for up to 90% of the AMD patients in the US, but a significantly different ratio was observed for the Japanese population [13]. No cure is known for the lost sight, but proper supplements and living habits may slow the degeneration process if being diagnosed in the early stages [14].
Diabetic macular edema (DME) is another common macular lesion, and it is one of the leading causes of complete blindness [10,15]. Most of the diabetes patients develop DME after 20 years of diabetes [16] but the symptoms may be slowed down with proper monitoring and treatments [17]. Although the early-stage symptoms of DME are difficult to be noticed by the patient, experienced clinicians may see the narrowed or completely blocked retinal blood vessels using the fundus photography. The disordered blood vessels will start to burst, bleed and blur the vision. Eventually the sight may be completely gone. These macular diseases may develop at all ages, and their early diagnosis will greatly improve the treatment effect and life quality.
Computer aided diagnosis technologies based on image texture and spatial features have been attempted for the early detection of the macular diseases. Farsiu et al. assessed the thickness of human retinal layers to discriminate AMD patients from the normal controls [9]. The distributions of grayscale features in the digital fundus images may also be integrated with the entropies and higher order spectra to detect the AMD samples [18]. By using five features analyzed from thickness profiles and cyst fluids, Hassan et al. built a model to classify macular edema and central serous retinopathy (CSR). The Support Vector Machine (SVM) classify trained on 90 images and realized a 97.77% accuracy [19]. T-test was utilized to rank all the features and the top 54 ranked features were chosen to train a Support Vector Machine (SVM) model. The model achieved a recognition rate of 95.07%. Albarrak et al. demonstrated that 3D-OCT based images may be used to discriminate AMD from the normal controls using the local description factors, e.g. local binary pattern from three orthogonal panels (LBP-TOP) and histograms of gradient features [20]. The Bayesian network trained with these features achieved an accuracy of 91.4%. Liu et al. employed the multi-scale spatial pyramid features and reduced the dimensions of the local binary patterns (LBPs). The optimized model used multiple binary SVM classifiers, and achieved at least 93% in the area under the receiver operator characteristic curve (AUC) for the four classes of samples, i.e. normal, macular edema, macular hole and age-related macular degeneration [21].
This work proposed a multi-class model of detecting age-related macular degeneration (AMD), diabetic macular edema (DME) and normal controls using the linear configuration patterns (LCPs). Only 23 were optimized from the 5,000 + LCP pyramid features, and an overall accuracy of 98.0% was achieved based on the individual OCT images. The patient level accuracy of 100% was achieved for both the DME and normal samples. The AMD samples received a slightly smaller accuracy of 93.33%. This work is organized in the following sections. Section 2 introduced the rationale of the proposed algorithm. Section 3 described the experimental results. The last section concludes the whole manuscript.

Material and methods
This work investigated the classification problem of three retinal OCT images, i.e. age-related macular degeneration (AMD), diabetic macular edema (DME) and the normal macula. The experiment procedure has four steps, i.e. OCT image preprocessing, feature extraction and selection, building the classification model, and predicting each pathology group, as shown in Fig. 1.

Data set
This study evaluated the proposed procedure using the publically available OCT data set provided by the joint efforts from Duke University, Harvard University and University of Michigan [22]. This data set is denoted as SD-OCT, and consists of over 3,000 OCT images from 45 participants. There are 15 patients with dry AMD, 15 patients with DME, and 15 healthy human controls.
The 2D horizontal cross-sectional retinal slice from the 3D macular imaging data was used in the proposed procedure, as shown in Fig. 2. The ophthalmological diagnoses of both AMD and DME are based on the patterns in such 2D horizontal cross-sectional OCT slices according to the clinical guidelines [23]. AMD retina begins with accumulating drusen between the retinal pigment epithelium (RPE) and the underlying choroid. The usually domeshaped retina atrophy and scar will be developed and cause severe damage to the sight gradually, as shown in Fig. 2(a). DME-induced micro vascular changes in retina will cause the thickening of the basement membrane. The mal-functioning and liquid accumulating vascular walls will show black blobs in the OCT images, as shown in Fig. 2(b). Normal macula shows clear boundaries between layers and regularly shaped vessels at the center, as in Fig. 2(c).

Linear configuration patterns (LCP)
This study utilized the linear configuration pattern (LCP) features to represent a retina image. The topological LCP features of image texture are widely utilized by biomedical researchers to classify disease patterns from the control tissues. Mookiah et al. utilized the linear configuration pattern method to automatically detect age-related macular degeneration in the fundus images [24]. They ranked the features individually and achieved 97.8% in accuracy for the binary classifier, which focused on disease and control samples. The LCP-based features were also utilized on image-based biomedical problems, such as mammogram-based breast cancer diagnosis [25], CT-based lung nodule segmentation and identification [26], and ultrasound-based myocardial infarction staging [27], etc. LCP was demonstrated to work well on describing both the microscopic configurations and the local structural features [28]. The local structural information is calculated by the local binary patterns (LBPs) [29]. The rotation invariant uniform patterns (riu) are extended from the original version of LBP operator.

Multi-scale feature extraction
LCP features were calculated on multiple scales of the OCT images in this study. Depending on the confusing structure appearance between shadow and macular edema, the context which shows the overall appearance of retina area are required being correctly interpreted. Figure 3 shows the shadowing effects which caused by uneven lighting or light block from the opaque media. Although the LBP(riu) features are apply to rotation invariant patterns, the extracted feature cannot present the retina characteristics properly. We demonstrate that in the experiment of Section 3.3. Besides, previous studies demonstrated that micro-and macroscale images contribute complementary discriminating characteristics for an image-based classification problem [30]. Features of different image scales may be formulated using a pyramid strategy, as shown in Fig. 4. A 3 level spatial pyramid strategy was utilized in this study [31]. For each level m, we divided the image length and width by 2 m . The overlapped blocks are also taken into consideration. The LCP features on the each block of different level were calculated and combined as a global image descriptor. The spatial pyramid calculation of linear configuration patterns was abbreviated as SP-LCP.

Feature optimization and classification
It is hypothesized that not all the features calculated from the OCT images may contribute to the classification performance of AMD, DME and healthy controls. A feature subset was chosen by the following feature selection algorithms, and evaluated for their classification performances by the 10-fold cross validation strategy. Attribute/feature evaluation algorithms are divided into two major classes, i.e. filters and wrappers. Filters evaluate the association of each feature with the class labels, with the assumption that this association is independent between features. The top-ranked features will be chosen as the feature subset for classification modeling. Wrappers screen for a subset of features with an optimal performance measurement, which is usually the classification accuracy or error rate. Wrappers are usually slower but more accurate than filters [32]. So this study chose two wrapper feature selection algorithms CFS (Correlation-based Feature Subset) and CSE (Classifier Subset Evaluator) to find the feature subset significantly associated with the phenotypes [33,34]. Previous studies focused on the discrimination power of the Linear Configuration Pattern (LCP) features and did not try to find a better and smaller feature subset using the feature selection strategies [35,36]. This study proposed a hypothesis that a subset of the LCP features may generate a better classification performance as well as a simpler classification model. CFS with the best first searching heuristic rule evaluates the worth of a subset of features by considering the individual predictive ability of each feature as well as the redundancies degree between them. Subsets of features that are highly correlated with the class while having low inter correlation are chosen which shrink the feature subset from 5000 + into 493. The experimental data supports this hypothesis by achieving an averaged 3.0% improvement in classification accuracy with only 9.4% of the total feature set. CSE with the best fist searching heuristic rule evaluates attribute subsets on training data or a separate hold out testing set. The classification algorithm SMO is embedded to estimate the 'merit' of a set of features. CSE reduced the numbers of features from 493 to only 23 features. Previous comparative studies demonstrated that the above feature selection algorithms usually performed best on selection a feature subset with good classification performances, while feature selection algorithms based on the method of filter like t-test optimize the phenotype association significances [32].
Classification algorithms are also essential to train a good model [32,37]. We tested our data set using one representative from each classification algorithm group, i.e. the quadratic programming based algorithm sequential minimal optimization (SMO) [38], the neural network algorithm multi-layer perceptron based on back propagation (BP) [39], the kernel based algorithm Support Vector Machine with polynomial kernel (SVM) [40], the linear regression based classification algorithm Logistic Regression (LR) [41], the Bayesian algorithm Naïve Bayes (NBayes) [42], the tree based algorithm J48 decision tree (J48) [43], and the ensemble forest algorithm Random Forest (RF) [44]. The experimental data in the later section suggested that classification algorithm SMO tends to achieve better performances, compared with the other algorithms.
This study employed the implementations of these feature selection and classification algorithms in the data mining program Weka version 3.7.12 [45] and libSVM version 3.21 [46].

Experimental settings
There were 15 participants for each of the three groups, AMD, DME and normal control. Each of the participants has multiple scans of OCT images, and the normal tissue slices of AMD or DME patients were excluded [22]. Due to the technical reasons, including irregular lighting and motion blurring, some images didn't provide a clear view of retina and were excluded. The final OCT image data set consists of 453 AMD images, 511 DME images and 1,403 normal images.
10 fold cross validation strategy was utilized to evaluate the classification performance of a given feature subset. Ten repeated runs with different random seeds were conducted to avoid the data set splitting bias. The accuracy, specificity, sensitivity and area under the receiver operator characteristic curve (AUC) were also employed to evaluate the parameterindependent classification performances. The mean value and standard deviation of each index are calculated. The ROC curve is an intuitive and parameter-independent way to illustrate a classification model, but it's difficult to generate the ROC plots of each algorithm due to the multiple runs of 10-fold cross validations. So the following sections do not provide the ROC plots.

Experimental results and discussion
The proposed automatic macular disease detection algorithm was evaluated from the following aspects. Different pyramid frameworks were evaluated for their classification performances in the section 3.1. Different feature extraction and selection algorithms were compared to find a best feature subset in the section 3.2. Section 3.3 compared different classification algorithms, so that a best classification model may be trained. The comparison with the existing studies was also conducted in the section 3.4.

Evaluation of pyramid settings
A number of pyramid settings were investigated for their classification performances, as similar in [21]. The image pyramid (IP) setting extracted the LCP features from the three scaling levels of the original images, as shown in Fig. 5(a). The Spatial Pyramid (SP) setting calculated the LCP features from the three levels of images scaled to the same size 80 × 248 in pixels, as in Fig. 5(b). The Multi-scale spatial pyramid setting (MSSP) carried out both the scaling and sub-image splitting before calculating the LCP features, as shown in Fig. 5(c). The overlapping pyramid setting is to calculate the LCP features from the fixed-size subimages, as illustrated as a rectangle with a dashed frame in Fig. 5.
The performance measurement sensitivity (TPR, true positive rate) is defined as (number of true positives)/(number of true positives + number of false negatives) = TP/(TP + FN). Specificity is usually defined as (number of true negatives)/(number of negatives) for a binary classification problem. In our case of three-class classification problem, specificity is defined in the same way for each class label, where the negative samples are the samples not in the current class label. So the overall sensitivity and specificity were averaged over the three class labels. The sensitivity of a class label is also the prediction accuracy. The overall accuracy is defined as the ratio (number of correctly predicted samples)/(total number of samples). The first round of feature selection was conducted to reduce the features for training a classifier, as shown in Table 1. All the pyramid settings generate at least 500 features, and there is a possibility of classifier over-fitting. So the CFS with the best first search strategy was carried out to find a feature subset with reasonable classification performances. The setting IP extracted the minimum number of features (507) from the data set, whereas SP:O extracted many more features (5,239). After the screening of CFS on the feature sets, at least 84% decreasing in the feature number was achieved for all the five pyramid settings. Only 8.9% of the features were left for the setting MSSP:O. Table 1. Feature summary of the five pyramid settings. The tag ":O" means overlapping for this pyramid setting. The row "Original" is the number of features generated by each pyramid setting, and the row "Selected" is the number of features selected by the CFS algorithm. The row "S/O" gives the ratio between "Selected" and "Original". The five pyramid settings were evaluated for their classification performances using the sequential minimal optimization (SMO) [38], which proved to be best classifier in the later section. Firstly, all the five pyramid settings demonstrated good classification performances, as shown in Table 2. The worst single-class accuracy 89.2% was observed for the DME samples based on the IP features. The IP features also achieved the worst overall accuracy 96.4%. All the other four pyramid settings achieved at least 99% in the overall accuracy, and the setting SP:O achieved 99.3% in the overall accuracy. The SP:O features performed slightly worse than the MSSP:O features, by 0.2% decreased accuracy for the sample class AMD. So the following sections utilized the pyramid setting SP:O as the default feature subset.

Evaluation of feature extraction and selection algorithms
The above section demonstrated the importance of having a good feature set for training a classifier. The clinical diagnosis of AMD and DME relies on the local macular structure abnormalities. So we compared the aforementioned features with the widely used local binary pattern (LBP) features [29] and the local configuration pattern (LCP) features [28].   Table 3 showed that local description features from different scaling levels were complementary to each other and their integrations increased the macular disease classification accuracy. The data suggested that LBP features didn't separate the AMD and DME samples well, and the three class classification problem only got an overall accuracy of 78.6%. The LCP features improved the classification performance by 17.7% in the overall accuracy. But another 3.0% improvement was achieved by the 493 features of the SP:O setting. Another round of feature selection was carried out on the 493 features using the CSE strategy with SMO as the embedded classifier. Only 23 features were selected for further analysis, and this small feature subset achieved 98.0% in the overall accuracy, a slight decrease (1.3%) compared with the feature subset SP:O in Table 3.
A smaller feature subset takes less time to build a classification model, and avoids the possibility of over-fitting. Table 4 showed that it took only 0.10 second to build a SMO classification model based on the 23 features of SP:O + CSE, while the 493 features of SP:O took 3.08 seconds.

Evaluation of classification algorithms
A comprehensive comparative study was conducted to evaluate how different classification algorithms perform on this data set, as shown in Table 5. SMO performed the best on the AMD samples with an overall accuracy 97.8%, and was ranked 2nd on the DME samples with 0.3% decreased accuracy compared with LR. LR performed the best on detecting the DME samples, with an overall accuracy 94.3%. And SMO performed the best on both Normal samples and the overall data set. In Table 6, SMO performed the best on the accuracy, specificity and sensitivity, and was ranked 2nd on the AUC performance with 1.3% decreased area under roc curve compared with BP and LR. The decision tree J48 performed the worst while RF performed much better. This suggests that the OCT LCP features harbor complicate associations and the simple decision rules trained by the J48 algorithm does not fit well with the inner patterns of this data set. So SMO was proposed as the default classification algorithm in this study.

Comparison with the existing studies
The best model achieved in this study was compared with the existing model proposed by Liu et al. [21], as shown in Table 7. The 10-fold cross validations were repeated 10 times with different random seeds. The same classification performance measurement was calculated for the comparison. The measurement Area Under the receiver operator characteristic Curve (AUC) was calculated for the model proposed in this study. AUC evaluates a classification problem independently of different cutoff parameters. The proposed model in this study outperformed the study Liu et al. in all the three sample classes, and achieved 0.064 improvement in the measure AUC. Another study evaluated the classification problem by individual subjects based on the same data set [22]. Srinivasan et al. extracted the multi-scale Histogram Of Oriented descriptors (HOG) as the feature vector of an OCT image, and the model was trained using the Support Vector Machine (SVM) algorithm [22]. 45 experiments were carried out using the leave-three-subjects-out cross validation strategy. That is to say, for an experiment, one subject was randomly selected from each of the three sample classes, and the rest 42 subjects were used as the training data set. . Srinivasan et al. achieved 100% in accuracies for the two disease classes AMD and DME, but incorrectly predicted two normal samples, as shown in Table 8. A rule of majority was employed to generate the patient level prediction. That is to say, a patient is defined to belong to a class label, which is the predicted class label of most OCT images of this patient. The best model achieved in this study accurately detected all the DME samples and Normal samples, but missed one AMD sample.

Conclusion and future scopes
This study demonstrated that a carefully selected set of features from different scaling levels were essential for the classification performances of two macular diseases compared against the normal controls using the SD-OCT images. The proposed model also outperformed the existing models on the levels of both individual OCT images and patients. Firstly, the LCP features of different pyramid settings worked very well for the investigated classification problem. The original version of non-overlapping LCP blocks may be improved by allowing additional overlapping blocks between the neighboring ones. Secondly, the multi-scaling pyramid setting introduced a major improvement for both LBP and LCP features. And two rounds of feature selection procedures significantly reduced the feature numbers, while retaining similar accuracies. Lastly, the classification algorithms SMO and LR achieved very good performances, and SMO performed the best overall performance. The best model in this study also outperformed the existing studies.
In conclusion, the proposed model may represent an automatic macular disease detection system based on the SD-OCT images, and provided useful strategies for other imaging-based clinical diagnosis modeling. Further investigations of how to optimize the scaling and aligning steps, and the parameters of the machine learning algorithms are planned. Confirmation in a larger population with more diversity and a user-friendly program interface will also facilitate the clinical application of the proposed model.

Funding
This work was supported by the Strategic Priority Research Program of the Chinese Academy of Sciences (XDB13040400) and the startup grant from the Jilin University. Thanks for the help from the major research projects of Xi'an Siyuan University (XASY-B1601).