Transfer learning based classification of optical coherence tomography images with diabetic macular edema and dry age-related macular degeneration

: We present an algorithm for identifying retinal pathologies given retinal optical coherence tomography (OCT) images. Our approach fine-tunes a pre-trained convolutional neural network (CNN), GoogLeNet, to improve its prediction capability (compared to random initialization training) and identifies salient responses during prediction to understand learned filter characteristics. We considered a data set containing subjects with diabetic macular edema, or dry age-related macular degeneration, or no pathology. The fine-tuned CNN could effectively identify pathologies in comparison to classical learning. Our algorithm aims to demonstrate that models trained on non-medical images can be fine-tuned for classifying OCT images with limited training data.


Introduction
Spectral domain optical coherence tomography (SD-OCT) has been an integral imaging instrument in ophthalmology for the early diagnosis of pathologies through textural and morphological variations [1,2]. The current application is aimed at automated classification (with limited training data) of retinal pathologies diabetic macular edema (DME) and agerelated macular degeneration (AMD) as they account for the majority of irreversible vision loss subjects in developed and developing countries [3][4][5]. In the case of DME fluid is accumulated [6] and in the case of dry AMD drusen is deposited resulting in geographic atrophy, thereby incurring the deformation of retinal layers [7,8]. From the onset of the disease as it progresses, retinal layers deform and detach from subsequent layers resulting in vision loss. Early diagnosis through inspection for layer deformations, the presence of fluids, geographic atrophy, and drusen, is crucial for subduing vision loss. Retinal images of subjects with dry AMD in Fig. 1(b) and DME in Fig. 1(c) with geographic atrophy and fluid, respectively, are shown in Fig. 1 in comparison to an image of the retina of a normal subject in Fig. 1(a). Such visual inspection process has been fragmented into solvable image analysis blocks to reduce the workload of the ophthalmologist and for remote clinical applications. These blocks are aimed at noise removal, layer segmentation, feature quantification (heuristically and intuitive) and classification. The noise removal block commonly employs techniques such as wavelets [9] and BM3D [10]. Retinal segmentation algorithms vary from traditional image processing to machinelearning [11][12][13][14][15][16][17][18] of which graph-based approaches implemented over machine-learning predictions have been successful [12,15,18]. Information about the segmented layer is utilized to quantify layer deformations to identify the pathology [2,7,19,20]. Another stream of algorithms includes the quantification of textural and morphological features to train a classifier to identify the pathology [21][22][23][24][25][26]. The evolution of salient feature quantification has reduced the dependency of image classification approaches on retinal segmentation; for example, in recent OCT image classification techniques [24] utilizes only segmented retinal pigment epithelium (RPE) layer information for retinal flattening and this has also been considered as a common practice. The proposed approach also utilizes only filtered and flattened images for labeling an unknown retinal OCT image as AMD or DME or normal.
In machine learning the introduction of an additional module of feature extraction [27] between feature quantification and the classifier transforms quantified features to improve the separation of the feature space. Principal Component Analysis (PCA) and neural networks (auto-encoders) are frequently used as feature extractors for enriching the quantified features. With the impact brought by such modules, more effort has been devoted to learning (identifying) data-driven feature quantifiers (from raw data) [28] rather than feature quantifiers that are limited to heuristics (clinical knowledge) and intuition. When multiple of these modules i.e., learnable feature quantifiers (filters) are stacked between raw data (OCT image) and class data (normal, AMD, or DME) it leads to deep learning [29,30]. Deep learning has had a phenomenal impact on artificial intelligence and machine-learning communities particularly for image processing, language processing, and voice processing applications [30]. Deep learning has emerged as deep neural networks (DNNs), deep belief networks (DBNs), and convolutional neural networks (CNNs), out of which CNNs are widely employed in image analysis applications [31]. In recent years, variants of CNN architectures namely LeNet [32], AlexNet [33], and GoogLeNet [34] have been developed. We selected GoogLeNet for our current application as it established state of the art performance.
A common assumption in machine learning and deep learning is that a large amount of data is required to increase the generalizability of the learnt model. The construction of a model with data-driven feature quantifiers (filters), i.e., a higher order model, on an inadequate amount of data leads to overfitting, which has a negative impact on performance during testing. As this is also the case with classical machine learning, transfer learning [35] has been introduced in which the model is trained on a relative task for which ample data is available and the model parameters are fine-tuned for the required target task [36][37][38]. Finetuning a model trained on general images for retinal optical coherence tomography image classification has not been explored. The proposed approach involves fine-tuning GoogLeNet, which is pre-trained on an ImageNet database, to identify OCT images with pathology. It also includes a preprocessing step for filtering and flattening OCT images, which is not general practice in the deep learning community. The proposed approach has been compared with GoogLeNet trained with random initialization and classical machine learning approaches. Recent training approaches induce randomness or noise to avoid overfitting and improve generalization. Upon introduction of noise the model performance changes from experiment to experiment. Therefore, a proper understanding of the model performance requires repeated experimentation to illustrate the bounds of performance of the trained model.
Collectively proposed approach illustrates the adaptation of GoogLeNet (trained on object classification) for OCT image classification, the employment of less data for training, and finally, automated identification of the potential filter response at each layer.

Preamble to GoogLeNet
Classical machine-learning algorithms construct a function (h), which transforms given quantified features (X) to a target space (y) where features representing different classes (DME, AMD, and normal) are easily distinguishable and a rule or boundary is identified to distinguish the classes in the target space. An alternative community has been working on automated feature learning where the feature quantification and transformation function is more data-driven instead of using heuristics (feature engineering) to accommodate more complex functions to minimize intra-class variance and maximize inter-class variance. Rather than learning a complex function in one transformation (layer), deep learning has taken the approach of concatenating multiple transformations in a sequential process such that more abstract features (edges in the first layer, corners and curves in the second layer and so on) are learned as information is processed (learning progresses) from layer to layer (learning progresses). In other words, instead of one complex transformation t(D) (where D represents raw data), deep learning adopts multiple transformations in sequence (t(D) = g(f(k(D)))) and decision boundaries (a single boundary in the case of a binary class problem) as the final layer. Many network architectures have been proposed, e.g., DNN, DBN, and CNN, of which CNN is widely applied in computer vision and image processing applications where the filters at each layer are responsible for transformation at each layer.
Convolutional neural networks (CNN) are more suitable for image applications due to their receptive field property (transformation at any particular pixel is based on information from neighboring pixels). A typical CNN has two phases: forward pass and backward pass. Forward pass involves predicting the output and computing loss (error between the predicted output and actual output). Backward pass involves correcting the learnable filters and weights based on the computed error through a back-propagation algorithm. Backward pass is only employed during the training phase.
CNN architectures for image classification applications involve combinations of the convolution block with learnable filters [39], activation block [33], crossmap local response normalization (LRN) block [33], pooling block [32], fully connected (traditional neural network) block [32] with learnable weights, dropout [40], and loss block [32,33]. Given an image or the responses, the convolution block convolves with a set of filters (learnable), the activation block alters each element in the input based on a user-defined function, the local response normalization acts as regularizer for unbounded activation functions, the pooling block replaces each element based on a statistical operation (maximum or mean) of neighboring elements (image dilation or mean filtering), the fully connected layer is a traditional artificial neural network (ANN), and the loss block computes the error between the predicted class and actual class. The error computed in the loss layer is utilized to correct the filter weights of each convolution block with gradients computed through error backpropagation. Traditional CNN architectures comprise a repetition of convolution blocks (Conv1, Conv2_1, etc.), activation blocks, and the maxpooling blocks (Maxpool1, Maxpool2, etc.), which are appended with multilayer perceptron architecture (ANN) at the end. The salient property of GoogLeNet is an inception layer (Appendix) that introduces sparsity and multiscale (convolving with different filter sizes) information in one block. Functionally it is equivalent to a small network inside a large network. The filters and weights of GoogLeNet are identified iteratively through error back-propagation. The error back-propagation approach converts a loss into gradients to correct the last layer and computes "correcting gradients" of any layer based on (multiplication) 'correcting gradients' of subsequent layers. In the case of deeper architectures, the filters of the initial layers (near the input layer) are not corrected as effectively as the last layer (near the loss layer) and this is termed as vanishing gradient. One additional novelty of GoogLeNet is pacifying the vanishing gradient through multi-headed loss, in contrast to the single loss block in traditional CNN image classification architectures. Thus, auxiliary branches also contribute to the correction of filters in the initial layers. For the current architecture of GoogLeNet, the input image should be of size 224 × 224 with three input channels, as shown in Fig. 2, and on forward pass results in an array of 1000 (1000 classes) after each softmax block. Here it should be noted that all convolution blocks are appended with a Rectified Linear Unit (ReLU) activation block and the dropout layer is appended after full connected, full connection a1 , and full connection b1 , but this is not represented in Fig. 2.

Formal definition to the solution
based on OCT images from training data and subtracted from the individual image (for data normalization). The GoogLeNet architecture is modified such that the final fully connected layer of both the main and auxiliary branches has three outputs. Before training there are various approaches for initialization of the filter weights: Random initialization, selective random initialization [41], auto-encoder, and transfer learning (initialization based on the weights of a CNN trained on another data set). The first approaches need a large amount of supervised training data (>10,000 images). The second approach needs a large amount of image data (unsupervised) but not labels. The third approach, which needs GoogLeNet trained on an image data set, is best suited to small training data sets (the current case), faster convergence, and high accuracy. To develop an understanding of information processing at each layer, a potential response is identified. This visualization assists a non-expert to understand information processing at each layer and an expert to identify layers that need to be modified in the case of critical errors.

Preprocessing
Initially, the saturated pixels (i.e., those with an intensity value of 255) in Fig. 3(a) are replaced with an intensity of 10 as shown in Fig. 3(b). The RPE is estimated as proposed in [42], the estimated RPE is smoothened, retinal flattening is performed, and the RPE lower contour is shifted to a fixed position (70% of the height) as shown in Fig. 3(c). The image is resized (bicubic kernel) to 224 × 224 as shown in Fig. 3(d) and filtered through BM3D filter as shown in Fig. 3(e). Figure 3 shows the intermediate results obtained for the preprocessing step with the BM3D result as the output. The BM3D result is replicated three times and each result is treated as channel information.

Modifications to GoogLeNet trained on ImageNet
• Fully connected layer, fully connected layer a2 , fully connected layer b2 have three outputs • Weights of all fully connected layers and convolutional layers in auxiliary branches are randomly initialized to learn the feature space shifts between ImageNet data and OCT data • All the last layers have been changed from soft-max to logsoft-max as the soft-max probabilities are low and avoid underflow of gradients.
• The modified GoogLeNet is trained with the adagrad [43] optimizer rather than adam or stochastic gradient descent (SGD), as adam results in impulsive step lengths and SGD needs expertly introduced learning rates at different epochs.

Identification of potential response
The capability of the fine-tuned CNN is evaluated by considering a test set and computing the prediction error. In a few cases, an expert interprets the filters and responses to identify repetitive filters, missing patterns, etc. This response visualization also helps a non-expert to understand information processing at each layer. Conventional practice involves supervised identification of the potential responses but each layer of a fine-tuned CNN has numerous filters resulting in a large number of responses. The current approach is automated and draws more inspiration from the basic definition of pathology, i.e., deviation from normality. Thus, given a test image and fine-tuned GoogLeNet on prediction of the test image being abnormal (AMD or DME) • The Pseudo-ground truth for test image is considered normal.
• Error back-propagation computes gradients for correcting each filter in every convolution block.
• In a convolution block, the magnitude of the 'correcting gradients' is proportional to the filter contribution to the prediction.
• In every convolution block all filters are ranked according to the magnitude, and the response of the top filter is treated as the potential response Based on the capability of human experts, the top-k potential responses can be identified. For simplicity only one response per convolutional layer is identified.

Experimentation and results
The Duke OCT data set [24] for classification of ophthalmic images consists of data from 45 subjects (15 AMD, 15 DME, and 15 normal) and multiple images exist for each subject. MATLAB 2015 b is employed to code the benchmark [24] and Torch (http://torch.ch/) is used for training and testing the CNN architectures. Training was conducted on a workstation with 48 GB RAM and a Nvidia tesla k40 GPU. The impact of transfer learning is illustrated by establishing a baseline where weights are randomly initialized. The code for the benchmark, proposed method, and base lines was released (https://github.com/ultrai/Chap_3) for reproducibility.
The training phase involves the following: • Select 24 subjects (first eight subjects from each class i.e., AMD, DME, and normal) and corresponding images are treated as training images.
• Each training image is preprocessed, self-replicated three times, and concatenated to generate the image to be 3 × 224 × 224.
• Assign images from the same class (AMD, DME, or normal) a single unique label.
• Compute the mean image (3 × 224 × 224) of the training set and subtract the mean image from each image in the training set.
• Divide the training set into batches where each batch contains ≤ 65 images.
• For each batch on a forward pass, CNN predicts 65 labels (one per image) and the negative log-likelihood computes the loss • On a backward pass, loss is translated to gradients at each layer through a chain rule in error back-propagation to correct the filters and weights.
• The amount of correction required is determined by the AdaGrad optimizer.
• A single iterative cycle involves selecting a subsequent batch, forward pass, backward pass, and correction of filters and weights. The current setup is trained by dividing the training set into 28 batches, and a collection of 28 iterative cycles is treated as an epoch and training is performed for 50 epochs, i.e., the training data proceeds though the CNN 50 times.
The testing set involves: • Select the remaining 21 subjects (seven subjects from each class i.e., AMD, DME, and normal), followed by preprocessing each image • Each image is self-replicated three times and concatenated, thus the resulting test image becomes 3 × 224 × 224.
• Subtract the mean image (3 × 224 × 224) of the training set from each test image. Fine-tuning GoogLeNet on a defined training set for 50 epochs and computing prediction accuracy on a defined test set (training set and testing sets have no overlap) at every epoch is treated as an experiment. Prediction (test) accuracy computation involves GoogLeNet predicting labels of all test set images and ratio between number of correctly predicted labels and total number of test images. For an experiment, CNN convergence and predictions are subjective to initialization. As fully connected layers are randomly initialized the performance changes from experiment to experiment. Similarly, the dropout block also randomly nullifies (setting to zero) the percentage of responses. This also results in the training model performance changing from experiment to experiment. The randomness that is introduced during model fine-tuning produces variable results for each experimentation; thus, in practice, multiple (10) experiments are repeated and the best model (96%) is stored. The test accuracy of each experiment for every epoch has been illustrated in Fig. 4. Keeping the repeatability in mind, the model with 94% accuracy is considered for evaluations (Table 1). Fig. 4. Repeatability of experimentation in terms of test accuracy also illustrates test accuracy divergence.
The impact of the transfer learning is illustrated by comparison with GoogLeNet trained with random initialization (rather than selective initialization) at all filters. The plots in Fig. 5 present the prediction accuracy on the complete test set (remaining seven subjects) images and illustrates faster convergence with improved accuracy for transfer-learning approach. The benchmark method [24] involves: image resizing, BM3D filtering, RPE estimation, retinal flattening, cropping to remove the background and choroid, and image pyramid construction. The training feature vector is constructed by concatenating the histogram of oriented gradients on each plane of the pyramid and three support vector machines (binary classifier) are trained in a one vs. one approach to accommodate multiclass classification through a binary classifier (SVM). However, to maintain an equal ground for comparison, the resizing, filtering, and RPE flattening are swapped according to the proposed preprocessing protocol.
Given a test set, images of each subject are separated and fine-tuned GoogLeNet predicts the label of each image. Majority voting of all predicted labels for each subject is treated as decision for the corresponding subject [24]. If the predicted decision class and actual patient class are the same then the decision pooling for the subject is '1'. Decision pooling per subject for the proposed method and benchmark is tabulated in Table 1. As the initial eight subjects are considered for training, the test set contains subjects 9 to 15; hence, the corresponding decision poolings are tabulated. It can be observed from the decision pooling that subject decisions are correct but the model is not 100% accurate. The benchmark employs an SVM, which has improved generalization capability with a smaller number of training samples. This has resulted in moderate performance. The fine-tuned model performs more accurately, particularly in comparison with the decisions relating to subjects 9 and 10. "0" and "1" imply incorrect and correct decisions, respectively, during testing.
Cross validation involves division of a data set into 15 folds, with each fold containing three subjects (one from each class). Each validation involves eight folds for training and seven folds for testing. Folds are sequentially chosen instead of randomly. The first validation includes folds 1-8 as training set and folds 9-15 as the testing test, the second validation includes fold 2-9 as the training set and folds 10-15 along with fold 1 as the test set and so on. This results in 15 validations. The mean of decision pooling across all validations for normal, AMD, and DME are 0.99, 0.89, and 0.86, respectively, for the proposed method.

Discussion
Deep learning has been successful in automatically crafting the feature quantifiers subjective to an image analysis application. As such, algorithms require an ample amount of data and higher number of epochs for convergence. Barriers such as these are avoided by fine-tuning models trained on similar data sets for various applications and this approach has proven to be successful. Transfer-learning-based image classification or segmentation for medical images mainly involves microscopic and CT images of which the underlying imaging distributions are related to natural images. A current application thereof involves fine-tuning models for OCT images of which the imaging physics incorporates speckles that are not similar to natural images.
In terms of space complexity there is no difference between fine-tuning an architecture and training an architecture from scratch (random initialization). However, the constraint for fine-tuning an architecture is on changing the filter size or number of filters of a layer in which case subsequent layers need to be trained from scratch. In pre-trained models, the n th layer filters might be responsible for a transformation (h) and subsequent layers are intended for transforming the information transformed by h. Changing the layer configuration changes the transformation (i.e., k instead of h), which enforces subsequent filters to learn a new transformation that can utilize k transformed information. In terms of decision pooling, finetuned architectures have outperformed their corresponding architectures trained from scratch. A common rule of thumb exists in deep learning that the deeper the architecture the more accurate the abstractions it can learn; however, no rule has been established regarding the depth of a CNN. Thus, the choice of depth is always user defined and is subjective to accuracy. The depth not only negatively impacts upon space complexity but also on time complexity. Our evaluation clearly illustrated that transfer learning has a definite advantage over learning from scratch. Because CNN is subjective to initialization, the performance varies with different protocols of selective initialization.
Future scope includes incubating the decision pooling inside the architecture; thus, given a stack of images from a subject rather than predicting each image class, the algorithm is expected to predict the subject class. This approach is commonly termed multiple instance learning.
It is common practice to identify a potential filter and corresponding response to visually interpret information processing at each layer. Even though an individual filter cannot quantify the pathology completely, it provides an approximation of a highly contributing filter. This facilitates identification of the areas of congestion of information processing in CNN. Given a 3 × 224 × 224 OCT image with AMD, the first convolution layer alone produces 64 responses; therefore, the potential response at each layer is identified autonomously and are shown in Fig. 6. The Conv1 layer (potential filter) identifies the filter that quantizes curves, which incidentally happens to identify the deformations in the RPE layer along with other subtle information. The ReLU only retains positive responses. Maxpooling induces spatial invariance (local) by allotting the maximum response to all neighboring pixels and stride results to reduce the image size. The LRN1 layer is renormalized to boost the responses. The potential responses of the last layers are illustrated as they are crucial for classification. The potential responses of inceptions 7, 8, and 9 include abstract patterns, which are difficult to interpret visually but distinct from the DME case. Given a 3 × 224 × 224 OCT image with DME, the autonomously identified potential response at each layer is shown in Fig. 7. In contrast to those of the AMD subjects, the conv1 potential response cannot illustrate salient markers to identify fluid deposits in the retina. However, as it moves on to the final layers, the potential responses of inception 7, 8, and 9 encode the responses of the previous layers to an abstract pattern that is distinguishable from AMD. One reason for identifying prominent retinal layer deformation instead of fluid deposits by CNN could be that "Layer deformation (Geographical Atrophy)" is a common factor in both pathologies. In the case of AMD, the drusen deposits beneath the RPE, which is hyperreflective, and Conv1 is responsive to RPE, but not at drusen. In the case of DME, the contrast between the intraretinal layers is low; hence, the top retinal layer is more responsive due to its sharper contrast. Inception7 of DME illustrates the prominence as salient feature, which could be fluid deposition but needs further validation by experts. All the potential responses have different dynamic ranges and are depicted in the corresponding sub-figures. All the responses are not transformed into common dynamic responses as few of the responses have a low dynamic range and transformation leads to the loss of visualization contrast. Future work involves the combination of responses into one response for visualization rather than identifying one single representative response.

Conclusion
An OCT-based retinal pathological image identification model has been learned for prescreening and remote clinical applications. The conventional approach requires heuristic feature quantifiers and large amounts of data, but the proposed approach learns application subjective feature quantifiers from data. Additionally, another advantage of the proposed approach is the employment of pre-trained models for faster convergence with less data. It has been illustrated that models trained on general images can be fine-tuned to classify OCT images. The method also has the ability to identify the potential response at each layer to understand information processing at each layer.

Appendix building blocks of GoogLeNet
A convolutional block consists of a set of filters and corresponding biases (scalar values) whose values are learned during training. Each filter ( ) is an unbounded activation function, which means LRN acts as regularizer. On a 2D response plane LRN aims at identifying high responses in comparison to neighboring responses and making it more sensitive. It also pacifies uniform or multiple large responses within a neighborhood. The neighborhood need not be limited to 2D and can be extended to depth (across response planes). A pooling block replaces each element based on a statistical operation (maximum or mean) within a neighborhood for each 2D response plane (image dilation or mean filtering). A fully connected layer is a traditional artificial neural network (ANN) where weights and bias are learned during training. Dropout is an approach where a percentage of responses are set to zero to improve the generalization of the model. The loss block computes the error between the predicted class and actual class. The error computed in the loss layer is utilized to correct each of the convolutional block filter values and fully connected layer weights with gradients computed through error back-propagation.
Stride or shift(s) is a user-defined factor in the convolution and pooling blocks, i.e., after computing convolution or pooling at the (x,y) element, the next convolution or pooling computation is at (x+s,y) or (x,y+s). Thus, given an image or response of size m,n a stride s operation results in approximately m/s,n/s. The inception layer processes the given information in four independent streams of information processing and concatenates along the depth at the output. The first stream involves compression through a filter with size 1×1 and where the number of filters are less than (data compression) or equivalent to the depth of input. This situation results in a reduced or equal depth of output compared to the input depth but no change along the height and width of the response. The second stream involves two convolution blocks of filter height × width, the size being 1×1 and 3×3, respectively. The initial block performs data compression and the subsequent block is responsible for quantifying the features within a neighborhood with a depth of 3×3. The third stream involves two convolution blocks of filter height × width, the size being 1×1 and 5×5, respectively. As appending convolution blocks of the second and third streams have different neighborhoods they act in a way similar to multi-scale signal processing. The fourth stream involves a maxpooling block and convolution block of filter height × width, the size being 3×3 and 1×1, respectively. A high response at a location on a 2D plane at the convolution block output implies the presence of an important feature at the same location in the input within a neighborhood of convolutional block filters with height and width (other than data compressing convolutional blocks). This implies that the high response is contributed by neighboring elements of the location corresponding to the input. To incorporate this intuition, maxpooling is performed in the fourth stream and this is followed by the convolution block for data compression.

Funding
This study did not receive any funding.