Quadratic divergence regularized SVM for optic disc segmentation

: Machine learning has been used in many retinal image processing applications such as optic disc segmentation. It assumes that the training and testing data sets have the same feature distribution. However, retinal images are often collected under di ﬀ erent conditions and may have di ﬀ erent feature distributions. Therefore, the models trained from one data set may not work well for another data set. However, it is often too expensive and time consuming to label the needed training data and rebuild the models for all di ﬀ erent data sets. In this paper, we propose a novel quadratic divergence regularized support vector machine (QDSVM) to transfer the knowledge from domains with su ﬃ cient training data to domains with limited or even no training data. The proposed method simultaneously minimizes the distribution di ﬀ erence between the source domain and target domain while training the classiﬁer. Experimental results show that the proposed transfer learning based method reduces the classiﬁcation error in superpixel level from 14.2% without transfer learning to 2.4% with transfer learning. The proposed method is e ﬀ ective to transfer the label knowledge from source to target domain, which enables it to be used for optic disc segmentation in data sets with di ﬀ erent feature distributions.


Introduction
The optic disc (OD) is the location where ganglion cell axons exit the eye to form the optic nerve. The segmentation of OD has been identified as an important step in the computer aided diagnosis/detection/screening of diseases such as glaucoma, diabetic retinopathy, age-related macular degeneration [1], etc . Figure 1 shows the boundary of OD in a retinal color fundus image. In recent years, OD segmentation has received much attention and many algorithms have been proposed. Existing methods for OD segmentation are mainly based on template matching [2-7], deformable models [8][9][10][11][12][13] or learning [14][15][16][17].
Template based approaches often model the disc as a circle [3] or an ellipse [4] as the discs are approximately a circle or an ellipse. In these approaches, the disc is often detected by Hough transform from the edges in the image, while different authors propose different methods to detect the edges from the discs. Some authors eliminate edges from vessels by first segmenting the vessels [5]. Deformable model based approaches usually start with an initial disc contour and deform toward the disc edges based on various energy terms such as boundary energy, shape energy, region energy [13]. These energy terms are often defined using image intensity, image gradient, and/or boundary smoothness, etc. Learning approaches often learn/train classifiers from existing labelled data to classify pixels as disc or non-disc pixels. Abràmoff et al. [14] propose the pixel classification approach and explore various features including intensity, gradient, Gaussian blurred feature, prior probability, for disc and cup segmentation. The limitation of the pixel classification is that the high number of pixels makes the optimization difficult.  Therefore, superpixel classification [16] is proposed. By first over-segmenting the images into superpixels, the approach significantly reduce the amount of data in both training and testing. Recently, ensemble sampling based convolutional neural network is also used to learn the model for disc segmentation [17], where a entropy based sampling technique is proposed to identify most informative samples for training.
This paper focuses on learning based methods. The learning based methods generally work under a common assumption that the training and testing data are drawn from the same feature space with same distributions. When the feature distribution changes, the models need to be rebuilt from scratch using newly collected training data. Retinal color fundus images are often collected under different conditions, e.g., camera models, image resolutions, races, dilated or un-dilated eyes, etc. Therefore, the feature distributions from different data sets are often different and the models trained from one data set is not optimized for another data set. However, it is often too expensive and time consuming to re-collect or label the needed training data and rebuild the models for all different data sets. Meanwhile, classifiers trained from only a few labelled data are usually not robust. It would be nice to reduce the need and effort to re-collect the training data. In such cases, knowledge transfer or transfer learning would be desirable. Transfer learning is a popular learning scheme which transfers the knowledge from domains with sufficient training data to domains with few or no training data. It learns more robust classifiers by leveraging a relatively large amount of labelled training data from other domains, referred to as source domains. Some efforts have been made in transfer learning for many applications including text classification, video concept detection, and face recognition. Pan et al. [18] propose transfer learning via dimensionality reduction to minimize maximum mean discrepancy(MMD). Duan et al. use a similar concept in domain transfer support vector machine(SVM) [19]. Si et al. propose a framework for transfer subspace learning via dimensionality reduction [20]. In [21], Cheng et al. use domain transfer SVM in MCI conversion prediction.
In this paper, we propose a new transfer learning method, referred to as quadratic divergence regularized SVM (QDSVM), for transfer learning in OD segmentation. As the objective of transfer learning is for classification using SVM, the discriminative information should be maximally kept and transferred when train the SVM classifier. The proposed QDSVM integrates the objective function of minimizing the distribution difference between source and target domains with the objective function of training the SVM classifier. The main contribution of this paper includes: it proposes QDSVM which minimizes the distribution difference measured by quadratic divergence. The proposed QDSVM integrates the objective function of the SVM with the objective of minimizing the distribution difference while learning a mapping from high dimensional data to low dimensional data for OD segmentation. The QDSVM is applied to OD segmentation and improves the segmentation results. To the best of our knowledge, the proposed method is the first to apply transfer learning for OD segmentation in the testing data which has a different feature distribution.
The rest of paper is organized as follows. Section 2 reviews the existing learning based disc segmentation algorithm. Section 3 introduces the proposed QDSVM method and its application for OD segmentation. Section 4 shows the experimental results followed by the discussion and conclusions in the last section.

Learning based disc segmentation
In this section, we first give a short review of learning based disc segmentation methods. Machine learning has been used for optic disc segmentation. In [14], pixel classification is first proposed. It is used to assign a class to the pixels in an image, in this case disc or background. Pixel classification uses multiple pixel features, which are numeric properties of a pixel and its surroundings. One advantage of the pixel classification method is that it avoids the potential bias of the algorithm developer, which is common in unsupervised approaches, such as in an edge-based, deformable model or in template-based algorithms [14]. Previously, we have also proposed a learning based method to segment the disc [16]. Different from [14], our approach utilizes the learning at superpixel level instead of pixel level. Fig. 2 shows the flow chart of the method. In the method, the input image is first over-segmented into superpixels using the simple linear iterative clustering (SLIC) [22] method. Then, various features such as histograms and center surround statistics [16] are computed from each superpixel. A SVM classifer is then trained to obtain an initial estimation of the disc boundary. In our implementation, the output value from the SVM decision function for each superpixel is used as the decision values for all pixels in the superpixel. Then, a smoothing filter is applied to achieve smoothed decision values, which are then compared with a threshold to obtain the binary decisions for all pixels. A morphological operation is applied to obtain the largest connected component whose boundary is used as the raw estimation of the disc boundary. Finally, active shape model (ASM) is applied to obtain the final disc boundary. Here we give a brief introduction of the ASM algorithm while more details can be found in [12].
In ASM, we represent a 2D disc shape by a vector p = (m 1 , n 1 , · · · , m Z , n Z ) with Z points, where (m i , n i ) denotes the coordinates of the i th point. To build the model, a data set with manual disc points is used and the disc shapes are aligned based on the centers. The mean and the covariance of the shapes are computed as: where N is the number of training shapes. By applying a principle component analysis, the first K components φ 1 , · · · , φ K are obtained. The shape is then approximated as where b = φ T (p −p) denotes the vector of K elements containing the model parameters and φ = (φ 1 | · · · |φ K ) stores the K eigenvectors of the principle component analysis. In our algorithms, we use Z = 24 landmarks evenly sampled along the disc boundary and K is set to keep 99% of variance in principle component analysis. The model is built using the normalized first derivatives of pixel profiles from points around each landmark point, in the direction perpendicular to the line that connects the neighboring landmark points. Denoting the normalized derivative profiles as g 1 , g 2 , · · ·, the mean profileḡ and the covariance matrix S g are computed for each landmark. The new landmark position is obtained by minimizing the following Mahalanobis distance: In our implementation, we first get the initial disc boundary based on the output of the SVM classification. Then, we choose the 24 landmark points around the initial disc boundary with each pair of adjacent points forming an angle of 15 degrees with the disc center. Then the landmark points are iteratively updated by minimizing the Mahalanobis distance in the direction perpendicular to the line that connects the neighboring landmark points until converged or the maximum number of iteration reached.

Quadratic divergence regularized SVM
For classification task, given n s samples X s from source domain and n t samples X t from target domain, which are all drawn from a high-dimensional space R D , a subspace learning algorithm finds a low-dimensional subspace R d , wherein samples from different classes can be well separated. Let us denote the data set of labelled and unlabelled samples from the target domain as X t l and X t u , respectively and X t = X t l ∪ X t u . Let us also represent the labelled training data set as X. We have X = X s ∪ X t l . The proposed QDSVM computes a mapping P from X and X t u in high dimensional space R D to Y and Y t u in low dimensional space space R d , where P ∈ R D × d and d < D, i.e., Y = X P and Y t u = X t u P. The mapping P is computed such that it minimizes both the SVM objective function and the distribution difference measured by quadratic divergence.

Support vector machine
In SVM, the focus is to find the maximal separating hyperplane between classes. Given a set of l training instances Y i = X i P, i = 1, · · · , l with labels z i ∈ {+1, −1}, where X i is the original feature vector, Y i the corresponding feature vector in the projected space P, the reformulated primal problem of classical SVM [23] is given by where (z i Y i , w) = max{0, 1 − z i Y i w} denotes the hinge loss and w the classification hyperplane.

Quadratic divergence-based regularization
There are many criteria to estimate the distance between different distributions. A well-known example is quadratic divergence. Denoting the probability distributions of the labelled training data and unlabelled testing data as Q l and Q u respectively, quadratic divergence is a nonsymmetric measure of the difference between probability distributions Q l and Q u . Specifically, the divergence of Q u from Q l , denoted D(Q l ||Q u ), is a measure of the information lost when Q u is used to approximate Q l .
In this paper, we present a quadratic divergence-based regularization D P (Q l ||Q u ), which measures the distribution difference of samples drawn from different domains in a projected subspace P. By integrating this regularization into (5), we obtain the objective function of QDSVM: where λ is a parameter controls the weight of the quadratic divergence regularization. The probability density for the training and testing samples in the projected subspace P is q l and q u , where y = P T x is the low-dimensional representation of the sample x. The difference is given by To estimate the distribution Q l and Q u in the projected space P, we apply the kernel density estimation (KDE) technique, which estimates the density at y as a sum of kernels between y and each of the other sample, i.e., q(y) = (1/n) n i=1 G Σ (y − y i ). Here, n is the number of samples and G Σ (y) is the d-dimensional Gaussian kernel with covariance matrix Σ. By introducing the estimated distributions based on KDE, we have Since G Σ 1 (y − y s )G Σ 2 (y − y t )dy = G Σ 1 +Σ 2 (y s − y t ), we can eliminate the integrations in (8).

Solution
As both P and w are unknown in (6), we solve them iteratively. In the solution, we solve P first and then solve w. The iteration continues until convergence. Denoting the objective function in (6) as F (P, w), we need to smooth F (P, w) as the hinge loss function (z i Y i , w) in (6) is not differentiable. It can be smoothed by subtracting a prox-function d 1 (u) = n i=1 X i ∞ u 2 i [23]. The smoothed hinge loss u is given by: where μ is the smooth parameter, u i can be obtained by setting the gradient of the objective function as zero, i.e., We have the smoothed objective function F u (P, w) for QDSVM: We apply Nesterov's method [24] to minimize the smoothed F u (P, w) iteratively. It is a gradient method with optimal convergence rate ℘(1/k 2 ). In the k th iteration, two auxiliary optimizations P k 1 and P k 2 are constructed: where L µ is the Lipschitz constant computed as n µ max i The weighted sum of P k 1 and P k 2 determines the solution of P in the next iteration Similarly, the SVM solution w can be solved by the above Nesterov's method as well.

Experimental results
We use 325 images with dimension 3072 × 2048 from Singapore Malay Eye Study (SiMES) [16] as source domain in our experiments. The disc boundaries for these images have been manually marked by trained professionals. The SiMES images were acquired using Canon CR-DGi digital retinal camera. The target domain contains 1676 images from Singapore Chinese Eye Study (SCES) with image dimension 3504 × 2336 and 3888 × 2592. The SCES images were acquired using Canon CR-1 Mark -II. As we can see, the two data sets are from different races captured by different cameras. In our experiments, the SLIC [22] algorithm is first applied to generate superpixels. Based on the manual disc boundary, the superpixels with more than 50% of pixels within the disc boundary are labelled as disc superpixels and the rest as non-disc superpixels.
To evaluate the performance of the method, the accuracy in superpixel classification is computed. In addition, the overlapping error E is also computed to examine the difference between automated and manual discs: where S and M denote the segmented and the manual disc respectively. In order to evaluate how transfer learning improve the performance, we conducted the tests in two scenarios depending on the availability of labelled data from the target domain for training.  In the first scenario, there is no labelled data from the target domain, i.e., X t l = ∅. This is a very common scenario in clinical applications as it is often difficult and troublesome to manually mark the boundaries of the optic disc for a new data set. The training data X = X s includes all 325 images in SiMES data set and X t = X t u includes all 1676 images in SCES data set. In our paper, we use the proposed QDSVM to compute a mapping P and the SVM classifier simultaneously. In the testing, the obtained P and classifier are used to determine each superpixel as disc or non-disc. To illustrate the benefit of transfer learning, we compare the results with transfer learning and the results without transfer learning. In addition, we have also applied the transferred Fisher's linear discriminant analysis (TFLDA) [20] and the domain transfer SVM (DTSVM) [19] to obtain different classifiers for comparison. These classifiers are also used to obtain disc segmentation results for comparison following the same flow in Fig. 2. Besides that, we have also include the simple ASM method [12] which applies a Hough transform followed by an ASM deformation for comparison. Table 1 shows the overlapping errors between the manual and automated discs. It shows that the proposed QDSVM reduces the error by 8.3% relatively from 12.0% to 11.0% compared with case without transfer learning. Since the transfer learning only improves in the step of classification in superpixel level, it is also important to compare the results before ASM deformation.  learning, TFLDA and DTSVM, respectively. Fig. 3 shows three sample results. In the figure, we show both the disc contours before and after applying ASM. As we can see, QDSVM achieves better initial disc contours for ASM and often a more accurate final disc boundary after ASM. It is noted that error reduction before ASM is much larger than that after ASM. This is reasonable as the ASM may deform to the same contour from two different initial contours in many cases and neutralize the improvement by QDSVM. Nevertheless, a student t-test shows p < 0.01, which indicates that the error reduction after ASM is still significant.

Some labelled data from target domain is available for training
In the second scenario, the doctors are able to label a few data, i.e., X t l ∅ and the training data X = X s ∪ X t l . In our experiments, we evaluate the performance of the proposed method when different numbers of labelled images are used. For convenience, we divide the SCES data set into two subsets A and B. The subset A consists of 1351 images randomly selected from the SCES data set. We use this fixed subset for testing. The second subset B consists of the rest of 325 SCES images. We randomly selected n images from subset B to form X t l . To evaluate how the number n affects the results, we adjust n from 25 to 175 with a step of 25 and 175 to 325 with a step of 50. We consider the following scenarios to justify the effectiveness of transfer learning: 1) we train the standard SVM using labelled data from target domain only, i.e., training data X = X t l ; 2) we train the standard SVM classifier using both labelled data from source and target domain: training data X = X t l ∪ X s ; 3) we apply QDSVM transfer learning and X = X t l ∪ X s . In addition, TFLDA and DTSVM are also tested for comparison. All tests are done on subset A. Fig. 4 shows the overlapping error as n changes. The results show that the QDSVM achieves the lowest overlapping error. The improvement is large when n is small and drops as n increases. This is reasonable as the increased n helps to build a more robust classifier in the target domain for other algorithms.

Performance on MESSIDOR
Besides the local data set, we have also conducted experiments on public data set such as MES-SIDOR [25,26]. The data set has been widely used since it was published online [27]. Since our objective is to transfer the knowledge from another domain, we assume the label information in MESSIDOR data set is unknown in the training stage. In the experiments, we use the proposed QDSVM to compute a mapping P and the SVM classifier simultaneously as that in section 4.1. We achieve mean overlapping error of 11.1%. To justify the benefit of transfer learning, we use the superpixel classification based method [16] without transfer learning. We use local SiMES data set to train the SVM classifier and test in MESSIDOR. We achieve an average overlapping error of 12.5%. In addition, we also include the results from [5, 6, 17] for comparison. Table 2 summarizes the results. Our results are better than 12.0% by Giachetti's method [5] and comparable to 11.0% by Dashtbozorg's method [6]. Zilly's method in [17] achieves the best result of 10.0%, but the labelled information in the target domain is used in the training while we do not use such information in our method. -11.0% Zilly's Method [17] MESSIDOR 10.0% Proposed SiMES 11.1%

Conclusions
In this paper, we propose QDSVM for transfer learning in OD segmentation. QDSVM simultaneously minimizes the distribution difference between data from source and target domains and learns a SVM classifier for classification. Experimental results show that QDSVM reduces error in OD segmentation significantly in both local and public data sets. The technology can be used to improve the robustness of learning based OD segmentation algorithm in data sets with different feature distributions.