FundusPosNet: A deep learning driven heatmap regression model for the joint localization of optic disc and fovea centers in color fundus images

The localization of the optic disc and fovea is crucial in the automated diagnosis of various retinal diseases. We propose a novel deep learning driven heatmap regression model based on the encoder-decoder architecture for the joint detection of optic disc and fovea centers in color fundus images. To train the regression model, we transform the ground-truth center coordinates of optic disc and fovea of the IDRiD dataset to heatmaps using a 2D-Gaussian equation. The model is capable of pinpointing any single pixel in a vast 2D image space. The model is tested on IDRiD test dataset, Messidor, and G1020 datasets. The model outperforms the state-of-the-art methods on these datasets. The model is very robust and generic, which can be trained and used for the simultaneous localization of multiple landmarks in different medical image datasets. The full implementation code and the trained model with weights (based on Keras) are available for reuse at https://github.com/bhargav-jb/FundusPosNet.


I. INTRODUCTION
Automated diagnosis of retinal diseases saves a considerable amount of human labor and other resources involved in contrast to manual diagnosis. It also increases the accuracy and efficiency of the screening process. The automated diagnosis of most of the diseases in fundus images requires the precise localization of optic disc (OD) and fovea. These two landmarks define the reference points for detecting other crucial anatomical and pathological structures in the retina [1]. The important anatomical structures and their relative distances in a color fundus image is shown in Fig. 1.
The precise detection of OD and fovea center coordinates is challenging because of the vast fundus image space. The problem of finding the centers of OD and fovea can be related to human pose estimation tasks where a trained convolutional neural network (CNN) generates N heatmaps corresponding to different key-joints in the human body [2] [3] [4]. Further,these key-joints are used to predict the actions performed by human beings. The heatmap regression technique is applied to locate the specific coordinates in Xray images [5], which can also be extended to determine the landmarks' coordinates in fundus images. The initial models designed for the landmark localization using the regression technique were directly regressing the coordinates using fully-connected layers on top of a CNN feature extractor [6]. This method primarily suffered from localization error as the task of precisely mapping the coordinates in a very large image space was highly non-linear in the context. The recent advancements in human pose estimations have shown remarkable accuracy in landmark's detection using a heatmap based coordinate regression technique. The heatmap based regression technique achieves a very low localization error as it also considers spatial context during regression.
Image augmentation techniques are commonly used to artificially increase the size of the training and validation datasets in deep learning [7]. We propose a novel deep learning regression model for jointly regressing the fovea and optic disc centers' coordinates using encoder-decoder architecture. We name our model, the FundusPosNet. A unique regression technique is proposed to automatically generate the heatmap labels for the OD and fovea regions in fundus images. The proposed network is trained on these labels to generate heatmaps for OD and fovea landmarks. The proposed model achieves state-of-art results on IDRiD [8], Messidor [9], and G1020 [10] datasets.

II. STATE-OF-THE-ART
Fundus image analysis is an active research area, and recently there have been many significant works related to the joint detection of the optic disc and fovea region centers.In [11] the authors presented a milestone in the joint detection and segmentation of crucial retinal structures. Their method jointly detected the three major anatomical structures, the macula, the OD, and the vascular arch, in the retinal images. The algorithm outputs 16 distinct points in the retinal image representing the three major anatomical structures by fitting a single point-distribution-model. The method uses a cost function to find the correct positions of anatomical structures based on a combination of local and global cues fetched from the reference images. This method has a limitation that both macula and the optic disc regions should be at the center of the fundus images. In their next work, the same authors proposed a regression method to localize the fovea and optic disc using a kNN regressor [12]. The proposed technique uses two templates to extract the features to localize the optic disc and fovea. It requires the availability of vessel extraction as a prerequisite for the input images. The method used training images manually marked for optic disc and fovea center to train the regressor. The regressor first selects the point with the lowest predicted distance to the optic disc as the optic disc center and based on this, the search area for the fovea is defined. The regressor selects the location with the least predicted distance to the fovea as the fovea center location from this search area.
A faster method for the optic disc and fovea localization was proposed using template matching and directional matched filters in CIElab color space [13]. The method also uses the vessel characteristics in the optic disc to avoid false positives. A search area for locating the fovea is defined based on the optic disc location and its diameter. The point of lowest matched filter response within the search area is selected as the fovea center.
The usage of a saliency region detection algorithm to detect the optic disc in CIElab color space was proposed in [14]. A saliency map may identify multiple image regions as the possible optic disc due to pathological symptoms. For validating the optic disc detection, an unsupervised, probabilistic latent semantic analysis classification algorithm was used, which uses the specific structure of vasculature in the detected region. After detecting the optic disc, the estimation of possible fovea region was done using the prior knowledge about the distance of the fovea center from the optic disc center along the axis of symmetry of a parabola whose vertex is at the center of the optic disc. The proposed method fails if the OD region is damaged or lacks saliency concerning vessel structure, color, and illuminance in the image.
The semi-elliptical convex shapes like the OD in the fundus image can be detected using super-elliptical filters [SEF] [15]. The authors have also proposed a setup for the simultaneous OD and fovea detection using two individual SEF filters located at a fixed distance from each other according to the vertical and horizontal distances between the OD and fovea mentioned in [16].
A unified approach for detecting OD and fovea based on normalized cross-correlation [NCC] technique was presented in [17]. The method performs NCC on fundus images using the OD and fovea templates obtained from cropping the specific regions in the sample fundus images by the experts. To optimize the traditional NCC technique, the authors have replaced the conventional mean and variance operations with vector inner products and norms. To further increase the detection speed, they have performed NCC on downsampled templates and down-sampled fundus images, and finally mapping the region-of-interest (ROI) obtained as a result back to the original fundus image.
The signal and intensity domain information from the fundus images are used to detect OD and fovea locations. The method proposed in [18] uses 1-D projections of the image feature set in which 19 scanned lines were used to identify the landmarks' precise locations. For the detection of OD, the method uses the intensity variation information of the central optic nerve and retinal vessels emerging from OD. This variation in the intensity is very different from any other variations in the intensities resulting from other image pathologies. The peak-valley analysis is performed on the scanned intensity lines to select OD's center coordinates, followed by choosing the reduced search space to detect the fovea. The signal-valley analysis is then performed on the reduced search space to precisely detect the fovea center.
Convolutional neural networks (CNN) are recently gaining much popularity in medical image analysis. A 7-layer CNN to jointly segment OD, fovea, retinal vasculature, and background regions was proposed by [19]. In this method, the three channels of the input are passed to the CNN for every pixel location (x, y) in the ROI for the classification. The first channel of input is a 7 × 7 neighborhood of the pixel (x, y) scaled to a size of 33 × 33. The second channel of input is the 33 × 33 neighborhood of the pixels. The third channel of input is a 165 × 165 neighborhood with the pixel (x, y) as its center but scaled down to a size of 33 × 33. The CNN used for classification has five hidden layers and an output layer with four neurons for 4-class classification.
A multi-stage faster-RCNN network for the OD and fovea detection is given in [20]. In the first stage, OD detection is performed using the traditional faster-RCNN, followed by OD segmentation using SVM. The second stage uses an RPIbased faster-RCNN to segment the fovea, followed by its center regression.
In most fundus images, the relative spatial positions and optic disc and fovea size are constant. The authors in [21] exploited this constant relative geometry to jointly detect the centers of optic disc and fovea in their two-stage proposed method. In the first stage, a relation network draws bounding boxes (ROI) around the OD and fovea. The relation network uses a Faster-RCNN [22] with Resnet-101 [23]. In the second stage, a simple regressor implemented as a two-layer CNN was used to jointly regress the center of OD and fovea inside the bounding boxes.
The first regression model to perform a pixel-wise regression task to jointly detect OD and fovea was proposed by [9]. The method employs a fully convolutional deep neural network for jointly regressing the centers of OD and fovea. The network learns on the entire image to assess the global features for predicting two minimal distances as OD and fovea centers instead of learning on specific cropped ROI representing them.
A deep multi-scale sequential CNN was used to regress OD and fovea centers in [1]. The proposed method is fast and robust, which does not depend on the relative geometry information between the landmarks in the fundus image. The network has two stages of CNNs. In the first stage, the ROIs of both OD and fovea are extracted from the input image. The second stage takes these ROIs as the input and performs regression to detect both fovea and OD centers.

III. PROPOSED MODEL
In this section, we emphasize the details of the FundusPos-Net design and its implementation. We explain the network architecture, dataset preparation, setup used for training the network, and the details of the new method used for extracting the landmark's center from the predicted heatmaps.

A. NETWORK ARCHITECTURE
FundusPosNet uses encoder-decoder architecture as its backbone inspired by the revolutionary U-Net model [24] designed for biomedical image segmentation. The network takes 128 × 128 × 3 fundus image as its input and outputs two 128 × 128 heatmaps having pixel values in the range between 0-1. These two heatmaps represent the two crucial landmarks OD and fovea in the fundus image.
During network training, the encoder path learns to map the input fundus image to a vector in the latent space, and the decoder path learns to map this latent space vector to heatmaps representing the OD and fovea regions. Table 1 lists the detailed layer-wise information of FundusPosNet architecture. The following subsections explain the design details of each type of layer used in FundusPosNet.

1) Conv-BN-LeakyReLU
In this layer, we first perform convolution using a K × K kernel with a stride of (1, 1) and a dilation rate of D. The convolution is followed by the Batch Normalization (BN) operation applied to the feature maps along the channel-axis. Finally, the Leaky ReLU activation function is applied to the output after the BN.

2) Conv-BN-Sigmoid
In this layer, we first perform convolution using a K × K kernel with a stride of (1, 1) and a dilation rate of D. The convolution is followed by the BN operation applied to the feature maps along the channel-axis. Finally, the Sigmoid activation function is applied to the output after the BN.

3) TransposeConv
This is a 2 × 2 transpose convolution layer where we perform the up-convolutions on the input vector with a stride of 2 to up-sample its height and width by a factor of 2.

4) MaxPool
This is a Ps × Ps pooling layer with a stride S of 2 × 2 used to down-sample height and width of input by a factor of 2.

5) Concatenate
This layer performs a channel-wise concatenation of encoder and decoder output. Skip connections are used in the encoder-decoder networks to avoid the problem of vanishing gradients. Skip connections concatenate the up-sampled vector in the decoder path with the symmetrically opposite output vector in the encoder path along the channel-axis.
The output of the decoder path is subjected to a 1 × 1 convolution operation, followed by batch normalization. Finally, a sigmoid activation function is applied to this normalized output of the decoder to predict the two heatmaps.
In some fundus images, the OD and fovea centers are tough to locate due to the lack of adequate image quality or disease pathology in the images. Retinal diseases often produce dark and bright patches in the retinal layer, which may be mistaken for the macula and OD regions, as shown in Fig. 2. To avoid interference from the pathological anomalies during OD and fovea centers' regression, we consider the unique geometrical distance relationship between OD and VOLUME 00, 2021  fovea centers. The center of the fovea is located in the darkest region of the fundus image, approximately 2.5 times the OD diameter from the OD region [1]. The usage of dilated convolutions [25] in the deeper layers of the architecture and the application of different kernel sizes increase the receptive field. This larger receptive field further aids the network to learn more parameters based on the OD and fovea center distance relation.

B. PREPARING DATASET FOR NETWORK TRAINING
We have used the IDRiD grand challenge fundus dataset is used for network training. This dataset has 413 images for training and 103 images for testing. All the images in the image is varied by randomly selecting a factor from -0.5 to +0.5 range.

C. THE HEATMAP LABEL GENERATION
Although the pixel coordinates regression is similar to image segmentation, we do not want the output of the network to be a binary mask; instead, we need the output to be a heatmap with continuous pixel values in the range between 0 -1. If we use ground truth as binary mask with only a single bright pixel to label target location without any spatial context, in that case, the task of predicting such heatmaps becomes extremely difficult as there can be multiple locations in fundus image which have pixel values similar to that of fovea or OD pixel values. A 2D Gaussian kernel has the ability to focus on target location as well as provide spatial context. The mean of 2D Gaussian kernel is centered around the ground truth coordinate, and the remaining part of the kernel provides the spatial context. The amount of spatial context can be controlled by varying the σ value. The spatial context will allow the model to learn the important features more effectively. We generate the heatmap using the Gaussian function given by Eq. (1) in the paper [26].
Where, (α, β) is the actual (annotated) center of the landmark, (x, y) is a coordinate in 128×128 image vector representing the heatmap label, and σ is a constant used to control the size of the Gaussian blob in the generated heatmap label. The following steps are performed to generate the heatmap labels for each landmark (OD and fovea) in the fundus image.
Step 1: Initialize a 128×128 image vector with all pixel values set to zero. This image vector will be the heatmap label produced for the input fundus image. The pair (x, y) in (1)  In the generated heatmap labels, the pixel value around the center of the landmark will be high, and it smoothly decreases as we go away from the center. The constant σ in (1) controls the size of the Gaussian blobs in heatmap labels. The higher the value of σ, the wider will be the Gaussian blob and viceversa. The value of σ chosen for the localization of OD and fovea is different as the size of the OD region is smaller compared to the macular region. It is essential to ensure that the Gaussian blob covers the entire landmark and its related surrounding region to prevent false localization. Suppose we choose a very small value for σ. In that case, the size of the Gaussian blob will be relatively small, and there is a minimal global and spatial context present in it, making it vulnerable to localization errors. After thorough experimentation with the network's different parameter values, we have selected the σ value of 2.0 and 2.5 for OD and fovea landmarks, respectively. Fig. 3 shows the samples of generated heatmap labels for OD and fovea landmarks.

D. NETWORK TRAINING
The network is trained using 413 fundus images from the IDRiD training set and the augmentation and respective heatmap labels. The network learns to regress every pixel in the heatmap for the given input fundus image. Since the network generates heatmaps for the two landmarks and all the pixels in these heatmaps are always between the ranges 0 -1, we use the binary cross-entropy loss function given in Eq. (2) to train the network towards an optimum state. The binary cross-entropy loss function is also best suited for the logistic regression that we use. Since we are trying to reduce the error in heatmap generation and not the Euclidean distance, binary cross-entropy is best suited for our purpose. Glorot uniform initializer is used for the weight initialization for all the kernels, and all the bias variables are initially set to zero.
Where,ŷ is the ground-truth heatmap and y is the predicted pixel value in the generated heatmap. For optimization, Adam optimizer is used with an initial learning rate of 0.001. A batch size of 12 is selected, and the network is trained for 800 epochs. We save the weights having the least validation loss on the IDRiD validation set. Design of architecture and the training is carried out using Keras framework on Tesla V80 GPU provided by the Google colaboraroty platform.

E. EXTRACTING THE LANDMARKS' CENTER COORDINATES FROM THE GENERATED HEATMAPS
Ideally, the center pixel of the landmark should be the brightest pixel in the heatmap generated by the network. But sometimes, because of severe pathologies, there can be more than one brightest pixel detected in the generated heatmaps. If we choose the criterion for selecting the brightest pixel in the heatmap as the center of the landmark, it can lead to localization errors. We introduce a method to accurately approximate the center coordinates of generated heatmaps. First, we determine the contour in the generated heatmap using the method given in [27]. The center of this contour determines the center of the landmark. The center (x,ȳ) of the contour is found by first computing the moment of the contour using Eq. (3) followed by the detection of the centroid of this moment using Eq. (4). Finally, the coordinates of the landmark's detected center are mapped back from 128 × 128 image space to the original dimension of the input image using Eq. (5).
Where, I (x, y) represents pixel intensity, W and H represent the width and height of the image respectively. The complete process of OD and fovea center detection using the proposed heatmap regression technique is shown in Fig. 4.

IV. RESULTS AND COMPARISON
To perform a fair comparison with state-of-the-art methods on the Messidor dataset, we use the R-criterion evaluation metric given in [28] to determine the accuracy of the proposed model. Accordingly, we use the value of R=68, R=103, and R=109 for the Messidor images of resolution 1440 × 960, 2240 × 1488, and 2304 × 1536, respectively. R is the radius of the optic disc, and we compare the results in terms of (1/8)R, (1/4)R, (1/2)R, and 1R Euclidean distances between the predicted centers and the actual centers (annotated) of the OD and fovea landmarks. Following [9], we also compute the Mean Euclidean Distance (D R ) between the predicted and the actual centers normalized by the OD radius as given in Eq. (6).D R = (D(P p , P r )/R.100) Where D is the Euclidean distance and P p and P r are the predicted and actual pixels. We have also considered the Euclidean Distance (ED) between the predicted center and the actual center as a metric for assessing our proposed model's performance with other state-of-the-art methods on Messidor and IDRiD test dataset. The Euclidean Distance is computed using Eq. (7) is also known as the L2 norm.
Where, (X g , Y g ) is the ground-truth center and (X p , Y p ) is the predicted center, respectively. Table 2 shows the accuracy of FundusPosNet compared to other state-of-the-art methods on the Messidor dataset [9]. Table 3 and Table 4 show the accuracy of FundusPosNet compared to other models on IDRiD grand challenge test dataset [8] for the OD and fovea center detection, respectively.
Furthermore, we test FundusPosNet on G1020 dataset [10] for the optic disc localization. There are 1024 images in this dataset and each image is of 2423 × 3003 resolution. Fun-dusPosNet detects OD centers for 979 images with a mean Euclidean distance of 54.59. The model failed to predict OD heatmap for remaining 41 images due to the poor visuality of OD in these images.

V. DISCUSSION AND CONCLUSION
In this paper, we show that our proposed deep learning regression model performs exceptionally well on IDRiD and Messidor datasets for the OD and fovea landmarks localization and their center detection. The model is very generic, and with ground-truth data, it can be efficiently trained to localize multiple landmarks in different medical image datasets simultaneously. Every layer in a robust deep learning model should contribute effectively while generating the desired output. Table VOLUME 00, 2021 5 shows the encoder-decoder blocks' attention in their last two channels during OD and fovea landmarks prediction. It is evident from the output that the regression model is optimally regressing the pixels of landmark regions with proper attention given in each encoder and decoder blocks. We skip the demonstration of channel outputs for the last two encoder blocks and the first decoder block as they represent the deepest layers in the encoder-decoder path of the network, and their output is hard to interpret as well as significantly less intuitive.
In very minimal cases, we have observed that if the fundus images have severe anatomical pathologies, FundusPosNet fails to detect the OD and fovea centers precisely. It is because the network is mistakenly identifying the pathological deformations as the region of interest. Table 6 shows the result of FundusPosNet for different visual quality fundus images like good quality fundus image, inadequate quality fundus image, fundus image with not clearly visible OD and macular regions, fundus images having pathology regions brighter than the OD region, and fundus images having pathologies with darker patches than the macular region.
Although the proposed model is inspired by the revolutionary U-Net model, there are significant architectural differences between them. Table 7 lists the fundamental differences between FundusPosNet and U-Net in terms of design and implementation. He has a Diploma degree, Bachelor's degree, a Master's degree in Computer Science and Engineering, and a PhD degree in Computer Science discipline. His research area is Computer Vision, and he has a specific interest in Computer-aided diagnosis and network/Data security. He is a machine learning enthusiast and has developed efficient deep learning models for the automated detection and staging of retinal diseases. He has several publications in SCOPUS and Web of Science indexed reviewed journals. He started his career as a software developer, and later, he opted for the teaching profession. He has more than 13 years of teaching experience for UG/PG students at the university level.   Unlike U-Net, which does not use BN, FundusPos-Net uses BN after the convolutional layer but before applying the activation function. Activation function In FundusPosNet, every convolutional layer is followed by a BN layer, which is followed by Leaky ReLU activation except for the last convolutional layer where the sigmoid activation function is used after the BN layer. In U-Net, ReLU is used as a standard activation function after every convolutional layer. Number of parameters FundusPosNet has 12,865,562 parameters, whereas U-Net has about 7,759,521 [29].

Copy and Crop
In U-Net, while concatenating vectors in the decoding path, the encoding path's output is cropped and copied. In FundusPosNet, we don't crop the vector because the network architecture is symmetrical, i.e., the shape of vectors along the encoding and decoding path is symmetrical. Kernel size U-Net uses 3 x 3 kernels for all encoder and decoder blocks, except for the last decoder block, which uses 1 x 1 convolutions. In FundusPosNet, we use different kernel sizes at different levels of encoding and decoding, as shown in Table 1.

Dilation rate
In U-Net, dilated convolution is not used, whereas in FundusPosNet, we use dilated convolutions in the last two encoder block and first decoder block, with dilation rate = 2. Number of filters In FundusPosNet, within each block, the number of filters follows a bottleneck scheme shown in Table 1. U-Net doesn't follow this bottleneck scheme within each block.

Loss function
The U-Net uses the soft-max loss function, whereas FundusPosNet uses a binary cross-entropy loss function. Optimization method U-Net is trained using SGD optimizer, whereas Fun-dusPosNet is trained using Adam optimizer with learning rate = 0.001. VOLUME 00, 2021