Retinal Mosaicking with Vascular Bifurcations Detected on Vessel Mask by a Convolutional Network

Mosaicking of retinal images is potentially useful for ophthalmologists and computer-aided diagnostic schemes. Vascular bifurcations can be used as features for matching and stitching of retinal images. A fully convolutional network model is employed to segment vascular structures in retinal images to detect vascular bifurcations. Then, bifurcations are extracted as feature points on the vascular mask by a robust and efficient approach. Transformation parameters for stitching can be estimated from the correspondence of vascular bifurcations. The proposed feature detection and mosaic method is evaluated on retinal images of 14 different eyes, 62 retinal images. The proposed method achieves a considerably higher average recall rate of matching for paired images compared with speeded-up robust features and scale-invariant feature transform. The running time of our method was also lower than other methods. Results produced by the proposed method superior to that of AutoStitch, photomerge function in Photoshop cs6 and ICE, demonstrate that accurate matching of detected vascular bifurcations could lead to high-quality mosaic of retinal images.


Introduction
Retinal images are crucial for ophthalmologists to diagnose a series of diseases. Fundus lesions caused by fundus and systemic diseases, such as diabetes, hypertension, macular lesions, fundus arteriosclerosis, and retinopathy, can appear in the retina [1]. Patients can receive timely and appropriate treatment if specialists examine and diagnose these diseases early. At present, mosaic images have been extensively used to provide a comprehensive view of the retina to assist ophthalmologists during laser surgery or other procedures. However, the retinal image captured by a fundus camera or scanning laser ophthalmoscope can only cover a local area of the eye. e retinal images captured in different field areas must be stitched together to form a mosaic image that ultimately meets the needs of the analysis of the entire area of the fundus in research and clinical diagnosis.
Numerous retinal image registration and stitching methods have been proposed in the literature. Can et al. [2][3][4] introduced a layered mosaicking algorithm that uses bifurcations as features of the vascular structure. However, one disadvantage of this algorithm is the occasionally indistinguishable landmarks. Ryan et al. [5] also studied the registration of retinal images using landmark correspondence. However, this approach obtains remarkably few matching point pairs. Landmark matching formulation [1] is based on retinal image alignment by enforcing sparsity in correspondence matrix. However, such approach needs a complicated computational process and high cost.
In 2010, Kwon and Ha [6] introduced panoramic video technology based on extracting important information from the image. e scale-invariant feature transform (SIFT) is one of the most robust and widely used methods. However, identifying corresponding points becomes difficult in the case of changing illumination or two surfaces with a similar intensity, as SIFT extract features using only gray information. Brown and Lowe [7] used the invariant feature for automatic mosaicking of natural images, adopting the multiresolution image fusion. Alomran and Chai [8] presented an image stitching algorithm based on speeded-up robust features (SURF), but the algorithm is restricted to an image set without exposure differences and extremely high lens distortion. e disadvantages of methods based on SIFT or SURF include clustered detection points on the edge, uneven distribution, and insufficient effective information to describe the image. Matching of bifurcations is more appealing compared with image registration. ese previous methods are limited only to images containing clearly visible vascular structures. However, the vascular structures in retinal images are often blurred when the fundus bleeds or becomes tumorous. us, the detection of robust features in retinal images for feature matching presents a difficulty. Simultaneously, indistinguishable features may result in matching ambiguities.
is work aims to improve the robustness of feature detection for indistinct vascular structures in retinal images. Numerous publications focus on using the convolutional neural network (CNN) to segment vascular structures. Jiang and Tan [9] proposed the use of conditional deep convolutional generative adversarial networks to segment the retinal vessels , by using skip connection to connect the output of the convolutional layer with the output of the deconvolution layer to avoid low-level information sharing. However, a few number of images exist in the data set to train the network, and the accuracy of vessel segmentation and the robustness of the network cannot be verified well. Alom et al. [10] proposed a Recurrent Convolutional Neural Network (RCNN) based on U-Net as well as a Recurrent Residual Convolutional Neural Network (RRCNN) based on U-Net models. 190,000 patches are randomly selected from 20 of the images as the input of networks which was a complicated computational process. Hu et al. [11] introduced a retinal vessel segmentation method based on CNN and fully connected conditional random fields. In this paper, we first employ a CNN segmentation model [12] to segment vascular structures in retinal images. en, an effective approach is proposed to detect bifurcations as features for matching and stitching by morphology operations in the vascular structures. Figure 1 shows an example of a mosaic image produced by the proposed method, our method for feature detection and matching benefits from previous works on retinal images. e rest of the paper is organized as follows. First, Section 2 describes the framework and details of our proposed method. Section 3 provides experimental results. Section 4 discusses the results.

Overview.
is work aims to develop a practical and useful method for detecting robust and sufficient features with indistinct vascular structures and constructing an automatic mosaic of multiple retinal images. e image stitching process can be summarized by the steps shown in Figure 2. e novelty of the paper include the following: (1) e binary image of vascular structures instead of the original image was used for bifurcation detection. e correspondences between retinal images are established using second and nearest neighbor ratio matching [13] by calculating the Euclidean distance between feature descriptors. (4) Parameters of transformation are estimated to warp images into a uniform coordinate system using the homography matrix, which is calculated by the correspondences between image pairs. (5) e retinal images and masks are warped to uniform coordination, and vertex coordinates of images are calculated after a uniform coordinate transformation. (6) Multibland blending is used to seamlessly fuse retinal images to generate a mosaic image.

Segmentation of Vascular Structures.
Classic methods for addressing vascular structure segmentation involve hand-crafted filters, such as line detectors and vessel enhancement techniques [12,[14][15][16]. Recent techniques [17,18] are available for segmenting vascular structures of retinal images, but these techniques exhibit poor performance.
Deep CNN architectures are designed to solve the initial classification of natural images; they are also very effective for vascular structure segmentation according to the literature [12]. Maninis proposed a CNN architecture based on visual geometry group (VGG) network for retinal vascular segmentation.
A CNN model for vascular structure segmentation is already trained on the DRIVE [19] and STARE [20] datasets (40 and 20 images, respectively). e trained CNN model is publically available (All the resources of this paper, including code and pretrained models, are available at http://www. vision.ee.ethz.ch/∼cvlsegmentation/driu/). Our experiment is installed and configured Caffe on Ubuntu 14.04 with NVIDIA GPU (TitanX). Download the original DRIU model and run the model to segment the vascular structures. e precision-recall between the obtained mask and gold standard (0.822) is higher than that of the precision-recall between human-annotated result and gold standard (0.791) on the DRIVE data set. is model is employed to segment retinal vascular structures in retinal images in our experiments.
Given that the size of the retinal image is inadequate for the operation of the CNN architecture, the image must be sampled before retinal vessel segmentation; the size of downsampling image is 700 × 605 pixels. Figure 3(a) shows a retinal image, and Figure 3(b) is the segmented retinal vascular structures obtained using the CNN model. We will further research how to improve the model to obtain the resulting output without boundary.
Subsequently, a fast parallel algorithm [21] is applied for thinning images to extract the center lines of retinal vessels segmented by the CNN model. is algorithm extracts the center lines of an image through the removal of image contour points, except for points belonging to the skeleton. Iteration over several times will obtain the final skeletons as the vascular centerlines (e.g., Figure 3(c)).

Bifurcation Detection and Descriptors.
Bifurcations of vascular centerlines are detected as matching features. Analysis of geometrical structures of bifurcations is important to design a scheme to solve this problem. From the observations, the vascular branch and crossover are generally presented as Y-or T-form, respectively. Hence, we use morphological operations to locate bifurcation points with a set of structural elements. Multiple Y-or T-form structural elements are applied to erode the vascular centerlines mask. Journal of Healthcare Engineering e structure element of size 3 × 3 is selected according to the literature [22]. e last four Y-form structural elements shown in Figure 4 include most vascular bifurcations by comparing the various structural elements to erode the same binary image in our experiments. Finally, 14 structure elements are selected from a set of structural elements to erode vascular centerlines images to produce the highest number of correctly detected bifurcations and no falsely detected features. More criteria will be considered for selecting structural elements in future work.
One of the structural elements ( Figure 4) is used to erode the vascular centerlines image, which will result in an eroded image containing numerous discrete points.
A total of 14 images are produced during image erosion by 14 structural elements. An existing problem is the addition of the 14 images. We add up all eroded images in a certain weight. e function is defined as follows: where B i denotes the 3 × 3 structural element, B 1 , B 2 , . . ., B 14 contains all 14 structural elements. For example, C 1 represents the first image that adds the image eroded by B 1 and that by B 2 together with 1/2 weight. Similarly, C 2 represents the image that adds image C 1 and the image eroded by B 3 .  Table 1, SIFT algorithm spends the longest time, and SURF contains a number of features. Although SURF acquires the maximum number of features, this algorithm is time consuming and mainly focuses on one area, which is easily noticed in feature matching in the next section. Trading off between the running time and number of features, our feature detection algorithm is superior to other feature detection algorithms.
Numerous bifurcations will be produced after feature detection. A descriptor is then constructed to describe the distribution of intensity content within the feature neighborhood. Similar to the gradient information extracted by SIFT [23] and its variants, SURF [24] builds on the distribution of first-order Haar wavelet responses in x and y directions rather than the gradient and uses only 64 dimensions. SURF first calculated the Haar wavelet responses in x and y directions within a circular neighbourhood of radius 6s around the interest point, with s the scale at which the interest point was detected. e dominant orientation is estimated by calculating the sum of all responses within a sliding orientation window of size π/3. e horizontal and vertical responses within the window are summed. e two summed responses then yield a local orientation vector. e longest such vector over all windows defines the orientation of the interest point. A square window centered around a feature was constructed and oriented along the orientation we already got above. e size of the window is set to 20s. e region is split up regularly into small 4 × 4 square subregions.  responses d x and d y are summed up over each subregion as d x and d y , respectively. e sum of the absolute values of responses (|d x | and |d y |) is also extracted to obtain information regarding the polarity of intensity changes. Hence, each subregion has a 4D descriptor vector v for its underlying intensity structure v � ( d x , d y , Σ|d x |, Σ|d y |). Concatenation for all 4 × 4 subregions results in a descriptor vector of length 64. Once the feature descriptors have been calculated, the correspondences between all image pairs can be established by calculating the Euclidean distance of those descriptors.

Bifurcation Matching.
e matched candidate of each feature is determined by identifying its nearest neighbor in the features from the target image [23]. e nearest neighbor refers to the feature with minimum Euclidean distance of the descriptors. We obtain the descriptor of a bifurcation in image I i and calculate its Euclidean distances to the descriptors of all features in another image I j . An effective measure is obtained by comparing the distance of the closest neighbor to that of the second-closest neighbor to determine correct matching. To suppress matches that could be regarded as possibly ambiguous, matches with a distance ratio larger than 0.3 will be rejected [25]. One of the bifurcation matching parameters was customized.
is parameter is the confidence that indicates two images come from the same mosaics. e default value is 1.0. In our user interface, the confidence was set to an option to choose a number between 0.5 and 1.5 freely to generate different mosaics.
Each pair of potentially matching images includes a set of feature matches that are geometrically consistent (inliers) and another set of inconsistent features (outliers) inside the area of overlap. Random sample consensus (RANSAC) [7,26] can be used to select a set of inliers that are compatible with a homography between images. Let X be a point on the retina with projections x i and x j in two images taken from different angles. A homography matrix H describes the transformation that connects x i and x j for any point X on the retina, x j � H x i . e inliers or outliers are classified by the distance threshold t. When ‖x j − Hx i ‖ > t, the matching point is considered as outliers, t � 3 and the number of iterations is 2000. en, it can get a pure sample with probability p � 0.99. e optimal homology matrix H is obtained. Figure 6 shows the correspondence between different features of an image pair. e first, second, third, and fourth rows indicate our algorithm, ORB, SIFT, and SURF, respectively. e number of matching points of ORB and SIFT is less than that in our algorithm. Otherwise, although SURF contains more matching points, it mainly concentrates on certain areas drawn by a blue circle. ese points contain little information as they occupy a small number of retinal vessels.

Image Warping.
e next step estimates transformation through the matching of detected bifurcations for retinal image pairs. Assuming that the camera rotates around the center of the optical axis, the transformation model of the image is estimated. is section first introduces how to use the homography matrix to calculate the parameters of the transformation model and then uses these parameters to transform image coordinates to the same coordinate system and perform image stitching.
Assuming that the camera rotates around its central axis, the transformation model of the image is defined as follows: Each image has a camera intrinsic parameter matrix K and an extrinsic matrix R. K is a 3 × 3 intrinsic parameter Estimation of focal length f [27]is as follows: where h 11 , h 12 , . . . are elements of the homography matrix H. Calculate f of two images from images set and take the median of f of all image as the focal length of all images. R is a 3 × 3 extrinsic matrix. e matrix R of the reference image is the identity matrix. R can be roughly estimated according to the intrinsic parameter matrix and the homography matrix, Given the matrix R of an image, intrinsic parameter matrix K i K j of the two images, and the homography matrix H of the two images, the matrix R of another image can be obtained. All image camera parameters are revised by beam adjustment. e matching points of pairs of images are mapped into the three-dimensional space, calculating the camera parameters with the minimum sum of squares of matching errors. e function is defined as follows: where Let the matching points of pairs of images be denoted as P and Q. An initial correspondence set (p i , q i ) ∈ C 0 containing all matching points is established. Levenberg-Marquardt [28] algorithm is used to solve the function, and the optimal parameters R and K are obtained. (1) Suppose the point (x, y, z � 1) of the initial image T is projected to the same coordinate (x′, y′, z′). According to K, R, there is a relation as follows: (2) Mapping the coordinate in (1) to the spherical coordinate system (u, w, v). e relationship is as (3) e coordinate is backprojected onto the warped image by means of backprojection. First, the coordinates (x′, y′, z′) are obtained by the inverse transform for spherical coordinates in (2): en the coordinates of the warped image are obtained by backprojection: Finally, the coordinates of the initial image T are transformed to the same coordinate system.

Image
Blending. When all images are warped into the same coordinate system, the images are blended to construct a mosaic image with invisible image boundaries [29]. Each sample (pixel) at the same location should exhibit the same intensity in the images for mosaics. However, no such case is observed in actual situations. After gain compensation, several image edges remain visible due to several causes, such as vignetting (decreasing intensity toward the edge of the image), misregistration of the mosaicking procedure, and radial distortion [7]. In 2D, the work of Burt [30] and Adelson on multiband blending has proven particularly effective for image mosaicking without blurring and ghosting artifacts. Suppose image A and image B are images to be blended. e method includes the following steps: (1) Gaussian pyramid is constructed according to images A and B, respectively:  e second row shows mosaic images after the application of the multiband blending. e boundary of the overlapping areas of the image transitions naturally, and the quality of the mosaic image has been greatly improved.

Evaluation Metrics.
e root mean square error (RMSE) [31] has been used as a standard statistical metric to measure the matching performance of our algorithm. Suppose n samples are used to calculate matching errors (e i , i � 1, 2, . . ., n). RMSE is calculated for the data set as shown in the following: Following a certain order, the errors e k , e k � R i K − 1 i p k − R j K − 1 j q k , R i , K i , R j , K j , which can be known from equation (4), k � 1, 2, . . ., and n can be written into an n-dimensional vector, where n denotes the total number of matching points from two images. For each matching point (p k , q k ) ∈ C 0 , according to equation (7), the RMSE of the matching points from two images is calculated. A low value of RMSE indicates desirable matching. e feature matching evaluation also refers to the concept of recall rate [13,32], which is a well-known performance metric. Using the recall rate as the metric to evaluate the matching performance is mainly from the inspiration of this reference [32]. e recall rate is defined as follows: Recall � retrieved correct matches all correct matches , (8) where retrieved_correct_matches denote the number of correct matching points obtained using our matching algorithm. All_correct_matches are manually marked by experts in the reference and target images.

Image Data.
We implemented the proposed method in C++ with OpenCV library. e experiments were performed on a PC with 8 GB RAM and an Intel i5-6500 CPU (3.20 GHz). e data set, which includes 62 retinal images, was collected from the hospital in Guangdong, China. Images are obtained from 14 different eyes. A total of 2-9 images are acquired for each eye. e image format is JPG format. e original size of images in the data set is inconsistent, including 2304 × 1728 and 1600 × 1200. A total of 190 image pairs with overlaps are used to evaluate the performance of feature matching. Table 2 provides a detailed description of the overlaps between image pairs. In the experiment, several stitching software were compared with the proposed method. e stitching software included AutoStitch, photomerge function in Photoshop cs6 and ICE (Image Composite Editor).

Comparison of Feature Matching Methods.
e experiment is conducted to investigate the effect of the overlap ratio of image pairs on the matching error. Table 2 lists 10 pairs of retinal images from the data set and shows matching errors with different overlap ratios. In general, desirable matching can generally be achieved when the overlap ratio of image pairs is high. e number and spatial distribution of matched features are important factors that affect registration results. From experiment results, if an image pair exhibits a high overlap ratio but contains several detected vascular bifurcations on the images or unevenly distributed detected bifurcations, then a high matching error would be obtained. e average root mean square error of all image pairs is 1.82 pixels. To measure the performance of the matching algorithm more accurately, we plan to compare the registered image coordinates with the ground truth coordinates on more manually annotated samples. Table 3 lists the quantitative matching performances of different features. e numbers of matched points for ORB, SIFT, SURF, and our method total 29, 17, 42, and 62, respectively. Although the number of matched points obtained by SURF is close to that of our matching method, SURF spends much more time (0.689 s) than our proposed method (0.249 s). ORB and SIFT consume less time but have fewer matched points than our algorithms. Our method found more correct matches than approaches such as SIFT or SURF. Vascular bifurcations obtained by our method are more evenly distributed than features obtained by SIFT as shown in Figure 5. In the meanwhile, matching points obtained by our method are more evenly distributed than SIFT or SURF according to the matching results. In this case, having more matches leads to better mosaic. Figure 8 compares the recall rate of different features for image pairs. is figure lists five pairs of retinal images from the data sets and shows the quantitative results of the recall rate. As shown in Figure 8, an optimal recall rate is obtained using our registration method. Although the number of matching points obtained by SURF is close to that of our matching method, the average recall rate obtained by our matching method is considerably higher than that of SURF (0.70 to 0.35). e matching recall rate produced by ORB is lower than that of SIFT (0.47 to 0.57). erefore, the method proposed in this paper performs better than other algorithms according to the statistics of the aforementioned indicators. Figure 9 shows the mosaic image stitched by four methods. a∼d are the results of    Figure 9(c) and the image transition is unnatural. In contrast, the mosaic image produced by our method possesses the highest similarity with the original image. Figure 9 roughly compares the performance of several stitching software by mosaics. In order to evaluate the quality of the mosaic image more accurately, the mosaic image is magnified for comparison, as shown in Figures 10  and 11. Figures 10 and 11 show two examples of mosaics produced by our method. In the figures, the left images refer to the original images captured from one retina at different angles.

Qualitative Evaluation.
e proposed method provides visually appealing Our method can produce mosaic images with less blur, as shown in the zoomed-in regions in Figure 11(b). By contrast, evident blurring around the vessels can be observed in the mosaic image produced by AutoStitch (Figure 11(c)). No evident blur is observed around the vessels in the mosaic image produced by our method. e contents of the mosaic image produced by our method are more consistent with the original retinal images compared with that produced by AutoStitch.

Discussion and Conclusion
We construct mosaics from a series of images of the human retina. Previous methods were limited to work with images containing clearly visible vascular structures. However, the vascular structure in the retinal image will often become blurred when the fundus bleeds or becomes tumorous. us, for all previous methods, detecting robust features in such a retinal image presents a difficulty. is paper addresses the problem of robust feature detection based on obscure vascular structures of retinal images.
Our paper employs a CNN model [12], which effectively obtains clear vascular structures, to segment vascular structures of retinal images. en, bifurcations in vascular structure are extracted as features, and different transformation estimations are utilized to warp images into a uniform coordinate system for mosaic image blending. Experimentally, we show the mosaic examples and list a summary of numerical results, such as RMSE, recall rate. e proposed feature detection method is superior to other methods, such as ORB and SURF. e quality of mosaic images produced by our method is superior to that of AutoStitch, photomerge function in Photoshop cs6 and ICE, which demonstrate uneven vascular mismatching. e quality of mosaic images produced by our method is  especially valuable for retina change detection. Although visually appealing results can be produced with the proposed method, its clinical usefulness should be further confirmed. We plan to improve the efficiency of the algorithm to realize real-time stitching algorithm. In addition, we plan to improve the algorithm for precise registration on large images with different overlaps.

Data Availability
e data used to support the findings of this study are included within the article and also available from the corresponding author upon request.

Conflicts of Interest
e authors declare that there are no conflicts of interest regarding the publication of this paper.