Multi-modal and multi-vendor retina image registration

: Multi-modal retinal image registration is often required to utilize the complementary information from diﬀerent retinal imaging modalities. However, a robust and accurate registration is still a challenge due to the modality-varied resolution, contrast, and luminosity. In this paper, a two step registration method is proposed to address this problem. Descriptor matching on mean phase images is used to globally register images in the ﬁrst step. Deformable registration based on modality independent neighbourhood descriptor (MIND) method is followed to locally reﬁne the registration result in the second step. The proposed method is extensively evaluated on color fundus images and scanning laser ophthalmoscope (SLO) images. Both qualitative and quantitative tests demonstrate improved registration using the proposed method compared to the state-of-the-art. The proposed method produces signiﬁcantly and substantially larger mean Dice coeﬃcients compared to other methods (p < 0.001). It may facilitate the measurement of corresponding features from diﬀerent retinal images, which can aid in assessing certain retinal diseases.


Introduction
Retinal images reveal biological information about the retina, which are examined by ophthalmologists to diagnose and monitor the progression of a variety of diseases, including diabetic retinopathy, age-related macular degeneration, and glaucoma [1][2][3].Retinal images are often acquired by different imaging modalities in order to have multiple representations for the eye [4].For example, color fundus photography and Scanning Laser Ophthalmoscope (SLO) [5] are two commonly used techniques for retinal image acquisition in ophthalmology [6], where the formal one uses the imaging light of red, green and blue wavebands and the latter one uses single wavelength laser light (see Fig. 1 (a) and (b)).Therefore, the retinal blood vessels on SLO images have higher contrast to the background than the ones on color fundus images, while the color images have a larger field-of-view and better discrimination between artery and vein [7].In a screening setting, ophthalmologists or computer-aided diagnosis system may also evaluate exams from different screen rounds.In the clinical review of disease progression, longitudinal retinal image registration is crucial which establishes a pixel-to-pixel correspondence among the images of different modalities.It provides extra assistance for observers to find the early and subtle signs of abnormalities on one type of images and confirm their findings in the same area of other images.
To make temporal analysis possible or to investigate the same findings from different retinal images, an image registration technique is necessary.However multi-modal retinal image registration is still a challenge due to the modality-varied resolution, contrast and luminosity between different modality images (i.e.color and SLO retinal images).In the last few years, registration approaches proposed in the literature for retinal images can be summarized into two categories: area-based methods and feature-based method.An area-based method aims to extract intensity-based features from the overlapped area of two images, then optimizes some similarity measurements such as cross correlation [8,9], mutual information [10][11][12] and/or the combination of both of them [13] to obtain the best alignment.A feature-based approach tires to extract descriptive features for finding the correspondence between the images.Commonly-used descriptors include the position of retinal landmarks like vascular bifurcation points [14], the optic disc center [15], and the fovea center [16].Moreover, high-level feature descriptors such as scale invariant feature transformation (SIFT) [17], speeded-up robust features (SURF) [18], histograms of oriented gradients (HOG) [19] and local binary patterns (LBP) [20] are proposed for registration.An objective function is used based on the correspondence between the extracted feature descriptors to find the best transformation parameters.One of the recent methods [21] was proposed to detect landmarks (corner points) as a preprocessing step.Afterwards, HOG was calculated from the neighborhood of each corner point.During the registration, random sample consensus (RANSAC) was used to remove the incorrect correspondences to achieve the best affine transformation between images.
There are several limitations when using the previous techniques for the registration of color images and SLO images.First of all, the representation of retinal landmarks on both types of image is different.The optic disc is bright on color images but dark on SLO image.On color images, the central reflection of blood vessels is visible on arteries rather than on veins, while on SLO images it can be clearly seen on both types of vessels.Therefore, area-based methods which rely on measuring the similarity between two images are difficult because the similarity metric is not easy to define.In addition, tiny blood vessels on SLO images are depicted more clearly than those on RGB images and the vascular structures on both images might have a quite large differences.Thus feature-based methods which use vascular bifurcation points [22,23] and vessel edge map [24] are not ideally applicable In this paper, we propose a new framework for multi-modal retina image registration (specifically of color and SLO images), which addresses the issues of aligning images of different modalities.The methodology consists of two main steps: 1) descriptor matching and 2) deformable image registration.Descriptor matching is used on mean phase images to estimate the global transformation (e.g affine) between between images.The descriptor is densely calculated on mean phase images so that the matching step is invariant to contrast differences.The modality independent neighborhood descriptor (MIND) based deformable registration method proposed by Heinrich et al. [25] is then followed to refine the registration result locally as the deformable registration could achieve better estimation of the transformation than the affine registration.

Modality-invariant feature descriptors
The transformation parameters for image registration are derived by matching the feature descriptors extracted from the images.In order to address the effect of contrast and illumination difference between color and SLO images, we extract descriptors from mean phase images, which are independent on pixel intensity [26].Firstly, an image f (x) is converted to a phase image ϕ(x) [26] via the followed transformation: where f R (x) is the Riesz transform proposed by Felsberg et al. [26] of f (x) that represents the odd component of f (x), and f e (x) is the even component of f (x) [27].In the proposed framework, f R (x) and f e (x) are measured by a set of log-Gabor filters [28], which have zero DC component with tunable bandwidth in the Fourier domain so that the filter responses are independent of image intensity.In addition, instead of calculating local phase at a certain empirical scale, we use log-Gabor filters with multiple scales σ to derive ϕ σ (x) [27].Then we compute the average of the local phase images, named mean phase image φ(x), for the feature descriptors extraction.φ(x) averages phases over all scales, and serves as an identifier.For example, a step corresponds to ϕ = 0 and a peak to ϕ = π 2 .Essentially, it provides contrast-invariant measurement for descriptor extraction (see Fig. 1 (c) and (d)).
On the mean phase images, we densely compute the histogram of oriented gradients (HOG-MP) as the feature descriptors for image matching [19].A square block of size MN×MN around each selected pixel is drawn for gradient calculation.The distance between each selected points is 5 pixels in all dimensions.Every block contains M×M cells which have the size of N×N.For the pixels in each cell, the histogram of the gradient in twelve directions (from 0 to 360 with step size 30) is obtained to extract the local structural information inside the block, which yields 12×N×N feature descriptors for each selected point.We choose M=11 and N= 3 in our experiments.Miri et al. [21] applied the HOG calculation on detected interesting points from the original intensity images.However, the HOG-based descriptor matching step may suffer from the large contrast differences between multi-modal images.The detection accuracy of the interesting points may also influence the matching results.

Matching method
We first resize the moving images into the same size of fixed image (see Fig. 2(a) and (c)).One can see that the size of the same vessel and disc from different images are not the same.Since the HOG- MP is densely calculated, we use the approximate nearest neighbor search method [29] to efficiently match descriptors.Similar to [30], this method is used twice (searching from the fixed image to the moving image and vice versa) to exclude outliers.Only the correspondences that exists in the two matching results is left.We used the their implmentation and parameter settings for approximate nearest neighbor search method (see https://people.eecs.berkeley.edu/katef/LDOF.html) .Then random sample consensus is applied to further remove the incorrect correspondences.An affine transformation is estimated and updated during the RANSAC process [21].The final affine transformation is then applied to the fundus images.
The descriptor matching results based on HOG-MP and HOG are compared in Fig. 2. It can be observed that the HOG-MP based matching shows good correspondence between color and SLO images, especially around the positions with structures (e.g.vessel bifurcation and crossing points), as expected.

Deformable image registration
An affine transformation, as a first-order transformation, is not flexible enough to model the deformation between color images and SLO images.A second order quadratic transformation was proposed to improve the deformation modeling [31][32][33].However, local deformations caused by the eye movements and breathing during acquisition can not be modeled by the quadratic transformation effectively.Therefore, a more sophisticated model is required in this scenario.
A deformable transformation or spatially varying deformation model is able to model the local deformation and therefore used in many medical image registration tasks (see the recent review paper [34]).The deformable transformation W is defined as follow: where each position x in the image is assigned a displacement u.
Besides the transformation model, a proper similarity metric and optimization scheme are essential parts in the image registration pipeline [34].We adopt the pipeline from [25], in which the similarity metrics is defined based on modality independent neighborhood descriptor (MIND) as follow: where Ω defines the registration region (the entire image in our case).The optimization (minimization in our case) of similarity metrics for the high-dimensional deformable transformation W is ill-posed, and regularization is generally necessary.Same as [25], we add a local regularization term to define the objective function: This objective function is then optimized by the Gaussian-Newton method.The MIND based method is reported to be invariant to contrast differences and suitable for multi-model registration problems [25].We empircally set weighting parameters α = 0.2 (same as [35]) for all following experiments.

Material
The proposed method is validated on a set of multi-modal and multi-vendor images.The dataset contains a total of 600 retinal images acquired by 5 fundus cameras (see Fig. 3), including (1) Canon Cr-1 Mark II (Canon); (2) Topcon NW300 (Topcon); (3) Nidek AFC-230 (Nidek); ( 4) EasyScan (i-Optics) and ( 5) Spectralis HRA OCT (Heidelberg) (the camera specifications are shown in Table 1).The first three are of the color fundus cameras and the last two are of the SLO cameras.The acquisition by each camera was done on the right eye of 12 healthy subjects with 5-times successive acquisitions on both the center of the fovea and optic-disc (OD), which produces 120 images (half of them fovea-centered and half of them OD centered) for each camera.The images are varied with each other slightly in terms of luminosity difference, translation and rotation.In addition, the exact region of the retina captured by each image is different.For example, for some of the fovea-centered images, the optic disc might not be completely captured.Since the Spectralis images have the smallest field-of-view (30 • ), they are considered as the fixed image for registration.The images of other four cameras (the moving images) are assigned to the corresponding Spectralis images (the same subject with fovea/OD centered).To be more specific, for one subject, the i-th acquired images using other cameras are only registered to the i-th acquired Spectralis image.Since the images were taken independently, it is equivalent to randomly pair these image yielding 600 (120×5) image pairs.
For processing, we removed the black region of the Canon and Topcon images (see Fig. 3) to make them into squared images as the fixed images are (see the red dashed lines).It is an automatic process and we used the image center as the square center.After that, all moving images are resized to the same size of the fixed image.These images are then used for descriptor matching and deformable registration.The cubic B-spline interpolation is used during the resize process.Because of the high contrast of the blood vessels, only the green channel of the fundus images is used for the registration where the same idea was adopted by [21].The flowchart of our method that adopted in the experiments can be seen in Fig. 4.

Qualitative comparison
We compared our proposed method (method-p) with the method proposed by Miri et al. [21]( method-1).In [21], all images are brought into the same resolution before the registration.A circular Hough transformation is first used to localize the optic discs.It results in the center and radius of the optic discs in the moving (c f , r f ) and the fixed (c o , r o ) images.The moving  images are then scaled to the fixed images by the ratio of R = r o /r f .However, this method is not applicable if the optic disc is not at the center of the image (see Fig. 2(a)).Alternatively, we calculated the rescale ratio R based on the affine transformation matrix A that we derived from the descriptor matching step of method-p: where A 1,1 and A 2,1 are the elements of the matrix A.
We followed the implementation from [21] for the method-1 (except for the disc localization).Their method was optimized for registering fundus images (Nidex and Topcon) to SD-OCT images.We used their parameter settings for our registration tasks.We also tried different parameter settings of method-1 for our dataset but did not find any significant improvment.For comparison we also evaluated the registration performance only based on our intermediate step, the descriptor matching (method-2).
The registrations between Spectralis and the other four types of images are shown in Fig. 5, Fig. 6, Fig. 7, and Fig. 8.One can clearly see the misalignment (pointed by the yellow arrows) by using method-1 and method-2.It proves that only using descriptor matching with an affine transformation is not enough.The results showed that using a deformable transformation successfully registered these images without any noticeable misalignment.

Quantitative comparison
In retina imaging, many systemic diseases including diabetes and hypertension are reflected by blood vessels changes such as being tortuous, narrowing and showing leakage.Blood vessels are key landmarks for inspection.To objectively evaluate the registration, we measure the matching effect between blood vessels from pairs of images.
To obtain the blood vessel segmentation, we applied the method proposed by Zhang et al. [36].This technique employs a set of multi-scale Gaussian derivative filters rotated to different orientations in so-called "orientation scores".An orientation score is a 3-D space with axis: the   spatial coordinates x, y and the orientation θ, in which vessels with different orientations lay in different planes.The benefit of this construction is that difficult cases like vessel crossings are now solved because they are disentangled.The multi-scale nature of the Gaussian derivative filters ensures that disentangled vessels with various sizes are equally enhanced.Afterwards, the 3D structure is projected onto the spatial plane by taking the maximum filter response over all orientations per position.After we obtain this 2D enhanced vessel map, a proper threshold value is applied on the enhanced image to obtain a binary vascular map.Subsequently, vessels within the optic disc region are eliminated by the optic-disc mask.An iterative thinning algorithm is used to obtain the centerline of the vasculature.Junction points like vessel branchings and crossings are also removed, thus pixels connected to each other represent an individual vessel segment.In this study, we used the vessel segmentation tool which was trained on a different retinal image dataset with blood vessels manually annotated.To apply it to our 6 camera image database, we rescaled all images to the same pixel size as the training data using the size of optic disc as the reference.
To evaluate the performance of the registration methods, we calculate the Dice coefficient: where X and Y are the vessel binary maps of fixed and moving images after registration, |.| represents the number of vascular pixels and |X ∩ Y | is the number of overlapped pixels between X and Y .Table 2 summarizes the Dice coefficient measures from three different methods on registering four manufacturer datasets to the Spectralis dataset.Our proposed method-p significantly outperforms the compared methods with higher means and smaller standard-deviations (std) in
We also computed failure rates for different methods.As visually the large vessels on cases with a Dice less than 0.5 barely match, therefore we regards the registration is a failure.The failure rates for method-1, method-2, and method-p are 19.5%,10.6% and 1% respectively.Our takes about 40s for each registration using a PC with Windows 10 64-bit OS, 32 GB RAM, and Intel(R) Core(R) CPU 4.20 GHz.For evaluation, the vessel segmentation takes about 30s per image.

Discussion and conclusion
In this paper, a robust and effective two-step framework is proposed to register multi-modal retinal images.The method of descriptor matching was used to register images globally in the first step.After that, a deformable registration was applied to locally refine the registration result in the second step.
In the descriptor matching process, in order to avoid the the intensity difference between different modalities, we first transferred the intensity images into mean phase images.Mean phase images are invariant to the intensity difference and often used to represent structure information of images.Since blood vessels share the same structure across different modalities, we densely calculate the HOG measure on mean phase (so called HOG-MP) image rather than only on interesting points of intensity image [21].Points detection errors are eliminated by our

Fig. 1 .
Fig. 1.The retina of the same subject acquired by (a) color fundus camera (Canon Cr-1 Mark II) and (b) Scanning Laser Ophthalmoscope (Spectralis HRA OCT).Retinal landmarks such as blood vessel and the optic disc have different representations on both modalities.(c) and (d) are mean phase images derived from (a) and (b), respectively.

Fig. 2 .
Fig. 2. Descriptor matching result between a color image (moving image) and an SLO image (fixed image, e.g.acquired by a Spectralis HRA OCT camera).(a) and (c) show matching pairs based on HOG and HOG-MP with the RANSAC process; (b) and (d) are the matching result using an affine transformation based on HOG and HOG-MP, respectively. ϕ

Fig. 3 .
Fig. 3. Five types of eye images need to be registered.The images of Spectralis fundus camera are the fixed image with size of 1536×1536 pixels; images from Canon, Topcon, EasyScan and Nidek are moving images (registered to Spectralis) with the size of 3456×2403, 2408×1536, 1024×1024 and 3744×3744 pixels, respectively.The images are shown in relative pixel size.

Fig. 5 .
Fig. 5. Registration results between a SLO and a Canon image.(a)-(c) are from method-1, method-2 and method-p; (d)-(f) are sub-regions from the red box of (a)-(c).The yellow arrows point out the misalignment.

Fig. 9 .
Fig. 9.The box-plot of the Dice coefficients of our proposed method and the state-of-art method (method-1).

Table 2 .
Dice coefficient from three different registration methods