Precision targeting for retinal motion extraction using cross-correlation with a high speed line scanning ophthalmoscope

Evaluations to quantify the precise targeting of template features and to select template sizes for retinal motion extraction were carried out using cross-correlation with a high speed line scanning ophthalmoscope (LSO) capable of 160 frames per second. The optimal template targeting was located on a retinal vessel pattern with vessel bifurcation or vessel features occupying approximately eighty percent of the template area preferred. The optimal template size for this LSO system was 80 × 80 pixels and it was able to extract retinal motion up to 300 deg s−1 at a speed of 30 Hz. Although the optimized template size was a compromise between having enough image data on the retinal features to make matches reliably and have good temporal resolution, the optimal targeting of the template location and size described here was appropriate and effective in extracting retinal motion. In addition, the determination of cross-correlation templates could be applied to other images having similar properties; i.e., relatively small features of distinct gray levels on an otherwise fairly uniform background.


Introduction
The human eye continually moves even when we steadily fix our gaze on an object [1]. These motions include high frequency tremors, rapid microsaccades, and slower drifts with an amplitude of several arc seconds to several arc minutes and a frequency of 10-100 Hz [2][3][4]. These movements are too small to be seen with the naked eye or mid-level eye movement monitors, but they introduce significant inter-frame distortions in retina images captured from fundus camera [5][6][7]. What is worse, they produce intra-frame and interframe distortions, and can warp images during eye fundus imaging examinations with high resolution systems such as optical coherence tomography [8][9][10], and confocal scanning laser ophthalmoscopy (SLO [11][12][13]). Therefore, image artifacts introduced by these movements severely impact accurate visualization of the retina, and hinder mathematical measurements of retinal microstructures [14]. For example, retinal motion lowers the quality of images, making it difficult to quantify measurements obtained by high resolution retinal imaging systems such as retinal thickness mapping (retinal topography) [15], retinal cell density mapping and so on [16].
For many decades, cross-correlation has been the most widely used method for retinal motion extraction. The earliest ophthalmoscopic system for retinal motion measurement was implemented by Cornsweet, who reported a resolution of ten arc seconds with a system that tracked a blood vessel using an oscilloscope and photomultiplier tube [17]. O'Connor applied a frame-to-frame cross-correlation to determine inter-frame retinal movements in SLO images [18], and then patch-based cross-correlation methods were proposed to estimate retinal movements in the SLO data, which required only a sequence of frames of scan data and estimated intra-frame motions [19]. Vogel et al applied a map-seeking circuit algorithm to compute correlations to estimate retinal motion vectors and account for more general motions [20]. These approaches have been shown to accurately estimate intra-frame retinal motions that are close to 1 KHz in offline analysis; however, only drift motions at slow speeds could be estimated while ultra-high frequency inter-frame distortions such as tremors or rapid microsaccades were unable to be estimated clearly [21]. Since the frame rate cannot be accelerated to overcome intraframe retinal motions, these approaches led to failure in estimations due to displacements in correlation templates in sequential frames [22]. Therefore, the most efficient approach to estimate ultrahigh frequency retinal movements is to accelerate the imaging speed. Increasing image acquisition speed using a high speed line scanning ophthalmoscope (LSO) with a frame rate of capable of 160 frames per second (fps) [23,24] eliminates motion artifacts within frames and reduces feature displacement between frames, and the absence of intra-frame motion gratifies extraction of retinal motion through cross-correlations from inter-frame distortions [25].
However, cross-correlation involves computing image intensities to maximize similarities between a pair of discrete rectangular images so that templates of the cross-correlation are related to detection accuracy. In brief, the template location and size related to different retinal features influences the precision of cross-correlation calculations. As the template locates different features of the retina, the contrast and gray level difference between the template features and the retinal background may be distinct or uniform, which may be a reliable or erroneous cross-correlation [26]. When the template contains optimum features of the retina, the template size is associated with efficiency of the cross-correlation. As the template size increases, the precision of cross-correlation computations increase, but only to a relatively stable point. Beyond this point, the cross-correlation will increase slowly and infrequently. Moreover, the anticipated high velocity of retinal motion is proportional to the template size [27]. However, given that the precision of cross-correlations is directly proportional to the increase of the template size to a certain extent, one cannot neglect the computational complexity with a large template size. The cross-correlation technique is computationally intensive and requires a lengthy calculation period despite the use of fast mainframe computers [28], and a smaller template size makes the cross-correlation process even faster. Consequently, the optimal template size will be a compromise between having enough retinal feature image data to match reliably and having good temporal resolution [29].
Design of a precise template targeting methodology, including template location and size, is critical to carry out cross-correlations to extract retinal motion. In this study, we extended previous studies by rigorously evaluating the performance of cross-correlations to extract retinal motion. Correlation templates of different retinal features were used to determine the optimal template for retinal structures, and several criteria were compared to determine the optimum template size of the cross-correlation as it was applied to extract retinal motion. Evaluations quantified the precise targeting of template features and template sizes for retinal motion extraction using cross-correlations.

Experimented configuration
A detailed description of the LSO system can be found in Yi He et al [30] and the optical layout can be seen in figure 1. A laser diode (LD) with a central wavelength of 780 nm was collimated through a convex lens (L1) and spread in one dimension with a cylindrical lens (CL). The line beam was focused onto the retina with a scanning lens (L2), and the retina was scanned with a galvanometer-driven mirror (G, 6230H, Cambridge Corporation). The CL and the galvanometer, as shown in figure 1, were used as a line beam scanner, which ultimately converted the point light source onto a focused line beam. The back-scattered light from the retina passed through the same lens and was descanned by the galvanometer through the beam splitter (BS) and focused onto the linear array sensor (LAS, AVIVA M2 CL 1014, ATMEL) with a detector objective lens (L3). Meanwhile, a slit aperture (SA) conjugated to the retinal plane was placed close to the LAS to reject the majority of the back-scattered light from the adjacent voxels along the scanned line.
As the line acquisition rate of the line-CCD was 53 kHz, the image acquisition speed could reach up to 160 fps for 800 × 400 pixels, or 2400 × 1200 μm on a retinal scale. The whole length of the linear array sensor is 1024 elements, but the curve shape of the retina produces some dark fields on the margin, and the middle pixels of 800 elements are useful to collect light. According to theoretical optical performance characterized by ZEMAX (ZEMAX Development Corporation, USA), the LSO system achieves diffraction-limited performance through the entire scanning angle (3 mm nonmydriatic pupil, 8°on retina plane), and the transverse Airy disk diameter referenced to the retina is 10 μm. The retinal conjugate was magnified by 4.8 on the linear array (also the confocal SA), and a 50 μm SA (∼10.05 μm on the retina) was chosen, which means it was tightly confocal (∼1.05 times the Airy disc). Accordingly, an estimated optical resolution was approximately 10 μm, which was limited by the aberrations of the human eye.
Prior to examination, informed consent was obtained from all of the subjects. An important informed consent was also obtained for the power of the light beam. The incident power at the cornea for the extended imaging beam was 500 μw. This exposure level is considered safe according to ANSI standards [ANSI Z136.  for several hours of continuous intra-beam viewing. For our experiments, the line illumination would allow for even higher safe exposure levels, since the power was roughly 50 times below the ANSI maximum permissible exposure levels for the human eye.
The LSO system was tested on human volunteers aged 20 to 35 years with healthy eyes. In order to minimize the characteristics of retinal movements, the chosen subjects were typically highly practiced observers with extensive knowledge on controlling eye movement behavior. Therefore, it can be assumed that they exercised great efforts to maintain a steady gaze. In addition, a chin rest and forehead rest were used to stabilize head movements, and the subjects were asked to fixate on a bright green target with the fellow eye.

Preliminary human subject test plan
For the high speed imaging sequences of our LSO system, as discussed above, there were almost no remnant intra-frame distortions in the single frame and the image sequences barely differed in magnification and rotation. In addition, fixation made the rotation of retinal motion negligible [31]. Consequently, the determination of similarity between sequential images yielded the best translational fit. The most widely used method for similarity detection is cross-correlation.
The details of retinal motion recovery with this system have been previously reported [32]. Briefly, an image is used as a fixed reference frame (usually taken to be the first frame), and a template of image features is selected on the reference image. As illustrated in figure 2, the whole window of sequences is the search area. The template window is then shifted across the search area, and a similarity measure is calculated. The point of maximal similarity is designed as the match position, where the features in the template are aligned with the corresponding features in the search area of image sequences to determine the translational motions.
An exhaustive search in a cross-correlation technique is defined as a pixel-by-pixel scan of sequential images to find the match coordinates, which is computationally intensive and may require a lengthy calculation period. For this reason, instead of a standard cross-correlation to calculate similarity detection, some novel techniques have employed an enhancing cross-correlation measure, and a normalized cross-correlation was subsequently used to correct for differences in illumination, which was defined as: , where I and I  are the reference image and sequential image, respectively, and E (·) is the expectation of images, k and l denotes the position of the moving windows. By finding the indices k and l, which maximize the cross-correlation I I corr( , ) , k l ,   the k l ( , ) displacements of the new frame with respect to the reference frame were determined to be a measure of the relative motion of the retina within the specific area. In order to improve the reliability and accuracy of the cross-correlation approach, we added a time-saving procedure. A modified cross-correlation was developed and programmed. One advanced modification was that a 'limited exhaustive search' routine was used in the cross correlation method in which the search area was restricted to pixels within the radius of the anticipated maximum retinal movement between frames by a 'look-ahead' strategy based on motion predicted from the previous frame.
The cross-correlation approach is a computationally efficient alternative to other motion estimation methods, allowing estimates of retinal motion that allow comparisons of imaging frame rates. In addition, it is not sensitive to image details, so image features of retinal structures including photoreceptors, feeder vessels, nerve fibers, and other cells can be extracted to measure retinal motion.

Results and discussion
Retinal motion was extracted from the raw LSO videos using a cross-correlation procedure. All frames in the video were compared to a single reference frame (taken as the first frame) in order to determine how the retina had moved during the course of imaging. Note that there are many features of the retina to be selected as a cross-correlation template to extract motions, and marking of different features, such as blood vessels, photoreceptors, and other cells may affect the precision of motion retrieval.
In order to determine precision targeting for retinal motion extraction, motion estimations based on manual marking of different retinal features was performed, and we present. Once precision targeting of retinal features was executed, the other important consideration was the size of the template for motion extraction in the cross-correlation approach. When the template size was too small, the accuracy of motion estimation decreased severely. A much larger cross-correlation template impeded the speed of the motion extraction procedure. In section 3.2, we present the experimental results for different template sizes, which determined the effectiveness and high-speed characteristics of the motion extraction scheme.

Tracking templates with different features
The video clip in figure 3 shows an 877 frame sequence (5.48 s) of LSO data. Feature displacements occurred at all times due to inter-frame motions. One single frame is shown in figure 3 to demonstrate retinal features that are potential template targets for motion extraction using the cross-correlation method.
The particular features in the image included blood vessels of different scales, parallel vessels and bifurcations, avascular areas, and the optic nerve head. An arbitrary retinal feature can be selected as a template in the cross-correlation tracking method, and templates of different retinal features do not affect the computation speed. Recall that the cross-correlation approach employs computing image intensities to maximize similarities between a pair of discrete rectangular images, so that templates of different features are related to cross-correlation detection accuracy.
We chose seven distinct locations in the retina image as templates (labeled in figure 3), and retinal motion extraction experiments were executed using the cross-correlation scheme. Note that retinal motion with large velocities diminished the visibility of retinal features in the search templates. Accordingly, the template had variable size flexibility if the predicted template location was missing in the field of view due to large motions. In this case, the location and size of the template was preserved, and the missing portion of the template was cut off; only the remnant template was used to cross-correlate for extracting retinal motion. This variety of templates assured the cross-correlation procedure was successful and reduced false extractions.
Note that the cross-correlation coefficient I I corr( , ) k l ,   presented in equation (1) defined the precision of retinal motion extraction. The closer I I corr( , ) k l ,   was to 1, the more similar the sequence images became. Figure 4 shows the cross-correlation coefficient I I corr( , ) k l ,   obtained from different templates. The positive direction in figure 4 corresponds to a higher precision of retinal motion estimates. The color labels in figure 4 correspond to template locations of distinct features in figure 3. Note that we have used simple  linear interpolation to fill in the missing coefficient data due to the discrete scan mode and due to a lack of recorded data during large motions in which the template location left the field of view.
In figure 4, one can clearly see some obvious distributions for cross-correlation coefficients. Cross-correlation coefficients in templates 2 and 3 were smaller than 0.8 with a relatively flat peak of cross-correlation values. Middle-level accuracies in the cross-correlation located in templates 1, 5, and 6 had a coefficient between 0.8 and 0.9. Due to the comparatively distinct features of vessels, a much higher coefficient for the cross-correlation was observed in templates 4 and 7. Occasionally, the cross-correlation of large parallel vessels in template 4 performed even more exactly than the vessel bifurcation in template 7. Overall, vessel bifurcation performed significantly better than parallel vessels. The difference was in the reliability of the cross-correlation at templates using different retinal features.
In order to manifest cross-correlation extractions clearly, the mean values of cross-correlation coefficients of different templates are shown in figure 5, and the standard error of the mean (SEM) is also plotted in this figure. In the plot, each bar has two numbers; i.e., an average (plotted histogram) and a SEM (error bar). The cross-correlation locations for template 2 with the lowest accuracy had a coefficient of 0.4638. This was due to template 2 having no particular features and a fairly uniform background, and the cross-correlation process performed a relative fluctuation of the largest SEM. Template 3 had a small vessel with an indistinct contour, so the coefficient was slightly higher than template 2, but was still below 0.7. Templates 1, 5, and 6 occupied a large vascular pattern of the retina, and cross-correlation coefficients were as high as 0.8. A higher precision of cross-correlation occurred in templates 4 and 7, with a coefficient greater than 0.9 and a much smaller SEM value. Allowing for stability in the cross-correlation procedure with a small SEM value, features in template 7 were more efficient cues for extracting retinal motion with the cross correlation method.
Normally, one would expect better detection with higher cross-correlation coefficients, because this showed the testing was more accurate. However, an unacceptable, erroneous correlation occurred many times when higher values of crosscorrelation coefficients were calculated. Since the cross-correlation computed the coefficient in the searched area from sequential images and produced a maximal similarity value, higher values of cross-correlation coefficients did not signify true correlation detections, especially when there was a large SEM value. Therefore, it was necessary to pick out incorrect similarity operations to generate an overall inspection of the cross-correlation extraction.
All the template experiments were examined to sum the number of incorrect cross-correlation operations. As one general criterion for correlation, once the shift position of the template in this calculation exceeded one pixel compared to the previous measurement, the cross-correlation estimation was determined to be one incorrect cross-correlation operation. Figure 6 shows the number of incorrect cross-correlations for all template experiments. With the smallest coefficient and the largest SEM, the incorrect number for template 2 was the largest value; i.e., 56, implying the lowest precision correlation computation. The incorrect number for template 3; i.e., 44, was a little smaller than template 2, and this implied an unacceptable cross-correlation experiment. Although the mean values of coefficients for templates 1, 5 and 6 were nearly equivalent (shown in figure 6), the incorrect numbers were not the same (shown in figure 7). The large SEM value for template 6 implied an unstable cross-correlation, and produced an incorrect number of 31. The incorrect numbers for templates 1 and 5 were decreased greatly, which are in accordance with the SEM values shown in figures 4 and 5. There were no incorrect cross-correlation operations with templates 4 and 7, and these two templates of retinal features are potentially the best targets for retinal motion extraction with the cross-correlation procedure.  In figure 6, only templates 4 and 7 showed no incorrect extraction of retinal motion. In accordance with the large coefficient values and small variances, shown in figures 4 and 5. Therefore, we can say that the template 7 is the optimal targeting location for the cross-correlation method, accompanied by the largest cross-correlation coefficient and the smallest SEM value.
The extraction of retinal motion using template 4 had a relatively high coefficient value, but the large SEM value indicated that the cross-correlation procedure fluctuated, and the correlation matching calculation had weak stability. Once the image quality decreased, features of template 4 had reduced correlation accuracy for the estimation of retinal motion.
Consequently, we can draw a clear conclusion. The template employing distinct retinal vessels had the highest accuracy and stability, and characteristics of blood features are the best potential templates in the cross-correlation algorithm. The vascular pattern of the retina should be used as the main template cue for cross-correlation extraction of retinal motion. The essential reason is that the retinal background is fairly uniform, and vessels are shown very distinctly. Ordinarily, a variety of vessel bifurcation structures appear in retinal imaging sessions. As a result, vessel bifurcation or vessel features occupying approximately eighty percent of the template area, which simultaneously has a non-uniform intensity distribution, are the optimal targets in cross-correlation templates for retinal motion estimations.

Determination of the template size
Cross-correlation has been shown to be reliable for extracting retinal motion, but it is computationally intensive. Modifications were made to produce a faster algorithm, and we made some new attempts to speed the calculations, as shown in section 2.2.
Considering computer capabilities today, there was no need for a special design and allocation of computer memory, and cross-correlation matching was quickly realized on a personal computer. While the cross-correlation was used for extraction of retinal high-frequency motion, several conditions needed to be met: (1) an execution speed for correlation computations is not much lower than the imaging speed, at least in real time scales; and (2) tracking of all retinal motion components, especially for motions at high speeds and large amplitudes that are hard to extract. All the requirements were related to the template size of the correlation similarity.
Normally, a smaller template size reduces the correlation computations, and is computationally more effective; on the other hand, a smaller template includes fewer features in the calculation of correlation similarity, leading to a decreased accuracy of detection and, what is worse, special retinal motion components cannot be extracted completely. In our study, we determined how the template size affects the extraction of retinal motion using a cross-correlation method.
Note that we chose vessel bifurcation as the template location, labeled as 7 in figure 3, and we continued to execute the cross-correlation for the same LSO data as in figure 3. Each test was carried out at a number of template size settings: 40 × 40 pixels, 60 × 60 pixels, 80 × 80 pixels, 100 × 100 pixels, and 120 × 120 pixels. For each experiment, we recorded the performance of cross-correlations and completion times.
In the same manner, we plotted the cross-correlation coefficients obtained using different template sizes in figure 7. One can clearly see obvious distributions for cross-correlation coefficients shown in figure 7. When the template size was 40 × 40 pixels, the cross-correlation coefficients were smaller than 0.8, and with a relatively fluctuating cross-correlation procedure. It was more accuracy in the correlation experiment with 60 × 60 pixels for the template size, and the coefficient reached 0.9. When the template size increased to 80 × 80 pixels, and even larger sizes, the coefficients were close to 1. It was difficult to distinguish the results from these three sets, shown in figure 7. Along with the increase in template size, up to 80 × 80 pixels, the correlation coefficients increased, mainly because the template contained enough features to produce high accuracy cross-correlation similarity management.
The template located on the vessel bifurcation was labeled as No.7 in figure 3.
A statistical analysis of the coefficients shown in figure 7 was performed and the mean values and SEM values are shown in figure 8. The smallest template size of the crosscorrelation produced a coefficient of 0.74 with a SEM of 0.38. A larger template of 60 × 60 pixels increased the precision of the cross-correlation, with a coefficient of 0.89, and the SEM was 0.39, which means there was a lack of stability for this template size. When the template size increased to 80 × 80 pixels, the coefficient increased to 0.977 with a small SEM of 0.010, signifying an improvement in stability. When the template size was larger than 80 × 80 pixels, the coefficients remained at 0.98 and with a SEM slightly less than 0.01.
Using the identical criteria for numbers of incorrect cross-correlations discussed above, the incorrect correlation performances are summarized in figure 9. As shown in figure 9, the larger the template size was, the lower the number of incorrect cross-correlations would be. There were 28 times incorrect cross-correlation operations when the template size was 40 × 40 pixels, signifying an unreliable correlation experiment due to lack of matching data. When the template size was 60 × 60 pixels, more retinal features with matching data were used to calculate the correlation, and the incorrect correlation number decreased to 14, but this still predicated a relatively inaccurate correlation. As the template size increased, the template included more features in the calculation of correlations, and it was clear that there were no incorrect correlations when the template size was 80 × 80 pixels.
Considering figures 7-9, a small template size led to an inaccurate cross-correlation, and the threshold value for template size was larger than 60 × 60 pixels in these experiments. A larger sized template contributed to cross-correlation precision as expected, but was associated with less computation efficiency. We summarized the correlation times for different template sizes in table 1. Note that the video had 877 frame sequences with a frequency of 160 Hz, which is much faster than a real-time record. In table 1, we can see that when the template size was smaller than 100 × 100 pixels and the calculation time of correlations was 30 Hz.
There are three main types of retinal motions: tremor, drift and microsaccades [33], and microsaccadic movements have a velocity of up to 300 deg s −1 , which is the hardest motion to be extracted due to their ultra-high frequency and large amplitudes. The template size (corresponding to the required minimum search area in the cross-correlation method) was determined by the spatial dimensions of the image pixels, and also the maximum anticipated retinal motion velocity and image frame rate [34]. In a LSO system retina imaging session, a square pixel spatial dimension of 14 μm per side of the retina, a maximum retinal motion velocity of 300 deg s −1 [35] and a frame rate of 160 Hz was achieved. Given that 1 deg is equivalent to approximately 291.5 μm on the human retina at the posterior pole [36], the template size in each direction in S pixels was defined as: Since S is the distance the retina can travel in any direction, the template window would have to be at least 2S to equal 80 pixels across. The maximum anticipated retinal motion velocity of 300 deg s −1 meant the template size was larger than 80 × 80 pixels. Considering experimental results plotted in figures 8-9, a template size larger than 60 × 60 pixels was needed to realize high accuracy cross-correlation calculations; on the other hand, the computational cost guarantee of cross-correlation operations required a template size smaller than 100 × 100 pixels. Therefore, a decisive conclusion for the template size was obtained: the template size was a compromise between enough image data on the retinal features to match reliably and having good temporal resolution. As an example result of this LSO system, the optimal template size was 80 × 80 pixels, which was suitable for a high speed and highly accurate cross-correlation to extract retinal motion.

Discussion and conclusions
This paper reported optimal targeting for retinal motion extraction using cross-correlation with a high speed LSO. Precision analysis of the cross-correlation matching experiments with LSO data was carried out to assess the performance of retinal motion extraction and to optimize the template target for the cross-correlation, including two key template parameters in the extraction of retinal motion; i.e., template location of retinal features and template size.
In this study, we found cross-correlation performance was highly sensitive to the template of retinal features, and the precision targeting of correlation templates were located on major retinal structures, such as vessel shadow patterns, because a vessel can be darker than the retinal background in normal fundus photographs, and the pattern of retinal vessels renders each frame unique, providing the most relevant information for the cross-correlation matching process. Particularly, the optimal target of a template was vessel bifurcation or vessel features occupying approximately eighty  percent of the template area because the gray level difference between the vessel pattern and background was usually much larger than the difference between two non-corresponding areas in the retinal background.
In conjunction with determining that the optimal target for a cross-correlation template was retinal vessel bifurcation, the cross-correlation performance was quantified to determine the template size parameters. We found the optimized template size for our LSO system was 80 × 80 pixels and extracting retinal motions up to 300 deg s −1 at speed of 30 Hz was possible.
Actually, there is no universal technique for solving precision targeting problems in a cross-correlation as it applies to extract retinal motion. The optimal template location and template size are highly dependent on the nature of the retinal images and system requirements. Nevertheless, the optimal template location and template size described here are appropriate and effective to extract retinal motion. In addition, the determination of cross-correlation templates can also be applied to other images having similar properties; i.e., relatively small features of distinct gray levels in an otherwise fairly uniform background.