Correction of circumferential and longitudinal motion distortion in high-speed catheter/endoscope-based optical coherence tomography

: Catheter/endoscope-based optical coherence tomography (OCT) is a powerful modality that visualizes structural information in luminal organs. Increases in OCT speed have reduced motion artifacts by enabling acquisition faster than or comparable to the time scales of physiological motion. However motion distortion remains a challenge because catheter/endoscope OCT imaging involves both circumferential and longitudinal scanning of tissue. This paper presents a novel image processing method to estimate and correct motion distortion in both the circumferential and longitudinal directions using a single en face image from a volumetric data set. The circumferential motion distortion is estimated and corrected using the en face image. Then longitudinal motion distortion is estimated and corrected using diversity of image features along the catheter pullback direction. Finally, the OCT volume is resampled and motion corrected. Results are presented on synthetic images and clinical OCT images of the human esophagus.

There are different types of motion distortion in side-viewing OCT probes including: radial, circumferential, and longitudinal distortion. Radial distortion is often caused by changes in the distance between the tissue lumen and scanning probe, leading to variation in the axial position of the tissue and also the spatial sampling pitch. However, in catheter/endoscopic OCT, this motion is usually small, because the probe is in contact with the tissue. The second type of distortion, circumferential distortion, is sometimes referred to as non-uniform rotational distortion (NURD) and is caused by variations in the circumferential scan position or speed. In proximal scanning systems, NURD is typically caused by the mechanical friction between the catheter torque coil and surrounding sheath [21]. It is further exacerbated by physiological motion of the tissue lumen and bending of the imaging catheter. The use of distal scanning micro-motor OCT probes dramatically reduces NURD associated with mechanical friction in the probe. NURD is reduced, but still occurs in micro-motor probes, because the motors have bearing friction and are usually driven open loop [22]. Circumferential motion distortion can be corrected by non-rigid registration using motion estimated from fiducial markers on the probe housing [22], speckle statistics between consecutive A-scans from a dynamic light scattering model [23], directly from OCT en face images by assuming there are slowly varying structures that are oriented roughly parallel to the pullback axis [24], or by maximizing the similarity between consecutive cross-sectional frames after alignment [25]. There are also micro-motor engineering solutions to reduce NURD, such as using synchronous micro-motors to reduce the speed variation with respect changing loads or temperature [26,27]. These motors have been used for intravascular OCT applications which require high rotational speed, small dimensions, and a very low maximum torque (∼a few µNm). Conversely, micro-motors used in endoscopic OCT scan a larger diameter lumen and require large beam steering optics. This necessitates larger micro-motors which also have higher moment of inertia armatures.
Circumferential motion distortion can also be caused by lateral movement of the tissue on the time scale of the sequential circumferential scans. The data acquisition in side-viewing OCT scanning is similar to a rolling shutter in commercial complementary metal-oxide semiconductor (CMOS) cameras where sequential rows of the camera sensor are exposed at different times [28]. Therefore, there is image distortion caused by moving objects in both methods, because every row of the CMOS camera or every B-scan in OCT samples the object at a different time. For OCT, even if the rotational beam scanning is uniform, circumferential distortion can still occur if there is relative displacement between the tissue and catheter during sequential rotational scans. This relative displacement can be caused by peristalsis, respiration, or cardiac motion, as well as coupled into the catheter from an external source.
The third type of motion distortion is longitudinal distortion, which is manifest as stretched or compressed areas in en face OCT images. This distortion is often caused by pullback speed variations arising from mechanical friction of the micro-motor and optical assembly with the probe sheath or by relative longitudinal motion (sliding) between the tissue and catheter from physiological motion. This type of distortion poses a challenge to interpreting en face OCT images in clinical applications. For example, if en face OCT images are used to detect dysplasia in patients with Barrett's esophagus, longitudinal motion distortion may obscure the mucosal features or cause them to appear distorted.
Software and hardware solutions have been developed to reduce longitudinal motion distortion. Lee et al. [29] suggested scanning the same region with two spatially-separated beams, registering common features between the two images, and estimating longitudinal motion from the registered results. Although effective, this method doubles the acquisition bandwidth needed. Another approach is increasing the A-scan rate and pullback speed to reduce the impact of physiological motion. Wang et al. [30] reported an intravascular OCT technique called "Heartbeat OCT", which operated at 2.88 MHz A-scan rate, 4,000 frames/s, and 100 mm/s pullback speed. An entire vessel segment could be scanned within 1 cardiac cycle, reducing the effects of cardiac motion.
However, the circumference of intravascular OCT catheters is much smaller than catheters used in other organs, such as the gastrointestinal tract. Implementing an analogous approach to "Heartbeat OCT" in larger luminal organs would require excessive increases in A-scan rate, which is technically challenging [31,32].
In this paper, we introduce a novel method that uses only a single en face OCT image to estimate and correct for circumferential and longitudinal motion distortion in catheter-based OCT volumes. Circumferential distortion estimation/compensation uses NURD-corrected data [22] to generate an en face OCT image and estimate the circumferential motion distortion. Using NURD-corrected data allows us to minimize the distortion from non-uniform micro-motor rotation, leaving residual distortion from the relative motion between the catheter and tissue. We developed a modified optical flow algorithm to estimate the catheter/tissue displacement from a selected en face image. In contrast to the traditional optical flow [33] method that uses a sequence of images for registration and estimation of the displacement vector, our method only requires a single en face image to estimate the circumferential displacement. It jointly estimates both the displacement and partial derivative of image intensity simultaneously. The en face image is then resampled to correct all remaining circumferential motion distortion. This image is subsequently used for longitudinal distortion correction. The longitudinal distortion correction quantifies the standard deviation of en face image features within a fixed size window sliding along the pullback direction to estimate the longitudinal sampling interval. Then, the en face image is resampled to correct the longitudinal motion distortion. The process is repeated to correct all en face images in the OCT volume. We demonstrate our method for distortion correction on synthetic en face images and OCT images of the esophagus from patients undergoing surveillance endoscopy for Barrett's esophagus.

OCT imaging instrument
We acquired data using a prototype ultrahigh speed, MEMS-VCSEL SS-OCT system operating at 600 kHz A-scan rate. A detailed description of this system has been previously published [18]. The instrument has 8 µm axial and 20 µm transverse resolution (full width at half maximum), and a 2.4 mm imaging range (in tissue). Volumetric OCT scanning was performed by a micromotor at 24,000 rotations per minute, 400 Hz frame rate, with longitudinal pullback speed of 2 mm/second for 8 seconds, imaging a ∼10 mm × 16 mm (circumferential × longitudinal) area. A strut on the motor housing blocked ∼40% of the circumferential field of view, leaving an imaging area of ∼6 mm × 16 mm. Each B-scan consisted of 1,500 A-scans and each volume had 3,200 B-scans.
Imaging was performed using a side-viewing, micro-motor OCT catheter ( Fig. 1(B)). A 2 mm diameter micro-motor (DBL02, Namiki Precision, CA) scanned the OCT beam circumferentially. An optical fiber and graded index lens assembly focused the beam ∼200 µm outside the outer wall of the probe sheath. The fiber / graded index lens was aligned to the scanning micro-motor mirror / prism by a metal housing and attached to a stainless-steel torque coil. The imaging assembly was contained within a high lubricity transparent PTFE sheath (Zeus Inc.) to facilitate smooth longitudinal pullback.

Clinical setting
The study was performed under an IRB protocol approved by VA Boston Healthcare System (VABHS), Harvard Medical School, and Massachusetts Institute of Technology. Written informed consent was obtained before imaging. Data was obtained from patients with Barrett's esophagus (14 patients with history of dysplasia and 30 with non-dysplasia Barrett's esophagus (NDBE)), who were undergoing endoscopic surveillance for dysplasia at the VABHS between September 2013 and March 2017. A dual-channel endoscope was used to introduce the OCT probe through the endoscope accessory channel and position it on regions of interest. Multiple OCT volumes were acquired from the squamous region, gastroesophageal junction (GEJ), squamocolumnar junction (SCJ), areas with nodularity and/or irregular mucosal/vascular patterns on white-light endoscopy and narrow-band imaging. Additional details about the study setting and patient demographics can be found in Ref. [34]. Figure 2 shows a block diagram of the processing pipeline with a breakdown for each block shown in Fig. 3. The pipeline consists of three steps. In Step 1, we process raw OCT fringes to calculate the B-scan OCT intensity image. Next, we correct for NURD using edges of the probe housing as fiducial marks, as described in Ref. [22]. An OCT volume is generated by concatenating the sequential NURD-corrected B-scans. Next, a mean depth projection is performed over ± 50 µm depth (100 µm range) at multiple depths to generate depth-resolved en face images with improved contrast and reduced noise. Finally, the algorithm computes the dynamic ranges of all en face images and chooses one with the largest dynamic range, called image I. This image is displayed to the user to confirm that it has representative features and there are no data acquisition errors. In Step 2, we estimate the remaining circumferential motion caused by relative motion between the tissue and catheter. We then use the estimated motion to resample I and generate a second motion-corrected en face image, called I 1 , which should have only longitudinal motion distortion. Finally, in Step 3, we use I 1 to estimate and correct the longitudinal motion. The estimated motion is used to resample all en face images at different depths (without depth projection) to correct the entire volume. We describe the details of Steps 2 & 3 in the next sections.

Estimate and correct circumferential motion distortion
The estimation and correction of circumferential motion distortion is performed iteratively until a predetermined stopping criterion is satisfied (either a maximum number of iterations is reached, or minor improvement is observed with each iteration). Each iteration consists of estimating the circumferential motion and resampling the en face image I using the estimated motion. We use a maximum of 30 iterations because we found that only minor improvements were obtained with further iterations. We describe details of the sub-steps below and show example correction results using different numbers of iterations.

Estimate circumferential motion from the selected en face image
In the 2D coordinate system of the optimal en face image, I, let c[m, n] and l[m, n] be the circumferential and longitudinal sampling coordinates of a pixel with a circumferential index of m (0 ≤ m ≤ C − 1) at longitudinal index n (0 ≤ n ≤ L − 1), where C and L are the number of A-lines in the circumferential dimension and the number of B-frames in the longitudinal dimension, respectively. Because the sampling frequency in the circumferential dimension was much faster (600,000 Hz) than that in longitudinal dimension (400 Hz), we ignore the dependence of l on the circumferential index m and refer to it as l [ . This process is repeated until convergence. The method is summarized below (Algorithm 1). In our implementation, convolutions with Sobel kernels [35] was used to calculate the derivatives ∂I/∂c and ∂I/∂l. The Sobel kernel effectively computes the differential in one dimension while performing a weighted average in another dimension normal to it, allowing a more robust estimation. Additionally, because the circumferential motion of neighboring pixels is likely very similar, we estimate ∆c[m, n] in a coarse grid where only 20 values of the circumferential coordinate m are used to sample the entire circumference (18 degrees angular sampling pitch). This achieves a significant improvement in algorithm throughput. Next, in order to calculate the circumferential coordinates c[m, n] from ∆c[m, n], one could simply integrate along the time dimension. However, we found that this integration is sensitive to the estimation error in ∆c[m, n], which sometimes led to unreasonable results for large n. To avoid this, we enforce additional global constraints on the circumferential coordinates:1. Top left circumferential coordinate constraint c[0, 0] = 0. 2. Within one cross-section, the coordinate difference between c. Check the stopping condition by computing the relative update ratio of neighboring indices m should be within a feasible range i.e.
where d is the average circumferential distance between two neighboring A-scans. δ defines the tightness of the constraint (smaller is more constrained). To enforce a smooth change in c, we chose a small value, δ = 0.01. These constraints were cast into a convex quadratic programming problem with constraints to solve for c[m, n] from the estimated ∆c[m, n], namely Here, we use the bold font to denote vectors (e.g. c) while lower case, italicized font is used for a coordinate of the vector (e.g. c[m, n]). The optimization problems were solved using the Optimization Toolbox of MATLAB, Mathworks Inc. As mentioned above, we use δ = 0.01 in Eq. (4), which is a relatively small number, and perform the correction over multiple iterations that alternate between estimating the circumferential motion and resampling the en face image. Our experimental results show that it is possible to relax this constraint with a larger value for δ and reduce the number of iterations. However, we found the convergence was smoother if we clipped the estimated displacements to underestimate the actual displacements and performed "partial" correction over multiple iterations (most converged in 5 to 10 iterations), rather than performing a "full" correction in only 1 or 2 iterations. This is because the Taylor expansion in Eq. (1) is only valid for small displacements ∆c. Therefore, it is better to distribute the correction for large ∆c over multiple iterations and perform a small correction on each iteration.

Resample the en face image to correct circumferential distortion
After solving for the circumferential coordinates c * of all pixels, we compute the maximum value of the circumferential coordinate across all estimated coordinates c * as c max = ⌊max c * ⌋.From this, we compute new a new vector of circumferential coordinates c 1 that uniformly sample the range [0, c max ] with unit increments, namely  show how the en face image evolves with different number of iterations. Only a fraction of the circumferential motion distortion is corrected ( Fig. 4(B)) after 5 iterations, while most of the distortion is corrected after 30 iterations. The insets show 2x zoomed images of a region from the en face images with almost no remaining circumferential distortion after 30 iterations. The motion-corrected image after 30 iterations I 1 was passed to the next step in the algorithm, to estimate and correct longitudinal motion distortion.

Estimate and correct longitudinal motion distortion
The estimation and correction of longitudinal distortion was performed iteratively. In each iteration, an estimation of the longitudinal sampling interval were performed, followed by a resampling step to correct for the longitudinal distortion. A checking step is performed after each iteration to determine if convergence has been reached and the iteration can be terminated. To quantify ∆l, we begin with the observation that there is a strong correlation between the interval ∆l and en face image feature distortion. A shorter value of ∆l can be caused by the optical assembly not pulling back with uniform speed, or sticking in the sheath, as well as by the probe sliding in the direction opposite the pullback. This causes the en face image features to appear stretched, because the same features are sampled for a longer time that they should be. Conversely, a longer value of ∆l can be caused by the optical assembly rapidly pulling back within the sheath, as well as the probe sliding in the direction of the pullback. This distortion causes the tissue to be more sparsely sampled, resulting in longitudinal compression of features in the en face image. Based on this observation, we estimated the sampling interval by measuring the diversity of the image features over a fixed-sized window sliding along the longitudinal dimension. The diversity is quantified by first performing a one-dimensional, short-time Fourier transform along the longitudinal dimension, with a window of 5 pixels around every pixel of interest (∼50 µm long). The window size was chosen as a compromise between localization and feature diversity for longitudinal motion estimation. A small window provides better localization. However, the feature diversity may be insufficient to ensure reliable estimation. A large window will capture sufficient feature diversity at the cost of less localization. Next, we compute the power spectral density by squaring the magnitude spectrum. Then, we compute a deviation σ[m 1 , n] from the power spectral density to characterize feature diversity at each pixel [m 1 , n]. A smaller value of σ means a narrower power spectra or less variation in the image features. Regions where the probe is out-of-contact with the tissue do not contain any image features, therefore the corresponding σ values do not necessarily represent the longitudinal velocity and pixels from these regions are excluded from the longitudinal motion estimation. The out-of-contact regions are identified by a simple thresholding step, where every pixel with intensity less than a threshold is classified as out-of-contact. To estimate ∆l[n], we used the median value of the deviation σ med [n] across the circumferential dimension after excluding all the σ values from out-of-contact regions, namely The choice of median deviation provides a good measure of image feature variation across the circumferential dimension as well as robustness to estimation errors of σ from imaging artifacts (e.g. variations in intensity from polarization changes, cross-sections with more than one type of tissue, etc.). More importantly, this method minimizes the estimation error due to naturally elongated tissue structures. These structures are expected to be local instead of global in the circumferential direction. To estimate ∆l[n], we hypothesize that there exists a increasing function f , such that ∆l[n] = f {σ med [n]}. When ∆l is small, there is less variation in the image features, hence, σ is small and vice versa. Furthermore, f (0) = 0 because when the pullback velocity is zero, i.e. ∆l = 0, the image feature remains unchanged. Differences in tissue types had a strong influence on the median deviation σ med . For example, squamous mucosa (which is devoid of mucosal surface patterns) was typically much smoother than gastric mucosa or regions of Barrett's esophagus (which have mucosal surface patterns). Hence, squamous mucosa often had smaller σ med than other tissues. To account for the difference in the tissue type, we chose the function f (.) to be of the following form where ⟨σ med [n]⟩ n denotes averaging of σ med [n] over a window larger than the scale of the longitudinal distortion. In our implementation, the window size was chosen to be 100 samples. From the definition in Eq. (7)      and 90% percentiles is computed and compared to a predetermined threshold (selected to be 0.03). If the difference is less than the threshold, the iteration is terminated. Otherwise, we continue iteration until the stopping condition is satisfied or the maximum number of iterations (set to 300) is reached. Convergence is usually achieved within the first 100 iterations. Our source code for image distortion correction is at: https://gitlab.com/huutan86/distortion_correction.git.

Motion distortion correction of the entire OCT volume
To correct the entire OCT volume, we first sliced it into multiple en face images (without depth projection). For each en face image, we correct the circumferential distortion, followed by longitudinal distortion. The correction for circumferential motion distortion was performed iteratively, following the procedure in Section 2.4. However, here we use the estimated displacements from Section 2.4.1 for resampling and do not re-estimate the circumferential motion for each en face image. The correction for longitudinal motion distortion is similar to that described in Section 2.5, except that we use the estimated sampling interval in Section 2.5.1 for resampling and do not re-estimate the sampling interval.

Demonstration of distortion correction on synthesized test images
It is difficult to evaluate the algorithm performance for endoscopic OCT because ground truth data is not available. Therefore, we tested performance by using an input volume with minimal motion distortion (Fig. 6(A)) and synthesized two motion distorted test volumes to mimic distortion commonly observed in clinical data. The circumferential distortion was generated using the formula  To simulate the longitudinal motion distortion, we used the equation where n i (i = 1, .., 5) were five random positions along the pullback. These locations were randomly sampled, such that ∆l was positive in both cases. This mimics a situation where the longitudinal pullback of the optical assembly had sticking 5 times. When the pullback speed was reduced by sticking, the longitudinal sampling slows down from the maximum value of 1.0 to 0.2(= 1.0 − 0.8). Using this ∆l[n], we resampled the circumferentially distorted image generated in the previous step to add longitudinal distortion. The distorted test images are shown in Fig. 6(B). Then, the motion correction was applied to these test images to remove the distortion. Corrected images are shown in Fig. 6(C). Note that the two motion corrected images are very similar to the input image, even though the distorted test images are synthesized using very different motion profiles. This demonstrates the convergence and stability of the algorithm. To quantify the similarity between correction outputs, we extracted corresponding test patches from the input image and the two motion corrected test images as shown in Fig. 6(D). The patches were compared using structure similarity index measures (SSIMs) [36], which is a number from 0 to 1 telling the similarity between the patches using the luminance, contrast, and structures. A higher value of SSIM corresponds to greater similarity. Note that the SSIM between the two motion corrected patches is high (0.75), suggesting the convergence of the motion corrected output to the same result, irrespective of the motion profile that was applied to the input image. This value is larger than the SSIM between each motion corrected patch and the input image, suggesting that there may have been a small amount of motion in the input image that was corrected by the algorithm.

Motion distortion correction of OCT images of human esophagus
Next, we validated the motion distortion correction algorithm on a selection of en face images from patients with different pathologies. Figure 7(A) shows an en face OCT image (at ∼60 µm below the mucosal surface) from a NDBE patient under surveillance for dysplasia with in a region affected by motion distortion. The motion-corrected image is shown in Fig. 7(B). Figure 7(C) shows selected 1.5x zoomed regions of the NDBE mucosa (red) and squamous mucosa (blue). The motion distortion of the columnar epithelium mucosal patterns (top-left inset) was corrected (top-right inset). The bottom left image has a region of homogeneous squamous epithelium with darker areas caused by insufficient or out-of-contact tissue. There is a high degree of waviness in the boundaries of these areas (bottom-left inset) caused by rapid circumferential motion distortion. This distortion was almost completely corrected (bottom-right inset), facilitating a more accurate evaluation of mucosal features. We then assessed the algorithm performance on images from patients with dysplastic BE. Figure 8(A) shows an en face image (at ∼40 µm below the mucosal surface) with low-grade dysplasia. Blue arrows indicate hypo-scattering atypical glands, characteristic OCT features of dysplasia [34]. The mucosal surface pattern in the dashed red region shows distorted mucosal surface patterns associated with dysplastic BE. Both circumferential and longitudinal motion distortion is visible in this region. Longitudinal motion distortion causes the mucosal features to appear stretched (∼1/3 from the top) and compressed (about half of the region, right of hypo-scattering structure). Figure 8(B) shows the motion corrected image. Figure 8(C) shows regions of interest enlarged by 1.5x. After motion correction, the mucosal surface pattern is more homogeneous with no noticeable compression or stretching. Also, glandular structures appear less distorted after motion correction. Before motion correction, the gland boundary has a discontinuity with sharp edges, which was likely caused by rapid circumferential motion distortion and does not represent the true shape. Distortion and irregularity in mucosal surface patterns and glandular structures are possible OCT markers of dysplasia, therefore motion-corrected images can reduce the risk of interpretation errors. Figure 9 shows several correction examples different regions extracted from the clinical images. These regions represent different features observed in our dataset including: small pit pattern of columnar epithelium (Fig. 9(A)), small Barrett's esophagus (BE) island ( Fig. 9(B)), large pit pattern with large ( Fig. 9(C)) and small ( Fig. 9(D)) amounts of motion, squamous epithelium ( Fig. 9(E)), and the SCJ function ( Fig. 9(F)). Consistent reduction in the motion distortion is observed for all cases.

Quantification of the performance for distortion correction using clinical data
To benchmark the performance of the algorithm across all datasets, we use two different metrics to evaluate the amount of residual circumferential and longitudinal motion distortion, respectively. The first metric is the average magnitude of the circumferential median med m ∆c[m, n] along the longitudinal dimension.
The circumferential median med m ∆c[m, n] was used to quantify the amount of circumferential distortion for each longitudinal index n. The magnitude was used because this quantity can be negative or positive. A small value of d ∆c corresponds to a lower amount of circumferential distortion and vice versa. The second metric is the standard deviation of ∆l[n] along the longitudinal dimension.
A small value of d ∆l corresponds to a more uniform sampling in the longitudinal dimension i.e. lesser amount of residual longitudinal motion and vice versa. Figure 10 shows two scatter plots for d ∆c and d ∆l from 54 datasets obtained from all 44 patients. One data point corresponds to 1 dataset with the color indicating the diagnosis. A dashed black line corresponding to the We notice that the amount of circumferential distortion varied appreciably among these datasets, but there were no large differences in motion corrected results between subgroups. While some datasets had a very small amount of circumferential distortion d ∆c ≈ 0.05 pixel, others had a much larger amount of d ∆c ≈ 0.2 pixel. Circumferential motion distortion was reduced in most of the datasets, except for one (arrowhead). Figures 11(A)-11(B) shows the en face images of this dataset before and after motion distortion correction. The 1.5x zoomed regions ( Fig. 11(C)) of these en face images show very complicated features with a longitudinal compression coupled with possible circumferential displacement. The algorithm attempted to correct distortion by stretching the image in the longitudinal dimension and correcting for circumferential displacement. The corrected image did not appear to improve, likely due to improper sampling in the raw en face image. In contrast to circumferential motion distortion, a reduction in longitudinal motion distortion was observed in all the datasets, with a saturated performance at d ∆l ≤ 0.02 pixel for all cases. Table 2 shows processing time for different steps of the algorithm to process an OCT volume of 3,200 B-scans with 1,500 A-scans per B-scan and 240 points per A-scan. Pre-processing to time required to process raw OCT fringes to calculate the B-scan OCT and correct for NURD is not included.

Processing time
The algorithm requires ∼13.5 minutes to process a volumetric OCT data (1,500 × 3,200 = 4.8 million A-scans). In many cases, the total processing time is less than 13.5 minutes because the stopping condition is reached before the maximum number of iterations. A large portion   11. A-B) En face images from a HGD patient where our method failed to reduce the motion distortion. C) 1.5x zoomed images of a region specified in A) and B) show improper sampling due to fast jumping in the pullback and a potentially coupled circumferential distortion. (∼50%) of the total time is required to estimate ∆c from the input en face image I. The code is in prototype form and for easy of development. Therefore, this step is implemented on CPUs. The estimation is sequentially performed from the most distal to most proximal position of the longitudinal pullback. However, since there is no dependency between estimation of ∆c from I at different longitudinal index n, this estimation step can be performed in parallel processing using Graphic Processing Units (GPUs). Similarly, estimation of the longitudinal sampling interval ∆l[n] can be performed in parallel for different longitudinal index n. Volumetric correction with different images at different en face depths, currently requires ∼12% of the total time and could be performed in parallel processing. The I/O time to write processed images cannot be easily speeded up by parallel processing. However, this step is only required for off-line data investigation and storage.

Discussion
Although the algorithm uses several parameters, many are determined by the instrument and scanning protocol, so the number of arbitrary parameters is not excessive. Algorithm performance depends on a small set of parameters (δ, L features , L tissue−normalizing ) related to the size of the image feature being evaluated. Other parameters such as ∆l, the stopping condition, and the maximum number of iterations only affect the convergence speed rather than the final output. Therefore, the algorithm can be applied to other datasets with a different sampling pitch or different features by appropriately scaling these feature size parameters, while retaining the default values for other parameters. The last column of Table 1 summarizes the effects of parameters on algorithm performance.
Our algorithm utilizes a single en face image to estimate motion and differs from other correction methods that only use cross-sectional images [23,25]. There are two types of motion correction methods using cross-sectional images. The first type estimates the motion using fiducials on the catheter [22]. This approach corrects NURD due to scanning non-uniformity but does not account for lateral tissue motion. While the motion corrected data has minimal rotational distortion, it may still suffer from distortion due to tissue motion. The second type of approach estimates and corrects motion using cross-sectional images of tissue instead of fiducials [23,25]. This approach accounts for tissue motion in the circumferential direction. However, it does not account for longitudinal motion distortion which causes variation in the longitudinal sampling interval. Our method adds an extra layer of correction on top of NURD correction to enforce consistency in both the circumferential and longitudinal dimensions, which is critical for reliable image interpretation. It was also possible to perform motion distortion correction starting from the en face image generated directly from non-NURD-corrected OCT data. However, we found that using the NURD-corrected data was superior because the estimated circumferential displacement was smaller than if the non-NURD-corrected data was used. This facilitates a smoother convergence of the optical flow algorithm, because the Taylor expansion in the optical flow algorithm performs better for smaller displacement values.
One disadvantage of our algorithm is that it relies on en face image features for motion estimation. We selected en face images with the highest dynamic range, but they typically only contain a subset of features from the OCT volume. Also, the depth selection of the en face image may be optimal for some regions and not others. Images from homogenous regions, such as those with squamous epithelium, will have less contrast than other regions (e.g. Barrett's epithelium), making accurate estimation challenging. While we accounted for this phenomenon using the normalization scheme in Eq. (7), we envision that estimation accuracy can be improved if an optimal depth is selected at each longitudinal (pullback) location instead of using a single selection for all locations. Future increases in A-scan rates will increase the sampling density and the visibility of features used for motion estimation, improving algorithm performance.
Improving tissue coverage will likely increase the robustness and the accuracy of the motion estimation. In our study, we used an endoscope-based catheter of 3.4 mm diameter, which was much smaller than the esophageal lumen. The en face image used for motion estimation only covered a portion of the esophagus, while other regions were out-of-contact. Therefore, the estimated motion was only local to the tissue area imaged by the probe. Improving the tissue coverage will allow a larger portion of the esophagus to be imaged, increasing robustness of the motion estimation. We emphasize that our algorithm should be compatible with modalities which have larger tissue coverage such as OCT capsules [37][38][39] or OCT balloons [40,41]. However, caution is required to ensure that the en face image has sufficient sampling density to visualize features for motion estimation.
It is important to note that the dominant motion distortion that requires correction is in the transverse direction. Axial motion where tissue is in contact with the probe is relatively small. Images from out of contact regions have poor quality and are usually not interpretable. The OCT probe sheath acts analogously to a microscope slide cover glass, reducing tissue surface roughness which produces aberrations, enabling subsurface imaging.
The problem of motion correction is similar to the wobbling correction problem in computer vision caused by rolling shutters [28,[42][43][44], although the image distortion here was more severe. In endoscopic OCT, sequential cross-sectional B-scans were obtained at different times, causing circumferential motion distortion. In the wobbling correction problem, different rows of the camera sensor are exposed at different times, causing skew and curvature distortion. The similarity between these two problems allows adaptation of wobbling correction methods to correct circumferential distortion. Recently, Rengarajan et al. [28] suggested using convolutional neural networks to estimate the row-wise camera motion from image features contained in a single rolling shutter image. Although simple and powerful, this method required a large training dataset that contains hundreds of thousands of synthetic images generated from undistorted images. This requirement is challenging in clinical OCT applications such as endoscopy where the amount of data is limited patient enrollment and procedural complexity. Therefore, a method that requires less training data is preferred.
There are also important differences between the catheter-based OCT problem and the wobbling correction problem. First, the B-scan frame rate for catheter-based OCT is relatively slow (a few hundred Hertz in endoscopic OCT) compared to the row reading rate in commercial cameras (∼65 kHz for a 60 fps 1080p camera). Therefore, circumferential motion distortion in catheter-based OCT is more severe than wobbling from the rolling shutter. Furthermore, the catheter-based OCT problem is confounded by the coupling of longitudinal motion distortion and non-uniform longitudinal sampling.
Our motion distortion correction algorithm can be applied to other catheter/endoscope-based OCT methods such as intravascular OCT, unsedated capsule imaging for Barrett's esophagus screening [39,[45][46][47], or balloon-based OCT esophageal imaging for large circumferential coverage [41,48]. It could also be applied to correct for motion artifacts in lower GI imaging for anal and rectal imaging of mucosal crypt architecture [49].
Machine learning algorithms typically rely on large amounts of data with labels obtained from multiple sources to reduce inter-observer variation. Training these algorithms to account for motion distortion can be costly, because it requires a large amount of clinical OCT images. Furthermore different readers can classify the same features as regular or irregular because of varying tolerance to motion distortion and previous training experience. Motion distortion correction can be used to normalize the data, hence, reduce the amount of data needed to train machine learning algorithms. This is because mucosal features will be less distorted after correction compared to original data. For human readers, evaluation of features such as irregular mucosal surface patterns or atypical glands which are potential markers of BE with dysplasia should be less prone to inter-observer variation [34].

Conclusions
This paper presents a new method to estimate and correct motion distortion in catheter/endoscopebased OCT. The algorithm begins with OCT pre-processing to generate the OCT volume from raw fringes and correct for NURD using fiducial markers on the probe housing. Then, we select an en face image with high dynamic range and representative features to estimate the motion distortion. This en face image is analyzed using a modified optical flow algorithm to estimate the relative circumferential motion between the tissue and catheter, then resampled to correct circumferential motion distortion. The corrected image is then used to estimate the longitudinal motion distortion from the standard deviation of the image features in a sliding window along the longitudinal direction. After estimating both circumferential and longitudinal motion, all en face images of the NURD corrected volume are resampled to correct the entire volume. We demonstrate the performance of the algorithm on synthetic images and clinical OCT images of the human esophagus from patients with Barrett's esophagus undergoing surveillance for dysplasia. The algorithm can be used in multiple OCT applications which may have circumferential and longitudinal motion distortion.