Cross-attention learning enables real-time nonuniform rotational distortion correction in OCT

Nonuniform rotational distortion (NURD) correction is vital for endoscopic optical coherence tomography (OCT) imaging and its functional extensions, such as angiography and elastography. Current NURD correction methods require time-consuming feature tracking or cross-correlation calculations and thus sacrifice temporal resolution. Here we propose a cross-attention learning method for the NURD correction in OCT. Our method is inspired by the recent success of the self-attention mechanism in natural language processing and computer vision. By leveraging its ability to model long-range dependencies, we can directly obtain the correlation between OCT A-lines at any distance, thus accelerating the NURD correction. We develop an end-to-end stacked cross-attention network and design three types of optimization constraints. We compare our method with two traditional feature-based methods and a CNN-based method, on two publicly-available endoscopic OCT datasets and a private dataset collected on our home-built endoscopic OCT system. Our method achieved a $\sim3\times$ speedup to real time ($26\pm 3$ fps), and superior correction performance.


Introduction
Optical coherence tomography (OCT) [1] uses temporal coherence gating to resolve depth information in micrometer scale.It enables non-invasive tomographic imaging of biological tissues with near-cellular spatial resolution and high sensitivity [2].Nowadays OCT has become a routine diagnostic instrument in ophthalmology [3].Through a fiber-optic endoscopic probe, its application is expanding to other medical fields for in situ label-free biopsy, such as cardiovascular, respiratory, gastrointestinal, and cervix sites [4][5][6].
For such applications, an endoscopic probe with point-by-point scanning capability is usually required.Typically, the scanning is controlled externally and implemented mechanically to achieve axial movement and circumferential rotation of the probe (referred to as proximal scanning).In recent years, with the development of technologies such as MEMS and piezoelectric devices, point-by-point scanning can be achieved by shifting the beam at the output end of the probe (referred to as distal scanning).However, distal scanning is currently rarely used clinically due to its significantly higher cost and larger size of the probe compared to the proximal scanning [7].
Due to the irregularities in the shape of vessels and other lumen structures, friction, and torque transmission losses, the rotation of the proximal scanning probe becomes non-uniform, resulting in distortion of the intracanal OCT images, known as non-uniform rotational distortion (NURD) [8].NURD can introduce errors in the morphological representation of tissues and make it difficult to perform functional imaging of tissue, such as elasticity, birefringence, angiography, and treatment processes [9][10][11].Effective NURD correction is demanded to deal with such problems.For many application scenarios of OCT that require real-time operation or fast evaluation, such as surgical robot navigation, online monitoring of treatment, and in situ diagnosis, the time cost of the NURD correction should be considered.
Existing methods for the NURD correction are primarily based on feature tracking/registration and dynamic programming [11][12][13][14].William et al. used the speeded-up robust feature (SURF) operator to extract feature points in OCT B-frames and then tracked them across adjacent frames for A-line alignment [11].Cao et al. proposed an improved feature extraction algorithm and put it into coarse and fine registration process [12].These methods rely on extracting a large number of feature points to improve the correction accuracy.Therefore, there is a trade-off between the time cost of feature extraction and accuracy.Soest et al. utilized the dynamic programming method to find a continuous path through a spatial cross-correlation matrix that measures the region similarity between adjacent frames [13].However, the construction of the cross-correlation matrix is time-consuming.Qi et al. used a graph-based dynamic programming algorithm to find an optimal path that represents the initial rotation angle error drifting along the pull-back direction [14], which significantly speeds up the processing, but the A-line level distortion is neglected.
Other methods utilized hardware and prior knowledge specific to the endoscopic probe or imaging target [9,15], thus lacking generality.Abouei et al. presented a motion artifact correction method based on azimuthal en face image registration [16].However, this method needs to collect the complete image sequence first and thus cannot correct the distortion in real time.Uribe-Patarroyo and Bouma developed a method based on speckle decorrelation [17], which could perform NURD correction in real-time, but the decorrelation is vulnerable to the variation of environment, such as motion and temperature [18].
Recently, Liao et al. proposed a convolutional neural network (CNN)-based learning method for the NURD correction [19].They developed a new A-line level shifting error vector estimation network to extract the optimal path from a spatial correlation matrix.Another CNN branch was introduced to suppress the accumulative error.Their method outperforms previous ones on correction performance and achieved a processing rate of around 7 fps (frame per second).However, CNNs have limitations in modeling long-range dependencies due to the constraints of local receptive fields and fixed convolutional kernel sizes, thus requiring pre-build a spatial correlation matrix as network input, which affects their capability to scale up the processing efficiency.
In this work, we propose a cross-attention learning method to address the limitations of existing NURD correction methods above.Our method is inspired by the recent success of the self-attention mechanism [20] in natural language processing (NLP) and computer vision (CV), which has played a crucial role in the development of cutting-edge tools like ChatGPT [21].Our key finding here is that the self-attention mechanism enables the direct establishment of global spatial correlations within OCT A-line sequences, without the necessity of correlation calculation in advance.Because the self-attention mechanism is used between different A-lines, we refer to it here as cross-attention.To achieve a high correction efficiency, we develop an end-to-end stacked cross-attention network and design three types of optimization constraints.

Overall framework
Figure 1 illustrates the overall framework of our proposed method.(a) and (b) illustrate its training and inference phases, respectively.We use a self-supervised generative learning approach for training, i.e., by distorting the original B-scans and then using the network to predict their distortions.Specifically, in Fig. 2, we use a distortion vector, which serves as the ground truth (GT) of the A-line shifts due to the NURD, to do the transform T → from the original frame to the distorted frame (the generation of the GT distortion vectors follows the method described in Section 3.2.1 of [19]).These two frames are then fed into the stacked cross-attention network, which is employed to correct the NURD.It predicts two distortion vectors: the first one is the distortion applied on the original frame to form the distorted frame; the second one is the distortion applied on the distorted frame to form the original frame.This bi-directional design is inspired by the notion of cycle consistency in generative learning [22].Using these predicted vectors, we can apply the transform T ′ → and T ′ → to the original and distorted frames, and form the new distorted and original frames, respectively.We use three types of optimization constraints in the training: (1) mean absolute error loss (L1 loss) L 1 between the distortion vector 1 and the GT vector, (2) smoothness loss L  of the predicted distortion vectors, and (3) similarity loss L  between the original/distorted frames and the new original/distorted frames at the A-line level.We list their functions below: where d and   are the elements of the predicted distortion vector and ground truth, respectively. is the length of the vector (also the number of A-lines in each frame). is the number of data points in each A-line.p,  and  ,  are the pixel value of data point  in A-line  from the predicted new frame and the corresponding input image, respectively.The smoothness loss and similarity loss are all adopted in the prediction of two distortion vectors, and L1 loss is only adopted in the prediction of distortion vector 1 because the GT vector of distorting original frame is known.The final loss of network is: In the inference phase, two successively acquired OCT B-scans (the raw  − 1-th and -th frames. refers to time points) are fed into the trained stacked cross-attention network.The output of this network is only the distortion vector 1, which is used to correct the NURD of the newest -th frame.We generate the cumulative distortion vector from raw -th frame to the initial 1-th frame using the method described in [19].Specifically, the -th frame is composed of  A-lines    ( ∈ [1, 𝑁]).Due to the NURD occurrence in adjacent frames,    mismatches its correct position which is supposed to be aligned to  −1  .The position error    =  −  of A-line    constitutes one element of distortion vector  ∼−1 =   1 , ...,    , ...,     (it can be integers only).Given predicted A-line level distortion vector D∼−1 between -th and  − 1-th frames and cumulative  −1∼1 between  − 1-th and initial 1-th frames, the latest distortion vector  ∼1 can be obtained by cumulative transform operation Ψ: where we can cumulatively transform the -th frame to the initial 1-th frame in A-line level, and finally generate NURD-corrected -th frame.  is the details of the multi-head cross-attention module.Instead of 2D operations employed in CNNs, here we use each A-line (1D) of the OCT B-scans as a token.Then they are used to calculate the query (Q), key (K), and value (V) vectors in the self-attention mechanism [20]:

Stacked cross-attention network
where X (X ∈ R  × ) is the input tokens with  A-lines and  data points in each A-line.The linear projection is defined as:Q, K, V ∈ R  × ( is the embeding dimension, and  > ), and W  , W  , W  are weight matrices, while b  , b  , b  are bias terms.This linear projection step allows the model to capture different aspects of the input sequence.The query vectors Q represent the current token and are responsible for computing attention weights.The key vectors K capture the contextual information of each element, enabling the model to assess the relevance between different elements.The value vectors V carry the actual content information associated with each token.Then these vectors are fed into 5 consecutive multi-head cross-attention blocks (×5).Each block includes a multi-head cross-attention module and a multi-layer perception (MLP) module [20].In each block, we apply layer normalization (Norm) before each module and conduct residual connections.Finally, we perform averaging and linear operations to get the distortion vectors.
The multi-head cross-attention module in the lower dashed box of Fig. 2 allows the model to attend to different parts of the input sequence and capture diverse dependencies, enhancing its representation and predictive capabilities.Given a sequence of input embeddings X = [x 1 , x 2 , . . ., x  ], the output is computed as follows: where head  = Attention(QW   , KW   , VW   ) represents the attention mechanism applied on the projected queries QW   , keys KW   , and values VW   of the -th attention head.Here, W   , W   , and W   are learnable linear projection matrices specific to each attention head.The concatenated outputs are then linearly transformed by the matrix W  to produce the final output.

Datasets and implementations
We collect a total of 7,731 endoscopic OCT B-scans from publicly-available datasets [10,15,17,[23][24][25][26][27] to train our model.As mentioned above, we use these data to generate the GT distortion vectors using the method in [19].By applying these vectors to the B-scans, we create 20,000 original-distorted image pairs for the training.Because most of them are from clinical acquisition, the temporal and spatial characteristics of the distortion vectors are consistent with real application scenarios.We then use another two synthetic endoscopic datasets and two real publicly-available endoscopic datasets [28,29] for evaluating our trained model.Note that we train our model in one go and evaluate it on external test datasets.Compared to the commonly used division of the same dataset into training and test sets, this approach can better demonstrate the accuracy and robustness of our approach and the generalization ability of the model.
The synthetic endoscopic OCT sequences are employed because we cannot get the GT of NURD from real endoscopic OCT data.We follow the method described in [19] to generate the synthetic sequences.Firstly, a motion (NURD)-free OCT sequence is created by repeating an OCT B-frame 500 times.Then we apply 499 random distortion vectors to all frames except for the first one.We employ a pig bronchus OCT B-scan [25] and a human nasopharynx OCT B-scan [15] (as shown in Fig. 5 below) to generate the two synthetic sequences for testing.The two real OCT sequences for testing include a gastrointestinal tract sequence (648 images) [28] and a sponge surface sequence (240 images) [29].
Besides, we further evaluate the NURD correction performance using our home-built endoscopic SD-OCT system.Our system has a central wavelength of ∼ 840 nm and a bandwidth of ∼ 50 nm, which corresponds to an axial resolution of ∼ 5 m.Its A-line rate is 80 kHz.A homemade capillary tube-based fiber optic rotary joint [30] driven by a commercial motor (34 rps rotation speed) is applied to perform circumferential scanning.As shown in Fig. 3(a), an assembled proximal scanning micro-probe with 1.2 m length offers a lateral resolution of 25 m and a working distance of 2 mm.The micro-probe with a transparent glass tube is 0.37 mm in diameter shown in the enlarged view of Fig. 3(b).In the experiment, a 30 mm length intravascular stent with 4 mm diameter was used for imaging as shown in Fig. 3(c).We implement the code of our proposed method using pyTorch.Our model is trained on a personal computer with an Nvidia 3090 GPU (24G onboard memory).We convert an endoscopic OCT B-scan into an input format where each frame consists of 1024 A-lines, and each A-line contains 512 data points.We employ the multi-head cross-attention with an embedding dimension of 1024 and 4 heads.We use the stochastic gradient descent (SGD) [31] optimizer with a learning rate of 5 − 4. We set a batch size of 24 and train our model for 200 epochs.The training time is about 33 hours.It should be noted that the model is trained once and for all.
We use the mean absolute error (MAE) () to quantitatively evaluate the correction performance of the synthetic sequences: where ∧    and    are the predicted and the GT shifts of i-th Aline of distortion vectors within n-th frame, respectively.For the real publicly-available sequences, because the GT of the distortion vector is unknown, we use the mean standard deviation (mean-STD)  () to quantitatively evaluate the correction performance, which was commonly adopted in previous NURD correction works [13,19]: where σ5 (  ,  ) is the mean-STD of pixel  ,  in adjacent 5 frames with -th frame as the center.
Precise correction can reduce the mean-STD to nearly 0, but it will never be exactly 0 due to variations in scanning locations and speckle/decorrelation noise.

Accuracy assessment of NURD correction
Using the synthetic endoscopic OCT sequences that have the GT, we perform the quantitative comparison of our proposed method with three other representative approaches, including a feature tracking (FT) method [11], a dynamic programming (DP) method [13], and the CNN-based method in [19] (referred to as De-NURD).The results are shown in Table .1 and Fig. 4. Our method achieves the smallest MAE values compared with the other three NURD correction methods on both synthetic sequences.Specifically, as shown in Fig. 4 (a) and (b), our method corrects the NURD with high accuracy and superior correction stability across the frames in each sequence.Other methods, in contrast, lack either correction accuracy or stability.Figure .5 demonstrates the results before and after the NURD correction using our method, on (a) the pig bronchus data and (b) the human nasopharynx data.The left column gives the original B-frames used to create the synthetic sequences.The middle column shows the axial maximum value projection of the synthetic sequences, which gives better views of the applied NURD.The right column gives the NURD-corrected synthetic sequences using our method.As shown in the figure, our method alleviates the shift and jitter caused by the NURD while the original structure information is maintained.In addition, the NURD is effectively corrected on both the synthetic porcine bronchial sequence (a), which has rich feature information, and the human nasopharyngeal sequence (b), which has less feature information, suggesting that our method has superior robustness.To verify the NURD correction is performed on the features of biological tissues, we manually remove tissue-unrelated features (sheath, wire, etc.) in the data as shown in Fig. 5 (c) and (d).Under this condition, our method is still able to correct the NURD in both the human nasopharynx and the pig bronchus data.

Robustness assessment of NURD correction
The results of the two real publicly-available testing datasets are shown in Table 2 and Figure 6.Our proposed method achieves the smallest mean-STD values compared with the other three NURD correction methods [11,13,19]. Figure 6 (a) and (b) are the results of the gastrointestinal tract and the sponge surface data, respectively.The results of our method are plotted in green, demonstrating consistent minimum mean-STD values over the image sequences.Figure 7 shows the qualitative comparison of different NURD correction methods on the gastrointestinal tract volume data.(a) is the 3D view of a volumetric scan of the gastrointestinal tract.The red and blue boxes refer to the zoom-in area in (b) and (c), respectively.In Fig. 7(b), to illustrate the NURD instability, we use RGB channels to encode three consecutive frames, and each frame is mapped to an individual channel.Structures that do not overlap are rendered in color and vice versa in greyscale.We can see our method achieves the best spatial consistency.In Fig. 7(c), we use mean value projection to obtain local en face images.It can be seen that our method minimizes the distortion caused by the NURD.correction methods are able to reduce the precession angle (a precession angle of 0 • represents the real state of the sponge).Our method outperforms others and achieves the minimum precession angle of 5.7 • .Our method also reserves the morphological feature of the sponge, while the DP method causes structural stretch as pointed out by the white arrows.The second row of Fig. 8 is the 3D rendering of the sponge, which further illustrates the performance advantages of our method.Besides, we give the first and last frames of the sequence in the last two rows.
To provide a comprehensive comparison of processing speed and correction accuracy, we combine the two and plot a histogram of the results of our proposed method against three other representative methods.The public gastrointestinal tract data is used in this evaluation.The results are shown in Fig. 9. Orange bars represent their mean-STD (smaller means better).Blue bars represent the processing speed (ms/frame) and the corresponding frame rate in fps.It can be observed that our method achieves the best correction performance (statistically significant) while also improving processing speed by about three times, reaching real-time performance.

Correction performance on 3D stent imaging
As a practical application, we conduct a pull-back endoscopic OCT scanning of the intravascular stent and correct distortion for the raw sequence to verify our correction performance for inherent NURD of the endoscopic OCT system.In vascular interventional procedures, endoscopic OCT imaging is commonly used to produce high-resolution in vivo images of blood vessels and deployed stents, providing accurate measurements of luminal architecture and insights regarding stent apposition [32,33].In this experiment, a 30 mm length intravascular stent with a 4 mm diameter was used for imaging shown in Fig. 3 (c).We wrapped up the stent with printer paper to simulate a lumen.We pulled back the mini-probe at a speed of 1.5 mm/s and collected about 640 images with ∼25 mm axial length.
Figure 10 (a) and (b) show the 3D view of direct imaging and after correction, respectively.For an intuitive comparison, we unfold the 3D view to 2D en face maps shown in Fig. 10 (c) and (d) by mean value projection.Due to friction and speed of the motor, shift distortion and uncertain stretch-shrink occur in the original en face projection according to the inherent structure of the stent.After correction by our proposed method, The imaging appearance of the stent is closer to the real structure itself.In addition, we show a cross-section example of (e) before and (g) after correction at the same frame location with three consecutive frames mapped in 3 channels separately.The corresponding enlarged views are displayed in (f) and (h), respectively.By this, it can be observed that the proposed method alleviates the artifacts caused by NURD.

Influence of training data
In the training of our NURD correction model, we use publicly-available endoscopic OCT datasets, which may influence the correction performance due to their intrinsic NURD.To address this issue, we also employ ophthalmic OCT data for training, which was acquired via raster scanning and thus inherently NURD-free.We employ 11,206 retinal OCT B-scans from a publicly-available dataset [34] with the same distortion vectors extracted from endoscopic OCT data to generate 20,000 original-distorted training pairs.
The NURD correction results using these two types of training data are demonstrated in Fig. 11, Fig. 12, and Table 3.As shown in Fig. 11, the NURD on the synthetic human nasopharynx and pig bronchus data could be effectively corrected when both the endoscopic and ophthalmic OCT data are used for training.The corresponding quantitative results (Fig. 12) indicate better performances are achieved when using the endoscopic OCT data for training.We further deploy the trained models on the real gastrointestinal tract data.Their quantitative results are listed in Table .3. Consistent with the results of the synthetic data, the model trained on endoscopic OCT data performs better.Due to the domain discrepancy between the endoscopic and ophthalmic OCT data, we can only suspect that the influence of the inherent NURD is neglectable.From the perspective of model training, our method (as illustrated in Fig. 1) aims to align the designedly distorted frame with the original one.Whether or not the original frame is inherently distorted, the model is to predict the artificially created distortion vectors, the influence of the inherent NURD should be minimal.

Ablation studies
To evaluate the effectiveness of the bi-directional prediction loss of two distortion vectors between the original frame and distorted frame in the training phase designed in our proposed method, we perform ablation studies on the gastrointestinal tract test data.
The results evaluated with mean-STD metric are shown in Table 4.For predicting the distortion vector 1 that transformed the original frame into the distorted frame, using only L1 Loss can significantly improve the performance of correcting distortion reducing mean-STD value of ∼19.When combined with the smoothness loss and similarity loss, the mean-STD value is slightly reduced.With the addition of another auxiliary prediction of distortion vector 2 that transformed the distorted frame into the original frame, the results achieve the best performance compared with other settings.It is noted that further improvements demonstrate the effectiveness of bi-directional prediction.Furthermore, this setting alleviates the NURD in the inference phase without adding much computation time and additional label burden in the training phase.

Evaluation of processing speed
Finally, we compare the processing speed of the two learning-based approaches in further detail.As shown in Table 5, our method enables significant time-savings during pre/post-processing compared with the CNN-based method, which is due to the fact that our approach does not require the pre-construction of a spatial correlation matrix.Note that we achieve the capability of real-time NURD correction at 26±3 fps while keeping a good accuracy.

Discussion
Self-attention, a groundbreaking mechanism for deep learning, has ushered in transformative advancements in NLP and CV [35].In NLP, large language models like BERT and GPT-4, built on self-attention, have excelled in language tasks due to their ability to capture context and dependencies in text [36].In CV, the vision transformer architecture and its variants leverage self-attention to process images by dividing them into patches and applying this mechanism to them [37].They have achieved remarkable success in many tasks, such as classification, object detection, and semantic segmentation [38].Besides, downstream applications such as medical image analysis also benefit from the paradigm shift from CNN to transformer [39].
In this work, we employ the self-attention mechanism to address the NURD problem in endoscopic OCT.We found that its capability of learning long-range dependencies and spatial correlations is useful in improving the efficiency of NURD correction.We designed the stacked cross-attention network specifically for this application (described in Section 2.2).Compared with existing NURD methods, our method achieves a ∼ 3× speedup to real time (26 ± 3 fps).We further design an overall framework for learning the NURD correction (described in Section 2.1) by leveraging three types of optimization constraints, including the L1, smoothness, and similarity losses.We also introduce a bi-directional design in the architecture of the framework.Their effectiveness in improving the NURD correction performance is verified through the ablation studies in Section 3.3.These new designs allow our method to outperform existing NURD correction methods not only in terms of efficiency but also in terms of performance.
To verify the generalization performance of our method, we test it on the data from several different OCT systems that cover the mainstream engines for endoscopic OCT imaging, including: (1) A tethered capsule endomicroscope for imaging gastrointestinal tract using a swept-source OCT system [28].It uses near-infrared wavelengths sweeping from 1,250 nm to 1,380 nm.It acquires circumferential, cross-sectional images at 20 frames s −1 using a total of 2,048 axial (depth) scans per image.(2) A volumetric scanning OCT system for general luminal organ diagnosis [29].It was built around the Axsun swept-source engine, with a 1310 nm center wavelength-swept source laser and 100 kHz A-line rate.The OCT probe has an outer diameter of 3.5 mm.It is terminated at the distal end with a transparent sheath on the tip, which allows three-dimensional OCT imaging using an internal rotating side-focusing optical probe with two proximal external scanning actuators.(3) A home-built endoscopic OCT system for intravascular imaging, which uses a spectral-domain OCT system for collecting the interference fringe.It has a central wavelength of 840 nm and a line rate of 80 kHz.The fiber-optic probe has an outer diameter of 0.46 mm.A homemade capillary tube-based fiber optic rotary joint driven by a commercial motor (34 rps rotation speed) is applied to perform circumferential scan imaging.For the data from the above systems, our method achieves superior accuracy and efficiency in the NURD correction.
Our method can be beneficial to many application scenarios of OCT: (1) Surgical navigation and surveillance using OCT have revolutionized the field of minimally invasive procedures [40][41][42].With its high-resolution imaging capabilities, OCT allows surgeons to navigate with unprecedented precision within complex anatomical structures.During surgery, real-time OCT imaging provides dynamic feedback, enabling surgeons to visualize tissue layers, assess boundaries, and confirm instrument placement.This real-time guidance enhances surgical accuracy, reduces the risk of complications, and minimizes the need for extensive tissue dissection.(2) Functional OCT imaging techniques that require capturing temporal dynamics (repeated scanning of a specific position), such as angiography, elastography, and thermometry.Bouma et al. developed a microscopic image guidance platform for radiofrequency ablation (RFA) using a clinical balloon-catheter-based optical coherence tomography (OCT) system [11].They have shown that the computational correction of NURD could be used to improve the calculation of complex differential variance, which was then used to visualize the therapeutic thermal field.(3) The high spatial resolution of OCT enables its applications in rapid In situ diagnosis.The presence of NURD increases the probability of misdiagnosis.Especially now that AI diagnostic models have been integrated with imaging instruments, the impact of imaging distortions will be further amplified [43].
Despite the above merits, the NURD correction method based on the proposed cross-attention learning has some limitations: (1) Learning-based methods require a large number of labels (supervision) for training.As mentioned above, we follow the approach in [19] by extracting the pseudo-GT distortion vectors using a feature-tracking method.Then we apply these distortion vectors randomly to the OCT images used in training.The stacked cross-attention network is trained to learn the mapping from manually distorted images to distortion-free ones.However, such a method is data-hungry and time-consuming.To address this issue, different supervision generation methods should be developed.(2) Our method is still in the category of image-based NURD correction and thus has the inherent drawbacks of such methods.This type of approach assumes that adjacent frames show a high degree of morphological coherence, i.e., and rotational artifacts result in faster changes in appearance than structural changes inherent in the appearance of the tissue.This is usually feasible in general clinical endoscopic imaging, except in a few cases, such as structural mutations and microscopic lesions at tissue junctions.

Conclusions
Here we tried to address the efficiency issue of NURD correction in endoscopic OCT and its functional extensions.Inspired by the self-attention mechanisms, we have developed a crossattention learning method, to establish spatial correlations between OCT A-lines efficiently.We have designed and implemented an end-to-end stacked cross-attention network with optimization constraints.Compared to existing methods, we have achieved a substantial ∼ 3× speedup to real-time processing (26 ± 3 frames per second) and superior NURD correction performance.Our approach will contribute to the further development of endoscopic OCT technology and its multi-organ, multi-functional, multi-clinical scenario applications, as well as other rotational scanning imaging techniques such as intravascular ultrasound.

Fig. 1 .
Fig. 1.Overall framework of our proposed method.(a) and (b) illustrate its training and inference phases, respectively.

Fig. 2 .
Fig. 2. Illustration of the stacked cross-attention network.The upper panel is the overall architecture and the lower dashed box is the details of the multi-head cross-attention module.

Figure 2
Figure 2 illustrates the stacked cross-attention network.(a) is the overall architecture and (b)is the details of the multi-head cross-attention module.Instead of 2D operations employed in CNNs, here we use each A-line (1D) of the OCT B-scans as a token.Then they are used to calculate the query (Q), key (K), and value (V) vectors in the self-attention mechanism[20]:

Fig. 3 .
Fig. 3. (a) Photograph of our assembled proximal scanning micro-probe used for endoscopic OCT imaging.(b) Enlarged view of the black box in (a).(c) Photograph of the intravascular stent used in OCT imaging.

Fig. 4 .
Fig. 4. Quantitative comparison of different NURD correction methods using the two synthetic sequences.(a) is the result of the pig bronchus data.(b) is the result of the human nasopharynx data.

Fig. 5 .
Fig. 5.The NURD performance of two synthetic sequences.(a) is the result of the pig bronchus data.(b) is the result of the human nasopharynx data.(c) and (d) are the results after removing the tissue-unrelated features, such as sheath.The left column gives the original B-frames used to create the synthetic sequences.The middle column shows the axial maximum value projection of the synthetic sequences, which gives better views of the applied NURD.The right column gives the NURD-corrected synthetic sequences using our method.

Fig. 6 .
Fig. 6.Quantitative comparison of different NURD correction methods using two publicly available datasets.(a) is the result of the gastrointestinal tract data.(b) is the result of the flat sponge surface data.

Fig. 7 .
Fig. 7. Qualitative comparison of different NURD correction methods on gastrointestinal tract test data.(a) is the 3D view of a volumetric scan of the gastrointestinal tract.The red and blue boxes refer to the zoom-in area in (b) and (c), respectively.(b) The local regions of OCT images are composed of three consecutive frames which are separately mapped to R, G, and B color channels.(c) Local en face images with mean value projection operation.

Figure 8 Fig. 8 .
Figure 8 presents the qualitative results of different NURD correction methods on pull-back scans of a flat surface of a sponge.The en face images by the mean value projection of the original and corrected results are shown in the first row, and the numbers at their bottom represent the NURD-induced precession angle of the flat surface.It is obtained by (1) firstly connecting the center positions of the first and last frames in the sponge sequence (blue dashed line with arrow) and (2) then measuring the deviation angle between the blue dashed line and the flat reference (black dashed line).The original sequence is gradually distorted by NURD of synchronous rotation and pull-back scanning causing a maximum precession angle of 79.5 • .All the NURD

Fig. 9 .
Fig. 9. Comparison between the results of our proposed method and three other representative approaches.Orange bars represent their mean-STD (smaller means better).Blue bars represent the processing speed (ms/frame) and the corresponding frame rate in fps.

Fig. 10 .
Fig. 10.The NURD correction performance of 3D intravascular stent imaging.(a) and (b) are 3D views of the original data and corrected data, respectively.(c) and (d) are 2D en face projections of (a) and (b).(e) and (g) are original and corrected cross-section images composed of three consecutive frames separately mapped in R, G, and B channels, respectively.(f) and (h) are enlarged views of the blue box in (e) and (g), respectively.

Fig. 11 .
Fig. 11.Qualitative comparison of NURD correction performance using endoscopic and ophthalmic OCT data for training.(a) The results of synthetic pig bronchus data.(b) The results of human nasopharynx data.

Fig. 12 .
Fig. 12. Quantitative comparison of NURD correction performance using endoscopic and ophthalmic OCT data for training.(a) The results of synthetic pig bronchus data.(b) The results of human nasopharynx data.

Table 1 .
Quantitative comparison of different NURD correction methods using the two synthetic OCT sequences.The data format in the table is mean (standard deviation).

Table 2 .
Quantitative comparison of different NURD correction methods using two publicly available datasets.The data format in the table is mean (standard deviation).

Table 3 .
Comparison of the NURD correction on the real gastrointestinal tract data using different types of training data.The data format in the table is the mean (standard deviation).

Table 5 .
Comparison of the processing speed using two learning-based methods.