Real-time retinal layer segmentation of OCT volumes with GPU accelerated inferencing using a compressed, low-latency neural network

: Segmentation of retinal layers in optical coherence tomography (OCT) is an essential step in OCT image analysis for screening, diagnosis, and assessment of retinal disease progression. Real-time segmentation together with high-speed OCT volume acquisition allows rendering of en face OCT of arbitrary retinal layers, which can be used to increase the yield rate of high-quality scans, provide real-time feedback during image-guided surgeries, and compensate aberrations in adaptive optics (AO) OCT without using wavefront sensors. We demonstrate here unprecedented real-time OCT segmentation of eight retinal layer boundaries achieved by 3 levels of optimization: 1) a modiﬁed, low complexity, neural network structure, 2) an innovative scheme of neural network compression with TensorRT, and 3) specialized GPU hardware to accelerate computation. Inferencing with the compressed network U-NetRT took 3.5 ms, improving by 21 times the speed of conventional U-Net inference without reducing the accuracy. The latency of the entire pipeline from data acquisition to inferencing was only 41 ms, enabled by parallelized batch processing. The system and method allow real-time updating of en face OCT and OCTA visualizations of arbitrary retinal layers and plexuses in continuous mode scanning. To the best our knowledge, our work is the ﬁrst demonstration of an ophthalmic imager with embedded artiﬁcial intelligence (AI) providing real-time feedback.


Introduction
Ophthalmology is at the forefront of the AI revolution in medicine. Image-based retinal disease screening with deep learning has been a hot topic in recent ophthalmology research. During the past 5 years, the group at Google's DeepMind made significant contributions to the diagnosis and referral of diabetic retinopathy (DR) using fundus photography [1] and optical coherence tomography (OCT) [2], the two most common imaging modalities found in ophthalmic clinics worldwide. Following these groundbreaking works, significant research efforts have been dedicated to AI-assisted diagnosis solutions for several vision-threatening retinal diseases such as age-related macular degeneration (AMD) [3][4][5][6], DR [7,8], glaucoma [9], and retinopathy of prematurity [10], among others. In addition to becoming a potential screening tool, AI-assisted segmentation technologies can help in the quantification of biomarkers associated with disease progression [11][12][13][14][15][16][17]. However, all of the above-mentioned research only trained and applied AI methods in post processing after the images had been acquired. There is an opportunity to apply deep learning during the image acquisition to assist the operator's judgment of image quality prior to acquisition, identify potential areas of disease and provide real time feedback. These capabilities could significantly improve the imaging success rate and potentially diagnosis accuracy.
OCT is a non-invasive, three-dimensional medical imaging technology that has become the standard of care for the evaluation of macular disease. OCT is increasingly being used not just for cross-sectional visualization of tissue but also for depth resolved en face imaging of arbitrary retinal depths [18][19][20]. OCT-angiography (OCTA) particularly exploits the benefits of en face retinal visualization to assess the integrity of retinal capillary networks. OCT-OCTA is now also used to detect subclinical disease [21,22] and obtain quantitative metrics of disease severity in structure and blood flow simultaneously, encompassing layer thickness [23], ellipsoid zone defects [24], drusen area [25], vessel density [26], among others. The accurate en face visualization of separate retinal layers with OCT and retinal plexuses with OCTA relies on the correct segmentation of the cross-sectional B-scans composing OCT volumes.
Because accurate layer segmentation is essential for appropriate clinical interpretation, this has been an active area in OCT research. Many conventional image processing methods have been proposed in the literature to segment retinal layer boundaries, including active contour [27], support vector machine [28], and graph-based methods [29]. Machine learning, and specifically deep neural networks, have also been used to perform retinal segmentation in OCT images. Roy et al., [14] used ReLayNet, a fully-convolutional neural network, to segment retinal layers and fluid in macular OCT images. Fang et al., [30] used a combination of deep learning and graph search to segment nine retinal layer boundaries in AMD patients. Most of these retinal layer segmentation methods are too slow to segment retinal layers in real-time, either due to algorithmic or implementation complexities.
With increasing OCT acquisition speed, real-time volumetric OCT has become a reality and real-time segmentation of retinal layers has become more desirable. In addition to speeding up the OCT volume data processing workflow and providing prompt diagnosis feedback, real-time OCT data processing and visualization of en face retinal layers is also a key requirement of successful imaging in the clinical setting. For example, real-time retinal layer segmentation can provide effective focus control and the direct feedback of aberration correction performance with image-guided adaptive-optics (AO) techniques. It could also be used in ultra-wide field of view OCT systems to dynamically control the reference arm length to match with the retinal curvature. For surgical applications, real-time segmentation of OCT features with deep learning could assist intraocular surgery and antiangiogenic therapy in a similar way deep-learning based real-time segmentation has aided prostate biopsy [31] and fetal standard scanning [32] with ultrasound, as well as cardiac [33] and fetal brain [34] volume extraction with MRI.
The first barrier for the real time retinal layer segmentation is the Fourier-domain (FD) OCT image processing itself, which is an inherently computationally intensive process. Transformation of interferometric fringes into B-scan images from high-speed FD-OCT systems far exceeds the computation capability of the state-of-the-art central processing unit (CPU). In order to keep up with the ultra-high speed OCT acquisition speed, we developed a parallel computing OCT image processing platform enabled by the graphics processing unit (GPU) that allows real-time OCT processing [35]. Our custom GPU pipeline can perform OCT processing at 2.24 MHz axial scan rate and has accomplished streaming of flow contrast en face images extracted from the selected region on speckle variance OCTA (svOCTA) in real-time [36].
With real-time OCT processing capability, some implementations have demonstrated realtime axial tracking based on OCT cross sectional images [37][38][39]. The axial position of the retina in an OCT B-scan was determined by the brightness of layers, a method that allows real-time focus optimization and visualization of OCT en face views. However, real-time axial tracking does not provide retinal layer specific information that is based on anatomical structures. Recently, Janpongsri et al., [40] demonstrated pseudo-real-time retinal layer segmentation for high-resolution sensorless adaptive-optics optical coherence tomography angiography (SAO-OCTA) with a segmentation method based on Dijkstra's algorithm. However, the segmentation time of this method strongly scales with the number of the retinal layers to be segmented and the size of the image. For instance, segmentation of 7 retinal layers requires more than 20 ms which is insufficient for high speed OCT that usually acquires B-scans at rates of a few hundreds of frames per second. Moreover, retinal layers can change drastically their thickness and delineation for different retinal landmarks and diseases. Artificial intelligence has more potential than classical image processing methods to infer successfully the diversity features described by retinal boundaries.
In this work, a modified U-Net architecture was trained for segmentation of eight retinal layer boundaries with data annotated by an offline and manually corrected graph-cut segmentation. For real-time inferencing, we generated a low complexity network (U-NetRT) by compressing the trained network's graph with specific layer pruning and fusions that reduced the computational load and maintained the original network's performance. We also used dedicated hardware (Tensor Cores) operating in parallel and designed to further accelerate the massive multiplication of matrices involved in deep learning, which maximized the performance of the GPU. We demonstrate high accuracy of layer boundary segmentation and en face visualization of arbitrary layer/slab simultaneously with OCT image acquisition.

Materials and methods
The real-time deep learning segmentation workflow is summarized in Fig. 1. First, ground truth layer segmentation of the B-scans in the training set were generated by an established graph-search method, which could be manually corrected in the event of segmentation errors. The available data was augmented by applying random contrast and intensity to each input B-scan, and a neural network was trained to obtain the probability of pixels belonging to retinal layers, the vitreous humor and the choroid. The originally-trained network architecture was then further compressed by TensorRT and the precision of weights was reduced by a scaling factor determined in a calibration process, reducing the computational burden in order to perform real-time inferencing. Real-time rendering of en face projections of arbitrary layers simultaneously (see ganglion cell layer and outer plexiform layers in Fig. 1) was achieved. These steps will be explained in detail in the following sections.
A total of ten volumes were used in this work from two participants. To improve signal-to-noise ratio and reduce speckle noise, a rolling average was applied to all groups of 10 adjacent B-scans. Averaged B-scans with blurred layers due to eye motion were excluded from the dataset, resulting in the total size of the dataset to be 5234.
The study participants were recruited and imaged at the Casey Eye Institute at the Oregon Health & Science University (OHSU). The imaging protocol was approved by the Institutional Review Board/Ethics Committee. The research adhered to the tenets of the Declaration of Helsinki.

System description
High-resolution OCT/OCTA volumetric images were acquired from the macula of healthy subjects with a prototype adaptive-optics spectral-domain OCT system built at the Casey Eye Institute [41]. The system operated at 250 kHz A-line scan rate (600 Hz B-scan rate), with an axial resolution of 2.6 µm in tissue (assuming a 1.33 index of refraction) and an optical lateral resolution of 5.6 µm. Cross-sectional images of size 400×1024 pixels were acquired at 400 adjacent lateral positions sampled over a field of view of 3×3 mm. The computer workstation used in this work consisted of an Intel i9 9900k CPU, an NVIDIA GeForce RTX 2080 Ti GPU, and 32 GB of RAM.

Network architecture
We selected the popular U-Net architecture developed by Ronneberger et al., [42], which has proven to produce highly accurate segmentations when training data is limited. U-Net is an Fig. 1. Summary of deep learning segmentation workflow. After generating the ground truth and pre-processing the cross-sectional B-scans (including augmentation of B-scans available), a neural network was trained to segment seven retinal layers. An optimization process cropped the network size and a calibration that took into account the dynamic range of activations was used to reduce the precision of network weights in order to perform inference of averaged groups of 10 OCT B-scans simultaneously with acquisition. After layer boundaries had been defined, en face projections of the volumetric data could be produced. end-to-end fully convolutional neural network with two symmetric paths: the contracting path to capture the context of an image, and the expansive path to precisely localize image features. In the original implementation of U-Net, each stage between pooling operations consisted of two consecutive convolutional layers. Here, aiming to reduce network size and thus inferencing time, we used one convolutional layer per stage ( Fig. 2) for our modified U-Net architecture. The contracting path consisted of four repeating blocks, each containing a 3 × 3 convolutional layer followed by a batch normalization layer, a rectified linear unit activation function (ReLU), and a 2 × 2 max pooling layer with stride 2. After the first convolution, the number of feature channels is 64. This number then doubles with each consecutive convolution. The four blocks in the expansive path perform upsampling using 2 × 2 deconvolution layers, then concatenate the result with the corresponding feature map from the contracting path, and perform a 3 × 3 convolution followed by batch normalization and ReLU. The final layer in the network is a 3 × 3 convolution layer with ReLU activation function, which outputs the probability of a pixel to belong to each of the 8 output classes.

Training
Ground truth was generated separately for each B-scan by an automated graph cut algorithm, manually corrected in ITK-Snap [43]. We modified the shortest-path graph cut retinal layer segmentation method based on Chiu et al., [44] and its open source implementation in the Caserel software. The method used Dijkstra's algorithm by iteratively searching the lowest weight path connecting the two end nodes of a layer boundary. The modified Dijkstra's algorithm was implemented with a CPU routine in Boost C++ Libraries. In both training and inferencing stages, we performed two pre-processing steps to homogenize the training and test data (Fig. 3). Firstly, we took the running average of all groups of 10 adjacent B-scans that were processed and presented in linear scale to improve the signal-to-noise ratio. B-scans with layers significantly blurred after averaging due to eye motion were excluded from the training and test datasets. Secondly, we increased the contrast of the averaged image by saturating the top and bottom 1% of pixel intensities and performed contrast stretching for intensities in between. Both steps were implemented using CUDA to increase throughput by taking advantage of parallelization. To ensure that no B-scans in the training and testing datasets belonged to the same volumetric acquisition, two out of ten available volumes were selected for testing, and eight were used for training. After eliminating poor-quality B-scans due to eye motion, the resulting training and testing datasets contained 4072 and 1162 images, respectively. The sizes of training and testing datasets were chosen empirically. The network was trained for 20 epochs using Adam optimizer with a learning rate set to 10 −5 . To reduce dependencies on brightness and contrast of a B-scan, each training image at every epoch was augmented with a random contrast/brightness coefficient.
After segmentation, each pixel was assigned to one of 8 possible labels corresponding to vitreous, inner limiting membrane and nerve fiber layer (ILM/NFL), ganglion cell layer and inner plexiform layer (GCL/IPL), inner nuclear layer (INL), outer plexiform layer (OPL), outer nuclear layer (ONL), photoreceptors and retinal pigment epithelium (PR/RPE), and the choroid.

Loss function
The network was trained using a composite loss function consisting of weighted logistic loss and dice loss as Roy et al., have demonstrated that using the join dice and weighted logistics loss results in the superior performance compared to using only dice or logistics loss [14]. The overall loss function was defined as Eq. (1): where L dice is the dice loss (Eq. (2)) and L log is the weighted logistic loss.
The weighted loss function was used in order to make the network more sensitive to the layer boundaries and compensate for imbalanced classes. The weighted logistic loss is defined in Fig. 3. Pre-processing framework of B-scans prior to training of the neural network. Ten adjacent B-scans are first averaged to increase the signal-to-noise ratio. Data was augmented by introducing random brightness adjustments in order to account for all different signal intensities. Contrast was then adjusted for original and augmented data saturating the bottom and top 1% of pixel intensities. Then, the data and ground truth obtained by graph-cut (GC) segmentation were fed to a U-Net type neural network for training.
Eq. (3): where w (x) is the weight associated with a particular pixel, g l (x) is a vector containing 1 for the correct label and 0 for others, and p l (x) is the estimated probability for each label of this pixel.
To encourage the network to learn layer boundaries, the regions of transition between layers are set to have a greater contribution. In addition, background regions above and below the retina weigh less than the pixels belonging to the retinal layer classes. The final weighting scheme is implemented in Eq. (4): where I is an indicator function and ∇ is the gradient operator.

Real-time inference
Once a neural network is trained, it is possible to draw inference from the network in a matter of milliseconds using GPU acceleration. To achieve significant reduction in inferencing time, we took advantage of NVIDIA's TensorRT, a platform for high-performance deep learning inference on NVIDIA GPUs. TensorRT allows acceleration of networks trained in all popular frameworks, such as TensorFlow, Caffe, or MATLAB. It performs various optimizations on a network to improve speed performance during inference by optimizing the network graph, reducing inference precision mode, and selecting kernels specific to the GPU to maximize performance. The purpose of graph optimization is to make the network smaller while preserving its performance. Our implementation of a modified U-Net contained 163 layers. A first optimization step pruned the layers that had no impact on the performance, which reduced the network size down to 59 layers. Then, layer fusion was performed, further simplifying the network. The layer fusion step detected layer patterns suitable for optimization in the graph and then fused them into one. For our U-Net implementation, the following layer fusions took place: 1. Convolution and scale layers (bias addition and batch normalization) by adjusting convolution weights to result in the same computation.
2. Convolution and ReLU activation by computing ReLU function in the same kernel as the convolution.

Scale and activation (Bias addition and ReLu).
After the layer fusion step, 32 layers were removed, resulting in a graph with 27 layers. The final graph optimization step consisted of removing concatenation layers, which further reduced the final number of layers to 23. TensorRT is capable of performing other types of graph optimizations, but we have only discussed here the ones applied to our network. Owing to TensorRT graph optimizations, the network size was reduced from 163 to 23 layers.
In addition, neural network parameters were represented as 32-bit floating point (FP32) during training, but TensorRT supports different precision modes: FP32, FP16, and INT8 (8-bit integer). Choosing a lower precision mode can dramatically reduce inferencing time, so we used the lowest precision mode available, 8-bit integer, to achieve the fastest possible inference. The INT8 type has a much lower dynamic range compared to FP32, so to avoid accuracy loss from precision reduction, FP32 needs to be correctly mapped to INT8 with the use of calibration data to determine the dynamic range of each tensor in the network. Calibration data contains samples of representative input to the network, and need to come from the same distribution as the test/inference data. In this work, 10 B-scans were selected from the training data to cover various sections of the retina and scanning positions.
After TensorRT completed all the above-mentioned optimizations, a runtime engine allowed us to perform high-speed inference.

Tensor cores
Deep learning is a very computationally intensive task, which makes selecting suitable hardware essential for high-performance projects. Training the same network on a multi-GPU machine could reduce training time from days to hours, allowing developers to iterate through network architectures towards the best implementation relatively quickly. With the recent growth of artificial intelligence, a number of technologies have been introduced to accelerate all stages of deep learning. One of the recent innovations greatly improving machine-learning performance are NVIDIA's Tensor Cores. Tensor Cores perform multiply-accumulate calculations on a 4 × 4 matrix in a single operation. As convolutional neural networks commonly contain large numbers of multiply-accumulate instructions, accelerating this operation leads to significant improvement in network performance. The theoretical speedup is 8 times, with an average of 4 for a typical user. In the project, we used a NVIDIA 2080 Ti GPU that has 554 Tensor cores to further accelerate inferencing of the retinal layer segmentation.

Post-processing
The output of the neural network is a semantically segmented image, with each pixel belonging to one of the 8 possible classes. Layer boundaries were then extracted from the segmentation map and used to generate the en face projection between the desired layers. As neural network semantic segmentation is anticipated to have an associated amount of error, a small number of pixels is expected to be mislabeled. The A-lines containing these errors could be detected automatically because the disposition of the labels assigned along the depth direction did not follow the proper order of retinal anatomy. This is particularly common close to the fovea, as the layers between ILM and OPL come together in this region. If a valid layer boundary cannot be detected in a section of a layer, we applied linear interpolation to fill the gap to provide a continuous boundary. Figure 4 demonstrates the steps of detecting and correcting missing layer boundaries, showing segmentation with and without the correction of retinal layers.

Network performance
We evaluated the performance for the original U-Net and the modified U-Net, which only has half of the original convolutional layers. Both networks showed good agreement with ground truth (Fig. 5) and their performances were evaluated using the Dice coefficient for each B-scan ( Table 1). The accuracy was comparable to the results reported by other teams using neural networks for retinal layer segmentation [14]. Reducing the number of convolutional layers of the original U-Net led to a relatively modest deterioration in Dice coefficient for our dataset, which was 0.005 in the worst-case scenario. After TensorRT optimization and reduction of precision to INT8, accuracy loss was negligible for U-NetRT compared to the modified U-Net with the largest Dice coefficient difference (0.0023) occurring at OPL layer. Dice coefficients of U-NetRT were identical to the modified U-Net prior to TensorRT for FP32 and FP16 precisions. During the acquisition process, adjusting brightness and contrast of the B-scans is often desirable to improve visibility of certain features. Significantly increasing brightness results in saturation of the brightest pixels, resulting in an increased number of labelling errors. In order to improve the robustness of the network when brightness and contrast are significantly different from to the settings used during the acquisition of training data, we added a random augmentation step to the training process. Table 2 compares brightness and dice coefficient for two networks: the first trained on B-scans with only contrast adjustment applied, and the second trained on B-scans with a random brightness/contrast coefficient applied followed by contrast adjustment. It can be seen that the network trained on randomly augmented images performed significantly better when B-scans were augmented with a high brightness/contrast coefficient.

Table 2. Comparison of dice coefficients between two networks trained on different data when tested on images with a high brightness/contrast coefficient. Network 1 was trained on images with default brightness, while the Network 2 was trained on images with random brightness and contrast
applied.

U-NetRT real-time performance
As inference time can vary for the same network and input size due to the non-deterministic nature of GPUs and non-real-time operating system, we tested the inferencing speed by running 1162 test B-scans for 100 iterations using Tensorflow on a GPU and TensorRT for each of the three available precision modes and two network architectures ( Table 3). The inferencing time of the modified U-Net was reduced by 18 times after introducing TensorRT and reducing precision of network parameters (Table 3). We combined the retinal layer segmentation, pre-and post-processing steps into our real-time GPU accelerated OCT acquisition and processing software OCTViewer. In order to maintain high speed throughout and minimize latency, the acquisition of raw OCT B-scans, transfer from host to GPU memory, and processing was synchronized with a parallelized batch scheme as shown in Fig. 6(a). Batches contained 10 cross-sectional B-scans, each formed by 400 A-lines acquired in the fast scanning direction and corresponding to 10 adjacent positions in the slow scanning direction. Batch sizes of only 10 frames allowed high throughput data transfer and processing, relatively low latency, and ensured good signal to noise ratio (SNR) when averaged for segmentation. The acquisition of batches containing 10 frames took 16.6 ms, data transfer from frame grabber 2.8 ms, OCT data processing 3.5 ms, pre-processing 0.2 ms, inference 3.5 ms and post-processing 0.11 ms (110.1 us for layer detection and 2.4 us for linear interpolation correction), for an overall latency of 41 ms (Fig. 6). This latency was considerably smaller than reported in previous literature [45].  6. Profiler for B-scan segmentation. (a) After data was acquired in batches of 10 frames (blue segments), it was transferred from host to GPU memory in synchronization with the acquisition of the next batch (orange segments). Then, OCT signal processing, image pre-processing, inference, and post-processing was performed during the acquisition of the following batch (cyan segments). (b) Processing encompasses OCT data processing, OCT image pre-processing, inference with U-NetRT and post-processing, yielding an approximate total latency of 41 ms.

Table 3. Comparison of inferencing time (mean ± standard deviation) for the two network architectures (modified U-Net and U-NetRT after architecture optimization) using TensorFlow and TensorRT with network parameters of different precision modes (floating point 32 and 16, and integer 8).
Our U-NetRT exhibited a high segmentation accuracy for all layer boundaries at different eccentricities within the parafovea, including the foveal dip (Fig. 7). The pixel classification showed excellent agreement with the anatomical distribution of layers, not showing 'islands' of misclassified pixels, yielding layer boundaries that were continuous and smooth. Figure 8 is a screen capture of our OCTViewer software during an in vivo imaging session with real-time retinal layer segmentation enabled (Visualization 1). Layer boundaries and the OCT projections of arbitrary retinal depths could be visualized simultaneously with a pupil view used to guide alignment. Due to screen size limitations, although all 6 retinal layers were segmented in real time, OCTViewer only displays four en face projections at a time; layer boundaries for those layers are shown on the B-scan. Real-time segmentation allowed the operator to exert better judgement of the quality of images prior to saving the OCT data. It also carries potential to represent in real-time any depth-specific retinal pathology. The segmentation was correct for various SNR and fields of view (FOV) found during in vivo imaging sessions (see Visualization 2 and Visualization 3). Errors were observed during head and eye movements causing pupil misalignment and partial vignetting, but were accurate again after the eye re-fixated or the alignment was corrected by the operator after the blink (Visualization 1).

Discussion and conclusions
We have demonstrated a real-time retinal layer segmentation method for OCT using state of the art GPU hardware features and TensorRT to compress the neural network and reduce the computational burden while maintaining the same performance. The original neural network's architecture was compressed from 163 to 23 layers in order to reduce U-Net's inferencing time while preserving the performance above 94% for all layers. The speed of U-NetRT inferencing (3.5 ms) was reduced by 21 times with respect to the original U-Net (75.6 ms), not accounting for 0.2 ms and 0.11 ms of pre-and post-processing. The pre-processing averaging of ten adjacent B-scans was performed anticipating inaccuracies due to SNR of single B-scans, but did not incur in significant additional latency. The tests done in real time demonstrated robustness after instances of blinking and microsaccadic motion (Visualization 1), as well as in various sizes of FOV and retinal eccentricities (Visualization 2 and Visualization 3).
While the training stage of neural networks typically requires large datasets and time-consuming manual segmentation of ground truth, this network was trained with only eight volumetric acquisitions from healthy eyes, each containing 400 B-scan positions within the parafovea. Data augmentation techniques allowed producing new B-scans to improve the network's accuracy without requiring new labeling; rolling averaged was applied for more efficient utilization of the available training dataset. The use of a graph cut algorithm in the generation of ground-truth itself was also a factor that facilitated training, since the operator only needed to intervene for manual corrections of the deficient frames, excluding the ones affected by eye motion. Still, the network cannot be applied yet on acquisitions encompassing the optic disc, significantly larger fields of view, or pathological characteristics such as macular edema, drusen, or layer dystrophy. We tested a retinal landmark and FOV approaching the optic disc, not used for training (Fig. 9). Here, segmentation was accurate until the boundaries approached the peripapillary area, where drastic changes in the boundary gradient owing to reduced contrast and large vessel shadows caused segmentation errors. Nevertheless, based on the previous literature we are confident that with additional training the neural network should be able to generalize successfully new patterns characteristic of different retinal landmarks and diseases.
There are some limitations of the method. Because we average 10 adjacent B-scans to improve the image quality and ensure the success of segmentations, single frame segmentations were not successful (Fig. 10) due to insufficient SNR. The method also fails when large ocular motion or blinks occur in between the average B-scans (Fig. 10). In addition, overall signal attenuation due to defocus or anterior segment opacities (e.g. cataracts) or defocus could yield an SNR too low to guarantee successful segmentation. In spite of this, the algorithm was successful across macular vessel shadows, suggesting it should also be robust to localized shadows from vitreous floaters of similar caliber. We also showed that errors in segmentation appear during microsaccadic transients in which the eye drifts away from the fixation target, which are rapidly fixed once fixation is recovered.
It can be noticed in Fig. 6 that the inference process does not occur until after acquisition of the two following batches, incurring in additional latency due to the parallelized intermediate data transfer step. This configuration, as mentioned before, was preferred in order to allow the software to increase overall processing speed by effectively hiding the data transfer time at the expense of slightly longer latency. With the 250 kHz A-scan rate, we could reduce the latency by 16 ms if we either serialized the data transfer and processing or reduced the batch size by half, however we chose the batch size of 10 B-scan to avoid any potential frame skipping and sufficient frame averaging for segmentation. It is also possible that in the near future, with the PCIe 4.0 standard, to transfer data between host and GPU at a much faster speed, which would allow serialized data transfer and processing. Furthermore, a device-to-device (D2D) data transfer is also an available option to reduce the overhead. For example, with the AlazerTech digitizers or Bitflow frame grabbers it is feasible to transfer the acquired interferogram from the   10. Representative examples of neural network segmentation errors. Single-frame B-scans (upper row) without frame averaging did not guarantee the signal-to-noise ratio required to delineate relatively dark boundaries such as the inner nuclear layer with the inner and outer plexiform layers, yielding 'broken' layer boundary graphs. Although averaging of ten B-scans over five adjacent B-scan positions increased the signal-to-noise ratio, the frames found at the proximity of positions of large bulk motion artifacts presented distinct layers overlapping at the same axial depths, precluding with the performance of the neural network. digitizer/frame grabber card directly to GPU without copying to host RAM first. However this feature requires NVIDIA Quadro or Tesla GPUs which can be much more expensive than the consumer ones used in this manuscript.
Dramatic acceleration of neural network-based segmentation is possible with little investment from researchers and developers by optimizing the neural network using existing software solutions (TensorRT) and high-end consumer GPU (GeForce RTX 2080 Ti). We were able to achieve average inference time of around 3.5 ms per B-scan, which allowed for real-time segmentation of the retinal layers during an imaging session. In order to further reduce the total latency after frame batch acquisition, two separate GPUs could be used -one for real-time OCT B-scan processing and one for deep learning inference of the layer boundaries. Peer-to-Peer (P2P) data transfer between separate GPU units is available and fast using NVLink interconnect, which supports transfer bandwidths as high as 300 GB/s, up to a 70% superior to PCIe transfer bandwidth.
Simultaneous B-scan and en face rendering of the OCT volume in real time allows the operator to: (1) ensure prior to acquisition that the position of the pupil is at an optimal plane to reduce pupil vignetting and B-scan cropping; (2) avoid saving scans containing large motion artifacts and blinks, thus improving the acquisition success rate; and (3) perform layer-specific optimization of defocus in large numerical-aperture instruments using adaptive optics [46][47][48]. Another direction that could be used to frame the work is that OCT is on the verge of transitioning from purely diagnostic applications to being used for image-guided surgeries. The capabilities shown here could provide real-time feedback to eye surgeons during common procedures such as retinal laser photocoagulation, corneal grafting, or anti-angiogenic injection treatment.
This method also poses great potential in OCTA, which is typically visualized en face and is very sensitive to motion and shadow artifacts [49]. Because adaptive optics was incorporated in this instrument, the en face images produced in real-time could also be utilized to compensate aberrations such as low-order defocusing, astigmatism, and coma prior to acquisition. By extending the method with a parallel software performing vessel segmentation in real time, we could produce instantaneous quantification of the vessel density or the avascular area in diseases like DR, yielding real-time visualization and quantification of imaging biomarkers of disease. In addition, real-time visualization of en face OCTA in tandem with sparse widefield scanning could assist the operator in navigating to regions of interest (e.g. locations of peripheral vitreous or choroidal neovascularization), and adapt the eccentricity, field of view and sampling density to visualize them with higher definition.
In summary, to the best of our knowledge, we have demonstrated real-time retinal layer segmentation by incorporating a highly optimized UNet in the OCT retinal scanner for the first time in literature. Embedding artificial intelligence in ophthalmic imaging system could potentially enable next generations of smart imagers that are capable of active image acquisitions.

Funding
Michael Smith Foundation for Health Research; Natural Sciences and Engineering Research Council of Canada; Canadian Institutes of Health Research; Research to Prevent Blindness; National Eye Institute (P30 EY010572) (R01 EY031331).

Disclosures
Acner Camino: Optovue, Inc (P). Marinko V. Sarunic: Seymour Vision (I). Yifan Jian: Seymour Vision (I). These potential conflicts of interest have been reviewed and managed by OHSU as well as Simon Fraser University. Other authors declare that there are no conflicts of interest related to this article.