Illumination angle correction during image acquisition in light-sheet fluorescence microscopy using deep learning

: Light-sheet fluorescence microscopy (LSFM) is a high-speed imaging technique that provides optical sectioning with reduced photodamage. LSFM is routinely used in life sciences for live cell imaging and for capturing large volumes of cleared tissues. LSFM has a unique configuration, in which the illumination and detection paths are separated and perpendicular to each other. As such, the image quality, especially at high resolution, largely depends on the degree of overlap between the detection focal plane and the illuminating beam. However, spatial heterogeneity within the sample, curved specimen boundaries, and mismatch of refractive index between tissues and immersion media can refract the well-aligned illumination beam. This refraction can cause extensive blur and non-uniform image quality over the imaged field-of-view. To address these issues, we tested a deep learning-based approach to estimate the angular error of the illumination beam relative to the detection focal plane. The illumination beam was then corrected using a pair of galvo scanners, and the correction significantly improved the image quality across the entire field-of-view. The angular estimation was based on calculating the defocus level on a pixel level within the image using two defocused images. Overall, our study provides a framework that can correct the angle of the light-sheet and improve the overall image quality in high-resolution LSFM 3D image acquisition.


Introduction
Light-sheet fluorescence microscopy (LSFM) has become an indispensable tool for discovery in the life science community for its high throughput capacity in imaging of live or cleared fixed tissues [1][2][3][4][5][6][7][8][9][10][11][12]. In LSFM, the sample is illuminated by a thin sheet of light, and fluorophores inside the confined excitation volume are excited. The resulting emitted photons are then detected using a wide-field detection system oriented perpendicularly to the light-sheet illumination plane [1,[13][14][15][16]. This unique imaging configuration provides optical sectioning and ensures that the 2D section inside the excitation volume is imaged in a single shot, dramatically reducing phototoxicity, photobleaching and acquisition times [1,2,15].
A critical condition for achieving high-quality imaging using LSFM is that the illuminating light-sheet plane and the detection focal plane must be co-planar (i.e., parallel and overlapping). However, due to the irregular sample shape, complex inner structures, mismatches of refractive index (RI) and imperfect optical alignment, the co-planar requirement is often violated causing degradations in image quality [1,2]. We previously proposed a deep learning based method to rapidly and accurately perform autofocus during LSFM imaging [17]. This approach enhances the image quality when the illumination plane and the detection plane are parallel ( Fig. 1(a)). and (c) Using two dual galvo scanning systems the light sheet setup can control both the roll and yaw angles of the light-sheet illumination beam, respectively. Sample induced roll or yaw angles in the illumination beam result in variable image quality within the field of view, in focus regions are marked by a green rectangle, while out of focus regions are marked by a red rectangle. Roll (b) and yaw (c) angles will result in aberrations that are vertically and horizontally oriented respectively. (d) Overview of the illumination angle correction pipeline. During image acquisition, two defocused images are acquired and sent to a deep learning network to predict the defocus distance on a pixel level. From the defocus map, the yaw and roll angles of the light-sheet relative to the detection objective focal plane can be calculated and corrected. After angle correction, the image is refocused again using the network and the entire field-of-view is brought in focus. In the color bar, the red and purple colors represent the extreme defocus values of -36 µm and 36 µm respectively. Black color in the processed image represents detected background or pixels with low certainty. Scale bars, 100 µm. However, refocusing fails to address cases where the illumination beam and the detection focal plane are not parallel. In this case the illumination beam has residual roll ( Fig. 1(b)) and yaw ( Fig. 1(c)) angles that cause different regions of the field-of-view to experience different defocus levels. Consequently, focusing on a single uniform value will not enhance the image quality in the entire field-of-view. These sample-induced angular errors were previously reported in live imaging of intact embryos [1,2], and we routinely observe these angular errors when imaging tissue cleared bones (e.g., porcine cochlea, and femur) with high resolution (> 0.5 numerical aperture). Bones contain irregular curved components that can refract the illumination beam. Fortunately, if the error in the yaw and roll angles is known, a correction to the illumination beam can be applied in real-time using a spatial-light-modulator or a galvanometric scanner [2].
Manually observing and correcting for these angular errors is highly time-consuming and laborious. For example, an adult pig cochlea consists of dozens of tiles (∼30-60) and each tile depth can be more than 5 mm deep, making the manual estimation and correction of sample-induced angular errors for so many sections in advance impractical. To address this issue, autofocus methods [2,[18][19][20][21][22][23][24][25][26] that estimate the defocus level in various regions in the image could be used. From the spatially varying defocus values, the residual yaw and roll angles of the illumination beam can be estimated and corrected. Traditional autofocus methods that acquire and evaluate a stack of images (∼10-20) are not ideal for angular error estimation since they are slow. Moreover, traditional autofocus methods provide suboptimal results in the presence of spherical or other non-defocus aberrations that are common in tissue clearing applications [26], and their performance sharply degrades when analyzing relatively small fields-of-view. The latter limitation is especially restrictive, since the defocus level in multiple small regions in the field-of-view is required to estimate the angular error.
To solve this limitation, we utilize deep learning methods recently employed in biomedical imaging [27][28][29][30][31][32][33][34]. These tools are particularly useful in our case since they only require one or two images to estimate the defocus distance, and they perform well on relatively small and aberrated image patches [35]. In general, several studies applied deep learning-based methods to solve the autofocus problem, using one [19,[36][37][38] or two images [39,40]. These fast autofocus methods were applied to tissue slides, and reflection microscopy of semiconductors.
Herein, we extend our previous work [9,12,17] on autofocus LSFM and devised a deep learning-based algorithm to estimate the angle of the light-sheet relative to the detection objective focal plane ( Fig. 1(d)). This algorithm is integrated in our LSFM setup and captures two defocused images. The images are fed into a deep learning network to generate a defocus map, which is then used to estimate and correct the angular and defocus aberrations by utilizing two galvo scanners and a linear stage ( Fig. 1(d)). We show that our angular correction approach performs well on multiple tissue types (e.g., brain, cochlea and lungs), by successfully overlapping the light-sheet illumination plane with the detection objective focal plane. This new angular correction approach may be used to correct sample induced angular aberrations without human intervention, and greatly improve the image quality in high resolution 3D imaging.

Sample preparation
In this study, the cochlea and lung samples were extracted from euthanized pigs, while the brain samples were extracted from Mosaic Analysis for Double Marker (MADM) mice [34]. The cochlea samples were cleared and labeled using a modified BoneClear protocol [9,41], while the rest of the samples were cleared and labeled using the iDISCO protocol [42][43][44]. The cochlea samples were labeled using Myosin VIIa (CY3 as secondary), and the brain samples were labeled against GFP (Alexa Fluor 647 as secondary) and RFP (CY3 as secondary). All animals were harvested under the regulation and approval of the Institutional Animal Care and Use Committee (IACUC) at North Carolina State University.

Light sheet setup
All the images were captured using a custom-built LSFM that was designed based on previous protocols [1,12,17]. The setup was described in detail in [17], and here we briefly describe the important components that control the illumination beam angle. The light-sheet plane illumination was generated by a continuous-wave laser (Coherent OBIS LS 561-50). The laser beam was expanded and focused on the sample to create a static Gaussian beam. To generate the light-sheet, the Gaussian beam was dithered at a high frequency (500 Hz) using a 2D scanning galvo ( Fig. 2(a); Cambridge Technology; 6215H) and a dual-channel arbitrary function generator (Tektronix; AFG31022A). The 2D scanning galvo could also control the roll angle, by synchronizing the frequencies and phases of the voltages, which drive the vertical and horizontal mirrors inside the galvo system. One mirror was dedicated for scanning the light-sheet beam up and down (x-axis), and the other mirror was responsible for the position of the light sheet on the z-axis (see Fig. S1). By adjusting the amplitude of the galvo mirror (in z-axis) the roll angle was determined and controlled in real time. Fig. S1 shows the voltage diagram, which was produced by the arbitrary function generator in order to create a certain roll angle. Again, the phase and frequency of the two channels needed to be locked to produce a stable light-sheet with a well-defined roll angle. To control the yaw and pitch angles, a pivot galvo was used (Thorlabs; GVS202). The pivot angle moved the illumination spot location over the 2D scanning galvo, and therefore on the conjugated back aperture of the illumination lens. The changes in the spot location resulted in angular changes on the sample plane. For the detection path, the perpendicular detection objective lens (10×, 0.6 numerical aperture, Olympus, XLPLN10XSVMP-2) together with a tube lens and a CMOS camera (Hamamatsu, C13440-20CU) were used to acquire a wide field image of the sample. The schematic diagram of our LSFM is shown in Fig. 2(a) and other hardware information about our microscope can be found in our recent publications [9,17]. The linear relationship between galvo control voltage and the pitch, yaw, and roll angles. In the Roll angle, the ratio of scanning galvo means the ratio of voltage amplitudes applied on the two scanning mirrors. Please note the small coupling between the yaw and pitch angles.
To correct the illumination angle, the yaw, pitch and roll angles of the light sheet were measured as a function of galvo voltages. The angles were measured without any specimen in the imaging chamber. First, the emission filter was removed. Second, the 2D scanning galvo voltage was set to a constant, and a Gaussian beam profile was observed on the camera. Third, the beam waist location and approximate propagation direction in the volume were recorded. Fourth, the voltage that controls either the yaw, pitch and roll angles was changed to a constant value. Last, the new position of the beam waist location and its new propagation direction in the volume were recorded. To find the new position of the beam the detection objective had to be refocused. From the change in the waist location and propagation direction, the roll, pitch and yaw angles were estimated. By repeating this procedure for different voltages, we found a linear relationship between the galvo control voltage and the resulting angles as shown in Fig. 2(b). The linear gradients between the galvo voltage and light sheet angle (Grad yaw and Grad roll ) were later used for online angular correction.

Datasets acquisition for training and testing
To train and test the deep learning models, we acquired images from our custom-built LSFM. Towards this end, the cleared tissue samples were placed in the chamber filled with dibenzyl-ether (DBE). For the training dataset, we first focused the images and set the illumination angles manually in order to achieve a uniform image quality in the field-of-view. Then the detection objective automatically moved and captured a stack of defocused images with defocus values ranging from -36 µm to 36 µm with 6 µm step size (13 images) [17]. The acquired images were uniformly defocused without any observable angle. Although the models were trained on image stacks that exhibit only parallel but defocused illumination, they generalized well when angular errors were present. This greatly simplified the training process, as we did not have to train the dataset on angled illumination that its absolute angle value was also affected by the specimen in an unpredictable way. A detailed description on how to train the model based on the these uniformly acquired stacks of images can be found in [17]. The test set that included cases with tilted illumination required substantial manual labeling.
To test the network's ability to estimate the yaw and roll angles, image stacks with random yaw and roll angles were collected (∼60 stacks). The images in each stack had a non-uniform image quality because of the introduced angular aberration. Test image stacks were captured in the same method as the training dataset i.e., 13 defocused images. The angle in these stacks was evaluated manually, since we could not rely on the galvo voltage to determine the angular error in the presence of a sample, which induced angular errors. Towards this end, we generated a graphical user interface that facilitated the labeling process (Fig. S2), as the regional defocus levels must be estimated manually at multiple sub regions in the image.

Network Implementation and training process
For accurately estimating the illumination angle of the light-sheet, it was beneficial to estimate the defocus values in the image at the pixel level, rather than patch level. Therefore, we used a U-net architecture [30] for multi-classification of the defocus levels at the pixel level (Fig. 3). We decided to use a classification network, and not to train the network to infer the defocus value directly, as the datasets generation process was easier. In light sheet microscopy, when focusing an image manually, it was easier to assign the value of the defocus inside a 6 µm bin, instead of trying to estimate it precisely. This is the case since as long as the objective focal plane was approximately within the light sheet FWHM the image remained sharp. Here, we extended the U-net to support two defocus images with an equal distance, which improved the accuracy of the defocus classification. We also added a certainty measure [17,35,45] to the U-net, to reject background and low certainty pixels. The network classified each pixel into 13 classes, whereby each class represented a different defocus distance between the light sheet and the focal plane of the detection objective (∆z est ). The ∆z est values in the 13 classes were equal to -36, -30, -24, . . . , 24, 30, 36 µm. The 6 µm space between the classes was determined empirically (see [17]) and we believe it is related to the full width half maximum (FWHM) of the light sheet. If the objective focal plane was approximately within the light sheet FWHM (on average ∼14 µm across the entire field of view) the image remained sharp. Fig. 3. The architecture of the network. In the example, a tissue cleared brain image with true label of ∆z = 0 µm is shown. The two images from the training set have 6 µm defocus distance between them, and they serve as input to the U-net. The output of the network has the same size as the input but with 13 channels. The value in each channel represents the probability of the defocus distance to be in one of the 13 classes. Each of the 13 classes represents ∆z est from -36 µm to 36 µm with 6 µm step size. The ground truth label is a binary stack of 13 channels, in this case the class that represents ∆z = 0 µm is equal to 1 and zero elsewhere. BN: batch normalization; RELU: rectified linear unit; Conv: convolutional layer.
The input to the U-net was two defocused images 6 µm apart (I ∆z and I ∆z + 6 µm ; 512×512 pixels each; ∆z is the real and unknown defocus distance), and we previously found that 6 µm step size between the defocused images provided the optimum results in terms of classification accuracy [17]. The U-net output is a matrix (512 ×512 ×13) with the same number of rows and columns as in the original input images, plus 13 channels. For a given pixel, the values in the 13 channels represented the probability distribution vector of ∆z est belonging the 13 defocus classes. To train the network, a dataset containing 379 image stacks was used. The label for each pair of input images was a matrix (512 ×512 ×13) with the value 1 corresponding to the channel with the true defocus value of the images (∆z). Even in the channel that corresponded to the true defocus value, pixels that corresponded to background (i.e., very low intensity) were assigned to zero. The rest of the channels in the matrix were assigned with zero.
From the network's output (512×512×13), a defocus distance prediction map was generated (512 × 512 pixels) by assigning the defocus value with the highest probability for each pixel.

Illumination angle correction in LSFM
In the illumination angle correction workflow, the network was accessed twice; once to correct for angle, and a second time to correct for defocus ( Fig. 1(d)). Two consecutive images I ∆z and I ∆z+6 µm (6 µm defocus distance apart) were captured and fed into the U-net model. From the network output, a defocus distance prediction map was generated (∆z est (x,y)), and a linear regression was used to fit a plane to the prediction map. The fitted plane parameters provided estimates for the roll and yaw angles (roll predict and yaw predict ) using the following process: from n random ∆z est values, the illumination plane equation can be derived using least square solver for β0, β1 and β2 (plane parameters), using the following equation: Using the estimated values of β 0 , β 1 and β 2 the yaw and roll angles can be calculated with the knowledge of the pixel size (6.5 × 6.5 µm 2 ) and the size of the image. The following equations were used, under the assumption that the yaw and roll angles are small (i.e., less than 7°). For larger angles, the accuracy will improve if the transformation will be derived from the plane parameters using for instance Rodrigues' rotation formula.
Atan is the arctangent function, left, and right, refer to the horizontal edge coordinates of the image (e.g., 1 and 2048, respectively). Top and bot refer to the vertical edge coordinates of the image (e.g., 1 and 2048, respectively), and mid refers to the middle coordinate of the image (e.g., 1024). For correcting the angles, the changes to the galvo voltage were calculated as follows: Note that the linear gradients Grad yaw and Grad roll (Fig. 2(b)) between the galvo voltage and light sheet yaw and roll angles were found in section 2.2. Hence, the updated galvo voltages for the galvo scanners were galvoy updated = galvoy current + ∆galvoy voltage and roll updtaed = roll current + ∆roll ratio . After the angle correction, the light sheet plane was parallel to the focal plane of the detection objective. Then, an additional two consecutive images were captured for refocusing and fed into the network again. Based on all the pixel level predictions, the most abundant class was considered as the defocus distance (∆z predict ). Accordingly, the detection objective was translated using a motorized linear stage (Newport; CONEX-TRB12CC with SMC100CC motion controller) to refocus the image. All in all, the workflow required four images to correct for the illumination angle and defocus distance. The control software and graphical user interface (app designer in MATLAB) of the custombuilt LSFM were implemented in MATLAB R2019b environment. The trained network can also be accessed in MATLAB environment.

Prediction of the Illumination angle and defocus
The network was implemented in Python 3.5 with PyTorch-1.4.0 Deep learning framework and trained on one NVidia Tesla V100-32GB GPU on Amazon Web Service for approximately 12 hours. A binary cross entropy loss function was used between the provided mask and the output of network. The learning rate was set to 1e-5 with Adam optimizer. Data augmentation methods such as normalization, random crop, saturation and flip were also used.
We tested our U-net model to perform autofocus in comparison with our previous deep neural network (DNN) model [17], and traditional autofocus image methods (DCTS: Shannon entropy of the normalized discrete cosine transform, and TENV: Tenengrad variance). In comparison to the U-net model, the DNN model provided regional defocus values, but not pixel-based predications. Consequently, the DNN model was less suitable to predict errors in the illumination angles [17]. The test dataset included 55 image stacks with uniform defocus values across the field of view ( Fig. 4(a)). Image patches with the size of either 128 × 128 pixels (83 × 83 µm 2 ) or 512 × 512 pixels (333 × 333 µm 2 ) were randomly cropped, and 13 random defocus positions were used from each stack. For the DNN and U-net models, only two defocused images were required, while traditional autofocus methods (TENV and DCTS) were fed with the full image stack as input (13 images). The average absolute distance between the ground truth and the prediction ∆z predict were calculated per each of the autofocus methods ( Fig. 4(a)). Since the neural network methods (U-net and DNN) provided either pixel or regional level data, ∆z predict for the entire image was defined as the most abundant defocus prediction.
For small image patches (128×128 pixels), U-net, DNN, and DCTS achieved an average distance error of 6.82, 7.14, and 8.56 µm, and for larger image patch (512×512 pixels), the average distance errors were 5.62, 6.02, and 5.86 µm, respectively ( Fig. 4(a)). Overall, our developed U-net model performed better or comparable with the DNN and traditional methods on different image sizes. For a detailed confusion matrix of the U-net model see Fig. S3. Interestingly, traditional autofocus methods like DCTS had larger average distance error when they performed on small image patches. Hence, they were less suitable for angular predictions, which require the division of the image into small image patches. Therefore, deep learning-based methods that provide pixel-based defocus measurements were preferable for illumination angle estimation.
To test the angular predication of the workflow (Fig. 4(b)), we manually observed and labeled 60 image stacks with different yaw and roll angles. To calculate the angles manually, the defocus distance in various positions in the image was calculated, and a plane was fitted to these measurements. From the plane equation, the ground truth yaw and roll angles were derived. Then, the test images were fed to the network, and the difference between the ground truth angles and the U-net output results was calculated ( Fig. 4(b)). The relative average angle error of U-net for the roll and yaw angles were 0.53°and 0.63°respectively. While for the DCTS, the relative average angle error of roll and yaw were 0.59°and 1.30°respectively. For side by side comparison between DCTS and Unet in terms of performance and speed see Table S1. We believe that the asymmetry in the error (i.e., roll versus yaw) was related to the fact that the light sheet width was broadening at the edges of the field-of-view. This fact made it difficult to accurately determine the in-focus position at the edges with great accuracy (see discussion). Figure 4(c) shows the U-net prediction maps of four representative test images, with relatively uniform defocus. The color bar indicates the defocus distance from the focal plane. The U-net average output (∆z predict ) matched the ground truth defocus distance well. Figure 4(d) shows the prediction maps in cases where the light-sheet illumination was tilted relative to the detection objective. The prediction maps showed vertical and horizontal color gradient. The gradient features of the prediction maps demonstrated the U-net ability to generalize well on the non-uniform defocused images, although it was trained on uniformly defocused images. The angular prediction allowed to correct for sample induced angular errors in the illumination beam.

Online deep learning-based illumination correction pipeline improved image quality in the porcine cochlea and mouse brain
We integrated the illumination correction pipeline with our custom-built LSFM and then we performed perturbation experiments on tissue cleared pig cochlea (Fig. 5(a)) and mouse brain ( Fig. 5(b)). In these experiments, the light sheet galvo control variables were randomly chosen, and the distorted images are presented in Fig. 5(a1) and 5(b1). The original image size was 2048 × 2048 pixels, and to generate the defocus prediction maps, the middle part was cropped (1024 × 1024 pixels). The gradients in the prediction maps were evident and accordingly the yaw and roll angles were calculated. After the illumination angle was corrected, the gradients in the prediction maps were less noticeable, and most areas in the prediction maps shared the same color (Fig. 5(a2) and 5(b2)). This indicated that the illumination and detection planes were approximately parallel. Then the angle corrected images were sent though the same network again to refocus the image after angle correction. The most abundant color in the prediction map was extracted and the associated defocus distance ∆z predict was inferred. Then the detection lens was refocused on the illumination plane (∆z predict distance) and the image quality improved dramatically in the field-of-view (Fig. 5(a3) and 5(b3)). After the correction, we ran the network again to verify that the illumination plane and the detection focal plane were indeed co-planer.
This can be observed as the prediction maps in Fig. 5(a3) and 5(b3) has a greenish color (0 µm defocus).

Fig. 5. Online perturbation experiments in LSFM. (a1 and b1)
Light-sheet images of a cochlea and brain tissue, respectively. In these images, the illumination plane and the detection plane are not parallel or focused. (a2 and b2) Images of the same area after illumination angle correction. Notice how the color gradients are less observable in the prediction maps. (a3 and b3) Images of the same area after defocus and illumination angle correction. The blue, green and yellow boxes mark the position of the zoom-in images, and the gray box mark the area that was used to generate the prediction maps. The improved image quality in a3 and b3 shows the pipeline ability to make the light sheet plane and the detection focal plane co-planar. Note that the image content may differ because the light sheet spatial position changed when the angle was corrected, and therefore slightly different areas in the sample are illuminated and recorded. Scale bars, 400 µm.

Performance of online deep learning-based illumination correction pipeline on unseen-tissue
We then tested our workflow on unseen tissue i.e., tissue cleared lung sample that was extracted from a pig. Please note that the U-net training stage was not conducted on any lung tissue, and this stage was performed to evaluate the network's ability to generalize to other untrained tissue types. In general, the lung tissue had very different morphological features than the cochlea and the brain, as can be seen in Fig. 6(a1) and 6(b1), that were acquired with tilted illumination. Figure 6(a2) and 6(b2) are the same images as in Fig. 6(a1) and 6(b1) but after illumination angle correction. After correcting for the defocus (Fig. 6(a3) and 6(b3)) the overall image quality improved. As expected, we noticed a performance drop on an unseen tissue, since the lung features were different from the brain and cochlea. Also, the pig lung tissue was not filled with agarose during the tissue processing and therefore, it collapsed and was denser than a mouse lung that can be agarose inflated [46].

Discussion and conclusions
For high resolution LSFM imaging, image quality is greatly degraded if the light-sheet illumination is not parallel to the objective detection plane. This angular component in the light-sheet illumination can result either from inadequate alignment of the optical system, or from refraction of the illumination beam inside the specimen. Here, we develop an online method to correct for both the angle and defocus in LSFM using a deep learning model. Our model estimates the defocus distance on a pixel level using only two defocused images at a fixed step size. Based on the pixel level information on the defocus distances in the image, the angular error of the light sheet illumination is estimated. The degrees of freedom in our adaptive light-sheet microscope are then used to correct for the uneven defocus over the entire image. We describe a procedure to calibrate the light-sheet angle as function of galvo scanner voltage, so users can replicate the method in independent platforms. In comparison with traditional autofocus methods, our new approach is faster since only two images are required to estimate the defocus level versus a full image stack. Additionally, the new approach is more accurate, as traditional autofocus methods yield less precise outputs when working on small image patches that are required for estimation of the angles. We tested the utility of our method, and we observe significant improvement in image quality in mouse brain, as well as porcine cochleae and lung samples. This approach can lighten the burden of human intervention in acquiring 3D datasets without manual calibration steps and improve image quality overall. Simple and alternative approaches to mitigate the effects of non-uniform illumination are to either increase the diameter of the illumination beam used to generate the light sheet, or to use a low numerical aperture objective lens. A larger diameter beam increases the illuminated volume in the sample, which increases the chances for the detection focal plane to overlap with the illuminated volume in the sample, even in the presence of an angle between the two. However, increasing the illumination beam diameter diminishes the optical sectioning, which results in low quality images. Alternatively, one can use low numerical aperture objectives (e.g., 1-2×, numerical aperture <0.1), whose depth of field will be large enough to capture moderately tilted light-sheet illumination beams. Again, this solution comes at the expense of low axial and spatial resolution.
Although successful in its original goal, our new approach possesses a few limitations. First, we find that the yaw angle of the light-sheet is more difficult to estimate than the roll angle. We believe that the primary reason for this limitation is that the Gaussian beam diverges at the edges relative to the center of the field-of-view where it is tightly focused. In this case, the optical sectioning is compromised, making it more difficult to observe significant changes in the image quality across multiple defocus levels in the edge of the field-of-view. As such, the estimation for the yaw angle is more difficult. In the future, we can incorporate "non-diffractive" beams (i.e., the beam spread out less while propagating away from the focal plane, e.g., Bessel beam). Such an application may improve the estimation of the yaw angle. Second, the main current computational bottleneck that restricts real-time correction is the processing time of the certainty measure. In general, the camera captures an image with 2048 × 2048 pixels, however we only use the middle part (1024 × 1024 pixels) as an input to the network for two reasons: to accelerate the processing time, and since the FWHM of the light sheet is broader at the edges, hence, the network's performance degrades. Therefore, we only train the network on the middle part. Given a patch of 1024 × 1024 pixels, both the focus correction and the angle correction take approximately 4 seconds, without including the acquisition time. Approximately 2 seconds are required to generate the prediction map for 1024 × 1024 pixels in Python. While it takes the network only a few milliseconds (<0.1 seconds) to generate the defocus distance per pixel, the calculation of the certainty requires ∼1s. In future efforts we will generate a less computationally intensive certainty measure, which may provide similar results to our current approach. The reduction in the processing time should bring the proposed method from online to real-time.