Deep learning-based autofocus method enhances image quality in light-sheet fluorescence microscopy

: Light-sheet fluorescence microscopy (LSFM) is a minimally invasive and high throughput imaging technique ideal for capturing large volumes of tissue with sub-cellular resolution. A fundamental requirement for LSFM is a seamless overlap of the light-sheet that excites a selective plane in the specimen, with the focal plane of the objective lens. However, spatial heterogeneity in the refractive index of the specimen often results in violation of this requirement when imaging deep in the tissue. To address this issue, autofocus methods are commonly used to refocus the focal plane of the objective-lens on the light-sheet. Yet, autofocus techniques are slow since they require capturing a stack of images and tend to fail in the presence of spherical aberrations that dominate volume imaging. To address these issues, we present a deep learning-based autofocus framework that can estimate the position of the objective-lens focal plane relative to the light-sheet, based on two defocused images. This approach outperforms or provides comparable results with the best traditional autofocus method on small and large image patches respectively. When the trained network is integrated with a custom-built LSFM, a certainty measure is used to further refine the network’s prediction. The network performance is demonstrated in real-time on cleared genetically labeled mouse forebrain and pig cochleae samples. Our study provides a framework that could improve light-sheet microscopy and its application toward imaging large 3D specimens with high spatial resolution.

The working principle for LSFM is to generate a thin layer of illumination (a light-sheet) to excite the fluorophores in a selective plane of the prepared sample ( Fig. 1(a)), while detecting the emitted signals using an orthogonal detection path [7,21]. This unique and orthogonal excitation-detection scheme makes the LSFM fast and non-destructive but also dictates a strict requirement: the thin sheet of excitation light needs to overlap with the focal plane of the objective lens. Any deviation from this requirement severely degrades the LSFM image quality and resolution ( Fig. 1(b)). However, this restriction is often violated when imaging deep within cleared tissues due to the specimen's structure and composition. The heterogeneous composition of tissues often leads to refractive index (RI) mismatches that cause: (i) spherical aberrations, and (ii) minute changes in the objective lens focal plane distance [22]. Consequently, the relative position of the light-sheet and the objective focal plane (∆z) constantly shifts in volume imaging, and in our implementation, the detection objective often needs to be translated to compensate for this shift. neurons (left) and hair cells (right). The images were captured from a whole mouse brain and a pig cochlea that were tissue cleared for 3D volume imaging. The red boxes mark the locations of the zoom-in images at the bottom. The degradation in the quality of out-of-focus images can be observed. (c) Overview of the integration of the deep learning-based autofocus method with a custom-built LSFM. During image acquisition, two defocused images will be collected and sent to a classification network to estimate the defocus level. The color of the borders of each individual patch in the right image indicates the predicted defocus distance. In the color bar, the red and purple colors represent the extreme cases, in which the defocus distance is −36 µm and 36 µm respectively. The borders' dominant color is green, which indicates that this image is in focus.
Determining the best position of the objective lens that overlaps with the light-sheet to provide superior image quality can be accomplished by eye. However, this is highly time-consuming and laborious, especially in high throughput platforms that image large numbers of specimens. To solve this problem, autofocus methods have been implemented whereby the microscope captures a stack of images (10 -20) at different defocus positions. Each image in the stack is then evaluated based on image quality measures, and the position that corresponds to the highest score is considered the in-focus position [23]. Previous studies have extensively evaluated the performance of image quality measures, and for LSFM, the Shannon entropy of the normalized discrete cosine transform (DCTS) shows superior results [5,24]. Nevertheless, the requirement to capture 10-20 images slows the acquisition process and can lead to photo-bleaching in sensitive samples. Additionally, the occurrence of spherical aberrations is more likely in tissue clearing applications, since they use a diverse range of immersion media (RI 1.38-1.58). In the presence of spherical aberrations, the performance of traditional image quality measures is degraded [25], however, even in this case DCTS still shows superior results (see Fig. S1).
Deep learning has recently been used to solve numerous computer vision problems (e.g., segmentation and classification) [26][27][28] and enhance the quality of biomedical images [29][30][31][32][33][34]. Several studies have used deep learning to perform autofocus, mostly on histopathology slides that were acquired by a bright field microscope and using a single frame [35][36][37]. Yang et al. proposed using a classification network to perform autofocus in thin fluorescence samples using a single shot, and their results outperformed traditional image metrics [33]. A certainty measure was also introduced to determine whether the viewed patch contains an object of interest or background. However, this approach remains to be extended to challenging 3D samples acquired using LSFM, which are dominated by aberrations, making them challenging for traditional autofocus measures.
Here, based on previous work on a custom-built LSFM design [11,13], we introduce a deep learning-based autofocus algorithm that uses two defocused images to improve image quality in acquisition ( Fig. 1(c)). The use of multiple images accelerates the network's training and provides results that are more accurate. We tested the effectiveness of our integrated framework using cleared whole mouse brain, pig cochlea, and lung samples. We show that our real-time autofocus framework performs well in thick cleared tissues with inherent scattering and spherical aberration that are difficult to solve using traditional autofocus methods.

Sample preparation
In this study, three wild-type (WT) pig cochleae samples were tissue cleared and labeled using a modified BoneClear protocol [13,38], while the one WT lung and three mouse brains were labeled using iDISCO protocol [39][40][41]. The cochleae samples were labeled using Myosin VIIa (CY3 as secondary), while the brain samples were labeled using GFP (Alexa Fluor 647 as secondary) and RFP (CY3 as secondary). The mice for the brain samples were generated using Mosaic Analysis with Double Markers chromosome 11 (MADM), which were previously described in [42][43][44][45]. All the animals were harvested under the regulation and approval of the Institutional Animal Care and Use Committee (IACUC) at North Carolina State University.

Image acquisition
Samples were imaged using a custom-built LSFM [13]. After tissue clearing, specimens were placed in an imaging chamber made from aluminum and filled with 100% dibenzyl-ether (DBE). Samples were mounted to a compact 4D stage (ASI; stage-4D-50), which incorporated three linear translation stages and a motorized rotating stage. The stage scanned the sample across the static light-sheet to acquire a 3D image. The light sheet was generated by a 561 nm laser beam (Coherent OBIS LS 561-50; FWHM = 8.5 µm) that was dithered at a high frequency (600 Hz) by an arbitrary function generator (Tektronix; AFG31022A) to create a virtual light-sheet. The detection objective lens (10×/numerical aperture (NA) 0.6, Olympus; XLPLN10XSVMP-2) was placed on a motorized linear translation stage (Newport; CONEX-TRB12CC with SMC100CC motion controller), which provides 12 mm travel range with ±0.75µm bi-directional repeatability.
For every defocused image stack that was used in the training and testing stages (∼420 stacks), first, the objective lens was translated (CONEX-TRB12CC motor) by the user to find the optimal focal plane. Once the optimal position was found by eye, the control software automatically collected a stack of 51 defocused images with 2 µm spacing between consecutive images. The optimal focal point, which was determined by the microscope's operator, was in the middle of the stack. All the stacks were acquired at random depths and spatial locations along the specimens, with a pixel size of 0.65 × 0.65 µm 2 , and 10 ms exposure time. Figure 2(a) shows representative defocused stacks that were used for the network's training. From Fig. 2(a), we observed that images that were taken above and below the focal plane have distinct features i.e., asymmetrical point spread function (PSF), which suggested that the network could determine if ∆z was negative or positive.

Architecture of the network and its training process
The classification network's architecture is presented in Fig. 2(b). The aim of the network was to classify an unseen image into one of 13 classes. Each class represented a different range of ∆z, for instance: if the bin size (∆b) was equal to 6 µm, the in-focus class corresponded to ∆z values in -3 to 3 µm range, and the classes center points had the values of −36, −30, −24, −18, −12, −6, 0, 6, 12, 18, 24, 30 and 36 µm. The network's architecture was previously presented by [33]. Here, we modified the network to accept multiple defocused images as an input, instead of solely one image. To train the network, 421 defocused image stacks were acquired: 337, 42, and 42 datasets were dedicated for training, validation, and testing respectively. The network was implemented in Python 3.6 with PyTorch-1.4.0 Deep-Learning Library. The network was trained on an Nvidia Tesla V100-32GB GPU on Amazon Web Services for about ∼35 hours. The cross-entropy loss function was selected, the learning rate was 1e-5, and an Adam optimizer was used. Data augmentation techniques including normalization, saturation, random crop, horizontal and vertical flip were applied during the training process. Figure 2(a) illustrates the network's training process. Two defocused images with ∆s spacing (e.g., ∆s = 6 µm) and known defocus distance ∆z were randomly selected from the defocused stack (I ∆z and I ∆z + ∆s ). Then a random region of interest (128 × 128 pixels) was selected and cropped from the two images. The two cropped image patches were fed into the network for training, while the known defocus distance ∆z, served as the ground truth. The output of the model was a probability distribution {p i , i = 1, 2. . . N} over N = 13 classes (or defocus levels), and the predicted defocus level (∆z predict ) was the one with the highest probability. The spacing (∆b) between the classes (or defocus levels) in the output of the network was given by: ∆b = 72 µm N−1 . Please note that the number of defocused levels (N) was determined empirically, and it determined how fine the correction was. A larger number of N (e.g., N = 19 and ∆b = 4 µm) could theoretically provide a better overlap between the objective focal plane and the light sheet. Nevertheless, in our case, it was difficult to observe big differences in image quality between two images that were separated by a distance smaller than 6 µm (Fig. S2). Therefore, ∆b smaller than 6 µm would not necessarily provide better image quality after the correction. This was the case since as long as the objective focal plane was approximately within the light sheet full width half maximum (FWHM; on average ∼14 µm across the entire field of view) the image remained sharp.

Measure of certainty
A valuable measure to calculate from the probability distributions (p i ) was the measure of certainty (cert), with the range of [0, 1]. Cert was calculated as follows [33,46]: Spherical aberrations lead to asymmetrical point spread function (PSF) for defocused images above (∆z > 0) and below (∆z < 0) the focal plane. The network uses this PSF asymmetry to estimate whether ∆z is positive or negative. In the training process, two defocused images with a constant distance between them (∆s e.g., ∆s = 6 µm) and a known ∆z are randomly selected from the stacks. The images are then randomly cropped into smaller image patches (128 × 128) and these patches are fed into the network. (b) The architecture of the network. The output of the network is a probability distribution function over N = 13 different values of ∆z with constant bin size (∆b), which equals 6 µm. The value for N was determined empirically.
A low value of cert corresponded to a probability distribution (p i ) which was similar to an equal distribution, which translated to low confidence in ∆z predict . Therefore, predictions with cert below 0.35 were discarded. In contrast, a high cert value, corresponded to the case where the network was more certain in its prediction, for example, when the maximum p i was much higher than the remaining probabilities.

Integration with a custom-built light-sheet microscopy
The control software and graphical user interface (GUI) of the LSFM were implemented in MATLAB R2019b environment. The MATLAB environment can integrate with a deep learning model, which was trained with Python. The graphical user interface was also written in MATLAB.

Performance of the network with various input configurations
We investigated various training configurations and their influence on the network's performance. First, we tested how the number of defocused images, which were fed into the deep neural network (DNN), influenced the classification accuracy ( Fig. 3(a)). Figure 3(a) shows the training loss and the classification accuracy as the function of epochs for 1, 2, and 3 defocused images (blue, yellow, and green graphs respectively). In these experiments, the DNN output was a probability distribution over 13 classes (defocus levels) with various ∆z values ranging from −36 µm to 36 µm, with ∆b = 6 µm, and ∆s = 6 µm (see Methods section). We found that when the DNN received two or three defocused images as input, it performed better in terms of classification accuracy than using only one defocused image ( Fig. 3(a)). The resulting confusion matrices of the three models, which were trained on 1, 2, and 3 defocused images, are shown in Fig. 3(b). When observing the confusion matrix for a single defocused image, we observed the following: (i) The distribution around the diagonal (upper left to bottom right) was not as tight as the other two confusion matrices, which was consistent with its low classification accuracy. (ii) The values on the second diagonal (upper right to bottom left) were higher in comparison with the two other confusion matrices. This observation indicated that the predicted |∆z predict | was correct, but not the sign. To balance the tradeoff between the network's performance and acquisition time of the additional defocused images, we decided to proceed with 2 defocused images rather than 3, although the DNN trained with 3 defocused images showed slightly higher classification accuracy. Next, we tested the performance of the network with different ∆s values (2, 4, 6, 8, and 10 µm). Figure 3(c) shows the training loss and classification accuracy as a function of epochs (N = 13, and ∆b = 6 µm). The graphs show that when ∆s equals 6 or 10 µm, the classification accuracy was higher in comparison to other values of ∆s. Therefore, ∆s was set to 6 µm henceforward.

DNN performs better or comparable to traditional autofocus quality measures
To determine the prediction accuracy of the proposed model, the model was compared with traditional autofocus methods ( Fig. 4(a); 2 defocused images, ∆s = ∆b = 6 µm). For the test cases, image patches with a size of 83 × 83 µm 2 (single patch) and 250 × 250 µm 2 (3 by 3 patches) were randomly cropped from 42 defocus stacks, which were dedicated to testing. While the DNN used only two defocused images, the full defocused stack i.e., 13 images were provided to the traditional autofocus measures that included the following: Shannon entropy of the normalized discrete cosine transform (DCTS), Tenengrad variance (TENV), Steerable filters (STFL), Brenner's measure (BREN), Variance of Wavelet coefficients (WAVV), image variance (VARS), and Variance of Laplacian (LAPV). The average absolute distance between ∆z predict and the ground truth was calculated to compare the DNN and traditional metrics. Please note, for the larger patches, ∆z predict was calculated based on the average result of 9 (3 by 3) patches with a

Fig. 3. Training configurations that influence the classification accuracy. (a)
The graphs show the training loss and classification accuracy as a function of the number of epochs. For comparison, the network is trained with one\two\three defocused images that are provided to the network as an input (N = 13, ∆s = 6 µm). The graphs show that two (I ∆z and I ∆z+6µm ) and three (I ∆z−6µm , I ∆z , and I ∆z+6µm ) defocus images yield higher classification accuracy than a single defocused image (I ∆z ). (b) Confusion matrices for a different number of defocused images that are provided to the network as input. Training with only one defocused stack shows inferior performance. (c) Training loss and classification accuracy as a function of the number of epochs using 2 defocused images as an input, but with variable spacing (∆s) between the images. The highest classification accuracy corresponds to ∆s values of 6 and 10 µm.
predefined threshold on the certainty of 0.35, i.e., any tile with a certainty score below 0.35 is discarded from the average. For a single patch, the DNN and DCTS, which was the best of traditional methods, achieved an average distance error of 4.84 and 5.36 µm on the test dataset, respectively ( Fig. 4(a)). When larger images (250 × 250 µm 2 ) were tested, the DNN and DCTS achieved an average distance error of 3.80 and 3.80 µm, respectively. Our DNN model presented better or comparable results over the tested autofocus metrics and only required two images, whereas the other metrics required a full stack of images. Table S1 compares the performance of DNN and DCTS with various conditions such as providing the DCTS 9 (3 by 3) smaller patches and averaging the results with or without certainty. Overall, the DNN performed better or comparable under all conditions. Figure 4(b) shows representative single patches from the test set, and their ∆z predict values are indicated with a color-coded border. If the certainty was smaller than the threshold (0.35), the prediction was discarded, and the border was not presented. The inference time for a single patch on our modest computer (Intel Xeon W-2102 CPU 2.90GHz), which operated MATLAB was ∼0.18 sec.

Real-time integration of the deep learning-based autofocus method with LSFM
Based on the performance of the defocus prediction model, we decided to integrate the model with our custom-built LSFM. To test the model, we performed perturbation experiments on tissue cleared mouse brain (Fig. 5(a)) and cochlea (Fig. 5(b)). In the experiments, the objective lens was displaced by 30 and −30 µm for the brain and cochlea, respectively. Then, two defocused images were captured and fed into the trained model. In these real-time cases the network used 64 (8 by 8) patches per image, and the average ∆z predict was calculated as followed: ∆z predict = (∆z 1 · S 1 + ∆z 2 · S 2 )/(S 1 + S 2 ). Where ∆z 1 and ∆z 2 were the two most abundant classes in the whole image, and S 1 and S 2 were the corresponding number of patches with the predicted label of ∆z 1 and ∆z 2 , respectively. This approach removed outliners. The model's ∆z predict values per patch are presented in Fig. 5(a4 and b4), and the corresponding averaged ∆z predict values were 26.9 and −35.3 µm for the brain and cochlea, respectively. According to the value of the averaged ∆z predict , the detection focal plane was adjusted. Figure 5(a3 and b3) show the improvement in image quality after the applied corrections. The color-coded line profiles in Fig. 5(a4 and b4)

Performance of the deep learning model on unseen tissue
Finally, to evaluate the deep learning model's ability to generalize to other unseen tissue types, we used the same deep learning model on a mouse lung sample that was tissue cleared. By and large, the lungs exhibited different morphology than the brain and cochlea samples that the model was trained on. The lung's tissue structure can be seen in Fig. 6(a and b). We performed again the real-time perturbation experiment, and the objective lens was translated by −30 and −20 µm in Fig. 6(a2 and b2), respectively. The model's ∆z predict values per patch are presented in Fig. 6(a4 and b4), and the averaged ∆z predict values were −34.2 and −33.57 µm, respectively. We corrected the position of the objective based on the averaged ∆z predict values as seen in Fig. 6(a3  and b3). Although several patches in Fig. 6(b4) had questionable predictions, the overall image quality was improved after the network's correction.  a1 and b1) The in-focus auto-fluorescence images of tissue cleared mouse lung samples, which are highly scattering. These samples exhibit different morphology than the brain and the cochlea, and the network is not trained on such samples/morphology. (a2 and b2) Images that show the same field of view as in a1 and b1 after the objective lens is displaced by −30 µm and −20 µm, respectively. (a3 and b3) Images of the same field of view after the objective is moved according to the network correction as shown in a4 and b4. The improved image quality in a3 and b3 indicates that the network can correctly estimate the defocus level and adjust the detection focal plane to improve image quality. Although further refinement might be required, the network can still generalize to unseen tissue types. Please note, in tissue cleared lung samples the auto-fluorescence is easily photo-bleached therefore making it especially suitable for autofocus methods that require as few defocused images as possible.

Discussion and conclusion
Here, we build upon previous work on thin 2D slides that used DNN to measure image focus quality using a single frame [33]. We expand the use of the DNN to 3D samples, and we demonstrate the advantages of using two or three defocused images rather than a single image: First, the network performs better in terms of classification accuracy and convergence speed ( Fig. 3(a)). Second, the network minimizes sign errors i.e., the DNN can determine whether the light-sheet is above or below the objective focal plane. Using two defocused images and after optimizing the spacing between them, we find that on small image patches (∼83 × 83 µm 2 ) the network outperforms DCTS, which requires a full stack of defocused images (∼13 images). Therefore, using only two images can significantly increase imaging speed and reduce photo-bleaching in a sensitive sample (e.g., single-molecule fluorescence in-situ hybridization). On large image patches (∼250 × 250 µm 2 ), the network provides comparable results to DCTS. Another advantage of the proposed method is that it inherently provides a measure of certainty in its prediction. Consequently, one can exclude image patches that may contain background or low contrast objects. In fact, when we exclude low certainty cases, we improve our accuracy (Table S1).
As a proof-of-concept experiment, the network is integrated with a custom-built LSFM. We demonstrate that the network performs reasonably well not only on tissue cleared mouse brain and cochlea but also on unseen tissue. The proposed approach can facilitate the effort to characterize large volumes of tissue in 3D, without a tedious and manual calibration stage that is performed by the user prior to imaging the sample.
A major limitation of the presented approach is the drop in its performance for unseen samples (specimen types that are outside the training set). This limitation is expected as new specimens likely exhibit unique morphologies and distinct features such as the lung samples in Fig. 6. There are several approaches to mitigate this challenge: (i) given that acquiring and labeling the dataset is relatively straightforward, one can train a network per specimen type. This approach is reasonable for experiments that require imaging many instances of the same specimen. (ii) Diversifying the training set with a plethora of specimens, and under multiple imaging conditions, such as multiple exposure levels. (iii) To synthesize data for the training set instead of physically capture it. This could be achieved by using publicly available 3D confocal microscopy datasets that do not require synchronization between the light-sheet and the objective focal plane. Then, from the confocal datasets, defocused images could be synthesized either by employing a physical model to defocus the image [47], or by utilizing generative adversarial networks (GAN). Utilizing GAN for data augmentation would allow to learn the LSFM distortion and synthesize artificial training sets [48,49].