Deep learning-based single-shot autofocus method for digital microscopy

: Digital pathology is being transformed by artificial intelligence (AI)-based pathological diagnosis. One major challenge for correct AI diagnoses is to ensure the focus quality of captured images. Here, we propose a deep learning-based single-shot autofocus method for microscopy. We use a modified MobileNetV3, a lightweight network, to predict the defocus distance with a single-shot microscopy image acquired at an arbitrary image plane without secondary camera or additional optics. The defocus prediction takes only 9 ms with a focusing error of only ∼ 1/15 depth of field. We also provide implementation examples for the augmented reality microscope and the whole slide imaging (WSI) system. Our proposed technique can perform real-time and accurate autofocus which will not only support pathologists in their daily work, but also provide potential applications in the life sciences, material research, and industrial automatic detection.


Introduction
In 2018, Google announced an augmented reality microscope (ARM) with real-time artificial intelligence integration for cancer diagnosis [1]. As the microscope is the most important tool for pathological diagnosis, the ARM has the potential to decrease the variability of pathological assessments and to alleviate the labor shortage of trained pathologists in regions such as rural areas [2]. However, defocus blur can greatly deteriorate the image quality and introduce tissue detail loss, thereby decreasing the reliability of the ARM or related AI-based microscopes. For ARM, defocus blur can occur due to the optical path length difference between the eyepiece ports and the camera port [3]. Pathologists are not trained to adjust the parfocal of the microscope and keeping the camera focused while reviewing the slides from the eyepiece simultaneously is difficult in practice.
US Food and Drug Administration (FDA) announced the approval of the first whole slide imaging (WSI) system for primary diagnosis in surgical pathology [4] in 2017. The WSI system has undergone an exponential period of growth for quantitative and streamlined slide reviewing [5]. We can regard the WSI system as a motorized high-capacity microscope with autofocus and auto-slide loading function. Although robust and high-throughput WSI systems are commercially available, their scanning speed is slow and their acquisition of well-focused digital slides remains inconsistent [6,7]. Pre-scanning a sample to acquire a focus map is the most adopted autofocus method for current WSI systems [8]. The focus map surveying requires z-stack acquisitions for focus plane estimation. Yet, the axial scanning of multiple images is time-consuming. Another issue is that skipping tiles can shorten the focus map surveying time at the cost of focus map accuracy [8].
In addition to the conventional time-consuming focus searching method through axial scanning, a variety of new autofocus methods for microscopy have emerged in recent years, which can be divided into three categories. The first category introduces additional illumination sources [9][10][11][12][13] and cameras [14][15][16] to the original microscope light path for defocus estimation. For example, the Nikon Perfect Focus System introduces an additional infrared LED and a linear sensor to track the position of the slides [9]. This method adds cost and complexity to the system and only works for 2D thin slides. The second category is image-based and requires no additional optics but multiple shots for defocus estimation [17][18][19]. For example, Dastidar et al. use the difference of two shots at different focal planes and deep learning for defocus estimation [17]. The third category generates a virtual in-focus image according to the input blurry image using deep learning instead of estimating the defocus distance [20][21][22][23][24]. These methods require no additional hardware or multiple shots to refocus the image. For example, Wu et al. trained a network to virtually refocus a two-dimensional fluorescence image onto a user-defined focal plane within the sample [20]. Luo et al. use a deep learning-based offline autofocusing method that efficiently and blindly autofocus a single-shot microscopy image of a specimen that is captured at an arbitrary out-of-focus plane [24]. An image-generating approach would be more time-consuming as the image size increases. Also, there are always some doubts about virtually generated images and they may not be accepted for critical tasks.
In this paper, we demonstrate a deep learning-based single-shot autofocus method without any modifications to the original microscopy system for focal plane estimation. This approach only requires one image captured at an arbitrary plane by the inherent camera of the microscope to determine the focal plane. One can freely choose the motorized Z-stage, piezoelectric stage, or the tunable lens to finish the focus adjustment. Our novel method shows that a neural network can be trained to predict how far out of focus a microscope is, based on a single image taken at an arbitrary defocus distance. In the network's training phase, a motorized stage is used to collect z-stack images to train a modified MobileNetV3_small [25], a lightweight neural network, and achieves a 1/15 depth of field (DOF) focusing error of each 672*672 image patch. By measuring several image patches in one high-resolution image, the final focusing accuracy and robustness can be further improved.
We provide a specific autofocus implementation scheme on the ARM as an application example. Furthermore, we demonstrate our autofocus method can also be applied to focus map surveying for WSI system. We believe our work can help make AI the technological 'right hand' of pathologists. Our single-shot autofocus method is universal and can be applied to other imaging fields such as time-lapse live-cell imaging, and material research. Not limited to microscopy, it can also find future applications in industrial depth estimation and autopilot software.

Single-shot autofocus for augmented reality microscopy
ARM overlays AI-generated information onto the current view of the sample in real-time, enabling seamless integration of AI into routine pathological workflows [1]. One of the serious problems encountered by this system in clinical trials is that defocused images lead to unreliable AI-diagnosis results, as an autofocus function is essential for ARMs to function. Another complaint is that an ARM can be too high for a comfortable sitting posture since the two parallel light paths (i.e., the image acquisition layer and AR projection layer) add to the height of the benchtop microscope as shown in Fig. 1(a). Therefore, we aim to make improvements to the ARM to create a more practical system. First, we remove the image acquisition layer in the parallel light path and place the camera in the standard camera port on top of the microscope as shown in Fig. 1(b). This reduces the original height increase of the ARM by half. We also use two crossed polarizers to block all lights from the ARM screen from entering the camera. Second, we add an autofocus function using a deep learning network to estimate the defocus distance from a single image captured at an arbitrary image plane. Then, we use the liquid tunable lens to adjust the focus rapidly to complete the autofocus process. Our modified ARM is based on the Olympus BX43 microscope and the Lumenera Lt425 camera with a resolution of 2048*2048 pixels. The model of the tunable lens is The image acquisition layer and AR projection layer are added to a conventional microscope. First, the raw image captured by the camera is sent as input for AI processing. The outcome AR contents, for example, contours and text, are sent to the AR display. The user can observe the raw image overlaid with AR contents at the eyepiece port thanks to the beam splitters. (b) The setup scheme of our autofocus ARM. We install the camera and the tunable lens at the standard camera port of the conventional microscope. Only the AR projection layer is inserted into the infinity space. We use a pair of crossed polarizers to block the light of AR display from entering the camera. (c) Workflow of the original ARM: pathology AI algorithms are applied to the raw image without autofocus. (d) Workflow of our autofocus ARM: we first use a modified MobileNetV3 to estimate the defocus distance of the raw image. Then, we adjust the image focus through the liquid tunable lens. The in-focus image is then captured as the new input for pathology AI algorithms.
the Optotune EL-16-40-TC with a 16 mm aperture. A customized 0.4X adaptor connects the tunable lens and with the camera port. The augmented reality screen is the Sony ECX335S microdisplay. Figure 1(c) and (d) compares the workflow between the original ARM and our autofocus ARM. An immunohistochemistry (IHC) image captured by the original ARM at an arbitrary axial plane under a 10X/NA0.3 objective lens is directly sent for IHC AI processing. On the other hand, for our autofocus ARM, the captured image is first cropped to seven patches sized 672*672. We then predict the defocus distance of the seven patches respectively using a pre-trained defocus distance estimation network, a lightweight deep learning network modified from the MobileNetV3_small [25]. The tunable lens is responsible for focus adjustment according to the averaged predicted defocus distances. In Fig. 1(c) and (d), we also compare the IHC AI-detection [2] results of the view indicated by the red box for the image before autofocus and the autofocused image. In the raw image, only 170 positive tumor cells and five negative tumor cells are detected. In the autofocused image, 13,370 positive tumor cells and 952 negative tumor cells are detected. We present a partially enlarged image for a better visual perception. Please refer to Appendix B for more detail about our IHC AI algorithm.
As shown in Fig. 1(d), the defocus prediction network is based on MobileNetV3_small [25], a lightweight deep learning network, which is suitable for the real-time autofocus requirement of ARM. We make the below modifications to the original MobileNetV3_small: First, we change the input size to 672*672 from 224*224 to cover a larger field of view for reliable prediction. Second, we also change the classification output to regression output. In the training phase, taking the autofocus under a 10X objective lens as an example, we capture focal stacks of in-focus and defocus images using the HeidStar HDS-BFS-BX43-PRO-1, a motorized Olympus BX43 microscope equipped with a 10X/NA0.3 objective lens. The testing instrument or the autofocus ARM has the same setup as the training data collection instrument except for the stage of the latter is motorized. We capture 1500 z-stacks of pathological images, including IHC, Thinprep Cytology Test (TCT), as well as Hematoxylin and Eosin (H&E) slides in total. The same number of focus stacks are taken for each of the three pathological image types (IHC, TCT, HE). We divide the data into training, validation, and prediction sets at a ratio of 8:1:1, respectively. Each z-stack contains 25 images ranging from -36 µm to +36 µm with a step size of 3 µm. The "-" is facing away from the objective lens and "+" is facing the objective lens. Each raw image is cropped to seven 672*672 patches, which is the input size of the network. The axial step size, 3 µm, is not small enough compared to the 10 µm depth of field of the 10X/NA0.3 objective. To achieve better continuity of the defocus level when capturing z-stack images, we alternately capture -37 µm to +35 µm, -36 µm to +36 µm, -35 µm to +37 µm. To label each image with its defocus distance, we use a Brenner Gradient [26,27] to locate the focal plane with subpixel resolution. The label value is the ground truth of the defocus distance for the image.
We train four autofocus networks in total, one model each trained for IHC, TCT, and H&E, and one mixed model trained with all mixed data. As shown in Fig. 2(a)-(c), the focusing error of the IHC, TCT, and H&E models are 0.82 ± 0.61 µm, 0.73 ± 0.75 µm, and 0.73 ± 0.61 µm respectively. The mixed model provides the best autofocus performance with a focus error of 0.68 ± 0.58 µm, as shown in Fig. 2(d). The depth of field of the 10X/NA0.3 objective lens is 10 µm. Figure 2(a-d) indicates that not only can we distinguish different degrees of defocus but also positive defocus from negative defocus. We will further discuss the mechanism of distinguishing positive and negative defocus with a single shot in the Discussion section.
The above deep learning networks we choose are among the most cited networks on classification, object detection and semantic segmentation in recent years. We can divide these networks into two types: heavy networks such as the ResNet50 and DenseNet121 and lightweight networks such as the MobileNetV3 and ShuffleNetV2. The ghost module and FFT module are integrated into the architecture of the ResNet50 and MobileNetV3 with verified performance improvement [29,31]. The same training data (100 z-stacks of HE images) and same training conditions (100 training epochs, Adam optimizer, multistep learning strategy and smooth L1 loss function, etc.) are configured to guarantee a relatively fair comparison. Figure 2(e) shows that most of the testing networks demonstrate good defocus estimation ability after the training (mean error < half DOF). The MobileNetV3_small is the one which masters the defocus estimation ability. We believe that this benefits from the advanced design of MobileNetV3: the optimal number of convolution kernels and channels obtained using the NetAdapt, inherited depth separable convolution and residual structure with linear bottleneck from V1 and V2, and the newly introduced activation function hard-swish which is verified with the ability to effectively improve the accuracy of the network [25].
Figure 2(f) shows that a larger input image size gives a more accurate output. However, the inference time increases exponentially as the input image size increase as shown in Fig. 2(g). We choose input size as 672*672 as a tradeoff between accuracy and efficiency.
The autofocus procedure on an ARM takes 59 ms in total: 25 ms for capturing an image and removing noise with a 3*3 median filter, 9 ms for defocus estimation, and 25 ms for focus adjustment with the tunable lens. The testing computer runs on Linux and has an Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40 GHz. One Tesla P40 GPU (24G memory) is assigned to the autofocus module.

WSI focus map surveying using our single-shot autofocus method
WSI systems automatically image the whole slide, turning a physical slide into a digital one. This enables doctors to stay away from the microscope to conduct remote pathological diagnoses and consultations. The most adopted autofocus method for WSI system is to acquire a focus map in the beginning. As shown in Fig. 3(a), an external camera captures a scout image of the slide. Then, focus points spacing over the entire sample excluding the background are selected automatically. At each focus point, conventionally, the system will capture a z-stack image to find the best focal plane. By interpolating the coarse focus map, a full focus map that guides the scanning for sharp WSI output can be obtained. The proposed single-shot autofocus network can be implemented on the WSI system as well, making the autofocus process of WSI much more efficient. Since WSI system already has a built-in focus adjustment module, such as the motorized Z-stage or piezoelectric stage, no liquid tunable lens is required (unlike the previously mentioned autofocus ARM). Hence, we only need to introduce the defocus prediction network for WSI autofocus.
The conventional autofocus method, as shown in Fig. 3(b), locates the focal plane within a z-stack image by evaluating the image's sharpness. However, scanning the z-stack images is very time-consuming since it requires multiple (e.g. 10) axial mechanic moves and image acquisitions. We use the proposed single-shot autofocus method for focus map surveying. High accuracy of defocus distance estimation can be achieved in just 34 ms for each focus point, at least 25 times faster than conventional methods. We retrain the autofocus network for WSI system equipped with a 20X/NA0.75 objective lens by capturing 1,000 H&E z-stacks for model training. Each z-stack ranges from -10 µm to 10 µm with a step size of 0.5 µm. We crop out nine 672*672 patches from each raw image and use image rotation plus image flipping for data augmentation. The model structure is the same as the modified version for our autofocus ARM. We use another 300 z-stacks for testing, as shown in Fig. 3(c), and get a focus error of 0.14µm±0.12 µm. The depth of field of the objective is 1 µm. Figure 3(d) shows the comparison of the time consumed to obtain the focus map under different amounts of focus points between the conventional z-stack autofocus method and ours. It indicates that as the focus points of the focus map increase, the sampling of focus map will be more refined, but the time taken will increase significantly. Our method needs less than 19% of conventional focus map surveying time and spends most of the time (∼85%) on X-Y movement. Figure 4 shows a real example of a focus map acquired by our autofocus method. Figure 4(a) is the thumbnail image of a Lymph node H&E sample. We acquire the focus map of a 10mm*10 mm area as indicated by the red box in Fig. 4(a). For comparison, we survey two 16*16-points focus maps using our method and conventional method respectively. The results are shown in Fig. 4(b)   Fig. 4. Actual focus map surveying example using our single-shot autofocus approach. with grey color indicating our method and blue color indicating conventional method or ground truth. We use the scanning trajectory indicated by red arrows in Fig. 4(b) to ensure that every two adjacent scanning points are also adjacent in the actual spatial position during the scanning process. Since the difference of the defocus distance of adjacent points will not be too long, compared to the Zigzag scanning trajectory, our trajectory can prevent the defocus of the next focus point from exceeding the prediction range of the defocus prediction network when changing lines. Figure 4(c) shows the error map, which is the difference between the focus map obtained by our method and the ground truth. The mean error of the focus map acquired with our approach is 0.28µm±0.32 µm. The depth of field of the 20X/NA0.75 objective lens used is 1 µm. From the error map, most focus points have a focus error within 0.5 µm such as the black arrow indicated points (x=8,y=15). The red arrow indicated point (x=13,y=13) in Fig. 4(c) has a focus error of -1.18 µm, which is larger than the depth of field. According to our focus map, we scan the entire slide to acquire the whole slide image. We show the scanned image of the points at the black and red arrows pointed regions on Fig. 4(c) in Figs. 4(d) and (e), respectively. Figure 4(d) shows a typical successful case of autofocus with a focus prediction error of -0.02 µm. We can observe that the cropped images are successfully focused. In Fig. 4(e), we show a typical "failure" case with a focusing error of -1.18 µm. However, we can find that this image contains many thick areas (pointed by red arrows in Crop1) and folded areas (pointed by red arrows in Crop2) in Fig. 4(e). Crop1 and Crop2 both have a size of 672*672. We argue that in this case, the ground truth calculated by the conventional z-stack searching does not serve as a universal standard solution. We recommend an axial scanning near this plane to get a composite all-in-focus image. The advantage of our patch sampling approach is that the predicted defocus distances of the seven sub-field-of-views (672*672) can tell us the focus distribution and variation of the original image (2K*2 K). We have two strategies when handling different samples including flat, uneven, and tilted ones. First, if the variation of the predicted defocus distances is smaller than the depth of field of the objective lens. We consider the field of view is even and suggest using the averaged defocus distance to guide the focus adjustment. Second, if the variance of the predicted defocus distances is larger than the depth of field of the objective lens. We consider the field of view is uneven and suggest an axial scanning according to the variation range of the predicted defocus distances. We compare our autofocus method with the state-of-the-art autofocus method for microscopy in Table 1 including deep learning-based and non-deep learning-based methods. DOF stands for depth of field of the objective lens. Since these approaches don't have exactly the same experiment setup (e.g. magnification, NA, DOF), we use the ratio of focusing error to DOF for a relatively fair comparison. From Table 1, our single-shot autofocus method has the highest focusing accuracy among the deep-learning-based methods and requires the least modification to the conventional microscope.

Discussion
In the scatter plots shown in Fig. 2 and Fig. 3, our single-shot deep learning autofocus method distinguishes the positive defocus and the negative defocus of the sample very well. However, when we use blur kernels (e.g. Gaussian blur) to simulate an out-of-focus image, there is no difference between the images on both sides of the focal plane, making it impossible to distinguish the defocus direction. The real-world defocus, which contains axial asymmetric spherical aberration and chromatic aberration, is more complicated than these common simulation methods. Based on our results, we deduce that the asymmetry allows distinguishing the focus direction of the real-world defocus image. To evaluate the asymmetry, we simulated the point spread functions (PSF) of different wavelengths at different focal planes in Fig. 5(a) with Zemax software by ray tracing. The objective lens is 10X/NA0.3.
As shown in Fig. 5(a), the PSFs are asymmetric on both sides of the focal plane. On the other hand, at the same focal plane, the PSFs of different wavelengths are also different. And this is where chromatic aberration comes from. We believe the asymmetric spherical and chromatic aberration is detectable by regular cameras (e.g. 5.5 µm pixel size), hence making the defocus directions distinguishable. In Fig. 5(b), we show the H&E images at different focal planes under a 10X/NA0.3 objective lens. We can observe color differences, especially at the white blank areas, at -20 µm and 20 µm defocus planes.
There are many autofocus techniques for microscopy, such as using additional autofocus illumination optics for defocus distance estimation [9][10][11][12]. However, many autofocus methods are not suitable for ARM. For example, Pinkard et al. used one or a few off-axis LEDs to guide defocus distance prediction with a single-shot using deep learning [10]. This is suitable for high NA objective lenses, such as autofocus for WSI system. As the NA shrink for low magnification lens such as 10X and 4X, the room left for off-axis LED is very small. Also, the focusing accuracy will decrease as the NA shrinks. However, for ARM, due to the depth-of-focus difference between eyepiece and camera port, 4X and 10X are the applications that need autofocus most. And additional illumination sources are not readily available as plug-and-play modules for current microscopes used in pathology. Tathagato [17] proposed using the difference image of two-shot for defocus distance prediction using deep learning. The required multiple shots at different focal planes are not efficient for the ARM, which has a strong requirement for real-time output. In this paper, we propose a single-shot autofocus method for ARM using a lightweight deep learning network without introducing additional illumination sources or cameras. We install the liquid lens before the camera to solve the parfocal problem. We do not choose the motorized z stage or install focus adjustment hardware connected with the objective lens [33,34] to adjust the lens focus since this does not decouple the autofocus for the user from the camera. The liquid lens is not the only option for rapid focus adjustment. Alternatively, a piezoelectric stage in front of the camera or an autofocus camera with a built-in electric stage for sensor axial movement are also efficient choices for fast autofocus. Shortening the optical stack of the original ARM does not impact the autofocus performance. This design change helps make more room for the installation of the liquid lens or other autofocus tools. Another advantage is that shortening the stack is more user-friendly for pathologists by reducing the height of the microscope.
For focus map surveying of WSI system, the conventional method conducts axial scanning to find the focus, which is not efficient. However, our deep learning method requires only a single shot to predict the defocus distance of the current field of view. One can crop more than seven images (the default setup in this study) from a raw image to get more robust prediction results, and parallel computing will make the total prediction time almost unchanged.
Compared with the state-of-the-art autofocus methods (Table 1), our autofocus method demonstrates a new idea to estimate the defocus distance: using the data-driven method to decode the defocus information from the captured raw image itself. In contrast, conventional methods use additional hardware such as extra illuminations or multiple shots to modulate the axial defocus information to image planes. The advantages of our autofocus method come from at least four aspects: First, we collect large training data (500 z-stacks each for HE, IHC, and TCT). Second, instead of using the original image for focus prediction, we use the patch sampling approach to divide the field of view of the captured image, hence getting a finer and more accurate focus (eg. a tilted sample). We also update the defocus label for each patch to compensate for focus variations at different patches. Third, the two strategies (single focus adjustment or axial scan) we used when handling uneven focus. Fourth, the efficient network structure of the MobileNetV3_small.Our autofocus method still shows room for improvement due to its novelty. One limitation is that we do not yet consider motion blur which is caused by fast stage movement. The ability to predict the defocus distance of images with motion blur can avoid interrupting pathologists' slide reading process on the ARM. For WSI system, the ability to predict the defocus distance of images with motion blur can also improve the focus map surveying efficiency hence shortening the time of the pathological diagnosis cycle.
The single-shot autofocus method is universal and should not be limited to the ARM and WSI system. For live-cell imaging or time-lapse imaging, a popular autofocus platform is the Nikon Perfect Focus System [9] which performs autofocus with a reference infrared beam to track the slide surface's fluctuation. The drawback of the Nikon perfect focus system and related autofocus techniques is that if the live cell or other moving target grows or moves above or below the reference plane, the autofocus will fail. Since our autofocus method is image-based, we can perform consistent autofocus for our interested target such as the live-cell which is easy to be located and separated from the background.
In summary, we report a single-shot autofocus method using a lightweight deep learning network for microscopy. We also propose a specific scheme of autofocus ARM using the autofocus network and the liquid tunable lens. We also incorporate this autofocus method into WSI system for focus map surveying. Compared with the state-of-the-art deep learning-based autofocus method, our approach is significantly more accurate and easier to deploy. Hence, our work will allow ARM and related AI products to enter the pathology department to support the limited pathologist workforce. We believe our paper provides a new idea for autofocus not just limited to ARM and WSI systems but can also find use in life science imaging, photography, and industrial machine vision. Funding. Tencent AI Lab.

Disclosures. The authors declare no conflicts of interest.
Data availability. Data underlying the results presented in this paper are not publicly available at this time but may be obtained from the authors upon reasonable request.