3D positioning and autofocus of the particle field based on the depth-from-defocus method and the deep networks

Accurate three-dimensional positioning of particles is a critical task in microscopic particle research, with one of the main challenges being the measurement of particle depths. In this paper, we propose a method for detecting particle depths from their blurred images using the depth-from-defocus technique and a deep neural network-based object detection framework called you-only-look-once. Our method provides simultaneous lateral position information for the particles and has been tested and evaluated on various samples, including synthetic particles, polystyrene particles, blood cells, and plankton, even in a noise-filled environment. We achieved autofocus for target particles in different depths using generative adversarial networks, obtaining clear-focused images. Our algorithm can process a single multi-target image in 0.008 s, allowing real-time application. Our proposed method provides new opportunities for particle field research.


Introduction
Particle field positioning is a crucial task in various fields, such as biomedicine, materials science, and environmental engineering [1,2]. Despite its importance, determining the three-dimensional (3D) position of each particle in a field remains a challenging task. While the lateral position can be obtained using centroid localization or object segmentation algorithms [3,4], obtaining the longitudinal position is more challenging. Several methods have been proposed to address this problem, such as multiple particle-based imaging methods [5], precise calibration methods for vision measurement [6], and digital holography [7][8][9]. However, these methods have their limitations and are not always efficient or accurate. The depth-from-defocus (DfD) method has also been used to determine the longitudinal 3D positions of the particle field. As early as 1987, Pentland used this method to investigate the depth of field [10]. Sometimes, there are defocus ambiguities in this method. Zhou et al used the DfD method with a single-lens dual-camera system to achieve the 3D positioning of moving particles. It solved the defocus ambiguity problem. However, the accuracy for the fuzzy particles was low [11]. Barnkob et al applied the defocus-particle-tracking (DPT) method to derive the depth coordinates of the particle images from different defocusing patterns. Two of the most common and widely used DPT approaches are based on the model functions (MFs) and cross-correlation (CC), respectively. MF-based methods work very well in low-concentration cases. CC-based methods have better robustness when the particle concentration is significant, and particle image overlap is vital [12]. Rossi and Barnkob applied this method to identify particles and estimate their 3D position. However, the iterative steps used here might reduce the processing speed [13]. There needs to be further research on how to determine the 3D position of the particle field efficiently.
In recent years, deep learning (DL) has been applied for positioning. Franchini and Krevor used convolutional neural network (CNN) to the blur images obtained by an astigmatic system to improve the detection accuracy [14]. Nehme et al used CNN to extract the localization of 3D particles and dealt with 3D fluorescent microscopy for particles [15]. Dreisbach et al have recently utilized neural networks to enhance the rate of particle detection and reduce the occurrence of false positives, surpassing the capabilities of traditional detection algorithms [16]. Sachs et al have presented deterministic algorithms and deep neural networks that can recognize the size of up to four particle species simultaneously, with a particle diameter ranging from 1.14 µm to 5.03 µm [17]. Leroy et al used both the soft-assignment encoding and the DfD method to determine the intermediate depth for a single object from defocus blur images [18]. We have combined the DfD method with an efficient CNN called EfficientNet, which has demonstrated exceptional performance in image classification and object detection tasks [19]. However, a manual process is sometimes needed for lateral positioning. In this paper, our method overcomes the challenge of manual processes for lateral positioning by integrating you-only-look-once (YOLO), a state-of-the-art object detection framework that enables accurate and efficient lateral positioning of particles [20,21]. Autofocus is another crucial aspect of particle field research [22][23][24], and we have incorporated generative adversarial networks (GANs) into our method to obtain clear and focused images of particles at different depths [25]. By combining these techniques, our proposed method enables accurate, efficient, and noise-robust 3D particle field positioning and autofocus. We believe that our method has significant potential to advance microscopic particle research and related fields.

Proposed method workflow and components
Our proposed approach for precise 3D particle field positioning, as illustrated in figure 1, utilizes the DfD method in combination with a powerful deep neural network-based object detection framework, YOLOv5. To prepare for training, we meticulously pre-processed the training images using labelImg software [26], a widely recognized image annotation tool in DL. Employing labelImg, we annotated the category name and location information for objects present in the training images. In this context, we considered depth as the category name and manually drew an anchor box for each sample with a known depth in the images. The lateral position was automatically derived from the box, and the corresponding information was converted into XML-format files, ready for YOLOv5 network training.
Simultaneously, we trained a GAN to obtain remarkably clear and focused images of the particles. We designated the defocus images as domain A and the focused images as domain B, utilizing both domains to effectively train the GAN. The expertly trained YOLOv5 network outputs the 3D positions of the particles, while the skillfully trained GAN generates focused images of the particle field. By seamlessly integrating YOLOv5 and the GAN, our proposed method achieves accurate, efficient, and noise-robust 3D particle field positioning and autofocus, representing a significant advancement in the field.

Experimental set-up
The experimental set-up for our proposed method involved obtaining micrographs of polystyrene particles as training input under a commercial microscope (XSP-37XF, Shanghai Opt. Inst. Fty., China). These particles were chosen for their mean diameter of approximately 10 µm and refractive index of 1.587. A 40× objective lens was used to capture their micrographs, and images for particles at different depths were obtained by carefully adjusting the longitudinal translation stage. The depths in focus were set to 0, while the depths for over-focus and under-focus conditions were greater and less than 0, respectively. The particles were placed under an Olympus objective lens oil during imaging, and some of the typical images obtained can be seen in figure 2. We have also tested our method on other particle fields, including plankton and red blood cells, and further details on the data collection process can be found in [19].

YOLO deep neural network and GANs
The YOLO network is a real-time object detection system that has shown great promise in accurately detecting object positions [27]. YOLOv5 network was the latest product of YOLO when we were performing our analysis, which has the advantages of fast detection and high accuracy [28]. It uses a one-stage neural network to complete detection object positioning directly. The model used here was modified from the YOLOv5s network. It has the smallest depth and width of the feature maps in the YOLO family [29], and the  detailed network structure can be found in [30]. Two types of GANs were utilized in our method for different conditions: Cycle-GAN and Pix2pix-GAN. Their structures and corresponding parameters can be found in [31].
The entire process was conducted on an Ubuntu 18.04 system. We trained our modified YOLOv5s network for 300 epochs using the Adam optimizer with a learning rate of 0.1 and a batch size of 2. The GANs were trained for 300 epochs with a batch size of 1 and a learning rate of 0.0001. The code for both networks was written in the Pytorch framework. We further processed the 3D images of the particles using MATLAB 2018a to enhance the clarity of the information presented. Our analysis was conducted on a desktop computer with an Intel Core processor i7-9700CPU, 3 GHz, GeForceGTX2060, and 8 GB video memory. The training data set and network code are available on GitHub https://github.com/xiaolei0828/particle-field.

Performance evaluation
The trained model can be evaluated using several parameters to assess its target detection ability. These parameters include precision (P), recall (R), average precision (AP), and mean average precision (mAP) [32]. The formulas for these metrics are defined as follows: Here, true positive (TP) represents the number of correctly detected target particles in the image, while false positive (FP) represents the number of false detections, and false negative (FN) represents the number of particles in the image that were not detected by the network. Precision is a metric that measures the ratio of correctly detected particles to all detected particles. It indicates how accurate the model is in detecting targets. Recall, on the other hand, measures the ratio of correctly detected particles to all particles in the sample. It shows how well the model can detect all instances of a target. AP is the average precision values obtained for different recall levels, and it is calculated by computing the area under the precision-recall curve. AP provides an indication of how well the model can detect targets at different levels of recall. mAP is the mean of the AP values for all classes, and it gives an overall measure of the model's performance.

Loss function of YOLOv5s
The YOLOv5s model employs a loss function consisting of three components: bounding box regression loss (box-loss), classification loss (cls-loss), and objectness loss (obj-loss) [33]. During the training process, monitoring the loss curves can indicate whether the network model is converging stably as the number of iterations increases. As shown in figure 3, the loss values decrease as the number of iterations increases when training and validating the model on polystyrene particles. The goal of this study is to estimate the depth class of the particles. The cls-loss curves become stabilized after 300 epochs, indicating that 300 epochs are sufficient for training the model to achieve stable convergence.
To determine the optimal number of iteration times and sample size for training, several tests were conducted using polystyrene particles as an example. The study compared the training time and evaluation indices of YOLOv5s using different amounts of data at different iteration times. As shown in table 1, the model achieved excellent performance after completing 300 epochs using a training set of 5955 samples. The P, R, and mAP were 89.3%, 93.1%, and 95.6%, respectively. Considering computational costs, the results suggest that 300 iterations and approximately 6000 particles are sufficient for effective training of the model.

Validation on the synthetic dataset
The proposed method was validated using a synthetic dataset created by MicroSIG, a synthetic image generator based on a 3D ray-tracing approach proposed by Rossi [34]. Details about the synthetic dataset are available in the supplementary material. show the 3D spatial distribution of particle fields corresponding to figures 4(b) and (h), respectively. The blue particles indicate the spatial positions of the particle field set by the MicroSIG, which are considered ground truths. The red particles indicate the positions predicted by the YOLOv5s network. The accuracy rate is approximately 99.9%, with most positions predicted correctly. However, the network needs better results for a very few particles. As shown in figures 4(e) and (k), the particles in the red dotted circle should have the same depth, but the network gives different depths. The correct depth of the particle is −30 µm. Mutual interference between overlapping particles may cause errors in figure 4(k). Nonetheless, this method can be successfully used in most situations, especially in sparse particle fields. The synthetic dataset has proven its feasibility, and the results suggest that the proposed method is effective in estimating particle depth.

Applications in the static field
The proposed method was applied to static particle fields using polystyrene particles and red blood cells as input examples. Five hundred-four images were used as the training input for the polystyrene particle field, and 5955 particles were extracted to train networks. The images were randomly divided into a training set and a verification set in an 8:2 ratio. To test the trained network, 1332 unlabeled particles with known depths were used, and the verification accuracy rate was approximately 99%. A similar process was performed for red blood cells using 308 images, with approximately 3447 cells for training and 230 for testing. The total accuracy rate was approximately 97.8%. Figure 5   or cells and noisy environments, the method successfully detected and positioned them, demonstrating the robustness of the positioning method.

Applied to the dynamic field and planktons
The proposed method has also been applied to track the 3D motion of particles in a dynamic field. As an example, motion videos of polystyrene particles were captured and processed. The trained YOLOv5s network was able to automatically output the 3D positions of the target particles in real-time. A single frame with a Furthermore, the proposed method was also applied to track the 3D motion of swimming planktons. Due to the difficulty of collecting moving underwater samples, the total number of samples was limited, and they were placed in a petri dish. Using the similar method introduced in section 2.2, microscopic images were captured using a 20×/0.40NA objective lens. Typical images of the planktons at different depths are shown in figure 7, while the results of two typical plankton types (named A and B) are presented in supplements 2 and 3, respectively. Part of these results can also be seen in figures 8 and 9. All samples were successfully detected and precisely positioned in the test process, and their 3D dynamic trajectory was obtained as shown in figures 8(d) and 9(d). The swimming planktons moved irregularly, and this method offered a new approach to the investigations of their behavior. It should be noted that in some cases, pollutants with similar color and morphology might be mistakenly considered as planktons, as shown in figure 9(c). This problem will be further discussed in the next section. Nonetheless, the results demonstrate the capability of this method to track the 3D motion of particles in a dynamic field, and it opens up new avenues for fluid and biological studies.

The influences of the noises, overlaps and the discrete depths
As previously mentioned, noise and overlaps from adjacent particles can introduce errors to the proposed method. To investigate the effects of these factors, polystyrene particles were used as an example, and the results can be seen in figure 10. The enlarged ROI 1 and ROI 2 in figures 10(b) and (c), respectively, show that noise does not affect the detection of target particles. The enlarged ROI 3 and ROI 4 in figures 10(d) and (e), respectively, demonstrate that most overlapped particles can be successfully detected and positioned, except for the particle in the center of the group composed of six-overlapped particles, which was affected by  the adjacent particles' morphology, leading to an error. However, as particles are usually dispersed before utilization, the proposed method is still suitable for most experimental conditions. The depths determined by the proposed method are discrete, while the actual depth is a continuous variable. To obtain continuous depths, the network compares the input and images obtained at all set depths and calculates their corresponding probabilities, choosing the depth with the highest probability as the output. Continuous depths can then be calculated by taking the weighted average of all possible depths with their probabilities. While Leroy et al proposed a regression method to address the discretization of depth values [18], the adjacent particles' overlaps may still affect the probability distribution, and achieving accurate and consecutive depth estimation remains an ongoing research challenge.

Autofocus
In situations where pollutants can lead to false detections in defocused images, autofocus is necessary. Therefore, the dataset prepared for 3D positioning can also be utilized for autofocus. However, DL-based autofocus methods require more data compared to 3D positioning methods and may require paired data. To address these issues, this section proposes the use of GANs, eliminating the need for additional data.

GANs selection for autofocus
We have employed two types of GANs for autofocus, one that uses paired data and another that uses unpaired data. For the latter type, we used the widely-used Cycle-GAN [25]. It was used to complete autofocus for different fields, such as polystyrene particles and blood cells.  [35], and since there is a large amount of data available for polystyrene particles and blood cells, Cycle-GAN works well in this scenario.
However, as the amount of plankton data is small, Cycle-GAN is unable to work effectively for this type of sample. Therefore, we opted for Pix2Pix-GAN to perform the autofocus. Pix2Pix-GAN requires paired data [36], where the input defocused image and the focused image (used as ground-truth) for training must be pixel-aligned. The network can learn the connections with relatively less data compared to Cycle-GAN.

Data augmentation for autofocus of plankton
Because plankton samples are often in motion, obtaining paired data for autofocus is challenging. To address this issue, we extended our data augmentation method (as shown in [19] using Cycle-GAN. High-frequency information, such as details, is difficult to recover from a defocused image, but it is relatively easy to lose this information from a focused image. The workflow for this method can be seen in figure 12. We used Cycle-GAN, trained on a small amount of data, to generate defocused images at different depths from focused ones. These generated images and their corresponding focused counterparts were considered pixel-aligned and used to train the Pix2Pix-GAN. The trained Pix2Pix-GAN can then be used for autofocus. This method proved effective for plankton samples, where obtaining paired data is difficult due to their motion. The structural similarity index measure (SSIM) [37] has been used here as an evaluation indicator to evaluate the accuracy of the images generated by the GANs. SSIM is defined as: Here, µ a and µ b are the mean values of images a and b, respectively. σ a 2 and σ b 2 are the variances, while σ ab is the covariance of a and b; c 1 and c 2 are regularization parameters, respectively. A larger SSIM means a more substantial structural similarity between the two images. For example, if SSIM is 1, the two images are identical. The average SSIM values between the autofocused images and the corresponding ground truths are about 0.95. It indicates that this autofocus method is feasible. Three typical samples can be seen in figure 12.
The SSIM values between the generated images (a1, b1, c1) and the corresponding experimental ones (a2, b2, c2) are about 0.9493, 0.9652, and 0.9567, respectively. As shown in figure 10(d), the pollutant was mistakenly recognized as a plankton. The plankton and pollutant were focused based on the proposed method, as shown in figures 12(d1) and (e1), respectively. They can be distinguished easily as the details can be observed clearly in the focused images. This data augmentation method is suitable for the autofocus of the moving plankton. It offers a chance for further investigation of the plankton and the pollutants.

Conclusions
The proposed method for 3D positioning and autofocus of the particle field is a novel and effective approach that combines the use of the DfD and YOLO network. The depth of the particles can be determined by their blurred defocused images, and then the complete 3D position information can be obtained based on the trained YOLO model. The proposed method can process a single image containing multiple particles in about 0.008 s, making it suitable for real-time detection and 3D positioning. Furthermore, GANs were introduced to perform autofocus on particles at different depths simultaneously. The proposed positioning-autofocus method was validated on various samples and has demonstrated its robustness to the overlaps of adjacent particles and noise. With this method, the 3D positions of particles in the field can be determined in real-time, and their focused images can be generated. The results of this study suggest that the proposed method has great potential for application in the particle field, microbiology, environmental science, and fluid investigation, among other related areas. Additionally, the method's ability to detect and autofocus particles in real-time makes it highly practical and useful for researchers and scientists working in these fields. In conclusion, the proposed method offers a highly effective and efficient approach for 3D positioning and autofocus of particles, providing valuable insights for further research in the field of particle analysis and investigation.

Data availability statement
The training data set and network code are available on GitHub: https://github.com/xiaolei0828/particlefield.