Calibration-free quantitative phase imaging using data-driven aberration modeling

We present a data-driven approach to compensate for optical aberration in calibration-free quantitative phase imaging (QPI). Unlike existing methods that require additional measurements or a background region to correct aberrations, we exploit deep learning techniques to model the physics of aberration in an imaging system. We demonstrate the generation of a single-shot aberration-corrected field image by using a U-net-based deep neural network that learns a translation between an optical field with aberrations and an aberration-corrected field. The high fidelity of our method is demonstrated on 2D and 3D QPI measurements of various confluent eukaryotic cells, benchmarking against the conventional method using background subtractions.


Calibration-free 2D Quantitative Phase Imaging
For experimental verification, we tested our method with eukaryotic cells (HeLa, NIH-3T3, HEK-293T, MDA-231, and COS7 cells). The representative results of NIH-3T3 and Hek-293T are shown in Fig. 2. The input, output, and ground truth (GT) complex field images of an NIH-3T3 cell, and confluent HEK-293T cells are presented in Figs. 2(a) and 2(b), respectively. The aberrations in the optical input fields are effectively suppressed in the network outputs, and the results are compatible with those obtained using the conventional background subtraction method (GT in Fig. 2).
Parasitic fringe patterns, caused by multiple reflections (red arrows in Fig. 2), are suppressed. Intriguingly, the phase discontinuity caused by phase wrapping in the input phase images (the white arrows in Fig. 2) is also clearly removed in the network output [ Fig. 2(a)] This implies that through the training of a paired data set (input and GT images), the algorithm automatically learns the information about an ideal imaging transfer condition, suppressing any deviation from an ideal optical imaging condition. The capability of our method was further validated with HEK-293T cells that densely occupy the entire field of view (FoV) [ Fig. 2(b) and Appendix D]. Similar to the NIH-3T3 case, the fringe noises and phase ambiguities were removed successfully, generating sufficiently clean images for probing biological information. This result indicates that our method can be effectively applied to confluent samples where no background region can be obtained. In such a scenario, on the other hand, it is difficult or even impossible to use the aberration correction methods that must exploit a background. We compared the difference between the GT and the network output image [ Fig. 2(c)]. The cross-correlation and root mean square error (RMSE) were measured as 0.982 and 0.127 rad [NIH-3T3 cell, Fig. 2(a)], and 0.950 and 0.317 rad [HEK-293T cells, Fig. 2(b)], respectively. While negligibly blurred, the subcellular features such as nucleolus and vesicles are preserved well in the network output.
The network error level is compared with the coherent noise level of the optical system used to validate the proposed method quantitatively. The absolute phase error map E between the present method and the conventional background subtraction method was calculated as E = φGT -φOutput, where φGT and φOutput are the phase images of GT and the network output, respectively.
The phase error map for the NIH-3T3 cell is shown in Fig. 2(c). The background region was obtained from the measured FoV by manually masking out the cell to compare the phase error with the coherent noise level [ Fig. 2(d)].
In Fig. 2(e) the fractional histogram of the phase error E (blue) and coherent noise Ncoh in the background region (red) are presented with their standard deviations. Both centered at zero, the standard deviations (0.13 rad and 0.12 rad) are comparable, indicating that the performance of our approch achieves the fundamental upper limit. Similar quantifications for the entire dataset can be found in Appendix E: the mean cross-correlation is 0.93, the mean RMSE is 0.23 rad.

Calibration-free 3D Quantitative Phase Imaging
The proposed method can be further utilized in 3D imaging. A three-dimensional refractive index (RI) tomogram is reconstructed from multiple 2D optical fields at various angle illuminations using the principle of inverse light scattering, which is known as optical diffraction tomography (ODT) [Fig 3(a)]. ODT is a 3D QPI technique, and it reconstructs the 3D RI tomogram of a sample from multiple 2D optical field images by inversely solving Helmholtz equation [25].
In ODT, it is essential to correct each optical field's aberration because varying aberration in the angle-scanned optical fields, which frequently occur in practice, can deteriorate the image quality of reconstructed tomograms [Fig 3(b)]. The green dashed arrow indicates the significant noise caused by aberrations.  The subcellular structures of NIH-3T3 were clearly restored in our network output. The solid black arrows indicate that the morphology of organelles, including nucleoli (1), nuclear membrane (2), and lipid droplets (3), were successfully reconstructed, as done in the reconstructed tomogram using the background subtraction method (GT).
The results show that this method improves the applicability of QPI in two ways. First, the proposed method does not require any additional measurements, enabling "single"-shot QPI. This aspect offers a robust phase measurement without sample restrictions such as staticity or sparsity. Second, our network can perform not only aberration correction but also a robust phase unwrapping process, which is difficult to conduct owing to its ill-posedness. The presented results reconfirm the use of deep learning for phase unwrapping, which has previously been reported in Ref. [26,27].

Conclusion
In sum, we proposed and experimentally demonstrated a deep-learning approach that performs aberration correction. The deeplearning model learned mapping from the 'input' optical field with aberrations to 'output' aberration-corrected optical field using the background subtraction method. The model, trained with the paired dataset, generates aberration-corrected optical fields without any additional measurements, enabling single-shot QPI. We have verified the robust performance of our method by successfully operating 2D QPI on various eukaryotic cells and optical diffraction tomography where the 3D refractive index of a biological sample is reconstructed from the 2D aberration-corrected optical fields. Recently, deep learning has been applied to segment background region in the measured FoV [28]; however, it necessitates object-free background areas in the image to model aberrations based on the Zernike functions, which is fundamentally different from our method. We believe our approach, capable of imaging dynamic and confluent samples, can be used to investigate perplexing biomedical phenomena in diverse disciplines.
Several subsequent studies can be conducted to enhance the proposed method. First, advanced network architectures, such as Dual-CNN or LS-DNN [29,30], may mitigate some blurring and pixelization artifacts that have appeared in our results. Next, a more diverse dataset can further generalize our method. As we used a single QPI instrument and six types of eukaryotic cells in our work, our framework may suffer from generalization when dealing with unseen data distribution or different imaging systems. The data preparation process can also be simplified and automated by manufacturing permanent reference samples with various RI distributions. [31]. Also, the present approach can be sequentially used or combined with a deeplearning-based phase retrieval algorithm [32][33][34], which can potentially even expedite the whole imaging and reconstruction process. Finally, artificial-intelligence-aided QPI approaches can be further utilized with the proposed method owing to the high throughput and automation [35].

Appendix A: Network architecture for aberration compensation
The encoder extracts the features of the input image at various scales through the successive down-sampling convolutional blocks. The input image is first convolved with 4 × 4 filters with stride 2 and pad 1 to create 64 feature maps. Down-sampling blocks are successively applied, extracting feature maps of different levels. The blocks consist of leaky rectified linear units (Leaky Relu) [36], 4 × 4 convolutions with stride 2 and pad 1, and batch normalization [37], typically used to reduce internal covariate shift and training time. The number of convolutional filters doubles with each layer until it becomes 512 while the image size halves. The number of layers is 8 in this case (the input image size = 28) to have features of 1×1 dimension at the end of the encoder. The decoder reconstructs the output image of the same dimensions with the input from the acquired features of the encoder. The up-sampling blocks consist of ReLu, bilinear interpolation (factor of 2), 3 × 3 convolutions with stride 1 and pad 1, and batch normalization. Magnification with convolution was used instead of transposed convolution to avoid the checkerboard artifact [38]. The dimensions of blocks are symmetric with respect to the encoder, except that features of the same dimension in the encoding path are concatenated to preserve spatial information. The last blocks of the encoder and decoder do not contain the batch normalization.

Appendix B: Loss function
Among various metrics to compare two images, we utilized l1 norm [Eq.
where E[X] is the expectation of the random variable X, ||x||1 is the l1 norm of a vector x, and σ 2 is the local variance. Because minimization of norm generated blurry images, we combined the contrast component of SSIM [Eq.
When the loss function consisted of SSIM only, the optimization process was highly unstable, failing frequently. The performance of each loss combination was tested by calculating the average field cross-correlation error (FCE) [Eq. (4)] and phase RMSE [Eq. (5)]. For a network output and GT complex electric field E = Ae iφ , The linear combination of l1 norm and SSIM achieved the lowest error, which was chosen for the present work.

Appendix C: Dataset and training
To obtain the dataset, the optical field images of biological cells in various cell lines (HeLa, NIH-3T3, HEK-293T, MDA-231, and COS7 cells) were measured using a commercial 3D QPI instrument (HT-2H, Tomocube Inc., Republic of Korea). The dataset was prepared by pairing the optical fields with aberrations and fields with aberration correction using the background subtraction method. To exploit the amplitude-phase correlation, the network was designed to handle 2-channel (amplitude and phase channel) images for input and output. For the ground truth training dataset, phase images were retrieved using a phase retrieval algorithm [40,41], and the 2π ambiguities in phase maps were resolved using a phase unwrapping algorithm, namely Goldstein's algorithm [42].
During the training process of the algorithm, the network parameters were updated by gradient-descent optimization, minimizing the loss function consisting of structural similarity index map and L1 norm [43]. We used a desktop computer with a CPU (XeonTM CPU E5-2630, Intel), and one graphic processing unit (GeForce GTX 1080Ti, 11 GB internal memory, NVIDIA) for training and testing. Once the training is completed, the process of converting an image with aberration into an aberration-corrected image takes 0.8 ms (with a batch size 100 and 15 ms with a single-image batch). All optical field images were cropped to 256 ×256 pixels before entering the network.
The learning curve of our network is depicted in Fig. 4. The average loss was calculated using the training set (11809 images) and validation set (11858 images) at each epoch during the training. Although our network was nearly optimized within 24 hours of training and validation, we decided to train the network for sufficient time (approximately 240 hours) to fully learn the image-to-image translation. We stopped the training when the validation loss achieved 28.53 at 2223 epochs.

Appendix D: Demonstration on microbeads and various eukaryotic cells
To validate that the proposed network did not learn the specific sample distributions in the training dataset, but the ensemble of aberration patterns of the optical system used, we also imaged known polystyrene (PS) microsphere (Fig. 5). The phase images of the PS microsphere are presented in Fig. 5(a). The input phase image containing aberration and phase wrapping (left), the network output (middle), and the GT (right) obtained from the BGSM are visualized. Phase delay along the red dashed lines in the insets of Fig. 5(a) are also plotted in Fig. 5(b). We also compared our method to the BGSM on the performance of optical diffraction tomography in Fig. 5(c). After 49 angles of optical fields were obtained, each set of optical fields were processed with our method and BGSM, respectively. The bead tomograms reconstructed from two different sets of optical fields are shown: our method (left) and BGSM (right). RI values along the blue dashed lines in Fig. 5(c) are plotted in Fig. 5(d). Although the network was trained using the dataset that consists of only the eukaryotic cells, our method has a close match with the BGSM in both 2-D QPI and tomographic reconstruction. Also, we have applied the present algorithm to various types of cancer cell lines [ Fig. 5(f)]. The QPI images of the HeLa, COS-7, and MDA-MB-213 cell were clearly retrieved with the present method.

Appendix E: Statistical analysis of results
The two types of errors (FCE and phase RMSE) are calculated for whole datasets (Fig. 6). For the images obtained from normal illumination only, 85% of the data in the test set has smaller FCE and phase RMSE than 0.028 and 0.15 rad, respectively. For the whole dataset containing images obtained from angled illumination, 85% of the data in the test set has smaller FCE and phase RMSE than 0.10 and 0.33 rad, respectively. Fig. 6. The performance of the proposed method is quantitatively analyzed. The entire dataset consisting of the training set and test set is assessed by the errors between generated fields and the ground truth fields (fields obtained from background subtraction). The green lines indicate the threshold that contains 85% of the data in the test set.