Towards ultrafast quantitative phase imaging via differentiable microscopy [Invited]

With applications ranging from metabolomics to histopathology, quantitative phase microscopy (QPM) is a powerful label-free imaging modality. Despite significant advances in fast multiplexed imaging sensors and deep-learning-based inverse solvers, the throughput of QPM is currently limited by the pixel-rate of the image sensors. Complementarily, to improve throughput further, here we propose to acquire images in a compressed form so that more information can be transferred beyond the existing hardware bottleneck of the image sensor. To this end, we present a numerical simulation of a learnable optical compression-decompression framework that learns content-specific features. The proposed differentiable quantitative phase microscopy (∂-QPM) first uses learnable optical processors as image compressors. The intensity representations produced by these optical processors are then captured by the imaging sensor. Finally, a reconstruction network running on a computer decompresses the QPM images post aquisition. In numerical experiments, the proposed system achieves compression of × 64 while maintaining the SSIM of ∼0.90 and PSNR of ∼30 dB on cells. The results demonstrated by our experiments open up a new pathway to QPM systems that may provide unprecedented throughput improvements.


Introduction
Among the label-free imaging modalities, quantitative phase microscopy (QPM) is a simple but powerful approach, providing important biophysical information by quantifying optical phase differences [1,2].From the phase map, one can further yield quantitative information about the morphology and dynamics of the examined specimens [3,4].In addition to morphology, the measured phase maps can be converted to dry mass of the cells with accuracy that is of the order of femtograms per square microns [5,6].QPM has found many important applications in biomedicine [7] including pathogen screening [8], cancer cell classification [9], and label-free analysis of histopathology specimens [10,11].Moreover, recently quantitative phase imaging has even been extended to image the structures of thick biological systems such as zebrafish larval [12].
The first phase imaging mechanism was introduced by Zernike in his phase contrast microscopy [13].Here, the phase shifts due to the refractive indices and depth differences in the specimen are converted into detectable intensity variations.Zernike's original design consisted of a phase filter which directly displays phase information by interfering scattered portion of light from an image, with its unscattered portion.Even though the work improved with several extensions [14,15], due to the non-linear dependency between phase and intensity, direct phase contrast techniques are incapable of quantitative phase measurements.QPM techniques overcome this problem by computational inverse reconstruction [7].A typical quantitative phase microscope consists of an optical system (forward model) and a computational phase retrieval algorithm (inverse model) [16].The forward optical system converts undetectable phase information into detectable interferometric fringe patterns; from the fringe patterns, the inverse reconstruction algorithm retrieves phase and intensity maps of the specimen.Recent developments in QPM have mostly been focused on improving the inverse reconstruction using GPU acceleration [17][18][19], deep-learning-based inverse solvers [20][21][22][23][24][25], and illumination patterns optimization [26,27].
These advancements have placed QPM in a unique position to measure large cell populations for applications in cytometry, a field currently dominated by flow cytometers.Most commercial flow cytometers can easily analyze hundreds of thousands of cells per second.But QPM-based image cytometers are currently orders of magnitude slower.The main bottleneck of QPM is the image acquisition speed, which is fundamentally governed by the pixel rate of image sensors.Currently, the pixel rate of a state-of-the-art camera is around 1 × 10 10 pixels/sec.However, the pixel throughput of the front-end optics is virtually unlimited.An image passes through optics at the speed of light and has been the rationale for developing optical signal processors [28].
Here we propose to exploit this property to optically compress an image in order to measure the compressed form of the image using a high-speed light detector (such as a high-speed camera).Thus the pixel throughput of the original image would be increased at a rate proportional to the degree of compression.
Compressive imaging of biological specimens, using random sampling of the linearly projected image space, has been demonstrated before [29].Better compression, however, may be achieved through learning dataset-specific features of images.To this end, here we propose to use differentiable microscopy (∂ µ) [30,31] to identify important image features for compression, through machine learning.Our method consists of an optical processor, a camera sensor, and a deep neural network.The optical processor encodes phase information of an input light field onto the sensor.The sensor compressively measures the intensity of this output field.The measured intensity map is then used by the neural network to reconstruct the phase map of the original input light field.We use machine learning to co-design the optical processor and the decoding neural network end-to-end.We call this measurement scheme differentiable quantitative phase microscopy (∂-QPM).In numerical simulations, we show that our proposed approach can image phase information of in-vitro cells at ×64 − ×256 compression, accelerating image acquisition by the same amount.We thus demonstrate that orders of magnitude faster QPM is feasible through ∂-QPM.Of note, this work only presents a simulation of the optical processor, leaving the implementation to future work.
In the following sections, we first introduce the proposed ∂-QPM (section 2.1).Second, we assess the feasibility of using optical processors as image compressors, despite them being linear operators (section 2.2).Third, we demonstrate ∂-QPM (in simulations) for in-vitro cells at ×64 − ×256 compression.Last, we discuss multiple aspects of the proposed measurement paradigm including potential avenues to implement the optical processors.

Differentiable quantitative phase microscopy (∂-QPM)
Figure 1 shows the schematic of the proposed ∂-QPM scheme that consists of an optical processor, a camera sensor, and a neural network.The optical processor maps an input light field (at the image plane of the microscope), to an output light field.We design the optical processor such that, the low-frequency intensity components of the output field, encode information about the phase of the input field.The output field is then imaged at low resolution using a camera sensor.
The sensor is smaller than what is required to measure the original input field at the Nyquist sampling rate, thereby performing a "compressive measurement".The measured intensity map is then "decompressed" and decoded using the neural network, to reconstruct the phase map of the original input field.Notably, in ∂-QPM, each sensor pixel codes for multiple pixels of the original input light field.We call this number the "compression".For any given camera, the compression is directly proportional to the improvement of imaging speed.Below, we write the mathematical model of the above process.the original input field.Notably, in -QPM, each sensor pixel codes for multiple pixels of the original input light field.We call this number the "compression".For any given camera, the compression is directly proportional to the improvement of imaging speed.Below, we write the mathematical model of the above process.
Consider the electric field   =       at the image plane of the microscope.  propagates through the optical processor   (.) such that, Here  (.) represents the low-resolution detection of the output light field. represents the detected intensity map.  (.) represents the neural network that reconstructs the phase map, φ, of   at its original resolution.
Parameters of both the optical processor and the neural network are optimized using machine learning methods.Specifically, we first parameterize the entire end-to-end model in a differentiable manner.The parameters are then optimized by reducing a loss function, that represents the reconstruction quality (see section 5.3).Eq. 3 shows the simplified representation of the overall problem.Consider the electric field x in = A in e jφ in at the image plane of the microscope.x in propagates through the optical processor H O (.) such that, Here D(.) represents the low-resolution detection of the output light field.I represents the detected intensity map.H E (.) represents the neural network that reconstructs the phase map, φ, of x in at its original resolution.
Parameters of both the optical processor and the neural network are optimized using machine learning methods.Specifically, we first parameterize the entire end-to-end model in a differentiable manner.The parameters are then optimized by reducing a loss function, that represents the reconstruction quality (see section 5.3).Equation (3) shows the simplified representation of the overall problem.
Here L represents the composite loss function.Components of L are explained in the methods section (section 5.3).Motivated by our previous work on all-optical phase retrieval [30], for H O (.), we consider two types of optical processors, Learnable Fourier Filters (Fig. 1.B1) and PhaseD2NNs (Fig. 1.B2).These models are discussed in detail in the methods section 5.2.
The main limitation of our previous work [30] was the lack of non-linearity of the optical processor.Phase retrieval is a non-linear image translation problem, and our linear optical processor could only find an approximation.In contrast, here we use the optical processor as a feature extractor.The optical model only has to learn a faithful representation that contains sufficient information to computationally reconstruct the original phase map.However, when compressive measurements are used, the reconstruction problem is highly ill-posed.Therefore, next, we investigate the capacity of our linear optical processor to encode phase information, sufficient for inverse reconstruction.

Linear encoding does not degrade compressibility
With respect to the phase of the input field, our ∂-QPM scheme is similar to an autoencoder.The optical processor acts as the encoder; the neural network acts as the decoder; the output field of the optical processor acts as the bottleneck.Conventional autoencoders [32] have non-linear encoders that can learn compressed representations at their bottleneck.But here our encoder, i.e., the optical processor, is a linear system.We therefore first established the feasibility of linear compression in comparison to nonlinear compressor models.
Linear Encoding and Non-linear Decoding Allow Compression.First, we experimented on an autoencoder (AE) network with a linear encoder followed by a non-linear decoder.The reconstruction results obtained from this network were compared with a fully linear autoencoder and a fully nonlinear autoencoder.The qualitative results for the MNIST dataset [33] in Fig. 2 show that the autoencoder network with a linear encoder and non-linear decoder performs on par with the fully nonlinear autoencoder.Fig. 3 shows that a complex-valued linear encoder with a nonlinear decoder achieves similar qualitative performance as the complex-valued nonlinear encoder and nonlinear decoder.

Optical Encoding and Electronic Decoding Enable Compressed QPM
Our results in section 2.2 show that an autoencoder with a linear encoder and a non-linear decoder (AE:LE+NLD) can reconstruct images as well as a fully nonlinear model.In this section, we numerically test our -QPM scheme with two types of optical processors (  (.) in Eq.
Learnable Fourier Filter (LFF) + SwinIR.Based on previous work [30], we first used a Learnable Fourier Filter (an LFF) as the optical processor.The LFF contained an optical 4- system with a learnable circular Fourier filter.Similar to previous work [30], the transmission coefficients of the circular Fourier filter were treated to be learnable.The input and output fields were 256 × 256 squared aperture grids.The circular Fourier filter had a radius of 128 grid points.
The coefficients of the filter were randomly initialized.We used SwinIR [34], a state-of-the-art super-resolution network, as the decoder neural network.We observed that directly training the end-to-end model (optical processor and SwinIR) was not ideal as the gradient flow between the the two models was weak.Therefore, we employed the 3-stage criteria for the optimization of the These results suggest the feasibility of a linear optical processor (encoder) followed by a nonlinear neural network (decoder) to compress and reconstruct information in the phase of the light field.

Optical encoding and electronic decoding enable compressed QPM
Our results in section 2.2 show that an autoencoder with a linear encoder and a non-linear decoder (AE:LE+NLD) can reconstruct images as well as a fully nonlinear model.In this section, we numerically test our ∂-QPM scheme with two types of optical processors (H O (.) in Eq. ( 3)) as encoders.For the decoder neural network (H E (.) in Eq. ( 3)), we use a state-of-the-art super-resolution model, SwinIR [34].We evaluate ∂-QPM on an experimentally collected HeLa cell dataset (see section 5.4).
Learnable Fourier Filter (LFF) + SwinIR.Based on previous work [30], we first used a Learnable Fourier Filter (an LFF) as the optical processor.The LFF contained an optical 4-f system with a learnable circular Fourier filter.Similar to previous work [30], the transmission coefficients of the circular Fourier filter were treated to be learnable.The input and output fields were 256 × 256 squared aperture grids.The circular Fourier filter had a radius of 128 grid points.The coefficients of the filter were randomly initialized.We used SwinIR [34], a state-of-the-art super-resolution network, as the decoder neural network.We observed that directly training the end-to-end model (optical processor and SwinIR) was not ideal as the gradient flow between the the two models was weak.Therefore, we employed the 3-stage criteria for the optimization of the end-to-end model (as discussed in section 5.3).We tested compression levels ×64 and ×256 for the compressed optical output intensity field in our experiments.
Table 1 shows the performances at ×64, ×256 compression levels for the tested HeLa dataset (section 5.4).For each compression level, performances are reported with and without the fine-tuning step.The corresponding qualitative results are shown in Figs. 4 and 5.All proposed methods outperformed all-optical phase to intensity conversion baselines (B1, B2) [30] with a significant margin in terms of SSIM (structural similarity index) and PSNR (peak signal-to-noise ratio) [35].Note that all-optical baselines use only the optical processor, and the output intensity measured by the camera sensor is considered the final output phase map.Also, here, the output is at the same resolution as the original input, thereby employing no compression.End-to-end fine-tuning showed a considerable improvement in the performance for all the cases.Our best method achieved PSNR= 29.76 dB, SSIM= 0.90 performance at ×64 compression, indicating that the proposed method is suitable for high-throughput QPM.Even at ×256 compression, the proposed method outperformed all-optical baselines by a considerable margin with PSNR= 27.61 dB and SSIM= 0.83.end-to-end model (as discussed in section 5.3).We tested compression levels ×64 and ×256 for the compressed optical output intensity field in our experiments.
Table 1 shows the performances at ×64, ×256 compression levels for the tested HeLa dataset (section 5.4).For each compression level, performances are reported with and without the fine-tuning step.The corresponding qualitative results are shown in Figures 4 and 5.All proposed methods outperformed all-optical phase to intensity conversion baselines (B1, B2) [30] with a significant margin in terms of SSIM (structural similarity index) and PSNR (peak signal-to-noise ratio) [35].Note that all-optical baselines use only the optical processor, and the output intensity measured by the camera sensor is considered the final output phase map.Also, here, the output is at the same resolution as the original input, thereby employing no compression.End-to-end fine-tuning showed a considerable improvement in the performance for all the cases.Our best method achieved PSNR= 29.76 dB, SSIM= 0.90 performance at ×64 compression, indicating that the proposed method is suitable for high-throughput QPM.Even at ×256 compression, the We further tested our approach by including a noise model with Poisson noise and read noise [31].We fine-tuned the best model (C2) with noise.A read noise with a standard deviation of 6.0 and a detector maximum photon count of 10000 were used.The proposed method with detector noise (E1) performed on par with the best model indicating that our LFF + SwinIR based ∂-QPM is robust to real-world noise conditions.We further discuss the effect of the detector noise in the discussion (see section 3.).
PhaseD2NN + SwinIR.Second, we tested a PhaseD2NN [30] as the optical processor in the proposed end-to-end framework.Similar to the previous section, the SwinIR super-resolution network was used for reconstruction.We selected the operating wavelength (λ = 632.8nm) of the PhaseD2NN in the visible wavelengths, and it was evaluated on the same HeLa cell dataset (see section 5.4).
Since the pixel size matched the PhaseD2NN neuron size (316.4nm × 316.4 nm), we could train the end-to-end network directly on the patch FoVs from the dataset.We followed the optimization criteria presented in section 5.3 for the end-to-end training.Notably, we observed that in step optimization criteria presented in section 5.3 for the end-to-end training.Notably, we observed that in step 1, PhaseD2NN training was not stable due to the large number of physical parameters with a larger grid size (e.g., 256 × 256).To increase the stability and gradient flow of this optimization step, we used a sub-optimization-schedule (shown in Supplementary algorithm S1).The compared segments: Groundtruth phase of the input field (A2) of a full FoV from the test set, all-optical results using LFF (B1) and PhaseD2NN (B2); Phase reconstructions from our approach 1 -LFF + SwinIR with ×64 compression with fine-tuning (C2), LFF with ×256 compression with fine-tuning (C4); and phase reconstructions from our approach 2 -PhaseD2NN + SwinIR with ×64 compression with fine-tuning + 5 optical layers (C8).To increase the stability and gradient flow of this optimization step, we used a sub-optimization-schedule (shown in Supplement 1).We compressed the output intensity from the optical processor ×64 to obtain a higher throughput.Table 1 shows the performances for ×64 compression level.We report the performances while selecting different layers of the PhaseD2NN as the output layer.The final model with 5 layers was fine-tuned according to the proposed optimization steps.Similar to section 2.3, fine-tuning improved the performance.We explored different numbers of diffractive layers for the PhaseD2NN without the fine-tuning step and the results are presented in Table 1.
We performed further experiments with the 5 layer PhaseD2NN (C8 and E2).Our method achieved the best performance of PSNR= 27.24 dB, SSIM= 0.86 with ×64 compression which was considerably higher than the all-optical baselines.Similar to the previous section, we tested our model for detector noise with similar specifications (of a maximum photon count of 10000 and detector read noise standard deviation of 6.0).The resultant performance with the detector noise (E2) was on par with the best model without the noise (C8).This indicates that our PhaseD2NN + SwinIR based ∂-QPM is robust to real-world noise conditions.

Discussion
Overall Comparison.Fig. 5 presents the qualitative results for best-performing models.Figure 5(d) shows that the proposed ∂-QPM systems have a higher resolving capability compared to the all-optical baselines.Figure 5(c) SSIM maps show how our methods perform for different regions of full field-of-view (FoV).Low SSIM in edges indicates that there is room to improve the proposed QPM just by refining the edges of generated patches.We also observed that the LFF-based method outperformed the PhaseD2NN-based one.Further studies are needed to investigate the reason for this behavior.
Stability of PhaseD2NN Training.We observed that the optimization step 1, i.e., all-optical reconstruction (see section 5.3), was unstable for the PhaseD2NN.We suspect that the reason for this instability is the large FoV (of 256 × 256) resulting in a large number of learnable parameters.
To overcome this, we used a sub-optimization-schedule for the PhaseD2NN training motivated by progressive growing learning principles [36] (see algorithm S1 in the Supplement 1).Instead of training the PhaseD2NN in an end-to-end fashion, here we optimize the PhaseD2NN layer by layer progressively with the phase reconstruction loss.With this schedule, we could efficiently train the optical processor.Even though one can argue that the proposed schedule leads to a sub-optimal solution, we achieved a sufficient performance for QPM [25] with this schedule.Nevertheless, an interesting future direction is to explore more efficient methods to train large D2NNs.
Effect of Photodetector Noise.To further evaluate the behavior of the proposed method with detector noise, we evaluated the method with maximum photon counts of 100 and 10000, and read noise standard deviations of 4.0 and 6.0.Changing the photon counts changes the Poisson noise in detection.Table 2 shows that ∂-QPM is robust to noise when the maximum photon count is 10000 (for most QPM applications such high light conditions can be used).We saw a significant reduction in performance when the maximum photon count was 100, i.e. at high Poisson noise conditions.Interestingly, the effect on the D2NN-based model was more severe than that of the LFF-based model.Thus an interesting future direction is to investigate better noise-aware training strategies for optical processors.Last, we did not see a significant effect from read noise.Compressibility limitations.Last, we tested our LFF-based approach on a QPM dataset of tissue with much more complex features (see section 5.4).The goal of this experiment was to investigate the limitations of our approach for complex features.We observed that our method failed to reconstruct high-resolution features at both ×64 and ×256 (see Supplement 1).There could be two potential reasons for the subpar performance.It could be the case that the optical processor cannot efficiently convert phase information to the latent intensity field at the detector.Alternatively, it could be the case that the reconstruction network is not capable of reconstructing highly compressed information from images with complex features.To investigate the latter we tested our reconstruction network on a simple resolution enhancement task on the same tissue dataset.As shown in Supplement 1, here too the reconstruction network failed.Thus we conclude that in our method, the compressibility is limited in the presence of complex features.Further studies are required to establish the compressibility bounds for data distributions of interest.Nevertheless, our successful demonstration on cell data opens doors to a number of applications in cell biology and medicine such as: pathogen screening [8]; stain-free quantification of chromosomal dry mass in living cells [37]; quantification of different growth phases of chondrocytes [38]; and identification of biophysical markers of sickle cell drug responses [39].
Realization of the Optical Processors.In this work, we only consider numerical simulations of optical processors.We now discuss the feasibility of their realization.The LFF setup can be implemented as an optical 4-f system with a transmissive spatial light modulator (SLM) (as shown in Fig. 1.B1), or with a reflective SLM (as experimentally demonstrated in our previous work [30]).The PhaseD2NN simulated in this work consists of submicron-sized "optical neurons" distributed in 3D in a micron-sized optical element.Fabricating such custom-designed 3D optics to a desired specification is extremely challenging.Nevertheless, D2NNs have been experimentally demonstrated at Terahertz wavelengths, and translating these models to visible wavelengths is an active area of research.For instance, two-photon lithography [40,41] is a promising avenue to fabricate D2NNs in 3D.Required fabrication precision may also be relaxed by incorporating the details about the fabrication imperfections during the design stage itself [42].We leave the robustness improvement, realization, and experimental validation of the optical processors to future work.

Conclusion
Quantitative phase microscopy (QPM) is an emerging label-free imaging modality with a wide range of biological and clinical applications.Recent advances in QPM are focused on developing fast instruments through better detectors and fast deep-learning-based inverse solvers.However, currently, the QPM throughput is fundamentally limited by the pixel throughput of the imaging detectors.Orthogonal to current advances, to improve QPM throughput beyond the hardware bottleneck, here we propose to use content-aware compressive data acquisition.Specifically, we utilize learnable optical processors to extract compressed phase features.A state-of-the-art transformer deep network then decodes the captured information to quantitatively reconstruct the phase image.The proposed pipeline inherently improves the imaging speed while achieving high-quality reconstructions.Moreover, the advances presented in this work can lead to similar developments in a wide range of label-free coherent imaging modalities such as photothermal, coherent anti-Stokes Raman scattering (CARS), and stimulated Raman scattering (SRS).

Networks for linear compression feasibility studies
To establish the feasibility of using a linear system to compress an image/ optical field, we conducted the analysis presented in section 2.2.For the analysis, we implemented simple autoencoder networks with linear/ nonlinear decoders and linear/ nonlinear/ complex linear/ complex nonlinear encoders.All the autoencoders we discussed have the following general format.
Here, y is the real/ complex input image (depending on the experiment), h is the latent code, ŷ is the reconstructed image.Enc(.) and Dec(.) are the functions to encode the input and decode the latent code.
To train real-valued autoencoders, we considered the following objective function.Enc * and Dec * are the optimal encoders and decoders found through Adam optimization [43].L shows the mean squared distance.E[.] denotes the expected value over the dataset.
Complex-valued autoencoders were trained with the following objective function.
Enc * , Dec * = arg min Notably, here we consider ∠y (input phase) as the ground truth.The goal is to extract information from the input phase and reconstruct it in the output (refer to section 2.2, 5.4).
The compression factor shown in Fig. 2 and 3 is defined as the ratio between the total number of pixels in y and h.We implemented the encoders using convolution layers, each having kernel size, padding, and stride to obtain ×2 downscale.Decoders are implemented by cascading transpose convolution, ReLU activations, and batch normalization layers.Complex-valued autoencoders allow complex values in the inputs, y.

Optical processors
We consider two types of optical processors based on previous work [30]: Learnable Fourier Filter and PhaseD2NN.This section gives a brief description of these optical processors and the mathematical modeling of light propagation through them.
Learnable Fourier Filter.The LFF is an optical 4-f system with a filter placed in the Fourier plane.The transmission coefficients of this filter are optimized during the optimization process.The overall system is modeled using the following equation.
In this, x in is the input light field coming to the LFF, x out is the output light field, and T is the Fourier filter.F , and F −1 denote the Fourier transform and the inverse Fourier transform where • denote the Hadamard product.
PhaseD2NN.PhaseD2NN is a diffractive deep neural network [44] with only the phase of the transmission coefficients being optimized.The amplitude of the transmission coefficients is set to 1 in each layer.We modeled light propagation through D2NN using the Rayleigh-Sommerfeld diffraction theory [45, ch. 3.5].After light propagates through a D2NN layer, the input field to the next layer is given by Here, in denotes the input field to the p th layer, T (p) is the complex transmission coefficient matrix of the p th layer, and ∆z (p) is the distance between the p th and the (p + 1) th layers.RS(.) denotes the Rayleigh-Sommerfeld diffraction operator.The output field of the PhaseD2NN is given by where M is the number of layers in the PhaseD2NN.The PhaseD2NN simulated in this work consisted of 5 diffractive layers each having 256 × 256 optical neural grid.The size of each neuron was λ 2 × λ 2 (316.4 nm ×316.4 nm).Therefore, the size of the optical layer was 80.9984µm × 80.9984µm.Optical layers were separated with 3.373µm distance between each other.The distance between the input plane and the first optical layer was 3.373µm while the distance between the last optical layer and the detector plane was 5.904µm.
For both LFF and PhaseD2NN, the final intensity captured by the detector is given by where D(.) denotes the low-resolution detection.We use the LFF and PhaseD2NN as all-optical baselines for result comparison.In that, they are detected at the original resolution of the input optical field following the Nyquist sampling theorem (i.e., without the D(.) operator).All-optical baselines are optimized in such a way that |x out | 2 gives an approximation of the input phase.

Optimization details
We follow a 3-stage optimization criteria for the improved stability of the end-to-end optimization; 1) optimize the optical processor; 2) optimize the decoder neural network; 3) end-to-end fine-tuning.
Optimize the optical processor.Here the optical processor is optimized to reconstruct the phase at its output intensity.For an input optical field x in = A in e jφ in we train an optical model H O through which the input field is propagated to produce the output field x out = A out e jφ out = H O (x in ).The phase reconstruction loss, L φ introduced in previous work [30] is utilized here as, where, P X and L1(.) respectively represent the probability distribution of phase objects and the L1 loss.
Optimize the decoder neural network.At this stage, we consider the end-to-end network, however, only the weights of the neural network are optimized.The pretrained optical processor discussed in the previous step is utilized to encode the input phase.We demagnify the output field of the optical processor to compress the intensity representation.The super-resolution neural network reconstructs the input phase from the compressed intensity representation.The reconstructed phase information is given by φ = H E (D(|A out | 2 )).Here H E (.) represents the decoder neural network.D(.) is the optical demagnification layer, which is simulated through a stack of 2 × 2 average pooling operations [46].Similar to previous work [34], we consider L swin , a combination of loss functions for this optimization, where, L perceptual and L adversarial represent the perceptual loss [34] and adversarial loss [34] respectively.End-to-end fine-tuning.As the final stage, we finetune the end-to-end ∂-QPM pipeline to reconstruct the phase at the output of the network.To improve the reconstruction in terms of capturing fine cell structures, we incorporate the negative structural similarity index measure (SSIM) [47] as the loss function.
Here, X j and Y j represent equal-sized windows from a normalized input phase image (ϕ in /2π) and the corresponding reconstructed phase output ( φ) respectively, with M number of windows for an image.P X represents the probability distribution of input phase objects.µ X j , µ Y j , σ X j , σ Y j , σ X j Y j are the means, variances, and the covariance of the X j and Y j windows respectively.C 1 = (k 1 × L) 2  and C 2 = (k 2 × L) 2 are regularization parameters with L = 1.0, k 1 = 0.01 and k 2 = 0.03.

Datasets
In our numerical experiments, we used three datasets.
PhaseMNIST Dataset.We developed PhaseMNIST, complex valued dataset for the evaluations in section 2.2.Each complex image of the dataset was obtained according to the Eq.(15).
Here, I is the complex image, ψ is the images from the original MNIST dataset [33] scaled into [0, 1].
HeLa Cell Dataset.We used a HeLa cell dataset [30] as the primary dataset for our experiments.We followed the sample preparation procedure explained in previous work [30].Briefly, the data were acquired using a low spatially coherent quantitative phase microscopy system.The details of the experimental setup can be found in [48].First, multiple phase shifted interferograms are recorded of both HeLa cells and tissue samples.The phase recovery is then performed by employing advanced iterative algorithm (AIA), which can retrieve phase maps using random phase-shifted interferograms.The details of the algorithm can be found in [49].The initial dataset contained 501 complex-valued images (i.e.detected FoVs).Each detected FoV was obtained by a camera with a 2304 × 2304 pixel grid where the pixel size was 6.5 µm × 6.5 µm.The light field from the specimen was magnified ×60 before imaging onto the detector.
To pre-process the dataset, we first calculated the side length of the light fields before the magnification (︂ = 2304 pixels×6.)︂ = 789 pixels )︂ .Finally, we resized the detected FoVs (i.e.2304 × 2304 pixel grids) into 789 × 789 pixel grids.This resulted in the light field before the magnification with a pixel size of 316.4 nm ×316.4 nm.We refer to these FoVs as full FoVs.We obtained train and test sets by dividing the full FoV dataset into 401 and 100 sets.For the training of the proposed networks, we used 256 × 256 cropped patches (i.e.patch FoVs) from the full FoVs.Phase values of the dataset were clipped into [0, 2π).
Tissue Dataset.We also acquired a tissue dataset to investigate the limitations of our method further.We utilized a 4-micron thick tissue sample which was prepared on a reflecting substrate (si-wafer in our case).The sample was illuminated from above by a light beam, traverses through it, and is subsequently reflected off the Si substrate.We followed acquisition and preprocessing procedures similar to HeLa cells, with a magnification of ×20.There were 470 detected FoVs.Camera had 2304 × 2304 pixel grid where the pixel size was 6.5 µm × 6.5 µm.The side length of the light fields before the magnification was 2304 pixels×6.5µm/pixel

Simulation details
We numerically simulated and trained the proposed ∂-QPM pipeline using Python version 3.6.13.
The simulation was done according to Eq. ( 8), (9), and (10).We used auto differentiation in PyTorch [50] framework (version 1.8.0) for the joint optimization/ training of the proposed pipeline.All experiments were conducted on a server with 12 Intel Xeon Platinum 8358 (2.60 GHz) CPU Cores and an NVIDIA A100 Graphics Processing Unit with 40 GB memory running on the CentOS operating system.We used a batch size of 32, and learning rates of 0.1, 0.001 respectively for LFF and PhaseD2NN in the optimization stage 1.LFF was trained for 1500 epochs with multi-step learning rate scheduler [50] (milestones : [50, 400, 650, 1000, 1400], γ = 0.1).PhaseD2NN was trained for 1500 epochs after each optimizer initialization step in algorithm S1.For joint multi-layer optimizations in algorithm S1, a learning rate of 0.00005 was used for better stability.For optimization stage 2, we followed the same training configurations used in SwinIR section 4.1, 4.3 real-world image SR, with channel size of 1 and pixel-shuffle upsampling [34].Lastly, for the final optimization stage (i.e.end-to-end fine-tuning), we fine-tuned the LFF + SwinIR and PhaseD2NN + SwinIR for 24000 and 3000 epochs respectively with a learning rate of 5 × 10 −6 .We used Adam [43] as the optimizer for all optimizations.
Funding.National Institute of Mental Health (R21-MH130067); The Research Council of Norway.

Fig. 1 .
Fig. 1.Overview of differentiable quantitative phase microscopy (-QPM): (A)End-to-end pipleine of -QPM.The input light field propagates through an optical processor to produce the output light field.The output field is imaged using a smaller camera sensor at low resolution.The intensity map, imaged by the camera, is fed to the neural network to reconstruct a high-resolution phase map of the original input light field.(B1) A potential design of the optical processor using a Fourier filter with learnable transmission coefficients.All lenses (  1 ,  2 , and  3 ) are placed at 4f configurations.(B2) Another potential design of the optical processor using a diffractive neural network (PhaseD2NN).Here,  1 ,  2 , and  3 are lenses; two   elements are objective lenses.The first   forms a remote focus.The PhaseD2NN is placed with respect to the remote focal plane.The output plane of the PhaseD2NN is imaged to the camera sensor using the second   and downstream lenses  1 ,  2 , and  3 .All lenses are placed at 4f configurations.

Fig. 1 .
Fig. 1.Overview of differentiable quantitative phase microscopy (∂-QPM): (A)Endto-end pipeline of ∂-QPM.The input light field propagates through an optical processor to produce the output light field.The output field is imaged using a smaller camera sensor at low resolution.The intensity map, imaged by the camera, is fed to the neural network to reconstruct a high-resolution phase map of the original input light field.(B1) A potential design of the optical processor using a Fourier filter with learnable transmission coefficients.All lenses (f 1 , f 2 , and f 3 ) are placed at 4f configurations.(B2) Another potential design of the optical processor using a diffractive neural network (PhaseD2NN).Here, f 1 , f 2 , and f 3 are lenses; two OBJ R elements are objective lenses.The first OBJ R forms a remote focus.The PhaseD2NN is placed with respect to the remote focal plane.The output plane of the PhaseD2NN is imaged to the camera sensor using the second OBJ R and downstream lenses f 1 , f 2 , and f 3 .All lenses are placed at 4f configurations.

Fig. 3 .
Fig. 3. Phase to intensity conversion and compressibility on PhaseMNIST dataset using linear (L) and nonlinear (NL) encoder (E)/ decoder (D).Both the encoders are complex-valued hence denoted as CLE and CNLE.

Fig. 5 .
Fig. 5. Performance comparison of best methods using side-by-side comparisons of phase reconstructions (a), compressed intensity fields (b), and SSIM maps of reconstructions (c).Subfigure (d) shows the resolving power of the phase reconstructions.The compared segments: Groundtruth phase of the input field (A2) of a full FoV from the test set, all-optical results using LFF (B1) and PhaseD2NN (B2); Phase reconstructions from our approach 1 -LFF + SwinIR with ×64 compression with fine-tuning (C2), LFF with ×256 compression with fine-tuning (C4); and phase reconstructions from our approach 2 -PhaseD2NN + SwinIR with ×64 compression with fine-tuning + 5 optical layers (C8).

Fig. 5 .
Fig. 5. Performance comparison of best methods using side-by-side comparisons of phase reconstructions (a), compressed intensity fields (b), and SSIM maps of reconstructions (c).Subfigure (d) shows the resolving power of the phase reconstructions.The compared segments: Groundtruth phase of the input field (A2) of a full FoV from the test set, all-optical results using LFF (B1) and PhaseD2NN (B2); Phase reconstructions from our approach 1 -LFF + SwinIR with ×64 compression with fine-tuning (C2), LFF with ×256 compression with fine-tuning (C4); and phase reconstructions from our approach 2 -PhaseD2NN + SwinIR with ×64 compression with fine-tuning + 5 optical layers (C8).

Table 1 . Performance comparison: Best results for optical feature extraction networks LFF, PhaseD2NN are highlighted. These best models are further fine-tuned end-to-end with the detector noise simulation (noise specifications of the detector: read noise standard deviation= 6.0, maximum photon count= 10000) to improve realisticity. We calculate the patch and full FoV metrics on the test patch FoVs and full FoVs respectively. We reconstruct the full FoVs by tiling the reconstructed patch FoVs.
1, PhaseD2NN training was not stable due to the large number of physical parameters with a larger grid size (e.g., 256 × 256).