Towards Practical Single-shot Phase Retrieval with Physics-Driven Deep Neural Network

Phase retrieval (PR), a long-established challenge for recovering a complex-valued signal from its Fourier intensity-only measurements, has attracted considerable attention due to its widespread applications in digital imaging. Recently, deep learning-based approaches were developed that achieved some success in single-shot PR. These approaches require a single Fourier intensity measurement without the need to impose any additional constraints on the measured data. Nevertheless, vanilla deep neural networks (DNN) do not give good performance due to the substantial disparity between the input and output domains of the PR problems. Physics-informed approaches try to incorporate the Fourier intensity measurements into an iterative approach to increase the reconstruction accuracy. It, however, requires a lengthy computation process, and the accuracy still cannot be guaranteed. Besides, many of these approaches work on simulation data that ignore some common problems such as saturation and quantization errors in practical optical PR systems. In this paper, a novel physics-driven multi-scale DNN structure dubbed PPRNet is proposed. Similar to other deep learning-based PR methods, PPRNet requires only a single Fourier intensity measurement. It is physics-driven that the network is guided to follow the Fourier intensity measurement at different scales to enhance the reconstruction accuracy. PPRNet has a feedforward structure and can be end-to-end trained. Thus, it is much faster and more accurate than the traditional physics-driven PR approaches. Extensive simulations and experiments on a practical optical platform were conducted. The results demonstrate the superiority and practicality of the proposed PPRNet over the traditional learning-based PR methods.


I. Introduction
P HASE retrieval (PR) intends to reconstruct a complex- valued signal only from its Fourier intensity measurements.It is a key problem in crystallography [1], [2], optical imaging [3], astronomical imaging [4], diffraction imaging [5], etc. PR is also a crucial component of holographic imaging [6].The investigation of PR methods was initiated in the 1970s.Numerous reconstruction approaches were developed by the optics research community [1], [3], [7].Recently, the developments in modern optimization theories [8]- [10] and computational imaging [11], [12] provided further understanding of the problem.From the mathematical perspective, the PR problem can be expressed as follows [9]: where x ∈ C N is the complex-valued signal of interest; X denotes the Fourier intensity measurements; • and F stand for elementwise multiplication and Fourier transform, respectively.The pre-determined optical masks h are optional; they are for providing the constraints to lessen the problem's illposedness.There are various methods for the implementation of the masks.For example, early PR approaches considered the non-zero signal support as the optical masks [3].However, these traditional methods cannot ensure globally optimal solutions (uniqueness condition) nor the convergence of the optimization procedure.In recent years, random masks were employed as powerful constraints for the optimization process [8], [13].They can be implemented using a digital micromirror device (DMD) or spatial light modulator (SLM) [14], [15].Although the use of random masks can improve the reconstruction performance, the high cost and inaccuracy of the DMD and SLM devices deter the general application of the method.Besides, it is empirically shown in [8], [9] that around 4−6 measurements are needed for accurate reconstruction with random binary masks.It increases the data acquisition time and is thus undesirable for dynamic applications.For solving (1), different iterative optimization methods, such as ADMM [16], Wirtinger Flow [8], etc., are generally used.These approaches are extremely time-consuming.Thus, the resulting PR methods are not suitable for any real-time applications.
In the last three decades, deep neural networks (DNN) have been widely studied and successfully applied to different applications [17].They were also used in the PR problems [18]- [27].Different from the traditional optimization-based approaches, the deep learning-based PR methods can work with only a single Fourier intensity measurement.These methods can fit in some feedforward DNN structures to achieve real-time performance when processing with GPU.Nevertheless, the accuracy of these methods still has much room to improve.It is still a challenge to directly infer a complex-valued signal from its Fourier intensity due to the enormous discrepancy between a signal in the Fourier and spatial domains.To improve the accuracy, researchers also suggested the physics-driven method [22] in which the intensity measurement was used to inform a plug-and-play optimization process.However, the method is iterative and as time-consuming as the traditional optimization-based methods.Besides using the plug-and-play structure, researchers also introduced the physics information to the PR process by directly plugging in the HIO algorithm [3] running alternately with a DNN in an iterative procedure [27].These physicsdriven approaches have a common characteristic that they apply the physics information to a traditional optimization method and let it work iteratively with a network model.However, in this case, the system cannot be end-to-end trained.The estimation error of the optimization algorithm may have a special distribution that is unknown to the network model.The iterative process can thus be trapped at a local minimum and fails to give the best solution.
One common problem of these deep learning-based PR methods is that most of them are trained and tested with simulation data.Except for noises, they often ignored the other artifacts in practical Fourier intensity measurements.It makes their reported results unreliable.Most PR applications involve structured images, which have energy concentrated at low frequencies (in particular, the d.c.).The dynamic range of the data in an intensity measurement is thus extremely large.Most general imaging devices nowadays have only a 12 to 16 bits dynamic range.It makes the low-frequency part of the intensity image severely saturated.As we have shown in Section IV-A, the saturation problem can significantly affect the performance of PR methods.
To rectify the abovementioned problems, we propose in this paper a novel physics-driven deep learning-based PR method dubbed PPRNet.Similar to other deep learning-based PR methods, PPRNet requires only one Fourier intensity measurement for each PR reconstruction.It, however, gives a much higher accuracy by using a physics-driven method that guides the network to follow the Fourier intensity measurement at different scales to reconstruct the images.The resulting network structure is still of feedforward type.Thus, it is much faster than the iterative physics-driven approaches.The whole network can be end-to-end trained to give the best result.PPRNet is enabled by a novel Hybrid Unwinding Block (HUB) embedded in a multi-scale convolutional neural network (CNN) structure.It directs the input feature map into two paths such that the global and local information of the feature map at different scales are separately processed with physics informed.The data are then combined with a channel attention method so that the significant features are collected for reconstructing the images.To evaluate the generality of the proposed PPRNet, we have conducted a series of simulations with 2 different datasets that contain complex-valued images with linearly correlated and uncorrelated magnitude and phase.All data in the intensity image were capped to be represented by 12-bit integers to simulate the saturation problem with quantization errors.We then adopted the defocusing method [28] to mitigate the saturation problem in the intensity images.As shown in the simulation results, the proposed PPRNet significantly outperforms the state-of-the-art deep learningbased PR methods.To understand the performance of the proposed PPRNet in practical applications, we constructed an optical setup for generating the intensity measurements of phase-only images obtained from 3 datasets.Naturally, all in-tensity measurements had a serious saturation problem in their low-frequency data.We again used the defocusing method to mitigate the effect of the problem.They were then used for the training and testing of the proposed PPRNet.Based on these experimental data, we compared the proposed PPRNet with the state-of-the-art deep learning-based PR methods.A significant improvement in the accuracy is noted.It also has lower complexity than other physics-driven PR methods.
To summarize, the contribution of this work is three-folded: 1.A physics-driven deep learning-based PR method, namely PPRNet, is developed.It requires only a single Fourier intensity measurement for each PR reconstruction without the need for any additional masking scheme to impose constraints on the measurement.It allows the Fourier intensity measurement to inform the training and inferencing processes so that the model is guided to give the right solution.Experimental results have demonstrated the effectiveness of this approach and the improvement it brings over the traditional deep learning-based PR methods.PPRNet has a non-iterative feedforward structure and is end-to-end trained.It has lower complexity than the existing physics-driven approaches while having higher accuracy.2. A novel Hybrid Unwinding Block (HUB) is proposed.While the proposed PPRNet has a multi-scale structure, HUBs are embedded at different scales of the network to facilitate the utilization of the physics information to guide the training and inferencing of the network.It separately processes the global and local information of the feature maps with the aid of the Fourier intensity measurement and combines them with a channel attention method.Our ablation study has shown the importance of HUB in PPRNet.3. Different from the traditional deep learning-based PR approaches that are trained and tested with simulation data, we construct an optical platform to evaluate the proposed PPRNet and compare it with the state-of-the-art approaches.The results are thus more reliable to reflect the true performances in practical applications.
This paper is organized as follows.Section II provides a comprehensive review of the traditional optimization-based algorithms and deep learning-based PR approaches.Section III introduces the proposed PPRNet.The simulation results and ablation studies are shown in Section IV.The experiment results and comparisons with the existing methods are shown in Section V, followed by the conclusion in Section VI.

A. Optimization-based PR Algorithms
In the early days, the traditional phase retrieval methods were based on the iterative alternating minimization (AM) framework [1], [3], [7].The estimated image x ∈ C N×N is updated iteratively between the spatial and Fourier domain.Although the AM framework offers a portable solution, the AM-based algorithms are prone to stagnation and slow to converge (usually, more than 1000 iterations are needed).Besides, they are sensitive to initialization.
In recent years, the Writinger flow (WF) PR algorithm was developed with the advancement of modern optimization theories [8].Different from the AM framework, WF solves the phase retrieval problem through gradient descent: where ∇ f (x) represents the first-order gradient descent of the loss function (i.e.MSE loss), and µ k+1 is the step size at current iteration.Empirically, 4 to 8 measurements are needed for a globally optimal solution [8], [9].Although WF provides a theoretical guarantee for convergence to the globally minimal solution, it often fails to converge to a satisfactory result if only one intensity measurement is given.

B. Deep Learning-based PR Methods
In recent years, the deep learning-based PR approaches have been widely studied since they provide non-iterative inferences compared with the time-consuming optimizationbased algorithms.Most of these approaches can work with only one Fourier intensity measurement for each PR reconstruction without the need for any additional constraints on the measurement.This seemingly impossible task, in fact, has a theoretical basis [29].It is known that if the Fourier intensity measurement is oversampled by two or more times in each dimension, the original complex-valued signal can be uniquely reconstructed, except for trivial ambiguities.Although such a reconstruction problem is highly non-convex (it is the reason why the traditional optimization methods fail to perform), it is particularly suitable to the deep learning-based methods due to their non-linear nature.Besides, they can also make use of the statistics acquired from a huge dataset to infer the solution.In general, the deep learning-based PR methods can be split into two categories depending on whether the underlying physics is adopted in the networks.
For the first category, a feedforward network is used to estimate the target images directly from a Fourier intensity measurement [18], [23]- [25], [28], [30]- [32].Specifically, [25] proposed a two-branch CNN to reconstruct the magnitude and phase part from an oversampled Fourier intensity measurement, and [30] applied this network to the 3D crystal PR problem.The approach shows reasonably good performance for simple images with brief details.The performance when dealing with complex images is unknown.[23] applied the conditional generative adversarial network to reconstruct the images.The network is extremely large since multiple multilayer perceptrons (MLP) are used for all stages of the network.
[31] adopted the ResNet [33] structure for Fourier phase retrieval tasks.Only a simple dataset (MNIST) was used for testing, and the performance was only barely satisfactory.The error in the details was still rather large.[18] implemented a UNet structure [34] to reconstruct the phase-only images from Fresnel diffraction patterns.Its performance with Fraunhofer diffraction patterns (Fourier intensity) is unknown.As for [32], the authors adopted a multiple-resolution UNet structure and connected the hidden layers in the decoder to additional convolution layers to produce coarse outputs in an attempt to match the low-frequency components.Only the result of using 2 measurements is shown in the paper, and the blurring effect is quite significant, as shown in the result.Recently, our team also developed a feedforward DNN structure to tackle the PR problem [28].It has an MLP front end for feature extraction and a residual attention-based reconstruction unit to generate the phase images.Although it outperformed most of the existing state-of-the-art methods, its performance when dealing with more complex images still had room for further improvement.
To improve the quality of the reconstructed images, the second category of deep learning-based PR methods implicitly or explicitly utilizes the underlying physics in the models [21], [22], [26], [27], [35], [36].For instance, [35] made use of the physics information to perform a learnable spectral initialization [8].It is followed by a double branch UNet for reconstruction.The approach requires an additional masking scheme to impose constraints on the measurement.The reconstructed images are rather noisy, even for simple images.[24] proposed to use MLP of different sizes in a cascaded network.The intensity measurement is applied to each MLP to assist the training and inferencing.The network size is very large due to the use of multiple MLPs.And the approach fails to reconstruct the details in the images.There are other iterative approaches similar to the traditional optimization-based methods.For instance, [27] suggested an iterative method with a 3-step structure: HIO initialization, iterative update between UNet and HIO, and final refinement by UNet.[22] proposed to embed a pre-trained DnCNN [37] into a plug-and-play iterative algorithm for refining the estimated images at each iteration.These iterative physics-driven approaches are usually quite time-consuming.Besides, they cannot be end-to-end trained, which often affects the overall performance.

III. The Proposed Approach
We present in this section the proposed Physics-Driven Phase Retrieval Network, PPRNet.Fig. 1 illustrates the overall architecture, which is essentially a multi-scale structure consisting of five parts: Initialization (Init), Hybrid Unwinding Block (HUB), Downsampling (DS), Upsampling (US), and Post-Processing (PP).The multi-scale structure has been generally applied to many deep learning-based image restoration tasks [34].It effectively extracts the essential features of the input image while redundant and insignificant components, such as noises and outliers, are ignored.The essential features are then used to reconstruct the target image gradually in scale.When applying it to the Fourier PR problems, the input (Fourier intensity measurement) and required output (complex-valued spatial image), however, have a large domain discrepancy.It requires special add-on structures to accomplish the task.The details are explained below.

A. The Overall Architecture
The proposed PPRNet takes the Fourier intensity measurement X ∈ R M×M×1 as the input.It is fed to the Init Block to give the initial guess x ∈ R N×N×C in the spatial domain, where N < M/2 and C is the number of channels.In our experiment, M, N, and C are set as 762, 128, and 64, respectively.In the Init Block, a random image is first generated and refined by a HUB.As will be explained later,

B. Initialization
Initialization is a crucial step in traditional optimizationbased algorithms as it provides an appropriate starting point in the spatial domain for the optimization process.Proper initialization can help the optimization algorithms avoid getting trapped in undesirable saddle points [8].Deep learning models, which are data-driven systems, are usually sensitive to the quality of the inputs.Improper initialization brings additional noise that harms the performance of the deep learning model.Instead of using conventional initialization approaches like [3], [8], we propose to include the initialization into the learnable network structure.First, we randomly generate the spatial images xinit ∼ N(0, 1) ∈ R N×N .xinit is then fed into a network that contains a 1×1 convolutional layer to convert the data into C channels.They are then sent to a HUB to generate the initial guess for the subsequent multi-scale network.Since HUB is physics-informed and trained end-to-end with the other parts of the network, it gives a better initial estimation of the target image than the traditional approaches that only rely on the available physics information.An example is shown in Fig. 1.It can be seen that the shape of the object is roughly constructed by the Init block.It is left to the subsequent multiscale structure to further enhance the image.

C. Hybrid Unwinding Block (HUB)
HUB is the crucial processing block of PPRNet.It is used in the Init Block to refine the initial guess and also in the expanding path to guide the reconstruction process.Its structure is shown in Fig. 2. HUB firstly splits the input feature maps x ∈ R H×W×C into two branches for processing (where H and W are set as N in our experiments).u ∈ R H×W×2 are the first two channels of the feature maps.They are fed to the Physics-driven Unwinding Block (PUB).And the rest channels v ∈ R H×W×(C−2) are processed by the Feature Refinement Block (FRB).We denote the outputs of PUB and FRB as ũ and ṽ, respectively.They are concatenated together with the input feature maps x and sent to the Feature Fusion Block (FFB), which makes use of the channel attention method to extract the significant features for sending to the next stage.The detailed operations of these functional blocks are described below.

Feature Fusion Block (FFB)
Output feature Fig. 2: Structure of the Hybrid Unwinding Block (HUB).

1) Physics-driven Unwinding Block (PUB):
As shown in the traditional physics-driven PR methods, applying the prior domain knowledge to the optimization process can improve the estimation accuracy.A typical approach is to use the intensity measurement as prior information to constrain the optimization.It, however, often ends up with an iterative process since the constraint needs to be repeatedly applied to give the effect.To solve the problem, we propose PUB that unwinds the iterative process into a feedforward network operation.Fig. 3 shows the structure of PUB.It can be seen that a PUB contains K unwinding layers cascaded in series.In our experiment, K = 5.Let us take the (k + 1)th unwinding layer for illustration.u k ∈ R H×W×2 , which forms the real and imaginary parts of the input feature map, are first transformed to the Fourier domain through: We then update the magnitude with the intensity measurement to constrain the estimation.Note that HUB is applied to different scales in the expanding path.We need to convert the intensity measurement to the respective scale for applying to PUB.The updated U k is then mapped back to the spatial domain through the inverse Fourier transform and produces an updated image u k .The whole operation can be expressed as: where S (X) refers to the filtering and folding of X for converting it to the required scale without aliasing.It forms the magnitude constraint that brings domain-specific prior information to the reconstruction process during the training and inference of the network.It informs the network of the reconstruction target at every scale of the expanding path.u k is then fed to a shallow CNN structure g k (.) (empirically, we stacked 8 convolutional layers) for learning the missing phase information.Instead of generating the refined image directly, g k is trained to give the residue of the desired output from the input feature map.It reduces the training difficulty.The operation can be expressed as follows: where β k is a learnable parameter.The resulting image u k+1 then acts as the input to the next layer of PUB.Both g k (•) and β k can be trained adaptively to control the amount of magnitude constraint to be included in u k+1 .Together with FFB (which will be described later), they provide the flexibility to the network to determine how much the physics information should be utilized in the image reconstruction process through end-to-end training.
2) Feature Refinement Block (FRB): Images contain both global and local features.The PUB illustrated above aims to constrain the image reconstruction through the prior Fourier intensity measurement.It is well-known that the Fourier transform can only give the global frequency information of an image.The magnitude constraint alone cannot effectively inform the local features in the image.Therefore, in parallel with PUB, we propose a Feature Refinement Block (FRB) to enrich the detailed structures corresponding to the local features.FRB consists of three ConvBs.The shallow FRB structure can learn detailed representations from the input feature maps, for example, high-frequency details like complex shapes and edges.With the global information from PUB and local textures from FRB, the concatenated feature maps (concat. in Fig. 2) can provide comprehensive representations for the reconstruction process.
3) Feature Fusion Block (FFB): As shown in Fig. 2, the output feature maps of PUB and FRB are concatenated with the input feature maps to form the resulting feature maps with size (H, W, 2C).The simple concatenation cannot effectively utilize the representations of the feature maps.Obviously, these 2C channels of feature maps have different importance to the reconstruction.We propose to use a Feature Fusion Block (FFB), which is essentially a channel attention network [38], to adaptively re-weight these channels based on their contents.They are then combined to form the output of HUB.More specifically, the structure of FFB is shown in Fig. 4. The input feature maps are first sent to an average pooling layer (Ave.Pool in Fig. 4) to find the global representation of different channels.It uses a single value extracted by the average pooling layer to represent the global information of each channel.In other words, a total of 2C values are used to represent the input feature maps with 2C channels.Then, two fully-connected layers (FC layers in Fig. 4) work together to investigate the channel-wise dependency and produce 2C weights to indicate the importance of these channels.The key features have large weights since they are essential for the reconstruction.The weight of each channel is expanded H ×W times to have the same spatial dimension as the input feature map, i.e., the Expand block in Fig. 4.They are then elementwise multiplied with the input feature maps to give different levels of attention.The re-weighted feature maps are fused together by a ConvB to have the size of H × W × C.They become the output of HUB and also the reconstructed images for a particular scale.Note that the input of FFB contains the feature maps generated by PUB (with the physics information), FRB, and those from the contracting path.The feature maps of PUB do not necessarily play a key role in the final output, although they are physics informed.The physics information is adopted only when it helps the final reconstruction, determined by FFB, as informed by the end-to-end training.Fig. 1 and Fig. 9 show some examples of the reconstructed images at different scales.Note that only the most significant feature map of each scale is shown.It is seen that the quality of the reconstructed image gradually improves as the scale increases.

D. Loss Function
We regard the phase retrieval task as a supervised learning problem.There is a corresponding complex-valued image x as ground-truth reference given a Fourier intensity measurement X.Thus, we can measure the difference between groundtruth image x and the image x estimated by PPRNet.The loss function L we used contains two parts: pixel-wise loss, L pixel , minimizes the pixel-wise difference between the estimation and ground-truth images; and total variation (TV) loss, L T V , uses a smoothness prior to regularize the estimated image while maintaining its edges and textures.The total loss function L is formulated as: where γ ∈ R denotes the coefficient to balance different loss terms.L pixel is defined as the 1 -norm distance between the estimation and ground-truth images.It promotes the fidelity of the images estimated by PPRNet while preserving the details of the original image.L pixel is expressed as: The terms (•) Re and (•) Im represent the real and imaginary parts of the image, respectively.
TV norm sums all the gradients along the horizontal and vertical directions.Using the TV regularization encourages the spatial smoothness in the estimated image such that it can assist in noise reduction and increase the consistency of the reconstructed image.L T V is defined as:

A. Defocus-Based Fourier Phase Retrieval System
A Fourier phase retrieval system reconstructs the image x ∈ C N×N from its Fourier intensity measurement X ∈ R M×M .
However, as mentioned in Section II, the saturation problem often happens when directly capturing Fourier intensity images using standard imaging devices.It is due to the large dynamic range of Fourier intensity data.An example is given in Fig. 5. Fig. 5(a) shows a typical structured image (complex-valued).Its Fourier intensity measurement obtained using a 12-bit dynamic range camera is shown in Fig. 5(c), with the central region magnified for better visualization.The profile along the blue line is shown in Fig. 5(e) (blue line).It has a flat top due to the saturation problem.We also show the histogram of the intensity image in Fig. 5(f).It can be seen that there are many pixels having the maximum value due to the saturation problem.Besides, we can also find many pixels having zero values.Most of them come from the high-frequency parts of the measurement.They have very small values as compared with the low-frequency data.They are thus quantized to zero, resulting in the so-called dead pixels.Consequently, the image contains many errors in both the low-frequency and highfrequency regions.Researchers have suggested a few solutions to the problem.One of them is by using the defocusing method [28].Specifically, we can reduce the dynamic range of the intensity measurement by convolving it with a defocus kernel H.The convolution operation can be easily implemented by moving the camera beyond the Fourier plane.Fig. 6 shows a typical optical path of a defocus-based PR system.In the figure, the object of interest is illuminated by a coherent light generated by a laser beam.The camera is placed beyond the focal plane such that a defocused Fourier intensity is captured.Mathematically, we denote the original image and its Fourier transform as x(p, q) and X(u, v), respectively.Then, the optical field X L (u , v ) on the defocus plane can be formulated as: where λ and k = 2π λ denote the wavelength and wave number, respectively.The symbol L represents the distance between the camera and the Fourier plane.The term H is the Fourier transform of the defocus kernel h.C is a constant and h(p, q) is defined as e jπ λL (p 2 +q 2 ) [39].As shown in ( 9), the defocusing is equivalent to the element-wise multiplication of the original image and defocus kernel h in the spatial domain with the scaling factor λL.An example of the defocus kernel is shown in Fig. 5(c).Thus in our experiment, we directly implement h together with the testing images on an SLM.More details will be provided in Section V.With the defocusing method, the saturation problem and dead pixels are greatly reduced, as shown in Fig. 5(e) and (f).Although the defocused intensity measurement is not the exact intensity measurement of the image, PR methods using the defocused intensity measurements usually perform much better than using the original measurements with saturated and dead pixels.Some examples are given in Fig. 7.In the figure, we first show the performance of the HIO [3] and PrDeep [22] methods, which represent the optimization-based and deep learning-based PR methods.They were implemented following the setting in their original papers (i.e., without saturation and dead pixels).The results are similar to those reported in [22].Then, we introduced the saturation problem and dead pixels to the intensity measurements by capping the image data as 12-bit integers.It can be seen that the performances of these methods drop substantially when the saturation problem and dead pixels in the measurements are taken into account.It shows the results of the traditional approaches using only the simulation data without considering these problems are unreliable for practical applications.Finally, we used the defocusing method as mentioned above to mitigate the problem.As shown in Fig. 7, both approaches can significantly benefit from using the defocusing method.An improvement of up to 15 dB can be achieved.For this reason, in this work, all comparing methods are trained and tested with the defocused intensity measurements. of HIO [3] and PrDeep [22] with, and without using the defocused kernel.The quantitative results are the average of 6 natural images used in [22].

B. Simulation 1) Datasets:
To verify the effectiveness of the proposed PPRNet, we conducted comprehensive simulations with complex-valued images.To prepare these images for training and testing, we first collected images from two publicly available datasets: Real-world Affective Faces (RAF) dataset [40] and Fashion-MNIST dataset [41].The images were converted to grayscale and resized to 128 × 128 pixels.We then constructed two datasets by combining these images differently.The first dataset has linearly correlated magnitude and phase parts.It was constructed by using the images in the Fashion-MNIST dataset.For each image obtained from the dataset x raw , it was scaled to [0, 1] and used as the magnitude part x mag of the final image x.For the phase part, we applied an exponential function to obtain it, i.e., x phase = exp (2πix raw ).Finally, we combined the magnitude and phase parts by x = x mag • x phase , where • denotes the element-wise multiplication.We used the first 25000 images of the Fashion-MNIST's training dataset to create our training set and used the first 1000 images of its testing dataset to create our testing set.The second dataset we constructed contains images with uncorrelated magnitude and phase parts.We applied the same approach mentioned above to re-scale the data but used images from two different datasets, namely, RAF and Fashion-MNIST, to generate the magnitude and phase parts, respectively.There were 12771 images in the training set and 1000 images in the testing set.
To simulate the defocusing effect as discussed in Section IV-A, we multiplied all images by a defocus kernel, which was generated by a built-in function of Holoeye SLM control software corresponding to the defocusing distance of 30mm.The same defocus kernel was used in the experiments on the optical platform, as will be discussed later.Then, 2-D FFT was performed on these images, and the magnitudes of the resulting images were extracted to become the intensity measurements.The resolution of the measurements is 762 × 762 pixels.To simulate the saturation and quantization errors, we capped the measurements with the maximum limit 4095 (12bit) and converted the numbers to the integer format.
2) Training Details: We trained our network with the Adam optimizer [42].The learning rate was set to 10 −4 , and the batch size was set to 24.The network was trained for 160 epochs on the PyTorch deep learning platform on a PC with two NVIDIA RTX3090 GPUs.The weighting factor γ in the loss function (6) was empirically set to 0.1.The number of layers in PUB was set as 5 for the speed-accuracy trade-off.More discussion on our choices can be found in the ablation studies.
3) Metrics and Evaluation: We used the Peak Signal to Noise Ratio (PSNR, higher the better), Structural Similarity (SSIM, higher the better), and Mean Absolute Error (MAE, lower the better) as the performance criteria to measure the discrepancy between the reconstructed images x and ground truth images x.They are widely used in the image restoration field to measure the estimation quality.The phase parts of all images are shifted by π to ensure that no negative values are used for the computation of PSNR.The average PSNR, SSIM, and MAE for all 1000 testing images of both datasets are used for comparison.
4) Compared Methods: To evaluate the effectiveness of the proposed PPRNet, we compare its performance with a few state-of-the-art methods, namely, HIO [3], which represents the traditional optimization-based method; PRCGAN [23]; HIO-UNet [27]; LenlessNet [18]; NNPhase [25]; and MCNN [32], which represent the deep learning-based approaches.Note that some of these approaches are designed under much trivial input and output requirements as compared with those in this paper.For instance, NNPhase and PRCGAN assume the input intensity measurement is very small (only 64 × 64 and 28 × 28 pixels, respectively, as compared to 762 × 762 in the simulation environment).And PRCGAN assumes the target image only has real-valued data (as compared to having both real and imaginary data in the simulation environment).We need to modify these networks slightly to let them adapt to the simulation environment.Specifically, for PRCGAN and NNPhase, we first convert these networks to accept 128 × 128 pixels input.Then, we put a pre-processing block that contains two convolutional layers with a 5 × 5 kernel and strides 3 and 4, respectively, in front of the original network to convert the dimension of the input data from 762 × 762 to 128 × 128 pixels.Besides, we modified the output of PRCGAN to have 2 channels for magnitude-phase representation, and HIO-UNet and LenlessNet to have 2 channels for real-imaginary representation.These modifications allow these networks to perform in the simulation environment.
5) Simulation Results: The qualitative and quantitative comparison results are shown in Fig. 8.To save space, we do not include the qualitative results of LenslessNet in Fig. 8 due to its inferior performance.All approaches give better performance for the dataset with linearly correlated magnitude and phase components, and the proposed PPRNet outperforms all compared approaches.As shown in Fig. 8(b), only PPRNet can reconstruct the details in the images.For instance, it is clearly seen that only PPRNet can successfully recover the "Lee" characters and plaid on the shirt.The images given by other approaches have poor quality.On the other hand, the dataset with uncorrelated magnitude and phase components is challenging for all competing methods, as shown in Fig. 8(a).It is particularly the case for PRCGAN and HIO-UNet since they originally were designed to output only real-valued output.The distortion in the magnitude images, as shown in Fig. 8(a), is quite obvious.For approaches like NNPhase and MCNN, the output images are rather blurry and dissimilar from the target images.These approaches also cannot reconstruct the details in the phase images.It is worth noting that the distortion of the HIO algorithm is quite severe.In general, the reconstruction of complex-valued objects is considerably more difficult than for real-valued, non-negative objects for the HIO algorithm [43].It is particularly the case for our dataset where the images have uncorrelated magnitude and phase such that the object supports are difficult to define.Compared with the above approaches, the proposed PPRNet gives the best performance.The reconstructed magnitude and phase images closely follow the target images, particularly the phase images.Although we can occasionally find some artifacts in the magnitude images, they are not serious.
The quantitative comparison is shown in Fig. 8(c).The proposed PPRNet achieves much better performance than all compared methods evaluated by different metrics.For the magnitude part, the proposed PPRNet achieves average PSNR and SSIM gains of at least 5.818dB and 0.146, respectively.For the phase part, the increases in PSNR and SSIM gains can reach 9.192dB and 0.106, respectively.The above simulation results show that the proposed PPRNet outperforms the stateof-the-art methods.We will show in Section V that the same conclusion can be drawn when testing these methods in a practical environment.

C. Ablation Analysis
For selecting the hyperparameters and better understanding the roles of different components of the proposed PPRNet, a series of ablation analyses were conducted.In these analyses, the RAF dataset was used to generate the required training and testing images.To allow the analyses to reflect the performance when testing with the optical setup (which will be described in Section V), we set the target images to have a constant magnitude but varied phase components (i.e., they are all phase-only images).So, when evaluating the network, only the phases of the reconstructed images are considered.
1) Effect of Multi-Scale Network: As introduced in Section III-A, we proposed to use a multi-scale structure that includes a contracting and expanding path.Multiple scales can help improve the representation power of the feature maps and thus better exploit the image information.Besides, having more scales lets us include more HUBs into the expanding path.It is equivalent to having more iterations in the iterative deep learning PR methods that will generally give a better result.To investigate the effects of the multi-scale features, we conducted a set of quantitative comparisons.The results are shown in TABLE I.It can be seen that the performance becomes better with the growth of the scale from one to three.It verifies our expectations mentioned above.However, the network deteriorates when the scale is larger than three since the resolutions of the lower scale feature maps are too small to represent the image appropriately.For instance, when the   scale is four, the resolution of the lowest-scale feature map is just 8 × 8(256/32).It is too small to represent the image.
Besides, the number of parameters and operations will also increase when there are more scales.Thus, we chose to have 3 scales in the proposed PPRNet.2) Effect of Hybrid Unwinding Block: As introduced in Section III-C, HUB contains three components: PUB, FRB, and FFB.To gain a deeper insight into the operation of HUB, we visualized in Fig. 9 the feature maps generated by different components of the HUBs of the first two scales when inferencing an example image.The ground-truth and its final estimated image are shown at the top.At Scale 2, the input feature maps are the upsampled FFB output of Scale 3. The feature maps contain 256 channels, with the first two fed to PUB and the rest fed to FRB.The outputs of PUB and FRB are shown.For better visualization of the FRB branch, the average feature maps of all channels are presented.It can be seen that PUB tends to give the global structure of the image, and FRB tends to give the details.These feature maps are then combined by FFB using the channel attention method.For FFB, we show the two output channels with the largest weights (denoted as Max) and two with the smallest weights (denoted as Min).The feature maps given by the channels with the largest weights have already contained the basic structure of the ground truth.They are sent to Scale 1 for further enhancement.At Scale 1, it can be seen the feature maps given by PUB are already very close to the ground truth.They are further processed to give the final output.To summarize, it can be seen that in 3) Effect of Unwinding Layers: As presented in Section III-C, PUB is formed by K unwinding layers.The feature maps are updated with the magnitude constraint K times in the frequency domain.In order to determine this parameter K, we analyzed the performance with different numbers of unwinding layers in PUB.We considered seven different values of K, i.e., zero to six, in each PUB.TABLE II presents the simulation results.It can be seen that using more layers (from zero to five) leads to better reconstruction performance (higher PSNR and SSIM, lower MAE).However, when the number of layers is larger than five, the performances tend to saturate.It is not surprising since, in deep learning, we have seen in many situations that deeper networks do not give better performance.Although using more layers allows more learnable parameters, it is also more difficult to train the network to give optimal performance.It can end up with the overfitting problem that unrelated details are introduced to fulfill the enlarged parameter space.Besides, the processing time also increases when more layers are used.Hence, we set the number of unwinding layers in PUB to five to obtain the best performance.

V. Experiment On Optical Platform
To understand the performance of the proposed PPRNet when applied to practical applications, we constructed an optical setup for collecting realistic intensity measurements for training and testing the network.The setup follows the optical path in Fig. 6.A snapshot of it is shown in Fig. 10.It comprises a Thorlabs 10mW HeNe laser with wavelength λ = 632.8nmand a 12-bit 1920 × 1200 Kiralux CMOS camera with a pixel pitch 5.86µm.We utilized a 1920×1080 Holoeye Pluto phaseonly SLM with pixel pitch δ S LM = 8µm to generate the target images.The SLM can impose different phase shifts to the coherent light gets through it.It effectively synthesizes the objects in phase imaging applications [13], [18].Since the ground truth is known, we can easily evaluate the accuracy of the reconstructed phase images given by different approaches.Another advantage of using the SLM is, similar to that in the simulation, we can directly multiple the defocus kernel with the images to generate the defocusing effect.For the training and testing datasets, we used the same RAF and Fashion-MNIST datasets as in the simulation.Besides, we added the COCO dataset [44] (used the first 50000 and 2000 images of the training and testing dataset, respectively), which contains more complex images for challenging the PR approaches.We used the same approach as in the simulation to generate the phase part of the images based on the images collected from these datasets.The magnitude part was assumed as constant.Then these phase images were multiplied with the defocus kernel and loaded to the SLM.Following the optical path, the images captured by the camera were the defocused Fourier intensity measurements of these phase images.They were used for the training and testing of different PR approaches.The resolutions of these intensity measurements and phase images were the same as in the simulation, i.e., 762 × 762 and 128 × 128 pixels, respectively.When training the models, we first initialized the models with the ones we used in the simulation, as discussed in Section IV-B.Then we fine-tuned all the pre-trained models by re-training them for 120 epochs with the experimental measurements as the new training inputs.
We compared the proposed PPRNet with four traditional optimization-based algorithms: GS [1], HIO [3], WF [8], and RAAR [7]; and eleven deep learning-based methods: ResNet [31], [33], PRCGAN [23], LearnInitNet [35], 3-scale UNet [34], LenlessNet [18], CPR-FS [24], SiSPRNet [28], NNPhase [25], and MCNN [32], prDeep [22], and HIO-UNet [27].Similar to that in the simulation, we modified the input of some of the deep learning-based approaches since they originally could only handle much smaller size intensity measurements.Specifically, for CPR-FS, SiSPRNet, ResNet, NNPhase, MCNN, and PRCGAN, we inserted a preprocessing block before the original network structures.It contains two convolutional layers with a 5×5 kernel and strides 3 and 4, respectively, to extract the features and compress the dimension of the inputs from 762 × 762 to 128 × 128.Besides, we also modified the networks that were originally designed to give real-valued images to reconstruct complex-valued images.Specifically, we set the output channels of LenslessNet, HIO-UNet, and ResNet as 2 to give the real and imaginary parts of the images; and we also set the number of output channels of PRCGAN to 2 for magnitude-phase representation.
We evaluate the performance of all competing PR methods using the same metrics as in the simulation, i.e., PSNR, SSIM, and MAE.For the traditional optimization-based algorithms, we fine-tuned the hyper-parameters and fixed them when evaluating with the testing images.We did three trials of evaluation for each testing image, where each trial had 1500 iterations.The reconstructed image with the highest PSNR among the three trials was chosen for comparison.Besides, traditional optimization-based algorithms can have trivial ambiguities in the reconstructed images, such as global phase shift, etc.They were removed before comparisons were conducted.
The quantitative and qualitative evaluation results are shown in TABLE III and Fig. 11, respectively.To save space, we do not include the qualitative results of GS, WF, RAAR, ResNet, UNet, LearnInitNet, and CPR-FS in Fig. 11 due to their inferior performances.As presented in TABLE III, the reconstruction performances of the proposed PPRNet are much better than those of the compared approaches.Compared with the traditional optimization-based algorithms (GS, HIO, WF, and RAAR), PPRNet has average PSNR and SSIM gains of at least 4.857dB and 0.266, respectively, on the challenging COCO dataset that contains a wide range of rich content natural images.For the facial RAF dataset, the PSNR and SSIM gains can reach 12.219dB and 0.52, respectively.The PSNR and SSIM gains can even reach 27.312dB and 0.704, respectively, on the Fashion-MNIST dataset.In general, traditional optimization-based methods are difficult to converge to satisfactory results with only one intensity measurement.In particular, they have poorer performances in the Fashion-MNIST dataset.This is because the meaningful contents of the images (i.e., clothes) often located at the center while the values of outer regions are close to zeros.Thus, the algorithms fail to estimate the correct spatial support of the images.Besides, the saturated intensity measurement further complicates the problem, although it has already been lessened by using the defocusing method, as mentioned in Section IV-A.As shown in Fig. 11, both HIO and RAAR methods can only recover the contour of the original images (third and fourth columns), but many defects remain.
Compared with the state-of-the-art deep learning-based PR methods, our PPRNet can achieve PSNR and SSIM gains of at least 7.45dB and 0.08, respectively, on the Fashion-MNIST dataset.It also performs the best on the RAF and COCO datasets.It can be seen in Fig. 11 that the reconstructed images of the proposed PPRNet can provide more details than other methods with fewer artifacts.It is particularly the case for the COCO dataset.Since the images are more complex, other approaches even cannot reconstruct the contours of the objects in the images.PPRNet can recover not only the contours but provide more textures of the original images.It is benefited from the physics information of the underlying image that guides the reconstruction process.Compared with other physics-driven methods, such as HIO-UNet and PrDeep, PPRNet not only provides better quality images but also has a lower complexity.It is expected since PPRNet has a feedforward architecture that does not require iterative estimation.Besides, PPRNet is end-to-end trained.It is different from HIO-UNet and PrDeep, which have an optimizer operated outside the network model.Thus, any mismatch between them can let the iteration process trap at the local minimum.Besides, prDeep is only successful for real-valued images, as claimed by the authors [22].The result is unsatisfactory when it is used to retrieve phase images.The above experimental results have verified the performance of the proposed PPRNet when used in a practical phase retrieval system.It consistently outperforms the state-of-the-art deep learning-based methods both quantitatively and qualitatively.

VI. Conclusion
In this paper, we proposed a deep learning-based phase retrieval (PR) method, namely PPRNet.Similar to other deep learning-based approaches, PPRNet requires only a single Fourier intensity measurement as its input and does not need an extra masking scheme for the PR system to constrain the measurement data.The main novelty of PPRNet is the introduction of physics information to the PR process.Unlike the traditional physics-driven PR methods that often end up in a time-consuming iterative procedure, the proposed PPRNet has a non-iterative feedforward structure but can still effectively utilize the intensity measurement to guide the image reconstruction process.It is enabled by the novel Hybrid Unwinding Blocks (HUB) embedded in the network's input and expanding path.It separately processes the global and local information of the feature maps with the aid of the intensity measurement and combines them with a channel attention method.Our simulation and experiment results have verified the effectiveness of PPRNet.In particular, our experiment results were obtained from an optical platform designed for this research.They demonstrate the performance of PPRNet when applied to practical phase retrieval applications.From the simulation and experiment results, it can be concluded that the proposed PPRNet consistently outperforms the state-ofthe-art deep learning-based PR methods, proving it a promising solution to practical PR applications.Nevertheless, at the moment, there is still room for PPRNet to improve further when handling images with complex scenes.It is one of the ongoing researches in our group.

Fig. 1 :
Fig. 1: Architecture of the proposed Physics-Driven Phase Retrieval Network (PPRNet).The most significant output feature maps after each HUB are shown at the bottom of the figure.

Fig. 5 :
Fig. 5: (a) A sample complex-valued image (real and imaginary parts).(b) Visualization of the defocus kernel (phase part).The Fourier intensity measurement (collected by a scientific camera) of the sample image (c) without the defocus kernel and (d) with the defocus kernel.(e) Profile of the colored lines in (c) and (d) (with and without the defocus kernel).(f) Histogram of intensity measurements in (c) and (d).Zoom in for a better view.

Fig. 6 :
Fig. 6: The optical path of the defocus-based PR system.

Fig. 7 :
Fig. 7: (a) Qualitative and (b) quantitative evaluation resultsof HIO[3] and PrDeep[22] with, and without using the defocused kernel.The quantitative results are the average of 6 natural images used in[22].

Fig. 8 :
Fig. 8: Qualitative simulation results of different phase retrieval methods on two datasets with images having (a) uncorrelated and (b) linearly correlated magnitude and phase components.The pixel values of the Fourier intensity measurements (first column) range from 0 to 4095.The magnitude part of each target image and their corresponding colormaps are shown in the second column.The third to sixth columns denote the reconstructed magnitude parts through different methods.The seventh column presents the phase part of each target image and their corresponding colormaps.The reconstructed phase parts via different approaches are provided from the eighth to last columns.Zoom in for better view.(c) Quantitative comparisons (average PSNR/SSIM/MAE) on the two datasets.Best performances are denoted in bold font.

Fig. 9 Fig. 9 :
Fig.9that the FFB outputs continuously improve from lower scale to upper scale.The physics information applied to the PUBs at different scales has played an important role in this enhancement process.It guides the network to reconstruct the image in the right direction not only at the training stage but also at the inferencing stage.

Fig. 10 :
Fig. 10: The hardware implementation of the optical path in Fig. 6.

Fig. 11 :
Fig.11: Experimental results of different phase retrieval methods on (a) RAF dataset[40], (b) Fashion-MNIST dataset[41], and (c) COCO dataset[44].The first column and the second column show the Fourier intensity measurements (pixel values: 0 − 4095) and the corresponding phase parts of ground-truth images with scale bars at the bottom left corners, respectively.The values of the Fourier measurements are scaled for better visualization.The other columns denote the reconstruction images through different methods.Except for the Fourier intensity measurements (the first column), the colormap of the rest columns ranges from 0 to 2π, with color bars at the second column.

TABLE I :
Comparison among using different number of scales for the proposed PPRNet on the RAF dataset (simulation).The best performances are marked in bold font.

TABLE II :
Comparison among different numbers of the unwinding layers on the RAF dataset (simulation).The best performances are marked in bold font.

TABLE III :
Quantitative comparison (average MAE/PSNR/SSIM/Parameters/Complexity) with the state-of-the-art methods for phase retrieval on three datasets.The training and testing samples are collected with the optical system shown in Fig.10.Best performances are marked in bold font.Second and third best performances are colored in blue and green, respectively.