On the interplay between physical and content priors in deep learning for computational imaging

Deep learning (DL) has been applied extensively in many computational imaging problems, often leading to superior performance over traditional iterative approaches. However, two important questions remain largely unanswered: first, how well can the trained neural network generalize to objects very different from the ones in training? This is particularly important in practice, since large-scale annotated examples similar to those of interest are often not available during training. Second, has the trained neural network learnt the underlying (inverse) physics model, or has it merely done something trivial, such as memorizing the examples or point-wise pattern matching? This pertains to the interpretability of machine-learning based algorithms. In this work, we use the Phase Extraction Neural Network (PhENN), a deep neural network (DNN) for quantitative phase retrieval in a lensless phase imaging system as the standard platform and show that the two questions are related and share a common crux: the choice of the training examples. Moreover, we connect the strength of the regularization effect imposed by a training set to the training process with the Shannon entropy of images in the dataset. That is, the higher the entropy of the training images, the weaker the regularization effect can be imposed. We also discover that weaker regularization effect leads to better learning of the underlying propagation model, i.e. the weak object transfer function, applicable for weakly scattering objects under the weak object approximation. Finally, simulation and experimental results show that better cross-domain generalization performance can be achieved if DNN is trained on a higher-entropy database, e.g. the ImageNet, than if the same DNN is trained on a lower-entropy database, e.g. MNIST, as the former allows the underlying physics model be learned better than the latter.


Two unanswered fundamental questions of deep learning in computational imaging
Deep learning (DL) has been proven versatile and efficient in solving many computational inverse problems, including image super-resolution [1][2][3][4][5][6][7], phase retrieval [8][9][10][11][12][13][14][15], imaging through the scattering medium [16][17][18], optical tomography [19][20][21] and so on. See [22,23] for more detailed reviews. Besides the superiority of performance over classical approaches in many cases, the DNN-based methods enjoy the advantage of extremely fast inference after the completion of the training stage. It is during the latter that the DNN weights are optimized as the training loss function between ground truth objects and estimated objects is reduced.
Despite great successes, two important questions remain largely unanswered: first, how well does a model trained on one dataset generalize directly to objects in disjoint classes; second, how well does the trained neural network learn the underlying physics model? These questions are well-motivated, since access to a large number of training data in the same category with those in the test set is not always possible and it would reduce the practicality of deep learning if the model trained on one set cannot reasonably well generalize to the other. Moreover, one major skepticism against deep learning is: has the algorithm actually learnt anything about the underlying physics or is it merely doing some trivial point-wise denoising, pattern matching, or, worse, just memorizing and reproducing examples from the training set?
In this paper, we recognize that these two questions are directly related: if the trained DNN were able to perfectly learn the underlying (inverse) physics law, which is satisfied unconditionally by all classes of objects, the prediction performance  [8] for lensless phase retrieval as an example. In fact, we discover that a trained DNN corresponds to the physics law better, and therefore cross-domain generalizes better, when the training set is more generic, e.g. the ImageNet [24], than if it is more constrained, e.g. the MNIST [25]. Therefore, when encountering insufficient training data in the same class as the test data, which is very common, the best compromise is to train the neural network on a less constrained publicly available standardized dataset such as the ImageNet, with reasonable confidence that it produces the reconstructions accurately in the domain of interest.

Phase retrieval and the weak object transfer function (WOTF) for lensless phase imaging
In the lensless phase imaging system (Fig. 1), the phase object is illuminated by collimated monochromatic light. The light transmitted through the phase object propagates in free space and forms an intensity pattern on the detector placed at a distance z away. Assuming that the illumination plane wave has unit amplitude, the forward model can be described as Here, f (x, y) is the phase distribution of the object, g(x, y) is the intensity image captured by the detector, λ is the wavelength of the illumination light, i is the imaginary unit and * denotes the convolution operation.
Eq.(1) is non-linear. However, when the weak object approximation holds, exp{i f (x, y)} ≈ 1 + i f (x, y) [26], the forward imaging model may be linearized as Here, G(u, v) and F(u, v) are the Fourier transforms of the intensity measurement g(x, y) and the phase distribution of the object f (x, y), respectively; sin πλz(u 2 + v 2 ) is the weak object transfer function (WOTF) for lensless phase imaging. The nulls of the WOTF are of particular significance as sign transitions surrounding these nulls in the frequency domain would cause a π phase shift in the measurement. We will address further on this effect in Section 3.2 and 4.3.

Phase Extraction Neural Networks (PhENN)
The Phase Extraction Neural Network (PhENN) [8] is a deep learning architecture that can be trained to recover an unknown phase object from the raw intensity measurement obtained through a lensless phase imaging system. Since PhENN was proposed, three general types of strategies have been followed to further enhance its performance. The first category focused on optimizing the network architecture or training specifics of PhENN, including the network depth, the training loss functions etc. [27] The second category focused on optimizing the spatial frequency components of the training data to compensate imbalanced fidelity in the high and low frequency bands in the reconstructions. The same rationale applies not only to phase retrieval but many other applications as well. Li et al. proposed a spectral pre-modulation approach [28] to amplify the high frequency content in the training data and experimentally demonstrated that the PhENN trained in this fashion achieved a better spatial resolution while at the cost of the overall reconstruction quality. Subsequently, Deng et al. proposed a learning synthesis by deep neural network (LS-DNN) method [7], which is able to achieve a full-band and high quality reconstruction in phase retrieval and other computational imaging applications, by splitting, separately processing and recombining the low and high spatial frequencies. The third category is to make the learning scheme more physics-informed. Attempts are made where, unlike the original PhENN, the forward model is incorporated via model-based pre-processing [9,14,15], etc. and such strategy is proven particularly useful under most ill-posed conditions. However, since all these efforts are secondary to the main objectives of this paper, we choose not to implement them.
The PhENN architecture used in this paper is shown in Fig. 2. It uses an encoder-decoder structure with skip connections, a structure proven efficient and versatile in many applications of interest. The encoder consists of 4 Down-Residual blocks (DRBs) to gradually extract compressed representations of the input signals to preserve high-level features; subsequently, the decoder component comprises of 4 Up-Residual blocks (URBs) and two (constant-size) Residual blocks (RBs) to expand the size of the feature maps and form the final reconstruction. To best preserve the high spatial frequencies, skip connections are used to bypass the feature maps from the encoder to the corresponding layers of the same size in the decoder. More details about the architecture of each functional blocks are available in Appendix A. Compared to the original PhENN [8], in this paper we implemented two modifications: first, we started from intensity patterns of size 256 × 256, same as that of the objects, as only with the same pixel patch sizes and number of pixels, can the computation of the weak object transfer function (WOTF) (see Section. 1.2) be physically sound. Second, PhENN is trained with negative Pearson correlation coefficient (NPCC), a loss proven to be more beneficial for restoring fine details of the objects [17], instead of using the mean absolute error (MAE), a pixel-wise loss known to suffer from the oversmoothing issue. The exact form of NPCC is given later in Eq. (5).

Generalization error in machine learning
Previous works have addressed different aspects of the generalization errors of DNNs. Some [29,30] aimed at tightening bounds on the capacity of a model, i.e. the number of training examples necessary to guarantee generalization. Ref. [31] provided a finer generalization bound for a subclass of neural networks in terms of the product of the spectral norm of their layers and the Frobenius norm of their weights. Yet, beautiful as the works are, the model capacity bounds and the generalization error bounds are not tight enough to provide practical insights.
In [32], the authors provide insights into the explicit and implicit role of various regularization methods (not to confuse with the regularization effect imposed by the training data, as mentioned in this paper) to reduce generalization error, including using dropout, data augmentation, weight decay, etc, practices that are commonly applied in the deep learning implementation, including in PhENN. Other works examined the generalization dynamics of large DNNs using SGD [33], the generalization ability of DNNs with respect to their robustness [34], i.e. the ability of the DNN to cope with small perturbations of its input. As [35], a nice review on this general topic, points out, one of the open problems in this area is to understand the interplay between memorization and generalization. This work, unlike in previous works where generalization typically refers to generalization to unseen examples similar to the training dataset, attempts to provide some insights into how the choice of training dataset might affect the cross-domain generalization performance, by an empirical investigation.

Entropy as a metric of the strength of the regularization effect imposed by a training set
When a dataset is selected as the training dataset, its influence to the training is reflected by the regularization effect it imposes. Intuitively, ImageNet [24] contains natural images of a broad collection of contents and scenes, and thus should present a more generic prior. On the other hand, MNIST [25], a database with the much more restricted set of handwritten digits only, would be expected to impose stronger regularization. However, to quantify the regularization effect imposed by a particular training dataset is not as straightforward. Here, we employ Shannon entropy [36,37] of the images in the dataset for that purpose: the higher the entropy, the weaker the regularization effect this dataset imposes to the training process.
Shannon entropy is a measure of the uncertainty of a random variable. Let X denote the random variable with the distribution p(X). Without loss of generality, we assume that X is discrete and assumes a finite alphabet X = {x 1 , · · · , x K }, then the entropy (in bits) of X, under the distribution p(·), H p (X), is defined as: where p(x k ) := Pr{X = x k }. The well-known Asymptotic Equipartition Theorem [36,37] states that, with the obvious constraint k p(x k ) = 1, the entropy is maximized at the equiprobable case, i.e. p(x k ) = 1 K for all i's and the maximized entropy is log 2 K = log 2 |X |. The higher the entropy, the higher the level of uncertainty is in the source. Conversely, if the distribution is completely deterministic (i.e. p(x i ) = 1 for some i and p(x j ) = 0 for j i), the entropy of the source is 0, the lowest extreme.
For an image f of size M × N, defined on the alphabet X M×N , the empirical distribution on the pixel value X ∈ X , is defined aŝ and the entropy of this image f can be approximated by the entropy Hp(·) according to this empirical distributionp(·), according to Eq.(3). In Fig. 3, we showed a histogram of our computed Shannon entropy of the images in ImageNet and MNIST, two representative standardized databases. 10000 8-bit images were selected from each set and a histogram of the entropy of the images is computed based on 1000 bins between 0 to 8 bits. The ImageNet images generally have higher entropy (mean 7.190 and standard deviation 0.600) and most images have their entropy between 7 and 8 bits; while the MNIST images have low entropy (mean 1.072 and standard deviation 0.185). Interestingly, this matches well with our anticipation that the MNIST images are approximately binary imagesthe entropy of the image will be 1 in the case of a perfect binary image with equal densities, i.e. p(x i ) = p(x j ) = 0.5, for i j and p k = 0 for all other k's. The deviation of the observed entropy in MNIST images from the ideal entropy of perfect binary images comes from the fact that the nonzero densities are not equal in the empirical distribution of MNIST images. (and later we will see that such narrow distribution of Shannon entropy found in MNIST makes the reconstruction of WOTF challenging as it offers very limited insight of out-of-focus images).
For a database, the connection between the entropy of its images and the strength of the regularization effect can be argued as follows: the higher the entropy of images in a database have, the closer the empirical distributions are to the uniform distribution, which is the extreme case where the weakest regularization effect can be imposed. Consider the extreme case where pixels on an image admit an i.i.d. uniform distribution on X , the empirical distribution in Eq. (4) is a uniform one. In this extreme case, no regularization effect can possibly be imposed by the training examples since if the DNN were to "memorize" such training examples, nothing but the totally random distribution on pixel-values could be remembered. On the other extreme, if the training dataset contains zero-entropy images only (all pixels are identical on each image), the training loss can be minimized merely by the DNN learning to produce such uniform-valued images, totally neglecting the underlying physics model associated with the stochastic processes involved during the image formation. In the next section we show that the stark difference in entropy between ImageNet and MNIST shown in Fig. 3 strongly implies respective difference between the two datasets in terms of generalization ability.

Performance of cross-domain generalization under different training datasets
In this section, we compare the cross-domain generalization performance of PhENN to ImageNet, MNIST and two other datasets, the IC layout dataset [9] and Face-LFW, when trained on ImageNet and MNIST, respectively. The choice of IC layout database and Face-LFW is well-motivated as they are both rather distinct from each training database; yet, the former is somewhat similar to the MNIST dataset in the piecewise constancy but the latter does not have this piece-wise constancy.
The intensity measurement is synthetically generated by the standard phase retrieval optical apparatus in Fig. 1, according to Eq.(1). All training and testing objects are of size 256 × 256, with the pixel size of the objects and on the detector being 20µm. The propagation distance is set to be z = 100mm. In order to qualify the weak object approximation, so that the learned WOTF can be explicitly computed from the measurements and the corresponding phase objects, the maximum phase depth of the objects is kept below 0.1π rad. In simulation, the weak objects are obtained from applying a heuristic one-to-one calibration curve, say a linear one, that maps from the 8 bit alphabet {0, 1, · · · , 255} to one that lies within 0 and 0.1π rad, as the entropy of an image is invariant under one-to-one mappings and therefore, we can have the weak object approximation and the entropy distributions above hold concurrently.
During the training stage, pairs of intensity measurements and weak phase objects are used as the input and output of PhENN, respectively. The disparity between the estimate produced by the current weights in PhENN with the ground truth is used to update the weights using backpropagation. After PhENN is trained, in the test stage, the measurements corresponding to unseen examples are fed into the trained PhENN to predict the estimated phase objects. The loss function used in training is the negative Pearson correlation coefficient (NPCC), which has been proved helpful for better image quality in reconstructions [9,14,17,21,27]. For phase object f and the corresponding estimatef , the NPCC of f andf is defined as where f and f are the spatial averages of the reconstructionf and true object f , respectively. If the reconstruction is perfect, we have NPCC(f , f ) = −1. One caveat of using NPCC is that a good NPCC metric cannot guarantee the reconstruction is of the correct quantitative scale, as NPCC is invariant under affine transformations, i.e., NPCC(f , f ) = NPCC(af + b, f ) for scalars a and b. Therefore, to correct the quantitative scale of the reconstructions without altering the PCC value, a linear fitting step is carried out based on the validation set and the learned a, b values are used to correct the test sets. The quantitative performance of the reconstructions on various datasets by ImageNet-trained PhENN and MNIST-trained PhENN, are compared in Table 1. The quantitative metrics chosen considered both the pixel-wise accuracy through MAE and the structural similarity through PCC. • When access to training data in the same database as the test data is not possible, if the model is trained on ImageNet, the cross-domain generalization to more constrained MNIST is satisfactory; however, such performance is catastrophic in the opposite case, when the model is trained on more constrained MNIST but tested on the more generic ImageNet.
• Likewise, when tested on the IC layout dataset and the Face-LFW dataset, the cross-domain generalization is always better if PhENN is trained on ImageNet than on MNIST, despite the fact that the IC layout dataset shares with MNIST, but not with the ImageNet, the feature of piece-wise constancy. However, such performance gap seems to depend on the (dis)similarity between the test datasets and the training set (ImageNet or MNIST) -the performance gap (cross-domain generalization performance difference between MNIST-trained PhENN and ImageNet-trained PhENN) is much larger when the test set is Face-LFW than IC.
More interesting observations are available in Fig. 4

How well has PhENN learned the physics model?
To quantitatively verify that ImageNet-trained PhENN learns the underlying physics law better than MNIST-trained PhENN, we compare the learned WOTF (LWOTF) of the ImageNet-trained PhENN and MNIST-trained PhENN.
Once the network has been trained, based on a test set of K test images, the LWOTF is computed as, where G k (u, v) andF k (u, v) are the Fourier transforms of the intensity measurement g k (x, y) and the network's estimated phasef k (x, y) for the kth testing object, respectively. For better generality, we split the test set of K = 100 images into four equally large subsets, that is, 25 test images from the ImageNet, MNIST, Face-LFW and IC layouts each. We denote the LWOTF of the ImageNet-trained PhENN and MNIST-trained PhENN as LWOTF-ImageNet and LWOTF-MNIST, respectively.
In Fig. 5, we show the 1D cross-sections along the diagonal directions of the WOTF (computed from the ground truth examples), LWOTF-ImageNet and LWOTF-MNIST, respectively. Also, we plot the theoretical WOTF sin(2λz(u 2 +v 2 )), denoted WOTF-theory, under the same sampling rate as the detection process. WOTF-theory is indistinguishable from WOTF-computed, indicating that the weak object approximation holds well. For better visualization, values are cropped to the range [−3, 3] (the values cropped out are outliers, all from LWOTF-MNIST). We find that ImageNet-trained PhENN indeed learned the WOTF better than the MNIST-trained PhENN. Also, we note the mismatch between the LWOTF-ImageNet and the WOTF-theory becomes larger at higher spatial frequencies, which is due to the under-representation of the high spatial frequencies in the reconstructions that has been extensively argued in [14,28]. In this paper, we choose not to overcome this limitation by the LS-DNN technique [7,14], so that the choice of training dataset is the only difference between ImageNet-PhENN and MNIST-PhENN. Besides the direct WOTF comparison, we propose an alternative study where we verify that ImageNet-trained PhENN learns the propagation model better than the MNIST-trained PhENN. From Eq. (2), we see that there exist several nulls (the locations where the values of the transfer function are equal to zero) in the weak object transfer function and at those nulls, the sign of the transfer function switches, introducing a phase delay of π rad in the spatial frequency domain. As a result, the measured pattern at the detector plane will shift by half period in the spatial domain at those frequencies. We refer this phenomenon as the "phase shift effect". Because of that, when we image a star-like binary weak phase object with P periods, the fringes in the measurement will become discontinuous (see Fig. 9 later). In particular, for a defocus distance z (in mm), the radii of discontinuity r k for k = 1, 2, · · · , and the associated spatial frequency in (u k , v k ) jointly satisfy If PhENN were doing something trivial, e.g. edge sharpening, it would be failing to catch the phase shifts due to (7). Therefore, the star pattern test also provides a way to test whether the physical model has been correctly incorporated. In Section 4.3 we show experimental results verifying that indeed ImageNet-trained PhENN incorporates the physics whereas MNIST-trained PhENN does not.

Optical Apparatus
He-Ne Laser The optical configuration of the experiments is shown in Fig. 6. Polarization angles of two linear polarizers (POL1 and POL2) were carefully chosen to minimize the maximum phase modulation depth of SLM (Holoeye LC-R 720) down to ∼ 0.1π (see Fig. 7) using Michelson-Young interferometer. Also, a 4 f telescope of two lens ( f = 100 mm and f = 60 mm) was used to transfer the image plane from SLM matching two different pixel pitches of the SLM and CMOS camera, followed by a defocus of z = 150 mm was given for capturing diffraction patterns with the camera. Each experimental diffraction pattern was iteratively registered with the simulated one by applying affine transformation on the experimental one to the direction of maximizing NMI (Normalized Mutual Information) between the two using Nelder-Mead method [38,39]. More details on the pre-processing process is available in Appendix D.

Comparisons of the cross-domain generalization performance
Cross-domain generalization performance based on experimental data (z = 150mm) for ImageNettrained PhENN and MNIST-trained PhENN is shown in Fig. 8 and quantitatively compared in Table 2, from which we make similar observations as those from the synthetic data: cross-domain generalization performance of ImageNet-trained PhENN is generally good; however, such performance of MNIST-trained PhENN is poor. Moreover, the estimated phase objects produced by MNIST Fig. 12 in Appendix C.

Star-pattern experiment to demonstrate the learning of the propagation model
From Eq. (7), for our experimental parameters z = 150mm, λ = 633nm , P = 50, the 2 nd (k = 2) and 3 rd (k = 3) discontinuity are at 0.0032µm −1 and 0.0040µm −1 , respectively. From the measurement in Fig. 9a, we see the two discontinuities are located at r 2 ≈ 2.44mm (red) and r 3 ≈ 2.16mm (blue), respectively, corresponding to spatial frequencies, 0.0033µm −1 and 0.0037µm −1 , matching well with theoretical values, and indicating that the weak object approximation holds well. After PhENN was trained with ImageNet, a dataset drastically different from the appearance of the star-pattern, such discontinuity was corrected perfectly in the reconstruction (Fig. 9c), indicating that ImageNet-trained PhENN has learned the underlying physics (or the  WOTF) while the MNIST-PhENN apparently failed (Fig. 9d). It is also noteworthy that there is still significant deficiency of high frequencies in the reconstruction of ImageNet-trained PhENN, which corroborates our observation earlier in Fig. 5 that even ImageNet-trained PhENN is not able to restore high frequencies very well. Using Learning-to-synthesize by DNN (LS-DNN) method [7,14] has been proven very efficient to tackle this issue [40], but we choose not to pursue it as it would deviate from the main emphasis of this paper.

Conclusions
In this paper, we used PhENN for lensless phase imaging as the standard platform to address the important question of DNN generalization performance when training cannot be performed in the intended class. This is motivated by the problem of insufficient training data, especially when such data need to be collected experimentally. We anticipate that this work will offer practitioners a way to efficiently train their machine learning architectures, by choosing the publicly available standardized dataset without worrying about cross-domain generalization performance.
Our work is suggestive of certain interesting directions for future investigation. A particularly intriguing one is to refine the bound of the (cross-domain) generalization error by incorporating the distance metric of the empirical distributions between the training and the test set, along with other factors that have been considered currently in the literature (recall Section 1.4). Moreover, though the chain of logic presented in the paper was centered on phase retrieval, it should be applicable to other domains of computational imaging subject to further study and verification.
Appendix A: More details of PhENN and the training specifics.
In Section 1.3, we introduced the high-level architecture of PhENN (Fig. 2). In this Section, we provide in Fig. 10 details of the layer-wise architecture of each functional block in PhENN.
The simulation is conducted on a Nvidia GTX1080 GPU using the open source machine learning Platform TensorFlow. The Adam optimizer [41], with learning rate being 0.001 and exponential decay rate for the first and second moment estimates being 0.9 and 0.999 (β 1 =0.9, β 2 =0.999). The batch size is 5. The training of PhENN for 50 epochs, which is sufficient for PhENN to converge, takes about 2 hours.
Appendix B: More results on synthetic data.
In this section, we show reconstruction examples with synthetic data at z = 150mm in Fig. 11, and the quantitative metrics in Table 3, where we see MNIST-trained PhENN produced reconstructions that are generally sparsified, as MNIST has imposed too strong regularization effect to the training and that gets passed onto the reconstructions. Superior cross-domain generalization performance of ImageNet-trained PhENN is again verified. In Fig. 12, we provide additional reconstructions of various classes of objects by ImageNet-trained PhENN and MNIST-trained PhENN (z = 150mm), based on experimental data. Consistent with previous observations, we see significant distortions in the reconstructions of non-MNIST objects by MNIST-trained PhENN. However, no significant distortions can be seen from reconstructions of non-ImageNet objects produced by ImageNet-trained PhENN.

Appendix D: Details on experimental data preprocessing.
Two linear polarizers were used to achieve the maximum phase depth of the reflective SLM of ∼ 0.1π, which however brings forth spurious effect on the shape of the exit beam. Thus, the training process with raw intensity measurements was preceded by the image registration on them.
Using the calibration curve between 8-bit grayscale values and phase modulation depth in radian, input phase objects were simulated, followed by the generation of simulated intensity measurements without any deformation. Experimental measurements should match with the simulated ones if ideal. Image registration process finds an optimal affine transformation that brings an experimental measurement to its corresponding simulated one according to the optimization using Nelder-Mead method to the direction of minimizing the negative NMI (Normalized Mutual Information). Optimal affine transformation matrix was found for each dataset and applied accordingly.
Then, only center 256 × 256 pixels in each preprocessed intensity measurement were cropped to generate training, validation, and testing datasets, paired with labels or ground truth images. Thanks to the 4f system in the optical apparatus that matches the pixel size of the reflective SLM with that of the CMOS camera, only the same dimension of pixels as that of each displayed image on the SLM is needed.

Funding Information
Intelligence Advanced Research Projects Activity (IARPA), RAVEN Program (FA8650-17-C-9113); Singapore National Research Foundation, the SMART (Singapore-MIT Alliance for Research and Technology) program (015824). I. Kang was supported in part by the KFAS (Korea Foundation for Advanced Studies) scholarship.