Spectral pre-modulation of training examples enhances the spatial resolution of the Phase Extraction Neural Network (PhENN)

The Phase Extraction Neural Network (PhENN) is a computational architecture, based on deep machine learning, for lens-less quantitative phase retrieval from raw intensity data. PhENN is a deep convolutional neural network trained through examples consisting of pairs of true phase objects and their corresponding intensity diffraction patterns; thereafter, given a test raw intensity pattern PhENN is capable of reconstructing the original phase object robustly, in many cases even for objects outside the database where the training examples were drawn from. Here, we show that the spatial frequency content of the training examples is an important factor limiting PhENN's spatial frequency response. For example, if the training database is relatively sparse in high spatial frequencies, as most natural scenes are, PhENN's ability to resolve fine spatial features in test patterns will be correspondingly limited. To combat this issue, we propose"flattening"the power spectral density of the training examples before presenting them to PhENN. For phase objects following the statistics of natural scenes, we demonstrate experimentally that the spectral pre-modulation method enhances the spatial resolution of PhENN by a factor of 2.


Introduction
The use of machine learning architectures is a relatively new trend in computational imaging and rapidly gaining popularity.Originally it was proposed for imaging through scatter using the older neural network format of support vector machines [2].Subsequently, contemporary Deep Neural Networks (DNNs) have been applied successfully to the same problem of imaging through scatter [3,4], as well as tomography [5], lensless quantitative phase retrieval [1,6], microscopy [7], GHOST imaging [8], imaging through fiber bundles [9], and imaging at extremely low light levels [10].
The main motivation for the use of machine learning is to overcome certain deficits of traditional computational imaging approaches.The latter are based on convex optimization, structured so that the optimal solution is as close as possible to the true object.The functional to be minimized is specified by the physical model H of the imaging process, also referred to as forward operator; and by prior knowledge Φ( f ) about the class of objects being imaged, also known as regularizer.The inverse (estimate) f of an object f is obtained from a measurement g as The regularization parameter α expresses the imaging system designer's relative belief in the measurement vs. belief in the available prior knowledge about the object class.
Clearly, the performance of (1) in terms of producing acceptable inverses is crucially dependent upon correct and explicit knowledge of both H and Φ, and judicious selection of the parameter α [11].In situations where this knowledge is questionable or not explicitly available, deep machine learning approaches become appealing as an effort to learn the missing knowledge implicitly through examples.Instead of (1), the object estimate is then obtained as f = DNN(g), (2) where DNN(.) denotes the output of the trained deep neural network.Notation (2) may be used for other, non-deep machine learning structures even though they are generally less effective.However, strictly applied, (2) is limited to the special "end-to-end" design where the measurement g from the camera is fed directly to the DNN.In some cases g first goes through a physical pre-processor, and the pre-processor's output is fed into the DNN [6,10]; whereas in other cases g is fed multiple times into a cascade of generator DNNs [12] to assess the outputs at each step.Developing notation for and carrying out a full debate on the relative merits of these different approaches is beyond the scope of the present paper, where, in any case, we used the end-to-end method (2) only.
Just as the performance of minimization principle (1) depends upon knowledge of the operators H and Φ, performance of the DNN principle (2) depends on the specific DNN architecture chosen (number of layers, connectivity, etc.) and the quality of the training examples.It is the latter aspect of DNN design that we focus on in the present paper.More specifically, we are concerned with the spatial resolution that the DNN can achieve, depending on the spatial frequency content of the examples the DNN is trained with.
The end-to-end residual convolutional DNN solution to lens-less quantitative phase retrieval is PhENN [1], shown to be robust to errors in propagation distance and fairly well able to generalize to test objects from outside the databases used for training.In the present paper, we implemented PhENN in a slightly different optical hardware configuration, described in Section 2.1.The computational architecture, described in Section 2.2, was similar to the original PhENN except here we used the Negative Pearson Correlation Coefficient (NPCC) as training loss function.This has a small beneficial effect in the reconstructions, but necessitates a histogram calibration procedure, described in Section 2.3, to remove linear amplification and bias in the reconstructed phase images.
From the point of view of the original inverse problem formulation (1), PhENN in effect has to learn both the forward operator H and the prior Φ at the entire range of spatial frequencies of interest.The examples presented to PhENN during training establish the spatial frequency content that is stored in the network weights contributing to the retrieval operation (2).In principle, this should be sufficient because, if the training examples are representative enough of the object class, then retrieval of each spatial frequency should be learnt proportionally to that spatial frequency's presence in the database.In practice, however, we found that spatial frequencies with relatively low representation in the database tend to be overshadowed by the more popular spatial frequencies, perhaps due to the nonlinearities in the network training process and operation.
Invariably, high spatial frequencies tend to be less popular in most available databases.ImageNet, in particular, exhibits the well-known inverse-square power spectral density of natural images, as we verify in Fig. 6.This means that high spatial frequencies are inherently underrepresented in PhENN training.Compounded by the nonlinear suppression of the less popular spatial frequencies due to PhENN nonlinearities, as mentioned above, this results in low-pass filtering of the estimates and loss of fine detail.Detailed analysis of this effect is presented in Section 3.
To better recover high spatial frequencies in natural objects then, one should emphasize high spatial frequencies more during training; this may be achieved, for example, by flattening the power spectral density of the training examples before they are presented to the neural network.It would appear that this spectral intervention violates the object class priors: PhENN does not learn the priors of ImageNet itself, it rather learns an edge-enhanced version of the priors.Yet, in practice, again probably because of nonlinear PhENN behavior, we found this spectral pre-modulation strategy to work quite well.The detailed approach and results are found in Section 4.
It is worth mentioning here that the first, to our knowledge, explicit experimental analysis of a DNN's spatial resolution was conducted on IDiffNet in the context of imaging through diffuse media [4].We chose to pursue the issue further in the present paper but on a different optical problem because spatial resolution in quantitative phase retrieval, in addition to also being worthwhile, is not impacted by the extreme ill-posedness of diffuse media.Even though we have not tried extensively beyond phase retrieval, pre-processing of training examples by spectral manipulation might have merit for several other challenging imaging problems.

Optical configuration
Our optical configuration is shown in Fig. 1.Unlike [1], a transmissive spatial light modulator (SLM) (Holoeye, LC2012, pixel size 36µm) is used in this system as a programmable phase object f representing the ground truth.The transmissive SLM is coherently illuminated by a He-Ne laser light source (Research Electro-Optics, Model 30995, 633nm).The light is transmitted through a spatial filter consisting of a microscope objective (Newport, M-60X, 0.85NA) and a pinhole aperture (D = 5µm) and then collimated by a lens (focal length 200mm) before illuminating the SLM.A telescope consisting of two plano-convex lenses L 1 and L 2 is placed between the SLM and a CMOS camera (Basler, A504k, pixel size 12µm).The CMOS camera captures the intensity g of the diffraction pattern produced by the SLM at a defocus distance ∆z = 50mm.The focal lengths of L 1 and L 2 are set to f 1 = 150mm and f 2 = 50mm, respectively.As a result, this telescope demagnifies the object by a factor of 3, consistent with the ratio between SLM and CMOS camera pixel sizes.An iris with diameter 5mm is placed at the pupil plane of the telescope to keep the 0 th diffracted order of the SLM and filter out all the other orders.
The modulation performance of the SLM depends on the input and output polarizations, which are controlled by the polarizer P and the analyzer A, respectively.In order to realize phase-mostly modulation, we set the incident beam to be linearly polarized at 310 • with respect to the vertical direction and also set the analyzer to be oriented at 5 • with respect to the vertical direction.The specific calibration curves for the SLM's modulation performance can be found in [33].In the present paper, all the training and testing objects are of size 256 × 256.They are zero-padded to the size 1024 × 768, before being uploaded to the SLM.For the diffraction patterns captured by the CMOS camera, we crop the central 256 × 256 region for processing.

Neural network architecture and training
Similar to [1], the phase extraction neural network (PhENN) that we implement in this paper follows the U-net architecture [34] and utilizes residuals to facilitate learning (ResNet [35].)The detailed architecture is shown in Fig. 2. PhENN input is the intensity g, and successively passes through 4 down-residual blocks (DRBs) for feature extraction.The extracted feature map then successively passes through 4 up-residual blocks (URBs) and 2 residual blocks (RBs) for pixel-wise regression and at the last layer outputs the estimate f of the object phase.Skip  Unlike [1], here we use the Negative Pearson Correlation Coefficient (NPCC) as loss function [4] to train PhENN.The NPCC loss function is defined as f and f are the true object and the object estimate according to (2), respectively; the summations take place over all pixels (i, j) and training example labels k; and .denotes spatial averaging.
We have found the NPCC to generally result in better DNN training in the problems that we examined, especially for objects that are spatially sparse [4].However, some care needs to be taken when the estimate f is not affine-invariant; we discuss this immediately below.

Calibration of PhENN output trained with NPCC
From the definition (4) it follows that for any function ψ and arbitrary real constants a and b representing linear amplification and bias, respectively, In other words, a DNN trained with NPCC as loss function can only produce affine transformed estimates; there is no way to enforce the requirement a = 1, b = 0 which would guarantee linear amplification-and bias-free reconstruction and is especially important for quantitative phase imaging.Neither does there exist a way that we know of to predetermine the values of a and b through specific choices in DNN training.Therefore, after DNN training a calibration step is required to determine the values of a and b that have resulted so that they can be compensated.This is realized by histogram matching according to the process shown in Fig. 3. Given a set of calibration data, we compute the cumulative distribution functions (CDFs) for the ground truth values as well as the PhENN output values, as shown in Fig. 3 (a) and (b).For an arbitrary value f in the ground truth, we find its corresponding PhENN output value f that is at the same CDF level; and repeat the process for several ( f , f ) samples.Subsequently, the values of a and b are determined by linear fitting of the form f = a f + b, as shown in Fig. 3(c).

Resolution analysis of ImageNet-trained PhENN
In [1], we trained separate PhENNs using the databases Faces-LFW [36] and ImageNet [37] and found that both PhENNs generalize to test objects both within and outside these two databases.In the present paper, we restrict our analysis to the ImageNet database only.
In the PhENN training phase, a total of 10, 000 images selected from the ImageNet database are uploaded to the SLM and the respective diffraction patterns are captured by the CMOS.For testing, we use a total of 471 images selected from several different databases: 50 Characters, 40 Faces-ATT [38], 60 CIFAR [39], 100 MNIST [40], 100 Faces-LFW, 100 ImageNet, 20 resolution test patterns [4], and 1 all-zero (dark) image.The diffraction pattern corresponding to the all-zero image is used as the background.For every test diffraction pattern that we capture, we first subtract the background and then normalize, before feeding into the neural network.

Reconstruction results
The phase reconstruction results are shown in Fig. 4. Here, we use 100 ImageNet test images as calibration data to compensate for the unknown affine transform effected by the NPCC-trained PhENN (Section 2.3).As expected, PhENN is not only able to quantitatively reconstruct the phase objects within the same category as its training database (ImageNet), but also able to retrieve the phase for those test objects from other databases.This indicates that PhENN has indeed learned a model of the underlying physics of the imaging system or at the very least a generalizable mapping of low-level textures between the phase objects and their respective diffraction patterns.[36], (ii) ImageNet [37], (iii) Characters, (iv) MNIST Digits [40], (v) Faces-ATT [38], or (vi) CIFAR [39], respectively.

Resolution test
In order to test the spatial resolution our trained PhENN, we use dot patterns as test objects [4], shown in Fig. 5 (a).Altogether 20 dot patterns are tested, with spacing D between dots gradually increasing from 2 pixels to 21 pixels.From the resolution test results shown in Fig. 5 it can be observed that the PhENN trained with ImageNet is able to resolve two dots down to D = 6 pixels but fails to distinguish two dots with spacing D ≤ 5 pixels.Thus, D ≈ 6 pixels can be considered as the Rayleigh resolution limit of this PhENN for point-like phase objects.

Resolution enhancement by spectral pre-modulation
In our imaging system, the SLM pixel size limits the spatial resolution of the trained PhENN since the minimum sampling distance in all the training and testing objects displayed on the SLM equals one pixel d p = 36µm, or maximum spatial frequency 13.9mm −1 . 1 However, as we saw in Section 3.2, the resolution achieved by our PhENN trained with ImageNet database is merely 6 pixels (216µm), much worse than the theoretical value.
The additional factor limiting the spatial resolution of the trained PhENN is the spatial frequency content of the training database.Generally, databases of natural objects, such as natural images, faces, hand-written characters, etc. do not cover the entire spectrum up to 1/(2d 0 ).For example, below we analyze the ImageNet database and show that it is dominated by low spatial frequency components, with the prevalence of higher spatial frequencies decreasing quadratically.
During training, the neural network learns the particular prevalence of spatial frequencies in the training examples as prior Φ, in addition to learning the physical forward operator H. What this implies is that the less prevalent spatial frequencies are actually learnt against, meaning that by presenting them less frequently we may be teaching PhENN to suppress or ignore them.In the rest of this section, we present evidence to corroborate this fact, and suggest as solution a pre-processing step that edge enhances the training examples as a way to impress their importance better upon PhENN.

Spectral pre-modulation
The 2D power spectral density (PSD) S(u, v) for the 10, 000 images in the ImageNet is shown in Fig. 6 (a &b) in linear and logarithmic scales, respectively; and in cross-section along the spatial frequency u in Fig. 6 (c& d).Not surprisingly [41], the cross-sectional power spectral density follows a power law of the form |u| p with p ≈ −2.
Therefore, we may approximately represent the 2D PSD of ImageNet database as (5) This is flattened by the inverse filter As expected, the high spatial frequency components in the image are amplified after the modulation, as can be seen, for example, in Fig. 7.

Resolution enhancement
We trained a new PhENN using training examples that were spectrally pre-modulated according to (6).That is, we replaced every training example f (i, j) with f e (i, j), where and F, F e are the Fourier transforms of f , f e , respectively.We also collected the corresponding diffraction patterns g e (i, j).The test examples were left without modulation, i.e. the same as in the original use of PhENN described in Section 3.All the training parameters were also kept the same.Both dot pattern and ImageNet test images were used to demonstrate the resolution enhancement, shown in Fig. 8 and 9, respectively.From Fig. 8, we find that with spectral pre-modulation of the training examples according to (7), PhENN is able to resolve two dots with spacing D = 3 pixels.Compared with the resolution test results shown in Fig. 5, it can be said that the spatial resolution of PhENN has been enhanced by a factor of 2 with the spectral pre-modulation technique.In Fig. 9, for the same test image selected from ImageNet database, more details are recovered by the PhENN that was trained with spectrally pre-modulated ImageNet, albeit at the cost of amplifying some noisy features of the object, near edges most notably.
We also investigated the effect of spectral post-modulation in the original PhENN; that is, if we use a PhENN trained without spectral pre-modulation, and modulate the PhENN output f (i, j) and F, Fe are the Fourier transforms of f , f e , respectively, do we obtain a similar resolution enhancement?The answer is no, as can be clearly verified from the results of Fig. 10.This negative result illustrates that in the original training scheme (without spectral premodulation) the fine details are indeed lost and not recoverable by simple means, e.g.linear post-processing.It also highlights the effect of the nonlinearity in PhENN's operation and

Conclusions
The spectral flattening approach (7) as pre-modulation is a simple approach that we found to be effective in enhancing PhENN's resolution by a factor of 2 when trained and tested on ImageNet examples.We have not investigated the performance of other (non-flattening) filters; indeed, it would be an interesting theoretical question to ask: given a particular form of the PSD in the training examples, what is the optimal spectral pre-modulation for improving spatial resolution?
It is also worth repeating the concern about the priors that PhENN is learning from the spatially pre-modulated examples, that we pointed out in Section 1.The amplification of certain noise artifacts, clearly seen in the result of Fig. 9(d), shows that, in addition to learning how to resolve fine details in the artifact, PhENN has learnt, somewhat undesirably, to edge enhance (since all the examples it was trained with were also edge enhanced.)These observations should present fertile ground for further improvements upon the work presented here.

Fig. 5 .
Fig. 5. Resolution test for PhENN trained with ImageNet.(a) Dot pattern for resolution test.(b) PhENN reconstructions for dot pattern with D = 3 pixels.(c) PhENN reconstructions for dot pattern with D = 5 pixels.(d) PhENN reconstructions for dot pattern with D = 6 pixels.(e) 1D cross-sections along the lines indicated by red arrows in (b)-(d).

Fig. 8 .
Fig. 8. Resolution test for PhENN trained with examples from the ImageNet database with spectral pre-modulation according to (7).(a) Dot pattern for resolution test.(b) PhENN reconstructions for dot pattern with D = 2 pixels.(c) PhENN reconstructions for dot pattern with D = 3 pixels.(d) PhENN reconstructions for dot pattern with D = 6 pixels.(e) 1D cross-sections along the lines indicated by red arrows in (b)-(d).

Fig. 9 .
Fig. 9. Resolution enhancement demonstration.(a) Ground truth for a phase object [37].(b) Diffraction pattern captured by the CMOS (after background subtraction and normalization).(c) Phase reconstruction by PhENN trained with ImageNet examples.(d) Phase reconstruction by PhENN trained with ImageNet examples that were spectrally pre-modulated according to (7).