Optimal Physical Preprocessing for Example-Based Super-Resolution

In example-based super-resolution, the function relating low-resolution images to their high-resolution counterparts is learned from a given dataset. This data-driven approach to solving the inverse problem of increasing image resolution has been implemented with deep learning algorithms. In this work, we explore modifying the imaging hardware in order to collect more informative low-resolution images for better ultimate high-resolution image reconstruction. We show that this"physical preprocessing"allows for improved image reconstruction with deep learning in Fourier ptychographic microscopy. Fourier ptychographic microscopy is a technique allowing for both high resolution and high field-of-view at the cost of temporal resolution. In Fourier ptychographic microscopy, variable illumination patterns are used to collect multiple low-resolution images. These low-resolution images are then computationally combined to create an image with resolution exceeding that of any single image from the microscope. We use deep learning to jointly optimize the illumination pattern with the post-processing reconstruction algorithm for a given sample type, allowing for single-shot imaging with both high resolution and high field-of-view. We demonstrate, with simulated data, that the joint optimization yields improved image reconstruction as compared with sole optimization of the post-processing reconstruction algorithm.


Introduction
Deep learning, a subset of machine learning, has shown remarkable results in the interpretation of data. By computationally finding complex, higher-order patterns in labeled datasets, deep learning has yielded state-of-the-art results in areas such as the classification of images [1], language translation [2], and speech recognition [3]. Deep learning is powerful because the mathematical form of the relationship between the input and output need not to be specified beforehand; the deep learning algorithm can computationally find arbitrary, non-linear relationships in datasets.
The "super-resolution" problem of converting low-resolution images to higher resolution has been attempted with machine learning [4][5][6][7][8][9][10][11][12]. The super-resolution problem is an ill-posed inverse problem, with many high-resolution images possible for a given low-resolution image. In example-based super-resolution, machine learning attempts to make the problem well-posed by applying prior information.
In example-based super-resolution, the training inputs are low-resolution image patches, and the training outputs are the corresponding high-resolution image patches. Machine learning then attempts to find the function that takes the low-resolution patch as the input, and then outputs the high-resolution patch. Deep convolutional neural networks have been promising in tackling the super-resolution problem [5,7,8,10,11], and the addition of adversarial networks has resulted in improved image quality of fine details [9]. Example-based super-resolution with deep learning has also been applied to microscopy images [13][14][15]. The success of these super-resolution deep neural networks can be attributed to the fact that images are not random collections of pixels; images contain some structure. The more we constrain the category or type of images we use, the more structure the deep learning algorithm should be able to find.
The desirability of obtaining a high-resolution image from low resolution stems from the arXiv:1807.04813v1 [eess.IV] 12 Jul 2018 trade-offs inherent in designing imaging systems, due to constraints in physics and cost. The minimum distance allowing two points to be distinguishable from each other determines the resolution of an image; the lower the minimum distance, the higher the resolution. In the design of an imaging system, resolution must be traded off with the total area imaged, the field-of-view.
We can obtain a low-resolution image with high field-of-view or vice versa. The tradeoff between resolution and field-of-view of an imaging system can be quantified by its space-bandwidth product [16]. Example-based super-resolution is a pathway to obtaining high resolution and high field-of-view at the same time, i.e. a high space-bandwidth product. Another way to obtain a high space-bandwidth product is to trade off temporal resolution by taking multiple images and stitching them together. This approach is undertaken in some computational imaging modalities, where multiple image measurements are collected and algorithmically combined.
In the computational imaging paradigm, the imaging hardware is co-designed with the postprocessing software. In computational imaging, we do not attempt to collect an image directly. Instead, we collect measurements that are used to create the image in post-processing. The individual measurements do not necessarily resemble the final image, but they contain the information necessary for image reconstruction. The data collected by the image sensor may be a garbled version of the desired final image, requiring considerable computational post-processing, sometimes with an iterative algorithm. The image reconstruction process from the collected noisy sensor data is an inverse problem and may be ill-posed. Computational imaging modalities include fluorescence imaging such as stochastic optical reconstruction microscopy [17], photoactivated localization microscopy [18,19] and structured light microscopy [20,21], where multiple lowerresolution images are computationally combined for a higher-resolution image without loss of field-of-view. In computational imaging, we can also consider the problem of imaging the phase of light, which cannot be measured directly. Fourier ptychographic microscopy is a modality where multiple lower-resolution images are computationally combined for a higher-resolution reconstruction of a complex object with both phase and amplitude [22].
In Fourier ptychographic microscopy, the illumination source is a 2-dimensional matrix of light-emitting diodes (LEDs). Low-resolution images with high field-of-view are collected by illuminating the sample with different patterns of LEDs. The different illumination patterns cause different spatial frequencies to be modulated into the pass-band of the optical system. These low-resolution images are then computationally combined to create a high-resolution reconstruction of the phase and amplitude of the sample, preserving the field-of-view of the low-resolution images. An iterative algorithm is generally used for the reconstruction, but can be computationally intensive. Recently, deep learning has been used to solve the inverse problem in several computational imaging modalities, replacing iterative methods [23][24][25][26][27][28][29][30][31][32][33][34][35][36][37][38], including in Fourier ptychographic microscopy [39].
Though Fourier ptychographic microscopy achieves a high space-bandwidth product, temporal resolution is sacrificed: many low-resolution images need to be collected for reconstruction. This makes imaging of dynamic processes, such as those in live cells, difficult. Imaging of such processes requires sophisticated hardware control to rapidly acquire the low-resolution image stack [40].
An emerging application for deep learning is to learn how to both optimally collect and process data for some end goal, rather than just process data [41][42][43][44][45]. With this "physical preprocessing" procedure for deep learning, data acquisition time or hardware requirements can be reduced. In this work, we use deep learning to jointly optimize the illumination pattern in Fourier ptychographic microscopy with the reconstruction algorithm for a given sample type, allowing for single-shot imaging with a high space-bandwidth product. Our deep learning algorithm looks for the illumination pattern that will embed the most information into a single low-resolution image, along with a direct reconstruction method to non-iteratively reconstruct the high-resolution complex field from the single image measurement.

Background on Fourier Ptychographic Microscopy
In Fourier ptychographic microscopy, a sample is illuminated by a light-emitting diode (LED), as in Fig. 1. We assume that the LED is sufficiently far away that the light entering the sample can be approximated as a plane wave. The sample scatters the entering plane wave, and the exiting wave is imaged through a microscope objective with a given numerical aperture (NA). Consider a thin sample that is in the xy-plane and centered at the origin in in the Cartesian coordinate system shown in Fig. 2. The illuminating LED is placed at the point (x l , y l , z l ). The spatial profile of the plane wave from the LED entering the sample is described by e i2π ì u l ·ì r where ì u l ·x = u l, x is the spatial frequency in the x-direction and ì u l ·ŷ = u l,y is the spatial frequency in the y-direction. The vector ì u l is in the direction of −x lx − y lŷ − z lẑ . The magnitude of ì u l is | ì u l |= 1 λ , where λ is the wavelength of the light. From Fig. 2, we see the following: We use the thin transparency approximation to describe the sample as a complex transmission function o(x, y). The field immediately after the sample can be described as the object transmission function multiplied by the illumination plane wave at z = 0: o(x, y)e i2π(u l, x x+u l, y y) . ( If the 2D Fourier transform of o(x, y) is given as O(u x , u y ), the field in Eqn. 3 in Fourier space is simply a shift, O(u x − u l, x , u y − u l,y ). The field is multiplied in Fourier space by the pupil function of the objective, P(u x , u y ), a low-pass filter. In the case of no aberrations, P(u x , u y ) is unity within a circle with radius NA λ , and zero elsewhere. The image sensor records the intensity I of this field in real space: where F −1 denotes the inverse 2D Fourier transform [46,47]. Due to the low-pass filtering operation, the intensity image contains information from only a portion of spatial frequencies of the original object. The previous discussion assumes illumination from a single LED, where it is assumed that the LED emits a coherent wave. In an array of LEDs, with multiple LEDs illuminated, we assume that each LED emits a coherent wave that is mutually incoherent with the waves emitted from the other LEDs. Thus, if we have an matrix of LEDs in the xy-plane at height z l , the intensity image recorded is: where n is the total number of illuminated LEDs and c l is the intensity of LED l [46]. In the original instantiation of Fourier ptychographic microscopy, the LEDs are turned on and off one at a time, generating n intensity images [22]. Other work uses a series of patterns of illuminated LEDs, reducing the total number of images needed [40,46]. Adaptive schemes can also reduce the required number of LED images [48].
The goal of Fourier ptychographic microscopy is to reconstruct the sample's complex transmission function, o(x, y) from these intensity images. A variety of iterative approaches have been applied to the reconstruction process [47]. The creation of each intensity image involves frequency modulation and a low-pass filtering operation. Each intensity image has limited spatial bandwidth, but reflects information from a different portion of frequency space. Thus, the reconstructed o(x, y) in Fourier ptychographic microscopy can have higher spatial bandwidth than that specified by the NA of the objective.
Though Fourier ptychographic microscopy creates images with a high space-bandwidth product, it requires acquisition of multiple images. In this work, we aim to enable Fourier ptychographic microscopy with single image acquisition, using a deep learning approach to jointly find the optimal illumination pattern and reconstruction procedure for a given sample type. Our reconstruction procedure replaces previously used iterative methods with a deep neural network.

A Communications Approach
Claude Shannon's fundamental work in the theory of communication forms a framework to circumvent the loss of temporal resolution in Fourier ptychographic microscopy. We consider the problem of imaging a live biological specimen and model the Fourier ptychographic microscope as a communication system, as shown in Fig. 3. In a general communication system, a message from an information source is encoded, transmitted over a noisy channel, and decoded by a receiver. If we know the statistics of the source, and the noise characteristics of the channel, we can optimize the method of encoding and decoding for information transmission [49]. This communications approach to imaging has been previously explored in [50].  [49] applied to imaging a thin biological specimen.
In the case of Fourier ptychographic microscopy, a biological sample is the information source, and the message is the sample's phase and amplitude (i.e. its complex transmission function) at a point in time. The message is encoded as an optical signal by the pattern of illuminated LEDs. This optical signal is then transmitted through the optics of the microscope. The image sensor of the microscope camera receives the signal as intensity measurements over an array of sensor pixels. The received signal may be corrupted by noise, such as Poisson noise. Finally, this image sensor measurement must be decoded with an algorithm to try to reconstruct the original phase and amplitude of the sample.
In Fourier ptychographic microscopy, as in most imaging modalities, generally no assumptions are made on the sample being imaged. Except in some cases [47], the imaging procedure of acquiring low-resolution intensity images and solving for the high-resolution complex object remains invariant with regards to the sample. However, the communication framework in Fig. 3 yields insight on how to reduce the time needed to obtain high space-bandwidth complex objects. If we can determine the statistics of the sample's phase and amplitude "messages," it might be possible to compress all the information needed for reconstruction of the message into a single image sensor measurement. This physical compression may be achieved by choosing the optimal LED illumination pattern. We think of the sample as having a range of different geometrical configurations, and optimizing a communication system for transmitting and receiving information regarding the current configuration.
Consider the following simple thought experiment. Imagine a sample having 2 possible states that lead to 2 different phase and amplitude images. With a standard Fourier ptychographic microscope, we would have to take multiple intensity images and complete the iterative reconstruction procedure. However, only 1 bit of information needs to be transmitted and received in this hypothetical system. Intuitively, the standard procedure is wasteful in this simple case; we should easily be able to design an imaging system that only needs a single-shot measurement to find the current state.
There are two major difficulties in the implementation of a communications approach to biological imaging. First, we need to determine the probability distribution of our "messages," which we need to discover through observations of the biological sample. Second, we need to encode our messages for transmission, but we cannot choose an arbitrary encoding algorithm. We are constrained to what is physically possible with optics.
To solve these problems, we turn to deep learning to optimize the end-to-end communications system, as in [51,52], simulating the entire end-to-end imaging pipeline as a deep neural network. During training of the network, the hardware parameters for encoding and the software parameters for decoding are optimized. In the training process, the neural network takes high-resolution complex objects obtained from a standard Fourier ptychographic microscope as input. From the input complex object, the low-resolution image collected with a single LED illumination pattern is emulated. This single low-resolution image then passes through post-processing neural network layers, outputting a prediction of the high-resolution complex object. The illumination pattern and the parameters of the post-processing layers are optimized during training, using a dataset of high-resolution complex objects from a given sample type, as diagrammed in Fig. 4. In the evaluation phase, we implement the optimized illumination in the actual Fourier ptychographic microscope. Each collected single low-resolution image is then fed directly into trained post-processing layers to reconstruct the high-resolution complex object (see Fig. 4). With this procedure, we could use fixed biological samples to collect the training dataset, and live samples in the evaluation step, potentially enabling unprecedented imaging of dynamic biological processes.

Methods
We implement the structure shown in Fig. 4 as a computational graph, and optimize the parameters with TensorFlow [53], an open-source Python package for machine learning. We utilized computational resources from the Extreme Science and Engineering Discovery Environment (XSEDE) [54] for training of the computational graph. We include the simulation of the optical function to go from the high-resolution complex object to the low-resolution collected image in the first layers of the deep neural network graph. The latter layers of the deep neural network graph aim to transform the simulated low-resolution image back to the high-resolution complex field, replacing an iterative algorithm. The first "physical preprocessing" layers emulate the function of the optics, and allow the parameters of the actual physical system to be optimized at the same time as the reconstruction algorithm. An overview of our computational graph is shown in Fig. 5. We utilize computationally simulated datasets for training and testing of the deep neural network. The dataset is composed of high-resolution, high field-of-view complex objects, emulating those reconstructed from a standard Fourier ptychographic microscope setup. To create the complex phase and amplitude images, we take datasets of intensity images and turn them into complex objects. Our first example dataset takes the MNIST dataset of handwritten digits [55] and turns it into a complex object dataset by applying the following formula to each pixel: where p 0 is a pixel of the original image. We then low-pass filter the complex field with the synthetic NA associated with the Fourier ptychographic setup. The resulting complex objects are the inputs to the computational graph in Fig. 5. The low-resolution intensity image corresponding to a single LED illumination pattern, I low , is computationally emulated by the procedure outlined in [46], summarized in Section 2. We assume a square grid LED matrix where the intensity of the LED can be tuned from 0 to 1, 0 ≤ c l ≤ 1. We emphasize that these physical preprocessing layers are included in the TensorFlow computational graph so that c l is optimized during the training process.
The next layers of the computational graph add a Gaussian approximation of Poisson noise to I low . Every pixel of I low is processed by: where m is a multiplicative factor chosen to fit the noise of a particular setup and g is drawn from a normal random distribution. A higher m corresponds to higher signal-to-noise ratio. We redraw the value of g at every evaluation of the TensorFlow computational graph. The noisy I low is the input to 2 separate post-processing networks. One network aims to reproduce the real part of the high-resolution object and the other network aims to reproduce the imaginary part, as shown in Fig. 5. The architecture for the networks is inspired by [56], and shown in Fig. 6. Each convolutional layer in Fig. 6 is diagrammed in Fig. 7. The networks make use of convolutional layers [57], maxout neurons [58], batch normalization [59], residual layers [1], and dropout [60]. The outputs of the two neural networks, I high r e al and I high i mag , are combined to output the high-resolution complex field reconstruction, I high = I high r e al + iI high i mag . Fig. 6: Diagram of the architecture inside the "ConvNet" blocks in Fig. 5. Each numbered square in this diagram represents a convolutional layer with batch normalization and maxout neurons, as shown in Fig. 7. The numbered squares correspond to kernel lengths of 10,20,30,40,50,60,70, 80 respectively. During training, 20% of the nodes in the layers in red are dropped out. Also shown are the residual connections. Generative adversarial networks have had success in mimicking the probability distribution of a dataset [61], and the inclusion of a discriminative network has shown success in improving image quality in super-resolution deep neural networks [9]. We include a discriminative network that takes as input both I high and the actual high-resolution field I actual and classifies whether the input is an output of the reconstruction network (a "fake") or an actual high-resolution image, as shown in Fig. 5. The objective function of this discriminative network is the cross-entropy loss of correct classification. We alternate training the parameters of the discriminative network to improve the cross-entropy loss with training of the rest of the computational graph to improve the overall objective function.
Our overall objective function consists of 3 parts, as in [62]: where M is the mean-squared error between I high and I actual , G is the mean-squared error between the first gradients of I high and I actual , and C is the cross-entropy loss representing how well the discriminative network is "fooled." In calculating G, we take the sum of the mean-squared error of the gradient in the vertical direction and the gradient in the horizontal direction. The gradient is calculated by taking the difference between I and I shifted by one place. The metric C measures how well the discriminative network is fooled, by calculating the cross-entropy loss of incorrect classification of I high . The objective function includes the weighting factor α = 1, 000. The parameters of the computational graph are trained with the Adam optimizer [63]. A randomly selected batch of 4 high-resolution inputs I actual from the training dataset are used in every iteration of training. For each example in this work, we train for 100,000 iterations, with a training rate of 1 × 10 −2 . The training rate is decayed by a factor of 0.99 every 1,000 iterations. The variable parameters of the computational graph are initialized with a truncated normal distribution with σ = 0.1, and values limited to between 2 standard deviations from µ = 0. An exponential running average with decay rate of 0.999 of the variable parameters is saved for testing.

Results and Discussion
We obtain image reconstruction results when we optimize the entire computational graph in Fig. 5. We compare to the results when only optimizing the post-processing layers, keeping the LED illumination pattern constant. As described in the previous section, the MNIST dataset of intensity images is processed by Eqn. 6 and then low-pass filtered with the synthetic NA of the Fourier ptychographic microscope. The optical parameters of the emulated Fourier ptychographic microscope are given in Table 1. We utilize the full MNIST dataset, consisting of 55,000 training images, 5,000 validation images, and 10,000 test images.  Table 1: Optical parameters used for the simulated MNIST dataset; the original MNIST dataset is padded with zeros to create 32 × 32 pixel images.
We train the following 4 cases of deep neural networks: 1. The LED matrix is fixed at constant, uniform illumination of c l = 1 for all l during training.
2. The LED matrix is initialized at constant, uniform illumination of c l = 1 for all l, but allowed to change during training.
3. The LED matrix is fixed at a random initialization of c l , picked from a uniform random distribution.
4. the LED matrix is initialized at the same random initialization of c l as in case (3), but is allowed to vary during training. Fig. 8 shows the results of the trained network for a randomly selected test example for cases (1) and (2). In Fig. 8, the noise factor m = 0.25. From this single test example, we see that by allowing the illumination pattern to vary, we obtain better results for both M, the mean-squared error, and G, the mean-squared error of the image gradient. In Fig. 9, we vary m, and calculate M and G averaged over the entire test dataset. Allowing the illumination pattern to vary yields lower M and G for all the noise levels, with the difference growing greater for higher values of 1 m , corresponding to higher levels of noise. It appears that physical encoding of the data by the illumination pattern allows for more robustness to noise. Fig. 10 shows the results for cases (3) and (4) for the same test example as in Fig. 8. We again plot M and G averaged over the entire test dataset for different levels of noise in Fig. 11 for   cases (3) and (4). We note similar trends to cases (1) and (2): allowing the LED pattern to vary yields lower error. The final LED pattern in cases (2) and (4) are similar, even though the initial conditions were different. Fig. 10: Test results for a single example for cases (3) and (4). The noise factor m = 0.25.
Though the MNIST dataset is useful for fast prototyping, the images do not resemble biological samples. In order to show proof-of-concept of this method for biologically relevant data, we utilized breast cancer cell images from the UCSB Bio-Segmentation Benchmark dataset [64], and converted the images to a high-resolution complex object dataset. Each image was cropped to 512 × 512 pixels and the three 8-bit color channels were summed. Each pixel p 0 of the resulting image was processed by: and then low-pass filtered by the synthetic NA of the Fourier ptychographic microscope. We used 34 images from the UCSB Bio-Segmentation Benchmark dataset for training and 12 images each  for validation and testing. The optical parameters used for this dataset of images are summarized in Table 2.
Size of high-resolution object 332. 8 Table 2: Optical parameters used for the UCSB Bio-Segmentation Benchmark complex object dataset. Fig. 12 shows a test example after training a neural network with noise factor m = 1, following the considerations of case (4). We see reconstruction of many of the image details, and future work will focus on reducing the phase artifacts.  Fig. 13b illustrates how the illumination pattern changed during training. The corresponding low-resolution images are also shown, with and without noise for the single validation example shown in Fig. 13a. Visually, we see enhanced image contrast in the low-resolution image corresponding to the final illumination pattern. In the initial low-resolution image, the image contrast is almost totally obscured with the addition of noise.
When we compare the low-resolution images with noise in Fig. 13b, it appears that more information about the original object is preserved the final image, corresponding to the optimized illumination pattern. We hypothesize that optimizing the illumination pattern increases the mutual information of the low-resolution image and high-resolution complex object. We perform a simple experiment to determine if the mutual information increases as we train the deep neural network. We create a high-resolution dataset of 16 distinct images of 4 × 4 pixels. Each image is a pattern of ones and zeros and each pixel is processed by Eqn. 6 and filtered by the synthetic NA of the Fourier ptychographic microscope. An example complex object of this dataset is shown in Fig. 14. The optical parameters used for this dataset are given in Table 3. Given the optical parameters, each 4 × 4 pixel high-resolution complex object becomes a 1 × 1 pixel low-resolution intensity image.
We trained 2 neural networks with no dropout, a batch size of 16, and noise level m = 1, following the considerations of case (2) and case (4). The high-resolution dataset contains log 2 16 = 4 bits of information. For the low-resolution image to capture all the information about the high-resolution dataset, the mutual information needs to be 4 bits. We calculate the mutual information for the initial and final illumination patterns of both neural networks, as shown in Fig. 15. The mutual information is calculated by: Fig. 14 where p(x) is the marginal probability distribution of high-resolution objects, p(y) is the marginal probability distribution of low-resolution images, and p(x, y) is the joint probability distribution.
For every x ∈ X, we use Eqn. 7 to generate 1,000,000 samples of y and approximate the distribution p(y | X = x). From Fig. 15, we see that in our joint optimization procedure of the illumination pattern and the post-processing parameters, the mutual information of the low-resolution image and high-resolution object increases. We also note that the final LED illumination patterns are similar for cases (2) and (4). It appears that during training, the LED illumination pattern changes so that more information about the high-resolution object is physically encoded in the low-resolution image. The physical preprocessing step is crucial to improved final complex object reconstruction.

Conclusions
The success of deep learning presents an opportunity in the design of imaging systems: by finding generalizable structure in past data, it is possible to figure out how to optimally collect future data. This work uses deep learning to jointly optimize the physical parameters of the imaging system together with the parameters of the image post-processing algorithm. We consider the optimization of Fourier ptychographic microscopy and demonstrate that we can eliminate the tradeoff among spatial resolution, field-of-view, and temporal resolution for a given sample type. This work frames the task of imaging a high-resolution complex field as a communications problem, where image information is: (1) encoded as an optical signal with the programmable  illumination source of a Fourier ptychographic microscope, (2) collected by an image sensor, and (3) computationally decoded. With a training dataset of high-resolution phase images of fixed cells, the parameters of both the microscope and the decoding algorithm can be co-optimized to maximize imaging speed. The optimized microscope can then be implemented to image the live sample, and the collected measurements fed directly into the post-processing layers of the computational graph, allowing for a high single-shot space-bandwidth product. With this method, instead of collecting images in the traditional sense, we optimally collect information about the spatial structure of the live sample. By using prior knowledge gained from static images, we avoid redundancy in information collection by physically preprocessing the data.
We present results from simulated data, in order to show the feasibility of this method. Future work is needed to validate these methods with real experimental data. Additionally, our work assumes that the probability distribution of the training data matches the probability distribution of the test data. It is possible that the structure of the fixed samples may not be the same as the live samples. A possible method to alleviate this problem is to feed the output of the deep neural network into an iterative solver to ensure a valid solution to the inverse problem. Finally, we do not consider temporal relationships between images in this work. We consider each image frame as independent, however there is clearly a relationship between sequential frames. With a training set of video sequences, these temporal relationships could be accounted for.
The framework outlined in this paper is not specific to Fourier ptychographic microscopy; these ideas can be extended to other imaging modalities with programmable parameters. Though we have focused on the benefits of this method for imaging of live biological samples, other benefits include reduced data storage needs and faster imaging for high-throughput studies.