Lensless computational imaging through deep learning

Deep learning has been proven to yield reliably generalizable answers to numerous classification and decision tasks. Here, we demonstrate for the first time, to our knowledge, that deep neural networks (DNNs) can be trained to solve inverse problems in computational imaging. We experimentally demonstrate a lens-less imaging system where a DNN was trained to recover a phase object given a raw intensity image recorded some distance away.


INTRODUCTION
Neural network training can be thought of as generic function approximation, as follows: given a training set (i.e., examples of matched input and output data obtained from a hitherto-unknown model), generate the computational architecture that most accurately maps all inputs in a test set (distinct from the training set) to their corresponding outputs. In this paper, we propose that deep neural networks may "learn" to approximate solutions to inverse problems in computational imaging.
A general computational imaging system consists of a physical part where light propagates through one or more objects of interest as well as optical elements such as lenses, prisms, etc. finally producing a raw intensity image on a digital camera. The raw intensity image is then computationally processed to yield object attributes, e.g. a spatial map of light attenuation and/or phase delay through the object-what we call traditionally "intensity image" and "quantitative phase image," respectively. The computational part of the system is then said to solve the inverse problem.
The study of inverse problems is traced back at least a century ago to Tikhonov [1] and Wiener [2]. A good introductory book with rigorous but not overwhelming discussion of the underlying mathematical concepts, especially regularization, is [3]. During the past decade, the field experienced a renaissance due to the almost simultaneous maturation of two related mathematics disciplines: convex optimization and harmonic analysis, especially sparse representations. A light technical introduction to these fascinating developments is in [4].
Neural networks have their own history of legendary ups-and-downs [5] culminating with an even more recent renaissance. This was driven by Hinton's insight that multi-layer architectures with numerous layers, dubbed as "deep networks," DNNs, can generalize better than had been previously thought after some simple but ingenious changes in the nonlinearity and training algorithms [6]. Even more recently developed architectures [7][8][9] have enabled neural networks to "learn deeper;" and modern DNNs have shown spectacular success at solving "hard" computational problems, such as: playing complex games like Atari [17] and Go [18], object detection [19], and image restoration (e.g., colorization [20], deblurring [21][22][23], in-painting [24]).
The idea of using neural networks to clean up images isn't exactly new: for example, Hopfield's associative memory network [25] was capable of retrieving entire faces from partially obscured inputs, and was implemented in an all-optical architecture [26] when computers weren't nearly as powerful as they are now. Recently, Horisaki et al. [27] used support-vector machines, a form of bi-layer neural network with nonlinear discriminant functions, also to recover face images when the obscuration is caused by scattering media.
The hypothesis that we set out to test in this paper is whether a neural network can be trained by being presented pairs of known objects and their raw intensity image representations on the digital camera of a computational imaging system; and then be used to produce object estimates given raw intensity images from hitherto unknown test objects, thus solving the inverse problem. This is a rather general question and may take several flavors, depending on the nature of the object, the physical design of the imaging system, etc. We chose to test the hypothesis in a very specific "heavy" computational imaging scenario: a lens-less physical architecture trying to image pure phase objects with coherent illumination.
Our experimental arrangement, described in more detail in Section 2, falls in-between two categories of imaging systems that could be traditionally called "digital holographic imaging," [28] and "transportof-intensity imaging" [12,15]. It is neither, because it violates the necessary assumptions of sparse objects leaving most of the incoming light unscattered to serve as reference beam for the digital hologram; and of sparse object gradients that avoid singularities in the transportof-intensity equation. Hence, either technique would be expected to require significant fine-tuning of regularization parameters to yield satisfactory results.
Our results demonstrate that the DNN computational architecture is capable of "learning" the inverse mapping between raw intensity image and object directly from the experimental data. This implies that the neural network "learns" the underlying governing equations of the system, including its forward operator and its possible deviations from underlying idealizations and assumptions, and the regularizer. This lack of a prior model is also notable because it removes the difficulty of correctly specifying the forward operator; many optimization approaches are sensitive to errors due to inaccurate or incomplete forward models.
Neural network approaches often come under criticism because the quality of training depends on the quality of the examples given to the network during the training phase. In our case, "phase objects," generally speaking, constitute a rather large class of objects. It would be unrealistic, perhaps even counter-productive to attempt to train a network sampling from across all possible objects from this large class. Instead, we took the approach of training a specific "training class," which we selected phase objects in the form handwritten digits because a database of these is readily available [29] and widely used in the study of various machine learning problems.
As expected, our network did well when presented with unknown phase objects in the form of handwritten digits that it had been trained to. Notably, the network also performed well when presented with objects outside of this "training class," including alphabets from the English language and characters from different languages (Arabic, Mandarin). Additionally, the trained network yielded accurate results even when the object-to-sensor distance(s) in the training set slightly differed from that of the testing set, suggesting that the network is not merely pattern-matching but instead has actually "learned" a generalizable model approximating the underlying system.
The details of our experiment, including the physical system and the computational training and testing results, are described in Section 2. The neural network itself, as we produced it, is analyzed in Section 3. Concluding thoughts are in Section 4. Our trained neural network, and the databases of images that we used to train and test it are all available online at [31] for interested readers to examine and experiment further.

EXPERIMENT
Our experimental arrangement is as shown in Figure 1. A HeNe laser source (Newport corporation, 633.3nm), after spatial filtering and collimation, is incident on a liquid crystal spatial light modulator (SLM) extracted from a ST7565 liquid crystal display (Adafruit, $16). Our display had a resolution of 128 × 64 pixels/cells, with a pixel pitch of 0.475 × 0.515 mm and dot size of 0.45 × 0.49 mm. It was controlled using a Raspberry Pi 3 (Adafruit, $40) and was configured to work as a phase SLM as follows: First, the backlight and attached polarizing films were removed, leaving just the bare liquid crystal (LC) cells. A polarizer was then placed in front of the LCD to linearly polarize the incident (elliptically polarized) laser light. In such a configuration, light passing through "off" pixels/cells of the LCD will experience a maximal amount of rotation (typically π/2 if the incident polarization is parallel to cell molecules at the entrance facet). Light passing through "on" pixels/cells, i.e., those with an applied voltage above some minimum threshold, will experience a reduced amount of rotation (down to 0 as molecule angle in the LCD approaches 0 and the cell becomes isotropic).  Due to the anisotropic nature of LC pixels (with applied voltage below that required to align all of the molecules), light passing through "on" and "off" pixels of the LCD will exit with different polarizations as well as phase shifts. Because of the inter-pixel difference in polarization, resulting diffraction patterns will show reduced interference (e.g., in the case where the polarizations are orthogonal, no interference will occur between light from these pixels). To address this issue, we restricted ourselves to binary phase objects and placed a second polarizer in the system behind the display to "split the difference" in polarization between the on and off pixels (such that output polarization at every pixel was the same, with equal loss of brightness due to change in polarization). After the second polarizer, A CCD detector (Basler A504k) was placed after a free-space propagation distance d, which ranged from ∼ 10 − 50 cm to record diffraction patterns. Images recorded were then processed on an Intel i7 CPU, with neural network computations performed on a GTX1080 graphics card ($700, NVIDIA).
Our experiment consists of two phases: training and testing. During the training phase, we modulate the phase SLM according to samples randomly selected from the MNIST handwritten digit database. We resize, pad, and binarize (via thresholding) selected images before displaying them on our SLM. A typical example of a handwritten digit, as it is sent to the SLM, and the raw intensity image (diffraction pattern) it generates on the CCD is shown in Figures 2(a) and (b), respectively. Our training set consisted of 10,000 such letters as training examples. The raw intensity images from all these training examples are used to train the weights in our DNN. We used a Zaber T-LSQ450D stage with repeatability 3µm to translate the camera in order to analyze the robustness of the learnt network to perturbations.
Our DNN uses a convolutional residual neural network (ResNet) architecture. In a convolutional neural network (CNN), inputs are passed from nodes of each layer to the next, with adjacent layers connected by convolution, pooling, or non-linearity-generating operations. Convolutional ResNets extend CNNs by adding short term memory to each layer of the network. The intuition behind ResNets is that one only wants to add a new layer if you can get something extra out of adding that layer. ResNets ensure that the N + 1th layer learns something new about the network by also providing the original input (i.e., without any transformation performed) to the output of the (N + 1)th layer and performing calculations on the residual of the two. This forces the new layer to learn something different from what the input has already encoded/learned [8].
A diagram of our specific DNN architecture is shown in Fig. 3. The input layer is the image captured by the CCD. It is then successively decimated first by a bilinear downsampling layer [30], then by a single convolution block of size 3 × 3 of stride 2, followed by 5 residual blocks of convolution + downsampling and finally by 4 residual blocks of deconvolution + upsampling. At the very last layer of our CNN, the values represent an estimate of our input signal. The connection weights are trained using backpropagation (not to be confused with optical backpropagation) on the quadratic error between the network output and the nominal appearance of the handwritten digits. Fig. 2(c) shows the network output at the beginning of the training phase (i.e. with randomly initialized weights); while Fig. 2(d) shows the network output after training, for the same digit and raw intensity image of Fig. 2(a-b). Training our network took ≈ 2 hours using MatCon-vNet. We provide analysis of the trained DNN in Section 3. We also experimented by replacing the first downsampling layer with residual blocks. Training this network took ≈ 48 hours without significant improvement in performance.  Table 1. Quantitative analysis of our trained deep neural networks for 3 object-to-sensor distances.
The testing phase consists of sampling more digit examples from the same database, exclusive of digits used during the training phase; using these test digits to modulate the SLM and produce raw intensity images on the CCD; using the intensity images as input to the trained DNN; and observing the output. We trained a network for data gathered for three different values of distance(s) between the CCD and SLM, i.e., 13.5 cm (Distance-1), 33.6 cm (Distance-2) and 54.6 cm (Distance-3) to validate the generality of our hypothesis. One randomly selected test example and its reconstructions by the network are shown in   (a-d) show the mean square error along with standard deviations for each test set at different relative distances to the baseline (≈ 13.5 cm). Negative distance(s) indicate that the CCD is closer to the SLM and positive distance(s) indicate higher object-to-sensor distance.
on the SLM and reconstructed by the neural network, are summarized in Table 1.

NETWORK ANALYSIS
The standard method of characterizing neural network training is by plotting the progression of training and test error through the training epochs (iterations in the backpropagation algorithm over all examples). These curves are shown in Figure6 for training and testing using the handwritten digits database; and (b) for testing with alternate types of images; for the experiments at 3 different distances between the CCD and SLM. The test error for the handwritten digits is lower than train set error indicating that the network generalizes well to unseen digits. The test error for alternate characters ran close to the train set error and converges to a low value indicating that there is no overfitting of the network.
We checked the robustness of the learnt network at Distance-1 feeding it raw intensity images at slightly different distance(s) from that of the training set images, i.e., 13.5 cm object-to-sensor distance, and the results for 8 different values of relative distance(s) to this baseline is shown in Figure 5. We also show the neural network reconstructions for the M,I and T characters in the test set for the different values of relative distance(s) in Figure 7. We see that our trained network yielded accurate results even when the object-to-sensor distance(s) in the training set differed from that of the testing set with the performance deteriorating as the relative distance(s) increase. This suggests that the network is not purely pattern-matching but instead has actually "learned" a generalizable model approximating the underlying system. Our trained network's early filters behave differently for raw intensity images acquired at different object distances but filters at latter levels tend to normalize these intermediate inputs to produce similar outputs.
It is also instructive to "look into" the neural network, e.g. by plotting the outputs of layers in response to randomly selected inputs. This is done in Figure8 for first layer after the input layer. Filters at initial levels appear to discriminate different extents of "shear" via regional high-pass filtering and decimation. High-pass filtering is present in transport-of-intensity type approaches to phase-retrieval [32] and the neural network seems to have "learnt" something similar or related. At deeper layers, the patterns observed seem to be more complicated, defying simple explanation. We are not reproducing any of them here, but the interested reader may investigate them at our online depository [31].

CONCLUSIONS AND DISCUSSION
The architecture presented here was deliberately well controlled, with an SLM creating the phase object inputs to the neural network for both training and testing. This allowed us to quantitatively and precisely analyze the behavior of the learning process. More practical architectures, e.g. replacing the SLM with physical phase objects for more practical applications, we judged beyond the scope of the present work. Other obvious and useful extensions would be to include optics, e.g. a microscope objective for microscopic imaging in the same mode; and to attempt to reconstruct complex objects, i.e. imparting both attenuation and phase delay to the incident light. The significant anticipated benefit in the latter case is that it would be unnecessary to characterize the optics for the formulation of the forward operator-the neural network should "learn" this automatically as well. We intend to undertake such studies in future work.

FUNDING INFORMATION
This research was funded by the Singapore National Research Foundation through the SMART program (Singapore-MIT Alliance for Research and Technology) and by the Information Advanced Research Projects Agency (iARPA) through the RAVEN Program. Justin Lee acknowledges funding from the U.S. Department of Energy Computational Science Graduate Fellowship (CSGF) (DE-FG02-97ER25308).