Learned reconstructions for practical mask-based lensless imaging

Mask-based lensless imagers are smaller and lighter than traditional lensed cameras. In these imagers, the sensor does not directly record an image of the scene; rather, a computational algorithm reconstructs it. Typically, mask-based lensless imagers use a model-based reconstruction approach that suffers from long compute times and a heavy reliance on both system calibration and heuristically chosen denoisers. In this work, we address these limitations using a bounded-compute, trainable neural network to reconstruct the image. We leverage our knowledge of the physical system by unrolling a traditional model-based optimization algorithm, whose parameters we optimize using experimentally gathered ground-truth data. Optionally, images produced by the unrolled network are then fed into a jointly-trained denoiser. As compared to traditional methods, our architecture achieves better perceptual image quality and runs 20x faster, enabling interactive previewing of the scene. We explore a spectrum between model-based and deep learning methods, showing the benefits of using an intermediate approach. Finally, we test our network on images taken in the wild with a prototype mask-based camera, demonstrating that our network generalizes to natural images.


Introduction
Mask-based lensless imagers (lensless imagers) are a class of computational cameras in which the lens is replaced with a phase or amplitude mask placed a short distance in front of the sensor (Fig. 1).Unlike conventional (lensed) cameras, which directly record an image, lensless cameras map each point in the scene to many sensor pixels, indirectly encoding scene information into the sensor measurement.A reconstruction algorithm is then used to recover the final image.This architecture enables small, cheap, and light-weight designs which can be used for portable or in vivo imaging [1][2][3][4][5][6].Additionally, the inherent multiplexing of lensless cameras can make them amenable to compressive measurement of higher-dimensional signals, such as 3D volumetric [3,7] or video [8], from a single 2D measurement.Lensless cameras have been used for 3D fluorescence microscopy [4,9], thermal imaging [10], and refocusable photography [11].
Image reconstruction methods for lensless cameras fall into two general categories: single-step and iterative reconstructions.Single-step reconstructions can be fast, but often require custom fabricated masks that must be carefully aligned to the sensor [1,2,10,11].In addition, it is difficult to incorporate priors and leverage compressed sensing in single-step reconstructions.Iterative reconstructions are much slower, but do not impose stringent restrictions on the mask itself, generally produce better results, and allow priors to be used [12,13].However, due to imperfect system modeling, these methods may still give significant reconstruction artifacts.Additionally, the high complexity of the computation precludes interactive previewing of the scene and requires expensive, bulky compute hardware.In this work, we focus on iterative methods, improving both the image quality and speed with a new reconstruction framework that < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > psf loss function Fig. 1.Overview of our imaging pipeline.During training, images are displayed on a computer screen and captured simultaneously with both a lensed and a lensless camera to form training pairs, with the lensed images serving as ground truth labels.The lensless measurements are fed into a model-based network which incorporates knowledge about the physics of the imager.The output of the network is compared with the labels using a loss function and the network parameters are updated through backpropagation.During operation, the lensless imager takes measurements and the trained model-based network is used to reconstruct the images, providing a large speedup in reconstruction time and an improvement in image quality.
incorporates the advantages of both deep learning and physical models, making lensless cameras more practical for everyday imaging.
The classical approach to image recovery is to use convex optimization to iteratively minimize a loss function [14,15] consisting of a data-fidelity term and an optional hand-picked regularization term.The data-fidelity term enforces that the recovered image, with the known imaging model applied to it, matches the measurement.The regularization term enforces prior knowledge of image statistics (e.g.non-negative, sparse gradients) and serves to regularize ill-conditioned problems.Iterative approaches are interpretable, but are sensitive to reconstruction artifacts due to model mismatch, calibration errors, hand-tuned parameters, and hand-picked regularizers which are not necessarily representative of the data.Each of these contributes to reconstruction artifacts and degrades image quality.Furthermore, these methods can take hundreds to thousands of iterations to converge, which is often too slow for real-time imaging.
Recently, deep learning-based methods for image reconstruction have risen in popularity.In deep methods, a convolutional neural network (CNN) is used for image reconstruction [16][17][18].Networks have hundreds of thousands of parameters which are updated using large datasets of image pairs.These networks are able to learn complex scene statistics, but do not incorporate any prior knowledge about the image formation process.Compared to iterative methods, deep learning-based methods are hard to interpret, do not have convergence guarantees, and have no structured way to incorporate knowledge of the imaging system physics.
Unrolled optimization represents a middle-ground between classic and deep methods.In unrolled optimization, a fixed number of iterations from a classic algorithm is interpreted as a deep network, with each iteration serving as a layer in the network.In each layer, if the parameters of the algorithm are differentiable with respect to the output, they can be optimized for a given loss function through backpropagation.In this framework, the sparsifying filters, hyperparameters, or shrinkage function can be learned from the training examples [19,20]

Classic Deep
Fig. 2. Networks on a scale from classic to deep.We will present several networks specifically designed for lensless imaging (Le-ADMM, Le-ADMM*, and Le-ADMM-U).We compare these to classic approaches, which have no learnable parameters, and to purely deep methods which do not include any knowledge of the imaging model.We will show the utility of using an algorithm in this middle range compared to a purely classic or deep method.Θ summarizes the parameters that are learned for each network as discussed in Section 4.
Here, we unroll the iterative alternating direction method of multipliers (ADMM) algorithm with a variable splitting specific for lensless imaging [3,15].This allows us to incorporate knowledge of the image formation process into the neural network as well as learn the network parameters based on the data.To train our network, we experimentally capture a large dataset of lensed and lensless images (Fig. 1).We train our network on a perceptual similarity metric in order to produce images that are visually similar to those from our ground truth lensed camera.We present several variations of networks along the spectrum between classic methods and deep methods, by varying the amount of trainable parameters (Fig. 2).Specifically, we introduce three architectures, Le-ADMM, Le-ADMM*, and Le-ADMM-U, each with increasing numbers of trainable parameters, explained in detail in Sec. 4. All of our networks have a bounded compute that can be adjusted according to the application.The networks trade-off data fidelity and image perceptual quality, producing more visually appealing images at the price of decreased data-fidelity.
We test our network using DiffuserCam [12] as our prototypical lensless camera, built with offthe-shelf components and a low-end camera sensor.Although our network is trained using images from a computer monitor, we demonstrate the generalization of our network to measurements of natural objects taken in the wild.We believe that this exploratory work shows the promise of using unrolled neural networks for lensless imaging, and our results suggest the utility of combining knowledge of the physics together with deep learning for the best performance.
Our contributions include: 1.A bounded-time trainable network architecture that incorporates knowledge of the physical model for lensless imaging.2.An experimental dataset of 25,000 aligned lensed and lensless image pairs taken using a beamsplitter and computer screen.3. A demonstration of 20× speedup and 3× improvement in perceptual similarity for lensless imaging reconstructions on an experimental system.4. Generalization of the network to images taken in the wild on a prototype lensless camera.

Lensless imaging forward model
First we describe our lensless imaging forward model for DiffuserCam.Based on this, we formulate our traditional model-based reconstruction (Sec.3), before moving on to modifications that span the spectrum from model-based to deep learning-based algorithms (Fig. 2) in Sec. 4.
DiffuserCam [3,12] is a compact, easy-to-build imaging system that consists only of a diffuser (a transparent phase mask with pseudo-random slowly varying thickness) placed a few millimeters in front of a standard image sensor (see Fig. 1).Light from a point source in the scene is refracted by the diffuser to create a high-contrast caustic pattern on the sensor, which is the point spread function (PSF) of the system (Fig. 1).Since the diffuser is thin, the PSF can be modeled as shift-invariant: a lateral shift of the point source in the scene causes a translation of the PSF in the opposite direction.We model the scene as a collection of point sources with varying color and intensity.Assuming all points are incoherent with each other, the sensor measurement, b, can be described as: where h is the system PSF, x represents the scene, and (x, y) are the sensor coordinates.Here, * denotes 2D discrete linear convolution, which returns an array that is larger than both the scene and the PSF.Therefore, a crop operation restricts the output to the physical sensor size.This relation is represented compactly in matrix-vector notation with crop denoted as C and convolution with the PSF denoted as H. Equation ( 1) is computed separately for each color channel.
Our goal is to recover the scene, x, from the measurement b.We assume the PSF is known, as it can easily be measured experimentally with an LED point source [3].Traditional model-based methods for recovering x solve a regularized optimization problem of the following form: where Ψ is a sparsifying transform, such as finite differences for total variation (TV) denoising, and τ is a tuning parameter that adjusts the sparsity level.

Model-based inverse algorithm
The traditional model-based inverse solver relies on the known physics of the forward model to solve Eq. ( 2), minimizing the difference between the actual and predicted measurements, while satisfying any additional constraints.This problem can be solved efficiently by ADMM [15] with a variable splitting that leverages the structure of the problem [3].In ADMM, the problem is reformulated as: This variable splitting allows closed-form updates for each step, as derived in [3].The update equations in each iteration become: < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t >

S N
< l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t >

Le-ADMM (N layers)
forwards backwards measurement b < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > PSF h < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > v N < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t >

S k
< l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > Optional Denoiser Network v gt < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > Here, α 1 , α 2 , and α 3 are the Lagrange multipliers, or dual variables, respectively associated with u, v, and w, and µ 1 , µ 2 , and µ 3 are scalar penalty parameters.T τ/µ 2 denotes vectorial soft-thresholding with parameter τ/µ 2 .This traditional method is based on the physical model of the imaging system (Eq.( 1)) and requires no additional calibration data beyond the PSF.However, it depends heavily on hand-chosen values, such as the sparsifying transform Ψ and its associated parameter, τ.The optimization parameters, µ 1 , µ 2 , and µ 3 , are either hand-tuned or auto-tuned based on the primal and dual residuals at each iteration [15].The method performs well under correctly chosen sparsifying transforms and with the proper hand-tuned parameters.However, in practice ADMM takes hundreds of iterations to converge and produces images with reconstruction artifacts.In the next section, we will outline how we unroll ADMM into a neural network in order to learn the hyper-parameters from the data and seamlessly interface with existing deep learning pipelines.

Learned reconstruction networks
Next, we present several variations of neural networks that jointly incorporate known physical models and deep learning principles.Each network is based on unrolling the iterative ADMM algorithm, such that each iteration comprises a layer of the network, with the tunable parameters learned from the training data.Thus, the physical model is inherently built into the network architecture, making it more efficient.
We present three variations of networks, each having a different number of learned parameters.Learned ADMM (Le-ADMM) has trainable tuning and hyper-parameters.Le-ADMM* extends Le-ADMM by adding a trainable CNN instead of a hand-tuned sparsifying transform.Finally, Le-ADMM-U adds a trainable deep denoiser based on a CNN as the last layer of the Le-ADMM network, learning both the hyper-parameters of Le-ADMM as well as the denoiser.Figure 2 summarizes these methods and where they fall on a scale from classic to deep, and the following sections describe them in detail.Each method has progressively more trainable parameters, and therefore needs a larger training dataset.All networks use 5 iterations of unrolled ADMM, in order to target a 20× speed improvement, which would speed up each reconstruction from 1.5 s to 75 ms, giving a practical speed for real world imaging.

Learned AMMM (Le-ADMM)
In the simplest of our unrolled networks, Le-ADMM (learned ADMM), we model each k th iteration of ADMM as a layer in a neural network, outlined in Fig. 3.In Le-ADMM, the optional denoiser step depicted in Fig. 3 is omitted.We denote the collection of update equations at the k th step of ADMM as S k .These update Eqs. are given by: The trainable parameters are outlined in blue and can be summarized by where k represents the iteration number.For 5 unrolled layers, we have a total of 20 learned parameters.After a fixed number of ADMM iterations the reconstruction is compared to the ground truth (lensed) image using the loss function described in Section 4.5.The trainable parameters are updated using backpropagation to minimize this loss across multiple training examples.Le-ADMM can be interpreted as a data-tuned ADMM where the parameters that are typically hand-tuned or auto-tuned are now updated based on the data in order to minimize a data-driven loss function.

Le-ADMM*, with learned regularizer
Le-ADMM* has the same overall structure as Le-ADMM, but also includes a learnable regularizer based on a CNN.The new update steps are summarized below:

Le-ADMM-U
Our third variation of unrolled networks is Le-ADMM followed by a learned denoiser, as shown in Fig. 3. Here, a U-Net is used as the denoiser [24].This method has the most learnable parameters, having a total of 10,605,927 learned parameters, all but 20 of which are from the U-Net.The parameters of Le-ADMM-U, given by Θ = {µ k 1 , µ k 2 , µ k 3 , τ k , U }, are jointly updated throughout training.The Le-ADMM portion of the network performs the bulk of the deconvolution and includes knowledge of the forward model, while the U-Net denoises the final image, is able to correct model mismatch errors, and makes the images look more visually appealing.Our denoiser network architecture is described in the Appendix.

U-Net
For completeness, we also compare to a purely deep method with no knowledge of the system physics in the reconstruction.For this, we directly use the U-Net architecture from [24], resulting in 10,605,907 learned parameters.We summarize this network architecture in the Appendix.

Loss functions
The loss function must be carefully selected because it dictates the parameter updates throughout the training process.In classic methods, ground truth is unavailable, so the loss is a function of the consistency of the final image, x with the measurement model and any image priors.With the inclusion of ground truth training data pairs, we now have access to another class of loss functions that directly compare a given reconstructed image to its associated ground truth image, x gt .One common loss is the mean-squared error (MSE) loss with respect to the ground truth, x gt − x 2 2 .However, MSE favors low frequencies and generally results in learned reconstructions that are blurry and lack detail [25].Here, we will use the Learned Perceptual Image Patch Similarity metric (LPIPS) that uses deep features and aims to quantify a perceptual distance between two images, as introduced in [25].During training, we use a combination of both MSE and LPIPS, as outlined in Section 5.These loss functions are summarized in Table 1.
Table 1.Loss functions.We use a combination of MSE and LPIPS during training of the learned methods.In the classical methods, there is no ground truth data, x gt , so data fidelity is used with total variation for regularization.

Data Fidelity
Consistency of the measurement with our knowledge of the imaging system Pixel-wise difference between reconstruction and ground truth LPIPS LPIPS(x gt , x) Perceptual distance between reconstruction and ground truth [25]

Implementation
For training, we simultaneously collect a set of lensless and ground truth image pairs using an experimental setup consisting of a lensed camera, a DiffuserCam, a beamsplitter, and computer monitor (Fig. 1).The cameras and computer monitor are simultaneously triggered, which allows us to display and capture all the training pairs in the dataset overnight.Our DiffuserCam prototype consists of an off-the-shelf diffuser (Luminit 0.5 • ) with a laser-cut paper aperture placed approximately 9 mm from a CMOS sensor.The lensed camera is focused at the plane of the computer screen, approximately 10 cm away.We capture a calibration PSF (see Fig. 1) using an LED point source placed at the distance of the computer screen, which sets the focal plane of the DiffuserCam.For both DiffuserCam and the ground truth camera, we use Basler Dart (daA1920-30uc) sensors.We use a 6 mm S-mount lens for the ground truth camera and calibrate the lens distortion using OpenCV's undistort camera calibration procedure [26].To achieve pixel-wise alignment between the image pairs, we first optically align the two cameras, then further calibrate by displaying a series of points on the computer monitor that span the field-of-view.We reconstruct these point images and compute the homography transform needed to co-align both cameras' coordinate systems.This transform is applied to all subsequent images.
Our dataset consists of 25,000 images from the MirFlickr dataset [27].The raw data from each sensor is 1920×1080 pixels, but is down-sampled by a factor of 4 in each direction, to 480×270.This is necessary due to moire fringes from the screen which degrade our lensed image quality.We split the dataset up into 24,000 training images and 1,000 test images.Our networks are implemented in PyTorch and trained on a Titan X GPU, using an ADAM optimizer throughout training [28].We find that using a combined loss based on MSE and LPIPS works best in practice.We weight MSE more heavily during earlier epochs and weight LPIPS more heavily during later epochs for further refinement.Source code is available at [29].When displaying the final images, we crop to 380×210 pixels to avoid displaying areas beyond the borders of the computer monitor.

Results
After training, we compare the performance of our unrolled networks against both classic ADMM and the fully deep U-Net.Since the number of iterations of ADMM affects both speed and quality of the result, we compare against both ADMM run until convergence (100 iterations) as well as ADMM bounded to 5 iterations.Bounded ADMM takes a similar time to run as our unrolled networks and converged ADMM sets a baseline for the best performance classic algorithms can achieve.On the deep side, we compare against a U-Net which is trained using our raw DiffuserCam measurements and ground truth labels.
The reconstruction results of images in our test set (taken by the monitor setup, but not used during training) show that our fastest learned networks are able to produce similar or better images than converged ADMM in the same amount of time as bounded ADMM (5 iterations), a 20× speedup while achieving comparable or better image quality.Furthermore, we show reconstructions of natural images in the wild (not from a computer monitor), demonstrating that our networks are able to generalize to 3D objects with variable lighting conditions.

Test set results
Table 2 summarizes the reconstruction performance and speed of our learned networks on the test set.Here we can see that our fastest networks (Le-ADMM and Le-ADMM-U) are 20× faster than classic reconstruction algorithms (ADMM converged) and have similar or better average MSE and LPIPS scores.Le-ADMM* is slightly slower due to its inclusion of a CNN on the uncropped image in each unrolled layer, however is still an order of magnitude faster than converged ADMM.
As we move on the scale from classic to deep (Le-ADMM → Le-ADMM* → Le-ADMM-U), our networks have better MSE and LPIPS scores, but have worse data fidelity.Figure 4 shows several sample images from our test set reconstructions.Here we can see that our networks (Le-ADMM, Le-ADMM*, Le-ADMM-U) produce images that are of equal or better quality than converged ADMM.We can see that bounded ADMM has streaky artifacts, but our learned networks do not.Le-ADMM-U has the best reconstruction performance overall Table 2. Network performance.We summarize the average data fidelity, MSE, and LPIPS metrics for each network on the test set (1,000 images).Le-ADMM and Le-ADMM-U are both 20× faster than converged ADMM with comparable or better performance in terms of MSE and LPIPS.Le-ADMM-U has the best performance in terms of MSE and LPIPS, outperforming the U-Net which has no knowledge of the system physics.

Reconstruction
Data and produces images that are visually similar to the ground truth images.Overall, Le-ADMM-U has 3× better image quality than converged ADMM as measured by the LPIPS metric.The U-Net does not perform as well as Le-ADMM-U, having inconsistent colors and missing higher frequencies.This shows the utility in combining model-based and deep methods.Figure 5(a), plots the distribution of MSE, LPIPS, and Data Fidelity scores for the test set.We can see that Le-ADMM-U has the best LPIPS and MSE scores and outperforms converged ADMM, whereas Le-ADMM has similar LPIPS and MSE scores to converged ADMM with many fewer training pairs.Here we can clearly see the trend of data fidelity increasing as MSE and LPIPS decrease, showing that there is a trade-off between image quality and matching the imaging model.We interpret this as our system model being imperfect, which prevents purely model-based algorithms from achieving the best image quality.As we increase the number of learned parameters, we are able to correct artifacts introduced by model mismatch, producing more visually appealing images that better match the lensed camera.Figure 5(b) analyzes what happens to the reconstruction throughout the layers of the learned network.The MSE and LPIPS scores tend to decrease with iterations, while data fidelity increases.For Le-ADMM-U, the U-Net greatly improves the LPIPS and MSE values, at the cost of data fidelity.

Generalization to images in the wild
Next, we remove the computer monitor and capture DiffuserCam images of natural objects.Figure 6 shows some example reconstructions using our learned networks.Again, we see that our networks produce images of similar or higher visual quality than converged ADMM.In particular, Le-ADMM-U again produces the most visually appealing images and has better image quality than converged ADMM.This shows that our learned networks are able to generalize beyond imaging a computer monitor to situations with dramatically different lighting conditions.

Discussion
Our work presents a preliminary analysis of using unrolled, model-based neural networks on a real experimental lensless imaging system.We show that it is favorable to choose a network that combines classic and deep methods.We can perform comparably to classic algorithms at a fraction of the speed using only a few learned parameters, but can greatly improve image quality when increasing the number of learned parameters.However, the number of learned parameters in the network could be varied depending on the application.For instance, scientific imaging applications might choose to have fewer learned parameters to prevent overfitting to the training data.Meanwhile, photography applications may prefer a deeper method with more parameters, potentially producing more visually appealing images at the expense of possibly hallucinating details not present in the scene.
The quality and resolution of our reconstructions is bounded by that of our training dataset, including any potential imperfections in the physical system.For instance, any aberrations introduced by our lensed camera or beamsplitter will affect the learned reconstructions, since the lensed images are used as the ground truth when updating the network parameters.However, in practice we correct for aberrations such as distortion before training; other effects (e.g.chromatic aberration, field curvature) are negligible at our reconstruction grid size.Possible future work includes training on scenes with larger depth content to yield reconstructions with desirable defocus blurs, such as seen in a lensed camera.

Conclusion
We presented several unrolled, model-based neural networks for lensless imaging with a varying number of trainable parameters.Our networks jointly incorporate the physics of the imaging model as well as learned parameters in order to use both the known physics and the power of deep learning.We presented an experimental system with a prototype lensless camera that was used to rapidly acquire a dataset of aligned lensless and lensed images for training.Each of our networks are able to produce similar or better image quality compared to standard algorithms, with the fastest offering a 20× improvement.In addition, our deeper method, Le-ADMM-U has 3× better image quality than standard algorithms on the LPIPS perceptual similarity scale.Our learned network is fast enough for interactive previewing of the scene and also produces visually appealing images, addressing two of the big limitations of lensless imagers.Our work suggests that using such model-based neural networks could greatly improve imaging speed and quality for lensless imaging at the cost of a training step before camera operation.Next, we outline our smaller U-Net that is used for Le-ADMM*.The network architecture is described as follows: Table 4. Network architecture for smaller U-Net that is used in Le-ADMM*.The encoding and decoding steps are the same as described in Table 3.Finally, we include a skip connection, adding the input of the network to the output.layer k s channels in/out input enc1 3 1 3/24 input pool1 2 2 24/24 enc1 conv1 3 1 24/24 pool1 dec1 3 1 24/24 up(conv1), enc1 conv2 1 1 24/3 dec1

Effect of training size
In Fig. 7 we study the effect of the number of training images on the network performance.We show that our model-based network, Le-ADMM-U, is able to perform much better than the deep method (U-Net) with fewer training images because it incorporates knowledge of the imaging system into the network.

Fig. 3 .
Fig. 3. Model-based Network Architecture.The input measurement and the calibration PSF are first fed into N layers of unrolled Le-ADMM.At each layer, the updates corresponding to S k+1 in Eq. (4) are applied.The output of this can be fed into an optional denoiser network.The network parameters are updated based on a loss function comparing the output image to the lensed image.Red arrows represent backpropagation through the network parameters.

Fig. 4 .
Fig. 4. Test set results, with the raw DiffuserCam measurement (contrast stretched) and the ground truth images from the lensed camera for reference.Le-ADMM (71 ms) has similar image quality to converged ADMM (1.5 s) and better image quality than bounded ADMM (71 ms).Le-ADMM* and Le-ADMM-U have noticeably better visual image quality.The U-Net by itself is unable to reconstruct the appropriate colors and lacks detail.

Fig. 5 .
Fig. 5. Network Performance on Test Set.(a) Here we plot the MSE, LPIPS, and DataFidelity values for all image pairs in our test set.On average, our learned networks (green) are more similar to the ground truth lensed images (lower MSE and LPIPS) than 5 iterations of ADMM.Furthermore, our networks have comparable performance to ADMM (100), which takes 20× longer than Le-ADMM and Le-ADMM-U.However, the data fidelity term is higher for the learned methods, indicating that these reconstructions are less consistent with the image formation model.(b) Here we plot performance after each layer (or equivalently, each ADMM iteration) in our network, showing that MSE and LPIPS generally decrease throughout the layers.The U-Net denoiser layer in Le-ADMM-U significantly decreases the LPIPS and MSE values, at the cost of data fidelity.

Fig. 6 .
Fig.6.Network performance on objects in the wild (toys and a plant) captured with our lensless camera.We show the raw measurement (contrast stretched) on the top row, followed by converged ADMM, ADMM bounded to 5 iterations, our learned networks, and U-Net for comparison.Our learned networks have similar or better image quality as converged ADMM, and Le-ADMM-U has the best image quality.For instance, Le-ADMM-U is able to capture the details in the sideways plant (second column from left) and the eye of the toy duck (right).The U-Net alone has good image quality, but is missing some colors and details (e.g. the first image is washed out and the nose of the alligator toy is miscolored).

Fig. 7 .
Fig. 7. Effect of Training Size.Here we vary the number of images in the training set and plot the LPIPS score after 5 epochs.Here we see that Le-ADMM-U performs better and converges faster than a U-Net alone.Le-ADMM does not improve as the number of training images increases, since it has so few parameters.