DiffuserCam: Lensless Single-exposure 3D Imaging

We demonstrate a compact and easy-to-build computational camera for single-shot 3D imaging. Our lensless system consists solely of a diffuser placed in front of a standard image sensor. Every point within the volumetric field-of-view projects a unique pseudorandom pattern of caustics on the sensor. By using a physical approximation and simple calibration scheme, we solve the large-scale inverse problem in a computationally efficient way. The caustic patterns enable compressed sensing, which exploits sparsity in the sample to solve for more 3D voxels than pixels on the 2D sensor. Our 3D voxel grid is chosen to match the experimentally measured two-point optical resolution across the field-of-view, resulting in 100 million voxels being reconstructed from a single 1.3 megapixel image. However, the effective resolution varies significantly with scene content. Because this effect is common to a wide range of computational cameras, we provide new theory for analyzing resolution in such systems.


INTRODUCTION
Because optical sensors are 2D, capturing 3D information requires projection onto a 2D sensor in such a way that the 3D data can be recovered. Scanning and multi-shot methods can achieve high spatial resolution, but sacrifice capture speed and require complex hardware. In contrast, existing single-shot methods are fast but have low resolution. Most 3D imagers require bulky hardware, such as large bench-top microscopes. In this work, we introduce a compact and inexpensive single-shot lensless system capable of volumetric imaging, and show that it improves upon the sampling limit of existing single-shot systems by leveraging compressed sensing.
We present DiffuserCam, a lensless imager that uses a diffuser to encode the 3D intensity of volumetric objects into a single 2D image. The diffuser, a thin phase mask, is placed a few millimeters in front of a traditional 2D image sensor. Each point source in 3D space creates a unique pseudorandom caustic pattern that covers a large portion of the sensor. Due to these properties, compressed sensing algorithms can be used to reconstruct more voxels than pixels captured, provided that the 3D sample is sparse in some domain. We recover this 3D information by solving a sparsity-constrained optimization problem, using a physical model and simple calibration scheme to make the computation and calibration scalable. This allows us to reconstruct on a grid of 100 million voxels, several orders of magnitude more than previous work.
We demonstrate a prototype DiffuserCam system built entirely from commodity hardware. It is simple to calibrate, does not require precise alignment during construction and is light efficient (as compared to amplitude masks). We reconstruct 3D objects on a grid of 100 million voxels (non-uniformly spaced) from a single 1.3 megapixel image, which provides high resolution across a large volume. Our reconstructions show true depth sectioning, allowing us to generate 3D renderings of the sample.
Our system uses a nonlinear reconstruction algorithm, which results in object-dependent performance. We quantify this by experimentally measuring the resolution of our prototype using a variety of test targets, and we show that the standard two-point resolution criterion is misleading and should be considered a best-case scenario. Additionally, we propose a new local condition number analysis that explains the variable resolving power and show that the theory is consistent with our experiments.
DiffuserCam uses concepts from lensless camera technology and imaging through complex media, integrated together via computational imaging design principles. This new device could enable high-resolution lensless 3D imaging of large and dynamic To review related previous work, we start with lensless cameras for 2D photography, which have shown great promise because of their small form factors. Unlike traditional cameras, in which a point in the scene maps to a pixel on the sensor, lensless cameras map a point in the scene to many points on the sensor, requiring computational reconstruction. A typical lensless architecture replaces the lens with an encoding element placed directly in front of the sensor. Incoherent 2D lensless cameras have been demonstrated using amplitude masks [1], diffractive masks [2,3], random reflective surfaces [4,5], and modified microlens arrays [6] as the encoding element. 2D lensless imaging with coherent illumination has been demonstrated in [7,8], and extended to 3D [9][10][11][12], but these methods require active or coherent illumination. Our system uses a similar lensless architecture but extends both the design and image reconstruction to enable 3D capture without the need for coherent lighting.
Light field cameras, also called integral imagers, passively capture 4D space-angle information in a single-shot [13], which can be used for 3D reconstructions. This concept can be built into a thin form factor with microlens arrays [14] or Fresnel zone plates [15]. Lenslet array-based 3D capture schemes have also been used in microscopy [16], where improved results are obtained when wave-optical effects [17,18] or in vivo tissue scattering [18,19] are accounted for. All of these systems must trade resolution or field-of-view for single-shot capture. DiffuserCam, in contrast, uses compressed sensing to capture large 3D volumes at high resolution in a single exposure.
Diffusers are often used as a proxy for general scattering media in the context of developing methods for imaging through scattering [20][21][22]. These works have similar mathematical models to our system, but instead of trying to mitigate the effects of unwanted scattering, here we use the diffuser as an optical element in our system design. Therefore, we choose a thin, optically smooth diffuser that refracts pseudorandomly (as opposed to true random scattering). Such diffusers have been shown to produce high contrast patterns under incoherent illumination, enabling light field imaging [23], and have also been used to record coherent holograms [10,24]. Multiple scattering with coherent illumination has been demonstrated as an encoding mechanism for 2D compressed sensing [25], but necessitates an inefficient transmission matrix calibration approach, limiting reconstructions to a few thousand pixels. We achieve similar benefits without needing coherent illumination, and, unlike previous work, we use compressed sensing to add depth information. Finally, our system is designed to enable simpler calibration and more efficient computation, allowing for 3D reconstruction at megavoxel scales with superior image quality.

A. System Overview
DiffuserCam is part of the class of mask-based lensless imagers in which a phase or amplitude mask is placed a small distance in front of a sensor, with no main lens. Our mask (the diffuser) is a thin transparent phase object with smoothly varying thickness (see Fig. 1). When illuminated by an incoherent source sufficiently far from the sensor, the convex bumps concentrate light into high-frequency pseudorandom caustic patterns which are captured by the sensor. The caustic patterns, termed Point Spread Functions (PSFs), vary with the 3D position of the source, thereby capturing 3D information.
To illustrate how the caustics encode 3D information, Fig. 2 shows simulations of caustic PSFs as a function of point source location in object space. A lateral shift of the point source causes a lateral translation of the PSF. An axial shift of the point source causes (approximately) a scaling of the PSF. Hence, each 3D position in the volume generates a unique PSF. The resolution of our camera depends on the structure and spatial frequencies present in the caustic patterns. Because the caustics retain high spatial frequencies over a large range of depths, DiffuserCam attains good lateral resolution for objects at any depth within the volumetric field-of-view (FoV).
By assuming that all points in the scene are incoherent with each other, the measurement can be modeled as a linear combination of PSFs from different 3D positions. We represent this as matrix-vector multiplication: where b is a vector representing the 2D sensor measurement and v is a vector representing the intensity of the object at every point in the 3D FoV, sampled on a user-chosen grid (discussed in image sensor, but the number of columns in H is set by the choice of reconstruction grid. Note that this model does not account for partial occlusion of sources. In order to reconstruct the 3D image, v, from the measured 2D image, b, we must solve Eq. (1) for v. However, if we use the full optical resolution of our system (see Sec. 3B), v will exist on a grid that contains more voxels than there are pixels on the sensor. In this case, H has many more columns than rows, so the problem is underdetermined and we cannot uniquely recover v simply by inverting Eq. (1). To remedy this, we rely on sparsity-based computation principles [26]. We exploit the fact that the object can be represented in a domain in which it is sparse (few non-zero elements) and solve the 1 regularized inverse problem: Here, Ψ is a linear transform that sparsifies the 3D object, v ≥ 0 is a nonnegativity constraint, and λ is a tuning parameter. For objects that are sparse in native 3D space, such as fluorescent samples, we choose Ψ to be the identity matrix. For objects that are not natively sparse, but whose gradient is sparse, we choose Ψ to be the finite difference operator, so Ψv 1 is the 3D Total Variation (TV) semi-norm [27]. Equation (2) is the basis pursuit problem in compressed sensing [26]. For this optimization procedure to succeed, H must have distributed, uncorrelated columns. The diffuser creates caustics that spread across many pixels in a pseudorandom fashion and contain high spatial frequencies. Therefore, any shift or magnification of the caustics leads to a new pattern that is uncorrelated with the original one. As discussed in Sections 2B and 2C, these two properties lead to a matrix that allows us to reconstruct 3D images v from the measurement b = Hv.

A. System Architecture
The hardware setup for DiffuserCam consists of a diffuser placed at a fixed distance in front a sensor (see Fig. 3a). The convex bumps on the diffuser surface can be thought of as a set of randomly-spaced microlenses that have statistically varying focal lengths and f-numbers. The f-number determines the minimum feature size of the caustics, which sets our optical resolution. The average focal length determines the distance at which the caustics have highest contrast (the caustic plane), which is where we place the sensor [23].
Our prototype is built using a PCO.edge 5.5 Color camera (6.5µm pixels). The diffuser is an off-the-shelf engineered diffuser (Luminit 0.5 • ) with a flat input surface and an output surface that is described statistically as Gaussian lowpass filtered white noise with an average spatial feature size of 140µm and average slope magnitude of 0.7 • (details in Supplementary  Fig. S1). This corresponds to an average focal length of approximately 8 mm and average f-number of 50. Due to the high f-number, the caustics maintain high contrast over a large range of propagation distances. Therefore, the diffuser need not be placed precisely at the caustic plane; in our prototype, d = 8.9 mm from our sensor. Additionally, we affix a 5.5 × 7.5 mm aperture directly on the textured side of the diffuser to limit the support of the caustics.
Similar to a traditional camera, the pixel pitch should Nyquist sample the minimum features of the caustics. The smallest features generated by the caustic patterns are roughly twice the pixel pitch on our sensor, so we perform 2x2 binning on the data, yielding 1.3 megapixel images, before applying our 3D reconstruction algorithm.

B. Convolutional Forward Model
Recovering a 3D image requires knowing the system transmission matrix, H, which is extremely large. Measuring or storing the full H would be impractical, requiring millions of calibration images and operating on multi-Terabyte matrices. The convolution model outlined below drastically reduces complexity of both calibration and computation.
We describe the object, v, as a set of point sources located at (x, y, z) on a non-Cartesian grid and represent the relative radiant power collected by the aperture from each source as v(x, y, z). The caustic pattern at pixel (x , y ) on the sensor due to a unit-powered point source at (x, y, z) is the PSF, h(x , y ; x, y, z). Thus, b(x , y ) represents the 2D sensor measurement recorded after the light from every point in v propagates through the diffuser and onto the sensor. This lets us explicitly write the matrix-vector multiplication Hv by summing over all voxels in the FoV: The convolution model amounts to a shift invariance (or infinite memory effect [20,21]) assumption, which greatly simplifies the evaluation of Eq. (3). Consider the caustics created by point sources at a fixed distance, z, from the diffuser. Because the diffuser surface is slowly varying and smooth, the paraxial approximation holds. This implies that a lateral translation of the source by (∆x, ∆y) leads to a lateral shift of the caustics on the sensor by (∆x , ∆y ) = (m∆x, m∆y), where m is the paraxial magnification. We validate this behavior in both simulations (see Fig. 2) and experiments (see Section 3D). For notational convenience, we define the on-axis caustic pattern at depth z as h(x , y ; z) := h(x , y ; 0, 0, z). Thus, the off-axis caustic pattern is given by h(x , y ; x, y, z) = h(x + mx, y + my; z). Plugging Here, * represents 2D discrete convolution over (x , y ), which returns arrays that are larger than the originals. Hence, we crop to the original sensor size, denoted by the linear operator C. For an object discretized into N z depth slices, the number of columns of H is N z times larger than the number of elements in b (i.e. the number of sensor pixels), so our system is underdetermined. The cropped convolution model provides three benefits. First, it allows us to compute Hv as a linear operator in terms of N z images, rather than instantiating H explicitly (which would require petabytes of memory to store). In practice, we evaluate the sum of 2D cropped convolutions using a single circular 3D convolution, enabling the use of 3D FFTs, which scale well to large array sizes (for details, see Supplementary Sec. 2). Second, it provides us with a theoretical justification of our system's capability for sparse reconstruction. Derivations in [28] show that translated copies of a random pattern provide close-to-optimal compressed sensing performance.
The third benefit of our convolution model is that it enables simple calibration. Rather than measuring the system response from every voxel (millions of images per depth), we only need to capture a single calibration image of the caustic pattern from an on-axis point source at each depth plane. 1 A typical calibration 1 While the scaling effect described in Sec. 1A suggests that we could use only thus consists of capturing images as a single point source is moved axially. This takes minutes, but need only be performed once. The added aperture on DiffuserCam ensures that a point source at the minimum z distance generates caustics that just fill the sensor, so that the entire PSF is captured in each image (see Supplementary Fig. S2).

C. Inverse Algorithm
Our inverse problem is extremely large in scale, with millions of inputs and outputs. Even with the convolution model described above, using projected gradient techniques is extremely slow due to the time required to compute the proximal operator of 3D total variation [29]. To alleviate this, we use the Alternating Direction Method of Multipliers (ADMM) [30]. We derive a variable splitting that leverages the specific structure of our problem.
Our algorithm uses the fact that Ψ can be written as a circular convolution for both the 3D TV and native sparsity cases. Additionally, we factor the forward model in Eq. (4) into a diagonal component, D, and a 3D convolution matrix, M, such that H = DM (details in Sec. ). Thus, both the forward operator and the regularizer can be computed in 3D Fouier space. This enables us to use variable-splitting [31][32][33] to formulate the constrained counterpart of Eq. (2): one image for calibration and scale it to predict PSFs at different depths, we find that there are subtle changes in the caustic structure with depth. Hence, we obtain better results when we calibrate with PSFs measured at each depth.
where v, u, and w are auxiliary variables. We solve Eq. (5) by following the commonly-used augmented Lagrangian arguments [34]. Using ADMM, this results in the following scheme at iteration k: Note that T ν is a vectorial soft-thresholding operator with a threshold value of ν [35], and ξ, η and ρ are the Lagrange multipliers associated with v, u, and w, respectively. The scalars µ 1 , µ 2 and µ 3 are penalty parameters which we compute automatically using the tuning strategy in [30].
Although our algorithm involves two large-scale matrix inversions, both can be computed efficiently and in closed form. Since D is diagonal, (D D + µ 1 I) is itself diagonal, requiring complexity O(n) to invert using point-wise multiplication. Additionally, all three matrices in (µ 1 M M + µ 2 Ψ Ψ + µ 3 I) are diagonalized by the 3D discrete Fourier transform (DFT) matrix, so inversion of the entire term can be done using point-wise division in 3D frequency space. Therefore, its inversion has good computational complexity, O(n 3 log n), since it is dominated by two 3D FFTs being applied to n 3 total voxels. We parallelize our algorithm on the CPU using C++ and Halide [36], a high performance programming language specialized for image processing (A comparison of regularizers and runtimes is shown in Supplementary Sec. 2c).

SYSTEM ANALYSIS
Unlike traditional cameras, the performance of computational cameras depends on properties of the scene being imaged (e.g. the number of sources). As a consequence, standard two-point resolution metrics may be misleading, as they do not predict resolving power for more complex objects. To address this, we propose a new local condition number metric that better predicts performance. We analyze resolution, FoV and the validity of the convolution model, then combine these analyses to determine the appropriate sampling grid for faithfully reconstructing realworld objects.

A. Field-of-View
At every depth in the volume, the angular half-FoV is determined by the most extreme lateral position that contributes to the measurement. There are two possible limiting factors. The first is the geometric angular cutoff, α, set by the aperture size, w, the sensor size, l, and the distance from the diffuser to the sensor, d (see Fig. 3a). Since the diffuser bends light, we also take into account the diffuser's maximum deflection angle, β. This gives a geometric angular half-FoV at every depth of l + w = 2d tan(α − β). The second limiting factor is the angular response of the sensor pixels. Real-world sensor pixels may not accept light at the high angles of incidence that our lensless camera accepts, so the sensor angular response (shown in Fig. 3b) may limit the FoV. Defining the angular cutoff of the sensor, α c , as the angle at which the camera response falls to 20% of its on-axis value, we can write the overall FoV equation as: Since we image in 3D, we must also consider the axial FoV. In practice, the axial FoV is limited by the range of calibrated depths. However, the system geometry creates bounds on possible calibration locations. Calibration cannot be arbitrarily close to the sensor since the caustics would exceed the sensor size. To account for this, we impose a minimum object distance such that an on-axis point source creates caustics that fill the sensor. Theoretically, our system can capture sources infinitely far away, but the axial resolution degrades with depth. The hyperfocal plane represents the axial distance after which no more axial resolution is available, establishing an upper bound. Objects beyond the hyperfocal focal plane have no depth information, but can be reconstructed to create 2D images for photographic applications [37], without any hardware modifications.
In our prototype, the axial FoV ranges from the minimum calibration distance (7.3 mm) to the hyperfocal plane (2.3 m). The angular FoV is limited by the pixel angular acceptance (α c = 41.5 • in x, α c = 30 • in y). Combined with our diffuser's maximum deflection angle (β = 0.5 • ) this yields an angular FoV of ±42 • in x and ±30.5 • in y. We validate the FoV experimentally by capturing a scene at optical infinity and measuring the angular extent of the result (see Supplementary Fig. S3).

B. Resolution
Investigating optical resolution is critical for both quantifying system performance and choosing our reconstruction grid. Although the raw data is collected on a fixed sensor grid, we can choose the non-uniform 3D reconstruction grid arbitrarily. The choice of reconstruction grid is important. When the grid is chosen with voxels that are too large, resolution is lost, and when they are too small, extra computation is performed without resolution gain. In this section we explain how to choose the grid of voxels for our reconstructions, with the aim of Nyquist sampling the two-point optical resolution limit.

B.1. Two-point resolution
A common metric for resolution analysis in traditional cameras is two-point distinguishablity. We measure our system's twopoint resolution by imaging scenes containing two point sources at different separation distances, built by summing together images of a single point source (1µm pinhole, wavelength 532nm) at two different locations.
We reconstruct the scene using our algorithm, with λ = 0 to remove the influence of the regularizer. To ensure best-case resolution, we use the full 5 MP sensor data (no 2 × 2 binning). The point sources are considered distinguishable if the reconstruction has a dip of at least 20% between the sources, as in the Rayleigh criterion. Figure 3c shows reconstructions with point sources separated both laterally and axially.
Our system has highly non-isotropic resolution (Fig. 3d), but we can use our model to predict the two-point distinguishability over the entire volume from localized experiments. Due to the shift invariance assumption, the lateral resolution is constant within a single depth plane and the paraxial magnification causes the lateral resolution to vary linearly with depth. For axial resolution, the main difference between two point sources is the size of their PSF supports. We find pairs of depths such that the difference in their support widths is constant: Here, z 1 and z 2 are neighboring depths and c is a constant determined experimentally. Based on this model, we set the voxel spacing in our grid to Nyquist sample the 3D two-point resolution. Figure 3d shows a to-scale map of the resulting voxel grid. Axial resolution degrades with distance until it reaches the hyperfocal plane (∼2.3 m from the camera), beyond which no depth information is recoverable. Due to the non-telecentric nature of the system, the voxel sizes are a function of depth, with the densest sampling occurring close to the camera. Objects within 5 cm of the camera can be reconstructed with somewhat isotropic resolution; this is where we place objects in practice.

B.2. Multi-point resolution
In a traditional camera, resolution is a function of the system and is independent of the scene. In contrast, computational cameras that use nonlinear reconstruction algorithms may incur degradation of the effective resolution as the scene complexity increases. To demonstrate this in our system, we consider a more complex scene consisting of 16 point sources. Figure 4 shows experiments using 16 point sources arranged in a 4×4 grid in the (x, z) plane at two different spacings. The first spacing is set to match the measured two-point resolution limit (∆x=45µm, ∆z=336µm). Despite being able to separate two points at this spacing, we cannot resolve all 16 sources. However, if we increase the source separation to (∆x=75µm, ∆z=448µm), all 16 points are distinguishable (Fig. 4d). In this example, the usable lateral resolution of the system degrades by approximately 1.7× due to the increased scene complexity. As we show in Section 3C, the resolution loss does not become arbitrarily worse as the scene complexity increases.
This experiment demonstrates that existing resolution metrics cannot be blindly used to determine performance of computational cameras like ours. How can we then analyze resolution if it depends on object properties? In the next section, we introduce a general theoretical framework for assessing resolution in computational cameras.

C. Local condition number theory
Our goal is to provide new theory that describes how the effective reconstruction resolution of computational cameras changes with object complexity. To do so, we introduce a numerical analysis of our forward model to determine how well it can be inverted.
First, note that recovering the image v from the measurement b = Hv comprises simultaneous estimation of the locations of all nonzeros within our image reconstruction, v, as well as the values at each nonzero location. To simplify the problem, suppose an oracle tells us the exact locations of every source within the 3D scene. This corresponds to knowing a priori the support of v, so we then need only determine the values of the nonzero elements in v. This can be accomplished by solving a least squares problem using a sub-matrix consisting of only the columns of H that correspond to the indices of the nonzero voxels. If this problem fails, then the more difficult problem of simultaneously determining the nonzero locations and their values will certainly fail.
In practice, the measurement is corrupted by noise. The maximal effect this noise can have on the least-squares estimate of the nonzero values is determined by the condition number of the sub-matrix described above. We therefore say that the reconstruction problem is ill-posed if any sub-matrices of H are very ill-conditioned. In practice, ill-conditioned matrices result in increased noise sensitivity and longer reconstruction times, as more iterations are needed to converge to a solution.
The worst-case scenario in our system is when multiple sources are in a contiguous block, since nearby measurements are always most similar. Therefore, we compute the condition number of sub-matrices of H corresponding to a group of point sources with separation varying by integer numbers of voxels. We repeat this calculation for different numbers of sources. The results are shown in Fig. 5. As expected, the conditioning is worse when sources are closer together. In this case, increased Our local condition number theory shows how resolution varies with object complexity. (a) Virtual point sources are simulated on a fixed grid and moved by integer numbers of voxels to change the separation distance. (b) Local condition numbers are plotted for sub-matrices corresponding to grids of neighboring point sources with varying separation (at depth 20 mm from the sensor). As the number of sources increases, the condition number approaches a limit, indicating that resolution for complex objects can be approximated by a limited number (but more than two) sources.
noise sensitivity means that even small amounts of noise could prevent us from resolving the sources. This trend matches what we saw experimentally in Figs. 3 and 4. Figure 5 shows that the local condition number increases with the number of sources in the scene, as expected. This means that resolution will degrade as more and more sources are added. We can see in Fig. 5 that, as the number of sources increases linearly, the conditioning approaches a limiting case. Hence, the resolution does not become arbitrarily worse with increased number of sources. Therefore we can estimate the system resolution for complex objects from distiguishability measurements of a limited number of point sources. This is experimentally validated in Section 4, where we find that the experimental 16point resolution is a good predictor of the resolution for a USAF target.
Unlike traditional resolution metrics, our new local condition number theory explains the resolution loss we observe experimentally. We believe that it is sufficiently general to be applicable to other computational cameras, which likely exhibit similar performance loss.

D. Validity of the Convolution Model
In Section 2B, we modeled the caustic pattern as shift invariant at every depth, leading to simple calibration and efficient computation. Since our convolution model is an approximation, we wish to quantify its applicability. Figure 6a-c shows registered close-ups of experimentally measured PSFs from plane waves incident at 0 • , 15 • and 30 • . The convolution model assumes that these registered PSFs are all exactly the same, though, in reality, they have subtle differences. To quantify the similarity across the FoV, we plot the inner product between each off-axis PSF and the on-axis PSF (see Fig. 6d). The inner product is greater than 75% across the entire FoV and particularly good within ±15 • of the optical axis, indicating that the convolution model holds relatively well.
To investigate how the spatial variance of the PSF impacts system performance, we use the peak width of the cross-correlation between the on-axis and off-axis PSFs to approximate the spot size off-axis. Figure 6e (solid) shows that we retain the on-axis resolution up to ±15 • . Beyond that, the resolution gradually degrades. To avoid model mismatch, one could replace the convolution model with exhaustive calibration over all positions in the FoV. This procedure would yield higher resolution at the edges of the FoV, as shown by the dashed line in Fig. 6e. The gap between these lines is what we sacrifice in resolution by using the convolution model. However, in return, we gain simplified calibration and efficient computation, which makes the large-scale problem feasible.

EXPERIMENTAL RESULTS
Images of two objects are presented in Fig. 7. Both were illuminated using white LEDs and reconstructed with a 3D TV regularizer. We choose a reconstruction grid that approximately Nyquist samples the two-point resolution (by 2 × 2 binning the sensor pixels to yield a 1.3 megapixel measurement). Calibration images are taken at 128 different z-planes, ranging from z = 10.86mm to z = 36.26mm (from the diffuser), with spacing set according to conditions outlined in Sec. 3B. The 3D images are reconstructed on a 2048×2048×128 grid, but the angular FoV restricts the usable portion of this grid to the center 100 million voxels. Note that the resolvable feature size on this reconstruction grid can still vary based on object complexity. The first object is a negative USAF 1951 fluorescence test target, tilted 45 • about the y-axis (Fig. 7a). Slices of the reconstructed volume at different z planes are shown to highlight the system's depth sectioning capabilities. As described in Sec. 3B, the spatial scale changes with depth. Analyzing the resolution in the vertical direction (Fig. 7a inset), we can easily resolve group 2 element 4 and barely resolve group 2 element 5 at z = 24  Fig. 7. Experimental 3D reconstructions. (a) Tilted resolution target, which was reconstructed on a 4.2 MP lateral grid with 128 z-planes and cropped to 640×640×50 voxels. The large panel shows the max projection over z. Note that the spatial scale is not isotropic. Inset is a magnification of group 2 with an intensity cutline, showing that we resolve element 5 at a distance of 24 mm, which corresponds to a feature size of 79 µm (approximately twice the lateral voxel size of 35µm at this depth). The degraded resolution matches our 16-point distinguishability (75 µm at 20 mm depth). Lower panels show depth slices from the recovered volume. (b) Reconstruction of a small plant, cropped to 480×320×128 voxels, rendered from multiple angles.
mm. This corresponds to resolving features 79µm apart on the resolution target. This resolution is significantly worse than the two-point resolution at this depth (50µm), but similar to the 16-point resolution (75µm). Hence, we reinforce our claim that two-point resolution is a misleading metric for computational cameras, but multi-point distinguishability can be extended to more complex objects.
Finally, we demonstrate the ability of DiffuserCam to image natural objects by reconstructing a small plant (Fig. 7b). Multiple angles are rendered to demonstrate the ability to capture the 3D structure of the leaves.

CONCLUSION
We demonstrated a simple optical system, with only a diffuser in front of a sensor, that is capable of single-shot 3D imaging. The diffuser encodes the 3D location of point sources in caustic patterns, which allow us to apply compressed sensing to reconstruct more voxels than we have measurements. By using a convolution model that assumes that the caustic pattern is shift invariant at every depth, we developed an efficient ADMM algorithm for image recovery and simple calibration scheme. We characterized the FoV and two-point resolution of our system, and showed how resolution varies with object complexity. This motivated the introduction of a new condition number analysis, which we used to analyze how inverse problem conditioning changes with object complexity.