End-to-end metasurface inverse design for single-shot multi-channel imaging

We introduce end-to-end metaoptics inverse design for multi-channel imaging: reconstruction of depth, spectral and polarization channels from a single-shot monochrome image. The proposed technique integrates a single-layer metasurface frontend with an efficient Tikhonov reconstruction backend, without any additional optics except a grayscale sensor. Our method yields multi-channel imaging by spontaneous demultiplexing: the metaoptics front-end separates different channels into distinct spatial domains whose locations on the sensor are optimally discovered by the inverse-design algorithm. We present large-area metasurface designs, compatible with standard lithography, for multi-spectral imaging, depth-spectral imaging, and ``all-in-one'' spectro-polarimetric-depth imaging with robust reconstruction performance ($\lesssim 10\%$ error with 1\% detector noise). In contrast to neural networks, our framework is physically interpretable and does not require large training sets. It can be used to reconstruct arbitrary three-dimensional scenes with full multi-wavelength spectra and polarization textures.


Introduction
Metasurfaces have been heralded as a revolutionary platform for realizing complex functionalities and compact form factors inaccessible to conventional refractive or diffractive optics [1][2][3][4]. Meanwhile, an emerging "end-to-end" paradigm in computational imaging, in which an optical frontend is optimized in conjunction with an image-processing backend, has received increasing attention due to successful applications in diffractive optics [5,6]. More recently, the end-to-end paradigm has been introduced to full-wave vectorial nanophotonic and metasurface frontends [7][8][9], demonstrating an enhanced capability for physical data acquisition and manipulation. So far, these early endeavors have been limited to two-dimensional (2D) RGB imaging or classification problems. In this paper, we present end-to-end metaoptics inverse design for single-shot multi-channel imaging beyond 2D RGB information: reconstruction of several depth, spectral and polarization channels simultaneously from a single monochrome image (Section 2). As a key result, we show that, even though demultiplexing is not a designated/prescribed objective, inverse design automatically leads to spatial demultiplexing of the multiple channels into spontaneous domains-distinct regions in the detected image for different channels, whose locations are not pre-designated but are optimally discovered during the course of optimization (Sections 3 and 4). In contrast to data-driven approaches such as neural networks [6], our framework is physically interpretable, does not overfit despite a small generic training set, and is fully validated against ground truths vastly different from those of the training set. Specifically, we present metasurface designs for 16-color imagers with 5-12% reconstruction error (under 1% image noise), a 4-color/4-depth imager with 5% error, and a 2-color/2-depth/4-polarization imager with 2% error (Section 3 and Appendix B). All the presented designs take into account fabrication constraints and are compatible with large-scale metasurface lithography [10]. In practice, our method only requires a single calibration step (via measurement or calculation of the point spread function) and is amenable to arbitrary material platforms and differentiable reconstruction algorithms. Our results highlight the power of full-wave optics design with subwavelength components, whereas scalar diffractive optics could struggle to distinguish different wavelengths and polarizations due to limited dispersion and polarization sensitivity [4].
A major aspiration of metasurface technology has been to realize aberrationfree focusing via an ultra-thin interface, directly replacing traditional bulky lenses [2]. While there has been significant progress towards this goal [3], most metalenses suffer from fundamental space-bandwidth limits on wave focusing [11]. Although nanophotonic inverse design has introduced several innovations to metaoptics architectures [12][13][14][15][16][17], further disruptive improvements await the advent of mature three-dimensional (3D) nanofabrication [18][19][20]. In contrast, recent studies in end-to-end inverse design [7][8][9] have unveiled "computationally aware" nanostructures that bear little semblance to a lens and offer capabilities beyond optics-only or computation-only designs. On the other hand, several computational techniques have been developed for retrieving depth, spec-tral and polarization information from a scene [21][22][23][24][25][26][27][28]. Such techniques operate by combining multiple bulky refractive, diffractive and/or absorptive elements, often involve time-domain multiplexing (for example, scanning a scene to accumulate different shots), and typically enable the reconstruction of a single additional dimension (e.g. depth, color, or polarization). A universal framework is still lacking, by which a single-piece nanophotonic structure can be optimally designed to extract any and all channels simultaneously from a single filter-free monochrome exposure. Our proposed end-to-end framework enables inverse design of an ultra-thin single-layer metasurface in conjunction with a simple Tikhonov-regularized reconstruction algorithm. In particular, the Tikhonov regularization is agnostic to the nature of the information channels under consideration and is thus capable of extracting any and all channels (whether they be depth, spectral, polarization, or any combination thereof).

Image formation model
In conventional imaging, the optical frontend is usually modeled by an elementary phase-shift function e i2πh(x,y)/λ , where λ is the free-space wavelength and h(x, y) the surface profile of the diffractive optical element [29]. In nanophotonics and metaoptics, involving sub-wavelength scatterers, more detailed electromagnetic simulations are required, which must take into account richer wave effects such as multiple scattering [1, 14-16, 30, 31]. In this work, we use a Chebyshev-interpolated surrogate model (T) under a locally periodic approximation (LPA) to efficiently simulate the transmitted electric field through a large-area metasurface [16]. Specifically, a metasurface is defined by a vector g characterizing the geometry of meta-atoms (such as width, height and orientation of nanopillars) while the surrogate model maps each parameter g in a periodic unit cell to complex transmission coefficients. The transmitted electric field is then given by E transmitted = T(g) · E incident .
In general, any ground-truth object u can be numerically discretized into a tensor of five dimensions including three-dimensional real space as well as color and polarization dimensions. For convenience, we denote u as a set of 2D (x, y) intensity arrays: u ≡ {u z,λ,p }, where each 2D array u is indexed by depth (z), wavelength (λ) and polarization (p) channels (see Fig. 1a,b). Such a "multi-channel" representation is naturally made for a multi-channel imageformation model, in which a single 2D monochrome image v is formed by the sum of convolutions of the object channels with the corresponding point spread Figure 1: (a,b) A multi-channel ground truth object consists of depth, spectral and polarization channels and can be represented by a set of two-dimensional (2D) intensity arrays indexed by (z, λ, p): a three-dimensional (3D) object can be naturally sectioned into a collection of 2D "depth" slices; each depth slice can be further decomposed into different "color" slices; each color slice is, in turn, decomposed into different "polarization" slices. (c) End-to-end inverse design: a metasurface frontend is optimized in conjunction with a computational reconstruction backend to minimize the reconstruction error evaluated at the end of the full pipeline.
where η is a generic noise term (typically modeled by zero-mean Gaussian white noise with standard deviation σ: η ∼ N (0, σ 2 ) [5]). Note that in our model, we assume shift-invariant PSFs (valid in the paraxial regime) and only consider object intensities under incoherent illumination [29]. The PSFs are computed from the surrogate model followed by near-to-far-field propagation, given a specific metasurface geometry g. While there is no limit to the number of depths (n z ) or wavelength channels (n λ ), four polarization channels are sufficient to reconstruct the full Stokes vector [32] (n p ≤ 4). Those components can be understood as follows: x-polarized intensity channel (p x ), y-polarized intensity channel (p y ), the real part of the correlation between x and y polarizations (p R xy ) and the imaginary part (p I xy ).

Inverse scattering and end-to-end optimization
Given the multi-channel image formation model, the corresponding reconstruction problem (also called inverse scattering problem) is posed as: The reconstructed object, denoted byû = {û z,λ,p }, is the solution that minimizes the problem (2). Here, a regularization term R(·) is usually needed to make the inverse problem well-posed and well-conditioned as well as to impose any prior information such as sparsity or smoothness. A simplest choice (with minimal prior information) is the so-called Tikhonov regularization or L 2 norm [33] where R(·) = α · 2 , leading to: where, for convenience, the convolutions have been recast into a matrix notation, Typically, the matrix G is large and dense, easily reaching over 10 5 × 10 5 in dimension. We use matrix-free FFT-based convolutions [29,34] in both forward and inverse scattering models to efficiently compute the action of G or G T on arbitrary vectors without storing G explicitly. In particular, in Eq. 3,û can be obtained by the iterative conjugate-gradient method [35], within ∼ 100 iterations, instead of directly computing a matrix inverse. Our end-to-end inverse design considers the entire pipeline (see Fig. 1c) and can be formulated as minimizing the average reconstruction error: Here, · u,η denotes averaging over training data as well as image noise; the training dataset consists of a few randomly-generated ground truths (e.g., random patterns drawn from a uniform distribution). FF denotes the near-to-far-field propagation to the detector plane-a convolution of the transmitted electric fields with the free-space Green's function [29]. In our end-to-end framework, the gradients are back-propagated through the entire pipeline all the way to the metasurface parameters, and are efficiently handled by an in-house implementation of the adjoint method [35] (see Appendix A) instead of solely relying on popular automatic differentiation libraries [36] which perform poorly for differentiating through iterative algorithms such as the conjugate-gradient method. The reconstruction accuracy of an optimized design is validated over vastly different ground truths (distinct from training objects).

Results
We now show how our framework can be utilized to inverse-design metaoptics with multi-channel reconstruction capability (depth, spectral, and polarization). We denote the dimensions of a ground truth object as n ch × m × m-a set of n ch arrays each with m × m pixels (note that n ch = n z n λ n p ), while the monochrome image is a single 2D array of n × n pixels. In this work, we choose n 2 ≥ n ch m 2 , that is, there are at least as many image pixels as the total size of the objectan over-determined inverse problem, suitable for Tikhonov regularization which harbors minimal assumptions about the nature of the object. First, we design a 16-color metasurface imager made up of 600 nm-tall TiO 2 pillars on silica (Fig. 2a,b)-a design platform compatible with large-area lithographic fabrication as recently demonstrated in millimeter-scale achromatic metasurfaces [10]. A Chebyshev-interpolated surrogate model maps the width of each pillar inside a unit cell (465 nm period) to transmission coefficients at 16 different wavelengths across the visible spectrum (450 − 660 nm). The singleshot monochrome image shows spatial demultiplexing of the wavelength channels (Fig. 2c). Interestingly, the imager does not solely rely on the demultiplexing effect; for example, there is channel replication, e.g. λ 2 (460 nm) channel, and a small degree of hybridization, e.g. between λ 1 (450 nm) and λ 2 (460 nm) channels. While the human eye is not equipped to recover all the information encoded The wavelengths are spatially demultiplexed onto distinct domains on the single-shot monochrome image, captured 1 mm away from the metasurface by a CCD array of 400 × 400 pixels (each pixel has an area of 1.4×1.4 µm 2 ). (d) Reconstruction of a synthetic ground truth-a multi-spectral picture of letters 'a' to 'p', situated 2 cm away from the metasurface and each letter emitting a different wavelength. Note that the letters in the ground truth cannot be distinguished by the naked eye (inset on the left of the ground truth row). Computationally, the ground truth is represented by a set of 16 intensity arrays, each of which is a 50 × 50-pixel image of a letter (with 25 µm resolution). The ground truth and the reconstruction are color-coded for a visual interpretation of the wavelengths. The reconstruction error is 12.5% under 1% image noise.
in hybridization and redundancy, these apparent "imperfections" do not necessarily represent information loss. In particular, the apparent mixing between different channels does not preclude information recovery since the channels can be readily reconstructed by computation (as long as the corresponding PSFs are distinctly non-degenerate), leading to a reconstruction error of 12.5% under 1% Gaussian image noise (Fig. 2e). We note that a signal-to-noise ratio of ∼ 100 (1% noise) can be readily achieved by modern electronic sensors [5]. Furthermore, the apparent residual fine-grain noise in the reconstruction of non-random objects can be easily removed by simple de-noising routines.
In this example, we considered a simple geometry (a square pillar) suited for photo-lithographic mass production; utilizing a more complex geometry, such as a holey pillar, allowing for more degrees of freedom to manipulate incident wavefronts, leads to even better performance (5% error with 1% noise, see Appendix B). Our methods are also amenable to inverse-design techniques allowing for freeform geometries, involving domain decomposition methods with larger unit cells and full topology optimization [12,31]. We emphasize that our framework does not seek a "heavily-processed imitation" of the ground truth; it looks for a faithful reconstruction which is stable under moderate noise, and should be applicable for imaging any object, including random ones (see Appendix B). If desired, additional processing may be used, such as convolutional neural networks, which can be trained to "interpolate" a particular distribution of objects, enhance the reconstruction of that class of objects, perform image segmentation, or classification on the reconstructed objects.
Apart from spectral imagers, our framework is powerful in that it is straightforward to extract any and all kinds of channels. For example, we design a depth-spectral imager (Fig. 3) that can reconstruct 4 depth channels × 4 wavelength channels. Additionally, as a proof of concept, we also design an "all-in-one" imager ( Fig. 4) that can reconstruct 2 depth channels × 2 wavelength channels × 4 polarization channels. In that case, spontaneous spatial demultiplexing discovered via inverse design is observed for channels that are a combination of a given depth and polarization (i.e. channels sharing the same depth but having different polarizations are also demultiplexed, and vice-versa). On the other hand, a greater degree of hybridization is seen to arise in between the depth channels. This originates from the limited geometric control of the local metasurface design we have chosen, which cannot provide sufficiently strong spatial dispersion to fully separate the depth channels. In future works, we will engineer larger unit cells, higher diffraction orders, and cascaded metamaterials to induce strongly non-local, spatially-dispersive effects [14,18].

Discussion and Outlook
The central result of this work is the realization of metaimaging based on spontaneous demultiplexing of multi-channel information into distinct spatial domains, whose locations appear irregular but are optimally determined by end-to-end inverse design. This is in contrast to the situation where such domains would be The monochrome image has n × n pixels (n = 400) and is corrupted by 1% noise, leading to (c) a reconstruction error of 5.3%. Note that z i ∈ {2, 4, 6, 8} cm, λ j ∈ {470, 520, 582, 660} nm. The monochrome image has n × n pixels (n = 400) and is corrupted by 1% noise, leading to (c) a reconstruction error of 2%. Note that z i ∈ {1.7, 3.4} cm, λ j ∈ {532, 488} nm.
dictated by a user, as would be the case for conventional optics-only designs such as color splitters [11], or even hybrid systems [28]. The end-to-end automated discovery naturally leads to an optimal demultiplexing scheme with minimal hybridization between channels; in contrast, a human-designated scheme, such as a regular lattice of focal spots, may not be permitted by the available degrees of freedom in the metasurface, resulting in noisy crosstalk, sub-optimal PSFs, and poor resolution. Intuitively, the demultiplexing effect is enabled by the Tikhonov reconstruction backend, which does not attempt to learn from the specific training data, but "judiciously" nudges the optical frontend to separate the incoming channels in order to reduce the reconstruction error. Therefore, our method is physically interpretable and data-efficient. It only requires a small training set, for example, as few as 30 training data drawn from a uniform distribution in the case of the 16-color imager (Fig. 2). At the same time, the final optimized designs achieve robust reconstruction performance with consistent accuracy over vastly different sets of ground truths (whether they are pictures like letters or patterns like random dots, see Appendix B). This is in contrast to recent works [6] using phase masks and neural networks, which require large, diverse and carefully curated training sets and do not lead to spatial demultiplexing. One conceivable limitation of our multi-channel imagers is that the transverse dimensions of the object must be significantly smaller than those of the detector; therefore, the device is not suitable for reconstructing the entire natural field of view corresponding to the size of the detector. In practice, a narrower operational field of view may be realized by an appropriate aperture, a directed flash, or by selective illumination (a common technique in microscopy) [37,38]. The field of view can be enlarged by designing larger-area metasurfaces or by taking into account out-of-field-of-view light in the image-formation model. On the other hand, the inverse problem is under-determined if we choose the same transverse dimensions for the object and the detector. Such a problem requires additional priors on the object, and a regularization scheme like Tikhonov may not be sufficient. One powerful prior in image processing is sparsity, and a theoretically rigorous technique for reconstructing sparse objects is called compressed sensing [39]. In another manuscript under preparation, we will present a fully end-to-end inverse design framework with a compressed-sensing backend. Ultimately, future backends may be realized by new architectures that combine classical algorithms (such as Tikhonov and CS, which are theoretically rigorous and physically interpretable) and deep neural networks (which are best suited for learning deep data priors). Moreover, the performance of ultra-compact nanophotonic devices (such as depth and spectral sensitivities) can be further enhanced if we transcend the limitations imposed by LPA and expand the available degrees of freedom to encompass the full Maxwell physics. In future work, we hope to explore end-to-end inverse design using more sophisticated domain decomposition methods [31] and scattering-matrix formulations [40], or by cascading non-local metamaterials and 3D photonic crystals with local metasurfaces. The trickier issue is to find ∂G T G ∂g . If one carries out the algebra faithfully, one may find that the inner product sandwiching the tricky derivative boils down to a cross-correlation between Λ and G (u −û). However, we can exploit the autograd automatic-differentiation (AD) package [36] in Python to compute this derivative effortlessly as follows (pseudo-code): def innerdv(x,a,b): def aGTGb(x): ... # compute and return aGTGb given design parameters x ... # by ''autograd-able'' convolutions g = autograd.grad(aGTGb) return g(x) The desired product Λ · ∂G T G ∂g (u −û) is then simply given by innerdv(g,Lambda,u-uhat).