MIT Open Access Articles High-performance hybrid time/frequency-domain topology optimization for large-scale photonics inverse design

: We present a photonics topology optimization (TO) package capable of addressing a wide range of practical photonics design problems, incorporating robustness and manufacturing constraints, which can scale to large devices and massive parallelism. We employ a hybrid algorithm that builds on a mature time-domain (FDTD) package Meep to simultaneously solve multiple frequency-domain TO problems over a broad bandwidth. This time/frequency-domain approach is enhanced by new ﬁlter-design sources for the gradient calculation and new material-interpolation methods for optimizing dispersive media, as well as by multiple forms of computational parallelism. The package is available as free/open-source software with extensive tutorials and multi-platform support.


Introduction
Inverse design is a powerful class of methodologies for photonics design in which large-scale optimization is applied to maximize performance over thousands or even millions of degrees of freedom (DOF), often yielding non-intuitive geometries with unprecedented performance [1]. A particularly flexible form of inverse design is density-based topology optimization (TO), which allows each "pixel" to continuously evolve between two (or more) materials [2][3][4][5][6][7][8][9][10]. The full potential of TO is only unlocked, however, when it is applied to larger and more complex devices integrating multiple simultaneous functionalities, such as broad-bandwidth and/or large-area semiconductor-foundry devices [11][12][13][14], that are beyond the reach of intuitive design. Hence, there is a need for algorithms and software tools that expose the full flexibility of TO to scalable parallel-computing platforms while retaining a high-level user interface.
We present a free/open-source photonics TO package that addresses this need, consolidating many TO features individually explored throughout the literature, with enhancements that allow our methodology to address a wide range of practical photonics design problems and scale to arbitrarily large devices. Our approach employs a hybrid time/frequency-domain adjointvariable method [15,Sec. 8.7] to efficiently and simultaneously compute gradients at multiple frequencies for any number of figures of merit (FOM) and/or design fields. We solve the forward and adjoint problems in the time domain using Meep, a popular, mature, and multi-featured free/open-source FDTD package that supports distributed-memory parallelism [16], combined with automatic differentiation (AD) [17][18][19] of arbitrary user-defined photonics-optimization objectives in a flexible Python interface. Since storage requirements for pure time-domain adjoint methods typically do not scale well with problem size [20], we leverage a hybrid adjoint formulation [21][22][23][24] that computes the forward and adjoint fields in the frequency domain, over an arbitrary bandwidth, using a discrete-time Fourier transform (DTFT). This methodology merges the broad-bandwidth parallelism, reliability, and speed of time-domain solvers with the flexibility and simplicity of frequency-domain adjoint methods.
Our work brings together several different features from other photonics inverse-design packages including techniques for reducing the computational cost of large design regions (Sec. 4.2), arbitrary geometric parameterizations (Sec. 3), and a broad range of material definitions including dispersion and anisotropy (Sec. 2.1). For example, we formulate the adjoint problem for near-tofar field transformations, mirror and rotational symmetries, as well as cylindrical coordinates, combining them within a single unified software package. Optimization figures of merit (objective/loss functions) are easy to define as their gradient is computed using AD (Sec. 4). This same pipeline allows for arbitrary sequences of filtering and projection steps, which then map the design parameters to multiple, distinct "design regions" (e.g. for different foundry-fabrication layers). Similarly, common design constraints like minimum area, enclosed area, linewidth, linespacing, and curvature are all supported (Sec. 3.1) [25], while support for connectivity constraints [26] is forthcoming.
In addition, we present novel signal-processing and optimization techniques to consolidate many of the above features within a single environment. For example, broadband adjoint sources computed in the frequency domain must be efficiently transformed to an equivalent time-limited/band-limited time-domain source, a task for which a straightforward inverse-Fouriertransform approach is restricted by expensive storage requirements (proportional to the number of timesteps multiplied by the number of spatial points) [24]. To overcome this, we introduce a fitting routine using standard digital signal processing (DSP) window functions [27], yielding low storage requirements without sacrificing bandwidth or figure-of-merit (FOM) complexity (Sec. 5.2)-a solution closely related to filter-design methods that fit a desired frequency response to a finite impulse-response (FIR) filter [24,28,29]. Similarly, continuous material interpolation (a fundamental part of density-based TO methods [30,31]) quickly becomes problematic for dispersive and lossy materials (esp. metals) as arbitrary shifting of their responses (e.g. refractive index "zero crossings" [32]) may induce field instabilities or problems with optimization. We present a material-interpolation scheme that adds broadband gradient support for anisotropic and dispersive materials while ensuring simulation stability (Sec. 2.1), inspired by a related technique for frequency-domain TO [32].
Finally, our software package combines three individually common (but rarely combined for TO) forms of parallelism that allow for predictive scaling as the problem size increases: (i) spatial parallelism, where the Maxwell solver divides the computational domain among hundreds or even thousands of compute nodes using a static load-balancing algorithm (Sec. 5.1); (ii) frequency parallelism, such that gradients for even complicated objective functions that involve multiple field quantities (e.g. eigenmode overlaps, scattering parameters, Poynting flux, etc.), corresponding to multiple adjoint sources, are simultaneously obtained across multiple frequencies of interest with minimal additional computational cost (Sec. 5.2); and (iii) simulation parallelism, where multiple design fields or objective functions can be simulated in tandem, but their results are globally gathered by the optimization algorithm at each iteration, enabling large-scale robust [33][34][35][36] and multi-objective [37] optimization (Sec. 5.3).
We illustrate our approach using two practical example problems: (i) a 3D silicon-photonics polarization splitter (Sec. 4.1) and (ii) a high numerical-aperture (NA) cylindrical metalens (Sec. 4.2). Some of these devices are large in comparison to the current state of the art, and they involve multiple TO features to demonstrate the methodology's flexibility and diverse range of applications.

Photonics density-based topology optimization
We first review the fundamentals of photonics density-based TO, highlighting important features of our approach (e.g. automatic differentiation and hybrid time/frequency-domain adjoint methods) that enable large-scale inverse design. We then present a novel material-interpolation scheme necessary for broadband TO over dispersive and/or anisotropic materials.
Density-based TO parameterizes the design at each pixel such that each pixel is either "solid" (a high-index material), "void" (a low-index material), or some non-physical interpolation between the two. A sequence of transformations involving filters and projections are used to gradually binarize the design so that it is solid or void almost everywhere [31,38], typically while also imposing various optimization constraints that enforce design rules (e.g. minimum lengthscale, area, and curvature) [25,39,40]. Since this approach allows pixels to adopt a fictitiously interpolated material "density" during the optimization process, the resulting cost function (and its gradient) are smooth and continuously differentiable, an important requirement for many gradient-based optimization algorithms [30]. (In contrast, level-set TO ensures that the material is binarized almost everywhere, but can make differentiation and optimization more difficult, especially for changes in topology [30].) The FOMs used to characterize the design's performance are typically some function of the time-or frequency-dependent fields (Sec. 4), as dictated by the Maxwell equations. To accurately evaluate each FOM, one typically performs a "forward run" using a full-wave Maxwell solver (e.g. FDTD, FDFD, etc.). The gradient of each FOM is then efficiently calculated by performing an "adjoint run" using the same (or slightly modified) Maxwell solver with source terms determined from the forward run-in particular, the "adjoint currents" at a particular frequency are essentially the derivative of the FOM with respect to the electromagnetic fields from the forward run at that frequency [1,23]. This process is repeated as the user gradually tunes a projection hyperparameter to increase binarization (Eq. (4)), yielding performant and oftentimes non-intuitive designs. Fig. 1 illustrates the basic design cycle for photonic TO driven by adjoint-variable methods, along with an example design evolution for a silicon-photonic three-way power splitter.
For example, one could formulate a broadband, multi-objective minimax optimization problem [41], based on performance over discrete frequency points ( ) and figures of merit ( ), with where x describes the (discretized) electric and magnetic fields at , are the design variables (densities) used to parameterize the design at each pixel (constrained pointwise to 0 ≤ ≤ 1), are inequality constraints implementing geometry constraints (Sec. 3.1), is the Maxwell operator in the frequency domain, and b is a vector describing the forward problem's current sources.
We emphasize that Eq. (1) and are only used conceptually in this presentation-we never form the frequency-domain operator or solve = explicitly. Instead, we Fourier transform the time-domain fields of the Meep FDTD simulation in order to obtain the frequency-domain solution indirectly, as reviewed in Appendix A and Ref. 23. By using broadband sources for the forward and adjoint solves, this hybrid adjoint method calculates the gradients at multiple frequency points simultaneously (Sec. 5.2), meaning that each gradient calculation requires only two FDTD simulations for each figure of merit regardless of the number of design parameters or frequencies of interest.
The minimax problem of Eq. (1) is nondifferentiable, but it can be recast into a differentiable  [30] for photonic inverse design via adjoint-variable methods demonstrated using a three-way siliconphotonic power splitter. First, the design variables are filtered and projected. Second, a forward run is executed to evaluate the objective function. Third, an adjoint run (one for each figure of merit) is executed using adjoint sources determined by the results of the forward run. The gradient of the objective function with respect to the design variables is computed from the Fourier-transformed fields in the design region from both the forward and adjoint runs. The optimization algorithm updates the design variables using the gradient information. A hyperparameter is gradually increased to "binarize" the design as optimization progresses.
form using an epigraph formulation [41,42] by introducing a dummy parameter : The process for mapping the raw design parameters ( ) to a near-binary permittivity distribution ( ) typically begins with a linear-convolution filter step where is the filtered design field, is the filter kernel, and * is a problem-dependent multidimensional convolution [31,43]. Once the design variables are filtered, the resulting field is then projected onto a nearly binary value using a differentiable and nonlinear function. One common choice, for example, is the modified hyperbolic tangent (tanh) function [31] where and are thresholding parameters and¯ is the projected design field. A unique feature of our software package is the ability to quickly and easily prototype different combinations of filter kernels, projection functions, and their corresponding hyperparameters. This is useful because different filter kernels and projection functions are used to realize certain topological features, such as angled sidewalls [44] or minimum lengthscales [40]. Generally, deciding which kernel or projection function to use is a difficult and time-consuming process mainly because each new functional form requires a new gradient derivation. Our approach leverages JAX [17], a high-performance AD package that quickly and efficiently computes gradients of arbitrary filters. Since AD implements the chain rule in reverse via vector-Jacobian products, the resulting filter-project backpropagation step remains significantly cheaper than the forward and adjoint Maxwell solves, making it easy to employ multiple filter steps [45]. Enabling multiple cascaded steps is especially important for certain types of robust optimization [46]. While our package already provides many built-in projections (Heaviside, hyperbolic tangent, etc.), kernels (dilation, erosion, etc.), and even nonlinear filters (conic, cylindrical, Gaussian, linear, etc.), it is easy to add new functions to support new geometries and constraints.

Material interpolation
Once the design parameters are projected onto a binary design field, the next challenge is to map these scalar values to physical materials over a broad bandwidth such that the FDTD algorithm remains stable. In the simple case involving dispersionless dielectrics, linear-interpolation schemes tend to work well [47]: where the relative permittivity linearly varies between a low-index material ( ) and a high-index material ( ), and¯ ∈ [0, 1] are the projected design parameters [47]. More complicated design problems (e.g. those involving metals, whose relative permittivity is complex and may have a negative real part) require a more sophisticated interpolation scheme. For example, intermediate values of can produce zero crossings in the permittivity profile that induce non-physical resonances and inhibit optimization convergence [32]. One possible solution when dealing with single-frequency applications involves nonlinearly interpolating the refractive index (¯ ) and the extinction coefficient (¯ ) [32]: In this case, and are both linearly interpolated as in Eq. (5). The final permittivity itself, however, is nonlinearly interpolated, preventing any possible zero crossings for intermediate values of . While this approach works well for frequency-domain Maxwell solvers, the hybrid time/frequency-domain implementation employed in our work requires a more holistic interpolation scheme that works across the entire band of interest with the same set of design parameters and is numerically compatible with the FDTD algorithm. For example, linear-dispersive materials (including metals) are typically modeled within FDTD codes using Lorentz-Drude and conductivity models such that the position-(r) and frequency-( ) dependent relative permittivity ( ) is described by [48] where ∞ is the instantaneous permittivity, is the position-dependent electric conductivity, is a position-dependent coupling strength, is a harmonic oscillator frequency, and is a damping coefficient. Interpolating between two arbitrary material models described by Eq. (7) is more difficult, as one must minimize the number of parameters within the model that are interpolated (to also minimize storage requirements) while simultaneously ensuring simulation stability for each combination of material types.
We propose a new interpolation scheme that only requires interpolation of the conductivity terms ( and ) and the instantaneous permeability ( ∞ ). First, we linearly interpolate the instantaneous permittivity, the conductivity term, and each Lorentz-Drude component, where 0 and 1 subscripts correspond to the first and second materials respectively, ∞ is the new material's instantaneous permeability, and sus describes the new material's dispersive susceptibilities. Next, we introduce an artificial damping term needed to eliminate spurious bulk resonances from accidental "zero crossings" where ≈ 0: where¯ is the mean resonance frequency of all the susceptibilities. The final material response is then described by This new material interpolation scheme requires no modification to existing FDTD codes, provided the conductivity and instantaneous permittivity can be parameterized over the simulation domain (or in particular, the design regions of interest). Since the FDTD method uses a Yee grid in which different field components are stored on different grids [49], one must account for the interpolation from the design grid onto the different staggered grids, especially in the recombination step of the adjoint method used to compute the gradient of the objective function (Appendix A).

Geometric parameterization and the FDTD Yee grid
In this section, we discuss the geometric-parameterization scheme used to enable density-based photonics TO over the FDTD Yee grid. In contrast to many density-based TO implementations (e.g. involving finite-element meshes) which use the same grid/mesh for both the design variables and the electromagnetic field unknowns [50], our approach allows the user to flexibly define arbitrary "material grids" [51] or design meshes in 1D, 2D, or 3D that have simulationindependent resolutions/orientations and are linearly interpolated onto the Yee grid (e.g. the simulation mesh) [16,48]. This generality, adapted from our Ref. 51, provides three key benefits: (i) a separation between the design and simulation mesh choices, where proper placement (and gradient calculation) with respect to the Yee grid is handled automatically; (ii) implicit dimensionality constraints (e.g. lithographic 2D designs that are projected onto a 3D polygonal prism via a 2D material grid) are supported without any additional optimization constraints; and (iii) complicated symmetries and periodic patterns are easily enforced by transforming and layering multiple material grids [51]. Fig. 2 illustrates the basic functionality of the material grid, along with its corresponding features.
Since the Yee grid involves staggered placement of a material's permittivity tensor components and field components, even simple density-based parameterization methods can be difficult to Density-based geometric parameterization scheme using material grids. A material grid takes evenly spaced values in 1D, 2D, or 3D and interpolates each grid value onto the FDTD Yee grid (a). The material grid can be described by any arbitrary resolution or dimensionality, and can project lower-dimensional design problems onto higher-dimensional simulation spaces (e.g. for restricting the DOF to planar lithographic planes or periodic grating structures) (b). Multiple material grids (and even multiple copies of the same material grid) can be rotated (e), mirrored (d), translated and overlayed such that the mean value of the resulting layer structure produces a net parameterization scheme (c), which can be used to enforce complicated symmetries without optimization constraints.
formulate. The initial simulation setup requires the user to properly interpolate the design DOFs onto each field-component location. The adjoint recombination step after the forward and adjoint runs have been executed (Appendix A) must also interpolate the field components from the Yee grid back onto the corresponding design grid using the transpose of the forward pass's coefficients (a process referred to as restriction [16]). Careful interpolation and bookkeeping is therefore required. As such, we generalized the user-specified design grid to accept any arbitrary resolution or dimensionality, such that the design grid correctly interpolates/backpropagates to/from the Yee grid. This is consistent with Meep's overall philosophy of "pervasive interpolation," [16] in which the user is presented with an illusion of continuous spatial coordinates (transparently interpolated to/from the Yee grid internally). This new data structure, which we call a "material grid," fills user-specified geometric shapes, such as blocks, so that the user can apply the same (or different) material grid(s) to multiple discrete geometric objects (covering any portions of the simulation domain). The objects containing each grid can be arbitrarily rotated/flipped/translated. Different (or identical) material grids can occupy multiple geometric objects to represent distinct fabrication layers and design planes or periodic structures within a supercell. Similarly, the user can adapt a two-dimensional material grid to a three-dimensional geometric primitive (e.g. a rectangular prism): our approach extrudes the design into the third dimension, simplifying the optimization pipeline for integrated photonics and other lithographic-based designs.
Geometric design symmetries are easy to impose by layering multiple material grids on top of one another, because the net structure is obtained by averaging any overlapping grids [51]. For example, by layering a grid with a mirror image of itself, the resulting design must have mirror symmetry. While symmetries could alternatively be enforced as explicit constraints on the design variables (e.g. before the filtering and projection stages) using an AD package, that approach is only straightforward for symmetry operations that map the material grid exactly onto itself. In contrast, the overlapping-grid approach allows arbitrary rotational and translational symmetries, such as threefold symmetries, glide-plane symmetries, and crystallographic symmetry groups [51].
In short, the material-grid approach maximizes flexibility and expressivity, without requiring complicated optimization constraints or non-intuitive design parameterizations. With this increased level of flexibility, however, one must explicitly constrain the resulting topological evolution to comply with fabrication design rules in a manner that is compatible with existing TO algorithms.

Fabrication constraints
In order to constrain the vast design freedom typical of most density-based TO methods, our package readily supports differentiable constraint functions commonly used to prescribe a design's minimum lengthscale [39], minimum area/enclosed area [25], curvature [40], connectivity [3], or any other topological feature. Performing optimization sweeps over different design "rulebooks" is often a simple matter of parameterizing each designated constraint function, as we described in detail in Ref. 25. For example, Fig. 3 compares four broadband, three-way splitters designed to comply with different minimum lengthscale constraints. Each design performs with no more than 6% insertion loss across the band and ≤ 2% variation in splitting ratio across the band.
Often, constraints are enforced implicitly using density filters [39,40] or even cascades of filters [46]. Many techniques, however, require explicit constraint functions that may even work in tandem with the density filters to enforce a particular topological feature requirement. Using the same AD approach for the density filters and projection steps (Sec. 1), our package supports AD for arbitrary constraint functions. This feature is essential for prototyping and "inventing" new constraint functions and methodologies. Indeed, choosing the proper constraints and filtering steps often seems to "regularize" the optimization problem to avoid nonphysical local minima [52]. When combined with the proper objective functions, the fabrication constraints The transmission across all wavelengths is depicted for the top/bottom ports (the same due to induced symmetry) and the right (output) port. Each device shows good performance with ≤ 6% insertion loss and ≤ 2% variation in splitting ratio averaged across the entire bandwidth.
help ensure the optimizer quickly yields a suitable design after a limited number of iterations.

Objective functions
In this section, we describe the process for formulating objective functions and highlight some unique features that enable rapid prototyping and sophisticated optimization. Our package supports several "differentiable measurements" (DMs) for which the adjoint source can be automatically generated. These DMs, which are derived from the Fourier-transformed fields, include: (1) eigenmode-decomposition coefficients (or Scattering parameters), (2) Poynting flux, (3) near-to-far transformation, and (4) the Fourier-transformed fields themselves. By using the same AD package (JAX) described in Sec. 2, the user can programmatically define any arbitrary function of these DMs, such that the adjoint source for each objective function is scaled and placed appropriately.
For example, a typical objective function, , may seek to maximize the squared magnitude of the 21 parameter of a device's scattering matrix ( matrix) equivalent to the power transmitted in Port 2 given an input source in Port 1, where the 21 parameter is a ratio of the outgoing and incoming eigenmode coefficients (each being a DM) such that 21 = where ± , is the mode coefficient (complex amplitude) of the th mode for either the forward (+) or backward (−) directions in the th port. The mode coefficient itself is determined by an overlap integral over a cross-section [53]: where E( ) and H( ) are the Fourier-transformed monitor fields at a particular frequency, E ± , ( ) and H ± , ( ) are the mode profiles corresponding to the th mode at the th port at the same frequency for the forward-(+) and backward-(−) propagating modes (obtained by an eigenmode solver [54]), and is a normalization constant chosen such that where ± , is the total power propagating in that particular mode. Since the adjoint source for a particular mode-overlap coefficient corresponds to another eigenmode source pointing in the opposite direction [22], we can choose a time-domain profile that scales the eigenmode source according to the results of the forward run (Sec. 5.2).
The user can further customize the objective function to additionally depend on other Fouriertransformed field quantities, like a far-field profile (corresponding to multiple broadband adjoint sources). Furthermore, different normalization functions, DOF filters, and FOM arrangements are easy to prototype, due to the automated AD process used to place each adjoint source. This flexibility is essential for selecting hyperparameters and identifying which objective function is best for a particular optimization problem. We describe two examples that highlight the flexibility and simplicity of our approach.

Example I: Silicon-photonics polarization splitter
As an initial example, we design a 3D silicon-photonics polarization splitter compatible with a commercial CMOS foundry using a stack of two independent design layers (an SOI layer and a polysilicon layer [55]). To quantify the performance of each output port, we construct two objective functions (to be evaluated in parallel as separate simulations) each consisting of four overlap coefficients over a 100 nm bandwidth. Fig. 4 illustrates the simulation configuration and final device design. The FOMs are designed to maximize transmission of the quasi-TE (Q-TE) and quasi-TM (Q-TM) modes in Ports 2 and 3, respectively, while minimizing return loss in Port 1 as well as channel crosstalk between Ports 2 and 3. The FOM for the Q-TE mode ( Q-TE ) is defined by where + 1,2 corresponds to the forward (+) propagating coefficient of the first mode (Q-TE) at the second port, such that  corresponds to the 11 entry. The functions, 1 , 2 , and 3 are mapping functions used to ensure units are consistent and of the same scale (and optimization direction) as one another, such that the transmission is maximized, and the return loss and channel crosstalk are minimized. In this example, the mapping we used is a differentiable spline interpolater based on user-specified thresholds to map each individual "sub-objective" to a compatible global FOM [56].
Similarly, the FOM for the Q-TM mode, Q-TM , is defined by corresponds to the 11 entry. Both FOMs are defined over ten frequency points spanning a 100 nm bandwidth covering the C-band, from 1.5 m to 1.6 m, resulting in twenty unique FOMs that each produce four broadband adjoint sources. The full 3D structure was optimized using the minimax formulation outlined in Eq. (2) to maximize the minimum (worst-case) FOM out of this set. While the gradient for each frequency point was calculated using the same forward and adjoint run, each objective function ( Q-TE and Q-TM ) required its own forward and adjoint runs, such that two simulations were always run in parallel (Sec. 5.3). Fig. 5 illustrates the final results of the polarization splitter for all six scattering parameters of interest across the entire band, along with a steady-sate field profile for both polarization states at 1.55 m.
We ran the optimization on two Intel Xeon Gold 6226 2.7 GHz CPU nodes using resources provided by the Partnership for an Advanced Computing Environment (PACE) at the Georgia Institute of Technology. Each node contained 24 cores (48 total) and 192 GB of RAM (384 GB total). The optimization procedure consisted of 250 iterations (500 total Maxwell solves) and took approximately three days to complete at an FDTD resolution of 30 pixels/ m. As expected,   the polarization splitter device performs well across the entire band. We enforced standard semiconductor foundry design rule constraints during the final "epoch" of optimization (Sec. 3.1). The effect of the constraints on the optimization evolution is demonstrated in Fig. 6, which tracks the evolution of the minimax objective, along with the actual scattering parameters as a function of optimization evolution. The performance worsens significantly once the constraints are applied, as the optimizer adjusts the topology for features and areas that are too small. Once theses features satisfy the constraint, the optimization algorithm continues to adjust the broadband performance for all objective functions.

Example II: High numerical-aperture cylindrical metalens
The second example involves a planar metalens [57][58][59][60] with high numerical aperture (NA) that uses a near-to-far field transformation and cylindrical symmetry to significantly reduce the computational cost at each optimization iteration, similar to techniques we have previously demonstrated for frequency-domain TO [3]. We design a metalens that has a diameter of 30 m, is 2 m thick, and has a desired focal length of 30 m (corresponding to a numerical aperture of 0.5) and a multi-wavelength operation regime spanning 1.54-1.56 m. Fig. 7 describes the steady-state performance of the optimized device at 1.55 m and compares the reduced cell used in the optimization to the full cell without any domain-decomposition techniques. The final device focuses an incoming planewave at the desired focal length across the entire band.
In order to inverse-design a metalens, we simply maximize the intensity at the desired focal point [3,61], so that the objective function is: where E is the Fourier-transformed electric field at frequency and position ( 0 , 0 ). Naively, this requires a simulation domain that includes the metalens, the desired focus, and the uniform propagation region in between (which could be rather large relative to the size of the lens). The simulation cost would then scale poorly with the focal length-not just because of the number of voxels that require updating at each timestep, but also because the electromagnetic fields must propagate through the entire cell (affecting the total number of timesteps needed for convergence), a typical challenge with time-domain methods. Since the hybrid time/frequency-domain method used throughout this work already operates on frequency-domain quantities (Sec. 2), we employed a near-to-far transformation to significantly reduce the size of the computational cell. The near-to-far transformation is a well-known technique, based on Huygens' principle, where the Fourier-transformed tangential fields in the "near region" can be treated as currents that are convolved with the corresponding Green's function (fields produced by time-harmonic point sources) to yield the Fourier-transformed fields at arbitrary user-defined far-field points [28,48]. Using near-to-far transformations, the new computational cell need only include the actual metalens structure and the fields just beyond its surface [61]. In this example, we use a cylindrical-coordinates Green's function, but our package also supports 2-and 3-dimensional near-to-far transformations in Cartesian coordinates (with both distributed-memory and thread-level parallelism).
The backpropagation step of the near-to-far transformation is just another (transposed) Green's function convolution. This extra step is efficiently implemented using the same machinery as the rest of the FDTD solver, such that the entire adjoint pipeline can be used with farfield transformations (or even combinations of far-field transformations and other fundamental differentiable quantities). This method could be extended to other Green's functions, such as those modeling propagation through inhomogeneous regions [62], which is useful for integrating multilayer films [14].  The FDTD algorithm is used to calculate the gradient at multiple frequency points in parallel, even with complicated objective functions that have several adjoint sources or with simulation domains involving lossy and dispersive materials. (c) Multiple objective functions and design fields are run using "in-the-loop" parallelism, allowing the results of each run to be communicated to the optimizer after each iteration. This is essential for multi-objective optimization (i.e. optimizing along a Pareto front) and robust optimization (i.e. optimizing for fabrication variability).

Computational parallelism
We now describe three forms of computational parallelism implemented within our package that enable large-scale photonics topology optimization: (1) spatial parallelism, which involves dividing the cell into load-balanced workloads across multiple processor cores of an MPI cluster; (2) frequency parallelism, which involves computing gradients for multiple frequencies simultaneously; and (3) simulation parallelism, which involves distributing gradient computations for different design fields or objective functions among different processor groups. Fig. 8 illustrates each of these forms of parallelism.

Data-driven load balancing for massively parallel simulations
Simulating large devices (i.e. tens to hundreds of wavelengths in physical dimensions) in which the total memory requirements exceed those of a single multicore machine is made possible in Meep by dividing the simulation region into multiple subregions or "chunks" [16]. Each chunk is assigned to a different processor core which, at each timestep, updates the fields at every voxel and communicates the fields on its exterior surfaces to neighboring chunks. Processors are assigned to the chunks in a way that ensures that adjacent chunks are assigned to a single cluster node as much as possible to exploit fast intra-node communication via shared memory for the chunk boundaries. A key challenge for performance optimization of massively parallel simulations involving hundreds or thousands of chunks/processes is the heterogeneous physics: different phenomena or calculations are modeled at different points in the computational domain.
Because different materials or data processing often require vastly disparate computational resources, load balancing the simulation involves partitioning the cell into unevenly sized chunks which have (nearly) equivalent computational work. To enable this feature, we developed a data-driven cell-partitioning strategy based on estimating the computational work required by a given subvolume [63,Sec. 3].
Because the forward and adjoint simulations involve computing the Fourier-transformed fields over the entire design region which can be a large fraction of the total cell, the cost of the DTFT field updates can often dominate the overall timestepping. We therefore implemented two performance-related features: (i) reducing the memory bandwidth by storing the DTFT fields (as well as the time-domain fields) using single-rather than double-precision floating point since the error in the simulation is almost always dominated by discretization error of the discontinuous material boundaries rather than floating-point rounding errors and (ii) decimating the DTFT time series updates (i.e. performing the DTFT updates at every th timestep rather than every timestep, without any loss of accuracy due to aliasing thanks to the band-limited nature of the sources we employ) [27]. The degree of decimation is chosen automatically at runtime based on the Nyquist rate set by the bandwidth of the sources and monitors. We also implement a recursive "cache-oblivious" loop tiling [64] approach for the field updates to improve memory locality for updating different field components. Taken together, we have demonstrated through benchmarking studies using Intel Xeon processors that these features can provide more than a 5× speedup for the timestepping.

Frequency parallelism, broadband optimization, and convergence criteria
To generate the correct frequency response over a broad bandwidth, a key challenge is the construction of the time-domain adjoint sources. For a given frequency-dependent FOM, the frequency-domain adjoint method prescribes a corresponding frequency-dependent current density J(x, ) at each of the design frequencies [1], as reviewed in Sec. 2. In a time-domain simulation, one must then construct a time-dependent current J(x, ) whose (discrete-time) Fourier transform (DTFT) matches J(x, ) for = 1, . . . , . By Fourier-transforming the "adjoint fields" resulting from this current, the gradient of the FOM can be computed at all frequencies using one additional time-domain simulation. Because there are infinitely many functions J(x, ) whose Fourier transform would be suitable, for efficiency we desire a current source that is highly localized in time (time-limited), enabling a short-duration time-domain adjoint simulation, as well as being well-localized in frequency (approximately bandlimited) to enable decimation without aliasing as described in Sec. 5.1. In this section, we describe a novel fitting approach to efficiently construct a time-domain adjoint current with these properties.
Our method fits a series of band-limited filter windows (acting as basis functions) in the frequency domain to the desired adjoint-source frequency response. The weight factors are then used to modulate the corresponding time-domain profiles, which are analytically derived using the inverse DTFT. The resulting algorithm only requires ( ) storage for the basis weights, where is the number of frequency points and is the number of spatial points spanning the source. This approach is independent of the number of timesteps and works with any number of adjoint sources and any number of frequency points with arbitrary spacing. Fig. 9 illustrates the fitting process for an eigenmode adjoint source, along with its corresponding time-domain profile.
The fitting routine solves the least squares problem where J(x, ) is a frequency-and position-dependent adjoint source corresponding to the th adjoint quantity of interest (eigenmode overlap, far-field profile, etc.) and (x) is a basis weight at position x corresponding to the th frequency basis function, ( ). Since the adjoint quantities are only evaluated at the discrete frequencies of interest, the fitted function can take any value for all other frequency points. Given these weights , the adjoint simulation then uses a time-domain current source where ( ) is the inverse DTFT of ( ). It is desirable to have a basis function that is localized in both the time and frequency domains, with an analytically tractable form in both domains. (For example, a Gaussian window is somewhat inconvenient because it does not have a simple analytical DTFT formula, although it could still be used as a basis function by, e.g., numerically integrating the corresponding DTFT at each timestep [24].) We chose to employ a Nuttall window [65] as our basis function, since its functional forms are simple and because it demonstrates a balance between resolution and dynamic range (an important trade-off when fitting several basis functions at closely spaced frequencies). The Nuttall window is defined by where is the total length of the sequence, Δ is the timestep, is the basis-function center frequency, the spacing between adjacent sinc functions is Δ = 2 Δ , and the Nuttall coefficients are This approach centers a basis function at each frequency of interest ( 0 ). The corresponding (complex) time-domain sequence is defined by where is the current (integer) timestep such that the current time is = Δ . When necessary, we cache the current evaluation of each respective basis function and weight it with the corresponding spatial profile. We note that if the simulation allows for purely real fields, we take the real part of the above complex current source; the resulting additional frequency content at negative frequencies does not affect our analysis at purely positive frequencies.
The frequency bandwidth of a basis function is inversely proportional to its temporal width, so that narrower-bandwidth basis functions require longer adjoint runs. The tradeoff is that wide bandwidths cause the basis functions in the fit (20) to overlap and result in an ill-conditioned fit and potentially inaccurate results. In consequence, we choose the bandwidth of the basis functions to be proportional to the spacing of the user-defined frequency points , such that the time duration of each (broadband) adjoint source is = 2 Δ Δ , where Δ corresponds to the minimum frequency spacing in the objective functions (which support arbitrary nonuniform frequency sampling). Requesting closely spaced values therefore leads to long-running adjoint simulations. However, the main reason to request such finely spaced frequencies is if the spectrum exhibits fine features (e.g. sharp peaks), which correspond to long-lived resonant waves that would require a long-running simulation anyway.

Simulation parallelism
Finally, our approach supports simulation parallelism, where multiple design fields or objective functions can be simulated in tandem, but their results are globally gathered by the optimization algorithm at each iteration, enabling large-scale robust and multi-objective TO. Similar to Sec. 5.1, our package automatically distributes different simulations to different groups of processors. Each group load-balances its respective simulation, and each simulation runs in parallel. After a gradient is calculated for each group, each group sends its corresponding gradient calculation to every other group (and every processor within each group) using an "all-to-all" operation [66]. This way, the same deterministic optimization algorithm can run on each individual processor and converge to the same result, even though each processor only propagates part of the fields for a single group.
For example, the polarization rotator described in Sec. 4.1 assigned an independent objective function to two independent groups of processors, where each group was allocated an independent Intel Xeon Gold 6248 node with 40 cores (for a total of 80 cores). One group computed the gradient corresponding to the quasi-TE objective function ( Q-TE ), while the other group computed the gradient corresponding to the quasi-TM objective function ( Q-TM ). Again, we note that each group is able to compute gradients at multiple frequency points simultaneously, such that 20 unique gradients are broadcasted during each optimization iteration (2 objective functions × 10 frequency points). After broadcasting, the optimization algorithm running on each core chooses the same set of parameters for the next iteration.
A more efficient approach might also distribute the work (and storage) of the actual optimization algorithm, such that each core only stores a subset of the design parameters and other arrays used while computing the next optimization step. Indeed, such an approach is especially important for large-scale designs where the DOF span three independent dimensions within the design region (unlike many lithographic applications, where a 2D space is projected onto a 3D thin film). To meet this challenge, we've implemented a hybrid thread-level parallelism within the fundamental timestepping algorithm, such that a single process is allocated on each compute node, but multiple shared-memory threads (corresponding to each core) propagate the fields in parallel. Such a hybrid approach [67] combines the benefits of the distributed-memory and multithreaded approaches to parallelism. Since only one process is allocated on each node, only one instance of the optimization algorithm (and all its accompanying data structures) is stored on the node. Rather than assign one process to each core, each core spawns a thread to distribute common tasks, like timestepping and near-to-far transformations. If more memory and/or cores are needed (especially when needing to run multiple simulations in parallel) then more nodes (and processes) can be allocated as needed.
Another common multi-objective task is robust optimization, which performs a worst-case (e.g. minimax) optimization over various design field perturbations [68]. For example, one might optimize over an ideal design and systematic over/under-etched variants (corresponding to eroded and dilated versions of its geometry, respectively). Each gradient calculation is performed in tandem by assigning one group to each design field. This approach seamlessly scales to any number of design variations (an appealing feature if one wishes to impose robustness over an ensemble of random variations [33]).

Conclusion
We present a free/open-source photonics topology-optimization software package that builds on the popular FDTD solver Meep which is capable of scaling to arbitrarily large design problems by exploiting massive computational parallelism. By combining powerful techniques scattered throughout several previous works, such as automatic differentiation and a hybrid time/frequencydomain adjoint-variable method, along with novel features such as new filter-design sources for the gradient calculation and material-interpolation methods for dispersive media, our package can flexibly inverse design complicated, broad bandwidth devices and subsystems for a wide range of photonic applications.
Leveraging a hybrid time-/frequency-domain adjoint-variable formulation brings several important benefits to the inverse design community not seen with formulations strictly implemented in one domain or the other. For example, our approach allows for broadband optimization without the prohibitive scaling costs of time-domain adjoint formulations. Similarly, our approach enjoys the flexibility and generalizability of frequency-domain adjoint solvers, without the subtle convergence issues of iterative solvers. Despite these advantages, no single approach is a panacea for all photonics problems, and certain design criteria are better addressed using alternate formulations. Plasmonic design problems, for example, tend to favor non-uniform meshes and finite-element or boundary-element methods (FEM or BEM) that are most easily implemented in the frequency domain [4,[69][70][71][72][73]. Another interesting case is the optimization of nonlinear optical devices. The steady-state time-harmonic response of a nonlinear system cannot be computed by Fourier-transforming a pulse input (as in our hybrid approach), whereas nonlinear frequencydomain methods are applicable [74][75][76]. Non-steady-state transient nonlinear effects, or perhaps self-pulsing or chaotic phenomena, might conversely require a purely time-domain approach. On the other hand, there are a number of nonlinear phenomena that can be modeled using cascaded linear calculations, which could utilize the hybrid time-/frequency-domain approach-for example, optimizing second-harmonic generation (SHG) [77], Raman scattering [4], or other devices where the nonlinearity can be treated perturbatively.

Appendix A. Time-harmonic adjoint formulation
This appendix derives the hybrid time/frequency-domain adjoint-variable method used throughout this work. Our implementation solves the Maxwell equations in the time domain using the FDTD software package Meep. Since the objective functions themselves, however, are defined in the frequency domain, the frequency-domain fields (and the corresponding adjoint sources) must be transformed from time-domain quantities and vice-versa. While this hybrid approach significantly reduces the storage requirements required of time-domain adjoint methods without compromising the flexibility of frequency-domain adjoint methods, there are several nuances that must be considered when constructing the proper adjoint formulation.
First, we examine the Maxwell equations in the time domain as described by where E and H are the macroscopic electric and magnetic fields, D and B are the electric displacement and magnetic induction fields, J is the electric current density, and K is a fictitious magnetic current density. Throughout our work, we assume linear time-invariant materials that may be dispersive and/or anisotropic (Eq. (7)). The FDTD algorithm discretizes Eq. (25) and Eq. (26) using the Yee grid [49], formulating an implicit "timestep operator" [16], 0 , that iteratively updates the fields after every timestep, , spaced by Δ , such that where x represents all the fields (E, D, H, and B) at timestep, , and s are the source terms from that timestep. While the FDTD algorithm computes the fields in the time domain, the equivalent frequencydomain response, x( ), can be obtained by accumulating a discrete-time Fourier transform (DTFT) at a frequency after each timestep: Accumulating the DTFT only at the desired frequencies during timestepping allows one to avoid the expensive storage that would be required to save the fields at every monitor point and at every timestep and then Fourier transform these time-domain fields in post-processing [16]. Two criteria must be met in order for the results of the DTFT to accurately approximate the results of an explicit frequency-domain solver. First, the Fourier-transformed fields must converge after timestepping. As the DTFT is updated after every timestep, any particular Fourier component will continue to update so long as its time-domain equivalent has not decayed to a negligible value due to absorption (by materials or by absorbing boundaries [48]). Sufficient runtime is especially important when designing highly resonant devices or explicitly specifying adjoint sources with long time profiles (due to dense frequency spacing, Sec. 5.2).
Second, since the FDTD algorithm approximates the continuous time-derivative operator, , using a finite-difference operator, , the corresponding frequency term, , requires correction to retain second-order accuracy in Δ . Specifically, the time discretization is described by While a continuous time-harmonic field x( ) = − x(0) would have the following DTFT relationship: the DTFT relationship for the finite-difference operator is described by where is an adjusted frequency term, which can be equivalently expressed as As expected, the two transforms agree as Δ → 0. We now describe the operator, ( , ), that corresponds to the steady-state field evolution calculated using the FDTD algorithm and the DTFT: To compute the gradient of an objective function, (x), satisfying the constraint ( , ) x = b, an adjoint-variable formulation is used such that where x are the fields stored after the forward simulation and are the fields stored after the adjoint simulation, with sources determined by both the objective function and results of the forward run [15,Sec. 8.7].
Since only the permittivity (and not the permeability) varies with , only the electric fields need to be stored during the forward ( E ) and adjoint ( E ) solves. The above inner product becomes a matrix-tensor-matrix product yielding the gradient at each point in space: This analysis produces sufficiently accurate gradients provided the interpolation and restriction steps (due to the Yee grid) are properly implemented (Sec. 3). That being said, we note that improvements could still be made by implementing a true "discretize-then-differentiate" scheme [78]. Specifically, the adjoint method should explicitly account for the timestepping operator, 0 , and not just the effects of the discrete-time operator. The approach taken here mixes the two methodologies (the other being "differentiate-then-discretize") in order to simplify the implementation. Future efforts could consolidate the approach while simultaneously integrating important features, like subpixel smoothing [21,79], further reducing inaccuracies and artifacts that stem from the discretization.