Momentum-Assisted Adjoint Method for Highly Efficient Inverse Design of Large-Scale Digital Nanophotonic Devices

Gradient-descent-based digitized adjoint method offers a way to realize the high-efficiency inverse design of digital nanophotonic devices with diverse functions. However, the vanishing gradient problem encountered in the design of high-dimension devices may lead to significant inefficiencies, making it difficult to integrate novel functions on a single chip. Here, we propose a highly efficient digitized adjoint method for large-scale inverse design, called adaptive gradient-descent with momentum. It uses the first- and second-order momentum, instead of the gradient, to update the device pattern during adjoint optimization. To demonstrate the efficiency of the proposed method, we design a coarse wavelength division multiplexer and a three-mode power divider with design dimensions of 800 and 1360, respectively, which are approximately 2-4 times that of conventional digital nanophotonic devices. The simulation results show that, compared with the conventional gradient descent method, the momentum-assisted adjoint method has about 4-6 times higher efficiency and obtains better optimization performance, which provides a powerful tool for the inverse design of novel digital nanophotonic devices.

In the adjoint method with the GD algorithm, we update the device pattern against the gradient direction at each iteration by ε new = ε old − α · g t . Here, g t is the gradient at the t th iteration and α is a fixed variable, called step size, to control the update speed of optimization. Unfortunately, optimizations often get trapped in poor results during the inverse design of devices with high dimensions and complex functions. This problem was first posed in the field of deep learning [27]. First, the number of saddle points would increase exponentially with the dimension of the objective function [28]. In mathematics, a saddle point is a point where g t = 0 but not a local extremum in the objective function, as shown in Fig. 1(a). In theory, when a saddle point is encountered in the optimization, its vanishing gradient will cause the GD-based optimization to slow down significantly, or even stop the optimization at the saddle point [29]. Second, one usually predefines a step size α and keeps it constant during optimization. As shown in Fig. 1 may lead to painfully slow optimization, while a step size that is too large may cause the optimization to oscillate or even diverge [27]. There are some manual intervention-based step size adjustment strategies, such as the WOLFE conditions, the GOLDSTEIN conditions [30], the iteration-based scheduling [31], [32], and the performance-based scheduling [33], which may require a lot of computation or need to be carefully tuned for each different design task. These strategies attempt to enforce the increase of the performance at each iteration and to secure the convergence of the training algorithm.
In this article, we propose the adaptive gradient-descent with the momentum (A-GDM) algorithm based on the digital adjoint method. In this algorithm, we introduce the first-and second-order momentum from the deep learning field to help efficiently escape saddle points and adaptively adjust the step size in the large-scale inverse design. First, instead of using only the current gradient g, we update the device pattern with the first-order momentum representing the sum of the recent gradients. Its optimization process can be compared to a ball rolling down a hill, where the gradient represents acceleration, and the first-order momentum represents the velocity. By analogy, the first-order momentum/velocity represents the accumulation of current and past gradients/accelerations, respectively. The optimization/ball will not stop at the saddle point where the current gradient/acceleration is zero, because the first-order momentum/velocity at the saddle point is usually not zero. Then, we introduce the second-order momentum to adaptively adjust the step size α. The second-order momentum represents the average gradient magnitude at recent iterations. We use the second-order momentum to increase the step size where the gradient is small and decrease the step size where the gradient is large, which may help to maintain a good balance between optimization efficiency and stability.
We design an 800-dimension coarse wavelength division multiplexer (CWDM) and a 1360-dimension three-mode 3 dB power divider to compare the performance and efficiency of the A-GDM algorithm. The fabrication-robust PhC-like subwavelength structure is used as the digital nanostructure for both devices, and its pixel is a silicon cuboid with a central cylinder filled with silica or air [34]. We employ the same inverse design framework as [26], but replace its GD-based adjoint method with the momentum-assisted adjoint method proposed in the article. In addition, we propose a new form of the objective function, which not only optimizes the device performance, but also biases the device's relative permittivity pattern ε to the binary distribution for CMOS-compatible fabrication. We also discussed the impact of the hyperparameters of the A-GDM algorithm on the efficiency and performance of inverse design, and give the optimal selection scheme. The simulation results show that the momentum-assisted adjoint method is about 4-6 times more efficient than the conventional GD method and may achieve better optimization performance, which provides a powerful tool for the inverse design of novel digital nanophotonic devices.

II. INVERSE DESIGN OF CWDM
First, we design an on-chip CWDM with two 1550 nm and 1570 nm wavelength channels to demonstrate the momentumassisted adjoint method. Many WDM devices based on inverse design have been reported before [11], [12], [35], [36], [37], but how to achieve smaller channel spacing is still a challenge. To achieve a narrow channel spacing of 20 nm in a small footprint, we need a higher design dimension to endow the device pattern with a more sensitive wavelength-dependent performance. For comparison, we choose the same device layout as [38]. Specifically, as shown in Fig. 2, the CWDM has a compact footprint of 4.98 μm × 2.56μm and is discretized into 40 × 20 pixels for inverse design. The device is designed on the 220 nm thick top silicon layer of the SOI platform. Each pixel of the PhC-like subwavelength structure is in the shape of a silicon cuboid (120 nm × 120 nm × 133 nm) with a central cylinder with a radius of 40 nm. The width of the input and the output waveguide is 450 nm, and the gap between the centers of the two output ones is 1 μm. The relative permittivity of air and silicon is set to 1.0 and 12.08 in simulations, respectively. The expression of the objective function F obj is where F perf = (T 1550,1 − 1) 2 + T 2 1550,2 + T 2 1570,1 where, F obj contains the performance indicator (F perf ) and the binarization penalty term (F bina ). F perf contains the insertion loss and crosstalk of two channels. T 1550,1 and T 1550,2 respectively represent the transmittance of the 1550 nm wavelength light on the upper output waveguide and the crosstalk on the lower output waveguide. T 1570,2 and T 1570,1 respectively represent the transmittance of the 1570 nm wavelength light on the lower output waveguide and the crosstalk on the upper output waveguide. F bina is the binarization penalty function, and when it is the minimum value of 0, the device pattern has been completely binarized. i, j represent the index number of the row and column of each pixel, respectively. γ is the weighting factor of the binary penalty, which may need to be roughly tuned. Usually, we may choose an appropriate γ through the following strategies. First, we can set the initial value of γ by γ = 10 * F perf contains two insertion loss indicators and two crosstalk indicators. N pixel = 800 is the number of pixels in the device. ε core and ε etch are the relative permittivity of core material and cladding material. The scaling factor of 10 is an empirical parameter. Next, we try to optimize with the initial value of γ. We can estimate the number of digitized pixels after convergence from the curve of the number of digitized pixels at the early stage of optimization. If the binarization ratio (the number of binarized pixels / the number of all pixels) is greater than 80%, we may choose this current value γ. If the binarization ratio is between 50% and 80%, we may increase γ to 1.2 γ. If the binarization ratio is less than 50%, we may increase γ to 2γ. Usually, no more than 3 adjustments are needed to obtain a suitable γ, which may ensure a good enough performance while making the digitized radio more than 80% after the 2nd step. Here we set γ to 4.1 × 10 −4 . Then, the optimization process can be divided into three steps as shown in Fig. 3(a). The transmittance T is calculated by overlapping integrals. The expression is given by where S is the cross-section of each output waveguide. E m and H * m represent the electric and magnetic fields of the TE m mode. E in and H in denote the actual electric and magnetic fields at S, respectively.
In the first step, all cylinders are uniformly filled with an intermediate material with a relative permittivity of ε ij = 6.5, and the other area is filled with silicon in the initial pattern, as shown in Fig. 5(a). Then we calculate the gradient of F obj for the current pattern by where, the g is the gradient of F obj to the ε of each pixel, which is a 40 × 20 dimension vector. g perf is the gradient of F perf , which is derived by the adjoint method. As shown in Fig. 3(b), the adjoint method needs a forward simulation and an adjoint simulation at each wavelength channel [39]. The forward simulation uses the incident modal source with normalized complex amplitude 1e i0 . The adjoint simulation uses the reciprocal modal source with complex amplitude ∂F perf /∂T [40]. E old (p) and E A (p) are the electric field response of the forward and adjoint simulation, respectively. χ ij is the space of each cylinder and p is all the simulation mesh points in χ ij . g bina is the gradient of the function F perf , which is computed by derivation of (3).
In the second step, we update the ε with the A-GDM algorithm. The A-GDM algorithm uses the exponential moving average of g t and |g t | 2 as the first-and second-order momentum, respectively [41]. The expression is given by where g t is the gradient at the t th iteration. v t and r t are the firstand second-order momentum, respectively. Their initial value v 0 = 0 and r 0 = 0. β 1 and (1 − β 1 ) are the weights of the previous first-order momentum v t−1 and the current gradient g t , respectively. β 2 and (1 − β 2 ) are the weights of the previous second-order momentum r t−1 and the absolute square of the current gradient |g t | 2 , respectively. Usually, β 1 < 1 and β 2 < 1.
Here we set β 1 = 0.9 and β 2 = 0.96. In the early iterations, the v t and r t may increase rapidly from 0. To reduce oscillations  caused by the drastic change of v t and r t , we correct them by where v t and r t are the bias-corrected estimates of v t and r t , respectively. In specific, v 1 = g 1 and r 1 = |g 1 | 2 at the first iteration. When t is small, the v t is close to the average value of g t and the r t is close to the average value of |g t | 2 , instead of starting from 0 and increasing rapidly [41]. When t is large, v t ≈ v t and r t ≈ r t because (β 1 ) t ≈ 0 and (β 2 ) t ≈ 0. As the r t may reflects the value of |g t | 2 in recent iterations, we adjust the adaptive step size α t by where α is a fixed hyperparameter of the step size. In practice, we usually add a smoothing term to the r t to avoid division by zero. The is usually set to 1e −8 . Then, the A-GDM algorithm updates the pattern by where we replace the "gradient" in the GD with the v t and replace the fixed step size with an adaptive one. The first-order momentum v t can utilize the past gradient to help the optimization to escape the saddle points. As stated in [41], the r t can be regarded as the estimate of |g t | 2 as E ( r t ) = E(|g t | 2 ). Informally, r t represents the average |g t | 2 at the recent iterations, especially the last 1/(1 − β 2 ) iterations. The larger the value of β 2 , the more dependent r t is on past gradients. When g t → 0, the step size α t may increase as the r t decreases, thus efficiently helping to escape the saddle points. When g t → ∞, the step size α t may decrease as the r t increases, thus avoiding the optimization divergence caused by the large update speed in the steep slopes. By introducing the adaptive step size α t , the optimization may achieve a balance between efficiency and stability. The A-GDM algorithm also naturally performs a form of step size annealing, thus the optimization may still guarantee convergence even if the α is too large [41].
To compare the effects of the first-order momentum and second-order momentum, we also use the GD and the GD with the first-order momentum (GDM) algorithms to update the device pattern, respectively. The GDM algorithm is also widely used in the field of deep learning. According to [42], the GDM algorithm calculates the first-order momentum by where v t is the first-order momentum and its initial value v 0 = 0. β is the weight of the previous first-order momentum v t−1 and is normally set to 0.9 [42]. Then the GDM algorithm updates the pattern by where α is the constant step size. For comparison, we set the step size hyperparameter α to 1.0825 in the GD and GDM and 1.0 in the A-GDM. In this way, for the three different algorithms, the |Δε| max among all pixels in the first iteration is the same value of 1.0. The three optimization processes based on the three algorithms run 2000 iterations respectively, and we set the convergence criterion as: the absolute value of the average ΔF obj of the next 30 iterations is less than 0.001 (| t+30 t ΔF obj |/30 < 0.001) and the optimization still satisfies this requirement for the next 100 iterations. Under this criterion, the GD, GDM, and A-GDM converge at the 1221st, 512th, and 300th iteration, respectively. We can find that the A-GDM achieves the best efficiency. In specific, the optimization efficiency of the A-GDM is 4.1 and 1.7 times that of GD and GDM, respectively. The detailed results are given in Fig. 4 and Table II. We also find that for all three algorithms, the performance indicator F perf converges much faster than the binarization penalty term F bina . We can speculate that in the large-scale adjoint optimization of digital nanophotonic devices, "gray" patterns with good performance are easier to obtain, but it is difficult to find a "quasi" digital pattern with high performance and ε close to binary distribution. In addition, the GD-based optimization oscillates back and forth after the 566th iteration, which may be caused by the gradient or the step size being too large. We also find in Fig. 4(c) that the A-GDM-based optimization is jittering very slightly even after convergence. We assume that as |g t | ≈ 0 after convergence, the step size α can be much large because the r t is getting closer and closer to 0, which may help escape the convergence position to find a better result if we do not stop the optimization. The histogram of relative permittivity of all the 800 pixels at the 300th iteration (where the A-GDM converges) is shown in Fig. 4(d)-(f). We can find that the GD and GDM algorithm have 423 and 70, respectively, more "gray" pixels than A-GDM. After 2000 iterations, the GD and GDM algorithm still have 275 and 23, respectively, more "gray" pixels than A-GDM. Since the A-GDM algorithm has the fastest optimization speed and the best performance, the "quasi-digital" pattern with this algorithm is shown in Fig. 5(b). We also tried to set β 2 to 0.999, 0.99, 0.96, and 0.9. Table I shows the optimized convergence iterations and oscillation degrees for each β 2 value. Here we evaluate the degree of jitter by the maximum value of the average ΔF obj in the next 10 rounds before convergence (max( t t−10 ΔF obj /10)). This criterion may describe the speed of the objective function rising when there is an oscillation. We can find that β 2 = 0.96 may obtain great efficiency and less oscillation degree. Although β 2 = 0.9 and β 2 = 0.96 have the same efficiency, the oscillation degree of β 2 = 0.9 is larger. Compared with the recommended β 2 = 0.999 in [41], the efficiency of β 2 = 0.96 is increased by 1.7 times. So, we finally choose β 2 = 0.96.
In the third and final step, we transform the "quasi-digital" pattern into an N-ary digital pattern based on effective medium theory. As in [26], we use a ternary pattern (N = 3) based on a three-level threshold for demonstration. As shown in Fig. 3(c), the range of relative permittivity of the continuous pattern after the second-order step is divided into three segments. Cylinders with relative permittivity smaller than 4.0 or larger than 9.0 in the "quasi-digital" pattern will be simply filled with silicon or an air pattern, respectively. Meanwhile, cylinders with relative permittivity between 4.0 and 9.0 are replaced simultaneously with smaller air cylinders with the same radius based on a simple brute-force method. We search the radius from 30 nm to 40 nm with a step of 1 nm by 11 rounds of 3D FDTD simulation. Then we choose the radius corresponding to the minimum F perf as the optimal small radius. Here, the final optimal small radii for the GD, GDM, and A-GDM are 33 nm, 31nm, and 30nm, respectively. The performance with the optimal small radius is shown in Table III. The final ternary pattern for the A-GDM-based optimization is shown in Fig. 5(c) and its simulated performances are shown in Fig. 5(d)-(f).
In specific, the optimization efficiency of the A-GDM is 4.0 and 1.7 times that of GD and the GDM, respectively. The insertion loss of the ternary pattern for A-GDM is -0.93 dB and -0.97 dB. The final performance of the three optimization algorithms is not much different. For the GD algorithm, although many pixels are not fully binarized, the performance penalty is not so severe after the 3rd step since the relative permittivity are already very close to air and silicon but not fully binarized. We find that the degradation of the crosstalk T 1570,1 is severe in the 3rd step for GDM and A-GDM. It is may because resonators are more sensitive to the relative refractive index at the intrinsic  II  OPTIMIZATION RESULTS OF CWDM AFTER 2000 ITERATIONS AT THE 2 ND STEP   TABLE III  PERFORMANCE OF CWDM WITH THE OPTIMAL RADIUS AFTER THE  wavelength, thus unpredictable performance losses may occur during the ternarization process. Compared to the optimization with the DBS method [38], the A-GDM algorithm improves design efficiency by about 3.9 times. Compared to the simulated performance in [38], the insertion losses (IL) are better while the crosstalks (XT) are slightly worse.

III. INVERSE DESIGN OF THE THREE-MODE 3 DB POWER DIVIDER
We also design a broad spectrum on-chip three-mode 3 dB power divider to demonstrate the momentum-assisted adjoint method in the inverse design with multiple optimization objects. Since the device requires high transmittances for all three modes in the 1530-1570 nm range, we divide the device into 1360 pixels. As shown in Fig. 6, the power divider has a compact footprint of 4.8 μm × 4.08 μm and is discretized into 40 × 34 pixels. The device is designed on the 220 nm thick top silicon layer of the SOI platform. Each pixel of the PhC-like subwavelength structure is in the shape of a silicon cuboid (120 nm × 120 nm × 220 nm) with a central cylinder with a radius of 45 nm, where the height of each cylinder is different from the CWDM. The width of input and output waveguides is 1.2 μm, and the gap between the centers of the two output ones is 2.68 μm. The relative permittivity of silicon oxide (SiO 2 , the background material) and silicon is set to 2.085 and 12.075 in Fig. 7. Taking the incidence of the modal source TE0 at the wavelength of 1530 nm as an example, this figure shows the simulation scheme for solving the device gradient by the adjoint method. In the 2 nd step, we need to simulate the response fields E old mode and E A mode of modal sources TE0, TE1, and TE2 at the wavelength of 1.53 μm, 1.55 μm, and 1.57 μm.
simulations, respectively. The center wavelength of the design object is 1550 nm, and the performance of 1530 nm and 1570 nm wavelengths are optimized at the same time. The expression of the objective function F obj is where  where λ i represents the wavelength sampling point. Here we optimize the performance at three wavelengths, where λ 1 = 1.53 μm, λ 2 = 1.55 μm and λ 3 = 1.57 μm. out 1,2 represents the upper and lower output waveguide of the device. T m,n represents the transmittance of TE n mode in the output waveguide under TE m mode input. m = n (T mn − 0.5) 2 represents the IL performance while m =n (T mn ) 2 represents the XT performance. Therefore, there are 54 performance indicators included in the objective function. γ is the weighting factor of the binary penalty and here we set it to 3.6 × 10 −3 .
In the first step, all cylinders are uniformly filled with an intermediate material with a relative permittivity of ε ij = 7.04, and the other area is filled with silicon in the initial pattern, as shown in Fig. 9(a). Then we calculate the gradient of F obj for the current pattern by where where g is the gradient of F obj to the ε of each pixel, which is a 40 × 34 dimension vector. g perf is derived by the adjoint method which needs the simulations with the three independent modal sources at three wavelengths. Taking the incidence of TE0 mode light at a wavelength of 1530 nm as an example, the simulation scheme is presented in Fig. 7. E old (p) and E A (p) are the electric field response of the forward and adjoint simulation, respectively. χ ij is the space of each cylinder and p is all the simulation mesh points in χ ij . g bina is computed by derivation of the (18).
In the second step, we adopt the same three algorithms to optimize the pattern. For comparison, we set the step size hyperparameter a to 1.580 in the GD and GDM and 1.0 in the A-GDM. In this way, for the three different algorithms, the |Δε| max among all pixels at the first iteration is 1.0. The three optimization processes based on the three algorithms run 1000 iterations respectively, and we set the convergence criterion as: the absolute value of the average ΔF obj of the next 20 iterations is less than 0.01 (| t+20 t ΔF obj |/20 < 0.01) and the optimization still satisfies this requirement for the next 100 iterations. Other hyperparameters are the same as in the CWDM. Under this criterion, the GD, GDM, and A-GDM converge at the 137th, 237th, and 884th  iterations, respectively. We can find that the A-GDM achieves the best efficiency. In specific, the optimization efficiency of the A-GDM is 6.5 and 1.7 times that of GD and GDM, respectively. The detailed results are given in Fig. 8 and Table IV. We also find that for all three algorithms, the performance indicator F perf converges much faster than the binarization penalty term F bina . The histograms of relative permittivity of all the 1360 pixels at the 137th iteration (where the A-GDM converges) are shown in Fig. 8(d)-(f). We can find that the GD and GDM algorithm has 700 and 26, respectively, more "gray" pixels than the A-GDM. After 1000 iterations, the GD algorithm still has 32 more "gray" pixels than A-GDM. The GDM algorithm has 8 more "gray" pixels than A-GDM. In this work, the GDM and A-GDM show no significant difference in device performance. The GD algorithm takes about 6 times the time to achieve a relatively close result. Since the A-GDM algorithm has the fastest optimization speed, the "quasi-digital" pattern with this algorithm is shown in Fig. 9(b).
In the third and final step, we transform the "quasi-digital" pattern into a ternary digital pattern based on effective medium theory. The range of relative permittivity of the continuous pattern after the second-order step is divided into three segments, too. Cylinders with relative permittivity smaller than 4.25 or larger than 9.25 in the "quasi-digital" pattern will be simply filled with silicon or a silica pattern, respectively. Meanwhile, cylinders with relative permittivity between 4.25 and 9.25 are replaced with smaller air cylinders with the same radius. We then search the radius from 30 to 45 nm with a step of 1 nm, and the final optimal small radii for the GD, GDM, and A-GDM are 30 nm, 34 nm, and 31 nm, respectively. The performance of the ternary patterns for the three algorithms is shown in Table V. In specific, the optimization efficiency of the A-GDM is 5.9 and 1.6 times that of GD and GDM, respectively. The performance of the final optimal pattern is at the same level. Since the A-GDM algorithm has the fastest optimization speed and the same level of performance, the final ternary pattern with this algorithm is shown in Fig. 9(c). And its simulated spectra transmission is shown in Fig. 10(a)-(f). We can see that the excess loss (EL) for TE0, TE1, and TE2 is less than 1.05 dB, and the XT is lower than -21.90 dB during the wavelength from 1530 nm to 1570 nm.

IV. CONCLUSION
The first-and second-order momentum may assist the adjoint method to gain high efficiency during the optimization of high-dimension digital nanophotonic devices. The vanishing gradient problem brought by saddle points may cause the GD-based optimization to slow down significantly, or even stop the optimization at the saddle point. The first-order momentum may help to efficiently escape saddle points by accumulating past gradients. And the second-order momentum can adaptively adjust the step size according to the recent gradient, which may improve the efficiency where the gradients are small and keep the stability where gradients are large. For comparison, we optimize a CWDM and a three-mode 3 dB PS with the GD, GDM, and A-GDM algorithms, respectively. In the optimization of the 800-dimension CWDM, the efficiency of A-GDM is 4.0 and 1.7 times faster than that of GD and GDM, respectively. In the optimization of the 1360-dimension three-mode 3 dB PS, the efficiency of A-GDM is 5.9 and 1.6 times faster than that of GD and GDM, respectively. From this result, we can guess that as the dimension continues to increase, the improvement of the optimization efficiency may also be greater. Compared with the GD, GDM, and DBS algorithms, the optimization results obtained by the A-GDM algorithm can also achieve the same or even better performance. By breaking the efficiency and capacity bottleneck of the conventional GD algorithm, we expect that the A-GDM-based digitized adjoint method could be applied to the inverse design of higher-dimension digital nanophotonic devices with novel integrated functions.