Automated Multi-layer Optical Design via Deep Reinforcement Learning

Optical multi-layer thin ﬁlms are widely used in optical and energy applications requiring photonic designs. Engineers often design such structures based on their physical intuition. However, solely relying on human experts can be time-consuming and may lead to sub-optimal designs, especially when the design space is large. In this work, we frame the multi-layer optical design task as a sequence generation problem. A deep sequence generation network is proposed for efﬁciently generating optical layer sequences. We train the deep sequence generation network with proximal policy optimization to generate multi-layer structures with desired properties. The proposed method is applied to two energy applications. Our algorithm successfully discovered high-performance designs, outperforming structures designed by human experts in task 1, and a state-of-the-art memetic algorithm in task 2.


Introduction
Optical multi-layer films have been widely used in many applications, such as broadband filtering [1], photovoltaics [2], radiative cooling [3], and structural colors [4].The design of optical multi-layer films is a combinatorial optimization problem that requires one to choose the best combination of candidate materials and layer thicknesses to form a multi-layer structure.Researchers and engineers often make such designs based on their physical intuition.However, a completely human-based design process is slow and often leads to sub-optimal designs, especially when the design space is enormous.Thus, computational methods for designing optical multi-layer structures, including evolutionary algorithms [5,6,7], needle optimization [8], and particle swarm optimization [9], have been proposed to tackle this problem.All of these previous methods frame the optical design task as an optimization problem and aim to synthesize a structure that meets user-specified design criteria.However, these methods for optical design are based entirely on hand-crafts heuristics, i.e., they do not learn a model to solve the design problems.When the heuristics are sub-optimal for a task, the search process may fail to identify a high-performance design.
In contrast, deep reinforcement learning (DRL) is a learning framework that learns to solve complex tasks through a trial-and-error process.It is proven to be highly scalable for solving large-scale and complicated tasks [10,11].Researchers have successfully applied DRL to various combinatorial optimization problems [12,13,14,15].Unlike heuristic-based search, reinforcement learning methods learn a model using the reward signal [16] and do not depend on hand-crafted heuristics.
On some combinatorial optimization tasks, DRL has been shown to outperform classic heuristic search methods [17].Recently, researchers applied DRL on designing optical devices with a structure template [18,19], where the number of layers is fixed.However, when designing the optical multilayer films, we often do not know the optimal structure template.Thus, the previous DRL approaches are not suitable for multi-layer designs.
Noticing that the multi-layer optical design task is equivalent to a sequence generation problem, we propose a DRL method called Optical Multi-layer Proximal Policy Optimization (OML-PPO) that can generate near-optimal multi-layer structures.We applied the proposed method to two optical design tasks that are relevant to energy applications (Figure 1a): 1) ultra-wideband absorbers that can enhance light-harvesting efficiency, e.g. for thermal photovoltaics and photothermal energy conversion 2) incandescent light bulb filters that can improve light bulb efficiency in emitting visible light.On these two tasks, we show that OML-PPO outperforms existing human expert designs and a state-of-the-art memetic algorithm.

Proposed Approach
Multi-layer films can be represented as sequences.Each layer is written as where m l and d l denote the material and the thickness of the l-th layer (counting from the top), respectively.When designing optical multi-layer films, we hope to synthesize a sequence that has the desired target spectral response T .Towards this end, we train a sequence generation network based on a GRU [20] with reinforcement learning [16].The process for generating an optical layer sequence is illustrated in Figure .1b.

Non-repetitive gating
To explore the design space more efficiently, we introduce a non-repetitive gating function that removes the logit element corresponding to the most recently sampled material to prevent the sequence generator from generating the same materials in a row.This gating function is a matrix ) formed by removing the row corresponding to the most recently sampled material from an identity matrix.When multiplied with the logits vector σ m l , the element corresponding to that material will be removed, i.e., σ Then, we pass the transformed logit vector σ m l to the softmax layer to obtain the sampling probability.By doing so, we set the sampling probability for the recurring material to 0. With the non-repetitive gating, the generated material sequence is guaranteed to have different materials for adjacent layers.
Auto-regressive generation of material and thickness Because the proper thickness of a layer should depend on the material, we input the sampled material m l to the thickness MLP in addition to the hidden state h l .A similar approach has been applied in RL problems where the actions are dependent on each other [11].Instead of using a one-hot vector to represent the material, we train a material embedding matrix emb ∈ R |M|×d together with the sequence generator network.

Reinforcement learning training
We train the sequence generation network with reinforcement learning.We set the reward to be 0 for all generation steps except the final At the final step (i.e., the structure S has been completely generated), we compute the spectrum of the generated structure with an optical spectrum calculation package TMM [21] and assign the final reward based on how well the structure spectrum matches with the target spectrum, i.e., where T S (λ j , δ k ) is the spectrum of the generated structure S at wavelength λ j under incidence angle δ k .Because T ∈ [0, 1], the cumulative reward is always non-negative.The reward value will become higher as the spectrum T S gets closer to the target spectrum T until it reaches 1 when the structure spectrum perfectly matches with the target spectrum.We use a state-of-the-art policy gradient algorithm PPO [22] to train the sequence generation network.Similar to the active search approach in Bello et al. [12], we output the best structure discovered throughout the entire training process (Figure 2b).Our model is implemented using PyTorch [23] and Spinning Up [24].The data used in this study and our code will be publicly available.

Experiment
We applied the proposed method to two optical design tasks that are relevant to energy applications, i.e., 1) designing ultra-wideband absorbers and 2) designing incandescent light bulb filters.In task 1 ultra-wideband absorber design, we measure the quality of the designed structure by average absorption.In task 2 incandescent light bulb filter, we calculate the visible light enhancement factor to measure the performance of designed structures.We also did an ablation study to understand the effect of non-repetitive gating and auto-regressive materials/thickness sampling.
Task 1: ultra-wideband absorber Firstly, we apply our algorithm to the task of designing an ultrawideband absorber for the wavelength range [400, 2000] nm.We choose the target spectrum as a constant 100% absorption under normal light incidence angle (i.e., the light is shining at the absorber at a right angle) to represent an ideal broadband absorber.This task has been previously studied by Yang et al. [1] based on physical models, where the broadband absorption is achieved by overlapping multiple absorption resonances and with an overall graded-index structure to minimize reflection.The authors designed a 5-layer structure using MgF 2 , TiO 2 , Si, Ge, and Cr.The simulated average absorption of their structure over the wavelength range is 95.37% under normal incidence.If not specified otherwise, we assume normal incidence when reporting average absorption.
We hypothesize that, when choosing from a larger set of materials than used in the previous work [1], it is possible to design a structure with higher average absorption than the human-designed structure.Thus, we expanded the original material set [1] to allow 11 more materials, including Ag, Al, Al 2 O 3 , Fe 2 O 3 , HfO 2 , Ni, SiO 2 , Ti, ZnO, ZnS, ZnSe.We set the available discrete thicknesses D to be {15, 20, 25, . . ., 200} nm with a total of 38 different values.When training the sequence generator, we set the learning rate to 5 × 10 −5 and the maximum length to L = 6.The material embedding size d is set to 5, i.e., emb m ∈ R 5 .The generator is trained for a total of 3, 000 epochs with the batch size set to be 1, 000 generation steps.We repeat the training for 10 runs with different random seeds.The best structure discovered by the algorithm, exhibiting an average absorption of 97.64%, is {(SiO2, 115 nm), (Fe2O3, 70 nm), (Ti, 15 nm), (MgF2, 124 nm), (Ti, 148 nm)}.Its average absorption is 2.27% higher than expert designed structure.The spectrum under normal incidence are plotted in Figure 3a.The structure designed by our proposed algorithm achieves higher reflectivity in the infrared range than the structure designed by a memetic algorithm.We compute the spectrum under two view factors f = 1 and f = 0.95 (the ratio of light emitted by the light bulb filament that can reach the wall of the light bulb).
(c) The structure designed by OML-PPO achieves higher emissive power than the previously reported structure.
Task 2: incandescent light bulb filter We further applied the proposed method to design a filter that can enhance the luminous efficiency of incandescent light bulbs [25,26].The goal is to reflect the infrared light emitted by the light bulb filament so that its energy can be recycled.To this end, we set the target reflectivity to be 0% in the range [480, 700] nm, and 100% outside this range (Figure . 3b).
In this way, the infrared light, which cannot contribute to lighting, will be reflected back to heat up the emitter.A similar design has been previously studied [26,6].We choose the same seven dielectric materials as the available materials: Al 2 O 3 , HfO 2 , MgF 2 , SiC, SiO 2 , and TiO 2 [6].In Figure 3b, we compare the average reflectivity normalized over all incidence angles (0 -90 degree) of the 42-layer structure designed with our algorithm and the 41-layer structure designed by a memetic algorithm [6].Our structure has a higher average reflectivity in the infrared range (> 780 nm) than the 41-layer structure.Following a previous work [6], we calculated the enhancement factor for visible range (400 -780 nm) under a fixed operating power.Our designed structure achieved an enhancement factor of 16.60 compared with 15.30 of a previously reported structure [6].
Ablation study On the ultra-wideband absorber design task, we conducted an ablation study to understand the effect of non-repetitive gating and auto-regressive generation of materials and thicknesses.We trained four different models:

Conclusion
We introduced a novel sequence generation architecture and a deep reinforcement learning pipeline to automatically design optical multi-layer films.To the best of our knowledge, our work is the first to apply deep reinforcement learning to design multi-layer optical structures with the number of layers unfixed beforehand.On two real-life applications, we show that our proposed algorithm achieves better performance than human experts and a state-of-the-art memetic algorithm.In the future, we plan to apply the proposed algorithm for more complicated designs that involve micro-nano structures [27].

Broader Impact
Optical multi-layer thin films are widely used in imaging and energy applications.With our proposed algorithm, researchers can design multi-layer thin films automatically with higher performance than existing approaches.In addition, we believe that our algorithm can be applied to many other optical design tasks based on multi-layer structures.

Figure 1 :
Figure 1: Illustration of two energy applications of optical multi-layer films (a) and sequence generation proces for optical multi-layer thin film designs.(a) For solar thermal panels, we can use multi-layer films as ultra-wideband absorbers to enhance light absorption efficiency.For incandescent light bulbs, we can coat multi-layer films on them to improve luminous efficiency by reflecting infrared light while transmitting visible light.(b) The generation process will stop when either the EOS token is sampled, or the length of the sequence reaches the maximum allowed length L.

Figure 2 :
Figure 2: Neural network architectures for generating optical multi-layer films (a) and design generation pipeline (b).(a) Built-upon the baseline architecture (left), our proposed model (right) adds a non-repetitive gating function and auto-regressive connection between the sampled material and the thickness MLP.(b) Pipeline of the sequence generator training process.

Figure 3 :
Figure 3: Spectrum of designed perfect absorber (a), incandescent light bulb filter (b), and emissive power spectrum of the light bulb filter (c).R: reflection, T: transimission, A: absorption.(a) The designed structure achieves almost perfect absorption between 500 to 1,600 nm.(b)The structure designed by our proposed algorithm achieves higher reflectivity in the infrared range than the structure designed by a memetic algorithm.We compute the spectrum under two view factors f = 1 and f = 0.95 (the ratio of light emitted by the light bulb filament that can reach the wall of the light bulb).(c) The structure designed by OML-PPO achieves higher emissive power than the previously reported structure.

Table 1 :
1) OML-PPO with both non-repetitive gating and auto-regressive generation, 2) non-repetitive gating only, 3) auto-regressive generation only, 4) neither non-repetitive gating nor the auto-regressive generation.For each model, we repeated the training for ten times.The maximum absorption values discovered by each model before finetuning are reported in Table1.Both non-repetitive gating and the auto-regressive material/thickness generation improve the performance of the baseline model.Highest absorption values discovered by each algorithm across 10 runs.The mean average absorption values and standard deviations of the 10 runs are reported.