OptoGPT: A Foundation Model for Inverse Design in Optical Multilayer Thin Film Structures

Optical multilayer thin film structures have been widely used in numerous photonic applications. However, existing inverse design methods have many drawbacks because they either fail to quickly adapt to different design targets, or are difficult to suit for different types of structures, e.g., designing for different materials at each layer. These methods also cannot accommodate versatile design situations under different angles and polarizations. In addition, how to benefit practical fabrications and manufacturing has not been extensively considered yet. In this work, we introduce OptoGPT (Opto Generative Pretrained Transformer), a decoder-only transformer, to solve all these drawbacks and issues simultaneously.


Introduction
Optical multilayer thin film structure is one of the most vital photonic structures widely used in many applications due to its ease of fabrication, including structural color 1,2 , filters 3 , absorbers 4 , distributed bragg reflectors 5,6 (DBR), Fabry-Pé rot 7 (FP) resonators, photovoltaic 8 and radiative cooling 9,10 , among others.
The inverse design of multilayer structures requires identifying the best material arrangements and obtaining thickness combinations to achieve user-desired optical targets.Previously, researchers and engineers specified material arrangements based on their domain expertise and simplified inverse design as iterative thickness optimization, with methods including particle swarm optimization 11 , needle optimization 12 , and genetic algorithms 13 .These optimization-based methods usually rely on numerical simulations and iterative searches to minimize the difference between simulated and targeted optical properties.However, a high-performance design may not be discovered if the human-specified material arrangement is not optimal.A global design method that can determine the total number of layers and material arrangements will broaden the design space significantly and lead to better performance.
The main difficulty of global design originates from the fact that materials are usually dispersive, making design problems more intractable for broadband applications.Recently, several methods have been proposed to tackle this challenge.For example, Shi 14 et al. proposed the memetic algorithm to incorporate the material into the design process and designed a high-performance radiative cooling device.Wang 15 et al. framed the inverse design as a sequence generation and developed a reinforcement learning-based algorithm OML-PPO to determine the number of layers and material combinations automatically.Finally, Zhang 16 et al. combined GLOnet 17 with the reconciled level set method to optimize both material and thickness for the optical transfer function.However, all these methods are inefficient because they require iterative evaluations and time-consuming simulations, which can be problematic when the target changes as a new design process from scratch will be needed.
In order to make the design process more efficient, many deep-learning-based methods have been proposed and explored, including tandem networks 18 , GANs 19 , and MDNs 20,21 .Because of the strong generalization ability of neural networks, these methods can learn a general mapping from the space of optical targets to the space of optical multilayer thin film structures.After training on a dataset, they can generate designs instantaneously for different targets without iterative and time-consuming evaluations and simulations.
Unfortunately, although efficient, they fail to accommodate the global design.This is because all these neural networks have fixed output size that disallows the material selections and structures with different number of layers.
So far, no algorithm can simultaneously remedy the challenge of global design and efficient design.In this work, we seek new solutions by turning attention to the transformer 22 , a family of neural network models initially proposed for natural language processing (NLP).Different from other neural networks, transformer uses the attention mechanism to weight the importance of different words in a sentence relative to each other, empowering the parallel learning of long-term dependencies and highly complex relationships among input and output.Because of this, transformer has been widely used as the backbone of foundation models 23 in NLP 24 , computer vision 25 , and reinforcement learning 26 .Recently, researchers have started using transformer to inverse design multilayer thin film structures.For example, Chen 27 et al. proposed the Metamaterial Spectrum Transformer (MST) to design for arbitrary absorbers using a six-layer MgF2/SiO2/Al2O3/TiO2/Si/Ge structure.
However, MST uses an encoder-only transformer that maps absorption spectra to a fixed material combination.Therefore, MST is a thickness-only design method that still cannot tackle the challenge of global design.In addition, the small dataset used for training (0.15M compared to millions or billions of data used in NLP) limits its capability outside absorber applications.In this work, we propose the Opto Generative Pretrained Transformer (OptoGPT) to resolve the conflict between global design and efficient design.Similar to other GPT models like GPT-3 24 and ChatGPT 28 , our OptoGPT is a decoder-only transformer that generates multilayer structures layer-by-layer in an auto-regressive way.In order to incorporate materials into the design process, we introduce structure serialization that uses a token to represent the material and thickness of a layer simultaneously.We also apply spectrum embedding to facilitate the learning of the complex relationship between spectra and structure.In this way, our model can design in the global space and determine the total number of layers (up to 20 layers), materials (up to 18 types), and thickness simultaneously.In addition, we use a large dataset comprising 10 million designs for training.This large-scale dataset enables our model to capture complicated relationships and expands our model's design ability towards diverse applications, including structural colors, filters, absorbers, DBR, and FP resonators.Our model is also efficient in all these design applications.On average, OptoGPT can complete each design within 0.1 seconds while consistently achieving design results better than those of baseline methods.Benefiting from the design efficiency, our model can also output multiple designs with minimal effort and exhibit high design flexibility under different design constraints, which are beneficial for fabrication and design considerations.For example, given the constraint of material arrangements at each layer, OptoGPT functions as a direct thickness optimizer that circumvents the iterative optimization process.
Based on the empirical results obtained in this study, we believe that OptoGPT can serve as a foundation model for the design of optical multilayer thin films across a diverse array of applications.By offering highly performant initial optical designs with computational efficiency, our model has the potential to streamline the design process, reduce the need for extensive manual iterations, and accelerate the development of advanced optical systems.Furthermore, by serving as a reliable starting point for researchers and engineers, our model facilitates the exploration of novel designs and the optimization of optical performance within the constraints of material properties and fabrication techniques.As such, OptoGPT will contribute to the field of photonics by enhancing the accessibility and effectiveness of optical design methodologies.

Implementation of OptoGPT
Our OptoGPT works in a similar way as many other GPTs.In NLP, GPT is an auto-regressive language model that produces text output given the initial text as a prompt (see Fig. 1(a)).The initial prompt can be a question, a task description, or anything necessary for the model to understand what to expect as outputs.
During training, pairs of text prompts and expected answers are fed together to the GPT, with the training goal of recovering the expected answers from the model's probability output.At testing time, GPT will only take in the text prompt; answers will be generated auto-regressively.This process also works for our OptoGPT (see Fig. 1(b)).Specifically, OptoGPT takes optical targets as the input prompt and outputs the corresponding multilayer structure design.Fig. 1(c) demonstrates several examples of input prompts.In order to unify the space of design targets in various applications, we convert all design targets to the reflection and transmission spectrum under normal incidence.The considered wavelength covers the visible and near-infrared (NIR) region, spanning from 400 nm to 1100 nm with 10 nm gap.
Because GPT requires sequential data as the input, we propose and apply structure serialization to convert a multilayer structure into a sequence of tokens.Fig. 1(d) gives one example of serializing a twenty-layer structure on the glass substrate using a sequence with twenty-one tokens.The first twenty tokens describe the material and thickness at each layer, whereas their relative position inside this sequence denotes the relative order of each layer in the multilayer structure.Here, the material is selected from a database with 18 different types of materials (see SI 1.1 for refractive index data), and the thickness is selected from a discretized range in [10, 500] nm with a step size of 10 nm.Of these twenty-one tokens, the last token is a special 'EoS' token that denotes the end of the sequence.Therefore, for each layer in the multilayer structure, there are 18 × 50 + 1 = 901 possible tokens, corresponding to 900 different combinations of material and thickness plus one special 'EoS' token.During inverse design, OptoGPT will output a probability distribution on all 901 tokens.Sampling from this distribution gives the design at each layer.If 'EoS' is sampled, OptoGPT will terminate the design process and output the existing sequence as the designed structure.We set the maximum number of layers to be 20, making the total number of multilayer structures under design consideration to be (901) 20 ~10 59 .Utilizing this approach, our model can determine the total number of layers required for a given design target, as well as select the appropriate material and thickness for each individual layer.
We also apply spectrum embedding to facilitate the learning of complex relationships between spectra and multilayer structures.To understand how it works, we illustrate the detailed architecture of OptoGPT in Fig. 2(a).Similar to GPT, our OptoGPT has  decoder blocks with identical structures that are stacked together.Only the first decoder block takes in the hidden representation of the input sequence of tokens, which is obtained after going through a learnable physical embedding and a positional embedding.All the other decoder blocks take in the output from the feed-forward layer in the previous decoder block.There are also two consecutive attention layers inside each decoder block.The first self-attention layer is used to learn the relationship between layered structures, while the second cross-attention layer can capture the relationship between input spectra and the multilayer structure.Because spectra and multilayer structure are in two different spaces, we introduce spectrum embedding to fill this gap and map the space of spectra into the same hidden space of multilayer structure such that our model can recognize and perform the attention operation (see the visualization of hidden space in the later section).A more detailed description of model architecture can be found in SI 1.3.

Inverse design and performance
We illustrate the auto-regressive design process with probability sampling 30 in Fig. 3(a).Because this autoregressive design process does not require any iterative simulations or evaluations, it can be finished within 0.1 seconds, enabling inverse design to adapt to different design tasks quickly while still running as fast as the TMM simulation (see time-consumption comparison in Fig. 3(d)).In addition, running the sampling process multiple times leads to different outputs, which gives multiple designs with minimal effort.As a comparison, to obtain a different design, most optimization-based methods require restarting the iteration from different initial points.
First, we evaluate the performance of inverse design on 1000 spectra targets randomly selected from the validation dataset.Mean Absolute Error (MAE) between the target spectrum and the spectrum of the designed structure is used to quantify the design performance.In Fig. 3(b), we compare the MAEs of the closest structures in the training dataset (orange dots), designed structures (blue dots), and finetuned structures (red dots).The closest structure in the training dataset is treated as the performance baseline.We iterate through the training dataset and select the structure with the lowest spectrum MAE as the closest structure.On average, the MAE of the designed structures is 0.0258, which is lower than the MAE of the closest structures (0.0296); finetuning the thickness can further reduce the MAE to 0.0192 (~24% reduction).
We detail one such inverse design example in Fig. 3(e).In addition, in Fig. 3(c), we compare the number of layers in the target structure (the structure corresponding to the target spectrum in the validation dataset) vs. the number of layers in the designed structure.The zero upper diagonal matrix implies that our OptoGPT learns to solve design tasks using a simplified structure with fewer layers (~6 layers on average), which can facilitate the fabrication process as structures with fewer layers are easier to make.
Next, we use our model to design structures for different applications, including structural color, absorber, filter, DBR, and FP resonator (discussed later).We use an artificial spectrum for each task as the design target (see Section 2 in SI).In Fig. 4(a), we describe each design task in detail, summarize these designed structures, and compare their spectrum performance.We also illustrate the performance of some design tasks in Fig. 4(b-e).Fig. 4(b-d) shows the detailed spectrum comparison for the task of "perfect absorption in 400-1100 nm", "band-notch filter at 550 nm", and "high reflection in NIR", respectively.In Fig. 4(e), we give some examples of designing reflective and transmissive structural colors.A color-to-spectrum conversion is done ahead of the design process (see SI 2.3).More design examples for each type of application are provided in Section 2 in SI.Specifically, our model can reproduce similar structures reported in 27 and 31 when designing for absorbers.Because our dataset is randomly generated, it is almost impossible for the training dataset to contain a perfect spectrum identical to the artificial target.However, our model can still finish these designs, demonstrating strong learning and generalization ability from the dataset.

High design flexibility
Design flexibility adds extra freedom to the design process because researchers can arbitrarily impose restrictions on the material selection and thickness range at any layers specific to the fabrication or design needs.We apply a simple but generalized method based on probability sampling to impose restrictions in the design process.As illustrated in Fig. 5(a), this is done by removing these structures that do not satisfy constraints from probability sampling.Since this method is independent of spectra targets, it can be used for designing any applications.As an example, we use OptoGPT to inverse design an FP resonator.Here, the target spectrum has a resonance absorption at 610 nm (see Fig. 5(c)) and corresponds to a three-layer 20 nm Ag/150 nm SiO2/50 nm Ag resonator on the glass substrate.Without any constraint, OptoGPT outputs multiple designs with a low spectrum MAE (see Fig. 5(c)).However, each structure varies significantly from the others (see Fig. 5(b)).

Now we consider adding four different constraints separately:
1. Fix the first layer to be 100 nm SiO2 2. Remove Ag in the third layer 3. Limit the thickness of the first layer inside [10, 150] nm and remove Ag/Al in the first layer 4. Specify the material arrangement to be a three-layer Ag/Si3N4/Ag structure and design the thickness only The first constraint can be used when requiring a dielectric layer at the air interface for protection, while the second constraint is practical when looking for an alternative to replace silver, considering silver is a precious metal.For the third constraint, we use it as a general example of adding thickness and material restrictions simultaneously.We list these designed structures in Fig. 5(b) and plot their spectra in Fig. 5(d), demonstrating that OptoGPT can finish designs that satisfy desired constraints while still guaranteeing spectrum performance.After examining these designed structures, we find that our model uses two strategies to satisfy the second constraint.The first strategy uses aluminum in the third layer as the reflective layer (see structure C2.1 and C2.2 in Fig. 5(b)), while the second strategy adds one extra thick dielectric layer on top of the FP resonator (see structure C2.3, C2.4, and C2.5 in Fig. 5(b)),).On the other hand, our model selects dielectric materials with a relatively high refractive index as the first layer to satisfy the thin thickness range in the third constraint.
In particular, the fourth constraint specifies the material at each layer and only requires thickness design.This is a traditional design strategy widely used by human experts and in many optimization-based methods.
Our results show that given the spectrum target and material arrangements, OptoGPT can be used as a direct thickness design without iterative optimization.Since this feature does not rely on targets, researchers can quickly examine if certain material combinations can achieve the target spectrum and obtain their corresponding thickness if so.We provide more examples of design flexibility in section 3 in SI.

Visualization of learned hidden representations
To get insight on why the OptoGPT model shows high efficiency in the multilayer structure inverse design, we would like to understand if our model has learned to sort out the intricate relationships between spectra, materials, and thicknesses.To do this, we use the t-distributed stochastic neighbor embedding 32 (t-SNE) to reduce the high dimensional hidden representations to two dimensions in order to visualize results.The results were shown in Fig. 6, by taking all 900 structure tokens used in training and 1000 spectra randomly selected from the validation dataset.Several interesting features are immediately observed.First, the physical structures (colored traces consisting of individual dots representing the structure token) and optical spectra responses (encircled cluster of green crosses) are well separated in this 2-D representation, even though they are fed into training on the equal footing.This demonstrates that our model has learned to distinguish the attributes of material structures and optical spectra while mapping them into the same hidden representation space.Second, the 900 structure tokens are easily distinguishable, either as colored curves (the starting and ending points correspond to thickness of 500 nm and 10nm respectively), or cluster of dots, with no overlap between different materials.Upon close examination, OptoGPT has intelligently separated the low refractive index dielectrics from the high refractive dielectrics (zoom view in (i) and (ii)).Within these two groups, all curves converge to the center region representing the lowest thickness 10 nm.This makes perfect sense as when the dielectric layer thickness is reduced to the minimal all materials will behave similarly as they contribute to negligible optical phrase change or optical absorption (in the case of high index material).In other words, OptoGPT has learned the fact that thin dielectric layers of different materials all have similar effect on light propagation in multilayer thin films.Equally interesting is that all the metals clusters into their own territories in this 2-D map.This can be easily explained because when the metal layer thickness is greater than the optical penetration depth, its contribution to the optical response (i.e., spectra) has little dependence on the thicknesses.These observations demonstrate that even though our model does not directly take in any refractive index nor thickness, it is capable of capturing this information and learning hidden representations from a large dataset, validating the usage of structure serialization and spectrum embedding.This also aligns well with the strong representation capabilities demonstrated in many other foundation models such as Galactica 33 , GaTo 34 , and PaLM-E 35 .

Discussion
For a long time, photonic inverse design was regarded as a much more difficult problem than simulation and has presented high entry barriers for people interested in searching for the optical design in their specific applications.We hope the development of OptoGPT will make the multilayer thin film structure-based inverse design an equally accessible task as the traditional optical simulation for a multilayer thin film structure.By treating inverse design as a sequence generation conditioned on the optical targets and using a large-scale dataset for training, our OptoGPT outperforms all the existing inverse design methods on four aspects (see Table S2 in SI).This result shows that OptoGPT can act as a foundation model for multilayer thin film inverse design, making the inverse design as easy, fast, and straightforward as running a simulation.
Our findings regarding the hidden representations of OptoGPT suggest that it has acquired domain-specific knowledge pertaining to optical multilayer structures through the training process.Furthermore, the model has demonstrated the capacity to apply this acquired knowledge effectively in the inverse design process.However, the current framework lacks explainability and does not allow users to directly understand the physical principles involved in its designs.For example, is there a general principle for designing absorbers and DBR?How to design high saturation structural color step-by-step?We hope future work can find a way to extract, formulate, and apply these design principles from the model to guide inverse design.
Finally, our method has some limitations as it requires a large dataset for training, which is also a common criticism for many large language models.For example, ChatGPT is trained on billions of tokens using ~10,000 GPUs, costing ~$10M for a single training.In this work, because of the constraint on computation resources, we have to simplify our design problems, including using limited types of materials, limited spectrum range, thickness discretization as well as the maximum number of layers that can be designed, all of which can be extended with more computation resources.Despite using a large-scale dataset with 10 million samples for training, it is important to recognize that this dataset only covers a small fraction of the expansive and complex design space associated with optical multilayer thin film structures.Due to this limitation of the training dataset, OptoGPT may fail to generate performant designs when tasked with generating designs that lie outside the boundaries of the sampled design space (see SI 4.2).Close collaboration across multiple research groups is needed to obtain a better model for photonic inverse design.Constraint 1 -4 are "Fix the first layer to be 100 nm SiO2", "Remove Ag in the third layer", "Limit the thickness of the first layer inside [10, 150] nm and remove Ag/Al in the first layer", and "Specify the material arrangement to be a three-layer Ag/Si3N4/Ag structure and design the thickness only", respectively.
For each constraint, we give five designed structures and use red color to highlight the best structure with  given in (iii) and (iv).Outside the green boundary are structure tokens corresponding to different material and thickness combinations.These tokens with the same materials either form a line shape or cluster together.For each line, the dot size is monotonically decreasing from one end to the other end, corresponding to the monotonical thickness decrease from 500 nm to 10 nm.Most lines converge into two regions, with zoom-in details given in (i) and (ii) corresponding to low refractive and high refractive index region, respectively.Our model demonstrates the ability of learning the material and thickness from a large dataset without their explicit inputs.

Fig. 2 (
Fig. 2(b) illustrates the training procedure of our OptoGPT model.We generate a large training dataset with 10 million samples and a validation dataset with 1 million samples (see SI 1.2).Each sample is a pair of a randomly sampled multilayer thin film structure sitting on the glass substrate and its corresponding spectra simulated using Transfer Matrix Methods 29 (TMM).The training dataset is used for training and the validation dataset is used for model selection.Once trained, our model can be used to design different spectra targets with different constraints.Furthermore, considering that the 10 nm discretization of thickness may lead to sub-optimal performance, we run a thickness finetuning to improve the design performance (see SI 2.1).Results of designed structures and finetuned structures are reported separately.

Figure 1 .
Figure 1.The schematic of using the Opto-Generative Pretrained Transformer (OptoGPT) to design multilayer thin film structures.(a) and (b) show the diagram of GPT and our OptoGPT, respectively.GPT is trained on pairs of input prompts and output answers.During testing time, the model only takes in the input prompts and generate answers from probability sampling in an auto-regressive way.In OptoGPT, the input prompts are the optical targets while the outputs are designed multilayer structures.(c) An illustration of different design applications, including structural color, absorbers, filters, distributed bragg reflectors (DBR), Fabry-Pé rot (FP) resonator and other arbitrary spectrum targets.All design targets are converted to design the reflection and transmission spectrum from 400 nm to 1100 nm under normal incidence.(d) An example of the "structure serialization" for a twenty-layer structure on the glass substrate.This twenty-layer structure is serialized by 20 + 1 tokens.The first twenty tokens are formed by concatenating the material and thickness at each layer.The last token is 'EoS', shorten for End of Sequence.At each layer, there are 18 possible types of materials and 50 different thickness choices (10nm discretization in [10, 500] nm).The refractive index of each material is given in SI 1.1.

Figure 2 .
Figure 2. Details of OptoGPT model.(a) The detailed model architecture of our OptoGPT, which is a decoder-only transformer.We use a fully-connected layer as the spectrum embedding to map spectrum to the space of hidden representation.The decoder has  (we set  = 6) decoder blocks that are stacked together.All decoder blocks use a self-attention layer and a cross-attention layer to process the structure and spectrum representations and use a feed-forward layer to connect to the next decoder block.Residual connections (not shown here) are used for both attention layers and feed-forward layers.Before passing the sequence of tokens (SoT) to the first decoder block, we use physical embeddings to get the hidden representation of all structure words in SoT, and use positional embeddings to recognize the relative position of each structure word inside this SoT.In total, our model has 59M parameters.Details of training and model architecture can be found in SI 1.3.(b) The working diagram of our OptoGPT model.We generate a training dataset with 10M randomly generated samples for training and train our model about two weeks using a NVIDIA 3090 GPU.After training, our model can be used for inverse design directly regardless of the spectrum targets and design constraints.

Figure 3 .
Figure 3. Results of inverse design performance on the validation dataset.(a) Illustration of the design process.Given a specific spectrum as the target, our model finishes the design layer-by-layer in an autoregressive way.When designing the  ℎ layer, our model takes in the target spectrum together with the old Sequence of Tokens (SoT) designed from previous  − 1 layers, and outputs a probability distribution for all 900+1 tokens.Sampling from this distribution gives the design at the  ℎ layer.Then, we append this sampled token at the end of the SoT.This new SoT will be used as the input to OptoGPT when designing the  + 1 ℎ layer.This design process will keep going until reaching the maximum layer of 20 or 'EoS' is sampled.(b) The Mean Absolute Error (MAE) on 1K random spectrum targets from the validation dataset.For each spectrum, we design five different structures by running the design process five times and the one with the lowest MAE is treated as the final designed structure.The orange, blue and red dots correspond to closest structures in the training dataset, designed structures and finetuned structures.Their averaged MAE

Figure 4 .
Figure 4. Examples of inverse design artificial spectra in different applications.(a) summarizes each task descriptions as well as designed structures in detail.The spectrum MAE are reported in the last column.The numbers in black, in underline and in italic blue are the spectrum MAE of designed structure, finetuned structures and closest structures in the training dataset, respectively.(b-d) illustrate the design performanceof "perfect absorption in 400-1100nm", "band-notch filter at 550nm", and "high reflection in NIR", respectively.In each figure, real lines, dotted lines and squared lines correspond to the spectrum of artificial target, the spectrum of the closest structure in the training dataset and the spectrum of designed structure from our model, respectively.Our model outperforms the baseline in all tasks.(e) shows the example of designing transmissive and reflective structural color.We use the color difference of Δ to evaluate the design performance.For each color, the first brick, second brick, and third brick correspond to the target color, designed color, and finetuned color, respectively.More details and examples can be found in section 2 in SI.

Figure 5 .
Figure 5. Illustration of design flexibility.(a) A visualization of the design process when adding the design constraint.We use the example of "remove Ag from material selection at  ℎ layer".When designing the desired  ℎ layer, we remove these tokens that do not satisfy constraints from probability distribution and only sample from the renormalized probability based on remaining tokens.(b) Summary of designed structures as well as their spectrum performance when adding different constraints to design a FP resonator.
the lowest MAE.(c) Comparison of the spectrum performance without any design constraint.(d) Comparison of the spectrum performance under four different constraints.We only plot the spectrum of the best structure in each category.More examples of design flexibility can be found in section 3 in SI.

Figure 6 .
Figure 6.2D visualization of the hidden space using t-SNE to reduce dimension.This figure includes 900 structure tokens and 1000 spectra randomly selected from the validation dataset.Spectra are marked as green cross and structure tokens are marked as colorful dots, where different color corresponds to different materials.The green dashed circle illustrates the approximated boundary between spectra and structures.Inside this boundary are the spectra, with examples of two different spectra (marked as red cross)