Inferring the solution space of microscope objective lenses using deep learning

Lens design extrapolation (LDE) is a data-driven approach to optical design that aims to generate new optical systems inspired by reference designs. Here, we build on a deep learning-enabled LDE framework with the aim of generating a significant variety of microscope objective lenses (MOLs) that are similar in structure to the reference MOLs, but with varied sequences-defined as a particular arrangement of glass elements, air gaps, and aperture stop placement. We first formulate LDE as a one-to-many problem-specifically, generating varied lenses for any set of specifications and lens sequence. Next, by quantifying the structure of a MOL from the slopes of its marginal ray, we improve the training objective to capture the structures of the reference MOLs (e.g., Double-Gauss, Lister, retrofocus, etc.). From only 34 reference MOLs, we generate designs across 7432 lens sequences and show that the inferred designs accurately capture the structural diversity and performance of the dataset. Our contribution answers two current challenges of the LDE framework: incorporating a meaningful one-to-many mapping, and successfully extrapolating to lens sequences unseen in the dataset-a problem much harder than the one of extrapolating to new specifications.


Inferring the solution space of microscope objective lenses using deep learning: supplemental document
This document provides additional methodology details and results related to the article "Inferring the solution space of microscope objective lenses using deep learning."

MICROSCOPE OBJECTIVE LENS SPECIFICATIONS
In Table S1 are shown the complete specifications for the nominal design case. Those specifications were used as a basis for our unified MOL framework, in particular in the choice of the OLF and training domain.

Parameter
Original Converted

COMPLETE OPTICAL LOSS FUNCTION FORMULATION
Here, we detail the optical loss function (OLF) used in this work. The core component of the OLF is the weighted spot size radius r, which is sampled at {0, 0.7, 1} times the HFOV with weights {2, 1, 1}. As in [1], we add penalty terms that target various constraints. The ray behavior penalty p behavior is used to avoid ray failures and overlapping surfaces. The glass penalty p glass targets the distance between each inferred set of intermediate glass variables and the closest catalog glass from the Schott and Ohara catalogs. The shape penalty p shape targets the diameter-to-thickness ratio of the glass elements and constrains it between a lower threshold, set to 3, and an upper threshold, set to 7. We refer readers to [1] for more details on the ray-tracing and ray-aiming process, as well as the computation of the RMS spot size radius and three penalty terms. To adapt our OLF to MOLs and fulfill the aforementioned specifications, we add new penalty terms that target the working distance (WD) and total track length (TTL).
The working distance penalty p WD encourages the model to increase the WD of the output designs. We design this penalty term so that its slope diminishes monotonically with the WD, which allows the model to strike a balance between optical performance and WD. We compute the penalty as follows: where WD 0 controls the shape of the penalty, which we set at 0.5. The total track length penalty p TTL is used to provide an upper threshold TTL max , using the ramp function R: We set TTL max = 6, which corresponds to a maximum total track length of 60 mm in the nominal case with EFL = 10 mm.
To combine the weighted spot size radius r and penalty terms into our OLF, as in [1], we use multiplicative factors instead of additive factors, which makes it easier to scale different penalty terms across multiple designs. The OLF for a given design is computed as follows: where each penalty term p i is scaled by a factor. We use λ behavior = 10 4 , λ glass = 32, λ shape = 10, λ WD = 6, and λ TTL = 1.

DETAILED MODEL ARCHITECTURE
The dynamic model architecture used in this work (see Fig. S1) is similar to the one in [1]. It is also functionally equivalent except for the additional output branches.
Dynamic network. As in [1], we use an off-the-shelf dynamic network that can operate on sequences with different lengths. However, we replaced the recurrent neural network (RNN) with a transformer encoder [2], with 6 layers, a dimensionality d model = 256 in the inputs and 512 in the fully-connected layers, 8 attention heads, and no dropout. We use the "Pre-LN" configuration for the normalization layers [3]. The dynamic network expects inputs of dimensionality L × d model , where L is the length of the sequence, and returns outputs of the same shape. We use learned embeddings and a projection layer to transform our inputs into the required format.
Learned embeddings. As in [1], we use the extended sequence, given as input, to dynamically add, remove, or rearrange the components of the model-this mechanism allows the model to adapt its predictions depending on the desired lens sequence. Each element of the extended sequence is one of four types of basic structure: an interface "i", a glass element "g", an air gap "a", or the aperture stop "s". The extended input sequence achieves the same results as in [1] while being more compact. Each basic structure is associated with an embedding vector e of size d model that is learned along with the other model components.
Positional encoding. Unlike RNNs, transformer models have no inherent mechanism to infer the order of the sequence of inputs, e.g. to distinguish the sequence "i-g-s-i-a" from "s-i-g-ia". Similar to most transformer approaches, we add to the learned embeddings a "positional encoding" of size d model that encodes the index of each element of the sequence. In contrast to common practice, the positional encoding depends not only on the index in forward order, but also in backward order (from the last to the first optical surface), which can help the model locate the last elements of the lens design more easily.
Specifications. The specifications (EPD, HFOV) form a 2-dimensional vector that is linearly projected into a vector of size d model using the learned linear layer m p , and passed to the dynamic network as an additional element in the sequence of inputs. This formulation is more natural for transformer architectures than the one used with the previous RNN model [1]: passing the specifications through a linear layer for each type of basic structure, then feeding those embeddings to the dynamic network. specifications; they are only fed as inputs so that the model can adapt its predictions. Likewise, there are no outputs for the last curvature, since it is computed directly to enforce a unit EFL. The number of outputs of these layers is proportional to the number of output lens branches K (not shown in Fig. S1).

TRAINING DETAILS
In unsupervised training, we empirically found it helpful to sample the lens sequences of the reference designs more often, so that the model quickly learns to extrapolate from those reference designs. Therefore, every time we generate an unsupervised batch, we sample the lens sequences that are common to both unsupervised and supervised training 100x more often than the others. To train the model, we use the Adam optimizer with parameters β = (0.9, 0.99) and use gradient clipping with an upper threshold of 0.1. We train over 500 000 training steps. The learning rate is linearly increased from 10 −6 to 10 −3 during the first 10 000 steps, then progressively decayed to 0 over the rest of the training steps using a cosine half-cycle.

ABLATION STUDY
In Table S2, we present an ablation study to justify some of our methodology choices. For each experiment, we train the model using the exact procedure explained herein except that we exclude one aspect of our method. Here, we consider methodology choices that aim to help the training process increase the overall optical performance of the designs: the paraxial image solve (PIM)-used to output the defocus rather than the thickness between the lens and image plane-and the supervised loss. We report the geometric average of the OLF across all lens sequences and branches for the nominal specifications, for a total of 59 456 designs. Our results show that excluding the PIM leads to significantly worse designs as the average OLF goes up by 1.6x. Despite the structural losses providing a supervision signal on the lens structures, supervised training is still mandatory as its exclusion brings the average OLF up by 6.8x.

ADDITIONAL RESULTS
In the following, as previously, we query the model using the nominal specifications (NA = 0.40, HFOV = 3.57 deg) and scale the designs to EFL = 10 mm. In Fig. S2 are shown additional layouts of designs inferred using the nominal specifications. Here, instead of selecting the lens sequence that minimizes the OLF for a given branch and number of glass elements, we randomly select a lens sequence out of all candidates with the proper amount of glass elements.  In Fig. S4 Fig. S4. Glass materials of the designs inferred from all 7432 lens sequences and 8 branches. For reference, we include the catalog glasses that were used during training.
For the nominal case, 95.2 % of the inferred designs completely fulfill the NA and HFOV, meaning that all rays were traced successfully without encountering ray failures (either missed surfaces or total internal reflection) or traveling backwards (overlapping surfaces).