Rethinking materials simulations: Blending direct numerical simulations with neural operators

Direct numerical simulations (DNS) are accurate but computationally expensive for predicting materials evolution across timescales, due to the complexity of the underlying evolution equations, the nature of multiscale spatio-temporal interactions, and the need to reach long-time integration. We develop a new method that blends numerical solvers with neural operators to accelerate such simulations. This methodology is based on the integration of a community numerical solver with a U-Net neural operator, enhanced by a temporal-conditioning mechanism that enables accurate extrapolation and efficient time-to-solution predictions of the dynamics. We demonstrate the effectiveness of this framework on simulations of microstructure evolution during physical vapor deposition modeled via the phase-field method. Such simulations exhibit high spatial gradients due to the co-evolution of different material phases with simultaneous slow and fast materials dynamics. We establish accurate extrapolation of the coupled solver with up to 16.5$\times$ speed-up compared to DNS. This methodology is generalizable to a broad range of evolutionary models, from solid mechanics, to fluid dynamics, geophysics, climate, and more.


Introduction
Materials simulations are omnipresent across diverse scientific domains including physical, chemical, biological, and materials sciences.Existing state-of-the-art direct numerical solvers used for these simulations, whether they are based on the finite-element [1], finite-difference [2], finite-volume [3], or spectral methods [4], are inherently computationally expensive because they solve a system of coupled, often nonlinear, partial differential equations (PDEs), requiring a large number of degrees of freedom and advanced numerical schemes to obtain spatially and temporally accurate solutions.
As an emerging alternative, neural networks are rapidly gaining research interest in the field of scientific computing, especially for learning the solutions of PDEs because of their universal function approximation capabilities [5].Amongst them, physics-informed neural network (PINN) [6] is a specialized and efficient approach for approximating PDEs for forward and inverse problems [7,8,9,10,11].PINNs incorporate the governing physical laws as soft constraints while training the network parameters, enabling them to act as accurate surrogates in sparse and high-dimensional regimes as For example, consider the task of simulating a time-dependent material process by solving a system of non-linear equations over a large computational domain.Although direct numerical solvers are the state-of-the-art in terms of accuracy, several applications, such as simulating long time scales or data hungry optimization problems, demand that direct numerical solvers need to be accelerated using faster neural operators that can accurately approximate the underlying mathematical operators over time.Achieving such acceleration can present a significant challenge since the surrogate operator needs to identify and learn the inherent patterns of the mathematical operator across different spatiotemporal scales from the training dataset and simultaneously exhibit strong generalization capabilities when exposed to new and previously unseen initial conditions.U-Net [30], a U-shaped fully convolutional neural network, is an interesting choice for modeling such multi-scale systems because of its inherent ability to discretely extract and learn correlations across different scales.U-shaped neural operators (U-no) [31], U-FNO [32] and Diffusion inspired Temporal Transformer Operator (DiTTO) [33] are some of the recently developed surrogate operators based on U-Net for non-linear time-dependent PDEs.For instance, Gupta et al. [34] extended the concept of positional encoding originally introduced for transformers [35] to U-Net-based architectures and performed a detailed systematic ablation study to analyze the significance of Fourier layers [13], attention blocks [35], and parameter conditioning [34].The solutions of all the two-dimensional test cases considered in these studies were smooth functions in space.The question of whether U-Net-based architectures can be effective in approximating PDEs whose solutions involve high spatiotemporal gradients remains, however, open for research.
Problems with high spatiotemporal gradients that evolve in time are common place in studies of phase transitions and microstructure evolution of materials [36,37,38,39,40,41].Many research groups [42,43,44,45,8,46,47,10,48,49,50,51,52,53] have investigated the effectiveness of using machine-learning-based algorithms for modeling microstructures, with a subset of researchers [54,55,56,57,58,59,60] proposing to learn the dynamics of such systems in latent spaces.In this manuscript, we build on these recent advances and develop a method that blends direct numerical solvers with neural operators to accelerate materials simulations.Specifically, this method is based on the integration of a commonly used phase-field solver [61,37] with a U-Net neural network operator augmented with a temporal conditioning mechanism.The idea behind using the U-Net operator is to have the ability to learn dynamics across multiple length scales.The incorporation of the time conditioning is meant to accurately capture the characteristic time scales associated with the various length scales and dynamics across those scales.We compare this method with two other types of operator networks: a previously proposed autoencoder combined with DeepONet [56] and a hierarchical autoencoder-based model, which is meant to investigate if hierarchical splitting of the latent space using the proper orthogonal decomposition (POD) modes can further improve the accuracy of the autoencoder-based surrogate operator.This comparison enables us to conclude that the U-Net with the temporal conditioning mechanism can successfully resolve the multiple spatiotemporal scales better than the other two options.However, for the long-term evolution of trajectories from new initial conditions, the errors in the predictions provided by the surrogate operator accumulate over time and eventually exceed an acceptable error threshold.We address this issue by integrating our pre-trained U-Net conditioned operator network with a direct numerical solver to yield an efficient hybrid solver, which accelerates time-to-solution predictions and extrapolates the solution in time.We demonstrate the comparison of the various operator networks and effectiveness of the hybrid solver on simulations of the microstructure evolution of thin films grown by physical vapor deposition (PVD) modeled via the phase-field method [61,37].We chose this problem because it exhibits high spatial gradients that evolve in time due to the co-evolution of materials phases in the growing film and because this problem displays slow (microstructure evolution in the solid film) and fast (moving boundary of the growing film) dynamics simultaneously.Furthermore, the vapor deposition processes are used in manufacturing thin films which have widespread engineering applications [62,63].
The manuscript is structured as followed.The comparative study of the different surrogate operators, the implementation of the accelerated hybrid solver, and its performance are provided in Section 2. We discuss the challenges encountered in data-driven approaches, the capabilities and limitations of the different surrogate operator networks, the accuracy-speed trade-off to be considered while designing hybrid solvers, and future research directions in Section 3. The phase-field model, dataset considered in this study, description of the training, and the architecture of the U-Net conditioned on time are provided in Section 4. In the appendices, we provide analyses of the simulations by visualizing the evolution of trajectories in the lower dimensional latent space spanned by autoencoder-based models.

Results
We trained our surrogate operator network to learn the spatiotemporal evolution of a materials system that exhibits both slow and fast dynamics.Specifically, the operator network is trained to learn the mapping of the composition field c from a set of previous look-back (lb) timesteps to ∆t timesteps in the future in a continuous sense, where G represents the operator network trained to approximate the operator that governs the dynamic evolution of c(t).Note that t is interpreted as a non-dimensional measure of time throughout the manuscript unless specified explicitly.

Comparing and analyzing the different surrogate operator networks
In this subsection, we compare the ability of different surrogate operator networks to accurately learn the evolution dynamics of the PVD problem quantitatively.We consider three operator network architectures: 1) a previously proposed autoencoder-DeepONet model [56] which serves as a baseline, 2) a hierarchical autoencoder-based model, and 3) a U-Net with temporal conditioning (see Figure 1).In the first two categories of architectures, we first trained an autoencoder to learn a non-linear mapping of the microstructure evolution to a low dimensional latent manifold and then trained another history-dependent surrogate network/networks to learn the evolution dynamics in the low dimensional latent space.These autoencoder-based models consist of a convolutional encoder and a transpose convolutional decoder paired with one of the following history-dependent neural networks: Gated Recurrent Unit (GRU) [64], Transformer [65], or DeepONet [56] to learn the evolution dynamics in the low-dimensional latent space.
The autoencoder-based architectures learn spatial reconstruction first using the encoder and the decoder.Information on the effect of the size of the dimensionality reduction is provided in Appendix A. After that, the weights of the encoder and decoder are frozen and the surrogate network is trained to learn the evolution dynamics in the low dimensional latent manifold.The autoencoder-based operator networks have one or multiple surrogate dynamical models in a single latent manifold, whereas, the time-conditioned U-Net encodes a representation of time ⃗ f (∆t), across multiple latent manifolds -L 1 , L 2 , L 3 and L 4 , each representing various scales associated with the dynamics of the materials evolution.A detailed schematic of the three types of architectures is portrayed in Figure 1.One may draw an analogy between the V-Grid in the multi-grid method [66] and the temporally conditioned U-Net that learns the underlying global and local dynamics at different resolutions.
Figure 2 shows the comparison in prediction performance for the three operator network architectures in terms of the relative mean square error (MSE) (panel a)) and error boxplot (panel b)) at different forecast time ∆t.Specifically, we considered 1) an autoencoder with a DeepONet as the latent surrogate model [56], 2) an autoencoder-GRU with hierarchical partitioning of latent space, 3) a hierarchical autoencoder-Transformer, 4) a hierarchical autoencoder-DeepONet, and 5) a temporally conditioned U-Net.Detailed descriptions of these different models are provided in Appendices B and C. All the models were trained using the trajectory-splitting technique described in Appendix D. From Figure 2 a) and b), we observe that the hierarchical partitioning of the latent microstructure embedding with respect to the POD modes outperforms the baseline autoencoder-DeepONet model.Amongst the hierarchical autoencoder-based models, the Transformer predictions are better than that of the GRU because of the self-attention mechanism.However, both GRU and Transformer models predict the future states autoregressively, leading to an accumulation of errors over time.The hierarchical DeepONet, on the other hand, is trained to learn the direct mapping into the future states and avoids autoregressive error accumulation.This attribute enables the DeepONet to predict the future states with lower error than the GRU and transformer that infers autoregressively.In comparison, the predictions obtained from the U-Net conditioned on time are more accurate than all the autoencoder-based models considered in this study.This improvement in the performance can be ascribed to the architecture of the network that simultaneously learns the dynamics from low to high-dimensional representations of the system.Despite these improvements, surrogate operator networks trained in a data-driven setting, although faster at inference, may not necessarily be suitable for forecasting  long time scales.We address this difficulty in the following subsection by blending the phase-field numerical solver (Mesoscale Multiphysics Phase Field Simulator or MEMPHIS in short, see Refs.[37,61]) with our best surrogate operator -the pre-trained U-Net conditioned on time as a hybrid solver.

Coupling the high-fidelity direct numerical solver with an operator network-based surrogate
Using an operator network-based surrogate for forecasting longer in time is a challenge and is extensively investigated in various prior works [54,56,67].This predicament is especially relevant in the PVD exemplar, where the interface of the growing film evolves rapidly compared to the slower diffusion-driven evolution of species in the bulk of growing film.Therefore, surrogate operator networks trained within a data-driven framework, despite their fast inference, might not be appropriate for making extended forecasts due to the gradual accumulation of errors over time.We address this problem of error accumulation over time via a hybrid formulation that integrates the accurate but slower finite-difference-based direct numerical solver with the pre-trained U-Net conditioned on time.The purpose of this hybrid approach is to accelerate the direct numerical simulations with the surrogate operator network by leaping ahead in time, periodically intervening with the direct numerical solver to keep numerical errors and physical solutions bounded.Specifically, we run our phase-field solver (MEMPHIS) for the first few timesteps.The output from the direct numerical simulation is then fed as an input to the pre-trained U-Net conditioned on time.The fast inference capability of the conditioned U-Net enables the hybrid formulation to leap in time for a prescribed time interval.The output of the surrogate operator is fed back as input to the phase-field solver, which is again executed for a certain number of timesteps, before leaping forward through the U-Net.In this manner, the U-Net hybrid model orchestrates the evolution trajectory of the microstructure's composition field through back-and-forth transitions between the phase-field solver and the temporally conditioned U-Net.For example, in Figure 3, the orange shade along the time axis t, corresponds to the timesteps evolved by the direct numerical phase-field solver, and the green shade corresponds to the timesteps accelerated by the pre-trained surrogate U-Net conditioned on time.The wall time associated with running the U-Net hybrid model on an 8-core CPU is also provided.Without trying to optimize the partitioning between time spent on the direct numerical solver and time spent on the operator network-based surrogate, we observe that the hybrid solver achieves a speed-up of 30%.Furthermore, from the comparison between the evolution of true and hybrid trajectories in Figure 3, we observe that the hybrid solver is able to predict the dynamics of the composition field c, with good accuracy over time and space for an unseen test sample.

Analyzing the local prediction errors of the hybrid trajectories
We now evaluate the local prediction errors of the microstructure evolution obtained from the hybrid solver.By computing the autocorrelations of the composition field on the solid phase of the growing film, we can extract and visualize the prominent features and patterns in the microstructure.The spatial autocorrelation, S (A,A) 2 (r, t), of the composition field with two species A and B can be interpreted as the probability of finding the same solid phase A at two random spatial locations x 1 and x 2 , where r = x 2 − x 1 .In Figure 4, we show the relative L 2 −error with respect to S (A,A) 2 (r, t = 50) for all the trajectories at the last timestep in the test dataset, since the maximum error accumulation is present at the final time.We observe that the mean relative L 2 −error of autocorrelation of the U-Net hybrid model's composition field at t = 50 is 0.0188.Furthermore, we compare and visualize the true and predicted S (A,A) 2 (r, t = 50), and its point-wise error for the test sample with a maximum error of 0.0542 representing the worst case and that of the test sample with a minimum error of 0.0038.Next, we compute and compare the radial average of the autocorrelations S(r, t = 50) in Figure 5 corresponding to the worst and the best test samples identified from the scatter plot in Figure 4.Here again the errors are small, with relative L 2 −errors of 0.0057 and 0.0018, respectively.We observe that the predicted statistics capture all the correlation lengths present in the microstructure with great accuracy, even for the 'worst' test sample, demonstrating the generalization capability of the hybrid solver.Overall, we observe a very good agreement between the predicted and true autocorrelations, demonstrating that the hybrid solver is a fast solver with excellent spatial and temporal accuracy across time.

The need for hybrid solvers
In the previous subsections, we investigated the errors in the autocorrelations and the radial average of the autocorrelations of the microstructure predicted by the U-Net hybrid model at the last timestep.We now investigate how the error in the autocorrelations of the microstructure predicted by the surrogate U-Net conditioned on time propagates in time.This is especially important in the context of understanding how quickly the predicted solution starts to deviate from the ground-truth solution and evaluating the trade-off between speed-up and accuracy.To accomplish this evaluation, we tested several configurations.For the first set of predictions, we ran our direct numerical phase-field solver until only t = 3 and then used the trained U-Net conditioned on time from t = 3 onward.The second set of predictions considered the direct numerical phase-field solver that ran until t = 10 and then used U-Net conditioned on time to roll out the remaining part of the trajectory.Finally, the last set of predictions consisted of the hybrid solver presented in Section 2.2.The evolution of mean error as well as the 25th and 75th quartiles of the error with respect to time for these three sets of prediction are plotted in Figure 6.For all three configurations, we note first that the error accumulates as a function time.We also observe that the two configurations for which the U-Net conditioned on time replaces the direct numerical solver (red and green curves in Figure 6) have similar rates of error growth as the U-Net marches in time.This result suggests that running the direct numerical solver for longer timesteps helps in controlling the accumulation of errors in forecasting, albeit reducing the overall speedup achieved.When contrasting these results with the hybrid solver implementation (in blue in in Figure 6), we observe that the back-and-forth transitions between the solver and the surrogate helps bringing down the error, the rate of accumulation of error in time, and also the variance of error.In general, the results shown in Figure 6 suggest that the accumulation of error in prediction can be reduced by running the solver for longer timesteps, but at the expense of higher computational costs and therefore reduced acceleration performance.Indeed, in the case of the first configuration for which the U-Net conditioned on time replaces the direct numerical solver early only we achieve the highest speedup with 16.5× but with the largest error.The other similar configuration but for which the direct numerical solver runs for longer time before being replaced achieved a speedup of 5×.Finally, our hybrid solver achieved a 1.3× speedup, however, we did not attempt to optimize the partitioning between DNS and the operator network.

Discussion
The results presented above demonstrate the effectiveness of using a U-Net with temporal conditioning as a fast and accurate surrogate model for problems with high spatial gradients and simultaneous fast and slow dynamics.However, training a surrogate neural network model for learning this type of problems is challenging because of two main reasons: the high-gradient regions that evolve in time and the ability to accurately capture both the slow and fast dynamics.In  (r, t = 50) for all the samples from the test dataset.From the scatter plot, we identify the worst test sample with the largest error and the best test sample with the lowest error and plot the true and predicted autocorrelations at t = 50, along with its absolute point-wise error.Fig. 5: Error in the radial average of autocorrelations.Radial average of the autocorrelations for the best and the worst test samples identified from the scatter plot in Figure 4 .the present case of the PVD simulations, the microstructure evolution in the growing film represents the slow dynamics and the moving interface of the solid thin film growing represents the fast dynamics.This means that the surrogate network must learn accurately the evolution of the moving solid-vapor interface with time.Any error in modeling the interfacial dynamics introduces errors in the predicted microstructure that only gets worse during the remaining course of evolution.At the same time, because of the duality of dynamics, the surrogate network should also be capable of learning and accurately modeling the multi-scale non-linear dynamics, rendering the training process delicate.
The first challenge of learning the evolution dynamics involving several high spatial gradient regions can be addressed by training an autoencoder that learns a suitable mapping to a lower dimensional latent space, and then training a surrogate latent dynamical model to learn the evolution dynamics of the composition field in the latent space of the autoencoder [56].We extended this approach to learning the latent dynamics with greater accuracy by hierarchically partitioning the latent space with respect to POD modes.More details regarding implementation and experiments can be found in the Appendix.However, in our PVD exemplar problem, such autoencoder-based models may not be a suitable choice for modeling the evolution of the solid-vapor interface because of the interface boundary conditions describing that interface.Indeed, in the PVD problem, the boundary condition at the interface is a random draw from a truncated Gaussian distribution and therefore behaves like a noise.The encoder part of the autoencoder, comprising the two-dimensional convolutional layers, smooths out the noise-like interface boundary condition resulting in the loss of information.Therefore, the autoencoder-based models that learn the evolution dynamics in a single latent manifold might not be a good choice for modeling problems with similar challenges.On the contrary, the U-Net conditioned on time learns the inherent dynamics of the system in multiple latent manifolds, namely L 1 to L 4 .Therefore, the temporally conditioned U-Net is able to take advantage of the information at the solid-vapor interface and overcome the difficulty of over-smoothing the interface boundary condition encountered in the autoencoder-based models.The results in Section 2 show that the U-Net conditioned on time can effectively learn both the fast-evolving interfacial dynamics and the slower diffusion-driven dynamics accurately, compared to the autoencoder-based models.Therefore, the surrogate operator network alleviates the excessive computational burden associated with numerical solvers that are forced to fix a sufficiently small time scale for accurately resolving the fast dynamics during time integration.Nevertheless, longer forecasts in time still remain a challenge for fast data-driven surrogate networks in general due to the gradual but consistent accumulation of errors.
For extrapolating in time, direct numerical solvers like MEMPHIS will be the best choice on the basis of accuracy.However, the finite-difference-based solver is computationally more expensive when required to generate a trajectory from a new initial condition.Conversely, the surrogate network trained offline offers faster solutions at inference but suffers in prediction accuracy due to the accumulation of errors.To circumvent the predicament associated with either end of the accuracy-speed spectrum, we coupled the numerical solver with the trained surrogate operator network.Specifically, we developed a hybrid solver that segues between the direct numerical solver and the trained U-Net conditioned on time during the course of the trajectory.From our results, we infer that running the direct numerical solver for longer timesteps helps decreasing the rate of accumulation of error in time.The surrogate conditioned U-Net by itself is able to restrict the mean relative L 2 −error of the spatial autocorrelations of the microstructure within 6% for forecasts up to t = 50, with a speed-up of 16.5×.Integrating the surrogate U-Net conditioned on time with the numerical solver MEMPHIS, helps bringing the mean rel.L 2 −error below 2%, while retaining a speed-up of 1.3×.
The hybrid formulation proposed here is generic and can be easily extended to diverse applications.Based on their performance and accuracy, such hybrid solvers can be utilized for solving data-hungry inverse design problems, i.e. for a given observed final state of the materials simulation what are the corresponding input parameters that lead to that state.For instance, prior works [68] have relied on iterative methods comprising the genetic algorithm as the inverse model and the MEMPHIS solver as the forward model for learning novel physical vapor deposition protocols to design thin films.The bottleneck in this framework is the excessive computational cost associated with executing the forward run several times during the iterative process.Our hybrid formulation would be able to substantially speed up the overall inverse control process by significantly alleviating the associated computational costs.Another interesting avenue would be to use such hybrid formulation in the context of multi-fidelity modeling frameworks to integrate the numerical and experimental data, where the U-Net hybrid model would serve as the low-fidelity rapid model fused with high-fidelity experimental data.While we demonstrated the applicability of the hybrid solver on a specific problem, this model can be applied to many other scientific domains such as high-speed flows, climate modeling, reaction-diffusion modeling, and many more.

Phase-field modeling of physical vapor deposition
The phase-field model used to simulate PVD is described in detail elsewhere [37].This model accounts for various mechanisms occurring during the physical vapor deposition and growth of thin films (surface and bulk diffusion, transport of vapor phase) and has been validated experimentally.In this model, the deposition of thin films is simulated as the evolution of two field variables c(x, t) and ϕ(x, t), where c(x, t) represents the concentration of one of the atomic species in the alloy and ϕ(x, t) represents the location of the solid-vapor moving interface.The solid phase corresponds to ϕ = 1 and the vapor phase to ϕ = −1.The evolution of the composition field (slow dynamics in this study) follows the Cahn-Hilliard equation, where F is the free-energy functional defined in Ref. [37] and M c (ϕ, c) is the structurally and compositionally dependent mobility function which captures the relative difference between bulk and surface mobilities such that, with and The function h(c) is a smooth swithching function (see Ref. [37] for expression).
The growth of the solid thin film is described by the Cahn-Hilliard equation with a source term that describes the deposition of new species on the surface of the growing film.As such, the evolution of the vapor-solid interface (fast dynamics in this study) is given by, where the source term S is defined as: with v as the velocity field of the incident vapor flux, ρ as the local vapor phase density field, and ⃗ n = ∇ϕ/|∇ϕ| as the interface normal direction..The transport and evolution of the vapor phase is modeled using a convection-diffusion equation such that, where D ρ is the mass diffusivity of vapor.

Dataset
The phase-field model described above has been implemented, validated, and verified in MEMPHIS, a finite-differencebased solver written in modern Fortran, with second-order-central-difference stencils for all spatial derivatives and an explicit midpoint method for time integration.
We generated 2200 microstructure evolution trajectories by varying several simulation parameters: the deposition rate (v) from 0.3 to 1.0 (in simulation units), the angle of deposition from 50 • to 90 • , species mobilites (M A , M B ).The initial phase fraction (c A ) was kept constant and set to 50%.All simulations in this work are performed on a uniform 2D mesh of 256 × 256 grid points with dimensionless numerical and physical parameters.Each microstructure evolution trajectory consists of 50 time snapshots equally spaced taken over the entire course of the simulation.The dataset is thus of shape 2200 × (50, 256, 256).The 2200 trajectories were split into training, validation, and testing datasets in the 80:10:10 ratio.

U-Net with temporal conditioning
The U-Net conditioned on time consists of two networks: 1) a multi-layer perception (MLP) network that learns a suitable basis ( ⃗ f (∆t)) with respect to the scalar parameter -time (∆t) and 2) a U-Net with element-wise product operation at the residual connections in order to condition the U-Net prediction with respect to the time ∆t.
The MLP used in this study is comprised of 2 hidden layers with 128 neurons each and uses the sin activation function [69].The non-linear basis of time projected by the MLP can be expressed as, where represents the parameters learned during the training of the operator.For the U-Net itself, each convolution block in this network is composed of a two-dimensional convolutional layer [70], a group normalization layer [71], and a Gaussian Error Linear Unit (GELU) activation layer [72].Before we formulate the forward pass of the U-Net with temporal conditioning, we define the convolution and group normalization operation for an arbitrary three-dimensional tensor u -with channel size of C, width of W , and height of H.
The two-dimensional convolution can be performed on u using a four-dimensional weight tensor w conv with input channel size of C, output channel size of C ′ , width of W ′ , and height of H ′ , in the following manner, where w conv is the weight tensor associated with the convolution operation, learned during the course of training the operator network.
For the group normalization operation on a three-dimensional tensor u, we separate C channels to G groups of C channels (C = G × C), compute G means and standard deviations separately as, We then normalize u as, where ϵ is a small positive constant.The output of the group normalization operation is, Here, γ k and β k are trainable C dimensional parameters to learn the ideal shift and scale operation.
The conv(u) and GN(u) are linear transformations of u.We introduce non-linearity using GELU activation function, which can be approximately represented as, The convolutional block that non-linearly transforms u can be represented as, The downsampling operation is performed by the two-dimensional max-pooling layer [73].This operation is expressed as, down(u where S w and S h represent strides along width and height.In order to obtain a scale-down factor of 2 during the downsampling operation S w = S h = W ′ = H ′ = 2 during maxpooling.The up-sampling is performed by the two-dimensional transpose convolutional operation [74] in the following manner, where w tconv m,n is a trainable two-dimensional tensor with width W ′ and height H ′ .To achieve a scale-up factor of 2, Next, we mathematically represent the output c(t + ∆t) of our U-Net with temporal conditioning (G) that takes [c(t − lb + 1), ..., c(t)] with lb channels and ∆t as the inputs.The latent variables in the latent space L p (p = 1, 2, 3, 4) before and after temporal conditioning are represented as ⃗ z Lp t and ⃗ z Lp t+∆t respectively.⃗ z Lp t can be expressed as, The temporal conditioning operation that makes the latent variable ⃗ z a function of ∆t is an element-wise product (⊙) as shown below.
In the above expression, w Lp projects the shared non-linear basis, ⃗ f (∆t), linearly to number of channels at the p th latent level before conditioning the latent representation.After conditioning the latent vector on time, we up-sample as follows, where ⃝ + concatenates the tensors along the dimension of channels.The output of the model is computed as, The illustration of the architecture used in this study is provided in Figure 1, part c.
We implemented and trained the surrogate operator in TensorFlow [75].Conditioning of the U-Net with respect to the scalar parameter time is motivated by Gupta et al. [34].Additionally, our implementation learns the dynamics of evolution by taking advantage of the information contained in the microstructure's composition field at lb previous timesteps as shown in Equation 1.We demonstrate the significance of incorporating the historical information empirically in the error heatmaps in Appendix Figure D.1, and decided to provide the composition field states at lb = 3 previous timesteps.

Training of the network
We used the mini-batch training mode [76] with a batch size of 5 for training the U-Net conditioned on time using the Adam optimizer [77] with a learning rate of 10 −4 .The best model was chosen on the basis of the relative mean squared error of the model prediction with respect to the trajectories in the validation dataset.The model was trained across 8×24GB NVIDIA GeForce RTX 3090 GPUs.Since the batch size is much smaller than the total number of samples in the training dataset, we observe that the model converges around the 200 epoch.Therefore, the model was trained only for 300 epochs, with a wall time of 14 hours.

Error metrics
The operator networks used in this study were trained to minimize the relative mean squared error (Rel.MSE) between the true and predicted composition fields(c).The Rel. MSE can be computed in the following manner, where n corresponds to the n th sample in the dataset.
The spatial autocorrelation of the composition field is a better statistical representation of the underlying microstructures as explained in Section 2.3 and elsewhere [42].Therefore, we first compute the spatial autocorrelations of the true and predicted composition fields and then quantify the statistical error on the basis of the relative L 2 error between the true and predicted autocorrelation.The relative L 2 error of the spatial autocorrelations can be computed as shown below, A Analyzing the latent dimensionality for spatial reconstruction In this supplementary note, we investigate the effect of the dimensionality of the latent space of the encoder for the autoencoder-based models on the reconstruction of the microstructures.We train seven different autoencoders with latent dimensions of ld = 64, 100, 196, 256, 400, 625, and 900.The autoencoder consists of an encoder that learns to compress the microstructure composition field (c(x, y)) to a lower dimensional embedding (c) and a decoder that learns to reconstruct the composition field from the embedding.The encoder can be interpreted as a nonlinear dimensionality reduction algorithm.Note that we are not trying to learn the evolution dynamics in this experiment, but rather attempting to identify a sufficiently large latent space that can reconstruct the microstructure from its encoded state with an acceptable level of accuracy.

B Visualizing the evolution dynamics of the encoded embeddings
We visualize and analyze the evolution dynamics of the microstructure embeddings (c(t)) in the latent space learned by the encoder.The latent space considered in this study has a dimensionality of ld =3137 (c(t) ∈ R ld=3137 ).We decided to move ahead with a large ld to ensure accurate reconstruction of microstructure from its embedding, as discussed in the previous note S1.We perform Singular Value Decomposition (SVD) on c using the truncated SVD algorithm [78] such that, where V represents ld − 1 proper orthogonal decomposition (POD) basis/modes that can span all the c(t) encountered in the train dataset.U Σ (t) ∈ R ld−1 represents the vector of coefficients at time t, such that, a linear combination of the set of POD modes in V with respect to the coefficients U Σ (t), approximates c(t) in the least square sense.The latent dynamics can be interpreted to be present in the linear projection of c(t) with respect to V T .and Table 1, we investigate the variation in the relative MSE of c(t) with respect to all the samples in the test dataset across all time steps, on increasing the number of latent partitions.Both the mean and standard deviation of the relative test MSE decrease with the number of latent partitions, implying more accurate and confident predictions.Note that during this experiment we only look into equal-sized latent partitions.There could be better ways of partitioning the latent space, and this could be a possible direction for future research.

D Trajectory splitting for supervised learning
Our PVD exemplar is a two-dimensional dynamical system.In a data-driven learning setting, the dataset consists of two-dimensional trajectories evolving from different initial conditions, but governed by the same PDEs.The PDEs that govern the PVD process and the numerical methods used for generating the simulation data are provided in Section 4.1.In the context of data-driven deep learning of two-dimensional dynamical systems, each sample trajectory can be further split into sub-trajectories because the PDEs remains unchanged.Therefore, it is reasonable to train the surrogate operator network (G) to learn the mapping from a look-back time window of size lb time steps to a look-forward window of size lf time steps, as shown below, To investigate the optimal choice of lb and lf , we perform an analysis by varying both the look-back and look-forward time windows with lb = 1,3,5,7,9 and lf = 3,5,7,9.From Figure D.1 a) we observe that for a given lf , the error is similar for lb = 3, 5, 7 and 9.However, there is a sudden increase in error when lb = 1. Figure D.1 b), c), d) and e) represents the mean relative MSE on the test dataset at t i + 3, t i + 5, t i + 7 and t i + 9 respectively.Intuitively, these results suggest an increasing trend in error while forecasting further into the future.The higher error for lb = 1 suggests that the physical vapor deposition is behaving like a non-Markovian process inflicted by the larger step-size ∆t used for training the surrogate operator.Note that the error remains unchanged for lb ≥ 3, justifying that lb = 3 should be the number of previous states of the composition field incorporated in training.

Fig. 1 :
Fig. 1: Architectures of autoencoder-based models and U-Net with temporal conditioning.The three types of architectures used in this study that predict the future states, c(t + ∆t), from a history of previous states, [c(t − lb + 1), ..., c(t)].a) and b) represent the baseline and hierarchical autoencoder-DeepONet models respectively.In c) a multi-layer perception (MLP) learns a set of basis functions ⃗ f (∆t) and conditions the latent representations of c at multiple levels of coarseness (L 1 , L 2 , L 3 and L 4 ) inside the U-Net.

Fig. 2 :
Fig. 2: Comparing the errors in the forecast.Relative Mean Square Error (MSE) at different forecast timesteps, computed across all the samples in the test dataset.We compare the prediction errors of autoencoder-based models and U-Net conditioned on time.In a) we observe the general trend in the mean accumulation of test relative MSE, and in b) the error boxplots give us further insights about the distribution of the test relative MSE of each model at each forecast timestep.

Fig. 3 :
Fig. 3: Hybrid trajectory with 1.3x speed-up.The left column of images represents a true trajectory of the normalized composition field c generated by our direct numerical solver MEMPHIS.The middle column shows the trajectory from the same initial condition predicted by the U-Net hybrid model consisting of a pre-trained U-Net conditioned on time.Note that this is a sample from the test dataset, not seen by the surrogate conditioned U-Net during the training stage.The details about the back-and-forth transitions and the computational time on an 8-core CPU can be seen on the time axis placed on the left side of the figure.The right column represents the evolution of the absolute pointwise error.(Check git repository for animation.)

Fig. 4 :
Fig. 4: Errors in the autocorrelations of the microstructure.Scatter plot to visualize the relative L 2 −error of S (A,A) 2

Fig. 6 :
Fig. 6: Time evolution of autocorrelation error in hybrid trajectories.The plot provides insights on how the relative L 2 −error associated with autocorrelation of microstructure predicted by the conditioned U-Net hybrid model accumulates over time.The solid line is the mean error and the shaded regions represent the 25th to 75th error quartiles computed across all the samples in the test dataset.The hybrid formulation involves multiple back-and-forth transitions between the numerical solver and the surrogate as shown in Figure 3.

Fig. A. 1 :
Fig. A.1: Effect of the latent dimension on the reconstruction error.We investigate the variation in the spatial reconstruction errors in the autoencoder against the dimensionality (ld) of the bottleneck layer.As expected, we observe a decreasing trend.

Fig. B. 1 :
Fig. B.1: Visualizing the evolution dynamics of the encoded microstructure embeddings.From visual inspection, we observe that the dynamics contained in the higher energy modes evolve smoothly compared to the lower energy modes.The figure motivated us to partition and scale the latent dynamics based on the POD modes.

Fig. C. 1 :
Fig. C.1: Effect of partitioning of the latent space.Panels a)-c) represent the comparison between the true and predicted energy spectrum of the latent trajectories for different number of latent partitions n = 1, 8, 16, respectively.Panel d) provides a statistical representation of the relative test MSE against the number of latent partitions.

Table 1 :
Partitioning of the latent space Mean 232e-02 1.351e-02 From Figure C.1 a) to c), we observe that increasing the number of latent partitions brings the eigenvalues of the predicted embedding closer to that of true embeddings, suggesting more accurate model predictions.The effect is prominent in the lower eigenmodes that carry more energy.The observation is intuitive because Figure B.1 suggests a decay in energy across the modes and portrays smoother and easy-to-learn dynamics for lower modes.In Figure C.1 d) [c t−lb+1 , ..., c t ] → [c t+1 , ..., c t+lf ].(26) c(t + ∆t) ≈ G([c t−lb+1 , ..., c t ])(∆t), where ∆t ∈ {1, 2, ..., lf } (27) This will increase the number of samples available for training the surrogate network.

Fig. D. 1 :
Fig. D.1: Error heatmaps.Error heatmaps with respect to the relative MSE corresponding to the different look-back (lb) and look-forward (lf ) windows considered in this study.