GPyro: uncertainty-aware temperature predictions for additive manufacturing

In additive manufacturing, process-induced temperature profiles are directly linked to part properties, and their prediction is crucial for achieving high-quality products. Temperature predictions require an accurate process model, which is usually either a physics-based or a data-driven simulator. Although many high-performance models have been developed, they all suffer from disadvantages such as long execution times, the need for large datasets, and error accumulation in long prediction horizons. These caveats undermine the utility of such modeling approaches and pose problems in their integration within iterative optimization and closed-loop control schemes. In this work, we introduce GPyro, a generative model family specifically designed to address these issues and enable fast probabilistic temperature predictions. GPyro combines physics-informed and parametric regressors with a set of smooth attention mechanisms and learns the evolution of the dynamics inherent to a system by employing Gaussian processes. The model predictions are equipped with confidence intervals quantifying the uncertainty at each timestep. We applied GPyro to Wire-arc additive manufacturing and learned an accurate model from a single experiment on a real welding cell, almost in real-time. Our model can be easily integrated within existing loop-shaping and optimization frameworks.


Introduction
Wire-arc additive manufacturing (WAAM) is an additive manufacturing technique used for incrementally building parts by sequentially depositing weld beads on a substrate. This process uses the well established arc-welding technology and is characterized by high deposition rates, reaching several kg/h (Priarone et al., 2020). At deposition time, the feed material in the form of a wire is melted by a highefficiency heat source usually mounted on a robot or a Computer numerically controlled (CNC) machine. The material transfer is controlled to minimize spatter and achieve the required geometry of the welded tracks.
WAAM is a compelling alternative to conventional manufacturing techniques such as milling due to several reasons. Some of them are the high material utilization, the use of wire raw material, the low-cost equipment needed, the design freedom, and the ability to process high-strength materials. Ideally, this process would allow the production of parts with complicated 3D geometries manufactured by durable alloys with negligible material waste. As a result, customized parts could be fabricated even in small volumes in a sustainable way, offering significant financial benefits to various industries.
However, in practice, distortions and residual stresses (Montevecchi et al., 2016), as well as many defects, such as cracking (Gu et al., 2020), limit the capabilities of WAAM, reducing its utility and undermining the quality of the produced parts. As a result, the process parameters have to be tuned in a trial-and-error manner to achieve an inspecification build. Nonetheless, as trial-and-error requires printing multiple parts to produce a functional one, the mate-  Fig. 1 a Typical WAAM part with 9 adjacent pockets, the welding motion pockets 1-9 is the same as for pocket 5; red crosses indicate welding start/stop; pocket 10 indicates contouring motion. In b, the measurement of permanent distortion on a cooled 5 mm thick plate is shown, after the material is deposited in the following sequence: 1-2-3-4-5-6-7-8-9-10 rial efficiency of the process drops, and thus, its financial advantages are not as attractive as in theory.
These defects originate from many process factors, with the most prominent one being the temperature fields developed during the deposition process (Taylor, 2001). As the heat source deposits molten material on the workpiece, heat is transferred to the part. Then, this heat spreads throughout the workpiece's volume and dissipates into the surroundings. The energy dissipation rate, though, is not equal to the input power from the welding torch; consequently, heat accumulates in the workpiece and the temperature rises over time. The resulting temperature fields are inhomogeneous, and their evolution is strongly coupled with the energy deposition and thus with the printing sequence that the heat source follows.
The heterogeneity in the temperature profiles induces thermal gradients to the produced part. Gradients are strongly related to the formation of defects, and hence, the printing sequence can have an essential role in the final quality of the part (Bambach et al., 2020). For instance, Fig. 1 illustrates how an unsuccessful choice of a material deposition path can lead to a permanent distortion of up to 7 mm within the first 5 printed layers. The severity of this example highlights the impact of the deposition path plan on the final results. However, the magnitude of this impact also indicates that by appropriately designing and optimizing the deposition sequence, one can obtain control over the evolution of the thermal fields.
Designing a path plan that can achieve a specific goal, e.g., the minimization of thermal gradients in a part, requires a model mapping the trajectory of the heat source to the corresponding thermal fields. This model must be fast enough to make simulation-based process optimization a more financially viable solution than the current trial-anderror approach.
There are many ways to approach thermal modeling. The most common approach is using white-box simulations, for example, Finite element models (FEM). White-box models are helpful for better understanding the effects of the underlying dynamics and, in this way, improving the processes (Xiong et al., 2018). As shown in Nguyen et al. (2021); Graf et al. (2018), which focused on the effects of the thermal fields on the resulting part distortions, these approaches also allow modeling the interactions between coupled physical phenomena. However, finite element methods in additive manufacturing are very computationally expensive and require many hours, even days, to execute. Several approaches as adaptive meshing (Hejripour et al., 2019) or domain decomposition (Neiva et al., 2019) aim to ease this burden, but their computational costs remain high. This problem undermines the efficacy of these approaches, as it results in simulations that last longer than the corresponding experiments and precludes iterative process optimization methods. Furthermore, these models contain a large set of parameters, such as the physical properties of the printed material. These properties have an essential effect on the simulation predictions but their values are usually challenging to determine. This increases the uncertainty of the simulation outputs and can lead to a significant modeling effort.
An alternative to white-box models is offered by their data-driven counterparts, such as Machine Learning models. Machine Learning has offered novel solutions to modeling processes with temporal dependencies. It has been successfully applied in various applications similar to the thermal modeling of additive manufacturing. These methods guarantee comparable accuracy but are significantly faster in comparison to white-box techniques. Moreover, they require fewer assumptions, thus strengthening the robustness of the predictions and casting process automation feasible.
In particular, Mozaffar et al. (2018) used recurrent models, and more precisely Gated recurrent unit (GRU) networks, for studying the temperature evolution of specific part locations in additive manufacturing applications. These architectures have been extensively used for modeling time-dependent data in many applications such as predictive maintenance (de Bruin et al., 2017), natural language processing (Ranzato et al., 2015), etc. They form a broad and powerful family of models with high predictive power and generalization capabilities. Be that as it may, Pascanu et al. (2012) show how recurrent models are notoriously hard to train, with problems like vanishing gradients and lack of memory recitation weakening their performance. It is also usual for such models to require a vast number of training examples for generalizing and lengthy optimization times. This is the case in Mozaffar et al. (2018) on the prediction of temperature profiles in Directed energy deposition (DED) processes as well. This necessity comes in stark contrast to the needs of applications similar to WAAM, where obtaining a model should require as little data as possible, even happening online.
To alleviate the high demand for data, these models are trained and validated on synthetic data produced by simulations. Many high-performance Machine Learning models have been proposed in the literature following this artificial data generation approach, extending from multilayer perceptrons (Le et al., 2022;Roy & Wodo, 2020), graph neural networks (Mozaffar et al., 2021), and extra trees (Ness et al., 2022) to new architectures specialized on the task of thermal predictions (Zhou et al., 2021). However, these solutions usually suffer from a simulation-to-reality gap. The simulated data carry biases and errors introduced by the dynamics incorporated in the simulation, and there is no proof that these models would also retain their performance under real conditions. Moreover, synthesizing these data sets and training on a large amount of data requires many days, even weeks, of effort. In Ren et al. (2020); Pham et al. (2021), the fidelity of the simulations producing the dataset was increased by validating the simulation predictions with measurements coming from real experiments; nonetheless, this is a complicated task that requires extra time and work.
Another issue with the approaches mentioned above is that the majority of them are validated on short time windows of relatively simple and hand-picked geometries with only a few researchers, such as Stathatos and Vosniakos (2019), trying to extend their results for longer prediction horizons. This is usually the case due to the error accumulation that is observed when the models are asked to extrapolate over long periods of time. Long extrapolation times require integrating over many model predictions, leading to even the smallest errors having a substantial cumulative effect.
Furthermore, existing models lack a statistical perspective in their results. Having uncertainty-aware models can be very useful, especially in optimization and control, as it allows to express a notion of confidence about different scenarios and guaranty robustness (Bertsimas et al., 2010). In addition, most of the proposed deep architectures are not well suited for control purposes. While their representation power is very high, they usually result in non-linear and non-smooth functions, making optimization difficult. In fact, Schubnel et al. (2020) emphasizes that it can be more efficient to try to optimize processes described as linear systems based on nonlinear regressors rather than using a more precise recurrent network.
In this work, we introduce GPyro (from the Greek word for fire, πυρ), a model that explicitly targets these gaps in the current literature. GPyro is designed to minimize the need for training data and thus allow the optimization with real experiments without compromising the ability to extrapolate over long integration horizons. Moreover, GPyro keeps the structure of the model in a form that is easy to control, it is computationally cheap and it is uncertainty-aware, meaning that we can predict the bounds within which we believe the thermal fields are contained with high proba-bility. To the best of the authors' knowledge, this work is the first uncertainty-aware model for thermal predictions in additive manufacturing processes that is trained on a single experiment. One can find similar intuition to our approach in Ness et al. (2022), where the authors designed features that would allow the models to generalize without an excessive amount of data and used models suitable for control. However, the authors of this study addressed the problem differently from us. GPyro combines physics-informed and parametric regressors with a set of smooth attention mechanisms in a linear time-variant system and employs Gaussian Processes to predict the evolution of the model parameters through the various stages of the process. In this way, the model is structured in an auto-regressive form, facilitating the formulation of linear control strategies, while learned features capture the non-linearity of the system dynamics.
This work is organized as follows. First, an introduction to the heat equation and to Gaussian Processes can be found in the "Heat equation" and "Gaussian processes" sections, the experimental setup used is found in "Experimental setup" section, then a description of the proposed model with its application in WAAM is illustrated in "Model description" section and, finally, the results, discussion and conclusions in "Results", "Discussion" and "Conclusions" sections. The project's source code can be found at https://github.com/iss1995/gpyro.git.

Heat equation
Modelling the thermal evolution of workpieces in additive manufacturing processes requires that the heat equation is solved where T = T (x, y, h, t) is the temperature field with x, y, h being the coordinates of the solution domain and t being time, λ(T ), ρ(T ), c p (T ) are the thermal conductivity, the density and the specific heat of the material, respectively, andQ hs is a term modeling heat sinks and sources.Q hs is usually composed by 2 components for additive manufacturing applicationṡ whereQ loss are the thermal losses andQ source is the thermal input coming from the heat-source. Following (Gh Ghanbari et al., 2020) the loss partQ loss includes convection and radiation losses in a single term asQ whereĥ c andĥ r (T ) are the effective coefficients of heat transfer, h c is the convection coefficient, is the emissivity, σ is the Stefan-Boltzmann constant, A is the free surface of the workpiece and T a is the ambient temperature. For the the termQ source usually volumetric models are employed. One example of such models is the Goldak model where P is the heat-source power, a, b, c f , c r , f f , f r are parameters defining the shape of the volume and χ , ξ , η is the local coordinate system for the moving heat-source. For more details on this heat-source model the interested reader may refer to Goldak et al. (1984). Eq. 1 excludes the heat flow due to fluid dynamics within the melt-pool and hence, constitutes an incomplete view of the process. Nonetheless, for the scope of this work, this level of approximation suffices. It is important to note that the model parameters are not constant, which means that the process dynamics evolve over time.
For solving the equation, the domain is usually discretized in space, as illustrated in Fig. 3, into several elements defined by a number of nodes on which the thermal field is explicitly calculated. After applying constitutive equations and boundary conditions to all free surfaces, Eq. 1 is transformed to a set of Ordinary differential equations (ODE) following the same procedure as Stockman et al. (2019) where j indicates the timestep index, A T , B T are the system matrices, and T j , Q j are vectors containing the temperatures and the heat flux at each node during timestep j. A set of ansatz functions is used to interpolate the nodal values and infer the temperatures at locations that do not correspond to the nodes. Eq. 5 shows that the thermal evolution on a grid of points can be approximated by an ODE system, whose parameters depend on the material, the boundary conditions, and the temperature field itself. In this work, we leverage this observation, and we develop a model that learns in a data-driven way an update rule for theṪ j .

Gaussian processes
Gaussian processes (GP) are widely used in the Machine Learning field to perform non-parametric regression. The goal of non-parametric regression is to find a mapping, f (θ) : R dim(θ ) → R, where θ is a vector and f (θ) is a scalar. Non-parametric regression in the Gaussian framework considers each variable f (θ ) as a random variable, with every fixed set of f (θ) having a joint Gaussian distribution related to the values of θ . In the literature, there is a wide variety of applications of GP regression in modelling, ranging from drug discovery (Sahli Costabal et al., 2019), materials and molecules modelling (Deringer et al., 2021) to timeseries analysis (Roberts et al., 2013), robotics, control, system identification (Deisenroth et al., 2015;Hewing et al., 2020;Jain et al., 2018) and safe Reinforcement Learning (Berkenkamp et al., 2016(Berkenkamp et al., , 2017. In the context of WAAM, GPs have been used for weld bead shape optimization (Lee, 2020), but never for tackling the thermal evolution of produced parts. The main attributes that make these models so widespread are the closed-form equations for calculating uncertainty intervals, the easy assimilation of new data, and the ability to include prior knowledge in the models.
To use GP regression, two priors have to be defined for the statistical model; one for the mean and another one for the covariance. The former, μ, is used to reflect a known trend of the process, whereas the latter is the covariance kernel function, k, defining the smoothness and the rate of change of f (θ). With no loss of generality, μ can be considered zero, meaning that there is no assumption being made regarding the overall trend of f , and the choice of k strongly depends on the modeling problem. Typical kernel choices are the Radial basis function (RBF), the Laplacian, and the Matern kernels. A broader collection of covariance functions and their properties can be found in Rasmussen and Williams (2005).
Each kernel function comes with a set of hyperparameters that tune its behavior. As an example, an RBF kernel function between two vectors, θ and θ , is of the form where σ n is the spread, M contains the lengthscales of the kernel, δ θ,θ is the Kronecker δ and σ ω accounts for measurement noise. In the presence of data ID = {θ,f (θ)}, the values a of the hyperparameters can be optimized by applying maximum likelihood type II to maximize the marginal log-likelihood of the data where is the covariance matrix and N is the number of datapoints. The optimization of the marginal log-likelihood is performed with gradient methods by using the partial derivatives with respect to a. Eq. 6 is composed of 3 parts. The , quantifies the wellness of fit to ID. The second part, 1 2 log (θ, θ ) + σ 2 ω I , puts a penalty in complex models and acts as a regularizer, while the last term, 1 2 N log 2π , is a normalization constant. Having the priors defined, a GP can be used to predict a value f (θ ) based on a set of noisy observations . The mean and the variance of f (θ ) are given by the equations of GP posterior inference are the covariance and the identity matrix respectively, whose entries [6 N ] j are the vectors k N (θ j ). After observing a value f (θ ) associated with θ , the new data can be combined with the rest of the measurements, forming Then, for inferring the value of f (θ ) for a new vector θ , the Eq. 7 and. 8 are conditioned on ID N +1 , assimilating in this way the new data in the model with no extra training. This trait casts the GPs very suitable for online learning. For more details on Gaussian processes we urge the interested reader to refer to Rasmussen and Williams (2005).

Experimental setup
The setup used for performing the training and validation experiments is illustrated in Fig. 2. In each experiment, the deposition of the part shown in Fig. 1a was carried out using different path plans. A mild steel plate of dimensions 250 × 250 × 5 mm was clamped on the positioner. The gas metal arc welding process was carried out with wire of 1 mm diameter, made from mild steel (ER70S-6). While the heat source inputs heat into the workpiece along its welding path is recorded on the underside of the support plate. A thermal camera with a resolution of 1024 × 768 pixels at 10Hz is used to monitor this surface. Monitoring the back of the plate allows to only log the temperature profiles on the observed plane with no disturbances coming from the brightness of the electric arc.
For depositing the parts, a continuous welding motion is chosen for every pocket (1-9) and the contour (10), as indicated in Fig. 1a, always starting and finishing from the same point. This deposition strategy guarantees that the amount of welding starts is low and enables a total of 10! = 3,628,800 path possibilities per layer. Every permutation essentially consists of a different ordering of pocket deposition sequences, e.g. 9,1,2,8,3,7,4,6,5,10. We performed 27 different experiments, each with a different permutation, depositing 6 layers every time. We used 1 of these experiments for training and the rest for validation. The link to the data can be found to the project's repository.

Model class
To tackle the problem of predicting the evolution of thermal fields in additive manufacturing, which are generated by different paths x t , a model is needed able to adapt in the time-dependent dynamics of the processes. Furthermore, the model is designed for optimization and control purposes, which constraints the computational costs available for its execution and the complexity of its architecture. To address these issues, a novel generative class of models is introduced, combining physics-informed and simple parametric elements through a smooth attention mechanism. The physics-informed part leverages the prior knowledge relevant to the system, such as Eq. 1, while the parametric parts are used as surrogates to approximate system responses that are not captured by the physics-informed elements. Additionally, the attention mechanism aims to select which parts of the system are activated at each point. To address the variability in the dynamics, the model parameters are smoothly dependent on a system state, i.e., a set of variables that controls the evolution of the dynamics themselves. Firstly, the generic model class is introduced, and then it is adapted to solve thermal problems.
A generic representation of the target model class is where m() is a (possibly non-linear) physics-informed regressor, constructed based on the balance and constitutional laws describing the physical problem the model addresses, e.g. the energy conservation law, g i (), i = 1, 2, .., M are a collection of parametric models acting as a supplement to m() and f i (), i = 1, 2, .., M are attention mechanisms that activate the correct part of the model at an appropriate time. The model depends on 2 states, z and s, and a set of parameters θ which are drawn from a Gaussian distribution, parameterized by a mean function μ(s) and a covariance matrix (s). The different θ values can be sampled for different values of the state s. More specifically: • z defines the state of the constitutional and balance equations. For instance, in the case of thermal problems, z would be the temperature distribution in the domain of interest. • s defines the state of the system itself. Often, the parameters of the constitutional equations driving a problem change with time. For this reason, the model is treated as a dynamical system, whose state depends on a set of variables. Considering again the example of thermal problems, this change in the dynamics could capture how the material parameters, such as the thermal conductivity k(T ), the density ρ(T ), and the heat capacity c p (T ) are altered with temperature or how the heat dissipation is affected by the boundary conditions on different parts of the solution domain. • θ defines the set of parameters that is used from the models m(), g i () and f i (). θ also depends on the model state s as it is a part of the model. In this framework, we define θ as a GP that models the change in the system's dynamics. GPs are an ideal choice for describing the evolution of the model dynamics as they can effortlessly assimilate new data, such as the parameters θ learned for a new state s found after optimizing Eq. 9 on a set of new data. Moreover, they provide an easily interpretable statistical perspective in the model class which can help in building robust optimization schemes.

Intuition behind the model:
The model is composed of 2 main parts, a physically informed one, m() and some datadriven supplements, g i (), that target what the physical part fails to capture. These two parts are connected by an attention mechanism, f i (), defined as a smooth function, and the sum of all its M components takes values between 0 and 1. This attention mechanism acts like an artificial switch, shifting the model's focus on its right constituents. This switch allows the supplementary models g i () to only be accurate on the limited values of the state z that they are needed, making their training easier. Moreover, it decouples the performance of m() and g i (), so that each component does not have to

Model for WAAM
Equation 9 presents the generic framework of the model class, which can be adapted to the constitutional equations of the problem to be solved. In this work, the model should describe the dynamics of Eq. 1 for parts produced in the welding cell of Fig. 2. To achieve this, the model is adapted to predict the temperature increments anticipated at the different nodes n i found in Fig. 3a. As the thermal camera in the setup described in "Experimental setup" section can observe the temperatures of the base plate, T (x, y, h = −0.005, t), the learned temperature increments correspond to the observed layer. Confining the problem only to this layer allows using the temperature measurements directly as targets for the model. Extending the model to unobserved layers would introduce inductive biases to the model as no information is given about their state. As a result, the thermal problem we tackle is reduced to a 2D problem. The temperature profiles of the discretized plate, T j = {T i, j , i = 0 . . . n} containing the temperature values of nodes n i for a timestep t j , are driven by the dynamics in Eq. 5. They are determined mainly by 3 phenomena, i.e., thermal conduction, heat transfer to the environment, and energy input by the torch. For the 2D case, it is possible to describe the contribution of the former two phenomena based on the plate temperature values, following Eq. 1, since it mainly depends on the temperatures of the observed plane itself. The input heat, though, is not directly observed by the camera. Its effects are only visible through the temperature increase of the base plate after the heat has flown from the deposition layer to the bottom of the base plate. Therefore, heat conduction and dissipation on the plate are incorporated into the non-linear regressor m(), and a parametric model g() describes the effect of the heat source on the observed layer.
where η and ξ are the grid coordinates of each node n i as illustrated in Fig. 3 and x i is the corresponding location given a frame of reference. Thus, L j fuses the temperature values of neighboring nodes at each timestep to estimate the heat conduction. Secondly, the energy dissi- Fig. 3 a Selected discretization for the 2D plate. Each node n i is characterized by i = η + 9ξ . The distance between neighboring nodes is 27 mm, yielding a sparse grid. Our algorithm allows for sparse grids and linear reduction in computational burden. b Example of locations of n i and p i , h pi, j , h in pi, j shown on a section of the the printed geometry Each element δ 1,i of the vector δ 1 encodes spatial information to the model and acts as a Kronecker δ on the term of Eq. 11, activating and deactivating it, according to the location of each node. Since each regressor term aims to model a physical phenomenon, we can also infer the sign of the corresponding weights. From the law of energy conservation we can deduce that θ 0 , θ 2 , ≤ 0 and θ 1 , ≥ 0. These constraints can be used during optimization to accelerate convergence and guarantee that the energy balance is conserved. Surrogate model g(): This part of the model captures the effect of the moving heat source on the temperature profile observed at the bottom surface of the base plate. This surrogate could be any parametric model, e.g., a neural network. In practice though, since the data and time needed for optimizing the model parameters are to be minimized, a simple linear regression scheme is chosen where h p,j = {h p i , j , i = 0 . . . n} is the vector of deposition heights, i.e. the distance of the deposited layer from the observed plane as seen in Fig. 3 , 1 n is a vector of ones, and g = {θ u,i , u = 3 . . . 5, i = 0 . . . n, } are the relevant parameters for each variable. The simplicity of the surrogate model facilitates the optimization by reducing the number of learned parameters as much as possible. Attention mechanism f (): This part of the model selects which component is activated. Considering that g() describes the thermal contribution of the torch, its output should be taken into account when the heat of the deposited material in a neighborhood of a node has traveled through the part. Thus, f () is designed to choose model components according to the distance of the torch from each node and to a heat transport delay. To calculate both the delay and the distance, the projection, p i , of each node, n i , is defined on the deposition plane. The projections p i are named as the counterparts of n i , and from now on, they are referred to as such. Then, the distances d i, j between p i and the position of the torch at timestep t j , x t j , and the amount of time-steps t D i, j required for the heat to arrive at the observed plane are evaluated. With these 2 values, f can be calculated as A visualization of the different model components can be found in Fig. 4a, while the shape of the activation function for different values of θ f is found in Fig. 4b. Fig. 4a illustrates the contributions and the coupling between the components. After the term g is activated, the physics-informed part m, responsible for heat conduction and dissipation, obtains significantly higher values and the temperature profiles achieved resemble Newtonian cooling. Moreover, it is also interesting to observe the effect of the attention mechanism on the activation and deactivation of the components.
To model the evolution of the system dynamics, states s i, j should be assigned to each node n i , reflecting the different stages of the process. For facilitating training, the states are expressed as s i, j = h p i, j , where h p i, j is the current height of n i 's counterpart. Moreover, different states are assigned to nodes close to boundaries to learn the effect of boundary conditions on the system dynamics. Thus, the nodes n i that do not get excited and lie in the central plate remain at s i = 0 for every timestep t j , while the rest evolve. For simplifying the approach and keeping the state of the system dependent on a single variable, the states of the nodes on the boundaries are encoded as s i, j = −h in p i, j , where h in p i, j is the current height of the counterpart of the closest node to n i that does not lie on the boundary, as illustrated in Fig. 3b. Thus Finally, the Gaussian processes that model the dependency of the model parametersθ = {θ u , u = 0 . . . 6} to the system states s j are modelled with a bilinear prior mean, and a Matern 3/2 kernel. An important note here is that a bilinear prior mean does not imply a linear posterior mean, as the value of the latter function is defined by Eq. 7 and the calculated weights. Every component ofθ is assigned a separate GP regression model. This decouples the weights and allows each model to grow independently.

Training
Optimizing Eq. 9 jointly on the data of a single experiment is not a straightforward task. The interconnection between f () and the rest of the components is not easy to optimize as the optimization can turn unstable. Moreover, even if the scheme is stable, optimizing the product of functions yields a highly complex optimization problem, which is challenging to solve.
In this work, the goal is to achieve learning in almost realtime, and a complicated optimization scheme would not be fit. Thus, to reduce the optimizer's computational burden and remove the solution's dependency on the initialization point, the optimization scheme should be as simple as possible with-Algorithm 1 Training algorithm: After every deposited layer, the new data are incorporated into the training algorithm, split to their corresponding states, and used to retrain (if needed) each state-model θ s separately. Then, the model's pointestimates are used as samples for a Gaussian Process that acts as a generative model for the parameters θ .
For each layer: out compromising the quality of the solution found. This introduces the need to simplify our optimization approach.
The necessary simplification can be achieved by breaking the joint optimization problem to its constituents. Instead of learning all the parametersθ (s) for every s simultaneously, the optimization focuses on independent sets ofθ s for each discrete value or group of s. Then, all the available estimations ofθ s are jointly united with a Gaussian Process and predicted at inference time. Therefore, eachθ s is considered a sample from the Gaussian random processθ(s), with a mean and a covariance μ(s) and (s). The GPθ (s) can be then sampled to provide the values of parameter θ for any value of s .
To learn eachθ s , 4 sequential steps are executed. Firstly, the thermal data observed at the bottom of the base plate are synchronized with the movements of the heat source. In particular, the synchronization finds and matches the high temperature peaks observed at n i with the corresponding movement of the heat source that caused them. In practice, the output of the synchronization step is a pair of timesteps including the time that the torch crossed p i and the time needed for the heat to reach n i . This allows decoupling the performance of the delay model found in Eq. 14 from the rest of the system. Secondly, the attention mechanism is trained to select the constituents of the model. Since the role of the attention mechanism f () is to activate the surrogate g() when the heat of the welding torch arrives at a location n i , θ s, f is trained to activate f () long enough to capture the whole duration of the heat wave.
After defining f (), g() can be optimized on the time segments that f j > 0 to only capturethe increase in the temperature due to the incoming heat wave. Finally, m() is optimized by taking the other two system components into account. For fusing and scaling the predictions of g() and m(), an extra parameter, θ p , is introduced and optimized together with m().
By substituting Eq.11-14 into Eq.9 and by scaling g with θ p (s j ), the final form of the thermal model for WAAM is derived for every t j that a node n i is in state s. Eq. 16 shows the operations needed to calculate the temperature increment of each node at each timestep. Following this formula, the remaining parameters θ s,m , θ s, p are optimized to describe the temperature increment on each node given the previous temperatures, the torch path, the optimized surrogate and attention mechanisms. The final set of parameters considered by the model is θ s =θ s ∪ θ s, p , and θ p is also assigned a GP regression model. This learning scheme is repeated every time a new batch of data is gathered for every state value s acquiring new measurements. The parameters are updated until the model's predictive performance for s does not change significantly with new data. The loss function used for the optimization is the mean of the square relative error where N is the prediction horizon used in every optimization step, l is the regularization coefficient, and e is the relative error of the model. For g() and f () a prediction horizon N = 100 timesteps of size t = 0.5 sec was used while m() was optimized on N = 1. Every optimization step can be performed either with auto-differentiation libraries like PyTorch or, due to the low dimensionality of the optimization problem, with out-of-theself optimizers like l-BFGS-b (Byrd et al., 1995). In this work, we choose the latter approach to prove that even with a mid-range PC, one can train the model states in a few seconds. More specifically, our training scheme needs approximately 2.14 sec for obtaining an estimate of θ s for every new state s , running on a single core of an AMD Ryzen 9 5900X, with a core bandwidth of 3,7GHz. For every optimization step 75% of the available data is used for training and 25% is used for validation. A synopsis of the training algorithm is shown in Alg. 1.
Finally, the obtained parameters are included in the GPs after the optimization is finished. For each parameter, a GP regression model is maintained and updated with the new parameters. The hyperparameters of the GPs can be updated as well with the addition of new states by maximizing the Eq. 6.

Model propagation
Once the GP regression models for θ (s) are obtained, the weight values (s j ) = {θ(s i, j ), i = 0 . . . n} can be sampled for any timestep t j , determining thus the model in Eq. 16. The states of the system, as defined in Eq. 15, only depend on the path plan. As a result, given an arbitrary predifined torch movement x t , s j is defined and the values (s j ) can be directly sampled from a Gaussian distribution parametrized by a mean and a covariance matrix defined in Eq. 7 and 8. One reasonable set of samples (s j ) one can choose is the mean of the GP, as this represents the most probable weight values learned.
With the estimations of the parameters (s j ), Eq. 16 can be used to approximate the temperature increment of each node on a discretized domain. The thermal fields can be reproduced step by step by integrating the thermal increments over time for all nodes. The node-wise operations for each timestep introduced in Eq.16 can be executed in parallel to save computational time. Therefore, the time integration is performed in batches, where each batch contains the updates of each node on the grid. The integration formula is given as where is the Hadamard product for element-wise multiplication. For updating each node, the model considers the temperatures surrounding the node, the boundary conditions applicable, and any thermal contribution stemming from the torch. The only necessary elements to fully define a temperature extrapolation are an initial temperature field T 0 , and a torch path plan x t for an horizon N .

Results
Incremental training of Eq. 16 results in having different system representations at every stage of the process. This can be seen in Fig. 5, where the evolution of 3 parameters is illustrated after printing 4 and 6 layers. The model predicts the bounds within which the weights θ are contained with high probability. The most probable weights are found by calculating the mean of the Gaussian process. An important observation is that the most probable weights are not always identical to the weights stemming from the state-wise optimization, as the GP evaluates the joint probability of all samples for inferring the mean.
With the estimations of the parameters θ, Eq. 16 can be used to approximate the solution of the heat equation. By using the mean of the GP to define the most probable θ and integrate as described in "Model propagation" section, the thermal fields of the validation set are reproduced. For instance, Fig. 6 illustrates the thermal fields derived after integrating 1650 steps (825 sec) (left), the corresponding measured fields at the same timestep (middle), and the prediction errors (right). The errors' magnitude remains small over the plate's surface, indicating that the predictions' accuracy does not depend on the location.
The thermal history of each node on the computational grid can be extracted from the acquired temperature fields. The thermal histories of a selection of nodes that result from 4000 steps (2000 s) of integration can be found in Fig. 7. The trajectories are uncertainty-aware, meaning that each prediction consists of a mean, which is derived from the most probable set of weights μ(s), and from a confidence interval, which is approximated with Monte Carlo sampling. In this case, 100 samples were used for the approximation, although in practice, even less were necessary. In the plot, the measured histories are annotated with blue color, the most probable trajectory that the model predicts is indicated with a dark orange dotted line, the orange shadow indicates the region within which the temperature profile is confined with probability 95% (4σ ) and the black dashed line is the absolute error calculated after aligning the signals with Dynamic time warping (DTW) (Giorgino, 2009). It is important to note firstly, that the thermal predictions of the most probable trajectory are consistently precise for every node, secondly that there is no sign of error accumulation over the extrapolation horizon, and lastly that the measured temperature predictions are confined in the confidence region.
For evaluating the precision of the algorithm, after training on the thermal data stemming from one experiment, the model is validated on the remaining n val = 26 experiments. To quantify the model's performance, we compare our predictions with measured data on the internal nodes of the plate in Fig. 3a. As a result, for every experiment, there is a total of n in = 49 time-series to compare, yielding a total 1274 validation time-series. For every experiment, a total of N = 4000 integration steps (T = 2000 sec) is calculated. We compare the calculated results to the measured fields by using DTW to align the 2 signals and then we calculate their mean abso-lute relative error,ê = 1 where T meas j is the measured temperature field at timestep t j , a method commonly used for comparing time-dependent data with small phase differences (Laperre et al., 2020;Ding et al., 2008). The DTW relative distance corresponds to the thermal errors the model would produce given an optimal time alignment of the predicted and measured signals. This operation allows to decouple the model performance from the delay. The errors in Fig 7 are calculated similarly using the DTW alignment between the illustrated mean prediction and measured data. The interested reader can find more details on DTW in the Appendix 8. The distribution of the means

Measured
Mean prediction Confidence Error of all the relative distances in every validation experiment E = {ê i,ex , i = 0 . . . n in − 1, ex = 0 . . . n val }, whereê i,ex is the mean relative absolute temperature error of node i in validation experiment ex, can be found in Fig. 8a. The results show a mean relative distance of 2.42% and a variance of 1.38%, proving the consistency of the model's performance.
The distribution of the delay model errors can also be found in Fig. 8b. These errors correspond to the time offset between a predicted temperature peak and the corresponding measured one. While the vast majority of errors lie within ±20 sec, which is a small error compared to the 2000 sec of integration, few outliers are found. Outliers can be formed due to synchronization errors, problems in the interpolation of the torch path plan x t , or missing dynamics from the simple linear delay model. Integrating Eq. 17 requires evaluating and adding the derivative on every single node for every timestep. This results in a linear computational complexity both in the number of nodes and time horizon. Consequently, achieving a sparse node grid and a large timestep can significantly reduce the computational burden of the integration scheme. Since the m() component resembles the discretized time-domain thermal transfer function of the 2D plate, there is no need to maintain a dense grid for calculating the temperature updates as an FEM solution would require. Instead, the inter-node distances and the timestep are kept as large as possible to maintain the process's observability and save unnecessary calculations. We empirically found that we could keep a sparse grid with inter-node distances of 27 mm and use a timestep of 0.5 s. As a result, our computational scheme could execute almost 14000 times faster than the real process (0.14 s instead of 2000 s) on a single core of an AMD Ryzen 9 5900X, with a core bandwidth of 3,7GHz.

Discussion
In the present work, a new generative model family is introduced that treats the model parameters as dynamical systems, evolving in time according to a learned law. As a result, the models can adapt to multiple scenarios and learn to distinguish which system representation fits best to each case. It is straightforward to adapt the model family introduced to diverse physical domains, as illustrated in "Model for WAAM" section, where the model was adapted to solve the heat equation.
Moreover, the performance of the adapted model demonstrates that in the setting of WAAM it is feasible to learn thermal models from single experiments, even during the process execution time. Previous approaches in this setting used models that required large amounts of data. For instance, to develop a GRU Network, 250000 examples were used for training in Mozaffar et al. (2018), while in Pham et al. (2021) the authors achieve high performance after creating a dataset comprised by 20 million examples. Gathering these datasets is impractical and expensive, therefore, synthetic data were employed, obtained after lengthy simulation runs. However, the resulting models were not tested with data directly measured from an experiment and could be biased due to an extensive simulation-to-reality gap. In contrast to this, our model needs significantly less data to acquire an accurate representation of the system dynamics because of the simplicity of the architecture, the physics-informed component, and the expressive power of the Gaussian processes, which capture the evolution of the parameters in Eq. 16.
The reduction in the training data removes the need for synthetic datasets and only uses measurements stemming from a single experiment executed on the setup in Fig. 2. In this study, we developed GPyro based on the experimental results available. A comparison to other data-driven models has not been included in this work. After studying the performance of Long short-term memory (LSTM) and GRU networks on the available data set of 27 experiments, we reached the conclusion that those models are not competitive to GPyro. Due to its architecture, our model is able to generalize with data from just a single experiment, whereas purely data-driven models seem to require very large data sets to generalize. Other Physics informed neural networks (PINN) could present a similar performance as GPyro, especially soft-constraint PINNs (Raissi et al., 2019). These models incorporate physical laws and boundary conditions as terms in the loss function and in this way, they manage to fit a fully connected neural network to learn an accurate representation of the system dynamics from only a few training examples (Cai et al., 2021). However, uncertainty quantification has not been greatly explored for these models, optimization is usually performed with l-BFGS-b, which for large sets of parameters can be challenging to initialize (Lou et al., 2021), and they also present various convergence problems, even for simple dynamical systems, such as those in convection problems. In fact, Krishnapriyan et al. (2021) show that for a wide range of physical parameters, learning a PINN for such simple problems can be very challenging because of the resulting optimization landscape. Although these disadvantages demotivate the choice of a PINN for a general purpose tool like GPyro, a comprehensive comparison between the model introduced and other PINNs is beyond the scope of the current work.
One of the direct consequences of the high accuracy of the model predictions presented is the long integration intervals achieved without error accumulation. Although many accurate models have been proposed, few works have evaluated their results in long prediction horizons. The integration horizon is kept short because even the slightest prediction errors can have a substantial effect when many predictions are taken into account. However, with GPyro, we can inte- Fig. 8 A Illustrates the Dynamically time warped (DTW) relative absolute temperature errors between the measured and the predicted temperature profiles of the validation set. The low mean and variance of the distances prove that the model is consistently performing well over all the 26 experiments of the validation set. b shows the delay model errors in seconds. Negative delay errors refer to prediction lags, and positive to leads. Delay errors are mostly within a range of 20 s, which is about 1% of the total extrapolation length. The outliers are caused by x t interpolation errors grate over periods of thousands of seconds, corresponding to multiple deposited layers. This is due to the linking of the thermal evolution of all nodes and the physics-informed regressor, which help maintain the balance between lost and acquired heat. As a result, the integration intervals are kept long without any substantial error accumulation. In fact, the only integration limit present when validating the models was the length of the executed experiments.
Another major advantage of GPyro is that it is ideal for process optimization and control purposes. Several model properties support this trait. One of the main strong points of GPyro is that it can be interpreted as a linear time-variant system, given its autoregressive form, the linearity of the operations between weights and features, and the smooth transitions from m() to g(). This interpretation indicates that such a model can be used directly to shape control loops with classical control theory techniques (Middleton & Goodwin, 1988). The control-loop shaping capabilities of the model are further strengthened by the tractability in the quantification of the parameters' uncertainty that the Gaussian processes offer. The uncertainty can be leveraged for designing robust advanced control strategies such as H ∞ and robust Model predictive control (MPC) controllers. Furthermore, integrating the system evolution on sparse grids with large timesteps results in a linear decrease in computational burden and solution times that are orders of magnitude lower than the corresponding process time. Hence, the present work is well suited for closed-loop control and iterative probabilistic optimization.
The present model shows high performance in predicting the thermal fields over time horizons extending to several layers. Both validation and training data are gathered from welding the geometry illustrated in Fig. 1a with different welding paths. A valuable extension of the current framework could be an automated grid generation algorithm capable of dealing with the sparsity-retained information trade-off. Moreover, extending the current model to non-regular grids would increase the flexibility of our algorithm. This can be achieved by conditioning the parameters on the nodal distances. Furthermore, future work should be focused on improving the delay estimations, for example by designing a more robust experiment, log the robot trajectory in high resolution, and find a more sophisticated model to describe the delay dynamics. Last but not least, jointly optimizing all the parameters θ for every s would be an important addition to GPyro. This could be achieved by either using the reparametrization trick, a technique commonly used in deep Gaussian processes (Salimbeni & Deisenroth, 2017), or by forming a custom made likelihood function.

Conclusions
This paper introduces a novel generative model family, namely GPyro, and its application for predicting the temperature profiles induced by different path plans in WAAM. The following conclusions can be drawn from the previous analysis: • GPyro is capable of learning probabilistic dynamical models from a single experiment, even in an online manner. • Mixing simple physics-informed models with parametric surrogates whenever needed can drastically decrease the amount of data necessary for training. • The training time of the model can be reduced to a few seconds by breaking down the optimization problem into pieces and then treating the learned parameters as samples from a joint distribution.
• Gaussian Processes can be employed to capture the timevariant nature of the heat equation. Moreover, these statistical models offer a confidence region bounding the predicted temperature evolutions. • Modeling the interactions between nodes of a grid allows the creation of sparse grids, the use of large timesteps, and the linear decrease in the computational complexity of integration schemes, yielding them many orders of magnitude faster than reality.
Funding Open access funding provided by Swiss Federal Institute of Technology Zurich

Declarations
Competing interest The authors have no competing interests to declare that are relevant to the content of this article.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecomm ons.org/licenses/by/4.0/.

Appendix A: Dynamic time warping
Fig. 9 DTW alignment between measured and predicted temperature profiles. The two profiles are separated with an artificial offset for reasons of clarity. DTW finds and compares similar features, decoupling the error from time shifts. This allows to model the absolute temperature error separately from the delays. Note that the dotted lines never cross each other due to the monotonicity of the alignment Dynamic Time Warping (DTW) is a family of algorithms for comparing data with sequential dependencies, like timeseries. The main goal of DTW is to find the optimal alignment between two sequences, by matching the indexes of similar features. The alignment is constrained to be monotonous, meaning that the order of indexes is maintained and that the warped index sequences never decrease. The output of DTW algorithms can be interpreted as the distance between the pairs of the most similar points. Figure 9 illustrates the alignment between a predicted and a measured temperature signal, with an additional offset introduced between the signals, which was added for increasing the alignment visibility. In this plot, the correspondence between the points of the two signals is annotated with black dotted lines. Regardless of the time-lead present in the response, DTW manages to match the predicted and measured peak temperatures as well as the corresponding values for the cooling and the valley phases. This allows to form the relative temperature errors between similar points. In the present work, we use the absolute thermal relative errors after aligning the signals with DTW because this enables the decoupling of the thermal model's performance from the simple and weak delay model.